About the Role
We are looking for an experienced Senior Site Reliability Engineer (SRE) with a strong background in AWS/GCP cloud platform to play a key role in ensuring the reliability, scalability, and performance of our cloud-based systems and applications The ideal candidate will have hands-on experience in designing, implementing, troubleshooting and managing AWS infrastructure, along with a passion for automation, continuous improvement, and collaboration with cross-functional teams. The candidate will be part of a diverse team consisting of software engineers and cryptographers where they will play a key role in the efficient and effective enablement of the technologies being developed.
What You’ll Do
- In this role, you will play a crucial part in managing and optimizing our GCP/AWS infrastructure, employing Terraform and other standard SRE expertise
- Collaborate closely with cross-functional teams to ensure the reliability, scalability, and performance of our systems
- Deploy, support, and monitor new and existing services, platforms, and application stacks
- Use scale testing to measure, tune and optimize system performance
- Enhance, architect, author, and deliver software to improve the availability, scalability and security of SandboxAQ services
- Build and run systems, infrastructure and applications through automation
- Participate in on-call duties
About You
- Strong sense of ownership, customer service, and integrity proven through clear communication
- Experience in managing and scaling distributed systems in a public, private, or hybrid cloud environment
- Experience with deploying, supporting and supervising new and existing services, platforms, and application stacks
- Excellent troubleshooting and problem solving skills
- Experience with scale testing, disaster recovery, and capacity planning
- Passion for eliminating repetitive manual processes using automation to improve them through repeated iteration
- Confirmed ability to write programs using a high-level programming language like: Python, Java, Go or Perl
- Experience working with CI/CD pipelines
- Experience handling large numbers of diverse systems with configuration management systems like: Puppet, Chef, Ansible, or Salt
- Understanding of the Linux Operating System, including Kernel, Memory, Process, Threads, Static / Shared Libraries, IPC, Signals
- Experience with Kubernetes, Nginx, Envoy, Prometheus, and/or Docker
- Understanding of standard networking protocols and components such as: HTTP, DNS, ECMP, TCP/IP, ICMP, the OSI Model, Subnetting and Load Balancing strategies
- BS in Computer Science or related field, or equivalent employment
Nice-to-haves
- Experience with Bazel a plus
- Experience with FedRAMP/AWS Governance cloud is a plus