Senior Site Reliability Engineer

About the Role

We are looking for an experienced Senior Site Reliability Engineer (SRE) with a strong background in AWS/GCP cloud platform to play a key role in ensuring the reliability, scalability, and performance of our cloud-based systems and applications The ideal candidate will have hands-on experience in designing, implementing, troubleshooting and managing AWS infrastructure, along with a passion for automation, continuous improvement, and collaboration with cross-functional teams. The candidate will be part of a diverse team consisting of software engineers and cryptographers where they will play a key role in the efficient and effective enablement of the technologies being developed.

What You’ll Do

In this role, you will play a crucial part in managing and optimizing our GCP/AWS infrastructure, employing Terraform and other standard SRE expertise
Collaborate closely with cross-functional teams to ensure the reliability, scalability, and performance of our systems
Deploy, support, and monitor new and existing services, platforms, and application stacks
Use scale testing to measure, tune and optimize system performance
Enhance, architect, author, and deliver software to improve the availability, scalability and security of SandboxAQ services
Build and run systems, infrastructure and applications through automation
Participate in on-call duties

About You

Strong sense of ownership, customer service, and integrity proven through clear communication
Experience in managing and scaling distributed systems in a public, private, or hybrid cloud environment
Experience with deploying, supporting and supervising new and existing services, platforms, and application stacks
Excellent troubleshooting and problem solving skills
Experience with scale testing, disaster recovery, and capacity planning
Passion for eliminating repetitive manual processes using automation to improve them through repeated iteration
Confirmed ability to write programs using a high-level programming language like: Python, Java, Go or Perl
Experience working with CI/CD pipelines
Experience handling large numbers of diverse systems with configuration management systems like: Puppet, Chef, Ansible, or Salt
Understanding of the Linux Operating System, including Kernel, Memory, Process, Threads, Static / Shared Libraries, IPC, Signals
Experience with Kubernetes, Nginx, Envoy, Prometheus, and/or Docker
Understanding of standard networking protocols and components such as: HTTP, DNS, ECMP, TCP/IP, ICMP, the OSI Model, Subnetting and Load Balancing strategies
BS in Computer Science or related field, or equivalent employment

Nice-to-haves

Experience with Bazel a plus
Experience with FedRAMP/AWS Governance cloud is a plus

Senior Site Reliability Engineer

SandboxAQ