Summary
We have an opening for a Staff SWE - Reliability (Service Reliability) in our Cloud Data Store team which is innovating cloud storage and rethinking the persistence layer of Temporal in a customized way. We’re solving hard distributed systems problems related to databases and the way storage works for the cloud. The charter of CDS team is to increase the reliability, and scalability and reduce COGS of running Cloud Temporal. See what the CDS team has been working on recently.
What You'll Do
- Be the first SWE for Service Reliability in the CDS team.
- Design, develop, and implement systems to enhance the operational efficiency and effectiveness of the Service and subsequently the CDS team.
- Work within a highly collaborative team – and across team boundaries – to ensure exceptional service reliability during a period of hyper-growth and expansion.
- Implement operational best practices, such as alerting and runbooks, to efficiently manage and maintain a high-scale distributed database system.
- Drive the team towards achieving a high degree of automation.
What You'll Bring
- Experience contributing to complex cross-team engineering efforts focused on cloud, compute, networking and storage infrastructure
- At least 10 years programming experience (Go, Java, or other applicable language) and experience writing concurrent code.
- Deep experience in at least one or more cloud infrastructure environments (AWS, GCP, or Azure) and familiarity with adjacencies.
- 6+ years of industry experience designing, building, and operating large, highly concurrent, reliable, and scalable distributed systems.
- Excellent collaboration and communication skills - a strong sense of ownership and integrity demonstrated through clear communication and cross-team collaboration.
- Take a proactive approach to identifying problems, performance bottlenecks, and areas for improving service reliability.
- Drive alignment across an organization and contributing to long-term roadmaps.
- Deep knowledge of SRE principles, including monitoring, alerting, error budgets, fault analysis, failover, and other common reliability engineering concepts.
- Fresh ideas and beginners mindset to improving team velocity (time to production maintaining high quality).
- BS or MS in Computer Science related discipline or equivalent industry experience.
Compensation
- The estimated pay range for this role is $175,000 - $240,000.
- Additionally, this role is eligible to participate in Temporal's equity plan.