What you’ll be doing:
Build platforms, services, or tools that enable engineers to deliver operationally mature services in production
Help teams with automating tedious tasks and enable them to quickly launch new services and execute optimally
Work with service owners to have a proactive approach to designing tests, observing results and creating fixes for complex failure scenarios.
Help service owners identify and instrument Service Level Objectives and design alerts that follow best practices
Facilitate blameless postmortems and drive effective action items
What you must have: (These are must-haves; skills that someone must have entering the role on day 1 and required for hiring)
Senior level experience as an Systems Engineer, Site Reliability Engineer, or Production Engineer
Significant Experience with Cloud Platforms such as Google Cloud Platform, Microsoft Azure or Amazon Web Services
Fluent with one or more programming languages such as Go, Python or Java
Ability to debug and optimize code
Experience with Incident Management platforms like Firehydrant
Ability to coordinate and manage incident response
Skills in defining and instrumenting SLOs and SLIs using query languages and observability tooling
Automate tasks and processes with open source tools
Streaming and Database technologies such as Postgres, Kafka, Cassandra, ElasticSearch, etc.
Bonus points!
Previous experience as an SRE or System Engineer
Previous Experience with Firehydrant
Previous experience with Backstage or another Developer Experience tool
Familiarity with Chef, Puppet, Ansible or other Configuration Management Tooling
Familiarity with Kubernetes, Docker, Go, Istio, Terraform, Vault, Google Cloud
The salary range for this position is $122,400 - $180,000. Compensation will vary depending on location, job-related knowledge, skills, and experience. You may also be offered a bonus, long-term incentives, and benefits. These ranges are subject to change.