About the Role:
As a Site Reliability Engineer on the Platform team, your role will be crucial in helping us design, scale, and manage our growing AWS-backed infrastructure. Your expertise will contribute to scaling our architecture and building a highly available system with an enthusiastic team. We are looking for candidates who have production experience with AWS-based platforms, expertise in automating distributed systems, scaling a fast-growing platform, maintaining high availability, and a forward-thinking mindset ready to take on tomorrow's challenges.
What You’ll Do:
- Work with other engineering and product teams to design and build the infrastructure required to deliver new features to customers
- Automate the provisioning, scaling, and management of our infrastructure using Configuration As Code and Configuration Management
- Identify and remove bottlenecks from systems in production
- Ensure 99.99% customer-facing uptime
- Continuously improve the monitoring and alerting capabilities of our platform, enabling us to be proactive instead of reactive
What We’re Looking For:
- 4+ years of professional SRE/DevOps experience, and a demonstrated ability working on high volume production systems
- Experience with container orchestration frameworks such as Kubernetes, Docker Swarm or similar.
- Working knowledge of AWS services and technologies (Redshift, DynamoDB, Kinesis, RDS, ELB, AutoScaling, Lambda, etc…)
- Experience with infrastructure as code and configuration management (Terraform, Ansible, CloudFormation, Chef, etc...)
- Knowledge of Python, Bash or other scripting languages. Knowledge of Ruby, or Golang is a plus.