About This Job
We are seeking an experienced and highly technical Staff Site Reliability Engineer (SRE) to join our Reliability Engineering team. As a Staff SRE, you will be the technical lead on the team in developing and implementing innovative solutions to ensure the reliability, scalability, and performance of our critical systems. This is a highly impactful role, where you will help shape our SRE strategy, mentor team members, and drive significant improvements to our infrastructure and operations. You will work closely with cross-functional teams to design, build, and maintain systems that deliver exceptional user experiences and improve the uptime and availability of the company’s products and services.
What You'll Do...
- Lead SRE Strategy: Define the overall technical direction and strategy for SRE at Dutchie, aligning with business goals and ensuring the highest levels of system reliability and stability.
- Technical Leadership: Mentor and guide other engineers on best practices, emerging technologies, and industry trends, fostering a culture of continuous learning and improvement.
- Project Execution: Drive the execution of key SRE projects, ensuring timely delivery, quality, and alignment with business objectives.
- Operational Excellence: Collaborate with development and product teams to optimize system performance, reliability, and scalability.
- Incident Management: Troubleshoot and resolve complex issues in production environments. Lead the resolution of critical incidents, conduct post-incident reviews, identify trends and implement preventative measures to minimize future disruptions.
- Automation: Champion automation initiatives to streamline processes, reduce manual toil, and improve operational efficiency.
- Performance Optimization: Continuously monitor system capacity and performance, identify bottlenecks, and implement optimization strategies to maximize efficiency and resource utilization.
- Collaboration: Partner with stakeholders across the organization to understand their needs, communicate SRE initiatives, and foster a collaborative environment.
- Mentorship: Provide technical guidance and mentorship to junior SREs, helping them develop their skills and grow professionally.
- Maximize Observability: Drive successful adoption and use of observability tools (Datadog) and logging (Splunk) across the organization. Implement and manage monitoring, alerting and logging systems to ensure early detection of issues.
- Business Continuity: Lead the design and implementation of disaster recovery and business continuity plans.
- Support: Participate in on-call rotation to ensure 24/7 availability of our systems and services.
What You Bring...
- Bachelor's degree in Computer Science, Information Technology, or a related field.
- 10+ years of experience as a Site Reliability Engineer or a related role with a proven track record.
- Strong expertise in cloud platforms (e.g., AWS, Azure, GCP) and container orchestration (e.g., Kubernetes).
- Strong technical expertise and leadership skills
- Proficient in scripting and automation using languages such as Python, Shell, or Go.
- Solid understanding of networking, security, and infrastructure-as-code principles.
- Experience with observability tools such as Datadog and logging solutions such as Splunk.
- Proven track record of successfully leading incident response efforts and conducting post-mortems.
- Experience in enabling application teams to enhance observability and reliability of their services.
- Excellent communication and collaboration skills, with the ability to work effectively in a team environment.
- Excellent problem-solving and troubleshooting skills.
It's a bonus if you...
- Master's degree in Computer Science, Computer Engineering, or a related field
- Experience with containerization technologies (e.g., Docker, Kubernetes)
- Experience with Infrastructure as Code (IaC) tools (e.g., Pulumi, Terraform, CloudFormation)
- Experience with agile development methodologies (e.g., Scrum, Kanban)
- Relevant industry certifications (e.g., CKAD)
You’ll Get…
We are targeting a starting salary of $190,000 based on the intended level for this role. There may be flexibility on individual compensation packages based candidate skill set, experience, qualifications and other position-related factors.
In addition to cash compensation, our total rewards package includes:
- Full medical benefits including dental and vision plans to ensure you always have the best care.
- Equity packages in the form of stock options to all employees.
- Technology (hardware, software, reading materials, etc..) allowance
- Flexible vacation and sick days
#LI-AH1