Staff Site Reliability Engineer

About This Job

We are seeking an experienced and highly technical Staff Site Reliability Engineer (SRE) to join our Reliability Engineering team. As a Staff SRE, you will be the technical lead on the team in developing and implementing innovative solutions to ensure the reliability, scalability, and performance of our critical systems. This is a highly impactful role, where you will help shape our SRE strategy, mentor team members, and drive significant improvements to our infrastructure and operations. You will work closely with cross-functional teams to design, build, and maintain systems that deliver exceptional user experiences and improve the uptime and availability of the company’s products and services.

What You'll Do...

Lead SRE Strategy: Define the overall technical direction and strategy for SRE at Dutchie, aligning with business goals and ensuring the highest levels of system reliability and stability.
Technical Leadership: Mentor and guide other engineers on best practices, emerging technologies, and industry trends, fostering a culture of continuous learning and improvement.
Project Execution: Drive the execution of key SRE projects, ensuring timely delivery, quality, and alignment with business objectives.
Operational Excellence: Collaborate with development and product teams to optimize system performance, reliability, and scalability.
Incident Management: Troubleshoot and resolve complex issues in production environments. Lead the resolution of critical incidents, conduct post-incident reviews, identify trends and implement preventative measures to minimize future disruptions.
Automation: Champion automation initiatives to streamline processes, reduce manual toil, and improve operational efficiency.
Performance Optimization: Continuously monitor system capacity and performance, identify bottlenecks, and implement optimization strategies to maximize efficiency and resource utilization.
Collaboration: Partner with stakeholders across the organization to understand their needs, communicate SRE initiatives, and foster a collaborative environment.
Mentorship: Provide technical guidance and mentorship to junior SREs, helping them develop their skills and grow professionally.
Maximize Observability: Drive successful adoption and use of observability tools (Datadog) and logging (Splunk) across the organization. Implement and manage monitoring, alerting and logging systems to ensure early detection of issues.
Business Continuity: Lead the design and implementation of disaster recovery and business continuity plans.
Support: Participate in on-call rotation to ensure 24/7 availability of our systems and services.

What You Bring...

Bachelor's degree in Computer Science, Information Technology, or a related field.
10+ years of experience as a Site Reliability Engineer or a related role with a proven track record.
Strong expertise in cloud platforms (e.g., AWS, Azure, GCP) and container orchestration (e.g., Kubernetes).
Strong technical expertise and leadership skills
Proficient in scripting and automation using languages such as Python, Shell, or Go.
Solid understanding of networking, security, and infrastructure-as-code principles.
Experience with observability tools such as Datadog and logging solutions such as Splunk.
Proven track record of successfully leading incident response efforts and conducting post-mortems.
Experience in enabling application teams to enhance observability and reliability of their services.
Excellent communication and collaboration skills, with the ability to work effectively in a team environment.
Excellent problem-solving and troubleshooting skills.

It's a bonus if you...

Master's degree in Computer Science, Computer Engineering, or a related field
Experience with containerization technologies (e.g., Docker, Kubernetes)
Experience with Infrastructure as Code (IaC) tools (e.g., Pulumi, Terraform, CloudFormation)
Experience with agile development methodologies (e.g., Scrum, Kanban)
Relevant industry certifications (e.g., CKAD)

You’ll Get…

We are targeting a starting salary of $190,000 based on the intended level for this role. There may be flexibility on individual compensation packages based candidate skill set, experience, qualifications and other position-related factors.

In addition to cash compensation, our total rewards package includes:

Full medical benefits including dental and vision plans to ensure you always have the best care.
Equity packages in the form of stock options to all employees.
Technology (hardware, software, reading materials, etc..) allowance
Flexible vacation and sick days

#LI-AH1

Staff Site Reliability Engineer

Dutchie