What You’ll Do
- Remotely install, upgrade, operate and maintain bare-metal Kubernetes clusters (up to thousands of nodes each)
- Handle cluster degradation, recovery and resizing using our fleet management tooling
- Perform out-of-hours on-call response for critical incidents as part of a well-balanced on-call rotation
- Work on improving our tooling, automation, and processes, for both daily operations, alerting, and incident response
- Dive into systems at a low level to solve unique cluster problems and write up your findings
- Assist customers with high-level Kubernetes questions and integration with applications, storage and authentication
- Assist with initial cluster build-outs and validation to help identify failed hardware before customer delivery
- Work closely with our HPC Ops and Datacenter Ops teams on issues that require lower-level expertise or cross-functional solutions
- Mentor and assist less-experienced team members
- Have a voice in our product direction and help us think about how to minimize operational costs and complexity
You
- Are an experienced operations engineer, SRE, sysadmin or similar with a deep knowledge of running Linux clusters and systems
- Are very familiar with running on bare-metal (including knowledge of BMCs, kernel drivers, PXE, RAID, VLANs, hypervisors)
- Have a good understanding of containers, virtualisation, and the mechanisms underpinning them
- Have a good understanding of daily operation, bug-fixing and maintenance of Kubernetes
- Have experience in an on-call environment and with incident response
- Can perform incident post-mortems and develop procedures and tooling to prevent root causes from reoccurring
- Have an excellent ability to learn on-the-fly and adapt to solve problems
- Are able to work either independently with limited direction, or as part of a team
- Are able to work with customers during incidents either via tickets, live messaging, or as part of a larger call.
Nice to Have
- Deep Kubernetes experience
- Experience with user-level restrictions and hardening (e.g. AppArmor)
- Experience with network engineering
- Experience with HPC clusters, environments & tooling
- Experience with large-scale AI/ML training clusters
- Experience with machine learning/AI frameworks
- A passion for running your own bare-metal lab
Salary Range Information
Based on market data and other factors, the salary range for this position is approximately €157,170 - €225,990. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.