Senior Kubernetes Operations Engineer

What You’ll Do

Remotely install, upgrade, operate and maintain bare-metal Kubernetes clusters (up to thousands of nodes each)
Handle cluster degradation, recovery and resizing using our fleet management tooling
Perform out-of-hours on-call response for critical incidents as part of a well-balanced on-call rotation
Work on improving our tooling, automation, and processes, for both daily operations, alerting, and incident response
Dive into systems at a low level to solve unique cluster problems and write up your findings
Assist customers with high-level Kubernetes questions and integration with applications, storage and authentication
Assist with initial cluster build-outs and validation to help identify failed hardware before customer delivery
Work closely with our HPC Ops and Datacenter Ops teams on issues that require lower-level expertise or cross-functional solutions
Mentor and assist less-experienced team members
Have a voice in our product direction and help us think about how to minimize operational costs and complexity

You

Are an experienced operations engineer, SRE, sysadmin or similar with a deep knowledge of running Linux clusters and systems
Are very familiar with running on bare-metal (including knowledge of BMCs, kernel drivers, PXE, RAID, VLANs, hypervisors)
Have a good understanding of containers, virtualisation, and the mechanisms underpinning them
Have a good understanding of daily operation, bug-fixing and maintenance of Kubernetes
Have experience in an on-call environment and with incident response
Can perform incident post-mortems and develop procedures and tooling to prevent root causes from reoccurring
Have an excellent ability to learn on-the-fly and adapt to solve problems
Are able to work either independently with limited direction, or as part of a team
Are able to work with customers during incidents either via tickets, live messaging, or as part of a larger call.

Nice to Have

Deep Kubernetes experience
Experience with user-level restrictions and hardening (e.g. AppArmor)
Experience with network engineering
Experience with HPC clusters, environments & tooling
Experience with large-scale AI/ML training clusters
Experience with machine learning/AI frameworks
A passion for running your own bare-metal lab

Salary Range Information

Based on market data and other factors, the salary range for this position is approximately €157,170 - €225,990. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.

Senior Kubernetes Operations Engineer

Lambda