Staff Software Engineer, ML Foundations Platform

(Additional locations: San Francisco CA, Sunnyvale CA)

The Machine Learning Platform team builds and supports the critical Distributed Training Framework and tools for every machine learning engineer at Cruise. Our goal is to greatly accelerate the development cycle of machine learning models across the whole company, empowering machine learning engineers to focus on improving the car’s safety and performance, instead of worrying about their infrastructure. We care about performance, ease of use and reliability of our products. We are driven by the success of our partner teams who rely on our work to build the most advanced driverless cars in the world.

What you'll be doing :

Design, implement and deploy platforms and tools to support machine learning models training/evaluation workflows at Cruise.
Own technical projects from start to finish and be responsible for major technical decisions and tradeoffs. Effectively engage in team’s planning, code reviews and design discussions
Consider the effects of projects across multiple teams and proactively manage prioritization. Work closely with partner teams to ensure they are benefiting from the systems we built.
Conduct technical interviews with well-calibrated standards and play an essential role in recruiting activities. Effectively onboard and mentor junior engineers and/or interns.

What you must have:

5+ years of experience building large-scale distributed applications with high-quality API design
Experience with ML development lifecycle and ML Ops
Strong coding in Python or C++
Experience with distributed training
Experience with optimizing model training performance
Experience to scale model training to large number of GPUs/CPUs or other accelerators
Passionate about self-driving technology and its potential impact on the world
BS, MS or PhD in CS, Math or equivalent real-world experience
Can do attitude and willingness to code

Bonus points!

Knowledge and experience with machine learning algorithms
Experience building distributed systems on cloud infrastructure
Deep learning frameworks like PyTorch, TensorFlow, etc
Building frameworks with high quality lasting APIs
Understanding of SOTA training optimization algorithms, their performance profiles and their effects on model convergence
Experience scaling model performance optimization work across many teams
Experience with build systems (Bazel, Buck, Blaze or Cmake)
Experience working with Docker and Kubernetes

The salary range for this position is $183,600 - $270,000. Compensation will vary depending on location, job-related knowledge, skills, and experience. You may also be offered a bonus, and benefits. These ranges are subject to change.