(Additional locations: San Francisco CA, Sunnyvale CA)
The Machine Learning Platform team builds and supports the critical Distributed Training Framework and tools for every machine learning engineer at Cruise. Our goal is to greatly accelerate the development cycle of machine learning models across the whole company, empowering machine learning engineers to focus on improving the car’s safety and performance, instead of worrying about their infrastructure. We care about performance, ease of use and reliability of our products. We are driven by the success of our partner teams who rely on our work to build the most advanced driverless cars in the world.
What you'll be doing :
- Design, implement and deploy platforms and tools to support machine learning models training/evaluation workflows at Cruise.
- Own technical projects from start to finish and be responsible for major technical decisions and tradeoffs. Effectively engage in team’s planning, code reviews and design discussions
- Consider the effects of projects across multiple teams and proactively manage prioritization. Work closely with partner teams to ensure they are benefiting from the systems we built.
- Conduct technical interviews with well-calibrated standards and play an essential role in recruiting activities. Effectively onboard and mentor junior engineers and/or interns.
What you must have:
- 5+ years of experience building large-scale distributed applications with high-quality API design
- Experience with ML development lifecycle and ML Ops
- Strong coding in Python or C++
- Experience with distributed training
- Experience with optimizing model training performance
- Experience to scale model training to large number of GPUs/CPUs or other accelerators
- Passionate about self-driving technology and its potential impact on the world
- BS, MS or PhD in CS, Math or equivalent real-world experience
- Can do attitude and willingness to code
Bonus points!
- Knowledge and experience with machine learning algorithms
- Experience building distributed systems on cloud infrastructure
- Deep learning frameworks like PyTorch, TensorFlow, etc
- Building frameworks with high quality lasting APIs
- Understanding of SOTA training optimization algorithms, their performance profiles and their effects on model convergence
- Experience scaling model performance optimization work across many teams
- Experience with build systems (Bazel, Buck, Blaze or Cmake)
- Experience working with Docker and Kubernetes
The salary range for this position is $183,600 - $270,000. Compensation will vary depending on location, job-related knowledge, skills, and experience. You may also be offered a bonus, and benefits. These ranges are subject to change.