Software Engineer - ML Reliability

AS A BACKEND ENGINEER ON THE CONVERSION ML TEAM, YOU WILL:

Design, implement, and maintain robust ML architecture to ensure high availability, reliability, and performance of ML models in production.
Implement monitoring tools and processes to track the performance of machine learning models in production, identifying any issues or degradation over time.
Provides best practices and running proof-of-concepts for automated and efficient model operations on a large scale.
Lead and participate in incident response efforts, conducting root cause analysis and implementing corrective actions to prevent recurrence.
Create and maintain comprehensive documentation for ML infrastructure, processes, and best practices.
Work closely with cross-functional teams, including data scientists, software engineers, and product managers, to align on goals and deliverables.
Contribute to an “engineering excellence” culture through state-of-the-art tools, risk-driven testing, explainable systems, and code review.
Join a nimble, consistently excellent, and experienced engineering team.

Responsibilities:

Have end-to-end ownership of projects, and collaborate with a small team of world-class engineers with diverse backgrounds.
Ship code multiple times a day, and within seconds see its quantified impact on millions of users and our business's revenue.
Be part of an “engineering excellence” culture through state-of-the-art tools, risk-driven testing, explainable systems, and code review.
Become an authority in Clojure, Go, and the many other cutting-edge open source technologies that maximize our development velocity.
Join a nimble, consistently excellent, and experienced engineering team.

Requirements:

5+ years of software engineering experience.
3+ years of experience in machine learning, software engineering, or reliability engineering, with a focus on production systems.
Solid core CS fundamentals (data structures, algorithms, architecting systems).
Proficiency in Python, Go, or similar programming languages.
Experience with ML frameworks (e.g., TensorFlow, PyTorch), cloud platforms (e.g. AWS, GCP, Azure).
Experience with ML monitoring tools (e.g. Prometheus, Grafana).
Experience in big data engines such as Trino and Spark is a big plus.
Strong problem-solving skills and the ability to work collaboratively across teams.
Excited to work on large scale ML and data systems.
Ability to lead across team and role boundaries to effect large scale change in culture and systems.
A healthy sense of fun!

Nice to have:

Experience in ML systems for training Transformer models, CTR prediction models.
Experience in AdTech is a strong plus.

Location:

This role eligible for full-time remote work in one of our entities: CA, CO, ID, IL, FL, GA, MA, MN, MO, NC, NJ, NV, OR, PA, RI, TX, UT, and WA. We are a remote-first company with US hubs in Redwood City, Los Angeles, and NYC.

Liftoff offers all employees a full compensation package that includes equity and health/vision/dental benefits associated with your country of residence. Base compensation will vary based on candidate location and experience. The following are our base salary ranges for this role:

SF Bay Area, NYC, Los Angeles/Orange County: $180,000 - $220,000
Seattle/Olympia, Austin, San Diego, Santa Barbara, Boston: $165,000 - $205,000
All other cities and towns in our approved states: $155,000 - $190,000

#LI-VM1

#LI-Remote

We use Covey as part of our hiring and / or promotional process for jobs in NYC and certain features may qualify it as an AEDT. As part of the evaluation process we provide Covey with job requirements and candidate submitted applications. We began using Covey Scout for Inbound on January 22, 2024.

Please see the independent bias audit report covering our use of Covey here.