Thoughts on Michelangelo: Uber's Machine Learning Platform
Last updated:- ML Use cases at Uber
- Uber's Strategy for predicting ETA
- Pillar 1: Organization + Roles
- Pillar 2: Process + Best practices
- Pillar 3: Technology
- Steps in Uber's ML workflow
- Spreading the use of ML throughout the organization
- Key lessons learned
This is a review/summary of the post Scaling Machine learning at Uber with Michelangelo
ML Use cases at Uber
Recommend menu items and restaurants on Uber Eats
Predict Estimated time of Arrival (ETA)
- For cars (Uber) and meals (Uber Eats)
Predict how many rides will be requested in a given area in the future
Predict which support team should answer a customer question
Detect trip issues (e.g. crashes or kidnappings) based off geocoded data
Uber's Strategy for predicting ETA
"ETAs are notoriously difficult to get right."
They have thought out a new approach to predicting ETAs:
Split up the path from driver to rider into segments
Predict the time taken for the car to traverse each segment
Use historical data on ETA errors to correct the estimate for each segment
This strategy has produced massive increases (up to 50%) in the accuracy of previous forms of predicting ETA.
Pillar 1: Organization + Roles
Product teams use the ML platform (Michelangelo) and they own their models (end-to-end)
Specialists provide on-demand help for specific tasks (Computer Vision, NLP, etc) for product teams
Research teams do research and suggest new capabilities for the ML platform.
The ML platform teams are engineers that work on the main ML platform, which is used by all product teams.
Pillar 2: Process + Best practices
Launch playbooks are project templates for new models. They provide prebuilt structures for new projects.
Product and platform teams work together to find out if a given problem can be solved using the standard platform or if a custom solution needs to be built.
In order to foster an ML community withing the company:
- Annual ML conferences are held internally for all teams working with ML.
- Engineers are encouraged to attend external events, publish papers, etc.
Internal training is done via bootcamps, office hours and workshops.
Pillar 3: Technology
The end-to-end ML workflow starts with data the models consume until predictions are served to clients. The whole stack is important, not just the part where models are trained and inference is performed.
Patterns from software engineering can and should be applied to code used in machine learning models.
Models are trained iteratively and the speed with which teams iterate defines how many experiments can be made and how good models will be.
It's important to have reusable platforms and modules but there will always be cases when customized solutions need to be made.
- At Uber, they allow the Michelangelo platform to be used whole (covers most cases) but its parts can also be imported into ad-hoc projects that need more customization.
Steps in Uber's ML workflow
"We found that the same workflow applies across a wide array of scenarios, including traditional ML and deep learning; supervised, unsupervised, and semi-supervised learning; online learning; batch, online, and mobile deployments; and time-series forecasting."
Managing data: Standard feature stores help train models faster and ensures consistency between training and inference times.
Training models: It's now done using a tool called DSW (see below) These can be supervised, unsupervised, neural nets, tree-based, etc.
Model Evaluation: Testing and comparing different models and hyperparameters. Uber uses Bayesian Hyperparameter Optimization
Model Management: Models are versioned and reproducible. You can retrieve a trained model and look at performance metrics, etc.
Deployment: The workflow needs to handle both offline and online workloads.
Data Monitoring: Input data needs to be monitored to ensure the models are being fed good data. Ideally, alerts should be sent out when there's issues in the data.
Prediction Monitoring: Ideally, you should log predictions and compare them to actual outcomes. When it's not possible, you must at least monitor distributions of features and/or predictions over time.
Spreading the use of ML throughout the organization
"When Michelangelo started, the most urgent and highest impact use cases were some very high scale problems, which led us to build around Apache Spark and Java This structure worked well for production training and deployment of many models but left a lot to be desired in terms of overhead, flexibility, and ease of use, especially during early prototyping and experimentation."
Although the Michelangelo platform is built around Apache Spark and Java, the ML team at Uber now provides a web front-end for engineers to experiment and train models that will be then passed on to Michelangelo.
It's called Data Science Workbench (doesn't look like it's open-source).
It's a notebook-like web UI based off React, which allows users to train and interact with models, prior to deploying them on Michelangelo.
It was created to make it easier for people from product teams (not necessarily experienced data scientists) to train and experiment with models.
Key lessons learned
End-users want to use tools they are comfortable with, i.e. they want to define models using Python or R, not Java and Scala.
Broken data is the most common cause of problems in production ML systems. Using standard feature stores helps with this because they are easier to monitor.
Standard libraries and tools don't necessarily work at massive scales. Includes things like Spark and Cassandra.
What users want vs what users need: Heeding immediate user demands guarantees tools are used and generate impact (read make money). However, platform engineers should also know when to leave space for future enhancements even though users didn't ask for those (yet).
Systems that need to serve predictions offline and online are challenging because there are few tools that address this niche.
References:
-
- It is an open-source tool to make it easier to train Deep Learning models (TF, Keras, Pytorch) in a distributed manner, e.g. train a neural net using 4 machines each having 4 GPUs.