Paper Summary: 150 Successful Machine Learning Models: 6 Lessons Learned at Booking.com
Last updated:Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.
WHAT
Study to analyze the impact of machine learning models from the business perspective using ML models in use at the website Booking.com.
WHY
Because the overwhelming majority of studies of ML systems focus on the technical aspects, rather than the social and/or business metrics associated thereto.
HOW
Authors
THE 6 LESSONS LEARNED
1. INCEPTION: MACHINE LEARNING AS A SWISS KNIFE FOR PRODUCT DEVELOPMENT
Authors describe the different model scopes and how they can help is nearly all product types.
2. MODELING: OFFLINE MODEL PERFORMANCE IS JUST A HEALTH CHECK
They show that model gains do not always translate to business gains and give some reasons why that happens.
3. MODELING: BEFORE SOLVING A PROBLEM, DESIGN IT
Authors give an argument that thinking about the end-to-end lifecycle of a model (including difficulties in defining/creating target variables and obtaining data) needs to be thouroughly done before any code is written.
4. DEPLOYMENT: TIME IS MONEY
They analyze the impact of models that are slow at inference time. To the point that the decrease in user experience due to latency offsets any gains made by the model.
5. MONITORING: UNSUPERVISED RED FLAGS
They mention problems in obtaining target variables for examples (e.g. only a subset of the population actually produces targets, targets take a long time to emerge after an example has been scored).
They describe a type of analysis called RDC (Response Distribution Chart) which is a way to analyze how predictions by binary classifier are distributed. Stable models that discriminate positives from negatives should have a plot that is smooth and skewed.
6. EVALUATION: EXPERIMENT DESIGN SOPHISTICATION PAYS OFF
Authors explain how carefully setting up an experiment with multiple control groups and separate populations (including separating subsets of the population that can/cannot be treated) helps engineers have a high degree of certainty to affirm some model is better than another model (or the control groups).
NOTES
- Authors claim guests are in a continuous state of cold start because each time they want to book a new travel package, it's a whole new product they are looking after and these visits are often very far apart in time.
MY 2¢: MY LESSONS
These give a slightly different take on the article.
Machine learning can be used for many different products, in widely different contexts
For example: Broad-scope (many products) VS narrow-scope (for a single product) models: Creating a set of understandable features (e.g. in a feature store) for entities could be an example of a broad-scope model, as these features can be used in many downstream products, in all sorts of contexts.
From the article, training a model that outputs a user's openess to change could be reused in many products becasue this models a key aspect of a human being.
Better model metrics do not always translate to better business metrics
We usually measure proxy metrics instead of actual business metrics. This is often done because some metrics are much easier to measure than business outcomes like profit or revenue.
However, even when the relationship looks obvious (e.g. CTR (click-through ratio) and Conversion), it may be the case where optimizing the proxy metric does not lead to an increase in the interest business metric.
Examples include: CTR vs Conversion Rate (in marketing ads), AUC vs Revenue per Customer, RMSE vs Diversity (in recommender systems).
Response time is very important for web products
Overly complex ML models can and do affect latencies, which always negatively affect user experience.
It may be the case that the operational latency and or risk introduced by a new model largely offsets any potential gains it might bring.
Training on one population and scoring another population leads to problems
Example: training a model where the target variable is only available for users have actually booked a hotel reservation and ended up actually going there. Then, using this model to score every user who enters the website (which obviously includes people who will not book hotels).
This can lead to problems because the validation metrics (AUC, etc) were obtained on an out-of-sample set containing only people who have booked hotels, so the actual real-world performance will likely be very different because you are scoring a population that includes people who will never book a hotel reservation.
Experiment setup is important and it's worthwhile to do it precisely
Example: in order to test whether a new model actually delivers better results for the company, you will probably want to set up control groups that do not get exposed to the model decisions (e.g. a recommender system) in order to better measure causality. This approach is more generally called RCT (randomized control trials)
But this can also be extended: testing with multiple models, having specific control groups where only subjects that could be treated are present, etc. This can also be used to detect implementation problems related to the model code itself.
MY 2¢ - INSIGHTS
The complexity of machine learning models is affected by the impact our decisions have on the population
For example: a credit risk model outputs a potential client's default risk.
The action taken with respect to the model output can be very simple. Either give this client a credit line or not1, the objective being that of giving credit to people who are more likely to pay it back.
This can be thought of as a binary decision which doesn't affect the population. It's regular supervised learning.
Take on the other hand, a machine learning model that recommends the next individual class a student should take to decrease his/her chances of dropping out of university.
In this case, it's not immediately obvious how each individual course recommendation affects the target variable (student completed the whole graduate course) and each recommendation impacts the next sample (sequences of two courses may be especially good/bad).
What you're trying to learn in this case is not a simple function, but a policy that takes a sequence of actions, each of which may impact the next decision you have to make.
The exploration vs exploitation tradeoff is paramount here, as well as the need for control groups, randomized trials, etc. This is more similar to reinforcement learning instead.
1: We are assuming all clients get the same amount of dollars as credit line.