Paper Summary: Hidden Technical Debt in Machine Learning SystemsLast updated:
- MY 2¢
Please note This post is mainly intended for my personal use. It is not peer-reviewed work and should not be taken as such.
Authors list several ways in which ML-enabled systems have extra layers of technical debt when compared to regular systems.
Coupling between features
Coupling between features and configurations in models that take feature interaction into consideration.
One way to avoid this is to use simpler models such as linear models where taking one feature out doesn't completely break the model; only makes it slightly less accurate.
Coupling between use contexts
This happens when a model \(M\), trained for some context \(A\) is adapted to work for a slightly different problem \(A'\).
This is done, for example, by multiplying the output of model \(M\) by some number, to give a score that can be used for problem \(A'\).
When this happens, the original model \(M\) cannot be either changed or updated anymore, without breaking other derived models.
Coupling between consumer teams
Undeclared Consumers: this happens when the model maintainer is not aware of other teams using the model's output for business decisions.
This can create problems because updating the model can have nontrivial effects in other parts of the stack.
It can also induce a feedback loop if the model maintainer is unaware that the training data may be affected by the model decisions themselves.
This happens when models use features built from unstable data sources, such as the output of other models, flaky ETL pipelines, etc.
Changes and problems in any code used in building these features may cause models to break or malfunction.
These are some ways to reduce the risk of model problems due to data dependencies:
Using versioned data to limit propagating upstream changes from affecting downstream systems.
Remove features that aren't in use anymore
Remove features that don't add much value to the model predictive power (bad features, correlated features, etc)
Static analysis tools for finding out data lineage graphs and see which features can be removed
Feedback loops happen when a model's output influences the training data for future versions of the model.
Feedback loops may be direct or hidden (when it is not known how or if it's happening)
One way to guard against feedback loops is to act on a random subset of the input items (without taking the model score into account) and recording their labels, so that these can be later used as an unbiased training set.
Messy Pipeline code
The ETL code that takes data from customer-facing systems and builds features out of them can esily become bloated like any other system:
Overly complicated code (joins, filters, exceptions to rules, etc)
Duplicated code (when you change one copy you must also change the others)
Too much abstractions / too few abstractions
Not enough ML Abstractions
Authors claim that there are (as of the writing of the paper) not yet good, mature abstractions fo writing ML-based systems, like there are for other types of systems.
I would argue the
Pipeline (as present in Scikit learn and Spark, for example) are one such example.
Common ML code smells
Plain-old Datatype Smell: using primitive data types to represent model entities
Multiple-language smell: using multiple languages for different system modules
Prototype smell: using research code for production scoring.
I would argue that some of these are not applicable anymore since most people are now using ready-made packages like sklearn, tensorflow, etc.
Also, I argue argue that using different languages is only a problem if each module does not have clear, defined responsibilities - this is a more important problem than the language it is written in.
This happens when the complexity of inference code get too heavy and precludes the development of new system features.
Authors consider configuration all code related to defining how inference (as opposed to training) is done. This includes:
Custom rules for certain subsets of items
Hardcuts and other filters for not scoring certain items
Rules defining where to fetch feature data from, fallbacks and timeouts.
In my opinion, this is more of a software engineering issue and can be mitigated with software engineering solutions, such as minimizing the public surface area of system parts, concentrating responsibilities in individual modules as opposed to leaving them all over the code, etc.
Because ML-enabled systems have their own sources of technical debt that add to the other types of debt inherent to any kind of system.
ML-enabled systems are becoming more complex and more ubiquitous in all sorts of organizations; many of these now begin to face common challenges that have only started being addressed.
"Not all debt is bad, but all debt needs to be serviced."
CACE: "Adding a new feature \(x_n + 1\) can cause similar changes, as can removing any feature \(x_j\). No inputs are ever really independent. We refer to this here as the CACE principle: Changing Anything Changes Everything."
DS and MLE roles: "A hybrid research approach where engineers and researchers are embedded together on the same teams (and indeed, are often the same people) can help reduce [...] friction significantly."
There are many points in this paper which (although useful and worthwhile to read) are not really related to technical debt, but to general quality assurance of models.
- Things such as monitoring and sanity bounds for model-based actions, while useful and important, don't help with paying off or avoiding debt at all.
General software engineering practives cannot be overlooked when building systems that leverage ML models.
Authors say that there is a tradeoff between using a generic ML package and custom built systems, as generic packages require lots of glue code. I would say that the best of both worlds would be using an open-source generic ML package. We can reap the benefits of using software that's used by many people (new features, bugfixes, stackoverflow help) while being able to customize it if need be.
As with other system code, it is always important to delete old code that's not in use anymore. Stale code (e.g. Pipeline, ETL, etc) is a cognitive as well as a maintenance drain.
Use as few features as possible for all models - the more feature wou use, the more dependencies you create with other systems.
Use typed languages and type-hinting (in dynamic languages) at least for some parts of the supporting code
Log everything including input features, processed features, scores, scoring times, model versions, etc.
Support null values so that models don't break when some features are missing (but make sure you monitor these to take appropriate action)
Specific Types of Technical Debt
Using other model's outputs as features
It is often the case that some teams will want to use other (upstream) model's scores as features for their own models for a quick performance boost.
This is bad because any changes/updates in the upstream model will affect those models, creating coupling between them.
This can make it impossible to retire old models because one can never be sure if someone else is still using them.
One alternative for this is, rather than borrowing the upstream model's score as a feature, borrow that model's features instead.
Policy and Model coupling
It is very often the case that there is downstream code that is used to take actions based on model scores.
For example, a simple rule to give a credit card for everyone whose model score is lower than 0.5 is a type of policy code.
Policy code is usually coupled with the model that produced the score, so changing the model will affect the policy.
One way to avoid too much coupling is to have models output calibrated probabilities, so that any model that produces these scores can be use with the same policy.
Some things that should be tracked during model training include: the code used to generate the training dataset, the actual training dataset snapshot, the precise state (e.g. commit hash) of the model repo when the model was traineda and seeds.
This is, in my view, considered debt because it may be necessary to pay it off when we need to retrain the model with more recent data, or explain exactly how it was trained for compliance issues.