- General workflow
- Cached data
- Namespaces and pickled objects
- Training VS prediction time
- Missing data
- Log predictions
These are just heads-ups, specifically for scikit-learn, not a full workflow you can follow;
The problem of how to accurately deliver model predictions in production and at scale is a large subject in and of itself, requiring things like model monitoring, logging, etc.
Be careful with cached information.
We normally cache models, processed data and other things that take too much time to build.
It's important to have checks to make sure cached data matches what we are actually expecting.
Asserts have their place.
Python is not a statically-typed language so there will be times when simple mistakes that could be caught at compile-time will leak into run-time.
Asserts can check the sanity of your data and results.
Assets can be used to filter cases where you have syntactically valid data which makes no sense.
- in other words, cases when your models would still deliver predictions, but probably nonsensical ones.
Asserts help inform the reader about what some specific piece of code does, and what the underlying assumptions are.
- In other words, asserts are useful documentation too.
Namespaces and pickled objects
If you pickle a classifier or a pipeline and use custom classes and other resources, these must also be available at inference time.
For example, if you use custom steps in a pipeline, external data or things like that for training your model, these must also be available at inference time.
Training VS prediction time
Data preprocessing at prediction time must be exactly the same as at training time
This is also called the Training/Serving Skew.
If you require lots of preprocessing, such as extracting features from text, creating artificial features from incoming data and/or processing categorical data, you need to make sure the exact same process is done at prediction time!
Important: Wrap the whole preprocessing/classification code into just a few methods and call those at training and prediction times.
This is especially true for categorical data and one-hot-encoding with
- Categorical data must be encoded into dummy variables using the very same mapping at training and at inference time!
- Always set
categoriesattribute when using
pd.get_dummies(), to make sure you are encoding categorical data with dummy variables in the very same positions as you did at training time.
- Always use
For vectorizing text:
- Unseen text must be vectorized with the vectorizer fitted on the training data!
Whatch out for library versions.
When you train models, you use specific versions of libraries like numpy, pandas or scikit-learn.
If someone runs the very same code you wrote using slightly different library versions, things will break.
Always make evident what library versions you are using:
- PIN dependency versions
np.__version__and so on at the top of notebooks
- define versions in a
- encode the whole environment (OS-level stuff too) on a Dockerfile or something like that.
Watch out for NULLs and missing data
It's very common for there to be missing data at inference time.
Most sklearn classifiers will throw errors if you data has
np.nans, so you must remove them.
fillna()on each dataframe column with suitable default values such as:
""empty string for text data
0or some other marker for numerical data
- Missing categorical data should be handled by
You must log all predictions done by a more, be it realtime or batch.
At a very minimum, you must enable logging for:
Input Features (features for each item processed)
Model output (i.e. scores, classes, etc)
Scoring time (the exact timestamp the model was called)