Heads-up for Deploying Scikit-learn Models to Production: Quick Checklist

Last updated:
Heads-up for Deploying Scikit-learn Models to Production: Quick Checklist
Table of Contents

These are just heads-ups, specifically for scikit-learn, not a full workflow you can follow;

The problem of how to accurately deliver model predictions in production and at scale is a large subject in and of itself, requiring things like model monitoring, logging, etc.

General workflow

deploying-scikit-learn-model-to-production General workflow for deploying a trained sklearn model into production

Be careful with cached stuff

We normally cache models, data versions and other things that take too much time to build.

It's important to have some kind of checks to make sure cached data matches what we are actually expecting.

Asserts are not all that bad

Python is not a statically-typed language so there will be times when simple mistakes that could be caught at compile-time will leak into run-time.

Asserts can check the sanity of your data and results.

  • Assets can be used to filter cases where you have syntactically valid data which makes no sense.

    • in other words, cases when your models would still deliver predictions, but probably nonsensical ones.

Namespaces and pickled objects

If you pickle a classifier or a pipeline and use custom classes and other resources, these must also be available at inference time.

For example, if you use custom steps in a pipeline, external data or things like that for training your model, these must also be available at inference time.

Data preprocessing at prediction time must be exactly like as at training time

If you require lots of preprocessing, such as extracting features from text, creating artificial features from incoming data and/or processing categorical data, you need to make sure the exact same process is done at prediction time!

My recommendation for this is: wrap the whole preprocessing code into a single method and call that very same method at training and prediction times.

  • This is especially true for categorical data and one-hot-encoding with pd.get_dummies():

    • Categorical data must be encoded into dummy variables using the very same mapping at training and at inference time!
    • Always set categories attribute when using pd.get_dummies(), to make sure you are encoding categorical data with dummy variables in the very same positions as you did at training time.
    • Always use dummy_na=True
  • For vectorizing text:

    • Unseen text must be vectorized with the vectorizer fitted on the training data!

Whatch out for library versions

When you train models, you use specific versions of libraries like numpy, pandas or scikit-learn.

If someone runs the very same code you wrote using slightly different library versions, things will break.

  • Always make evident what library versions you are using:

    • print pandas.__version__, np.__version__ and so on at the top of notebooks
    • define versions in a requirements.txt file
    • encode the whole environment (OS-level stuff too) on a Dockerfile or something like that.

Watch out for NULLs and missing data

It's very common for there to be missing data at inference time.

Most sklearn classifiers will throw errors if you data has Nones or np.nans, so you must remove them.

  • Always use fillna() on each dataframe column with suitable default values such as:

    • "" empty string for text data
    • 0 or some other marker for numerical data
    • Missing categorical data should be handled by to_dummies() method, with dummy_na=True

Dialogue & Discussion