Scikit-Learn Pipeline Examples

Scikit-Learn Pipeline Examples

Last updated:
Scikit-Learn Pipeline Examples
Source
Table of Contents

Updated for Scikit-learn v0.19 and Keras v2.0.3

WHAT

Pipelines allow you to create a single object that includes all steps from data preprocessing and classification.

View all code on this notebook

WHY

  • Increase reproducibility

  • Make it easier to use cross validation and other types of model selection.

  • Avoid common mistakes such as leaking data from training sets into test sets.

Pipeline example

Just a classifier and one preprocessing step (data standardization)

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# add your data here
X_train,y_train = make_my_dataset()

# it takes a list of tuples as parameter
pipeline = Pipeline([
    ('scaler',StandardScaler()),
    ('clf', LogisticRegression())
])

# use the pipeline object as you would
# a regular classifier
pipeline.fit(X_train,y_train)

Text Classification/NLP

View notebook here

from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.metrics import f1_score
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

# X_train and X_test are lists of strings, each 
# representing one document
# y_train and y_test are vectors of labels
X_train,X_test,y_train,y_test = make_my_dataset()

# this calculates a vector of term frequencies for 
# each document
vect = CountVectorizer()

# this normalizes each term frequency by the 
# number of documents having that term
tfidf = TfidfTransformer()

# this is a linear SVM classifier
clf = LinearSVC()

pipeline = Pipeline([
    ('vect',vect),
    ('tfidf',tfidf),
    ('clf',clf)
])

# call fit as you would on any classifier
pipeline.fit(X_train,y_train)

# predict test instances
y_preds = pipeline.predict(X_test)

# calculate f1
mean_f1 = f1_score(y_test, y_preds, average='micro')

Cross-Validation (cross_val_score)

View notebook here

Doing cross-validation is one of the main reasons why you should wrap your model steps into a Pipeline.

The recommended method for training a good model is to first cross-validate using a portion of the training set itself to check if you have used a model with too much capacity (i.e. if the model is overfitting the data).

You can cross-validate a whole pipeline using cross_val_score:

from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

X_train,X_test,y_train,y_test = make_my_dataset()

pipeline = Pipeline([
    ('vect',CountVectorizer()),
    ('tfidf',TfidfTransformer()),
    ('clf',LinearSVC())
])

# Instead of 'just' fitting the pipeline on the training
# data, do cross-validation too so that you know if it's
# overfitting.
# This returns an array of values, each having the score 
# for an individual run.
# - cv=3 means that we're doing 3-fold cross validation
# - You can select any metric to score your pipeline
scores = cross_val_scores(pipeline,X_train,y_train,cv=3,
    scoring='f1_micro')

# with the information above, you can be more 
# comfortable to train on the whole dataset
pipeline.fit(X_train,y_train)

y_preds = pipeline.predict(x_test)

mean_f1 = f1_score(y_test, y_preds, average='micro')

Cross-Validation (GridSearchCV)

View notebook here

To cross-validate and select the best parameter configuration at the same time, you can use GridSearchCV.

This allows you to easily test out different hyperparameter configurations using for example the KFold strategy to split your model into random parts to find out if it's generalizing well or if it's overfitting.

GridSearchCV allows you do define a ParameterGrid with hyperparameter configuration values to iterate over. All combinations are tested and scored.

In this example, there are 2 x 3 = 6 parameter combinations to test, so the model will be trained and tested on the validation set 6 times.

import numpy as np

from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

X_train,X_test,Y_train,Y_test = make_my_dataset()

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LinearSVC()),
])

# this is where you define the values for
# GridSearchCV to iterate over
param_grid = {
    'vect__max_df':[0.8,0.9,1.0],
    'clf__C':[0.1,1.0]
}

# do 3-fold cross validation for each of the 6 possible
# combinations of the parameter values above
grid = GridSearchCV(pipeline, cv=3, param_grid=param_grid)
grid.fit(X_train,y_train)

# summarize results
print("Best: %f using %s" % (grid.best_score_, 
    grid.best_params_))
means = grid.cv_results_['mean_test_score']
stds = grid.cv_results_['std_test_score']
params = grid.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

# now train and predict test instances
# using the best configs found in the CV step

Multi-Label Classification

As above, using text classification as example

Using a One-vs-Rest meta-classifier.

A meta-classifier is an object that takes any classifier as argument.

In this example, we have OneVsRestClassifier, which trains the provided classifier one for each different label.

This meta-classifier is very often used in multi-label problems, where it's also known as Binary relevance.

import numpy as np

from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.metrics import f1_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.svm import LinearSVC

# X_train and X_test are lists of strings, each 
# representing one document
# Y_train and Y_test are usually lists of lists of labels
X_train,X_test,Y_train,Y_test = make_my_dataset()

all_labels = np.vstack([Y_train,Y_test])

# you need to fit on all labels because you need
# a place for every label
mlb = MultiLabelBinarizer().fit(all_labels)

# there is no data leaking because 'fitting'
# a multilabel binarizer does not really train anything
Y_train_binary = mlb.transform(Y_train)
Y_test_binary = mlb.transform(Y_test)

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC(),n_jobs=-1)),
])

# use the transformed Y_train here
pipeline.fit(X_train,Y_train_binary)

# the result is also a binary label matrix
Y_preds_binary = pipeline.predict(X_test)

# calculate f1 on the test set
mean_f1 = f1_score(Y_test_binary, Y_preds_binary, average='micro')

Incompatible Parameter pairs

There are cases when a certain combination of parameters is invalid in some model.

One example is for the LinearSVC classifier, where you can choose among the following options:

  • the penalty parameter may be 'l1' or 'l2'

  • the dual parameter may be True or False.

But the specific combination penalty='l1' and dual=True is invalid, so you need a way to design the Parameter grid so that this particular combination is never used.

You can fix this by defining separate lists in the parameter grid, so that it looks like this:

# add imports here

# add model initialization here

pipeline = Pipeline([
    ('clf', svm.LinearSVC()),
])

param_grid = [
    { 
          "clf__penalty": ["l2"],
          "clf__dual":[False,True]
    },
    { 
          "clf__penalty": ["l1"],
          "clf__dual":[False]
    }    
]

grid = GridSearchCV(pipeline, cv=3, param_grid=param_grid)
grid.fit(X_train,y_train)

# add the rest of the code here

Keras Model

Heads-up: If you're using a GPU, do not use multithreading (i.e. do not change n_jobs parameter)

This example includes using Keras' wrappers for the Scikit-learn API which allows you do define a Keras model and use it within scikit-learn's Pipelines. There are wrappers for classifiers and regressors, depending upon your use case.

import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

from keras.layers import Dense, Dropout,
from keras.models import Model, Sequential
from keras.wrappers.scikit_learn import KerasRegressor

# load your data
X_train,X_test,y_train,y_test = make_my_dataset()

# create a function that returns a model, taking as parameters things you
# want to verify using cross-valdiation and model selection
def create_model(optimizer='adagrad',
                 kernel_initializer='glorot_uniform', 
                 dropout=0.2):
    model = Sequential()
    model.add(Dense(64,activation='relu',kernel_initializer=kernel_initializer))
    model.add(Dropout(dropout))
    model.add(Dense(1,activation='sigmoid',kernel_initializer=kernel_initializer))

    model.compile(loss='binary_crossentropy',optimizer=optimizer, metrics=['accuracy'])

    return model

# wrap the model using the function you created
clf = KerasRegressor(build_fn=create_model,verbose=0)

scaler = StandardScaler()

# create parameter grid, as usual, but note that you can
# vary other model parameters such as 'epochs' (and others 
# such as 'batch_size' too)
param_grid = {
    'clf__optimizer':['rmsprop','adam','adagrad'],
    'clf__epochs':[4,8],
    'clf__dropout':[0.1,0.2],
    'clf__kernel_initializer':['glorot_uniform','normal','uniform']
}

pipeline = Pipeline([
    ('preprocess',scaler),
    ('clf',clf)
])

# if you're not using a GPU, you can set n_jobs to something other than 1
grid = GridSearchCV(pipeline, cv=3, param_grid=param_grid)
grid.fit(X_train, y_train)

# summarize results
print("Best: %f using %s" % (grid.best_score_, grid.best_params_))
means = grid.cv_results_['mean_test_score']
stds = grid.cv_results_['std_test_score']
params = grid.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Manual Cross-Validation (ParameterGrid)

It's very common to use a specific train/test split (e.g. time-based split, where you split the dataset according to each sample's date/time and use values in the past to predict values in the future) for your data, and you must stick to this split when doing cross-validation.

You do a manual Grid Search using a ParameterGrid, and you can manually set the train and validation sets to use:

from sklearn.model_selection import ParameterGrid, train_test_split
from sklearn.pipeline import Pipeline

# add other imports, load data here

# a simple pipeline with just a classifier
pipeline = Pipeline([('clf', LinearSVC())])

parameters = [
    { 
          "clf__penalty": ["l1","l2"]
    },
]

X_train, y_train, X_validation, y_validation = train_test_split(X,y, test_size=0.2)

# start with minus infinity as your
# current best_score
best_score = float("-inf")

for g in ParameterGrid(parameters):
    pipeline.set_params(**g)

    # here you call fit with whatever data you want
    pipeline.fit(X_train,y_train)

    # again, choose the validation data 
    # yourself
    y_pred_train = pipeline.predict(X_train)    
    y_pred_validation = pipeline.predict(X_validation)

    # I've used f1-score as an example, but you can use
    # any metric you want.
    train_score = f1_score(y_train, y_pred_train,
        average='micro')
    val_score = f1_score(y_validation, y_pred_validation,
        average='micro')

    current_score = val_score

    # show results
    print("training score: {}".format(train_score))
    print("validation score: {}".format(val_score))
    print("grid: {}".format(g))
    print("")

    # update the best_score if needed
    if current_score > best_score:
        best_score = current_score
        best_grid = g

Skip/disable step

Full example: jupyter notebook

Subclass the classifier / transformer you want to maybe skip, and add an argument called skip to the constructor.

For example, to define a parameter grid where sometimes TruncatedSVD is enabled and sometimes it isn't:

from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

class SkippableTruncatedSVD(TruncatedSVD):

    # add the "skip" argument and keep the others as in the superclass
    def __init__(self,skip=False,n_components=2, algorithm="randomized", n_iter=5,
                 random_state=None, tol=0.):
        self.skip = skip
        super().__init__(n_components, algorithm, n_iter, random_state, tol)

    # execute if not being skipped
    def fit(self, X, y=None):
        if self.skip:
            return self
        else:
            return super().fit(X,y)

    # execute if not being skipped
    def fit_transform(self, X, y=None):
        if self.skip:
            return X
        else:
            return super().fit_transform(X,y) 

    # execute if not being skipped
    def transform(self, X):
        if self.skip:
            return X
        else:
            return super().transform(X) 

# note the extra parameter 'skip' for our transformer
param_grid = [
    {
        'tfidf__max_features':[100,200,500,1000],
        'svd__skip':[True,False],
        'svd__n_components':[2,5,10,20]
    }
]

# use the subclassed version here
pipeline = Pipeline([
    ('tfidf',TfidfVectorizer()),
    ('svd',SkippableTruncatedSVD()),
    ('clf',LogisticRegression())
])

# now use it as you would a normal pipeline