Scikit-Learn Cheatsheet: Reference and Examples

Last updated:

WIP Alert This is a work in progress. Current information is correct but more content will probably be added in the future.

traintestsplit

import numpy as np
from sklearn.model_selection import train_test_split

# X is a 2d ndarray
# y is a column array

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

reshaping 1-d arrays

DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample. DeprecationWarning)

# X is a 1-d ndarray

# you want a COLUMN vector (many samples, 1 feature)
X = X.reshape(-1,1)

# you want a ROW vector (one sample, many features)
X = X.reshape(1,1)

Quickly calculate evaluation metrics

works also for auc, precision, recall, etc (or all metrics available on the scikit learn docs for metrics)

template: func(ypredictions, yground_truth)

from sklearn import metrics

# say you have a trained model, clf

metrics.accuracy_score(y_test, clf.predict(X_test))
# 0.8812312312

Scaling data

0 mean and variance 1

Don't fit testing data - this amounts to data snooping because you're using testing data to drive training

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Dialogue & Discussion