Scikit-Learn Cheatsheet: Reference and Examples

Scikit-Learn Cheatsheet: Reference and Examples

Last updated:
Table of Contents

train test split example

from sklearn.model_selection import train_test_split

# X is a 2d ndarray
# y is a column array
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

Manual split into train/test sets

import numpy as np

# X is a 2d ndarray
# y is a column array

# shuffle the indices
indices = np.arange(len(X))
np.random.shuffle(indices)

X = [X[i] for i in indices]
y = y[indices]

# using 15% of the data as validation
num_validation_samples = int(0.15 * len(data))

X_train = X[:-num_validation_samples]
y_train = y[:-num_validation_samples]

X_test = X[-num_validation_samples:]
y_test = y[-num_validation_samples:]

Reshape 1-d arrays

DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample. DeprecationWarning)

# X is a 1-d ndarray

# you want a COLUMN vector (many samples, 1 feature)
X = X.reshape(-1,1)

# you want a ROW vector (one sample, many features)
X = X.reshape(1,1)

Evaluation metrics

works also for auc, precision, recall, etc (or all metrics available on the scikit learn docs for metrics)

template: func(y_predictions, y_ground_truth)

from sklearn import metrics

# say you have a trained model, clf

metrics.accuracy_score(y_test, clf.predict(X_test))
# 0.8812312312

Scaling data

Look at this post for more information: Feature Scaling: Quick Introduction and Examples using Scikit-learn

Don't fit testing data - this amounts to data snooping because you're using testing data to drive training

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Dialogue & Discussion