Scikit-Learn Cheatsheet: Reference and Examples
Last updated:Table of Contents
- train test split example
- Manual split into train/test sets
- Reshape 1-d arrays
- Evaluation metrics
- Scaling data
train test split example
from sklearn.model_selection import train_test_split
# X is a 2d ndarray
# y is a column array
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
Manual split into train/test sets
import numpy as np
# X is a 2d ndarray
# y is a column array
# shuffle the indices
indices = np.arange(len(X))
np.random.shuffle(indices)
X = [X[i] for i in indices]
y = y[indices]
# using 15% of the data as validation
num_validation_samples = int(0.15 * len(data))
X_train = X[:-num_validation_samples]
y_train = y[:-num_validation_samples]
X_test = X[-num_validation_samples:]
y_test = y[-num_validation_samples:]
Reshape 1-d arrays
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample. DeprecationWarning)
# X is a 1-d ndarray
# you want a COLUMN vector (many samples, 1 feature)
X = X.reshape(-1,1)
# you want a ROW vector (one sample, many features)
X = X.reshape(1,1)
Evaluation metrics
works also for auc, precision, recall, etc (or all metrics available on the scikit learn docs for metrics)
template: func(y_predictions, y_ground_truth)
from sklearn import metrics
# say you have a trained model, clf
metrics.accuracy_score(y_test, clf.predict(X_test))
# 0.8812312312
Scaling data
Look at this post for more information: Feature Scaling: Quick Introduction and Examples using Scikit-learn
Don't fit testing data - this amounts to data snooping because you're using testing data to drive training
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)