Scikit-Learn examples: Making Dummy Datasets

Last updated:
Table of Contents

Dummy dataset for binary classification

Simplest possible dummy dataset: a simple dataset having 10,000 samples with 25 features, all of which are informative

This dataset will have an equal amount of 0 and 1 targets.

Each feature is a sample of a cannonical gaussian distribution (mean 0 and standard deviance=1).

from sklearn.datasets import make_classification

# other options are also available
X, y = make_classification(n_samples=10000, n_features=25)

Add more noise to dummy datasets

Generated feature values are samples from a gaussian distribution so there will naturally be a little noise, but you can increase this if you need to.

Parameter flip_y (default value = 0.01) defines the probability that the target variable for a sample will be flipped (it becomes 1 when it should be 0 and vice-versa).

This makes the classification task harder and enables you to test whether some classifier or combination of parameters is resistant to noisy inputs.

Make the classification harder by making points closer together

Adjust the parameter class_sep (class separator). The default value is 1.0.

The lower this value is, the close together points from different classes will be, so it will be more difficult for any classifier to be able to separate them.

Set it to a low value (such a 0.1 or 0.2) to see if you classifier can still get a good result.

Adjust how much each feature contributes to the target

These can't sum up to more than the total number of features or n_features.

n_informative, n_redundant and n_repeated allow you to set how the features contribute to the target value.

X, y = make_classification(n_samples=10000, 
    n_features=25, 
    n_redundant=10, 
    n_repeated=5)

Complete example

X, y = make_classification(n_samples=10000, 
    n_features=25, 
    n_redundant=10, 
    n_repeated=5,
    weights=[0.2,0.8] # 20% of the targets will be 0, 80% will be 1. default is 50/50
    class_sep=0.2, # default value is 1.0. the lower is it the more difficult the task is.
    flip_y=0.1) # every 10th sample has its target variable flipped. default is 0.01

Dummy dataset for multi-class classification

Just set n_classes to the number of classes you want (default is 2):

from sklearn.datasets import make_classification

X, y = make_classification(n_samples=10000, n_features=10, n_classes=10)

References

Dialogue & Discussion