Scikit-Learn examples: Making Dummy Datasets

Scikit-Learn examples: Making Dummy Datasets

Last updated:
Table of Contents

IMPORTANT The default value for n_informative is 2. This means that even though you set n_features to a large number, only 2 features will be informative unless you override the default value for n_informative!

Binary classification

Simplest possible dummy dataset: a simple dataset having 10,000 samples with 25 features, all of which are informative

This dataset will have an equal amount of 0 and 1 targets.

Each feature is a sample of a cannonical gaussian distribution (mean 0 and standard deviance=1).

from sklearn.datasets import make_classification

# other options are also available
X, y = make_classification(n_samples=10000, n_features=25)

Add noise to target variable

Generated feature values are samples from a gaussian distribution so there will naturally be a little noise, but you can increase this if you need to.

Parameter flip_y (default value = 0.01) defines the probability that the target variable for a sample will be flipped (it becomes 1 when it should be 0 and vice-versa).

This makes the classification task harder and enables you to test whether some classifier or combination of parameters is resistant to noisy inputs:

from sklearn.datasets import make_classification

# 10% of the values of Y will be randomly flipped
X, y = make_classification(
    n_samples=10000, 
    n_features=25,
    flip_y=0.1) # the default value for flip_y is 0.01, or 1%

Make classes more similar

Adjust the parameter class_sep (class separator). To make the classification task harder. The default value is 1.0.

The lower this value is, the closer together points from different classes will be, so it will be more difficult for any classifier to be able to separate them.

Set it to a low value (such a 0.1 or 0.2) to see if you classifier can still get a good result.

from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=10000, 
    n_features=25,
    class_sep=0.1) # the default value for class_sep is 1.0. The lower the value, the harder classification is.

Feature contribution to target

These can't sum up to more than the total number of features or n_features.

n_informative, n_redundant and n_repeated allow you to adjust how much the features contribute to the target value.

The default value for n_informative is 2!

X, y = make_classification(n_samples=10000, 
    n_features=25, 
    n_redundant=10, # 10 out of the 25 features will just be combinations of other features 
    n_repeated=5) # and 5 out of 25 features will be duplicates

Full example

X, y = make_classification(n_samples=10000, 
    n_features=25, 
    n_informative=10,
    n_redundant=10, 
    n_repeated=5,
    weights=[0.2,0.8], # 20% of the targets will be 0, 80% will be 1. default is 50/50
    class_sep=0.2, # default value is 1.0. the lower is it the more difficult the task is.
    flip_y=0.1) # every 10th sample has its target variable flipped. default is 0.01

Multi-class classification

Just set n_classes to the number of classes you want (default is 2):

from sklearn.datasets import make_classification

X, y = make_classification(n_samples=10000, n_features=10, n_classes=10)

References

Dialogue & Discussion