Scikit-Learn examples: Making Dummy Datasets
Last updated:- Binary classification
- Add noise to target variable
- Make classes more similar
- Feature contribution to target
- Full example
- Multi-class classification
IMPORTANT The default value for
n_informative
is 2. This means that even though you setn_features
to a large number, only 2 features will be informative unless you override the default value forn_informative
!
Binary classification
Simplest possible dummy dataset: a simple dataset having 10,000 samples with 25 features, all of which are informative
This dataset will have an equal amount of 0 and 1 targets.
Each feature is a sample of a cannonical gaussian distribution (mean 0 and standard deviance=1).
from sklearn.datasets import make_classification
# other options are also available
X, y = make_classification(n_samples=10000, n_features=25)
Add noise to target variable
Generated feature values are samples from a gaussian distribution so there will naturally be a little noise, but you can increase this if you need to.
Parameter flip_y
(default value = 0.01) defines the probability that the target variable for a sample will be flipped (it becomes 1 when it should be 0 and vice-versa).
This makes the classification task harder and enables you to test whether some classifier or combination of parameters is resistant to noisy inputs:
from sklearn.datasets import make_classification
# 10% of the values of Y will be randomly flipped
X, y = make_classification(
n_samples=10000,
n_features=25,
flip_y=0.1) # the default value for flip_y is 0.01, or 1%
Make classes more similar
Adjust the parameter class_sep
(class separator). To make the classification task harder. The default value is 1.0.
The lower this value is, the closer together points from different classes will be, so it will be more difficult for any classifier to be able to separate them.
Set it to a low value (such a 0.1 or 0.2) to see if you classifier can still get a good result.
from sklearn.datasets import make_classification
X, y = make_classification(
n_samples=10000,
n_features=25,
class_sep=0.1) # the default value for class_sep is 1.0. The lower the value, the harder classification is.
Feature contribution to target
These can't sum up to more than the total number of features or
n_features
.
n_informative
, n_redundant
and n_repeated
allow you to adjust how much the features contribute to the target value.
The default value for n_informative
is 2!
X, y = make_classification(n_samples=10000,
n_features=25,
n_redundant=10, # 10 out of the 25 features will just be combinations of other features
n_repeated=5) # and 5 out of 25 features will be duplicates
Full example
X, y = make_classification(n_samples=10000,
n_features=25,
n_informative=10,
n_redundant=10,
n_repeated=5,
weights=[0.2,0.8], # 20% of the targets will be 0, 80% will be 1. default is 50/50
class_sep=0.2, # default value is 1.0. the lower is it the more difficult the task is.
flip_y=0.1) # every 10th sample has its target variable flipped. default is 0.01
Multi-class classification
Just set n_classes
to the number of classes you want (default is 2):
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=10000, n_features=10, n_classes=10)