- Dummy dataset for binary classification
- Add more noise to dummy datasets
- Make the classification harder by making points closer together
- Adjust how much each feature contributes to the target
- Complete example
- Dummy dataset for multi-class classification
Dummy dataset for binary classification
Simplest possible dummy dataset: a simple dataset having 10,000 samples with 25 features, all of which are informative
This dataset will have an equal amount of 0 and 1 targets.
Each feature is a sample of a cannonical gaussian distribution (mean 0 and standard deviance=1).
from sklearn.datasets import make_classification # other options are also available X, y = make_classification(n_samples=10000, n_features=25)
Add more noise to dummy datasets
Generated feature values are samples from a gaussian distribution so there will naturally be a little noise, but you can increase this if you need to.
flip_y (default value = 0.01) defines the probability that the target variable for a sample will be flipped (it becomes 1 when it should be 0 and vice-versa).
This makes the classification task harder and enables you to test whether some classifier or combination of parameters is resistant to noisy inputs.
Make the classification harder by making points closer together
Adjust the parameter
class_sep (class separator). The default value is 1.0.
The lower this value is, the close together points from different classes will be, so it will be more difficult for any classifier to be able to separate them.
Set it to a low value (such a 0.1 or 0.2) to see if you classifier can still get a good result.
Adjust how much each feature contributes to the target
These can't sum up to more than the total number of features or
n_repeated allow you to set how the features contribute to the target value.
X, y = make_classification(n_samples=10000, n_features=25, n_redundant=10, n_repeated=5)
X, y = make_classification(n_samples=10000, n_features=25, n_redundant=10, n_repeated=5, weights=[0.2,0.8] # 20% of the targets will be 0, 80% will be 1. default is 50/50 class_sep=0.2, # default value is 1.0. the lower is it the more difficult the task is. flip_y=0.1) # every 10th sample has its target variable flipped. default is 0.01
Dummy dataset for multi-class classification
n_classes to the number of classes you want (default is 2):
from sklearn.datasets import make_classification X, y = make_classification(n_samples=10000, n_features=10, n_classes=10)