SVM and Kernels: The very least every data scientist needs to know

Last updated:

WIP Alert This is a work in progress. Current information is correct but more content may be added in the future.

Algorithm

Find a hyperplane that separates positive/negative points with the largest possible margin to the one point that is closest to it.

This can be framed as a constrained optimization problem (can be solved by the Lagrange multiplier method)

Support vectors are the data points that lie closest to the decision hyperplane

  • They are also the most "difficult" points to classify.

  • The support vectors (and only they) can fully specify the decision function/hyperplane.

What about non linearly separable points?

  • One of the elements in the function to be optimized (to find the best separting hyperplane) is an inner product between two given input points.

  • Inner products provide some measure of similarity.

  • A kernel is a function whose result is the inner product of two points when mapped to another space, without explicitly mapping the points to that space.

Parameters

  • C:

    • This parameter controls how much you punish your model for wrongly classifying samples.
    • The commonly recommended default value is 1.0.
    • You should decrease this value if your data has many noisy observations.

Data imbalance

Many SVM implementations report better results when you explicitly set class weights in case your dataset is unbalanced (i.e. some targets appear much more frequently than others).

Most provide ways to do that (e.g. sklearn.svm.SVC has a 'class_weight' param in the fit() method).

Parameter tuning

SVM is highly sensitive to changes in hyperparameters so you need to perform cross-validation and some sort of grid search to find the best parameters for accuracy and generalization on your specific problem set.

Input Normalization

As any method that uses distance between points to train, you need to normalize your features (i.e. each feature should be squashed so that all range from, say, -1 to 1).

If you're running scikit-learn, you can easily do this via preprocessing.StandardScaler for instance.

Rules of thumb

  • For sparse data (e.g. text encoded as bag-of-words features), the Linear kernel is faster and very accurate.

References

Dialogue & Discussion