# SVM and Kernels: The very least every data scientist needs to know

Last updated:WIP Alert This is a work in progress. Current information is correct but more content may be added in the future.

## Algorithm

Find a hyperplane that separates positive/negative points with the largest possible *margin* to the one point that is closest to it.

This can be framed as a constrained optimization problem (can be solved by the Lagrange multiplier method)

## Support vectors are the data points that lie closest to the decision hyperplane

They are also the most "difficult" points to classify.

The support vectors (and only they) can fully specify the decision function/hyperplane.

## What about non linearly separable points?

One of the elements in the function to be optimized (to find the best separting hyperplane) is an inner product between two given input points.

Inner products provide some measure of

*similarity*.A

**kernel**is a function whose result is the inner product of two points when mapped to another space, without explicitly mapping the points to that space.

## Parameters

**C**:- This parameter controls how much you punish your model for wrongly classifying samples.
- The commonly recommended default value is 1.0.
- You should decrease this value if your data has many noisy observations.

## Data imbalance

Many SVM implementations report **better results** when you explicitly set class weights in case your dataset is unbalanced (i.e. some targets appear much more frequently than others).

Most provide ways to do that (e.g. `sklearn.svm.SVC`

has a `'class_weight'`

param in the `fit()`

method).

## Parameter tuning

SVM is highly sensitive to changes in hyperparameters so you need to perform cross-validation and some sort of grid search to find the best parameters for accuracy and generalization on your specific problem set.

## Input Normalization

As any method that uses distance between points to train, you **need** to normalize your features (i.e. each feature should be squashed so that all range from, say, -1 to 1).

If you're running scikit-learn, you can easily do this via preprocessing.StandardScaler for instance.

## Rules of thumb

- For sparse data (e.g. text encoded as bag-of-words features), the Linear kernel is faster and very accurate.

### References

Video Lectures by Bill Howe:

- Intuition for Support Vector Machines
- Support Vector Machine Example (uses data from the Titanic Kaggle Competition)
- Full Playlist on Youtube

Eric Kim: Everything you always wanted to know about the Kernel Trick but were too afraid to ask

Youtube video: How support vector machines work / How to open a black box (17 min)

- very good video by a Facebook data scientist with nice visualizations explaining how you can view the kernel trick as a "bending" of the feature space.

Felipe
TECHNOLOGY

data-science
machine-learning
svm