One-Hot Encoding a Feature on a Pandas Dataframe: an Example
Source

One-Hot Encoding a Feature on a Pandas Dataframe: an Example

Last updated:

When extracting features, from a dataset, it is often useful to transform categorical features into vectors so that you can do vector operations (such as calculating the cosine distance) on them.

Think about it for a second: how would you naïvely calculate the distance between users using the cosine difference, where their country of origin is the only feature? You need a way that will correctly return zero for users that share the same country and 1 (maximum) for users that don't.

How would you calculate the distance between users in a dataset, where their country of origin is the only feature?

Take this dataset for example:

+---+----------------+
| id| country        |
+---+----------------+
|  0| russia         |
|  1| germany        |
|  2| australia      |
|  3| korea          |
|  4| germany        |
+---+----------------+

One of the ways to do it is to encode the categorical variable as a one-hot vector, i.e. a vector where only one element is non-zero, or hot.

With one-hot encoding, a categorical feature becomes an array whose size is the number of possible choices for that features, i.e.:

+---+---------------+----------------+------------------+---------------+
| id| country=russia| country=germany| country=australia|  country=korea|
+---+---------------+----------------+------------------+---------------+
|  0|              1|               0|                 0|              0|
|  1|              0|               1|                 0|              0|
|  2|              0|               0|                 1|              0|
|  3|              0|               0|                 0|              1|
|  4|              0|               1|                 0|              0|
+---+---------------+----------------+------------------+---------------+

One-hot encoding a column in a Pandas Dataframe

To create a dataset similar to the one used above in Pandas, we could do this:

import pandas as pd

df = pd.DataFrame({'country': ['russia', 'germany', 'australia','korea','germany']})
df

>>
     country
0     russia
1    germany
2  australia
3      korea
4    germany

Pandas provides the very useful get_dummies method on DataFrame, which does what we want:

pd.get_dummies(df,prefix=['country'])

>>
   country_australia  country_germany  country_korea  country_russia
0                  0                0              0               1
1                  0                1              0               0
2                  1                0              0               0
3                  0                0              1               0
4                  0                1              0               0

Add extra columns for categories that only appear in the test set

You need to inform pandas if you want it to create dummy columns for categories even though never appear (for example, if you one-hot encode a categorical variable that may have unseen values in the test).

# say you want a column for "japan" too (it'll be always zero, of course)
df["country"] = train_df["country"].astype('category',categories=["australia","germany","korea","russia","japan"])

References

Dialogue & Discussion