One-Hot Encoding a Feature on a Pandas Dataframe: an Example
Source

One-Hot Encoding a Feature on a Pandas Dataframe: an Example

Last updated:

When extracting features, from a dataset, it is often useful to transform categorical features into vectors so that you can do vector operations (such as calculating the cosine distance) on them.

Think about it for a second: how would you naïvely calculate the distance between users using the cosine difference, where their country of origin is the only feature? You need a way that will correctly return zero for users that share the same country and 1 (maximum) for users that don't.

How would you calculate the distance between users in a dataset, where their country of origin is the only feature?

Take this dataset for example:

+---+----------------+
| id| country        |
+---+----------------+
|  0| russia         |
|  1| germany        |
|  2| australia      |
|  3| korea          |
|  4| germany        |
+---+----------------+

One of the ways to do it is to encode the categorical variable as a one-hot vector, i.e. a vector where only one element is non-zero, or hot.

With one-hot encoding, a categorical feature becomes an array whose size is the number of possible choices for that features, i.e.:

+---+---------------+----------------+------------------+---------------+
| id| country=russia| country=germany| country=australia|  country=korea|
+---+---------------+----------------+------------------+---------------+
|  0|              1|               0|                 0|              0|
|  1|              0|               1|                 0|              0|
|  2|              0|               0|                 1|              0|
|  3|              0|               0|                 0|              1|
|  4|              0|               1|                 0|              0|
+---+---------------+----------------+------------------+---------------+

One-hot encoding a column in a Pandas Dataframe

To create a dataset similar to the one used above in Pandas, we could do this:

import pandas as pd

df = pd.DataFrame({'country': ['russia', 'germany', 'australia','korea','germany']})
df

>>
     country
0     russia
1    germany
2  australia
3      korea
4    germany

Pandas provides the very useful get_dummies method on DataFrame, which does what we want:

pd.get_dummies(df,prefix=['country'])

>>
   country_australia  country_germany  country_korea  country_russia
0                  0                0              0               1
1                  0                1              0               0
2                  1                0              0               0
3                  0                0              1               0
4                  0                1              0               0

Dialogue & Discussion