Pandas Dataframe Examples: Column Operations
Last updated:- Rename column
- Apply function to column names
- Apply function to column
- Create derived column
- Number of NaNs in column
- Get column names
- Get number of columns
- Change column order
- Drop column
- Drop multiple columns
- Append new column
- Check if column exists
- Insert column at specific index
- Convert column to another type
- Convert column to date/datetime
- map example
View all examples on this jupyter notebook
Rename column
To change just a single column name:
import pandas as pd
df = pd.DataFrame({
'name': ['alice','bob','charlie'],
'age': [25,26,27]
})
df.rename(columns={'name':'person_name'})
name
becomes person_name
To change multiple column names, it's the same thing, just name them all in the columns
dictionary:
import pandas as pd
df = pd.DataFrame({
'name': ['alice','bob','charlie'],
'age': [25,26,27]
})
df.rename(columns={'name':'person_name','age':'age_in_years'})
name
becomes person_name
, age
becomes age_in_years
Apply function to column names
Example: Make all column names uppercase
import pandas as pd
df = pd.DataFrame({
'name': ['alice','bob','charlie'],
'age': [25,26,27]
})
# convert column NAMES to uppercase
df.columns = [col.upper() for col in df.columns]
df
to uppercase, but the data is still the same
Apply function to column
You can also use apply: Pandas Dataframes: Apply Examples
One way to apply a function to a column is to use map()
import pandas as pd
df = pd.DataFrame({
'name': ['alice','bob','charlie'],
'age': [25,26,27]
})
# convert all names to uppercase
df['name'] = df['name'].map(lambda name: name.upper())
upper()
to all names
Create derived column
It's very common to add new columns using derived data. You just need to assign to a new column:
import pandas as pd
df = pd.DataFrame({
'name': ['alice','bob','charlie'],
'age': [25,26,27]
})
df['age_times_two']= df['age'] *2
df
like in numpy arrays
You can also use custom elementwise functions to help create the new column:
import pandas as pd
df = pd.DataFrame({
'name': ['alice','bob','charlie'],
'age': [25,26,27]
})
df["name_uppercase"] = df['name'].map(lambda name: name.upper())
df
Number of NaNs in column
.isnull()
considers bothnp.NaN
andNone
as null values
Use .isnull()
to check which values are null/NaN and then call .sum()
import pandas as pd
df = pd.DataFrame({
'name': ['alice','bob','charlie'],
'age': [25,26,np.nan],
'state': ['ak',np.nan,None]
})
print(df['name'].isnull().sum())
# 0
print(df['age'].isnull().sum())
# 1
print(df['state'].isnull().sum())
# 2
Get column names
Use df.columns.values
:
import pandas as pd
df = pd.DataFrame({
'name': ['alice','bob','charlie'],
'age': [25,26,27],
'state': ['ak','ny','dc']
})
print(df.columns.values)
# ['age' 'name' 'state']
Get number of columns
Use len(df.columns.values)
(ignores the index column):
import pandas as pd
df = pd.DataFrame({
'name': ['alice','bob','charlie'],
'age': [25,26,27],
'state': ['ak','ny','dc']
})
print(len(df.columns.values))
# 3
Change column order
To reorder columns, just reassign the dataframe with the columns in the order you want:
import pandas as pd
df = pd.DataFrame({
'name': ['alice','bob','charlie'],
'age': [25,26,27],
'state': ['ak','ny','dc']
})
# reassign the dataframe, selecting the
# columns in the order you want
df = df[['name','age','state']]
df
columns in alphabetical order
dataframe with the columns in the order you want
Drop column
To delete a single column use df.drop(columns=['column_name'])
import pandas as pd
df = pd.DataFrame({
'name': ['alice','bob','charlie'],
'age': [25,26,27]
})
df.drop(columns=['age'])
Drop multiple columns
To delete multiple columns, you can pass multiple column names to the columns argument:
import pandas as pd
df = pd.DataFrame({
'name': ['alice','bob','charlie'],
'age': [25,26,27]
})
df.drop(columns=['age','name'])
only the index column is left!
Append new column
In order to add a new column to a DataFrame, create a Series and assign it as a new column:
import pandas as pd
df = pd.DataFrame({
'name': ['alice','bob','charlie'],
'age': [25,26,27]
})
states = pd.Series(['dc','ca','ny'])
df['state'] = states
df
Check if column exists
import pandas as pd
df = pd.DataFrame({
'name': ['alice','bob','charlie'],
'age': [25,26,27]
})
candidate_names = ['name','gender','age']
for name in candidate_names:
if name in df.columns.values:
print('"{}" is a column name'.format(name))
# outputs:
# "name" is a column name
# "age" is a column name
Insert column at specific index
Call df.insert(<position>,<column_name>, <data>)
on the dataframe:
import pandas as pd
df = pd.DataFrame({
'name': ['alice','bob','charlie'],
'age': [25,26,27]
})
column_data = pd.Series(['female','male','male'])
# starts at zero
position = 1
column_name = 'gender'
df.insert(position, column_name, column_data)
1 is the second one, etc
Convert column to another type
Use astype()
. Numpy types (such as 'uint8', 'int32', 'int64', etc) and python types (str
, int
, etc) are supported
import pandas as pd
df = pd.DataFrame({
'name': ['alice','bob','charlie'],
'age': [25,26,27]
})
print(df2['age'].dtype)
# 'int64'
df2['age'] = df2['age'].astype(str)
print(df2['age'].dtype)
# 'object'
df2['age'] = df2['age'].astype(np.uint8)
print(df2['age'].dtype)
# 'uint8'
Convert column to date/datetime
Use pd.to_datetime(string_column)
:
import pandas as pd
df = pd.DataFrame({
'name': ['alice','bob','charlie'],
'date_of_birth': ['10/25/2005','10/29/2002','01/01/2001']
})
df['date_of_birth'] = pd.to_datetime(df['date_of_birth'])
now of type 'datetime' and you can
perform date arithmetic on it
If your dates are in a custom format, use format
parameter:
View all possible formats here: python strftime formats
import pandas as pd
df = pd.DataFrame({
'name': ['alice','bob','charlie'],
'date_of_birth': ['27/05/2001','16/02/1999','25/09/1998']
})[['name','date_of_birth']]
df['date_of_birth'] = pd.to_datetime(df['date_of_birth'],format='%d/%m/%Y')
displayed in a day/month/year
format
format
parameter so that pandas knows what format your
dates are in
map example
Use .map()
to create derived columns:
import pandas as pd
df = pd.DataFrame({
'name': ['alice','bob','charlie'],
'age': [25,26,27]
})
# add a new column to the dataframe
df['name_uppercase'] = df['name'].map(lambda element: element.upper())
common way to add derived columns
to a dataframe
map vs apply: time comparison
One of the most striking differences between the .map()
and .apply()
functions is that apply() can be used to employ Numpy vectorized functions.
This gives massive (more than 70x) performance gains, as can be seen in the following example:
Time comparison: create a dataframe with 10,000,000 rows and multiply a numeric column by 2
import pandas as pd
import numpy as np
# create a sample dataframe with 10,000,000 rows
df = pd.DataFrame({
'x': np.random.normal(loc=0.0, scale=1.0, size=10000000)
})
(top 5 rows shown only)
Using map function multiply 'x'
column by 2
def multiply_by_two_map(x):
return x*2
df['2x_map'] = df['x'].map(multiply_by_two)
# >>> CPU times: user 14.4 s, sys: 300 ms, total: 14.7 s
# >>> Wall time: 14.7 s
Using apply function multiply 'x'
column by 2
import numpy as np
def multiply_by_two(arr):
return np.multiply(arr,2)
# note the double square brackets around the 'x'!!
# this is because we want to use DataFrame.apply,
# not Series.apply!!
df['2x_apply'] = df[['x']].apply(multiply_by_two)
# >>> CPU times: user 80 ms, sys: 112 ms, total: 192 ms
# >>> Wall time: 188 ms
but the apply version was
70 times faster!