Pandas Dataframe Examples: Column Operations

Last updated:
Table of Contents

View all examples on this jupyter notebook

Rename column

To change just a single column name:

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27]
})

df.rename(columns={'name':'person_name'})

original-dataframe BEFORE: original dataframe
         
renamed-dataframe AFTER: name becomes person_name

To change multiple column names, it's the same thing, just name them all in the columns dictionary:

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27]
})

df.rename(columns={'name':'person_name','age':'age_in_years'})

original-dataframe BEFORE: original dataframe
         
renamed-dataframe AFTER: name becomes person_name,
age becomes age_in_years

Apply function to column names

Example: Make all column names uppercase

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27]
})

# convert column NAMES to uppercase
df.columns = [col.upper() for col in df.columns]

df

original-dataframe BEFORE: original dataframe
         
alt-text-image-2 AFTER: colum names have been converted
to uppercase, but the data is still the same

Apply function to column

Use map():

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27]
})

# convert all names to uppercase
df['name'] = df['name'].map(lambda name: name.upper())

original-dataframe BEFORE: original dataframe
         
output-dataframe AFTER: applied function upper() to all names

Apply function to multiple columns

Another way to think about this is that you are actually applying a function to a row

In other words, apply a single function that takes as parameters elements from 2 (or more) columns at the same time

Use df.apply(lambda row: ... , axis=1)

Example: Find out if column word is in column text:

import pandas as pd

df = pd.DataFrame({
    'text': ['foo bar','bar baz','baz quux','foo quux'],
    'word': ['foo','foo','bar','foo']
})

df['word_is_in_text'] = df[['text','word']].apply(lambda row: row['word'] in row['text'], axis=1)

source-dataframe Source dataframe
         
dataframe-with-extra-column Added a new column that's True
if "word" is contained in "text"

Create derived column

See above if you need information from more than a single column

It's very common to add new columns using derived data. You just need to assign to a new column:

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27]
})

df['age_times_two']= df['age'] *2
df

original-dataframe BEFORE: original dataframe
         
output-dataframe AFTER: you can apply vectorized functions
like in numpy arrays

You can also use custom elementwise functions to help create the new column:

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27]
})

df["name_uppercase"] = df['name'].map(lambda name: name.upper())
df

original-dataframe BEFORE: original dataframe
         
output-dataframe AFTER: derived column is appended to the dataframe

Number of NaNs in column

.isnull() considers both np.NaN and None and Nulls

Use .isnull() to check which values are null/NaN and then call .sum()

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,np.nan],
    'state': ['ak',np.nan,None]
})

print(df['name'].isnull().sum())
# 0

print(df['age'].isnull().sum())
# 1

print(df['state'].isnull().sum())
# 2

Column names

Use df.columns.values:

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27],
    'state': ['ak','ny','dc']
})

print(df.columns.values)
# ['age' 'name' 'state']

Number of columns

Use len(df.columns.values) (ignores the index column):

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27],
    'state': ['ak','ny','dc']
})

print(len(df.columns.values))
# 3

Change column order

To reorder columns, just reassign the dataframe with the columns in the order you want:

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27],
    'state': ['ak','ny','dc']
})

# reassign the dataframe, selecting the
# columns in the order you want
df = df[['name','age','state']]
df

original-dataframe BEFORE: By default, Pandas displays
columns in alphabetical order
         
reordered-dataframe AFTER: But you just need to reassign the
dataframe with the columns in the order you want

Drop column

To delete a single column use df.drop(columns=['column_name'])

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27]
})

df.drop(columns=['age'])

original-dataframe BEFORE: original dataframe
         
output-dataframe AFTER: Dropped column age

Drop multiple columns

To delete multiple columns, you can pass multiple column names to the columns argument:

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27]
})

df.drop(columns=['age','name'])

original-dataframe BEFORE: original dataframe
         
output-dataframe AFTER: Deleted both columns,
only the index column is left!

Append new column

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27]
})

states = pd.Series(['dc','ca','ny'])

df['state'] = states
df

original-dataframe BEFORE: AFTERoriginal dataframe
         
output-dataframe AFTER: Column is added following row order

Add new column

See above: Append new column

Check if column exists

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27]
})

candidate_names = ['name','gender','age']

for name in candidate_names:
    if name in df.columns.values:
        print('"{}" is a column name'.format(name))

# outputs:
# "name" is a column name
# "age" is a column name

Insert column at specific index

Call insert(<position>,<column_name>, <data>) on the dataframe:

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27]
})

column_data = pd.Series(['female','male','male'])

# starts at zero
position = 1
column_name = 'gender'

df.insert(position, column_name, column_data)

original-dataframe BEFORE: original dataframe
         
output-dataframe AFTER: 0 is the first column,
1 is the second one, etc

Convert column to another type

Use astype(). Numpy types (such as 'uint8', 'int32', 'int64', etc) and python types (str, int, etc) are supported

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27]
})

print(df2['age'].dtype)
# 'int64'

df2['age'] = df2['age'].astype(str)
print(df2['age'].dtype)
# 'object'

df2['age'] = df2['age'].astype(np.uint8)
print(df2['age'].dtype)
# 'uint8'

Convert column to date/datetime

Use pd.to_datetime(string_column):

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'date_of_birth': ['10/25/2005','10/29/2002','01/01/2001']
})

df['date_of_birth'] = pd.to_datetime(df['date_of_birth'])

original-dataframe BEFORE: column is of type 'object'
         
modified-dataframe AFTER: column 'date_of_birth' is
now of type 'datetime' and you can
perform date arithmetic on it

If your dates are in a custom format, use format parameter:

View all possible formats here: python strftime formats

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'date_of_birth': ['27/05/2001','16/02/1999','25/09/1998']
})[['name','date_of_birth']]

df['date_of_birth'] = pd.to_datetime(df['date_of_birth'],format='%d/%m/%Y')

day-first-date BEFORE: In some countries, dates are often
displayed in a day/month/year
format
         
modified-dataframe AFTER: Just pass the format parameter so that
pandas knows what format your
dates are in

map example

Use .map() to create derived columns:

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27]
})

# add a new column to the dataframe
df['name_uppercase'] = df4['name'].map(lambda element: element.upper())

source-dataframe BEFORE: original dataframe
         
after-applying-map AFTER: using .map() is a very
common way to add derived columns
to a dataframe

apply example

Both Series and DataFrame have an .apply() function:

Series.apply example

Essentially the same as Series.map():

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie','david'],
    'age': [25,26,27,22],
})[['name', 'age']]

# each element of the age column is a string
# so you can call .upper() on it
df['name_uppercase'] = df['name'].apply(lambda element: element.upper())

source-dataframe BEFORE: original dataframe
         
after-applying-map AFTER: using Series.apply() gives the
exact same result as when we
used Series.map() above

Dataframe.apply example

We can also apply apply on a DataFrame.

Specifically, we can select a single column (note the double square brackets) and use apply on that filtered dataframe:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'name': ['alice','bob','charlie','david'],
    'age': [25,26,27,22],
})[['name', 'age']]

# note that the lambda function receives an array
# but also returns an array
df['age_times_2'] = df[['age']].apply(lambda arr: np.multiply(arr,2))

source-dataframe BEFORE: original dataframe
         
after-applying-apply AFTER: applied a numpy function to the
whole array of ages

map vs apply

map() apply()
Series functionSeries function
and Dataframe function
Returns new SeriesReturns new dataframe,
possibly with a single column
Can only be applied to a single
column (one element at
a time)
Can be applied to multiple
columns at the same time
Operates on array elements,
one at a time
Operates on whole columns
Very slow, no better than a
Python for loop
Much faster when you can use
numpy vectorized functions

map vs apply: time comparison

One of the most striking differences between the .map() and .apply() functions is that apply() can be used to employ Numpy vectorized functions.

This gives massive (more than 70x) performance gains, as can be seen in the following example:

Time comparison: create a dataframe with 10,000,000 rows and multiply a numeric column by 2

import pandas as pd
import numpy as np

# create a sample dataframe with 10,000,000 rows
df = pd.DataFrame({
    'x': np.random.normal(loc=0.0, scale=1.0, size=10000000)
})

sample-dataframe Sample dataframe for benchmarking
(top 5 rows shown only)

Using map function multiply 'x' column by 2

def multiply_by_two_map(x):
    return x*2

df['2x_map'] = df['x'].map(multiply_by_two)
# >>> CPU times: user 14.4 s, sys: 300 ms, total: 14.7 s
# >>> Wall time: 14.7 s

Using apply function multiply 'x' column by 2

import numpy as np

def multiply_by_two(arr):
    return np.multiply(arr,2)

# note the double square brackets around the 'x'!!
# this is because we want to use DataFrame.apply,
# not Series.apply!!
df['2x_apply'] = df[['x']].apply(multiply_by_two)
# >>> CPU times: user 80 ms, sys: 112 ms, total: 192 ms
# >>> Wall time: 188 ms

map-vs-apply-dataframe Both map and apply yielded identical results
but the apply version was
70 times faster!


See also

Dialogue & Discussion