Pandas Dataframe Examples: Column Operations

Last updated:
Table of Contents

Rename column

To change just a single column name:

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27]
})

df.rename(columns={'name':'person_name'})

original-dataframe BEFORE: original dataframe
         
renamed-dataframe AFTER: name becomes person_name

To change multiple column names, it's the same thing, just name them all in the columns dictionary:

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27]
})

df.rename(columns={'name':'person_name','age':'age_in_years'})

original-dataframe BEFORE: original dataframe
         
renamed-dataframe AFTER: name becomes person_name,
age becomes age_in_years

Apply function to column

Use map():

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27]
})

# convert all names to uppercase
df['name'] = df['name'].map(lambda name: name.upper())

original-dataframe BEFORE: original dataframe
         
output-dataframe AFTER: applied function upper() to all names

Create derived/calculated column

It's very common to add new columns using derived data. You just need to assign to a new column:

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27]
})

df['age_times_two']= df['age'] *2
df

original-dataframe BEFORE: original dataframe
         
output-dataframe AFTER: you can apply vectorized functions
like in numpy arrays

You can also use custom elementwise functions to help create the new column:

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27]
})

df["name_uppercase"] = df['name'].map(lambda name: name.upper())
df

original-dataframe BEFORE: original dataframe
         
output-dataframe AFTER: derived column is appended to the dataframe

Get number of NaNs in column

.isnull() considers both np.NaN and None and Nulls

Use .isnull() to check which values are null/NaN and then call .sum()

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,np.nan],
    'state': ['ak',np.nan,None]
})

print(df['name'].isnull().sum())
# 0

print(df['age'].isnull().sum())
# 1

print(df['state'].isnull().sum())
# 2
Twitter Linkedin YC Hacker News Reddit

Get column names in Dataframe

Use df.columns.values:

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27],
    'state': ['ak','ny','dc']
})

print(df.columns.values)
# ['age' 'name' 'state']

Get number of columns in Dataframe

Use len(df.columns.values) (ignores the index column):

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27],
    'state': ['ak','ny','dc']
})

print(len(df.columns.values))
# 3

Change order of columns in Dataframe

To reorder columns, just reassign the dataframe with the columns in the order you want:

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27],
    'state': ['ak','ny','dc']
})

# reassign the dataframe, selecting the
# columns in the order you want
df = df[['name','age','state']]
df

original-dataframe BEFORE: By default, Pandas displays
columns in alphabetical order
         
reordered-dataframe AFTER: But you just need to reassign the
dataframe with the columns in the order you want

Delete/drop column from Dataframe

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27]
})

df.drop(columns=['age'],inplace=True)

original-dataframe BEFORE: original dataframe
         
output-dataframe AFTER: Dropped column age

Delete/drop multiple columns

You can pass multiple column names to the columns argument:

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27]
})

df.drop(columns=['age','name'],inplace=True)

original-dataframe BEFORE: original dataframe
         
output-dataframe AFTER: Deleted both columns,
only the index column is left!

Add/append new column to existing Dataframe

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27]
})

states = pd.Series(['dc','ca','ny'])

df['state'] = states
df

original-dataframe BEFORE: AFTERoriginal dataframe
         
output-dataframe AFTER: Column is added following row order

Check if column exists in Dataframe

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27]
})

candidate_names = ['name','gender','age']

for name in candidate_names:
    if name in df.columns.values:
        print('"{}" is a column name'.format(name))

# outputs:
# "name" is a column name
# "age" is a column name

Insert column at specific index

Call insert(<position>,<column_name>, <data>) on the dataframe:

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27]
})

column_data = pd.Series(['female','male','male'])

# starts at zero
position = 1
column_name = 'gender'

df.insert(position, column_name, column_data)

original-dataframe BEFORE: original dataframe
         
output-dataframe * AFTER: 0 is the first column,
1 is the second one, etc*

Convert column to another type

Use astype(). Numpy types (such as 'uint8', 'int32', 'int64', etc) and python types (str, int, etc) are supported

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27]
})

print(df2['age'].dtype)
# 'int64'

df2['age'] = df2['age'].astype(str)
print(df2['age'].dtype)
# 'object'

df2['age'] = df2['age'].astype(np.uint8)
print(df2['age'].dtype)
# 'uint8'

Convert column to date/datetime

Use pd.to_datetime(string_column):

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'date_of_birth': ['10/25/2005','10/29/2002','01/01/2001']
})

df['date_of_birth'] = pd.to_datetime(df['date_of_birth'])

original-dataframe BEFORE: column is of type 'object'
         
modified-dataframe AFTER: column 'date_of_birth' is
now of type 'datetime' and you can
perform date arithmetic on it

If your dates are in a custom format, use format parameter:

View all possible formats here: python strftime formats

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'date_of_birth': ['27/05/2001','16/02/1999','25/09/1998']
})[['name','date_of_birth']]

df['date_of_birth'] = pd.to_datetime(df['date_of_birth'],format='%d/%m/%Y')

day-first-date BEFORE: In some countries, dates are often
displayed in a day/month/year
format
         
modified-dataframe AFTER: Just pass the format parameter so that
pandas knows what format your
dates are in

apply VS map

Both .apply() and .map() apply a given function across rows and columns, so what's the difference?

  • .apply() is a Dataframe function and takes whole columns as arguments.

  • .apply() returns a new Dataframe

  • .map() is a Series (column) function, and the function is applied to each element

  • .map() return a new Series

import pandas as pd

df = pd.DataFrame({
    'name': ['alice','bob','charlie'],
    'age': [25,26,27],
    'number_of_children':[1,3,4]
})

# columns 'age' and 'number_of_children' are
# operated on at once, using numpy 
# vectorized functions
df[['age','number_of_children']].apply(lambda row: row *2)

original-dataframe BEFORE: original dataframe
         
output-dataframe AFTER: Apply operates on the whole column

# apply operates on a single column, and
# processes each element at a time

df['name_upper'] = df['name'].map(lambda element: element.upper())
df

original-dataframe BEFORE: original dataframe
         
output-dataframe AFTER: map operates on each element, so you
can call any method the
object accepts.

Dialogue & Discussion