Pandas DataFrame by Example

Last updated:
Pandas DataFrame by Example
Source
Table of Contents

Pandas is a very versatile tool for data analysis in Python and you must definitely know how to do, at the bare minimum, simple operations on it.

View this notebook for live examples of techniques seen here

Here are a couple of examples to help you quickly get productive using Pandas' main data structure: the DataFrame

Updated for version: 0.20.1

So here are some of the most common things you'll want to do with a DataFrame:

Read CSV file into DataFrame

# needed all around
import pandas as pd

# the 
df = pd.read_csv("data.csv")

sample-pandas-dataframe This is what our sample dataset looks like

To use a column in the file as the dataframe index, use index_col argument:

import pandas as pd

# note that Pandas will NOT warn you if the column you've selected
# is NOT unique!
df = pd.read_csv("data.csv",index_col='MyColumn')

Write DataFrame into CSV file

# simplest possible usage
df.to_csv("data-out.csv")

Omit the index column for a cleaner CSV file:

df.to_csv("data-out-no-index.csv", index=False)
Twitter Linkedin YC Hacker News Reddit

Select rows by position

Just use iloc. Position starts at 0.

The same rules apply as in regular python list slicing

# select the first 2 rows
df.iloc[:2]

# select the last 2 rows
df.iloc[-2:]

Select Rows by index value

Just use loc.

compare this with iloc above

# select rows up to and including the one
# with index=2 (this retrieves 3 rows)
df.loc[:2]

Select rows by column value

# people whose "age" is greater than 30
df[df["age"] > 30]

# people who have more pets than children
df[ df["num_pets"] > df[ "num_children"] ]

Select rows by multiple column values

# people older than 40 who own pets
df[ (df["age"] > 40) & (df["num_pets"] > 0) ] 

Select columns starting with

Select all columns starting with 'n':

df[[colname for colname in df.columns if colname.startswith('n')]]
Twitter Linkedin YC Hacker News Reddit

Select all columns but

Use df.drop(). Note the axis=1 parameter.

Also, the columns must be passed as a list (even if it's a single column you want to exclude from the selection).

# df itself is not modified; a copy is returned instead
df.drop(["age","num_children"],axis=1)

column-dropped A new dataframe is returned, with columns "age" and "num_children" removed.

Drop duplicated rows based on a column's value

For example, say you have a movies dataframe with "title" and "synopsis" columns and you want to drop all movies with duplicate titles:

duplicated_titles = movies_df.duplicated(subset=['title'], keep=False)

# tilde is used to to dataframe subraction!
movies_df = movies_df[~duplicated_titles]

Apply an aggregate function to every column

For every numeric column, what is the average over all rows?

Note that our resultset contains 3 rows (one for each numeric column in the original dataset).

df[["age","num_pets","num_children"]].apply(lambda row: np.mean(row),axis=0)
# >>
# age             36.666667
# num_pets         1.500000
# num_children     1.333333

Apply an aggregate function to every row

The same as above, but note that axis=1 is used.

For every row, what is the sum of all numeric columns?

# note that our resultset will have 6 rows (one for every row in the 
# original dataset)
df[["age","num_pets","num_children"]].apply(lambda row: np.sum(row),axis=1)
# >>
# 0    25
# 1    84
# 2    22
# 3    25
# 4    47
# 5    34
Twitter Linkedin YC Hacker News Reddit

Transform a column

Remember: df[['colname']] returns a new DataFrame, while df['colname'] returns a Series

For numerical data, it's straighforward:

this is requivalent to df[["age"]] * 2

# returns a new dataframe where every age 
# is double its old value
df[["age"]].apply(lambda value: value*2)

For text data (or otherwise non-numerical) you need to cast:

# returns a new dataframe where every name is 
# the old name to uppercase
df[["name"]].apply(lambda value: value.str.upper())

Another way to transform a column is to use map()

map() works on Series objects, apply() works on DataFrame objects

# convert all names to uppercase
df['name'] = df['name'].map(lambda name: name.upper())

Add a new column based on a another one

It's very common to add new columns using derived data. You just need to assign to a new column:

assign() is the immutable version of this method

# df now has an extra column at the end of the dataframe
# which amount to the sum of num_pets and num_children
# for that row

df["pets_and_children"] = df["num_pets"] + df["num_children"]

new-dataframe Derived column "pets_and_children" added to the dataframe.

You can also use custom elementwise functions to help create the new column:

# df now has an extra column at the end, which is just an uppercase
# version of the name (the original column is untouched)
df["name_uppercase"] = df[["name"]].apply(lambda name: name.str.upper())

name uppercase Added column "name_uppercase" based on elementwise application of a function to another columns.

Twitter Linkedin YC Hacker News Reddit

Shuffle the rows in a DataFrame

Method reindex() can be used to reindex your data and, if you pass random indices, you'll have shuffled your data:

# using a random permutation of the original indices
df = df.reindex(np.random.permutation(df.index))

Iterate over all rows in a DataFrame

Using for ... in. This is similar to iterating over Python dictionaries (think iteritems() or items()in Python 3):

for index,row in df.iterrows():
    print("{0} has name: {1}".format(index,row["name"]))
# >>
# 0 has name: john
# 1 has name: mary
# 2 has name: peter
# 3 has name: jeff
# 4 has name: bill
# 5 has name: lisa

Randomly sample rows from a DataFrame

Sampling is commonly used in Machine Learning tasks and many others.

# sample 4 rows from df
random_indices = np.random.choice(df.index.values, 4, replace=False)

# iloc retrieves rows by position, but the dataframe is now smaller
# so use loc instead (loc retrieves rows by their numeric indices)
sampled_df = df.loc[random_indices]

Sort DataFrame by column value

This is pretty self-explanatory:

# sort by "age" column, larger to smaller
df.sort_values("age",ascending=False)

sorted dataframe Sorted by "age", descending.

You can also use multiple columns to break ties:

# sort by column "num_pets" descending and in case there 
# are ties, use "age" ascending to sort those
df.sort_values( ["num_pets","age"], ascending=[False,True] )

sorted-by-num_pets_and age Sorted by "num_pets" (descending), then by "age" (ascending) to break ties.

Custom sort

Your boss has asked you to sort records by age, but put everybody with "N/A" for state at the end.

One way to solve this is to create a new column rank and use that in sorting:

import pandas as pd

df = pd.DataFrame({
    'name':['john','mary','peter','jeff','bill','lisa'],
    'age':[23,78,22,19,12,33],
    'state':['N/A','dc','california','texas','N/A','dc']
})

# use this function to convert a state string to 0 or 1
def state_to_rank(state):
    if state=="N/A":
        return 1
    else:
        return 0

df['rank'] = df['state'].map(lambda x: state_to_rank(x))

df.sort_values(by=['rank','age'])

original-dataframe The real world has
lots of missing data
      
sorted-dataframe-with-custom-order Ages are sorted in each group

Twitter Linkedin YC Hacker News Reddit

Select rows using lambdas

If you need something more complex that the regular df[df['somecolumn'] == 'somevalue'], you can use apply with a lambda function too:

# select rows whose name begins with the letter 'j'
df[df.apply(lambda row: row['name'].startswith('j'),axis=1)]

complex-selection Filter only rows where column "name" starts with 'j'

Split a dataframe by column value

grouped = df.groupby(df["num_pets"])

grouped.groups.keys()
# [0,4,5]

# this is a dataframe containing only data for people
# who have zero pets
df0 = grouped.get_group('0')
Twitter Linkedin YC Hacker News Reddit

Apply multiple aggregation operations on a single GroupBy pass

Say, for instance, ORDER_DATE is a timestamp column. We want to find out the total quantity QTY AND the average UNIT price per day.

Note: we're not using the sample dataframe here

grouped = df.groupby(p7.ORDER_DATE.dt.day).agg({
    "QTY":np.sum,
    "UNIT": np.mean
})

Convert a column to another dtype

Use astype().

print(my_df['my_column'].dtype)
# prints dtype('int32')

my_df['my_column'] = my_df['my_column'].astype('uint8')

print(my_df['my_column'].dtype)
# prints dtype('uint8')

Parse/Convert column to datetime

View all possible formats here: python strftime formats

Use pd.to_datetime(string_column,format='format_string'):

data_dict = {
    'name':['john','mary','peter'],
    "date_of_birth": ['27/05/2002','10/10/1999','01/04/1985']
}

# BEFORE: column 'date_of_birth' is of type string
df = pd.DataFrame(data_dict,columns=['name','date_of_birth'])

# AFTER: column 'date_of_birth' is now of type datetime64[ns]
df['date_of_birth'] = pd.to_datetime(new_df['date_of_birth'],format='%d/%m/%Y')

date-column-before BEFORE: 'date_of_birth' is a string (type 'object')
date-column-after AFTER: 'date_of_birth' is is now type 'datetime'


Verify that the dataframe includes specific values

This is done using the .isin() method, which returns a boolean dataframe to indicate where the passed values match.

# if the method is passed a simple list, it matches
# those values anywhere in the dataframe 
df.isin([2,4])

isin Elements that match the values in the original dataframe become True

# you can also pass a dict or another dataframe
# as argument
df.isin({'num_pets':[4,5]})

isin using dict When a dict is passed, columns must match the dict keys too.

Twitter Linkedin YC Hacker News Reddit

Create an empty Dataframe and append rows

Use append() with ignore_index=True.

# set column names and dtypes
new_df = pd.DataFrame(columns=['col_a','col_b']).astype({'col_a':'float32', 'col_b':'int8'})


# must reassign since the append method does not work in place
new_df = new_df.append({'col_a':5,'col_b':10}, ignore_index=True)
new_df = new_df.append({'col_a':1,'col_b':100}, ignore_index=True)

new_df

dataframe-create-empty-and-append Since ignore_index is set, indices will start at 0

Create from list of dicts

A record is a dict with column names as keys

Another way to create a dataframe is by using a list of dicts with column names as keys, using from_records().

new_df = pd.DataFrame(columns=['id','name'])

data_dict = [
    {'id':1,'name':"john"},
    {'id':2,'name':"mary"},
    {'id':3,'name':"peter"}
]

# must reassign since the append method does not work in place
new_df = new_df.from_records(data_dict)
new_df

pandas-dataframe-create-from-list-of-dicts Since we didin't specify dtypes, they are automatically inferred from the data.


See also

Dialogue & Discussion