Pandas DataFrame: GroupBy Examples

Last updated:
Table of Contents

WIP Alert This is a work in progress. Current information is correct but more content may be added in the future.

For Dataframe usage examples not related to GroupBy, see Pandas Dataframe by Example

View all examples in this post on this notebook: pandas-groupby-post

Concatenate strings in a group

This is called GROUP_CONCAT in databases such as MySQL

In the original dataframe, each row is a tag assignment.

import pandas as pd

df = pd.DataFrame({
    'user_id':[1,2,1,3,3,],
    'content_id':[1,1,2,2,2],
    'tag':['cool','nice','clever','clever','not-bad']
})

df.groupby("content_id")['tag'].apply(lambda tags: ','.join(tags))

After the operation, we have one row per content_id and all tags are joined with ','.

source-dataframe Source dataframe
    
tags-by-content All tags given to each content

Number of of unique column values per group

How many unique users have tagged each movie?

import pandas as pd

df = pd.DataFrame({
    'user_id':[1,2,1,3,3,],
    'content_id':[1,1,2,2,2],
    'tag':['cool','nice','clever','clever','not-bad']
})

df.groupby("container_id")["user_id"].nunique().to_frame()

source-dataframe Source dataframe
      
tags-by-content How many users tagged each content?

Sort groupby results

Turn the GroupBy object into a regular dataframe by calling .to_frame() and then reindex with reset_index(), then you call sort_values() as you would a normal DataFrame:

import pandas as pd

df = pd.DataFrame({
    'value':[20.45,22.89,32.12,111.22,33.22,100.00,99.99],
    'product':['table','chair','chair','mobile phone','table','mobile phone','table']
})

df1 = df.groupby('product')['value'].sum().to_frame().reset_index()

df2 = df.groupby('product')['value'].sum().to_frame().reset_index().sort_values(by='value')

original-dataframe Original Dataframe
      
dataframe-1 Total value for each product:
df1 has the default ordering

dataframe-2 Total value for each product:
df2 has been ordered by value

GroupBy Size Plot

For many more examples on how to plot data directly from Pandas see: Pandas Dataframe: Plot Examples with Matplotlib and Pyplot

If you have matplotlib installed, you can call .plot() directly on the output of methods on GroupBy objects, such as sum(), size(), etc.

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({
    'value':[20.45,22.89,32.12,111.22,33.22,100.00,99.99],
    'product':['table','chair','chair','mobile phone','table','mobile phone','table']
})

plt.clf()
df.groupby('product').size().plot(kind='bar')
plt.show()

Original dataframe Original dataframe
      
groupby-dataframe-size Pandas knows how to plot quantities

GroupBy Sum Plot

As above

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({
    'value':[20.45,22.89,32.12,111.22,33.22,100.00,99.99],
    'product':['table','chair','chair','mobile phone','table','mobile phone','table']
})

plt.clf()
df.groupby('product').sum().plot(kind='bar')
plt.show()

Original dataframe Original dataframe
      
groupby-dataframe-size Pandas automatically sets axes and legends too

Flatten hierarchical indices created by agg

It's useful to execute multiple aggregations in a single pass using the DataFrameGroupBy.agg() method.

But the result is a dataframe with hierarchical columns, which are not very easy to work with.

You can flatten multiple aggregations on a single columns using the following procedure:

import pandas as pd

df = pd.DataFrame({
    'value':[20.45,22.89,32.12,111.22,33.22,100.00,99.99],
    'product':['table','chair','chair','mobile phone','table','mobile phone','table']
})

# call
grouped_df = df.groupby('product').agg({'value':['min','max','mean']})

At this point, join together the columns, with '_' in between and the reset the index:

grouped_df.columns = ['_'.join(col).strip() for col in grouped_df.columns.values]
grouped_df = grouped_df.reset_index()

original-dataframe Original Dataframe
      
dataframe-1 The agg() method creates a hierarchical index
dataframe-2 Much better: Join hierarchical columns together
using '_' and then call reset_index()
for a simpler index structure

Iterate over dataframe groups

Object returned by the call to groupby() can be used as an iterator:

for key,group_df in df.groupby('product'):

    # `key` contains the name of the grouped element 
    # i.e. 'chair', 'mobile phone', 'table'

    # `group_df` is a normal dataframe
    # containing only the data referring to the key

    print("the group for product '{}' has {} rows".format(key,len(group_df)))  

# >>>
# the group for product 'chair' has 2 rows
# the group for product 'mobile phone' has 2 rows
# the group for product 'table' has 3 rows

References

Dialogue & Discussion