Pandas Dataframe examples: Plotting Histograms

Pandas Dataframe examples: Plotting Histograms

Last updated:
Table of Contents

All code available on this jupyter notebook

Histogram of column values

You can also use numpy arange to create bins automatically: np.arange(<start>,<stop>,<step>)

Example: Plot histogram of values in column "age"

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({
    'name':['john','mary','peter','jeff','bill','lisa','jose'],
    'age':[23,78,22,19,45,33,20],
    'gender':['M','F','M','M','M','F','M'],
    'state':['california','dc','california','dc','california','texas','texas'],
    'num_children':[2,0,0,3,2,1,4],
    'num_pets':[5,1,0,5,2,2,3]
})

df[['age']].plot(kind='hist',bins=[0,20,40,60,80,100],rwidth=0.8)
plt.show()

source-dataframe Source dataframe
age-by-bins The most common age group is between 20 and 40 years old

Group large values into a single bucket

In other words, truncate values after a given point into a single bucket, called "+∞":

Example: group values larger than 100 into a separate column:

sample-data sample data used: note that there are rows where age is larger than 100!

import pandas as pd
import random
import matplotlib.pyplot as plt
import numpy as np

# install via pip install faker
from faker import Faker
fake = Faker()

df = pd.DataFrame([{
    'name':fake.name(),
    'age': random.randint(1,120), # generate age from 1 to 120
} for i in range(0,100)]) # dataframe should have 100 rows

# define the bins
step_size          = 20
max_tick           = 100 
original_bins      = np.arange(0, max_tick+1, step_size)
new_bins           = np.append(original_bins, [max_tick+step_size+1])
max_value          = max(original_bins)

# function to format the label
def format_text(current_value, max_value, label):
    return label if current_value > max_value else current_value

df[['age']].plot(kind='hist', bins=new_bins, rwidth=0.8)

current_ticklabels = plt.gca().get_xticks()

plt.gca().set_xticklabels([format_text(x, max_value, '+∞') for x in current_ticklabels])

plt.show()

naivelly-plotting If you naively plot histogram with 0-20 buckets,
some of the data will be missing!
  
plotting-after-truncating-values Use the provided code to have an additional
column to store all values that didn't fit into the original bins!

Dialogue & Discussion