Pandas Dataframe examples: Plotting Histograms

Pandas Dataframe examples: Plotting Histograms

Last updated:
Table of Contents

All code available on this jupyter notebook

Histogram of column values

You can also use numpy arange to create bins automatically: np.arange(<start>,<stop>,<step>)

Example: Plot histogram of values in column "age"

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({


source-dataframe Source dataframe
age-by-bins The most common age group is between 20 and 40 years old

Group large values

In other words, truncate large values after a given point into a single bucket, called "+∞":

Example: group values larger than 100 into a separate column:

sample-data sample data used: note that there are rows where age is larger than 100!

import pandas as pd
import random
import matplotlib.pyplot as plt
import numpy as np

# install via pip install faker
from faker import Faker
fake = Faker()

df = pd.DataFrame([{
    'age': random.randint(1,120), # generate age from 1 to 120
} for i in range(0,100)]) # dataframe should have 100 rows

# define the bins
step_size          = 20
max_tick           = 100 
original_bins      = np.arange(0, max_tick+1, step_size)
new_bins           = np.append(original_bins, [max_tick+step_size+1])
max_value          = max(original_bins)

# function to format the label
def format_text(current_value, max_value, label):
    return label if current_value > max_value else current_value

df[['age']].plot(kind='hist', bins=new_bins, rwidth=0.8)

current_ticklabels = plt.gca().get_xticks()

plt.gca().set_xticklabels([format_text(x, max_value, '+∞') for x in current_ticklabels])

naivelly-plotting If you naively plot histogram with 0-20 buckets,
some of the data will be missing!
plotting-after-truncating-values Values larger than 100 grouped into the new bucket

Dialogue & Discussion