Pandas Dataframe examples: Plotting Histograms

# Pandas Dataframe examples: Plotting Histograms

Last updated:

All code available on this jupyter notebook

## Histogram of column values

You can also use numpy arange to create bins automatically: np.arange(<start>,<stop>,<step>)

Example: Plot histogram of values in column "age"

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({
'name':['john','mary','peter','jeff','bill','lisa','jose'],
'age':[23,78,22,19,45,33,20],
'gender':['M','F','M','M','M','F','M'],
'state':['california','dc','california','dc','california','texas','texas'],
'num_children':[2,0,0,3,2,1,4],
'num_pets':[5,1,0,5,2,2,3]
})

df[['age']].plot(kind='hist',bins=[0,20,40,60,80,100],rwidth=0.8)
plt.show() Source dataframe The most common age group is between 20 and 40 years old

## Group large values into a single bucket

In other words, truncate values after a given point into a single bucket, called "+∞":

Example: group values larger than 100 into a separate column: sample data used: note that there are rows where age is larger than 100!

import pandas as pd
import random
import matplotlib.pyplot as plt
import numpy as np

# install via pip install faker
from faker import Faker
fake = Faker()

df = pd.DataFrame([{
'name':fake.name(),
'age': random.randint(1,120), # generate age from 1 to 120
} for i in range(0,100)]) # dataframe should have 100 rows

# define the bins
step_size          = 20
max_tick           = 100
original_bins      = np.arange(0, max_tick+1, step_size)
new_bins           = np.append(original_bins, [max_tick+step_size+1])
max_value          = max(original_bins)

# function to format the label
def format_text(current_value, max_value, label):
return label if current_value > max_value else current_value

df[['age']].plot(kind='hist', bins=new_bins, rwidth=0.8)

current_ticklabels = plt.gca().get_xticks()

plt.gca().set_xticklabels([format_text(x, max_value, '+∞') for x in current_ticklabels])

plt.show() If you naively plot histogram with 0-20 buckets,
some of the data will be missing! Use the provided code to have an additional
column to store all values that didn't fit into the original bins!