Pandas Dataframe Examples: Duplicated Data

Pandas Dataframe Examples: Duplicated Data

Last updated:
Pandas Dataframe Examples: Duplicated Data
Source
Table of Contents

All code available on this jupyter notebook

Show duplicated rows

I.e. show rows that are duplicates of existing rows.

import pandas as pd

df = pd.DataFrame({
    'title': ['foo','bar','baz','baz','foo','bar'],
    'contents':[
        'Lorem ipsum dolor sit amet.',
        'Lorem ipsum dolor sit amet.',
        'Lorem ipsum dolor sit amet.',
        'Consectetur adipiscing elit.',
        'Lorem ipsum dolor sit amet.',
        'Lorem ipsum dolor sit amet.'
    ],
    'year':[2010,2011,2000,2005,2010,2011]
})

df[df.duplicated()]

source-dataframe Rows 1 and 5 are duplicates
         
alt-text-image-2 Show only duplicated rows
with df[df.duplicated()]

Show duplicated rows including original rows

Use df[df.duplicated(keep=False)].

This shows both duplicated and original rows.

import pandas as pd

df = pd.DataFrame({
    'title': ['foo','bar','baz','baz','foo','bar'],
    'contents':[
        'Lorem ipsum dolor sit amet.',
        'Lorem ipsum dolor sit amet.',
        'Lorem ipsum dolor sit amet.',
        'Consectetur adipiscing elit.',
        'Lorem ipsum dolor sit amet.',
        'Lorem ipsum dolor sit amet.'
    ],
    'year':[2010,2011,2000,2005,2010,2011]
})

df[df.duplicated(keep=False)]

source-dataframe Source dataframe
         
alt-text-image-2 Using keep=False means that
both duplicates and the original
rows are displayed

List duplicates

See above Show duplicated rows

Number of duplicated rows

In other words, the number of rows that are duplicates of other existing rows.

Use len(df[df.duplicated()])

import pandas as pd

df = pd.DataFrame({
    'title': ['foo','bar','baz','baz','foo','bar'],
    'contents':[
        'Lorem ipsum dolor sit amet.',
        'Lorem ipsum dolor sit amet.',
        'Lorem ipsum dolor sit amet.',
        'Consectetur adipiscing elit.',
        'Lorem ipsum dolor sit amet.',
        'Lorem ipsum dolor sit amet.'
    ],
    'year':[2010,2011,2000,2005,2010,2011]
})

len(df[df.duplicated()])
# >>> 2

Duplicate count

See above: number of duplicated rows

Drop duplicates

Results will depend on how the dataframe is sorted.

Use df.drop_duplicates().

The first row will be kept and the duplicates will be dropped.

import pandas as pd

df = pd.DataFrame({
    'title': ['bar','bar','baz','baz','foo','foo'],
    'contents':[
        'Sed mollis tempor accumsan.',
        'Sed mollis tempor accumsan.',
        'Nullam et feugiat turpis, non condimentum dolor.',
        'Aenean eu aliquam nunc.',
        'Lorem ipsum dolor sit amet.',
        'Lorem ipsum dolor sit amet.'
    ],
    'year':[2010,2010,2005,2005,2011,2011]
})

df.drop_duplicates()

source-dataframe By default, rows are considered
duplicates if values in all columns
(except the index) are identical.
         
drop-duplicates AFTER: dropped duplicated rows

Drop duplicates based on some columns only

In other words, use only a subset of columns to define which rows are duplicates.

import pandas as pd

df = pd.DataFrame({
    'title': ['bar','bar','baz','baz','foo','foo'],
    'contents':[
        'Sed mollis tempor accumsan.',
        'Sed mollis tempor accumsan.',
        'Nullam et feugiat turpis, non condimentum dolor.',
        'Aenean eu aliquam nunc.',
        'Lorem ipsum dolor sit amet.',
        'Lorem ipsum dolor sit amet.'
    ],
    'year':[2010,2010,2005,2005,2011,2011]
})

# only consider title and year when defining duplicates
df.drop_duplicates(subset=['title','year'])

source-dataframe If we only take into account
columns title and year, then
these rows are duplicates
         
with-different-subsets By setting only title and year to the
subset parameter, we have a more
aggressive
duplicate filter, which ends
dropping more rows

Drop all rows that have duplicates or are duplicates

I.e. if there are two identical rows, this will drop both from the dataframe.

Use df.drop_duplicates(keep=False)

import pandas as pd

df = pd.DataFrame({
    'title': ['bar','bar','baz','baz','foo','foo'],
    'contents':[
        'Sed mollis tempor accumsan.',
        'Sed mollis tempor accumsan.',
        'Nullam et feugiat turpis, non condimentum dolor.',
        'Aenean eu aliquam nunc.',
        'Lorem ipsum dolor sit amet.',
        'Lorem ipsum dolor sit amet.'
    ],
    'year':[2010,2010,2005,2005,2011,2011]
})

df.drop_duplicates(keep=False)

source-dataframe These rows either have duplicates
or ARE duplicates themselves
         
with-keep-false Setting keep=False means that you want to
keep neither the original rows
nor the duplicates.

Mark duplicate rows with flag column

Often, it's better to mark bad rows in a dataframe instead of deleting them.

This helps keep track of what modifications were done from the raw data until the processed dataframe.

Use df.assign() to create a new is_duplicate column:

import pandas as pd

df = pd.DataFrame({
    'title': ['bar','bar','baz','baz','foo','foo'],
    'contents':[
        'Sed mollis tempor accumsan.',
        'Sed mollis tempor accumsan.',
        'Nullam et feugiat turpis, non condimentum dolor.',
        'Aenean eu aliquam nunc.',
        'Lorem ipsum dolor sit amet.',
        'Lorem ipsum dolor sit amet.'
    ],
    'year':[2010,2010,2005,2005,2011,2011]
})

# assign doesnt happen in place
df=df.assign(
    is_duplicate= lambda d: d.duplicated()
).sort_values(['title','contents','year']).reset_index(drop=True)

df

source-dataframe Source dataframe
         
with-helper-column Using a helper column aids in keeping
track of data lineage.
Note that all columns were used
to define duplicates

Flag duplicate rows

See above: Mark duplicate rows with flag column

Arbitrary keep criterion

There's no out-of-the-box way to do this so one answer is to sort the dataframe so that the correct values for each duplicate are at the end and then use drop_duplicates(keep='last')

Example: drop duplicated rows, keeping the values that are more recent according to column year:

import pandas as pd

df = pd.DataFrame({
    'title': ['bar','bar','baz','baz','foo','foo'],
    'contents':[
        'Sed mollis tempor accumsan.',
        'Sed mollis tempor accumsan.',
        'Nullam et feugiat turpis, non condimentum dolor.',
        'Aenean eu aliquam nunc.',
        'Lorem ipsum dolor sit amet.',
        'Lorem ipsum dolor sit amet.'
    ],
    'year':[2009,2019,2005,2005,2015,1995]
})

df.sort_values(
  ['title','contents','year']
).drop_duplicates(subset=['title','contents'],keep='last')

source-dataframe Source dataframe: these have the
same title and contents, but
one is newer
         
dataframe-after Only the most recent information
for each duplicated group was kept.