Pandas Dataframe Examples: Duplicated Data
Last updated:- Show duplicated rows
- Show duplicated rows including original rows
- List duplicates
- Number of duplicated rows
- Duplicate count
- Drop duplicates
- Drop duplicates based on some columns only
- Drop all rows that have duplicates or are duplicates
- Mark duplicate rows with flag column
- Flag duplicate rows
- Arbitrary keep criterion
All code available on this jupyter notebook
Show duplicated rows
I.e. show rows that are duplicates of existing rows.
import pandas as pd
df = pd.DataFrame({
'title': ['foo','bar','baz','baz','foo','bar'],
'contents':[
'Lorem ipsum dolor sit amet.',
'Lorem ipsum dolor sit amet.',
'Lorem ipsum dolor sit amet.',
'Consectetur adipiscing elit.',
'Lorem ipsum dolor sit amet.',
'Lorem ipsum dolor sit amet.'
],
'year':[2010,2011,2000,2005,2010,2011]
})
df[df.duplicated()]
with
df[df.duplicated()]
Show duplicated rows including original rows
Use df[df.duplicated(keep=False)]
.
This shows both duplicated and original rows.
import pandas as pd
df = pd.DataFrame({
'title': ['foo','bar','baz','baz','foo','bar'],
'contents':[
'Lorem ipsum dolor sit amet.',
'Lorem ipsum dolor sit amet.',
'Lorem ipsum dolor sit amet.',
'Consectetur adipiscing elit.',
'Lorem ipsum dolor sit amet.',
'Lorem ipsum dolor sit amet.'
],
'year':[2010,2011,2000,2005,2010,2011]
})
df[df.duplicated(keep=False)]
keep=False
means that both duplicates and the original
rows are displayed
List duplicates
See above Show duplicated rows
Number of duplicated rows
In other words, the number of rows that are duplicates of other existing rows.
Use len(df[df.duplicated()])
import pandas as pd
df = pd.DataFrame({
'title': ['foo','bar','baz','baz','foo','bar'],
'contents':[
'Lorem ipsum dolor sit amet.',
'Lorem ipsum dolor sit amet.',
'Lorem ipsum dolor sit amet.',
'Consectetur adipiscing elit.',
'Lorem ipsum dolor sit amet.',
'Lorem ipsum dolor sit amet.'
],
'year':[2010,2011,2000,2005,2010,2011]
})
len(df[df.duplicated()])
# >>> 2
Duplicate count
See above: number of duplicated rows
Drop duplicates
Results will depend on how the dataframe is sorted.
Use df.drop_duplicates()
.
The first row will be kept and the duplicates will be dropped.
import pandas as pd
df = pd.DataFrame({
'title': ['bar','bar','baz','baz','foo','foo'],
'contents':[
'Sed mollis tempor accumsan.',
'Sed mollis tempor accumsan.',
'Nullam et feugiat turpis, non condimentum dolor.',
'Aenean eu aliquam nunc.',
'Lorem ipsum dolor sit amet.',
'Lorem ipsum dolor sit amet.'
],
'year':[2010,2010,2005,2005,2011,2011]
})
df.drop_duplicates()
duplicates if values in all columns
(except the index) are identical.
Drop duplicates based on some columns only
In other words, use only a subset of columns to define which rows are duplicates.
import pandas as pd
df = pd.DataFrame({
'title': ['bar','bar','baz','baz','foo','foo'],
'contents':[
'Sed mollis tempor accumsan.',
'Sed mollis tempor accumsan.',
'Nullam et feugiat turpis, non condimentum dolor.',
'Aenean eu aliquam nunc.',
'Lorem ipsum dolor sit amet.',
'Lorem ipsum dolor sit amet.'
],
'year':[2010,2010,2005,2005,2011,2011]
})
# only consider title and year when defining duplicates
df.drop_duplicates(subset=['title','year'])
columns title and year, then
these rows are duplicates
title
and year
to the subset parameter, we have a more
aggressive duplicate filter, which ends
dropping more rows
Drop all rows that have duplicates or are duplicates
I.e. if there are two identical rows, this will drop both from the dataframe.
Use df.drop_duplicates(keep=False)
import pandas as pd
df = pd.DataFrame({
'title': ['bar','bar','baz','baz','foo','foo'],
'contents':[
'Sed mollis tempor accumsan.',
'Sed mollis tempor accumsan.',
'Nullam et feugiat turpis, non condimentum dolor.',
'Aenean eu aliquam nunc.',
'Lorem ipsum dolor sit amet.',
'Lorem ipsum dolor sit amet.'
],
'year':[2010,2010,2005,2005,2011,2011]
})
df.drop_duplicates(keep=False)
or ARE duplicates themselves
keep neither the original rows
nor the duplicates.
Mark duplicate rows with flag column
Often, it's better to mark bad rows in a dataframe instead of deleting them.
This helps keep track of what modifications were done from the raw data until the processed dataframe.
Use df.assign()
to create a new is_duplicate
column:
import pandas as pd
df = pd.DataFrame({
'title': ['bar','bar','baz','baz','foo','foo'],
'contents':[
'Sed mollis tempor accumsan.',
'Sed mollis tempor accumsan.',
'Nullam et feugiat turpis, non condimentum dolor.',
'Aenean eu aliquam nunc.',
'Lorem ipsum dolor sit amet.',
'Lorem ipsum dolor sit amet.'
],
'year':[2010,2010,2005,2005,2011,2011]
})
# assign doesnt happen in place
df=df.assign(
is_duplicate= lambda d: d.duplicated()
).sort_values(['title','contents','year']).reset_index(drop=True)
df
track of data lineage.
Note that all columns were used
to define duplicates
Flag duplicate rows
See above: Mark duplicate rows with flag column
Arbitrary keep criterion
There's no out-of-the-box way to do this so one answer is to sort the dataframe so that the correct values for each duplicate are at the end and then use drop_duplicates(keep='last')
Example: drop duplicated rows, keeping the values that are more recent according to column year
:
import pandas as pd
df = pd.DataFrame({
'title': ['bar','bar','baz','baz','foo','foo'],
'contents':[
'Sed mollis tempor accumsan.',
'Sed mollis tempor accumsan.',
'Nullam et feugiat turpis, non condimentum dolor.',
'Aenean eu aliquam nunc.',
'Lorem ipsum dolor sit amet.',
'Lorem ipsum dolor sit amet.'
],
'year':[2009,2019,2005,2005,2015,1995]
})
df.sort_values(
['title','contents','year']
).drop_duplicates(subset=['title','contents'],keep='last')
same title and contents, but
one is newer
for each duplicated group was kept.