Pandas Query Examples: SQL-like queries in dataframes

Last updated:
Table of Contents

Pandas v1.x used. See all examples on this jupyter notebook

Why use query

Because it enables you to create views and filters inplace.

It is also faster than python for numerical operations.1

import pandas as pd

# using filters needs two steps
# one to assign the dataframe to a variable
df = pd.DataFrame({
    'name':['john','david','anna'],
    'country':['USA','UK',np.nan]
})

# another one to perform the filter
df[df['country']=='USA']

But you can define the dataframe and query on it in a single step (memory gets freed at once because you didn't create any temporary variables)

# this is equivalent to the code above
# and uses no intermediate variables
pd.DataFrame({
    'name':['john','david','anna'],
    'country':['USA','UK',np.nan]
}).query("country == 'USA'")

Query with python variable

Use @variable_name:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'name':['john','david','anna'],
    'country':['USA','UK',np.nan],
    'age':[23,45,45]
})

target_age = 45

df.query('age == @target_age')

OR operator

Just use or. Don't forget parentheses.

import pandas as pd

df = pd.DataFrame({
    'name':['john','david','anna'],
    'country':['USA','UK', 'USA'],
    'age':[23,45,45]
})

df.query("(name=='john') or (country=='UK')")

source-dataframe Before: Source dataframe
  
after-filter-or-operator After: only rows where name is 'john'
or country is 'UK'

AND operator

Just use and. Don't forget parentheses.

import pandas as pd

df = pd.DataFrame({
    'name':['john','david','anna'],
    'country':['USA','UK', 'USA'],
    'age':[23,45,45]
})

df.query("(country=='USA') and (age==23)")

Source-dataframe Source Dataframe with all rows
  
after-and-operator-filter After: only one row has
country='USA' and
age =23

Column in array

Put values in a python array and use in @myvar:

import pandas as pd

df = pd.DataFrame({
    'name':['john','david','anna'],
    'country':['USA','UK', 'USA'],
    'age':[23,45,45]
})

names_array = ['john','anna']

df.query('name in @names_array')

source-dataframe Source dataframe
  
where-column-is-in-array Rows where name in ['john', 'anna']

Column not in array

Same as above. Put values in a python array and use not in @myvar:

import pandas as pd

df = pd.DataFrame({
    'name':['john','david','anna'],
    'country':['USA','UK', 'USA'],
    'age':[23,45,45]
})

invalid_array = ['anna']

df.query('name not in @invalid_array')

source-dataframe Source dataframe: all rows
  
where-column-not-in-array Selected rows where name not in ['anna']

Column name with spaces

Wrap column names in backticks: '`'

df = pd.DataFrame({
    'name':['john','david','anna'],
    'country of birth':['USA','UK', 'USA'],
    'age':[23,45,45]
})

df.query('`country of birth` == "UK"')

source-dataframe Source dataframe: one of the column
names has spaces in it
  
after-filter-column-name-spaces Selected rows where
country of origin equals 'UK'

Where column is null

Use .isnull()

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'name':['john','david','anna'],
    'country':['USA','UK',np.nan]
})

df.query('country.isnull()')

original-dataframe Original Dataframe
         
query-where-is-null Equivalent to:
"where country is NULL"

Where column is NOT null

Use .notnull()

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'name':['john','david','anna'],
    'country':['USA','UK',np.nan]
})

df.query('country.notnull()')

original-dataframe Original Dataframe
         
query-where-is-null Equivalent to:
"where country is NOT NULL"

Column in python array

See above: Column in array

Column not in python array

See above: Column not in array


1 It uses numexpr under the hood: https://github.com/pydata/numexpr

Dialogue & Discussion