Pandas Dataframes: Apply Examples
Last updated:- Apply example
- Apply example, custom function
- Take multiple columns as parameters
- Apply function to row
- Apply function to column
- Return multiple columns
- Apply function in parallel
- Vectorization and Performance
- map vs apply
WIP Alert This is a work in progress. Current information is correct but more content may be added in the future.
Pandas version 1.0+ used.
All code available online on this jupyter notebook
Apply example
To apply a function to a dataframe column, do df['my_col'].apply(function)
, where the function takes one element and return another value.
import pandas as pd
df = pd.DataFrame({
'name': ['alice','bob','charlie','david'],
'age': [25,26,27,22],
})[['name', 'age']]
# each element of the age column is a string
# so you can call .upper() on it
df['name_uppercase'] = df['name'].apply(lambda element: element.upper())
data:image/s3,"s3://crabby-images/6ef7a/6ef7ac48e7bf568316dfe815c79bf795af8d7688" alt="source-dataframe"
data:image/s3,"s3://crabby-images/3cbe7/3cbe75c348b59b4d556af05efd7cae6a1a37a92c" alt="after-applying-map"
Apply example, custom function
To apply a custom function to a column, you just need to define a function that takes one element and returns a new value:
import pandas as pd
df = pd.DataFrame({
'name': ['alice','bob','charlie','david'],
'age': [25,26,27,22],
})
# function that takes one value, returns one value
def first_letter(input_str):
return input_str[:1]
# pass just the function name to apply
df['first_letter'] = df['name'].apply(first_letter)
data:image/s3,"s3://crabby-images/6ef7a/6ef7ac48e7bf568316dfe815c79bf795af8d7688" alt="source-dataframe"
data:image/s3,"s3://crabby-images/1b194/1b19474bb3a67f56cd29ee8ad9a39cc21b339d6f" alt="alt text"
custom function to apply
Take multiple columns as parameters
Double square brackets return another dataframe instead of a series
To apply a single function using multiple columns, select columns using double square brackets ([[]]
) and use axis=1
:
import pandas as pd
df = pd.DataFrame({
'name': ['alice','bob','charlie','david'],
'age': [25,26,27,22],
})
# define a function that takes two values, returns 1 value
def concatenate(value_1, value_2):
return str(value_1)+ "--" + str(value_2)
# note the use of DOUBLE SQUARE BRACKETS!
df['concatenated'] = df[['name','age']].apply(
lambda row: concatenate(row['name'], row['age']) , axis=1)
data:image/s3,"s3://crabby-images/b7b3a/b7b3a2afce5a4d23f411ba038ebcce5f7df58761" alt="source-dataframe"
data:image/s3,"s3://crabby-images/a0c4d/a0c4dc1aa15a45e20aa15eed5cf89bbf6e42cef8" alt="dataframe-with-new-concatenated-column"
takes two columns and concatenates them as strings
Apply function to row
To apply a dunction to a full row instead of a column, use axis=1
and call apply
on the dataframe itself:
Example: Sum all values in each row:
import pandas as pd
df = pd.DataFrame({
'value1': [1,2,3,4,5],
'value2': [5,4,3,2,1],
'value3': [10,20,30,40,50],
'value4': [99,99,99,99,np.nan],
})
def sum_all(row):
return np.sum(row)
# note that apply was called on the dataframe itself, not on columns
df['sum_all'] = df.apply(lambda row: sum_all(row) , axis=1)
data:image/s3,"s3://crabby-images/692b7/692b768fe8dbeffa3639db971395dd1dcb7ed9cd" alt="source-dataframe-with-observations"
observations for one sample
data:image/s3,"s3://crabby-images/e5e77/e5e77128e1448f999933c6fd20d977bc4c2d0db8" alt="Dataframe-with-new-column-based-on-row-application"
values in the row, with numpy.sum
Apply function to column
Just use apply
. Example here
Return multiple columns
To apply a function to a column and return multiple values so that you can create multiple columns, return a pd.Series with the values instead:
Example: produce two values from a function and assign to two columns
import pandas as pd
df = pd.DataFrame({
'name': ['alice','bob','charlie','david','edward'],
'age': [25,26,27,22,np.nan],
})
def times_two_times_three(value):
value_times_2 = value*2
value_times_3 = value*3
return pd.Series([value_times_2,value_times_3])
# note that apply was called on age column
df[['times_2','times_3']]= df['age'].apply(times_two_times_three)
data:image/s3,"s3://crabby-images/1833f/1833f2b62ec1c324363d373995c2123b718ece0c" alt="source-dataframe-with-columns"
data:image/s3,"s3://crabby-images/ea483/ea483bc0ef73e796758930c240b4807a03f8bc5b" alt="dataframe-with-two-new-columns"
both returned by apply
Apply function in parallel
If you have costly operations you need to perform on a dataframe, (e.g. text preprocessing), you can split the operation into multiple cores to decrease the running time:
import multiprocessing
import numpy as np
import pandas as pd
# how many cores do you have?
NUM_CORES=8
# replace load_large_dataframe() with your dataframe
df = load_large_dataframe()
# split the dataframe into chunks, depending on hoe many cores you have
df_chunks = np.array_split(df ,NUM_CORES)
# this is a function that takes one dataframe chunk and returns
# the processed chunk (for example, adding processed columns)
def process_df(input_df):
# copy the dataframe to prevent mutation in place
output_df = input_df.copy()
# apply a function to every row *in this chunk*
output_df['new_column'] = output_df.apply(some_function, axis=1)
return output_df
with multiprocessing.Pool(NUM_CORES) as pool:
# process each chunk in a separate core and merge the results
full_output_df = pd.concat(pool.map(process_df, df_chunks), ignore_index=True)
Vectorization and Performance
TODO
map vs apply
map() | apply() |
---|---|
Series function | Series function and Dataframe function |
Returns new Series | Returns new dataframe, possibly with a single column |
Can only be applied to a single column (one element at a time) | Can be applied to multiple columns at the same time |
Operates on array elements, one at a time | Operates on whole columns or rows |
Very slow, no better than a Python for loop | Much faster when you can use numpy vectorized functions |