Pandas for Large Data

Last updated:

Memory usage

  • Use the most efficient dtypes you possibly can, such as 'uint8' and 'float32' and 'category' (rather than 'object'

    • Use df.dtypes to see what your dataframe's types look like
  • Delete things you will not use anymore using:

    • del(my_variable) deletes the reference to your variable
    • gc.collect() (first you need to import gc) explictly frees memory space.
  • Perform large, complex joins in parts to avoid exploding memory usage

    # split df indices into parts
    indexes = np.linspace(0, len(second_df), num=10, dtype=np.int32)
    
    # update by small portions
    for i in range(len(indexes)-1):
        my_df = pd.concat(
            [
                my_df, # the same DF
                pd.merge(left=pd.merge(
                                left=second_df.loc[indexes[i]:indexes[i+1], :],
                                right=third_df,
                                how='left',
                                on='foreign_key'
                                ),
                         right=fourth_df, 
                         how='left', 
                         on='other_foreign_key'
                    )
            ]
        )
    

Speed

  • Use vectorized methods instead of .map() and .apply()

    • Pandas has very efficient ways to merge, combine and do operations across columns and rows. Use those method if at all possible, resorting to things like .map(), apply() or looping only when needed.

This short post is part of the data newsletter. Click here to sign up.

References

Dialogue & Discussion