Using Command-line Tools for Text Data Preprocessing: Examples and Reference

Last updated:
Table of Contents

WIP Alert This is a work in progress. Current information is correct but more content may be added in the future.

You can do most of your data preprocessing using native command-line tools, available both on Linux and MacOS systems.

These tools are time-tested and naturally support stream-processing, that is, outputs from one step flow as input to the next step as they are produced.

They are fast.

Example: Clean text with sed and tr

TODO

Example: Sample csv file with head tail and shuf

This can be used with large files as the full file is not put in memory all at once

Given file.csv:

"name","age"
"alice",23
"bob",33,
"charlie",19
"david",26
"eugene",39
"fay",27
  • Sample 2 lines from the csv file, skipping the header line and write output to file-sampled.csv

    $ tail -n +2 file.csv | shuf -n 2 > sampled-data.csv
    
  • Get the header file from the original csv file:

    $ head -1 file.csv > header.csv
    
  • Join the header and the sampled data file-sampled.csv

    $ cat header.csv sampled-data.csv > file-sampled.csv
    

Output file file-sampled.csv now looks like this:

"name","age"
"fay",27
"david",26

Example: Call command-line tools from Jupyter Notebooks

TODO

Other info

Dialogue & Discussion