Using Command-line Tools for Text Data Preprocessing: Examples and Reference

Last updated: 17 Nov 2019

Table of Contents

Example: Clean text with sed and tr
Example: Sample csv file with head tail and shuf
Example: Call command-line tools from Jupyter Notebooks

WIP Alert This is a work in progress. Current information is correct but more content may be added in the future.

You can do most of your data preprocessing using native command-line tools, available both on Linux and MacOS systems.

These tools are time-tested and naturally support stream-processing, that is, outputs from one step flow as input to the next step as they are produced.

They are fast.

Example: Clean text with sed and tr

TODO

Example: Sample csv file with head tail and shuf

This can be used with large files as the full file is not put in memory all at once

Given file.csv:

"name","age"
"alice",23
"bob",33,
"charlie",19
"david",26
"eugene",39
"fay",27

Sample 2 lines from the csv file, skipping the header line and write output to file-sampled.csv
```
$ tail -n +2 file.csv | shuf -n 2 > sampled-data.csv
```
Get the header file from the original csv file:
```
$ head -1 file.csv > header.csv
```

Join the header and the sampled data file-sampled.csv

$ cat header.csv sampled-data.csv > file-sampled.csv

Output file file-sampled.csv now looks like this:

"name","age"
"fay",27
"david",26

Example: Call command-line tools from Jupyter Notebooks

TODO

Other info

Adam Drake: Command-line tools can be 235 faster than your hadoop cluster
Adam Drake: Big Data Small Machine
- In my opinion, a better title would be "Big Data - Single Machine" instead

Felipe 09 Nov 2019 17 Nov 2019 gnu macos unix linux command-line data-science

Example: Clean text with sed and tr

Example: Sample csv file with head tail and shuf

Example: Call command-line tools from Jupyter Notebooks

Other info

Dialogue & Discussion