Using Command-line Tools for Text Data Preprocessing: Examples and Reference
Last updated:- Example: Clean text with sed and tr
- Example: Sample csv file with head tail and shuf
- Example: Call command-line tools from Jupyter Notebooks
WIP Alert This is a work in progress. Current information is correct but more content may be added in the future.
You can do most of your data preprocessing using native command-line tools, available both on Linux and MacOS systems.
These tools are time-tested and naturally support stream-processing, that is, outputs from one step flow as input to the next step as they are produced.
They are fast.
Example: Clean text with sed and tr
TODO
Example: Sample csv file with head tail and shuf
This can be used with large files as the full file is not put in memory all at once
Given file.csv:
"name","age"
"alice",23
"bob",33,
"charlie",19
"david",26
"eugene",39
"fay",27
Sample 2 lines from the csv file, skipping the header line and write output to
file-sampled.csv
$ tail -n +2 file.csv | shuf -n 2 > sampled-data.csv
Get the header file from the original csv file:
$ head -1 file.csv > header.csv
Join the header and the sampled data
file-sampled.csv
$ cat header.csv sampled-data.csv > file-sampled.csv
Output file file-sampled.csv now looks like this:
"name","age"
"fay",27
"david",26
Example: Call command-line tools from Jupyter Notebooks
TODO
Other info
Adam Drake: Command-line tools can be 235 faster than your hadoop cluster
Adam Drake: Big Data Small Machine
- In my opinion, a better title would be "Big Data - Single Machine" instead