- Example: Clean text with sed and tr
- Example: Sample csv file with head tail and shuf
- Example: Call command-line tools from Jupyter Notebooks
WIP Alert This is a work in progress. Current information is correct but more content may be added in the future.
You can do most of your data preprocessing using native command-line tools, available both on Linux and MacOS systems.
These tools are time-tested and naturally support stream-processing, that is, outputs from one step flow as input to the next step as they are produced.
They are fast.
Example: Clean text with sed and tr
Example: Sample csv file with head tail and shuf
This can be used with large files as the full file is not put in memory all at once
"name","age" "alice",23 "bob",33, "charlie",19 "david",26 "eugene",39 "fay",27
Sample 2 lines from the csv file, skipping the header line and write output to
$ tail -n +2 file.csv | shuf -n 2 > sampled-data.csv
Get the header file from the original csv file:
$ head -1 file.csv > header.csv
Join the header and the sampled data
$ cat header.csv sampled-data.csv > file-sampled.csv
Output file file-sampled.csv now looks like this:
"name","age" "fay",27 "david",26
Example: Call command-line tools from Jupyter Notebooks
- In my opinion, a better title would be "Big Data - Single Machine" instead