WIP Alert This is a work in progress. Current information is correct but more content may be added in the future.
TODO mention client and cluster mode
TODO mention cores
A job is a sequence of stages, triggered by an action such as
A stage is a sequence of tasks that can all be run together, without a shuffle, for example, using
.read to read a file from disk, then runnning
.filter can all be done without a shuffle, so it can fit in a single stage.
A task is a single operation on an RDD, like
A partition is a logic chunk of your input data. It can be processed by a single executor core.
If you have 4 data partitions and you have 4 executor cores, you can process everything in parallel, in a single pass.