Spark Architecture Overview: Clusters, Jobs, Stages, Tasks, etc

Last updated:

WIP Alert This is a work in progress. Current information is correct but more content may be added in the future.

Cluster

TODO

Driver

TODO mention client and cluster mode

Executor

TODO mention cores  

Job

A job is a sequence of stages, triggered by an action such as .count(), foreachRdd(), sortBy(), read() or write().

Stage

A stage is a sequence of tasks that can all be run together, without a shuffle, for example, using .read to read a file from disk, then runnning .map and .filter can all be done without a shuffle, so it can fit in a single stage.

Task

A task is a single operation on an RDD, like .map or .filter

Shuffle

TODO

Partition

A partition is a logic chunk of your input data. It can be processed by a single executor core.

If you have 4 data partitions and you have 4 executor cores, you can process everything in parallel, in a single pass.


References

Dialogue & Discussion