Spark Architecture Overview: Clusters, Jobs, Stages, Tasks, etc

Last updated:

WIP Alert This is a work in progress. Current information is correct but more content may be added in the future.

Cluster

TODO

Driver

TODO mention client and cluster mode

Executor

TODO mention cores  
Twitter Linkedin YC Hacker News Reddit

Job

A job is a sequence of stages, triggered by an action such as .count(), foreachRdd(), sortBy(), read() or write().

Stage

A stage is a sequence of tasks that can all be run together, without a shuffle, for example, using .read to read a file from disk, then runnning .map and .filter can all be done without a shuffle, so it can fit in a single stage.

Twitter Linkedin YC Hacker News Reddit

Task

A task is a single operation on an RDD, like .map or .filter

Shuffle

TODO

Partition

A partition is a logic chunk of your input data. It can be processed by a single executor core.

If you have 4 data partitions and you have 4 executor cores, you can process everything in parallel, in a single pass.


References

Dialogue & Discussion