Apache Spark Architecture Overview: Jobs, Stages, Tasks, etcLast updated:
A Cluster is a group of JVMs (nodes) connected by the network, each of which runs Spark, either in Driver or Worker roles.
The Driver is one of the nodes in the Cluster.
The driver does not run computations (
It plays the role of a master node in the Spark cluster.
When you call
collect() on an RDD or Dataset, the whole data is sent to the Driver. This is why you should be careful when calling
Executors are JVMs that run on Worker nodes.
These are the JVMs that actually run Tasks on data Partitions.
A Job is a sequence of Stages, triggered by an Action such as
A Stage is a sequence of Tasks that can all be run together, in parallel, without a shuffle.
For example: using
.read to read a file from disk, then runnning
.filter can all be done without a shuffle, so it can fit in a single stage.
A Task is a single operation (
.filter) applied to a single Partition.
Each Task is executed as a single thread in an Executor!
If your dataset has 2 Partitions, an operation such as a
filter() will trigger 2 Tasks, one for each Partition.
A Shuffle refers to an operation where data is re-partitioned across a Cluster.
join and any operation that ends with
ByKey will trigger a Shuffle.
It is a costly operation because a lot of data can be sent via the network.
A Partition is a logical chunk of your RDD/Dataset.
Data is split into Partitions so that each Executor can operate on a single part, enabling parallelization.
It can be processed by a single Executor core.
For example: If you have 4 data partitions and you have 4 executor cores, you can process each Stage in parallel, in a single pass.
Job vs Stage
A Job is a sequence of Stages.
A Job is started when there is a an Action such as
Stage vs Task
A Stage is a sequence of Tasks that don't require a Shuffle in-between.
The number of Tasks in a Stage also depends upon the number of Partitions your datasets have.