Apache Spark Architecture Overview: Jobs, Stages, Tasks, etc
Last updated:Cluster
A Cluster is a group of JVMs (nodes) connected by the network, each of which runs Spark, either in Driver or Worker roles.
Driver
The Driver is one of the nodes in the Cluster.
The driver does not run computations (filter
,map
, reduce
, etc).
It plays the role of a master node in the Spark cluster.
When you call collect()
on an RDD or Dataset, the whole data is sent to the Driver. This is why you should be careful when calling collect()
.
Executor
Executors are JVMs that run on Worker nodes.
These are the JVMs that actually run Tasks on data Partitions.
Job
A Job is a sequence of Stages, triggered by an Action such as .count()
, foreachRdd()
, collect()
, read()
or write()
.
Stage
A Stage is a sequence of Tasks that can all be run together, in parallel, without a shuffle.
For example: using .read
to read a file from disk, then runnning .map
and .filter
can all be done without a shuffle, so it can fit in a single stage.
Task
A Task is a single operation (.map
or .filter
) applied to a single Partition.
Each Task is executed as a single thread in an Executor!
If your dataset has 2 Partitions, an operation such as a filter()
will trigger 2 Tasks, one for each Partition.
Shuffle
A Shuffle refers to an operation where data is re-partitioned across a Cluster.
join
and any operation that ends with ByKey
will trigger a Shuffle.
It is a costly operation because a lot of data can be sent via the network.
Partition
A Partition is a logical chunk of your RDD/Dataset.
Data is split into Partitions so that each Executor can operate on a single part, enabling parallelization.
It can be processed by a single Executor core.
For example: If you have 4 data partitions and you have 4 executor cores, you can process each Stage in parallel, in a single pass.
Job vs Stage
A Job is a sequence of Stages.
A Job is started when there is a an Action such as count()
, collect()
, saveAsTextFile()
, etc.
Stage vs Task
A Stage is a sequence of Tasks that don't require a Shuffle in-between.
The number of Tasks in a Stage also depends upon the number of Partitions your datasets have.