Apache Spark Architecture Overview: Jobs, Stages, Tasks, etc

Last updated: 07 Aug 2020

Source

Table of Contents

Cluster
Driver
Executor
Job
Stage
Task
Shuffle
Partition
Job vs Stage
Stage vs Task

Cluster

A Cluster is a group of JVMs (nodes) connected by the network, each of which runs Spark, either in Driver or Worker roles.

Driver

The Driver is one of the nodes in the Cluster.

The driver does not run computations (filter,map, reduce, etc).

It plays the role of a master node in the Spark cluster.

When you call collect() on an RDD or Dataset, the whole data is sent to the Driver. This is why you should be careful when calling collect().

Executor

Executors are JVMs that run on Worker nodes.

These are the JVMs that actually run Tasks on data Partitions.

Job

A Job is a sequence of Stages, triggered by an Action such as .count(), foreachRdd(), collect(), read() or write().

Stage

A Stage is a sequence of Tasks that can all be run together, in parallel, without a shuffle.

For example: using .read to read a file from disk, then runnning .map and .filter can all be done without a shuffle, so it can fit in a single stage.

Task

A Task is a single operation (.map or .filter) applied to a single Partition.

Each Task is executed as a single thread in an Executor!

If your dataset has 2 Partitions, an operation such as a filter() will trigger 2 Tasks, one for each Partition.

Shuffle

A Shuffle refers to an operation where data is re-partitioned across a Cluster.

join and any operation that ends with ByKey will trigger a Shuffle.

It is a costly operation because a lot of data can be sent via the network.

Partition

A Partition is a logical chunk of your RDD/Dataset.

Data is split into Partitions so that each Executor can operate on a single part, enabling parallelization.

It can be processed by a single Executor core.

For example: If you have 4 data partitions and you have 4 executor cores, you can process each Stage in parallel, in a single pass.

Job vs Stage

A Job is a sequence of Stages.

A Job is started when there is a an Action such as count(), collect(), saveAsTextFile(), etc.

Stage vs Task

A Stage is a sequence of Tasks that don't require a Shuffle in-between.

The number of Tasks in a Stage also depends upon the number of Partitions your datasets have.

References

Felipe 03 Jan 2017 07 Aug 2020 spark architecture