Spark Dataframe Examples: Pivot and Unpivot Data

Last updated:
Table of Contents

Examples use Spark version 2.4.3 and the Scala API

View all examples on a jupyter notebook here: pivot-unpivot.ipynb

Pivot vs Unpivot

Here's a rough explanation for what both operations achieve:

  • Pivot: Turn rows into columns.

  • Unpivot: Turn columns into rows.

Pivot with .pivot()

val sourceDf = Seq(
    ("john", "notebook", 2),
    ("gary", "notebook", 3),
    ("john", "small phone", 2),
    ("mary", "small phone", 3),
    ("john", "large phone", 3),
    ("john", "camera", 3)
).toDF("salesperson","device", "amount sold")

val sourceDf = unpivotedDf
    .sum("amount sold")

original-dataframe BEFORE: Each fact is on a
separate row
pivoted-dataframe AFTER PIVOTING: Each different device
gets a column and aggregates (in this
case sum) are shown

Unpivot with selectExpr and stack

Unpivot is the inverse transformation for pivot.

There's no equivalent dataframe operator for the unpivot operation, we must use selectExpr() along with the stack builtin.

Template: df.selectExpr("stack(<n_columns>, <col0_label>, <col0_value>, <col1_label>, <col1_value> ...)")

// using Option[Int] to help spark's type inference
val sourceDf = Seq(
    ("gary",None,   None,   Some(3),None),
    ("mary",None,   None,   None,   Some(3)),
).toDF("salesperson", "camera", "large phone", "notebook", "small phone")

// select the row label column (salesperson) and apply unpivot in the other columns
val unpivotedDf = sourceDf
    .withColumnRenamed("col0","device") // default name of this column is col0
    .withColumnRenamed("col1","amount_sold") // default name of this column is col1
    .filter($"amount_sold".isNotNull) // must explicitly remove nulls

original-dataframe BEFORE: data is spread across columns
unpivoted-dataframe AFTER UNPIVOTING: a row has been created for each
combination of the row label and the
stacked columns

Dialogue & Discussion