- Pivot vs Unpivot
- Pivot with .pivot()
- Unpivot with selectExpr and stack
- Heads-up: Pivot with no value columns trigger a Spark action
Examples use Spark version 2.4.3 and the Scala API
View all examples on a jupyter notebook here: pivot-unpivot.ipynb
Pivot vs Unpivot
|Turn tall dataframes |
into wide dataframes
|Wide dataframes |
into tall dataframes
|Turn rows into columns||Turn columns into rows|
Here's a rough explanation for what both operations achieve:
Pivot: Turn rows into columns.
Unpivot: Turn columns into rows.
Pivot with .pivot()
val sourceDf = Seq( ("john", "notebook", 2), ("gary", "notebook", 3), ("john", "small phone", 2), ("mary", "small phone", 3), ("john", "large phone", 3), ("john", "camera", 3) ).toDF("salesperson","device", "amount sold") val sourceDf = unpivotedDf .groupBy("salesperson") .pivot("device") .sum("amount sold")
gets a column and aggregates (in this
case sum) are shown
Unpivot with selectExpr and stack
Unpivot is the inverse transformation for pivot.
There's no equivalent dataframe operator for the unpivot operation, we must use
selectExpr() along with the
df.selectExpr("stack(<n_columns>, <col0_label>, <col0_value>, <col1_label>, <col1_value> ...)")
// using Option[Int] to help spark's type inference val sourceDf = Seq( ("gary",None, None, Some(3),None), ("mary",None, None, None, Some(3)), ("john",Some(3),Some(3),Some(2),Some(2)) ).toDF("salesperson", "camera", "large phone", "notebook", "small phone") // select the row label column (salesperson) and apply unpivot in the other columns val unpivotedDf = sourceDf .selectExpr("salesperson","stack(4,'camera',camera,'large_phone',large_phone,'notebook',notebook,'small_phone',small_phone)") .withColumnRenamed("col0","device") // default name of this column is col0 .withColumnRenamed("col1","amount_sold") // default name of this column is col1 .filter($"amount_sold".isNotNull) // must explicitly remove nulls
combination of the row label and the
Heads-up: Pivot with no value columns trigger a Spark action
The pivot operation turns row values into column headings.
If you call method
pivot with a
pivotColumn but no
values, Spark will need to trigger an action1 because it can't otherwise know what are the values that should become the column headings.
In order to avoid an action to keep your operations lazy, you need to provide the values you want to pivot over, by passing the