Spark DataFrame UDFs: Scala Examples

Last updated:
Table of Contents

User-Defined functions are functions that operate on a DataFrame's column.

For instance, if you have a Column that represents an age feature, you could create an UDF that multiplies an Int by 2, and evaluate that UDF over that particular Column.

UDFs operate on Columns while regular RDD/Dataframe functions (map, filter, etc) operate on Rows

A Sample DataFrame

For our examples let's use a DataFrame with two columns like the one below:

> df.show()

+---+-------------+
| id|         text|
+---+-------------+
|  0|          foo|
|  1|      foo bar|
|  2|  hello world|
|  3|great success|
+---+-------------+

> df.printSchema()

root
 |-- id: integer (nullable = false)
 |-- text: string (nullable = true)

HEADS-UP All examples use Spark 2.x unless otherwise noted!

Example: get string length

Next is a very simple example: replace a String column with a Long column representing the text length (using the sample dataframe above)

import org.apache.spark.sql.functions.udf

def strLength(inputString: String): Long = inputString.size.toLong

// we use the method name followed by a "_" to indicate we want a reference
// to the method, not call it
val strLengthUdf = udf(strLength _)

val df2 = df.select(strLengthUdf(df("text")))

then show it:

> df2.show()

+---+---------+
|   UDF(text) |
+-------------+
|            3|
|            7|
|           11|
|           14|
+---+---------+

Spark.ml Pipelines are udfs

Spark.ml Pipelines are all written in terms of udfs. Since they operate column-wise rather than row-wise, they are prime candidates for transforming a DataSet by adding columns, modifying features, and so on.

Look at how Spark's MinMaxScaler is just a wrapper for a udf.

Dialogue & Discussion