Spark DataFrame UDFs: Examples using Scala and Python

Last updated:

WIP Alert This is a work in progress. Current information is correct but more content will probably be added in the future.

What are User-Defined functions ?

They are function that operate on a DataFrame's column. For instance, if you have a Column that represents an age feature, you could create an UDF that multiplies an Int by 2, and evaluate that UDF over that particular Column.

UDFs operate on Columns while regular RDD functions (map, filter, etc) operate on Rows

A Sample DataFrame

For our examples let's use a DataFrame with two columns like the one below:

> df.show()

+---+-------------+
| id|         text|
+---+-------------+
|  0|          foo|
|  1|      foo bar|
|  2|  hello world|
|  3|great success|
+---+-------------+

> df.printSchema()

root
 |-- id: integer (nullable = false)
 |-- text: string (nullable = true)

HEADS-UP All examples use Spark 2.x unless otherwise noted!

Scala example: replace a String column with a Long column representing the text length

import org.apache.spark.sql.functions.udf

def strLength(inputString: String): Long = inputString.size.toLong

// we use the method name followed by a "_" to indicate we want a reference
// to the method, not call it
val strLengthUdf = udf(strLength _)

val df2 = df.select(strLengthUdf(df("text")))

then show it:

> df2.show()

+---+---------+
|   UDF(text) |
+-------------+
|            3|
|            7|
|           11|
|           14|
+---+---------+

Spark.ml Pipelines are all written in terms of udfs

Since they operate column-wise rather than row-wise, they are prime candidates for transforming a DataSet by addind columns, modifying features, and so on.

Look at how Spark's MinMaxScaler is just a wrapper for a udf.

Python example: multiply an Intby two

TODO define the UDF (python class) and call it 

Dialogue & Discussion