Spark DataFrame UDFs: Examples using Scala and Python

Last updated:

WIP ALERT This is a Work in Progress

What are User-Defined functions ?

They are function that operate on a DataFrame's column. For instance, if you have a Column that represents an age feature, you could create an UDF that multiplies an Int by 2, and evaluate that UDF over that particular Column.

UDFs operate on Columns while regular RDD functions (map, filter, etc) operate on Rows

A Sample DataFrame

For our examples let's use a DataFrame with two columns like the one below:


| id|         text|
|  0|          foo|
|  1|      foo bar|
|  2|  hello world|
|  3|great success|

> df.printSchema()

 |-- id: integer (nullable = false)
 |-- text: string (nullable = true)

Python example: multiply an Intby two

TODO define the UDF (python class) and call it 

Scala example: replace a String column with a Longcolumn representing the text length

TODO define the UDF (just a call to udf) and call it Pipelines are all written in terms of udfs

Since they operate column-wise rather than row-wise, they are prime candidates for transforming a DataSet by addind columns, modifying features, and so on.

Look at how Spark's MinMaxScaler is just a wrapper for a udf.

Dialogue & Discussion