Spark DataFrame UDFs: Examples using Scala and Python

Last updated: 11 Nov 2015

WIP Alert This is a work in progress. Current information is correct but more content will probably be added in the future.

What are User-Defined functions ? Permalink

They are function that operate on a DataFrame's column. For instance, if you have a Column that represents an age feature, you could create an UDF that multiplies an Int by 2, and evaluate that UDF over that particular Column.

UDFs operate on Columns while regular RDD functions (map, filter, etc) operate on Rows

A Sample DataFrame Permalink

For our examples let's use a DataFrame with two columns like the one below:

> df.show()

+---+-------------+
| id|         text|
+---+-------------+
|  0|          foo|
|  1|      foo bar|
|  2|  hello world|
|  3|great success|
+---+-------------+

> df.printSchema()

root
 |-- id: integer (nullable = false)
 |-- text: string (nullable = true)

HEADS-UP All examples use Spark 2.x unless otherwise noted!

Scala example: replace a `String` column with a `Long` column representing the text length Permalink

import org.apache.spark.sql.functions.udf

def strLength(inputString: String): Long = inputString.size.toLong

// we use the method name followed by a "_" to indicate we want a reference
// to the method, not call it
val strLengthUdf = udf(strLength _)

val df2 = df.select(strLengthUdf(df("text")))

then show it:

> df2.show()

+---+---------+
|   UDF(text) |
+-------------+
|            3|
|            7|
|           11|
|           14|
+---+---------+

Spark.ml Pipelines are all written in terms of udfs

Since they operate column-wise rather than row-wise, they are prime candidates for transforming a DataSet by addind columns, modifying features, and so on.

Look at how Spark's MinMaxScaler is just a wrapper for a udf.

Python example: multiply an `Int`by two Permalink

TODO define the UDF (python class) and call it

Felipe 11 Nov 2015 11 Nov 2015 spark udf wip

Spark DataFrame UDFs: Examples using Scala and Python

What are User-Defined functions ? Permalink

A Sample DataFrame Permalink

Scala example: replace a `String` column with a `Long` column representing the text length Permalink

Python example: multiply an `Int`by two Permalink

Related content

Dialogue & Discussion

What are User-Defined functions ? Permalink

A Sample DataFrame Permalink

Scala example: replace a String column with a Long column representing the text length Permalink

Python example: multiply an Intby two Permalink

Related content

Dialogue & Discussion

Scala example: replace a `String` column with a `Long` column representing the text length Permalink

Python example: multiply an `Int`by two Permalink