Spark DataFrame UDFs: Examples using Scala and Python
Last updated:WIP Alert This is a work in progress. Current information is correct but more content will probably be added in the future.
What are User-Defined functions ? Permalink
They are function that operate on a DataFrame's column. For instance, if you have a Column
that represents an age
feature, you could create an UDF that multiplies an Int by 2, and evaluate that UDF over that particular Column
.
UDFs operate on Columns while regular RDD functions (
map
,filter
, etc) operate on Rows
A Sample DataFrame Permalink
For our examples let's use a DataFrame with two columns like the one below:
> df.show()
+---+-------------+
| id| text|
+---+-------------+
| 0| foo|
| 1| foo bar|
| 2| hello world|
| 3|great success|
+---+-------------+
> df.printSchema()
root
|-- id: integer (nullable = false)
|-- text: string (nullable = true)
HEADS-UP All examples use Spark 2.x unless otherwise noted!
Scala example: replace a String
column with a Long
column representing the text length Permalink
import org.apache.spark.sql.functions.udf
def strLength(inputString: String): Long = inputString.size.toLong
// we use the method name followed by a "_" to indicate we want a reference
// to the method, not call it
val strLengthUdf = udf(strLength _)
val df2 = df.select(strLengthUdf(df("text")))
then show it:
> df2.show()
+---+---------+
| UDF(text) |
+-------------+
| 3|
| 7|
| 11|
| 14|
+---+---------+
Spark.ml Pipelines are all written in terms of
udf
s
Since they operate column-wise rather than row-wise, they are prime candidates for transforming a DataSet by addind columns, modifying features, and so on.
Look at how Spark's MinMaxScaler is just a wrapper for a udf
.
Python example: multiply an Int
by two Permalink
TODO define the UDF (python class) and call it