Spark DataFrame UDFs: Scala Examples
Last updated:User-Defined functions are functions that operate on a DataFrame's column.
For instance, if you have a Column
that represents an age
feature, you could create an UDF that multiplies an Int by 2, and evaluate that UDF over that particular Column
.
UDFs operate on Columns while regular RDD/Dataframe functions (
map
,filter
, etc) operate on Rows
A Sample DataFrame
For our examples let's use a DataFrame with two columns like the one below:
> df.show()
+---+-------------+
| id| text|
+---+-------------+
| 0| foo|
| 1| foo bar|
| 2| hello world|
| 3|great success|
+---+-------------+
> df.printSchema()
root
|-- id: integer (nullable = false)
|-- text: string (nullable = true)
HEADS-UP All examples use Spark 2.x unless otherwise noted!
Example: get string length
Next is a very simple example: replace a String
column with a Long
column representing the text length (using the sample dataframe above)
import org.apache.spark.sql.functions.udf
def strLength(inputString: String): Long = inputString.size.toLong
// we use the method name followed by a "_" to indicate we want a reference
// to the method, not call it
val strLengthUdf = udf(strLength _)
val df2 = df.select(strLengthUdf(df("text")))
then show it:
> df2.show()
+---+---------+
| UDF(text) |
+-------------+
| 3|
| 7|
| 11|
| 14|
+---+---------+
Spark.ml Pipelines are udfs
Spark.ml Pipelines are all written in terms of udf
s. Since they operate column-wise rather than row-wise, they are prime candidates for transforming a DataSet by adding columns, modifying features, and so on.
Look at how Spark's MinMaxScaler is just a wrapper for a udf
.