Spark DataFrame UDFs: Scala Examples
Last updated:User-Defined functions are functions that operate on a DataFrame's column.
For instance, if you have a Column that represents an age feature, you could create an UDF that multiplies an Int by 2, and evaluate that UDF over that particular Column.
UDFs operate on Columns while regular RDD/Dataframe functions (
map,filter, etc) operate on Rows
A Sample DataFrame
For our examples let's use a DataFrame with two columns like the one below:
> df.show()
+---+-------------+
| id| text|
+---+-------------+
| 0| foo|
| 1| foo bar|
| 2| hello world|
| 3|great success|
+---+-------------+
> df.printSchema()
root
|-- id: integer (nullable = false)
|-- text: string (nullable = true)
HEADS-UP All examples use Spark 2.x unless otherwise noted!
Example: get string length
Next is a very simple example: replace a String column with a Long column representing the text length (using the sample dataframe above)
import org.apache.spark.sql.functions.udf
def strLength(inputString: String): Long = inputString.size.toLong
// we use the method name followed by a "_" to indicate we want a reference
// to the method, not call it
val strLengthUdf = udf(strLength _)
val df2 = df.select(strLengthUdf(df("text")))
then show it:
> df2.show()
+---+---------+
| UDF(text) |
+-------------+
| 3|
| 7|
| 11|
| 14|
+---+---------+
Spark.ml Pipelines are udfs
Spark.ml Pipelines are all written in terms of udfs. Since they operate column-wise rather than row-wise, they are prime candidates for transforming a DataSet by adding columns, modifying features, and so on.
Look at how Spark's MinMaxScaler is just a wrapper for a udf.