Spark Date/Datetime Functions: Examples

Last updated:
Spark Date/Datetime Functions: Examples
Source
Table of Contents

Spark version 2.4.3 used

All code available on this jupyter notebook

Examples on how to use common date/datetime-related function on Spark SQL

Parse date string

Use to_date(Column) from org.apache.spark.sql.functions

import org.apache.spark.sql.functions.to_date

// source dataframe
val df = Seq(
    ("notebook","2019-01-01"),
    ("notebook", "2019-01-10"),
    ("small_phone", "2019-01-15"),
    ("small_phone", "2019-01-30")
).toDF("device", "purchase_date").sort("device","purchase_date")

// column "purchase_date" is of type string
df.dtypes
// >>> Array((device,StringType), (purchase_date,StringType))

// call to_date passing the date string column
val df2 = df.withColumn("purchase_date",to_date($"purchase_date"))

df2.dtypes
// >>> Array((device,StringType), (purchase_date,DateType))

Parse datetime string

This can be used with date strings too.

Use to_timestamp(Column).

import org.apache.spark.sql.functions.to_timestamp

// source dataframe
val df = Seq(
    ("notebook","2019-01-01 00:00:00"),
    ("notebook", "2019-01-10 13:00:00"),
    ("small_phone", "2019-01-15 12:00:00"),
    ("small_phone", "2019-01-30 09:30:00")
).toDF("device", "purchase_time").sort("device","purchase_time")

// column "purchase_time" is of type string
df.dtypes
// >>> Array((device,StringType), (purchase_time,StringType))

// call to_date passing the datetime string column
val df2 = df.withColumn("purchase_time",to_date($"purchase_time"))

df2.dtypes
// >>> Array((device,StringType), (purchase_time,TimestampType))

Parse timestamp string, custom format

See Java SimpleDateFormat for ways to define the format

If you use alternative datetime formats, you can pass it as an extra argument to to_datetime.

Otherwise, it will result in null values.

import org.apache.spark.sql.functions.to_timestamp

val df = Seq(
    ("notebook","27/12/2019 12:00"),
    ("notebook", "01/12/2019 00:00"),
    ("small_phone", "23/01/2019 12:00"),
    ("small_phone", "27/12/2019 12:00")
).toDF("device", "purchase_time").sort("device","purchase_time")

// providing the format like this prevents null values
df.withColumn("purchase_time",to_timestamp($"purchase_time","d/M/y H:m"))

source-dataframe Source dataframe with string columns
         
badly-converted-dates Calling to_timestamp without a
custom format gave you
null values in this case
correctly-parsed-dataframe Correctly parsed values when
you provide a custom format

Convert timestamp to date

You can also use to_date(Column) on a Timestamp column:

import java.sql.Timestamp
import org.apache.spark.sql.functions.to_date

val df = Seq(
    ("notebook",Timestamp.valueOf("2019-01-29 12:00:00")),
    ("notebook", Timestamp.valueOf("2019-01-01 00:00:00")),
    ("small_phone", Timestamp.valueOf("2019-01-15 23:00:00")),
    ("small_phone", Timestamp.valueOf("2019-01-01 09:00:00"))
).toDF("device", "purchase_time").sort("device","purchase_time")

// call to_date as you would on a string column
df.withColumn("purchase_date",to_date($"purchase_time"))

dataframe-with-extra-date-column Derived column purchase_date was created by calling to_date
on a Timestamp column, namely purchase_time

Convert date to timestamp with zero hours

Again, you can you to_timestamp(Column) on columns of type Date too:

import java.sql.Date
import org.apache.spark.sql.functions.to_timestamp

val df = Seq(
    ("notebook",Date.valueOf("2019-01-29")),
    ("notebook", Date.valueOf("2019-01-01")),
    ("small_phone", Date.valueOf("2019-01-15")),
    ("small_phone", Date.valueOf("2019-01-01"))
).toDF("device", "purchase_date").sort("device","purchase_date")


df.withColumn("purchase_time",to_timestamp($"purchase_date"))

dataframe-with-extra-column Added a new column purchase_time by calling to_timestamp
on existing column purchase_date

Custom date/timestamp formatting

See Java SimpleDateFormat for ways to define the format

Use date_format(Column, formatStr) for formatting both Timestamp and Date columns to String:

import java.sql.Timestamp
import org.apache.spark.sql.functions.date_format

val df = Seq(
    ("notebook",Timestamp.valueOf("2019-01-29 12:00:00")),
    ("notebook", Timestamp.valueOf("2019-01-01 00:00:00")),
    ("small_phone", Timestamp.valueOf("2019-01-15 23:00:00")),
    ("small_phone", Timestamp.valueOf("2019-01-01 09:00:00"))
).toDF("device", "purchase_time").sort("device","purchase_time")

df.withColumn("formatted_purchase_time",date_format($"purchase_time","y-MM"))

dataframe-with-formatted-column Created a new column by using "y-MM" format

Subtract/add days to date

HEADS-UP the output is always of type Date even if the inputs aren't

Use date_add(Column, num_days) date_sub(Column, num_days) to add and subtract days from the given Column.

Can be used on Date, Timestamp of String columns (when string is a valid date string)

import org.apache.spark.sql.functions.date_add

// note that the dates are just strings here
val df = Seq(
    ("notebook","2019-01-29 12:00:00"),
    ("notebook", "2019-01-01 00:00:00"),
    ("small_phone","2019-01-15 23:00:00"),
    ("small_phone", "2019-01-01 09:00:00")
).toDF("device", "purchase_time").sort("device","purchase_time")

df.withColumn("plus_2_days",date_add($"purchase_time",2))

dataframe-with-days-added Created a column by adding 2 days to column
purchase_time.
Note that the time information is lost in the process!!

Datediff

HEADS-UP this functiona accepts Timestamps but the time part is ignored when comparing diff between days!

Performs the following arithmetic: first date - second date:

import org.apache.spark.sql.functions.datediff

// note that the dates are just strings
val df = Seq(
    ("notebook","2019-01-29", "2019-02-10"),
    ("notebook", "2019-01-01","2019-01-15"),
    ("small_phone","2019-01-15","2019-01-05"),
    ("small_phone", "2019-01-01","2019-01-20")
).toDF("device", "purchase_date", "arrival_date").sort("device","purchase_date")

df.withColumn("days_to_arrive",datediff($"arrival_date",$"purchase_date"))

source-dataframe Source dataframe: one column indicating
when a product was purchased and
another one with the date it
arrived at the buyer's house.
         
dataframe-with-diff-column-added Using datediff to calculate
the number of days it took
for a package to arrive.
Negative values here probably
indicate a bug somewhere
in the source system code

Dialogue & Discussion