Spark SQL Date/Datetime Function Examples

Last updated:
Spark SQL Date/Datetime Function Examples
Source
Table of Contents

Spark version 2.4.6 used

All code available on this jupyter notebook

Examples on how to use common date/datetime-related function on Spark SQL

to_date example

Use to_date(Column) from org.apache.spark.sql.functions.

import org.apache.spark.sql.functions.to_date

// source dataframe
val df = Seq(
    ("notebook","2019-01-01"),
    ("notebook", "2019-01-10"),
    ("small_phone", "2019-01-15"),
    ("small_phone", "2019-01-30")
).toDF("device", "purchase_date").sort("device","purchase_date")

// column "purchase_date" is of type string
df.dtypes
// >>> Array((device,StringType), (purchase_date,StringType))

// call to_date passing the date string column
val df2 = df.withColumn("purchase_date",to_date($"purchase_date"))

df2.dtypes
// >>> Array((device,StringType), (purchase_date,DateType))

to_date, custom date format

Pass a format string compatible with Java SimpleDateFormat

import org.apache.spark.sql.functions.to_date

val df = Seq(
    ("notebook","27/12/2019"),
    ("notebook", "01/12/2019"),
    ("small_phone", "23/01/2019"),
    ("small_phone", "27/12/2019")
).toDF("device", "purchase_date").sort("device","purchase_date")

// parse string into Date
df.withColumn("purchase_date",to_date($"purchase_date", "dd/MM/yyyy"))

source-dataframe-string-type BEFORE: column of type String with
dates in custom format
  
after-column-parsed-to-date-type AFTER: column parsed to type Date

to_timestamp example

Use to_timestamp(StringColumn).

import org.apache.spark.sql.functions.to_timestamp

// source dataframe
val df = Seq(
    ("notebook","2019-01-01 00:00:00"),
    ("notebook", "2019-01-10 13:00:00"),
    ("small_phone", "2019-01-15 12:00:00"),
    ("small_phone", "2019-01-30 09:30:00")
).toDF("device", "purchase_time").sort("device","purchase_time")

// column "purchase_time" is of type string
df.dtypes
// >>> Array((device,StringType), (purchase_time,StringType))

// call to_date passing the datetime string column
val df2 = df.withColumn("purchase_time",to_timestamp($"purchase_time"))

df2.dtypes
// >>> Array((device,StringType), (purchase_time,TimestampType))

to_timestamp, custom datetime format

Pass a format string compatible with Java SimpleDateFormat

Otherwise, it will result in null values.

import org.apache.spark.sql.functions.to_timestamp

val df = Seq(
    ("notebook","27/12/2019 12:00"),
    ("notebook", "01/12/2019 00:00"),
    ("small_phone", "23/01/2019 12:00"),
    ("small_phone", "27/12/2019 12:00")
).toDF("device", "purchase_time").sort("device","purchase_time")

// providing the format like this prevents null values
df.withColumn("purchase_time",to_timestamp($"purchase_time","d/M/y H:m"))

source-dataframe Source dataframe with string columns
         
badly-converted-dates Calling to_timestamp without a
custom format gave you
null values in this case
correctly-parsed-dataframe Correctly parsed values when
you provide a custom format

timestamp to date

use to_date(TimestampColumn)

import java.sql.Timestamp
import org.apache.spark.sql.functions.to_date

val df = Seq(
    ("notebook",Timestamp.valueOf("2019-01-29 12:00:00")),
    ("notebook", Timestamp.valueOf("2019-01-01 00:00:00")),
    ("small_phone", Timestamp.valueOf("2019-01-15 23:00:00")),
    ("small_phone", Timestamp.valueOf("2019-01-01 09:00:00"))
).toDF("device", "purchase_time").sort("device","purchase_time")

// call to_date as you would on a string column
df.withColumn("purchase_date",to_date($"purchase_time"))

dataframe-with-extra-date-column Derived column purchase_date was created by calling to_date
on a Timestamp column, namely purchase_time

date to timestamp at zero hours

Use to_timestamp(DateColumn) to convert a date to a timestamp with the same date, at midnight.

import java.sql.Date
import org.apache.spark.sql.functions.to_timestamp

val df = Seq(
    ("notebook",Date.valueOf("2019-01-29")),
    ("notebook", Date.valueOf("2019-01-01")),
    ("small_phone", Date.valueOf("2019-01-15")),
    ("small_phone", Date.valueOf("2019-01-01"))
).toDF("device", "purchase_date").sort("device","purchase_date")

df.withColumn("purchase_time",to_timestamp($"purchase_date"))

dataframe-with-extra-column Added a new column purchase_time by calling to_timestamp
on existing column purchase_date

Format timestamp

See Java SimpleDateFormat for ways to define the format

Use date_format(Column, formatStr) to apply custom foramtting to both Timestamp and Date columns to String:

import java.sql.Timestamp
import org.apache.spark.sql.functions.date_format

val df = Seq(
    ("notebook",Timestamp.valueOf("2019-01-29 12:00:00")),
    ("notebook", Timestamp.valueOf("2019-01-01 00:00:00")),
    ("small_phone", Timestamp.valueOf("2019-01-15 23:00:00")),
    ("small_phone", Timestamp.valueOf("2019-01-01 09:00:00"))
).toDF("device", "purchase_time").sort("device","purchase_time")

df.withColumn("formatted_purchase_time",date_format($"purchase_time","y-MM"))

dataframe-with-formatted-column Created a new column by using "y-MM" format

Format date

custom timestamp formatting can be used for applying custom formatting to Date columns too too.

Get hour from timestamp

There is a method called hour (also others such as minute, second, etc)

import org.apache.spark.sql.functions.hour

val df = Seq(
    ("foo", "2019-01-01 01:00:00.000"), 
    ("bar", "2019-01-01 12:30:00.000"), 
    ("baz", "2019-01-01 23:01:00.000") 
).toDF("col1", "some_timestamp")

df.withColumn("hour", hour($"some_timestamp"))

before Dataframe with a timestamp column
  
after-apllying-hour-method After extracting the hour from
the timestamp

Subtract/add days to date

HEADS-UP the output is always of type Date even if the inputs aren't

Use date_add(Column, num_days) date_sub(Column, num_days) to add and subtract days from the given Column.

Can be used on Date, Timestamp of String columns (when string is a valid date string)

import org.apache.spark.sql.functions.date_add

// note that the dates are just strings here
val df = Seq(
    ("notebook","2019-01-29 12:00:00"),
    ("notebook", "2019-01-01 00:00:00"),
    ("small_phone","2019-01-15 23:00:00"),
    ("small_phone", "2019-01-01 09:00:00")
).toDF("device", "purchase_time").sort("device","purchase_time")

df.withColumn("plus_2_days",date_add($"purchase_time",2))

dataframe-with-days-added Created a column by adding 2 days to column
purchase_time.
Note that the time information is lost in the process!!

Datediff example

HEADS-UP this function accepts Timestamps but the time part is ignored when comparing diff between days!

Performs the following arithmetic: first date - second date:

import org.apache.spark.sql.functions.datediff

// note that the dates are just strings
val df = Seq(
    ("notebook","2019-01-29", "2019-02-10"),
    ("notebook", "2019-01-01","2019-01-15"),
    ("small_phone","2019-01-15","2019-01-05"),
    ("small_phone", "2019-01-01","2019-01-20")
).toDF("device", "purchase_date", "arrival_date").sort("device","purchase_date")

df.withColumn("days_to_arrive",datediff($"arrival_date", $"purchase_date"))

source-dataframe Source dataframe: one column indicating
when a product was purchased and
another one with the date it
arrived at the buyer's house.
         
dataframe-with-diff-column-added Using datediff to calculate
the number of days it took
for a package to arrive.
Negative values here probably
indicate a bug somewhere
in the source system code

Difference in days

See above Datediff example

Difference in seconds

Convert datetime to unix timestamp using unix_timestamp and subtract:

import org.apache.spark.sql.functions.unix_timestamp

val df = Seq(
    ("foo", "2019-01-01 00:00:00", "2019-01-01 01:00:00"), // 1 hour apart
    ("bar", "2019-01-01 00:00:00", "2019-01-02 00:00:00"), // 24 hours apart
    ("baz", "2019-01-01 00:00:00", "2019-01-07 00:00:00")  // 7 days apart
).toDF("col1", "purchase_time", "arrival_time").sort("col1","purchase_time")

df
  .withColumn("diff_in_seconds", unix_timestamp($"arrival_time") - unix_timestamp($"purchase_time"))

source-dataframe-with-two-datetime-columns BEFORE: a dataframe with two datetime column
  
new-column-created AFTER: difference in seconds between purchase_time
and arrival_time

Difference in milliseconds

Cast Timestamp to Double, subtract and multiply by 1000:

import java.sql.Timestamp

val df = Seq(
    ("foo", "2019-01-01 00:00:00.000", "2019-01-01 00:00:00.400"), // 400 millis apart
    ("bar", "2019-01-01 00:00:00.000", "2019-01-01 00:00:00.650"), // 650 millis apart
    ("baz", "2019-01-01 00:00:00.000", "2019-01-01 00:01:00.000")  // 1 minute apart
).toDF("col1", "time_before", "time_after")
.withColumn("time_before", to_timestamp($"time_before")) // string to timestamp
.withColumn("time_after", to_timestamp($"time_after")) // string to timestamp


// cast to double, subtract, multiply by 1000, cast back to long
df
.withColumn("diff_millis", $"time_after".cast("double") - $"time_before".cast("double"))
.withColumn("diff_millis", ($"diff_millis"*1000).cast("long"))

before-dataframe-with BEFORE: source dataframe has two timestamp
columns with microseconds
  
after-dataframe-with-diff-column AFTER: created a new column with the
difference between the two timestamp
columns (in milliseconds)

Subtract from timestamp

If you use date_sub (or date_add) on timestamps, they get truncated to the date at zero hours.

To subtract from Timestamps, use $"colname" - expr("INTERVAL ...").

Example: subtract 24 hours from timestamp:

import org.apache.spark.sql.functions.expr

val df = Seq(
    ("foo", "2019-10-10 00:45:00"), 
    ("bar", "2019-10-10 12:34:56"), 
    ("baz", "2019-10-10 23:59:00") 
).toDF("col1", "timestamp")

df
.withColumn("timestamp_minus_24_hours", $"timestamp" - expr("INTERVAL 24 HOURS"))

subtract-from-timestamp To subtract 24 hours from a timestamp column
use $"col" - expr("INTERVAL 24 HOURS")

Current timestamp

Use current_timestamp

import org.apache.spark.sql.functions.current_timestamp

val df = Seq(
    ("foo"), 
    ("bar"), 
    ("baz") 
).toDF("col1")

df
  .withColumn("now", current_timestamp)

current-timestamp Create a column with the current timestamp, down to
the milliseconds

Current date

Use current_date to get today's date:

import org.apache.spark.sql.functions.current_date

val df = Seq(
    ("foo"), 
    ("bar"), 
    ("baz") 
).toDF("col1")

df
  .withColumn("today", current_date)

current-date Use the aptly named current_date to
get today's date.

Dialogue & Discussion