Spark SQL Date/Datetime Function Examples

Spark SQL Date/Datetime Function Examples

Last updated:
Spark SQL Date/Datetime Function Examples
Source
Table of Contents

Spark version 2.4.8 used

All code available on this jupyter notebook

Examples on how to use common date/datetime-related function on Spark SQL

For stuff related to date arithmetic, see Spark SQL date/time Arithmetic examples: Adding, Subtracting, etc

to_date example

Use to_date(Column) from org.apache.spark.sql.functions.

import org.apache.spark.sql.functions.to_date

// source dataframe
val df = Seq(
    ("notebook","2019-01-01"),
    ("notebook", "2019-01-10"),
    ("small_phone", "2019-01-15"),
    ("small_phone", "2019-01-30")
).toDF("device", "purchase_date").sort("device","purchase_date")

// column "purchase_date" is of type string
df.dtypes
// >>> Array((device,StringType), (purchase_date,StringType))

// call to_date passing the date string column
val df2 = df.withColumn("purchase_date",to_date($"purchase_date"))

df2.dtypes
// >>> Array((device,StringType), (purchase_date,DateType))

to_date, custom date format

Pass a format string compatible with Java SimpleDateFormat

import org.apache.spark.sql.functions.to_date

val df = Seq(
    ("notebook","27/12/2019"),
    ("notebook", "01/12/2019"),
    ("small_phone", "23/01/2019"),
    ("small_phone", "27/12/2019")
).toDF("device", "purchase_date").sort("device","purchase_date")

// parse string into Date
df.withColumn("purchase_date",to_date($"purchase_date", "dd/MM/yyyy"))

source-dataframe-string-type BEFORE: column of type String with
dates in custom format
  
after-column-parsed-to-date-type AFTER: column parsed to type Date

to_timestamp example

Use to_timestamp(StringColumn).

import org.apache.spark.sql.functions.to_timestamp

// source dataframe
val df = Seq(
    ("notebook","2019-01-01 00:00:00"),
    ("notebook", "2019-01-10 13:00:00"),
    ("small_phone", "2019-01-15 12:00:00"),
    ("small_phone", "2019-01-30 09:30:00")
).toDF("device", "purchase_time").sort("device","purchase_time")

// column "purchase_time" is of type string
df.dtypes
// >>> Array((device,StringType), (purchase_time,StringType))

// call to_date passing the datetime string column
val df2 = df.withColumn("purchase_time",to_timestamp($"purchase_time"))

df2.dtypes
// >>> Array((device,StringType), (purchase_time,TimestampType))

to_timestamp, custom datetime format

Pass a format string compatible with Java SimpleDateFormat

Otherwise, it will result in null values.

import org.apache.spark.sql.functions.to_timestamp

val df = Seq(
    ("notebook","27/12/2019 12:00"),
    ("notebook", "01/12/2019 00:00"),
    ("small_phone", "23/01/2019 12:00"),
    ("small_phone", "27/12/2019 12:00")
).toDF("device", "purchase_time").sort("device","purchase_time")

// providing the format like this prevents null values
df.withColumn("purchase_time",to_timestamp($"purchase_time","d/M/y H:m"))

source-dataframe Source dataframe with string columns
         
badly-converted-dates Calling to_timestamp without a
custom format gave you
null values in this case
correctly-parsed-dataframe Correctly parsed values when
you provide a custom format

timestamp to date

use to_date(TimestampColumn)

import java.sql.Timestamp
import org.apache.spark.sql.functions.to_date

val df = Seq(
    ("notebook",Timestamp.valueOf("2019-01-29 12:00:00")),
    ("notebook", Timestamp.valueOf("2019-01-01 00:00:00")),
    ("small_phone", Timestamp.valueOf("2019-01-15 23:00:00")),
    ("small_phone", Timestamp.valueOf("2019-01-01 09:00:00"))
).toDF("device", "purchase_time").sort("device","purchase_time")

// call to_date as you would on a string column
df.withColumn("purchase_date",to_date($"purchase_time"))

dataframe-with-extra-date-column Derived column purchase_date was created by calling to_date
on a Timestamp column, namely purchase_time

date to timestamp at zero hours

Use to_timestamp(DateColumn) to convert a date to a timestamp with the same date, at midnight.

import java.sql.Date
import org.apache.spark.sql.functions.to_timestamp

val df = Seq(
    ("notebook",Date.valueOf("2019-01-29")),
    ("notebook", Date.valueOf("2019-01-01")),
    ("small_phone", Date.valueOf("2019-01-15")),
    ("small_phone", Date.valueOf("2019-01-01"))
).toDF("device", "purchase_date").sort("device","purchase_date")

df.withColumn("purchase_time",to_timestamp($"purchase_date"))

dataframe-with-extra-column Added a new column purchase_time by calling to_timestamp
on existing column purchase_date

Format timestamp

See Java SimpleDateFormat for ways to define the format

Use date_format(Column, formatStr) to apply custom formatting to both Timestamp and Date columns to String:

import java.sql.Timestamp
import org.apache.spark.sql.functions.date_format

val df = Seq(
    ("notebook",Timestamp.valueOf("2019-01-29 12:00:00")),
    ("notebook", Timestamp.valueOf("2019-01-01 00:00:00")),
    ("small_phone", Timestamp.valueOf("2019-01-15 23:00:00")),
    ("small_phone", Timestamp.valueOf("2019-01-01 09:00:00"))
).toDF("device", "purchase_time").sort("device","purchase_time")

df.withColumn("formatted_purchase_time",date_format($"purchase_time","y-MM"))

dataframe-with-formatted-column Created a new column by using "y-MM" format

Format date

custom timestamp formatting can be used for applying custom formatting to Date columns too.

Get hour from timestamp

There is a method called hour (also others such as minute, second, etc)

import org.apache.spark.sql.functions.hour

val df = Seq(
    ("foo", "2019-01-01 01:00:00.000"), 
    ("bar", "2019-01-01 12:30:00.000"), 
    ("baz", "2019-01-01 23:01:00.000") 
).toDF("col1", "some_timestamp")

df.withColumn("hour", hour($"some_timestamp"))

before Dataframe with a timestamp column
  
after-apllying-hour-method After extracting the hour from
the timestamp

Current timestamp

Use current_timestamp

import org.apache.spark.sql.functions.current_timestamp

val df = Seq(
    ("foo"), 
    ("bar"), 
    ("baz") 
).toDF("col1")

df
  .withColumn("now", current_timestamp)

current-timestamp Create a column with the current timestamp, down to
the milliseconds

Current date

Use current_date to get today's date:

import org.apache.spark.sql.functions.current_date

val df = Seq(
    ("foo"), 
    ("bar"), 
    ("baz") 
).toDF("col1")

df
  .withColumn("today", current_date)

current-date Use the aptly named current_date to
get today's date.

Start of the week

It's often useful to group data by the week in which it occurred (and then do group-bys and other aggregations).

To get the beginning of the week, use this helper function (dayNameToIndex) together with date_format:

import org.apache.spark.sql.{Column, DataFrame}
import org.apache.spark.sql.functions._

// dummy dataframe for testing
val df = Seq(
    ("2018-12-28"), 
    ("2019-01-01"), 
    ("2019-01-04") 
).toDF("source_date")

// turn a day name (like "Wed") to its position on the week (e.g. 3)
def dayNameToIndex(col: Column) : Column = {
    when(col.isNull, null)
       .when(col === "Sun", 0)
       .when(col === "Mon", 1)
       .when(col === "Tue", 2)
       .when(col === "Wed", 3)
       .when(col === "Thu", 4)
       .when(col === "Fri", 5)
       .when(col === "Sat", 6)
}

// need to use expr because the number of days to subtract is a column value
df
  .withColumn("day_index", dayNameToIndex(date_format(col("source_date"), "E")))
  .withColumn("week_start", expr("date_sub(source_date, day_index)"))
  .drop("day_index") // dont need this

before-only-date BEFORE: a column with a date
  
after-with-week-start-column AFTER: new column with the
start of the week of source_date

Dialogue & Discussion