Spark SQL Date/Datetime Function Examples
Last updated:- to_date example
- to_date, custom date format
- to_timestamp example
- to_timestamp, custom datetime format
- timestamp to date
- date to timestamp at zero hours
- Format timestamp
- Format date
- Get hour from timestamp
- Current timestamp
- Current date
- Start of the week
Spark version 2.4.8 used
All code available on this jupyter notebook
Examples on how to use common date/datetime-related function on Spark SQL
For stuff related to date arithmetic, see Spark SQL date/time Arithmetic examples: Adding, Subtracting, etc
to_date example
Use to_date(Column)
from org.apache.spark.sql.functions
.
import org.apache.spark.sql.functions.to_date
// source dataframe
val df = Seq(
("notebook","2019-01-01"),
("notebook", "2019-01-10"),
("small_phone", "2019-01-15"),
("small_phone", "2019-01-30")
).toDF("device", "purchase_date").sort("device","purchase_date")
// column "purchase_date" is of type string
df.dtypes
// >>> Array((device,StringType), (purchase_date,StringType))
// call to_date passing the date string column
val df2 = df.withColumn("purchase_date",to_date($"purchase_date"))
df2.dtypes
// >>> Array((device,StringType), (purchase_date,DateType))
to_date, custom date format
Pass a format string compatible with Java SimpleDateFormat
import org.apache.spark.sql.functions.to_date
val df = Seq(
("notebook","27/12/2019"),
("notebook", "01/12/2019"),
("small_phone", "23/01/2019"),
("small_phone", "27/12/2019")
).toDF("device", "purchase_date").sort("device","purchase_date")
// parse string into Date
df.withColumn("purchase_date",to_date($"purchase_date", "dd/MM/yyyy"))
String
with dates in custom format
Date
to_timestamp example
Use to_timestamp(StringColumn)
.
import org.apache.spark.sql.functions.to_timestamp
// source dataframe
val df = Seq(
("notebook","2019-01-01 00:00:00"),
("notebook", "2019-01-10 13:00:00"),
("small_phone", "2019-01-15 12:00:00"),
("small_phone", "2019-01-30 09:30:00")
).toDF("device", "purchase_time").sort("device","purchase_time")
// column "purchase_time" is of type string
df.dtypes
// >>> Array((device,StringType), (purchase_time,StringType))
// call to_date passing the datetime string column
val df2 = df.withColumn("purchase_time",to_timestamp($"purchase_time"))
df2.dtypes
// >>> Array((device,StringType), (purchase_time,TimestampType))
to_timestamp, custom datetime format
Pass a format string compatible with Java SimpleDateFormat
Otherwise, it will result in null
values.
import org.apache.spark.sql.functions.to_timestamp
val df = Seq(
("notebook","27/12/2019 12:00"),
("notebook", "01/12/2019 00:00"),
("small_phone", "23/01/2019 12:00"),
("small_phone", "27/12/2019 12:00")
).toDF("device", "purchase_time").sort("device","purchase_time")
// providing the format like this prevents null values
df.withColumn("purchase_time",to_timestamp($"purchase_time","d/M/y H:m"))
custom format gave you
null values in this case
you provide a custom format
timestamp to date
use to_date(TimestampColumn)
import java.sql.Timestamp
import org.apache.spark.sql.functions.to_date
val df = Seq(
("notebook",Timestamp.valueOf("2019-01-29 12:00:00")),
("notebook", Timestamp.valueOf("2019-01-01 00:00:00")),
("small_phone", Timestamp.valueOf("2019-01-15 23:00:00")),
("small_phone", Timestamp.valueOf("2019-01-01 09:00:00"))
).toDF("device", "purchase_time").sort("device","purchase_time")
// call to_date as you would on a string column
df.withColumn("purchase_date",to_date($"purchase_time"))
purchase_date
was created by calling to_date
on a
Timestamp
column, namely purchase_time
date to timestamp at zero hours
Use to_timestamp(DateColumn)
to convert a date to a timestamp with the same date, at midnight.
import java.sql.Date
import org.apache.spark.sql.functions.to_timestamp
val df = Seq(
("notebook",Date.valueOf("2019-01-29")),
("notebook", Date.valueOf("2019-01-01")),
("small_phone", Date.valueOf("2019-01-15")),
("small_phone", Date.valueOf("2019-01-01"))
).toDF("device", "purchase_date").sort("device","purchase_date")
df.withColumn("purchase_time",to_timestamp($"purchase_date"))
purchase_time
by calling to_timestamp
on existing column
purchase_date
Format timestamp
See Java SimpleDateFormat for ways to define the format
Use date_format(Column, formatStr)
to apply custom formatting to both Timestamp and Date columns to String:
import java.sql.Timestamp
import org.apache.spark.sql.functions.date_format
val df = Seq(
("notebook",Timestamp.valueOf("2019-01-29 12:00:00")),
("notebook", Timestamp.valueOf("2019-01-01 00:00:00")),
("small_phone", Timestamp.valueOf("2019-01-15 23:00:00")),
("small_phone", Timestamp.valueOf("2019-01-01 09:00:00"))
).toDF("device", "purchase_time").sort("device","purchase_time")
df.withColumn("formatted_purchase_time",date_format($"purchase_time","y-MM"))
"y-MM"
format
Format date
custom timestamp formatting can be used for applying custom formatting to Date
columns too.
Get hour from timestamp
There is a method called hour
(also others such as minute
, second
, etc)
import org.apache.spark.sql.functions.hour
val df = Seq(
("foo", "2019-01-01 01:00:00.000"),
("bar", "2019-01-01 12:30:00.000"),
("baz", "2019-01-01 23:01:00.000")
).toDF("col1", "some_timestamp")
df.withColumn("hour", hour($"some_timestamp"))
the timestamp
Current timestamp
Use current_timestamp
import org.apache.spark.sql.functions.current_timestamp
val df = Seq(
("foo"),
("bar"),
("baz")
).toDF("col1")
df
.withColumn("now", current_timestamp)
the milliseconds
Current date
Use current_date
to get today's date:
import org.apache.spark.sql.functions.current_date
val df = Seq(
("foo"),
("bar"),
("baz")
).toDF("col1")
df
.withColumn("today", current_date)
current_date
to get today's date.
Start of the week
It's often useful to group data by the week in which it occurred (and then do group-bys and other aggregations).
To get the beginning of the week, use this helper function (dayNameToIndex
) together with date_format
:
import org.apache.spark.sql.{Column, DataFrame}
import org.apache.spark.sql.functions._
// dummy dataframe for testing
val df = Seq(
("2018-12-28"),
("2019-01-01"),
("2019-01-04")
).toDF("source_date")
// turn a day name (like "Wed") to its position on the week (e.g. 3)
def dayNameToIndex(col: Column) : Column = {
when(col.isNull, null)
.when(col === "Sun", 0)
.when(col === "Mon", 1)
.when(col === "Tue", 2)
.when(col === "Wed", 3)
.when(col === "Thu", 4)
.when(col === "Fri", 5)
.when(col === "Sat", 6)
}
// need to use expr because the number of days to subtract is a column value
df
.withColumn("day_index", dayNameToIndex(date_format(col("source_date"), "E")))
.withColumn("week_start", expr("date_sub(source_date, day_index)"))
.drop("day_index") // dont need this
start of the week of
source_date