I am trying to use spark SQL to query a table where user-filled dates didn’t follow any kind of strict formatting, so you see entries such as "10/4/21" and "10/04/21", for example. My query attempt was
select CAST(unix_timestamp(thedate,"mm/d/yy") AS TIMESTAMP) from table
and this works for dates like "10/4/21", but not for "10/04/21" (and vice versa, the "mm/dd/yy" works for "10/04/21" but not for "10/4/21"). Is there any way to format the query so that it works for either date format?
>Solution :
You can try converting thedate using both formats and then apply a COALESCE such that the result from converting using valid format is returned.
sql = """
SELECT thedate,
Cast(COALESCE(unix_timestamp(thedate, "MM/d/yy"),
unix_timestamp(thedate, "MM/dd/yy")
) AS TIMESTAMP) AS thedate_astimestamp
FROM (SELECT Explode(Array('10/4/21', '10/04/21', '10/11/21')) AS thedate)
"""
spark.sql(sql).show()
Output
+--------+-------------------+
| thedate|thedate_astimestamp|
+--------+-------------------+
| 10/4/21|2021-10-04 00:00:00|
|10/04/21|2021-10-04 00:00:00|
|10/11/21|2021-10-11 00:00:00|
+--------+-------------------+