Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

why am I not able to convert string type column to date format in pyspark?

I have a column which is in the "20130623" format. I am trying to convert it into dd-mm-YYYY. I have seen various post online including here. But I only got one solution as below

from datetime import datetime
df = df2.withColumn("col_name", datetime.utcfromtimestamp(int("col_name")).strftime('%d-%m-%y'))

However, it throws an error that the input should be int type, not the string type. I tried to convert with the help of int() function. But even that doesn’t seem to be helping.

below is the error that I see when converting

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

invalid literal for int() with base 10: 'col_name'

I am not sure if it is taking the col_name as string, or its value as string.
Please suggest, how can I do this, or the best way to get the required output

Note: I cannot use pandas in my environment.

thank you.

>Solution :

withColumn() should use Spark types. You can not use python datetime there.

First parse string to date using to_date and then format date to string using date_format:

import pyspark.sql.functions as F

df = spark.createDataFrame(data=[["20130623"]], schema=["date_str"])

df = df.withColumn("date", F.to_date("date_str", format="yyyyMMdd"))
df = df.withColumn("date_ddMMyyyy", F.date_format("date", format="dd-MM-yyyy"))

df.show(truncate=False)
df.printSchema()

Output:

+--------+----------+-------------+
|date_str|date      |date_ddMMyyyy|
+--------+----------+-------------+
|20130623|2013-06-23|23-06-2013   |
+--------+----------+-------------+

root
 |-- date_str: string (nullable = true)
 |-- date: date (nullable = true)
 |-- date_ddMMyyyy: string (nullable = true)

Whatever you see as output 2013-06-23 in date column is actually default to_string() representation of date.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading