my dataframe looks like this:
| accountId | income | dateOfOrder
| 123 | 60000 | 56347264327_01_20200110
| 321 | 52000 | 54346262452_01_20200218
I want to take the header dateOfOrder and change it to acct_order_dt and only use the last 8 characters which are dates in yyyymmdd. I want to preserve the order of this pyspark dataframe.
I am currently using this method but I dont think it’s preserving the order:
sample_data = sample_data.withColumn("acct_order_dt", to_date(substring(col("dateOfOrder"),-8,8), "yyyyMMdd")).drop("dateOfOrder")
>Solution :
To achieve this, you can use the withColumn
method and the substring function. Here’s an example code in PySpark:
from pyspark.sql.functions import substring
df = df.withColumn("acct_order_dt", substring(df["dateOfOrder"], -8, 8))
df = df.drop("dateOfOrder")
df = df.selectExpr("accountId", "income", "acct_order_dt")
The first line uses withColumn
to add a new column acct_order_dt
to the DataFrame. The substring
function is used to extract the last 8 characters of the dateOfOrder
column and store them in acct_order_dt
.
scala apache Spark snippet :
import org.apache.spark.sql.functions._
val df2 = df.withColumn("acct_order_dt", substring(col("dateOfOrder"), -8, 8))
.drop("dateOfOrder")
.select("accountId", "income", "acct_order_dt")
since you got the 8 characters you can apply any sql function of your choice to convert your required date format