PySpark Error: cannot resolve '`timestamp`'


I have to find the exact hour that most checkins occur in the Yelp dataset but I’m running up against this error for some reason. Here is my code so far:

from pyspark.sql.functions import udf
from pyspark.sql.functions import explode
from pyspark.sql.types import IntegerType
from pyspark.sql.types import ArrayType,StringType
from pyspark.sql import functions as F

square_udf_int = udf(lambda z: square(z), IntegerType())

checkin ='yelp_academic_dataset_checkin.json.gz')
datesplit = udf(lambda x: x.split(','),ArrayType(StringType()))'business_id',datesplit('date').alias('dates')).withColumn('checkin_date',explode('dates'))
datesplit = udf(lambda x: x.split(','),ArrayType(StringType()))
dates ='business_id',datesplit('date').alias('dates')).withColumn('checkin_date',explode('dates'))
dates ="checkin_date")
dates.withColumn("checkin_date", F.date_trunc('checkin_date',
                   F.to_timestamp("timestamp", "yyyy-MM-dd HH:mm:ss 'UTC'"))).show(truncate=0)

And the error:

Py4JJavaError: An error occurred while calling o1112.withColumn.
: org.apache.spark.sql.AnalysisException: cannot resolve '`timestamp`' given input columns: [checkin_date];;
'Project [date_trunc(checkin_date, to_timestamp('timestamp, Some(yyyy-MM-dd HH:mm:ss 'UTC')), Some(Etc/UTC)) AS checkin_date#190]
+- Project [checkin_date#176]
   +- Project [business_id#6, dates#172, checkin_date#176]
      +- Generate explode(dates#172), false, [checkin_date#176]
         +- Project [business_id#6, <lambda>(date#7) AS dates#172]
            +- Relation[business_id#6,date#7] json

dates is just a Spark dataframe with one column named: "checkin_date" with only datetimes, so I’m not sure why this isn’t working.

>Solution :

The error you obtain simply means that in the following line of code, you are trying to access a column named timestamp and that it does not exist.

dates.withColumn("checkin_date", F.date_trunc('checkin_date',
                   F.to_timestamp("timestamp", "yyyy-MM-dd HH:mm:ss 'UTC'")))

Indeed, here is the signature of the to_timestamp function:

pyspark.sql.functions.to_timestamp(col, format=None)

The first argument is the column, the second is the format. I am assuming you are trying to parse a date and then truncate it. Let’s say you want to truncate the date to the month level. The correct way to do it would be:

dates.withColumn("checkin_date", F.date_trunc('month',
                   F.to_timestamp('checkin_date', "yyyy-MM-dd HH:mm:ss 'UTC'")))

Leave a Reply Cancel reply