Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

PySpark Error: cannot resolve '`timestamp`'

I have to find the exact hour that most checkins occur in the Yelp dataset but I’m running up against this error for some reason. Here is my code so far:

from pyspark.sql.functions import udf
from pyspark.sql.functions import explode
from pyspark.sql.types import IntegerType
from pyspark.sql.types import ArrayType,StringType
from pyspark.sql import functions as F

square_udf_int = udf(lambda z: square(z), IntegerType())

checkin = spark.read.json('yelp_academic_dataset_checkin.json.gz')
datesplit = udf(lambda x: x.split(','),ArrayType(StringType()))
checkin.select('business_id',datesplit('date').alias('dates')).withColumn('checkin_date',explode('dates'))
datesplit = udf(lambda x: x.split(','),ArrayType(StringType()))
dates = checkin.select('business_id',datesplit('date').alias('dates')).withColumn('checkin_date',explode('dates'))
dates = dates.select("checkin_date")
dates.withColumn("checkin_date", F.date_trunc('checkin_date',
                   F.to_timestamp("timestamp", "yyyy-MM-dd HH:mm:ss 'UTC'"))).show(truncate=0)

And the error:

Py4JJavaError: An error occurred while calling o1112.withColumn.
: org.apache.spark.sql.AnalysisException: cannot resolve '`timestamp`' given input columns: [checkin_date];;
'Project [date_trunc(checkin_date, to_timestamp('timestamp, Some(yyyy-MM-dd HH:mm:ss 'UTC')), Some(Etc/UTC)) AS checkin_date#190]
+- Project [checkin_date#176]
   +- Project [business_id#6, dates#172, checkin_date#176]
      +- Generate explode(dates#172), false, [checkin_date#176]
         +- Project [business_id#6, <lambda>(date#7) AS dates#172]
            +- Relation[business_id#6,date#7] json

dates is just a Spark dataframe with one column named: "checkin_date" with only datetimes, so I’m not sure why this isn’t working.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

The error you obtain simply means that in the following line of code, you are trying to access a column named timestamp and that it does not exist.

dates.withColumn("checkin_date", F.date_trunc('checkin_date',
                   F.to_timestamp("timestamp", "yyyy-MM-dd HH:mm:ss 'UTC'")))

Indeed, here is the signature of the to_timestamp function:

pyspark.sql.functions.to_timestamp(col, format=None)

The first argument is the column, the second is the format. I am assuming you are trying to parse a date and then truncate it. Let’s say you want to truncate the date to the month level. The correct way to do it would be:

dates.withColumn("checkin_date", F.date_trunc('month',
                   F.to_timestamp('checkin_date', "yyyy-MM-dd HH:mm:ss 'UTC'")))
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading