Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Dataframe column name with $$ failing in filter condition with parse error

I have dataframe with column names as "lastname$$" and "firstname$$"

+-----------+----------+----------+------------------+-----+------+
|firstname$$|middlename|lastname$$|languages         |state|gender|
+-----------+----------+----------+------------------+-----+------+
|James      |          |Smith     |[Java, Scala, C++]|OH   |M     |
|Anna       |Rose      |          |[Spark, Java, C++]|NY   |F     |
|Julia      |          |Williams  |[CSharp, VB]      |OH   |F     |
|Maria      |Anne      |Jones     |[CSharp, VB]      |NY   |M     |
|Jen        |Mary      |Brown     |[CSharp, VB]      |NY   |M     |
|Mike       |Mary      |Williams  |[Python, VB]      |OH   |M     |
+-----------+----------+----------+------------------+-----+------+

I am trying to filter this dataframe for firstname$$ = "James" and getting following error.

{
   "name": "ParseException",
   "message": "\nSyntax error at or near '$'(line 1, pos 9)\n\n== SQL ==\nfirstname$$ == 'James'\n---------^^^\n",
   "stack": "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mParseException\u001b[0m                            Traceback (most recent call last)\n\u001b[1;32md:\\Users\\sample.ipynb Cell 3\u001b[0m in \u001b[0;36m<cell line: 1>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> <a href='vscode-notebook-cell:/d%3A/Users/sample.ipynb#X20sZmlsZQ%3D%3D?line=0'>1</a>\u001b[0m df_sam1 \u001b[39m=\u001b[39m df_sam\u001b[39m.\u001b[39;49mwhere(\u001b[39m\"\u001b[39;49m\u001b[39mfirstname$$ == \u001b[39;49m\u001b[39m'\u001b[39;49m\u001b[39mJames\u001b[39;49m\u001b[39m'\u001b[39;49m\u001b[39m\"\u001b[39;49m)\n\u001b[0;32m      <a href='vscode-notebook-cell:/d%3A/Users/sample.ipynb#X20sZmlsZQ%3D%3D?line=1'>2</a>\u001b[0m df_sam1\u001b[39m.\u001b[39mshow()\n\nFile \u001b[1;32mD:\\SparkApp\\spark-3.3.1-bin-hadoop3\\python\\pyspark\\sql\\dataframe.py:2077\u001b[0m, in \u001b[0;36mDataFrame.filter\u001b[1;34m(self, condition)\u001b[0m\n\u001b[0;32m   2052\u001b[0m \u001b[39m\"\"\"Filters rows using the given condition.\u001b[39;00m\n\u001b[0;32m   2053\u001b[0m \n\u001b[0;32m   2054\u001b[0m \u001b[39m:func:`where` is an alias for :func:`filter`.\u001b[39;00m\n\u001b[1;32m   (...)\u001b[0m\n\u001b[0;32m   2074\u001b[0m \u001b[39m[Row(age=2, name='Alice')]\u001b[39;00m\n\u001b[0;32m   2075\u001b[0m \u001b[39m\"\"\"\u001b[39;00m\n\u001b[0;32m   2076\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39misinstance\u001b[39m(condition, \u001b[39mstr\u001b[39m):\n\u001b[1;32m-> 2077\u001b[0m     jdf \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_jdf\u001b[39m.\u001b[39;49mfilter(condition)\n\u001b[0;32m   2078\u001b[0m \u001b[39melif\u001b[39;00m \u001b[39misinstance\u001b[39m(condition, Column):\n\u001b[0;32m   2079\u001b[0m     jdf \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_jdf\u001b[39m.\u001b[39mfilter(condition\u001b[39m.\u001b[39m_jc)\n\nFile \u001b[1;32mD:\\SparkApp\\spark-3.3.1-bin-hadoop3\\python\\lib\\py4j-0.10.9.5-src.zip\\py4j\\java_gateway.py:1321\u001b[0m, in \u001b[0;36mJavaMember.__call__\u001b[1;34m(self, *args)\u001b[0m\n\u001b[0;32m   1315\u001b[0m command \u001b[39m=\u001b[39m proto\u001b[39m.\u001b[39mCALL_COMMAND_NAME \u001b[39m+\u001b[39m\\\n\u001b[0;32m   1316\u001b[0m     \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mcommand_header \u001b[39m+\u001b[39m\\\n\u001b[0;32m   1317\u001b[0m     args_command \u001b[39m+\u001b[39m\\\n\u001b[0;32m   1318\u001b[0m     proto\u001b[39m.\u001b[39mEND_COMMAND_PART\n\u001b[0;32m   1320\u001b[0m answer \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mgateway_client\u001b[39m.\u001b[39msend_command(command)\n\u001b[1;32m-> 1321\u001b[0m return_value \u001b[39m=\u001b[39m get_return_value(\n\u001b[0;32m   1322\u001b[0m     answer, \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mgateway_client, \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mtarget_id, \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mname)\n\u001b[0;32m   1324\u001b[0m \u001b[39mfor\u001b[39;00m temp_arg \u001b[39min\u001b[39;00m temp_args:\n\u001b[0;32m   1325\u001b[0m     temp_arg\u001b[39m.\u001b[39m_detach()\n\nFile \u001b[1;32mD:\\SparkApp\\spark-3.3.1-bin-hadoop3\\python\\pyspark\\sql\\utils.py:196\u001b[0m, in \u001b[0;36mcapture_sql_exception.<locals>.deco\u001b[1;34m(*a, **kw)\u001b[0m\n\u001b[0;32m    192\u001b[0m converted \u001b[39m=\u001b[39m convert_exception(e\u001b[39m.\u001b[39mjava_exception)\n\u001b[0;32m    193\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39misinstance\u001b[39m(converted, UnknownException):\n\u001b[0;32m    194\u001b[0m     \u001b[39m# Hide where the exception came from that shows a non-Pythonic\u001b[39;00m\n\u001b[0;32m    195\u001b[0m     \u001b[39m# JVM exception message.\u001b[39;00m\n\u001b[1;32m--> 196\u001b[0m     \u001b[39mraise\u001b[39;00m converted \u001b[39mfrom\u001b[39;00m \u001b[39mNone\u001b[39m\n\u001b[0;32m    197\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[0;32m    198\u001b[0m     \u001b[39mraise\u001b[39;00m\n\n\u001b[1;31mParseException\u001b[0m: \nSyntax error at or near '$'(line 1, pos 9)\n\n== SQL ==\nfirstname$$ == 'James'\n---------^^^\n"
}

Here is the code I am trying.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df_sam1 = df_sam.where("firstname$$ == 'James'")
df_sam1.show()

I tried to rename the column to fitstname and still getting the same error.

Can anyone look at it and help me to resolve this issue.

Thanks,
Bab

>Solution :

In Spark SQL $ character is not allowed in column names when utilizing SQL expressions, see this. To circumvent this problem, you can either employ DataFrame API functions for filtering the data:

from pyspark.sql.functions import col

df_sam1 = df_sam.filter(col("firstname$$") == 'James')
df_sam1.show()

or rename the column:

df_sam_renamed = df_sam.withColumnRenamed("firstname$$", "firstname").withColumnRenamed("lastname$$", "lastname")

df_sam1 = df_sam_renamed.where("firstname == 'James'")
df_sam1.show()
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading