Dataframe column name with $$ failing in filter condition with parse error

April 6, 2023

I have dataframe with column names as "lastname$$" and "firstname$$"

+-----------+----------+----------+------------------+-----+------+
|firstname$$|middlename|lastname$$|languages         |state|gender|
+-----------+----------+----------+------------------+-----+------+
|James      |          |Smith     |[Java, Scala, C++]|OH   |M     |
|Anna       |Rose      |          |[Spark, Java, C++]|NY   |F     |
|Julia      |          |Williams  |[CSharp, VB]      |OH   |F     |
|Maria      |Anne      |Jones     |[CSharp, VB]      |NY   |M     |
|Jen        |Mary      |Brown     |[CSharp, VB]      |NY   |M     |
|Mike       |Mary      |Williams  |[Python, VB]      |OH   |M     |
+-----------+----------+----------+------------------+-----+------+

I am trying to filter this dataframe for firstname$$ = "James" and getting following error.

{
   "name": "ParseException",
   "message": "\nSyntax error at or near '$'(line 1, pos 9)\n\n== SQL ==\nfirstname$$ == 'James'\n---------^^^\n",
   "stack": "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mParseException\u001b[0m                            Traceback (most recent call last)\n\u001b[1;32md:\\Users\\sample.ipynb Cell 3\u001b[0m in \u001b[0;36m<cell line: 1>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> <a href='vscode-notebook-cell:/d%3A/Users/sample.ipynb#X20sZmlsZQ%3D%3D?line=0'>1</a>\u001b[0m df_sam1 \u001b[39m=\u001b[39m df_sam\u001b[39m.\u001b[39;49mwhere(\u001b[39m\"\u001b[39;49m\u001b[39mfirstname$$ == \u001b[39;49m\u001b[39m'\u001b[39;49m\u001b[39mJames\u001b[39;49m\u001b[39m'\u001b[39;49m\u001b[39m\"\u001b[39;49m)\n\u001b[0;32m      <a href='vscode-notebook-cell:/d%3A/Users/sample.ipynb#X20sZmlsZQ%3D%3D?line=1'>2</a>\u001b[0m df_sam1\u001b[39m.\u001b[39mshow()\n\nFile \u001b[1;32mD:\\SparkApp\\spark-3.3.1-bin-hadoop3\\python\\pyspark\\sql\\dataframe.py:2077\u001b[0m, in \u001b[0;36mDataFrame.filter\u001b[1;34m(self, condition)\u001b[0m\n\u001b[0;32m   2052\u001b[0m \u001b[39m\"\"\"Filters rows using the given condition.\u001b[39;00m\n\u001b[0;32m   2053\u001b[0m \n\u001b[0;32m   2054\u001b[0m \u001b[39m:func:`where` is an alias for :func:`filter`.\u001b[39;00m\n\u001b[1;32m   (...)\u001b[0m\n\u001b[0;32m   2074\u001b[0m \u001b[39m[Row(age=2, name='Alice')]\u001b[39;00m\n\u001b[0;32m   2075\u001b[0m \u001b[39m\"\"\"\u001b[39;00m\n\u001b[0;32m   2076\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39misinstance\u001b[39m(condition, \u001b[39mstr\u001b[39m):\n\u001b[1;32m-> 2077\u001b[0m     jdf \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_jdf\u001b[39m.\u001b[39;49mfilter(condition)\n\u001b[0;32m   2078\u001b[0m \u001b[39melif\u001b[39;00m \u001b[39misinstance\u001b[39m(condition, Column):\n\u001b[0;32m   2079\u001b[0m     jdf \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_jdf\u001b[39m.\u001b[39mfilter(condition\u001b[39m.\u001b[39m_jc)\n\nFile \u001b[1;32mD:\\SparkApp\\spark-3.3.1-bin-hadoop3\\python\\lib\\py4j-0.10.9.5-src.zip\\py4j\\java_gateway.py:1321\u001b[0m, in \u001b[0;36mJavaMember.__call__\u001b[1;34m(self, *args)\u001b[0m\n\u001b[0;32m   1315\u001b[0m command \u001b[39m=\u001b[39m proto\u001b[39m.\u001b[39mCALL_COMMAND_NAME \u001b[39m+\u001b[39m\\\n\u001b[0;32m   1316\u001b[0m     \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mcommand_header \u001b[39m+\u001b[39m\\\n\u001b[0;32m   1317\u001b[0m     args_command \u001b[39m+\u001b[39m\\\n\u001b[0;32m   1318\u001b[0m     proto\u001b[39m.\u001b[39mEND_COMMAND_PART\n\u001b[0;32m   1320\u001b[0m answer \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mgateway_client\u001b[39m.\u001b[39msend_command(command)\n\u001b[1;32m-> 1321\u001b[0m return_value \u001b[39m=\u001b[39m get_return_value(\n\u001b[0;32m   1322\u001b[0m     answer, \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mgateway_client, \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mtarget_id, \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mname)\n\u001b[0;32m   1324\u001b[0m \u001b[39mfor\u001b[39;00m temp_arg \u001b[39min\u001b[39;00m temp_args:\n\u001b[0;32m   1325\u001b[0m     temp_arg\u001b[39m.\u001b[39m_detach()\n\nFile \u001b[1;32mD:\\SparkApp\\spark-3.3.1-bin-hadoop3\\python\\pyspark\\sql\\utils.py:196\u001b[0m, in \u001b[0;36mcapture_sql_exception.<locals>.deco\u001b[1;34m(*a, **kw)\u001b[0m\n\u001b[0;32m    192\u001b[0m converted \u001b[39m=\u001b[39m convert_exception(e\u001b[39m.\u001b[39mjava_exception)\n\u001b[0;32m    193\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39misinstance\u001b[39m(converted, UnknownException):\n\u001b[0;32m    194\u001b[0m     \u001b[39m# Hide where the exception came from that shows a non-Pythonic\u001b[39;00m\n\u001b[0;32m    195\u001b[0m     \u001b[39m# JVM exception message.\u001b[39;00m\n\u001b[1;32m--> 196\u001b[0m     \u001b[39mraise\u001b[39;00m converted \u001b[39mfrom\u001b[39;00m \u001b[39mNone\u001b[39m\n\u001b[0;32m    197\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[0;32m    198\u001b[0m     \u001b[39mraise\u001b[39;00m\n\n\u001b[1;31mParseException\u001b[0m: \nSyntax error at or near '$'(line 1, pos 9)\n\n== SQL ==\nfirstname$$ == 'James'\n---------^^^\n"
}

Here is the code I am trying.

df_sam1 = df_sam.where("firstname$$ == 'James'")
df_sam1.show()

I tried to rename the column to fitstname and still getting the same error.

Can anyone look at it and help me to resolve this issue.

Thanks,
Bab

>Solution :

In Spark SQL $ character is not allowed in column names when utilizing SQL expressions, see this. To circumvent this problem, you can either employ DataFrame API functions for filtering the data:

from pyspark.sql.functions import col

df_sam1 = df_sam.filter(col("firstname$$") == 'James')
df_sam1.show()

or rename the column:

df_sam_renamed = df_sam.withColumnRenamed("firstname$$", "firstname").withColumnRenamed("lastname$$", "lastname")

df_sam1 = df_sam_renamed.where("firstname == 'James'")
df_sam1.show()