I have dataframe with column names as "lastname$$" and "firstname$$"
+-----------+----------+----------+------------------+-----+------+
|firstname$$|middlename|lastname$$|languages |state|gender|
+-----------+----------+----------+------------------+-----+------+
|James | |Smith |[Java, Scala, C++]|OH |M |
|Anna |Rose | |[Spark, Java, C++]|NY |F |
|Julia | |Williams |[CSharp, VB] |OH |F |
|Maria |Anne |Jones |[CSharp, VB] |NY |M |
|Jen |Mary |Brown |[CSharp, VB] |NY |M |
|Mike |Mary |Williams |[Python, VB] |OH |M |
+-----------+----------+----------+------------------+-----+------+
I am trying to filter this dataframe for firstname$$ = "James" and getting following error.
{
"name": "ParseException",
"message": "\nSyntax error at or near '$'(line 1, pos 9)\n\n== SQL ==\nfirstname$$ == 'James'\n---------^^^\n",
"stack": "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mParseException\u001b[0m Traceback (most recent call last)\n\u001b[1;32md:\\Users\\sample.ipynb Cell 3\u001b[0m in \u001b[0;36m<cell line: 1>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> <a href='vscode-notebook-cell:/d%3A/Users/sample.ipynb#X20sZmlsZQ%3D%3D?line=0'>1</a>\u001b[0m df_sam1 \u001b[39m=\u001b[39m df_sam\u001b[39m.\u001b[39;49mwhere(\u001b[39m\"\u001b[39;49m\u001b[39mfirstname$$ == \u001b[39;49m\u001b[39m'\u001b[39;49m\u001b[39mJames\u001b[39;49m\u001b[39m'\u001b[39;49m\u001b[39m\"\u001b[39;49m)\n\u001b[0;32m <a href='vscode-notebook-cell:/d%3A/Users/sample.ipynb#X20sZmlsZQ%3D%3D?line=1'>2</a>\u001b[0m df_sam1\u001b[39m.\u001b[39mshow()\n\nFile \u001b[1;32mD:\\SparkApp\\spark-3.3.1-bin-hadoop3\\python\\pyspark\\sql\\dataframe.py:2077\u001b[0m, in \u001b[0;36mDataFrame.filter\u001b[1;34m(self, condition)\u001b[0m\n\u001b[0;32m 2052\u001b[0m \u001b[39m\"\"\"Filters rows using the given condition.\u001b[39;00m\n\u001b[0;32m 2053\u001b[0m \n\u001b[0;32m 2054\u001b[0m \u001b[39m:func:`where` is an alias for :func:`filter`.\u001b[39;00m\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 2074\u001b[0m \u001b[39m[Row(age=2, name='Alice')]\u001b[39;00m\n\u001b[0;32m 2075\u001b[0m \u001b[39m\"\"\"\u001b[39;00m\n\u001b[0;32m 2076\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39misinstance\u001b[39m(condition, \u001b[39mstr\u001b[39m):\n\u001b[1;32m-> 2077\u001b[0m jdf \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_jdf\u001b[39m.\u001b[39;49mfilter(condition)\n\u001b[0;32m 2078\u001b[0m \u001b[39melif\u001b[39;00m \u001b[39misinstance\u001b[39m(condition, Column):\n\u001b[0;32m 2079\u001b[0m jdf \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_jdf\u001b[39m.\u001b[39mfilter(condition\u001b[39m.\u001b[39m_jc)\n\nFile \u001b[1;32mD:\\SparkApp\\spark-3.3.1-bin-hadoop3\\python\\lib\\py4j-0.10.9.5-src.zip\\py4j\\java_gateway.py:1321\u001b[0m, in \u001b[0;36mJavaMember.__call__\u001b[1;34m(self, *args)\u001b[0m\n\u001b[0;32m 1315\u001b[0m command \u001b[39m=\u001b[39m proto\u001b[39m.\u001b[39mCALL_COMMAND_NAME \u001b[39m+\u001b[39m\\\n\u001b[0;32m 1316\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mcommand_header \u001b[39m+\u001b[39m\\\n\u001b[0;32m 1317\u001b[0m args_command \u001b[39m+\u001b[39m\\\n\u001b[0;32m 1318\u001b[0m proto\u001b[39m.\u001b[39mEND_COMMAND_PART\n\u001b[0;32m 1320\u001b[0m answer \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mgateway_client\u001b[39m.\u001b[39msend_command(command)\n\u001b[1;32m-> 1321\u001b[0m return_value \u001b[39m=\u001b[39m get_return_value(\n\u001b[0;32m 1322\u001b[0m answer, \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mgateway_client, \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mtarget_id, \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mname)\n\u001b[0;32m 1324\u001b[0m \u001b[39mfor\u001b[39;00m temp_arg \u001b[39min\u001b[39;00m temp_args:\n\u001b[0;32m 1325\u001b[0m temp_arg\u001b[39m.\u001b[39m_detach()\n\nFile \u001b[1;32mD:\\SparkApp\\spark-3.3.1-bin-hadoop3\\python\\pyspark\\sql\\utils.py:196\u001b[0m, in \u001b[0;36mcapture_sql_exception.<locals>.deco\u001b[1;34m(*a, **kw)\u001b[0m\n\u001b[0;32m 192\u001b[0m converted \u001b[39m=\u001b[39m convert_exception(e\u001b[39m.\u001b[39mjava_exception)\n\u001b[0;32m 193\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39misinstance\u001b[39m(converted, UnknownException):\n\u001b[0;32m 194\u001b[0m \u001b[39m# Hide where the exception came from that shows a non-Pythonic\u001b[39;00m\n\u001b[0;32m 195\u001b[0m \u001b[39m# JVM exception message.\u001b[39;00m\n\u001b[1;32m--> 196\u001b[0m \u001b[39mraise\u001b[39;00m converted \u001b[39mfrom\u001b[39;00m \u001b[39mNone\u001b[39m\n\u001b[0;32m 197\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[0;32m 198\u001b[0m \u001b[39mraise\u001b[39;00m\n\n\u001b[1;31mParseException\u001b[0m: \nSyntax error at or near '$'(line 1, pos 9)\n\n== SQL ==\nfirstname$$ == 'James'\n---------^^^\n"
}
Here is the code I am trying.
df_sam1 = df_sam.where("firstname$$ == 'James'")
df_sam1.show()
I tried to rename the column to fitstname and still getting the same error.
Can anyone look at it and help me to resolve this issue.
Thanks,
Bab
>Solution :
In Spark SQL $ character is not allowed in column names when utilizing SQL expressions, see this. To circumvent this problem, you can either employ DataFrame API functions for filtering the data:
from pyspark.sql.functions import col
df_sam1 = df_sam.filter(col("firstname$$") == 'James')
df_sam1.show()
or rename the column:
df_sam_renamed = df_sam.withColumnRenamed("firstname$$", "firstname").withColumnRenamed("lastname$$", "lastname")
df_sam1 = df_sam_renamed.where("firstname == 'James'")
df_sam1.show()