Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Chaining joins in Pyspark

I am trying to join multiple dataframes in PySpark by one chained operation. The join key column name is the same in all of them. The code snippet:

columns_summed = [i for i in df_summed.columns if i != "buildingBlock_id"]
columns_concat = [i for i in df_concat.columns if i != "buildingBlock_id"]
columns_indicator = [i for i in df_indicator_fields.columns if i != "buildingBlock_id"]
columns_takeone = [i for i in df_takeone.columns if i != "buildingBlock_id"]
columns_minmax = [i for i in df_minmax.columns if i != "buildingBlock_id"]

df_all_joined = (df_summed.alias("df1").join(df_concat,df_summed.buildingBlock_id == df_concat.buildingBlock_id, "left")
                          .join(df_indicator_fields,df_summed.buildingBlock_id == df_indicator_fields.buildingBlock_id, "left")
                          .join(df_takeone,df_summed.buildingBlock_id == df_takeone.buildingBlock_id, "left")
                          .join(df_minmax,df_summed.buildingBlock_id == df_minmax.buildingBlock_id, "left")
                          .select("df1.buildingBlock_id", *columns_summed
                          , *columns_concat
                          , *columns_indicator
                          , *columns_takeone
                          , *columns_minmax
                          )
                          )

Now, when I am trying to display the joined dataframe using:

display(df_all_joined)

I’m getting the following error:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

AnalysisException: Reference 'df1.buildingBlock_id' is ambiguous, could be: df1.buildingBlock_id, df1.buildingBlock_id.

Why am I getting this error even though I specified where the key column should come from?

>Solution :

You should specify the join columns as an array of strings:

.join(df_concat,['buildingBlock_id'], "left")

If the columns that you are joining on have the same name, this makes sure to drop one of them. In the case of the left join, it drops the column from df_concat.

If you don’t do that, you end up with both columns in the joined data frame, thus creating this "Ambiguous" excception.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading