Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pyspark toPandas ValueError: Found non-unique column index

I get the following error when I try to convert pyspark dataframe to pandas dataframe with the method toPandas. I don’t understand the reason for the error:

 ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_64705/3870041712.py in <module>
----> 1 df_who.limit(10).toPandas()

/opt/miniforge/miniforge/envs/jupyterlab/lib/python3.7/site-packages/pyspark/sql/dataframe.py in toPandas(self)
   2130                     if len(batches) > 0:
   2131                         table = pyarrow.Table.from_batches(batches)
-> 2132                         pdf = table.to_pandas()
   2133                         pdf = _check_dataframe_convert_date(pdf, self.schema)
   2134                         return _check_dataframe_localize_timestamps(pdf, timezone)

/opt/miniforge/miniforge/envs/jupyterlab/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._PandasConvertible.to_pandas()

/opt/miniforge/miniforge/envs/jupyterlab/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table._to_pandas()

/opt/miniforge/miniforge/envs/jupyterlab/lib/python3.7/site-packages/pyarrow/pandas_compat.py in table_to_blockmanager(options, table, categories, ignore_metadata, types_mapper)
    786 
    787     _check_data_column_metadata_consistency(all_columns)
--> 788     columns = _deserialize_column_index(table, all_columns, column_indexes)
    789     blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
    790 

/opt/miniforge/miniforge/envs/jupyterlab/lib/python3.7/site-packages/pyarrow/pandas_compat.py in _deserialize_column_index(block_table, all_columns, column_indexes)
    901 
    902     # ARROW-1751: flatten a single level column MultiIndex for pandas 0.21.0
--> 903     columns = _flatten_single_level_multiindex(columns)
    904 
    905     return columns

/opt/miniforge/miniforge/envs/jupyterlab/lib/python3.7/site-packages/pyarrow/pandas_compat.py in _flatten_single_level_multiindex(index)
   1142         # Cheaply check that we do not somehow have duplicate column names
   1143         if not index.is_unique:
-> 1144             raise ValueError('Found non-unique column index')
   1145 
   1146         return pd.Index(

ValueError: Found non-unique column index

>Solution :

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

You can check columns of pyspark dataframe, There is repeat column name in your dataframe according to your error.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading