Home Finding offending entries when pivot fails with a ValueError: Index contains duplicate entries, cannot reshape

Questions

Finding offending entries when pivot fails with a ValueError: Index contains duplicate entries, cannot reshape

July 11, 2024

I have a dataframe with 3 columns and datatypes: Datetime (Datetime dtype), Area (string), Value (float).

I want to pivot it so I get separate columns for each unique entry in Area.

df_pivot = pd.pivot(df,index='Datetime',columns='Area','values='Value')

Now I get the following error

ValueError: Index contains duplicate entries, cannot reshape.

I drop duplicates without subsetting and try again but get the same result.

I try to find the offending entries with no success:

duplicates = df[df.duplicated(subset=['Datetime','Area','Value'], keep=False)]

This returns an empty dataframe

So I try resetting the index and repeating subsetting with the nex index:

duplicates = df[df.duplicated(subset=['Datetime','Area','Value','index'], keep=False)]

I get the same result, an empty dataframe and an unpivottable df.

Is there some other way to find the offending entries?

>Solution :

You need find duplicates per Datetime and Area, not for all 3 columns, because new index and new column is created by Datetime and Area columns:

df = pd.DataFrame(
    {
    "Datetime" : [1,1,2,3,3],
    "Value" : [1,2,3,4,5],
    "Area" : ["D", "D", "D", "E", "E"]
    }
 )

Get duplicates per both columns:

duplicates = df[df.duplicated(subset=['Datetime','Area'], keep=False)]
print (duplicates)
   Datetime  Value Area
0         1      1    D
1         1      2    D
3         3      4    E
4         3      5    E

Rmeove duplicates per both columns, keep first values is possible by DataFrame.drop_duplicates:

print (df.drop_duplicates(subset=['Datetime','Area']))
   Datetime  Value Area
0         1      1    D
2         2      3    D
3         3      4    E

All toghether:

df_pivot = (df.drop_duplicates(subset=['Datetime','Area'])
              .pivot(index='Datetime',columns='Area',values='Value'))
print (df_pivot)
Area        D    E
Datetime          
1         1.0  NaN
2         3.0  NaN
3         NaN  4.0

Another idea is aggregate duplicated values, e.g. by sum:

df_pivot = df.pivot_table(index='Datetime',columns='Area',values='Value', aggfunc='sum')
print (df_pivot)
Area        D    E
Datetime          
1         3.0  NaN
2         3.0  NaN
3         NaN  9.0