Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Finding offending entries when pivot fails with a ValueError: Index contains duplicate entries, cannot reshape

I have a dataframe with 3 columns and datatypes: Datetime (Datetime dtype), Area (string), Value (float).

I want to pivot it so I get separate columns for each unique entry in Area.

df_pivot = pd.pivot(df,index='Datetime',columns='Area','values='Value')

Now I get the following error

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

ValueError: Index contains duplicate entries, cannot reshape.

I drop duplicates without subsetting and try again but get the same result.

I try to find the offending entries with no success:

duplicates = df[df.duplicated(subset=['Datetime','Area','Value'], keep=False)]

This returns an empty dataframe

So I try resetting the index and repeating subsetting with the nex index:

duplicates = df[df.duplicated(subset=['Datetime','Area','Value','index'], keep=False)]

I get the same result, an empty dataframe and an unpivottable df.

Is there some other way to find the offending entries?

>Solution :

You need find duplicates per Datetime and Area, not for all 3 columns, because new index and new column is created by Datetime and Area columns:

df = pd.DataFrame(
    {
    "Datetime" : [1,1,2,3,3],
    "Value" : [1,2,3,4,5],
    "Area" : ["D", "D", "D", "E", "E"]
    }
 )

Get duplicates per both columns:

duplicates = df[df.duplicated(subset=['Datetime','Area'], keep=False)]
print (duplicates)
   Datetime  Value Area
0         1      1    D
1         1      2    D
3         3      4    E
4         3      5    E

Rmeove duplicates per both columns, keep first values is possible by DataFrame.drop_duplicates:

print (df.drop_duplicates(subset=['Datetime','Area']))
   Datetime  Value Area
0         1      1    D
2         2      3    D
3         3      4    E

All toghether:

df_pivot = (df.drop_duplicates(subset=['Datetime','Area'])
              .pivot(index='Datetime',columns='Area',values='Value'))
print (df_pivot)
Area        D    E
Datetime          
1         1.0  NaN
2         3.0  NaN
3         NaN  4.0

Another idea is aggregate duplicated values, e.g. by sum:

df_pivot = df.pivot_table(index='Datetime',columns='Area',values='Value', aggfunc='sum')
print (df_pivot)
Area        D    E
Datetime          
1         3.0  NaN
2         3.0  NaN
3         NaN  9.0
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading