Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pandas force reindex duplicate axis

I want reindex dataframe’s rows by specific column, let’s say that dataframe contain collumn colX. That’s mean I need call method df.set_index(colX)

Consider df.head() is

colX colY
2.71 foo1
3.14 foo2
6.9 foo3
6.9 foo4
9.6 foo5

When I reindex it

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df=df.set_index('colX')
df2=df.reindex(index=[9.6,6.9,6.9,3.14,2.71])

I expect the result df2.head() is

colX colY
9.6 foo5
6.9 foo3 # duplicate index in colX
6.9 foo4 # duplicate index in colX
3.14 foo2
2.71 foo1

Or this

colX colY
9.6 foo5
6.9 foo4 # duplicate index in colX
6.9 foo3 # duplicate index in colX
3.14 foo2
2.71 foo1

Reindex won’t work if there’s duplicate axis.

ValueError: cannot reindex from a duplicate axis

Note: df was created by df=pd.read_csv('foobar.csv')

>Solution :

Error means there is at least one duplicated values, so reindex raise error, because working only with unique values in list passed to reindex.


Solution with helper column created by GroupBy.cumcount and DataFrame.merge with default inner join:

df['g'] = df.groupby('colX').cumcount()

L = [9.6,6.9,6.9,3.14,2.71]

df11 = pd.DataFrame({'colX': L})
df11['g'] = df11.groupby('colX').cumcount()
print (df11)
   colX  g
0  9.60  0
1  6.90  0
2  6.90  1
3  3.14  0
4  2.71  0

df = df.merge(df11, on=['colX','g'])
print (df)
   colX  colY  g
0  2.71  foo1  0
1  3.14  foo2  0
2  6.90  foo3  0
3  6.90  foo4  1
4  9.60  foo5  0

Or with reindex:

df['g'] = df.groupby('colX').cumcount()

L = [9.6,6.9,6.9,3.14,2.71]

df11 = pd.DataFrame({'colX': L})
df11['g'] = df11.groupby('colX').cumcount()
print (df11)

df=df.set_index(['colX', 'g']).reindex(df11.set_index(['colX','g']).index)
print (df)
        colY
colX g      
9.60 0  foo5
6.90 0  foo3
     1  foo4
3.14 0  foo2
2.71 0  foo1

Wrong first solution:

df=df.set_index('colX')
df2=df.loc[[9.6,6.9,6.9,3.14,2.71]]
print (df2)
      colY
colX      
9.60  foo5
6.90  foo3 > combinations of 6.9 (2 times in original * 2 times in list = 4)
6.90  foo4
6.90  foo3
6.90  foo4
3.14  foo2
2.71  foo1
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading