I want reindex dataframe’s rows by specific column, let’s say that dataframe contain collumn colX. That’s mean I need call method df.set_index(colX)
Consider df.head() is
colX colY
2.71 foo1
3.14 foo2
6.9 foo3
6.9 foo4
9.6 foo5
When I reindex it
df=df.set_index('colX')
df2=df.reindex(index=[9.6,6.9,6.9,3.14,2.71])
I expect the result df2.head() is
colX colY
9.6 foo5
6.9 foo3 # duplicate index in colX
6.9 foo4 # duplicate index in colX
3.14 foo2
2.71 foo1
Or this
colX colY
9.6 foo5
6.9 foo4 # duplicate index in colX
6.9 foo3 # duplicate index in colX
3.14 foo2
2.71 foo1
Reindex won’t work if there’s duplicate axis.
ValueError: cannot reindex from a duplicate axis
Note: df was created by df=pd.read_csv('foobar.csv')
>Solution :
Error means there is at least one duplicated values, so reindex raise error, because working only with unique values in list passed to reindex.
Solution with helper column created by GroupBy.cumcount and DataFrame.merge with default inner join:
df['g'] = df.groupby('colX').cumcount()
L = [9.6,6.9,6.9,3.14,2.71]
df11 = pd.DataFrame({'colX': L})
df11['g'] = df11.groupby('colX').cumcount()
print (df11)
colX g
0 9.60 0
1 6.90 0
2 6.90 1
3 3.14 0
4 2.71 0
df = df.merge(df11, on=['colX','g'])
print (df)
colX colY g
0 2.71 foo1 0
1 3.14 foo2 0
2 6.90 foo3 0
3 6.90 foo4 1
4 9.60 foo5 0
Or with reindex:
df['g'] = df.groupby('colX').cumcount()
L = [9.6,6.9,6.9,3.14,2.71]
df11 = pd.DataFrame({'colX': L})
df11['g'] = df11.groupby('colX').cumcount()
print (df11)
df=df.set_index(['colX', 'g']).reindex(df11.set_index(['colX','g']).index)
print (df)
colY
colX g
9.60 0 foo5
6.90 0 foo3
1 foo4
3.14 0 foo2
2.71 0 foo1
Wrong first solution:
df=df.set_index('colX')
df2=df.loc[[9.6,6.9,6.9,3.14,2.71]]
print (df2)
colY
colX
9.60 foo5
6.90 foo3 > combinations of 6.9 (2 times in original * 2 times in list = 4)
6.90 foo4
6.90 foo3
6.90 foo4
3.14 foo2
2.71 foo1