ValueError due to duplicate axis when replace values in pandas dataframe

March 30, 2023

I have one dataset, df, including nodes (N and T) and indicators assigned to nodes (IND_N and IND_T):

         N        T  IND_N  IND_T
0     John     Mark      1      0
1     Mike     John      2      1
2  Stephan    Simon      1      0
3    Laura  Stephan      1      1
4     Matt    Simon      3      0
5    Simon     Joey      0      2

I split the dataset into two, one (df1) with nodes that keep the indicators from df, the other one (df2) with indicators replaced by a dummy value.

df1 (keeps indicators from df)

         N      T  IND_N  IND_T
0     John   Mark      1      0
1  Stephan  Simon      1      0
2    Simon   Joey      0      2

df2 (please note that, after splitting, I assigned a dummy value -1 to all the indicators in df2)

       N        T  IND_N  IND_T
0  Laura  Stephan     -1     -1
1   Matt    Simon     -1     -1
2   Mike     John     -1     -1

Since there could be nodes in df2 that can be also found in df1, to avoid the case of nodes being in both the datasets (df1 and df2) but having different indicators (e.g., Simon in the example above), I wanted to keep/replace the indicators of nodes that are both df2 and df1 with their original indicator (i.e., that one from df1), then recombine the two datasets in order to have the final output:

df_out

         N        T  IND_N  IND_T
0     John     Mark      1      0
1  Stephan    Simon      1      0
2    Simon     Joey      0      2
3    Laura  Stephan     -1      1
4     Matt    Simon     -1      0
5     Mike     John     -1      1

Following the solution proposed here, I have got the following error:

ValueError: cannot reindex from a duplicate axis

I tried to fix it as follows:

temp = df_unlabel[values]
temp.update(df_label[values].set_index(col, inplace=True))

After checking the values in the final table (df_out), I found that there are no dummy variables assigned (they are replaced again by the original ones).

I’d appreciate your help to fix this error in order to get the final output.
Happy to provide more info if needed.

>Solution :

You can use a mapping dict:

# Create a mapping dict with default value
dmap = pd.concat([df1.set_index('N')['IND_N'], df.set_index('T')['IND_T']]).to_dict()
dmap.update({'.*': -1})

df2[['IND_N', 'IND_T']] = df2[['N', 'T']].replace(dmap, regex=True).values
out = pd.concat([df1, df2], axis=0, ignore_index=True)

Output:

>>> out
         N        T  IND_N  IND_T
0     John     Mark      1      0
1  Stephan    Simon      1      0
2    Simon     Joey      0      2
3    Laura  Stephan     -1      1
4     Matt    Simon     -1      0
5     Mike     John     -1      1

>>> dmap
{'John': 1, 'Stephan': 1, 'Simon': 0, 'Mark': 0, 'Joey': 2, '.*': -1}