How can I compare values in a dataframe row and select one of them?

I want to use Pandas to create a new column in a dataframe in which it compares 2 columns in a row in the df, if the values are different to choose an specific column, if the values are equal to chose any of them, if one of them in NaN, to choose the other one (if both are NaN it will already be cover). I was using Numpy select and Pandas where, but can’t make it work, and I don’t want to use (yet) a for cycle.

If i got this dataframe:

df1 = pd.DataFrame([["A", "B"], ["C", "D"], [np.nan, "A"], ["B", np.nan], [np.nan, np.nan], ["E", "E"]], columns=["col1", "col2"])
df1
  col1 col2
0    A    B
1    C    D
2  NaN    A
3    B  NaN
4  NaN  NaN
5    E    E

I want that the behavior is like this:

  • In row 0 because (A != B) & (A != np.nan) & (B != np.nan) I want to choose the value from col1.
  • In row 1 because (C != D) & (C != np.nan) & (D != np.nan) I want to choose the value from col1 (same as before).
  • In row 2 because (NaN != A) & (A != np.nan) & (NaN == np.nan) I want to choose the value from col 2.
  • In row 3 because (B != NaN) & (NaN == np.nan) & (B != np.nan) I want to choose the value from col 1.
  • In row 4 because (NaN == NaN) & (NaN == np.nan) & (NaN == np.nan) I want to choose the value from col1 (either would work actually).
  • In row 5 because (E == E) & (E != np.nan) & (E != np.nan) I want to choose the value form col 1 (same as before).

In this case I should obtain something like:

  col1 col2 col3
0    A    B    A
1    C    D    C
2  NaN    A    A
3    B  NaN    B
4  NaN  NaN  NaN
5    E    E    E

To achieve this I was doing this with np.select:

df1 = pd.DataFrame([["A", "B"], ["C", "D"], [np.nan, "A"], ["B", np.nan], [np.nan, np.nan], ["E", "E"]], columns=["col1", "col2"])
conditions = [
    ((df['col1'] == df['col2']) & (df['col1'] != np.nan)),
    ((df['col1'] == df['col2']) & (df['col1'] == np.nan)),
    ((df['col1'] != df['col2']) & (df['col1'] != np.nan) & (df['col2'] == np.nan)),
    ((df['col1'] != df['col2']) & (df['col1'] == np.nan) & (df['col2'] != np.nan)),
]
choices = [df['col1'], np.nan, df['col1'], df['col2']]
df['condition'] = np.select(conditions, choices, default="WHAT?")
print(df)

But the result is:

  col1 col2 condition
0    A    B     WHAT?
1    C    D     WHAT?
2  NaN    A     WHAT?
3    B  NaN     WHAT?
4  NaN  NaN     WHAT?
5    E    E         E

So, I don’t understand what am I doing wrong.

SEMI-EDIT: While writing this I noticed that I was missing a case in choices, so I updated the code to this:

Hope someone can help me, thanks

df1 = pd.DataFrame([["A", "B"], ["C", "D"], [np.nan, "A"], ["B", np.nan], [np.nan, np.nan], ["E", "E"]], columns=["col1", "col2"])
conditions = [
    ((df['col1'] == df['col2']) & (df['col1'] != np.nan)),
    ((df['col1'] == df['col2']) & (df['col1'] == np.nan)),
    ((df['col1'] != df['col2']) & (df['col1'] != np.nan) & (df['col2'] == np.nan)),
    ((df['col1'] != df['col2']) & (df['col1'] == np.nan) & (df['col2'] != np.nan)),
    ((df['col1'] != df['col2']) & (df['col1'] != np.nan) & (df['col2'] != np.nan)),

]
choices = [df['col1'], np.nan, df['col1'], df['col2'], df["col1"]]
df['condition'] = np.select(conditions, choices, default="WHAT?")
print(df)

The result was this one:

  col1 col2 condition
0    A    B         A
1    C    D         C
2  NaN    A       NaN
3    B  NaN         B
4  NaN  NaN       NaN
5    E    E         E

To ilustrate a little bit more, I change choices to: [1, 2, 3, 4, 5]. This is the result:

  col1 col2 condition
0    A    B         5
1    C    D         5
2  NaN    A         5
3    B  NaN         5
4  NaN  NaN         5
5    E    E         1

It always enters in the fifth case, which is ok for row 0 and 1, but row 2 should enter case 4, row 3 should enter case 3, and row 4 should enter case 2. Row 5 is correct with case 1.

Hope I made my self clear, thanks.

EDIT: someone commented and then delete something that worked for me. This person pointed out that if you do this: print(np.nan == np.nan) it will return False, so my logic will always fail. Therefore, I must user notnull() and isna() in the conditionals.

>Solution :

np.nan != np.nan NaNs are never equal to each other so your logic is not doing what you think it is: df['col1'] != np.nan. Trying doing print(np.nan == np.nan) and it will always return False. You want to use df['col1'].isna() to find null values and df['col1'].notnull() to find non-null values.

conditions = [
    ((df['col1'] == df['col2']) & (df['col1'].notnull())),
    ((df['col1'] == df['col2']) & (df['col1'].isna())),
    ((df['col1'] != df['col2']) & (df['col1'].notnull()) & (df['col2'].isna())),
    ((df['col1'] != df['col2']) & (df['col1'].isna()) & (df['col2'].notnull())),
    ((df['col1'] != df['col2']) & (df.notnull().all(1))),# not sure if this is the logic you actually want

]
choices = [df['col1'], np.nan, df['col1'], df['col2'], df["col1"]]
df['condition'] = np.select(conditions, choices, default="WHAT?")

  col1 col2 condition
0    A    B         A
1    C    D         C
2  NaN    A         A
3    B  NaN         B
4  NaN  NaN     WHAT?
5    E    E         E

Leave a Reply