Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How can I compare values in a dataframe row and select one of them?

I want to use Pandas to create a new column in a dataframe in which it compares 2 columns in a row in the df, if the values are different to choose an specific column, if the values are equal to chose any of them, if one of them in NaN, to choose the other one (if both are NaN it will already be cover). I was using Numpy select and Pandas where, but can’t make it work, and I don’t want to use (yet) a for cycle.

If i got this dataframe:

df1 = pd.DataFrame([["A", "B"], ["C", "D"], [np.nan, "A"], ["B", np.nan], [np.nan, np.nan], ["E", "E"]], columns=["col1", "col2"])
df1
  col1 col2
0    A    B
1    C    D
2  NaN    A
3    B  NaN
4  NaN  NaN
5    E    E

I want that the behavior is like this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

  • In row 0 because (A != B) & (A != np.nan) & (B != np.nan) I want to choose the value from col1.
  • In row 1 because (C != D) & (C != np.nan) & (D != np.nan) I want to choose the value from col1 (same as before).
  • In row 2 because (NaN != A) & (A != np.nan) & (NaN == np.nan) I want to choose the value from col 2.
  • In row 3 because (B != NaN) & (NaN == np.nan) & (B != np.nan) I want to choose the value from col 1.
  • In row 4 because (NaN == NaN) & (NaN == np.nan) & (NaN == np.nan) I want to choose the value from col1 (either would work actually).
  • In row 5 because (E == E) & (E != np.nan) & (E != np.nan) I want to choose the value form col 1 (same as before).

In this case I should obtain something like:

  col1 col2 col3
0    A    B    A
1    C    D    C
2  NaN    A    A
3    B  NaN    B
4  NaN  NaN  NaN
5    E    E    E

To achieve this I was doing this with np.select:

df1 = pd.DataFrame([["A", "B"], ["C", "D"], [np.nan, "A"], ["B", np.nan], [np.nan, np.nan], ["E", "E"]], columns=["col1", "col2"])
conditions = [
    ((df['col1'] == df['col2']) & (df['col1'] != np.nan)),
    ((df['col1'] == df['col2']) & (df['col1'] == np.nan)),
    ((df['col1'] != df['col2']) & (df['col1'] != np.nan) & (df['col2'] == np.nan)),
    ((df['col1'] != df['col2']) & (df['col1'] == np.nan) & (df['col2'] != np.nan)),
]
choices = [df['col1'], np.nan, df['col1'], df['col2']]
df['condition'] = np.select(conditions, choices, default="WHAT?")
print(df)

But the result is:

  col1 col2 condition
0    A    B     WHAT?
1    C    D     WHAT?
2  NaN    A     WHAT?
3    B  NaN     WHAT?
4  NaN  NaN     WHAT?
5    E    E         E

So, I don’t understand what am I doing wrong.

SEMI-EDIT: While writing this I noticed that I was missing a case in choices, so I updated the code to this:

Hope someone can help me, thanks

df1 = pd.DataFrame([["A", "B"], ["C", "D"], [np.nan, "A"], ["B", np.nan], [np.nan, np.nan], ["E", "E"]], columns=["col1", "col2"])
conditions = [
    ((df['col1'] == df['col2']) & (df['col1'] != np.nan)),
    ((df['col1'] == df['col2']) & (df['col1'] == np.nan)),
    ((df['col1'] != df['col2']) & (df['col1'] != np.nan) & (df['col2'] == np.nan)),
    ((df['col1'] != df['col2']) & (df['col1'] == np.nan) & (df['col2'] != np.nan)),
    ((df['col1'] != df['col2']) & (df['col1'] != np.nan) & (df['col2'] != np.nan)),

]
choices = [df['col1'], np.nan, df['col1'], df['col2'], df["col1"]]
df['condition'] = np.select(conditions, choices, default="WHAT?")
print(df)

The result was this one:

  col1 col2 condition
0    A    B         A
1    C    D         C
2  NaN    A       NaN
3    B  NaN         B
4  NaN  NaN       NaN
5    E    E         E

To ilustrate a little bit more, I change choices to: [1, 2, 3, 4, 5]. This is the result:

  col1 col2 condition
0    A    B         5
1    C    D         5
2  NaN    A         5
3    B  NaN         5
4  NaN  NaN         5
5    E    E         1

It always enters in the fifth case, which is ok for row 0 and 1, but row 2 should enter case 4, row 3 should enter case 3, and row 4 should enter case 2. Row 5 is correct with case 1.

Hope I made my self clear, thanks.

EDIT: someone commented and then delete something that worked for me. This person pointed out that if you do this: print(np.nan == np.nan) it will return False, so my logic will always fail. Therefore, I must user notnull() and isna() in the conditionals.

>Solution :

np.nan != np.nan NaNs are never equal to each other so your logic is not doing what you think it is: df['col1'] != np.nan. Trying doing print(np.nan == np.nan) and it will always return False. You want to use df['col1'].isna() to find null values and df['col1'].notnull() to find non-null values.

conditions = [
    ((df['col1'] == df['col2']) & (df['col1'].notnull())),
    ((df['col1'] == df['col2']) & (df['col1'].isna())),
    ((df['col1'] != df['col2']) & (df['col1'].notnull()) & (df['col2'].isna())),
    ((df['col1'] != df['col2']) & (df['col1'].isna()) & (df['col2'].notnull())),
    ((df['col1'] != df['col2']) & (df.notnull().all(1))),# not sure if this is the logic you actually want

]
choices = [df['col1'], np.nan, df['col1'], df['col2'], df["col1"]]
df['condition'] = np.select(conditions, choices, default="WHAT?")

  col1 col2 condition
0    A    B         A
1    C    D         C
2  NaN    A         A
3    B  NaN         B
4  NaN  NaN     WHAT?
5    E    E         E
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading