Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Compare multiple columns in the dataframe and return column names with different values in a new column

Consider this df:

data = [{'name': 'Muhammad', 'age_x': 20, 'city_x': 'Karachi', 'age_y': 20, 'city_y': 'Karachi'},
        {'name': 'Ali', 'age_x': 19, 'city_x': 'Lahore', 'age_y': 30, 'city_y': None},
        {'name': 'Ahmed', 'age_x': 25, 'city_x': 'Islamabad', 'age_y': None, 'city_y': 'Islamabad'}]
            
df = pd.DataFrame(data)
cols = {'age_x':'age_y', 'city_x':'city_y'}

The idea is to compare multiple column pairs (age_x and age_y, city_x and city_y in this example) and return the column name(s) if the value is different. There are many columns to check that so would be good to use the cols dictionary in the solution. The column names to return are the ones with ‘_y’. So the expected result is:

    name     age_x  city_x    age_y city_y    diff
0   Muhammad 20     Karachi   20.0  Karachi   None
1   Ali      19     Lahore    30.0  None      age_y, city_y
2   Ahmed    25     Islamabad NaN   Islamabad age_y

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Use apply on axis=1 and a list comprehension derived from cols:

df["diff"] = df.apply(
    lambda row: ", ".join([y for x, y in cols.items() if row[x] != row[y]]) or None,
    axis=1,
)
       name  age_x     city_x  age_y     city_y           diff
0  Muhammad     20    Karachi   20.0    Karachi           None
1       Ali     19     Lahore   30.0       None  age_y, city_y
2     Ahmed     25  Islamabad    NaN  Islamabad          age_y

You can also get cols from df instead of typing it manually:

df = df.sort_index(axis=1)  # sort columns if not already sorted
cols = {
    x: y
    for x, y in zip(df.filter(regex=r"_x$").columns, df.filter(regex=r"_y$").columns)
}
{'age_x': 'age_y', 'city_x': 'city_y'}
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading