I would like to compare multiple columns in a data frame and add a new column that tells me which columns are different for each row.
for example, for this dataframe i want to compare a1 to a2 and b1 to b2:
a1 b1 a2 b2
0 1 2 1 2
1 1 2 1 3
2 1 2 3 4
the output should be something like:
a1 b1 a2 b2 diff
0 1 2 1 2
1 1 2 1 3 'b1-b2'
2 1 2 3 4 'a1-a2,b1-b2'
this is what i have so far:
import numpy as np
import pandas as pd
data = [{'a1': 1, 'b1': 2, 'a2':1, 'b2':2},
{'a1':1, 'b1': 2, 'a2': 1, 'b2':3},
{'a1':1, 'b1': 2, 'a2':3 , 'b2':4}]
df = pd.DataFrame(data)
compare = [('a1','a2'),('b1','b2')]
comp_result = np.array([(df[x[0]] != df[x[1]]) for x in compare])
comp_result is a list of lists of True/False values for each of the comparisons but i am not sure how to use that to create the "diff" column.
>Solution :
Fast one-liner without loops:
col_groups = [c.columns for _, c in df.groupby(df.columns.str[0], axis=1)]
df['diff'] = pd.Series(np.sum([(df[l] != df[r]).map({True: f'{l}-{r}',False:''}) + ',' for l, r in col_groups], axis=0)).str.strip(',')
Output:
>>> df
a1 b1 a2 b2 diff
0 1 2 1 2
1 1 2 1 3 b1-b2
2 1 2 3 4 a1-a2,b1-b2