I have 2 dataframes:
df1 = pd.DataFrame({"id1": ["A", "B", "C", "D"], "id2": ["1", "2", "2", "1"]})
df2 = pd.DataFrame({"id1": ["A", "B", "F", "C", "D", "E"], "id2": ["1", "1", "2", "1", "1", "2"]})
I want to find the rows which are different, here is the code I tried:
vals = set(df2.iloc[:, :2].apply(tuple, 1)) \ - set(df1.apply(tuple, 1))
The result is the following which gives me the rows which are different between the 2 dataframes:
{('B', '1'), ('E', '2'), ('C', '1'), ('F', '2')}
I would like to get an additional output which tells me which rows of df2 have been modified or not (Y for yes and N for No) :
| id1 | Modified_ID | |
|---|---|---|
| 0 | A | N |
| 1 | B | Y |
| 2 | F | N |
| 3 | C | Y |
| 4 | D | N |
| 5 | E | N |
>Solution :
You can merge and build a boolean mask that returns True if a value didn’t change (or was expanded) and map Y/N values according to it:
df2 = df2.merge(df1, on='id1', how='left', suffixes=('_',''))
df2['Modified_ID'] = (df2['id2_'].eq(df2['id2']) | df2['id2'].isna()).map({True:'N', False:'Y'})
df2 = df2.drop(columns=['id2_','id2'])
Output:
id1 Modified_ID
0 A N
1 B Y
2 F N
3 C Y
4 D N
5 E N