Suppose this is my df:
{'accuracy': [0.773, 0.841, 0.862, 0.874, 0.883, 0.913],
'code': [('D',),('D', 'F'),('B', 'D', 'F'),
('B', 'F', 'K'), ('B', 'F', 'I', 'K'),
('F', 'I', 'K')]}
df
accuracy code
0 0.773 (D,)
1 0.841 (D, F)
2 0.862 (B, D, F)
3 0.874 (B, F, K)
4 0.883 (B, F, I, K)
5 0.913 (F, I, K)
I would like to add a column dropped whose value is the item in code in previous row is not available in the current row.
Expected:
accuracy code dropped
0 0.773 (D,) -
1 0.841 (D, F) -
2 0.862 (B, D, F) -
3 0.874 (B, F, K) D
4 0.883 (B, F, I, K) -
5 0.913 (F, I, K) B
>Solution :
It’s very easy if you use sets and shift:
s = df['code'].apply(set)
df['dropped'] = s.shift(fill_value=set())-s
Output:
accuracy code dropped
0 0.773 (D,) {}
1 0.841 (D, F) {}
2 0.862 (B, D, F) {}
3 0.874 (B, F, K) {D}
4 0.883 (B, F, I, K) {}
5 0.913 (F, I, K) {B}
If you insist on the format (and have at most one dropped item per row):
s = df['code'].apply(set)
df['dropped'] = (s.shift(fill_value=set()).sub(s)
.apply(list).str[0].fillna('-')
)
Output:
accuracy code dropped
0 0.773 (D,) -
1 0.841 (D, F) -
2 0.862 (B, D, F) -
3 0.874 (B, F, K) D
4 0.883 (B, F, I, K) -
5 0.913 (F, I, K) B