I have the below dataframe in python with column names "Order_number" and "item_id" which represents the order number and the items consisting in that order. I need to fetch order numbers that consist of item_ids A AND B so my result should be Order Number 12345 and 84573
| Order_number | item_id |
|---|---|
| 12345 | A |
| 12345 | B |
| 34235 | B |
| 43543 | B |
| 84573 | A |
| 84573 | B |
| 45846 | A |
I have the below code but I am getting an error:
df[df['item_id'] == 'A' & df['item_id'] == 'B']
TypeError: Cannot perform 'rand_' with a dtyped [object] array and scalar of type [bool]
>Solution :
You can groupby.apply a lambda that checks if the unique "item_id"s include both "A" and "B" for each "Order_number"; then filter the ones that do:
out = df.groupby('Order_number')['item_id'].apply(lambda x: {'A','B'}.issubset(x.unique().tolist())).pipe(lambda x: x.index[x]).tolist()
Another option is to use groupby.any twice; once for "A" and again for "B" to create boolean Series objects that return True if an item_id exists for an Order_number; then since we want both to exist, we use & and filter the "Order_number"s:
out = (df['item_id'].eq('A').groupby(df['Order_number']).any() & df['item_id'].eq('B').groupby(df['Order_number']).any()).pipe(lambda x: x.index[x].tolist())
Output:
[12345, 84573]