I have a dataframe with the format shown below, where each row represents a snapshot in time of a particular specimen.
As time moves forward, a specimen can move from type 1 to type 2, but not from type 2 to type 1. There are other types such as 3, 4, and 5, but I figured if I know how to deal with 1 and 2, I can make it work for the others as well.
The data contains errors where there are transitions from 2 to 1, and my goal is to find them and produce a set of ID.
For example, the output should be 456, since it went from Type 2 to Type 1 as time passes, which is an error.
| ID (not unique) | Snapshot Month (YYYYMMDD) | Type (1, 2, 3, 4, 5) |
|---|---|---|
| 123 | 20210131 | 1 |
| 123 | 20210521 | 2 |
| 456 | 20210131 | 2 |
| 456 | 20210521 | 1 |
What I have tried is to sort by ID, then by Snapshot Month, and thought about slicing the dataframe by Type and with a loop, find the maximum date for each ID where Type is 1 and minimum date for each ID where Type is 2, and merge the two on ID and check to see if the maximum is greater than the minimum.
But not only this is unpythonic, but also inefficient (loops). I wonder if there are better ways?
>Solution :
Grouping your data by ID and snaphot then doing a shift might help you :
# sort the dataframe by ID and Snapshot
df.sort_values(['ID', 'Snapshot'], inplace=True)
# then group by ID
grouped = df.groupby('ID')
# Create a previous type column
df['Previous Type'] = grouped['Type'].shift(1)
# get the lines that contains an error by comparing the value and the previous one
errors = df[(df['Type'] == 1) & (df['Previous Type'] == 2)]
# keep only the unique ID
error_ids = errors['ID'].unique()