I have a excel file containing three columns as shown below,
| ID | Name | Date |
|---|---|---|
| 117 | Laspringe | 2019-04-08 |
| 117 | Laspringe (FT) | 2020-06-16 |
| 117 | Laspringe (Ftp) | 2020-07-24 |
| 999 | Angelo | 2020-04-15 |
| 999 | Angelo(FT) | 2021-03-05 |
| 999 | Angelo(Ftp) | 2021-09-13 |
| 999 | Angelo | 2022-02-20 |
I wanted to find out that based on each ID which has the name changed from original name and changed back to the same original name. For example Angelo is changed to Angelo(FT), Angelo(Ftp) and changed back to original Angelo.
Whereas Laspringe is not changed back to the original name.
Is it possible to find out which of the ID’s have changed the name back to original using python ??
Expecting the result to be like,
| ID |
|---|
| 999 |
>Solution :
A simple way might be to check if the Name has any duplicate per group:
s = df.duplicated(['ID', 'Name']).groupby(df['ID']).any()
out = s[s].index.tolist()
Output: [999]
If you can have duplicates on successive dates (A -> A -> B shouldn’t be a match):
s = (df
.sort_values(by='Date')
.groupby('ID')['Name']
.agg(lambda s: s[s.ne(s.shift())].duplicated().any())
)
out = s[s].index.tolist()
The two code will behave differently on this input:
ID Name Date
0 117 Laspringe 2019-04-08
1 117 Laspringe 2019-04-09 # duplicated but no intermediate name
2 117 Laspringe (FT) 2020-06-16
3 117 Laspringe (Ftp) 2020-07-24
4 999 Angelo 2020-04-15
5 999 Angelo(FT) 2021-03-05
6 999 Angelo(Ftp) 2021-09-13
7 999 Angelo 2022-02-29