I have a dataframe that looks like this:
| Name | ID |
|---|---|
| A | 1 |
| B | 2 |
| A | 1 |
| C | 3 |
| B | 3 |
| D | 3 |
| E | 1 |
| F | 2 |
As you can see for some IDs there are multiple names, I would like to change the ID so that there is a unique ID for each new instance of a name, the resulting ID column would ideally look like this:
Unfortunately I cannot use ngroup() because there are over 35,000 IDs.
| Name | ID | ID_new |
|---|---|---|
| A | 1 | 1_1 |
| B | 2 | 2_1 |
| A | 1 | 1_1 |
| C | 3 | 3_1 |
| B | 2 | 2_1 |
| D | 3 | 3_2 |
| E | 1 | 1_2 |
| F | 2 | 2_2 |
All help is appreciated!
I have used .ngroup() + 1 but as I said there are too many IDs, as well I have used cumcount() + 1 but this makes the number after the ‘_’ go up by one each time resulting in non-unique IDs.
>Solution :
Use factorize per groups in lambda function in GroupBy.transform and join with ID by Series.str.cat:
f = lambda x: pd.factorize(x)[0] + 1
s = df.groupby('ID')['Name'].transform(f).astype(str)
df['ID_new'] = df['ID'].astype(str).str.cat(s, sep='_')
print (df)
Name ID ID_new
0 A 1 1_1
1 B 2 2_1
2 A 1 1_1
3 C 3 3_1
4 B 2 2_1
5 D 3 3_2
6 E 1 1_2
7 F 2 2_2