I have a dataframe with a ton of rows and columns. The following dataframe is just a simplified version of it.
dct = {'day': ['Mon', 'Tue', 'Wed', 'Wed', 'Thur', 'Fri', 'Fri', 'Sat', 'Sun'],
'id': ['1', '2', '3', '4', '5',
'6', '7', '8', '9']}
df = pd.DataFrame(dct)
day id
0 Mon 1
1 Tue 2
2 Wed 3
3 Wed 4
4 Thur 5
5 Fri 6
6 Fri 7
7 Sat 8
8 Sun 9
I want to change the ids to match the duplicated days (in this scenario) but still be consecutive.
day id
0 Mon 1
1 Tue 2
2 Wed 3
3 Wed 3
4 Thur 4
5 Fri 5
6 Fri 5
7 Sat 6
8 Sun 7
So far i managed to change the ids but they are not consecutive and i’m kinda stuck right now.
m = df['day'].duplicated()
df['id'] = df['id'].mask(m).ffill()
day id
0 Mon 1
1 Tue 2
2 Wed 3
3 Wed 3
4 Thur 5
5 Fri 6
6 Fri 6
7 Sat 8
8 Sun 9
>Solution :
You can subtract a cumsum of the duplicated values:
# ensure using numeric dtype
df['id'] = df['id'].astype(int)
# subtract duplicated cumsum
df['id'] -= df['day'].duplicated().cumsum()
Output:
day id
0 Mon 1
1 Tue 2
2 Wed 3
3 Wed 3
4 Thur 4
5 Fri 5
6 Fri 5
7 Sat 6
8 Sun 7
Intermediates:
day id duplicated cumsum id-cumsum
0 Mon 1 False 0 1
1 Tue 2 False 0 2
2 Wed 3 False 0 3
3 Wed 4 True 1 3
4 Thur 5 False 1 4
5 Fri 6 False 1 5
6 Fri 7 True 2 5
7 Sat 8 False 2 6
8 Sun 9 False 2 7