I have a large df. I need to count groups of the ‘years’ column if the year restarts to 1.
I have tried
df["count"] = df.groupby((df["year"]>=df["year"].shift(1)).cumsum()).cumcount()+1
but this doesn’t give the expected result. Below is an example df and the expected output.
year month alpha
1 0 1.24036
1 1 1.17023
1 0 1.24036
1 1 1.17023
1 0 1.24036
1 1 1.17023
2 0 1.24036
2 1 1.17023
2 0 1.24036
2 1 1.17023
2 0 1.24036
2 1 1.17023
1 0 1.24036
1 1 1.17023
1 0 1.24036
1 1 1.17023
1 0 1.24036
1 1 1.17023
2 0 1.24036
2 1 1.17023
2 0 1.24036
2 1 1.17023
2 0 1.24036
2 1 1.17023
Expected output:
year month alpha count
1 0 1.24036 1
1 1 1.17023 1
1 0 1.24036 1
1 1 1.17023 1
1 0 1.24036 1
1 1 1.17023 1
2 0 1.24036 1
2 1 1.17023 1
2 0 1.24036 1
2 1 1.17023 1
2 0 1.24036 1
2 1 1.17023 1
1 0 1.24036 2
1 1 1.17023 2
1 0 1.24036 2
1 1 1.17023 2
1 0 1.24036 2
1 1 1.17023 2
2 0 1.24036 2
2 1 1.17023 2
2 0 1.24036 2
2 1 1.17023 2
2 0 1.24036 2
2 1 1.17023 2
>Solution :
You don’t really want a groupby.cumcount
, you want to enumerate the groups.
You can use:
m = df['year'].eq(1) & df['year'].shift().ne(1)
df['count'] = m.cumsum()
Or, if you want to ensure having 1 as the first count even if the year does not necessarily start with 1:
df['count'] = m.cumsum() + ~m.iloc[0]
output:
year month alpha count
0 1 0 1.24036 1
1 1 1 1.17023 1
2 1 0 1.24036 1
3 1 1 1.17023 1
4 1 0 1.24036 1
5 1 1 1.17023 1
6 2 0 1.24036 1
7 2 1 1.17023 1
8 2 0 1.24036 1
9 2 1 1.17023 1
10 2 0 1.24036 1
11 2 1 1.17023 1
12 1 0 1.24036 2
13 1 1 1.17023 2
14 1 0 1.24036 2
15 1 1 1.17023 2
16 1 0 1.24036 2
17 1 1 1.17023 2
18 2 0 1.24036 2
19 2 1 1.17023 2
20 2 0 1.24036 2
21 2 1 1.17023 2
22 2 0 1.24036 2
23 2 1 1.17023 2