My DataFrame is:
import pandas as pd
df = pd.DataFrame(
{
'a': [-3, -1, -2, -5, 10, -3, -13, -3, -2, 1, 2, -100],
}
)
Expected output:
a
0 -3
1 -1
2 -2
3 -5
Logic:
I want to return the largest streak of negative numbers. And if there are more than one streak that are the largest, I want to return the first streak. In df there are two negative streaks with size of 4, so the first one is returned.
This is my attempt but whenever I use idxmax() in my code, I want to double check because it gets tricky sometimes in some scenarios.
import numpy as np
df['sign'] = np.sign(df.a)
df['sign_streak'] = df.sign.ne(df.sign.shift(1)).cumsum()
m = df.sign.eq(-1)
group_sizes = df.groupby('sign_streak').size()
largest_group = group_sizes.idxmax()
largest_group_df = df[df['sign_streak'] == largest_group]
>Solution :
Your code is fine, you could simplify it a bit, avoiding the intermediate columns:
# get sign
s = np.sign(df['a'])
# form groups of successive identical sign
g = s.ne(s.shift()).cumsum()
# keep only negative, get size per group and first group with max size
out = df[g.eq(df[s.eq(-1)].groupby(g).size().idxmax())]
Or, since you don’t really care about the 0/+ difference:
# negative numbers
m = df['a'].lt(0)
# form groups
g = m.ne(m.shift()).cumsum()
out = df[g.eq(df[m].groupby(g).size().idxmax())]
Note: idxmax is always fine if you want the first match.
Output:
a
0 -3
1 -1
2 -2
3 -5