Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Only keep the IDs based on the observations of the other variable

Current table

ID   date   sender   sum_sender 
A    Jan20     3         37
A    Feb20     7         37
A    Mar20     12        37
A    Apr20     15        37
B    Mar20     1         26
B    May20     10        26
B    Jun20     15        26
...
Y    Jan21     10        47
Y    Feb21     12        47
Y    Mar21     20        47
Y    Apr21     5         47

I have a panel-time series with many IDs. How do I only keep rows of observations with 10 highest values of sum_sender?

so if i want to keep the observations with 2 highest sum_sender values

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

desired table

ID   date   sender   sum_sender 
A    Jan20     3         37
A    Feb20     7         37
A    Mar20     12        37
A    Apr20     15        37
Y    Jan21     10        47
Y    Feb21     12        47
Y    Mar21     20        47
Y    Apr21     5         47

>Solution :

Use nlargest:

N = 10
out = df.loc[df.groupby('ID')['sum_sender'].nlargest(N).index.levels[1]]

Example for N=2 with your sample:

>>> df.loc[df.groupby('ID')['sum_sender'].nlargest(N).index.levels[1]]
  ID   date  sender  sum_sender
0  A  Jan20       3          37
1  A  Feb20       7          37
4  B  Mar20       1          26
5  B  May20      10          26
7  Y  Jan21      10          47
8  Y  Feb21      12          47

Update

If you need the top 10 of sum_sender independently of ID, you can simple use:

>>> df.nlargest(columns='sum_sender', n=10)
   ID   date  sender  sum_sender
7   Y  Jan21      10          47
8   Y  Feb21      12          47
9   Y  Mar21      20          47
10  Y  Apr21       5          47
0   A  Jan20       3          37
1   A  Feb20       7          37
2   A  Mar20      12          37
3   A  Apr20      15          37
4   B  Mar20       1          26
5   B  May20      10          26

Update 2
Try:

>>> df.loc[df['ID'].isin(df.groupby('ID').max().nlargest(2, 'sum_sender').index)]
   ID   date  sender  sum_sender
0   A  Jan20       3          37
1   A  Feb20       7          37
2   A  Mar20      12          37
3   A  Apr20      15          37
7   Y  Jan21      10          47
8   Y  Feb21      12          47
9   Y  Mar21      20          47
10  Y  Apr21       5          47
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading