Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Calculate sliding window median for Pandas df

This is the follow-up question to link

So, I have the following df:

df = pd.DataFrame({'col1':[1,2,3,4,5,6,7,8,9,10],
                'col2':[10,15,20,25,30,35,40,45,50,55],
                'col3':['A','A','A','A','B','B','B','B','B','B']})

I would like to calculate the median values for col1 and col2 for every 'window' rows. To do that I have the following function:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

def calculate_median_val(df, window):
    return df.groupby(np.arange(len(df))//window)['col1','col2'].median()

There are two problems with this function:

  1. The returned df doesn’t contain col3. I understand we can’t calculate the median on strings, but how can I make sure it remains? The expected df is as follows:

if window = 2: The expected returned df is:

   col1 col2 col3
0   1.5 12.5 A
1   3.5 22.5 A 
2   5.5 32.5 B
3   7.5 42.5 B
4   9.5 52.5 B

However, if window=3, then in col3 second window, there will be 'A','B','B'. Ideally, I would like to keep 'B' as it is the majority, but it is not very critical right now, it can be 'A' or 'B'.

  1. When I run the function above, it gives a future warning:

    FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.

I think this is due to passing more than 1 argument to calculate median values.

>Solution :

You can pass GroupBy.agg with specify function for col1 like GroupBy.first or GroupBy.last:

df = pd.DataFrame({'col1':[1,2,3,4,5,6,7,8,9,10],
                'col2':[10,15,20,25,30,35,40,45,50,55],
                'col3':['A','A','A','A','B','B','B','B','B','B']})

def calculate_median_val(df, window):
    return (df.groupby(np.arange(len(df))//window)
               .agg({'col1':'median', 'col2':'median', 'col3':'first'}))

print (calculate_median_val(df, 2))
   col1  col2 col3
0   1.5  12.5    A
1   3.5  22.5    A
2   5.5  32.5    B
3   7.5  42.5    B
4   9.5  52.5    B

If need prioritize B is possible create ordered categoricals and aggregate by max:

def calculate_median_val(df, window):
    df['col3'] = pd.Categorical(df['col3'], ordered=True, categories=['A','B'])
    return (df.groupby(np.arange(len(df))//window)
              .agg({'col1':'median', 'col2':'median', 'col3':'max'}))

print (calculate_median_val(df, 3))

   col1  col2 col3
0     2    15    A
1     5    30    B
2     8    45    B
3    10    55    B
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading