Calculate sliding window median for Pandas df

March 23, 2023

This is the follow-up question to link

So, I have the following df:

df = pd.DataFrame({'col1':[1,2,3,4,5,6,7,8,9,10],
                'col2':[10,15,20,25,30,35,40,45,50,55],
                'col3':['A','A','A','A','B','B','B','B','B','B']})

I would like to calculate the median values for col1 and col2 for every 'window' rows. To do that I have the following function:

def calculate_median_val(df, window):
    return df.groupby(np.arange(len(df))//window)['col1','col2'].median()

There are two problems with this function:

The returned df doesn’t contain col3. I understand we can’t calculate the median on strings, but how can I make sure it remains? The expected df is as follows:

if window = 2: The expected returned df is:

   col1 col2 col3
0   1.5 12.5 A
1   3.5 22.5 A 
2   5.5 32.5 B
3   7.5 42.5 B
4   9.5 52.5 B

However, if window=3, then in col3 second window, there will be 'A','B','B'. Ideally, I would like to keep 'B' as it is the majority, but it is not very critical right now, it can be 'A' or 'B'.

When I run the function above, it gives a future warning:

FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.

I think this is due to passing more than 1 argument to calculate median values.

>Solution :

You can pass GroupBy.agg with specify function for col1 like GroupBy.first or GroupBy.last:

df = pd.DataFrame({'col1':[1,2,3,4,5,6,7,8,9,10],
                'col2':[10,15,20,25,30,35,40,45,50,55],
                'col3':['A','A','A','A','B','B','B','B','B','B']})

def calculate_median_val(df, window):
    return (df.groupby(np.arange(len(df))//window)
               .agg({'col1':'median', 'col2':'median', 'col3':'first'}))

print (calculate_median_val(df, 2))
   col1  col2 col3
0   1.5  12.5    A
1   3.5  22.5    A
2   5.5  32.5    B
3   7.5  42.5    B
4   9.5  52.5    B

If need prioritize B is possible create ordered categoricals and aggregate by max:

def calculate_median_val(df, window):
    df['col3'] = pd.Categorical(df['col3'], ordered=True, categories=['A','B'])
    return (df.groupby(np.arange(len(df))//window)
              .agg({'col1':'median', 'col2':'median', 'col3':'max'}))

print (calculate_median_val(df, 3))

   col1  col2 col3
0     2    15    A
1     5    30    B
2     8    45    B
3    10    55    B