This is the follow-up question to link
So, I have the following df:
df = pd.DataFrame({'col1':[1,2,3,4,5,6,7,8,9,10],
'col2':[10,15,20,25,30,35,40,45,50,55],
'col3':['A','A','A','A','B','B','B','B','B','B']})
I would like to calculate the median values for col1 and col2 for every 'window' rows. To do that I have the following function:
def calculate_median_val(df, window):
return df.groupby(np.arange(len(df))//window)['col1','col2'].median()
There are two problems with this function:
- The returned
dfdoesn’t containcol3. I understand we can’t calculate the median on strings, but how can I make sure it remains? The expecteddfis as follows:
if window = 2: The expected returned df is:
col1 col2 col3
0 1.5 12.5 A
1 3.5 22.5 A
2 5.5 32.5 B
3 7.5 42.5 B
4 9.5 52.5 B
However, if window=3, then in col3 second window, there will be 'A','B','B'. Ideally, I would like to keep 'B' as it is the majority, but it is not very critical right now, it can be 'A' or 'B'.
-
When I run the function above, it gives a future warning:
FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
I think this is due to passing more than 1 argument to calculate median values.
>Solution :
You can pass GroupBy.agg with specify function for col1 like GroupBy.first or GroupBy.last:
df = pd.DataFrame({'col1':[1,2,3,4,5,6,7,8,9,10],
'col2':[10,15,20,25,30,35,40,45,50,55],
'col3':['A','A','A','A','B','B','B','B','B','B']})
def calculate_median_val(df, window):
return (df.groupby(np.arange(len(df))//window)
.agg({'col1':'median', 'col2':'median', 'col3':'first'}))
print (calculate_median_val(df, 2))
col1 col2 col3
0 1.5 12.5 A
1 3.5 22.5 A
2 5.5 32.5 B
3 7.5 42.5 B
4 9.5 52.5 B
If need prioritize B is possible create ordered categoricals and aggregate by max:
def calculate_median_val(df, window):
df['col3'] = pd.Categorical(df['col3'], ordered=True, categories=['A','B'])
return (df.groupby(np.arange(len(df))//window)
.agg({'col1':'median', 'col2':'median', 'col3':'max'}))
print (calculate_median_val(df, 3))
col1 col2 col3
0 2 15 A
1 5 30 B
2 8 45 B
3 10 55 B