This is the follow-up question to link
So, I have the following df
:
df = pd.DataFrame({'col1':[1,2,3,4,5,6,7,8,9,10],
'col2':[10,15,20,25,30,35,40,45,50,55],
'col3':['A','A','A','A','B','B','B','B','B','B']})
I would like to calculate the median values for col1
and col2
for every 'window'
rows. To do that I have the following function:
def calculate_median_val(df, window):
return df.groupby(np.arange(len(df))//window)['col1','col2'].median()
There are two problems with this function:
- The returned
df
doesn’t containcol3
. I understand we can’t calculate the median on strings, but how can I make sure it remains? The expecteddf
is as follows:
if window = 2
: The expected returned df
is:
col1 col2 col3
0 1.5 12.5 A
1 3.5 22.5 A
2 5.5 32.5 B
3 7.5 42.5 B
4 9.5 52.5 B
However, if window=3
, then in col3
second window, there will be 'A','B','B'
. Ideally, I would like to keep 'B'
as it is the majority, but it is not very critical right now, it can be 'A'
or 'B'
.
-
When I run the function above, it gives a future warning:
FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
I think this is due to passing more than 1 argument to calculate median values.
>Solution :
You can pass GroupBy.agg
with specify function for col1
like GroupBy.first
or GroupBy.last
:
df = pd.DataFrame({'col1':[1,2,3,4,5,6,7,8,9,10],
'col2':[10,15,20,25,30,35,40,45,50,55],
'col3':['A','A','A','A','B','B','B','B','B','B']})
def calculate_median_val(df, window):
return (df.groupby(np.arange(len(df))//window)
.agg({'col1':'median', 'col2':'median', 'col3':'first'}))
print (calculate_median_val(df, 2))
col1 col2 col3
0 1.5 12.5 A
1 3.5 22.5 A
2 5.5 32.5 B
3 7.5 42.5 B
4 9.5 52.5 B
If need prioritize B
is possible create ordered categoricals and aggregate by max
:
def calculate_median_val(df, window):
df['col3'] = pd.Categorical(df['col3'], ordered=True, categories=['A','B'])
return (df.groupby(np.arange(len(df))//window)
.agg({'col1':'median', 'col2':'median', 'col3':'max'}))
print (calculate_median_val(df, 3))
col1 col2 col3
0 2 15 A
1 5 30 B
2 8 45 B
3 10 55 B