I need to assign scores to each of the values in many columns of a pandas dataframe, depending on the percentile score range each value falls between.
I have created a function:
import pandas as pd
import numpy as np
def get_percentiles(x, percentile_array):
percentile_array = np.sort(np.array(percentile_array))
if x < x.quantile(percentile_array[0]) < 0:
return 1
elif (x >= x.quantile(percentile_array[0]) & (x < x.quantile(percentile_array[1]):
return 2
elif (x >= x.quantile(percentile_array[1]) & (x < x.quantile(percentile_array[2]):
return 3
elif (x >= x.quantile(percentile_array[2]) & (x < x.quantile(percentile_array[3]):
return 4
else:
return 5
Sample data:
df = pd.DataFrame({'col1' : [1,10,5,9,15,4],
'col2' : [4,10,15,19,3,2],
'col3' : [10,5,6,9,1,24]})
When I try to run the function using apply:
percentile_array = [0.05, 0.25, 0.5, 0.75]
df.apply(lambda x : get_percentiles(x, percentile_array), result_type = 'expand')
I get below error:
Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
The expected output is new dataframe with 3 columns that has the scores between 1 and 5 depending on which percentile range each value in each column falls under
>Solution :
IIUC, you could use rank to compute the percentile (per column), then pandas.cut to bin the values to your reference:
percentile_array = [0.05, 0.25, 0.5, 0.75]
bins = [-np.inf]+percentile_array+[np.inf]
labels = [1, 2, 3, 4, 5]
out = (df.rank(pct=True)
.apply(lambda c: pd.cut(c, bins=bins, labels=labels))
)
Alternatively, with numpy.searchsorted:
percentile_array = [0.05, 0.25, 0.5, 0.75]
bins = [-np.inf]+percentile_array+[np.inf]
out = pd.DataFrame(np.searchsorted(bins, df.rank(pct=True)),
columns=df.columns, index=df.index)
Output:
col1 col2 col3
0 2 3 5
1 5 4 3
2 3 5 3
3 4 5 4
4 5 3 2
5 3 2 5
Intermediate:
df.rank(pct=True)
col1 col2 col3
0 0.166667 0.500000 0.833333
1 0.833333 0.666667 0.333333
2 0.500000 0.833333 0.500000
3 0.666667 1.000000 0.666667
4 1.000000 0.333333 0.166667
5 0.333333 0.166667 1.000000