I have a DataFrame like this:
import pandas as pd
import numpy as np
df = pd.DataFrame({
"realization_id": np.repeat([0, 1], 6),
"sample_size": np.tile([0, 1, 2], 4),
"num_obs": np.tile(np.repeat([25, 100], 3), 2),
"accuracy": [0.8, 0.7, 0.8, 0.6, 0.7, 0.5, 0.6, 0.7, 0.8, 0.7, 0.9, 0.7],
"prob": [0.94, 0.96, 0.95, 0.98, 0.93, 0.92, 0.90, 0.92, 0.95, 0.9, 0.91, 0.92]
})
df["accum_max_prob"] = df.groupby(["realization_id", "num_obs"])["prob"].cummax()
And I want to know how to create a column with this output:
df["desired_accuracy"] = [0.8, 0.7, 0.7, 0.6, 0.6, 0.6, 0.6, 0.7, 0.8, 0.7, 0.9, 0.7]
Each entry of desired_accuracy equals the accuracy value that corresponds to the row where the highest prob has been achieved so far by group (that is why I create accum_max_prob).
So, for example: the first value is 0.8 because there is no data prior to that, but then the next one is 0.7 because the prob of the second row is greater than the first. The third value stays the same, because the third prob is lower than the second one, so it does not update desired_accuracy. For each pair (realization_id, num_obs) the criteria resets.
How can I do it in a vectorized fashion using Pandas?
>Solution :
It looks like:
df['desired_accuracy'] = df['accuracy'].mask(df['prob'].lt(df['accum_max_prob'])).ffill()
Output:
realization_id sample_size num_obs accuracy prob accum_max_prob desired_accuracy
0 0 0 25 0.8 0.94 0.94 0.8
1 0 1 25 0.7 0.96 0.96 0.7
2 0 2 25 0.8 0.95 0.96 0.7
3 0 0 100 0.6 0.98 0.98 0.6
4 0 1 100 0.7 0.93 0.98 0.6
5 0 2 100 0.5 0.92 0.98 0.6
6 1 0 25 0.6 0.90 0.90 0.6
7 1 1 25 0.7 0.92 0.92 0.7
8 1 2 25 0.8 0.95 0.95 0.8
9 1 0 100 0.7 0.90 0.90 0.7
10 1 1 100 0.9 0.91 0.91 0.9
11 1 2 100 0.7 0.92 0.92 0.7