I have a DataFrame (df_value_bin) with Multiindex which is the binned values, like this:
import pandas as pd
import numpy as np
np.random.seed(100)
df = pd.DataFrame(np.random.randn(100, 3), columns=['a', 'b', 'value'])
a_bins = np.arange(-3, 4, 1)
b_bins = np.arange(-2, 4, 2)
df['a_bins'] = pd.cut(df['a'], bins=a_bins)
df['b_bins'] = pd.cut(df['b'], bins=b_bins)
df_value_bin = df.groupby(['a_bins','b_bins']).agg({'value':'mean'})
Here’s the quickview of df_value_bin:
value
a_bins b_bins
(-3, -2] (-2, 0] -0.417606
(0, 2] -0.267035
(-2, -1] (-2, 0] -0.296727
(0, 2] -0.112280
(-1, 0] (-2, 0] 0.459780
(0, 2] 0.131588
(0, 1] (-2, 0] 0.110268
(0, 2] 0.287755
(1, 2] (-2, 0] 0.254337
(0, 2] -0.627460
(2, 3] (-2, 0] -0.075165
(0, 2] -0.589709
Then, I want to get the closest value of df_value_bin when giving some a and b.
Let’s say a=1.5 and b=-1, then we should get value=0.254337.
Attempt 1
I can generate the boolean mask for a_bins and b_bins:
a_test = 1.5
b_test = -1
boolean_a = df_value_bin.index.get_level_values('a_bins').categories.contains(a_test)
boolean_b = df_value_bin.index.get_level_values('b_bins').categories.contains(b_test)
print(boolean_a, boolean_b) # Output: [False False False False True False] [ True False]
However, I have no idea of using the masks to select the row …
Attempt 2
I can get the index directly:
index_a = np.digitize(a_test, a_bins, right=True)
index_b = np.digitize(b_test, b_bins, right=True)
print(index_a, index_b) # Output: 5 1
Again, I don’t know how to use the index to select the row directly.
Notes
It seems the second method should be quicker as it uses np.digitize().
If you have any idea to complete it or other better methods, please feel free to answer!
>Solution :
You can just index with numbers in this case:
df_value_bin.loc[(1.5, -1)]
Output (ignore value, which is generated randomly, look at the Name):
value 0.047439
Name: ((1, 2], (-2, 0]), dtype: float64