Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to get the closest the column value by the bin Multiindex?

I have a DataFrame (df_value_bin) with Multiindex which is the binned values, like this:

import pandas as pd
import numpy as np

np.random.seed(100)
df = pd.DataFrame(np.random.randn(100, 3), columns=['a', 'b', 'value'])

a_bins = np.arange(-3, 4, 1)
b_bins = np.arange(-2, 4, 2)

df['a_bins'] = pd.cut(df['a'], bins=a_bins)
df['b_bins'] = pd.cut(df['b'], bins=b_bins)
df_value_bin = df.groupby(['a_bins','b_bins']).agg({'value':'mean'})

Here’s the quickview of df_value_bin:

                     value
a_bins   b_bins           
(-3, -2] (-2, 0] -0.417606
         (0, 2]  -0.267035
(-2, -1] (-2, 0] -0.296727
         (0, 2]  -0.112280
(-1, 0]  (-2, 0]  0.459780
         (0, 2]   0.131588
(0, 1]   (-2, 0]  0.110268
         (0, 2]   0.287755
(1, 2]   (-2, 0]  0.254337
         (0, 2]  -0.627460
(2, 3]   (-2, 0] -0.075165
         (0, 2]  -0.589709

Then, I want to get the closest value of df_value_bin when giving some a and b.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Let’s say a=1.5 and b=-1, then we should get value=0.254337.

Attempt 1

I can generate the boolean mask for a_bins and b_bins:

a_test = 1.5
b_test = -1

boolean_a = df_value_bin.index.get_level_values('a_bins').categories.contains(a_test)
boolean_b = df_value_bin.index.get_level_values('b_bins').categories.contains(b_test)

print(boolean_a, boolean_b) # Output: [False False False False  True False] [ True False]

However, I have no idea of using the masks to select the row …

Attempt 2

I can get the index directly:

index_a = np.digitize(a_test, a_bins, right=True)
index_b = np.digitize(b_test, b_bins, right=True)

print(index_a, index_b) # Output: 5 1

Again, I don’t know how to use the index to select the row directly.

Notes

It seems the second method should be quicker as it uses np.digitize().
If you have any idea to complete it or other better methods, please feel free to answer!

>Solution :

You can just index with numbers in this case:

df_value_bin.loc[(1.5, -1)]

Output (ignore value, which is generated randomly, look at the Name):

value    0.047439
Name: ((1, 2], (-2, 0]), dtype: float64
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading