Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Slicing MultiIndex pandas Dataframe

I am working with awkward arrays and dumping information to pandas dataframe with multiindex:

>>> import awkward as ak
>>> import pandas as pd

>>> ak_arr = ak.Array([
...     {
...       'jet_pt': [2.33e+05, 1.1e+04, 1.47e+05, 1.33e+04, 1.73e+05, 1.07e+04], 
...       'jet_num': 6, 
...       'bb_dR': [0.83e-01, 0.56e-01, 0.98e-01, 0.32e-01, 0.21e-01, 0.66e-01], 
...       'hh_m': 3.25e+05
...     }, 
...     {
...       'jet_pt': [1.48e+05, 2.06e+04, 9.93e+04, 1.29e+04], 
...       'jet_num': 4, 
...       'bb_dR': [0.12e-1, 0.32e-01, 0.45e-01, 0.76e-01, 0.33e-01, 0.54e-01], 
...       'hh_m': 2.87e+05
...     }
... ])
>>> ak_arr
<Array [{jet_pt: [...], ...}, {...}] type='2 * {jet_pt: var * float64, jet_...'>

>>> df = ak.to_dataframe(ak_arr, how='outer')
>>> df
                  jet_pt  jet_num  bb_dR      hh_m
entry subentry                                    
0     0         233000.0      6    0.083  325000.0
      1          11000.0      6    0.056  325000.0
      2         147000.0      6    0.098  325000.0
      3          13300.0      6    0.032  325000.0
      4         173000.0      6    0.021  325000.0
      5          10700.0      6    0.066  325000.0
1     0         148000.0      4    0.012  287000.0
      1          20600.0      4    0.032  287000.0
      2          99300.0      4    0.045  287000.0
      3          12900.0      4    0.076  287000.0
      4              NaN      4    0.033  287000.0
      5              NaN      4    0.054  287000.0

I would like to know:

  1. How can I select jet_pt using the column jet_num to get the number of jets in each entry, so if jet_num is 6, then jet_t will have 6 subentries and so on? Wanted result:
                  jet_pt
entry subentry                                    
0     0         233000.0
      1          11000.0
      2         147000.0
      3          13300.0
      4         173000.0
      5          10700.0
1     0         148000.0
      1          20600.0
      2          99300.0
      3          12900.0

I can accomplish this result with:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

jet_num = df['jet_num'].max(level=0)
jet_z = df['jet_z'].groupby(level=0).apply(lambda x: x[:jet_num[x.name]]).droplevel(0)

but it feels inefficient to me.

  1. How can I select the column bb_dR but only the first 4th elements in subentry? Wanted result:
                bb_dR
entry subentry                                    
0     0         0.083
      1         0.056
      2         0.098
      3         0.032
1     0         0.012
      1         0.032
      2         0.045
      3         0.076

Again, I can achieve the wanted result by doing:

df['bb_dR'].groupby(level=0).apply(lambda x: x[:4]).droplevel(0)

but still think there is a better way.

  1. Similar to 2, how can I select just the first row in entry and subentry? Wanted result:
                  hh_m
entry subentry                                    
0     0           325000.0
1     0           287000.0

I think for 3, it would also be useful to drop entry and subentry. Thanks in advance.

>Solution :

answer1

cond = df.index.get_level_values(1) < df['jet_num']
out1 = df.loc[cond, ['jet_pt']]

out1

                  jet_pt
entry subentry          
0     0         233000.0
      1          11000.0
      2         147000.0
      3          13300.0
      4         173000.0
      5          10700.0
1     0         148000.0
      1          20600.0
      2          99300.0
      3          12900.0

answer2

out2 = df.loc[(slice(None), slice(0, 3)), ['bb_dR']]

out2

                bb_dR
entry subentry       
0     0         0.083
      1         0.056
      2         0.098
      3         0.032
1     0         0.012
      1         0.032
      2         0.045
      3         0.076

answer3

out3 = df.loc[(slice(None), 0), ['hh_m']]

out3

                    hh_m
entry subentry          
0     0         325000.0
1     0         287000.0

If your multi-index does not have integer locations like 0, 1, use groupby + cumcount. In the case of answer1, using cumcount results in the following code:

cond = df.groupby(level=0).cumcount() < df['jet_num']
out1 = df.loc[cond, ['jet_pt']]

Example Code

import pandas as pd
nan = float('nan')
df = pd.DataFrame({'jet_pt': {(0, 0): 233000.0, (0, 1): 11000.0, (0, 2): 147000.0, (0, 3): 13300.0, (0, 4): 173000.0, (0, 5): 10700.0, (1, 0): 148000.0, (1, 1): 20600.0, (1, 2): 99300.0, (1, 3): 12900.0, (1, 4): nan, (1, 5): nan}, 'jet_num': {(0, 0): 6, (0, 1): 6, (0, 2): 6, (0, 3): 6, (0, 4): 6, (0, 5): 6, (1, 0): 4, (1, 1): 4, (1, 2): 4, (1, 3): 4, (1, 4): 4, (1, 5): 4}, 'bb_dR': {(0, 0): 0.083, (0, 1): 0.056, (0, 2): 0.098, (0, 3): 0.032, (0, 4): 0.021, (0, 5): 0.066, (1, 0): 0.012, (1, 1): 0.032, (1, 2): 0.045, (1, 3): 0.076, (1, 4): 0.033, (1, 5): 0.054}, 'hh_m': {(0, 0): 325000.0, (0, 1): 325000.0, (0, 2): 325000.0, (0, 3): 325000.0, (0, 4): 325000.0, (0, 5): 325000.0, (1, 0): 287000.0, (1, 1): 287000.0, (1, 2): 287000.0, (1, 3): 287000.0, (1, 4): 287000.0, (1, 5): 287000.0}}).rename_axis(['entry', 'subentry'])
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading