Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

index compatibility of dataframe with multiindex result from apply on group

We have to apply an algorithm to columns in a dataframe, the data has to be grouped by a key and the result shall form a new column in the dataframe. Since it is a common use-case we wonder if we have chosen a correct approach or not.

Following code reflects our approach to the problem in a simplified manner.

import numpy as np
import pandas as pd

np.random.seed(42)

N = 100
key = np.random.randint(0, 2, N).cumsum()
x = np.random.rand(N)
data = dict(key=key, x=x)

df = pd.DataFrame(data)

This generates a DataFrame as follows.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

    key         x
0     0  0.969585
1     1  0.775133
2     1  0.939499
3     1  0.894827
4     1  0.597900
..  ...       ...
95   53  0.036887
96   54  0.609564
97   55  0.502679
98   56  0.051479
99   56  0.278646

Application of exemplary methods on the DataFrame groups.

def magic(x, const):
    return (x + np.abs(np.random.rand(len(x))) + float(const)).round(1)

def pandas_confrom_magic(df_per_key, const=1):
    index = df_per_key['x'].index # preserve index
    x = df_per_key['x'].to_numpy()
    y = magic(x, const) # perform some pandas incompatible magic
    return pd.Series(y, index=index) # reconstruct index

g = df.groupby('key')
y_per_g = g.apply(lambda df: pandas_confrom_magic(df, const=5))

When assigning a new column to the result df['y'] = y_per_g it will throw a TypeError.

TypeError: incompatible index of inserted column with frame index

Thus a compatible multiindex needs to be introduced first.

df.index.name = 'index'
df = df.set_index('key', append=True).reorder_levels(['key', 'index'])
df['y'] = y_per_g
df.reset_index('key', inplace=True)

Which yields the intended result.

       key         x    y
index                    
0        0  0.969585  6.9
1        1  0.775133  6.0
2        1  0.939499  6.1
3        1  0.894827  6.4
4        1  0.597900  6.6
...    ...       ...  ...
95      53  0.036887  6.0
96      54  0.609564  6.0
97      55  0.502679  6.5
98      56  0.051479  6.0
99      56  0.278646  6.1

Now we wonder if there is a more straight forward way of dealing with the index and if we generally have chosen a favorable approach.

>Solution :

Use Series.droplevel for remove first level of MultiIndex, so same index like df, so assign working well:

g = df.groupby('key')
df['y']  = g.apply(lambda df: pandas_confrom_magic(df, const=5)).droplevel('key')
print (df)

    key         x    y
0     0  0.969585  6.9
1     1  0.775133  6.0
2     1  0.939499  6.1
3     1  0.894827  6.4
4     1  0.597900  6.6
..  ...       ...  ...
95   53  0.036887  6.0
96   54  0.609564  6.0
97   55  0.502679  6.5
98   56  0.051479  6.0
99   56  0.278646  6.1

[100 rows x 3 columns]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading