Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Capture all unique information by group

I want to create a unique dataset of fruits. I don’t know all the types (e.g. colour store, price) that could be under each fruit. For each type, there could also be duplicate rows. Is there a way to detect all possible duplicates and capture all unique informoation in a fully generalisable way?

   type    val       detail
0 fruit    apple
1 colour   green     greenish
2 colour   yellow    
3 store    walmart    usa
4 price    10
5 NaN
6 fruit    banana
7 colour   yellow
8 fruit    pear
9 fruit    jackfruit
...

Expected Output

   fruit      colour            store    price       detail           ...
0  apple     [green, yellow ]  [walmart]  [10]      [greenish, usa] 
1  banana     [yellow]           NaN      NaN
2  pear        NaN               NaN      NaN    
3  jackfruit   NaN               NaN      NaN    

I tried. But this does not get close to the expected output. It does not show the colum names either.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df.groupby("type")["val"].agg(size=len, set=lambda x: set(x))
0 fruit   {"apple",...}
1 colour  ...

>Solution :

First is created fruit column with val values if type is fruit, replace non matched values to NaNs and forward filling missing values, then pivoting by DataFrame.pivot_table with custom function for unique values without NaNs and then flatten MultiIndex:

m = df['type'].eq('fruit')

df['fruit'] = df['val'].where(m).ffill()

df1 = (df.pivot_table(index='fruit',columns='type', 
                      aggfunc=lambda x: list(dict.fromkeys(x.dropna())))
        .drop('fruit', axis=1, level=1))
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df1)
          detail_colour detail_price detail_store       val_colour val_price  \
fruit                                                                          
apple        [greenish]           []        [usa]  [green, yellow]      [10]   
banana               []          NaN          NaN         [yellow]       NaN   
jackfruit           NaN          NaN          NaN              NaN       NaN   
pear                NaN          NaN          NaN              NaN       NaN   

           val_store  
fruit                 
apple      [walmart]  
banana           NaN  
jackfruit        NaN  
pear             NaN  
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading