Home pandas — multiple rows per observation with repeated and non-repeated values

Questions

pandas — multiple rows per observation with repeated and non-repeated values

January 23, 2022

I have data where each observation has a unique id and month, but multiple observations have the same id or month.

Currently the data has more than one observation per row because the each observation gets an additional row for every continent the observation has. Instead of the continents being grouped together in a list or exploded into multiple columns, all the values for the observation are repeated over multiple rows except for the continent column, which will have a unique value for each row in an observation.

    id  month   continent   species color
0   51451   jan africa  penguin red
1   51451   jan s america   penguin red
2   51451   feb n america   penguin red
3   51451   feb asia    penguin red
4   68321   jul asia    lion    blue
5   464316  jul africa  monkey  blue
6   51451   oct africa  dog grey
7   51451   oct s america   dog grey
8   51451   oct europe  dog grey

The data just needs to be tidy. Collapsing the multiple unique continent values per observation into a list will probably work best:

    id  month   continent   species color
0   51451   jan "[""africa"", ""s america""]"   penguin red
1   51451   feb "[""n america"", ""asia""]" penguin red
2   68321   jul "[""asia""]"    lion    blue
3   464316  jul "[""africa""]"  monkey  blue
4   51451   oct ['africa', 's america', 'europe']   dog grey

I have a feeling GroupBy is the answer here,

df.groupby(['id', 'month']). ...

but can’t find an attribute that works here. .unique() only applies to SeriesGroupBy, obviously. (Right?) Is there an attribute that could work here? Or is there a different direction than GroupBy to try? .agg(list) gets close, and would just require some clean up:

df = df.groupby(['id','month']).agg(list)

df[['species', 'color']] = df[['species', 'color']].applymap(
    lambda x: x[0] if (len(np.unique(x))==1) else x
)

This seems a little clunky, though. .agg(lambda x: list(set(x))) gets rid of a bit of the work, but still requires going back and popping the last two columns to get the desired output:

df[['species', 'color']] = df[['species', 'color']].applymap(
    lambda x: x.pop()
)

>Solution :

If I understand you correctly you want the result under continent to be a list and the results under species and colour to be strings:

f = lambda arr: ','.join(np.unique(arr))
df.groupby(['id','month']).agg({'continent':'unique','species':f,'color':f})

id     month                                            
51451  feb              [n america, asia]  penguin   red
       jan            [africa, s america]  penguin   red
       oct    [africa, s america, europe]      dog  grey
68321  jul                         [asia]     lion  blue
464316 jul                       [africa]   monkey  blue