I have data where each observation has a unique id and month, but multiple observations have the same id or month.
Currently the data has more than one observation per row because the each observation gets an additional row for every continent the observation has. Instead of the continents being grouped together in a list or exploded into multiple columns, all the values for the observation are repeated over multiple rows except for the continent column, which will have a unique value for each row in an observation.
id month continent species color
0 51451 jan africa penguin red
1 51451 jan s america penguin red
2 51451 feb n america penguin red
3 51451 feb asia penguin red
4 68321 jul asia lion blue
5 464316 jul africa monkey blue
6 51451 oct africa dog grey
7 51451 oct s america dog grey
8 51451 oct europe dog grey
The data just needs to be tidy. Collapsing the multiple unique continent values per observation into a list will probably work best:
id month continent species color
0 51451 jan "[""africa"", ""s america""]" penguin red
1 51451 feb "[""n america"", ""asia""]" penguin red
2 68321 jul "[""asia""]" lion blue
3 464316 jul "[""africa""]" monkey blue
4 51451 oct ['africa', 's america', 'europe'] dog grey
I have a feeling GroupBy is the answer here,
df.groupby(['id', 'month']). ...
but can’t find an attribute that works here. .unique() only applies to SeriesGroupBy, obviously. (Right?) Is there an attribute that could work here? Or is there a different direction than GroupBy to try? .agg(list) gets close, and would just require some clean up:
df = df.groupby(['id','month']).agg(list)
df[['species', 'color']] = df[['species', 'color']].applymap(
lambda x: x[0] if (len(np.unique(x))==1) else x
)
This seems a little clunky, though. .agg(lambda x: list(set(x))) gets rid of a bit of the work, but still requires going back and popping the last two columns to get the desired output:
df[['species', 'color']] = df[['species', 'color']].applymap(
lambda x: x.pop()
)
>Solution :
If I understand you correctly you want the result under continent to be a list and the results under species and colour to be strings:
f = lambda arr: ','.join(np.unique(arr))
df.groupby(['id','month']).agg({'continent':'unique','species':f,'color':f})
id month
51451 feb [n america, asia] penguin red
jan [africa, s america] penguin red
oct [africa, s america, europe] dog grey
68321 jul [asia] lion blue
464316 jul [africa] monkey blue