Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Groupby specific to list of concatenated items

Let it be the following Python Panda DataFrame:

| ID   | time    | vector       | other |
| ---- | ------- | ------------ | ----- |
| 0    | 81219   | [N1, N2, N3] | blue  |
| 0    | 32323   | [N1, N4]     | green |
| 1    | 423     | [N2, N3]     | red   | 
| 1    | 4232    | [N1]         | blue  |
| 1    | 42424   | [N2, N2]     | red   |
| 2    | 42422   | [N3, N5]     | blue  |

I want to create a grouped DataFrame containing for each ID a column vector with the concatenation of the vectors for each element.

Example of the result:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

| ID   | vector               |
| ---- | -------------------- |
| 0    | [N1, N2, N3, N1, N4] |
| 1    | [N2, N3, N1, N2, N2] | 
| 2    | [N3, N5]             |

>Solution :

One option could be to use groupby.sum:

out = df.groupby('ID', as_index=False)['vector'].sum()

However, if you have many lists, a repeated sum will be quadratic.

In this case, prefer to use itertools.chain:

from itertools import chain
out = (df.groupby('ID', as_index=False)['vector']
         .agg(lambda x: list(chain.from_iterable(x)))
       )

output:

   ID                vector
0   0  [N1, N2, N3, N1, N4]
1   1  [N2, N3, N1, N2, N2]
2   2              [N3, N5]

timing

Let’s use 10k repeats of the example and compare sum and itertools.chain:

df = pd.DataFrame({'ID': [0, 0, 1, 1, 1, 2],
                   'vector': [['N1', 'N2', 'N3'], ['N1', 'N4'], ['N2', 'N3'],
                              ['N1'], ['N2', 'N2'], ['N3', 'N5']]})
df = pd.concat([df]*10000, ignore_index=True)

%%timeit
df.groupby('ID', as_index=False)['vector'].sum()
# this is very slow!
# 1.86 s ± 38.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
from itertools import chain
df.groupby('ID', as_index=False)['vector'].agg(lambda x: list(chain.from_iterable(x)))
# much faster!
# 8.02 ms ± 359 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading