Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to group by one column or another in pandas

I have a table like:

    col1    col2
0   1       a
1   2       b
2   2       c
3   3       c
4   4       d

I’d like rows to be grouped together if they have a matching value in col1 or col2. That is, I’d like something like this:

> (
    df
    .groupby(set('col1', 'col2'))  # Made-up syntax
    .ngroup())
0  0
1  1
2  1
3  1
4  2

Is there a way to do this with pandas?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

This is not easy to achieve simply with pandas. Indeed, two far away groups can become connected when two items are connected in the second group.

You can approach this using graph theory. Find the connected components using edges formed by the two (or more) groups. A python library for this is networkx:

import networkx as nx

g1 = df.groupby('col1').ngroup()
g2 = 'a'+df.groupby('col2').ngroup().astype(str)

# make graph and get connected components to form a mapping dictionary
G = nx.from_edgelist(zip(g1, g2))
d = {k:v for v,s in enumerate(nx.connected_components(G)) for k in s}

# find common group
group = g1.map(d)

df.groupby(group).ngroup()

output:

0    0
1    1
2    1
3    1
4    2
dtype: int64

graph:

enter image description here

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading