Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Comparing 2 columns group by group in pandas or python

I currently have a dataset here where i am unsure of how to compare if the groups have similar values. Here is a sample of my dataset

type   value
a       1
a       2
a       3
a       4

b       2
b       3
b       4
b       5

c       1
c       3
c       4



d       2
d       3
d       4


I want to know which rows are similar, in the sense that all the (values in 1 type) are present in another type. So for example type d has value 2,3,4 and type a has value 1,2,3,4
so this is ‘similar’ or can be considered the same so i would like it output something that tells me d is similar to A.

Expected output should be like this

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel


type   value            similarity
a       1         A is similar to B and D
a       2
a       3
a       4

b       2         b is similar to a and d
b       3
b       4
b       5

c       1         c is similar to a 
c       3
c       4



d       2         d is similar to a and b
d       3
d       4


not sure if this can be done in python or pandas but guidance is really appreciated as i’m really lost and not sure where to begain

the output also does not have to be what i just put as an example here, it can just be another csv that tells me which types are similar and

>Solution :

I would use set operations.

assuming similarity means at least N items in common:

from itertools import combinations

N = 3

s = df.groupby('type')['value'].agg(set)

out = (pd.Series([len(a&b)>=N for a, b in combinations(s, 2)],
                 index=pd.MultiIndex.from_tuples(combinations(s.index, 2)))
      )

similarity = (
 pd.concat([out, out.swaplevel()])
   .loc[lambda x: x].reset_index(-1)
   .groupby(level=0)['level_1'].apply(lambda g: f"{g.name} is similar to {', '.join(g)}")
)

df.loc[~df['type'].duplicated(), 'similarity'] = df['type'].map(similarity)

print(df)

Output:

   type  value               similarity
0     a      1  a is similar to b, c, d
1     a      2                      NaN
2     a      3                      NaN
3     a      4                      NaN
4     b      2     b is similar to d, a
5     b      3                      NaN
6     b      4                      NaN
7     b      5                      NaN
8     c      1        c is similar to a
9     c      3                      NaN
10    c      4                      NaN
11    d      2     d is similar to a, b
12    d      3                      NaN
13    d      4                      NaN

assuming similarity means one set is the subset of the other:

from itertools import combinations

s = df.groupby('type')['value'].agg(set)

out = (pd.Series([a.issubset(b) or b.issubset(a) for a, b in combinations(s, 2)],
                 index=pd.MultiIndex.from_tuples(combinations(s.index, 2)))
      )

similarity = (
 pd.concat([out, out.swaplevel()])
   .loc[lambda x: x].reset_index(-1)
   .groupby(level=0)['level_1'].apply(lambda g: f"{g.name} is similar to {', '.join(g)}")
)

df.loc[~df['type'].duplicated(), 'similarity'] = df['type'].map(similarity)

print(df)

Output:

   type  value            similarity
0     a      1  a is similar to c, d
1     a      2                   NaN
2     a      3                   NaN
3     a      4                   NaN
4     b      2     b is similar to d
5     b      3                   NaN
6     b      4                   NaN
7     b      5                   NaN
8     c      1     c is similar to a
9     c      3                   NaN
10    c      4                   NaN
11    d      2  d is similar to a, b
12    d      3                   NaN
13    d      4                   NaN
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading