Comparing 2 columns group by group in pandas or python

February 1, 2023

I currently have a dataset here where i am unsure of how to compare if the groups have similar values. Here is a sample of my dataset

type   value
a       1
a       2
a       3
a       4

b       2
b       3
b       4
b       5

c       1
c       3
c       4



d       2
d       3
d       4

I want to know which rows are similar, in the sense that all the (values in 1 type) are present in another type. So for example type d has value 2,3,4 and type a has value 1,2,3,4
so this is ‘similar’ or can be considered the same so i would like it output something that tells me d is similar to A.

Expected output should be like this


type   value            similarity
a       1         A is similar to B and D
a       2
a       3
a       4

b       2         b is similar to a and d
b       3
b       4
b       5

c       1         c is similar to a 
c       3
c       4



d       2         d is similar to a and b
d       3
d       4

not sure if this can be done in python or pandas but guidance is really appreciated as i’m really lost and not sure where to begain

the output also does not have to be what i just put as an example here, it can just be another csv that tells me which types are similar and

>Solution :

I would use set operations.

assuming similarity means at least N items in common:

from itertools import combinations

N = 3

s = df.groupby('type')['value'].agg(set)

out = (pd.Series([len(a&b)>=N for a, b in combinations(s, 2)],
                 index=pd.MultiIndex.from_tuples(combinations(s.index, 2)))
      )

similarity = (
 pd.concat([out, out.swaplevel()])
   .loc[lambda x: x].reset_index(-1)
   .groupby(level=0)['level_1'].apply(lambda g: f"{g.name} is similar to {', '.join(g)}")
)

df.loc[~df['type'].duplicated(), 'similarity'] = df['type'].map(similarity)

print(df)

Output:

   type  value               similarity
0     a      1  a is similar to b, c, d
1     a      2                      NaN
2     a      3                      NaN
3     a      4                      NaN
4     b      2     b is similar to d, a
5     b      3                      NaN
6     b      4                      NaN
7     b      5                      NaN
8     c      1        c is similar to a
9     c      3                      NaN
10    c      4                      NaN
11    d      2     d is similar to a, b
12    d      3                      NaN
13    d      4                      NaN

assuming similarity means one set is the subset of the other:

from itertools import combinations

s = df.groupby('type')['value'].agg(set)

out = (pd.Series([a.issubset(b) or b.issubset(a) for a, b in combinations(s, 2)],
                 index=pd.MultiIndex.from_tuples(combinations(s.index, 2)))
      )

similarity = (
 pd.concat([out, out.swaplevel()])
   .loc[lambda x: x].reset_index(-1)
   .groupby(level=0)['level_1'].apply(lambda g: f"{g.name} is similar to {', '.join(g)}")
)

df.loc[~df['type'].duplicated(), 'similarity'] = df['type'].map(similarity)

print(df)

Output:

   type  value            similarity
0     a      1  a is similar to c, d
1     a      2                   NaN
2     a      3                   NaN
3     a      4                   NaN
4     b      2     b is similar to d
5     b      3                   NaN
6     b      4                   NaN
7     b      5                   NaN
8     c      1     c is similar to a
9     c      3                   NaN
10    c      4                   NaN
11    d      2  d is similar to a, b
12    d      3                   NaN
13    d      4                   NaN