I currently have a dataset here where i am unsure of how to compare if the groups have similar values. Here is a sample of my dataset
type value
a 1
a 2
a 3
a 4
b 2
b 3
b 4
b 5
c 1
c 3
c 4
d 2
d 3
d 4
I want to know which rows are similar, in the sense that all the (values in 1 type) are present in another type. So for example type d has value 2,3,4 and type a has value 1,2,3,4
so this is ‘similar’ or can be considered the same so i would like it output something that tells me d is similar to A.
Expected output should be like this
type value similarity
a 1 A is similar to B and D
a 2
a 3
a 4
b 2 b is similar to a and d
b 3
b 4
b 5
c 1 c is similar to a
c 3
c 4
d 2 d is similar to a and b
d 3
d 4
not sure if this can be done in python or pandas but guidance is really appreciated as i’m really lost and not sure where to begain
the output also does not have to be what i just put as an example here, it can just be another csv that tells me which types are similar and
>Solution :
I would use set operations.
assuming similarity means at least N items in common:
from itertools import combinations
N = 3
s = df.groupby('type')['value'].agg(set)
out = (pd.Series([len(a&b)>=N for a, b in combinations(s, 2)],
index=pd.MultiIndex.from_tuples(combinations(s.index, 2)))
)
similarity = (
pd.concat([out, out.swaplevel()])
.loc[lambda x: x].reset_index(-1)
.groupby(level=0)['level_1'].apply(lambda g: f"{g.name} is similar to {', '.join(g)}")
)
df.loc[~df['type'].duplicated(), 'similarity'] = df['type'].map(similarity)
print(df)
Output:
type value similarity
0 a 1 a is similar to b, c, d
1 a 2 NaN
2 a 3 NaN
3 a 4 NaN
4 b 2 b is similar to d, a
5 b 3 NaN
6 b 4 NaN
7 b 5 NaN
8 c 1 c is similar to a
9 c 3 NaN
10 c 4 NaN
11 d 2 d is similar to a, b
12 d 3 NaN
13 d 4 NaN
assuming similarity means one set is the subset of the other:
from itertools import combinations
s = df.groupby('type')['value'].agg(set)
out = (pd.Series([a.issubset(b) or b.issubset(a) for a, b in combinations(s, 2)],
index=pd.MultiIndex.from_tuples(combinations(s.index, 2)))
)
similarity = (
pd.concat([out, out.swaplevel()])
.loc[lambda x: x].reset_index(-1)
.groupby(level=0)['level_1'].apply(lambda g: f"{g.name} is similar to {', '.join(g)}")
)
df.loc[~df['type'].duplicated(), 'similarity'] = df['type'].map(similarity)
print(df)
Output:
type value similarity
0 a 1 a is similar to c, d
1 a 2 NaN
2 a 3 NaN
3 a 4 NaN
4 b 2 b is similar to d
5 b 3 NaN
6 b 4 NaN
7 b 5 NaN
8 c 1 c is similar to a
9 c 3 NaN
10 c 4 NaN
11 d 2 d is similar to a, b
12 d 3 NaN
13 d 4 NaN