how to calculate common elements between list column rows

April 29, 2024

Is it possible to calculate the number of common items in a list column, between a row and the previous row in the method chain?
My code below throws and error ‘TypeError: unhashable type: ‘list”

import pandas as pd

df = pd.DataFrame({
    'x':[1,2,3,4],
    'list_column': [
        ['apple', 'banana', 'cherry'],
        ['banana', 'cherry'],
        ['cherry', 'date', 'fig'],
        ['orange']
    ]
})

res = len(set(df.loc[1,'list_column']) & set(df.loc[0,'list_column']))
res

df=(df
     .assign(
         list_length=lambda x: x['list_column'].str.len(),
         nr_common=lambda x: (set(x['list_column']) & set(x['list_column'].shift(1))).len() 
         )
)

df

>Solution :

I would first convert all lists to sets, then use diff:

df.assign(sets=lambda d: d['list_column'].apply(set),
          common=lambda d: d['sets']-d['sets'].diff(),
          n_common=lambda d: d['common'].str.len(),
         )

Output:

   x              list_column                     sets            common  n_common
0  1  [apple, banana, cherry]  {apple, cherry, banana}               NaN       NaN
1  2         [banana, cherry]         {cherry, banana}  {banana, cherry}       2.0
2  3      [cherry, date, fig]      {date, cherry, fig}          {cherry}       1.0
3  4                 [orange]                 {orange}                {}       0.0

If you don’t want the intermediates:

df.assign(n_common=lambda d: (s:=d['list_column'].apply(set)).sub(s.diff()).str.len())

Or with a custom function:

def common_set(s):
    s = s.apply(set)
    return [len(a&b) for a,b in zip(s, s.shift(fill_value=set()))]

df.assign(n_common=lambda d: common_set(d['list_common']))