Is it possible to calculate the number of common items in a list column, between a row and the previous row in the method chain?
My code below throws and error ‘TypeError: unhashable type: ‘list”
import pandas as pd
df = pd.DataFrame({
'x':[1,2,3,4],
'list_column': [
['apple', 'banana', 'cherry'],
['banana', 'cherry'],
['cherry', 'date', 'fig'],
['orange']
]
})
res = len(set(df.loc[1,'list_column']) & set(df.loc[0,'list_column']))
res
df=(df
.assign(
list_length=lambda x: x['list_column'].str.len(),
nr_common=lambda x: (set(x['list_column']) & set(x['list_column'].shift(1))).len()
)
)
df
>Solution :
I would first convert all lists to sets, then use diff:
df.assign(sets=lambda d: d['list_column'].apply(set),
common=lambda d: d['sets']-d['sets'].diff(),
n_common=lambda d: d['common'].str.len(),
)
Output:
x list_column sets common n_common
0 1 [apple, banana, cherry] {apple, cherry, banana} NaN NaN
1 2 [banana, cherry] {cherry, banana} {banana, cherry} 2.0
2 3 [cherry, date, fig] {date, cherry, fig} {cherry} 1.0
3 4 [orange] {orange} {} 0.0
If you don’t want the intermediates:
df.assign(n_common=lambda d: (s:=d['list_column'].apply(set)).sub(s.diff()).str.len())
Or with a custom function:
def common_set(s):
s = s.apply(set)
return [len(a&b) for a,b in zip(s, s.shift(fill_value=set()))]
df.assign(n_common=lambda d: common_set(d['list_common']))