I have a dataframe df1 like this:
A B
AA [a,b,c,d]
BB [a,f,g,c]
CC [a,b,l,m]
And another one as df2 like:
C D
XX [a,b,c,n]
YY [a,m,r,s]
UU [e,h,I,j]
I want to find out and map column C of df2 with column A of df1 based on the highest element match between the items of df2[‘D’] and df1[‘B’] and null if there is none.
The result df will look like:
C D A common_items
XX [a,b,c,n] AA [a,b,c]
YY [a,m,r,s] CC [a,m]
UU [e,h,I,j] Null Null
After spending a lot of time through itertools and np operations and pd.merge with ‘inner’, the closest I have come with is:
np.intersect1d(df2.D, df1.B)
keys = ['B', 'D']
intersection = df1.merge(df2[keys], on=keys)
Any solution on the number of common elements of the two columns of different dataframes, mapping the source df1[‘A’] to target df2[‘c’].
[a,b,c,d] etc are list of string.
Data:
df1.to_dict('list'):
{'A': ['AA', 'BB', 'CC'],
'B': [['a', 'b', 'c', 'd'], ['a', 'f', 'g', 'c'], ['a', 'b', 'l', 'm']]}
df2.to_dict('list'):
{'C': ['XX', 'YY', 'UU'],
'D': [['a', 'b', 'c', 'n'], ['a', 'm', 'r', 's'], ['e', 'h', 'l', 'j']]}
Anything on Pyspark/Pandas would be really halpful.
>Solution :
You can convert to set, compute the intersections with numpy and get the best match:
b = df1['B'].apply(set).to_numpy()
d = df2['D'].apply(set).to_numpy()
# compute pairwise intersections
common = d[:,None] & b
# get largest intersection per row
vlen = np.vectorize(len)
idx = np.argmax(vlen(common), axis=1)
# assign intersections and original ID
df2['common_items'] = common[np.arange(len(d)), idx]
df2['A'] = np.where(df2['common_items'].str.len()>0,
df1['A'].to_numpy()[idx], None)
Output:
C D common_items A
0 XX [a, b, c, n] {b, a, c} AA
1 YY [a, m, r, s] {a, m} CC
2 ZZ [m, b, c, d] {b, d, c} AA
3 UU [e, h, I, j] {} None
Intermediates:
# common
array([[{'b', 'a', 'c'}, {'a', 'c'}, {'b', 'a'}],
[{'a'}, {'a'}, {'a', 'm'}],
[{'b', 'd', 'c'}, {'c'}, {'b', 'm'}],
[set(), set(), set()]], dtype=object)
# vlen(common)
array([[3, 2, 2],
[1, 1, 2],
[3, 1, 2],
[0, 0, 0]])
# idx
array([0, 2, 0, 0])