Home map columns of two dataframes based on array intersection of their individual columns and based on highest common element match Pyspark/Pandas

Questions

map columns of two dataframes based on array intersection of their individual columns and based on highest common element match Pyspark/Pandas

byMR

October 17, 2023

I have a dataframe df1 like this:

A   B
AA  [a,b,c,d]
BB  [a,f,g,c]
CC  [a,b,l,m]

And another one as df2 like:

C   D
XX  [a,b,c,n]
YY  [a,m,r,s]
UU  [e,h,I,j]

I want to find out and map column C of df2 with column A of df1 based on the highest element match between the items of df2[‘D’] and df1[‘B’] and null if there is none.

The result df will look like:

C      D         A   common_items
XX  [a,b,c,n]   AA    [a,b,c]
YY  [a,m,r,s]   CC    [a,m]
UU  [e,h,I,j]   Null    Null

After spending a lot of time through itertools and np operations and pd.merge with ‘inner’, the closest I have come with is:

np.intersect1d(df2.D, df1.B)

keys = ['B', 'D']
intersection = df1.merge(df2[keys], on=keys)

Any solution on the number of common elements of the two columns of different dataframes, mapping the source df1[‘A’] to target df2[‘c’].
[a,b,c,d] etc are list of string.

Data:

df1.to_dict('list'):
{'A': ['AA', 'BB', 'CC'],
 'B': [['a', 'b', 'c', 'd'], ['a', 'f', 'g', 'c'], ['a', 'b', 'l', 'm']]}


df2.to_dict('list'):
{'C': ['XX', 'YY', 'UU'],
 'D': [['a', 'b', 'c', 'n'], ['a', 'm', 'r', 's'], ['e', 'h', 'l', 'j']]}

Anything on Pyspark/Pandas would be really halpful.

>Solution :

You can convert to set, compute the intersections with numpy and get the best match:

b = df1['B'].apply(set).to_numpy()
d = df2['D'].apply(set).to_numpy()

# compute pairwise intersections
common = d[:,None] & b

# get largest intersection per row
vlen = np.vectorize(len)
idx = np.argmax(vlen(common), axis=1)

# assign intersections and original ID
df2['common_items'] = common[np.arange(len(d)), idx]
df2['A'] = np.where(df2['common_items'].str.len()>0,
                    df1['A'].to_numpy()[idx], None)

Output:

    C             D common_items     A
0  XX  [a, b, c, n]    {b, a, c}    AA
1  YY  [a, m, r, s]       {a, m}    CC
2  ZZ  [m, b, c, d]    {b, d, c}    AA
3  UU  [e, h, I, j]           {}  None

Intermediates:

# common
array([[{'b', 'a', 'c'}, {'a', 'c'}, {'b', 'a'}],
       [{'a'}, {'a'}, {'a', 'm'}],
       [{'b', 'd', 'c'}, {'c'}, {'b', 'm'}],
       [set(), set(), set()]], dtype=object)

# vlen(common)
array([[3, 2, 2],
       [1, 1, 2],
       [3, 1, 2],
       [0, 0, 0]])

# idx
array([0, 2, 0, 0])