Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

map columns of two dataframes based on array intersection of their individual columns and based on highest common element match Pyspark/Pandas

I have a dataframe df1 like this:

A   B
AA  [a,b,c,d]
BB  [a,f,g,c]
CC  [a,b,l,m]

And another one as df2 like:

C   D
XX  [a,b,c,n]
YY  [a,m,r,s]
UU  [e,h,I,j]

I want to find out and map column C of df2 with column A of df1 based on the highest element match between the items of df2[‘D’] and df1[‘B’] and null if there is none.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

The result df will look like:

C      D         A   common_items
XX  [a,b,c,n]   AA    [a,b,c]
YY  [a,m,r,s]   CC    [a,m]
UU  [e,h,I,j]   Null    Null

After spending a lot of time through itertools and np operations and pd.merge with ‘inner’, the closest I have come with is:

np.intersect1d(df2.D, df1.B)

keys = ['B', 'D']
intersection = df1.merge(df2[keys], on=keys)

Any solution on the number of common elements of the two columns of different dataframes, mapping the source df1[‘A’] to target df2[‘c’].
[a,b,c,d] etc are list of string.

Data:

df1.to_dict('list'):
{'A': ['AA', 'BB', 'CC'],
 'B': [['a', 'b', 'c', 'd'], ['a', 'f', 'g', 'c'], ['a', 'b', 'l', 'm']]}


df2.to_dict('list'):
{'C': ['XX', 'YY', 'UU'],
 'D': [['a', 'b', 'c', 'n'], ['a', 'm', 'r', 's'], ['e', 'h', 'l', 'j']]}

Anything on Pyspark/Pandas would be really halpful.

>Solution :

You can convert to set, compute the intersections with and get the best match:

b = df1['B'].apply(set).to_numpy()
d = df2['D'].apply(set).to_numpy()

# compute pairwise intersections
common = d[:,None] & b

# get largest intersection per row
vlen = np.vectorize(len)
idx = np.argmax(vlen(common), axis=1)

# assign intersections and original ID
df2['common_items'] = common[np.arange(len(d)), idx]
df2['A'] = np.where(df2['common_items'].str.len()>0,
                    df1['A'].to_numpy()[idx], None)

Output:

    C             D common_items     A
0  XX  [a, b, c, n]    {b, a, c}    AA
1  YY  [a, m, r, s]       {a, m}    CC
2  ZZ  [m, b, c, d]    {b, d, c}    AA
3  UU  [e, h, I, j]           {}  None

Intermediates:

# common
array([[{'b', 'a', 'c'}, {'a', 'c'}, {'b', 'a'}],
       [{'a'}, {'a'}, {'a', 'm'}],
       [{'b', 'd', 'c'}, {'c'}, {'b', 'm'}],
       [set(), set(), set()]], dtype=object)

# vlen(common)
array([[3, 2, 2],
       [1, 1, 2],
       [3, 1, 2],
       [0, 0, 0]])

# idx
array([0, 2, 0, 0])
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading