Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Is there any alternative of pandas' merge_asof when joining on multiple columns?

I need to merge 2 data frames on 3 key columns, one contains strings, one contains integers and the last contains floats. On the string and integer columns, I would need an exact match, while on the float column I want to get the values of the right data frame for the nearest corresponding float key in the left data frame. I thought that pandas’ merge_asof function would be good, but unfortunately, it can be used only for a single, numerical column to join on.

I have this simple solution with cross merge, but unfortunately, it won’t work for the real problem, as each data frame has ~2 million rows.

import pandas as pd
import numpy as np
import itertools

df1 = pd.DataFrame(data = itertools.product(['a', 'b', 'c'],
                                            [0, 1, 2, 3, 4],
                                            [0, 0.25, 0.5, 0.75, 1]), 
                   columns = ['key1', 'key2', 'key3'])

df2 = df1.copy()
df2['key3'] = np.random.rand(len(df2))
df2['values'] = np.random.rand(len(df2))

merged = df1.merge(df2, how = 'cross')
merged['key3_diff'] = np.abs(merged['key3_x'] - merged['key3_y'])

aux = merged[['key1_x', 'key2_x', 
              'key3_x', 'key3_diff']].groupby(['key1_x', 
                                               'key2_x', 
                                               'key3_x']).min().reset_index()

res = merged[merged['key1_x'].isin(aux['key1_x']) &
           merged['key2_x'].isin(aux['key2_x']) &
           merged['key3_diff'].isin(aux['key3_diff'])][['key1_x', 
                                                        'key2_x', 
                                                        'key3_x', 
                                                        'values']].rename(columns = {'key1_x': 'key1',
                                                                                     'key2_x': 'key2',
                                                                                     'key3_x': 'key3'})

Is there any alternative besides implementing this calculation sequentially (maybe with numba and jit)?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

IIUC, Try something like this:

# Creates a Cartesian on key1, and key2 match all key3 in df1 to 
# all key3 in df2 by key1 and key
dfm = df1.merge(df2, on=['key1', 'key2'])

# Create a measure to find minimum match on  
dfm = dfm.eval('key_diff = abs(key3_y - key3_x)')

# Find isolate minimum match per record in df1 key1, key2 and key3_x.
idx = dfm.groupby(['key1', 'key2', 'key3_x'])['key_diff'].idxmin().to_numpy()

# Filter merged data frame
dfm.reindex(idx)

Output:

    key1  key2  key3_x    key3_y    values  key_diff
3      a     0    0.00  0.155915  0.993490  0.155915
7      a     0    0.25  0.206727  0.620721  0.043273
11     a     0    0.50  0.705011  0.942236  0.205011
19     a     0    0.75  0.705943  0.124545  0.044057
20     a     0    1.00  0.979249  0.089052  0.020751
..   ...   ...     ...       ...       ...       ...
350    c     4    0.00  0.269154  0.437154  0.269154
355    c     4    0.25  0.269154  0.437154  0.019154
362    c     4    0.50  0.714185  0.927274  0.214185
367    c     4    0.75  0.714185  0.927274  0.035815
373    c     4    1.00  0.908999  0.078522  0.091001

[75 rows x 6 columns]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading