Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Optimize Pandas assignation based on second dataframe

I have two dataframes, df and df2. I want to include on df an extra column on df called names. This column should contain the name of the row of df that is defined on df2 based on the conjunction of the group names of df and df2 and when this condition is met: (df[val2] >= df2[val] >= df[val2]).

Therefore, given this inputs:

df

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

  val1 val2 group
0   1   2   a1
1   3   4   a1
2   10  12  a1
3   23  25  a1
4   1   2   a2
5   3   4   a2
6   10  12  a2
7   23  25  a2

df2

   val name group
0   1   abc a1
1   4   def a1
2   12  ghi a1
3   26  jkl a1
4   2   p1  a2
5   4   p2  a2
6   12  p3  a2
7   26  p4  a2

The desired output would be:

df

  val1 val2 group name
0   1   2   a1  abc
1   3   4   a1  def
2   10  12  a1  ghi
3   23  25  a1  
4   1   2   a2  p1
5   3   4   a2  p2
6   10  12  a2  p3
7   23  25  a2

As said, I manage to obtain the correct result, but my code is not efficient at all (in this small example, it does not really matter, but when running this snippet on an example with thousands of registers, you can tell that it is quite slow). Is there any way to boost this operation?
PS: bellow is my current code with a working example:

d = {'val1': [1, 3, 10, 23, 1, 3, 10, 23], 'val2': [2, 4, 12, 25, 2, 4, 12, 25], 'group': ["a1", "a1", "a1", "a1", "a2", "a2", "a2", "a2"]}
d2 = {'val': [1, 4, 12, 26, 2, 4, 12, 26], 'name': ['abc', 'def', 'ghi', 'jkl', 'p1', 'p2', 'p3', 'p4'], 'group': ["a1", "a1", "a1", "a1", "a2", "a2", "a2", "a2"]}
df = pd.DataFrame(data=d)
df2 = pd.DataFrame(data=d2)

names = set(df["group"])
df["name"] = ""

for name in names:
    df_it = df[df["group"] == name]
    df2_it = df2[df2["group"] == name]
    for it in df_it.index:
        df2_it_2 = df2_it[(df2["val"] >= \
                           df_it["val1"][it]) &
                          (df2["val"] <=
                           df_it["val2"][it])]
        
        if not df2_it_2.empty:
            df["name"][it] = df2_it_2["name"].iloc[0]

>Solution :

You can perform a merge_asof to have val ≥ val1, then mask the "name"s for which the merged val is greater than val2:

(pd.merge_asof(df.reset_index().sort_values(by='val1'),
               df2.sort_values(by='val'),
               direction='forward',
               by='group', left_on='val1', right_on='val')
   .set_index('index').sort_index()
   .assign(name=lambda d: d['name'].mask(d['val'].gt(d['val2'])))
   .drop(columns='val')
)

Another option using a second join, this might be more flexible for many columns to add:

cols = ['name']
df.join(
 pd.merge_asof(df.reset_index().sort_values(by='val1'),
               df2.sort_values(by='val'),
               direction='forward',
               by='group', left_on='val1', right_on='val')
   .loc[lambda d: d['val'].le(d['val2']), ['index']+cols]
   .set_index('index')
)

output:

       val1  val2 group name
index                       
0         1     2    a1  abc
1         3     4    a1  def
2        10    12    a1  ghi
3        23    25    a1  NaN
4         1     2    a2   p1
5         3     4    a2   p2
6        10    12    a2   p3
7        23    25    a2  NaN
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading