Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

What is the best way to merge two dataframes that one of them has date ranges and the other one has date WITHOUT any shared columns?

I have two DataFrames:

import pandas as pd

df1 = pd.DataFrame(
    {
        'date': ['2024-01-01','2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05', '2024-01-06', '2024-01-07', '2024-01-08', '2024-01-09', '2024-01-10', '2024-01-11', '2024-01-12', '2024-01-13'],
        'price': list(range(13))
    }
)

df2 = pd.DataFrame(
    {
        'start': ['2024-01-01', '2024-01-03', '2024-01-10'],
        'end': ['2024-01-03', '2024-01-08', '2024-01-12'],
        'id': ['a', 'b', 'c']
    }
)

And this is the expected output. I want to add id to df1:

         date  price   id
0  2024-01-01      0  NaN
1  2024-01-02      1    a
2  2024-01-03      2    a
3  2024-01-04      3    b
4  2024-01-05      4    b
5  2024-01-06      5    b
6  2024-01-07      6    b
7  2024-01-08      7    b
8  2024-01-09      8  NaN
9  2024-01-10      9  NaN
10 2024-01-11     10    c
11 2024-01-12     11    c
12 2024-01-13     12  NaN

The process is like this. Let me give you an example for row 1 of the output:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

a) The date is 2024-01-02. Look up df2. There is a range for each row of df2. This date is between the first row of df2. Note that start is exclusive and end is inclusive.

b) Get the id from the identified row in df2 and put in the output.

Since there are no common columns between these two DataFrames, I have used a loop to get the output. It works but I am not sure if it is the best way:

df1['date'] = pd.to_datetime(df1.date)
df2[['start', 'end']] = df2[['start', 'end']].apply(pd.to_datetime)
for idx, row in df2.iterrows():
    start = row['start']
    end = row['end']
    id = row['id']

    df1.loc[df1.date.between(start, end, inclusive='right'), 'id'] = id

Any suggestions?

>Solution :

Another options, use pd.cut:

bins = pd.IntervalIndex.from_arrays(df2["start"], df2["end"])
df1["id"] = pd.cut(df1["date"], bins).map(dict(zip(bins, df2["id"])))

print(df1)

Prints:

         date  price   id
0  2024-01-01      0  NaN
1  2024-01-02      1    a
2  2024-01-03      2    a
3  2024-01-04      3    b
4  2024-01-05      4    b
5  2024-01-06      5    b
6  2024-01-07      6    b
7  2024-01-08      7    b
8  2024-01-09      8  NaN
9  2024-01-10      9  NaN
10 2024-01-11     10    c
11 2024-01-12     11    c
12 2024-01-13     12  NaN
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading