Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Generate Boolean Matrix of overlapping intervals

I have a dataframe where two columns represent the start and end points of intervals on a real number line. I want to generate a third column as a list of the indices of rows which said row has any overlap with. I’m having difficulty creating a inequality boolean matrix for this natively in pandas. I assume logic like this s1<=e2 and e1>=s2 will do the trick, but I don’t know how to effectively broadcast it.

As a toy example I’m hoping for a simple way to at least generate a 5×5 boolean matrix (with all True down the diagonal) given this dataframe:

import pandas as pd
intervals_df = pd.DataFrame({"Starts":[0,1,5,10,15,20],"Ends":[4,2,9,14,19,24]})

Starts  Ends
0   0   4
1   1   2
2   5   9
3   10  14
4   15  19
5   20  24

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

The condition for the two intervals (s1,e1) and (s2,e2) to intersect is max(s1,s2) <= min(e1,e2). So you can do a cross merge (this is the broadcast), calculate the condition, the pivot:

d = (intervals_df.reset_index()
       .merge(intervals_df.reset_index(), how='cross')
       .assign(cond=lambda x: x.filter(like='Starts').max(axis=1) <= x.filter(like='Ends').min(axis=1))
        .pivot('index_x', 'index_y', 'cond')
    )

You would get:

index_y      0      1      2      3      4      5
index_x                                          
0         True   True  False  False  False  False
1         True   True  False  False  False  False
2        False  False   True  False  False  False
3        False  False  False   True  False  False
4        False  False  False  False   True  False
5        False  False  False  False  False   True

Or you can make do with numpy’s broadcasting:

starts = intervals_df[['Starts']].to_numpy()
ends = intervals_df[['Ends']].to_numpy()

np.maximum(starts, starts.T) <= np.minimum(ends, ends.T)

Output:

array([[ True,  True, False, False, False, False],
       [ True,  True, False, False, False, False],
       [False, False,  True, False, False, False],
       [False, False, False,  True, False, False],
       [False, False, False, False,  True, False],
       [False, False, False, False, False,  True]])
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading