Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Making rows NaN based on many conditions

If I have a dataframe with some index and some value as follows:

import pandas as pd
from random import random

my_index = []
my_vals = []
for i in range(1000):
    my_index.append(i+random())
    my_vals.append(random())

df_vals = pd.DataFrame({'my_index': my_index, 'my_vals': my_vals})

And I have a second dataframe with a column start and end, a row must be read as an interval, so the first row would be interval from 1 to 4 (including 1 and 4). It is the following dataframe:

df_intervals = pd.DataFrame({'start': [1, 7, 54, 73, 136, 235, 645, 785, 968], 'end': [4, 34, 65, 90, 200, 510, 700, 805, 988]})

I would like to make all values in the my_vals column of df_vals a NaN if the row’s index (my_index) does not fall in to one of the intervals specified in the df_intervals dataframe. What is the best way to go about this automatically rather than specifying each condition manually?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

(In my actual data set there are more than 9 intervals, this is some example data)

EDIT: in my actual data these indeces are not strictly integers, these can also be random floats

>Solution :

I believe this is a possible solution,

def index_in_range(index, df):
    for index_, row in df.iterrows():
        if (index >= row['start']) and (index <= row['end']):
            return True
    
    return False
    
df_vals['my_vals'] = df_vals.apply(lambda row: row['my_vals'] if index_in_range(row['my_index'], df_intervals) else None, axis=1)

To accomplish this without using a lambda function, you can do the following,

def index_in_range(index, df):
    for index_, row in df.iterrows():
        if (index >= row['start']) and (index <= row['end']):
            return True
    
    return False

for index_, row in df_vals.iterrows():
    if not index_in_range(row['my_index'], df_intervals):
        df_vals.at[index_, 'my_vals'] = None

Output:

    my_index    my_vals
0   0   NaN
1   1   0.126647
2   2   0.769215
3   3   0.819891
4   4   0.674466
... ... ...
995 995 NaN
996 996 NaN
997 997 NaN
998 998 NaN
999 999 NaN
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading