Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Identify the matching range from a list of valid range

I am having a csv file , which contains a range of numbers in a column. I also have list of valid range. I need identify the possible range for the matching numbers in csv.

valid_range = ["0-1", "1-2", "0-5", "5-10", "10-15", "15-20", ">20","mixed"]

csv data :

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

What I have:

enter image description here

What I need:

enter image description here

Anything greater than 20 will have ">20" and mixed will have the same.

df = {‘Gender’: {0: ‘male’,
1: ‘male’,
2: ‘male’,
3: ‘female’,
4: ‘male’,
5: ‘male’,
6: ‘male’,
7: ‘male’,
8: ‘male’,
9: ‘mixed’,
10: ‘female’},
‘Range’: {0: ‘mixed’,
1: ‘mixed’,
2: ‘5-6’,
3: ‘4-5’,
4: ‘5-6′,
5: ’12-14’,
6: ‘9-10’,
7: ‘mixed’,
8: ’17-18′,
9: ‘mixed’,
10: ‘>20’}}

>Solution :

If you don’t have overlap, you can rely on the upper bound.

Then use of a bit or regex magic and pandas.cut:

# non overlapping ranges
valid_range = ["0-5", "5-10", "10-15", "15-20", ">20", "mixed"]

# define bins (could also be coded as list directly!)
bins = (pd.Series(valid_range[:-2]).str.extract('(\d+)$', expand=False)
          .astype(int).to_list()
        )
bins = [0]+bins+[float('inf')]
# [0, 5, 10, 15, 20, inf]

df['Valid_Range'] = (           # replace ">x" with upper bound + 1
 pd.cut(df['Range'].str.replace('>.*', f'{bins[-1]+1}', regex=True)
                  # extract trailing digits 
                  .str.extract('(\d+)$', expand=False).astype(float),
        bins=bins, labels=valid_range[:-1])
    .values.add_categories(valid_range[-1])  # cut output is Categorical
    .fillna(valid_range[-1])  # fill nans that correspond to initial "mixed"
)

output:

    Gender  Range Valid_Range
0     male  mixed       mixed
1     male  mixed       mixed
2     male    5-6        5-10
3   female    4-5         0-5
4     male    5-6        5-10
5     male  12-14       10-15
6     male   9-10        5-10
7     male  mixed       mixed
8     male  17-18       15-20
9    mixed  mixed       mixed
10  female    >20         >20
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading