Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to split a value in a dataframe if the value contains a digit?

I have a dataframe which looks like below (in reality much bigger):

df = pd.DataFrame([
    [-0.531, '30 mg', 0],
    [1.49, '70 kg', 1],
    [-1.3826, 'food delivery', 2],
    [0.814, '80 degrees', ' '],     
    [-0.22, ' ', 4],
    [-1.11,  '70 grams', '  '],         
], columns='Power Value Stage'.split(), index=pd.date_range('2000-01-01','2000-01-06'))  

Now I’m adding a new column named Unit to the dataframe which actually does a split on the column Value. However, it seems to literally split everything even if the values don’t make sense. For example values like food delivery don’t need to be split.

I only want the values to be split if str[0] is a digit AND if str[1] is <= 5 characters long. I think I’m really close however I got stuck. This is my code:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df['Unit'] = df['Value'].str.split(' ').str[-1] if str[0].isdigit() and len(str[1]) is <= 5

This is my desired output when I do print(df):

             Power          Value Stage      Unit
2000-01-01 -0.5310          30 mg     0        mg
2000-01-02  1.4900          70 kg     1        kg
2000-01-03 -1.3826  food delivery     2  
2000-01-04  0.8140     80 degrees         
2000-01-05 -0.2200                    4
2000-01-06 -1.1100       70 grams           grams

This is my output:

    df['Unit'] = df['Value'].str.split(' ').str[-1] if str[0].isdigit() and len(str[1]) is <= 5
                                                                                            ^
SyntaxError: invalid syntax

>Solution :

You can use a regex:

df['Unit'] = df['Value'].str.extract(r'\d+\s*(\w{,5})\b').fillna('')[0]

output:

             Power          Value Stage   Unit
2000-01-01 -0.5310          30 mg     0     mg
2000-01-02  1.4900          70 kg     1     kg
2000-01-03 -1.3826  food delivery     2       
2000-01-04  0.8140     80 degrees             
2000-01-05 -0.2200                    4       
2000-01-06 -1.1100       70 grams        grams
regex:
\d+        # one or more digits
\s*        # optional spaces
(\w{,5})   # up to 5 word characters in capturing group
\b         # word boundary

If you want to ensure the unit is letters only, replace \w{,5} with [a-zA-Z]{,5}

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading