Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Deleting characters in arrays while converting rows to integer in DataFrame

I have a dataframe like below

enter image description here

Each row has a different length.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I’m trying to run the code below

for i in range(len(df)):
  df['ColumnA'][i] = df['ColumnA'][i].astype(int)

I’m getting an error because some rows have like "64B", "64A" string values.

For example,

df['ColumnA'][37]

Output:

array(['34', '35', '36', '38', '39', '40', '41', '56', '58', '59', '60',
       '61', '62', '62A', '62B', '63', '64', '65', '88', '90', '94', '98'],
      dtype='<U3')

You can easily see the ’62A’, ’62B’ string values.

My goal is to convert all strings to int (e.g from 62A to 62). Deleting the characters next to it.

>Solution :

You can use the sub function in the builtin regex module to replace every non-digit character (\D selects non-digits in a regex) with a blank "".

A list comprehension will let you apply that same method to every item in the list.

The last component you need is the apply method in pandas that will apply the same function to every item in a series.

import re

import pandas as pd


# sample data
df = pd.Series([
    [61, '62a', 70, 'z8z8z'],
    [61, '62hello', 70],
], name='columnA').to_frame()

# show sample data
df
#>                 columnA
#> 0  [61, 62a, 70, z8z8z]
#> 1     [61, 62hello, 70]

# select the column and apply the function to each element
df.columnA.apply(
    lambda row: [
        int(re.sub("\D", "", str(x)))  # cast to string, replace any non-digit character, cast to int
        for x in row  # for every item in the list
    ]
)
#> 0    [61, 62, 70, 88]
#> 1        [61, 62, 70]
#> Name: columnA, dtype: object

This will result in a list stored as each item in the Series "columnA". If you want these to be numpy arrays instead, you can wrap the list comprehension in a call to np.array.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading