Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Is there an option to extract a pattern from strings in pandas?

i have to extract histologie codes from strings in pandas.

For example:

Histologie (ICD-O-2): 8500/2.
Histologie (ICD-O-3): 8842/8
(ICD-O-2): 8522/2. Histo

To:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

8500/2
8842/8
8522/2

There are many variations of the writing in the original format.
For that reason i want to search the strings for the composition of digit,digit,digit,digit,char,digit (0000/0).

Thanks everyone for your help.

>Solution :

Assuming your columns is named "col", you can use a simple regex:

df['code'] = df['col'].str.extract(r'(\d{4}/\d)')

output:

                             col    code
0  Histologie (ICD-O-2): 8500/2.  8500/2
1   Histologie (ICD-O-3): 8842/8  8842/8
2       (ICD-O-2): 8522/2. Histo  8522/2

regex:

\d{4}  # match 4 digits
/      # match a literal /
\d     # match one digit

If you need to ensure that the codes are independent words (e.g., 12345/6a shouldn’t match:

df['code'] = df['col'].str.extract(r'(\b\d{4}/\d\b)')

or if non-digits are allowed to touch the code:

df['code'] = df['col'].str.extract(r'(?:\D|^)(\d{4}/\d\b)(?:\D|$)')
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading