i have to extract histologie codes from strings in pandas.
For example:
Histologie (ICD-O-2): 8500/2.
Histologie (ICD-O-3): 8842/8
(ICD-O-2): 8522/2. Histo
To:
8500/2
8842/8
8522/2
There are many variations of the writing in the original format.
For that reason i want to search the strings for the composition of digit,digit,digit,digit,char,digit (0000/0).
Thanks everyone for your help.
>Solution :
Assuming your columns is named "col", you can use a simple regex:
df['code'] = df['col'].str.extract(r'(\d{4}/\d)')
output:
col code
0 Histologie (ICD-O-2): 8500/2. 8500/2
1 Histologie (ICD-O-3): 8842/8 8842/8
2 (ICD-O-2): 8522/2. Histo 8522/2
regex:
\d{4} # match 4 digits
/ # match a literal /
\d # match one digit
If you need to ensure that the codes are independent words (e.g., 12345/6a shouldn’t match:
df['code'] = df['col'].str.extract(r'(\b\d{4}/\d\b)')
or if non-digits are allowed to touch the code:
df['code'] = df['col'].str.extract(r'(?:\D|^)(\d{4}/\d\b)(?:\D|$)')