Advertisements
I want to clean up some TNM entries, Here is an example:
structure(list(TNM = c("pT3 N0 (0/13)", "pT3 N2b (21/45l)", "pT3 N0 (0/32 LK)"
)), class = "data.frame", row.names = c(NA, -3L))
TNM
1 pT3 N0 (0/13)
2 pT3 N2b (21/45l)
3 pT3 N0 (0/32 LK)
So far I got this:
library(dplyr)
library(stringr)
df %>%
mutate(TNM = str_remove_all(TNM, '\\,|\\;|\\.'),
TNM = str_replace_all(TNM, ' ', ''),
TNM = str_replace_all(TNM, "x", "X")) %>%
mutate(N_count = str_extract(TNM, '\\(\\d+\\/\\d+\\)'))
TNM N_count
1 pT3N0(0/13) (0/13)
2 pT3N2b(21/45l) <NA>
3 pT3N0(0/32LK) <NA>
This works:
library(dplyr)
library(stringr)
df %>%
mutate(TNM = str_remove_all(TNM, '\\,|\\;|\\.'),
TNM = str_replace_all(TNM, ' ', ''),
TNM = str_replace_all(TNM, "x", "X")) %>%
mutate(N_count = str_extract(TNM, '\\(\\d+\\/\\d+\\)|\\(\\d+\\/\\d+\\w\\)|\\(\\d+\\/\\d+\\w+\\)'))
TNM N_count
1 pT3N0(0/13) (0/13)
2 pT3N2b(21/45l) (21/45l)
3 pT3N0(0/32LK) (0/32LK)
Is there a way to shorten this regex:
'\\(\\d+\\/\\d+\\)|\\(\\d+\\/\\d+\\w\\)|\\(\\d+\\/\\d+\\w+\\)'
?
>Solution :
In the alternation, you want to match no, a single or 1 or more word characters.
You could shorten the pattern not using the alternation and repeating optional word characters
\\(\\d+/\\d+\\w*\\)
To also match (0/32 LK)
and not only trailing spaces like (21/45 )
, you can optionally match optional whitespace characters followed by 1+ word characters:
\\(\\d+/\\d+(?:\\s*\\w+)?\\)