Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Using gsub to replace matches with same number of characters

Is it possible to use gsub to replace each character of a match with another character? I have read and tried solutions from a lot of questions without success, because they were very specific to the example being used. Some that looked promising but ultimately did not get me there are

gsub-replace-regex-match-with-regex-replacement-string

replace-pattern-with-one-space-per-character-in-perl

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

What I am looking for is a general way to do the following. I have a list of regexes, which I combine into a single regex expression of the form

pattern <- "[0-9]{3,}|[a-z]{3,}|..."

Given a string such as

x <- "1234 abc 12 a 123456"

I would like to get back from gsub

"#### ### 12 a ######"

instead of

"# # 12 a #"

I have used gsub with the perl arg set to TRUE, and experimented with an online regex tool, using things like \G and lookarounds, but I cannot figure it out.

The reason I am looking for a way to do this with gsub (I realise it is easy to do in other ways) is to use it as a method of censoring certain words and matches such as dates, phone numbers and email addresses in a dplyr pipeline. The function I have works fine, except that any replacement is fixed, and I would like to replace each matching character, rather than each matching substring.

filter_words <- function(.data, .words, .replacement, ...) {
  .data %>% dplyr::mutate(
    dplyr::across(
      c(...),
      ~ gsub(
          paste0("\\b", .words, collapse = "|\\b"),
          .replacement, .,
          ignore.case = TRUE, perl = TRUE
      )
    )
  )
}

I did try using a package called mgsub for the mgsub_censor function it provides. This does work, but it is several orders of magnitude slower than what I already have, so not really practical for large datasets.

I did try creating a custom gsub function able to accept a function (that could return a string consisting of the same number of characters as each match) as the replacement argument. It worked fine for a single string, but failed to work in a pipe.

>Solution :

You may pass a function in str_replace_all and use strrep to repeat the # symbol n times.

x <- "1234 abc 12 a 123456"
pattern <- "[0-9]{3,}|[a-z]{3,}"

stringr::str_replace_all(x, pattern, function(m) strrep('#', nchar(m)))
#[1] "#### ### 12 a ######"
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading