Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

unable to translate '…' to a wide string

It looks to me like R introduced a new error in version 4.3.0, which breaks a lot of my web-scrapers. I only found one mention of the change, but don’t really understand the blog post.

In essence, this code fails on newer versions of R, but older versions do some internal conversion that seems to work:

text <- "\xa0 x"
gsub("x", "u", text)
#> Warning in gsub("x", "u", text): unable to translate '<a0> x' to a wide string
#> Error in gsub("x", "u", text): input string 1 is invalid

Created on 2023-07-13 with reprex v2.0.2

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Is there any way to remove these special characters before doing string operations? Note that I do not know which characters specifically fail, since the real strings I’m working with are too long to check.

>Solution :

It’s an encoding issue, text is not interpreted as a valid string because it contains non-ASCII characters.

Conversion to UTF-8:

text_utf8 <- iconv(text, from = "ISO-8859-1", to = "UTF-8")
gsub("x","u", text_utf8)

will produce: ' u'.

R 4.3.0 changelog says: "Regular expression functions now check more thoroughly whether their inputs are valid strings (in their encoding, e.g. in UTF-8)."

You could also treat input as sequence of bytes (this will also be preserved in the output).

gsub("x", "u", text, useBytes = TRUE)

gives '\xa0 u'

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading