Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

perl replace non UTF-8 characters or binary contents with whitespace

I have a file with non-ascii characters.

$ org od -t c -t x1 -A d tmp.txt
0000000   S   o   -   c   a   l   l   e   d    217    204   l   a   b
         53  6f  2d  63  61  6c  6c  65  64  f4  8f  b1  84  6c  61  62
0000016   e   l   e   d    217    204   p   a   t   t   e   r   n   s
         65  6c  65  64  f4  8f  b1  84  70  61  74  74  65  72  6e  73
0000032    217    204   c   a   n       b   e    217    204   u   s
         f4  8f  b1  84  63  61  6e  20  62  65  f4  8f  b1  84  75  73
0000048   e   d    217    204   w   i   t   h    217    204   s   i
         65  64  f4  8f  b1  84  77  69  74  68  f4  8f  b1  84  73  69
0000064   n   g   l   e   ,        217    204   d   o   u   b   l   e
         6e  67  6c  65  2c  20  f4  8f  b1  84  64  6f  75  62  6c  65
0000080   ,        217    204   a   n   d    217    204   t   r   i
         2c  20  f4  8f  b1  84  61  6e  64  f4  8f  b1  84  74  72  69
0000096   p   l   e    217    204   b   l   a   n   k   s   .
         70  6c  65  f4  8f  b1  84  62  6c  61  6e  6b  73  2e

As you can see, \x{f4}\x{8f}\x{b1}\x{84} has several occurrences. I want to replace \x{f4}\x{8f}\x{b1}\x{84} with whitespace. According to this, I try:

s/\x{f4}\x{8f}\x{b1}\x{84}/ /g;
tr/\x{f4}\x{8f}\x{b1}\x{84}/ /;

It doesn’t work.
But if I remove this two lines in the script:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

use utf8;
use open qw( :std :encoding(UTF-8) );

It works. Why?

I suspect that it is because perl only deals with characters, but \x{f4}\x{8f}\x{b1}\x{84} is not regarded as a character. Is there a way to remove \x{f4}\x{8f}\x{b1}\x{84} or any other binary contents or non UTF-8 characters with perl?

>Solution :

While the file may contain "\x{f4}\x{8f}\x{b1}\x{84}", you have decoded it. You actually have "\x{10FC44}", or "\N{U+10FC44}" if you prefer. As such, you’d need

tr/\N{U+10FC44}/ /

It’s a private-use Code Point. To replace all 137,468 private-use Code Points, you can use

s/\p{General_Category=Private_Use}/ /g

General_Category can be abbreviated to Gc.
Private_Use can be abbreviated to Co.
General_Category= can be omitted.
So these are equivalent:

s/\p{Gc=Private_Use}/ /g
s/\p{Private_Use}/ /g
s/\p{Co}/ /g

Co makes me think of "control", so maybe it’s best to avoid that one. (Controls characters are identified by the Control aka Cc general category.)

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading