perl replace non UTF-8 characters or binary contents with whitespace

September 7, 2022

I have a file with non-ascii characters.

$ org od -t c -t x1 -A d tmp.txt
0000000   S   o   -   c   a   l   l   e   d    217    204   l   a   b
         53  6f  2d  63  61  6c  6c  65  64  f4  8f  b1  84  6c  61  62
0000016   e   l   e   d    217    204   p   a   t   t   e   r   n   s
         65  6c  65  64  f4  8f  b1  84  70  61  74  74  65  72  6e  73
0000032    217    204   c   a   n       b   e    217    204   u   s
         f4  8f  b1  84  63  61  6e  20  62  65  f4  8f  b1  84  75  73
0000048   e   d    217    204   w   i   t   h    217    204   s   i
         65  64  f4  8f  b1  84  77  69  74  68  f4  8f  b1  84  73  69
0000064   n   g   l   e   ,        217    204   d   o   u   b   l   e
         6e  67  6c  65  2c  20  f4  8f  b1  84  64  6f  75  62  6c  65
0000080   ,        217    204   a   n   d    217    204   t   r   i
         2c  20  f4  8f  b1  84  61  6e  64  f4  8f  b1  84  74  72  69
0000096   p   l   e    217    204   b   l   a   n   k   s   .
         70  6c  65  f4  8f  b1  84  62  6c  61  6e  6b  73  2e

As you can see, \x{f4}\x{8f}\x{b1}\x{84} has several occurrences. I want to replace \x{f4}\x{8f}\x{b1}\x{84} with whitespace. According to this, I try:

s/\x{f4}\x{8f}\x{b1}\x{84}/ /g;
tr/\x{f4}\x{8f}\x{b1}\x{84}/ /;

It doesn’t work.
But if I remove this two lines in the script:

use utf8;
use open qw( :std :encoding(UTF-8) );

It works. Why?

I suspect that it is because perl only deals with characters, but \x{f4}\x{8f}\x{b1}\x{84} is not regarded as a character. Is there a way to remove \x{f4}\x{8f}\x{b1}\x{84} or any other binary contents or non UTF-8 characters with perl?

>Solution :

While the file may contain "\x{f4}\x{8f}\x{b1}\x{84}", you have decoded it. You actually have "\x{10FC44}", or "\N{U+10FC44}" if you prefer. As such, you’d need

tr/\N{U+10FC44}/ /

It’s a private-use Code Point. To replace all 137,468 private-use Code Points, you can use

s/\p{General_Category=Private_Use}/ /g

General_Category can be abbreviated to Gc.
Private_Use can be abbreviated to Co.
General_Category= can be omitted.
So these are equivalent:

s/\p{Gc=Private_Use}/ /g

s/\p{Private_Use}/ /g

s/\p{Co}/ /g

Co makes me think of "control", so maybe it’s best to avoid that one. (Controls characters are identified by the Control aka Cc general category.)