I have to read a bad encoded string from a remote service and can not figure out how to recover the correct value in C# or Javascript. I can neither change the values in the service or change the way they are being saved in the DB, but I need to display them correctly.
Bad string: Adrián José
Correct string: Adrián José
The error can be undone since the fixed value can be obtained using tools such as https://www.iosart.com/tools/charset-fixer or in Notepad++ by changing the Encoding from ANSI to UTF-8.
So far, I have this solution in JS (client side), but I don’t like to use the escape() function and would like to do the fix on server side.
var badString = "Adrián José";
var fixedString = decodeURIComponent(escape(badString)); // "Adrián José"
I tried to play with the Encoding class in C# (like here), but couln’t find a valid combination.
var badString = "Adrián José";
var origEnco = Encoding.UTF8;
var targetEnco = Encoding.Default;
byte[] utfBytes = origEnco.GetBytes(badString);
byte[] isoBytes = Encoding.Convert(origEnco, targetEnco, utfBytes);
string fixedString = targetEnco.GetString(isoBytes); // "Adrián José"
What am I missing? How do the character set fixer or Notepad++ work?
>Solution :
For your provided example, this code works and outputs "Adrián José" as expected:
var currentEncoding = Encoding.GetEncoding("Windows-1252");
var targetEncoding = Encoding.UTF8;
string input = "Adrián José";
string output = targetEncoding.GetString(currentEncoding.GetBytes(input));
If you’re using .NET Core/.NET 5+ then you’ll need to install System.Text.Encoding.CodePages from NuGet and add this somewhere in your code (I usually do it at the top of my Main method):
Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
While this provides the result you’re interested in, I don’t know if it will work for all instances of your bad text.
If you can, I would fix the problem at the source, rather than trying to fix it once you have the incorrectly-encoded string.