Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Strange characters found in XML file and PowerShell output after exporting from Excel: ​

I have an XML file that I’m trying to read with PowerShell. However when I read it, the output of some of the XML objects have the following characters in them: ​

I simply downloaded an XML file I needed from a third-party, which opens in Excel. Then I grab the columns I need and paste them into a new Excel Workbook. Then I map the fields with an XML Schema and then export it as an XML file, which I then use for scripting.

In the Excel spreadsheet my data looks clean, but then when I export it and run the PS script, these strange characters appear in the output. The characters even appear in the actual XML file after exporting. What am I doing wrong?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I tried using -Encoding UTF8, but I’m relatively new to PowerShell and am not sure how to appropriately apply it to my script. Appreciate any help!

PowerShell

$xmlpath = 'Path\To\The\File.xml'

[xml]$xmldata = (Get-Content $xmlpath)

$xmldata.applications.application.name

Example of Output

​ABC_DEF_GHI​.com​​
​JKL_MNO_PQRS​.com​
TUV_WXY_Z.com
AB_CD_EF_GH​.com

>Solution :

This is a prime example of why you shouldn’t use the idiom [xml]$xmldata = (Get-Content $xmlpath) – as convenient as it is.[1] The problem is indeed one of character encoding: your file is UTF-8-encoded, but Windows PowerShell’s Get-Content cmdlet interprets it as ANSI-encoded in the absence of a BOM – this answer explains the encoding part in detail.Thanks, choroba.

Instead, to ensure that the XML file’s character encoding is interpreted correctly, use the following:

($xmlData = [xml]::new()).Load((Convert-Path $xmlPath))

This delegates interpretation of the character encoding to the System.Xml.XmlDocument.Load .NET API method, which not only assumes the proper default for XML (UTF-8), but also respects any explicit encoding specification as part of the XML declaration, if present (e.g.,
<?xml version="1.0" encoding="iso-8859-1"?>)

See also:


[1] If you happen to know the encoding of the input file ahead of time, you can get away with using Get-Content‘s -Encoding parameter in your original approach ([xml]$xmldata = (Get-Content -Encoding utf8 $xmlpath), but the .Load()-based approach is much more robust.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading