Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Checking for well formed xml in a dataframe column

I have a dataframe with xml in a string column, before I can handle this further, I need the xml to check for being well formed. The strategy I’m following at the moment is using an udf, but I get an error as a result.

Code:

from lxml import etree

def wellformedness (xml):
    wellformed = True
    try:
        doc = etree.fromstring(xml)
    except:
        wellformed = False
    return wellformed

udf_wellformedness = F.udf(wellformedness, BooleanType())
df.withColumn('Wellformed', udf_wellformedness('MyColumn'))
df2 = df.filter(~df["Wellformed"])

Error:
AnalysisException: Cannot resolve column name "Wellformed" among (MyColumn, MyColumn2, MyColumn3);

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

What am I doing wrong? And can this be done more efficiently?

>Solution :

You are adding the column to a dataframe that you don’t save. The line where you call withColumn should read as follows:

df = df.withColumn('Wellformed', udf_wellformedness('MyColumn'))

If you must validate the xml with a custom python function, using an udf is the way to go.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading