Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Why isn't PySpark JSON writing all keys to df when a custom schema is defined?

I’m trying to read approx. 2000 files in an s3 bucket, parse the data in each file and then write the parsed output to another bucket.

Each file is made up of arrays of dictionaries. e.g.

[
    {'region_code': "UK", 'city': "London", 'country_name': "England", ...etc.}
]

I have the following schema for my input data. These are the keys I want to grab out of each file:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

    schema = StructType([
        StructField('region_code', StringType(), True),
        StructField('country_code', StringType(), True),
        StructField('city', StringType(), True),
        StructField('last_update', StringType(), True),
        StructField('latitude', StringType(), True),
        StructField('tags', ArrayType(StringType(), True), True),
        StructField('area_code', StringType(), True),
        StructField('country_name', StringType(), True),
        StructField('hostnames', ArrayType(StringType(), True), True),
        StructField('org', StringType(), True),
        StructField('asn', StringType(), True),
        StructField('isp', StringType(), True),
        StructField('longitude', StringType(), True),
        StructField('domains', ArrayType(StringType(), True), True),
        StructField('ip_str', StringType(), True),
        StructField('os', StringType(), True),
        StructField('ports', ArrayType(IntegerType(), True), True)
    ])

I read all files in the input s3 bucket folder:

    df = spark.read.option("multiLine", "true").schema(schema).json(
        "s3a://bucket-name/test-folder/*"
    )

and then write the output to another s3 bucket:

df.write.format('json').save("s3a://bucket-name/test/outputdata")

However, when I look at the output data sometimes there are keys missing and I’m not sure why this is happening., e.g.

[
    {'city': "London", 'country_name': "England", ...etc.}, # No 'region_code' key
    {'region_code': "UK", 'country_name': "England", ...etc.} # No 'city' key
]

I thought the schema defines the structure of the dataframe. I’m assuming when a key is missing it means the key wasn’t present or null in the input data. In that case though wouldn’t the schema define the key anyway and just return null?

>Solution :

Spark might be dropping the null fields while writing the JSON files. You can force spark to write all fields even when they’re null using the ignoreNullFields option – set it to False. See doc for more options.

spark_df.write.option('ignoreNullFields', False). \
    json('path/to/folder/')
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading