Home Why isn't PySpark JSON writing all keys to df when a custom schema is defined?

Questions

Why isn't PySpark JSON writing all keys to df when a custom schema is defined?

August 11, 2022

I’m trying to read approx. 2000 files in an s3 bucket, parse the data in each file and then write the parsed output to another bucket.

Each file is made up of arrays of dictionaries. e.g.

[
    {'region_code': "UK", 'city': "London", 'country_name': "England", ...etc.}
]

I have the following schema for my input data. These are the keys I want to grab out of each file:

    schema = StructType([
        StructField('region_code', StringType(), True),
        StructField('country_code', StringType(), True),
        StructField('city', StringType(), True),
        StructField('last_update', StringType(), True),
        StructField('latitude', StringType(), True),
        StructField('tags', ArrayType(StringType(), True), True),
        StructField('area_code', StringType(), True),
        StructField('country_name', StringType(), True),
        StructField('hostnames', ArrayType(StringType(), True), True),
        StructField('org', StringType(), True),
        StructField('asn', StringType(), True),
        StructField('isp', StringType(), True),
        StructField('longitude', StringType(), True),
        StructField('domains', ArrayType(StringType(), True), True),
        StructField('ip_str', StringType(), True),
        StructField('os', StringType(), True),
        StructField('ports', ArrayType(IntegerType(), True), True)
    ])

I read all files in the input s3 bucket folder:

    df = spark.read.option("multiLine", "true").schema(schema).json(
        "s3a://bucket-name/test-folder/*"
    )

and then write the output to another s3 bucket:

df.write.format('json').save("s3a://bucket-name/test/outputdata")

However, when I look at the output data sometimes there are keys missing and I’m not sure why this is happening., e.g.

[
    {'city': "London", 'country_name': "England", ...etc.}, # No 'region_code' key
    {'region_code': "UK", 'country_name': "England", ...etc.} # No 'city' key
]

I thought the schema defines the structure of the dataframe. I’m assuming when a key is missing it means the key wasn’t present or null in the input data. In that case though wouldn’t the schema define the key anyway and just return null?

>Solution :

Spark might be dropping the null fields while writing the JSON files. You can force spark to write all fields even when they’re null using the ignoreNullFields option – set it to False. See doc for more options.

spark_df.write.option('ignoreNullFields', False). \
    json('path/to/folder/')

pyspark

byMR

Published August 11, 2022

Add a comment

Nothing clicks on the tic-tac-toe game when clicked

byMR

August 11, 2022

Questions

Make local server reload code after changes

byMR

August 11, 2022

Questions

how can i pass date time in this function for unit testing usin c#

byMR

August 11, 2022

Questions

My Flask app returns an unexpected syntax error

byMR

August 11, 2022

Questions

Python dynamically import classes an call their functions

byMR

August 11, 2022

Questions

How to count positive values of last 5 rows of dataframe

byMR

August 11, 2022

Why isn't PySpark JSON writing all keys to df when a custom schema is defined?

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Like this:

Leave a ReplyCancel reply

Read more

Nothing clicks on the tic-tac-toe game when clicked

Make local server reload code after changes

how can i pass date time in this function for unit testing usin c#

My Flask app returns an unexpected syntax error

Python dynamically import classes an call their functions

How to count positive values of last 5 rows of dataframe

Keep Up to Date with the Most Important News

Why isn't PySpark JSON writing all keys to df when a custom schema is defined?

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Share this:

Like this:

Leave a ReplyCancel reply

Keep Up to Date with the Most Important News

Read more

Nothing clicks on the tic-tac-toe game when clicked

Make local server reload code after changes

how can i pass date time in this function for unit testing usin c#

My Flask app returns an unexpected syntax error

Python dynamically import classes an call their functions

How to count positive values of last 5 rows of dataframe

Discover more from Dev solutions