Can't extract value from <> need struct type but got string;

February 7, 2022

I have some nested json that I have parallelized and spat out as a json. A complete record would look like:

{
   "id":"1",
   "type":"site",
   "attributes":{
      "description":"Number 1 Park",
      "activeInactive":{
         "text":"Active",
         "colour":"#4CBB17"
      },
      "lastUpdated":"2019-12-05T08:51:39"
   },
   "relationships":{
      "region":{
         "data":{
            "type":"region",
            "id":"1061",
            "meta":{
               "displayValue":"Park Region"
            }
         }
      }
   }
}

However, the data is pending a data cleanse and currently the region field is not populated.

{
   "id":"1",
   "type":"site",
   "attributes":{
      "description":"Number 1 Park",
      "activeInactive":{
         "text":"Active",
         "colour":"#4CBB17"
      },
      "lastUpdated":"2019-12-05T08:51:39"
   },
   "relationships":{
      "region":{
         "data": null
         }
      }
   }
}

The data element will be null if the relationship doesn’t exist (i.e. it is an orphaned site).

I run this JSON into a spark dataframe via an RDD. The schema of the dataframe is:

attributes:struct
    activeInactive:struct
       colour:string
       text:string
    description:string
    lastUpdated:string
id:string
relationships:struct
    region:struct
       data:string

I get errors when coding for region using df.select(col('relationships.region.data.meta.displayValue')) as if the nested fields were there rather than data as per the topic heading. I’m going to assume this is because of the conflict with the dataframe’s schema.

The question is how can I make this more dynamic and still obtain the displayValue as and when this is populated without needing to revisit the code?

>Solution :

While reading a json file, you can impose the schema on the output dataframe using this syntax:

df = spark.read.json("<path to json file>", schema = <schema object>)

This way the data field will still show you null, but it’s gonna be StructType() with a complete nested structure.
Based on the data snippet that was provided the applicable schema object looks like this:

schemaObject = StructType([
  StructField('id', StringType(), True),
  StructField('type', StringType(), True),
  StructField('attributes', StructType([
    StructField('descrption', StringType(), True),
    StructField('activeInactive', StructType([
      StructField('text', StringType(), True),
      StructField('colour', StringType(), True)
    ]), True),
    StructField('lastUpdated', StringType(), True)
  ]), True),
  StructField('relationships'StructType([
    StructField('region', StructType([
      StructField('data', StructType([
        StructField('type', StringType(), True),
        StructField('id', StringType(), True),
        StructField('meta', StructType([
          StructField('displayValue', StringType(), True)
        ]), True)
      ]), True)
    ]), True)
  ]), True)
])