Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Can't extract value from <> need struct type but got string;

I have some nested json that I have parallelized and spat out as a json. A complete record would look like:

{
   "id":"1",
   "type":"site",
   "attributes":{
      "description":"Number 1 Park",
      "activeInactive":{
         "text":"Active",
         "colour":"#4CBB17"
      },
      "lastUpdated":"2019-12-05T08:51:39"
   },
   "relationships":{
      "region":{
         "data":{
            "type":"region",
            "id":"1061",
            "meta":{
               "displayValue":"Park Region"
            }
         }
      }
   }
}

However, the data is pending a data cleanse and currently the region field is not populated.

{
   "id":"1",
   "type":"site",
   "attributes":{
      "description":"Number 1 Park",
      "activeInactive":{
         "text":"Active",
         "colour":"#4CBB17"
      },
      "lastUpdated":"2019-12-05T08:51:39"
   },
   "relationships":{
      "region":{
         "data": null
         }
      }
   }
}

The data element will be null if the relationship doesn’t exist (i.e. it is an orphaned site).

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I run this JSON into a spark dataframe via an RDD. The schema of the dataframe is:

attributes:struct
    activeInactive:struct
       colour:string
       text:string
    description:string
    lastUpdated:string
id:string
relationships:struct
    region:struct
       data:string

I get errors when coding for region using df.select(col('relationships.region.data.meta.displayValue')) as if the nested fields were there rather than data as per the topic heading. I’m going to assume this is because of the conflict with the dataframe’s schema.

The question is how can I make this more dynamic and still obtain the displayValue as and when this is populated without needing to revisit the code?

>Solution :

While reading a json file, you can impose the schema on the output dataframe using this syntax:

df = spark.read.json("<path to json file>", schema = <schema object>)

This way the data field will still show you null, but it’s gonna be StructType() with a complete nested structure.
Based on the data snippet that was provided the applicable schema object looks like this:

schemaObject = StructType([
  StructField('id', StringType(), True),
  StructField('type', StringType(), True),
  StructField('attributes', StructType([
    StructField('descrption', StringType(), True),
    StructField('activeInactive', StructType([
      StructField('text', StringType(), True),
      StructField('colour', StringType(), True)
    ]), True),
    StructField('lastUpdated', StringType(), True)
  ]), True),
  StructField('relationships'StructType([
    StructField('region', StructType([
      StructField('data', StructType([
        StructField('type', StringType(), True),
        StructField('id', StringType(), True),
        StructField('meta', StructType([
          StructField('displayValue', StringType(), True)
        ]), True)
      ]), True)
    ]), True)
  ]), True)
])
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading