Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Building a view from a JSON stringified array of objects in the original table column

Given an original table in Athena that is similar to this contrived example below:

Original Table:

a b cmsmessage c
{"asset": {"metadata": {"item": [{"key": "author", "value": "Rob"}, { "key": "id", "value": "123"}, {"key": "version", "value": "1"}]}}}
{"asset": {"metadata": {"item": [{ "key": "id", "value": "456"}, {"key": "author", "value": "John"}, {"key": "version", "value": "3"}]}}}
{"asset": {"metadata": {"item": [{"key": "version", "value": "2"}, {"key": "author", "value": "Sally"}, {"key": "id", "value": "789" }]}}}

I am trying to create a View from the stringified JSON in the cmsmessage column:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Desired Resultant View:

id author version
123 Rob 1
456 John 3
789 Sally 3

As you can see in the "Desired Resultant View" I am essentially wanting to extract the data from each object of the item array based on the value of the key value.

[{"key": "author", "value": "Rob"}, { "key": "id", "value": "123"}, {"key": "version", "value": "1"}]


I’ve seen/tried various examples that access items in an array by index, e.g. [0] such as the examples in Athena docs. However the index position in the array may vary given the json/string in the cmsmessage column of the original table.


My Failed attempt:

Below shows my failed attempt:

WITH dataset AS 
(
   SELECT *
   FROM (VALUES 
     ('{"asset": {"metadata": {"item": [{"key": "author", "value": "Rob"}, { "key": "id", "value": "123"}, {"key": "version", "value": "1"}]}}}'),
     ('{"asset": {"metadata": {"item": [{ "key": "id", "value": "456"}, {"key": "author", "value": "John"}, {"key": "version", "value": "3"}]}}}'),
     ('{"asset": {"metadata": {"item": [{"key": "version", "value": "2"}, {"key": "author", "value": "Sally"}, {"key": "id", "value": "789" }]}}}')
  ) AS t (cmsmessage)
)

SELECT
    json_extract_scalar(objArray, '$.key') as _keys,
    json_extract_scalar(objArray, '$.value') as _values
FROM dataset

CROSS JOIN UNNEST(CAST(json_extract(cmsmessage, '$.asset.metadata.item') as array(json))) as t (objArray)

>Solution :

Arguably the easiest option would be using json_query which has better support for JSON path syntax:

SELECT json_query(cmsmessage, 'lax $.asset.metadata.item[*]?(@.key=="id").value') AS id
    , json_query(cmsmessage, 'lax $.asset.metadata.item[*]?(@.key=="author").value') AS author
    , json_query(cmsmessage, 'lax $.asset.metadata.item[*]?(@.key=="version").value') AS version
FROM dataset;

Should work for Athena engine version 3 which is based on Trino.

For version 2 you can achieve similar with array manipulation.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading