Given an original table in Athena that is similar to this contrived example below:
Original Table:
| a | b | cmsmessage | c |
|---|---|---|---|
| … | … | {"asset": {"metadata": {"item": [{"key": "author", "value": "Rob"}, { "key": "id", "value": "123"}, {"key": "version", "value": "1"}]}}} | … |
| … | … | {"asset": {"metadata": {"item": [{ "key": "id", "value": "456"}, {"key": "author", "value": "John"}, {"key": "version", "value": "3"}]}}} | … |
| … | … | {"asset": {"metadata": {"item": [{"key": "version", "value": "2"}, {"key": "author", "value": "Sally"}, {"key": "id", "value": "789" }]}}} | … |
I am trying to create a View from the stringified JSON in the cmsmessage column:
Desired Resultant View:
| id | author | version |
|---|---|---|
| 123 | Rob | 1 |
| 456 | John | 3 |
| 789 | Sally | 3 |
As you can see in the "Desired Resultant View" I am essentially wanting to extract the data from each object of the item array based on the value of the key value.
[{"key": "author", "value": "Rob"}, { "key": "id", "value": "123"}, {"key": "version", "value": "1"}]
I’ve seen/tried various examples that access items in an array by index, e.g. [0] such as the examples in Athena docs. However the index position in the array may vary given the json/string in the cmsmessage column of the original table.
My Failed attempt:
Below shows my failed attempt:
WITH dataset AS
(
SELECT *
FROM (VALUES
('{"asset": {"metadata": {"item": [{"key": "author", "value": "Rob"}, { "key": "id", "value": "123"}, {"key": "version", "value": "1"}]}}}'),
('{"asset": {"metadata": {"item": [{ "key": "id", "value": "456"}, {"key": "author", "value": "John"}, {"key": "version", "value": "3"}]}}}'),
('{"asset": {"metadata": {"item": [{"key": "version", "value": "2"}, {"key": "author", "value": "Sally"}, {"key": "id", "value": "789" }]}}}')
) AS t (cmsmessage)
)
SELECT
json_extract_scalar(objArray, '$.key') as _keys,
json_extract_scalar(objArray, '$.value') as _values
FROM dataset
CROSS JOIN UNNEST(CAST(json_extract(cmsmessage, '$.asset.metadata.item') as array(json))) as t (objArray)
>Solution :
Arguably the easiest option would be using json_query which has better support for JSON path syntax:
SELECT json_query(cmsmessage, 'lax $.asset.metadata.item[*]?(@.key=="id").value') AS id
, json_query(cmsmessage, 'lax $.asset.metadata.item[*]?(@.key=="author").value') AS author
, json_query(cmsmessage, 'lax $.asset.metadata.item[*]?(@.key=="version").value') AS version
FROM dataset;
Should work for Athena engine version 3 which is based on Trino.
For version 2 you can achieve similar with array manipulation.