Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Using JQ for statistical aggregation of key values in a large data set

I have a complex use case that I can currently mostly solve with JQ, but am wondering if JQ is no longer the right tool for the job, or if there is a way I can generalize my filter. The use case is to create a statistical aggregate of sorted key value counts in large data set of newline delimited JSON. The issue I am running into relates to the number of keys in my data set becoming too large of a filter for jq to actually compile.

Below is a sample input, note that it includes arrays.

Input: (input.json)

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

{
  "content": {
    "user": {
      "name": "Bill",
      "items": [
        {
          "name": "shovel",
          "details": {
            "heavy": false
          }
        },
        {
          "name": "hammer",
        }
      ],
      "details": {
        "age": 60
      }
    }
  }
}
{
  "content": {
    "user": {
      "name": "Bill",
      "items": [
        {
          "name": "shovel",
          "details": {
            "heavy": false
          }
        },
        {
          "name": "hammer",
        }
      ],
      "details": {
        "age": 21
      }
    }
  }
}
{
  "content": {
    "user": {
      "name": "Alice",
      "items": [
        {
          "name": "hammer",
          "details": {
            "heavy": true
          }
        },
        {
          "name": "shovel",
        }
      ],
      "details": {
        "age": 30
      }
    }
  }
}

My desired output is below, essentially I get a listing of all keys in the structure, including array indices, with a sorted count by key value.

Output:

{
  "stats": {
    "user.name": {
      "Bill": 2,
      "Alice": 1
    },
    "user.items.0.name": {
      "shovel": 2,
      "hammer": 1
    },
    "user.items.1.name": {
      "hammer": 2,
      "shovel": 1
    },
    "user.items.0.details.heavy": {
      "true": 1,
      "": 2,
    },
    "user.items.1.details.heavy": {
      "true": 1,
      "": 2
    },
    "user.details.age": {
      "30": 1,
      "62": 1,
      "21": 1
    }
  }
}

Current Problematic Solution:

Currently I initially get a list of all keys in the json input [content.user.name, content.user.items.1.name, etc.] and use that to construct a jq filter.

For context, here is the jq filter I use to get keys
select(objects)|=[.] | map( .content | paths(scalars)) | map(join(".")) | unique

The current aggregation filter looks like this (for just a single content.user.name aggregate calculation):

cat input.json | jq -c -s '{"stats": {"user.name": (map(.content."user"?."name"?) 
| del(..|nulls) | map(. | tostring) 
| reduce .[] as $i ( {}; setpath([$i]; getpath([$i]) + 1)) 
| to_entries | sort_by(.value) | reverse | from_entries)}}'

So to add more aggregate calculations, I use this template:

(newlines added for legibility)

"{KEY}": (map(.content.{KEY})
| del(..|nulls) | map(. | tostring)
| reduce .[] as $i ( {}; setpath([$i]; getpath([$i]) + 1))
| to_entries | sort_by(.value) | reverse | from_entries)

Filter including content.user.details..age

cat input.json | jq -c -s '{"stats": {"user.name": (map(.content."user"?."name"?) 
| del(..|nulls) | map(. | tostring) 
| reduce .[] as $i ( {}; setpath([$i]; getpath([$i]) + 1)) 
| to_entries | sort_by(.value) | reverse | from_entries),
"user.details.age": (map(.content."user"?."details"?."age"?) 
| del(..|nulls) | map(. | tostring) 
| reduce .[] as $i ( {}; setpath([$i]; getpath([$i]) + 1)) 
| to_entries | sort_by(.value) | reverse | from_entries)}}'

So the size of my filter grows linearly with the number of keys in the data set. Which means for large data sets, my filter actually grows too large for jq to compile. I’m not sure if I have been staring at this too long, but I am not sure if this is a problem best solved by jq. If I reduce the size of my key aggregation template, I am still capped by the max filter size at some number of keys, and I can’t seem to find a way to map the original keys in order to reuse the template while iterating through keys. This would mean recalculating keys for each newline JSON which isn’t optimal but not sure around that either

TLDR;

I want to aggregate key values by key from some newline delimited json.

Some pseudocode of what would be the ideal solution is below, but I cannot get it to work.

get keys:
select(objects)|=[.] | map( .content | paths(scalars)) | map(join(".")) | unique

iterate through all keys and run:
"{KEY}": (map(.content.{KEY})
| del(..|nulls) | map(. | tostring)
| reduce .[] as $i ( {}; setpath([$i]; getpath([$i]) + 1))
| to_entries | sort_by(.value) | reverse | from_entries)

Anyone have any ideas?

>Solution :

You could use the --stream option to read in the big input in smaller pieces

jq --stream -n '
  {stats: (reduce (1 | truncate_stream(inputs)) as $i ({};
    if ($i | has(1)) then ."\($i[0] | join("."))"."\($i[1])" += 1 else . end
  ))}
' input.json
{
  "stats": {
    "user.name": {
      "Bill": 2,
      "Alice": 1
    },
    "user.items.0.name": {
      "shovel": 2,
      "hammer": 1
    },
    "user.items.0.details.heavy": {
      "false": 2,
      "true": 1
    },
    "user.items.1.name": {
      "hammer": 2,
      "shovel": 1
    },
    "user.details.age": {
      "60": 1,
      "21": 1,
      "30": 1
    }
  }
}
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading