Using JQ for statistical aggregation of key values in a large data set

March 17, 2022

I have a complex use case that I can currently mostly solve with JQ, but am wondering if JQ is no longer the right tool for the job, or if there is a way I can generalize my filter. The use case is to create a statistical aggregate of sorted key value counts in large data set of newline delimited JSON. The issue I am running into relates to the number of keys in my data set becoming too large of a filter for jq to actually compile.

Below is a sample input, note that it includes arrays.

Input: (input.json)

{
  "content": {
    "user": {
      "name": "Bill",
      "items": [
        {
          "name": "shovel",
          "details": {
            "heavy": false
          }
        },
        {
          "name": "hammer",
        }
      ],
      "details": {
        "age": 60
      }
    }
  }
}
{
  "content": {
    "user": {
      "name": "Bill",
      "items": [
        {
          "name": "shovel",
          "details": {
            "heavy": false
          }
        },
        {
          "name": "hammer",
        }
      ],
      "details": {
        "age": 21
      }
    }
  }
}
{
  "content": {
    "user": {
      "name": "Alice",
      "items": [
        {
          "name": "hammer",
          "details": {
            "heavy": true
          }
        },
        {
          "name": "shovel",
        }
      ],
      "details": {
        "age": 30
      }
    }
  }
}

My desired output is below, essentially I get a listing of all keys in the structure, including array indices, with a sorted count by key value.

Output:

{
  "stats": {
    "user.name": {
      "Bill": 2,
      "Alice": 1
    },
    "user.items.0.name": {
      "shovel": 2,
      "hammer": 1
    },
    "user.items.1.name": {
      "hammer": 2,
      "shovel": 1
    },
    "user.items.0.details.heavy": {
      "true": 1,
      "": 2,
    },
    "user.items.1.details.heavy": {
      "true": 1,
      "": 2
    },
    "user.details.age": {
      "30": 1,
      "62": 1,
      "21": 1
    }
  }
}

Current Problematic Solution:

Currently I initially get a list of all keys in the json input [content.user.name, content.user.items.1.name, etc.] and use that to construct a jq filter.

The current aggregation filter looks like this (for just a single content.user.name aggregate calculation):

cat input.json | jq -c -s '{"stats": {"user.name": (map(.content."user"?."name"?) 
| del(..|nulls) | map(. | tostring) 
| reduce .[] as $i ( {}; setpath([$i]; getpath([$i]) + 1)) 
| to_entries | sort_by(.value) | reverse | from_entries)}}'

So to add more aggregate calculations, I use this template:

(newlines added for legibility)

"{KEY}": (map(.content.{KEY})
| del(..|nulls) | map(. | tostring)
| reduce .[] as $i ( {}; setpath([$i]; getpath([$i]) + 1))
| to_entries | sort_by(.value) | reverse | from_entries)

Filter including content.user.details..age

cat input.json | jq -c -s '{"stats": {"user.name": (map(.content."user"?."name"?) 
| del(..|nulls) | map(. | tostring) 
| reduce .[] as $i ( {}; setpath([$i]; getpath([$i]) + 1)) 
| to_entries | sort_by(.value) | reverse | from_entries),
"user.details.age": (map(.content."user"?."details"?."age"?) 
| del(..|nulls) | map(. | tostring) 
| reduce .[] as $i ( {}; setpath([$i]; getpath([$i]) + 1)) 
| to_entries | sort_by(.value) | reverse | from_entries)}}'

So the size of my filter grows linearly with the number of keys in the data set. Which means for large data sets, my filter actually grows too large for jq to compile. I’m not sure if I have been staring at this too long, but I am not sure if this is a problem best solved by jq. If I reduce the size of my key aggregation template, I am still capped by the max filter size at some number of keys, and I can’t seem to find a way to map the original keys in order to reuse the template while iterating through keys. This would mean recalculating keys for each newline JSON which isn’t optimal but not sure around that either

TLDR;

I want to aggregate key values by key from some newline delimited json.

Some pseudocode of what would be the ideal solution is below, but I cannot get it to work.

get keys:
select(objects)|=[.] | map( .content | paths(scalars)) | map(join(".")) | unique

iterate through all keys and run:
"{KEY}": (map(.content.{KEY})
| del(..|nulls) | map(. | tostring)
| reduce .[] as $i ( {}; setpath([$i]; getpath([$i]) + 1))
| to_entries | sort_by(.value) | reverse | from_entries)

Anyone have any ideas?

>Solution :

You could use the --stream option to read in the big input in smaller pieces

jq --stream -n '
  {stats: (reduce (1 | truncate_stream(inputs)) as $i ({};
    if ($i | has(1)) then ."\($i[0] | join("."))"."\($i[1])" += 1 else . end
  ))}
' input.json

{
  "stats": {
    "user.name": {
      "Bill": 2,
      "Alice": 1
    },
    "user.items.0.name": {
      "shovel": 2,
      "hammer": 1
    },
    "user.items.0.details.heavy": {
      "false": 2,
      "true": 1
    },
    "user.items.1.name": {
      "hammer": 2,
      "shovel": 1
    },
    "user.details.age": {
      "60": 1,
      "21": 1,
      "30": 1
    }
  }
}