How to efficiently parse Large JSON files in Python?

Reason for editing: The question have been flagged as duplicate from another one that talks about handling large files using Panda. I’m reopening it because I’m asking recommendations on how-to. Notice I never wrote anything related to Panda in my description and the answer I got from a fellow user helped me achieve what I was tring to. Same goes for users who try same methods I tried, but never had the insight of making use of the answer provided.


Summary:
I am currently working on a project where I need to parse extremely large JSON files (over 10GB) in Python, and I am looking for ways to optimize the performance of my parsing code. I have tried using the json module in Python, but it is taking too long to load the entire file into memory. I am wondering if there are any alternative libraries or techniques that senior developers have used to handle such large JSON files in Python.

Explanation:
I am working on a project where I need to analyze and extract data from very large JSON files. The files are too large to be loaded into memory all at once, so I need to find an efficient way to parse them. I have tried using the built-in json module in Python, but it is taking a long time to load the file into memory. I have also tried using ijson and jsonlines, but the performance is still not satisfactory. I am looking for suggestions on alternative libraries or techniques that could help me optimize my parsing code and speed up the process.

Example of the JSON:

{
  "orders": [
    {
      "order_id": "1234",
      "date": "2022-05-10",
      "total_amount": 245.50,
      "customer": {
        "name": "John Doe",
        "email": "johndoe@example.com",
        "address": {
          "street": "123 Main St",
          "city": "Anytown",
          "state": "CA",
          "zip": "12345"
        }
      },
      "items": [
        {
          "product_id": "6789",
          "name": "Widget",
          "price": 20.00,
          "quantity": 5
        },
        {
          "product_id": "2345",
          "name": "Gizmo",
          "price": 15.50,
          "quantity": 4
        }
      ]
    },
    {
      "order_id": "5678",
      "date": "2022-05-09",
      "total_amount": 175.00,
      "customer": {
        "name": "Jane Smith",
        "email": "janesmith@example.com",
        "address": {
          "street": "456 Main St",
          "city": "Anytown",
          "state": "CA",
          "zip": "12345"
        },
        "phone": "555-555-1212"
      },
      "items": [
        {
          "product_id": "9876",
          "name": "Thingamajig",
          "price": 25.00,
          "quantity": 3
        },
        {
          "product_id": "3456",
          "name": "Doodad",
          "price": 10.00,
          "quantity": 10
        }
      ]
    },
    {
      "order_id": "9012",
      "date": "2022-05-08",
      "total_amount": 150.25,
      "customer": {
        "name": "Bob Johnson",
        "email": "bjohnson@example.com",
        "address": {
          "street": "789 Main St",
          "city": "Anytown",
          "state": "CA",
          "zip": "12345"
        },
        "company": "ABC Inc."
      },
      "items": [
        {
          "product_id": "1234",
          "name": "Whatchamacallit",
          "price": 12.50,
          "quantity": 5
        },
        {
          "product_id": "5678",
          "name": "Doohickey",
          "price": 7.25,
          "quantity": 15
        }
      ]
    }
  ]
}

Version:
Python 3.8

Here’s what I tried:

import json

with open('large_file.json') as f:
    data = json.load(f)
import ijson

filename = 'large_file.json'
with open(filename, 'r') as f:
    parser = ijson.parse(f)
    for prefix, event, value in parser:
        if prefix.endswith('.name'):
            print(value)
import jsonlines

filename = 'large_file.json'
with open(filename, 'r') as f:
    reader = jsonlines.Reader(f)
    for obj in reader:
        print(obj)

>Solution :

You could try to use Pandas, as in theory Pandas can also handle json, or you could even try using SQLITE, as it can parse JSON, store JSON in columns and also query JSON. But I would recommend that you use Pandas as it is easier to use and has more documentation online. You could do it like this in Pandas –

import pandas as pd
file = pd.read_json("your-filename.json")
print(file)

Leave a Reply