Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Quickest way of loading, portions of json responses. Into a pd.DataFrame object

So i have a series containing a lot of json responses. And it is quite a list of big jsons.

Display sample

{0: '{"city":"Campinas","bot-origin":null,"campaign-source":null,"lastState":"productAvailabilityCpfRequest","main-installation-date":"22/09/2021","userid":"c46528a7-988d-465e-9f12-25426a8b1808@tunnel.msging.net","full-name":"Claudenice lĂ´bo da silva","alternative-installation-date":"23/09/2021","chosen-product":"Internet","bank":null,"postalcode":"13056015","due-date":"20","cpf":"30979696836","origin-link":null,"payment":"boleto","state":"SP","api-orders-hash-id":null,"email":"Silvaclaudenice71@gmail.com","plan-name":null,"userphone":"19 98715-0491","plan-offer":null,"completed-address":"13056015 - AV FERNANDO PAOLIERI, 182 - JARDIM PLANALTO DE VIRACOPOS, Campinas - SP","type-of-person":"CPF","type-of-product":"Residencial","main-installation-period-day":"manhĂŁ","plan-value":null,"alternative-installation-period-day":"manhĂŁ"}', 1: '{"city":"Campinas","bot-origin":null,"campaign-source":null,"lastState":"productAvailabilityStart","main-installation-date":"22/09/2021","userid":"c46528a7-988d-465e-9f12-25426a8b1808@tunnel.msging.net","full-name":"Claudenice lĂ´bo da silva","alternative-installation-date":"23/09/2021","chosen-product":"Internet","bank":null,"postalcode":"13056015","due-date":"20","cpf":"30979696836","origin-link":null,"payment":"boleto","state":"SP","api-orders-hash-id":null,"email":"Silvaclaudenice71@gmail.com","plan-name":null,"userphone":"19 98715-0491","plan-offer":null,"completed-address":"13056015 - AV FERNANDO PAOLIERI, 182 - JARDIM PLANALTO DE VIRACOPOS, Campinas - SP","type-of-person":"CPF","type-of-product":"Residencial","main-installation-period-day":"manhĂŁ","plan-value":null,"alternative-installation-period-day":"manhĂŁ"}', 2: '{"city":"Campinas","bot-origin":null,"campaign-source":null,"lastState":"cpfValidationTrue","main-installation-date":"22/09/2021","userid":"c46528a7-988d-465e-9f12-25426a8b1808@tunnel.msging.net","full-name":"Claudenice lĂ´bo da silva","alternative-installation-date":"23/09/2021","chosen-product":"Internet","bank":null,"postalcode":"13056015","due-date":"20","cpf":"30979696836","origin-link":null,"payment":"boleto","state":"SP","api-orders-hash-id":null,"email":"Silvaclaudenice71@gmail.com","plan-name":null,"userphone":"19 98715-0491","plan-offer":null,"completed-address":"13056015 - AV FERNANDO PAOLIERI, 182 - JARDIM PLANALTO DE VIRACOPOS, Campinas - SP","type-of-person":"CPF","type-of-product":"Residencial","main-installation-period-day":"manhĂŁ","plan-value":null,"alternative-installation-period-day":"manhĂŁ"}', 3: '{"city":"Campinas","bot-origin":null,"campaign-source":null,"lastState":"productAvailabilityCpfRequest","main-installation-date":"22/09/2021","userid":"c46528a7-988d-465e-9f12-25426a8b1808@tunnel.msging.net","full-name":"Claudenice lĂ´bo da silva","alternative-installation-date":"23/09/2021","chosen-product":"Internet","bank":null,"postalcode":"13056015","due-date":"20","cpf":"30979696836","origin-link":null,"payment":"boleto","state":"SP","api-orders-hash-id":null,"email":"Silvaclaudenice71@gmail.com","plan-name":null,"userphone":"19 98715-0491","plan-offer":null,"completed-address":"13056015 - AV FERNANDO PAOLIERI, 182 - JARDIM PLANALTO DE VIRACOPOS, Campinas - SP","type-of-person":"CPF","type-of-product":"Residencial","main-installation-period-day":"manhĂŁ","plan-value":null,"alternative-installation-period-day":"manhĂŁ"}', 4: '{"city":"Campinas","bot-origin":null,"campaign-source":null,"lastState":"productAvailabilityStart","main-installation-date":"22/09/2021","userid":"c46528a7-988d-465e-9f12-25426a8b1808@tunnel.msging.net","full-name":"Claudenice lĂ´bo da silva","alternative-installation-date":"23/09/2021","chosen-product":"Internet","bank":null,"postalcode":"13056015","due-date":"20","cpf":"30979696836","origin-link":null,"payment":"boleto","state":"SP","api-orders-hash-id":null,"email":"Silvaclaudenice71@gmail.com","plan-name":null,"userphone":"19 98715-0491","plan-offer":null,"completed-address":"13056015 - AV FERNANDO PAOLIERI, 182 - JARDIM PLANALTO DE VIRACOPOS, Campinas - SP","type-of-person":"CPF","type-of-product":"Residencial","main-installation-period-day":"manhĂŁ","plan-value":null,"alternative-installation-period-day":"manhĂŁ"}'}

I had a few issues, trying to load only portions (of the json object) efficiently and quickly. Ran into issues such as too much memory usage (when running pandas functions). And too slow processing.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

So I made the following code

import orjson

def dataset_extras(extras #Series being passed,*args # List of keys you want to unload):
    l = [] 
    for i in extras:
        l.append({arg : orjson.loads(i).get(arg) for arg in args})
    return pd.DataFrame.from_records(l)

dataset_extras(df.Extras,'city','campaign-source','api-orders-hash-id')
# Sample of Call

This time i managed to circumvent, a lot of the performance issues. But I was wondering if there was a even more efficient way of transforming portions of a series of json responses, into a pd.DataFrame(). Would appreciate some feedback on a way I could improve this code.

>Solution :

As commented, you can probably optimize things quite a bit by not parsing JSON over and over again for each arg:

def dataset_extras(
    json_strings,
    keys,
):
    records = []
    for json_string in json_strings:
        datum = orjson.loads(json_string)
        records.append({key: datum.get(key) for key in keys})
    return pd.DataFrame.from_records(records)


x = dataset_extras(df.Extras, ["city", "campaign-source", "api-orders-hash-id"])

Another approach might be to build the df from a dict-of-lists. You’ll have to measure if this is faster than from_records.

def dataset_extras(
    json_strings,
    keys,
):
    columns = {col: [] for col in keys}
    for json_string in json_strings:
        datum = orjson.loads(json_string)
        for key in keys:
            columns[key].append(datum.get(key))
    return pd.DataFrame(columns)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading