Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Automate fractal like nested JSON normalization

The problem :

I have 100+ JSON with a fractal like structure of list of dicts. The width and the heigth of the data structure vary a lot from one JSON to another. Each labels are parts of a sentence.

test = [
    {
        "label": "I",
        "children": [
            {
                "label": "want",
                "children": [
                    {
                        "label": "a",
                        "children": [
                            {"label": "coffee"},
                            {"label": "big", "children": [{"label": "piece of cake"}]},
                        ],
                    }
                ],
            },
            {"label": "need", "children": [{"label": "time"}]},
            {"label": "like",
                "children": [{"label": "italian", "children": [{"label": "pizza"}]}],
            },
        ],
    },
    {
        "label": "We",
        "children": [
            {"label": "are", "children": [{"label": "ok"}]},
            {"label": "will", "children": [{"label": "rock you"}]},
        ],
    },
]

I want to automate the normalization of JSON to obtain this type of output :

sentences = [
'I want a coffee', 
'I want a big piece of cake', 
'I need time', 
'I like italian pizza', 
'We are ok',
'We will rock you',
] 

It’s really like the os.walk function that returns each "path".

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

What I tried :

  • pandas.json_normalize but it need to a predifine meta and record_path arguments to work with complexe herarchies ;

  • jsonpath_ng with parse('[*]..label') but I coudn’t find the way to work this out ;

  • flatten function like this post that obtains :

{'0label': 'I',
 '0children_0label': 'want',
 '0children_0children_0label': 'a',
 '0children_0children_0children_0label': 'coffee',
 '0children_0children_0children_1label': 'big',
 '0children_0children_0children_1children_0label': 'piece of cake',
 '0children_1label': 'need',
 '0children_1children_0label': 'time',
 '0children_2label': 'like',
 '0children_2children_0label': 'italian',
 '0children_2children_0children_0label': 'pizza',
 '1label': 'We',
 '1children_0label': 'are',
 '1children_0children_0label': 'ok',
 '1children_1label': 'will',
 '1children_1children_0label': 'rock you'}

I tried to split keys to identify hierarchy but I have an indexation problem. For example, I don’t understand why some keyslike ‘1children_0label’ contains ‘0label’ and not ‘1label’ index that should refer to {‘1label’ : ‘We’}.

  • while loops that output a list of ‘levels’ as list of tuples containing count of n+1 children and label. It was meant to be the first step to recreate the final output but I’m couldn’t work this out too.
import copy
levels = []
idx = [i for i in range(len(test))]
stack = copy.deepcopy(test)
lvl = 1
while stack: 
    idx = []
    children = []
    for i,d in enumerate(stack):
        if 'children' in d:
            n = len(d['children'])
        else : 
            n = 0
        occurences = (n,d['label'])
        idx.append(occurences)
        
        children = stack[i].copy()
        if 'children' in stack[i]:
            children.extend(stack[i]['children'])
    
    stack = childs.copy()
    children = []
    levels.append(idx.copy())       

print(levels)    

Output :

[[(3, 'I'), (2, 'We')], [(1, 'want'), (1, 'need'), (1, 'like'), (1, 'are'), (1, 'will')], [(2, 'a'), (0, 'time'), (1, 'italian'), (0, 'ok'), (0, 'rock you')], [(0, 'coffee'), (1, 'big'), (0, 'pizza')], [(0, 'piece of cake')]]

Please help.

>Solution :

You can try a recursion:

def get_sentences(o):
    if isinstance(o, dict):
        if "children" in o:
            for item in get_sentences(o["children"]):
                yield o["label"] + " " + item
        else:
            yield o["label"]
    elif isinstance(o, list):
        for v in o:
            yield from get_sentences(v)


print(list(get_sentences(test)))

Prints:

[
    "I want a coffee",
    "I want a big piece of cake",
    "I need time",
    "I like italian pizza",
    "We are ok",
    "We will rock you",
]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading