Split pfd based off value and Merge dictionaries inside list in python

August 30, 2022

I want to split a pdf based off a value on every page. Every value should be in its own pdf file. I currently have the following list where all values with the pages are displayed:

l = [
    {'abr': '123 ', 'page': 1},
    {'abr': '125 ', 'page': 2},
    {'abr': '125 ', 'page': 3},
    {'abr': '140 ', 'page': 4},
    {'abr': '142 ', 'page': 5},
]

I want to "merge" the dicts so that every "abr" is uniqe inside the list and i have every page of that specific abr added to a list.

I thought of something like the following:

l = [
    {'abr': '123 ', 'page': [1]},
    {'abr': '125 ', 'page': [2, 3]},
    {'abr': '140 ', 'page': [4]},
    {'abr': '142 ', 'page': [5]},
]

Thats because i need to have a for loop for every abr where i can get every page so can do something like:

pdf = PdfFileReader(path)
for abr in l:
    pdf_writer = PdfFileWriter()
    for page in abr["page"]:
        pdf_writer.addPage(pdf.getPage(page))
    
    with open(output_filename, 'wb') as out:
        pdf_writer.write(out)

Is there a good / simple way to do this or has anyone a better way to structure the data or can we split it easier?

>Solution :

If you’re open to a new structure, given that you use the abr as a key, then you can put that as a key for a dict (or defaultdict in this case).

from collections import defaultdict

l = [
    {'abr': '123 ', 'page': 1},
    {'abr': '125 ', 'page': 2},
    {'abr': '125 ', 'page': 3},
    {'abr': '140 ', 'page': 4},
    {'abr': '142 ', 'page': 5},
]

abrs = defaultdict(list)

for d in l:
    abrs[d["abr"]].append(d["page"])

print(abrs)  # contains {'123 ': [1], '125 ': [2, 3], '140 ': [4], '142 ': [5]}

pdf = PdfFileReader(path)
for abr, pages in abrs.items():  # get key and value from dict
    pdf_writer = PdfFileWriter()
    for page in pages:  # iterate through pages
        pdf_writer.addPage(pdf.getPage(page))
    
    with open(output_filename, 'wb') as out:
        pdf_writer.write(out)

PS. you might not need both abr and pages from abrs.items(), you can iterate through abrs.values() instead to get pages only.