I want to split a pdf based off a value on every page. Every value should be in its own pdf file. I currently have the following list where all values with the pages are displayed:
l = [
{'abr': '123 ', 'page': 1},
{'abr': '125 ', 'page': 2},
{'abr': '125 ', 'page': 3},
{'abr': '140 ', 'page': 4},
{'abr': '142 ', 'page': 5},
]
I want to "merge" the dicts so that every "abr" is uniqe inside the list and i have every page of that specific abr added to a list.
I thought of something like the following:
l = [
{'abr': '123 ', 'page': [1]},
{'abr': '125 ', 'page': [2, 3]},
{'abr': '140 ', 'page': [4]},
{'abr': '142 ', 'page': [5]},
]
Thats because i need to have a for loop for every abr where i can get every page so can do something like:
pdf = PdfFileReader(path)
for abr in l:
pdf_writer = PdfFileWriter()
for page in abr["page"]:
pdf_writer.addPage(pdf.getPage(page))
with open(output_filename, 'wb') as out:
pdf_writer.write(out)
Is there a good / simple way to do this or has anyone a better way to structure the data or can we split it easier?
>Solution :
If you’re open to a new structure, given that you use the abr
as a key, then you can put that as a key for a dict
(or defaultdict
in this case).
from collections import defaultdict
l = [
{'abr': '123 ', 'page': 1},
{'abr': '125 ', 'page': 2},
{'abr': '125 ', 'page': 3},
{'abr': '140 ', 'page': 4},
{'abr': '142 ', 'page': 5},
]
abrs = defaultdict(list)
for d in l:
abrs[d["abr"]].append(d["page"])
print(abrs) # contains {'123 ': [1], '125 ': [2, 3], '140 ': [4], '142 ': [5]}
pdf = PdfFileReader(path)
for abr, pages in abrs.items(): # get key and value from dict
pdf_writer = PdfFileWriter()
for page in pages: # iterate through pages
pdf_writer.addPage(pdf.getPage(page))
with open(output_filename, 'wb') as out:
pdf_writer.write(out)
PS. you might not need both abr
and pages
from abrs.items()
, you can iterate through abrs.values()
instead to get pages
only.