Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to merge and deduplicate lists in Python efficiently?

I’d like to merge and decuplicate several lists in Python efficiently. There are 10 lists/pd.Series and each of them has approx. 200k string elements (They DO NOT have the same length). Each of them has been sorted. Lists have ~90% overlapping elements. How can I EFFICIENTLY MERGE them and DEDUPLICATE?

I can use the following code to achieve merge and sort, but do we have a more efficient way? Performance is important here. (prefer not using Cython due to the overhead)

targetList = list()
for l in lists:
  targetList += l
targetList = list(set(targetList))
targetList = targetList.sort()

I also know that I can first linearly merge all the lists (order preserving), and linearly deduplicate with a hashset (actually the two steps can be merged).

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

However, there is no built-in function for such a lists merge and I fear my own code, w/ linear complexity, might bring in extra overhead and in turn become slower than NlogN algo with simple system built-in functions. I’m using Python3.9

I’m aware of this post for deduplicate, but my question has many features that I think there is room for optimization.

>Solution :

You can try to achieve efficient merging and deduplication using the heapq.merge approach for merging sorted iterables and then applying a linear deduplication step.
There is an exemple :

import heapq

# Assuming 'lists' is a list of sorted lists/pd.Series
merged = list(heapq.merge(*lists))

# Linear deduplication
deduplicated = [merged[0]] + [value for prev, value in zip(merged, merged[1:]) if prev != value]

heapq.merge method avoids creating intermediate lists, providing good memory efficiency for large datasets. The overall time complexity is O(N * log(k)), where N is the total number of elements across all lists, and k is the number of input lists

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading