How to merge and deduplicate lists in Python efficiently?

November 13, 2023

I’d like to merge and decuplicate several lists in Python efficiently. There are 10 lists/pd.Series and each of them has approx. 200k string elements (They DO NOT have the same length). Each of them has been sorted. Lists have ~90% overlapping elements. How can I EFFICIENTLY MERGE them and DEDUPLICATE?

I can use the following code to achieve merge and sort, but do we have a more efficient way? Performance is important here. (prefer not using Cython due to the overhead)

targetList = list()
for l in lists:
  targetList += l
targetList = list(set(targetList))
targetList = targetList.sort()

I also know that I can first linearly merge all the lists (order preserving), and linearly deduplicate with a hashset (actually the two steps can be merged).

However, there is no built-in function for such a lists merge and I fear my own code, w/ linear complexity, might bring in extra overhead and in turn become slower than NlogN algo with simple system built-in functions. I’m using Python3.9

I’m aware of this post for deduplicate, but my question has many features that I think there is room for optimization.

>Solution :

You can try to achieve efficient merging and deduplication using the heapq.merge approach for merging sorted iterables and then applying a linear deduplication step.
There is an exemple :

import heapq

# Assuming 'lists' is a list of sorted lists/pd.Series
merged = list(heapq.merge(*lists))

# Linear deduplication
deduplicated = [merged[0]] + [value for prev, value in zip(merged, merged[1:]) if prev != value]

heapq.merge method avoids creating intermediate lists, providing good memory efficiency for large datasets. The overall time complexity is O(N * log(k)), where N is the total number of elements across all lists, and k is the number of input lists