I’d like to merge and decuplicate several lists in Python efficiently. There are 10 lists/pd.Series and each of them has approx. 200k string elements (They DO NOT have the same length). Each of them has been sorted. Lists have ~90% overlapping elements. How can I EFFICIENTLY MERGE them and DEDUPLICATE?
I can use the following code to achieve merge and sort, but do we have a more efficient way? Performance is important here. (prefer not using Cython due to the overhead)
targetList = list()
for l in lists:
targetList += l
targetList = list(set(targetList))
targetList = targetList.sort()
I also know that I can first linearly merge all the lists (order preserving), and linearly deduplicate with a hashset (actually the two steps can be merged).
However, there is no built-in function for such a lists merge and I fear my own code, w/ linear complexity, might bring in extra overhead and in turn become slower than NlogN algo with simple system built-in functions. I’m using Python3.9
I’m aware of this post for deduplicate, but my question has many features that I think there is room for optimization.
>Solution :
You can try to achieve efficient merging and deduplication using the heapq.merge approach for merging sorted iterables and then applying a linear deduplication step.
There is an exemple :
import heapq
# Assuming 'lists' is a list of sorted lists/pd.Series
merged = list(heapq.merge(*lists))
# Linear deduplication
deduplicated = [merged[0]] + [value for prev, value in zip(merged, merged[1:]) if prev != value]
heapq.merge method avoids creating intermediate lists, providing good memory efficiency for large datasets. The overall time complexity is O(N * log(k)), where N is the total number of elements across all lists, and k is the number of input lists