Python: What is a faster way to check if items in a large list are not in another large list?

Advertisements

I have two lists:

  • old_data has a length of 180,000.
  • updated_data has a length of 184,000.
  • Both lists contain IDs which are string values.
  • There is a lot of overlap in IDs stored within the lists but I need to know which ones are no longer stored in updated_data so that I can say those IDs are not longer active in an SQL server.

Therefore, I need to check for any items in old_data that are not in updated_data and save them to a separate list, let’s call it inactive_data.

I have the following code currently which is highly time-consuming and inefficient:

# Initialize list for no-longer active IDs to be saved into.
inactive_data = []

# Iteratively check if all IDs in old data are still in updated data.
# If they are not, add them to the list.
for Id in old_data:
    
    if Id not in updated_data:
        
        inactive_data.append(Id)

To run this code, it took approx. 45 minutes. I would to know how I could drastically reduce the time. If you have any suggestions please let me know!

>Solution :

The usual and easy way is to turn one or both of the list into a set.

An alternate way is to sort both lists and compare them element by element. Once they’re ordered, the comparison is quick because it’s O(n).

Leave a Reply Cancel reply