Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Python: Iterate through Big Array with BigInts, find first duplicate and printout the indexes of the duplicate Values

for Cryptographie course in my university i need to compare a lot of SHA Hashes which are saved in an array. I need to compare the values of the Array Indexes.

There are duplicates in the Array – i already checked that with a comparison between the length of a set of the array and the length of the array itself.

Now i need to have the indexes of the duplicated values. I found a lot of solutions for checking for duplicates, but only for short arrays. My Array has a length of 3 Million and the Values in each index are around this length: 864205495604807476120572616017955259175325408501.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I have written a nested loop (coming from Java and trying to learn python). Here my code:

counter_outer = 0
while counter_outer < len(hash_value_array):
    counter_inner = counter_outer + 1
    while counter_inner < len(hash_value_array):
        if hash_value_array[counter_outer] == hash_value_array[counter_inner]:
            print(f"*****FOUND MATCH *****")
            print(f"Message [{counter_outer}] Hashvalue has same Value as Message [{counter_inner}]")
            safe_index1 = counter_outer
            safe_index2 = counter_inner
            counter_outer = len(hash_value_array)
            break
        else:
            print("------NO Match-----")
        counter_inner += 1
    counter_outer += 1

As you can imagine … this takes ages.

Important for me is, that i need the indexes where the duplicates are – not the values. So for example, if there is a 898 in index 100 and a 898 in index 1000001 i only need as output: 100, 1000001

Any suggestions?

>Solution :

You can do something along these lines in Python:

Assume this list of 5 signatures (they could be ints or strings, but I have strings):

li=['864205495604807476120572616017955259175325408501',
'864205495604807476120572616017955259175325408502',
'864205495604807476120572616017955259175325408503',
'864205495604807476120572616017955259175325408501',
'864205495604807476120572616017955259175325408502']

You can make a dict of lists with each list being the index of duplicates:

idx={}
for i, sig in enumerate(li):
    idx.setdefault(sig, []).append(i)

You can then find the duplicates like so:

for sig, v in idx.items():
    if len(v)>1:
        print(f'{sig}: {v}')

Prints:

864205495604807476120572616017955259175325408501: [0, 3]
864205495604807476120572616017955259175325408502: [1, 4]

If you make li 3,000,000 entries, that runs in about 550 milliseconds on my computer and likely would be similar on yours.


But to be honest – i dont understand why this is so much faster.

Yours is super slow because it has On**2 complexity from nested while loops. You are looping over the entire array for each element. The method I showed you here is only looping once over the entire list — not 3 million times!

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading