Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Dictionary of dictionaries – Removing duplications in only certain keys/values

I have a dictionary of dictionaries, a sample is below:

my_dictionary = {
    "0": {"Name": "Nick", "Age": 39, "Country": "UK"},
    "1": {"Name": "Steve", "Age": 19, "Country": "Spain"},
    "2": {"Name": "Dave", "Age": 23, "Country": "UK"},
    "3": {"Name": "Nick", "Age": 39, "Country": "Hong Kong"},
    "4": {"Name": "Nick", "Age": 39, "Country": "France"},
}

I want to remove duplicates in my_dictonary if the value in "Name" AND "Age" is the same. It does not matter which one is removed (there could be many that are the same, I only want one version to remain though).

So in our example above, the output would be:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

{'0': {'Name': 'Nick', 'Age': 39, 'Country': 'UK'},
 '1': {'Name': 'Steve', 'Age': 19, 'Country': 'Spain'},
 '2': {'Name': 'Dave', 'Age': 23, 'Country': 'UK'}}

As Nick, 39 was duplicated despite having a different country.

Is there an easy/efficient way of doing this? I have several million rows.

>Solution :

Track seen records, for example:

my_dictionary = {
    "0": {"Name": "Nick", "Age": 39, "Country": "UK"},
    "1": {"Name": "Steve", "Age": 19, "Country": "Spain"},
    "2": {"Name": "Dave", "Age": 23, "Country": "UK"},
    "3": {"Name": "Nick", "Age": 39, "Country": "Hong Kong"},
    "4": {"Name": "Nick", "Age": 39, "Country": "France"},
}

seen = set()
result = {}
for k, v in my_dictionary.items():
    if (v['Name'], v['Age']) not in seen:
        result[k] = v
        seen.add((v['Name'], v['Age']))

print(result)

Output:

{
    '0': {'Name': 'Nick', 'Age': 39, 'Country': 'UK'}, 
    '1': {'Name': 'Steve', 'Age': 19, 'Country': 'Spain'}, 
    '2': {'Name': 'Dave', 'Age': 23, 'Country': 'UK'}
}

Edit note: Using set() (which uses a hash-table) for tracking leads to the overall complexity of O(n) for n rows.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading