I have a dictionary of dictionaries, a sample is below:
my_dictionary = {
"0": {"Name": "Nick", "Age": 39, "Country": "UK"},
"1": {"Name": "Steve", "Age": 19, "Country": "Spain"},
"2": {"Name": "Dave", "Age": 23, "Country": "UK"},
"3": {"Name": "Nick", "Age": 39, "Country": "Hong Kong"},
"4": {"Name": "Nick", "Age": 39, "Country": "France"},
}
I want to remove duplicates in my_dictonary if the value in "Name" AND "Age" is the same. It does not matter which one is removed (there could be many that are the same, I only want one version to remain though).
So in our example above, the output would be:
{'0': {'Name': 'Nick', 'Age': 39, 'Country': 'UK'},
'1': {'Name': 'Steve', 'Age': 19, 'Country': 'Spain'},
'2': {'Name': 'Dave', 'Age': 23, 'Country': 'UK'}}
As Nick, 39 was duplicated despite having a different country.
Is there an easy/efficient way of doing this? I have several million rows.
>Solution :
Track seen records, for example:
my_dictionary = {
"0": {"Name": "Nick", "Age": 39, "Country": "UK"},
"1": {"Name": "Steve", "Age": 19, "Country": "Spain"},
"2": {"Name": "Dave", "Age": 23, "Country": "UK"},
"3": {"Name": "Nick", "Age": 39, "Country": "Hong Kong"},
"4": {"Name": "Nick", "Age": 39, "Country": "France"},
}
seen = set()
result = {}
for k, v in my_dictionary.items():
if (v['Name'], v['Age']) not in seen:
result[k] = v
seen.add((v['Name'], v['Age']))
print(result)
Output:
{
'0': {'Name': 'Nick', 'Age': 39, 'Country': 'UK'},
'1': {'Name': 'Steve', 'Age': 19, 'Country': 'Spain'},
'2': {'Name': 'Dave', 'Age': 23, 'Country': 'UK'}
}
Edit note: Using set() (which uses a hash-table) for tracking leads to the overall complexity of O(n) for n rows.