associate a list of values to each set element

March 9, 2022

I’m trying to come up with the best solution for the following problem:

I have a list of filenames, and associated with each filename is an ID; these IDs are non-unique, meaning that several filenames might be associated with one ID.

So I could pack my data up as: (ID, [filename1, filename2,…])

The problem is that I would like to work with the IDs as a set since I will need to group and extract differences and intersections with another predefined grouping of these IDs, and I need the operations to be relatively fast since I have about a million IDs.

But I know no way to keep ID associated with the list of filenames while treating ID as an element in a set. Is this possible to do with sets, or is there any set extension that enables this?

>Solution :

It sounds like your data looks something like the sample data below. If so, then the code shows how to use a hash table to do what you’re asking. The hash table could either be a Python dict (hashed on id as key with a list of file names as associated value) or simply a set of id elements if that’s what you really want (though as others have suggested in the comments, a dict is potentially the best solution).

files = [
    {'filename':'foo101', 'id':1},
    {'filename':'foo102', 'id':1},
    {'filename':'foo103', 'id':1},
    {'filename':'foo201', 'id':2},
    {'filename':'foo202', 'id':2},
    {'filename':'foo301', 'id':3},
    {'filename':'foo401', 'id':4},
]
fileDict = defaultdict(list)
for d in files:
    fileDict[d['id']].append(d['filename'])
[print(id, fileNames) for id, fileNames in fileDict.items()]
idSet = set(fileDict)
print(idSet)

Sample output:

1 ['foo101', 'foo102', 'foo103']
2 ['foo201', 'foo202']
3 ['foo301']
4 ['foo401']
{1, 2, 3, 4}

The above code uses a defaultdict(list) for convenience, but you could also use a regular dict as follows:

files = [
    {'filename':'foo101', 'id':1},
    {'filename':'foo102', 'id':1},
    {'filename':'foo103', 'id':1},
    {'filename':'foo201', 'id':2},
    {'filename':'foo202', 'id':2},
    {'filename':'foo301', 'id':3},
    {'filename':'foo401', 'id':4},
]
fileDict = {}
for d in files:
    if d['id'] not in fileDict:
        fileDict[d['id']] = []
    fileDict[d['id']].append(d['filename'])
[print(id, fileNames) for id, fileNames in fileDict.items()]
idSet = set(fileDict)
print(idSet)