I want to detect the most similar list from list of lists in the fastest way.
My searching list:
[1,2,3,4]
The list of lists:
[[1],[2],[1,2],[1,2,3,4,5,6],[1,2,3],[1,2,3,4,5]]
Most simillar result:
[1,2,3]
I was trying to find that with some common operators in python but it’s too slow in my data. I have about 2 million list of lists that I want to search in them.
>Solution :
The following fonction returns the most similar lists according to the length
def most_similar_acc_length(my_list, range_of_lists, length_range):
"""most similar series according to length
Parameters
----------
my_list : The list of interest
range_of_lists: List of lists where we search the most similar to 'my_list'
length_range: Range of series length to be considered as similar to the one of my_list
Returns:
--------
List of most similar lists in terms of length
"""
sim_lists=[x for x in range_of_lists if len(x)>=(len(my_list)-length_range) and len(x)<=(len(my_list)+length_range)]
return sim_lists
If we try it on the lists you shared with length_range length_range=1 we get:
range_of_lists=[[1],[2],[1,2],[1,2,3,4,5,6],[1,2,3],[1,2,3,4,5]]
my_list=[1,2,3,4]
sim_list=most_similar_acc_length(my_list, range_of_lists, 1)
Output
[[1, 2, 3], [1, 2, 3, 4, 5]]
Second step
We set up another function after having similar lists according to length
def most_similar_list(my_list, range_of_lists, length_range):
# We start with a first selection similar lists in terms of length
sim_list=most_similar_acc_length(my_list, range_of_lists, length_range)
new_list=[] # Binary values ==1 when value is same and ==0 when not
temp_list=[] # Temprary list to be appended to 'new_list'
for x in sim_list:
for i in range(min(len(x), len(my_list))):
if i==min(len(x)-1, len(my_list)-1):
if x[i]==my_list[i]:
temp_list.append(1)
else:
temp_list.append(0)
new_list.append(temp_list)
temp_list=[]
else:
if x[i]==my_list[i]:
temp_list.append(1)
else:
temp_list.append(0)
max_list=[sum(x) for x in new_list]
ind_max=max_list.index(max(max_list))
return sim_list[ind_max]
Let’s try this function:
range_of_lists=[[1],[2],[1,2],[1,2,3,4,5,6],[1,2,3],[1,2,3,4,5]]
my_list=[1,2,3,4]
similar_list=most_similar_list(my_list, range_of_lists, 1)
similar_list
Output
[1, 2, 3, 4, 5]