I want to compute the difference between every element in a Python list of equally long lists and put it into a Numpy array.
When I say difference between two lists I mean the number of differences between corresponding elements between the two lists.
Here’s an example difference function using a list comprehension:
def list_difference(list_a, list_b):
len_lists = len(list_a)
assert len_lists == len(list_b), "Lists must be the same length."
return sum([list_a[i] != list_b[i] for i in range(len_lists)])
Then I call the difference function on every pair in my list of lists and put it into a numpy array.
You could call it the outer difference of the list of lists, just like the outer product.
I do it in naive for loops:
import numpy as np
import time
sequences = [
["A", "A", "A", "B", "C"],
["B", "A", "B", "A", "B"],
["B", "A", "C", "C", "B"],
["B", "A", "C", "C", "C"],
]
start = time.time()
n_seq = len(sequences)
dists = np.zeros((n_seq, n_seq))
for row in range(n_seq):
for col in range(n_seq):
if row >= col:
continue
dists[row, col] = list_difference(sequences[row], sequences[col])
dists += dists.T
print(dists)
print(f"Time: {time.time() - start} seconds")
The result is
[[0. 4. 4. 3.]
[4. 0. 2. 3.]
[4. 2. 0. 1.]
[3. 3. 1. 0.]]
Time: 0.0003669261932373047 seconds
This example is quick enough on my computer (best time of three above).
However, running this on 64 lists (sequences) each with length 1008 it takes 293.572190 seconds, which is a while.
Is there a faster way to do this?
Attempts:
1 I tried putting the inner for loop in a list comprehension:
dists = np.zeros((n_seq, n_seq))
for row in range(n_seq):
dist_row = [list_difference(sequences[row], sequences[col]) for col in range(n_seq) if row >= col]
dists[row, n_seq-row-1:] = dist_row
dists += dists.T
But it actually made it slower, taking 0.000523 seconds (0.70 times faster).
On my larger dataset it takes 319.2168769 seconds (0.92 times faster).
2 I wondered if doing the for loops in native Python and then copying to Numpy at the end would help.
_dists = [ [0]*n_seq for i in range(n_seq)]
for row in range(n_seq):
for col in range(n_seq):
if col <= row:
continue
_dists[row][col] = list_difference(sequences[row], sequences[col])
dists = np.array(_dists)
dists += dists.T
This takes 0.0001862 seconds, which is about twice as fast as the original code.
On my larger dataset the speedup isn’t as significant at 229.945951 seconds (1.28 times faster), but still something.
3 Just a thought that there’s probably a way to do the two outer for loops directly in Numpy.
>Solution :
A much cleaner, simpler, and faster way to do it would be to use numpy broadcasting:
sequences = np.array(sequences)
dists = (sequences[:, None] != sequences).sum(axis=2)
Output:
>>> dists
array([[0, 4, 4, 3],
[4, 0, 2, 3],
[4, 2, 0, 1],
[3, 3, 1, 0]])