Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Fast Python outer difference of list

I want to compute the difference between every element in a Python list of equally long lists and put it into a Numpy array.

When I say difference between two lists I mean the number of differences between corresponding elements between the two lists.
Here’s an example difference function using a list comprehension:

def list_difference(list_a, list_b):
    len_lists = len(list_a)
    assert len_lists == len(list_b), "Lists must be the same length."
    return sum([list_a[i] != list_b[i] for i in range(len_lists)])

Then I call the difference function on every pair in my list of lists and put it into a numpy array.
You could call it the outer difference of the list of lists, just like the outer product.
I do it in naive for loops:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

import numpy as np
import time

sequences = [
    ["A", "A", "A", "B", "C"],
    ["B", "A", "B", "A", "B"],
    ["B", "A", "C", "C", "B"],
    ["B", "A", "C", "C", "C"],
]

start = time.time()
n_seq = len(sequences)
dists = np.zeros((n_seq, n_seq))
for row in range(n_seq):
    for col in range(n_seq):
        if row >= col:
            continue
        dists[row, col] = list_difference(sequences[row], sequences[col])
dists += dists.T
print(dists)
print(f"Time: {time.time() - start} seconds")

The result is

[[0. 4. 4. 3.]
 [4. 0. 2. 3.]
 [4. 2. 0. 1.]
 [3. 3. 1. 0.]]
Time: 0.0003669261932373047 seconds

This example is quick enough on my computer (best time of three above).
However, running this on 64 lists (sequences) each with length 1008 it takes 293.572190 seconds, which is a while.
Is there a faster way to do this?

Attempts:

1 I tried putting the inner for loop in a list comprehension:

dists = np.zeros((n_seq, n_seq))
for row in range(n_seq):
    dist_row = [list_difference(sequences[row], sequences[col]) for col in range(n_seq) if row >= col]
    dists[row, n_seq-row-1:] = dist_row
dists += dists.T

But it actually made it slower, taking 0.000523 seconds (0.70 times faster).
On my larger dataset it takes 319.2168769 seconds (0.92 times faster).

2 I wondered if doing the for loops in native Python and then copying to Numpy at the end would help.

_dists = [ [0]*n_seq for i in range(n_seq)]
for row in range(n_seq):
    for col in range(n_seq):
        if col <= row:
            continue
        _dists[row][col] = list_difference(sequences[row], sequences[col])
dists = np.array(_dists)
dists += dists.T

This takes 0.0001862 seconds, which is about twice as fast as the original code.
On my larger dataset the speedup isn’t as significant at 229.945951 seconds (1.28 times faster), but still something.

3 Just a thought that there’s probably a way to do the two outer for loops directly in Numpy.

>Solution :

A much cleaner, simpler, and faster way to do it would be to use numpy broadcasting:

sequences = np.array(sequences)
dists = (sequences[:, None] != sequences).sum(axis=2)

Output:

>>> dists
array([[0, 4, 4, 3],
       [4, 0, 2, 3],
       [4, 2, 0, 1],
       [3, 3, 1, 0]])
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading