Home Fast Python outer difference of list

Questions

Fast Python outer difference of list

February 8, 2022

I want to compute the difference between every element in a Python list of equally long lists and put it into a Numpy array.

When I say difference between two lists I mean the number of differences between corresponding elements between the two lists.
Here’s an example difference function using a list comprehension:

def list_difference(list_a, list_b):
    len_lists = len(list_a)
    assert len_lists == len(list_b), "Lists must be the same length."
    return sum([list_a[i] != list_b[i] for i in range(len_lists)])

Then I call the difference function on every pair in my list of lists and put it into a numpy array.
You could call it the outer difference of the list of lists, just like the outer product.
I do it in naive for loops:

import numpy as np
import time

sequences = [
    ["A", "A", "A", "B", "C"],
    ["B", "A", "B", "A", "B"],
    ["B", "A", "C", "C", "B"],
    ["B", "A", "C", "C", "C"],
]

start = time.time()
n_seq = len(sequences)
dists = np.zeros((n_seq, n_seq))
for row in range(n_seq):
    for col in range(n_seq):
        if row >= col:
            continue
        dists[row, col] = list_difference(sequences[row], sequences[col])
dists += dists.T
print(dists)
print(f"Time: {time.time() - start} seconds")

The result is

[[0. 4. 4. 3.]
 [4. 0. 2. 3.]
 [4. 2. 0. 1.]
 [3. 3. 1. 0.]]
Time: 0.0003669261932373047 seconds

This example is quick enough on my computer (best time of three above).
However, running this on 64 lists (sequences) each with length 1008 it takes 293.572190 seconds, which is a while.
Is there a faster way to do this?

Attempts:

1 I tried putting the inner for loop in a list comprehension:

dists = np.zeros((n_seq, n_seq))
for row in range(n_seq):
    dist_row = [list_difference(sequences[row], sequences[col]) for col in range(n_seq) if row >= col]
    dists[row, n_seq-row-1:] = dist_row
dists += dists.T

But it actually made it slower, taking 0.000523 seconds (0.70 times faster).
On my larger dataset it takes 319.2168769 seconds (0.92 times faster).

2 I wondered if doing the for loops in native Python and then copying to Numpy at the end would help.

_dists = [ [0]*n_seq for i in range(n_seq)]
for row in range(n_seq):
    for col in range(n_seq):
        if col <= row:
            continue
        _dists[row][col] = list_difference(sequences[row], sequences[col])
dists = np.array(_dists)
dists += dists.T

This takes 0.0001862 seconds, which is about twice as fast as the original code.
On my larger dataset the speedup isn’t as significant at 229.945951 seconds (1.28 times faster), but still something.

3 Just a thought that there’s probably a way to do the two outer for loops directly in Numpy.

>Solution :

A much cleaner, simpler, and faster way to do it would be to use numpy broadcasting:

sequences = np.array(sequences)
dists = (sequences[:, None] != sequences).sum(axis=2)

Output:

>>> dists
array([[0, 4, 4, 3],
       [4, 0, 2, 3],
       [4, 2, 0, 1],
       [3, 3, 1, 0]])

numpy

byMR

Published February 08, 2022

Add a comment

Split String at First Occurrence of an Integer using R

byMR

February 8, 2022

Questions

How to slow down sprite animation while maintaining 60 fps in pygame?

byMR

February 8, 2022

Questions

C++: Wrapping C style array into unique_ptr leads to "double free or corruption"

byMR

February 8, 2022

Questions

Python variable won't update outside of module

byMR

February 8, 2022

Questions

Create Week Date Range column based on the service date

byMR

February 8, 2022

Questions

filter array based on another nested array

byMR

February 8, 2022

Fast Python outer difference of list

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Like this:

Leave a ReplyCancel reply

Read more

Split String at First Occurrence of an Integer using R

How to slow down sprite animation while maintaining 60 fps in pygame?

C++: Wrapping C style array into unique_ptr leads to "double free or corruption"

Python variable won't update outside of module

Create Week Date Range column based on the service date

filter array based on another nested array

Keep Up to Date with the Most Important News

Fast Python outer difference of list

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Share this:

Like this:

Leave a ReplyCancel reply

Keep Up to Date with the Most Important News

Read more

Split String at First Occurrence of an Integer using R

How to slow down sprite animation while maintaining 60 fps in pygame?

C++: Wrapping C style array into unique_ptr leads to "double free or corruption"

Python variable won't update outside of module

Create Week Date Range column based on the service date

filter array based on another nested array

Discover more from Dev solutions