Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Create 'correlation matrix' for two lists to check if the values have something in common

Could someone please help me out with the following?

I have one dataframe with two columns: products and webshops (n x 2) with n products. Now I would like to obtain a binary (n x n) matrix with all products listed as the indices and all products listed as the column names. Then each cell should contain a 1 or 0 denoting whether the product in the index and column name came from the same webshop.

The following code is returning what I would like to achieve.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

dist = np.empty((len(df_title), len(df_title)), int)

for i in range(0,len(df_title)):
    for j in range(0,len(df_title)):
            boolean = df_title.values[i][1] == df_title.values[j][1]
            dist[i][j] = boolean  
df = pd.DataFrame(dist)

However, this code takes quite a significant time already for n = 1624. Therefore I was wondering if someone would have an idea for a faster algorithm.

Thanks!

>Solution :

It seems like you’re only interested in the element at position 1 for every column anyways, so creating a temp-variable for easier lookup could help:

lookup = df_title.values[:, 1]

Also since you want to interpret the resulting matrix as bool-matrix, you should probably specify dtype=bool (1 byte per field) instead of dtype=int (8 bytes per field), which also cuts down memory consumption by 8.

dist = np.empty((len(df_title), len(df_title)), dtype=bool)

Your matrix will be symmetric along the diagonal anyways, so you only need to compute "half" of the matrix, also if i == j we know the corresponding field in the matrix should be True.

lookup = df_title.values[:, 1]
dist = np.empty((len(df_title), len(df_title)), dtype=bool)

for i in range(len(df_title)):
    for j in range(len(df_title)):
        if i == j:
            # diagonal
            dist[i, j] = True
        else:
            # symmetric along diagonal
            dist[i, j] = dist[j, i] = lookup[i] == lookup[j]

Also using numpy-broadcasting you could actually transform all of that into a single line of code, that is orders of magnitude faster than the double-for-loop solution:

lookup = df_title.values[:, 1]
dist = lookup[None, :] == lookup[:, None]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading