Flatten only part of a dataframe shape for Euclidean calculation?

February 7, 2022

I have a data frame with shape:

(20,30,1024)

I want to find the Euclidean distance between every entry and every other entry in the dataframe (ideally non-redundantly, i.e. don’t find the distance of row 1 and 5….and then row 5 and 1 but not there yet). I have this code:

from scipy.spatial.distance import pdist,squareform

distances = pdist(df_test,metric='euclidean')
dist_matrix = squareform(distances)

print(dist_matrix)

The error says:

A 2-dimensional array must be passed.

So I guess I want to convert my matrix from shape (20,30,1024) to (20,30720), and then calculate the pdist/squareform between the rows (i.e. 20 rows of vectors that are 30720 in length).

I know that I can use test_df[0:20].flatten().tolist()

But that completely flattened my matrix, the output shape was (1,614400).

Can someone show me how to convert a shape from (20,30,1024) to (20,3072), or if i’m not going about this the right way?

The ultimate end goal is to calculate Euclidean distance between all non-redundant pairs in a data set, but the data set is big, so I need to do it as efficiently as possible/not duplicating calculations.

>Solution :

The most straightforward way to reshape that I can think of, according to how you described the problem, is:

df_test.values.reshape(20, -1)

By calling .values, you are retrieving your dataframe data as a numpy array. From there, .reshape finishes your job. Since you need a 2D-array, you provide the size of the first dimension (in your case, 20), and by passing -1 Numpy will calculate the size of the second dimension for you (in this case it will multiply the remaining dimension sizes in the original 3D-array)