I have a data frame with shape:
(20,30,1024)
I want to find the Euclidean distance between every entry and every other entry in the dataframe (ideally non-redundantly, i.e. don’t find the distance of row 1 and 5….and then row 5 and 1 but not there yet). I have this code:
from scipy.spatial.distance import pdist,squareform
distances = pdist(df_test,metric='euclidean')
dist_matrix = squareform(distances)
print(dist_matrix)
The error says:
A 2-dimensional array must be passed.
So I guess I want to convert my matrix from shape (20,30,1024) to (20,30720), and then calculate the pdist/squareform between the rows (i.e. 20 rows of vectors that are 30720 in length).
I know that I can use test_df[0:20].flatten().tolist()
But that completely flattened my matrix, the output shape was (1,614400).
Can someone show me how to convert a shape from (20,30,1024) to (20,3072), or if i’m not going about this the right way?
The ultimate end goal is to calculate Euclidean distance between all non-redundant pairs in a data set, but the data set is big, so I need to do it as efficiently as possible/not duplicating calculations.
>Solution :
The most straightforward way to reshape that I can think of, according to how you described the problem, is:
df_test.values.reshape(20, -1)
By calling .values, you are retrieving your dataframe data as a numpy array. From there, .reshape finishes your job. Since you need a 2D-array, you provide the size of the first dimension (in your case, 20), and by passing -1 Numpy will calculate the size of the second dimension for you (in this case it will multiply the remaining dimension sizes in the original 3D-array)