I’m new to NLP and text analysis; I have a dataframe of tokens and their tf-idf scores from some text data I am working with. Ex.
input df= |article |token1|token2|token3|token4|token5| |article1|.00 |.04 |.03 |.00 |.10 | |article2|.07 |.00 |.14 |.04 |.00 |
The tokens are in alphabetical order; I’m trying to get the correlation between adjacent columns throughout the dataframe and append it to the dataframe. The output would look something like this:
desired output df= |article |token1 |token2 |token3 |token4 |token5 | |article1|.00 |.04 |.03 |.00 |.10 | |article2|.07 |.00 |.14 |.04 |.00 | |Corr |Corr1-2|Corr2-3|Corr3-4|Corr4-5|Nan |
I know that I could use df.corr(), but that won’t yield the expected output. I would think that looping over columns could get there, but I’m not really sure where to start. Does anyone have an idea on how to achieve this?
df2 = df.set_index('article') df2.loc['Corr'] = df2.corrwith(df2.shift(-1, axis=1)) print(df2) token1 token2 token3 token4 token5 article article1 0.00 0.04 0.03 0.00 0.1 article2 0.07 0.00 0.14 0.04 0.0 Corr -1.00 -1.00 1.00 -1.00 NaN