I’m new to NLP and text analysis; I have a dataframe of tokens and their tf-idf scores from some text data I am working with. Ex.
input
df=
|article |token1|token2|token3|token4|token5|
|article1|.00 |.04 |.03 |.00 |.10 |
|article2|.07 |.00 |.14 |.04 |.00 |
The tokens are in alphabetical order; I’m trying to get the correlation between adjacent columns throughout the dataframe and append it to the dataframe. The output would look something like this:
desired output
df=
|article |token1 |token2 |token3 |token4 |token5 |
|article1|.00 |.04 |.03 |.00 |.10 |
|article2|.07 |.00 |.14 |.04 |.00 |
|Corr |Corr1-2|Corr2-3|Corr3-4|Corr4-5|Nan |
I know that I could use df.corr(), but that won’t yield the expected output. I would think that looping over columns could get there, but I’m not really sure where to start. Does anyone have an idea on how to achieve this?
>Solution :
Use:
df2 = df.set_index('article')
df2.loc['Corr'] = df2.corrwith(df2.shift(-1, axis=1))
print(df2)
token1 token2 token3 token4 token5
article
article1 0.00 0.04 0.03 0.00 0.1
article2 0.07 0.00 0.14 0.04 0.0
Corr -1.00 -1.00 1.00 -1.00 NaN