How to calculate correlation between adjacent columns throughout a dataframe and add it to the dataframe?

I’m new to NLP and text analysis; I have a dataframe of tokens and their tf-idf scores from some text data I am working with. Ex.

input 
df=
|article |token1|token2|token3|token4|token5|
|article1|.00   |.04   |.03   |.00   |.10   |
|article2|.07   |.00   |.14   |.04   |.00   |

The tokens are in alphabetical order; I’m trying to get the correlation between adjacent columns throughout the dataframe and append it to the dataframe. The output would look something like this:

desired output
df=
|article |token1 |token2 |token3 |token4 |token5 |
|article1|.00    |.04    |.03    |.00    |.10    |
|article2|.07    |.00    |.14    |.04    |.00    |
|Corr    |Corr1-2|Corr2-3|Corr3-4|Corr4-5|Nan    |

I know that I could use df.corr(), but that won’t yield the expected output. I would think that looping over columns could get there, but I’m not really sure where to start. Does anyone have an idea on how to achieve this?

>Solution :

Use:

df2 = df.set_index('article')
df2.loc['Corr'] = df2.corrwith(df2.shift(-1, axis=1))
print(df2)
          token1  token2  token3  token4  token5
article                                         
article1    0.00    0.04    0.03    0.00     0.1
article2    0.07    0.00    0.14    0.04     0.0
Corr       -1.00   -1.00    1.00   -1.00     NaN

Leave a Reply