Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Word frequency over time : How to count the word frequency by date?

I have a data frame look like this :

date text
201901 Thank you for helping me
201902 You are amazing
201902 For helping with this

My aim is to calculate the word frequency in each line, and eventually look like this:

date thank you for helping me are amazing with this for
201901 1 1 1 1 1 0 0 0 0 0
201902 0 1 1 1 0 1 1 1 1 1

The actual data set is like this frame, but contains millions of text lines. So I was wondering how to automate this process using R, without typing all those texts lines.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Using R and tidyverse:

df <- data.frame(date = c(201901, 201902, 201902),
                 text = c("Thank you for helping me", "You are amazing", "For helping with this"))

library(tidyverse)

If you want your data as a table of counts

df %>% 
            separate_rows(text, sep = " ") %>% 
            mutate(text = tolower(text)) %>% 
            table()

Output:

text
date     amazing are for helping me thank this with you
  201901       0   0   1       1  1     1    0    0   1
  201902       1   1   1       1  0     0    1    1   1

If you want your output as a tibble

df %>% 
        separate_rows(text, sep = " ") %>% 
        mutate(text = tolower(text)) %>% 
        table() %>% 
        as_tibble() %>% 
        pivot_wider(names_from = text, values_from = n)

Output:

# A tibble: 2 x 10
  date   amazing   are `for` helping    me thank  this  with   you
  <chr>    <int> <int> <int>   <int> <int> <int> <int> <int> <int>
1 201901       0     0     1       1     1     1     0     0     1
2 201902       1     1     1       1     0     0     1     1     1

edit: To transform everything to lowercase as your desired output and to show you the output

edit2: To show you that you can also get your data as a tibble to further work with it

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading