Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Unable to filter data frame by column value

I created a data frame of reviews from a website. The three columns are date, rating, and text. I want to only see 1 and 5 star reviews. I have tried everything below and get roughly the same error

df %>% filter(Rating = '1 star', Rating = '5 star')

df$Rating

df[df$rating == '1 star',]

None have worked. Here’s the full code. The bit with the df is at the very bottom:

library(rvest)
library(tidyverse)

# Create url object ---------------------------------
url = "https://www.yelp.com/biz/24th-st-pizzeria-san-antonio?osq=Worst+Restaurant"

# Convert url to html object ------------------------
page <- read_html(url)

# Number of pages -----------------------------------
pageNums = page %>%
  html_elements(xpath = "//div[@class=' border-color--default__09f24__NPAKY text-align--center__09f24__fYBGO']") %>%
  html_text() %>%
  str_extract('of.*') %>% 
  str_remove('of ') %>% 
  as.numeric() 

# Create page sequence ------------------------------
pageSequence <- seq(from=0, to=(pageNums * 10)-10, by = 10)

# Create empty vectors to store data ----------------
review_date_all = c()
review_rating_all = c()
review_text_all = c()

# Create for loop -----------------------------------
for (i in pageSequence){
  if (i==0){
    page <- read_html(url) 
  } else {
    page <- read_html(paste0(url, '&start=', i))
  }
  
  # Review date ----
  review_dates <- page %>%
    html_elements(xpath = "//*[@class=' css-chan6m']") %>%
    html_text() %>%
    .[str_detect(., "^\\d+[/]\\d+[/]\\d{4}$")]
  
  # Review Rating ----
  review_ratings <- page %>%
    html_elements(xpath = "//div[starts-with(@class, ' review')]") %>%
    html_elements(xpath = ".//div[contains(@aria-label, 'rating')]") %>%
    html_attr('aria-label') %>%
    str_remove('rating')
  
  # Review text ----
  review_text = page %>%
    html_elements(xpath = "//p[starts-with(@class, 'comment')]") %>%
    html_text()
  
  # For each page, append these to appropriate vectors----
  review_date_all = append(review_date_all, review_dates)
  review_rating_all = append(review_rating_all, review_ratings)
  review_text_all = append(review_text_all, review_text)
}

# Create data frame ---------------------------------
df <- data.frame('Date' = review_date_all,
                 'Rating' = review_rating_all,
                 'Text'= review_text_all)
View(df)

What am I overlooking?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

There’s an issue with the Rating values in your df. There’s an extra space at the end of every rating.

So you need to do something like this:

df1 <- df %>%
  filter(Rating == '1 star ' | Rating == '5 star ')

You can also remove the trailing whitespace using stringr library as follows:

library(stringr)
df1 <- df %>%
  mutate(Rating = str_squish(Rating)) %>%
  filter(Rating == '1 star' | Rating == '5 star')
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading