UPDATED
I web scraped a table online that wasn’t actually structured as a table. I managed to separate the characters into multiple rows, but for future reference, would like to know of a more efficient way to do this for larger data sets.
I also was able to get everything into one column, but the entire code is wildly inefficient. Any suggestions for improvement?
library(rvest)
library(tidyverse)
library(dplyr)
url = "https://www.ncsl.org/research/health/state-laws-and-legislation-related-to-biologic-medications-and-substitution-of-biosimilars.aspx"
webpage=read_html(url)
mandatory_2014 = webpage %>%
html_element(css = "#dnn_ctr84472_HtmlModule_lblContent > div > table:nth-child(15)") %>%
html_table()
mandatory_2014 = data.frame(mandatory_2014)
df = mandatory_2014 %>%
mutate(X1=strsplit(X1, "\n\n\t\t\t")) %>%
unnest(X1) %>%
mutate(X2=strsplit(X2, "\n\n\t\t\t")) %>%
unnest(X3)%>%
mutate(X3=strsplit(X3, "\n\n\t\t\t")) %>%
unnest(X3)
df = df[-c(2)]
df = stack(df)
df = df[-c(2)]
df = data.frame(df[!duplicated(df),])
df = rename(df, States = df..duplicated.df....)
>Solution :
This may be done in base R more easily – unlist the columns to a vector, then replace one or more occurrence (+) of \n\t with a single , as well as removing the characters that starts from the (, then either use strsplit or scan to split the string into individual elements (using delimiter ,), apply trimws to remove any remaining leading/lagging spaces, and convert it to a data.frame column
out <- data.frame(States = trimws(scan(text = sub("\\s+\\(.*", "",
gsub("(\\n+\\t+)", ",", mandatory_2014)), what="", sep=",")))
-output
> out
States
1 Florida
2 Kansas
3 Kentucky
4 Massachusetts
5 Minnesota
6 Mississippi
7 Nevada
8 New Jersey
9 New York
10 Pennsylvania
11 Puerto Rico
12 Rhode Island
13 Washington
14 West Virginia