Home Most efficient way to separate character columns into rows and combine multiple columns into one column in r

Questions

Most efficient way to separate character columns into rows and combine multiple columns into one column in r

June 1, 2022

UPDATED

I web scraped a table online that wasn’t actually structured as a table. I managed to separate the characters into multiple rows, but for future reference, would like to know of a more efficient way to do this for larger data sets.

I also was able to get everything into one column, but the entire code is wildly inefficient. Any suggestions for improvement?

library(rvest)
library(tidyverse)
library(dplyr)

url = "https://www.ncsl.org/research/health/state-laws-and-legislation-related-to-biologic-medications-and-substitution-of-biosimilars.aspx"
webpage=read_html(url)

mandatory_2014 = webpage %>% 
  html_element(css = "#dnn_ctr84472_HtmlModule_lblContent > div > table:nth-child(15)") %>% 
  html_table()
mandatory_2014 = data.frame(mandatory_2014)

df = mandatory_2014 %>% 
  mutate(X1=strsplit(X1, "\n\n\t\t\t")) %>% 
  unnest(X1) %>% 
  mutate(X2=strsplit(X2, "\n\n\t\t\t")) %>% 
  unnest(X3)%>% 
  mutate(X3=strsplit(X3, "\n\n\t\t\t")) %>% 
  unnest(X3)
df = df[-c(2)]
df = stack(df)
df = df[-c(2)]
df = data.frame(df[!duplicated(df),])
df = rename(df, States = df..duplicated.df....)

>Solution :

This may be done in base R more easily – unlist the columns to a vector, then replace one or more occurrence (+) of \n\t with a single , as well as removing the characters that starts from the (, then either use strsplit or scan to split the string into individual elements (using delimiter ,), apply trimws to remove any remaining leading/lagging spaces, and convert it to a data.frame column

out <- data.frame(States = trimws(scan(text = sub("\\s+\\(.*", "",
   gsub("(\\n+\\t+)", ",", mandatory_2014)), what="", sep=",")))

-output

> out
           States
1         Florida
2          Kansas
3        Kentucky
4   Massachusetts
5       Minnesota
6     Mississippi
7          Nevada
8      New Jersey
9        New York
10   Pennsylvania
11    Puerto Rico
12   Rhode Island
13     Washington
14  West Virginia