Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Most efficient way to separate character columns into rows and combine multiple columns into one column in r

UPDATED

I web scraped a table online that wasn’t actually structured as a table. I managed to separate the characters into multiple rows, but for future reference, would like to know of a more efficient way to do this for larger data sets.

I also was able to get everything into one column, but the entire code is wildly inefficient. Any suggestions for improvement?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

library(rvest)
library(tidyverse)
library(dplyr)

url = "https://www.ncsl.org/research/health/state-laws-and-legislation-related-to-biologic-medications-and-substitution-of-biosimilars.aspx"
webpage=read_html(url)

mandatory_2014 = webpage %>% 
  html_element(css = "#dnn_ctr84472_HtmlModule_lblContent > div > table:nth-child(15)") %>% 
  html_table()
mandatory_2014 = data.frame(mandatory_2014)

df = mandatory_2014 %>% 
  mutate(X1=strsplit(X1, "\n\n\t\t\t")) %>% 
  unnest(X1) %>% 
  mutate(X2=strsplit(X2, "\n\n\t\t\t")) %>% 
  unnest(X3)%>% 
  mutate(X3=strsplit(X3, "\n\n\t\t\t")) %>% 
  unnest(X3)
df = df[-c(2)]
df = stack(df)
df = df[-c(2)]
df = data.frame(df[!duplicated(df),])
df = rename(df, States = df..duplicated.df....)

>Solution :

This may be done in base R more easily – unlist the columns to a vector, then replace one or more occurrence (+) of \n\t with a single , as well as removing the characters that starts from the (, then either use strsplit or scan to split the string into individual elements (using delimiter ,), apply trimws to remove any remaining leading/lagging spaces, and convert it to a data.frame column

out <- data.frame(States = trimws(scan(text = sub("\\s+\\(.*", "",
   gsub("(\\n+\\t+)", ",", mandatory_2014)), what="", sep=",")))

-output

> out
           States
1         Florida
2          Kansas
3        Kentucky
4   Massachusetts
5       Minnesota
6     Mississippi
7          Nevada
8      New Jersey
9        New York
10   Pennsylvania
11    Puerto Rico
12   Rhode Island
13     Washington
14  West Virginia 
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading