I have a data frame with one column, the content of which has been extracted from a messy PDF table:
my_df <- structure(list(value = c("Jon Doe Managing Director My Company Elk View IL (312) 726-1578 email5@email.com",
"John Smith Director Acme Corp Springfield IA (111) 111-1111 email1@email.com",
"Mike Jones Manager MyCo inc Jonestown MN (111) 111-1111 email2@email.com",
"Dorothy Baker CEO Our Company Inc Philadelphia PA (111) 111-111 email3@email.com"
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
I’m trying to separate into multiple columns. Some of the values, such as job titles and phone numbers have space within them, so I need to separate by multiple spaces or tabs.
I’d like to use separate from tidyverse if possible, so the basic code may look like:
pdf_list_df |>
separate(
value,
c(
"First Name",
"Last Name",
"Job Title",
"Company Name",
"City",
"State",
"Phone Number",
"Email Address"
)
)
I’m just stuck as to what regex or options would do this. I see solutions on here for other languages but not R. Thanks.
>Solution :
my_df %>%
separate(value,
into = c(
"First Name",
"Last Name",
"Job Title",
"Company Name",
"City",
"State",
"Phone Number",
"Email Address"
),
sep = "\\s{2,}")
# A tibble: 4 × 8
`First Name` `Last Name` `Job Title` `Company Name` City State `Phone Number` `Email Address`
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Jon Doe Managing Director My Company Elk View IL (312) 726-1578 email5@email.com
2 John Smith Director Acme Corp Springfield IA (111) 111-1111 email1@email.com
3 Mike Jones Manager MyCo inc Jonestown MN (111) 111-1111 email2@email.com
4 Dorothy Baker CEO Our Company Inc Philadelphia PA (111) 111-111 email3@email.com
The regex is pretty simple here: since you have multiple whitespace characters between the elements you want to extract the splitting pattern here simply matches at least two but possibly more white space characters. Conveniently the ‘words’ that form a unit, such as "Our Company Inc" only have one whitespace between them so the split pattern does not match here.