Home Separate a string vector into columns by multiple spaces or tabs, not single spaces

Questions

Separate a string vector into columns by multiple spaces or tabs, not single spaces

September 23, 2022

I have a data frame with one column, the content of which has been extracted from a messy PDF table:

my_df <- structure(list(value = c("Jon         Doe          Managing Director                                           My Company                                                   Elk View            IL      (312) 726-1578      email5@email.com", 
"John        Smith           Director                                                    Acme Corp                       Springfield          IA      (111) 111-1111      email1@email.com", 
"Mike          Jones           Manager              MyCo inc                                                        Jonestown        MN      (111) 111-1111      email2@email.com", 
"Dorothy       Baker          CEO                                           Our Company Inc                                              Philadelphia       PA      (111) 111-111      email3@email.com"
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))

I’m trying to separate into multiple columns. Some of the values, such as job titles and phone numbers have space within them, so I need to separate by multiple spaces or tabs.

I’d like to use separate from tidyverse if possible, so the basic code may look like:

pdf_list_df |>
  separate(
    value,
    c(
      "First Name",
      "Last Name",
      "Job Title",
      "Company Name",
      "City",
      "State",
      "Phone Number",
      "Email Address"
    )
  )

I’m just stuck as to what regex or options would do this. I see solutions on here for other languages but not R. Thanks.

>Solution :

my_df %>%
  separate(value,
          into =     c(
            "First Name",
            "Last Name",
            "Job Title",
            "Company Name",
            "City",
            "State",
            "Phone Number",
            "Email Address"
          ),
          sep = "\\s{2,}")
# A tibble: 4 × 8
  `First Name` `Last Name` `Job Title`       `Company Name`  City         State `Phone Number` `Email Address` 
  <chr>        <chr>       <chr>             <chr>           <chr>        <chr> <chr>          <chr>           
1 Jon          Doe         Managing Director My Company      Elk View     IL    (312) 726-1578 email5@email.com
2 John         Smith       Director          Acme Corp       Springfield  IA    (111) 111-1111 email1@email.com
3 Mike         Jones       Manager           MyCo inc        Jonestown    MN    (111) 111-1111 email2@email.com
4 Dorothy      Baker       CEO               Our Company Inc Philadelphia PA    (111) 111-111  email3@email.com

The regex is pretty simple here: since you have multiple whitespace characters between the elements you want to extract the splitting pattern here simply matches at least two but possibly more white space characters. Conveniently the ‘words’ that form a unit, such as "Our Company Inc" only have one whitespace between them so the split pattern does not match here.