Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

R: Selecting Multiple Columns Based on Conditions

I have the following data in R:

id = 1:5
col1 = c("john", "henry", "adam", "jenna", "Phone: 222 2222")
col2 = c("river B8C 9L4", "Field U9H 5E2 PP", "NA", "ocean A1B 5H1 dd", "dave")
col3 = c("Phone: 111 1111 111", "steve", "forest K0Y 1U9 hu2", "NA", "NA")
col4 = c("matt", "peter", "Phone: 333 333 1113", "Phone: 444 111 1153", "kevin")
my_data = data.frame(id, col1, col2, col3, col4)

      id            col1             col2                col3                col4
1  1            john    river B8C 9L4 Phone: 111 1111 111                matt
2  2           henry Field U9H 5E2 PP               steve               peter
3  3            adam               NA  forest K0Y 1U9 hu2 Phone: 333 333 1113
4  4           jenna ocean A1B 5H1 dd                  NA Phone: 444 111 1153
5  5 Phone: 222 2222             dave                  NA               kevin

I am trying to accomplish the following task – I would like to create a new dataset with the following columns. For each row:

  • Step 1: id (trivial, this is always the first column)
  • Step 2: A column with the phone number
  • Step 3: A column that satisfies the following condition '(([A-Z] ?[0-9]){3})|.', '\\1'
  • Step 4: Once Step 1 – Step 3 has been completed, I would like to combine all names into a single column

Here is a sample of the desired output:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

  id              name             address                phone
1  1         john matt       river B8C 9L4  Phone: 111 1111 111
2  2 henry steve peter    Field U9H 5E2 PP                   NA
3  3              adam  forest K0Y 1U9 hu2  Phone: 333 333 1113
4  4             jenna  ocean A1B 5H1 dd    Phone: 444 111 1153
5  5        dave kevin                  NA      Phone: 222 2222

Here is the code I have written:

my_data$col1[grep("Phone", my_data$col1)]
my_data$col2[grep("Phone", my_data$col2)]
my_data$col3[grep("Phone", my_data$col3)]
my_data$col4[grep("Phone", my_data$col4)]

my_data$col1[grep( '(([A-Z] ?[0-9]){3})|.', '\\1' , my_data$col1)]
my_data$col2[grep('(([A-Z] ?[0-9]){3})|.', '\\1', my_data$col2)]
my_data$col3[grep('(([A-Z] ?[0-9]){3})|.', '\\1', my_data$col3)]
my_data$col4[grep('(([A-Z] ?[0-9]){3})|.', '\\1', my_data$col4)]

Based on the above code, I was thinking on identifying which of the columns meet the condition in each step, and then using the COLASCE statement in dplyr to create the final dataset. But I think this might be a very long way of accomplishing this problem.

Can someone please suggest a faster way to solve this problem?

Thanks!

>Solution :

Try this:

tmp <- apply(my_data[,-1], 1, function(z) { z <- z[!is.na(z) & z != "NA"]; ind <- dplyr::case_when(grepl("^Phone:", z) ~ "phone", grepl("(([A-Z] ?[0-9]){3})", z) ~ "address", TRUE ~ "name"); data.frame(lapply(split(z, ind), paste, collapse = " ")); })
tmp
# [[1]]
#         address      name               phone
# 1 river B8C 9L4 john matt Phone: 111 1111 111
# [[2]]
#            address              name
# 1 Field U9H 5E2 PP henry steve peter
# [[3]]
#              address name               phone
# 1 forest K0Y 1U9 hu2 adam Phone: 333 333 1113
# [[4]]
#            address  name               phone
# 1 ocean A1B 5H1 dd jenna Phone: 444 111 1153
# [[5]]
#         name           phone
# 1 dave kevin Phone: 222 2222

With this, we can combine them (I prefer dplyr::bind_rows or data.table::rbindlist).

dplyr::bind_rows(tmp)
#              address              name               phone
# 1      river B8C 9L4         john matt Phone: 111 1111 111
# 2   Field U9H 5E2 PP henry steve peter                <NA>
# 3 forest K0Y 1U9 hu2              adam Phone: 333 333 1113
# 4   ocean A1B 5H1 dd             jenna Phone: 444 111 1153
# 5               <NA>        dave kevin     Phone: 222 2222
cbind(my_data[,1,drop=FALSE], dplyr::bind_rows(tmp))
#   id            address              name               phone
# 1  1      river B8C 9L4         john matt Phone: 111 1111 111
# 2  2   Field U9H 5E2 PP henry steve peter                <NA>
# 3  3 forest K0Y 1U9 hu2              adam Phone: 333 333 1113
# 4  4   ocean A1B 5H1 dd             jenna Phone: 444 111 1153
# 5  5               <NA>        dave kevin     Phone: 222 2222
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading