Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Find the maximum number in an R dataframe column of strings

For each cell in a particuar column of a dataframe (which here we will simply name as df), I want to find the value of maximum and minimum number that is originally represented as a string, embedded in a string. Any commas present in the cell have no special significance. These numbers should not be a percentage, so if for example 50% appears then 50 is to be excluded from consideration. The relevant column of the dataframe looks something like this:

| particular_col_name | 
| ------------------- | 
| First Row String10. This is also a string_5, and so is this 20, exclude70% |
| Second_Row_50%, number40. Number 4. number_15|

So two new columns should be created with title ‘maximum_number’ and ‘minimum number, and in the case of the first row the former should say 20 and 5 respectively. Note that 70 has been excluded because of the % sign next to it. Similarly, the second row should put 40 and 4 into the new columns.

I have tried a couple of methods (e.g. str_extract_all, regmatches, strsplit), within the dplyr ‘mutate’ operator, but they either give error messages (particularly regarding the input column particular_col_name) or do not output the data in an appropriate format for the maximum and minimum values to be easily identified.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Any help on this would be most appreciated please.

>Solution :

library(tidyverse)

tibble(
  particular_col_name = c(
    "First Row String10. This is also a string_5, and so is this 20, exclude70%",
    "Second_Row_50%, number40. Number 4. number_15",
    "20% 30%"
  )
) %>%
  mutate(
    numbers = particular_col_name %>% map(~ {
      .x %>% str_remove_all("[0-9]+%") %>% str_extract_all("[0-9]+") %>% simplify() %>% as.numeric()
    }),
    min = numbers %>% map_dbl(~ .x %>% min() %>% na_if(Inf) %>% na_if(-Inf)),
    max = numbers %>% map_dbl(~ .x %>% max() %>% na_if(Inf) %>% na_if(-Inf))
  ) %>%
  select(-numbers)
#> Warning in min(.): no non-missing arguments to min; returning Inf
#> Warning in max(.): no non-missing arguments to max; returning -Inf
#> # A tibble: 3 x 3
#>   particular_col_name                                                  min   max
#>   <chr>                                                              <dbl> <dbl>
#> 1 First Row String10. This is also a string_5, and so is this 20, e…     5    20
#> 2 Second_Row_50%, number40. Number 4. number_15                          4    40
#> 3 20% 30%                                                               NA    NA

Created on 2022-02-22 by the reprex package (v2.0.0)

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading