A column of strings contains one or more numbers separated by commas or by a range (specifically, something like 5551-6 meaning 5551 through 5556). I need to expand the range to be in the same comma separated form as the other rows.
I have a work-around, but I am not happy with the complexity of it.
Data Source
I have a character column of NAICS codes from this table. Which is the sheet named "Table 2.11" of the excel file available on the linked page.
The problem can be reproduced using the following sample data:
data0 <- tibble::tibble("2022 NAICS" = c("31-33", "331", "332, 333", "6113", "6114-7", "6118"))
Problem
Some of the rows contain multiple codes (for instance, 313, 314). This is fine. I can separate these each into their own row with separate_longer_delim(NAICS, delim = ", "). One row contains 6114-7, which would be the NAICS codes 6114 through 6117. I would like to convert 6114-7 to the same form as the other rows. Since there is only one instance of the problem, I know I could explicitly replace that string, but I would like to do it dynamically.
Desired Result
Convert NAICS codes given as a range (6114-7) to comma separated values as a string (6114, 6115, 6116, 6117).
Attempt
I have a work-around, but I’m sure there is a more elegant solution. Ideally a solution that can also handle a different number of digits, such as turning 31-33 into 31, 32, 33, which also appears in the data. I removed them in the attempt below.
The thought process in the mutate below is to find any rows similar to 6114-7, and if found, paste a string which is formed by collapsing a sequence whose endpoints are extracted from the original value. If a match is not found, NAICS is set equal to itself, no changes are made.
library(tidyverse)
data1 <- data0 |>
rename(NAICS = "2022 NAICS") |>
filter(!is.na(NAICS), NAICS != "NA") |>
filter(str_detect(NAICS,"(\\d\\d)-(\\d\\d)",negate = TRUE)) |>
rowwise() |>
mutate(
NAICS = ifelse(
str_detect(NAICS, "(\\d+)-(\\d)"),
paste(
as.numeric(
str_extract(NAICS, "(\\d+)", group = 1)
):(
round(
as.numeric(
str_extract(NAICS, "(\\d+)", group = 1)
),
digits = -1
) +
as.numeric(
str_extract(NAICS, "-(\\d)", group = 1)
)
),
collapse = ", "
),
NAICS
)
)
# using data0 provided above, data1 is:
# A tibble: 5 × 1
# Rowwise:
# NAICS
# <chr>
# 1 331
# 2 332, 333
# 3 6113
# 4 6114, 6115, 6116, 6117
# 5 6118
>Solution :
You could do:
unlist(lapply(strsplit(data0$`2022 NAICS`, '-'), \(x) {
if(length(x) == 2) {
x <- as.numeric(x)
if(x[2] < 10) x[2] <- 10 * floor(x[1]/10) + x[2]
paste(seq(as.numeric(x[1]), as.numeric(x[2])), collapse = ', ')
} else x
}))
#> [1] "31, 32, 33" "331" "332, 333"
#> [4] "6113" "6114, 6115, 6116, 6117" "6118"