Stripping select characters from the beginning and end of a string in R

November 30, 2022

Sorry if this has been answered, but I couldn’t quite seem to find an answer that addressed this particular issue. Here is a small sample of the data I’m using:

precinct_no,newsom_count,dahle_count,difference
0001-100000-SAN PASQUAL,5,18,-13
0002-100090-SAN PASQUAL,567,622,-55
0003-100120-SAN PASQUAL,0,0,0
0004-100150-SAN PASQUAL,0,0,0
0005-105000-RANCHO BERNARDO,572,538,34
0006-105040-RANCHO BERNARDO,609,582,27

In the precinct_no column, how can I strip everything except for the middle six digits? I don’t want the four digits in the beginning, the town names at the end, or the dashes. Just those middle six digits. I need to do this for about 3,000 rows.

>Solution :

If the format is consistent across your ~3000 rows, perhaps this will work:

df <- read.table(text = "precinct_no,newsom_count,dahle_count,difference
0001-100000-SAN PASQUAL,5,18,-13
0002-100090-SAN PASQUAL,567,622,-55
0003-100120-SAN PASQUAL,0,0,0
0004-100150-SAN PASQUAL,0,0,0
0005-105000-RANCHO BERNARDO,572,538,34
0006-105040-RANCHO BERNARDO,609,582,27",
header = TRUE, sep = ",")

df$precinct_no = gsub("[^-]*-(\\d{6})-.*", "\\1", df$precinct_no)
df
#>   precinct_no newsom_count dahle_count difference
#> 1      100000            5          18        -13
#> 2      100090          567         622        -55
#> 3      100120            0           0          0
#> 4      100150            0           0          0
#> 5      105000          572         538         34
#> 6      105040          609         582         27

^{Created on 2022-11-30 with reprex v2.0.2}

Explanation:

"[^-]*-(\\d{6})-.*"

[^-]*- match one or more characters that aren’t "-", then a single "-"

(\\d{6}) capture this ‘middle’ group (6 digits)

-.*" match a single "-", followed by any number of any characters