Sorry if this has been answered, but I couldn’t quite seem to find an answer that addressed this particular issue. Here is a small sample of the data I’m using:
precinct_no,newsom_count,dahle_count,difference
0001-100000-SAN PASQUAL,5,18,-13
0002-100090-SAN PASQUAL,567,622,-55
0003-100120-SAN PASQUAL,0,0,0
0004-100150-SAN PASQUAL,0,0,0
0005-105000-RANCHO BERNARDO,572,538,34
0006-105040-RANCHO BERNARDO,609,582,27
In the precinct_no column, how can I strip everything except for the middle six digits? I don’t want the four digits in the beginning, the town names at the end, or the dashes. Just those middle six digits. I need to do this for about 3,000 rows.
>Solution :
If the format is consistent across your ~3000 rows, perhaps this will work:
df <- read.table(text = "precinct_no,newsom_count,dahle_count,difference
0001-100000-SAN PASQUAL,5,18,-13
0002-100090-SAN PASQUAL,567,622,-55
0003-100120-SAN PASQUAL,0,0,0
0004-100150-SAN PASQUAL,0,0,0
0005-105000-RANCHO BERNARDO,572,538,34
0006-105040-RANCHO BERNARDO,609,582,27",
header = TRUE, sep = ",")
df$precinct_no = gsub("[^-]*-(\\d{6})-.*", "\\1", df$precinct_no)
df
#> precinct_no newsom_count dahle_count difference
#> 1 100000 5 18 -13
#> 2 100090 567 622 -55
#> 3 100120 0 0 0
#> 4 100150 0 0 0
#> 5 105000 572 538 34
#> 6 105040 609 582 27
Created on 2022-11-30 with reprex v2.0.2
Explanation:
"[^-]*-(\\d{6})-.*"
[^-]*- match one or more characters that aren’t "-", then a single "-"
(\\d{6}) capture this ‘middle’ group (6 digits)
-.*" match a single "-", followed by any number of any characters