Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Standardizing address formatting in R

I have a medium-sized data set (provided to me) that includes address information in R that I’m in the process of cleaning. There is information that I need to remove but I am unsure how to do so, as the information after the ZIP code itself is not static. Below is a sample:

addresses <- c("515 DUMMY 1 75253 69AP",
               "1000 DUMMY 2  75211",
               "3948 DUMMY 3 75217 69Q",
               "4545 DUMMY 4 75217 MAP 68C")

In essence, I need to transform these addresses into the following format:

"515 DUMMY 1 75253",
"1000 DUMMY 2  75211",
"3948 DUMMY 3 75217",
"4545 DUMMY 4 75217"

Thanks in advance for any help you may be able to provide.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Seems a classic regex approach might be something like below. I’ll add one more address with another 5-digit number (leading) to make sure we don’t over-remove.

addresses <- c("515 DUMMY 1 75253 69AP",
               "1000 DUMMY 2  75211",
               "3948 DUMMY 3 75217 69Q",
               "4545 DUMMY 4 75217 MAP 68C",
               "45454 DUMMY 4 75217 MAP 68C")
sub("^(.+)\\b(\\d{5})\\b.*", "\\1\\2", addresses)
# [1] "515 DUMMY 1 75253"   "1000 DUMMY 2  75211" "3948 DUMMY 3 75217"  "4545 DUMMY 4 75217"  "45454 DUMMY 4 75217"

Regex:

"^(.+)\\b(\\d{5})\\b.*"
 ^^^^^                    something at the beginning of string,
                          so that we don't false-trigger on a 5-digit
                          house address (a little fragile)
      ^^^        ^^^      word boundaries
         ^^^^^^^^         exactly five digits ([0-9])
                    ^^    anything else (discarded)

The (...) are saved groups, and \\1\\2 restore those two groups.

Quick edit: I don’t like having to double-backslash everything, so in a newer R with "raw strings", we can do

sub(r"{^(.+)\b(\d{5})\b.*}", r"{\1\2}", addresses)

I think it makes it a little easier to read, though we still need to mentally discard the leading/trailing braces (we can also use r"(..)", r"[..]", r"|..|").

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading