Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Parse DataFrame of postal addresses – remove country and unit number

I have a dataframe with a column of postal addresses (generated with geopy.geocoders GoogleV3 – I used it to parse my dataframe). The output of geolocator.geocode, however, has the country name – which I don’t want. It also contains Unit number – which I don’t want.

How can I do it?

I have tried:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

test_add['clean address'] = test_add.apply(lambda x: x['clean address'][:-5], axis = 1)

and

def remove_units(X):
    X = X.split()
    X_new = [x for x in X if not x.startswith("#")]
    return ' '.join(X_new)

test_add['parsed addresses'] = test_add['clean address'].apply(remove_units)

It works for:

data = ["941 Thorpe St, Rock Springs, WY 82901, USA",
    "2809 Harris Dr, Antioch, CA 94509, USA",
    "7 Eucalyptus, Newport Coast, CA 92657, USA",
    "725 Mountain View St, Altadena, CA 91001, USA",
    "1966 Clinton Ave #234, Calexico, CA 92231, USA",
    "431 6th St, West Sacramento, CA 95605, USA",
    "5574 Old Goodrich Rd, Clarence, NY 14031, USA",
    "Valencia Way #1234, Valley Center, CA 92082, USA"]
test_df = pd.DataFrame(data, columns=['parsed addresses'])

but get an error: "AttributeError: ‘float’ object has no attribute ‘split’" when I use a larger dataframe with 150k such addresses.

Ultimately, I require only street number, street name, city, state and zipcode.

>Solution :

Another possible solution:

test_df['parsed addresses'].str.replace(r',\D+$|\s#\d+', '', regex=True)

EXPLANATION

  • \D means non-digit character.
  • \D+ means one or more non-digit character
  • $ means end of string
  • | means logical OR
  • \s means space character
  • \d+ means one or more digit character

For a more comprehensive treatment of regex, please see Regular Expression HOWTO.

Output:

0       941 Thorpe St, Rock Springs, WY 82901
1           2809 Harris Dr, Antioch, CA 94509
2       7 Eucalyptus, Newport Coast, CA 92657
3    725 Mountain View St, Altadena, CA 91001
4        1966 Clinton Ave, Calexico, CA 92231
5       431 6th St, West Sacramento, CA 95605
6    5574 Old Goodrich Rd, Clarence, NY 14031
7       Valencia Way, Valley Center, CA 92082
Name: parsed addresses, dtype: object
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading