Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Masking the Zip Codes

I’m taking a course and I need to solve the following assignment:
"In this part, you should write a for loop, updating the df_users dataframe.

Go through each user, and update their zip code, to Safe Harbor specifications:

If the user is from a zip code for the which the “Geographic Subdivision” is less than equal to 20,000, change the zip code in df_users to ‘0’ (as a string)
Otherwise, zip should be only the first 3 numbers of the full zip code
Do all this by directly updating the zip column of the df_users DataFrame
Hints:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

This will be several lines of code, looping through the DataFrame, getting each zip code, checking the geographic subdivision with the population in zip_dict, and setting the zip_code accordingly.
Be very aware of your variable types when working with zip codes here."

Here you can find all the data necessary to understand the context:

https://raw.githubusercontent.com/DataScienceInPractice/Data/master/

assignment: ‘A4’

data_files: user_dat.csv, zip_pop.csv

After cleaning the data from user_dat.csv leaving only the columns: ‘age’, ‘zip’ and ‘gender’, and creating a dictionary from zip_pop.csv that contains the population of the first 3 digits from all the zipcodes; I wrote this code:

# Loop through the dataframe's to get each zipcode
for zipcode in df_users['zip']:
# check if the zipcode's 3 first numbers from the dataframe, correspond to a population of more or less than 20.000 people
    if zip_dict[zipcode[:len(zipcode) - 2]] <= 20000:

        # if less, change zipcode value to string zero.
        df_users.loc[df_users['zip'] == zipcode, 'zip'] = '0'
    else:

        # If more, preserve only the first 3 digits of the zipcode.
        df_users.loc[df_users['zip'] == zipcode, 'zip'] = zipcode[:len(zipcode) - 2]

This code works halfways and I don’t understand why.
It changes the zipcode to 0 if the population is less than 20.000 people, and also changes the first zipcodes (up until the ones that start with ‘078’) but then it returns this error message:

KeyError Traceback (most recent call last)
/var/folders/95/4vh4zhc1273fgmfs4wyntxn00000gn/T/ipykernel_44758/1429192050.py in < module >
1 for zipcode in df_users['zip']:
----> 2 if zip_dict[zipcode[:len(zipcode) - 2]] <= 20000:
3 df_users.loc[df_users['zip'] == zipcode, 'zip'] = '0'
4 else:
5 df_users.loc[df_users['zip'] == zipcode, 'zip'] = str(zipcode[:len(zipcode) - 2])

KeyError: '0'

I get that the problem is in the last line of code, because I’ve been doing every line at a time and each of them worked, until I put that last one. And if I just print the zipcodes instead of that last line, it also works!

Can anyone can help me understand why my code is wrong?

>Solution :

You’re modifying a collection of values (i.e. df_users['zip']) whilst you’re iterating over it. This is a common anti pattern. If a loop is absolutely required, then you could consider iterating over df_users['zip'].unique() instead. That creates a copy of all the unique zip codes, solving your current error, and it means that you aren’t redoing work when you encounter a duplicate zipcode.

If a loop is not required, then there are better (more pandas style) ways to go about your problem. I would suggest something like (untested):

zip_start = df_users['zip'].str[:-2]
df_users['zip'] = zip_start.where(zip_start.map(zip_dict) > 20000, other="0")
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading