R: Adding a column to a data frame based on a column in another data frame causes several duplicate rows

Advertisements

I want to take the state name from one data frame, and put it in another based on the county fips code. Below is a snippet of the two

> head(arr)
  state_abb fips_state_county_code
1        CO                   8001
2        CO                   8001
3        CO                   8001
4        CO                   8001
5        CO                   8001
6        CO                   8001

> head(countyname)
    county       countyname
245   8001     Adams County
246   8003   Alamosa County
247   8005  Arapahoe County
248   8007 Archuleta County
249   8009      Baca County
250   8011      Bent County

When I use left_join, I get a ton of duplicates

xcountyname= left_join(arr, countyname, by =c("county" = "fips_state_county_code")) 

> head(xcountyname)
  county   countyname state_abb
1   8001 Adams County        CO
2   8001 Adams County        CO
3   8001 Adams County        CO
4   8001 Adams County        CO
5   8001 Adams County        CO
6   8001 Adams County        CO

It should only be adding the state column, but it has now gone from 535 rows to over 71k rows. Is there a better way to do this so that the state only fills in the existing rows of the xcountyname data frame?

Edit: arr is a subset of another dataset, that’s why there are duplicates. But it still stands that I only want the rows with the state name to transfer with the existing fips codes in xcountyname

>Solution :

You only want one state abb for each fips, so use distinct()

left_join(
  distinct(arr %>% select(state_abb, county=fips_state_county_code)),
  countyname
)

Leave a ReplyCancel reply