Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Advice for fixing data set made long by broken column?

I really hope that title makes sense. I’ve been given a data set pulled from an online source that comes out a bit broken. The file should have four columns and look like this:

ID Number Address Builder Status
352TRM 123 Street Rd I Build Houses Complete
486JRT 4321 Road St We Build Houses Complete

Instead, the ID Number column is missing and the value of the ID Number is listed in a separate row above the rest of the data like this:

Address Builder Status
352TRM
123 Street Rd I Build Houses Complete
486JRT
4321 Road St We Build Houses Complete

I done research and I think that essentially I need to make a long data set into a wide one. I see that there are lots of functions like spread() and pivot_wider().

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

For those functions I think it should be something like this:

data %>% pivot_wider(names_from = Address, values_from = Address)

but the issue is that some of the addresses are repeated (apartment complexes) and I get a "Values are not uniquely identified" error.

I’m not sure where to go next. Is this problem not fixable with reshaping if the addresses aren’t all unique? Is there another option I should research? Thank you for your advice!

>Solution :

You’re on the right track, but there are three more things you need:

  1. You need to tell pivot_wider() which column to put each value from Address into — e.g., that we want "352TRM" to end up in ID Number but "123 Street Rd" to end up in Address. This is the purpose of the names_from argument. Since the ID numbers are all in rows where the other columns are NA, we can create a names_from column based on that.
  2. Columns that aren’t specified in names_from or values_from and that you want in the same row post-pivoting should have the same values. We can accomplish this by filling NAs with the values from the row below, using tidyr::fill(.direction = "up").
  3. The above two are enough for your example data… but if there are repeated addresses in your actual data, you’ll want to assign a unique ID to each set of rows that should be combined in the result.
library(tidyverse)

data %>% 
  mutate(
    Row = (row_number() + 1) %/% 2,     # unique destination ID
    Variable = if_else(is.na(Builder), "ID Number", "Address")
  ) %>% 
  fill(Builder, Status, .direction = "up") %>% 
  pivot_wider(names_from = Variable, values_from = Address) %>%
  select(!Row)

#> # A tibble: 2 x 4
#>   Builder         Status   `ID Number` Address      
#>   <chr>           <chr>    <chr>       <chr>        
#> 1 I Build Houses  Complete 352TRM      123 Street Rd
#> 2 We Build Houses Complete 486JRT      4321 Road St

Created on 2022-03-06 by the reprex package (v2.0.1)

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading