Advice for fixing data set made long by broken column?

March 7, 2022

I really hope that title makes sense. I’ve been given a data set pulled from an online source that comes out a bit broken. The file should have four columns and look like this:

ID Number	Address	Builder	Status
352TRM	123 Street Rd	I Build Houses	Complete
486JRT	4321 Road St	We Build Houses	Complete

Instead, the ID Number column is missing and the value of the ID Number is listed in a separate row above the rest of the data like this:

Address	Builder	Status
352TRM
123 Street Rd	I Build Houses	Complete
486JRT
4321 Road St	We Build Houses	Complete

I done research and I think that essentially I need to make a long data set into a wide one. I see that there are lots of functions like spread() and pivot_wider().

For those functions I think it should be something like this:

data %>% pivot_wider(names_from = Address, values_from = Address)

but the issue is that some of the addresses are repeated (apartment complexes) and I get a "Values are not uniquely identified" error.

I’m not sure where to go next. Is this problem not fixable with reshaping if the addresses aren’t all unique? Is there another option I should research? Thank you for your advice!

>Solution :

You’re on the right track, but there are three more things you need:

You need to tell pivot_wider() which column to put each value from Address into — e.g., that we want "352TRM" to end up in ID Number but "123 Street Rd" to end up in Address. This is the purpose of the names_from argument. Since the ID numbers are all in rows where the other columns are NA, we can create a names_from column based on that.
Columns that aren’t specified in names_from or values_from and that you want in the same row post-pivoting should have the same values. We can accomplish this by filling NAs with the values from the row below, using tidyr::fill(.direction = "up").
The above two are enough for your example data… but if there are repeated addresses in your actual data, you’ll want to assign a unique ID to each set of rows that should be combined in the result.

library(tidyverse)

data %>% 
  mutate(
    Row = (row_number() + 1) %/% 2,     # unique destination ID
    Variable = if_else(is.na(Builder), "ID Number", "Address")
  ) %>% 
  fill(Builder, Status, .direction = "up") %>% 
  pivot_wider(names_from = Variable, values_from = Address) %>%
  select(!Row)

#> # A tibble: 2 x 4
#>   Builder         Status   `ID Number` Address      
#>   <chr>           <chr>    <chr>       <chr>        
#> 1 I Build Houses  Complete 352TRM      123 Street Rd
#> 2 We Build Houses Complete 486JRT      4321 Road St

^{Created on 2022-03-06 by the reprex package (v2.0.1)}