I really hope that title makes sense. I’ve been given a data set pulled from an online source that comes out a bit broken. The file should have four columns and look like this:
| ID Number | Address | Builder | Status |
|---|---|---|---|
| 352TRM | 123 Street Rd | I Build Houses | Complete |
| 486JRT | 4321 Road St | We Build Houses | Complete |
Instead, the ID Number column is missing and the value of the ID Number is listed in a separate row above the rest of the data like this:
| Address | Builder | Status |
|---|---|---|
| 352TRM | ||
| 123 Street Rd | I Build Houses | Complete |
| 486JRT | ||
| 4321 Road St | We Build Houses | Complete |
I done research and I think that essentially I need to make a long data set into a wide one. I see that there are lots of functions like spread() and pivot_wider().
For those functions I think it should be something like this:
data %>% pivot_wider(names_from = Address, values_from = Address)
but the issue is that some of the addresses are repeated (apartment complexes) and I get a "Values are not uniquely identified" error.
I’m not sure where to go next. Is this problem not fixable with reshaping if the addresses aren’t all unique? Is there another option I should research? Thank you for your advice!
>Solution :
You’re on the right track, but there are three more things you need:
- You need to tell
pivot_wider()which column to put each value fromAddressinto — e.g., that we want"352TRM"to end up inID Numberbut"123 Street Rd"to end up inAddress. This is the purpose of thenames_fromargument. Since the ID numbers are all in rows where the other columns areNA, we can create anames_fromcolumn based on that. - Columns that aren’t specified in
names_fromorvalues_fromand that you want in the same row post-pivoting should have the same values. We can accomplish this by fillingNAs with the values from the row below, usingtidyr::fill(.direction = "up"). - The above two are enough for your example data… but if there are repeated addresses in your actual data, you’ll want to assign a unique ID to each set of rows that should be combined in the result.
library(tidyverse)
data %>%
mutate(
Row = (row_number() + 1) %/% 2, # unique destination ID
Variable = if_else(is.na(Builder), "ID Number", "Address")
) %>%
fill(Builder, Status, .direction = "up") %>%
pivot_wider(names_from = Variable, values_from = Address) %>%
select(!Row)
#> # A tibble: 2 x 4
#> Builder Status `ID Number` Address
#> <chr> <chr> <chr> <chr>
#> 1 I Build Houses Complete 352TRM 123 Street Rd
#> 2 We Build Houses Complete 486JRT 4321 Road St
Created on 2022-03-06 by the reprex package (v2.0.1)