I have a dataset of U.S. cities that includes 400 variables (with both character and numeric values). I want each row to be a unique city, but the dataset contains duplicate rows for many cities. In these duplicate rows, the values that are missing in one row are non-missing in the other, and vice versa. The unique id for each city is geocode
.
This is an example of what my data look like now:
geocode x y
8049 NA 7
8049 2 NA
2045 NA 9
2045 3 NA
And this is what I want it to look like:
geocode x y
8049 2 7
2045 3 9
I’ve seen similar questions asked before, but none of the existing solutions have worked for me. This is what I’ve tried:
vars <- ls(df)
vars <- vars[vars != "geocode" & vars != "State_name"]
max <- function(x) max(x, na.rm = TRUE)
df = df %>% group_by(geocode) %>%
mutate_at(vars, max)
However, I get the following error:
Error in `mutate()`:
...
Caused by error in `max()`:
! unused argument (na.rm = T)
And if I remove the na.rm = T
from the function, I get the following error:
Error: C stack usage is too close to the limit
Any idea what I can do?
I’ve also tried limiting the list var
to only numeric variables and I still get the same error.
>Solution :
One way could be a quasi "rowwise coalesce":
library(dplyr)
coalesce_all_columns <- function(df) {
return(coalesce(!!! as.list(df)))
}
df %>%
group_by(geocode) %>%
summarise_all(coalesce_all_columns)
geocode x y
<int> <int> <int>
1 2045 3 9
2 8049 2 7
Another way could be using pivoting:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(-geocode) %>%
na.omit() %>%
pivot_wider(names_from = name, values_from = value)
geocode y x
<int> <int> <int>
1 8049 7 2
2 2045 9 3