Say I have the following data:
df <- structure(list(treat = structure(1:4, levels = c("Control", "Treatment 1",
"Treatment 2", "Treatment 3"), class = "factor"),
female_n = c(314709L, 10456L, 10481L, 10455L), female_mean = c(0.506,
0.506, 0.504, 0.5), female_sd = c(0.5, 0.5, 0.5, 0.5), birth_year_n = c(314709L,
10456L, 10481L, 10455L), birth_year_mean = c(1973.74, 1973.654,
1973.486, 1973.766), birth_year_sd = c(16.867, 16.997, 16.869,
16.89), provided_phone_no_n = c(314709L, 10456L, 10481L,
10455L), provided_phone_no_mean = c(0.656, 0.666, 0.663,
0.647), provided_phone_no_sd = c(0.475, 0.472, 0.473, 0.478
), dem_n = c(314709L, 10456L, 10481L, 10455L), dem_mean = c(0.48,
0.474, 0.482, 0.478), dem_sd = c(0.5, 0.499, 0.5, 0.5), rep_n = c(314709L,
10456L, 10481L, 10455L), rep_mean = c(0.136, 0.141, 0.142,
0.138), rep_sd = c(0.343, 0.348, 0.349, 0.345), uaf_n = c(314709L,
10456L, 10481L, 10455L), uaf_mean = c(0.363, 0.365, 0.357,
0.363), uaf_sd = c(0.481, 0.481, 0.479, 0.481)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -4L))
I want to add in a new *_se
column which takes as inputs the existing *_n
and *_sd
columns for each of the variable groups in my data. I.e. one for each of female_*
, birth_year_*
, provided_phone_no_*
, dem_*
, rep_*
, and uaf_*
.
Attempting to do this, I think mutate(across())
is probably the right helper function, but i’m having some issues substringing {.col}
and getting R to recognise it as a column name. This is my attempt so far:
df %>%
mutate(
across(ends_with("_sd"),
list(
se = ~.x / sqrt(!!ensym("{str_replace(.col, '_sd', '_n')}"))
)
)
The above returns the error:
Error in `ensym()`:
! `arg` must be a symbol
Backtrace:
1. ... %>% ...
10. rlang::abort(message = message)
Can anyone see where i’m going wrong here?
>Solution :
This is really the ideal situation to pivot your data to a long format and then pivot back to a wide format:
library(tidyr)
library(dplyr)
df |>
pivot_longer(cols = -treat,
names_pattern = "(.*)_(.*)",
names_to = c("grp", ".value")) |>
mutate(se = sd / sqrt(n)) |>
pivot_wider(names_from = grp,
values_from = n:se,
names_glue = "{grp}_{.value}",
names_vary = "slowest")
#> # A tibble: 4 × 25
#> treat female_n female_mean female_sd female_se birth_year_n birth_year_mean
#> <fct> <int> <dbl> <dbl> <dbl> <int> <dbl>
#> 1 Control 314709 0.506 0.5 0.000891 314709 1974.
#> 2 Alexand… 10456 0.506 0.5 0.00489 10456 1974.
#> 3 Politic… 10481 0.504 0.5 0.00488 10481 1973.
#> 4 Mark yo… 10455 0.5 0.5 0.00489 10455 1974.
#> # ℹ 18 more variables: birth_year_sd <dbl>, birth_year_se <dbl>,
#> # provided_phone_no_n <int>, provided_phone_no_mean <dbl>,
#> # provided_phone_no_sd <dbl>, provided_phone_no_se <dbl>, dem_n <int>,
#> # dem_mean <dbl>, dem_sd <dbl>, dem_se <dbl>, rep_n <int>, rep_mean <dbl>,
#> # rep_sd <dbl>, rep_se <dbl>, uaf_n <int>, uaf_mean <dbl>, uaf_sd <dbl>,
#> # uaf_se <dbl>
If you really wanted to do it in wide format here is one possible solution:
library(dplyr)
library(stringr)
df |>
mutate(across(ends_with("_sd"), \(x) {
x / sqrt(pick(all_of(str_replace(cur_column(), "_sd", "_n")))[[1]])
},
.names = "{str_remove(.col, '_sd')}_se"
))