Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

R: Tidyverse – Merging only the dublicate columns in a tibble

Some columns of my tibble are torn into two columns. I would like to merge them back together. The dublicate columns have the same name and read_delim() adds "…2" and "…3" to have identical column names. There shouldn’t be two numerical values in a dublicate column, but it would be nice, if the code could handle this exception (there the mean of both would be nice). It frequently occurs, that both dublicate columns contain NAs. Some columns occur only once (like Date&Time, PYRANO#1, …). "Date&Time" is the only consistent column without NAs.

The data looks like this:

head(df)

A tibble: 6 × 10

  | ------------------- | ------------------- | ------------------ | -----------------
  | `Date&Time`         |`SNOWDEPTH#1#HS...2` |`SNOWDEPTH#1#HS...3`| `PYRANO#1#RSWR…`
  | <dttm>              |               <dbl> |           <dbl>    |        <dbl>
  |:-------------------:|:-------------------:|:------------------:|:----------------- 
1 | 1997-11-19 16:30:00 |                   0 |               NA   |            NA
2 | 1997-11-19 17:00:00 |                  NA |               10   |            NA
3 | 1997-11-19 17:30:00 |                   9 |               NA   |            NA
4 | 1997-11-19 18:00:00 |                  NA |               NA   |            NA
5 | 1997-11-19 18:30:00 |                   9 |               NA   |            NA
6 | 1997-11-19 19:00:00 |                   9 |               NA   |            NA

# with 6 more variables: `MODEL_SNOWPACK#1#SWE` <dbl>,
#   `THERMO_HYGRO#1#TA_30MIN_MEAN...6` <dbl>,
#   `THERMO_HYGRO#1#TA_30MIN_MEAN...7` <dbl>,
#   `IRTHERMO#1#TSS_30MIN_MEAN...8` <dbl>,
#   `IRTHERMO#1#TSS_30MIN_MEAN...9` <dbl>,
#   `SNOWTHERMO#1#TS0_30MIN_MEAN` <dbl>

I would like to use a for-loop to loop through many of these files, but unfortunately the dublicate columns aren’t always the same. Ideally the code should find dublicate columns and merge them automatically.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

What I have tried so far:

substr(colnames(df), 1, 7)

[1] "Date&Ti" "SNOWDEP" "SNOWDEP" "PYRANO#" "MODEL_S" "THERMO_" "THERMO_"
[8] "IRTHERM" "IRTHERM" "SNOWTHE"

df %>% 
      group_by(., substr(colnames(.), 1, 7), na.rm=TRUE) %>% 
      summarise_all()

Error in group_by():
! Problem adding computed columns.
Caused by error in mutate():
! Problem while computing ..1 = substr(colnames(.), 1, 7).
..1 must be size 407400 or 1, not 10.
Run rlang::last_error() to see where the error occurred.

Thanks a lot for your help!

>Solution :

A possible solution is to turn it into a long format, remove the "…" (etc.) and then transform it back to wide format with a function:

library(tidyverse)

df |>
  pivot_longer(-date) |>
  mutate(name = str_remove(name, "\\.\\.\\.\\d")) |>
  pivot_wider(values_fn = ~ mean(., na.rm = TRUE))

Output:

# A tibble: 6 × 2
   date `SNOWDEPTH#1#HS`
  <int>            <dbl>
1     1                0
2     2               10
3     3                9
4     4              NaN
5     5                9
6     6                9

And some data:

df <- tibble(date = 1:6,
             `SNOWDEPTH#1#HS...2` = c(0, NA, 9, NA, 9, 9),
             `SNOWDEPTH#1#HS...3` = c(NA, 10, NA, NA, NA, NA)) 
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading