Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Replace value in column by previous value having pattern

Let’s say we have data frame x defined as

x <- data.frame(a = c('Start : 20220101', '1', '1', '1', 'Start : 20220102', '2', '2', 'Start : 20220103', '3', '3'),
           b = c(NA, 200, 200, 200, NA, 200, 200, NA, 200, 200),
           c = c(NA, 1, 3, 5, NA, 2, 4, NA, 3, 5))

                  a   b  c
1  Start : 20220101  NA NA
2                 1 200  1
3                 1 200  3
4                 1 200  5
5  Start : 20220102  NA NA
6                 2 200  2
7                 2 200  4
8  Start : 20220103  NA NA
9                 3 200  3
10                3 200  5

I need to replace column a‘s value by previous Start : ...‘s ... which indicates it’s full date.

My desired output might make by problem more clear.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

         a     b     c
1 20220101   200     1
2 20220101   200     3
3 20220101   200     5
4 20220102   200     2
5 20220102   200     4
6 20220103   200     3
7 20220103   200     5

Data x always have patern with Start : YMD and D follows.

Original x have more than 10^8 rows, so I think it need to be very efficient.

Any help would be grateful.

What I tried is

library(dplyr)
library(data.table)
library(readr)

x %>%
  mutate(d = floor((rleid(a)+1)/2))  %>%
  group_by(d) %>%
  mutate(a = first(parse_number(a))) %>%
  na.omit() %>%
  ungroup %>%
  select(-d)

>Solution :

Here is one data.table solution with zoo::na.locf to fill NA values.

library(data.table)

setDT(x)
#Change all the a values to NA except the ones that start with "Start"
x[, a := replace(a, !grepl('^Start', a), NA)]
#Remove "Start" from a so only the date remains. 
x[, a := sub('Start\\s*:\\s*', '', a)]
#Replace NA with latest non-NA values.
zoo::na.locf(x)

#          a   b c
#1: 20220101 200 1
#2: 20220101 200 3
#3: 20220102 200 3
#4: 20220102 200 2
#5: 20220102 200 4
#6: 20220103 200 4
#7: 20220103 200 3

A tidyverse solution for the same would be –

library(dplyr)
library(tidyr)

x %>%
  mutate(a = replace(a, !grepl('^Start', a), NA)) %>%
  fill(everything(), .direction = "downup") %>%
  mutate(a = sub('Start\\s*:\\s*', '', a))
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading