Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Optimize loop for index creation in R

I have a dataframe with records of trawl stations in different regions. I need to create a sequential index that changes every time the region changes. I’ve implemented it with a for loop, but I have about 60000 records, so it’s super slow. Any idea on how to do it faster?
Please note that I cannot simply group by region, because I need to keep the region in the order they were sampled.

Here is my solution:

# Create dataframe
df <- data.frame(region=c(rep("A",3),rep("B",8),rep("C",2),rep("A",7),rep("C",3)),date=seq.Date(from=as.Date("2020-03-20"),to=as.Date("2020-05-02"),length.out=23))

# create index column
df$region_id <- 1

# loop through each row to check if different from previous row. If different the id changes.
for(ii in 2:nrow(df)){
  if(df$region[ii]!=df$region[ii-1]) {
    df$region_id[ii] <- df$region_id[ii-1]+1
  } else {
    df$region_id[ii] <- df$region_id[ii-1]
  }
}

And I get something like this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

   region       date region_id
1       A 2020-03-20         1
2       A 2020-03-21         1
3       A 2020-03-23         1
4       B 2020-03-25         2
5       B 2020-03-27         2
6       B 2020-03-29         2
7       B 2020-03-31         2
8       B 2020-04-02         2
9       B 2020-04-04         2
10      B 2020-04-06         2
11      B 2020-04-08         2
12      C 2020-04-10         3
13      C 2020-04-12         3
14      A 2020-04-14         4
15      A 2020-04-16         4
16      A 2020-04-18         4
17      A 2020-04-20         4
18      A 2020-04-22         4
19      A 2020-04-24         4
20      A 2020-04-26         4
21      C 2020-04-28         5
22      C 2020-04-30         5
23      C 2020-05-02         5

For 56373 records this takes:

  user  system elapsed 
  36.70    0.20   36.91 

Any help will be much appreciated.

Thanks

>Solution :

Here is a base R way with rle.

df <- data.frame(region=c(rep("A",3),rep("B",8),rep("C",2),rep("A",7),rep("C",3)),
                 date=seq.Date(from=as.Date("2020-03-20"),to=as.Date("2020-05-02"),length.out=23))

r <- rle(df$region)
r$values <- seq_along(r$values)
inverse.rle(r)
#>  [1] 1 1 1 2 2 2 2 2 2 2 2 3 3 4 4 4 4 4 4 4 5 5 5

df$region_id <- inverse.rle(r)
df
#>    region       date region_id
#> 1       A 2020-03-20         1
#> 2       A 2020-03-21         1
#> 3       A 2020-03-23         1
#> 4       B 2020-03-25         2
#> 5       B 2020-03-27         2
#> 6       B 2020-03-29         2
#> 7       B 2020-03-31         2
#> 8       B 2020-04-02         2
#> 9       B 2020-04-04         2
#> 10      B 2020-04-06         2
#> 11      B 2020-04-08         2
#> 12      C 2020-04-10         3
#> 13      C 2020-04-12         3
#> 14      A 2020-04-14         4
#> 15      A 2020-04-16         4
#> 16      A 2020-04-18         4
#> 17      A 2020-04-20         4
#> 18      A 2020-04-22         4
#> 19      A 2020-04-24         4
#> 20      A 2020-04-26         4
#> 21      C 2020-04-28         5
#> 22      C 2020-04-30         5
#> 23      C 2020-05-02         5
က
#> Error in eval(expr, envir, enclos): object 'က' not found

Created on 2023-03-15 with reprex v2.0.2


Edit

With package data.table there is also rleid. This will advantageous with bigger data.frames.

library(data.table)

dt1 <- as.data.table(df)
dt1[, region_id := rleid(region)]
dt1
#>     region       date region_id
#>  1:      A 2020-03-20         1
#>  2:      A 2020-03-21         1
#>  3:      A 2020-03-23         1
#>  4:      B 2020-03-25         2
#>  5:      B 2020-03-27         2
#>  6:      B 2020-03-29         2
#>  7:      B 2020-03-31         2
#>  8:      B 2020-04-02         2
#>  9:      B 2020-04-04         2
#> 10:      B 2020-04-06         2
#> 11:      B 2020-04-08         2
#> 12:      C 2020-04-10         3
#> 13:      C 2020-04-12         3
#> 14:      A 2020-04-14         4
#> 15:      A 2020-04-16         4
#> 16:      A 2020-04-18         4
#> 17:      A 2020-04-20         4
#> 18:      A 2020-04-22         4
#> 19:      A 2020-04-24         4
#> 20:      A 2020-04-26         4
#> 21:      C 2020-04-28         5
#> 22:      C 2020-04-30         5
#> 23:      C 2020-05-02         5
#>     region       date region_id

Created on 2023-03-15 with reprex v2.0.2

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading