Optimize loop for index creation in R

I have a dataframe with records of trawl stations in different regions. I need to create a sequential index that changes every time the region changes. I’ve implemented it with a for loop, but I have about 60000 records, so it’s super slow. Any idea on how to do it faster?
Please note that I cannot simply group by region, because I need to keep the region in the order they were sampled.

Here is my solution:

# Create dataframe
df <- data.frame(region=c(rep("A",3),rep("B",8),rep("C",2),rep("A",7),rep("C",3)),date=seq.Date(from=as.Date("2020-03-20"),to=as.Date("2020-05-02"),length.out=23))

# create index column
df$region_id <- 1

# loop through each row to check if different from previous row. If different the id changes.
for(ii in 2:nrow(df)){
  if(df$region[ii]!=df$region[ii-1]) {
    df$region_id[ii] <- df$region_id[ii-1]+1
  } else {
    df$region_id[ii] <- df$region_id[ii-1]
  }
}

And I get something like this:

   region       date region_id
1       A 2020-03-20         1
2       A 2020-03-21         1
3       A 2020-03-23         1
4       B 2020-03-25         2
5       B 2020-03-27         2
6       B 2020-03-29         2
7       B 2020-03-31         2
8       B 2020-04-02         2
9       B 2020-04-04         2
10      B 2020-04-06         2
11      B 2020-04-08         2
12      C 2020-04-10         3
13      C 2020-04-12         3
14      A 2020-04-14         4
15      A 2020-04-16         4
16      A 2020-04-18         4
17      A 2020-04-20         4
18      A 2020-04-22         4
19      A 2020-04-24         4
20      A 2020-04-26         4
21      C 2020-04-28         5
22      C 2020-04-30         5
23      C 2020-05-02         5

For 56373 records this takes:

  user  system elapsed 
  36.70    0.20   36.91 

Any help will be much appreciated.

Thanks

>Solution :

Here is a base R way with rle.

df <- data.frame(region=c(rep("A",3),rep("B",8),rep("C",2),rep("A",7),rep("C",3)),
                 date=seq.Date(from=as.Date("2020-03-20"),to=as.Date("2020-05-02"),length.out=23))

r <- rle(df$region)
r$values <- seq_along(r$values)
inverse.rle(r)
#>  [1] 1 1 1 2 2 2 2 2 2 2 2 3 3 4 4 4 4 4 4 4 5 5 5

df$region_id <- inverse.rle(r)
df
#>    region       date region_id
#> 1       A 2020-03-20         1
#> 2       A 2020-03-21         1
#> 3       A 2020-03-23         1
#> 4       B 2020-03-25         2
#> 5       B 2020-03-27         2
#> 6       B 2020-03-29         2
#> 7       B 2020-03-31         2
#> 8       B 2020-04-02         2
#> 9       B 2020-04-04         2
#> 10      B 2020-04-06         2
#> 11      B 2020-04-08         2
#> 12      C 2020-04-10         3
#> 13      C 2020-04-12         3
#> 14      A 2020-04-14         4
#> 15      A 2020-04-16         4
#> 16      A 2020-04-18         4
#> 17      A 2020-04-20         4
#> 18      A 2020-04-22         4
#> 19      A 2020-04-24         4
#> 20      A 2020-04-26         4
#> 21      C 2020-04-28         5
#> 22      C 2020-04-30         5
#> 23      C 2020-05-02         5
က
#> Error in eval(expr, envir, enclos): object 'က' not found

Created on 2023-03-15 with reprex v2.0.2


Edit

With package data.table there is also rleid. This will advantageous with bigger data.frames.

library(data.table)

dt1 <- as.data.table(df)
dt1[, region_id := rleid(region)]
dt1
#>     region       date region_id
#>  1:      A 2020-03-20         1
#>  2:      A 2020-03-21         1
#>  3:      A 2020-03-23         1
#>  4:      B 2020-03-25         2
#>  5:      B 2020-03-27         2
#>  6:      B 2020-03-29         2
#>  7:      B 2020-03-31         2
#>  8:      B 2020-04-02         2
#>  9:      B 2020-04-04         2
#> 10:      B 2020-04-06         2
#> 11:      B 2020-04-08         2
#> 12:      C 2020-04-10         3
#> 13:      C 2020-04-12         3
#> 14:      A 2020-04-14         4
#> 15:      A 2020-04-16         4
#> 16:      A 2020-04-18         4
#> 17:      A 2020-04-20         4
#> 18:      A 2020-04-22         4
#> 19:      A 2020-04-24         4
#> 20:      A 2020-04-26         4
#> 21:      C 2020-04-28         5
#> 22:      C 2020-04-30         5
#> 23:      C 2020-05-02         5
#>     region       date region_id

Created on 2023-03-15 with reprex v2.0.2

Leave a Reply