I am working with the R programming language.
Suppose I have the following problem:
- There is a coin where if it lands head then the probability of the next flip being heads is 0.6 (and if tails then the next flip being tails is also 0.6)
- There are 100 students in a class
- Each student flips this coin a random number of times
- The last flip of student_n does not influence the first flip of student_n+1 (i.e. when the next student flips the coin, the first flip has 0.5 probability of heads or tails, but the next flip for this student depends on the previous flip)
I am trying to write R code to represent this problem.
First I defined the variables:
library(dplyr)
library(stringr)
# generate data
set.seed(123)
ids <- 1:100
student_id <- sample(ids, 100000, replace = TRUE)
coin_result <- character(1000)
coin_result[1] <- sample(c("H", "T"), 1)
Next, I tried to write the flipping process:
for (i in 2:length(coin_result)) {
if (student_id[i] != student_id[i-1]) {
coin_result[i] <- sample(c("H", "T"), 1)
} else if (coin_result[i-1] == "H") {
coin_result[i] <- sample(c("H", "T"), 1, prob = c(0.6, 0.4))
} else {
coin_result[i] <- sample(c("H", "T"), 1, prob = c(0.4, 0.6))
}
}
#tidy up
my_data <- data.frame(student_id, coin_result)
my_data <- my_data[order(my_data$student_id),]
Finally, I tried to verify the results:
my_data %>%
group_by(student_id) %>%
summarize(Sequence = str_c(coin_result, lead(coin_result)), .groups = 'drop') %>%
filter(!is.na(Sequence)) %>%
count(Sequence)
Even though the code ran, I don’t think my code is correct – when I look at the results:
# A tibble: 4 x 2
Sequence n
<chr> <int>
1 HH 23810
2 HT 25043
3 TH 25042
4 TT 26005
I think if I was correct, HH should have been significantly greater than HT , and TT should have been significantly greater than TH.
Can someone please tell me if I have done this correctly and how to correct it?
Thanks!
>Solution :
I think you need to sort the student_id vector before the loop, so that your comparison of student_id[i] != student_id[i-1] would be valid. Otherwise, it’s not catching consecutive flips from the same student.
The result seems to make sense, where HH and TT together occupies 60.4% of the total flips.
library(tidyverse)
set.seed(123)
ids <- 1:100
# only the following line was changed, all other lines are same as your code
student_id <- sort(sample(ids, 100000, replace = TRUE))
coin_result <- character(1000)
coin_result[1] <- sample(c("H", "T"), 1)
for (i in 2:length(coin_result)) {
if (student_id[i] != student_id[i-1]) {
coin_result[i] <- sample(c("H", "T"), 1)
} else if (coin_result[i-1] == "H") {
coin_result[i] <- sample(c("H", "T"), 1, prob = c(0.6, 0.4))
} else {
coin_result[i] <- sample(c("H", "T"), 1, prob = c(0.4, 0.6))
}
}
#tidy up
my_data <- data.frame(student_id, coin_result)
my_data <- my_data[order(my_data$student_id),]
my_data %>%
group_by(student_id) %>%
summarize(Sequence = str_c(coin_result, lead(coin_result)), .groups = 'drop') %>%
filter(!is.na(Sequence)) %>%
count(Sequence)
# A tibble: 4 × 2
Sequence n
<chr> <int>
1 HH 29763
2 HT 19782
3 TH 19775
4 TT 30580