can someone explain to me why the value of split is false in the test set?

June 22, 2022

split = sample.split(dataset$Salary, SplitRatio = 2/3)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

>Solution :

I assume you got this code from some kind of caTools documentation? I recommend trying to run the first line of code and it should start to make sense.

Basically what caTools::sample.split does is create a random vector of length nrow(x) with TRUEs and FALSEs, in the given ratio. Let’s take the iris dataset for example (which has 150 rows):

split = sample.split(iris$Sepal.Length, SplitRatio = 2/3)

The result will be a 150 item vector with 2/3 TRUE and 1/3 FALSE.

Next you use the subset function to extract all the rows i from iris where split[i] == TRUE to create the training set and use all the rows i from iris where split[i] == FALSE to create the test set.

That is why you use split == TRUE in the training set and split == FALSE in the test set