Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to create a test and train dataset in R by specifying the range in the data set instead of using set.seed() function and probability?

I am a noob at programming, sorry if this is a silly question.

My supervisor doesn’t seem to trust set.seed() function in r as every number will yield a different output (with different test and train sets). Thus she asked me to specify the range for my training and test dataset.

I am conducting a Binary logistic regression model in R with a sample size of 1790. There are 8 independent variables in my model. I want to do a 70/30 split for train and test data. I did it using these lines of code the first time:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

RLV <- read.csv(file.choose(), header = T)

set.seed(123)
index <- sample(2, nrow(RLV), replace = T, prob = c(0.7, 0.3))
train <- RLV[index == 1,]
test <- RLV[index == 2,]

But if I change 123 into say 1234, the output is similar but not exactly the previous one (and yes I know that’s the point).
But according to my supervisor, she wants me to train using the data obtained in Day 1 and Day 2 and test(validate) using the data of Day 3 (That was my initial plan as well).

Thus after intense brainstorming I came up with these lines of code…

RLV <- read.csv(file.choose(), header = T)

train <- RLV[1:1253,]
test <- RLV[1254:1790,]
head(test)

I want all the rows from 1 to 1253 (all columns too) in my train dataset and from 1254 to 1790 in my test(validation) dataset.

I checked using the head function and it does seem to work. But I am on the fence here. Can someone please clarify how this works? Or please if its even right (lol). I just want to complete this project without any hassle.

Thanks a bunch.

>Solution :

As you said: It does work. head() shows you the first six rows of a dataframe. So you should get rows 1254, 1255, 1256, 1257, 1258, and 1259 from your ‘test set’ after head(test).

It works because if you index a dataframe with [,], everything before the comma specifies row restrictions and everything after the comma specifies column restrictions. You indexed by row number. It would also be possible to index by a logical vector. For example, RLV[RLV$Day %in% 1:2,] would give you all cases from RLV where (the hypothetical) column Day holds the value 1 or 2.

If this doesn’t answer your question(s), please specify what you mean by "how this works" and "if it’s even right" 😉

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading