Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Caret rfe() error "there should be the same number of samples in x and y"

I am having difficulties solving the error "there should be the same number of samples in x and y". I notice that others have posted on this site regarding this error, but their solutions have not worked for me. I am attaching an abbreviated version of my dataset here.

x_train is here:

x_train <- structure(list(laterality = c("Left", "Right", "Right", "Right", 
"Left", "Left", "Left", "Left", "Left", "Right"), age = c(66L, 
56L, 69L, 49L, 60L, 70L, 58L, 53L, 59L, 64L), insurance = c("MEDICARE", 
"UNITED", "MEDICARE", "UNITED", "COMMERCIAL", "MEDICARE", "AETNA", 
"AETNA", "OXFORD", "MEDICARE_MANAGED"), employment = c("Retired", 
"FullTime", "Retired", "FullTime", "Disabled", "SelfEmployed", 
"Retired", "FullTime", "FullTime", "Disabled"), sex = c("Female", 
"Male", "Female", "Female", "Female", "Female", "Male", "Male", 
"Female", "Male"), race = c("WhiteorCaucasian", "WhiteorCaucasian", 
"WhiteorCaucasian", "WhiteorCaucasian", "WhiteorCaucasian", "WhiteorCaucasian", 
"Other", "BlackorAfricanAmerican", "WhiteorCaucasian", "WhiteorCaucasian"
), ethnicity = c("NotHispanicorLatino", "NotHispanicorLatino", 
"NotHispanicorLatino", "NotHispanicorLatino", "NotHispanicorLatino", 
"NotHispanicorLatino", "NotHispanicorLatino", "NotHispanicorLatino", 
"NotHispanicorLatino", "NotHispanicorLatino"), bmi = c(22.3, 
33, 34.3, 36, 30, 20, 29.5, 33.4, 26.5, 34.2), PreferredLanguage = c("English", 
"English", "English", "English", "English", "English", "English", 
"English", "English", "English"), married = c("Married", "Married", 
"Married", "Married", "Married", "Married", "Divorced", "Single", 
"Married", "Married"), RadiographSevere = c("No", "No", "No", 
"No", "No", "No", "No", "No", "No", "No"), HxAnxietyDepression = c("No", 
"No", "No", "Yes", "Yes", "Yes", "No", "No", "No", "No"), SurgeryYear = c(2017L, 
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L
), operativetime = c(82L, 79L, 85L, 76L, 84L, 86L, 67L, 75L, 
72L, 100L), HipApproach = c("Anterior", "Posterior", "Posterior", 
"Posterior", "Posterior", "Anterior", "Posterior", "Posterior", 
"Posterior", "Posterior")), row.names = c(NA, -10L), class = c("data.table", 
"data.frame"))


y_train is here:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel


y_train <- structure(list(POD1AverageNrsScoreCut = c("[0,5)", "[0,5)", "[0,5)", 
                                          "[0,5)", "[5,10)", "[0,5)", "[0,5)", "[5,10)", "[0,5)", "[0,5)"
)), row.names = c(NA, -10L), class = c("data.table", "data.frame"
))


Code I am using for rfe is here:

library(caret)
control <- rfeControl(functions = rfFuncs, # random forest
                      method = "repeatedcv", # repeated cv
                      repeats = 3, # number of repeats
                      number = 10) # number of folds

result_rfe <- rfe(x = x_train, y = y_train, sizes = c(1:30), rfeControl = control)

>Solution :

I see your output is two classes of limit intervals. Maybe if you try them as factors y = as.factor(unlist(y_train))? It worked for me

control <- rfeControl(functions = rfFuncs, # random forest
                      method = "repeatedcv", # repeated cv
                      repeats = 3, # number of repeats
                      number = 10) # number of folds

result_rfe <- rfe(x = x_train, y = as.factor(unlist(y_train)), sizes = c(1:30), rfeControl = control)

Output:

>result_rfe
    
    Recursive feature selection

Outer resampling method: Cross-Validated (10 fold, repeated 3 times) 

Resampling performance over subset size:

 Variables Accuracy Kappa AccuracySD KappaSD Selected
         1  0.06667     0     0.2537       0         
         2  0.06667     0     0.2537       0         
         3  0.30000     0     0.4661       0         
         4  0.20000     0     0.4068       0         
         5  0.36667     0     0.4901       0         
         6  0.40000     0     0.4983       0         
         7  0.43333     0     0.5040       0         
         8  0.53333     0     0.5074       0        *
         9  0.30000     0     0.4661       0         
        10  0.33333     0     0.4795       0         
        11  0.20000     0     0.4068       0         
        12  0.26667     0     0.4498       0         
        13  0.06667     0     0.2537       0         
        14  0.13333     0     0.3457       0         
        15  0.20000     0     0.4068       0         

The top 5 variables (out of 8):
   insurance, laterality, HipApproach, employment, ethnicity

Note: I don’t know if this is what you expected, I don’t know the data context and your approach.

Original answer:
Subscript out of bounds error in caret's rfe function

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading