Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Variable lengths differ error message in R

I am running a logistic regression on the spam dataset from https://hastie.su.domains/ElemStatLearn/. The dependent variable is in the last column, which is given as V58 after I import the data in R studio. But I get the error message "variable lengths differ (found in ‘V1’)"

I’ve checked for na’s using na.omit, I tried removing the V1 column just to see if that fixes the issue, but I get the same message. I tried using the old cv.glm function instead, but that does not work. I also tried referring to the dependent variable as spam.data$V58. I am stuck. What am I missing? Below is the code I have so far to just get the model.

spam.data <- data_frame(read.table(datapathname))

dim(spam.data)
str(spam.data)
summary(spam.data)

set.seed(2718)
row.number = sample(1:nrow(spam.data), 0.7*nrow(spam.data))
train = spam.data[row.number,]
test = spam.data[-row.number,]
dim(train)
dim(test)

model.logistic = glm(as.factor(spam.data[58])~., data=train, family=binomial) #The error gets thrown here.

summary(model.logistic)

I should also say that the way I created the data set was to copy and paste the data from the website to a text file and read it into R from that file. The info on the site says we should get 4601 rows and 58 columns, and indeed we get this size dataset.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Any ideas?

>Solution :

There are two mistakes when you fit the logistic model:

  • The data is train, but you take the outcome from spam.data. These two data frames have different numbers of observations. Instead, use train[58] as the outcome.
  • When you change the outcome to a factor with as.factor, a single level NA is returned, because the as.factor does not work on the dbl format of the outcome. Therefore, first unlist the outcome with unlist(train[58]).

Using these two changes the model worked for me:

model.logistic = glm(as.factor(unlist(train[58]))~., data=train, family=binomial)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading