Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

What does the summary function do to the output of regsubsets?

Let me preface this by saying that I do think this question is a coding question, not a statistics question. It would almost surely be closed over at Stats.SE.

The leaps package in R has a useful function for model selection called regsubsets which, for any given size of a model, finds the variables that produce the minimum residual sum of squares. Now I am reading the book Linear Models with R, 2nd Ed., by Julian Faraway. On pages 154-5, he has an example of using the AIC for model selection. The complete code to reproduce the example runs like this:

data(state)
statedata = data.frame(state.x77, row.names=state.abb)
require(leaps)
b = regsubsets(Life.Exp~.,data=statedata)
rs = summary(b)
rs$which
AIC = 50*log(rs$rss/50) + (2:8)*2
plot(AIC ~ I(1:7), ylab="AIC", xlab="Number of Predictors")

The rs$which command produces the output of the regsubsets function and allows you to select the model once you’ve plotted the AIC and found the number of parameters that minimizes the AIC. But here’s the problem: while the typed-up example works fine, I’m having trouble with the wrong number of elements in the array when I try to use this code and adapt it to other data. For example:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

require(faraway)
data(odor, package='faraway')
b=regsubsets(odor~temp+gas+pack+
              I(temp^2)+I(gas^2)+I(pack^2)+
              I(temp*gas)+I(temp*pack)+I(gas*pack),data=odor)
rs=summary(b)
rs$which
AIC=50*log(rs$rss/50) + (2:10)*2

produces a warning message:

Warning message:
In 50 * log(rs$rss/50) + (2:10) * 2 :
  longer object length is not a multiple of shorter object length

Sure enough, length(rs$rss)=8, but length(2:10)=9. Now what I need to do is model selection, which means I really ought to have an RSS value for each model size. But if I choose b$rss in the AIC formula, it doesn’t work with the original example!

So here’s my question: what is summary() doing to the output of the regsubsets() function? The number of RSS values is not only not the same, but the values themselves are not the same.

>Solution :

regsubsets has a nvmax parameter to control the "maximum size of subsets to examine". By default this is 8. If you increase it to 9 or higher, your code works.

Please note though, that the 50 in your AIC formula is the sample size (i.e. 50 states in statedata). So for your second example, this should be nrow(odor), so 15.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading