# How do I use the means of the columns of a matrix as prediction values in a linear regression in R?

Problem Statement: Some near infrared spectra on 60 samples of gasoline and corresponding octane numbers can be found by `data(gasoline, package="pls").` Compute the mean value for each frequency and predict the response for the best model using the five different methods from Question 4.

Note: This is Exercise 11.5 in Linear Models with R, 2nd Ed., by Julian Faraway. Also, the "five different methods from Question 4" are: linear regression with all predictors, linear regression with variables selected using AIC, principal component regression, partial least squares, and ridge regression.

My Work So Far: We do

``````require(pls)
data(gasoline, package="pls")
test_index = seq(1,nrow(gasoline),10)
train_index = 1:nrow(gasoline)
train_index = train_index[!train_index %in% test_index]
train_gas = gasoline[train_index,]
test_gas = gasoline[test_index,]
lmod = lm(octane~NIR,train_gas)
``````

So far, so good. However, if I look at a summary of the model, I find that 348 coefficients are not defined because of singularities. (Why?) Moreover, massaging the mean values of the columns of the `NIR` matrix (the predictors) into an acceptable data frame is proving difficult.

My Question: How can I get to the point where the highly-fussy `predict` function will let me do something like this:

``````new_data = apply(train_gas\$NIR, 2, mean)
*some code here*
predict(lmod, new_data)
``````

?

Incidentally, as I have done a significant amount of moderating on Stats.SE, I can assert positively that this question would be closed on Stats.SE as being off-topic. It’s a "programming or data request", and hence unwelcome on Stats.SE.

I have also looked up a few related questions on SO, but nothing seems to fit exactly.

### >Solution :

This does seem pretty CrossValidated-ish to me … `gasoline` is a rather odd object, containing a ‘column’ (element) that is a 401-column matrix:

``````data.frame':    60 obs. of  2 variables:
\$ octane: num  85.3 85.2 88.5 83.4 87.9 ...
\$ NIR   : 'AsIs' num [1:60, 1:401] -0.0502 -0.0442 -0.0469 -0.0467 -0.0509 ...
``````

However, the fundamental problem is that this is a p>>n problem; there are 60 observations and 401 predictors. Thus, a standard linear regression probably just won’t make sense – you probably want to use a penalized approach like LASSO/ridge (i.e., `glmnet`). This is why you get the undefined coefficients (without some kind of penalization, you can’t estimate 402 coefficients (ncols + 1 for the intercept) from 60 observations …)

However, if we do want to hack this into a shape where we can do the linear model and prediction (however ill-advised):

``````NIR <- gasoline\$NIR
class(NIR) <- "matrix" ## override "AsIs" class
g2 <- data.frame(octane = gasoline\$octane, NIR)
dim(g2) ## 60 402 - now this is a 'regular' data frame
``````
``````## using train_index from above
train_gas <- g2[train_index,]
lmod = lm(octane~., train_gas)
## drop first column (response); use `lapply()` to maintain list structure
new_data <- as.data.frame(lapply(train_gas[-1], mean))
predict(lmod, new_data)
##        1
## 87.16019
## Warning message:
## In predict.lm(lmod, new_data) :
##   prediction from a rank-deficient fit may be misleading
``````

A slightly more direct approach (but still ugly) is to fit the model to the original weird structure and construct a prediction frame that matches that weird structure, i.e.

``````pp <- data.frame(NIR=I(matrix(colMeans(train_gas\$NIR), nrow = 1)))
``````

If you were willing to forgo `predict()` you could do it like this:

``````sum(na.omit(coef(lmod) * c(1, colMeans(train_gas\$NIR))))
``````