**Problem Statement:** Some near infrared spectra on 60 samples of gasoline and corresponding octane numbers can be found by `data(gasoline, package="pls").`

Compute the mean value for each frequency and predict the response for the best model using the five different methods from Question 4.

**Note:** This is Exercise 11.5 in *Linear Models with R, 2nd Ed.,* by Julian Faraway. Also, the "five different methods from Question 4" are: linear regression with all predictors, linear regression with variables selected using AIC, principal component regression, partial least squares, and ridge regression.

**My Work So Far:** We do

```
require(pls)
data(gasoline, package="pls")
test_index = seq(1,nrow(gasoline),10)
train_index = 1:nrow(gasoline)
train_index = train_index[!train_index %in% test_index]
train_gas = gasoline[train_index,]
test_gas = gasoline[test_index,]
lmod = lm(octane~NIR,train_gas)
```

So far, so good. However, if I look at a summary of the model, I find that 348 coefficients are not defined because of singularities. (Why?) Moreover, massaging the mean values of the columns of the `NIR`

matrix (the predictors) into an acceptable data frame is proving difficult.

**My Question:** How can I get to the point where the highly-fussy `predict`

function will let me do something like this:

```
new_data = apply(train_gas$NIR, 2, mean)
*some code here*
predict(lmod, new_data)
```

?

Incidentally, as I have done a significant amount of moderating on Stats.SE, I can assert positively that this question would be closed on Stats.SE as being off-topic. It’s a "programming or data request", and hence unwelcome on Stats.SE.

I have also looked up a few related questions on SO, but nothing seems to fit exactly.

### >Solution :

This does seem pretty CrossValidated-ish to me … `gasoline`

is a rather odd object, containing a ‘column’ (element) that is a 401-column matrix:

```
data.frame': 60 obs. of 2 variables:
$ octane: num 85.3 85.2 88.5 83.4 87.9 ...
$ NIR : 'AsIs' num [1:60, 1:401] -0.0502 -0.0442 -0.0469 -0.0467 -0.0509 ...
```

However, the fundamental problem is that this is a p>>n problem; there are 60 observations and 401 predictors. Thus, a standard linear regression probably just won’t make sense – you probably want to use a penalized approach like LASSO/ridge (i.e., `glmnet`

). This is why you get the undefined coefficients (without some kind of penalization, you can’t estimate 402 coefficients (ncols + 1 for the intercept) from 60 observations …)

However, if we do want to hack this into a shape where we can do the linear model and prediction (however ill-advised):

```
NIR <- gasoline$NIR
class(NIR) <- "matrix" ## override "AsIs" class
g2 <- data.frame(octane = gasoline$octane, NIR)
dim(g2) ## 60 402 - now this is a 'regular' data frame
```

```
## using train_index from above
train_gas <- g2[train_index,]
lmod = lm(octane~., train_gas)
## drop first column (response); use `lapply()` to maintain list structure
new_data <- as.data.frame(lapply(train_gas[-1], mean))
predict(lmod, new_data)
## 1
## 87.16019
## Warning message:
## In predict.lm(lmod, new_data) :
## prediction from a rank-deficient fit may be misleading
```

A **slightly** more direct approach (but still ugly) is to fit the model to the original weird structure and construct a prediction frame that matches that weird structure, i.e.

```
pp <- data.frame(NIR=I(matrix(colMeans(train_gas$NIR), nrow = 1)))
```

If you were willing to forgo `predict()`

you could do it like this:

```
sum(na.omit(coef(lmod) * c(1, colMeans(train_gas$NIR))))
```