Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

For R, in the MASS::boxcox function, I often get the dreaded "'data' must be a data.frame, environment, or list" error erroneously

I’m running R 4.4.1 and MASS 7.3-61 on my MacBook Pro, 14", Nov 2023, which has MacOS 14.6.

Here’s some reproducible code as a MWE:

require(MASS)

rm(list=ls())  # Clear workspace
result = tryCatch({. # Clear plots
    dev.off()
}, warning = function(w) {
}, error = function(e) {
}, finally = {
})

set.seed(42)
x = rnorm(5)
y = rnorm(5)

df = data.frame(x, y)

lmod = lm(y~x, data=df)
boxcox(lmod)

This produces the error:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Error in model.frame.default(formula = y ~ x, data = df, drop.unused.levels = TRUE) : 
  'data' must be a data.frame, environment, or list

The variable df is clearly a dataframe, so this error message is totally erroneous:

> class(df)
[1] "data.frame"
> is.data.frame(df)
[1] TRUE

I’m obviously specifying the model correctly, so that cause is not relevant. If I try the traceback() function, it yields the following:

16: stop("'data' must be a data.frame, environment, or list")
15: model.frame.default(formula = y ~ x + cat, data = df, drop.unused.levels = TRUE)
14: stats::model.frame(formula = y ~ x + cat, data = df, drop.unused.levels = TRUE)
13: eval(mf, parent.frame())
12: eval(mf, parent.frame())
11: lm(formula = y ~ x + cat, data = df, y = TRUE, qr = TRUE)
10: eval(call, parent.frame())
9: eval(call, parent.frame())
8: update.default(object, y = TRUE, qr = TRUE, ...)
7: update(object, y = TRUE, qr = TRUE, ...)
6: boxcox.lm(lmod, plotit = TRUE)
5: boxcox(lmod, plotit = TRUE) at test_boxcox.R#20
4: eval(ei, envir)
3: eval(ei, envir)
2: withVisible(eval(ei, envir))
1: source("~/Projects/non_repo_data/test_boxcox.R")

But going through the stats::model.frame.default function’s source code did not reveal this stop command anywhere. I’m at a total loss for understanding why this is happening, or even whence the error is arising. Definitely feels like a bug, though.

>Solution :

tl;dr you have to name your data frame something other than df, so that it doesn’t collide with a built-in R object.

The error itself arises from line 526 of src/library/stats/R/models.R.

This is arguably a bug, or at least an "infelicity" (sensu Bill Venables), in MASS::boxcox, but it is also an illustration of why it’s good to avoid name overlaps between your variables and built-in objects.

Continuing with your example:

dff <- df  ## rename your data frame
lmod <- lm(y~x, data=dff)
boxcox(lmod)

Error in boxcox.default(lmod) : response variable must be positive

This error happens because you constructed a slightly inappropriate example (which was fine for showing what you wanted).

lmod <- lm(abs(y)~x, data=dff)
boxcox(lmod)  ## works

We can get a hint of what’s going on by looking at the output of traceback():

12: stop("'data' must be a data.frame, environment, or list")
11: model.frame.default(formula = y ~ x, data = df, drop.unused.levels = TRUE)
10: stats::model.frame(formula = y ~ x, data = df, drop.unused.levels = TRUE)
9: eval(mf, parent.frame())
8: eval(mf, parent.frame())
7: lm(formula = y ~ x, data = df, y = TRUE, qr = TRUE)
6: eval(call, parent.frame())
5: eval(call, parent.frame())
4: update.default(object, y = TRUE, qr = TRUE, ...)
3: update(object, y = TRUE, qr = TRUE, ...)
2: boxcox.lm(lmod)
1: boxcox(lmod)
  • boxcox is calling update() to make sure the fitted models has all the components it needs (especially the stored QR decomposition)
  • update() is re-calling lm()
  • lm() is calling model.frame()
  • by the time we get there, model.frame() is being evaluated in an environment where it sees the built-in df (in the stats package) before it sees your data frame (in .GlobalEnv).

It would take just a little more work than I feel like doing right now to establish exactly what all those parent.frame() invocations are seeing. From within the lm() call (you can get there by setting options(error = recover), you can see that the enclosing environment of the parent frame parent.frame()$enclos is <environment:base>. I’m not quite sure how we get from there to <environment: namespace:stats>, which is where we’re getting df from …

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading