Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Summarise before collecting in arrow using strings for column names

Say I want to summarise a column in an arrow table prior to collecting (because the actual dataset is larger than memory). I could do something like this:

arrow_table(mtcars) %>% 
  summarise(mean(mpg)) %>% 
  collect()

# A tibble: 1 × 1
#     `mean(mpg)`
#           <dbl>
#   1        20.1

Now, say I want to do this programmatically and the column name is provided as a string. In regular (i.e., non-arrow) dplyr, I could use across and all_of like this:

foo_regular <- function(x){
  mtcars %>% 
    summarise(across(all_of(x), mean)) %>% 
    collect()
}

foo_regular("mpg")

#        mpg
# 1 20.09062

But how do I do this in arrow?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

foo_arrow <- function(x){
  arrow_table(mtcars) %>%
    summarise(across(all_of(x), mean)) %>%
    collect()
}

foo_arrow("mpg")

# Warning: Error in summarize_eval(names(exprs)[i], exprs[[i]], ctx, length(.data$group_by_vars) >  : 
# Expression across(all_of(x), mean) is not an aggregate expression or is not supported in Arrow; pulling data into R
# Error:
#   ! Problem while computing `..1 = across(all_of(x), mean)`.
# Caused by error in `across()`:
#   ! Can't subset columns that don't exist.
# ✖ Column `mpg` doesn't exist.
# Run `rlang::last_error()` to see where the error occurred.

Clearly, performing the mean on that column is possible prior to collect in arrow as my first code chunk does this, but how do I specify column names with strings? As I say, the actual dataset is massive so pulling the data into R first isn’t an option.

>Solution :

In the most recent released version of Arrow (9.0.0.1), across() is not yet implemented, but it has been implemented in the most recent development version, and so should be in the upcoming release (10.0.0).

For the moment, you can either install a nightly version of arrow via arrow::install_arrow(nightly = TRUE), which will successfully run your code example, or manually specify the columns/functions to summarise() without using across().

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading