Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

R/arrow summarizing on variable columns

I have a large-ish parquet file I’m referencing via arrow::open_dataset. I’d like to get the max value of one or more of the columns, where I don’t know a priori which (or how many) columns. In general, this sounds like "programming with dplyr" (assuming arrow-10 and its recent support of dplyr::across), but I can’t get it to work.

write_parquet(data.frame(a=c(1,9), b=c(2,10), d=c("q","r")), "quux.parquet")
vars <- c("a")
open_dataset("quux.parquet") %>%
  summarize(across(sym(vars), ~ max(.))) %>%
  collect()
# # A tibble: 1 x 1
#       a
#   <dbl>
# 1     9

But when vars is length 2 or more, I assume I need to be using syms or similar, but that fails with

vars <- c("a", "b")
open_dataset("quux.parquet") %>%
  summarize(across(all_of(syms(vars)), ~ max(.))) %>%
  collect()
# Error: Must subset columns with a valid subscript vector.
# x Subscript has the wrong type `list`.
# i It must be numeric or character.

How do I lazily (not load all data) find the max of multiple columns in an arrow dataset?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

While I suspect that the correct answer in dplyr will be some form of syms, and then whether or not arrow supports that is the next question. I’m not tied to the dplyr mechanisms, if there’s a method using ds$NewScan() or similar, I’m amenable.

>Solution :

Is this the kind of thing you’re after – using tidyselect’s all_of function?

library(arrow)
library(dplyr)

write_parquet(data.frame(a=c(1,9), b=c(2,10), d=c("q","r")), "quux.parquet")

vars <- c("a", "d")

open_dataset("quux.parquet") %>%
  summarize(across(all_of(vars), ~ max(.))) %>%
  collect()
#> # A tibble: 1 × 2
#>       a d    
#>   <dbl> <chr>
#> 1     9 r

See https://tidyselect.r-lib.org/reference/index.html for the different tidyselect functions you may also want to check out.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading