Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to write a function with tidy eval when using the "arrow" R package (arrow::open_dataset) and dplyr verbs?

What I’m trying to do

I’m attempting to write a function that uses dplyr verbs and that takes an "arrow open dataset" as the first argument, and a column in that dataset as the second argument. Since I would like to pass the column as a string (necessary for the context of my actual task I’m working on, i.e. Shiny), I’m using the syntax .data[[.column]]. Below is an image of the error I’m getting and some code to reproduce said error. Any help or insight is appreciated.

Image of error message

enter image description here

Code to reproduce error

# install.packages(c("dplyr", "ggplot2", "arrow"))
library(dplyr)

arrow::write_parquet(x = ggplot2::mpg, sink = "sample_data.parquet")

dat <- arrow::open_dataset("sample_data.parquet")

glimpse(dat)

get_metric <- function(.data, .metric) {
  
  .data %>%
    group_by(manufacturer, cyl) %>% 
    summarize(
      new_col = sum(.data[[.metric]], na.rm = T)
    ) %>% 
    ungroup() 
}

get_metric(dat, "cty") %>% collect()

Additional code that works but doesn’t use arrow as much so not ideal for speed

In this code I collect before the tidy eval stuff so its just essentially regular dplyr code. It runs, but is a slower than code that I’ve successfully gotten to run before extracting stuff into said function.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

get_metric2 <- function(.data, .metric) {
  
  .data %>%
    collect() %>% 
    group_by(manufacturer, cyl) %>% 
    summarize(
      new_col = sum(.data[[.metric]], na.rm = T)
    ) %>% 
    ungroup() 
}

get_metric2(dat, "cty")

>Solution :

Use the !! nomenclature.

arrow::write_parquet(x = ggplot2::mpg, sink = "sample_data.parquet")
dat <- arrow::open_dataset("sample_data.parquet")
get_metric <- function(.data, .metric) {
  .metric <- rlang::sym(.metric)
   .data %>%
    group_by(manufacturer, cyl) %>% 
    summarize(
      new_col = sum(!!.metric, na.rm = T)
    ) %>% 
    ungroup() 
}

get_metric(dat, "cty") %>%
  collect()
# # A tibble: 32 × 3
#    manufacturer   cyl new_col
#    <chr>        <int>   <int>
#  1 audi             4     153
#  2 audi             6     148
#  3 audi             8      16
#  4 chevrolet        8     191
#  5 chevrolet        4      41
#  6 chevrolet        6      53
#  7 dodge            4      18
#  8 dodge            6     225
#  9 dodge            8     243
# 10 ford             8     197
# # ℹ 22 more rows
# # ℹ Use `print(n = ...)` to see more rows
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading