Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Error package arrowR : read_parquet/open_dataset "Couldn't deserialize thrift: TProtocolException: Exceeded size limit"

My institution is slowly transiting from SAS to R, most of the code is written in arrow/dplyr or data.table, using the .parquet format as its main storing format.
In my own work I am usually dealing with storing and analysing data from 1 million to 10 million rows and up to 150-200 columns.
Parquet format is great for this kind of usage, but an unusual error has been occuring recently, and I couldn’t find any ressource on the internet :

library(arrow)
library(tidyverse)

open_dataset(data_error)
Error in `open_dataset()`:
! IOError: Error creating dataset. Could not read schema from 'path/example.parquet'. 
Is this a 'parquet' file?: 
Could not open Parquet input source 'path/example.parquet':
Couldn't deserialize thrift: TProtocolException: Exceeded size limit

The same would happen with the function read_parquet.

What is data_error ?

data_error is just a typical data.frame, extracted from a bigger data source (let’s call it data_clean), through a few data.table processes and saved by write_parquet unpartitionned. Please not that this error does not occur if the parquet file is partionnized.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

This error first happened on a program on data.table that I didn’t write, and I’m not familiar enough with data.table to understand the underlying issue.

repex :

library(arrow)
library(data.table)
# Seed
set.seed(1L)
# Big enough data.table 
dt = data.table(x = sample(1e5L, 1e7L, TRUE), y = runif(100L)) 
# Save in parquet format
write_parquet(dt, "example_ok.parquet")
# Readable
dt_ok <- open_dataset("example_ok.parquet")
# Simple filter 
dt[x == 989L]
# Save in parquet format
write_parquet(dt, "example_error.parquet")
# Error
dt_error <- open_dataset("example_error.parquet")

Thank you all for your help !

>Solution :

The culprit is that once you call dt[x == 989L], an index is created in the data.table.

set.seed(1L)
dt = data.table(x = sample(1e5L, 1e7L, TRUE), y = runif(100L))
attr1 <- attributes(dt)
dt[x == 989L]
attr2 <- attributes(dt)
str(attr1)
# List of 4
#  $ names            : chr [1:2] "x" "y"
#  $ row.names        : int [1:10000000] 1 2 3 4 5 6 7 8 9 10 ...
#  $ class            : chr [1:2] "data.table" "data.frame"
#  $ .internal.selfref:<externalptr> 
str(attr2)
# List of 5
#  $ names            : chr [1:2] "x" "y"
#  $ row.names        : int [1:10000000] 1 2 3 4 5 6 7 8 9 10 ...
#  $ class            : chr [1:2] "data.table" "data.frame"
#  $ .internal.selfref:<externalptr> 
#  $ index            : int(0) 
#   ..- attr(*, "__x")= int [1:10000000] 17660 25871 28519 270694 275019 419020 437190 615578 628622 739696 ...

Notice the addition of the index attribute.

The default action of arrow is to store attributes; one nice side-effect of this is that dt_ok will actually be of class data.table:

head(dt_ok) |> collect() # assuming dplyr::collect is visible?
#        x         y
#    <int>     <num>
# 1: 24388 0.4023457
# 2: 59521 0.9142361
# 3: 43307 0.2847435
# 4: 69586 0.3440578
# 5: 11571 0.1822614
# 6: 25173 0.8130521

The file size is also adversely affected (not sure if you are aware of this):

file.info(list.files(pattern = "*parquet"))
#                            size isdir mode               mtime               ctime               atime  uid  gid uname grname
# example_error.parquet 209818297 FALSE  664 2024-02-12 11:34:05 2024-02-12 11:34:05 2024-02-12 11:34:05 1000 1000    r2     r2
# example_ok.parquet     25744071 FALSE  664 2024-02-12 11:33:58 2024-02-12 11:33:58 2024-02-12 11:33:58 1000 1000    r2     r2

Clearly the _error file has something more. The normal efficiency of binary-data-storage in parquet files is not afforded to R attributes, so it makes sense that 10Mi values in a vector stored less-efficiently would take up that space.

If we remove the index, the problem goes away. One way to remove the index is to manually set the order:

write_parquet(dt, "example_error.parquet")
setorder(dt)
names(attributes(dt))
# [1] "names"             "row.names"         "class"             ".internal.selfref"
write_parquet(dt, "example_ok2.parquet")
open_dataset("example_error.parquet")
# Error in open_dataset("example_error.parquet") : 
#   IOError: Error creating dataset. Could not read schema from '.../example_error.parquet'. Is this a 'parquet' file?: Could not open Parquet input source '.../example_error.parquet': Couldn't deserialize thrift: TProtocolException: Exceeded size limit
open_dataset("example_ok2.parquet")
# FileSystemDataset with 1 Parquet file
# x: int32
# y: double
# See $metadata for additional Schema metadata

My immediate thought is that this is a bug, perhaps due to the size of the attribute. For demonstration, if we instead repeat this with 100 rows, we have no problem.

set.seed(1L)
dt = data.table(x = sample(1e5L, 1e2L, TRUE), y = runif(100L))
write_parquet(dt, "small1.parquet")
open_dataset("small1.parquet")
# FileSystemDataset with 1 Parquet file
# x: int32
# y: double
# See $metadata for additional Schema metadata
dt[x == 8229L]
#        x       y
#    <int>   <num>
# 1:  8229 0.62041
str(attributes(dt))
# List of 5
#  $ names            : chr [1:2] "x" "y"
#  $ row.names        : int [1:100] 1 2 3 4 5 6 7 8 9 10 ...
#  $ class            : chr [1:2] "data.table" "data.frame"
#  $ .internal.selfref:<externalptr> 
#  $ index            : int(0) 
#   ..- attr(*, "__x")= int [1:100] 35 50 21 78 98 33 9 77 88 94 ...
write_parquet(dt, "small2.parquet")
open_dataset("small1.parquet")
# FileSystemDataset with 1 Parquet file
# x: int32
# y: double
# See $metadata for additional Schema metadata

I suggest (request, even) that you submit a bug report.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading