Error package arrowR : read_parquet/open_dataset "Couldn't deserialize thrift: TProtocolException: Exceeded size limit"

Advertisements

My institution is slowly transiting from SAS to R, most of the code is written in arrow/dplyr or data.table, using the .parquet format as its main storing format.
In my own work I am usually dealing with storing and analysing data from 1 million to 10 million rows and up to 150-200 columns.
Parquet format is great for this kind of usage, but an unusual error has been occuring recently, and I couldn’t find any ressource on the internet :

library(arrow)
library(tidyverse)

open_dataset(data_error)
Error in `open_dataset()`:
! IOError: Error creating dataset. Could not read schema from 'path/example.parquet'. 
Is this a 'parquet' file?: 
Could not open Parquet input source 'path/example.parquet':
Couldn't deserialize thrift: TProtocolException: Exceeded size limit

The same would happen with the function read_parquet.

What is data_error ?

data_error is just a typical data.frame, extracted from a bigger data source (let’s call it data_clean), through a few data.table processes and saved by write_parquet unpartitionned. Please not that this error does not occur if the parquet file is partionnized.

This error first happened on a program on data.table that I didn’t write, and I’m not familiar enough with data.table to understand the underlying issue.

repex :

library(arrow)
library(data.table)
# Seed
set.seed(1L)
# Big enough data.table 
dt = data.table(x = sample(1e5L, 1e7L, TRUE), y = runif(100L)) 
# Save in parquet format
write_parquet(dt, "example_ok.parquet")
# Readable
dt_ok <- open_dataset("example_ok.parquet")
# Simple filter 
dt[x == 989L]
# Save in parquet format
write_parquet(dt, "example_error.parquet")
# Error
dt_error <- open_dataset("example_error.parquet")

Thank you all for your help !

>Solution :

The culprit is that once you call dt[x == 989L], an index is created in the data.table.

set.seed(1L)
dt = data.table(x = sample(1e5L, 1e7L, TRUE), y = runif(100L))
attr1 <- attributes(dt)
dt[x == 989L]
attr2 <- attributes(dt)
str(attr1)
# List of 4
#  $ names            : chr [1:2] "x" "y"
#  $ row.names        : int [1:10000000] 1 2 3 4 5 6 7 8 9 10 ...
#  $ class            : chr [1:2] "data.table" "data.frame"
#  $ .internal.selfref:<externalptr> 
str(attr2)
# List of 5
#  $ names            : chr [1:2] "x" "y"
#  $ row.names        : int [1:10000000] 1 2 3 4 5 6 7 8 9 10 ...
#  $ class            : chr [1:2] "data.table" "data.frame"
#  $ .internal.selfref:<externalptr> 
#  $ index            : int(0) 
#   ..- attr(*, "__x")= int [1:10000000] 17660 25871 28519 270694 275019 419020 437190 615578 628622 739696 ...

Notice the addition of the index attribute.

The default action of arrow is to store attributes; one nice side-effect of this is that dt_ok will actually be of class data.table:

head(dt_ok) |> collect() # assuming dplyr::collect is visible?
#        x         y
#    <int>     <num>
# 1: 24388 0.4023457
# 2: 59521 0.9142361
# 3: 43307 0.2847435
# 4: 69586 0.3440578
# 5: 11571 0.1822614
# 6: 25173 0.8130521

The file size is also adversely affected (not sure if you are aware of this):

file.info(list.files(pattern = "*parquet"))
#                            size isdir mode               mtime               ctime               atime  uid  gid uname grname
# example_error.parquet 209818297 FALSE  664 2024-02-12 11:34:05 2024-02-12 11:34:05 2024-02-12 11:34:05 1000 1000    r2     r2
# example_ok.parquet     25744071 FALSE  664 2024-02-12 11:33:58 2024-02-12 11:33:58 2024-02-12 11:33:58 1000 1000    r2     r2

Clearly the _error file has something more. The normal efficiency of binary-data-storage in parquet files is not afforded to R attributes, so it makes sense that 10Mi values in a vector stored less-efficiently would take up that space.

If we remove the index, the problem goes away. One way to remove the index is to manually set the order:

write_parquet(dt, "example_error.parquet")
setorder(dt)
names(attributes(dt))
# [1] "names"             "row.names"         "class"             ".internal.selfref"
write_parquet(dt, "example_ok2.parquet")
open_dataset("example_error.parquet")
# Error in open_dataset("example_error.parquet") : 
#   IOError: Error creating dataset. Could not read schema from '.../example_error.parquet'. Is this a 'parquet' file?: Could not open Parquet input source '.../example_error.parquet': Couldn't deserialize thrift: TProtocolException: Exceeded size limit
open_dataset("example_ok2.parquet")
# FileSystemDataset with 1 Parquet file
# x: int32
# y: double
# See $metadata for additional Schema metadata

My immediate thought is that this is a bug, perhaps due to the size of the attribute. For demonstration, if we instead repeat this with 100 rows, we have no problem.

set.seed(1L)
dt = data.table(x = sample(1e5L, 1e2L, TRUE), y = runif(100L))
write_parquet(dt, "small1.parquet")
open_dataset("small1.parquet")
# FileSystemDataset with 1 Parquet file
# x: int32
# y: double
# See $metadata for additional Schema metadata
dt[x == 8229L]
#        x       y
#    <int>   <num>
# 1:  8229 0.62041
str(attributes(dt))
# List of 5
#  $ names            : chr [1:2] "x" "y"
#  $ row.names        : int [1:100] 1 2 3 4 5 6 7 8 9 10 ...
#  $ class            : chr [1:2] "data.table" "data.frame"
#  $ .internal.selfref:<externalptr> 
#  $ index            : int(0) 
#   ..- attr(*, "__x")= int [1:100] 35 50 21 78 98 33 9 77 88 94 ...
write_parquet(dt, "small2.parquet")
open_dataset("small1.parquet")
# FileSystemDataset with 1 Parquet file
# x: int32
# y: double
# See $metadata for additional Schema metadata

I suggest (request, even) that you submit a bug report.

Leave a ReplyCancel reply