Read csv files using pyarrow

May 10, 2024

i’m trying to read csv files using the python library pyarrow but i got an issue while reading file because for some fields i have "\N" for values (it means that this is a null value).
the problem is that i can’t manage to skip this value while reading …

here is my code :

parse_options = csv.ParseOptions(delimiter=chr(1))
read_options = csv.ReadOptions(column_names=columns)
convert_options = csv.ConvertOptions(column_types=schema_table, include_columns=columns, include_missing_columns=True, null_values=True)

with hdfs.open_input_file("path") as f:
    csv_file = csv.read_csv(f, read_options=read_options, parse_options=parse_options, convert_options=convert_options)

The error that i have :

ArrowInvalid: In CSV column #59: CSV conversion error to int64: invalid value '\N'

when i tried with a file with no value between the separators i have no problem …

many thanks!

>Solution :

All you have to do is to specify "\N" to be interpreted as null by including it in the null_values parameter of your convert_options.

convert_options = csv.ConvertOptions(column_types=schema_table, 
                                 include_columns=columns, 
                                 include_missing_columns=True, 
                                 null_values=['\N'])

Hope it helps.