i’m trying to read csv files using the python library pyarrow but i got an issue while reading file because for some fields i have "\N" for values (it means that this is a null value).
the problem is that i can’t manage to skip this value while reading …
here is my code :
parse_options = csv.ParseOptions(delimiter=chr(1))
read_options = csv.ReadOptions(column_names=columns)
convert_options = csv.ConvertOptions(column_types=schema_table, include_columns=columns, include_missing_columns=True, null_values=True)
with hdfs.open_input_file("path") as f:
csv_file = csv.read_csv(f, read_options=read_options, parse_options=parse_options, convert_options=convert_options)
The error that i have :
ArrowInvalid: In CSV column #59: CSV conversion error to int64: invalid value '\N'
when i tried with a file with no value between the separators i have no problem …
many thanks!
>Solution :
All you have to do is to specify "\N" to be interpreted as null by including it in the null_values parameter of your convert_options.
convert_options = csv.ConvertOptions(column_types=schema_table,
include_columns=columns,
include_missing_columns=True,
null_values=['\N'])
Hope it helps.