Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Read csv files using pyarrow

i’m trying to read csv files using the python library pyarrow but i got an issue while reading file because for some fields i have "\N" for values (it means that this is a null value).
the problem is that i can’t manage to skip this value while reading …

here is my code :

parse_options = csv.ParseOptions(delimiter=chr(1))
read_options = csv.ReadOptions(column_names=columns)
convert_options = csv.ConvertOptions(column_types=schema_table, include_columns=columns, include_missing_columns=True, null_values=True)

with hdfs.open_input_file("path") as f:
    csv_file = csv.read_csv(f, read_options=read_options, parse_options=parse_options, convert_options=convert_options)

The error that i have :

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

ArrowInvalid: In CSV column #59: CSV conversion error to int64: invalid value '\N'

when i tried with a file with no value between the separators i have no problem …

many thanks!

>Solution :

All you have to do is to specify "\N" to be interpreted as null by including it in the null_values parameter of your convert_options.

convert_options = csv.ConvertOptions(column_types=schema_table, 
                                 include_columns=columns, 
                                 include_missing_columns=True, 
                                 null_values=['\N'])

Hope it helps.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading