Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Cannot read csv file in pandas read_csv

I am trying to read a csv file in Pandas. The file seems in a strange format I downloaded from LinkedIN campaign manager. Can you help me read this file normally? Here is the code:

path = r'C:\Users\FilePath' # use your path
all_files = glob.glob(os.path.join(path, "*.csv"))
dfAllDataLI = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)

Here is the error:

UnicodeDecodeError                        Traceback (most recent call 

last)
~\AppData\Local\Temp/ipykernel_11340/2382686370.py in <module>
      3 path = r'C:\Users\pchauh04\OneDrive - dentsu\Dokumente\Wiencom DB\Data\Data_New\Linkedin' # use your path
      4 all_files = glob.glob(os.path.join(path, "*.csv"))
----> 5 dfAllDataLI = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)
      6 dfAllDataLI = dfAllDataLI.fillna('')
      7 

c:\Users\pchauh04\Anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper

c:\Users\pchauh04\Anaconda3\lib\site-packages\pandas\core\reshape\concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    292     ValueError: Indexes have overlapping values: ['a']
    293     """
--> 294     op = _Concatenator(
    295         objs,
    296         axis=axis,

c:\Users\pchauh04\Anaconda3\lib\site-packages\pandas\core\reshape\concat.py in __init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    346             objs = [objs[k] for k in keys]
...
c:\Users\pchauh04\Anaconda3\lib\site-packages\pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

c:\Users\pchauh04\Anaconda3\lib\site-packages\pandas\_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

The file has 5 non-CSV rows before the column header.

Happily, read_csv allows you to skip those lines. You’ll also need to specify the text encoding (it’s UTF-16LE, not UTF-8) and separator for that file (it’s tab-separated):

import pandas as pd

df = pd.read_csv('csv file.csv', skiprows=5, encoding='utf-16le', sep='\t')
print(df.columns)

outputs

Index(['Start Date (in UTC)', 'Account Name', 'Campaign Group Name',
       'Campaign Group ID', 'Campaign Name', 'Campaign ID', 'Campaign Type',
       'Campaign Start Date', 'Campaign Group Start Date', 'Campaign End Date',
       'Total Budget', 'Clicks', 'Impressions', 'Average CPM', 'Average CPC',
       'Avg. Last Day Reach', 'Video Completions'],
      dtype='object')
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading