Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Ignoring delimiter while reading CSV files from URLs – Python

I have some URLs for downloading CSV files.

import pandas as pd
import io
import requests

url1 = 'https://www.ons.gov.uk/generator?format=csv&uri=/economy/economicoutputandproductivity/output/timeseries/' + 'k22a' + '/diop'

url2 = 'https://www.ons.gov.uk/generator?format=csv&uri=/economy/economicoutputandproductivity/output/timeseries/' + 'k24c' + '/diop'

s=requests.get(url).content
c=pd.read_csv(io.StringIO(s.decode('utf-8')))

When I use url1, there is a ',' in the 4th record. But some urls (url2) dont have this unexpected separator. This is causing

ParserError: Error tokenizing data. C error: Expected 1 fields in line
5, saw 2

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

when I try to merge the CSV files into a single dataframe. How do I ignore these unexpected separators. Anyway the first seven records are to be deleted. But I still get this error.

This solution suggests we pre-parse each line before converting into CSV. Since I have many such URLs, and don’t know for sure which unexpected delimiters would be encountered in future, not sure how to debug. Can pre-parsing before converting to CSV work? How to implement in such a manner to include other separators encountered in the future?

>Solution :

Since you don’t need the metadata, just skip it using the skiprows parameter of read_csv. As a nice side effect, you’ll also have the correct dtypes automatically:

url = url1
N = 7

s = requests.get(url1).content
c = pd.read_csv(io.StringIO(s.decode('utf-8')), header=0, skiprows=range(1, N+1))

Output:

  Title  IOP: C:MANUFACTURING: CVMSA
0  1948                         25.2
1  1949                         27.0
2  1950                         29.0
3  1951                         29.9
4  1952                         28.4
...

If you don’t even need headers:

url = url1
N = 8

s = requests.get(url1).content
c = pd.read_csv(io.StringIO(s.decode('utf-8')), header=None, skiprows=N)

Output:

      0     1
0  1948  25.2
1  1949  27.0
2  1950  29.0
3  1951  29.9
4  1952  28.4
...
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading