Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

pd.DataFrame(table) adds extra None columns

This may seem like a strange way to deal with CSV file, but it is a study task: I need to open csv file as a text, read lines, create a list and then create a pandas df using that list

import pandas as pd
with open ('file.csv', 'r') as f:
    lst = f.readlines()

for idx, line in enumerate(lst):
    lst[idx] = line.strip('\n')

header = lst[0].replace('"', '').split(",")
for idx, line in enumerate(lst[1:]):
    lst[idx] = line.split(',')

df = pd.DataFrame(data  = lst, columns = header)

ValueError: 5 columns passed, passed data had 39 columns

It crashes, because pd.Dataframe adds (?) a bunch of Nones at the end of each row
I checked it, when tried to run this without specifying ‘columns’
Please help me to understand where this Nones come from

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

The issue you’re encountering is related to how you’re processing the CSV file lines and subsequently trying to construct a Pandas DataFrame. Let’s break down the steps and see where the problem might be:

  1. Reading the File: You correctly open and read the lines from the CSV
    file, storing them in a list.
  2. Stripping Newline Characters: You remove the newline characters from
    each line. This is also done correctly.
  3. Processing the Header: You correctly process the header, but the
    replacement of double quotes (") is not always necessary unless you
    are sure your header contains double quotes.
  4. Processing the Data Rows: Here’s where the issue likely originates.
    You’re iterating over lst[1:] but assigning the split lines back to
    lst[idx]. Because lst[1:] is shorter than lst, this doesn’t
    overwrite all the entries in lst. As a result, the original, unsplit
    lines from lst remain in your list, leading to more columns than
    expected when you create the DataFrame.
import pandas as pd

with open('file.csv', 'r') as f:
    lines = f.readlines()

# Remove newline characters and strip quotes if needed
lines = [line.strip('\n').replace('"', '') for line in lines]

# Split the header
header = lines[0].split(',')

# Split the data rows
data = [line.split(',') for line in lines[1:]]

# Create the DataFrame
df = pd.DataFrame(data, columns=header)

This script should correctly process the CSV file into a DataFrame. If your CSV contains quoted fields with commas inside, this simple split approach may not work correctly, and you might need to use a CSV parser, like the one built into Pandas (pandas.read_csv()) or Python’s csv module.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading