pd.DataFrame(table) adds extra None columns

December 28, 2023

This may seem like a strange way to deal with CSV file, but it is a study task: I need to open csv file as a text, read lines, create a list and then create a pandas df using that list

import pandas as pd
with open ('file.csv', 'r') as f:
    lst = f.readlines()

for idx, line in enumerate(lst):
    lst[idx] = line.strip('\n')

header = lst[0].replace('"', '').split(",")
for idx, line in enumerate(lst[1:]):
    lst[idx] = line.split(',')

df = pd.DataFrame(data  = lst, columns = header)

ValueError: 5 columns passed, passed data had 39 columns

It crashes, because pd.Dataframe adds (?) a bunch of Nones at the end of each row
I checked it, when tried to run this without specifying ‘columns’
Please help me to understand where this Nones come from

>Solution :

The issue you’re encountering is related to how you’re processing the CSV file lines and subsequently trying to construct a Pandas DataFrame. Let’s break down the steps and see where the problem might be:

Reading the File: You correctly open and read the lines from the CSV
file, storing them in a list.
Stripping Newline Characters: You remove the newline characters from
each line. This is also done correctly.
Processing the Header: You correctly process the header, but the
replacement of double quotes (") is not always necessary unless you
are sure your header contains double quotes.
Processing the Data Rows: Here’s where the issue likely originates.
You’re iterating over lst[1:] but assigning the split lines back to
lst[idx]. Because lst[1:] is shorter than lst, this doesn’t
overwrite all the entries in lst. As a result, the original, unsplit
lines from lst remain in your list, leading to more columns than
expected when you create the DataFrame.

import pandas as pd

with open('file.csv', 'r') as f:
    lines = f.readlines()

# Remove newline characters and strip quotes if needed
lines = [line.strip('\n').replace('"', '') for line in lines]

# Split the header
header = lines[0].split(',')

# Split the data rows
data = [line.split(',') for line in lines[1:]]

# Create the DataFrame
df = pd.DataFrame(data, columns=header)

This script should correctly process the CSV file into a DataFrame. If your CSV contains quoted fields with commas inside, this simple split approach may not work correctly, and you might need to use a CSV parser, like the one built into Pandas (pandas.read_csv()) or Python’s csv module.