Python Pandas Duplicate Header Row and Increasing File Size

May 23, 2023

I am writing a tool that will pull data from an API and save it as CSV. I want to run this daily and update my CSV, removing duplicates. Currently the data is pulled and stored fine, however I have two problems.

There is a duplicate header row somewhere in my data that isn’t being removed with drop_duplicates()
If I delete the data.csv file and run the tool twice the data.csv file changes in size by ~5,000 kb and I don’t know why

Here is the relevant code.

    def fetch_data_from_api(self):
        resource_dump_data = ""
        for idx, resource in enumerate(self.package["result"]["resources"]):
            if resource["datastore_active"]:
                url = self.base_url + "/datastore/dump/" + resource["id"]
                resource_dump_data += requests.get(url).text  
        return pd.read_csv(io.StringIO(resource_dump_data))
    

    def load_data_to_main(self, new_data):
        if os.path.isfile(self.CSV_FILE_PATH):
            existing_data = pd.read_csv(self.CSV_FILE_PATH)
            new_entries = pd.concat([existing_data, new_data])
        else:
            new_entries = new_data

        new_entries = new_entries.drop_duplicates()
        new_entries.to_csv(self.CSV_FILE_PATH, index=False)

def main():
    extract=Extract()
    new_data = extract.fetch_data_from_api()
    extract.load_data_to_main(new_data)
    print("Complete!")

if __name__ == "__main__":
    main()

The images below show the change in file size and the duplicate header row:

(https://i.stack.imgur.com/alyBL.png)(https://i.stack.imgur.com/7oGT9.png)(https://i.stack.imgur.com/FrPhP.png)

>Solution :

It looks like 1) fetch_data_from_api returns a pandas dataframe that is read from a csv file, and 2) this new data is passed to load_data_to_main() where it is concatenated with an existing pandas dataframe (which is itself read from a csv file). Concatenating these two dataframes (if they have the correct header) should result in only one header. Therefore, my guess is that one of the two dataframes (existing_data or new_data) must already have the duplicate header row when read in. If so, running "drop duplicates" won’t delete the duplicate header since it would be considered a regular row (different from all other rows).

Is the api dump duplicating headers in the csv data dump?

If so, you could try reading in all csv files without a header (header=None) and then I believe dropping duplicates will result in only one header being present.