Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Python Pandas Duplicate Header Row and Increasing File Size

I am writing a tool that will pull data from an API and save it as CSV. I want to run this daily and update my CSV, removing duplicates. Currently the data is pulled and stored fine, however I have two problems.

  • There is a duplicate header row somewhere in my data that isn’t being removed with drop_duplicates()
  • If I delete the data.csv file and run the tool twice the data.csv file changes in size by ~5,000 kb and I don’t know why

Here is the relevant code.

    def fetch_data_from_api(self):
        resource_dump_data = ""
        for idx, resource in enumerate(self.package["result"]["resources"]):
            if resource["datastore_active"]:
                url = self.base_url + "/datastore/dump/" + resource["id"]
                resource_dump_data += requests.get(url).text  
        return pd.read_csv(io.StringIO(resource_dump_data))
    

    def load_data_to_main(self, new_data):
        if os.path.isfile(self.CSV_FILE_PATH):
            existing_data = pd.read_csv(self.CSV_FILE_PATH)
            new_entries = pd.concat([existing_data, new_data])
        else:
            new_entries = new_data

        new_entries = new_entries.drop_duplicates()
        new_entries.to_csv(self.CSV_FILE_PATH, index=False)

def main():
    extract=Extract()
    new_data = extract.fetch_data_from_api()
    extract.load_data_to_main(new_data)
    print("Complete!")

if __name__ == "__main__":
    main()

The images below show the change in file size and the duplicate header row:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

(https://i.stack.imgur.com/alyBL.png)(https://i.stack.imgur.com/7oGT9.png)(https://i.stack.imgur.com/FrPhP.png)

>Solution :

It looks like 1) fetch_data_from_api returns a pandas dataframe that is read from a csv file, and 2) this new data is passed to load_data_to_main() where it is concatenated with an existing pandas dataframe (which is itself read from a csv file). Concatenating these two dataframes (if they have the correct header) should result in only one header. Therefore, my guess is that one of the two dataframes (existing_data or new_data) must already have the duplicate header row when read in. If so, running "drop duplicates" won’t delete the duplicate header since it would be considered a regular row (different from all other rows).

Is the api dump duplicating headers in the csv data dump?

If so, you could try reading in all csv files without a header (header=None) and then I believe dropping duplicates will result in only one header being present.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading