I am writing a tool that will pull data from an API and save it as CSV. I want to run this daily and update my CSV, removing duplicates. Currently the data is pulled and stored fine, however I have two problems.
- There is a duplicate header row somewhere in my data that isn’t being removed with drop_duplicates()
- If I delete the data.csv file and run the tool twice the data.csv file changes in size by ~5,000 kb and I don’t know why
Here is the relevant code.
def fetch_data_from_api(self):
resource_dump_data = ""
for idx, resource in enumerate(self.package["result"]["resources"]):
if resource["datastore_active"]:
url = self.base_url + "/datastore/dump/" + resource["id"]
resource_dump_data += requests.get(url).text
return pd.read_csv(io.StringIO(resource_dump_data))
def load_data_to_main(self, new_data):
if os.path.isfile(self.CSV_FILE_PATH):
existing_data = pd.read_csv(self.CSV_FILE_PATH)
new_entries = pd.concat([existing_data, new_data])
else:
new_entries = new_data
new_entries = new_entries.drop_duplicates()
new_entries.to_csv(self.CSV_FILE_PATH, index=False)
def main():
extract=Extract()
new_data = extract.fetch_data_from_api()
extract.load_data_to_main(new_data)
print("Complete!")
if __name__ == "__main__":
main()
The images below show the change in file size and the duplicate header row:
(https://i.stack.imgur.com/alyBL.png)(https://i.stack.imgur.com/7oGT9.png)(https://i.stack.imgur.com/FrPhP.png)
>Solution :
It looks like 1) fetch_data_from_api returns a pandas dataframe that is read from a csv file, and 2) this new data is passed to load_data_to_main() where it is concatenated with an existing pandas dataframe (which is itself read from a csv file). Concatenating these two dataframes (if they have the correct header) should result in only one header. Therefore, my guess is that one of the two dataframes (existing_data or new_data) must already have the duplicate header row when read in. If so, running "drop duplicates" won’t delete the duplicate header since it would be considered a regular row (different from all other rows).
Is the api dump duplicating headers in the csv data dump?
If so, you could try reading in all csv files without a header (header=None) and then I believe dropping duplicates will result in only one header being present.