pandas creates 2 copies of files in a loop

I have a dataframe like as below

import numpy as np
import pandas as pd
from numpy.random import default_rng
rng = default_rng(100)
cdf = pd.DataFrame({'Id':[1,2,3,4,5],
                   'customer': rng.choice(list('ACD'),size=(5)),
                   'region': rng.choice(list('PQRS'),size=(5)),
                   'dumeel': rng.choice(list('QWER'),size=(5)),
                   'dumma': rng.choice((1234),size=(5)),
                   'target': rng.choice([0,1],size=(5))

I am trying to split the dataframe based on Customer and store it in a folder. Not necessary to understand the full code. The issue is in the last line.

i = 0
for k, v in df.groupby(['Customer']):
    LM = k.split('@')[0]
    i = i+1
    unique_cust_names = '_'.join(v['Customer'].astype(str).unique())
    unique_ids = '_'.join(v['Id'].astype(str).unique())
    unique_location = '_'.join(v['dumeel'].astype(str).unique())
    filename = '_'.join([unique_ids, unique_cust_names, unique_location, LM])
    with pd.ExcelWriter(f"{filename}.xlsx", engine='xlsxwriter') as writer:
    wb = load_workbook(filename = 'format_sheet.xlsx')
    sheet_from =wb.worksheets[0]
    wb1 = load_workbook(filename = f"{filename}.xlsx")
    sheet_to = wb1.worksheets[0]
    copy_styles(sheet_from, sheet_to)
    tab = Table(displayName = "Table1", ref = "A1:" + get_column_letter(wb1.worksheets[0].max_column) + str(wb1.worksheets[0].max_row) )
    style = TableStyleInfo(name="TableStyleMedium2", showFirstColumn=False, showLastColumn=False, showRowStripes=True, showColumnStripes=False)
    tab.tableStyleInfo = style
    #wb1.worksheets[0]"{filename}.xlsx")"test_files/" + f"{filename}.xlsx")  # issue is here
print("Total number of customers to be emailed is ", i)

Though the code works fine, the issue is in the below line I guess"test_files/" + f"{filename}.xlsx")  # issue is here

This creates two copies of files.. One in the current folder as jupyter notebook file and other one inside the test_files folder.

For ex: I see two files called test1.xlsx one in the current folder and one inside the test_files folder (path is test_files/test1.xlsx)

How can I avoid this?

I expect my output to generate/save only 1 file for each customer inside the test_files folder?

>Solution :

The issue is happening because you are referencing 2 different file names one with the prefix "test_files/" and once without it. Best way to handle it will be to define file name as follows

dir_filename = "test_files/" + f"{filename}.xlsx"

and then reference it in the following places

with pd.ExcelWriter(f"{filename}.xlsx", engine='xlsxwriter') as writer:
wb1 = load_workbook(filename = f"{filename}.xlsx")
##"test_files/" + f"{filename}.xlsx") 

Hope it helps

Leave a Reply