Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

zip_longest to filter dataframe and output multiple csv files

I have a dataframe like as shown below

ID,Region,Supplier,year,output
1,ANZ,AB,2021,1
2,ANZ,ABC,2022,1
3,ANZ,ABC,2022,1
4,ANZ,ABE,2021,0
5,ANZ,ABE,2021,1
6,ANZ,ABQ,2021,1
7,ANZ,ABW,2021,1
8,AUS,ABO,2020,1
9,KOREA,ABR,2019,0

data = pd.read_clipboard(sep=',')

My obejctive is to

a) Filter the dataframe by year>=2021 and output==1

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

b) Generate multiple csv files for each unique combination of region and supplier. For example, data for ANZ and AB should be stored in seperate file. Similarly, KOREA and ABR data should be stored in seperate file.. This has to be done for each unique combination of region and supplier

So, I tried the below

column_name = "region"
col_name = "supplier"
region_values = data[column_name].unique()
supplier_values = data[col_name].unique() 

for i in itertools.zip_longest(region_values,supplier_values,fillvalue="ANZ"):
    data_output = data.query(f"{column_name} == i[0] & Year>=2021 & output == 1 & {col_name} == i[1]")
    output_path = ATTACHMENT_DIR / f"{i}_ge_2021.csv"
    data_output.to_csv(output_path, index=False)

However, this results in error as shown below

KeyError: ‘i’

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call
last) ~\Anaconda3\lib\site-packages\pandas\core\computation\scope.py
in resolve(self, key, is_local)
205 # e.g., df[df > 0]
–> 206 return self.temps[key]
207 except KeyError as err:

KeyError: ‘i’

The above exception was the direct cause of the following exception:

UndefinedVariableError Traceback (most recent call
last)
C:\Users\aksha~1\AppData\Local\Temp/ipykernel_31264/2689222803.py in

1 for i in itertools.zip_longest(subregion_values,disti_values,fillvalue="ANZ"):
—-> 2 data_output = data.query(f"{column_name} == i[0] & Year>=2021 & output == 1 & {col_name} == i1")

I expect my output to have 5 csv files generated in the folder because there are 5 unique combination of region and supplier that meet filter criteria of year and output

update – zip_longest – incorrect output

enter image description here

>Solution :

Use @ for pass variables to query, for columns names are correct f-strings:

#i, j are same like i[0], i[1]
for i, j in itertools.zip_longest(region_values,supplier_values,fillvalue="ANZ"):
    data_output = data.query(f"{column_name} == @i & year>=2021 & output == 1 & {col_name} == @j")

Your solution also workign with @:

for i in itertools.zip_longest(region_values,supplier_values,fillvalue="ANZ"):
    data_output = data.query(f"{column_name} == @i[0] & year>=2021 & output == 1 & {col_name} == @i[1]")

Also is possible use f-strings for variables, but need pass repr for representation of i variables:

for i in itertools.zip_longest(region_values,supplier_values,fillvalue="ANZ"):
    data_output = data.query(f"{column_name} == {repr(i[0])} & year>=2021 & output == 1 & {col_name} == {repr(i[1])}")
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading