Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Elegant way to query dataframe based on nested OR and nested AND

I have a dataframe like as shown below

ID,Region,Supplier,year,output
1,ANZ,AB,2021,1
2,ANZ,ABC,2022,1
3,ANZ,ABC,2022,1
4,ASEAN,ABQ,2021,1
5,ASEAN,ABE,2021,2
6,ASEAN,ABQ,2021,3
7,UK,ABW,2021,8
8,UK,ABO,2020,1
9,UK,ABR,2019,0

I would like to do the below

a) Filter the dataframe based on Region = UK and (Supplier = ABW or output >=1 or year = 2021)

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

b) Filter the dataframe based on Region = ANZ and (Supplier = ABC or output >1 or year = 2021)

c) Filter the dataframe based on Region = ASEAN and (Supplier = ABE or output >1 or year = 2021)

So, I tried the below

df_ANZ = df[(df['Region']=='ANZ') & ((df['Supplier']=='ABC') | (df['output']>1) | (df['year']==2021))]
df_UK = df[(df['Region']=='UK') & ((df['Supplier']=='ABW') | (df['output']>=1) | (df['year']==2021))]
df_ASEAN = df[(df['Region']=='ASEAN') & ((df['Supplier']=='ABE') | (df['output']>1) | (df['year']==2021))]
df_ANZ.append(df_UK).append(df_ASEAN)

But the problem is, I have similar such criteria for around 10 regions. Writing 10 lines for each region may not be elegant.

Is there any efficient and elegant way to do this for a big dataframe with 5 million rows?

I expect my output to be like below

ID,Region,Supplier,year,output
1,ANZ,AB,2021,1
2,ANZ,ABC,2022,1
3,ANZ,ABC,2022,1
4,ASEAN,ABQ,2021,1
5,ASEAN,ABE,2021,2
6,ASEAN,ABQ,2021,3
7,UK,ABW,2021,8

>Solution :

Create tuples for Region with Supplier, so possible first filter in list comprehension and then join mask by OR in np.logical_or.reduce:

tups = [('ANZ','ABC'),('UK','ABW'),('ASEAN','ABE')]

m = [(df['Region']==a) & ((df['Supplier']==b) | (df['output']>1) | (df['year']==2021)) 
      for a, b in tups]

df = df[np.logical_or.reduce(m)]

print (df)
   ID Region Supplier  year  output
0   1    ANZ       AB  2021       1
1   2    ANZ      ABC  2022       1
2   3    ANZ      ABC  2022       1
3   4  ASEAN      ABQ  2021       1
4   5  ASEAN      ABE  2021       2
5   6  ASEAN      ABQ  2021       3
6   7     UK      ABW  2021       8
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading