Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to subset pandas df based on two columns?

I need to subset rows of df based on two columns (c1 and c2 columns) which have strings.
I need to be able to return rows where one value in c1 is associated with only 2 different values in c2. col4-6 are irrelevant for subsetting and just need to be returned.

Code to recreate df 
df = pd.DataFrame({"": [0,1,2,3,4,5,6,7,8],
                     "c1": ["ABC", "ABC", "dfg", "dfg", "dfg","dfg","ghj","ghj","ghj"], 
                     "c2": ["delta", "delta", "alpha", "bravo", "alpha","bravo","bravo","delta","alpha"], 
                     "c3": [1, 2, 2, 3, 5,6,3,3,3], 
                     "col4": [786, 787, 777, 775, 767,715,772,712,712], 
                     "col5": [10, 11, 13, 12, 13,12,14,12,12], 
                     "col6": [1,2,4, 3, 4,3, 5, 8,8]})
 df
     c1      c2     c3   col4 col5 col6
 0   ABC     delta   1    786  10   1
 1   ABC     delta   2    787  11   2
 2   dfg     alpha   2    777  13   4
 3   dfg     bravo   3    775  12   3
 4   dfg     alpha   5    767  13   4
 5   dfg     bravo   6    715  12   3
 6   ghj     bravo   3    772  14   5
 7   ghj     delta   3    712  12   8
 8   ghj     alpha   3    712  12   8

Answer df should be:

 finaldf
     c1      c2     c3   col4 col5 col6
 2   dfg     alpha   2    777  13   4
 3   dfg     bravo   3    775  12   3
 4   dfg     alpha   5    767  13   4
 5   dfg     bravo   6    715  12   3

What if the rows where one value in c1 is associated with 2 and 3 different values in c2 is of interest like in df below?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

 finaldf
     c1      c2     c3   col4 col5 col6
 2   dfg     alpha   2    777  13   4
 3   dfg     bravo   3    775  12   3
 4   dfg     alpha   5    767  13   4
 5   dfg     bravo   6    715  12   3
 6   ghj     bravo   3    772  14   5
 7   ghj     delta   3    712  12   8
 8   ghj     alpha   3    712  12   8

I think some kind of groupby and transform operation could help achieve this.

>Solution :

here is one way to do it

#using transform, return a boolean on unique values matching the count
# and filter the dataframe
df[df.groupby('c1')['c2'].transform(lambda x: x.nunique()==2)]
c1  c2  c3  col4    col5    col6
2   2   dfg     alpha   2   777     13  4
3   3   dfg     bravo   3   775     12  3
4   4   dfg     alpha   5   767     13  4
5   5   dfg     bravo   6   715     12  3
df[df.groupby('c1')['c2'].transform(lambda x: x.nunique()>=2)]
c1  c2  c3  col4    col5    col6
2   2   dfg     alpha   2   777     13  4
3   3   dfg     bravo   3   775     12  3
4   4   dfg     alpha   5   767     13  4
5   5   dfg     bravo   6   715     12  3
6   6   ghj     bravo   3   772     14  5
7   7   ghj     delta   3   712     12  8
8   8   ghj     alpha   3   712     12  8
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading