Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to only keep rows in a Pandas DataFrame based on its count in a given column

I have a Pandas DataFrame with some categorical data in one of the columns. On doing value_counts on that particular column, I get something similar to:

HR                          176
Coding                       81
Reject                       74
Database Administration      21
Finance                      17
Project Management           16
Sales                        15
DevOps                       13
Core Electronics             10
Networking                   10
Medical Science               9
Core Mechanical               8
Web Development               4
Puzzles                       3
behavioural                   3
not a question                2
civil engineering             1
Mathematics                   1
Finance, Medical Science      1
Sales, HR                     1

What I’d like to do is to only keep the categories with a count >= some threshold (e.g. 10). All the smaller categories should get clubbed in a separate "Other" category i.e. the result should look like:

HR                          176
Coding                       81
Reject                       74

*Other*                      33

Database Administration      21
Finance                      17
Project Management           16
Sales                        15
DevOps                       13
Core Electronics             10
Networking                   10

I’ve done this in the past by hacking together a defaultdict(int) and only taking the instances where count >= threshold. I want to know if there is a Pandas canonical way of achieving the same.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

Is this the answer you’re looking for :

Pandas: Selecting rows based on value counts of a particular column

Else maybe this is what you want :

data = pd.DataFrame([["researcher",150],["politician",15],["builder",1],["teacher",5],])
data.columns = ["category", "count"]
filter_value = 10
d1 = data[data['count'] >= filter_value]
d2 = data[data['count'] < filter_value]
d1["tag"] = "filter_passed"
d2["tag"] = "Others"
data = pd.concat([d1,d2])
>>> data
     category  count            tag
0  researcher    150  filter_passed
1  politician     15  filter_passed
2     builder      1         Others
3     teacher      5         Others
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading