Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Is there a way to drop rows with low frequency given a threshold value

With the following data, I would like to remove the rows with low frequency. Suppose I pick 2 as a threshold, how can I only keep rows whose frequency is greater than 2, or threshold x.

data = [[1, 0, 0, 0, 1 ], [0, 1, 1, 0, 2], 
       [3, 1, 0, 1, 0], [0, 1, 1, 0, 0],
      [2, 0,0 , 0, 0], [1, 0, 2, 2, 0], 
      [0, 3, 0, 1, 3], [0, 0, 1, 0, 0], 
      [0, 0, 3, 0, 1], [0, 0, 1, 0, 0]] 
      
df = pd.DataFrame(data, columns =[1608, 1772, 1790, 1791, 1799], 
                   index = [1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009],
                                           dtype = int) 
df.index.name = 'user'

print(df)

user    1608  1772  1790  1791  1799
1000     1     0     0     0     1
1001     0     1     1     0     2
1002     3     1     0     1     0
1003     0     1     1     0     0
1004     2     0     0     0     0
1005     1     0     2     2     0
1006     0     3     0     1     3
1007     0     0     1     0     0
1008     0     0     3     0     1
1009     0     0     1     0     0

Desired output:
I will only keep rows where the user (row id) has bought from 3 or more retailers (columns). Note that this is not the total of each row. For example, user 1000 will be dropped since he only bought from two retailers, but user 1001 will be kept, he bought from 3 retailers.

Actual data has 2 Million rows and 30K columns

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

user   1608  1772  1790  1791  1799
1001     0     1     1     0     2
1002     3     1     0     1     0
1005     1     0     2     2     0
1006     0     3     0     1     3

I have done some reading on the site and don’t seem to come across an example much similar to my problem. Much appreciate your help.

>Solution :

We can count the non zero values along the columns axis, then filter the rows which have count >= 3

df[df.ne(0).sum(1).ge(3)]

      1608  1772  1790  1791  1799
user                              
1001     0     1     1     0     2
1002     3     1     0     1     0
1005     1     0     2     2     0
1006     0     3     0     1     3
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading