Is there a way to drop rows with low frequency given a threshold value

December 28, 2021

With the following data, I would like to remove the rows with low frequency. Suppose I pick 2 as a threshold, how can I only keep rows whose frequency is greater than 2, or threshold x.

data = [[1, 0, 0, 0, 1 ], [0, 1, 1, 0, 2], 
       [3, 1, 0, 1, 0], [0, 1, 1, 0, 0],
      [2, 0,0 , 0, 0], [1, 0, 2, 2, 0], 
      [0, 3, 0, 1, 3], [0, 0, 1, 0, 0], 
      [0, 0, 3, 0, 1], [0, 0, 1, 0, 0]] 
      
df = pd.DataFrame(data, columns =[1608, 1772, 1790, 1791, 1799], 
                   index = [1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009],
                                           dtype = int) 
df.index.name = 'user'

print(df)

user    1608  1772  1790  1791  1799
1000     1     0     0     0     1
1001     0     1     1     0     2
1002     3     1     0     1     0
1003     0     1     1     0     0
1004     2     0     0     0     0
1005     1     0     2     2     0
1006     0     3     0     1     3
1007     0     0     1     0     0
1008     0     0     3     0     1
1009     0     0     1     0     0

Desired output:
I will only keep rows where the user (row id) has bought from 3 or more retailers (columns). Note that this is not the total of each row. For example, user 1000 will be dropped since he only bought from two retailers, but user 1001 will be kept, he bought from 3 retailers.

Actual data has 2 Million rows and 30K columns

user   1608  1772  1790  1791  1799
1001     0     1     1     0     2
1002     3     1     0     1     0
1005     1     0     2     2     0
1006     0     3     0     1     3

I have done some reading on the site and don’t seem to come across an example much similar to my problem. Much appreciate your help.

>Solution :

We can count the non zero values along the columns axis, then filter the rows which have count >= 3

df[df.ne(0).sum(1).ge(3)]

      1608  1772  1790  1791  1799
user                              
1001     0     1     1     0     2
1002     3     1     0     1     0
1005     1     0     2     2     0
1006     0     3     0     1     3