With the following data, I would like to remove the rows with low frequency. Suppose I pick 2 as a threshold, how can I only keep rows whose frequency is greater than 2, or threshold x.
data = [[1, 0, 0, 0, 1 ], [0, 1, 1, 0, 2],
[3, 1, 0, 1, 0], [0, 1, 1, 0, 0],
[2, 0,0 , 0, 0], [1, 0, 2, 2, 0],
[0, 3, 0, 1, 3], [0, 0, 1, 0, 0],
[0, 0, 3, 0, 1], [0, 0, 1, 0, 0]]
df = pd.DataFrame(data, columns =[1608, 1772, 1790, 1791, 1799],
index = [1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009],
dtype = int)
df.index.name = 'user'
print(df)
user 1608 1772 1790 1791 1799
1000 1 0 0 0 1
1001 0 1 1 0 2
1002 3 1 0 1 0
1003 0 1 1 0 0
1004 2 0 0 0 0
1005 1 0 2 2 0
1006 0 3 0 1 3
1007 0 0 1 0 0
1008 0 0 3 0 1
1009 0 0 1 0 0
Desired output:
I will only keep rows where the user (row id) has bought from 3 or more retailers (columns). Note that this is not the total of each row. For example, user 1000 will be dropped since he only bought from two retailers, but user 1001 will be kept, he bought from 3 retailers.
Actual data has 2 Million rows and 30K columns
user 1608 1772 1790 1791 1799
1001 0 1 1 0 2
1002 3 1 0 1 0
1005 1 0 2 2 0
1006 0 3 0 1 3
I have done some reading on the site and don’t seem to come across an example much similar to my problem. Much appreciate your help.
>Solution :
We can count the non zero values along the columns axis, then filter the rows which have count >= 3
df[df.ne(0).sum(1).ge(3)]
1608 1772 1790 1791 1799
user
1001 0 1 1 0 2
1002 3 1 0 1 0
1005 1 0 2 2 0
1006 0 3 0 1 3