Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Counting occurrences based on another column in a dataframe

In a database, I have data that looks like this (just a snippet):

itemId   userId     action       likeorDislike     timestamp
i1       u1         rate         0                 2021-06-09 10:43:57.827 UTC
i1       u1         rate         1                 2021-06-10 10:43:57.827 UTC
i1       u2         rate         1                 2021-06-09 11:43:57.827 UTC
i1       u3         rate         1                 2021-06-09 12:43:57.827 UTC
i2       u6         rate         1                 2021-06-09 10:43:57.827 UTC
i2       u6         rate         0                 2021-08-09 10:43:57.827 UTC
i2       u1         rate         0                 2021-06-12 10:43:57.827 UTC
i4       u1         rate         1                 2021-06-09 10:45:57.827 UTC
i4       u1         rate         1                 2021-06-09 10:48:57.827 UTC
i4       u3         rate         1                 2021-06-09 10:45:58.827 UTC
i1       u5         select                         2021-06-09 10:45:58.827 UTC

I want to compute a score for each itemId, which is the number of 1‘s over the sum of likeorDislike for that particular item, but only take into consideration the first attempt of a user to rate that particular item.

itemId     score
i1         0.66                 
i2         0.5
i4         1

What I did:
Only kept the data that interests me (ordered by timestamp, cleaned etc.), but I don’t know how to compute the above scores now:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

itemId     likeOrDislike     
i1         0      
i1         1 
i2         1
i1         1
i2         0
i4         1
i4         1

I also wrote a SQL query that does compute these scores but doesn’t take into consideration the timestamp.

%%bigquery counting_answers_new
SELECT trueAnswers.itemId, IFNULL(falseAnswers.answerCount, 0) as falseAnswerCount, trueAnswers.answerCount as trueAnswerCount, IFNULL((trueAnswers.answerCount/(falseAnswers.answerCount+trueAnswers.answerCount)), 1) as score
FROM 
 ( SELECT itemId, likeorDislike, action, COUNT(*) as answerCount
 FROM `mydata` 
 GROUP BY itemId, likeorDislike, action
 HAVING action='rate' AND likeorDislike=false
 ORDER BY itemId) as falseAnswers
 RIGHT JOIN
 ( SELECT itemId, likeorDislike, acton, COUNT(*) as answerCount
 FROM `mydata` 
 GROUP BY itemId, likeorDislike, action
 HAVING action='rate' AND likeorDislike=true
 ORDER BY itemId) as trueAnswers
 ON falseAnswers.itemId = trueAnswers.itemId

I would like to do it in pandas and proceed from the dataframe with two columns that I have above because I already cleaned it/filtered it. I know how to count occurrences of 0’s and 1’s, but how do I do that pro item, and how do I compute these scores?

>Solution :

It seems like you just want the mean of the itemId’s likeOrDislike. In pandas you could do this:

df[["itemId", "likeOrDislike"]].groupby("itemId").mean()
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading