Pandas Avoiding Summation of Duplicates

October 8, 2022

I am working with the Pandas Library and learning more about it with data manipulation and analysis. With this Dataframe ITEM

  ItemNo1 ItemNo2 ItemNo3 ItemNo4 
0   Paper   Paper    Rock   Paper
1 Scissor Scissor    Rock     NaN
2    Rock    Rock    Rock   Paper
3 Scissor   Paper Scissor   Paper
4    Rock   Paper Scissor    Rock

I’d like to create a Dataframe with two columns. Item and Unique Item Row Count. The unique item row count should only have the unique row values of Item. Meaning if there are three of the same item in the row, it will only take into account the first instance for count. This would mean the following output:


  Item UniqueItemRowCount
0 Paper                 4
1 Rock                  4
2 Scissor               3

There will be 4 for rock because for index 0 and 1 there is one rock. For index 2 and 4 there is already one Rock found so the rest are ignored/not added for the sum.

My main issue is trying to create a lambda function for this as I cannot make it properly get the sum. I have also tried to make the values NaN (null) if there is a duplicate found in the row but this does not work either.

>Solution :

Use DataFrame.melt with DataFrame.drop_duplicates for count duplicates per rows and then count values by Series.value_counts:

df1 = (df.melt()
         .drop_duplicates()['value']
         .value_counts()
         .rename_axis('Item')
         .reset_index(name='UniqueItemRowCount'))
print (df1)
      Item  UniqueItemRowCount
0     Rock                   4
1  Scissor                   3
2    Paper                   3