Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Python: Fastest Way to get no of duplicates of each uniqe row in a pandas DF full of duplicates

So I have a pandas dataframe full of repeats. I’m trying to count how many times each row has been repeated and put that into a column.

Here’s my code – found it on stack overflow

def getUniqCounts2(dupDF ):
    countARR=[]
    uniqDF = getUniq(dupDF)
    for i in range( 0,len(uniqDF)):
        df2 = len(dupDF[(dupDF["variable1"]==uniqDF['variable1'][i]) &
         (dupDF["variable2"]==uniqDF['variable2'][i])])
        print(df2)
        countARR.append(df2)
    uniqDF['count'] = countARR
    return uniqDF

Someone did recommend the pivot table function, but the problem is that the resulting DF has a column that is only partially filled even though the input data frame had all columns filled.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Now I have a DF of 1.5M rows. I just want to get the unique rows, and the no of repetitions found per row. What would be the fastest way to do this?

>Solution :

Suppose you have a DataFrame like named df, you can count the number of occurrences by simply using groupby() function like below:

counts = df.groupby(df.columns.tolist()).size().reset_index().rename(columns={0:'count'})

I guess this is much more efficient.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading