Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pandas: check if count of occurence of each element of a dataframe column is equal to the element occurence count in another dataframe

Giving two dataframes df1:

col1 
-----
1
1
1
2
3
1
1
2

and df2:

colA | colB
------------
1    | 2
1    | 4
2    | 1
3    | 1
5    | 1
4    | 5

I want to return True if the count of the occurence of every element e in col1 in df1 is equal to the count of the occurence of e in both colA and colB in df2.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

In the example above for df1 and df2 it should return True.

Explanation:

count of occurence of 1 in col1 in df1= 5 = count of occurence of 1 in colA and colB in df2.

count of occurence of 2 in col1 = 2 = count of occurence of 2 in colA and colB in df2.

count of occurence of 3 in col1 = 1 = count of occurence of 3 in colA and colB in df2

Logic:
The idea is to groupby the elements of col1 in df1 and count the occurences of each one of it then go search the count of their occurence in colA and colB in df2.

I’ve tried the following code:

def records_check(df1, df2):
    if df1.groupby(['col1']).size() == df2.groupby(['colA', 'colA']).size():
        return True
    else:
        return False

But I’ve got this:

ValueError: Can only compare identically-labeled Series objects

What is the most efficient way to achieve this especially when dealing with huge data please?

>Solution :

You can use value_counts on col1 and the stacked version of df2, then compare the outputs:

c1 = df1['col1'].value_counts(sort=False)
c2 = df2.stack().value_counts(sort=False)
# if more columns
# c2 = df2[['colA', 'colB']].stack().value_counts()

out = c1.eq(c2.reindex_like(c1)).all()

variant:

# get values and counts for each dataset
n1, c1 = np.unique(df1['col1'], return_counts=True)
n2, c2 = np.unique(df2[['colA', 'colB']].to_numpy().ravel(), return_counts=True)

# compare the counts for existing values in `df1['col1']`
np.array_equal(c1, c2[np.isin(n2, n1)])

Output: True

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading