Home Pandas: check if count of occurence of each element of a dataframe column is equal to the element occurence count in another dataframe

Questions

Pandas: check if count of occurence of each element of a dataframe column is equal to the element occurence count in another dataframe

byMR

October 25, 2023

Giving two dataframes df1:

col1 
-----
1
1
1
2
3
1
1
2

and df2:

colA | colB
------------
1    | 2
1    | 4
2    | 1
3    | 1
5    | 1
4    | 5

I want to return True if the count of the occurence of every element e in col1 in df1 is equal to the count of the occurence of e in both colA and colB in df2.

In the example above for df1 and df2 it should return True.

Explanation:

count of occurence of 1 in col1 in df1= 5 = count of occurence of 1 in colA and colB in df2.

count of occurence of 2 in col1 = 2 = count of occurence of 2 in colA and colB in df2.

count of occurence of 3 in col1 = 1 = count of occurence of 3 in colA and colB in df2

Logic:
The idea is to groupby the elements of col1 in df1 and count the occurences of each one of it then go search the count of their occurence in colA and colB in df2.

I’ve tried the following code:

def records_check(df1, df2):
    if df1.groupby(['col1']).size() == df2.groupby(['colA', 'colA']).size():
        return True
    else:
        return False

But I’ve got this:

ValueError: Can only compare identically-labeled Series objects

What is the most efficient way to achieve this especially when dealing with huge data please?

>Solution :

You can use value_counts on col1 and the stacked version of df2, then compare the outputs:

c1 = df1['col1'].value_counts(sort=False)
c2 = df2.stack().value_counts(sort=False)
# if more columns
# c2 = df2[['colA', 'colB']].stack().value_counts()

out = c1.eq(c2.reindex_like(c1)).all()

numpy variant:

# get values and counts for each dataset
n1, c1 = np.unique(df1['col1'], return_counts=True)
n2, c2 = np.unique(df2[['colA', 'colB']].to_numpy().ravel(), return_counts=True)

# compare the counts for existing values in `df1['col1']`
np.array_equal(c1, c2[np.isin(n2, n1)])

Output: True