Giving two dataframes df1:
col1
-----
1
1
1
2
3
1
1
2
and df2:
colA | colB
------------
1 | 2
1 | 4
2 | 1
3 | 1
5 | 1
4 | 5
I want to return True if the count of the occurence of every element e in col1 in df1 is equal to the count of the occurence of e in both colA and colB in df2.
In the example above for df1 and df2 it should return True.
Explanation:
count of occurence of 1 in col1 in df1= 5 = count of occurence of 1 in colA and colB in df2.
count of occurence of 2 in col1 = 2 = count of occurence of 2 in colA and colB in df2.
count of occurence of 3 in col1 = 1 = count of occurence of 3 in colA and colB in df2
Logic:
The idea is to groupby the elements of col1 in df1 and count the occurences of each one of it then go search the count of their occurence in colA and colB in df2.
I’ve tried the following code:
def records_check(df1, df2):
if df1.groupby(['col1']).size() == df2.groupby(['colA', 'colA']).size():
return True
else:
return False
But I’ve got this:
ValueError: Can only compare identically-labeled Series objects
What is the most efficient way to achieve this especially when dealing with huge data please?
>Solution :
You can use value_counts on col1 and the stacked version of df2, then compare the outputs:
c1 = df1['col1'].value_counts(sort=False)
c2 = df2.stack().value_counts(sort=False)
# if more columns
# c2 = df2[['colA', 'colB']].stack().value_counts()
out = c1.eq(c2.reindex_like(c1)).all()
numpy variant:
# get values and counts for each dataset
n1, c1 = np.unique(df1['col1'], return_counts=True)
n2, c2 = np.unique(df2[['colA', 'colB']].to_numpy().ravel(), return_counts=True)
# compare the counts for existing values in `df1['col1']`
np.array_equal(c1, c2[np.isin(n2, n1)])
Output: True