When hashing the same email address in two DataFrames, I am returned different hashes.
These two dataframes, df1 and df2, each contain a column of email addresses which need to be hashed, so the hashes can be compared when they are inner joined, like this:
import pandas as pd
### Boring part to import the data ###
# define table 1 as df1
df1 = pd.DataFrame([[2, 'random.email@fjjd.com'], [6, 'different.email@u8888ueend.dskjs'], [7, 'random.email@dsju.c'], [8, 'same.email@dkicj.c'], [200, 'different.email@cjhs.oo'], [18, 'random.email@siidjd.dd'], [19, 'random.email@dsjds.j']])
df1 = df1.set_axis(['ID1', 'email 1'], axis=1)
# define table 2 as df2
df2 = pd.DataFrame([[100, 'new.email@jsscd.d'], [6, 'different.email@uueend.dskjs'], [99, 'new.email@djjsd.conm'], [10, 'new.email@jhhs.co'], [115, 'new.email@dsjjdsds.cod'], [116, 'new.email@dsjkjds.ckk'], [8, 'same.email@dkicj.c'], [200, 'different.email@jdsjd.co']])
df2 = df2.set_axis(['ID2', 'email 2'], axis=1)
### End part to import the data ###
### Fun part now... ###
# hash the emails in each row of df1?
df1['hash 1'] = pd.util.hash_pandas_object(df1['email 1'].astype(str))
# hash the emails in each row of df2?
df2['hash 2'] = pd.util.hash_pandas_object(df2['email 2'].astype(str))
# perform an inner join of df1 and df2 about their IDs, ID1 and ID2 respectively
df3 = pd.merge(df1, df2, how='inner', left_on='ID1', right_on='ID2')
# add an email comparison column
df3['same email'] = df3['email 1'] == df3['email 2']
# add a hash comparison column
df3['same hash'] = df3['hash 1'] == df3['hash 2']
# print the table...
print(df3)
The result shows that while the email addresses in row 1 are identical (as far as I can tell) the hashes are not the same:
ID1 email 1 hash 1 ID2 email 2 hash 2 same email same hash
0 6 different.email@u8888ueend.dskjs 18381560226251184406 6 different.email@uueend.dskjs 16113553761483526335 False False
1 8 same.email@dkicj.c 5780217243550696535 8 same.email@dkicj.c 6939369575697951555 True False
2 200 different.email@cjhs.oo 13252009090739560311 200 different.email@jdsjd.co 1942861278265138167 False False
Why are these hashes of the same email address from different DataFrames different to one another?
>Solution :
According to the documentation, the default mode of operation is including index in hash computation. So when two same emails have different indices, the hash is different.
You can try:
df1["hash 1"] = pd.util.hash_pandas_object(df1["email 1"].astype(str), index=False)
df2["hash 2"] = pd.util.hash_pandas_object(df2["email 2"].astype(str), index=False)
Then the result will be:
ID1 email 1 hash 1 ID2 email 2 hash 2
0 6 different.email@u8888ueend.dskjs 5185970979410096600 6 different.email@uueend.dskjs 18338061231746973003
1 8 same.email@dkicj.c 9881121729072933860 8 same.email@dkicj.c 9881121729072933860
2 200 different.email@cjhs.oo 742268446511091656 200 different.email@jdsjd.co 775994242592712264
Other method of hash computation is using the built-in hash
function:
df1["hash 1"] = df1["email 1"].apply(hash)
df2["hash 2"] = df2["email 2"].apply(hash)