Why are these hashes of the same values different between different Pandas DataFrames?

When hashing the same email address in two DataFrames, I am returned different hashes.

These two dataframes, df1 and df2, each contain a column of email addresses which need to be hashed, so the hashes can be compared when they are inner joined, like this:

import pandas as pd

### Boring part to import the data ###

# define table 1 as df1
df1 = pd.DataFrame([[2, 'random.email@fjjd.com'], [6, 'different.email@u8888ueend.dskjs'], [7, 'random.email@dsju.c'], [8, 'same.email@dkicj.c'], [200, 'different.email@cjhs.oo'], [18, 'random.email@siidjd.dd'], [19, 'random.email@dsjds.j']])
df1 = df1.set_axis(['ID1', 'email 1'], axis=1)

# define table 2 as df2
df2 = pd.DataFrame([[100, 'new.email@jsscd.d'], [6, 'different.email@uueend.dskjs'], [99, 'new.email@djjsd.conm'], [10, 'new.email@jhhs.co'], [115, 'new.email@dsjjdsds.cod'], [116, 'new.email@dsjkjds.ckk'], [8, 'same.email@dkicj.c'], [200, 'different.email@jdsjd.co']])
df2 = df2.set_axis(['ID2', 'email 2'], axis=1)

### End part to import the data ###

### Fun part now... ###

# hash the emails in each row of df1?
df1['hash 1'] = pd.util.hash_pandas_object(df1['email 1'].astype(str))  

# hash the emails in each row of df2?
df2['hash 2'] = pd.util.hash_pandas_object(df2['email 2'].astype(str)) 

# perform an inner join of df1 and df2 about their IDs, ID1 and ID2 respectively
df3 = pd.merge(df1, df2, how='inner', left_on='ID1', right_on='ID2') 

# add an email comparison column
df3['same email'] = df3['email 1'] == df3['email 2']

# add a hash comparison column
df3['same hash'] = df3['hash 1'] == df3['hash 2']

# print the table...
print(df3)
 

The result shows that while the email addresses in row 1 are identical (as far as I can tell) the hashes are not the same:

   ID1                           email 1                hash 1  ID2                       email 2                hash 2  same email  same hash
0    6  different.email@u8888ueend.dskjs  18381560226251184406    6  different.email@uueend.dskjs  16113553761483526335       False      False
1    8                same.email@dkicj.c   5780217243550696535    8            same.email@dkicj.c   6939369575697951555        True      False
2  200           different.email@cjhs.oo  13252009090739560311  200      different.email@jdsjd.co   1942861278265138167       False      False

Why are these hashes of the same email address from different DataFrames different to one another?

>Solution :

According to the documentation, the default mode of operation is including index in hash computation. So when two same emails have different indices, the hash is different.

You can try:

df1["hash 1"] = pd.util.hash_pandas_object(df1["email 1"].astype(str), index=False)
df2["hash 2"] = pd.util.hash_pandas_object(df2["email 2"].astype(str), index=False)

Then the result will be:

   ID1                           email 1               hash 1  ID2                       email 2                hash 2
0    6  different.email@u8888ueend.dskjs  5185970979410096600    6  different.email@uueend.dskjs  18338061231746973003
1    8                same.email@dkicj.c  9881121729072933860    8            same.email@dkicj.c   9881121729072933860
2  200           different.email@cjhs.oo   742268446511091656  200      different.email@jdsjd.co    775994242592712264

Other method of hash computation is using the built-in hash function:

df1["hash 1"] = df1["email 1"].apply(hash)
df2["hash 2"] = df2["email 2"].apply(hash)

Leave a Reply