Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Why are these hashes of the same values different between different Pandas DataFrames?

When hashing the same email address in two DataFrames, I am returned different hashes.

These two dataframes, df1 and df2, each contain a column of email addresses which need to be hashed, so the hashes can be compared when they are inner joined, like this:

import pandas as pd

### Boring part to import the data ###

# define table 1 as df1
df1 = pd.DataFrame([[2, 'random.email@fjjd.com'], [6, 'different.email@u8888ueend.dskjs'], [7, 'random.email@dsju.c'], [8, 'same.email@dkicj.c'], [200, 'different.email@cjhs.oo'], [18, 'random.email@siidjd.dd'], [19, 'random.email@dsjds.j']])
df1 = df1.set_axis(['ID1', 'email 1'], axis=1)

# define table 2 as df2
df2 = pd.DataFrame([[100, 'new.email@jsscd.d'], [6, 'different.email@uueend.dskjs'], [99, 'new.email@djjsd.conm'], [10, 'new.email@jhhs.co'], [115, 'new.email@dsjjdsds.cod'], [116, 'new.email@dsjkjds.ckk'], [8, 'same.email@dkicj.c'], [200, 'different.email@jdsjd.co']])
df2 = df2.set_axis(['ID2', 'email 2'], axis=1)

### End part to import the data ###

### Fun part now... ###

# hash the emails in each row of df1?
df1['hash 1'] = pd.util.hash_pandas_object(df1['email 1'].astype(str))  

# hash the emails in each row of df2?
df2['hash 2'] = pd.util.hash_pandas_object(df2['email 2'].astype(str)) 

# perform an inner join of df1 and df2 about their IDs, ID1 and ID2 respectively
df3 = pd.merge(df1, df2, how='inner', left_on='ID1', right_on='ID2') 

# add an email comparison column
df3['same email'] = df3['email 1'] == df3['email 2']

# add a hash comparison column
df3['same hash'] = df3['hash 1'] == df3['hash 2']

# print the table...
print(df3)
 

The result shows that while the email addresses in row 1 are identical (as far as I can tell) the hashes are not the same:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

   ID1                           email 1                hash 1  ID2                       email 2                hash 2  same email  same hash
0    6  different.email@u8888ueend.dskjs  18381560226251184406    6  different.email@uueend.dskjs  16113553761483526335       False      False
1    8                same.email@dkicj.c   5780217243550696535    8            same.email@dkicj.c   6939369575697951555        True      False
2  200           different.email@cjhs.oo  13252009090739560311  200      different.email@jdsjd.co   1942861278265138167       False      False

Why are these hashes of the same email address from different DataFrames different to one another?

>Solution :

According to the documentation, the default mode of operation is including index in hash computation. So when two same emails have different indices, the hash is different.

You can try:

df1["hash 1"] = pd.util.hash_pandas_object(df1["email 1"].astype(str), index=False)
df2["hash 2"] = pd.util.hash_pandas_object(df2["email 2"].astype(str), index=False)

Then the result will be:

   ID1                           email 1               hash 1  ID2                       email 2                hash 2
0    6  different.email@u8888ueend.dskjs  5185970979410096600    6  different.email@uueend.dskjs  18338061231746973003
1    8                same.email@dkicj.c  9881121729072933860    8            same.email@dkicj.c   9881121729072933860
2  200           different.email@cjhs.oo   742268446511091656  200      different.email@jdsjd.co    775994242592712264

Other method of hash computation is using the built-in hash function:

df1["hash 1"] = df1["email 1"].apply(hash)
df2["hash 2"] = df2["email 2"].apply(hash)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading