Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

R: Why digest(algo = sha1) produces different answer to sha1()

# what I need
x <- 111111
consistent_output <- hash_function(as.character(x))

I understand there must be a good reason for this, but it is confusing to me why in the digest library sha1() and digest(,algo=sha1) produce different results. I need to choose method which will give the same results on all machines. I need to pass six to ten digits as character string and get always the same output from a one-way hash function.

  1. Is there a possibility that results might be still different if R script is done on 32 bit system vs 64bit system? I write scripts on a 64bit Linux machine, but they may well need to be executed on various Windows computers.

  2. How to explain the different results below, and is there a difference in future-proof_ness if for my script I choose either digest(x,algo = "sha1") or sha1(x)?

    MEDevel.com: Open-source for Healthcare and Education

    Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

    Visit Medevel

library(digest)

> digest("111111", algo = "sha1")
[1] "f807e8107b0ee536b79044938ac2497845f43c71"
> sha1("111111")
[1] "e6975dc20e721b2a5cfa6f0d834b2bf8287ab592"

When I say future-proofness, I mean – if I run the same function on the same input 10 years from now, I would wish to get the same output.

Many thanks, and apologies if my question is too simple – I am not from a computer science background.

>Solution :

The sha1 function adds attributes to the object that you pass in so it knows exactly what settings were used to create the hash. These attributes are then serialized with the data as well.

The digest() function also serializes data. This means that the object as it’s stored in R is preserved, not just the value itself. You can see the serialized value with

serialize("111111", NULL)
#  [1] 58 0a 00 00 00 03 00 04 02 01 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00 00 10 00
# [29] 00 00 01 00 04 00 09 00 00 00 06 31 31 31 31 31 31

So there are a lot more bytes there than just the string value. If you are just working with string values, you can skip the serialization step

digest("111111", algo = "sha1", serialize = FALSE)
[1] "3d4f2bf07dc1be38b20cd6e46949a1071f9d0e3d"

That will match what you would get at the linux command line

echo -n "111111" | sha1sum
3d4f2bf07dc1be38b20cd6e46949a1071f9d0e3d  -

and also matches online calculators like http://www.sha1-online.com/

So if you need compatibility with non-R systems, then I would recommend digest() with serialize=FALSE since values might be serialized differently in other languages. If you only have to worry about R, then it would probably be safer to use sha1 keeping in mind that it’s including extra info in the object that affects the resulting hash

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading