Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Meaning behind 'thefuzz' / 'rapidfuzz' similarity metric when comparing strings

When using thefuzz in Python to calculate a simple ratio between two strings, a result of 0 means they are totally different while a result of 100 represents a 100% match. What do intermediate results mean? Does a result of 82, say, mean that the two files are 82% similar? Or is it just an abstract idea of ‘bigger is better?’

The documentation is sadly lacking in any detail to answer this question, so far as I can tell.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

There are bunch of string matching algorithms that have been developed over the last… hundred years or so. I believe the string matching algorithm under the hood of this library is InDel.

InDel is a variation of the much more common Levenshtein distance algorithm. Levenshtein Distance essentially counts the number of needed insertions, deletions, and subsitutions necessary to get from the first string to the second string.

With InDel only insertions and deletions are counted. The ratio is calcuated by dividing the number of insertions and deletions into the length of both strings, and then subtracting from 1. So the closer to 1, the closer the match as it took less insertions and deletions to get from one string to the other.

The real question you have to determined for yourself, is how far away from 1 (a perfect match) do you want to accept for two strings being the same. Likely no matter what you choose you will end up with false positives/negatives.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading