A fast and simple Levenshtein distance function in C++

Posted by Martti Ylioja on September 2, 2020

I needed a fast and simple Levenshtein distance function for a private optical character reader (OCR) based invoice handling project. The intended use-case was to compare the similarity between spellings of Finnish words.

My implementation is based on the classic Wagner–Fischer algorithm, but it completely avoids dynamic memory allocation, and uses only a modest amount of stack space. The computation is arranged in such a way, that only one single row of the distance matrix needs to be kept in memory.

Verified by Wagner-Fischer

The repository contains also an implementation of the original Wagner-Fischer algorithm. It's sort of self-verifying because it can apply the computed shortest edit sequence to the original source word, and see that the result matches the expected target word. I use it as a "gold standard" to verify the correctness of my own adaptation of the algorithm. A simple test program is included.

Detailed explanation

For a clear and detailed step by step, explanation of the Wagner–Fischer algorithm, I recommend these slides by Ben Langmead. It's one of the best explanations I have ever seen on the subject.

The original paper: The String-to-String Correction Problem, Robert A Wagner, Michael J.Fischer Journal of the ACM, 21:168-178, January 1974. https://doi.org/10.1145/321796.321811 is also quite readable.

To the code repository