In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertion, deletion, substitution) required to change one word into the other.
The bitap algorithm (also known as the shift-or, shift-and or Baeza-Yates–Gonnet algorithm) is an approximate string matching algorithm. The algorithm tells whether a given text contains a substring which is "approximately equal" to a given pattern, where approximate equality is defined in terms of Levenshtein distance — if the substring and pattern are within a given distance k of each other, then the algorithm considers them equal.
C++ implementation of bitap algorithm is given below:
#include
#include
The call to procedure is as follows:
int res = match_bitap(stringOne, stringTwo, 0);
where stringOne and stringTwo are strings whose match (Levenshtein distance) is to be determined, and 0 is location in stringOne where matching procedure should start. Return value is Levenshtein distance, as described above.
3 comments:
What is license of this code? Do you allow to use it in GPL projects?
Original licence for C code (which was basis for this C++ implementation) is GPL, so feel free to use it in GPL projects.
Hi there, nice to see a full implementation of Bitap in C++. However, I have one correction and one question to make. Correction: the function does not seem to be returning the edit distance, as stated in the post. Rather, it returns the position of the best match within the text being searched. Question: how exactly can one change the tolerated edit distance? Say, for example, one wants the tolerated to be 2 edits (where inserts, substitutions and deletions all have cost of 1)?
Post a Comment