Calculate minimum edit distance between strings using Levenshtein distance, Find minimum edit distance between two words
Find minimum edit distance between two strings with Levenshtein distance
Question:
Find
the minimum edit distance in transforming the word DOG to COW using Levenshtein
distance, ie., insertion = deletion =1 and substitution = 2.
Solution:
Minimum
edit distance algorithm finds the minimum number of editing operations
(insertion, deletion, substitution) required to convert one string into another
with the help of dynamic programming concept.
In
this exercise, we supposed to use Levenshtein distance
while finding the distance between the words DOG and COW. Each cell in the
distance matrix contains the distance between two strings. For instance, the
cell intersect at i, j (distance[i, j]) contains the distance between first i
characters of the target and the first j characters of the source. The value
for each cell is calculated as per the equation shown below for Levenshtein
distance;
Stepbystep
process is as follows;
Step 1: Draw the edit
distance matrix. In this, each word is preceded by # symbol which marks the
empty string.



Step 2: Find the
editdistance values using minimum edit distance algorithm to convert # (row side) to #COW (column side) and populate appropriate
cells with the calculated distance.
Number of
operations required to convert;
·
# to # is
0.
·
# to #C is 1.
That is, one insertion (no change with #, insert C)
·
# to #CO is
2. That is, two insertions (no change with #, insert C and O)
·
# to #COW
is 3. That is, three insertions (no change with #, insert C, O and W)



Step 3: Find the
editdistance values using minimum edit distance algorithm to convert # (column
side) to #DOG (row side) and
populate appropriate cells with the calculated distance.
Number of
operations required to convert
·
# to # is 0.
·
# to #D is 1. [one insertion (no change
with #, insert D]
·
# to #DO is 2. [two insertions (no change
#, insert D and O]
·
# to #DOG is 3. [three insertions (no
change #, insert D, O and G]



So far, we have
found the minimum edit distance for 7 subproblems.
[#  # = 0, # 
#C = 1, #  #CO = 2, #  #COW = 3, #  #D = 1, #  #DO = 2, and #  #DOG = 3]


Step 4: From this step
onward, we try to find the cost for a subproblem by finding the minimum
cost of three subproblems and add 1 with that if the characters intersect at
that cell are different. If the intersecting characters are same, then we add
0 with the diagonal cell value.
[Note: we have used A as the name for this matrix and
included the index numbers for easy understanding]
Use the following
table as a key to find the cost.
If
the row character ≠ column character,
The
cost of the intersecting cell = min(replace, delete, insert)+2. (image below shows the replacement cost as 1. In this exercise we used 2 as per Levenshtein distance)
If
the row character = column character,
The
cost of the intersecting cell = cost of the Replace cell


Subproblem: #D ➔ #C.
Intersecting
cell: A[1,
1]
Same intersecting
characters? NO.
Cost(A[1, 1]) = Min(A[0, 0], A[0, 1], A[1, 0]) + 2
= A[0, 0] + 2
=
0 + 2 = 2



Subproblem: #D ➔ #CO.
Intersecting
cell: A[1,
2]
Same intersecting
characters? NO.
Cost(A[1, 2]) = Min(A[0, 1], A[0, 2], A[1, 1]) + 2
=
A[0, 1] + 2
= 1 + 2 = 3



Subproblem: #D ➔ #COW.
Intersecting
cell: A[1,
3]
Same intersecting
characters? NO.
Cost(A[1, 3]) = Min(A[0, 2], A[0, 3], A[1, 2]) + 2
= A[0, 2] + 2
= 2 + 2 = 4



Subproblem: #DO ➔ #C.
Intersecting
cell: A[2,
1]
Same intersecting
characters? NO.
Cost(A[2, 1]) = Min(A[1, 0], A[1, 1], A[2, 0]) + 2
= A[1, 0] + 2
= 1 + 2 = 3



Subproblem: #DO ➔ #CO.
Intersecting
cell: A[2, 2]
Same
intersecting characters? YES
Cost(A[2, 2]) = Min(A[1, 1], A[1, 2], A[2, 1]) + 0
= A[1, 1] + 0
= 2 + 0 = 2



Subproblem: #DO ➔ #COW.
Intersecting
cell: A[2,
3]
Same intersecting
characters? NO.
Cost(A[2, 3]) = Min(A[1, 2], A[1, 3], A[2, 2]) + 2
= A[2, 2] + 2
= 2 + 2 = 4



Subproblem: #DOG ➔ #C.
Intersecting
cell: A[3,
1]
Same intersecting
characters? NO.
Cost(A[3, 1]) = Min(A[2, 0], A[2, 1], A[3, 0]) + 2
= A[2, 0] + 2
= 2 + 2 = 4



Subproblem: #DOG ➔ #CO.
Intersecting
cell: A[3,
2]
Same intersecting
characters? NO
Cost(A[3, 2]) = Min(A[2, 1], A[2, 2], A[3, 1]) + 2
= A[2, 2] + 2
= 2 + 2 = 4



Subproblem: #DOG ➔ #COW.
Intersecting
cell: A[3,
3]
Same intersecting
characters? NO.
Cost(A[3, 3]) = Min(A[2, 2], A[2, 3], A[3, 2]) + 2
= A[2, 2] + 2
= 2 + 2 = 4


Result:
The
distance between two strings DOG and COW as per Minimum Edit Distance using
Levenshtein distance is 4.
***********
No comments:
Post a Comment