Saturday, April 4, 2020

Find minimum edit distance between strings

Find minimum edit distance between two words, minimum edit distance solved exercise, how to use minimum edit distance to find the distance between two strings? how to use dynamic programming for finding edit distance?

Find minimum edit distance between two strings


Question:
Find the minimum edit distance in transforming the word DOG to COW using insertion, deletion, and substitution cost as 1.

Solution:
Minimum edit distance is the minimum number of editing operations (insertion, deletion, substitution) required to convert one string into another. The solution given here uses Basic Minimum Edit Distance which uses cost for all operations as 1.
Each cell in the distance matrix contains the distance between two strings. For instance, the cell intersect at i, j (distance[i, j]) contains the distance between first i characters of the target and the first j characters of the source. The value for each cell is calculated as per the equation shown below;

Step 1: Draw the edit distance matrix. In this, each word is preceded by # symbol which marks the empty string.

#
C
O
W
#




D




O




G






Step 2: Find the edit-distance values using minimum edit distance algorithm to convert # (row side) to #COW (column side) and populate appropriate cells with the calculated distance.
Number of operations required to convert;
  • # to # is 0.
  • # to #C is 1. That is, one insertion (no change with #, insert C)
  • # to #CO is 2. That is, two insertions (no change with #, insert C and O)
  • # to #COW is 3. That is, three insertions (no change with #, insert C, O and W)

#
C
O
W
#
0
1
2
3
D




O




G






Step 3: Find the edit-distance values using minimum edit distance algorithm to convert # (column side) to #DOG (row side) and populate appropriate cells with the calculated distance.
Number of operations required to convert
  • # to # is 0.
  • # to #D is 1. [one insertion (no change with #, insert D]
  • # to #DO is 2. [two insertions (no change #, insert D and O]
  • # to #DOG is 3. [three insertions (no change #, insert D, O and G]

#
C
O
W
#
0
1
2
3
D
1



O
2



G
3







So far, we have found the minimum edit distance for 7 sub-problems.
[# - # = 0, # - #C = 1, # - #CO = 2, # - #COW = 3, # - #D = 1, # - #DO = 2, and # - #DOG = 3]

Step 4: From this step onward, we try to find the cost for a sub-problem by finding the minimum cost of three sub-problems and add 1 with that if the characters intersect at that cell are different. If the intersecting characters are same, then we add 0 with the diagonal cell value.
[Note: we have used A as the name for this matrix and included the index numbers for easy understanding]

From this step onward, use the following table as a key to find the cost.

If the row character ≠ column character,
  • The cost of the intersecting cell = min(replace, delete, insert) + 1.
If the row character = column character,
  • The cost of the intersecting cell = cost of the Replace cell
  [Note: If row = column, then the cost of the intersecting cell is the cost of the sub-problem, ie., the cell at diagonal above]
Edit distance for Row 1

Sub-problem: #D #C.
Intersecting cell: A[1, 1]
Same intersecting characters? NO.
Cost(A[1, 1])       = Min(A[0, 0], A[0, 1], A[1, 0]) + 1
                             = A[0, 0] + 1
                             = 0 + 1 = 1
A

0
1
2
3


#
C
O
W
0
#
0
1
2
3
1
D
1
1


2
O
2



3
G
3





Sub-problem: #D #CO.
Intersecting cell: A[1, 2]
Same intersecting characters? NO.
Cost(A[1, 2])       = Min(A[0, 1], A[0, 2], A[1, 1]) + 1
                             = A[1, 1] + 1
                             = 1 + 1 = 2
A

0
1
2
3


#
C
O
W
0
#
0
1
2
3
1
D
1
1
 2

2
O
2



3
G
3





Sub-problem: #D #COW.
Intersecting cell: A[1, 3]
Same intersecting characters? NO.
Cost(A[1, 3])       = Min(A[0, 2], A[0, 3], A[1, 2]) + 1
                             = A[1, 2] + 1
                             = 2 + 1 = 3
A

0
1
2
3


#
C
O
W
0
#
0
1
2
3
1
D
1
1
 2
3
2
O
2



3
G
3



Edit distance for Row 2


Sub-problem: #DO #C.
Intersecting cell: A[2, 1]
Same intersecting characters? NO.
Cost(A[2, 1])       = Min(A[1, 0], A[1, 1], A[2, 0]) + 1
                             = A[1, 0] + 1
                             = 1 + 1 = 2
A

0
1
2
3


#
C
O
W
0
#
0
1
2
3
1
D
1
1
 2
3
2
O
2
2


3
G
3





Sub-problem: #DO #CO.
Intersecting cell: A[2, 2]
Same intersecting characters? YES
Cost(A[2, 2])       = Min(A[1, 1], A[1, 2], A[2, 1]) + 0
                             = A[1, 1] + 0
                             = 1 + 0 = 1
A

0
1
2
3


#
C
O
W
0
#
0
1
2
3
1
D
1
1
 2
3
2
O
2
2
1

3
G
3





Sub-problem: #DO #COW.
Intersecting cell: A[2, 3]
Same intersecting characters? NO.
Cost(A[2, 3])       = Min(A[1, 2], A[1, 3], A[2, 2]) + 1
                             = A[2, 2] + 1
                             = 1 + 1 = 2
A

0
1
2
3


#
C
O
W
0
#
0
1
2
3
1
D
1
1
 2
3
2
O
2
2
1
2
3
G
3



Edit distance for Row 3


Sub-problem: #DOG #C.
Intersecting cell: A[3, 1]
Same intersecting characters? NO.
Cost(A[3, 1])       = Min(A[2, 0], A[2, 1], A[3, 0]) + 1
                             = A[2, 0] + 1
                             = 2 + 1 = 3
A

0
1
2
3


#
C
O
W
0
#
0
1
2
3
1
D
1
1
 2
3
2
O
2
2
1
2
3
G
3
3




Sub-problem: #DOG #CO.
Intersecting cell: A[3, 2]
Same intersecting characters? NO
Cost(A[3, 2])       = Min(A[2, 1], A[2, 2], A[3, 1]) + 1
                             = A[2, 2] + 1
                             = 1 + 1 = 2
A

0
1
2
3


#
C
O
W
0
#
0
1
2
3
1
D
1
1
 2
3
2
O
2
2
1
2
3
G
3
3
2



Sub-problem: #DOG #COW.
Intersecting cell: A[3, 3]
Same intersecting characters? NO.
Cost(A[3, 3])       = Min(A[2, 2], A[2, 3], A[3, 2]) + 1
                             = A[2, 2] + 1
                             = 1 + 1 = 2
A

0
1
2
3


#
C
O
W
0
#
0
1
2
3
1
D
1
1
 2
3
2
O
2
2
1
2
3
G
3
3
2
2
The last cell (A[3, 3]) holds the minimum edit distance between the given strings DOG and COW. 

Alignment:

The alignment between DOG and COW is as follows;



D                 O                 G

|                 |                 |

C                 O                 W

Substitute           No          Substitute

Cost = 1           change        Cost = 1

*************


Go to NLP Glossary


Find minimum edit distance between two words, 

minimum edit distance solved exercise with basic algorithm, 

how to use minimum edit distance with basic distance to find the distance between two strings? 

how to use dynamic programming for finding edit distance between strings?

Calculate the minimum edit distance between two strings using simple algorithm

How to decide whether two strings are close or not in spelling using  minimum edit distance


No comments:

Post a Comment

Featured Content

Multiple choice questions in Natural Language Processing Home

MCQ in Natural Language Processing, Quiz questions with answers in NLP, Top interview questions in NLP with answers Multiple Choice Que...

All time most popular contents

data recovery