Monday, March 5, 2012

Edit distance - implementation

The Problem Statement:

Edit Distance (also called Levenshtein distance)is a classic Dynamic Programming Problem. You are given a source string (say, of length m)and a target string (say of length n) plus a series of allowed transformations and their corresponding costs. You are required to find the minimum cost required to convert the source string into the target string using only the given transformations. Let us take a temporary string "temp" to illustrate things clearly. Initially we will keep our "temp" string empty and at the end of the algorithm we will have "temp" containing the same contents as the target string. Lets use i and j as the indices into the source string and the "temp" string. Set i=j=0 initially. The allowed transformations are generally:

  1. insert - add a character(say 'c') to the "temp" string ie.. put temp[j]=c and increment j.

  2. replace - replacing a character of source string by any another character (say 'c') ie.. we will insert c into "temp" string and increment both i and j. Note that we only incremented j in (1).

  3. delete - deleting a character from source string ie.. increment i only.

  4. copy - copy a character from source to "temp"(generally involves no cost).


Cormen's text (no 1 here) also mentions this problem in its exercises but he has given two more transformations called

  • twiddle - exchange next two characters ie.. copy both but they appear in opposite order in the target string.

  • kill - kill the remainder of source string ie set i=m-1.


Lets assume that the costs for copy, insert, replace and delete are C,I,R and D respectively.  Also, for simplicity we will leave out the "twiddle" and "kill" operations. If you want to consider these two operations, you may refer the Instructor’s Manual by Thomas H. Cormen, Clara Lee and Erica Lin to accompany the cormen's text which gives the pseudocode involving "twiddle" and "kill" functions as well. Lets now look at a few examples to understand the problem statement more clearly.

Examples:

  1. Let source string="at" and target string="mat". Here we can simply insert "m" at the beginning of "temp"and then copy 'a' and 't' from source to "temp". So the minimum cost required= I+(2*C)

  2. let source="cat" , target = "mat" then cost= R+(2*C),

  3. let source="cat" , target = "at" then cost= D+(2*C),

  4. let source="cat" , target = "cat" then cost= 0(zero),

  5. let source="cat" , target = empty string, then cost= 3*I ie.. insert all 3 characters.


As is generally the case , lets assume that copy operation requires no cost.Lets now take a slightly tougher example(from wikipedia article):

Suppose, source= "kitten" and target="sitting",there is no way to do it with fewer than three edits:

  • kitten → sitten (replace 'k' by 's')

  • sitten → sittin (replace 'e' by 'i')

  • sittin → sitting (insert of 'g' at the end).


Recursive Code:

Below is a recursive code to find the edit distance of two given strings. You may try different inputs on the ideone link for my recursive solution.

[sourcecode language="cpp"]

#include<stdio.h>
#include<stdlib.h>
#include<string.h>

#define IC 1 /*cost to insert*/
#define RC 1 /*cost to replace*/
#define DC 1 /*cost to delete*/

int minimum(int a, int b, int c)
{
int min = c;

if( a < b )
{
if( a < c )
{
min = a;
}
}
else
{
if( b < c )
{
min = b;
}
}

return min;
}

/*
red- recursive edit distance

*/
int red(char *x, char *y)
{
int d,e,f;

/* Base cases */

if(*x==0)return strlen(y);
if(*y== 0)return strlen(x);

/* Recurse */

if(*x==*y)
d=red(x+1,y+1);
else
d=RC+red(x+1,y+1); /*replace*/
e=IC+red(x,y+1); /*insert*/
f=DC+red(x+1,y); /*delete*/

return minimum(d,e,f);
}

int main()
{
char source[]="kitten";
char target[]="sitting";

printf("Minimum cost to convert %s into %s is %d\n",source,target,red(source,target));

return 0;
}
[/sourcecode]

The DP solution:

Here is the idoene link for the DP code. Play with it if you find the DP code difficult to understand !!

[sourcecode language="cpp"]

#include<stdio.h>
#include<stdlib.h>
#include<string.h>

#define IC 1 /*cost to insert*/
#define RC 1 /*cost to replace*/
#define DC 1 /*cost to delete*/

int minimum(int a, int b, int c)
{
int min = c;

if( a < b )
{
if( a < c )
{
min = a;
}
}
else
{
if( b < c )
{
min = b;
}
}

return min;
}

int edit_distance(char* s1, char* s2)
{
int d;
int m = strlen(s1);
int n = strlen(s2);
int dp[m+1][n+1];
for (int i = 0; i <= m; i++) {
dp[i][0] = i;
}
for (int j = 0; j <= n; j++) {
dp[0][j] = j;
}

for (int i = 1; i <= m; i++) {
for (int j = 1; j <= n; j++) {
if (s1[i-1] == s2[j-1])
d = 0;
else
d = RC;
dp[i][j] = minimum(dp[i-1][j-1] + d, /*replace*/
dp[i-1][j]+IC, /*insert*/
dp[i][j-1]+DC); /*delete*/
}
}

return dp[m][n];
}

int main()
{
char source[]="kitten";
char target[]="sitting";

printf("Minimum cost to convert %s into %s is %d\n",source,target,edit_distance(source,target));

return 0;
}

[/sourcecode]

Try understanding how the DP matrix will look like. For the above wiki example, here is the DP matrix:




























































































kitten
0123456
s1123456
i2212345
t3321234
t4432123
i5543223
n6654332
g7765443

You can also try different inputs to see what the result could be here. Both time and space complexity of DP solution is O(nm). The sequence of operations performed can also be printed but for that we need to use an additional O(mn) space which will save the operation performed along the way and the sequence can be found recursively moving backwards. See the Instructor’s Manual by Thomas H. Cormen, Clara Lee and Erica Lin to accompany the cormen's text if you find any difficulty. Note that the space requirement can be improved to O(n), as in each iteration only requires element from the current and previous row but in this case retracing the path to find the operations used and their order is not possible. Also note that there may be more than one of these paths and the above solution chooses one of them.

Modifications:

  1. Substring Matching

  2. LCS - make the cost of replacement greater than that of an insertion plus a deletion.

  3. Maximum Monotone Subsequence


For an explaination on what modifications are needeed in each section refer Programming Challenges by Steven Skiena( no 3 here).



References: 

Here are a few materials that discuss the algo/implementation:



3 comments:

  1. [...] you want to find all the LCSs see the pseudocode at wiki. Lastly, lcs can also be found using the edit distance algo by making the cost of replacement greater than that of an insertion plus a [...]

    ReplyDelete
  2. Dear Shiv,

    I have attempted your code using the following cases, but it haven't yielded any result after more than 15 minutes of execution.
    source: "Thou shalt not kill"
    destination: "You should not murder"

    ReplyDelete
  3. Hey Mohamed !
    Thanks for bringing it to my notice. I am really sorry I have not been able to fix it for I am pretty busy these days. Will do the same asap.

    ReplyDelete