## Tuesday, February 7, 2012

### LCS and LIS .....

LONGEST COMMON SUBSEQUENCE

LCS stands for Longest common  subsequence . It is a classic computer science problem, the basis of file comparison programs such as diff, and has applications in bioinformatics. If you are studying LCS for the first time , I would recommend   you to read  it from CLRS- Introduction to Algorithms . Charles Liesersen (one of the authors of the book) himself explains it here  . I must confess he explains it pretty well in this video . You may also  visit the corresponding wiki article  Its a pretty long article but they have mentioned everything  very neatly .

LCS Problem Statement: Given two sequences, find the length of longest subsequence present in both of them. A subsequence is a sequence that appears in the same relative order, but not necessarily contiguous. For example, “abc”, “abg”, “bdf”, “aeg”, ‘”acefg”, .. etc are subsequences of “abcdefg”. So a string of length n has 2^n different possible subsequences.

Examples:
LCS for input Sequences “ABCDGH” and “AEDFHR” is “ADH” of length 3.
LCS for input Sequences “AGGTAB” and “GXTXAYB” is “GTAB” of length 4.

Still not clear with the problem formulation / algo ?? look at the animations here.

Here  is a code of mine that finds the LCS of two strings given as input .

Recursive code:

[sourcecode language="cpp"]

int max(int a,int b)
{
if(a<b)return b;
return a;
}
int recursive_lcs_length(char * x, char * y)
{
if (*x == '\0' || *y == '\0') return 0;
else if (*x == *y) return 1 + recursive_lcs_length(x+1, y+1);
else return max(recursive_lcs_length(x+1,y), recursive_lcs_length(x,y+1));
}
[/sourcecode]

Recursive code takes O(2^n) time in the worst case- the two strings have no matching characters, so the last line is always executed. Its very inefficient. Lets now see the DP code.

Dynamic Programming Code (Bottom-up):

[sourcecode language="cpp"]
#include <stdio.h>
#include <string.h>
#define MAX 100
char X[MAX],Y[MAX];
int i,j,m,n,c[MAX][MAX],b[MAX][MAX];

int LCSlength()

{
m=strlen(X);
n=strlen(Y);
for (i=1;i<=m;i++) c[i]=0;
for (j=0;j<=n;j++) c[j]=0;
for (i=1;i<=m;i++)
for (j=1;j<=n;j++) {
if (X[i-1]==Y[j-1]) {
c[i][j]=c[i-1][j-1]+1;
b[i][j]=1;
}
else if (c[i-1][j]>=c[i][j-1]) {
c[i][j]=c[i-1][j];
b[i][j]=2;
}
else {
c[i][j]=c[i][j-1];
b[i][j]=3;
}
}
return c[m][n];
}

void printLCS(int i,int j)

{
if (i==0 || j==0) return;
if (b[i][j]==1) {
printLCS(i-1,j-1);
printf("%c",X[i-1]);

}
else if (b[i][j]==2)
printLCS(i-1,j);
else
printLCS(i,j-1);
}

int main()

{
while (1) {
gets(X);
if (feof(stdin)) break;
gets(Y);
printf("LCS length -> %d\n",LCSlength());
printLCS(m,n);
printf("\n");
}
}
[/sourcecode]

The DP code above moves bottom-up. It solves all the subproblems. Sometimes it may not be required to solve all the subproblems but only a few of them. This is what the memoization techenique does. Look at the code using memoization below:

Memoization Technique (Top-down):

[sourcecode language="cpp"]
int max(int a,int b)
{
if(a<b)return b;
return a;
}

int lcs_memo(char* x, int n, char* y,int m)
{
int memo[n+1][m+1];
int result=0;
for(int i=0;i<=n;i++)
for(int j=0;j<=m;j++)
memo[i][j]=-1;

/*base cases*/
if(m==0)return 0;
if(n==0)return 0;

if(memo[n][m]!=-1) return memo[n][m];

/*if not calculated, calculate now*/
if(x[n]==y[m])
result=1+lcs_memo(x,n-1,y,m-1);
else
result=max(lcs_memo(x,n-1,y,m),lcs_memo(x,n,y,m-1));

/*save result*/
memo[n][m]=result;

return result;
}

[/sourcecode]

The LCS code above (both DP and memoization techniques) takes O(n-square) time and O(n-square) space . It is possible to write a little more efficient code for LCS . The time complexity is the same (or probably a little higher in terms of bigger constants) but the space complexity can be reduced to be linear . Check the code  here . Note that the LCS is not unique. If you want to find all the LCSs see the pseudocode at wiki. Lastly, lcs can also be found using the edit distance algo by making the cost of replacement greater than that of an insertion plus a deletion.

LONGEST INCREASING SUBSEQUENCE

LIS, I would say , is a cousin of LCS  It stands for   Longest increasing subsequence.  The problem statement is as follows:

Given a sequence , find the largest subset such that for every i < j, ai < aj.

This wiki article explains the problem statement  pretty nicelyAn interesting way of finding the LIS is to use the LCS algorithm.

1. Make a sorted copy of the given sequence (say A), denoted as B. This would take O(nlog(n)) time.

2. Use LCS on A and B which takes O(n2) time.

Alternatively you may solve LIS without using the LCS algo , The straight-forward DP approach is explained nicely by the algorithmist . The LIS problem comes in two different flavours :

1. Find the length of LIS ,

2. print the LIS .

If you need only the length  its the easy part . The second part is a little trickier . Also ,the LIS implementation can be done in O(n-square) or 0(n log n ) time .Here is an O(n log n) code to find the length of LIS.

[sourcecode language="cpp"]
#include<iostream>
#include<set>
#include<vector>
using namespace std;

int LIS(vector<int> A)
{
int N = A.size(),i;
set<int> s;
set<int>::iterator k;
for (i=0;i<N;i++)
{
if (s.insert(A[i]).second)
{
k = s.find(A[i]);
k++;
if (k!=s.end())
s.erase(k);
}
}
return s.size();
}
[/sourcecode]

The above code runs in O(n log k) time where k is the length of the LIS found . To  get the LIS itself ,  we need to maintain a previous array which stores the index of the previous element in the LIS. Here’s a C++ implementation. It returns the LIS as an array.

[sourcecode language="cpp"]
#include<iostream>
#include<map>
#include<vector>
using namespace std;

typedef pair < int , int > PI;

vector<int> LIS(vector<int> A)
{
int N = A.size(),i,j=-1,t;
vector<int> pre(N,-1),res;
map<int,int> m;
map<int,int>::iterator k,l;
for (i=0;i<N;i++)
{
if (m.insert(PI(A[i],i)).second)
{
k = m.find(A[i]);
l = k;
k++;
if (l==m.begin())
pre[i]=-1;
else
{
l--;
pre[i]=l->second;
}
if (k!=m.end())
m.erase(k);
}
}
k=m.end();
k--;
j = k->second;
while (j!=-1)
{
res.push_back(A[j]);
j = pre[j];
}
reverse (res.begin(),res.end());
return res;
}
[/sourcecode]

If you are not very familiar with STL (in C++) , you may go for the O(n-square) solution . Here is a C code I generally use .

[sourcecode language="cpp"]
#include<stdio.h>
#include<string.h>
void lis(int str[],int n)
{
int len[n],prev[n],i,j,max=-1,k,b[n],a=0;
for(i=0;i<n;i++){
len[i]=1;
prev[i]=-1;
}
for(i=1;i<n;i++)
for(j=0;j<i;j++){
if(str[i]>str[j]&& len[i]<(len[j]+1)){
len[i]=len[j]+1;
prev[i]=j;

}
}
for(i=0;i<n;i++)
if(len[i]>max)
{ max=len[i]; k=i;}
b[a]=str[k];
for(i=prev[k];i>=0;i=prev[k])
{b[++a]=str[i];k=i;}
for(i=max-1;i>=0;i--)
printf("%d ",b[i]);
}
main()
{
int str[]={5,9,4};
lis(str,sizeof(str)/4);
return 0;
}

[/sourcecode]

Play with my codes if you dont understand it  . Try different inputs .  If you  still dont get it , feel free to drop a comment .

References:

1. I came to know of the the LIS codes (O(n log n) versions)  presented above from here .

2. Art of programming (viz - no. 2 here).

3. geeksforgeeks.org article for LCS

4. http://www.ics.uci.edu/~eppstein/161/960229.html

1. Hey Mr.Cool Congrats for ur first blog:)

2. Nice blog! :D

3. oohhh !!! thanks for the compliment :) and a special one for "Mr. COOL" !!!

4. thanks ...

5. nice blog..
Post another articles for nice DP problems with good explaination

6. Sure , I will try doing that..thanks for the suggestion :)

7. [...] LCS - make the cost of replacement greater than that of an insertion plus a deletion. [...]

8. i think the dp code will take more than n*n running time... n*n time it ll take to build the table itself... please clarify..

9. sorry i mean it ll take more than m*n running time..

10. Hi Rahul,

You are right right, building the LCS table will take O(m*n) time. And printLCS a linear time, in worst case O(m+n), so effectively it becomes O(m*n)+O(m+n). The second factor there can be left out. And that is why we say it takes O(m*n). That is how asymptotic analysis works. Does that make it clear?

11. yeah.. actually i was reading asymptotic analysis at that time... so, i mistook your asymptotic tight bound (order notation) for big oh (O) notation.. i understood your code.. only later i realised order notation is generally used for efficiency of algorithms...:) i was hoping you wont read my comment and i would delete it before you read it..:)
Anyways, very good explanation bhaiya.. Dynamic programming was the first topic i read in design of algorithms(apart from recursive and divide and conquer that I am generally comfortable with)... And your blog was a great motivation.. :) Keep Writing...

12. Good to see the interest, keep it going. And remember DP is tough. So keep attacking it to hack the tricks !