**LONGEST COMMON SUBSEQUENCE**

LCS stands for

**Longest common subsequence**. It is a classic computer science problem, the basis of file comparison programs such as diff, and has applications in bioinformatics. If you are studying LCS for the first time , I would recommend you to read it from CLRS- Introduction to Algorithms . Charles Liesersen (one of the authors of the book) himself explains it here . I must confess he explains it pretty well in this video . You may also visit the corresponding wiki article Its a pretty long article but they have mentioned everything very neatly .

**LCS Problem Statement**

*:*Given two sequences, find the length of longest subsequence present in both of them. A subsequence is a sequence that appears in the same relative order, but not necessarily contiguous. For example, “abc”, “abg”, “bdf”, “aeg”, ‘”acefg”, .. etc are subsequences of “abcdefg”. So a string of length n has 2^n different possible subsequences.

**Examples:**

LCS for input Sequences “ABCDGH” and “AEDFHR” is “ADH” of length 3.

LCS for input Sequences “AGGTAB” and “GXTXAYB” is “GTAB” of length 4.

Still not clear with the problem formulation / algo ?? look at the animations here.

Here is a code of mine that finds the LCS of two strings given as input .

**Recursive code:**

[sourcecode language="cpp"]

int max(int a,int b)

{

if(a<b)return b;

return a;

}

int recursive_lcs_length(char * x, char * y)

{

if (*x == '\0' || *y == '\0') return 0;

else if (*x == *y) return 1 + recursive_lcs_length(x+1, y+1);

else return max(recursive_lcs_length(x+1,y), recursive_lcs_length(x,y+1));

}

[/sourcecode]

Recursive code takes O(2^n) time in the worst case- the two strings have no matching characters, so the last line is always executed. Its very inefficient. Lets now see the DP code.

**Dynamic Programming Code (Bottom-up):**

[sourcecode language="cpp"]

#include <stdio.h>

#include <string.h>

#define MAX 100

char X[MAX],Y[MAX];

int i,j,m,n,c[MAX][MAX],b[MAX][MAX];

int LCSlength()

{

m=strlen(X);

n=strlen(Y);

for (i=1;i<=m;i++) c[i][0]=0;

for (j=0;j<=n;j++) c[0][j]=0;

for (i=1;i<=m;i++)

for (j=1;j<=n;j++) {

if (X[i-1]==Y[j-1]) {

c[i][j]=c[i-1][j-1]+1;

b[i][j]=1;

}

else if (c[i-1][j]>=c[i][j-1]) {

c[i][j]=c[i-1][j];

b[i][j]=2;

}

else {

c[i][j]=c[i][j-1];

b[i][j]=3;

}

}

return c[m][n];

}

void printLCS(int i,int j)

{

if (i==0 || j==0) return;

if (b[i][j]==1) {

printLCS(i-1,j-1);

printf("%c",X[i-1]);

}

else if (b[i][j]==2)

printLCS(i-1,j);

else

printLCS(i,j-1);

}

int main()

{

while (1) {

gets(X);

if (feof(stdin)) break;

gets(Y);

printf("LCS length -> %d\n",LCSlength());

printLCS(m,n);

printf("\n");

}

}

[/sourcecode]

The DP code above moves bottom-up. It solves all the subproblems. Sometimes it may not be required to solve all the subproblems but only a few of them. This is what the memoization techenique does. Look at the code using memoization below:

**Memoization Technique (Top-down):**

[sourcecode language="cpp"]

int max(int a,int b)

{

if(a<b)return b;

return a;

}

int lcs_memo(char* x, int n, char* y,int m)

{

int memo[n+1][m+1];

int result=0;

for(int i=0;i<=n;i++)

for(int j=0;j<=m;j++)

memo[i][j]=-1;

/*base cases*/

if(m==0)return 0;

if(n==0)return 0;

/*see if already calculated*/

if(memo[n][m]!=-1) return memo[n][m];

/*if not calculated, calculate now*/

if(x[n]==y[m])

result=1+lcs_memo(x,n-1,y,m-1);

else

result=max(lcs_memo(x,n-1,y,m),lcs_memo(x,n,y,m-1));

/*save result*/

memo[n][m]=result;

return result;

}

[/sourcecode]

The LCS code above (both DP and memoization techniques) takes O(n-square) time and O(n-square) space . It is possible to write a little more efficient code for LCS . The time complexity is the same (or probably a little higher in terms of bigger constants) but the space complexity can be reduced to be linear . Check the code here . Note that the LCS is not unique. If you want to find all the LCSs see the pseudocode at wiki. Lastly, lcs can also be found using the edit distance algo by making the cost of replacement greater than that of an insertion plus a deletion.

**LONGEST INCREASING SUBSEQUENCE**

LIS, I would say , is a cousin of LCS It stands for

**Longest increasing subsequence.**The problem statement is as follows:

**Given a sequence , find the largest subset such that for every i < j, a**_{i}< a_{j}.This wiki article explains the problem statement pretty nicely

**.**An interesting way of finding the LIS is to

*.*

**use the LCS algorithm**- Make a sorted copy of the given sequence (say
*A)*, denoted as*B*. This would take*O*(*n*log(*n*)) time. - Use LCS on
*A*and*B which takes**O*(*n*^{2}) time.

Alternatively you may solve LIS without using the LCS algo , The straight-forward DP approach is explained nicely by the algorithmist . The LIS problem comes in two different flavours :

- Find the length of LIS ,
- print the LIS .

If you need only the length its the easy part . The second part is a little trickier . Also ,the LIS implementation can be done in O(n-square) or 0(n log n ) time .Here is an

**.**

*O(n log n) code to find the length of LIS*[sourcecode language="cpp"]

#include<iostream>

#include<set>

#include<vector>

using namespace std;

int LIS(vector<int> A)

{

int N = A.size(),i;

set<int> s;

set<int>::iterator k;

for (i=0;i<N;i++)

{

if (s.insert(A[i]).second)

{

k = s.find(A[i]);

k++;

if (k!=s.end())

s.erase(k);

}

}

return s.size();

}

[/sourcecode]

The above code runs in O(n log k) time where k is the length of the LIS found . To

*, we need to maintain a previous array which stores the index of the previous element in the LIS. Here’s a C++ implementation. It returns the LIS as an array.*

**get the LIS itself**[sourcecode language="cpp"]

#include<iostream>

#include<map>

#include<vector>

using namespace std;

typedef pair < int , int > PI;

vector<int> LIS(vector<int> A)

{

int N = A.size(),i,j=-1,t;

vector<int> pre(N,-1),res;

map<int,int> m;

map<int,int>::iterator k,l;

for (i=0;i<N;i++)

{

if (m.insert(PI(A[i],i)).second)

{

k = m.find(A[i]);

l = k;

k++;

if (l==m.begin())

pre[i]=-1;

else

{

l--;

pre[i]=l->second;

}

if (k!=m.end())

m.erase(k);

}

}

k=m.end();

k--;

j = k->second;

while (j!=-1)

{

res.push_back(A[j]);

j = pre[j];

}

reverse (res.begin(),res.end());

return res;

}

[/sourcecode]

If you are not very familiar with STL (in C++) , you may go for the O(n-square) solution . Here is a

*I generally use .*

**C code**[sourcecode language="cpp"]

#include<stdio.h>

#include<string.h>

void lis(int str[],int n)

{

int len[n],prev[n],i,j,max=-1,k,b[n],a=0;

for(i=0;i<n;i++){

len[i]=1;

prev[i]=-1;

}

for(i=1;i<n;i++)

for(j=0;j<i;j++){

if(str[i]>str[j]&& len[i]<(len[j]+1)){

len[i]=len[j]+1;

prev[i]=j;

}

}

for(i=0;i<n;i++)

if(len[i]>max)

{ max=len[i]; k=i;}

b[a]=str[k];

for(i=prev[k];i>=0;i=prev[k])

{b[++a]=str[i];k=i;}

for(i=max-1;i>=0;i--)

printf("%d ",b[i]);

}

main()

{

int str[]={5,9,4};

lis(str,sizeof(str)/4);

return 0;

}

[/sourcecode]

Play with my codes if you dont understand it . Try different inputs . If you still dont get it , feel free to drop a comment .

**References:**

- I came to know of the the LIS codes (O(n log n) versions) presented above from here .
- Art of programming (viz - no. 2 here).
- geeksforgeeks.org article for LCS
- http://www.ics.uci.edu/~eppstein/161/960229.html

Hey Mr.Cool Congrats for ur first blog:)

ReplyDeleteNice blog! :D

ReplyDeleteoohhh !!! thanks for the compliment :) and a special one for "Mr. COOL" !!!

ReplyDeletethanks ...

ReplyDeletenice blog..

ReplyDeletePost another articles for nice DP problems with good explaination

Sure , I will try doing that..thanks for the suggestion :)

ReplyDelete[...] LCS - make the cost of replacement greater than that of an insertion plus a deletion. [...]

ReplyDeletei think the dp code will take more than n*n running time... n*n time it ll take to build the table itself... please clarify..

ReplyDeletesorry i mean it ll take more than m*n running time..

ReplyDeleteHi Rahul,

ReplyDeleteYou are right right, building the LCS table will take O(m*n) time. And printLCS a linear time, in worst case O(m+n), so effectively it becomes O(m*n)+O(m+n). The second factor there can be left out. And that is why we say it takes O(m*n). That is how asymptotic analysis works. Does that make it clear?

yeah.. actually i was reading asymptotic analysis at that time... so, i mistook your asymptotic tight bound (order notation) for big oh (O) notation.. i understood your code.. only later i realised order notation is generally used for efficiency of algorithms...:) i was hoping you wont read my comment and i would delete it before you read it..:)

ReplyDeleteAnyways, very good explanation bhaiya.. Dynamic programming was the first topic i read in design of algorithms(apart from recursive and divide and conquer that I am generally comfortable with)... And your blog was a great motivation.. :) Keep Writing...

Good to see the interest, keep it going. And remember DP is tough. So keep attacking it to hack the tricks !

ReplyDelete