Tuesday, September 17, 2013

z-algorithm for pattern matching

String algorithms are a traditional area of study in computer science and there is a wide variety of standard algorithms available. The more algorithms you know, the more easier it gets to work with them :) As the title of the post says, we will try to explore the z-algorithm today. Lets first formulate the problem statement and a couple of definitions. To start with lets look at the below mentioned question (from interviewstreet):

QUESTION

For two strings A and B, we define the similarity of the strings to be the length of the longest prefix common to both strings. For example, the similarity of strings “abc” and “abd” is 2, while the similarity of strings “aaa” and “aaab” is 3.

Calculate the sum of similarities of a string S with each of it’s suffixes.

Sample Input:
2
ababaa
aa

Sample Output:
11
3

Explanation:
For the first case, the suffixes of the string are “ababaa”, “babaa”, “abaa”, “baa”, “aa” and “a”. The similarities of each of these strings with the string “ababaa” are 6,0,3,0,1,1 respectively. Thus the answer is 6 + 0 + 3 + 0 + 1 + 1 = 11.

For the second case, the answer is 2 + 1 = 3.

SOLUTION 1:

The most intuitive way (brute force)  to solve this problem is simple, take out every suffix of the given string and match it with the given string to find the number of matching characters.Now add all such numbers for every suffix and there you have it solved.

[sourcecode language="cpp"]
#include <string>
#include <iostream>
#include <algorithm>
using namespace std;
long long similarity(string s, int k){
long long i;
for(i=0;k+i<s.size();i++){
if(s[i]!=s[k+i]) break;
}
return i;
}

int main() {
string s; long long sum;
getline(cin,s);
sum=s.size();
for(long long i=1;i<s.size();i++){
sum+=similarity(s,i);
}
cout<<sum<<endl;
return 0;
}

[/sourcecode]

But this is costly as we are not using one result to calculate the next one.  So we are throwing away some valuable info and calculating things from scratch everytime. If you try the above solution, it will get you a TLE (time limit exceeded). To get this in time, we need to come up with a better algorithm. Lets discuss this better algorithm to solve this- the Z-algorithm.  The z-algorithm gives you a Z-array which is exactly what we need in this question to sum up and print the answer- The similarity of each suffix to the prefix. A tutorial for the z-algorithm can be found in video at:

  1. http://www.youtube.com/watch?v=MFK0WYeVEag   and

  2. http://www.youtube.com/watch?v=NVJ_ELSbbew


The algorithm is explained quite nicely in these videos.

SOLUTION 2:

 Once you understand the z-algorithm, its easy to guess that the solution to this question is nothing but sum of all the z values. Lets see the code now:

[sourcecode language="cpp"]

#include <string>
#include <cstdio>
#include <vector>
#include <iostream>
#include <algorithm>
using namespace std;

long long compute_z(string s){
long long sum=0;
int n = s.length();
vector<int> z(n,0);

long long L = 0, R = 0;
for (long long i = 1; i < n; i++) {
if (i > R) {
L = R = i;
while (R < n && s[R-L] == s[R]) R++;
z[i] = R-L; R--;
}

else {

int k = i-L;
if (z[k] < R-i+1) z[i] = z[k];
else {
L = i;
while (R < n && s[R-L] == s[R]) R++;
z[i] = R-L; R--;
}
}
sum+=z[i];
}
return sum;
}

int main() {
int t; string s; long long sum;
cin>>t;getchar();
while(t--){
getline(cin,s);
cout<<compute_z(s)+s.size()<<endl;
}
return 0;
}

[/sourcecode]

And yes this code is accepted :)

In case you didn't notice yet,

  • Z-algorithms can be used to find a given pattern (P) in a given text (T) in linear time. To do the same,  you need to pass the string(P+S) to the function compute_z(). If you have a z-value>=pattern at an index 'i' . the pattern exists in the text at that index 'i'.  So, the z-algorithm helps you find all occurances of of pattern in the text.

  • Another advantage of Z-algorithm is that though KMP has same time complexity as Z-Algorithm, Z-algorithm is a lot more easy to understand, remember and code :)

5 comments:

  1. Nice post :)
    Add this link too http://www.youtube.com/watch?v=PLxduXcg3Ms
    - n00b no. 1

    ReplyDelete
  2. I thought I should explain the algo itself but then this video was too good. Nothing more required!

    ReplyDelete
  3. Reblogged this on TeCnIQz and commented:
    exactly i was searching

    ReplyDelete