CodeChef is a non-commercial competitive programming community
Login
Username (New User? Signup) Password (Forgot Password?)
Signup
Login or
Signup with
Connect
Note
  • Publicize your achievements on your Facebook Wall.
  • Challenge your friends or ask them for help.

Site Navigation

  • PRACTICE
    • Easy
    • Medium
    • Hard
    • Challenge
    • Peer
  • COMPETE
    • February Long Contest
    • January CookOff
    • January Long Contest
  • DISCUSS
    • Wiki
    • Forums
    • Blog
    • Twitter
  • COMMUNITY
    • CodeChef Meetups
    • Campus Chapters
    • Host your Contest
    • User Groups
    • CodeChef TechTalks
    • All Educational Initiatives
    • Event Calendar
  • HELP
    • Frequently Asked Questions
    • FAQ for problem setters
    • Problem Setting
    • Ranks
    • Tutorials
  • ABOUT
    • About CodeChef
    • Team CodeChef
    • Press Room
    • CodeChef Financials
    • CodeChef Sponsorships
    • CEO's Corner
    • Contact Us
    • About Directi
Home » Wiki » Tutorial for The LCS Problem Revisited

Tutorial for The LCS Problem Revisited

The LCS Problem Revisited was one of the problems in the November contest.

The Longest Common Subsequence problem is a very well known problem, and has a classic example of a dynamic programming solution. This problem seems to be more complex - while we can easily compute the length of the LCS in the standard way, now we have to count them - and only the distinct ones, which is the crucial part. A subsequence may occur several times in different places, but we are only meant to count it once.

 

The standard LCS problem


Let's start by looking at how to solve the standard LCS problem - ie to find out the length of the longest common subsequence of two strings.

Given two strings, we split things into three cases:

Case 1: We don't use the first character of the first string.
Case 2: We don't use the first character of the second string.
Case 3: We match up the first character of each string.

The third case is obviously only possible if the first characters match.

Once we have decided which of these options we are going to use, we have reduced the problem to one of a smaller size: we want to find the LCS of the remaining strings, which are suffixes of the original strings.

The terminating condition for the DP occurs when one of the strings is empty - in that case, the length of the LCS is 0.

This results in the following code:

[code]
// lcs is an array of dimensions [M+1][N+1]
// s is first string of length M
// t is second string of length N
for (int i=M; i>=0; i--)  // using the ith character onwards of first string
for (int j=N; j>=0; j--) {   // using the jth character onwards of second string
    if (i==M || j==N) lcs[i][j]=0;   // empty string
    else {
        lcs[i][j] = max(lcs[i+1][j],lcs[i][j+1]);   // first two cases
        
        if (s[i]==t[j]) lcs[i][j] = max(lcs[i][j], lcs[i+1][j+1] + 1);   // third case
    }
}
[/code]

This code runs in O(MN) time, which is very fast with the lengths of the strings restricted to just 1000.

Now lcs[0][0] will hold the length of the longest common subsequence of both strings. This is the first number we are required to print out.

 

Counting the LCS


Counting the number of distinct LCS looks a little more difficult. If we tried to construct the algorithm in a similar way to the above, by adding together the number of solutions of all subproblems, we run into a few issues. Firstly, we are going to count some things twice (eg removing the first character of string one, then removing the second character of string two, and vice versa, will both be counted separately). Also, this isn't going to detect duplicates - it will count a subsequence more than once if it occurs several times in the string.

In order to get around this, we use the one constraint we haven't looked at yet - there are only 26 possible characters that could occur. Rather than looping through the strings to choose which character we should use first (which may include the same letter twice), we loop through all possible letters. We thus split things into 26 possible cases:

Case 1: How many subsequences of the right length (lcs[0][0]) begin with 'a'?
Case 2: How many subsequences of the right length (lcs[0][0]) begin with 'b'?
..
Case 26: How many subsequences of the right length (lcs[0][0]) begin with 'z'?

Let's suppose we are considering Case 1. Since we know that we are going to start with the letter 'a', we may as well use the first occurrence of 'a' that occurs in both strings. Now we consider the remaining parts of the strings - and want to count the number of subsequences of length one smaller than we started with.

Again, the terminating condition is when one string is empty, where the answer is 0.

This turns into the following code:

[code]
// lcscount is an array of dimensions [M+1][N+1]
// s is first string of length M
// t is second string of length N

for (int i=M; i>=0; i--)   // using the ith character onwards of first string
for (int j=N; j>=0; j--) {   // using the jth character onwards of second string
    if (i==M || j==N) lcscount[i][j]=0;   // empty string
    else {
    	lcscount[i][j]=0;
        for (char k='a'; k<='z'; k++) {   // use k as the first letter
            int x = /* first position in first string (from i onwards) that is a k */
            int y = /* first position in second string (from j onwards) that is a k */            
            if (/* x and y both exist*/) {
                if (lcs[x][y] == lcs[i][j] - 1) {   // make sure we can get a solution of the right length
                    lcscount[i][j] = (lcscount[i][j] + lcscount[x][y])%23102009;   // add on solutions
                }
            }
        }
    }
}
[/code]

 

Precomputing the next positions


The only problem we have remaining are the bits commented out in the above code - finding the first position of a certain character in each suffix of the input string. Simply looping through the string at each point isn't going to work - we might have to loop through up to 1000 characters, and 1000*1000*1000 is going to be too slow.

However, we can get around this by precomputing all of this information, using a third dynamic programming approach (but a very simple one).

We do so for each string by considering each character separately - if the string starts with this character, we have our answer. Otherwise, we skip this character.

Example code for the first string only:

[code]
// next is an array of dimensions [M+1][26]
// s is first string of length 

for (int i=0; i<26; i++) {
    next[M][i] = M;   // set the 'next' occurrence of each character in an empty string to something impossible to signify it doesn't exist
}

for (int i=M-1; i>=0; i--)
for (int j=0; j<26; j++) {
    if (s[i]=='a'+j) {   // first character matches, so position = i
        next[i][j] = i;
    } else {   // first character doesn't match, so we skip this character
        next[i][j] = next[i+1][j];
    }
}
[/code]

By doing the same thing for the second string, and inserting these results into the previous DP code, we end up with a full solution.

 

Summary


While dynamic programming often looks scary to beginners, as long as you approach it in a step-by-step manner, the resulting code tends to be very short and sweet. Just make sure you know exactly what you want your subproblems to be, and you'll have a solution before you know it.


Comments

  • Login or Register to post a comment.

Hi, i used the following

srirams @ 11 Nov 2009 05:02 PM

Hi, i used the following algo. Incidentally, there was a paper on the same topic. However, the algo gives wrong answer ;) .

Here's the pseudocode. First we find the LCS and use it to count the distinct LCS.cin>>A>>B;
m=sz(A);n=sz(B);
REP(i,m) X[i+1]=A[i];
REP(j,n) Y[j+1]=B[j];
int L[m+1][n+1];
REP(i,m+1) L[i][0]=0;
REP(j,n+1) L[0][j]=0;
for(i=1;i<=m;i++) {
for(j=1;j<=n;j++) {
if(X[i]==Y[j])
L[i][j]=L[i-1][j-1]+1;
else
L[i][j]=max(L[i-1][j],L[i][j-1]);

}
}
LL D[m+1][n+1];
REP(j,n+1) {
REP(i,m+1) {
if(i==0 || j==0) D[i][j]=1;
else {
D[i][j]=0;
if(X[i]==Y[j]) {D[i][j]=D[i-1][j-1];}  // Number of distinct LCS of D[i][j] is same as the no of distinct LCS from [i-1,j-1]
else {
if(L[i][j]==L[i-1][j]) D[i][j]=(D[i][j]+D[i-1][j]);
if(L[i][j-1]==L[i][j]) D[i][j]=(D[i][j]+D[i][j-1]);
if(L[i-1][j-1]==L[i][j]) D[i][j]=(D[i][j]-D[i-1][j-1]); // We have added twice,so subtract once

}
}
D[i][j]%=23102009;
}
}
printf("%d% ldn",L[m][n],D[m][n]%23102009);

Can you tell me why this code is wrong?

Link to the paper on the same problem:- http://arxiv.org/pdf/cs/0301034

 

I never realized that reading

prodigyaj @ 11 Nov 2009 06:41 PM

I never realized that reading tutorials is interesting even after you have solved the problem :)

@sriram: May be the way you

flying_ant @ 11 Nov 2009 11:08 PM

@sriram: May be the way you are taking the Modulus is the problem in your code ? It can go to negative, so.. you can add the given MOD value to it, before taking the modulus, like D[i][j] = (D[i][j] + 23102009) % 23102009;

The culprit is the following

raptor @ 12 Nov 2009 01:57 AM

The culprit is the following line:

D[i][j]%=23102009;

note that  the MOD can go negative in the above line , in that case you should add  23102009 before doing mod .

You can do the thing as stated above by Anil Kishore or just add this line to your code :

if(D[i][j]<0)D[i][j]+=  23102009;

Can somone explain how it can

sppraveen @ 12 Nov 2009 01:21 PM

Can somone explain how it can go Negative. I could not understand. D[i][j] matrix never goes negative. Hence a positive no % positive no should always produce a positive result. What am i missing?

sriram's code (which is being

triplem @ 13 Nov 2009 05:53 AM

sriram's code (which is being discussed) contains a subtraction:

[code]if(L[i-1][j-1]==L[i][j]) D[i][j]=(D[i][j]-D[i-1][j-1]);[/code]

ALso when i programmed on

codegambler @ 14 Nov 2009 05:35 PM

ALso when i programmed on this problem the count exceeded in long long also ans so when you do

if(L[i][j]==L[i-1][j]) D[i][j]=(D[i][j]+D[i-1][j]);
if(L[i][j-1]==L[i][j]) D[i][j]=(D[i][j]+D[i][j-1]);
if(L[i-1][j-1]==L[i][j]) D[i][j]=(D[i][j]-D[i-1][j-1]);

 

then value on D[*][*] will exceed and become -ve. Hence when at last you calculated for answer it gave right answer for small input but would give -ve or unpredictable in large input

That wouldn't happen in

triplem @ 16 Nov 2009 06:10 AM

That wouldn't happen in sriram's case because D[i][j] at every stage is always less than the modulus, and 2*23102009 easily fits inside an integer. You would only get into that problem if you forgot to reduce the value after each addition.

for input

shiplu @ 6 Dec 2009 01:20 AM

for input like:

abc

xyz

 

srirams code gives 0 1 and i think this is the correct answer.

but judge output for this input is 0 0 ( why ).

The judge output is 0 1. Why

triplem @ 6 Dec 2009 03:00 AM

The judge output is 0 1. Why do you think it is 0 0?

Because i got a wrong answer

shiplu @ 6 Dec 2009 08:04 PM

Because i got a wrong answer in the contest, for this reason and i handled it explicitly by printing 0 and got AC.

Well, my AC code prints 0 1

triplem @ 7 Dec 2009 01:57 AM

Well, my AC code prints 0 1 in all of those cases.

hi

maheshnarala @ 13 Oct 2010 09:07 PM

hi

CodeChef is a non-commercial competitive programming community
  • About CodeChef
  • About Directi
  • CEO's Corner
  • C-Programming
  • Programming Languages
  • Contact Us
© 2009 Directi Group. All Rights Reserved. CodeChef uses SPOJ © by Sphere Research Labs
In order to report copyright violations of any kind, send in an email to copyright@codechef.com
CodeChef a product of Directi
The time now is:
CodeChef - A Platform for Aspiring Programmers

CodeChef was created as a platform to help programmers make it big in the world of computer programming. At CodeChef we work hard to revive the geek in you by hosting programming contests on a monthly basis. We also aim to have training sessions and events related to online programming for programmers around the world. Apart from providing a platform for programming competitions, CodeChef also has various tutorials and forum discussions to help those who are new to the world of computer programming.

Practice Section - A Place to hone your 'Computer Programming Skills'

Try your hand at one of our many practice problems and submit your solution in a language of your choice. Our judge accepts solutions in over 35+ programming languages. Online programming was never this much fun! Receive points, and move up through the CodeChef ranks. Use our practice section to better prepare yourself for the multiple programming competitions that take place through-out the month on CodeChef.

Compete - Monthly Programming Contests and Cook-offs

Here is where you can show off your computer programming skills. Take part in our 10 day long monthly programming contests and the shorter format Cook-off programming contests. Put yourself up for recognition and win great prizes. Prizes worth up to Rs.20,000 and $700 are up for grabs every month along with lots more CodeChef goodies.

Discuss

Are you new to computer programming? Do you need help with algorithms? Then be part of CodeChefs Forums and interact with all our programmers love helping out other programmers and share their ideas.

CodeChef Community

As part of our Educational initiative, we give institutes the opportunity to associate with CodeChef in the form of Campus Chapters. Hosting online programming competitions is not the only feature on CodeChef. Be a part of the CodeChef community through CodeChef meetups and techtalks. You can also host a programming contest for your institute on CodeChef and be a guest author on our blog.

Domain Name Registration, Web hosting, and Website Design provided by BigRock.com