Tuesday, July 26, 2016

Lintcode 79: longest common substring

July 26 2016

Work on lintcode: longest common substring.

Problem statement:

Given two strings, find the longest common substring.
Return the length of it.
Note
The characters in substring should occur continiously in original string. This is different with subsequnce.

Algorithm Study

Study the blog - longest common substring (60 minutes reading first time/ 20 minutes review every 6 months) written by a facebook engineer, Ider Zheng. Julia likes the article written in Chinese, it is a well-written and very good thinking process about the problem solving.

Julia likes to repeat the process here in her own words.

1. Brute force solution -> from O(n4) to O(n3), great analysis in the above blog:

Good thinking addes value to your coding practice:

Warmup with a brute force solution:

Time complexity - O(n4)

For example, a string s1 = "abcdefg",

One way to think about brute force:

The length of string is 7.

How many substring in s1? Guess?

Substring - start position and end position, two variables. Each one has O( n ) choices. Total is O(n2) choices.

More detail, start position can be any i from 0 to 6, and then end position starts from i to n-1. The variation formula, Sum = (n - 1) + ( n - 2) + ... + 1 = (n-1) n / 2, so the total is O(n2);

Try to reduce brute force variation from O(n2) to O(n), instead of letting start position and end position
both varies, just work on start position only.

Small improvement based on brute force solution

Time complexity - O(n3)

Second way to think about using brute force:
The start position of substring is from 0 to n-1, so considering the start position, start a new search.

So, the total of search is O(n).

 Longest common substring - start from start-position, and then  compare both of chars are equal, if yes, continue, record length and compare to maximum length, else then break the search.

1. C++ code:

Code is from the blog written by a facebook engineer.

The time complexity is O( n). There is duplicated calculation.

Will write C# practice very soon.

Dynamic programming - optimal time complexity O(n2)


2. Use Dynamic programming, in the above blog, the analysis is very helpful.

Work on dynamic programming, improve time complexity from O(n4) or O(n3) to O(n2), using memorization, space O(n2), bottom up approach.

The idea is to find the formula of DP - dynamic programming.

table T(i, j) - common substrings, using end position as a variable.


one is in s1, ending at position s1[i];
one is in s2, ending at position s2[j].

We know that if T(i, j) >0, then, s1[i] = s2[j];
if(s1[i+1] == s2[j+1]), then, T[i+1, j+1] = T[i,j]+1,
otherwise, T[i+1,j+1] = 0.

Will think about to put together a graph here to explain the idea as well.

2. C++ code:


From blog: C++ code
DP solution, time complexity O(nm), space O(nm)


3. further improvement: C++ code
DP solution, time complexity O(nm), space O(nm) -> O(n+m) -> O(1)
because the recurrence formula tells us that the current position only relies on diagonal position - left-up corner ( i - 1, j - 1)
https://gist.github.com/jianminchen/0061dcf562bd0bdb091301241c38730f

From blog

Julia's practice:
1. brute force solution C#
A. first writing, static analysis catching 1 bug, left 2 bugs for debugging. (not so good!)

B. fix all bugs - final presentation: C#

2. Dynamic programming solution using C#:


Highlights of code writing and execution:

1. static analysis - find bugs
change made: line 56 - 61, if the longest common substring is with length 1 and also start from row 0 or col 0.

line 67 - 72

2. Test case failed on line longestCommonSubstring("abc1def","1ghijkl") - should be "1",

change made: move end1 variable to line 49, and set variable from line 56 - 61, line 67 - line 72

2nd version: add comments


3. DP solution with space reduction: O( nm ) -> O( n+m )  -> O( 1 )

practice later.

* Design issues:
         *
         * 4 variables - memo, longest, end1, searchFirstRowCol
         * 1. memorization using two dimension array - memo
         * 2. variable int longest - get maximum length
         * 3. variable int end1    - string s1 - end position - s1's substring end position
         * 4. variable bool searchFirstRowCol - check first row and first col to update maximum length

DP problems:

Follow up after 8 months


March 17, 2017

1. Read code review on longest common substring algorithm.
2. Read wiki article about "Algorithm Implementation/Strings/Longest common substring"
3. Blog formatting to make it more readable. 
4. Code review:

Ashton and String Hackerrank


Coding practice is like sports - I don't feel fear when I am on court. That's where I feel at home.

No comments:

Post a Comment