Saturday, March 5, 2016

HackerRank: Bear Steady Gene (II)

March 4, 2016

  Problem statement:

https://www.hackerrank.com/contests/hourrank-6/challenges/bear-and-steady-gene  

A gene is represented as a string of length  (where  is divisible by ), composed of the letters , and . It is considered to be steady if each of the four letters occurs exactly  times. For example,  and are both steady genes.


Julia's 1st practice:

https://gist.github.com/jianminchen/80723bae951328a690bb

score 15 out of 50, there are 2 run time error, failed a few of test cases.

The algorithm ends up in time complexity O(n^2), close to brute force solution - O(n^2)

C# code implementation to study:

Readable code, with some analysis.

https://gist.github.com/jianminchen/fae9142eff9a6f9643fc

Work on two pointers, sliding window, so time complexity is O(2n) = O(n).

Let us go over two pointers algorithm here using example:
GAAATAAA,
A - count of A, denoted as cA = 6, n/4 = 2, so we have to change 4 of A to other thing.
A - 6 -2  = 4 , 4 of change
C - 0 - 2 = -2
T - 1 -2  = -1
G - 1-2  = -1
So, at least minimum is 4, but a substring containing 4 of A, shortest one is AAAA. But "AAAA" is not a substring of "GAAATAAA"

G  A A A T A A A
0
start
end
Let us find the substring starting from 0, but will include all 4 of A, and then, rest of string will not include any char of "ACGT" more than n/4.
GAAATA, string length is 6.
start - 0, index of 0
end - A, index of 5

continue to move start to next one, 1, then, G is moved out from substring, adjust count of chars.
AAATA can be the substring, so the length is min (6, 5) = 5. Do not need to move end pointer.

next step, start = 2, missing one of A, and then, end has to move to next one until the string fits into requirement.

Every thing should be covered in 20 minutes, problem reading, the design of algorithm, the coding.

So, Julia had second practice, and the code scores 50 out of 50 this time.

https://gist.github.com/jianminchen/61dfe437f82edb9793fc

Conclusion:
After coding, Julia understands the algorithm much better.

1. count of array for substring, cB=[4, -2, -1, -1], so in other words, substring has to contain at least 4 'A', C, G, T does not matter.

So, using CB array, sliding windows of substring GAAATA, each time, end index is moving forward, tracking the substring's char by taking off from CB; in other words, GAAATA, CB will be
[0, -2, -2, -2].

Now, it is easy to understand that GAAATA will be added to the solution set, the length is 6.
line 51, if (cBAllLessThan1(cB)) 
<-add substring to solution set, compare length<- easy to understand now.

2. the code in https://gist.github.com/jianminchen/fae9142eff9a6f9643fc has discussion:
switch(line[0]) {
case 'A': A--; break;
case 'C': C--; break;
case 'G': G--; break;
case 'T': T--; break;
}
Julia removed those detail discussion, but her code takes more time. Because she created a bug in her writing. Debugging takes time.

3. Julia's practice in 2015, sliding windows, two pointers:
http://juliachencoding.blogspot.ca/search/label/slide%20window

April 27, 2016
change C# program to make variables more meaningful, matching the design:
code -> GENES <- global string array
function name matches design of two pointers moving.

https://gist.github.com/jianminchen/b8263048c297473319c23836e9468c14

searchStrArray -> searchStrNumber, more meaningful.
https://gist.github.com/jianminchen/b23c4f606a101b9aeec71eff3268db32

No comments:

Post a Comment