Course Schedule III

Today, I was thinking about a question on Leetcode, Course Schedule III.

The problem first appeared on its weekly online contest. I went to the contest but did not manage to solve it there. Let me rephrase the question here. You have a list of courses and, for each course, you have two numbers: duration and deadline. Duration is the time it takes to complete the course. Deadline is the time you have to finish the course. Then, you have to decide how many courses you are able to take at most, assuming that you can not take two courses at the same time.

Hence, this question is a weird version of the greedy scheduling problems: a hybrid of activity scheduling and deadlines. Activity scheduling has durations but not deadlines. However, this one has deadlines. Deadline scheduling always goes for the earliest possible deadline and it does not work for this problem since if the task with the earliest deadline takes a lot of time that should have been used for more courses, that would not be optimal. For example, if we have

Course 1: duration 10, deadline 10;

Course 2: duration 2, deadline 11;

Course 3: duration 2, deadline 11.

If we go for Course 1, which has the earliest deadline, we waste our time take the long course 1 and the time could have been used to take course 2 and 3, a plan that allows us to finish two courses.

Thus, in activity scheduling, we always go for the shortest activity, so here, what about alway taking the shortest course first? That does not work either.

Course 1: duration 10, deadline 32

Course 2:  duration 11, deadline 11

Course 3: duration 11, deadline 22

In this case, if we use our strategy, we will go for Course 1 first since it has the shortest duration. However, that means we can not take Course 2 since, after course 1, we will be at time 10, and there is not enough time to finish course 2 by time 11. However, we should be able to finish all three courses if we take them as course 2, 3, and then 1. Thus, this strategy does not work either.

None of the strategy works. It seems we have to find a “trade-off” between the deadline and the duration. My thinking during the contest ends here.

When I look at the answer, I find this is what the answer does. It checks each course by deadline and keeps track of the selected courses up to the current index. If we can take the next course, we take it; otherwise, we see if we can abandon the longest course taken and take the next course. If the alternative still does not work, that means the new course is longer than any course we have taken before so we do take it.

That is the answer to the problem but why does it work? I need some insights. After thinking about it for a few days, I realize tonight that this is based on a recursive relation between the optimal of a solution and the optimal plus possibly a new course. All courses selected before course k must be finished by the deadline of course k. That is the same for k + 1. We also have to make sure that the total duration is optimized, which means, if there are multiple ways to accomplish the same number of courses before the deadline, we have to make sure that the duration used up is minimum. Hence, two conditions:

  1. Max courses.
  2. Min duration for the same number of courses.

Thus, if the new course is good and can be finished by the new deadline, we are good since we have one more course. Otherwise, we still keep the same amount of course so we have to make sure that duration is minimum. The new course, although can not be added, might still be taken by taking away the longest course. Thus, if the new course is shorter than the longest course, we take it and drop the longest current course.

What we must learn is that we should never forget that a lot of things are recursive. We should always make a try to find recursive relations. Greedy, DP, DFS, BFS are all based on recursive relations. The reason why I did not solve this problem during the contest is that I only tried to use those existing solutions to directly solve this problem, instead of going one level further into their fundamentals, recursive relation.

Max Points on a Line

Today I am solving an algorithm problem on Leetcode: Max Points on a Line. This is tagged as a hard problem and it is indeed not a easy one. I did not manage to solve it before I look at the answer so I think it worth reflecting a bit so as to learn from this problem.

I should state the problem again. The problem asks us to write an algorithm that gives the max number of collinear points among a set of points. Hence, the input is an array of Point objects, Point[], and the output is an integer. Each point object contains two attributes, x and y. We need to find the max number of points among the array of points that are on the same line.

I started solving the problem with dynamic programming since I noticed that it is a optimization problem, finding the max number of points that are on the same line. Then I went on looking for the optimal substructure: the relation between the optimal solutions for subsets of the problem to the original problem. In our case, this will be a relation between the max collinear points of all the points with the max collinear points of some subsets of the points. I started thinking about relation between the optimal for a full array and the optimal for all but the last one. The relation does not seem to exist because there might be ties for the max number of collinear points, which means, there might be two subsets of points on the same line that have the same number of points. In this case, if the last point is on one of the lines, there will be no way to guarantee we can choose the one that is good for the next point, which will be predicting the future. Also, what makes it worse, the last point can also be on a line that is not the best in the subset. In this case, the optimal substructure totally disappeared. Hence, this does not seem to work.

Then, I tried another dynamic programming approach, looking for the relation between the max number of collinear of points on the same line that contains the new point with that of the subset. I am only looking for the lines that contains the last point, and at the end, I will calculate the max of the whole array. This is a common trick of DP problem: when we are in trouble looking for optimal substructure directly between the answers, we take a step back and look for the max that contains the new member. Then, we have to get the max of all, instead of just returning the last element. An orthodox example of this is the longest increasing subsequence. In this case, there is a pretty good relation: the max collinear line that contains the new point is either the max of any subset before plus this point if it is on the line, or two, when it is not on any of them. For a while, I thought it was correct. I realized that I should record the max lines to identify if the new point is on any line of the subset. This leads to the same problem as before: if there are ties, the information about the tied lines will be missing. Then, if the new point is a part of the tied lines, we will get the wrong answer. Since in DP problems, all the answers depend on each other, the whole problem will be wrong if we get a subproblem wrong. Thus, at this point, I have to say that DP does not work for this problem, despite that it is an optimization problem, a strong indicate of a DP solution.

Then, driving by the desire to get something accepted, I also tried brutal force. Then DP solutions tried above aim for a complexity of O(n^2); however, brutal force is O(n^3). The idea is pretty simple: try all pairs of points and see how many points are on the line of line determined by the two points. Trying all pairs take O(n^2) and checking all other points gives O(n) so the total takes O(n^3). The idea is simple but it has many edge cases.

  1. Two points may be on the same vertical or horizontal line.
  2. Two points may overlap. What’s worse, two neighboring points in the array may overlap. In this case, the “pair” that we started with overlaps.

Those edge cases give us too many if statements and logical pains. I struggled a while with it and give up. It does not worth it spending too much time on a solution that will eventually fail with timing out.

Finally I checked the solution and find out that I was in the wrong direction. I forgot about maps. The solution fix one of the points, check all other points and see how many of them are on the same line. It identify each line passing through the point with the slope of the line. It creates a map between the slope and the number of points with that slope. When it checks all the points, it classifies the points with the slope. That is such an easy approach.

Where I got wrong is that I failed to identify the classifying nature of the problem. Like I said above, I totally forgot about maps. When you stand at some point, looking at all other points, a finite number of points, around you, you can easily say that a point has a certain slope. Especially we want the max number of points on a line, a basic idea will be to classify all the points as they are on the same line. Although the idea is simple, I would say the problem is indeed a hard one since the classification nature is not obvious, or even elusive since the slope is a continuous value which usually does not welcome classifying. We can classify it just because the number of points is finite. Classification are mostly indicated by some discrete nature of the problem, such as finding the duplicate number in an array. However, here we have a plane where each point has great freedom of being where it wants to be. It is indeed hard to think of maps when solving a problems with such a great freedom. Anyway, at least, I should learn that DP, divide and conquer and brutal force are not the only ways to solve a optimization problem. There is one more way, simpler and more elegant, classification.