Archive for May, 2008

Median of 2 sorted arrays

Sunday, May 4th, 2008

A nice young man wrote about working on an algorithm problem after being inspired to think thusly by his new copy of The Algorithm Design Manual. Coincidentally, I recently received my copy of the book as well. I was less inspired than Mr. Torreborre, probably because I’m very lazy. I thank him for presenting his interview question, because that got me to thinking yesterday morning. How would I find the median element of the merger of two sorted lists without actually doing a (linear) merge?

I’m now typing in what’s been going through my head, mostly while cooking breakfast for my kids. I contemplated trying to explain it to them, but I decided against it. Over the course of the explanation they would certainly concoct a variety of theories about why I was trying to punish them in such a strange and tedious way. I haven’t gone to search for a result in the book or on the web; maybe that’ll be obvious after you read this. I’ll check later.

First, I’ll say the median element of a sorted list afirst … alast is afloor((last-first+1)/2). In other words, it’s the element right in the middle (rounding down – arbitrarily – if the list has an even number of elements). So for this problem, I’ve got two lists, a and b, and I want to know what element would be smack in the middle of the list resulting from their merger.

I went around in confused circles for a while before hitting upon (what I think to be) a good way to view the problem. I know what the size of the merged list would be – it’s the sum of the sizes of a and b. Thus I know exactly where the median value will be in the result. However, all I know about the result list is that it’s a muddle of values from the two original lists. For any index i in the merged list, all I can say is that the value will be from either a or b.

Now, there are a few cheap things I can do to examine my lists. Looking at an element at a given position is cheap – well, at least, I’m going to declare it to be cheap. So, for example, I can look at the median of either source list, or at the first element, or the last. Another thing I can do, less cheaply, is to find where a number would go in one of the source lists. That’s an O(log2 n) operation (with n being the list length).

If I think about a couple of ordered lists of numbers (with no known bounds on the values in the lists), it’s clear that one list may have values larger than any value in the other, or smaller than any value in the other. If I take at the smaller of the two last values in my source lists (that is, alast and blast), and then find where it would go in the other list, I now know something really interesting: I know exactly what values occupy the positions at the end of the merged list! That is, if alast is smaller than blast, and I find that it would 100 numbers down from the last element of b, then I know that the last 100 elements of the merged array have to be the last 100 elements of b. Of course I can make the same discovery at the bottom ends of the lists.

In the degenerate case, one list might contain nothing but values completely beyond the end of the other list. In that case, I can immediately find the median of the combined lists because the merge is easy. If it’s not, then with 2 log n operations I can snip off some of both lists. Now, some portion of the low and high ends of my hypothetical merged lists are no longer muddled – I know that those portions contain values from one of the two lists. In fact, I can now see what my goal is: I need to get to the point where the median slot – the position smack in the middle of the hypothetical merged list – is not muddled.

So now I have two sublists of the original lists, which represent the “muddled middle” of the hypothetical merged result. Hmm. I’m not liking this approach, because I’m not lopping off sufficiently big chunks of the problem. Well, they might be big, but I’m not forcing any bigness; I’m just whittling the ends of the lists down, and the rapidity with which the lists get smaller is entirely dependent on the values in the lists. I need to get a little more radical. (I do still like checking to see whether one list lies completely beyond the other, however, at least for now. It’s cheap to do.)

Another thing I can do cheaply is look at the median of one of my lists, and then I can see where it’d go in the other list. If I do that, then I will definitely lop off half of one list. I’ll still have a muddled result, except now I know a little bit more: I now have four lists, not two, and by looking at the combined sizes I’ll know that two of those can be forgotten. Now that’s looking good, because on every iteration I’m throwing away half of one list and at least some of the other. I’ll always pick the larger list to be the one I cut in half of course. Eventually, I’ll get to the point where one of the two lists lies completely beyond the other, and then I’m home free.

Coding up something like this makes me feel uncomfortable. I fear “off by one” errors the way some people fear spiders. Sitting here now I can conjure up a vision of working through the day on this stupid thing. I know that I really should try. One thing that’s clear is that it’d be silly to do it in a fun language like Erlang, because it all depends on it being O(1) to look at values in the lists. (Well I guess I could merge two Erlang binaries, treating them as arrays of integers, but compounding my off-by-one fears with the need to essentially code up my own array indexing routines really freaks me out.) I’ll try not to be lazy and do this in Java or something boring like that, at least.

Later…
Well I just checked this blog by a smart person and he does this a totally ‘nuther way. I don’t have enough glucose left in my blood right now to figure out which is better, but I bet his is because he doesn’t have to do any searches. He has a weird definition of the median though in the case of lists with an even number of elements. I’ve always thought that the median value has to actually be in the data set, because otherwise there can be an arbitrary number of median values. I’m no statistician however. It probably makes no difference at all for this problem.

Later still…
I’ve been putting together a model cannon kit whose instructions are basically “glue it together”. My eyes are sore. Anyway, to the extent that I can get a lame approximation to the stuff I wrote above working, I think it’s still interesting. In the case of a significant disparity in list sizes, mine converges really quickly on the median because it chews up the shorter list very quickly. When the lists are pretty dense and about the same size, it takes about log n iterations. I’m not sure how to figure the bounding function – maybe it’s log n but maybe not (I suspect the later but I’m dumb).