Archive for December, 2007

HTML Escaping in Javascript

Saturday, December 29th, 2007

The Protoype library augments the native “String” class with an “escapeHTML” method. It’s handy for scrubbing user-supplied text onto the page:

  $('someSpan').update(randomString.escapeHTML()); 

I found recently that all that’s “escaped” by the function are the characters ‘&’, ‘<‘, and ‘>’. I found that out the hard way when I had assumed it’d also escape single- and double-quote characters. (Why did I assume that? I’m dumb maybe, but I figure it’s nice for building HTML attribute values. Whatever.)

The implementation of “escapeHTML” (on Firefox, at least, and maybe Opera) is pretty odd, and clearly something done for performance. What the library does is to create a div element with a text node in it, and keep that around. A call to the “escapeHTML” function is handled by setting the “data” attribute of the text node to the string being escaped. The return value is then the “innerHTML” property of the div. Thus the function leverages the internals of “innerHTML” and has the translation done in the (presumably) fast compiled innards of the browser. Here’s a rendition of what it looks like; it’s not exactly like this but this is (I think) equivalent:


  escapeHTML: (function() {
    var rv = function(s) {
      var self = arguments.callee;
      self.text.data = s;
      return self.div.innerHTML;
    };
    rv.div = document.createElement('div');
    rv.text = document.createTextNode('');
    rv.div.appendChild(rv.text);
    return rv;
  })()

The simple-minded version used for WebKit and IE looks like this:


  escapeHTML: function(s) {
    return s
      .replace(/&/g, '&amp;')
      .replace(/</g, '&lt;')
      .replace(/>/g, '&gt;');
  }

Whether or not I’m right in thinking that the library’s “escapeHTML” function should deal with quotes, I need something that does. Clearly I can’t leverage the “innerHTML” trick, because the browser apparently disagrees with me about quotes. I could clearly just use an extended version of the repeated-“replace” version. There’s another way, however:


  escapeHTML: (function() {
     var MAP = {
       '&': '&amp;',
       '<': '&lt;',
       '>': '&gt;',
       '"': '&#34;',
       "'": '&#39;'
     };
      var repl = function(c) { return MAP[c]; };
      return function(s) {
        return s.replace(/[&<>'"]/g, repl);
      };
  })()

The “replace” function (native to Javascript “String” objects) can take a function as its second argument. The function is passed the results of each match. Thus this version above matches on any of the characters I want to escape, and then translates them to the HTML entities via a little map object. It seems like this would be better in the case of “clean” strings, because the pattern would match nothing and so the string would be returned unchanged.

In order to see what the performance realities were, I started hacking up a little test page. After a while, I got tired of editing and re-editing to see how different tinkerings with the different function versions would affect the timing. What I wanted was a setup to make a little table: different test datasets (short strings, long strings, clean strings, dirty strings) across the top, and the different “escapeHTML” versions at the left. The cells would be the average time of a pass over each string in each test list. Because I’m silly I wanted the table to be filled in as the tests ran, so the numbers would show up while I stared at the screen.

To do the little incremental animation thing made my head hurt for a while. I took a break to cook some muffins (orange cranberry – here’s a tip: when making orange muffins or any sort of orange quick bread, use frozen concentrated orange juice instead of plain juice, and a teeny bit of orange oil) and while I was filling up the muffin cups I had an idea. I think it’s something that people who’ve been using functional languages would probably think of as being perfectly obvious.

Prototype has an “inject” routine, which is like “foldl” in … well, in something; Scheme maybe? I know that in the Erlang “lists” module it’s “lists:foldl”. The function is a method available on any Enumerable object, like an array. It takes two arguments (well, three, but ignore that for now), the first being an initial value and the second being an iterator function. The “inject” mechanism passes the initial value and the first element of the collection to the function. The function does something, and returns a value. That’s passed on in the next iteration, with the second element of the list, and so on. The final return value of “inject” is whatever the last invocation of the iterator returns.

Common pedagogical examples of “inject” do stuff like compute an arithmetic reduction over a list of numbers, or build up an array from a function that interprets values in the source array. In my case, what I wanted was a function, one that I could call to start the sequence of test runs. In other words, I needed to have something that would iterate accross the cells of my table, running the “escapeHTML” versions on each list of test strings.

My muffin epiphany was that I could use nested “inject” loops backwards through the table to build up a function that would run the test for its cell, and then invoke the function from the previous cell. (That’s why I’d go backwards through the array – the last function to run would have to be the first function I built.) Thus the “inject” process would build up a chain of functions wrapped around eachother like Russian nesting dolls.

That ended up as a somewhat general-purpose facility that will take a list of functions (with names) and a list of test datasets (with names), and create a table in a selected container, and then fill in the table test by test:


function timings(container, datasets, functions, repeats) {
  // first make the empty table, with headings and left-side labels
  $(container).update(
    "<table><tr><th></th>#{headings}</tr>#{rows}</table>".interpolate({
      headings: datasets.collect(function(sl) {
        return "<th>#{name}</th>".interpolate(sl);
      }).join(''),
      rows: functions.collect(function(fp, rowNum) {
        return "<tr><td>#{name}</td>#{timings}</tr>".interpolate({
          name: fp.name,
          timings: datasets.collect(function(x, colNum) {
            return "<td id='#{r}_#{c}'></td>".interpolate({r: rowNum, c: colNum});
           }).join('')
        });
      }).join('')
    })
  );
  // now make the nested series of functions, and call the one that pops
  // out the top
  functions.reverse(false).inject(null, function(ff, fl, rowNum) {
    return datasets.reverse(false).inject(ff, function(ff, sl, colNum) {
      var tdid = "#{r}_#{c}".interpolate({
        r: functions.length - rowNum - 1,
        c: datasets.length - colNum - 1
      });
      return function() {
        $(tdid).addClassName('pending');
        setTimeout(function() {
          // average the time for some runs over this test dataset
          var avgTime = $R(1, (repeats || 10)).inject(0, function(avg, n) {
            var start = new Date().getTime();
            sl.list.each(fl.fn);
            return (avg * (n - 1) + (new Date().getTime() - start)) / n;
          });
          $(tdid).update(
            (Math.floor(avgTime * 100) + '').replace(/(\d\d)$/, '.$1')
          ).removeClassName('pending');
          // call the next function in the chain - this is the one passed
          // in as the first argument by the "inject" mechanism
          ff && ff();
        }, 250);
      };
    });
  })();
}

Note that the per-cell functions operate via timeout. That’s so that the browser has a chance to catch its breath and update the display between test runs. (The 250ms time was just a random guess, but short times like 10ms were too short). You can see the results of my fooling around here. What I found was kind-of interesting.

First, the “weird” way that Prototype does the escaping (via the “innerHTML” trick) is pretty consistent in performance. It’s slower than the “fun” approach (the single regex with a replace function) when strings are clean, I guess because the browser regex code is pretty well optimized, and the pattern is simple. All of the regex routines get slow on larger strings that require scrubbing. The “hybrid” code on that test page is the “innerHTML” trick with a pre-check via regex.

The “Weird 2” function, which is always slower than “Weird”, is the Prototype “innerHTML” version implemented as in the actual library. Specifically, the div and text node are attached as attributes to the function object itself, and then accessed via “arguments.callee”. In my version (“Weird”), I just used a closure for the elements. Why would the closure be faster?

Oh, and the “multi” ones are the muliple-replace routines.

For my purposes, I’m pretty sure most of the strings that my pages escape are clean. Thus I think I’ll introduce the single-regex version somehow in my world.

Finally, while typing this in, I needed a place to escape the code samples. I think I wrote a C program to do that once a long time ago, but I couldn’t find it. Thus I typed in this page to do it. Maybe people with fancy IDEs have built-in tools for that.

The dark side of {{ }}

Saturday, December 22nd, 2007

Despite having typed in an enormous amount of Java code over the past seven or eight years, my “discovery” of


return new Foo() {{
  setThingy(whatever);
}};

was embarrassingly recent. Because it lets me work more with values than with statements, I like it a lot, and I try to think in terms of using it.

For those still unaware of this surprisingly old feature of the language, it’s possible in general to add anonymous blocks to class definitions. Everybody knows about


static { foo = new BigFoo(); }

Without the static keyword, the block is treated as an appendage to the constructor code. (I guess they run in declaration order, but I don’t care about that at the moment.) Thus the notation

new T() {{ code; }}

means to construct an anonymous subclass of T with some random code that runs along with the no-args constructor for T. (Of course you can also do that when the constructor does have arguments.)

The dark side of that trick is probably obvious to everybody except me, because the dark side has crept in and wasted a lot of my time recently. The code that you type into that {{ }} block is of course code that’s treated as being in the lexical context of the subclass, not that of your surrounding code. Durr. If the initialization statements refer to method names or variable names that are visible in the base class, then the statements are interpreted as being in those terms. That’s really nice when you’re using this to populate a HashMap


return new HashMap() {{
  for (DungHeap dh : allHeaps()) put(dh.getName(), dh);
}};

However, if the base class you’re subclassing has some protected member variables, watch out! The final variables in the surrounding code you’re referencing as part of the initialization will be shadowed. Same goes for utility methods you’d like to call.

This got me just today with a context variable called “headers”. I was initializing a new MimeMessage instance. Guess what’s a protected member variable of a MimeMessage instance?

I still like that notation, for now at least.

Just what is this Javascript object you handed me?

Saturday, December 8th, 2007

Modern Javascript “frameworks” like Prototype, jQuery, and Dojo compete to provide flexible, expressive tools, which is a great thing for Javascript developers. It’s truly astounding how effective these relatively simple and concise tools can be.

One of the challenges faced by toolkit designers stems from the malleability of the Javascript environment – the same malleability that makes the tools possible in the first place. In type-happy (I could have written “stodgy”) languages like Java, the question, “what is this thing?” that code behind an API might ask is always easy to answer; in fact as a Java programmer you don’t really ask the question much unless you’ve done things pretty badly. In Javascript, however, there’s never a good answer. Well, there is – the thing is an Object – but it doesn’t do you much good.

So anyway, coding away the other day I came across an interesting bug, one that turns out to be shared by Prototype, jQuery, and Dojo (and maybe others). It was a case of the toolkit code doing something to check to see if a thing passed through an API was of a particular type so that some work could be done with it. In this case, the toolkit code (Prototype) needed to know whether a value was an Array, because if so then something different had to be done than if the value were a simple scalar. Instead of approaching that situation with a pragmatic, dynamic “duck typing” technique, the code took an approach that a Java programmer would recognize.

Specifically the fault was in code that transformed a “Hash” object into an HTTP query string. The client code has a property list of parameter names and values, which is to be used in a GET URL or an XmlHTTPRequest. The Prototype service has to deal with the fact that a single parameter name may be associated with a list of values. (Recall that an HTML form can have more than one input field with the same name.) The Prototype convention here is that the “toQueryString” routine looks for parameters in the Hash object whose values are arrays:


  // this is 1.5.1 code - it's different in 1.6
  // but the problem remains (for now)
  toQueryString: function(obj) {
    var parts = [];
    parts.add = arguments.callee.addPair;

    this.prototype._each.call(obj, function(pair) {
      if (!pair.key) return;
      var value = pair.value;

      if (value && typeof value == 'object') {
        if (value.constructor == Array) value.each(function(value) {
          parts.add(pair.key, value);
        });
        return;
      }
      parts.add(pair.key, value);
    });

    return parts.join('&');
  },
  // ...

(The “addPair” routine that’s hooked up to the return-value array “parts” just builds a “name=value” string with appropriate URL encoding.) See how the code works? There in the loop code where it handles each key/value pair in the Hash object, it checks each value to see if it’s type is “object”. (The Javascript “typeof” is pretty lame.) If it’s an object, then a further test is made to see if the value is an array. That test compares the “constructor” attribute of the value object to the global Array function.

What’s the problem with this? The code is doing something that seems reasonable: if the constructor of the object is the Array function, then it’s an array, right? Sure, this is Javascript, so if the client code has done something stupid like reassign the “constructor” attribute or even reassign the global variable name “Array”, it’ll break, but that would be my fault. Setting that aside, what could possibly go wrong?

First I’ll be philosophical: does that code really need to know for sure that the value was constructed by the Array function? All it really does when it determines that the value is an array is take advantage of the “each” function it expects to find there. I think that a pure (funny word in this case) duck-typing approach would be to check to see if the value object has an attribute called “each” that references a function.

Now on to the practical issue here, the reason that the code is buggy as written. The subtle assumption made in the code is that every array object will have been constructed by the function pointed to by the global variable “Array”. Conversely, it assumes that any object not constructed by the function pointed to by the global variable “Array” must not be a real array. OK then, outside of shenanigans that break things on purpose (reassigning “Array”), how is that assumption problematic?

Well I’ll tell you. A web application can be spread across multiple separate windows. The window objects are linked together into a single DOM graph. Code can “see” up and down the graph into other windows, and functions can be called across window boundaries. In particular, an array can be passed from code on one page into a function on another. The kicker is that every page has its own global context (which means that the term “global” is sort-of questionable), and that includes a “global” variable called “Array.” Oops.

Javascript is not Java. You can’t really put much stock in anything your code might find out about a mysterious parameter, so why compound the problem by asking indirect questions? Just because an object was not apparently constructed by the function you think has to be the one and only “right” one doesn’t mean it won’t work just fine.

This page is a simple example of the problem. Prototype will be fixed soon thanks to the attentions of world-famous Prototype contributor “savetheclocktower.” Dojo and jQuery have similar issues but I haven’t reported them yet. Specifically, jQuery uses the “foo.constructor == Array” deal all over the place. Dojo seems to use “instanceof Array”, which suffers from exactly the same problem. I don’t think that expecting the libraries to work properly on values passed across page boundaries is unreasonable.