HTML Escaping in Javascript

The Protoype library augments the native “String” class with an “escapeHTML” method. It’s handy for scrubbing user-supplied text onto the page:

  $('someSpan').update(randomString.escapeHTML()); 

I found recently that all that’s “escaped” by the function are the characters ‘&’, ‘<‘, and ‘>’. I found that out the hard way when I had assumed it’d also escape single- and double-quote characters. (Why did I assume that? I’m dumb maybe, but I figure it’s nice for building HTML attribute values. Whatever.)

The implementation of “escapeHTML” (on Firefox, at least, and maybe Opera) is pretty odd, and clearly something done for performance. What the library does is to create a div element with a text node in it, and keep that around. A call to the “escapeHTML” function is handled by setting the “data” attribute of the text node to the string being escaped. The return value is then the “innerHTML” property of the div. Thus the function leverages the internals of “innerHTML” and has the translation done in the (presumably) fast compiled innards of the browser. Here’s a rendition of what it looks like; it’s not exactly like this but this is (I think) equivalent:


  escapeHTML: (function() {
    var rv = function(s) {
      var self = arguments.callee;
      self.text.data = s;
      return self.div.innerHTML;
    };
    rv.div = document.createElement('div');
    rv.text = document.createTextNode('');
    rv.div.appendChild(rv.text);
    return rv;
  })()

The simple-minded version used for WebKit and IE looks like this:


  escapeHTML: function(s) {
    return s
      .replace(/&/g, '&amp;')
      .replace(/</g, '&lt;')
      .replace(/>/g, '&gt;');
  }

Whether or not I’m right in thinking that the library’s “escapeHTML” function should deal with quotes, I need something that does. Clearly I can’t leverage the “innerHTML” trick, because the browser apparently disagrees with me about quotes. I could clearly just use an extended version of the repeated-“replace” version. There’s another way, however:


  escapeHTML: (function() {
     var MAP = {
       '&': '&amp;',
       '<': '&lt;',
       '>': '&gt;',
       '"': '&#34;',
       "'": '&#39;'
     };
      var repl = function(c) { return MAP[c]; };
      return function(s) {
        return s.replace(/[&<>'"]/g, repl);
      };
  })()

The “replace” function (native to Javascript “String” objects) can take a function as its second argument. The function is passed the results of each match. Thus this version above matches on any of the characters I want to escape, and then translates them to the HTML entities via a little map object. It seems like this would be better in the case of “clean” strings, because the pattern would match nothing and so the string would be returned unchanged.

In order to see what the performance realities were, I started hacking up a little test page. After a while, I got tired of editing and re-editing to see how different tinkerings with the different function versions would affect the timing. What I wanted was a setup to make a little table: different test datasets (short strings, long strings, clean strings, dirty strings) across the top, and the different “escapeHTML” versions at the left. The cells would be the average time of a pass over each string in each test list. Because I’m silly I wanted the table to be filled in as the tests ran, so the numbers would show up while I stared at the screen.

To do the little incremental animation thing made my head hurt for a while. I took a break to cook some muffins (orange cranberry – here’s a tip: when making orange muffins or any sort of orange quick bread, use frozen concentrated orange juice instead of plain juice, and a teeny bit of orange oil) and while I was filling up the muffin cups I had an idea. I think it’s something that people who’ve been using functional languages would probably think of as being perfectly obvious.

Prototype has an “inject” routine, which is like “foldl” in … well, in something; Scheme maybe? I know that in the Erlang “lists” module it’s “lists:foldl”. The function is a method available on any Enumerable object, like an array. It takes two arguments (well, three, but ignore that for now), the first being an initial value and the second being an iterator function. The “inject” mechanism passes the initial value and the first element of the collection to the function. The function does something, and returns a value. That’s passed on in the next iteration, with the second element of the list, and so on. The final return value of “inject” is whatever the last invocation of the iterator returns.

Common pedagogical examples of “inject” do stuff like compute an arithmetic reduction over a list of numbers, or build up an array from a function that interprets values in the source array. In my case, what I wanted was a function, one that I could call to start the sequence of test runs. In other words, I needed to have something that would iterate accross the cells of my table, running the “escapeHTML” versions on each list of test strings.

My muffin epiphany was that I could use nested “inject” loops backwards through the table to build up a function that would run the test for its cell, and then invoke the function from the previous cell. (That’s why I’d go backwards through the array – the last function to run would have to be the first function I built.) Thus the “inject” process would build up a chain of functions wrapped around eachother like Russian nesting dolls.

That ended up as a somewhat general-purpose facility that will take a list of functions (with names) and a list of test datasets (with names), and create a table in a selected container, and then fill in the table test by test:


function timings(container, datasets, functions, repeats) {
  // first make the empty table, with headings and left-side labels
  $(container).update(
    "<table><tr><th></th>#{headings}</tr>#{rows}</table>".interpolate({
      headings: datasets.collect(function(sl) {
        return "<th>#{name}</th>".interpolate(sl);
      }).join(''),
      rows: functions.collect(function(fp, rowNum) {
        return "<tr><td>#{name}</td>#{timings}</tr>".interpolate({
          name: fp.name,
          timings: datasets.collect(function(x, colNum) {
            return "<td id='#{r}_#{c}'></td>".interpolate({r: rowNum, c: colNum});
           }).join('')
        });
      }).join('')
    })
  );
  // now make the nested series of functions, and call the one that pops
  // out the top
  functions.reverse(false).inject(null, function(ff, fl, rowNum) {
    return datasets.reverse(false).inject(ff, function(ff, sl, colNum) {
      var tdid = "#{r}_#{c}".interpolate({
        r: functions.length - rowNum - 1,
        c: datasets.length - colNum - 1
      });
      return function() {
        $(tdid).addClassName('pending');
        setTimeout(function() {
          // average the time for some runs over this test dataset
          var avgTime = $R(1, (repeats || 10)).inject(0, function(avg, n) {
            var start = new Date().getTime();
            sl.list.each(fl.fn);
            return (avg * (n - 1) + (new Date().getTime() - start)) / n;
          });
          $(tdid).update(
            (Math.floor(avgTime * 100) + '').replace(/(\d\d)$/, '.$1')
          ).removeClassName('pending');
          // call the next function in the chain - this is the one passed
          // in as the first argument by the "inject" mechanism
          ff && ff();
        }, 250);
      };
    });
  })();
}

Note that the per-cell functions operate via timeout. That’s so that the browser has a chance to catch its breath and update the display between test runs. (The 250ms time was just a random guess, but short times like 10ms were too short). You can see the results of my fooling around here. What I found was kind-of interesting.

First, the “weird” way that Prototype does the escaping (via the “innerHTML” trick) is pretty consistent in performance. It’s slower than the “fun” approach (the single regex with a replace function) when strings are clean, I guess because the browser regex code is pretty well optimized, and the pattern is simple. All of the regex routines get slow on larger strings that require scrubbing. The “hybrid” code on that test page is the “innerHTML” trick with a pre-check via regex.

The “Weird 2″ function, which is always slower than “Weird”, is the Prototype “innerHTML” version implemented as in the actual library. Specifically, the div and text node are attached as attributes to the function object itself, and then accessed via “arguments.callee”. In my version (“Weird”), I just used a closure for the elements. Why would the closure be faster?

Oh, and the “multi” ones are the muliple-replace routines.

For my purposes, I’m pretty sure most of the strings that my pages escape are clean. Thus I think I’ll introduce the single-regex version somehow in my world.

Finally, while typing this in, I needed a place to escape the code samples. I think I wrote a C program to do that once a long time ago, but I couldn’t find it. Thus I typed in this page to do it. Maybe people with fancy IDEs have built-in tools for that.

6,831 Responses to “HTML Escaping in Javascript”