Misusing HTML Entities

Author: skirtle First posted: 28-May-2020 Last updated: 28-May-2020

JavaScript JSON Encoding Escaping Unicode

How do you write a non-breaking space in JavaScript or JSON?

Lose 10 points if you said  .

The Problem

Let's imagine we're working in an HTML templating language and we have a template that looks something like this:

    <button>{{ text }}</button>

Currently the text is set to 'Save changes' but it keeps wrapping on the space, like this:

This problem should be solved using CSS but, for the sake of discussion, let's try to replace the space in the text with a non-breaking space.

You might be tempted to try something like this:

    text = 'Save&nbsp;changes'

Assuming the template handles escaping properly we'll most likely end up with something like this:

Oh dear. The   isn't being converted into a non-breaking space, we're just getting it output literally as text.

At this point you might reach for whatever mechanism the templating language provides for inserting HTML content instead. Taking Vue.js as an example, you could do something like this:

    <button v-html="text"></button>

If the template is part of a third-party component then changing it like this won't be practical but, even if you can change it, this is still completely the wrong way to go about inserting a non-breaking space.

To be clear, there's nothing wrong with using   in HTML. If the   appeared directly in the template it'd be fine. That isn't what we have here. This is a JavaScript string that represents plain text and there shouldn't be any HTML within it.

Understanding  

The sequence   isn't some magic incantation for inserting a non-breaking space. It's just an HTML entity, equivalent to   or  . These are all ways to tell the HTML parser that you want Unicode character 160, usually written in the form U+00A0.

Importantly, it is the HTML parser that interprets the entity. Until it reaches that parser we don't have a non-breaking space, we have the 6 separate characters &, n, b, s, p and ;.

We can see this by checking the length of the string in JavaScript:

    '&nbsp;'.length // => 6

There's no need to represent it this way. A JavaScript string can contain a non-breaking space as a single character. However, trying to include that character directly in the source code poses 3 problems:

How do you type a character that isn't on a standard keyboard?
How will anyone maintaining the code be able to distinguish visually between a non-breaking space and a normal space?
Files contain bytes, not characters, and the non-breaking space character is outside the ASCII range. We're going to have to commit to a specific character encoding (e.g. UTF-8) and then hope we can convince all the relevant tooling to use that encoding.

In practice, we can dodge all of that by writing it using an escape sequence instead:

    '\u00a0'.length // => 1

Even though the escape sequence involves 6 characters, the resulting string only contains the single non-breaking space character. It is important to appreciate that the escaping used here is part of the string literal syntax used to create the string and is not actually a feature of the resulting string. It's the JavaScript parser that evaluates this escape sequence and it doesn't matter whether it subsequently goes through an HTML parser.

Just to reinforce that point, we can use the same technique to create strings containing other, less exotic characters. Consider the capital letter A. That's Unicode character U+0041. Obviously you'd normally just write that as 'A' but it can also be written as '\u0041'. The resulting strings are identical.

If we use text = 'Save\u00a0changes' in our earlier example then everything will work fine. It doesn't matter whether the templating language applies HTML encoding to the text or not, either way we'll end up with the correct character being used.

Furthermore, because we're using the actual character, anything else that encounters that string will be able to understand it correctly too. For example:

A length check will give the correct length.
Searching and filtering won't need to worry about matching HTML entities.
While an HTML entity can be chopped in two during truncation, that can't happen using the actual character.

An Example in Vue

Let's suppose we want to write a formatting function that automatically changes normal spaces to non-breaking spaces. In Vue we might include it in the template like this:

    <button>{{ spaceToNbsp(text) }}</button>

Other templating languages will typically have an equivalent syntax.

As you might expect, trying to implement spaceToNbsp like this won't work:

    spaceToNbsp (str) {
      return str.replace(/ /g, '&nbsp;')
    }

As with the earlier example it'll end up with the   being treated literally.

To get it working it'd need to be this:

    spaceToNbsp (str) {
      return str.replace(/ /g, '\u00a0')
    }

Other Characters

Non-breaking spaces aren't the only characters to be unnecessarily encoded as HTML entities. If you're working with JavaScript strings and find yourself tempted to include any HTML entity you should consider using the actual character instead. For example, with accented characters such as é:

    text = 'caf&eacute;'

Depending on your keyboard layout it may be tricky to type é directly and, as before, keeping everything as ASCII can help to avoid problems with character encodings. But é has all the same problems as  . In this case it's Unicode character U+00E9, so we can write it as:

    text = 'caf\u00e9'

Granted, a named entity is easier to understand but it's a moot point because it creates the wrong string.

JSON

Much the same applies to JSON. My heart always sinks when I see JSON data like this:

    [
      {
        "name": "Th&eacute;r&egrave;se"
      },
      ...
    ]

Why are there HTML entities lurking in the data? Most likely this can be traced back to a character encoding issue that was bodged into submission using HTML entities rather than fixing it properly. Somewhere, something needed setting to UTF-8 but instead we get this travesty.

Improperly encoded data like this makes implementing client-side searching really tricky. Some serious hoop-jumping is going to be required to ensure searches for cut or rave don't match and searches for Thérèse do. Searching is difficult enough as it is with Unicode equivalence, case-sensitivity, accents and locales to consider without having to handle nonsensical HTML entities. JSON supports the same \uXXXX escaping notation as JavaScript:

    [
      {
        "name": "Th\u00e9r\u00e8se"
      },
      ...
    ]

In practice you're probably using a standard JSON library and it may not support escaping those characters. It shouldn't matter. JSON is expected to be transferred in UTF-8 so you just need to get all your streams, pipes and channels configured correctly and all will be well.