JSON and Lisp-Like Queries

How can we express data queries using JSON? There are plenty of options but by borrowing a few ideas from Lisp we can create a simple, powerful syntax that's also convenient to use.

A single query term would be written as a JSON array, starting with the operator:

    ["=", "name", "John"]
  

More complex expressions can then be formed using the "AND" and "OR" operators:

    ["AND",
      ["=", "name", "John"],
      [">=", "age", 21]
    ]
  

There's an interactive example at the end.

I'm going to explain the motivation behind it and give some examples of how specific queries might be written in this format. This isn't a formal specification, nor have I included a detailed analysis of the pros and cons of the approach, it's just an introduction to the ideas involved.

Background

JSON has become almost ubiquitous as the response format of choice for AJAX requests. Increasingly it is also used as the body of a POST request to provide a more structured way to pass the details of the request than the traditional flat set of x-www-form-urlencoded request parameters.

For an application using JSON in both directions it's only a matter of time before a query or filtering expression is going to need encoding as JSON. You could just bundle it up into a string, such as "(name = 'John') AND (age >= 21)", but then you'd need to write your own parser. A more natural way would be to use JSON's data structures to encode the AST (abstract syntax tree) of the query.

It's not uncommon to see implementations like this:

    {
      "operator": "AND",

      "values": [
        {
          "field": "name",
          "operator": "=",
          "value": "John"
        }, {
          "field": "age",
          "operator": ">=",
          "value": 21
        }
      ]
    }
  

Lots of boilerplate for not much query.

MongoDB uses a JSON-based query syntax and the equivalent query would be:

    {
      "name": "John",
      "age": {"$gte": 21}
    }
  

That doesn't look too bad, it's much more concise than the previous attempt. The operators "$eq" and "$and" are both implicit in such a simple query and written out in full it would look like this:

    {
      "$and": [
        {"name": {"$eq": "John"}},
        {"age": {"$gte": 21}}
      ]
    }
  

The Mongo syntax shares some similarities with JsonLogic, a project which describes a way to store logical expressions in JSON. In JsonLogic, object keys are used to specify the operator and the arguments are written as an array, i.e. {"operator": [arg1, arg2, ...]}. Fields in the data are accessed using the "var" operator and the array wrapper can be omitted if there's only a single argument. Thus our example query could be written as:

    {"and": [
      {"==": [{"var": "name"}, "John"]},
      {">=": [{"var": "age"}, 21]}
    ]}
  

While the syntax for accessing fields is somewhat long-winded it's worth noting that it does provide a simple way to compare fields within the data. For example, if we're searching a music database looking for eponymous albums we could use something like this:

    {"==": [
      {"var": "title"},
      {"var": "artist"}
    ]}
  

JsonLogic supports various operators, including basic arithmetic:

    {"+": [1, 2]} // => 3, in case you were wondering
  

As "+" is associative it can be extended to more than 2 arguments:

    {"+": [1, 2, 3]} // => 6
  

Having the operator at the start of an expression is known as Polish prefix notation and it's a key feature of Lisp.

A Bit of a Lisp

Most programming languages feature both infix operators and function invocations in some form, allowing for calculations to be performed in one of two ways:

    1 + 2     // Infix operator
    sum(1, 2) // Function invocation
  

An immediate benefit of the function call is that it can take advantage of the associativity of addition to support more arguments, sum(1, 2, 3), whereas with the infix operator you would need to repeat yourself.

Lisp forgoes infix operators altogether and instead uses lists to represent expressions. An expression list is wrapped in parentheses and begins with the operator, so the simple sum above would be written (+ 1 2 3). Unlike most programming languages, there is no need for a concept of operator precedence because the brackets are always included.

While writing (+ 1 2 3) wouldn't be valid JSON, we can transmogrify it a bit to get ["+", 1, 2, 3]. It's not quite as clean as the original Lisp but it's still not bad. All expressions have the same format, ["operator", ...arguments...]. By introducing suitable operators such as "=", ">=" and "AND" we can create the expression we saw at the beginning:

    ["AND",
      ["=", "name", "John"],
      [">=", "age", 21]
    ]
  

My own experience of working with this format is that it's really convenient for building queries in JavaScript. The standard array methods push and concat are usually sufficient but for composing more complex queries some simple utility functions or even a dedicated query builder class can prove helpful.

One thing to note is that the format doesn't use JSON objects at all, only arrays. That leaves plenty of scope for extending the syntax to use objects should the need arise.

Is anyone else using this? Possibly many people but, as a concrete example, PuppetDB's AST query language uses the same central idea.

What follows are some of my own ideas for how to take it further. They might not all be appropriate in your circumstances but if you are keen to try this query format you might find some of it useful.

Variadic Operators

A key benefit of a Lisp-like syntax is that associative operators can support a variable number of arguments, such as in our earlier example of ["+", 1, 2, 3].

While supporting more than 2 arguments may be obvious, we can also go the other way:

    ["+", 15] // => 15
  

We can take the pattern even further, removing all the arguments:

    ["+"] // => 0
  

This may seem unnecessary, and perhaps it is, but when complex queries are built up in many stages it can remove some of the burden if the degenerate cases are handled in the most logically consistent way.

A similar but slightly more useful example comes when working with "AND" and "OR". Again these are associative so there's no difficulty extending them to more arguments:

    ["AND",
      ["=", "name", "John"],
      [">=", "age", 21],
      ["=", "city", "London"],
      ... // etc.
    ]
  

The operator "AND" returns true unless one of the arguments evaluates to false, so in the edge case we get:

    ["AND"] // => true
  

The operator "OR" is the opposite, it returns false unless one of the arguments evaluates to true:

    ["OR"] // => false
  

Even if you can't see any point in supporting empty arguments it may just fall out naturally when you come to implement your operators. At the very least you should try to support more than 2 arguments for associative operators or you'll have lost out on one of the most useful features of the syntax.

Accessing Fields

Let's consider a simple field match:

    ["=", "name", "John"]
  

The two arguments are not symmetrical. The first, "name", is treated as a field name whereas the second, "John", is treated as the value of the field. While this is useful as a shortcut for the most common cases it doesn't allow us to handle the title = artist example we saw earlier for comparing two fields with each other.

We can solve this problem in a similar fashion to JsonLogic's "var" operator. Here I've chosen to call the operator "GET" but you could call it whatever you want:

    ["=", "title", ["GET", "artist"]]
  

As before the first argument is implicitly a "GET" but we can expand that to make everything explicit:

    ["=", ["GET", "title"], ["GET", "artist"]]
  

Taking field access a little further we could consider how we'd work with nested data. In the following example the sample data is written in JSON but there's nothing about the query syntax that limits it to searching collections of JSON documents. The data could just as easily be in a SQL database and in that context nesting might translate into following a foreign key.

    {
      "id": 8585,

      "user": {
        "name": "John",
        "age": 39
      },

      ...
    }
  

A simple way to handle it might be to use a dot to create a path:

    ["=", "user.name", "John"]
  

Maybe that's good enough for what you need but it would cause problems if the field names could contain dots. A safer way would be to use multiple arguments with "GET":

    ["=", ["GET", "user", "name"], "John"]
  

The two approaches can be combined, allowing dots as separators in the simple string version but treating them as literal dots in the "GET" version.

Building Strings

In the last section we started to unpick the asymmetry in the arguments of this query:

    ["=", "name", "John"]
  

In the same way that the "GET" operator allowed us to expand the first argument into a nested expression, we can introduce a "STR" operator (inspired by Clojure) to do the same thing for the second argument:

    ["=", "name", ["STR", "John"]]
  

While not terribly useful in this case, this makes it possible to replace the implicit field access of the first argument with a literal value for comparison:

    ["=", ["STR", "John"], ["STR", "John"]] // => true, "John" = "John"
  

When written out as full expressions the order of the arguments no longer matters. We could write the earlier query either way around:

    ["=", ["GET", "name"], ["STR", "John"]]
  
    ["=", ["STR", "John"], ["GET", "name"]]
  

How much practical value the STR operator has can depend on how other operators are implemented. It can also be used to perform type conversion and concatenation:

    // => "John is 39"
    ["STR",
      ["GET", "name"],
      " is ",
      ["GET", "age"]
    ]
  

The primary use case for building strings like this is for manipulating the response rather than as a query expression. We'll see more about that later.

Arrays

Ideally all arrays in the query should be treated as expressions. If an operator requires an array as an argument it might be tempting to make a special case and to allow something like this:

    ["OPERATOR", "field", [1, 2, 3, 4]]
  

If the final argument is always an array then it's trivial to drop the brackets and pass each item as a separate argument:

    ["OPERATOR", "field", 1, 2, 3, 4]
  

In cases where that isn't possible the simplest solution is to insert a suitable operator:

    ["OPERATOR", "field", ["ARRAY", 1, 2, 3, 4]]
  

Here the "ARRAY" operator is used to create the array [1, 2, 3, 4]. The equivalent Lisp operator is called quote but that doesn't translate well here. A name such as "LIST" would be a sensible alternative to "ARRAY".

Response Fields

So far all the examples have been written from the perspective of writing predicates to filter a data set down to a list of matching results. How might it affect the way in which those results are returned? We'll assume a JSON response format.

If you just want to return the results in their entirety then that's fine, it might be all you need. But it wouldn't be unusual to want to specify which fields come back:

    {
      "fields": ["id", "name", "age"],
      "query": [...],

      ... // other stuff, e.g. paging
    }
  

Here the "query" is our Lisp-like query expression for determining which rows/documents to return and the "fields" then determine which columns/fields/properties to return for each result. The "fields" array in the example above is just an array of 3 fields, it isn't an expression and "id" is just a field name, not an operator. The response format might be something like this:

    [
      [8585, "John", 39],
      [8586, "Jane", 29],
      ...
    ]
  

Using arrays rather than objects to hold each result is quite efficient as it saves on repeating the keys. The order of the entries in the arrays is the same as the order specified in the "fields".

To take this to the next level we can introduce query expressions in the "fields". Let's start with something simple, accessing nested fields:

    // We could support dot notation for accessing nested fields...
    ["id", "user.name", "user.age"]

    // ...or we could write it using "GET"
    ["id", ["GET", "user", "name"], ["GET", "user", "age"]]
  

The "STR" operator could be used to perform concatenation:

    // => [8585, "John Smith"]
    [
      "id",
      ["STR",
        ["GET", "firstName"],
        " ",
        ["GET", "lastName"]
      ]
    ]
  

Using other operators we could extract whatever values we need from the data without having to return all of the relevant fields:

    {
      "fields": [
        "id",
        "name",
        ["AND"
          [">=", "age", 21],
          ["<", "income", 27000],
          [">", "rent", 500],
          ["=", "city", "London"],
          ["=", "country", "UK"]
        ]
      ],

      ...
    }
  

That third 'field' isn't a filtering condition, it's just used to calculate the value of the last field in the response arrays. The response might look something like this:

    [
      [8585, "John", true],
      [8586, "Jane", false],
      ...
    ]
  

Information such as a total results count or other query meta can easily be added by wrapping the results array in a JSON object.

Custom Fields

As queries get more advanced it can be desirable to reuse sections of a query in multiple places. Perhaps you want to use one of your field expressions within the search criteria, or maybe for sorting purposes. Probably the simplest solution is to put a name on it, like this:

    {
      "fields": [...],
      "query": [...],

      "customFields": {
        "turningAngle": ["-", "direction2", "direction1"]
      },

      ...
    }
  

Here the custom field "turningAngle" is calculated using the values of the fields "direction1" and "direction2". It can then be used in the "fields" or "query" just like any other field. Custom fields can be used instead of expressions in the "fields" but personally I like to have support for both techniques.

The interactive example in the next section includes an example of using custom fields to calculate HSL values from RGB. The custom fields are used like variables to store intermediate values in calculations. We could continue to expand the query syntax until it starts to look a lot like a full programming language, perhaps including features like custom operators (think functions) written in terms of other operators. I don't know how much value that would provide for a query syntax but it's reassuring to have Lisp sitting there in the background as a source of inspiration if the need to take things further should arise.

Interactive Example

The operators supported by this example are:

  • GET and STR
  • AND, OR and NOT
  • IF, which provides if/else support. Syntax ["IF", cond1, value1, ..., condN, valueN, elseValue]
  • =, <, >, <= and >=
  • +, *, -, / and MOD
  • MEAN, MEDIAN, MAX and MIN
  • ABS, CEIL, FLOOR, ROUND, SIGN and TRUNC
  • STARTS_WITH, CONTAINS and INDEX_OF

The data for this example is in the form:

    [
      {name: 'AliceBlue', hex: 'F0F8FF', red: 240, green: 248, blue: 255},
      {name: 'AntiqueWhite', hex: 'FAEBD7', red: 250, green: 235, blue: 215},
      ...
    ]
  

The 'Quick Demos' allow you to run some example queries without having to type them in yourself. Some of them are quite complicated, probably more complicated than anything you'll ever need. They're there to show what can be done, don't let them put you off.

Quick Demos:
 

Thank you to those who provided feedback on earlier drafts of this article.

-----