If you have been following the work on the JavaScript Binary AST, you probably realized that we haven’t been communicating much recently. Sorry about that, we’ve been busy iterating on it!

In this blog entry, I’d like to talk to you about our current main focus: file compression.

About the JavaScript Binary AST

The JavaScript Binary AST is a TC39 proposal for the future of JavaScript, designed by Mozilla, Facebook, Bloomberg, Cloudflare. Our main objective is to introduce a new way of sending JavaScript source code to make loading JS much faster, without requiring web developers to make any change in their code.

Why file size matters

There are several potential bottlenecks involved in loading JS. To make a long story short:

the code may be slow to download/load from disk;
the code may be slow to parse;
the code may be slow to compile to bytecode;
the code may be slow to execute.

As we have detailed in previous blog posts:

point 2 is tremendousy improved by BinAST (and other proposals);
point 3 is something we are working on;
point 4 is besides the scope of BinAST.

This leaves us with 1. To make JavaScript loading faster, we also need to make the number of bytes transmitted smaller. In other words, we need compression.

The Text AST

Nowadays, the de facto standard for compression of JavaScript source code is Brotli. It’s an extremely good tool, which provides both fast compression, fast decompression and a high compression ratio. In fact, Brotli is already custom-tailored to compress JavaScript.

However, out of the box, Brotli doesn’t compress BinAST too well.

Consider the following JS code:

function identity(x) {
    return x;
}

This representation is very concise already, and Brotli already has in its builtin dictionary words such as function or return, which make its representation even more concise. In fact, Brotli reduces this snippet to 34 bytes, and would do a much better job on a larger codebase.

Now consider the in-memory representation (or “AST”) of the same code:

EagerFunctionDeclaration:
    isAsync: false
    isGenerator: false
    name:
        BindingIdentifier:
            name: "identity"
    length: 1
    directives: []
    contents:
        FunctionOrMethodContents:
            isThisCaptured: false
            parameterScope:
                AssertedParameterScope:
                    paramNames:
                        - AssertedParameterName:
                            name: "x"
                            isCaptured: false
                    hasDirectEval: false
                    isSimpleParameterList: true
            params:
                FormalParameters:
                    items:
                        - BindingIdentifier:
                            name: "x"
                    rest: null
            bodyScope:
                AssertedVarScope:
                    declaredNames: []
                    hasDirectEval: false
            body:
                - ReturnStatement:
                    expression:
                        IdentifierExpression:
                            name: "x"

That is… quite a little bit longer. That’s because we are showing here the result of two steps of transformation on the snippet above:

parsing;
static analysis (yes, there’s static analysis during JavaScript startup).

Since both operations are quite costly, we wish to transmit this entire data structure, or a variant thereof. Maybe we can apply Brotli directly?

After Brotli, that’s a 9x file size increase, which makes things much slower and expensive (in terms of electricity) to download/load from cache.

So, this format is not acceptable.

Towards the Binary AST

Now, the BinAST spec specifies the order of fields, so we can already simplify our representation

EagerFunctionDeclaration:
    - false
    - false
    - BindingIdentifier:
        "identity"
    - 1
    - []
    - FunctionOrMethodContents:
        - false
        - AssertedParameterScope:
            -
                - AssertedParameterName:
                    "x"
                    false
            - false
            - true
        - FormalParameters:
            -
                - BindingIdentifier:
                    "x"
            - null
        - AssertedVarScope:
            - []
            - false
        -
            - ReturnStatement:
                IdentifierExpression:
                    "x"

While we’re there, let’s replace false with 0 and true with 1.

EagerFunctionDeclaration:
    - 0
    - 0
    - BindingIdentifier:
        "identity"
    - 1
    - []
    - FunctionOrMethodContents:
        - 0
        - AssertedParameterScope:
            -
                - AssertedParameterName:
                    "x"
                    0
            - 0
            - 1
        - FormalParameters:
            -
                - BindingIdentifier:
                    "x"
            - null
        - AssertedVarScope:
            - []
            - 0
        -
            - ReturnStatement:
                IdentifierExpression:
                    "x"

We can already see improvements:

Now, let’s change a bit the format of lists, to prefix them with their length and avoid nesting them behind empty -.

EagerFunctionDeclaration:
    - 0
    - 0
    - BindingIdentifier:
        "identity"
    - 1
    - 0
    - FunctionOrMethodContents:
        - 0
        - AssertedParameterScope:
            - 1
            - AssertedParameterName:
                "x"
                0
            - 0
            - 1
        - FormalParameters:
            - 1
            - BindingIdentifier:
                "x"
            - null
        - AssertedVarScope:
            - 0
            - 0
        - 1
        - ReturnStatement:
            IdentifierExpression:
                "x"

We now have a representation with three constructions:

numbers;
node names;
literal strings.

Let’s see what we can do about strings. BinAST is not a minifier, so we keep all strings, identifiers, etc. unchanged, but we can move them out of the way and into a header, as follows:

strings:
    concat:
        "identityx"
    len:
        - 8
        - 1

Field concat contains a single string, obtained by concatenating all individual strings used in the file, and we maintain a separate list of string lengths. This has several benefits:

strings may be parsed ahead of time and/or in parallel from the rest of the file;
strings only need to be validated and optimized for memory use once, regardless of the number of occurrences in the file;
strings only need to appear once in the file, so the cost of repeating a string is negligible.

Just as we had removed field names, let’s get rid of concat and len:

strings:
    - "identityx"
    - 2
    - 8
    - 1

This gives us the following representation:

strings:
    - "identityx"
    - 2
    - 8
    - 1
EagerFunctionDeclaration:
    - 0
    - 0
    - BindingIdentifier:
        0
    - 1
    - 0
    - FunctionOrMethodContents:
        - 0
        - AssertedParameterScope:
            - 1
            - AssertedParameterName:
                1
                0
            - 0
            - 1
        - FormalParameters:
            - 1
            - BindingIdentifier:
                1
            - null
        - AssertedVarScope:
            - 0
            - 0
        - 1
        - ReturnStatement:
            IdentifierExpression:
                1

Now, let’s take a look at these node names. We still need the node names to tell us what we’re looking at and to deduce the number of fields and their nature. By using the same trick as Brotli, we can introduce the names of nodes in a shared dictionary and repace them with numbers in the file, obtaining the rather more concise

0:
    - "identityx"
    - 2
    - 8
    - 1
0:
    - 0
    - 0
    - 1:
        0
    - 1
    - 0
    - 2:
        - 0
        - 3:
            - 1
            - 4:
                1
                0
            - 0
            - 1
        - 5:
            - 1
            - 1:
                1
            - 6
        - 7:
            - 0
            - 0
        - 1
        - 8:
            9:
                1

Also, since the parser knows the grammar, we do not need the nesting information, so we could easily rewrite this:

- 0
- identityx
- 2
- 8
- 1
- 0
- 0
- 0
- 1
- 0
- 1
- 0
- 2
- 0
- 3
- 1
- 4
- 1
- 0
- 0
- 1
- 5
- 1
- 1
- 1
- 6
- 7
- 0
- 0
- 1
- 8
- 9
- 1

Or, in binary format:

0x69, 0x64, 0x65, 0x6e, 0x74, 0x69, 0x74, 0x79, 0x78, 0x00, 0x02, 0x08, 0x01, 0x00, 0x00, 0x00, 0x01, 0x00, 0x01, 0x00, 0x02, 0x00, 0x03, 0x01, 0x04, 0x01, 0x00, 0x00, 0x01, 0x05, 0x01, 0x01, 0x01, 0x06, 0x07, 0x00, 0x00, 0x01, 0x08, 0x09, 0x01

This was, indeed, the first version of our Binary AST.

Let’s see how it behaves:

That’s much better!

Also, while moving entirely to binary makes the format harder to debug, it is also much faster to parse. Indeed, loading a single byte and comparing it to a set of bytes to determine how to interpret what follows is simpler and faster than loading "FunctionOrMethodContents" and comparing it to a set of strings for the same reason.

BinAST 0.1 vs. the Real World

With this format, we can attempt to try and compress real-world JavaScript files using BinAST 0.1.

Our first benchmark is fb_bench, a large sample of minified JavaScript published by Facebook for use by developers of VMs and JavaScript code analysis tools.

While the result is not awful, we still see a post-compression size increase of 16%. This is in part due to the fact that fb_bench is minified (BinAST 0.1 works better on non-minified code), but then, so is most of the JS on the web.

In other words, we cannot simply ship files with BinAST 0.1 + Brotli. In fact, we have spent the best part of last year iterating through successive versions of BinAST to improve this result.

To be continued…

That’s enough code for one blog post. In the second part of this series, I’ll be showcasing the technique we currently use to compress BinAST (the so-called “Entropy format”) and the results we currently achieve.

Il y a du thé renversé au bord de la table !