The Abstract Syntax Tree

drBearhands

2021-11-19

Introduction

One of the first steps a compiler must take is parse the human-readable program format, usually a plain text file, and put it into a structure the compiler can work with; the Abstract Syntax Tree (abbreviated to AST).

Here’s a simple example with code on the left and AST on the right:

f 2
The AST for code sample on the left The AST for code sample on the left
and a slightly more complex example:
( \x -> x + x ) 2

The AST for code sample on the left The AST for code sample on the left

in this post I wish to define the AST for my language based on requirements set forth previously.

Underlying litterature

I’m primarily basing myself on the following work:

Of course, I’m also using a lot of transitive knowledge and things I picked up here and there, but have no specific source for.

A default AST

To define how an AST might look, we must specify a language grammar. This is often done using BNF (Backus-Naur Form) notation. In languages like Haskell and Idris it is incredibly easy to express it directly in a type, so I will do that instead.

I start from the AST given in [Peyton Jones & Lester 1992] p.17, translated to Idris:

data Expr a
  = EVar String
  | ENum Int
  | ECtr Int Int
  | EAp (Expr a) (Expr a)
  | ELet IsRec (List (a, Expr a)) (Expr a)
  | ECase (Expr a) (List (Alt a))           
  | ELam (List a) (Expr a)

data Alt a = (Int, List a, Expr a)

Here, the type-variable a is the “binder” of values. Generally this will be a String representing the name of a variable, but different types could be used as well.

EVar represents named variables, such as x.

ENum is for number primitives: 1, 2, 42, etc.

ECtr is less intuitive, when we define a type, such as data MaybeInt = Just Int | Nothing, each constructor can be given a number, called the tag. The tag can be used internally by case expressions to match on the right constructor. The other Int argument to ECtr is the arity: how many arguments the constructor takes. In the example Just has arity 1 and Nothing has arity 0.

EAp is function application. For instance, f x would parse to EAp (EVar "f") (EVar "x").

ELet is for let ... in expressions, where the (a, Expr a) values are the variable definitions.

ECase is a case expression. The Alt a type has a tag value to match, with a list of constructor arguments, and an expression.

ELam is for lambda functions.


There are a few changes I wish to make based mostly on preference.

First, we can determine whether a let-expression is recursive, e.g. let xs = 1 : xs in ..., by static analysis of the enclosed definitions. [Peyton Jones & Lester 1992] does this, more or less, in 6.8.3. I prefer not to have a specific field for it so that I have a single source of truth. I will remove the IsRec field. I suspect the book only uses this field to avoid having to explain dependency analysis first.

Second, numbers are just constructors that should really check their priviledge. Defaulting to int is a fantastic source of bugs. By giving numbers special treatment we are also overlooking more generic optimizations. ENum goes.

Third, I’m taking a hint from Elm style guidelines and replacing builtin types by more meaningful custom ones.

After those basic changes we are left with:

data Expr a
  = EVar VarID
  | ECtr CtrID
  | EAp (Expr a) (Expr a)
  | ELet (List (Def a)) (Expr a)
  | ECase (Expr a) (List (Alt a))           
  | ELam (List a) (Expr a)

data Alt a = MkAlt CtrID (List VarID) (Expr a)

data Def a = MkDef a (Expr a)

I have left the various IDs intentionally undefined.

Making the AST hollistic

To consider what changes are necessary to support hollistic programming, we must go back to the language’s goals.

If we wish to define the whole system in a language, rather than one particular program, we must start considering changes in code over time. Typically, it is the responsibility of the developer to ensure that changes to not break compatibility with persistent systems, such as databases. If such systems are generated by the compiler, it is the compiler that must be able to prove compatibility with some previous version. Immutability helps a great deal here. Just as Elm uses immutability to quickly determine what areas of a DOM need updating, so too we wish to have immutability in program definitions themselves, so we can correctly determine what is compatible. It would be nice if we had some kind of “metapurity”: make the act of programming itself pure.

To achieve this, I will copy the identify-by-hash approach of unison. Where names are stateful and may point to different functions over time, hashes are stateless. If a function changes, so does its hash.

Of course, programming with hashes directly is infeasible for programmers. We need a human-compatible language, let’s call it the display language. In most other programming languages, the display language is the source of truth. The core language AST is derived from it during compilation. There is a missed opportunity here.

A lot of effort is being wasted by developers fighting about pure display concerns: tabs vs spaces, camelCase vs PascalCase vs snake_case vs kebab-case, operators vs functions, import definitions, naming, alignment, egyptian brackets… after removing names as identifiers, none of these concerns change a program’s AST in the core language. If we make the core language the source of truth, and define the display language as (isomorphic to) the core language plus some display information (e.g. variable names), we can do away with those issues entirely.

Now, back to the AST


Many naming issues are already resolved by resorting to hashes, but variables may also be bound by let-expressions, lambdas, and supercombinators. For there, we don’t really need hashes. We can use a de Bruijn index, slightly adapted to work with let-expressions and supercombinators as well (both can be converted to lambda functions, but that seems counter-productive).

I will have to distinguish between local variables, using de Bruijn indices, and global variables. Binder types have become superfluous, we just need to know how many binders a binding-expression has.

data Expr
  = ELVar Nat               ||| de Bruijn index of locally-bound variable
  | EGVar Hash              ||| Hash of globally bound supercombinator
  | ECtr CtrID                    
  | EAp Expr Expr
  | ELet (List Expr) Expr   ||| 1st arg is definitions, second is the "in" part
  | ECase Expr (List Alt)           
  | ELam Nat (Expr a)       ||| Only need to know number of arguments

data Alt = MkAlt CtrID Expr ||| # arguments derived from CtrID

Dependent typing

This leaves us with 2 problems: CtrID is not defined, and we are missing a way to use types as values.

Let’s start with CtrID. We usually write type declarations as disjunctions of conjunctions: the type is either this constructor with these arguments, or that constructor with those arguments, etc. From [Martin-Löf 1973] we get some information about disjunctions in Martin-Löf type theories:

\(A \oplus B\) represents their disjunction

\[ A \vee B\]

a proof of which consists of either a proof of \(A\) (together with the information that it is \(A\) that has been proved) or a proof of \(B\) (together with information that it is \(B\) that has been proved).

CtrID is that “information that it is \(x\) that has been proved”, also called a tag. I won’t include arity in CtrID as it can be derived from type definitions in a statically typed language. However, we can’t resort to using Int, or other purely algebraic notions, to defined the tag. Consider the following equivalent type definitions:

data IsNice
  = Yes
  | No
data IsNice
  = No
  | Yes

This change of position should make no difference to the evaluation result. However, without names, the algebraic representation of both types is:

\[ I \oplus I \]

(where \(I\) is the unit type, which you might better know as ())

For a concrete example, if we were to use the definition of constructor expressions of [Peyton Jones & Lester 1992], the first version would result in ECtr 0 0 for Yes, and the second version would result in ECtr 0 0 for No. If these constructor values were stored in a database, code post-update would interpret Yes values that were stored pre-update as No.

For similar reasons, we need to store the name of the type:

data IsDangerous
  = Yes
  | No
data IsSafe
  = Yes
  | No

Not something you want to switch around because of a code update! However, this is a consideration for the display language. We can consider the constructor names to include type name, and be IsDangerous.Yes and IsSafe.Yes, which are clearly different.

In short, the CtrID should be a full name, a string for now, which I will abstract to Tag. I will probably switch to something fancier in the future, as here too lies the danger of bickering about names and other inane matters.

That leaves us with:

data Expr
  = ELVar Nat
  | EGVar Hash
  | ECtr Tag
  | EAp Expr Expr
  | ELet (List Expr) Expr
  | ECase Expr (List Alt)           
  | ELam Nat (Expr a)

data Alt = MkAlt Tag Expr

Note that Alt does not contain information to determine what type a tag belongs to, and so cannot derive how many arguments it has. However, that information is known in the encompassing case expression.


But what about types? Again, we look to [Martin-Löf 1973]:

With the sequence of universes, we achieve that every type is at the same time an object and, as such, has itself got a type.

That means a type itself also has a contructor tag for it’s own type. A type name can be expressed with ECtr.

What else do we need to express the terms of dependent typing? Let’s look at the definition of preterms in [Atkey 2018] for a language of booleans, and see if there’s anything we’re missing.

\[ \begin{array}{lrl} M, N, O & \mathrel{::=} & x \\ & | & \lambda x \overset{\pi}{:} S.M^T \\ & | & \mathrm{App}_{(x \overset{\pi}{:} S) T} (M, N) \\ & | & (M, N)_{x \overset{\pi}{:} S . T} \\ & | & \ast \\ & | & \mathrm{fst}_{x \overset{\pi}{:} S . T} (M) \\ & | & \mathrm{snd}_{x \overset{\pi}{:} S . T} (M) \\ & | & \mathrm{let}_{x \overset{\pi}{:} S . T} (x, y) = M \, \mathrm{in} \, N \\ & | & \mathrm{let}_I \ast = M \, \mathrm{in} \, N \\ & | & \mathrm{true} \\ & | & \mathrm{false} \\ & | & \mathrm{ElimBool}_{(z)T}(M_t, M_f, N) \\ & | & \mathrm{Bool} \\ & | & (x \overset{\pi}{:} M) \to N \end{array} \]

(I adapted the typesetting a bit because I would rather split the various options by line and I’m not worried about running out of paper.)

I don’t wish to implement the type calculus just yet, so I will ignore the types, and assume that if we can express all these preterms, we will be able to express terms in a dependent typed system. Going through them one by one, ignoring type annotations.

Now, the question arises whether arrows aren’t just a special case of ECtr. After all, types are constructors (of universes), as we’ve discussed. Is “arrow” a type that would be better off in the prelude than as a native type constructor? I’m inclined to say no. It seem disproportionately difficult to compile anything without a native notion of functions, and it would be impossible for the compiler to say what the type of a lambda function is.

So, let’s add arrows to our expressions type:

data Expr
  = ELVar Nat
  | EGVar Hash
  | ECtr Tag
  | EAp Expr Expr
  | ELet (List Expr) Expr
  | ECase Expr (List Alt)           
  | ELam Nat (Expr a)
  | EArr Expr Expr

data Alt = MkAlt Tag Expr

And with that my initial AST is complete. I say initial, because I have no doubt I will realise I have made a huge mistake somewhere in the future.

previous