The Abstract Syntax Tree
2021-11-19
Introduction
One of the first steps a compiler must take is parse the human-readable program format, usually a plain text file, and put it into a structure the compiler can work with; the Abstract Syntax Tree (abbreviated to AST).
Here’s a simple example with code on the left and AST on the right:
and a slightly more complex example:in this post I wish to define the AST for my language based on requirements set forth previously.
Underlying litterature
I’m primarily basing myself on the following work:
[Peyton Jones & Lester 1992] Implementing Functional Languages: a tutorial. The title of this book is self-explanatory, though I should mention it covers lazy languages specifically.
[Martin-Löf 1973] An intuitionistic theory of types: predicative part. This article first describes the Martin-Löf type system, that is, dependent types. It explains how propositions about values in a program can be expressed themselves as types and terms.
[Atkey 2018] Syntax and Semantics of Quantitative Type Theory. This article describes quantitative type theory, which marries dependent types with linear values and erasure.
Of course, I’m also using a lot of transitive knowledge and things I picked up here and there, but have no specific source for.
A default AST
To define how an AST might look, we must specify a language grammar. This is often done using BNF (Backus-Naur Form) notation. In languages like Haskell and Idris it is incredibly easy to express it directly in a type, so I will do that instead.
I start from the AST given in [Peyton Jones & Lester 1992] p.17, translated to Idris:
data Expr a
= EVar String
| ENum Int
| ECtr Int Int
| EAp (Expr a) (Expr a)
| ELet IsRec (List (a, Expr a)) (Expr a)
| ECase (Expr a) (List (Alt a))
| ELam (List a) (Expr a)
data Alt a = (Int, List a, Expr a)
Here, the type-variable a
is the “binder” of values.
Generally this will be a String
representing the name of a
variable, but different types could be used as well.
EVar
represents named variables, such as
x
.
ENum
is for number primitives: 1
,
2
, 42
, etc.
ECtr
is less intuitive, when we define a type, such as
data MaybeInt = Just Int | Nothing
, each constructor can be
given a number, called the tag. The tag can be used internally by case
expressions to match on the right constructor. The other
Int
argument to ECtr
is the arity: how many
arguments the constructor takes. In the example Just
has
arity 1 and Nothing
has arity 0
.
EAp
is function application. For instance,
f x
would parse to
EAp (EVar "f") (EVar "x")
.
ELet
is for let ... in
expressions, where
the (a, Expr a)
values are the variable definitions.
ECase
is a case expression. The Alt a
type
has a tag value to match, with a list of constructor arguments, and an
expression.
ELam
is for lambda functions.
There are a few changes I wish to make based mostly on preference.
First, we can determine whether a let-expression is recursive,
e.g. let xs = 1 : xs in ...
, by static analysis of the
enclosed definitions. [Peyton
Jones & Lester 1992] does this, more or less, in 6.8.3. I prefer
not to have a specific field for it so that I have a single source of
truth. I will remove the IsRec
field. I suspect the book
only uses this field to avoid having to explain dependency analysis
first.
Second, numbers are just constructors that should really check their
priviledge. Defaulting to int
is a fantastic source of
bugs. By giving numbers special treatment we are also overlooking more
generic optimizations. ENum
goes.
Third, I’m taking a hint from Elm style guidelines and replacing builtin types by more meaningful custom ones.
After those basic changes we are left with:
data Expr a
= EVar VarID
| ECtr CtrID
| EAp (Expr a) (Expr a)
| ELet (List (Def a)) (Expr a)
| ECase (Expr a) (List (Alt a))
| ELam (List a) (Expr a)
data Alt a = MkAlt CtrID (List VarID) (Expr a)
data Def a = MkDef a (Expr a)
I have left the various IDs intentionally undefined.
Making the AST hollistic
To consider what changes are necessary to support hollistic programming, we must go back to the language’s goals.
If we wish to define the whole system in a language, rather than one particular program, we must start considering changes in code over time. Typically, it is the responsibility of the developer to ensure that changes to not break compatibility with persistent systems, such as databases. If such systems are generated by the compiler, it is the compiler that must be able to prove compatibility with some previous version. Immutability helps a great deal here. Just as Elm uses immutability to quickly determine what areas of a DOM need updating, so too we wish to have immutability in program definitions themselves, so we can correctly determine what is compatible. It would be nice if we had some kind of “metapurity”: make the act of programming itself pure.
To achieve this, I will copy the identify-by-hash approach of unison. Where names are stateful and may point to different functions over time, hashes are stateless. If a function changes, so does its hash.
Of course, programming with hashes directly is infeasible for programmers. We need a human-compatible language, let’s call it the display language. In most other programming languages, the display language is the source of truth. The core language AST is derived from it during compilation. There is a missed opportunity here.
A lot of effort is being wasted by developers fighting about pure display concerns: tabs vs spaces, camelCase vs PascalCase vs snake_case vs kebab-case, operators vs functions, import definitions, naming, alignment, egyptian brackets… after removing names as identifiers, none of these concerns change a program’s AST in the core language. If we make the core language the source of truth, and define the display language as (isomorphic to) the core language plus some display information (e.g. variable names), we can do away with those issues entirely.
Now, back to the AST
Many naming issues are already resolved by resorting to hashes, but variables may also be bound by let-expressions, lambdas, and supercombinators. For there, we don’t really need hashes. We can use a de Bruijn index, slightly adapted to work with let-expressions and supercombinators as well (both can be converted to lambda functions, but that seems counter-productive).
I will have to distinguish between local variables, using de Bruijn indices, and global variables. Binder types have become superfluous, we just need to know how many binders a binding-expression has.
data Expr
= ELVar Nat ||| de Bruijn index of locally-bound variable
| EGVar Hash ||| Hash of globally bound supercombinator
| ECtr CtrID
| EAp Expr Expr
| ELet (List Expr) Expr ||| 1st arg is definitions, second is the "in" part
| ECase Expr (List Alt)
| ELam Nat (Expr a) ||| Only need to know number of arguments
data Alt = MkAlt CtrID Expr ||| # arguments derived from CtrID
Dependent typing
This leaves us with 2 problems: CtrID
is not defined,
and we are missing a way to use types as values.
Let’s start with CtrID
. We usually write type
declarations as disjunctions of conjunctions: the type is either this
constructor with these arguments, or that constructor with those
arguments, etc. From [Martin-Löf 1973] we get some information about
disjunctions in Martin-Löf type theories:
\(A \oplus B\) represents their disjunction
\[ A \vee B\]
a proof of which consists of either a proof of \(A\) (together with the information that it is \(A\) that has been proved) or a proof of \(B\) (together with information that it is \(B\) that has been proved).
CtrID
is that “information that it is \(x\) that has been proved”, also called a
tag. I won’t include arity in CtrID
as it can be derived
from type definitions in a statically typed language. However, we can’t
resort to using Int
, or other purely algebraic notions, to
defined the tag. Consider the following equivalent type definitions:
This change of position should make no difference to the evaluation result. However, without names, the algebraic representation of both types is:
\[ I \oplus I \]
(where \(I\) is the unit type, which
you might better know as ()
)
For a concrete example, if we were to use the definition of
constructor expressions of [Peyton
Jones & Lester 1992], the first version would result in
ECtr 0 0
for Yes
, and the second version would
result in ECtr 0 0
for No
. If these
constructor values were stored in a database, code post-update would
interpret Yes
values that were stored pre-update as
No
.
For similar reasons, we need to store the name of the type:
Not something you want to switch around because of a code update!
However, this is a consideration for the display language. We can
consider the constructor names to include type name, and be
IsDangerous.Yes
and IsSafe.Yes
, which are
clearly different.
In short, the CtrID should be a full name, a string for now, which I
will abstract to Tag
. I will probably switch to something
fancier in the future, as here too lies the danger of bickering about
names and other inane matters.
That leaves us with:
data Expr
= ELVar Nat
| EGVar Hash
| ECtr Tag
| EAp Expr Expr
| ELet (List Expr) Expr
| ECase Expr (List Alt)
| ELam Nat (Expr a)
data Alt = MkAlt Tag Expr
Note that Alt
does not contain information to determine
what type a tag belongs to, and so cannot derive how many arguments it
has. However, that information is known in the encompassing case
expression.
But what about types? Again, we look to [Martin-Löf 1973]:
With the sequence of universes, we achieve that every type is at the same time an object and, as such, has itself got a type.
That means a type itself also has a contructor tag for it’s own type.
A type name can be expressed with ECtr
.
What else do we need to express the terms of dependent typing? Let’s look at the definition of preterms in [Atkey 2018] for a language of booleans, and see if there’s anything we’re missing.
\[ \begin{array}{lrl} M, N, O & \mathrel{::=} & x \\ & | & \lambda x \overset{\pi}{:} S.M^T \\ & | & \mathrm{App}_{(x \overset{\pi}{:} S) T} (M, N) \\ & | & (M, N)_{x \overset{\pi}{:} S . T} \\ & | & \ast \\ & | & \mathrm{fst}_{x \overset{\pi}{:} S . T} (M) \\ & | & \mathrm{snd}_{x \overset{\pi}{:} S . T} (M) \\ & | & \mathrm{let}_{x \overset{\pi}{:} S . T} (x, y) = M \, \mathrm{in} \, N \\ & | & \mathrm{let}_I \ast = M \, \mathrm{in} \, N \\ & | & \mathrm{true} \\ & | & \mathrm{false} \\ & | & \mathrm{ElimBool}_{(z)T}(M_t, M_f, N) \\ & | & \mathrm{Bool} \\ & | & (x \overset{\pi}{:} M) \to N \end{array} \]
(I adapted the typesetting a bit because I would rather split the various options by line and I’m not worried about running out of paper.)
I don’t wish to implement the type calculus just yet, so I will ignore the types, and assume that if we can express all these preterms, we will be able to express terms in a dependent typed system. Going through them one by one, ignoring type annotations.
\(x\): a variable. Corresponds to
ELVar
.\(\lambda x \overset{\pi}{:} S.M^T\): lambda functions. Corresponds to
ELam
.\(\mathrm{App}_{(x \overset{\pi}{:} S) T} (M, N)\): function application. Corresponds to
EAp
.\((M, N)_{x \overset{\pi}{:} S . T}\): pairs. Covered by
ECtr
andEAp
.\(\ast\): the unit value. Covered by
ECtr
.\(\mathrm{fst}_{x \overset{\pi}{:} S . T} (M)\): the
fst
function over tuples. Covered byEGVar
, as it is a function.\(\mathrm{snd}_{x \overset{\pi}{:} S . T} (M)\): the
snd
function over tuples.\(\mathrm{let}_{x \overset{\pi}{:} S . T} (x, y) = M \, \mathrm{in} \, N\): a let … in expression for tuples. Covered by
ELet
.\(\mathrm{let}_I \ast = M \, \mathrm{in} \, N\): a let … in expression for units. Covered by
ELet
.\(\mathrm{true}\). Covered by
ECtr
.\(\mathrm{false}\). Covered by
ECtr
.\(\mathrm{ElimBool}_{(z)T}(M_t, M_f, N)\): a way to eliminate booleans. We essentially do this with
ECase
expressions.\((x \overset{\pi}{:} M) \to N\): function type declaration. This one is missing!
Now, the question arises whether arrows aren’t just a special case of
ECtr
. After all, types are constructors (of universes), as
we’ve discussed. Is “arrow” a type that would be better off in the
prelude than as a native type constructor? I’m inclined to say no. It
seem disproportionately difficult to compile anything without a native
notion of functions, and it would be impossible for the compiler to say
what the type of a lambda function is.
So, let’s add arrows to our expressions type:
data Expr
= ELVar Nat
| EGVar Hash
| ECtr Tag
| EAp Expr Expr
| ELet (List Expr) Expr
| ECase Expr (List Alt)
| ELam Nat (Expr a)
| EArr Expr Expr
data Alt = MkAlt Tag Expr
And with that my initial AST is complete. I say initial, because I have no doubt I will realise I have made a huge mistake somewhere in the future.