Building a Lexer - Part I

February 18, 2024

Starting to build a lexer for my language.

My goal for 2024 is to implement a simple programming language that compiles to WebAssembly and that I can use to solve at least one Advent of Code problem. In this post we will be implementing a simple lexer for the language.

A lexer is a program, that collects sequences of meaningful characters into chunks. These chunks are called tokens, which are annotated with a type and later used as input for a parser. For example 3 would be a token, which represents a number. Characters, that don’t matter for the language can be ignored (e.g. ' ', '\t', '\n'). Characters that are unexpected are reported as errors.

Take a look at the following program, which should print Hello World!.

(println "Hello World!")

The left parenthesis ( denotes the beginning of an expression in Lisp. That means it is a meaningful character and at the same time the first token. The second token println is a meaningful sequence of characters, which references a function. Such references are called symbols. The space between println and " is just ignored and not a token, as it does not matter for the execution of our program. "Hello World!" is a string token and the third token. The right parenthesis ) is the fourth and last token and denotes the end of an expression.

A note on existing libraries

There are countless tools and libraries to implement lexers. IntelliJ uses JFlex for custom languages, in Rust there is logos, which focuses on speed and ergonomics (and seems to nail it).

I like implementing things from scratch to know how they work. That’s not possible in all cases and circumstances, but in this case it is and that’s why I am trying not to use any library, that might make my life easier. At least on the first pass.

With a Lisp, it would probably be possible and rather simple to skip lexing completely and just implement a pest grammar or use a parser combinator library like nom or combine or chumsky. I think I might revisit logos and pest in a later post and see how much simpler they make my life.

From a usage perspective, I still feel like pest is one of the most elegant ways to implement a parser.

Implementing a lexer

This section is divided into three parts. First a Span is introduced to store the position of Tokens. Then a Token is introduced to represent the various tokens in this language and finally a Lexer with the two primitive operations advance and peek is introduced.

Tracking the position of tokens is important for displaying good error messages. Most compilers I have seen in the wild use a type called Span, that stores the start and ending positions of tokens.

#[derive(Debug, Eq, PartialEq, Ord)
struct Span {
    start: u32,
    end: u32,
}

There are others ways to represent this information. For example the Elm compiler uses a Region type, which instead stores line and column information for the starting and ending positions of a token. I have used this type in my previous attempts and it works well, but I don’t know enough about compiler development to see the trade-offs between Spans and Regions, which is why this time, I will just be trying out Spans.

Ok, now let’s declare a Token type.

#[derive(Debug)
enum Token {
    LParen(Span),
    RParen(Span),
    Symbol(Span, String),
    Number(Span, f64),
}

Simple, there are four kinds of tokens: (, ), symbols and numbers. I have deliberately left out strings, because these four tokens should suffice to get a simple calculator like language to compile to WASM. Since WASM does not support strings natively adding strings is more complicated than working with numbers. Since I don’t have much experience, my goal is to get the whole pipeline to work and then move on to more complicated matters.

Now let’s move on to modeling a lexer. A lexer needs to be able to consume characters, track position information and report errors. In Kotlin I would just define a data class Lexer and store the input as String. In Rust however returning a character at a position from a String or &str is pretty finicky and I have yet to fully understand these types. So I have essentially stolen the design from the gleam compiler. Gleam is a functional language, that compiles to WebAssembly and to BEAM.

They define a lexer as an Iterator that produces items of LexResult and takes an Iterator of chars as input.

#[derive(Debug)
enum Error {
    BadChar(u32, char),
}

type LexResult = Result<(Span, Token), Error>

#[derive(Debug)
struct Lexer<T: Iterator<Item = char>> {
    input: T,
    pos: u32,
    pos_start: u32,
    char0: Option<char>,
}

char0 is used to allow peeking at the next char in the input without “moving the iterator forward”. Imagine you are at the end of reading a number digit by digit.

3.14ab
   ^

If there was no way to peek at the next character, we would need to advance the Iterator by one character, determine, that a is not a digit and lose a in the process, because we cannot backtrack and there is no way to store a.

With char0 though, we can implement a peek operation, that asks the input for the next character and stores it.

impl <T: Iterator<Item = char>> Lexer<T> {

    fn peek(&mut self) -> Option<char> {
        if self.char0.is_none() {
            self.char0 = self.input.next();
        }
        
        self.char0
    }

Now, when we actually want to advance and take the next character from the input, we first try to take the value from char0 and if it does not exist, we get the next value from our input.


    fn advance(&mut self) -> Option<char> {
        self.pos += 1;
        self.char0.take().or_else(|| self.input.next())
    }

}

Conclusion

That concludes the first part of implementing a lexer. This article laid the groundwork structuring the basic parts like errors, position information and tokens. The next part will focus on using advance and peek to implement lexing the tokens we defined in this post (strings, numbers, ( and )). Additionally, it will also tackle the implementation of Iterator<Item = LexResult> for the Lexer to allow using it later in the parsing phase. It will also contain some tests to make sure the lexer works as intended.

Stay tuned, cheers.