CS 320: Programming Languages — OCaml Interpreter

CS 320 · Fall 2023

The Parser — Recursive Descent over a BNF Grammar

How the OCaml interpreter turns source text into an AST without a parser generator.

The parser is the bridge between human syntax and the AST that the evaluator consumes. I built it entirely by hand—no lex/yacc, no regex, no parser combinators. Just pattern matching and recursion. This forced me to understand tokenization deeply.

The tokenizer

A token is one of four things: NEWLINE, STRING of string, NUMBER of int, or SYMBOL of string.

type token =
  | NEWLINE
  | STRING of string
  | NUMBER of int
  | SYMBOL of string

The tokenizer walks the source character by character, maintaining a line number for error reporting. Three helper functions classify characters:

let is_space (c : char): bool =
  c = ' ' || c = '\t'

let is_digit (c : char): bool =
  match c with | '0' .. '9' -> true | _ -> false

let is_alpha (c : char): bool =
  match c with | 'a' .. 'z' | 'A' .. 'Z' -> true | _ -> false

When we hit a digit or letter, we need to consume the entire number or symbol. For numbers, we handle negation by looking ahead for a - and then parsing the integer:

let tokenize_number line_num (ls : char list): (int * char list, lexer_err) result =
  let rec helper ls acc =
    match ls with
    | [] -> Ok(acc, [])
    | ch::rst ->
       if ch = '\n' || is_space ch then
         Ok(acc, ls)
       else if is_digit ch then
         helper rst @@ acc * 10 + int_of_char ch
       else Error(InvalidChar(line_num, ch))
  in helper ls 0

This accumulates digits, shifting left by 10 each time (base conversion). When we hit a space or newline, we stop. Symbols work the same way but allow underscores and are case-sensitive.

Strings are quoted and must close. We validate that each character inside is alphabetic:

let tokenize_string line_num (ls : char list): (string * char list, lexer_err) result =
  let rec helper ls acc =
    match ls with
    | [] -> Error(UnclosedString(line_num))
    | ch::rst ->
       if ch = '\"' then
         Ok(acc, rst)
       else if is_alpha ch then
         helper rst (acc ^ Char.escaped ch)
       else
         Error(InvalidChar(line_num, ch))
  in helper ls ""

The main tokenization loop dispatches on the first character:

let tokenize_source (src : string): (token list, lexer_err) result =
  let rec helper line_num ls acc =
    match ls with
    | [] -> Ok(List.rev acc)
    | ch::rst ->
       match ch with
       | '\n' -> helper (line_num + 1) rst (NEWLINE :: acc)
       | '\"' -> tokenize_string line_num rst
                 |> and_then @@ fun (s, rst) ->
                 helper line_num rst (STRING(s) :: acc)
       | '-'  -> tokenize_number line_num rst
                 |> and_then @@ fun (n, rst) ->
                 helper line_num rst (NUMBER(-1 * n) :: acc)
       | ch when is_digit ch
              -> tokenize_number line_num (ch::rst)
                 |> and_then @@ fun (n, rst) ->
                 helper line_num rst (NUMBER(n) :: acc)
       | ch when is_alpha ch
              -> tokenize_symbol line_num ls
                 |> and_then @@ fun (s, rst) ->
                 helper line_num rst (SYMBOL(s) :: acc)
       | ch when is_space ch -> helper line_num rst acc
       | ch -> Error(UnknownChar(line_num, ch))
  in helper 1 (char_list_of_string src) []

The result is a flat list of tokens. At this point, the grammar structure is still implicit; the parser will reify it.

The recursive-descent parser

The parser is a set of mutually recursive functions that consume tokens and produce AST nodes. The key insight: each function tries to match a specific production rule, and if it succeeds, it returns the parsed node plus the remaining tokens.

The main parsing functions have this type:

parse_com : int -> token list -> (com * int * token list, parse_err) result

It returns the parsed command, the current line number, and leftover tokens. Line numbers are threaded through for error reporting.

Constants are simple: numbers become Int, strings become String, and symbols become Name(var) after validation (variables must start with a lowercase letter):

let make_var line_num (s : string): (var, parse_err) result =
  if String.length s <> 0 then
     match String.get s 0 with
     | 'a' .. 'z' -> Ok(Var(s))
     | _ -> Error(InvalidVar(line_num, SYMBOL(s)))
  else Error(InvalidVar(line_num, SYMBOL(s)))

Parsing commands is pattern matching on the first token:

match ls with
| SYMBOL("Push")  ::fst::rst ->
   parse_const line_num fst |> and_then @@ fun x ->
   Ok(Push(x), line_num, rst)
| SYMBOL("Quit")  ::rst -> Ok(Quit, line_num, rst)
| SYMBOL("Add")   ::rst -> Ok(Add, line_num, rst)
| ...

Zero-arity commands consume one token. Nullary-syntax commands like Push consume the next token as an argument.

Blocks: BeginEnd and IfThenElse parse nested command lists. The grammar looks like:

BeginEnd ::= "Begin" NEWLINE prog "End" NEWLINE
IfThenElse ::= "IfThen" NEWLINE prog "Else" NEWLINE prog "End" NEWLINE

In code:

| SYMBOL("Begin") ::rst ->
   consume_newline line_num false rst
   |> and_then @@ fun (line_num, rst) ->
   parse_com_list line_num (SYMBOL("End")) rst
   |> and_then @@ fun (blk, line_num, rst) ->
   Ok(BeginEnd(blk), line_num, rst)

parse_com_list recursively parses commands until it hits a terminator token:

and parse_com_list (line_num : int) (terminator : token) (ls : token list)
                 : (prog * int * token list, parse_err) result =
    match ls with
    | fst::rst when fst = terminator -> Ok([], line_num, rst)
    | _  -> parse_com line_num ls
              |> and_then @@ fun (fst, line_num, ls') ->
            consume_newline line_num true ls'
              |> and_then @@ fun (line_num, ls'') ->
            parse_com_list line_num terminator ls''
              |> and_then @@ fun (rst, line_num, ls''') ->
            Ok((fst::rst, line_num, ls'''))

This is tail-recursive in the token list; each command is parsed, a newline consumed, and the rest of the list parsed.

Mutual recursion: The Fun_Mutual construct allows multiple function definitions. The parser recursively consumes function triples until it hits End:

and parse_functionslist (line_num: int) (ls: token list):
                ((var*var*prog) list * int * token list, parse_err) result =
    match ls with
    | SYMBOL(funcName)::SYMBOL(varName)::rst ->
      make_var line_num funcName |> and_then @@ fun f ->
      make_var line_num varName  |> and_then @@ fun f2 ->
        consume_newline line_num false rst
        |> and_then @@ fun (line_num, rst) ->
          (match parse_com_list line_num (SYMBOL("Mut")) rst with
          | Ok(prog1, line_num1, rst1) ->
              parse_functionslist line_num1 rst1
              |> and_then @@ fun(fList, line_num, rst) ->
                Ok((f, f2, prog1)::fList, line_num, rst)
          | _ -> Error(UnexpectedEOF(line_num)))
    | _ -> Error(UnexpectedEOF(line_num))

Example: tokenize and parse

Source:

Push 5
Push 3
Add
Quit

Tokens:

[SYMBOL("Push"), NUMBER(5), NEWLINE,
 SYMBOL("Push"), NUMBER(3), NEWLINE,
 SYMBOL("Add"), NEWLINE,
 SYMBOL("Quit"), NEWLINE]

AST:

[Push(Int(5)),
 Push(Int(3)),
 Add,
 Quit]

This is a valid program: it pushes 5, pushes 3, adds them (8 on the stack), and quits (printing "8").

What I learned

Hand-writing the lexer and parser forced me to see that tokenization and parsing are separate concerns. The tokenizer knows nothing about grammar—it just produces a flat stream of tokens. The parser then imposes structure by recognizing patterns. That separation is elegant, and it's why parsers can be so small: they don't reimplement character-by-character logic.

The Result type monad (and_then) was essential. Every function could fail at any point, and threading error information through deeply nested calls is tedious without monadic composition. By the end, I was chaining five or six and_then calls in a row, and it was readable.

The tricky part was handling newlines correctly. The grammar requires them after every command, but not always before terminators. consume_newline takes a required flag so we can be flexible—if there's a terminator next, we don't demand a newline.