Notes on building lexers

The following notes detail my preferred strategy for building UTF-8 lexers in C++. This strategy uses little specific to C++, and could be easily adapted to C with minimal effort.

If you are unfamiliar with Unicode or the idea of a DFA and how it can be used to implement regular languages, you may want to read about them before proceeding.

A lexer class will have the following instance variables:

state — An unsigned integer representing the current finite state of the lexer. Begins at zero, which represents the “drifting” state outside of any particular lexeme.
s — A std::string or other suitable expandable string structure which accumulates characters or bytes of input for placement into the next token. Begins empty.
c — A 32-bit integer value which represents either a byte or a Unicode codepoint. Unicode does not use negative values and is capped at 0x10FFFF, so it doesn't particularly matter whether you make this signed or not.
buf_next, buf_end — Pointers to the next character and the end of the buffer respectively.
midchar — True if a potentially valid but incomplete UTF-8-encoded character was found at the end of the current buffer. The requisite number of extra bytes will be taken from the next buffer and this will then be set to false.
residual — A six byte array for the assembly of characters spanning two input buffers. This is used with midchar. The first half of the character is placed at the start and the second half is placed afterwords. residual_length indicates the length of the first half.

Binary Handling: The following fields form part of an optional extension to allow the lexer to handle binary data. You may want a binary-capable lexer, for instance, if you are handling netstrings or other length-prefix constructs which permit the incorporation of binary data into otherwise textual documents. The lexer must be switched into and out of binary mode by appropriate state handlers.

bytemode — A boolean value which indicates whether the input buffer is being read in binary mode. If this is false, the input buffer is being read in UTF-8 mode.
bytes_remaining — Counts how many bytes remain of a length-prefixed construct. Typically, a state handler would decrement this upon every call, and disable byte mode (and change state, and perhaps reissue) once it reaches zero.

The class should also have the following methods. Methods beginning with underscores are protected.

Reset() — resets the lexer to its state at instantiation time.
SetBuffer(const char *buf, unsigned buf_len) — sets the input buffer held in memory which the lexer should use. This must remain allocated until a subsequent call to SetBuffer or Reset or until the lexer is destroyed or reaches the end of the buffer.
GetNextToken(Token &) — retrieves the next token and writes it into the passed token structure, or returns an EOF error.
This function is a loop which repeatedly calls _Advance and _StateDispatch(Token &), thereby dispatching over every character in the input. It may dispatch repeatedly over the same input character if it the character is reissued.
GetFinalToken(Token &) — this submits the EOF symbol to the lexer and retrieves any resulting token, if any. For example, lexing the string “lexeme1 lexeme2”, assuming lexing rules for any typical programming language, will not yield the latter lexeme until GetFinalToken is called, because only then can the lexer know that “lexeme2” is not merely the start of a longer lexeme to be continued in a subsequent input buffer passed to it.
_StateDispatch(Token &) — switch(state) block which dispatches to the internal method implementing that state. The supplied token reference is forwarded. This method, and the state handlers called by it, are expected to return zero if a token was not yielded, one if a token was yielded (in which case all functions must return back to GetNextToken, which itself must return to the caller), or a negative value in the event of an error.
_Advance() — the lexer advances in the input buffer, and c changes. If the lexer is in bytemode it must advance by one byte, and 0 ≤ c ≤ 255; otherwise, c contains a Unicode codepoint. The method may error if it has reached the end of the buffer, or if it is not in bytemode and there is insufficient data in the buffer to yield a Unicode codepoint. If malformed UTF-8 is encountered, the lexer should advance its position in the buffer by one byte and set c to zero. (Because zero is not a valid Unicode codepoint, it can be used by state handlers to test for an error condition. A typical Unicode lexer will itself error when faced with a zero value, so this policy essentially delegates error handling to state handlers.)
_Reissue() — causes a character to be reissued, meaning that it is dispatched again when _StateDispatch runs rather than advancing to the next character. This is useful when changing states. For example, suppose that you are lexing a language in which both numbers and words can appear; in the zero (“drifting”) state, you examine the current character and change states based on its classification as a digit or a letter. However, because the character in question forms the first character of a lexeme, it must be handled by its appropriate state handler, so that it can be properly recorded for inclusion in the generated token. Calling _Reissue allows a character to be dispatched again to a different state than the one which it was originally dispatched to. _Reissue must not be called more than once between calls to _StateDispatch.
_IsEnd() — true if at end of buffer.

The following psuedocode provides a general outline.

struct Lexer:
  void Reset()
    ...
  void SetBuffer(const char *buf, unsigned buf_len)
    ...
  int GetNextToken(tok)
    while !_IsEnd() || reissue
      if _reissue: _reissue := false
      else
        _Advance()         (return if fail)
      _StateDispatch(tok)  (return if != 0)

    return 'EOF Error'

  int GetFinalToken(tok)
    if done: return 'EOF Error'
    done := true
    c := 0
    error_code := _StateDispatch(tok)   (return if error)
    if state ≠ 0: return 'Unexpected Error'
    return 0 if error_code = 1 [Token Yielded], 'EOF Error' otherwise

  bool _IsEnd(): ...
  bool _Reissue(): reissue := true

  int _Advance()
    if _IsEnd(): return 'EOF Error'
    if bytemode
      _c = *buf_next++
    else
      var unsigned v
      var int nr
      if midchar
        extra := 6 - residual_len
        memcpy(residual + residual_len, buf_next, extra)
        num_bytes_read := utf8_to_utf32(residual, 6, out v)
          (return 'Invalid Argument Error' if fail)

        nr := nr - residual_len
        midchar := false
      else
        nr := utf8_to_utf32(buf_next, buf_end-buf_next, out v)
        if nr = 'EOF Error'
          midchar := true
          residual_len := buf_end - buf_next
          memcpy(residual, buf_next, residual_len)
          buf_next := buf_end
          return 'EOF Error'
        else if nr < 0
          return nr

      buf_next := buf_next + nr
      c := v

      if c = '\n'
        linenum := linenum + 1
        colnum := 1
      else
        colnum := colnum + 1

      return 0

  int _StateDispatch(tok)
    switch state
      case 0: return _SDefault(tok)
      case ...: return ...(tok)
      ...

      default: UNREACHABLE (assert)

  bool done, reissue, midchar
  unsigned state, linenum, column, c, residual_len
  char residual[6]
  const char *buf_next, *buf_end