Oct '14

16

Make a Lexer with Go

Language parsing has always been very interesting to me. There are lots of different ways to parse languages. Today we will implement a simple lexer.

Now, strictly speaking, a lexer is a program that analyzes a string or sentence creating a list of tokens that represents the contents of the string. Many times, a lexer will generate tokens and a parser will then parse those tokens, creating a normalized AST of the code or sentence being parsed.

Check out this answer on Stack Overflow for a much better understanding on the difference between lexing and parsing.

The Go standard library provides an excellent example of a lexer to work from in its text/template package. The code is straightforward and fairly easy to understand. We can see that it works under the following premise:

  1. Starts by lexing text until it reaches an opening "{{"
  2. Lexes inside the action with various functions to determine keywords and expressions
  3. Lexes the closing "}}" and returns to lexing text until EOF

Make a simple lexer that can parse an English sentence

At its very core, a sentence in English has: whitespace, words, and punctuation. Obviously, a sentence is much more complex than that, with words having a type and different punctuations causing different inflections and meanings. However, we will simplify the problem initially to keep the example concise.

First lets define our tokens:


type tokenType int

const (
	tokenWord tokenType = iota
	tokenWhitespace
	tokenPunctuation
	tokenEof
)

var names = map[tokenType]string{
	tokenWord:        "WORD",
	tokenWhitespace:  "SPACE",
	tokenPunctuation: "PUNCTUATION",
	tokenEof:         "EOF",
}

We have defined four tokens: a token for words, a token for whitespace, a token for punctuation, and an "EOF" token signifying the end of the input.

Now, since this is a lexer, we should have a few regular expressions:


var wordRegexp = regexp.MustCompile("[A-Za-z]+")
var whitespaceRegexp = regexp.MustCompile("[\\s]+")
var punctuationRegexp = regexp.MustCompile("[\\p{P}\\p{S}]+")

These are pretty straightforward regular expressions: the first matching any alphabet character, the second matching any space (including newlines and tabs), and the last matching any punctuation.

Next, we will define our token structure and a String() method which will pretty-print the token when we pass it to fmt.Println


type token struct {
	value     string
	pos       int
	tokenType tokenType
}

func (tok token) String() string {
	return fmt.Sprintf("{%s '%s' %d}", names[tok.tokenType], tok.value, tok.pos)
}

Now that we have our tokens and regexes defined, lets start on the lexer itself.


type stateFn func(*lexer) stateFn

type lexer struct {
	start  int // The position of the last emission
	pos    int // The current position of the lexer
	input  string
	tokens chan token
	state  stateFn
}

func (l *lexer) next() (val string) {
	if l.pos >= len(l.input) {
		l.pos++
		return ""
	}

	val = l.input[l.pos : l.pos+1]

	l.pos++

	return
}

func (l *lexer) backup() {
	l.pos--
}

func (l *lexer) peek() (val string) {
	val = l.next()

	l.backup()

	return
}

func (l *lexer) emit(t tokenType) {
	val := l.input[l.start:l.pos]
	tok := token{val, l.start, t}
	l.tokens <- tok
	l.start = l.pos
}

As you can see, we've created a special type alias called a stateFn. These stateFns will be what actually perform the lexing, with each  stateFn  function returning another stateFn or nil if it has reached the end of input.

Then, we define our lexer with its four methods:

next()
increases the lexer's position by one and returns the character at that position
backup()
decreases the lexer's position by one
peek()
returns the next character, but does not move the lexer's position
emit()
emits a token with contents beginning at l.start until the current position

Now that we have our lexer, we can create the actual lexing functions. First, lets define a word lexing function:


func lexWord(l *lexer) stateFn {
	matched := wordRegexp.FindString(l.input[l.pos:])
	l.pos += len(matched)
	l.emit(tokenWord)

	return lexData
}

We can see this function matches the first word (as defined by our regular expression) starting at the input's current position. It emits this word as a token and then returns a yet-to-be-defined stateFn.

Lets define our whitespace and punctuation lexing functions:


func lexPunctuation(l *lexer) stateFn {
	matched := punctuationRegexp.FindString(l.input[l.pos:])
	l.pos += len(matched)
	l.emit(tokenPunctuation)

	return lexData
}

func lexWhitespace(l *lexer) stateFn {
	matched := whitespaceRegexp.FindString(l.input[l.pos:])
	l.pos += len(matched)
	l.emit(tokenWhitespace)

	return lexData
}

These are both very similar to our lexWord function. Now that we have the main workers created, lets define the lexData function


func lexData(l *lexer) stateFn {
	v := l.peek()
	switch {
	case v == "":
		l.emit(tokenEof)
		return nil

	case punctuationRegexp.MatchString(v):
		return lexPunctuation

	case whitespaceRegexp.MatchString(v):
		return lexWhitespace
	}

	return lexWord
}

We can see that lexData is responsible for deciding which lexing function to use. It peeks at the next character, then matches that character against a check for EOF, punctuation, and whitespace, and then defaulting to the lexWord function if none of the others match.

Finally, define a helper tokenize method and a newLexer function to help kick things off:


func (l *lexer) tokenize() {
	for l.state = lexData; l.state != nil; {
		l.state = l.state(l)
	}
}

func newLexer(input string) *lexer {
	return &lexer{0, 0, input, make(chan token), nil}
}

We can see when we call l.tokenize(), the lexer will start with lexData and continue lexing until the current state is nil.

Now in our main function, we can try out our lexer by first instantiating it and then sending the lexer off in its own goroutine to complete its work


func main() {
	lex := newLexer("This   is  a test-aculous test, sir...")

	go lex.tokenize()

	for {
		tok := <-lex.tokens
		fmt.Println(tok)
		if tok.tokenType == tokenEof {
			break
		}
	}
}

View the entire source code here.

If we now try out our program, we get the following output:

{WORD 'This' 0}
{SPACE '   ' 4}
{WORD 'is' 7}
{SPACE '  ' 9}
{WORD 'a' 11}
{SPACE ' ' 12}
{WORD 'test' 13}
{PUNCTUATION '-' 17}
{WORD 'aculous' 18}
{SPACE ' ' 25}
{WORD 'test' 26}
{PUNCTUATION ',' 30}
{SPACE ' ' 31}
{WORD 'sir' 32}
{PUNCTUATION '...' 35}
{EOF '' 38}

Pretty neat! We can easily add additional tokens by defining a new token, a regular expression, and a  stateFn.

We can also use other means of matching which may be faster than regular expressions. For instance, text/template uses strings.HasPrefix to decide which lexing function to use, and then it iterates over each character rather than matching using regular expressions.

Additional Resources

Lexical Scanning in Go
An excellent talk by Rob Pike on creating a lexer

golexerProgramming

Comments

Tyler Sommer

Thanks for catching that mistake, Satish! It's now fixed.

Satish Puranam

The "var wordRegexp = regexp.MustCompile("[A-za-z]+")" must read: var wordRegexp = regexp.MustCompile("[A-Za-z]+")