Oct '14
16
Make a Lexer with Go
Language parsing has always been very interesting to me. There are lots of different ways to parse languages. Today we will implement a simple lexer.
Now, strictly speaking, a lexer is a program that analyzes a string or sentence creating a list of tokens that represents the contents of the string. Many times, a lexer will generate tokens and a parser will then parse those tokens, creating a normalized AST of the code or sentence being parsed.
The Go standard library provides an excellent example of a lexer to work from in its text/template package. The code is straightforward and fairly easy to understand. We can see that it works under the following premise:
- Starts by lexing text until it reaches an opening "{{"
- Lexes inside the action with various functions to determine keywords and expressions
- Lexes the closing "}}" and returns to lexing text until EOF
Make a simple lexer that can parse an English sentence
At its very core, a sentence in English has: whitespace, words, and punctuation. Obviously, a sentence is much more complex than that, with words having a type and different punctuations causing different inflections and meanings. However, we will simplify the problem initially to keep the example concise.
First lets define our tokens:
type tokenType int const ( tokenWord tokenType = iota tokenWhitespace tokenPunctuation tokenEof ) var names = map[tokenType]string{ tokenWord: "WORD", tokenWhitespace: "SPACE", tokenPunctuation: "PUNCTUATION", tokenEof: "EOF", }
We have defined four tokens: a token for words, a token for whitespace, a token for punctuation, and an "EOF" token signifying the end of the input.
Now, since this is a lexer, we should have a few regular expressions:
var wordRegexp = regexp.MustCompile("[A-Za-z]+") var whitespaceRegexp = regexp.MustCompile("[\\s]+") var punctuationRegexp = regexp.MustCompile("[\\p{P}\\p{S}]+")
These are pretty straightforward regular expressions: the first matching any alphabet character, the second matching any space (including newlines and tabs), and the last matching any punctuation.
Next, we will define our token structure and a String()
method which will pretty-print the token when we pass it to fmt.Println
type token struct { value string pos int tokenType tokenType } func (tok token) String() string { return fmt.Sprintf("{%s '%s' %d}", names[tok.tokenType], tok.value, tok.pos) }
Now that we have our tokens and regexes defined, lets start on the lexer itself.
type stateFn func(*lexer) stateFn type lexer struct { start int // The position of the last emission pos int // The current position of the lexer input string tokens chan token state stateFn } func (l *lexer) next() (val string) { if l.pos >= len(l.input) { l.pos++ return "" } val = l.input[l.pos : l.pos+1] l.pos++ return } func (l *lexer) backup() { l.pos-- } func (l *lexer) peek() (val string) { val = l.next() l.backup() return } func (l *lexer) emit(t tokenType) { val := l.input[l.start:l.pos] tok := token{val, l.start, t} l.tokens <- tok l.start = l.pos }
As you can see, we've created a special type alias called a stateFn
. These stateFn
s will be what actually perform the lexing, with each stateFn
function returning another stateFn
or nil
if it has reached the end of input.
Then, we define our lexer with its four methods:
next()
- increases the lexer's position by one and returns the character at that position
backup()
- decreases the lexer's position by one
peek()
- returns the next character, but does not move the lexer's position
emit()
- emits a token with contents beginning at
l.start
until the current position
Now that we have our lexer, we can create the actual lexing functions. First, lets define a word lexing function:
func lexWord(l *lexer) stateFn { matched := wordRegexp.FindString(l.input[l.pos:]) l.pos += len(matched) l.emit(tokenWord) return lexData }
We can see this function matches the first word (as defined by our regular expression) starting at the input's current position. It emits this word as a token and then returns a yet-to-be-defined stateFn
.
Lets define our whitespace and punctuation lexing functions:
func lexPunctuation(l *lexer) stateFn { matched := punctuationRegexp.FindString(l.input[l.pos:]) l.pos += len(matched) l.emit(tokenPunctuation) return lexData } func lexWhitespace(l *lexer) stateFn { matched := whitespaceRegexp.FindString(l.input[l.pos:]) l.pos += len(matched) l.emit(tokenWhitespace) return lexData }
These are both very similar to our lexWord
function. Now that we have the main workers created, lets define the lexData
function
func lexData(l *lexer) stateFn { v := l.peek() switch { case v == "": l.emit(tokenEof) return nil case punctuationRegexp.MatchString(v): return lexPunctuation case whitespaceRegexp.MatchString(v): return lexWhitespace } return lexWord }
We can see that lexData
is responsible for deciding which lexing function to use. It peeks at the next character, then matches that character against a check for EOF, punctuation, and whitespace, and then defaulting to the lexWord
function if none of the others match.
Finally, define a helper tokenize
method and a newLexer
function to help kick things off:
func (l *lexer) tokenize() { for l.state = lexData; l.state != nil; { l.state = l.state(l) } } func newLexer(input string) *lexer { return &lexer{0, 0, input, make(chan token), nil} }
We can see when we call l.tokenize()
, the lexer will start with lexData and continue lexing until the current state is nil.
Now in our main
function, we can try out our lexer by first instantiating it and then sending the lexer off in its own goroutine to complete its work
func main() { lex := newLexer("This is a test-aculous test, sir...") go lex.tokenize() for { tok := <-lex.tokens fmt.Println(tok) if tok.tokenType == tokenEof { break } } }
View the entire source code here.
If we now try out our program, we get the following output:
{WORD 'This' 0} {SPACE ' ' 4} {WORD 'is' 7} {SPACE ' ' 9} {WORD 'a' 11} {SPACE ' ' 12} {WORD 'test' 13} {PUNCTUATION '-' 17} {WORD 'aculous' 18} {SPACE ' ' 25} {WORD 'test' 26} {PUNCTUATION ',' 30} {SPACE ' ' 31} {WORD 'sir' 32} {PUNCTUATION '...' 35} {EOF '' 38}
Pretty neat! We can easily add additional tokens by defining a new token, a regular expression, and a stateFn
.
We can also use other means of matching which may be faster than regular expressions. For instance, text/template uses strings.HasPrefix
to decide which lexing function to use, and then it iterates over each character rather than matching using regular expressions.
Additional Resources
- Lexical Scanning in Go
- An excellent talk by Rob Pike on creating a lexer
Comments
Tyler Sommer
Thanks for catching that mistake, Satish! It's now fixed.
Satish Puranam
The "var wordRegexp = regexp.MustCompile("[A-za-z]+")" must read: var wordRegexp = regexp.MustCompile("[A-Za-z]+")
Search
Archive
- November 2022
- Incremental Progress
- August 2021
- Self-Hosting for Fun and Personal Freedom
- July 2019
- Closing Channels Twice in Go
- May 2019
- On Life, Legacy, and JavaScript
- March 2018
- Refactoring, Now With Generics!
- November 2017
- Packages 3.2 released!
- September 2017
- Introducing the MOTKI CLI
- July 2017
- Decoupling Yourself From Dependencies
- May 2017
- Model Rocketry Update
- April 2017
- Dynamic DNS with homedns