Lexer Generator
The lexer generator converts regular expression patterns into a deterministic finite automaton (DFA) that tokenizes input text.
How It Works
- Parsing: The
.lfile is parsed to extract patterns, actions, and declarations. - NFA Construction: Each regex pattern is converted to an NFA using Thompson's construction.
- NFA Combination: All pattern NFAs are combined with epsilon transitions.
- DFA Construction: The combined NFA is converted to a DFA using subset construction.
- DFA Minimization: The DFA is minimized to reduce state count.
- Code Generation: The DFA transition table is emitted in the target language.
Input Format
Lexer specifications use the Flex .l file format:
%{
/* C/Python/Java prologue code */
%}
/* Definitions */
DIGIT [0-9]
LETTER [a-zA-Z]
%%
/* Rules */
{DIGIT}+ { return NUMBER; }
{LETTER}+ { return IDENTIFIER; }
%%
/* Epilogue code */
Output
The generated lexer provides:
- A
Lexerclass (or struct in C) - A
next_token()function that returns the next token - Token type constants
- Access to matched text via
yytext - Test driver (optional) for immediate testing
Test Driver
When enabled, the generated lexer includes a test driver that lets you verify tokenization:
# Python
python lexer.py "3 + 4 * 2"
# C
./lexer "hello world"
# Java
java Lexer "test input"
See Test Drivers for details.
Using with Parser
The lexer's output feeds into the parser. Ensure token names match between your .l and .y files:
[0-9]+ { return NUMBER; } /* Must match %token NUMBER in parser */
See Combining Lexer and Parser for complete integration examples.
Chapters
- Lexer File Format - Complete
.lfile syntax - Regular Expressions - Supported regex syntax
- Actions and Return Values - Writing token actions
- Start Conditions - Conditional lexing