OpenLexer
OpenLexer is a lexer and parser generator written in Rust. It reads .l (Flex-compatible) and .y (Bison-compatible) specification files and generates lexers and parsers in C, Java, or Python.
Components
- Lexer Generator: Converts regular expression patterns to DFA-based lexers using Thompson construction and subset construction algorithms.
- Parser Generator: Builds LALR(1) parsing tables from context-free grammars. Supports GLR parsing for ambiguous grammars.
- Code Generation: Outputs standalone lexer and parser code in C, Java, or Python.
Supported Platforms
- Windows (x64)
- Linux (x64, ARM64)
- macOS (x64, ARM64)
File Formats
OpenLexer uses the standard Flex/Bison file formats:
.lfiles: Lexer specifications with regex patterns and actions.yfiles: Grammar specifications with production rules and semantic actions
Basic Usage
# Generate a Python lexer from calc.l
openlexer gen-lexer --lexer calc.l --lang python --output ./
# Generate a Java parser from calc.y
openlexer gen-parser --parser calc.y --lang java --output ./
# For Java: compile and run
javac Lexer.java Parser.java
java Parser "3 + 4 * 2"
Key Features
- Smart File Organization: Generated code follows language-specific best practices
- Java: One public class per file, automatic lexer detection
- C: Flexible compilation with preprocessor controls
- Python: Module-based imports
- Complete Integration: Lexer and parser work seamlessly together
- Standalone or Combined: Each component can work independently or together
- Built-in Test Drivers: Generated code includes test scaffolding
Documentation Structure
- Getting Started - Installation and first steps
- Lexer Generator - Writing lexer specifications
- Parser Generator - Writing grammar specifications
- GLR Parsing - Handling ambiguous grammars
- Output Languages - Language-specific details
- Examples - Complete working examples
License
MIT License
Getting Started
This guide will help you install OpenLexer and generate your first lexer and parser.
What You'll Learn
- How to install OpenLexer
- How to write a lexer specification (
.lfile) - How to write a grammar specification (
.yfile) - How to generate and use the output code
Prerequisites
- Basic understanding of regular expressions
- Familiarity with context-free grammars (helpful but not required)
- A compiler for your target language (gcc, javac, or python)
Overview
OpenLexer follows a two-stage process similar to Flex/Bison:
┌─────────────┐ ┌─────────────┐
│ calc.l │──────│ Lexer │──────▶ lexer.c/java/py
│ (patterns) │ │ Generator │
└─────────────┘ └─────────────┘
┌─────────────┐ ┌─────────────┐
│ calc.y │──────│ Parser │──────▶ parser.c/java/py
│ (grammar) │ │ Generator │
└─────────────┘ └─────────────┘
The lexer converts input text into tokens using regular expressions. The parser recognizes the structure of those tokens using grammar rules.
Next Steps
- Installation - Install OpenLexer on your system
- Quick Start - Build a calculator in 5 minutes
Installation
Prerequisites
- Rust toolchain (1.70 or later)
- Cargo package manager
Build from Source
git clone https://github.com/magi8101/openlexer.git
cd openlexer
cargo build --release
The binary will be at target/release/openlexer (or openlexer.exe on Windows).
Install via Cargo
cargo install --path .
This installs openlexer to ~/.cargo/bin/.
Verify Installation
openlexer --version
openlexer --help
WebAssembly Build
To build for web browsers:
cargo install wasm-pack
wasm-pack build --target web
The output will be in the pkg/ directory.
Platform Notes
Windows
No additional dependencies required. The MSVC toolchain is recommended.
Linux
Ensure build-essential or equivalent is installed for linking.
macOS
Xcode command line tools must be installed:
xcode-select --install
Quick Start
This guide walks through creating a calculator lexer and parser.
Step 1: Create the Lexer Specification
Create calc.l:
%{
/* Calculator lexer */
%}
%%
[0-9]+ { return NUMBER; }
"+" { return PLUS; }
"-" { return MINUS; }
"*" { return TIMES; }
"/" { return DIVIDE; }
"(" { return LPAREN; }
")" { return RPAREN; }
[ \t\n]+ { /* skip whitespace */ }
. { return ERROR; }
%%
Step 2: Create the Grammar Specification
Create calc.y:
%token NUMBER PLUS MINUS TIMES DIVIDE LPAREN RPAREN
%left PLUS MINUS
%left TIMES DIVIDE
%%
expr:
expr PLUS expr { $$ = $1 + $3; }
| expr MINUS expr { $$ = $1 - $3; }
| expr TIMES expr { $$ = $1 * $3; }
| expr DIVIDE expr { $$ = $1 / $3; }
| LPAREN expr RPAREN { $$ = $2; }
| NUMBER { $$ = $1; }
;
%%
Step 3: Generate Code
Option A: Python
# Generate Python lexer and parser
openlexer gen-lexer --lexer calc.l -L python -o ./
openlexer gen-parser --parser calc.y -L python -o ./
This produces lexer.py and parser.py.
Use the generated code:
# Test the parser directly
python parser.py "3 + 4 * 2"
# Output: Result: 11
Or create main.py:
from parser import parse
result = parse("3 + 4 * 2")
print(f"Result: {result}")
Option B: Java
# Generate Java lexer and parser
openlexer gen-lexer --lexer calc.l -L java -o ./
openlexer gen-parser --parser calc.y -L java -o ./
This produces Lexer.java and Parser.java.
Compile and run:
# Compile both files
javac Lexer.java Parser.java
# Run parser (auto-detects Lexer.class)
java Parser "3 + 4 * 2"
# Output: [Using external Lexer.class]
# Input: "3 + 4 * 2"
# Result: 11
Option C: C
# Generate C lexer and parser
openlexer gen-lexer --lexer calc.l -L c -o ./
openlexer gen-parser --parser calc.y -L c -o ./
This produces lexer.c and parser.c.
Compile and run:
# Compile and link
gcc -o calc parser.c -lm
# Run
./calc "3 + 4 * 2"
# Output: Result: 11
Next Steps
- Lexer File Format - Full lexer specification syntax
- Grammar File Format - Full grammar specification syntax
- Calculator Example - Complete calculator implementation
Lexer Generator
The lexer generator converts regular expression patterns into a deterministic finite automaton (DFA) that tokenizes input text.
How It Works
- Parsing: The
.lfile is parsed to extract patterns, actions, and declarations. - NFA Construction: Each regex pattern is converted to an NFA using Thompson's construction.
- NFA Combination: All pattern NFAs are combined with epsilon transitions.
- DFA Construction: The combined NFA is converted to a DFA using subset construction.
- DFA Minimization: The DFA is minimized to reduce state count.
- Code Generation: The DFA transition table is emitted in the target language.
Input Format
Lexer specifications use the Flex .l file format:
%{
/* C/Python/Java prologue code */
%}
/* Definitions */
DIGIT [0-9]
LETTER [a-zA-Z]
%%
/* Rules */
{DIGIT}+ { return NUMBER; }
{LETTER}+ { return IDENTIFIER; }
%%
/* Epilogue code */
Output
The generated lexer provides:
- A
Lexerclass (or struct in C) - A
next_token()function that returns the next token - Token type constants
- Access to matched text via
yytext - Test driver (optional) for immediate testing
Test Driver
When enabled, the generated lexer includes a test driver that lets you verify tokenization:
# Python
python lexer.py "3 + 4 * 2"
# C
./lexer "hello world"
# Java
java Lexer "test input"
See Test Drivers for details.
Using with Parser
The lexer's output feeds into the parser. Ensure token names match between your .l and .y files:
[0-9]+ { return NUMBER; } /* Must match %token NUMBER in parser */
See Combining Lexer and Parser for complete integration examples.
Chapters
- Lexer File Format - Complete
.lfile syntax - Regular Expressions - Supported regex syntax
- Actions and Return Values - Writing token actions
- Start Conditions - Conditional lexing
Lexer File Format (.l)
The lexer specification file has three sections separated by %%:
DEFINITIONS
%%
RULES
%%
USER CODE
Definitions Section
The definitions section contains:
Prologue Code
Code enclosed in %{ and %} is copied directly to the output:
%{
#include <stdio.h>
int line_count = 0;
%}
Named Patterns
Named patterns can be referenced in rules using {name}:
DIGIT [0-9]
ALPHA [a-zA-Z]
ALNUM [a-zA-Z0-9]
ID {ALPHA}{ALNUM}*
Start Condition Declarations
Declare exclusive (%x) or inclusive (%s) start conditions:
%x COMMENT
%s STRING
Rules Section
Each rule has a pattern and an action:
%%
{ID} { return IDENTIFIER; }
{DIGIT}+ { return NUMBER; }
"/*" { BEGIN(COMMENT); }
<COMMENT>"*/" { BEGIN(INITIAL); }
<COMMENT>. { /* skip */ }
[ \t\n]+ { /* skip whitespace */ }
%%
Rule Syntax
[<start_condition>]pattern { action }
- Patterns match from left to right
- Longer matches take priority
- Earlier rules break ties
- Actions are code blocks that can return tokens
Special Variables
yytext: The matched text (string)yyleng: Length of matched textyylineno: Current line number (if enabled)
User Code Section
The third section is copied verbatim to the end of the output file:
%%
int main() {
while (yylex() != 0) {
printf("Token: %s\n", yytext);
}
return 0;
}
Complete Example
%{
/* Token definitions */
#define NUMBER 1
#define PLUS 2
#define MINUS 3
%}
DIGIT [0-9]
%%
{DIGIT}+ { return NUMBER; }
"+" { return PLUS; }
"-" { return MINUS; }
[ \t\n]+ { /* skip */ }
. { fprintf(stderr, "Unknown: %s\n", yytext); }
%%
Regular Expressions
OpenLexer supports the following regular expression syntax, compatible with Flex.
Character Classes
| Syntax | Description |
|---|---|
. | Any character except newline |
[abc] | Character class: a, b, or c |
[a-z] | Range: lowercase letters |
[^abc] | Negated class: not a, b, or c |
[a-zA-Z0-9_] | Combined ranges |
Predefined Classes
| Syntax | Equivalent |
|---|---|
[:alpha:] | [a-zA-Z] |
[:digit:] | [0-9] |
[:alnum:] | [a-zA-Z0-9] |
[:space:] | [ \t\n\r\f\v] |
[:upper:] | [A-Z] |
[:lower:] | [a-z] |
Use inside character classes: [[:alpha:]_]
Quantifiers
| Syntax | Description |
|---|---|
* | Zero or more |
+ | One or more |
? | Zero or one |
{n} | Exactly n |
{n,} | n or more |
{n,m} | Between n and m |
Anchors
| Syntax | Description |
|---|---|
^ | Start of line |
$ | End of line |
Grouping and Alternation
| Syntax | Description |
|---|---|
(ab) | Group |
a|b | Alternation: a or b |
Escape Sequences
| Syntax | Description |
|---|---|
\n | Newline |
\t | Tab |
\r | Carriage return |
\\ | Literal backslash |
\. | Literal dot |
\* | Literal asterisk |
Literal Strings
Double-quoted strings match literally:
"while" { return WHILE; }
"==" { return EQ; }
"++" { return INCREMENT; }
Named Pattern References
Reference definitions with braces:
/* Definition */
DIGIT [0-9]
%%
/* Rule using definition */
{DIGIT}+ { return NUMBER; }
Examples
/* Integer literal */
[0-9]+ { return INTEGER; }
/* Floating point */
[0-9]+\.[0-9]+ { return FLOAT; }
/* Identifier */
[a-zA-Z_][a-zA-Z0-9_]* { return IDENTIFIER; }
/* C-style string */
\"([^"\\]|\\.)*\" { return STRING; }
/* Single-line comment */
"//".* { /* skip */ }
Actions and Return Values
Each lexer rule has an action block that executes when the pattern matches.
Basic Actions
Return a token type:
[0-9]+ { return NUMBER; }
"+" { return PLUS; }
Skip input without returning:
[ \t\n]+ { /* skip whitespace */ }
Available Variables
yytext
The matched text as a string:
[a-z]+ { printf("Matched: %s\n", yytext); return WORD; }
yyleng
Length of the matched text:
[a-z]+ { printf("Length: %d\n", yyleng); return WORD; }
yylval
Semantic value passed to the parser. Set in the action:
[0-9]+ { yylval.ival = atoi(yytext); return NUMBER; }
[0-9]+\.[0-9]+ { yylval.dval = atof(yytext); return FLOAT; }
Multi-Statement Actions
Use braces for multiple statements:
[0-9]+ {
int val = atoi(yytext);
if (val > 1000) {
fprintf(stderr, "Warning: large number\n");
}
yylval.ival = val;
return NUMBER;
}
Language-Specific Actions
C Actions
[a-z]+ {
yylval.str = strdup(yytext);
return IDENTIFIER;
}
Python Actions
[a-z]+ {
self.yylval = self.yytext
return self.IDENTIFIER
}
Java Actions
[a-z]+ {
yylval = yytext;
return IDENTIFIER;
}
Token Constants
Token constants should match those declared in the grammar file:
%{
/* From parser.h or defined here */
#define NUMBER 258
#define PLUS 259
#define MINUS 260
%}
%%
[0-9]+ { return NUMBER; }
"+" { return PLUS; }
"-" { return MINUS; }
%%
Error Handling
The . pattern matches any single character. Use it as a catch-all:
. {
fprintf(stderr, "Unexpected character: %s\n", yytext);
return ERROR;
}
Start Conditions
Start conditions allow the lexer to switch between different sets of rules. This is useful for handling comments, strings, and other context-dependent lexing.
Declaring Start Conditions
Use %x for exclusive conditions or %s for inclusive conditions:
%x COMMENT
%x STRING
%s SPECIAL
- Exclusive (
%x): Only rules with this condition are active. - Inclusive (
%s): Rules with this condition AND rules without conditions are active.
Using Start Conditions
Applying to Rules
Prefix a rule with the condition name in angle brackets:
<COMMENT>. { /* inside comment */ }
<STRING>[^"]+ { /* inside string */ }
Multiple Conditions
Specify multiple conditions separated by commas:
<COMMENT,STRING>. { /* in comment or string */ }
Initial Condition
Rules without a condition apply in the INITIAL state:
[a-z]+ { return WORD; } /* applies in INITIAL */
Or explicitly:
<INITIAL>[a-z]+ { return WORD; }
Switching Conditions
Use BEGIN(condition) to switch:
"/*" { BEGIN(COMMENT); }
<COMMENT>"*/" { BEGIN(INITIAL); }
Example: C-Style Comments
%x COMMENT
%%
"/*" { BEGIN(COMMENT); }
<COMMENT>"*/" { BEGIN(INITIAL); }
<COMMENT>\n { yylineno++; }
<COMMENT>. { /* skip comment content */ }
%%
Example: String Literals
%x STRING
%%
\" { BEGIN(STRING); string_buf_ptr = string_buf; }
<STRING>\" {
BEGIN(INITIAL);
*string_buf_ptr = '\0';
yylval.str = strdup(string_buf);
return STRING_LITERAL;
}
<STRING>\\n { *string_buf_ptr++ = '\n'; }
<STRING>\\t { *string_buf_ptr++ = '\t'; }
<STRING>\\\\ { *string_buf_ptr++ = '\\'; }
<STRING>\\. { *string_buf_ptr++ = yytext[1]; }
<STRING>[^\\\"]+ {
char *p = yytext;
while (*p) *string_buf_ptr++ = *p++;
}
%%
Example: Nested Comments
For languages with nested comments, use a counter:
%x COMMENT
%{
int comment_depth = 0;
%}
%%
"(*" { comment_depth++; BEGIN(COMMENT); }
<COMMENT>"(*" { comment_depth++; }
<COMMENT>"*)" {
if (--comment_depth == 0) BEGIN(INITIAL);
}
<COMMENT>. { /* skip */ }
<COMMENT>\n { yylineno++; }
%%
Parser Generator
The parser generator converts context-free grammars into LALR(1) parsing tables and generates parser code.
How It Works
- Parsing: The
.yfile is parsed to extract tokens, precedence, and grammar rules. - Grammar Analysis: First and Follow sets are computed.
- LR(0) Items: Canonical LR(0) item sets are constructed.
- LALR(1) Tables: Lookahead information is computed and tables are built.
- Conflict Resolution: Shift/reduce and reduce/reduce conflicts are resolved using precedence.
- Code Generation: The parsing table and driver are emitted in the target language.
Input Format
Grammar specifications use the Bison .y file format:
%{
/* Prologue code */
%}
/* Token and precedence declarations */
%token NUMBER PLUS MINUS
%left PLUS MINUS
%%
/* Grammar rules */
expr: expr PLUS expr
| expr MINUS expr
| NUMBER
;
%%
/* Epilogue code */
Output
The generated parser provides:
- A
Parserclass (or struct in C) - A
parse()function that returns the parse result - Action and goto tables
- Semantic action execution
GLR Mode
For ambiguous grammars, add %glr-parser to enable GLR parsing:
%glr-parser
%%
expr: expr MINUS expr
| NUMBER
;
%%
Chapters
- Grammar File Format - Complete
.yfile syntax - Tokens and Precedence - Declaring tokens and resolving conflicts
- Semantic Actions - Attaching code to rules
- Error Handling - Parser error recovery
Grammar File Format (.y)
The grammar specification file has three sections separated by %%:
DECLARATIONS
%%
RULES
%%
USER CODE
Declarations Section
Prologue Code
Code in %{ and %} is copied to the output:
%{
#include <stdio.h>
#include <math.h>
extern int yylex();
void yyerror(const char *s);
%}
Token Declarations
Declare terminal symbols with %token:
%token NUMBER
%token PLUS MINUS TIMES DIVIDE
%token LPAREN RPAREN
Tokens with types:
%token <ival> NUMBER
%token <str> IDENTIFIER
Union Declaration
Define the semantic value type:
%union {
int ival;
double dval;
char *str;
}
Type Declarations
Assign types to non-terminals:
%type <dval> expr term factor
Precedence Declarations
Declare operator precedence and associativity:
%left PLUS MINUS
%left TIMES DIVIDE
%right POWER
%nonassoc UMINUS
%left: Left-associative%right: Right-associative%nonassoc: Non-associative (error on chaining)
Later declarations have higher precedence.
Start Symbol
Specify the grammar's start symbol:
%start program
If omitted, the left-hand side of the first rule is used.
Rules Section
Production rules define the grammar:
%%
program:
statement_list
;
statement_list:
statement
| statement_list statement
;
statement:
expr SEMICOLON { printf("Result: %d\n", $1); }
| error SEMICOLON { yyerrok; }
;
expr:
expr PLUS expr { $$ = $1 + $3; }
| expr MINUS expr { $$ = $1 - $3; }
| NUMBER { $$ = $1; }
;
%%
Rule Syntax
nonterminal:
production1 { action1 }
| production2 { action2 }
;
Empty Productions
Use /* empty */ or nothing:
optional_items:
/* empty */
| item_list
;
User Code Section
Auxiliary functions copied to the output:
%%
void yyerror(const char *s) {
fprintf(stderr, "Error: %s\n", s);
}
int main() {
return yyparse();
}
Tokens and Precedence
Token Declaration
All terminals must be declared:
%token NUMBER IDENTIFIER STRING_LITERAL
%token PLUS MINUS TIMES DIVIDE
%token IF ELSE WHILE FOR
Token Values
By default, tokens are assigned sequential integer values starting from 256. To specify values:
%token NUMBER 258
%token PLUS 259
%token MINUS 260
Literal Tokens
Single-character tokens can be used directly in rules:
expr: expr '+' expr
| expr '-' expr
| NUMBER
;
The character's ASCII value is the token value.
Precedence and Associativity
Associativity
%left PLUS MINUS /* left-to-right */
%left TIMES DIVIDE /* higher precedence than + - */
%right POWER /* right-to-left */
%nonassoc LT GT EQ /* a < b < c is an error */
Precedence Levels
Declarations later in the file have higher precedence:
%left OR /* lowest */
%left AND
%left EQ NE
%left LT GT LE GE
%left PLUS MINUS
%left TIMES DIVIDE
%right POWER /* highest binary */
%right UMINUS /* unary minus */
Context-Dependent Precedence
Use %prec to override a rule's precedence:
expr: MINUS expr %prec UMINUS { $$ = -$2; }
;
This uses UMINUS precedence instead of MINUS.
Conflict Resolution
Shift/Reduce Conflicts
When the parser can either shift or reduce:
- If the lookahead token has higher precedence than the rule, shift.
- If the rule has higher precedence, reduce.
- If equal precedence, use associativity:
- Left: reduce
- Right: shift
- Nonassoc: error
Example: Dangling Else
%nonassoc LOWER_THAN_ELSE
%nonassoc ELSE
stmt: IF expr stmt %prec LOWER_THAN_ELSE
| IF expr stmt ELSE stmt
;
Reduce/Reduce Conflicts
When two rules can reduce the same input, the first rule in the file wins. This usually indicates a grammar problem.
Token Types
With %union, assign types to tokens:
%union {
int ival;
double dval;
char *str;
}
%token <ival> INTEGER
%token <dval> FLOAT
%token <str> IDENTIFIER STRING
Semantic Actions
Semantic actions are code blocks attached to grammar rules that execute when the rule is reduced.
Basic Syntax
expr: expr PLUS expr { $$ = $1 + $3; }
| NUMBER { $$ = $1; }
;
Value Stack References
$$: The result value (left-hand side)$1,$2,$3, ...: Values of right-hand side symbols
/* Rule: factor -> LPAREN expr RPAREN */
factor: LPAREN expr RPAREN { $$ = $2; }
/* $1=LPAREN, $2=expr, $3=RPAREN */
;
Typed Actions
With %union, values have specific types:
%union {
int ival;
double dval;
struct node *ast;
}
%type <dval> expr term factor
%%
expr: expr PLUS term { $$ = $1 + $3; }
;
Multi-Statement Actions
Braces contain the full action:
stmt: PRINT expr SEMICOLON {
printf("Value: %g\n", $2);
$$ = $2;
}
;
Mid-Rule Actions
Actions can appear in the middle of a rule:
compound: LBRACE { enter_scope(); } stmt_list RBRACE { leave_scope(); }
;
Mid-rule actions count as symbols:
rule: A { mid_action(); } B C
/* $1=A, $2=mid_action result, $3=B, $4=C */
;
AST Construction
Building abstract syntax trees:
%union {
struct node *ast;
}
%type <ast> expr term factor
%%
expr: expr PLUS term {
$$ = make_binary_node(OP_ADD, $1, $3);
}
| term {
$$ = $1;
}
;
Language-Specific Actions
C
expr: expr PLUS expr { $$ = $1 + $3; }
;
Python
expr: expr PLUS expr { self.yyval = self.yyvs[-3] + self.yyvs[-1] }
;
Java
expr: expr PLUS expr { yyval = yyvs[yyvsp-2] + yyvs[yyvsp]; }
;
Empty Actions
Rules without explicit actions get a default that copies $1 to $$:
term: factor /* implicit: { $$ = $1; } */
;
Error Handling
The error Token
The special token error represents a syntax error:
stmt: expr SEMICOLON
| error SEMICOLON { yyerrok; }
;
When an error occurs, the parser pops states until it can shift error, then continues parsing.
yyerror Function
The yyerror function is called on syntax errors:
C
void yyerror(const char *msg) {
fprintf(stderr, "line %d: %s\n", yylineno, msg);
}
Python
def yyerror(self, msg):
print(f"line {self.yylineno}: {msg}", file=sys.stderr)
Java
void yyerror(String msg) {
System.err.println("line " + yylineno + ": " + msg);
}
Error Recovery
yyerrok
Clears the error state, allowing normal parsing to resume:
stmt: error SEMICOLON { yyerrok; printf("Recovered at ;\n"); }
;
yyclearin
Discards the current lookahead token:
stmt: error SEMICOLON { yyerrok; yyclearin; }
;
YYERROR
Force an error from within an action:
expr: expr DIVIDE expr {
if ($3 == 0) {
yyerror("division by zero");
YYERROR;
}
$$ = $1 / $3;
}
;
YYABORT
Abort parsing immediately:
stmt: QUIT { YYABORT; }
;
YYACCEPT
Accept the input immediately:
program: statement_list END { YYACCEPT; }
;
Error Recovery Strategies
Statement-Level Recovery
Recover at statement boundaries:
program: stmt_list
;
stmt_list: stmt_list stmt
| stmt
;
stmt: valid_statement SEMICOLON
| error SEMICOLON { yyerrok; }
;
Block-Level Recovery
Recover at block boundaries:
block: LBRACE stmt_list RBRACE
| LBRACE error RBRACE { yyerrok; }
;
Parenthesis Recovery
expr: LPAREN expr RPAREN
| LPAREN error RPAREN { yyerrok; $$ = 0; }
| NUMBER
;
Error Messages
Provide context in error messages:
void yyerror(const char *msg) {
fprintf(stderr, "Error at line %d near '%s': %s\n",
yylineno, yytext, msg);
}
GLR Parsing
GLR (Generalized LR) parsing handles ambiguous and non-deterministic grammars that LALR(1) cannot process.
When to Use GLR
Use GLR parsing when your grammar has:
- Inherent ambiguity that cannot be resolved with precedence
- Multiple valid parse trees for the same input
- Complex language constructs requiring more than one token of lookahead
Enabling GLR Mode
Add %glr-parser to the declarations section:
%glr-parser
%%
expr: expr MINUS expr
| NUMBER
;
%%
How GLR Works
Graph-Structured Stack (GSS)
Instead of a single parse stack, GLR maintains a graph of stacks:
[0] ─── expr ─── [3] ─── MINUS ─── [4] ─── expr ─── [7]
│ │
└─────── expr ─────────────────────────────┘
When a conflict occurs, GLR forks the stack and explores both paths.
Shared Packed Parse Forest (SPPF)
Multiple parse trees are represented compactly:
expr
/ | \
expr - expr
| / \
1 expr - expr
| |
2 3
For 1 - 2 - 3, the SPPF captures both:
(1 - 2) - 31 - (2 - 3)
Conflict Handling
In standard LALR(1), conflicts are resolved statically. In GLR:
- Shift/Reduce: Both shift and reduce occur on separate stack branches
- Reduce/Reduce: Both reductions occur
Invalid branches are pruned when they encounter errors.
Performance
GLR parsing is O(n^3) in the worst case for highly ambiguous grammars. For mostly unambiguous grammars with occasional conflicts, it approaches O(n) LALR performance.
Chapters
- Ambiguous Grammars - Writing grammars with intentional ambiguity
- GLR Algorithm - Technical details of the implementation
Ambiguous Grammars
Ambiguous grammars have multiple valid parse trees for the same input.
Expression Ambiguity
Without precedence or associativity, binary operators are ambiguous:
%glr-parser
%%
expr: expr MINUS expr
| expr PLUS expr
| expr TIMES expr
| NUMBER
;
%%
For 1 - 2 - 3:
Parse tree 1 (left-associative):
expr
/ | \
expr - expr
/ \ |
expr - 3
| \
1 2
Result: (1 - 2) - 3 = -4
Parse tree 2 (right-associative):
expr
/ | \
expr - expr
| / | \
1 expr - expr
| |
2 3
Result: 1 - (2 - 3) = 2
Dangling Else
The classic ambiguity in C-like languages:
%glr-parser
%%
stmt: IF expr THEN stmt
| IF expr THEN stmt ELSE stmt
| other_stmt
;
%%
For if a then if b then s1 else s2:
if a then (if b then s1 else s2)- else binds to inner ifif a then (if b then s1) else s2- else binds to outer if
Type or Expression
In C, (x)(y) could be:
- A cast of
yto typex - A function call of
xwith argumenty
%glr-parser
%%
expr: LPAREN type RPAREN expr
| expr LPAREN expr RPAREN
| IDENTIFIER
;
%%
Resolving Ambiguity at Runtime
With GLR, you can defer disambiguation to semantic actions:
%glr-parser
%%
primary: IDENTIFIER {
if (is_type($1)) {
$$ = make_type_node($1);
} else {
$$ = make_var_node($1);
}
}
;
%%
Comparison with LALR
| Feature | LALR(1) | GLR |
|---|---|---|
| Ambiguous grammars | Error | Supported |
| Multiple parse trees | No | Yes (SPPF) |
| Performance | O(n) | O(n) to O(n^3) |
| Conflict handling | Static | Dynamic |
Best Practices
- Use LALR with precedence when possible (more efficient)
- Use GLR when the grammar is inherently ambiguous
- Keep the ambiguous portion of the grammar small
- Test with known ambiguous inputs
GLR Algorithm
This chapter describes the GLR parsing algorithm implemented in OpenLexer.
Data Structures
Graph-Structured Stack (GSS)
The GSS is a directed acyclic graph where:
- Nodes represent parser states
- Edges represent stack entries (symbols and semantic values)
#![allow(unused)] fn main() { struct GssNode { state: usize, links: Vec<GssEdge>, } struct GssEdge { predecessor: NodeId, symbol: Symbol, value: SemanticValue, } }
GSS Operations
Push: Add a new node with an edge to the current node.
Reduce: Follow edges backward by the rule length, then push the result.
Merge: When two paths reach the same state, merge them instead of duplicating.
Shared Packed Parse Forest (SPPF)
The SPPF compactly represents all parse trees:
#![allow(unused)] fn main() { enum SppfNode { Symbol { symbol: String, span: (usize, usize), }, Packed { rule: usize, children: Vec<NodeId>, }, Ambiguous { alternatives: Vec<NodeId>, }, } }
Algorithm Steps
1. Initialization
Create the initial GSS node with state 0.
GSS: [state 0]
Active: { node_0 }
2. Per-Token Processing
For each input token:
pending_shifts = {}
pending_reductions = {}
for each active node:
action = get_action(node.state, token)
if action contains SHIFT:
add to pending_shifts
if action contains REDUCE:
add to pending_reductions
3. Reduction Phase
Process all pending reductions:
while pending_reductions not empty:
(node, rule) = pop from pending_reductions
for each path of length rule.rhs_len from node:
base_node = end of path
goto_state = goto[base_node.state, rule.lhs]
if node with goto_state exists in GSS:
merge edge into existing node
else:
create new node with goto_state
check for new reductions from the new/merged node
4. Shift Phase
Process all pending shifts:
new_active = {}
for each (node, shift_state) in pending_shifts:
if node with shift_state exists:
merge
else:
create new node
add to new_active
active = new_active
5. Termination
After processing all tokens:
- If any active node can accept, parsing succeeds
- If no active nodes remain, parsing fails
Conflict Actions
The glr_conflict_actions table stores all actions at conflict points:
#![allow(unused)] fn main() { struct ParsingTable { // Standard LALR action table (winner of conflicts) action: HashMap<(usize, String), Action>, // All actions at conflicts (for GLR) glr_conflict_actions: HashMap<usize, HashMap<String, Vec<Action>>>, } }
During GLR parsing, both actions are explored when a conflict exists.
Path Enumeration
Finding all paths of length N in the GSS:
#![allow(unused)] fn main() { fn enumerate_paths(node: NodeId, length: usize) -> Vec<Path> { if length == 0 { return vec![Path::new(node)]; } let mut paths = vec![]; for edge in gss[node].links { for subpath in enumerate_paths(edge.predecessor, length - 1) { paths.push(subpath.prepend(edge)); } } paths } }
Complexity
- Time: O(n^3) worst case, O(n) for unambiguous input
- Space: O(n^2) for GSS, O(n^3) for full SPPF
The practical performance depends on the degree of ambiguity in the grammar and input.
Output Languages
OpenLexer generates lexers and parsers in three target languages:
Language Selection
Use the --lang option:
openlexer gen-lexer --lexer rules.l --lang c --output ./
openlexer gen-parser --parser grammar.y --lang java --output ./
Output Files
| Language | Lexer Output | Parser Output |
|---|---|---|
| C | lexer.c | parser.c |
| Java | Lexer.java | Parser.java |
| Python | lexer.py | parser.py |
File Organization
Each language has specific file organization requirements and best practices:
- Java: Requires one public class per file; generated files are designed to work standalone or together
- C: Can be compiled separately or combined; use preprocessor flags to control test drivers
- Python: Generated files are importable modules
📖 See File Organization Guide for detailed instructions on organizing generated code.
Common Interface
All generated lexers provide:
- Token type constants
- A method to get the next token
- Access to the matched text
All generated parsers provide:
- A parse method that drives the parsing loop
- Semantic value stack management
- Error reporting hooks
Integration
The lexer and parser integrate through a common interface:
Input Text
│
▼
┌─────────┐
│ Lexer │ ──▶ Token Stream
└─────────┘
│
▼
┌─────────┐
│ Parser │ ──▶ Parse Result / AST
└─────────┘
Each language chapter describes the specific integration method.
C Output
Generated Files
lexer.c- Lexer implementation with DFA tablesparser.c- Parser implementation with LALR tables
Lexer Interface
/* Initialize lexer with input */
void lex_init(const char *input);
/* Get next token, returns token type */
int yylex(void);
/* Current matched text */
extern char *yytext;
/* Length of matched text */
extern int yyleng;
/* Current line number (if enabled) */
extern int yylineno;
Parser Interface
/* Parse input, returns 0 on success */
int yyparse(void);
/* Error reporting callback */
void yyerror(const char *msg);
/* Semantic value passed from lexer */
extern YYSTYPE yylval;
Integration Example
#include <stdio.h>
#include "lexer.c"
#include "parser.c"
int main(int argc, char **argv) {
if (argc < 2) {
fprintf(stderr, "Usage: %s <input>\n", argv[0]);
return 1;
}
lex_init(argv[1]);
if (yyparse() == 0) {
printf("Parse successful\n");
return 0;
} else {
printf("Parse failed\n");
return 1;
}
}
Compilation
gcc -o myparser main.c -Wall
Or compile separately:
gcc -c lexer.c -o lexer.o
gcc -c parser.c -o parser.o
gcc -c main.c -o main.o
gcc -o myparser lexer.o parser.o main.o
Memory Management
The C lexer uses a static buffer for yytext. For dynamic allocation:
char *token_copy = strdup(yytext);
/* ... use token_copy ... */
free(token_copy);
YYSTYPE Union
When using %union in the grammar:
typedef union YYSTYPE {
int ival;
double dval;
char *str;
} YYSTYPE;
extern YYSTYPE yylval;
Set yylval in lexer actions:
[0-9]+ { yylval.ival = atoi(yytext); return NUMBER; }
Access in parser actions:
expr: expr PLUS expr { $$ = $1 + $3; }
Platform Compatibility
The generated C code is C99 compliant and works on:
- GCC (Linux, macOS, MinGW)
- Clang (Linux, macOS)
- MSVC (Windows)
Java Output
File Organization
Important: Java requires exactly ONE public class per .java file, and the filename must match the public class name.
Generated Files
- Lexer Only: Generates
Lexer.javacontainingpublic class Lexer - Parser Only: Generates
Parser.javacontainingpublic class Parser - Both Together: Generate as separate files and compile together
# Generate both
openlexer gen-lexer --lexer calc.l -L java -o output/
openlexer gen-parser --parser calc.y -L java -o output/
# Compile both
javac output/Lexer.java output/Parser.java
# Run (Parser auto-detects Lexer.class)
java -cp output Parser "3 + 4 * 2"
Lexer Interface
public class Lexer {
// Token constants
public static final int TOKEN_EOF = 0;
public static final int TOKEN_ERROR = 1;
public static final int TOKEN_NUMBER = 2;
public static final int TOKEN_PLUS = 3;
// ...
// Token class for lexer-parser integration
public static class Token {
public final int type; // Token type ID
public final String text; // Matched lexeme
public final int pos; // Position in input
public Token(int type, String text, int pos) {
this.type = type;
this.text = text;
this.pos = pos;
}
}
// Constructor
public Lexer(String input);
// Get next token (returns Token object)
public Token nextToken();
// Get token name
public static String tokenName(int type);
}
Parser Interface
public class Parser {
// Parse input string (auto-detects external Lexer if available)
public static int parse(String input);
// Main method for testing
public static void main(String[] args);
}
Automatic Lexer Detection
The generated Parser automatically detects if a compiled Lexer.class is available:
- With external Lexer: Uses reflection to call
Lexer.nextToken()- Output:
[Using external Lexer.class]
- Output:
- Without external Lexer: Falls back to inline lexer
- Output:
[Using inline lexer]
- Output:
This allows the Parser to work standalone OR with a separate Lexer.
Integration Example
Method 1: Using Generated Test Drivers
# Lexer only
javac Lexer.java
java Lexer "3 + 4 * 2"
# Parser with external Lexer
javac Lexer.java Parser.java
java Parser "3 + 4 * 2"
# Output: [Using external Lexer.class]
# Parser standalone (no Lexer.class)
javac Parser.java
java Parser "3 + 4 * 2"
# Output: [Using inline lexer]
Method 2: Custom Integration
public class Main {
public static void main(String[] args) {
if (args.length < 1) {
System.err.println("Usage: java Main <expression>");
System.exit(1);
}
String input = args[0];
// Tokenize with Lexer
Lexer lexer = new Lexer(input);
Lexer.Token token;
System.out.println("Tokens:");
while ((token = lexer.nextToken()).type != Lexer.TOKEN_EOF) {
System.out.printf(" %s: \"%s\"\n",
Lexer.tokenName(token.type), token.text);
}
// Parse
int result = Parser.parse(input);
System.out.println("Result: " + result);
}
}
Compilation and Execution
# Compile all files
javac Lexer.java Parser.java Main.java
# Run
java Main "3 + 4 * 2"
Semantic Values
The lexer's getValue() returns an Object. Cast as needed:
public int nextToken() {
// In NUMBER rule:
this.value = Integer.parseInt(this.text);
return NUMBER;
}
In parser actions, values are accessed through the semantic stack.
Error Handling
public class ParseException extends Exception {
private int line;
private int column;
private String token;
public ParseException(String message, int line, int column, String token) {
super(message);
this.line = line;
this.column = column;
this.token = token;
}
// Getters...
}
Reading from Files
String content = new String(Files.readAllBytes(Paths.get(filename)));
Lexer lexer = new Lexer(content);
Reading from InputStream
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
StringBuilder sb = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
sb.append(line).append("\n");
}
Lexer lexer = new Lexer(sb.toString());
Package Declaration
The generated code does not include a package declaration by default. Add one manually or modify the output if needed.
Java Version Compatibility
The generated code is compatible with Java 8 and later. No external dependencies are required.
Python Output
Generated Files
lexer.py- Lexer class with DFA tablesparser.py- Parser class with LALR tables
Lexer Interface
class Lexer:
# Token constants
EOF = 0
NUMBER = 1
PLUS = 2
# ...
def __init__(self, input_string: str):
"""Initialize lexer with input string."""
def next_token(self) -> int:
"""Return the next token type."""
@property
def text(self) -> str:
"""Return the matched text."""
@property
def value(self):
"""Return the semantic value."""
Parser Interface
class Parser:
def __init__(self, lexer: Lexer):
"""Initialize parser with lexer."""
def parse(self):
"""Parse input and return result."""
Integration Example
#!/usr/bin/env python3
import sys
from lexer import Lexer
from parser import Parser
def main():
if len(sys.argv) < 2:
print("Usage: python main.py <expression>", file=sys.stderr)
sys.exit(1)
lexer = Lexer(sys.argv[1])
parser = Parser(lexer)
try:
result = parser.parse()
print(f"Result: {result}")
except Exception as e:
print(f"Parse error: {e}", file=sys.stderr)
sys.exit(1)
if __name__ == "__main__":
main()
Execution
python main.py "3 + 4 * 2"
Semantic Values
Set in lexer actions:
def next_token(self):
# In NUMBER rule:
self._value = int(self._text)
return self.NUMBER
Error Handling
class ParseError(Exception):
def __init__(self, message, line=None, column=None, token=None):
super().__init__(message)
self.line = line
self.column = column
self.token = token
Reading from Files
with open(filename, 'r') as f:
content = f.read()
lexer = Lexer(content)
parser = Parser(lexer)
result = parser.parse()
Reading from stdin
import sys
content = sys.stdin.read()
lexer = Lexer(content)
parser = Parser(lexer)
result = parser.parse()
Type Hints
The generated code includes type hints compatible with Python 3.6+:
def next_token(self) -> int:
...
def parse(self) -> Any:
...
Dependencies
No external dependencies. Uses only the Python standard library.
Python Version Compatibility
- Python 3.6+
- Uses f-strings and type hints
- No walrus operator (3.8+) for broader compatibility
File Organization Guide for Generated Code
Overview
OpenLexer generates lexer and parser code in C, Java, and Python. This guide explains how to properly organize and use the generated files.
Java File Organization
The "One Public Class Per File" Rule
Java requires that each .java file contains only ONE public class, and the filename must match that class name.
Generating Lexer Only
openlexer gen-lexer --lexer grammar.l -L java -o output/
Output: output/Lexer.java
- Contains:
public class Lexer - Includes: Token class, lexer logic, test driver
Compilation & Usage:
javac Lexer.java
java Lexer "3 + 4 * 2"
Generating Parser Only
openlexer gen-parser --parser grammar.y -L java -o output/
Output: output/Parser.java
- Contains:
public class Parser - Includes: Inline lexer, parser logic, test driver
- Can run standalone or detect external Lexer.class
Compilation & Usage:
javac Parser.java
java Parser "3 + 4 * 2"
Generating Both Lexer and Parser
Method 1: Separate Files (Recommended)
openlexer gen-lexer --lexer grammar.l -L java -o output/
openlexer gen-parser --parser grammar.y -L java -o output/
Output:
output/Lexer.java- public class Lexeroutput/Parser.java- public class Parser
Compilation & Usage:
# Compile both
javac Lexer.java Parser.java
# Run parser (automatically detects and uses Lexer.class)
java Parser "3 + 4 * 2"
# Output: [Using external Lexer.class]
Method 2: Combined File (Advanced)
If you need both in one file for deployment, make one class non-public:
// File: Lexer.java
public class Lexer {
// ... lexer code ...
}
class Parser { // Note: not public!
// ... parser code ...
}
Compile and run:
javac Lexer.java
java Lexer # or java Parser if Parser is public
Token Interface
Lexer Token Format
All Java lexers return a Token object with consistent structure:
public static class Token {
public final int type; // Token type constant
public final String text; // Lexeme text
public final int pos; // Position in input
public Token(int type, String text, int pos) {
this.type = type;
this.text = text;
this.pos = pos;
}
}
Parser-Lexer Integration
The Parser automatically detects and uses an external Lexer if available:
-
With external Lexer:
- Parser calls
Lexer.nextToken()via reflection - Displays:
[Using external Lexer.class]
- Parser calls
-
Without external Lexer:
- Parser uses its inline lexer
- Displays:
[Using inline lexer]
C File Organization
Generating Lexer
openlexer gen-lexer --lexer grammar.l -L c -o output/
Output: output/lexer.c
- Includes headers, lexer logic, test driver
Compilation & Usage:
gcc -o lexer lexer.c
./lexer "3 + 4 * 2"
Generating Parser
openlexer gen-parser --parser grammar.y -L c -o output/
Output: output/parser.c
- Includes inline lexer, parser logic, test driver
Compilation & Usage:
gcc -o parser parser.c
./parser "3 + 4 * 2"
Combining Lexer and Parser
# Generate both
openlexer gen-lexer --lexer grammar.l -L c -o output/
openlexer gen-parser --parser grammar.y -L c -o output/
# Method 1: Compile separately and link
gcc -c lexer.c -o lexer.o
gcc -c parser.c -o parser.o
gcc lexer.o parser.o -o myapp
# Method 2: Combine into one compilation unit
cat lexer.c parser.c > combined.c
gcc -o myapp combined.c
Suppressing Test Drivers:
Use preprocessor flags to disable test/main functions:
# Lexer without test driver
gcc -DLEXER_NO_MAIN -DLEXER_NO_TEST -c lexer.c
# Parser without test driver
gcc -DPARSER_NO_MAIN -c parser.c
Python File Organization
Generating Lexer
openlexer gen-lexer --lexer grammar.l -L python -o output/
Output: output/lexer.py
- Lexer class, test functions
Usage:
python lexer.py "3 + 4 * 2"
# Or import in your code:
from lexer import Lexer
lex = Lexer("3 + 4 * 2")
for token in lex.tokenize():
print(token)
Generating Parser
openlexer gen-parser --parser grammar.y -L python -o output/
Output: output/parser.py
- Parser and inline Lexer class
Usage:
python parser.py "3 + 4 * 2"
# Or import:
from parser import parse
result = parse("3 + 4 * 2")
Combining Both
# Generate both
openlexer gen-lexer --lexer grammar.l -L python -o output/
openlexer gen-parser --parser grammar.y -L python -o output/
# Import and use together:
from lexer import Lexer
from parser import Parser
lex = Lexer("3 + 4 * 2")
parser = Parser(lex)
result = parser.parse()
Best Practices
1. Separate Generation
Generate lexer and parser in separate commands for better modularity.
2. Version Control
- Commit .l and .y grammar files
- Add generated files to .gitignore or commit them depending on your workflow
3. Build Scripts
Create a Makefile or build script:
# Makefile
all: lexer parser
lexer:
openlexer gen-lexer --lexer grammar.l -L java -o build/
parser:
openlexer gen-parser --parser grammar.y -L java -o build/
cd build && javac Lexer.java Parser.java
clean:
rm -rf build/
run:
cd build && java Parser "3 + 4 * 2"
4. Package Structure (Java)
For larger projects, use Java packages:
mkdir -p src/com/example/parser
openlexer gen-lexer --lexer grammar.l -L java -o src/com/example/parser/
# Edit generated files to add: package com.example.parser;
5. Testing
Use the built-in test drivers during development:
# Test lexer
java Lexer "(10 + 20) * 3"
# Test parser (with or without external lexer)
java Parser "(10 + 20) * 3"
Troubleshooting
Java: "Public class X must be in a file named X.java"
Solution: Ensure file name matches the public class name, or make the class non-public.
Java: "Class Lexer not found" when running Parser
Causes:
- Lexer.java not compiled
- Lexer.class not in same directory
- CLASSPATH not set correctly
Solution:
# Compile both in same directory
javac Lexer.java Parser.java
# Or set CLASSPATH
javac -cp /path/to/classes Parser.java
java -cp /path/to/classes Parser
Java: Multiple public classes error
Solution: Only ONE public class per .java file. Make others non-public:
public class Lexer { ... }
class Parser { ... } // No 'public'
C: Undefined reference to lexer functions
Solution: Ensure proper linking:
gcc lexer.c parser.c -o myapp
# Or link separately
gcc -c lexer.c && gcc -c parser.c && gcc lexer.o parser.o -o myapp
Advanced: Custom Integration
Java: Manual Lexer-Parser Integration
// Main.java
public class Main {
public static void main(String[] args) {
Lexer lexer = new Lexer("3 + 4 * 2");
Parser parser = new Parser();
// Custom token loop
Lexer.Token tok;
while ((tok = lexer.nextToken()).type != Lexer.TOKEN_EOF) {
// Process tokenparser.consumeToken(tok);
}
int result = parser.getResult();
System.out.println("Result: " + result);
}
}
Python: Custom Integration
# main.py
from lexer import Lexer, TokenType
from parser import Parser
def custom_parse(input_str):
lex = Lexer(input_str)
tokens = list(lex.tokenize())
# Filter or transform tokens
filtered = [t for t in tokens if t.type != TokenType.WHITESPACE]
# Parse
parser = Parser()
return parser.parse_tokens(filtered)
result = custom_parse("3 + 4 * 2")
print(f"Result: {result}")
Summary
| Language | Lexer File | Parser File | Test Command |
|---|---|---|---|
| Java | Lexer.java | Parser.java | javac *.java && java Parser "input" |
| C | lexer.c | parser.c | gcc *.c -o app && ./app "input" |
| Python | lexer.py | parser.py | python parser.py "input" |
Key Points:
- Java: One public class per file
- Parser can detect and use external Lexer automatically
- Use preprocessor flags in C to disable test drivers
- Python files can be imported as modules
Calculator Example
A complete calculator supporting arithmetic operations with proper precedence.
Lexer Specification (calc.l)
%{
/* Calculator Lexer */
%}
%%
[0-9]+ { return NUMBER; }
[0-9]+\.[0-9]+ { return FLOAT; }
"+" { return PLUS; }
"-" { return MINUS; }
"*" { return TIMES; }
"/" { return DIVIDE; }
"%" { return MOD; }
"^" { return POWER; }
"(" { return LPAREN; }
")" { return RPAREN; }
[ \t]+ { /* skip whitespace */ }
\n { return NEWLINE; }
. { return ERROR; }
%%
Grammar Specification (calc.y)
%{
/* Calculator Parser */
#include <stdio.h>
#include <math.h>
%}
%union {
double dval;
}
%token <dval> NUMBER FLOAT
%token PLUS MINUS TIMES DIVIDE MOD POWER
%token LPAREN RPAREN NEWLINE ERROR
%type <dval> expr term factor primary
%left PLUS MINUS
%left TIMES DIVIDE MOD
%right POWER
%right UMINUS
%%
input:
/* empty */
| input line
;
line:
NEWLINE
| expr NEWLINE { printf("= %g\n", $1); }
| error NEWLINE { yyerrok; }
;
expr:
expr PLUS term { $$ = $1 + $3; }
| expr MINUS term { $$ = $1 - $3; }
| term { $$ = $1; }
;
term:
term TIMES factor { $$ = $1 * $3; }
| term DIVIDE factor {
if ($3 == 0) {
yyerror("division by zero");
$$ = 0;
} else {
$$ = $1 / $3;
}
}
| term MOD factor { $$ = fmod($1, $3); }
| factor { $$ = $1; }
;
factor:
primary POWER factor { $$ = pow($1, $3); }
| primary { $$ = $1; }
;
primary:
NUMBER { $$ = $1; }
| FLOAT { $$ = $1; }
| LPAREN expr RPAREN { $$ = $2; }
| MINUS primary %prec UMINUS { $$ = -$2; }
;
%%
void yyerror(const char *s) {
fprintf(stderr, "Error: %s\n", s);
}
Generate Code
openlexer gen-lexer --lexer calc.l --lang c --output ./
openlexer gen-parser --parser calc.y --lang c --output ./
Main Program (main.c)
#include <stdio.h>
#include <string.h>
/* Include generated code */
#include "lexer.c"
#include "parser.c"
int main() {
char line[1024];
printf("Calculator (type expressions, Ctrl+C to exit)\n");
while (1) {
printf("> ");
fflush(stdout);
if (fgets(line, sizeof(line), stdin) == NULL) {
break;
}
lex_init(line);
yyparse();
}
return 0;
}
Build and Run
gcc -o calc main.c -lm -Wall
./calc
Example Session
Calculator (type expressions, Ctrl+C to exit)
> 3 + 4
= 7
> 2 * 3 + 4
= 10
> 2 + 3 * 4
= 14
> (2 + 3) * 4
= 20
> 2 ^ 3 ^ 2
= 512
> -5 + 3
= -2
> 10 / 0
Error: division by zero
= 0
Notes
- Power (
^) is right-associative:2^3^2 = 2^9 = 512 - Unary minus has highest precedence
- Parentheses override precedence
- Division by zero is caught with error message
MiniJava Example
A subset of Java for teaching compilers. This example shows a more complex grammar.
Lexer Specification (minijava.l)
%{
/* MiniJava Lexer */
%}
%x COMMENT
%%
"class" { return CLASS; }
"public" { return PUBLIC; }
"static" { return STATIC; }
"void" { return VOID; }
"main" { return MAIN; }
"String" { return STRING; }
"extends" { return EXTENDS; }
"return" { return RETURN; }
"int" { return INT; }
"boolean" { return BOOLEAN; }
"if" { return IF; }
"else" { return ELSE; }
"while" { return WHILE; }
"System.out.println" { return PRINTLN; }
"length" { return LENGTH; }
"true" { return TRUE; }
"false" { return FALSE; }
"this" { return THIS; }
"new" { return NEW; }
[a-zA-Z_][a-zA-Z0-9_]* { return IDENTIFIER; }
[0-9]+ { return INTEGER_LITERAL; }
"{" { return LBRACE; }
"}" { return RBRACE; }
"[" { return LBRACKET; }
"]" { return RBRACKET; }
"(" { return LPAREN; }
")" { return RPAREN; }
";" { return SEMICOLON; }
"," { return COMMA; }
"." { return DOT; }
"=" { return ASSIGN; }
"&&" { return AND; }
"<" { return LT; }
"+" { return PLUS; }
"-" { return MINUS; }
"*" { return TIMES; }
"!" { return NOT; }
"//".* { /* skip single-line comment */ }
"/*" { BEGIN(COMMENT); }
<COMMENT>"*/" { BEGIN(INITIAL); }
<COMMENT>. { /* skip */ }
<COMMENT>\n { /* skip */ }
[ \t\n\r]+ { /* skip whitespace */ }
. { return ERROR; }
%%
Grammar Specification (minijava.y)
%{
/* MiniJava Parser */
%}
%token CLASS PUBLIC STATIC VOID MAIN STRING EXTENDS RETURN
%token INT BOOLEAN IF ELSE WHILE PRINTLN LENGTH
%token TRUE FALSE THIS NEW
%token IDENTIFIER INTEGER_LITERAL
%token LBRACE RBRACE LBRACKET RBRACKET LPAREN RPAREN
%token SEMICOLON COMMA DOT ASSIGN
%token AND LT PLUS MINUS TIMES NOT
%token ERROR
%left AND
%left LT
%left PLUS MINUS
%left TIMES
%right NOT
%left DOT LBRACKET
%%
Program:
MainClass ClassDeclarationList
;
MainClass:
CLASS IDENTIFIER LBRACE
PUBLIC STATIC VOID MAIN LPAREN STRING LBRACKET RBRACKET IDENTIFIER RPAREN
LBRACE Statement RBRACE
RBRACE
;
ClassDeclarationList:
/* empty */
| ClassDeclarationList ClassDeclaration
;
ClassDeclaration:
CLASS IDENTIFIER LBRACE VarDeclarationList MethodDeclarationList RBRACE
| CLASS IDENTIFIER EXTENDS IDENTIFIER LBRACE VarDeclarationList MethodDeclarationList RBRACE
;
VarDeclarationList:
/* empty */
| VarDeclarationList VarDeclaration
;
VarDeclaration:
Type IDENTIFIER SEMICOLON
;
MethodDeclarationList:
/* empty */
| MethodDeclarationList MethodDeclaration
;
MethodDeclaration:
PUBLIC Type IDENTIFIER LPAREN FormalParameterList RPAREN
LBRACE VarDeclarationList StatementList RETURN Expression SEMICOLON RBRACE
;
FormalParameterList:
/* empty */
| FormalParameter FormalParameterRest
;
FormalParameter:
Type IDENTIFIER
;
FormalParameterRest:
/* empty */
| FormalParameterRest COMMA FormalParameter
;
Type:
INT LBRACKET RBRACKET
| BOOLEAN
| INT
| IDENTIFIER
;
StatementList:
/* empty */
| StatementList Statement
;
Statement:
LBRACE StatementList RBRACE
| IF LPAREN Expression RPAREN Statement ELSE Statement
| WHILE LPAREN Expression RPAREN Statement
| PRINTLN LPAREN Expression RPAREN SEMICOLON
| IDENTIFIER ASSIGN Expression SEMICOLON
| IDENTIFIER LBRACKET Expression RBRACKET ASSIGN Expression SEMICOLON
;
Expression:
Expression AND Expression
| Expression LT Expression
| Expression PLUS Expression
| Expression MINUS Expression
| Expression TIMES Expression
| Expression LBRACKET Expression RBRACKET
| Expression DOT LENGTH
| Expression DOT IDENTIFIER LPAREN ExpressionList RPAREN
| INTEGER_LITERAL
| TRUE
| FALSE
| IDENTIFIER
| THIS
| NEW INT LBRACKET Expression RBRACKET
| NEW IDENTIFIER LPAREN RPAREN
| NOT Expression
| LPAREN Expression RPAREN
;
ExpressionList:
/* empty */
| Expression ExpressionRest
;
ExpressionRest:
/* empty */
| ExpressionRest COMMA Expression
;
%%
Generate Code
openlexer gen-lexer --lexer minijava.l --lang java --output ./
openlexer gen-parser --parser minijava.y --lang java --output ./
Sample MiniJava Program
class Factorial {
public static void main(String[] args) {
System.out.println(new Fac().compute(10));
}
}
class Fac {
public int compute(int num) {
int result;
if (num < 1)
result = 1;
else
result = num * this.compute(num - 1);
return result;
}
}
Limitations
MiniJava is a teaching language. It does not support:
- Interfaces
- Overloading
- Static fields
- Access modifiers other than
public - Floating point numbers
- Strings (except in
mainsignature)
Command Line Interface
Synopsis
openlexer <command> [options]
Commands
gen-lexer
Generate a lexer from a .l file.
openlexer gen-lexer --lexer <file.l> --lang <language> --output <dir>
Options:
| Option | Description |
|---|---|
--lexer <file> | Input lexer specification file |
--lang <lang> | Target language: c, java, python |
--output <dir> | Output directory |
Example:
openlexer gen-lexer --lexer calc.l --lang python --output ./build
# Java: Generate Lexer.java
openlexer gen-lexer -l calc.l -L java -o ./build
# C: Generate without test driver
openlexer gen-lexer -l calc.l -L c -o ./build --no-test
Generated Files: lexer.c (C), Lexer.java (Java), lexer.py (Python)
gen-parser
Generate a parser from a .y file.
openlexer gen-parser --parser <file.y> --lang <language> --output <dir>
Options:
| Option | Description |
|---|---|
--parser <file> | Input grammar specification file |
--lang <lang> | Target language: c, java, python |
--output <dir> | Output directory |
Example:
openlexer gen-parser --parser calc.y --lang c --output ./build
# Java: Compile lexer and parser together (Parser auto-detects Lexer.class)
openlexer gen-lexer -l calc.l -L java -o ./build
openlexer gen-parser --parser calc.y -L java -o ./build
javac ./build/Lexer.java ./build/Parser.java
java -cp ./build Parser "3 + 4 * 2"
Generated Files: parser.c (C), Parser.java (Java), parser.py (Python)
help
Show help information.
openlexer help
openlexer --help
openlexer <command> --help
version
Show version information.
openlexer --version
Exit Codes
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | General error |
| 2 | Invalid arguments |
Output
The generator prints progress information to stdout:
Parsing lexer specification...
Building NFA (23 states)...
Converting to DFA (15 states)...
Minimizing DFA (12 states)...
Generating Python code...
Written: ./build/lexer.py
Errors are printed to stderr:
Error: Invalid regex at line 5: unclosed bracket
Combining Lexer and Parser
Generate both in a single directory:
mkdir build
openlexer gen-lexer --lexer calc.l --lang c --output build
openlexer gen-parser --parser calc.y --lang c --output build
The generated files can then be compiled together.
API Reference
This chapter documents the internal Rust API for programmatic use.
Lexer Generator
LexerSpec
Represents a parsed .l file.
#![allow(unused)] fn main() { pub struct LexerSpec { pub prologue: String, pub epilogue: String, pub definitions: Vec<(String, String)>, pub rules: Vec<LexerRule>, pub start_conditions: Vec<StartCondition>, } impl LexerSpec { /// Parse a lexer specification from source text pub fn parse(input: &str) -> Result<LexerSpec>; } }
LexerRule
A single lexer rule (pattern + action).
#![allow(unused)] fn main() { pub struct LexerRule { pub pattern: String, pub action: RuleAction, pub conditions: Vec<String>, } }
Code Generation
#![allow(unused)] fn main() { /// Generate lexer code from a specification pub fn generate_code(spec: &LexerSpec, lang: &str) -> Result<String>; /// Target languages pub enum TargetLanguage { C, Java, Python, } }
Parser Generator
Grammar
Represents a parsed .y file.
#![allow(unused)] fn main() { pub struct Grammar { pub prologue: String, pub epilogue: String, pub tokens: Vec<String>, pub precedence: Vec<PrecedenceLevel>, pub rules: Vec<Rule>, pub start_symbol: Option<String>, pub union_def: Option<String>, pub glr_mode: bool, } impl Grammar { /// Parse a grammar specification from source text pub fn parse(input: &str) -> Result<Grammar>; } }
Rule
A grammar production rule.
#![allow(unused)] fn main() { pub struct Rule { pub lhs: String, pub rhs: Vec<String>, pub action: Option<String>, pub precedence: Option<String>, } }
ParsingTable
LALR(1) parsing tables.
#![allow(unused)] fn main() { pub struct ParsingTable { pub states: Vec<LrState>, pub action: HashMap<(usize, String), Action>, pub goto: HashMap<(usize, String), usize>, pub glr_conflict_actions: HashMap<usize, HashMap<String, Vec<Action>>>, } impl ParsingTable { /// Build parsing tables from a grammar pub fn build(grammar: &Grammar) -> Result<ParsingTable>; } }
Action
Parser actions.
#![allow(unused)] fn main() { pub enum Action { Shift(usize), // Shift and go to state Reduce(usize), // Reduce by rule number Accept, // Accept input } }
GLR Parser
GlrParser
Runtime GLR parser using Graph-Structured Stack.
#![allow(unused)] fn main() { pub struct GlrParser { table: GlrTable, gss: Gss, forest: ParseForest, } impl GlrParser { /// Create a new GLR parser from a parsing table pub fn new(table: GlrTable) -> GlrParser; /// Parse a token stream pub fn parse(&mut self, tokens: Vec<Token>) -> Result<Vec<ParseTree>>; } }
Token
Input token for GLR parsing.
#![allow(unused)] fn main() { pub struct Token { pub kind: String, pub value: String, pub span: (usize, usize), } }
Error Types
#![allow(unused)] fn main() { pub enum Error { /// Invalid regex syntax InvalidRegex(String), /// Grammar error GrammarError(String), /// Unsupported target language InvalidLanguage(String), /// Parse table construction failed TableError(String), /// I/O error IoError(std::io::Error), } }
Example Usage
use openlexer::lexgen; use openlexer::parsegen; fn main() -> openlexer::Result<()> { // Generate lexer let lexer_spec = lexgen::parse_lexer_spec(include_str!("calc.l"))?; let lexer_code = lexgen::generate_code(&lexer_spec, "python")?; std::fs::write("lexer.py", lexer_code)?; // Generate parser let grammar = parsegen::parse_grammar(include_str!("calc.y"))?; let parser_code = parsegen::generate_code(&grammar, "python")?; std::fs::write("parser.py", parser_code)?; Ok(()) }
Test Drivers
OpenLexer generates test drivers - standalone runnable code that lets you verify your lexer or parser works correctly without writing any integration code.
What Are Test Drivers?
A test driver is a main() function (or equivalent) appended to your generated code that:
- Accepts input from command-line arguments or stdin
- Runs the lexer/parser on that input
- Prints detailed output showing tokens or parse results
- Helps you debug patterns and grammar rules
Enabling Test Drivers
Command Line
Test drivers are included by default. To exclude them:
# Include test driver (default)
openlexer --lexer rules.l --lang python -o lexer.py
# Exclude test driver
openlexer --lexer rules.l --lang python -o lexer.py --no-test
# Generate ONLY the test driver
openlexer --test-driver --lang python -o test_driver.py
GUI
In the GUI, check the "Test Driver" option in the Lexer or Parser options panel before generating.
Using Test Drivers
Python
# Test with command-line expressions
python lexer.py "3 + 4 * 2"
python lexer.py "hello" "world" "123"
# Interactive testing from Python
from lexer import test, test_all
test('3 + 4 * 2')
test_all('1+2', '(3*4)', 'x = 10')
Output:
Input: '3 + 4 * 2'
NUMBER | '3' | pos=0 line=1 col=1
PLUS | '+' | pos=2 line=1 col=3
NUMBER | '4' | pos=4 line=1 col=5
TIMES | '*' | pos=6 line=1 col=7
NUMBER | '2' | pos=8 line=1 col=9
C
# Compile with test driver
gcc -o lexer lexer.c
# Test expressions
./lexer "3 + 4 * 2"
./lexer "hello" "123"
# Compile WITHOUT test driver (for library use)
gcc -DLEXER_NO_MAIN -c lexer.c -o lexer.o
Output:
Input: "3 + 4 * 2"
type=1 | "3"
type=2 | "+"
type=1 | "4"
type=3 | "*"
type=1 | "2"
type=0 | ""
Java
# Compile
javac Lexer.java
# Test expressions
java Lexer "3 + 4 * 2"
java Lexer "hello" "123"
Test Driver API
Python Test Driver Functions
| Function | Description |
|---|---|
test(expr) | Tokenize one expression, print tokens, return list |
test_all(*exprs) | Test multiple expressions, return list of results |
C Test Driver
The C test driver is wrapped in #ifndef LEXER_NO_MAIN so you can disable it:
// To compile as a library without main():
#define LEXER_NO_MAIN
#include "lexer.c"
// Your code that uses Lexer API
Java Test Driver
The Java test driver adds a main method to the Lexer class:
public static void main(String[] args) {
// Tests each argument
}
Parser Test Drivers
Parser test drivers integrate both lexer and parser:
# Python
python parser.py "3 + 4 * 2"
# C
./parser "3 + 4 * 2"
# Java
java Parser "3 + 4 * 2"
Parser test drivers show:
- Tokenization phase output
- Parsing phase with shift/reduce actions
- Final parse result or error messages
Customizing Test Behavior
Python
Modify the test driver in the generated file:
if __name__ == '__main__':
# Add your custom test cases
test('my custom expression')
# Or read from file
with open('input.txt') as f:
test(f.read())
C
Override the default expressions in the else branch of main():
} else {
// Your custom default tests
test("custom expression 1");
test("custom expression 2");
}
Best Practices
- Use test drivers during development - They're the fastest way to verify patterns
- Remove for production - Use
--no-testwhen building release versions - Keep test expressions - Save useful test cases in a script or file
- Check edge cases - Test empty input, long input, Unicode, special characters
Example Workflows
Python Workflow
# 1. Generate lexer with test driver
openlexer --lexer calc.l --lang python -o lexer.py
# 2. Test your patterns
python lexer.py "3+4" "(10-2)/4" "x=y*z"
# 3. Fix any issues, regenerate, retest
openlexer --lexer calc.l --lang python -o lexer.py
python lexer.py "3+4"
# 4. Generate production version without test driver
openlexer --lexer calc.l --lang python -o lexer.py --no-test
C Workflow
# 1. Generate lexer with test driver
openlexer --lexer calc.l --lang c -o lexer.c
# 2. Compile the lexer
gcc -o lexer lexer.c -Wall
# 3. Test your patterns
./lexer "3+4"
./lexer "(10-2)/4"
./lexer "x=y*z"
# 4. Fix any issues, regenerate, recompile, retest
openlexer --lexer calc.l --lang c -o lexer.c
gcc -o lexer lexer.c -Wall
./lexer "3+4"
# 5. Generate production version without test driver
openlexer --lexer calc.l --lang c -o lexer.c --no-test
# 6. Compile as library (for use in your project)
gcc -DLEXER_NO_MAIN -c lexer.c -o lexer.o
Complete C Example Program:
// main.c - Using the generated lexer in your own program
#define LEXER_NO_MAIN // Exclude the built-in test driver
#include "lexer.c"
#include <stdio.h>
#include <stdlib.h>
void process_file(const char* filename) {
FILE* f = fopen(filename, "r");
if (!f) {
fprintf(stderr, "Cannot open file: %s\n", filename);
return;
}
// Read entire file
fseek(f, 0, SEEK_END);
long size = ftell(f);
fseek(f, 0, SEEK_SET);
char* buffer = malloc(size + 1);
fread(buffer, 1, size, f);
buffer[size] = '\0';
fclose(f);
// Tokenize
Lexer lexer;
lexer_init(&lexer, buffer);
Token token;
int count = 0;
printf("Tokens in %s:\n", filename);
do {
token = lexer_next(&lexer);
printf(" [%3d] type=%-3d \"", count++, token.type);
for (int i = 0; i < token.length; i++) {
putchar(token.start[i]);
}
printf("\"\n");
} while (token.type != TOKEN_EOF);
printf("Total: %d tokens\n\n", count);
free(buffer);
}
int main(int argc, char** argv) {
if (argc < 2) {
printf("Usage: %s <file1> [file2] ...\n", argv[0]);
printf(" %s --expr \"expression\"\n", argv[0]);
return 1;
}
for (int i = 1; i < argc; i++) {
if (strcmp(argv[i], "--expr") == 0 && i + 1 < argc) {
// Tokenize expression directly
Lexer lexer;
lexer_init(&lexer, argv[++i]);
Token token;
printf("Expression: \"%s\"\n", argv[i]);
do {
token = lexer_next(&lexer);
printf(" type=%-3d \"", token.type);
for (int j = 0; j < token.length; j++) {
putchar(token.start[j]);
}
printf("\"\n");
} while (token.type != TOKEN_EOF);
} else {
process_file(argv[i]);
}
}
return 0;
}
Build and run:
# Compile your program with the lexer
gcc -o myprogram main.c -Wall
# Process files
./myprogram source1.txt source2.txt
# Or tokenize an expression
./myprogram --expr "3 + 4 * 2"
Java Workflow
# 1. Generate lexer with test driver
openlexer --lexer calc.l --lang java -o Lexer.java
# 2. Compile the lexer
javac Lexer.java
# 3. Test your patterns
java Lexer "3+4"
java Lexer "(10-2)/4"
java Lexer "x=y*z"
# 4. Fix any issues, regenerate, recompile, retest
openlexer --lexer calc.l --lang java -o Lexer.java
javac Lexer.java
java Lexer "3+4"
# 5. Generate production version without test driver
openlexer --lexer calc.l --lang java -o Lexer.java --no-test
Complete Java Example Program:
// Calculator.java - Using the generated lexer in your own program
import java.io.*;
import java.nio.file.*;
import java.util.*;
public class Calculator {
public static void main(String[] args) {
if (args.length == 0) {
System.out.println("Usage: java Calculator <expression>");
System.out.println(" java Calculator --file <filename>");
System.out.println(" java Calculator --interactive");
return;
}
Calculator calc = new Calculator();
if (args[0].equals("--file") && args.length > 1) {
calc.processFile(args[1]);
} else if (args[0].equals("--interactive")) {
calc.interactiveMode();
} else {
// Process each argument as an expression
for (String expr : args) {
calc.tokenize(expr);
}
}
}
public void tokenize(String expression) {
System.out.println("Input: \"" + expression + "\"");
Lexer lexer = new Lexer(expression);
List<Token> tokens = new ArrayList<>();
try {
for (Token token : lexer.tokenize()) {
tokens.add(token);
System.out.printf(" %-12s | %-15s | pos=%d%n",
token.type.name(),
"\"" + token.text + "\"",
token.pos);
}
} catch (LexerException e) {
System.err.println("Lexer error: " + e.getMessage());
}
System.out.println("Total: " + tokens.size() + " tokens\n");
}
public void processFile(String filename) {
try {
String content = Files.readString(Path.of(filename));
System.out.println("=== File: " + filename + " ===\n");
tokenize(content);
} catch (IOException e) {
System.err.println("Cannot read file: " + filename);
System.err.println(e.getMessage());
}
}
public void interactiveMode() {
Scanner scanner = new Scanner(System.in);
System.out.println("OpenLexer Interactive Mode");
System.out.println("Enter expressions to tokenize (empty line to quit):\n");
while (true) {
System.out.print("> ");
String line = scanner.nextLine();
if (line.isEmpty()) {
System.out.println("Goodbye!");
break;
}
tokenize(line);
}
scanner.close();
}
}
Build and run:
# Compile your program (assumes Lexer.java is in same directory)
javac Calculator.java Lexer.java
# Tokenize expressions
java Calculator "3 + 4 * 2"
java Calculator "hello" "world"
# Process a file
java Calculator --file source.txt
# Interactive mode
java Calculator --interactive
> 3 + 4
NUMBER | "3" | pos=0
PLUS | "+" | pos=2
NUMBER | "4" | pos=4
Total: 3 tokens
> quit
Goodbye!
IDE Integration
VS Code Tasks
Create .vscode/tasks.json:
{
"version": "2.0.0",
"tasks": [
{
"label": "Generate & Test Lexer",
"type": "shell",
"command": "openlexer --lexer ${file} --lang python -o lexer.py && python lexer.py \"test\"",
"group": "build"
}
]
}
Makefile
LANG ?= c
lexer.$(LANG): rules.l
openlexer --lexer $< --lang $(LANG) -o $@
test: lexer.$(LANG)
ifeq ($(LANG),c)
gcc -o lexer $< && ./lexer "test expression"
else ifeq ($(LANG),java)
javac $< && java Lexer "test expression"
else
python $< "test expression"
endif
clean:
rm -f lexer lexer.c lexer.py Lexer.java *.class
Combining Lexer and Parser
This guide explains how to generate and use a complete lexer+parser system with OpenLexer.
Overview
A typical language implementation has two phases:
- Lexical Analysis (Lexer): Convert character stream → token stream
- Syntactic Analysis (Parser): Convert token stream → parse tree/AST
Source Code → [Lexer] → Tokens → [Parser] → Parse Tree → [Your Code]
Method 1: GUI Combined Tab
The easiest way to generate both together:
- Open the GUI:
cargo run --bin openlexer-gui --features gui - Click the Combined tab
- Enter your lexer spec in the left panel
- Enter your grammar in the middle panel
- Click Generate Combined
- Download or copy the output
The output contains both lexer and parser in one file with section markers.
Method 2: Command Line
# Generate both at once
openlexer --lexer calc.l --parser calc.y --lang python -o output/
# This creates:
# output/lexer.py - The lexer
# output/parser.py - The parser (imports lexer)
Method 3: Separate Generation
# Generate lexer
openlexer --lexer calc.l --lang python -o lexer.py
# Generate parser
openlexer --parser calc.y --lang python -o parser.py
Token Coordination
Critical: The tokens in your .y file must match those returned by your .l file.
Lexer (.l file)
%%
[0-9]+ { return NUMBER; }
"+" { return PLUS; }
"-" { return MINUS; }
"*" { return TIMES; }
"/" { return DIVIDE; }
"(" { return LPAREN; }
")" { return RPAREN; }
[ \t\n]+ { /* skip whitespace */ }
%%
Parser (.y file)
%token NUMBER PLUS MINUS TIMES DIVIDE LPAREN RPAREN
%%
expr: expr PLUS term
| expr MINUS term
| term
;
term: term TIMES factor
| term DIVIDE factor
| factor
;
factor: NUMBER
| LPAREN expr RPAREN
;
%%
The token names must match exactly!
Language-Specific Integration
Java Integration
File Organization: Java requires one public class per file.
# Generate both
openlexer gen-lexer --lexer calc.l -L java -o src/
openlexer gen-parser --parser calc.y -L java -o src/
# This creates:
# src/Lexer.java - public class Lexer
# src/Parser.java - public class Parser
Compilation:
# Compile both (Parser auto-detects Lexer)
javac src/Lexer.java src/Parser.java
# Run parser
java -cp src Parser "3 + 4 * 2"
# Output: [Using external Lexer.class]
# Input: "3 + 4 * 2"
# Result: 11
Custom Integration:
import java.util.*;
public class Calculator {
public static void main(String[] args) {
Scanner sc = new Scanner(System.in);
while (true) {
System.out.print("calc> ");
String line = sc.nextLine();
if (line.equals("quit")) break;
try {
// Tokenize
Lexer lex = new Lexer(line);
System.out.print("Tokens: ");
Lexer.Token tok;
while ((tok = lex.nextToken()).type != Lexer.TOKEN_EOF) {
System.out.print(tok.text + " ");
}
System.out.println();
// Parse and evaluate
int result = Parser.parse(line);
System.out.println("= " + result);
} catch (Exception e) {
System.err.println("Error: " + e.getMessage());
}
}
}
}
Python Integration
# calc.py - Using generated lexer and parser together
from lexer import Lexer, TokenType
from parser import Parser
def evaluate(expression):
"""Evaluate a mathematical expression."""
# Phase 1: Tokenize
lexer = Lexer(expression)
tokens = list(lexer.tokenize())
# Phase 2: Parse
parser = Parser(tokens)
ast = parser.parse()
# Phase 3: Evaluate (your code)
return evaluate_ast(ast)
# Example usage
result = evaluate("3 + 4 * 2")
print(f"Result: {result}") # Output: 11
C Integration
// calc.c - Using generated lexer and parser together
#include "lexer.h"
#include "parser.h"
int evaluate(const char* expression) {
// Phase 1: Initialize lexer
Lexer lexer;
lexer_init(&lexer, expression);
// Phase 2: Initialize parser with lexer
Parser parser;
parser_init(&parser, &lexer);
// Phase 3: Parse and evaluate
int result = parser_parse(&parser);
return result;
}
int main() {
printf("3 + 4 * 2 = %d\n", evaluate("3 + 4 * 2"));
return 0;
}
Java Integration
// Calculator.java - Using generated lexer and parser together
public class Calculator {
public static int evaluate(String expression) {
// Phase 1: Tokenize
Lexer lexer = new Lexer(expression);
List<Token> tokens = lexer.tokenize();
// Phase 2: Parse
Parser parser = new Parser(tokens);
ParseTree tree = parser.parse();
// Phase 3: Evaluate
return evaluate(tree);
}
public static void main(String[] args) {
System.out.println("3 + 4 * 2 = " + evaluate("3 + 4 * 2"));
}
}
Parser Calling Lexer Directly
In some generated code, the parser calls the lexer automatically:
# Parser internally calls lexer.next_token() as needed
parser = Parser(input_string) # Lexer created internally
result = parser.parse()
Check your generated code's constructor signature to see which style is used.
Semantic Actions
Connect lexer output to parser semantic values:
Passing Token Values
In the lexer, the matched text is available via yytext:
[0-9]+ { yylval = atoi(yytext); return NUMBER; }
In the parser, access values with $1, $2, etc.:
expr: expr PLUS term { $$ = $1 + $3; }
| NUMBER { $$ = $1; }
;
Complete Example: Calculator
calc.l (Lexer)
%{
/* Calculator lexer */
%}
%%
[0-9]+ { return NUMBER; }
"+" { return PLUS; }
"-" { return MINUS; }
"*" { return TIMES; }
"/" { return DIVIDE; }
"(" { return LPAREN; }
")" { return RPAREN; }
[ \t\n]+ { /* skip */ }
. { return ERROR; }
%%
calc.y (Parser)
%{
/* Calculator parser */
%}
%token NUMBER PLUS MINUS TIMES DIVIDE LPAREN RPAREN ERROR
%left PLUS MINUS
%left TIMES DIVIDE
%%
input: expr { printf("Result: %d\n", $1); }
;
expr: expr PLUS expr { $$ = $1 + $3; }
| expr MINUS expr { $$ = $1 - $3; }
| expr TIMES expr { $$ = $1 * $3; }
| expr DIVIDE expr { $$ = $1 / $3; }
| LPAREN expr RPAREN { $$ = $2; }
| NUMBER { $$ = $1; }
;
%%
Build and Run
# Generate both
openlexer --lexer calc.l --parser calc.y --lang c -o calc/
# Compile
cd calc
gcc -o calculator lexer.c parser.c main.c
# Run
./calculator "3 + 4 * 2"
# Output: Result: 11
Troubleshooting
"Unknown token" errors
Cause: Parser received a token not in %token declaration
Fix: Ensure all tokens returned by lexer are declared in parser
"Syntax error" on valid input
Cause: Usually whitespace/newlines not being skipped
Fix: Add whitespace rule to lexer:
[ \t\n]+ { /* skip whitespace */ }
Parser stuck or infinite loop
Cause: Lexer not returning EOF token
Fix: Ensure lexer returns EOF/END_OF_FILE when input exhausted
Token value is wrong
Cause: Semantic value (yylval) not being set
Fix: In lexer action, set the value:
[0-9]+ { yylval.intval = atoi(yytext); return NUMBER; }
Best Practices
- Define tokens in one place - Use
%tokenin parser, reference in lexer - Test lexer first - Use test driver to verify tokens before parser work
- Handle all input - Include catch-all rule for unexpected characters
- Skip whitespace explicitly - Don't rely on implicit handling
- Use precedence - Resolve ambiguity with
%left,%right,%nonassoc - Start simple - Get basic grammar working, then add complexity
Architecture and Internals
This page explains how OpenLexer works internally. Understanding these concepts helps you write better specifications and debug complex issues.
System Overview
┌─────────────────────────────────────────────────────────────┐
│ OpenLexer │
├─────────────────────────────┬───────────────────────────────┤
│ Lexer Generator │ Parser Generator │
│ (lexgen) │ (parsegen) │
├─────────────────────────────┼───────────────────────────────┤
│ 1. Parse .l file │ 1. Parse .y file │
│ 2. Compile regexes → NFA │ 2. Build grammar structure │
│ 3. NFA → DFA conversion │ 3. Compute FIRST/FOLLOW sets │
│ 4. DFA minimization │ 4. Build LR(0) item sets │
│ 5. Generate code │ 5. Build LALR(1) tables │
│ │ 6. Generate code │
└─────────────────────────────┴───────────────────────────────┘
Lexer Generation Pipeline
Stage 1: Parsing the .l File
The lexer spec parser (lexgen::rules) extracts:
- Definitions: Named patterns like
DIGIT [0-9] - Start conditions:
%x COMMENT,%s STRING - Rules: Pattern-action pairs
- Prologue/Epilogue: User code to embed
#![allow(unused)] fn main() { // Internal representation struct LexerSpec { rules: Vec<Rule>, // Pattern → Action mappings definitions: HashMap<String, String>, start_conditions: Vec<StartCondition>, prologue: String, epilogue: String, } }
Stage 2: Regex → NFA (Thompson's Construction)
Each regex pattern is converted to a Nondeterministic Finite Automaton:
Regex: a(b|c)*
ε ε
┌──────► 2 ──► 3 ─┐
│ │b │ε
│ ▼ │
1 ──┤ ε 4 ────────┼──► 5
a │ │c │
│ ▼ │
└──────► 6 ──► 7 ─┘
ε
Key operations:
- Concatenation: Chain NFAs sequentially
- Alternation (
|): Parallel paths with ε-transitions - Kleene star (
*): Loop back with ε-transitions - Character classes: Single state with multiple outgoing edges
Stage 3: NFA → DFA (Subset Construction)
The combined NFA is converted to a Deterministic Finite Automaton:
- Compute ε-closure of start state → initial DFA state
- For each input symbol, compute move then ε-closure
- New state sets become new DFA states
- Repeat until no new states
NFA States {1,2,3} on input 'a' → NFA States {4,5,6,7}
↓
DFA State D2
Stage 4: DFA Minimization (Hopcroft's Algorithm)
The DFA is minimized to reduce state count:
- Partition states into accepting/non-accepting
- Refine partitions based on transitions
- Merge equivalent states
This reduces memory usage and improves runtime performance.
Stage 5: Code Generation
The DFA is emitted as:
- Transition table: 2D array
table[state][char] → next_state - Accept table:
accept[state] → token_type or -1 - Lexer driver: Loop that walks the DFA
Parser Generation Pipeline
Stage 1: Parsing the .y File
The grammar parser (parsegen::grammar) extracts:
- Token declarations:
%token NUMBER PLUS - Precedence rules:
%left PLUS MINUS,%right POWER - Grammar rules: Productions with semantic actions
- Start symbol: First rule's LHS or explicit
%start
#![allow(unused)] fn main() { // Internal representation struct Grammar { terminals: Vec<Symbol>, nonterminals: Vec<Symbol>, rules: Vec<Production>, precedence: Vec<PrecedenceLevel>, start: Symbol, } }
Stage 2: FIRST and FOLLOW Sets
FIRST(X): Set of terminals that can begin strings derived from X
FIRST(expr) = { NUMBER, LPAREN, MINUS }
FOLLOW(X): Set of terminals that can appear after X
FOLLOW(expr) = { PLUS, MINUS, TIMES, RPAREN, $ }
These sets drive lookahead decisions in the parser.
Stage 3: LR(0) Item Sets
An LR(0) item is a production with a dot showing parsing progress:
expr → expr • + term (we've seen expr, expecting +)
expr → expr + • term (we've seen expr +, expecting term)
expr → expr + term • (complete, ready to reduce)
The parser builds the canonical collection of item sets:
State I0:
S' → • expr
expr → • expr + term
expr → • term
term → • NUMBER
State I1: (after seeing expr)
S' → expr •
expr → expr • + term
Stage 4: LALR(1) Tables
LALR(1) adds lookahead to resolve conflicts:
ACTION Table: state × terminal → shift/reduce/accept/error
GOTO Table: state × nonterminal → next state
Example:
State 5, lookahead '+': SHIFT to state 7
State 5, lookahead '$': REDUCE by rule expr → expr + term
Stage 5: Conflict Resolution
Shift/Reduce Conflicts: Resolved by precedence and associativity
%left PLUS MINUS /* Left-associative, same precedence */
%left TIMES DIVIDE /* Higher precedence than +/- */
%right POWER /* Right-associative */
Reduce/Reduce Conflicts: First matching rule wins (or use GLR)
Stage 6: Code Generation
The parser is emitted as:
- ACTION table: Encodes shift/reduce/accept/error
- GOTO table: State transitions on nonterminals
- Reduction logic: Switch statement executing semantic actions
- Stack management: State and value stacks
GLR Parsing
When %glr-parser is specified, conflicts create multiple parse paths:
┌─ Parse path A (shift)
Input: "a - b - c" ──┤
└─ Parse path B (reduce)
All paths are explored in parallel. Invalid paths die when they encounter errors. If multiple paths succeed, the grammar is ambiguous.
DFA Transition Table Encoding
The generated transition table uses efficient encoding:
Dense Table (Default)
// Full 256-entry table per state
int table[NUM_STATES][256] = {
{ -1, -1, ..., 5, 5, ..., 3, 3, ... }, // state 0
{ -1, -1, ..., 2, 2, ..., 7, 7, ... }, // state 1
};
Optimized Encoding (When Enabled)
- Row compression: Factor out common default transitions
- Column compression: Merge equivalent character columns
- Direct-coded DFA: Generate switch statements instead of tables
Token Matching Strategy
OpenLexer uses maximal munch (longest match) with rule priority:
- Try all patterns from current position
- Select the longest matching pattern
- On ties, prefer the pattern defined first in the
.lfile - Execute the associated action
- Advance position and repeat
Memory Layout
Generated Lexer
┌────────────────────────────┐
│ Input buffer (user string) │
├────────────────────────────┤
│ Position tracking │
│ (pos, line, column) │
├────────────────────────────┤
│ Start condition stack │
├────────────────────────────┤
│ Token result buffer │
└────────────────────────────┘
Generated Parser
┌────────────────────────────┐
│ State stack │
│ (current parsing states) │
├────────────────────────────┤
│ Value stack │
│ (semantic values for $1,$2)│
├────────────────────────────┤
│ Lookahead token │
├────────────────────────────┤
│ Error recovery state │
└────────────────────────────┘
Performance Characteristics
| Operation | Lexer | Parser |
|---|---|---|
| Time complexity | O(n) | O(n) for LALR, O(n³) worst-case for GLR |
| Space complexity | O(states × alphabet) | O(states × symbols) |
| Typical states | 50-500 | 100-2000 |
Error Handling
Lexer Errors
When no pattern matches:
- Report position with line/column
- Skip one character (or configurable recovery)
- Continue lexing
Parser Errors
When no valid action exists:
- Report token and expected tokens
- Pop states until error recovery possible
- Skip tokens until synchronization point
- Resume parsing
File Format References
The .l and .y formats are compatible with Flex and Bison respectively, with these extensions:
OpenLexer Extensions to Flex
%unicode- Enable Unicode mode- Full Unicode character class support (
\p{L},\p{Greek})
OpenLexer Extensions to Bison
%glr-parser- Enable GLR parsing- Multi-language output (Python, Java, C)
Internal Module Structure
openlexer/
├── src/
│ ├── lib.rs # Public API
│ ├── error.rs # Error types
│ ├── lexgen/ # Lexer generator
│ │ ├── mod.rs # Public interface
│ │ ├── regex.rs # Regex parser
│ │ ├── nfa.rs # NFA construction
│ │ ├── dfa.rs # DFA conversion + minimization
│ │ ├── rules.rs # .l file parser
│ │ └── codegen.rs # Code generation
│ └── parsegen/ # Parser generator
│ ├── mod.rs # Public interface
│ ├── grammar.rs # .y file parser
│ ├── first.rs # FIRST/FOLLOW computation
│ ├── lalr.rs # LALR(1) table construction
│ └── codegen.rs # Code generation
Comparison with Flex/Bison
This chapter compares OpenLexer with GNU Flex and Bison.
Feature Comparison
| Feature | Flex | Bison | OpenLexer |
|---|---|---|---|
| Lexer generation | Yes | No | Yes |
| Parser generation | No | Yes | Yes |
| C output | Yes | Yes | Yes |
| Java output | JFlex | CUP/ANTLR | Yes |
| Python output | PLY | PLY | Yes |
| GLR parsing | No | Yes | Yes |
| LALR(1) | N/A | Yes | Yes |
| Start conditions | Yes | N/A | Yes |
| Precedence | N/A | Yes | Yes |
| Unicode | Limited | N/A | Planned |
| Web (WASM) | No | No | Yes |
File Format Compatibility
Lexer Files (.l)
OpenLexer supports most Flex syntax:
Supported:
- Definitions section with
%{ %}code blocks - Named patterns
- Rules with actions
- Start conditions (
%x,%s) - Character classes
[a-z] - Predefined classes
[:alpha:] - Quantifiers
*,+,?,{n,m} - Alternation
| - Grouping
() - Literal strings
"..."
Not supported:
%optiondirectivesyymore(),yyless()REJECT- Multiple input buffers
- Interactive mode
Grammar Files (.y)
OpenLexer supports most Bison syntax:
Supported:
%tokendeclarations%uniontype declarations%typenon-terminal types%left,%right,%nonassocprecedence%startsymbol%preccontextual precedence%glr-parserGLR mode- Semantic actions with
$$,$1, etc. errortoken for recovery
Not supported:
%defineoptions%codeblocks (use%{ %}instead)%destructor%printer- Lookahead correction (
%define lr.type ielr) - Named references (
$nameinstead of$1)
Migration Guide
From Flex
- Remove
%optionlines - Replace
yymore()with manual buffer handling - Replace
REJECTwith explicit rules - Test start condition behavior
From Bison
- Replace
%definewith equivalent features - Move
%codeblocks to%{ %} - Test GLR behavior if used
- Verify error recovery patterns
Performance
OpenLexer generates table-driven lexers and parsers similar to Flex/Bison. Performance characteristics:
- Lexer: O(n) where n is input length
- LALR Parser: O(n) for unambiguous grammars
- GLR Parser: O(n) to O(n^3) depending on ambiguity
The generated code size is comparable to Flex/Bison output.
When to Use OpenLexer
Use OpenLexer when:
- You need multi-language output
- You want a single cross-platform tool
- You need integrated lexer + parser generation
- You want GLR parsing
Use Flex/Bison when:
- You need maximum C performance optimization
- You require specific Flex/Bison features not in OpenLexer
- You have existing build infrastructure for Flex/Bison