OpenLexer

OpenLexer is a lexer and parser generator written in Rust. It reads .l (Flex-compatible) and .y (Bison-compatible) specification files and generates lexers and parsers in C, Java, or Python.

Components

  • Lexer Generator: Converts regular expression patterns to DFA-based lexers using Thompson construction and subset construction algorithms.
  • Parser Generator: Builds LALR(1) parsing tables from context-free grammars. Supports GLR parsing for ambiguous grammars.
  • Code Generation: Outputs standalone lexer and parser code in C, Java, or Python.

Supported Platforms

  • Windows (x64)
  • Linux (x64, ARM64)
  • macOS (x64, ARM64)

File Formats

OpenLexer uses the standard Flex/Bison file formats:

  • .l files: Lexer specifications with regex patterns and actions
  • .y files: Grammar specifications with production rules and semantic actions

Basic Usage

# Generate a Python lexer from calc.l
openlexer gen-lexer --lexer calc.l --lang python --output ./

# Generate a Java parser from calc.y 
openlexer gen-parser --parser calc.y --lang java --output ./

# For Java: compile and run
javac Lexer.java Parser.java
java Parser "3 + 4 * 2"

Key Features

  • Smart File Organization: Generated code follows language-specific best practices
    • Java: One public class per file, automatic lexer detection
    • C: Flexible compilation with preprocessor controls
    • Python: Module-based imports
  • Complete Integration: Lexer and parser work seamlessly together
  • Standalone or Combined: Each component can work independently or together
  • Built-in Test Drivers: Generated code includes test scaffolding

Documentation Structure

License

MIT License

Getting Started

This guide will help you install OpenLexer and generate your first lexer and parser.

What You'll Learn

  1. How to install OpenLexer
  2. How to write a lexer specification (.l file)
  3. How to write a grammar specification (.y file)
  4. How to generate and use the output code

Prerequisites

  • Basic understanding of regular expressions
  • Familiarity with context-free grammars (helpful but not required)
  • A compiler for your target language (gcc, javac, or python)

Overview

OpenLexer follows a two-stage process similar to Flex/Bison:

┌─────────────┐      ┌─────────────┐
│   calc.l    │──────│  Lexer      │──────▶ lexer.c/java/py
│ (patterns)  │      │  Generator  │
└─────────────┘      └─────────────┘

┌─────────────┐      ┌─────────────┐
│   calc.y    │──────│  Parser     │──────▶ parser.c/java/py
│ (grammar)   │      │  Generator  │
└─────────────┘      └─────────────┘

The lexer converts input text into tokens using regular expressions. The parser recognizes the structure of those tokens using grammar rules.

Next Steps

Installation

Prerequisites

  • Rust toolchain (1.70 or later)
  • Cargo package manager

Build from Source

git clone https://github.com/magi8101/openlexer.git
cd openlexer
cargo build --release

The binary will be at target/release/openlexer (or openlexer.exe on Windows).

Install via Cargo

cargo install --path .

This installs openlexer to ~/.cargo/bin/.

Verify Installation

openlexer --version
openlexer --help

WebAssembly Build

To build for web browsers:

cargo install wasm-pack
wasm-pack build --target web

The output will be in the pkg/ directory.

Platform Notes

Windows

No additional dependencies required. The MSVC toolchain is recommended.

Linux

Ensure build-essential or equivalent is installed for linking.

macOS

Xcode command line tools must be installed:

xcode-select --install

Quick Start

This guide walks through creating a calculator lexer and parser.

Step 1: Create the Lexer Specification

Create calc.l:

%{
/* Calculator lexer */
%}

%%

[0-9]+          { return NUMBER; }
"+"             { return PLUS; }
"-"             { return MINUS; }
"*"             { return TIMES; }
"/"             { return DIVIDE; }
"("             { return LPAREN; }
")"             { return RPAREN; }
[ \t\n]+        { /* skip whitespace */ }
.               { return ERROR; }

%%

Step 2: Create the Grammar Specification

Create calc.y:

%token NUMBER PLUS MINUS TIMES DIVIDE LPAREN RPAREN

%left PLUS MINUS
%left TIMES DIVIDE

%%

expr:
    expr PLUS expr    { $$ = $1 + $3; }
  | expr MINUS expr   { $$ = $1 - $3; }
  | expr TIMES expr   { $$ = $1 * $3; }
  | expr DIVIDE expr  { $$ = $1 / $3; }
  | LPAREN expr RPAREN { $$ = $2; }
  | NUMBER            { $$ = $1; }
  ;

%%

Step 3: Generate Code

Option A: Python

# Generate Python lexer and parser
openlexer gen-lexer --lexer calc.l -L python -o ./
openlexer gen-parser --parser calc.y -L python -o ./

This produces lexer.py and parser.py.

Use the generated code:

# Test the parser directly
python parser.py "3 + 4 * 2"
# Output: Result: 11

Or create main.py:

from parser import parse

result = parse("3 + 4 * 2")
print(f"Result: {result}")

Option B: Java

# Generate Java lexer and parser
openlexer gen-lexer --lexer calc.l -L java -o ./
openlexer gen-parser --parser calc.y -L java -o ./

This produces Lexer.java and Parser.java.

Compile and run:

# Compile both files
javac Lexer.java Parser.java

# Run parser (auto-detects Lexer.class)
java Parser "3 + 4 * 2"
# Output: [Using external Lexer.class]
#         Input: "3 + 4 * 2"
#         Result: 11

Option C: C

# Generate C lexer and parser
openlexer gen-lexer --lexer calc.l -L c -o ./
openlexer gen-parser --parser calc.y -L c -o ./

This produces lexer.c and parser.c.

Compile and run:

# Compile and link
gcc -o calc parser.c -lm

# Run
./calc "3 + 4 * 2"
# Output: Result: 11

Next Steps

Lexer Generator

The lexer generator converts regular expression patterns into a deterministic finite automaton (DFA) that tokenizes input text.

How It Works

  1. Parsing: The .l file is parsed to extract patterns, actions, and declarations.
  2. NFA Construction: Each regex pattern is converted to an NFA using Thompson's construction.
  3. NFA Combination: All pattern NFAs are combined with epsilon transitions.
  4. DFA Construction: The combined NFA is converted to a DFA using subset construction.
  5. DFA Minimization: The DFA is minimized to reduce state count.
  6. Code Generation: The DFA transition table is emitted in the target language.

Input Format

Lexer specifications use the Flex .l file format:

%{
/* C/Python/Java prologue code */
%}

/* Definitions */
DIGIT   [0-9]
LETTER  [a-zA-Z]

%%

/* Rules */
{DIGIT}+        { return NUMBER; }
{LETTER}+       { return IDENTIFIER; }

%%

/* Epilogue code */

Output

The generated lexer provides:

  • A Lexer class (or struct in C)
  • A next_token() function that returns the next token
  • Token type constants
  • Access to matched text via yytext
  • Test driver (optional) for immediate testing

Test Driver

When enabled, the generated lexer includes a test driver that lets you verify tokenization:

# Python
python lexer.py "3 + 4 * 2"

# C
./lexer "hello world"

# Java  
java Lexer "test input"

See Test Drivers for details.

Using with Parser

The lexer's output feeds into the parser. Ensure token names match between your .l and .y files:

[0-9]+   { return NUMBER; }  /* Must match %token NUMBER in parser */

See Combining Lexer and Parser for complete integration examples.

Chapters

Lexer File Format (.l)

The lexer specification file has three sections separated by %%:

DEFINITIONS
%%
RULES
%%
USER CODE

Definitions Section

The definitions section contains:

Prologue Code

Code enclosed in %{ and %} is copied directly to the output:

%{
#include <stdio.h>
int line_count = 0;
%}

Named Patterns

Named patterns can be referenced in rules using {name}:

DIGIT       [0-9]
ALPHA       [a-zA-Z]
ALNUM       [a-zA-Z0-9]
ID          {ALPHA}{ALNUM}*

Start Condition Declarations

Declare exclusive (%x) or inclusive (%s) start conditions:

%x COMMENT
%s STRING

Rules Section

Each rule has a pattern and an action:

%%

{ID}            { return IDENTIFIER; }
{DIGIT}+        { return NUMBER; }
"/*"            { BEGIN(COMMENT); }
<COMMENT>"*/"   { BEGIN(INITIAL); }
<COMMENT>.      { /* skip */ }
[ \t\n]+        { /* skip whitespace */ }

%%

Rule Syntax

[<start_condition>]pattern    { action }
  • Patterns match from left to right
  • Longer matches take priority
  • Earlier rules break ties
  • Actions are code blocks that can return tokens

Special Variables

  • yytext: The matched text (string)
  • yyleng: Length of matched text
  • yylineno: Current line number (if enabled)

User Code Section

The third section is copied verbatim to the end of the output file:

%%

int main() {
    while (yylex() != 0) {
        printf("Token: %s\n", yytext);
    }
    return 0;
}

Complete Example

%{
/* Token definitions */
#define NUMBER 1
#define PLUS 2
#define MINUS 3
%}

DIGIT   [0-9]

%%

{DIGIT}+    { return NUMBER; }
"+"         { return PLUS; }
"-"         { return MINUS; }
[ \t\n]+    { /* skip */ }
.           { fprintf(stderr, "Unknown: %s\n", yytext); }

%%

Regular Expressions

OpenLexer supports the following regular expression syntax, compatible with Flex.

Character Classes

SyntaxDescription
.Any character except newline
[abc]Character class: a, b, or c
[a-z]Range: lowercase letters
[^abc]Negated class: not a, b, or c
[a-zA-Z0-9_]Combined ranges

Predefined Classes

SyntaxEquivalent
[:alpha:][a-zA-Z]
[:digit:][0-9]
[:alnum:][a-zA-Z0-9]
[:space:][ \t\n\r\f\v]
[:upper:][A-Z]
[:lower:][a-z]

Use inside character classes: [[:alpha:]_]

Quantifiers

SyntaxDescription
*Zero or more
+One or more
?Zero or one
{n}Exactly n
{n,}n or more
{n,m}Between n and m

Anchors

SyntaxDescription
^Start of line
$End of line

Grouping and Alternation

SyntaxDescription
(ab)Group
a|bAlternation: a or b

Escape Sequences

SyntaxDescription
\nNewline
\tTab
\rCarriage return
\\Literal backslash
\.Literal dot
\*Literal asterisk

Literal Strings

Double-quoted strings match literally:

"while"     { return WHILE; }
"=="        { return EQ; }
"++"        { return INCREMENT; }

Named Pattern References

Reference definitions with braces:

/* Definition */
DIGIT   [0-9]

%%

/* Rule using definition */
{DIGIT}+    { return NUMBER; }

Examples

/* Integer literal */
[0-9]+                      { return INTEGER; }

/* Floating point */
[0-9]+\.[0-9]+              { return FLOAT; }

/* Identifier */
[a-zA-Z_][a-zA-Z0-9_]*      { return IDENTIFIER; }

/* C-style string */
\"([^"\\]|\\.)*\"           { return STRING; }

/* Single-line comment */
"//".*                      { /* skip */ }

Actions and Return Values

Each lexer rule has an action block that executes when the pattern matches.

Basic Actions

Return a token type:

[0-9]+      { return NUMBER; }
"+"         { return PLUS; }

Skip input without returning:

[ \t\n]+    { /* skip whitespace */ }

Available Variables

yytext

The matched text as a string:

[a-z]+      { printf("Matched: %s\n", yytext); return WORD; }

yyleng

Length of the matched text:

[a-z]+      { printf("Length: %d\n", yyleng); return WORD; }

yylval

Semantic value passed to the parser. Set in the action:

[0-9]+      { yylval.ival = atoi(yytext); return NUMBER; }
[0-9]+\.[0-9]+  { yylval.dval = atof(yytext); return FLOAT; }

Multi-Statement Actions

Use braces for multiple statements:

[0-9]+      {
    int val = atoi(yytext);
    if (val > 1000) {
        fprintf(stderr, "Warning: large number\n");
    }
    yylval.ival = val;
    return NUMBER;
}

Language-Specific Actions

C Actions

[a-z]+      {
    yylval.str = strdup(yytext);
    return IDENTIFIER;
}

Python Actions

[a-z]+      {
    self.yylval = self.yytext
    return self.IDENTIFIER
}

Java Actions

[a-z]+      {
    yylval = yytext;
    return IDENTIFIER;
}

Token Constants

Token constants should match those declared in the grammar file:

%{
/* From parser.h or defined here */
#define NUMBER 258
#define PLUS   259
#define MINUS  260
%}

%%

[0-9]+  { return NUMBER; }
"+"     { return PLUS; }
"-"     { return MINUS; }

%%

Error Handling

The . pattern matches any single character. Use it as a catch-all:

.       {
    fprintf(stderr, "Unexpected character: %s\n", yytext);
    return ERROR;
}

Start Conditions

Start conditions allow the lexer to switch between different sets of rules. This is useful for handling comments, strings, and other context-dependent lexing.

Declaring Start Conditions

Use %x for exclusive conditions or %s for inclusive conditions:

%x COMMENT
%x STRING
%s SPECIAL
  • Exclusive (%x): Only rules with this condition are active.
  • Inclusive (%s): Rules with this condition AND rules without conditions are active.

Using Start Conditions

Applying to Rules

Prefix a rule with the condition name in angle brackets:

<COMMENT>.      { /* inside comment */ }
<STRING>[^"]+   { /* inside string */ }

Multiple Conditions

Specify multiple conditions separated by commas:

<COMMENT,STRING>.   { /* in comment or string */ }

Initial Condition

Rules without a condition apply in the INITIAL state:

[a-z]+      { return WORD; }   /* applies in INITIAL */

Or explicitly:

<INITIAL>[a-z]+     { return WORD; }

Switching Conditions

Use BEGIN(condition) to switch:

"/*"            { BEGIN(COMMENT); }
<COMMENT>"*/"   { BEGIN(INITIAL); }

Example: C-Style Comments

%x COMMENT

%%

"/*"            { BEGIN(COMMENT); }
<COMMENT>"*/"   { BEGIN(INITIAL); }
<COMMENT>\n     { yylineno++; }
<COMMENT>.      { /* skip comment content */ }

%%

Example: String Literals

%x STRING

%%

\"              { BEGIN(STRING); string_buf_ptr = string_buf; }
<STRING>\"      { 
    BEGIN(INITIAL);
    *string_buf_ptr = '\0';
    yylval.str = strdup(string_buf);
    return STRING_LITERAL;
}
<STRING>\\n     { *string_buf_ptr++ = '\n'; }
<STRING>\\t     { *string_buf_ptr++ = '\t'; }
<STRING>\\\\    { *string_buf_ptr++ = '\\'; }
<STRING>\\.     { *string_buf_ptr++ = yytext[1]; }
<STRING>[^\\\"]+ {
    char *p = yytext;
    while (*p) *string_buf_ptr++ = *p++;
}

%%

Example: Nested Comments

For languages with nested comments, use a counter:

%x COMMENT

%{
int comment_depth = 0;
%}

%%

"(*"            { comment_depth++; BEGIN(COMMENT); }
<COMMENT>"(*"   { comment_depth++; }
<COMMENT>"*)"   { 
    if (--comment_depth == 0) BEGIN(INITIAL);
}
<COMMENT>.      { /* skip */ }
<COMMENT>\n     { yylineno++; }

%%

Parser Generator

The parser generator converts context-free grammars into LALR(1) parsing tables and generates parser code.

How It Works

  1. Parsing: The .y file is parsed to extract tokens, precedence, and grammar rules.
  2. Grammar Analysis: First and Follow sets are computed.
  3. LR(0) Items: Canonical LR(0) item sets are constructed.
  4. LALR(1) Tables: Lookahead information is computed and tables are built.
  5. Conflict Resolution: Shift/reduce and reduce/reduce conflicts are resolved using precedence.
  6. Code Generation: The parsing table and driver are emitted in the target language.

Input Format

Grammar specifications use the Bison .y file format:

%{
/* Prologue code */
%}

/* Token and precedence declarations */
%token NUMBER PLUS MINUS
%left PLUS MINUS

%%

/* Grammar rules */
expr: expr PLUS expr
    | expr MINUS expr
    | NUMBER
    ;

%%

/* Epilogue code */

Output

The generated parser provides:

  • A Parser class (or struct in C)
  • A parse() function that returns the parse result
  • Action and goto tables
  • Semantic action execution

GLR Mode

For ambiguous grammars, add %glr-parser to enable GLR parsing:

%glr-parser

%%

expr: expr MINUS expr
    | NUMBER
    ;

%%

Chapters

Grammar File Format (.y)

The grammar specification file has three sections separated by %%:

DECLARATIONS
%%
RULES
%%
USER CODE

Declarations Section

Prologue Code

Code in %{ and %} is copied to the output:

%{
#include <stdio.h>
#include <math.h>
extern int yylex();
void yyerror(const char *s);
%}

Token Declarations

Declare terminal symbols with %token:

%token NUMBER
%token PLUS MINUS TIMES DIVIDE
%token LPAREN RPAREN

Tokens with types:

%token <ival> NUMBER
%token <str> IDENTIFIER

Union Declaration

Define the semantic value type:

%union {
    int ival;
    double dval;
    char *str;
}

Type Declarations

Assign types to non-terminals:

%type <dval> expr term factor

Precedence Declarations

Declare operator precedence and associativity:

%left PLUS MINUS
%left TIMES DIVIDE
%right POWER
%nonassoc UMINUS
  • %left: Left-associative
  • %right: Right-associative
  • %nonassoc: Non-associative (error on chaining)

Later declarations have higher precedence.

Start Symbol

Specify the grammar's start symbol:

%start program

If omitted, the left-hand side of the first rule is used.

Rules Section

Production rules define the grammar:

%%

program:
    statement_list
    ;

statement_list:
    statement
    | statement_list statement
    ;

statement:
    expr SEMICOLON      { printf("Result: %d\n", $1); }
    | error SEMICOLON   { yyerrok; }
    ;

expr:
    expr PLUS expr      { $$ = $1 + $3; }
    | expr MINUS expr   { $$ = $1 - $3; }
    | NUMBER            { $$ = $1; }
    ;

%%

Rule Syntax

nonterminal:
    production1     { action1 }
    | production2   { action2 }
    ;

Empty Productions

Use /* empty */ or nothing:

optional_items:
    /* empty */
    | item_list
    ;

User Code Section

Auxiliary functions copied to the output:

%%

void yyerror(const char *s) {
    fprintf(stderr, "Error: %s\n", s);
}

int main() {
    return yyparse();
}

Tokens and Precedence

Token Declaration

All terminals must be declared:

%token NUMBER IDENTIFIER STRING_LITERAL
%token PLUS MINUS TIMES DIVIDE
%token IF ELSE WHILE FOR

Token Values

By default, tokens are assigned sequential integer values starting from 256. To specify values:

%token NUMBER 258
%token PLUS 259
%token MINUS 260

Literal Tokens

Single-character tokens can be used directly in rules:

expr: expr '+' expr
    | expr '-' expr
    | NUMBER
    ;

The character's ASCII value is the token value.

Precedence and Associativity

Associativity

%left PLUS MINUS      /* left-to-right */
%left TIMES DIVIDE    /* higher precedence than + - */
%right POWER          /* right-to-left */
%nonassoc LT GT EQ    /* a < b < c is an error */

Precedence Levels

Declarations later in the file have higher precedence:

%left OR              /* lowest */
%left AND
%left EQ NE
%left LT GT LE GE
%left PLUS MINUS
%left TIMES DIVIDE
%right POWER          /* highest binary */
%right UMINUS         /* unary minus */

Context-Dependent Precedence

Use %prec to override a rule's precedence:

expr: MINUS expr %prec UMINUS   { $$ = -$2; }
    ;

This uses UMINUS precedence instead of MINUS.

Conflict Resolution

Shift/Reduce Conflicts

When the parser can either shift or reduce:

  1. If the lookahead token has higher precedence than the rule, shift.
  2. If the rule has higher precedence, reduce.
  3. If equal precedence, use associativity:
    • Left: reduce
    • Right: shift
    • Nonassoc: error

Example: Dangling Else

%nonassoc LOWER_THAN_ELSE
%nonassoc ELSE

stmt: IF expr stmt %prec LOWER_THAN_ELSE
    | IF expr stmt ELSE stmt
    ;

Reduce/Reduce Conflicts

When two rules can reduce the same input, the first rule in the file wins. This usually indicates a grammar problem.

Token Types

With %union, assign types to tokens:

%union {
    int ival;
    double dval;
    char *str;
}

%token <ival> INTEGER
%token <dval> FLOAT
%token <str> IDENTIFIER STRING

Semantic Actions

Semantic actions are code blocks attached to grammar rules that execute when the rule is reduced.

Basic Syntax

expr: expr PLUS expr    { $$ = $1 + $3; }
    | NUMBER            { $$ = $1; }
    ;

Value Stack References

  • $$: The result value (left-hand side)
  • $1, $2, $3, ...: Values of right-hand side symbols
/* Rule: factor -> LPAREN expr RPAREN */
factor: LPAREN expr RPAREN  { $$ = $2; }
       /* $1=LPAREN, $2=expr, $3=RPAREN */
    ;

Typed Actions

With %union, values have specific types:

%union {
    int ival;
    double dval;
    struct node *ast;
}

%type <dval> expr term factor

%%

expr: expr PLUS term    { $$ = $1 + $3; }
    ;

Multi-Statement Actions

Braces contain the full action:

stmt: PRINT expr SEMICOLON {
        printf("Value: %g\n", $2);
        $$ = $2;
    }
    ;

Mid-Rule Actions

Actions can appear in the middle of a rule:

compound: LBRACE { enter_scope(); } stmt_list RBRACE { leave_scope(); }
    ;

Mid-rule actions count as symbols:

rule: A { mid_action(); } B C
    /* $1=A, $2=mid_action result, $3=B, $4=C */
    ;

AST Construction

Building abstract syntax trees:

%union {
    struct node *ast;
}

%type <ast> expr term factor

%%

expr: expr PLUS term {
        $$ = make_binary_node(OP_ADD, $1, $3);
    }
    | term {
        $$ = $1;
    }
    ;

Language-Specific Actions

C

expr: expr PLUS expr    { $$ = $1 + $3; }
    ;

Python

expr: expr PLUS expr    { self.yyval = self.yyvs[-3] + self.yyvs[-1] }
    ;

Java

expr: expr PLUS expr    { yyval = yyvs[yyvsp-2] + yyvs[yyvsp]; }
    ;

Empty Actions

Rules without explicit actions get a default that copies $1 to $$:

term: factor    /* implicit: { $$ = $1; } */
    ;

Error Handling

The error Token

The special token error represents a syntax error:

stmt: expr SEMICOLON
    | error SEMICOLON   { yyerrok; }
    ;

When an error occurs, the parser pops states until it can shift error, then continues parsing.

yyerror Function

The yyerror function is called on syntax errors:

C

void yyerror(const char *msg) {
    fprintf(stderr, "line %d: %s\n", yylineno, msg);
}

Python

def yyerror(self, msg):
    print(f"line {self.yylineno}: {msg}", file=sys.stderr)

Java

void yyerror(String msg) {
    System.err.println("line " + yylineno + ": " + msg);
}

Error Recovery

yyerrok

Clears the error state, allowing normal parsing to resume:

stmt: error SEMICOLON   { yyerrok; printf("Recovered at ;\n"); }
    ;

yyclearin

Discards the current lookahead token:

stmt: error SEMICOLON   { yyerrok; yyclearin; }
    ;

YYERROR

Force an error from within an action:

expr: expr DIVIDE expr {
        if ($3 == 0) {
            yyerror("division by zero");
            YYERROR;
        }
        $$ = $1 / $3;
    }
    ;

YYABORT

Abort parsing immediately:

stmt: QUIT { YYABORT; }
    ;

YYACCEPT

Accept the input immediately:

program: statement_list END { YYACCEPT; }
    ;

Error Recovery Strategies

Statement-Level Recovery

Recover at statement boundaries:

program: stmt_list
    ;

stmt_list: stmt_list stmt
    | stmt
    ;

stmt: valid_statement SEMICOLON
    | error SEMICOLON { yyerrok; }
    ;

Block-Level Recovery

Recover at block boundaries:

block: LBRACE stmt_list RBRACE
    | LBRACE error RBRACE { yyerrok; }
    ;

Parenthesis Recovery

expr: LPAREN expr RPAREN
    | LPAREN error RPAREN { yyerrok; $$ = 0; }
    | NUMBER
    ;

Error Messages

Provide context in error messages:

void yyerror(const char *msg) {
    fprintf(stderr, "Error at line %d near '%s': %s\n",
            yylineno, yytext, msg);
}

GLR Parsing

GLR (Generalized LR) parsing handles ambiguous and non-deterministic grammars that LALR(1) cannot process.

When to Use GLR

Use GLR parsing when your grammar has:

  • Inherent ambiguity that cannot be resolved with precedence
  • Multiple valid parse trees for the same input
  • Complex language constructs requiring more than one token of lookahead

Enabling GLR Mode

Add %glr-parser to the declarations section:

%glr-parser

%%

expr: expr MINUS expr
    | NUMBER
    ;

%%

How GLR Works

Graph-Structured Stack (GSS)

Instead of a single parse stack, GLR maintains a graph of stacks:

     [0] ─── expr ─── [3] ─── MINUS ─── [4] ─── expr ─── [7]
      │                                          │
      └─────── expr ─────────────────────────────┘

When a conflict occurs, GLR forks the stack and explores both paths.

Shared Packed Parse Forest (SPPF)

Multiple parse trees are represented compactly:

         expr
        / | \
      expr - expr
       |     / \
       1   expr - expr
            |     |
            2     3

For 1 - 2 - 3, the SPPF captures both:

  • (1 - 2) - 3
  • 1 - (2 - 3)

Conflict Handling

In standard LALR(1), conflicts are resolved statically. In GLR:

  • Shift/Reduce: Both shift and reduce occur on separate stack branches
  • Reduce/Reduce: Both reductions occur

Invalid branches are pruned when they encounter errors.

Performance

GLR parsing is O(n^3) in the worst case for highly ambiguous grammars. For mostly unambiguous grammars with occasional conflicts, it approaches O(n) LALR performance.

Chapters

Ambiguous Grammars

Ambiguous grammars have multiple valid parse trees for the same input.

Expression Ambiguity

Without precedence or associativity, binary operators are ambiguous:

%glr-parser

%%

expr: expr MINUS expr
    | expr PLUS expr
    | expr TIMES expr
    | NUMBER
    ;

%%

For 1 - 2 - 3:

Parse tree 1 (left-associative):

    expr
   / | \
 expr - expr
 / \     |
expr -  3
 |    \
 1     2

Result: (1 - 2) - 3 = -4

Parse tree 2 (right-associative):

    expr
   / | \
expr  - expr
  |    / | \
  1  expr - expr
       |     |
       2     3

Result: 1 - (2 - 3) = 2

Dangling Else

The classic ambiguity in C-like languages:

%glr-parser

%%

stmt: IF expr THEN stmt
    | IF expr THEN stmt ELSE stmt
    | other_stmt
    ;

%%

For if a then if b then s1 else s2:

  • if a then (if b then s1 else s2) - else binds to inner if
  • if a then (if b then s1) else s2 - else binds to outer if

Type or Expression

In C, (x)(y) could be:

  • A cast of y to type x
  • A function call of x with argument y
%glr-parser

%%

expr: LPAREN type RPAREN expr
    | expr LPAREN expr RPAREN
    | IDENTIFIER
    ;

%%

Resolving Ambiguity at Runtime

With GLR, you can defer disambiguation to semantic actions:

%glr-parser

%%

primary: IDENTIFIER {
        if (is_type($1)) {
            $$ = make_type_node($1);
        } else {
            $$ = make_var_node($1);
        }
    }
    ;

%%

Comparison with LALR

FeatureLALR(1)GLR
Ambiguous grammarsErrorSupported
Multiple parse treesNoYes (SPPF)
PerformanceO(n)O(n) to O(n^3)
Conflict handlingStaticDynamic

Best Practices

  1. Use LALR with precedence when possible (more efficient)
  2. Use GLR when the grammar is inherently ambiguous
  3. Keep the ambiguous portion of the grammar small
  4. Test with known ambiguous inputs

GLR Algorithm

This chapter describes the GLR parsing algorithm implemented in OpenLexer.

Data Structures

Graph-Structured Stack (GSS)

The GSS is a directed acyclic graph where:

  • Nodes represent parser states
  • Edges represent stack entries (symbols and semantic values)
#![allow(unused)]
fn main() {
struct GssNode {
    state: usize,
    links: Vec<GssEdge>,
}

struct GssEdge {
    predecessor: NodeId,
    symbol: Symbol,
    value: SemanticValue,
}
}

GSS Operations

Push: Add a new node with an edge to the current node.

Reduce: Follow edges backward by the rule length, then push the result.

Merge: When two paths reach the same state, merge them instead of duplicating.

Shared Packed Parse Forest (SPPF)

The SPPF compactly represents all parse trees:

#![allow(unused)]
fn main() {
enum SppfNode {
    Symbol {
        symbol: String,
        span: (usize, usize),
    },
    Packed {
        rule: usize,
        children: Vec<NodeId>,
    },
    Ambiguous {
        alternatives: Vec<NodeId>,
    },
}
}

Algorithm Steps

1. Initialization

Create the initial GSS node with state 0.

GSS: [state 0]
Active: { node_0 }

2. Per-Token Processing

For each input token:

pending_shifts = {}
pending_reductions = {}

for each active node:
    action = get_action(node.state, token)
    if action contains SHIFT:
        add to pending_shifts
    if action contains REDUCE:
        add to pending_reductions

3. Reduction Phase

Process all pending reductions:

while pending_reductions not empty:
    (node, rule) = pop from pending_reductions
    
    for each path of length rule.rhs_len from node:
        base_node = end of path
        goto_state = goto[base_node.state, rule.lhs]
        
        if node with goto_state exists in GSS:
            merge edge into existing node
        else:
            create new node with goto_state
        
        check for new reductions from the new/merged node

4. Shift Phase

Process all pending shifts:

new_active = {}

for each (node, shift_state) in pending_shifts:
    if node with shift_state exists:
        merge
    else:
        create new node
    
    add to new_active

active = new_active

5. Termination

After processing all tokens:

  • If any active node can accept, parsing succeeds
  • If no active nodes remain, parsing fails

Conflict Actions

The glr_conflict_actions table stores all actions at conflict points:

#![allow(unused)]
fn main() {
struct ParsingTable {
    // Standard LALR action table (winner of conflicts)
    action: HashMap<(usize, String), Action>,
    
    // All actions at conflicts (for GLR)
    glr_conflict_actions: HashMap<usize, HashMap<String, Vec<Action>>>,
}
}

During GLR parsing, both actions are explored when a conflict exists.

Path Enumeration

Finding all paths of length N in the GSS:

#![allow(unused)]
fn main() {
fn enumerate_paths(node: NodeId, length: usize) -> Vec<Path> {
    if length == 0 {
        return vec![Path::new(node)];
    }
    
    let mut paths = vec![];
    for edge in gss[node].links {
        for subpath in enumerate_paths(edge.predecessor, length - 1) {
            paths.push(subpath.prepend(edge));
        }
    }
    paths
}
}

Complexity

  • Time: O(n^3) worst case, O(n) for unambiguous input
  • Space: O(n^2) for GSS, O(n^3) for full SPPF

The practical performance depends on the degree of ambiguity in the grammar and input.

Output Languages

OpenLexer generates lexers and parsers in three target languages:

Language Selection

Use the --lang option:

openlexer gen-lexer --lexer rules.l --lang c --output ./
openlexer gen-parser --parser grammar.y --lang java --output ./

Output Files

LanguageLexer OutputParser Output
Clexer.cparser.c
JavaLexer.javaParser.java
Pythonlexer.pyparser.py

File Organization

Each language has specific file organization requirements and best practices:

  • Java: Requires one public class per file; generated files are designed to work standalone or together
  • C: Can be compiled separately or combined; use preprocessor flags to control test drivers
  • Python: Generated files are importable modules

📖 See File Organization Guide for detailed instructions on organizing generated code.

Common Interface

All generated lexers provide:

  • Token type constants
  • A method to get the next token
  • Access to the matched text

All generated parsers provide:

  • A parse method that drives the parsing loop
  • Semantic value stack management
  • Error reporting hooks

Integration

The lexer and parser integrate through a common interface:

Input Text
    │
    ▼
┌─────────┐
│  Lexer  │ ──▶ Token Stream
└─────────┘
    │
    ▼
┌─────────┐
│ Parser  │ ──▶ Parse Result / AST
└─────────┘

Each language chapter describes the specific integration method.

C Output

Generated Files

  • lexer.c - Lexer implementation with DFA tables
  • parser.c - Parser implementation with LALR tables

Lexer Interface

/* Initialize lexer with input */
void lex_init(const char *input);

/* Get next token, returns token type */
int yylex(void);

/* Current matched text */
extern char *yytext;

/* Length of matched text */
extern int yyleng;

/* Current line number (if enabled) */
extern int yylineno;

Parser Interface

/* Parse input, returns 0 on success */
int yyparse(void);

/* Error reporting callback */
void yyerror(const char *msg);

/* Semantic value passed from lexer */
extern YYSTYPE yylval;

Integration Example

#include <stdio.h>
#include "lexer.c"
#include "parser.c"

int main(int argc, char **argv) {
    if (argc < 2) {
        fprintf(stderr, "Usage: %s <input>\n", argv[0]);
        return 1;
    }
    
    lex_init(argv[1]);
    
    if (yyparse() == 0) {
        printf("Parse successful\n");
        return 0;
    } else {
        printf("Parse failed\n");
        return 1;
    }
}

Compilation

gcc -o myparser main.c -Wall

Or compile separately:

gcc -c lexer.c -o lexer.o
gcc -c parser.c -o parser.o
gcc -c main.c -o main.o
gcc -o myparser lexer.o parser.o main.o

Memory Management

The C lexer uses a static buffer for yytext. For dynamic allocation:

char *token_copy = strdup(yytext);
/* ... use token_copy ... */
free(token_copy);

YYSTYPE Union

When using %union in the grammar:

typedef union YYSTYPE {
    int ival;
    double dval;
    char *str;
} YYSTYPE;

extern YYSTYPE yylval;

Set yylval in lexer actions:

[0-9]+  { yylval.ival = atoi(yytext); return NUMBER; }

Access in parser actions:

expr: expr PLUS expr  { $$ = $1 + $3; }

Platform Compatibility

The generated C code is C99 compliant and works on:

  • GCC (Linux, macOS, MinGW)
  • Clang (Linux, macOS)
  • MSVC (Windows)

Java Output

File Organization

Important: Java requires exactly ONE public class per .java file, and the filename must match the public class name.

Generated Files

  • Lexer Only: Generates Lexer.java containing public class Lexer
  • Parser Only: Generates Parser.java containing public class Parser
  • Both Together: Generate as separate files and compile together
# Generate both
openlexer gen-lexer --lexer calc.l -L java -o output/
openlexer gen-parser --parser calc.y -L java -o output/

# Compile both
javac output/Lexer.java output/Parser.java

# Run (Parser auto-detects Lexer.class)
java -cp output Parser "3 + 4 * 2"

Lexer Interface

public class Lexer {
    // Token constants
    public static final int TOKEN_EOF = 0;
    public static final int TOKEN_ERROR = 1;
    public static final int TOKEN_NUMBER = 2;
    public static final int TOKEN_PLUS = 3;
    // ...
    
    // Token class for lexer-parser integration
    public static class Token {
        public final int type;      // Token type ID
        public final String text;   // Matched lexeme
        public final int pos;       // Position in input
        
        public Token(int type, String text, int pos) {
            this.type = type;
            this.text = text;
            this.pos = pos;
        }
    }
    
    // Constructor
    public Lexer(String input);
    
    // Get next token (returns Token object)
    public Token nextToken();
    
    // Get token name
    public static String tokenName(int type);
}

Parser Interface

public class Parser {
    // Parse input string (auto-detects external Lexer if available)
    public static int parse(String input);
    
    // Main method for testing
    public static void main(String[] args);
}

Automatic Lexer Detection

The generated Parser automatically detects if a compiled Lexer.class is available:

  1. With external Lexer: Uses reflection to call Lexer.nextToken()
    • Output: [Using external Lexer.class]
  2. Without external Lexer: Falls back to inline lexer
    • Output: [Using inline lexer]

This allows the Parser to work standalone OR with a separate Lexer.

Integration Example

Method 1: Using Generated Test Drivers

# Lexer only
javac Lexer.java
java Lexer "3 + 4 * 2"

# Parser with external Lexer
javac Lexer.java Parser.java
java Parser "3 + 4 * 2"
# Output: [Using external Lexer.class]

# Parser standalone (no Lexer.class)
javac Parser.java
java Parser "3 + 4 * 2"
# Output: [Using inline lexer]

Method 2: Custom Integration

public class Main {
    public static void main(String[] args) {
        if (args.length < 1) {
            System.err.println("Usage: java Main <expression>");
            System.exit(1);
        }
        
        String input = args[0];
        
        // Tokenize with Lexer
        Lexer lexer = new Lexer(input);
        Lexer.Token token;
        
        System.out.println("Tokens:");
        while ((token = lexer.nextToken()).type != Lexer.TOKEN_EOF) {
            System.out.printf("  %s: \"%s\"\n", 
                Lexer.tokenName(token.type), token.text);
        }
        
        // Parse
        int result = Parser.parse(input);
        System.out.println("Result: " + result);
    }
}

Compilation and Execution

# Compile all files
javac Lexer.java Parser.java Main.java

# Run
java Main "3 + 4 * 2"

Semantic Values

The lexer's getValue() returns an Object. Cast as needed:

public int nextToken() {
    // In NUMBER rule:
    this.value = Integer.parseInt(this.text);
    return NUMBER;
}

In parser actions, values are accessed through the semantic stack.

Error Handling

public class ParseException extends Exception {
    private int line;
    private int column;
    private String token;
    
    public ParseException(String message, int line, int column, String token) {
        super(message);
        this.line = line;
        this.column = column;
        this.token = token;
    }
    
    // Getters...
}

Reading from Files

String content = new String(Files.readAllBytes(Paths.get(filename)));
Lexer lexer = new Lexer(content);

Reading from InputStream

BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
StringBuilder sb = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
    sb.append(line).append("\n");
}
Lexer lexer = new Lexer(sb.toString());

Package Declaration

The generated code does not include a package declaration by default. Add one manually or modify the output if needed.

Java Version Compatibility

The generated code is compatible with Java 8 and later. No external dependencies are required.

Python Output

Generated Files

  • lexer.py - Lexer class with DFA tables
  • parser.py - Parser class with LALR tables

Lexer Interface

class Lexer:
    # Token constants
    EOF = 0
    NUMBER = 1
    PLUS = 2
    # ...
    
    def __init__(self, input_string: str):
        """Initialize lexer with input string."""
        
    def next_token(self) -> int:
        """Return the next token type."""
        
    @property
    def text(self) -> str:
        """Return the matched text."""
        
    @property
    def value(self):
        """Return the semantic value."""

Parser Interface

class Parser:
    def __init__(self, lexer: Lexer):
        """Initialize parser with lexer."""
        
    def parse(self):
        """Parse input and return result."""

Integration Example

#!/usr/bin/env python3
import sys
from lexer import Lexer
from parser import Parser

def main():
    if len(sys.argv) < 2:
        print("Usage: python main.py <expression>", file=sys.stderr)
        sys.exit(1)
    
    lexer = Lexer(sys.argv[1])
    parser = Parser(lexer)
    
    try:
        result = parser.parse()
        print(f"Result: {result}")
    except Exception as e:
        print(f"Parse error: {e}", file=sys.stderr)
        sys.exit(1)

if __name__ == "__main__":
    main()

Execution

python main.py "3 + 4 * 2"

Semantic Values

Set in lexer actions:

def next_token(self):
    # In NUMBER rule:
    self._value = int(self._text)
    return self.NUMBER

Error Handling

class ParseError(Exception):
    def __init__(self, message, line=None, column=None, token=None):
        super().__init__(message)
        self.line = line
        self.column = column
        self.token = token

Reading from Files

with open(filename, 'r') as f:
    content = f.read()

lexer = Lexer(content)
parser = Parser(lexer)
result = parser.parse()

Reading from stdin

import sys

content = sys.stdin.read()
lexer = Lexer(content)
parser = Parser(lexer)
result = parser.parse()

Type Hints

The generated code includes type hints compatible with Python 3.6+:

def next_token(self) -> int:
    ...

def parse(self) -> Any:
    ...

Dependencies

No external dependencies. Uses only the Python standard library.

Python Version Compatibility

  • Python 3.6+
  • Uses f-strings and type hints
  • No walrus operator (3.8+) for broader compatibility

File Organization Guide for Generated Code

Overview

OpenLexer generates lexer and parser code in C, Java, and Python. This guide explains how to properly organize and use the generated files.

Java File Organization

The "One Public Class Per File" Rule

Java requires that each .java file contains only ONE public class, and the filename must match that class name.

Generating Lexer Only

openlexer gen-lexer --lexer grammar.l -L java -o output/

Output: output/Lexer.java

  • Contains: public class Lexer
  • Includes: Token class, lexer logic, test driver

Compilation & Usage:

javac Lexer.java
java Lexer "3 + 4 * 2"

Generating Parser Only

openlexer gen-parser --parser grammar.y -L java -o output/

Output: output/Parser.java

  • Contains: public class Parser
  • Includes: Inline lexer, parser logic, test driver
  • Can run standalone or detect external Lexer.class

Compilation & Usage:

javac Parser.java
java Parser "3 + 4 * 2"

Generating Both Lexer and Parser

Method 1: Separate Files (Recommended)

openlexer gen-lexer --lexer grammar.l -L java -o output/
openlexer gen-parser --parser grammar.y -L java -o output/

Output:

  • output/Lexer.java - public class Lexer
  • output/Parser.java - public class Parser

Compilation & Usage:

# Compile both
javac Lexer.java Parser.java

# Run parser (automatically detects and uses Lexer.class)
java Parser "3 + 4 * 2"
# Output: [Using external Lexer.class]

Method 2: Combined File (Advanced)

If you need both in one file for deployment, make one class non-public:

// File: Lexer.java
public class Lexer {
    // ... lexer code ...
}

class Parser {  // Note: not public!
    // ... parser code ...
}

Compile and run:

javac Lexer.java
java Lexer  # or java Parser if Parser is public

Token Interface

Lexer Token Format

All Java lexers return a Token object with consistent structure:

public static class Token {
    public final int type;      // Token type constant
    public final String text;   // Lexeme text
    public final int pos;       // Position in input
    
    public Token(int type, String text, int pos) {
        this.type = type;
        this.text = text;
        this.pos = pos;
    }
}

Parser-Lexer Integration

The Parser automatically detects and uses an external Lexer if available:

  1. With external Lexer:

    • Parser calls Lexer.nextToken() via reflection
    • Displays: [Using external Lexer.class]
  2. Without external Lexer:

    • Parser uses its inline lexer
    • Displays: [Using inline lexer]

C File Organization

Generating Lexer

openlexer gen-lexer --lexer grammar.l -L c -o output/

Output: output/lexer.c

  • Includes headers, lexer logic, test driver

Compilation & Usage:

gcc -o lexer lexer.c
./lexer "3 + 4 * 2"

Generating Parser

openlexer gen-parser --parser grammar.y -L c -o output/

Output: output/parser.c

  • Includes inline lexer, parser logic, test driver

Compilation & Usage:

gcc -o parser parser.c
./parser "3 + 4 * 2"

Combining Lexer and Parser

# Generate both
openlexer gen-lexer --lexer grammar.l -L c -o output/
openlexer gen-parser --parser grammar.y -L c -o output/

# Method 1: Compile separately and link
gcc -c lexer.c -o lexer.o
gcc -c parser.c -o parser.o
gcc lexer.o parser.o -o myapp

# Method 2: Combine into one compilation unit
cat lexer.c parser.c > combined.c
gcc -o myapp combined.c

Suppressing Test Drivers:

Use preprocessor flags to disable test/main functions:

# Lexer without test driver
gcc -DLEXER_NO_MAIN -DLEXER_NO_TEST -c lexer.c

# Parser without test driver
gcc -DPARSER_NO_MAIN -c parser.c

Python File Organization

Generating Lexer

openlexer gen-lexer --lexer grammar.l -L python -o output/

Output: output/lexer.py

  • Lexer class, test functions

Usage:

python lexer.py "3 + 4 * 2"

# Or import in your code:
from lexer import Lexer
lex = Lexer("3 + 4 * 2")
for token in lex.tokenize():
    print(token)

Generating Parser

openlexer gen-parser --parser grammar.y -L python -o output/

Output: output/parser.py

  • Parser and inline Lexer class

Usage:

python parser.py "3 + 4 * 2"

# Or import:
from parser import parse
result = parse("3 + 4 * 2")

Combining Both

# Generate both
openlexer gen-lexer --lexer grammar.l -L python -o output/
openlexer gen-parser --parser grammar.y -L python -o output/

# Import and use together:
from lexer import Lexer
from parser import Parser

lex = Lexer("3 + 4 * 2")
parser = Parser(lex)
result = parser.parse()

Best Practices

1. Separate Generation

Generate lexer and parser in separate commands for better modularity.

2. Version Control

  • Commit .l and .y grammar files
  • Add generated files to .gitignore or commit them depending on your workflow

3. Build Scripts

Create a Makefile or build script:

# Makefile
all: lexer parser

lexer:
	openlexer gen-lexer --lexer grammar.l -L java -o build/

parser:
	openlexer gen-parser --parser grammar.y -L java -o build/
	cd build && javac Lexer.java Parser.java

clean:
	rm -rf build/

run:
	cd build && java Parser "3 + 4 * 2"

4. Package Structure (Java)

For larger projects, use Java packages:

mkdir -p src/com/example/parser
openlexer gen-lexer --lexer grammar.l -L java -o src/com/example/parser/
# Edit generated files to add: package com.example.parser;

5. Testing

Use the built-in test drivers during development:

# Test lexer
java Lexer "(10 + 20) * 3"

# Test parser (with or without external lexer)
java Parser "(10 + 20) * 3"

Troubleshooting

Java: "Public class X must be in a file named X.java"

Solution: Ensure file name matches the public class name, or make the class non-public.

Java: "Class Lexer not found" when running Parser

Causes:

  1. Lexer.java not compiled
  2. Lexer.class not in same directory
  3. CLASSPATH not set correctly

Solution:

# Compile both in same directory
javac Lexer.java Parser.java

# Or set CLASSPATH
javac -cp /path/to/classes Parser.java
java -cp /path/to/classes Parser

Java: Multiple public classes error

Solution: Only ONE public class per .java file. Make others non-public:

public class Lexer { ... }
class Parser { ... }  // No 'public'

C: Undefined reference to lexer functions

Solution: Ensure proper linking:

gcc lexer.c parser.c -o myapp
# Or link separately
gcc -c lexer.c && gcc -c parser.c && gcc lexer.o parser.o -o myapp

Advanced: Custom Integration

Java: Manual Lexer-Parser Integration

// Main.java
public class Main {
    public static void main(String[] args) {
        Lexer lexer = new Lexer("3 + 4 * 2");
        Parser parser = new Parser();
        
        // Custom token loop
        Lexer.Token tok;
        while ((tok = lexer.nextToken()).type != Lexer.TOKEN_EOF) {
            // Process tokenparser.consumeToken(tok);
        }
        
        int result = parser.getResult();
        System.out.println("Result: " + result);
    }
}

Python: Custom Integration

# main.py
from lexer import Lexer, TokenType
from parser import Parser

def custom_parse(input_str):
    lex = Lexer(input_str)
    tokens = list(lex.tokenize())
    
    # Filter or transform tokens
    filtered = [t for t in tokens if t.type != TokenType.WHITESPACE]
    
    # Parse
    parser = Parser()
    return parser.parse_tokens(filtered)

result = custom_parse("3 + 4 * 2")
print(f"Result: {result}")

Summary

LanguageLexer FileParser FileTest Command
JavaLexer.javaParser.javajavac *.java && java Parser "input"
Clexer.cparser.cgcc *.c -o app && ./app "input"
Pythonlexer.pyparser.pypython parser.py "input"

Key Points:

  • Java: One public class per file
  • Parser can detect and use external Lexer automatically
  • Use preprocessor flags in C to disable test drivers
  • Python files can be imported as modules

Calculator Example

A complete calculator supporting arithmetic operations with proper precedence.

Lexer Specification (calc.l)

%{
/* Calculator Lexer */
%}

%%

[0-9]+          { return NUMBER; }
[0-9]+\.[0-9]+  { return FLOAT; }
"+"             { return PLUS; }
"-"             { return MINUS; }
"*"             { return TIMES; }
"/"             { return DIVIDE; }
"%"             { return MOD; }
"^"             { return POWER; }
"("             { return LPAREN; }
")"             { return RPAREN; }
[ \t]+          { /* skip whitespace */ }
\n              { return NEWLINE; }
.               { return ERROR; }

%%

Grammar Specification (calc.y)

%{
/* Calculator Parser */
#include <stdio.h>
#include <math.h>
%}

%union {
    double dval;
}

%token <dval> NUMBER FLOAT
%token PLUS MINUS TIMES DIVIDE MOD POWER
%token LPAREN RPAREN NEWLINE ERROR

%type <dval> expr term factor primary

%left PLUS MINUS
%left TIMES DIVIDE MOD
%right POWER
%right UMINUS

%%

input:
    /* empty */
    | input line
    ;

line:
    NEWLINE
    | expr NEWLINE          { printf("= %g\n", $1); }
    | error NEWLINE         { yyerrok; }
    ;

expr:
    expr PLUS term          { $$ = $1 + $3; }
    | expr MINUS term       { $$ = $1 - $3; }
    | term                  { $$ = $1; }
    ;

term:
    term TIMES factor       { $$ = $1 * $3; }
    | term DIVIDE factor    { 
        if ($3 == 0) {
            yyerror("division by zero");
            $$ = 0;
        } else {
            $$ = $1 / $3;
        }
    }
    | term MOD factor       { $$ = fmod($1, $3); }
    | factor                { $$ = $1; }
    ;

factor:
    primary POWER factor    { $$ = pow($1, $3); }
    | primary               { $$ = $1; }
    ;

primary:
    NUMBER                  { $$ = $1; }
    | FLOAT                 { $$ = $1; }
    | LPAREN expr RPAREN    { $$ = $2; }
    | MINUS primary %prec UMINUS { $$ = -$2; }
    ;

%%

void yyerror(const char *s) {
    fprintf(stderr, "Error: %s\n", s);
}

Generate Code

openlexer gen-lexer --lexer calc.l --lang c --output ./
openlexer gen-parser --parser calc.y --lang c --output ./

Main Program (main.c)

#include <stdio.h>
#include <string.h>

/* Include generated code */
#include "lexer.c"
#include "parser.c"

int main() {
    char line[1024];
    
    printf("Calculator (type expressions, Ctrl+C to exit)\n");
    
    while (1) {
        printf("> ");
        fflush(stdout);
        
        if (fgets(line, sizeof(line), stdin) == NULL) {
            break;
        }
        
        lex_init(line);
        yyparse();
    }
    
    return 0;
}

Build and Run

gcc -o calc main.c -lm -Wall
./calc

Example Session

Calculator (type expressions, Ctrl+C to exit)
> 3 + 4
= 7
> 2 * 3 + 4
= 10
> 2 + 3 * 4
= 14
> (2 + 3) * 4
= 20
> 2 ^ 3 ^ 2
= 512
> -5 + 3
= -2
> 10 / 0
Error: division by zero
= 0

Notes

  • Power (^) is right-associative: 2^3^2 = 2^9 = 512
  • Unary minus has highest precedence
  • Parentheses override precedence
  • Division by zero is caught with error message

MiniJava Example

A subset of Java for teaching compilers. This example shows a more complex grammar.

Lexer Specification (minijava.l)

%{
/* MiniJava Lexer */
%}

%x COMMENT

%%

"class"         { return CLASS; }
"public"        { return PUBLIC; }
"static"        { return STATIC; }
"void"          { return VOID; }
"main"          { return MAIN; }
"String"        { return STRING; }
"extends"       { return EXTENDS; }
"return"        { return RETURN; }
"int"           { return INT; }
"boolean"       { return BOOLEAN; }
"if"            { return IF; }
"else"          { return ELSE; }
"while"         { return WHILE; }
"System.out.println" { return PRINTLN; }
"length"        { return LENGTH; }
"true"          { return TRUE; }
"false"         { return FALSE; }
"this"          { return THIS; }
"new"           { return NEW; }

[a-zA-Z_][a-zA-Z0-9_]*  { return IDENTIFIER; }
[0-9]+                   { return INTEGER_LITERAL; }

"{"             { return LBRACE; }
"}"             { return RBRACE; }
"["             { return LBRACKET; }
"]"             { return RBRACKET; }
"("             { return LPAREN; }
")"             { return RPAREN; }
";"             { return SEMICOLON; }
","             { return COMMA; }
"."             { return DOT; }
"="             { return ASSIGN; }
"&&"            { return AND; }
"<"             { return LT; }
"+"             { return PLUS; }
"-"             { return MINUS; }
"*"             { return TIMES; }
"!"             { return NOT; }

"//".*          { /* skip single-line comment */ }
"/*"            { BEGIN(COMMENT); }
<COMMENT>"*/"   { BEGIN(INITIAL); }
<COMMENT>.      { /* skip */ }
<COMMENT>\n     { /* skip */ }

[ \t\n\r]+      { /* skip whitespace */ }
.               { return ERROR; }

%%

Grammar Specification (minijava.y)

%{
/* MiniJava Parser */
%}

%token CLASS PUBLIC STATIC VOID MAIN STRING EXTENDS RETURN
%token INT BOOLEAN IF ELSE WHILE PRINTLN LENGTH
%token TRUE FALSE THIS NEW
%token IDENTIFIER INTEGER_LITERAL
%token LBRACE RBRACE LBRACKET RBRACKET LPAREN RPAREN
%token SEMICOLON COMMA DOT ASSIGN
%token AND LT PLUS MINUS TIMES NOT
%token ERROR

%left AND
%left LT
%left PLUS MINUS
%left TIMES
%right NOT
%left DOT LBRACKET

%%

Program:
    MainClass ClassDeclarationList
    ;

MainClass:
    CLASS IDENTIFIER LBRACE
        PUBLIC STATIC VOID MAIN LPAREN STRING LBRACKET RBRACKET IDENTIFIER RPAREN
        LBRACE Statement RBRACE
    RBRACE
    ;

ClassDeclarationList:
    /* empty */
    | ClassDeclarationList ClassDeclaration
    ;

ClassDeclaration:
    CLASS IDENTIFIER LBRACE VarDeclarationList MethodDeclarationList RBRACE
    | CLASS IDENTIFIER EXTENDS IDENTIFIER LBRACE VarDeclarationList MethodDeclarationList RBRACE
    ;

VarDeclarationList:
    /* empty */
    | VarDeclarationList VarDeclaration
    ;

VarDeclaration:
    Type IDENTIFIER SEMICOLON
    ;

MethodDeclarationList:
    /* empty */
    | MethodDeclarationList MethodDeclaration
    ;

MethodDeclaration:
    PUBLIC Type IDENTIFIER LPAREN FormalParameterList RPAREN
        LBRACE VarDeclarationList StatementList RETURN Expression SEMICOLON RBRACE
    ;

FormalParameterList:
    /* empty */
    | FormalParameter FormalParameterRest
    ;

FormalParameter:
    Type IDENTIFIER
    ;

FormalParameterRest:
    /* empty */
    | FormalParameterRest COMMA FormalParameter
    ;

Type:
    INT LBRACKET RBRACKET
    | BOOLEAN
    | INT
    | IDENTIFIER
    ;

StatementList:
    /* empty */
    | StatementList Statement
    ;

Statement:
    LBRACE StatementList RBRACE
    | IF LPAREN Expression RPAREN Statement ELSE Statement
    | WHILE LPAREN Expression RPAREN Statement
    | PRINTLN LPAREN Expression RPAREN SEMICOLON
    | IDENTIFIER ASSIGN Expression SEMICOLON
    | IDENTIFIER LBRACKET Expression RBRACKET ASSIGN Expression SEMICOLON
    ;

Expression:
    Expression AND Expression
    | Expression LT Expression
    | Expression PLUS Expression
    | Expression MINUS Expression
    | Expression TIMES Expression
    | Expression LBRACKET Expression RBRACKET
    | Expression DOT LENGTH
    | Expression DOT IDENTIFIER LPAREN ExpressionList RPAREN
    | INTEGER_LITERAL
    | TRUE
    | FALSE
    | IDENTIFIER
    | THIS
    | NEW INT LBRACKET Expression RBRACKET
    | NEW IDENTIFIER LPAREN RPAREN
    | NOT Expression
    | LPAREN Expression RPAREN
    ;

ExpressionList:
    /* empty */
    | Expression ExpressionRest
    ;

ExpressionRest:
    /* empty */
    | ExpressionRest COMMA Expression
    ;

%%

Generate Code

openlexer gen-lexer --lexer minijava.l --lang java --output ./
openlexer gen-parser --parser minijava.y --lang java --output ./

Sample MiniJava Program

class Factorial {
    public static void main(String[] args) {
        System.out.println(new Fac().compute(10));
    }
}

class Fac {
    public int compute(int num) {
        int result;
        if (num < 1)
            result = 1;
        else
            result = num * this.compute(num - 1);
        return result;
    }
}

Limitations

MiniJava is a teaching language. It does not support:

  • Interfaces
  • Overloading
  • Static fields
  • Access modifiers other than public
  • Floating point numbers
  • Strings (except in main signature)

Command Line Interface

Synopsis

openlexer <command> [options]

Commands

gen-lexer

Generate a lexer from a .l file.

openlexer gen-lexer --lexer <file.l> --lang <language> --output <dir>

Options:

OptionDescription
--lexer <file>Input lexer specification file
--lang <lang>Target language: c, java, python
--output <dir>Output directory

Example:

openlexer gen-lexer --lexer calc.l --lang python --output ./build

# Java: Generate Lexer.java  
openlexer gen-lexer -l calc.l -L java -o ./build

# C: Generate without test driver
openlexer gen-lexer -l calc.l -L c -o ./build --no-test

Generated Files: lexer.c (C), Lexer.java (Java), lexer.py (Python)

gen-parser

Generate a parser from a .y file.

openlexer gen-parser --parser <file.y> --lang <language> --output <dir>

Options:

OptionDescription
--parser <file>Input grammar specification file
--lang <lang>Target language: c, java, python
--output <dir>Output directory

Example:

openlexer gen-parser --parser calc.y --lang c --output ./build

# Java: Compile lexer and parser together (Parser auto-detects Lexer.class)
openlexer gen-lexer -l calc.l -L java -o ./build
openlexer gen-parser --parser calc.y -L java -o ./build
javac ./build/Lexer.java ./build/Parser.java
java -cp ./build Parser "3 + 4 * 2"

Generated Files: parser.c (C), Parser.java (Java), parser.py (Python)

help

Show help information.

openlexer help
openlexer --help
openlexer <command> --help

version

Show version information.

openlexer --version

Exit Codes

CodeMeaning
0Success
1General error
2Invalid arguments

Output

The generator prints progress information to stdout:

Parsing lexer specification...
Building NFA (23 states)...
Converting to DFA (15 states)...
Minimizing DFA (12 states)...
Generating Python code...
Written: ./build/lexer.py

Errors are printed to stderr:

Error: Invalid regex at line 5: unclosed bracket

Combining Lexer and Parser

Generate both in a single directory:

mkdir build
openlexer gen-lexer --lexer calc.l --lang c --output build
openlexer gen-parser --parser calc.y --lang c --output build

The generated files can then be compiled together.

API Reference

This chapter documents the internal Rust API for programmatic use.

Lexer Generator

LexerSpec

Represents a parsed .l file.

#![allow(unused)]
fn main() {
pub struct LexerSpec {
    pub prologue: String,
    pub epilogue: String,
    pub definitions: Vec<(String, String)>,
    pub rules: Vec<LexerRule>,
    pub start_conditions: Vec<StartCondition>,
}

impl LexerSpec {
    /// Parse a lexer specification from source text
    pub fn parse(input: &str) -> Result<LexerSpec>;
}
}

LexerRule

A single lexer rule (pattern + action).

#![allow(unused)]
fn main() {
pub struct LexerRule {
    pub pattern: String,
    pub action: RuleAction,
    pub conditions: Vec<String>,
}
}

Code Generation

#![allow(unused)]
fn main() {
/// Generate lexer code from a specification
pub fn generate_code(spec: &LexerSpec, lang: &str) -> Result<String>;

/// Target languages
pub enum TargetLanguage {
    C,
    Java,
    Python,
}
}

Parser Generator

Grammar

Represents a parsed .y file.

#![allow(unused)]
fn main() {
pub struct Grammar {
    pub prologue: String,
    pub epilogue: String,
    pub tokens: Vec<String>,
    pub precedence: Vec<PrecedenceLevel>,
    pub rules: Vec<Rule>,
    pub start_symbol: Option<String>,
    pub union_def: Option<String>,
    pub glr_mode: bool,
}

impl Grammar {
    /// Parse a grammar specification from source text
    pub fn parse(input: &str) -> Result<Grammar>;
}
}

Rule

A grammar production rule.

#![allow(unused)]
fn main() {
pub struct Rule {
    pub lhs: String,
    pub rhs: Vec<String>,
    pub action: Option<String>,
    pub precedence: Option<String>,
}
}

ParsingTable

LALR(1) parsing tables.

#![allow(unused)]
fn main() {
pub struct ParsingTable {
    pub states: Vec<LrState>,
    pub action: HashMap<(usize, String), Action>,
    pub goto: HashMap<(usize, String), usize>,
    pub glr_conflict_actions: HashMap<usize, HashMap<String, Vec<Action>>>,
}

impl ParsingTable {
    /// Build parsing tables from a grammar
    pub fn build(grammar: &Grammar) -> Result<ParsingTable>;
}
}

Action

Parser actions.

#![allow(unused)]
fn main() {
pub enum Action {
    Shift(usize),    // Shift and go to state
    Reduce(usize),   // Reduce by rule number
    Accept,          // Accept input
}
}

GLR Parser

GlrParser

Runtime GLR parser using Graph-Structured Stack.

#![allow(unused)]
fn main() {
pub struct GlrParser {
    table: GlrTable,
    gss: Gss,
    forest: ParseForest,
}

impl GlrParser {
    /// Create a new GLR parser from a parsing table
    pub fn new(table: GlrTable) -> GlrParser;
    
    /// Parse a token stream
    pub fn parse(&mut self, tokens: Vec<Token>) -> Result<Vec<ParseTree>>;
}
}

Token

Input token for GLR parsing.

#![allow(unused)]
fn main() {
pub struct Token {
    pub kind: String,
    pub value: String,
    pub span: (usize, usize),
}
}

Error Types

#![allow(unused)]
fn main() {
pub enum Error {
    /// Invalid regex syntax
    InvalidRegex(String),
    
    /// Grammar error
    GrammarError(String),
    
    /// Unsupported target language
    InvalidLanguage(String),
    
    /// Parse table construction failed
    TableError(String),
    
    /// I/O error
    IoError(std::io::Error),
}
}

Example Usage

use openlexer::lexgen;
use openlexer::parsegen;

fn main() -> openlexer::Result<()> {
    // Generate lexer
    let lexer_spec = lexgen::parse_lexer_spec(include_str!("calc.l"))?;
    let lexer_code = lexgen::generate_code(&lexer_spec, "python")?;
    std::fs::write("lexer.py", lexer_code)?;
    
    // Generate parser
    let grammar = parsegen::parse_grammar(include_str!("calc.y"))?;
    let parser_code = parsegen::generate_code(&grammar, "python")?;
    std::fs::write("parser.py", parser_code)?;
    
    Ok(())
}

Test Drivers

OpenLexer generates test drivers - standalone runnable code that lets you verify your lexer or parser works correctly without writing any integration code.

What Are Test Drivers?

A test driver is a main() function (or equivalent) appended to your generated code that:

  1. Accepts input from command-line arguments or stdin
  2. Runs the lexer/parser on that input
  3. Prints detailed output showing tokens or parse results
  4. Helps you debug patterns and grammar rules

Enabling Test Drivers

Command Line

Test drivers are included by default. To exclude them:

# Include test driver (default)
openlexer --lexer rules.l --lang python -o lexer.py

# Exclude test driver
openlexer --lexer rules.l --lang python -o lexer.py --no-test

# Generate ONLY the test driver
openlexer --test-driver --lang python -o test_driver.py

GUI

In the GUI, check the "Test Driver" option in the Lexer or Parser options panel before generating.

Using Test Drivers

Python

# Test with command-line expressions
python lexer.py "3 + 4 * 2"
python lexer.py "hello" "world" "123"

# Interactive testing from Python
from lexer import test, test_all

test('3 + 4 * 2')
test_all('1+2', '(3*4)', 'x = 10')

Output:

Input: '3 + 4 * 2'
  NUMBER       | '3'             | pos=0 line=1 col=1
  PLUS         | '+'             | pos=2 line=1 col=3
  NUMBER       | '4'             | pos=4 line=1 col=5
  TIMES        | '*'             | pos=6 line=1 col=7
  NUMBER       | '2'             | pos=8 line=1 col=9

C

# Compile with test driver
gcc -o lexer lexer.c

# Test expressions
./lexer "3 + 4 * 2"
./lexer "hello" "123"

# Compile WITHOUT test driver (for library use)
gcc -DLEXER_NO_MAIN -c lexer.c -o lexer.o

Output:

Input: "3 + 4 * 2"
  type=1   | "3"
  type=2   | "+"
  type=1   | "4"
  type=3   | "*"
  type=1   | "2"
  type=0   | ""

Java

# Compile
javac Lexer.java

# Test expressions
java Lexer "3 + 4 * 2"
java Lexer "hello" "123"

Test Driver API

Python Test Driver Functions

FunctionDescription
test(expr)Tokenize one expression, print tokens, return list
test_all(*exprs)Test multiple expressions, return list of results

C Test Driver

The C test driver is wrapped in #ifndef LEXER_NO_MAIN so you can disable it:

// To compile as a library without main():
#define LEXER_NO_MAIN
#include "lexer.c"

// Your code that uses Lexer API

Java Test Driver

The Java test driver adds a main method to the Lexer class:

public static void main(String[] args) {
    // Tests each argument
}

Parser Test Drivers

Parser test drivers integrate both lexer and parser:

# Python
python parser.py "3 + 4 * 2"

# C
./parser "3 + 4 * 2"

# Java
java Parser "3 + 4 * 2"

Parser test drivers show:

  • Tokenization phase output
  • Parsing phase with shift/reduce actions
  • Final parse result or error messages

Customizing Test Behavior

Python

Modify the test driver in the generated file:

if __name__ == '__main__':
    # Add your custom test cases
    test('my custom expression')
    
    # Or read from file
    with open('input.txt') as f:
        test(f.read())

C

Override the default expressions in the else branch of main():

} else {
    // Your custom default tests
    test("custom expression 1");
    test("custom expression 2");
}

Best Practices

  1. Use test drivers during development - They're the fastest way to verify patterns
  2. Remove for production - Use --no-test when building release versions
  3. Keep test expressions - Save useful test cases in a script or file
  4. Check edge cases - Test empty input, long input, Unicode, special characters

Example Workflows

Python Workflow

# 1. Generate lexer with test driver
openlexer --lexer calc.l --lang python -o lexer.py

# 2. Test your patterns
python lexer.py "3+4" "(10-2)/4" "x=y*z"

# 3. Fix any issues, regenerate, retest
openlexer --lexer calc.l --lang python -o lexer.py
python lexer.py "3+4"

# 4. Generate production version without test driver
openlexer --lexer calc.l --lang python -o lexer.py --no-test

C Workflow

# 1. Generate lexer with test driver
openlexer --lexer calc.l --lang c -o lexer.c

# 2. Compile the lexer
gcc -o lexer lexer.c -Wall

# 3. Test your patterns
./lexer "3+4"
./lexer "(10-2)/4"
./lexer "x=y*z"

# 4. Fix any issues, regenerate, recompile, retest
openlexer --lexer calc.l --lang c -o lexer.c
gcc -o lexer lexer.c -Wall
./lexer "3+4"

# 5. Generate production version without test driver
openlexer --lexer calc.l --lang c -o lexer.c --no-test

# 6. Compile as library (for use in your project)
gcc -DLEXER_NO_MAIN -c lexer.c -o lexer.o

Complete C Example Program:

// main.c - Using the generated lexer in your own program
#define LEXER_NO_MAIN  // Exclude the built-in test driver
#include "lexer.c"

#include <stdio.h>
#include <stdlib.h>

void process_file(const char* filename) {
    FILE* f = fopen(filename, "r");
    if (!f) {
        fprintf(stderr, "Cannot open file: %s\n", filename);
        return;
    }
    
    // Read entire file
    fseek(f, 0, SEEK_END);
    long size = ftell(f);
    fseek(f, 0, SEEK_SET);
    
    char* buffer = malloc(size + 1);
    fread(buffer, 1, size, f);
    buffer[size] = '\0';
    fclose(f);
    
    // Tokenize
    Lexer lexer;
    lexer_init(&lexer, buffer);
    
    Token token;
    int count = 0;
    
    printf("Tokens in %s:\n", filename);
    do {
        token = lexer_next(&lexer);
        printf("  [%3d] type=%-3d \"", count++, token.type);
        for (int i = 0; i < token.length; i++) {
            putchar(token.start[i]);
        }
        printf("\"\n");
    } while (token.type != TOKEN_EOF);
    
    printf("Total: %d tokens\n\n", count);
    free(buffer);
}

int main(int argc, char** argv) {
    if (argc < 2) {
        printf("Usage: %s <file1> [file2] ...\n", argv[0]);
        printf("       %s --expr \"expression\"\n", argv[0]);
        return 1;
    }
    
    for (int i = 1; i < argc; i++) {
        if (strcmp(argv[i], "--expr") == 0 && i + 1 < argc) {
            // Tokenize expression directly
            Lexer lexer;
            lexer_init(&lexer, argv[++i]);
            
            Token token;
            printf("Expression: \"%s\"\n", argv[i]);
            do {
                token = lexer_next(&lexer);
                printf("  type=%-3d \"", token.type);
                for (int j = 0; j < token.length; j++) {
                    putchar(token.start[j]);
                }
                printf("\"\n");
            } while (token.type != TOKEN_EOF);
        } else {
            process_file(argv[i]);
        }
    }
    
    return 0;
}

Build and run:

# Compile your program with the lexer
gcc -o myprogram main.c -Wall

# Process files
./myprogram source1.txt source2.txt

# Or tokenize an expression
./myprogram --expr "3 + 4 * 2"

Java Workflow

# 1. Generate lexer with test driver
openlexer --lexer calc.l --lang java -o Lexer.java

# 2. Compile the lexer
javac Lexer.java

# 3. Test your patterns
java Lexer "3+4"
java Lexer "(10-2)/4"
java Lexer "x=y*z"

# 4. Fix any issues, regenerate, recompile, retest
openlexer --lexer calc.l --lang java -o Lexer.java
javac Lexer.java
java Lexer "3+4"

# 5. Generate production version without test driver
openlexer --lexer calc.l --lang java -o Lexer.java --no-test

Complete Java Example Program:

// Calculator.java - Using the generated lexer in your own program
import java.io.*;
import java.nio.file.*;
import java.util.*;

public class Calculator {
    
    public static void main(String[] args) {
        if (args.length == 0) {
            System.out.println("Usage: java Calculator <expression>");
            System.out.println("       java Calculator --file <filename>");
            System.out.println("       java Calculator --interactive");
            return;
        }
        
        Calculator calc = new Calculator();
        
        if (args[0].equals("--file") && args.length > 1) {
            calc.processFile(args[1]);
        } else if (args[0].equals("--interactive")) {
            calc.interactiveMode();
        } else {
            // Process each argument as an expression
            for (String expr : args) {
                calc.tokenize(expr);
            }
        }
    }
    
    public void tokenize(String expression) {
        System.out.println("Input: \"" + expression + "\"");
        
        Lexer lexer = new Lexer(expression);
        List<Token> tokens = new ArrayList<>();
        
        try {
            for (Token token : lexer.tokenize()) {
                tokens.add(token);
                System.out.printf("  %-12s | %-15s | pos=%d%n",
                    token.type.name(),
                    "\"" + token.text + "\"",
                    token.pos);
            }
        } catch (LexerException e) {
            System.err.println("Lexer error: " + e.getMessage());
        }
        
        System.out.println("Total: " + tokens.size() + " tokens\n");
    }
    
    public void processFile(String filename) {
        try {
            String content = Files.readString(Path.of(filename));
            System.out.println("=== File: " + filename + " ===\n");
            tokenize(content);
        } catch (IOException e) {
            System.err.println("Cannot read file: " + filename);
            System.err.println(e.getMessage());
        }
    }
    
    public void interactiveMode() {
        Scanner scanner = new Scanner(System.in);
        System.out.println("OpenLexer Interactive Mode");
        System.out.println("Enter expressions to tokenize (empty line to quit):\n");
        
        while (true) {
            System.out.print("> ");
            String line = scanner.nextLine();
            
            if (line.isEmpty()) {
                System.out.println("Goodbye!");
                break;
            }
            
            tokenize(line);
        }
        
        scanner.close();
    }
}

Build and run:

# Compile your program (assumes Lexer.java is in same directory)
javac Calculator.java Lexer.java

# Tokenize expressions
java Calculator "3 + 4 * 2"
java Calculator "hello" "world"

# Process a file
java Calculator --file source.txt

# Interactive mode
java Calculator --interactive
> 3 + 4
  NUMBER       | "3"             | pos=0
  PLUS         | "+"             | pos=2
  NUMBER       | "4"             | pos=4
Total: 3 tokens

> quit
Goodbye!

IDE Integration

VS Code Tasks

Create .vscode/tasks.json:

{
    "version": "2.0.0",
    "tasks": [
        {
            "label": "Generate & Test Lexer",
            "type": "shell",
            "command": "openlexer --lexer ${file} --lang python -o lexer.py && python lexer.py \"test\"",
            "group": "build"
        }
    ]
}

Makefile

LANG ?= c

lexer.$(LANG): rules.l
	openlexer --lexer $< --lang $(LANG) -o $@

test: lexer.$(LANG)
ifeq ($(LANG),c)
	gcc -o lexer $< && ./lexer "test expression"
else ifeq ($(LANG),java)
	javac $< && java Lexer "test expression"
else
	python $< "test expression"
endif

clean:
	rm -f lexer lexer.c lexer.py Lexer.java *.class

Combining Lexer and Parser

This guide explains how to generate and use a complete lexer+parser system with OpenLexer.

Overview

A typical language implementation has two phases:

  1. Lexical Analysis (Lexer): Convert character stream → token stream
  2. Syntactic Analysis (Parser): Convert token stream → parse tree/AST
Source Code → [Lexer] → Tokens → [Parser] → Parse Tree → [Your Code]

Method 1: GUI Combined Tab

The easiest way to generate both together:

  1. Open the GUI: cargo run --bin openlexer-gui --features gui
  2. Click the Combined tab
  3. Enter your lexer spec in the left panel
  4. Enter your grammar in the middle panel
  5. Click Generate Combined
  6. Download or copy the output

The output contains both lexer and parser in one file with section markers.

Method 2: Command Line

# Generate both at once
openlexer --lexer calc.l --parser calc.y --lang python -o output/

# This creates:
#   output/lexer.py   - The lexer
#   output/parser.py  - The parser (imports lexer)

Method 3: Separate Generation

# Generate lexer
openlexer --lexer calc.l --lang python -o lexer.py

# Generate parser
openlexer --parser calc.y --lang python -o parser.py

Token Coordination

Critical: The tokens in your .y file must match those returned by your .l file.

Lexer (.l file)

%%
[0-9]+      { return NUMBER; }
"+"         { return PLUS; }
"-"         { return MINUS; }
"*"         { return TIMES; }
"/"         { return DIVIDE; }
"("         { return LPAREN; }
")"         { return RPAREN; }
[ \t\n]+    { /* skip whitespace */ }
%%

Parser (.y file)

%token NUMBER PLUS MINUS TIMES DIVIDE LPAREN RPAREN

%%
expr: expr PLUS term
    | expr MINUS term
    | term
    ;

term: term TIMES factor
    | term DIVIDE factor
    | factor
    ;

factor: NUMBER
      | LPAREN expr RPAREN
      ;
%%

The token names must match exactly!

Language-Specific Integration

Java Integration

File Organization: Java requires one public class per file.

# Generate both
openlexer gen-lexer --lexer calc.l -L java -o src/
openlexer gen-parser --parser calc.y -L java -o src/

# This creates:
#   src/Lexer.java  - public class Lexer
#   src/Parser.java - public class Parser

Compilation:

# Compile both (Parser auto-detects Lexer)
javac src/Lexer.java src/Parser.java

# Run parser
java -cp src Parser "3 + 4 * 2"
# Output: [Using external Lexer.class]
#         Input: "3 + 4 * 2"
#         Result: 11

Custom Integration:

import java.util.*;

public class Calculator {
    public static void main(String[] args) {
        Scanner sc = new Scanner(System.in);
        
        while (true) {
            System.out.print("calc> ");
            String line = sc.nextLine();
            if (line.equals("quit")) break;
            
            try {
                // Tokenize
                Lexer lex = new Lexer(line);
                System.out.print("Tokens: ");
                Lexer.Token tok;
                while ((tok = lex.nextToken()).type != Lexer.TOKEN_EOF) {
                    System.out.print(tok.text + " ");
                }
                System.out.println();
                
                // Parse and evaluate
                int result = Parser.parse(line);
                System.out.println("= " + result);
            } catch (Exception e) {
                System.err.println("Error: " + e.getMessage());
            }
        }
    }
}

Python Integration

# calc.py - Using generated lexer and parser together

from lexer import Lexer, TokenType
from parser import Parser

def evaluate(expression):
    """Evaluate a mathematical expression."""
    # Phase 1: Tokenize
    lexer = Lexer(expression)
    tokens = list(lexer.tokenize())
    
    # Phase 2: Parse
    parser = Parser(tokens)
    ast = parser.parse()
    
    # Phase 3: Evaluate (your code)
    return evaluate_ast(ast)

# Example usage
result = evaluate("3 + 4 * 2")
print(f"Result: {result}")  # Output: 11

C Integration

// calc.c - Using generated lexer and parser together

#include "lexer.h"
#include "parser.h"

int evaluate(const char* expression) {
    // Phase 1: Initialize lexer
    Lexer lexer;
    lexer_init(&lexer, expression);
    
    // Phase 2: Initialize parser with lexer
    Parser parser;
    parser_init(&parser, &lexer);
    
    // Phase 3: Parse and evaluate
    int result = parser_parse(&parser);
    
    return result;
}

int main() {
    printf("3 + 4 * 2 = %d\n", evaluate("3 + 4 * 2"));
    return 0;
}

Java Integration

// Calculator.java - Using generated lexer and parser together

public class Calculator {
    public static int evaluate(String expression) {
        // Phase 1: Tokenize
        Lexer lexer = new Lexer(expression);
        List<Token> tokens = lexer.tokenize();
        
        // Phase 2: Parse
        Parser parser = new Parser(tokens);
        ParseTree tree = parser.parse();
        
        // Phase 3: Evaluate
        return evaluate(tree);
    }
    
    public static void main(String[] args) {
        System.out.println("3 + 4 * 2 = " + evaluate("3 + 4 * 2"));
    }
}

Parser Calling Lexer Directly

In some generated code, the parser calls the lexer automatically:

# Parser internally calls lexer.next_token() as needed
parser = Parser(input_string)  # Lexer created internally
result = parser.parse()

Check your generated code's constructor signature to see which style is used.

Semantic Actions

Connect lexer output to parser semantic values:

Passing Token Values

In the lexer, the matched text is available via yytext:

[0-9]+   { yylval = atoi(yytext); return NUMBER; }

In the parser, access values with $1, $2, etc.:

expr: expr PLUS term  { $$ = $1 + $3; }
    | NUMBER          { $$ = $1; }
    ;

Complete Example: Calculator

calc.l (Lexer)

%{
/* Calculator lexer */
%}

%%
[0-9]+      { return NUMBER; }
"+"         { return PLUS; }
"-"         { return MINUS; }
"*"         { return TIMES; }
"/"         { return DIVIDE; }
"("         { return LPAREN; }
")"         { return RPAREN; }
[ \t\n]+    { /* skip */ }
.           { return ERROR; }
%%

calc.y (Parser)

%{
/* Calculator parser */
%}

%token NUMBER PLUS MINUS TIMES DIVIDE LPAREN RPAREN ERROR

%left PLUS MINUS
%left TIMES DIVIDE

%%
input: expr { printf("Result: %d\n", $1); }
     ;

expr: expr PLUS expr   { $$ = $1 + $3; }
    | expr MINUS expr  { $$ = $1 - $3; }
    | expr TIMES expr  { $$ = $1 * $3; }
    | expr DIVIDE expr { $$ = $1 / $3; }
    | LPAREN expr RPAREN { $$ = $2; }
    | NUMBER           { $$ = $1; }
    ;
%%

Build and Run

# Generate both
openlexer --lexer calc.l --parser calc.y --lang c -o calc/

# Compile
cd calc
gcc -o calculator lexer.c parser.c main.c

# Run
./calculator "3 + 4 * 2"
# Output: Result: 11

Troubleshooting

"Unknown token" errors

Cause: Parser received a token not in %token declaration

Fix: Ensure all tokens returned by lexer are declared in parser

"Syntax error" on valid input

Cause: Usually whitespace/newlines not being skipped

Fix: Add whitespace rule to lexer:

[ \t\n]+    { /* skip whitespace */ }

Parser stuck or infinite loop

Cause: Lexer not returning EOF token

Fix: Ensure lexer returns EOF/END_OF_FILE when input exhausted

Token value is wrong

Cause: Semantic value (yylval) not being set

Fix: In lexer action, set the value:

[0-9]+   { yylval.intval = atoi(yytext); return NUMBER; }

Best Practices

  1. Define tokens in one place - Use %token in parser, reference in lexer
  2. Test lexer first - Use test driver to verify tokens before parser work
  3. Handle all input - Include catch-all rule for unexpected characters
  4. Skip whitespace explicitly - Don't rely on implicit handling
  5. Use precedence - Resolve ambiguity with %left, %right, %nonassoc
  6. Start simple - Get basic grammar working, then add complexity

Architecture and Internals

This page explains how OpenLexer works internally. Understanding these concepts helps you write better specifications and debug complex issues.

System Overview

┌─────────────────────────────────────────────────────────────┐
│                      OpenLexer                              │
├─────────────────────────────┬───────────────────────────────┤
│        Lexer Generator      │       Parser Generator        │
│          (lexgen)           │         (parsegen)            │
├─────────────────────────────┼───────────────────────────────┤
│ 1. Parse .l file            │ 1. Parse .y file              │
│ 2. Compile regexes → NFA    │ 2. Build grammar structure    │
│ 3. NFA → DFA conversion     │ 3. Compute FIRST/FOLLOW sets  │
│ 4. DFA minimization         │ 4. Build LR(0) item sets      │
│ 5. Generate code            │ 5. Build LALR(1) tables       │
│                             │ 6. Generate code              │
└─────────────────────────────┴───────────────────────────────┘

Lexer Generation Pipeline

Stage 1: Parsing the .l File

The lexer spec parser (lexgen::rules) extracts:

  • Definitions: Named patterns like DIGIT [0-9]
  • Start conditions: %x COMMENT, %s STRING
  • Rules: Pattern-action pairs
  • Prologue/Epilogue: User code to embed
#![allow(unused)]
fn main() {
// Internal representation
struct LexerSpec {
    rules: Vec<Rule>,           // Pattern → Action mappings
    definitions: HashMap<String, String>,
    start_conditions: Vec<StartCondition>,
    prologue: String,
    epilogue: String,
}
}

Stage 2: Regex → NFA (Thompson's Construction)

Each regex pattern is converted to a Nondeterministic Finite Automaton:

Regex: a(b|c)*

        ε         ε
    ┌──────► 2 ──► 3 ─┐
    │        │b       │ε
    │        ▼        │
1 ──┤   ε   4 ────────┼──► 5
 a  │        │c       │
    │        ▼        │
    └──────► 6 ──► 7 ─┘
             ε

Key operations:

  • Concatenation: Chain NFAs sequentially
  • Alternation (|): Parallel paths with ε-transitions
  • Kleene star (*): Loop back with ε-transitions
  • Character classes: Single state with multiple outgoing edges

Stage 3: NFA → DFA (Subset Construction)

The combined NFA is converted to a Deterministic Finite Automaton:

  1. Compute ε-closure of start state → initial DFA state
  2. For each input symbol, compute move then ε-closure
  3. New state sets become new DFA states
  4. Repeat until no new states
NFA States {1,2,3} on input 'a' → NFA States {4,5,6,7}
                                 ↓
                         DFA State D2

Stage 4: DFA Minimization (Hopcroft's Algorithm)

The DFA is minimized to reduce state count:

  1. Partition states into accepting/non-accepting
  2. Refine partitions based on transitions
  3. Merge equivalent states

This reduces memory usage and improves runtime performance.

Stage 5: Code Generation

The DFA is emitted as:

  • Transition table: 2D array table[state][char] → next_state
  • Accept table: accept[state] → token_type or -1
  • Lexer driver: Loop that walks the DFA

Parser Generation Pipeline

Stage 1: Parsing the .y File

The grammar parser (parsegen::grammar) extracts:

  • Token declarations: %token NUMBER PLUS
  • Precedence rules: %left PLUS MINUS, %right POWER
  • Grammar rules: Productions with semantic actions
  • Start symbol: First rule's LHS or explicit %start
#![allow(unused)]
fn main() {
// Internal representation
struct Grammar {
    terminals: Vec<Symbol>,
    nonterminals: Vec<Symbol>,
    rules: Vec<Production>,
    precedence: Vec<PrecedenceLevel>,
    start: Symbol,
}
}

Stage 2: FIRST and FOLLOW Sets

FIRST(X): Set of terminals that can begin strings derived from X

FIRST(expr) = { NUMBER, LPAREN, MINUS }

FOLLOW(X): Set of terminals that can appear after X

FOLLOW(expr) = { PLUS, MINUS, TIMES, RPAREN, $ }

These sets drive lookahead decisions in the parser.

Stage 3: LR(0) Item Sets

An LR(0) item is a production with a dot showing parsing progress:

expr → expr • + term    (we've seen expr, expecting +)
expr → expr + • term    (we've seen expr +, expecting term)
expr → expr + term •    (complete, ready to reduce)

The parser builds the canonical collection of item sets:

State I0:
  S' → • expr
  expr → • expr + term
  expr → • term
  term → • NUMBER

State I1: (after seeing expr)
  S' → expr •
  expr → expr • + term

Stage 4: LALR(1) Tables

LALR(1) adds lookahead to resolve conflicts:

ACTION Table: state × terminal → shift/reduce/accept/error
GOTO Table:   state × nonterminal → next state

Example:

State 5, lookahead '+': SHIFT to state 7
State 5, lookahead '$': REDUCE by rule expr → expr + term

Stage 5: Conflict Resolution

Shift/Reduce Conflicts: Resolved by precedence and associativity

%left PLUS MINUS      /* Left-associative, same precedence */
%left TIMES DIVIDE    /* Higher precedence than +/- */
%right POWER          /* Right-associative */

Reduce/Reduce Conflicts: First matching rule wins (or use GLR)

Stage 6: Code Generation

The parser is emitted as:

  • ACTION table: Encodes shift/reduce/accept/error
  • GOTO table: State transitions on nonterminals
  • Reduction logic: Switch statement executing semantic actions
  • Stack management: State and value stacks

GLR Parsing

When %glr-parser is specified, conflicts create multiple parse paths:

                     ┌─ Parse path A (shift)
Input: "a - b - c" ──┤
                     └─ Parse path B (reduce)

All paths are explored in parallel. Invalid paths die when they encounter errors. If multiple paths succeed, the grammar is ambiguous.

DFA Transition Table Encoding

The generated transition table uses efficient encoding:

Dense Table (Default)

// Full 256-entry table per state
int table[NUM_STATES][256] = {
    { -1, -1, ..., 5, 5, ..., 3, 3, ... },  // state 0
    { -1, -1, ..., 2, 2, ..., 7, 7, ... },  // state 1
};

Optimized Encoding (When Enabled)

  • Row compression: Factor out common default transitions
  • Column compression: Merge equivalent character columns
  • Direct-coded DFA: Generate switch statements instead of tables

Token Matching Strategy

OpenLexer uses maximal munch (longest match) with rule priority:

  1. Try all patterns from current position
  2. Select the longest matching pattern
  3. On ties, prefer the pattern defined first in the .l file
  4. Execute the associated action
  5. Advance position and repeat

Memory Layout

Generated Lexer

┌────────────────────────────┐
│ Input buffer (user string) │
├────────────────────────────┤
│ Position tracking          │
│ (pos, line, column)        │
├────────────────────────────┤
│ Start condition stack      │
├────────────────────────────┤
│ Token result buffer        │
└────────────────────────────┘

Generated Parser

┌────────────────────────────┐
│ State stack                │
│ (current parsing states)   │
├────────────────────────────┤
│ Value stack                │
│ (semantic values for $1,$2)│
├────────────────────────────┤
│ Lookahead token            │
├────────────────────────────┤
│ Error recovery state       │
└────────────────────────────┘

Performance Characteristics

OperationLexerParser
Time complexityO(n)O(n) for LALR, O(n³) worst-case for GLR
Space complexityO(states × alphabet)O(states × symbols)
Typical states50-500100-2000

Error Handling

Lexer Errors

When no pattern matches:

  1. Report position with line/column
  2. Skip one character (or configurable recovery)
  3. Continue lexing

Parser Errors

When no valid action exists:

  1. Report token and expected tokens
  2. Pop states until error recovery possible
  3. Skip tokens until synchronization point
  4. Resume parsing

File Format References

The .l and .y formats are compatible with Flex and Bison respectively, with these extensions:

OpenLexer Extensions to Flex

  • %unicode - Enable Unicode mode
  • Full Unicode character class support (\p{L}, \p{Greek})

OpenLexer Extensions to Bison

  • %glr-parser - Enable GLR parsing
  • Multi-language output (Python, Java, C)

Internal Module Structure

openlexer/
├── src/
│   ├── lib.rs          # Public API
│   ├── error.rs        # Error types
│   ├── lexgen/         # Lexer generator
│   │   ├── mod.rs      # Public interface
│   │   ├── regex.rs    # Regex parser
│   │   ├── nfa.rs      # NFA construction
│   │   ├── dfa.rs      # DFA conversion + minimization
│   │   ├── rules.rs    # .l file parser
│   │   └── codegen.rs  # Code generation
│   └── parsegen/       # Parser generator
│       ├── mod.rs      # Public interface
│       ├── grammar.rs  # .y file parser
│       ├── first.rs    # FIRST/FOLLOW computation
│       ├── lalr.rs     # LALR(1) table construction
│       └── codegen.rs  # Code generation

Comparison with Flex/Bison

This chapter compares OpenLexer with GNU Flex and Bison.

Feature Comparison

FeatureFlexBisonOpenLexer
Lexer generationYesNoYes
Parser generationNoYesYes
C outputYesYesYes
Java outputJFlexCUP/ANTLRYes
Python outputPLYPLYYes
GLR parsingNoYesYes
LALR(1)N/AYesYes
Start conditionsYesN/AYes
PrecedenceN/AYesYes
UnicodeLimitedN/APlanned
Web (WASM)NoNoYes

File Format Compatibility

Lexer Files (.l)

OpenLexer supports most Flex syntax:

Supported:

  • Definitions section with %{ %} code blocks
  • Named patterns
  • Rules with actions
  • Start conditions (%x, %s)
  • Character classes [a-z]
  • Predefined classes [:alpha:]
  • Quantifiers *, +, ?, {n,m}
  • Alternation |
  • Grouping ()
  • Literal strings "..."

Not supported:

  • %option directives
  • yymore(), yyless()
  • REJECT
  • Multiple input buffers
  • Interactive mode

Grammar Files (.y)

OpenLexer supports most Bison syntax:

Supported:

  • %token declarations
  • %union type declarations
  • %type non-terminal types
  • %left, %right, %nonassoc precedence
  • %start symbol
  • %prec contextual precedence
  • %glr-parser GLR mode
  • Semantic actions with $$, $1, etc.
  • error token for recovery

Not supported:

  • %define options
  • %code blocks (use %{ %} instead)
  • %destructor
  • %printer
  • Lookahead correction (%define lr.type ielr)
  • Named references ($name instead of $1)

Migration Guide

From Flex

  1. Remove %option lines
  2. Replace yymore() with manual buffer handling
  3. Replace REJECT with explicit rules
  4. Test start condition behavior

From Bison

  1. Replace %define with equivalent features
  2. Move %code blocks to %{ %}
  3. Test GLR behavior if used
  4. Verify error recovery patterns

Performance

OpenLexer generates table-driven lexers and parsers similar to Flex/Bison. Performance characteristics:

  • Lexer: O(n) where n is input length
  • LALR Parser: O(n) for unambiguous grammars
  • GLR Parser: O(n) to O(n^3) depending on ambiguity

The generated code size is comparable to Flex/Bison output.

When to Use OpenLexer

Use OpenLexer when:

  • You need multi-language output
  • You want a single cross-platform tool
  • You need integrated lexer + parser generation
  • You want GLR parsing

Use Flex/Bison when:

  • You need maximum C performance optimization
  • You require specific Flex/Bison features not in OpenLexer
  • You have existing build infrastructure for Flex/Bison