Combining Lexer and Parser
This guide explains how to generate and use a complete lexer+parser system with OpenLexer.
Overview
A typical language implementation has two phases:
- Lexical Analysis (Lexer): Convert character stream → token stream
- Syntactic Analysis (Parser): Convert token stream → parse tree/AST
Source Code → [Lexer] → Tokens → [Parser] → Parse Tree → [Your Code]
Method 1: GUI Combined Tab
The easiest way to generate both together:
- Open the GUI:
cargo run --bin openlexer-gui --features gui - Click the Combined tab
- Enter your lexer spec in the left panel
- Enter your grammar in the middle panel
- Click Generate Combined
- Download or copy the output
The output contains both lexer and parser in one file with section markers.
Method 2: Command Line
# Generate both at once
openlexer --lexer calc.l --parser calc.y --lang python -o output/
# This creates:
# output/lexer.py - The lexer
# output/parser.py - The parser (imports lexer)
Method 3: Separate Generation
# Generate lexer
openlexer --lexer calc.l --lang python -o lexer.py
# Generate parser
openlexer --parser calc.y --lang python -o parser.py
Token Coordination
Critical: The tokens in your .y file must match those returned by your .l file.
Lexer (.l file)
%%
[0-9]+ { return NUMBER; }
"+" { return PLUS; }
"-" { return MINUS; }
"*" { return TIMES; }
"/" { return DIVIDE; }
"(" { return LPAREN; }
")" { return RPAREN; }
[ \t\n]+ { /* skip whitespace */ }
%%
Parser (.y file)
%token NUMBER PLUS MINUS TIMES DIVIDE LPAREN RPAREN
%%
expr: expr PLUS term
| expr MINUS term
| term
;
term: term TIMES factor
| term DIVIDE factor
| factor
;
factor: NUMBER
| LPAREN expr RPAREN
;
%%
The token names must match exactly!
Language-Specific Integration
Java Integration
File Organization: Java requires one public class per file.
# Generate both
openlexer gen-lexer --lexer calc.l -L java -o src/
openlexer gen-parser --parser calc.y -L java -o src/
# This creates:
# src/Lexer.java - public class Lexer
# src/Parser.java - public class Parser
Compilation:
# Compile both (Parser auto-detects Lexer)
javac src/Lexer.java src/Parser.java
# Run parser
java -cp src Parser "3 + 4 * 2"
# Output: [Using external Lexer.class]
# Input: "3 + 4 * 2"
# Result: 11
Custom Integration:
import java.util.*;
public class Calculator {
public static void main(String[] args) {
Scanner sc = new Scanner(System.in);
while (true) {
System.out.print("calc> ");
String line = sc.nextLine();
if (line.equals("quit")) break;
try {
// Tokenize
Lexer lex = new Lexer(line);
System.out.print("Tokens: ");
Lexer.Token tok;
while ((tok = lex.nextToken()).type != Lexer.TOKEN_EOF) {
System.out.print(tok.text + " ");
}
System.out.println();
// Parse and evaluate
int result = Parser.parse(line);
System.out.println("= " + result);
} catch (Exception e) {
System.err.println("Error: " + e.getMessage());
}
}
}
}
Python Integration
# calc.py - Using generated lexer and parser together
from lexer import Lexer, TokenType
from parser import Parser
def evaluate(expression):
"""Evaluate a mathematical expression."""
# Phase 1: Tokenize
lexer = Lexer(expression)
tokens = list(lexer.tokenize())
# Phase 2: Parse
parser = Parser(tokens)
ast = parser.parse()
# Phase 3: Evaluate (your code)
return evaluate_ast(ast)
# Example usage
result = evaluate("3 + 4 * 2")
print(f"Result: {result}") # Output: 11
C Integration
// calc.c - Using generated lexer and parser together
#include "lexer.h"
#include "parser.h"
int evaluate(const char* expression) {
// Phase 1: Initialize lexer
Lexer lexer;
lexer_init(&lexer, expression);
// Phase 2: Initialize parser with lexer
Parser parser;
parser_init(&parser, &lexer);
// Phase 3: Parse and evaluate
int result = parser_parse(&parser);
return result;
}
int main() {
printf("3 + 4 * 2 = %d\n", evaluate("3 + 4 * 2"));
return 0;
}
Java Integration
// Calculator.java - Using generated lexer and parser together
public class Calculator {
public static int evaluate(String expression) {
// Phase 1: Tokenize
Lexer lexer = new Lexer(expression);
List<Token> tokens = lexer.tokenize();
// Phase 2: Parse
Parser parser = new Parser(tokens);
ParseTree tree = parser.parse();
// Phase 3: Evaluate
return evaluate(tree);
}
public static void main(String[] args) {
System.out.println("3 + 4 * 2 = " + evaluate("3 + 4 * 2"));
}
}
Parser Calling Lexer Directly
In some generated code, the parser calls the lexer automatically:
# Parser internally calls lexer.next_token() as needed
parser = Parser(input_string) # Lexer created internally
result = parser.parse()
Check your generated code's constructor signature to see which style is used.
Semantic Actions
Connect lexer output to parser semantic values:
Passing Token Values
In the lexer, the matched text is available via yytext:
[0-9]+ { yylval = atoi(yytext); return NUMBER; }
In the parser, access values with $1, $2, etc.:
expr: expr PLUS term { $$ = $1 + $3; }
| NUMBER { $$ = $1; }
;
Complete Example: Calculator
calc.l (Lexer)
%{
/* Calculator lexer */
%}
%%
[0-9]+ { return NUMBER; }
"+" { return PLUS; }
"-" { return MINUS; }
"*" { return TIMES; }
"/" { return DIVIDE; }
"(" { return LPAREN; }
")" { return RPAREN; }
[ \t\n]+ { /* skip */ }
. { return ERROR; }
%%
calc.y (Parser)
%{
/* Calculator parser */
%}
%token NUMBER PLUS MINUS TIMES DIVIDE LPAREN RPAREN ERROR
%left PLUS MINUS
%left TIMES DIVIDE
%%
input: expr { printf("Result: %d\n", $1); }
;
expr: expr PLUS expr { $$ = $1 + $3; }
| expr MINUS expr { $$ = $1 - $3; }
| expr TIMES expr { $$ = $1 * $3; }
| expr DIVIDE expr { $$ = $1 / $3; }
| LPAREN expr RPAREN { $$ = $2; }
| NUMBER { $$ = $1; }
;
%%
Build and Run
# Generate both
openlexer --lexer calc.l --parser calc.y --lang c -o calc/
# Compile
cd calc
gcc -o calculator lexer.c parser.c main.c
# Run
./calculator "3 + 4 * 2"
# Output: Result: 11
Troubleshooting
"Unknown token" errors
Cause: Parser received a token not in %token declaration
Fix: Ensure all tokens returned by lexer are declared in parser
"Syntax error" on valid input
Cause: Usually whitespace/newlines not being skipped
Fix: Add whitespace rule to lexer:
[ \t\n]+ { /* skip whitespace */ }
Parser stuck or infinite loop
Cause: Lexer not returning EOF token
Fix: Ensure lexer returns EOF/END_OF_FILE when input exhausted
Token value is wrong
Cause: Semantic value (yylval) not being set
Fix: In lexer action, set the value:
[0-9]+ { yylval.intval = atoi(yytext); return NUMBER; }
Best Practices
- Define tokens in one place - Use
%tokenin parser, reference in lexer - Test lexer first - Use test driver to verify tokens before parser work
- Handle all input - Include catch-all rule for unexpected characters
- Skip whitespace explicitly - Don't rely on implicit handling
- Use precedence - Resolve ambiguity with
%left,%right,%nonassoc - Start simple - Get basic grammar working, then add complexity