doc/internal/compiler.md
This document describes mruby's compilation pipeline for developers working on the parser, code generator, or bytecode format.
Read this if you are: adding new syntax or modifying the parser,
debugging codegen issues (wrong registers, missing opcodes),
working with the .mrb binary format, or understanding how Ruby
constructs map to bytecode.
Ruby source
|
v
Lexer/Parser (parse.y)
|
v
AST (mrb_ast_node)
|
v
Code Generator (codegen.c)
|
v
Bytecode (mrb_irep)
|
v
VM execution -or- .mrb binary file
The lexer and parser are combined in a single Lrama/Bison grammar
file: mrbgems/mruby-compiler/core/parse.y.
The parser maintains extensive state in mrb_parser_state:
+/-
are interpreted (sign vs operator) and whether newlines are
significant.The parser produces an AST using two node types:
node_type, lineno,
and filename_indexKey node types include NODE_SCOPE (new variable scope),
NODE_STMTS (statement sequence), NODE_IF, NODE_WHILE,
NODE_CALL (method call), NODE_DEF (method definition),
NODE_CLASS, NODE_RESCUE, NODE_ENSURE, etc. See
mrbgems/mruby-compiler/core/node.h for the complete list.
Local variables are tracked per-scope during parsing:
local_add(sym): register a new local variable in current scopelocal_var_p(sym): check if a symbol is a local variable (affects
whether an identifier is parsed as a method call or variable
reference)The code generator (mrbgems/mruby-compiler/core/codegen.c) walks
the AST and emits bytecode into mrb_irep structures.
Each lexical scope (method, block, class body) has its own
codegen_scope:
codegen_scope
+-- sp current register index (stack pointer)
+-- pc current instruction count
+-- nlocals number of local variables
+-- nregs maximum register index used
+-- lv local variable list
+-- iseq[] instruction sequence (grows dynamically)
+-- pool[] literal pool (strings, numbers)
+-- syms[] symbol table (method/variable names)
+-- reps[] child ireps (nested methods/blocks)
+-- catch_table[] exception handler entries
+-- loop current loop context stack
+-- prev parent scope
+-- mscope true if method/module/class scope
Scopes nest for blocks, method definitions, and class/module bodies.
Each scope produces one mrb_irep.
The code generator uses a simple stack-based register allocator:
selfpush() increments sp and tracks the high-water mark in nregs.
pop() decrements sp. The allocator is linear - it does not
reuse temporaries within an expression.
Instructions are emitted via helper functions:
genop_0(opcode): no operandsgenop_1(opcode, a): one operand (auto-extends with OP_EXT1
if a > 255)genop_2(opcode, a, b): two operands (auto-extends with
OP_EXT1/2/3 as needed)genop_3(opcode, a, b, c): three operandsgenop_W(opcode, a): 24-bit operandgenop_2S(opcode, a, b): one 8-bit + one 16-bit operandThe code generator performs limited peephole optimizations, such as
removing redundant OP_MOVE instructions and combining consecutive
literal loads. Optimization is disabled at jump targets and when
no_optimize is set in the compilation context.
Loop constructs (while, until, for, blocks) push a
loopinfo structure that tracks jump destinations:
pc0: destination for nextpc1: destination for redopc2: destination for breakLoop types (LOOP_NORMAL, LOOP_BLOCK, LOOP_FOR, LOOP_BEGIN,
LOOP_RESCUE) determine how break/next/redo behave.
The compiled bytecode is stored in mrb_irep (Instruction
REPresentation):
mrb_irep
+-- iseq[] instruction sequence (mrb_code array)
+-- pool[] literal pool (mrb_irep_pool entries)
+-- syms[] symbol table (mrb_sym array)
+-- reps[] child ireps (nested scopes)
+-- lv[] local variable names (for debugging)
+-- nlocals local variable count
+-- nregs register count (locals + temporaries)
+-- ilen instruction count
+-- plen pool entry count
+-- slen symbol count
+-- rlen child irep count
+-- clen catch handler count
+-- debug_info source file/line mapping
Pool entries store constants referenced by instructions:
| Type | Tag | Description |
|---|---|---|
IREP_TT_STR | 0 | Dynamic string (heap allocated) |
IREP_TT_SSTR | 2 | Static string (read-only) |
IREP_TT_INT32 | 1 | 32-bit integer |
IREP_TT_INT64 | 3 | 64-bit integer |
IREP_TT_FLOAT | 5 | Floating-point number |
IREP_TT_BIGINT | 7 | Arbitrary-precision integer |
The code generator deduplicates pool entries: identical strings and equal numeric values share the same pool index.
Exception handler entries are appended after the instruction sequence in memory:
mrb_irep_catch_handler
+-- type MRB_CATCH_RESCUE (0) or MRB_CATCH_ENSURE (1)
+-- begin[4] start PC of protected range
+-- end[4] end PC of protected range
+-- target[4] jump target when handler fires
During exception unwinding, handlers are searched in reverse order (last to first) for the current PC.
Standard instructions use 8-bit operands. When a value exceeds 255, extension prefixes widen operands to 16 bits:
| Prefix | Effect |
|---|---|
OP_EXT1 | First operand (a) becomes 16-bit |
OP_EXT2 | Second operand (b) becomes 16-bit |
OP_EXT3 | Both a and b become 16-bit |
Instruction formats:
| Format | Layout | Size |
|---|---|---|
| Z | opcode only | 1 byte |
| B | opcode + a(8) | 2 bytes |
| BB | opcode + a(8) + b(8) | 3 bytes |
| BBB | opcode + a(8) + b(8) + c(8) | 4 bytes |
| BS | opcode + a(8) + b(16) | 4 bytes |
| BSS | opcode + a(8) + b(16) + c(16) | 6 bytes |
| S | opcode + a(16) | 3 bytes |
| W | opcode + a(24) | 4 bytes |
See opcode.md for the full instruction table.
OP_ENTER encodes a method's argument layout in a 24-bit value
(W format). The bit fields are defined by the MRB_ARGS_* macros:
Bits 23 no-block flag
Bits 18-22 required argument count (5 bits, 0-31)
Bits 13-17 optional argument count (5 bits, 0-31)
Bit 12 rest argument flag (*args)
Bits 7-11 post-rest argument count (5 bits, 0-31)
Bits 2-6 keyword argument count (5 bits, 0-31)
Bit 1 keyword rest flag (**kwargs)
Bit 0 block argument flag (&block)
Example: def foo(a, b=1, *rest, &block) produces an aspec with
1 required, 1 optional, rest flag set, and block flag set.
The presym system pre-allocates symbol IDs at build time for frequently used method names and operators. This avoids runtime string interning for common symbols.
Generated by lib/mruby/presym.rb, the presym table maps symbol
names to compile-time constants:
| Macro | Example | Symbol |
|---|---|---|
MRB_SYM(name) | MRB_SYM(initialize) | :initialize |
MRB_SYM_B(name) | MRB_SYM_B(map) | :map! |
MRB_SYM_Q(name) | MRB_SYM_Q(nil) | :nil? |
MRB_SYM_E(name) | MRB_SYM_E(name) | :name= |
MRB_OPSYM(op) | MRB_OPSYM(add) | :+ |
MRB_IVSYM(name) | MRB_IVSYM(name) | :@name |
MRB_CVSYM(name) | MRB_CVSYM(count) | :@@count |
MRB_GVSYM(name) | MRB_GVSYM(stdout) | :$stdout |
Precompiled bytecode is stored in the RITE binary format:
Header: "RITE" magic + version ("0400") + CRC + size
Section IREP: instruction sequences, pools, symbols
Section DBG: debug info (optional, filename/line mapping)
Section LVAR: local variable names (optional)
Footer: "END\0"
Loading functions:
mrb_load_irep(mrb, bin): load and execute from byte arraymrb_load_irep_buf(mrb, buf, len): load with explicit size
(safer)mrb_read_irep(mrb, bin): load without executing (returns
mrb_irep*)mrb_load_irep_file(mrb, fp): load from fileThe mrbc command-line tool performs ahead-of-time compilation:
mrbc -o output.mrb source.rb # binary format
mrbc -Boutput source.rb # C array format
| Limit | Value |
|---|---|
| Max nesting depth | 256 (MRB_CODEGEN_LEVEL_MAX) |
| Max local variables | 255 (uint16 nlocals) |
| Max symbols per irep | 65535 |
| Max operand (standard) | 255 (8-bit) |
| Max operand (extended) | 65535 (16-bit) |
| File | Contents |
|---|---|
mrbgems/mruby-compiler/core/parse.y | Lrama/Bison grammar |
mrbgems/mruby-compiler/core/y.tab.c | Generated parser |
mrbgems/mruby-compiler/core/codegen.c | Code generator |
mrbgems/mruby-compiler/core/node.h | AST node types |
include/mruby/irep.h | IRep structure definition |
include/mruby/compile.h | Compiler context API |
include/mruby/ops.h | Opcode definitions |
src/load.c | Binary format loader |
src/dump.c | Binary format writer |
lib/mruby/presym.rb | Presym table generator |