beps/docs/proposals/BEP-007-mir/context/gemini-mir-research.md
So I have a compiler with these phases: Lexer -> Parser (CST) -> AST (view into CST) -> HIR -> THIR -> Codegen (VM bytecode). Now I'm running into a problem. I want to desugar constructs like for x in array loops into something more codegen friendly, and also I have match statements that need some desugaring. I have added the for loop desugaring into while loops at the AST -> HIR layer transformation. But for match, I need the typechecker to know about all match arms to do exhaustivness analysis and stuff, I can't desugar into if statements. I was basically trying to mimic Rust, which in the playground has a "Show HIR" button and I can see desugared for loops. Match statements are not desugared there, only at the MIR level. But even at MIR Rust still does analysis of borrow checking and stuff. The problem with analyzing desugared code is that when finding errors you have to show the user his source code, not your compiler desugared code. So if I desugar at HIR layer my only option is to carry both the source code and the desugared code nodes, I use source for typechecking and errors, I use desugared for codegen. If there's error in desugared code I guess I can use synthetic spans that refer back to user's code so errors point at it, not at my desugared code. Another solution I thought about is to add another layer after THIR called DHIR (Desugared HIR) or something like that, where errors can no longer happen. But I wonder if I need at some point a CFG MIR where errors can be found, I'm gonna run into the same problem. How do other compilers deal with this? What would be the best approach?
This is a classic compiler architecture dilemma: The trade-off between IR simplicity (desugaring early) and diagnostic fidelity (keeping original structure).
You have correctly identified the standard solution (synthetic spans) and the standard frustration (showing errors on code that "doesn't exist" in the user's file).
Here is how production compilers (like Rust, Swift, and Roslyn) typically solve this, and how you should fit it into your pipeline.
Your intuition about "synthetic spans" is correct. This is not just a backup option; it is the industry standard way to handle this.
You do not need to carry both the source node and the desugared node in the AST/HIR. You carry the desugared node, but you carefully assign it the Span (file location) of the original source construct.
Example: Desugaring a for loop
Suppose the user writes:
// Span: 10..25
for x in arr { ... }
When you desugar this into a while loop in your AST->HIR phase:
while node inherits the span of the for keyword (or the whole loop).arr..next() call (hidden) inherits the span of arr or in.If a Type Error occurs in the desugared code:
If arr is not an iterator, the type checker (running on HIR/THIR) will complain that "type Array does not have method .next()". Because the method call node has the span of arr, the error highlights arr in the user's source code. The user sees: "Error: arr is not iterable"—which is perfectly intelligible, even though the user never wrote .next().
You should split your desugaring into two distinct buckets. You are currently trying to do everything at once, which is causing the friction with match.
Phase: AST $\to$ HIR Strategy: Desugar things that map purely structurally and don't require type information.
for loops, while let, if let, unless, self-assign operators (+=).loop, match, and assignment immediately. This simplifies your HIR so your type checker has fewer node types to handle.Phase: THIR $\to$ MIR (or THIR $\to$ Bytecode) Strategy: Keep these high-level during type checking, then lower them once you have types.
match expressions, closures (sometimes), async/await.if/else statements.match exists as a tree node.match node.match into a Decision Tree or a CFG (Control Flow Graph) of jumps/branches.You mentioned: THIR $\to$ Codegen (VM Bytecode).
The jump from a Typed Tree (THIR) directly to VM Bytecode is often too large if you have complex control flow like match. This is where MIR (Mid-level IR) comes in. In your nomenclature, "DHIR" is essentially MIR.
I strongly recommend inserting a CFG-based IR (Control Flow Graph) after THIR but before Bytecode.
The Pipeline:
for loops. Everything is a tree. Names are resolved.match is still a high-level node.
match into jumps/gotos. Flatten nested expressions.
You asked: "If I desugar... and find errors, I have to show the user his source code."
This is the hardest part of compiler UX. Here is how you handle it:
Scenario: You have lowered match into a complex web of GOTO statements in MIR. You run a borrow checker on that MIR and find a use-after-free error.
move _1) holds the span of the variable usage in the source code. You highlight that.x" (points to user code).for loop." (You can track "Desugaring Origin" in your span data structure).for loops in AST $\to$ HIR. Ensure the generated nodes point to the source text spans. This keeps your type checker simple (it doesn't need to know what a for loop is).match in HIR and THIR. Do your exhaustiveness checking on the high-level representation.match into a Decision Tree or If-Else chain.Next Step:
Would you like to see a concrete example of how to implement the Span propagation for a for loop desugaring to ensure error messages point to the right place?
Okay, I want to understand exactly how an MIR looks like. So, for Rust, I rememeber reading a couple things:
Also in the Rust playground I can see the MIR.
Given this program:
enum Status {
Active(i32),
Idle
}
fn print(s: &Status) {
for _ in 0..10 {
let s = match s {
Status::Active(n) => String::from("active"),
Status::Idle => String::from("idle")
};
}
}
It shows me this MIR:
// WARNING: This output format is intended for human consumers only
// and is subject to change without notice. Knock yourself out.
// HINT: See also -Z dump-mir for MIR at specific points during compilation.
fn print(_1: &Status) -> () {
debug s => _1;
let mut _0: ();
let mut _2: std::ops::Range<i32>;
let mut _3: std::ops::Range<i32>;
let mut _4: std::ops::Range<i32>;
let mut _5: std::option::Option<i32>;
let mut _6: &mut std::ops::Range<i32>;
let mut _7: isize;
let mut _9: isize;
scope 1 {
debug iter => _4;
let _8: std::string::String;
let _10: &i32;
scope 2 {
debug s => _8;
}
scope 3 {
debug n => _10;
}
}
bb0: {
_3 = std::ops::Range::<i32> { start: const 0_i32, end: const 10_i32 };
_2 = <std::ops::Range<i32> as IntoIterator>::into_iter(move _3) -> [return: bb1, unwind continue];
}
bb1: {
_4 = move _2;
goto -> bb2;
}
bb2: {
_6 = &mut _4;
_5 = <std::ops::Range<i32> as Iterator>::next(copy _6) -> [return: bb3, unwind continue];
}
bb3: {
_7 = discriminant(_5);
switchInt(move _7) -> [0: bb6, 1: bb5, otherwise: bb4];
}
bb4: {
unreachable;
}
bb5: {
_9 = discriminant((*_1));
switchInt(move _9) -> [0: bb8, 1: bb7, otherwise: bb4];
}
bb6: {
return;
}
bb7: {
_8 = <String as From<&str>>::from(const "idle") -> [return: bb9, unwind continue];
}
bb8: {
_10 = &(((*_1) as Active).0: i32);
_8 = <String as From<&str>>::from(const "active") -> [return: bb9, unwind continue];
}
bb9: {
drop(_8) -> [return: bb2, unwind continue];
}
}
alloc2 (size: 4, align: 1) {
69 64 6c 65 │ idle
}
alloc1 (size: 6, align: 1) {
61 63 74 69 76 65 │ active
}
fn Status::Active(_1: i32) -> Status {
let mut _0: Status;
bb0: {
_0 = Status::Active(move _1);
return;
}
}
// MIR FOR CTFE
fn Status::Active(_1: i32) -> Status {
let mut _0: Status;
bb0: {
_0 = Status::Active(move _1);
return;
}
}
This is an excellent example of MIR. You’ve struck gold with this snippet because it perfectly demonstrates control flow flattening, which is exactly what you need to solve your match and for loop desugaring problems.
Here is the breakdown of exactly what you are looking at and how it solves the architectural problems you described.
Unlike your AST or HIR (which are trees), MIR is a Control Flow Graph (CFG). This means the concept of "nesting" (nested ifs, nested loops) is completely gone.
_0, _1...): All variables are declared at the top.
_0: Always the return value._1..._n: Arguments to the function._n+1...: Temporary variables generated by the compiler. Notice how many there are (_2 to _10). MIR is very explicit about memory.bb0, bb1...): The code is sliced into chunks. A "Basic Block" is a sequence of straight-line code that always ends with a "Terminator" (a jump, a switch, or a return).goto, switchInt, return, unreachable.match is Desugared (The switchInt)You mentioned you were stuck on how to desugar match while keeping semantics. Look at bb5 in your output. This is the heart of your match s { ... }.
bb5: {
// 1. Read the "tag" or variant index of the enum
_9 = discriminant((*_1));
// 2. Branch based on that integer
switchInt(move _9) -> [0: bb8, 1: bb7, otherwise: bb4];
}
discriminant: This is a special MIR instruction that extracts the integer value of the enum variant (e.g., Active=0, Idle=1).switchInt: This is the low-level version of match. It takes an integer and jumps to a specific block.
0 (Active) $\to$ Go to bb8 (where the code for Active lives).1 (Idle) $\to$ Go to bb7 (where the code for Idle lives).otherwise $\to$ Go to bb4 (which contains unreachable, handling the "impossible" case).Why this helps you: By the time you get to MIR, you don't need to know about "patterns" or "exhaustiveness." You already checked that in THIR. Now you just need raw integer comparisons and jumps. This is incredibly friendly for Codegen.
for is Desugared (The Loop)You can see the for loop has been completely exploded into the Iterator protocol.
bb0): Creates the range 0..10 (_3) and converts it into an iterator (_2).bb2): This is where the loop "starts".bb3): This is the desugared condition.
bb3: {
// Call .next() on the iterator
_5 = <...>::next(copy _6) -> ...;
// Check if result is Some (1) or None (0)
_7 = discriminant(_5);
switchInt(move _7) -> [0: bb6, 1: bb5, otherwise: bb4];
}
0 (None) $\to$ Go to bb6 (Exit loop/Return).1 (Some) $\to$ Go to bb5 (Execute loop body / The Match).You are currently struggling because you are trying to do two incompatible things in one IR:
match to be a tree).match to be jumps).The Solution: Add the MIR Phase
You should insert a phase between THIR and Codegen.
Phase 1: THIR (Tree-based, Typed)
match as a complex node: Match { scrutinizer, arms: [...] }.Status::Idle".Phase 2: MIR Construction (The "Lowering" step)
match:
bb5).discriminant instruction.bb7, bb8).switchInt in the start block pointing to those arm blocks.Phase 3: Codegen
switchInt becomes a jump_table or cmp/jmp instructions in your VM.goto becomes a jmp.Revisiting your worry about errors:
"If there's error in desugared code... errors point at it, not at my desugared code."
In MIR, every statement carries a Span.
bb3, the call to next() is generated by the compiler.for keyword (or the in keyword) from the original AST to this next() call instruction in the MIR.next() fails (e.g., borrow checker says the iterator is moved), the error engine looks at the MIR instruction, grabs the span (which points to for), and prints:
"Error at line X: Borrow of moved value in for loop".The user never sees bb3 or next(). They just see their loop highlighted.
Do not try to desugar match inside your AST or HIR. It is too early.
match structured until THIR is finished.match into switches and loops into gotos.Interesting. I know CPython uses CFG as well, can you show me an example
CPython absolutely uses a CFG, though it exposes it a bit differently than Rust. In Python, the CFG generation happens deep inside the compiler (in compile.c) just before the final bytecode is emitted.
While Rust's MIR is a distinct intermediate language you can dump and read, Python's CFG is usually invisible unless you disassemble the final bytecode. However, the structure is exactly the same: Basic Blocks connected by Jumps.
Here is an example using Python's dis (disassembler) module, annotated to show you the hidden Control Flow Graph.
Let's use a loop with a conditional, similar to your Rust example, to see how the control flow splits and merges.
def generic_example(x):
while x > 0:
if x % 2 == 0:
print("even")
else:
print("odd")
x -= 1
If you run import dis; dis.dis(generic_example), you get the linear bytecode. I have grouped the instructions below into Basic Blocks (BB) so you can see the graph structure that the compiler built.
Note: The >> arrows indicate jump targets (entry points to a basic block).
This is the entry point. It immediately checks the loop condition.
2 0 SETUP_LOOP 36 (to 38) # Pushes a block for 'break' handling
>> 2 LOAD_FAST 0 (x)
4 LOAD_CONST 1 (0)
6 COMPARE_OP 4 (>)
8 POP_JUMP_IF_FALSE 36 # TERMINATOR: If False, Jump to BB4 (End)
if ConditionWe entered the loop. Now we check the if.
3 10 LOAD_FAST 0 (x)
12 LOAD_CONST 2 (2)
14 BINARY_MODULO
16 LOAD_CONST 1 (0)
18 COMPARE_OP 2 (==)
20 POP_JUMP_IF_FALSE 30 # TERMINATOR: If False, Jump to BB3 (Else)
then BranchThe "True" path of the if statement.
4 22 LOAD_GLOBAL 0 (print)
24 LOAD_CONST 3 ('even')
26 CALL_FUNCTION 1
28 POP_TOP
29 JUMP_ABSOLUTE 32 # TERMINATOR: Jump to BB3-Merge
else Branch & MergeNotice this block handles the else logic and the code that happens after the if/else (the decrement).
6 >> 30 LOAD_GLOBAL 0 (print) # Target of "else" jump
32 LOAD_CONST 4 ('odd')
34 CALL_FUNCTION 1
36 POP_TOP
7 >> 32 LOAD_FAST 0 (x) # Target of "then" jump (Merge point)
34 LOAD_CONST 5 (1)
36 INPLACE_SUBTRACT
38 STORE_FAST 0 (x)
40 JUMP_ABSOLUTE 2 # TERMINATOR: Jump back to BB0 (Loop Header)
Where POP_JUMP_IF_FALSE sends us when x > 0 is finally false.
>> 36 POP_BLOCK # Clean up loop scope
>> 38 LOAD_CONST 0 (None)
40 RETURN_VALUE
POP_JUMP_IF_FALSE 30 means "goto index 30").bb0, bb1 labels.You don't need to output fancy graphviz diagrams or text like Rust does. For your compiler, your "MIR" can simply be a list of BasicBlock objects:
struct BasicBlock {
instructions: Vec<Instruction>,
terminator: Terminator, // Jump, Branch, Return
}
struct MirFunction {
blocks: Vec<BasicBlock>, // Just a flat list!
}
When you generate code (VM bytecode), you just:
blocks.BlockID -> BytecodeOffset.BytecodeOffset you calculated.This separates the complex logic of "how flow works" (MIR) from the boring logic of "counting bytes" (Codegen).
Additional question: Can this be used for try-catch or that's handled elsewhere?
Yes, absolutely. In fact, MIR (or any CFG-based IR) is the best place to model exception handling.
Handling try-catch in a tree-based structure (AST/HIR) is deceptively simple: it looks like just another block. But in terms of control flow, exceptions are effectively "goto-from-anywhere." Modeling this in a CFG is standard practice.
There are two main ways compilers handle this in the CFG. You should choose the one that fits your VM architecture.
In this model, every Basic Block that contains a function call (or an instruction that might throw) has two exit paths:
If you look back at the Rust MIR output I showed you earlier, notice the unwind continue annotations?
_2 = <...>::into_iter(move _3) -> [return: bb1, unwind continue];
This instruction says: "If this call succeeds, goto bb1. If it panics/throws, continue bubbling the exception up the stack."
If this code were inside a try block (or catch_unwind in Rust), the unwind target would point to a specific Cleanup Block (your catch handler) instead of just "continue."
You would modify your Terminator or Call instruction to hold two targets:
enum Terminator {
Goto(BlockId),
Return,
// A call that might crash/throw
Call {
func: String,
args: Vec<Operand>,
destination: BlockId, // Normal path
cleanup: Option<BlockId>, // "Catch" path (if inside a try block)
}
}
This is likely the approach you want if you are building a VM.
Instead of drawing explicit arrows in the graph for every single instruction that might fail, you generate "ranges" of bytecode that are protected.
The Exception Table usually looks like this:
| Start PC | End PC | Handler PC | Exception Type |
|---|---|---|---|
| 0x0010 | 0x0050 | 0x0090 | ValueError |
| 0x0010 | 0x0050 | 0x00A5 | IOError |
How the VM uses it:
When an exception is thrown at PC 0x0020:
0x0020 inside 0x0010...0x0050?" -> Yes.ValueError?" -> Yes.0x0090 (The Handler).try-catch to MIRLet's assume a simplified Python-like syntax:
try:
danger()
print("Success")
catch Error:
handle_error()
print("Done")
Here is how this looks as a Control Flow Graph:
BB0 (Start of Try):
Call danger().danger() throws, the VM/Runtime looks up the table and jumps to BB2.Goto BB1.BB1 (Rest of Try - "Success"):
Print "Success".Goto BB3 (Skip the catch block).BB2 (The Catch Block):
Call handle_error().Goto BB3.BB3 (Join/Finally):
Print "Done".Return.It lives in MIR, but the specific implementation details affect Codegen.
TryCatch is just a node with a body and a handler.catch body.try body.try blocks in a list of "Landing Pads" or "Exception Scopes."try.Recommendation for your compiler: Use the Table-Based Approach. It keeps your MIR graph cleaner (you don't need double edges on every call) and it is the standard for virtually all bytecode interpreters (JVM, CLR, Python). The VM does the heavy lifting of checking the table when an exception actually occurs.