docs/design/mono/jiterpreter.md
The jiterpreter is a just-in-time compiler for the Mono Interpreter component of the .NET WASM Runtime. It's split into three components:
The jiterpreter is configured by mono runtime options, defined in options-def.h. The key ones to configure the jiterpreter are as follows:
jiterpreter-traces-enabled: Enables trace compiler. On by default for single-threaded builds; disabled if threading is active.jiterpreter-interp-entry-enabled: Enables interp_entry thunks. Only relevant for AOT.jiterpreter-jit-call-enabled: Enables jit_call thunks. Only relevant for AOT.jiterpreter-stats-enabled: Enables collecting and printing statistics like number of traces compiled, number of bytes generated, and which interpreter opcodes caused failures. Statistic printing may need to be manually triggered using INTERNAL.jiterpreter_dump_stats().jiterpreter-backward-branches-enabled: Configures whether backward branches will be handled in traces. If disabled, backward branches will be handled by the interpreter.jiterpreter-eliminate-null-checks: Configures whether null check elimination is enabled. This optimization may cause correctness issues in rare cases if you hit a bug, especially if your code throws NullReferenceExceptions during normal execution.jiterpreter-wasm-bytes-limit: Limits the total number of wasm bytes the jiterpreter will generate. Once this limit is hit, it shuts off in order to avoid exhausting browser memory.To set runtime options when using the .NET runtime directly, use the withRuntimeOptions configuration method:
const runtime = await dotnet
.withRuntimeOptions(["--jiterpreter-stats-enabled"])
Or disable jiterp with
const runtime = await dotnet
.withRuntimeOptions(["--no-jiterpreter-traces-enabled"])
When using Blazor (or Microsoft.NET.Sdk.WebAssembly), you can use the Msbuild property BlazorWebAssemblyJiterpreter as a convenient shorthand to configure whether the Jiterpreter is enabled. You can also use BlazorWebAssemblyRuntimeOptions to set specific options directly. At present, the Jiterpreter only functions in Blazor applications that have been published. When running with debugging enabled, it will be inactive.
The trace compiler operates in multiple phases, starting first during code generation for "tiered" (optimized) interpreter methods and then finishing during actual execution of interpreted code.
While the interpreter is generating code, we select candidate locations that may be good places to create a trace, using multiple heuristics to eliminate locations that are less likely to be valuable. At each candidate location, we insert a "prepare point" opcode that will maintain statistics during execution and eventually trigger compilation of the trace. The heuristics are important because each prepare point opcode adds a small amount of overhead to the method and a large method may have many prepare points. Each prepare point has a unique index with an associated entry in the "trace info" table maintained by the jiterpreter.
During execution, each time a prepare point is hit, we increment its hit counter. If the hit counter reaches a set threshold we will invoke the trace compiler, and if trace compilation is successful we transform the prepare point into a "monitor point". If compilation fails or the trace is rejected for some reason, we patch the opcode to become a NOP, reducing its cost slightly.
Trace compilation occurs synchronously as soon as the threshold is hit for a given prepare point and produces a function pointer that can be used to enter the trace directly. Compilation will fail and "reject" a trace if it doesn't meet key criteria - for example, traces that are too short are rejected because the overhead of entering and exiting the trace makes them harmful to performance. Upon successful compilation, the MINT_TIER_PREPARE_JITERPRETER trace_index opcode is patched to become MINT_TIER_MONITOR_JITERPRETER trace_index.
The trace info table contains a hit counter and "penalty total" for each unique candidate location. During execution when the interpreter hits a monitor point it executes the trace in a special monitoring mode by passing a JiterpreterCallInfo structure into the trace. All traces have special monitoring support compiled in that will update this structure when exiting early, either due to an abnormal condition (like an exception being thrown) or due to a trace compilation failure (an unsupported opcode, etc.) These abnormal exits store an estimated number of opcodes executed along with a backward branch flag, and the combination of these two values are used to compute a "penalty value" for the current trace execution. We maintain a running hit counter for monitoring mode along with a running sum of the penalty values.
After a trace is executed a set number of times in monitoring mode, we compare the average penalty value against a threshold and decide whether to disable the trace (turning the entry point into a NOP) or keep it permanently. This allows us to cull traces that have undesirable behavior we couldn't detect at compile time (for example, conditionally performing a function call 99% of the time right at the beginning of the trace, which forces us to return control to the interpreter.) If the trace survives the monitoring phase, we patch the MINT_TIER_MONITOR_JITERPRETER trace_index opcode into a MINT_TIER_ENTER_JITERPRETER function_pointer opcode to allow more efficient execution of the trace.
For traces which survive the monitoring phase, interpreter execution will hit a MINT_TIER_ENTER_JITERPRETER opcode, read the function pointer out of it, and directly invoke the function pointer. The trace returns a displacement (in bytes) from the current location, which allows returning control to any location in the method.
The trace compiler is designed to operate sequentially in two simple passes in order to reduce the amount of time and memory consumed by JIT compilation.
The first pass scans sequentially through the interpreter opcodes for a method, starting from the prepare point, transforming each encountered opcode into one of:
During this first pass the compiler also records control flow information, keeping track of branches and branch targets in its "CFG" which will be used later to construct the WASM structures necessary for loops and other control flow. During compilation a running estimate is maintained of the trace's size, because web browsers impose an arbitrary 4KB limit on the total size of a synchronously compiled module, including the size of things like function names and type information. If we get too close to the 4KB limit, trace compilation will end at the current location. During this phase all of the generated Webassembly code is appended into a scratch buffer, with buffer offsets recorded in the CFG.
The second pass generates the final WebAssembly module including metadata like types, imported and exported functions, etc. The actual executable WebAssembly code is stitched together from segments of the scratch buffer based on the "segments" recorded in the CFG. Segments come in three types:
For traces containing backward branches, each trace begins with a small "dispatch table" which performs a forward branch to a specific destination determined by a dispatch index. Upon trace entry the dispatch index points to the top of the trace, but when a backwards branch occurs we set a specific dispatch index and always jump to the dispatch table. This is necessary due to WebAssembly's heavily constrained flow control model that does not allow arbitrary jumps and encodes jumps based on nesting depths instead of as branches targeting specific code offsets.
Opcodes fall into a few large categories:
When determining whether to keep or reject a trace, we do this based on 'opcode values', where each opcode is assigned a value that approximates how much productive work the resulting trace will perform. These opcodes are defined in jiterpreter-opcode-values.h. For opcodes like branches we assign them a low value due to the overhead associated with running them in a trace (sometimes worse than the interpreter for reasons described below), while for opcodes like SIMD instructions or field accesses with eliminated null checks, we assign them a high value to represent how they are significantly faster in traces than in the interpreter. The sum of a trace's opcode values is compared against a threshold at the end of trace compilation to decide whether to keep it.
Transforming interpreter opcodes into WebAssembly opcodes comes with a few key performance considerations:
To examine the generated WebAssembly for traces inside a specific method, you can add a substring of the method name to instrumentedMethodNames in jiterpreter.ts. At runtime when traces are generated, the full name of the containing method will be checked against all the substrings in the list and any matches will be generated in 'instrumented mode', where all the processed interpreter opcodes are recorded and dumped to the console along with the raw generated wasm bytes (you can paste those into a hex editor and save the resulting .wasm file, then run it through the analysis tool of your choice.) An example hand-annotated trace is below:
| Interp Op | Generated Wasm | Notes |
|---|---|---|
| (module (memory $memory0 (import "m" "h") 1) | The module for each trace has to import the WebAssembly heap to access memory. | |
| (table $table0 (import "f" "f") 1 funcref) | Each trace also needs to import the WebAssembly function pointer table in order to perform indirect calls (though these are rarely used) | |
| (export "SequenceEqual:ae" (func $func0)) | We export traces with a short name to make it easier to identify them in the debugger. | |
| (func $func0 (param $frame i32) (param $pLocals i32) (param $cinfo i32) (result i32) | ptrdiff_t trace (void *frame, void *pLocals, JiterpreterCallInfo *cinfo); | |
| (local $var3 i32) (local $var4 i32) (local $cknull_ptr i32) (local $var6 i32) (local $var7 i32) (local $var8 i32) (local $var9 i64) (local $var10 i64) (local $var11 f32) (local $var12 f64) | Traces have a fixed set of locals used for storing temporary state like the back-branch flag or cached lhs/rhs for arithmetic operations. A critical one here is cknull_ptr, used for null checks. Names here are added for clarity, there are no names in the .wasm module due to the 4KB size limit. | |
| block $label3 | A block starts here at the top of the trace in order to allow us to jump to a branch target later on. WebAssembly branches target a specific block (based on a numeric depth) and either jump to its top (for loops) or its bottom (for all other block types). | |
ldind_off.i4 0, 104 -> 128 | block $label0 | |
| local.get $pLocals | ||
| i32.load | ||
| local.tee $cknull_ptr | ||
| br_if $label0 | ||
| i32.const 6 | ||
| return | ||
| end $label0 | Most memory operations like ldind_off begin with a null check that loads the pointer from a local, then performs a bailout (the i32.const; return pair here) if the check fails. The resulting known-not-null is stored in cknull_ptr for later use. Note that in this case, the i32.load has no offset, because we're loading arg0. | |
| ... | local.get $pLocals | |
| local.get $cknull_ptr | ||
| local.get $pLocals | ||
| i32.load offset=104 | ||
| i32.add | ||
| i32.load align=1 | ||
| i32.store offset=128 | Now that we have a known-non-null base pointer to perform an indirect load, we compute the address dynamically and then load from it, storing the result in the dreg (#128). This is the equivalent of *(pLocals + 128) = *(*(pLocals + 0) + *(pLocals + 104)). | |
ldind_off.i4 8, 104 -> 136 | block $label1 | |
| local.get $pLocals | ||
| i32.load offset=8 | ||
| local.tee $cknull_ptr | ||
| br_if $label1 | ||
| i32.const 14 | ||
| return | ||
| end $label1 | ||
| local.get $pLocals | ||
| local.get $cknull_ptr | ||
| local.get $pLocals | ||
| i32.load offset=104 | ||
| i32.add | ||
| i32.load align=1 | ||
| i32.store offset=136 | another ldind_off, nothing new is happening here. note that the ptrdiff_t returned in this bailout is different - it's the offset of the opcode that failed. | |
bne.un.i4.s 128, 136 | block $label2 | |
| local.get $pLocals | ||
| i32.load offset=128 | ||
| local.get $pLocals | ||
| i32.load offset=136 | ||
| i32.ne | ||
| i32.eqz | ||
| br_if $label2 | ||
| br $label3 | ||
| end $label2 | The entire conditional branch is contained inside a block - we will branch to the end of the block if the conditional check fails, in order to skip the actual branch operation. The block begins by computing the branch condition (by loading two locals and performing an ne comparison), then if the condition value is 0 (eqz) we jump to the end of the block, which skips the actual branch. You can see the branch targets label3, which will skip the remainder of this basic block and go to our branch target. | |
add.i4.imm 104 -> 104 | local.get $pLocals | |
| local.get $pLocals | ||
| i32.load offset=104 | ||
| i32.const 4 | ||
| i32.add | ||
| i32.store offset=104 | Note that we load pLocals here twice. This is because wasm store operations expect the destination pointer to precede the value, and wasm has no dup or swap instructions to allow us to rearrange the stack later. Since this is an imm arithmetic operation, we inline the immediate as an i32.const. | |
bgt.un.i4.s 112, 104 | block $label4 | |
| local.get $pLocals | ||
| i32.load offset=112 | ||
| local.get $pLocals | ||
| i32.load offset=104 | ||
| i32.gt_u | ||
| i32.eqz | ||
| br_if $label4 | ||
| i32.const 0 | ||
| return | ||
| end $label4 | Here we have another conditional branch, but it targets a location outside of the trace. So instead of a branch forward or backward within the trace, we have a bailout with a displacement of 0. (If everything were working ideally, this would have generated a trace with a backward branch in it - something must have happened to prevent that.) The zero displacement means the interpreter will resume execution at the trace entry point, essentially restarting the trace for another loop iteration. | |
ldind_off.i4 0, 112 -> 128 | block $label5 | |
| local.get $pLocals | ||
| i32.load | ||
| local.tee $cknull_ptr | ||
| br_if $label5 | ||
| i32.const 46 | ||
| return | ||
| end $label5 | ||
| local.get $pLocals | ||
| local.get $cknull_ptr | ||
| local.get $pLocals | ||
| i32.load offset=112 | ||
| i32.add | ||
| i32.load align=1 | ||
| i32.store offset=128 | We've all seen a ldind_off twice at this point. Note that while we're loading arg0 again, the null check is not eliminated - when we cross a branch the jiterpreter discards any optimization data to avoid introducing errors. | |
ldind_off.i4 8, 112 -> 136 | block $label6 | |
| local.get $pLocals | ||
| i32.load offset=8 | ||
| local.tee $cknull_ptr | ||
| br_if $label6 | ||
| i32.const 54 | ||
| return | ||
| end $label6 | ||
| local.get $pLocals | ||
| local.get $cknull_ptr | ||
| local.get $pLocals | ||
| i32.load offset=112 | ||
| i32.add | ||
| i32.load align=1 | ||
| i32.store offset=136 | Similarly, a human examining this trace might say 'we don't need to check local 8 for null, we already checked it', but since we crossed a branch afterward that information was discarded. | |
ceq.i4 128, 136 -> 32 | local.get $pLocals | |
| local.get $pLocals | ||
| i32.load offset=128 | ||
| local.get $pLocals | ||
| i32.load offset=136 | ||
| i32.eq | ||
| i32.store offset=32 | As we approach the end of the trace, we can now recognize the rough shape of what the trace was doing - the first two ldind_offs leading into a bne_un_i4 were the equivalent of if (pLhs[i] != pRhs[i]) break;. After that, the add_i4_imm increased the loop offset by 4, and the bgt_un_i4 returned to the top of the loop body as long as the comparison size exceeded the loop offset. But what's going on here? Examining the source for SequenceEqual will reveal the truth. | |
| ... | ... | SequenceEqual's scalar search loop ends with /* Do final compare as sizeof(nuint) from end rather than start */ result = (LoadNUInt(ref first, lengthToExamine) == LoadNUInt(ref second, lengthToExamine));. We can now see that the previous two ldind_off operations are the two LoadNUint calls, and the ceq_i4 is performing the comparison and storing its result into result. |
br.s | i32.const -34 | |
| return | This non-conditional branch travels to a point before the trace, so we return a negative displacement and bail out. | |
ret.i4.imm | end $label3 |
i32.const 74
return
| We've reached the end of the trace, and we can see a few things: the label3 block ends, which means this is the branch target for the earlier conditional branch - an early-out exit of the loop when the comparison inside the loop fails. Return opcodes are handled by the interpreter, so this ret becomes a bailout with a fixed displacement. |
| | i32.const 78
return | All traces end with a final bailout indicating that execution ran off the end of the trace. In this case we can tell by looking that the final bailout is unreachable, but the jiterpreter currently doesn't perform that sort of analysis (it would only save a few bytes anyway).
Interpreter entry points that meet some basic criteria (number of arguments, etc) are instrumented to notify the jiterpreter each time they are hit. After a certain number of hits, they are added to the "jit queue" and will be compiled asynchronously in small batches. If a specific entry point is hit an even larger number of times, the queue will immediately be flushed to compile it.
The compiled thunks are simple functions that imitate the behavior of the generic interp_entry wrapper for a given target method, eliminating some of the indirect function calls and branching on types that would normally happen. Arguments are mapped from the native WASM calling convention to the interpreter stack so that execution can begin. The behavior of the thunks is meant to match the semantics of interp_entry 1:1.
There is also a small optimization in this path that attempts to detect whether the interp_entry thunk is being used to call a delegate, and in cases where the same delegate is invoked repeatedly we are able to cache the target method instead of performing an expensive delegate invoke lookup every call.
The do_jit_call interpreter operation is instrumented to record hit counts for each call site. After a certain number of hits, a call site is added to the "jit queue" and the queue will be flushed asynchronously in small batches. If a specific call site is hit a large number of times, the queue will immediately be flushed.
The compiled jit call thunks are simple functions that load function arguments from the interpreter stack and pass them directly to native code with the WASM calling convention. The thunks also perform exception handling, using native WebAssembly Exceptions if possible for improved performance. The behavior of the thunks is meant to match the semantics of mini_get_gsharedvt_out_sig_wrapper 1:1.
Where possible jit call thunks will be "direct", bypassing the by-reference wrappers typically used for AOT->native transitions and calling the target function directly with arguments passed by-value. This is possible when all of the arguments are simple types and the wrapper is not needed for some special reason.
At startup the jiterpreter will attempt to compile two small webassembly modules in order to detect support for WebAssembly Exception Handling and WebAssembly SIMD. While both of these features are widely supported, they were not available in the WebAssembly MVP so we need to detect their availability before trying to use them. If they are unavailable, code generation will adapt to use fallbacks (for exception handling, a JavaScript helper, and for SIMD, scalar implementations in C or C#.)
The jiterpreter's compiler detects various errors during compilation and reports them to the browser console. If more than a small number of compile errors occur during execution, the jiterpreter will automatically be disabled to avoid having serious impacts on application performance or stability.