docs/design/coreclr/jit/ryujit-tutorial.md
For context, the .NET runtime has been around since about the turn of the millennium. It is a virtual machine that supports the execution of a number of languages, primarily C#, Visual Basic, and F#. It has been standardized through Ecma and then ISO. There have been a number of implementations of the CLI, the dominant one being installed with the Windows operating system. That implementation has served as the basis for a standalone implementation that is now available in open source. RyuJIT is the re-architected JIT for .NET.
We wanted something with "JIT" in the name, and the idea of a dragon came to mind because of the Dragon book that we all know and love. So – we just adapted the name of the Japanese sea dragon, Ryujin.
The original JIT was developed for a precursor to .NET, around the turn of the millennium. It was developed quickly, and was designed for x86 only. It has evolved over time, and is still evolving. The front-end has been extended to support SSA and value numbering, while the original register allocation and code generation phases have been fully replaced.
First and foremost, the evolution of RyuJIT is constrained by the requirement to maintain compatibility with previous JITs. This is especially important for a just in time compiler that has such a huge installed base. We need to ensure that the behavior of existing programs will be preserved across updates. Sometimes this even means adding "quirks" to ensure that even invalid programs continue to behave the same way. Beyond that, the objective is to get the best possible performance with a minimum of compile time. Both the design of the RyuJIT IR, as well as constraints on the scope of optimizations, are aimed at keeping as close to a linear compile time as possible. Finally, while the original JIT was quite x86-oriented, we now have a broader set of platforms to support, as well as an ever-widening variety of application scenarios.
In this document and elsewhere, you will see the .NET runtime often referred to as the VM, for virtual machine, the EE, for execution engine, or the CLR, for common language runtime. They are largely used interchangeably, though some might argue the terms have subtle differences.
The JIT and the VM each implement an interface that abstracts the dependencies they share. The VM invokes methods on the ICorJitCompiler interface to compile methods, and the JIT calls back on the ICorJitInfo interface to get information about types and methods.
This is the 10,000 foot view of RyuJIT. It takes in MSIL (aka CIL) in the form of byte codes, and the Importer phase transforms these to the intermediate representation used in the JIT. The IR operations are called "GenTrees", as in "trees for code generation". This format is preserved across the bulk of the JIT, with some changes in form and invariants along the way. Eventually, the code generator produces a low-level intermediate called InstrDescs, which simply capture the instruction encodings while the final mappings are done to produce the actual native code and associated tables.
Note: Various details in what follows are outdated and should be mostly taken to be of historical interest. For more up-to-date information about the phases and their details, please see the overview document.
This is the more detailed view, and shows all of the significant phases of RyuJIT. The ones in green are responsible for creating and normalizing the IR and associated data structures. The phases in orange are the optimization phases, and the phases in purple are the lower-level, back-end phases.
The initial phases of RyuJIT set up the IR in preparation for the optimization phases. This includes importing the IL, that is, producing GenTrees IR from the incoming MSIL, inlining methods when considered profitable, normalizing the representation, analyzing the flowgraph, specifying the execution order and identifying the important local variables for optimization.
The optimization phases of RyuJIT are based on liveness analysis, SSA and value numbering. These are used to perform loop invariant code hoisting, copy propagation, common subexpression elimination, assertion propagation, and range check elimination. SSA is used to uniquely identify the values of lclVars, while value numbering is used to identify trees that compute the same value for a given execution.
While the phases up to this point are largely target-independent, the back-end is responsible for performing the transformations necessary to produce native code. The Rationalization pass is intended to be a transitional phase that takes the existing IR, and turns it into a form that is easier to analyze and reason about. Over time, we expect that this IR form will be used across the JIT. The next phase, Lowering, is responsible for fully exposing the control flow and register requirements. This allows the register allocation to be performed on a rather more abstract IR than usually used for that purpose. It then annotates the IR with the register numbers, and the code generator walks the IR in linear execution order to produce the instructions. These are kept in a low-level format called InstrDescs which simply allow the emitter to fixup relative addresses and produce native code and tables.
The RyuJIT IR can be described at a high level as follows:
Compiler object is the primary data structure of the JIT. Each method is represented as a doubly-linked list of BasicBlock objects. The Compiler object points to the head of this list with the fgFirstBB link, as well as having additional pointers to the end of the list, and other distinguished locations.BasicBlock nodes contain a list of doubly-linked statements with no internal control flow
BasicBlock also contains the dataflow information, when available.GenTree nodes represent the operations and statement of the method being compiled.
LclVarDsc represents a local variable, argument or JIT-created temp. It has a gtLclNum which is the identifier usually associated with the variable in the JIT and its dumps. The LclVarDsc contains the type, use count, weighted use count, frame or register assignment etc. These are often referred to simply as "lclVars". They can be tracked (lvTracked), in which case they participate in dataflow analysis, and have a different index (lvVarIndex) to allow for the use of dense bit vectors. Only non-address-taken lclVars participate in liveness analysis, though aliased variables can participate in value numbering.BasicBlock is a list of statements (Statement)
Statement node points to its root expression tree
The GenTree is the primary data structure of the JIT. It is used to represent the expressions for each statement. Some distinguishing features of this IR are that, while an operation has links to its operands, they do not have a link to their parent expression. Furthermore, during the initial phases of the JIT, the nodes are only ordered implicitly by the canonical traversal order of trees. The initial construction of the IR ensures that any ordering dependencies are obeyed by the canonical traversal order, and any subsequent optimizations must ensure that any visible ordering constraints are obeyed.
▌ Statement (top level) (IL 0x01D
│ ┌──▌ const int 1
│ ┌──▌ & int
│ │ └──▌ lclVar int V08
│ ┌──▌ + int
│ │ └──▌ lclVar int V06
└──▌ = int
└──▌ lclVar int V06
From the example we'll look at later: count = count + (bits & 1)
This is the dump of an expression tree for a single statement. It takes a little while to get used to reading the expression tree dumps, which are printed with the children indented from the parent, and, for binary operators, with the first operand below the parent and the second operand above, at least in the front-end. This statement is extracted from the PopCount example we're going to walk through later. V06 is the local variable with lclNum 6, which is the "count" variable in the source. V08 is the "bits" variable. One thing to notice about this is that the context of the lclVar nodes is not clear without examining their parent. That is, two of the lclVar nodes here are uses, while the bottom one is a definition (it is the left-hand-side of the assignment above it).
The GenTrees IR is being evolved to address the context issue mentioned in the last slide. We would like to ensure that all data flows are from child to parent. Examples where this is currently violated are in the assignment, or GT_ASG node, where the data flows from the right hand side of the assignment "over" the parent assignment, to the value being defined on the left hand side. Similarly, the child of a GT_ADDR node may appear to be a use or a definition of the value, when instead its address is being taken. Finally, the comma node is inserted in the tree purely as an ordering constraint to enable a non-flow computation or store of a temporary variable, to be inserted in the midst of the evaluation of an expression.
The Rationalizer phase, which transforms these constructs to a more rational form, is currently run just prior to the back-end, but over time we expect to move it much earlier.
gtNext and gtPrev links.
gtPrev and gtNext) are the definitive specification
of execution order (no longer derivable from a tree walk)BasicBlock contains a single linked list of nodes
Statement nodes are eliminatedGT_IL_OFFSETnodes convey source (IL) mapping infoThere are a number of modalities in the RyuJIT IR, but one of the more significant is that when the IR is first imported, the nodes are linked only from parent to child. Later in the JIT, the execution order is made explicit by the addition of gtPrev and gtNext links on each node. This eases analysis of the IR, but also makes updates a bit more cumbersome.
┌──▌ lclVar V03
──▌ []
└──▌ lclVar V00
┌──▌ indir
│ │ ┌──▌ const 16
│ └──▌ +
│ │ ┌──▌ const 2
│ │ ┌──▌ <<
│ │ │ └──▌ lclVar V03
│ └──▌ +
│ └──▌ lclVar V00
──▌ comma
└──▌ arrBndsChk
├──▌ arrLen
│ └──▌ lclVar V00
└──▌ lclVar V03
┌──▌ lclVar V00
┌──▌ lea(b+8)
┌──▌ indir
├──▌ lclVar V03
──▌ arrBndsChk
┌──▌ lclVar V00
├──▌ lclVar V03
┌──▌ lea(b+(i*4)+16)
──▌ indir
This example shows the evaluation of the IR from the importer, through the front-end and finally the backend. Initially, an array reference is imported as a GT_INDEX node, represented here by the double brackets, with the array object (V00) and the index (V03) as its children. In order to optimize this, we want to be able to expose the full addressing expression, as well as the bounds check. Note that the bounds check is inserted below a comma node. This allows this transformation to be made "in place". Also, note that the execution order is generally "left to right" where the "left" node of a binary operator is actually shown below it in tree order. On the right hand side, you can see that the addressing expression has been transformed to a "load effective address", and the array bounds check has been split. In the backend, although the tree links are still shown, we dump the graphs in execution order, as that is the primary view of the IR in the backend.
The JIT only tracks the value of two kinds of entities: Local variables – including user-defined arguments and locals, as well as JIT-generated temps. These can have multiple definitions and uses. Tree nodes – these are always single-def, single-use
The JIT also does some tracking of heap values.
The JIT limits the number of lclVars for which it computes liveness. The limit is currently 512 for most targets, which is sufficient to include all of the lclVars for most methods. These are the only lclVars that will participate in optimization or register allocation. A lclVar whose address has been taken will never be tracked. They are also the lclVars for which GC ranges will be reported. Although the CLR has a precise GC model, the JIT can either report ranges over which a location (i.e. a register or stack location) contains a GC variable, or report a location as a GC value for the full method scope, and then ensure that its value either represents a valid heap location or null.
GT_LCL_FLD nodes that do not define the entire variable).The liveness analysis is a simple, traditional liveness analysis. Once complete, the bbLiveIn and bbLiveOut sets are available on each BasicBlock, and the data flow bits are set on each GenTree node that represents a lclVar reference, to indicate whether it is a use, a def, or an update, and whether it is a last use. The size of the bitVector is the number of tracked variables, and can be changed with different JIT build options, though not currently dynamically.
The SSA implementation constructs pruned SSA, using the liveness information. It doesn't change the lclNum of the lclVar nodes, but simply annotates the node with the SSA name. That is, an SSA name can be considered to be the pair formed by the lclNum and the SSA name. The JIT has no "out of SSA" translation, so all optimizations must preserve conventional SSA form – that is they may not move references to an SSA name across a PHI node that references it.
The RyuJIT value numbering implementation is somewhat unusual in that it takes advantage of the type safety of .NET by not invalidating the value number for heap-based field references by an arbitrary heap write, unless the write is to the same field. While this doesn't give the full fidelity of alias-based analysis, it is quite effective. This is done by tagging indirections with a chain of descriptors for the fields referenced at each level of indirection below the current node. In this way, each node has the full chain of fields that it references. Value numbering performs symbolic evaluation of many operations, and can therefore tag equivalent expressions with the same value number. This is used by all of the front-end optimizations.
The inliner re-invokes the importer for each method that is considered a suitable candidate. Along the way, it may determine that the method cannot, or should not, be inlined, at which case it abandons the constructed IR, and leaves the callsite as-is. Otherwise, it inserts the newly created IR at the callsite, adds the local variables of the called method to the callee, and fixes up the arguments and returns. This phases has recently been significantly refactored, and enhancements are in progress. There is a design document online that describes the overall plan.
fgSetBlockOrder()):
gtNext and gtPrev linksThis is the same diagram as before, but with additional links to indicate execution order. The statements have a link to the first node in the list, and each node has a link to its previous and next node in the linear order.
BasicBlock are in a single linked-list ┌──▌ lclVar V03
──▌ =
└──▌ lclVar V00
┌──▌ lclVar V03
──▌ st.lclVar V00
┌──▌ const 0
├──▌ lclVarAddr V00
├──▌ const 16
──▌ initBlk
┌──▌ const 0
│ ┌──▌ lclVar V00
├──▌ addr
├──▌ const 16
──▌ initBlk
▌ Statement (IL 0x093...0x09D)
│ ┌──▌ lclVar long V09
│ ┌──▌ indir long
│ │ ┌──▌ lclFld float V10 [+0]
│ │ ┌──▌ addr byref
│ │ ├──▌ lclVar long V101
│ │ ┌──▌ = long
│ │ ├──▌ lclVar long V101
│ ├──▌ comma long
└──▌ = long
▌ Statement (IL 0x093...0x09D)
│ ┌──▌ lclVar long V09
│ │ { ▌ stmtExpr (embedded)
│ │ { │ ┌──▌ &lclFld V10 [+0]
│ │ { └──▌ st.lclVar long V101
│ ├──▌ lclVar long V101
└──▌ storeIndir long
This needs to be updated, as we no longer have embedded statements.
public static int PopCount(ulong bitVectorArg)
{
int count = 0;
const int bitWidth = sizeof(ulong) * 8;
ulong bitVector = bitVectorArg;
for (int i = 0; i < bitWidth; i++)
{
count += ((int)bitVector & 1);
bitVector >>= 1;
}
return count;
}
The sample I'm going to walk through implements support for pop count (counting the number of '1' bits in a 64-bit value).
We're going to start by assuming that we have a method with a known signature that implements PopCount. Here's the implementation we're going to use. It simply takes the input value, and keeps anding with one, and then shifting right. We're first going to simply recognize the name and signature, and replace the method call with a simple PopCnt IR node. Then, we're going to add code to pattern match the loop. Note that this is not necessarily the approach one would take, because the loop is quite an inefficient implementation - but the alternative (sequence of ands, ors and multiplies) creates a complex tree that would be problematic to recognize for our simple example.
set DOTNET_JitDump=Main
set DOTNET_JitDumpAscii=0
set DOTNET_JitDumpFg=Main
set DOTNET_JitDumpFgDot=1
set DOTNET_JitDumpFgFile=Main
set DOTNET_JitDumpFgPhase=OPT-CHK
{BinaryDir}\CoreRun.exe PopCount.exe 1122334455667788 > jitdump.out.0
The first thing we're going to do is to take a look at the IR for the sample. We have a simple Main method that reads in an unsigned long value, and calls the PopCount method. We set a bunch of environment variables to get some useful dumps. The first says that we're want the dump for the method named "Main". We are also going to set the Ascii option to zero, to get a slightly better tree dump. We also want to dump the flowgraph in dot format, to a file named Main.dot, and we want to do it after the range check optimization phase, since that's where we're eventually going to put our pattern matching code. The graph to the right is the result of running the dot file through graphviz.
I added/changed some foundational stuff:
GenTree::IsIntegralConst(1))Getting Ready
Recognize "Intrinsic" (SampleStep1 shelveset)
Add Pattern Recognition (SampleStep2 shelveset):
Here is an example dump in tree order (shown with DOTNET_JitDumpAscii=0)
STMT00000 (IL ???... ???)
[000067] -AC-G------- ▌ call help void HELPER.CORINFO_HELP_ARRADDR_ST
[000047] ------------ arg0 ├──▌ lclVar ref V03 loc2
[000048] ------------ arg1 ├──▌ const int 0
[000063] -A---------- arg2 └──▌ box ref
[000061] ------------ │ ┌──▌ lclVar ref V04 tmp0
[000062] -A---------- └──▌ comma ref
[000049] ------------ │ ┌──▌ lclVar long V01 loc0
[000060] -A---------- └──▌ = long
[000059] -------N---- └──▌ indir long
[000057] ------------ │ ┌──▌ const long 8
[000058] ------------ └──▌ + byref
[000056] ------------ └──▌ lclVar ref V04 tmp0
***** BB06, stmt 21 (top level)
( 49, 31) [000068] ------------ ▌ stmtExpr void (top level) (IL ???... ???)
N199 ( 1, 1) [000047] ------------ │ ┌──▌ lclVar ref V03 loc2 u:3 rdi REG rdi RV $540
N201 ( 5, 4) [000281] DA--------L- arg0 SETUP │ ┌──▌ st.lclVar ref V11 tmp7 d:3 rcx (last use) REG rcx RV
N203 ( 0, 0) [000288] ----------L- arg1 SETUP │ ├──▌ argPlace int REG NA $40
( 8, 7) [000396] ------------ │ │ { ▌ stmtExpr void (embedded) (IL ???... ???)
N205 ( 1, 1) [000056] ------------ │ │ { │ ┌──▌ lclVar ref V04 tmp0 u:3 rax REG rax RV $1cf
N207 ( 2, 2) [000421] ------------ │ │ { │ ┌──▌ lea(b+8) byref REG NA
N209 ( 3, 2) [000049] ----G------- │ │ { │ ├──▌ lclVar long (AX) V01 loc0 REG rcx $348
N211 ( 8, 7) [000397] ----G------- │ │ { └──▌ storeIndir long REG NA
N213 ( 1, 1) [000061] ------------ │ │ ┌──▌ lclVar ref V04 tmp0 u:3 rax (last use) REG rax RV $1cf
N215 ( 19, 15) [000285] DA--GO----L- arg2 SETUP │ ├──▌ st.lclVar ref V12 tmp8 d:3 r8 REG r8 RV
N223 ( 1, 1) [000282] ------------ │ │ ┌──▌ lclVar ref V03 loc2 u:3 rdi REG rdi RV $540
N225 ( 1, 1) [000418] ------------ arg0 in rcx │ ├──▌ putarg_reg ref REG rcx
N227 ( 3, 2) [000286] ------------ │ │ ┌──▌ lclVar ref V12 tmp8 u:3 r8 (last use) REG r8 RV $2d3
N229 ( 3, 2) [000419] ------------ arg2 in r8 │ ├──▌ putarg_reg ref REG r8
N231 ( 1, 1) [000048] ------------ │ │ ┌──▌ const int 0 REG rdx $40
N233 ( 1, 1) [000420] ------------ arg1 in rdx │ ├──▌ putarg_reg int REG rdx
N241 ( 49, 31) [000067] -ACXGO------ └──▌ call help void HELPER.CORINFO_HELP_ARRADDR_ST $1d1
This needs to be updated, as we no longer have statements in back-end.