docs/design/coreclr/jit/lsra-detail.md
This document provides additional detail on the linear scan register allocator (LSRA) in RyuJIT. It is expected that the reader has already read the RyuJIT Overview document.
Register allocation is performed using a linear scan register allocation
scheme, implemented by the LinearScan class.
Physical registers are represented by the RegRecord class.
Values (lclVar references and non-lclVar GenTree nodes that
produce a value) are represented by the Interval class.
Interval (rather poorly named, since it really is
multiple "intervals") consists of one or more live ranges, each
of which has a list of RefPositions (termed "use positions"
in much of the literature) that identify nodes at which the
value is referenced, as well as the register requirements for
that reference.References to both physical (RegRecord) and virtual
(Interval) registers are represented by the
RefPosition class. It contains information about the location,
type and other requirements of the reference.
There are four main phases to LSRA:
Preparation
The order of allocation of the BasicBlocks is determined:
This attempts to ensure that at least one predecessor of a block is allocated before it.
When not optimizing, the layout order of the blocks is used.
Note that the order doesn't affect correctness, as the
location of lclVars across block boundaries is fixed up
as necessary by the resolution phase. When not optimizing,
lclVars are not enregistered, so there is no benefit to
using a different order.
An Interval is built for each register-candidate lclVar.
A RegRecord is built for each physical register.
Constructing RefPositions
The RefPositions are built in a linear traversal of all the
GenTree nodes
Iterating over the BasicBlocks in the order determined
above, a RefTypeBB RefPosition is created at each
new BasicBlock. This is the signal to the register
allocator to re-establish variable locations at the
boundary.
Iterating over the nodes within the block in execution order:
Two locations (LsraLocation, which is an unsigned int) are assigned to each node.
The first is the virtual location at which its operands are used.
The second is the virtual location at which its target register(s) are defined.
This allows an instruction to use the same register
as both a source and target (where the source is not
marked delayRegFree.
An exception is multi-reg local stores of multi-reg sources. For these, the code generator will read each source register, and then move it, if needed, to the destination register. These nodes have 2*N locations where N is the number of registers, so that the liveness can be reflected accordingly.
For each node, RefPositions are built to reflect the uses,
definitions and kills of any registers involved in the
evaluation of the node.
The allocation phase iterates over the RefPosition list:
At the beginning of each new block the register assignment of
each lclVar Interval is updated, either based on incoming
argument locations (for the first block in the method),
or based on the location at the end of the selected predecessor block.
Iteration proceeds according to ascending order of RefPositions.
This means that the register assignment is performed in linear
execution order within a block.
The status of an Interval is updated as a RefPosition is
encountered for it.
Splitting or spilling an Interval doesn't involve creating a new
one. Instead, the RefPosition simply gets a new assignment, and
is either marked for reload/copy or its location is simply
updated in the incoming map. This differs from other linear-scan
allocators, where separate intervals are constructed for this case.
The resolution phase has two parts:
The RefPositions are walked again to write back the register
assignments to the GenTree nodes.
BasicBlock boundaries.Then, if we are optimizing (i.e. enregistering lclVars) the block boundaries are handled:
For fork edges (the source block has multiple targets, but each target has only that one source), any required resolution is placed at the individual target(s).
For join edges (a single target block has multiple sources, but each source has only that one target), any required resolution is placed at the individual source(s).
Critical edges require more complicated handling, and may require splitting of the edge for placement of resolution.
The RyuJIT backend components (register allocation and code generation)
are somewhat unusual (compared to other compilers) in that the IR nodes
do not map directly to target
instructions. Instead, the Lowering phase of the JIT first performs
IL transformations to ensure that the operations match the semantic
level of the target such that all registers are explicitly represented
in the IL:
It is the job of the Lowering phase to transform the IR such that:
The nodes are in LIR form (i.e. all expression trees have been
linearized, and the execution order of the nodes within a BasicBlock
is specified by the gtNext and gtPrev links).
Any node that will be evaluated as part of its parent (consumer)
node has been identified. These are known as "contained" nodes, and
are identified by setting the GTF_CONTAINED flag in the
gtFlags field of the node.
Similarly, nodes that represent values that could be referenced from
memory by their consumer (i.e. via an addressing mode) are marked as
"reg-optional" (LIR::Flags::RegOptional)
All unused values (nodes that produce a result that is not consumed)
are identified (LIR::Flags::UnusedValue flag is set in gtLIRFlags).
Code can be generated without any context from the parent (consumer) of each node.
There are three types of value for which registers are allocated:
Local variables, lclVars, which appear in the IR as
GT_LCL_VAR for uses or GT_STORE_LCL_VAR for
definitions.
Values produced by any other IR node. These are known as "tree temps", and never have more than a single use.
"Internal registers" that are required for the evaluation of a node.
The register requirements are determined by the LinearScan::BuildNode()
method which is called for each node in order to build its
RefPositions. (It in turn may invoke a node-specific LinearScan::BuildXXX() method).
The register lifetimes must obey the following lifetime model:
First, any internal registers are defined.
Next, any source registers are used (and are then freed if they are
last use and are not identified as delayRegFree).
Next, the internal registers are used (and are then freed).
Next, any registers in the kill set for the instruction are killed.
Next, the destination register(s) are defined.
Finally, any delayRegFree source registers are freed.
There are several things to note about this order:
The lifetime of internal registers will always overlap a use (that is, they may never
get the same register as a source), but they do not overlap the lifetime of a destination register (so they may get the same register as a target unless they are marked delayRegFree).
Internal registers are never live beyond the node.
The delayRegFree annotation is used to identify register uses that must not
be allocated the same register as a register target for the same node. This is
primarily used for instructions that are
only available in a Read-Modify-Write form. That is, the destination
register is one of the sources. In this case, we must not use the
same register for the non-RMW operand as for the destination. Thus, the non-RMW operand
is marked with the delayRegFree annotation.
After LSRA, the graph has the following properties:
The _gtRegNum of each tree node (GetRegNum()) contains the allocated register,
if any. Nodes that produce multiple registers are similarly
assigned, via extended register number fields. If the node does not
produce a value directly (i.e. it is either of void type, or it is
evaluated as part of its parent) its _gtRegNum is set to REG_NA.
In most cases, this register must satisfy the constraints
specified for each RefPosition by the BuildNode methods.
In some cases, this is difficult:
If a lclVar node currently lives in some register, it may not be desirable to move it (i.e. its current location may be desirable for future uses, e.g. if it's a callee save register, but needs to be in a specific arg register for a call).
In other cases there may be conflicts on the restrictions placed by the defining node and the node which consumes it
If such a node is constrained to a single fixed register (e.g. an arg register, or a return from a call), then LSRA is free to annotate the node with a different register. The code generator must issue the appropriate move.
However, if such a node is constrained to a set of registers,
as in the case of x86 instructions which require a byte-addressable register,
and its current location does not satisfy that requirement, LSRA
must insert a GT_COPY node between the node and its parent.
The _gtRegNum on the GT_COPY node must satisfy the register
requirement of the parent.
GenTree::gtRsvdRegs has a set of registers used for internal temps.
These must satisfy the constraints specified by the associated RefPositions.
A tree node is marked GTF_SPILL if the tree node must be spilled by
the code generator after it has been evaluated. The value will no longer be
live in the register, except in some cases involving EH-write-thru vars, see below.
A tree node is marked GTF_SPILLED if it is a lclVar that must be
reloaded prior to use.
The register (_gtRegNum) on the node indicates the register to
which it must be reloaded.
For lclVar nodes, since the uses and defs are distinct tree nodes, it is always possible to annotate the node with the register to which the variable must be reloaded.
For other nodes, since they represent both the def and use, if
the value must be reloaded to a different register than the one specified
on the tree node (which is the one used when the node is evaluated), LSRA must
insert a GT_RELOAD node to specify the register to which it
should be reloaded.
If a node has both GTF_SPILL and GTF_SPILLED, the tree node is reloaded prior to using
it (GTF_SPILLED) and spilled after it is evaluated (GTF_SPILL).
GTF_SPILL and GTF_SPILLED on uses,
since a def never needs to reload an old value. However, for EH-write-thru variable
defs, this combination of flags has a special meaning. A def of an EH-write-thru variable is
always written to the stack. However, if it is also marked GTF_SPILLED it remains live in the
register, in addition to being written to the stack. This is somewhat counter-intuitive since
normally the reloading (GTF_SPILLED) takes place prior to evaluation of the node.Note that GT_COPY and GT_RELOAD nodes are inserted immediately after the
instruction that must be copied or reloaded. However, the reload or copy
isn't actually generated until the code generator is generating code for
the consuming node.
Local variable table (LclVarDsc):
LclVarDsc::lvRegister is set to true if a local variable has the
same register assignment for its entire lifetime.
LclVarDsc::_lvRegNum is initialized to its initial register
assignment.
genFnPrologCalleeRegArgs() will move it.Codegen will set _lvRegNum to its current value as it processes
the trees, since a variable can be assigned different registers
over its lifetimes.
This section describes the phases of the LinearScan allocator (as
well as supporting components) in more depth.
Compiler::lvaMarkLocalVars
Compiler::fgLocalVarLiveness
This computes the BasicBlock::bbLiveIn and
BasicBlock::bbLiveOut sets that are used by LSRA.
It does this for any lclVar marked lvTracked, even if it is
not a register candidate. The dataflow information is also used
for GC info, but it may be the case that some of these should
not be marked during the pre-LSRA dataflow analysis.
LinearScan::identifyCandidates
This mostly duplicates what is done in
Compiler::lvaMarkLocalVars(). There are differences in what
constitutes a register candidate vs. what is tracked, but we
should probably handle them in Compiler::lvaMarkLocalVars()
when it is called after Lowering.
It sets the lvLRACandidate flag on lclVars that are going
to be register candidates.
The determination of the order in which the register allocator handles
blocks is done during the building of
RefPositions. However, it is a logically distinct component. Its
objective is to identify a sequence of BasicBlocks for allocation
that satisfies the following properties:
Each block comes after at least one of its predecessors, ideally the one on the edge with the greatest weight.
Blocks that enter EH regions have no predecessor. All live-in vars are on the stack.
The order of the BasicBlocks is captured in the blockSequence member of LinearScan.
Other implementations of linear scan register allocation aim to ensure that a block is immediately preceded by a predecessor block. This is not as big a consideration in our implementation because we reset the variable locations to those at the end of the most frequent successor.
It begins with the first block, and then adds successor blocks to the ready list, in sorted order, where the block weight is the sorting criterion.
After a block is added to the sequence list,
findPredBlockForLiveIn() is called to determine which predecessor to
use to set the register location of live-in lclVars, which may not be the
same as the previous block in the blockSequence. Its bbNum is captured in
the LsraBlockInfo.
The block ordering pass also identifies whether and where there are
critical edges. This also captured in the LsraBlockInfo and is used by the resolution phase.
Intervals are built for lclVars up-front. These are maintained in an array,
localVarIntervals which is indexed by the lvVarIndex (not the varNum, since
we never allocate registers for non-tracked lclVars). Other intervals (for tree temps and
internal registers) are constructed as the relevant node is encountered. Intervals for
lclVars that are live into an exception region are marked isWriteThru.
The building of RefPositions is done via a traversal of the nodes, using the blockSequence
constructed as described above. This traversal invokes LinearScan::BuildNode() for each
node, which builds RefPositions according to the liveness model described above:
First, we create RefPositions to define any internal registers that are required.
As we create new Intervals for these, we add the definition RefPosition to an array,
internalDefs. This allows us to create the corresponding uses of these internal
registers later.
Then we create RefPositions for each use in the instruction.
A use of a register candidate lclVar becomes a RefTypeUse RefPosition on the
Interval associated with the lclVar.
For tree-temp operands (including non-register-candidate lclVars), we may have one of 3 situations:
A contained immediate requires no registers, so no RefPositions are created.
A contained memory operand or addressing mode will cause RefPositions to be
created for any (non-contained) base or index registers.
A single RefPosition is created for non-contained nodes.
In order to build these uses, we need to find the Interval associated with the
tree-temp. This is done via the defList, which contains the RefTypeDef RefPosition
for all tree temps whose def has been encountered but for which we
have not yet seen the use. This is a simple list on the assumption that the distance
between defs and uses of tree temps is rarely very great.
When we have an instruction that will overwrite one of its sources, such as RMW
operands common on x86 and x64,
we need to ensure that the other source(s) isn't/aren't given the same register as the
target. For this, we annotate those use RefPositions with delayRegFree.
Next we create the uses of the internal registers, using the internalDefs array.
This is cleared before the next instruction is handled.
Next, any registers in the kill set for the instruction are killed. This is performed
by buildKillPositionsForNode(), which takes a kill mask that is node-specific and
either provided directly by the buildXXX() method for the node, or by a getKillSetForXXX()
method. There is a debug-only method, getKillSetForNode() which is only used for validation.
Finally, we create RefTypeDef RefPositions for any registers that are defined by
the node.
For a STORE_LCL_VAR of a write-thru lclVar, the RefPosition is marked writeThru.
There is some special handling for GT_PUTARG_REG nodes whose source is a non-last-use lclVar.
At build time, we mark such intervals as isSpecialPutArg.
At allocation time, if the lclVar is already in the argument register, but it has another use
prior to the call, we don't want to reassign that register to the tree temp interval, since that
would require us to spill the lclVar and reload it for the next use.
Instead, we retain the assignment of the register to the lclVar, and mark that RegRecord as
isBusyUntilNextKill so that it isn't reused if the lclVar goes dead before the call.
(Otherwise, if the lclVar is in a different register, or if its next use is after
the call, we clear the isSpecialPutArg flag on the interval.) Here is
a case from Microsoft.Win32.OAVariantLib:ChangeType(System.Variant,System.Type,short,System.Globalization.CultureInfo):System.Variant in System.Private.CoreLib.dll (I've edited the
dump to make it easier to see the structure):
N037 t16 = ┌──▌ LCL_VAR ref V04 arg3
N039 t127 = ┌──▌ PUTARG_REG ref REG rcx
N041 t128 = │ ┌──▌ LCL_VAR ref V04 arg3 (last use)
N043 t129 = │ ┌──▌ LEA(b+0) byref
N045 t130 = │ ┌──▌ IND long
N047 t131 = │ ┌──▌ LEA(b+72) long
N049 t132 = │ ┌──▌ IND long
N051 t133 = │ ┌──▌ LEA(b+40) long
N053 t134 = ├──▌ IND long REG NA
N055 t17 = ▌ CALLV ind int System.Globalization.CultureInfo.get_LCID $242
The PUTARG_REG at location 39 uses V04, and wants it in RCX. Because it is not a last use,
the tree temp interval I14 is marked isSpecialPutArg when it is built. At allocation time, we
allocate V04 to RCX so we leave the isSpecialPutArg flag and mark rcx as isBusyUntilNextKill
(this shows as "Busy" in the dump after the last use of V04 has been encountered).
This allows the use of V04 by the indirection at location 45
to use the value in RCX, and since that register is busy, we don't incorrectly free it even
though it is a last use of V04.
────────────────────────────────┼────┼────┼────┼────┼────┼────┼────┼────┼────┤
Loc RP# Name Type Action Reg │rax │rcx │rdx │rbx │rbp │rsi │rdi │r8 │r9 │
────────────────────────────────┼────┼────┼────┼────┼────┼────┼────┼────┼────┤
39.#15 V4 Use Keep rcx │C12i│V4 a│ │V1 a│V3 a│V0 a│V2 a│ │ │
40.#16 rcx Fixd Keep rcx │C12i│V4 a│ │V1 a│V3 a│V0 a│V2 a│ │ │
40.#17 I14 Def PtArg rcx │C12i│V4 a│ │V1 a│V3 a│V0 a│V2 a│ │ │
45.#18 V4 Use * Keep rcx │C12i│V4 a│ │V1 a│V3 a│V0 a│V2 a│ │ │
46.#19 I15 Def Alloc rax │I15a│Busy│ │V1 a│V3 a│V0 a│V2 a│ │ │
49.#20 I15 Use * Keep rax │I15a│Busy│ │V1 a│V3 a│V0 a│V2 a│ │ │
50.#21 I16 Def Alloc rax │I16a│Busy│ │V1 a│V3 a│V0 a│V2 a│ │ │
55.#22 rcx Fixd Keep rcx │I16a│Busy│ │V1 a│V3 a│V0 a│V2 a│ │ │
55.#23 I14 Use * PtArg rcx │I16a│Busy│ │V1 a│V3 a│V0 a│V2 a│ │ │
55.#24 I16 Use * Keep rax │I16a│Busy│ │V1 a│V3 a│V0 a│V2 a│ │ │
56.#25 rax Kill Keep rax │ │Busy│ │V1 a│V3 a│V0 a│V2 a│ │ │
A RefTypeBB RefPosition marks the beginning of a block, at which the incoming live
variables are set to their locations at the end of the selected predecessor.
If there are live-in variables with no location in the
selected predecessor, a RefTypeDummyDef is created for each of these.
LSRA_BLOCK_BOUNDARY_LAYOUT) or 0x200 (LSRA_BLOCK_BOUNDARY_ROTATE) settings of DOTNET_JitStressRegs.At the end of a block, for any exposed uses that do not have downstream
RefPositions (e.g. variables that are live across the backedge, so there is no
last use), a RefTypeExposedUse is created.
During this phase, preferences are set:
Cross-interval preferences are expressed via the relatedInterval field of Interval
When a use is encountered, it is preferenced to the target Interval for the
node, if that is deemed to be profitable. During register selection, it tries to
find a register that will fit the requirements of that preferenced Interval.
Then, when the use RefPosition is assigned a register, the
register preference of the target Interval is updated.
This addresses the scenario when the lclVar has already
(at a previous definition) been assigned a register, and we want to try to use that
register again, as well as the case where it has yet to be assigned a register.
This area has room for improvement, (see Improving Preferencing for specific issues related to this.).
Register preferences are set:
When the use or definition of a value must use a fixed register, due to instruction
or ABI constraints. In this case, the register may be added to the registerPreferences,
depending on whether it conflicts with existing preferences
(the heuristics for determining this are in LinearScan::mergeRegisterPreferences()).
When a register is killed at a point where a lclVar is live, that register is removed
from the registerPreferences of the lclVar Interval.
The register preferences of a related Interval may also be updated.
The allocation phase iterates over the blocks in BlockSequence order, and then over the
linear ordering of nodes within each BasicBlock:
RefPositions in forward order, performing linear scan
allocation. The algorithm below is a modified version of that
specified in [3].
It is a high-level abstraction of the algorithm as implemented:LinearScanAllocation(List<RefPosition> refPositions)
{
List<Interval> active = {};
Location currentLoc = Start;
foreach refPosition in RefPositions
{
if (refPosition.Location > currentLoc)
{
freeRegisters();
currentLoc = refPosition.Location;
}
if (refPosition is BasicBlockBoundary)
{
ProcessBlockBoundaryAlloc();
continue;
}
Interval currentInterval = refPosition.Interval;
if (currentInterval->hasValidAssignment())
{
refPosition->setReg(currentInterval->physReg);
}
// Find a register for currentInterval.
// AllocateBusyReg() will take into account whether it must have
// a register in this position, and may otherwise choose not to
// allocate if existing intervals have higher priority.
if (!TryAllocateFreeReg(refPosition))
AllocateBusyReg(refPosition);
}
}
hasValidAssignment() is not currently an actual method. If the
Interval currently has an assigned register, and it meets the
requirements of refPosition it is used. Otherwise, a new register
may be assigned.
BB_UNITY_WEIGHT, however
plenty of room remains for improving the allocation of
parameters Issue #7999TryAllocateFreeReg() iterates over the registers, attempting to find
the best free register (if any) to allocate:
It uses a set of scoring criteria to evaluate the "goodness" of the register that includes:
Whether it covers the lifetime of the Interval and/or the
Interval to which it is preferenced, if any
Whether it is in the register preference set for the
Interval
Whether it is not only available but currently unassigned
(i.e. this register is NOT currently assigned to an Interval
which is not currently live, but which previously occupied
that register).
Currently it doesn't take encoding size into account. Issue #7996 tracks this.
It always uses the same order for iterating over the registers. The jit32 register allocator used a different ordering for tree temps than for lclVars. It's unclear if this matters for LSRA, but Issue #8000 tracks this question.
AllocateBusyReg() iterates over all the registers trying to find the
best register to spill (it must only be called if tryAllocateFreeReg()
was unable to find one):
It takes into account a number of heuristics including:
The distance to the next use of the Interval being
spilled
The relative weight of the Intervals.
Whether the RefPosition being allocated, or the one
potentially being spilled, is reg-optional
Both tryAllocateFreeReg() and allocateBusyReg() currently fully evaluate the "goodness"
of each register, except in certain cases.
It will always spill an Interval either at its most recent
use, or at the entry to the current block.
It is quite likely that combining TryAllocateFreeReg() and
AllocateBusyReg() would be more effective, see
Merge Allocation of Free and Busy Registers
Resolution
Since register allocation is performed over a linearized list of nodes, there is no guarantee that register assignments on off-path edges will be consistent. The resolution phase is responsible for performing moves, as needed to match up the register assignments across edges.
All of the critical edges are handled first.These are the most problematic, as there
is no single location at which the moves can be added. This is done in a pass over all
BasicBlocks with an outgoing critical edge, handling all critical edges from that block.
The approach taken is:
diffResolutionSet).sameResolutionSet), but different from the end of this block.
For these, we can insert a move at the end of this block, as long as they
don't write to any of the registers read by the diffResolutionSet as those
must remain live into the split block.The actual resolution, for all edge types, is done by resolveEdge().
Based on the ResolveType, it either inserts the move at the top or bottom
of the block.
The algorithm for resolution can be found in [2], though note
that there is a typo: in the last 'if' statement, it should be
"if b != loc(pred(b))" instead of "if b = loc(pred(b))":
First, any variables that are in registers on the incoming edge, but need to be on the stack, are stored, freeing those registers
Next, any register-to-register moves are done, using a standard algorithm.
Finally, any stack to register moves are done.
Resolution of exception edges
When DOTNET_EnableEHWriteThru == 0, any variable that's
live in to an exception region is always referenced on the stack.
Code generation (genGenerateCode)
Code generation utilizes the registers specified on the instructions, as gtRegNum.
At block boundaries, the code generator calls the register
allocator to update the locations of variables
(recordVarLocationsAtStartOfBB())
Spills and reloads are generated as annotated in gtFlags
(GTF_SPILL, GTF_SPILLED, respectively).
GC tracking is performed during codegen.
This is the same bbLiveIn set as used elsewhere in the JIT but is
recomputed prior to register allocation because Lowering creates new
variables.
This is a temporary single-instance data structure used during the
construction of Intervals.
This is a common base class for Interval, which represent virtual
registers, and RegRecords, which represent the physical registers. It
contains:
A structure that represents a variable, optimizer temp, or tree temp,
that is a candidate for enregistering. Its RefPositions reflect
actual uses (RefTypeUse) or defs (RefTypeDef) in the code
stream, entry "definitions" (RefTypeParamDef and
RefTypeZeroInit) or convey boundary information
(RefTypeDummyDef, RefTypeExpUse). It also contains:
Register preference information:
Set of preferred registers
An Interval to which it is related by assignment or copy.
The assigned register and its RegRecord
During allocation, this may change, and reflects the current
assignment, if any, at the most recent RefPosition
If the Interval is spilled or becomes inactive, it will retain
the RegRecord, but its assignedReg field will be set to
REG_NA.
A special type of Interval is used to track the status of the upper half
of a vector register in the case where calls do not preserve the upper half
of the callee-save vector registers. This is true for 256-bit vectors on
AVX and for 128-bit registers on Arm64 NEON. These Intervals are
marked with isUpperVector, and their relatedInterval points to the associated
large vector Interval. This is an imperfect workaround for this special type
of partial register tracking, which is costly to support in full generality.
RegRecords for physical registers share a common base class with
Intervals, but represent actual registers instead of virtual
registers. They have a list of RefPositions for code locations at
which a fixed physical register is required, and during allocation they
maintain a pointer to the current Interval to which the physical
register is assigned.
RegRecord RefPositions are RefTypeKill or RefTypeFixedReg. The former is
inserted at a point where a register is killed, e.g. by a call. The
latter are inserted at a point where a specific register must be used or
defined. These are generally due to calls or instructions with implicit
fixed registers.
The representation of TYP_DOUBLE registers on 32-bit Arm is complicated by the
fact that they overlap two TYP_FLOAT registers. The handling of this
case could be improved. See Support for Allocating Consecutive Registers.
A RefPosition represents a single use, def or update of an
enregisterable variable or temporary or physical register. It contains
The Interval or RegRecord to which this reference belongs.
The next RefPosition (in code order) for the Interval or RegRecord.
This is used when determining which register to spill.
Register assignment
Prior to allocation, this represents the set of valid registers for this reference.
After allocation, this is a single element register mask including only the allocated register (or two registers in the ARM implementation for longs).
The type of RefPosition:
RefTypeDef is a pure definition of an Interval .
RefTypeUse is a pure use of an Interval .
RefTypeKill is a location at which a physical register is
killed. These only exist on RegRecords, not on Intervals
RefTypeBB is really just a marker in the list of RefPositions,
where the register allocator needs to record the register
locations at block boundaries. It is not associated with an
Interval or RegRecord.
RefTypeFixedReg is a RefPosition on a RegRecord that marks a
position at which it is required by some Interval. This allows
the register allocator to determine the range in which a
register is free.
RefTypeExpUse is an Interval RefPosition inserted just prior to
a loop backedge, that allows the register allocator to identify
the actual end of the live range
RefTypeParamDef is an Interval RefPosition that represents the
incoming "definition" of a parameter (register or stack)
RefTypeDummyDef is inserted at a block into which a variable is
live, but was not live out of the previous block (in traversal
order).
RefTypeZeroInit is an Interval RefPosition that represents the
position at entry at which a variable will be initialized to
zero.
RefTypeUpperVectorSave is a RefPosition for an upper vector Interval
that is inserted prior to a call that will kill the upper vector if
it is currently occupying a register. The Interval is then marked with
isPartiallySpilled.
RefTypeUpperVectorRestore is a RefPosition for an upper vector Interval
that is inserted prior to a use of the associated vector, or at the end
of a block. This is only done in the case where the RefPosition might
have become partially spilled.
The tree nodes are updated during the writeback traversal with the
register that has been assigned to it (if any). If the associated
RefPosition is marked spillAfter, the GTF_SPILL flag is set. If it is
marked "reload" then the GTF_SPILLED flag is set.
The GT_LCL_VAR nodes are updated with the current "active" register
for the variable. For Intervals that represent incoming parameters, this
must be set to the initial register assignment for the parameter (this
may differ from the incoming register for register parameters; the
prolog generation code will move it as needed).
In the event that a tree-node (non-lclVar) is spilled, and the original assigned register is not available at the time it is reloaded, a GT_RELOAD node is inserted in the graph, and is annotated with the register to which it should be reloaded just prior to use.
This mapping records the location of each live variable at the entry and
exit of each block. It is recorded during the writeback phase
(resolveRegisters()) and is used to add resolution moves, as well as to
record the variable locations at the beginning of each block during code
generation.
The following dumps and debugging modes are provided:
Prior to register allocation
During Interval & RefPosition creation - buildIntervals()
defList.
This corresponds to all the nodes that have defined values
that have not yet been consumed.After buildIntervals()
Intervals and their RefPositionsRefPositionsDuring allocation
After allocation
RefPositions, sequentially, and grouped for lclVar
IntervalsDuring resolution
After resolution
RefPosition was allocated.The implementation uses the DOTNET_JitStressRegs environment variable.
The following are the stress modes associated with this variable. For
the most part they can be combined, though in some cases the values are
exclusive:
Limit the number of registers available for allocation (at most one of these can be specified):
Limit to callee-regs (0x1)
Limit to caller-save regs (0x2)
Limit to a reduced set of both callee and caller-save regs (0x3)
Note that some instructions require a large number of registers,
and may require special handling (this is captured in the
DEBUG-only field RefPosition::minRegCandidateCount, which is
set during buildRefPositionsForNode())
Modify register selection heuristics
Interval spans a call (0x8)Modify the order in which the basic blocks are traversed:
Extend lifetimes to the entire method (0x80)
Modify the starting location (register/stack) for each variable at block boundaries (the default is to use the weightiest predecessor block that has already been allocated):
Always insert a GTF_RELOAD above a use of a spilled register (0x400).
Always spill (0x800). This mode is not fully functional, as there are cases where spill isn't actually supported (it should be possible to simply not spill in such cases).
Never allocate a register for a RefPosition marked regOptional (0x1000).
There are many assertions in LinearScan. The following are the most
effective at identifying issues (i.e. they frequently show up in bugs):
LinearScan::BuildNode() method (these are
architecture-specific, and can be found in lsraxarch.cpp, lsraarm64.cpp and lsraarm.cpp).
BuildXXX method for this
node is not building RefPositions for all of its uses (which is what consume has been set to).LinearScan::checkLastUses() which
does a reverse iteration over the RefPositions for a block to verify that bbLiveIn is consistent with bbLiveOut updated with the defs and last uses:
assert(!foundDiff);assert(farthestRefPhysRegRecord != nullptr);
At the end of write-back (resolveRegisters()), verifyFinalAllocation() runs. It doesn't do a
lot of validation, but it prints the final allocation (including final spill placement), so is
useful for tracking down correctness issues.
The potential enhancements to the JIT, some of which are referenced in this document, can generally be found by searching for LSRA in open issues. The ones that are focused on JIT throughput are labeled JITThroughput.
This is #39457. As described there, the challenge is to do this without impacting the common path. This should also include cleaning up the allocating of consecutive registers for TYP_DOUBLE for Arm32 #8758.
This is captured as #9399 Consider merging allocating free & busy regs.
Currently the register allocator will always allocate an available register, even if it only meets the minimum requirements for the current node. This results in some really bad worst-case behavior where a single scratch register is available, and is immediately spilled because it is killed prior to the next use. Often this same register will be used and spilled many times before a better register becomes free.
The alternative approach under consideration is to combine free and busy register allocation
(tryAllocateFreeReg() and allocateBusyReg()) such that a busy register will be spilled if there
are no suitable free registers, and the current Interval has greater weight than the Interval
occupying the register. This must be accompanied by some improvements in the efficiency of the
checks, so as not to degrade throughput. This is currently a work-in-progress (https://github.com/CarolEidt/runtime/tree/CombineAlloc), and needs
further work to eliminate diffs and improve throughput.
This would make it possible to spill a register for a higher weight lclVar rather than "settling"
for a register that's a poor fit for its requirements. This is probably the best approach to
address Issues #6824:
Heuristics for callee saved reg allocation and
#8846:
Let variables within a loop use register first.
The following issues are related:
Issues #6806 and #6825 track improvement of spill placement.
Issue #6705 tracks the possibility of short-circuiting this evaluation. Making such an improvement should probably be done in conjunction with this work.
Issue #13466: Inefficient register allocation in simple method which dereferences a span
This is not yet captured as a GitHub issue.
This would be best done after Merge Allocation of Free and Busy Registers.
The idea would be to add support to change the weight given to the various selection heuristics according to a configuration specification, allowing them to be auto-tuned.
The scoring could be done using perf scores, or even using actual performance measurements, though running actual benchmarks on each configuration could be quite costly.
In order to enable this capability without impacting throughput, it is likely that the configurability would be added as an alternate path in the register allocator, leaving the default path as-is.
There are a couple of distinct opportunities here:
On Arm64, there are intrinsics that require the register allocator to allocate continuous registers. This could be considered a generalization of the support for doubles on Arm32, though some refactoring is probably required to ensure that it doesn't hurt throughput for single-register allocation. This is Issue #39457.
Other intrinsics, as well as long multiply on X86 and Arm32, require allocating multiple register targets, but do not require that they are contiguous. Now that we have support for multi-reg structs, this should be relatively straightforward.
This is captured as Issue #8019 Consider pre-allocating high-frequency lclVars. A fix for this might also address #13466.
The idea here is to ensure that high frequency lclVars aren't forced to use less-than-optimal registers (or worse, spilled), by allocating them ahead of the linear scan.
This requires a mechanism for recording the location of these lclVars. The LinearScan
implementation doesn't keep a reservation table for the physical registers; instead it
maintains the current status of each register (i.e. which Interval it is allocated to)
as it proceeds during allocation or write-back. While keeping a reservation table may be
a reasonable approach for Tier 1 or higher
(see Spanning trees for physical registers),
an alternate approach would be to only perform
pre-allocation at block granularity, including block boundaries (for which the variable
locations are recorded in the varToRegMaps). Without an interference graph or a reservation
table, it will likely be limited to a small number of the highest frequency lclVars.
One strategy would be to do something along the lines of (appropriate hand-waving applies):
Set aside a subset of the registers for pre-allocation. This might include all of the callee-save registers, and a small number of caller-save registers. Some tuning would likely be required to determine exactly what and how many to reserve.
After buildIntervals() has completed,for each register candidate local Interval,
in sorted order, assign a register that corresponds to its registerPreferences. If such
a register is not available, stop pre-allocating. This becomes its assignedInterval
(which will be inactive at the start of alloaction).
During allocateRegisters() as each new block is encountered, instead of simply copying
the predecessor varToRegMap, iterate over the most frequently lclVars in the union of the
live-in, uses and defs, and displace any Intervals that are occupying registers that
would be more profitably used by the high-frequencly lclVars, weighing spill costs.
This is captured as Issue #9909.
When the register allocator performs resolution across block boundaries, it may split critical edges (edges from a block with multiple successors to a block with multiple predecessors). This can seriously impact performance, and is especially bad when the edge is a loop backedge.
One option would be to avoid edge splitting altogether. Instead, we would construct CriticalEdgeSets that
would include the set of critical edges that have any edges in common.
For these, once we have allocated the first block that contains either an in-edge or an out-edge in such
a set, the variable locations on that edge would be fixed, and future blocks that have edges in that
set would ensure that the variable locations match. This would eliminate not just backedge splitting,
but all the extra branches currently inserted for resolution. It remains to be seen whether this would
outweigh the impact of cases where more resolution moves would be required.
I have an old experimental branch where I started working on this: https://github.com/CarolEidt/runtime/tree/NoEdgeSplitting. It was ported from the coreclr to the runtime repo, but not validated in any significant way. Initial experience showed that this resulted in more regressions than improvements.
A less aggressive approach would be to make minor modifications to reduce the need for split edges:
When selecting a predecessor for the fall-through block for a loop backedge, instead of using the actual predecessor, use the non-backedge predecessor of the loop head. This will mean that both successors of the loop will use the same mapping, reducing the need to split the edge.
When a RefTypeExpUse is encountered, if it is at a loop backedge, the variable
is in a register, and the loop head has the variable on the stack, spill it.
Spilling preemptively will enable the spill to be performed at an actual reference,
rather than at the block boundary.
This approach has been implemented experimentally in https://github.com/CarolEidt/runtime/tree/AvoidSplittingBackedge. Running crossgen diffs across frameworks and benchmarks for X64 Windows shows a delta of -2392 (-0.007%) with 317 methods improved and 66 regressed. Further analysis of the regressions is needed.
Issues #8552 and #40264 may be related.
When DOTNET_EnableEHWriteThru is set, some performance regressions are observed. When an EH write-thru variable (i.e. one that is live into an exception region) is defined, its value is
always stored, in addition to potentially remaining live in a register. This increases register
pressure which may result in worse code.
Further investigation is needed, but the following mitigations may be effective (here the
term "EH Var" means a lclVar marked lvLiveInOutOfHndlr):
Adjust the heuristics:
For determining whether an EH var should be a candidate for register allocation, e.g. if the defs outweigh the uses.
An initial investigation might only consider an EH var as a register candidate if it has a single use. One complication is that we sometimes generate better code for a non-register-candidate local than one that is always spilled (we don't support RegOptional defs).
Thus, it would be better to identify before building intervals whether we should consider it a candidate, but the problem with that is that we don't necessarily know at that
time whether there is a single def. A possible approach:
isSingleDef flag to Interval.writeThru interval:
isSingleDef, allocate as usual.RegOptional, don't allocate.For determining when a definition of an EH var should be only stored to the stack, rather than also remaining live in the register.
lclVar to the stack only at those boundaries.The original issue to enable EH WriteThru is #6212. It remains open pending the resolution of the performance regressions.
The idea here is to avoid spilling at a use if the value on the stack is already the correct value.
Issues that this might address include:
Currently the register allocator doesn't track whether a local variable has the same value on the stack as in a register. The support for "write-thru" EH variables (variables live across exception boundaries) has added the capability to liveness analysis and code generation (in addition to the register allocator) to handle variables that are live in both registers and on the stack. This support could be further leveraged to avoid spilling single-def variables to memory if they have already been spilled at their definition.
Extending such support to more generally track whether there is already a valid stack copy involves more work. Fully general support would require such information at block boundaries, but it might be worth investigating whether it would be worthwhile and cheaper to simply track this information within a block.
This would involve identifying Intervals whose values are cheaper to recompute than to spill
and reload. Without SSA form, this would probably be easiest to do when there's a single def.
Issue #6264.
Issues #6862 and #6863 track the proposal to support "folding" of operations using a tree temp when the defining operation supports read-modify-write (RMW) to memory. This involves supporting the possibility of a def being reg-optional, as well as its use, so that it need never occupy a register.
I have an old experimental branch: https://github.com/CarolEidt/coreclr/tree/RegOptDef where I started working on this, and it is in the process of being ported to the runtime repo.
Issue #6358
tracks the problem that Lowering currently has
to select a single operand to be reg-optional, even if either
operand could be. This requires some additional state because
LSRA can't easily navigate from one use to the other to
communicate whether the first operand has been assigned a
register.
Issue #9896.
Issues #36454, #11260 and #12945 involve preferencing for HW intrinsics.
Issue #11959 also has a pointer to some methods that could benefit from improved preferencing.
Issue #13090 involves a case where anti-preferencing might be useful.
Issue #10296 may also be related to preferencing, if it is still an issue.
This has not yet been opened as a github issue.
Making SSA form available to LSRA would:
Allow splitting of Intervals where SSA names are not related by a phi node.
Ease potential future support for "General SSA" (where sources and dests of phi nodes may overlap) as existing support for resolution at block boundaries could be easily extended to introduce minimal copies for translating out of General SSA form.
This has not yet been opened as a github issue.
LLVM has extended their linear scan register allocator with something it
calls "Greedy Register Allocation"[6,7]. This uses a priority queue for the
order of allocation (sorted by decreasing spill cost), and a B+ tree to
represent each physical register. I think that using the B+ trees for
physical registers would be an improvement over the current PhysRegs,
and we may want to experiment with changing the allocation order as
well. It would not be necessary to significantly modify the process of
creating Intervals, nor the resolution phase.
Def/use conflicts arise when the producing and conusming nodes each have register requirements,
and they conflict. The current mechanism, in which the register assignment of one of the
RefPositions is changed, can lead to problems because there's then
no associated RefTypeFixedReg for that reference. This is Issue #10196.
A related issue is Issue #7966, which captures an issue with propagating a fixed-register use to its definition, if the value is defined by a node with delay-free operands.
Currently, the register allocator builds the RefPositions for the entire method at once.
In Min-Opts and Tier 0, or any time there are no local variables available for enregistration,
there is no value in doing so, as no values remain live beyond the top-level node of an
expression. Since the JIT doesn't do instruction scheduling or other reordering of nodes
after Lowering, it is generally the case that the live-ranges for the values produced by
nodes of disjoint top-level trees do not overlap.
Performing the entire allocation (building, allocation and write-back) on a single expression at a time, and maintaining pools of data structures for reuse, could improve throughput and reduce memory consumption.
As the register allocator builds the RefPositions, it keeps track of outstanding values in the
form of a defList that holds all of the tree temp values that have been defined but not yet used.
Once this is empty, the register allocator could process the current list of RefPositions and then
start over.
Issue #6690 proposes to build RefPositions incrementally, which is part of this item.
It is unclear whether it would be beneficial, but if we could keep track of the variables that are only used within a block (presumably true of many introduced temps), we may find that we could continue to limit the number of variables whose liveness is tracked across blocks, keeping an expanded set only for transient liveness. Issue #7992.
Note that this would only improve JIT throughput for optimized code.
The VarToRegMap incurs non-trivial JIT-time overhead.
Issue #8013 addresses
the question of whether there is an alternative that would have better
performance. This would also improve JIT throughput only for optimized code.
Issue #7998 suggests evluating the throughput cost of updating the preferences at each kill site.
Issue #9767 captures the issue that the
"spill always" stress mode, LSRA_SPILL_ALWAYS, DOTNET_JitStressRegs=0x800 doesn't work properly.
Issue #6261 has to do with RegOptional
RefPositions that are marked as copyReg or moveReg. See the notes on this issue;
I don't think such cases should arise, but there may be some cleanup needed here.
Issue #5793 suggests adding a stress mode that allocates registers for multi-reg nodes in the reverse of the ABI requirements.
Issue #10691 suggests adding a stress mode that deliberately trashes registers that are not currently occupied (e.g. at block boundaries).
<a id="1"></a> Boissinot, B. et al "Fast liveness checking for ssa-form programs," CGO 2008, pp. 35-44. http://portal.acm.org/citation.cfm?id=1356058.1356064&coll=ACM&dl=ACM&CFID=105967773&CFTOKEN=80545349
<a id="2"></a> Boissinot, B. et al, "Revisiting Out-of-SSA Translation for Correctness, Code Quality and Efficiency," CGO 2009, pp. 114-125. http://portal.acm.org/citation.cfm?id=1545006.1545063&coll=ACM&dl=ACM&CFID=105967773&CFTOKEN=80545349
<a id="3"></a>Wimmer, C. and Mössenböck, D. "Optimized Interval Splitting in a Linear Scan Register Allocator," ACM VEE 2005, pp. 132-141. http://portal.acm.org/citation.cfm?id=1064998&dl=ACM&coll=ACM&CFID=105967773&CFTOKEN=80545349
<a id="4"></a> Wimmer, C. and Franz, M. "Linear Scan Register Allocation on SSA Form," ACM CGO 2010, pp. 170-179. http://portal.acm.org/citation.cfm?id=1772979&dl=ACM&coll=ACM&CFID=105967773&CFTOKEN=80545349
<a id="5"></a> Traub, O. et al "Quality and Speed in Linear-scan Register Allocation," SIGPLAN '98, pp. 142-151. http://portal.acm.org/citation.cfm?id=277650.277714&coll=ACM&dl=ACM&CFID=105967773&CFTOKEN=80545349
<a id="6"></a> Olesen, J. "Greedy Register Allocation in LLVM 3.0," LLVM Project Blog, Sept. 2011. http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html (Last retrieved July 2020)
<a id="7"/></a> Yatsina, M. "LLVM Greedy Register Allocator," LLVM Dev Meeting, April 2018. https://llvm.org/devmtg/2018-04/slides/Yatsina-LLVM%20Greedy%20Register%20Allocator.pdf (Last retrieved July 2020)