docs/design/coreclr/jit/longs-on-32bit-arch.md
The challenge here is to model long operations in a way that reflects register lifetimes for the lo and hi halves of a 64-bit integer accurately enough that we can achieve good register allocation.
The current liveness model supports:
Previously (eons ago) when we were working on arm32, the model had all but the last def interfering with the uses and internal registers (i.e. they went live during the first location for the node).
Considerations for an expanded model:
This is the approach currently being taken. In the initial implementation, the reordering
of nodes to support the required "valid tree traversal" property of the IR was causing
the code to be incorrect because for instructions like add the carry bit needs to be
immediately consumed by the computation ofthe hi bits.
In order to preserve the current decomposition approach and not change the fundamental tree ordering properties, it is necessary to evaluate sub-expressions into temps, to keep the lo and hi computations together.
There are concerns about this, because it requires generating a number of extra temps in the case of nested expressions. However, mikedn has done some experimentation here that indicates that this approach may not be as problematic as we feared.
This basic idea is that whenever we need to decompose hi/lo operations but keep them together (i.e. can’t preserve the tree traversal/linear order invariants), we create a temp.
The idea here would be to retain TYP_LONG nodes in the IR, and to find a way to extend
the liveness model used by Lowering, LSRA and CodeGen to ensure good register allocation.
o
In the table below, there are several operations that have different register lifetime characteristics. Each is modeled with a sequence of "Locations" for which the changes in register lifetimes can be viewed as happening at the same time. All lifetimes participating in a given Location are considered to interfere. Note that the simplest model is that all the uses happen at one location, and then the def(s) happen at the next Location (i.e. the def does not interfere with the uses). The model becomes more complicated when we have constraints such as RMW (read-modify-write) operands, and even more so when we are actually modeling multiple target instructions with a single IR node.
To avoid even more complication than is already inherent in this issue, we will assume that the evaluation order is fixed during Lowering (and in these examples we are assuming that the lo operand is evaluated first).
In future we may want to consider the implications of relaxing that constraint, though note that the Decomposition approach also requires a predetermined evaluation order.
| Operation | Location 1 | Location 2 | Location 3 |
|---|---|---|---|
| x = y | use y.lo | def x.lo | |
| use y.hi | def x.hi | ||
| x = y + z | use y.lo | def x.lo | def x.hi |
| use z.lo | use y.hi | ||
| y.hi live | use z.hi | ||
| z.hi live | use z.hi | ||
| x = *p | use p | def x.hi | |
| (non-ldp) | def x.lo | ||
| x = *p | use p | def x.lo | |
| (ldp) | def x.hi | ||
| x = *(p+i*8) | use p | def x.hi | |
| (non-ldp) | use i | ||
| def x.lo | |||
| x = call | use args | def x.lo | |
| kill (end) | def x.hi |
Both the non-ldp (load pair - available on Arm but not x86) cases take
advantage of the fact that all the sources must remain live
with the first def. The same can be done in the non-ldp case of x = *p.
However, that approach doesn't reduce the Location requirement for the x = y + z case,
where, if we assume that all inputs must be live for the first def, then we keep the lo
registers live longer than necessary, and therefore can't reuse them for the x.lo (or
other) results.
TO DO:
It would be really nice to break the “valid tree traversal” restriction, and allow arbitrary (or at least more flexible) reordering of nodes when in linear form.