docs/design/coreclr/jit/hot-cold-splitting.md
This document describes the current state of hot/cold splitting in the JIT.
Hot/Cold splitting is an optimization that splits code into frequently-executed ("hot") and rarely-executed ("cold") parts, and places them in separate memory regions. Increased hot code density better leverages spatial locality, improving application performance via fewer instruction cache misses, less OS paging, and fewer TLB misses.
The JIT previously supported hot/cold splitting for AOT-compiled NGEN images in .NET Framework. With Crossgen2 support in progress (and no existing support for splitting dynamically-generated code), JIT support has not been tested since retiring .NET Framework -- thus, there are likely regressions. Furthermore, the JIT never supported splitting functions with certain features, like exception handling or switch tables. Finally, with ARM64 code generation being a newer addition to the JIT, hot/cold splitting was never implemented for the architecture. These limitations significantly inhibit the applicability of hot/cold splitting.
The below sections describe various improvements made to the JIT's hot/cold splitting support to remove such limitations.
Without runtime support for hot/cold splitting in .NET as of summer 2022, testing the JIT's existing hot/cold splitting
support is not as simple as turning the feature on. A new "fake" splitting mode, enabled by the
DOTNET_JitFakeProcedureSplitting environment variable, removes this dependency on runtime support. This mode allows
the JIT to execute its hot/cold splitting workflow without changing the runtime's behavior. This workflow proceeds as
follows:
Compiler::fgDetermineFirstColdBlock, as usual.Compiler::eeAllocMem, the JIT requests one memory buffer from the host (either Crossgen2 or the VM) for the
entire function; this is unlike normal splitting, where separate buffers are allocating for the hot/cold sections.
While enabling fake-splitting also enables opts.compProcedureSplitting, there is no guarantee the JIT will fake-split
a function unless Compiler::fgDetermineFirstColdBlock finds a splitting point; without PGO data, the JIT's heuristics
may be too conservative for extensive testing. To aid regression testing, the JIT also has a stress-splitting mode now,
under DOTNET_JitStressProcedureSplitting. When opts.compProcedureSplitting and stress-splitting are both enabled,
the JIT splits every function after its first basic block; in other words, fgFirstColdBlock is always
fgFirstBB->bbNext. The rest of the hot/cold splitting workflow is the same: The JIT emits instructions to handle the
split code sections and, if fake-splitting, utilizes only one memory buffer.
When used in tandem, fake-splitting and stress-splitting have strong potential to reveal regressions in the JIT's
hot/cold splitting functionality without runtime support. As such, a new rolling test job in the
runtime-jit-experimental, jit_stress_splitting, runs
all dotnet/runtime tests with fake-splitting and stress-splitting enabled.
jit_stress_splitting to runtime_jit_experimentalAfter devising strategies for testing the JIT independently of runtime support for splitting, achieving functional parity for ARM64 became a priority. While initial splitting prototypes in Crossgen2 target x64, the JIT can achieve some correctness with hot/cold splitting on ARM64 by leveraging fake-splitting alone.
Most of the JIT's hot/cold splitting workflow is architecture-independent; only code generation is ARM64-specific. The majority of implementation work here is thus related to emitting various long pseudo-instructions:
IF_LARGEJMP. For example, branch condition, target becomes
the following:branch !condition, pc+1
branch target
ldr instruction. However, this is not possible from the cold section: Because it is arbitrarily far away, the target
address cannot be determined relative to the PC. Instead, the JIT emits a IF_LARGELDC pseudoinstruction with a
few different possibilities:
adrp instruction.ldr instruction. (Final sequence: adrp + ldr)
fmov
instruction. (Final sequence: adrp + ldr + fmov)add instruction, and load the constant with an ld1 instruction.
(Final sequence: adrp + add + ld1)Aside from these pseudo-instructions, hot/cold splitting required a few other tweaks to ARM64 code generation:
IMAGE_REL_ARM64_BRANCH26.There is no concept of chained unwind info on ARM64; instead, the JIT generates unwind info for each function fragment, regardless of its hot/cold status. While this should not have any immediate implications for JIT work around hot/cold splitting, this does affect the feature's implementation in Crossgen2 and the VM. On x64, the Crossgen2 splitting prototype uses chained unwind info to differentiate between cold main body fragments and cold EH funclets (see below for details on EH splitting). This comparison is not possible on ARM64 -- the JIT may have to pass more information to the host when generating unwind info on ARM64 to indicate if a cold fragment is a funclet.
An EH funclet is a "mini-function" for handling or filtering exceptions; for example, for a conventional "try/catch" expression, the catch block becomes a funclet (this is true for finally/fault/filter/etc. blocks as well). The JIT places EH funclets contiguously in memory, adjacent to the main function body. Because of the prevalence of exception handling in .NET programs, enabling splitting of EH funclets massively expands this optimization's applicability.
Because EH funclets immediately succeed the main function body, the JIT can easily split such functions without breaking existing invariants:
This approach may not be the most performant implementation: Splitting funclets individually could yield better spatial locality. However, this would require re-arranging the order of funclets (currently, there is no specific order imposed), and significantly altering unwind info generation, thus breaking many invariants in the host. This approach enables splitting in many more scenarios without breaking existing invariants or introducing architecture-specific workarounds. However, if the JIT supports splitting functions multiple times in the future, we should revisit this.
In the absence of PGO data, the JIT assumes exceptions occur rarely; this justifies moving handlers to the cold section.
Because finally blocks execute regardless of an exception occurring, it may be detrimental to make these handlers
cold. Thus, Compiler::fgCloneFinally
copies the finally block to the hot section, provided it is not too large. Once runtime support for splitting matures,
we should revisit this optimization to ensure the JIT is not too sparse or overzealous in its usage.
HANDLER_ENTRY_MUST_BE_IN_HOT_SECTIONAs of writing, support for hot/cold splitting in Crossgen2 on x64 is in progress. While some future tasks are JIT-specific
and will not require runtime support to begin work, many will require close collaboration. See the dotnet/runtimelab
hot/cold splitting prototype for
runtime-specific tasks.