docs/design/coreclr/jit/object-stack-allocation.md
This document describes work to enable object stack allocation in .NET Core.
In .NET instances of reference types are allocated on the garbage-collected heap. Such allocations have performance overhead at garbage collection time. The allocator also has to ensure that the memory is fully zero-initialized. If the lifetime of an object is bounded by the lifetime of the allocating method, the allocation may be moved to the stack. The benefits of this optimization:
Object stack allocation is implemented in various Java runtimes. This optimization is more important for Java since it doesn't have value types.
roslyn #2104 Compiler should optimize "alloc temporary small object" to "alloc on stack"
runtime #4584 CLR/JIT should optimize "alloc temporary small object" to "alloc on stack" automatically
An object is said to escape a method if it can be accessed after the method's execution has finished. An object allocation can be moved to the stack safely only if the object doesn't escape the allocating method.
Several escape algorithms have been implemented in different Java implementations. Of the 3 algorithms listed in references, [1] is the most precise and most expensive (it is based on connection graphs) and was used in the context of a static Java compiler, [3] is the least precise and cheapest (it doesn't track references through assignments of fields) and was used in MSR's Marmot implementation. [2] is between [1] and [3] both in analysis precision and cost. It was used in Java HotSpot.
Effectiveness of object stack allocation depends in large part on whether escape analysis is done inter-procedurally. With intra-procedural analysis only, the compiler has to assume that arguments escape at all non-inlined call sites, which blocks many stack allocations. In particular, assuming that 'this' argument always escapes hurts the optimization. [4] describes an approach that handle objects that only escape on some paths by promoting them to the heap "just in time" as control reaches those paths.
There are several choices for where escape analysis can be performed:
Pros:
Cons:
Possible approaches to interprocedural analysis in the jit:
Pros:
Cons:
Pros:
Cons:
The results of escape analysis in the linker may be communicated to the jit by injecting an intrinsic call right before or after newobj for the object that was determined to be non-escaping. Note that assemblies may lose verifiability with this approach. An alternative is to annotate parameters with escape information so that the annotations can be verified by the jit with local analysis.
If the methods whose info was used for interprocedural escape analysis are allowed to change after the analysis, the jit either needs to inline those methods or there should be a mechanism to immediately revoke methods with stack allocated objects that relied on that analysis.
The jit is responsible for reporting references to heap-allocated objects to the GC. With stack-allocated objects present in a method a reference at a particular GC-safe point may be in one of 3 states:
All GC fields of stack-allocated objects have to be reported to the GC, the same as for fields of stack-allocated value classes.
When a field of an object is modified the jit may need to issue a write barrier:
@echesakovMSFT implemented a prototype in 2016.
The goal of the prototype was to have the optimization working end-to-end with a number of simplifications:
We ran the prototype on System.Private.CoreLib via crossgen in 2016 and objects at 21 allocation sites were moved to the stack.
We also ran an experiment where we changed the algorithm to optimistically assume that no arguments escape. The goal was to get an upper bound on the number of potential stack allocations.
@AndyAyersMS recently resurrected @echesakovMSFT work and used it to prototype stack allocation of a simple delegate that's directly invoked. It exposed a number of things that need to be done in the jit to generate better code for stack-allocated objects. The details are in comments of runtime #4584.
We did some analysis of Roslyn csc self-build to see where this optimization may be beneficial. One hot place was found in GreenNode.WriteTo. This object allocation accounts for 8.17% of all object allocations in this scenario. The number is not as impressive as a percentage of all allocated bytes: 0.67% (6.24 Mb out of 920.1 Mb) but it's just a single static allocation. Below is the portion of the call graph the escape analysis will have to consider when proving this allocation is not escaping. Green arrows correspond to the call sites that are inlined and red arrows correspond to the call sites that are not inlined.
We will implement the optimization in the jit first and get as many cases as possible that don't require deep interprocedural analysis, e.g., the delegate cases from the prototype mentioned above. We will also try to take advantage of the more suitable order of method processing in higher-tier jit.
The jit work includes removing the restrictions in @echesakovMSFT prototype, making escape analysis more sophisticated, making changes for producing better code for stack-allocated objects (some of which @AndyAyersMS discovered while working on his prototype), and updating inlining heuristics to help with object stack allocation.
To get the maximum benefit from the optimization we will likely have to augment the jit analysis with more information. The information may come from manual annotations or from a tool analysis. ILLink or the upcoming CPAOT may be appropriate places for ahead-of-time escape analysis. Self-contained applications will benefit the most from ILLink analysis but framework assemblies can also be analyzed and annotated even though cross-assembly calls will have to be processed conservatively.
In this context an algorithm similar to [1] can be used to get the most accurate escape results.
The cost of adding some infrastructure to ILLink (call graph, proper IR, etc.) will be amortized if we do other IL-to-IL optimizations in the future. Also, we may be able to reuse the infrastructure from other projects, i.e., ILSpy.
[4] Lukas Stadler at al. Partial Escape Analysis and Scalar Replacement for Java