docs/design/coreclr/jit/inline-size-estimates.md
Note this is work in progress....
Inlining is a heuristic-based optimization. There are some cases where it's obvious that a particular inline is good or bad for performance, but more typically, things are not that clear. In those cases where the benfit is in doubt, the compiler will rely on heurstics, typically based on various estimates of an inline's size impact, speed impact, or other important factors.
In this writeup we consider approaches to what should be the simplest of these estimates: the size impact of an inline.
There are a number of interesting facets to the inline size estimate
problem, but let's start with some generalities. Suppose we have some
size estimate for the caller C (ignoring for the moment exactly what
we are measuring), say CallerSize, and some size estimate for the
callee E, say CalleeSize. Now suppose we are contemplating what
would happen if E is inlined into C to create C'. We'd like some
sort of size estimate CallerSize'. The simplest estimate is that
CallerSize' = CallerSize + CalleeSize.
(1) `CallerSize'` = `CallerSize + CalleeSize`
However, calling conventions impose some additional code overhead on both the caller and callee. The caller must set up arguments in registers or on the stack, and if there is a return value, might need to move it or store it somewhere. It also might need to spill values that are stored in caller-saved registers around the call.
The callee must set up a stack frame and possibly spill callee-save registers it intends to use, and then ultimately must undo all this.
When we inline E into C, none of this calling convention setup is
necessary. So one can imagine that there is some additional size
savings from inlining, one that depends on the calling convention and
the number and kind of arguments passed from caller to callee.
(2) CallerSize' = CallerSize + CalleeSize - Overhead
Note that it's entirely possible that Overhead > CalleeSize, so
that CallerSize' < CallerSize, that is the inline not only results
in faster code but also in smaller code. Indeed this state of affairs
is increasingly common with that advent of modern programming styles
that emphasize building functionality out of lots of small procedures.
Alternatively, we can compute the size impact as the size change of C because of the inline of E:
(2) SizeImpact = CallerSize - CallerSize'
or with a bit of math,
(2) SizeImpact = Overhead - CalleeSize
Now let's look at some actual data to see how well this simple model
for SizeImpact holds up. The data below shows the measured size
impact of a single inline into various calling methods in mscorlib.
This data was obtained by first compiling all the methods without
inlining to obtain a baseline size as measured by the encoded
instruction size in bytes. The inliner was then modified so it would
only do a single inline per caller, with the candidate chosen by a
parameter k that could be varied externally. mscorlib was then
repeatedly compiled with varying values of k, and the inlines
performed and the subsequent modified caller sizes recorded. This data
was then permuted and histogrammed to show the variety of SizeImpact
values for a fixed E. Since the calling convention and number and
kind of parameters is fixed across all observations for a fixed E,
the better model would predict SizeImpact is the same in every
case.
In this first example, case E is System.Object:.ctor(), which is
close to the simplest possible callee -- it takes one argument and has
no return value. In all the observed cases the SizeImpact is
negative, but it's far from the same in every case.
Inlining for System.Object:.ctor():this (size 6)
Instances 1160 Mean SizeImpact -12.53 Min -52 Max -5
Distribution
less than -50: 1
[-50,-41]: 0 1 1 0 0 1 0 0 1 1
[-40,-31]: 0 1 0 0 1 0 0 3 1 2
[-30,-21]: 1 3 2 2 1 8 76 5 22 18
[-20,-11]: 38 23 98 7 29 4 8 27 14 21
[-10, -1]: 1 362 373 0 0 3 0 0 0 0
[ 0, 9]: 0 0 0 0 0 0 0 0 0 0
[ 10, 19]: 0 0 0 0 0 0 0 0 0 0
[ 20, 29]: 0 0 0 0 0 0 0 0 0 0
[ 30, 39]: 0 0 0 0 0 0 0 0 0 0
[ 40, 49]: 0 0 0 0 0 0 0 0 0 0
greater than 49: 0
It should be evident from this set of observations that SizeImpact
cannot be completely characterized by a simple formula like (2). For
this inlinee, inlining always saves at least 5 bytes, roughly half the time
it saves 8 or 9 bytes, on average it saves about 12.5 bytes, but it often
saves considerably more.
Other inlinees show similar spreads in SizeImpact. Here we see a case
where sometimes the SizeImpact is negative and other times it's
positive.
Inlining for System.Threading.CancellationToken:get_IsCancellationRequested():bool:this (size 29)
Instances 42 SizeImpact Mean 11.33 Min -20 Max 28
Distribution
less than -50: 0
[-50,-41]: 0 0 0 0 0 0 0 0 0 0
[-40,-31]: 0 0 0 0 0 0 0 0 0 0
[-30,-21]: 0 0 0 0 0 0 0 0 0 0
[-20,-11]: 1 0 0 0 0 0 0 1 0 0
[-10, -1]: 0 2 0 0 0 0 0 0 1 2
[ 0, 9]: 0 0 1 1 0 2 0 1 4 1
[ 10, 19]: 1 0 1 0 4 1 2 1 2 6
[ 20, 29]: 0 0 2 0 2 0 0 0 3 0
[ 30, 39]: 0 0 0 0 0 0 0 0 0 0
[ 40, 49]: 0 0 0 0 0 0 0 0 0 0
greater than 49: 0
Not all inlinee SizeImpacts exhibit such wide distributions. Some spread
just a little:
Inlining for System.Environment:GetResourceString(ref):ref (size 15)
Instances 2238 SizeImpact Mean 0.01 Min -3 Max 6
Distribution
less than -50: 0
[-50,-41]: 0 0 0 0 0 0 0 0 0 0
[-40,-31]: 0 0 0 0 0 0 0 0 0 0
[-30,-21]: 0 0 0 0 0 0 0 0 0 0
[-20,-11]: 0 0 0 0 0 0 0 0 0 0
[-10, -1]: 0 0 0 0 0 0 0 7 0 7
[ 0, 9]:2212 3 0 5 0 0 4 0 0 0
[ 10, 19]: 0 0 0 0 0 0 0 0 0 0
[ 20, 29]: 0 0 0 0 0 0 0 0 0 0
[ 30, 39]: 0 0 0 0 0 0 0 0 0 0
[ 40, 49]: 0 0 0 0 0 0 0 0 0 0
greater than 49: 0
and some not at all:
Inlining for System.DateTime:get_Ticks():long:this (size 15)
Instances 129 SizeImpact Mean 0.00 Min 0 Max 0
Distribution
less than -50: 0
[-50,-41]: 0 0 0 0 0 0 0 0 0 0
[-40,-31]: 0 0 0 0 0 0 0 0 0 0
[-30,-21]: 0 0 0 0 0 0 0 0 0 0
[-20,-11]: 0 0 0 0 0 0 0 0 0 0
[-10, -1]: 0 0 0 0 0 0 0 0 0 0
[ 0, 9]: 129 0 0 0 0 0 0 0 0 0
[ 10, 19]: 0 0 0 0 0 0 0 0 0 0
[ 20, 29]: 0 0 0 0 0 0 0 0 0 0
[ 30, 39]: 0 0 0 0 0 0 0 0 0 0
[ 40, 49]: 0 0 0 0 0 0 0 0 0 0
greater than 49: 0
So it appears there must be other -- perhaps many -- factors that
influence the SizeImpact of a particular inline. The question is,
what factors are important, how do we measure them, and how do they
they combine to influence the overall result? The remainder of this
document will look into some of the ways we can answer this question.
Before we move on, however, we should address what kind of information will actually be available to observe and use in building a heuristic model.
It's not realistic to feed the heuristic on the actual encoded size of
methods. While compiling a method C obtaining the encoded size of
C is problematic. It's also quite likely that the encoded size of
the inlinee E is unknown, except possibly in some AOT scenarios
where one could generally arrange for methods to be compiled in
bottom-up order (that is, callees before callers). While one could
imagine spending compile time doing experimental compilations of C's
and E's to see what code size results, this is probably too costly
to be practical.
Second, even if we could obtain the actual size of prospective inline
candidates, we might not want to use this data. The final code
sequence emitted by the compiler depends intimately on details of the
target architecture, runtime conventions (ABIs), and capabilities of the
compiler phases that run after inlining. If we allow feedback into hey
heuristics by incorporating data from these "downstream" sources, we
introduce various forms of coupling that have important
consequences. For example, it would be challenging to work on a
downstream phase (say, register allocation) if some trivial change to
allocation in an inlinee E perturbs inlining decisions for other
methods like C. Likewise we may or may not want to allow runtime
conventions to influence inlining -- if our application can run cross
platform (say in both Windows and Linux) we might not want the various
versions to exhibit vastly different inlining patterns.
For that reason, we intend to restrict the information we can use in forming heuristics to things that are generally true of the caller and callee, and the general capabilities of the downstream phases. For the caller and callee this largely means information that can be obtained by inspecting either the input to the compiler, or the compiler's IR in the stages leading up to inlining, and perhaps (if available) some carefully chosen information obtained from actual compilation of the callee.
At the same time, we still intend for the predictions made by the heuristics to be indicative of the final binary size. We have no other reliable source of truth to guide us.
Given all this, we can restate the central question as: how best can the compiler estimate the ultimate size impact of an inline while restricting itself to considering features that are generally true of caller, callee, and the capabilities of the compiler?
The tried-and-true approach is to build the heuristic manually. There are two prongs to the approach: the first is case analysis of actual behavior, and the second is modelling based on the compiler writer's experience and intuition.
It's always instructive to look at actual behavior, so let's consider
some of the cases. From the table above for System.Object:.ctor() we
see a number of cases where there were large savings in byte size. The
most significant of these is the following:
Inline System.Object:.ctor():this
into System.IO.UnmanagedMemoryAccessor:.ctor(ref,long,long,int):this
CalleeSize = 6
CallerSize = 72
CallerSize' = 24
SizeImpact = -48
where System.Object:.ctor():this is simply:
; System.Object:.ctor():this
; Total bytes of code 6, prolog size 5
0F1F440000 nop
C3 ret
As an aside, one might wonder why there are 5 bytes of nops here -- they are put there to support the rejit feature used by the application insights framework.
The caller's source code shows it simply delegates immediately to another method to initialize the object:
UnmanagedMemoryAccessor(
SafeBuffer buffer, Int64 offset,
Int64 capacity, FileAccess access) {
Initialize(buffer, offset, capacity, access);
}
Here's the code before the inline:
; BEFORE
; System.IO.UnmanagedMemoryAccessor:.ctor(ref,long,long,int):this
; Total bytes of code 72, prolog size 10
4156 push r14
57 push rdi
56 push rsi
55 push rbp
53 push rbx
4883EC20 sub rsp, 32
488BF1 mov rsi, rcx
488BFA mov rdi, rdx
498BD8 mov rbx, r8
498BE9 mov rbp, r9
488BCE mov rcx, rsi
E800000000 call System.Object:.ctor():this
448B742470 mov r14d, dword ptr [rsp+70H]
4489742470 mov dword ptr [rsp+70H], r14d
488BCE mov rcx, rsi
488BD7 mov rdx, rdi
4C8BC3 mov r8, rbx
4C8BCD mov r9, rbp
488D0500000000 lea rax, [(reloc)]
4883C420 add rsp, 32
5B pop rbx
5D pop rbp
5E pop rsi
5F pop rdi
415E pop r14
48FFE0 rex.jmp rax
and here's the code after the inline:
; AFTER
; System.IO.UnmanagedMemoryAccessor:.ctor(ref,long,long,int):this
; Total bytes of code 24, prolog size 5
0F1F440000 nop
90 nop
8B442428 mov eax, dword ptr [rsp+28H]
89442428 mov dword ptr [rsp+28H], eax
488D0500000000 lea rax, [(reloc)]
48FFE0 rex.jmp rax
In both cases this method ends up tail calling to Initialize (via the
rex.jmp), passing it the exact set of arguments that
was passed to the method.
In the before case, the un-inlined call changes the method from a leaf to a non-leaf. Leaf functions don't need to set up their own frame, while non-leaves do. So some of the extra code in the prolog and epilog is the frame setup. Beyond that, however, the un-inlined callee forces the compiler to move the arguments that will be ultimately be passed to the tail-called method out of the volatile registers and into preserved registers before the call, then back again to again to the proper argument registers after the call. And in order to use the callee saves these registers must be pushed in the prolog and popped in the epilog.
In the after case, no frame setup is required, no registers need to be
preserved across the call, so no registers need saving and
restoring. The two movs that remain appear to be gratuitous. The
first `nop' is there for rejit padding. Not sure why the second one is
there.
So we can now see why this particular case has such extreme
SizeImpact: the un-inlined call triggers frame creation and a fair
amount of shuffling to accommodate the potential side effects of the
call.
In the data above, the inline of System.Object:.ctor() typically
saves either 8 or 9 bytes of code. Let's look at one such case.
Inline System.Object:.ctor():this
into System.Security.SecurityElement:.ctor():this
CalleeSize = 6
CallerSize = 15
CallerSize' = 6
SizeImpact = -9
Here are before and after listings for the caller:
; BEFORE
; System.Security.SecurityElement:.ctor():this
; Total bytes of code 15, prolog size 5
0F1F440000 nop
488D0500000000 lea rax, [(reloc)]
48FFE0 rex.jmp rax
; AFTER
; System.Security.SecurityElement:.ctor():this
; Total bytes of code 6, prolog size 5
0F1F440000 nop
C3 ret
In this case the call site was initially handled via a tail call that
required a 7 byte lea and a 3 byte rex.jmp. With the inline all
that's left is a single byte ret. This case covers 154 of the 187
instances where inlining System.Object:.ctor() saved 9 bytes.
Now let's look at a set of examples where an inline can either decrease or increase the caller size. Here's the good case:
Inline System.IntPtr:op_Explicit(long):long
into System.Runtime.InteropServices.Marshal:WriteIntPtr(ref,int,long)
CalleeSize = 35
CallerSize = 50
CallerSize' = 22
SizeImpact = -28
Here's the callee code:
SIGH -- turns out the above has the wrong callee size -- there are several overloaded versions and my script picks up the wrong one. true callee is only 9 bytes long:
; System.IntPtr:op_Explicit(long):long
; Total bytes of code 9, prolog size 5
0F1F440000 nop
488BC1 mov rax, rcx
C3 ret
At any rate the SizeImpact is still correct.
Here's the before caller code:
; BEFORE
; System.Runtime.InteropServices.Marshal:WriteIntPtr(ref,int,long)
; Total bytes of code 50, prolog size 12
55 push rbp
57 push rdi
56 push rsi
4883EC20 sub rsp, 32
488D6C2430 lea rbp, [rsp+30H]
488BF1 mov rsi, rcx
8BFA mov edi, edx
498BC8 mov rcx, r8
E800000000 call System.IntPtr:op_Explicit(long):long
4C8BC0 mov r8, rax
8BD7 mov edx, edi
488BCE mov rcx, rsi
488D0500000000 lea rax, [(reloc)]
488D65F0 lea rsp, [rbp-10H]
5E pop rsi
5F pop rdi
5D pop rbp
48FFE0 rex.jmp rax // tail call WriteInt64
and the after caller code:
; AFTER
; System.Runtime.InteropServices.Marshal:WriteIntPtr(ref,int,long)
; Total bytes of code 22, prolog size 10
55 push rbp
4883EC20 sub rsp, 32
488D6C2420 lea rbp, [rsp+20H]
E800000000 call System.Runtime.InteropServices.Marshal:WriteInt64(ref,int,long)
90 nop
488D6500 lea rsp, [rbp]
5D pop rbp
C3 ret
Somewhat confusingly, the inline has stopped the jit from making a tail call, and the size savings should be even greater than they are. There appears to be some kind of code generation issue within the jit. The IL for the callee is
IL_0000 0f 00 ldarga.s 0x0
IL_0002 7b 8f 02 00 04 ldfld 0x400028F
IL_0007 6e conv.u8
IL_0008 2a ret
and the JIT rejects the tail call because
Rejecting tail call late for call [000008]: Local address taken
so presumably the ldarga.s in the inlinee is causing trouble.
Now let's see if that same inlinee can cause a size increase. Here's a case:
(NOTE this is calling a different overload..., need to get this straghtened out)
Inline System.IntPtr:op_Explicit(long):long [35]
into EventData:get_DataPointer():long:this [18] size 35 delta 17
CalleeSize = 35
CallerSize = 18
CallerSize' = 35
SizeImpact = 17
; BEFORE
; EventData:get_DataPointer():long:this
; Total bytes of code 18, prolog size
0F1F440000 nop
488B09 mov rcx, qword ptr [rcx]
488D0500000000 lea rax, [(reloc)]
48FFE0 rex.jmp rax
; AFTER
; EventData:get_DataPointer():long:this
; Total bytes of code 35, prolog size 5
4883EC28 sub rsp, 40
488B11 mov rdx, qword ptr [rcx]
33C9 xor rcx, rcx
48894C2420 mov qword ptr [rsp+20H], rcx
488D4C2420 lea rcx, bword ptr [rsp+20H]
E800000000 call System.IntPtr:.ctor(long):this
488B442420 mov rax, qword ptr [rsp+20H]
4883C428 add rsp, 40
C3 ret
Here the un-inlined case made a tail call. The inlined case was unable to do the same optimization for some reason, though it looks like it should have been able to do so as well.