Perf Score

What is it:

A new JIT feature that estimates the dynamic execution cost of a method.

Who is the customer:

Developers who work on the JIT code base.

What problem does it solve:

Currently JIT developers will generate "Asm Diffs" when making a change to the JIT to see if their change is improving the JIT-ted code. We have a tool called "jit-diffs" which allows a developer to quickly find methods that generate different assembly code for two different JITs.

The jit-analyze" tool will sort this information and provide a list of methods that have the largest decreases and increases in code size. For many methods smaller code size usually means faster execution and larger code size means slower execution. But for complex methods that have loops it is not a very good metric for measuring code quality. For such methods we want the JIT to improve the code generated for the innermost loop and are willing to pay extra costs outside the loop. The total code size for the method does not properly classify regressions or improvements for changes to code inside of loops.

Instead of using total code size as the metric it is better to have the JIT produce an estimate of the dynamic execution cost of a method. This new dynamic execution cost of the method is called the "Perf Score" of the method. The "jit-analyze" tool would also be changed so that instead of using code size it would use the Perf Score to determine the list of methods to display.

When making a Perf change that impacts a Benchmark, you should verify that both the computed Perf Score and the benchmarks execution time improve in a similar way.

Implementation details:

It is a very hard problem to get a highly accurate value for what the actual dynamic execution cost of a method is on a particular CPU. The difficulty occurs because all modern hardware executes instructions using a superscalar architecture. This approach partitions each instruction into a series of steps called a pipeline and the CPU manages multiple instruction pipelines each executing concurrently. With modern hardware there can easily be twenty or more instructions executing at one time. In order to accurately compute the actual dynamic execution cost a software model of this process would need to be implemented and it would be large and complex.

This current Perf Score feature does NOT propose to try to accurately model the actual dynamic execution cost. Instead we propose to return an estimate of the dynamic execution cost. We are willing to make several simplifying assumptions that allow us to more quickly and easily return a reasonable estimate to use as the Perf Score.

Modern CPU Hardware documentation:

Hardware vendors, such as Intel and ARM, provide information for assembly language programmers and compiler writers. They typically provide the following details for instructions:

Latency -- The number of clock cycles that are required for the execution to complete and produce a result after all inputs are available.
Throughput -- The number of clock cycles required to wait before the issue ports are free to accept another instruction. This value is often less than one cycle.
Pipeline Ports -- Each instruction uses certain pipeline functions, such as Load Memory, Store Memory, Integer ALU, Floating Point Multiply, etc...

For the Perf Score implementation, we will call the method getInsExecutionCharacteristics and it will return the insExecutionCharacteristics for each instruction. We will use the information returned along with the BasicBlock weight to calculate the weighted execution cost for each instruction. For this implementation we won't try to model the actual def/use latencies and instead will use a simplified model to estimate the instruction latency. If in the future, we decide to add an instruction scheduler phase to the JIT we can revisit this area. I believe that it will be straightforward to add it at that time.

Simplifying Assumptions

We will assume that we won't have to pay for the latency for memory store operations. This means that we don't expect a memory read to occur that reads back the value being stored before it has finished its latency cost.
We will assume that the hardware speculation will be able to hide one cycle of latency for each instruction.
We won't model exact pipeline port usage and instead will model the worst case, such issuing back to back divide instructions.

Additional code size costs

We also want to consider the amount of code generated by the JIT for the method. We do this as follows:

We add 0.1 for each byte of hot code generated
We add 0.01 for each byte of cold code generarted.

Follow on work

I will modify the "jit-analyze" tool to use the Perf Score to identify the methods that have the largest regressions and improvements.