hphp/doc/hands-on/lesson4.md
In this step, we're going to hunt for a potential performance win.
Our main tool in this hunt is profiling. One way to do that is to run Linux perf on a machine with an HHVM webserver under load. HHVM even outputs a "perf-PID.map" file that can be used to symbolize stack traces in JIT-ed Hack code. That said, it's easiest to read perf's numbers for C++ helpers, and many performance opportunities in HHVM come from optimizing C++ functions that are used to implement the Hack runtime.
In a sample perf trace I took, I saw that ~0.2% CPU is spent doing object-to-array-like casts (e.g. convObjToVec, convObjToDict, convObjToKeyset).
What makes these helpers a good potential target for optimization? First off, there is some opportunity there: if we could eliminate most of the cost of these helpers, it could be a 0.2% gCPU win. Second, the opportunity is relatively easy to realize. It's difficult to implement general-purpose code to cast an object to an array-like. That's why HHVM implements these casts by calling a C++ helper. But after having read enough of web code written in Hack, I can make an educated guess that most object-to-array casts are casts on a collections objects. These objects are secretly backed by arrays, which makes it possible for us to generate faster code to do the cast in this case.
When we have a potential perf idea, it's important to write a small test case and check that the opportunity shows up. Let's put the following code into ~/php/convert.php:
<?hh
<<__EntryPoint>>
function main() {
$x = Vector { 17, 34 };
var_dump(vec($x));
}
Now, when we run this code with tracing enabled, we must do so at the highest optimization level! If we don't, we might go down a rabbit hole trying to make an optimization that is irrelevant for our production use case. To see how HHVM compiles the code above, we have to run it with both HHBBC (the bytecode-to-bytecode optimizer) and RetranslateAll (the two-phase JIT) enabled:
TRACE=printir:1 hphp/tools/hhvm_wrapper.php -c -r 2 ~/php/convert.php | less -RN
Then we have to jump to the TransOptimize translation for main. Here's the relevant part:
B4: [profCount=2] (preds B0)
--- bc main(id 1078142128)@5, fp 0, fixupFP 0, spOff 1, [profTrans=0]
ColFromArray Vector
(12) t5:Obj=HH\Vector = NewColFromArray<HH\Vector> vec(0x8018d4e0)=Vanilla
Main:
0x3280000e: mov $0x8018d4e0, %edi
0x32800013: callq 0xa712ff0 <HPHP::collections::allocFromArrayVector(HPHP::ArrayData*)>
0x32800018: mov %rax, %rbx
--- bc main(id 1078142128)@24, fp 0, fixupFP 0, spOff 3, [profTrans=0]
CastVec
(22) t6:Vec = ConvObjToVec t5:Obj=HH\Vector -> B5<Catch>
Main:
0x3280001b: mov %rbx, %rdi
0x3280001e: callq 0xa260c00 <HPHP::jit::convObjToVecHelper(HPHP::ObjectData*)>
-> B6
B6: [profCount=2] (preds B4)
(23) StStk<IRSPOff -4> t1:StkPtr, t6:Vec
Main:
0x32800023: movb $0x17, -0x38(%rbp)
0x32800027: movq %rax, -0x40(%rbp)
The important thing here is that, even though we know that t5 is an HH\Vector in instruction (22) ConvObjToVec, we still generate machine code for it that makes a call to a C++ function convObjToVecHelper. We can do better!
We last looked at the TRACE=printir:1 output in the first lesson. Since then, we've learned enough about HHVM to tease its output apart in more detail. This tracing is useful because it simultaneously displays three levels of code representation:
Here's the Hack bytecode annotation for the first IR op above. First, we see the function and bytecode offset where this bytecode came from, a sort of annotation from the bytecode back to source code. Then, we see the bytecode's name and immediates:
--- bc main(id 1078142128)@5, fp 0, fixupFP 0, spOff 1, [profTrans=0]
ColFromArray Vector
Here's the IR op. Like the bytecode, the IR op takes immediates - the parameters inside the angle brackets. Unlike the bytecode, the IR takes inputs and outputs inline. HHVM's IR operates on static, single-assignment (SSA) temporary values. For this op, the input SSATmp is a constant, so we hide its ID; the output SSATmp is not constant, so we show that it's "t5", of type "Obj=HH\Vector". At the IR level, we're no longer implicitly operating on a stack machine - all stack operations are explicit!
(12) t5:Obj=HH\Vector = NewColFromArray<HH\Vector> vec(0x8018d4e0)=Vanilla
Here's the machine code that we generate for this IR op. Machine code can be placed in multiple regions of memory, depending on whether we think blocks at the machine code level are likely to be executed. If they are, we put the code in "Main"; otherwise, we put it in "Cold" or "Frozen". The machine code for this op is simple because the heavy lifting is done in C++:
Main:
0x3280000e: mov $0x8018d4e0, %edi
0x32800013: callq 0xa712ff0 <HPHP::collections::allocFromArrayVector(HPHP::ArrayData*)>
0x32800018: mov %rax, %rbx
After the ColFromArray bytecode comes the CastVec bytecode. We compile this bytecode to two IR ops. ConvObjToVec is doing the main logic of the Cast op, and, again, is implemented mostly in C++. StStk ("store stack") pushes the final result of the cast onto the stack. Unlike the other IR ops here, StStk is compiled directly to machine code. It's simple enough that we don't need to shell out to C++ for it.
If we look at the bytecode for these operations, we'll see that the IR trace above skipped several bytecodes. You can see for yourself by running this command, which is like our command above, except that:
TRACE=printir:1 and add --hdf Eval.DumpHhas=1 to get a bytecode rather than an IR dump.-r 2 (which expands to --retranslate-all 2) because RetranslateAll results in more optimized IR generation - it doesn't affect the bytecode!hphp/tools/hhvm_wrapper.php -c --hdf Eval.DumpHhas=1 ~/php/convert.php | grep -v srcloc
Here's what I get if I run that command:
.function{} [unique persistent "__EntryPoint"("""v:0:{}""")] (4,7) <"" N > main() {
.declvars _0;
Vec @A_0
ColFromArray Vector
AssertRATL _0 Uninit
AssertRATStk 0 Obj=HH\Vector
PopL _0
NullUninit
NullUninit
AssertRATL _0 Obj=HH\Vector
PushL _0
CastVec
FCallFuncD <SkipRepack SkipCoeffectsCheck> 1 1 "" "" - "" "var_dump"
AssertRATStk 0 InitNull
PopC
Null
AssertRATL _0 Uninit
RetC
}
In the bytecode, in between creating the Vector and casting it to a vec, we store it to local 0 (via PopL), then unset the local and push it back onto the stack (via PushL). Check bytecode.specification to confirm your understanding! There are also some type assertion bytecodes (AssertRATL, "assert repo-authoritative type for a local", and AssertRATStk, for the stack), but it makes sense that these bytecodes don't compile to machine code. They just add in type information at compile time.
It makes sense that the bytecode stores the Vector to a local - after all, that's what the source code does! The question is, why don't the IR ops match the bytecode above exactly? If the IR were to implement the exact same operations that the bytecode above claims to do, then we'd have several IR ops for each bytecode in this sequence:
Compiler engineers love abbreviations! "Loads" and "stores" are just reads and writes to memory. Because interacting with memory is ubiquitous in Hack code, we have tons of IR ops that use "Ld" or "St" in their names. We also use shorthand for many other ops, like CreateAFWH, DecRefNZ, JmpZ, and Shr. These names seem opaque at first, but they make reading code easier once you are versed in the domain. If you're ever confused about what an IR op does, consult ir.specification.
These are sensible and correct translations of the bytecode above into IR. They're also wasteful. The ColFromArray, PopL, PushL, and CastVec are pushing and popping a single Vector value. If the net effect of all of these ops is to do a cast on the Vector's SSATmp, it's better if we just do that and skip the intermediate stack and local stores and loads. HHVM produces this more optimized code in two steps:
There are actually two places we do the analysis needed to make these optimizations. First, the irgen ("IR generation") code for each bytecode uses a forward-only analysis called FrameState, which eliminates operations loads that can be trivially removed (e.g. a LdStk that comes right after a StStk). FrameState can eliminate redundant loads in straight-line code, but we need to apply a more complex fixed-point algorithm to handle branches and loops. load-elim and store-elim do these optimizations in general, after irgen is complete.
ir.specification is actually a kind of executable
documentation. We use this script
to process the lines beginning with | in this file to produce C++ code
defining a table of IR ops, the single source of truth for the IR. Because the
documentation is the implementation, it's always up to date! Let's look at a
few examples:
| ConvObjToVec, D(Vec), S(Obj), PRc|CRc
| StStk<offset>, ND, S(StkPtr) S(Cell), NF
| CheckType<T>, DRefineS(0), S(Cell), B|P
The first line says that ConvObToVec is an IR op taking one SSATmp input that must be a subtype of TObj, and returning a subtype of TVec. The last bit of that line defines the op's "flags" - in this case, it consumes a refcount on the input and produces one on the output. That is: this IR op takes care of the inc-ref and dec-ref required by the CastVec bytecode. For a complete list of flags, look arlier in the file.
The second line says that StStk takes two inputs - a stack pointer and an arbitrary Hack value (Cell == "mixed") - and doesn't return anything (ND == "no destination"). That should make sense, as StStk simply writes to memory. StStk doesn't have any additional flags, but it does have an immediate: the stack offset, which is a constant offset off of the stack pointer. (Recall that an immediate is an argument to an instruction that is "immediately" available at compile time, as opposed to inputs, which are only available at runtime.) The immediates in ir.specification are just comments; we define the immediates for each op in code at the end of the file extra-data.h. If you look for StStk there, you'll see that it takes an immediate of type IRSPRelOffsetData.
The third line says that CheckType takes one input, an arbitrary Hack value, and returns an output. The return type of this op isn't fixed; instead, it must be computed at compile time based on the input, and based on the immediate type T that a given CheckType is checking for. For example, we might compile a CheckType<Bool> op, which takes a Cell input, checks that it's a bool, and returns a refined SSATmp of type TBool. If the runtime type check fails, then instead of returning the result, we'd branch (the "B" flag) to a "taken" block.
A key difference between Hack bytecode and HHVM's IR is that each IR op places type constraints on its inputs, and produces typed outputs. The "CastVec" bytecode will take any type of input from the stack and "do what vec cast does" for that input. That could mean leaving the input unchanged (if it's a vec), doing a cast (if it's a dict, keyset, or object), or throwing an error (all other types). By contrast, the ConvObjToVec IR op will only implement vec cast behavior for object inputs. It's up to our irgen code to only generate IR ops whose type constraints are satisfied.
In debug builds, after completing irgen for some block of code, we have assertions to check these type constraints. The only invariants that hold in HHVM are the ones that we actively check in debug mode!
We've already seen some examples of JIT types in the sources and destinations of each op in ir.specification. These examples are instances of the C++ class jit::Type, defined in hphp/runtime/vm/jit/type.h. A JIT type constraints what a Hack value could be at runtime. Unlike the Hack type system, the JIT type system is sound, modulo bugs in HHVM: it provides actual guarantees about runtime values, and without these guarantees, we wouldn't be able to produce any nontrivial machine code.
Each instance of a JIT type represents a subset of values that match that type. For example:
That's right - JIT types can be constants! Like most compilers, HHVM does constant folding: if an IR instruction has constant inputs, and if it's side effect free, HHVM may compute its output at compile time. We check if an IR instruction can be const-folded by checking if all of that instruction's source SSATmps have types of constant values.
Based on the previous lesson, on refcounting, you may have a guess as to why we could have constant subtypes of TInt, TStr, TVec, etc., but not of TObj. There are even some types, like TInitNull, which always represent a constant!
The set of all JIT types forms a mathematical structure called a lattice.
That means that JIT types support two key operations: union and intersection.
The union | of two JIT types A and B is the type of values that could be an A
or a B. The intersection & of A and B is the type of values that must be both
an A and a B. Here are a few examples:
TCell | TInt = TCell // A TCell could be any Hack value, so this union could be, too.
TCell & TInt = TInt // A TCell and an int must be an int.
TStr & TInt = TBottom // No value is an int and a string. TBottom is "the empty set".
TInt | TStr = (Int|Str) // jit::Type can represent this union...
(Int|Str) | TInitNull = (Int|Str|InitNull) // ...and this one.
// jit::Type can represent any union of Hack DataTypes (i.e. "top-level" Hack types)!
TInt=3 | TInt=5 = TInt // We can't represent two constants, so we expand this union.
TInt=3 & TInt=5 = TBottom // No value is ever both the int 3 and the int 5.
We use the lattice operations to bound the type of values at compile time:
There's a lot more to learn about this type system, but it's best done by
reading code. Luckily, jit::Type is one of the most heavily-tested parts of
HHVM. In addition to exercising this code via the end-to-end tests in the
hphp/test directory, we have
a battery of unit tests which double as examples of usage and expected behavior.
Phew! That was a lot of information. Take five and step away from the computer. Go for a walk! Get a cup of coffee!
...but before you do, kick off a build of HHVM =)
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
It's time to apply what we've learned to optimize the example from the start of this lesson - a simple vec cast on a Vector. At a high level, here's what we have to do:
Let's go through these items together. For item 1, we can simply search in the
hphp/runtime directory for "emitCastVec". Here's the code that I found.
At the rev you're working at, the code may be a bit different, but the
high-level structure is probably similar:
void emitCastVec(IRGS& env) {
auto const src = popC(env);
auto const raise = [&](const char* type) {
auto const message =
makeStaticString(folly::sformat("{} to vec conversion", type));
gen(env, ThrowInvalidOperation, cns(env, message));
return cns(env, TBottom);
};
push(
env,
[&] {
if (src->isA(TVec)) return src;
if (src->isA(TArrLike)) return gen(env, ConvArrLikeToVec, src);
if (src->isA(TClsMeth)) return raise("ClsMeth");
if (src->isA(TObj)) return gen(env, ConvObjToVec, src);
if (src->isA(TNull)) return raise("Null");
if (src->isA(TBool)) return raise("Bool");
if (src->isA(TInt)) return raise("Int");
if (src->isA(TDbl)) return raise("Double");
if (src->isA(TStr)) return raise("String");
if (src->isA(TFunc)) return raise("Func");
if (src->isA(TRes)) return raise("Resource");
PUNT(CastVecUnknown);
}()
);
}
What's going on here? The C++ lambda inside the "push" call returns an SSATmp
that's the result of doing a cast. TVec, TArrLike, TObj, etc. are all JIT
types; most of them represent values of some DataType, but TArrLike is a
special union type that just means - you guessed it! - TVec | TDict | TKeyset.
In the TVec, TArrLike, and TObj cases, we produce IR to actually do the cast.
In all of the other cases, here, the cast is guaranteed to throw, so we emit a
"terminal" (the flag "T" in ir.specification) op ThrowInvalidOperation. This
op is guaranteed to halt execution, so we can return a dummy SSATmp of type
TBottom, since we know that value is unreachable.
When compiling a bytecode to IR, it's critical to keep track of refcounting. This bytecode pops one value - the "src" SSATmp - from the stack, and pushes one value - the return value of the lambda - onto it. Based on the standard refcounting semantics, we must dec-ref the stack input and inc-ref the output.But we don't see any DecRef or IncRef IR ops here! That's because the ConvArrLikeToVec and ConvObjToVec IR ops handle the refcounting for us. We can confirm that by reading their ir.specification entries. The CRc|PRc flags on these ops mean that they consume a refcount on their input and produce one on their output. If we replace these ops with ones that don't handle refcounting, we'll have to generate refcounting IR ops ourselves.
For item 2, we must check the conditions under which our optimization applies. Since we're trying to optimize vec casts on Vector objects, we only have to consider the TObj case, but we can't apply our optimization to all objects. We must use the JIT type system to check for a more specific type here.
Let's make this change by editing the TObj line alone. We can insert another C++ lambda to give us some syntactic space to make this logic more complex. Then, we can use the jit::Type's specialized constructor, jit::Type::SubObj, to produce a type to compare against here:
if (src->isA(TObj)) return [&]{
auto const TVector = Type::SubObj(c_Vector::classof());
if (src->isA(TVector)) {
// TODO: We can optimize this case!
}
return gen(env, ConvObjToVec, src);
}();
We create and immediately invoke a closure here as a way to pack complex control flow in an expression position. It's the same reason that we create and invoke the outer closure, the one that produces the output that we push onto the stack. You could consider this pattern a "cute" "trick", or you could refactor this code to use named helper methods instead. The choice is yours!
Make these edits and check that HHVM still compiles. (You may need to include "hphp/runtime/ext/collections/ext_collections-vector.h" at the top of the file, too.) On to item 3! We need to find an IR op to extract the vec that backs a vector. There are a few ways to look for this op:
The first approach quickly leads us to:
| LdColVec, D(Vec), S(Obj), NF
Load the vec array backing a collection instance in S0. S0 must be a Vector
or ImmVector, and that specific object type must be known at compile time.
That's exactly what we need! However, we now have to generate the appropriate refcounting IR ops. Unlike ConvObjToVec, LdColVec has no special flags - it simply does a load. In order to generate any IR ops here, we need to use the magical templated "gen" helper. This helper takes a variadic inputs and builds and IR instruction from those inputs. In order, its inputs are:
To get this optimization working, we must gen an LdColVec, an IncRef of the result, and a DecRef of the input. LdColVec and IncRef don't need any of the three optional arguments above. DecRef has an associated immediate, but no other optional arguments. As a result, this code should work for us:
if (src->isA(TObj)) return [&]{
auto const TVector = Type::SubObj(c_Vector::classof());
if (src->isA(TVector)) {
auto const result = gen(env, LdColVec, src);
gen(env, IncRef, result);
gen(env, DecRef, DecRefData(), src);
return result;
}
return gen(env, ConvObjToVec, src);
}();
It's important that we do the IncRef here before the DecRef. Make sure you understand why!
When you've got this code compiling, rerun the TRACE=printir:1 command with the test file and your new HHVM. You should see that the C++ call in ConvObjToVec has been replaced with an offset load. Our code generation for this case is much improved!
When writing a compiler optimization, you should always try to make it as general as possible. If your optimization only applies in certain special cases, then users of your language may see large performance differences due to small edits to their code, which is a frustrating experience! Further, performance measurement can be noisy, and the more cases your optimization covers, the greater the chance of measuring an unambiguous improvement on real benchmarks.
Compiler engineers are greedy. They want to optimize as many cases as possible. Compiler engineers are also lazy. They may give up at the point at which it becomes hard to extend an optimization further. As a result of these warring impulses, compilers exhibit stepwise behavior. Ubiquitous, simple cases are heavily optimized. A variety of common cases are handled somewhat well. The general case may be slow as molasses.
There are a variety of ways to extend this optimization. We'll look at simpler extensions in the exercises, but the most important extension here is to take advantage of late type information.
HHVM includes a variety of optimization passes. These passes run after irgen is complete. They take an IR unit that implements some Hack source code, and modify it in place to produce better IR that implements the same code. These passes can produce tighter type bounds on values than we originally had at irgen time; for example, if we inline a function, we may get a better type for its return value, which recursively results in tighter types for SSATmps computed based on that value.
If we can do an IR optimization without introducing new control flow, we can take advantage of these late types by moving the optimization to simplify.cpp. The simplifier is run on every IR op after every optimization pass. If there's any opportunity to make the optimization, the simplifier will find it. The simplifier's contract is, uh, simple...it processes a single IR instruction, and if it's possible to replace it with zero or more IR instructions, it does so.
When can we simplify an operation down to zero IR instructions? When we can const-fold it! Take a look at simplifyXorInt for an example. One of the cases we optimize here is xor-ing an SSATmp with itself. If we do that, we can return cns(env, 0). Since this result is a constant SSATmp, we don't need to execute any IR instructions to compute it. The simplifier will remove the XorInt op and replace its result with the constant.
To move the optimization to the simplifier, we need to:
Here's what I've got for the simplify code (as a diff: D34433352):
SSATmp* simplifyConvObjToVec(State& env, const IRInstruction* inst) {
auto const src = inst->src(0);
auto const TVector = Type::SubObj(c_Vector::classof());
if (src->isA(TVector)) {
auto const result = gen(env, LdColVec, src);
gen(env, IncRef, result);
gen(env, DecRef, DecRefData(), src);
return result;
}
return nullptr;
}
It's basically the same as the irgen version, but by doing it in the right place, we optimize more cases! Make this change, get it to compile, and verify that you can see the optimization on the test case. Then kick off diff testing and benchmarking!
Your first exercise is to think about how else we could extend this optimization. Do so before reading on!
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
For all of these extensions, be sure to create a small test case that shows the current, suboptimal code generation first. Then make the optimization and confirm that the test case is improved!