docs/runtime/ir-caching.md
One of the largest pain points for users of Enso at the moment is the fact that it has to precompile the entire standard library on every project load. This is, in essence, due to the fact that while the current parser (rewritten to Rust) is fast, the compiler passes that follow continue to be abysmally slow, and incredibly demanding. The obvious solution to improve this is to take the compiler passes out of the equation in its entirety, by serializing their IR output.
To that end, we want to serialize the Enso IR to a format that can later be read back in, bypassing parser and many compiler passes entirely. Furthermore, by moving the boundary at which this serialization takes place to the end of the compiler pipeline, thereby bypassing doing most of the compilation work, and further improving startup performance.
<!-- MarkdownTOC levels="2,3" autolink="true" --> <!-- /MarkdownTOC -->Using classical Java Serialization turned out to be unsuitably slow. Rather than switching to other serialization framework that does the same, but faster we decided in PR-8207 to create own persistance framework that radically changes the way we can read the caches. Rather than loading all the megabytes of stored data, it reads them lazily on demand.
Use following command to generate the Javadoc for the org.enso.persist
package:
enso$ find lib/java/persistance/src/main/java/ | grep java$ | xargs /graalvm-24/bin/javadoc -d target/javadoc/ --snippet-path lib/java/persistance/src/test/java/
enso$ links target/javadoc/index.html
In order to maximize the benefits of this process, we want to serialize the IR
as late in the compiler pipeline as possible. This means serializing it just
before the code generation step that generates Truffle nodes (before the
RuntimeStubsGenerator and IrToTruffle run).
This serialization should take place in an offloaded thread so that it doesn't block the compiler from continuing.
Doing this naïvely, however, means that we can inadvertently end up serializing
the entire module graph. This is due to the BindingsMap, which contains a
reference to the associated runtime.Module, from which there is a reference to
the ModuleScope. The ModuleScope may then reference other runtime.Modules
which all contain IR.Modules. Therefore, done in a silly fashion, we end up
serializing the entire reachable module graph. This is not what we want.
The Persistance.Pool with its write method allows an additional
writeReplace function to be associated with it. The IR caches system uses such
a function to perform following modification just before
ProcessingPass.Metadata are stored down:
BindingsMap and its child types to be able to contain an unlinked
module pointer case class ModulePointer(qualifiedName: List[String]) in
place of a Module.MetadataStorage type that holds the BindingsMap is mutable it might
be tempting to update it in place, but relying on writeReplace mechanism is
safer as it only changes the format of object being written down, rather than
modifying objects of live IR - potentially shared with other parts of the
system.Having done this, we have broken any links that the IR may hold between modules, and can serialize each module individually.
It may be safer to duplicate the IR before handing it to serialization, but
it shouldn't be necessary if the writeReplace function is written correctly.
The serialized IR needs to be stored in a location that is tied to the library that it serializes. Despite this, we also want to be able to ship cached IR with libraries. This leads to a two pronged solution where we check two locations for the cache.
With the Library distribution: As libraries can have a hidden .enso
directory, we can use a path within that for caching. This should be
$package/.enso/cache/ir/enso-$version/, and can be accessed by extending
the pkg library to be aware of the cache directories. This location is used
for .bindings and suggestion caches
Per user: System shared library locations may not be writeable, we need
to have a fallback out-of-line cache that is used if the first one is not
writeable. This is located under $ENSO_DATA (whose location can be obtained
from the RuntimeDistributionManager), and is located under the path
$ENSO_DATA/cache/ir/$namespace/$libraryName/$version/enso-$version/
This per user location is used for storing .ir cache files for individual
Enso modules. The IR file is located in a directory modelled after its module
path, followed by a file named after the module itself with the extension
.ir (e.g. the IR for Standard.Base.Data.Vector is stored in
Standard/Base/Data/Vector.ir).
There is an associated metadata file is located right next to the corresponding cache file. Storage of the IR only takes place iff the intended location for that IR is empty.
The metadata is used for integrity checking of the cached IR to prevent loading corrupted or out of date data from the cache. Due to the fact that engines can only load IR created by their versions, and cached IR is located in a directory named after the engine version, this format need not be forward compatible.
It is a JSON file as follows:
{
sourceHash: String; // The hash of the corresponding source file.
blobHash: String; // The hash of the blob.
compilationStage: String; // The compilation stage of the IR.
}
All hashes are encoded in SHA1 format, for performance reasons. The engine version is encoded in the cache path, and hence does not need to be explicitly specified in the metadata.
These are two static methods in Persistance class to help creating a byte[]
from a single object and then read it back. The array is identified with
following header:
E.g. 12 bytes overhead before the actual data start. Following versioning is recommended when making a change:
Persitance implementation -
change the builtin header first four bytesPersitance.writeObject method - change its IDThat way the same version of Enso will recognize its .ir files. Different
versions of Enso will realize that the files aren't in suitable form.
Every Persistance class has a unique identifier. In order to keep definitions
consistent one should not attempt to use smaller ids than previously assigned.
One should also not delete any Persistance classes.
Additionally, PerMap.serialVersionUID version provides a seed to the version
stamp calculated from all Persistance classes. Increasing the
serialVersionUID will invalidate all caches.
Loading the IR is a multi-stage process that involves performing integrity checking on the loaded cache. It works as follows.
.enso/cache folder. If there is a
.bindings file, assume it is up to date and use it. If there is no such
file in the library distribution look in the per user directory under
$ENSO_DATA for module appropriate .ir file..ir file. If
deserialization fails in any way, immediately fall back to parsing the source
file.Persistance.Pool.read provide own readResolve function. Such a function
gets a chance to change and replace each object read-in with appropriate
variant respecting the whole compiler environment.The main subtlety here is handling the dependencies between modules. We need to
ensure that, when loading multiple cached libraries, we properly handle them
one-by-one. Doing this is as simple as hooking into Compiler::parseModule and
setting AFTER_STATIC_PASSES as the compilation state after loading the module.
This will tie into the current ImportsResolver and ExportsResolver which are
run in an un-gated fashion in Compiler::run.
Unlike classical Java deserialization only registered Persistance subclasses
may participate in deserialization making it much safer and less vulnerable.
For a cache to be usable, the following properties need to be satisfied:
sourceHash must match the hash of the corresponding source file.blobHash must match the hash of the corresponding .ir file.If any of these fail, the cache file should be deleted where possible, or ignored if it is in a read-only location.
It is important, as part of this, that we fail under all circumstances into a working state. This means that:
At no point should this mechanism be exposed to the user in any visible way, other than the fact that they may be seeing the actual files on disk.
Integrity Checking does not check the situation when the cached module imports a
module which cache has been invalidated. For example, module A uses a method
foo from module B and a successful compilation resulted in IR cache for both
A and B. Later, someone modified module B by renaming method foo to
bar. If we only compared source hashes, B's IR would be re-generated while
A's would be loaded from cache, thus failing to notice method name change,
until a complete cache invalidation was forced.
Therefore, the compiler performs an additional check by invalidating module's cache if any of its imported modules have been invalidated.
There are two main elements that need to be tested as part of this feature.
persistance project comes with its own unit testsruntime-parser project adds tests of various core classes used during IR
serialization - like Scala List or checks of the laziness of Scala SeqBindingsMap to work properly.$ENSO_DATA to a
temporary directory and then directly interact with the filesystem. Caching
should be disabled for existing tests. This will require adding additional
runtime options for debugging, but also constructing the DistributionManager
on context creation (removing RuntimeDistributionManager).Import and export resolution is one of the more expensive elements in the initial pipeline. It is also the element which does not change for the releases library components as we do not expect users to modify them. During the initial compilation stage we iteratively parse/load cached ir, do import resolution on the module, followed by export resolution, and repeat the process with any dependent modules discovered in the process. Calculating such transitive closure is an expensive and repeatable process. By caching bindings per library we are able to skip that process completely and discover all necessary modules of the library in a single pass.
The bindings are serialized along with the library caches in a file with a
.bindings suffix.
Further more the storage of .ir files contains usage of lazy Seq
references to separate the general part of the IR tree from elements
representing method bodies. As such the compiler can process the structure of
.ir files, but avoid loading in IR for methods that aren't being executed.
The Persistance framework gives us laziness opportunities and we should use
them more:
have a single blob with all IRs per a library and read only the parts that
are needed
experiment with GC - being able to release parts of unused IR once they were
used (for code generation or co.)
make the .ir files smaller where possible
The use of Persistance has already sped up the execution time of simple
IO.println "Hello!" by 16% - let's use it to speed things up even more.