Unbounded Recursion

The JVM (and hence, GraalVM) do not have support for segmented stacks, and hence do not allow for computation of unbounded recursion - if you make too many recursive function calls you can cause your stack to overflow. Quite obviously, this is a big problem for a functional language where recursion is the primary construct for looping.

There are two main categories of solution for working with unbounded recursion:

Segmented Stacks: If you have the ability to allocate stacks on the heap you can allocate the stack in segments as it grows, meaning that the upper limit on the size of your stack is
Continuation Passing Style (CPS): A program in CPS is one in which the flow of control is passed explicitly as a function of one argument (the continuation). The significant benefit of this is that it means that all calls are made in tail position, and hence no new stack frame needs to be allocated.

This document contains the details of designs and experiments for allowing the use of unbounded recursion in Enso on GraalVM.

A Baseline
Emulating Stack Segmentation with Threads
Avoiding Stack Usage via a CPS Transform
- The CPS Transform
- A Hybrid Approach
Linearised Representations
Alternatives
Open Questions

A Baseline

In an ideal world, we'd like the performance of Enso's recursive calls to approximate that of Haskell, which can be made to have fairly optimal performance for a functional language. Basic measurements for a Haskell program that sums the numbers up to 1 million are as follows:

Non-TCO: 20-25 ms/op
TCO: 0.8-1 ms/op

All benchmarks in the sections below are written in pure Java rather than in Enso itself. This is to allow us to estimate the maximum theoretical performance possible when executing on the JVM. They have been run on GraalVM 19.1.0, and perform the same summation of integers. They have a variable threshold, listed in the results as inputSize.

Emulating Stack Segmentation with Threads

As each new thread has its own stack, we can exploit this to emulate the notion of split stacks as used in many functional programming languages. The basic idea is to work out when you're about to run out of stack space,

When to Spawn a Thread

One of the main problems with this approach is that you want to make as much use out of the stack for a given thread as possible. However, it is very difficult to get an accurate idea of when a stack may be about to overflow. There are two main approaches:

Conservative Counting: You can explicitly maintain a counter that records the depth of your call stack.
Catching the Overflow: When a thread on the JVM overflows, it throws a StackOverflowError, thus giving information as to when you've run out of stack space.

It may at first be apparent that you can rely on some other details of how JVM stacks are implemented, but the JVM spec is very loose with regards to what it permits as a valid stack implementation. This means that from a specification perspective there is very little that could actually be relied upon.

Conservative Counting

A naive and obvious solution is to maintain a counter that tracks the depth of your call stack. This would allow you to make a conservative estimate of the amount of stack you have remaining, and spawn a new thread at some threshold.

Of course, the main issue with this is that the stacks you have available become significantly under-utilised as the threshold has to be set such that overflow is impossible.

We did some brief testing to experiment with the 'depth limit' to find a rough estimate for how much utilisation we could get out of the thread stacks before they overflowed. In practice this seemed to be around 2000, though some runs could have it set higher. Using this value gave the following results.

Benchmark                 (inputSize)  Mode  Cnt    Score    Error  Units
Main.testCountedExecutor          100  avgt    5    0.001 ±  0.001  ms/op
Main.testCountedExecutor         1000  avgt    5    0.008 ±  0.004  ms/op
Main.testCountedExecutor        10000  avgt    5    0.951 ±  0.095  ms/op
Main.testCountedExecutor        50000  avgt    5    7.279 ±  2.476  ms/op
Main.testCountedExecutor       100000  avgt    5   12.790 ±  1.101  ms/op
Main.testCountedExecutor      1000000  avgt    5  107.034 ±  2.076  ms/op

As is obvious this is quite slow when compared to the Haskell case, with around a 5x slowdown. A significant amount of the time appears to be spent on OS-level context switches, as the smaller cases that fit into the stack of a single thread are approximately equal to Haskell. It is hence possible that a method that reduces the cost of context switching could make this approach feasible.

Catching the Overflow

Though it is heavily recommended against by the Java documentation, it is indeed possible to catch the StackOverflowError. While this provides accurate info about when you run out of stack space, it has one major problem: you may not unwind enough to have enough stack space to spawn a new thread.

The following is a potential algorithm that ignores this problem for the moment:

Each recursive call is wrapped in a try {} catch (StackOverflowError e) {} block in order to detect when the stack overflows.
All side-effecting operations must take place within a single Java frame.
When the stack overflows, a StackOverflowError is thrown at frame creation.
This can be caught, with control-flow entering the catch block.
A new thread is spawned to continue the computation.

This works because the StackOverflowError is thrown when the attempt to create the new stack frame is made. This means that in the failure case none of the function body has executed so we can safely resume on a new thread.

The main issue with this design is ensuring that there is enough stack space after the unwind to the catch block. If there isn't enough, then it proves impossible to spawn a new thread and this doesn't work.

The benchmarks listed here implement this algorithm without actually performing any significant computation.

Benchmark            (inputSize)  Mode  Cnt    Score    Error  Units
Main.testSOExecutor          100  avgt    5   ≈ 10⁻⁴           ms/op
Main.testSOExecutor         1000  avgt    5    0.003 ±  0.001  ms/op
Main.testSOExecutor        10000  avgt    5    0.031 ±  0.001  ms/op
Main.testSOExecutor        50000  avgt    5    3.927 ±  0.477  ms/op
Main.testSOExecutor       100000  avgt    5    7.724 ±  0.239  ms/op
Main.testSOExecutor      1000000  avgt    5  104.719 ± 11.411  ms/op

This performs slightly better than the conservative option discussed above. As we're guaranteed total utilisation of the stack of each thread we spawn less threads and hence reduce the context switching overhead. Nevertheless, this is still very slow compared to Haskell baseline.

Thread Pools

While a thread pool is conventionally seen as a way to amortise the cost of spawning threads, this approach to recursion requires far more threads than is really feasible to keep around in a pool, so we've not explored that approach.

Project Loom

If project loom's coroutines and / or fibres were stable, these would likely help somewhat by reducing the thread creation overhead that is primarily down to OS-level context switches.

However, Loom doesn't currently seem like a viable solution to this approach as it is not only far from stable, but also has no guarantee that it will actually make it into the JVM.

Avoiding Stack Usage via a CPS Transform

Transforming recursive calls into CPS allows us to avoid the need for using the stack instead of trying to augment it. This could be implemented as a global transformation, or as a local one only for recursive calls.

Benchmark     (inputSize)  Mode  Cnt   Score   Error  Units
Main.testCPS          100  avgt    5   0.001 ±  0.001  ms/op
Main.testCPS         1000  avgt    5   0.014 ±  0.004  ms/op
Main.testCPS        10000  avgt    5   0.197 ±  0.038  ms/op
Main.testCPS        50000  avgt    5   1.075 ±  0.269  ms/op
Main.testCPS       100000  avgt    5   2.258 ±  0.310  ms/op
Main.testCPS      1000000  avgt    5  27.002 ±  2.059  ms/op

The CPS-based approach is very much a trade-off. The code that is actually being executed is more complex, showing an order of magnitude slowdown in the cases where the execution profile fits into a single stack. However, once the input size grows to the point that additional stack segments are needed, the execution performance is within spitting distance of the Haskell code.

The CPS Transform

While it is tempting to perform the CPS transform globally for the whole program, this has some major drawbacks:

As shown above, the code becomes an order of magnitude slower within the space of a single stack.
It may be difficult to maintain a mapping from the original code to the CPS'd execution. This would greatly impact our ability to use the debugging and introspection tools which are necessary for implementing Enso Studio.

As a result, an ideal design would involve only performing the CPS transformation on code which is actually recursive. While you can detect this statically via whole-program analysis, you can also track execution on the program stack in a thread-safe manner and perform the transformation at runtime (e.g. private static ThreadLocal<Boolean> isExecuting;).

A Hybrid Approach

As we clearly don't want to CPS transform the program globally, we need some mechanism by which we can rewrite only when necessary. As discussed above, we could do this via a dynamic runtime analysis, but we could also potentially make use of the Java stack at least in part.

The hybrid approach works as follows:

Execute the code using standard recursion on the Java stack until we catch a StackOverflowError.
Spawn a new thread to rewrite the original code to CPS, and then continue execution in that style.

This avoids the CPS overhead as much as possible (when the computation fits into the Java stack), but allows for unbounded recursion in the general case. The performance profile is as follows.

Benchmark        (inputSize)  Mode  Cnt   Score    Error  Units
Main.testHybrid          100  avgt    5  ≈ 10⁻⁴           ms/op
Main.testHybrid         1000  avgt    5   0.003 ±  0.001  ms/op
Main.testHybrid        10000  avgt    5   0.013 ±  0.003  ms/op
Main.testHybrid        50000  avgt    5   0.069 ±  0.013  ms/op
Main.testHybrid       100000  avgt    5   1.765 ±  0.056  ms/op
Main.testHybrid      1000000  avgt    5  25.961 ±  2.775  ms/op

This hybrid implementation makes things faster overall, with some particularly good performance wins for the smaller cases.

An open question for this is how you work out exactly what code to CPS transform at the point of the stack overflow. In the simply-recursive case this is trivial, but it may require some more sophisticated tracing in the case of mutually-recursive functions.

Linearised Representations

While not something that we could feasibly do at the moment, one of the potential solutions for this is to statically compile the language to a linearised representation. Rather than trying to implement the CPS transform in a Truffle interpreter not designed for it, we could instead compile Enso to a low-level IR format which has no stack frames, and instead just uses jumps.

Whether we write this IR ourselves or use an existing one implemented as a Truffle language, such as WASM bytecode (currently very experimental) or LLVM IR (much more tried and tested), this would provide a number of benefits:

The IR output by the compiler phase need not be fed into the truffle interpreter for said IR.
We gain more flexibility.
We can still support interoperation with foreign languages through Truffle.

However, such an approach also has some major downsides:

We do not have the time to pursue such an approach in the short term.
Such an approach would require significantly more work, as generating linear IR representations is not as simple as generating a high-level truffle node IR.
Such an approach adds quite a lot of complexity to the compiler pipeline, which, is currently tied quite strongly into the Truffle language life-cycle.

Alternatives

At the current time there are no apparent alternatives to the three approaches discussed above. While it would be ideal for the JVM to have native support for stack segmentation on the heap, this would likely be an in-depth and significant amount of work to add, with no guarantee that it would be accepted into main.

Open Questions

The following are questions for which we don't yet have answers:

Are there any ways to instrument a JVM thread to detect when it's about to stack overflow?
Based on our investigation, what would your recommendation be for us to proceed?