design/coroutines.md
In the past Flow implemented an actor mode by shipping its own compiler which would extend the C++ language with a few additional keywords. This, while still supported, is deprecated in favor of the standard C++20 coroutines.
Coroutines are meant to be simple, look like serial code, and be easy to reason about. As simple example for a coroutine function can look like this:
Future<double> simpleCoroutine() {
double begin = now();
co_await delay(1.0);
co_return now() - begin;
}
This document assumes some familiarity with Flow. As of today, actors and coroutines can be freely mixed, but new code should be written using coroutines.
For detailed performance analysis, benchmarking results, and optimization techniques, see COROUTINE_PERF_ANALYSIS.md.
Key Summary: C++20 coroutines show pattern-dependent performance:
It is important to understand that C++ coroutine support doesn't change anything in Flow: they are not a replacement
of Flow but they replace the actor compiler with a C++ compiler. This means, that the network loop, all Flow types,
the RPC layer, and the simulator all remain unchanged. A coroutine simply returns a special SAV<T> which has handle
to a coroutine.
As defined in the C++20 standard, a function is a coroutine if its body contains at least one co_await, co_yield,
or co_return statement. However, in order for this to work, the return type needs an underlying coroutine
implementation. Flow provides these for the following types:
Future<T> is the primary type we use for coroutines. A coroutine returning
Future<T> is allowed to co_await other coroutines and it can co_return
a single value. co_yield is not implemented by this type.
Future<Void>. Void-Futures are what a user would probably
expect Future<> to be (it has this type for historical reasons and to
provide compatibility with old Flow ACTORs). A coroutine with return type
Future<Void> must not return anything. So either the coroutine can run until
the end, or it can be terminated by calling co_return.Generator<T> can return a stream of values. However, they can't co_await
other coroutines. These are useful for streams where the values are lazily
computed but don't involve any IO.AsyncGenerator<T> is similar to Generator<T> in that it can return a stream
of values, but in addition to that it can also co_await other coroutines.
Due to that, they're slightly less efficient than Generator<T>.
AsyncGenerator<T> should be used whenever values should be lazily generated
AND need IO. It is an alternative to PromiseStream<T>, which can be more efficient, but is
more intuitive to use correctly.A more detailed explanation of Generator<T> and AsyncGenerator<T> can be
found further down.
In actor compiled code we were able to use the keywords choose and when to wait on a
statically known number of futures and execute corresponding code. Something like this:
choose {
when(wait(future1)) {
// do something
}
when(Foo f = wait(foo())) {
// do something else
}
}
Since this is a compiler functionality, we can't use this with C++ coroutines. For most
coroutine conversions, prefer race(...) and then branch on the returned std::variant.
This keeps the control flow explicit and usually maps more directly to what the coroutine
will do after the wait. Choose is still available for cases where the old choose-when
shape is the clearest fit, or where you need its ordered evaluation behavior described below.
For example, this actor pattern:
choose {
when(R res = wait(f1)) {
return res;
}
when(wait(timeout(...))) {
throw io_timeout();
}
}
should usually become:
auto result = co_await race(f1, timeout(...));
if (result.index() == 0) {
co_return std::get<0>(result);
}
throw io_timeout();
Choose remains useful when you specifically want callback-style handling of the winner:
co_await Choose()
.When(future1, [](Void& const) {
// do something
})
.When(foo(), [](Foo const& f) {
// do something else
}).run();
While Choose and choose behave very similarly, there are some minor differences between
the two. These are explained below.
In the above example, there is one, potentially important difference between the old and new
style: in the statement when(Foo f = wait(foo())) is only executed if future1 is not ready.
Depending on what the intent of the statement is, this could be desirable. Since Choose::When
is a normal method, foo() will be evaluated whether the statement is already done or not.
This can be worked around by passing a lambda that returns a Future instead:
co_await Choose()
.When(future1, [](Void& const) {
// do something
})
.When([](){ return foo() }, [](Foo const& f) {
// do something else
}).run();
The implementation of When will guarantee that this lambda will only be executed if all previous
When calls didn't receive a ready future.
In FDB we sometimes see this pattern:
loop {
choose {
when(RequestA req = waitNext(requestAStream.getFuture())) {
wait(handleRequestA(req));
}
when(RequestB req = waitNext(requestBStream.getFuture())) {
wait(handleRequestb(req));
}
//...
}
}
This is not possible to do with Choose. However, this is done deliberately as the above is
considered an antipattern: This means that we can't serve two requests concurrently since the loop
won't execute until the request has been served. Instead, this should be written like this:
state ActorCollection actors(false);
loop {
choose {
when(RequestA req = waitNext(requestAStream.getFuture())) {
actors.add(handleRequestA(req));
}
when(RequestB req = waitNext(requestBStream.getFuture())) {
actors.add(handleRequestb(req));
}
//...
when(wait(actors.getResult())) {
// this only makes sure that errors are thrown correctly
UNREACHABLE();
}
}
}
And so the above can easily be rewritten using Choose:
ActorCollection actors(false);
loop {
co_await Choose()
.When(requestAStream.getFuture(), [&actors](RequestA const& req) {
actors.add(handleRequestA(req));
})
.When(requestBStream.getFuture(), [&actors](RequestB const& req) {
actors.add(handleRequestB(req));
})
.When(actors.getResult(), [](Void const&) {
UNREACHABLE();
}).run();
}
choose-whenWhen porting actor code, use this rule of thumb:
race(...) when the choose picks one winner and then the code branches on which future completed.Choose() when you need ordered when evaluation, lazy creation of later futures, or the callback style is
genuinely clearer than branching on a std::variant.quorum, waitForAll, waitForAllReady, or operator|| when they express the
intent better than either race or Choose.race(...) is usually the best direct replacement for patterns like:
choose {
when(R res = wait(f1)) {
return res;
}
when(S res = wait(f2)) {
return use(res);
}
}
which becomes:
auto result = co_await race(f1, f2);
if (result.index() == 0) {
co_return std::get<0>(result);
}
co_return use(std::get<1>(result));
However, often using choose-when is overkill and other facilities like quorum and
operator|| should be used instead. For example this:
choose {
when(R res = wait(f1)) {
return res;
}
when(wait(timeout(...))) {
throw io_timeout();
}
}
Should be written like this:
co_await (f1 || timeout(...));
if (f1.isReady()) {
co_return f1.get();
}
throw io_timeout();
(The above could also be packed into a helper function in genericactors.actor.h).
With C++ coroutines we introduce two new basic types in Flow: Generator<T> and AsyncGenerator<T>. A generator is a
special type of coroutine, which can return multiple values.
Generator<T> and AsyncGenerator<T> implement a different interface and serve a very different purpose.
Generator<T> conforms to the input_iterator trait -- so it can be used like a normal iterator (with the exception
that copying the iterator has a different semantics). This also means that it can be used with the new ranges
library in STL which was introduced in C++20.
AsyncGenerator<T> implements the () operator which returns a new value every time it is called. However, this value
HAS to be waited for (dropping it and attempting to call () again will result in undefined behavior!). This semantic
difference allows an author to mix co_await and co_yield statements in a coroutine returning AsyncGenerator<T>.
Since generators can produce infinitely long streams, they can be useful to use in places where we'd otherwise use a
more complex in-line loop. For example, consider the code in masterserver.actor.cpp that is responsible generate
version numbers. The logic for this code is currently in a long function. With a Generator<T> it can be isolated to
one simple coroutine (which can be a direct member of MasterData). A simplified version of such a generator could
look as follows:
Generator<Version> MasterData::versionGenerator() {
auto prevVersion = lastEpochEnd;
auto lastVersionTime = now();
while (true) {
auto t1 = now();
Version toAdd =
std::max<Version>(1,
std::min<Version>(SERVER_KNOBS->MAX_READ_TRANSACTION_LIFE_VERSIONS,
SERVER_KNOBS->VERSIONS_PER_SECOND * (t1 - self->lastVersionTime)));
lastVersionTime = t1;
co_yield prevVersion + toAdd;
prevVersion += toAdd;
}
}
Now that the logic to compute versions is separated, MasterData can simply create an instance of Generator<Version>
by calling auto vGenerator = MasterData::versionGenerator(); (and possibly storing that as a class member). It can
then access the current version by calling *vGenerator and go to the next generator by incrementing the iterator
(++vGenerator).
AsyncGenerator<T> should be used in some places where we used promise streams before (though not all of them, this
topic is discussed a bit later). For example:
template <class T, class F>
AsyncGenerator<T> filter(AsyncGenerator<T> gen, F pred) {
while (gen) {
auto val = co_await gen();
if (pred(val)) {
co_yield val;
}
}
}
Note how much simpler this function is compared to the old flow function:
ACTOR template <class T, class F>
Future<Void> filter(FutureStream<T> input, F pred, PromiseStream<T> output) {
loop {
try {
T nextInput = waitNext(input);
if (pred(nextInput))
output.send(nextInput);
} catch (Error& e) {
if (e.code() == error_code_end_of_stream) {
break;
} else
throw;
}
}
output.sendError(end_of_stream());
return Void();
}
A FutureStream can be converted into an AsyncGenerator by using a simple helper function:
template <class T>
AsyncGenerator<T> toGenerator(FutureStream<T> stream) {
loop {
try {
co_yield co_await stream;
} catch (Error& e) {
if (e.code() == error_code_end_of_stream) {
co_return;
}
throw;
}
}
}
Generator<T> can be used like an input iterator. This means, that it can also be used with std::ranges. Consider
the following coroutine:
// returns base^0, base^1, base^2, ...
Generator<double> powersOf(double base) {
double curr = 1;
loop {
co_yield curr;
curr *= base;
}
}
We can use this now to generate views. For example:
for (auto v : generatorRange(powersOf(2))
| std::ranges::views::filter([](auto v) { return v > 10; })
| std::ranges::views::take(10)) {
fmt::print("{}\n", v);
}
The above would print all powers of two between 10 and 2^10.
One major difference between async generators and tasks (coroutines returning only one value through Future) is the
execution policy: An async generator will immediately suspend when it is called while a task will immediately start
execution and needs to be explicitly scheduled.
This is a conscious design decision. Lazy execution makes it much simpler to reason about memory ownership. For example, the following is ok:
Generator<StringRef> randomStrings(int minLen, int maxLen) {
Arena arena;
auto buffer = new (arena) uint8_t[maxLen + 1];
while (true) {
auto sz = deterministicRandom()->randomInt(minLen, maxLen + 1);
for (int i = 0; i < sz; ++i) {
buffer[i] = deterministicRandom()->randomAlphaNumeric();
}
co_yield StringRef(buffer, sz);
}
}
The above coroutine returns a stream of random strings. The memory is owned by the coroutine and so it always returns
a StringRef and then reuses the memory in the next iteration. This makes this generator very cheap to use, as it only
does one allocation in its lifetime. With eager execution, this would be much harder to write (and reason about): the
coroutine would immediately generate a string and then eagerly compute the next one when the string is retrieved.
However, in Flow a co_yield is guaranteed to suspend the coroutine until the value was consumed (this is not generally
a guarantee with co_yield -- C++ coroutines give the implementer a great degree of freedom over decisions like this).
Flow provides another mechanism to send streams of messages between actors: PromiseStream<T>. In fact,
AsyncGenerator<T> uses PromiseStream<T> internally. So when should one be used over the other?
As a general rule of thumb: whenever possible, use Generator<T>, if not, use AsyncGenerator<T> if in doubt.
For pure computation it almost never makes sense to use a PromiseStream<T> (the only exception is if computation
can be expensive enough that co_await yield() becomes necessary). Generator<T> is more lightweight and therefore
usually more efficient. It is also easier to use.
When it comes to IO it becomes a bit more tricky. Assume we want to scan a file on disk, and we want to read it in 4k blocks. This can be done quite elegantly using a coroutine:
AsyncGenerator<Standalone<StringRef>> blockScanner(Reference<IAsyncFile> file) {
auto sz = co_await file->size();
decltype(sz) offset = 0;
constexpr decltype(sz) blockSize = 4*1024;
while (offset < sz) {
Arena arena;
auto block = new (arena) int8_t[blockSize];
auto toRead = std::min(sz - offset, blockSize);
auto r = co_await file->read(block, toRead, offset);
co_yield Standalone<StringRef>(StringRef(block, r), arena);
offset += r;
}
}
The problem with the above generator though, is that we only start reading when the generator is invoked. If consuming the block takes sometimes a long time (for example because it has to be written somewhere), each call will take as long as the disk latency is for a read.
What if we want to hide this latency? In other words: what if we want to improve throughput and end-to-end latency by prefetching?
Doing this with a generator, while not trivial, is possible. But here it might be easier to use a PromiseStream
(we can even reuse the above generator):
Future<Void> blockScannerWithPrefetch(Reference<IAsyncFile> file,
PromiseStream<Standalone<StringRef> promise,
FlowLock lock) {
auto generator = blockScanner(file);
while (generator) {
{
FlowLock::Releaser _(co_await lock.take());
try {
promise.send(co_await generator());
} catch (Error& e) {
promise.sendError(e);
co_return;
}
}
// give caller opportunity to take the lock
co_await yield();
}
}
With the above the caller can control the prefetching dynamically by taking the lock if the queue becomes too full.
By default, a coroutine runs until it is either done (reaches the end of the function body, a co_return statement,
or throws an exception) or the last Future<T> object referencing that object is being dropped. The second use-case is
implemented as follows:
0, the coroutine is immediately resumed and actor_cancelled is
thrown within that coroutine (this allows the coroutine to do some cleanup work).co_await expr will immediately throw actor_cancelled.However, some coroutines aren't safe to be cancelled. This usually concerns disk IO operations. With ACTOR we could
either have a return-type void or use the UNCANCELLABLE keyword to change this behavior: in this case, calling
Future<T>::cancel() would be a no-op and dropping all futures wouldn't cause cancellation.
However, with C++ coroutines, this won't work:
UNCANCELLABLE would require some preprocessing).promise_type for void isn't a good idea, as this would make any void-function potentially a
coroutine.However, this can also be seen as an opportunity: uncancellable actors are always a bit tricky to use, since we need to make sure that the caller keeps all memory alive that the uncancellable coroutine might reference until it is done. Because of that, whenever someone calls a coroutine, they need to be extra careful. However, someone might not know that the coroutine they call is uncancellable.
We address this problem with the following definition:
Definition:
A coroutine is uncancellable if the first argument (or the second, if the coroutine is a class-member) is of type
Uncancellable
The definition of Uncancellable is trivial: struct Uncancellable {}; -- it is simply used as a marker. So now, if
a user calls an uncancellable coroutine, it will be obvious on the caller side. For example the following is never
uncancellable:
co_await foo();
But this one is:
co_await bar(Uncancellable());
ACTOR's to C++ CoroutinesIf you have an existing ACTOR, you can port it to a C++ coroutine by following these steps:
ACTOR keyword.UNCANCELLABLE, remove it and make the first argument Uncancellable. If the return
type of the actor is void make it Future<Void> instead and add an Uncancellable as the first argument.state modifiers from local variables.wait(expr) with co_await expr.waitNext(expr) with co_await expr.choose-when statements by preferring race(...) for first-ready branching; use Choose
only when you need ordered when semantics or callback-style handling.In addition, the following things should be looked out for:
Consider this code:
Local foo;
wait(bar());
...
foo will be destroyed right after the wait-expression. However, after making this a coroutine:
Local foo;
co_await bar();
...
foo will stay alive until we leave the scope. This is better (as it is more intuitive and follows standard C++), but
in some weird corner-cases code might depend on the semantic that locals get destroyed when we call into wait. Look
out for things where destructors do semantically important work (like in FlowLock::Releaser).
In flow/genericactors.actor.h we have a number of useful helpers. Some of them are also useful with C++ coroutines,
others add unnecessary overhead. Look out for those and remove calls to it. The most important ones are success and
store.
wait(success(f));
becomes
co_await f;
and
wait(store(v, f));
becomes
v = co_await f;
In certain places we use locals just to work around actor compiler limitations. Since locals use up space in the coroutine object they should be removed wherever it makes sense (only if it doesn't make the code less readable!).
For example:
Foo f = wait(foo);
bar(f);
might become
bar(co_await foo);
Using co_await in an error-handler produces a compilation error in C++. However, this was legal with ACTOR. There
is no general best way of addressing this issue, but usually it's quite easy to move the co_await expression out of
the catch-block.
One place where we use this pattern a lot if in our transaction retry loop:
state ReadYourWritesTransaction tr(db);
loop {
try {
Value v = wait(tr.get(key));
tr.set(key2, val2);
wait(tr.commit());
return Void();
} catch (Error& e) {
wait(tr.onError(e));
}
}
Luckily, with coroutines, we can do one better: generalize the retry loop. The above could look like this:
co_await db.run([&](ReadYourWritesTransaction* tr) -> Future<Void> {
Value v = wait(tr->get(key));
tr->set(key2, val2);
wait(tr->commit());
});
A possible implementation of Database::run would be:
template <std::invocable<ReadYourWritesTransaction*> Fun>
Future<Void> Database::run(Fun fun) {
ReadYourWritesTransaction tr(*this);
Future<Void> onError;
while (true) {
if (onError.isValid()) {
co_await onError;
onError = Future<Void>();
}
try {
co_await fun(&tr);
co_return;
} catch (Error& e) {
onError = tr.onError(e);
}
}
}
With actors, we often see the following pattern:
struct Foo : IFoo {
ACTOR static Future<Void> bar(Foo* self) {
// use `self` here to access members of `Foo`
}
Future<Void> bar() override {
return bar(this);
}
};
This boilerplate is necessary, because ACTORs can't be class members: the actor compiler will generate another
struct and move the code there -- so this will point to the actor state and not to the class instance.
With C++ coroutines, this limitation goes away. So a cleaner (and slightly more efficient) implementation of the above is:
struct Foo : IFoo {
Future<Void> bar() override {
// `this` can be used like in any non-coroutine. `co_await` can be used.
}
};
There is one very subtle and hard to spot difference between ACTOR and a coroutine: the way some local variables are
initialized. Consider the following code:
struct SomeStruct {
int a;
bool b;
};
ACTOR Future<Void> someActor() {
// beginning of body
state SomeStruct someStruct;
// rest of body
}
For state variables, the actor-compiler generates the following code to initialize SomeStruct someStruct:
someStruct = SomeStruct();
This, however, is different from what might expect since now the default constructor is explicitly called. This means if the code is translated to:
Future<Void> someActor() {
// beginning of body
SomeStruct someStruct;
// rest of body
}
initialization will be different. The exact equivalent instead would be something like this:
Future<Void> someActor() {
// beginning of body
SomeStruct someStruct{}; // auto someStruct = SomeStruct();
// rest of body
}
If the struct SomeStruct would initialize its primitive members explicitly (for example by using int a = 0; and
bool b = false) this would be a non-issue. And explicit initialization is probably the right fix here. Sadly, it
doesn't seem like UBSAN finds these kind of subtle bugs.
Another difference is, that if a state variables might be initialized twice: once at the creation of the actor using
the default constructor and a second time at the point where the variable is initialized in the code. With C++
coroutines we now get the expected behavior, which is better, but nonetheless a potential behavior change.
state Variables Inside BlocksThe actor compiler hoists all state variables into the actor's state struct, regardless of C++ block scope. This
means a state variable declared inside an if, else, for, or try block lives for the entire actor lifetime.
In a coroutine, these become regular C++ locals that follow normal scoping rules.
This is a source of subtle bugs. Consider:
ACTOR Future<Void> example() {
if (someCondition) {
state Future<Void> background = longRunningTask();
}
// In ACTOR code, `background` is still alive here — it was hoisted.
wait(delay(100.0));
return Void();
}
A naive conversion:
Future<Void> example() {
if (someCondition) {
Future<Void> background = longRunningTask();
}
// BUG: `background` was destroyed at the `}` above, cancelling longRunningTask()!
co_await delay(100.0);
}
The fix is to move the variable to function scope:
Future<Void> example() {
Future<Void> background;
if (someCondition) {
background = longRunningTask();
}
// `background` is still alive — correct.
co_await delay(100.0);
}
Rule: When removing state from a variable, check whether it is declared inside a block. If so, move the
declaration to function scope.
const& ParametersC++20 coroutines only store a reference in the coroutine frame for const& parameters — they do not copy the
argument. If the caller passes a temporary (e.g. a default argument value, or a local that goes out of scope), the
reference dangles after the first suspend point.
// DANGEROUS: if caller passes a temporary, `key` dangles after first co_await
Future<Void> doSomething(Key const& key) {
co_await delay(1.0);
fmt::print("{}\n", key.toString()); // potential use-after-free
}
The fix is to copy const& parameters to locals before the first co_await:
Future<Void> doSomething(Key const& key) {
Key keyCopy = key; // safe copy before any suspend
co_await delay(1.0);
fmt::print("{}\n", keyCopy.toString()); // OK
}
Rule: Copy all const& parameters to local variables before the first co_await.
.actor.h FilesWhen a function is converted from ACTOR to a coroutine, any forward declarations in .actor.h files must have the
ACTOR keyword removed. The actor compiler automatically adds const& to all parameters in ACTOR declarations.
If you also write const& explicitly, the generated code will contain const& const&, which is a compile error.
// workloads.actor.h — WRONG: ACTOR + const& = double const&
ACTOR Future<Void> foo(Database const& cx);
// workloads.actor.h — CORRECT: remove ACTOR since foo() is now a coroutine
Future<Void> foo(Database const& cx);
Converted files should be renamed from .actor.cpp to .cpp (or .actor.h to .h) since they no longer need the
actor compiler. Both fdbserver and flowbench use fdb_find_sources() in their CMakeLists.txt, which
automatically picks up files by glob, so the rename is usually sufficient without any CMake changes.
.actor.cpp to .cpp.ACTOR from all function definitions.UNCANCELLABLE; add Uncancellable as the first parameter instead.state from all local variable declarations.
if/else/for/try)? If so, move it to function scope.wait(expr) with co_await expr. Replace waitNext(expr) with co_await expr.return expr with co_return expr. Replace return Void() with co_return.choose/when by preferring race(...); use Choose only for patterns that do not map cleanly to
race.wait(success(f)) → co_await f; wait(store(v, f)) → v = co_await f.const& parameters: copy to a local before the first co_await.ACTOR from any forward declarations of the converted functions in .actor.h files.Through optimization and profiling analysis, C++20 coroutines have made some performance improvements, reducing the gap with ACTOR-generated code from ~10% to 3-8% depending on workload patterns.
Benchmark Type ACTOR Performance Coroutine Performance Gap Status
-------------- ---------------- -------------------- --- ------
NET2/4096 2.67M/s 2.41M/s -8.5% Target for optimization
YIELD/4096 7.45M/s 13.6M/s +83% Coroutines much faster ✅
DELAY/4096 1.44M/s 5.22M/s +260% Coroutines much faster ✅
CALLBACK/1024/64 50.9M/s 8.7M/s (some patterns) -82% Mixed results
Coroutines excel in frame-reuse patterns (YIELD, DELAY) where a single coroutine is suspended/resumed many times.
Coroutines lag in allocation-heavy patterns (NET2) where many short-lived coroutines are created and destroyed.
Original Analysis: Coroutines had 39.13% CPU overhead in final_suspend() that actors completely avoid.
ACTORS (2.67M/s): 43.31% CPU in direct ActorCallback::fire()
COROUTINES (2.41M/s): 35.61% CPU in QuorumCallback + other overhead = ~75% total
Implementation: Moved SAV cleanup from final_suspend() to return_value() to match actor completion timing.
Result: Eliminated final_suspend() overhead from performance profiles (39.13% → 0.21% CPU usage).
Based on comprehensive Linux profiling of optimized coroutines:
Compiler optimization hints: __attribute__((hot)), __attribute__((always_inline)), __attribute__((flatten))
Branch prediction hints: [[likely]], [[unlikely]]
Architectural changes: Moving SAV cleanup from final_suspend() to return_value()
Custom FastAllocator forcing: Attempted to force frames into smaller buckets
Frame packing: __attribute__((packed)), pointer bit-packing
Aggressive final_suspend() bypass: Attempted to skip SAV operations entirely
To generate complete actor vs coroutine performance comparison reports (matching historical format):
cd build_output # or your build directory
python3 ../contrib/benchmark_comparison.py
Output: Complete comparison across all benchmark types:
Requirements:
Usage: Tool automatically runs benchmarks and generates comparison report in the format matching historical coroutine optimization reports.