Coroutines in Flow

Introduction
Coroutines vs ACTORs
Basic Types
Choose-When
Generators
Uncancellable
Porting ACTOR's to C++ Coroutines

Introduction

In the past Flow implemented an actor mode by shipping its own compiler which would extend the C++ language with a few additional keywords. This, while still supported, is deprecated in favor of the standard C++20 coroutines.

Coroutines are meant to be simple, look like serial code, and be easy to reason about. As simple example for a coroutine function can look like this:

c++

Future<double> simpleCoroutine() {
    double begin = now();
    co_await delay(1.0);
    co_return now() - begin;
}

This document assumes some familiarity with Flow. As of today, actors and coroutines can be freely mixed, but new code should be written using coroutines.

Coroutines vs ACTORs

Performance Characteristics

For detailed performance analysis, benchmarking results, and optimization techniques, see COROUTINE_PERF_ANALYSIS.md.

Key Summary: C++20 coroutines show pattern-dependent performance:

Excellent for suspension-heavy patterns (YIELD: +83% faster than actors)
Competitive for allocation-heavy patterns (NET2: 3-8% slower than actors)
Production ready with performance characteristics suitable for most FDB workloads

Basic Types

It is important to understand that C++ coroutine support doesn't change anything in Flow: they are not a replacement of Flow but they replace the actor compiler with a C++ compiler. This means, that the network loop, all Flow types, the RPC layer, and the simulator all remain unchanged. A coroutine simply returns a special SAV<T> which has handle to a coroutine.

As defined in the C++20 standard, a function is a coroutine if its body contains at least one co_await, co_yield, or co_return statement. However, in order for this to work, the return type needs an underlying coroutine implementation. Flow provides these for the following types:

Future<T> is the primary type we use for coroutines. A coroutine returning Future<T> is allowed to co_await other coroutines and it can co_return a single value. co_yield is not implemented by this type.
- A special case is Future<Void>. Void-Futures are what a user would probably expect Future<> to be (it has this type for historical reasons and to provide compatibility with old Flow ACTORs). A coroutine with return type Future<Void> must not return anything. So either the coroutine can run until the end, or it can be terminated by calling co_return.
Generator<T> can return a stream of values. However, they can't co_await other coroutines. These are useful for streams where the values are lazily computed but don't involve any IO.
AsyncGenerator<T> is similar to Generator<T> in that it can return a stream of values, but in addition to that it can also co_await other coroutines. Due to that, they're slightly less efficient than Generator<T>. AsyncGenerator<T> should be used whenever values should be lazily generated AND need IO. It is an alternative to PromiseStream<T>, which can be more efficient, but is more intuitive to use correctly.

A more detailed explanation of Generator<T> and AsyncGenerator<T> can be found further down.

Choose-When

In actor compiled code we were able to use the keywords choose and when to wait on a statically known number of futures and execute corresponding code. Something like this:

c++

choose {
    when(wait(future1)) {
        // do something
    }
    when(Foo f = wait(foo())) {
        // do something else
    }
}

Since this is a compiler functionality, we can't use this with C++ coroutines. For most coroutine conversions, prefer race(...) and then branch on the returned std::variant. This keeps the control flow explicit and usually maps more directly to what the coroutine will do after the wait. Choose is still available for cases where the old choose-when shape is the clearest fit, or where you need its ordered evaluation behavior described below.

For example, this actor pattern:

c++

choose {
    when(R res = wait(f1)) {
        return res;
    }
    when(wait(timeout(...))) {
        throw io_timeout();
    }
}

should usually become:

c++

auto result = co_await race(f1, timeout(...));
if (result.index() == 0) {
    co_return std::get<0>(result);
}
throw io_timeout();

Choose remains useful when you specifically want callback-style handling of the winner:

c++

co_await Choose()
    .When(future1, [](Void& const) {
        // do something
    })
    .When(foo(), [](Foo const& f) {
        // do something else
    }).run();

While Choose and choose behave very similarly, there are some minor differences between the two. These are explained below.

Execution in when-expressions

In the above example, there is one, potentially important difference between the old and new style: in the statement when(Foo f = wait(foo())) is only executed if future1 is not ready. Depending on what the intent of the statement is, this could be desirable. Since Choose::When is a normal method, foo() will be evaluated whether the statement is already done or not. This can be worked around by passing a lambda that returns a Future instead:

c++

co_await Choose()
    .When(future1, [](Void& const) {
        // do something
    })
    .When([](){ return foo() }, [](Foo const& f) {
        // do something else
    }).run();

The implementation of When will guarantee that this lambda will only be executed if all previous When calls didn't receive a ready future.

Waiting in When-Blocks

In FDB we sometimes see this pattern:

c++

loop {
  choose {
      when(RequestA req = waitNext(requestAStream.getFuture())) {
          wait(handleRequestA(req));
      }
      when(RequestB req = waitNext(requestBStream.getFuture())) {
          wait(handleRequestb(req));
      }
      //...
  }
}

This is not possible to do with Choose. However, this is done deliberately as the above is considered an antipattern: This means that we can't serve two requests concurrently since the loop won't execute until the request has been served. Instead, this should be written like this:

c++

state ActorCollection actors(false);
loop {
  choose {
      when(RequestA req = waitNext(requestAStream.getFuture())) {
          actors.add(handleRequestA(req));
      }
      when(RequestB req = waitNext(requestBStream.getFuture())) {
          actors.add(handleRequestb(req));
      }
      //...
      when(wait(actors.getResult())) {
          // this only makes sure that errors are thrown correctly
          UNREACHABLE();
      }
  }
}

And so the above can easily be rewritten using Choose:

c++

ActorCollection actors(false);
loop {
    co_await Choose()
        .When(requestAStream.getFuture(), [&actors](RequestA const& req) {
            actors.add(handleRequestA(req));  
        })
        .When(requestBStream.getFuture(), [&actors](RequestB const& req) {
            actors.add(handleRequestB(req));  
        })
        .When(actors.getResult(), [](Void const&) {
            UNREACHABLE();
        }).run();
}

Migrating `choose`-`when`

When porting actor code, use this rule of thumb:

Prefer race(...) when the choose picks one winner and then the code branches on which future completed.
Use Choose() when you need ordered when evaluation, lazy creation of later futures, or the callback style is genuinely clearer than branching on a std::variant.
Use more specific helpers like quorum, waitForAll, waitForAllReady, or operator|| when they express the intent better than either race or Choose.

race(...) is usually the best direct replacement for patterns like:

c++

choose {
    when(R res = wait(f1)) {
        return res;
    }
    when(S res = wait(f2)) {
        return use(res);
    }
}

which becomes:

c++

auto result = co_await race(f1, f2);
if (result.index() == 0) {
    co_return std::get<0>(result);
}
co_return use(std::get<1>(result));

However, often using choose-when is overkill and other facilities like quorum and operator|| should be used instead. For example this:

c++

choose {
    when(R res = wait(f1)) {
        return res;
    }
    when(wait(timeout(...))) {
        throw io_timeout();
    }
}

Should be written like this:

c++

co_await (f1 || timeout(...));
if (f1.isReady()) {
    co_return f1.get();
}
throw io_timeout();

(The above could also be packed into a helper function in genericactors.actor.h).

Generators

With C++ coroutines we introduce two new basic types in Flow: Generator<T> and AsyncGenerator<T>. A generator is a special type of coroutine, which can return multiple values.

Generator<T> and AsyncGenerator<T> implement a different interface and serve a very different purpose. Generator<T> conforms to the input_iterator trait -- so it can be used like a normal iterator (with the exception that copying the iterator has a different semantics). This also means that it can be used with the new ranges library in STL which was introduced in C++20.

AsyncGenerator<T> implements the () operator which returns a new value every time it is called. However, this value HAS to be waited for (dropping it and attempting to call () again will result in undefined behavior!). This semantic difference allows an author to mix co_await and co_yield statements in a coroutine returning AsyncGenerator<T>.

Since generators can produce infinitely long streams, they can be useful to use in places where we'd otherwise use a more complex in-line loop. For example, consider the code in masterserver.actor.cpp that is responsible generate version numbers. The logic for this code is currently in a long function. With a Generator<T> it can be isolated to one simple coroutine (which can be a direct member of MasterData). A simplified version of such a generator could look as follows:

c++

Generator<Version> MasterData::versionGenerator() {
    auto prevVersion = lastEpochEnd;
    auto lastVersionTime = now();
    while (true) {
        auto t1 = now();
        Version toAdd =
			    std::max<Version>(1,
			                      std::min<Version>(SERVER_KNOBS->MAX_READ_TRANSACTION_LIFE_VERSIONS,
			                                        SERVER_KNOBS->VERSIONS_PER_SECOND * (t1 - self->lastVersionTime)));
        lastVersionTime = t1;
        co_yield prevVersion + toAdd;
        prevVersion += toAdd;
    }
}

Now that the logic to compute versions is separated, MasterData can simply create an instance of Generator<Version> by calling auto vGenerator = MasterData::versionGenerator(); (and possibly storing that as a class member). It can then access the current version by calling *vGenerator and go to the next generator by incrementing the iterator (++vGenerator).

AsyncGenerator<T> should be used in some places where we used promise streams before (though not all of them, this topic is discussed a bit later). For example:

c++

template <class T, class F>
AsyncGenerator<T> filter(AsyncGenerator<T> gen, F pred) {
    while (gen) {
        auto val = co_await gen();
        if (pred(val)) {
            co_yield val;
        }
    }
}

Note how much simpler this function is compared to the old flow function:

c++

ACTOR template <class T, class F>
Future<Void> filter(FutureStream<T> input, F pred, PromiseStream<T> output) {
	loop {
		try {
			T nextInput = waitNext(input);
			if (pred(nextInput))
				output.send(nextInput);
		} catch (Error& e) {
			if (e.code() == error_code_end_of_stream) {
				break;
			} else
				throw;
		}
	}

	output.sendError(end_of_stream());

	return Void();
}

A FutureStream can be converted into an AsyncGenerator by using a simple helper function:

c++

template <class T>
AsyncGenerator<T> toGenerator(FutureStream<T> stream) {
    loop {
        try {
            co_yield co_await stream;
        } catch (Error& e) {
            if (e.code() == error_code_end_of_stream) {
                co_return;
            }
            throw;
        }
    }
}

Generators and Ranges

Generator<T> can be used like an input iterator. This means, that it can also be used with std::ranges. Consider the following coroutine:

c++

// returns base^0, base^1, base^2, ...
Generator<double> powersOf(double base) {
    double curr = 1;
    loop {
        co_yield curr;
        curr *= base;
    }
}

We can use this now to generate views. For example:

c++

for (auto v : generatorRange(powersOf(2)) 
            | std::ranges::views::filter([](auto v) { return v > 10; })
            | std::ranges::views::take(10)) {
    fmt::print("{}\n", v);
}

The above would print all powers of two between 10 and 2^10.

Eager vs Lazy Execution

One major difference between async generators and tasks (coroutines returning only one value through Future) is the execution policy: An async generator will immediately suspend when it is called while a task will immediately start execution and needs to be explicitly scheduled.

This is a conscious design decision. Lazy execution makes it much simpler to reason about memory ownership. For example, the following is ok:

c++

Generator<StringRef> randomStrings(int minLen, int maxLen) {
    Arena arena;
    auto buffer = new (arena) uint8_t[maxLen + 1];
    while (true) {
        auto sz = deterministicRandom()->randomInt(minLen, maxLen + 1);
        for (int i = 0; i < sz; ++i) {
            buffer[i] = deterministicRandom()->randomAlphaNumeric();
        }
        co_yield StringRef(buffer, sz);
    }
}

The above coroutine returns a stream of random strings. The memory is owned by the coroutine and so it always returns a StringRef and then reuses the memory in the next iteration. This makes this generator very cheap to use, as it only does one allocation in its lifetime. With eager execution, this would be much harder to write (and reason about): the coroutine would immediately generate a string and then eagerly compute the next one when the string is retrieved. However, in Flow a co_yield is guaranteed to suspend the coroutine until the value was consumed (this is not generally a guarantee with co_yield -- C++ coroutines give the implementer a great degree of freedom over decisions like this).

Generators vs Promise Streams

Flow provides another mechanism to send streams of messages between actors: PromiseStream<T>. In fact, AsyncGenerator<T> uses PromiseStream<T> internally. So when should one be used over the other?

As a general rule of thumb: whenever possible, use Generator<T>, if not, use AsyncGenerator<T> if in doubt.

For pure computation it almost never makes sense to use a PromiseStream<T> (the only exception is if computation can be expensive enough that co_await yield() becomes necessary). Generator<T> is more lightweight and therefore usually more efficient. It is also easier to use.

When it comes to IO it becomes a bit more tricky. Assume we want to scan a file on disk, and we want to read it in 4k blocks. This can be done quite elegantly using a coroutine:

c++

AsyncGenerator<Standalone<StringRef>> blockScanner(Reference<IAsyncFile> file) {
    auto sz = co_await file->size();
    decltype(sz) offset = 0;
    constexpr decltype(sz) blockSize = 4*1024;
    while (offset < sz) {
        Arena arena;
        auto block = new (arena) int8_t[blockSize];
        auto toRead = std::min(sz - offset, blockSize);
        auto r = co_await file->read(block, toRead, offset);
        co_yield Standalone<StringRef>(StringRef(block, r), arena);
        offset += r;
    }
}

The problem with the above generator though, is that we only start reading when the generator is invoked. If consuming the block takes sometimes a long time (for example because it has to be written somewhere), each call will take as long as the disk latency is for a read.

What if we want to hide this latency? In other words: what if we want to improve throughput and end-to-end latency by prefetching?

Doing this with a generator, while not trivial, is possible. But here it might be easier to use a PromiseStream (we can even reuse the above generator):

c++

Future<Void> blockScannerWithPrefetch(Reference<IAsyncFile> file,
                                      PromiseStream<Standalone<StringRef> promise,
                                      FlowLock lock) {
    auto generator = blockScanner(file);
    while (generator) {
        {
            FlowLock::Releaser _(co_await lock.take());
            try {
                promise.send(co_await generator());
            } catch (Error& e) {
                promise.sendError(e);
                co_return;
            }
        }
        // give caller opportunity to take the lock
        co_await yield();
    }
}

With the above the caller can control the prefetching dynamically by taking the lock if the queue becomes too full.

Uncancellable

By default, a coroutine runs until it is either done (reaches the end of the function body, a co_return statement, or throws an exception) or the last Future<T> object referencing that object is being dropped. The second use-case is implemented as follows:

When the future count of a coroutine goes to 0, the coroutine is immediately resumed and actor_cancelled is thrown within that coroutine (this allows the coroutine to do some cleanup work).
Any attempt to run co_await expr will immediately throw actor_cancelled.

However, some coroutines aren't safe to be cancelled. This usually concerns disk IO operations. With ACTOR we could either have a return-type void or use the UNCANCELLABLE keyword to change this behavior: in this case, calling Future<T>::cancel() would be a no-op and dropping all futures wouldn't cause cancellation.

However, with C++ coroutines, this won't work:

We can't introduce new keywords in pure C++ (so UNCANCELLABLE would require some preprocessing).
Implementing a promise_type for void isn't a good idea, as this would make any void-function potentially a coroutine.

However, this can also be seen as an opportunity: uncancellable actors are always a bit tricky to use, since we need to make sure that the caller keeps all memory alive that the uncancellable coroutine might reference until it is done. Because of that, whenever someone calls a coroutine, they need to be extra careful. However, someone might not know that the coroutine they call is uncancellable.

We address this problem with the following definition:

Definition:

A coroutine is uncancellable if the first argument (or the second, if the coroutine is a class-member) is of type Uncancellable

The definition of Uncancellable is trivial: struct Uncancellable {}; -- it is simply used as a marker. So now, if a user calls an uncancellable coroutine, it will be obvious on the caller side. For example the following is never uncancellable:

c++

co_await foo();

But this one is:

c++

co_await bar(Uncancellable());

Porting `ACTOR`'s to C++ Coroutines

If you have an existing ACTOR, you can port it to a C++ coroutine by following these steps:

Remove ACTOR keyword.
If the actor is marked with UNCANCELLABLE, remove it and make the first argument Uncancellable. If the return type of the actor is void make it Future<Void> instead and add an Uncancellable as the first argument.
Remove all state modifiers from local variables.
Replace all wait(expr) with co_await expr.
Remove all waitNext(expr) with co_await expr.
Rewrite existing choose-when statements by preferring race(...) for first-ready branching; use Choose only when you need ordered when semantics or callback-style handling.

In addition, the following things should be looked out for:

Lifetime of locals

Consider this code:

c++

Local foo;
wait(bar());
...

foo will be destroyed right after the wait-expression. However, after making this a coroutine:

c++

Local foo;
co_await bar();
...

foo will stay alive until we leave the scope. This is better (as it is more intuitive and follows standard C++), but in some weird corner-cases code might depend on the semantic that locals get destroyed when we call into wait. Look out for things where destructors do semantically important work (like in FlowLock::Releaser).

Unnecessary Helper Actors

In flow/genericactors.actor.h we have a number of useful helpers. Some of them are also useful with C++ coroutines, others add unnecessary overhead. Look out for those and remove calls to it. The most important ones are success and store.

c++

wait(success(f));

becomes

c++

co_await f;

and

c++

wait(store(v, f));

becomes

c++

v = co_await f;

Replace Locals with Temporaries

In certain places we use locals just to work around actor compiler limitations. Since locals use up space in the coroutine object they should be removed wherever it makes sense (only if it doesn't make the code less readable!).

For example:

c++

Foo f = wait(foo);
bar(f);

might become

c++

bar(co_await foo);

Don't Wait in Error-Handlers

Using co_await in an error-handler produces a compilation error in C++. However, this was legal with ACTOR. There is no general best way of addressing this issue, but usually it's quite easy to move the co_await expression out of the catch-block.

One place where we use this pattern a lot if in our transaction retry loop:

c++

state ReadYourWritesTransaction tr(db);
loop {
    try {
        Value v = wait(tr.get(key));
        tr.set(key2, val2);
        wait(tr.commit());
        return Void();
    } catch (Error& e) {
        wait(tr.onError(e));
    }
}

Luckily, with coroutines, we can do one better: generalize the retry loop. The above could look like this:

c++

co_await db.run([&](ReadYourWritesTransaction* tr) -> Future<Void> {
    Value v = wait(tr->get(key));
    tr->set(key2, val2);
    wait(tr->commit());
});

A possible implementation of Database::run would be:

c++

template <std::invocable<ReadYourWritesTransaction*> Fun>
Future<Void> Database::run(Fun fun) {
    ReadYourWritesTransaction tr(*this);
    Future<Void> onError;
    while (true) {
        if (onError.isValid()) {
            co_await onError;
            onError = Future<Void>();
        }
        try {
            co_await fun(&tr);
            co_return;
        } catch (Error& e) {
            onError = tr.onError(e);
        }
    }
}

Make Static Functions Class Members

With actors, we often see the following pattern:

c++

struct Foo : IFoo {
    ACTOR static Future<Void> bar(Foo* self) {
        // use `self` here to access members of `Foo`
    }
    
    Future<Void> bar() override {
        return bar(this);
    }
};

This boilerplate is necessary, because ACTORs can't be class members: the actor compiler will generate another struct and move the code there -- so this will point to the actor state and not to the class instance.

With C++ coroutines, this limitation goes away. So a cleaner (and slightly more efficient) implementation of the above is:

c++

struct Foo : IFoo {
    Future<Void> bar() override {
        // `this` can be used like in any non-coroutine. `co_await` can be used.
    }
};

Initialization of Locals

There is one very subtle and hard to spot difference between ACTOR and a coroutine: the way some local variables are initialized. Consider the following code:

c++

struct SomeStruct {
    int a;
    bool b;
};

ACTOR Future<Void> someActor() {
    // beginning of body
    state SomeStruct someStruct;
    // rest of body
}

For state variables, the actor-compiler generates the following code to initialize SomeStruct someStruct:

c++

someStruct = SomeStruct();

This, however, is different from what might expect since now the default constructor is explicitly called. This means if the code is translated to:

c++

Future<Void> someActor() {
    // beginning of body
    SomeStruct someStruct;
    // rest of body
}

initialization will be different. The exact equivalent instead would be something like this:

c++

Future<Void> someActor() {
    // beginning of body
    SomeStruct someStruct{}; // auto someStruct = SomeStruct();
    // rest of body
}

If the struct SomeStruct would initialize its primitive members explicitly (for example by using int a = 0; and bool b = false) this would be a non-issue. And explicit initialization is probably the right fix here. Sadly, it doesn't seem like UBSAN finds these kind of subtle bugs.

Another difference is, that if a state variables might be initialized twice: once at the creation of the actor using the default constructor and a second time at the point where the variable is initialized in the code. With C++ coroutines we now get the expected behavior, which is better, but nonetheless a potential behavior change.

`state` Variables Inside Blocks

The actor compiler hoists all state variables into the actor's state struct, regardless of C++ block scope. This means a state variable declared inside an if, else, for, or try block lives for the entire actor lifetime. In a coroutine, these become regular C++ locals that follow normal scoping rules.

This is a source of subtle bugs. Consider:

c++

ACTOR Future<Void> example() {
    if (someCondition) {
        state Future<Void> background = longRunningTask();
    }
    // In ACTOR code, `background` is still alive here — it was hoisted.
    wait(delay(100.0));
    return Void();
}

A naive conversion:

c++

Future<Void> example() {
    if (someCondition) {
        Future<Void> background = longRunningTask();
    }
    // BUG: `background` was destroyed at the `}` above, cancelling longRunningTask()!
    co_await delay(100.0);
}

The fix is to move the variable to function scope:

c++

Future<Void> example() {
    Future<Void> background;
    if (someCondition) {
        background = longRunningTask();
    }
    // `background` is still alive — correct.
    co_await delay(100.0);
}

Rule: When removing state from a variable, check whether it is declared inside a block. If so, move the declaration to function scope.

`const&` Parameters

C++20 coroutines only store a reference in the coroutine frame for const& parameters — they do not copy the argument. If the caller passes a temporary (e.g. a default argument value, or a local that goes out of scope), the reference dangles after the first suspend point.

c++

// DANGEROUS: if caller passes a temporary, `key` dangles after first co_await
Future<Void> doSomething(Key const& key) {
    co_await delay(1.0);
    fmt::print("{}\n", key.toString()); // potential use-after-free
}

The fix is to copy const& parameters to locals before the first co_await:

c++

Future<Void> doSomething(Key const& key) {
    Key keyCopy = key; // safe copy before any suspend
    co_await delay(1.0);
    fmt::print("{}\n", keyCopy.toString()); // OK
}

Rule: Copy all const& parameters to local variables before the first co_await.

Forward Declarations in `.actor.h` Files

When a function is converted from ACTOR to a coroutine, any forward declarations in .actor.h files must have the ACTOR keyword removed. The actor compiler automatically adds const& to all parameters in ACTOR declarations. If you also write const& explicitly, the generated code will contain const& const&, which is a compile error.

c++

// workloads.actor.h — WRONG: ACTOR + const& = double const&
ACTOR Future<Void> foo(Database const& cx);

// workloads.actor.h — CORRECT: remove ACTOR since foo() is now a coroutine
Future<Void> foo(Database const& cx);

File Naming

Converted files should be renamed from .actor.cpp to .cpp (or .actor.h to .h) since they no longer need the actor compiler. Both fdbserver and flowbench use fdb_find_sources() in their CMakeLists.txt, which automatically picks up files by glob, so the rename is usually sufficient without any CMake changes.

Conversion Checklist

Rename the file from .actor.cpp to .cpp.
Remove ACTOR from all function definitions.
Remove UNCANCELLABLE; add Uncancellable as the first parameter instead.
Remove state from all local variable declarations.
- Check: is the variable inside a block (if/else/for/try)? If so, move it to function scope.
Replace wait(expr) with co_await expr. Replace waitNext(expr) with co_await expr.
Replace return expr with co_return expr. Replace return Void() with co_return.
Rewrite choose/when by preferring race(...); use Choose only for patterns that do not map cleanly to race.
Simplify: wait(success(f)) → co_await f; wait(store(v, f)) → v = co_await f.
For const& parameters: copy to a local before the first co_await.
Remove ACTOR from any forward declarations of the converted functions in .actor.h files.
Build and run simulation tests to verify correctness.

Performance Analysis & Optimization

Performance Summary (Updated February 2026)

Through optimization and profiling analysis, C++20 coroutines have made some performance improvements, reducing the gap with ACTOR-generated code from ~10% to 3-8% depending on workload patterns.

Current Linux Performance Results (32-core, 3.1 GHz)

Benchmark Type    ACTOR Performance    Coroutine Performance    Gap        Status
--------------    ----------------     --------------------     ---        ------
NET2/4096         2.67M/s             2.41M/s                  -8.5%      Target for optimization
YIELD/4096        7.45M/s             13.6M/s                  +83%       Coroutines much faster ✅
DELAY/4096        1.44M/s             5.22M/s                  +260%      Coroutines much faster ✅
CALLBACK/1024/64  50.9M/s             8.7M/s (some patterns)   -82%       Mixed results

Key Insight: Workload Pattern Dependency

Coroutines excel in frame-reuse patterns (YIELD, DELAY) where a single coroutine is suspended/resumed many times.

Coroutines lag in allocation-heavy patterns (NET2) where many short-lived coroutines are created and destroyed.

Performance Analysis Deep Dive

Root Cause Identification (February 2026)

Original Analysis: Coroutines had 39.13% CPU overhead in final_suspend() that actors completely avoid.

ACTORS (2.67M/s):    43.31% CPU in direct ActorCallback::fire()
COROUTINES (2.41M/s): 35.61% CPU in QuorumCallback + other overhead = ~75% total

Fix

Implementation: Moved SAV cleanup from final_suspend() to return_value() to match actor completion timing.

Result: Eliminated final_suspend() overhead from performance profiles (39.13% → 0.21% CPU usage).

Current Bottlenecks (February 2026)

Based on comprehensive Linux profiling of optimized coroutines:

1. QuorumCallback Overhead (35.61% CPU)

Impact: Shared bottleneck between actors and coroutines
Cause: Callback chain traversal in SAV system
Optimization: Compiler hints provide minimal improvement

2. FastAllocator<128> Waste (7.19% CPU)

Impact: Frame allocation overhead in NET2 pattern
Cause: Some coroutine frames exceed 64-byte optimal bucket size
Evidence: 3.69% allocate + 3.50% release CPU usage
Attempts: Custom allocator forcing provided <1% improvement

3. AwaitableFuture Operations (3.66% CPU)

Impact: Coroutine-specific suspend/resume overhead
Components: 2.79% fire() + 0.87% resumeImpl()
Nature: Inherent to C++20 coroutine mechanics

Optimization Techniques - What Works and What Doesn't

✅ Successful Optimizations

Compiler optimization hints: __attribute__((hot)), __attribute__((always_inline)), __attribute__((flatten))
- Impact: 2-5% performance improvements in hot paths
Branch prediction hints: [[likely]], [[unlikely]]
- Impact: Optimizes common vs error paths
Architectural changes: Moving SAV cleanup from final_suspend() to return_value()
- Impact: Eliminated 39.13% CPU bottleneck (99.5% reduction)

❌ Ineffective Optimizations

Custom FastAllocator forcing: Attempted to force frames into smaller buckets
- Result: Only 0.82% reduction in FastAllocator<128> overhead
- Risk: Unsafe for frames that don't fit smaller buckets
Frame packing: __attribute__((packed)), pointer bit-packing
- Result: Added overhead from indirection outweighed space savings
- Issue: Increased function call overhead
Aggressive final_suspend() bypass: Attempted to skip SAV operations entirely
- Result: Broke Flow's reference counting semantics (double-free crashes)

Performance Comparison by Platform

Linux (Release, -O3)

Coroutines: 2.41M/s NET2, 13.6M/s YIELD
Actors: 2.67M/s NET2, 7.45M/s YIELD

macOS (Debug, -g)

Coroutines: 930k/s NET2 (significantly slower)
Platform difference: 2.56x performance gap between Linux and macOS

Benchmark Comparison Tool

Generating Comprehensive Performance Reports

To generate complete actor vs coroutine performance comparison reports (matching historical format):

bash

cd build_output  # or your build directory
python3 ../contrib/benchmark_comparison.py

Output: Complete comparison across all benchmark types:

DELAY benchmarks (DELAY + YIELD variants, all scales)
NET2 benchmarks (allocation-heavy patterns, all scales)
CALLBACK benchmarks (various template sizes and scales)
OVERALL_GEOMEAN calculations for statistical analysis

Requirements:

Working flowbench binary with both actor and coroutine benchmarks
Benchmark infrastructure must include: bench_net2, coroutine_net2, bench_delay, coroutine_delay_bench, coroutine_yield_bench, bench_callback, coroutine_callback

Usage: Tool automatically runs benchmarks and generates comparison report in the format matching historical coroutine optimization reports.

Future Optimization Opportunities

High-Impact Targets

QuorumCallback optimization (35.61% CPU) - requires deeper architectural changes
Frame allocation strategy - investigate frame pooling for allocation-heavy patterns
Profile-guided optimization - compiler-level optimization based on runtime profiles

Coroutines in Flow

Coroutines in Flow

Introduction

Coroutines vs ACTORs

Performance Characteristics

Basic Types

Choose-When

Execution in when-expressions

Waiting in When-Blocks

Migrating choose-when

Generators

Generators and Ranges

Eager vs Lazy Execution

Generators vs Promise Streams

Uncancellable

Porting ACTOR's to C++ Coroutines

Lifetime of locals

Unnecessary Helper Actors

Replace Locals with Temporaries

Don't Wait in Error-Handlers

Make Static Functions Class Members

Initialization of Locals

state Variables Inside Blocks

const& Parameters

Forward Declarations in .actor.h Files

File Naming

Conversion Checklist

Performance Analysis & Optimization

Performance Summary (Updated February 2026)

Current Linux Performance Results (32-core, 3.1 GHz)

Key Insight: Workload Pattern Dependency

Performance Analysis Deep Dive

Root Cause Identification (February 2026)

Fix

Current Bottlenecks (February 2026)

1. QuorumCallback Overhead (35.61% CPU)

2. FastAllocator<128> Waste (7.19% CPU)

3. AwaitableFuture Operations (3.66% CPU)

Optimization Techniques - What Works and What Doesn't

✅ Successful Optimizations

❌ Ineffective Optimizations

Performance Comparison by Platform

Linux (Release, -O3)

macOS (Debug, -g)

Benchmark Comparison Tool

Generating Comprehensive Performance Reports

Future Optimization Opportunities

High-Impact Targets

Migrating `choose`-`when`

Porting `ACTOR`'s to C++ Coroutines

`state` Variables Inside Blocks

`const&` Parameters

Forward Declarations in `.actor.h` Files