rfcs/experimental/parallel_phase_for_task_arena/README.md
In oneTBB, there has never been an API that allows users to block worker threads within the arena. This design choice was made to preserve the composability of the application. Before PR#1352, workers moved to the thread pool to sleep once there were no arenas with active demand. However, PR#1352 introduced a delayed leave behavior to the library that results in blocking threads for an implementation-defined duration inside an arena if there is no active demand arcoss all arenas. This change significantly improved performance for various applications on high thread count systems. The main idea is that usually, after one parallel computation ends, another will start after some time. The delayed leave behavior is a heuristic to utilize this, covering most cases within implementation-defined duration.
However, the new behavior is not the perfect match for all the scenarios:
So there are two related problems but with different resolutions:
Let's tackle these problems one by one.
Let’s consider both “Delayed leave” and “Fast leave” as 2 different states in state machine.
There will be a question that we need to answer:
To answer this question, the following scenarios should be considered:
oneTBB itself can only guess when the ideal time to release threads from the arena is. Therefore, it does its best effort to preserve and enhance performance without completely messing up composability guarantees (that is how delayed leave is implemented).
As we already discussed, there are cases where it does not work perfectly, therefore customers that want to further optimize this aspect of oneTBB behavior should be able to do it.
This problem can be considered from another angle. Essentially, if the user can indicate where parallel computation ends, they can also indicate where it starts.
With this approach, the user not only releases threads when necessary but also specifies a programmable block where worker threads should expect new work coming regularly to the executing arena.
Let’s add a new state to the existing state machine. To represent "Parallel Phase" state.
NOTE: The "Fast leave" state is colored Grey just for simplicity of the chart. Let's assume that arena was created with the "Delayed leave". The logic demonstrated below is applicable to the "Fast leave" as well.
This state diagram leads to several questions:
The extended state machine aims to answer these questions.
Let's consider the semantics that an API for explicit parallel phases can provide:
Summary of API changes:
task_arena class
and the this_task_arena namespace.class task_arena {
enum class leave_policy : /* unspecified type */ {
automatic = /* unspecified */,
fast = /* unspecified */,
};
task_arena(int max_concurrency = automatic, unsigned reserved_for_masters = 1,
priority a_priority = priority::normal,
leave_policy a_leave_policy = leave_policy::automatic);
task_arena(const constraints& constraints_, unsigned reserved_for_masters = 1,
priority a_priority = priority::normal,
leave_policy a_leave_policy = leave_policy::automatic);
void initialize(int max_concurrency, unsigned reserved_for_masters = 1,
priority a_priority = priority::normal,
leave_policy a_leave_policy = leave_policy::automatic);
void initialize(constraints a_constraints, unsigned reserved_for_masters = 1,
priority a_priority = priority::normal,
leave_policy a_leave_policy = leave_policy::automatic);
void start_parallel_phase();
void end_parallel_phase(bool with_fast_leave = false);
class scoped_parallel_phase {
scoped_parallel_phase(task_arena& ta, bool with_fast_leave = false);
};
};
namespace this_task_arena {
void start_parallel_phase();
void end_parallel_phase(bool with_fast_leave = false);
}
The parallel phase continues until each previous start_parallel_phase call
to the same arena has a matching end_parallel_phase call.
Let's introduce RAII scoped object that will help to manage the contract.
If the end of the parallel phase is not indicated by the user, it will be done automatically when the last public reference is removed from the arena (i.e., task_arena has been destroyed or, for an implicitly created arena, the thread that owns it has completed). This ensures correctness is preserved (threads will not be retained forever).
Following code snippets show how the new API can be used.
void task_arena_leave_policy_example() {
tbb::task_arena ta{tbb::task_arena::automatic, 1, priority::normal, leave_policy::fast};
ta.execute([]() {
// Parallel computation
});
// Different parallel runtime is used
// so it is preferred that worker threads won't be retained
// in the arena at this point.
#pragma omp parallel for
for (int i = 0; i < work_size; ++i) {
// Computation
}
}
void parallel_phase_example() {
tbb::this_task_arena::start_parallel_phase();
tbb::parallel_for(0, work_size, [] (int idx) {
// User defined body
});
// Some serial computation
tbb::parallel_for(0, work_size, [] (int idx) {
// User defined body
});
tbb::this_task_arena::end_parallel_phase(/*with_fast_leave=*/true);
// Different parallel runtime (for example, OpenMP) is used
// so it is preferred that worker threads won't be retained
// in the arena at this point.
#pragma omp parallel for
for (int i = 0; i < work_size; ++i) {
// Computation
}
}
void scoped_parallel_phase_example() {
tbb::task_arena ta;
{
// Start of the parallel phase
tbb::task_arena::scoped_parallel_phase phase{ta, /*with_fast_leave=*/true};
ta.execute([]() {
// Parallel computation
});
// Serial computation
ta.execute([]() {
// Parallel computation
});
} // End of the parallel phase
// Different parallel runtime (for example, OpenMP) is used
// so it is preferred that worker threads won't be retained
// in the arena at this point.
#pragma omp parallel for
for (int i = 0; i < work_size; ++i) {
// Computation
}
}
The alternative approaches were also considered.
We can express this state machine as complete graph and provide low-level interface that will give control over state transition.
We considered this approach too low-level. Plus, it leaves a question: "How to manage concurrent changes of the state?".
The retaining of worker threads should be implemented with care because it might introduce performance problems if:
To implement the proposed feature, the following changes were made:
thread_leave_manager to the r1::arena which is responsible for
for managing the state of workers' arena leaving behaviour.r1::enter_parallel_phase(d1::task_arena_base*, std::uintptr_t) - used to communicate
the start of parallel phase with the library.r1::exit_parallel_phase(d1::task_arena_base*, std::uintptr_t) - used to communicate
the end of parallel phase with the library.thread_leave_manager class implements the state machine described in proposal.
Specifically, it controls when worker threads are allowed to be retained in the arena.
thread_leave_manager is initialized with a state that determines the default
behavior for workers leaving the arena.
To support start/end_parallel_phase API, it provides functionality to override the default
state with a "Parallel Phase" state. It also keeps track of the number of active parallel phases.
The following sequence diagram illustrates the interaction between the user and
the thread_leave_manager during the execution of parallel phases. It shows how the
thread_leave_manager manages the state transitions when using start/end_parallel_phase.
Some open questions that remain:
scoped_parallel_phase object can be created
only for already existing task_arena. Should it be possible for this_task_arena as well?this_task_arena when a calling thread
doesn't yet have any associated arena?scoped_parallel_phase object is
not acceptable?Following conditions need to be met for the feature to move from experimental to fully supported: