Back to Onetbb

API to simplify creation of task arenas constrained to NUMA nodes

rfcs/supported/numa_support/create-numa-arenas.md

2023.0.010.3 KB
Original Source

API to simplify creation of task arenas constrained to NUMA nodes

This RFC describes an API to ease creation of a one-per-NUMA-node set of task arenas.

Motivation

The code example in the overarching RFC for NUMA support shows the likely pattern of using task arenas to distribute computation across all NUMA domains on a system. Let's take a closer look at the part where arenas are created and initialized.

c++
  std::vector<tbb::numa_node_id> numa_nodes = tbb::info::numa_nodes();
  std::vector<tbb::task_arena> arenas(numa_nodes.size());
  std::vector<tbb::task_group> task_groups(numa_nodes.size());

  // initialize each arena, each constrained to a different NUMA node
  for (int i = 0; i < numa_nodes.size(); i++)
    arenas[i].initialize(tbb::task_arena::constraints(numa_nodes[i]), 0);

The first line obtains a vector of NUMA node IDs for the system. Then, a vector of the same size is created to store tbb::task_arena objects, each constrained to one of the NUMA nodes. Another vector holds task_group instances used later to submit and wait for completion of the work in each of the arenas - it is necessary because task_arena does not provide any work synchronization API. Finally, the loop over all NUMA nodes initializes associated task arenas with proper constraints.

While not incomprehensible, the code is quite verbose and arguably too explicit for the typical scenario of creating a set of arenas across all available NUMA domains. There is also risk of subtle issues. The default constructor of task_arena reserves a slot for an application thread. The arena initialization at the last line explicitly overwrites it to 0 to allow TBB worker threads taking all the slots, however this nuance might be unknown and easy to miss, potentially resulting in underutilization of CPU resources.

Implemented API

A special function is introduced to create the set of task arenas, one per NUMA node on the system. The initialization code equivalent to the example above would be:

c++
  std::vector<tbb::task_arena> arenas = tbb::create_numa_task_arenas();
  std::vector<tbb::task_group> task_groups(arenas.size());

The rest of the code in that example might be rewritten with the API for waiting in a task arena:

c++
  // enqueue work to all but the first arena, using the task groups to track work
  for (int i = 1; i < arenas.size(); i++)
    arenas[i].enqueue(
      [] { tbb::parallel_for(0, N, [](int j) { f(w); }); },
      task_groups[i]
    );

  // directly execute the work to completion in the remaining arena
  arenas[0].execute([] {
    tbb::parallel_for(0, N, [](int j) { f(w); });
  });

  // join the other arenas to wait on their task groups
  for (int i = 1; i < arenas.size(); i++)
    arenas[i].wait_for(task_groups[i]);

Public API

The function has the following signature:

c++
// Defined in tbb/task_arena.h

namespace tbb {
    std::vector<tbb::task_arena> create_numa_task_arenas(
        task_arena::constraints constraints_ = {},
        unsigned reserved_slots = 0
    };
}

It optionally takes a constraints argument to change default arena settings such as maximal concurrency (the upper limit on the number of threads), core type etc., except for the numa_id value. The main goal of this API is to achieve simplicity in the creation of NUMA-bound task arenas, so passing constraints with an explicitly set numa_id won't result in an error, and there is no clear advantage to providing an error indication. Therefore, the numa_id value of constraints is ignored. The second optional argument allows to override the number of reserved slots, which by default is 0 (unlike the task_arena construction default of 1) for the reasons described in the introduction.

These arena parameters were selected for pre-setting because there appear to be practical use cases to modify it uniformly for the whole arena set - e.g., to suppress the use of hyper-threading or to reserve a slot for a dedicated application thread. For other arena parameters, such as priorities and thread leave policy, no obvious use cases are seen for uniformly changing default values; it can be addressed on demand.

The function returns a std::vector of created arenas. The arenas should not be initialized, in order to allow changing certain arena settings before the use.

Possible implementation

c++
std::vector<tbb::task_arena> create_numa_task_arenas(
    tbb::task_arena::constraints other_constraints,
    unsigned reserved_slots)
{
    std::vector<tbb::numa_node_id> numa_nodes = tbb::info::numa_nodes();
    std::vector<tbb::task_arena> arenas;
    arenas.reserve(numa_nodes.size());
    for (tbb::numa_node_id nid : numa_nodes) {
        other_constraints.numa_id = nid;
        arenas.emplace_back(other_constraints, reserved_slots);
    }
    return arenas;
}

Shortcomings and downsides

The following critics was provided in the RFC discussion:

  • It might be confusing that a single constraints object is used to generate multiple arenas, especially with part of it (numa_id) being ignored.
  • The API addresses just one, albeit important, use case for creating a set of arenas.

See the "universal function" alternative below for related considerations.

Considered alternatives

Sub-classing task_arena

The earlier proposal PR #1559 also aimed to simplify the typical usage pattern of NUMA arenas, with possibility to extend to other similar cases.

It suggested to add a new class derived from task_arena which would have "only necessary methods to allow submission and waiting of a parallel work", by selectively exposing methods of task_arena and also adding a method to wait for work completion. Instances of such class could only be created via a factory function that would instantiate a ready-to-use arena for each of the NUMA domains.

c++
class constrained_task_arena : protected task_arena {
public:
    using task_arena::is_active;
    using task_arena::terminate;
    using task_arena::max_concurrency;

    using task_arena::enqueue;
    using task_arena::execute; // not in the original proposal

    void wait();
    friend std::vector<constrained_task_arena> initialize_numa_constrained_arenas();
};

In the code example used in this document, that API (with the method execute also exposed) would fully eliminate explicit use of task_group:

c++
  std::vector<tbb::constrained_task_arena> arenas =
    tbb::initialize_numa_constrained_arenas();

  // enqueue work to all but the first arena
  for (int i = 1; i < arenas.size(); i++)
    arenas[i].enqueue([] {
      tbb::parallel_for(0, N, [](int j) { f(w); });
    });

  // directly execute the work to completion in the remaining arena
  arenas[0].execute([] {
    tbb::parallel_for(0, N, [](int j) { f(w); });
  });

  // join the other arenas to wait on their task groups
  for (int i = 1; i < arenas.size(); i++)
    arenas[i].wait();

While suggesting more concise and error-protected API for the problem at question, that approach has its downsides:

  • Adding a special "flavour" of task_arena potentially increases the library learning curve and might create confusion about which class to use in which conditions, and how these interoperate.
  • It seems very specialized, capable to only address specific and quite narrow set of use cases.
  • Arenas are pre-initialized and so could not be adjusted after creation.

This API, instead, aims at a single usability aspect with incremental improvements/extensions to the existing oneTBB classes and usage patterns, leaving other aspects to complementary proposals. Specifically, the Waiting in a task arena RFC improves the joint use of a task arena and a task group to submit and wait for work. Combined, these extensions will provide only a slightly more verbose solution for the NUMA use case, while being more flexible and having greater potential for other useful extensions and applications.

Universal function to create an arena set

In the discussion of this RFC, it was suggested to generalize the function for creating a set of arenas with various prescribed characteristics, beyond only NUMA-bound arenas. It would take a "constraint generator" type which would represent multiple arena constraint objects according to given options, and each of these constraints would be used to create an arena.

Generally, such generalized API needs some way to describe which arena parameters should vary across the arena set and how, and which should remain uniform. Possible ways for that are to use some descriptive "language" and/or to generate parameter sets by a function.

In the suggestion, construction arguments of a constraint generator (including value sets and named patterns) serve as the description language. Another way could be to add named patterns as possible data values for task_arena::constraints; for example, tbb::task_arena::constraints c{ .numa_id = tbb::task_arena::iterate } would serve as a pattern for generating a set of constraints bound to NUMA domains.

Using a function to generate arena parameter sets seems even more flexible comparing to a description language, as that function would be not limited in which parameters to alter, and how.

This approach would necessarily be more verbose because of the need to describe or generate a set of constraints. That could however be mitigated for common use cases with presets provided by the library allowing for a reasonably concise code. For example:

c++
  std::vector<tbb::task_arena> arenas = tbb::create_task_arenas(tbb::iterate_over_numa_ids);

where iterate_over_numa_ids is a predefined variable of a type accepted by the function.

Yet it is questionable if such flexibility is useful at the moment, and so if it justifies extra complexity. As of now, we know just a few arena set patterns with potential practical usage: besides NUMA arenas, we can think of splitting CPU resources by core type and of creating a set of arenas with different priorities. We do not however recall any requests to simplify these use cases.

Future Extensions

  • Consider extending API to support pre-setting arena priority and experimental worker thread leave policy.
  • Revisit the universal function approach if more use cases arise that would benefit from such flexibility.