| | Taskflow: A General-purpose Task-parallel Programming System |

Searching...

No Matches

Release 3.10.0 (2025/05/01)

Release Summary

This release improves scheduling performance through optimized work-stealing threshold tuning and a constrained decentralized buffer. It also introduces index-range-based parallel-for and parallel-reduction algorithms and modifies subflow tasking behavior to significantly enhance the performance of recursive parallelism.

Download

Taskflow 3.10.0 can be downloaded from here.

System Requirements

To use Taskflow v3.10.0, you need a compiler that supports C++17:

GNU C++ Compiler at least v8.4 with -std=c++17
Clang C++ Compiler at least v6.0 with -std=c++17
Microsoft Visual Studio at least v19.27 with /std:c++17
Apple Clang Xcode Version at least v12.0 with -std=c++17
Nvidia CUDA Toolkit and Compiler (nvcc) at least v11.1 with -std=c++17
Intel C++ Compiler at least v19.0.1 with -std=c++17
Intel DPC++ Clang Compiler at least v13.0.0 with -std=c++17

Taskflow works on Linux, Windows, and Mac OS X.

AttentionAlthough Taskflow supports primarily C++17, you can enable C++20 compilation through -std=c++20 to achieve better performance due to new C++20 features.

New Features

Taskflow Core

optimized work-stealing loop with an adaptive breaking strategy
optimized shut-down signal detection using decentralized variables
optimized memory layout of node by combining successors and predecessors together
changed the default notifier to use the atomic notification algorithm under C++20
added debug mode for the windows CI to GitHub actions
added index range-based parallel-for algorithm (#551)

// initialize data1 and data2 to 10 using two different approaches

std::vector<int> data1(100), data2(100);

// Approach 1: initialize data1 using explicit index range

taskflow.for_each_index(0, 100, 1, [&](int i){ data1[i] = 10; });

// Approach 2: initialize data2 using tf::IndexRange

tf::IndexRange<int> range(0, 100, 1);

taskflow.for_each_by_index(range, [&](tf::IndexRange<int>& subrange){

for(int i=subrange.begin(); i<subrange.end(); i+=subrange.step_size()) {

data2[i] = 10;

}

});

tf::IndexRanges::end

T end() const

queries the ending index of the range (only available when N == 1)

Definition iterator.hpp:358

tf::IndexRanges::begin

T begin() const

queries the starting index of the range (only available when N == 1)

Definition iterator.hpp:346

tf::IndexRanges::step_size

T step_size() const

queries the step size of the range (only available when N == 1)

Definition iterator.hpp:370

tf::IndexRange

IndexRanges< T, 1 > IndexRange

alias for the common 1D case of tf::IndexRanges

Definition iterator.hpp:971

added index range-based parallel-reduction algorithm (#654)

std::vector<double> data(100000);

double res = 1.0;

taskflow.reduce_by_index(

// index range

tf::IndexRange<size_t>(0, N, 1),

// final result

res,

// local reducer

[&](tf::IndexRange<size_t> subrange, std::optional<double> running_total) {

double residual = running_total ? *running_total : 0.0;

for(size_t i=subrange.begin(); i<subrange.end(); i+=subrange.step_size()) {

data[i] = 1.0;

residual += data[i];

}

printf("partial sum = %lf\n", residual);

return residual;

// global reducer

std::plus<double>()

);

added static keyword to the executor creation in taskflow benchmarks
added waiter test to detect over-subscription issues
added tf::Executor::num_waiters (C++20 only) for querying the number of non-stealing workers
added tf::make_module_task to the algorithm collection (see Module Algorithm)
added tf::Runtime::is_cancelled to query if the parent taskflow is cancelled
added tf::Runtime to async tasking to simplify designs of recursive parallelism (see Runtime Tasking)

Utilities

added tf::IndexRange for index range-based parallel-for algorithm
added tf::distance to calculate the number of iterations in an index range
added tf::is_index_range_invalid to check if the given index range is valid

Bug Fixes

fixed the compilation error of CLI11 due to version incompatibility (#672)
fixed the compilation error of template deduction on packaged_task (#657)
fixed the MSVC compilation error due to macro clash with std::min and std::max (#670)
fixed the runtime error due to the use of latch in tf::Executor::Executor (#667)
fixed the compilation error due to incorrect const qualifier used in algorithms (#673)
fixed the TSAN error when using find-if algorithm tasks with closure wrapper (#675)
fixed the task trait bug in incorrect detection for subflow and runtime tasks (#679)
fixed the infinite steal caused by incorrect num_empty_steals (#681)

Breaking Changes

corrected the terminology by replacing 'dependents' with 'predecessors'
- tf::Task::num_predecessors (previously tf::Task::num_dependents)
- tf::Task::for_each_predecessor (previously tf::Task::for_each_dependent)
- tf::Task::num_strong_dependencies (previously tf::Task::num_strong_dependents)
- tf::Task::num_weak_dependencies (previously tf::Task::num_weak_dependents)
disabled the support for tf::Subflow::detach due to multiple intricate and unresolved issues:
- detached subflows are inherently difficult to reason about their execution logic
- detached subflows can incur excessive memory consumption, especially in recursive workloads
- detached subflows lack a manner to safe life cycle control and graph cleanup
- detached subflows have limited practical benefits for most use cases
- detached subflows can be re-implemented using taskflow composition
changed the default behavior of tf::Subflow to no longer retain its task graph after join
- default retention can incur significant memory consumption problem (#674)
- users must explicitly call tf::Subflow::retain to retain a subflow after join

tf::Taskflow taskflow;

tf::Executor executor;

taskflow.emplace([&](tf::Subflow& sf){

sf.retain(true); // retain the subflow after join for visualization

auto A = sf.emplace({ std::cout << "A\n"; });

auto B = sf.emplace({ std::cout << "B\n"; });

auto C = sf.emplace({ std::cout << "C\n"; });

A.precede(B, C); // A runs before B and C

}); // subflow implicitly joins here

executor.run(taskflow).wait();

// The subflow graph is now retained and can be visualized using taskflow.dump(...)

taskflow.dump(std::cout);

tf::Executor

class to create an executor

Definition executor.hpp:62

tf::Executor::run

tf::Future< void > run(Taskflow &taskflow)

runs a taskflow once

tf::FlowBuilder::emplace

Task emplace(C &&callable)

creates a static task

Definition flow_builder.hpp:1571

tf::Subflow

class to construct a subflow graph from the execution of a dynamic task

Definition flow_builder.hpp:1735

tf::Subflow::retain

void retain(bool flag) noexcept

specifies whether to keep the subflow after it is joined

Definition flow_builder.hpp:1844

tf::Task::precede

Task & precede(Ts &&... tasks)

adds precedence links from this to other tasks

Definition task.hpp:1258

tf::Taskflow

class to create a taskflow object

Definition taskflow.hpp:64

tf::Taskflow::dump

void dump(std::ostream &ostream) const

dumps the taskflow to a DOT format through a std::ostream target

Definition taskflow.hpp:433

disabled the support for tf::cudaFlow and tf::cudaFlowCapturer
- introduced a cleaner interface tf::cudaGraph directly atop CUDA Graph (see GPU Tasking)
- tf::cudaGraph has similar interface to tf::cudaFlow and can be changed as follows:

// programming tf::cudaGraph is consistent with Nvidia CUDA Graph but offers a simpler

// and more intuitive interface by abstracting away low-level CUDA Graph boilerplate.

tf::cudaGraph cg;

cg.kernel(...); // same as cudaFlow/cudaFlowCapturer

// unlike cudaFlow/cudaFlowCapturer, you need to explicitly instantiate an executable

// CUDA graph now and submit it to a stream for execution

tf::cudaGraphExec exec(cg);

tf::cudaStream stream;

stream.run(exec).synchronize();

tf::cudaGraphBase::kernel

cudaTask kernel(dim3 g, dim3 b, size_t s, F f, ArgsT... args)

creates a kernel task