Back to Taskflow

Install CUDA Compiler

docs/CompileTaskflowWithCUDA.html

4.1.09.0 KB
Original Source

| | Taskflow: A General-purpose Task-parallel Programming System |

Loading...

Searching...

No Matches

Compile Taskflow with CUDA

Install CUDA Compiler

To compile Taskflow with CUDA code, you need a nvcc compiler. Please visit the official page of Downloading CUDA Toolkit.

Compile Source Code Directly

Taskflow's GPU programming interface for CUDA is tf::cudaFlow. Consider the following simple.cu program that launches a single kernel function to output a message:

#include <taskflow/taskflow.hpp>

#include <taskflow/cudaflow.hpp>

int main(int argc, const char** argv) {

// create a CUDA graph with a single-threaded task

tf::cudaGraph cg;

cf.single_task([] __device__ () { printf("hello CUDA Graph!\n"); });

// instantiate an executable CUDA graph and run it through a stream

tf::cudaStream stream;

tf::cudaGraphExec exec(cg);

stream.run(cg).synchronize();

return 0;

}

tf::cudaStreamBase::synchronize

cudaStreamBase & synchronize()

synchronizes the associated stream

Definition cuda_stream.hpp:232

tf::cudaStreamBase::run

cudaStreamBase & run(const cudaGraphExecBase< C, D > &exec)

runs the given executable CUDA graph

tf::cudaGraphExec

cudaGraphExecBase< cudaGraphExecCreator, cudaGraphExecDeleter > cudaGraphExec

default smart pointer type to manage a cudaGraphExec_t object with unique ownership

Definition cudaflow.hpp:23

tf::cudaGraph

cudaGraphBase< cudaGraphCreator, cudaGraphDeleter > cudaGraph

default smart pointer type to manage a cudaGraph_t object with unique ownership

Definition cudaflow.hpp:18

tf::cudaStream

cudaStreamBase< cudaStreamCreator, cudaStreamDeleter > cudaStream

default smart pointer type to manage a cudaStream_t object with unique ownership

Definition cuda_stream.hpp:340

The easiest way to compile Taskflow with CUDA code (e.g., cudaFlow, kernels) is to use nvcc:

~$ nvcc -std=c++17 -I path/to/taskflow/ --extended-lambda simple.cu -o simple

~$ ./simple

hello cudaFlow!

Compile Source Code Separately

Large GPU applications often compile a program into separate objects and link them together to form an executable or a library. You can compile your CPU code and GPU code separately with Taskflow using nvcc and other compilers (such as g++ and clang++). Consider the following example that defines two tasks on two different pieces (main.cpp and cudaflow.cpp) of source code:

// main.cpp

#include <taskflow/taskflow.hpp>

tf::Task make_cudaflow(tf::Taskflow& taskflow); // create a cudaFlow task

int main() {

tf::Executor executor;

tf::Taskflow taskflow;

tf::Task task1 = taskflow.emplace({ std::cout << "main.cpp!\n"; })

.name("cpu task");

tf::Task task2 = make_cudaflow(taskflow);

task1.precede(task2);

executor.run(taskflow).wait();

return 0;

}

tf::Executor

class to create an executor

Definition executor.hpp:62

tf::Executor::run

tf::Future< void > run(Taskflow &taskflow)

runs a taskflow once

tf::FlowBuilder::emplace

Task emplace(C &&callable)

creates a static task

Definition flow_builder.hpp:1571

tf::Task

class to create a task handle over a taskflow node

Definition task.hpp:569

tf::Task::precede

Task & precede(Ts &&... tasks)

adds precedence links from this to other tasks

Definition task.hpp:1258

tf::Taskflow

class to create a taskflow object

Definition taskflow.hpp:64

// cudaflow.cpp

#include <taskflow/taskflow.hpp>

#include <taskflow/cudaflow.hpp>

tf::Task make_cudaflow(tf::Taskflow& taskflow) {

return taskflow.emplace({

// create a CUDA graph with a single-threaded task

tf::cudaGraph cg;

cf.single_task([] __device__ () { printf("hello CUDA Graph!\n"); });

// instantiate an executable CUDA graph and run it through a stream

tf::cudaStream stream;

tf::cudaGraphExec exec(cg);

stream.run(cg).synchronize();

}).name("gpu task");

}

tf::cudaGraphBase::single_task

cudaTask single_task(C c)

runs a callable with only a single kernel thread

Compile each source to an object (g++ as an example):

~$ g++ -std=c++17 -I path/to/taskflow -c main.cpp -o main.o

~$ nvcc -std=c++17 --extended-lambda -x cu -I path/to/taskflow \

-dc cudaflow.cpp -o cudaflow.o

~$ ls

now we have the two compiled .o objects, main.o and cudaflow.o

main.o cudaflow.o

The --extended-lambda option tells nvcc to generate GPU code for the lambda defined with device. The -x cu tells nvcc to treat the input files as .cu files containing both CPU and GPU code. By default, `nvcc` treats .cpp files as CPU-only code. This option is required to have nvcc generate device code here, but it is also a handy way to avoid renaming source files in larger projects. The –dc option tells nvcc to generate device code for later linking.

You may also need to specify the target architecture to tell nvcc to target on a compatible SM architecture using the option -arch. For instance, the following command requires device code linking to have compute capability 7.5 or later:

~$ nvcc -std=c++17 --extended-lambda -x cu -arch=sm_75 -I path/to/taskflow \

-dc cudaflow.cpp -o cudaflow.o

Using nvcc to link compiled object code is nothing special but replacing the normal compiler with nvcc and it takes care of all the necessary steps:

~$ nvcc main.o cudaflow.o -o main

run the main program

~$ ./main

main.cpp!

cudaflow.cpp!

You can choose to use a compiler other than nvcc for the final link step. Since your CPU compiler does not know how to link CUDA device code, you have to add a step in your build to have nvcc link the CUDA device code, using the option -dlink:

~$ nvcc -o gpuCode.o -dlink main.o cudaflow.o

This step links all the device object code and places it into gpuCode.o.

NoteNote that this step does not link the CPU object code and discards the CPU object code in main.o and cudaflow.o.

To complete the link to an executable, you can use, for example, ld or g++.

replace /usr/local/cuda/lib64 with your own CUDA library installation path

~$ g++ -pthread -L /usr/local/cuda/lib64/ -lcudart \

gpuCode.o main.o cudaflow.o -o main

run the main program

~$ ./main

main.cpp!

cudaflow.cpp!

We give g++ all of the objects again because it needs the CPU object code, which is not in gpuCode.o. The device code stored in the original objects, main.o and cudaflow.o, does not conflict with the code in gpuCode.o. g++ ignores device code because it does not know how to link it, and the device code in gpuCode.o is already linked and ready to go.

NoteThis intentional ignorance is extremely useful in large builds where intermediate objects may have both CPU and GPU code. In this case, we just let the GPU and CPU linkers each do its own job, noting that the CPU linker is always the last one we run. The CUDA Runtime API library is automatically linked when we use nvcc for linking, but we must explicitly link it (-lcudart) when using another linker.