Back to Taskflow

Taskflow: A General

docs/classtf_1_1cudaGraphBase.html

4.1.018.4 KB
Original Source

| | Taskflow: A General-purpose Task-parallel Programming System |

Loading...

Searching...

No Matches

Public Types | Public Member Functions | List of all members

tf::cudaGraphBase< Creator, Deleter > Class Template Reference

class to create a CUDA graph with uunique ownership More...

#include <taskflow/cuda/cuda_graph.hpp>

Inheritance diagram for tf::cudaGraphBase< Creator, Deleter >:

[Embedded content](classtf_1_1cudaGraphBase inherit graph.svg)

[legend]

Collaboration diagram for tf::cudaGraphBase< Creator, Deleter >:

[Embedded content](classtf_1_1cudaGraphBase coll graph.svg)

[legend]

|

Public Types

| | using | base_type = std::unique_ptr<std::remove_pointer_t<cudaGraph_t>, Deleter> | | | base std::unique_ptr type
| | |

|

Public Member Functions

| | template<typename... ArgsT> | | | cudaGraphBase (ArgsT &&... args) | | | constructs a cudaGraph object by passing the given arguments to the executable CUDA graph creator
| | | | | cudaGraphBase (cudaGraphBase &&)=default | | | constructs a cudaGraph from the given rhs using move semantics
| | | | cudaGraphBase & | operator= (cudaGraphBase &&)=default | | | assign the rhs to *this using move semantics
| | | | size_t | num_nodes () const | | | queries the number of nodes in a native CUDA graph
| | | | size_t | num_edges () const | | | queries the number of edges in a native CUDA graph
| | | | bool | empty () const | | | queries if the graph is empty
| | | | void | dump (std::ostream &os) | | | dumps the CUDA graph to a DOT format through the given output stream
| | | | cudaTask | noop () | | | creates a no-operation task
| | | | template<typename C> | | cudaTask | host (C &&callable, void *user_data) | | | creates a host task that runs a callable on the host
| | | | template<typename F, typename... ArgsT> | | cudaTask | kernel (dim3 g, dim3 b, size_t s, F f, ArgsT... args) | | | creates a kernel task
| | | | cudaTask | memset (void *dst, int v, size_t count) | | | creates a memset task that fills untyped data with a byte value
| | | | cudaTask | memcpy (void *tgt, const void *src, size_t bytes) | | | creates a memcpy task that copies untyped data in bytes
| | | | template<typename T, std::enable_if_t< is_pod_v< T > &&(sizeof(T)==1||sizeof(T)==2||sizeof(T)==4), void > * = nullptr> | | cudaTask | zero (T *dst, size_t count) | | | creates a memset task that sets a typed memory block to zero
| | | | template<typename T, std::enable_if_t< is_pod_v< T > &&(sizeof(T)==1||sizeof(T)==2||sizeof(T)==4), void > * = nullptr> | | cudaTask | fill (T *dst, T value, size_t count) | | | creates a memset task that fills a typed memory block with a value
| | | | template<typename T, std::enable_if_t<!std::is_same_v< T, void >, void > * = nullptr> | | cudaTask | copy (T *tgt, const T *src, size_t num) | | | creates a memcopy task that copies typed data
| | | | template<typename C> | | cudaTask | single_task (C c) | | | runs a callable with only a single kernel thread
| | | | template<typename I, typename C, typename E = cudaDefaultExecutionPolicy> | | cudaTask | for_each (I first, I last, C callable) | | | applies a callable to each dereferenced element of the data array
| | | | template<typename I, typename C, typename E = cudaDefaultExecutionPolicy> | | cudaTask | for_each_index (I first, I last, I step, C callable) | | | applies a callable to each index in the range with the step size
| | | | template<typename I, typename O, typename C, typename E = cudaDefaultExecutionPolicy> | | cudaTask | transform (I first, I last, O output, C op) | | | applies a callable to a source range and stores the result in a target range
| | | | template<typename I1, typename I2, typename O, typename C, typename E = cudaDefaultExecutionPolicy> | | cudaTask | transform (I1 first1, I1 last1, I2 first2, O output, C op) | | | creates a task to perform parallel transforms over two ranges of items
| | |

Detailed Description

template<typename Creator, typename Deleter>
class tf::cudaGraphBase< Creator, Deleter >

class to create a CUDA graph with uunique ownership

Template Parameters

| Creator | functor to create the stream (used in constructor) | | Deleter | functor to delete the stream (used in destructor) |

This class wraps a cudaGraph_t handle with std::unique_ptr to ensure proper resource management and automatic cleanup.

Constructor & Destructor Documentation

cudaGraphBase()

template<typename Creator, typename Deleter>

template<typename... ArgsT>

|

| tf::cudaGraphBase< Creator, Deleter >::cudaGraphBase | ( | ArgsT &&... | args | ) | |

| inlineexplicit |

constructs a cudaGraph object by passing the given arguments to the executable CUDA graph creator

Constructs a cudaGraph object by passing the given arguments to the executable CUDA graph creator

Parameters

| args | arguments to pass to the executable CUDA graph creator |

Member Function Documentation

copy()

template<typename Creator, typename Deleter>

template<typename T, std::enable_if_t<!std::is_same_v< T, void >, void > *>

| cudaTask tf::cudaGraphBase< Creator, Deleter >::copy | ( | T * | tgt, | | | | const T * | src, | | | | size_t | num ) |

creates a memcopy task that copies typed data

Template Parameters

| T | element type (non-void) |

Parameters

| tgt | pointer to the target memory block | | src | pointer to the source memory block | | num | number of elements to copy |

Returnsa tf::cudaTask handle

A copy task transfers num*sizeof(T) bytes of data from a source location to a target location. Direction can be arbitrary among CPUs and GPUs.

dump()

template<typename Creator, typename Deleter>

| void tf::cudaGraphBase< Creator, Deleter >::dump | ( | std::ostream & | os | ) | |

dumps the CUDA graph to a DOT format through the given output stream

Parameters

| os | target output stream |

fill()

template<typename Creator, typename Deleter>

template<typename T, std::enable_if_t< is_pod_v< T > &&(sizeof(T)==1||sizeof(T)==2||sizeof(T)==4), void > *>

| cudaTask tf::cudaGraphBase< Creator, Deleter >::fill | ( | T * | dst, | | | | T | value, | | | | size_t | count ) |

creates a memset task that fills a typed memory block with a value

Template Parameters

| T | element type (size of T must be either 1, 2, or 4) |

Parameters

| dst | pointer to the destination device memory area | | value | value to fill for each element of type T | | count | number of elements |

Returnsa tf::cudaTask handle

A fill task fills the first count elements of type T with value in a device memory area pointed by dst. The value to fill is interpreted in type T rather than byte.

for_each()

template<typename Creator, typename Deleter>

template<typename I, typename C, typename E>

| cudaTask tf::cudaGraphBase< Creator, Deleter >::for_each | ( | I | first, | | | | I | last, | | | | C | callable ) |

applies a callable to each dereferenced element of the data array

Template Parameters

| I | iterator type | | C | callable type | | E | execution poligy (default tf::cudaDefaultExecutionPolicy) |

Parameters

| first | iterator to the beginning (inclusive) | | last | iterator to the end (exclusive) | | callable | a callable object to apply to the dereferenced iterator |

Returnsa tf::cudaTask handle

This method is equivalent to the parallel execution of the following loop on a GPU:

for(auto itr = first; itr != last; itr++) {

callable(*itr);

}

for_each_index()

template<typename Creator, typename Deleter>

template<typename I, typename C, typename E>

| cudaTask tf::cudaGraphBase< Creator, Deleter >::for_each_index | ( | I | first, | | | | I | last, | | | | I | step, | | | | C | callable ) |

applies a callable to each index in the range with the step size

Template Parameters

| I | index type | | C | callable type | | E | execution poligy (default tf::cudaDefaultExecutionPolicy) |

Parameters

| first | beginning index | | last | last index | | step | step size | | callable | the callable to apply to each element in the data array |

Returnsa tf::cudaTask handle

This method is equivalent to the parallel execution of the following loop on a GPU:

// step is positive [first, last)

for(auto i=first; i<last; i+=step) {

callable(i);

}

// step is negative [first, last)

for(auto i=first; i>last; i+=step) {

callable(i);

}

host()

template<typename Creator, typename Deleter>

template<typename C>

| cudaTask tf::cudaGraphBase< Creator, Deleter >::host | ( | C && | callable, | | | | void * | user_data ) |

creates a host task that runs a callable on the host

Template Parameters

| C | callable type |

Parameters

| callable | a callable object with neither arguments nor return (i.e., constructible from std::function<void()>) | | user_data | a pointer to the user data |

Returnsa tf::cudaTask handle

A host task can only execute CPU-specific functions and cannot do any CUDA calls (e.g., cudaMalloc).

kernel()

template<typename Creator, typename Deleter>

template<typename F, typename... ArgsT>

| cudaTask tf::cudaGraphBase< Creator, Deleter >::kernel | ( | dim3 | g, | | | | dim3 | b, | | | | size_t | s, | | | | F | f, | | | | ArgsT... | args ) |

creates a kernel task

Template Parameters

| F | kernel function type | | ArgsT | kernel function parameters type |

Parameters

| g | configured grid | | b | configured block | | s | configured shared memory size in bytes | | f | kernel function | | args | arguments to forward to the kernel function by copy |

Returnsa tf::cudaTask handle

memcpy()

template<typename Creator, typename Deleter>

| cudaTask tf::cudaGraphBase< Creator, Deleter >::memcpy | ( | void * | tgt, | | | | const void * | src, | | | | size_t | bytes ) |

creates a memcpy task that copies untyped data in bytes

Parameters

| tgt | pointer to the target memory block | | src | pointer to the source memory block | | bytes | bytes to copy |

Returnsa tf::cudaTask handle

A memcpy task transfers bytes of data from a source location to a target location. Direction can be arbitrary among CPUs and GPUs.

memset()

template<typename Creator, typename Deleter>

| cudaTask tf::cudaGraphBase< Creator, Deleter >::memset | ( | void * | dst, | | | | int | v, | | | | size_t | count ) |

creates a memset task that fills untyped data with a byte value

Parameters

| dst | pointer to the destination device memory area | | v | value to set for each byte of specified memory | | count | size in bytes to set |

Returnsa tf::cudaTask handle

A memset task fills the first count bytes of device memory area pointed by dst with the byte value v.

noop()

template<typename Creator, typename Deleter>

| cudaTask tf::cudaGraphBase< Creator, Deleter >::noop | ( | | ) | |

creates a no-operation task

Returnsa tf::cudaTask handle

An empty node performs no operation during execution, but can be used for transitive ordering. For example, a phased execution graph with 2 groups of n nodes with a barrier between them can be represented using an empty node and 2*n dependency edges, rather than no empty node and n^2 dependency edges.

single_task()

template<typename Creator, typename Deleter>

template<typename C>

| cudaTask tf::cudaGraphBase< Creator, Deleter >::single_task | ( | C | c | ) | |

runs a callable with only a single kernel thread

Template Parameters

| C | callable type |

Parameters

| c | callable to run by a single kernel thread |

Returnsa tf::cudaTask handle

transform() [1/2]

template<typename Creator, typename Deleter>

template<typename I, typename O, typename C, typename E>

| cudaTask tf::cudaGraphBase< Creator, Deleter >::transform | ( | I | first, | | | | I | last, | | | | O | output, | | | | C | op ) |

applies a callable to a source range and stores the result in a target range

Template Parameters

| I | input iterator type | | O | output iterator type | | C | unary operator type | | E | execution poligy (default tf::cudaDefaultExecutionPolicy) |

Parameters

| first | iterator to the beginning of the input range | | last | iterator to the end of the input range | | output | iterator to the beginning of the output range | | op | the operator to apply to transform each element in the range |

Returnsa tf::cudaTask handle

This method is equivalent to the parallel execution of the following loop on a GPU:

while (first != last) {

*output++ = callable(*first++);

}

transform() [2/2]

template<typename Creator, typename Deleter>

template<typename I1, typename I2, typename O, typename C, typename E>

| cudaTask tf::cudaGraphBase< Creator, Deleter >::transform | ( | I1 | first1, | | | | I1 | last1, | | | | I2 | first2, | | | | O | output, | | | | C | op ) |

creates a task to perform parallel transforms over two ranges of items

Template Parameters

| I1 | first input iterator type | | I2 | second input iterator type | | O | output iterator type | | C | unary operator type | | E | execution poligy (default tf::cudaDefaultExecutionPolicy) |

Parameters

| first1 | iterator to the beginning of the input range | | last1 | iterator to the end of the input range | | first2 | iterato | | output | iterator to the beginning of the output range | | op | binary operator to apply to transform each pair of items in the two input ranges |

ReturnscudaTask handle

This method is equivalent to the parallel execution of the following loop on a GPU:

while (first1 != last1) {

*output++ = op(*first1++, *first2++);

}

zero()

template<typename Creator, typename Deleter>

template<typename T, std::enable_if_t< is_pod_v< T > &&(sizeof(T)==1||sizeof(T)==2||sizeof(T)==4), void > *>

| cudaTask tf::cudaGraphBase< Creator, Deleter >::zero | ( | T * | dst, | | | | size_t | count ) |

creates a memset task that sets a typed memory block to zero

Template Parameters

| T | element type (size of T must be either 1, 2, or 4) |

Parameters

| dst | pointer to the destination device memory area | | count | number of elements |

Returnsa tf::cudaTask handle

A zero task zeroes the first count elements of type T in a device memory area pointed by dst.


The documentation for this class was generated from the following files: