cpp/4_CUDA_Libraries/cubDeviceTransform/README.md
This sample demonstrates cub::DeviceTransform in its N-input / M-output form. A single device-wide call reads from N input sequences and writes to M output sequences, driven by a user-provided op that returns a cuda::std::tuple of M values. Two cases are shown: N=3 inputs producing 1 output, and N=2 inputs producing 2 outputs (sum and difference in one fused pass).
CCCL 3.3, CUB Device Algorithms, Fused Elementwise Transforms, Counting Iterators
SM 7.0 SM 7.5 SM 8.0 SM 8.6 SM 8.9 SM 9.0 SM 10.0 SM 11.0 SM 12.0
Linux, Windows
x86_64, aarch64
cub::DeviceTransform::Transform
cuda::counting_iterator, cuda::std::tuple
cudaDeviceSynchronize, cudaGetDeviceProperties
CCCL 3.3+. Fetched automatically via CPM at configure time (pinned to v3.3.3). Override with -DCCCL_SOURCE_DIR=/path/to/cccl to use a local checkout.
Download and install the CUDA Toolkit for your corresponding platform. Make sure the dependencies mentioned in Dependencies section above are installed.