python/3_FrameworkInterop/customPyTorchKernel/README.md
This sample demonstrates how to add a custom GPU operation to PyTorch using the cuda.core API. It implements a simple square operation (y = x²) to show the complete workflow from CUDA kernel to PyTorch integration with autograd support.
cd python/3_FrameworkInterop/customPyTorchKernel
pip install -r requirements.txt
Windows users: The default torch wheel on PyPI for Windows is CPU-only and will cause torch.cuda.is_available() to return False. Install a CUDA-enabled build from PyTorch's wheel index before (or after) the command above:
pip install torch --index-url https://download.pytorch.org/whl/cu128
Replace cu128 with the wheel suffix matching your installed CUDA driver (e.g. cu121, cu124, cu126, cu128). The driver's CUDA version must be >= the wheel's bundled runtime.
# Basic usage
python customPyTorchKernel.py
# Test with more elements
python customPyTorchKernel.py --size 1000000
# Use specific GPU
CUDA_VISIBLE_DEVICES=1 python customPyTorchKernel.py
The sample runs three tests:
All tests should pass, confirming the custom operator works correctly with PyTorch's autograd system.
The sample demonstrates:
torch.autograd.FunctionThe code is self-documenting with inline comments explaining each step.