guide/src/faq.md
This page will cover a lot of the questions people often have when they encounter this project, so they are addressed all at once.
Good question, a good amount of reasons:
So overall, the LLVM PTX backend is fit for smaller kernels/projects/proofs of concept. It is however not fit for compiling an entire language (core is very big) with dependencies and more. The end goal is for Rust to be able to be used over CUDA C/C++ with the same (or better!) performance and features, therefore, we must take advantage of all optimizations NVCC has over us.
Short answer, no.
Long answer, there are a couple of things that make this impossible:
i1, i8, i16, i32, or i64.
This required special handling in the codegen backend to convert these "irregular" types into vector types.This is probably the most asked question by far, so let's break it down in detail.
TL;DR There are things we fundamentally can't check, but just because that is the case does not mean we cannot still prevent a lot of problems we can check.
Yes it is true that GPU kernels have much more unsafe than CPU code usually, but why is that?
The reason is that CUDA's entire model is not based on safety in any way, there are almost zero safety nets in CUDA. Rust is the polar opposite of this model, everything is safe unless there are some invariants that cannot be checked by the compiler. Let's take a look at some of the invariants we face here.
Take this program as an example, written in CUDA C++:
__global__ void kernel(int* buf, int* other)
{
int idx = threadIdx.x;
buf[idx] = other[idx];
}
int main(void)
{
int N = 50;
int* a, b, d_a, d_b;
a = (int*)malloc(N*sizeof(int));
b = (int*)malloc(N*sizeof(int));
cudaMalloc(&d_a, N*sizeof(int));
cudaMalloc(&d_b, N*sizeof(int));
for (int i = 0; i < N; i++) {
a[i] = 0.0f;
b[i] = 2.0f;
}
cudaMemcpy(d_a, a, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, N*sizeof(float), cudaMemcpyHostToDevice);
kernel<<<1, N>>>(d_a, d_b);
cudaMemcpy(d_a, a, N*sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(d_b, b, N*sizeof(float), cudaMemcpyDeviceToHost);
/* do something with the data */
cudaFree(d_a);
cudaFree(d_b);
free(a);
free(b);
}
You may think this looks innocent enough, it's a very easy and understandable program. But if you really think about it, this is a minefield of things that could go wrong. Let's list most of them:
buf could be too small, that is undefined behavior (reading beyond allocated memory)other could also be too small.cudaMalloc, cudaMemcpy, kernel launches, or cudaFree calls could have errored, which we dont handle and simply ignore.This goes to show that CUDA C/C++ and CUDA overall rely on shifting the burden of correctness from the API to the developer.
However, Rust uses a completely opposite design model, the compiler verifies as much as it can, and burden is only shifted to the
developer if its absolutely essential, behind unsafe.
This creates a big problem for us, it is very difficult (and sometimes impossible) to prove correctness statically when wrapping how CUDA works. We can solve a lot of the points using things like RAII and providing a high level wrapper, but we fundamentally cannot prove a lot of things, the most common place where this is shown is the CPU-GPU boundary, e.g. launching kernels.
Firstly, we cannot verify that the PTX we are calling is sound, that it has no data races, writes into the right buffers, doesnt rely on undocumented invariants, and does not write invalid data to buffers. This already makes launching kernels perma-unsafe.
Second, CUDA does zero validation in terms of kernel parameter mismatch, it will simply segfault on you, or even keep going but produce invalid data (or cause the kernel to cause undefined behavior). This is a design flaw in CUDA itself, we have no control over it and no 100% reliable way to fix it, therefore we must shift this burden of correctness to the developer.
Moreover, the CUDA GPU kernel model is entirely based on trust, trusting each thread to index into the correct place in buffers, trusting the caller of the kernel to uphold some dimension invariants, etc. This is once again, completely incompatible with how Rust does things. We can provide wrappers to calculate an index that always works, and macros to index a buffer automatically, but indexing in complex ways is a core operation in CUDA and it is impossible for us to prove that whatever the developer is doing is correct.
Finally, We would love to be able to use mut refs in kernel parameters, but this is would be unsound. Because each kernel function is technically called multiple times in parallel with the same parameters, we would be aliasing the mutable ref, which rustc declares as unsound (aliasing mechanics). So raw pointers or slightly-less-unsafe need to be used. However, they are usually only used for the initial buffer indexing, after which you can turn them into a mutable reference just fine (because you indexed in a way where no other thread will index that element). Also note that shared refs can be used as parameters just fine.
Now that we outlined why this is a thing, why is using Rust a benefit if we still need to use unsafe?
Well it's simple, eliminating most of the things that a developer needs to think about to have a safe program is still exponentially safer than leaving everything to the developer to think about.
By using Rust, we eliminate:
thread::index).And countless other problems with things like graphs, streams, devices, etc.
So, just because we cannot solve every problem with CUDA safety, does not mean we cannot solve a lot of them, and ease the burden of correctness from the developer.
Besides, using Rust only adds to safety, it does not make CUDA more unsafe. This means there are only things to gain in terms of safety using Rust.
The reasoning for this is the same reasoning as to why you would use CUDA over opengl/vulkan compute shaders:
rustc_codegen_ssa's less than ideal codegen, the optimizations by LLVM and libNVVM are needed.rustc_codegen_nvvm does, which can be used
for profiling kernels in something like nsight compute.Moreover, CUDA is the primary tool used in big computing industries such as VFX and scientific computing. Therefore it is much easier for CUDA C++ users to use Rust for GPU computing if most of the concepts are still the same. Plus, we can interface with existing CUDA code by compiling it to PTX then linking it with our Rust code using the CUDA linker API (which is exposed in a high level wrapper in cust).
Simply put, the driver API provides better control over concurrency, context, and module management, and overall has better performance control than the runtime API.
Let's break it down into the main new concepts introduced in the Driver API.
The first big difference in the driver API is that CUDA context management is explicit and not implicit.
Contexts are similar to CPU processes, they manage all of the resources, streams, allocations, etc associated with operations done inside them.
The driver API provides control over these contexts. You can create new contexts and drop them at any time. As opposed to the runtime API which works off of an implicit context destroyed on device reset. This causes a problem for larger applications because a new integration of CUDA could call device reset when it is finished, which causes further uses of CUDA to fail.
Modules are the second big difference in the driver API. Modules are similar to shared libraries, they contain all of the globals and functions (kernels) inside of a PTX/cubin file. The driver API is language-agnostic, it purely works off PTX/cubin files. To answer why this is important we need to cover what cubins and PTX files are briefly.
PTX is a low level assembly-like language which is the penultimate step before what the GPU actually
executes. It is human-readable and you can dump it from a CUDA C++ program with nvcc ./file.cu --ptx.
This PTX is then optimized and lowered into a final format called SASS (Source and Assembly) and
turned into a cubin (CUDA binary) file.
Driver API modules can be loaded as either PTX, cubin, or fatbin files. If they are loaded as PTX then the driver API will JIT compile the PTX to cubin then cache it. You can also compile PTX to cubin yourself using ptx-compiler and cache it.
This pipeline provides much better control over what functions you actually need to load and cache. You can separate different functions into different modules you can load dynamically (and even dynamically reload). This can yield considerable performance benefits when dealing with a lot of functions.
Streams are (one of) CUDA's way of dispatching multiple kernels in parallel. You can kind of think of them as OS threads essentially. Kernels dispatched one after the other inside of a particular stream will execute one after the other on the GPU, which is helpful for kernels that rely on a previous kernel's result.
The CUDA runtime API operates off of a single global stream. This causes a lot of issues for users of large programs or libraries that need to manage many kernels being dispatched at the same time as efficiently as possible.
This is a complex issue with many arguments for both sides, so I will give you both sides as well as my opinion.
Pros for using OpenCL over CUDA:
Cons for using OpenCL over CUDA:
Pros for using CUDA over OpenCL:
cust or rustc_codegen_nvvm-generated PTX by
using the CUDA linker APIs which are exposed in cust. Allowing for incremental switching to Rust.Cons for using CUDA over OpenCL:
Cust is a fork of RustaCUDA which changes a lot of things inside of it, as well as adds new features that are not inside of RustaCUDA.
The most significant changes (This list is not complete!!) are:
?.cust_raw so we can ensure updates to the latest CUDA features.vek linear algebra types for grid/block dimensions and DeviceCopy has been added under the vek feature.Changes that are currently in progress but not done/experimental:
Just like RustaCUDA, cust makes no assumptions of what language was used to generate the PTX/cubin. It could be C, C++, futhark, or best of all, Rust!
Cust's name is literally just Rust + CUDA mashed together in a horrible way. Or you can pretend it stands for custard if you really like custard.