guide/src/nvvm/debugging.md
When you try to compile an entire language for a completely different type of hardware, stuff is bound to break. In this section we will cover how to debug 🧊, segfaults, and more.
Segfaults are usually caused in one of two ways:
The first case can be debugged in two ways:
Building the codegen backend in debug mode and using RUSTC_LOG="rustc_codegen_nvvm=trace" ($env:RUSTC_LOG = "rustc_codegen_nvvm=trace"; if using powershell).
Note that this will dump a LOT of output, and when I say a LOT, i am not joking, so please, pipe this to a file.
This will give you a detailed summary of almost every action the codegen backend has done, you can examine the final few logs to
check what the last action the codegen backend was doing before segfaulting was. This is usually straightforward because the logs are detailed.
Building LLVM 7 with debug assertions. This, coupled with logging should give all the info needed to debug a segfault. It should get LLVM to throw an exception whenever something bad happens.
The latter case is a bit worse.
Segfaults in libNVVM are generally because we gave something to libnvvm which it did not expect. In an ideal world, libnvvm would just throw a validation error, but it wouldn't be an LLVM-based library if it threw friendly errors ;). Libnvvm has been known to segfault on things like:
i1, i8, i16, i32, or i64 in functions signatures. (see int_replace.rs).Generally there is no good way to debug these failures other than hoping libNVVM throws a validation error (which will cause an ICE).
I have created a tiny tool to run llvm-extract on an LLVM IR file to attempt to isolate segfaulting functions which works to some degree
which I will add to the project soon.
Miscompilations are rare but annoying. They usually result in one of two things happening:
InvalidPtx error). Run ptxas on the .ptx
file to get a more informative error message. This is rare but the most common cause is declaring
invalid extern functions (just grep for extern in the PTX file and check if it's odd functions
that aren't CUDA syscalls like vprintf, malloc, free, etc).RUSTFLAGS="--emit=llvm-ir" and find crate_name.ll in target/nvptx64-nvidia-cuda/<debug/release>/deps/ and attach it in any bug report.That should give you an idea of who is responsible for the miscompilation, if it is us, LLVM, or NVVM. Which should allow you to isolate the cause and file a bug report to LLVM/NVIDIA and generate different IR to avoid it.
If that doesn't work, then it might be a bug inside of CUDA itself, but that should be very rare. The best way to debug that (and really the only way) is to set up the crate for debug (and see if it still happens in debug). Then you can run your executable under NSight Compute, go to the source tab, and examine the SASS (basically an assembly lower than PTX) to see if ptxas miscompiled it.
If you set up the codegen backend for debug, it should give you a mapping from Rust code to SASS which should hopefully help to see what exactly is breaking.
Here is an example of the screen you should see: