troubleshooting/TROUBLESHOOTING_HANGS.md
This guide covers various approaches to debugging Java processes that hang due to native code issues in libnd4j.
When a native crash occurs in JNI code, the Java process often appears to "hang" instead of crashing outright. This happens because:
# Attach to a running process
sudo gdb -p <process-id>
# Once in GDB
(gdb) thread apply all bt
gdb -p <process-id>
Key GDB commands:
thread apply all bt - Get backtraces from all threadsinfo threads - List all threadsthread <number> - Switch to a specific threadbt - Show backtrace of current threadmvn test -Dtest.prefix="valgrind --tool=memcheck"
The platform-tests/bin/java script provides special handling for Valgrind:
--track-origins=yes: Track the origins of uninitialized values--keep-stacktraces=alloc-and-free: Maintain allocation/free stacktraces--error-limit=no: Show all errors-Djava.compiler=NONEmvn clean install -Dlibnd4j.sanitize=ON -Dlibnd4j.sanitizers="address,undefined,float-divide-by-zero,float-cast-overflow"
Key ASAN Features (from CMakeLists.txt):
-fsanitize=address-static-libasan-ftls-model=local-dynamicexport LD_PRELOAD=/usr/lib/gcc/x86_64-linux-gnu/*/libasan.so
Important Notes:
For CUDA-specific issues:
compute-sanitizer --tool memcheck ./your-program
Or attach to running process:
compute-sanitizer --tool memcheck --attach-pid <process-id>
Features:
# Basic sanitizer build
mvn clean install -Dlibnd4j.sanitize=ON
# With specific sanitizers
mvn clean install -Dlibnd4j.sanitize=ON -Dlibnd4j.sanitizers="address,undefined,float-divide-by-zero,float-cast-overflow"
# Enable CUDA debugging symbols
mvn clean install -Dlibnd4j.chip=cuda -Dlibnd4j.cuda=cudnn -Dlibnd4j.build=debug
Systematic Approach:
Log Collection:
Build Considerations:
Memory Access Violations:
CUDA Synchronization Issues:
Resource Leaks: