Back to Ml Engineering

Debugging and Troubleshooting

debug/README.md

latest809 B
Original Source

Debugging and Troubleshooting

Guides

Tools

  • Debug Tools

  • torch-distributed-gpu-test.py - this a torch.distributed diagnostics script that checks that all GPUs in the cluster (one or many nodes) can talk to each other and allocate gpu memory.

  • NicerTrace - this is an improved trace python module with multiple additional flags added to the constructor and more useful output.