(serve-multi-node-gpu-troubleshooting)=

Troubleshoot multi-node GPU serving on KubeRay

This guide helps you diagnose and resolve common issues when deploying multi-node GPU workloads on KubeRay, particularly for large language model (LLM) serving with vLLM.

Debugging strategy

When encountering issues with multi-node GPU serving, use this systematic approach to isolate the problem:

Test on different platforms Compare behavior between:
- Single node without KubeRay
- Standalone vLLM server on KubeRay
- Ray Serve LLM deployment on KubeRay
Vary hardware configurations Test with different GPU types—for example, A100s vs H100s—to identify hardware-specific issues
Use minimal reproducers Create simplified test cases that isolate specific components (NCCL, model loading, etc.)

Common issues and solutions

1. Head pod scheduled on GPU node

Symptoms

ray status shows duplicate GPU resources, for example, 24 GPUs when cluster only has 16 GPUs
Model serving hangs when using pipeline parallelism (PP > 1)
Resource allocation conflicts

Root Cause The Ray head pod is incorrectly scheduled on a GPU worker node, causing resource accounting issues.

Solution Configure the head pod to use zero GPUs in your RayCluster specification:

yaml

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: my-cluster
spec:
  headGroupSpec:
    rayStartParams:
      num-cpus: "0"
      num-gpus: "0"  # Ensure head pod doesn't claim GPU resources.
    # ... other head group configuration

2. AWS OFI plugin version issues (H100-specific)

Symptoms

NCCL initialization failures on H100 instances
Works fine on A100 but fails on H100 with identical configuration
Malformed topology files

Root Cause Outdated aws-ofi-plugin in container images causes NCCL topology detection to fail on H100 instances.

Related issues

Solution

Update to a newer container image with an updated aws-ofi-plugin
Use the NCCL debugging script below to verify NCCL functions as expected
Consider hardware-specific configuration adjustments

Further troubleshooting

If you continue to experience issues after following this guide:

Collect diagnostic information: Run the NCCL debugging script below and save the output
Check compatibility: Verify Ray, vLLM, PyTorch, and CUDA versions are compatible
Review logs: Examine Ray cluster logs and worker pod logs for additional error details
Hardware verification: Test with different GPU types if possible
Community support: Share your findings with the Ray and vLLM communities for additional help

Additional resources

NCCL debugging script

Use this diagnostic script to identify NCCL-related issues in your multi-node GPU setup:

python

#!/usr/bin/env python3
"""
NCCL Diagnostic Script for Multi-Node GPU Serving

This script helps identify NCCL configuration issues that can cause
multi-node GPU serving failures. Run this script on each node to verify
NCCL function before deploying distributed workloads.

Usage: python3 multi-node-nccl-check.py
"""
import os
import sys
import socket
import torch
from datetime import datetime

def log(msg):
    """Log messages with timestamp for better debugging."""
    timestamp = datetime.now().strftime("%H:%M:%S")
    print(f"[{timestamp}] {msg}", flush=True)

def print_environment_info():
    """Print relevant environment information for debugging."""
    log("=== Environment Information ===")
    log(f"Hostname: {socket.gethostname()}")
    log(f"CUDA_VISIBLE_DEVICES: {os.environ.get('CUDA_VISIBLE_DEVICES', 'not set')}")
    
    # Print all NCCL-related environment variables.
    nccl_vars = [var for var in os.environ.keys() if var.startswith('NCCL_')]
    if nccl_vars:
        log("NCCL Environment Variables:")
        for var in sorted(nccl_vars):
            log(f"  {var}: {os.environ[var]}")
    else:
        log("No NCCL environment variables set")

def check_cuda_availability():
    """Verify CUDA is available and functional."""
    log("\n=== CUDA Availability Check ===")
    
    if not torch.cuda.is_available():
        log("ERROR: CUDA not available")
        return False
    
    device_count = torch.cuda.device_count()
    log(f"CUDA device count: {device_count}")
    log(f"PyTorch version: {torch.__version__}")
    
    # Check NCCL availability in PyTorch.
    try:
        import torch.distributed as dist
        if hasattr(torch.distributed, 'nccl'):
            log(f"PyTorch NCCL available: {torch.distributed.is_nccl_available()}")
    except Exception as e:
        log(f"Error checking NCCL availability: {e}")
    
    return True

def test_individual_gpus():
    """Test that each GPU is working individually."""
    log("\n=== Individual GPU Tests ===")
    
    for gpu_id in range(torch.cuda.device_count()):
        log(f"\n--- Testing GPU {gpu_id} ---")
        
        try:
            torch.cuda.set_device(gpu_id)
            device = torch.cuda.current_device()
            
            log(f"Device {device}: {torch.cuda.get_device_name(device)}")
            
            # Print device properties.
            props = torch.cuda.get_device_properties(device)
            log(f"  Compute capability: {props.major}.{props.minor}")
            log(f"  Total memory: {props.total_memory / 1024**3:.2f} GB")
            
            # Test basic CUDA operations.
            log("  Testing basic CUDA operations...")
            tensor = torch.ones(1000, device=f'cuda:{gpu_id}')
            result = tensor.sum()
            log(f"  Basic CUDA test passed: sum = {result.item()}")
            
            # Test cross-GPU operations if multiple GPUs are available.
            if torch.cuda.device_count() > 1:
                log("  Testing cross-GPU operations...")
                try:
                    other_gpu = (gpu_id + 1) % torch.cuda.device_count()
                    test_tensor = torch.randn(10, 10, device=f'cuda:{gpu_id}')
                    tensor_copy = test_tensor.to(f'cuda:{other_gpu}')
                    log(f"  Cross-GPU copy successful: GPU {gpu_id} -> GPU {other_gpu}")
                except Exception as e:
                    log(f"  Cross-GPU copy failed: {e}")
            
            # Test memory allocation.
            log("  Testing large memory allocations...")
            try:
                large_tensor = torch.zeros(1000, 1000, device=f'cuda:{gpu_id}')
                log("  Large memory allocation successful")
                del large_tensor
            except Exception as e:
                log(f"  Large memory allocation failed: {e}")
                
        except Exception as e:
            log(f"ERROR testing GPU {gpu_id}: {e}")
            import traceback
            log(f"Traceback:\n{traceback.format_exc()}")

def test_nccl_initialization():
    """Test NCCL initialization and basic operations."""
    log("\n=== NCCL Initialization Test ===")
    
    try:
        import torch.distributed as dist
        
        # Set up single-process NCCL environment.
        os.environ['MASTER_ADDR'] = 'localhost'
        os.environ['MASTER_PORT'] = '29500'
        os.environ['RANK'] = '0'
        os.environ['WORLD_SIZE'] = '1'
        
        log("Attempting single-process NCCL initialization...")
        dist.init_process_group(
            backend='nccl',
            rank=0,
            world_size=1
        )
        
        log("Single-process NCCL initialization successful!")
        
        # Test basic NCCL operation.
        if torch.cuda.is_available():
            device = torch.cuda.current_device()
            tensor = torch.ones(10, device=device)
            
            # This is a no-op with world_size=1 but exercises NCCL
            dist.all_reduce(tensor)
            log("NCCL all_reduce test successful!")
        
        dist.destroy_process_group()
        log("NCCL cleanup successful!")
        
    except Exception as e:
        log(f"NCCL initialization failed: {e}")
        import traceback
        log(f"Full traceback:\n{traceback.format_exc()}")

def main():
    """Main diagnostic routine."""
    log("Starting NCCL Diagnostic Script")
    log("=" * 50)
    
    print_environment_info()
    
    if not check_cuda_availability():
        sys.exit(1)
    
    test_individual_gpus()
    test_nccl_initialization()
    
    log("\n" + "=" * 50)
    log("NCCL diagnostic script completed")
    log("If you encountered errors, check the specific error messages above")
    log("and refer to the troubleshooting guide for solutions.")

if __name__ == "__main__":
    main()