examples/guides/vllm-qwen35-27b-fp8.md
This guide explains how to deploy PentAGI with a fully local LLM setup using vLLM and Qwen3.5-27B-FP8. This configuration enables complete independence from cloud API providers while maintaining high performance for autonomous penetration testing workflows.
Qwen3.5-27B is a state-of-the-art dense language model from Alibaba Cloud with 27 billion parameters fully active on every token. It features a hybrid architecture combining:
This model is particularly well-suited for PentAGI's multi-agent workflows due to its:
FP8 W8A8 hardware acceleration requires GPUs with Compute Capability ≥ 8.9 (Ada Lovelace, Hopper, or Blackwell architectures). On older GPUs like Ampere (A100, A6000, RTX 3090), FP8 falls back to W8A16 mode via Marlin kernels with reduced performance.
| Configuration | Total VRAM | Max Context | FP8 Mode | Status |
|---|---|---|---|---|
| 2× RTX 5090 (64 GB) | 64 GB | ≤131k | W8A8 | Good |
| 4× RTX 5090 (128 GB) | 128 GB | 262k (native) | W8A8 | Tested (~30 GB/GPU) |
| 1× H100 SXM (80 GB) | 80 GB | 262k | W8A8 | Single GPU |
| 2× H100 SXM (160 GB) | 160 GB | 262k | W8A8 | Excellent |
| 4× A100 80GB (320 GB) | 320 GB | 262k | W8A16 | Slower fallback |
Install CUDA toolkit and verify installation:
nvidia-smi
nvcc --version
Install Python package manager (uv recommended for faster installation):
curl -LsSf https://astral.sh/uv/install.sh | sh
IMPORTANT: The qwen3_5 architecture is not recognized in stable vLLM releases. You must use the nightly build until vLLM v0.17.0 is released.
Option 1: Using uv (recommended)
uv pip install vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
Option 2: Using pip
pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
Option 3: Docker (alternative)
docker pull vllm/vllm-openai:nightly
python -c "import vllm; print(vllm.__version__)"
The following configuration has been tested and optimized for 4× RTX 5090 GPUs with ~30 GB VRAM usage per GPU at --gpu-memory-utilization 0.75:
| Parameter | Value | Explanation |
|---|---|---|
--model | Qwen/Qwen3.5-27B-FP8 | HuggingFace model identifier |
--tensor-parallel-size | 4 | Number of GPUs (1 shard per GPU) |
--max-model-len | 262144 | Native context window size |
--max-num-batched-tokens | 4096 | Optimal for low inter-token latency in chat |
--block-size | 128 | Matches FP8 quantization block size |
--gpu-memory-utilization | 0.75 | VRAM allocation ratio (adjust as needed) |
--language-model-only | flag | Skip vision encoder → +2-4 GB KV-cache |
--enable-prefix-caching | flag | Cache repeated system prompts |
--reasoning-parser | qwen3 | Enable Qwen3.5 reasoning/thinking mode parser |
--tool-call-parser | qwen3_xml | Prevents infinite !!!! bug with long contexts |
--attention-backend | FLASHINFER | Best for Ada/Hopper/Blackwell GPUs |
--speculative-config | '{"method":"qwen3_next_mtp","num_speculative_tokens":1}' | Enable Medusa-based speculative decoding (MTP) |
-O3 | flag | Maximum optimization via torch.compile |
For Single GPU (H200, B200, B300):
vllm serve Qwen/Qwen3.5-27B-FP8 \
--max-model-len 262144 \
--max-num-batched-tokens 4096 \
--block-size 128 \
--gpu-memory-utilization 0.75 \
--language-model-only \
--enable-prefix-caching \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_xml \
--attention-backend FLASHINFER \
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":1}' \
-O3 \
--host 127.0.0.1 \
--port 8000
For Multi-GPU (4× RTX 5090):
NCCL_P2P_DISABLE=1 vllm serve Qwen/Qwen3.5-27B-FP8 \
--tensor-parallel-size 4 \
--max-model-len 262144 \
--max-num-batched-tokens 4096 \
--block-size 128 \
--gpu-memory-utilization 0.75 \
--language-model-only \
--enable-prefix-caching \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_xml \
--attention-backend FLASHINFER \
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":1}' \
-O3 \
--host 127.0.0.1 \
--port 8000
Multi-GPU Note: The NCCL_P2P_DISABLE=1 environment variable is required for Blackwell GPUs (RTX 5090) with tensor parallelism > 1 to prevent NCCL hangs. Update nvidia-nccl-cu12 to version 2.27.3+ for additional stability.
To disable the thinking mode at the server level (can still be enabled per-request):
vllm serve Qwen/Qwen3.5-27B-FP8 \
--default-chat-template-kwargs '{"enable_thinking": false}' \
# ... other parameters
Best Practice: In multi-turn conversations, the historical model output should only include the final output and not the thinking content (<think>...</think> tags). This is automatically handled by vLLM's Jinja2 chat template, but if you're implementing custom conversation handling, ensure thinking tags are stripped from message history.
After starting the vLLM server, verify it's working correctly with these test requests.
curl "http://127.0.0.1:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3.5-27B-FP8",
"messages": [{"role": "user", "content": "hey! what is the weather in Moscow?"}],
"temperature": 1.0,
"top_k": 20,
"top_p": 0.95,
"min_p": 0.0,
"presence_penalty": 1.5,
"repetition_penalty": 1.0
}'
Expected: Response includes <think> tags with reasoning process.
curl "http://127.0.0.1:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3.5-27B-FP8",
"messages": [{"role": "user", "content": "hey! what is the weather in Beijing?"}],
"temperature": 0.7,
"top_k": 20,
"top_p": 0.8,
"min_p": 0.0,
"presence_penalty": 1.5,
"repetition_penalty": 1.0,
"chat_template_kwargs": {"enable_thinking": false}
}'
Expected: Direct response without <think> tags.
curl "http://127.0.0.1:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3.5-27B-FP8",
"messages": [{"role": "user", "content": "hey! what is the weather in New York?"}],
"temperature": 1.0,
"top_k": 40,
"top_p": 1.0,
"min_p": 0.0,
"presence_penalty": 2.0,
"repetition_penalty": 1.0,
"chat_template_kwargs": {"enable_thinking": false}
}'
Expected: Creative/diverse responses without thinking tags.
If all tests return valid JSON responses with appropriate content, your vLLM server is ready for PentAGI integration.
The Qwen team provides official recommendations for sampling parameters optimized for different use cases:
| Mode | temp | top_p | top_k | presence_penalty |
|---|---|---|---|---|
| Thinking, general tasks | 1.0 | 0.95 | 20 | 1.5 |
| Thinking, coding (WebDev) | 0.6 | 0.95 | 20 | 0.0 |
| Non-thinking (Instruct), general | 0.7 | 0.8 | 20 | 1.5 |
| Non-thinking (Instruct), reasoning | 1.0 | 1.0 | 40 | 2.0 |
Additional parameters:
repetition_penalty=1.0 for all modesmax_tokens=32768 for most tasksmax_tokens=81920 for complex math/coding tasksThese parameters are already applied in the PentAGI provider configuration files referenced below.
PentAGI includes pre-configured provider files for Qwen3.5-27B-FP8 with optimized sampling parameters for different agent roles.
Two provider configurations are available:
With Thinking Mode (default): examples/configs/vllm-qwen3.5-27b-fp8.provider.yml
<think> tags for primary agents (primary_agent, assistant, adviser, refiner, generator)temp=0.6 for coding agents (coder, installer, pentester)Without Thinking Mode: examples/configs/vllm-qwen3.5-27b-fp8-no-think.provider.yml
chat_template_kwargstemp=0.7 for general tasks, temp=1.0 for reasoningvLLM Qwen3.5-27B-FP8 (or any custom name)Customhttp://127.0.0.1:8000/v1 (or your vLLM server address)dummy (vLLM doesn't require authentication by default)Test the provider by creating a simple flow:
"Scan localhost port 80"Based on internal testing with 4× RTX 5090 GPUs and 10 concurrent requests:
| Metric | Value |
|---|---|
| Prompt Processing Speed | ~13,000 tokens/sec |
| Completion Generation Speed | ~650 tokens/sec |
| Concurrent Flows | 12 flows simultaneously with stable performance |
| VRAM Usage | ~30 GB per GPU (at 0.75 utilization) |
| Context Window | Full 262K tokens supported |
These benchmarks demonstrate that Qwen3.5-27B-FP8 provides excellent throughput for running multiple PentAGI flows in parallel, making it suitable for production deployments.
Cause: Using stable vLLM release instead of nightly.
Solution: Install vLLM nightly build:
uv pip install vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
Cause: Blackwell GPUs (RTX 5090) require P2P communication to be disabled when using tensor parallelism.
Solution: Set environment variable before starting vLLM:
export NCCL_P2P_DISABLE=1
Also update NCCL library:
pip install --upgrade nvidia-nccl-cu12
enable_thinking Parameter IgnoredCause: Parameter must be passed inside chat_template_kwargs, not at root level.
Solution: Use correct JSON structure:
{
"messages": [...],
"chat_template_kwargs": {"enable_thinking": false}
}
!!!! Generation on Long ContextsCause: Using qwen3_coder parser with long contexts triggers a known bug.
Solution: Switch to XML parser:
--tool-call-parser qwen3_xml
Cause: Insufficient VRAM for chosen context length.
Solution: Reduce --max-model-len or --gpu-memory-utilization:
# Reduce context window
--max-model-len 131072
# Or reduce VRAM allocation
--gpu-memory-utilization 0.7
Cause: num_speculative_tokens > 1 is unstable in current nightly builds.
Solution: Use only 1 speculative token:
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":1}'
Qwen3.5-27B natively supports 262K tokens. For tasks requiring longer context (up to 1,010,000 tokens), you can enable YaRN (Yet another RoPE extensioN) scaling.
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve Qwen/Qwen3.5-27B-FP8 \
--hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' \
--max-model-len 1010000 \
# ... other parameters
Important Notes:
factor based on typical context length (e.g., factor=2.0 for 524K tokens)examples/configs/