docs/provider-config/cerebras.mdx
Cerebras delivers the world's fastest AI inference through their revolutionary wafer-scale chip architecture. Unlike traditional GPUs that shuttle model weights from external memory, Cerebras stores entire models on-chip, eliminating bandwidth bottlenecks and achieving speeds up to 2,600 tokens per second-often 20x faster than GPUs.
Website: https://cloud.cerebras.ai/
Cline supports the following Cerebras models:
zai-glm-4.7 - Highly capable general-purpose model on Cerebras (up to 1,000 tokens/s), competitive with leading proprietary models on coding tasks.gpt-oss-120b - Intelligent general purpose model with 3,000 tokens/sqwen-3-235b-a22b-instruct-2507 - Advanced instruction-following modelCerebras has fundamentally reimagined AI hardware architecture to solve the inference speed problem:
Traditional GPUs use separate chips for compute and memory, forcing them to constantly shuttle model weights back and forth. Cerebras built the world's largest AI chip-a wafer-scale engine that stores entire models on-chip. No external memory, no bandwidth bottlenecks, no waiting.
Cerebras discovered that faster inference enables smarter AI. Modern reasoning models generate thousands of tokens as "internal monologue" before answering. On traditional hardware, this takes too long for real-time use. Cerebras makes reasoning models fast enough for everyday applications.
Unlike other speed optimizations that sacrifice accuracy, Cerebras maintains full model quality while delivering unprecedented speed. You get the intelligence of frontier models with the responsiveness of lightweight ones.
Learn more about Cerebras's technology in their blog posts:
Cerebras offers specialized plans for developers:
The qwen-3-coder-480b-free model provides access to high-performance inference at no cost-unique among speed-focused providers.
Reasoning models like qwen-3-235b-a22b-thinking-2507 can complete complex multi-step reasoning in under a second, making them practical for interactive development workflows.
Qwen3-Coder models are specifically optimized for programming tasks, delivering performance comparable to Claude Sonnet 4 and GPT-4.1 in coding benchmarks.
Works with any OpenAI-compatible tool-Cursor, Continue.dev, Cline, or any other editor that supports OpenAI endpoints.