docs/source/en/community_integrations/llama_cpp.md
llama.cpp is a C/C++ inference engine for deploying large language models locally. It's lightweight and doesn't require Python, CUDA, or other heavy server infrastructure. llama.cpp uses the GGUF file format. GGUF supports quantized model weights and memory-mapping to reduce memory bandwidth on your device.
[!TIP] Browse the Hub for models already available in GGUF format.
Convert any Transformers model to GGUF format with the convert_hf_to_gguf.py script.
python3 convert_hf_to_gguf.py ./models/openai/gpt-oss-20b \
--outfile gpt-oss-20b.gguf \
Deploy the model locally from the command line with llama-cli or start a web UI with llama-server. Add the -hf flag to indicate the model is from the Hub.
llama-cli -hf ggml-org/gpt-oss-20b-GGUF
llama-server -hf ggml-org/gpt-oss-20b-GGUF
AutoConfig.from_pretrained] loads the model's config.json file to extract metadata.AutoTokenizer.from_pretrained] extracts the vocabulary and tokenizer configuration.architectures field in the config, the script selects a converter class from its internal registry. The registry maps Transformers architecture names (like [LlamaForCausalLM]) to corresponding converter classes.model.layers.0.self_attn.q_proj.weight) to GGUF tensor names, transforms tensors, and packages the vocabulary.