skills/mlops/inference/llama-cpp/references/hub-discovery.md
Use URL-only workflows first. Do not require hf or API clients just to find GGUF files, choose a quant, or build a llama-server command.
Search:
https://huggingface.co/models?apps=llama.cpp&sort=trending
Search with text:
https://huggingface.co/models?search=<term>&apps=llama.cpp&sort=trending
Search with size bounds:
https://huggingface.co/models?search=<term>&apps=llama.cpp&num_parameters=min:0,max:24B&sort=trending
Repo local-app view:
https://huggingface.co/<repo>?local-app=llama.cpp
Repo tree API:
https://huggingface.co/api/models/<repo>/tree/main?recursive=true
Repo file tree:
https://huggingface.co/<repo>/tree/main
Start from the models page with apps=llama.cpp.
Use:
search=<term> for model family names such as Qwen, Gemma, Phi, or Mistralnum_parameters=min:0,max:24B or similar if the user has hardware limitssort=trending when the user wants popular repos right nowDo not start with random GGUF repos if the user has not chosen a model family yet. Search first, shortlist second.
Example: https://huggingface.co/models?search=Qwen&apps=llama.cpp&num_parameters=min:0,max:24B&sort=trending
Open:
https://huggingface.co/<repo>?local-app=llama.cpp
Extract, in order:
Use this model snippet, if it is visible as textHardware compatibility section from the fetched page text or HTML:
--jinjaTreat the HF local-app snippet as the source of truth when it is visible.
Do this by reading the URL itself, not by assuming the UI rendered in a browser. If the fetched page source does not expose Hardware compatibility, say that the section was not text-visible and fall back to the tree API plus generic guidance from quantization.md.
Open:
https://huggingface.co/api/models/<repo>/tree/main?recursive=true
Treat the JSON response as the source of truth for repo inventory.
Keep entries where:
type is filepath ends with .ggufUse these fields:
path for the filename and subdirectorysize for the byte sizelfs.size to confirm the LFS payload sizeSeparate files into:
Qwen3.6-35B-A3B-UD-Q4_K_M.ggufmmproj-*.ggufBF16/Ignore unless the user asks:
README.mdUse https://huggingface.co/<repo>/tree/main only as a human fallback if the API endpoint fails or the user wants the web view.
Preferred order:
llama-server -hf <repo>:<QUANT>
llama-server --hf-repo <repo> --hf-file <filename.gguf>
llama-cli -hf <repo>:<QUANT>
Use the exact-file form when the repo uses custom labels or nonstandard naming that could make :<QUANT> ambiguous.
unsloth/Qwen3.6-35B-A3B-GGUFUse these URLs:
https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF?local-app=llama.cpp
https://huggingface.co/api/models/unsloth/Qwen3.6-35B-A3B-GGUF/tree/main?recursive=true
https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/tree/main
On the local-app page, the hardware compatibility section can expose entries such as:
UD-IQ4_XS - 17.7 GBUD-Q4_K_S - 20.9 GBUD-Q4_K_M - 22.1 GBUD-Q5_K_M - 26.5 GBUD-Q6_K - 29.3 GBQ8_0 - 36.9 GBOn the tree API, you can confirm exact filenames such as:
Qwen3.6-35B-A3B-UD-Q4_K_M.ggufQwen3.6-35B-A3B-UD-Q5_K_M.ggufQwen3.6-35B-A3B-UD-Q6_K.ggufQwen3.6-35B-A3B-Q8_0.ggufmmproj-F16.ggufGood final output for this repo:
Repo: unsloth/Qwen3.6-35B-A3B-GGUF
Recommended quant from HF: UD-Q4_K_M (22.1 GB)
llama-server: llama-server --hf-repo unsloth/Qwen3.6-35B-A3B-GGUF --hf-file Qwen3.6-35B-A3B-UD-Q4_K_M.gguf
Other GGUFs:
- Qwen3.6-35B-A3B-UD-Q5_K_M.gguf - 26.5 GB
- Qwen3.6-35B-A3B-UD-Q6_K.gguf - 29.3 GB
- Qwen3.6-35B-A3B-Q8_0.gguf - 36.9 GB
Projector:
- mmproj-F16.gguf - 899 MB
UD-Q4_K_M to Q4_K_M unless the page itself does.mmproj files are projector weights for multimodal models, not the main language model checkpoint.quantization.md.