docs/content/faq.md
+++ disableToc = false title = "FAQ" weight = 24 icon = "quiz" url = "/faq/" +++
Here are answers to some of the most common questions.
Most gguf-based models should work, but newer models may require additions to the API. If a model doesn't work, please feel free to open up issues. However, be cautious about downloading models from the internet and directly onto your machine, as there may be security vulnerabilities in lama.cpp or ggml that could be maliciously exploited. Some models can be found on Hugging Face: https://huggingface.co/models?search=gguf, or models from gpt4all are compatible too: https://github.com/nomic-ai/gpt4all.
LocalAI stores downloaded models in the following locations by default:
./models (relative to current working directory)/models (inside the container, typically mounted to ./models on host)~/.localai/models (in your home directory)You can customize the model storage location using the LOCALAI_MODELS_PATH environment variable or --models-path command line flag. This is useful if you want to store models outside your home directory for backup purposes or to avoid filling up your home directory with large model files.
Model sizes vary significantly depending on the model and quantization level:
Quantization levels (smaller files, slightly reduced quality):
Q4_K_M: ~75% of original sizeQ4_K_S: ~60% of original sizeQ2_K: ~50% of original sizeStorage recommendations:
LocalAI applies a set of defaults when loading models with the llama.cpp backend, one of these is mirostat sampling - while it achieves better results, it slows down the inference. You can disable this by setting mirostat: 0 in the model config file. See also the advanced section ({{%relref "advanced/advanced-usage" %}}) for more information and this issue.
LocalAI is a multi-model solution that doesn't focus on a specific model type (e.g., llama.cpp or alpaca.cpp), and it handles all of these internally for faster inference, easy to set up locally and deploy to Kubernetes.
There are few situation why this could occur. Some tips are:
mmap in the model config file so it loads everything in memory.--threads should match the number of physical cores. For instance if your CPU has 4 cores, you would ideally allocate <= 4 threads to a model.DEBUG=true. This gives more information, including stats on the token inference speed."stream": true to see how fast the model is responding.Yes! If the client uses OpenAI and supports setting a different base URL to send requests to, you can use the LocalAI endpoint. This allows to use this with every application that was supposed to work with OpenAI, but without changing the application!
There is GPU support, see {{%relref "features/GPU-acceleration" %}}.
There is the availability of localai-webui and chatbot-ui in the examples section and can be setup as per the instructions. However as LocalAI is an API you can already plug it into existing projects that provides are UI interfaces to OpenAI's APIs. There are several already on Github, and should be compatible with LocalAI already (as it mimics the OpenAI API)
Yes, see the examples!
Enable the debug mode by setting DEBUG=true in the environment variables. This will give you more information on what's going on.
You can also specify --debug in the command line.
This typically happens when your prompt exceeds the context size. Try to reduce the prompt size, or increase the context size.
Your CPU probably does not have support for certain instructions that are compiled by default in the pre-built binaries. If you are running in a container, try setting REBUILD=true and disable the CPU instructions that are not compatible with your CPU. For instance: CMAKE_ARGS="-DGGML_F16C=OFF -DGGML_AVX512=OFF -DGGML_AVX2=OFF -DGGML_FMA=OFF" make build