docs/docs/en/ai-employees/scenarios/localization-hy-mt.md
This guide describes a localization translation practice: deploy a translation-specific small model locally, expose it as an OpenAI-compatible service, and configure it for Lina to translate localization entries in batches.
This approach is suitable for translating many system entries, plugin text, menus, collection titles, and field labels. Compared with online models, local models are not affected by external API RPM, TPM, or concurrency limits, and concurrency can be tuned according to machine and model capability.
This guide uses:
tencent/HY-MT1.5-1.8B-GGUFllama-server:::info{title=Note} HY-MT1.5-1.8B is a translation-specific small model. It is more suitable for short entries, UI text, and batch translation. General chat models are not recommended as the first choice for localization tasks. :::
Before starting, prepare:
llama-server.llama-server.On macOS, you can install it with Homebrew:
brew install llama.cpp
You can also use a prebuilt llama.cpp binary or build it from source. The final requirement is that llama-server is available.
Start the service with the GGUF model from Hugging Face:
llama-server \
-hf tencent/HY-MT1.5-1.8B-GGUF:Q4_K_M \
--host 0.0.0.0 \
--port 8000 \
-c 2048 \
-np 4
| Parameter | Description |
|---|---|
-hf | Load the model from Hugging Face. |
--host | Listening address. Use 127.0.0.1 for local testing or 0.0.0.0 for container or remote access. |
--port | HTTP service port. |
-c | Context length. Localization entries are usually short, so 2048 is usually enough. |
-np | Number of parallel slots. Adjust according to machine performance. |
:::info{title=Tip}
If server resources are limited, start with -np 1 or -np 2, then increase gradually after verifying stability.
:::
After llama-server starts, check service health:
curl http://127.0.0.1:8000/health
Then test translation through the OpenAI-compatible API:
curl http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "tencent/HY-MT1.5-1.8B-GGUF:Q4_K_M",
"messages": [
{
"role": "user",
"content": "Translate the following text into Chinese. Output only the translated result without any additional explanation:\n\nSave"
}
]
}'
If you start from a local model file, change model to the actual model name returned or configured by the service.
:::warning{title=Note}
If a request does not respond for a long time, the model may be too slow, concurrency may be too high, or context may be too large. Lower -np and NocoBase translation concurrency first, then observe response time.
:::
Go to System Settings -> AI Employees -> LLM service and add an LLM service.
Example configuration:
| Setting | Example |
|---|---|
| Provider | OpenAI (completions) |
| Title | HY-MT Local |
| Base URL | http://127.0.0.1:8000/v1 |
| API Key | If llama-server has no authentication, use a placeholder such as dummy. |
| Enabled Models | Select tencent/HY-MT1.5-1.8B-GGUF:Q4_K_M, or enter the actual model name. |
After configuration, use Test flight to verify the model.
:::info{title=Tip}
If NocoBase runs in Docker, 127.0.0.1 points to the container itself and may not access the host service. Use the host IP, container network address, or host.docker.internal.
:::
Go to System Settings -> AI Employees -> AI employees, open Lina, and switch to Model settings.
Enable dedicated model configuration.Models.After this, Lina uses this model for localization translation tasks, preventing users or tasks from switching to general chat models.
For details, see Configure AI Employee Models.
Localization translation task concurrency is controlled by AI_LOCALIZATION_CONCURRENCY:
AI_LOCALIZATION_CONCURRENCY=10
Rules:
10120The best concurrency depends on CPU, GPU, memory, model quantization, and llama-server -np. If the default concurrency causes issues:
AI_LOCALIZATION_CONCURRENCY=1 and verify single-entry translation.llama-server -np and AI_LOCALIZATION_CONCURRENCY to 2 or 4.:::warning{title=Note} Do not set concurrency too high at the beginning. If concurrency exceeds actual model capacity, tasks may become slower due to queuing, timeout, or service stalls. :::
Go to System Management -> Localization Management.
Synchronize to ensure entries are synchronized.Incremental translation: translate entries that do not have translations yet.Selected translation: translate selected entries in the table.Full translation: translate all entries in the current language.AllBuilt-in entries: system and plugin entries.Custom entries: route names, collection and field names, and UI content.Start with Selected translation for a few entries to verify output style and speed before running incremental or full translation.
Lina builds requests from entries and reference translations. For short entries, existing references are used to improve consistency:
Prompt semantics are similar to:
Refer to the following translation:
{source_term} is translated as {target_term}
Translate the following text into {target_language}. Output only the translated result without any additional explanation:
{source_text}
Check whether llama-server received requests. View service logs or call /v1/chat/completions with curl.
If the model receives requests but does not return, reduce:
AI_LOCALIZATION_CONCURRENCYllama-server -npllama-server -cLocal translation models are usually more stable than general chat models. If explanations still appear, test the same prompt with curl first to verify the model's output style.
You can also translate shorter entries first or reduce sampling parameters such as temperature.
Check:
/v1.llama-server is still running.After AI translation finishes, review before publishing:
Reset system built-in entry translations to restore defaults. To contribute default translations for the system and official plugins, see Translation Contribution.