docs-site/content/guide/locale.md
In this article, we'll talk about how to handle locale-specific text in your Typesense collections. We'll cover:
To enable language-specific text handling, specify the locale when defining your
collection's schema using the locale parameter in the field definition:
{
"name": "posts",
"fields": [
{ "name": "title", "type": "string", "locale": "vi" },
{ "name": "description", "type": "string", "locale": "en" }
]
}
Each field can have its own locale setting, allowing you to handle multilingual content within the same document.
When no locale is specified for a field, Typesense treats it as English (en)
by default. This has important implications:
This affects search behavior in important ways:
Typesense supports any valid two-letter ISO 639-1 language code. This means you're not limited to just the commonly documented languages - you can use any standard language code like:
vi for Vietnameseid for Indonesianms for Malaynl for Dutchpl for PolishThe locale parameter accepts any valid ISO 639-1 code and will use the appropriate ICU rules for that language.
When you specify a locale:
{
"fields": [
{ "name": "title", "type": "string", "locale": "ANY-VALID-ISO-CODE" }
]
}
Where ANY-VALID-ISO-CODE can be any standard two-letter language code.
While any ISO 639-1 code works, here are some commonly used languages with their codes:
| Language | Code | Notes |
|---|---|---|
| Arabic | ar | Right-to-left script |
| Chinese | zh | Supports both simplified and traditional |
| Dutch | nl | |
| English | en | Default if no locale specified |
| French | fr | |
| German | de | |
| Hindi | hi | |
| Indonesian | id | |
| Italian | it | |
| Japanese | ja | |
| Korean | ko | |
| Malay | ms | |
| Polish | pl | |
| Portuguese | pt | |
| Russian | ru | Cyrillic script |
| Greek | el | Cyrillic script |
| Spanish | es | |
| Thai | th | |
| Turkish | tr | |
| Vietnamese | vi |
Different language families receive specialized handling:
CJK Languages (Chinese, Japanese, Korean)
Right-to-Left Scripts (Arabic, Hebrew)
Languages with Special Characters
Some languages receive special word segmentation handling:
pre_segmented_queryFor languages with complex word boundaries or when you need more control over
tokenization, Typesense offers the pre_segmented_query parameter. This feature
allows you to:
This is particularly useful for:
Example usage:
{
"q": "pre tokenized query",
"pre_segmented_query": true
}
When using this feature, ensure that:
Always Specify the Locale if it's not an English field.
{
"fields": [
{ "name": "title_vi", "type": "string", "locale": "vi" },
{ "name": "title_th", "type": "string", "locale": "th" }
]
}
Field Naming Convention Include the language code in field names for multilingual content:
{
"title_en": "English Title",
"title_vi": "Tiêu đề tiếng Việt",
"title_th": "ชื่อเรื่องภาษาไทย"
}
Consider Diacritic Handling
For European languages, consider whether your users are likely to search with or without diacritics.
Custom Tokenization
Consider using pre_segmented_query for languages where default tokenization might not be optimal.