Stemming

Stemming is a technique that helps handle variations of words during search. When stemming is enabled, a search for one form of a word will also match other grammatical forms of that word. For example:

Searching for "run" would match "running", "runs", "ran"
Searching for "walk" would match "walking", "walked", "walks"
Searching for "company" would match "companies"

Typesense provides two approaches to handle word variations:

Basic Stemming

Basic stemming uses the Snowball stemmer algorithm to automatically detect and handle word variations. Being rules-based, it works well for common word patterns in the configured language, but may produce unintended side effects with brand names, proper nouns, and locations. Since these rules are designed primarily for common nouns, applying them to specialized content like company names or locations can sometimes degrade search relevance.

To enable basic stemming for a field, set "stem": true in your collection schema:

bash

curl "http://localhost:8108/collections" -X POST \
-H "Content-Type: application/json" \
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" -d '{
  "name": "companies",
  "fields": [
    {"name": "description", "type": "string", "stem": true}
  ]
}'

</template> </Tabs>

The language used for stemming is automatically determined from the locale parameter of the field. For example, setting "locale": "fr" will use French-specific stemming rules.

Custom Stemming Dictionaries

For cases where you need more precise control over word variations, or when dealing with irregular forms that algorithmic stemming can't handle well, you can use stemming dictionaries. These allow you to define exact mappings between words and their root forms.

Pre-made Dictionaries

Typesense provides a pre-made English plurals dictionary that handles common singular/plural variations. You can download it here.

This dictionary is particularly useful when you need reliable handling of English plural forms without the potential side effects of algorithmic stemming.

Creating a Stemming Dictionary

First, create a JSONL file with your word mappings:

json

{"word": "people", "root": "person"}
{"word": "children", "root": "child"}
{"word": "geese", "root": "goose"}

Then upload it using the stemming dictionary API:

bash

curl "http://localhost:8108/stemming/dictionaries/import?id=irregular-plurals" \
-X POST \
-H "Content-Type: application/json" \
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
--data-binary @dictionary.jsonl

</template> </Tabs>

Sample Response

json

{
  "id": "irregular-plurals",
  "words": [
    {"root": "person", "word": "people"},
    {"root": "child", "word": "children"},
    {"root": "goose", "word": "geese"}
  ]
}

</template> </Tabs>

Using a Stemming Dictionary

To use a stemming dictionary, specify it in your collection schema using the stem_dictionary parameter:

bash

curl "http://localhost:8108/collections" -X POST \
-H "Content-Type: application/json" \
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" -d '{
  "name": "companies",
  "fields": [
    {"name": "title", "type": "string", "stem_dictionary": "irregular-plurals"}
  ]
}'

</template> </Tabs>

:::tip Understanding Stemming Options When configuring a field for stemming:

Using "stem": true alone applies the default Porter stemmer algorithm
Using "stem_dictionary": "dictionary_name" automatically enables stemming functionality ("stem": true is implied)
When explicitly configuring both options on the same field, dictionary stemming takes precedence

When you specify only stem_dictionary in your configuration, you'll notice "stem": true appears automatically in your schema because the system enables basic stemming by default when dictionary stemming is configured. :::

Managing Dictionaries

Retrieve a Dictionary

bash

curl -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
"http://localhost:8108/stemming/dictionaries/irregular-plurals"

</template> </Tabs>

List All Dictionaries

bash

curl -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
"http://localhost:8108/stemming/dictionaries"

</template> </Tabs>

Sample Response

json

{
  "dictionaries": ["irregular-plurals", "company-terms"]
}

</template> </Tabs>

Best Practices

Start with Basic Stemming: For most use cases, basic stemming with the appropriate locale setting will handle common word variations well.
Use Dictionaries for Exceptions: Add stemming dictionaries when you need to handle:
- Domain-specific variations
- Cases where basic stemming doesn't give desired results
Language-Specific Considerations: Remember that basic stemming behavior changes based on the locale parameter. Set this appropriately for your content's language.