manual/english/Searching/Spell_correction.md
Spell correction, also known as:
and so on, is a software functionality that suggests alternatives to or makes automatic corrections of the text you have typed in. The concept of correcting typed text dates back to the 1960s when computer scientist Warren Teitelman, who also invented the "undo" command, introduced a philosophy of computing called D.W.I.M., or "Do What I Mean." Instead of programming computers to accept only perfectly formatted instructions, Teitelman argued that they should be programmed to recognize obvious mistakes.
The first well-known product to provide spell correction functionality was Microsoft Word 6.0, released in 1993.
There are a few ways spell correction can be done, but it's important to note that there is no purely programmatic way to convert your mistyped "ipone" into "iphone" with decent quality. Mostly, there has to be a dataset the system is based on. The dataset can be:
Manticore provides the fuzzy search option and the commands CALL QSUGGEST and CALL SUGGEST that can be used for automatic spell correction purposes.
The Fuzzy Search feature allows for more flexible matching by accounting for slight variations or misspellings in the search query. It works similarly to a normal SELECT SQL statement or a /search JSON request but provides additional parameters to control the fuzzy matching behavior.
NOTE: The
fuzzyoption requires Manticore Buddy. If it doesn't work, make sure Buddy is installed.
NOTE: The
fuzzyoption is not available for multi-queries.
SELECT
...
MATCH('...')
...
OPTION fuzzy={0|1}
[, distance=N]
[, preserve={0|1}]
[, layouts='{be,bg,br,ch,de,dk,es,fr,uk,gr,it,no,pt,ru,se,ua,us}']
}
Note: When conducting a fuzzy search via SQL, the MATCH clause should not contain any full-text operators except the phrase search operator and should only include the words you intend to match.
<!-- intro -->SELECT * FROM mytable WHERE MATCH('someting') OPTION fuzzy=1, layouts='us,ua', distance=2;
Example of a more complex Fuzzy search query with additional filters:
SELECT * FROM mytable WHERE MATCH('someting') OPTION fuzzy=1 AND (category='books' AND price < 20);
POST /search
{
"table": "test",
"query": {
"bool": {
"must": [
{
"match": {
"*": "ghbdtn"
}
}
]
}
},
"options": {
"fuzzy": true,
"layouts": ["us", "ru"],
"distance": 2
}
}
+------+-------------+
| id | content |
+------+-------------+
| 1 | something |
| 2 | some thing |
+------+-------------+
2 rows in set (0.00 sec)
SELECT * FROM mytable WHERE MATCH('hello wrld') OPTION fuzzy=1, preserve=1;
POST /search
{
"table": "test",
"query": {
"bool": {
"must": [
{
"match": {
"*": "hello wrld"
}
}
]
}
},
"options": {
"fuzzy": true,
"preserve": 1
}
}
+------+-------------+
| id | content |
+------+-------------+
| 1 | hello wrld |
| 2 | hello world |
+------+-------------+
2 rows in set (0.00 sec)
POST /search
{
"table": "table_name",
"query": {
<full-text query>
},
"options": {
"fuzzy": {true|false}
[,"layouts": ["be","bg","br","ch","de","dk","es","fr","uk","gr","it","no","pt","ru","se","ua","us"]]
[,"distance": N]
[,"preserve": {0|1}]
}
}
Note: If you use the query_string, be aware that it does not support full-text operators except the phrase search operator. The query string should consist solely of the words you wish to match.
fuzzy: Turn fuzzy search on or off.distance: Set the Levenshtein distance for matching. The default is 2.preserve: 0 or 1 (default: 0). When set to 1, keeps words that don't have fuzzy matches in the search results (e.g., "hello wrld" returns both "hello wrld" and "hello world"). When set to 0, only returns words with successful fuzzy matches (e.g., "hello wrld" returns only "hello world"). Particularly useful for preserving short words or proper nouns that may not exist in Manticore Search.layouts: Keyboard layouts for detecting typing errors caused by keyboard layout mismatches (e.g., typing "ghbdtn" instead of "привет" when using wrong layout). Manticore compares character positions across different layouts to suggest corrections. Requires at least 2 layouts to effectively detect mismatches. No layouts are used by default. Use an empty string '' (SQL) or array [] (JSON) to turn this off. Supported layouts include:
be - Belgian AZERTY layoutbg - Standard Bulgarian layoutbr - Brazilian QWERTY layoutch - Swiss QWERTZ layoutde - German QWERTZ layoutdk - Danish QWERTY layoutes - Spanish QWERTY layoutfr - French AZERTY layoutuk - British QWERTY layoutgr - Greek QWERTY layoutit - Italian QWERTY layoutno - Norwegian QWERTY layoutpt - Portuguese QWERTY layoutru - Russian JCUKEN layoutse - Swedish QWERTY layoutua - Ukrainian JCUKEN layoutus - American QWERTY layoutBoth commands are accessible via SQL and support querying both local (plain and real-time) and distributed tables. The syntax is as follows:
CALL QSUGGEST(<word or words>, <table name> [,options])
CALL SUGGEST(<word or words>, <table name> [,options])
options: N as option_name[, M as another_option, ...]
These commands provide all suggestions from the dictionary for a given word. They work only on tables with infixing enabled and dict=keywords. They return the suggested keywords, Levenshtein distance between the suggested and original keywords, and the document statistics of the suggested keyword.
If the first parameter contains multiple words, then:
CALL QSUGGEST will return suggestions only for the last word, ignoring the rest.CALL SUGGEST will return suggestions only for the first word.That's the only difference between them. Several options are supported for customization:
| Option | Description | Default |
|---|---|---|
| limit | Returns N top matches | 5 |
| max_edits | Keeps only dictionary words with a Levenshtein distance less than or equal to N | 4 |
| result_stats | Provides Levenshtein distance and document count of the found words | 1 (enabled) |
| delta_len | Keeps only dictionary words with a length difference less than N | 3 |
| max_matches | Number of matches to keep | 25 |
| reject | Rejected words are matches that are not better than those already in the match queue. They are put in a rejected queue that gets reset in case one actually can go in the match queue. This parameter defines the size of the rejected queue (as reject*max(max_matched,limit)). If the rejected queue is filled, the engine stops looking for potential matches | 4 |
| result_line | alternate mode to display the data by returning all suggests, distances and docs each per one row | 0 |
| non_char | do not skip dictionary words with non alphabet symbols | 0 (skip such words) |
| sentence | Returns the original sentence along with the last word replaced by the matched one. | 0 (do not return the full sentence) |
| force_bigrams | Forces the use of bigrams (2-character n-grams) instead of trigrams for all word lengths, which can improve matching for words with transposition errors | 0 (use trigrams for words ≥6 characters) |
| search_mode | Refines suggestions by performing searches on the index. Accepts 'phrase' for exact phrase matching or 'words' for bag-of-words matching. When enabled, adds a found_docs column showing document counts and re-ranks results by found_docs descending, then by distance ascending. | N/A (disabled by default) |
To show how it works, let's create a table and add a few documents to it.
create table products(title text) min_infix_len='2';
insert into products values (0,'Crossbody Bag with Tassel'), (0,'microfiber sheet set'), (0,'Pet Hair Remover Glove');
As you can see, the mistyped word "crossbUdy" gets corrected to "crossbody". By default, CALL SUGGEST/QSUGGEST return:
distance - the Levenshtein distance which means how many edits they had to make to convert the given word to the suggestiondocs - number of documents containing the suggested wordTo disable the display of these statistics, you can use the option 0 as result_stats.
call suggest('crossbudy', 'products');
+-----------+----------+------+
| suggest | distance | docs |
+-----------+----------+------+
| crossbody | 1 | 1 |
+-----------+----------+------+
If the first parameter is not a single word, but multiple, then CALL SUGGEST will return suggestions only for the first word.
call suggest('bagg with tasel', 'products');
+---------+----------+------+
| suggest | distance | docs |
+---------+----------+------+
| bag | 1 | 1 |
+---------+----------+------+
If the first parameter is not a single word, but multiple, then CALL SUGGEST will return suggestions only for the last word.
CALL QSUGGEST('bagg with tasel', 'products');
+---------+----------+------+
| suggest | distance | docs |
+---------+----------+------+
| tassel | 1 | 1 |
+---------+----------+------+
Adding 1 as sentence makes CALL QSUGGEST return the entire sentence with the last word corrected.
CALL QSUGGEST('bag with tasel', 'products', 1 as sentence);
+-------------------+----------+------+
| suggest | distance | docs |
+-------------------+----------+------+
| bag with tassel | 1 | 1 |
+-------------------+----------+------+
The 1 as result_line option changes the way the suggestions are displayed in the output. Instead of showing each suggestion in a separate row, it displays all suggestions, distances, and docs in a single row. Here's an example to demonstrate this:
CALL QSUGGEST('bagg with tasel', 'products', 1 as result_line);
+----------+--------+
| name | value |
+----------+--------+
| suggests | tassel |
| distance | 1 |
| docs | 1 |
+----------+--------+
The force_bigrams option can help with words that have transposition errors, such as "ipohne" vs "iphone". By using bigrams instead of trigrams, the algorithm can better handle character transpositions.
CALL SUGGEST('ipohne', 'products', 1 as force_bigrams);
+--------+----------+------+
| suggest| distance | docs |
+--------+----------+------+
| iphone | 2 | 1 |
+--------+----------+------+
The search_mode option enhances suggestions by performing actual searches on the index to count how many documents contain each suggested phrase or combination of words. This helps rank suggestions based on real document relevance rather than just dictionary statistics.
The option accepts two values:
'phrase' - Performs exact phrase searches. For example, when suggesting "bag with tassel", it searches for the exact phrase "bag with tassel" and counts documents containing these words as an adjacent phrase.'words' - Performs bag-of-words searches. For example, when suggesting "bag with tassel", it searches for bag with tassel (without quotes) and counts documents containing all these words, regardless of order or intervening words.NOTE: The
search_modeoption only works whensentenceis enabled (i.e., when the input contains multiple words). For single-word queries,search_modeis ignored.
NOTE: Performance consideration: Each suggestion candidate triggers a separate search query against the index. If you need to evaluate many candidates, consider using a lower
limitvalue to reduce the number of searches performed.
When search_mode is enabled, results include a found_docs column showing the document count for each suggestion, and results are re-ranked by found_docs descending, then by distance ascending.
CALL QSUGGEST('bag with tasel', 'products', 1 as sentence, 'phrase' as search_mode);
+-------------------+----------+------+-------------+
| suggest | distance | docs | found_docs |
+-------------------+----------+------+-------------+
| bag with tassel | 1 | 13 | 10 |
| bag with tazer | 2 | 27 | 3 |
+-------------------+----------+------+-------------+
-- With phrase matching: finds exact phrases only
CALL QSUGGEST('test carp', 'products', 1 as sentence, 'phrase' as search_mode);
-- With words matching: finds documents with all words regardless of order
CALL QSUGGEST('test carp', 'products', 1 as sentence, 'words' as search_mode);
-- Phrase mode results:
+----------------+----------+------+-------------+
| suggest | distance | docs | found_docs |
+----------------+----------+------+-------------+
| test car | 1 | 17 | 5 |
| test carpet | 2 | 19 | 4 |
+----------------+----------+------+-------------+
-- Words mode results (more matches for "test carpet" due to word separation):
+----------------+----------+------+-------------+
| suggest | distance | docs | found_docs |
+----------------+----------+------+-------------+
| test carpet | 2 | 19 | 19 |
| test car | 1 | 17 | 5 |
+----------------+----------+------+-------------+
Understanding the difference:
'phrase'): Searches for exact sequences. The query "test carpet" matches only documents where these words appear together in that exact order (e.g., "test carpet cleaning" matches, but "test the carpet" or "carpet test" do not).'words'): Searches for all words to exist in the document, order doesn't matter. The query test carpet matches any document containing both "test" and "carpet" anywhere (e.g., "test the carpet", "test red carpet", "carpet test" all match).CALL SUGGEST works in a little web app.{.scale-0.5}
<!-- proofread -->