docs/scoring.md
tf-idf.knn distance metric, highlighted here.tf-idf scores with knn distance numbers.bm25 scoring for exact searches.When it comes to scoring a document hit for a specific field, BM25 scoring mechanism requires the following stats:
The scoring formula followed in BM25 is
\sum_{i}^n IDF(q_i) {{f(q_i,D) * (k1 + 1)}\over{f(q_i,D) + k1 * (1-b+b*{{fieldLen}\over{avgFieldLen}})}}
$IDF(q_i)$ here refers to Inverse Document Frequency, which talks about how rare (and hence rich in information) is a particular query term $q_i$ across all the documents in the index, which is calculated as
\ln(1 + {{docTotal - docTerm + 0.5}\over{docTerm + 0.5}})
Coming back to the BM25 scoring, $f(q_i,D)$ refers to the frequency of the query term in document $D$. The entire equation has certain multipliers
Bleve v2.5.0 updated the indexMapping construct with the concept of scoringModel. This is a global (meaning applicable to all fields) setting which drives which scoring algorithm to apply while scoring the document hits. Supported scoring models are defined here
For instance, while defining the index mapping for the data model that's been decided by the user, following snippet can be referred to enable BM25
indexMapping := bleve.NewIndexMapping()
indexMapping.TypeField = "type"
indexMapping.DefaultAnalyzer = "en"
indexMapping.ScoringModel = "bm25"
During search time there's no explicit change involved, unless the user wants to perform a global scoring.
Let's say that the user has a dataset which is quite large (let's say 3 million) and to have good throughput, they create 3 shards (with the same index mapping) for the "index". Each of these shards can be bleve.Index type and while performing a search over the entire dataset, a bleve.IndexAlias can be created over which a search can be performed. This parallelizes things pretty well, both on the indexing path and the search path.
The concept of global scoring is applicable when the index is "sharded" (similar to above situation). This is because each index has data which is disjoint, and thereby while performing the scoring on document hits on each of them, the value of stats is not complete at a global level, since we're doing a search over the entire dataset using the bleve.IndexAlias. For eg: docTotal value while scoring the document hits would be 1 million which is incorrect at a global level.
So in order to keep the scoring roughly same across varying count of the number of shards involved, we provide a mechanism to enable "global scoring". In this type of search, an initial roundtrip is performed to gather and aggregate the stats necessary for the scoring mechanism and in the second phase, the actual search is performed. So naturally this comes at a cost of latency. As a reference here's how the user can go about with it
multiPartIndex := bleve.NewIndexAlias(shard1, shard2)
// set the alias with the same index mapping which both the shards use.
err = multiPartIndex.SetIndexMapping(indexMapping)
if err != nil {
return err
}
ctx := context.Background()
ctx = context.WithValue(ctx, search.SearchTypeKey, search.GlobalScoring)
res, err := multiPartIndex.SearchInContext(ctx, searchRequest)
A note here is that, this would only matter if the relative order of the document hits vary quite a bit (vs single shard case). This would be possible when the shard count increases quite a bit, in low doc count situations or if there is a heavy skew in the data distribution amongst the shards for some reason. Ideally the shards are created when the data is quite large and each of them index the same amount of data - in which case the scores won't fluctuate much to affect the relative hit order and the user can choose to avoid the global scoring mechanism altogether.
TF-IDF is the default scoring mechanism involved (for backward compatibility reasons) and requires no change from the user at index or search time to avail it.
The scoring formula involved is
\sum_{i}^n f(q_i, D) * {{1}\over{\sqrt{fieldLen}}} * IDF(q_i)
where $IDF(q_i)$ is
1 + {{docTotal}\over{1 + docTerm}}
Note: TF-IDF formula doesn't accommodate logic for score saturation due to term frequency or fieldLen. So, it's recommended to use BM25 scoring by explicitly setting it in the index mapping.