backend/onyx/document_index/opensearch/README.md
Opensearch has 2 phases, a Search phase and a Fetch phase. The Search phase works by getting the document scores on each
shard separately, then typically a fetch phase grabs all of the relevant fields/data for returning to the user. There is also
an intermediate phase (seemingly built specifically to handle hybrid search queries) which can run in between as a processor.
References:
https://docs.opensearch.org/latest/search-plugins/search-pipelines/search-processors/
https://docs.opensearch.org/latest/search-plugins/search-pipelines/normalization-processor/
https://docs.opensearch.org/latest/query-dsl/compound/hybrid/
Hybrid queries are basically parallel queries that each run through their own Search phase and do not interact in any way.
They also run across all the shards. It is not entirely clear what happens if a combination pipeline is not specified for them,
perhaps the scores are just summed.
When the normalization processor is applied to keyword/vector hybrid searches, documents that show up due to keyword match may not also have showed up in the vector search and vice versa. In these situations, it just receives a 0 score for the missing query component. Opensearch does not run another phase to recapture those missing values. The impact of this is that after normalizing, the missing scores are 0 but this is a higher score than if it actually received a non-zero score.
This may not be immediately obvious so an explanation is included here. If it got a non-zero score instead, it must be lower than all of the other scores of the list (otherwise it would have shown up). Therefore it would impact the normalization and push the other scores higher so that it's not only the lowest score still, but now it's a differentiated lowest score. This is not strictly the case in a multi-node setup but the high level concept approximately holds. So basically the 0 score is a form of "minimum value clipping".
Embedding models do not have a uniform distribution from 0 to 1. The values typically cluster strongly around 0.6 to 0.8 but also varies between models and even the query. It is not a safe assumption to pre-normalize the scores so we also cannot apply any additive or multiplicative boost to it. i.e. if results of a doc cluster around 0.6 to 0.8 and I give a 50% penalty to the score, it doesn't bring a result from the top of the range to 50th percentile, it brings it under the 0.6 and is now the worst match. Same logic applies to additive boosting.
So these boosts can only be applied after normalization. Unfortunately with Opensearch, the normalization processor runs last
and only applies to the results of the completely independent Search phase queries. So if a time based boost (a separate
query which filters on recently updated documents) is added, it would not be able to introduce any new documents
to the set (since the new documents would have no keyword/vector score or already be present) since the 0 scores on keyword
and vector would make the docs which only came because of time filter very low scoring. This can however make some of the lower
scored documents from the union of all the Search phase documents to show up higher and potentially not get dropped before
being fetched and returned to the user. But there are other issues of including these:
So while it is possible to apply time based boosting at the normalization stage (or specifically to the keyword score), we have decided it is better to not apply it during the OpenSearch query.
Because of these limitations, Onyx in code applies further refinements, boostings, etc. based on OpenSearch providing an initial filtering. The impact of time decay and boost should not be so big that we would need orders of magnitude more results back from OpenSearch.
Within the Search phase, there are optional steps like Rescore but these are not useful for the combination/normalization
work that is relevant for the hybrid search. Since the Rescore happens prior to normalization, it's not able to provide any
meaningful operations to the query for our usage.
Because the Title is included in the Contents for both embedding and keyword searches, the Title scores are very low relative to the actual full contents scoring. It is seen as a boost rather than a core scoring component. Time decay works similarly.