examples/nbc-headlines/3_search.ipynb
This notebooks explore a few different ways one could combine FTS5 and vector search results, when querying
FTS5 and
sqlite-vec virtual table.
This dataset is a small list of headines scraped from NBC News, found in the ./1_scrape.ipynb notebook.
To see how the fts_articles and vec_articles tables were created, see the ./3_search.ipynb notebook.
.open tmp-artic2.db
.load ../../dist/vec0
.load ./lembed0
insert into lembed_models(name, model)
values (
'default',
lembed_model_from_file('snowflake-arctic-embed-m-v1.5.d70deb40.f16.gguf')
);
select vec_version(), lembed_version();
A simple FTS query on the fts_articles virutal table can be made like so:
.param set query planned parenthood
select
rowid,
headline,
rank
from fts_articles
where headline match :query
order by rank
limit 10;
The rank column is the negative BM25 score of the query + document.
A KNN vector search can be made on the vec_articles virtual table like so:
.param set query planned parenthood
select
article_id,
articles.headline,
distance
from vec_articles
left join articles on articles.rowid = vec_articles.article_id
where headline_embedding match lembed(:query)
and k = 10;
The distance column is the L2 distance between the query vector and the headline embedding.
The rest of this notebook explore different ways of combining these FTS5 and vector search results.
The core queries are similar, and only really different on different JOIN or ORDER BY techniques.
In many search-engine cases, you may way to display keyword matches first, and supplement the rest wih with vector search results. This makes some intuitive sense — keyword matches are what uses expect, but you'll want to display more result if there are only a few matching documents.
.param set query abortion bans
.param set k 10
with fts_matches as (
select
rowid as article_id,
row_number() over (order by rank) as rank_number,
rank as score
from fts_articles
where headline match :query
limit :k
),
vec_matches as (
select
article_id,
row_number() over (order by distance) as rank_number,
distance as score
from vec_articles
where
headline_embedding match lembed(:query)
and k = :k
order by distance
),
combined as (
select 'fts' as match_type, * from fts_matches
union all
select 'vec' as match_type, * from vec_matches
),
final as (
select
articles.id,
articles.headline,
combined.*
from combined
left join articles on articles.rowid = combined.article_id
)
select * from final;
We do this with a verbose CTE: one step for the FTS5 query, another for the vector search, one to "combine" the results with a UNION ALL, and one last one to LEFT JOIN back to the base articles table to get the headline.
Here we have 5 FTS results and 10 additional vector results. This seems pretty natural, a fallback to vector search when keywords matches lack a bit.
One note: this example doesn't do any de-duplication, so you may get the same results twice. So you may want to add a DISTINCT or GROUP BY somehwere to handle that.
Reciprocal Rank Fusion is another combination technique, where matches that are both FTS matches and vector matches are ranked higher than other. The CTE logic is a bit more involved, but can still be represented in a few steps:
.param set query abortion ban
.param set k 10
.param set rrf_k 60
.param set weight_fts 1.0
.param set weight_vec 1.0
with vec_matches as (
select
article_id,
row_number() over (order by distance) as rank_number,
distance
from vec_articles
where
headline_embedding match lembed(:query)
and k = :k
),
fts_matches as (
select
rowid,
row_number() over (order by rank) as rank_number,
rank as score
from fts_articles
where headline match :query
limit :k
),
final as (
select
articles.id,
articles.headline,
vec_matches.rank_number as vec_rank,
fts_matches.rank_number as fts_rank,
coalesce(1.0 / (:rrf_k + fts_matches.rank_number), 0.0) * :weight_fts
+ coalesce(1.0 / (:rrf_k + vec_matches.rank_number), 0.0) * :weight_vec
as combined_rank,
vec_matches.distance as vec_distance,
fts_matches.score as fts_score
from fts_matches
full outer join vec_matches on vec_matches.article_id = fts_matches.rowid
join articles on articles.rowid = coalesce(fts_matches.rowid, vec_matches.article_id)
order by combined_rank desc
)
select * from final;
The first two CTE steps are identical to the "keyword-first" approach, just a normal FTS5 + vector KNN queries.
The combination CTE step is more involved, and is described in detail in this "Hybrid Search" Supabase docs page. What's nice about this approach is that you can configure the "weights" of FTS or vector results with a normal SQL parameter.
In this query, we can see the top result "Trump signals support for a national 15-week abortion ban" was neither a top FTS result or vector result — only ranked 2 and 3 respectively.
But since it appeared in both the FTS and vector results, it's ranked higher than others, same with "Mitch McConnell shies away from supporting national abortion ban". The rest of the results are
FTS + vector results interwoven together, pretty nice!
Here we use FTS5 results are the "source truth", but we re-order them based on semantic similarity between
.param set query abortion ban
.param set k 10
with fts_matches as (
select
rowid,
row_number() over (order by rank) as fts_rank_number,
rank as score
from fts_articles
where headline match :query
limit :k
),
final as (
select
articles.id,
articles.headline,
fts_matches.*
from fts_matches
left join articles on articles.rowid = fts_matches.rowid
order by vec_distance_cosine(lembed(:query), lembed(articles.headline))
)
select * from final;