Back to Sqlite Vec

NBC News Headlines: Exploring Hybrod FTS5 + Vector Search

examples/nbc-headlines/3_search.ipynb

0.1.96.3 KB
Original Source

NBC News Headlines: Exploring Hybrod FTS5 + Vector Search

This notebooks explore a few different ways one could combine FTS5 and vector search results, when querying FTS5 and sqlite-vec virtual table.

This dataset is a small list of headines scraped from NBC News, found in the ./1_scrape.ipynb notebook. To see how the fts_articles and vec_articles tables were created, see the ./3_search.ipynb notebook.

sql
.open tmp-artic2.db

.load ../../dist/vec0
.load ./lembed0

insert into lembed_models(name, model)
  values (
    'default',
    lembed_model_from_file('snowflake-arctic-embed-m-v1.5.d70deb40.f16.gguf')
  );

select vec_version(), lembed_version();

Full-text Search Only

A simple FTS query on the fts_articles virutal table can be made like so:

sql
.param set query planned parenthood

select
  rowid,
  headline,
  rank
from fts_articles
where headline match :query
order by rank
limit 10;

The rank column is the negative BM25 score of the query + document.

Vector Search Only

A KNN vector search can be made on the vec_articles virtual table like so:

sql
.param set query planned parenthood

select
  article_id,
  articles.headline,
  distance
from vec_articles
left join articles on articles.rowid = vec_articles.article_id
where headline_embedding match lembed(:query)
  and k = 10;

The distance column is the L2 distance between the query vector and the headline embedding.

The rest of this notebook explore different ways of combining these FTS5 and vector search results. The core queries are similar, and only really different on different JOIN or ORDER BY techniques.

Combination Technique #1: Keyword-first

In many search-engine cases, you may way to display keyword matches first, and supplement the rest wih with vector search results. This makes some intuitive sense — keyword matches are what uses expect, but you'll want to display more result if there are only a few matching documents.

sql
.param set query abortion bans
.param set k 10


with fts_matches as (
  select
    rowid as article_id,
    row_number() over (order by rank) as rank_number,
    rank as score
  from fts_articles
  where headline match :query
  limit :k
),
vec_matches as (
  select
    article_id,
    row_number() over (order by distance) as rank_number,
    distance as score
  from vec_articles
  where
    headline_embedding match lembed(:query)
    and k = :k
  order by distance
),
combined as (
  select 'fts' as match_type, * from fts_matches
  union all
  select 'vec' as match_type, * from vec_matches
),
final as (
  select
    articles.id,
    articles.headline,
    combined.*
  from combined
  left join articles on articles.rowid = combined.article_id
)
select * from final;


We do this with a verbose CTE: one step for the FTS5 query, another for the vector search, one to "combine" the results with a UNION ALL, and one last one to LEFT JOIN back to the base articles table to get the headline.

Here we have 5 FTS results and 10 additional vector results. This seems pretty natural, a fallback to vector search when keywords matches lack a bit.

One note: this example doesn't do any de-duplication, so you may get the same results twice. So you may want to add a DISTINCT or GROUP BY somehwere to handle that.

Combination Technique #2: Reciprocal Rank Fusion (RRF)

Reciprocal Rank Fusion is another combination technique, where matches that are both FTS matches and vector matches are ranked higher than other. The CTE logic is a bit more involved, but can still be represented in a few steps:

sql
.param set query abortion ban


.param set k 10
.param set rrf_k 60
.param set weight_fts 1.0
.param set weight_vec 1.0

with vec_matches as (
  select
    article_id,
    row_number() over (order by distance) as rank_number,
    distance
  from vec_articles
  where
    headline_embedding match lembed(:query)
    and k = :k
),
fts_matches as (
  select
    rowid,
    row_number() over (order by rank) as rank_number,
    rank as score
  from fts_articles
  where headline match :query
  limit :k
),
final as (
  select
    articles.id,
    articles.headline,
    vec_matches.rank_number as vec_rank,
    fts_matches.rank_number as fts_rank,
    coalesce(1.0 / (:rrf_k + fts_matches.rank_number), 0.0) * :weight_fts
    + coalesce(1.0 / (:rrf_k + vec_matches.rank_number), 0.0) * :weight_vec
      as combined_rank,
    vec_matches.distance as vec_distance,
    fts_matches.score as fts_score
  from fts_matches
  full outer join vec_matches on vec_matches.article_id = fts_matches.rowid
  join articles on articles.rowid = coalesce(fts_matches.rowid, vec_matches.article_id)
  order by combined_rank desc
)
select * from final;


The first two CTE steps are identical to the "keyword-first" approach, just a normal FTS5 + vector KNN queries.

The combination CTE step is more involved, and is described in detail in this "Hybrid Search" Supabase docs page. What's nice about this approach is that you can configure the "weights" of FTS or vector results with a normal SQL parameter.

In this query, we can see the top result "Trump signals support for a national 15-week abortion ban" was neither a top FTS result or vector result — only ranked 2 and 3 respectively. But since it appeared in both the FTS and vector results, it's ranked higher than others, same with "Mitch McConnell shies away from supporting national abortion ban". The rest of the results are FTS + vector results interwoven together, pretty nice!

Combination Technique #3: Re-rank by semantics

Here we use FTS5 results are the "source truth", but we re-order them based on semantic similarity between

sql
.param set query abortion ban
.param set k 10


with fts_matches as (
  select
    rowid,
    row_number() over (order by rank) as fts_rank_number,
    rank as score
  from fts_articles
  where headline match :query
  limit :k
),
final as (
  select
    articles.id,
    articles.headline,
    fts_matches.*
  from fts_matches
  left join articles on articles.rowid = fts_matches.rowid
  order by vec_distance_cosine(lembed(:query), lembed(articles.headline))
)
select * from final;