examples/nbc-headlines/2_build.ipynb
vec0 indexesUsing the dataset built in the previous ./1_scrape.ipynb notebook,
this notebook will enrich that dataset with a full-text search index and a semantic search index,
using FTS5,
sqlite-vec, and
sqlite-lembed.
This example will use pure SQL for everything. You can do the same exact thing in Python/JavaScript/Go/Rust/etc., or use
your own embeddings providers like Ollama/llamafile/OpenAI/etc. The core mechanics of FTS5 and sqlite-vec will remain the same.
We will use the Snowflake Artic Embed v1.5 embeddings model to generate embeddings.
.open tmp-artic2.db
Creating a full-text search index is as simple as 3 SQL commands! We already have the headlines stored in the articles
table under the headline column, so it's just a matter of initializing the FTS5 virtual table and inserting the data.
create virtual table fts_articles using fts5(
headline,
content='articles', content_rowid='id'
);
insert into fts_articles(rowid, headline)
select rowid, headline
from articles;
insert into fts_articles(fts_articles) values('optimize');
By convention we name the FTS5 table fts_articles, where the fts_ prefix says "this virtual table is full-text search of the articles table". We are only searching the headline column, the rest can be ignored.
Here we are using the "external content tables"
feature in FTS5 tables, which will avoid storing the headlines a 2nd time, since they already exist in the articles table.
This part isn't required, but saves us a bit of storage.
We also use the 'optimize' command
to keep things tidy. This doesn't do much on such a small dataset, but is important to remember for larger tables!
select *
from fts_articles
where headline match 'planned parenthood'
limit 10;
"Semantic index" in this case is just a fancy way of saying "vector store", which we will do with a sqlite-vec vec0 virtual table.
Now, sqlite-vec just stores vectors, it doesn't generate embeddings for us. There are hundreds of different remote APIs or local inference runtimes you can use to generate embeddings,
but here we will use sqlite-lembed to keep everything local and everything in pure SQL.
We will need to choose an embeddings model in the GGUF format,
since sqlite-lembed uses llama.cpp under the hood.
Here we will use Snowflake/snowflake-arctic-embed-m-v1.5,
where we can find a GGUF version here.
This model is small-sh (436MB full-sized, 118MB at Q8_0 quantized), and is trained on fairly recent data so it understands
recent events like "COVID-19" or "Kamala Harris".
You can download a .gguf quantized version of this model with:
wget https://huggingface.co/asg017/sqlite-lembed-model-examples/resolve/main/snowflake-arctic-embed-m-v1.5/snowflake-arctic-embed-m-v1.5.d70deb40.f16.gguf
And we can configure sqlite-lembed to use this model like so:
.load ./lembed0
.load ../../dist/vec0
insert into lembed_models(name, model) values
('default', lembed_model_from_file('./snowflake-arctic-embed-m-v1.5.d70deb40.f16.gguf'));
It's embeddings time! We can use the lembed() function, which takes in text and returns a vector representation of that text,
as an embeddings BLOB that we can insert directly into a vec0 virtul table.
We'll declare this new vec_articles table, using the vec_ prefix as convention. This matches the fts_articles table above.
The Snowflake embedding model generate vectors with 768 dimensions, which we we store as-as.
Embedding and inserting into this vector store is as easy as a single INSERT INTO and lembed() call.
create virtual table vec_articles using vec0(
article_id integer primary key,
headline_embedding float[768]
);
insert into vec_articles(article_id, headline_embedding)
select
rowid,
lembed(headline)
from articles;
This took ~13 minutes for ~14,500 embeddings on my older 2019 Macbook, but newer computers with better CPUs will finish quicker (it took 2m20s on my newer Mac M1 Mini).
Once the vec_articles is ready, we can perform a KNN query like so:
select
articles.headline,
vec_articles.distance
from vec_articles
left join articles on articles.rowid = vec_articles.article_id
where headline_embedding match lembed("planned parenthood")
and k = 10;
The vectors in the vec_articles table take up a lot of space. A vector with 768 dimensions take up 786 * 4 = 3072 bytes of space each, or around 45MB of space for these ~14,500 entries.
That's a lot — the original text dataset was only ~4MB!
If you want to make the database smaller, there's a number of quantization or other methods to do so, by trading accuracy.
Here's an example of performing binary quantization
on this dataset, storing 768-dimensional bit-vectors instead of floating-point vectors, a 32x size reduction, at the expense of accuracy.
We'll keep the current SQLite database as-is, and instead make a copy into a new SQLite database file, and change the vec_articles table
to store bit-vectors instead.
First, we'll make a copy of the current database into a new file:
vacuum into 'tmp-artic2.slim.db';
Now we'll make a connection to this new file, and drop the old vec_articles table that contains the large float[768] vectors.
attach database 'tmp-artic2.slim.db' as slim;
drop table slim.vec_articles;
Now we can create a new vec0 table, storing bit[768] vectors instead!
We can insert the original float[768] from the main.vec_articles table (original table),
calling vec_quantize_binary() to convert the floats to bits.
create virtual table slim.vec_articles using vec0(
article_id integer primary key,
headline_embedding bit[768]
);
insert into slim.vec_articles(article_id, headline_embedding)
select
article_id,
vec_quantize_binary(headline_embedding)
from main.vec_articles;
Then we can VACUUM the new slim database to shrink the file, delete the DROP'ed pages from the older vec0 table.
vacuum slim;
And there we have it! This file is 7.1MB, a large reduction from the original 53MB table.
KNN queries are similar, only adding the vec_quantize_binary() function to the query vector.
select
slim.articles.headline,
slim.vec_articles.distance
from slim.vec_articles
left join slim.articles on slim.articles.rowid = slim.vec_articles.article_id
where headline_embedding match vec_quantize_binary(lembed("planned parenthood"))
and k = 10;
You'll notice the results differ slightly to the full-sized query from above. Some results are ordered differently, some are missing.
The distance in this binary KNN search is hamming distance, not the default L2 distance.