Back to Lance

Quickstart

notebooks/quickstart.ipynb

4.0.16.2 KB
Original Source

Quickstart

python
import shutil

import lance
import numpy as np
import pandas as pd
import pyarrow as pa

Creating datasets

Via pyarrow it's really easy to create lance datasets

Create a dataframe

python
df = pd.DataFrame({"a": [5]})
df

Write it to lance

python
shutil.rmtree("/tmp/test.lance", ignore_errors=True)

dataset = lance.write_dataset(df, "/tmp/test.lance")
dataset.to_table().to_pandas()

Converting from parquet

python
shutil.rmtree("/tmp/test.parquet", ignore_errors=True)
shutil.rmtree("/tmp/test.lance", ignore_errors=True)

tbl = pa.Table.from_pandas(df)
pa.dataset.write_dataset(tbl, "/tmp/test.parquet", format='parquet')

parquet = pa.dataset.dataset("/tmp/test.parquet")
parquet.to_table().to_pandas()

Write to lance in 1 line

python
dataset = lance.write_dataset(parquet, "/tmp/test.lance")
python
# make sure it's the same
dataset.to_table().to_pandas()

Versioning

We can append rows

python
df = pd.DataFrame({"a": [10]})
tbl = pa.Table.from_pandas(df)
dataset = lance.write_dataset(tbl, "/tmp/test.lance", mode="append")

dataset.to_table().to_pandas()

We can overwrite the data and create a new version

python
df = pd.DataFrame({"a": [50, 100]})
tbl = pa.Table.from_pandas(df)
dataset = lance.write_dataset(tbl, "/tmp/test.lance", mode="overwrite")
python
dataset.to_table().to_pandas()

The old version is still there

python
dataset.versions()
python
lance.dataset('/tmp/test.lance', version=1).to_table().to_pandas()
python
lance.dataset('/tmp/test.lance', version=2).to_table().to_pandas()

We can create tags

python
dataset.tags.create("stable", 2)
dataset.tags.create("nightly", 3)
dataset.tags.list()

which can be checked out

python
lance.dataset('/tmp/test.lance', version="stable").to_table().to_pandas()

Vectors

Data preparation

For this tutorial let's use the Sift 1M dataset:

  • Download ANN_SIFT1M from: http://corpus-texmex.irisa.fr/
  • Direct link should be ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
  • Download and then unzip the tarball
python
!rm -rf sift* vec_data.lance
!wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
!tar -xzf sift.tar.gz

Convert it to Lance

python
from lance.vector import vec_to_table
import struct

uri = "vec_data.lance"

with open("sift/sift_base.fvecs", mode="rb") as fobj:
    buf = fobj.read()
    data = np.array(struct.unpack("<128000000f", buf[4 : 4 + 4 * 1000000 * 128])).reshape((1000000, 128))
    dd = dict(zip(range(1000000), data))

table = vec_to_table(dd)
lance.write_dataset(table, uri, max_rows_per_group=8192, max_rows_per_file=1024*1024)
python
uri = "vec_data.lance"
sift1m = lance.dataset(uri)

KNN (no index)

Sample 100 vectors as query vectors

python
import duckdb
# if this segfaults make sure duckdb v0.7+ is installed
samples = duckdb.query("SELECT vector FROM sift1m USING SAMPLE 100").to_df().vector
samples

Call nearest neighbors (no ANN index here)

python
import time

start = time.time()
tbl = sift1m.to_table(columns=["id"], nearest={"column": "vector", "q": samples[0], "k": 10})
end = time.time()

print(f"Time(sec): {end-start}")
print(tbl.to_pandas())

Without the index this is scanning through the whole dataset to compute the distance.

For real-time serving we can do much better with an ANN index

Build index

Now let's build an index. Lance now supports IVF_PQ, IVF_HNSW_PQ and IVF_HNSW_SQ indexes

NOTE If you'd rather not wait for index build, you can download a version with the index pre-built from here and skip the next cell

python
%%time

sift1m.create_index(
    "vector",
    index_type="IVF_PQ", # IVF_PQ, IVF_HNSW_PQ and IVF_HNSW_SQ are supported
    num_partitions=256,  # IVF
    num_sub_vectors=16,  # PQ
)

NOTE If you're trying this on your own data, make sure your vector (dimensions / num_sub_vectors) % 8 == 0, or else index creation will take much longer than expected due to SIMD misalignment

Try nearest neighbors again with ANN index

Let's look for nearest neighbors again

python
sift1m = lance.dataset(uri)
python
import time

tot = 0
for q in samples:
    start = time.time()
    tbl = sift1m.to_table(nearest={"column": "vector", "q": q, "k": 10})
    end = time.time()
    tot += (end - start)

print(f"Avg(sec): {tot / len(samples)}")
print(tbl.to_pandas())

NOTE on performance, your actual numbers will vary by your storage. These numbers are run on local disk on an M2 Macbook Air. If you're querying S3 directly, HDD, or network drives, performance will be slower.

The latency vs recall is tunable via:

  • nprobes: how many IVF partitions to search
  • refine_factor: determines how many vectors are retrieved during re-ranking
python
%%time

sift1m.to_table(
    nearest={
        "column": "vector",
        "q": samples[0],
        "k": 10,
        "nprobes": 10,
        "refine_factor": 5,
    }
).to_pandas()

q => sample vector

k => how many neighbors to return

nprobes => how many partitions (in the coarse quantizer) to probe

refine_factor => controls "re-ranking". If k=10 and refine_factor=5 then retrieve 50 nearest neighbors by ANN and re-sort using actual distances then return top 10. This improves recall without sacrificing performance too much

NOTE the latencies above include file io as lance currently doesn't hold anything in memory. Along with index building speed, creating a purely in memory version of the dataset would make the biggest impact on performance.

Features and vector can be retrieved together

Usually we have other feature or metadata columns that need to be stored and fetched together. If you're managing data and the index separately, you have to do a bunch of annoying plumbing to put stuff together. With Lance it's a single call

python
tbl = sift1m.to_table()
tbl = tbl.append_column("item_id", pa.array(range(len(tbl))))
tbl = tbl.append_column("revenue", pa.array((np.random.randn(len(tbl))+5)*1000))
tbl.to_pandas()
python
sift1m = lance.write_dataset(tbl, uri, mode="overwrite")
python
sift1m.to_table(columns=["revenue"], nearest={"column": "vector", "q": samples[0], "k": 10}).to_pandas()