docs/src/integrations/duckdb.md
Lance datasets can be queried in SQL with DuckDB, an in-process OLAP relational database. Using DuckDB means you can write complex SQL queries (that may not yet be supported in Lance), without needing to move your data out of Lance.
!!! note
This integration is done via a DuckDB extension, whose source code and latest documentation (via README.md) is available
here.
To ensure you see the most up-to-date examples and syntax, check out the repo and the
DuckDB extension
documentation page.
pip install pylance duckdb pyarrow
We're now ready to begin querying Lance using DuckDB! First, install the extension.
=== "SQL"
```sql
INSTALL lance FROM community;
LOAD lance;
```
=== "Python"
```python
import duckdb
duckdb.sql(
"""
INSTALL lance FROM community;
LOAD lance;
"""
)
```
???+ info "Update extensions"
If you already have the extension installed locally, run the following command to update it to the
latest version:
UPDATE EXTENSIONS;
All examples below reuse a small dataset with three rows (duck, horse, dragon)
and a vector column with representative values. In the real world, you'd have
a high-dimensional array generated by an embedding model, and a much larger Lance dataset.
Use DuckDB's COPY ... TO ... to materialize query results as a Lance dataset.
=== "SQL"
```sql
COPY (
SELECT *
FROM (
VALUES
('duck', 'quack', [0.9, 0.7, 0.1]::FLOAT[]),
('horse', 'neigh', [0.3, 0.1, 0.5]::FLOAT[]),
('dragon', 'roar', [0.5, 0.2, 0.7]::FLOAT[])
) AS t(animal, noise, vector)
) TO './lance_duck.lance' (FORMAT lance, mode 'overwrite');
```
=== "Python"
```python
import duckdb
duckdb.sql(
"""
COPY (
SELECT *
FROM (
VALUES
('duck', 'quack', [0.9, 0.7, 0.1]::FLOAT[]),
('horse', 'neigh', [0.3, 0.1, 0.5]::FLOAT[]),
('dragon', 'roar', [0.5, 0.2, 0.7]::FLOAT[])
) AS t(animal, noise, vector)
) TO './lance_duck.lance' (FORMAT lance, mode 'overwrite');
"""
)
```
Now that the Lance dataset is written, let's query it using SQL in DuckDB.
=== "SQL"
```sql
SELECT *
FROM './lance_duck.lance'
LIMIT 5;
```
=== "Python"
```python
import duckdb
r1 = duckdb.sql(
"""
SELECT *
FROM './lance_duck.lance'
LIMIT 5;
"""
)
print(r1)
```
This returns:
┌─────────┬─────────┬─────────────────┐
│ animal │ noise │ vector │
│ varchar │ varchar │ float[] │
├─────────┼─────────┼─────────────────┤
│ duck │ quack │ [0.9, 0.7, 0.1] │
│ horse │ neigh │ [0.3, 0.1, 0.5] │
│ dragon │ roar │ [0.5, 0.2, 0.7] │
└─────────┴─────────┴─────────────────┘
???+ info "Query S3 paths directly"
To access object store URIs (such as s3://...), configure a TYPE LANCE secret.
```sql
CREATE SECRET (
TYPE LANCE,
PROVIDER credential_chain,
SCOPE 's3://bucket/'
);
SELECT *
FROM 's3://bucket/path/to/dataset.lance'
LIMIT 5;
```
When you ATTACH a directory as a Lance namespace, you can create new datasets
using CREATE TABLE or CREATE TABLE AS SELECT. The dataset is written to
<namespace_root>/<table_name>.lance.
=== "SQL"
```sql
ATTACH './lance_ns' AS lance_ns (TYPE LANCE);
CREATE TABLE lance_ns.main.duck_animals AS
SELECT *
FROM (
VALUES
('duck', 'quack', [0.9, 0.7, 0.1]::FLOAT[]),
('horse', 'neigh', [0.3, 0.1, 0.5]::FLOAT[]),
('dragon', 'roar', [0.5, 0.2, 0.7]::FLOAT[])
) AS t(animal, noise, vector);
```
=== "Python"
```python
import duckdb
duckdb.sql(
"""
ATTACH './lance_ns' AS lance_ns (TYPE LANCE);
CREATE TABLE lance_ns.main.duck_animals AS
SELECT *
FROM (
VALUES
('duck', 'quack', [0.9, 0.7, 0.1]::FLOAT[]),
('horse', 'neigh', [0.3, 0.1, 0.5]::FLOAT[]),
('dragon', 'roar', [0.5, 0.2, 0.7]::FLOAT[])
) AS t(animal, noise, vector);
"""
)
```
You can then query the namespace as follows:
SELECT count(*) FROM lance_ns.main.duck_animals;
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 3 │
└──────────────┘
You can perform vector search on a column. This returns the _distance
(smaller is closer, so sort in ascending order for nearest neighbors). The example vector here is similar to the query "duck".
=== "SQL"
```sql
SELECT animal, noise, vector, _distance
FROM lance_vector_search(
'./lance_duck.lance',
'vector',
[0.8, 0.7, 0.2]::FLOAT[],
k = 1,
prefilter = true
)
ORDER BY _distance ASC;
```
=== "Python"
```python
import duckdb
r2 = duckdb.sql(
"""
SELECT animal, noise, vector, _distance
FROM lance_vector_search(
'./lance_duck.lance',
'vector',
[0.8, 0.7, 0.2]::FLOAT[],
k = 1,
prefilter = true
)
ORDER BY _distance ASC;
"""
)
print(r2)
```
This returns:
┌─────────┬─────────┬─────────────────┐
│ animal │ noise │ vector │
│ varchar │ varchar │ float[] │
├─────────┼─────────┼─────────────────┤
│ duck │ quack │ [0.9, 0.7, 0.1] │
└─────────┴─────────┴─────────────────┘
Run keyword-based BM25 search as shown below. This returns a _score, which
is sorted in descending order to get the most relevant results.
=== "SQL"
```sql
SELECT animal, noise, vector, _score
FROM lance_fts(
'./lance_duck.lance',
'animal',
'the brave knight faced the dragon',
k = 1,
prefilter = true
)
ORDER BY _score DESC;
```
=== "Python"
```python
import duckdb
r3 = duckdb.sql(
"""
SELECT animal, noise, vector, _score
FROM lance_fts(
'./lance_duck.lance',
'animal',
'the brave knight faced the dragon',
k = 1,
prefilter = true
)
ORDER BY _score DESC;
"""
)
print(r3)
```
This returns:
┌─────────┬─────────┬─────────────────┐
│ animal │ noise │ vector │
│ varchar │ varchar │ float[] │
├─────────┼─────────┼─────────────────┤
│ dragon │ roar │ [0.5, 0.2, 0.7] │
└─────────┴─────────┴─────────────────┘
Hybrid search combines vector and FTS scores, returning a _hybrid_score in addition
to _distance / _score. To get the most relevant results, sort in descending order.
=== "SQL"
```sql
SELECT animal, noise, vector, _hybrid_score, _distance, _score
FROM lance_hybrid_search(
'./lance_duck.lance',
'vector',
[0.8, 0.7, 0.2]::FLOAT[],
'animal',
'the duck surprised the dragon',
k = 2,
prefilter = false,
alpha = 0.5,
oversample_factor = 4
)
ORDER BY _hybrid_score DESC;
```
=== "Python"
```python
import duckdb
r4 = duckdb.sql(
"""
SELECT animal, noise, vector, _hybrid_score, _distance, _score
FROM lance_hybrid_search(
'./lance_duck.lance',
'vector',
[0.8, 0.7, 0.2]::FLOAT[],
'animal',
'the duck surprised the dragon',
k = 2,
prefilter = false,
alpha = 0.5,
oversample_factor = 4
)
ORDER BY _hybrid_score DESC;
"""
)
print(r4)
```
This returns:
┌─────────┬─────────┬─────────────────┐
│ animal │ noise │ vector │
│ varchar │ varchar │ float[] │
├─────────┼─────────┼─────────────────┤
│ duck │ quack │ [0.9, 0.7, 0.1] │
│ dragon │ roar │ [0.5, 0.2, 0.7] │
└─────────┴─────────┴─────────────────┘
!!! warning
DuckDB treats column as a keyword in some contexts. It's recommended to
use text_column / vector_column as column names for the Lance extension.
Check out the lance-duckdb project
for the latest source code, and go through README.md for the latest API docs.
Additional pages are listed below.
sql.md lists the current SQL surface supported by this extension. It's recommended to refer to this page for the most up-to-date information.
cloud.md lists the current supported backends that allow you to access data on various cloud providers.
s3://... (also accepts s3a://... and s3n://..., normalized to s3://...)gs://...az://...oss://...hf://...