docs/documentation/indexing/columnar.mdx
By default, all non-text and non-JSON fields are indexed using ParadeDB's columnar format.
This enables fast filtering pushdown, Top K ordering, and
aggregates over these fields. For example, in the following index definition, rating and id are columnar indexed
because they are integers, whereas description is not because it is text.
from django.db import connection
from paradedb.indexes import BM25Index
with connection.schema_editor() as schema_editor:
schema_editor.add_index(
MockItem,
BM25Index(
fields={
"id": {},
"description": {},
"rating": {},
},
key_field="id",
name="search_idx",
),
)
from sqlalchemy import Index
from paradedb.sqlalchemy import indexing
idx = Index(
"search_idx",
indexing.BM25Field(MockItem.id),
indexing.BM25Field(MockItem.description),
indexing.BM25Field(MockItem.rating),
postgresql_using="bm25",
postgresql_with={"key_field": "id"},
)
with engine.begin() as conn:
idx.create(conn)
ActiveRecord::Base.connection.add_bm25_index(
:mock_items,
fields: {
id: {},
description: {},
rating: {}
},
key_field: :id,
name: :search_idx
)
To enable columnar indexing for text and JSON fields, cast the field to a tokenizer with columnar set to true.
from django.db import connection
from paradedb.indexes import BM25Index
with connection.schema_editor() as schema_editor:
schema_editor.add_index(
MockItem,
BM25Index(
fields={
"id": {},
"description": {
"tokenizer": "unicode_words",
"named_args": {"columnar": True},
},
"rating": {},
},
key_field="id",
name="search_idx",
),
)
from sqlalchemy import Index
from paradedb.sqlalchemy import indexing
idx = Index(
"search_idx",
indexing.BM25Field(MockItem.id),
indexing.BM25Field(
MockItem.description,
tokenizer=indexing.tokenize.from_config(
{
"tokenizer": "unicode_words",
"named_args": {"columnar": True},
}
),
),
indexing.BM25Field(MockItem.rating),
postgresql_using="bm25",
postgresql_with={"key_field": "id"},
)
with engine.begin() as conn:
idx.create(conn)
ActiveRecord::Base.connection.add_bm25_index(
:mock_items,
fields: {
id: {},
description: {
tokenizer: :unicode_words,
named_args: { columnar: true }
},
rating: {}
},
key_field: :id,
name: :search_idx
)
Columnar defaults to false for all tokenizers besides literal and
literal normalized, which default to
true and do not require an explicit setting.
The reason is that tokenized fields can represent large documents and would be expensive to store column-wise, whereas literal and literal normalized fields are typically single-value and much more compact.
<Note> The columnar field stores the raw text value regardless of the tokenizer. For example, if `Hello world` is split into tokens `hello` and `world`, the columnar value remains `Hello world`.This is important because operations like filtering and sorting require the original field value, not the tokens.
</Note> <Note> Internally, Tantivy refers to columnar fields as fast fields. Our [legacy docs](/legacy/indexing/create-index) also refer to these fields as fast. </Note>