Back to Paradedb

Columnar Storage

docs/documentation/indexing/columnar.mdx

0.23.34.2 KB
Original Source

By default, all non-text and non-JSON fields are indexed using ParadeDB's columnar format. This enables fast filtering pushdown, Top K ordering, and aggregates over these fields. For example, in the following index definition, rating and id are columnar indexed because they are integers, whereas description is not because it is text.

<CodeGroup> ```sql SQL CREATE INDEX search_idx ON mock_items USING bm25 (id, description, rating) WITH (key_field = 'id'); ```
python
from django.db import connection
from paradedb.indexes import BM25Index

with connection.schema_editor() as schema_editor:
    schema_editor.add_index(
        MockItem,
        BM25Index(
            fields={
                "id": {},
                "description": {},
                "rating": {},
            },
            key_field="id",
            name="search_idx",
        ),
    )
python
from sqlalchemy import Index
from paradedb.sqlalchemy import indexing

idx = Index(
    "search_idx",
    indexing.BM25Field(MockItem.id),
    indexing.BM25Field(MockItem.description),
    indexing.BM25Field(MockItem.rating),
    postgresql_using="bm25",
    postgresql_with={"key_field": "id"},
)

with engine.begin() as conn:
    idx.create(conn)
ruby
ActiveRecord::Base.connection.add_bm25_index(
  :mock_items,
  fields: {
    id: {},
    description: {},
    rating: {}
  },
  key_field: :id,
  name: :search_idx
)
</CodeGroup>

To enable columnar indexing for text and JSON fields, cast the field to a tokenizer with columnar set to true.

<CodeGroup> ```sql SQL CREATE INDEX search_idx ON mock_items USING bm25 (id, (description::pdb.unicode_words('columnar=true')), rating) WITH (key_field = 'id'); ```
python
from django.db import connection
from paradedb.indexes import BM25Index

with connection.schema_editor() as schema_editor:
    schema_editor.add_index(
        MockItem,
        BM25Index(
            fields={
                "id": {},
                "description": {
                    "tokenizer": "unicode_words",
                    "named_args": {"columnar": True},
                },
                "rating": {},
            },
            key_field="id",
            name="search_idx",
        ),
    )
python
from sqlalchemy import Index
from paradedb.sqlalchemy import indexing

idx = Index(
    "search_idx",
    indexing.BM25Field(MockItem.id),
    indexing.BM25Field(
        MockItem.description,
        tokenizer=indexing.tokenize.from_config(
            {
                "tokenizer": "unicode_words",
                "named_args": {"columnar": True},
            }
        ),
    ),
    indexing.BM25Field(MockItem.rating),
    postgresql_using="bm25",
    postgresql_with={"key_field": "id"},
)

with engine.begin() as conn:
    idx.create(conn)
ruby
ActiveRecord::Base.connection.add_bm25_index(
  :mock_items,
  fields: {
    id: {},
    description: {
      tokenizer: :unicode_words,
      named_args: { columnar: true }
    },
    rating: {}
  },
  key_field: :id,
  name: :search_idx
)
</CodeGroup> <Note> The `columnar` option for tokenizers is available in versions `0.22.0` and above. </Note>

Columnar defaults to false for all tokenizers besides literal and literal normalized, which default to true and do not require an explicit setting.

The reason is that tokenized fields can represent large documents and would be expensive to store column-wise, whereas literal and literal normalized fields are typically single-value and much more compact.

<Note> The columnar field stores the raw text value regardless of the tokenizer. For example, if `Hello world` is split into tokens `hello` and `world`, the columnar value remains `Hello world`.

This is important because operations like filtering and sorting require the original field value, not the tokens.

</Note> <Note> Internally, Tantivy refers to columnar fields as fast fields. Our [legacy docs](/legacy/indexing/create-index) also refer to these fields as fast. </Note>