Verify Index Integrity - Paradedb

ParadeDB provides amcheck-style index verification functions to detect corruption and validate the structural integrity of BM25 indexes. These functions are useful for:

Proactive corruption detection before issues become critical
Validating index health after hardware failures or unexpected shutdowns
Verifying backup integrity
Debugging index-related issues

Basic Verification

The pdb.verify_index function performs structural integrity checks on a BM25 index:

<CodeGroup> ```sql SQL SELECT * FROM pdb.verify_index('search_idx'); ```

python

from paradedb import paradedb_verify_index

paradedb_verify_index("search_idx")

python

from paradedb.sqlalchemy import diagnostics

diagnostics.paradedb_verify_index(engine, "search_idx")

ruby

ParadeDB.paradedb_verify_index("search_idx")

</CodeGroup>

This returns a table with three columns:

Column	Type	Description
`check_name`	text	Name of the verification check
`passed`	boolean	Whether the check passed
`details`	text	Additional information about the check result

Example Output

               check_name               | passed |                    details
----------------------------------------+--------+-----------------------------------------------
 search_idx: schema_valid               | t      | Index schema loaded successfully
 search_idx: index_readable             | t      | Index reader opened successfully
 search_idx: checksums_valid            | t      | All segment checksums validated successfully
 search_idx: segment_metadata_valid     | t      | 3 segments validated successfully

Heap Reference Validation

To verify that all indexed entries still exist in the heap table, use the heapallindexed option:

<CodeGroup> ```sql SQL SELECT * FROM pdb.verify_index('search_idx', heapallindexed := true); ```

python

from paradedb import paradedb_verify_index

paradedb_verify_index("search_idx", heapallindexed=True)

python

from paradedb.sqlalchemy import diagnostics

diagnostics.paradedb_verify_index(
    engine,
    "search_idx",
    heapallindexed=True,
)

ruby

ParadeDB.paradedb_verify_index(
  "search_idx",
  heapallindexed: true
)

</CodeGroup>

This adds an additional check that validates every indexed ctid (tuple identifier) references a valid row in the table. This is particularly useful for detecting index entries that reference deleted or non-existent rows.

<Warning> The `heapallindexed` option can be slow on large indexes as it must verify every document. Consider using `sample_rate` for quick spot checks on large indexes. </Warning>

Options

Sampling for Large Indexes

For large indexes, you can check a random sample of documents instead of all documents:

<CodeGroup> ```sql SQL -- Check 10% of documents SELECT * FROM pdb.verify_index('search_idx', heapallindexed := true, sample_rate := 0.1 ); ```

python

from paradedb import paradedb_verify_index

paradedb_verify_index(
    "search_idx",
    heapallindexed=True,
    sample_rate=0.1,
)

python

from paradedb.sqlalchemy import diagnostics

diagnostics.paradedb_verify_index(
    engine,
    "search_idx",
    heapallindexed=True,
    sample_rate=0.1,
)

ruby

ParadeDB.paradedb_verify_index(
  "search_idx",
  heapallindexed: true,
  sample_rate: 0.1
)

</CodeGroup>

Progress Reporting

For long-running verifications, enable progress reporting to see status updates:

<CodeGroup> ```sql SQL SELECT * FROM pdb.verify_index('search_idx', heapallindexed := true, report_progress := true ); ```

python

from paradedb import paradedb_verify_index

paradedb_verify_index(
    "search_idx",
    heapallindexed=True,
    report_progress=True,
)

python

from paradedb.sqlalchemy import diagnostics

diagnostics.paradedb_verify_index(
    engine,
    "search_idx",
    heapallindexed=True,
    report_progress=True,
)

ruby

ParadeDB.paradedb_verify_index(
  "search_idx",
  heapallindexed: true,
  report_progress: true
)

</CodeGroup>

Progress messages are emitted via PostgreSQL's NOTICE channel.

Verbose Mode

For detailed logging including segment-by-segment progress and resume hints, enable verbose mode:

<CodeGroup> ```sql SQL SELECT * FROM pdb.verify_index('search_idx', heapallindexed := true, report_progress := true, verbose := true ); ```

python

from paradedb import paradedb_verify_index

paradedb_verify_index(
    "search_idx",
    heapallindexed=True,
    report_progress=True,
    verbose=True,
)

python

from paradedb.sqlalchemy import diagnostics

diagnostics.paradedb_verify_index(
    engine,
    "search_idx",
    heapallindexed=True,
    report_progress=True,
    verbose=True,
)

ruby

ParadeDB.paradedb_verify_index(
  "search_idx",
  heapallindexed: true,
  report_progress: true,
  verbose: true
)

</CodeGroup>

Stop on First Error

To stop verification immediately when the first error is found (similar to pg_amcheck --on-error-stop):

<CodeGroup> ```sql SQL SELECT * FROM pdb.verify_index('search_idx', on_error_stop := true); ```

python

from paradedb import paradedb_verify_index

paradedb_verify_index("search_idx", on_error_stop=True)

python

from paradedb.sqlalchemy import diagnostics

diagnostics.paradedb_verify_index(
    engine,
    "search_idx",
    on_error_stop=True,
)

ruby

ParadeDB.paradedb_verify_index(
  "search_idx",
  on_error_stop: true
)

</CodeGroup>

Parallel Verification

A single verify_index call processes segments sequentially within one PostgreSQL backend. For very large indexes, you can distribute verification across multiple database connections by specifying which segments each connection should check using the segment_ids parameter. This allows you to utilize multiple CPU cores by running verification in parallel processes.

Listing Segments

First, list all segments in the index:

<CodeGroup> ```sql SQL SELECT * FROM pdb.index_segments('search_idx'); ```

python

from paradedb import paradedb_index_segments

paradedb_index_segments("search_idx")

python

from paradedb.sqlalchemy import diagnostics

diagnostics.paradedb_index_segments(engine, "search_idx")

ruby

ParadeDB.paradedb_index_segments("search_idx")

</CodeGroup>

 partition_name | segment_idx | segment_id | num_docs | num_deleted | max_doc
----------------+-------------+------------+----------+-------------+---------
 search_idx     |           0 | b7e661af   |    10000 |           0 |   10000
 search_idx     |           1 | b4fc1b40   |    10000 |           0 |   10000
 search_idx     |           2 | 9894b412   |    10000 |           0 |   10000
 search_idx     |           3 | 4d0168d6   |     5000 |           0 |    5000

Verifying Specific Segments

Then verify specific segments using the segment_ids parameter:

<CodeGroup> ```sql SQL -- Worker 1: Verify even segments SELECT * FROM pdb.verify_index('search_idx', heapallindexed := true, segment_ids := ARRAY[0, 2] );

-- Worker 2: Verify odd segments SELECT * FROM pdb.verify_index('search_idx', heapallindexed := true, segment_ids := ARRAY[1, 3] );


```python Django
from paradedb import paradedb_verify_index

# Worker 1: Verify even segments
paradedb_verify_index(
    "search_idx",
    heapallindexed=True,
    segment_ids=[0, 2],
)

# Worker 2: Verify odd segments
paradedb_verify_index(
    "search_idx",
    heapallindexed=True,
    segment_ids=[1, 3],
)

python

from paradedb.sqlalchemy import diagnostics

# Worker 1: Verify even segments
diagnostics.paradedb_verify_index(
    engine,
    "search_idx",
    heapallindexed=True,
    segment_ids=[0, 2],
)

# Worker 2: Verify odd segments
diagnostics.paradedb_verify_index(
    engine,
    "search_idx",
    heapallindexed=True,
    segment_ids=[1, 3],
)

ruby

# Worker 1: Verify even segments
ParadeDB.paradedb_verify_index(
  "search_idx",
  heapallindexed: true,
  segment_ids: [0, 2]
)

# Worker 2: Verify odd segments
ParadeDB.paradedb_verify_index(
  "search_idx",
  heapallindexed: true,
  segment_ids: [1, 3]
)

</CodeGroup>

Automation Example

Distribute verification across N workers:

<CodeGroup> ```sql SQL -- Get segments for worker 0 (of 4 workers) SELECT array_agg(segment_idx) AS segments FROM pdb.index_segments('search_idx') WHERE segment_idx % 4 = 0;

-- Run verification with those segments SELECT * FROM pdb.verify_index('search_idx', heapallindexed := true, segment_ids := ( SELECT array_agg(segment_idx) FROM pdb.index_segments('search_idx') WHERE segment_idx % 4 = 0 ) );


```python Django
from paradedb import paradedb_index_segments, paradedb_verify_index

paradedb_verify_index(
    "search_idx",
    heapallindexed=True,
    segment_ids=[
        row["segment_idx"]
        for row in paradedb_index_segments("search_idx")
        if row["segment_idx"] % 4 == 0
    ],
)

python

from paradedb.sqlalchemy import diagnostics

diagnostics.paradedb_verify_index(
    engine,
    "search_idx",
    heapallindexed=True,
    segment_ids=[
        row["segment_idx"]
        for row in diagnostics.paradedb_index_segments(engine, "search_idx")
        if row["segment_idx"] % 4 == 0
    ],
)

ruby

ParadeDB.paradedb_verify_index(
  "search_idx",
  heapallindexed: true,
  segment_ids: ParadeDB.paradedb_index_segments("search_idx").filter_map { |row|
    row["segment_idx"] if row["segment_idx"] % 4 == 0
  })

</CodeGroup>

Verifying All BM25 Indexes

To verify all BM25 indexes in the database at once:

<CodeGroup> ```sql SQL SELECT * FROM pdb.verify_all_indexes(); ```

python

from paradedb import paradedb_verify_all_indexes

paradedb_verify_all_indexes()

python

from paradedb.sqlalchemy import diagnostics

diagnostics.paradedb_verify_all_indexes(engine)

ruby

ParadeDB.paradedb_verify_all_indexes()

</CodeGroup>

Filtering by Pattern

Filter indexes by schema or name pattern (using SQL LIKE syntax):

<CodeGroup> ```sql SQL -- Verify indexes in the 'public' schema only SELECT * FROM pdb.verify_all_indexes(schema_pattern := 'public');

-- Verify indexes matching a name pattern SELECT * FROM pdb.verify_all_indexes(index_pattern := 'search_%');

-- Combine filters SELECT * FROM pdb.verify_all_indexes( schema_pattern := 'app_%', index_pattern := '%_idx', heapallindexed := true );


```python Django
from paradedb import paradedb_verify_all_indexes

# Verify indexes in the 'public' schema only
paradedb_verify_all_indexes(schema_pattern="public")

# Verify indexes matching a name pattern
paradedb_verify_all_indexes(index_pattern="search_%")

# Combine filters
paradedb_verify_all_indexes(
    schema_pattern="app_%",
    index_pattern="%_idx",
    heapallindexed=True,
)

python

from paradedb.sqlalchemy import diagnostics

# Verify indexes in the 'public' schema only
diagnostics.paradedb_verify_all_indexes(
    engine,
    schema_pattern="public",
)

# Verify indexes matching a name pattern
diagnostics.paradedb_verify_all_indexes(
    engine,
    index_pattern="search_%",
)

# Combine filters
diagnostics.paradedb_verify_all_indexes(
    engine,
    schema_pattern="app_%",
    index_pattern="%_idx",
    heapallindexed=True,
)

ruby

# Verify indexes in the 'public' schema only
ParadeDB.paradedb_verify_all_indexes(schema_pattern: "public")

# Verify indexes matching a name pattern
ParadeDB.paradedb_verify_all_indexes(index_pattern: "search_%")

# Combine filters
ParadeDB.paradedb_verify_all_indexes(
  schema_pattern: "app_%",
  index_pattern: "%_idx",
  heapallindexed: true
)

</CodeGroup>

Listing All BM25 Indexes

To see all BM25 indexes in the database with summary statistics:

<CodeGroup> ```sql SQL SELECT * FROM pdb.indexes(); ```

python

from paradedb import paradedb_indexes

paradedb_indexes()

python

from paradedb.sqlalchemy import diagnostics

diagnostics.paradedb_indexes(engine)

ruby

ParadeDB.paradedb_indexes()

</CodeGroup>

 schemaname |  tablename  |   indexname   | indexrelid | num_segments | total_docs
------------+-------------+---------------+------------+--------------+------------
 public     | products    | products_idx  |      16421 |            3 |      50000
 public     | documents   | documents_idx |      16435 |            5 |     125000
 app        | articles    | articles_idx  |      16448 |            2 |      10000

Function Reference

`pdb.verify_index`

Verifies a single BM25 index.

Parameter	Type	Default	Description
`index`	regclass	(required)	The index to verify
`heapallindexed`	boolean	`false`	Check that all indexed ctids exist in the heap
`sample_rate`	float	`NULL`	Fraction of documents to check (0.0-1.0). NULL = check all
`report_progress`	boolean	`false`	Emit progress messages
`verbose`	boolean	`false`	Emit detailed segment-level progress and resume hints
`on_error_stop`	boolean	`false`	Stop on first error found
`segment_ids`	int[]	`NULL`	Specific segment indices to check. NULL = all segments

`pdb.verify_all_indexes`

Verifies all BM25 indexes in the database.

Parameter	Type	Default	Description
`schema_pattern`	text	`NULL`	Filter by schema name (SQL LIKE pattern). NULL = all
`index_pattern`	text	`NULL`	Filter by index name (SQL LIKE pattern). NULL = all
`heapallindexed`	boolean	`false`	Check that all indexed ctids exist in the heap
`sample_rate`	float	`NULL`	Fraction of documents to check (0.0-1.0). NULL = check all
`report_progress`	boolean	`false`	Emit progress messages
`on_error_stop`	boolean	`false`	Stop on first error found

`pdb.index_segments`

Lists all segments in a BM25 index.

Parameter	Type	Default	Description
`index`	regclass	(required)	The index to inspect

Returns:

Column	Type	Description
`partition_name`	text	Name of the index partition
`segment_idx`	int	Segment index (use with `segment_ids` parameter)
`segment_id`	text	Tantivy segment UUID
`num_docs`	bigint	Number of live documents
`num_deleted`	bigint	Number of deleted documents
`max_doc`	bigint	Maximum document ID

`pdb.indexes`

Lists all BM25 indexes in the database.

Returns:

Column	Type	Description
`schemaname`	text	Schema containing the index
`tablename`	text	Table the index is on
`indexname`	text	Name of the index
`indexrelid`	oid	OID of the index
`num_segments`	int	Number of Tantivy segments
`total_docs`	bigint	Total documents across all segments