Back to Paradedb

Verify Index Integrity

docs/documentation/indexing/verify-index.mdx

0.23.316.0 KB
Original Source

ParadeDB provides amcheck-style index verification functions to detect corruption and validate the structural integrity of BM25 indexes. These functions are useful for:

  • Proactive corruption detection before issues become critical
  • Validating index health after hardware failures or unexpected shutdowns
  • Verifying backup integrity
  • Debugging index-related issues

Basic Verification

The pdb.verify_index function performs structural integrity checks on a BM25 index:

<CodeGroup> ```sql SQL SELECT * FROM pdb.verify_index('search_idx'); ```
python
from paradedb import paradedb_verify_index

paradedb_verify_index("search_idx")
python
from paradedb.sqlalchemy import diagnostics

diagnostics.paradedb_verify_index(engine, "search_idx")
ruby
ParadeDB.paradedb_verify_index("search_idx")
</CodeGroup>

This returns a table with three columns:

ColumnTypeDescription
check_nametextName of the verification check
passedbooleanWhether the check passed
detailstextAdditional information about the check result

Example Output

               check_name               | passed |                    details
----------------------------------------+--------+-----------------------------------------------
 search_idx: schema_valid               | t      | Index schema loaded successfully
 search_idx: index_readable             | t      | Index reader opened successfully
 search_idx: checksums_valid            | t      | All segment checksums validated successfully
 search_idx: segment_metadata_valid     | t      | 3 segments validated successfully

Heap Reference Validation

To verify that all indexed entries still exist in the heap table, use the heapallindexed option:

<CodeGroup> ```sql SQL SELECT * FROM pdb.verify_index('search_idx', heapallindexed := true); ```
python
from paradedb import paradedb_verify_index

paradedb_verify_index("search_idx", heapallindexed=True)
python
from paradedb.sqlalchemy import diagnostics

diagnostics.paradedb_verify_index(
    engine,
    "search_idx",
    heapallindexed=True,
)
ruby
ParadeDB.paradedb_verify_index(
  "search_idx",
  heapallindexed: true
)
</CodeGroup>

This adds an additional check that validates every indexed ctid (tuple identifier) references a valid row in the table. This is particularly useful for detecting index entries that reference deleted or non-existent rows.

<Warning> The `heapallindexed` option can be slow on large indexes as it must verify every document. Consider using `sample_rate` for quick spot checks on large indexes. </Warning>

Options

Sampling for Large Indexes

For large indexes, you can check a random sample of documents instead of all documents:

<CodeGroup> ```sql SQL -- Check 10% of documents SELECT * FROM pdb.verify_index('search_idx', heapallindexed := true, sample_rate := 0.1 ); ```
python
from paradedb import paradedb_verify_index

paradedb_verify_index(
    "search_idx",
    heapallindexed=True,
    sample_rate=0.1,
)
python
from paradedb.sqlalchemy import diagnostics

diagnostics.paradedb_verify_index(
    engine,
    "search_idx",
    heapallindexed=True,
    sample_rate=0.1,
)
ruby
ParadeDB.paradedb_verify_index(
  "search_idx",
  heapallindexed: true,
  sample_rate: 0.1
)
</CodeGroup>

Progress Reporting

For long-running verifications, enable progress reporting to see status updates:

<CodeGroup> ```sql SQL SELECT * FROM pdb.verify_index('search_idx', heapallindexed := true, report_progress := true ); ```
python
from paradedb import paradedb_verify_index

paradedb_verify_index(
    "search_idx",
    heapallindexed=True,
    report_progress=True,
)
python
from paradedb.sqlalchemy import diagnostics

diagnostics.paradedb_verify_index(
    engine,
    "search_idx",
    heapallindexed=True,
    report_progress=True,
)
ruby
ParadeDB.paradedb_verify_index(
  "search_idx",
  heapallindexed: true,
  report_progress: true
)
</CodeGroup>

Progress messages are emitted via PostgreSQL's NOTICE channel.

Verbose Mode

For detailed logging including segment-by-segment progress and resume hints, enable verbose mode:

<CodeGroup> ```sql SQL SELECT * FROM pdb.verify_index('search_idx', heapallindexed := true, report_progress := true, verbose := true ); ```
python
from paradedb import paradedb_verify_index

paradedb_verify_index(
    "search_idx",
    heapallindexed=True,
    report_progress=True,
    verbose=True,
)
python
from paradedb.sqlalchemy import diagnostics

diagnostics.paradedb_verify_index(
    engine,
    "search_idx",
    heapallindexed=True,
    report_progress=True,
    verbose=True,
)
ruby
ParadeDB.paradedb_verify_index(
  "search_idx",
  heapallindexed: true,
  report_progress: true,
  verbose: true
)
</CodeGroup>

Stop on First Error

To stop verification immediately when the first error is found (similar to pg_amcheck --on-error-stop):

<CodeGroup> ```sql SQL SELECT * FROM pdb.verify_index('search_idx', on_error_stop := true); ```
python
from paradedb import paradedb_verify_index

paradedb_verify_index("search_idx", on_error_stop=True)
python
from paradedb.sqlalchemy import diagnostics

diagnostics.paradedb_verify_index(
    engine,
    "search_idx",
    on_error_stop=True,
)
ruby
ParadeDB.paradedb_verify_index(
  "search_idx",
  on_error_stop: true
)
</CodeGroup>

Parallel Verification

A single verify_index call processes segments sequentially within one PostgreSQL backend. For very large indexes, you can distribute verification across multiple database connections by specifying which segments each connection should check using the segment_ids parameter. This allows you to utilize multiple CPU cores by running verification in parallel processes.

Listing Segments

First, list all segments in the index:

<CodeGroup> ```sql SQL SELECT * FROM pdb.index_segments('search_idx'); ```
python
from paradedb import paradedb_index_segments

paradedb_index_segments("search_idx")
python
from paradedb.sqlalchemy import diagnostics

diagnostics.paradedb_index_segments(engine, "search_idx")
ruby
ParadeDB.paradedb_index_segments("search_idx")
</CodeGroup>
 partition_name | segment_idx | segment_id | num_docs | num_deleted | max_doc
----------------+-------------+------------+----------+-------------+---------
 search_idx     |           0 | b7e661af   |    10000 |           0 |   10000
 search_idx     |           1 | b4fc1b40   |    10000 |           0 |   10000
 search_idx     |           2 | 9894b412   |    10000 |           0 |   10000
 search_idx     |           3 | 4d0168d6   |     5000 |           0 |    5000

Verifying Specific Segments

Then verify specific segments using the segment_ids parameter:

<CodeGroup> ```sql SQL -- Worker 1: Verify even segments SELECT * FROM pdb.verify_index('search_idx', heapallindexed := true, segment_ids := ARRAY[0, 2] );

-- Worker 2: Verify odd segments SELECT * FROM pdb.verify_index('search_idx', heapallindexed := true, segment_ids := ARRAY[1, 3] );


```python Django
from paradedb import paradedb_verify_index

# Worker 1: Verify even segments
paradedb_verify_index(
    "search_idx",
    heapallindexed=True,
    segment_ids=[0, 2],
)

# Worker 2: Verify odd segments
paradedb_verify_index(
    "search_idx",
    heapallindexed=True,
    segment_ids=[1, 3],
)
python
from paradedb.sqlalchemy import diagnostics

# Worker 1: Verify even segments
diagnostics.paradedb_verify_index(
    engine,
    "search_idx",
    heapallindexed=True,
    segment_ids=[0, 2],
)

# Worker 2: Verify odd segments
diagnostics.paradedb_verify_index(
    engine,
    "search_idx",
    heapallindexed=True,
    segment_ids=[1, 3],
)
ruby
# Worker 1: Verify even segments
ParadeDB.paradedb_verify_index(
  "search_idx",
  heapallindexed: true,
  segment_ids: [0, 2]
)

# Worker 2: Verify odd segments
ParadeDB.paradedb_verify_index(
  "search_idx",
  heapallindexed: true,
  segment_ids: [1, 3]
)
</CodeGroup>

Automation Example

Distribute verification across N workers:

<CodeGroup> ```sql SQL -- Get segments for worker 0 (of 4 workers) SELECT array_agg(segment_idx) AS segments FROM pdb.index_segments('search_idx') WHERE segment_idx % 4 = 0;

-- Run verification with those segments SELECT * FROM pdb.verify_index('search_idx', heapallindexed := true, segment_ids := ( SELECT array_agg(segment_idx) FROM pdb.index_segments('search_idx') WHERE segment_idx % 4 = 0 ) );


```python Django
from paradedb import paradedb_index_segments, paradedb_verify_index

paradedb_verify_index(
    "search_idx",
    heapallindexed=True,
    segment_ids=[
        row["segment_idx"]
        for row in paradedb_index_segments("search_idx")
        if row["segment_idx"] % 4 == 0
    ],
)
python
from paradedb.sqlalchemy import diagnostics

diagnostics.paradedb_verify_index(
    engine,
    "search_idx",
    heapallindexed=True,
    segment_ids=[
        row["segment_idx"]
        for row in diagnostics.paradedb_index_segments(engine, "search_idx")
        if row["segment_idx"] % 4 == 0
    ],
)
ruby
ParadeDB.paradedb_verify_index(
  "search_idx",
  heapallindexed: true,
  segment_ids: ParadeDB.paradedb_index_segments("search_idx").filter_map { |row|
    row["segment_idx"] if row["segment_idx"] % 4 == 0
  })
</CodeGroup>

Verifying All BM25 Indexes

To verify all BM25 indexes in the database at once:

<CodeGroup> ```sql SQL SELECT * FROM pdb.verify_all_indexes(); ```
python
from paradedb import paradedb_verify_all_indexes

paradedb_verify_all_indexes()
python
from paradedb.sqlalchemy import diagnostics

diagnostics.paradedb_verify_all_indexes(engine)
ruby
ParadeDB.paradedb_verify_all_indexes()
</CodeGroup>

Filtering by Pattern

Filter indexes by schema or name pattern (using SQL LIKE syntax):

<CodeGroup> ```sql SQL -- Verify indexes in the 'public' schema only SELECT * FROM pdb.verify_all_indexes(schema_pattern := 'public');

-- Verify indexes matching a name pattern SELECT * FROM pdb.verify_all_indexes(index_pattern := 'search_%');

-- Combine filters SELECT * FROM pdb.verify_all_indexes( schema_pattern := 'app_%', index_pattern := '%_idx', heapallindexed := true );


```python Django
from paradedb import paradedb_verify_all_indexes

# Verify indexes in the 'public' schema only
paradedb_verify_all_indexes(schema_pattern="public")

# Verify indexes matching a name pattern
paradedb_verify_all_indexes(index_pattern="search_%")

# Combine filters
paradedb_verify_all_indexes(
    schema_pattern="app_%",
    index_pattern="%_idx",
    heapallindexed=True,
)
python
from paradedb.sqlalchemy import diagnostics

# Verify indexes in the 'public' schema only
diagnostics.paradedb_verify_all_indexes(
    engine,
    schema_pattern="public",
)

# Verify indexes matching a name pattern
diagnostics.paradedb_verify_all_indexes(
    engine,
    index_pattern="search_%",
)

# Combine filters
diagnostics.paradedb_verify_all_indexes(
    engine,
    schema_pattern="app_%",
    index_pattern="%_idx",
    heapallindexed=True,
)
ruby
# Verify indexes in the 'public' schema only
ParadeDB.paradedb_verify_all_indexes(schema_pattern: "public")

# Verify indexes matching a name pattern
ParadeDB.paradedb_verify_all_indexes(index_pattern: "search_%")

# Combine filters
ParadeDB.paradedb_verify_all_indexes(
  schema_pattern: "app_%",
  index_pattern: "%_idx",
  heapallindexed: true
)
</CodeGroup>

Listing All BM25 Indexes

To see all BM25 indexes in the database with summary statistics:

<CodeGroup> ```sql SQL SELECT * FROM pdb.indexes(); ```
python
from paradedb import paradedb_indexes

paradedb_indexes()
python
from paradedb.sqlalchemy import diagnostics

diagnostics.paradedb_indexes(engine)
ruby
ParadeDB.paradedb_indexes()
</CodeGroup>
 schemaname |  tablename  |   indexname   | indexrelid | num_segments | total_docs
------------+-------------+---------------+------------+--------------+------------
 public     | products    | products_idx  |      16421 |            3 |      50000
 public     | documents   | documents_idx |      16435 |            5 |     125000
 app        | articles    | articles_idx  |      16448 |            2 |      10000

Function Reference

pdb.verify_index

Verifies a single BM25 index.

ParameterTypeDefaultDescription
indexregclass(required)The index to verify
heapallindexedbooleanfalseCheck that all indexed ctids exist in the heap
sample_ratefloatNULLFraction of documents to check (0.0-1.0). NULL = check all
report_progressbooleanfalseEmit progress messages
verbosebooleanfalseEmit detailed segment-level progress and resume hints
on_error_stopbooleanfalseStop on first error found
segment_idsint[]NULLSpecific segment indices to check. NULL = all segments

pdb.verify_all_indexes

Verifies all BM25 indexes in the database.

ParameterTypeDefaultDescription
schema_patterntextNULLFilter by schema name (SQL LIKE pattern). NULL = all
index_patterntextNULLFilter by index name (SQL LIKE pattern). NULL = all
heapallindexedbooleanfalseCheck that all indexed ctids exist in the heap
sample_ratefloatNULLFraction of documents to check (0.0-1.0). NULL = check all
report_progressbooleanfalseEmit progress messages
on_error_stopbooleanfalseStop on first error found

pdb.index_segments

Lists all segments in a BM25 index.

ParameterTypeDefaultDescription
indexregclass(required)The index to inspect

Returns:

ColumnTypeDescription
partition_nametextName of the index partition
segment_idxintSegment index (use with segment_ids parameter)
segment_idtextTantivy segment UUID
num_docsbigintNumber of live documents
num_deletedbigintNumber of deleted documents
max_docbigintMaximum document ID

pdb.indexes

Lists all BM25 indexes in the database.

Returns:

ColumnTypeDescription
schemanametextSchema containing the index
tablenametextTable the index is on
indexnametextName of the index
indexrelidoidOID of the index
num_segmentsintNumber of Tantivy segments
total_docsbigintTotal documents across all segments