Back to Weaviate

Metrics

docs/metrics.md

1.37.224.6 KB
Original Source

Weaviate Metrics

This document is the single source of truth for Prometheus metrics exposed by Weaviate. It explains what we measure and why, how to use the metrics, and how we keep the set lean and cost‑effective.

Purpose

  • Provide a canonical list of metrics, their meaning, and intended usage
  • Standardize how teams interpret and build dashboards/alerts
  • Control cost and label cardinality by separating operational from analytical needs

Source of truth

  • This file (docs/metrics.md) is authoritative. Any metric changes (add/modify/deprecate) must be reflected here in the correct section.
  • Category and Usage Status here define where a metric should live and how it should be used.

Usage categories

  • 🎯 Active (dashboard): core metrics suitable for dashboards; use stable, bounded labels
  • βš™οΈ Active (operational): health/run-state and background processes; sample where possible
  • 🚨 Alerting: minimal, symptom based alerts with low cardinality
  • πŸ“Š Analytical (could be moved out of Prometheus): debugging/analysis; avoid long retention/high cardinality in Prometheus
  • ‼️ Can be deprecated: candidates for removal; consumers should migrate off
  • πŸ—‘οΈ Deprecated: removed from codebase; documented for one release cycle to aid migration; remove from dashboards/alerts and drop recording rules

Cost and cardinality guidance

  • Prefer counters/gauges with a small, bounded label set
  • Avoid per-tenant/per-class/per-route label explosions unless essential for operations
  • Move exploratory or wide-label analytics to logs, traces, or external stores

Change management

  • Adding: include type, labels, category, and justification for labels
  • Changing labels: call out cardinality impact and migration steps
  • Deprecating: move to ‼️ Can be deprecated, keep for one minor release, then remove
  • Alerting: document thresholds and runbook links in dashboards, not here

🎯 Active (dashboard)

Batch Operations

NameDescriptionTypeLabelsHigh Cardinality
batch_durations_msDuration in ms of a single batchHistogramclass_name, operation, shard_name❌ High

Object Operations

NameDescriptionTypeLabelsHigh Cardinality
object_countNumber of currently ongoing async operationsGaugeclass_name, shard_name❌ High

Query Operations

NameDescriptionTypeLabelsHigh Cardinality
concurrent_queries_countNumber of concurrently running query operationsGaugeclass_name, query_type❌ High
requests_totalNumber of all requests madeGaugeapi, class_name, query_type, status❌ High
queries_durations_msDuration of queries in millisecondsHistogramclass_name, query_type❌ High
queries_filtered_vector_durations_msDuration of queries in millisecondsSummaryclass_name, operation, shard_name❌ High
query_dimensions_totalVector dimensions used by read-queries involving vectorsCounterclass_name, operation, query_type❌ High

LSM Metrics

NameDescriptionTypeLabelsHigh Cardinality
lsm_active_segmentsNumber of currently present segments per shardGaugeclass_name, path, shard_name, strategy❌ High
lsm_memtable_sizeSize of memtable by pathGaugeclass_name, path, shard_name, strategy❌ High

System Metrics

NameDescriptionTypeLabelsHigh Cardinality
async_operations_runningNumber of currently ongoing async operationsGaugeclass_name, operation, path, shard_name❌ High

Queue Metrics

NameDescriptionTypeLabelsHigh Cardinality
queue_sizeNumber of records in the queueGaugeclass_name, shard_name❌ High

Vector Index Metrics

NameDescriptionTypeLabelsHigh Cardinality
vector_index_tombstonesNumber of active vector index tombstonesGaugeclass_name, shard_name❌ High
vector_index_tombstone_cleanedTotal number of deleted objects that have been cleaned upCounterclass_name, shard_name❌ High
vector_index_tombstone_unexpected_totalTotal number of unexpected tombstones foundCounterclass_name, operation, shard_name❌ High
vector_index_operationsTotal number of mutating operations on the vector indexGaugeclass_name, operation, shard_name❌ High
vector_index_sizeThe size of the vector indexGaugeclass_name, shard_name❌ High
vector_segments_sumTotal segments in a shard if quantization enabledGaugeclass_name, shard_name❌ High
vector_dimensions_sumTotal dimensions in a shardGaugeclass_name, shard_name❌ High
vector_index_durations_msDuration of typical vector index operations (insert, delete)Summaryclass_name, operation, shard_name, step❌ High

Startup Metrics

NameDescriptionTypeLabelsHigh Cardinality
startup_progressRatio (percentage) of startup progress for a particular component in a shardGaugeclass_name, operation, shard_name❌ High
startup_diskio_throughputDisk I/O throughput in bytes per secondSummaryclass_name, operation, shard_name❌ High

Tombstone Metrics

NameDescriptionTypeLabelsHigh Cardinality
tombstone_find_local_entrypointTotal number of tombstone delete local entrypoint callsCounterclass_name, shard_name❌ High
tombstone_find_global_entrypointTotal number of tombstone delete global entrypoint callsCounterclass_name, shard_name❌ High

Text-to-Vector (T2V) Metrics

NameDescriptionTypeLabelsHigh Cardinality
t2v_concurrent_batchesNumber of batches currently runningGaugevectorizer- Low
t2v_batch_queue_duration_secondsTime of a batch spent in specific portions of the queueHistogramoperation, vectorizer- Low
t2v_request_duration_secondsDuration of an individual request to the vectorizerHistogramvectorizer- Low
t2v_tokens_in_batchNumber of tokens in a user-defined batchHistogramvectorizer- Low
t2v_tokens_in_requestNumber of tokens in an individual request sent to the vectorizerHistogramvectorizer- Low
t2v_rate_limit_statsRate limit stats for the vectorizerGaugestat, vectorizer- Low
t2v_repeat_statsWhy batch scheduling is repeatedGaugestat, vectorizer- Low
t2v_requests_per_batchNumber of requests required to process an entire (user) batchHistogramvectorizer- Low

Index Shard Metrics

NameDescriptionTypeLabelsHigh Cardinality
weaviate_index_shards_totalTotal number of shards per index statusGaugestatus- Low
weaviate_index_shard_status_update_duration_secondsTime taken to update shard status in secondsHistogramstatus- Low

Auto Schema Metrics

NameDescriptionTypeLabelsHigh Cardinality
weaviate_auto_tenant_totalTotal number of tenants processedCounter-- Low
weaviate_auto_tenant_duration_secondsTime spent in auto tenant operationsHistogramoperation- Low

βš™οΈ Active (operational)

Vector Index Metrics

NameDescriptionTypeLabelsHigh Cardinality
vector_index_tombstone_cycle_end_timestamp_secondsUnix epoch timestamp of the end of the last tombstone cleanup cycleGaugeclass_name, shard_name❌ High
vector_index_tombstone_cycle_progressRatio (percentage) of the progress of the current tombstone cleanup cycleGaugeclass_name, shard_name❌ High

Tenant Offload Metrics

NameDescriptionTypeLabelsHigh Cardinality
tenant_offload_operation_duration_secondsDuration of tenant offload operationsHistogramoperation, status❌ High

Module Usage Metrics

NameDescriptionTypeLabelsHigh Cardinality
weaviate_<module>_operation_latency_secondsLatency of usage operations in secondsHistogramoperation- Low
weaviate_<module>_uploaded_file_size_bytesSize of the last uploaded usage file in bytesGauge-- Low

Shard Load Limiter Metrics

NameDescriptionTypeLabelsHigh Cardinality
database_shards_loadingNumber of shards currently loadingGauge-- Low
database_shards_waiting_for_permit_to_loadNumber of shards waiting for permit to loadGauge-- Low

Replication Engine Metrics

NameDescriptionTypeLabelsHigh Cardinality
weaviate_replication_pending_operationsNumber of replication operations pending processingGaugenode- Low
weaviate_replication_ongoing_operationsNumber of replication operations currently in progressGaugenode- Low
weaviate_replication_complete_operationsNumber of successfully completed replication operationsCounternode- Low
weaviate_replication_failed_operationsNumber of failed replication operationsCounternode- Low
weaviate_replication_cancelled_operationsNumber of cancelled replication operationsCounternode- Low
weaviate_replication_engine_running_statusReplication engine running status (0:not running, 1:running)Gaugenode- Low
weaviate_replication_engine_producer_running_statusReplication engine producer running status (0:not running, 1:running)Gaugenode- Low
weaviate_replication_engine_consumer_running_statusReplication engine consumer running status (0:not running, 1:running)Gaugenode- Low

Distributed Task Metrics

NameDescriptionTypeLabelsHigh Cardinality
weaviate_distributed_tasks_runningNumber of active distributed tasks running per namespaceGaugenamespace❌ High

HTTP Server Metrics

NameDescriptionTypeLabelsHigh Cardinality
http_request_duration_secondsTime (in seconds) spent serving requestsHistogrammethod, route, status_code❌ High
http_request_size_bytesSize (in bytes) of the request receivedHistogrammethod, route❌ High
http_response_size_bytesSize (in bytes) of the response sentHistogrammethod, route❌ High
http_requests_inflightCurrent number of inflight requestsGaugemethod, route❌ High

gRPC Server Metrics

NameDescriptionTypeLabelsHigh Cardinality
grpc_server_request_duration_secondsTime (in seconds) spent serving requestsHistogramgrpc_service, method, status❌ High
grpc_server_request_size_bytesSize (in bytes) of the request receivedHistogramgrpc_service, method❌ High
grpc_server_response_size_bytesSize (in bytes) of the response sentHistogramgrpc_service, method❌ High
grpc_server_requests_inflightCurrent number of inflight requestsGaugegrpc_service, method❌ High

Cluster Store Metrics

NameDescriptionTypeLabelsHigh Cardinality
weaviate_cluster_store_fsm_apply_duration_secondsTime to apply cluster store FSM state in local nodeHistogramnodeID- Low
weaviate_cluster_store_fsm_apply_failures_totalTotal failure count of cluster store FSM state apply in local nodeCounternodeID- Low
weaviate_cluster_store_raft_last_applied_indexCurrent applied index of a raft cluster in local nodeGaugenodeID- Low
weaviate_cluster_store_fsm_last_applied_indexCurrent applied index of cluster store FSM in local nodeGaugenodeID- Low
weaviate_cluster_store_fsm_startup_applied_indexPrevious applied index of the cluster store FSM in local nodeGaugenodeID- Low

Schema Management Metrics

NameDescriptionTypeLabelsHigh Cardinality
weaviate_schema_collectionsNumber of collections per nodeGaugenodeID- Low
weaviate_schema_shardsNumber of shards per node with corresponding statusGaugenodeID, status- Low

Runtime Config Metrics

NameDescriptionTypeLabelsHigh Cardinality
weaviate_runtime_config_last_load_successWhether the last loading attempt of runtime config was successGauge-- Low
weaviate_runtime_config_hashHash value of the currently active runtime configurationGaugesha256- Low

🚨 Alerting

Query Operations

NameDescriptionTypeLabelsHigh Cardinality
queries_durations_msDuration of queries in millisecondsHistogramclass_name, query_type❌ High

πŸ“Š Analytical (could be moved out of Prometheus)

Vector Index Metrics

NameDescriptionTypeLabelsHigh Cardinality
vector_index_maintenance_durations_msDuration of a sync or async vector index maintenance operationSummaryclass_name, operation, shard_name❌ High

Module Usage Metrics

NameDescriptionTypeLabelsHigh Cardinality
weaviate_<module>_operations_totalTotal number of module operationsCounteroperation, status- Low
weaviate_<module>_resource_countNumber of resources tracked by moduleGaugeresource_type- Low

πŸ› Active (debugging)

Batch Operations

NameDescriptionTypeLabelsHigh Cardinality
batch_size_bytesSize of a raw batch request batch in bytesSummaryapi- Low
batch_size_objectsNumber of objects in a batchSummary-- Low
batch_size_tenantsNumber of unique tenants referenced in a batchSummary-- Low
batch_delete_durations_msDuration in ms of a single delete batchSummaryclass_name, operation, shard_name❌ High
batch_objects_processed_totalNumber of objects processed in a batchCounterclass_name, shard_name❌ High
batch_objects_processed_bytesNumber of bytes processed in a batchCounterclass_name, shard_name❌ High

LSM Metrics

NameDescriptionTypeLabelsHigh Cardinality
lsm_bitmap_buffers_usageNumber of bitmap buffers used by sizeCounteroperation, size- Low

File I/O Metrics

NameDescriptionTypeLabelsHigh Cardinality
file_io_writes_total_bytesTotal number of bytes written to diskSummaryoperation, strategy- Low
file_io_reads_total_bytesTotal number of bytes read from diskSummaryoperation- Low
mmap_operations_totalTotal number of mmap operationsCounteroperation, strategy- Low
mmap_proc_mapsNumber of entries in /proc/self/mapsGauge-- Low

Schema Metrics

NameDescriptionTypeLabelsHigh Cardinality
schema_writes_secondsDuration of schema writes (which always involve the leader)Summarytype- Low
schema_reads_local_secondsDuration of local schema reads that do not involve the leaderSummarytype- Low
schema_reads_leader_secondsDuration of schema reads that are passed to the leaderSummarytype- Low
schema_wait_for_version_secondsDuration of waiting for a schema version to be reachedSummarytype- Low

‼️ Can be deprecated

Object Operations

NameDescriptionTypeLabelsHigh Cardinality
objects_durations_msDuration of an individual object operationSummaryclass_name, operation, shard_name, step❌ High

Query Operations

NameDescriptionTypeLabelsHigh Cardinality
query_dimensions_combined_totalVector dimensions used by read-queries, aggregated across all classes and shardsCounter-- Low

System Metrics

NameDescriptionTypeLabelsHigh Cardinality
concurrent_goroutinesNumber of concurrently running goroutinesGaugeclass_name, query_type❌ High

LSM Metrics

NameDescriptionTypeLabelsHigh Cardinality
lsm_objects_bucket_segment_countNumber of segments per shard in the objects bucketGaugeclass_name, path, shard_name, strategy❌ High
lsm_compressed_vecs_bucket_segment_countNumber of segments per shard in the vectors_compressed bucketGaugeclass_name, path, shard_name, strategy❌ High
lsm_segment_objectsNumber of objects/entries of segment by levelGaugeclass_name, level, path, shard_name, strategy❌ High
lsm_segment_sizeSize of segment by level and unitGaugeclass_name, level, path, shard_name, strategy, unit❌ High
lsm_segment_countNumber of segments by levelGaugeclass_name, level, path, shard_name, strategy❌ High
lsm_segment_unloadedNumber of unloaded segmentsGaugeclass_name, path, shard_name, strategy❌ High
lsm_memtable_durations_msTime in ms for a bucket operation to completeSummaryclass_name, operation, path, shard_name, strategy❌ High

Queue Metrics

NameDescriptionTypeLabelsHigh Cardinality
queue_disk_usageDisk usage of the queueGaugeclass_name, shard_name❌ High
queue_pausedWhether the queue is pausedGaugeclass_name, shard_name❌ High
queue_countNumber of queuesGaugeclass_name, shard_name❌ High
queue_partition_processing_duration_msDuration in ms of a single partition processingHistogramclass_name, shard_name❌ High

Vector Index Metrics

NameDescriptionTypeLabelsHigh Cardinality
vector_index_queue_insert_countNumber of insert operations added to the vector index queueCounterclass_name, shard_name, target_vector❌ High
vector_index_queue_delete_countNumber of delete operations added to the vector index queueCounterclass_name, shard_name, target_vector❌ High
vector_index_tombstone_cleanup_threadsNumber of threads in use to clean up tombstonesGaugeclass_name, shard_name❌ High
vector_index_tombstone_cycle_start_timestamp_secondsUnix epoch timestamp of the start of the current tombstone cleanup cycleGaugeclass_name, shard_name❌ High

Startup Metrics

NameDescriptionTypeLabelsHigh Cardinality
startup_durations_msDuration of individual startup operations in msSummaryclass_name, operation, shard_name❌ High

Backup/Restore Metrics

NameDescriptionTypeLabelsHigh Cardinality
backup_restore_msDuration of a backup restoreSummarybackend_name, class_name❌ High
backup_restore_class_msDuration restoring classSummaryclass_name❌ High
backup_restore_init_msStartup phase of a backup restoreSummarybackend_name, class_name❌ High
backup_restore_from_backend_msFile transfer stage of a backup restoreSummarybackend_name, class_name❌ High
backup_store_to_backend_msFile transfer stage of a backup storeSummarybackend_name, class_name❌ High
bucket_pause_durations_msBucket pause durationsSummarybucket_dir- Low
backup_restore_data_transferredTotal number of bytes transferred during a backup restoreCounterbackend_name, class_name❌ High
backup_store_data_transferredTotal number of bytes transferred during a backup storeCounterbackend_name, class_name❌ High

Shard Metrics

NameDescriptionTypeLabelsHigh Cardinality
shards_loadedNumber of shards loadedGauge-- Low
shards_unloadedNumber of shards not loadedGauge-- Low
shards_loadingNumber of shards in process of loadingGauge-- Low
shards_unloadingNumber of shards in process of unloadingGauge-- Low

Schema Metrics

NameDescriptionTypeLabelsHigh Cardinality
schema_tx_opened_totalTotal number of opened schema transactionsCounterownership- Low
schema_tx_closed_totalTotal number of closed schema transactionsCounterownership, status- Low
schema_tx_duration_secondsMean duration of a tx by statusSummaryownership, status- Low

Tombstone Metrics

NameDescriptionTypeLabelsHigh Cardinality
tombstone_reassign_neighborsTotal number of tombstone reassign neighbor callsCounterclass_name, shard_name❌ High
tombstone_delete_list_sizeDelete list size of tombstonesGaugeclass_name, shard_name❌ High

Tokenizer Metrics

NameDescriptionTypeLabelsHigh Cardinality
tokenizer_duration_secondsDuration of a tokenizer operationHistogramtokenizer- Low
tokenizer_requests_totalNumber of tokenizer requestsCountertokenizer- Low
tokenizer_initialize_duration_secondsDuration of a tokenizer initialization operationHistogramtokenizer- Low
token_count_totalNumber of tokens processedCountertokenizer- Low
token_count_per_requestNumber of tokens processed per requestHistogramtokenizer- Low

Module/External API Metrics

NameDescriptionTypeLabelsHigh Cardinality
weaviate_module_requests_totalNumber of module requests to external APIsCounterapi, op❌ High
weaviate_module_request_duration_secondsDuration of an individual request to a module external APIHistogramapi, op❌ High
weaviate_module_requests_per_batchNumber of items in a batchHistogramapi, op❌ High
weaviate_module_request_size_bytesSize (in bytes) of the request sent to an external APIHistogramapi, op❌ High
weaviate_module_response_size_bytesSize (in bytes) of the response received from an external APIHistogramapi, op❌ High
weaviate_vectorizer_request_tokensNumber of tokens in the request sent to an external vectorizerHistogramapi, inout❌ High
weaviate_module_request_single_countNumber of single-item external API requestsCounterapi, op❌ High
weaviate_module_request_batch_countNumber of batched module requestsCounterapi, op❌ High
weaviate_module_error_totalNumber of OpenAI errorsCounterendpoint, module, op, status_code❌ High
weaviate_module_call_error_totalNumber of module errors (related to external calls)Counterendpoint, module, status_code❌ High
weaviate_module_response_status_totalNumber of API response statusesCounterendpoint, op, status❌ High
weaviate_module_batch_error_totalNumber of batch errorsCounterclass_name, operation❌ High

Tenant Offload Metrics

NameDescriptionTypeLabelsHigh Cardinality
tenant_offload_fetched_bytes_totalTotal bytes fetched during tenant offload operationsCounter-- Low
tenant_offload_transferred_bytes_totalTotal bytes transferred during tenant offload operationsCounter-- Low

Checksum Metrics

NameDescriptionTypeLabelsHigh Cardinality
checksum_validation_duration_secondsDuration of checksum validationSummary-- Low
checksum_bytes_readNumber of bytes read during checksum validationSummary-- Low

πŸ—‘οΈ Deprecated

NameDescriptionTypeLabelsReasonRemoved In
lsm_bloom_filters_duration_msDuration of bloom filter operationsSummaryclass_name, operation, shard_name, strategyRemoved due to high CPU cost and synchronization on hot path during segment reads; no demonstrated valuev1.31 (PR #9057)