docs/guides/vm-architectures/README.md
The complexity of any monitoring system is not an end in itself. It is a direct response to two questions: what risks are we protecting against, and how much performance do we need? This guide is designed to help you choose an architecture that precisely matches your answers.
It's a common mistake to think of availability as a simple number. In reality, availability is a guarantee against a specific level of risk. For example, 99.9% ("three nines") availability allows for about 44 minutes of downtime per month. Before chasing higher nines, ask yourself: is this level of downtime acceptable for your defined risks? Remember that each additional 'nine' of availability often comes with an exponential increase in both system complexity and operational cost.
The scope of the failure you are designing for is your "blast radius". Before choosing an architecture, you must first define the blast radius you need to withstand.
It is also crucial to distinguish between two fundamental goals:
The architectures in this guide are simply different combinations of these two approaches, designed to handle a specific blast radius.
Each subsequent section of this guide presents an architecture designed to handle a specific blast radius, moving from the most straightforward setup to the most resilient.
Recommended for: Pet projects, development/test stages, and non-critical systems monitoring.
Installation guide reference: VictoriaMetrics Single
Key characteristics: Single instance that does everything: stores, retrieves, and provides metrics.
Pros:
Cons:
Schema:
<p align="center"> </p>In this simplest setup, any single-node failure leads to temporary data unavailability or loss until the instance restarts or storage is restored. There are no built-in redundancy or replication layers.
For this section, you can increase availability by utilizing backup and restore mechanisms on various levels: hardware, virtualization, persistence volume management, or application. VictoriaMetrics provides the backup tools to achieve that.
Recommended for: Single availability zone hosted systems of any scale
Installation guide reference: VictoriaMetrics Cluster
High availability implementation: HA VictoriaMetrics Cluster
Key characteristics: This is a complete VictoriaMetrics cluster, commonly running in a single Kubernetes cluster. Each component of the cluster: vminsert, vmselect, and vmstorage has multiple copies (replicas). The data is also copied and sharded between vmstorage nodes using the --replicationFactor setting on vminsert. See the official documentation to determine the optimal replication factor for your needs.
Pros:
Cons:
Schema:
<p align="center"> </p>When building a resilient cluster, several replication options are available.
Path A: Application-Level Replication. This approach is enabled by setting the -replicationFactor=N flag, where N is an integer representing the desired number of replicas. It makes the cluster components responsible for writing N copies of the data across different vmstorage nodes.
Pros:
Cons:
Path B: Storage-Level Replication (The Cloud Provider Way) In this model, VictoriaMetrics replication factor is set to 1, and the vmstorage data is backed up with cloud-provided and replicated volumes(i.e., AWS EBS replicated within AZ, Google Zonal PD).
Pros:
Cons:
In a large, distributed system, partial failures are a common occurrence. A critical choice is how your read path should behave when only partial data can be retrieved.
Path A: Allow Partial Responses (Focus on Availability) By default, if a vmstorage node is down, vmselect will continue getting results from the healthy vmstorage nodes. If more than or equal to the replicationFactor vmstorage nodes fail to respond, the response will have the "isPartial" field set to true.
Pros:
Cons:
Path B: Deny Partial Responses (Focus on Consistency) You can configure vmselect with the -search.denyPartialResponse flag. If vmselect cannot fetch a complete result from all vmstorage nodes that hold the requested data according to the replication factor value, it will return an error instead of a partial result.
Pros:
Cons:
Once you have a vmagent sending data to the storage component (vmsingle or cluster), you face your first important trade-off: what should vmagent do when the storage is temporarily unavailable? This choice defines the trade-off between higher availability (by not losing data) and lower resource consumption (by not using disk). By default, vmagent acts as a durable queue: it persists compressed unsent data to the local filesystem. The size of the queue is controlled via `--remoteWrite.maxDiskUsagePerURL` and can be estimated in advance.
Path A: Stateful Mode (Most Reliable). By default, the operator uses ephemeral storage for the vmagent queue. In production, we recommend explicitly configuring a PersistentVolumeClaim (PVC) for vmagent to ensure the buffer is stored on a persistent disk and survives pod restarts. The documentation about on-disk persistence.
Pros:
Cons:
For Enterprise users, the queueing can be offloaded to an external message broker, such as Kafka. In that case vmagent can read or write into Kafka.
Path B: Ephemeral Buffering (with tmpfs). For maximum performance, the vmagent buffer directory can be mounted as a tmpfs volume, which is physically stored in the node's RAM. In Kubernetes, this is configured via emptyDir: { medium: "Memory" }.
Pros:
Cons:
Blast radius: Cluster
Recommended for: Large-scale workloads or services with high SLA requirements that must survive the complete failure of a datacenter or an Availability Zone (AZ).
High availability implementation: VictoriaMetrics Multi-Regional Setup
Key characteristics: The core principle of this architecture is to run two or more independent, self-contained VictoriaMetrics clusters (from the Single AZ section) in separate failure domains, such as different Availability Zones or geographic regions. A global, stateless layer is responsible for routing write and read traffic to these clusters. Each participating AZ must be provisioned to handle the entire workload if another AZ fails.
There are no differences in the VictoriaMetrics clusters' topology regarding the multi-AZ approach. It can be Active-Active or Active-Passive - the schema will be the same.
To ensure reliability, vmagent implements the bulkhead pattern: each destination URL configured via --remoteWrite.url is assigned a dedicated data queue and an isolated pool of workers. This isolates the data streams, ensuring that if one storage destination becomes slow or unavailable, it does not impact data delivery to the others.
Pros:
Cons:
Schema:
<p align="center"> </p>Blast radius: Availability zone
Primary region failure (Active-Passive): switchover in minutes; stale reads until DNS/load balancer/BGP reroute.
Single AZ/cluster failure (Active-Active): seamless reroute; read results may temporarily differ between clusters if cross-AZ replication lags.
Cross-region link failure:
Recommended for: Systems that require extra reliability and scalability across multiple regions and zones.
Key characteristics: This architecture is built on two main ideas - cells and the separation of routing and storage paths
First, we have logical groups of Availability Zones (AZs). Think of these as our data pods. Inside these groups, we deploy our basic clusters. The data within these groups can be distributed in two ways:
Inside each Storage Cell, the VictoriaMetrics cluster is configured with a -replicationFactor of 1. High availability is achieved by replicating data across multiple cells by the global routing layer, not within the cell or the cluster.
Next, we have a separate, stateless layer of routing cells. Their only purpose is to manage traffic. They accept all incoming data and queries and intelligently route them to the correct storage groups. This separation of routing and storage is key to the design.
For complete disaster recovery, this entire cell-based architecture is duplicated in a second geographic region.
Pros:
Cons / Trade-offs:
Schema:
A global, stateless layer of routing cells (vmagent, vmauth) sits on top. It routes traffic to several logical groups of storage cells. Each storage group contains multiple AZs, and data is replicated or sharded across them. There are several approaches to implementing it.
<p align="center"> </p>When you build a system that spans multiple AZs or regions, you face a fundamental choice: how to read the data? The answer to this question will define the trade-offs in your architecture between data completeness, query speed, and cost. Your choice of how to write data directly impacts how you can read it. Let's look at two pairs of write/read strategies.
In this model, your primary goal is to obtain as complete and consistent data as possible for every query, even if some storage cells are lagging behind.
Write Path: vmagent shards data across your storage cells. Fault tolerance is configured via -remoteWrite.shardByURL and -remoteWrite.shardByURLReplicas (for example, writing each time series to 3 out of 4 cells). Redundancy is achieved across cells, not within a cell. This provides resilience against cell failures while saving storage compared to full copies.
Read Path: You use a two-level vmselect system. A global vmselect receives user queries. In turn, it queries local vmselects in each of your storage cells and merges the results. Exposing local VMSelects to a global one is necessary because there can be no possibility to connect directly to vmstorage on the local cell, especially if it is in Kubernetes, as there is no HTTP endpoint for querying vmstorage. And using NodePort may not be a good practice for production.
Schema:
Global vmselect -> Local vmselects (in each cell)
Pros:
Cons:
In this model, your primary goal is to provide users with the fastest possible response, accepting certain risks associated with data freshness.
Write Path: This is where you face another choice. To make the first_available read path work, every storage cell must contain a full copy of all data. This is achieved by configuring the global vmagent to replicate 100% of the write traffic to every storage cell. This is achieved by providing all storage cell URLs in the -remoteWrite.url flags. If you provide another count of storage cells in the URL section, it will affect the completeness of the data on the read path.
Read Path: A global vmauth directs the user to the first available cell.
Schema:
Global vmauth -> Cell -> vmselect
Pros:
Cons:
-remoteWrite.url target, so lag in a single cell can cause it to serve stale results under the first_available policy. If vmauth sends a user to this cell while its queue is not empty, that user will receive stale data (data that is not 100% fresh). A certain automation could be used to disable reads from cells that are lagging behind.Just like the read path, your alerting strategy in a hyperscale setup also involves critical trade-offs.
Path A: Local vmalert (Fast Evaluation, High Traffic). In this model, you deploy vmalert inside each storage cell.
How it works: Each vmalert queries its local vmselect for data. This is very fast and efficient. It then sends its firing alerts to a global Alertmanager cluster, which is likely located in the compute cells.
Pros:
Cons:
Path B: Global vmalert (Consistent Alerts, Higher Latency) In this model, you move vmalert out of the storage cells and into the global compute cells.
How it works: The global vmalert instances query the same entry point as users (either the global vmselect or vmauth). This provides them with a comprehensive view of all data. They then send alerts to their local Alertmanager instances in the same compute cell.
Pros:
Cons:
Blast radius: Region / Cell
Single node failure within a cell: degraded performance in that cell; global system continues normally.
Single cell failure:
Path A (Global vmselect): queries still complete but slower (merging from healthy cells).
Path B (First-available vmauth): queries are routed to healthy cells; stale data is possible if a write lag exists.
Region outage: the duplicated architecture in the standby region takes over, resulting in temporary degradation until the reroute is completed.
Recommended for: Companies of any scale that need to serve multiple internal teams or external customers with separate data. Each tenant may have different requirements for data isolation and performance.
The other use case is a different retention across tenants, which is described in this guide: VictoriaMetrics Cluster
Key characteristics: This architecture introduces a logical layer of multitenancy on top of the physical architectures mentioned before.
How it works:
This multitenancy approach gives us another trade-off in the isolation implementation.
Schema:
<p align="center"> </p>Path A: Shared resources. We have a single, shared pool of all cluster components.
Pros:
Cons:
Path B: Dedicated processing layer. For very important tenants, we can create a separate, dedicated layer of vmagents, vmselect, vminsert, and other components in use.
Pros:
Cons: