High Availability - Vector

Failure Modes

Before making a system highly available, you must understand the various ways it can fail.

Hardware Failures

Disk Failure

Disk failures are treated as process failures since Vector will exit if the disk is unavailable when Vector needs it. Vector will fail to start if the disk is unavailable at boot time, and Vector will exit if the disk is unavailable at runtime.

Vector does not need a disk during runtime unless you’re using a component that requires it, such as disk buffers or the file source and sink.

Node Failure

Mitigate individual node failures by distributing Vector, and its load balancer, across multiple nodes. Front the Vector nodes with a load balancer that will failover when a Vector node becomes unreachable. You should have enough capacity to distribute the tolerated node failure across your available nodes. A good rule of thumb is that no single node should process more than 33% of your data. Finally, automated self-healing should detect and replace unhealthy nodes.

Data Center Failure

Mitigate data center failures by deploying Vector across multiple Availability Zones. For uninterrupted failover, you should have enough capacity in other Availability Zones to handle the loss of any single Availability Zone. Vector’s high-performance design goal exists to make it economically possible to over-provision in this manner to handle failover seamlessly.

Software Failures

Vector Process Failure

Mitigate Vector process failures similarly to node failures: Vector’s load balancer will failover when a Vector process becomes unreachable. Automated platform-level self-healing restarts the process or replaces the node. This is achieved with a platform-level supervisor, such as a controller in Kubernetes.

Data Receive Failure

Mitigate data receive failures with Vector’s end-to-end acknowledgements feature. When enabled, Vector only responds to the clients when data durably persists. Such as writing the data to the disk buffer or a downstream service.

This feature should be paired with disk buffers or streaming sinks to ensure acknowledgements are timely.

Data Processing Failure

Mitigate data processing failures with Vector’s dropped events routing. remap transforms have a dropped output channel that routes dropped events through another pipeline. This flexible design allows the routing of dropped events like any other event.

We recommend routing dropped events immediately to a backup destination to prioritize durability. Then, once data is persisted, you can inspect it, correct the error, and replay it.

The following is an example configuration that routes failed events from a remap transform to a aws_s3 sink:

yaml

sources:
  input:
    type: "datadog_agent"

transforms:
  parsing:
    inputs: ["input"]
    type: "remap"
    reroute_dropped: true
    source: "..."

sinks:
  analysis:
    inputs: ["parsing"]
    type: "datadog_logs"

  backup:
    inputs: ["parsing.dropped"] # dropped events from the `parsing` transform
    type: "aws_s3"

{{< info >}} Support for dropped channels in other components is forthcoming. {{< /info >}}

Data Send Failure

Mitigate data send failures with Adaptive Request Concurrency (ARC) and buffers.

ARC automatically scales down the number of outgoing connections when Vector cannot send data. ARC minimizes the amount of in-flight data, appropriately applies backpressure, and prevents the stampede effect that often prohibits downstream services from recovering.

Buffers absorb back pressure when a service cannot accept data, insulating upstream clients from backpressure, and durably persisting data. Buffers replay data to the destination upon recovery.

Total System Failures

{{< warning >}} The mitigation tactics discussed in this section are advanced tactics that should only be used in environments with stringent availability requirements where the cost tradeoffs are worth it. {{< /warning >}}

Mitigate Vector system failures by failing over to intra-network standbys. To recover Vector, standbys should be within the same network to avoid sending data over the public internet for security and cost reasons.

{{< info >}} The disaster recovery section covers entire site/region failures, and mitigation is achieved with your broader disaster recovery plan. {{< /info >}}

Load Balancer System Failure

Mitigate total load balancer failures by failing over to your standby load balancer. Failover is typically achieved with service discovery or DNS.

Aggregator System Failure

Mitigate total aggregator failures by failing over to your standby aggregator. Failover is typically achieved with service discovery and DNS.

Strategy

Taking the above failure modes, we end up with the following strategy to achieve high availability.

Contain The Blast Radius

Deploy Vector within each network boundary to contain the blast radius of any individual aggregator failure. This mitigates a single point of failure.

{{< info >}} More detail can be found in the working with network boundaries section in the deployment architecture document. {{< /info >}}

Mitigate Hardware Failures

Mitigate hardware failures by distributing your aggregators and load balancers across multiple nodes and availability zones:

Deploy at least one node in each availability zone.
Capacity in your availability zones should handle any single availability zone failure.

These actions mitigate individual node failure and data center failure.

Mitigate Software Failures

Mitigate software failures by configuring your Vector instances as follows:

Able sources should enable end-to-end acknowledgements.
Sinks should implement disk or memory buffers that overflow to disk.
Transforms should implement dropped event routing to your system of record.

These actions mitigate Vector process failures, data receive failures, data processing failures, and data send failures.

Mitigate Total System Failures

{{< warning >}} This is an advanced tactic that we only recommend for production environments with stringent availability requirements. {{< /warning >}}

Mitigate Vector system failures by failing over to intra-network standbys. Failover is typically achieved with service discovery or DNS.

Agents should failover to your standby load balancers.
Load balancers should failover to your standby aggregators.
Aggregators should failover to your standby destinations.
Autoscaling should be used with your standbys to reduce cost.

These actions mitigate load balancer failure and aggregator failure.

Disaster Recovery & Data Loss

Internal Disaster Recovery

Vector is an infrastructure-level tool designed to route internal observability data. It implements a shared-nothing architecture and does not manage state that should be replicated or transferred to your disaster recovery (DR) site. If the systems that Vector collects data from fail, then so should Vector.

Therefore, if your entire network or region fails, Vector should fail with it and be installed in your DR site as part of your broader DR plan.

External Disaster Recovery

Vector can play an essential role in your disaster recovery (DR) plan for external systems. For example, if you’re using a managed destination, such as Datadog, Vector can facilitate automatic routing data to your Datadog DR site via Vector’s circuit breakers feature.

Enterprise Vector

Datadog Observability Pipelines is Vector’s enterprise offering designed to meet the challenges of deploying and managing Vector at scale. Out-of-the-box monitoring, configuration change management, and expert support simplify operations of Vector in demanding environments.