docs/streaming-routing.md
Streaming routing controls how Netdata child nodes connect to parent nodes when multiple parents are available. It handles three key operations: initial parent selection, connection management, and failover.
:::info
This feature requires configuring streaming in netdata.conf. See Streaming Configuration for setup instructions.
:::
When a child node starts, it queries all configured parents simultaneously to determine the best connection:
flowchart LR
A("**Start**
Child node startup") --> B("**Query all parents**
Parallel HTTP requests")
B --> C("**Data recency
delta < 1min?**
Compare timestamps")
C -->|Multiple| D("**Random select**
Load balancing")
C -->|One best| E("**Connect to
most recent**
Data continuity")
C -->|No data| D
D --> F("**Connect**
Streaming active")
E --> F
%% Style definitions matching the reference
classDef alert fill: #ffeb3b, stroke: #000000, stroke-width: 3px, color: #000000, font-size: 18px
classDef neutral fill: #f9f9f9, stroke: #000000, stroke-width: 3px, color: #000000, font-size: 18px
classDef complete fill: #4caf50, stroke: #000000, stroke-width: 3px, color: #000000, font-size: 18px
classDef database fill: #2196F3, stroke: #000000, stroke-width: 3px, color: #000000, font-size: 18px
%% Apply styles
class A alert
class B neutral
class C neutral
class D database
class E database
class F complete
How it works:
Example:
# In child's stream.conf
[stream]
enabled = yes
destination = parent-a:19999 parent-b:19999 parent-c:19999
api key = YOUR_API_KEY
With this configuration:
Child Node startup:
│
├─→ Parent A (has historical data) ✓ Selected (random between A & B)
├─→ Parent B (has historical data)
└─→ Parent C (no historical data) ← Lower priority
Once connected, the child maintains a persistent connection:
:::info
Netdata automatically replicates missing historical data when reconnection occurs. Data is only lost if:
memory mode = ram (metrics stored in memory) ANDFor persistent data, use memory mode = dbengine.
:::
:::warning
Children do not automatically reconnect to their original parent after failover. This prevents connection flapping but requires manual intervention for load redistribution.
:::
When the active connection fails, the child repeats the parent selection process:
flowchart LR
A("**Failed**
Connection lost") --> B("**Wait 5-X sec**
Randomized delay")
B --> C("**Query all parents**
Re-evaluate options")
C --> D("**Select best
by recency**
Data continuity priority")
D --> E("**Try connect**
Attempt connection")
E --> F("**OK?**
Connection test")
F -->|No| B
F -->|Yes| G("**Stream**
Active monitoring")
%% Style definitions matching the reference
classDef alert fill: #ffeb3b, stroke: #000000, stroke-width: 3px, color: #000000, font-size: 18px
classDef neutral fill: #f9f9f9, stroke: #000000, stroke-width: 3px, color: #000000, font-size: 18px
classDef complete fill: #4caf50, stroke: #000000, stroke-width: 3px, color: #000000, font-size: 18px
classDef database fill: #2196F3, stroke: #000000, stroke-width: 3px, color: #000000, font-size: 18px
%% Apply styles
class A alert
class B neutral
class C neutral
class E neutral
class D database
class F database
class G complete
:::note
Unlike traditional round-robin failover, Netdata re-evaluates all parents on each attempt. This means a child might connect to a different parent than expected if data states have changed.
:::
Failover Example:
Normal: Child → Parent A
Failure: Child ✗ Parent A (connection lost)
Child → Parent B (immediate failover)
Recovery: Parent A comes back online
Child → Parent B (stays connected - no automatic switch)
| Behavior | Description | Impact |
|---|---|---|
| Data Recency Priority | Selects parent with most recent data (lowest time delta) | Minimizes gap in historical data |
| Parallel Parent Query | Queries all parents simultaneously via HTTP | Fast parent selection, no sequential delays |
| Sticky Connections | No automatic rebalancing after failover | Requires manual intervention to redistribute load |
| Smart Failover | Re-evaluates all parents on each connection attempt | May connect to different parent based on current data state |
| Connection Persistence | Maintains connection until failure occurs | Prevents unnecessary reconnections and data gaps |
| No Health Checks | Doesn't proactively test parent availability | Discovers failures only when connection breaks |
| Randomized Delays | Reconnection waits random time (5s to configured maximum) | Prevents thundering herd during mass reconnections |
[stream]
# Streaming targets (space-separated list)
# Order doesn't matter - selection is based on data recency
destination = parent1:19999 parent2:19999 parent3:19999
# Reconnection delay - randomized between 5 and this value (seconds)
# Default: 5, Minimum: 5
reconnect delay seconds = 5
# Initial connection timeout
timeout seconds = 60
For larger deployments:
Child Nodes ──→ Parent Proxies ──→ Ultimate Parents
(forward only) (store & analyze)
Configure intermediate parents as proxies to distribute load without storage overhead.
The Netdata Streaming function (under the "Functions" tab) provides:
# Check journal for streaming-related messages
journalctl _SYSTEMD_INVOCATION_ID="$(systemctl show --value --property=InvocationID netdata)" --namespace=netdata --grep stream
# Test each parent
nc -zv parent-a 19999
nc -zv parent-b 19999
:::note Troubleshooting
If a child connects to an unexpected parent, check the data retention on all parents. The child prefers parents that already have its historical data.
:::
| Scenario | What Happens | Why |
|---|---|---|
| Parent A fails | Child switches to Parent B | Automatic failover |
| All parents fail | Child cycles through list every second | Continuous retry |
| Parent A recovers | Child stays on Parent B | No automatic rebalancing |
| New child starts | Randomly selects from parents with data | Load distribution |
:::caution Maintenance Planning
When taking a parent offline for maintenance, its children will fail over to other parents and won't automatically return. Plan capacity accordingly.
:::