docs/welcome-to-netdata.md
Netdata is a distributed, real-time observability platform that monitors metrics and logs from systems and applications, built on a foundation designed to seamlessly extend to distributed tracing. It collects data at per-second granularity, stores it at (or as close to) the edge where it's generated, provides automated dashboards, machine learning anomaly detection, and AI-powered analysis without requiring configuration or specialized skills.
Instead of centralizing the data, Netdata distributes the monitoring code to each system, keeping data local while providing unified access. This architecture enables linear scaling to millions of metrics per second and terabytes of logs, automated root cause analysis, faster UX and significantly lower total cost of ownership.
We have designed this platform for operations teams, sysadmins, DevOps engineers, and SREs who need comprehensive real-time, low-latency visibility into their infrastructure and applications. Netdata is opinionated — it collects everything, visualizes everything, runs machine learning anomaly detection on everything, with several innovations that make modern observability accessible to lean teams, without the need for specialized skills.
The system consists of three components:
| Aspect | Netdata | Industry Standard |
|---|---|---|
| Real-Time Monitoring | ||
| Data granularity | 1 second | 10-60 seconds |
| Collection to visualization | 1 second | 30+ seconds |
| Time to first dashboard | 10 seconds | Hours to days |
| Automation | ||
| Configuration required | Minimal to none | Extensive |
| ML anomaly detection | All metrics automatically | Selected metrics manually |
| Pre-configured alerts | 400+ out of the box | Build from scratch |
| Efficiency | ||
| Storage per metric | 0.6 bytes/sample | 2-16 bytes/sample |
| Agent CPU usage | 5% single core | 10-30% single core |
| Scalability | Linear, unlimited | Exponential complexity |
| Coverage | ||
| Metrics collected | Everything available | Manually selected |
| Built-in collectors | 800+ integrations | Basic system metrics |
| Hardware monitoring | Comprehensive | Limited or none |
| Live monitoring | processes, network connections, and more | Limited or none |
:::note
Netdata keeps the observability data at the edge (Netdata Agents), or as close to the edge as possible (Netdata Parents).
:::
:::tip
Keeping data at the edge eliminates egress charges, ensures compliance by default, and transforms observability from an unpredictable cost center into a fixed operational expense while delivering sub-second query performance.
:::
Implementation: Each Netdata Agent is a complete monitoring system with collection, storage, query engine, visualization, machine learning, and alerting. This isn't just an agent that ships data elsewhere — it's a full observability stack. The distributed architecture provides:
:::note
Most observability solutions are usually selective to control cost, complexity and the time and skills required to set up. Organizations are frequently instructed to select only what is important for them, based on their understanding and needs.
This creates two fundamental problems:
:::
Netdata's design allows it to capture everything exposed by systems and applications — every metric, every log entry, every piece of telemetry available.
The comprehensive approach ensures:
:::note
Most observability solutions collect data every 10-60 seconds with additional pipeline delays of seconds to minutes, making them statistical analysis tools rather than real-time monitoring. This forces engineers to SSH into servers for accurate, timely data during incidents.
:::
Netdata collects everything per-second and has a fixed one-second data collection to visualization latency. Netdata works on a beat. Every sample needs to be collected on time. Delays in data collection indicate that the monitored component or application is under stress, and Netdata shows gaps on the charts. This strict real-time approach delivers:
:::note
Most observability solutions require users to learn query languages, manually build dashboards, and understand metric types before they can visualize data. This prerequisite knowledge and configuration work becomes the biggest barrier to effective monitoring.
:::
Most of our infrastructure components are common: operating systems, databases, web servers, message brokers, containers, storage devices, network devices, and so on. We all use the same finite set of components, plus a few custom applications.
Netdata dashboards are an algorithm, not a configuration. Each Netdata chart is a full analytical tool, offering a 360° view of data and its sources. With simple point-and-click, you can slice and dice any dataset, gaining a clear picture of what’s available and where it comes from. Netdata provides single-node, multi-node, and infrastructure level dashboards automatically. All metrics are organized in a meaningful manner with a universal table of contents that dynamically adapts to the data available, providing instant access to every metric. This approach delivers:
Netdata, contrary to most observability solutions, is optimized for lightweight storage operations. Three storage tiers are updated in parallel (per-second, per-minute, per-hour). The high-resolution tier needs 0.6 bytes per sample on disk (Gorilla compression + ZSTD). The lower resolution tiers need 6-bytes and 18-bytes per sample respectively and maintain the ability to provide the same min, max, average and anomaly rate the high-resolution tier provides. Data are written in append-only files and are never reorganized on disk (Write Once Read Many - WORM). Writes are spread evenly over time. Netdata Agents write at 5 KiB/s, Netdata Parents aggregating 1M metrics/s write at 1MiB/s across all tiers.
Netdata implements a custom time-series database optimized for the specific patterns of system metrics:
This efficient storage architecture delivers years of data in gigabytes rather than terabytes, with predictable I/O patterns and linear scaling of storage requirements with infrastructure size.
:::info
Log management has become one of the largest cost drivers in observability, with organizations spending millions on storage and processing infrastructure. Many resort to aggressive filtering and sampling just to make costs manageable, inevitably losing critical information when they need it most.
:::
Netdata takes a fundamentally different approach by leveraging the systemd journal format, the native logs format on Linux systems. This edge-based approach provides enterprise-grade capabilities without the enterprise costs:
log2journal for converting any text, JSON, or logfmt logs into structured journal entriesWhere traditional solutions sample 5,000 log entries to generate field statistics on their dashboards, Netdata starts sampling at 1 million entries, providing 200x more accurate insights into log patterns. The result is enterprise-grade log management capabilities — field statistics, histogram breakdowns, full-text search, time-based filtering — all while keeping logs at the edge where they're generated, eliminating the massive costs of centralized log infrastructure.
:::note
On Windows Netdata queries Windows Event Logs (WEL), Event Tracing for Windows (ETW) and TraceLogging (TL) via the Event Log.
:::
ML is the simplest way to model the behavior of our systems and applications. When done properly, ML can reliably detect anomalies, surface correlations between components and applications, provide valuable information about cascading effects under crisis, identify the blast radius of issues and even detect infrastructure level issues independently of the configured alerts.
Netdata democratizes ML and AI by making it automatic and universal (no configuration is required). The system trains 18 k-means models per metric using different time windows, requiring unanimous agreement before flagging anomalies. This achieves a false positive rate of 10^-36 (1% per model ^ 18 models) while remaining sensitive to real issues:
Note: Netdata's ML focuses on detecting behavioral anomalies in metrics using their last 2 days of data. It is optimized for reliability rather than sensitivity and may miss slow (over days/weeks) infrastructure degradation or certain types of long-term anomalies (weekly, monthly, etc.). However, it typically detects most types of abnormal behavior that break services.
For more information see Netdata's ML Accuracy, Reliability and Sensitivity.
Netdata introduces a significant shift to the troubleshooting process utilizing its unsupervised and real-time anomaly detection system. The "Anomaly Advisor" transforms troubleshooting:
This approach still requires interpretation skills but dramatically simplifies the investigation process compared to traditional methods (the aha! moment is within the first 30-50 results).
:::note
Most monitoring solutions focus on aggregate metrics and business-level alerts, often missing component failures until they cascade into service outages. This approach leads to alert fatigue from false positives and missed issues from incomplete coverage.
:::
Netdata takes a fundamentally different approach: templated alerts that monitor individual component and application instances. Each alert watches a single instance, building a comprehensive safety net where every component has its own watchdog. This granular approach ensures:
:::tip
Netdata ships with hundreds of pre-configured alerts, many intentionally silent by default. These silent alerts monitor important but non-critical conditions that should be reviewed but shouldn't wake engineers at 3am. This pragmatic approach balances comprehensive monitoring with operational sanity.
:::
For Netdata, scalability is inherent to the architecture, not an add-on. Designed to be fully distributed, Netdata achieves linear scalability through:
Netdata thrives as part of a vibrant open-source community with 1.5 million downloads per day. The platform integrates seamlessly with existing tools and standards:
Netdata can operate independently or alongside your existing observability stack. Whether you use Prometheus, Grafana, OpenTelemetry, or centralized log aggregators, Netdata enhances visibility without disrupting existing workflows.
Typically, organizations deploying Netdata need to:
log2journal and centralization using typical systemd-journald methodologiesNetdata will automatically provide:
:::tip
Custom dashboards are supported but are optional. Netdata provides single-node, multi-node and infrastructure level dashboards automatically.
:::
Netdata configurations are infrastructure-as-code friendly, and provisioning systems can be used to automate deployment on large infrastructures. A complete Netdata deployment is usually achieved within a few days.
Netdata is committed to having best-in-class resource utilization. Wasted resources are considered bugs and are addressed with high priority.
Based on extensive real-world deployments and independent academic validation, Netdata maintains minimal resource footprint:
| Resource | Standalone 5k metrics/s | Child 5k metrics/s | Parent 1M metrics/s |
|---|---|---|---|
| CPU | 5% of a single core | 3% of a single core | ~10 cores total |
| Memory | 200 MB | 150 MB | ~40 GB |
| Network | None | <1 Mbps to Parent | ~100 Mbps inbound |
| Storage Capacity | 3 GiB (configurable) | None | as needed |
| Storage Throughput | 5 KiB/s write | None | 1 MiB/s write |
| Retention | 1 year (configurable) | None | as needed |
:::note
:::
:::info
The University of Amsterdam study found Netdata to be the most energy-efficient monitoring solution, with the lowest CPU overhead, memory usage, and execution time impact among compared tools.
For more information, see Netdata's impact on resources.
:::
Please also see Netdata Enterprise Evaluation Guide and Netdata's Security and Privacy Design.
Without dedicated monitoring staff, teams need systems that work without constant attention. Netdata's automatic operation enables teams to:
At scale, traditional monitoring becomes expensive and complex. Netdata's architecture enables organizations to:
Modern infrastructure changes constantly. Netdata enables teams to:
The opposite is true — edge architecture eliminates most management overhead.
Traditional centralized systems require:
Netdata's edge approach provides:
The architecture also delivers operational benefits:
Why not use existing databases?
Existing time-series databases couldn't meet the requirements for edge deployment:
The "thousands of databases" concern misunderstands the architecture. These aren't databases you manage — they're autonomous components that manage themselves. It's like worrying about managing thousands of log files when you use syslog — the system handles it.
In practice, organizations using Netdata routinely achieve multi-million samples/second, highly-available observability infrastructure without even noticing the complexity this would normally imply. The complexity isn't moved — it's eliminated through design.
</details> <details> <summary>Isn't collecting 'everything' fundamentally wasteful?</summary>The opposite is true — Netdata is the most energy-efficient monitoring solution available.
The University of Amsterdam study confirmed Netdata uses significantly fewer resources than selective monitoring solutions. Despite collecting everything and per-second, our optimized design and streamlined code make Netdata more efficient, not less.
The real question is: What's the business impact when critical troubleshooting data isn't available during a crisis?
Consider:
The business case for complete coverage:
Selective monitoring creates a paradox: you must predict what will break to know what to monitor, but if you could predict it, you'd prevent it. Complete coverage eliminates this guessing game while actually reducing resource consumption through better engineering.
</details> <details> <summary>Does complete coverage create analysis paralysis?</summary>Structure prevents paralysis — Netdata organizes data hierarchically, not as an unstructured pool.
Unlike monitoring solutions that present metrics as a flat list, Netdata uses intelligent hierarchical organization:
This means:
The real insight: Comprehensive data empowers different engineering approaches
Some engineers thrive with complete visibility — they can trace issues across subsystems, understand cascading failures, and prevent future problems. Others prefer simpler "is it working?" dashboards. Netdata supports both:
The philosophy isn't "more data is better" — it's "the right data should always be available." Hierarchical organization ensures engineers can work at their preferred depth without being overwhelmed by information they don't currently need.
Organizations report that engineers who initially felt overwhelmed quickly adapt once they experience finding that one critical metric that solved a major incident — the metric they wouldn't have thought to collect in advance.
</details> <details> <summary>Is per-second granularity actually useful or just marketing?</summary>Per-second is for engineers, not business metrics — it matches the speed at which systems actually operate.
Consider the reality of modern systems:
Per-second is the standard for engineering tools
When engineers debug with console tools, they never use 10-second or minute averages. Why? Because averaging hides critical details:
Netdata was designed as a unified console replacement
Think of Netdata as the evolution of top, iostat, netstat, and hundreds of other console tools — but with:
This is true tools consolidation: instead of jumping between dozens of console commands during an incident, engineers have one unified view at the resolution that matters. When a service degrades, you need to see the exact second it started, not a minute-average that obscures the trigger.
Immediate feedback is crucial for effective operations
When engineers make infrastructure changes, they need to see the impact immediately:
This instant feedback loop dramatically accelerates problem resolution. Engineers can rapidly iterate through potential fixes, seeing results within seconds rather than waiting for averaged data that might hide whether the intervention actually helped.
For business metrics, minute or hourly aggregations make sense. But for infrastructure monitoring and tuning, per-second granularity is the foundation of effective troubleshooting.
</details> <details> <summary>What about the observer effect? How do you guarantee per-second collection isn't impacting application performance?</summary>Netdata's default collection frequencies are carefully configured to avoid impacting monitored applications.
The goal is simple: collect all metrics at the maximum possible frequency without affecting performance. This means:
Thoughtfully configured defaults:
Performance-first defaults:
User control:
This isn't about blindly collecting everything every second regardless of impact. It's about being intelligent enough to collect each metric at the optimal frequency for that specific data source and use case, defaulting to configurations that have been proven safe across thousands of production deployments.
The University of Amsterdam study confirmed this approach works: despite comprehensive collection, Netdata has the lowest performance impact on the monitored applications among monitoring solutions.
</details> <details> <summary>Why systemd-journal instead of industry standards like Elasticsearch/Splunk?</summary>systemd-journal IS the industry standard — it's already installed and running on every Linux system.
The question misframes the choice. systemd-journal isn't competing with Elasticsearch/Splunk — it's the native log format they all read from. The real question is: why move data when you can query it directly?
Understanding the trade-offs:
| Approach | Storage Footprint | Query Performance | Indexing Strategy |
|---|---|---|---|
| Loki | 1/4 to 1/2 of original logs | Slow (brute force scan after metadata filtering) | Limited metadata indexing |
| Elasticsearch/Splunk | 2-5x larger than original logs | Fast full-text search | Word-level reverse indexing |
| systemd-journal | ~Equal to original logs | Fast field-value queries | Forward indexing of all field values |
systemd-journal provides a balanced approach:
But systemd-journal is actually superior in critical ways:
Furthermore, direct file access isn't a security risk — it's a security advantage. Access control is enforced by the operating system itself through native filesystem permissions. There's no query server to hack, no additional authentication layer to misconfigure, and no database permissions to manage. Multi-tenancy and log isolation work through the same filesystem permission model that has provided reliable security for decades.
What Netdata adds: systemd-journal is powerful but lacks the visualization and analysis layer. Netdata provides:
The insight: instead of copying logs to expensive centralized systems, why not build better tools on the robust foundation already present in every Linux system? This eliminates data movement, reduces infrastructure costs, provides superior security, and delivers faster queries through native file access — all while maintaining the distributed architecture that makes modern infrastructure manageable.
</details>Netdata represents a fundamental rethink of monitoring architecture. By processing data at the edge, automating configuration, maintaining real-time resolution, applying ML universally, and making data accessible to everyone, it solves core monitoring challenges that have persisted for decades.
The result is a monitoring system that deploys in minutes, scales to any size, adapts automatically to change, and delivers insights traditional tools can’t — all while staying open source and community-driven.
Whether you're monitoring a single server or a global infrastructure, Netdata's design philosophy creates a monitoring system that works with you rather than demanding constant attention.