docs/NIDL-Framework.md
The Netdata NIDL (Nodes, Instances, Dimensions, Labels) framework is the foundational data model that underpins Netdata's approach to observability. It defines how metrics are structured, collected, stored, and presented, enabling an interactive and intuitive analysis experience without the need for a query language.
For users, NIDL transforms raw data into an explorable, multi-dimensional view of their infrastructure. For developers, NIDL provides a strict set of guidelines for metric design, ensuring that collected data automatically translates into meaningful and actionable dashboards.
This document serves as both an introduction to the NIDL framework for all Netdata users and a fundamental guide for developers contributing to Netdata's data collection.
Imagine your infrastructure's performance data as a complex, multi-dimensional cube. Traditional monitoring tools often require you to learn a specialized query language to extract insights from this cube. Netdata, through the NIDL framework, provides intuitive controls to slice, dice, and examine this cube from any angle using simple dropdown menus.
Every metric collected by Netdata is organized according to these four components:
disk.io chart, instances would be individual disk devices like sda, sdb. In a containers.cpu chart, instances would be individual container IDs or names.user, system, iowait, and idle as dimensions.kubernetes_namespace=production, device_type=ssd, or environment=staging.Every Netdata chart is an interactive analytical tool. Above each graph, you'll find dropdown menus corresponding to Nodes, Instances, Dimensions, and Labels. These menus are not just for filtering; they provide real-time statistics to guide your investigation:
┌───────────┬───────┬────────┬───────────┬────────────┬────────┬────────────┐
│ group by ▼│aggr. ▼│nodes ▼ │instances ▼│dimensions ▼│labels ▼│time aggr. ▼│
└───────────┴───────┴────────┴───────────┴────────────┴────────┴────────────┘
┌───────────────────────────────────────────────────────────────────────────┐
│ ▒▒▒▒▒░░░▒▒▒▒▒▒░░░░▒▒▒ Anomaly ribbon (anomaly rates over time) │
├───────────────────────────────────────────────────────────────────────────┤
│ ╱╲ ╱╲ │
│ ╱ ╲ ╱ ╲ GRAPH │
│ ╱ ╲╱ ╲ │
│ ╱ ╲_______________ │
├───────────────────────────────────────────────────────────────────────────┤
│ ░░░░█░░░░░ Info ribbon (gaps, resets, partial data) │
└───────────────────────────────────────────────────────────────────────────┘
X-axis (time)
─────────────────────────────────────────────────────────────────────────────
Dimension1: 12.3k │ Dimension2: 8.9k │ Dimension3: 5.6k │ Dimension4: 2.3k
Each dropdown table displays:
This rich context enables:
kubernetes_namespace to see CPU usage per namespace, then apply an average aggregation to understand typical consumption.NIDL extends beyond individual charts to enable powerful custom dashboards through simple drag-and-drop operations:
Example: Create a Kubernetes dashboard by dragging CPU, memory, and network charts, then setting one to group by namespace, another by pod, and a third showing node-level aggregations - all from the same underlying metrics.
NIDL makes deliberate UX choices to maintain simplicity and clarity. These aren't technical limitations - the query engine could handle more complexity - but rather design decisions that keep the interface intuitive:
Uniform Aggregation per Chart: All dimensions in a chart use the same aggregation function. For example, you cannot show "min of mins" and "max of maxes" in the same chart. This keeps the mental model simple: one chart, one aggregation.
Single Context per Chart: Each chart displays metrics from one context only. While the engine can query multiple contexts simultaneously, combining them would require complex UI controls that could overwhelm users.
Continuous Evolution: While NIDL controls already enhance and simplify advanced analytics for the vast majority of use cases, Netdata is continuously evolving. Like all monitoring solutions, we identify areas for improvement and actively work to address them.
We've identified several areas where Netdata dashboards can be enhanced to cover even more sophisticated use cases without compromising NIDL's simplicity:
Virtual Contexts: Enable custom calculations across multiple contexts, appearing as new charts on dashboards - bringing complex correlations to the same simple interface
Advanced Query Mode: Introduce an optional query editor for power users, thoughtfully integrated to preserve the default NIDL experience
Persistent Virtual Metrics: Allow complex calculations to be saved as new time-series, making advanced analytics reusable and shareable
These improvements are part of Netdata's commitment to making infrastructure monitoring both powerful and accessible. We continuously refine the balance between capability and simplicity based on real-world usage and community feedback.
For Netdata developers, understanding and adhering to the NIDL framework is paramount. In Netdata, metric design IS dashboard design. The choices made during data collection directly determine the usability and clarity of the automatically generated dashboards. There is no separate dashboard configuration step to correct poorly structured metrics.
To ensure your collected metrics integrate seamlessly with the NIDL framework and produce meaningful charts, adhere to the following principles:
Each Netdata context (which corresponds to a single chart on the dashboard) must contain only one type of instance. Mixing different types of entities within the same context will lead to confusing dropdown menus and meaningless aggregations.
Correct Example:
Context: mysql.db.queries
Instances: database1, database2, database3
Dimensions: select, insert, update, delete
Explanation: All instances are of type "database".
Incorrect Example:
Context: mysql.queries
Instances: server1, database1, table1 // Mixed instance types
Dimensions: select, insert, update, delete
Explanation: Mixing server, database, and table instances in one context makes the "Instances" dropdown unusable for comparison or drill-down.
All dimensions within a single chart (context) must be logically related, share the same unit, and make sense when aggregated together.
Correct Example:
Context: system.cpu
Dimensions: user, system, iowait, idle
Unit: percentage
Explanation: All dimensions represent parts of CPU time and sum to 100%.
Incorrect Example:
Context: system.health
Dimensions: cpu_percent, free_memory_mb, disk_io_ops
Explanation: These dimensions have different units and represent unrelated metrics, making aggregation or comparison within the same chart meaningless.
For hierarchical data (e.g., a database server, its databases, its tables, its indexes), create separate contexts for each level of the hierarchy. Do not attempt to combine different hierarchical levels into a single context.
Example: MySQL Monitoring
Instead of one mysql.operations context trying to cover everything, create distinct contexts:
mysql.operations: Instances are the database servers themselves.mysql.db.operations: Instances are individual databases (e.g., users_db, orders_db).mysql.table.operations: Instances are individual tables within databases.mysql.index.operations: Instances are individual indexes within tables.Each of these contexts will generate its own chart, ensuring that the "Instances" dropdown for each chart is clean and coherent (e.g., the mysql.db.operations chart's "Instances" dropdown will only list databases, not servers or tables).
Families in Netdata are used to group charts (contexts) on the dashboard, not to define instance hierarchies within a single chart. You can use families to organize related contexts (e.g., a "MySQL" family containing all mysql.* contexts, or a "Connections" family grouping all connection-related contexts across different services).
Use labels to provide meaningful metadata that enables flexible filtering and grouping. Labels should be consistent across instances and provide valuable context for analysis.
Example: For container metrics, useful labels might include:
kubernetes_pod_namekubernetes_namespacedocker_imageenvironmentThese labels allow users to slice their data by specific pods, namespaces, or environments, enhancing the analytical power of the chart.
Context Design:
disk.iosda, sdb, sdc (individual disk devices)read, write (bytes/s)device_type=ssd, mount_point=/varWhat this enables: Users can compare I/O across disks, see read/write patterns per disk, filter by device type, and group by mount point.
Context Design:
containers.cpucontainer1, container2, container3 (individual containers)user, system (percentage)image=nginx, namespace=production, pod=web-serverWhat this enables: Users can compare CPU usage across containers, identify system vs. user CPU consumption, filter by image type, and group by Kubernetes namespace or pod.
Instead of one generic metric, separate by concerns and hierarchical levels:
app.requests
endpoint1, endpoint2success, client_error, server_error (requests/s)app.response_time
endpoint1, endpoint2p50, p95, p99 (milliseconds)app.active_connections
server1, server2active, idle (connections)When ingesting metrics from other observability solutions (e.g., Prometheus), it's common to encounter multi-dimensional metrics that combine several instance types or hierarchical levels into a single metric.
Prometheus-style (example):
mysql_operations{server="prod1", database="db1", table="users", operation="read"} = 1234
This single metric contains enough information to extract server-level, database-level, and table-level views. With NIDL, you could technically import this as-is and use the dropdown menus to aggregate by different labels.
However, this misses the bigger picture of dashboard design.
Consider how you'd structure a dashboard if designing it by hand:
Each section tells its own story with metrics appropriate to that level.
Netdata's Best Practice: Create this natural structure through separate contexts:
mysql.operations - Server-level view with server-specific metricsmysql.db.operations - Database-level view combining:
mysql.table.operations - Detailed table-level metricsThis approach delivers:
The key insight: Pre-aggregation isn't just about performance - it's about creating a well-structured dashboard where each section has a clear purpose and tells a complete story.
This guide walks through the thought process of designing metrics for a new collector, using database servers (PostgreSQL) and application servers (WebSphere) as examples.
Question to answer: What are the major functional areas of this application that need monitoring?
Example - PostgreSQL:
Example - WebSphere:
Decision: Flat structure (<10 families) or Tree structure (>10 families)?
PostgreSQL - Flat Structure:
connections
queries
databases
tables
replication
WebSphere - Tree Structure:
jvm/memory
jvm/gc
jvm/threads
web/servlets
web/sessions
connections/jdbc
connections/jms
Rules:
For each family/subfamily:
Example - web/servlets:
✓ servlet.requests → Instance: each servlet
✓ servlet.response_time → Instance: each servlet
✓ servlet.errors → Instance: each servlet
? servlet.total_count → Instance: server (put first as summary)
✗ session.count → Wrong topic! (move to web/sessions)
For each group of related metrics with the same instance type:
Identify shared characteristics:
Create contexts:
Example - PostgreSQL Tables:
Context: postgres.table.operations
Instances: users_table, orders_table, products_table
Dimensions: select, insert, update, delete
Unit: operations/s
Title: "Table Operations"
Context: postgres.table.size
Instances: users_table, orders_table, products_table
Dimensions: size, indexes_size
Unit: bytes
Title: "Table Size"
Common mistake to avoid:
WRONG - Mixed dimensions per instance:
Context: postgres.table.operations
Instance: users_table → Dimensions: select, insert, update
Instance: orders_table → Dimensions: select, delete // Missing insert, update!
For each context, verify:
Summary metrics (different instance type):
Rate metrics:
Complex hierarchies:
Family: tables
1. postgres.table.count
Instance: server
Dimensions: total_tables
Unit: tables
Title: "Total Tables"
(Summary chart - goes first)
2. postgres.table.operations
Instances: each table
Dimensions: select, insert, update, delete
Unit: operations/s
Title: "Table Operations"
3. postgres.table.size
Instances: each table
Dimensions: data_size, indexes_size
Unit: bytes
Title: "Table Size"
4. postgres.table.maintenance
Instances: each table
Dimensions: vacuum_time, analyze_time
Unit: seconds
Title: "Table Maintenance"
Following this guide ensures your collector creates a coherent, navigable dashboard that tells clear stories about each aspect of the monitored application.
Netdata's highly efficient storage engine (0.5 bytes per sample on the high-resolution tier) is crucial for the NIDL framework's success. This efficiency allows Netdata to:
This means that the NIDL discipline of creating separate contexts for each level is not just about clarity; it's also a performance optimization. By doing the "heavy lifting" once at collection time, Netdata ensures fast dashboards and instant responses for users.
The NIDL framework is the backbone of Netdata's "no query language needed" philosophy. It empowers users with intuitive, interactive data exploration capabilities. However, this power comes with a critical responsibility for developers: to design metrics thoughtfully and adhere strictly to NIDL principles.
By embracing the NIDL framework, collector developers are not merely writing data collection code; they are designing the entire observability experience. Their discipline in defining coherent contexts, consistent instances, related dimensions, and meaningful labels directly translates into clear, actionable, and automatically generated dashboards that empower every Netdata user.