content/platform/monitoring/influxdata-platform/tools/measurements-internal.md
By default, InfluxDB generates internal metrics and saves to the _internal database.
Use these metrics to monitor InfluxDB and InfluxDB Enterprise and to create alerts to notify you when problems arise.
_internal database in productionInfluxData does not recommend using the _internal database in a production cluster.
It creates unnecessary overhead, particularly for busy clusters, that can overload an already loaded cluster.
Metrics stored in the _internal database primarily measure workload performance
and should only be tested in non-production environments.
To disable the _internal database, set store-enabled
to false under the [monitor] section of your InfluxDB configuration file.
# ...
[monitor]
# ...
# Whether to record statistics internally.
store-enabled = false
#...
To monitor InfluxDB _internal metrics in a production cluster, use Telegraf
and the influxdb input plugin
to capture these metrics from the InfluxDB /debug/vars endpoint and store them
in an external InfluxDB monitoring instance.
For more information, see Configure a Watcher of Watchers.
{{% note %}}
When using the "watcher of watcher (WoW)" configuration, InfluxDB
metric field keys are prepended with infuxdb_, but are otherwise identical
to those listed below.
{{% /note %}}
Use the InfluxDB OSS Monitor dashboard
or the InfluxDB Enterprise Monitor dashboard
to visualize InfluxDB _internal metrics.
{{% truncate %}}
The measurement statistics related to the Anti-Entropy (AE) engine in InfluxDB Enterprise clusters.
The number of bytes received by the data node.
The total number of anti-entropy jobs that have resulted in errors.
The total number of jobs executed by the data node.
The number of active (currently executing) jobs.
The cluster measurement tracks statistics related to the clustering features of the data nodes in InfluxDB Enterprise.
The tags on the series indicate the source host of the stat.
The number of internal requests made to copy a shard from one data node to another.
The number of read requests from other data nodes in the cluster.
The number of remote node requests made to find measurements on this node that match a particular regular expression. Indicates a SELECT from a regex initiated on a different data node, which then sent an internal request to this node. There is not currently a statistic tracking how many queries with a regex, instead of a fixed measurement, were initiated on a particular node.
The number of remote node requests for information about the fields and associated types, and tag keys of measurements on this data node.
The number of internal requests for iterator cost.
Tracks the number of open connections being handled by the data node (including counting logical connections multiplexed onto a single yamux connection).
The number of internal requests to delete a shard from this data node.
Exclusively incremented by use of the influxd-ctl remove shard command.
The total number of internal write requests from a remote node that failed.
It's the cousin of InfluxDB shard stat writeReqErr.
A write request over HTTP is received by Node A. Node A does not have the shard locally,
so it creates an internal request to Node B instructing what to write and to which shard.
If Node B sees the request and if anything goes wrong, Node B increments its own writeShardFail.
Depending on what went wrong, in most circumstances Node B would also increment its writeReqErr stat inherited from InfluxDB OSS.
If Node A had the shard locally, there would be no internal request to write data
to a remote node, so writeShardFail would not be incremented.
The number of points in every internal write request from any remote node, regardless of success.
The number of internal write requests from a remote data node, regardless of success.
The measurement statistics related to continuous queries (CQs).
The total number of continuous queries that executed but failed.
The total number of continuous queries that executed successfully. Note that this value may be incremented in some cases where a CQ is initiated but does not actually run, for example, due to misconfigured resample intervals.
The current number of measurements in the specified database.
The series cardinality values are estimates, based on HyperLogLog++ (HLL++). The numbers returned by the estimates when there are thousands or millions of measurements or series should be accurate within a relatively small margin of error.
The current series cardinality of the specified database. The series cardinality values are estimates, based on HyperLogLog++ (HLL++). The numbers returned by the estimates when there are thousands or millions of measurements or series should be accurate within a relatively small margin of error.
The hh measurement statistics track events resulting in new hinted handoff (HH)
processors in InfluxDB Enterprise.
The hh measurement has one additional tag:
path - The path to the durable hinted handoff queue on disk.The number of initial write requests handled by the hinted handoff engine for a remote node.
Subsequent write requests to this node, destined for the same remote node, do not increment this statistics.
This statistic resets to 0 upon restart of influxd, regardless of the state
the last time the process was alive. It is incremented when the HH "supersystem"
is instructed to enqueue a write for the node, and the "subsystem" for the destination
node doesn't exist and has to be created, and the "subsystem" created successfully.
If HH files are on disk for a remote node at process startup, the branch that
increments this stat will not be reached.
The number of write requests for each point in the initial request to the hinted handoff engine for a remote node.
The hh_database measurement aggregates all hinted handoff queues for a single database and node.
This allows accurate reporting of total queue size for a single database to a target node.
The hh_database measurement has two additional tags:
db — The name of the databasenode — The node identifierThe size, in bytes, of points read from the hinted handoff queue and sent to its destination data node.
Note that if the data node process is restarted while there is data in the HH queue,
bytesRead may settle to a number larger than bytesWritten.
Hinted handoff writes occur in concurrent batches as determined by the
retry-concurrency setting.
If an individual write succeeds, the metric is incremented.
If any write out of the whole batch fails, the entire batch is considered unsuccessful,
and every part of the batch will be retried later. This was not the intended behavior of this stat.
The other situation where bytesRead could be larger would be after a restart of the process.
Say at startup there were 1000 bytes still enqueued in HH from the previous run of the process.
Immediately after a restart, both bytesRead and bytesWritten are set to zero.
Assuming HH is properly depleted, and no future writes require HH, then the stats will read 1000 bytes read and 0 bytes written.
{{% note %}} Resets to zero after crash or restart, even if the HH queue was non-empty. {{% /note %}}
The total number of bytes written to the hinted handoff queue. Note that this statistic only tracks bytes written during the lifecycle of the current process. Upon restart or a crash, this statistic resets to zero, even if the hinted handoff queue was not empty.
The total number of bytes remaining in the hinted handoff queue. This statistic should accurately and absolutely track the number of bytes of encoded data waiting to be sent to the remote node.
This statistic should remain correct across restarts, unlike bytesRead and bytesWritten (see #780).
The total number of segments in the hinted handoff queue. The HH queue is a sequence of 10MB "segment" files.
This is a coarse-grained statistic that roughly represents the amount of data queued for a remote node.
The queueDepth values can give you a sense of when a queue is growing or shrinking.
The number of writes blocked because the number of concurrent HH requests exceeds the limit.
The number of writes dropped from the HH queue because the write appeared to be corrupted.
The total number of write requests that succeeded in writing a batch to the destination node.
The total number of write requests that failed in writing a batch of data from the hinted handoff queue to the destination node.
The total number of points successfully written from the HH queue to the destination node fr
The total number of every write batch request enqueued into the hinted handoff queue.
The total number of points enqueued into the hinted handoff queue.
Available in InfluxDB Enterprise 1.9.8 and later.
The hh_node measurement stores hinted handoff statistics for all queues (shards) for a given node.
The hh_node measurement has one additional tag:
node - The destination node for the recorded metrics.Total bytes of disk space used by all hinted handoff queues for a single node. Tracks the disk usage of all hinted handoff queues for a given node (not the bytes waiting to be processed). Due to the implementation of the hinted handoff queue, a lag occurs between when bytes are processed and when they're removed from the disk.
queueTotalSize is used to determine when a node's hinted handoff queue has reached the
maximum size configured in the hinted-handoff max-size parameter.
The hh_processor measurement stores statistics for a single queue (shard).
In InfluxDB Enterprise, there is a hinted handoff processor on each data node.
The hh_processor measurement has two additional tags:
node - The destination node for the recorded metrics.path - The path to the durable hinted handoff queue on disk.{{% note %}}
The hh_processor statistics against a host are only accurate for the lifecycle of the current process.
If the process crashes or restarts, bytesRead and bytesWritten are reset to zero, even if the HH queue was non-empty.
{{% /note %}}
The size, in bytes, of points read from the hinted handoff queue and sent to its destination data node.
Note that if the data node process is restarted while there is data in the HH queue,
bytesRead may settle to a number larger than bytesWritten.
Hinted handoff writes occur in concurrent batches as determined by the
retry-concurrency setting.
If an individual write succeeds, the metric is incremented.
If any write out of the whole batch fails, the entire batch is considered unsuccessful,
and every part of the batch will be retried later.
This was not the intended behavior of this stat.
The other situation where bytesRead could be larger would be after a restart of the process.
Say at startup there were 1000 bytes still enqueued in HH from the previous run of the process.
Immediately after a restart, both bytesRead and bytesWritten are set to zero.
Assuming HH is properly depleted, and no future writes require HH, then the stats
will read 1000 bytes read and 0 bytes written.
{{% note %}} Resets to zero after crash or restart, even if the HH queue was non-empty. {{% /note %}}
The total number of bytes written to the hinted handoff queue. Note that this statistic only tracks bytes written during the lifecycle of the current process. Upon restart or a crash, this statistic resets to zero, even if the hinted handoff queue was not empty.
The total number of bytes remaining in the hinted handoff queue. This statistic should accurately and absolutely track the number of bytes of encoded data waiting to be sent to the remote node.
This statistic should remain correct across restarts, unlike bytesRead and bytesWritten
(see #780).
The total number of segments in the hinted handoff queue. The HH queue is a sequence of 10MB "segment" files.
This is a coarse-grained statistic that roughly represents the amount of data queued for a remote node.
The queueDepth values can give you a sense of when a queue is growing or shrinking.
The number of writes blocked because the number of concurrent HH requests exceeds the limit.
The number of writes dropped from the HH queue because the write appeared to be corrupted.
The total number of write requests that succeeded in writing a batch to the destination node.
The total number of write requests that failed in writing a batch of data from the hinted handoff queue to the destination node.
The total number of points successfully written from the HH queue to the destination node fr
The total number of every write batch request enqueued into the hinted handoff queue.
The total number of points enqueued into the hinted handoff queue.
The httpd measurement stores fields related to the InfluxDB HTTP server.
The number of HTTP requests that were aborted due to authentication being required, but not supplied or incorrect.
The number of HTTP responses due to client errors, with a 4XX HTTP status code.
The number of Flux query requests served.
The duration (wall-time), in nanoseconds, spent executing Flux query requests.
The sum of all bytes returned in Flux query responses.
The number of times InfluxDB HTTP server served the /ping HTTP endpoint.
The number of points dropped by the storage engine.
The number of points accepted by the HTTP /write endpoint, but unable to be persisted.
The number of points successfully accepted and persisted by the HTTP /write endpoint.
The number of read requests to the Prometheus /read endpoint.
The number of write requests to the Prometheus /write endpoint.
The number of query requests.
The total query request duration, in nanosecond (ns).
The total number of bytes returned in query responses.
The total number of panics recovered by the HTTP handler.
The total number of HTTP requests served.
The number of currently active requests.
The duration (wall time), in nanoseconds, spent inside HTTP requests.
The number of HTTP responses due to server errors.
The number of status requests served using the HTTP /status endpoint.
The number of values (fields) successfully accepted and persisted by the HTTP /write endpoint.
The number of write requests served using the HTTP /write endpoint.
The number of currently active write requests.
The total number of bytes of line protocol data received by write requests, using the HTTP /write endpoint.
The duration (wall time), in nanoseconds, of write requests served using the /write HTTP endpoint.
The queryExecutor statistics related to usage of the Query Executor of the InfluxDB engine.
The number of active queries currently being handled.
The number of queries executed (started).
The number of queries that have finished executing.
The duration (wall time), in nanoseconds, of every query executed. If one query took 1000 ns from start to finish, and another query took 500 ns from start to finish and ran before the first query finished, the statistic would increase by 1500.
The number of panics recovered by the Query Executor.
The rpc measurement statistics are related to the use of RPC calls within InfluxDB Enterprise clusters.
The number of idle multiplexed streams across all live TCP connections.
The current number of live TCP connections to other nodes.
The current number of live multiplexed streams across all live TCP connections.
The total number of RPC calls made to remote nodes.
The total number of RPC failures, which are RPCs that did not recover.
The total number of RPC bytes read.
The total number of RPC calls that retried at least once.
The total number of RPC bytes written.
The total number of single-use connections opened using Dial.
The number of single-use connections currently open.
The total number of TCP connections that have been established.
The total number of streams established.
The runtime measurement statistics include a subset of MemStats records statistics about the Go memory allocator.
The runtime statistics can be useful to determine poor memory allocation strategies and related performance issues.
The Go runtime package contains operations that interact with Go's runtime system, including functions used to control goroutines. It also includes the low-level type information used by the Go reflect package.
The currently allocated number of bytes of heap objects.
The cumulative number of freed (live) heap objects.
The size, in bytes, of all heap objects.
The number of bytes of idle heap objects.
The number of bytes in in-use spans.
The number of allocated heap objects.
The number of bytes of physical memory returned to the OS.
The number of bytes of heap memory obtained from the OS. Measures the amount of virtual address space reserved for the heap.
The number of pointer lookups performed by the runtime. Primarily useful for debugging runtime internals.
The total number of heap objects allocated. The total number of live objects is Frees.
The number of completed GC (garbage collection) cycles.
The total number of Go routines.
The total duration, in nanoseconds, of total GC (garbage collection) pauses.
The total number of bytes of memory obtained from the OS. Measures the virtual address space reserved by the Go runtime for the heap, stacks, and other internal data structures.
The total number of bytes allocated for heap objects. This statistic does not decrease when objects are freed.
The shard measurement statistics are related to working with shards in InfluxDB OSS and InfluxDB Enterprise.
The size, in bytes, of the shard, including the size of the data directory and the WAL directory.
The number of fields created.
The type of index inmem or tsi1.
Then number of series created.
The number of bytes written to the shard.
The number of requests to write points t dropped from a write.
Also, http.pointsWrittentDropped incremented when a point is dropped from a write
(see #780).
The number of requests to write points that failed to be written due to errors.
The number of points written successfully.
The total number of write requests.
The total number of write requests that failed due to errors.
The total number of successful write requests.
The subscriber measurement statistics are related to the usage of InfluxDB subscriptions.
The number of subscriptions that failed to be created.
The total number of points that were successfully written to subscribers.
The total number of batches that failed to be written to subscribers.
The tsm1_cache measurement statistics are related to the usage of the TSM cache.
The following query example calculates various useful measurements related to the TSM cache.
SELECT
max(cacheAgeMs) / 1000.000 AS CacheAgeSeconds,
max(memBytes) AS MaxMemBytes, max(diskBytes) AS MaxDiskBytes,
max(snapshotCount) AS MaxSnapShotCount,
(last(cachedBytes) - first(cachedBytes)) / (last(WALCompactionTimeMs) - first(WALCompactionTimeMs)) - 1000.000 AS CompactedBytesPerSecond,
last(cachedBytes) AS CachedBytes,
(last(cachedBytes) - first(cachedBytes))/300 as CacheThroughputBytesPerSecond
FROM _internal.monitor.tsm1_cache
WHERE time > now() - 1h
GROUP BY time(5m), path
The duration, in milliseconds, since the cache was last snapshotted at sample time. This statistic indicates how busy the cache is. Large numbers indicate a cache which is idle with respect to writes.
The total number of bytes that have been written into snapshots.
This statistic is updated during the creation of a snapshot.
The purpose of this statistic is to allow calculation of cache throughput between any two instances of time.
The ratio of the difference between two samples of this statistic divided by the
interval separating the samples is a measure of the cache throughput (more accurately,
the rate at which data is being snapshotted). When combined with the diskBytes
and memBytes statistics, it can also be used to calculate the rate at which data
is entering the cache and rate at which is being purged from the cache.
If the entry rate exceeds the exit rate for a sustained period of time,
there is an issue that needs to be addressed.
The size, in bytes, of on-disk snapshots.
The size, in bytes, of in-memory cache.
The current level (number) of active snapshots. In a healthy system, this number should be between 0 and 1. A system experiencing transient write errors might expect to see this number rise.
The duration, in milliseconds, that the commit lock is held while compacting snapshots.
The expression (cachedBytes - diskBytes) / WALCompactionTime provides an indication
of how fast the WAL logs are being committed to TSM files.
The ratio of the difference between the start and end "WALCompactionTime" values
for an interval divided by the length of the interval provides an indication of
how much of maximum cache throughput is being consumed.
The total number of writes dropped due to timeouts.
The total number of writes that failed.
The total number of successful writes.
The tsm1_engine measurement statistics are related to the usage of a TSM storage
engine with compressed blocks.
The duration (wall time), in nanoseconds, spent in cache compactions.
The number of cache compactions that have failed due to errors.
The total number of cache compactions that have ever run.
The number of cache compactions that are currently running.
The duration (wall time), in nanoseconds, spent in full compactions.
The total number of TSM full compactions that have failed due to errors.
The current number of pending TMS Full compactions.
The total number of TSM full compactions that have ever run.
The number of TSM full compactions currently running.
The duration (wall time), in nanoseconds, spent in TSM level 1 compactions.
The total number of TSM level 1 compactions that have failed due to errors.
The current number of pending TSM level 1 compactions.
The total number of TSM level 1 compactions that have ever run.
The number of TSM level 1 compactions that are currently running.
The duration (wall time), in nanoseconds, spent in TSM level 2 compactions.
The number of TSM level 2 compactions that have failed due to errors.
The current number of pending TSM level 2 compactions.
The total number of TSM level 2 compactions that have ever run.
The number of TSM level 2 compactions that are currently running.
The duration (wall time), in nanoseconds, spent in TSM level 3 compactions.
The number of TSM level 3 compactions that have failed due to errors.
The current number of pending TSM level 3 compactions.
The total number of TSM level 3 compactions that have ever run.
The number of TSM level 3 compactions that are currently running.
The duration (wall time), in nanoseconds, spent during TSM optimize compactions.
The total number of TSM optimize compactions that have failed due to errors.
The current number of pending TSM optimize compactions.
The total number of TSM optimize compactions that have ever run.
The number of TSM optimize compactions that are currently running.
The tsm1_filestore measurement statistics are related to the usage of the TSM file store.
The size, in bytes, of disk usage by the TSM file store.
The total number of files in the TSM file store.
The tsm1_wal measurement statistics are related to the usage of the TSM Write Ahead Log (WAL).
The current size, in bytes, of the segment disk.
The size, in bytes, of the segment disk.
The number of writes that failed due to errors.
The number of writes that succeeded.
The write measurement statistics are about writes to the data node, regardless of the source of the write.
The total number of every point requested to be written to this data node.
Incoming writes have to make it through a couple of checks before reaching this
point (points parse correctly, correct authentication provided, etc.).
After these checks, this statistic should be incremented regardless of source
(HTTP, UDP, _internal stats, OpenTSDB plugin, etc.).
The total number of points received for write by this node and then enqueued into hinted handoff for the destination node.
The total number of point requests that have been attempted to be written into a shard on the same (local) node.
The total number of points received for write by this node but needed to be forwarded into a shard on a remote node.
The pointReqRemote statistic is incremented immediately before the remote write attempt,
which only happens if HH doesn't exist for that node.
Then if the write attempt fails, we check again if HH exists, and if so, add the point to HH instead.
This statistic does not distinguish between requests that are directly written to
the destination node versus enqueued into the hinted handoff queue for the destination node.
Number of points written to the HTTP /write endpoint and persisted successfully.
The total number of batches of points requested to be written to this node.
The total number of batches of points that failed to be sent to the subscription dispatcher.
The total number of batches of points that were successfully sent to the subscription dispatcher.
Number of values (fields) written to the HTTP /write endpoint and persisted successfully.
The total number of write requests for points that have been dropped due to timestamps not matching any existing retention policies.
The total number of batches of points that were not successfully written, due to a failure to write to a local or remote shard.
The total number of batches of points written at the requested consistency level.
The total number of batches of points written to at least one node but did not meet the requested consistency level.
The total number of write requests that failed to complete within the default write timeout duration.
This could indicate severely reduced or contentious disk I/O or a congested network to a remote node.
For a single write request that comes in over HTTP or another input method, writeTimeout
will be incremented by 1 if the entire batch is not written within the timeout period,
regardless of whether the points within the batch can be written locally or remotely.