Back to Alluxio

List of Metrics

docs/en/reference/Metrics-List.md

31312.6 KB
Original Source
  • Table of Contents {:toc}

There are two types of metrics in Alluxio, cluster-wide aggregated metrics, and per-process detailed metrics.

  • Cluster metrics are collected and calculated by the leading master and displayed in the metrics tab of the web UI. These metrics are designed to provide a snapshot of the cluster state and the overall amount of data and metadata served by Alluxio.

  • Process metrics are collected by each Alluxio process and exposed in a machine-readable format through any configured sinks. Process metrics are highly detailed and are intended to be consumed by third-party monitoring tools. Users can then view fine-grained dashboards with time-series graphs of each metric, such as data transferred or the number of RPC invocations.

Metrics in Alluxio have the following format for master node metrics:

Master.[metricName].[tag1].[tag2]...

Metrics in Alluxio have the following format for non-master node metrics:

[processType].[metricName].[tag1].[tag2]...[hostName]

There is generally an Alluxio metric for every RPC invocation, to Alluxio or to the under store.

Tags are additional pieces of metadata for the metric such as user name or under storage location. Tags can be used to further filter or aggregate on various characteristics.

Cluster Metrics

Workers and clients send metrics data to the Alluxio master through heartbeats. The interval is defined by property alluxio.master.worker.heartbeat.interval and alluxio.user.metrics.heartbeat.interval respectively.

Bytes metrics are aggregated value from workers or clients. Bytes throughput metrics are calculated on the leading master. The values of bytes throughput metrics equal to bytes metrics counter value divided by the metrics record time and shown as bytes per minute.

<table class="table table-striped"> <tr><th>Name</th><th>Type</th><th>Description</th></tr> {% for item in site.data.table.cluster-metrics %} <tr> <td><a class="anchor" name="{{ item.metricName }}"></a> {{ item.metricName }}</td> <td>{{ item.metricType }}</td> <td>{{ site.data.table.en.cluster-metrics[item.metricName] }}</td> </tr> {% endfor %} </table>

Process Metrics

Metrics shared by the all Alluxio server and client processes.

<table class="table table-striped"> <tr><th>Name</th><th>Type</th><th>Description</th></tr> {% for item in site.data.table.process-metrics %} <tr> <td><a class="anchor" name="{{ item.metricName }}"></a> {{ item.metricName }}</td> <td>{{ item.metricType }}</td> <td>{{ site.data.table.en.process-metrics[item.metricName] }}</td> </tr> {% endfor %} </table>

Server Metrics

Metrics shared by the Alluxio server processes.

<table class="table table-striped"> <tr><th>Name</th><th>Type</th><th>Description</th></tr> {% for item in site.data.table.server-metrics %} <tr> <td><a class="anchor" name="{{ item.metricName }}"></a> {{ item.metricName }}</td> <td>{{ item.metricType }}</td> <td>{{ site.data.table.en.server-metrics[item.metricName] }}</td> </tr> {% endfor %} </table>

Master Metrics

Default master metrics:

<table class="table table-striped"> <tr><th>Name</th><th>Type</th><th>Description</th></tr> {% for item in site.data.table.master-metrics %} <tr> <td><a class="anchor" name="{{ item.metricName }}"></a> {{ item.metricName }}</td> <td>{{ item.metricType }}</td> <td>{{ site.data.table.en.master-metrics[item.metricName] }}</td> </tr> {% endfor %} </table>

Dynamically generated master metrics:

Metric NameDescription
Master.CapacityTotalTier{TIER_NAME}Total capacity in tier {TIER_NAME} of the Alluxio file system in bytes
Master.CapacityUsedTier{TIER_NAME}Used capacity in tier {TIER_NAME} of the Alluxio file system in bytes
Master.CapacityFreeTier{TIER_NAME}Free capacity in tier {TIER_NAME} of the Alluxio file system in bytes
Master.UfsSessionCount-Ufs:{UFS_ADDRESS}The total number of currently opened UFS sessions to connect to the given {UFS_ADDRESS}
Master.{UFS_RPC_NAME}.UFS:{UFS_ADDRESS}.UFS_TYPE:{UFS_TYPE}.User:{USER}The details UFS rpc operation done by the current master
Master.PerUfsOp{UFS_RPC_NAME}.UFS:{UFS_ADDRESS}The aggregated number of UFS operation {UFS_RPC_NAME} ran on UFS {UFS_ADDRESS} by leading master
Master.{LEADING_MASTER_RPC_NAME}The duration statistics of RPC calls exposed on leading master

Worker Metrics

Default worker metrics:

<table class="table table-striped"> <tr><th>Name</th><th>Type</th><th>Description</th></tr> {% for item in site.data.table.worker-metrics %} <tr> <td><a class="anchor" name="{{ item.metricName }}"></a> {{ item.metricName }}</td> <td>{{ item.metricType }}</td> <td>{{ site.data.table.en.worker-metrics[item.metricName] }}</td> </tr> {% endfor %} </table>

Dynamically generated worker metrics:

Metric NameDescription
Worker.UfsSessionCount-Ufs:{UFS_ADDRESS}The total number of currently opened UFS sessions to connect to the given {UFS_ADDRESS}
Worker.{RPC_NAME}The duration statistics of RPC calls exposed on workers

Client Metrics

Each client metric will be recorded with its local hostname or alluxio.user.app.id is configured. If alluxio.user.app.id is configured, multiple clients can be combined into a logical application.

<table class="table table-striped"> <tr><th>Name</th><th>Type</th><th>Description</th></tr> {% for item in site.data.table.client-metrics %} <tr> <td><a class="anchor" name="{{ item.metricName }}"></a> {{ item.metricName }}</td> <td>{{ item.metricType }}</td> <td>{{ site.data.table.en.client-metrics[item.metricName] }}</td> </tr> {% endfor %} </table>

Fuse Metrics

Fuse is a long-running Alluxio client. Depending on the launching ways, Fuse metrics show as

  • client metrics when Fuse client is launching in a standalone AlluxioFuse process.
  • worker metrics when Fuse client is embedded in the AlluxioWorker process.

Fuse metrics includes:

<table class="table table-striped"> <tr><th>Name</th><th>Type</th><th>Description</th></tr> {% for item in site.data.table.fuse-metrics %} <tr> <td><a class="anchor" name="{{ item.metricName }}"></a> {{ item.metricName }}</td> <td>{{ item.metricType }}</td> <td>{{ site.data.table.en.fuse-metrics[item.metricName] }}</td> </tr> {% endfor %} </table>

Fuse reading/writing file count can be used as the indicators for Fuse application pressure. If a large amount of concurrent read/write occur in a short period of time, each of the read/write operations may take longer time to finish.

When a user or an application runs a filesystem command under Fuse mount point, this command will be processed and translated by operating system which will trigger the related Fuse operations exposed in AlluxioFuse. The count of how many times each operation is called, and the duration of each call will be recorded with metrics name Fuse.<FUSE_OPERATION_NAME> dynamically.

The important Fuse metrics include:

Metric NameDescription
Fuse.readdirThe duration metrics of listing a directory
Fuse.getattrThe duration metrics of getting the metadata of a file
Fuse.openThe duration metrics of opening a file for read or overwrite
Fuse.readThe duration metrics of reading a part of a file
Fuse.createThe duration metrics of creating a file for write
Fuse.writeThe duration metrics of writing a file
Fuse.releaseThe duration metrics of closing a file after read or write. Note that release is async so fuse threads will not wait for release to finish
Fuse.mkdirThe duration metrics of creating a directory
Fuse.unlinkThe duration metrics of removing a file or a directory
Fuse.renameThe duration metrics of renaming a file or a directory
Fuse.chmodThe duration metrics of modifying the mode of a file or a directory
Fuse.chownThe duration metrics of modifying the user and/or group ownership of a file or a directory

Fuse related metrics include:

  • Client.TotalRPCClientsshows the total number of RPC clients exist that is using to or can be used to connect to master or worker for operations.
  • Worker metrics with Direct keyword. When Fuse is embedded in worker process, it can go through worker internal API to read from / write to this worker. The related metrics are ended with Direct. For example, Worker.BytesReadDirect shows how many bytes are served by this worker to its embedded Fuse client for read.
  • If alluxio.user.block.read.metrics.enabled=true is configured, Client.BlockReadChunkRemote will be recorded. This metric shows the duration statistics of reading data from remote workers via gRPC.

Client.TotalRPCClients and Fuse.TotalCalls metrics are good indicator of the current load of the Fuse applications. If applications (e.g. Tensorflow) are running on top of Alluxio Fuse but these two metrics show a much lower value than before, the training job may be stuck with Alluxio.

Process Common Metrics

The following metrics are collected on each instance (Master, Worker or Client).

JVM Attributes

Metric NameDescription
nameThe name of the JVM
uptimeThe uptime of the JVM
vendorThe current JVM vendor

Garbage Collector Statistics

Metric NameDescription
PS-MarkSweep.countTotal number of mark and sweep
PS-MarkSweep.timeThe time used to mark and sweep
PS-Scavenge.countTotal number of scavenge
PS-Scavenge.timeThe time used to scavenge

Memory Usage

Alluxio provides overall and detailed memory usage information. Detailed memory usage information of code cache, compressed class space, metaspace, PS Eden space, PS old gen, and PS survivor space is collected in each process.

A subset of the memory usage metrics are listed as following:

Metric NameDescription
total.committedThe amount of memory in bytes that is guaranteed to be available for use by the JVM
total.initThe amount of the memory in bytes that is available for use by the JVM
total.maxThe maximum amount of memory in bytes that is available for use by the JVM
total.usedThe amount of memory currently used in bytes
heap.committedThe amount of memory from heap area guaranteed to be available
heap.initThe amount of memory from heap area available at initialization
heap.maxThe maximum amount of memory from heap area that is available
heap.usageThe amount of memory from heap area currently used in GB
heap.usedThe amount of memory from heap area that has been used
pools.Code-Cache.usedUsed memory of collection usage from the pool from which memory is used for compilation and storage of native code
pools.Compressed-Class-Space.usedUsed memory of collection usage from the pool from which memory is use for class metadata
pools.PS-Eden-Space.usedUsed memory of collection usage from the pool from which memory is initially allocated for most objects
pools.PS-Survivor-Space.usedUsed memory of collection usage from the pool containing objects that have survived the garbage collection of the Eden space

ClassLoading Statistics

Metric NameDescription
loadedThe total number of classes loaded
unloadedThe total number of unloaded classes

Thread Statistics

Metric NameDescription
countThe current number of live threads
daemon.countThe current number of live daemon threads
peak.countThe peak live thread count
total_started.countThe total number of threads started
deadlock.countThe number of deadlocked threads
deadlockThe call stack of each thread related deadlock
new.countThe number of threads with new state
blocked.countThe number of threads with blocked state
runnable.countThe number of threads with runnable state
terminated.countThe number of threads with terminated state
timed_waiting.countThe number of threads with timed_waiting state