doc/radosgw/metrics.rst
The Ceph Object Gateway uses :ref:Perf Counters to track metrics. The counters can be labeled (:ref:Labeled Perf Counters). When counters are labeled, they are stored in the Ceph Object Gateway specific caches.
These metrics can be sent to the time series database Prometheus to visualize a cluster-wide view of usage data (for example, number of S3 put operations on a specific bucket) over time.
.. contents::
The following metrics related to S3 or Swift operations are tracked per Ceph Object Gateway.
.. list-table:: Ceph Object Gateway Op Metrics :widths: 25 25 75 :header-rows: 1
There are three different sections in the output of the counter dump and counter schema commands that show the op metrics and their information.
The sections are rgw_op, rgw_op_per_user, and rgw_op_per_bucket.
The counters in the rgw_op section reflect the totals of each op metric for a given Ceph Object Gateway.
The counters in the rgw_op_per_user and rgw_op_per_bucket sections are labeled counters of op metrics for a user or bucket respectively.
Information about op metrics can be seen in the rgw_op sections of the output of the counter schema command.
To view op metrics in the Ceph Object Gateway go to the rgw_op sections of the output of the counter dump command::
"rgw_op": [
{
"labels": {},
"counters": {
"put_obj_ops": 2,
"put_obj_bytes": 5327,
"put_obj_lat": {
"avgcount": 2,
"sum": 2.818064835,
"avgtime": 1.409032417
},
"get_obj_ops": 5,
"get_obj_bytes": 5325,
"get_obj_lat": {
"avgcount": 2,
"sum": 0.003000069,
"avgtime": 0.001500034
},
...
"list_buckets_ops": 1,
"list_buckets_lat": {
"avgcount": 1,
"sum": 0.002300000,
"avgtime": 0.002300000
}
}
},
]
Op metrics can also be tracked per-user or per-bucket. These metrics are exported to Prometheus with labels bucket = {name} or user = {userid}::
"rgw_op_per_bucket": [
...
{
"labels": {
"Bucket": "bucket1"
},
"counters": {
"put_obj_ops": 2,
"put_obj_bytes": 5327,
"put_obj_lat": {
"avgcount": 2,
"sum": 2.818064835,
"avgtime": 1.409032417
},
"get_obj_ops": 5,
"get_obj_bytes": 5325,
"get_obj_lat": {
"avgcount": 2,
"sum": 0.003000069,
"avgtime": 0.001500034
},
...
"list_buckets_ops": 1,
"list_buckets_lat": {
"avgcount": 1,
"sum": 0.002300000,
"avgtime": 0.002300000
}
}
},
...
]
:ref:rgw-multitenancy allows the use of buckets and users with the same name,
if they are created under different tenants. If a user or bucket lies under a
tenant, a label for the tenant in the form Tenant = {tenantid} is added to
the metric.
In a large system with many users and buckets, it may not be tractable to export all metrics to Prometheus. For that reason, the collection of these labeled metrics is disabled by default.
Once enabled, the working set of tracked users and buckets is constrained to limit memory and database usage. As a result, the collection of these labeled metrics will not always be reliable.
User & Bucket Counter Caches ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To track op metrics by user the Ceph Object Gateway the config value rgw_user_counters_cache must be set to true.
To track op metrics by bucket the Ceph Object Gateway the config value rgw_bucket_counters_cache must be set to true.
These config values are set in Ceph via the command ceph config set client.rgw rgw_{user,bucket}_counters_cache true
Since the op metrics are labeled perf counters, they live in memory. If the Ceph Object Gateway is restarted or crashes, all counters in the Ceph Object Gateway, whether in a cache or not, are lost.
User & Bucket Counter Cache Size & Eviction ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Both rgw_user_counters_cache_size and rgw_bucket_counters_cache_size can be used to set number of entries in each cache.
Counters are evicted from a cache once the number of counters in the cache are greater than the cache size config variable. The counters that are evicted are the least recently used (LRU).
For example if the number of buckets exceeded rgw_bucket_counters_cache_size by 1 and the counters with label bucket1 were the last to be updated, the counters for bucket1 would be evicted from the cache. If S3 operations tracked by the op metrics were done on bucket1 after eviction, all of the metrics in the cache for bucket1 would start at 0.
Cache sizing can depend on a number of factors. These factors include:
#. Number of users in the cluster #. Number of buckets in the cluster #. Memory usage of the Ceph Object Gateway #. Disk and memory usage of Prometheus.
To help calculate the Ceph Object Gateway's memory usage of a cache, it should be noted that each cache entry, encompassing all of the op metrics, is 1360 bytes. This is an estimate and subject to change if metrics are added or removed from the op metrics list.
To get metrics from a Ceph Object Gateway into the time series database Prometheus, the ceph-exporter daemon must be running and configured to scrape the Ceph Object Gateway's admin socket.
The ceph-exporter daemon scrapes the Ceph Object Gateway's admin socket at a regular interval, defined by the config variable exporter_stats_period.
Prometheus has a configurable interval in which it scrapes the exporter (see: https://prometheus.io/docs/prometheus/latest/configuration/configuration/).
The following Ceph Object Gateway op metrics related settings can be set via ceph config set client.rgw CONFIG_VARIABLE VALUE.
.. confval:: rgw_user_counters_cache .. confval:: rgw_user_counters_cache_size .. confval:: rgw_bucket_counters_cache .. confval:: rgw_bucket_counters_cache_size
The following are notable ceph-exporter related settings can be set via ceph config set global CONFIG_VARIABLE VALUE.
.. confval:: exporter_stats_period