Back to Ceph

mClock Config Reference

doc/rados/configuration/mclock-config-ref.rst

21.0.033.6 KB
Original Source

======================== mClock Config Reference

.. index:: mclock; configuration

QoS support in Ceph is implemented using a queuing scheduler based on the dmClock algorithm_. See :ref:dmclock-qos section for more details.

To make the usage of mclock more user-friendly and intuitive, mclock config profiles are introduced. The mclock profiles mask the low level details from users, making it easier to configure and use mclock.

The following input parameters are required for a mclock profile to configure the QoS related parameters:

  • total capacity (IOPS) of each OSD (determined automatically - See OSD Capacity Determination (Automated)_)

  • the max sequential bandwidth capacity (MiB/s) of each OSD - See osd_mclock_max_sequential_bandwidth_[hdd|ssd] option

  • an mclock profile type to enable

Using the settings in the specified profile, an OSD determines and applies the lower-level mclock and Ceph parameters. The parameters applied by the mclock profile make it possible to tune the QoS between client I/O and background operations in the OSD.

.. index:: mclock; mclock clients

mClock Client Types

The mclock scheduler handles requests from different types of Ceph services. Each service can be considered as a type of client from mclock's perspective. Depending on the type of requests handled, mclock clients are classified into the buckets as shown in the table below,

+------------------------+--------------------------------------------------------------+ | Client Type | Request Types | +========================+==============================================================+ | Client | I/O requests issued by external clients of Ceph | +------------------------+--------------------------------------------------------------+ | Background recovery | Internal recovery requests | +------------------------+--------------------------------------------------------------+ | Background best-effort | Internal backfill, scrub, snap trim and PG deletion requests | +------------------------+--------------------------------------------------------------+

The mclock profiles allocate parameters like reservation, weight and limit (see :ref:dmclock-qos) differently for each client type. The next sections describe the mclock profiles in greater detail.

.. index:: mclock; profile definition

mClock Profiles - Definition and Purpose

A mclock profile is “a configuration setting that when applied on a running Ceph cluster enables the throttling of the operations(IOPS) belonging to different client classes (background recovery, scrub, snaptrim, client op, osd subop)”.

The mclock profile uses the capacity limits and the mclock profile type selected by the user to determine the low-level mclock resource control configuration parameters and apply them transparently. Additionally, other Ceph configuration parameters are also applied. Please see sections below for more information.

The low-level mclock resource control parameters are the reservation, limit, and weight that provide control of the resource shares, as described in the :ref:dmclock-qos section.

.. index:: mclock; profile types

mClock Profile Types

mclock profiles can be broadly classified into built-in and custom profiles,

Built-in Profiles

Users can choose between the following built-in profile types:

.. note:: The values mentioned in the tables below represent the proportion of the total IOPS capacity of the OSD allocated for the service type.

  • balanced (default)
  • high_client_ops
  • high_recovery_ops

balanced (default) ^^^^^^^^^^^^^^^^^^^^ The balanced profile is the default mClock profile. This profile allocates equal reservation/priority to client operations and background recovery operations. Background best-effort ops are given lower reservation and therefore take a longer time to complete when are are competing operations. This profile helps meet the normal/steady-state requirements of the cluster. This is the case when external client performance requirement is not critical and there are other background operations that still need attention within the OSD.

But there might be instances that necessitate giving higher allocations to either client ops or recovery ops. In order to deal with such a situation, the alternate built-in profiles may be enabled by following the steps mentioned in next sections.

+------------------------+-------------+--------+-------+ | Service Type | Reservation | Weight | Limit | +========================+=============+========+=======+ | client | 50% | 1 | MAX | +------------------------+-------------+--------+-------+ | background recovery | 50% | 1 | MAX | +------------------------+-------------+--------+-------+ | background best-effort | MIN | 1 | 90% | +------------------------+-------------+--------+-------+

high_client_ops ^^^^^^^^^^^^^^^ This profile optimizes client performance over background activities by allocating more reservation and limit to client operations as compared to background operations in the OSD. This profile, for example, may be enabled to provide the needed performance for I/O intensive applications for a sustained period of time at the cost of slower recoveries. The table shows the resource control parameters set by the profile:

+------------------------+-------------+--------+-------+ | Service Type | Reservation | Weight | Limit | +========================+=============+========+=======+ | client | 60% | 2 | MAX | +------------------------+-------------+--------+-------+ | background recovery | 40% | 1 | MAX | +------------------------+-------------+--------+-------+ | background best-effort | MIN | 1 | 70% | +------------------------+-------------+--------+-------+

high_recovery_ops ^^^^^^^^^^^^^^^^^ This profile optimizes background recovery performance as compared to external clients and other background operations within the OSD. This profile, for example, may be enabled by an administrator temporarily to speed-up background recoveries during non-peak hours. The table shows the resource control parameters set by the profile:

+------------------------+-------------+--------+-------+ | Service Type | Reservation | Weight | Limit | +========================+=============+========+=======+ | client | 30% | 1 | MAX | +------------------------+-------------+--------+-------+ | background recovery | 70% | 2 | MAX | +------------------------+-------------+--------+-------+ | background best-effort | MIN | 1 | MAX | +------------------------+-------------+--------+-------+

.. note:: Across the built-in profiles, internal background best-effort clients of mclock include "backfill", "scrub", "snap trim", and "pg deletion" operations.

Custom Profile

This profile gives users complete control over all the mclock configuration parameters. This profile should be used with caution and is meant for advanced users, who understand mclock and Ceph related configuration options.

.. index:: mclock; shard config for HDD clusters

.. _mclock-hdd-cfg:

OSD Shard Configuration For HDD Based Clusters With mClock

Each OSD is configured with one or more shards to perform tasks. Each shard comprises a unique queue to handle various types of OSD specific operations like client I/O, recovery, scrub and so on. The scheduling of these operations in the queue is performed by a scheduler - in this case the mClock scheduler.

For HDD based OSDs, the number of shards is controlled by configuration :confval:osd_op_num_shards_hdd. Items are queued and dequeued by one or more worker threads and this is controlled by configuration :confval:osd_op_num_threads_per_shard_hdd.

As described in :ref:dmclock-qos-caveats, the number of OSD shards employed determines the impact of mClock queue. In general, a lower number of shards increases the impact of mClock queues with respect to scheduling accuracy. This is providing there are enough number of worker threads per shard to help process the items in the mClock queue.

Based on tests performed at scale with small objects in the range [1 KiB - 256 KiB] on a HDD based cluster (192 OSDs, 8 nodes, 150 Million objects), it was found that scheduling with mClock was not optimal with multiple OSD shards. For example, in this cluster with multiple OSD node failures, the client throughput was found to be inconsistent across test runs coupled with multiple reported slow requests. For more details see https://tracker.ceph.com/issues/66289. With multiple shards, the situation was exacerbated when MAX limit was allocated to both client and background recovery class of operations. During the OSD failure phase, since both client and recovery ops were in direct competition to utilize the full bandwidth of OSDs, there was no predictability with respect to the throughput of either class of services.

However, the same test with a single OSD shard and with multiple worker threads yielded significantly better results in terms of consistency of client and recovery throughput across multiple test runs. Please refer to the tracker above for more details. For sanity, the same test executed using this shard configuration with large objects in the range [1 MiB - 256 MiB] yielded similar results.

Therefore, as an interim measure until the issue with multiple OSD shards (or multiple mClock queues per OSD) is investigated and fixed, the following change to the default HDD OSD shard configuration is made:

+---------------------------------------------+------------------+----------------+ | Config Option | Old Default | New Default | +=============================================+==================+================+ | :confval:osd_op_num_shards_hdd | 5 | 1 | +---------------------------------------------+------------------+----------------+ | :confval:osd_op_num_threads_per_shard_hdd | 1 | 5 | +---------------------------------------------+------------------+----------------+

.. index:: mclock; built-in profiles

mClock Built-in Profiles - Locked Config Options

The below sections describe the config options that are locked to certain values in order to ensure mClock scheduler is able to provide predictable QoS.

mClock Config Options

.. important:: These defaults cannot be changed using any of the config subsytem commands like config set or via the config daemon or config tell interfaces. Although the above command(s) report success, the mclock QoS parameters are reverted to their respective built-in profile defaults.

When a built-in profile is enabled, the mClock scheduler calculates the low level mclock parameters [reservation, weight, limit] based on the profile enabled for each client type. The mclock parameters are calculated based on the max OSD capacity provided beforehand. As a result, the following mclock config parameters cannot be modified when using any of the built-in profiles:

  • :confval:osd_mclock_scheduler_client_res
  • :confval:osd_mclock_scheduler_client_wgt
  • :confval:osd_mclock_scheduler_client_lim
  • :confval:osd_mclock_scheduler_background_recovery_res
  • :confval:osd_mclock_scheduler_background_recovery_wgt
  • :confval:osd_mclock_scheduler_background_recovery_lim
  • :confval:osd_mclock_scheduler_background_best_effort_res
  • :confval:osd_mclock_scheduler_background_best_effort_wgt
  • :confval:osd_mclock_scheduler_background_best_effort_lim

Recovery/Backfill Options

.. warning:: The recommendation is to not change these options as the built-in profiles are optimized based on them. Changing these defaults can result in unexpected performance outcomes.

The following recovery and backfill related Ceph options are overridden to mClock defaults:

  • :confval:osd_max_backfills
  • :confval:osd_recovery_max_active
  • :confval:osd_recovery_max_active_hdd
  • :confval:osd_recovery_max_active_ssd

The following table shows the mClock defaults which is the same as the current defaults. This is done to maximize the performance of the foreground (client) operations:

+----------------------------------------+------------------+----------------+ | Config Option | Original Default | mClock Default | +========================================+==================+================+ | :confval:osd_max_backfills | 1 | 1 | +----------------------------------------+------------------+----------------+ | :confval:osd_recovery_max_active | 0 | 0 | +----------------------------------------+------------------+----------------+ | :confval:osd_recovery_max_active_hdd | 3 | 3 | +----------------------------------------+------------------+----------------+ | :confval:osd_recovery_max_active_ssd | 10 | 10 | +----------------------------------------+------------------+----------------+

The above mClock defaults, can be modified only if necessary by enabling :confval:osd_mclock_override_recovery_settings (default: false). The steps for this is discussed in the Steps to Modify mClock Max Backfills/Recovery Limits_ section.

Sleep Options

If any mClock profile (including "custom") is active, the following Ceph config sleep options are disabled (set to 0),

  • :confval:osd_recovery_sleep
  • :confval:osd_recovery_sleep_hdd
  • :confval:osd_recovery_sleep_ssd
  • :confval:osd_recovery_sleep_hybrid
  • :confval:osd_recovery_sleep_degraded
  • :confval:osd_recovery_sleep_degraded_hdd
  • :confval:osd_recovery_sleep_degraded_ssd
  • :confval:osd_recovery_sleep_degraded_hybrid
  • :confval:osd_scrub_sleep
  • :confval:osd_delete_sleep
  • :confval:osd_delete_sleep_hdd
  • :confval:osd_delete_sleep_ssd
  • :confval:osd_delete_sleep_hybrid
  • :confval:osd_snap_trim_sleep
  • :confval:osd_snap_trim_sleep_hdd
  • :confval:osd_snap_trim_sleep_ssd
  • :confval:osd_snap_trim_sleep_hybrid

The above sleep options are disabled to ensure that mclock scheduler is able to determine when to pick the next op from its operation queue and transfer it to the operation sequencer. This results in the desired QoS being provided across all its clients.

.. index:: mclock; enable built-in profile

Steps to Enable mClock Profile

As already mentioned, the default mclock profile is set to balanced. The other values for the built-in profiles include high_client_ops and high_recovery_ops.

If there is a requirement to change the default profile, then the option :confval:osd_mclock_profile may be set during runtime by using the following command:

.. prompt:: bash #

ceph config set osd.N osd_mclock_profile <value>

For example, to change the profile to allow faster recoveries on "osd.0", the following command can be used to switch to the high_recovery_ops profile:

.. prompt:: bash #

ceph config set osd.0 osd_mclock_profile high_recovery_ops

.. note:: The custom profile is not recommended unless you are an advanced user.

And that's it! You are ready to run workloads on the cluster and check if the QoS requirements are being met.

Switching Between Built-in and Custom Profiles

There may be situations requiring switching from a built-in profile to the custom profile and vice-versa. The following sections outline the steps to accomplish this.

Steps to Switch From a Built-in to the Custom Profile

The following command can be used to switch to the custom profile:

.. prompt:: bash #

ceph config set osd osd_mclock_profile custom

For example, to change the profile to custom on all OSDs, the following command can be used:

.. prompt:: bash #

ceph config set osd osd_mclock_profile custom

After switching to the custom profile, the desired mClock configuration option may be modified. For example, to change the client reservation IOPS ratio for a specific OSD (say osd.0) to 0.5 (or 50%), the following command can be used:

.. prompt:: bash #

ceph config set osd.0 osd_mclock_scheduler_client_res 0.5

.. important:: Care must be taken to change the reservations of other services like recovery and background best effort accordingly to ensure that the sum of the reservations do not exceed the maximum proportion (1.0) of the IOPS capacity of the OSD.

.. tip:: The reservation and limit parameter allocations are per-shard based on the type of backing device (HDD/SSD) under the OSD. See :confval:osd_op_num_shards_hdd and :confval:osd_op_num_shards_ssd for more details.

Steps to Switch From the Custom Profile to a Built-in Profile

Switching from the custom profile to a built-in profile requires an intermediate step of removing the custom settings from the central config database for the changes to take effect.

The following sequence of commands can be used to switch to a built-in profile:

#. Set the desired built-in profile using:

.. prompt:: bash #

 ceph config set osd <mClock Configuration Option>

For example, to set the built-in profile to high_client_ops on all OSDs, run the following command:

.. prompt:: bash #

 ceph config set osd osd_mclock_profile high_client_ops

#. Determine the existing custom mClock configuration settings in the central config database using the following command:

.. prompt:: bash #

 ceph config dump

#. Remove the custom mClock configuration settings determined in the previous step from the central config database:

.. prompt:: bash #

 ceph config rm osd <mClock Configuration Option>

For example, to remove the configuration option :confval:osd_mclock_scheduler_client_res that was set on all OSDs, run the following command:

.. prompt:: bash #

 ceph config rm osd osd_mclock_scheduler_client_res

#. After all existing custom mClock configuration settings have been removed from the central config database, the configuration settings pertaining to high_client_ops will come into effect. For e.g., to verify the settings on osd.0 use:

.. prompt:: bash #

 ceph config show osd.0

Switch Temporarily Between mClock Profiles

To switch between mClock profiles on a temporary basis, the following commands may be used to override the settings:

.. warning:: This section is for advanced users or for experimental testing. The recommendation is to not use the below commands on a running cluster as it could have unexpected outcomes.

.. note:: The configuration changes on an OSD using the below commands are ephemeral and are lost when it restarts. It is also important to note that the config options overridden using the below commands cannot be modified further using the ceph config set osd.N ... command. The changes will not take effect until a given OSD is restarted. This is intentional, as per the config subsystem design. However, any further modification can still be made ephemerally using the commands mentioned below.

#. Run the injectargs command as shown to override the mclock settings:

.. prompt:: bash #

 ceph tell osd.N injectargs '--<mClock Configuration Option>=<value>'

For example, the following command overrides the :confval:osd_mclock_profile option on osd.0:

.. prompt:: bash #

 ceph tell osd.0 injectargs '--osd_mclock_profile=high_recovery_ops'

#. An alternate command that can be used is:

.. prompt:: bash #

 ceph daemon osd.N config set <mClock Configuration Option> <value>

For example, the following command overrides the :confval:osd_mclock_profile option on osd.0:

.. prompt:: bash #

 ceph daemon osd.0 config set osd_mclock_profile high_recovery_ops

The individual QoS-related config options for the custom profile can also be modified ephemerally using the above commands.

Steps to Modify mClock Max Backfills/Recovery Limits

This section describes the steps to modify the default max backfills or recovery limits if the need arises.

.. warning:: This section is for advanced users or for experimental testing. The recommendation is to retain the defaults as is on a running cluster as modifying them could have unexpected performance outcomes. The values may be modified only if the cluster is unable to cope/showing poor performance with the default settings or for performing experiments on a test cluster.

.. important:: The max backfill/recovery options that can be modified are listed in section Recovery/Backfill Options_. The modification of the mClock default backfills/recovery limit is gated by the :confval:osd_mclock_override_recovery_settings option, which is set to false by default. Attempting to modify any default recovery/backfill limits without setting the gating option will reset that option back to the mClock defaults along with a warning message logged in the cluster log. Note that it may take a few seconds for the default value to come back into effect. Verify the limit using the config show command as shown below.

#. Set the :confval:osd_mclock_override_recovery_settings config option on all osds to true using:

.. prompt:: bash #

 ceph config set osd osd_mclock_override_recovery_settings true

#. Set the desired max backfill/recovery option using:

.. prompt:: bash #

 ceph config set osd osd_max_backfills <value>

For example, the following command modifies the :confval:osd_max_backfills option on all osds to 5.

.. prompt:: bash #

 ceph config set osd osd_max_backfills 5

#. Wait for a few seconds and verify the running configuration for a specific OSD using:

.. prompt:: bash #

 ceph config show osd.N | grep osd_max_backfills

For example, the following command shows the running configuration of :confval:osd_max_backfills on osd.0.

.. prompt:: bash #

 ceph config show osd.0 | grep osd_max_backfills

#. Reset the :confval:osd_mclock_override_recovery_settings config option on all osds to false using:

.. prompt:: bash #

 ceph config set osd osd_mclock_override_recovery_settings false

OSD Capacity Determination (Automated)

The OSD capacity in terms of total IOPS is determined automatically during OSD initialization. This is achieved by running the OSD bench tool and overriding the default value of osd_mclock_max_capacity_iops_[hdd, ssd] option depending on the device type. No other action/input is expected from the user to set the OSD capacity.

.. note:: If you wish to manually benchmark OSD(s) or manually tune the Bluestore throttle parameters, see section Steps to Manually Benchmark an OSD (Optional)_.

You may verify the capacity of an OSD after the cluster is brought up by using the following command:

.. prompt:: bash #

ceph config show osd.N osd_mclock_max_capacity_iops_[hdd, ssd]

For example, the following command shows the max capacity for "osd.0" on a Ceph node whose underlying device type is SSD:

.. prompt:: bash #

ceph config show osd.0 osd_mclock_max_capacity_iops_ssd

Mitigation of Unrealistic OSD Capacity From Automated Test

In certain conditions, the OSD bench tool may show unrealistic/inflated result depending on the drive configuration and other environment related conditions. To mitigate the performance impact due to this unrealistic capacity, a couple of threshold config options depending on the osd's device type are defined and used:

  • :confval:osd_mclock_iops_capacity_threshold_hdd = 500
  • :confval:osd_mclock_iops_capacity_threshold_ssd = 80000

The following automated step is performed:

Fallback to using default OSD capacity (automated) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If OSD bench reports a measurement that exceeds the above threshold values depending on the underlying device type, the fallback mechanism reverts to the default value of :confval:osd_mclock_max_capacity_iops_hdd or :confval:osd_mclock_max_capacity_iops_ssd. The threshold config options can be reconfigured based on the type of drive used. Additionally, a cluster warning is logged in case the measurement exceeds the threshold. For example, ::

2022-10-27T15:30:23.270+0000 7f9b5dbe95c0  0 log_channel(cluster) log [WRN]
: OSD bench result of 39546.479392 IOPS exceeded the threshold limit of
25000.000000 IOPS for osd.1. IOPS capacity is unchanged at 21500.000000
IOPS. The recommendation is to establish the osd's IOPS capacity using other
benchmark tools (e.g. Fio) and then override
osd_mclock_max_capacity_iops_[hdd|ssd].

If the default capacity doesn't accurately represent the OSD's capacity, the following additional step is recommended to address this:

Run custom drive benchmark if defaults are not accurate (manual) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If the default OSD capacity is not accurate, the recommendation is to run a custom benchmark using your preferred tool (e.g. Fio) on the drive and then override the osd_mclock_max_capacity_iops_[hdd, ssd] option as described in the Set or Override Max IOPS Capacity of an OSD_ section.

This step is highly recommended until an alternate mechansim is worked upon.

Steps to Manually Benchmark an OSD (Optional)

.. note:: These steps are only necessary if you want to override the OSD capacity already determined automatically during OSD initialization. Otherwise, you may skip this section entirely.

.. tip:: If you have already determined the benchmark data and wish to manually override the max osd capacity for an OSD, you may skip to section Set or Override Max IOPS Capacity of an OSD_.

Any existing benchmarking tool (e.g. Fio) can be used for this purpose. In this case, the steps use the Ceph OSD Bench command described in the next section. Regardless of the tool/command used, the steps outlined further below remain the same.

As already described in the :ref:dmclock-qos section, the number of shards and the bluestore's throttle parameters have an impact on the mclock op queues. Therefore, it is critical to set these values carefully in order to maximize the impact of the mclock scheduler.

:Number of Operational Shards: We recommend using the default number of shards as defined by the configuration options osd_op_num_shards, osd_op_num_shards_hdd, and osd_op_num_shards_ssd. In general, a lower number of shards will increase the impact of the mclock queues.

:Bluestore Throttle Parameters: We recommend using the default values as defined by :confval:bluestore_throttle_bytes and :confval:bluestore_throttle_deferred_bytes. But these parameters may also be determined during the benchmarking phase as described below.

OSD Bench Command Syntax

The :ref:osd-subsystem section describes the OSD bench command. The syntax used for benchmarking is shown below :

.. prompt:: bash #

ceph tell osd.N bench [TOTAL_BYTES] [BYTES_PER_WRITE] [OBJ_SIZE] [NUM_OBJS]

where,

  • TOTAL_BYTES: Total number of bytes to write
  • BYTES_PER_WRITE: Block size per write
  • OBJ_SIZE: Bytes per object
  • NUM_OBJS: Number of objects to write

Benchmarking Test Steps Using OSD Bench

The steps below use the default shards and detail the steps used to determine the correct bluestore throttle values (optional).

#. Bring up your Ceph cluster and login to the Ceph node hosting the OSDs that you wish to benchmark. #. Run a simple 4KiB random write workload on an OSD using the following commands:

.. note:: Note that before running the test, caches must be cleared to get an accurate measurement.

For example, if you are running the benchmark test on osd.0, run the following commands:

.. prompt:: bash #

 ceph tell osd.0 cache drop

.. prompt:: bash #

 ceph tell osd.0 bench 12288000 4096 4194304 100

#. Note the overall throughput(IOPS) obtained from the output of the osd bench command. This value is the baseline throughput(IOPS) when the default bluestore throttle options are in effect. #. If the intent is to determine the bluestore throttle values for your environment, then set the two options, :confval:bluestore_throttle_bytes and :confval:bluestore_throttle_deferred_bytes to 32 KiB(32768 Bytes) each to begin with. Otherwise, you may skip to the next section. #. Run the 4KiB random write test as before using OSD bench. #. Note the overall throughput from the output and compare the value against the baseline throughput recorded in step 3. #. If the throughput doesn't match with the baseline, increment the bluestore throttle options by 2x and repeat steps 5 through 7 until the obtained throughput is very close to the baseline value.

For example, during benchmarking on a machine with NVMe SSDs, a value of 256 KiB for both bluestore throttle and deferred bytes was determined to maximize the impact of mclock. For HDDs, the corresponding value was 40 MiB, where the overall throughput was roughly equal to the baseline throughput. Note that in general for HDDs, the bluestore throttle values are expected to be higher when compared to SSDs.

.. _override_max_iops_capacity:

Set or Override Max IOPS Capacity of an OSD

The steps in this section may be performed to set or override the max IOPS capacity of an OSD. The osd_mclock_max_capacity_iops_[hdd, ssd] option for an OSD can be overridden by running a command of the following form:

.. prompt:: bash #

ceph config set osd.N osd_mclock_max_capacity_iops_[hdd,ssd] <value>

For example, the following command sets the max capacity for a specific OSD (say "osd.0") whose underlying device type is HDD to 350 IOPS:

.. prompt:: bash #

ceph config set osd.0 osd_mclock_max_capacity_iops_hdd 350

Alternatively, you may specify the max capacity for OSDs within the Ceph configuration file under the respective [osd.N] section. See :ref:ceph-conf-settings for more details.

Global Override of Max IOPS Capacity for multiple OSDs

The max IOPS capacity of multiple OSDs may be overridden by a global config specification. This section shows the steps to globally override the individually scoped values in the mon store.

.. note:: The examples use :confval:`osd_mclock_max_capacity_iops_hdd` and
          the steps are also applicable for SSD based OSDs in which case
          the option to use is :confval:`osd_mclock_max_capacity_iops_ssd`.

Below are steps to override the IOPS capacities of individual OSDs. Note that the individual value is taken by the OSD after it runs the usual startup benchmark.

#. Run the following command to verify the individual values set for the OSDs in the central config database:

.. prompt:: bash #

  ceph config dump | grep osd_mclock_max_capacity_iops

::

  WHO     MASK  LEVEL  OPTION                            VALUE       RO
  osd.0         basic  osd_mclock_max_capacity_iops_hdd  379.197568    
  osd.1         basic  osd_mclock_max_capacity_iops_hdd  400.903575    
  osd.2         basic  osd_mclock_max_capacity_iops_hdd  398.303428    
  osd.3         basic  osd_mclock_max_capacity_iops_hdd  419.035854    

#. If there are no individual values reported, skip to the next step. Otherwise, remove all the individual values reported in the previous step with a command of the following form (where 'x' is the OSD id):

.. prompt:: bash #

  ceph config rm osd.x osd_mclock_max_capacity_iops_hdd

#. Confirm that the ceph config dump command from step 1 does not show any individual values.

#. Set the global value of osd_mclock_max_capacity_iops_hdd with a command of the following form:

.. prompt:: bash #

  ceph config set global osd_mclock_max_capacity_iops_hdd 111

#. Confirm that the global option is set by running:

.. prompt:: bash #

  ceph config dump | grep osd_mclock_max_capacity_iops

::

  global        basic  osd_mclock_max_capacity_iops_hdd  111.000000    

#. Confirm that the global setting is now in effect for any OSD that no longer has a specific per-OSD central config setting:

.. prompt:: bash #

  ceph config show osd.0 | grep osd_mclock_max_capacity_iops_hdd

::

  osd_mclock_max_capacity_iops_hdd                 111.000000                               mon

.. index:: mclock; config settings

mClock Config Options

.. confval:: osd_mclock_profile .. confval:: osd_mclock_max_capacity_iops_hdd .. confval:: osd_mclock_max_capacity_iops_ssd .. confval:: osd_mclock_max_sequential_bandwidth_hdd .. confval:: osd_mclock_max_sequential_bandwidth_ssd .. confval:: osd_mclock_force_run_benchmark_on_init .. confval:: osd_mclock_skip_benchmark .. confval:: osd_mclock_override_recovery_settings .. confval:: osd_mclock_iops_capacity_threshold_hdd .. confval:: osd_mclock_iops_capacity_low_threshold_hdd .. confval:: osd_mclock_iops_capacity_threshold_ssd .. confval:: osd_mclock_iops_capacity_low_threshold_ssd

.. _the dmClock algorithm: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf