doc/io-scheduler.md
IO scheduler uses rate-limiter to throttle the amount of data it dispatches into the disk.
The scheduler's main equation that models the disk behavior is
$$ \frac {bandwidth_r} {bandwidth_{r_{max}}} + \frac {bandwidth_w} {bandwidth_{w_{max}}} + \frac {iops_r} {iops_{r_{max}}} + \frac {iops_w} {iops_{w_{max}}} \le 1.0 $$
Let's say that
$$ m_b = \frac {bandwidth_{r_{max}}} {bandwidth_{w_{max}}} \ \ m_o = \frac {iops_{r_{max}}} {iops_{w_{max}}} $$
Since
$$ bandwidth_x = \frac {d}{dt} bytes_x \ \ iops_x = \frac {d}{dt} ios_x $$
The main equation turns into
$$ \frac {d}{dt} ( \frac {bytes_r + m_b \times bytes_w} {bandwidth_{r_{max}}} + \frac {ios_r + m_o \times ios_w} {iops_{r_{max}}} ) \le 1.0 $$
Requests are asigned a 2d value of {1, bytes} for reads and {m<sub>o</sub>, m<sub>b</sub> * bytes} for writes called "tickets"
The "normalization" operation is defined as
$$ N(ticket) = \frac {ticket_0}{iops_{r_{max}}} + \frac {ticket_1}{bandwidth_{r_{max}}} $$
With that the main equation turns into
$$ \frac {d}{dt} \sum_{ticket} N(ticket) \le 1.0 $$
The time-derivative limitation is then implemented with the token bucket algorithm
The algorithm creates token bucket with the refill rate of 1.0 and each request wants to carry the fractional token value of tokens = N(ticket) with ticket defined above
The bucket additionally requires the limit parameter which is the maximum number of tokens the bucket may hold. This value is calculated using the io_latency_goal parameter, to be the amount of tokens accumulated for the io_latency_goal duration
Let's assume we need to reduce the rate of request run-time. This means that we want to
$$ \frac {d}{dt} ( \frac {bytes_r + m_b \times bytes_w} {bandwidth_{r_{max}} \times \alpha} + \frac {ios_r + m_o \times ios_w} {iops_{r_{max}} \times \beta} ) \le 1.0 \ \ \alpha \le 1.0 \ \beta \le 1.0 $$
There are many ways of selecting which of IOPS or bandwidth to reduce, the "general" slowing down may assume that
$$ \alpha = \beta = \frac {1}{\gamma} , \ \ \ \gamma \ge 1.0 $$
The main equation then turns into
$$ \frac {d}{dt} \sum_{ticket} \gamma \times N(ticket) \le 1.0 $$
which in turn means, that we can just multiply the request cost (in tokens) by some number above 1.0
Let's say that
d<sub>i</sub> -- the amount of requests dispatched at tick i p<sub>i</sub> -- the amount of requests processed by disk at tick i c<sub>i</sub> -- the amount of requests completed by reactor loop at tick i
We can observe d<sub>i</sub> and c<sub>i</sub> in the dispatcher, but not the p<sub>i</sub>, because we don't have direct access to disks' queues
After n ticks we have
D<sub>n</sub> -- total amount of requests dispatched, P<sub>n</sub> -- total amount of requests processed, C<sub>n</sub> -- total amount of requests completed,
$$ D_n = \sum_{i=0}^n d_i $$
$$ P_n = \sum_{i=0}^n p_i $$
$$ C_n = \sum_{i=0}^n c_i $$
$$ d_i \ge p_i \Rightarrow D_n \ge P_n$$
$$ p_i \ge c_i \Rightarrow P_n \ge C_n $$
Note that D<sub>n</sub> > P<sub>n</sub> means that disk is falling behind. Our goal is to make sure the disk doesn't do it and doesn't accumulate the queue, i.e. D<sub>n</sub> = P<sub>n</sub>, but we cannot observe P<sub>n</sub> directly, only C<sub>n</sub>.
Next
$$ D_n - P_n = Qd_n \ge 0 $$
$$ P_n - C_n = Qc_n \ge 0 $$
$$ D_n - C_N = Qd_n + Qc_n \ge 0 $$
Qd<sub>n</sub> is the queue accumulated in disk. We try to avoid it. Qc<sub>n</sub> is the accumulated delta between processed (by disk) and completed (by reactor) requests. We cannot reliably say that it goes to zero over time, because there's always some unknown amount of processed but not yet completed requests. So the above formula contains two unknowns and we cannot solve it. Let's try the other way
$$ \frac {D_n} {P_n} = Rd_n \ge 1 $$
$$ \frac {P_n} {C_n} = Rc_n \ge 1 $$
$$ \frac {D_n} {C_n} = Rd_n \times Rc_n \ge 1 $$
Rd<sub>n</sub> is the ratio of dispatched to processed requests. If it's 1, we're OK, if it's greater, disk is accumulating the queue, we try to avoid it. Rc<sub>n</sub> is the ratio between processed (by disk) and completed (by reactor) requests. It has a not immediately apparent, but very pretty property
$$ \lim_{n\to\infty} Rc_n = \lim_{n\to\infty} \frac {P_n} {C_n} = \lim_{n\to\infty} \frac {C_n + Qc_n} { C_n} = \lim_{n\to\infty} (1 + \frac {Qc_n} {C_n} ) = 1 + \lim_{n\to\infty} \frac {Qc_n} {C_n}$$
The Qc<sub>n</sub> doesn't grow over time. It's non-zero, but it's upper bound by some value. Respectively
$$ \lim_{n\to\infty} Rc_n = 1 $$
$$ \lim_{n\to\infty} \frac {D_n} {C_n} = \lim_{n\to\infty} ( Rd_n \times Rc_n ) = \lim_{n\to\infty} Rd_n $$
IOW -- we can say if the disk is accumulating the queue or not by observing the dispatched-to-completed (to completed, not processed) over a long enough time