docs/layouts/shortcodes/generated/expert_fault_tolerance_section.html
| Key | Default | Type | Description |
|---|---|---|---|
| (none) | Integer | The size of the IO executor pool used by the cluster to execute blocking IO operations (Master as well as TaskManager processes). By default it will use 4 * the number of CPU cores (hardware contexts) that the cluster process has access to. Increasing the pool size allows to run more IO operations concurrently. | |
| 10 s | Duration | The pause made after an registration attempt caused an exception (other than timeout). | |
| 100 ms | Duration | Initial registration timeout between cluster components. | |
| 30 s | Duration | Maximum registration timeout between cluster components. | |
| 30 s | Duration | The pause made after the registration attempt was refused. | |
| 30 s | Duration | The shutdown timeout for cluster services like executors. | |
| 10 s | Duration | Time interval between heartbeat RPC requests from the sender to the receiver side. | |
| 2 | Integer | The number of consecutive failed heartbeat RPCs until a heartbeat target is marked as unreachable. Failed heartbeat RPCs can be used to detect dead targets faster because they no longer receive the RPCs. The detection time is heartbeat.interval * heartbeat.rpc-failure-threshold. In environments with a flaky network, setting this value too low can produce false positives. In this case, we recommend to increase this value, but not higher than heartbeat.timeout / heartbeat.interval. The mechanism can be disabled by setting this option to -1 |
|
| 50 s | Duration | Timeout for requesting and receiving heartbeats for both sender and receiver sides. | |
| "region" | String | This option specifies how the job computation recovers from task failures. Accepted values are:
|