Expert Fault Tolerance Section - Flink

Key	Default	Type	Description

cluster.io-pool.size

| (none) | Integer | The size of the IO executor pool used by the cluster to execute blocking IO operations (Master as well as TaskManager processes). By default it will use 4 * the number of CPU cores (hardware contexts) that the cluster process has access to. Increasing the pool size allows to run more IO operations concurrently. | |

cluster.registration.error-delay

cluster.registration.initial-timeout

cluster.registration.max-timeout

cluster.registration.refused-registration-delay

cluster.services.shutdown-timeout

heartbeat.interval

heartbeat.rpc-failure-threshold

| 2 | Integer | The number of consecutive failed heartbeat RPCs until a heartbeat target is marked as unreachable. Failed heartbeat RPCs can be used to detect dead targets faster because they no longer receive the RPCs. The detection time is heartbeat.interval * heartbeat.rpc-failure-threshold. In environments with a flaky network, setting this value too low can produce false positives. In this case, we recommend to increase this value, but not higher than heartbeat.timeout / heartbeat.interval. The mechanism can be disabled by setting this option to -1 | |

heartbeat.timeout

jobmanager.execution.failover-strategy

| "region" | String | This option specifies how the job computation recovers from task failures. Accepted values are:

'full': Restarts all tasks to recover the job.
'region': Restarts all tasks that could be affected by the task failure. More details can be found here.