Back to Flink

Expert Fault Tolerance Section

docs/layouts/shortcodes/generated/expert_fault_tolerance_section.html

0.4-rc12.4 KB
Original Source
KeyDefaultTypeDescription
cluster.io-pool.size

| (none) | Integer | The size of the IO executor pool used by the cluster to execute blocking IO operations (Master as well as TaskManager processes). By default it will use 4 * the number of CPU cores (hardware contexts) that the cluster process has access to. Increasing the pool size allows to run more IO operations concurrently. | |

cluster.registration.error-delay

| 10 s | Duration | The pause made after an registration attempt caused an exception (other than timeout). | |

cluster.registration.initial-timeout

| 100 ms | Duration | Initial registration timeout between cluster components. | |

cluster.registration.max-timeout

| 30 s | Duration | Maximum registration timeout between cluster components. | |

cluster.registration.refused-registration-delay

| 30 s | Duration | The pause made after the registration attempt was refused. | |

cluster.services.shutdown-timeout

| 30 s | Duration | The shutdown timeout for cluster services like executors. | |

heartbeat.interval

| 10 s | Duration | Time interval between heartbeat RPC requests from the sender to the receiver side. | |

heartbeat.rpc-failure-threshold

| 2 | Integer | The number of consecutive failed heartbeat RPCs until a heartbeat target is marked as unreachable. Failed heartbeat RPCs can be used to detect dead targets faster because they no longer receive the RPCs. The detection time is heartbeat.interval * heartbeat.rpc-failure-threshold. In environments with a flaky network, setting this value too low can produce false positives. In this case, we recommend to increase this value, but not higher than heartbeat.timeout / heartbeat.interval. The mechanism can be disabled by setting this option to -1 | |

heartbeat.timeout

| 50 s | Duration | Timeout for requesting and receiving heartbeats for both sender and receiver sides. | |

jobmanager.execution.failover-strategy

| "region" | String | This option specifies how the job computation recovers from task failures. Accepted values are:

  • 'full': Restarts all tasks to recover the job.
  • 'region': Restarts all tasks that could be affected by the task failure. More details can be found here.

|