docusaurus/platform_versioned_docs/version-1.7/understanding-airbyte/heartbeats.md
Many transient issues can occur when moving data. One class of issues is an unresponsive Source or Destination. In this case, the fix is often a simple restart.
Airbyte aims to make restarts as automated as possible and uses a heartbeating mechanism to do so.
Airbyte monitors for responses from the Sources and Destination, interpreting these as 'heartbeats'. If the Source or Destination does not heartbeat within a certain time frame, Airbyte triggers a heartbeat error and automatically restarts the job.
Heartbeats are a final catch-all mechanism. Errors are expected to be transient and should automatically resolve. If they do not, it is likely a sign of a more serious issue. In these cases, Airbyte takes the more conservative approach. Airbyte restarts the job to avoid a seemingly endless job, and highlight to users the existence of a potential issue.
Possible reasons for a heartbeat error:
In general,
Airbyte Cloud has identical heartbeat monitoring and alerting as Airbyte Open Source.
If these issues show up on Airbyte Cloud,
The platform considers both RECORD and STATE messages emitted by the source as source heartbeats.
The Airbyte platform has a process which monitors when the last beat was send and if it reaches a threshold,
the synchronization attempt will be failed. It fails with a cause being the source an message saying
The source is unresponsive. Internal the error has a heartbeat timeout type, which is not display in the UI.
The heartbeat can be configured using the file flags.yaml through 2 entries:
heartbeat-max-seconds-between-messages: this configures the maximum time allowed between 2 messages.
The default is 3 hours.heartbeat.failSync: Setting this to true will make the syncs to fail if a missed heartbeat is detected.
If false no sync will be failed because of a missed heartbeat. The default value is true.Adding a heartbeat to the destination similar to the one at the source is not straightforward since there isn't a constant stream of messages from the destination to the platform. Instead, we have implemented something that is more akin to a timeout. The platform monitors whether there has been a call to the destination that has taken more than a specified amount of time. If such a delay occurs, the platform considers the destination to have timed out.
The timeout can be configured using the file flags.yaml through 2 entries:
destination-timeout-max-seconds: If the platform detects a call to the destination exceeding the duration specified in this entry, it will consider the destination to have timed out. The default timeout value is 24 hours.destination-timeout.failSync: If enabled (true by default), a detected destination timeout will cause the platform to fail the sync. If not, the platform will log a message and allow the sync to continue. When the platform fails a sync due to a destination timeout, the UI will display the message: The destination is unresponsive.