Back to Dapr

Dapr 1.17.9

docs/release_notes/v1.17.9.md

1.17.93.7 KB
Original Source

Dapr 1.17.9

This update contains the following bug fix:

Workflow retention purge fails on Azure Cosmos DB when customStatus is not persisted

Problem

A completed workflow whose customStatus row does not exist in the actor state store cannot be purged by the retention reminder when the state store is Azure Cosmos DB. The retention reminder fires every second indefinitely and the workflow stays in the Completed state past its configured retention TTL.

Impact

Any deployment using the Azure Cosmos DB state store (state.azure.cosmosdb) for the workflow actor state store is affected whenever a workflow reaches a terminal state without a customStatus row persisted in the store.

This includes:

  • Workflows that were first saved by a pre-customStatus daprd version and have since been upgraded.
  • Workflows whose customStatus row was removed out of band (manual cleanup, partial restore from backup, etc.).
  • Workflows scheduled but never advanced past the initial state, so no history delta ever triggered the customStatus upsert.

Visible symptoms include:

  • A completed workflow stays in the Completed state past the configured stateRetentionPolicy.anyTerminal TTL.
  • The scheduler retains the retentioner reminder and re-fires it once per second indefinitely.
  • The dapr_runtime_workflow_operation_count{operation=purge_workflow,status=failed} metric increments once per second per affected workflow.
  • daprd logs failed to invoke scheduled actor reminder named: retention due to: transaction failed at one tick per second per affected workflow.
  • The workflow recovers only after an operator manually deletes the retentioner job out of the scheduler.

Root Cause

GetPurgeRequest in pkg/runtime/wfengine/state/state.go unconditionally emitted a delete for the customStatus key alongside the metadata, inbox, and history deletes, regardless of whether the row was actually persisted.

The Azure Cosmos DB state store (components-contrib/state/azure/cosmosdb) translates state.TransactionalStore.Multi into a single Cosmos transactional batch. Cosmos batches are atomic: if any operation in the batch fails, the whole batch is rolled back and every operation returns FailedDependency. A delete for a row that does not exist returns NotFound and aborts the batch. The state component's "tolerate NotFound on etag-less delete" path only applies to single-operation calls, not to batched ones, so the purge transaction was rolled back and the workflow state stayed in place.

The retentioner reminder is created with a failure policy of Constant{Interval: 1s, MaxRetries: nil} (retry every second, forever), so the scheduler retried the same doomed batch once per second indefinitely.

Solution

State now tracks customStatusPersisted as an explicit observation, set from the bulk-get ETag at load time and maintained in ResetChangeTracking to reflect whichever upserts the most recent save committed. GetPurgeRequest consults this flag and only emits the customStatus delete when the row is known to exist in the store.

Workflows already stuck in this state on existing 1.17 deployments recover automatically once the sidecar is upgraded to 1.17.9. On the next retention reminder fire after restart, daprd reloads the workflow state from Cosmos, observes the missing customStatus row from its absent ETag, omits the delete, and Cosmos accepts the batch. The reminder drains and the workflow is purged in the normal way. No operator intervention or manual scheduler delete is required after the upgrade.