Dapr 1.17.7

This update contains the following bug fixes:

Workflow GetWorkItems gRPC stream torn down when the history payload exceeds the max body size
Workflows orphaned or not purged after scheduler pod restart under load
Workflow inbox accumulates duplicate completion events under pod migration, driving an SDK spin loop
Sentry fails to start with "unsupported key type" when the issuer key is Ed25519 or RSA
Kafka in-flight pub/sub messages abandoned during graceful shutdown
Kafka bulk subscriber partial batches flushed early after a count-based flush
RabbitMQ pub/sub subscription restart kills sibling subscriptions and amplifies duplicate processing
Long actor drainOngoingCallTimeout blocks placement dissemination and resets the placement stream
Workflow activity body executes multiple times in parallel during scheduler or daprd reconnect
daprd and scheduler crash with "could not determine host IP address" on IPv6-only clusters
No observability for workflow payload proximity to the gRPC max body size before stalling
Scheduler stops delivering job triggers after periodic etcd compaction under workflow load
Workflows freeze in RUNNING after placement rebalance
daprd shuts down and stays unready after a brief scheduler outage

Workflow GetWorkItems gRPC stream torn down when the history payload exceeds the max body size

Problem

When the proto-encoded WorkItem that the orchestrator sends to a connected SDK worker on the GetWorkItems gRPC stream grew larger than the dapr API gRPC server's MaxSendMsgSize (which is the same as --max-body-size, default 4 MiB), the underlying stream.Send returned ResourceExhausted and the entire stream was cancelled. Every other workflow that happened to be pending on the same stream was cancelled along with it, and the SDK reconnected only to repeat the same failure on the same offending workflow.

Impact

Any long-running workflow whose accumulated history (PastEvents + NewEvents + propagated history) crossed the configured --max-body-size could trigger a stream tear-down loop. Visible symptoms included:

The offending workflow appeared frozen with no diagnostic in its history.
Other workflows that shared the worker's stream were repeatedly cancelled mid-execution and replayed.
The SDK logged repeated reconnects to the dapr sidecar.
An activity dispatched with a very large PropagatedHistory could exhibit the same tear-down on the activity work-item path.

Root Cause

Neither the orchestrator nor the durabletask gRPC executor measured the size of the WorkItem proto before pushing it onto the stream. Once the message reached stream.Send, gRPC enforced MaxSendMsgSize and aborted the entire GetWorkItems server stream with ResourceExhausted. Because the failure was at the transport layer, the runtime had no place to record a structured signal back to the user, and there was no terminal state for an orchestration that could not legally be dispatched.

Solution

The orchestrator now precomputes the proto size of the WorkItem it is about to dispatch and compares it to a 95% safety threshold of --max-body-size (the headroom covers the engine's WorkflowStarted event injection plus gRPC framing overhead). If the threshold would be crossed:

For a workflow dispatch, runWorkflow short-circuits before the work item is handed to the durabletask scheduler.
For an activity dispatch, callActivity short-circuits before the activity actor is invoked, and the parent workflow is stalled.

Either path appends an ExecutionStalled event to the workflow's history with the new StalledReason value PAYLOAD_SIZE_EXCEEDED and transitions the workflow into the existing STALLED state. The orchestrator's stallable lock is held until the actor is deactivated, so the next activation re-evaluates: if the operator has purged or terminated the workflow, or restarted daprd with a larger --max-body-size, the workflow resumes; otherwise it re-stalls without disturbing other instances on the stream.

Workflows orphaned or not purged after scheduler pod restart under load

Problem

When a scheduler pod is killed during workflow execution under load, some workflows become orphaned: they remain in RUNNING state with no further execution, or they reach a terminal state but are never purged despite a configured retention policy. dapr workflow history shows nothing abnormal, execution simply stops. dapr workflow list reports the affected completed workflows as much older than the configured retention window.

Impact

Any deployment running workflows with a multi-replica scheduler is affected when scheduler pods restart during load. This is most visible during routine operations such as Kubernetes rolling updates, node drains, or OOM-driven scheduler restarts.

Root Cause

The actor state-store transaction that persists workflow state was not coordinated with the gRPC call that registers the corresponding wake-up reminder in the scheduler service. These are two independent operations against two different systems with no atomic boundary between them.

When a scheduler pod was killed mid-RPC, the state save had typically completed and the reminder Create was lost. The reminder failure policy retries an already-persisted reminder forever; it cannot recover a reminder whose Create RPC never reached durable storage.

For completed workflows, the retention path was particularly fragile: the workflow's firing reminder was deleted before the retention reminder was created. If the retention Create then failed, no reminder remained to drive a retry, leaving the workflow terminal-but-not-purged.

Solution

Three changes close the loss windows:

In-process retry on reminder creation. Every reminder Create now retries with bounded exponential backoff (up to 60 seconds total) before returning to the caller. Retries reuse the same reminder Name; the scheduler's overwrite-by-name semantics keep them idempotent. A typical scheduler-pod failover completes in seconds, so the retry transparently heals the failure without surfacing it to the workflow.
Retention reminder created before deletion. In the completion path, the retention reminder is now registered before the workflow's own reminders are deleted. If the retention Create still fails after the in-process retry, the firing reminder remains alive and its failure-policy retry brings execution back to the completion path.
Idempotent retention recovery on re-fire. When a reminder fires for a workflow whose state is already terminal but whose inbox is empty, the runtime now re-issues the retention reminder Create. The retention reminder name is deterministic, so this is a safe overwrite rather than a duplicate. This recovers workflows whose completion was persisted in a prior run but whose retention reminder Create was lost.

The retention reminder's due time is now anchored to the workflow's actual completion time rather than the moment of the Create call, so retries converge on a single reminder at a stable due time instead of pushing retention back on every retry.

Workflow inbox accumulates duplicate completion events under pod migration, driving an SDK spin loop

Problem

When the workflow actor on one pod was cancelled mid-flight (typically during a rolling deployment) after dispatching an activity but before its state save committed, the activity actor still completed normally and posted its TaskCompleted event back to the workflow actor's inbox. On the next workflow activation, the orchestrator re-yielded the same ScheduleTask because its replay state did not yet reflect the dispatch, so the activity actor ran a second time and posted a second TaskCompleted for the same taskScheduledId. The same shape applied to TaskFailed, TimerFired, and child-workflow completions delivered through the inbox.

The language SDK's process_event handlers for these event kinds silently return when no matching pending task is found, producing zero new actions, so dapr re-fired the wake-up reminder against the same un-cleared inbox and the cycle repeated.

Impact

Any deployment running workflows whose hosting pods are restarted during load is affected. This is most visible during routine operations such as Kubernetes rolling updates or node drains.

Visible symptoms include:

A workflow appears stuck in RUNNING while its persisted history grows steadily with full activity payloads.
Sidecar logs show repeated dropping duplicate event: executionStarted warnings on the dapr side, paired with thousands of Ignoring unexpected taskCompleted event with ID = N warnings on the SDK side for the same instance.
An activity executes more times than the workflow function calls it, because the activity actor re-runs each time the orchestrator re-yields the schedule.

Root Cause

Two layers were missing safeguards.

First, the workflow actor's addWorkflowEvent (the inbox-write boundary called by the activity actor and by sub-workflow completion delivery) did not deduplicate task-resolution events. A redelivered completion was appended to the inbox, persisted, and a new wake-up reminder was created, even when the same resolution was already committed to history or queued in the inbox from an earlier delivery.

Second, the orchestrator's callActivities did not check whether the activity it was about to dispatch had already resolved. When the orchestrator re-yielded a ScheduleTask because its replay state was missing the corresponding TaskScheduled (e.g. after a partial save was lost on cancellation), the activity actor was invoked again, ran the activity body again, and posted yet another TaskCompleted to the inbox. The two layers compounded: the inbox grew because the dispatch produced new completions, the orchestrator re-ran because the inbox grew, and the SDK silently spun on the unmatched events.

Solution

Two complementary checks were added in the workflow actor, both backed by a shared dedup helper:

Inbox-write dedup in addWorkflowEvent. A TaskCompleted / TaskFailed / TimerFired / ChildWorkflowInstance{Completed,Failed} whose correlator (taskScheduledId or timerId) already appears in either state.History or state.Inbox is dropped before it reaches state.AddToInbox, the transactional save, and the new-event reminder. EventRaised and ExecutionTerminated are intentionally excluded: EventRaised is a user signal that may legitimately repeat, and ExecutionTerminated is idempotent.
Dispatch-skip in callActivities. Before invoking the activity actor for a TaskScheduled, the workflow actor checks whether a matching TaskCompleted or TaskFailed for the same taskScheduledId is already in state.History or state.Inbox. If it is, the dispatch is suppressed; the orchestrator's stale re-yield no longer triggers a second activity run.

The underlying engine in durabletask-go was hardened in lockstep: runtimestate.AddEvent now also rejects a resolution event whose correlator is already present, providing defence in depth for any caller that bypasses the actor's inbox-write path. The Stalled-clear logic runs only on a successful add, so a duplicate-rejection error preserves a prior stalled state.

After upgrading, persisted histories from older daprd versions that already accumulated duplicates are silently truncated on next workflow load (the duplicate entries are not re-added to the in-memory OldEvents), so the upgrade is one-way for that state.

Sentry fails to start with "unsupported key type" when the issuer key is Ed25519 or RSA

Problem

Operators who downgraded a control plane from 1.18 back to 1.17 saw dapr-sentry crash on startup with:

fatal: error creating CA: failed to get CA bundle: failed to verify CA bundle: unsupported key type ed25519.PrivateKey

The same failure mode also rejected RSA-keyed issuer bundles. The crash is hit before sentry serves any traffic, so every sidecar that depends on sentry for its identity certificate stops being able to obtain or rotate one.

Impact

Any 1.17 control plane whose dapr-trust-bundle secret was generated by, or migrated through, a newer Dapr release that issues Ed25519 (or RSA) issuer keys is affected. In practice this includes:

Downgrade from Dapr 1.18 to 1.17 against the same cluster.
Existing 1.17 deployments where the issuer key was rotated or replaced with an Ed25519 / RSA key by the operator.

Sentry crash-loops, no new mTLS identities are issued, and existing certificates are not rotated. Sidecars whose certs have not yet expired keep working; sidecars that come up fresh, restart, or hit cert expiry start failing to obtain identities.

Root Cause

dapr/kit's crypto/pem.EncodePrivateKey (used by sentry to re-encode the issuer key it just decoded from the trust bundle) only matched *ecdsa.PrivateKey and *ed25519.PrivateKey in its type switch. ed25519.PrivateKey is itself a []byte alias rather than a struct, so the *ed25519.PrivateKey case never matched a real Ed25519 key. RSA private keys were never listed at all.

When sentry called EncodePrivateKey on an Ed25519 or RSA issuer key it fell through to the default branch and returned unsupported key type %T, which the CA initialiser surfaced as a fatal error.

Solution

dapr/kit's EncodePrivateKey now matches ed25519.PrivateKey (value form) and *rsa.PrivateKey alongside *ecdsa.PrivateKey. All three round-trip through PKCS#8 unchanged. Dapr 1.17.7 picks up this fix by bumping github.com/dapr/kit to v0.17.1, which also includes table-driven roundtrip tests for ECDSA P-256, RSA-2048, and Ed25519 to guard the regression.

No operator action is required beyond upgrading sentry to 1.17.7. Existing trust bundles are read as-is; the issuer key is not regenerated.

Kafka in-flight pub/sub messages abandoned during graceful shutdown

Problem

When a sidecar received SIGTERM, Kafka pub/sub subscriptions tore down their consumer group session before the messages already fetched from the broker had been delivered to the application. The contrib retry loop observed context canceled, the runtime logged Too many failed attempts at processing Kafka message ... Error: context canceled, and the broker handed the same offsets to whichever consumer won the rebalance.

Impact

Any deployment running Kafka pub/sub through a multi-replica subscriber was affected on rolling restarts, node drains, or any other graceful-shutdown event. Visible symptoms included:

Repeated Too many failed attempts at processing Kafka message and kafka: tried to use a consumer group that was closed errors during shutdown.
The application processed the same message twice across pods (once via a partial in-flight call that got cancelled, again after rebalance).
Latency-sensitive workloads (e.g. financial transactions) experienced retry-driven tail latency on every pod restart.

Root Cause

The runtime's Subscription.Stop() set its closed flag immediately on entry, which caused the handler closure to reject any further deliveries from contrib with errors.New("subscription is closed"). Contrib treated that as an error and retried inside an already-closing session, eventually giving up and surrendering the partition to the rebalance.

The "in-flight" definition was also too narrow: only handlers already inside the closure counted, while messages that contrib had pulled from the broker but not yet handed to the handler were considered absent and got the rejection path.

Solution

A new pubsub.PausableSubscriber capability lets the runtime ask a component to stop fetching from the broker without tearing down the consumer group session. On graceful shutdown the runtime now:

Pauses the underlying component (Kafka's implementation calls Sarama's PauseAll, which stops broker fetches while keeping the session and partition assignments alive).
Leaves closed=false during a bounded drain window so handlers continue delivering buffered messages to the application via postman.
Polls an inflight counter with a stable-quiet predicate (100 ms of consecutive zero readings on the paused path) so the drain does not seal in the sub-millisecond gap between handler return and the next claim-buffer read.
Caps the drain at 30 seconds so a misbehaving application that keeps returning RETRY cannot block StopAllSubscriptionsForever and prevent the block-shutdown timer from starting.
On ceiling hit, force-cancels the subscription context so stuck handlers' HTTP/gRPC calls error out via context propagation rather than running indefinitely.
Falls back to the previous close-first behavior for non-pausable components and non-graceful Stop calls.

The components-contrib Kafka component additionally gates consumerGroup.Close() on the last subscription exiting (so multi-topic pubsubs no longer race a sibling subscription's reload into a closed group) and demotes the Too many failed attempts log to debug when the cause is shutdown rather than real retry exhaustion.

Kafka bulk subscriber partial batches flushed early after a count-based flush

Problem

When a Kafka bulk subscriber's buffer filled to maxMessagesCount and was flushed before its maxAwaitDurationMs window had elapsed, the await ticker continued firing on its original schedule. Any subsequent partial batch was then flushed within (often well under) one period of the count-based flush instead of waiting for a fresh maxAwaitDurationMs window from the moment the buffer was last drained.

Impact

Any deployment using Kafka bulk pub/sub subscriptions with both maxMessagesCount and maxAwaitDurationMs configured was affected. Visible symptoms included:

Partial batches delivered to the application much sooner than the configured maxAwaitDurationMs after a count-based flush.
Effective batch sizes lower than expected during steady-state traffic, because the await window was shortened by however much of the original window had already elapsed before the count threshold was hit.
Workloads tuned to amortize per-batch overhead (large bulk handlers, batched downstream writes) seeing more invocations than the configuration implied.

Root Cause

In ConsumeClaim, the bulk path used a single time.Ticker constructed from maxAwaitDurationMs to trigger time-based flushes. When the count threshold (len(messages) >= maxMessagesCount) was reached and flushBulkMessages was called, the ticker was not reset. The next tick still fired at its original wall-clock schedule, so a partial batch arriving just after a count-flush was eligible for flush after only the residual portion of the original ticker period rather than a full maxAwaitDurationMs.

Solution

After a count-based flush in ConsumeClaim, the await ticker is now reset to a fresh maxAwaitDurationMs window via ticker.Reset, anchoring the next time-based flush to the moment of the count-flush. Go 1.23+ guarantees that Ticker.Reset discards any tick that was queued before the call, so no stale tick can fire immediately after the reset and short-circuit the new window. Partial batches now consistently wait a full maxAwaitDurationMs after the most recent flush, regardless of whether that flush was triggered by the count threshold or the timer.

RabbitMQ pub/sub subscription restart kills sibling subscriptions and amplifies duplicate processing

Problem

On a pubsub.rabbitmq/v1 subscription restart — most commonly triggered by an app health check transitioning from unhealthy back to healthy, but also by any streaming-subscription stop/start or runtime-driven re-subscribe path — the RabbitMQ broker rejected the new subscription with AMQP 530 NOT_ALLOWED - attempt to reuse consumer tag '<queue-name>'. Because all RabbitMQ subscriptions on a single daprd shared the same broker connection state, that rejection cascaded into every sibling subscription, tearing them down even when their own subscriptions were healthy.

Impact

Any deployment running pubsub.rabbitmq/v1 with two or more subscriptions on a single daprd was affected on any event that triggered a subscription restart. The most common triggers were Dapr app health-check flaps, streaming subscription stop/start, and bursts of dapr-scheduler or dapr-placement re-elections that knocked the AMQP connection over.

Visible symptoms included:

Repeated NOT_ALLOWED - attempt to reuse consumer tag '<queue-name>' lines in daprd logs immediately followed by channel/connection is not open errors on sibling subscriptions.
A leaking RabbitMQ connection count (the reconnect counter stepping up multiple times per minute) alongside cross-topic Exception (503) unexpected command received lines in the same log window.
Logs in which an error attributed to one topic referenced a different topic's queue name — a side-effect of the cascading tear-down on a shared broker connection, not a Dapr binding mismatch.
Duplicate processing of in-flight messages. RabbitMQ + Dapr pub/sub is at-least-once: every cascading tear-down opens a redelivery window for messages whose handler completed but whose ack did not reach the broker before the connection died. This bug expanded that window from "real broker or network failures" to "any subscription restart", so handlers without idempotent side-effects observed the same message processed two or more times during a storm.

Root Cause

The RabbitMQ pub/sub component reused the same broker-side consumer identifier for every subscription attempt on a given queue, and did not explicitly tell the broker to release the prior registration when a subscription restarted. The second registration therefore collided with the still-live first one on the broker, which raised a connection-level exception that disrupted every other RabbitMQ subscription sharing that connection.

Solution

The RabbitMQ pub/sub component now uses a unique broker-side consumer identifier per subscription attempt and explicitly releases the prior registration with the broker on subscription teardown. A subscription restart no longer collides with a stale registration, sibling subscriptions on the same connection keep running, and the amplified redelivery window closes back to "real connection failures only".

Application code does not need to change. Handlers should remain idempotent because Dapr + RabbitMQ pub/sub is at-least-once even with this fix in place — a real broker or network failure between handler completion and ack will still cause a redelivery — but the everyday subscription-restart path no longer manufactures redeliveries.

Long actor drainOngoingCallTimeout blocks placement dissemination and resets the placement stream

Problem

When an application supplied an actor drainOngoingCallTimeout that met or exceeded the daprd-side placement dissemination timeout (default 30 seconds), drain held the placement LOCK -> UPDATE -> UNLOCK round long enough for daprd to reset its own placement stream. The most visible trigger was the Python SDK shipping a 60-second drain default while the runtime defaulted dissemination to 30 seconds, but any explicit override above the dissemination budget produced the same behaviour.

In addition, per-actor-type drainOngoingCallTimeout configured via entitiesConfig was parsed but never applied: only the top-level drainOngoingCallTimeout reached the placement layer, so per-type values silently fell back to the global drain.

Impact

Routine placement disseminations (a peer joining or leaving) ran for the full configured drain, exceeding the dissemination budget.
daprd reset its placement stream mid-round, halting hosted actors until the stream re-established and the next dissemination round completed.
Per-type drain settings configured by users had no effect on actual drain behaviour.

Root Cause

The configured drain timeout was passed straight into placement.SetDrainOngoingCallTimeout with no validation against the dissemination budget; daprd then waited the full drain in inflight.CancelClaimsForTypes during the UPDATE phase. On the per-entity path, api.TranslateEntityConfig parsed drainOngoingCallTimeout into the domain EntityConfig but actors.RegisterHosted only forwarded the global value to placement.

Solution

api.ClampDrainOngoingCallTimeout now bounds the configured drain at registration time. When drain meets or exceeds the dissemination timeout, the runtime logs a warning naming the source (global config or entities=<types>) and clamps the value to disseminationTimeout * 0.8, floored at DefaultOngoingCallTimeout (2 seconds). The clamp applies to both the global drain in actors.RegisterHosted and the per-entity drain in api.TranslateEntityConfig; the inverted-conditional bugs in TranslateEntityConfig are also corrected as part of this change.

Per-actor-type drainOngoingCallTimeout is now wired through to placement. A new placement.SetEntityDrainOngoingCallTimeouts plumbs an entityDrainTimeouts map onto Inflight, and the lock loop's handleCancelTypes groups in-flight claims by actor type and drains each type in parallel against its own timer, falling back to the global drain when no per-type override is set. Total drain wall-clock for a round is therefore bounded by the largest per-type drain rather than the sum.

Workflow activity body executes multiple times in parallel during scheduler or daprd reconnect

Problem

When the gRPC WatchJobs stream between daprd and the scheduler reconnected while an activity was in flight (typical triggers: a scheduler pod restart, a daprd pod restart, an application pod restart, or a network blip), the same activity reminder could be redelivered several times in a tight burst before the connection stabilised. Each redelivery that reached daprd's reminder handler dispatched a fresh ActivityWorkItem to the application's SDK worker, so the activity callback ran 2..N times concurrently for a single scheduled activity.

The workflow's logical state remained correct because the orchestrator deduplicated the resulting TaskCompleted events, but the user-facing application code observed real parallel executions of an activity that the workflow had only scheduled once.

Impact

Any deployment whose application or scheduler pods restart during workflow activity execution is affected. This is most visible during routine operations such as Kubernetes rolling updates, node drains, or OOM-driven restarts, and during transient scheduler-daprd network instability.

Visible symptoms include:

The same activity body running concurrently in the application, sometimes mutating the same downstream resources multiple times.
SDK telemetry showing repeated activity invocations (often with error: operation aborted) for a single workflow taskScheduledId.

Root Cause

The activity actor lock is context-scoped: when the gRPC stream that delivered the activity reminder dropped, the actor's reminder context was cancelled, the in-flight executeActivity returned ctx.Err(), and the actor lock released. The next reminder retry could then acquire the lock and call a.scheduler(ctx, wi) again, pushing a second ActivityWorkItem into the durabletask worker queue. The work-item worker had no per-activity dedup, so each work item was streamed to the SDK and ran the activity callback.

Solution

The activity actor now keeps an in-memory inflight guard per (activityActorID, taskExecutionId). The first reminder for a given activity scheduling becomes the owner: it dispatches the work item, waits on the SDK callback, publishes the result back to the workflow actor, and finalises the inflight entry. Concurrent reminder retries become followers: they observe the owner's outcome via the inflight entry and surface that outcome to their caller, so the scheduler acks SUCCESS for each retry without re-dispatching the activity to the SDK.

The cache key includes the durabletask TaskExecutionId so that retries of the same scheduling share the entry while a new workflow run that reuses the same instance ID (for example after a Purge followed by a fresh ScheduleNewWorkflow) gets a fresh entry. Cached outcomes are released after a 60 second TTL.

daprd and scheduler crash with "could not determine host IP address" on IPv6-only clusters

Problem

GetHostAddress() only worked on IPv4 networks: it dialed 8.8.8.8 (IPv4 only) and the fallback only accepted addresses where To4() != nil. On IPv6-only clusters, daprd and the scheduler crashed at startup with "could not determine host IP address".

Impact

Any deployment running on an IPv6-only network was unable to start daprd or the scheduler.

Root Cause

The UDP dial target was hardcoded to 8.8.8.8:80, which requires IPv4 connectivity. The fallback interface enumeration filtered for To4() != nil, excluding all IPv6 addresses. Several downstream call sites also built host:port strings with plain concatenation (host + ":" + port), which produces invalid addresses for IPv6 literals (missing brackets).

Solution

Replaced the hardcoded 8.8.8.8 probe with UDP dials to RFC documentation addresses (192.0.2.1 for IPv4 and 2001:db8::1 for IPv6). The interface fallback now considers both IPv4 and IPv6 addresses with priority ordering: public IPv4, IPv6 GUA, private IPv4/CGNAT, IPv6 ULA, and link-local as last resort.

Call sites that consume the address were updated to use net.JoinHostPort (actor placement, IsActorLocal) and RFC 7239 IPv6 quoting in the Forwarded header (direct messaging).

No observability for workflow payload proximity to the gRPC max body size before stalling

Problem

The orchestrator stalls workflows whose payloads would exceed 95% of the dapr API gRPC server's --max-body-size on the way to the SDK (the PAYLOAD_SIZE_EXCEEDED precheck added earlier in this release). Operators had no way to see how close payloads were to that threshold until a workflow actually tripped it. By then the workflow was stuck in STALLED and the only remediation was to raise --max-body-size and restart daprd, or force purge the workflow all together.

Impact

Capacity planning was reactive: operators could not predict which workflows or activities were trending toward the stall before they tripped.
Post-incident triage required reading sidecar logs to confirm that the stall fired on a genuinely oversize payload rather than another failure mode.
Fleets running the same workflow with different --max-body-size settings could not be compared on a single dashboard.

Root Cause

The precheck computed proto.Size(...) of the workflow and activity payloads inline with the stall decision, but the value was never exported. There was no metric describing the size distribution of payloads sent to the SDK.

Solution

Two new histograms are exported from DefaultWorkflowMonitoring, recorded from the same code path that performs the stall precheck, so the value fed into the threshold comparison is the value exported as a metric:

dapr_runtime_workflow_payload_size_ratio, tagged by app_id, namespace, workflow_name. Recorded once per workflow dispatch.
dapr_runtime_workflow_activity_payload_size_ratio, tagged by app_id, namespace, workflow_name, activity_name. Recorded once per activity dispatch, including the dispatch that trips the stall.

Both express payload size as a fraction of the configured --max-body-size rather than absolute bytes. The ratio is portable across daprds with different --max-body-size settings, scales beyond the absolute-size distribution's 4 GiB ceiling, and makes proximity-to-stall queries trivial: histogram_quantile(0.99, ...) > 0.95 fires before the next stall.

Buckets are concentrated near the 0.95 stall threshold (0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99, 1.0, 1.5, 2.0). Recording is skipped when --max-body-size is not configured, since the ratio is undefined without a limit.

Scheduler stops delivering job triggers after periodic etcd compaction under workflow load

Problem

Under sustained workflow load, the scheduler service would intermittently stop delivering job triggers to connected daprds. The pod stayed healthy: its embedded etcd still served reads and writes, its gRPC API still returned, and the leader did not change. The only observable difference was that new jobs (workflow timers, activity dispatches) created on the scheduler were never streamed to the daprds that had subscribed for them. The only known remediation was to restart the scheduler pod.

Impact

Any deployment running workflows at a sufficiently high job-churn rate to outrun the embedded etcd's compaction retention could be affected. This was most visible on long-running scheduler pods with many concurrent workflows; short-lived deployments or low-throughput jobs rarely tripped it.

Visible symptoms included:

Workflows that scheduled correctly but never fired their next activity or timer.
dapr workflow history showing the orchestration paused at a TaskScheduled with no corresponding TaskCompleted.
The scheduler's etcd metrics showing normal write throughput while the informer goroutine was idle.

Root Cause

The scheduler's job pipeline is driven by an informer that opens a single long-lived etcd Watch against the jobs prefix and re-emits each event onto an internal channel for the queue to consume.

Three weaknesses in that watch interacted under workflow write rates:

The watcher's tracked revision only advanced when an event actually arrived. With periodic compaction at a 10-minute retention, an idle watcher's cursor could fall behind the compaction window; the next event then returned ErrCompacted rather than being delivered.
The receive arm of the informer's select only ranged over WatchResponse.Events. It did not distinguish a closed channel from a delivery, and never inspected Canceled, Err(), or CompactRevision on the response. When any of those happened (a compaction race, a server-side cancel, a fatal client error), the goroutine kept selecting against a dead channel and silently delivered nothing.
Several embedded etcd defaults were inherited from etcd's general-purpose settings rather than tuned for the high-write, high-watch profile workflows produce: SnapshotCount=10000 (high WAL/snapshot churn), BackendBatchInterval=50ms / BackendBatchLimit=5000 (more frequent fsyncs than necessary), 10-minute periodic compaction (unpredictable MVCC growth), and an unset MaxTxnOps (etcd default of 128, below the size of fan-out transactions).

Solution

The scheduler's informer watch is now hardened against all three weaknesses:

The watch is opened with WithProgressNotify, so the etcd client advances the watcher's internal revision cursor on idle watches and resumes from a fresh revision on transient reconnects, keeping the cursor inside the compaction retention window in the common case.
The informer's receive arm now distinguishes a closed channel via the receive's ok value, and explicitly checks Canceled, Err(), and CompactRevision on each WatchResponse. Any of these now logs and returns, unwinding to the cron leadership loop which performs a fresh SyncBase + SyncUpdates cycle instead of leaving a silently-dead watcher in place.

The scheduler's embedded etcd defaults are retuned for the workflow read/write profile:

Setting	Old	New
`--etcd-snapshot-count`	`10000`	`100000`
`--etcd-compaction-mode`	`periodic`	`revision`
`--etcd-compaction-retention`	`10m`	`1000000`
`--etcd-backend-batch-limit`	`5000`	`10000`
`--etcd-backend-batch-interval`	`50ms`	`100ms`
`--etcd-max-txn-ops` (new flag)	(etcd's `128`)	`10000`

CLI flag defaults and helm chart values are kept in sync; the StatefulSet passes every flag explicitly so K8s and standalone scheduler behave the same.

The helm chart's cluster.storageSize default is also raised from 1Gi to 16Gi for fresh installs. Because StatefulSet.spec.volumeClaimTemplates is immutable in Kubernetes, the chart uses a new dapr_scheduler.storageSize helper that detects the live StatefulSet via lookup and pins storage to the existing PVC size on upgrade; fresh installs and offline helm template fall through to the new default. Existing 1Gi clusters are not disrupted; operators wanting more space on an existing install must expand the PVC directly on the cluster.

Workflows freeze in RUNNING after placement rebalance

Problem

When placement rebalances actors (during a placement pod restart, leader election, node drain, or network blip), two daprds can briefly believe they each own the same workflow actor. Each daprd loaded the workflow's state, appended its own changes locally, and saved back to the state store without coordinating with the other. Whichever daprd committed second races with the first's history rows at the same offset. The workflow was left with a TaskScheduled but no matching TaskCompleted: its history was missing the event that the activity had already produced.

Impact

Any deployment running workflows on a multi-replica daprd is affected whenever placement loses its leader. This is most visible during routine operations such as placement pod rolling updates, node drains, OOM-driven restarts, or transient daprd-placement network instability.

Even after placement stabilises, the affected workflow does not recover.

Visible symptoms include:

dapr workflow list shows a workflow stuck in RUNNING with no recent activity.
dapr workflow history shows the orchestration paused on a TaskScheduled with no following TaskCompleted.
The activity body itself ran successfully and its result reached addWorkflowEvent; the row only disappeared after the colliding save.
dapr scheduler list shows no pending reminders for the affected workflow, because the wake-up reminder was consumed by the colliding save's transaction.
The only known remediation was to terminate and restart the workflow.

Root Cause

The workflow state save path performed a blind upsert on every row. When daprd A loaded at row version E1, modified state locally, and saved, the state store bumped the version to E2. If daprd B's load began before daprd A's commit was visible, B's state-store snapshot also saw E1; its later save wrote without a version check. Both daprds computed history-NNNNNN offsets against the same history each thought was authoritative, so the two writes targeted the same row. The later write replaced the earlier write byte-for-byte, dropping whatever event the earlier daprd had appended at that offset.

Solution

Workflow state saves now use optimistic concurrency on a single version anchor: the metadata row's state-store ETag.

LoadWorkflowState captures the metadata row's ETag returned by the state store on every load and caches it alongside the rest of the state.
GetSaveRequest attaches that ETag to the metadata TransactionalUpsert inside the transactional save request. Other rows in the same transaction (history events, custom status, signatures) stay blind, but the state-store Multi is atomic, so a mismatch on metadata aborts the entire transaction without touching any row.
When a save returns ETagMismatch, the orchestrator invalidates its cached state and surfaces the error as recoverable. The existing reminder-retry path re-fires the operation, the next load reads fresh state with the new ETag, and the retry succeeds against the up-to-date base state.
After a successful save, the orchestrator refreshes the metadata ETag via a single-key Get so the next save has the correct token without paying for a full state reload.

daprd shuts down and stays unready after a brief scheduler outage

Problem

When the scheduler control plane was unreachable at the moment daprd opened (or re-opened) its WatchHosts stream, the first response from the scheduler could be a gRPC error code such as Unavailable, Internal, or a server-side Canceled rather than a hosts list. The runtime's WatchHosts loop returned that error directly instead of treating it as a transient peer issue, so the failure propagated up the runtime's top-level RunnerCloserManager. Because that manager treats any runner exit as terminal for the whole process, daprd then tore down the actor runtime, the workflow backend, and the gRPC servers.

If a closer on that path was waiting on a control-plane peer that was also still unreachable (placement, the state store, an actor halt), the shutdown did not complete cleanly. daprd stayed in the process table with its readiness probe failing on dapr is not ready: [grpc-internal-server grpc-api-server], but never exited cleanly enough for Kubernetes to restart it on its own.

Impact

Any deployment whose scheduler pods can experience a brief outage at the same time daprd happens to re-open its WatchHosts stream was affected. This was most visible during routine operations that disrupt the scheduler and placement control planes simultaneously, including rolling control-plane updates, node drains, OOM-driven control-plane restarts, and transient daprd-control-plane network instability.

Visible symptoms included:

daprd logging Actor runtime shutting down, Placement client shutting down, and Dapr is shutting down shortly after the control plane became reachable again, despite no SIGTERM or explicit shutdown trigger from the operator.
Sidecar readiness probes failing with dapr is not ready: [grpc-internal-server grpc-api-server] and not recovering.

Root Cause

pkg/runtime/scheduler/internal/watchhosts/watchhosts.go only recognised two gRPC status codes from the first Recv() on a freshly opened WatchHosts stream: Unimplemented (the old-server compatibility path) and Canceled (which it interpreted as the daprd-side context being cancelled). Every other code returned from the function.

Scheduler.Run wrapped that loop in a plain RunnerManager, so the error reached the runtime's top-level RunnerCloserManager unmodified. The top-level manager treats any runner exit as terminal: every sibling runner was cancelled (actors.Run, wfengine.Run, jobsManager.Run, both gRPC servers), and the closer chain ran before the control plane finished stabilising. If the state store, placement loop, or actor table closers blocked waiting on a peer that was also still unreachable, the process stalled with its servers already torn down.

Solution

WatchHosts.Run now treats any non-Unimplemented error from the first Recv() as a transient peer issue. The connection is closed and the loop reconnects after a one second pause, matching the existing behaviour for errors observed on the long-lived second Recv(). The loop only returns when daprd's own context is cancelled (real shutdown), so a brief scheduler outage no longer cascades into a runtime tear-down.