Back to Dapr

Dapr 1.17.4

docs/release_notes/v1.17.4.md

1.17.618.8 KB
Original Source

Dapr 1.17.4

This update contains bug fixes:

Pulsar pub/sub ignores processMode from component metadata and lacks async backpressure

Problem

The Pulsar pub/sub component ignored the processMode parameter when set in component metadata (YAML). The parameter was only read from subscription request metadata, so users who configured processMode: async or processMode: sync in the component YAML were silently running in the default mode. Additionally, async mode spawned an unbounded number of goroutines per message with no concurrency limit.

Impact

Applications that configured processMode in the Pulsar component YAML were not running in the expected processing mode. Users who set processMode: sync thinking they had synchronous, ordered processing were actually running in async mode.

In async mode, every incoming message spawned a new goroutine with no upper bound. Under high message rates, this caused unbounded unacked messages (~30k observed in production), excessive memory usage, and potential OOM crashes. The maxConcurrentHandlers metadata field controlled a channel buffer size but did not limit actual concurrent goroutines.

Root Cause

The processMode field was missing from the pulsarMetadata struct, so it was never parsed from component metadata. It was only read from the per-subscription request metadata, which most users do not set.

In async mode, a shared err variable across goroutines caused a data race, and maxConcurrentHandlers set to 0 caused a deadlock instead of falling back to a default value.

Solution

The processMode parameter is now correctly read from component metadata, with per-subscription metadata able to override it. Invalid values are rejected at initialization time.

Async mode now enforces a concurrency limit that applies backpressure when all handler slots are full, preventing unbounded goroutine growth. Setting maxConcurrentHandlers to 0 falls back to the default (100) instead of deadlocking.

Additionally, a data race in async mode was fixed, and graceful shutdown now waits for in-flight handlers before returning.

Cross-app workflow stuck in PENDING when the first action is a remote activity or child workflow call to an offline app

Problem

When scheduling a workflow whose first action is a remote activity call (using TargetAppId) or a remote child workflow call (using SubOrchestratorAppID) to an application that is not yet online, the workflow gets stuck in PENDING state indefinitely. The ScheduleNewWorkflow API call blocks until the remote application becomes available.

If any local action (such as a timer, a local activity, or a local child workflow) is placed before the remote call, the workflow transitions to RUNNING immediately as expected and waits gracefully for the remote application.

Impact

Any cross-app workflow deployment where the workflow application starts before the remote activity or child workflow application is affected. This is common during:

  • Rolling deployments where the remote application takes longer to become ready than the workflow application.
  • Scale-from-zero scenarios where the remote application has not yet started.
  • Microservice architectures where service startup order is not guaranteed.

The blocked ScheduleNewWorkflow call creates back-pressure on the calling application. The workflow cannot make progress and its status remains PENDING, even though the workflow logic itself is valid.

Root Cause

In runWorkflow(), activities and child workflow messages are dispatched to their target applications before the workflow state is saved to the state store. When the target application is offline, these dispatch calls block waiting for the remote app to become reachable. Because the state is never saved, the OrchestratorStarted event that transitions the workflow from PENDING to RUNNING is never persisted. The workflow appears stuck in PENDING, and the start reminder retries the full execution on each attempt, repeating the same blocking dispatch.

When a local action (such as a timer) is the first action, the workflow yields before reaching the remote call. The timer is a local operation that succeeds, allowing the state to be saved and the workflow to transition to RUNNING. On the subsequent execution (when the timer fires), the remote dispatch may block, but the workflow is already in RUNNING state.

Solution

Two changes address this issue:

  1. Per-dispatch timeout: Each activity dispatch and child workflow message dispatch now uses a short timeout (2 seconds). These dispatches are one-way fire-and-forget messages to the target actor, so they complete in milliseconds when the target app is reachable. The short timeout ensures the actor lock is released quickly when an app is offline, allowing status queries and reminder retries to proceed without delay. All dispatches are attempted even if some fail, so that successfully dispatched items are not blocked by a single unreachable app.

  2. Pre-save on first remote execution: On the first execution of a workflow that contains remote activities or child workflows (detected by checking for TargetAppId in pending task and message routers), if any dispatch fails, the runtime saves the workflow state before returning. Events corresponding to failed dispatches (TaskScheduled for activities, SubOrchestrationInstanceCreated for child workflows) are excluded from the saved history so the retry can regenerate them. Events for successfully dispatched items are preserved to avoid re-dispatching them. The inbox is preserved so the existing reminder retries the full orchestrator execution. This transitions the workflow to RUNNING and releases the actor lock so that status queries are not blocked.

Dapr forwards hop-by-hop HTTP headers when proxying service invocation requests

Problem

When a client sends hop-by-hop HTTP headers such as Connection, Upgrade, HTTP2-Settings, TE, Keep-Alive, Trailer, Proxy-Authorization, and Proxy-Connection to the Dapr sidecar, Dapr forwards them to the upstream application or HTTPEndpoint. This violates RFC 7230 Section 6.1, which requires intermediaries (proxies) to remove hop-by-hop headers before forwarding a message.

Impact

This affects all HTTP service invocation paths: local invocation, remote invocation, HTTPEndpoint invocation, dapr-app-id header invocation, and direct URL invocation.

The most visible failure occurs when an HTTP client configured with HTTP_2 sends Upgrade: h2c, Connection: Upgrade, and HTTP2-Settings headers to the Dapr sidecar. Dapr forwards these to the upstream HTTPS endpoint, which rejects the request with:

http2: invalid Upgrade request header: ["h2c"]

Other upstream servers may silently accept the leaked headers, but the behavior is still incorrect per the HTTP specification and may cause subtle issues with proxies, load balancers, or HTTP/2-strict servers. The same issue also applied to response headers: if an upstream application set hop-by-hop headers on its response, Dapr forwarded them back to the caller.

Root Cause

The InternalMetadataToHTTPHeader function, which converts internal metadata to outgoing HTTP request headers, did not filter out hop-by-hop headers. It forwarded all headers except trace headers, Content-Type, Content-Length, and gRPC binary metadata. Similarly, the copyHeader function, which copies upstream response headers back to the caller, performed a naive copy of all headers with no filtering.

Solution

A new filter function identifies the standard hop-by-hop headers per RFC 7230 Section 6.1: Connection, Keep-Alive, Proxy-Connection, Transfer-Encoding, Upgrade, HTTP2-Settings, TE, Trailer, Proxy-Authorization, and Proxy-Authenticate. This filter is applied to request and response paths. End-to-end headers such as Accept, Authorization, Content-Type, and custom headers (X-*) are unaffected and continue to be forwarded normally.

Workflow events lost during ContinueAsNew under high event volume

Problem

When many external events are raised concurrently against a workflow that uses ContinueAsNew with preserveUnprocessedEvents, events can be silently lost or processed multiple times. The workflow completes with fewer events than were actually sent, its counter jumps ahead skipping events, or the same event is delivered to the workflow function more than once.

Impact

Any workflow that uses the ContinueAsNew pattern with preserveUnprocessedEvents (or WithKeepUnprocessedEvents in the Go SDK) is affected when receiving a burst of external events. The harder the workflow is driven (more events raised in parallel), the more likely events are to be lost or duplicated.

For simple counter workflows, this manifests as event loss (the counter skips ahead). For coordination workflows such as a semaphore pattern, duplicate event delivery causes the same request to be dispatched multiple times, leading to runaway workflow state where the workflow's bookkeeping becomes inconsistent and the workflow cannot recover.

Root Cause

Two related issues cause event loss and duplicate processing when the workflow engine's ContinueAsNew tight-loop exceeds the iteration limit (20):

  1. State corruption via shared pointer: The engine mutates the workflow's runtime state in place during ContinueAsNew iterations. If the loop exceeds the iteration limit and fails, the in-memory cached state is left in a corrupted state; the workflow's input counter has jumped ahead to reflect iterations that were never persisted. On retry, the workflow resumes from the corrupted counter value instead of the last persisted value, causing it to skip events.

  2. Duplicate event delivery from stale inbox: When partial ContinueAsNew progress is saved after hitting the iteration limit, the workflow's inbox (containing all original events) was preserved unchanged. On retry, all original events were re-delivered as new events alongside the carryover events already saved in history. This caused the workflow to buffer both sets, processing the same events multiple times. For workflows implementing coordination patterns (e.g. a semaphore), this resulted in duplicate dispatches and runaway workflow state.

Solution

The workflow runtime state is now cloned before being passed to the engine for execution. The engine operates on its own copy, so if execution fails for any reason, the actor's cached state remains consistent with what was last persisted to the state store. Retries start from the correct state and all events are processed.

  1. State snapshot and restore on failure: Before executing the workflow, the orchestrator saves a snapshot of the runtime state. The engine still mutates the in-memory workflow state during execution, but if execution fails for any reason, the orchestrator restores the snapshot so the actor's cached state remains consistent with what was last persisted to the state store.

  2. Inbox replacement with carryover events: When partial ContinueAsNew progress is saved, unprocessed carryover events (buffered EventRaised events from the engine's ContinueAsNew state) are moved from history to the inbox, and the stale original inbox is discarded. On retry, only the unprocessed carryover events are delivered as new events, preventing duplicate processing.

Go 1.25.9 security update

Problem

Multiple vulnerabilities were identified in the Go standard library used by Dapr 1.17.3 (Go 1.25.8), affecting the go command, compiler, and the archive/tar, crypto/tls, crypto/x509, html/template, and os packages.

Impact

Applications compiled with Go versions prior to 1.25.9 are potentially affected by these vulnerabilities.

Root Cause

The vulnerabilities are in the Go standard library and compiler, and are not specific to Dapr code.

Solution

Upgraded the Go toolchain from 1.25.8 to 1.25.9 across all modules and Docker images in the repository.

Workflow and actor operations freeze when a slow sidecar delays placement dissemination

Problem

When one daprd sidecar in a namespace is slow to respond during placement table dissemination (for example, due to GC pressure, high actor load, or network latency), all other sidecars in the same namespace experience frozen workflow and actor operations. Scheduling new workflows, invoking actors, and running dapr workflow terminate or dapr workflow purge all hang indefinitely. After restarting the app, workflows may appear as RUNNING but no activities execute, and get_workflow_state returns nothing.

Impact

Any deployment with multiple replicas using actors or workflows is affected. A single slow sidecar can freeze all actor and workflow operations across every other sidecar in the namespace for as long as the placement server takes to time out the slow peer (8 seconds by default), plus the sidecar's own 5-second timeout. In practice, this created a 10-15 second window where:

  • All actor method invocations hang
  • All workflow scheduling, termination, and purge operations hang
  • Actor reminders stop firing
  • Applications stop responding to health probes and are declared unhealthy by Kubernetes
  • With a workflow SDK client connected, the daprd process enters a zombie state where it can neither serve requests nor exit

The issue is particularly severe during rolling updates, where pod cycling causes repeated dissemination rounds that can trigger the timeout multiple times in succession.

Root Cause

When the daprd sidecar receives a LOCK order from the placement server during actor table dissemination, it blocks all in-flight actor operations (the "inflight lock") until it receives the corresponding UNLOCK. The placement server only sends UNLOCK after every sidecar in the namespace has responded to each phase (LOCK, UPDATE, UNLOCK). If one sidecar is slow, no other sidecar receives UPDATE or UNLOCK.

The daprd sidecar has a 5-second timeout for this wait. When the timeout fired, it killed the entire placement subsystem by sending a fatal error through the internal error channel, which terminated the placement connection permanently. The sidecar never reconnected. Actor operations that were queued during the LOCK phase were silently abandoned, and any new actor operations hung indefinitely because the placement client was dead.

Solution

The dissemination timeout now closes the placement stream and triggers an automatic reconnection, instead of killing the placement subsystem. After the timeout:

  1. The sidecar closes its stream to the placement server
  2. All actors are halted and the routing table is cleared
  3. The sidecar reconnects to placement and re-registers
  4. A new dissemination round completes and actor operations resume

This matches the existing recovery behavior for network disconnections (EOF, connection reset), which already reconnected successfully. The timeout was the only error path that did not reconnect.

Additionally, this release fixes two related issues in the dissemination subsystem:

  • The placement readiness flag was set immediately after reconnecting, before the new dissemination round completed. This caused the daprd to report healthy to metadata queries while actor operations were still blocked. The readiness flag is now only set after the first successful UNLOCK.
  • Recycled disseminator objects from the internal pool could carry a stale timeout version counter from their previous use, potentially causing a valid timeout to be incorrectly ignored. The counter is now reset on reuse.

Scheduled jobs stop firing longer than needed after a scheduler pod restart in a multi-node cluster

Problem

When a scheduler pod in a multi-node cluster restarts (due to a rollout, crash, or node migration), some or all scheduled jobs stop firing for a longer period of time than necessary. The sidecar's metadata endpoint continues to report three connected scheduler addresses, but no job triggers are delivered to the application. Jobs remain stalled until another cluster event, such as a subsequent scheduler restart or leadership change, triggers a fresh connection cycle.

Impact

Any deployment running the scheduler service with three or more replicas is affected. During routine operations that cause a scheduler pod to restart, such as Kubernetes rolling updates, node drains, or OOM kills, applications stop receiving scheduled job triggers. The jobs remain registered but do not fire until an unrelated cluster event happens to re-establish the connections.

Root Cause

The sidecar maintains a streaming connection to each scheduler pod for receiving job triggers. These connections are managed by a shared runner that runs all per-scheduler connectors concurrently. When any single connector encountered an error, such as the replaced scheduler pod briefly accepting and then closing the streaming connection during startup, the runner cancelled all connectors, including those with healthy connections to the other scheduler pods.

After the connections were torn down, no reconnection was attempted. The sidecar's host-watching mechanism, which discovers scheduler addresses, had no reason to emit a new event because the scheduler cluster membership had not changed. The sidecar remained disconnected from all schedulers until an unrelated event, such as another scheduler restart or leadership election, caused the host watcher to cycle and re-establish connections.

Solution

Each per-scheduler streaming connector now retries independently on failure with a half-second backoff, instead of returning an error that tears down all sibling connections. A transient failure on one scheduler connection no longer affects the healthy connections to other scheduler pods. The connector keeps retrying until the scheduler pod becomes available or the connection is explicitly closed by a cluster membership change.