Dapr 1.15.6

This update includes bug fixes:

Fix Actor memory leak
Fix Workflow state store contention
Update Max Concurrent Workflow Operations
Remove client-side rate limiter from injector
Allow non-stream Workflow Schedule
Fix Placement reconnection during leader shutdown
Scheduler connect after Placement dissemination
Fix Scheduler Deadlock under high load

Fix Actor memory leak

Problem

Running Actor or Workflow workloads would see the daprd process consume more and more memory over time.

Impact

Running Actor or Workflow for long enough periods of time would see the daprd process consume all available memory, leading to an OOM crash.

Root cause

The Actor locking mechanism would not release object memory after use, in the case where the daprd it calling a remote Actor.

Solution

Defer Actor message locking to only the daprd which is hosting that Actor ID to ensure the lock object memory is always released after use.

Fix Workflow state store contention

Problem

Running Workflows in high throughput scenarios would see contention on processing or blocking Workflow state store operations.

Impact

Degraded performance of Workflow operations, leading to increased latency and potential timeouts.

Root cause

Circular Placement locking of Actor Workflow state operations.

Solution

Remove Placement locking in Workflow state operations as it is already locked at the Actor level.

Update Max Concurrent Workflow Operations

Problem

Workflows would experience degraded performance in high throughput scenarios.

Impact

Workflows would be limited in the number of concurrent operations they could handle, leading to increased latency and potential timeouts.

Root cause

The default value for maxConcurrentWorkflowInvocations and maxConcurrentActivityInvocations was set to 1000, which could lead to performance issues in high throughput scenarios.

Solution

Increase the default value for maxConcurrentWorkflowInvocations and maxConcurrentActivityInvocations to max int32 value, allowing for more concurrent operations to be processed without contention.

Remove client-side rate limiter from injector

Problem

The injector was using a client-side rate limiter that could cause throttling issues in production environments with high-throughput scenarios.

Impact

Degraded performance and potential service disruptions when the injector was under heavy load.

Root cause

The client-side rate limiter in the injector was unnecessarily restricting operations, causing bottlenecks in high-traffic scenarios.

Solution

Remove the client-side rate limiter from the injector to allow for better performance and eliminate unnecessary throttling in production environments.

Allow non-stream Workflow Schedule

Problem

Attempting to schedule a Workflow on a daprd with no Workflow listener stream would hang.

Impact

Attempting to schedule a Workflow on a daprd with no Workflow listener stream would hang indefinitely.

Root cause

The daprd would wait until a Workflow listener stream was established before processing the schedule request.

Solution

Allow scheduling of Workflows without requiring a Workflow listener stream to be established first. This allows for immediate processing of the schedule request.

Fix Placement reconnection during leader shutdown

Problem

Daprd would fail to find the Placement leader host.

Impact

Actors (or Workflows) would fail to continue to function.

Root cause

The daprd DNS record set used to round robin the Placement hosts would not be refreshed in the event the leader host was shutdown.

Solution

Add a retry mechanism to the daprd to refresh the DNS record set for the Placement hosts, ensuring that it can always find the current leader host even during leader shutdown events.

Scheduler connect after Placement dissemination

Problem

Connecting a Workflow stream when there is worked queued in the system would cause a large number errors.

Impact

Connecting a Workflow stream when there is work queued in the system would cause a large number of errors, leading to degraded performance and potential timeouts on existing Workflows.

Root cause

Daprd would connect to Scheduler to receive Workflow work before Placement had disseminated.

Solution

Ensure that the daprd connects to the Scheduler only after Placement has disseminated, preventing errors and ensuring that the daprd is ready to process Workflow work without contention.

Fix Scheduler Deadlock under high load

Problem

Under high load, like with high Workflow through put, a Scheduler would deadlock.

Impact

Workflows and the Jobs API would fail to trigger.

Root cause

The Scheduler would not retry or handle transient Etcd errors related to high contention.

Solution

Add retry logic to the Scheduler to handle transient Etcd errors, preventing deadlocks and ensuring that Workflows and Jobs can be triggered successfully under high load conditions.