Argo CD Scalability Benchmarking

Open Questions [optional]

What are the scalability factors to prioritize when benchmarking Argo CD?
- Reconciliation time for all, one, or several applications in the cluster.

Motivation

Users of Argo CD are interested to know how to scale Argo CD, what configuration tweaks and deployment options they have, and how far they can push resources (in terms of the number of supported applications, Git repositories, managing clusters, etc.).

While the Argo CD documentation discusses options to scale up, the actual process is not clear and, as articulated in this thread, oftentimes a point of confusion for users.

By running large-scale benchmarking, we aim at helping the Argo CD community with the following:

Give confidence to organizations running at a significant scale (needs to be quantified) that Argo CD can support their use-case, with empirical evidence.
Create clear guidelines for scaling Argo CD based on the key scalability factors.
Provide recommendations for which topology is best suited for users based on their needs.
Determine what work, if any, is needed to improve the scalability of Argo CD.

Goals

Create a standard set of repeatable benchmarking procedures to objectively measure the limitations of Argo CD.
1. This may result in a new argo-cd-benchmarking repo under argoproj-labs so that anyone can easily replicate it and the development of the procedures can happen outside of the lifecycle for Argo CD (unlike the current gen-resources hack in the main project).
2. Include detailed test scenarios that account for key scalability factors that allow for easy tweaking of the parameters to simplify testing of alternative scenarios.
Determine the baseline for when tweaking is required on the default configuration (resource allocations and replicas).
1. One cluster, Applications in In-cluster, default resource allocations.
Quantify how tweaking existing parameters (replicas, sharding, parallelism, etc) impacts the performance of Argo CD.
Provide a set of metrics and thresholds that will provide a basis for automatically scaling the Argo CD components and alerting for when performance is being impacted by limitations.
All tooling, recommendations, examples, and scenarios will be vendor-agnostic.

Non-Goals

This proposal does not intend to cover the implementation of any auto-scaling based on the metrics and thresholds determined by the scalability benchmarking. A separate proposal will be used for any auto-scaling enhancements.
This proposal does not intend to add testing that determines how a change impacts the scalability of Argo CD based on the benchmarks.
We do not intend to analyze the cost implications of running different topologies and purely focus on scalability requirements from a technology perspective.

Initial Members

The initial members for this proposal and their affiliations are:

Name	Company
Andrew Lee	AWS
Carlos Santana	AWS
Nicholas Morey	Akuity
Nima Kaviani	AWS

With the introduction of the proposed Scalability SIG, the members participating in the proposal may change.

Any community member is welcome to participate in the work for this proposal. Either by joining the Scalability SIG or through contributing to the proposed argoproj-labs/argo-cd-benchmarking repository containing the tooling.

Proposal

Create new argo-cd-benchmarking repo under argoproj-labs and add the authors of this proposal as maintainers.
Create a set of key scalability factors to use as testing parameters. For example:
1. Number of Applications.
2. Number of resources managed by an Application.
3. Number of resources in a cluster.
4. The size of the resources in a cluster and managed by an Application.
5. Churn rate for resources in the cluster (how often resources change).
6. Number of clusters.
7. Number of repositories being monitored.
8. Size of the repositories.
9. The tooling (e.g., directory/raw manifests, Helm, Kustomize).
Determine the metrics that reflect limitations in scalability factors.
1. Application sync time for x number of apps
2. Emptying the queues (app_reconciliation_queue, app_operation_processing_queue)
Create automated testing procedures for Argo CD that take the key scalability factors as testing parameters.
Test the default installation of Argo CD to determine the limit based on the key scalability factors.
Create test scenarios that reflect the common topologies (Argo CD 1-1 with clusters, Argo CD 1-many with clusters).
Determine the thresholds for the metrics identified earlier to capture when performance is being impacted.
1. Contribute back Grafana thresholds and alerts for Prometheus

Use cases

Each use case will cover a specific topology with N permutations based on the key scalability factors. They will be measured using the metrics given in the proposal.

We intend to focus on two key topologies: one Argo CD per cluster and one Argo CD for all remote clusters. All other variations are an extension of these.

Topology 1: One Argo CD per cluster.

The exact key scalability factors used in each permutation will be determined once we get to the testing.

Topology 2: One Argo CD for all remote clusters.

This will capture the impact of network throughput on performance.

Topology 3: One Namespaced Argo CD.

This instance will only manage namespace-level resources to determine the impact of monitoring cluster-scoped resources (related to the effect of resource churn in the cluster).

Implementation Details/Notes/Constraints [optional]

There is already some tooling in the Argo CD repository for scalability testing. We plan to build on the existing effort and further push the boundaries of testing it.

Automatically set up supporting tooling for capturing metrics (Grafana, Prometheus)
Use a local Gitea in the cluster to support many repositories used in testing, and avoid performance variance by depending on the performance of an external git SaaS (ie GitHub)
Simulate the cluster and nodes using vcluster or kwok, in addition to using a cloud provider with a real cluster fleet.
- The simulated clusters and nodes are intended to make the testing accessible, but ultimately the infrastructure should be easily changed to test more realistic scenarios. Once the benchmarking tooling is functional, we can determine if the simulated components skew the results.

Once we have the benchmarking tooling, we can determine if the simulated components skew the results compared to the real world.

AWS intends to provide the infrastructure required to benchmark large-scale scenarios.

Security Considerations

There is no intention to change the security model of Argo CD and therefore this project has no direct security considerations.

Risks and Mitigations

Consider including folks that also work outside your immediate sub-project.

Drawbacks

Alternatives

Implementing Argo CD into an environment than waiting for scaling issues to arise. Monitoring the metrics to understand what the limitations are to address them. Using arbitrary resource allocation and replica counts to avoid running into limitations.