Gitaly on Kubernetes

Tier: Free, Premium, Ultimate
Offering: GitLab.com, GitLab Self-Managed, GitLab Dedicated

Introduced in GitLab 17.3 as an experiment.
Changed from experiment to beta in GitLab 17.10.
Changed from beta to limited availability in GitLab 18.2.
Changed from limited availability to general availability in GitLab 18.11.

[!disclaimer]

Running Gitaly on Kubernetes has availability trade-offs, so consider these trade-offs when planing a production environment and set expectations accordingly. This document describes and provides guidance on how to minimize, and plan for existing limitations.

Gitaly on Kubernetes has been evaluated by the Gitaly team and determined to be a safe way to deploy Gitaly. The rest of this document details best practices for doing so.

Timeline

Gitaly on Kubernetes is generally available as of GitLab 18.11. GitLab does not guarantee compatibility with specific managed Kubernetes offerings from cloud providers (such as Amazon EKS, Google GKE, or Azure AKS). You should validate your specific environment before deploying to production.

Context

By design, Gitaly (non-Cluster) is a single point of failure service (SPoF). Data is sourced and served from a single instance. For Kubernetes, when the StatefulSet pod rotates (for example, during upgrades, node maintenance, or eviction), the rotation causes service disruption for data served by the pod or instance.

In a Cloud Native Hybrid setup (Gitaly VM), the Linux package (Omnibus) masks the problem by:

Upgrading the Gitaly binary in-place.
Performing a graceful reload.

The same approach doesn't fit a container-based lifecycle where a container or pod needs to fully shutdown and start as a new container or pod.

Gitaly Cluster (Praefect) solves the data and service high-availability aspect by replicating data across instances. However, Gitaly Cluster (Praefect) is unsuited to run in Kubernetes because of existing issues and design constraints that are augmented by a container-based platform.

To support a Cloud Native deployment, Gitaly (non-Cluster) is the only option. By leveraging the right Kubernetes and Gitaly features and configuration, you can minimize service disruption and provide a good user experience.

Requirements

The information on this page assumes:

Kubernetes version equal to or greater than 1.29.
Kubernetes node runc version equal to or greater than 1.1.9.
Kubernetes node cgroup v2. Native, hybrid v1 mode is not supported. Only systemd-style cgroup structure is supported (Kubernetes default).
Pod access to node mountpoint /sys/fs/cgroup.
containerd version 2.1.0 or later.
Pod init container (init-cgroups) access to root user file system permissions on /sys/fs/cgroup. Used to delegate the pod cgroup to the Gitaly container (user git, UID 1000).
The cgroups file system is not mounted with the nsdelegate flag. For more information, see Gitaly issue 6480.

Guidance

When running Gitaly in Kubernetes, you must:

Address pod disruption.
Address resource contention and saturation.
Optimize pod rotation time.
Monitor disk usage

Enable `cgroup_writable` field in containerd

Cgroup support in Gitaly requires writable access to cgroups for unprivileged containers. containerd v2.1.0 introduced the cgroup_writable configuration option. When enabled, this option ensures that the cgroups file system is mounted with read/write permissions.

To enable this field, perform the following steps on the nodes where Gitaly will be deployed. If Gitaly is already deployed, then the pods must be recreated after the configuration is modified.

Modify the containerd configuration file located at /etc/containerd/config.toml to include the cgroup_writable field:

toml

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
cgroup_writable = true

Restart the Kubelet and containerd services:
shell
```
sudo systemctl restart kubelet
sudo systemctl restart containerd
```
These commands might mark the node as NotReady if the services take a long time to restart.

Address pod disruption

A pod can rotate for many reasons. Understanding and planing the service lifecycle helps minimize disruption.

For example, with Gitaly, a Kubernetes StatefulSet rotates on spec.template object changes, which can happen during Helm Chart upgrades (labels, or image tag) or pod resource requests or limits updates.

This section focuses on common pod disruption cases and how to address them.

Schedule maintenance windows

Because the service is not highly available, certain operations can cause brief service outages. Scheduling maintenance windows signals potential service disruption and helps set expectations. You should use maintenance windows for:

GitLab Helm chart upgrades and reconfiguration.
Gitaly configuration changes.
Kubernetes node maintenance windows. For example, upgrades and patching. Isolating Gitaly into its own dedicated node pool might help.

Use `PriorityClass`

Use PriorityClass to assign Gitaly pods higher priority compared to other pods, to help with node saturation pressure, eviction priority, and scheduling latency:

Create a priority class:

yaml

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: gitlab-gitaly
value: 1000000
globalDefault: false
description: "GitLab Gitaly priority class"

Assign the priority class to Gitaly pods:

yaml

gitlab:
  gitaly:
    priorityClassName: gitlab-gitaly

Signal node autoscaling to prevent eviction

Node autoscaling tooling adds and removes Kubernetes nodes as needed to schedule pods and optimize cost.

During downscaling events, the Gitaly pod can be evicted to optimize resource usage. Annotations are usually available to control this behavior and exclude workloads. For example, with Cluster Autoscaler:

yaml

gitlab:
  gitaly:
    annotations:
      cluster-autoscaler.kubernetes.io/safe-to-evict: "false"

Address resource contention and saturation

Gitaly service resource usage can be unpredictable because of the indeterminable nature of Git operations. Not all repositories are the same and size heavily influences performance and resource usage, especially for monorepos.

In Kubernetes, uncontrolled resource usage can lead to Out Of Memory (OOM) events, which forces the platform to terminate the pod and kill all its processes. Pod termination raises two important concerns:

Data/Repository corruption
Service disruption

This section focuses on reducing the scope of impact and protecting the service as a whole.

Constrain Git processes resource usage

Isolating Git processes provides safety in guaranteeing that a single Git call can't consume all service and pod resources.

Gitaly can use Linux Control Groups (cgroups) to impose smaller, per repository quotas on resource usage.

You should maintain cgroup quotas below the overall pod resource allocation. CPU is not critical because it only slows down the service. However, memory saturation can lead to pod termination. A 1 GiB memory buffer between pod request and Git cgroup allocation is a safe starting point. Sizing the buffer depends on traffic patterns and repository data.

For example, with a pod memory request of 15 GiB, 14 GiB is allocated to Git calls:

yaml

gitlab:
  gitaly:
    cgroups:
      enabled: true
      # Total limit across all repository cgroups, excludes Gitaly process
      memoryBytes: 15032385536 # 14GiB
      cpuShares: 1024
      cpuQuotaUs: 400000 # 4 cores
      # Per repository limits, 50 repository cgroups
      repositories:
        count: 50
        memoryBytes: 7516192768 # 7GiB
        cpuShares: 512
        cpuQuotaUs: 200000 # 2 cores

For more information, see Gitaly configuration documentation.

Right size Pod resources

Sizing the Gitaly pod is critical and reference architectures provide some guidance as a starting point. However, different repositories and usage patterns consume varying degrees of resources. You should monitor resource usage and adjust accordingly over time.

Memory is the most sensitive resource in Kubernetes because running out of memory can trigger pod termination. Isolating Git calls with cgroups helps to restrict resource usage for repository operations, but that doesn't include the Gitaly service itself. In line with the previous recommendation on cgroup quotas, add a buffer between overall Git cgroup memory allocation and pod memory request to improve safety.

A pod Guaranteed Quality of Service class is preferred (resource requests match limits). With this setting, the pod is less susceptible to resource contention and is guaranteed to never be evicted based on consumption from other pods.

Example resource configuration:

yaml

gitlab:
  gitaly:
    resources:
      requests:
        cpu: 4000m
        memory: 15Gi
      limits:
        cpu: 4000m
        memory: 15Gi

    init:
      resources:
        requests:
          cpu: 50m
          memory: 32Mi
        limits:
          cpu: 50m
          memory: 32Mi

Configure concurrency limiting

You can use concurrency limits to help protect the service from abnormal traffic patterns. For more information, see concurrency configuration documentation and how to monitor limits.

Isolate Gitaly pods

When running multiple Gitaly pods, you should schedule them in different nodes to spread out the failure domain. This can be enforced using pod anti affinity. For example:

yaml

gitlab:
  gitaly:
    antiAffinity: hard

Optimize pod rotation time

This section covers areas of optimization to reduce downtime during maintenance events or unplanned infrastructure events by reducing the time it takes the pod to start serving traffic.

Persistent Volume permissions

As the size of data grows (Git history and more repositories), the pod takes more and more time to start and become ready.

During pod initialization, as part of the persistent volume mount, the file system permissions and ownership are explicitly set to the container uid and gid. This operation runs by default and can significantly slow down pod startup time because the stored Git data contains many small files.

This behavior is configurable with the fsGroupChangePolicy attribute. Use this attribute to perform the operation only if the volume root uid or gid mismatches the container spec:

yaml

gitlab:
  gitaly:
    securityContext:
      fsGroupChangePolicy: OnRootMismatch

Health probes

The Gitaly pod starts serving traffic after the readiness probe succeeds. The default probe times are conservative to cover most use cases. Reducing the readinessProbe initialDelaySeconds attribute triggers probes earlier, which accelerates pod readiness. For example:

yaml

gitlab:
  gitaly:
    statefulset:
      readinessProbe:
        initialDelaySeconds: 2
        periodSeconds: 10
        timeoutSeconds: 3
        successThreshold: 1
        failureThreshold: 3

Gitaly graceful shutdown timeout

By default, when terminating, Gitaly grants a 1 minute timeout for in-flight requests to complete. While beneficial at first glance, this timeout:

Slows down pod rotation.
Reduces availability by rejecting requests during the shutdown process.

A better approach in a container-based deployment is to rely on client-side retry logic. You can reconfigure the timeout by using the gracefulRestartTimeout field. For example, to grant a 1 second graceful timeout:

yaml

gitlab:
  gitaly:
    gracefulRestartTimeout: 1

Monitor disk usage

Monitor disk usage regularly for long-running Gitaly containers because log file growth can cause storage issues if log rotation is not enabled.

Migrate to Gitaly on Kubernetes

To migrate existing repositories from non-Kubernetes Gitaly nodes to Gitaly on Kubernetes:

Deploy your Gitaly on Kubernetes nodes and add them as new repository storages. in the GitLab Admin area. Configure storage weights so that all new repositories are created on the new repository storages. This prevents new projects from being created on the old repository storages while the migration is in progress.
Use the repository move API to move existing repositories to the new storages. GitLab repositories can be associated with projects, groups, and snippets, and each type has a separate API. For complete instructions, see moving repositories managed by GitLab.

Each repository is made read-only for the duration of the move and is not writable until the move is complete.

Timeline

Context

Requirements

Guidance

Enable cgroup_writable field in containerd

Address pod disruption

Schedule maintenance windows

Use PriorityClass

Signal node autoscaling to prevent eviction

Address resource contention and saturation

Constrain Git processes resource usage

Right size Pod resources

Configure concurrency limiting

Isolate Gitaly pods

Optimize pod rotation time

Persistent Volume permissions

Health probes

Gitaly graceful shutdown timeout

Monitor disk usage

Migrate to Gitaly on Kubernetes

Enable `cgroup_writable` field in containerd

Use `PriorityClass`