doc/administration/geo/_index.md
{{< details >}}
{{< /details >}}
Geo is the solution for widely distributed development teams and for providing a warm-standby as part of a disaster recovery strategy. Geo is not an out of the box HA solution.
[!warning] Geo undergoes significant changes from release to release. Upgrades are supported and documented, but you should ensure that you're using the right version of the documentation for your installation.
To make sure you're using the right version of the documentation, go to the Geo page on GitLab.com and choose the appropriate release from the Switch branch/tag dropdown list. For example, v15.7.6-ee.
Fetching large repositories can take a long time for teams and runners located far from a single GitLab instance.
Geo provides local caches that can be placed geographically close to remote teams which can serve read requests. This can reduce the time it takes to clone and fetch large repositories, speeding up development and increasing the productivity of your remote teams.
Geo secondary sites transparently proxy write requests to the primary site. All Geo sites can be configured to respond to a single GitLab URL, to deliver a consistent, seamless, and comprehensive experience whichever site the user lands on.
Geo uses a set of defined terms that are described in the Geo Glossary. Be sure to familiarize yourself with those terms.
Implementing Geo addresses several use cases. This section provides some of the intended use cases and highlights their benefits.
Geo as a disaster recovery solution gives you a warm-standby secondary site in a different region from your primary site. Data is continuously synchronized to the secondary site ensuring it is always up to date. In the event of a disaster, such as data center or network outage or hardware failure, you can failover to a fully operational secondary site. You can test your disaster recovery processes and infrastructure with planned failovers.
Benefits:
Establish Geo secondary sites geographically closer to your remote teams to provide local caches that accelerate read operations. You can have multiple Geo secondary sites, each tailored to synchronize only the projects your remote teams need. Transparent proxying and geographic routing with unified URL ensures a consistent and seamless developer experience.
Benefits:
You can configure your CI/CD runners to clone from Geo secondary sites. You can tailor your secondary sites to match the needs of the runner workload and don't need to mirror the primary site. Supported read requests are served with cached data on the secondary site, and requests are transparently forwarded to the primary site when the data on the secondary site is stale or not available.
Benefits:
You can use Geo to migrate to new infrastructure. If you move your GitLab instance to a new server or data center, use Geo to migrate your GitLab data to the new instance in the background while your old instance continues to serve your users. Any changes to your active GitLab data are copied to your new instance, so there's no data loss during the cutover.
You cannot use Geo to migrate a PostgreSQL database from one operating system to another. See Upgrading operating systems for PostgreSQL.
Benefits:
You can also use Geo to migrate GitLab Self-Managed to GitLab Dedicated. A migration to GitLab Dedicated is similar to an infrastructure migration.
For more information, see migrate to GitLab Dedicated with Geo.
Benefits:
Geo is not designed to address every use case. This section provides examples of use cases where Geo is not an appropriate solution.
While Geo's selective synchronization functionality allows you to restrict projects that are synchronized to secondary sites, it was designed to reduce cross-region traffic and storage requirements, not to enforce export compliance. You must independently determine your legal obligations with regard to privacy, cybersecurity, and applicable trade control laws on an ongoing basis based on solution and documentation. Both the solution and the documentation are subject to change.
Geo read-only secondary site functionality is not a first-class feature, and might not be supported in the future. You should not rely on this functionality for access control purposes. GitLab provides authentication and authorization controls that better serve this purpose.
Geo is a not a solution for zero downtime upgrades. You must upgrade the primary Geo site before upgrading secondary sites.
Geo replicates corruption on the primary site to all secondary sites. To protect against malicious or unintentional corruption you should complement Geo with backups.
Geo is designed to be an active-passive, high-availability solution. It operates an eventually consistent synchronization model which means that secondary sites are not tightly synchronized with the primary site. Secondary sites follow the primary with a small delay, which can result in a small amount of data loss after a disaster. Failover to a secondary site in the event of a disaster requires human intervention. However, large parts of the process of promoting a secondary site to become a primary is automated by the GitLab Environment Toolkit (GET), provided you deploy all your sites using GET.
Geo should not be confused with Gitaly Cluster (Praefect). For more information about the difference between Geo and Gitaly Cluster (Praefect), see Comparison to Geo.
This is a brief summary of how Geo works in your GitLab environment. For more details, see the Geo development documentation.
Your Geo instance can be used for cloning and fetching projects, in addition to reading any data. This makes working with large repositories over large distances much faster.
When Geo is enabled, the:
Keep in mind that:
The following diagram illustrates the underlying architecture of Geo.
In this diagram:
From the perspective of a user performing Git operations:
From the perspective of a user browsing the GitLab UI, or using the API:
To simplify the diagram, some necessary components are omitted.
gitlab-shell.gitlab-workhorse.A secondary site needs two different PostgreSQL databases:
The secondary sites also run an additional daemon: Geo Log Cursor.
The following are required to run Geo:
Additionally, check the GitLab minimum requirements, and use the latest version of GitLab for a better experience.
Because Geo adds a tracking database and replication metadata on top of the base GitLab installation, plan for at least 40 GB of disk space per site for a minimal Geo deployment with no repository data. See the storage requirements for more details.
The following table lists basic ports that must be open between the primary and secondary sites for Geo. To simplify failovers, you should open ports in both directions.
| Source site | Source port | Destination site | Destination port | Protocol |
|---|---|---|---|---|
| Primary | Any | Secondary | 80 | TCP (HTTP) |
| Primary | Any | Secondary | 443 | TCP (HTTPS) |
| Secondary | Any | Primary | 80 | TCP (HTTP) |
| Secondary | Any | Primary | 443 | TCP (HTTPS) |
| Secondary | Any | Primary | 5432 | TCP |
| Secondary | Any | Primary | 5000 | TCP (HTTPS) |
See the full list of ports used by GitLab in Package defaults
[!warning] For PostgreSQL replication between Geo sites, you must use private network connections, such as internal VPC peering. Never expose PostgreSQL ports to the internet. Exposing PostgreSQL ports to the internet can result in unauthorized access with full write permissions to your GitLab database, potentially compromising your entire GitLab instance and all associated data.
Additionally:
Connection and Upgrade hop-by-hop headers. See the web terminal integration guide for more details.HTTPS for external/internal URLs, it is not necessary to open port 80 in the firewall.HTTP requests from any Geo secondary site to the primary Geo site use the Internal URL of the primary Geo site. If this is not explicitly defined in the primary Geo site settings in the Admin area, the public URL of the primary site is used.
Prerequisites:
To update the internal URL of the primary Geo site:
The tracking database instance is used as metadata to control what needs to be updated on the local instance. For example:
Because the replicated database instance is read-only, we need this additional database instance for each secondary site.
This daemon:
When something is marked to be updated in the tracking database instance, asynchronous jobs running on the secondary site execute the required operations and update the state.
This new architecture allows GitLab to be resilient to connectivity issues between the sites. It doesn't matter how long the secondary site is disconnected from the primary site as it is able to replay all the events in the correct order and become synchronized with the primary site again.
[!warning] These known issues reflect only the latest version of GitLab. If you are using an older version, additional issues might exist.
https://user:[email protected]. For more information, see how to use a Geo Site.registry.example.com. Secondary site container registries are intended only for disaster recovery. Users should not be routed to them, especially not for pushes, because the data is not propagated to the primary site.--depth over SSH against a secondary site does not work and hangs indefinitely if the secondary site is not up to date at the time the request is initiated. This is due to problems related to translating Git SSH to Git https during proxying. For more information, see issue 391980. A new workflow that does not involve the aforementioned translation step is now available for Linux-packaged GitLab Geo secondary sites which can be enabled with a feature flag. For more details, see comment in issue 454707. The fix for Cloud Native GitLab Geo secondary sites is tracked in issue 5641.There is a complete list of all GitLab data types and replicated data types.
After installing GitLab on the secondary sites and performing the initial configuration, see the following documentation for post-installation information.
For information on configuring Geo, see Set up Geo.
For information on configuring Geo with Object storage, see Geo with Object storage.
For more information on how to replicate the container registry, see Container registry for a secondary site.
For an example of how to set up a single, location-aware URL with AWS Route53 or Google Cloud DNS, see Set up a unified URL for Geo sites.
For more information on configuring Single Sign-On (SSO), see Geo with Single Sign-On (SSO).
For more information on configuring LDAP, see Geo with Single Sign-On (SSO) > LDAP.
For more information on tuning Geo, see Tuning Geo.
For more information, see Pausing and resuming replication.
When a secondary site is set up, it starts replicating missing data from the primary site in a process known as backfill. You can monitor the synchronization process on each Geo site from the primary site's Geo Nodes dashboard in your browser.
Failures that happen during a backfill are scheduled to be retried at the end of the backfill.
For information on how to update your Geo sites to the latest GitLab version, see Upgrading the Geo sites.
For more information on Geo security, see Geo security review.
For more information on removing a Geo site, see Removing secondary Geo sites.
To find out how to disable Geo, see Disabling Geo.
Geo stores structured log messages in a geo.log file.
For more information on how to access and consume Geo logs, see the Geo section in the log system documentation.
For information on using Geo in disaster recovery situations to mitigate data-loss and restore services, see Disaster Recovery.
For answers to common questions, see the Geo FAQ.