docs/server/features/archiving.md
KurrentDB 25.0 introduced the initial release of Archiving: a new major feature to reduce costs and increase scalability of a KurrentDB cluster.
Future releases of KurrentDB will build on and improve this feature.
A license key is required to use this feature.
KurrentDB databases can become very large. Typical KurrentDB deployments require low latency, high throughput access to the data, and so employ large, expensive volumes attached to each node in the cluster. The size of the database can be controlled by deleting data, but only if that data is no longer needed. Often, a large proportion of the data can become old enough that, although it is still required for occasional reads, it is not read frequently, and need not be read quickly. Until now, this 'cold' data has necessarily been stored on the same volumes as the hot data, taking up space and adding to the expense of running a high performance cluster.
With the new Archiving feature, data is uploaded to cheaper storage such as Amazon S3 and then can be removed from the volumes attached to the cluster nodes. The volumes can be correspondingly smaller and cheaper. The nodes are all able to read the archive, and when a read request from a client requires data that is stored in the archive, the node retrieves that data from the archive transparently to the client.
The extra copy of the data in S3 also acts as a backup that is kept up to date as each chunk of data is written to the log.
::: warning Read requests that read the archive will have comparatively high latency and at the moment can cause other reads to be queued. :::
A backup taken from one node can generally be restored to any other node, but this is not the case with the Archiver Node. The Archiver Node must be restored from a backup that was taken from the Archiver Node itself. Scavenging the archive is not yet implemented, but once it is then if the Archiver node were to be restored from a backup taken from a different node then there would be a risk that the Archiver Node will not completely scavenge the archive. A scavenge with threshold = -1 would need to be run to restore normal operation.
When a node starts up, it checks to see if the archive has newer data than it has locally, and if so downloads that data from the archive.
Sample configuration:
The following settings are required on all nodes (including the Archiver Node) to enable archiving:
Licensing:
LicenseKey: <your key>
Archive:
Enabled: true
RetainAtLeast:
Days: 30
LogicalBytes: 500000000
StorageType: S3 # or GCP / Azure (examples below)
S3:
Region: eu-west-1
Bucket: kurrentdb-cluster-123-archive
::: warning Do not use the same archive bucket for multiple clusters, and do not run more than one Archiver node in a single cluster. :::
Additionally, this must be placed on the Archiver Node:
ReadOnlyReplica: true
Archiver: true
The Archiver Node is a read-only replica and does not participate in quorum activities. It must be a separate node to the main cluster nodes. e.g. If you have a three node cluster, you will need a fourth node to be the archiver. If you already have a read-only replica as a forth node then it is possible to use it as the Archiver Node.
RetainAtLeast is the retention policy for what to retain in the local volume for each node. It does not affect which chunks are uploaded to the archive by the Archiver Node (which will upload all completed committed chunks). If a chunk contains any data less than RetainAtLeast:Days old, then it will not be removed locally. If a chunk contains any data that is within RetainAtLeast:LogicalBytes of the tail of the log (strictly: the scavenge point of the current scavenge) then it will not be removed locally. The metrics described below can be useful to determine sensible values for the retention policy.
On startup, up to MaxMemTableSize events can be read from the log. It is recommended to keep at least this much data locally for faster startup.
StorageType must be set to S3, GCP or Azure. Other cloud providers may be supported in the future, please contact us if you are interested.
Example:
StorageType: S3
S3:
Region: eu-west-1
Bucket: kurrentdb-cluster-123-archive
The KurrentDB nodes authenticate with S3 by looking for credentials from the standard providers. Please see the documentation for S3 in general and .NET in particular.
Example:
StorageType: GCP
GCP:
Bucket: kurrentdb-cluster-123-archive
The KurrentDB nodes authenticate with Google Cloud using Application Default Credentials. Please see the documentation for GCP in general and .NET in particular.
The basic configuration format is as follows:
StorageType: Azure
Azure:
Container: kurrentdb-cluster-123-archive
Authentication: <authentication method> # Default / ConnectionString / SystemAssignedIdentity / UserAssignedIdentity
The following Authentication methods are supported:
DefaultConnectionStringOrServiceUrl configuration option. This authentication method is not recommended for production use by Microsoft. You can use it with the Azure CLI (among other methods) to quickly test if your setup works. StorageType: Azure
Azure:
Container: kurrentdb-cluster-123-archive
Authentication: Default
ConnectionStringOrServiceUrl: https://your-storage-account.blob.core.windows.net/
ConnectionStringConnectionStringOrServiceUrl configuration option. This authentication method is suitable when your KurrentDB cluster runs outside Azure. StorageType: Azure
Azure:
Container: kurrentdb-cluster-123-archive
Authentication: ConnectionString
ConnectionStringOrServiceUrl: DefaultEndpointsProtocol=https;AccountName=<your-storage-account>;AccountKey=<your-account-key>;EndpointSuffix=core.windows.net
SystemAssignedIdentityConnectionStringOrServiceUrl configuration option. This authentication method is suitable when your KurrentDB cluster runs in virtual machines inside Azure. StorageType: Azure
Azure:
Container: kurrentdb-cluster-123-archive
Authentication: SystemAssignedIdentity
ConnectionStringOrServiceUrl: https://your-storage-account.blob.core.windows.net/
UserAssignedIdentityConnectionStringOrServiceUrl configuration option and the client ID of the user-assigned managed identity must be supplied with the UserAssignedClientId configuration option. This authentication method is suitable when your KurrentDB cluster runs in virtual machines inside Azure. StorageType: Azure
Azure:
Container: kurrentdb-cluster-123-archive
Authentication: UserAssignedIdentity
UserAssignedClientId: 2d8e2e8c-8b17-4d63-8c20-3b7e8a7cbb6b
ConnectionStringOrServiceUrl: https://your-storage-account.blob.core.windows.net/
The metrics relevant to Archiving in particular are kurrentdb_logical_chunk_read_distribution_bucket and kurrentdb_io_record_read_duration_seconds_bucket described in the metrics documentation.
The panels are available in the Events Served section of the miscellaneous panels dashboard.
This initial release has several limitations that we intend to improve in future releases.
Work to improve the following limitations is in progress:
Work to improve the following limitations is planned: