operator/docs/enhancements/zone_aware_replication.md
By default, data is transparently replicated across the whole pool of service instances, regardless of whether these instances are all running within the same availability zone (or data center, or rack) or in different ones. Storing multiple replicas for a given data within the same availability zone poses a risk for data loss if there is an outage affecting various nodes within a zone or a full zone outage. There is support for zone-aware data replication in Loki. When enabled, replicas for the given data are guaranteed to span across different availability zones.
A zone represents a logical failure domain. It is common for Kubernetes clusters to span multiple zones for increased availability. While the exact definition of a zone is left to infrastructure implementations, common properties of a zone include very low network latency within a zone, no-cost network traffic within a zone, and failure independence from other zones.
The following sections describe a set of APIs in form of Custom Resource Definitions (CRD) that enable users of LokiStack resources to support:
Zone-aware replication is the replication of data across failure domains. Avoiding data loss during a domain outage is the motivation to introduce a zone-aware component deployment and enable Loki's zone-aware data replication capabilities.
The following enhancement proposal describes the required API additions and changes in the Loki Operator to add zone-aware data replication support.
The following API changes introduce a new spec to enable configuration for Loki's data replication properties.
Note: The new replicationSpec introduces a factor field that is a replacement for the old replicationFactor field. Moving forward the old field is officially deprecated and will be removed in future CRD versions.
import "github.com/prometheus/prometheus/model/labels"
// LokiStackSpec defines the desired state of LokiStack
type LokiStackSpec struct {
...
// ReplicationFactor defines the policy for log stream replication. (Deprecated: Please use replication.factor instead. This field will be removed in future versions of this CRD)
//
// +optional
// +kubebuilder:validation:Optional
// +operator-sdk:csv:customresourcedefinitions:type=spec,xDescriptors="urn:alm:descriptor:com.tectonic.ui:number",displayName="Replication Factor"
ReplicationFactor int32 `json:"replicationFactor,omitempty"`
// Replication defines the configuration for Loki data replication
//
// +optional
// +kubebuilder:validation:Optional
Replication *ReplicationSpec `json:"replication,omitempty"`
}
type ReplicationSpec struct {
// Factor defines the policy for log stream replication.
//
// +optional
// +kubebuilder:validation:Optional
Factor int32 `json:"factor,omitempty"`
// Zone is the key that defines a topology in the Nodes' labels
//
// +required
// +kubebuilder:validation:Required
Zones []ZoneSpec
}
// ZoneSpec defines the spec to support zone-aware component deployments.
type ZoneSpec struct {
// MaxSkew describes the maximum degree to which Pods can be unevenly distributed
//
// +required
// +kubebuilder:default:="1"
MaxSkew int `json:"maxSkew,omitempty"`
// Topologykey is the key that defines a topology in the Nodes' labels
//
// +required
// +kubebuilder:validation:Required
Topologykey string `json:"topologyKey,omitempty"`
}
The following manifest represents a full example of a LokiStack with zone-aware data replication turned on using the topology.kubernetes.io/zone node label as a key to spread pods across zones and a replication factor of three:
apiVersion: loki.grafana.com/v1beta1
kind: LokiStack
metadata:
name: lokistack-dev
spec:
size: 1x.small
storage:
secret:
name: test
type: s3
storageClassName: gp3-csi
replication:
factor: 3
zones:
- topologyKey: topology.kubernetes.io/zone
maxSkew: 1
The Loki components can be divided into the Write Path (Distributor, Ingester) and the Read Path (Query frontend, Querier, Ingester). In the first pass of the feature implementation, we will not distinguish between the two paths. See the Drawbacks section for more details on this.
In this step, all components and their replicas are configured to spread over different zones. This can be done by making use of the PodTopologySpreadConstraint feature in Kubernetes.
Commonly used node labels to identify failure-domains are
Kubernetes makes a few assumptions about the structure of zones and regions:
The user needs to be aware of the node labels set to identify the different topology domains. Node read operations are generally made available to admin/developer users in OCP. This should be provided in the Lokistack CR topology key so that the podTopologySpreadConstraint can use this to schedule the pods accordingly.
These are some of the checks needed before the deployment happens:
If both these conditions are satisfied, deploy the lokistack components using the PodTopologySpreadConstraint definition in the deployments/statefulsets Pod template spec.
The second step is to identify the domain the pod is scheduled in and then add this value into the Loki configuration file (loki-config.yaml). The configurations required for adding zone-aware information in the Loki configuration is discussed here(see Loki config). There is no easy way to implement this since the Kubernetes Downward-API does not support exposing node labels within containers. (see Kubernetes issue)
A couple of solutions to implement passing the node labels to the Loki pods are listed below:
The loki-operator watches the pods when they are scheduled. If the pod.spec.topologySpreadConstraints.topologyKey is set, then the operator extracts the topology key and value from the node where the pod is scheduled, and sets it as an annotation for the pod by patching the pod. For reference consider:
Here, the plan is to introduce an Admission Mutating Webhook, that would watch the pods/binding sub-resource for each Loki pods. This webhook can update the pod annotations to add the topology key-value pair(s) when it is being scheduled on a node.
Catch API calls to pods/binding sub-resource using a webhook:
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
name: mutating-pod-webhook-configuration
webhooks:
- clientConfig:
service:
name: webhook-service
namespace: system
path: /pod-binding-loki-grafana-com-v1
name: podbinding.loki.grafana.com
objectSelector:
matchLabels:
"app.kubernetes.io/instance": "lokistack-dev"
rules:
- apiGroups:
- ""
apiVersions:
- v1
resources:
- bindings
- pods/binding
operations:
- CREATE
Decoding the binding request provides the target node to read the topology labels from (e.g. aws-node-0):
{
"name": "sample-request-0",
"namespace": "default",
"operation": "CREATE",
"userInfo": {
"username": "system:kube-scheduler",
"uid": "uid:system:kube-scheduler",
"groups": ["system:authenticated"]
},
"object": {
"kind": "Binding",
"apiVersion": "v1",
"metadata": {
"name": "sample-request-0",
"namespace": "default",
},
"target": {
"kind": "Node",
"name": "aws-node-0"
}
}
}
Finally injecting the node topology labels into each pod as annotations:
func (wh *mutatingWebhook) Handle(ctx context.Context, request admission.Request) admission.Response {
binding := &v1.Binding{}
// Decode the /bind request
err := wh.decoder.DecodeRaw(request.Object, binding)
if err != nil {...}
if err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
// 1. Read the current Pod specification
key := client.ObjectKey{Namespace: binding.ObjectMeta.Namespace, Name: binding.ObjectMeta.Name}
pod := &v1.Pod{}
if err := wh.client.Get(ctx, key, pod); err != nil {...}
// get the topology keys from the pods where the pod.spec.topologySpreadConstraints.topologyKey is set
kv, err := getTopologyKeyValue(ctx, wh.client, binding.Target.Name, pod.spec.topologySpreadConstraints.topologyKey)
if err != nil {...}
if pod.Annotations == nil {
pod.Annotations = make(map[string]string)
}
// 2. Add topology keys to the Pod's annotations
for k, v := range kv {
if k == topologyKey {
pod.Annotations[k] = v
}
}
// 3. Update the Pod
return wh.client.Update(ctx, pod)
}); err != nil {...}
}
Some reference implementations on how to use the a pods/binding sub-resource mutating webhook can be found here:
The pod annotation (having the zone information) obtained as a result of the previous step, can then be used in the container as an ENV variable and be used in the loki-config.yaml
In this approach we introduce a script in the application container which runs a loop waiting for the pod annotation value to be non-empty. Using the downwardAPI volume approach lets us know the updated value of the pod annotation, without any pod restart. (See https://kubernetes.io/docs/tasks/inject-data-application/downward-api-volume-expose-pod-information/#store-pod-fields)
Once the domain value information is collected via the volume, it can be used to update a ENV variable that is used in the loki-config.yaml. The loki application is then started. In this way we can be sure that the loki application will have the domain information
This is how the expected individual pod spec will look after the topology key annotation is added to the pod
kind: Pod
apiVersion: v1
metadata:
name: ingester
annotations:
topology.kubernetes.io/zone: zone-a
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
ingester: pod
containers:
- name: ingester
env:
- name: ZONE
value: ""
command: ["sh", "-c"]
args:
- while true; do
if [[ -e /etc/podinfo/annotations ]]; then
if [[ -s /etc/podinfo/annotations ]]; then
echo -en '\n\n'; cat /etc/podinfo/annotations;
ZONE=$(cat /etc/podinfo/annotations);
else
echo "Empty File"; fi; fi;
sleep 5;
done;
volumeMounts:
- name: podinfo
mountPath: /etc/podinfo
volumes:
- name: podinfo
downwardAPI:
items:
- path: "annotations"
fieldRef:
fieldPath: metadata.annotations['topology.kubernetes.io/zone']
In this approach we introduce a conditional init container which is used only if zone-aware is enabled. The Init container has the task of checking for the pod annotation to be set to the topologykey. Once this value is succesfully set, the main application container is started, and an ENV variable is set that is used in the loki-config.yaml. The loki application is then started. In this way we can be sure that the loki application will have the domain information
kind: Pod
apiVersion: v1
metadata:
name: ingester
annotations:
topology.kubernetes.io/zone: zone-a
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
ingester: pod
initContainers:
- name: init-envval
image: busybox:1.28
command: ['sh', '-c', "until cat /etc/podinfo/annotations; do echo waiting for toppologykey; sleep 2; done"]
volumeMounts:
- name: podinfo
mountPath: /etc/podinfo
containers:
- name: ingester
env:
- name: ZONE
valueFrom:
fieldRef:
fieldPath: metadata.annotations['topology.kubernetes.io/zone']
volumes:
- name: podinfo
downwardAPI:
items:
- path: "annotations"
fieldRef:
fieldPath: metadata.annotations['topology.kubernetes.io/zone']
The container can then be started by passing the configuration via the cli flag instance_availability_zone: $ZONE
PodTopologyConstraint in Kubernetes is able to handle multiple topology keys and can schedule the pods as expected. But, there is no documentation on how Cortex/Loki handles multiple topology keys. Until further information is found, a simple proposal is to concatenate the values of the different topology keys and create the $ZONE variable for the loki configuration. For example:
# node labels
topology.kubernetes.io/zone: "zone-a"
kubernetes.io/hostname: "ip-172-20-114-199.ec2.internal"
# possible concatenation of the values
ZONE: "zone-a-ip-172-20-114-199.ec2.internal"
As suggested in the Design section there is no separate enabling of the Read and the Write path in the initial pass of the feature implementation. The benefits of enabling this separately is that, during live migrations we can first turn on write path zone awareness, leave it enabled for the max query lookback period, and then enable read path zone awareness. This means that at the time we turn on the read path zone awareness all ingesters that should have data for a stream will have all the data for that stream. If you enable the read path zone awareness earlier, then potentially data would be missing in queries. At this point the loki-operator deploys all the components in an all or nothing fashion. So the users should expect missing queries in the data for the max query lookback period. To avoid missing the queries, the operator needs to update the deployment process to two stages - first the components in the write path are deployed, wait the max query lookback period(currently 30s), followed by the deployment of the components in the read path.
By just using the topology key to deploy the replicas in the different zones, we only ensure that 2 pods of different zones are not in the same node. But we cant control 2 replica pods in the same zone getting deployed on the same node. To prevent this we might want to let the user input zone details that can be used as a NodeSelector
To enable zone-aware replication for the write path and the read path:
-ingester.availability-zone CLI flag (or its respective YAML config option)-distributor.zone-awareness-enabled CLI flag (or its respective YAML config option). Please be aware this configuration option should be set to distributors, queriers and rulers.
https://github.com/grafana/rollout-operatorWithout zone-aware replication, the LokiStack pods are scheduled on different nodes within the same or different availability zones. There are no hard requirements on the number of zones within the cluster. In case a zone fails, the pods are expected to be rescheduled on a different zone, but they will not be successful because the PV cannot be moved automatically, and we cannot create new pods in a different zone which can use the old PVs. At this point a manual interventation is the only way to fix the issue, by deleting the old PVCs so that new PVs are created that can be used in the new zone.
If there are more than floor(replication_factor/2) zones with failing instances, reads and writes would not be possible and there would be a high chance of data loss. According to Cortex, the minimum number of zones should be equal to the replication factor. Which means the following for our production t-shirt sizes:
1x.small has a replication factor of 2 & all components have 2 replicas. In order to replicate a 1x.small LokiStack across zones, there has to be at least 2 zones available in the cluster. Each replica of the LokiStack pods will be scheduled on a node in a different zone. For example: zone-a and zone-b will have 1 replica each of distributor, ingester, querier, query-frontend, gateway, index-gateway, and ruler.
The issue arises when we only have 2 zones, when one of them fails, the write-path cannot hold on if data ingestion rate is high, in turn data loss is possible. This can be overcome by only enabling zone-aware replication if the number of existing zones are replication_factor + 1 (3 in this case, which is also the default for Cortex implementations). This solution is possible since most public cloud providers have 3 availability zones per region.
Switching a 1x.small LokiStack to being zone-aware: setting replication.topology.enabled in the LokiStack CR to true should be sufficient to restart the pods and reschedule them on different zones (given that the requirements are met).
1x.medium has a replication factor of 3, with the ingester and querier having 3 replicas, while the distributor, query-frontend, gateway, index-gateway, and ruler having only 2 replicas. This means 1 zone will have only 1 querier and 1 ingester that will communicate with components (e.g. distributors) in other zones . In this case, losing one zone cannot be tolerated, and relying on having a 4th zone isn't always possible. In this case it is preferable to suggest using a replication factor of 2 instead of the default set to 3.
lokiconfig.yaml should be updated to include the zone_awareness_enabled=true. The instance_availability_zone will be different in the pods of different zones.