docs/en/setup/backend/backend-cluster.md
In many production environments, the backend needs to support distributed aggregation, high throughput and provide high availability (HA) to maintain robustness, so you always need to setup CLUSTER management in product env. Otherwise, you would face metrics inaccurate.
core/gRPCHost is listening on 0.0.0.0 for quick start as the single mode for most cases.
Besides the Kubernetes coordinator, which is using the cloud-native mode to establish cluster, all other coordinators
requires core/gRPCHost updated to real IP addresses or take reference of internalComHost and internalComPort in each
coordinator doc.
NOTICE, cluster management doesn't provide a service discovery mechanism for agents and probes. We recommend agents/probes using gateway to load balancer to access OAP clusters.
There are various ways to manage the cluster in the backend. Choose the one that best suits your needs.
In the application.yml file, there are default configurations for the aforementioned coordinators under the
section cluster. You can specify any of them in the selector property to enable it.
NOTICE, Before you set up the cluster, please read the Query Cluster Nodes API to understand how to verify the cluster node list. If the nodes don't match the expectation, the cluster is not working properly, there could be many feature impacts, e.g. the metrics could be inaccurate and the alarms could not be triggered correctly.
The required backend clusters are deployed inside Kubernetes. See the guides in Deploy in kubernetes.
Set the selector to kubernetes.
cluster:
selector: ${SW_CLUSTER:kubernetes}
# other configurations
Meanwhile, the OAP cluster requires the pod's UID which is laid at metadata.uid as the value of the system environment variable SKYWALKING_COLLECTOR_UID
containers:
# Original configurations of OAP container
- name: {{ .Values.oap.name }}
image: {{ .Values.oap.image.repository }}:{{ required "oap.image.tag is required" .Values.oap.image.tag }}
# ...
# ...
env:
# Add metadata.uid as the system environment variable, SKYWALKING_COLLECTOR_UID
- name: SKYWALKING_COLLECTOR_UID
valueFrom:
fieldRef:
fieldPath: metadata.uid
Read the complete helm for more details.
NOTICE
In all the following coordinators, oap.internal.comm.host:oap.internal.comm.port is registered as the ID
and address for the current OAP node. By default, because they are same in all OAP nodes, the registrations are conflicted,
and (may) show as one registered node, which actually would be the node itself. In this case, the cluster mode is NOT working.
Please check the registered nodes on your coordinator servers, to make the registration information unique for every node. You could have two options
core/gRPCHost(oap.internal.comm.host) and core/gRPCPort(oap.internal.comm.port) for internal,
and setup external communication channels for data reporting and query.internalComHost and internalComPort in the config to provide a unique host and port for every OAP node. This
host name port should be accessible for other OAP nodes.Zookeeper is a very common and widely used cluster coordinator. Set the cluster/selector to zookeeper in the yml to enable it.
Required Zookeeper version: 3.5+
cluster:
selector: ${SW_CLUSTER:zookeeper}
# other configurations
hostPort is the list of zookeeper servers. Format is IP1:PORT1,IP2:PORT2,...,IPn:PORTnenableACL
enable Zookeeper ACL to
control access to its znode.schema is Zookeeper ACL schemas.expression is a expression of ACL. The format of the expression is specific to
the schema.hostPort, baseSleepTimeMs and maxRetries are settings of Zookeeper curator client.Note:
Zookeeper ACL is enabled and /skywalking exists, you must ensure that SkyWalking has CREATE, READ
and WRITE permissions. If /skywalking does not exist, it will be created by SkyWalking, and all permissions to the
specified user will be granted. Simultaneously, znode grants READ permission to anyone.schema as digest, the password of the expression is set in clear text.In some cases, the OAP default gRPC host and port in the core are not suitable for internal communication among the OAP
nodes, such as the default host(0.0.0.0) should not be used in cluster mode.
The following settings are provided to set the host and port manually, based on your own LAN env.
cluster:
selector: ${SW_CLUSTER:zookeeper}
...
zookeeper:
namespace: ${SW_NAMESPACE:""}
hostPort: ${SW_CLUSTER_ZK_HOST_PORT:localhost:2181}
#Retry Policy
baseSleepTimeMs: ${SW_CLUSTER_ZK_SLEEP_TIME:1000} # initial amount of time to wait between retries
maxRetries: ${SW_CLUSTER_ZK_MAX_RETRIES:3} # max number of times to retry
internalComHost: ${SW_CLUSTER_INTERNAL_COM_HOST:172.10.4.10}
internalComPort: ${SW_CLUSTER_INTERNAL_COM_PORT:11800}
# Enable ACL
enableACL: ${SW_ZK_ENABLE_ACL:false} # disable ACL in default
schema: ${SW_ZK_SCHEMA:digest} # only support digest schema
expression: ${SW_ZK_EXPRESSION:skywalking:skywalking}
Recently, the Consul system has become more and more popular, and many companies and developers now use Consul as their service discovery solution. Set the cluster/selector to consul in the yml to enable it.
cluster:
selector: ${SW_CLUSTER:consul}
...
consul:
serviceName: ${SW_SERVICE_NAME:"SkyWalking_OAP_Cluster"}
# Consul cluster nodes, example: 10.0.0.1:8500,10.0.0.2:8500,10.0.0.3:8500
hostPort: ${SW_CLUSTER_CONSUL_HOST_PORT:localhost:8500}
aclToken: ${SW_CLUSTER_CONSUL_ACLTOKEN:""}
internalComHost: ${SW_CLUSTER_INTERNAL_COM_HOST:""}
internalComPort: ${SW_CLUSTER_INTERNAL_COM_PORT:-1}
Same as the Zookeeper coordinator,
in some cases, the OAP default gRPC host and port in the core are not suitable for internal communication among the OAP
nodes, such as the default host(0.0.0.0) should not be used in cluster mode.
The following settings are provided to set the host and port manually, based on your own LAN env.
Set the cluster/selector to etcd in the yml to enable it. The Etcd client uses v3 protocol with the CoreOS official library. Only the v3 protocol is supported for Etcd.
cluster:
selector: ${SW_CLUSTER:etcd}
# other configurations
etcd:
# etcd cluster nodes, example: 10.0.0.1:2379,10.0.0.2:2379,10.0.0.3:2379
endpoints: ${SW_CLUSTER_ETCD_ENDPOINTS:localhost:2379}
namespace: ${SW_CLUSTER_ETCD_NAMESPACE:/skywalking}
serviceName: ${SW_CLUSTER_ETCD_SERVICE_NAME:"SkyWalking_OAP_Cluster"}
authentication: ${SW_CLUSTER_ETCD_AUTHENTICATION:false}
user: ${SW_CLUSTER_ETCD_USER:}
password: ${SW_CLUSTER_ETCD_PASSWORD:}
Same as the Zookeeper coordinator,
in some cases, the OAP default gRPC host and port in the core are not suitable for internal communication among the OAP
nodes, such as the default host(0.0.0.0) should not be used in cluster mode.
The following settings are provided to set the host and port manually, based on your own LAN env.
Set the cluster/selector to nacos in the yml to enable it.
Nacos 2.x is required.
cluster:
selector: ${SW_CLUSTER:nacos}
# other configurations
Nacos supports authentication by username or accessKey. Empty means that there is no need for authentication. Extra config is as follows:
nacos:
username:
password:
accessKey:
secretKey:
Same as the Zookeeper coordinator,
in some cases, the OAP default gRPC host and port in the core are not suitable for internal communication among the OAP
nodes, such as the default host(0.0.0.0) should not be used in cluster mode.
The following settings are provided to set the host and port manually, based on your own LAN env.