trunk/3rdparty/srs-docs/blog/2022-05-16-Load-Balancing-Streaming-Servers.md
Written by Winlin, Azusachino, Benjamin
When our business workloads exceed streaming-server capacity, we have to balance those workloads. Normally, the problem can be solved by clustering. Clustering is not the only way to solve this problem, though. Sometimes the concept of Load Balancing can be linked to many emerging terms such as Service Discovery, but a LoadBalancer in cloud service is an indispensable requirement for solving the problem. In short, this problem is very complicated, and many people ask me about this frequently. Here I’m going to systematically discuss this issue.
<!--truncate-->If you already have answers to the questions below and understand mechanisms behind, you may skip this article:
Well, let's discuss load and load balancing in detail.
Before addressing load balancing, let's define what is load. For a server, load is the consequence of increasing resource consumption, at the time of handling more requests from clients. Such an increase may cause the service to become unavailable when there is a serious imbalance.For example, all clients behave abnormally when CPU reaches 100%.
For streaming servers, load is caused by streaming clients consuming server resources. Load is generally evaluated from following perspectives:
What are the consequences for high loads? It will directly lead to system issues, such as increased latency, lag, or even denial-of-service. Consequences of overload usually occur in the form of chain events. For example:
In light of those above, to reduce system load, the first and foremost step is to measure the load, aka focusing on overloads, this is also an issue that needs further clarification.
When the load exceeds the system’s load capacity, overload happens. This is actually a complicated issue despite its simplistic description, for instance:
In conclusion, for CPU load and capacity, to know when overload happens, both streaming server’s available CPU resource and CPU resource in use must be identified and measured:
Network bandwidth, whether it is overloaded, is considered over the following criteria. Streaming media bandwidth consumption is mainly due to code rate represented as Kbps or Mbps, which stands for bitrate or how much data is transmitted every second.
Disk, related to video recording, log splitting and STW problems caused by a slow disk :
As for RAM, the main concerns are leakage and buffering. This is relatively simple, as can be solved by focusing on monitoring RAM usage for a system.
In addition to general resource consumption, there are some additional factors that affect load or load balancing in a streaming server, including:
These are certainly not completely load and load balancing problems. For example, in order to solve the problem of some weak clients, WebRTC developed SVC and Simulcast. Some problems can be solved by the client's failed retry, such as the server can be forced to shut down when under high load, and then the clients will retry and migrate to the next server.
There is also a tendency to avoid streaming services and insteadly use slices such as HLS/DASH/CMAF, which then served via a web server and all of the above problems become irrelevant. However, the slicing protocol can actually only achieve 3 seconds delay at best, commonly more than 5 seconds. It is impractical to count on a slice server to accommodate 1 to 3 seconds delay, or even achieve the 500ms to 1 second low-latency in the live streaming, the RTC of 200ms to 500ms calls, and control scenarios within 100ms. These scenarios can only be implemented using a stream server, regardless of whether the TCP stream or UDP packet is transmitted.
We need to consider these issues when designing the system. For example, WebRTC should try to avoid coupling between streams and rooms, that is, the streams in a room must be distributed to multiple servers, rather than limited to one server. The more restrictions on these services, the more unfavorable it is for load balancing.
Let's talk about the load and overload conditions in SRS:
Special note for SRS single-process design. This is actually a choice. It is a trade-off for performance optimization. On one hand, multiprocessing can improve the processing capability. On the other hand, it is at the expense of increasing complexity of the system. Moreover it is hard to evaluate the overall load. For instance, what is an 8-core multiprocessing streaming server’s CPU overload threshold? Is it 640%? No, because the processes are not evenly consuming CPU resources. To even the process-consumption, we must implement process load balancing scheme, which is a more complicated problem.
At present, the single-process design of SRS can adapt to most scenarios. For live streaming, the Edge can use the multi-process REUSEPORT mechanism to listen on the same port to achieve multi-core consumption. RTC can use a certain number of ports. In cloud-native scenarios, using docker to run SRS, and you can also start multiple K8S Pods. These are options that require less effort.
Note: Except for cloud services that are sensitive to budget, opt for customization is always an option, but at the cost of increasing complexity. To the best of my knowledge, several well-known cloud vendors have implemented multi-processed versions based on SRS. We are working together to open source the multiprocessing capability to improve the system load capacity in an affordable complexity range. For details, please refer to Threading #2188.
Now, we understand loads of streaming servers, it's time to think about how to balance those loads.
Round Robin is a very simple load balancing strategy: every time a client requests a service, the scheduler finds the next server from the server list and returns to the client:
server = servers[pos++ % servers.length()]
This strategy is more effective if each request is relatively balanced, for example, web requests are generally completed in a short time. In this way, it is very easy to add and delete servers, go online and offline, upgrade and isolate services.
Due to the characteristics of long connections in streaming media, the polling strategy is not useful enough, because some requests may be longer and some are shorter, which will cause load imbalance.Whereas, this strategy still works well if there are only a small number of requests.
In the Edge Cluster of SRS, the round-robin approach is also used when looking for the upstream Edge servers, with the assumption that the number of streams and the service time are relatively balanced. We consider this is a reasonable and appropriate assumption in open source applications. Essentially, this is the load balancing strategy for the upstream Edge servers, to solve the SRS origin overload problem when all the requests always go back to one server. As shown below:
In an SRS Origin Cluster, the Edge will also select an Origin server when pushing the stream for the first time, which uses the Round Robin strategy too. This is essentially the load balancing of the Origin server, which solves the problem of overloading the Origin server. As shown below:
In production business deployment, instead of simply using Round Robin, there is a scheduling service that collects the data from these servers, evaluates the load, and then picks an upstream edge server with lower load or higher service quality. As shown below:
Then how do we solve the load balancing problem of Edge? It relies on the Frontend Load Balancing strategy, which is the system on the frontend access point. We will talk about the commonly used methods below.
In the above Round Robin part, we focused on load balancing within the service, while the situation will be a bit different for servers that directly interfaces to the clients, generally called Frontend Load Balancer:
In fact, there is no difference between DNS and HTTP DNS in terms of scheduling capabilities, and even lots of DNS and HTTP DNS systems have the same decision-making system, because they have to solve the same problem: how to use user’s IP, and other information (such as RTT or more detection data) to allocate more appropriate nodes. (usually the nearest one, also considering the cost)
DNS is the basis of the Internet. It can be considered as a name translator. For example, when we PING the server of SRS, it resolves ossrs.io into the IP address 182.92.233.108. There is no load balancing capability here, because It's just a server, DNS is just name resolution here:
ping ossrs.io
PING ossrs.io (182.92.233.108): 56 data bytes
64 bytes from 182.92.233.108: icmp_seq=0 ttl=64 time=24.350 ms
What DNS does in streaming media load balancing is actually to return the IP of different servers according to the IP of the client, and the DNS system itself is also distributed. The DNS information can be recorded in the /etc/hosts file. If there is no such information, the IP of this domain name will be queried in LocalDNS (usually configured in the system or obtained automatically).
This means that DNS can withstand very large concurrency, because it is not a centralized DNS server providing resolution services, but a distributed system. This is why there is a TTL and expiration time when creating a new resolution. After modifying the resolution record, it will take effect after this time. In fact, it all depends on the policies of each DNS server, and there are some operations such as DNS hijacking and tampering, which sometimes cause load imbalance.
Therefore, HTTP DNS comes out. It can be considered that DNS is the basic network service provided by ISPs, while HTTP DNS can be implemented by streaming media platforms developers. It is a name service, or you can call an HTTP API to resolve, for example:
curl http://your-http-dns-service/resolve?domain=ossrs.io
{["182.92.233.108"]}
Since this service is provided by yourself, you can decide when to update the meaning of the name. Of course, you can achieve a more precise load-balancing, and also use HTTPS to prevent tampering and hijacking.
Note: The your-http-dns-service of HTTP-DNS can use a set of IP or DNS domain name, because its load is relatively well balanced.
SRS supports Vhost, which is generally used by CDN platforms to isolate multiple customers. Each customer can have its own domain name, such as:
vhost customer.a {
}
vhost customer.b {
}
If users push streams to the same IP server but use different vhosts, then they are different streams. When playback, they are also different streams, with different URL addresses, for example:
rtmp://ip/live/livestream?vhost=customer.artmp://ip/live/livestream?vhost=customer.bNote: Of course, you can use the DNS system to map the ip to a different domain name, so that you can directly use the domain name in the URL.
In fact, Vhost can also be used for load balancing of multiple origin servers, because in Edge, different customers can be distributed to different origin servers, so that the capabilities of the origin server can be expanded without using Origin Cluster:
vhost customer.a {
cluster {
mode remote;
origin server.a;
}
}
vhost customer.b {
cluster {
mode remote;
origin server.b;
}
}
Different vhosts actually share the same Edge nodes, but Upstream and Origin can be isolated. And, of course, it can also be done with Origin Cluster. At this time, there are multiple origin server centers, which is a bit similar to the goal of Consistent Hash.
In the scenario where Vhost isolates users, the configuration file can become much more complicated, and there is a simpler strategy to achieve this job, that is, Consistent Hash.
For example, Hash can be calculated based on the URL of the stream requested by the user, and then used to determine which Upstream or Origin to visit, with which can achieve the same isolation and load reduction.
In practice, there are already such solutions available online, so the scheme is definitely feasible. While, SRS does not implement this capability and you need to implement it by yourself.
In fact, Vhost or Consistent Hash can also be used along with Redirect to accomplish more complex load balancing.
302 is redirect, which can actually be used for load balancing. For example, when access a server through scheduling, but if the server finds that its load is too high, then it can redirect the request to another server, as shown in the following figure:
Note: Not only HTTP supports 302, RTMP also supports 302, and the SRS Origin Cluster is implemented in this way. While 302 here is mainly used for streaming service discovery, not for load balancing.
Since RTMP also supports 302, we can use 302 to achieve load rebalancing within the service. If the load of one Upstream is too high, the stream will be scheduled to other nodes with several 302 jumps.
Generally speaking, in Frontend Server, only HTTP streams support 302, such as HTTP-FLV or HLS. RTMP requires 302 support on the client side, which is infeasible because it is generally not supported.
In addition, UDP-based streaming media protocols also support 302, such as RTMFP (a Flash P2P protocol designed by Adobe), which also supports 302. And it's rarely used now.
WebRTC currently does not have a 302 mechanism, and generally relies on the proxy of Frontend Server to achieve load balancing of subsequent servers. QUIC as a future HTTP/3 standard, will definitely support the basic features like 302. WebRTC will gradually support WebTransport (based on QUIC), so this capability will also be available in the future.
SRS Edge is essentially a Frontend Server and solves the following problems:
Round Robin when connected to the Upstream Edge to achieve Upstream's load balancing.As shown below:
Special Note:
Since Edge itself is a Frontend Server, it is generally not necessary to place Nginx or other LB in front of it in order to increase system capacity, because Edge itself is to solve the capacity problem, and only Edge can solve the problem of merging back to the origin.
Note: Merging back to the origin means that the same stream will only be returned to the origin once. For example, if 1000 players are connected to the Edge, the Edge will only get 1 stream from the Upstream, rather than 1000 streams. This is different from the transparent Proxy.
Of course, sometimes we still need to place Nginx or LB in front, for example:
In addition, no other servers should be placed in front of Edge, and services should be provided directly by Edge.
SRS Origin Cluster, different from the Edge Cluster, is mainly to expand the origin server’s capability:
The SRS Origin Cluster should not be accessed directly and relies on Edge Cluster to provide external services, because two simple strategies:
Note: In fact, Edge could also access the stream query service before accessing the origin server, and initiate the connection after finding the one within the stream. But it's possible that the stream will be switched away, so there still requires a process of relocating the stream.
The whole process is quite simple, as shown below:
Since the stream is always on only one origin server, the HLS slice will also be generated by one origin without extra synchronization. Generally we could use shared storage, or use on_hls to send slices to the cloud storage.
Note: Another way to achieve this: use the dual-stream hot backup. Usually there are two different streams, and we need to implement the backup mechanism. Generally this is very complicated for HLS, SRT and WebRTC. SRS does not support it.
From the aspect of load balancing, the origin cluster fits the job as a scheduler.
Round Robin when the Edge returns to the origin serverThe load of WebRTC is only on the origin servers. Load balancing at the edge has little relevance in WebRTC service model, because the ratio of publishing to viewing a stream in WebRTC is close to one, rather than the 1-to-10k difference in the live-streaming scenario. In other words, the edge is to solve the problem of massive viewing, and there is no need for load balancing at the edge when the publishing and viewing are similar (in the live-streaming scenario you can use edge for access and redirect).
Since WebRTC does not implement 302 redirect, there is no need to deploy edge (for access). For example, in a common Load Balancing scenario, when there are 10 SRS origin servers behind a VIP (i.e., the same VIP will be returned to the client), which SRS origin server will the client end up with? The answer is that it entirely depends on the Load Balancing strategy. At this time, it is impossible to add an edge to implement the RTMP 302 redirect like in the live-streaming scenario.
Therefore, the load balancing of WebRTC cannot be solved by Edge at all, whereas it relies on the Origin Cluster. Generally speaking, in RTC, this is called Cascade, that is, all nodes are equal, but connected with different levels of Routing to increase the load capacity. As shown below:
This is an essential difference from the Origin Cluster. There is no media transmission between servers in the Origin Cluster, and RTMP 302 is used to instruct Edge redirecting to a specified origin server. The load of the origin server is predictable and manageable, as one origin server only owns a limited number of Edges.
The Origin Cluster strategy is not suitable for WebRTC, because the client needs to connect directly to the origin server, and then the load of the origin server is not manageable at this time. For example, in a conference of 100 people, each stream will be subscribed by 100 people. At this time, users are distributedly connected to different origin servers. Hence, establishing connections among all the origin servers is required, in order to push one's own stream and get others’.
Note: This example is a rather rare case. Generally speaking, a 100-person interactive conference will use the MCU mode, in which the server will merge different streams into one, or selectively forward several streams. The internal logic of such a server is very complicated.
In fact, one-to-one call is considered as the common WebRTC scenario, which accounts for about 80% of the business case. At this time, everyone publishes one stream and plays one stream. In a typical situation when there are many streams, the user can just connect to one origin server nearby. While the geographical locations of the users might not be the same, such as in different regions or countries, then the cascade between the origin servers should improve call quality.
Under the cascading architecture of origin servers, users access HTTPS API using DNS or HTTP DNS protocol. The IP of origin server is returned in SDP, so this is an opportunity for load balancing, which can return an origin server that is close to the user and has less load.
In addition, how to cascade multiple origin servers? If users are in a similar region, they can be dispatched to one origin server to avoid cascading. It saves internal transmission bandwidth (it is a well worthy and effective optimization when there are a large number of one-to-one calls in the same area). At the same time, this also increases the un-schedulability of loads, especially when a conference evolves into a multi-person one.
Therefore, in a conference, distinguishing one-to-one from multi-person conferences, or limiting the number of participants in it is actually very helpful for load balancing. It's easier to schedule and achieve load-balance if you are aware this is a one-to-one conference ahead of time. Unfortunately, project managers are generally not interested in this opinion.
Remark: Note that the cascading feature has yet been implemented in SRS. Only the prototype has been implemented, and it has not been committed to the repository.
In particular, let's talk about some WebRTC-related protocols, such as TURN, ICE, and QUIC.
ICE is not actually a transmission protocol, it is more like an identification protocol, generally referring to Binding Request and Response, which will contain IP and priority information to identify the address and channel information for the selection of multiple channels, such as Who to choose when both 4G and WiFi are good enough. It is also used as the heartbeat of the session, and the client will always send ICE messages.
Therefore, ICE has no effect on load balancing, but it can be used to identify sessions, similar to QUIC's ConnectionID, so it can play a role in identifying sessions when passing through Load Balancing, especially when the client's network switches.
The TURN protocol is actually a very unfriendly protocol for Cloud Native, because it needs to allocate a series of ports and use ports to distinguish users. This is practical in private networks, assuming that the ports are unlimited, while the ports on the cloud are often limited.
Note: Of course, TURN can also multiplex a port without actually assigning a port, which limits the ability to use TURN to communicate directly but go through SFU, and there is no problem with SRS.
The real deal of TURN is to downgrade to the TCP protocol, because some enterprise firewalls do not support UDP, so they can only use TCP, and the client needs to use the TCP function of TURN. Of course, you can also use the TCP host directly. For example, mediasoup supports it already, but SRS doesn't support it yet.
The 0RTT connection in QUIC is more friendly. The client caches the SSL ticket-like thing and can skip the handshake. For load balancing, QUIC is more effective because it has a ConnectionID. Then, when load balancing, even though the client changes the address and network, Load Balancer can still know which service on the backend handles it. But, of course, this actually makes the server load more difficult to transfer.
In fact, such a complex set of protocols and systems as WebRTC is quite messy and disgusting. Since the 100ms-level delay is a hard indicator, UDP and a complex set of congestion control protocols must be used, and encryption is also indispensable. Some people claim that Cloud Native's RTC is the future, which introduces more problems like port multiplexing, load balancing, long-connection and restart upgrades, as well as the structure that has been turned upside down, and HTTP/3 and QUIC to spoil the game...
Perhaps for the load balancing of WebRTC, there is one word that is most applicable: there is no difficulty in the world, as long as you are willing to give up.
The premise for Load Balancing is to know how to balance the load, which highly depends on data collection and calculation. Prometheus is for this purpose. It will continuously collect various data, and calculate these data according to its set of rules. Prometheus is essentially a time series database.
System load, which is also essentially a series of time series data, changes over time.
For example, Prometheus has a node_exporter, which provides the relevant timing information of the host node, such as CPU, disk, network, memory, etc., which can be used to compute service load.
Many services own a corresponding exporter. For example, redis_exporter collects Redis load data, nginx-exporter collects Nginx load data.
At present, SRS has not implemented its own srs-exporter, but it will be implemented in the future. For details, please refer to #2899.
At SRS, our goal is to establish a non-profit, open-source community dedicated to creating an all-in-one, out-of-the-box, open-source video solution for live streaming and WebRTC online services.
Additionally, we offer a Cloud service for those who prefer to use cloud service instead of building from scratch. Our cloud service features global network acceleration, enhanced congestion control algorithms, client SDKs for all platforms, and some free quota.
To learn more about our cloud service, click here.
Welcome for more discussion at discord.