trunk/3rdparty/srs-docs/blog/2024-10-18-Hidden-Flaws-of-SRS.md
In open-source streaming server projects, SRS is widely used and has been maintained for a long time, starting in 2013. Over the years, various issues have been addressed, but some remain challenging to solve. This article summarizes and shares these insights.
Some people think SRS has many features and is complete, but it is far from a perfect open-source project and never will be. Anyone can contribute to the SRS project at any time, and participation is always welcome.
The cluster capability is a significant weakness of SRS. There have been ongoing debates and design revisions regarding the types of clusters it actually supports.
SRS has supported Edge clusters since version 2.0 in 2015, claiming to be capable of large-scale RTMP/FLV distribution. However, after a decade of development, it was found that very few users actually utilize Edge, especially considering the lack of support for protocols like SRT and WebRTC. There is a notable disparity in capabilities between open-source project users and what video cloud services require in terms of media streaming clusters.
In my assessment, SRS doesn’t necessarily need to support large-scale distribution but rather focus on small to medium-sized clusters. However, even in smaller clusters, it’s essential to support thousands or tens of thousands of concurrent connections. Moreover, it’s crucial to have robust cluster capabilities for mainstream protocols, not just limited to RTMP/FLV.
When discussing media streaming clusters, apart from the media processing server SRS, there are also scheduling and operational systems involved. Even with RTMP URL access systems, there is still DNS resolution and scheduling. While WebRTC signaling typically uses HTTPS, separate from media transport, clusters inherently involve scheduling and distribution capabilities.
Firstly, SRS lacks cluster scheduling capabilities as it’s primarily an open-source media server. To build a million-concurrent-user media streaming cluster, one needs to address scheduling independently. CDNs typically use DNS or HTTP-DNS scheduling. Scheduling essentially determines the system’s capacity utilization and operational costs.
Secondly, the cluster architecture could be tree-based (source-edge) or graph-based (forwarding or cascading) but there isn’t a definitive answer. In China, due to multiple operators, tree architecture is common for handling massive playback issues in live streaming. However, for building a global network or scenarios with massive cold streams (like network conferences), a tree structure is inadequate.
Lastly, load balancing for streaming servers, including load collection and overload protection, is usually balanced internally within the system. While scheduling acts as the first line of load balancing defense, internal load balancing is crucial. Although SRS has implemented circuit breaking, it hasn’t yet implemented load collection and balancing, like redirecting overloaded traffic to other servers through 302 redirects. For more on load balancing, refer to the Load Balancing Streaming Servers summary.
Below is the source-edge architecture of SRS, where multiple Edge servers retrieve streams from the Origin server. Each stream is fetched from the Origin only once per Edge, and Edges can also fetch streams from other Edges, creating a multilevel Edge structure to support massive playback:
SRS(Edge)
SRS(Origin) --RTMP/FLV---> SRS(Edge) --RTMP/FLV--> Massive Playback
SRS(Edge)
Note: This Edge architecture was initially derived from Adobe AMS. However, AMS used a modified RTMP protocol with custom messages for cluster management, while SRS uses a standard RTMP protocol, achieving a standard RTMP client implementation.
Next is the source server cluster architecture of SRS, where multiple source servers connect in a mesh, exchanging stream information. They use RTMP 302 redirection to direct clients to the target source server, dispersing streams to multiple sources to support more streams:
SRS(Origin)
+(MESH) ---RTMP---> SRS(Edge) --RTMP/FLV--> Streaming or Playback
SRS(Origin)
In reality, this source server cluster isn’t an ideal solution, especially for the WebRTC protocol. Since SRS is single-threaded and WebRTC protocol performance is notably low, each SRS source server can support around 300 WebRTC connections. Enabling WebRTC and RTMP protocol conversion may support around 40 connections (with audio transcoding). While live streaming protocols may support 3000–5000 concurrent users, WebRTC is on a different scale, necessitating new source server cluster solutions.
The new source server cluster architecture of SRS uses Proxy servers to forward traffic to backend source servers. Proxies are stateless and horizontally scalable, addressing many streaming scenarios:
SRS(Origin)
SRS(Origin) ---> SRS(Proxy) --> Streaming or Playback
SRS(Origin)
Considering the WebRTC protocol, using Proxy to connect 100 SRS Origins, even with WebRTC to RTMP protocol conversion, can support up to 4000 streams, with each stream accommodating up to 300 viewers. This setup can handle a maximum of 4000 streams or support 100 streams with 30K viewers each.
Note: SRS also supports Dynamic Forwarding, actively forwarding streams to different backend services, essentially acting as a proxy. The difference lies in SRS Proxy being implemented in Go, relying on Redis for statelessness, facilitating system integration and various business scheduling logic. On the other hand, SRS Forwarding is a feature implemented by SRS, supporting only the RTMP protocol, and making modifications can be complex.
The new source server cluster can be used in conjunction with Edge servers, still supporting numerous RTMP streams and massive playback for each stream:
SRS(Origin) SRS(Edge)
SRS(Origin) ---> SRS(Proxy) --> SRS(Edge) --> Streaming or Playback
SRS(Origin) SRS(Edge)
Apart from top-tier video clouds and CDNs requiring millions of streaming and playback capabilities, in general applications, handling thousands of streams and playback is the scale most business scenarios can achieve. For smaller applications, WebRTC relies on clusters to address performance bottlenecks, while other protocols focus on avoiding single points of failure to improve availability.
Cluster implementation for RTMP/FLV protocols is relatively straightforward. For Edge servers, RTMP/FLV is stateless, so if issues arise, switching to the next source server is usually sufficient. Clients may experience minor content shifts or repetitions, but these are generally imperceptible. When saying stateless, it means these states can be discarded, and errors can be resolved through retries. If an Edge server encounters issues, clients simply retry, relying on client retries for error recovery.
Implementing a cluster for HLS protocol is challenging, especially in handling stream switching. HLS protocol can handle client stream reconnections correctly, using DISCONTINUITY tags to ensure continuous segments. However, if a source server crashes or restarts, SRS currently lacks persistent storage and recovery of segment information, leading to M3U8 content regeneration and client disruptions. Similar issues arise when streams migrate between source servers. HLS clusters need to address these exceptional scenarios.
Another challenging aspect of HLS protocol is data statistics like connection numbers. Since HLS is segment-based and lacks a concept of connections, SRS introduced HLS_CTX to achieve connection statistics through QueryString. However, for HLS clusters, Edge servers need to implement HLS distribution and connection statistics, a feature not yet available in SRS. If using Nginx for Edge HLS distribution, connection statistics support needs to be added to Nginx.
For SRT and WebRTC, SRS currently only supports Proxy, enabling stream scheduling to different Origins, but Edge servers do not support these two protocols. This limitation means that if a stream has a large number of viewers, SRT or WebRTC protocols cannot be used. Addressing how to achieve comprehensive protocol stack support on Edge servers is a challenging aspect of clusters.
Typically, K8s or HELM are used to manage clusters, including deployment, upgrades, scaling, shrinking, and rollbacks. SRS clusters consider K8s support, with a critical capability being Gracefully Quit, where ports are closed upon exit, and services are restarted only after existing connections gradually close, enabling smooth upgrades. Gracefully Quit is easily implemented for TCP protocols, but UDP protocol support is still pending.
For end-to-end tracing, SRS passes connection IDs and other information in the RTMP protocol and supports the OpenTelemetry standard protocol, integrating with Tencent Cloud’s APM. However, APM lacks an open-source platform like Prometheus, and integrating with different APM cloud services requires some effort. Additionally, SRS Proxy does not yet support APM instrumentation extensively. SRS’s own APM instrumentation is limited, providing only essential information and not comprehensive coverage.
Using Go to implement Proxy poses potential performance bottlenecks. As Go handles media traffic, it may not achieve the same high performance as C++, especially on multi-core CPUs, potentially leading to a 30% performance loss. While this issue may not be significant in small clusters, most scenarios do not require such high performance levels. RUST could be an alternative option, but maintaining C++, Go, and RUST simultaneously increases community maintenance complexity and incurs high costs.
Regarding clusters, there are considerations around multi-process or multi-threading issues.
Multi-core processing or multi-threading is essential for servers as they typically support multiple CPU cores, enabling a single machine to offer more services externally. From a technical standpoint, there are various solutions:
SRS has traditionally been single-process, single-threaded, akin to a single-process version of Nginx, with the addition of coroutines for concurrent processing. Coroutines are implemented using the StateThreads library, which has been modified to support thread-local functionality for operation in a multi-threaded environment. Despite experimenting and analyzing thread-local handling for a media architecture over the years, SRS has not adopted a thread-local approach but rather a different multi-threaded architecture that is still in the planning stage:
In SRS 5.0, StateThreads were restructured to support thread-local functionality and initiated a main thread and subthreads to transition the architecture into a multi-threaded model. However, various issues arose during subsequent stages, leading to a default return to a single-threaded architecture in SRS 6.0. The future may see the removal of multi-threading capabilities as they become less critical with the new Proxy and Edge architectures. If Proxy continues to enhance its capabilities, encompassing various protocols and Edge functionality, it will gradually evolve into a Proxy+Origin cluster, fully resolving the multi-threading challenge.
Additionally, we explored another potential architecture where specific capabilities are distributed across different threads, like using separate threads for WebRTC encryption and decryption. However, this approach transforms into a typical multi-threaded program rather than a thread-local architecture, resulting in performance overhead from locks and reduced stability — not an ideal direction.
In essence, Go can be considered a traditional multi-threaded architecture, not a thread-local architecture, which is why Go also requires locks, contributing to its lower performance. For media streaming clusters, Go may not be the best language as performance losses increase costs for media servers. While C++ is prone to crashing issues, which can’t be fundamentally resolved, RUST might be a potential option, albeit requiring research and experimentation.
The stability issues in medium to large C++ projects are indeed a pain point, especially when introducing SharedPtr for Source object release.
The memory leakage issue in SRS took approximately 10 years to resolve, as detailed in the Source cleanup discussion.
In reality, it wasn’t exactly a leak but rather a failure to release caches. Initially, SRS focused on handling a small number of RTMP streams with a large number of viewers, prioritizing the release of client connections over Stream Source objects. This approach was chosen because most memory was consumed by client connections, while Stream Source objects occupied minimal memory. By not releasing Stream Source objects, the system was simplified, avoiding many memory issues and ensuring high stability.
However, if there were frequent URL changes for streams or a high volume of publishing with few viewers, rapid memory escalation became evident.
Releasing Stream Source objects requires addressing various object references to Stream Source. Since SRS supports protocols like RTMP, HTTP-FLV, HLS, SRT, WebRTC, DASH, GB28181, and their conversions, scenarios involving mutual references arise.
Consider an RTMP client pushing to SRS, creating a Stream Source for each URL:
srs_error_t SrsRtmpConn::stream_service_cycle() {
SrsLiveSource* source = NULL;
_srs_sources->fetch_or_create(req, server, &source);
}
When starting a stream, an HTTP mount is created to enable HTTP FLV playback:
srs_error_t SrsHttpStreamServer::http_mount(SrsLiveSource* s, SrsRequest* r) {
entry = new SrsLiveEntry(mount);
entry->source = s;
entry->stream = new SrsLiveStream(s, r, entry->cache);
mux.handle(mount, entry->stream);
}
srs_error_t SrsLiveStream::do_serve_http(ISrsHttpResponseWriter* w, ISrsHttpMessage* r) {
SrsLiveConsumer* consumer = NULL;
source->create_consumer(consumer);
}
If the Source remains valid, business logic can be easily implemented. However, if cleanup is needed, various scenarios must be considered, significantly increasing the risk of dangling pointers:
In practice, C++ 11’s shared pointer, with some advanced features, not only manages memory through reference counting but also includes:
After implementing simplified smart pointers, Stream Source is managed using timers for checking and releasing. This standard pointer usage involves releasing the Source smart pointer when the reference count reaches zero:
srs_error_t SrsLiveSourceManager::notify(int event, srs_utime_t interval, srs_utime_t tick) {
std::map< std::string, SrsSharedPtr<SrsLiveSource> >::iterator it;
for (it = pool.begin(); it != pool.end();) {
SrsSharedPtr<SrsLiveSource>& source = it->second;
source->cycle();
if (source->stream_is_dead()) {
pool.erase(it++); // Free source smart ptr
}
To address circular reference issues, comments are used to avoid circular references. Directly using raw pointers can help, for instance:
// Source holds and releases hub
class SrsLiveSource {
SrsOriginHub* hub;
}
class SrsOriginHub : public ISrsReloadHandler {
// Because source references to this object, so we should directly use the source ptr.
SrsLiveSource* source_;
}
Another scenario involves two objects triggering destruction, but their lifecycles aren’t independent. For instance, in the RTC context, a session like SrsRtcConnection contains SrsRtcTcpConn. When the session times out, the TCP connection must be closed. Conversely, when the TCP connection closes, it must trigger session destruction:
// Session contains the TCP connection
class SrsRtcConnection {
SrsRtcNetworks* networks_;
}
class SrsRtcTcpNetwork: public ISrsRtcNetwork {
SrsSharedResource<SrsRtcTcpConn> owner_;
}
// TCP connection directly uses the session's raw pointer, as the session's lifecycle is shorter
class SrsRtcTcpConn {
// Because session references to this object, we should directly use the session ptr.
SrsRtcConnection* session_;
}
// When TCP disconnects, trigger session expiration and destruction
srs_error_t SrsRtcTcpConn::cycle() {
// Only remove session when network is established, because client might use other UDP network.
if(session_ && session_->tcp()->is_establelished()) {
session_->tcp()->set_state(SrsRtcNetworkStateClosed);
session_->expire();
}
}
Dealing with GB28181 is more complex, where a session includes both Sip and Media components. Although there are no circular reference issues, the session may be created from Sip, and Sip and Media may need to update the session:
class SrsGbSession {
SrsSharedResource<SrsGbSipTcpConn> sip_;
SrsSharedResource<SrsGbMediaTcpConn> media_;
}
class SrsGbSipTcpConn {
// The owner session object; we use the raw pointer and avoid freeing it.
SrsGbSession* session_;
}
class SrsGbMediaTcpConn {
// The owner session object; we use the raw pointer and avoid freeing it.
SrsGbSession* session_;
}
srs_error_t SrsGbSipTcpConn::bind_session(SrsSipMessage* msg, SrsGbSession** psession) {
SrsSharedResource<SrsGbSession>* session = dynamic_cast<SrsSharedResource<SrsGbSession>*>(_srs_gb_manager->find_by_id(device));
// Create a Session object from the SIP channel
if (!session) {
raw_session = new SrsGbSession();
session = new SrsSharedResource<SrsGbSession>(raw_session);
_srs_gb_manager->add_with_id(device, session);
}
// Update the SIP channel object in the Session.
raw_session->on_sip_transport(*wrapper_);
}
srs_error_t SrsGbMediaTcpConn::bind_session(uint32_t ssrc, SrsGbSession** psession) {
SrsSharedResource<SrsGbSession>* session = dynamic_cast<SrsSharedResource<SrsGbSession>*>(_srs_gb_manager->find_by_fast_id(ssrc));
SrsGbSession* raw_session = (*session).get();
// Update the Media channel object in the Session.
raw_session->on_media_transport(*wrapper_);
}
In contrast to RTC, for GB, the SIP and Media channels, when released, only free their connection objects and don’t trigger GB Session destruction. Instead, GB Session is destroyed through timeouts.
In practical applications, SRS’s smart pointers are used in limited scenarios, without applying advanced features or syntactic sugar. This approach enhances the maintainability of the solution.
In most cases, errors will eventually occur in areas prone to issues, making error handling crucial for system operation. Errors and logging are often intertwined, with a common practice being to log error information in the logs. However, errors and logging are two distinct matters.
Logging involves recording system actions and can be used to troubleshoot issues. Errors represent a specific type of behavior, and the relationship between errors and logs includes:
Errors may not always occur, but issues like irregular frame rates can still arise. Therefore, logs typically need to indicate critical paths without including errors to assess system normalcy:
[2024-09-27 10:32:46.245][INFO][58467][f951w1of] RTMP client ip=127.0.0.1:57591, fd=13
[2024-09-27 10:32:46.246][INFO][58467][f951w1of] complex handshake success
[2024-09-27 10:32:46.247][INFO][58467][f951w1of] connect app, tcUrl=rtmp://localhost:1935/live, pageUrl=, swfUrl=, schema=rtmp, vhost=localhost, port=1935, app=live, args=null
[2024-09-27 10:32:46.247][INFO][58467][f951w1of] protocol in.buffer=0, in.ack=0, out.ack=0, in.chunk=128, out.chunk=128
[2024-09-27 10:32:46.247][INFO][58467][f951w1of] client identified, type=fmle-publish, vhost=localhost, app=live, stream=livestream, param=, duration=0ms
[2024-09-27 10:32:46.247][INFO][58467][f951w1of] connected stream, tcUrl=rtmp://localhost:1935/live, pageUrl=, swfUrl=, schema=rtmp, vhost=__defaultVhost__, port=1935, app=live, stream=livestream, param=, args=null
[2024-09-27 10:32:46.247][INFO][58467][f951w1of] new live source, stream_url=/live/livestream
[2024-09-27 10:32:46.248][INFO][58467][f951w1of] source url=/live/livestream, ip=127.0.0.1, cache=1/2500, is_edge=0, source_id=/
[2024-09-27 10:32:46.250][INFO][58467][f951w1of] start publish mr=0/350, p1stpt=20000, pnt=5000, tcp_nodelay=0
[2024-09-27 10:32:46.251][INFO][58467][f951w1of] got metadata, width=768, height=320, vcodec=7, acodec=10
[2024-09-27 10:32:46.251][INFO][58467][f951w1of] 46B video sh, codec(7, profile=High, level=3.2, 768x320, 0kbps, 0.0fps, 0.0s)
[2024-09-27 10:32:46.251][INFO][58467][f951w1of] 4B audio sh, codec(10, profile=LC, 2channels, 0kbps, 44100HZ), flv(16bits, 2channels, 44100HZ)
[2024-09-27 10:32:46.253][INFO][58467][f951w1of] RTMP2RTC: Init audio codec to 10(AAC)
[2024-09-27 10:32:48.385][INFO][58467][f951w1of] cleanup when unpublish
For server logs, it’s essential to include IDs, such as session-level IDs. This enables quick identification of specific logs when serving multiple clients or when aggregating logs. SRS refers to this mechanism as “traceable logs,” allowing all logs related to a specific ID to be easily found. Due to SRS’s coroutine-based design, a session like an RTMP push connection may contain one or multiple coroutines. Hence, when logging, passing the ID isn’t necessary as it’s automatically obtained:
srs_error_t SrsRtmpConn::do_cycle() {
srs_trace("RTMP client ip=%s:%d, fd=%d", ip.c_str(), port, srs_netfd_fileno(stfd));
}
void SrsProtocol::print_debug_info() {
srs_trace("protocol in.buffer=%d, in.ack=%d, out.ack=%d, in.chunk=%d, out.chunk=%d", in_buffer_length,
in_ack_size.window, out_ack_size.window, in_chunk_size, out_chunk_size);
}
// This log macro definition automatically fetches the coroutine ID from get_id().
#define srs_trace(msg, ...) srs_logger_impl(SrsLogLevelTrace, NULL, _srs_context->get_id(), msg, ##__VA_ARGS__)
SRS errors contain stack information and error details, resembling Go’s error and wrap mechanism. With this approach, the error object alone provides the complete error context:
[2024-09-27 10:40:30.836][INFO][62805][l9067692] RTMP client ip=127.0.0.1:57942, fd=15
[2024-09-27 10:40:30.837][INFO][62805][l9067692] client identified, type=fmle-publish, vhost=localhost, app=live, stream=livestream, param=, duration=0ms
[2024-09-27 10:40:30.838][ERROR][62805][l9067692][35] serve error code=1028(StreamBusy)(Stream already exists or busy) : service cycle : rtmp: stream service : rtmp: stream /live/livestream is busy
thread [62805][l9067692]: do_cycle() [./src/app/srs_app_rtmp_conn.cpp:263][errno=35]
thread [62805][l9067692]: service_cycle() [./src/app/srs_app_rtmp_conn.cpp:457][errno=35]
thread [62805][l9067692]: acquire_publish() [./src/app/srs_app_rtmp_conn.cpp:1078][errno=35](Resource temporarily unavailable)
In SRS, errors are objects that provide complete context and can be sent through a tracing system, like OpenTelemetry APM, for full-chain error display. This necessitates that SRS errors are wrapped rather than directly returned as integers:
srs_error_t SrsRtmpConn::do_cycle() {
if ((err = service_cycle()) != srs_success) {
err = srs_error_wrap(err, "service cycle");
}
}
srs_error_t SrsRtmpConn::service_cycle() {
if (!srs_is_system_control_error(err)) {
return srs_error_wrap(err, "rtmp: stream service");
}
}
srs_error_t SrsRtmpConn::acquire_publish(SrsSharedPtr<SrsLiveSource> source) {
if (!source->can_publish(info->edge)) {
return srs_error_new(ERROR_SYSTEM_STREAM_BUSY, "rtmp: stream %s is busy", req->get_stream_url().c_str());
}
}
It’s common practice not to return error objects but to directly log errors, which can be a less sophisticated approach. For instance, low-level function errors that can be ignored might lead to log flooding if directly printed, prompting the function to return error codes instead. This can result in losing stack information. By returning error objects, the application layer can decide whether to discard or ignore errors, log warnings, log errors, or send alerts to monitoring systems. Directly logging errors fundamentally limits error handling possibilities.
Challenges with SRS’s error mechanism include:
Initially, SRS used a globally incremented integer for logging, but later switched to generating random string IDs to reduce the likelihood of ID conflicts when logs are centrally collected. Besides IDs, timestamps are included, and logs are typically searched within a limited timeframe, such as one or three hours. The probability of random string ID conflicts is almost negligible.
SRS 6.0 now supports OpenTelemetry APM for application performance monitoring, full-chain logging, and errors. Since APM requires integration with platforms, SRS integrates with Tencent Cloud’s APM. The APM protocol stack involves HTTP/2 and Protobuf, with SRS implementing the Protobuf protocol but supporting only the HTTP/1 protocol. While most APM platforms support integration using HTTP/1 + Protobuf, it’s not the default method. The use of newer protocols and inconsistent authentication mechanisms across cloud providers limits APM applications.
In practice, system OpenAPIs are more commonly used than errors and logs, as detailed below.
API refers to the system’s external interface, where in a narrow sense, API generally refers to HTTP API. In reality, logs and errors can be considered as a broader form of API, including system configurations, HTTP APIs, Prometheus Exporter, and more.
SRS has supported HTTP API and Callback since its early days, making it easy to integrate with business systems. Apart from supporting queries for stream and client information, it also enables actions like kicking off streams. Callback includes callbacks for events such as streaming, playback, recording, HLS, etc. Here are some areas that still need improvement:
It is evident that providing APIs and Callbacks is not sufficient; processing of this data is necessary, supporting comprehensive queries, and even graphical representation. For operations and maintenance, monitoring alerts and historical query support are needed. These can be achieved through Prometheus and Grafana using SRS’s Exporter:
+-----+ +-----------+ +---------+
| SRS +--Exporter-->--| Prometheus +-->--+ Grafana +
+-----+ (HTTP) +-----------+ +---------+
Prometheus is a comprehensive monitoring and operations system that fetches data from SRS through a standardized Exporter API, stores it in its own time-series database, and uses its query language PromQL for querying. Grafana enables graphical representation. This approach eliminates the need for each user to convert HTTP APIs and Callbacks for their own monitoring and operations systems. By utilizing SRS’s Prometheus Exporter functionality, you can set up a graphical monitoring and operations dashboard within 10 minutes. For more details, refer to SRS Exporter. Here are some areas where the Exporter can be improved:
Apart from data metrics, configuration can also be considered part of the interface, determining how the system behaves. SRS uses the configuration file method similar to nginx and supports reloading. Due to users managing configuration files, SRS ensures strong compatibility. Configuration files from SRS 1.0 can still be used in subsequent versions; whenever there are changes like renaming, SRS ensures compatibility with both old and new names after parsing the configuration. Here are some areas where configuration can be improved:
Comparing configuration files and environment variable configurations, below is an example configuration file example.conf:
listen 1935;
http_api {
enabled on;
}
rtc_server {
enabled on;
candidate 192.168.3.82;
}
vhost __defaultVhost__ {
rtc {
enabled on;
}
}
docker run --rm -it -p 1985:1985 -p 8000:8000 \
-v $(pwd)/example.conf:/usr/local/srs/conf/docker.conf \
ossrs/srs:5
Describing the configuration file content in the documentation requires users to copy content, create and edit files, and then specify the file during startup, which is cumbersome and prone to errors. In contrast, using environment variables, everything can be done with a simple copy-paste, reducing the likelihood of user errors:
docker run --rm -it -p 1985:1985 -p 8000:8000 \
-e SRS_LISTEN=1935 \
-e SRS_HTTP_API_ENABLED=on \
-e SRS_RTC_SERVER_ENABLED=on \
-e CANDIDATE="192.168.3.82" \
-e SRS_VHOST_RTC_ENABLED=on \
ossrs/srs:5
Maintaining a comprehensive API requires significant effort and time, especially when adding new features to ensure that both the API and documentation are promptly updated. SRS’s API is far from perfect and requires further effort.
Streaming protocols are crucial capabilities of streaming servers, and the protocol capabilities required on servers differ from general media processing. For example, the discontinuation of ffserver was due to FFmpeg’s I/O design primarily focusing on clients, making it unsuitable for handling high concurrency as a server. For instance, SRS does not use a WebRTC codebase to implement the WebRTC protocol because SFU servers require lightweight protocol stacks without the need for media processing, device management, signaling, and algorithms.
SRS natively supports protocols like RTMP, HTTP-FLV, HLS, and WebRTC, ensuring high stability and maintainability. SRT is integrated with the libsrt library for stability, but it lacks perfection in protocol conversion and performance. Protocols like DASH and GB28181 have inherent flaws leading to various issues. HDS and RTSP are outdated streaming protocols, but RTSP still finds application in certain scenarios in the AI era.
Initially, SRS only supported common live streaming protocols: RTMP, HTTP-FLV, and HLS, along with live clusters such as Edge and Origin clusters. With the evolution of live streaming, protocols like WebRTC and SRT are increasingly used, significantly impacting the entire system architecture, especially in protocol conversion.
Initially, protocol conversion only required support for RTMP to HLS. Although OBS can stream using HLS, it is not commonly used. In reality, SRS also supports streaming via POST HTTP-FLV, similar to POST HLS, but FLV streaming is not widely supported by CDNs and common clients, making this method less popular. However, SRS uses the Stream Caster structure to support this unique stream input and then converts it to RTMP; for instance, PUSH MPEGTS over UDP follows this method.
Initially, both SRT and GB28181 used the Stream Caster structure; after SRS receives the stream, it initiates an RTMP Client to stream to localhost, somewhat resembling an external method of receiving these two protocols’ streams. These protocols encounter different issues; starting with SRT:
The situation with GB28181 is different; in SRS 5.0, GB28181 protocol stack was completely re-implemented. Typically, GB28181 streams to SRS for viewing via WebRTC without plugins, addressing issues such as:
The RTSP protocol was removed from SRS, initially supported through the Stream Caster method for RTSP stream ingestion. Main issues include:
Note: SRS is willing to support the RTSP protocol in the future, but the long-term assessment indicates that the primary streaming protocols will remain RTMP, WebRTC, and SRT, rather than RTSP. The community’s willingness and resources to support and maintain a declining RTSP protocol remain uncertain.
SRS’s RTMP implementation is undoubtedly the most comprehensive, addressing several issues but also having areas for improvement:
SRS supports WebRTC as a core protocol because it is the only option for streaming protocols supported by browsers. However, there are several areas where WebRTC implementation in SRS can be improved:
SRS’s segmented protocols like HLS, DASH, and HDS, with HLS being the most widely used, HDS nearly obsolete, and DASH having some users but facing certain issues:
MP4 can also be considered a protocol commonly used in recording files; LLHLS also utilizes MP4, especially for HLS HEVC, as Safari only supports MP4 segmented files, not TS segmented files. MP4 compatibility issues are more common than with RTMP, occasionally leading to various unexpected problems. It is recommended to use FFmpeg for recording and transcoding, while relying on SRS for stream reception and distribution only.
SRS’s quality assurance mechanisms include unit tests based on gtest, black-box testing based on srs-bench, and Code Review. These mechanisms are interconnected through GitHub Actions, automatically triggered for every Pull Request, Commit, and Release.
These tests are highly effective, often leading to Pull Requests failing due to logic issues caused by code modifications in other areas. It seems that the test cases are not as effective, essentially confirming common-sense judgments like 1+1=2. In reality, errors are seldom introduced in the modified areas but rather surface in other parts due to the intricate relationships between different sections of code. As code grows, these interdependencies often lead to such issues.
Programmers typically have great confidence in their own code but lack confidence in others’ code. Few developers voluntarily write test cases, and when asked to add unit tests during Code Review, most acknowledge the necessity of testing, especially after realizing how tests can uncover logic issues. Personally, even when recognizing the effectiveness of testing, voluntarily writing tests remains a challenging task, making it difficult to adhere to the practice of writing tests before functional code.
I believe that one of the most crucial roles of Code Review is to require the submission of test code. Of course, the primary function of Code Review is to integrate a contributor’s code into one’s own codebase, necessitating an understanding of the code’s purpose and intent, and exploring better implementation options. Common issues include developers unknowingly duplicating existing code functions or failing to leverage existing code for improvements due to unfamiliarity with the current codebase. Additionally, contributors often prioritize habitual implementations over the most maintainable solutions.
In Code Review, open-source communities typically adhere to a single rule: one Pull Request should contain only one feature or bug fix. Due to limited time in open-source communities, including multiple features in a single change makes explaining the code to different individuals challenging, requiring detailed explanations of each function’s purpose, making the review process difficult. Conversely, in a corporate setting, almost every change includes some “convenient” modifications, leading to ineffective Code Reviews and significant code quality issues.
Moreover, although companies have ample time for coding, the urgency from clients often results in a rush to deploy numerous features and changes, leading to Pull Requests containing multiple features. Companies spend less time on code than open-source communities, failing to allocate sufficient time to weigh the most maintainable solutions. This is why open-source communities are more likely to produce high-quality code, whereas companies are more prone to lower-quality code, despite having top-tier programmers.
This is not to say that open-source communities always produce high-quality code; Code Review standards can easily slip, especially when introducing significant code changes or large features. The cases of SRS’s GB28181 and RTSP are typical examples where, due to substantial changes and time constraints, insufficient Review time led to a compromise in code quality. This resulted in numerous bug reports, necessitating the rewriting of existing code and the diligent addition of tests and tools.
It is essential to emphasize that the issue is not with the low quality of code submitted by other developers but rather the lack of sufficient time spent during Code Review, resulting in lower code quality post-merge. For instance, when reviewing GB28181 code, I did not invest time in thoroughly understanding the GB28181 protocol, reviewing each line of code meticulously, demanding comprehensive testing, or conducting stress and black-box testing. This was simply negligence on my part, failing to enforce strict Review standards, for which I take full responsibility.
There are significant issues with SRS’s testing, primarily inadequate coverage. Despite a 53% coverage rate, some core functionalities remain uncovered, such as the Edge cluster. At times, there is a tendency towards laziness; for example, even with the development of the Proxy cluster, there was no test coverage. Although the importance of test coverage is well understood, at times, it is challenging to implement, with the thought that the feature is still unstable and testing can be added once it stabilizes.
Code quality is independent of programming language; without effective code quality standards and strict adherence to these standards, any language can produce low-quality code.
It’s evident that Go is not the future of streaming servers due to several reasons:
Life goes on, projects need to progress, new capabilities must be developed, and we cannot wait; action must be taken. Is RUST a suitable path forward? It might be worth exploring if RUST addresses the key challenges faced by streaming servers and if the issues introduced by RUST are manageable.
Firstly, let’s revisit some key themes mentioned earlier:
Apart from these issues, there are additional considerations that haven’t been discussed yet, which I will summarize below.
In C++, the following code snippet does not have dangling pointer issues:
int x = 100;
std::thread t([&]() {
std::cout << "x is " << x << std::endl;
});
t.join();
However, if the thread does not release as expected and references a variable after its local scope, it can lead to dangling pointer issues. This is a real problem that I have encountered. On the other hand, in RUST:
let x = 100;
thread::spawn(move || {
println!("x is {}", x);
}).join().unwrap();
This problem does not exist in RUST, as the compiler ensures that variables must be moved into the thread. RUST restricts sharing regular variables between threads due to potential risks. Although sharing in this case is not problematic, as the variable will be valid until the thread ends, it is not always the case, especially when continually maintaining a project and adding logic incrementally. Dangling pointer issues are quite common, and I have often seen them during crash troubleshooting in real projects.
RUST, like any other language, has its strengths and weaknesses. It solves some problems while introducing others:
For server programs, supporting asynchronous IO (async) is essential. RUST’s async and ST are quite similar in this regard. Most async IO operations are based on polling, such as Linux epoll. Nginx, for instance, directly uses epoll. When reading and writing, partial writes may occur, indicating that the buffer is full, requiring readiness polling before writing again. Dealing with these issues for every read and write can lead to cumbersome application logic, resulting in callback hell. Go and ST create application-level coroutines, while RUST also uses spawn to execute a future. Comparing the code snippets:
//////////////////////////////////////////////////////////////////////////////
// Go
listener, _ := net.Listen("tcp", "0.0.0.0:8080")
for {
conn, _ := listener.Accept()
go handleTCP(conn)
}
// Goroutine handling the TCP connection
func handleTCP(conn net.Conn) {
defer conn.Close()
buf := make([]byte, 1024)
n, _ := conn.Read(buf)
}
//////////////////////////////////////////////////////////////////////////////
// ST(State Threads)
int fd = socket(AF_INET, SOCK_STREAM, 0);
::bind(fd, (const sockaddr*)&addr, sizeof(addr)); // addr is the listening address
::listen(fd, 10);
st_netfd_t stfd = st_netfd_open_socket(fd);
for {
st_netfd_t client = st_accept(stfd, NULL, NULL, ST_UTIME_NO_TIMEOUT);
st_thread_create(serverTCP, client, 0, 0);
}
// ST coroutine serving the TCP connection
void* serverTCP(void* arg) {
st_netfd_t client = (st_netfd_t)arg;
char buf[1024];
int n = st_read(client, buf, sizeof(buf), ST_UTIME_NO_TIMEOUT);
}
//////////////////////////////////////////////////////////////////////////////
// RUST async
let listener = tokio::net::TcpListener::bind("0.0.0.0:8080").await.unwrap();
loop {
let (socket, _) = listener.accept().await.unwrap();
tokio::spawn(processTCP(socket));
}
// Handling the TCP connection using async
async fn processTCP(mut socket: tokio::net::TcpStream) {
let mut buf = vec![0; 1024];
let n = socket.read(&mut buf).await;
}
RUST Runtime Support: RUST’s runtime supports both single-threaded and multi-threaded environments, allowing tasks to potentially move between different threads when using spawn and await. In contrast, ST only supports thread-local operations, requiring the application layer to handle how tasks are passed, which is not yet supported. Go’s M and N model provides a sophisticated multi-threaded scheduler implemented at the language level, offering the highest level of support.
Performance Analysis: ST relies entirely on assembly for coroutine switching without a multithreading mechanism, with a codebase of around 5,000 lines. On the other hand, RUST’s tokio runtime delivers high performance without the impact of garbage collection. In the realm of high-performance servers, RUST tokio emerges as a suitable solution, especially given its third-party async library that allows for customization and modification. In terms of maintainability, RUST tokio outperforms ST.
Compatibility: Both RUST and Go excel in terms of compatibility. ST in SRS requires manual assembly for coroutine switching, necessitating adaptation to hardware registers and function call mechanisms when encountering new hardware and CPU chips. Particularly, ST currently lacks compatibility with Windows SEH exception mechanisms. RUST and Go support a wide range of systems and CPUs without compatibility issues.
Streaming media falls within the realm of media technology, which can vary across different countries. Some countries may regulate media, limiting the availability of services in the media sector. However, the global user and developer base of SRS confirms its practical value. Our efforts can continue to have a positive impact on this planet, making it worthwhile to persist in our endeavors. It’s unfortunate that nginx-rtmp ceased maintenance, despite having a larger user base compared to SRS.
The architecture of SRS primarily focuses on maintainability. Rather than adding more features, it is essential to listen to diverse voices from different countries and regions and explore various technological solutions and languages. I believe RUST could be an interesting pursuit. Even if it may not ultimately materialize, the process can yield many intriguing ideas, similar to how SRS incorporates many concepts from Go. Additionally, without in-person interactions, building an effective open-source community is challenging. It’s important to participate in open-source conferences and exchanges in different countries to foster such a community.
Maintaining SRS involves encountering numerous fascinating technical challenges, engaging discussions, and diverse viewpoints. While these joys may not translate into monetary gains or serve as a means of livelihood, solely focusing on survival can be stifling. Conversely, even if these joys do not directly contribute to survival, they can still warm the heart and bring fulfillment.