Support Multiple-CPUs(or Threads) to improve concurrency. #3665

winlinvip · 2021-02-05T01:19:07Z

winlinvip
Feb 5, 2021
Maintainer

Remark

For now, let's put the multi-threading preparation on hold. Although ST already supports multi-threading and the RTC multi-threading part has been almost completed, there are several factors that make me think we should reconsider whether multi-threading is necessary at this stage.

First of all, the multi-threading branch has been deleted from the SRS repository, but it is still preserved in my repository feature/threads, which mainly includes the following commits:

ST: Simplify it, only Linux/Darwin, epoll/kqueue, single process Simplify ST, only support limited platforms.
ST: Support thread-local state-threads Optimize ST, support multi-threading.
Threads: Support Hybrid+API+LOG threads Transform SRS to support multi-threading.
Threads: Remove reload for streams Disable some Reload, multi-threading Reload is complex.

The main reasons for reconsidering multi-threading support are:

RTC cascading can extend SRS's capabilities, making it simple and reliable to expand with the Edge/Origin mechanism for live streaming.
Single-process concurrency is currently between 1000 and 1200 concurrent connections. After further optimizing QoS, it is estimated that it can maintain around 1000 concurrent connections, which is sufficient for open-source use.
Only ST and RTC have more comprehensive multi-threading support, while live streaming and API support are not yet perfect. It requires more effort and is not suitable for our time-limited community to do.
The main goal of multi-threading is to expand the single-machine concurrency to the tens of thousands level, which is generally used in large-scale commercial systems. Generally, open-source SFUs do not have this scenario (besides, SRS can use cascading for expansion).

However, simplifying ST and improving its performance can still be considered for merging, including:

Summary

SRS's support for multi-threading is a significant architectural upgrade, essentially aimed at addressing performance issues.

Regarding performance issues, the following points can be expanded:

In the live streaming playback scenario, a single-process single-thread can run up to 3000 concurrent connections or higher, as live streaming has no encryption and can directly distribute audio and video data in a one-to-many scenario. Moreover, it can be horizontally scaled through Edge.
In the live streaming push scenario, it is impossible to achieve complete horizontal scaling, and the source station cluster scale cannot be very large, as referenced in Cluster: Origin Cluster for Fault Tolarence and Load Balance. #464. If it is a high-bitrate stream, a single-process single-thread can hardly achieve more than 1000 connections.
In the RTC scenario, in addition to encryption and QoS, the performance of UDP sending and receiving will also be lower, and it is roughly estimated that it is difficult to reach 500 concurrent connections. This makes SRS no longer a purely IO-intensive server, but an IO and CPU-intensive server.

Why is this issue important?

In the RTC scenario, if a single process can only reach 100 or 300 concurrent connections, then 10 times the number of cores is needed. The amount of traffic between these servers is also ten times higher. At this point, the concurrency capability is entirely insufficient.
In the live streaming scenario, the current multi-process, Origin, and Edge cluster expansion capabilities can be supported by multi-threading, allowing a single machine to achieve high concurrency, reducing the number of machines to manage, and lowering system complexity.
In the monitoring scenario, it is no longer a case of a single stream being consumed by many people, but rather many streams requiring encryption. If a single process does not support enough connections, it will cause many process issues.

Therefore, the multi-threading architecture can be considered a revolution after the multi-coroutine architecture, but this time it is a self-revolution.

Arch

The previous SRS single-threaded architecture (SRS/1/2/3/4):

Single-process single-threaded architecture, live streaming supports Origin-Edge cluster, and if deployed on a single machine, it is a multi-process single-threaded architecture.
RTC is computationally and IO-intensive, so multi-core capabilities are essential, and the main problem is the issue described in this post.
Despite this, SRS4 has also made many single-core performance optimizations, as referenced in b431ad7, c5d2027, 14bfc98, and 36ea673

The ultimate goal architecture is horizontally scalable Hybrid threads, also known as low-lock multi-threaded structure (SRS/v5.0.3):

Merge SRTP thread into Hybrid thread, both threads call OpenSSL simultaneously, posing a risk of Crash.
UDP RECV and SEND threads are optional, disabled by default, and Hybrid threads are responsible for sending and receiving data.
ST is changed to a thread-local structure (ST#19), with each thread having its isolated ST. It is crucial to ensure that the ST structure of one thread does not pass to another thread, i.e., not reading or writing other threads' FDs.
Hybrid threads default to 1, which is almost identical to SRS4's completely single-threaded structure, maintaining structural simplicity. If high performance is needed, multiple threads can be enabled, and the architecture remains essentially unchanged (from one thread to multiple independent threads).
Hybrid threads support horizontal scaling, using a multi-port approach. RTC returns different ports through SDP, while RTMP and HTTP require 302 redirection. The specific implementation depends on the documentation.
When Hybrid scales horizontally, it still locks connections to a single thread based on the stream, enabling push stream expansion. It can almost scale horizontally to hundreds or thousands of cores, but as the number of cores increases, communication between threads increases, and it is not entirely cost-free, but the cost is relatively small.
When Hybrid scales horizontally, in general, multiple push streams and multiple play streams can be well supported, such as 1000 push streams with 10 play streams each, totaling 10,000 streams.

The disadvantages of this architecture:

Single-stream playback expansion issue: When Hybrid scales horizontally, since each stream is locked to a single thread, the number of play streams for a single stream is limited by the number of connections supported by a single thread. Solution: In the future, cascading will be used to solve the downstream expansion issue, such as supporting 1000 connections for a single stream on a single machine (regardless of the number of cores), and cascading 1000 servers to support 1 million play streams.
Global variable and static variable cleanup issue: Although multi-threading is isolated, they still have a chance to make mistakes through global and static variables. Therefore, all global variables must be checked and modified, either changed to thread-local or thread-safe. This brings risks to stability and is often difficult to troubleshoot when problems arise. Solution: By default, only 1 thread is enabled, allowing for a long enough transition and improvement period.
Library thread safety issue: For example, OpenSSL has multi-threading issues, OpenSSL 1.1 claims to be thread-safe, requires modifying the compilation script, and does not use the option -no-threads, but sometimes it may forget to change this option and cause problems. Solution: By default, only 1 SRTP thread is enabled, allowing for a long enough transition and improvement period.

Single-stream playback expansion issue: If you must modify multi-threading to allow a single machine with multiple cores to support playback horizontal scaling, you can have the push stream thread broadcast to the pull stream thread. This change is acceptable, but open-source does not entirely pursue performance. It still needs to maintain a very simple architecture with performance optimization. Personally, I think it is reasonable for a single machine to support 1000 play streams and use cascading to solve expansion, so this optimization will not be used in the future.

Note: Image source is here.

Communication Mechanism

There are two ways for threads to communicate: the first is locked chan, and the second is passing fd. The second can rely on the first.

Both methods should avoid passing audio and video data. Of course, they can be passed, but it is not efficient. For example, you can start a transcoding thread, communicate with chan, and it doesn't require much concurrency.

SRS will have multiple ST threads, which communicate through chan, but they do not pass audio and video data, only some coordination messages.

Currently, SRS's thread communication uses pipe implementation to avoid locks. Therefore, when using it, be aware that it is a low-efficiency mechanism and should not directly pass audio and video packets. It is mainly used for communication between the Master (API) thread and the Hybrid (service) thread, with Hybrid returning SDP to the API.

Thread Types

Each thread will have its ST, and ST is thread-local, i.e., independent and isolated ST.

Remark: The most critical risk and change is to avoid passing FD (or other ST resources) created by one thread to another thread for processing, which will definitely cause problems.

In the end, there will be several types of threads:

Main thread. It mainly manages configuration, manages threads, passes messages, and listens to APIs.
Log thread. Read logs from the queue and write them to disk. To avoid disk IO blocking ST, recording and HLS writing to disk can also be placed in this thread.
One or more Hybrid threads. Listen to network FDs, create epoll, start ST, and serve as the main audio and video business thread. Threads do not communicate with each other. Live streaming uses REUSE PORT, and RTC uses multi-port isolation.
Optional, SRT thread. As it is now, an independent SRT, pushing RTMP to the ST thread via local socket.
Optional, if implementing the SRT protocol yourself, the ST thread can handle SRT clients.
Optional, transcoding or AI thread, pulling audio and video data from the ST thread via local socket, or passing data through Chan, implementing capabilities such as mixing and streaming.

Milestones

4.0 will not enable multi-threading, maintaining single-threaded capabilities.

5.0 will implement most of the multi-threading capabilities, including improving ST's thread-local capabilities. However, Hybrid will only default to 1 thread, and although the process has multiple threads, the overall difference from the previous single-thread is not significant.

6.0 will enable as many threads as there are CPU cores by default, completing the entire multi-threaded architecture transformation.

Differences from Go

Go's multi-threading overhead is too high, and its performance is not sufficient, as it is designed for general services.

With multiple cores, such as 16 cores, Go has about 5 cores for switching. This is because there are locks and data copying between multiple threads, even though chan is used.

In addition, Go is genuinely multi-threaded, requiring constant consideration of competition and thread switching, while SRS is still genuinely single-threaded. Go is more complicated to use, while SRS can still maintain the simplicity of single-threading.

SRS is a multi-threaded and coroutine-based architecture optimized for business, essentially still single-threaded, with threads being essentially unrelated.

Relationship with Source

A single ST thread will have multiple sources.

A source, which is a push stream and its corresponding consumer (playback), is only in one ST thread.

In this way, both push and play are completed in a single ST thread, without the need for locks or switching.

Since the client's URL is unknown when connecting, it is also unknown which stream it belongs to, so it may be accepted by the wrong ST thread, requiring FD migration.

Migrating FD between multiple threads is relatively simple. The difficulty lies in ST, which needs to support multi-threading and consider rebuilding the FD in the new ST thread's epoll when migrating FD. However, this is not particularly difficult, and it is much easier than multi-process.

Why not Multi-process

FD migration between multi-processes is too difficult to implement, and communication between processes is not as easy as communication between threads, nor is it as efficient as threads.

The reason why Nginx uses multi-process is that there is no need for FD migration between multiple processes. So when doing live streaming, NginxRTMP processes push streams to each other, which is too difficult to maintain.

If not migrated, audio and video packets need to be forwarded, and it is definitely better and more suitable for streaming media to migrate FD based on the stream.

Thread Local

Each thread has its own ST, which can be referred to as the Envoy Threading Model, using the C++ thread_local keyword to indicate variables.

I wrote an example SRS: thread-local.cpp, with the following results:

$ ./thread-local
PFN1: tl_g_nn(0x7fbd59504080)=1, g_obj(0x7fbd59504084)=1, gp_obj(0x7fbd59504088,0x7fbd595040a0)=1, gp_obj2(0x7fbd59504090,0x7fbd595040b0)=1
PFN2: tl_g_nn(0x7fbd59604080)=2, g_obj(0x7fbd59604084)=2, gp_obj(0x7fbd59604088,0x7fbd596040a0)=2, gp_obj2(0x7fbd59604090,0x7fbd596040b0)=2
MAIN: tl_g_nn(0x7fbd59704080)=100, g_obj(0x7fbd59704084)=100, gp_obj(0x7fbd59704088,0x7fbd597040a0)=100, gp_obj2(0x7fbd59704090,0x7fbd597040b0)=100

It can be used to modify global variables:

// Global thread local int variable.
thread_local int tl_g_nn = 0;

Including global pointers:

thread_local MyClass g_obj(0);
thread_local MyClass* gp_obj = new MyClass(0);
thread_local MyClass* gp_obj2 = NULL;
MyClass* get_gp_obj2()
{
    if (!gp_obj2) {
        gp_obj2 = new MyClass(0);
    }
    return gp_obj2;
}

The addresses and values of these pointers are different in each thread.

GCC __thread

GCC has extended the keyword __thread, which has the same effect as C++11's thread_local.

A multi-threaded version of ST has been implemented before, using gcc's __thread keyword, referring to toffaletti and ST#19.

UDP Binding

Note: Please note that we ultimately chose to implement RTC multi-thread isolation with multiple ports instead of using UDP binding, so I have collapsed related comments by default.

RTC's UDP is connectionless, and multiple threads can reuse the fd through REUSE_PORT to receive packets sent to the same port.

The kernel will perform a five-tuple binding. When the kernel delivers to a certain listen fd, it will continue to deliver to this fd. Refer to udp-client and udp-server:

Start 3 workers, at 127.0.0.1:8000
listen at 127.0.0.1:8000, fd=3 ok
listen at 127.0.0.1:8000, fd=4 ok
listen at 127.0.0.1:8000, fd=5 ok
fd #5, peer 127.0.0.1:50331, got 13B, Hello world!
fd #5, peer 127.0.0.1:50331, got 13B, Hello world!
fd #5, peer 127.0.0.1:50331, got 13B, Hello world!

Note: There are three fds listening on port 8000 above. After the client 50331 delivers to fd=5, it will continue to deliver to this fd.

UDP Migration

If we receive a client packet from a certain fd, such as 3, and find that this client should be received by another fd, such as 4, we can use connect to bind the delivery relationship.

Refer to the example udp-connect-client.cpp and udp-connect-server.cpp. The server receives the packet and continuously uses other fds to connect. The performance is different on different platforms.

CentOS 7 server, listening on 0.0.0.0:8000, as shown below, can achieve migration twice:

Start 2 workers, at 0.0.0.0:8000
listen at 0.0.0.0:8000, fd=4, migrate_fd=3 ok
listen at 0.0.0.0:8000, fd=3, migrate_fd=4 ok

fd #3, peer 172.16.239.217:37846, got 13B, Hello world!
Transfer 172.16.239.217:37846 from #3 to #4, r0=0, errno=0

fd #4, peer 172.16.239.217:37846, got 13B, Hello world!
Transfer 172.16.239.217:37846 from #4 to #3, r0=0, errno=0

fd #3, peer 172.16.239.217:37846, got 13B, Hello world!
Transfer 172.16.239.217:37846 from #3 to #4, r0=0, errno=0
fd #3, peer 172.16.239.217:37846, got 13B, Hello world!

CentOS 7 server, if bound to a fixed address, such as eth0 or lo, will not migrate:

Start 2 workers, at 172.16.123.121:8000
listen at 172.16.123.121:8000, fd=4, migrate_fd=3 ok
listen at 172.16.123.121:8000, fd=3, migrate_fd=4 ok

fd #3, peer 120.227.88.168:43015, got 13B, Hello world!
Transfer 120.227.88.168:43015 from #3 to #4, r0=0, errno=0

fd #3, peer 120.227.88.168:43015, got 13B, Hello world!
Transfer 120.227.88.168:43015 from #3 to #4, r0=0, errno=0
fd #3, peer 120.227.88.168:43015, got 13B, Hello world!

Note: Linux multiple migrations will not return an error, but will not take effect.

Mac server, regardless of which address is bound, will migrate once:

Start 2 workers, at 127.0.0.1:8000
listen at 127.0.0.1:8000, fd=4, migrate_fd=3 ok
listen at 127.0.0.1:8000, fd=3, migrate_fd=4 ok

fd #3, peer 127.0.0.1:61448, got 13B, Hello world!
Transfer 127.0.0.1:61448 from #3 to #4, r0=0, errno=0

fd #4, peer 127.0.0.1:61448, got 13B, Hello world!
Transfer 127.0.0.1:61448 from #4 to #3, r0=-1, errno=48
fd #4, peer 127.0.0.1:61448, got 13B, Hello world!

Note: On Mac, multiple migrations will return an error [ERRNO 48] ADDRESS ALREADY IN USE.

After connecting with @wasphin, we don't want to migrate. Instead, we hope to bind the 5-tuple to this fd after connecting, so as to avoid other FDs receiving packets.

In this case, a more suitable thread model is:

The UDP port is listened and packets are sent and received by default by the public thread, and the processing thread sends and receives packets through the public thread.
If the processing thread finds that this 5-tuple is not what it should handle, it will transfer the processing to other threads through the message queue.
If the processing thread finds that the packets of this 5-tuple are to be processed by itself, it will open the FD and connect to this address, so that only this thread will directly send and receive packets in the future.

This model is actually a hybrid model:

Most of the time, there is no need to pass packets through locks and inter-thread communication queues.
In a few cases, especially when a new address has just started, packets can be passed through the message queue.
In the early stage of architecture evolution, it is possible not to connect, which means that packets will be passed between threads.

This hybrid model does not rely on UDP connect, but the performance will be very high when Connect works.

In addition, the encryption and decryption problem can also be solved by a similar hybrid model:

Start multiple independent encryption and decryption threads and pass packets through the queue.
If the performance of the working thread is sufficient, it can directly encrypt and decrypt by itself.
In the early stage of architecture evolution, there can be independent encryption and decryption threads, which means that packets will be passed between threads.

What's special is the disk IO thread, which will definitely use the queue to send messages:

The log writing thread collects logs from other threads through the queue and writes them to the disk.
The TS, FLV, and MP4 file writing threads, also known as recording threads, collect the file content to be written through the queue and write the content to the disk.

In the early days, we will still pass packets between multiple threads and divide different threads according to the business. As the evolution progresses, we will gradually eliminate the communication and dependencies between threads and turn them into independent threads that do not rely on each other, achieving higher performance.

winlinvip · 2021-02-05T02:07:44Z

winlinvip
Feb 5, 2021
Maintainer Author

See #2188 (comment)

0 replies

winlinvip · 2021-02-05T03:37:55Z

winlinvip
Feb 5, 2021
Maintainer Author

See #2188 (comment)

0 replies

winlinvip · 2021-02-23T08:43:32Z

winlinvip
Feb 23, 2021
Maintainer Author

See #2188 (comment)

0 replies

ChenMoGe2 · 2021-12-23T12:46:15Z

ChenMoGe2
Dec 23, 2021

So, does it support multi-threading for WebRTC now? After starting multiple processes and adding port reuse, only one core is used on a multi-core machine.

0 replies

winlinvip · 2022-02-19T09:09:12Z

winlinvip
Feb 19, 2022
Maintainer Author

I've been working with Node.js for almost a year now, and I found that it is very similar to SRS's coroutine + multithreading, and I can basically see the future of SRS's multithreading.

The simplicity is quite good, which is what we want. We can't refer to Go for multithreading because it is true multithreading, while Node.js's multithreading is actually single-threaded for each thread, without locks and such. Go actually has locks.

Multithreading without thread synchronization is more suitable for maintenance.

I still insist on business separation for multithreading, with the stream still in one thread. This doesn't solve performance issues, but it does solve some CPU-intensive and freezing issues, such as:

Audio transcoding: For example, AAC to Opus, and vice versa.
DNS resolution: Currently implemented using system functions, multithreading can avoid blocking (of course, using UDP to implement it yourself is also a solution, but it's more complicated).
Writing logs: Writing to disk, usually not a problem, but who knows? Experts say that if you can rely on the disk, everything should be on the tree.
HLS, DASH, and DVR: Writing a lot of data to disk, some friends have tested that it has an impact when there are more than 10 channels, because the disk is really unreliable.

After solving these problems, the stability will be improved, and sometimes it is definitely affected by these factors.

I don't want to do multithreading for streams, because from the perspective of ease of use, for example, the API needs to double the maintenance cost for clustering, requires scheduling logic (no matter how simple), increases the steps for troubleshooting, and cannot simply evaluate the load. All these factors will greatly reduce the maintenance level of the entire project.

The only advantage of multithreading for streams is to increase multi-core capabilities, which can be achieved through cascading (which will be supported in the future) and business scheduling. If you think that running one process on one machine is too wasteful, you can use multiple ports or implement it with Pods. In any case, if you have already reached the point of focusing on high performance, it must be a large business volume of tens of thousands or even hundreds of thousands. If such a large business volume does not have research and development capabilities, it will either be a pitfall or a death trap.

0 replies

winlinvip · 2023-07-18T05:49:57Z

winlinvip
Jul 18, 2023
Maintainer Author

Update on 2023.07.18

SRS 5.0 and ST now support multi-threading, but it is not utilized in the streaming architecture as it would add unnecessary complexity and hinder system monitoring through Prometheus.

The optimal and unified architecture is a proxy cluster. Creating a local proxy cluster to leverage multiple CPUs is a better solution than multi-threading.

Nevertheless, we plan to implement multi-threading in the future for disk writing to prevent block IO, such as logging #3647.

2 replies

winlinvip Aug 28, 2024
Maintainer Author

Update: The proxy server PR is #4158.

We also plan to enhance the Edge Cluster to support more protocols.

Therefore, we don't need multi-threading to support more publishers and viewers; instead, we only need to resolve the blocking issue.

winlinvip Aug 28, 2024
Maintainer Author

Update: The SEH problem is very difficult to address. Without it, we can't support SRT on Cygwin for Windows. For more details, see #3251.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Multiple-CPUs(or Threads) to improve concurrency. #3665

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 12 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

This comment has been hidden.

{{title}}

{{editor}}'s edit

{{editor}}'s edit

This comment has been hidden.

This comment has been hidden.

This comment has been hidden.

This comment has been hidden.

This comment has been hidden.

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Support Multiple-CPUs(or Threads) to improve concurrency. #3665

winlinvip Feb 5, 2021 Maintainer

Replies: 12 comments · 2 replies

winlinvip Feb 5, 2021 Maintainer Author

winlinvip Feb 5, 2021 Maintainer Author

This comment has been hidden.

winlinvip Feb 23, 2021 Maintainer Author

This comment has been hidden.

This comment has been hidden.

This comment has been hidden.

This comment has been hidden.

This comment has been hidden.

winlinvip Feb 19, 2022 Maintainer Author

winlinvip Jul 18, 2023 Maintainer Author

winlinvip Aug 28, 2024 Maintainer Author

winlinvip Aug 28, 2024 Maintainer Author

winlinvip
Feb 5, 2021
Maintainer

Replies: 12 comments 2 replies

winlinvip
Feb 5, 2021
Maintainer Author

winlinvip
Feb 5, 2021
Maintainer Author

winlinvip
Feb 23, 2021
Maintainer Author

winlinvip
Feb 19, 2022
Maintainer Author

winlinvip
Jul 18, 2023
Maintainer Author

winlinvip Aug 28, 2024
Maintainer Author

winlinvip Aug 28, 2024
Maintainer Author