WebRTC Conference

Version 49 (Adrian Georgescu, 07/10/2017 09:41 am)

1 18 Adrian Georgescu
h1. SylkServer WebRTC Video Conference
2 2 Adrian Georgescu
3 20 Adrian Georgescu
https://webrtc-test.sipthor.net
4 20 Adrian Georgescu
5 4 Adrian Georgescu
6 2 Adrian Georgescu
h2. Design
7 2 Adrian Georgescu
8 12 Dan Pascu
Two types of conferences are being supported: ad-hoc conferences and moderated conferences.
9 12 Dan Pascu
10 12 Dan Pascu
h3. Ad-hoc conferences
11 12 Dan Pascu
12 12 Dan Pascu
An ad-hoc conference is a conference where all participants have the same status and no one is controlling what other are participants are doing. The participants are rendered in a matrix or up to 3x3 depending of how many participants are in the room. The layout switches automatically for everybody as participants join or leave.
13 12 Dan Pascu
14 12 Dan Pascu
The conference room has a fixed total bitrate configured by the server, that can be specified per room or globally with the max_bitrate setting in webrtcgateway.ini (see below). This bitrate is shared by all participants in the room, meaning that the more participants are in the room, the less bitrate each participant will use for the video stream they send, keeping the total room usage constant to the value configured by max_bitrate. The bitrate adjustment per participant is done automatically by sylkserver as participants join or leave the room, by diving the available bitrate among the number of participants. The end result of this is that each participant will send a fraction of max_bitrate (which is determined by the number of participants in the room) and will always receive a total combined of max_bitrate from all the participants in the room, no matter how many participants are in the room. The formula to compute the bitrate per participant is shown below:
15 12 Dan Pascu
16 12 Dan Pascu
<pre>
17 12 Dan Pascu
participant_send_bitrate = max_bitrate / max(number_of_participants - 1, 1)
18 12 Dan Pascu
</pre>
19 12 Dan Pascu
20 12 Dan Pascu
Using this formula we can make sure that each participant always receives max_bitrate traffic in incoming video streams, independent of the number of participants. The traffic send/received by each party can be expressed like (considering N to be the number of participants and N>1):
21 12 Dan Pascu
22 12 Dan Pascu
<pre>
23 12 Dan Pascu
participant_sent_traffic     = max_bitrate / (N - 1)
24 12 Dan Pascu
participant_received_traffic = max_bitrate
25 12 Dan Pascu
26 23 Dan Pascu
sylkserver_sent_traffic      = max_bitrate * N            (participant_received_traffic * N)
27 23 Dan Pascu
sylkserver_received_traffic  = max_bitrate * N / (N - 1)  (participant_sent_traffic     * N)
28 12 Dan Pascu
</pre>
29 12 Dan Pascu
30 14 Dan Pascu
h3. Moderated conferences
31 14 Dan Pascu
32 14 Dan Pascu
A moderated conference is a conference where a moderator can decide the flow of the conference. The moderator is the first participant to join the conference. The moderator has the ability to see a list with all the participants, can select 1 or 2 of them to be the active speakers and also has the ability to mute other participants (audio and/or video). The moderator can also change the active speakers at any time.
33 12 Dan Pascu
34 16 Dan Pascu
The other participants will see the selected active speakers in full-sized video and the other participants as thumbnails. They will not be able to choose which other participant to watch, the conference view in their browser will be controlled by the moderator that decides who is the active speaker that everybody else sees on their screen in full-sized video.
35 16 Dan Pascu
36 16 Dan Pascu
The active speakers selected by the moderator will have their bitrate set to either max_bitrate (for 1 active speaker) or max_bitrate/2 (for 2 active speakers), while everybody else will have their bitrate set to a low value (64kb/s), just enough to have them represented in small thumbnails on other participant's screens.
37 15 Dan Pascu
38 2 Adrian Georgescu
h2. Features
39 2 Adrian Georgescu
40 17 Dan Pascu
h3. Ad-hoc conferences
41 1 Adrian Georgescu
42 17 Dan Pascu
Ad-hoc conferences are best suited for conversations with family/friends, since bandwidth/bitrate is managed automatically and does not involve a dedicated person to control the flow of the conference. However they can also be used for any other video conferences that imply a free-flowing type of discussion where any participant can jump into the conversation at any time.
43 11 Adrian Georgescu
44 17 Dan Pascu
h3. Moderated conferences
45 11 Adrian Georgescu
46 17 Dan Pascu
Moderate conferences are best suited for a business environment, where participants have to make some sort of presentation in front of the other participants and a moderator is assigned to control the flow of the conference and give the microphone to the appropriate participant, while the others are just watching the active speaker. They can also be used for a conference with 2 active participants that are having a public debate on a subject, while every other participant is just watching it and eventually asking questions.
47 7 Adrian Georgescu
48 2 Adrian Georgescu
h2. Configuration
49 2 Adrian Georgescu
50 10 Dan Pascu
Sylkserver allows the maximum bitrate and video codec to be configured, globally or per room with the following settings in webrtcgateway.ini file:
51 10 Dan Pascu
52 10 Dan Pascu
<pre>
53 10 Dan Pascu
; Maximum video bitrate allowed per sender in a room in bits/s. This value is
54 10 Dan Pascu
; applied to any room that doesn't define its own. The value is any integer
55 10 Dan Pascu
; number between 64000 and 4194304. Default value is 2016000 (~2Mb/s).
56 10 Dan Pascu
; max_bitrate = 2016000
57 10 Dan Pascu
58 10 Dan Pascu
; The video codec to be used by all participants in a room. This value is
59 10 Dan Pascu
; applied to any room that doesn't define its own.
60 10 Dan Pascu
; Possible values are: h264, vp8 and vp9. Default is vp9.
61 10 Dan Pascu
; video_codec = vp9
62 10 Dan Pascu
</pre>
63 10 Dan Pascu
64 1 Adrian Georgescu
h2. Client support
65 42 Dan Pascu
66 45 Dan Pascu
Firefox and Chrome browsers are supported. We have not tested Edge browsers, but recently they added WEBRTC support, so they might work. The standalone Electron application is supported, it just has to be rebuilt with the new content of the site. Mobile devices will be supported, but we need to rebuild them as well. As a note, mobile devices for the moment cannot work when H264 is configured as a codec because of a compatibility problem with using this codec that we have not yet figured out.
67 1 Adrian Georgescu
68 25 Dan Pascu
h2. Things that were explored
69 25 Dan Pascu
70 28 Dan Pascu
In order to implement bandwidth management and CPU load optimizations we have explored a couple of things, some of which proved fruitful, while with others were abandoned or proved to be not very helpful for our goal.
71 28 Dan Pascu
72 29 Dan Pascu
The original idea we started with was to have each client send two video streams, one low resolution, one high resolution and let the other participants switch between them based on their need (use the high resolution video if the participant was viewed in full or the low resolution video if he was displayed as a thumbnail. As we progressed, we quickly discovered that this setup was a lot more complicated to manage than we have anticipated. Every participant would open 2 sessions to the conference room just to publish their low and high resolution streams, which made them appear duplicated in the conference. Special means needed to be employed to associate two such distinct sessions coming from the same device and present them as a single entity. This had to be done in each client, which meant that older clients would not be able to deal with this setup and they would automatically display every participant duplicated.
73 29 Dan Pascu
74 29 Dan Pascu
In addition this setup would increase the upload bandwidth of each participant 1.5 times, going against the idea of reducing the used bandwidth.
75 29 Dan Pascu
76 29 Dan Pascu
The advantages of this model were the reduced download bandwidth and reduced CPU utilization that resulted from only having to process one high resolution video stream while all the other video streams would be low resolution, which were overshadowed by the higher upload bandwidth being used, by the more complicated room management that was required to deal with devices connecting twice per participant in the room and by the inability to have older devices join such a conference room.
77 29 Dan Pascu
78 29 Dan Pascu
While we were working on this we also run into a technical limitation on Firefox, which was unable to provide 2 video streams of different resolutions at the same time. When we tried to obtain 2 video streams, one low resolution one high resolution, the moment we requested the second stream with a different resolution, the first stream's resolution was updated to match the second and we ended up with 2 streams with the same resolution. This was a limitation in Firefox that we couldn't overcome, so at this point in addition to the issues mentioned above with this mode, we were also facing the prospect of dropping Firefox support and only have our solution work with Chrome.
79 29 Dan Pascu
80 29 Dan Pascu
While we were contemplating our choices here we discovered that there was a mechanism by which a WEBRTC client could be constrained to limit its sending bandwidth and this mechanism could be employed dynamically during a call to make the device's sending bitrate high or low as desired without any need to renegotiate the session. This mechanism uses REMB packets which are control packets sent through RTCP and will make a browser adjust its send bitrate on the fly as requested. The good news was that both Chrome and Firefox supported this. This bit of information changed everything and we realized we could use this to build a better solution, which was a lot less complicated and more effective.
81 29 Dan Pascu
82 35 Dan Pascu
At the same time we realized that the initial model that the webrtc client used, where in a conference room the client would display one participant in full and the others as thumbnails, and then let the user switch which participant to view by clicking on a thumbnail to display that participant in full, was not very useful for a large category of uses, namely users having a group video chat with friends/family. In this case the user is not expected to click a thumbnail to switch to another participant and only be able to see one participant at a time, but instead they would like to see all participants at the same time.
83 29 Dan Pascu
84 36 Dan Pascu
As a result of all this, we went we decided to give up on the original idea with 2 streams of a different resolution per participant and completely change our model. We came up with the 2 models mentioned before: the ad-hoc conference model and the moderated conference model.
85 29 Dan Pascu
86 29 Dan Pascu
h3. The ad-hoc conference model
87 29 Dan Pascu
88 38 Dan Pascu
The ad-hoc conference mode was supposed to be used for a group chat with friends/family where one expects to see all the other participants on the screen at the same time and any participant can jump into the conversation at any time. In this model we decided to display all participants in a matrix, so everyone is visible at all times. Initially the matrix is just 1x1 when there are just 1 or 2 people in the room, but it can grow up to a 3x3 matrix that can accommodate up to 10 participants (9+yourself). This model proved to be favored by the idea of using REMB to limit send bitrate, because the more participants on screen, the smaller their video would be, which aligned perfectly with the idea of having a constant room bitrate that is shared by all participants: the more participants, the lower their bitrate would be and also the lower their video frame would be on screen compensating for the reduced quality of their video stream.
89 29 Dan Pascu
90 39 Dan Pascu
In order to compare the bandwidth used by this mode and the original model we attempted (the one with 2 video streams per participant), lets consider the bitrate used by an HD stream (1280x720 @30fps). This bitrate is ~2.0-2.4Mb/s, and let's call this B. We have found that for a thumbnail sized video stream of 320x240 pixels at 30 fps, the bitrate requirement was still very high, in the range or B/3 to B/2. As a result in the original model each participant had to send anywhere between 1.3*B to 1.5*B. At the same time, because only one participant was big on screen and all others were thumbnails, each participant would receive B + (N-1)*B/2 = B*(N+1)/2, where N is the number of participants. In the ad-hoc conference model, as mentioned before, each participant receives B and sends B/(N-1).
91 29 Dan Pascu
92 29 Dan Pascu
In order to compare these numbers, lets consider B = 2Mb/s and N=9.
93 29 Dan Pascu
94 29 Dan Pascu
In the original model, each participant would have sent 1.5*2 = 3Mb/s and would have received 2*(9+1)/2 = 10Mb
95 29 Dan Pascu
In the ad-hoc conference model, with B being set as the room maximum bitrate, each participant would send 2/(9-1) = 0.25Mb/s and would receive 2Mb/s
96 29 Dan Pascu
97 39 Dan Pascu
These numbers show how the ad-hoc conference model with controlled bitrate per participant is a lot more effective as far as bandwidth management goes, compared to the original model we started with, being 5-12 times more efficient in the amount of data sent/received.
98 29 Dan Pascu
99 29 Dan Pascu
In addition the ad-hoc conference also provides a much better user experience allowing all participants to be visible on screen at once.
100 29 Dan Pascu
101 46 Dan Pascu
Another thing we noticed with Chrome, while using VP8 as a codec, was that with more than 3 participants in a room, Chrome started to dynamically adjust the resolution of the video being sent, fluctuating between HD and VGA resolution, depending on the bitrate it was allowed to use and the amount of movement in the encoded video stream. In addition to the resolution changes we also noticed changes to the sent frame rate. This was an added bonus because it meant that with more participants in a room that would impose a lower bitrate value per participant we expected Chrome to to this more often, and thus we could achieve not only improved network bandwidth usage, but also reduced CPU usage.
102 29 Dan Pascu
103 39 Dan Pascu
Unfortunately Firefox did not have this behavior, Firefox would maintain the original resolution value requested when the stream started for the whole duration of the call regardless of the bitrate limitation being imposed on it. In order to compensate for this we tried to request resolution adjustments based on the number of participants in the room, in order to reduce the CPU usage when the number of participants in a room increases. However this did not prove successful, because doing this does not yield reliable results. Sometimes Firefox will switch resolutions without a problem, some other times the camera will attempt to switch resolutions and will not reopen at the new resolution, which results in the video stream not being sent anymore (it freezes on the last frame before the resolution change was attempted). This happens randomly and we could not determine what causes it or how to fix it. It is also worth mentioning that this is a problem we noticed with Firefox running on OSX on a Macbook Pro with a built-in camera. We do not know if a similar problem exists for external cameras or on different operating systems (Linux or Windows).
104 29 Dan Pascu
105 47 Dan Pascu
The idea of exploring this feature however is still open, as it is a much better solution than Chrome's automatic resolution adjustment, because it yields more reliable and consistent results. Chrome switches resolution based on other factors than just bitrate and it doesn't seem to do it often enough to be effective. In addition we have found that with VP9 as a codec, Chrome would not lower the resolution of a video scream even when the bitrate is as low as 256Kb/s, it would only lower the frame rate. While this could still be effective in reducing CPU load, is not as effective as reducing both resolution and frame rate. We assume this is because VP9 is a more effective codec than VP8 and is able to maintain the HD resolution even at lower bitrates.
106 29 Dan Pascu
107 29 Dan Pascu
h3. The moderated conference model
108 28 Dan Pascu
109 30 Dan Pascu
Since the ad-hoc model is not best suited for every application, we also considered the moderated conference model. In this model a moderator would control the flow of the conference. The moderator is the first participant that joins the conference. The moderator would be able to see a list with all the participants, decide who is the active speaker and mute audio/video per participant when needed. In this model only 1 or at most 2 participants can be active speakers at a time and who they are is decided by the moderator. Participants cannot select what other participants they see on screen. This is decided by the moderator which selects the active participants that will be shown on everyone's screens, while all others are shown as thumbnails.
110 30 Dan Pascu
111 30 Dan Pascu
With 1 active speaker the conference is suitable for cases like when some people need to give a speech or show a presentation for others to watch. In this case the moderator simply switches the active participant by giving the next speaker their stage time.
112 30 Dan Pascu
113 30 Dan Pascu
With 2 active speakers at the same time, the conference can be used for example for having a public debate on a subject, where the active speakers debate the subject while the rest of the participants just watch the debate, or ask questions if needed.
114 30 Dan Pascu
115 30 Dan Pascu
In this model, each active speaker will have their bitrate limited by max_bitrate / number_of_active_speakers, while everyone else will just have a very low bitrate value (64Kb/s) so they can be displayed as thumbnails.
116 30 Dan Pascu
117 30 Dan Pascu
Considering B the bitrate for an HD stream @30fps, N the number of participants in the conference and AS the number of active speakers:
118 30 Dan Pascu
119 30 Dan Pascu
Each active speaker will send B/AS
120 30 Dan Pascu
Everyone else will send a constant 64Kb/s
121 30 Dan Pascu
Everyone in the room will receive B + (N-AS)*64Kb/s
122 30 Dan Pascu
123 30 Dan Pascu
For B=2Mb/s, N=10, AS=2 we have:
124 30 Dan Pascu
Each active speaker send 1Mb/s
125 30 Dan Pascu
Everyone else sends 64Kb/s = 0.064Mb/s
126 30 Dan Pascu
Everyone in the room will receive 2Mb/s + (10-2) * 0.064Mb/s = 2.512Mb/s
127 30 Dan Pascu
128 30 Dan Pascu
As can be seen, these numbers also show that the moderated conference model is also a lot more efficient that the original model with 2 streams per participant.
129 30 Dan Pascu
130 31 Dan Pascu
h3. Mobile device considerations
131 31 Dan Pascu
132 32 Dan Pascu
Because mobile devices have both more limited resources and more limited screen space available, we consider using the following technique for small mobile devices:
133 32 Dan Pascu
134 32 Dan Pascu
For both ad-hoc and moderated conferences, the mobile client will only display 1 or at most 2 participants in full view. For a moderated conference they are already decided by the moderator, while for an ad-hoc conference the user can select 1-2 of the participants to be seen. For the other participants the device will pause their video streams and not show thumbnails for them, but instead show them as static icons or just display them in a list. By doing this, the mobile device not only prevents screen clutter allowing for a more efficient use of the limited screen space, but by pausing the other participant's video streams, it will dramatically reduce it's CPU usage because it will not need to receive and decode their video streams just to display them as thumbnails.
135 32 Dan Pascu
136 32 Dan Pascu
By using this technique, a mobile device will only have to deal with decoding and displaying 1 or at most 2 video streams which is fully within the device's processing capabilities, regardless how many participants are in the conference room.
137 32 Dan Pascu
138 1 Adrian Georgescu
h2. Measurements
139 1 Adrian Georgescu
140 21 Adrian Georgescu
These load measurements were done on a Macbook Pro 15" with a 2.3GHz Intel Core I7 CPU, while having 7 participants in the room with each using 336Kb/s. The measurement shows the CPU usage in Firefox web browser with the aforementioned conditions, for the specified video codecs and resolutions which are used by all participants:
141 1 Adrian Georgescu
142 10 Dan Pascu
<pre>
143 10 Dan Pascu
 * H264/VGA - 150% CPU
144 10 Dan Pascu
 * H264/HD  - 250% CPU
145 10 Dan Pascu
 * VP9/VGA  - 220% CPU
146 10 Dan Pascu
 * VP9/HD   - 350% CPU
147 10 Dan Pascu
</pre>
148 6 Adrian Georgescu
149 43 Dan Pascu
As far as CPU utilization goes, the most efficient codec is H264 (presumably because it has hardware accelerated support on a lot of devices), followed by VP9 and last is VP8.
150 6 Adrian Georgescu
151 31 Dan Pascu
In a conference with 2 participants both sending HD video (1280x720 @30fps), on the same laptop mentioned above we noticed the following CPU load values in Firefox:
152 31 Dan Pascu
153 31 Dan Pascu
<pre>
154 31 Dan Pascu
 * VP8  - 130% CPU
155 31 Dan Pascu
 * VP9  - 100-110% CPU
156 31 Dan Pascu
 * H264 - 50-70% CPU
157 31 Dan Pascu
</pre>
158 31 Dan Pascu
159 1 Adrian Georgescu
h2. Conclusions
160 1 Adrian Georgescu
161 34 Dan Pascu
We consider that the ad-hoc and moderated conference models offer much better results that the original two-streams-per-participant idea. In addition not only do they offer a better and more natural user interface, they also allow for more control from the server that can decide both the codec to be used and the bitrate limit per room, thus controlling the quality of the call in a single place.
162 1 Adrian Georgescu
163 41 Dan Pascu
For now we consider a room with a 2Mb/s bitrate limit using VP9 to be the best compromise between quality and resources being used as well as support across all devices. At the moment we cannot recommend H264 despite the huge improvement it would provide in CPU usage, especially for mobile clients, because we have found some compatibility issues for the mobile clients, where the mobile client would display a green screen for any incoming video stream with H264.
164 33 Dan Pascu
165 33 Dan Pascu
h2. Remaining tasks
166 33 Dan Pascu
167 48 Adrian Georgescu
 * SylkServer: control and feedback interface for moderator
168 49 Adrian Georgescu
 * Sylk WebRTC client: add control for moderator 
169 48 Adrian Georgescu
 * Janus: patch to request full frames when a paused video is resumed
170 48 Adrian Georgescu
 * Rebuild the mobile version
171 48 Adrian Georgescu
 * Package new versions of modified software
172 33 Dan Pascu
173 48 Adrian Georgescu
 
174 22 Dan Pascu
h2. Software that was modified
175 26 Dan Pascu
176 26 Dan Pascu
In order to implement the bandwidth management and CPU load optimizations the following software was modified:
177 26 Dan Pascu
178 26 Dan Pascu
# sylkserver https://github.com/AGProjects/sylkserver
179 26 Dan Pascu
# sylk-webrtc https://github.com/AGProjects/sylk-webrtc
180 1 Adrian Georgescu
# sylkrtc.js https://github.com/AGProjects/sylkrtc.js
181 27 Dan Pascu
# python-application https://github.com/AGProjects/python-application
182 27 Dan Pascu
# python-sipsimple https://github.com/AGProjects/python-sipsimple