WebRTC Conference

Version 56 (Tijmen de Mes, 07/10/2017 10:10 am) → Version 57/59 (Adrian Georgescu, 07/15/2017 01:02 pm)

h1. SylkServer WebRTC Video Conference

https://webrtc.sipthor.net https://webrtc-test.sipthor.net

h2. Goal

Current SylkServer implementation of video rooms lacks various features that are a must in order to make it an useful application. Most prominent ones are related to the usage of resources like bandwidth utilisation or CPU usage must be optimised in order to make the application effective on a variety of devices and networks where these resources are scarce (e.g. mobile devices and networks). Another feature is the possibility of managing who is the active speaker.

We have enhanced the video room conference application to address these issues.

h2. Design

Two types of conferences are being supported: ad-hoc conferences and moderated conferences.

h3. Ad-hoc conferences

An ad-hoc conference is a conference where all participants have the same status and no one is controlling what other are participants are doing. The participants are rendered in a matrix or up to 3x3 depending of how many participants are in the room. The layout switches automatically for everybody as participants join or leave.

The conference room has a fixed total bitrate configured by the server, that can be specified per room or globally with the max_bitrate setting in webrtcgateway.ini (see below). This bitrate is shared by all participants in the room, meaning that the more participants are in the room, the less bitrate each participant will use for the video stream they send, keeping the total room usage constant to the value configured by max_bitrate. The bitrate adjustment per participant is done automatically by sylkserver as participants join or leave the room, by diving the available bitrate among the number of participants. The end result of this is that each participant will send a fraction of max_bitrate (which is determined by the number of participants in the room) and will always receive a total combined of max_bitrate from all the participants in the room, no matter how many participants are in the room. The formula to compute the bitrate per participant is shown below:

participant_send_bitrate = max_bitrate / max(number_of_participants - 1, 1)

Using this formula we can make sure that each participant always receives max_bitrate traffic in incoming video streams, independent of the number of participants. The traffic send/received by each party can be expressed like (considering N to be the number of participants and N>1):

participant_sent_traffic = max_bitrate / (N - 1)
participant_received_traffic = max_bitrate

sylkserver_sent_traffic = max_bitrate * N (participant_received_traffic * N)
sylkserver_received_traffic = max_bitrate * N / (N - 1) (participant_sent_traffic * N)

h3. Moderated conferences

A moderated conference is a conference where a moderator can decide the flow of the conference. The moderator is the first participant to join the conference. The moderator has the ability to see a list with all the participants, can select 1 or 2 of them to be the active speakers and also has the ability to mute other participants (audio and/or video). The moderator can also change the active speakers at any time.

The other participants will see the selected active speakers in full-sized video and the other participants as thumbnails. They will not be able to choose which other participant to watch, the conference view in their browser will be controlled by the moderator that decides who is the active speaker that everybody else sees on their screen in full-sized video.

The active speakers selected by the moderator will have their bitrate set to either max_bitrate (for 1 active speaker) or max_bitrate/2 (for 2 active speakers), while everybody else will have their bitrate set to a low value (64kb/s), just enough to have them represented in small thumbnails on other participant's screens.

h2. Features

h3. Ad-hoc conferences

Ad-hoc conferences are best suited for conversations with family/friends, since bandwidth/bitrate is managed automatically and does not involve a dedicated person to control the flow of the conference. However they can also be used for any other video conferences that imply a free-flowing type of discussion where any participant can jump into the conversation at any time.

h3. Moderated conferences

Moderate conferences are best suited for a business environment, where participants have to make some sort of presentation in front of the other participants and a moderator is assigned to control the flow of the conference and give the microphone to the appropriate participant, while the others are just watching the active speaker. They can also be used for a conference with 2 active participants that are having a public debate on a subject, while every other participant is just watching it and eventually asking questions.

h2. Configuration

Sylkserver allows the maximum bitrate and video codec to be configured, globally or per room with the following settings in webrtcgateway.ini file:

; Maximum video bitrate allowed per sender in a room in bits/s. This value is
; applied to any room that doesn't define its own. The value is any integer
; number between 64000 and 4194304. Default value is 2016000 (~2Mb/s).
; max_bitrate = 2016000

; The video codec to be used by all participants in a room. This value is
; applied to any room that doesn't define its own.
; Possible values are: h264, vp8 and vp9. Default is vp9.
; video_codec = vp9

h2. Client support

Firefox and Chrome browsers are supported. We have not tested Edge browsers, but recently they added WEBRTC support, so they might work. The standalone Electron application is supported, it just has to be rebuilt with the new content of the site. Mobile devices will be supported, but we need to rebuild them as well. As a note, mobile devices for the moment cannot work when H264 is configured as a codec because of a compatibility problem with using this codec that we have not yet figured out.

h2. Things that were explored

In order to implement bandwidth management and CPU load optimizations we have explored a couple of things, some of which proved fruitful, while with others were abandoned or proved to be not very helpful for our goal.

The original idea we started with was to have each client send two video streams, one low resolution, one high resolution and let the other participants switch between them based on their need (use the high resolution video if the participant was viewed in full or the low resolution video if he was displayed as a thumbnail. As we progressed, we quickly discovered that this setup was a lot more complicated to manage than we have anticipated. Every participant would open 2 sessions to the conference room just to publish their low and high resolution streams, which made them appear duplicated in the conference. Special means needed to be employed to associate two such distinct sessions coming from the same device and present them as a single entity. This had to be done in each client, which meant that older clients would not be able to deal with this setup and they would automatically display every participant duplicated.

In addition this setup would increase the upload bandwidth of each participant 1.5 times, going against the idea of reducing the used bandwidth.

The advantages of this model were the reduced download bandwidth and reduced CPU utilization that resulted from only having to process one high resolution video stream while all the other video streams would be low resolution, which were overshadowed by the higher upload bandwidth being used, by the more complicated room management that was required to deal with devices connecting twice per participant in the room and by the inability to have older devices join such a conference room.

While we were working on this we also run into a technical limitation on Firefox, which was unable to provide 2 video streams of different resolutions at the same time. When we tried to obtain 2 video streams, one low resolution one high resolution, the moment we requested the second stream with a different resolution, the first stream's resolution was updated to match the second and we ended up with 2 streams with the same resolution. This was a limitation in Firefox that we couldn't overcome, so at this point in addition to the issues mentioned above with this mode, we were also facing the prospect of dropping Firefox support and only have our solution work with Chrome.

While we were contemplating our choices here we discovered that there was a mechanism by which a WEBRTC client could be constrained to limit its sending bandwidth and this mechanism could be employed dynamically during a call to make the device's sending bitrate high or low as desired without any need to renegotiate the session. This mechanism uses REMB packets which are control packets sent through RTCP and will make a browser adjust its send bitrate on the fly as requested. The good news was that both Chrome and Firefox supported this. This bit of information changed everything and we realized we could use this to build a better solution, which was a lot less complicated and more effective.

Chrome reactions to REMB packets are displayed in the image below. Here we tested with setting the limit from 4Mb/s to 100 Kb/s.

p=. !REMB-reactions.png!

At the same time we realized that the initial model that the webrtc client used, where in a conference room the client would display one participant in full and the others as thumbnails, and then let the user switch which participant to view by clicking on a thumbnail to display that participant in full, was not very useful for a large category of uses, namely users having a group video chat with friends/family. In this case the user is not expected to click a thumbnail to switch to another participant and only be able to see one participant at a time, but instead they would like to see all participants at the same time.

As a result of all this, we went we decided to give up on the original idea with 2 streams of a different resolution per participant and completely change our model. We came up with the 2 models mentioned before: the ad-hoc conference model and the moderated conference model.

h3. The ad-hoc conference model

The ad-hoc conference mode was supposed to be used for a group chat with friends/family where one expects to see all the other participants on the screen at the same time and any participant can jump into the conversation at any time. In this model we decided to display all participants in a matrix, so everyone is visible at all times. Initially the matrix is just 1x1 when there are just 1 or 2 people in the room, but it can grow up to a 3x3 matrix that can accommodate up to 10 participants (9+yourself). This model proved to be favored by the idea of using REMB to limit send bitrate, because the more participants on screen, the smaller their video would be, which aligned perfectly with the idea of having a constant room bitrate that is shared by all participants: the more participants, the lower their bitrate would be and also the lower their video frame would be on screen compensating for the reduced quality of their video stream.

In order to compare the bandwidth used by this mode and the original model we attempted (the one with 2 video streams per participant), lets consider the bitrate used by an HD stream (1280x720 @30fps). This bitrate is ~2.0-2.4Mb/s, and let's call this B. We have found that for a thumbnail sized video stream of 320x240 pixels at 30 fps, the bitrate requirement was still very high, in the range or B/3 to B/2. As a result in the original model each participant had to send anywhere between 1.3*B to 1.5*B. At the same time, because only one participant was big on screen and all others were thumbnails, each participant would receive B + (N-1)*B/2 = B*(N+1)/2, where N is the number of participants. In the ad-hoc conference model, as mentioned before, each participant receives B and sends B/(N-1).

In order to compare these numbers, lets consider B = 2Mb/s and N=9.

In the original model, each participant would have sent 1.5*2 = 3Mb/s and would have received 2*(9+1)/2 = 10Mb
In the ad-hoc conference model, with B being set as the room maximum bitrate, each participant would send 2/(9-1) = 0.25Mb/s and would receive 2Mb/s

These numbers show how the ad-hoc conference model with controlled bitrate per participant is a lot more effective as far as bandwidth management goes, compared to the original model we started with, being 5-12 times more efficient in the amount of data sent/received.

In addition the ad-hoc conference also provides a much better user experience allowing all participants to be visible on screen at once.

Another thing we noticed with Chrome, while using VP8 as a codec, was that with more than 3 participants in a room, Chrome started to dynamically adjust the resolution of the video being sent, fluctuating between HD and VGA resolution, depending on the bitrate it was allowed to use and the amount of movement in the encoded video stream. In addition to the resolution changes we also noticed changes to the sent frame rate. This was an added bonus because it meant that with more participants in a room that would impose a lower bitrate value per participant we expected Chrome to to this more often, and thus we could achieve not only improved network bandwidth usage, but also reduced CPU usage.

Unfortunately Firefox did not have this behavior, Firefox would maintain the original resolution value requested when the stream started for the whole duration of the call regardless of the bitrate limitation being imposed on it. In order to compensate for this we tried to request resolution adjustments based on the number of participants in the room, in order to reduce the CPU usage when the number of participants in a room increases. However this did not prove successful, because doing this does not yield reliable results. Sometimes Firefox will switch resolutions without a problem, some other times the camera will attempt to switch resolutions and will not reopen at the new resolution, which results in the video stream not being sent anymore (it freezes on the last frame before the resolution change was attempted). This happens randomly and we could not determine what causes it or how to fix it. It is also worth mentioning that this is a problem we noticed with Firefox running on OSX on a Macbook Pro with a built-in camera. We do not know if a similar problem exists for external cameras or on different operating systems (Linux or Windows).

The idea of exploring this feature however is still open, as it is a much better solution than Chrome's automatic resolution adjustment, because it yields more reliable and consistent results. Chrome switches resolution based on other factors than just bitrate and it doesn't seem to do it often enough to be effective. In addition we have found that with VP9 as a codec, Chrome would not lower the resolution of a video scream even when the bitrate is as low as 256Kb/s, it would only lower the frame rate. While this could still be effective in reducing CPU load, is not as effective as reducing both resolution and frame rate. We assume this is because VP9 is a more effective codec than VP8 and is able to maintain the HD resolution even at lower bitrates.

h3. The moderated conference model

Since the ad-hoc model is not best suited for every application, we also considered the moderated conference model. In this model a moderator would control the flow of the conference. The moderator is the first participant that joins the conference. The moderator would be able to see a list with all the participants, decide who is the active speaker and mute audio/video per participant when needed. In this model only 1 or at most 2 participants can be active speakers at a time and who they are is decided by the moderator. Participants cannot select what other participants they see on screen. This is decided by the moderator which selects the active participants that will be shown on everyone's screens, while all others are shown as thumbnails.

With 1 active speaker the conference is suitable for cases like when some people need to give a speech or show a presentation for others to watch. In this case the moderator simply switches the active participant by giving the next speaker their stage time.

With 2 active speakers at the same time, the conference can be used for example for having a public debate on a subject, where the active speakers debate the subject while the rest of the participants just watch the debate, or ask questions if needed.

In this model, each active speaker will have their bitrate limited by max_bitrate / number_of_active_speakers, while everyone else will just have a very low bitrate value (64Kb/s) so they can be displayed as thumbnails.

Considering B the bitrate for an HD stream @30fps, N the number of participants in the conference and AS the number of active speakers:

Each active speaker will send B/AS
Everyone else will send a constant 64Kb/s
Everyone in the room will receive B + (N-AS)*64Kb/s

For B=2Mb/s, N=10, AS=2 we have:
Each active speaker send 1Mb/s
Everyone else sends 64Kb/s = 0.064Mb/s
Everyone in the room will receive 2Mb/s + (10-2) * 0.064Mb/s = 2.512Mb/s

As can be seen, these numbers also show that the moderated conference model is also a lot more efficient that the original model with 2 streams per participant.

h3. Mobile device considerations

Because mobile devices have both more limited resources and more limited screen space available, we consider using the following technique for small mobile devices:

For both ad-hoc and moderated conferences, the mobile client will only display 1 or at most 2 participants in full view. For a moderated conference they are already decided by the moderator, while for an ad-hoc conference the user can select 1-2 of the participants to be seen. For the other participants the device will pause their video streams and not show thumbnails for them, but instead show them as static icons or just display them in a list. By doing this, the mobile device not only prevents screen clutter allowing for a more efficient use of the limited screen space, but by pausing the other participant's video streams, it will dramatically reduce it's CPU usage because it will not need to receive and decode their video streams just to display them as thumbnails.

By using this technique, a mobile device will only have to deal with decoding and displaying 1 or at most 2 video streams which is fully within the device's processing capabilities, regardless how many participants are in the conference room.

h2. Measurements

These load measurements were done on a Macbook Pro 15" with a 2.3GHz Intel Core I7 CPU, while having 7 participants in the room with each using 336Kb/s. The measurement shows the CPU usage in Firefox web browser with the aforementioned conditions, for the specified video codecs and resolutions which are used by all participants:

* H264/VGA - 150% CPU
* H264/HD - 250% CPU
* VP9/VGA - 220% CPU
* VP9/HD - 350% CPU

As far as CPU utilization goes, the most efficient codec is H264 (presumably because it has hardware accelerated support on a lot of devices), followed by VP9 and last is VP8.

In a conference with 2 participants both sending HD video (1280x720 @30fps), on the same laptop mentioned above we noticed the following CPU load values in Firefox:

* VP8 - 130% CPU
* VP9 - 100-110% CPU
* H264 - 50-70% CPU

h2. Conclusions

We consider that the ad-hoc and moderated conference models offer much better results that the original two-streams-per-participant idea. In addition not only do they offer a better and more natural user interface, they also allow for more control from the server that can decide both the codec to be used and the bitrate limit per room, thus controlling the quality of the call in a single place.

For now we consider a room with a 2Mb/s bitrate limit using VP9 to be the best compromise between quality and resources being used as well as support across all devices. At the moment we cannot recommend H264 despite the huge improvement it would provide in CPU usage, especially for mobile clients, because we have found some compatibility issues for the mobile clients, where the mobile client would display a green screen for any incoming video stream with H264.

h2. Remaining tasks

* SylkServer: control and feedback interface for moderator
* Sylkrtc.js: use control and feedback interface for moderator
* Sylk WebRTC client: add control for moderator
* Janus: patch to request full frames when a paused video is resumed
* Rebuild the mobile version
* Package new versions of modified software
* Deploy SylkServer with SIP2SIP on line service

h2. Software that was modified

In order to implement the bandwidth management and CPU load optimizations the following software was modified:

# sylkserver https://github.com/AGProjects/sylkserver
# sylk-webrtc https://github.com/AGProjects/sylk-webrtc
# sylkrtc.js https://github.com/AGProjects/sylkrtc.js
# python-application https://github.com/AGProjects/python-application
# python-sipsimple https://github.com/AGProjects/python-sipsimple