This article by our CEO, Jerod Venema, was originally published by Spiceworks: “How To Analyze and Interpret Network Data for Real-Time Video.”
Latency is a problem when dealing with real-time communication. Traditional CDNs have methods to handle last-mile quality issues. But, they aren’t options for real-time streaming as they introduce network latency, which kills communication. In this first article of a two-part series, Jerod Venema, CEO and co-founder, LiveSwitch, discusses how to analyze the problem.
One of my product owners recently came up with an acronym we now use internally at LiveSwitch: CNCS. This came about after discussing simulcast, SVC, RED packets, and various other mechanisms for mitigating poor quality networks and debating the effectiveness of the various options under specific conditions. It stands for Crappy Network Condition Support, a key element in how LiveSwitch operates.
One of the hardest challenges when dealing with real-time (<200ms) communication is latency. Traditional CDNs have a number of techniques they can use to deal with last-mile quality issues. From pushing data to the edge and caching it to client-side buffering or pre-encoding the data into multiple different quality profiles, there are multiple solutions that work around an end user’s crappy wifi. For real-time streaming, none of these options are (within a range) possible because they all introduce network latency, and latency kills communication. (As I write this, I received a “regular” phone call from a friend of mine; the lag was about 1.5 seconds between us, and we had to redial because it was intolerable).
So what do we do with this? If we cannot buffer, cannot fix the end user’s network, have to keep the latency under 200ms, and we still have to make the communication high quality, what are our options?
Identify the Potential Sources of Lag in the System
There are a lot of components in play in the pipeline when it comes to real-time communication. Here’s a quick view of what happens just to send data from your camera to the server:
[access the camera]->[receive an image]->[scale the image]->[convert the image to the right format]->[encode the frame]->[packetize the encoded data]->[wrap the packet]->[encrypt packet]->[place packet on the network card]->[network card transmits to wifi access point]->[wifi access point transmits to router]->[router transmits server over public internet]->[TURN(S) server]->[YOU HAVE ARRIVED]
In the flowchart above, every arrow is a potential handoff point where there can be a queue, and every item in [square parens] is a potential point of latency. We then have to do the whole thing in reverse on the receiver side. Once we understand this flow, we can start to identify places where problems can occur, get feedback from the system on those problems, and adjust appropriately.
I would like to note that audio basically behaves the same as video for the purposes of our discussion and also that there can be other video sources besides a camera. But for our purposes, we will assume a camera video feed. There are absolutely differences (audio cannot be NACKed, we would degrade screenshare differently than a camera due to use-case, etc.), but we will deal with that another day.
I would also like to note that despite the title of this article, we will review lag in the whole system because, critically, you cannot adjust for network *only on the network* and be successful. You have to work on the entire system as a whole.
Finally, remember that everything in the list above is happening really, really fast. 30fps means each frame has 33.3 milliseconds.
Step 1: Camera access
A surprising point of latency is the actual camera access itself. Starting the camera in hardware takes a few seconds, so right out of the gate, we have a momentary pause to deal with. This is ok, as it is a one-time startup cost. But it is worth noting that modern cameras also typically (but not always!) provide timestamps related to the images they capture (as do audio devices).
As a result, it is possible to get data from a camera with a timestamp that is out of sync with the data from your microphone! This can be an unexpected source of lip sync problems. Unfortunately, there are no excellent ways to detect this. The best mechanisms we have found are to utilize pre-recorded audio and video, with known data embedded in each, and simulate the audio/video device. Unfortunately, this leaves out the actual physical device in question, which can lead to false positives.
The true solution to this problem (provided you are indeed getting timestamps from the underlying hardware) lies in ensuring that your users utilize both a microphone and camera from the same physical device, which are designed to work together and receive their timestamps from the exact same underlying source.
The only other solution is utilizing timestamps from the OS. This can be viable, again, assuming you replace both audio and video timestamps with a single source of time and assuming the timestamps are applied as early as possible in the stages of both capture elements. This is a pretty rough solution and should only be used in a worst-case scenario, but it is at least better than the “offset slider of doom.”
Step 2: Receiving the image
This seems like a silly point to include. Why on earth would we care about the time it takes to “receive an image” locally in code? The problem here lies in scale. If we were dealing with a 320×240 image, no problem on any PC today. But with a 4k or even a 1080p image, it takes time to load up into a buffer. At 4k, that is (24/8) * 3840 * 2160 bytes or about 24.9MB per frame (don’t forget, we’re moving a frame every 33 milliseconds).
Thus, when dealing with images at high resolution, you are better off working directly with pointers and leaving the images in the GPU if at all possible. As the frame rate and quality increase, frames can drop on the sender side here if the pipeline fails to keep up. We treat this as a sender-side dropped frame count, and it indicates either:
1. The CPU or GPU is unable to keep up with processing the volume of data we are pumping through, or
2. The server is informing us that we have losses occurring on the network and need to send fewer frames.
The next step is scaling the image, which will be addressed in the next article in this series