akuraHT: 2017

Wednesday, November 29, 2017

Understanding the Concept of WebRTC

Web Real Time Communication (WebRTC)

Enable human communication via voice and video (Real Time Communication) was a major challenge for the web. WebRTC allows web browsers to not only request resources from backend servers, but also real-time information from browsers to other users. This enables applications such as video conferencing, file transfer, chat, or desktop sharing without the need of either internal or external plugins. Simply webRTC enables peer to peer communication.

Earlier, RTC has been corporate and complex, requiring complex expensive audio and video technologies to be licensed or developed in house. Integrating RTC technology with existing content, data and services has been difficult and time consuming. Particularly on the web.

Gmail video chat became popular in 2008, and in 2011 Google introduced Hangouts, which use the Google Talk service (as does Gmail). Google bought GIPS, a company which had developed many components required for RTC, such as codecs and echo cancellation techniques. Google open sourced the technologies developed by GIPS and engaged with relevant standards bodies at the IETF and W3C to ensure industry consensus. In May 2011, Ericsson built the first implementation of WebRTC.

Why we needs webRTC.

Currently, there is no free, high-quality, complete solution available that enables communication in the browser. WebRTC enables this.
Many web services already use RTC, but need downloads, native apps or plugins. These includes Skype, Facebook (which uses Skype) and Google Hangouts (which use the Google Talk plugin).Downloading, installing and updating plugins can be complex, error prone and annoying. WebRTC does not require any plugins.
Already integrated with best-of-breed voice and video engines that have been deployed on millions of endpoints over the last 8+ years. Google does not charge royalties for WebRTC.

WhatsApp, Facebook Messenger, appear.in and platforms such as TokBox uses WebRTC now a days. And google chrome, Firefox opera and Microsoft edge supports webRTC.

WebRTC applications need to do several things:

Get streaming audio, video or other data.
Get network information such as IP addresses and ports, and exchange this with other WebRTC clients (known as peers) to enable connection, even though NATs and firewalls.
Coordinate signaling communication to report errors and initiate or close sessions.
Exchange information about media and client capability, such as resolution and codecs.
Communicate streaming audio, video or data.

To communicate streaming data WebRTC implements three APIs:

1. MediaStream (getUserMedia)

Get access to data streams, such as from the user's camera and microphone.
Available in chrome, Firefox opera and edge

2. RTCPeerConnection

Enables Audio or video calling, with facilities for encryption and bandwidth management.
Chrome (on desktop and for Android), Opera (on desktop and in the latest Android Beta) and in Firefox.

3. RTCDataChannel

peer-to-peer communication of generic data.
Supported by Chrome, Opera and Firefox.

MediaStream (getUserMedia)

The getUserMedia() method prompts the user for permission to use a media input which produces a MediaStream with tracks containing the requested types of media. That stream can include a video track (produced by either a hardware or virtual video source such as a camera, video recording device, screen sharing service, and so forth), an audio track (similarly, produced by a physical or virtual audio source like a microphone, A/D converter, or the like), and possibly other track types.

It returns a Promise that resolves to a MediaStream object. If the user denies permission, or matching media is not available, then the promise is rejected with PermissionDeniedError or NotFoundError respectively

In above example there's no audio so stream.getAudioTracks() returns an empty array and stream.getVideoTracks() returns an array of one MediaStreamTrack representing the stream from the webcam. Each MediaStreamTrack has a kind ('video' or 'audio'), and a label (something like 'FaceTime HD Camera (Built-in)'), and represents one or more channels of either audio or video. In this case, there is only one video track and no audio, but it is easy to imagine use cases where there are more: for example, a chat application that gets streams from the front camera, rear camera, microphone, and a 'screenshared' application.

Each MediaStream has an input, which might be a MediaStream generated by navigator.getUserMedia(), and an output, which might be passed to a video element or an RTCPeerConnection.

RTCPeerConnection

The RTCPeerConnection interface represents a WebRTC connection between the local computer and a remote peer. It provides methods to connect to a remote peer, maintain and monitor the connection, and close the connection once it's no longer needed.

RTCDataChannel

The RTCDataChannel interface represents a network channel which can be used for bidirectional peer-to-peer transfers of arbitrary data. Every data channel is associated with an RTCPeerConnection, and each peer connection can have up to a theoretical maximum of 65,534 data channels (the actual limit may vary from browser to browser).

To create a data channel and ask a remote peer to join you, call the RTCPeerConnection's createDataChannel() method. The peer being invited to exchange data receives a datachannel event (which has type RTCDataChannelEvent) to let it know the data channel has been added to the connection.

Security

There are several ways a real-time communication application or plugin might compromise security.

Unencrypted media or data might be intercepted en route between browsers, or between a browser and a server.
An application might record and distribute video or audio without the user knowing.
Malware or viruses might be installed alongside an apparently innocuous plugin or application

WebRTC has several features to avoid these problems:

WebRTC implementations use secure protocols such as DTLS and SRTP.
Encryption is mandatory for all WebRTC components, including signaling mechanisms.
WebRTC is not a plugin: its components run in the browser sandbox and not in a separate process, components do not require separate installation, and are updated whenever the browser is updated.
Camera and microphone access must be granted explicitly and, when the camera or microphone are running, this is clearly shown by the user interface.

Tuesday, November 7, 2017

WebRTC media connections: ICE, STUN and TURN

How do you deal with the fact that most endpoints have been assigned a private IP address behind some form of firewall?

Before I answer that question let me define a few terms. These are pretty basics stuffs in networking.

Public IP Address

The IP address that is globally unique across the Internet. Only one device may be in possession of a public IP address.

Private IP Address

This is an IP address that is not globally unique and may exist simultaneously on many different devices. A private IP address is never directly connected to the Internet. Devices that possess a private IP address will be in their own unique IP space (e.g. different companies or domains). The chances are extremely high that the device you are using to read this blog article has acquired a private IP address.

Network Address Translation (NAT)

This gives private IP addresses access to the Internet. NAT allows a single device, such as a router, to act as an agent between the Internet (populated with public IP addresses) and a private network (populated with private IP addresses). A NAT device can use a single public IP address to represent many private IP addresses.

Symmetric NAT
A Symmetric NAT not only translates the IP address from private to public (and vice versa), it also translates ports. There are various rules as to how that translation and mapping occurs, but it’s safe to say that with symmetric NAT, you should never expect that the IP address/port of the source is what the destination will see.

Now I can return to the problem. How do two WebRTC clients communicate with each other when there is a good chance that neither has an IP address and port that the other can send directly to?

This is where the Interactive Connectivity Establishment (ICE) comes in.

ICE is a framework that allows WebRTC to overcome the complexities of real-world networking. It’s ICE’s job to find the best path to connect peers. It may be able to do that with a direct connection between the clients, but it also works for clients where a direct connection is not possible (i.e. behind NATs).

In the case of asymmetric NAT, ICE will use a STUN (Session Traversal Utilities for NAT) server. A STUN server allows clients to discover their public IP address and the type of NAT they are behind. This information is used to establish a media connection. The STUN protocol is defined in RFC 3489.

In most cases, a STUN server is only used during the connection setup and once that session has been established, media will flow directly between clients.

If a STUN server cannot establish the connection, ICE can turn to TURN (pardon the pun). Traversal Using Relay NAT (TURN) is an extension to STUN that allows media traversal over a NAT that does not do the “consistent hole punch” required by STUN traffic. TURN servers are often used in the case of asymmetric NAT.

Unlike STUN, a TURN server remains in the media path after the connection has been established. That is why the term “relay” is used to define TURN. A TURN server literally relays the media between the WebRTC peers.

Clearly, not having to use TURN is desirable, but not always possible. Every WebRTC solution must be prepared to support both service types and engineered to handle the processing requirements placed upon the TURN server.

You will have to tell your webRTC application where to find STUN and TURN servers.

What is Session Description Protocol ?

So, what is SDP (Session Description Protocol)? Well, it’s exactly what its name says it is. It’s a protocol that describes the media of a session. It is important to realize that it doesn’t negotiate the media. It isn’t used by SIP clients to go back and forth asking “can you do this?” before finally settling on a common media protocol like G.711. Instead, one party tells the other party, “here are all the media types I can support — pick one and use it.”

SDP is a set of rules that defines how multimedia sessions can be set up to allow all end points to effectively participate in the session. In this context, a session consists of a set of communications end points along with a series of interactions among them. The session is initiated when the connection is first established and is terminated when the all of the end points have stopped participating. An example is a video conference conducted by a large corporation that includes participants from multiple departments in diverse geographic locations.

SDP is comprised of a series of <character>=<value> lines, where <character> is a single case-sensitive alphabetic character and <value> is structured text.

In SDP, session parameters include information such as the session name, the date and time at which the session is scheduled to begin, the purpose of the session, the addresses or ports of all end points, the data formats to be used and the bandwidth requirements for effectively exchanging the session data. SDP is intended primarily for use in large WANs (wide-area networks) including the Internet. As such, it is designed to function in a many-to-many environment. Nevertheless, SDP can also be employed in proprietary LANs (local area networks) and MANs (metropolitan area networks).

The definition of those sections and their possible contents are as follows. It’s important to know that not every character/value may be present in an SDP message.

Session description

v= (protocol version number, currently only 0)

o= (originator and session identifier : username, id, version number, network address)

s= (session name : mandatory with at least one UTF-8-encoded character)

i=* (session title or short information)

u=* (URI of description)

e=* (zero or more email address with optional name of contacts)

p=* (zero or more phone number with optional name of contacts)

c=* (connection information—not required if included in all media)

b=* (zero or more bandwidth information lines)

One or more Time descriptions (“t=” and “r=” lines; see below)

z=* (time zone adjustments)

k=* (encryption key)

a=* (zero or more session attribute lines)

Zero or more Media descriptions (each one starting by an “m=” line; see below)

Time description (mandatory)

t= (time the session is active)

r=* (zero or more repeat times)

Media description (if present)

m= (media name and transport address)

i=* (media title or information field)

c=* (connection information — optional if included at session level)

b=* (zero or more bandwidth information lines)

k=* (encryption key)

a=* (zero or more media attribute lines — overriding the Session attribute lines)

For Example

The following is an example of an actual SDP message.

v=0

o=Andrew 2890844526 2890844526 IN IP4 10.120.42.3

s= SDP Blog

c=IN IP4 10.120.42.3

t=0 0

m=audio 49170 RTP/AVP 0 8 97

a=rtpmap:0 PCMU/8000

a=rtpmap:8 PCMA/8000

a=rtpmap:97 iLBC/8000

m=video 51372 RTP/AVP 31 32

a=rtpmap:31 H261/90000

Unless you’ve been working with SIP and SDP for a while, this probably looks pretty undecipherable. However, it’s really not that bad if you know what to look for and what you can safely ignore. This is what I pay attention to in an SDP message.

c= This will tell me the IP address where the media will come from and where it should be sent to.

m= There will be a media line for each media type. For example, if your client can support real-time audio there will be an m= audio line. If your client can support real-time video there will be a separate m=video line. Each media line indicates the number the codecs that will be defined in attribute lines.

a= There will be an attribute line for each codec advertised in the media line.

After receiving a SIP message with the above SDP in the message body, the recipient will respond with SDP of its own identifying its IP address, ports, and codec values. The recipient will also pick from the list of the sender’s codecs which ones it will use and potentially start real-time media flows. The unwritten rule of SDP is that if possible you use the first codec of a type listed, but you don’t have to. If the sender says he can do something, he had better be prepared to handle media of that type no matter in what order it was listed.

Signaling In webRTC

In the perfect world webRTC enable peer to peer communication but in the real world webRTC still needs signaling servers. Here I’m going to explain you why we need signaling servers and where we need signaling servers.

What is Signaling?

Signaling is the exchange of information between involved points in the network that sets up, controls, and terminate each connections. its clients need to exchange some information such as

· Session control information

Determines when to initialize, close, and modify communications sessions.

· Network Data

Callers can find callees endpoints.

· Media Data

is required to determine the codecs and media types that the callers and callees have in common.

So this signaling process needs a way for clients to pass messages back and forth.

Why signaling is not defined by webRTC APIs?

To avoid redundancy and to maximize compatibility with established technologies, signaling methods and protocols are not specified in webRTC standards.

Why we still need signaling servers in WebRTC?

In the perfect world signaling in webRTC is straight forward and simple. Every WebRTC endpoints would have a unique IP address that it could exchange with other peers in order to communicate directly. JSEP (JavaScript Session Establishment Protocol) gives a summary of this approach. But that document is lengthy and somewhat involved. Here is a simple and straightforward explanation that will provide you with enough information to be known.

The mechanism to establish a connection between two parties (Let’s say Alice and Bob).

1. Alice creates an offer that contains his local SDP.

2. Alice attaches that offer to something known as an RTCPeerConnection object.

3. Alice sends his offer to the signaling server using WebSocket. WebSocket is a protocol that provides a full-duplex communications channel over a network connection. WebRTC standardized on WebSocket as the way to send information from a web browser to the signaling server and vice versa.

4. Bob receives Alice’s offer using WebSocket.

5. Bob creates an answer containing his local SDP.

6. Bob attaches his answer, along with Alice’s offer, to Alice's own RTCPeerConnection object.

7. Bob returns his answer to the signaling server using WebSocket.

8. Alice receives Bobs’s offer using WebSocket.

Keeping it Simple

This should be enough for now to convey the following points.

· WebRTC does not define signaling.

· JSEP summarized signaling approach in webRTC.

· A signaling server sits between two clients.

· Clients use WebSocket to communicate to a signaling server and vice versa.