TL;DR

Extensive journey thought VoIP systems with extensive references. Time to read whole article ~ 1.5 hours.

Introduction

This book is intended as a comprehensive survey of the data transport phase of real-time voice communication over IP networks. We do not include the signaling or call setup phases. Books on the real-time data traffic in VoIP systems are described in the following subsections. A number of books exist which include a broader range of issues within both IP telephony and VoIP:

[@Uyless00:Voice; @Gilbert98:Voice; @Hers00:IP; @Marcus99:Voice].

Minoli focuses solely on voice delivering on IP networks in [@Mino98:Delivering] whilst Hardy looks at VoIP service quality in [@Hardy03:Voip_Service_Quality] and Raake “Speech Quality of VoIP: Assessment and Prediction” in [@raake06:speech [@01:_overv_pstn_compar_voice_over_ip; @Pacifici86:Issues; @Sinha:B98; @chapter:ld_celp_coding_mobile_radio; @chapter:speech_802.11_adhoc].

This post

The primary goal of a VoIP systems is to transport human voice from microphone to loudspeaker/headset in a timely and quality-aware manner in order to deliver a interactive conversation quality that is acceptable to most users. It should be stressed here that there are no inherent problem in transporting human voice over any kind of packet switched network. The main difficulty arises when either there is high packet loss rate on some element of the path or a high end-to-end delay. The first difficulty can arise when the network becomes congested, there is a routing problem, or one or more links has an high packet loss (due to the link itself as opposed to others using the link which would be congestion).

In the case of congestion, both the network and end-systems should take appropriate actions to deliver good quality voice, a number of schemes to address this are given in this paper. Since there is little end-to-end regulation on the Internet, admission control has to be done by the end-systems or intermediary nodes to estimate the current conditions. Note that many ISPs do not provide different priorities for different traffic classes.

In the case of routing problems, the packets may not be able to find a path from the source to the end destination. The existence of a path may change during the VoIP session (i.e., during a call). Other that using source routing, there is generally little that the end users can do to address this problem beyond selecting one or more ISPs that offer a suitable service level agreement (SLA).

If a specific link is the source of high packet loss, then if this is the first or the last link the user might utilize another available link[^1] or increase the redundancy of the their communication in order to allow the receiver to utilize this redundancy to recover sufficient data to provide acceptable quality.

From mouth to ear (via the Internet)

Figure [fig:voice_journey]{reference-type=”ref” reference=”fig:voice_journey”} shows the typical end-to-end path and associated processing for the audio stream in a voice call. In this section we will assume that the signaling process has already occurred to set up the call. In this section we examine the path of speech from the caller’s mouth (at the sender) to the callee’s ear (at the receiver) for a typical voice stream. A similar path for the audio from the callee’s mouth to the caller’s ear is assumed, with the corresponding roles of Sender (callee) and Receiver (caller).

As a resulting of the call signaling or based upon a prior agreement between the caller and callee, a particular speech coding style is agreed upon. Unlike the case of encoding and decoding using pulse code modulation over a circuit switched communication path, for a packet switched network the end systems will collect audio samples for some period of time to generate a speech frame. This period of time defines the audio frame duration, and is generally referred to as the packetization time. Each audio frame is typically 10-20 ms in duration — depending upon the choice of encoding. Each speech frame is encoded for transmission over a suitable transport protocol.

To put this packetization time into context, we can compare it to the durations of various speech phenomena. Mark D. Skowronski, in his lecture notes for his course in Automatic Speech Processing, gives these as: “10 $\mu$s: smallest difference detectable by auditory system (localization), 3 ms: shortest phoneme (plosive burst), 10 ms: glottal pulse period, 100 ms: average phoneme duration, 4 s: exhale period
during speech.”[@Skowronski2003].

Given these speech phenomena durations and a 10 ms packetization time, then an average phoneme will be spread over roughly 5 speech frames and the shortest phoneme could be in a single speech frame. We will return to these times when we consider what the effects of a missing frame at the receiver might be. (see subsection
2.4{reference-type=”ref” reference=”subsect:Packet_level_error_control”})

In the following subsections we will start with an audio signal emitted from the speaker’s mouth and captured by a microphone, then follow the audio stream until it is emitted by a speaker as audio destined for the listener’s ear. We will begin by assuming that the audio content that is to be communicated is the result of a human speaker and that the audio is to be reproduced for a human listener. Later we will examine what happens when we relax this assumption.

Input devices

Commonplace these days is the use of USB headsets, where the microphone is in front of the speaker’s mouth and the user wears headphones over their ears. The use of such headsets eliminates analogue feedback from the headphones to the microphone, avoiding the need to do acoustic echo cancellation. If a speaker phone or a traditional handset is used, it will be necessary to perform acoustic echo cancellation — this is most easily done in the subsequent digital signal processing (hence we will ignore this issue for now).

Analogue to digital conversion

The human speaker generates analogue sound pressure waves. These pressure wave are detected by a microphone which produces an electrical signal that is sampled at some sampling rate and these samples are digitized by a analog to digital converter (ADC). The sampling process converts a continuous analogue signal to an analogue sample. This sample is quantized and converted to a digital value. The quantization can be linear or logarithmic [^2]. The number of bits used to represent the sample as a digital value is referred to as the resolution of the ADC. As a result of the sampling, quantization, and conversion to digital format the voice signal is now a stream of digital values. A new value is produced at the selected sampling rate.

Speech coding

Given the stream of digitized audio samples a local application encodes the voice and packetizes it for transmission. It is important to understand that in the case of voice over IP, that the result of the encoding process will be an encoded speech frame that will be encapsulated in a packet and not a continuous streams of digital values being sent over a circuit switched connection to the receiver. The coder/decoder is generally referred to as a CODEC. The input to the speech coding process is speech samples and the output is coded speech frames.

The objective in speech coding is to encode the speech in such a way that the audio information can be faithfully reproduced at the receiver. Traditionally, a secondary objective of this speech coding process what to reduce the amount of data that needs to be communicated to the receiver, i.e., to compress the data. Simple encoding schemes process each sample separately, while more complex schemes exploit properties of the signal, and even more complex schemes exploit knowledge of speech phenomena (how sounds are produced by humans) and the perception of the decoded signal. Simple coding schemes such as ITU’s G.711 utilize A-law or $\mu$-law logarithmic conversion at an 8 kHz sampling rate to produce a 64 kbits per second data stream that is simply packetized with a fixed number of samples in each packet. While GSM 06.10 (ETS 300 961) encoding utilizes residual pulse excitation — long term prediction (RPE-LTP) coding to reduce the data rate to 13 kbits per second data stream. This data stream is again packetized into fixed sized packets and transmitted.

Today the trend is to exploit higher data rates (if they are available) to increase audio quality with the use of CODECs such as Extended Adaptive Multi-Rate Wideband (AMR-WB+) Audio Codec [@IETF:RFC4352] and Variable-Rate Multimode Wideband (VMR-WB) Extension Audio Codec[@IETF:RFC4348; @IETF:RFC4424]. A motivation for these CODECs is the encoding and decoding of both speech and audio – as increasingly a communication session includes audio that is not simply speech. For example, AMR-WB+ supports both monaural (or monophonic) and stereo audio. The resulting monaural audio data rates range from 5.2-36 kbps, while stereo audio results in data rates ranging from 6.2-48 kbps.

One of the authors (Maguire) believes that a future trend will be toward personalized speech synthesizers – so that the quality of the perceived speech is independent of the data rate of the communication path between the sender and receiver. Such a scheme will be possible because of the increased storage available in the end devices and the increasingly personal nature of the end devices. The later means that ‘my’ phone will have a very good model of my speech, hence it can construct a very good speech synthesizer for ‘my’ speech. The only information that will need to be transmitted between the sender and receiver will be the marked up output of the recognizer as input to the synthesizer.

Lyra audio codec enables high-quality voice calls at 3 kbps bitrate

Packet level error control

Because the speech frames are encapsulated in packets (generally using the Real Time Protocol (RTP)) and generally transmitted to the receiver using a transport protocol that does not provide for retransmission of missing packets, the receiver must be able to recognize when there are missing packet (and hence potentially missing speech frames) and deal with this loss in some way. This raises four questions:

Why not use a transport protocol that provides for retransmission of missing packets?
How does the receiver know that a packet is missing?
What can the receiver do about missing packets?
What can the sender do to assist the receiver, i.e., to increase the probability that the receiver will get sufficient information to re-generate the speech input?

The short answers to these questions are:

Why not use a transport protocol that provides for retransmission of missing packets? Answer: The delay to request a retransmission and get the new copy of the missing packet is equal to the round trip delay between the receiver and the sender along with some additional processing time. This added delay is likely to increase the total delay beyond the desired delay bound. Details of this will be described in section [subsect:artifact1_delay]{reference-type=”ref” reference=”subsect:artifact1_delay”}.
How does the receiver know that a packet is missing? Answer: The receiver can recognize that a packet is missing when there is a gap in the sequence numbers of the received packets. What can the receiver do about missing packets? Answer: The receiver can attempt to estimate what was in the missing packet, replay the
previous packet, stretch the content it has (for example, by time-warping), play comfort noise, or play silence. Details will be addressed in section 2.32{reference-type=”ref” reference=”subsect:frame_level_error_control_PLC”}.
What can the sender do to assist the receiver, i.e., to increase the probability that the receiver will get sufficient information to re-generate the speech input? Answer: The sender can add redundancy, so that the probability that the receiver does not have information to determine what to play is decreased. Extra data frames may be generated from the samples. Lost data is not uncommon in IP networks so by producing copies, the probability of some number arriving at the destination is increased. This is particularly the case where voice is transmitted over wireless networks. Two common schemes employed are Forward Error Correction (FEC) and Multiple Description Coding (MDC). Note this step is optional, hence it is shown as a dashed box in Figure [fig:voice_journey]{reference-type=”ref” reference=”fig:voice_journey”}. We will return to these techniques in section [subsect:forward_error_correction]{reference type=”ref”reference=”subsect:forward_error_correction”}.

Voice Activity Detection (VAD) and silence suppression

Many applications implement Voice Activity Detection (VAD), so that if the volume of the audio below some threshold, then the application can do silence suppression to avoid transmitting any contents when the speaker is silent. Silence suppression avoids generating unnecessary traffic, but causes some problems at the receiver, as the receiver must:

(1) Decide what if any audio should be played where there is silence and
(2) The arrival of voice frames is no longer quasi-periodic.

Some applications, such as Skype, do not employ silence suppression, although they do voice activity detection, but simply send packets with empty payloads when a person is not speaking. The reasons for this are (1) to transport comfort noise when one party is not speaking, addressing the first issue above – as now this comfort noise can be played by the receiver; (2) to maintain the UDP bindings at NATs along the path of the RTP packets, so that these packets can continue to pass through the NAT; and (3) when TCP is used to transport the voice, continuing to send TCP segments maintains TCP’s congestion window. The last two reasons are advantages of maintaining the quasi-period flow of packets.

Synchronization information

Timing information is needed for the voice samples to be faithfully replayed for the listener on the receiving system. In order for this to take place either the underling communications system needs to be synchronous (as is the case in traditional circuit switched telephony) or the timing of the original speech frames needs to be recorded and the receiver has to playout the speech frames with the same relative timing as when they were recorded.

Since the speech frames are encoded into a block of data before transmission it is possible to use a single timestamp for this whole block to enable the receiver to estimate the proper relative timing for playout of this encoded speech frame. This time stamp is derived from a local timer which increments at the packetization rate. Details of this timestamp and how it can be transmitted along with the encoded voice frame are given in section [subsect:rtp]{reference-type=”ref” reference=”subsect:rtp”}.

Sequencing information

In order to deal with the fact that some of the IP packets may be lost along the way it is useful to add a sequence number to each packet containing encoded speech data. This sequence number can be used to easily resequence packets if they arrive out of order, to detect missing packets, and together with the timestamp to determine that silence suppression has been done by the sender. Details of this sequence number and how it can be transmitted along with the encoded voice frame are given in section [subsect:rtp]{reference-type=”ref” reference=”subsect:rtp”}.

Readying the operating system

In this section we will discuss the operation of the terminals themselves. Exactly how the voice is handled is somewhat dependent on the input devices and types of terminal in use. For devices that have an operating system, i.e. system software running in privileged mode, it usually is the responsibility of the operating system to handle the voice data (samples in this case). Device drivers are such system software modules, they interface with the devices, plus manage the transfer of data when the device is reading or writing data. The device driver manages buffers both on the device and the system, ensuring underflows or overflows do not occur.

This makes the operation for the application easy, in the case of recording voice, the application instructs the operating system to write to a particular device and notifications are handled by the system. Although the data must be copied to the device from the application, it could involve unnecessary interruptions to the operating system for every block of samples. Therefore some systems use a mechanism called Direct Memory Access. It is used where some setup time is sacrificed to provide faster and more effective transfers, basically data transfer can occur without the intervention of the CPU once the DMA transfer has been established. At this stage the operating system is ready to receive the block of coded samples (plus optional redundant data) with the RTP header completed. We have omitted some of the other field descriptions for conciseness.

Addressing information

There are two types of addresses that are important the media stream for VoIP. The first is the IP address of the interface to which packets containing speech frames should be sent. The second is a combination of the protocol and the port number, this combination is used to deliver these frames to the correct application.

Both the IP header includes a checksum to enable the receiver to determine if there here has been an error in the header, enabling the receiver to check that the data has been received on the proper interface. Additionally the transport header has a checksum to ensure that data is delivered to the correct instance of a running application and that the data has not been altered during its transmission across the network.

In the case of UDP as a transport protocol there has been some work to enable to delivery of UDP datagrams that may have an error in some of the user data. This enables datagrams to be delivered to applications even though some of the user data may be damaged, as otherwise the datagram would be discarded — even though the application might be able to make some use of the data that is actually delivered. This protocol is referred to as “The Lightweight User Datagram Protocol (UDP-Lite)” for further details see [@IETF:RFC3828].

Another transport protocol that might be used is the Datagram Congestion Control Protocol (DCCP) – which provides bidirectional unicast of congestion-controlled unreliable datagrams. DCCP has its own sequence numbering scheme and DCCP also provides feedback about loss rate to the sender, via DCCP acknowledgment options. For further details about DCCP and its interaction with RTP see section 17 of RFC 4340[@IETF:RFC4340].

Packet construction

The format of a VoIP packet is as shown in Figure [fig:packet_format]{reference-type=”ref” reference=”fig:packet_format”} (not to scale). The minimum IPv4 header is 20 octets long (while and IPv6 header is 40 octets long), the UDP header is 8 octets long, and the RTP header is 12 octets — for a total of 40 (or 60 octets) of header information. Optional headers can further swell the header size. Voice payloads tend to be in the order of 160 octets (for 20ms of speech encoded with G.711). Additionally there may be link layer headers, for example, the Ethernet frame is 14 octets long while a WiFi header is 58 octets. There may also be additional media access control layer and link layer overhead — such as minimum interframe intervals in the case of shared media. Sending encoded voice traffic in IP packets is rather inefficient (due to the large overhead for a small amount of user payload); unless some form of header compression or multiplexing with other traffic is used. However, despite this if silence suppression is done there is a very large gain in efficiency for a conversation as network traffic is only generated where there is information to convey, hence for a conversation this may eliminate 35% of the total traffic that circuit-switch call might have generated [^3].

(400,80) (20,50)(90,30)IPv4/v6 Header (20,30)(90,20)20/40 octets

(110,50)(60,30) (110,56)(60,30)UDP (110,42)(60,30)Header
(110,30)(60,20)8 octets

(170,50)(60,30) (170,56)(60,30)RTP (170,42)(60,30)Header
(170,30)(60,20)12 octets

(230,50)(90,30)CODEC output

(20,20)(1,0)210 (230,20)(-1,0)210 (125,10)(0,0)40/60 octets

(230,20)(1,0)90 (320,20)(-1,0)90 (275,10)(0,0)160 octets

Networking

Routing

For the routing of IP voice data, the normal IP routing mechanisms apply. Path information from a company, home, or university network is provided to the backbone using interior link-state routing protocols such as IS-IS or OSPF. In the backbone network, routes are determined by peer agreements and the inter-domain routing protocol BGP. In a MPLS network voice traffic may be given its own path through the label switched path if the particular operator has sufficient traffic for it to be worthwhile.

Header compression

A balance between the payload and header overhead is desirable. That is, not to transport too little voice data compared to the protocols, and not to transport too much payloads in case the entire packet is lost. Therefore, header compression schemes for IP telephony have been developed and are particularly desirable when wireless links are being used. The basic ideas is to exploit the redundancy in the sequential headers, often only one digit changes and this can be coded rather than the whole header. State is important in header compression and the coder and decoder should be synchronized for this method to work. Usually a full header is sent after a period of time to ensure the sender and receiver are fully synchronized.

Framing

The packets are now passed to the network access processing stage. Further addressing information is needed to identify the recipient of this packet on the local network, i.e. the other end of the link. Therefore extra local network information is added, commonly referred to as the link address. In the examples we will use WiFi Medium Access Control and Ethernet link address examples. The link addresses are used to identify the recipient, confirm the sender is who they claim to be and in the case of WiFi send back an acknowledgment.

Link access

We will consider two technologies for the network access. Ethernet and 802.11. Access to these different technologies is slightly different so they are given separate sections. In this paragraph we consider only their common functionalities. Both must listen to the medium before a transmission in order to avoid collisions. This results in the bits on the medium (and at the receiver) being garbled. Listening before transmitting is known as Carrier Sense Multiple Access (CSMA). All the stations must listen to the “carrier” and potentially many can attempt an access at any one time. Both Ethernet and WiFi have a waiting scheme if the channel is busy. It is randomized in order to inhibit transmissions colliding. Subsequent transmission failures result in further doubling in this waiting time which is also randomized, hence the scheme is known as exponential backoff. Otherwise, assuming the medium is free, the frame is transmitted.

Ethernet

Ethernet uses CSMA/CD where the CD stands for collision detection, which means it can detect if the frame has been mangled by simultaneous transmissions. This is done by the sending station comparing the frame “on the wire” with a local kept copy. Where they differ the sender can deduce that a collision occurred. If this is the case, a retransmission is scheduled after the mandatory randomized waiting time. No explicit acknowledgment by the recipient is needed when a frame is correctly received. This functionality is somewhat routed in the past where Ethernet really was a shared medium (hence the name). In recent years however, switched Ethernet has become much more commonplace, where each machine has a direct connection to an Ethernet switch. This allows for much higher throughput and no collisions, since the wires are not shared. Ethernet speeds have increased from 10 to 100 to 1000 to 10,000 millions bits per second. Once the frame has been carried across the Ethernet it is in the memory of the Ethernet switch, ready to be moved to the next element. The Ethernet link layer header will have been removed.

IEEE 802.11

IEEE 802.11 uses a variant of CSMA called CSMA/CA, where the CA signifies collision avoidance. WiFi terminals cannot detect during transmission whether a frame was garbled by collisions or not. One of the reasons is that it is impossible for the sending station to transmit and listen simultaneously as in the Ethernet case. This is an artifact of radio transmission. Therefore in the WiFi case, positive acknowledgments are used to indicate whether frames were received by the receiving station. If this acknowledgment frame is not received, the sender will assume it was lost and retransmit it. This is repeated a number of times, typically 4-7. If all subsequent retransmissions attempts fail, the frame is discarded and it is then the responsibility of the higher protocol layers to take further actions. In real-time traffic such as Internet telephony, no further action is taken. One of the reasons is the UDP protocol used, it was not designed for reliable data transfer and hence has no retransmission capabilities unlike TCP. It is still possible that the frame was received and is in error, i.e. bits that have been changed by interference for example, i.e. in this case a link layer checksum is computed and verified.

IEEE 802.11 access points

WiFi usually operates in infrastructure mode which means an access point acts as the bridge between the wireless and wired networks. Access points have the same rights as the other stations when it comes to network transmissions. That is, by default, they have no access priority to the wireless network. Often however, they are better are receiving weaker signals compared to their wireless clients. This is because they have better reception electronics and antennae. They also have internal queuing mechanisms so when the wireless network is busy, and they have frames to deliver downstream, they can be temporarily stored with the unit. Modern access points have also the ability to differentiate types of traffic, voice and data for example, and certain traffic types can be
prioritized in the queuing process. IEEE 802.11 access points use the switched Ethernet as other computers in the local network. Once the frame has been carried across the wireless network it is in the memory of the access point to be moved to the next element. The physical and link layer headers are removed.

Additionally non-licensed spectrum technologies are more prone to disturbances and losses of frames [@byoung-jo99:_at_t_labs]. However, 802.11 provides a link layer retransmission protection that can alleviate frame loss on wireless access links to some degree at the expense of a little delay. Other sources of problems for IP-based voice are heavy traffic loads on shared links, poorly dimensioned links, long-delay link technologies (e.g. satellite links) and misconfigured equipment.

IP gateways and onwards

Once the Ethernet or IEEE 802.11 frame has been received by the switch, its link layer header is removed and it is then the responsibility of the IP infrastructure to move the packet towards the receiver. The first major switching point is often the IP gateway. The gateway is often seen as the border between a cluster of computers in an organization and the Internet. It is usually not possible to administer machines on the outside of the IP gateway. An administrative organization such a university or ISP will connect to other entities via an Internet eXchange (or “IX”). Dual circular fiber rings running at 10’s of gigabits per second are common. These tend to be geographically located, one per city for example. The packet must pass the gateway to find its route over the Internet.

Core networking

One of the technology remnants from ATM is layer 2 switching: Multi Protocol Label Switching (MPLS) is a carrier technology for IP packets. Basically, MPLS switches labels that are added to IP packets at the ingress of a MPLS network. IP packets that belong to a call are all labeled identically and switched over a dedicated path. Therefore link dimensioning for IP telephony becomes much simpler using MPLS. In an MPLS network one has the label switch routers at the center of the network and the label edge routers at the extreme points of the network. Also it is possible that some of the voice packets will be lost, due to congestion in the routers, discarding algorithms such as RED, or link problems.

Tiering and peering

A tier 1 network is an IP network which connects to the entire Internet via a settlement free Interconnection, also known as peering. These networks are also known as transit-free, because they do not receive a full transit table from any other network. A tier 2 network is one that peers with some networks but purchases IP transit to reach at least some portion of the Internet. Finally a tier 3 network is one which solely purchases transit from other networks to reach the Internet.

Technically speaking, peering is the voluntary interconnection of administratively separate Internet networks for exchanging traffic between the networks’ customers. The concept of settlement-free means neither party pays for the traffic but receives income from its own customers. Peering requires physical interconnection of the networks and an exchange of routing information through the Border Gateway Protocol (BGP) routing protocol. Peering agreements can vary from simple agreements to heavy contracts.

Once the packet leaves a local ISP or company, it is usually the responsibility of these larger networks to carry the voice data. Within these networks, the amount of multiplexing, but also the capacity increases.

Routing principles

IP packets are transported across the Internet dependent on two pieces of information. The address in the IP header of the packet and tables of information contained in network routers maintained by a separate set of Internet protocols. In most cases the information within the protocol header is not changed, however the information within the routers tables may change. Routers are often connected to more than one link. For a particular packet, the router decides which output link should be selected in order for it to be best forwarded towards its destination. This is achieved by matching the address in the IP part of the packet with the router’s memory table and choosing the entry with the longest match. With each entry in the table there is an associated link and the packet is output on that link. Where there is no match there is a default route and the packet is sent there. In simple routers a default and a backup is all that is needed. This process is repeated until the final router is located and then it is that routers responsibility to deliver the packet to the destination computer or terminal.

Network Address Translators (NATs) and private addresses

In order to support the large number of Internet hosts, the idea of a private IP addresses was introduced. Each host does not a unique IP version address. The idea is that a host can use a non-global address inside the network. Communication with the outside world will look like the data all came from the same host. A network address translator (NAT) is a computer that converts the local addresses to its global address and vice versa. It basically holds a mapping in its memory. Internet telephony packets have to pass through this process if the originating computer has a private address (typically 192.168.X.X or 10.0.X.X).

Long distance links (network group)

We have discussed the local link technologies, however there are technologies for carrying the frames over longer distances. A particular link technology sits on top of a physical technology. These tend to be fiber optics since laser light has a low attenuation factor compared to electromagnetic waves. Fiber optics are also relatively cheap due to semiconductor lasers becoming available (Erbium-doped fiber amplifiers). The optic fiber used in undersea cables is chosen for its exceptional clarity, permitting runs of more than 100 kilometers between repeaters to minimize the number of amplifiers and the distortion they cause. The first transatlantic telephone cable to use optical fiber was TAT-8, which went into operation in 1988.

There are two dominant optical standards, SDH and SONET. Both SDH and SONET are widely used today: SONET in the U.S. and Canada and SDH in the rest of the world. Both SONET and SDH can be used to encapsulate earlier digital transmission standards, such as the PDH standard, or used directly to support either ATM or so-called Packet over SONET/SDH (POS) networking. SDH and SONET are not protocols per se but are carriers for voice and data traffic. Developments in using different frequency (or color) lasers at the same time has brought about large bandwidth increases (WDM) in recent years.

ATM

Asynchronous Transfer Mode (ATM or sometimes referred to as B-ISDN) is a wide-area switching technique that uses asynchronous time division multiplexing. It encodes data and voice into small fixed-sized cells and provides data link layer services. In the ISO layer model it runs over layer 1 (physical) links. By using a small cell size (53 bytes) the design suits both datagrams and real-time voice datagrams too. ATM uses a connection-oriented model and establishes a virtual circuit between endpoints. One of the reasons for using a small cell size is to reduce jitter in the multiplexing of data streams. Dejittering traffic is important when carrying voice traffic, that is evenly spaced data streams do not need sophisticated playout schemes or endure loss (silence or distortion) in the voice stream. A role for ATM in telephony and IP networks is available in [@Mainwaring:A00].

Other works on queuing analysis, jitter guarantees and dimensioning buffers in ATM networks include [@mammeri99:delay; @Land9702:Multiplexing; @Biersack:C92; @Blef9808:Dimensioning; @He0003:Queueing; @Yang95:novel; Zahirazami0502:Channel; @Walke:A01; @mammeri99:delay].

Importantly in this decade, IP and ATM were competing technologies, with ATM keeping voice foremost in its multiservice solution. Basically the ATM Forum proposed five different circuit emulation services, depending on the capacities required. Although both IP and ATM were technically viable for both voice and data, the flexible data transport structure of IP, plus the development of the HTTP protocol, and lower hardware costs
effectively sealed the fate of IP over ATM.

[@mammeri99:delay] “Delay Jitter Guarantee for Real-Time Communications with ATM network”

[@Land9702:Multiplexing] Landry and Stavrakakis published “Multiplexing ATM traffic streams with time-scale-dependent arrival processes”

[@Biersack:C92] Biersack Performance Evaluation of Forward Error Correction in an ATM Environment

[@Blef9808:Dimensioning] Blefari-Melazzi et. al Dimensioning of playout buffers for real-time services in a B-ISDN.

[@He0003:Queueing] He and Sohraby On the Queueing Analysis of Dispersed Periodic Messages.

[@Yamamoto0997:Impact] Yamamoto Beerends, entitled “Impact of network performance parameters on the end-to-end perceived speech quality”

[@Yang95:novel] Yang and Tsang published “A Novel Approach To Estimating The Cell Loss Probability In An ATM Multiplexer Loaded With Homogeneous On-Off Sources”

[@Zahirazami0502:Channel] Zahirazami et. al. and “Channel loss and queuing loss tradeoffs in voice transmission over ATM switching systems”

[@Walke:A01] Walke et. al “IP over wireless mobile ATM-guaranteed wireless QoS by HiperLAN/2”

Receiver operation

The role of the receiver in the Internet telephony systems is, at a relatively high level of abstraction, to recreate the original spoken voice stream as accurately and with the highest possible quality for the listener. This is the responsibility of the telephony application, which in turn interfaces with the operating system, networking and sound software & hardware.

Frame reception

Assuming a frame for our voice application is either in the memory of the Ethernet switch or in the access point, the following takes place. Once the switch or access point has sent the frame to us, a hardware triggered interrupt will alert the network software driver that it should respond. Usually the network card’s driver is in listen mode unless it is actually transmitting. The interrupt routine will ready the operating system to receive the frame by allocating memory to hold it and placing locks on the shared resources. Once received, the link layer checksum is calculated and if correct, the frame is delivered to the next layer of processing (IP). If the packet was received by a IEEE 802.11 card, an acknowledgement is sent back to the access point. The IEEE 802.11 and Ethernet drivers can then delete their local copy of the frame.

IP processing

Simplistically the destination IP address is used to route each packet toward the target terminal. The link layer device drivers expect to deliver the frame to the IP layer. There are two versions of the IP protocol, versions 4 and 6. Depending on the format of the received frame, the device driver will select the software for the appropriate IP version. Most network drivers (and operating systems) now support both these versions. Note that the networking software is nearly always part of the operating system. The IP code has to check the packet is indeed destined for this computer. Then it calculates the checksum to ensure nothing has been changed since the sender passed the packet to the link layer at their end. The IP checksum only covers the header, not the contents. If the checksum fails, the data is often discarded and no further action is taken. The IP header is now removed, and the remaining data, called a datagram is delivered to the next layer, which is UDP.

UDP processing

The next step is for the UDP software to process the packet. It also has a checksum that is calculated, but this one is applied to the whole datagram. If the checksum fails then the datagram should be discarded. In the Internet telephony case there is the possibility to retain the packet as many of the bits may be useful to the application. An alternative to UDP, known as UDP-lite, has been proposed for this purpose. Recall also, that the job of the UDP layer is demultiplex the data to the right program. It is almost certain that more than one program is running on the host and the correct one must receive the datagrams data. Before finally delivering the data, the UDP header is removed. The UDP port field is used to demultiplex the data at the receiver to the correct application.

Receiver packet level error control

At this point if either FEC or MDC was applied, lost packets can be reconstructed. In the media-dependent case of FEC if one packet was lost, the lower bitrate version in the next packet can be extracted and used instead. In media-independent FEC, the combination of received and redundancy frames can be used to recreate the original flow. As stated, whether this is possible depends on the losses and the amount of redundancy applied. MDC needs at least one description of the data to start reconstructing the original, more descriptions produce better quality and lower distortion. After the redundancy phase, more packets are available for further processing.

Dejittering

Packets transmitted over the Internet could arrive at a receiver with inter-packet spacings which are not as they were sent. This is a problem for the listener, who will hear the speech replayed in a time distorted manner. It can be a problem for the speech decoder too, as some decoders have time sensitive states that need to be maintained with respect to time. To counter this problem a solution is to temporarily store the packets in a holding buffer. The buffer can be used to absorb the time differences and using the RTP timestamp recreate the original timing. The buffer can be either a fixed size or be allowed to change over time. In the fixed case, a sensible size needs to chosen before the session starts. In the variable case, the application makes a tradeoff between additional loss or additional delay. Loss results from too short buffers and delays result from a too conservative buffer length.

Temporarily holding the frames in a buffer is not only necessary for the correct timing playout, but also for techniques such as FEC/MDC decoding or finding frames for the packet loss concealment.

RTP processing

The source identification field (SSRC) is used to locate the correct RTP flow within a session.

Rename to companding ? (Frame level error control (PLC))

Packet loss concealment is the name associated with the processing done at a receiver to counter lost IP datagrams (packets). This term is slightly misleading, as the actual signal processing is performed on frames of speech data which were contained in the payloads of UDP datagrams. Packet loss concealment is typically a receiver-based technique, however there have been suggestions to perform the processing within the network [@le1000:Active]. PLC computes fill in samples for gaps in the stream caused by lost packets. Nevertheless it has been shown that playing something is always perceived as better than an audible gap. Broadly techniques can categorised as insertion, interpolation and regeneration. Insertion is where samples are inserted into the gaps left by missing packets. Options include silence frames, spliced frames (so called zero length), samples with noise that fades, extrapolation where the voiced frames are extended to cover the gap, or using samples generated from pitch cycle waveform repetition. Interpolation-based schemes work by “stretching” the received packets to cover the gaps caused by lost packets. The G.711 concealment algorithm G.711i works by constantly calculating the pitch frequency of the received frames, and if a frame is lost the receiver reproduces a sample based on the waveform thus far and successively lowers the amplitude with each following frame. After five 20ms packets, the amplitude will be zero and in effect no more concealment is audible. Regeneration is where a totally new sample is created from what has been experienced before. The effectiveness of regeneration schemes is highly dependent on the quality of the past (and sometimes future) estimations.

Speech decoding

The term “speech coding” is typically reserved for operations more complex than companding. In more complex coding, such as source model linear predictive coding such as LPC, iLBC and G.729 the residuals and LPC coefficients are converted to raw PCM samples. Since iLBC coding was devised for the Internet there is no hangover between consecutive iLBC blocks, reducing the time dependence mentioned earlier.

Audible playout

In order to playout the speech, the raw samples must be passed to the operating system. This is because hardware control is available only viathe operating system. Typically a device is setup in terms of bits per second, number of samples, hertz, number of channels and volume. On Windows based systems there is a device independent interface to do this, known as DirectX. DirectX is a collection of programming interfaces for multimedia applications (the X meaning all of the subsystems). In terms of VoIP the DirectSound will be used.

Some formats, WAV for example, contain a header with this information, known as RIFF. If the device (or its software interface) matches the specification for playback, it will initialised for receiving the data. The device then converts the stream into an analogue format for output to loudspeakers or headphones. Volume controls can be either digital or analogue and the voice may be mixed with other streams.

Subjective experience

Although the computer controlled part of the journey is complete, the voice still has to be heard, understood and experienced. The complete voice path should be effectively constructed and optimised for this stage. Humans seem to be relatively adaptive and can accept and adapt to, relatively poor conditions. It seems intuitive to consider that we are better at adapting to constant poor conditions rather than variable quality. This implies poor quality can be accepted given no further large quality changes take place.

VoIP applications

[@Schu9207:Voice] “Voice Communication Across the Internet: A Network Voice Terminal” [@Sisa9809:Multimedia] “The Multimedia Internet Terminal (MInT)”

Sicsophone reduces delay through a novel receiver buffering scheme. The solution uses the low-level features of audio hardware and a specialized jitter buffer playout algorithm. Using the sound card memory directly eliminates intermediate buffering. A statistical-based approach for inserting packets into the audio buffers is used in conjunction with a scheme for inhibiting unnecessary fluctuations in the system. For comparison we present the performance of the playout algorithm against idealized playout conditions. To obtain an idea of the system performance we give some mouth to ear delay measurements for selected VoIP applications. The proposed mechanism is shown to save 100’s of milliseconds on the end to end path [@Hagsand03:Low]

VAT (Visual Audio Tool) [@Jaco9207:vat] is a well known VoIP tool that implements a playout buffer similar to the one described, including a circular buffer to hold the packets before playout.

Realisation of an Adaptive Audio Tool [@Meylan0004:April] Meylan and Boutremans “Real-time audio over the best effort Internet often suffers from packet loss. At this time, Forward Error Correction (FEC) seems to be an efficient way to attenuate the impact of loss”

[@Bolo9803:Adding] Adding Voice to a Distributed Game on the Internet.

They examine issues related to adding voice between participants in virtual environments. We consider in particular a special kind of DVE, namely distributed games over the Internet. We consider all stages of voice manipulation, including voice generation (with emphasis on echo cancellation), voice transmission (with emphasis on RTP and packetization), and voice restitution (with emphasis on spatial rendition and on synchronization between voice and visual cues). We also consider implementation issues, and illustrate these with the MiMaze game and the FreePhone audio tool, both developed at INRIA.

FreePhone “Performance evaluation of end points” [@marjamaki9912:performance] “Performance evaluation of an IP voice terminal” [@Jiang2002:QoS] “QoS Evaluation Of VoIP End-Points”

[@Neug9906:How] “How elastic are real applications?” abstract = programs are typically developed with little or no consideration of their performance under different system loads or the effect they may have on other processes competing for the same resources. To an extent, this stems from the “virtual machine” approach promoted by most mainstream operating systems. With operating systems which offer mechanisms for fine-grained control of resource allocations it becomes apparent that a central policy for allocating potentially scarce resources is not sufficient. We are currently developing a toolkit which allows programmers to systematically examine and assess the performance behavior of a wide range of applications under different resource allocations by determining the applications’ utility curves. We argue that such a toolkit is useful for the development of adaptive applications as well as for the implementation of global resource management policies. In particular, we argue that this is necessary for the application of economic models to the area of resource management, as proposed by some researchers.

Tools such as the Robust Audio Tool (RAT) came from UCL in London in 1995 [@Hard9506:Reliable]. RAT, with its simple redundancy scheme, sending one compressed version of the packet in the following one, was an simple example of utilizing redundancy. RAT was intended for both group, and one-to-one conferencing. Somewhat surprisingly, RAT and VIC were still being maintained today as part of the AVATS project (formerly SUMOVER) at UCL, London.

In the late 1990s, a tool called Freephone was developed by the Rodeo group at INRIA in Sophia Antipolis, France which implemented FEC mechanisms [@freephone]. At that time all the applications were UNIX based, as this was the only (open) operating system for Internet applications. Freephone adapts to network conditions, it includes a rate control mechanism (adaptation to available bandwidth), a FEC-based error recovery scheme (adaptation to and recovery from loss process in the network), and an adaptive playout adjustment scheme which adapts to delay variations.

Rosenberg et. al in [@Rose0003:Integrating] looked at combining target-based playout algorithms in conjunction with FEC schemes, and propose a number of new playout algorithms based on this coupling. Transport of real time voice traffic on the Internet is difficult for two reasons – network loss and network jitter. There has been substantial research in developing protocols and algorithms to combat these problems. Network loss is handled primarily through a variety of different forward error correction (FEC) algorithms and local repair operations at the receiver. Jitter has been compensated for by means of adaptive playout buffer algorithms at the receiver. Traditionally, these two mechanisms have been investigated in isolation. In this paper, we show the interactions between adaptive playout buffer algorithms and FEC, and demonstrate the need for coupling. We propose a number of novel playout buffer algorithms which provide this coupling, and demonstrate their effectiveness through simulations based on both network models and real network traces.

Perhaps the best known IP telephony application is Skype. It revolutionized telephony in its introduction in 2003. It is a cross-platform solution that became successful partly by embracing recent technological developments, and because it provided good, free and easy voice communication [@world:_asses_skypes_networ_impac; @economist06:end]. The technological developments it embraced were: Internet-specific speech coding, a firewall bypass solution, a scalable call establishment system, and an intuitive graphical user interface. Skype has continued to add functionality such as inter-operability with the telephony system, a payment scheme, and conferencing capabilities. Recently, the developers have added video and SMS capabilities and is available on some networked televisions.

Within the Skype network there are two classes of nodes: normal nodes and super nodes. We will first discuss normal nodes. Normal nodes are typically a home owner’s PC and are usually behind a home firewall and/or an ISP’s NAT. These nodes typically have a private IP address allocated to them. A private address is not globally routable and is defined by specific ranges which routers do forward data from. The CPU processing of normal nodes is also assumed to be somewhat limited.

Super nodes on the other hand are well connected machines and must posses a public IP address. A typical example might be a UNIX computer on a university network. Due to their connectivity and processing capabilities, super nodes perform routing and forwarding of Skype signaling messages. The load on the super node is carefully monitored so Skype message processing does not interfere with the normal operation of its host. Usually users are not unaware that the computer has been elected to super node status. The software distribution for normal and super nodes is actually identical, with different routines being invoked after initialization. The super nodes also forward login requests on the behalf of the normal nodes, if the normal nodes cannot reach the login server.

On the first invocation of Skype, a normal node uses a pre-configured list of permanent super nodes, it then receives an update of more recent super nodes. The directory of Skype users is decentralized. Skype uses its Global Index technology to find a user with encrypted (256 bit AES) messages. In order to locate a user the procedure is as follows. A normal node sends a request to one super node, if it doesn’t know itself the location of the callee. That super node then responds with four additional nodes to be queried if the person was still not found. The normal node then queries these four nodes. If the user is not found, an exchange occurs again with the same super node. The super node then responds with eight new (and different) nodes. This is repeated several times until the user is found. Here we have assumed that the normal nodes has a public address for simplicity, in the case where it has a private address this negotiation is done by a super node on the normal node’s behalf. Search results are also cached at intermediate nodes for subsequent searches.

Non-standardized solutions need to use protocol translation services if they are to inter-operate with existing solutions. Protocol translation involves taking a message from one protocol and generating a (near) equivalent message in the second protocol. We briefly mentioned some names of known translators for H.323 and SIP in the previous section. For a closed protocol the developer themself must create a translator for the desired interoperability. There have been many publications and presentations on the Skype protocol. Prestige in being the first to reverse engineer a closed (and widely used) protocol often acts as an incentive for such efforts. Some of these can be found in [@suh:_charac_detec_skype_relay_traff; d04:_analy_skype_peer_peer_inter_telep_protoc].
A more basic introduction to the operation of Skype at a somewhat higher level can be found in [@s.a.:_skype_explain].

Some researchers have tried to reverse engineer the Skype protocol,
these include Biondi and Desclaux’s presentation [@biondi6.:_silver_needl_in_skype] interestingly entitled \”Silver needle in the Skype\”. Others include Suh et. al. [@suh:_charac_detec_skype_relay_traff; @04:_analy_skype_peer_peer_inter_telep_protoc; @guha06:_exper_study_skype_peer_peer_voip_system].

End system control of VoIP

[[sect:end_system_control_of_voip]]{#sect:end_system_control_of_voip
label=”sect:end_system_control_of_voip”}

End-systems also have to deal with the implications of different traffic types existing on the network too. The effect of this is the original voice stream becomes distorted relative the original stream and it is the receiver that normally reconstructs the original pattern of words and gaps for the listener. In simple terms the end-systems have to resynchronize the time variations introduced by the network. The end-systems have to to engineered “intelligently”, that is to create working systems that can adapt to the changing conditions over Internet paths.

The end systems play an important role in VoIP systems. At the sender they are responsible for the sampling, coding, packetization, optionally adding protection to the voice stream should any packets be lost, and of course, actually transmitting the data. The receiver is responsible for removing any jitter introduced by the network as well as receiving, de-packetizing, removing extra data if sent, decoding and playing the samples to the listener. These actions within the receiver are one of most researched areas within real-time voice communication.

The early 90’s produced a surge in packet audio playout research. One of the first efforts to implement a voice application on an IP network with an adaptive buffer playout strategy was NeVoT [@Schu9207:Voice].

The playout algorithm implemented in Sicsophone is almost identical to NeVoT. They use a variation estimate similar to the one given earlier, however they make a slight distinction for the first packet in a talkspurt and subsequent ones. The playout for the first packet is delayed longer due to lack of information on the network state after the silence period.

Our work shares theirs in the choice of a ring buffer for buffering packets, only we perform the copying by using DMA transfers directly rather than copying the data from the application to the operating system.

Using a ring buffer in Sicsophone is identical to that described in [@Schu9207:Voice] where the authors motivate their choice of using a circular buffer for performance reasons.

The operating system is the supervisor that coordinates the reception, packetization, error recoverer, multiplexing, and playout of the incoming voice stream(s).

Operating systems and VoIP

UNIX

[@reed98:new] “A new audio device driver abstraction” [@Kouv9701:Overcoming] “Overcoming Workstation Scheduling Problems in a Real-Time Audio Tool” [@rizzo97:freebsd] “The FreeBSD Audio Driver” [@Meylan0004:April] “Realisation of an Adaptive Audio Tool” [@martin2000:200mhz] “A 200Mhz 0.25W Packet Audio Terminal Processor for Voice-over-Internet Protocol Applications” [@Chan9906:Hardware] “Hardware and Software Architecture of a Packet Telephony Appliance” Luigi Rizzo describes a generic sound card driver for FreeBSD [@rizzo97freebsd]. Aspects of it resemble their work, in particular, handling of timers, DMA transfer and buffer size allocation. They include hooks to use the driver for VoIP applications, one such example is a select() call which can be scheduled to return only when a certain amount of data is ready for consumption.

Kouvelas and Hardman in [@Kouv9701:Overcoming] keep the flow of audio constant during operating system load by using buffering in the audio hardware. They also look at reducing the amount of buffering in the application by keeping the buffers in the application as small as possible. In their case we try and totally eliminate it by only using the hardware buffers.

Mobile phones

Android

About 75% of the market share of the mobile market (2021)

IOS

About 25% of the market share of the mobile market (2021)

Windows

Windows mobile

Mostly defunct as of 2021.

Algorithms for buffer sizing

Coping with the variable delays over Internet paths whilst maintaining acceptable interactivity in a real-time conservation subject to speech codec limits is essentially what an algorithm for buffer sizing needs to perform.

Naylor and Kleinrock in 1982 investigated buffering considerations in the case of stream traffic on a packet switched network [@Nayl82:Stream]. They used delay estimates (prediction) to eliminate delay differences. Their idea is to use previous stream delays to estimate the range of the delays incurred currently. The essence is how to discard $k$ samples from a last sample set $m$ to estimate $D$ the waiting time at the destination. They provide some rules of thumb for choosing $m$ and $k$ and show the suitability of the choice on empirical delay distributions from the ARPANET. They show it is still necessary to deal with discontinuities even after smoothing.

Gopal analyzed a previous scheme suggested by Barberis and Pazzaglia (1980). it included a non-stationary buffer length distribution and mean values as a function of time in the talkspurt [@Gopa8402:Playout].

Due to variations in network delay, a stream of voice packets with deterministic interarrival times to a data network may not have deterministic interdeparture times at the destination. Two playout schemes which are designed to remove such variations in delay are considered. Analytic results for the performance of these two schemes are obtained. Numerical examples showing the effect of coefficient of variation of interdeparture time on performance are presented.

The design and simulation [@Kansal01:Jitter-free]. Much of the voice focus was on solutions, mostly theoretical, for buffer design and sizing [@Barb8002:Optimal; @Barb8102:Buffer]. Work by Naylor and Kleinrock described general design methodologies for the design of jitter absorbing buffers [@Nayl82:Stream]. Buffer sizing from a control theory perspective has been looked at in [@linc0205:jitter].

[@Fujimoto0211:Adaptive].

Analysis of the buffer performance using probabilistic methods can be found in [@Mans0108:Jitter].

They study jitter control in networks with guaranteed quality of service (QoS) from the competitive analysis point of view: they propose on-line algorithms that control jitter and compare their performance to the best possible (by an off-line algorithm) for any given arrival sequence.

For delay jitter, where the goal is to minimize the difference between delay times of different packets, we show that a simple on-line algorithm using a buffer of B slots guarantees the same delay jitter as the best off-line algorithm using buffer space B/2. We prove that the guarantees made by their on-line algorithm hold, even for simple distributed implementations, where the total buffer space is distributed along the path of the connection, provided that the input stream satisfies a certain simple property. For rate jitter, where the goal is to minimize the difference between inter-arrival times, we develop an on-line algorithm using a buffer of size 2B+h for any h/spl ges/1, and compare its jitter to the jitter of an optimal off-line algorithm using buffer size B. They prove that their algorithm guarantees that the difference is bounded by a term proportional to B/h.

[@Moon9508:Packet]

Packet Audio Playout Delay Adjustment Algorithms: Performance Bounds and Algorithms.

[@Jha96:Continuous] Continuous Media Playback and Jitter Control

[@Anandakumar01:adaptive] An adaptive voice playout method for VOP
applications.

[@matic2002:voice] Predictive Playout Delay Adaptation for Voice over
Internet.

[@Moon9508:Packet] Packet Audio Playout Delay Adjustment Algorithms:
Performance Bounds and Algorithms.

Comparison of solutions

This section concentrates on solutions which perform comparative studies between two or more algorithms or simulations.

[@Wang9906:Comparison] Comparison of Adaptive Internet Multimedia Applications

[@Cho9408:Reconstruction] Performance analysis of reconstruction algorithms for packet voice communications.

Moon et al. [@Ramj9406:Adaptive] present four different playout algorithms for packet audio. All calculate an estimate of the network delay and jitter as an average from all the packets measured. The authors study jitter spikes in traces and also do not adapt the buffer size to these spikes, the work and results are very similar to those presented in [@RAMJ_94_Adaptive_Infocom].

Pinto and Christensen in two papers [@Pinto99:Talkspurt; @pinto9910:algorithm] describe an algorithm for jitter compensation based on the target packet loss rate. Their “gap based” approach compares the current playout time with the arrival time and calculate a gap for both early and late packets. They compare the current playout delay, for any particular talkspurt in progress, with an optimal playout delay. This optimal theoretical delay is defined as minimum amount of delay to be added to the creation time of each packet which would result in a playout of a talkspurt at the given loss rate. Their calculation of the optimal playout is similar to the one described in this paper.

[@Alva9309:Voice] Voice synchronization in packet switching networks.

A new algorithm that incorporates a novel adaptive scheme and a special control-packet based time measurement works on the vocoder and CEPT DoD standards. The first method uses measurement packets emitted at the beginning of each silence interval, and returned by receiver. Then, a second packet, the reference packet, is sent, containing the estimate of mean network delay. The second method remembers the packet with the lowest delay (greatest delay until playout).

[@agrawal98:use] Use of Statistical Methods to Reduce Delays for Media Playback Buffering. [@matic2000:optimal] Optimal delay buffer for VoIP applications. [@matic2002:predictive] Predictive Playout Delay Adaptation for Voice over Internet. [@sreenan00:delay] Delay Reduction Techniques for Playout Buffering. [@Ramos03:moving] A Moving Average Predictor for Playout Delay Control in VoIP. [@Jeske2001:Adaptive] Adaptive Play-Out Algorithms For Voice Packets.

[@kuo2000:delivering] Delivering Voice over the Intenet.

[@leon99:adaptive] An adaptive predictor for media playout buffering.

An Algorithm for Playout of Packet Voice Based on Adaptive Adjustment of Talkspurt Silence Periods.

[@Hodson2000:Skew] Skew Detection and Compensation for Internet Audio
Applications.

[@Fujimoto0211:Adaptive] Adaptive playout buffer algorithm for enhancing perceived quality of streaming applications,

[@Ergul2001:Novel] A Novel Adaptive Playout Algorithm For Voice Over IP Applications and Performance Over WANs.

[@Fujimoto0106:playout] Playout control for streaming applications by statistical delay analysis.

[@peng01:control] The Control and Algorithm of Audio Dynamic Buffer”

[@roccetti98:design] Design, Development and Experimentation of an Adaptive Mechanism for Reliable Packetized Audio for Use over the Internet”

Loss/delay considered together

[@Moon9801:Correlation] [@Jian0006:Modeling]

Those which do include voice in the design of the packet switched voice networks include

[@Lieb0006:Tradeoffs] “Tradeoffs in Designing Networks with End-to-End
Statistical QoS Guarantees”

[@Aras9401:Real] Real-time communications in packet-switched networks

[@Garr9302:Joint] Joint Source/Channel Coding of Statistically Multiplexed Real-Time Services on Packet Networks

A. P. Bernard Source-Channel Coding of Speech in 1998 from the UCLA [@Bernard:T98]. The issue of designing source and channel coding techniques for voice is presented, an adaptive multi-rate (AMR) transmission system that switches between operating modes depending on channel conditions is presented. It uses variable bit rate embedded source encoders and rate-compatible channel coders providing unequal error protection. The main concept is using a rate-compatible punctured trellis code (RCPT) for obtaining unequal error protection via progressive puncturing of symbols in a trellis. Results are presented where the rate compatible punctured convolutional code is compared with and without bit-interleaved coded modulation. The coder displays a wide range of bit error sensitivities, and is used in combination with rate-compatible punctured channel codes providing adequate levels of protection, it also over a wide range of channel conditions with graceful performance degradation as the channel signal-to-noise ratio decreases.

Mixing of voice and data at 3 different levels. [@Chen8806:Integrated] Integrated voice/data switching

[@yletyinen98:voice] Voice packet interarrival jitter over IP switching

[@CaSa98:predictive] Predictive loss pattern queue management for Internet routers.

[@Fulton1998:Delay] Delay jitter first-order and second-order statistical functions of general traffic on high-speed multimedia network.

[@Land9702:Multiplexing] “Multiplexing ATM traffic streams with time-scale-dependent arrival processes”

[@song00:new] “A New Queue Discipline for Various Delay and Jitter Requirements in Real-Time Packet-Switched Networks”

Network Synchronization [@Bier9606:Intra] “Intra- and Inter-Stream Synchronization for Stored Multimedia Streams”

[@yuang96:novel] A Novel Intra-media Synchronization Mechanism for Multimedia Communication

[@Huan9511:Multimedia] Multimedia synchronization for live presentation using the N-buffer approach

[@Jeff9211:Adaptive] Adaptive, best-effort delivery of digital audio and
video across packet switched networks,

[@Tien9901:Intelligent] “Intelligent Voice Smoother for Silence-Suppressed Voice over Internet”

[@Tucker87:packet] “Packet-speech Multiplexer”

[@Figu9510:Leave] Leave-in-Time: A New Service Discipline for Real-Time Communications in a Packet-Switching Data Network

[@Kofman96:Loss] Loss Probabilities and Delay and Jitter Distributions In A Finite Buffer Queue With Heterogeneous Batch Markovian Arrival Process

[@Sing9405:Jitter] Jitter and Clock Recovery for Periodic Traffic in Broadband Packet Networks

Wireless access

[[sec:wireless access]]{#sec:wireless access
label=”sec:wireless access”}

This section covers wireless access technologies for IP voice users. Allowing the handset free physical access from the wired infrastructure is both natural for the user and allows mobility and flexibility in usage. This freedom however puts more restraints on the system in terms of QoS, security and resource usage. Due to strict frequency guidelines on radio transmissions, the technological favorite for VoIP is to use unlicensed spectrum in the 2.4 and 5.0 GHz bands.

History and 802.11 extensions

The IEEE 802.11 standard belongs to the 802 family, a series of standards developed by the IEEE to define specifications for local and metropolitan area networking, mainly at the data-link and physical layers of the OSI reference model. In 1997 the IEEE released the first version of the 802.11 standard, whose purpose was to provide wireless connectivity between different devices in a local area, with a maximum transmission rate of 2 Mbps. Two years later a revision appeared. It included two new extensions which used new modulation schemes to provide rates up to 11 Mbps at the 2.4 GHz frequency band (802.11b) and 54 Mbps at the 5 Ghz band (802.11a). Further extensions are being released, addressing aspects such security, higher transmission rates, and quality of service (QoS). The following list summarizes the current extensions to the 802.11 standard (some of which are still in draft state) and their main features:

Challenges [@eriksson00:_ip] “The challenges of voice over IP over wireless” Eriksson et. al.

802.11a High speed WLAN standard at the 5 GHz band – supports 54 Mbps

802.11b standard for 2.4 GHz band – supports 11 Mbps
802.11e Quality of Service enhancements
802.11f inter-access point communication
802.11g modulation technique for the 2.4 GHz band, achieving a rate of 54 Mbps
802.11h management of the 5 Ghz band for use in Europe and in the Asia Pacific region
802.11i Addresses security weaknesses

: 802.11 sub-standards

802.11 layers

The standard defines a set of medium access control (MAC) and physical layers (PHY) specifications for wireless connectivity. The following figure shows the relation between these layers and the OSI reference model:

{#fig:802.11-and-OSI width=”45%”}

As shown in the picture, the MAC is a sublayer of the data-link layer, which offers its service to the logical link sublayer, a common interface for all the IEEE 802 standards. This common interface permits an heterogeneous interconnection of different types of devices by abstracting their underlying media technology.

In order to distinguish between data units of different layers, we will follow the convention of naming ‘packets’ to the data units at the IP and higher layers, whilst using the word ‘frames’ for the 802.11b data units.

subcarriers

Physical layer

The physical layer (PHY) specifies low-level communication parameters such as the radio technology, frequencies, channel bandwidth, modulation schemes, and transmission rates. The three units shown at the bottom of figure 1{reference-type=”ref” reference=”fig:802.11-and-OSI”} represent the three different radio technologies that the original 802.11 standard determined for wireless connectivity. IR stands for InfraRed, FHSS for Frequency Hopping Spread Spectrum, and DSSS for Direct Sequence Spread Spectrum.

Whilst all these technologies support transmission rates of 1 and 2 Mbps, practically no known vendors sell IR compliant products. Also, frequency hopping products are few in comparison with the ones that use direct sequence. An explanation for the dominance of DSSS products is the development of the 802.11b extension that enhanced the basic DSSS mode with additional transmission rates of 5.5 and 11 Mbps.

For spread spectrum techniques the standard defines the S-band ISM (2.4-2.5 Ghz) as the frequency range to use. This is because the regulatory authorities permits the unlicensed use of ISM (Industrial, Scientific, and Medical) frequency bands provided that the emitting power of the devices is low.

For direct sequence mode, the standard divides the band into 14 different channels whose mapping to frequencies is shown in table 2{reference-type=”ref” reference=”fig:IEEE-802.11b-channels”}:

::: {#fig:IEEE-802.11b-channels} Channel (1-6) 1 2 3 4 5 6 7

Frequency (Ghz) 2.412 2.417 2.422 2.427 2.432 2.437 2.442

: [[fig:IEEE-802.11b-channels]]{#fig:IEEE-802.11b-channels label=”fig:IEEE-802.11b-channels”}IEEE 802.11b channels
:::

::: {#fig:IEEE-802.11b-channels} Channel (7-14) 8 9 10 11 12 13 14

Frequency (Ghz) 2.447 2.452 2.457 2.462 2.467 2.472 2.477

: [[fig:IEEE-802.11b-channels]]{#fig:IEEE-802.11b-channels
label=”fig:IEEE-802.11b-channels”}IEEE 802.11b channels
:::

However, this channel distribution causes adjacent channel interference, since the bandwidth used by the 802.11b stations is around 30 Mhz and there is only 5 Mhz of separation between two adjacent channels. Thus, only channels 1, 6 and 11 are spread enough to be used in close locations without interference. This poses a significant challenge to deploy a WLAN with an adequate coverage while keeping the values of adjacent channel interference and co-channel interference low. The figure 2{reference-type=”ref” reference=”fig:Adjacent-channel-interference”} shows how the frequency band used by each channel overlaps the frequency band of the neighbouring channels.

{#fig:Adjacent-channel-interference
width=”45%”}

Different modulation schemes are used to achieve the standard bitrates of 1, 2, 5.5 and 11 Mbps. The standard determines the use of BPSK for 1 Mbps, QPSK for 2 Mbps, CCK BPSK for 5.5 Mbps and CCK QPSK for 11 Mbps. The denser encodings used to transmit at the highest rates however make them more prone to interference and therefore have less range than the lower rates. Therefore, a trade-off exists between data throughput and distance and we will see an illustration of this in our experiments.

In order to accomplish interoperability between devices that use different transmission techniques, the PHY layer prepends a physical header to data frames that is always transmitted at 1 Mbps.

Architecture

Before describing the two different architectures defined in the the 802.11 standard we describe the four components that a 802.11 network may consist of: An access point, stations, the wireless medium, and a distribution system.

Station (STA): An electronic device capable to communicate wireless with other stations in range.
Access point (AP): A special wireless station whose main purpose is to provide the wireless stations access to the Internet.
Distribution system (DS): In order to provide larger coverage areas, several access points may be used. Then, a distribution system is the logical component of a 802.11 network that allows the different access points to track the location of the wireless stations.
Wireless medium: While in wired networks a cable is the physical medium used to carry frames from the sender station to the receiver, in wireless networks the physical medium is the air.

Infrastructure networking

This is the most common mode that the IEEE 802.11b standard defines to build wireless LANs. An infrastructure network relies on an access point to provide connectivity to the wireless stations. The stations must be in range of the AP, however it is not required that the stations are in range of each other. When this happens, the stations that are not in range of each other are usually referred as hidden nodes (see section 4.18 {reference-type=”ref” reference=”sub:Hidden-node-problem”}). It is possible to extend the size of a wireless LAN by interconnecting several APs, thus permitting the wireless stations to roam between adjacent cells.

Adhoc (independent) networking

The standard provides a mechanism to create small, usually short-lived networks, with only end stations. A common usage of this is when a connection between two or more stations is desired (file sharing for instance) and no AP is available. However, it is also possible that one of the stations acts as a bridge between the WLAN and the wired world, providing external connectivity to the other stations.

IEEE 802.11 Medium Access Control (MAC)

The main functions of the medium access control (MAC) layer are to coordinate the stations to gain access to the medium and to define the mapping of physical layer signals to/from link frames. We describe the MAC layer in detail in section 4.7{reference-type=”ref”
reference=”sub:IEEE-802.11-MAC”}.

The main functions of the medium access control (MAC) layer are to coordinate the stations to gain access to the medium and to define the mapping of physical layer signals to/from link frames. It also manages the bitrate selection and supports different operational modes like the RTS/CTS handshake. The following sections describe the relevant 802.11b MAC features for the measurements we conducted.

MAC access modes

The distributed Coordination function (DCF) which is the access mode defined in the 802.11 protocol to provide unsynchronized, contention-based access to the medium, through the CSMA/CA protocol (described in section 4.13{reference-type=”ref” reference=”sub:Contention-based-in-DCF”}). The unsynchronized access to the medium results in random delays between each frame transmission, which may be problematic for real-time traffic.

The Point Coordination Function (PCF) is an optional access mode which enables synchronized transmission of data frames. In this mode, the AP polls the wireless stations granting them access to the medium for a short period of time. Then, the AP moves to the next station in the poll list and thus all the stations obtain a slot of time to transmit data. Although this access mode seems suitable for real-time communication, it simply is not supported by many 802.11 devices. Further information about PCF mode can be found in PCF.

HCCA

802.11e includes ammendments to the MAC layer to support QoS. The HCCA

Controlled Channel Access operates similiar to the Point Coordination Function.

HCF (Hybrid Coordination Function)

However, in contrast to PCF, in which the interval between two beacon frames is divided into two periods of CFP and CP, the HCCA allows for CFPs being initiated at almost anytime during a CP.

A CFP is called a Controlled Access Phase (CAP) in 802.11e. A CAP is initiated by the AP, whenever it wants to send a frame to a station, or receive a frame from a station, in a contention free manner. In fact, the CFP is a CAP too. During a CAP, the Hybrid Coordinator (HC) — which is also the AP — controls the access to the medium. During the CP, all stations function in EDCA.

The other difference with the PCF is that Traffic Class (TC) and Traffic Streams (TS) are defined. This means that the HC is not limited to per-station queuing and can provide a kind of per-session service. Also, the HC can coordinate these streams or sessions in any fashion it chooses (not just round-robin). Moreover, the stations give info about the lengths of their queues for each Traffic Class (TC). The HC can use this info to give priority to one station over another, or better adjust its scheduling mechanism. Another difference is that stations are given a TXOP: they may send multiple packets in a row, for a given time period selected by the HC. During the CP, the HC allows stations to send data by sending CF-Poll frames.

HCCA is generally considered the most advanced (and complex) coordination function. With the HCCA, QoS can be configured with great precision. QoS-enabled stations have the ability to request specific transmission parameters (data rate, jitter, etc.) which should allow advanced applications like VoIP and video streaming to work more effectively on a Wi-Fi network.

HCCA support is not mandatory for 802.11e APs. In fact, few (if any) APs currently available are enabled for HCCA. Nevertheless, implementing the HCCA does not require much overhead, as it basically uses the existing DCF mechanism for channel access (no change to DCF or EDCA operation is needed). In particular, the station side implementation is very simple as stations only need to be able to respond to poll messages. On the AP side, however, a scheduler and queuing mechanism is needed. Given that AP’s are already equipped better than station transceivers, this should not be a problem either.

EDCA

With EDCA (Enhanced Distributed Channel Access), high priority traffic has a higher chance of being sent than low priority traffic: a station with high priority traffic waits a little less before it sends its packet, on average, than a station with low priority traffic. This is accomplished by using a shorter contention window (CW) and shorter arbitration inter-frame space (AIFS) for higher priority packets. In addition, EDCA provides contention-free access to the channel for a period called a Transmit Opportunity (TXOP). A TXOP is a bounded time interval during which a station can send as many frames as possible (as long as the duration of the transmissions does not extend beyond the maximum duration of the TXOP). If a frame is too large to be transmitted in a single TXOP, it should be fragmented into smaller frames. The use of TXOPs reduces the problem of low rate stations gaining an inordinate amount of channel time in the legacy 802.11 DCF MAC. A TXOP time interval of 0 means it is limited to a single MSDU or MMPDU. The levels of priority in EDCA are called access categories (ACs). Default EDCA Parameters for each AC:

       AC            CWmin   CWmax   AIFSN   Max TXOP

Background (AC_BK) 31 1023 7 0
Best Effort (AC_BE) 31 1023 3 0
Video (AC_VI) 15 31 2 3.008ms
Voice (AC_VO) 7 15 2 1.504
Legacy DCF 15 1023 2 0

ACs map directly from Ethernet-level Class of Service (CoS) priority levels:[2]

Priority 802.1D Priority 802.1D Designation Access Category

Lowest           1                  BK                AC_BK
                 2                Spare               AC_BK
                 0                  BE                AC_BE
                 3                  EE                AC_BE
                 4                  CL            Video (AC_VI)
                 5                  VI            Video (AC_VI)
                 6                  VO            Voice (AC_VO)

Highest 7 NC Voice (AC_VO)

The purpose of QoS is to protect high priority data from low priority data but there can be scenarios in which the data which belongs to same priority needs to be protected from data of same priority. Example being suppose a network can accommodate only 10 data calls and eleventh call is made. Admission Control in EDCA address this type of problems. The AP publishes the available bandwidth in beacons. The clients can check the available bandwidth before adding more traffic in the network that cannot be entertained. Wi-Fi Multimedia (WMM) certified APs must be enabled for EDCA and TXOP. All other enhancements of the 802.11e amendment are optional.

Positive acknowledgments

The 802.11 standard defines a positive acknowledgment schema in order to provide some reliability for wireless transmissions. All data frames must be acknowledged. When a station has properly received a data frame it sends back an ACK frame to the sender so that the sender knows of the successful delivery. If for some reason the ACK frame does not arrive at the sender it will assume that the packet was not delivered and the sender will retransmit it. The drawback of the ACK mechanism is the overhead that it adds to the communication. However, ACK frames are necessary since link conditions are highly variable in wireless networks.

Interframe spacing

The 802.11 standard defines four interframe spacings to prioritize the transmission of certain frames. They are used to ensure that atomic operations such as the frame-acknowledgment pair or the RTS/CTS handshake are not interrupted; they are also used to provide preference to contention-free traffic over contention-based traffic when both exist. When the medium is busy all the frames have to wait until it becomes idle, in order to be transmitted. Then, the frames with the highest priority gain access to the channel as they are assigned a shorter interframe space. Figure 3{reference-type=”ref” reference=”fig:Interframe-spacing”} shows the different interframe spacings.

{#fig:Interframe-spacing width=”30%”}

The SIFS (Short InterFrame Space) is used for high priority frames such as acknowledgments or CTS frames, which must be sent immediately after the corresponding frame. It has a duration of 10 $\mu s$ in DSSS. The PIFS (PCF InterFrame Space) is used to give higher priority to PCF contention-free traffic over DCF contention-based traffic. It has a duration of 30 $\mu s$ in DSSS. The DIFS (DCF InterFrame Space) is the minimum time that the medium must be idle before attempting a contention-based transmission. It has a duration of 50 $\mu s$ in DSSS. The EIFS (Extended InterFrame Space) (not shown) is a variable-length interval that is used when a frame is received with errors.

Contention using DCF – backoff mechanism

The DCF mode is based on the carrier sense multiple access with collision avoidance (CSMA/CA) protocol. It makes the stations sense the medium to determine whether it is busy or idle before they attempt to transmit a frame. However, if several stations are waiting for the medium to be idle and they transmit simultaneously, a collision will occur unless there is some mechanism to deal with the contention.

In order to resolve contention between several stations, CSMA/CA defines an exponential backoff algorithm. The backoff mechanism works as follows:

Every time a station attempts to transmit, it waits for the medium to be idle during a DIFS (or a EIFS, if the transmission was not successful) period. Then, it follows an interval called the
contention window which is divided into slots of 20 $\mu s$. The stations choose a random number of slots and then they wait for the slots to elapse, thus accessing the medium the station that selected the lowest number of slots. The timing of the backoff mechanism is shown in the figure 3{reference-type=”ref” reference=”fig:Interframe-spacing”}.
During the occupation of the channel by the winning station the other stations suspend the backoff procedure until the medium is idle again. However, the stations resuming the contention do not choose a random number of slots again. Instead, they wait for the slots that remained in the previous contention. In this way, the stations that lost the contention have a higher probability of gaining access to the medium than the one that just transmitted.
The random number of slots is chosen from the interval [0, CW], CW being the size of the contention window. This mechanism is called exponential because after a failed transmission a station must double the size of the contention window. This is designed to reduce the collision probability in a loaded network.
Collisions occur when two stations select the same number of slots. A collision is only detected by the lack of the corresponding ACK, since the wireless stations cannot listen to the medium whilst they are transmitting.

The contention window size is a power of 2 minus 1, starting at 31. It doubles after each failed transmission until the 5th retransmission, where it is limited to 1023 slots for further retransmission attempts. The contention window reverts to the minimum size after a successful transmission of a frame.

Retry counters

The retry counters set a limit to the maximum number of retransmissions allowed per frame before it is discarded by the MAC layer. Some cards define different counters depending on the packet size.

Error detection and recovery

When a station does not receive the corresponding ACK it assumes that the frame was lost and it tries to retransmit it again. Thus detection and recovery at the MAC layer occur at the source. However, after reaching the maximum number of retransmissions the MAC layer discards the frame. It is therefore responsibility of higher layers to perform a detection or recovery mechanism in the case where reliability (TCP, for instance) is required.

Carrier sensing functions

Carrier sensing is a method invoked by the MAC layer to ascertain whether the medium is busy. It can be either physical or virtual. Physical carrier sensing is performed by the physical layer and it reports the state of the medium to the MAC layer. Virtual carrier sensing is made through the Network Allocation Vector (NAV), which is the expected time that the atomic transmission of a frame will maintain the channel busy. The NAV is calculated from the duration field of the 802.11 header of the existing frame in the air.

Bitrate selection

The bitrate selection has a critical role in wireless systems as it directly affects the frame error rate for a given signal-to-noise ratio. The 802.11 standard intentionally left the bitrate selection mechanism unspecified. Thus, vendors have freedom to implement their own mechanisms. However, some drivers permit the user to select specific bitrates.

Hidden node problem. RTS/CTS mechanism

In a wireless network transmission ranges have fuzzy boundaries, unlike in a wired network where all the stations can reach all others. It may happen that two stations are in range with a third but not in range with each other, because of some obstacle or simply because they are too far apart.

Figure 4 {reference-type=”ref” reference=”fig:Hidden-node-problem”} reflects this situation: The pairs of nodes A – B and B – C are in range. However, A cannot hear what C transmits, nor C can hear what A transmits. Thus, A and C may simultaneously start a transmission causing a collision at B because they cannot sense the medium as busy.

{#fig:Hidden-node-problem width=”45%”} [[fig:Hidden-node-problem]]{#fig:Hidden-node-problem label=”fig:Hidden-node-problem”}

In order to alleviate this problem the IEEE standard provides a mechanism to reserve the medium, the Request To Send (RTS) and Clear To Send (CTS) messages. The procedure works as follows: A sends a RTS frame to B, which responds with a CTS. Although C cannot hear the RTS it will receive the CTS, thus knowing that a transmission is about to be in progress. The CTS frame has a duration field that informs all the stations in range about the time that the medium will be busy. Then, after receiving the CTS A transmits the data frame while C waits for the transmission to finish.

RTS frames are very small (20 bytes) and are less likely to be interrupted than a large data frame (which can be around 1500 bytes). However, when a small data frame is to be transmitted this mechanism does not offer any significant advantage. Additionally, the RTS/CTS reservation adds some overhead to the communication thus reducing the overall throughput of the network.

Here we concentrate on the IEEE 802.11 physical and MAC layer access for VoIP. Our focus is on access using wireless technologies rather than relaying voice.

It mainly deals with the IEEE 802.11 suite of protocols, primarily as it is those we have access to.

Anastasi et al. measured the performance of IEEE 802.11b ad hoc networks [@anastasi2004:IEEE], specifically the range of the end-terminals, the impact of different data rates and their variability. They observed that the transmission range was highly dependent on the data rate up to 100m, whilst the physical carrier sensing range was independent of rate up to 200m. Unlike their results in ad hoc mode, we didn’t observe different ranges for different rates at up to 320 meters. Even at 400 meters there was no conclusive data rate dependency on range.

Hertrich looked at mixed traffic (including real-time voice) in IEEE 802.11 networks [@hertrich03:_exper_perfor_evaluat]. He used a MAC booster and by tailoring it could alter the number of retransmissions for different positions to achieve the required throughput. We did not try to change the number of transmissions. This work is similar to ours in that he considered the environment as important, however he used VoIP and MPEG4, while we used VoIP and TCP. Additionally, Hertrich focused on the home, whereas we focused on an office environment. Nevertheless we found that certain positions also did not permit any communication to take place at all.

Dimitrou et al. address issues that can make the deployment of multimedia communications difficult in 802.11 networks. [@dimitriou03:inter_telep_wlans]. They cite interference and users moving out of range as limiting factors for good VoIP quality in WLANs. They suggest the use of smart speech coding (including an enhanced version of the G.711 coding developed by their company) to make the speech more resilient to loss.

Hoene et al. examined the effect of motion on the performance of wireless links through a series of experiments with moving nodes [@Hoene2003:measuring]. They conclude that other factors such as modulation type, quality of power supply, environmental setup, and number of retransmissions may have greater impact on 802.11b performance than the motion itself. In general the greater the speed of the terminal the lower the correlation of loss events. In our experiments the nodes were not moving, i.e., movement only occurred between measurements; thus movement should only decrease the observed losses.

QoS.

Radio aspects

One of the more active research areas has been in wireless voice services. Focus has mainly been in the areas of throughput and capacity issues of IEEE 802.11 networks. Casetti et al. present a framework that assumes variable rate speech coders at rates of 64 kb/s, 13 kb/s, and 8 kb/s [@casetti04:_improv]. Their rates are determined by an end to end control mechanism, based on measurements of packet delay and loss rates. Another approach is to look at the MAC protocol directly. Dong et al. propose and examine selective error checking (SEC) at the MAC layer of 802.11 [@dong04:_selec_mac_ieee]. They make use of the fact that speech bits can tolerate errors, but should be protected for optimal quality reproduction. Simulation results showed that the speech quality can be substantially improved by modifying the MAC layer with SEC to suit the Narrow-Band Adaptive MultiRate (NB-AMR).

Filali looks at a MAC tuning approach [@filali04:_dynam]. He exploits the properties of multimedia applications in IEEE 802.11-based wireless networks by limiting the number of retransmissions of a data frame by a source until the reception of a link-level acknowledgment from the destination.

In 2005, the IEEE approved QoS service enhancements for local area network applications called IEEE 802.11e. Garg et al. examines using the IEEE 802.11e protocol for voice applications [@garg03:_using_ieee]. The Enhanced Distributed Coordination Function (EDCF) has been proposed as a MAC protocol. EDCF assigns four different priority classes for incoming packets at each node which are called access categories (AC). Each AC has its own channel access function. This is in contrast to the standard Distributed Coordination Function (DCF) where packets all use the same access function to the channel. Access functions for different categories means assigning delay times, minimum contention windows, and the number of back-off stages for each type of service.

Garg et al. looked at 802.11e’s ability to fulfill the goals of improved QoS and higher channel efficiency. They investigated the response of the protocol to options in the protocol parameters and showed that the Hybrid Coordination Function (HCF) reduces channel contention and provides improved channel utilization. Both MAC coordination functions, EDCF and HCF, are sensitive to protocol parameters which are dependent on the scheduling algorithms. They conclude that further investigations need to be conducted.

Kawata et al. propose a dynamic Point Coordinator Function (PCF) for improved capacity [@kawata05:_using]. They suggest two new media access schemes, dynamic point coordination function (DPCF) and modified DPCF (DPCF2). The claim is that the capacity of VoIP traffic can be increased by up to 20% in 802.11b networks. They show how a significant improvement in the end-to-end delay with mixed VoIP and data traffic can be achieved. Delay is maintained at approximately 100 ms in heavily loaded traffic conditions, whilst at 60 ms in normal traffic conditions.

Lindgren et al. [@lindgren03:_qualit_servic_schem_ieee] evaluate four mechanisms for providing service differentiation in IEEE 802.11
networks. The evaluated schemes are the PCF of IEEE 802.11, EDCF of IEEE 802.11e extension, Distributed Fair Scheduling (DFS), and Blackburst. Using simulation they looked at throughput, medium utilization, collision rate, average access delay, and delay distribution for a variable load of real time and background traffic. The simulations showed that the best performance is achieved by Blackburst. PCF and EDCF are also able to provide good service differentiation. DFS can provide relative differentiation and consequently avoids starvation of low priority traffic.

Currently voice occupies relatively little of the IP wireless access capacity and the majority of voice traffic is carried by the cellular networks. Research in combining these two has been published within the context of voice roaming [@calvagna03:_wifi_gprs; @marsh506:Design]. Exploring voice quality in IP networks continues to be an active research area [@TRITA-EE_2006:016; @varela05:_study_effec_fec_voice_traff].

WiMAX

Cross-layer methods

A new design principle is being applied to wireless VoIP systems. ALF Cross-layer [@Matt0306:Source] [@Poppe0401:Choosing [@aguiar03:_chann_sched_voip_mpeg4_chann_predic] Was ALF an early example of cross-layer? Wireless networks have rejuvenated the interest in cross-layer design. Cross-layer refers to a term where the sharing of information across different network layers increases the efficiency of the whole system. By not observing the strict layering approach of classic network design it may be possible to couple elements that are closer than the network boundary design suggests.

Cross layer and joint channel/source coding

The handset of mobile handset is a good example of successful cross-layer design. One concrete example is joint channel and source
coding.

802.11 and quality mechanisms

Hoene et. al. Voice Over IP: Improving The Quality Over Wireless LAN By Adopting A Booster Mechanism – An Experimental Approach
[@Hoene0108:Voice]

VoIP and handovers

[[subsect:voip_handovers]]{#subsect:voip_handovers label=”subsect:voip_handovers”}

Voice quality can suffer if there are radio coverage problems, interference from external sources, and excessive network load. The range for good quality varies from a few metres to a hundred meters depending on the equipment in use, obstacles, interference sources, and so on. Therefore the second scenario is to switch calls between the local wireless and cellular infrastructures in order to provide call continuity outside the coverage area of the wireless LAN. As mentioned, mobile phones and PDA’s are now available with both cellular and 802.11 interfaces. This provides an option for switching to the cellular network when needed. Alternatively, if local wireless coverage is detected during a cellular call, a switch to the local network is possible, thus freeing cellular resources and potentially avoiding the cellular operator’s tariffs. Entering a home or office area are typical scenarios in which a cellular call could be transferred to the local 802.11 network. The procedure of switching an ongoing call from one technology to another is known as a handover or handoff. Ideally the user should be unaware of the change, if this is the case it is known as a seamless handover. The current technological barriers for seamless handovers are the configuration and connection establishment mechanisms rather than the switching of the voice stream. Switching a voice stream means receiving two parallel streams to the same terminal over different networks. Once running in parallel to the terminal, the initial stream can be stopped and the new voice stream played to the caller instead.

Voice Call Continuity (VCC) has been so far standardized way to handle these kinds of handovers. VCC prvides the infrastructure for performing vertical handovers, but suitable triggers are still needed in order to perform them in a timely fashion. As of Release 8, the 3GPP has proposed a generic framework for media service handovers called IMS Service Continuity, which applies the same principles to other media services.

As call quality is paramount, the timing of handovers from the WLAN to the cellular network is important. In the case of radio problems there might be insufficient time to initiate and start a call to the cellular network. In the case of handover due to the onset of congestion, the handover success depends on the rates of the other flows. This is due to the time needed to estimate the call quality and if need be, to initiate a cellular-based call. In the other case where a user would move out of the coverage area, there should be time to schedule the handover. The speed and path of the user movement can be tracked to estimate whether the user is moving out of coverage. In this case there is a design tradeoff: To maintain connectivity in the coverage area as long as possible to minimise the frequency of handovers on the one hand, or to reduce the probability of poor quality and switch early on the other. Therefore more conservative or aggressive switching algorithms can be envisaged.

One solution is to use the 802.11 network where possible, but to handover a call to the cellular network when the link conditions are insufficient to support good quality as stipulated in the problem statement. How to schedule this handover has been addressed in paper H. Real-world voice handovers typically need time to initialise a parallel technology to switch to. As calls to the public phone network take in the order of five seconds to setup, estimation of deteriorating quality conditions in the 802.11 network must anticipate (at least) this interval ahead of the handover. The relation of this work to the problem statement is in the heterogeneity of the systems and providing good speech quality to the users.

In paper \cite{}{=latex} they implemented an automated handover mechanism on a PDA running Windows CE. The call quality is estimated in the terminal based on network measurements and signal a third party application that the current call should be transferred from the 802.11 network to the cellular network. The handover was triggered when the quality fell below a quality threshold. The implementation allowed automatic roaming from 802.11 to GSM networks. The goal of the implementation was to show proof of concept, as well as to judge differences in the speech quality at the time of handover.

There is still much research to be done in the voice handover area, including monitoring the network conditions at the handset. Eluded to earlier, tight integration achieves the best results and in the case of dual-radio phones, prediction of impending problems is the key criterion. Not included in this research is the possibility to make use of tracking i.e. estimating the position or path of the user. This would greatly influence the decision of whether to switch a call to an alternate technology.

Calvagna et al. present an overview of handover issues with a focus on hybrid mobile data networks [@pahlavan00:_handof]. They propose a neural network solution for handovers to/from 802.11 networks to GPRS networks and show its performance to be good. The E-Model as standardized by the ITU-T allows for the prediction of voice quality based on network QoS parameters [@G.107]. However, it is not useful for our purposes because it does not take the signal strength and delay jitter into account. Very recent work by Hoene et al. propose a real-time implementation of PESQ called PESQlite [@Hoene2005:Thesis]. It reduces the complexity by making simplifications to the PESQ algorithm e.g. using constant length test samples and non time alignment of the degraded samples. Our off-line method has a slightly different purpose, it is to obtain a mapping between consecutive packet loss and the PESQ MOS score. Dimitriou et al. state that interference and users moving out of range as limiting factors for good VoIP quality in WLANs [@dimitriou03:inter_telep_wlans].

Their solution is to use better speech coding and suggest an enhanced version of G.711 to make the speech more resilient to loss. Kashihara and Oie developed a WLAN handover scheme for VoIP that makes use of MAC-layer information on the number of retransmissions of the voice packets [@kashihara05:_handov_manag_number_retries_voip_wlans]. If this number exceeds a certain threshold, the system switches to multi-path transmission of the packets. As soon as one of the WLAN interfaces reaches a stable condition, it can be used for single-path transmission. In Fitzpatrick et al. propose a transport layer handover mechanism using the stream control transmission protocol (SCTP) [@fitzpatrick06:_approac_trans_layer_handov_voip_wlan]. The mechanism
uses the multi-homing feature of SCTP and measures the network performance metrics by sending probes. Handover decisions are based on speech quality estimations utilizing the ITU-T’s E-Model.

Handovers are about L2 management.

Mobility support

Paul company

Operator Issues

Iyad VoIP company

Ericsson

A historical view of VoIP from the operators

In short it is not hard to see that a battle between, or merger of, these two communication paradigms will occur; the Internet ‘drivers’ are keen to capture a percentage of the worlds voice market and use the Internet as the carrier for voice, whilst the telecom manufacturers are attempting to put Web and data services onto their handheld devices, currently under the guise of 3G. Very important from operator’s point of view, to save money by using only IP networks.

The telecom industry has also advanced whilst the Internet was becoming part of our lifestyles, namely in the widespread arrival of mobile telephony. Clunky, old analog phones are being replaced at a tremendous rate by small, sleek, fashionable digital phones, transforming the way we communicate allowing us to be finally free of the twisted copper pair.

The outcome is likely to be, of course, some sharing of the market, and the stark differences between data and telecom are, in fact, more blurred. The traditional telecom operators own, operate or buy capacity on data networks allowing them an opportunity to use cheaper network infra-structures, saving them financially by merging two or three different infra-structures (voice, data, cable TV) onto one, cheaper alternative, namely the IP infrastructure. This means that voice is carried on a data network.

3GPP

3GPP + 4G The Internet revolution initially bypassed the traditional telecommunications equipment manufacturers and operators. However, the 3rd Generation Partnership Project (3GPP), established in 1998, brought together a number of commercial, organizational and standardization bodies to work on integrating IP into their solutions for mobile communication. 3GPP has already standardized the use of an IP based core network. Today telecommunication companies are deploying the 3GPP IP Multimedia Subsystem (IMS) to merge Internet technologies with mobile networks. So called ‘Release 5’ enables operators to upgrade their existing telecommunication equipment and allows a smooth transition to IP technology. IMS is based upon the Session Initialization Protocol (SIP). The upcoming 3GPP Long Term Evolution (LTE) standard will use IP in both the access and core networks to carry data and voice traffic.

Currently local wireless IP voice services have not reached significant market penetration, as current handsets and infrastructure are dominated by the telecommunication industry’s 2nd and 3rd generation standard solutions. There can be voice quality issues with the current data-centric LAN technologies we have today. The problems are mainly due to coverage and heavy load situations. These are discussed in the next section.

IMS

Voice over LTE

https://en.wikipedia.org/wiki/Voice_over_LTE

Cellular calls.

Local wireless and operators

Generic Access Network (GAN), formerly known as UMA (Unlicensed Mobile Access), is one possibility to provide seamless roaming between local and wide area networks [@ETSI:GAN]. GAN allows voice, data, and IMS/SIP applications to be accessed from a mobile phone. The operation of GAN is as follows: Once a local wireless network is detected (e.g. Bluetooth or 802.11) the handset initiates a secure IP connection through the local network to a gateway in the operator’s network. A GAN server makes the handset appear as if it were connected to a new base station. Thus, when the handset moves from a cellular to a 802.11 network, it appears to the core network as if the handset is simply associated with a different base station. There is GAN support for 2nd and 3rd generation cellular technologies.

The Internet revolution initially bypassed the traditional telecommunications equipment manufacturers and operators. However, the 3rd Generation Partnership Project (3GPP), established in 1998, brought together a number of commercial, organizational and standardization bodies to work on integrating IP into their solutions for mobile communication. 3GPP has already standardized the use of an IP based core network. Today telecommunication companies are deploying the 3GPP IP Multimedia Subsystem (IMS) to merge Internet technologies with mobile networks. So called ‘Release 5’ enables operators to upgrade their existing telecommunication equipment and allows a smooth transition to IP technology. IMS is based upon the Session Initialization Protocol (SIP). The upcoming 3GPP Long Term Evolution (LTE) standard will use IP in both the access and core networks to carry data and voice traffic.

Gateways

Bluetooth

VolTE

Wifi 6

User aspects

This section is divided into two parts, quantifying quality and some standardized approaches for calculating it. As tools and methods have been developed for the telephony industry, it seems natural to re-use them for Internet telephony where appropriate. We will introduce two standardized methods for estimating VoIP quality as they are used within this dissertation. For a more in-depth treatment of objective and subjective methods consult [@raake06:speech]. From the user’s perspective, there should be no major quality difference between telephony being carried by the Internet and a regular telephony network.

Quantifying quality

Although most people have a good feeling of what good quality (or more accurately fidelity) means during electronic communication, it is not straightforward to translate this into measurable parameters of a system. First the system we are dealing with is a distributed system and each component has its own individual attributes. Second people are involved in the assessments, and add inevitable human variations. Third, people are adaptable, therefore ratings tend to change over time and finally the situations differ from environment to environment.

The simplest form of quality rating for speech would be something descriptive, for example ‘EXCELLENT’ for a speech sequence that was almost glitch-free down to ‘POOR which was barely understandable. Different words could be used, or any number of intervals between the extremes choices, however studies have shown, in a descriptive setting, three intermediary steps are reasonable. Numerically, it is somewhat easier to get a finer scale, however more than ten intervals often leads to fuzziness between the intervals.

Humans and subjective topics

Subjective scores, in this section we consider quality, where some form human tests have been conducted. If humans are not included in the tests, results may be found in other sections.

[@Su9906:Factors] Uu Srivastava and Yao Investigating factors influencing QoS of Internet phone. [@Hamdi1297:Fresh] Hamdi et. al Fresh
Packet First Scheduling for Voice Traffic In Congested Networks. Wilson and Sasse [@watson0010:good] The good, the bad, and the muffled the impact of different degradations on internet speech. Wilson and Sasse [@wilson0012:Investigating] Investigating the Impact of Audio Degradations on Users: Subjective vs. Objective Assessment Methods. [@wilson0009:Do] Do Users Always Know What’s Good For Them? Utilizing Physiological Responses to Assess Media Quality. De Vleeschauwer et. al. [@DeVl00:Quality] Quality Bounds for Packetized Voice Transport. Sun et. al. [@sun00:end-to-speech] End-to-end Speech Quality Analysis for VoIP. Sun et. al. [@sun00:voip] VoIP Speech Quality Simulation and Evaluation. Jansson et. al. [@Jans0004:Delay] Delay and Distortion Bounds for Packetized Voice Calls of Traditional PSTN Quality. [@aldini1099:comparing] Aldini et. al. Comparing the QoS of Internet Audio Mechanisms via Formal Methods. @Jian0205:Speech] Jiang and Schulzrinne Speech Recognition Performance as an Effective Perceived Quality Predictor. [@cole01:voip] Cole and Rosenbluth “Voice Over IP Performance Monitoring”. [@Moha0104:Integrating] Mohamed et. al. “Integrating Networks Measurements and Speech Quality Subjective Scores for Control Purposes”. [@Hoene0304:Importance] Hoene et. al “On the Importance of a VoIP Packet”. [@Hoene1003:Impact] Hoene “Impact Of Single Frame Loss Events”. [@sun02:perceived] Sun and Ifeachor “Perceived Speech Quality Prediction for Voice over IP-based Networks”.

Measuring quality

Determining an accurate quantitative measure for human speech fidelity is desirable, but impossible. The best one can achieve is a qualitative rating that has been established in a rigorous and controlled manner. Typically test listeners and controlled auditory conditions are used for people to rate speech coder performance for example. It can be expensive and time-consuming. There are tools and methods that map qualitative assessments to quantitative values, however they will always be, to some degree, approximate. If one can show however, that there is reasonable correlation between the qualitative and quantitative results, and under what conditions the correlation holds, then this solution may be acceptable to some users. Some objective tools, such as those which use signal processing techniques, have shown this correlation and hence have found acceptance within the community. Therefore with some degree of confidence, the software developers can justify their techniques have proven success and give results as real people would.

Quality tolerances

When human speech is uttered, the time taken from when the pressure waves leave the mouth to the sensation of hearing is a fraction of a second for a nearby speaker. We have evolved to expect, and actually need, to hear our own voice. This is in order to be sure that we are saying what we really want to. The development of the human speech and hearing recognition has however taken place via face to face meetings. Thus, extra visual or body cues are available when uncertainty is present. An example of such ‘understanding’ is when a language is being spoken that we do not understand. We can sometimes guess the meaning
from gestures, facial expressions and intonation.

On the other hand, impaired speech requires extra concentration from the listener, that is we are not used to processing distorted or missing segments, visual and auditory clues are more difficult to interpret. Somewhat similarly is communicating with people from afar, we don’t receive the original speech samples and visual cues are harder to see.

In IP voice communication systems the visual cues are not existent, thus making intelligibility more important. In order to hear one’s own voice a very short delay is introduced between capturing the recorded voice and replying it for the speaker. This is particularly applicable when using headsets. The introduced delay is in the order of 5 ms.

As far as the delay in the system is concerned, it is obviously desirable to keep it below some maximum. This is in the order of half a second. Delay is discussed from a networking perspective below. Recent results have shown that delay is not as significant as once postulated, at least in VoIP systems. Traditional telephony standards have been much stricter with respect to delay budgets [@G.114]. If one is not in a highly interactive conversation, then higher delays can be tolerated than those suggested by telecommunication standards. This is particularly true in situations where people use computers, delays are expected by users (operating system hiccups) and therefore their delay expectations also become relaxed from the communication system.

If users are engaged in quick voice exchanges, delays will frustrate their conversational style. Therefore, introducing the factor of
interactivity into an objective quality measure is still under research. The following studies have looked at conversational interactivity [@varela2005:varela; @05:_study_of_relat_between_subjec; @Froehlich1004:Elements; @Reichl:C04hot; @Hammer:C05the]. The last reference in this list proposes the potential impact of interactivity on the perceived quality for Internet telephony services.

Where delays and losses are experienced at the same time, it has been shown that the influence of losses is much more significant with respect to the perception of quality degradation than the influence of delay. This implies that people are able to make a transition from highly interactive scenarios to a more measured communication style. In fact this transition appears to be somewhat bilinear, that is, the quality degradation from an interactive mode to a simplex conversation mode occurs in two linear steps, with the break at about 400 ms. Varying delays can be disturbing, due to the listener not being allowed to settle into a single mode of operation. For more information on the influence of delay on Internet telephony see [@Boutremans0312:Delay].

Quality and noise

The quality of voice communication actually depends on many (independent) factors. The effect of noise, be it in the electrical circuitry, or in the surrounding environment can be a determining factor in the perceived quality.

The quality of the components is a key issue in voice systems. Lower quality components can leave voice sounding thin i.e. a lack of bass in the speech. Background noise, caused by poor grounding or shielding of the analogue components is frequently experienced as low frequency humming in the system. Internet telephony systems that use on-board sound cards can introduce noise of this nature into the signal. USB headsets are helpful, and they also alleviate the need for echo suppression.

The environment is another factor, whether a noise source is remote (distant from) or local (close by) to the speaker. In the remote case, the non-speech parts of the voice should be suppressed so as not to interfere with the spectral analysis of the voice processing. Undesirable noises from similar frequencies and volumes will be encoded into the signal, sent, and reproduced for the listener. Often listening to a remote speaker in the presence of background noise is more difficult than when background noise is present locally.

Research in the signal processing field has studied the issue of noise in systems [@sallberg:analog_circuit_implem]. Important speech parameters such as the intelligibility, clearness, or naturalness of speech can be improved by signal processing using digital, analog, or hybrid solutions. A robust, low complexity, speech enhancement algorithm has been proposed to show the advantages of a purely digital, purely analog, and a hybrid digital-analog implementation in [@sallberg:speec_enhan_implem].

In terms of testing systems with controlled noise, the ITU conducts tests with standardized background noises. These are known as mean noise reference units (MNRU) [@ITU:P810]. Typically well defined noise patterns of fixed modulated noise are presented at the beginning of each test. Each sample represents an example distortion corresponding to a five grade impairment scale (excellent to poor). The MNRU has been used extensively in subjective performance evaluations of conventional telephone and wide-band voice systems.

The ITU-T E-model

The E-model is intended as an off-line planning tool. Due to its simple form it has found applications into on-line assessments as well. Network planners can input parameters from a system and obtain a numerical value (between 1 and 100) representing an estimate of the perceived quality. One important point of the E-model is that loss, delay, jitter, speech coding and echo parameters are combined linearly to calculate the so called impairments that result in the score. The E-model assumes the parameters are independent. Another important (selling) point of the E-model is that the numerical scores correlate well with subjective tests, indicating that this estimation is indeed possible. Since the linear combination is simple, and most of the parameters are easily measurable, the E-model has been popular for a number of years.

The E-model also indicates how network impairments and speech coding can be combined to give an approximate estimate of voice quality. It is important to state that there are many tunable parameters included in the model, 19 in fact, not including the different speech encodings and loss concealment methods. Interestingly, jitter is not explicitly included as an input parameter. As jitter can affect whether packets arrive in time for playout or not, late packets for a real-time audio application are akin to network loss or delay, which are included in the model.

The ITU-T E-model was first proposed in 1998 [@G.107]. Some important IETF standards were also first published in the same year namely RTSP [@IETF:RFC2326] and SDP [@IETF:RFC2327].

::: {#table:Rmodel_user}
User satisfaction R-value MOS score

Very satisfied	90	4.3
`Satisfied`	80	4.0
Some users dissatisfied	70	3.6
Many users dissatisfied	60	3.1

Nearly all users dissatisfied 50 2.6

The ITU’s E-model and MOS scores

Table 3{reference-type=”ref” reference=”table:Rmodel_user”} shows scalar values known as the R-value derived from the computational model. They are relatively consistent with subjective scores, i.e. real user estimations of the speech quality, shown by their respective mean opinion scores (MOS). Mean opinion scores are derived by replaying samples to a naïve set of listeners who rank the quality on a scale from 5 (best) to 1 (worst). The R-value is defined as shown in equation [eqn:rvalue]{reference-type=”ref” reference=”eqn:rvalue”}.

$$\begin{aligned}
R = R_o – I_s – I_d – I_{e-eff} + A
\label{eqn:rvalue}\end{aligned}$$

$R$ = rating value\
$R_o$ = signal to noise ratio (noise sources)\
$I_s$ = voice impairments to the signal (side-tones and quantization
distortion)\
$I_d$ = delay and equipment impairments\
$I_{e-eff}$ = packet loss impairment (including random packet losses)\
$A$ = advantage factor (compensation of ‘other’ factors)\

Each of the factors is calculated and subtracted from the maximum of 100 to obtain the R-value. The impairment due to the delay is denoted by $I_d$. Two different values are defined, $I_d = 0$ if the absolute delay ($T_a$) is less than 100 ms, i.e. no impairment or an increasing $I_d$ if the delay is over 100 ms. A number of amendments have been to incorporate non-random losses into the model [@ITU01054:prediction; @ITU0304:E-model]. The effect of packet loss on
the R-value is given by the $I_{e-eff}$ term. The $I_{e-eff}$ is defined in the E-model as:

$$\begin{aligned}
I_{e-eff} = I_e + (95 – I_e) \cdot \frac{P_{pl}}{P_{pl} + B_{pl}}
\end{aligned}$$

$P_{pl}$ = packet loss probability
$B_{pl}$ = packet loss robustness

For G.711, $I_e = 0$. This means for situations without loss, G.711 provides the best speech quality. The advantage factor $A$, is a value that indicates how tolerant users can be when using telecommunication equipment. It can be seen as a willingness to trade quality for operational convenience. One example is with mobile telephony, where users accept lower quality since they have the luxury of being mobile. One other example could be an advantage factor, as mentioned, where higher delays are tolerated when using a computer as a communicating device rather than a telephone.

Perceptual Evaluation of Speech Quality (PESQ)

{#fig:pesq_structure width=”30%”}

PESQ Linguistic Quality	MOS equivalent degradation	Degredation
4.5	Excellent	None
4	Goood
3.5	Good / Fair	Moderate
3	Fair
2.5	Fair / Poor	Severe
2	Poor
1	Bad

PESQ, MOS and Linguistic terms

Although the E-model is popular for estimating quality using network parameters, it has shortcomings. As we have seen, the bursty effects of packet loss on speech quality are not well addressed in the E-model. A later development by the ITU was to develop a scheme that could improve on the E-model by estimating the impact of speech coding and losses on the original speech signal itself. The solution, the “Perceptual Evaluation of Speech Quality” or more commonly known PESQ, addresses these issues [@ITU:p862].

The idea is to estimate the degradation of the coding and loss on a speech sample using a model of the human auditory system. Figure 5{reference-type=”ref” reference=”fig:pesq_structure”} shows the functional units of PESQ. A reference speech signal is transmitted through a network that results in a quality degradation corresponding to the coding used and the network losses. PESQ analysis both the reference and degraded signal and calculates their representation in the perceptual domain based on a psychoacoustic model. The disturbance between the original and the degraded speech signals is calculated by a quality estimation algorithm and a corresponding subjective mean opinion score (MOS) is derived. The evaluation of speech quality using PESQ is performed off-line due to its computational complexity. If one assumes a 20 ms packetization and an eight second sample, the sequence would then be 400 packets. As an indication of the time needed to compute a PESQ score, a sequence with ten losses requires approximately two seconds of processing time for G.711 coded speech on a Pentium III computer. G.711 yields the maximum PESQ score (4.5) in the absence of loss, however it is particularly sensitive to packet loss even when concealment is used.

PESQ’s validity has been shown by its ratings being sufficiently correlated to subjective ratings as we discussed in the introduction of this section. More recent research that correlates PESQ with subjective scores, shows that some small transformations are needed to better align PESQ to MOS [@Rix:C03comparison].

Other measures

Recent work by Hoene et al. proposes a real-time implementation of PESQ called PESQlite [@Hoene2005:Thesis]. The idea is that using PESQ in real-time is too slow for real-time use. Hence PESQlite reduces the complexity by making simplifications to the PESQ algorithm, e.g. by using constant length test samples and non-time alignment of the degraded samples. PESQlite is currently only available for G.711 coding.

One other alternative for an objective measure is to use machine speech recognition as a MOS predictor [@Jian0205:Speech]. The technique uses a word recognition ratio metric to reliably predict perceived quality. This ratio is speaker-independent, whereas the absolute word recognition ratio of a speech recognizer is speaker dependent. The relative word recognition ratio is obtained by dividing the absolute word recognition ratio with the value at 0% loss. The results show that human and machine based recognition techniques are correlated, although not linearly. It is also been found that human-based word recognition ratio does not degrade linearly once packet loss exceeds 10%, due to performance limits of the codec.

P.OLQA that has been selected to form the new ITU-T voice quality testing standard, P.863, in September 2010. P.OLQA is the next-generation voice quality testing algorithm for fixed, mobile and IP based networks.

Objective methods for speech quality

Essentially, PESQ methods estimate the speech quality based on a psycho-acoustic model of the human perception by comparing the degraded speech sample with its clean version in the perceptual domain. As far as the related work is concerned Lakaniemi et. al. looked at combining small scale measurements with subjective MOS scores [@Laka0106:Subjective] and [@conway0204:ip_itu_t] looked at passive measurements using PESQ evaluation techniques. Hoene et. al. do actually suggest a new perceptual model for adaptive VoIP applications that takes into account real-time factors like delay spikes or changes in the coding mode [@hoene2004:spects]. Wilson and Sasse look at the larger picture of using objective against subjective techniques in speech degradation assessment [@wilson0012:Investigating].

Speech encoding

Human speech occupies a fundamental frequency in the range of 85-155 Hz for men and 165-255 Hz for women. Higher tones or harmonics can be heard up to 10 KHz. Encoding this full frequency range would require a sampling rate of at least twice this frequency to faithfully reproduce the speech. In a voice transmission system, the speech is sampled and then digitized according to the quality required (or restrictions) of the transmission system. In a system such as the traditional telephony system, this capacity is not sufficient to faithfully accommodate human speech’s full frequency range.

The vocoder was invented in the late 1930’s and is an implementation of the model of the human sound production system. Vocoders are often known as analysis-synthesis systems, where the input speech is passed through a multiband filter and each filter is passed through an envelope follower. The signals from the envelope followers are transmitted, and the decoder applies the amplitude controlled signals to corresponding filters in the synthesizer. The main motivation for this type of system was to cryptographically encode the signals during transmission. Delta modulation appeared in 1952, it is the simplest form of differential
pulse-code modulation (DPCM) where the difference between successive samples is encoded into a one bit stream. Also in the 1950s the Lincoln Laboratory at MIT conducted a study of pitch in speech detection, which led to vocoders designed to reduce the speech bandwidth. The first LPC ideas came about in 1966 from work done at NTT in Japan. In the late 1960’s early real-time versions of LPC coders were implemented. The first workable LPC encoder was the US government’s LPC-10 coder developed in the early 1980’s [@Trem8204:Government]. The ten in LPC-10 signifies the number of coefficients it used. 1964 saw the standardization of PCM waveform coding for fixed telecommunication networks. The implications of this choice is still with us today.

Well before the modern Internet was devised, people where investigating alternatives to the traditional telephony system for carrying voice. The earliest accounts of packet switched networks can be found in the signal processing community. Researchers and engineers were looking for computationally efficient methods of compressing voice for transmission over low bandwidth links. In fact, advances in low data rate coders and the deployment of a distributed packet switched network led to some of the earliest findings [@Magi73:Adaptive]. The details of the networking are often omitted, but the idea was to block-code voice for transmission. Much of the focus was on LPC and entropy methods. Blankenship et. al described the Lincoln Laboratory digital voice terminal system in a technical note published in 1975. Accounts of the early days of vocoder work can be found in [@gray.05:_voip] and the small amount of networking in [@cohen99:_realt_networ_packet_voice].

Moving forward a number of years, warped LPC was first proposed in 1980 which is a variant of LPC where the spectral representation of the system is modified. This reduces the bitrate required for a given level of perceived audio quality/intelligibility. In 1985 the Code-Excited Linear Predictive (CELP) codec was introduced [@schroeder85:_code_excit_linear_predic_celp]. The ITU’s G.729 was standardized in 1996 [@ITU:g729]. In 1997 the Enhanced Full Rate (EFR) codec was standardized. More recently intelligent multimode terminals have appeared that can adapt their configuration to different rates, quality and robustness. These are known as adaptive multirate AMR codecs which was standardized is 1998. For an account of the early vocoder history research consult [@Gold7712:Digital].

Pulse Code Modulation (PCM)

In narrowband telephony, the frequency bandwidth is restricted to 3100 Hz, ranging from 300 to 3400 Hz. Voice in the fixed telephony system has therefore to be reduced from its original range to this 3100 Hz range (a reduction of about one third). The lower frequency of the human range is lower than that of the telephony system. This is not as problematic as it may seen, due to the perceptual system’s ability to reconstruct the lower tones from the overtones. Traditional telephony does not use the low frequencies as they are very hard to reproduce with inexpensive loudspeakers.

Quantizing the sampled waveform can either be done using constant steps between the sample levels or using non-constant steps, such systems are known as linear and non-linear quantizers respectively. From a 12 bit linear input signal, an 8 bit companded signal can be produced which has a similar signal to noise ratio as the original. Non-linear quantization has the advantage that the quantization performance is independent of the signal loudness. Its disadvantage is lower accuracy for larger amplitude signals. Two (similar) examples of non-linear quantizing encodings are known as the A and $\mu$-law companders. There are three main methods of implementing the $\mu$-law algorithm:

One is using an amplifier with non-linear gain to achieve companding entirely in the analogue domain.
The second is to use an analogue to digital converter with quantization levels that match the $\mu$-law algorithm.
The third is to convert the 12 bit linearly quantized representation to $\mu$-law coding entirely in the digital domain.

In Europe A-law coding is used. The A-law algorithm provides a slightly larger dynamic range than the $\mu$-law version at the cost of worse proportional distortion for small signals. By convention, A-law is used on an international connection if at least one country does. The G.711 standard encapsulates the A-law and the $\mu$-law formats into a single standard [@ITU:g711]. G.711’s simplicity (and the low SNR) makes it the default choice in the non-wireless telecommunications infrastructure.

Adaptive differential pulse-code modulation (ADPCM)

Differential (or delta) pulse-code modulation (DPCM) encodes the PCM values as differences between the current and a predicted value. An algorithm predicts the next sample based on previous samples, and the encoder transmits only the difference between this prediction and the actual value. If the prediction is reasonable, fewer bits can be used to represent the same information. For speech, this type of encoding reduces the number of bits required per sample by about 25% compared to PCM. Adaptive DPCM (ADPCM) is a variant of DPCM that varies the size of the quantization step to allow further reduction of the required
bandwidth for a given signal-to-noise ratio. The rate of ADPCM is 32 kb/s.

Low bit rate models

Speech that is sampled and encoded using A or $\mu$-law at 8000 samples per second with 8 bit resolution for each sample produces a data rate of 64 kb/s. Current speech coding techniques can produce encoded voice with rates as low as 16 kb/s which are indistinguishable in quality from 64 kb/s codec. We will discuss some of these schemes soon, however it is first necessary to explain how humans produce speech, in order to understand the technique known source filter modeling.

Human production of sounds

The lungs produce a stream of air that enters the vocal tract. The vocal tract is the pharynx, mouth, and nasal cavities. There are essentially two types of sounds: voiced and unvoiced sounds. Voiced sounds such as /a/ or /e/ are produced by the vocal chords. Unvoiced sounds have two types, the first type is fricatives such as /s/, /sh/, or /f/ which are produced when the vocal tract is constricted. The second type of unvoiced sounds are known as plosives, and include sounds such as /p/, /k/ or /t/. They are produced when the end of the vocal tract is closed, pressure is built up, and the pressure is released suddenly. There are actually additional types of sounds such as the nasal /n/ sound, but we will omit these from the following discussion.

Voiced and unvoiced segments

In order to encode and transmit speech at low bit rates, it is necessary to differentiate between the voiced and unvoiced sounds. As we will see, these sounds constitute different parts of a source filter model, and are actually transmitted separately. In order to separate them different techniques are available:

Spectral flatness: calculated by the geometric mean of the power spectrum divided by the arithmetic mean. Unvoiced frames (typically 20 ms long) are flatter than voiced frames. The spectral flatness can also be measured within a specified sub-band of frequencies as well as across the whole frequency band.
Energy: the square of the spectrum values of the sampled frame. Voiced frames have greater energies than unvoiced frames. Zero crossing points: counting the sign changes in the signal, voiced frames exhibit fewer crossing points than unvoiced frames.

Source-filter models

The most popular technique within source filter models is based on linear predictive coding (LPC). The basic idea is to model the speech generator as produced by the human vocal system, described in the previous section. The generator is a simple buzzer at the end of a tube. The space between the vocal chords (called the glottis) produces the buzz. It is characterized by its intensity and frequency (pitch). The vocal tract (the throat and the mouth) forms the tube, which is characterized by its resonances, these are known as formants.

The parametric coding process

Low bit rate coders estimate the formants, remove their effects from the speech signal, and then estimate the intensity and frequency of the remaining buzz. The process of removing the formants is called inverse filtering, and the remaining signal after the subtraction of the filtered signal is called the residue. The formants and the residue can then be transmitted to recreate the voice at the receiver. Another term for this process is vocoding, a contraction of the words voice and coding.

Decoding or synthesizing the speech signal is done by reversing the process. The buzz parameters are used together with the residue to create a source signal. The formants are used to create a filter (which is the tube), and the source is run through the filter reproducing the original speech. The spectral information is well suited for vector quantization. Compression algorithms often differ in how the residuals are treated. Typically 30 bits are used to code the 10 coefficients for basic LPC quality, and up to 18 coefficients can be used for improved fidelity.

Code excited linear prediction (CELP)

In an attempt to improve on the robotic sound of early LPC schemes, a number of improvements were made that have led to methods used in modern codecs (see section 8.10 {reference-type=”ref” reference=”subsec:modern_codecs”}). Multi-excitation linear predictive coding (MELPC) is based on LPC but instead of using a periodic pulse train for the voiced segments and white noise to represent the unvoiced segments, it uses mixed periodic and aperiodic pulses, a pulse dispersion filter, and spectral enhancement. The multi-pulse linear predictive coder (MPLPC) is an analysis by synthesis approach where each excitation vector consists of a number of pulses where their amplitudes that have been derived from closed loop optimisation. CELP uses an codebook (sequence) of excitation pulses as the excitation rather than the multi-pulses of MPLPC. The optimum sequence is chosen to minimise the distortion between the derived signal and the original one. At the decoder the sequence of excitation signals is passed through a long term filter and a LPC vocal tract filter to produce a block of reconstructed samples. The bitrate of CELP coders is usually in the range of 5 to 15 kb/s.

Transform coders

Transform coding tries to draw the best from waveform tracking techniques used in the PCM encoders, but also include models of the human production of speech as the source filter models do. Knowledge of the speech signal is used to select which information to discard in order to lower the bandwidth of the signal. Transform coding derives its name from frequency based techniques to code the transform coefficients in a manner suitable for voice. Different transforms have been suggested for speech compression, we will briefly consider just two: the Karhunen-Loève transform (KLT) and the Discrete Cosine Transform (DCT). The Karhunen-Loève transform offers optimal coding performance (in terms of minimum square error) if the input samples are Gaussian distributed and the coefficients are scalar quantized. However the Karhunen-Loève transform is difficult to implement and its performance is signal dependent. The DCT is signal independent, but is sub-optimal (compared to the KLT) in that it cannot completely de-correlate the transform coefficients. The DCT is attractive since there are computationally efficient algorithms to compute it, and it retains the formant structure of the speech. The bitrate of transform coders is in the range of 10-20 kb/s, but can produce better fidelity speech.

GSM, G.729 and iLBC codecs

GSM networks employ a LPC-based speech encoding technique called Code-Excited Linear Predictive (CELP) coding. The significant difference between CELP and LPC is that the excitation signals are not simply generated based upon a voice or unvoiced sound, but taken from stored codebooks. There are two types of codebooks, fixed and adaptive which are used in conjunction to code the signal. ETSI’s GSM has defined different rate voice codecs ranging from 6 kb/s (half-rate) to 13 kb/s (full-rate). GSM was further enhanced in the mid-1990s by the GSM-EFR codec (effective full-rate) which is a 12.2 kb/s codec that uses a full-rate GSM channel. GSM is one of the preferred speech coding schemes for wide area radio links. EFR is a fixed rate codec, however some GSM networks now use Adaptive Multi-Rate (AMR) coding [@Bessette:A01]. AMR uses link adaptation to select from one of eight different bit rates depending on the instantaneous link conditions.

G.729 is another example of a LPC-based encoder, again a CELP codec. The coded stream consists of linear predication coefficients, the excitation codebook indices, and gain parameters. Technically it is known as variable bit rate conjugate structure algebraic code excited linear-prediction scheme (CS-ACELP). The standard rate of G.729 is 8 kb/s. It requires 10 ms input frames and produces an 80 bit output frame. It also includes a 5 ms lookahead, producing a 15 ms algorithmic delay. Annex B of the recommendation (G.729B [@ITU:g729b]) also describes a silence compression scheme and a voice activation scheme. It
also has a discontinuous transmission module, which estimates the background noise at the sender and can use a comfort noise generator at the receiver. G.729 is popular within VoIP applications, due to its low data rate and the features just mentioned. A Skype call initiated from the Internet and terminating at a PSTN connection uses G.729 for the Internet part of the path. It was developed by the University of Sherbrooke (Canada), the Nippon Telegraph and Telephone orporation of Japan and France Telecom in 1995.

The iLBC encoder from Global IP Solutions is a block-independent orientated LPC coder [@s.v.02:_ilbc]. Whereas LPC schemes have a memory that lead to error propagation in the case of lost packets, iLBC encodes each frame as a separate block. It therefore has a controlled response to packet loss and exhibits a robustness similar to PCM with respect to packet loss concealment [@ITU:g711i]. The CPU resources when using iLBC are comparable to that of G.729A, but it yields higher basic quality. Although a narrow-band speech coder, iLBC uses the full 4 KHz spectrum of 300-3400 Hz codecs, thus producing better fidelity. iLBC is popular in PC to PC communication and is found in tools such as Skype and GoogleTalk.

A book on speech coding and synthesis was published by Kleijn [@Kleijn:B95] (one of the creators of the iLBC codec). Also one of the first papers on IEEE 802.11 and VoIP was published in the same year [@visser95:_voice_data_trans]. As for speech coding, the first G.729 standard was released in 1996 [@ITU:g729]. As noted earlier, G.729 is an 8 kb/sec LPC-based coder still used in many VoIP applications today. This includes the Skype application when using IP to telephony services e.g. in the SkypeIn and SkypeOut services. As the load on the Internet grew, studies of error recovery were being published [@Bolot:C97; @IETF:RFC2733]

A good reference for speech coding algorithms in the review article of 1994 [@Spanias9410:Speech]. More comprehensive books include Kleijn’s speech coding and synthesis from 1995 ([@Kleijn:B95] or Oppenheim’s older classic @dspapp. More particularly for low bitrates consult [@561156]. Design considerations for speech coders for packet networks can be found in [@Lefebvre:C04ab].

Echo cancellation

Future directions for networked voice

Quality

Moving onto just one topic outside of this dissertation, we believe that higher fidelity telephony should be available in the near future. Although the technology for transporting bits has improved, the media stream itself has not changed since the introduction of 64 kb/s voice many decades ago. From the user’s perspective the voice quality of a 13 kb/s stream is actually worse than that of traditional telephony. However, we are prepared to pay this cost in order to have mobile telephony, and, of course, the operator can squeeze more calls out of the system without substantial investment.

The drive to reduce bitrates for calls has been to multiplex more calls onto capacity constrained links. However, as ever more capacity is becoming available both on the cellular and Internet technologies, the time is right for a new type of voice experience. Therefore, one example would be to use higher fidelity than we are currently used to. This may be stereo voice, and would require headsets, but many mobile users already use such devices to listen to music.

3D telephony

3D telephony Going one step further is 3D telephony. This will enhance the experience at the listener through capturing binaural signals at the speaker, optionally rendering them in 3D space, and replaying the enhanced signal at the listener. Capturing the signals at the speaker can be done by placing small microphones on the outside of the headsets, somewhat similar to what noise canceling headsets do today.

Steps such as these would represent a new domain for telephony that has been thus far the preserve of specific environments such as audio conferencing. 3D telephony is very much under investigation, however significant challenges remain, particularly in the domain of noise cancellation, either at the sender or receiver, or both.

Rate friendly VoIP

Competing traffic

– P2P traffic – Upgraded links – Our result that quality is improving

differentiation

In telecommunications, triple play service is a marketing term for the provisioning, over a single broadband connection, of two bandwidth-intensive services, broadband Internet access and television, and the latency-sensitive telephone.^[1] Triple play focuses on a supplier convergence rather than solving technical issues or a common standard. However, standards like G.hn might deliver all these services on a common technology.

DoCSIS

Data Over Cable Service Interface Specification is an international telecommunications standard that permits the addition of high-bandwidth data transfer to an existing cable television (CATV) system. It is used by many cable television operators to provide Internet access (see cable Internet) over their existing hybrid fiber-coaxial (HFC) infrastructure. The version numbers are sometimes prefixed with simply “D” instead of “DOCSIS” (e.g. D3 for DOCSIS 3).

A survey such as this reveals topics still be addressed in the area. Ultimately users decide the perceived quality.

[^1]: In the case of wireless links this may mean that the user has to physically move to a location with better link properties.

[^2]: Most computers today utilize linear ADCs to digitize audio using 16, 24, or 32 bits per sample at typical sampling rates of 8 kHz, 16 kHz, 22 kHz, 44.1 kHz, 48 kHz, 96 kHz, or 192 kHz. Traditional telephony systems use eight-bit A-law or $\mu$-law logarithmic conversion at an 8 kHz sampling rate to support a large dynamic range.

[^3]: The estimate of 35% traffic savings is based upon a long term average of 24 or more simultaneous calls[@Cisco:7934].

Flowroute

References

Speech coding

Abstract– If the packet loss rate in a network is higher than the loss rate requested by an application, the transport protocol must make up for the difference in loss rate. In high bandwidth delay-product networks, the latency introduced by retransmission-based error recovery schemes may be too high for applications with latency constraints. In this case, forward error correction (FEC) can be used. FEC allows recovery from loss without retransmission. The amount of loss recovered strongly depends on the loss behavior of the network. FEC works best if losses are dispersed in time. We use simulation to study the loss behavior of an output buffered cell multiplexer for three different traffic scenarios. Our results show how the loss behavior of the multiplexer is affected by the traffic mix and the statistics of the sources. The more bursty the sources, the higher the loss rate and the higher the probability that losses will occur in bursts. Based on simulation results, we develop a mathematical model for the performance of FEC, when applied to multiplexed traffic, and compute the effectiveness of FEC for the three traffic scenarios. FEC is not effective for the two homogeneous traffic scenarios. However, FEC reduces the loss rate for the video sources by several orders of magnitude for a heterogeneous traffic scenario consisting of video and burst sources.