author:
- Gerald maquire and Ian Marsh
bibliography: - allinone.bib
-
voip-chip.bib
date: May 2002
title: "Voice over IP: A research and historical perspective"
Introduction {#sect:introduction}
This book is intended as a comprehensive survey of the data transport
phase of real-time voice communication over IP networks. We do not
include the signaling or call setup phases. Books on the real-time data
traffic in VoIP systems are described in the following subsections. A
number of books exist which include a broader range of issues within
both IP telephony and VoIP
[@Uyless00:Voice; @Gilbert98:Voice; @Hers00:IP; @Marcus99:Voice]. Minoli
focuses solely on voice delivering on IP networks in
[@Mino98:Delivering] whilst Hardy looks at VoIP service quality in
[@Hardy03:Voip_Service_Quality] and Raake "Speech Quality of VoIP:
Assessment and Prediction" in [@raake06:speech]. Other books on VoIP
include
[@01:_overv_pstn_compar_voice_over_ip; @Pacifici86:Issues; @Sinha:B98; @chapter:ld_celp_coding_mobile_radio; @chapter:speech_802.11_adhoc].
This book
The primary goal of a VoIP systems is to transport human voice from
microphone to loudspeaker/headset in a timely and quality-aware manner
in order to deliver a interactive conversation quality that is
acceptable to most users. It should be stressed here that there are no
inherent problem in transporting human voice over any kind of packet
switched network. The main difficulty arises when either there is high
packet loss rate on some element of the path or a high end-to-end delay.
The first difficulty can arise when the network becomes congested, there
is a routing problem, or one or more links has an high packet loss (due
to the link itself as opposed to others using the link which would be
congestion).
In the case of congestion, both the network and end-systems should take
appropriate actions to deliver good quality voice, a number of schemes
to address this are given in this paper. Since there is little
end-to-end regulation on the Internet, admission control has to be done
by the end-systems or intermediary nodes to estimate the current
conditions. Note that many ISPs do not provide different priorities
for different traffic classes.
In the case of routing problems, the packets may not be able to find a
path from the source to the end destination. The existence of a path may
change during the VoIP session (i.e., during a call). Other that using
source routing, there is generally little that the end users can do to
address this problem beyond selecting one or more ISPs that offer a
suitable service level agreement (SLA).
If a specific link is the source of high packet loss, then if this is
the first or the last link the user might utilize another available
link[^1] or increase the redundancy of the their communication in order
to allow the receiver to utilize this redundancy to recover sufficient
data to provide acceptable quality.
From mouth to ear via the Internet
[[sect:From_Mouth_to_Ear_via_the_Internet]]{#sect:From_Mouth_to_Ear_via_the_Internet
label="sect:From_Mouth_to_Ear_via_the_Internet"}
Figure [fig:voice_journey]{reference-type="ref"
reference="fig:voice_journey"} shows the typical end-to-end path and
associated processing for the audio stream in a voice call. In this
section we will assume that the signaling process has already occurred
to set up the call. In this section we examine the path of speech from
the caller’s mouth (at the sender) to the callee’s ear (at the receiver)
for a typical voice stream. A similar path for the audio from the
callee’s mouth to the caller’s ear is assumed, with the corresponding
roles of Sender (callee) and Receiver (caller).
As a resulting of the call signaling or based upon a prior agreement
between the caller and callee, a particular speech coding style is
agreed upon. Unlike the case of encoding and decoding using pulse code
modulation over a circuit switched communication path, for a packet
switched network the end systems will collect audio samples for some
period of time to generate a speech frame. This period of time defines
the audio frame duration, and is generally referred to as the
packetization time. Each audio frame is typically 10-20 ms in duration
— depending upon the choice of encoding. Each speech frame is encoded
for transmission over a suitable transport protocol.
To put this packetization time into context, we can compare it to the
durations of various speech phenomena. Mark D. Skowronski, in his
lecture notes for his course in Automatic Speech Processing, gives these
as: "10 $mu$s: smallest difference detectable by auditory system
(localization), 3 ms: shortest phoneme (plosive burst), 10 ms: glottal
pulse period, 100 ms: average phoneme duration, 4 s: exhale period
during speech."[@Skowronski2003]
Given these speech phenomena durations and a 10 ms packetization time,
then an average phoneme will be spread over roughly 5 speech frames and
the shortest phoneme could be in a single speech frame. We will return
to these times when we consider what the effects of a missing frame at
the receiver might be. (see subsection
2.4{reference-type="ref"
reference="subsect:Packet_level_error_control"})
In the following subsections we will start with an audio signal emitted
from the speaker’s mouth and captured by a microphone, then follow the
audio stream until it is emitted by a speaker as audio destined for the
listener’s ear. We will begin by assuming that the audio content that is
to be communicated is the result of a human speaker and that the audio
is to be reproduced for a human listener. Later we will examine what
happens when we relax this assumption.
Input devices
Commonplace these days is the use of USB headsets, where the microphone
is in front of the speaker’s mouth and the user wears headphones over
their ears. The use of such headsets eliminates analogue feedback from
the headphones to the microphone, avoiding the need to do acoustic echo
cancellation. If a speaker phone or a traditional handset is used, it
will be necessary to perform acoustic echo cancellation — this is most
easily done in the subsequent digital signal processing (hence we will
ignore this issue for now).
Analogue to digital conversion
The human speaker generates analogue sound pressure waves. These
pressure wave are detected by a microphone which produces an electrical
signal that is sampled at some sampling rate and these samples are
digitized by a analog to digital converter (ADC). The sampling process
converts a continuous analogue signal to an analogue sample. This sample
is quantized and converted to a digital value. The quantization can be
linear or logarithmic [^2]. The number of bits used to represent the
sample as a digital value is referred to as the resolution of the ADC.
As a result of the sampling, quantization, and conversion to digital
format the voice signal is now a stream of digital values. A new value
is produced at the selected sampling rate.
Speech coding {#subsec:speech_coding}
Given the stream of digitized audio samples a local application encodes
the voice and packetizes it for transmission. It is important to
understand that in the case of voice over IP, that the result of the
encoding process will be an encoded speech frame that will be
encapsulated in a packet and not a continuous streams of digital
values being sent over a circuit switched connection to the receiver.
The coder/decoder is generally referred to as a CODEC. The input to the
speech coding process is speech samples and the output is coded speech
frames.
The objective in speech coding is to encode the speech in such a way
that the audio information can be faithfully reproduced at the receiver.
Traditionally, a secondary objective of this speech coding process what
to reduce the amount of data that needs to be communicated to the
receiver, i.e., to compress the data. Simple encoding schemes process
each sample separately, while more complex schemes exploit properties of
the signal, and even more complex schemes exploit knowledge of speech
phenomena (how sounds are produced by humans) and the perception of the
decoded signal. Simple coding schemes such as ITU’s G.711 utilize A-law
or $mu$-law logarithmic conversion at an 8 kHz sampling rate to produce
a 64 kbits per second data stream that is simply packetized with a fixed
number of samples in each packet. While GSM 06.10 (ETS 300 961) encoding
utilizes residual pulse excitation — long term prediction (RPE-LTP)
coding to reduce the data rate to 13 kbits per second data stream. This
data stream is again packetized into fixed sized packets and
transmitted.
Today the trend is to exploit higher data rates (if they are available)
to increase audio quality with the use of CODECs such as Extended
Adaptive Multi-Rate Wideband (AMR-WB+) Audio Codec [@IETF:RFC4352] and
Variable-Rate Multimode Wideband (VMR-WB) Extension Audio
Codec[@IETF:RFC4348; @IETF:RFC4424]. A motivation for these CODECs is
the encoding and decoding of both speech and audio – as increasingly a
communication session includes audio that is not simply speech. For
example, AMR-WB+ supports both monaural (or monophonic) and stereo
audio. The resulting monaural audio data rates range from 5.2-36 kbps,
while stereo audio results in data rates ranging from 6.2-48 kbps.
One of the authors (Maguire) believes that a future trend will be toward
personalized speech synthesizers – so that the quality of the perceived
speech is independent of the data rate of the communication path between
the sender and receiver. Such a scheme will be possible because of the
increased storage available in the end devices and the increasingly
personal nature of the end devices. The later means that ‘my’ phone will
have a very good model of my speech, hence it can construct a very good
speech synthesizer for ‘my’ speech. The only information that will need
to be transmitted between the sender and receiver will be the marked up
output of the recognizer as input to the synthesizer.
Packet level error control {#subsect:Packet_level_error_control}
Because the speech frames are encapsulated in packets (generally using
the Real Time Protocol (RTP)) and generally transmitted to the receiver
using a transport protocol that does not provide for retransmission of
missing packets, the receiver must be able to recognize when there are
missing packet (and hence potentially missing speech frames) and deal
with this loss in some way. This raises four questions:
-
Why not use a transport protocol that provides for retransmission of
missing packets? -
How does the receiver know that a packet is missing?
-
What can the receiver do about missing packets?
- What can the sender do to assist the receiver, i.e., to increase the
probability that the receiver will get sufficient information to
re-generate the speech input?
The short answers to these questions are:
-
Why not use a transport protocol that provides for retransmission of
missing packets? Answer: The delay to request a retransmission and
get the new copy of the missing packet is equal to the round trip
delay between the receiver and the sender along with some additional
processing time. This added delay is likely to increase the total
delay beyond the desired delay bound. Details of this will be
described in section
[subsect:artifact1_delay]{reference-type="ref"
reference="subsect:artifact1_delay"}. -
How does the receiver know that a packet is missing? Answer: The
receiver can recognize that a packet is missing when there is a gap
in the sequence numbers of the received packets. -
What can the receiver do about missing packets? Answer: The receiver
can attempt to estimate what was in the missing packet, replay the
previous packet, stretch the content it has (for example, by
time-warping), play comfort noise, or play silence. Details will be
addressed in section
2.32{reference-type="ref"
reference="subsect:frame_level_error_control_PLC"}. - What can the sender do to assist the receiver, i.e., to increase the
probability that the receiver will get sufficient information to
re-generate the speech input? Answer: The sender can add redundancy,
so that the probability that the receiver does not have information
to determine what to play is decreased. Extra data frames may be
generated from the samples. Lost data is not uncommon in IP networks
so by producing copies, the probability of some number arriving at
the destination is increased. This is particularly the case where
voice is transmitted over wireless networks. Two common schemes
employed are Forward Error Correction (FEC) and Multiple Description
Coding (MDC). Note this step is optional, hence it is shown as a
dashed box in Figure
[fig:voice_journey]{reference-type="ref"
reference="fig:voice_journey"}. We will return to these techniques
in section
[subsect:forward_error_correction]{reference-type="ref"
reference="subsect:forward_error_correction"}.
Voice Activity Detection (VAD) and silence suppression {#subsect:vad}
Many applications implement Voice Activity Detection (VAD), so that if
the volume of the audio is below some threshold, then the application
can do silence suppression to avoid transmitting any contents when the
speaker is silent. Silence suppression avoids generating unnecessary
traffic, but causes some problems at the receiver, as the receiver must
(1) decide what if any audio should be played where there is silence and
(2) the arrival of voice frames is no longer quasi-periodic.
Some applications, such as Skype, do not employ silence suppression,
although they do voice activity detection, but simply send packets with
empty payloads when a person is not speaking. The reasons for this are
(1) to transport comfort noise when one party is not speaking,
addressing the first issue above – as now this comfort noise can be
played by the receiver; (2) to maintain the UDP bindings at NATs along
the path of the RTP packets, so that these packets can continue to pass
through the NAT; and (3) when TCP is used to transport the voice,
continuing to send TCP segments maintains TCP’s congestion window. The
last two reasons are advantages of maintaining the quasi-period flow of
packets.
Synchronization information {#subsect:synchronisation_information}
Timing information is needed for the voice samples to be faithfully
replayed for the listener on the receiving system. In order for this to
take place either the underling communications system needs to be
synchronous (as is the case in traditional circuit switch telephony) or
the timing of the original speech frames needs to be recorded and the
receiver has to playout the speech frames with the same relative timing
as when they were recorded.
Since the speech frames are encoded into a block of data before
transmission it is possible to use a single timestamp for this whole
block to enable the receiver to estimate the proper relative timing for
playout of this encoded speech frame. This time stamp is derived from a
local timer which increments at the packetization rate. Details of this
timestamp and how it can be transmitted along with the encoded voice
frame are given in section
[subsect:rtp]{reference-type="ref"
reference="subsect:rtp"}.
Sequencing information {#subsect:sequencing_information}
In order to deal with the fact that some of the IP packets may be lost
along the way it is useful to add a sequence number to each packet
containing encoded speech data. This sequence number can be used to
easily resequence packets if they arrive out of order, to detect missing
packets, and together with the timestamp to determine that silence
suppression has been done by the sender. Details of this sequence number
and how it can be transmitted along with the encoded voice frame are
given in section [subsect:rtp]{reference-type="ref"
reference="subsect:rtp"}.
Readying the operating system
In this section we will discuss the operation of the terminals
themselves. Exactly how the voice is handled is somewhat dependent on
the input devices and types of terminal in use. For devices that have an
operating system, i.e. system software running in privileged mode, it
usually is the responsibility of the operating system to handle the
voice data (samples in this case). Device drivers are such system
software modules, they interface with the devices, plus manage the
transfer of data when the device is reading or writing data. The device
driver manages buffers both on the device and the system, ensuring
underflows or overflows do not occur.
This makes the operation for the application easy, in the case of
recording voice, the application instructs the operating system to write
to a particular device and notifications are handled by the system.
Although the data must be copied to the device from the application, it
could involve unnecessary interruptions to the operating system for
every block of samples. Therefore some systems use a mechanism called
Direct Memory Access. It is used where some setup time is sacrificed to
provide faster and more effective transfers, basically data transfer can
occur without the intervention of the CPU once the DMA transfer has been
established. At this stage the operating system is ready to receive the
block of coded samples (plus optional redundant data) with the RTP
header completed. We have omitted some of the other field descriptions
for conciseness.
Addressing information
There are two types of addresses that are important the media stream for
VoIP. The first is the IP address of the interface to which packets
containing speech frames should be sent. The second is a combination of
the protocol and the port number, this combination is used to deliver
these frames to the correct application.
Both the IP header includes a checksum to enable the receiver to
determine if there here has been an error in the header, enabling the
receiver to check that the data has been received on the proper
interface. Additionally the transport header has a checksum to ensure
that data is delivered to the correct instance of a running application
and that the data has not been altered during its transmission across
the network.
In the case of UDP as a transport protocol there has been some work to
enable to delivery of UDP datagrams that may have an error in some of
the user data. This enables datagrams to be delivered to applications
even though some of the user data may be damaged, as otherwise the
datagram would be discarded — even though the application might be able
to make some use of the data that is actually delivered. This protocol
is referred to as "The Lightweight User Datagram Protocol (UDP-Lite)"
for further details see [@IETF:RFC3828].
Another transport protocol that might be used is the Datagram Congestion
Control Protocol (DCCP) – which provides bidirectional unicast of
congestion-controlled unreliable datagrams. DCCP has its own sequence
numbering scheme and DCCP also provides feedback about loss rate to the
sender, via DCCP acknowledgment options. For further details about DCCP
and its interaction with RTP see section 17 of RFC 4340[@IETF:RFC4340].
Packet construction
The format of a VoIP packet is as shown in Figure
[fig:packet_format]{reference-type="ref"
reference="fig:packet_format"} (not to scale). The minimum IPv4 header
is 20 octets long (while and IPv6 header is 40 octets long), the UDP
header is 8 octets long, and the RTP header is 12 octets — for a total
of 40 (or 60 octets) of header information. Optional headers can further
swell the header size. Voice payloads tend to be in the order of 160
octets (for 20ms of speech encoded with G.711). Additionally there may
be link layer headers, for example, the Ethernet frame is 14 octets long
while a WiFi header is 58 octets. There may also be additional media
access control layer and link layer overhead — such as minimum
interframe intervals in the case of shared media. Sending encoded voice
traffic in IP packets is rather inefficient (due to the large overhead
for a small amount of user payload); unless some form of header
compression or multiplexing with other traffic is used. However, despite
this if silence suppression is done there is a very large gain in
efficiency for a conversation as network traffic is only generated where
there is information to convey, hence for a conversation this may
eliminate 35% of the total traffic that circuit-switch call might have
generated [^3].
(400,80) (20,50)(90,30)IPv4/v6 Header (20,30)(90,20)20/40 octets
(110,50)(60,30) (110,56)(60,30)UDP (110,42)(60,30)Header
(110,30)(60,20)8 octets
(170,50)(60,30) (170,56)(60,30)RTP (170,42)(60,30)Header
(170,30)(60,20)12 octets
(230,50)(90,30)CODEC output
(20,20)(1,0)210 (230,20)(-1,0)210 (125,10)(0,0)40/60 octets
(230,20)(1,0)90 (320,20)(-1,0)90 (275,10)(0,0)160 octets
Move to routing?
For the routing of IP voice data, the normal IP routing mechanisms
apply. Path information from a company, home, or university network is
provided to the backbone using interior link-state routing protocols
such as IS-IS or OSPF. In the backbone network, routes are determined by
peer agreements and the inter-domain routing protocol BGP. In a MPLS
network voice traffic may be given its own path through the label
switched path if the particular operator has sufficient traffic for it
to be worthwhile.
Header compression
A balance between the payload and header overhead is desirable. That is,
not to transport too little voice data compared to the protocols, and
not to transport too much payloads in case the entire packet is lost.
Therefore, header compression schemes for IP telephony have been
developed and are particularly desirable when wireless links are being
used. The basic ideas is to exploit the redundancy in the sequential
headers, often only one digit changes and this can be coded rather than
the whole header. State is important in header compression and the coder
and decoder should be synchronized for this method to work. Usually a
full header is sent after a period of time to ensure the sender and
receiver are fully synchronized.
Framing
The packets are now passed to the network access processing stage.
Further addressing information is needed to identify the recipient of
this packet on the local network, i.e. the other end of the link.
Therefore extra local network information is added, commonly referred to
as the link address. In the examples we will use WiFi Medium Access
Control and Ethernet link address examples. The link addresses are used
to identify the recipient, confirm the sender is who they claim to be
and in the case of WiFi send back an acknowledgment.
Link access
We will consider two technologies for the network access. Ethernet and
802.11. Access to these different technologies is slightly different so
they are given separate sections. In this paragraph we consider only
their common functionalities. Both must listen to the medium before a
transmission in order to avoid collisions. This results in the bits on
the medium (and at the receiver) being garbled. Listening before
transmitting is known as Carrier Sense Multiple Access (CSMA). All the
stations must listen to the "carrier" and potentially many can attempt
an access at any one time. Both Ethernet and WiFi have a waiting scheme
if the channel is busy. It is randomized in order to inhibit
transmissions colliding. Subsequent transmission failures result in
further doubling in this waiting time which is also randomized, hence
the scheme is known as exponential backoff. Otherwise, assuming the
medium is free, the frame is transmitted.
Ethernet
Ethernet uses CSMA/CD where the CD stands for collision detection, which
means it can detect if the frame has been mangled by simultaneous
transmissions. This is done by the sending station comparing the frame
"on the wire" with a local kept copy. Where they differ the sender can
deduce that a collision occurred. If this is the case, a retransmission
is scheduled after the mandatory randomized waiting time. No explicit
acknowledgment by the recipient is needed when a frame is correctly
received. This functionality is somewhat routed in the past where
Ethernet really was a shared medium (hence the name). In recent years
however, switched Ethernet has become much more commonplace, where each
machine has a direct connection to an Ethernet switch. This allows for
much higher throughput and no collisions, since the wires are not
shared. Ethernet speeds have increased from 10 to 100 to 1000 to 10,000
millions bits per second. Once the frame has been carried across the
Ethernet it is in the memory of the Ethernet switch, ready to be moved
to the next element. The Ethernet link layer header will have been
removed.
IEEE 802.11
IEEE 802.11 uses a variant of CSMA called CSMA/CA, where the CA
signifies collision avoidance. WiFi terminals cannot detect during
transmission whether a frame was garbled by collisions or not. One of
the reasons is that it is impossible for the sending station to transmit
and listen simultaneously as in the Ethernet case. This is an artifact
of radio transmission. Therefore in the WiFi case, positive
acknowledgments are used to indicate whether frames were received by the
receiving station. If this acknowledgment frame is not received, the
sender will assume it was lost and retransmit it. This is repeated a
number of times, typically 4-7. If all subsequent retransmissions
attempts fail, the frame is discarded and it is then the responsibility
of the higher protocol layers to take further actions. In real-time
traffic such as Internet telephony, no further action is taken. One of
the reasons is the UDP protocol used, it was not designed for reliable
data transfer and hence has no retransmission capabilities unlike TCP.
It is still possible that the frame was received and is in error, i.e.
bits that have been changed by interference for example, i.e. in this
case a link layer checksum is computed and verified.
IEEE 802.11 access points
WiFi usually operates in infrastructure mode which means an access point
acts as the bridge between the wireless and wired networks. Access
points have the same rights as the other stations when it comes to
network transmissions. That is, by default, they have no access priority
to the wireless network. Often however, they are better are receiving
weaker signals compared to their wireless clients. This is because they
have better reception electronics and antennae. They also have internal
queuing mechanisms so when the wireless network is busy, and they have
frames to deliver downstream, they can be temporarily stored with the
unit. Modern access points have also the ability to differentiate types
of traffic, voice and data for example, and certain traffic types can be
prioritized in the queuing process. IEEE 802.11 access points use the
switched Ethernet as other computers in the local network. Once the
frame has been carried across the wireless network it is in the memory
of the access point to be moved to the next element. The physical and
link layer headers are removed.
Additionally non-licensed spectrum technologies are more prone to
disturbances and losses of frames [@byoung-jo99:_at_t_labs]. However,
802.11 provides a link layer retransmission protection that can
alleviate frame loss on wireless access links to some degree at the
expense of a little delay. Other sources of problems for IP-based voice
are heavy traffic loads on shared links, poorly dimensioned links,
long-delay link technologies (e.g. satellite links) and misconfigured
equipment.
IP gateways and onwards
Once the Ethernet or IEEE 802.11 frame has been received by the switch,
its link layer header is removed and it is then the responsibility of
the IP infrastructure to move the packet towards the receiver. The first
major switching point is often the IP gateway. The gateway is often seen
as the border between a cluster of computers in an organization and the
Internet. It is usually not possible to administer machines on the
outside of the IP gateway. An administrative organization such a
university or ISP will connect to other entities via an Internet
eXchange (or "IX"). Dual circular fiber rings running at 10’s of
gigabits per second are common. These tend to be geographically located,
one per city for example. The packet must pass the gateway to find its
route over the Internet.
Core networking
One of the technology remnants from ATM is layer 2 switching: Multi
Protocol Label Switching (MPLS) is a carrier technology for IP packets.
Basically, MPLS switches labels that are added to IP packets at the
ingress of a MPLS network. IP packets that belong to a call are all
labeled identically and switched over a dedicated path. Therefore link
dimensioning for IP telephony becomes much simpler using MPLS. In an
MPLS network one has the label switch routers at the center of the
network and the label edge routers at the extreme points of the network.
Also it is possible that some of the voice packets will be lost, due to
congestion in the routers, discarding algorithms such as RED, or link
problems.
Tiering and peering
A tier 1 network is an IP network which connects to the entire Internet
via a settlement free Interconnection, also known as peering. These
networks are also known as transit-free, because they do not receive a
full transit table from any other network. A tier 2 network is one that
peers with some networks but purchases IP transit to reach at least some
portion of the Internet. Finally a tier 3 network is one which solely
purchases transit from other networks to reach the Internet.
Technically speaking, peering is the voluntary interconnection of
administratively separate Internet networks for exchanging traffic
between the networks’ customers. The concept of settlement-free means
neither party pays for the traffic but receives income from its own
customers. Peering requires physical interconnection of the networks and
an exchange of routing information through the Border Gateway Protocol
(BGP) routing protocol. Peering agreements can vary from simple
agreements to heavy contracts.
Once the packet leaves a local ISP or company, it is usually the
responsibility of these larger networks to carry the voice data. Within
these networks, the amount of multiplexing, but also the capacity
increases.
Routing principles
IP packets are transported across the Internet dependent on two pieces
of information. The address in the IP header of the packet and tables of
information contained in network routers maintained by a separate set of
Internet protocols. In most cases the information within the protocol
header is not changed, however the information within the routers tables
may change. Routers are often connected to more than one link. For a
particular packet, the router decides which output link should be
selected in order for it to be best forwarded towards its destination.
This is achieved by matching the address in the IP part of the packet
with the router’s memory table and choosing the entry with the longest
match. With each entry in the table there is an associated link and the
packet is output on that link. Where there is no match there is a
default route and the packet is sent there. In simple routers a default
and a backup is all that is needed. This process is repeated until the
final router is located and then it is that routers responsibility to
deliver the packet to the destination computer or terminal.
Network Address Translators (NATs) and private addresses
In order to support the large number of Internet hosts, the idea of a
private IP addresses was introduced. Each host does not a unique IP
version address. The idea is that a host can use a non-global address
inside the network. Communication with the outside world will look like
the data all came from the same host. A network address translator (NAT)
is a computer that converts the local addresses to its global address
and vice versa. It basically holds a mapping in its memory. Internet
telephony packets have to pass through this process if the originating
computer has a private address (typically 192.168.X.X or 10.0.X.X).
Long distance links
We have discussed the local link technologies, however there are
technologies for carrying the frames over longer distances. A particular
link technology sits on top of a physical technology. These tend to be
fiber optics since laser light has a low attenuation factor compared to
electromagnetic waves. Fiber optics are also relatively cheap due to
semiconductor lasers becoming available (Erbium-doped fiber amplifiers).
The optic fiber used in undersea cables is chosen for its exceptional
clarity, permitting runs of more than 100 kilometers between repeaters
to minimize the number of amplifiers and the distortion they cause. The
first transatlantic telephone cable to use optical fiber was TAT-8,
which went into operation in 1988.
There are two dominant optical standards, SDH and SONET. Both SDH and
SONET are widely used today: SONET in the U.S. and Canada and SDH in the
rest of the world. Both SONET and SDH can be used to encapsulate earlier
digital transmission standards, such as the PDH standard, or used
directly to support either ATM or so-called Packet over SONET/SDH (POS)
networking. SDH and SONET are not protocols per se but are carriers for
voice and data traffic. Developments in using different frequency (or
color) lasers at the same time has brought about large bandwidth
increases (WDM) in recent years.
ATM
Asynchronous Transfer Mode (ATM or sometimes referred to as B-ISDN) is a
wide-area switching technique that uses asynchronous time division
multiplexing. It encodes data and voice into small fixed-sized cells and
provides data link layer services. In the ISO layer model it runs over
layer 1 (physical) links. By using a small cell size (53 bytes) the
design suits both datagrams and real-time voice datagrams too. ATM uses
a connection-oriented model and establishes a virtual circuit between
endpoints. One of the reasons for using a small cell size is to reduce
jitter in the multiplexing of data streams. Dejittering traffic is
important when carrying voice traffic, that is evenly spaced data
streams do not need sophisticated playout schemes or endure loss
(silence or distortion) in the voice stream. A role for ATM in telephony
and IP networks is available in [@Mainwaring:A00].
Other works on queuing analysis, jitter guarantees and dimensioning
buffers in ATM networks include
[@mammeri99:delay; @Land9702:Multiplexing; @Biersack:C92; @Blef9808:Dimensioning; @He0003:Queueing; @Yang95:novel; @Zahirazami0502:Channel; @Walke:A01; @mammeri99:delay].
Importantly in this decade, IP and ATM were competing technologies, with
ATM keeping voice foremost in its multiservice solution. Basically the
ATM Forum proposed five different circuit emulation services, depending
on the capacities required. Although both IP and ATM were technically
viable for both voice and data, the flexible data transport structure of
IP, plus the development of the HTTP protocol, and lower hardware costs
effectively sealed the fate of IP over ATM.
[@mammeri99:delay] "Delay Jitter Guarantee for Real-Time Communications
with ATM network"
[@Land9702:Multiplexing] Landry and Stavrakakis published "Multiplexing
ATM traffic streams with time-scale-dependent arrival processes"
[@Biersack:C92] Biersack Performance Evaluation of Forward Error
Correction in an ATM Environment
Abstract- If the packet loss rate in a network is higher than the loss
rate requested by an application, the transport protocol must make up
for the difference in loss rate. In high bandwidth delay-product
networks, the latency introduced by retransmission-based error recovery
schemes may be too high for applications with latency constraints. I n
this case, forward error correction (FEC) can be used. FEC allows
recovery from loss without retransmission. The amount of loss recovered
strongly depends on the loss behavior of the network. FEC works best if
losses are dispersed in time. We use simulation to study the loss
behavior of an output buffered cell multiplexer for three different
traffic scenarios. Our results show how the loss behavior of the
multiplexer is affected by the traffic mix and the statistics of the
sources. The more bursty the sources, the higher the loss rate and the
higher the probability that losses will occur in bursts. Based on
simulation results, we develop a mathematical model for the performance
of FEC, when applied to multiplexed traffic, and compute the
effectiveness of FEC for the three traffic scenarios. FEC is not
effective for the two homogeneous traffic scenarios. However, FEC
reduces the loss rate for the video sources by several orders of
magnitude for a heterogeneous traffic scenario consisting of video and
burst sources.
[@Blef9808:Dimensioning] Blefari-Melazzi et. al Dimensioning of playout
buffers for real-time services in a B-ISDN.
[@He0003:Queueing] He and Sohraby On the Queueing Analysis of Dispersed
Periodic Messages.
[@Yamamoto0997:Impact] Yamamoto Beerends, entitled "Impact of network
performance parameters on the end-to-end perceived speech quality"
[@Yang95:novel] Yang and Tsang published "A Novel Approach To Estimating
The Cell Loss Probability In An ATM Multiplexer Loaded With Homogeneous
On-Off Sources"
[@Zahirazami0502:Channel] Zahirazami et. al. and "Channel loss and
queuing loss tradeoffs in voice transmission over ATM switching systems"
[@Walke:A01] Walke et. al "IP over wireless mobile ATM-guaranteed
wireless QoS by HiperLAN/2"
Receiver operation: an overview
The role of the receiver in the Internet telephony systems is, at a
relatively high level of abstraction, to recreate the original spoken
voice stream as accurately and with the highest possible quality for the
listener. This is the responsibility of the telephony application, which
in turn interfaces with the operating system, networking and sound
software & hardware.
Frame reception
Assuming a frame for our voice application is either in the memory of
the Ethernet switch or in the access point, the following takes place.
Once the switch or access point has sent the frame to us, a hardware
triggered interrupt will alert the network software driver that it
should respond. Usually the network card’s driver is in listen mode
unless it is actually transmitting. The interrupt routine will ready the
operating system to receive the frame by allocating memory to hold it
and placing locks on the shared resources. Once received, the link layer
checksum is calculated and if correct, the frame is delivered to the
next layer of processing (IP). If the packet was received by a IEEE
802.11 card, an acknowledgement is sent back to the access point. The
IEEE 802.11 and Ethernet drivers can then delete their local copy of the
frame.
IP processing
Simplistically the destination IP address is used to route each packet
toward the target terminal.
The link layer device drivers expect to deliver the frame to the IP
layer. There are two versions of the IP protocol, versions 4 and 6.
Depending on the format of the received frame, the device driver will
select the software for the appropriate IP version. Most network drivers
(and operating systems) now support both these versions. Note that the
networking software is nearly always part of the operating system. The
IP code has to check the packet is indeed destined for this computer.
Then it calculates the checksum to ensure nothing has been changed since
the sender passed the packet to the link layer at their end. The IP
checksum only covers the header, not the contents. If the checksum
fails, the data is often discarded and no further action is taken. The
IP header is now removed, and the remaining data, called a datagram is
delivered to the next layer, which is UDP.
UDP processing
The next step is for the UDP software to process the packet. It also has
a checksum that is calculated, but this one is applied to the whole
datagram. If the checksum fails then the datagram should be discarded.
In the Internet telephony case there is the possibility to retain the
packet as many of the bits may be useful to the application. An
alternative to UDP, known as UDP-lite, has been proposed for this
purpose. Recall also, that the job of the UDP layer is demultiplex the
data to the right program. It is almost certain that more than one
program is running on the host and the correct one must receive the
datagrams data. Before finally delivering the data, the UDP header is
removed. The UDP port field is used to demultiplex the data at the
receiver to the correct application.
Receiver packet level error control
At this point if either FEC or MDC was applied, lost packets can be
reconstructed. In the media-dependent case of FEC if one packet was
lost, the lower bitrate version in the next packet can be extracted and
used instead. In media-independent FEC, the combination of received and
redundancy frames can be used to recreate the original flow. As stated,
whether this is possible depends on the losses and the amount of
redundancy applied. MDC needs at least one description of the data to
start reconstructing the original, more descriptions produce better
quality and lower distortion. After the redundancy phase, more packets
are available for further processing.
Dejittering
Packets transmitted over the Internet could arrive at a receiver with
inter-packet spacings which are not as they were sent. This is a problem
for the listener, who will hear the speech replayed in a time distorted
manner. It can be a problem for the speech decoder too, as some decoders
have time sensitive states that need to be maintained with respect to
time. To counter this problem a solution is to temporarily store the
packets in a holding buffer. The buffer can be used to absorb the time
differences and using the RTP timestamp recreate the original timing.
The buffer can be either a fixed size or be allowed to change over time.
In the fixed case, a sensible size needs to chosen before the session
starts. In the variable case, the application makes a tradeoff between
additional loss or additional delay. Loss results from too short buffers
and delays result from a too conservative buffer length.
Temporarily holding the frames in a buffer is not only necessary for the
correct timing playout, but also for techniques such as FEC/MDC decoding
or finding frames for the packet loss concealment.
RTP processing
The source identification field (SSRC) is used to locate the correct RTP
flow within a session.
Rename to companding ? (Frame level error control (PLC)) {#subsect:frame_level_error_control_PLC}
Packet loss concealment is the name associated with the processing done
at a receiver to counter lost IP datagrams (packets). This term is
slightly misleading, as the actual signal processing is performed on
frames of speech data which were contained in the payloads of UDP
datagrams. Packet loss concealment is typically a receiver-based
technique, however there have been suggestions to perform the processing
within the network [@le1000:Active]. PLC computes fill in samples for
gaps in the stream caused by lost packets. Nevertheless it has been
shown that playing something is always perceived as better than an
audible gap. Broadly techniques can categorised as insertion,
interpolation and regeneration. Insertion is where samples are inserted
into the gaps left by missing packets. Options include silence frames,
spliced frames (so called zero length), samples with noise that fades,
extrapolation where the voiced frames are extended to cover the gap, or
using samples generated from pitch cycle waveform repetition.
Interpolation-based schemes work by "stretching" the received packets to
cover the gaps caused by lost packets. The G.711 concealment algorithm
G.711i works by constantly calculating the pitch frequency of the
received frames, and if a frame is lost the receiver reproduces a sample
based on the waveform thus far and successively lowers the amplitude
with each following frame. After five 20ms packets, the amplitude will
be zero and in effect no more concealment is audible. Regeneration is
where a totally new sample is created from what has been experienced
before. The effectiveness of regeneration schemes is highly dependent on
the quality of the past (and sometimes future) estimations.
Speech decoding
The term "speech coding" is typically reserved for operations more
complex than companding. In more complex coding, such as source model
linear predictive coding such as LPC, iLBC and G.729 the residuals and
LPC coefficients are converted to raw PCM samples. .
Since iLBC coding was devised for the Internet there is no hangover
between consecutive iLBC blocks, reducing the time dependence mentioned
earlier.
Audible playout
In order to playout the speech, the raw samples must be passed to the
operating system. This is because hardware control is available only via
the operating system. Typically a device is setup in terms of bits per
second, number of samples, hertz, number of channels and volume. On
Windows based systems there is a device independent interface to do
this, known as DirectX. DirectX is a collection of programming
interfaces for multimedia applications (the X meaning all of the
subsystems). In terms of VoIP the DirectSound will be used.
Some formats, WAV for example, contain a header with this information,
known as RIFF. If the device (or its software interface) matches the
specification for playback, it will initialised for receiving the data.
The device then converts the stream into an analogue format for output
to loudspeakers or headphones. Volume controls can be either digital or
analogue and the voice may be mixed with other streams.
Subjective experience
Although the computer controlled part of the journey is complete, the
voice still has to be heard, understood and experienced. The complete
voice path should be effectively constructed and optimised for this
stage. Humans seem to be relatively adaptive and can accept and adapt
to, relatively poor conditions. It seems intuitive to consider that we
are better at adapting to constant poor conditions rather than variable
quality. This implies poor quality can be accepted given no further
large quality changes take place.
VoIP applications
[@Schu9207:Voice] "Voice Communication Across the Internet: A Network
Voice Terminal" [@Sisa9809:Multimedia] "The Multimedia Internet Terminal
(MInT)"
Sicsophone reduces delay through a novel receiver buffering scheme.
The solution uses the low-level features of audio hardware and a
specialized jitter buffer playout algorithm. Using the sound card memory
directly eliminates intermediate buffering. A statistical-based approach
for inserting packets into the audio buffers is used in conjunction with
a scheme for inhibiting unnecessary fluctuations in the system. For
comparison we present the performance of the playout algorithm against
idealized playout conditions. To obtain an idea of the system
performance we give some mouth to ear delay measurements for selected
VoIP applications. The proposed mechanism is shown to save 100’s of
milliseconds on the end to end path [@Hagsand03:Low]
VAT (Visual Audio Tool) [@Jaco9207:vat] is a well known VoIP tool that
implements a playout buffer similar to the one described, including a
circular buffer to hold the packets before playout.
Realisation of an Adaptive Audio Tool [@Meylan0004:April] Meylan and
Boutremans "Real-time audio over the best effort Internet often suffers
from packet loss. At this time, Forward Error Correction (FEC) seems to
be an efficient way to attenuate the impact of loss"
[@Bolo9803:Adding] Adding Voice to a Distributed Game on the Internet.
They examine issues related to adding voice between participants in
virtual environments. We consider in particular a special kind of DVE,
namely distributed games over the Internet. We consider all stages of
voice manipulation, including voice generation (with emphasis on echo
cancellation), voice transmission (with emphasis on RTP and
packetization), and voice restitution (with emphasis on spatial
rendition and on synchronization between voice and visual cues). We also
consider implementation issues, and illustrate these with the MiMaze
game and the FreePhone audio tool, both developed at INRIA.
FreePhone "Performance evaluation of end points"
[@marjamaki9912:performance] "Performance evaluation of an IP voice
terminal" [@Jiang2002:QoS] "QoS Evaluation Of VoIP End-Points"
[@Neug9906:How] "How elastic are real applications?" abstract = programs
are typically developed with little or no consideration of their
performance under different system loads or the effect they may have on
other processes competing for the same resources. To an extent, this
stems from the "virtual machine" approach promoted by most mainstream
operating systems. With operating systems which offer mechanisms for
fine-grained control of resource allocations it becomes apparent that a
central policy for allocating potentially scarce resources is not
sufficient. We are currently developing a toolkit which allows
programmers to systematically examine and assess the performance
behavior of a wide range of applications under different resource
allocations by determining the applications’ utility curves. We argue
that such a toolkit is useful for the development of adaptive
applications as well as for the implementation of global resource
management policies. In particular, we argue that this is necessary for
the application of economic models to the area of resource management,
as proposed by some researchers.
Tools such as the Robust Audio Tool (RAT) came from UCL in London in
1995 [@Hard9506:Reliable]. RAT, with its simple redundancy scheme,
sending one compressed version of the packet in the following one, was
an simple example of utilizing redundancy. RAT was intended for both
group, and one-to-one conferencing. Somewhat surprisingly, RAT and VIC
were still being maintained today as part of the AVATS project (formerly
SUMOVER) at UCL, London.
In the late 1990s, a tool called Freephone was developed by the Rodeo
group at INRIA in Sophia Antipolis, France which implemented FEC
mechanisms [@freephone]. At that time all the applications were UNIX
based, as this was the only (open) operating system for Internet
applications. Freephone adapts to network conditions, it includes a rate
control mechanism (adaptation to available bandwidth), a FEC-based error
recovery scheme (adaptation to and recovery from loss process in the
network), and an adaptive playout adjustment scheme which adapts to
delay variations.
Rosenberg et. al in [@Rose0003:Integrating] looked at combining
target-based playout algorithms in conjunction with FEC schemes, and
propose a number of new playout algorithms based on this coupling.
Transport of real time voice traffic on the Internet is difficult for
two reasons – network loss and network jitter. There has been
substantial research in developing protocols and algorithms to combat
these problems. Network loss is handled primarily through a variety of
different forward error correction (FEC) algorithms and local repair
operations at the receiver. Jitter has been compensated for by means of
adaptive playout buffer algorithms at the receiver. Traditionally, these
two mechanisms have been investigated in isolation. In this paper, we
show the interactions between adaptive playout buffer algorithms and
FEC, and demonstrate the need for coupling. We propose a number of novel
playout buffer algorithms which provide this coupling, and demonstrate
their effectiveness through simulations based on both network models and
real network traces.
Perhaps the best known IP telephony application is Skype . It
revolutionized telephony in its introduction in 2003. It is a
cross-platform solution that became successful partly by embracing
recent technological developments, and because it provided good, free
and easy voice communication
[@world:_asses_skypes_networ_impac; @economist06:end]. The technological
developments it embraced were: Internet-specific speech coding, a
firewall bypass solution, a scalable call establishment system, and an
intuitive graphical user interface. Skype has continued to add
functionality such as inter-operability with the telephony system, a
payment scheme, and conferencing capabilities. Recently, the developers
have added video and SMS capabilities and is available on some networked
televisions.
Within the Skype network there are two classes of nodes: normal nodes
and super nodes. We will first discuss normal nodes. Normal nodes are
typically a home owner’s PC and are usually behind a home firewall
and/or an ISP’s NAT. These nodes typically have a private IP address
allocated to them. A private address is not globally routable and is
defined by specific ranges which routers do forward data from. The CPU
processing of normal nodes is also assumed to be somewhat limited.
Super nodes on the other hand are well connected machines and must
posses a public IP address. A typical example might be a UNIX computer
on a university network. Due to their connectivity and processing
capabilities, super nodes perform routing and forwarding of Skype
signaling messages. The load on the super node is carefully monitored so
Skype message processing does not interfere with the normal operation of
its host. Usually users are not unaware that the computer has been
elected to super node status. The software distribution for normal and
super nodes is actually identical, with different routines being invoked
after initialization. The super nodes also forward login requests on the
behalf of the normal nodes, if the normal nodes cannot reach the login
server.
On the first invocation of Skype, a normal node uses a pre-configured
list of permanent super nodes, it then receives an update of more recent
super nodes. The directory of Skype users is decentralized. Skype uses
its Global Index technology to find a user with encrypted (256 bit AES)
messages. In order to locate a user the procedure is as follows. A
normal node sends a request to one super node, if it doesn’t know itself
the location of the callee. That super node then responds with four
additional nodes to be queried if the person was still not found. The
normal node then queries these four nodes. If the user is not found, an
exchange occurs again with the same super node. The super node then
responds with eight new (and different) nodes. This is repeated several
times until the user is found. Here we have assumed that the normal
nodes has a public address for simplicity, in the case where it has a
private address this negotiation is done by a super node on the normal
node’s behalf. Search results are also cached at intermediate nodes for
subsequent searches.
Non-standardized solutions need to use protocol translation services if
they are to inter-operate with existing solutions. Protocol translation
involves taking a message from one protocol and generating a (near)
equivalent message in the second protocol. We briefly mentioned some
names of known translators for H.323 and SIP in the previous section.
For a closed protocol the developer themself must create a translator
for the desired interoperability. There have been many publications and
presentations on the Skype protocol. Prestige in being the first to
reverse engineer a closed (and widely used) protocol often acts as an
incentive for such efforts. Some of these can be found in
[@suh:_charac_detec_skype_relay_traff; @04:_analy_skype_peer_peer_inter_telep_protoc].
A more basic introduction to the operation of Skype at a somewhat higher
level can be found in [@s.a.:_skype_explain].
Some researchers have tried to reverse engineer the Skype protocol,
these include Biondi and Desclaux’s presentation
[@biondi6.:_silver_needl_in_skype] interestingly entitled "Silver
needle in the Skype". Others include Suh et. al.
[@suh:_charac_detec_skype_relay_traff; @04:_analy_skype_peer_peer_inter_telep_protoc; @guha06:_exper_study_skype_peer_peer_voip_system].
End system control of VoIP
[[sect:end_system_control_of_voip]]{#sect:end_system_control_of_voip
label="sect:end_system_control_of_voip"}
End-systems also have to deal with the implications of different traffic
types existing on the network too. The effect of this is the original
voice stream becomes distorted relative the original stream and it is
the receiver that normally reconstructs the original pattern of words
and gaps for the listener. In simple terms the end-systems have to
resynchronize the time variations introduced by the network. The
end-systems have to to engineered "intelligently", that is to create
working systems that can adapt to the changing conditions over Internet
paths.
The end systems play an important role in VoIP systems. At the sender
they are responsible for the sampling, coding, packetization, optionally
adding protection to the voice stream should any packets be lost, and of
course, actually transmitting the data. The receiver is responsible for
removing any jitter introduced by the network as well as receiving,
de-packetizing, removing extra data if sent, decoding and playing the
samples to the listener. These actions within the receiver are one of
most researched areas within real-time voice communication.
The early 90’s produced a surge in packet audio playout research. One of
the first efforts to implement a voice application on an IP network with
an adaptive buffer playout strategy was NeVoT [@Schu9207:Voice].
The playout algorithm implemented in Sicsophone is almost identical to
NeVoT. They use a variation estimate similar to the one given earlier,
however they make a slight distinction for the first packet in a
talkspurt and subsequent ones. The playout for the first packet is
delayed longer due to lack of information on the network state after the
silence period.
Our work shares theirs in the choice of a ring buffer for buffering
packets, only we perform the copying by using DMA transfers directly
rather than copying the data from the application to the operating
system.
Using a ring buffer in Sicsophone is identical to that described in
[@Schu9207:Voice] where the authors motivate their choice of using a
circular buffer for performance reasons.
The operating system is the supervisor that coordinates the reception,
packetization, error recoverer, multiplexing, and playout of the
incoming voice stream(s).
Operating systems and VoIP
UNIX
[@reed98:new] "A new audio device driver abstraction"
[@Kouv9701:Overcoming] "Overcoming Workstation Scheduling Problems in a
Real-Time Audio Tool" [@rizzo97:freebsd] "The FreeBSD Audio Driver"
[@Meylan0004:April] "Realisation of an Adaptive Audio Tool"
[@martin2000:200mhz] "A 200Mhz 0.25W Packet Audio Terminal Processor for
Voice-over-Internet Protocol Applications" [@Chan9906:Hardware]
"Hardware and Software Architecture of a Packet Telephony Appliance"
Luigi Rizzo describes a generic sound card driver for FreeBSD
[@rizzo97freebsd]. Aspects of it resemble their work, in particular,
handling of timers, DMA transfer and buffer size allocation. They
include hooks to use the driver for VoIP applications, one such example
is a select() call which can be scheduled to return only when a certain
amount of data is ready for consumption.
Kouvelas and Hardman in [@Kouv9701:Overcoming] keep the flow of audio
constant during operating system load by using buffering in the audio
hardware. They also look at reducing the amount of buffering in the
application by keeping the buffers in the application as small as
possible. In their case we try and totally eliminate it by only using
the hardware buffers.
Windows
Windows mobile
Algorithms for buffer sizing
Coping with the variable delays over Internet paths whilst maintaining
acceptable interactivity in a real-time conservation subject to speech
codec limits is essentially what an algorithm for buffer sizing needs to
perform.
Naylor and Kleinrock in 1982 investigated buffering considerations in
the case of stream traffic on a packet switched network
[@Nayl82:Stream]. They used delay estimates (prediction) to eliminate
delay differences. Their idea is to use previous stream delays to
estimate the range of the delays incurred currently. The essence is how
to discard $k$ samples from a last sample set $m$ to estimate $D$ the
waiting time at the destination. They provide some rules of thumb for
choosing $m$ and $k$ and show the suitability of the choice on empirical
delay distributions from the ARPANET. They show it is still necessary to
deal with discontinuities even after smoothing.
Gopal analyzed a previous scheme suggested by Barberis and Pazzaglia
(1980). it included a non-stationary buffer length distribution and mean
values as a function of time in the talkspurt [@Gopa8402:Playout].
Due to variations in network delay, a stream of voice packets with
deterministic interarrival times to a data network may not have
deterministic interdeparture times at the destination. Two playout
schemes which are designed to remove such variations in delay are
considered. Analytic results for the performance of these two schemes
are obtained. Numerical examples showing the effect of coefficient of
variation of interdeparture time on performance are presented.
The design and simulation [@Kansal01:Jitter-free]. Much of the voice
focus was on solutions, mostly theoretical, for buffer design and sizing
[@Barb8002:Optimal; @Barb8102:Buffer]. Work by Naylor and Kleinrock
described general design methodologies for the design of jitter
absorbing buffers [@Nayl82:Stream]. Buffer sizing from a control theory
perspective has been looked at in [@linc0205:jitter].
[@Fujimoto0211:Adaptive].
Analysis of the buffer performance using probabilistic methods can be
found in [@Mans0108:Jitter].
They study jitter control in networks with guaranteed quality of service
(QoS) from the competitive analysis point of view: they propose on-line
algorithms that control jitter and compare their performance to the best
possible (by an off-line algorithm) for any given arrival sequence.
For delay jitter, where the goal is to minimize the difference between
delay times of different packets, we show that a simple on-line
algorithm using a buffer of B slots guarantees the same delay jitter as
the best off-line algorithm using buffer space B/2. We prove that the
guarantees made by their on-line algorithm hold, even for simple
distributed implementations, where the total buffer space is distributed
along the path of the connection, provided that the input stream
satisfies a certain simple property. For rate jitter, where the goal is
to minimize the difference between inter-arrival times, we develop an
on-line algorithm using a buffer of size 2B+h for any h/spl ges/1, and
compare its jitter to the jitter of an optimal off-line algorithm using
buffer size B. They prove that their algorithm guarantees that the
difference is bounded by a term proportional to B/h.
[@Moon9508:Packet]
Packet Audio Playout Delay Adjustment Algorithms: Performance Bounds and
Algorithms.
[@Jha96:Continuous] Continuous Media Playback and Jitter Control
[@Anandakumar01:adaptive] An adaptive voice playout method for VOP
applications.
[@matic2002:voice] Predictive Playout Delay Adaptation for Voice over
Internet.
[@Moon9508:Packet] Packet Audio Playout Delay Adjustment Algorithms:
Performance Bounds and Algorithms.
Comparison of solutions
This section concentrates on solutions which perform comparative studies
between two or more algorithms or simulations.
[@Wang9906:Comparison] Comparison of Adaptive Internet Multimedia
Applications
[@Cho9408:Reconstruction] Performance analysis of reconstruction
algorithms for packet voice communications.
Moon et al. [@Ramj9406:Adaptive] present four different playout
algorithms for packet audio. All calculate an estimate of the network
delay and jitter as an average from all the packets measured. The
authors study jitter spikes in traces and also do not adapt the buffer
size to these spikes, the work and results are very similar to those
presented in [@RAMJ_94_Adaptive_Infocom].
Pinto and Christensen in two papers
[@Pinto99:Talkspurt; @pinto9910:algorithm] describe an algorithm for
jitter compensation based on the target packet loss rate. Their "gap
based" approach compares the current playout time with the arrival time
and calculate a gap for both early and late packets. They compare the
current playout delay, for any particular talkspurt in progress, with an
optimal playout delay. This optimal theoretical delay is defined as
minimum amount of delay to be added to the creation time of each packet
which would result in a playout of a talkspurt at the given loss rate.
Their calculation of the optimal playout is similar to the one described
in this paper.
[@Alva9309:Voice] Voice synchronization in packet switching networks.
A new algorithm that incorporates a novel adaptive scheme and a special
control-packet based time measurement works on the vocoder and CEPT DoD
standards. The first method uses measurement packets emitted at the
beginning of each silence interval, and returned by receiver. Then, a
second packet, the reference packet, is sent, containing the estimate of
mean network delay. The second method remembers the packet with the
lowest delay (greatest delay until playout).
[@agrawal98:use] Use of Statistical Methods to Reduce Delays for Media
Playback Buffering. [@matic2000:optimal] Optimal delay buffer for VoIP
applications. [@matic2002:predictive] Predictive Playout Delay
Adaptation for Voice over Internet. [@sreenan00:delay] Delay Reduction
Techniques for Playout Buffering. [@Ramos03:moving] A Moving Average
Predictor for Playout Delay Control in VoIP. [@Jeske2001:Adaptive]
Adaptive Play-Out Algorithms For Voice Packets.
[@kuo2000:delivering] Delivering Voice over the Intenet.
[@leon99:adaptive] An adaptive predictor for media playout buffering.
An Algorithm for Playout of Packet Voice Based on Adaptive Adjustment of
Talkspurt Silence Periods.
[@Hodson2000:Skew] Skew Detection and Compensation for Internet Audio
Applications.
[@Fujimoto0211:Adaptive] Adaptive playout buffer algorithm for enhancing
perceived quality of streaming applications,
[@Ergul2001:Novel] A Novel Adaptive Playout Algorithm For Voice Over IP
Applications and Performance Over WANs.
[@Fujimoto0106:playout] Playout control for streaming applications by
statistical delay analysis.
[@peng01:control] The Control and Algorithm of Audio Dynamic Buffer"
[@roccetti98:design] Design, Development and Experimentation of an
Adaptive Mechanism for Reliable Packetized Audio for Use over the
Internet"
Loss/delay considered together
[@Moon9801:Correlation] [@Jian0006:Modeling]
Those which do include voice in the design of the packet switched voice
networks include
[@Lieb0006:Tradeoffs] "Tradeoffs in Designing Networks with End-to-End
Statistical QoS Guarantees"
[@Aras9401:Real] Real-time communications in packet-switched networks
[@Garr9302:Joint] Joint Source/Channel Coding of Statistically
Multiplexed Real-Time Services on Packet Networks
A. P. Bernard Source-Channel Coding of Speech in 1998 from the UCLA
[@Bernard:T98]. The issue of designing source and channel coding
techniques for voice is presented, an adaptive multi-rate (AMR)
transmission system that switches between operating modes depending on
channel conditions is presented. It uses variable bit rate embedded
source encoders and rate-compatible channel coders providing unequal
error protection. The main concept is using a rate-compatible punctured
trellis code (RCPT) for obtaining unequal error protection via
progressive puncturing of symbols in a trellis. Results are presented
where the rate compatible punctured convolutional code is compared with
and without bit-interleaved coded modulation. The coder displays a wide
range of bit error sensitivities, and is used in combination with
rate-compatible punctured channel codes providing adequate levels of
protection, it also over a wide range of channel conditions with
graceful performance degradation as the channel signal-to-noise ratio
decreases.
Mixing of voice and data at 3 different levels. [@Chen8806:Integrated]
Integrated voice/data switching
[@yletyinen98:voice] Voice packet interarrival jitter over IP switching
[@CaSa98:predictive] Predictive loss pattern queue management for
Internet routers.
[@Fulton1998:Delay] Delay jitter first-order and second-order
statistical functions of general traffic on high-speed multimedia
network.
[@Land9702:Multiplexing] "Multiplexing ATM traffic streams with
time-scale-dependent arrival processes"
[@song00:new] "A New Queue Discipline for Various Delay and Jitter
Requirements in Real-Time Packet-Switched Networks"
Network Synchronization [@Bier9606:Intra] "Intra- and Inter-Stream
Synchronization for Stored Multimedia Streams"
[@yuang96:novel] A Novel Intra-media Synchronization Mechanism for
Multimedia Communication
[@Huan9511:Multimedia] Multimedia synchronization for live presentation
using the N-buffer approach
[@Jeff9211:Adaptive] Adaptive, best-effort delivery of digital audio and
video across packet switched networks,
[@Tien9901:Intelligent] "Intelligent Voice Smoother for
Silence-Suppressed Voice over Internet"
[@Tucker87:packet] "Packet-speech Multiplexer"
[@Figu9510:Leave] Leave-in-Time: A New Service Discipline for Real-Time
Communications in a Packet-Switching Data Network
[@Kofman96:Loss] Loss Probabilities and Delay and Jitter Distributions
In A Finite Buffer Queue With Heterogeneous Batch Markovian Arrival
Process
[@Sing9405:Jitter] Jitter and Clock Recovery for Periodic Traffic in
Broadband Packet Networks
Wireless access
[[sec:wireless access]]{#sec:wireless access
label="sec:wireless access"}
This section covers wireless access technologies for IP voice users.
Allowing the handset free physical access from the wired infrastructure
is both natural for the user and allows mobility and flexibility in
usage. This freedom however puts more restraints on the system in terms
of QoS, security and resource usage. Due to strict frequency guidelines
on radio transmissions, the technological favorite for VoIP is to use
unlicensed spectrum in the 2.4 and 5.0 GHz bands.
History and 802.11 extensions
The IEEE 802.11 standard belongs to the 802 family, a series of
standards developed by the IEEE to define specifications for local and
metropolitan area networking, mainly at the data-link and physical
layers of the OSI reference model. In 1997 the IEEE released the first
version of the 802.11 standard, whose purpose was to provide wireless
connectivity between different devices in a local area, with a maximum
transmission rate of 2 Mbps. Two years later a revision appeared. It
included two new extensions which used new modulation schemes to provide
rates up to 11 Mbps at the 2.4 GHz frequency band (802.11b) and 54 Mbps
at the 5 Ghz band (802.11a). Further extensions are being released,
addressing aspects such security, higher transmission rates, and quality
of service (QoS). The following list summarizes the current extensions
to the 802.11 standard (some of which are still in draft state) and
their main features:
Challenges [@eriksson00:_ip] "The challenges of voice over IP over
wireless" Eriksson et. al.
802.11a High speed WLAN standard at the 5 GHz band – supports 54 Mbps
802.11b standard for 2.4 GHz band – supports 11 Mbps
802.11e Quality of Service enhancements
802.11f inter-access point communication
802.11g modulation technique for the 2.4 GHz band, achieving a rate of 54 Mbps
802.11h management of the 5 Ghz band for use in Europe and in the Asia Pacific region
802.11i Addresses security weaknesses
: 802.11 sub-standards
802.11 layers
The standard defines a set of medium access control (MAC) and physical
layers (PHY) specifications for wireless connectivity. The following
figure shows the relation between these layers and the OSI reference
model:
{#fig:802.11-and-OSI
width="45%"}
As shown in the picture, the MAC is a sublayer of the data-link layer,
which offers its service to the logical link sublayer, a common
interface for all the IEEE 802 standards. This common interface permits
an heterogeneous interconnection of different types of devices by
abstracting their underlying media technology.
In order to distinguish between data units of different layers, we will
follow the convention of naming ‘packets’ to the data units at the IP
and higher layers, whilst using the word ‘frames’ for the 802.11b data
units.
subcarriers
Physical layer
The physical layer (PHY) specifies low-level communication parameters
such as the radio technology, frequencies, channel bandwidth, modulation
schemes, and transmission rates. The three units shown at the bottom of
figure 1{reference-type="ref"
reference="fig:802.11-and-OSI"} represent the three different radio
technologies that the original 802.11 standard determined for wireless
connectivity. IR stands for InfraRed, FHSS for Frequency Hopping Spread
Spectrum, and DSSS for Direct Sequence Spread Spectrum.
Whilst all these technologies support transmission rates of 1 and 2
Mbps, practically no known vendors sell IR compliant products. Also,
frequency hopping products are few in comparison with the ones that use
direct sequence. An explanation for the dominance of DSSS products is
the development of the 802.11b extension that enhanced the basic DSSS
mode with additional transmission rates of 5.5 and 11 Mbps.
For spread spectrum techniques the standard defines the S-band ISM
(2.4-2.5 Ghz) as the frequency range to use. This is because the
regulatory authorities permits the unlicensed use of ISM (Industrial,
Scientific, and Medical) frequency bands provided that the emitting
power of the devices is low.
For direct sequence mode, the standard divides the band into 14
different channels whose mapping to frequencies is shown in table
2{reference-type="ref"
reference="fig:IEEE-802.11b-channels"}:
::: {#fig:IEEE-802.11b-channels}
Channel (1-6) 1 2 3 4 5 6 7
Frequency (Ghz) 2.412 2.417 2.422 2.427 2.432 2.437 2.442
: [[fig:IEEE-802.11b-channels]]{#fig:IEEE-802.11b-channels
label="fig:IEEE-802.11b-channels"}IEEE 802.11b channels
:::
::: {#fig:IEEE-802.11b-channels}
Channel (7-14) 8 9 10 11 12 13 14
Frequency (Ghz) 2.447 2.452 2.457 2.462 2.467 2.472 2.477
: [[fig:IEEE-802.11b-channels]]{#fig:IEEE-802.11b-channels
label="fig:IEEE-802.11b-channels"}IEEE 802.11b channels
:::
However, this channel distribution causes adjacent channel interference,
since the bandwidth used by the 802.11b stations is around 30 Mhz and
there is only 5 Mhz of separation between two adjacent channels. Thus,
only channels 1, 6 and 11 are spread enough to be used in close
locations without interference. This poses a significant challenge to
deploy a WLAN with an adequate coverage while keeping the values of
adjacent channel interference and co-channel interference low. The
figure 2{reference-type="ref"
reference="fig:Adjacent-channel-interference"} shows how the frequency
band used by each channel overlaps the frequency band of the neighboring
channels.
{#fig:Adjacent-channel-interference
width="45%"}
Different modulation schemes are used to achieve the standard bitrates
of 1, 2, 5.5 and 11 Mbps. The standard determines the use of BPSK for 1
Mbps, QPSK for 2 Mbps, CCK BPSK for 5.5 Mbps and CCK QPSK for 11 Mbps.
The denser encodings used to transmit at the highest rates however make
them more prone to interference and therefore have less range than the
lower rates. Therefore, a trade-off exists between data throughput and
distance and we will see an illustration of this in our experiments.
In order to accomplish interoperability between devices that use
different transmission techniques, the PHY layer prepends a physical
header to data frames that is always transmitted at 1 Mbps.
Architecture
Before describing the two different architectures defined in the the
802.11 standard we describe the four components that a 802.11 network
may consist of: An access point, stations, the wireless medium, and a
distribution system.
{width="textwidth"}
-
Station (STA): An electronic device capable to communicate wireless
with other stations in range. -
Access point (AP): A special wireless station whose main purpose is
to provide the wireless stations access to the Internet. -
Distribution system (DS): In order to provide larger coverage areas,
several access points may be used. Then, a distribution system is
the logical component of a 802.11 network that allows the different
access points to track the location of the wireless stations. - Wireless medium: While in wired networks a cable is the physical
medium used to carry frames from the sender station to the receiver,
in wireless networks the physical medium is the air.
Infrastructure networking
This is the most common mode that the IEEE 802.11b standard defines to
build wireless LANs. An infrastructure network relies on an access point
to provide connectivity to the wireless stations. The stations must be
in range of the AP, however it is not required that the stations are in
range of each other. When this happens, the stations that are not in
range of each other are usually referred as hidden nodes (see section
4.18{reference-type="ref"
reference="sub:Hidden-node-problem"}). It is possible to extend the size
of a wireless LAN by interconnecting several APs, thus permitting the
wireless stations to roam between adjacent cells.
Adhoc (independent) networking
The standard provides a mechanism to create small, usually short-lived
networks, with only end stations. A common usage of this is when a
connection between two or more stations is desired (file sharing for
instance) and no AP is available. However, it is also possible that one
of the stations acts as a bridge between the WLAN and the wired world,
providing external connectivity to the other stations.
IEEE 802.11 Medium Access Control (MAC) {#sub:IEEE-802.11-MAC}
The main functions of the medium access control (MAC) layer are to
coordinate the stations to gain access to the medium and to define the
mapping of physical layer signals to/from link frames. We describe the
MAC layer in detail in section
4.7{reference-type="ref"
reference="sub:IEEE-802.11-MAC"}.
The main functions of the medium access control (MAC) layer are to
coordinate the stations to gain access to the medium and to define the
mapping of physical layer signals to/from link frames. It also manages
the bitrate selection and supports different operational modes like the
RTS/CTS handshake. The following sections describe the relevant 802.11b
MAC features for the measurements we conducted.
MAC access modes
The distributed Coordination function (DCF) which is the access mode
defined in the 802.11 protocol to provide unsynchronized,
contention-based access to the medium, through the CSMA/CA protocol
(described in section
4.13{reference-type="ref"
reference="sub:Contention-based-in-DCF"}). The unsynchronized access to
the medium results in random delays between each frame transmission,
which may be problematic for real-time traffic.
The Point Coordination Function (PCF) is an optional access mode which
enables synchronized transmission of data frames. In this mode, the AP
polls the wireless stations granting them access to the medium for a
short period of time. Then, the AP moves to the next station in the poll
list and thus all the stations obtain a slot of time to transmit data.
Although this access mode seems suitable for real-time communication, it
simply is not supported by many 802.11 devices. Further information
about PCF mode can be found in PCF.
HCCA
802.11e includes ammendments to the MAC layer to support QoS. The HCCA
Controlled Channel Access operates similiar to the Point Coordination
Function.
HCF (Hybrid Coordination Function)
However, in contrast to PCF, in which the interval between two beacon
frames is divided into two periods of CFP and CP, the HCCA allows for
CFPs being initiated at almost anytime during a CP.
A CFP is called a Controlled Access Phase (CAP) in 802.11e. A CAP is
initiated by the AP, whenever it wants to send a frame to a station, or
receive a frame from a station, in a contention free manner. In fact,
the CFP is a CAP too. During a CAP, the Hybrid Coordinator (HC) — which
is also the AP — controls the access to the medium. During the CP, all
stations function in EDCA.
The other difference with the PCF is that Traffic Class (TC) and Traffic
Streams (TS) are defined. This means that the HC is not limited to
per-station queuing and can provide a kind of per-session service. Also,
the HC can coordinate these streams or sessions in any fashion it
chooses (not just round-robin). Moreover, the stations give info about
the lengths of their queues for each Traffic Class (TC). The HC can use
this info to give priority to one station over another, or better adjust
its scheduling mechanism. Another difference is that stations are given
a TXOP: they may send multiple packets in a row, for a given time period
selected by the HC. During the CP, the HC allows stations to send data
by sending CF-Poll frames.
HCCA is generally considered the most advanced (and complex)
coordination function. With the HCCA, QoS can be configured with great
precision. QoS-enabled stations have the ability to request specific
transmission parameters (data rate, jitter, etc.) which should allow
advanced applications like VoIP and video streaming to work more
effectively on a Wi-Fi network.
HCCA support is not mandatory for 802.11e APs. In fact, few (if any) APs
currently available are enabled for HCCA. Nevertheless, implementing the
HCCA does not require much overhead, as it basically uses the existing
DCF mechanism for channel access (no change to DCF or EDCA operation is
needed). In particular, the station side implementation is very simple
as stations only need to be able to respond to poll messages. On the AP
side, however, a scheduler and queuing mechanism is needed. Given that
AP’s are already equipped better than station transceivers, this should
not be a problem either.
EDCA
With EDCA (Enhanced Distributed Channel Access), high priority traffic
has a higher chance of being sent than low priority traffic: a station
with high priority traffic waits a little less before it sends its
packet, on average, than a station with low priority traffic. This is
accomplished by using a shorter contention window (CW) and shorter
arbitration inter-frame space (AIFS) for higher priority packets. In
addition, EDCA provides contention-free access to the channel for a
period called a Transmit Opportunity (TXOP). A TXOP is a bounded time
interval during which a station can send as many frames as possible (as
long as the duration of the transmissions does not extend beyond the
maximum duration of the TXOP). If a frame is too large to be transmitted
in a single TXOP, it should be fragmented into smaller frames. The use
of TXOPs reduces the problem of low rate stations gaining an inordinate
amount of channel time in the legacy 802.11 DCF MAC. A TXOP time
interval of 0 means it is limited to a single MSDU or MMPDU. The levels
of priority in EDCA are called access categories (ACs). Default EDCA
Parameters for each AC:
AC CWmin CWmax AIFSN Max TXOP
Background (AC_BK) 31 1023 7 0
Best Effort (AC_BE) 31 1023 3 0
Video (AC_VI) 15 31 2 3.008ms
Voice (AC_VO) 7 15 2 1.504
Legacy DCF 15 1023 2 0
ACs map directly from Ethernet-level Class of Service (CoS) priority
levels:[2]
Priority 802.1D Priority 802.1D Designation Access Category
Lowest 1 BK AC_BK
2 Spare AC_BK
0 BE AC_BE
3 EE AC_BE
4 CL Video (AC_VI)
5 VI Video (AC_VI)
6 VO Voice (AC_VO)
Highest 7 NC Voice (AC_VO)
The purpose of QoS is to protect high priority data from low priority
data but there can be scenarios in which the data which belongs to same
priority needs to be protected from data of same priority. Example being
suppose a network can accommodate only 10 data calls and eleventh call
is made. Admission Control in EDCA address this type of problems. The AP
publishes the available bandwidth in beacons. The clients can check the
available bandwidth before adding more traffic in the network that
cannot be entertained. Wi-Fi Multimedia (WMM) certified APs must be
enabled for EDCA and TXOP. All other enhancements of the 802.11e
amendment are optional.
Positive acknowledgments {#sub:Positive-acknowledgements}
The 802.11 standard defines a positive acknowledgment schema in order to
provide some reliability for wireless transmissions. All data frames
must be acknowledged. When a station has properly received a data frame
it sends back an ACK frame to the sender so that the sender knows of the
successful delivery. If for some reason the ACK frame does not arrive at
the sender it will assume that the packet was not delivered and the
sender will retransmit it. The drawback of the ACK mechanism is the
overhead that it adds to the communication. However, ACK frames are
necessary since link conditions are highly variable in wireless
networks.
Interframe spacing
The 802.11 standard defines four interframe spacings to prioritize the
transmission of certain frames. They are used to ensure that atomic
operations such as the frame-acknowledgment pair or the RTS/CTS
handshake are not interrupted; they are also used to provide preference
to contention-free traffic over contention-based traffic when both
exist. When the medium is busy all the frames have to wait until it
becomes idle, in order to be transmitted. Then, the frames with the
highest priority gain access to the channel as they are assigned a
shorter interframe space. Figure
3{reference-type="ref"
reference="fig:Interframe-spacing"} shows the different interframe
spacings.
{#fig:Interframe-spacing width="30%"}
The SIFS (Short InterFrame Space) is used for high priority frames such
as acknowledgments or CTS frames, which must be sent immediately after
the corresponding frame. It has a duration of 10 $mu s$ in DSSS. The
PIFS (PCF InterFrame Space) is used to give higher priority to PCF
contention-free traffic over DCF contention-based traffic. It has a
duration of 30 $mu s$ in DSSS. The DIFS (DCF InterFrame Space) is the
minimum time that the medium must be idle before attempting a
contention-based transmission. It has a duration of 50 $mu s$ in DSSS.
The EIFS (Extended InterFrame Space) (not shown) is a variable-length
interval that is used when a frame is received with errors.
Contention using DCF – backoff mechanism {#sub:Contention-based-in-DCF}
The DCF mode is based on the carrier sense multiple access with
collision avoidance (CSMA/CA) protocol. It makes the stations sense the
medium to determine whether it is busy or idle before they attempt to
transmit a frame. However, if several stations are waiting for the
medium to be idle and they transmit simultaneously, a collision will
occur unless there is some mechanism to deal with the contention.
In order to resolve contention between several stations, CSMA/CA defines
an exponential backoff algorithm. The backoff mechanism works as
follows:
-
Every time a station attempts to transmit, it waits for the medium
to be idle during a DIFS (or a EIFS, if the transmission was not
successful) period. Then, it follows an interval called the
contention window which is divided into slots of 20 $mu
s$. The stations choose a random number of slots and then they
wait for the slots to elapse, thus accessing the medium the station
that selected the lowest number of slots. The timing of the backoff
mechanism is shown in the figure
3{reference-type="ref"
reference="fig:Interframe-spacing"}. -
During the occupation of the channel by the winning station the
other stations suspend the backoff procedure until the medium is
idle again. However, the stations resuming the contention do not
choose a random number of slots again. Instead, they wait for the
slots that remained in the previous contention. In this way, the
stations that lost the contention have a higher probability of
gaining access to the medium than the one that just transmitted. -
The random number of slots is chosen from the interval [0, CW], CW
being the size of the contention window. This mechanism is called
exponential because after a failed transmission a station must
double the size of the contention window. This is designed to reduce
the collision probability in a loaded network. - Collisions occur when two stations select the same number of slots.
A collision is only detected by the lack of the corresponding ACK,
since the wireless stations cannot listen to the medium whilst they
are transmitting.
The contention window size is a power of 2 minus 1, starting at 31. It
doubles after each failed transmission until the 5th retransmission,
where it is limited to 1023 slots for further retransmission attempts.
The contention window reverts to the minimum size after a successful
transmission of a frame.
Retry counters
The retry counters set a limit to the maximum number of retransmissions
allowed per frame before it is discarded by the MAC layer. Some cards
define different counters depending on the packet size.
Error detection and recovery
When a station does not receive the corresponding ACK it assumes that
the frame was lost and it tries to retransmit it again. Thus detection
and recovery at the MAC layer occur at the source.
However, after reaching the maximum number of retransmissions the MAC
layer discards the frame. It is therefore responsibility of higher
layers to perform a detection or recovery mechanism in the case where
reliability (TCP, for instance) is required.
Carrier sensing functions
Carrier sensing is a method invoked by the MAC layer to ascertain
whether the medium is busy. It can be either physical or virtual.
Physical carrier sensing is performed by the physical layer and it
reports the state of the medium to the MAC layer. Virtual carrier
sensing is made through the Network Allocation Vector (NAV), which is
the expected time that the atomic transmission of a frame will maintain
the channel busy. The NAV is calculated from the duration field of the
802.11 header of the existing frame in the air.
Bitrate selection
The bitrate selection has a critical role in wireless systems as it
directly affects the frame error rate for a given signal-to-noise ratio.
The 802.11 standard intentionally left the bitrate selection mechanism
unspecified. Thus, vendors have freedom to implement their own
mechanisms. However, some drivers permit the user to select specific
bitrates.
Hidden node problem. RTS/CTS mechanism {#sub:Hidden-node-problem}
In a wireless network transmission ranges have fuzzy boundaries, unlike
in a wired network where all the stations can reach all others. It may
happen that two stations are in range with a third but not in range with
each other, because of some obstacle or simply because they are too far
apart.
Figure 4{reference-type="ref"
reference="fig:Hidden-node-problem"} reflects this situation: The pairs
of nodes A – B and B – C are in range. However, A cannot hear what C
transmits, nor C can hear what A transmits. Thus, A and C may
simultaneously start a transmission causing a collision at B because
they cannot sense the medium as busy.
{#fig:Hidden-node-problem
width="45%"} [[fig:Hidden-node-problem]]{#fig:Hidden-node-problem
label="fig:Hidden-node-problem"}
In order to alleviate this problem the IEEE standard provides a
mechanism to reserve the medium, the Request To Send (RTS) and Clear To
Send (CTS) messages. The procedure works as follows: A sends a RTS frame
to B, which responds with a CTS. Although C cannot hear the RTS it will
receive the CTS, thus knowing that a transmission is about to be in
progress. The CTS frame has a duration field that informs all the
stations in range about the time that the medium will be busy. Then,
after receiving the CTS A transmits the data frame while C waits for the
transmission to finish.
RTS frames are very small (20 bytes) and are less likely to be
interrupted than a large data frame (which can be around 1500 bytes).
However, when a small data frame is to be transmitted this mechanism
does not offer any significant advantage. Additionally, the RTS/CTS
reservation adds some overhead to the communication thus reducing the
overall throughput of the network.
Here we concentrate on the IEEE 802.11 physical and MAC layer access for
VoIP. Our focus is on access using wireless technologies rather than
relaying voice.
It mainly deals with the IEEE 802.11 suite of protocols, primarily as it
is those we have access to.
Anastasi et al. measured the performance of IEEE 802.11b ad hoc networks
[@anastasi2004:IEEE], specifically the range of the end-terminals, the
impact of different data rates and their variability. They observed that
the transmission range was highly dependent on the data rate up to 100m,
whilst the physical carrier sensing range was independent of rate up to
200m. Unlike their results in ad hoc mode, we didn’t observe different
ranges for different rates at up to 320 meters. Even at 400 meters there
was no conclusive data rate dependency on range.
Hertrich looked at mixed traffic (including real-time voice) in IEEE
802.11 networks [@hertrich03:_exper_perfor_evaluat]. He used a MAC
booster and by tailoring it could alter the number of retransmissions
for different positions to achieve the required throughput. We did not
try to change the number of transmissions. This work is similar to ours
in that he considered the environment as important, however he used VoIP
and MPEG4, while we used VoIP and TCP. Additionally, Hertrich focused on
the home, whereas we focused on an office environment. Nevertheless we
found that certain positions also did not permit any communication to
take place at all.
Dimitrou et al. address issues that can make the deployment of
multimedia communications difficult in 802.11 networks.
[@dimitriou03:inter_telep_wlans]. They cite interference and users
moving out of range as limiting factors for good VoIP quality in WLANs.
They suggest the use of smart speech coding (including an enhanced
version of the G.711 coding developed by their company) to make the
speech more resilient to loss.
Hoene et al. examined the effect of motion on the performance of
wireless links through a series of experiments with moving nodes
[@Hoene2003:measuring]. They conclude that other factors such as
modulation type, quality of power supply, environmental setup, and
number of retransmissions may have greater impact on 802.11b performance
than the motion itself. In general the greater the speed of the terminal
the lower the correlation of loss events. In our experiments the nodes
were not moving, i.e., movement only occurred between measurements; thus
movement should only decrease the observed losses.
QoS.
Radio aspects
One of the more active research areas has been in wireless voice
services. Focus has mainly been in the areas of throughput and capacity
issues of IEEE 802.11 networks. Casetti et al. present a framework that
assumes variable rate speech coders at rates of 64 kb/s, 13 kb/s, and 8
kb/s [@casetti04:_improv]. Their rates are determined by an end to end
control mechanism, based on measurements of packet delay and loss rates.
Another approach is to look at the MAC protocol directly. Dong et al.
propose and examine selective error checking (SEC) at the MAC layer of
802.11 [@dong04:_selec_mac_ieee]. They make use of the fact that speech
bits can tolerate errors, but should be protected for optimal quality
reproduction. Simulation results showed that the speech quality can be
substantially improved by modifying the MAC layer with SEC to suit the
Narrow-Band Adaptive MultiRate (NB-AMR).
Filali looks at a MAC tuning approach [@filali04:_dynam]. He exploits
the properties of multimedia applications in IEEE 802.11-based wireless
networks by limiting the number of retransmissions of a data frame by a
source until the reception of a link-level acknowledgment from the
destination.
In 2005, the IEEE approved QoS service enhancements for local area
network applications called IEEE 802.11e. Garg et al. examines using the
IEEE 802.11e protocol for voice applications [@garg03:_using_ieee]. The
Enhanced Distributed Coordination Function (EDCF) has been proposed as a
MAC protocol. EDCF assigns four different priority classes for incoming
packets at each node which are called access categories (AC). Each AC
has its own channel access function. This is in contrast to the standard
Distributed Coordination Function (DCF) where packets all use the same
access function to the channel. Access functions for different
categories means assigning delay times, minimum contention windows, and
the number of back-off stages for each type of service.
Garg et al. looked at 802.11e’s ability to fulfill the goals of improved
QoS and higher channel efficiency. They investigated the response of the
protocol to options in the protocol parameters and showed that the
Hybrid Coordination Function (HCF) reduces channel contention and
provides improved channel utilization. Both MAC coordination functions,
EDCF and HCF, are sensitive to protocol parameters which are dependent
on the scheduling algorithms. They conclude that further investigations
need to be conducted.
Kawata et al. propose a dynamic Point Coordinator Function (PCF) for
improved capacity [@kawata05:_using]. They suggest two new media access
schemes, dynamic point coordination function (DPCF) and modified DPCF
(DPCF2). The claim is that the capacity of VoIP traffic can be increased
by up to 20% in 802.11b networks. They show how a significant
improvement in the end-to-end delay with mixed VoIP and data traffic can
be achieved. Delay is maintained at approximately 100 ms in heavily
loaded traffic conditions, whilst at 60 ms in normal traffic conditions.
Lindgren et al. [@lindgren03:_qualit_servic_schem_ieee] evaluate four
mechanisms for providing service differentiation in IEEE 802.11
networks. The evaluated schemes are the PCF of IEEE 802.11, EDCF of IEEE
802.11e extension, Distributed Fair Scheduling (DFS), and Blackburst.
Using simulation they looked at throughput, medium utilization,
collision rate, average access delay, and delay distribution for a
variable load of real time and background traffic. The simulations
showed that the best performance is achieved by Blackburst. PCF and EDCF
are also able to provide good service differentiation. DFS can provide
relative differentiation and consequently avoids starvation of low
priority traffic.
Currently voice occupies relatively little of the IP wireless access
capacity and the majority of voice traffic is carried by the cellular
networks. Research in combining these two has been published within the
context of voice roaming [@calvagna03:_wifi_gprs; @marsh506:Design].
Exploring voice quality in IP networks continues to be an active
research area
[@TRITA-EE_2006:016; @varela05:_study_effec_fec_voice_traff].
WiMAX
Cross-layer methods {#sect:cross_layer}
A new design principle is being applied to wireless VoIP systems. ALF
Cross-layer [@Matt0306:Source] [@Poppe0401:Choosing],
[@aguiar03:_chann_sched_voip_mpeg4_chann_predic] Was ALF an early
example of cross-layer? Wireless networks have rejuvenated the interest
in cross-layer design. Cross-layer refers to a term where the sharing of
information across different network layers increases the efficiency of
the whole system. By not observing the strict layering approach of
classic network design it may be possible to couple elements that are
closer than the network boundary design suggests.
Cross layer and joint channel/source coding
The handset of mobile handset is a good example of successful
cross-layer design. One concrete example is joint channel and source
coding.
802.11 and quality mechanisms
Hoene et. al. Voice Over IP: Improving The Quality Over Wireless LAN By
Adopting A Booster Mechanism – An Experimental Approach
[@Hoene0108:Voice]
VoIP and handovers
[[subsect:voip_handovers]]{#subsect:voip_handovers
label="subsect:voip_handovers"}
Voice quality can suffer if there are radio coverage problems,
interference from external sources, and excessive network load. The
range for good quality varies from a few metres to a hundred meters
depending on the equipment in use, obstacles, interference sources, and
so on. Therefore the second scenario is to switch calls between the
local wireless and cellular infrastructures in order to provide call
continuity outside the coverage area of the wireless LAN. As mentioned,
mobile phones and PDA’s are now available with both cellular and 802.11
interfaces. This provides an option for switching to the cellular
network when needed. Alternatively, if local wireless coverage is
detected during a cellular call, a switch to the local network is
possible, thus freeing cellular resources and potentially avoiding the
cellular operator’s tariffs. Entering a home or office area are typical
scenarios in which a cellular call could be transferred to the local
802.11 network. The procedure of switching an ongoing call from one
technology to another is known as a handover or handoff. Ideally the
user should be unaware of the change, if this is the case it is known as
a seamless handover. The current technological barriers for seamless
handovers are the configuration and connection establishment mechanisms
rather than the switching of the voice stream. Switching a voice stream
means receiving two parallel streams to the same terminal over different
networks. Once running in parallel to the terminal, the initial stream
can be stopped and the new voice stream played to the caller instead.
Voice Call Continuity (VCC) has been so far standardized way to handle
these kinds of handovers. VCC prvides the infrastructure for performing
vertical handovers, but suitable triggers are still needed in order to
perform them in a timely fashion. As of Release 8, the 3GPP has proposed
a genericframework for media serice handovers called IMS Service
Continuity, which applies the same principles to other media services.
As call quality is paramount, the timing of handovers from the WLAN to
the cellular network is important. In the case of radio problems there
might be insufficient time to initiate and start a call to the cellular
network. In the case of handover due to the onset of congestion, the
handover success depends on the rates of the other flows. This is due to
the time needed to estimate the call quality and if need be, to initiate
a cellular-based call. In the other case where a user would move out of
the coverage area, there should be time to schedule the handover. The
speed and path of the user movement can be tracked to estimate whether
the user is moving out of coverage. In this case there is a design
tradeoff: To maintain connectivity in the coverage area as long as
possible to minimise the frequency of handovers on the one hand, or to
reduce the probability of poor quality and switch early on the other.
Therefore more conservative or aggressive switching algorithms can be
envisaged.
One solution is to use the 802.11 network where possible, but to
handover a call to the cellular network when the link conditions are
insufficient to support good quality as stipulated in the problem
statement. How to schedule this handover has been addressed in paper
H. Real-world voice handovers typically need time to initialise a
parallel technology to switch to. As calls to the public phone network
take in the order of five seconds to setup, estimation of deteriorating
quality conditions in the 802.11 network must anticipate (at least) this
interval ahead of the handover. The relation of this work to the problem
statement is in the heterogeneity of the systems and providing good
speech quality to the users.
In paper cite{}{=latex} they implemented an automated handover
mechanism on a PDA running Windows CE. The call quality is estimated in
the terminal based on network measurements and signal a third party
application that the current call should be transferred from the 802.11
network to the cellular network. The handover was triggered when the
quality fell below a quality threshold. The implementation allowed
automatic roaming from 802.11 to GSM networks. The goal of the
implementation was to show proof of concept, as well as to judge
differences in the speech quality at the time of handover.
There is still much research to be done in the voice handover area,
including monitoring the network conditions at the handset. Eluded to
earlier, tight integration achieves the best results and in the case of
dual-radio phones, prediction of impending problems is the key
criterion. Not included in this research is the possibility to make use
of tracking i.e. estimating the position or path of the user. This would
greatly influence the decision of whether to switch a call to an
alternate technology.
Calvagna et al. present an overview of handover issues with a focus on
hybrid mobile data networks [@pahlavan00:_handof]. They propose a neural
network solution for handovers to/from 802.11 networks to GPRS networks
and show its performance to be good. The E-Model as standardized by the
ITU-T allows for the prediction of voice quality based on network QoS
parameters [@G.107]. However, it is not useful for our purposes because
it does not take the signal strength and delay jitter into account. Very
recent work by Hoene et al. propose a real-time implementation of PESQ
called PESQlite [@Hoene2005:Thesis]. It reduces the complexity by making
simplifications to the PESQ algorithm e.g. using constant length test
samples and non time alignment of the degraded samples. Our off-line
method has a slightly different purpose, it is to obtain a mapping
between consecutive packet loss and the PESQ MOS score. Dimitriou et al.
state that interference and users moving out of range as limiting
factors for good VoIP quality in WLANs [@dimitriou03:inter_telep_wlans].
Their solution is to use better speech coding and suggest an enhanced
version of G.711 to make the speech more resilient to loss. Kashihara
and Oie developed a WLAN handover scheme for VoIP that makes use of
MAC-layer information on the number of retransmissions of the voice
packets [@kashihara05:_handov_manag_number_retries_voip_wlans]. If this
number exceeds a certain threshold, the system switches to multi-path
transmission of the packets. As soon as one of the WLAN interfaces
reaches a stable condition, it can be used for single-path transmission.
In Fitzpatrick et al. propose a transport layer handover mechanism using
the stream control transmission protocol (SCTP)
[@fitzpatrick06:_approac_trans_layer_handov_voip_wlan]. The mechanism
uses the multi-homing feature of SCTP and measures the network
performance metrics by sending probes. Handover decisions are based on
speech quality estimations utilizing the ITU-T’s E-Model.
Handovers are about L2 management.
Mobility support
Operator Issues {#sect:operator_issues}
A historical view of VoIP from the operators
In short it is not hard to see that a battle between, or merger of,
these two communication paradigms will occur; the Internet ‘drivers’ are
keen to capture a percentage of the worlds voice market and use the
Internet as the carrier for voice, whilst the telecom manufacturers are
attempting to put Web and data services onto their handheld devices,
currently under the guise of 3G. Very important from operator’s point of
view, to save money by using only IP networks.
The telecom industry has also advanced whilst the Internet was becoming
part of our lifestyles, namely in the widespread arrival of mobile
telephony. Clunky, old analog phones are being replaced at a tremendous
rate by small, sleek, fashionable digital phones, transforming the way
we communicate allowing us to be finally free of the twisted copper
pair.
The outcome is likely to be, of course, some sharing of the market, and
the stark differences between data and telecom are, in fact, more
blurred. The traditional telecom operators own, operate or buy capacity
on data networks allowing them an opportunity to use cheaper network
infra-structures, saving them financially by merging two or three
different infra-structures (voice, data, cable TV) onto one, cheaper
alternative, namely the IP infrastructure. This means that voice is
carried on a data network.
3GPP
3GPP + 4G The Internet revolution initially bypassed the traditional
telecommunications equipment manufacturers and operators. However, the
3rd Generation Partnership Project (3GPP), established in 1998, brought
together a number of commercial, organizational and standardization
bodies to work on integrating IP into their solutions for mobile
communication. 3GPP has already standardized the use of an IP based core
network. Today telecommunication companies are deploying the 3GPP IP
Multimedia Subsystem (IMS) to merge Internet technologies with mobile
networks. So called ‘Release 5’ enables operators to upgrade their
existing telecommunication equipment and allows a smooth transition to
IP technology. IMS is based upon the Session Initialization Protocol
(SIP). The upcoming 3GPP Long Term Evolution (LTE) standard will use IP
in both the access and core networks to carry data and voice traffic.
Currently local wireless IP voice services have not reached significant
market penetration, as current handsets and infrastructure are dominated
by the telecommunication industry’s 2nd and 3rd generation standard
solutions. There can be voice quality issues with the current
data-centric LAN technologies we have today. The problems are mainly due
to coverage and heavy load situations. These are discussed in the next
section.
Local wireless and operators
Generic Access Network (GAN), formerly known as UMA (Unlicensed Mobile
Access), is one possibility to provide seamless roaming between local
and wide area networks [@ETSI:GAN]. GAN allows voice, data, and IMS/SIP
applications to be accessed from a mobile phone. The operation of GAN is
as follows: Once a local wireless network is detected (e.g. Bluetooth or
802.11) the handset initiates a secure IP connection through the local
network to a gateway in the operator’s network. A GAN server makes the
handset appear as if it were connected to a new base station. Thus, when
the handset moves from a cellular to a 802.11 network, it appears to the
core network as if the handset is simply associated with a different
base station. There is GAN support for 2nd and 3rd generation cellular
technologies.
The Internet revolution initially bypassed the traditional
telecommunications equipment manufacturers and operators. However, the
3rd Generation Partnership Project (3GPP), established in 1998, brought
together a number of commercial, organizational and standardization
bodies to work on integrating IP into their solutions for mobile
communication. 3GPP has already standardized the use of an IP based core
network. Today telecommunication companies are deploying the 3GPP IP
Multimedia Subsystem (IMS) to merge Internet technologies with mobile
networks. So called ‘Release 5’ enables operators to upgrade their
existing telecommunication equipment and allows a smooth transition to
IP technology. IMS is based upon the Session Initialization Protocol
(SIP). The upcoming 3GPP Long Term Evolution (LTE) standard will use IP
in both the access and core networks to carry data and voice traffic.
Gateways
User aspects {#Quality_aspects_of_Internet_telephony}
This section is divided into two parts, quantifying quality and some
standardized approaches for calculating it. As tools and methods have
been developed for the telephony industry, it seems natural to re-use
them for Internet telephony where appropriate. We will introduce two
standardized methods for estimating VoIP quality as they are used within
this dissertation. For a more in-depth treatment of objective and
subjective methods consult [@raake06:speech]. From the user’s
perspective, there should be no major quality difference between
telephony being carried by the Internet and a regular telephony network.
Quantifying quality
Although most people have a good feeling of what good quality (or more
accurately fidelity) means during electronic communication, it is not
straightforward to translate this into measurable parameters of a
system. First the system we are dealing with is a distributed system and
each component has its own individual attributes. Second people are
involved in the assessments, and add inevitable human variations. Third,
people are adaptable, therefore ratings tend to change over time and
finally the situations differ from environment to environment.
The simplest form of quality rating for speech would be something
descriptive, for example ‘EXCELLENT’ for a speech sequence that was
almost glitch-free down to ‘POOR which was barely understandable.
Different words could be used, or any number of intervals between the
extremes choices, however studies have shown, in a descriptive setting,
three intermediary steps are reasonable. Numerically, it is somewhat
easier to get a finer scale, however more than ten intervals often leads
to fuzziness between the intervals.
Humans and subjective topics
Subjective scores, in this section we consider quality, where some form
human tests have been conducted. If humans are not included in the
tests, results may be found in other sections.
[@Su9906:Factors] Uu Srivastava and Yao Investigating factors
influencing QoS of Internet phone. [@Hamdi1297:Fresh] Hamdi et. al Fresh
Packet First Scheduling for Voice Traffic In Congested Networks. Wilson
and Sasse [@watson0010:good] The good, the bad, and the muffled the
impact of different degradations on internet speech. Wilson and Sasse
[@wilson0012:Investigating] Investigating the Impact of Audio
Degradations on Users: Subjective vs. Objective Assessment Methods.
[@wilson0009:Do] Do Users Always Know What’s Good For Them? Utilizing
Physiological Responses to Assess Media Quality. De Vleeschauwer et. al.
[@DeVl00:Quality] Quality Bounds for Packetized Voice Transport. Sun et.
al. [@sun00:end-to-speech] End-to-end Speech Quality Analysis for VoIP.
Sun et. al. [@sun00:voip] VoIP Speech Quality Simulation and Evaluation.
Jansson et. al. [@Jans0004:Delay] Delay and Distortion Bounds for
Packetized Voice Calls of Traditional PSTN Quality.
[@aldini1099:comparing] Aldini et. al. Comparing the QoS of Internet
Audio Mechanisms via Formal Methods. [@Jian0205:Speech] Jiang and
Schulzrinne Speech Recognition Performance as an Effective Perceived
Quality Predictor. [@cole01:voip] Cole and Rosenbluth "Voice Over IP
Performance Monitoring". [@Moha0104:Integrating] Mohamed et. al.
"Integrating Networks Measurements and Speech Quality Subjective Scores
for Control Purposes". [@Hoene0304:Importance] Hoene et. al "On the
Importance of a VoIP Packet". [@Hoene1003:Impact] Hoene "Impact Of
Single Frame Loss Events". [@sun02:perceived] Sun and Ifeachor
"Perceived Speech Quality Prediction for Voice over IP-based Networks".
Measuring quality
Determining an accurate quantitative measure for human speech fidelity
is desirable, but impossible. The best one can achieve is a qualitative
rating that has been established in a rigorous and controlled manner.
Typically test listeners and controlled auditory conditions are used for
people to rate speech coder performance for example. It can be expensive
and time-consuming. There are tools and methods that map qualitative
assessments to quantitative values, however they will always be, to some
degree, approximate. If one can show however, that there is reasonable
correlation between the qualitative and quantitative results, and under
what conditions the correlation holds, then this solution may be
acceptable to some users. Some objective tools, such as those which use
signal processing techniques, have shown this correlation and hence have
found acceptance within the community. Therefore with some degree of
confidence, the software developers can justify their techniques have
proven success and give results as real people would.
Quality tolerances
When human speech is uttered, the time taken from when the pressure
waves leave the mouth to the sensation of hearing is a fraction of a
second for a nearby speaker. We have evolved to expect, and actually
need, to hear our own voice. This is in order to be sure that we are
saying what we really want to. The development of the human speech and
hearing recognition has however taken place via face to face meetings.
Thus, extra visual or body cues are available when uncertainty is
present. An example of such ‘understanding’ is when a language is being
spoken that we do not understand. We can sometimes guess the meaning
from gestures, facial expressions and intonation.
On the other hand, impaired speech requires extra concentration from the
listener, that is we are not used to processing distorted or missing
segments, visual and auditory clues are more difficult to interpret.
Somewhat similarly is communicating with people from afar, we don’t
receive the original speech samples and visual cues are harder to see.
In IP voice communication systems the visual cues are not existent, thus
making intelligibility more important. In order to hear one’s own voice
a very short delay is introduced between capturing the recorded voice
and replying it for the speaker. This is particularly applicable when
using headsets. The introduced delay is in the order of 5 ms.
As far as the delay in the system is concerned, it is obviously
desirable to keep it below some maximum. This is in the order of half a
second. Delay is discussed from a networking perspective below. Recent
results have shown that delay is not as significant as once postulated,
at least in VoIP systems. Traditional telephony standards have been much
stricter with respect to delay budgets [@G.114]. If one is not in a
highly interactive conversation, then higher delays can be tolerated
than those suggested by telecommunication standards. This is
particularly true in situations where people use computers, delays are
expected by users (operating system hiccups) and therefore their delay
expectations also become relaxed from the communication system.
If users are engaged in quick voice exchanges, delays will frustrate
their conversational style. Therefore, introducing the factor of
interactivity into an objective quality measure is still under research.
The following studies have looked at conversational interactivity
[@varela2005:varela; @05:_study_of_relat_between_subjec; @Froehlich1004:Elements; @Reichl:C04hot; @Hammer:C05the].
The last reference in this list proposes the potential impact of
interactivity on the perceived quality for Internet telephony services.
Where delays and losses are experienced at the same time, it has been
shown that the influence of losses is much more significant with respect
to the perception of quality degradation than the influence of delay.
This implies that people are able to make a transition from highly
interactive scenarios to a more measured communication style. In fact
this transition appears to be somewhat bilinear, that is, the quality
degradation from an interactive mode to a simplex conversation mode
occurs in two linear steps, with the break at about 400 ms. Varying
delays can be disturbing, due to the listener not being allowed to
settle into a single mode of operation. For more information on the
influence of delay on Internet telephony see [@Boutremans0312:Delay].
Quality and noise
The quality of voice communication actually depends on many
(independent) factors. The effect of noise, be it in the electrical
circuitry, or in the surrounding environment can be a determining factor
in the perceived quality.
The quality of the components is a key issue in voice systems. Lower
quality components can leave voice sounding thin i.e. a lack of bass in
the speech. Background noise, caused by poor grounding or shielding of
the analogue components is frequently experienced as low frequency
humming in the system. Internet telephony systems that use on-board
sound cards can introduce noise of this nature into the signal. USB
headsets are helpful, and they also alleviate the need for echo
suppression.
The environment is another factor, whether a noise source is remote
(distant from) or local (close by) to the speaker. In the remote case,
the non-speech parts of the voice should be suppressed so as not to
interfere with the spectral analysis of the voice processing.
Undesirable noises from similar frequencies and volumes will be encoded
into the signal, sent, and reproduced for the listener. Often listening
to a remote speaker in the presence of background noise is more
difficult than when background noise is present locally.
Research in the signal processing field has studied the issue of noise
in systems [@sallberg:analog_circuit_implem]. Important speech
parameters such as the intelligibility, clearness, or naturalness of
speech can be improved by signal processing using digital, analog, or
hybrid solutions. A robust, low complexity, speech enhancement algorithm
has been proposed to show the advantages of a purely digital, purely
analog, and a hybrid digital-analog implementation in
[@sallberg:speec_enhan_implem].
In terms of testing systems with controlled noise, the ITU conducts
tests with standardized background noises. These are known as mean noise
reference units (MNRU) [@ITU:P810]. Typically well defined noise
patterns of fixed modulated noise are presented at the beginning of each
test. Each sample represents an example distortion corresponding to a
five grade impairment scale (excellent to poor). The MNRU has been used
extensively in subjective performance evaluations of conventional
telephone and wide-band voice systems.
The ITU-T E-model
The E-model is intended as an off-line planning tool. Due to its simple
form it has found applications into on-line assessments as well. Network
planners can input parameters from a system and obtain a numerical value
(between 1 and 100) representing an estimate of the perceived quality.
One important point of the E-model is that loss, delay, jitter, speech
coding and echo parameters are combined linearly to calculate the so
called impairments that result in the score. The E-model assumes the
parameters are independent. Another important (selling) point of the
E-model is that the numerical scores correlate well with subjective
tests, indicating that this estimation is indeed possible. Since the
linear combination is simple, and most of the parameters are easily
measurable, the E-model has been popular for a number of years.
The E-model also indicates how network impairments and speech coding can
be combined to give an approximate estimate of voice quality. It is
important to state that there are many tunable parameters included in
the model, 19 in fact, not including the different speech encodings and
loss concealment methods. Interestingly, jitter is not explicitly
included as an input parameter. As jitter can affect whether packets
arrive in time for playout or not, late packets for a real-time audio
application are akin to network loss or delay, which are included in the
model.
The ITU-T E-model was first proposed in 1998 [@G.107]. Some important
IETF standards were also first published in the same year namely RTSP
[@IETF:RFC2326] and SDP [@IETF:RFC2327].
::: {#table:Rmodel_user}
User satisfaction R-value MOS score
Very satisfied 90 4.3
Satisfied 80 4.0
Some users dissatisfied 70 3.6
Many users dissatisfied 60 3.1
Nearly all users dissatisfied 50 2.6
: The ITU’s E-model and MOS scores
:::
Table 3{reference-type="ref"
reference="table:Rmodel_user"} shows scalar values known as the R-value
derived from the computational model. They are relatively consistent
with subjective scores, i.e. real user estimations of the speech
quality, shown by their respective mean opinion scores (MOS). Mean
opinion scores are derived by replaying samples to a naïve set of
listeners who rank the quality on a scale from 5 (best) to 1 (worst).
The R-value is defined as shown in equation
[eqn:rvalue]{reference-type="ref"
reference="eqn:rvalue"}.
$$begin{aligned}
R = R_o – I_s – Id – I{e-eff} + A
label{eqn:rvalue}end{aligned}$$ $R$ = rating value
$R_o$ = signal to noise ratio (noise sources)
$I_s$ = voice impairments to the signal (side-tones and quantization
distortion)
$Id$ = delay and equipment impairments
$I{e-eff}$ = packet loss impairment (including random packet losses)
$A$ = advantage factor (compensation of ‘other’ factors)
Each of the factors is calculated and subtracted from the maximum of 100
to obtain the R-value. The impairment due to the delay is denoted by
$I_d$. Two different values are defined, $I_d = 0$ if the absolute delay
($T_a$) is less than 100 ms, i.e. no impairment or an increasing $Id$
if the delay is over 100 ms. A number of amendments have been to
incorporate non-random losses into the model
[@ITU01054:prediction; @ITU0304:E-model]. The effect of packet loss on
the R-value is given by the $I{e-eff}$ term. The $I_{e-eff}$ is defined
in the E-model as:
$$begin{aligned}
I_{e-eff} = I_e + (95 – Ie) cdot frac{P{pl}}{P{pl} + B{pl}}end{aligned}$$
$P{pl}$ = packet loss probability
$B{pl}$ = packet loss robustness
For G.711, $I_e = 0$. This means for situations without loss, G.711
provides the best speech quality. The advantage factor $A$, is a value
that indicates how tolerant users can be when using telecommunication
equipment. It can be seen as a willingness to trade quality for
operational convenience. One example is with mobile telephony, where
users accept lower quality since they have the luxury of being mobile.
One other example could be an advantage factor, as mentioned, where
higher delays are tolerated when using a computer as a communicating
device rather than a telephone.
Perceptual Evaluation of Speech Quality (PESQ) {#subsect:perceptual_evaluation_of_speech_quality}
{#fig:pesq_structure width="30%"}
PESQ Linguistic Quality
MOS equivalent degradation
4.5 Excellent None
4 Good
3.5 Good/Fair Moderate
3 Fair
2.5 Fair/Poor Severe
2 Poor
1 Bad
Although the E-model is popular for estimating quality using network
parameters, it has shortcomings. As we have seen, the bursty effects of
packet loss on speech quality are not well addressed in the E-model. A
later development by the ITU was to develop a scheme that could improve
on the E-model by estimating the impact of speech coding and losses on
the original speech signal itself. The solution, the "Perceptual
Evaluation of Speech Quality" or more commonly known PESQ, addresses
these issues [@ITU:p862].
The idea is to estimate the degradation of the coding and loss on a
speech sample using a model of the human auditory system. Figure
5{reference-type="ref"
reference="fig:pesq_structure"} shows the functional units of PESQ. A
reference speech signal is transmitted through a network that results in
a quality degradation corresponding to the coding used and the network
losses. PESQ analysis both the reference and degraded signal and
calculates their representation in the perceptual domain based on a
psychoacoustic model. The disturbance between the original and the
degraded speech signals is calculated by a quality estimation algorithm
and a corresponding subjective mean opinion score (MOS) is derived. The
evaluation of speech quality using PESQ is performed off-line due to its
computational complexity. If one assumes a 20 ms packetization and an
eight second sample, the sequence would then be 400 packets. As an
indication of the time needed to compute a PESQ score, a sequence with
ten losses requires approximately two seconds of processing time for
G.711 coded speech on a Pentium III computer. G.711 yields the maximum
PESQ score (4.5) in the absence of loss, however it is particularly
sensitive to packet loss even when concealment is used.
PESQ’s validity has been shown by its ratings being sufficiently
correlated to subjective ratings as we discussed in the introduction of
this section. More recent research that correlates PESQ with subjective
scores, shows that some small transformations are needed to better align
PESQ to MOS [@Rix:C03comparison].
Other measures
Recent work by Hoene et al. proposes a real-time implementation of PESQ
called PESQlite [@Hoene2005:Thesis]. The idea is that using PESQ in
real-time is too slow for real-time use. Hence PESQlite reduces the
complexity by making simplifications to the PESQ algorithm, e.g. by
using constant length test samples and non-time alignment of the
degraded samples. PESQlite is currently only available for G.711 coding.
One other alternative for an objective measure is to use machine speech
recognition as a MOS predictor [@Jian0205:Speech]. The technique uses a
word recognition ratio metric to reliably predict perceived quality.
This ratio is speaker-independent, whereas the absolute word recognition
ratio of a speech recognizer is speaker dependent. The relative word
recognition ratio is obtained by dividing the absolute word recognition
ratio with the value at 0% loss. The results show that human and machine
based recognition techniques are correlated, although not linearly. It
is also been found that human-based word recognition ratio does not
degrade linearly once packet loss exceeds 10%, due to performance limits
of the codec.
P.OLQA that has been selected to form the new ITU-T voice quality
testing standard, P.863, in September 2010. P.OLQA is the
next-generation voice quality testing algorithm for fixed, mobile and IP
based networks.
Objective methods for speech quality {#sect:objective}
Essentially, PESQ methods estimate the speech quality based on a
psycho-acoustic model of the human perception by comparing the degraded
speech sample with its clean version in the perceptual domain. As far as
the related work is concerned Lakaniemi et. al. looked at combining
small scale measurements with subjective MOS scores
[@Laka0106:Subjective] and [@conway0204:ip_itu_t] looked at passive
measurements using PESQ evaluation techniques. Hoene et. al. do actually
suggest a new perceptual model for adaptive VoIP applications that takes
into account real-time factors like delay spikes or changes in the
coding mode [@hoene2004:spects]. Wilson and Sasse look at the larger
picture of using objective against subjective techniques in speech
degradation assessment [@wilson0012:Investigating].
Speech encoding {#sec:speech_coding}
Human speech occupies a fundamental frequency in the range of 85-155 Hz
for men and 165-255 Hz for women. Higher tones or harmonics can be heard
up to 10 KHz. Encoding this full frequency range would require a
sampling rate of at least twice this frequency to faithfully reproduce
the speech. In a voice transmission system, the speech is sampled and
then digitized according to the quality required (or restrictions) of
the transmission system. In a system such as the traditional telephony
system, this capacity is not sufficient to faithfully accommodate human
speech’s full frequency range.
The vocoder was invented in the late 1930’s and is an implementation of
the model of the human sound production system. Vocoders are often known
as analysis-synthesis systems, where the input speech is passed through
a multiband filter and each filter is passed through an envelope
follower. The signals from the envelope followers are transmitted, and
the decoder applies the amplitude controlled signals to corresponding
filters in the synthesizer. The main motivation for this type of system
was to cryptographically encode the signals during transmission. Delta
modulation appeared in 1952, it is the simplest form of differential
pulse-code modulation (DPCM) where the difference between successive
samples is encoded into a one bit stream. Also in the 1950s the Lincoln
Laboratory at MIT conducted a study of pitch in speech detection, which
led to vocoders designed to reduce the speech bandwidth. The first LPC
ideas came about in 1966 from work done at NTT in Japan. In the late
1960’s early real-time versions of LPC coders were implemented. The
first workable LPC encoder was the US government’s LPC-10 coder
developed in the early 1980’s [@Trem8204:Government]. The ten in LPC-10
signifies the number of coefficients it used. 1964 saw the
standardization of PCM waveform coding for fixed telecommunication
networks. The implications of this choice is still with us today.
Well before the modern Internet was devised, people where investigating
alternatives to the traditional telephony system for carrying voice. The
earliest accounts of packet switched networks can be found in the signal
processing community. Researchers and engineers were looking for
computationally efficient methods of compressing voice for transmission
over low bandwidth links. In fact, advances in low data rate coders and
the deployment of a distributed packet switched network led to some of
the earliest findings [@Magi73:Adaptive]. The details of the networking
are often omitted, but the idea was to block-code voice for
transmission. Much of the focus was on LPC and entropy methods.
Blankenship et. al described the Lincoln Laboratory digital voice
terminal system in a technical note published in 1975. Accounts of the
early days of vocoder work can be found in [@gray.05:_voip] and the
small amount of networking in [@cohen99:_realt_networ_packet_voice].
Moving forward a number of years, warped LPC was first proposed in 1980
which is a variant of LPC where the spectral representation of the
system is modified. This reduces the bitrate required for a given level
of perceived audio quality/intelligibility. In 1985 the Code-Excited
Linear Predictive (CELP) codec was introduced
[@schroeder85:_code_excit_linear_predic_celp]. The ITU’s G.729 was
standardized in 1996 [@ITU:g729]. In 1997 the Enhanced Full Rate (EFR)
codec was standardized. More recently intelligent multimode terminals
have appeared that can adapt their configuration to different rates,
quality and robustness. These are known as adaptive multirate AMR codecs
which was standardized is 1998. For an account of the early vocoder
history research consult [@Gold7712:Digital].
Pulse Code Modulation (PCM)
In narrowband telephony, the frequency bandwidth is restricted to 3100
Hz, ranging from 300 to 3400 Hz. Voice in the fixed telephony system has
therefore to be reduced from its original range to this 3100 Hz range (a
reduction of about one third). The lower frequency of the human range is
lower than that of the telephony system. This is not as problematic as
it may seen, due to the perceptual system’s ability to reconstruct the
lower tones from the overtones. Traditional telephony does not use the
low frequencies as they are very hard to reproduce with inexpensive
loudspeakers.
Quantizing the sampled waveform can either be done using constant steps
between the sample levels or using non-constant steps, such systems are
known as linear and non-linear quantizers respectively. From a 12 bit
linear input signal, an 8 bit companded signal can be produced which has
a similar signal to noise ratio as the original. Non-linear quantization
has the advantage that the quantization performance is independent of
the signal loudness. Its disadvantage is lower accuracy for larger
amplitude signals. Two (similar) examples of non-linear quantizing
encodings are known as the A and $mu$-law companders. There are three
main methods of implementing the $mu$-law algorithm:
-
One is using an amplifier with non-linear gain to achieve companding
entirely in the analogue domain. -
The second is to use an analogue to digital converter with
quantization levels that match the $mu$-law algorithm. - The third is to convert the 12 bit linearly quantized representation
to $mu$-law coding entirely in the digital domain.
In Europe A-law coding is used. The A-law algorithm provides a slightly
larger dynamic range than the $mu$-law version at the cost of worse
proportional distortion for small signals. By convention, A-law is used
on an international connection if at least one country does. The G.711
standard encapsulates the A-law and the $mu$-law formats into a single
standard [@ITU:g711]. G.711’s simplicity (and the low SNR) makes it the
default choice in the non-wireless telecommunications infrastructure.
Adaptive differential pulse-code modulation (ADPCM)
Differential (or delta) pulse-code modulation (DPCM) encodes the PCM
values as differences between the current and a predicted value. An
algorithm predicts the next sample based on previous samples, and the
encoder transmits only the difference between this prediction and the
actual value. If the prediction is reasonable, fewer bits can be used to
represent the same information. For speech, this type of encoding
reduces the number of bits required per sample by about 25% compared to
PCM. Adaptive DPCM (ADPCM) is a variant of DPCM that varies the size of
the quantization step to allow further reduction of the required
bandwidth for a given signal-to-noise ratio. The rate of ADPCM is 32
kb/s.
Low bit rate models
Speech that is sampled and encoded using A or $mu$-law at 8000 samples
per second with 8 bit resolution for each sample produces a data rate of
64 kb/s. Current speech coding techniques can produce encoded voice with
rates as low as 16 kb/s which are indistinguishable in quality from 64
kb/s codec. We will discuss some of these schemes soon, however it is
first necessary to explain how humans produce speech, in order to
understand the technique known source filter modeling.
Human production of sounds
The lungs produce a stream of air that enters the vocal tract. The vocal
tract is the pharynx, mouth, and nasal cavities. There are essentially
two types of sounds: voiced and unvoiced sounds. Voiced sounds such as
/a/ or /e/ are produced by the vocal chords. Unvoiced sounds have two
types, the first type is fricatives such as /s/, /sh/, or /f/ which are
produced when the vocal tract is constricted. The second type of
unvoiced sounds are known as plosives, and include sounds such as /p/,
/k/ or /t/. They are produced when the end of the vocal tract is closed,
pressure is built up, and the pressure is released suddenly. There are
actually additional types of sounds such as the nasal /n/ sound, but we
will omit these from the following discussion.
Voiced and unvoiced segments
In order to encode and transmit speech at low bit rates, it is necessary
to differentiate between the voiced and unvoiced sounds. As we will see,
these sounds constitute different parts of a source filter model, and
are actually transmitted separately. In order to separate them different
techniques are available:
-
Spectral flatness: calculated by the geometric mean of the power
spectrum divided by the arithmetic mean. Unvoiced frames (typically
20 ms long) are flatter than voiced frames. The spectral flatness
can also be measured within a specified sub-band of frequencies as
well as across the whole frequency band. -
Energy: the square of the spectrum values of the sampled frame.
Voiced frames have greater energies than unvoiced frames. - Zero crossing points: counting the sign changes in the signal,
voiced frames exhibit fewer crossing points than unvoiced frames.
Source-filter models
The most popular technique within source filter models is based on
linear predictive coding (LPC). The basic idea is to model the speech
generator as produced by the human vocal system, described in the
previous section. The generator is a simple buzzer at the end of a tube.
The space between the vocal chords (called the glottis) produces the
buzz. It is characterized by its intensity and frequency (pitch). The
vocal tract (the throat and the mouth) forms the tube, which is
characterized by its resonances, these are known as formants.
The parametric coding process
Low bit rate coders estimate the formants, remove their effects from the
speech signal, and then estimate the intensity and frequency of the
remaining buzz. The process of removing the formants is called inverse
filtering, and the remaining signal after the subtraction of the
filtered signal is called the residue. The formants and the residue can
then be transmitted to recreate the voice at the receiver. Another term
for this process is vocoding, a contraction of the words voice and
coding.
Decoding or synthesizing the speech signal is done by reversing the
process. The buzz parameters are used together with the residue to
create a source signal. The formants are used to create a filter (which
is the tube), and the source is run through the filter reproducing the
original speech. The spectral information is well suited for vector
quantization. Compression algorithms often differ in how the residuals
are treated. Typically 30 bits are used to code the 10 coefficients for
basic LPC quality, and up to 18 coefficients can be used for improved
fidelity.
Code excited linear prediction (CELP)
In an attempt to improve on the robotic sound of early LPC schemes, a
number of improvements were made that have led to methods used in modern
codecs (see section 8.10{reference-type="ref"
reference="subsec:modern_codecs"}). Multi-excitation linear predictive
coding (MELPC) is based on LPC but instead of using a periodic pulse
train for the voiced segments and white noise to represent the unvoiced
segments, it uses mixed periodic and aperiodic pulses, a pulse
dispersion filter, and spectral enhancement. The multi-pulse linear
predictive coder (MPLPC) is an analysis by synthesis approach where each
excitation vector consists of a number of pulses where their amplitudes
that have been derived from closed loop optimisation. CELP uses an
codebook (sequence) of excitation pulses as the excitation rather than
the multi-pulses of MPLPC. The optimum sequence is chosen to minimise
the distortion between the derived signal and the original one. At the
decoder the sequence of excitation signals is passed through a long term
filter and a LPC vocal tract filter to produce a block of reconstructed
samples. The bitrate of CELP coders is usually in the range of 5 to 15
kb/s.
Transform coders
Transform coding tries to draw the best from waveform tracking
techniques used in the PCM encoders, but also include models of the
human production of speech as the source filter models do. Knowledge of
the speech signal is used to select which information to discard in
order to lower the bandwidth of the signal. Transform coding derives its
name from frequency based techniques to code the transform coefficients
in a manner suitable for voice. Different transforms have been suggested
for speech compression, we will briefly consider just two: the
Karhunen-Loève transform (KLT) and the Discrete Cosine Transform (DCT).
The Karhunen-Loève transform offers optimal coding performance (in terms
of minimum square error) if the input samples are Gaussian distributed
and the coefficients are scalar quantized. However the Karhunen-Loève
transform is difficult to implement and its performance is signal
dependent. The DCT is signal independent, but is sub-optimal (compared
to the KLT) in that it cannot completely de-correlate the transform
coefficients. The DCT is attractive since there are computationally
efficient algorithms to compute it, and it retains the formant structure
of the speech. The bitrate of transform coders is in the range of 10-20
kb/s, but can produce better fidelity speech.
Modern codecs GSM, G.729 and iLBC {#subsec:modern_codecs}
GSM networks employ a LPC-based speech encoding technique called
Code-Excited Linear Predictive (CELP) coding. The significant difference
between CELP and LPC is that the excitation signals are not simply
generated based upon a voice or unvoiced sound, but taken from stored
codebooks. There are two types of codebooks, fixed and adaptive which
are used in conjunction to code the signal. ETSI’s GSM has defined
different rate voice codecs ranging from 6 kb/s (half-rate) to 13 kb/s
(full-rate). GSM was further enhanced in the mid-1990s by the GSM-EFR
codec (effective full-rate) which is a 12.2 kb/s codec that uses a
full-rate GSM channel. GSM is one of the preferred speech coding schemes
for wide area radio links. EFR is a fixed rate codec, however some GSM
networks now use Adaptive Multi-Rate (AMR) coding [@Bessette:A01]. AMR
uses link adaptation to select from one of eight different bit rates
depending on the instantaneous link conditions.
G.729 is another example of a LPC-based encoder, again a CELP codec. The
coded stream consists of linear predication coefficients, the excitation
codebook indices, and gain parameters. Technically it is known as
variable bit rate conjugate structure algebraic code excited
linear-prediction scheme (CS-ACELP). The standard rate of G.729 is 8
kb/s. It requires 10 ms input frames and produces an 80 bit output
frame. It also includes a 5 ms lookahead, producing a 15 ms algorithmic
delay. Annex B of the recommendation (G.729B [@ITU:g729b]) also
describes a silence compression scheme and a voice activation scheme. It
also has a discontinuous transmission module, which estimates the
background noise at the sender and can use a comfort noise generator at
the receiver. G.729 is popular within VoIP applications, due to its low
data rate and the features just mentioned. A Skype call initiated from
the Internet and terminating at a PSTN connection uses G.729 for the
Internet part of the path. It was developed by the University of
Sherbrooke (Canada), the Nippon Telegraph and Telephone Corporation of
Japan and France Telecom in 1995.
The iLBC encoder from Global IP Solutions is a block-independent
orientated LPC coder [@s.v.02:_ilbc]. Whereas LPC schemes have a memory
that lead to error propagation in the case of lost packets, iLBC encodes
each frame as a separate block. It therefore has a controlled response
to packet loss and exhibits a robustness similar to PCM with respect to
packet loss concealment [@ITU:g711i]. The CPU resources when using iLBC
are comparable to that of G.729A, but it yields higher basic quality.
Although a narrow-band speech coder, iLBC uses the full 4 KHz spectrum
of 300-3400 Hz codecs, thus producing better fidelity. iLBC is popular
in PC to PC communication and is found in tools such as Skype and
GoogleTalk.
A book on speech coding and synthesis was published by Kleijn
[@Kleijn:B95] (one of the creators of the iLBC codec). Also one of the
first papers on IEEE 802.11 and VoIP was published in the same year
[@visser95:_voice_data_trans]. As for speech coding, the first G.729
standard was released in 1996 [@ITU:g729]. As noted earlier, G.729 is an
8 kb/sec LPC-based coder still used in many VoIP applications today.
This includes the Skype application when using IP to telephony services
e.g. in the SkypeIn and SkypeOut services. As the load on the Internet
grew, studies of error recovery were being published
[@Bolot:C97; @IETF:RFC2733]
A good reference for speech coding algorithms in the review article of
1994 [@Spanias9410:Speech]. More comprehensive books include Kleijn’s
speech coding and synthesis from 1995 ([@Kleijn:B95] or Oppenheim’s
older classic [@dspapp] (1978). More particularly for low bitrates
consult [@561156]. Design considerations for speech coders for packet
networks can be found in [@Lefebvre:C04ab].
Future directions for networked voice {#sect:future_qos}
Echo cancellation
TBD.
Improved quality
Moving onto just one topic outside of this dissertation, we believe that
higher fidelity telephony should be available in the near future.
Although the technology for transporting bits has improved, the media
stream itself has not changed since the introduction of 64 kb/s voice
many decades ago. From the user’s perspective the voice quality of a 13
kb/s stream is actually worse than that of traditional telephony.
However, we are prepared to pay this cost in order to have mobile
telephony, and, of course, the operator can squeeze more calls out of
the system without substantial investment.
The drive to reduce bitrates for calls has been to multiplex more calls
onto capacity constrained links. However, as ever more capacity is
becoming available both on the cellular and Internet technologies, the
time is right for a new type of voice experience. Therefore, one example
would be to use higher fidelity than we are currently used to. This may
be stereo voice, and would require headsets, but many mobile users
already use such devices to listen to music.
3D telephony
3D telephony Going one step further is 3D telephony. This will enhance
the experience at the listener through capturing binaural signals at the
speaker, optionally rendering them in 3D space, and replaying the
enhanced signal at the listener. Capturing the signals at the speaker
can be done by placing small microphones on the outside of the headsets,
somewhat similar to what noise canceling headsets do today.
Steps such as these would represent a new domain for telephony that has
been thus far the preserve of specific environments such as audio
conferencing. 3D telephony is very much under investigation, however
significant challenges remain, particularly in the domain of noise
cancellation, either at the sender or receiver, or both.
Rate friendly VoIP
Competing traffic
– P2P traffic – Upgraded links – Our result that quality is improving
Wireless too late?
A survey such as this reveals topics still be addressed in the area.
Ultimately users decide the perceived quality.
[^1]: In the case of wireless links this may mean that the user has to
physically move to a location with better link properties.
[^2]: Most computers today utilize linear ADCs to digitize audio using
16, 24, or 32 bits per sample at typical sampling rates of 8 kHz,
16 kHz, 22 kHz, 44.1 kHz, 48 kHz, 96 kHz, or 192 kHz. Traditional
telephony systems use eight-bit A-law or $mu$-law logarithmic
conversion at an 8 kHz sampling rate to support a large dynamic
range.
[^3]: The estimate of 35% traffic savings is based upon a long term
average of 24 or more simultaneous calls[@Cisco:7934].