Sie sind auf Seite 1von 8

Comparative Study of Techniques to minimize packet loss

in VoIP
Shveni P Mehta
ABSTRACT
Voice over IP is an upcoming technology that enables voice
communication through the Internet. Packet-based network links
are shared between different connections, which gives rise to
interaction between various traffic types. Excessive delay, packet
loss, and high delay jitter all impair the communication quality.
Although voice over IP is very economical, people become
hesitant to use it due to the above-mentioned facts which affect
the voice quality greatly. Delay jitter and packet loss are the two
factors that affect the voice quality the most. The delay jitter
manifests itself as packet loss in the dejitter buffer, which drops
the late packets. It is impossible to remove packet loss from the
network but it can definitely be minimized. Several techniques
have been proposed to minimize the effect of packet loss. This
paper describes some of those techniques and then compares the
following two techniques on the basis of certain criteria:
1) A New Technique for improving quality of speech in voice
over IP using time-scale modification [1]
2) Adaptive playout scheduling and loss concealment for voice
communication over IP Network [3]

Table 1 - Available Codecs [2]


Codec Standards
G.711 -Law
G.711 A-Law
G.726-7 ADPCM 5 bit
G.726-7 ADPCM 4 bit
G.726-7 ADPCM 3 bit
G.728
GSM-FR
G.729
G.723.1

MOS

1. INTRODUCTION
For non-real-time applications like Telnet, FTP, and email, TCP
offers reliable delivery of data, but for real-time applications like
voice over IP, TCP is not appropriate because it introduces too
much delay in creating that reliability. The three main problems
occurring in real-time applications like Voice over IP (VoIP) are:
1) End-to-End delay: The total delay experienced by the packet
from the sender till it reaches the receiver.
2) Jitter: The variation in packet interarrival time. The
difference between when the packet is expected and when it
is actually received is jitter.
3) Packet loss: Loss of voice packets from sender to receiver.
The total packet loss is composed of two elements: 1) packets lost
over the network due to congestion, and 2) packets arriving late
after their expected playout time that are discarded by the
receiver. The jitter caused by variable delays in the network is
ultimately translated into the effect of packet loss in the network,
as the packets arriving after the playout time are considered as
lost. Another factor that could affect the quality of VoIP is the
choice of codec used to transform and compress analog signals
into digital signals. Currently available codecs are listed in Table
1.
The Mean Opinion Score (MOS) is a measure to determine or
compare the quality of audio transmissions. It is widely applied

High bit rate


Codecs

Low bit rate


Codecs

as a listening quality scale. To evaluate a given speech sample,


humans listen to a speech sample and evaluate it on a MOS Scale
of 1 to 5 as given in Table 2. When comparing a speech sample
against a reference speech sample, PESQ-MOS (Perceptual
Evaluation of Speech Quality - MOS) or DMOS (Degradation
MOS) are determined on a scale of 1 to 5 as given in Table 2.

Keywords
VoIP, Packet loss, Adaptive playout, Time-Scale Modification.

Bitrate
64 kbps
64 kbps
40 kbps
32 kbps
24 kbps
16 kbps
13 kbps
8 kbps
5.3 kbps

1
2
3
4
5

Table 2 - MOS, DMOS and PESQ-MOS Scale


Opinion
Opinion
DMOS or
PESQMOS
Very annoying
1
Bad
annoying
2
poor
Slightly annoying
3
Fair
Audible but not annoying
4
Good
Inaudible
Excellent 5

From [2] it can be seen that the influence of sampling and


digitization is small but not negligible. Lower bit-rate codecs
show a larger degradation of speech quality. Codecs like G.726-7
and G.728 are more susceptible to degradation from lost packets
than the other codecs like G.711, GSM-FR, G.729 and G.723.2.
For high bit-rate codecs, packet loss values up to 10% are
acceptable for good quality, whereas for the low bit-rate codecs
only 4-5% packet loss rate is acceptable because the number of
samples lost within a packet covers more of the speech signal.
However, packet size itself does not have much influence on
quality for any of the codecs. For a connection without a dejitter
buffer, jitter has a significant influence on the perceived quality of
received voice. For the connection with a dejitter buffer, both lost
and delayed packets have the same effect in the dejitter buffer,
which drops the late packets. Thus, packet loss is the most
important problem that should be studied to improve the quality
of VoIP.

21st Computer Science Seminar


SB3-T2-1

2. BRIEF OVERVIEW OF SOME OF THE


PROPOSED TECHNIQUES
Many techniques have been proposed to minimize the effect of
packet loss in VoIP. This section describes some of the techniques
in brief to get an idea about various approaches taken for
minimizing the effect of packet loss in VoIP.

2.1 Replacing Lost Packets


Mayorga et al [4] first study the impact of packet loss in different
transmissions with respect to different codecs and then propose
reconstruction strategies to recover lost information.

2.1.1 Interleaving
This technique distributes the effect of the lost packets in order to
reduce the impact on quality. The information of a speech part is
distributed in multiple packets. The data units are regrouped in a
crossed form before transmission such that they are distributed,
and at the receiver they are rearranged in their original form.
Thus, instead of losing the whole packet small parts from
distributed packets are lost, Figure 1.
1

Before
Transmission

Interleaving

Lost

Figure 1. Example of Interleaving

2.1.2

Repetition:

Lost packets are replaced by copies of last received packets.

2.1.3

Interleaving with Repetition

The data are interleaved before sending and then any missing part
is substituted using the repetition technique at the receiver.

2.1.5

This approach encompasses both loss correction and loss


concealment algorithms. Loss Correction uses media-dependent
and media-independent Forward Error Correction (FEC)
techniques. FEC constitutes adding redundancy data to the normal
voice stream to protect from packet loss. FEC introduces
overhead in terms of the total amount of traffic on the network,
but if the amount of redundancy is controlled then this approach
can be used. In media-independent FEC, general protection codes
like Reed Solomon or Viterbi are used to produce an extra
protection packet that follows the protected set of voice packets.
These codes do not depend on any particular underlying media
characteristics, but introduce a higher delay which may not be
tolerated by many applications, including VoIP. In mediadependent FEC, the sender uses a high-quality codec to create the
voice samples and a lower quality codec to generate redundant
bits that are added to every packet rather than being sent in their
own separate packet. The receiving codec removes the
redundancy. If the receiver must use that redundant data to
substitute for a lost packet, the result is a lower-quality (but not
missing) segment of voice. It introduces minimum delay but may
introduce more computational processing delay as it is media
dependent.
Concealment techniques can be used to supplement FEC for even
better lost-packet compensation. The most common concealment
approaches include:

Silence substitution is substitution of the lost frame by a


silence frame of the same temporal frequency, but it can
introduce noise if several of them are introduced.

In noise substitution Gaussian noise frames are used to


substitute for the missing frames. This produces better
quality.

In frame replication, missing frames are replaced by already


present redundancy in the voice. This has low computational
complexity and is efficient as more redundancy is expected
to be present in the neighboring voice frames. It does not
need large temporal size.

Waveform substitution uses the frames prior to the lost


frames and tries to use the most recent ones. It examines
buffered frames and searches for the best match.

2.3 Optimized unequal error protection [5]

Simple Interpolation:

Consists of interpolating (averaging) by using the packets after


and before the lost packet.

2.1.4

2.2 Forward Error Correction and


Concealment [7]

Interleaving with Interpolation Calculation

The interleaving technique is used before sending and then the


receiver interpolates to replace any missing parts in the jitter
buffer.

Forward Error Correction schemes allocate equal amounts of


error-control resources to each voice packet, irrespective of the
perceived importance of a packet. This technique proposes signaladaptive, unequal error protection in which certain packets are
allocated more FEC protection than others, depending on their
perceived importance. Here, the basic unequal protection varies
the number of copies of a packet that are piggybacked onto
subsequent packets: an adaptive Reed Solomon (RS) coding
scheme provides only certain packets with RS protection. The
error control resources that should be allocated to the packet are
determined by anticipating what the packet loss concealment will
do if the packet is lost, and calculating the expected distortion for
different protection scenarios. This technique is based on
Lagrangian optimization used in video communications to
balance rate against distortion.

21st Computer Science Seminar


SB3-T2-2

2.4 Compressed Domain Packet Loss


Concealment of Sinusoidally Coded Speech [6]
For this technique Rodbro et al [6] designed a new codec where
the speech signal is compressed at the transmitter using a
sinusoidal coding scheme working at 8 kbps. At the receiver the
concealment is carried out working directly on the sinusoidal
parameters, based on time-scaling of the packets surrounding the
missing ones. This is essentially another type of interpolation
approach.

2.5 Time Scale Modification Approach [1]


In this approach a time-scale modification algorithm has been
used to minimize the effect of packet loss without introducing
additional delays. It provides flexible arrival delay cut-offs to late
arriving packets by means of adaptive playout, which helps in
reducing the packet loss-rate at the receiver.

2.6 Adaptive Playout Scheduling and Loss


Concealment [3]
In this scheme the past statistics on network delay are used to
adaptively adjust the playout time of the voice packets.
Continuous speech samples are created at the receiver using timescale modification. This technique improves the trade off between
buffering delay and late loss significantly.

3. EVALUATION OF TECHNIQUES
The most important characteristics in determining a good
solution for VoIP include the following:
1) Receiver-based technique, as they are faster and independent of
the network delay characteristics, and also have lower
computational overhead at the sender or network [3].
2) Minimizes overall delay and packet loss.
3) Flexible arrival delay cut-offs for late arriving packets, to
reduce the packet loss-rate at the receiver.
4) Uses adaptive playout to minimize overall delay and effects of
lost packets.
5) Low complexity.
6) Maintains pitch frequencies and intelligibility of speech.
7) Technique itself includes the solution for recovering from burst
losses.
Of the techniques described above, the two that best meet these
criteria and are similar enough to be compared effectively are:
1) Time-scale modification Approach [1].
2) Adaptive Playout scheduling and Loss Concealment [3].
The specific characteristics to be considered for comparison are:
1) Receiver-based
2) Minimizes overall delay and packet loss
3) Flexible arrival delay cut-offs to late arriving packets, reducing
the packet loss-rate at the receiver
4) Considers both portions of packet lost, i.e., packets lost in the
network due to congestion at some intermediate node and at the
receiver due to packets arriving later than their scheduled playout
times
5) Generic and relative computational overhead

6) Maintains pitch frequencies and intelligibility of speech


7) Solution for burst losses
The primary reason for choosing the selected techniques for
comparison is that both are packet loss concealment techniques
based on time scaling. Both also maintain an adaptive playout
buffer at the receiver to compensate for the jitter. A detailed
comparison of these two follows.

3.1 Time-scale Modification Approach [1]


3.1.1 GLS-TSM Algorithm
Time-scale modification is the processing performed on speech
signals that changes the perceived rate of speech without affecting
the pitch or intelligibility of the speech. GLS-TSM (Global Local
Search Time Scale Modification) is a time-scale modification
algorithm which is an improved version with low complexity of
the Synchronized Overlap and Add (SOLA) algorithm [1].
Let Ss be the length of the output frame and Sa be the length of the
input frame arriving at the receiver. The relation between Ss and
Sa is given by Ss = * Sa; where is time scale modification factor
( < 1 for compression and > 1 for expansion). GLS-TSM
algorithm was originally designed for larger speech segments (Sa
of length 1600 samples or 200ms) and a priori fixed value of .
VoIP uses smaller packet sizes, hence the values of these
parameters were redefined to match the requirement of shorter
signal segments and per packet variability of [1]. GLS-TSM
takes N samples from those arriving at the receiver and is
calculated from the arrival pattern at the receiver on a packet-bypacket basis. GLS-TSM searches for the point of best alignment
in two steps. First it searches for global similarity or similarity
over a time interval between the analysis and synthesis frames by
comparing the zero crossing rates. Then it searches for local
similarity about a sample point. Once the point of best alignment
is located, the output signal y[n], is formed by fading-in the
analysis frame and fading-out the synthesis frame in the
overlapping interval, and then duplicating the input frame until all
N samples are exhausted [1]. 10-ms voice packets and playout
buffer size of 5 packets are used to create the illustration in Figure
2. Additional delay allowed per packet was set to 4 ms and TimeScale Modification (TSM) frame size was set to 5 packets.
In Figure 2(a), the packet 4 is delayed by 7 ms. This delay is
accepted and the packet is played out by time-expanding packet 1
with = 1.3 and packets 2 and 3 with = 1.2. So packet 4 is
available when playout of packet 3 ends. But as 7 ms playout time
of packet 4 is already taken away by packets 1, 2, and 3, packet 4
is compressed with = 0.5 as in Figure 2(b). In general if the
situation arises where a packet is to be compressed with < 0.5, it
is compressed down to = 0.5 only and the rest of the
compression is shared by subsequent packets. This is to maintain
synchronization without degrading voice quality by excessively
compressing a packet. As a result, packet 5 is also compressed
and played out with = 0.8. If packet 4 does not arrive at its
rescheduled playout time, it is supposed to be lost, Figure 2(c).
Hence, any packet with sequence number > 4 is searched. For
buffer size of five packets, sequence numbers of 5 or 6 are
searched. If packet 5 has arrived then it is expanded to
compensate for residual playout time of packet 4. If expansion is

21st Computer Science Seminar


SB3-T2-3

Seq. #

3.5
3

Seq. #

10

20

30

37 40
5

PESQ Score

(a)
50
6

2.5
2
1.5
1
0.5
0
0

13

25

37

42

10

20

30

40

50

30

40

50

30

40

50

RPL %

50
4

PESQ Score

Seq. #

3.5

(b)
0

13

25

37 40

50

3
2.5
2
1.5
1

Figure 2. Working of the proposed scheme with packet delay


jitter (a) and loss (b)

0.5
0
0

3.2 Adaptive Playout scheduling and Loss


Concealment for voice communication over IP
Networks
3.2.1 Adaptive Playout
When a playout buffer is employed at the receiver to absorb the
delay jitter, there is a tradeoff between average time spent in the
buffer (buffering delay) and the number of packets that have to be
dropped because of late arrival (late loss). Increasing

10

20

RPL %

4
3.5
3

PESQ Score

more than 1.5 times, then subsequent packets share the


compensation to avoid delivering poor quality voice with
excessive expansion. In Figure 2(c), packet 5 is expanded with
= 1.3 to compensate for 3 ms of packet 4. In the worst case, if
both packets 5 and 6 dont arrive, then samples from the
previously successfully received packet are repeated to
compensate for residual playout time with smoothing of the
waveform occurring at packet boundaries. This reduces the
number of fully-repeated packets and eliminates the waveform
mismatch at packet boundaries. Thus, acceptable delay for late
arriving packets and concealment of the lost packets is achieved
effectively by this technique. The redefined parameter ranges of
the GLS-TSM algorithm produced good quality speech for 0.5
2.0. The input speech material was taken from the TIMIT
database and files were downsampled to 8 KHz and two files
were concatenated to create a reference speech file of duration 6-8
seconds [1].
The plots in Figure 3 show that for acceptable packet loss values
between 5% and10% the voice quality has been graded as 3.5
PESQ MOS and above, which is between slightly annoying and
audible but not annoying. Flexible delay cut-offs and improved
loss concealment give better quality speech. The scheme is
generic, computationally efficient, and suitable for any practical
VoIP system.

2.5
2
1.5
1
0.5
0
0

10

20
RPL %

Figure 3. MOS predictions for 3 representative input files


with random packet-loss percentages (no packet arrival jitter)
[1]
buffering delay will minimize packet loss but increase end-to-end
delay, whereas decreasing buffering delay will decrease end-toend delay but increase packet loss. In this technique adaptive
playout is used to accommodate as many as possible late arriving
packets. In Figure 4 the graph shows the delay of voice packets on
the network as dots and the total delay as a solid line. Packets
arriving after the playout deadline are lost and have to be
concealed. As can be seen in the figure, the playout is adjusted
not only in silence periods but also within talk spurts according to
the varying delay statistics, which shows that the technique is
highly dynamic. Voice packets are scaled to maintain continuous
playout, which introduces some playout jitter. However, this
flexibility allows reducing the average buffering delay and late
loss at the same time. Hence, the tradeoff between buffering delay
and late loss is improved.
This technique is mainly based on the average buffering delay and
the late loss rate as the basic performance measures. All measures
are defined in Table 3.
Assume that there are N packets in the packet stream. For voice,
packets are fixed-size blocks and outgoing packets are generated
periodically. As shown in Figure 5, the packets are sent from the

21st Computer Science Seminar


SB3-T2-4

. Network Delay
___ Total End-to-End Delay

Figure 4. Adaptive Playout scheduling scheme [3]


receiver at time tsi at a constant packetization time Lo; where i =
1,2,3.N is the sequence number of the packet. These packets
are received at the receiver at time tri which is given by tsi+1- tsi =
Lo = constant. These packets are first stored at the receiver in a
buffer for time equal to buffering delay and then played out at
time tpi. Thus, dbi = tpi - tri . The network delay of the packets can
be given by dni= tri - tsi . Thus, total delay dti = dni + dbi . If the
packet is lost during transmission then dni = .
Hence, the set of received packets is given by R = {i | tri < }. As
can be seen from the Figure 5, the adaptation is performed on a
packet-by-packet basis and each packet is played out with
different lengths of the played out packets given by Li = tpi+1- tpi;
where Li is the achieved length
(in time) of audio packet i. Average buffering delay is given by db
= 1 (dmaxi dni ); where P = {i | tpi > tri } is the set of
|P|
played packets, and |P| denotes the cardinality of the set. Late
loss rate is given by l = (|R| |P|). Link loss rate n = (|N| |R|).
N
N
The total loss rate is the sum of late loss rate and link loss rate
i.e.; = l + n. Burst loss rate, denoted by b, is given by b =
(|B|)/N; where B is the set of packets with two consecutive losses
and is given by B = {i| tpi < tri, tpi +1< tri+1}.

Notation
t si
tri
tpi
dni
dbi
db
dti
dmax
dmaxi
n
l
b

R
P
B
N
Lo
Li

Table 3 Notation [3]


Description
Time packet i is sent
Time packet i is received
Time packet i is played out
Network delay of packet i
Buffering delay of packet i
Average buffering delay of a stream
Total delay of packet i
Fixed playout deadline
Playout deadline of packet i
Link loss rate
Late loss rate
Burst loss rate
Total loss rate
Set of received packets
Set of played packets
Set of packets lost consequtively
Number of packets in stream
Sender packetization time
Actual length of scaled packet i

Packet Sequence Number


Sender tsi

Lo

i+1

Playout

dmaxi

i+4

tr

tr i

dni

i+3

ts

Receiver

3.2.2 Scaling of voice packets with WSOLA


Waveform Similarity Overlap-Add (WSOLA) decomposes the
input into overlapping segments of equal length, which are then
realigned and superimposed to form the output with equal and
fixed overlap. The realignment leads to a modified output length.
The segments to be added in overlap are searched on the basis of
maximum correlation to ensure that they have the maximum
similarity and the superposition will not cause any discontinuity
before they are superimposed to generate smooth transitions in
the output. Weighting windows are applied to the segments the
reconstructed output. For speech processing, WSOLA has the
advantages of maintaining the pitch period, which results in
improved quality compared to resampling.
Single-Packet WSOLA
The WSOLA approach has been modified to work on only one
packet so that incoming packets can be scaled immediately,
without introducing any additional processing delay. To scale a
voice packet, a template segment of constant length is selected
from the input. Next the segment of maximum similarity to the

i+2

dbi

tp i

tp
Time

Li

Figure 5. Adaptive Playout [3]


segment, called the similar segment, is searched in the input. The
start of the similar segment is searched in a search region, as is
shown in Figure 6. For expanding short packets, the search region
is moved to the prior segment to find the first similar segment.
This widens the range of looking for similar waveforms. The prior
packet could have been played out at the time of scaling. Once the
similar segment is found, it is weighted by a rising window and
the template segment is weighted by a symmetric falling window.
The similar segment followed by the rest of the samples in the
packet is shifted and superimposed with the template segment to
generate the output. This results in a long output as shown in
Figure 6(a). For example in Figure 6(a), the output waveform has
one extra pitch period. The extra pitch period is the interpolation
of several pitch periods based on similarity. If the output speech
has not reached the desired length

21st Computer Science Seminar


SB3-T2-5

Pitch Period
1
2

packet time and results in better voice quality. Waveform


repetition is used to repair burst loss.

Input
Input

Packet

Input

Template
4

Input

Packet

i-2

Weighting Window
3
2

Found
Similar
Segment

Lost
i

i-1

i+1

i+2

Output
i- 1

Extend to 2Lo

Similar Segment

i+2

i+1
Extend to 1.3Lo
(a)

Input

Search Region
2
1

Search Region
4
3

i -1

Lost
i+2

i+1

i+3

i+4
L

Output

Realignment

3 Template
4

Lost
i

and

Overlap Add

i-1

i+1

Extend to 2Lo

i+3

Extend to 1.3Lo

Extend to 2Lo
(b)

Input
i-1

Similar Segment

Lost

Lost

i+1

i+4

L
i+2

i+3

5
Output

1/ 2

2/3

3/4

Output

2/3

3/4

i+1

i-1

Extend to 2Lo

Waveform Repetition
(c)

1
(a)

160

160

(b)
Time (Sample)
Used for Correlation

Figure 6. (a) Extension and (b) compression of single voice


packets using time-scale modification [3]
after such operations, additional iterations are performed. Packet
compression is done similarly, as shown in Figure 6(b), but as we
want the output of shorter length than the input, the search region
for the similar segment should not be defined in the prior packet.
Packet compression requires that a packet contains more than one
pitch, which limits the minimum length of the packet that can be
compressed. However, a common packet length, such as 20 ms,
is usually sufficient because pitch values below 100 Hz are not
very frequent in speech signals. If for some reason packet
compression cannot be performed, then the compression is shared
by later packets. This scheme is entirely voice codec-independent.
The beginning and the end of each packet are not altered, so
concatenation of modified packets needs no overlap to obtain
smooth transitions. Hence, packets can be modified independently
and sent to the output queue back to back. To avoid extreme
scaling, Lmax = 2.3 Lo and Lmin = 0.3 Lo, where Lo is constant
packetization time.

3.3 Packet Loss Concealment


In this technique the packet loss concealment tries to cover late
loss as well as link loss in the network by taking advantage of
redundancy in the audio signal. It is a hybrid of time-scale
modification and waveform substitution. It scales one packet at a
time and uses two-sided information along with adaptive playout
scheduling. Scaling one packet at a time reduces the delay to one

i+2
Extend to 1.3Lo

i+3
Time

Alignment determined by correlation

Figure 7. Loss concealment for (a) single loss (b) interleaved


loss (c) consecutive loss [3]
As shown in Figure 7, the packets are stored in a buffer at the
receiver after arrival. Packet i is assumed lost if it is not received
by the time packet i-1 is to be played out, and the concealment
starts at that moment. When packet i assumed lost, its prior packet
i-1 is extended with a target length of 2Lo and then played out.
Further operation depends on the loss pattern. If packet i is the
only packet lost and the following packets are received by their
deadlines, packet i+1 is extended with a target length of 1.3Lo. As
is shown in Figure 7(a), small segments from packets i-1 and i+1
are searched for similarity to obtain a merging position. The two
segments are then weighted by falling and rising windows before
merging. The total expansion of packets i-1 and i+1 can be bigger
or smaller than the gap created by the lost packet. The resulting
length of modified packets is then Li-1 + Li+1 - Lmerge, and the
playout time of packet i+2 is
tpi +2 = tpi -1 + Li-1 + Li+1 - Lmerge
For successful concealment tpi +2 > tri +2 but in general, tpi +2 will
not match the desired playout time and it is likely to be either
ahead or behind the scheduled playout by a small difference L,
as shown in Fig. 7. This difference is corrected by the adaptive
loss patterns or bursts loss can be covered as shown in Figure 7(b)
and 7(c), respectively. In Figure 7(b) when packet i+2 is
determined to be lost, packet i+1 is scaled with a target length of
2 Lo instead of 1.3 Lo to cover the gap resulting from the second
loss in Fig. 7(c) when packets i and i+1 are lost, the waveform of
the scaled packet i-1 is repeated in order to conceal burst loss. In
both cases, search of similar waveforms is performed for merging,
and adaptive playout time adjustment is used on the

21st Computer Science Seminar


SB3-T2-6

Table 4 - Subjective results of the technique [3]


Trace

STD of
Network
Delay (ms)

Maximum
Jitter
(ms)

1
2
3
4

23.7
15.9
5.9
13.7

130.0
86.0
39.0
47.0

Link
Loss
Rate (%)

STD of
Total
Delay (ms)

Buffering
Delay
(ms)

Total
Loss
Rate (%)

Burst
Loss
Rate (%)

MOS

0
8.3
0
0

7.5
8.5
2.6
7.4

54.6
26.1
23.0
25.7

2.8
8.3
0.3
1.1

0.7
0
0
0

3.7
2.8
4.3
4.1

following packet losses as shown in Figure 7(c), waveform


repetition is used for a maximum of three times before the
mechanism stops to generate output and resets the playout
schedule. Burst loss degrades voice quality most severely, even
after being concealed, because it is a simple repetition of prior
waveforms.

3.4 Performance of the technique


As given in [3], results were collected using packet delay traces
from the Internet by transmitting voice streams between hosts at
four different geographic locations. The data sequences collected
from these links were referred to as Traces 14 respectively.
Subjective testing for evaluation of the quality degradation by
scaling of voice packets on a scale of 5 to 1 as given in section 1
shows that the degradation due to audio scaling is between
inaudible and not annoying, even for extreme cases. In short,
these results say that scaled audio has a good quality. Subjective
testing to determine the overall quality of speech using adaptive
playout in combination with loss concealment was carried out
using four short segments from Traces 14 that last for
aproximately 6s. The corresponding network characteristics for
these segments are given in column 24 of Table 4.
No original sample was provided for direct comparison, and the
listeners were asked to rate the quality of speech using an absolute
scale of 5 to 1 corresponding to 5-excellent, 4-good, 3-fair, 2-

1
2

Receiver-based
Minimizes overall
delay and packet loss

Flexible arrival delay


cut-offs to late
arriving packets,
reducing the packet
loss-rate at the
receiver

poor, and 1-bad quality, respectively. The mean opinion scores


(MOS) obtained by averaging the scores from all listeners and
four different samples are listed in Table 4. In order to obtain
reasonable sound quality for each trace, the buffering delay used
is adjusted to the jitter characteristics. Therefore, the highest
buffering delay is used for Trace 1 while the buffering delay for
Trace 3 is close to the minimum of one packet time (20 ms),
which is required for loss concealment. Table 4 data are quite
detailed, covering not just packet loss, but also network loss,
jitter, buffering delay, and burst loss

4. SUMMARY AND CONCLUSION


The two techniques under comparison may look very similar on
the whole, but when studied closely they have some differences
between them. Table 5 captures the highlights of how TimeScale Modification Approach [1] and Adaptive playout
scheduling and loss concealment [3]compare.
Thus, in this paper two almost similar techniques have been
compared. From Figure 3 and Table 5 it might appear that TimeScale Modification Approach [1] is better from the point of view
of MOS values, but the results presented for Adaptive playout
scheduling and loss Concealment [3] are more detailed. The
latter technique uses traces from the real network as compared to
the files used in the former technique. Although Time-Scale
Modification
Approach
[1]
is
effective,
the

Table 5. Comparison Table


Time-Scale
Modification Adaptive Playout Scheduling and loss concealment [2]
Approach [1]
Yes
Yes
Yes, but it does not have to wait for N samples from the input
Yes, but it is likely to introduce
more delay as compared to the other before processing starts and hence it has comparatively less
buffering delay as demonstrated in 4.
technique because it is based on
collecting N samples from the input
as described in Section 3.1.1
Yes but, this technique is based on WSOLA algorithm as
Yes but, this technique is based on
GLS-TSM algorithm as described in described in section 3.2.2 where expansion and compression are
based on finding similar waveforms rather than scaling the
section 3.1.1 where expansion and
neighboring packets. The similar segment is weighted by a rising
compression of packets for timescaling are based on . The value of window and template segment is weighted by a falling window.
The two segments are then shifted and superimposed with
cannot be more than 1.5 to
template to generate output. This shows that the technique is
maintain the quality of sound. The
working of the algorithm shows that highly dynamic. It detects arrival patterns from the input and
it is comparatively less dynamic and automatically adjusts the amount of delay. The output is not
simple time-scale modification but is based on similarity and is a
is a simple time-scale modification
hybrid of time-scale modification and waveform substitution.
technique.

21st Computer Science Seminar


SB3-T2-7

Considers both
portions of packet
loss, i.e., packets lost
in the network due to
congestion at some
intermediate node and
at the receiver due to
packets arriving later
than their scheduled
playout times

Yes

Yes

Generic and less


computation
overhead
Maintains pitch
frequencies and
intelligibility
Solution for burst
losses

Yes

Yes. It integrates well with the system, is independent of codec


and the results demonstrate less computational overhead.

Yes

Yes

Yes Waveform Repetition

Yes Waveform Repetition

Adaptive playout scheduling and loss Concealment [3] gives a


more detailed analysis based on total delay, buffering delay and
loss, burst losses, network delay and loss, hence it is more reliable
in terms of minimizing packet loss along with delay. The results
presented by both techniques are insufficient to compare them in
terms of MOS and PESQ-MOS values. Adaptive playout
scheduling and loss Concealment [3] gives a detailed analysis for
each trace but it is based only on a single loss rate value. If it had
produced results for more MOS values using varied packet losses,
then the two techniques could have been compared more
effectively. This leads to the need for further work in this area.
The other techniques proposed so far for minimizing packet loss
can also be studied to determine which technique best serves the
purpose of minimizing the effects in particular network
conditions.

5.0 REFERENCES
[1] Agnihotri, S., Aravindhan, K., Jamadagni, H.S., Pawate,
B.I. A new technique for improving quality of speech in
Voice Over IP using time-scale modification, Acoustics,
Speech, and Signal Processing, 2002 Proceedings. (ICASSP
'02). IEEE International Conference on, Volume: 2, 2002
Pages:2085 2088.
[2] Duysburgh, B., Vanhastel, S., De Vreese, B., Petrisor, C.,
Demeester, P. On the influence of best-effort network
conditions on the perceived speech quality of VoIP
connections, Computer Communications and Networks,
2001 Proceedings. Tenth International Conference on 15-17
Oct 2001, Pages 334-339.

[3] Liang, Y.J., Farber, N., Girod, B. Adaptive Playout


scheduling and Loss Concealment for voice communication
over IP Networks, Multimedia, IEEE on Volume: 5, Issue:
4, Dec. 2003 Pages: 532 543
[4] Mayorga, P., Besacier, L., Lamy, R., Serignat, J.-F. Audio
packet loss over IP and speech recognition, Automatic
Speech Recognition and Understanding, 2003. ASRU '03.
2003 IEEE Workshop on , 30 Nov.-3 Dec. 2003, Pages:607
612.
[5] Mingyu Chen, Murthi, M.N. Optimized unequal error
protection for voice over IP, Acoustics, Speech, and Signal
Processing, 2004 Proceedings. (ICASSP '04). IEEE
International Conference on, Volume: 5, 17-21 May 2004,
Pages - 865-8 vol.5.
[6] Rodbro, C.A., Christensen, M.G., Andersen, S.V., Jensen,
S.H. Compressed domain packet loss concealment of
sinusoidally coded speech; Acoustics, Speech, and Signal
Processing, 2003 Proceedings. (ICASSP '03). 2003 IEEE
International Conference on, Volume: 1, 6-10 April 2003,
Pages: I - 104-7 vol.1.
[7] Santos, P.M., Balbinot, R., Silveira, J.G., Castello, F.C.
Analysis of packet loss correction and concealment
algorithms in robust voice over IP environments,
Communications, Computers and signal Processing, 2003.
ACRIM. 2003 IEEE Pacific Rim Conference on, Volume:
2, 28-30 Aug. 2003, Pages:824 - 827 vol.2 4-339

21st Computer Science Seminar


SB3-T2-8

Das könnte Ihnen auch gefallen