PERFORMANCE ASSESSMENT OF 4.8 KBIT/S AMBE CODING
UNDER AERONAUTICAL ENVIRONMENTAL CONDITIONS*
‘Simao F.Campos Neto, Franklin L. Corcoran
COMSAT Laboratories, 22300 Comsat Drive, Clarksburg, MD 20871, USA
é&
John Phipps
Compression Telecommunications, 19540 Amaranth Drive, Germantown, MD 20874, USA
&
Spiros Dimolitsas
Lawrence Livermore National Laboratory, University of California, Livermore, CA 94551, USA
ABSTRACT
‘The quality delivered by low-rate parametric speech coding
deployed in commercial mobile satellite systems has to
be assessed by formal subjective assessment methods due
to their nonlinear nature, Aeronautical systems,
especially those for deployment in small aircraft, pose yet
‘more stringent conditions due to high environmental
noise levels. In this paper the results of an evaluation of
the voice quality of DVSI's 4.8 kbivs Advanced Multi-
Band Excited (AMBE) coding under aeronautical channel
conditions are presented for listening and conversational
experiments, From these assessments it was concluded
that that the voice quality of the Inmarsat Aeronautical
system will be perceptibly improved when the current
full-rate codec is replaced by the AMBE codec.
1. INTRODUCTION
Digital air-to-ground and ground-to-air communications
systems employing satellite networks have been
commercially employed for more than six years now,
‘mostly on large aircraft. In the meantime, the extension
of the use of aeronautical systems to aircraft smaller than
the wide-body jets where these systems are presently
installed has become a commercial imperative, including
providing for the use of the aeronautical system for ATR-
style commuter jets. This extension requires for the voice
system to operate under much worse ambient noise
conditions than the currently available aeronautical
communication systems. Speech coding technology
developed recently and codecs delivering higher quality at
lower rates, offer the potential of improved efficiency and
performance for systems currently in use. To realize this
potential, however, improved voice performance is
equired under these high ambient-noise conditions.
Adequately characterizing this performance, given the
easing non-linear nature of low-rate speech codecs,
demands the use of direct subjective evaluation methods.
¥ This work was sponsored by Inmarsat.
0-7803-3192-3/96 $5.0001996 IEEE
499
This paper presents the results of an evaluation of the
voice quality of the 4.8 kbit/s Advanced Multi-Band
Excited (AMBE) codec (developed by DVSI), under
aeronautical channel conditions [1]. The performance of
this system was compared against that of 9.6 kbit’s
CELP, which is presently used in the Inmarsat
‘Aeronautical system, and a newer generation speech codec
consisting of a modified version of the full-rate version
but operating at 4.8 kbit/s. The 4.8 kbiv’s AMBE system
was recently selected for use in the Inmarsat mini-M
(land-mobile) notebook-sized satellite terminal [2)
‘These characterizations, which were functionally divided
in four subjective experiments (three listening experi-
‘ments in Phase I, and one conversational experiment in
Phase I), included the performance assessment under
various input speech-levels, transmission errors,
interconnectivity with other signal processing devices,
and background noise. In the second phase, only the two
best codecs from Phase I were considered.
2, EXPERIMENTAL DESIGN
In defining the experiment design for measuring codec
voice performance, several factors were taken into
‘consideration, including knowledge of human psychol-
ogy, statistics, experiment size, and the objective of the
evaluation in terms of the system performance parameters
sought.
2.1 Phase I Listening Tests. The first phase of
tests was implemented to parametrically determine which
of two low-bit rate speech coding schemes would better,
suit the evolution of the Inmarsat Aeronautical system.
‘Three listening-opinion tests were designed. Two of the
listener-opinion tests were conducted using an Absolute
Category Rating (ACR) (single-stimulus) 5-point Mean
Opinion Score (MOS) transmission quality scale [3] to
quantify the performance under different input-levels, bit
errors (random and burst), and tandem combinations, for
unweighted (flat) and IRS [3] weighted speech. ACRassessments are usually conducted by arranging for a
listener to hear a succession of groups of typically two to
three sentences (stimulus), with each stimulus or group
of stimuli. being reproduced over a different circuit
condition. After each sample is heard, listeners express an
opinion with regard to their perception of quality of the
processed speech, expressed as an excellent, good, fair,
poor, or bad (5, 4, 3, 2, 1) rating. Each opinion is based
‘on exposure to the most recently heard sample only, and
listeners are typically given 5 to 10 seconds in which to
cast a vote before the next sample is heard.
The other listener-opinion test used a Degradation
Category Rating (DCR) (dual-stimulus) 5-point
Degradation MOS (DMOS) scale to assess the quality
degradation for IRS-weighted speech in the presence of
vehicle and aeronautical environment background noise,
The DCR procedure is similar to the ACR procedure,
with the exception that votes are cast for a pait of
stimuli, of which the first stimulus is the unprocessed
speech, and the second is the speech processed under a
given circuit condition. The listener is asked how the
second stimulus is degraded in relation the unprocessed
speech: no degradation, audible but not annoying, slightly
annoying, annoying, and very annoying degradation (5, 4,
3, 2, 1). Each opinion is based on exposure to the most
recently heard pair of samples (stimulus) only, and the
listener is again given 5 to 10 seconds to cast a vote
before the next pair of samples is heard,
‘The experimental designs of the three listening tests were
based on a balanced block structure, and provided for
arranging the conditions in presentation blocks, where
each block contained a complete set of randomized codec-
condition combinations. In these tests, eight talkers were
used for the ACR experiments and four talkers were used.
for the DCR experiment (equal number of male and
female talkers).
‘The conditions, which were evaluated by 40 non-expert
listeners in each experiment (120 in total), included the
network configurations whose assessment was sought, as
well as a number of reference systems, including a
number of Modulated Noise Reference Units (MNRU,
ITU-T Rec. P.80). The two ACR tests also used as
reference conditions: IS-54 8 kbit/s Vector-Sum Excited
Linear Prediction (VSELP) codec; G.728 16 kbit/s Low-
Delay Code-Excited Linear Predition (LD-CELP) codec;
the InmarsatM 6.4 kbit/s Improved Multi-Band
Excitation (IMBE) speech codec; and four interconnected
32 kbit/s Adaptive Differential Code Pulse Modulation
(ADPCM) devices (whose cumulative distortion is
accepted as perceptually equivalent to the maximum end-
500
to-end quantization distortion permitted in wireline
connections).
2.2 Phase I Conversational Test. The second
phase of tests consisted of one conversational experiment
to characterize the best lower bit rate codec in Phase I in
relation to the full-rate aeronautical codec and to two other
reference codecs in a dynamic simulation of actual
network conditions, involving air-to-ground and ground-
to-sir connections. Environmental noise conditions
consisted of one of the booths having aeronautical noise,
and the other booth having either office babble noise or
simulating a quiet room (40 dBA). Four scales were used,
two quintal (quality and ease-of-interruption) and two
binary (difficulty-in-conversing, acceptability). The
quality scale was identical to the ACR scale, while the
cease-of-interruption scale ranged uniformly from “no
effort” to interrupt (5) to “extreme effort” to interrupt (1),
The difficulty-in-conversing scale consisted simply of
asking the subjects whether any difficulty was felt during
the conversations (yes or no). The acceptability scale,
similarly, consisted of asking whether the connection was
‘considered acceptable. Binary scale scores are computed
based on the number of answers “yes”.
The design followed a 16x16 Latin-Square, where test
conditions (quiet, babble, and aeronautical _noise
environments) altemated among the 64 pairs of non-
expert conversation participants. Channel conditions
included both error (0.1% random bit errors) and error-free
situations for the full-rate codec and the best of the two
lower rate codecs, plus two reference codecs (8 kbit/s
VSELP and Inmarsat M system 6.4 kbit’s IMBE) in an
error-free situation. The conversational test also included a
270 ms delay simulating a one-hop satellite configura-
tion.
3. RESULTS & ANALYSIS
3.1 Phase I. The results of Phase I, summarized in
Table 1, indicated a substantial overall advantage of the
AMBE codec over the half-rate version of the aeronautical
codec. All statistical analises were conducted at a 95%
confidence level. In Experiment 1, where unweighted
speech was assessed in the presence of codec input level
variation, transmission errors, and double-transcodings
(“tandem”), the AMBE codec and the half-rate codec were
statistically equivalent, and better in performance than the
full-rate codec. For IRS-weighted input speech in the
presence of the same circuit conditions (Experiment 2),
however, the full- and half-rate aeronautical codecs.
delivered an equivalent overall performance, while the
AMBE codec performed substantially better than both ofthem. In Experiment 3, IRS-weighted speech contami-
nated either by interfering talker or by gaussian,
vehicular, aeronautical, lorry or babble background noise
was assessed over error-free channel conditions. In this,
experiment, the full-rate (9.6 kbit/s) aeronautical codec
had the best performance, which was statistically better
than the half-rate codec, while the AMBE performed
statistically better than the half-rate codec for all the tested
conditions. The test variance was approximately 0.9 for
the two ACR experiments and 1.7 for the DCR
‘experiment. The average standard error was in 0.06 for the
ACR tests and 0.08 for the DCR test, which were within
the target experiment accuracies.
‘The AMBE was hence chosen for further testing, since
the AMBE codec performed equivalently to the full-rate
and the half-rate aeronautical codec for unweighted speech,
and better than the latter in the presence of IRS-weighted
speech, as well as in the presence of background noise.
3.2 Phase II. In Phase Il, only the full-rate (9.6 kbit/s)
‘aeronautical codec and the 4.8 kbit/s AMBE codec were
tested using two quintal and two binary scales. Table 2
reports the results only for the Quality and the Accept-
ability scales, which allowed for a more insightful
understanding of the dynamic performance of both codecs.
‘The test variance was 0.9 and 0.2 respectively for the
Quality and the Acceptability scales. Standard errors were.
0.9 and 4% respectively for those two scales. Examining
Table 2, it can be seen that the overall performance of the
AMBE codec was equivalent to that of the VSELP codec,
while the full-rate aeronautical codec was either worse or
in the same range of quality as that of the Inmarsat-M
system IMBE codec. Another observation derived by
means of a proper analysis of variance is that the codec
performance was affected more by the room noise than by
circuit condition impairments used.
4. CONCLUSIONS
From the results obtained it was determined that the
overall performance of the 4.8 kbils AMBE is, in
general, better than that of the half-rate version of the
Inmarsat Aeronautical codec (both having outperformed
the 9.6 kbit/s full-rate aeronautical codec), and was thus
selected for further evaluation using conversational
methods. From the conversational test, it was subse-
quently confirmed that 4.8 kbit/s AMBE coding performed
equivalently to or better than the other codecs under
‘Acronautical-noise conditions. Thus, it was concluded that
the voice quality of the Inmarsat Aeronautical system will
be perceptibly improved when the current full-rate codec is
replaced by the AMBE codec, as well as will allow for
501
expansion of the Inmarsat Aeronautical service to smaller
aircraft while maintaining good cellular speech quality.
Table 1(a): Results for the DCR Listening Ex-
periment (Performance with Background Noise)
FACTOR,
‘ATR, Takeoff,SIN=6dB
ATR, Takeoff S/N=60B
WBA, Cruise, SIN=20 dB
WBA, Cruise, S/N=20 dB
GI Noise 1, SINGD B
CI Noise 1, SIN=9 dB
‘CI Noise 1, SIN=9 dB
Gi Noise 2, SINEIS 4B
Ci Noise 2, SIN=15 4B
Babble Noise, S/N=20 dB
Babble Noise, S/N=20 dB
Babble Noise, S/N=20 dB
Babble Noise, S/IN=20 4B
Talker Noise, SIN=24 dB
“Talker Noise, S/N=24 dB
‘Talker Noise, S/N=24 dB
‘Talker Noise, S/N=24 dB
sian Noise, S/N=20 4B
[Gaussian Noise, S/N=20 dB
Gaussian Noise, S/N=20 4B
Gaussian Noise, S/N=20 4B
Lorry Noise, SIN=15 4B
Lorry Noise, S/N=15 4B
IMBE | Lorry Noise, SIN=15 dB
“48k Aero | Lorry Noise, SIN=I5 dB
WEBI: Wide-body jer CH: Corporate Jet
ATR: Turbo-propeller
"AMBE
9.6k Aero
4.8K Aero
IMBE
"AMBE
916k AeroTable 1(b): Results for the ACR Listening Ex- Table 1(¢): Results for the ACR Listening Ex-
periment (Effect of Tandem, Input Level Variation periment (Effect of Tandem, Input Level Variation
)-20 dBm0
20 dBm0
=10 aBmO
=10 dBmo_
“30 dBmO
BER=0, -30 dBm0
Coder/Coder, -20 dBm0
‘Coder/Coder, -20 dBm0
‘Coder/Coder, -20 dBm0
G.726/Coder, -20 dBmO
G.726/Coder, -20 dBmO
G.726/Coder, -20 dBm0
BER = 1%, -200Bm0
BER = 1%, -20Bm0
‘Coder/Coder, -20 dBm0 |
(G.726/Coder, -20 4Bm0
6.726/Coder, -20 4Bm0
CiMi=12 dB, -20 dBm 48k Acro
‘CiM=12 4B, -20 dBmO [9.6K Aero
BER=0, -20 dBm0 VSELP
BER=0, -20 dBm0 436.726
BER=~0, -20 dBm0 IMBE_| _ BER=0, -20 dBm0
‘Table 2: Results for the Conversational Experi- 5, REFERENCES
‘ment (Environmental noise and channel errors
[1] S.F.Campos Neto er al., “Performance Assessment
of the Inmarsat Aeronautical Codec”, Final Report,
Inmarsat Contract INM/94/1367/ES.
[2] S. Dimolitsas et al., “Evaluation of Voice Codec
Performance for the Inmarsat Mini-M_ System,”
Proceedings, 10th Int. Digital Satellite
(ICDSC’10), Brighton, England, May, 95.
[3] ITU-T,"Methods for Subjective Determination of
Transmission Quality,” Rec.P.80, March, 1993.
(4) CCITT, “Specification of an _Intermediate
Reference System,” Rec. P.48, Blue Book, Vol.
V, pp. 81-86, Melbourne, 1988
Cireuit
Noise
YQ): scores for the Quality scale
Y(A): percentage of “Yes” for Acceptability
502