Sie sind auf Seite 1von 5

TEE SIMULTANEOUS USE OF THREE MACHINE SPEECH RECOGNITION SYSTEMS TO INCREASE

RECOGNITION ACCURACY
Tim Barry
North Coast Simulation
Dayton, Ohio
Tom Solz
John Reising
Dave Williamson
Pilot-Vehicle Interface Technology Branch
Wright-Patterson AFB, Ohio

of accufacy in its recognition of pilot commands under all


the various noise, vibration, and g forces the fighter pilot has
to endure. However, several flight test activities performed
in the early and mid 80s went a long way to help q u a n m
these effects and has led to sigruficant improvements in
basic speech recognition performance in the military
environment (Williamson, 1987).

ABSTRACT
Two studies were performed to test the ability of
three different automated speech recognition devices
working in parallel, along with an Enhanced Majority Rules
(EMR) software algorithm, to obtain a combined speech
recognition accuracy better than the accuracies produced by
each of the individual systems done. The first experiment,
using a rather simple, robust vocabulary, compared the
recognition accuracy of the three individual systems with the
accuracy obtained by combining the data from all three
systems using the EMR algorithm. The second experiment
made the same comparison, but a more difficult, non-robust
vocabulary was used during testing. Results from both
experiments revealed a signficant increase in speech
recognition accuracy obtained by combining the recognition
data from all three systems in the EMR algorithm when
compared with two of the three speech recognition systems.
The third system, a newer-generation recognition device,
performed as well as all three used in concert. The
implications of using intelligent software and multiple
speech recognition devices to improve speech recognition
accuracy are discussed.

One simple approach to improving recognition


performance was suggested several years ago by researchers
in the Flight Dynamics Directorate in Wright Laboratory.
Given the observed unique strengths of individual vendors'
speech systems, it was hypothesized that combining multiple
systems into a simple voting architecture may result in an
enhancement in overall speech recognition performance.
This hypothesis was tested for a limited vocabulary, isolated
word application which resulted in a significant
improvement in recognition accuracy (Barry, Liggett,
Williamson, Reising, 1992). That experiment resulted in
this follow+n study effort to further investigate the benefits
of using multiple speech recognition systems in a single
architecture.

OaTECTIVES
INTRODUCTION
The objectives of the two experiments were 1) to
gauge the recognition accuracies of the three different
speech recognition systems using both "easy" and "hard"
vocabularies and 2) to determine whether the Enhanced
Majority Rules (EMR) algorithm, combining raw data from
all three systems, could increase recogrution accuracy above
and beyond the accuracy produced by the individual systems
alone.

Speech recognition has long been advocated as a


manual and visual workload alleviator for the pilot. As the
modern aircraft incorporates more sophisticated avionics,
the challenge of managing all the various information
sources becomes even more difficult. From on and off-board
sensor manipulation to weapons, communications and
navigation control, the single seat fighter pilot has limited
ability to effectively manage all of the various information
available using just hands and eyes. For these reasons,
researchers have been exploring the possibilities of using
speech recognition technology to augment the pilot's ability
to control and display information in the cockpit (Lizza and
Goulet, 1986; Williamson and Barry, 1990).

EXPERIMENT 1

Subiects. Six male and six female civilian and Air


Force military personnel volunteered to participate in
Experiment 1. The subjects had little or no &rea
experience with automated speech recognition systems.

One of the hurdles preventing the transition of


speech technology into the cockpit is ensuring a high degree

667

C>H3431-4/94l0000-0667 $1 .OO @ 1994 IEEE

Amaratus. Four computer systems were used to


host the three speech recognition systems and an
experimenter's station, as shown in Figure 1. A Compaq
386/20e served as the experimenter's station and master of
three slave computers, each hosting a different speech
recognition system. The master was in charge of presenting
the vocabulary words to the subjects and receiving and
logging recognition data sent by the slaves. Slave #1
contained a Votan VPC-2100 speech board (Votan)
operating in isolated speech recognition mode. Slave #2
contained a Texas Instruments speech board operating in
isolated speech recognition mode (TI). Slave #3 contained
an ITT-1290 speech board operating in connected speech
recognition mode (I").
The three slaves were connected
with the master via RS-232 serial communications links.
One microphone was connected to an audio amplifier which,
in turn, fed the audio signal to each of the three recognition
systems. Engineering tests venfied that each system
received the Same noise-free audio signal.

Enhanced Majority Rules algorithm lo&. By


processing the raw recognition and other data received from
the three speech recognition devices whenever a word was
spoken tiy the subject, the EMR software program
determined its own best guess as to which word the subject
spoke. The EMR algorithm, in essence, was a fourth speech
recognition system capable of being compared with the other
three hardlware systems.
The logic of the EMR algorithm when processmg
the raw data is conceptually very simple and is depictcd in
Figure 2. Each of the three hardware speech recognition
systems simultaneously received speech input from the
subject an'd attempted recognition. The three speech systems
then sent a packet of information containing the recognized
word (its best guess as to which word was spoken by the
subject), a second choice word, the distance score (a measure
of how much confidence the system had in choosing the first
choice word) for the recognized word, and the distance score
for the second choice word to the Master system.

-1-

A.t times, one or more of the recognition systems


would respond with a packet of data for more than one
recognized word. These exqra words, or insertions, were
handled by the algorithm by looking at all of the returned
words and immediately checking for a clear majority. When
no insertions were encountered, the EMR algorithm first
tested to see if at least two of the three systems returned the
same first choice word. When this was the case, the EMR
algorithm simply identified that word as its response. When
only one system returned a response and the other two
systems returned either an invalid response or no response at
all, the EMR algorithm chose that word as its response.
When there was no agreement at all among the systems, the
second best words were added to the pool, with equal
weight, to see if some majority agreement among the
systems could now be reached. If the systems still could not
agree on tlhe recognized word, the response with the lowest
distance score was chosen by the EMR as its response.

RS-232

Figure 1, Hardware configuration used in the experiment


Vocabularv. The test vocabulary consisted of the
ten digit words (Zero through Niner) and 10 additional
words likely to be used in a simple cockpit digit entry task.
The 20 word vocabulary used in Experiment 1 is shown in
Table 1.

Zero
One
Two
Three
Four

Five
Six
Seven
I Eight
I Niner

Point
Clear
Enter
I Hundred
I Thousand

ExDerimental desipn. The 20 vocabulary words


were randomly presented to both the male and female
subjects in each of five trials, resulting in a total of 100 word
presentations (and therefore 100 words spoken by the
subjects). Because the hardware was configured so that all
recognition systems received the spoken word
simultaneously, 400 data points (3 recognition systems and
the EMR times 100 utterances) were collected from the 100
spoken words. WO tests for a trial effect were planned, so the
data for the five trials were statistically combined. The
resulting experimental design was therefore a 2 gender
(male, female) by 4 recognition systems (Votan, TI, ITT.
EMR algorithm) Within Subjects Factorial design.

Frequency
Channel
Range
I Mirmative
1 Negative

Table 1. Vocabulary words used in Experiment 1

668

speech recognition systems by reading the raw data file,


computing its own recognition result, and adding this fourth
recognition result to the raw data file. At the end of the
experiment, the raw recognition data file was processed to
calculate the recognition accuracy dependent measure.

Collect recognitiondata from each of the three systems


I

Yes

DeDendent measure. The dependent measure was


a mean Adjusted Overall Accuracy (AOA) value (Simpson,
1990) computed for each of the three recognition systems
and the EMR algorithm. AOA was computed as follows:

Two or more

AOA= (NC/NT) * (1 (NI/ NT)) * 100.0


where:
NC = Number of correctly recognized words
N T = Total number of words presented
NI = Number of word insertions
Add second best words to the pool

II

An AOA value of 100 (percent recognition)


represented perfect speech recognition.

RESULTS AND DISCUSSION


Planned orthogonal comparisons were run using
the Statistical Package for the Social Sciences (SPSS Inc,
1993) to contrast the EMR algorithm with each of the three
individual speech recognition systems. The results indicated
that when using the three speech recognition systems in
combination, Adjusted Overall Accuracy (AOA) was
significantly improved over the recognition accuracy
produced from the TI (F (1,ll) = 26.85, p<.OOO) and Votan
(F (1,ll) = 11.59, p<.006) systems (Figure 3). The EMR
algorithm failed to provide better recognition accuracy than
the I l T system. A test for a gender effect revealed no
performance differences between males and females.

Figure 2. Logic flow of EMR algorithm


Procedure. Each subject was seated in a cockpit
mockup, fitted with a headset microphone, and given a
standardized briefing in which the experimenter explained
the purpose of the experiment and what events would take
place. The subject then "trained" the three speech systems
to recognize his or her voice. This training process allowed
each recognition system to collect and store speech
templates for use in the recognition portion of the
experiment. The order in which the three systems were
trained was counter-balanced across the 12 subjects. After
training, the data collection session was begun. The 20
vocabulary words were presented, one at a time, on a
computer screen in front of the subject. The subject's task
was to simply say the presented word. Each of the three
speech recognition systems simultaneously received the
speech signal, attempted recognition, and sent the
recognition results to the Master system. The random
presentation of all 20 vocabulary words represented one
trial. The five data collection trials were run consecutively.

Adjusted Overall Accuracy (AOA) for


"Easy" Vocabulary

Votan

TI

97.0

97.2

rl-r

EMR

Recognition System

Once the raw data from the three recognition


systems were collected and stored in a data file, the EMR
algorithm computer program was run. This program
simulated the real-time processing of input from the three

Figure 3. Results of Experiment 1

669

Subjects Factorial design, was identical to that used in


Experiment 1

Some of the results of this study were expected


while others were surprising. Previous work in the area of
multi-board speech recognition using one Votan and two TI
systems (one in discrete mode and one in connected mode)
had shown a significant increase in recognition accuracy of
the EMR algorithm over all three systems (Barry et. al.,
1992). We were surprised to learn that the ITT system,
which replaced the TI connected speech system, performed
as well as it did. The accuracy of the EMR algorithm may
in fact have been solely due to the strong performance of the
ITT system. Another possible explanation was the relative
simplicity of the vocabulary words used in the test. Would
the three recognition systems, and therefore the EMR
algorithm perform differently when a more difficult, less
robust, vocabulary was used? It was these questions that
necessitated a second experiment.

--

Procedure. The experimental procedure used in


Experiment 1 was i d s used in Experiment 2.

RESULTS
Planned orthogonal comparisons were run using
SPSS to contrast the EMR algorithm with each of the three
individual speech recognition systems. The results indicated
that when using the three speech recognition systems in
combination, AOA was sigruficantly improved over the
recognition accuracy produced from the TI (F (1,ll) =
27.16, p<.OOO) arid Votan (F (1,ll) = 38.61, p<.OOO)
systems (Figure 4). As was the case in Experiment 1, the
EMR algorithm failed to provide better recognition accuracy
than the ITT system. A test for a gender effect again
revealed no performance differences between males and
females.

EXPERIMENT 2
The objectives of Experiment 2 were to validate the
results of Experiment 1 and to test the recognition
performance of the three systems and the EMR algorithm
using a relatively diilicult vocabulary.

Adjusted Overall Accuracy (AOA) for


"Hard" Vocabulary

Method

Subiects. An attempt was made to use the twelve


subjects who participated in Experiment 1 in the new
experiment, but because of scheduling conflicts, only ten of
the twelve subjects were able to participate in Experiment 2.
Two ad&tional subjects were recruited as replacements.

Vocabularv. The new test vocabulary consisted of


the digit words used in the first experiment (Zero through
Eight and Niner changed to Nine), six "teen" words, seven
"tens" words, and two very short words, "on" and "off.
These specific words. taken from a list of words likely to be
used in an aircraft cockpit environment, were chosen
because of their similar sounds and therefore potential
confusability. The 25 word vocabulary used in Experiment
2 is shown in Table 2.

Zero
One
Two
Three
Four

Emriment 2 Vocabularv
I Fourteen Nineteen I Seventy
]Fifteen
IThirty
] Eighty
I Ninetv
I Seven I Sixteen I FOIW
I Eight 1 Seventeen I Fifly 1 On
I Nine I Eighteen I Sixty [ Off

I Five
I Six

Votan

TI

EMR

RecognitionSystem
Figure 4. Results of Experiment 2

DISCUSSION
In both experiments, the collective use of raw data
from the three speech recognition systems in the EMR
algorithm resulted in a better mean AOA than the accuracy
obtained by the: two individual older-generation systems. As
was expected, the use of a difficult vocabulary in Experiment
2 resulted in an overall decrease in accuracy of these systems
compared with the results of Experiment 1. The EMR
algorithm, however, showed a slight increase in mean
accuracy with the hard vocabulary.

Table 2. Vocabulary words used in Esperiment 2


Emerimental desim. The design used in this
experiment, a 2 gender by 4 recognition systems Within

The excellent performance of the newer-generation


ITT-1290 recognition system, compared with the oldergeneration sysiems, proved also to be a pleasant surprise.

670

The accuracy of the I'IT system remained above 96.6% even


when operating with the hard vocabulary containing similarsounding words like "six", "sixteen" and "sixty".
It is becoming increasingly evident that automated
speech recognition technology is now maturing to the point
that serious consideration must be given to the integration of
this technology into the aircraft cockpit. Additional work, of
course, must be done to ensure that the best possible
accuracy is obtained while at the same time the initial
system costs, life cycle costs, etc. are considered. With the
laboratory success of the EMR algorithm using recognition
data from two older-generation systems and one newer
system, it seems plausible that even better recognition
accuracy might be obtained if all three systems were state-ofthe-art recognition devices.

REFERENCES
Bany, T., Liggett, K., Williamson, D. and Reising, J.
(1992). Enhanced Recognition Accuracy with the
Simultaneous Use of Three Automated Speech Recognition
Systems. In Proceedings of the Human Factors Society 36th
Annual Meeting (pp. 288-292). Santa Monica, CA: Human
Factors Society.

Lizza, G. and Goulet, C. (1986). Cockpit Natural Language.


In Proceedings of the 1986 National Aerospace and
Electronics Conference (pp. 8 18-819). Dayton, OH: EEE.
Simpson, C. (1990). Evaluation of Speech Recognizers for
Use in Advanced Combat Helicopter Crew Station Research
and Development. (NASA Technical Memorandum 90-A001). Moffett Field, CA: Ames Research Center.

SPSS Inc (1993). SPSS for Windows Advanced Statistics


Release 6.0. Chicago, IL
Williamson, D. (1987). Flight Test Results of the AFI'I/F-16
Voice Interactive Avionics Program. In Proceedings of the
Voice I/O Systems Applications Conference (pp. 335-345).
Alexandria, VA.
Williamson, D. and Barry,T. (1990). Cockpit Application of
Voice Technology. In Proceedings of the Eleventh Annual
IEEEAESS Dayton Chapter Symposium (pp. 5760).
Dayton, OH: IEEE.

671

Das könnte Ihnen auch gefallen