Sie sind auf Seite 1von 80

Investigation of Voice Stage Support: Subjective Preference Test Using an Auralization System for Self-Voice by Cheuk Wa Yuen A Thesis

Submitted to the Graduate Faculty of Rensselaer Polytechnic Institute in Partial Fulfillment of the Requirements for the degree of MASTER OF SCIENCE IN BUILDING SCIENCES, CONCENTRATION IN ARCHITECTURAL ACOUSTICS

Approved: _________________________________________ Professor Paul T. Calamia, Thesis Adviser _________________________________________ Professor Ning Xiang, Ph.D. Rensselaer Polytechnic Institute Troy, New York June, 2007 (For Graduation August 2007)

Copyright 2007 by Cheuk Wa Yuen All Rights Reserved ii

CONTENTS
LIST OF TABLES .........................................................................................................v LIST OF FIGURES...................................................................................................... vi ACKNOWLEDGMENT .............................................................................................. ix ABSTRACT................................................................................................................. xi 1. Introduction ..............................................................................................................1 1.1 1.2 Aim of the Thesis.............................................................................................3 Historical Review ............................................................................................5 1.2.1 Stage Acoustics and Support................................................................5 1.2.2 Previous Researches on Subjective Preferences on Stage Acoustics .....7 1.2.3 Self-Voice Perception ..........................................................................7 1.2.4 Experimental Setup in Previous Self-voice Auralization and Related Sound Field Simulation......................................................................10 1.3 Thesis Outline................................................................................................12

2. Self-voice Auralization System: Design and Implementation ..................................13 2.1 2.2 2.3 Experimental Design Concept ........................................................................13 Measurement Setup........................................................................................13 Binaural Real-time Auralization System ........................................................15 2.3.1 System Overview...............................................................................15 2.3.2 Implementation of Direct Air Conduction Modeling ..........................19 2.3.3 Implementation of Indirect Air Conduction Modeling ........................23 2.3.4 Implementation of Headphone Equalization.......................................25 2.4 2.5 BRIR Acquisition System ..............................................................................26 Experimental Procedures in Subjective Test...................................................30 2.5.1 Test Subject Conditioning..................................................................30 2.5.2 Use of Dramatic Text in Study of Actors............................................30 2.5.3 Verifying the Consistency of Self-voice Stimuli by Monitoring the Pace of Speech...................................................................................31 iii

2.6

HATS Verification Tests................................................................................34 2.6.1 Binaural Microphones........................................................................35 2.6.2 Artificial Mouth.................................................................................36

2.7

Subjective Test on Naturalness of the Auralization System ............................41 2.7.1 Evaluation on Naturalness of CIL Filter Delay Time..........................42 2.7.2 Evaluation on Naturalness of CIL Filter Level ...................................45

2.8

Discussion .....................................................................................................46

3. Subjective Preference Tests on Stage Acoustic Conditions for Actors.....................47 3.1 3.2 3.3 Introduction ...................................................................................................47 Impulse Response Acquisition .......................................................................48 Subjective Test Design...................................................................................50 3.3.1 Preference ratings from paired comparison ........................................50 3.3.2 Test procedures..................................................................................51 3.4 Paired Comparison Test on Stage Locations...................................................51 3.4.1 Preference study of stage locations when head orientation is look center ...............................................................................................52 Preference study of stage locations when head orientation is look left ........59 3.4.2 Preference study of stage locations when head orientation is look right .................................................................................................65 3.4.3 Discussion on Stage Location Preference ...........................................70 4. DISCUSSIONS ......................................................................................................71 4.1 4.2 4.3 Reflections on subjective preferences on stage acoustic conditions.................71 Accuracy in subjective testing........................................................................71 Potential ways of improving voice stage support in proscenium theaters ........72

iv

LIST OF TABLES
Table 1. Direct AC Filter Settings (Rane PE-15)...........................................................20 Table 2. CIL filter settings (digital parametric equalizer on 02R)..................................23 Table 3. Headphone compensation filter settings (02R master output) ..........................26 Table 4. Delay time in evaluation test of naturalness of Direct AC insertion loss compensation filter (CIL filter). *The current system has a processing delay of 0.14ms with a setting of 0.01ms on DN716. ** Prschmann's tested delay times were based on taps of a 48kHz Tucker Davis DSP system, and they are represented here in millisecond for convenience of comparison. ..............................................43

LIST OF FIGURES
Figure 1. Components of perception of self-voice...........................................................9 Figure 2 Earthwork M30 omni-directional microphone.................................................14 Figure 3 Countryman B3 miniature omni-directional microphone.................................14 Figure 4 Transfer function from Earthwork M30 to Countryman B3.............................15 Figure 5 Binaural self-voice auralization system block diagram....................................16 Figure 6 Test subject with microphone and pop filter....................................................17 Figure 7 MRP-to-ERP Transfer Function & Direct AC Filter using PE-15 parametric equalizer (1/3 octave smoothing) ..........................................................................20 Figure 8 Setup for measuring the headphone's insertion loss using an isolation tube. ....21 Figure 9 Isolation tube used in insertion loss measurement ...........................................22 Figure 10 Waves IR1 Convolution Reverb, loaded with a 3-second unit sample sequence .............................................................................................................................24 Figure 11 Impulse response trimming before importing to IR-1 ....................................24 Figure 12 Headphone response and compensation filter (02R master output)................25 Figure 13 Frontal plane section of HATS, showing binaural microphones and the related fittings. (Adapted from manufacturers manual)....................................................27 Figure 14 Median plane section of HATS, showing the artificial mouth .......................28 Figure 15 Binaural room impulse response acquisition system block diagram, showing how the binaural ears and artificial mouth of the HATS are connected..................29 Figure 16 Example plot of effective duration of running autocorrelation function of two recordings of the same text at different pace, showing the use of e min as a temporal reference for monitoring subjective testing............................................................33 Figure 17 Verification Test of HATS in anechoic chamber at General Electric Laboratory (NY)...................................................................................................34 Figure 18 Frequency response comparison between HATS binaural microphones ........35 Figure 19 Voice directivity of HATS (a) horizontal plane (b) vertical plane, in 4 octave bands (250Hz, 500Hz, 1khz & 2kHz)....................................................................36 Figure 20 Comparison of voice directivity, in 3 ocatve bands (500Hz, 1kHz & 2kHz)..37 Figure 21 On-axis frequency response of HATS artificial mouth. Overlays of nosmoothing and 1/6-octave smoothing...............................................................38 vi

Figure 22 Frequency Response of B&K Artificial Mouth Type 4128C [adapted from manufacturers datasheet] .....................................................................................39 Figure 23 MRP-to-ERP transfer function of HATS (1/3-octave smoothing)..................40 Figure 24 Averaged frequency response of MRP-to-ERP (Direct AC) of 18 human subjects. The grey area marks the satndard deviation. (Adapted from Porschmann [2000])..................................................................................................................41 Figure 25 Subjective evaluation of naturalness of delay time in Direct AC auralization. Mean score and error rate of all subjects (95% confidence)...................................44 Figure 26 Architectural plan of the main space at the RPI Playhouse. Dimension unit in inches. Blue lines are dimensional guides. (CAD drawing courtesy of RPI Building Management)........................................................................................................48 Figure 27 Stage locations where BRIR was measured. Dashed line labeled "CL" is the center line of the stage across the proscenium. ......................................................49 Figure 28 Top view of the HATS showing 3 different head orientations in binaural room impulse response acquisition.................................................................................50 Figure 29 Preference score of different stage locations when head orientation is "look center" (Conditions - A: DSC, B: DSR, C:CSC, D: CSR). Normalized scores of 13 individual subjects A-M (blue bar graph) and overall average score of all subjects (red bar graph) ......................................................................................................55 Figure 30 Interaural cross-correlation functions in 100ms-intervals for conditions A-D when head orientation is "look center". .................................................................56 Figure 31 VSS plot of binaural ears in four conditions (A: DSC, B: DSR, C: CSC, D: CSR) when head orientation is "look center".........................................................58 Figure 32 Preference score of different stage locations when head orientation is "look left" (Conditions - A: DSC, B: DSR, C: CSC, D: CSR). Normalized scores of 13 individual subjects A-M (blue bar graph) and overall average score of all subjects (red bar graph) ......................................................................................................62 Figure 33 Interaural cross-correlation functions in 100ms-intervals for conditions A-D when head orientation is "look left". .....................................................................63 Figure 34 VSS plot of binaural ears in four conditions (A: DSC, B: DSR, C: CSC, D: CSR) when head orientation is "look left.............................................................64 vii

viii

ACKNOWLEDGMENT
I am grateful for professor Paul Calamia and his willingness to share his knowledge and wisdom. I would also thank Dr Ning Xiang for his insights and meticulous training in laboratory work and research; and Dr Jonas Braasch for an enjoyable class in psychoacoustics. This research would not be possible without the help from Mr. David Larson for his generosity in lending me the Brel & Kjr head and torso simulator. My gratitude also goes to Mr Bob Hedeen at General Electric Laboratory (NY) for letting me use the anechoic chamber for numerous acoustical measurements. A thunderous applause goes to all participating actors in this research. Speaking in an isolated environment without the interaction with other actors or audience had been the most difficult experience for the artists. Your patience and concentration were professional. Without your support, there will be no study of voice stage support. My study in the United States was made possible by the support of the prestigious Sir Edward Youde Memorial Fund Fellowship in Hong Kong. I hereby send my dedication to late Sir Edward Youde. His wife, Pamela Youdes continuous encouragement means a lot to me. Respect also goes to all officers of the fellowship council, especially Ms Carnelia Fung. I sincerely thank for all my mentors and incredible educators from whom I learned a lot throughout the years at Rensselaer Polytechnic Institute, California Institutes of the Arts and the Hong Kong Academy for Performing Arts. Last but not least, I thank my family for their continuous support. This thesis is dedicated to my parents, late grandmother, Fung Yau Hau, and my brother Cheuk Chi Yuen who is recovering from speech disorder after a stroke in summer 2006. His speech therapy session of repetitive reading is an irony to my research.

ix

Not everything that can be counted counts, and not everything that counts can be counted. [4] Albert Einstein

ABSTRACT
The human voice plays an integral role in dramatic art. The performance of singers and actors, who perceive their voice through their ears as well as bone conduction, is highly related to the acoustic condition they are in. Due to the proximity of the sound source and the spectral difference in the transmission through the skull as compared to air, a support condition different from that for musical instrumentalists is needed. This paper aims at initiating a standardization of methodology in subjective preference testing for voice stage support in order to collect more data for statistical analysis. A proposal of an acquisition/auralization system for self-voice and a set of subjective test procedures are presented. The subjective evaluation of the system is compared to previous designs reported in the literature, and the implementation is validated. A small playhouse has been measured and auralized using the system described, and subjective-preference tests have been conducted with 13 professionally trained actors. Their preferred stage-acoustic conditions (in relation to locations on stage and head orientations) are reported. The results show potential directions for further investigations and identify the necessary concerns in developing an objective parameter for voice stage support.

xi

1. Introduction
In the course of theater history, from classical Greek drama to Shakespearean plays to Ibsens naturalistic plays to 20th century Broadway rock musicals, human voice has always been an integral part of the dramatic art. The success of this art largely relies on how well the audience understands the words voiced by the actors. This rule has not changed for more than 2,300 years since the day of Lycurgian Theater of Dionysus in Greece (the first great permanent theater in recorded history) [1]. While most contemporary architects and acousticians are focusing on auditorium acoustics in the design of performance spaces, the special acoustical needs required by musicians, singers and actors are often less emphasized. Although acoustic shells have been developed in concert halls and achieved a certain degree of success, opera houses and theaters do not receive the same attention. Stage performers are left to adapt to the acoustics of the space as best they can. [2] In many cases, performers would find it difficult to hear themselves or each other intelligibly, and thus not manage to achieve their best tonality; and in extreme cases, they would fail to attain pitch accuracy and coherency individually or in the ensemble. This may result in a less-than-satisfactory performance. That means the performer-audience communication is not achieved and ultimately affects how the audience would rate the performance and possibly the acoustics of the venue. It is strongly suggested that stage acoustics demands as much attention as auditorium acoustics deserves. It is logical that an optimal stage acoustics is fundamental to a good overall rating of the acoustics in a performance space. (Visual appeal also plays a role in audiences rating on the acoustics, but it is out of the scope of discussion in this thesis.) There is currently no parameter in international standards quantifying stage acoustics. Among all acoustical parameters widely used in the industry, only one is generally accepted as a means of quantifying the ease of listening and performing on stage Support (ST1, ST2), first proposed by A.C. Gade in 1989, which is intended to measure the contribution of early reflections to the sound from the musicians own instrument. [3]

Gades proposal is however limited to instrumentalists only. For singers and actors (or voice performers as a consistent terminology for the rest of the thesis) whose instrument is the human voice, Support cannot be applied directly because of the influence of bone conduction in the perception of self-voice. Moreover, Support fails to address the frequency spectrum and orientations of early reflections; and directions of late reverberation which might be determining factors as well. The practicality of using a single parameter to represent voice performers preferred stage acoustic condition remains uncertain. But one thing is clear whether it is to propose a new acoustical parameter or to validate the effectiveness of a stage acoustics design, subjective preference test is the only viable means of solving the problem. Every human being is uniquely different from each other and our preferences to a certain acoustic condition remain highly subjective and may vary enormously. The preferred stage acoustic condition depends on ones own voice quality and auditory behavior. A new study in stage acoustics, called Voice Stage Support (VSS), is proposed to investigate auditory feedback on stage for professional voice performers. It thus excludes normal speech communication between common people. It is not the objective of this thesis to comprehensively define VSS or devise a new parameter experimentally. It is rather an initiation of ground work in promoting the study of this uniquely different field in opera house and theater acoustics which involves acoustical design, psychoacoustics and performance psychology. One may argue the tremendous difficulties or the impossibility in generalizing auditory preferences for the entire human population, and it remains a challenge for acousticians.

From error to error, one discovers the entire truth. - Sigmund Freud

1.1 Aim of the Thesis


As discussed above, in order to study Voice Stage Support (VSS), subjective preference tests are inevitable. There are some prominent difficulties in conducting such tests. In statistical analysis, the key to success is to have a large number of samples in the population. Thus, it takes a long time for any single researcher to acquire enough data for analysis and come up with a convincing conclusion. It is particularly difficult for VSS because professional voice performers constitute only a very small portion of the human population. It is an effort of numerous researches before any comprehensive theory of VSS can be accomplished. The more subjective data collected, the better the development of the study. Unlike most auditory experiments which involve external stimuli (sound sources outside ones body), the perception of self-voice strictly requires ones own vocalization to generate the sound stimuli for the test which demands real-time auralization of auditory scenes and inhibits the use of pre-recorded and pre-processed test stimuli. The demand of real-time auralization implies a system capable of low propagation delay (or processing delay). Previous similar acoustical researches often require specialized digital signal processing (DSP) equipment which means very few facilities are qualified to repeat such tests to generate compatible results. However, there are now alternatives since digital audio signal processing becomes more widely available and affordable in the professional audio industry. The argument here is an easily obtainable, reproducible and repeatable setup for real-time auralization of self-voice would greatly promote this field of study by enabling more researchers who have access to professional singers and actors to conduct such tests and thus enlarging the sample base in the aggregation of compatible data for longterm statistical analysis.

The first objective of this thesis is to verify the reliability of a more accessible realtime auralization setup, compared to previous experimental systems found in literature. It is also a process in standardizing psycho-acoustical experimental procedures involving real-time self-voice auralization and the respective data acquisition, both in terms of hardware set up and subject conditioning, for the purpose of voice stage support study. This thesis includes a subjective evaluation of the stage in a 200-seat playhouse. Various stage locations and head orientations are compared using the proposed auralization setup and procedures.

1.2 Historical Review


This section briefly covers the issues related the thesis. It first summarizes the field of stage acoustics and support for performers followed by the difference and difficulties of support for voice performers as compared to musicians (section 1.2.1 & 1.2.2). The psychophysics of self-voice perception is introduced in section 1.2.3. Previous subjective preferences tests and their auralization setup are reviewed in section 1.2.4. 1.2.1 Stage Acoustics and Support Stage acoustics can be defined as the study of acoustic conditions of where the performers are located in a performance. In many occasions, performers are located in a stage house or a stage volume which is spatially discriminated (yet not isolated) from the physical volume of the auditorium. It is particularly obvious in proscenium theaters and opera houses. In other settings, such as theater-in-the-round or multi-functional/ modular theaters, the separation between stage and auditorium acoustic space is less distinguishing and may overlap with each other. Whatever the setting is, performers demand a certain condition so that they can perform comfortably. Stage support usually refers to the amount of auditory feedback of ones own instrument, which enables the performer to hear him/herself with ease and he/she does not need to force the instrument to develop the tone. In A.C. Gades pioneering work [5], it is translated into an objective parameter, SUPPORT, which includes three measures of energy ratios (ST1, ST2 and STlate) in the sound fields.

ST1 = 10 log10 ST 2 = 10 log10

E(20,100ms) E(0,10ms) E(20, 200ms) E(0,10ms) E(100, !ms) E(0,10ms)

STlate = 10 log10

After a few applications and analysis, they were later revised by Gade [6] as:

STearly = 10 log10 STtotal = 10 log10 STlate = 10 log10

E(20,100ms) E(0,10ms)

E(20,1000ms) E(0,10ms)

E(100,1000ms) E(0,10ms)

where E () stands for the time integral of the squared pressure signal of an impulse response between the time limits reported in the brackets. In the above definitions, t = 0 is the arrival time of the direct sound. Unit is in dB. SUPPORT has been applied in various studies of acoustics for performers [7][8][9], and is generally agreed with performers subjective preferences. Nevertheless, there are a few points that attract our attention. Firstly, Gades setup suggests that the measurement microphone position is one meter (roughly the maximum distance between the players ears and his instrument) in front of the sound source. Secondly, single microphone is used in the measurement. Gade reported that STearly unexpectedly succeeded in describing the ease of hearing other musicians rather than its intended purpose [6]. Although the reliability has yet to be ascertained over a longer period of time, some acoustical consultants have been using it as a parametric guideline. However closely related to performers support, it is not applicable in the case of singers and actors because the instrument concerned - the human voice - is in close proximity to the ears and there exists a fundamental difference in the perception of self-voice than that of any other music instruments.

1.2.2 Previous Researches on Subjective Preferences on Stage Acoustics All previous researches indicated musicians preference (including music instrumentalists and singers) on early reflections in support of their performance. Marshall and Meyer, in 1985, reported singers prefer a strong presence of reverberation, while early reflections were weakly preferred. [10] Noson, later in 2000, reported that singers preferred longer delay time of reflections than musicians due to the masking effect of bone conducted sound inside the head [11]. He also discovered that melisma singing style (non-plosive, non-fricative syllables) resulted in a shift in preferred delay time of reflections [12]. It indicated that subjective preference is content-dependent of the sound source. In Nosons works, it was also proved that singers subjective preference on delay time of a single reflection is proportional to the minimum effective duration (e min) of the running autocorrelation function (ACF) of the sound source. This is in direct agreement with Andos previous research on audiences subjective preference in concert halls [13]. Andos proposal is thus believed to be applicable to musicians and singers as well. Nosons work strongly supported that the unique nature of self-voice perception is the most significant contributing factor of a different preference pattern for voice performers as compared to instrumentalist.

1.2.3 Self-Voice Perception Perception of self-voice is constituted by air conduction (sound wave propagations from mouth to ear) and bone conduction (vibrations from voice organs to the ear inside the human head). The air conduction path includes mainly the diffraction of sound coming out from the mouth opening, across the surface of the head and into the ear canal. It also includes all transmissions of vibrations of the vocal tissues from the surface of the head into air,

and back to the ear canal. However, this latter component is believed to be of negligible contribution to our hearing [14]. The role of bone conduction was not well understood until Georg von Bksy [15], in 1949, identified bone conduction and air conduction as the sound paths pertinent to perceiving ones self-voice. Estimations derived from his various investigations show that the perceived loudness of bone conduction is of the same order of magnitude as that of the air conduction. According to a more contemporary study of bone conduction by Stenfelt and Goode [16], the bone conduction path can be divided into four components, (1) sound radiation into the ear canal, (2) inertial motion of the middle ear ossicles, (3) inertial motion of fluid in the cochlea, and (4) compression and expansion of the bone encapsulating the cochlea. In most occasions, natural human voice is found in the presence of an acoustic environment. With the inclusion of acoustic space, the air conduction path can be further divided into two - direct air-conduction (mouth-to-ear) and indirect air-conduction (specular reflections from boundaries in the acoustic environment). Hence, the paths constituting the perception of self-voice can be summarized as: Direct Air Conduction (Direct AC) Bone Conduction (BC) Indirect Air Conduction (Indirect AC) - from mouth to ear - through skull - reflections of voice off room boundaries

*Direct AC, BC and Indirect AC are used, throughout this thesis, to denote the above auditory paths. Their relationship is represented graphically in a simplified fashion in Figure 1.

Figure 1. Components of perception of self-voice

The spectral characteristics of the above pathways can be identified with human subjects. For Direct AC, it is usually obtained by measuring the transfer function between the sound pressure at microphones placed at mouth reference point (MRP) and ear reference point (ERP) in an anechoic environment, in which human subjects recite a selection of words effectively covering the vocal frequency range, as demonstrated by Prschmann [17] as well as Williams and Barnes [18]. For BC, direct measurement cannot be applied. It is determined by measuring the masked threshold of pure-tone or narrow-band noise while the air-conducted sound is removed (or highly attenuated) [17]. In general, the threshold increases as frequency rises. Nevertheless, in Stenfelts research [19], it is found that sensitivity in loudness perception in bone conduction is higher than that in air conduction. And this trend is progressively more drastic as listening level increases, suggesting that the loudness contour of bone conduction is different from that of air conduction. (Air conduction loudness contour refers to the Fletcher and Munson curve in 1933. [20]) To determine Indirect AC, method similar to that for Direct AC can be used with human subjects. Binaural receivers can be fitted in the subjects ear canals. By 9

measuring the impulse response of the receiver at MRP and the binaural receivers in the ears, the transfer function of the room can be determined. Then, the Indirect AC can be isolated by properly removing the direct sound from the impulse response. For Direct AC and BC, an average result can be collected from a group of subjects in the laboratory. However, for Indirect AC, it requires bringing a large number of subjects to each acoustic condition being examined (i.e. different concert halls and different stage positions) which is impractical in most cases. An alternative approach is discussed in Section 2.1 in this thesis. 1.2.4 Experimental Setup in Previous Self-voice Auralization and Related Sound Field Simulation In the subjective evaluation of a sound field for singers or actors, owning to the use of self-voice as the sound stimulus, one must create an auralization setup capable of reproducing (1) the direct mouth-to-ear air conduction (if such path is to be obstructed by the reproduction system) and (2) the convolution product between the live signal and the impulse response of the sound field under test, all in real-time. At the source pickup end, there is no consistency in previous experiments. A microphone is usually placed in front of the mouth. However, the microphone type and microphone-to-mouth distance vary greatly between researches. In Marshall and Meyers setup [10], a cardioid microphone is placed at 0.5m directly pointing at the mouth; in Nosons setup [11], a small headset microphone (no polar pattern specified) is located at 10cm in front and 5cm below the mouth; and in Prschmanns experiment [21], a Sennheiser KE4 omni-directional miniature microphone is positioned precisely at the mouth reference point (MRP40) 40mm in front of the lips, with a holding device attached to the headphones harness. At the sound reproduction end, there are generally two different approaches (1) spatially distributed loudspeakers for reproduction of delayed reflections, as found in Marshall and Noson; and (2) open-back circumaural headphones with compensation filters for binaural sound field simulation, found in Prschmann. The two reproduction ap10

proaches have their pros and cons. Using loudspeakers inherently create the possible acoustic feedback path which means, for instance, the delayed reflection is picked up by the microphone and then more reflections are regenerated again through the system. This leads to unintended stimuli and eventually affects the accuracy of the subjective test. Its advantage is that subjects are free of restraint by any body attachment. However, the speaker system requires a comparatively larger space and is usually not portable. The headphone system, on the contrary, is less demanding for the laboratory space and is fairly portable and easy to set up. The disadvantage of using headphone system is the necessity of implementing a compensation filter for the direct mouth-to-ear air conduction path because, even when open-back circumaural headphones are used, the headphone enclosures inevitably impose a sound insertion loss between the mouth and the ears. High frequencies are usually attenuated. Moreover, the compensated (filtered) signal needs to be delayed before reaching the headphones so that it is in sync with the natural air conduction to avoid comb-filtering effect. Prschmann [21] has shown that such reproduction scheme is successful in achieving a certain degree of naturalness in virtual auditory environment. Another issue with headphones is the occlusion effect and return of radiation by the human head. Occlusion effect refers to the accentuation of sensitivity in bass frequencies when the ear canal is obstructed. Details of the occlusion effect can be found in literature written by Tonndorf [22] and Dean [23]. Open-back headphones can minimize such effect and have been accepted and used in experiments provided that there is enough padding between the headphone hardware and the test subject.

11

1.3 Thesis Outline


In Chapter 2, the proposed self-voice auralization system dedicated for the investigation of voice stage support is introduced. It includes the binaural impulse response acquisition system, the binaural auralization system and the subjective preference test procedures. Also, a validation test with human subjects to obtain subjective rating on the naturalness of the system is reported. The results were compared to evaluation of setups in previous researches. The proposed system was used to investigate stage acoustic conditions of a playhouse. Chapter 3 shows the results and analysis of actors subjective preferences on various stage locations and head orientations. Chapter 4 discusses the experimental results, followed by suggestions in directions for future work in the field of voice stage support.

12

2. Self-voice Auralization System: Design and Implementation

2.1 Experimental Design Concept


As discussed in section 1.2.3, using human subjects to obtain an average of Indirect AC data is impractical when a lot of acoustic spaces and conditions have to be examined. Portability and repeatablility were the first criteria of the current design. In order to achieve that, a dummy head was proposed to substitute the human head in the measurement and acquisition process. An artificial mouth was used to represent the human voice source. Dunn and Farnsworth [24] showed that a persons own voice can be modeled by a source at the opening of the mouth. Similar approach has been taken and examined by Bozzoli, Viktorovitch and Farina [25]. The design consists of three basic components: - Binaural Room Impulse Response (BRIR) Acquisition - Binaural Real-time Auralization - Experimental Procedures for Test Subjects Since the experimental design was logically driven by the implementation of auralizing the conduction paths, the design of real-time auralization is first described (section 2.3) followed by the BRIR acquisition (section 2.4) and experimental procedures (section 2.5).

2.2 Measurement Setup


In this thesis, the acoustical measurement system was Electronic & Acoustic System Evaluation & Response Analysis (EASERA) v1.0.60 software running on a Pentium M processor based PC, with a Sound Devices USB Pre (USB-powered audio interface) for audio input/output. Sampling rate was set at 48 kHz and bit depth is 16. Excitation signal

13

was a pink sweep sine with 1 pre-send and 3 averages customized in EASERA unless otherwise stated. Two measurement microphones were used. They were Earthwork M30 and Countryman B3 omni-directional microphones (Figure 2 & Figure 3). The M30 was chosen for its high sound pressure capability whereas the B3 is chosen for its compact size for measurement positions where M30 is unable to reach. Their transfer functions were first measured in order to compensate for their difference in frequency characteristics when they were used simultaneously. A pink sweep sine was produced using a Yamaha MSP5 2-way studio monitor speaker at a distance of 1m in front of the microphones. The microphones outputs were recorded using the above EASERA setup and the transfer function was then obtained for use as an equalization function in subsequent calculations. Figure 4 shows the microphones transfer function.

Figure 2 Earthwork M30 omni-directional microphone

Figure 3 Countryman B3 miniature omni-directional microphone

14

Loudspeaker for excitation source is an artificial mouth in a dummy head unless otherwise stated. The detailed structure of the dummy head is described in section 2.4. All measurements were conducted in a hemi-anechoic chamber unless otherwise stated.

Figure 4 Transfer function from Earthwork M30 to Countryman B3

2.3 Binaural Real-time Auralization System


2.3.1 System Overview

In this research, only Direct AC and Indirect AC need to be implemented. Since human subjects own voice is used as sound stimuli in real-time auralization, the bone conduction component is produced naturally inside the subjects head. The auralization 15

system used a topology of two separate paths to model the direct air conduction (Direct AC) and indirect air conduction (Indirect AC). All auralizations were conducted in a hemi-anechoic chamber.

Figure 5 Binaural self-voice auralization system block diagram

Figure 5 shows the system block diagram. The setup used an Earthwork M30 omnidirectional microphone to pick up the subjects voice. It was positioned at the mouthreference-point MRP (80mm from the lips) separated from the subjects mouth by a metal-grille pop-filter mounted at 40mm from the diaphragm so as to eliminate microphone diaphragm excursion caused by plosive sounds. Figure 6 shows the relationship between the subjects mouth, the pop filter and the microphone. The microphone signal was split into two and connected to input channels 1 & 2 (Ch1 & Ch2) on a Yamaha 02R digital mixing console, with identical and repeatable gain setting using the step-gain control on the pre-amplifiers. The gain setting was optimized to achieve a peak at -10 dBFS using a mic calibrator of 1kHz sine tone at 105 dBA. The resulting line-level signals were routed to two paths, Path 1 & Path 2, modeling the Direct AC and Indirect AC respectively.

16

Figure 6 Test subject with microphone and pop filter.

(Path 1) Through the channel-insert-send before the A/D stage on the Ch1, the preamplified signal was connected to a Klark Teknik DN-716 single-channel digital delay unit (with built-in 16-bit A/D & D/A conversion) cascaded with a Rane PE-15 4-band analog parametric equalizer. The analog output was returned to the channel-insert-return of Ch1, going into the A/D conversion stage on the 02R. The delay unit and parametric equalizer were used to model the mouth-to-ear propagation delay and transfer function respectively. Their implementations are further described in section 2.3.2. The 02R onboard digital equalizer on Ch1 was used as the compensation filter for the insertion loss introduced by the auralization headphones. Details are described in section 2.3.4. (Path 2) Through Ch2, the signals was A/D converted and digitally routed, via an optical connection (TOSLINK) to a Digidesign TDM MIX digital audio workstation (Motorola 17

DSP-based PCI mixing engine running in a Macintosh dual processor 500MHz G4 computer) using a Digidesign ADAT Bridge digital interface. The workstation was running ProTools audio software with Waves IR1 (dual-channel convolution reverberation plugin) through which a BRIR can be loaded and convolved with the incoming signal. The output (convolved) signal was returned to the ADAT IN (TAPE IN 1) on the 02R mixer via the ADAT Bridge digitally. The setup described was used to model the Indirect AC or called the room response. The convolution implementation is further described in section 2.3.3. Both returns from Path 1 and Path 2 were internally routed to 02Rs main stereo output in the digital domain. The stereo output of the 02R was connected to a Samson HP-5 headphone amplifier driving a pair of Audio-Technica ATH-A700 open-back headphones. A compensation filter was implemented using the on-board equalizer on the 02R stereo output channel to remedy the frequency anomalies induced by the headphones. It is described in section 2.3.4 The A/D & D/A conversions in the 02R and DN-716 are all 16-bit, 48kHz. Each conversion stage introduces a processing delay of 0.02ms. The processing delay of Path 1 measured 0.14ms (A/D & D/A conversion and filter network in DN-716 and conversion stages in 02R) when DN-716 is at its lowest setting 0.01ms whereas the processing delay of Path 2 measured 11.74ms (A/D & D/A conversion plus latency of IR1 [11.6ms]) while IR1 was engaged and loaded with a 3-second long unit-sample sequence. All levels were set at unity gain during the delay measurement.

18

2.3.2 Implementation of Direct Air Conduction Modeling

In Path 1, which is designed to model the Direct AC, the MRP-to-ERP transfer function (measured in section 2.3.2.3) is approximated using the four-band parametric equalizers PE-15, the digital delay DN-716 and the internal equalizer on the 02R.

2.3.2.1 Determining the PE-15 filter setting The frequency response of the MRP-to-ERP impulse response was approximated using an analog parametric filter. The precise settings of the filters were determined by overlaying the transfer function of the PE-15 against the MRP-to-ERP magnitude-spectrum plot. Using the Live mode in EASERA, the MRP-to-ERP plot was pre-loaded. A pink noise was fed to PE-15 at line level and itss output was connected directly back to EASERA to obtain a live magnitude-spectrum plot while adjusting the PE-15 settings. Figure 7 shows an overlay magnitude-spectrum plot of the impulse responses of MRPto-ERP and the determined filter settings in PE-15 (see Table 1). The plot was generated in Matlab.

19

Figure 7 MRP-to-ERP Transfer Function & Direct AC Filter using PE-15 parametric equalizer (1/3 octave smoothing)

Table 1. Direct AC Filter Settings (Rane PE-15)

Direct AC Filter Gain (dB) Frequency (Hz) Q

Band 1 +4.0 90 1.2

Band 2 -5.5 800 0.26

Band 3 -8.0 7k 0.45

Output Level -18.0 -

20

2.3.2.2 Determining the DN-716 delay time The initial arrival time of the ERP-to-MRP impulse response was implemented using a digital delay line. The mean value of MRP-to-ERP propagation delay in human is 300 s (or 0.3ms) as reported by Prschmann [17]. Thus, by subtracting the processing delay 0.14ms, the delay time to be inserted is 0.16ms (corresponding to a panel display of 0.17ms on the DN-716). The MRP-to-ERP transfer function measurement is described in section 2.6.2.3. Various delay times are evaluated in section 2.7.1

2.3.2.3 Determining the 02R parametric equalizer setting

The headphones used in auralization introduced an insertion loss in the Direct AC path. As a result, Path 1 essentially functions as Direct AC modeling and compensation of insertion loss (CIL) induced by the headphones. The CIL filter was implemented using the digital parametric equalizer on Channel 1 in the 02R mixer. Two microphones, M30 and B3, were first calibrated for identical gain and then used to measure the insertion loss of the headphone as shown Figure 8.

Figure 8 Setup for measuring the headphone's insertion loss using an isolation tube.

21

Figure 9 Isolation tube used in insertion loss measurement

An isolation tube (see Figure 9) was built to measure the insertion loss of the headphone. The tube was 300mm in length and 250mm in diameter. It had a 50mm thick soft fiberglass outer shell with a thin layer of cotton lined in the inner wall. The headphone was carefully mounted to the tube opening and is sealed with rubber for any air-gaps. A Yamaha MSP5 2-way loudspeaker was used to generate the measurement signal while an M30 microphone was positioned close to the headphone enclosure outside the tube and a B3 microphone was mounted 10mm away from the headphone transducer inside the tube. The transfer function was recorded using EASERA. The inverted magnitude-spectrum plot represents the compensation filter. The internal digital parametric equalizer in the 02R was used to approximate the compensation filter response using a similar adjustment method described above for PE-15 (section 2.3.2.1). To assure unity gain through the 02R during filter adjustment, a sine tone was fed from EASERA and split to two. One was routed back to EASERA channel 1 and the other was connected to the 02R and returned to EASERA Channel 2. The CIL filter implemented in the 02R is shown in Table 2.

22

Table 2. CIL filter settings (digital parametric equalizer on 02R)

CIL filter Gain (dB) Frequency (Hz) Q

Band 1 +2.0 4k 0.2

Band 2 +3.0 10k 0.1

Band 3 -

Band 4 -

2.3.3 Implementation of Indirect Air Conduction Modeling

In Path 2, which modeled Indirect AC, a real-time binaural convolution was applied using the Waves IR1 Convolution Reverb plug-in (see Figure 10). In order to time-align correctly, the room impulse response to be convolved was first trimmed to eliminate the direct sound. The length of the trim was determined by the propagation delay in Path 2 which measured 11.74ms (see Figure 11). A Hann window was applied to the trimmed impulse response before importing to IR-1. A shortcoming resulted from the latency is the incapability of reproducing the room response between Direct AC and the early sound field up until 11.74ms (approximately 12 feet of traveling distance which is in the order of a 6-foot tall person) which may include diffractions from the subjects own body and the first back scattered sound from the floor or other possible nearby boundaries. Nevertheless, the focus of the current research is stage acoustics which seldom involves boundaries in close proximity (at least not in the case of this thesis). Also, there is no direct specular reflection path between the mouth and the floor. The back-scattered rays from the floor were assumed to have minimal influence on the perception of selfvoice.

23

Figure 10 Waves IR1 Convolution Reverb, loaded with a 3-second unit sample sequence

Figure 11 Impulse response trimming before importing to IR-1

24

2.3.4 Implementation of Headphone Equalization

The frequency response of the headphones was measured in a hemi-anechoic chamber. An M30 microphone was positioned at 10mm in front of the headphones transducer. The impulse response was recorded using EASERA. The internal digital parametric equalizer on the 02R was used to approximate the headphone compensation filter using the method described in section 2.3.2.1. The headphones response was plotted against the inverted compensation filter in Figure 12 and the filter settings is shown in Table 3.

Figure 12 Headphone response and compensation filter (02R master output)

25

Table 3. Headphone compensation filter settings (02R master output)

Headphone compensation Gain (dB) Frequency (Hz) Q

Band 1 +8.0 40 0.1

Band 2 +2.5 185 0.3

Band 3 -4.0 1.7k 1.0

Band 4 -2.5 9.1k 1.2

2.4 BRIR Acquisition System


In order to achieve repeatability in binaural acquisition, a head simulator (or sometimes called dummy head) was used. In the particular interest of this thesis, the sound source and receivers corresponds to human mouth and ears, thus microphones and loudspeaker were installed inside the dummy head. The heart of the design was a Brel & Kjr Type 5930 head and torso simulator (HATS). The head geometry theoretically represented an average of human head physical features according to the compliance of ITU-T Rec. P.58, IEC60959 and ANSIS3, 36-1985. It was retrofitted with a loudspeaker unit of diameter 50mm, inside the mouth cavity as an artificial mouth. The microphones mounted inside the HATS were Brel & Kjr Type 4010 omni-directional transducers. The grille of the capsules aligned with the opening of the ear canals as binaural receivers. (see Appendix for the microphone specifications in free-field). The structure of the HATS is shown in Figure 13 and the position of the artificial mouth is illustrated in Figure 14. Detailed dimensional information of the HATS can be obtained from the Brel & Kjr website (www.bksv.com)

26

Figure 13 Frontal plane section of HATS, showing binaural microphones and the related fittings. (Adapted from manufacturers manual)

27

Figure 14 Median plane section of HATS, showing the artificial mouth

To validate the representativeness of the HATS, a series of verification tests were conducted to examine the binaural microphone characteristics, artificial mouth frequency responses and MRP-to-ERP transfer function. (See section 2.6) In BRIR acquisition, the HATS was supported by a microphone stand so that its height measured 5 ft 7 in or 67 inches (approximately 1.7-meter), which is about the mean height between age 20 to 74 of both men and women reported by U.S. Department Of Health And Human Services in 2004 [26]. The binaural microphones inside the HATS were connected to the EASERA measurement system during acquisition. Before data acquisition, their gain settings were optimized to -10dBFS using a microphone cali28

brator of 1kHz sine tone (105dBA). Since the binaural microphones cannot be easily removed from the HATS for calibration, and any such repetitive preparation may imply microphone position misalignment, a compromised approach was proposed. The calibrator was positioned as close to the binaural microphone as possible while it was on axis. The artificial mouth was driven by a Samson Servo 170 power amplifier which has a published linear frequency response between 20Hz and 20kHz. Figure 15 shows the BRIR acquisition system block diagram.

Figure 15 Binaural room impulse response acquisition system block diagram, showing how the binaural ears and artificial mouth of the HATS are connected

As described in section 2.3.3, the binaural room impulse response was trimmed such that the direct sound (including any contribution from internal (bone) conduction and direct air conduction) in the BRIR was not included in the convolution.

29

2.5 Experimental Procedures in Subjective Test

2.5.1 Test Subject Conditioning In the current research, it was inevitable to use test subjects voice in real-time auralization, the sound stimuli thus becomes highly unpredictable and may lead to erroneous results due to variance in the subjects conditions, both physiological (i.e. vocal fatigue) and psychological (i.e. personal emotion). In order to minimize the experimental errors, a set of experimental procedures for human subjects was adapted and modified from the method used by Jnsdottir, et al [27]. In each measurement, subjects are asked to read a piece of text at least twice before subjective scoring. For each subject, the same set of test was repeated 6 times within a period of 21 days; on the test day, three trials were in the morning/midday, and three other trials were in late afternoon/ early evening. Before each experimental session, subjects were asked to warm up their voice to performing condition (which takes about 10-20 minutes). Subjects were also assured of under no influences of drugs and alcohol. The above procedures were expected to lessen the impact of subjects individual conditions over a period of time. 2.5.2 Use of Dramatic Text in Study of Actors The objective of the thesis is to investigate voice support in theaters, and so subjects were actors who had been trained professionally. Instead of sentences from the Harvard Psychoacoustics Sentence Lists (often recommended for psychoacoustics research of speech), a short edited excerpt from Shakespeares play Hamlet was chosen for its wellknown dramatic expression and inclusion of most vowels and consonances in English.

30

To be, or not to be; that is the question; To die, to sleep, nor more; and by a sleep to say we end the heart-ache and the thousand natural shocks.

The reason of using a dramatic text was because actors seldom read sentences that do not have literal meanings. The Harvard sentences are far from reality and are considered to have no representation of acting in theater. The second argument in this particular thesis was that actors always know what they are going to say (as they have rehearsed before performance), and because the current research is about self-perception, so there is no issue of intelligibility of unexpected words/ vowel sounds from an unknown sound source. Subjects were given a sample recording of the text, prior to each test sessions, to get accustomed to the rhythm and speed of the speech. And the entire test was recorded and analyzed for the pace afterwards.

2.5.3 Verifying the Consistency of Self-voice Stimuli by Monitoring the Pace of Speech A method of pace analysis was developed with reference to Andos work on the relationship between subjective preference and objective parameters. Ando found that the minimum effective duration (e min) of running auto-correlation function (ACF) of the sound source is proportional to the most preferred delay of a single reflection [28]. Andos results suggested that the faster the tempo of the stimuli, the lower the resultant e min and thus the shorter preferred delay of first reflection. In the current thesis, it was attempted to use this parameter as a reference to the pace of speech.

31

ACF is defined by:


! p (" ) = lim 1 T #$ 2T

&

+T

%T

p '(t)p '(t + " )dt

where p(t) = p(t) * s (t), s(t) corresponds to ear sensitivity and is chosen as the impulse response of an A-weighting filter, as suggested by Ando. The normalized ACF is expressed as:
# p (" ) # p (0)

! p (" ) =

The effective duration of the envelope of normalized ACF is represented by ! e , which is the initially 10dB of the ACF decay (or called the 10 percentile decay), obtained by line regression.

! e is obtained at 100ms-intervals of the running source signal and the minimum


value is expressed as e min. Figure 16 shows the values of e against the elapsed time of two speech recordings. The e min for sample 1 and 2 are 0.39s and 0.37s respectively, indicating that a slight variation of the pace of speech would not change the overall temporal characteristics.

32

Figure 16 Example plot of effective duration of running autocorrelation function of two recordings of the same text at different pace, showing the use of subjective testing.

e min as a temporal reference for monitoring

From Figure 16, it is observed that there is a slight shift of the peaks which indicates the different paces of speech in the two samples (red is faster and blue is slower). Through a number of trials, it was proven that e min can be used as a robust quantifier for verifying the speech pace in subjective testing. By calculating e min on each trial, the pace was ensured by maintaining a deviation of less than 5% of the e min value of the sample recording. This is a procedure to secure all subjects are giving their preference score under the same condition within a fixed tolerance.

33

A total of 15 subjects participated in this project, but 2 of them were dropped out because they did not manage to attend all required tests. The final 13 subjects included 6 Caucasians, 2 Asians, 3 Black Americans and 2 Latin Americans. All subjective tests in this thesis followed the above scheme unless otherwise stated.

2.6 HATS Verification Tests


All evaluation measurements of the HATS were conducted in an anechoic chamber at General Electric Laboratory, Niskayuna, NY (see Figure 17). The volume of the chamber was 5100 cu. ft. The background noise was rated at under 20dbA. Room temperature and humidity were measured before and after each experimental session and recorded negligible variations.

Figure 17 Verification Test of HATS in anechoic chamber at General Electric Laboratory (NY).

34

2.6.1 Binaural Microphones

The microphones at the two ears of the HATS were first calibrated using a sine tone calibrator (1kHz at 105dBA) to achieve identical gain. A dodecahedron speaker was positioned at 1 meter away from the HATS, directly in front of the HATS mouth opening. The frequency responses were averaged over 3 measurements taken with different orientations of the dodecahedron speakers to minimize any potential error caused by directional characteristics of the dodecahedron loudspeaker in near field. Figure 18 shows a frequency response overlay of the two binaural microphones.

Figure 18 Frequency response comparison between HATS binaural microphones

The result showed acceptable differences between the two binaural microphones. The peaks and notches are due to head-related transfer function and pinna of the HATS. 35

2.6.2 Artificial Mouth

In BRIR acquisition, the loudspeaker unit inside the HATS was used to generate excitation signals in the HATS mouth cavity; it is thus called the Artificial Mouth. There are no readily available specifications of the artificial mouth from the HATS owner. Its directivity, on-axis response, and ERP-to-MRP transfer function were measured.

2.6.2.1 Directivity The directivity measurements were conducted in 15-degree resolution on the horizontal plane (full 360 degrees) and the vertical plane (range from -45 to 135 degrees). The HATS was left stationary throughout the measurement session while the microphone position was manually adjusted for each measurement.

Figure 19 Voice directivity of HATS (a) horizontal plane (b) vertical plane, in 4 octave bands (250Hz, 500Hz, 1khz & 2kHz)

The directivity was plotted using Matlab and was compared to data from other previous artificial heads and other mean values of human voices, as shown in Figure 20. 36

Figure 20 Comparison of voice directivity, in 3 ocatve bands (500Hz, 1kHz & 2kHz)

The current HATS was found to be reminiscent to the human mean values reported in literature. Although it did not directly reflect a higher reliability of experimental results, it did suggest the HATS was a good representation of human voice source.

2.6.2.2 On-axis Frequency Response

The on-axis frequency response of the HATS was measured at 1m directly in front of the artificial mouth. Figure 21 shows the frequency response. It was rather erratic which may have resulted from the construction of the HATS. The HATS itself had inherent resonance characteristics and the head cavity was not dampened by any materials.

37

The prominent peaks in the high-mid frequency range were believed to have relationships with the resonant frequencies of the HATS.

Figure 21 On-axis frequency response of HATS artificial mouth. Overlays of no-smoothing and 1/6-octave smoothing.

Two resonant peaks were observed 6.6k and 7.6k Hz which roughly correspond to distances of 0.05m and 0.045m. It was believed they are related to the position of the loudspeaker unit in respect to the HATS internal cavity. The current artificial mouth response was compared to another commercially available model, B&K Type 4128C artificial mouth (see Figure 22).

38

Figure 22 Frequency Response of B&K Artificial Mouth Type 4128C [adapted from manufacturers datasheet]

The frequency response of a loudspeaker unit coupled to the HATS is expected to display anomalies due to the interaction with the physical geometry. It is hardly possible to achieve a flat frequency response in such deviced. The assumption here is the given similar voice directivity analyzed in section 2.6.2.1 is acceptable in the current study.

2.6.2.3 MRP-to-ERP Transfer Function Two microphones were used in this measurement. They are Earthwork M30 omnidirectional microphone and a Countryman B3 miniature omni-directional microphone. The two microphones were first calibrated by measuring their transfer function (see section 2.2) which was used as an equalization function in the measurement. The M30 and B3 were placed at MRP and ERP respectively. The B3 capsule was suspended such that there was no physical contact with the HATS. It was to eliminate any internal vibration transmission from the artificial mouth to the microphone via the 39

HATS surface. The transfer function was measured in EASERA and compensated with the equalization described above. Result is shown in

Figure 23 MRP-to-ERP transfer function of HATS (1/3-octave smoothing)

The above plot was smoothed and zoomed in order to compare to the averaged results of human subjects in Prschmanns study, as shown in Figure 24.

40

Figure 24 Averaged frequency response of MRP-to-ERP (Direct AC) of 18 human subjects. The grey area marks the satndard deviation. (Adapted from Porschmann [2000])

The HATS frequency response was considered to be roughly falling into the human averaged values, except it was slightly lower in the frequency range between 300Hz to 2kHz.

2.7 Subjective Test on Naturalness of the Auralization System


Considering the deviation of every human heads, the validity of the auralization setup can be proven by subjective tests. The aim of this evaluation test is to find the optimal delay time, and filter implementation for Direct AC, as explained in section 2.3.2.1 and 2.3.2.2. In the evaluation of the systems naturalness, Indirect AC (as stated in section 1.2.3) was ignored.

41

2.7.1 Evaluation on Naturalness of CIL Filter Delay Time

The delay time implementation for Direct AC represents the propagation delay from the mouth opening to the ear canal entrance. It is critical in the auralization process because the delayed voice reproduced by the headphone was acoustically combined with the direct sound traveling from the mouth through the open-back headphones, before entering to the subjects ear canal. Improper delay time would induce perceivable echoes or comb-filtering effect. Since everyones head geometry and facial features are different from each other, the delay time may vary. It was assumed that the comb-filters were inaudible if the separation of arrival time is short enough so that the lowest null frequency of the comb filter is beyond the audible spectrum. For instance, a time delay of 20microsecond would produce comb filtering starting at a frequency of 25kHz. This test was to find out the most natural and representational delay time to be used in auralization without compromising the perceptual naturalness of self-voice. The tests were conducted with 13 subjects following the procedures stated in section 2.5, using the auralization setup in section 2.1. During the test, subjects were exposed to a random sequence of delay times which included 3 repetitions of the 8 delay settings. The random sequence was generated by Matlab individually for each test set. For each setting, subjects were asked to read the given text in full twice, one with headphones and one without. Subjects were allowed to repeat reading and listening until they were comfortable to compare the sound of their own voice with headphones to that without headphones (the reference). Then, the subjects would rate the degree of similarity on a 7-point category scale in the range of 1 to 7 (1 being very dissimilar and 7 being very similar) for a given setting and notify the experimenter via the microphone before changing to the next delay setting. The aims of this evaluation test are to find out the most optimal delay time for the CIL filter for auralization and validate the effectiveness of the current system by comparing the results in previous experiments. Thus, the delay times evaluated were 42

specifically chosen to match with those found in literature in order to make a direct comparison (see Table 4).

Table 4. Delay time in evaluation test of naturalness of Direct AC insertion loss compensation filter (CIL filter). *The current system has a processing delay of 0.14ms with a setting of 0.01ms on DN716. ** Prschmann's tested delay times were based on taps of a 48kHz Tucker Davis DSP system, and they are represented here in millisecond for convenience of comparison.

DirectAC Delay Time DN716* Delay Setting Prschmanns Test [2001]**

0.14ms 0.01ms -

0.3ms 0.17ms 0.3ms (0taps)

0.47ms 0.34ms 0.47ms (8taps)

0.63ms 0.50ms 0.63ms (16taps)

0.97ms 0.84ms 0.97ms (32taps)

1.63ms 1.50ms 1.63ms (64taps)

2.30ms 2.17ms 2.30ms (96taps)

2.97ms 2.84ms 2.97ms (128taps)

The same test was repeated 6 times for each subject totaling 144 data. All results were combined and compared with Prschmanns published results as shown in Figure 25.

43

Figure 25 Subjective evaluation of naturalness of delay time in Direct AC auralization. Mean score and error rate of all subjects (95% confidence)

The above comparison showed that the current evaluation exhibit a similar preference trend as observed in Prschmanns results. Subjects tended to prefer lower delay times. The current experiment included 1 more delay setting (0.14ms), which is less than the nominal MRP-to-ERP propagation delay of 0.3ms. Interestingly, the most preferred delay in the current setup is 0.14ms. This might be resulted from the difference in the definition of MRP (MRP80 80mm from the lips) in the current research as compared to Prschmanns (MRP40 40mm from the lips).

44

2.7.2 Evaluation on Naturalness of CIL Filter Level

This test was to find out the sound level of the CIL-filtered signal at the headphone to achieve the best realism. The tests were conducted with 13 subjects following the procedures stated in section 2.5, using the auralization setup in section 2.1. There are six levels under test. Inf, -6dB, -3dB, 0dB, +3dB and +6dB. 0 dB indicates unity gain of the CIL filter and Inf refers to the absence of the filter compensation path in auralization. Positive dB values represent a gain in the filtered compensation whereas negative values represent an attenuation. During the test, subjects were exposed to a random sequence of sound levels which included 3 repetitions of 6 different settings. The random sequence was generated by Matlab individually for each test set. For each setting, subjects were asked to read the given text in full twice, one with headphones and one without. Subjects were allowed to repeat reading and listening until they were comfortable to compare the sound of their own voice with headphones to that without headphones (the reference). Then, the subjects would rate the degree of similarity on a 7-point category scale in the range of 1 to 7 (1 being very dissimilar and 7 being very similar) for a given setting and notify the experimenter via the microphone before changing to the next CIL-filter level setting. The same test was repeated 6 times for each subject totaling 108 data. All results were combined and compared with Prschmanns published results as shown in Figure 26. The results showed that the current evaluation has an overall lower score in all settings but displays a similar trend as observed in Prschmanns setup. Subjects agreed the nominal level setting for the CIL filter is most natural, proving the successful implementation.

45

Figure 26 Subjective evaluation of naturalness of CIL filter level in Direct AC auralization. Mean score and error rate of all subjects (95% confidence)

2.8 Discussion
Overall, the current auralization setup gave satisfactory results in evaluation of naturalness in Direct AC. The CIL filter level was determined to be at unity gain. The delay setting was chosen to be 0.01ms on DN-716 (resulting in 0.14ms of total delay).

46

3. Subjective Preference Tests on Stage Acoustic Conditions for Actors

3.1 Introduction
The focus of the current study is voice stage support in proscenium theaters. This particular type of theater architectural design creates a special acoustical phenomenon. A proscenium theater is characterized by the separation of a stage house from the main auditorium by a large opening called proscenium. The two acoustic volumes (stage & auditorium) can have very different acoustical properties depending on the design. In general, the stage house usually has a high ceiling for counter-weight flying systems and thick curtains were hung above and beside the stage area in order to mask off-stage areas, lighting instruments and unused scenery pieces. The stage is often not specifically designed for acoustical purposes, whereas, in the auditorium, it is usually designed to optimize the audiences aural experience. The interaction of these two volumes is known as coupling space phenomenon in architectural acoustics. Considering the actors location and movement on stage, they are mostly in the stage volume. As they move closer to the audience and reach the proscenium or the forestage, they are moving into the aperture of the coupled spaces where the stage volume and the auditorium meet. Unlike an audience, actors are constantly moving and turning on stage while acting. Their experience of the sound field changes drastically every second. As a beginning of research work in this field, the current study aims at investigating acoustic conditions on stage presuming the actor is not moving. Actors sometimes speak in a fairly fixed location for a sustained period when they are having monologues (or prose). The most common positions are at center of the forestage or the central area of the main stage. These two positions were studied in this research and two more positions off to the side were chosen as well. Ideally, more positions should be studied, but the limiting factor was the time required for each subject to go through all acoustic conditions without causing their vocal fatigue.

47

Figure 27 Architectural plan of the main space at the RPI Playhouse. Dimension unit in inches. Blue lines are dimensional guides. (CAD drawing courtesy of RPI Building Management)

3.2 Impulse Response Acquisition


Binaural room impulse responses were collected when the playhouse was unoccupied. The auditorium of the 200-seat playhouse has an area of approximately 2650 sq. ft. (246.1 sq. m.) and a volume of around 39750 cu. ft. (1125.5 cu. m.). The stage is located opposite to the playhouse entrance separated by a proscenium with dimensions of 31.6 ft by 14 ft. (9.6m by 4.2m). The stage level is 5 feet higher than the auditorium; and it has an area of roughly 1260 sq. ft. (117 sq. m.) and a volume of 25200 cu. ft. (713.6 cu. m.). Figure 27 shows the detailed dimensions. All seats were removed and the stage was cleared during measurement sessions. The stage was set to a standard configuration of masking flats and borders hung above stage.

48

Measurement was taken at four stage locations using the setup described in section 2.4. The relative positions of the stage locations are shown on Figure 23. The acronyms used stands for down stage center (DSC), down stage right (DSR), center stage center (CSC) and center stage right (CSR). It should be noted that stage right is defined as the right-hand-side from the actors perpective when he or she is facing the audience. The four stage locations were 3 meters apart from each other.

Figure 28 Stage locations where BRIR was measured. Dashed line labeled "CL" is the center line of the stage across the proscenium.

At each stage positions, the HATS were adjusted to three different head orientations manually with the aid of rotation angle guide marked on the top of the HATS. The relative angles of the head rotation are shown in Figure 29. The height of the HATS was

49

maintained at 5 feet and 7 inches (about 1.7 meter) above the stage floor as described in section 2.4.

Figure 29 Top view of the HATS showing 3 different head orientations in binaural room impulse response acquisition.

Due to the background noise level at the playhouse, an excitation signal of pink sweep sine of length 21.8s was used to achieve better signal-to-noise ratio (SNR). The reported average SNR in all measurements was 52dB. Three averages were taken in each measurement.

3.3 Subjective Test Design


3.3.1 Preference ratings from paired comparison

In section 2.7, a 7-category rating method was chosen for a reason - to generate compatible results to compare with previously published data. Nevertheless, it has uncertainty in absolute judgment between subjects. Paired comparison can help reduce the absolute judgment errors and also the relative judgment errors. In this chapter, different combinations of acoustic conditions were given in pairs to the subjects in a randomized order. In each of the paired comparison tests in this chapter, there were four different acoustic conditions. The six pair-combinations were A-B, A-C, A-D, B-C, B-D and C-D. Each pair was repeated twice, one in forward order while the other in reversed order. This would give a total of 12 randomized pair-conditions for each subject in each test.

50

(The author first proposed 4 repetitions of each pair, resulting in 24 test conditions, but test subjects reported that they had vocal fatigue and lost concentration after a certain period of time (usually 30 minutes). Thus, in order to balance reliability of test results and test subjects personal comfort, it was reduced to 12 pair-conditions in the stage preference testing.)

3.3.2 Test procedures

In each comparison, the pair-conditions of (Indirect AC) were pre-loaded to preset A and preset B in IR-1 (see Figure 10 and note the long A: " button below the red circle named RTAS). The subjects were able to compare the pair-conditions by asking the experimenter to A/B swap the presets and they allowed to spend as much time as they wanted in each preset (condition). After auditioning both, the subjects were asked, Which did you prefer? The preferred condition was scored as +1 and otherwise as -1. Preference scores were added and normalized. A score of 1.0 indicates complete unanimity of preference judgment of the acoustic condition, 0.0 means an equal number of yes and no scores, and -1.0 means complete agreement on a negative preference. After each complete set of paired comparison, the subjects were allowed to rest for 5 minutes before the next set of tests began. All other procedures follows section 2.5.

3.4 Paired Comparison Test on Stage Locations


The paired comparison tests were in three groups. Each group represents one head orientation they were look center, look left and look right. Four stage locations were studied in each group. The preference scores were obtained and analyzed.

51

3.4.1 Preference study of stage locations when head orientation is look center 3.4.1.1 Result The preference scores for each subject were averaged over the 6 test sessions. Scores were obtained for 4 Conditions (A,B,C and D) corresponding to the 4 stage locations (DSC, DSR, CSC and CSR) respectively. The results of all 13 subjects are plotted individually in Figure 30. Then, scores of all subjects in each condition were averaged and plotted as well. The result shows an agreement of preference on Condition A (DSC-down stage center) and Condition C (CSC-center stage center) in all subjects while Condition A is slightly more preferred than Condition C in the overall average. Another agreement across all subjects is the negative preference on Condition D (CSR).

52

53

54

Figure 30 Preference score of different stage locations when head orientation is "look center" (Conditions - A: DSC, B: DSR, C:CSC, D: CSR). Normalized scores of 13 individual subjects A-M (blue bar graph) and overall average score of all subjects (red bar graph)

3.4.1.2 Analysis Upon interviewing with the test subjects after the experiment, their common feedback was the lateralization of the sound decay and the change in the sense of envelopment. Some of the subjects had questions, during test sessions, of whether the volume of the headphones was not balanced on the two channels. Subjects expressed an inclination to spatially balanced and encapsulating sound fields. Interaural cross-correlation function (IACF) is commonly used in analyzing the impact side reflections and subjective preference of room width. IACF can also be used to visualize the lateralization of a running sound source. IACF is defined as:
t2

IACFt (! ) =

t1

" P (t)P (t + ! )dt


L R 1/2

# t2 2 t2 2 & % " PL (t) " PR (t)dt ( % ( t1 $ t1 '

where L and R refer to the entrances to the left and right ear canals. The maximum possible value of IACF is one when both signals are the same. The variable accounts for

55

the time difference between the two ears and is varied over a range from -1 to +1 ms from the first arrival [29]. The IACF for each condition was calculated in 100-ms intervals and plotted in Figure 31.

Figure 31 Interaural cross-correlation functions in 100ms-intervals for conditions A-D when head orientation is "look center".

From the above IACF plots, the spike at time = 0ms indicates the highest correlation between the two ears at the onset of the impulse response. As the sound decays, the correlation rapidly reduces to around zero. This suggested the early reverberation field (0 to 400ms) was fairly diffused in all four conditions. The correlation rises towards the late reverberation field (after 400ms). It should be noted that the rise in IACF does not result in a prominent peak across the binaural sound field. This rise is apparent in both 56

Condition A & C, which suggests a correlation between the subjective preference score and the behavior of the late reverberation field. The late reverberation in the playhouse was believed to have contributed by the reflections from the back wall in the auditorium. The absence of peak in late IACF implies the diffusiveness of the late reflections and a sense of envelopment. Furthermore, the energy decay was examined. Due to the fact that, in acquiring the impulse response, the sound source (artificial mouth) and the receivers (binaural ears) are in extreme close proximity, conventional reverberation time calculation cannot be applied directly to inform the subjective sensation of the decay. It is also unknown that how human perceives reverberation time of the same acoustic space differently when listening to his or her own voice versus other sound sources. As a result, an alternative parameter, Voice Stage Support (VSS), was proposed to analyze the energy decay:
ti +10

VSSti = 10 log10

"

ti ! 90 10

p 2 (ti )dt

"

p 2 (t)dt

where ti is the time interval every 100ms. Since the energy of the direct sound from the artificial mouth to the ears is constant, it is taken as a reference of initial energy. The energy ratio was calculated on every 100ms interval after the initial 10ms. The results of VSS of the binaural ears are plotted against the four conditions (stage locations) in Figure 32.

57

Figure 32 VSS plot of binaural ears in four conditions (A: DSC, B: DSR, C: CSC, D: CSR) when head orientation is "look center".

From the above plot, it is found that Condition A & B have higher energy content than C& D during the time period of 0.3 to 0.7s. This may be constituted by the exposure of more reflective surfaces in Condition A & B. Both locations are down stage which means very close to the stage front. However, the higher energy ratio in Condition A & B does not correlate to the subjective preference score which implies that the loudness of the sound field is not the major impact for actors support. 58

Preference study of stage locations when head orientation is look left 3.4.1.3 Result The results of all 13 subjects are plotted individually in Figure 33. Then, scores of all subjects in each condition were averaged and plotted as well. As opposed to preference study when head orientation is look center, the overall preference is less distinct. Subject B, E, F, in particular, showed less distinguished preferences. One general agreement among all subjects is the negative preference on Condition C (CSC center stage center). Also, Conditions B & D are more preferred than the others.

59

60

61

Figure 33 Preference score of different stage locations when head orientation is "look left" (Conditions - A: DSC, B: DSR, C: CSC, D: CSR). Normalized scores of 13 individual subjects A-M (blue bar graph) and overall average score of all subjects (red bar graph)

3.4.1.4 Analysis Again, IACF is used to analyze the behavior of the binaural sound field and lateralization of the late reverberation.

62

Figure 34 Interaural cross-correlation functions in 100ms-intervals for conditions A-D when head orientation is "look left".

The lower preference score for Conditions A & C than when head orientation is look center could be accounted by the lateralization of the late reverberation as shown in Figure 34. Whereas, in Condition B & D, the rise of IACF in the late sound field show less prominent peaks which indicates a diffused late reverberation uniformly spread across the binaural field. The energy change was also examined by using VSS proposed in previous section. The results are plotted in .

63

Figure 35 VSS plot of binaural ears in four conditions (A: DSC, B: DSR, C: CSC, D: CSR) when head orientation is "look left.

The energy difference between conditions is less pronounced in the left channel. One possible reason is when the head was looking left, the left ear was directed toward the stage. Notice the generally lower VSS on the left channel compared to the right channel in each time slices. This indicates the perceptual loudness of the sound field is mainly constituted by the reflections and reverberation in the auditorium. On the right channel, higher VSS is recorded in Condition A & B within the time period of 0.3 to 0.6s. This is mainly due to the better exposure to the late reverberation from the auditorium.

64

3.4.3 Discussion on Stage Location Preference

In this chapter, stage locations were compared with different head orientations. It is found that actors subjective preference highly depends on directionality of the binaural sound fields as explained by using IACF. One might assume a certain stage location providing better loudness and closer to the center line would be more preferred. However, the current results showed that there is no single stage location consistently rated the best which implies voice stage support depends on parameters other than local acoustic characteristics. Nevertheless, it should be noted that the current subjective test does not include visual stimuli. In reality, actors subjective response to a stage location is also highly dependent to the visibility and perceptual distance of the audience. As a form of communicative art, the actor-audience interaction plays an important role. There are also other factors not covered in the current study, such as scenery, lighting, musician and sound reinforcement system, etc. Their impact on the actors subjective preference should not be overlooked.

70

4. DISCUSSIONS

4.1 Reflections on subjective preferences on stage acoustic conditions


The study of stage acoustic conditions in Chapter 3 was meant to be an initiation of ground work in the field of voice stage support. It does not offer a complete view of the issues involved. The subjective tests were limited by the time required for the subjects to actively concentrate in listening and speaking in an acoustic-controlled environment. To achieve higher statistical significance, more averaging is desired which in turns demands longer test sessions. Actors participating in this research had all expressed a concern of not being able to concentrate after repetitive tests. Thus, it is unlikely that more conditions can be added. Originally, more stage locations were planned for tests, but it was compromised due to the overwhelming duration of the test. In order to further this field of study, subjective tests have to be careful designed and organized. Spreading over a longer span of time and have fewer stimuli/ conditions in each sessions might have maintain the actors optimal conditions (both psychological and physiological). Having more subjects involved in the test is definitely another goal of this study. This strengthens the need for a portable and repeatable auralization system so that more experimenters and actors can involve in this field of research and aggregate statistically significant data in the long run.

4.2 Accuracy in subjective testing


Researches employing subjective tests are not trivial. Preference scoring is highly individual and reliability of data depends on how concentrated the test subjects are. In the current research, it was found that the test duration approaches the limit of the test subjects.

71

In designing experiments involving human perception, one must beware of neurological functioning of the human body. Researches in neuroscience have proved that human subjects are susceptible to neurological suppression or activation by means of repetition of external stimuli. Belin and Zatorre (2003), Henson (2003), Bergerbest, (2004), reported suppression in auditory cortex activity with a prolonged period of repeated stimuli. In the current research, subjects actively speaking the same words over and over again may induce such neurological suppression. The influence on the auditory cortex by repetition of our own voice is unknown, but further experiments should be carried out to justify the potential errors in putting human subjects repetitively speaking in a controlled environment. And what is the optimal rest period in order to retain memory for comparison test while not introducing too much repetition suppression.

4.3 Potential ways of improving voice stage support in proscenium theaters


Proscenium theaters has a characteristic coupled space phenomenon resulted from the stage volume and the auditorium. The current research partially revealed the difference in sound field at various stage locations and discovered that higher diffusiveness of late reverberation and balanced binaural fields are most preferred. Possible improvement of voice stage support is to provide a late sound field on stage similar to the reverberation coming back from the auditorium. Due to the complexity of scenery changes and occupancy of lighting instruments in the fly systems, acoustical treatment may not feasible. The most viable means is electro-acoustical enhancement systems which comprises distributed microphones and loudspeakers above and around the stage area.

72

References: [1] Izenour, George, Theater Design (New York: McGraw Hill, 1977), 10 [2] Barry Blesser and Linda-Ruth Salter, Spaces Speak, Are You Listening?: Experiencing Aural Architecture (Cambridge: MIT Press, 2006), 225 [3] A.C. Gade, Investigations of musicians room acoustic conditions in concert halls. I. Method and laboratory experiments, Acustica 65 (1989), 193-203 [4] Quoted from a sign hanging in Einsteins office at Princeton. [source: Barry Blesser and Linda-Ruth Salter, Spaces Speak, Are You Listening?: Experiencing Aural Architecture (Cambridge: MIT Press, 2006), 215.] [5] A.C. Gade, Investigations of musicians room acoustic conditions in concert halls. II. Method and laboratory experiments, Acustica 69 (1989), 249-62 [6] A.C. Gade, Practical Aspects of Room Acoustic Condition Measurements on Orchestra Platforms, International Congress on Acoustics Proceedings (14th: 1992, Beijing, China) [7] Gino Iannace, et al, Room Acoustic Conditions of Performers in an Old Opera House, Journal of Sound and Vibration 232:1 (2000), 17-26 [8] Jin Yong Jeon and Michael Barron, Evaluation of Stage Acoustics in Seoul Arts Center Concert Hall by Measuring Stage Support, JASA 117 (2005), 232-9 [9] Kanako Ueno, et al, Experimental Study on the Effects of Early Reflections on Players by Binaural Simulation Technique, Journal of Acoustical Society in Japan (E) 21:3 (2000), 167-70 [10] A.H. Marshall and J. Meyer, The Directivity and auditory impression of singers, Acustica 58 (1985), 130-40 [11] Dennis Noson, et al., Singer Responses to Sound Fields with a Simulated Reflection, Journal of Sound and Vibration 232 (2000), 39-51 [12] Dennis Noson et al., Melisma singing and preferred stage acoustics for singers, Journal of Sound and Vibration 258 (2002) 473-485 [13] Yoichi Ando, et al., The Running Autocorrelation Function of Different Music Signals Relating to Preferred Temporal Parameters of Sound Fields, JASA 86:2 (1989), 644-9

[15] Georg von Bksy, The structure of the middle ear and the hearing of ones own voice by bone conduction, JASA 21 (1949), 217232 [16] Stefan Stenfelt and Richard L. Goode, Bone Conducted Sound: Physiological and Clinical Aspects, Audiol. Neurotol. 26 (2005), 1245-61 [17] Prschmann, Influences of Bone Conduction and Air Conduction on the Sound of Ones Own Voice. Acustica 86 (2000), 1038-45. [18] B. Williams and G.J. Barnes, The Specification and Measurement of sidetone as dictated by human factors considerations. Human factors Symposium, Montreal, 1974. [19] Stefan Stenfelt and Bo Hkansson, Air versus Bone Conduction: an Equal Loudness Investigation, Hearing Research 167 (2002), 1-12 [20] Harvey Fletcher and W.A. Munson, Loudness, Its Definition, Measurement and Calculation., JASA 5 (1933) [21] Prschmann, Own Voice In Auditory Virtual Environment, Acustica 87 (2001), 382-5 [22] Jurgen Tonndorf, Bone Conduction, Handbook of Sensory Physiology Vol. 5 Part 3: Auditory System: Clinical and Special Topics (New York: SpringerVerlag, 1976), 58-66 [23] Margaret S. Dean and Frederick N. Martin, Insert Earphone Depth and the Occlusion Effect, American Journal of Audiology Vol. 9 (2000), 131-4 [24] H.K. Dunn and D.W. Farnsworth, Exploration of pressure field around the human head during speech, JASA 10 (1938), 184-99 [25] F. Bozzoli, et al, balloons of directivity of real and arfiticial mouth used in determining speech transmission index, AES 118th Convention Paper (2005) [26] Centers for Disease Control and Prevention, U.S. Department of Health and Human Service, Vital and Health Statistics, Number 347, 2004 [27] Valdis Jonsdottir, et al, Effects of Amplified and Damped Auditory Feedback on Vocal Characteristics, Log Phon Vocol 26 (2001), 76-81 [28] Yoichi Ando, Architectural Acoustics. (New York: Springer Verlag 1998), Chapter 7. [29] Marshall Long, Architectural Acoustics. (New York: Academic Press 2005). 675.