AAn Automatic Volume ControlFor Preserving Intelligibility
Franklin Felber
Starmark Technologies DivisionStarmark, Inc.P. O. Box 270710, San Diego, CA 92198, [email protected]
Abstract —A new method has been developed to adjust volumeautomatically on all audio devices equipped with at least one mi-crophone, including mobile phones, personal media players,headsets, and car radios, that might be used in noisy environ-ments, such as crowds, cars, and outdoors. The method uses apatented set of algorithms, implemented on the chips in such de-vices, to preserve constant intelligibility of speech in noisy envi-ronments, rather than constant signal-to-noise ratio. The algo-rithms analyze the noise background in real time and compensateonly for fluctuating noise in the frequency domain and the timedomain that interferes with intelligibility of speech. Advantagesof this method of controlling volume include: Controlling volumewithout sacrificing clarity; adjusting only for persistent speech-interference noise; smoothing volume fluctuations; and eliminat-ing static-like bursts caused by noise spikes. Practical human-factors approaches to implementing these algorithms in mobilephones are discussed.
Keywords-automatic volume control; automatic gain control;intelligibility of speech; SmartAVC ™ ; speech interference level;noisy environments; US Patent 7,760,893; US Patent 7,908,134 I. I
NTRODUCTION AND B ACKGROUND
This paper presents a method for automatically adjustingthe volume of an audio device to compensate only for noisethat interferes with the intelligibility of speech or appreciationof music from the audio device.The automatic volume control (AVC) described here [1] isa fully automatic system and method for adjusting the volumeof an audio output device, such as a mobile phone or car radio,in accordance with listener preferences, to compensate selec-tively for changing levels of ambient noise only in the timeand frequency domains that interfere with intelligibility ofspeech or appreciation of music.An example of an audio device is a car radio. Manysources of noise can interfere with hearing a car radio, includ-ing tire (road) noise, wind, engine noise, traffic (highway)noise, the fan of a heater or air conditioner, and noises madeby the driver and passengers. The noise levels of all of thesesources can change with time, depending on factors like thespeed of the car or changing environmental conditions outsideor inside the car. The noise levels can change abruptly or qua-si-continuously or can be transient. Having repeatedly tomanually adjust the volume of an audio device to compensatefor changing noise levels is a nuisance, and, in a car, can com-promise the safety of the occupants and others. Not all noise, however, interferes with a listener’s under-standing or appreciation of the output of an audio device. Andnot all noise, therefore, would impel a listener to want tochange the volume. For example, nearly all the information inspeech is contained within the frequency interval 200 Hz to 6kHz [2]. Generally, only the frequency components of noisewithin this interval can detract significantly from intelligibilityof speech. Similarly, the intelligibility of full sentences innoisy environments is substantially greater than the intelligi-bility of isolated words. Generally, only noises that persistlong enough to mask more than a few words can detract sig-nificantly from intelligibility of speech [2].Any system that attempts to compensate for all noise, re-gardless of frequency or duration, will generally overcompen-sate by raising or lowering the volume of an audio device toadjust for noise that is not significantly interfering with theability to listen to the audio device. For example, the occur-rence of a high-pitched whine above 6 kHz should not general-ly be cause for the volume of an audio device to be increasedautomatically, or to be decreased upon its cessation. Similar-ly, a transient noise within a car, or another car passing at highspeed in the opposite direction, should not generally be causefor the volume of a phone or radio to be changed.What is needed, therefore, is not a means for automaticallyadjusting the volume of an audio device to compensate forchanges in all ambient noise, but rather only that noise of afrequency and duration that detracts from the ability to listento the audio device. That is, the AVC should have somemeans of discriminating significant noise, which persistentlydetracts from listening ability, from noise that is less conse-quential. One means of identifying such significant noise is tomeasure its interference with the intelligibility of speech. Onemeasure of interference with intelligibility considered suitablefor field use is the preferred speech interference level (PSIL),which is the arithmetic average of the noise levels in the threeoctave bands centered at 500, 1000, and 2000 Hz [2].Another example of an audio device is a two-way voicecommunications device, such as a telephone. Mobile phonesin particular are often used outdoors, in crowds, and in carsand other environments where the background noise fluctuatesin intensity. To adjust the volume control constantly on aphone in a noisy environment is inconvenient and often im-practical. For this reason, a user of a communications device,such as a mobile phone, could potentially benefit from an
Published in Proceedings of34 th IEEE Sarnoff Symposium, Princeton, NJ, 3–4 May 2011
VC feature.The AVC for a phone is similar to the AVC for a radio inthat both should have some means of discriminating signifi-cant noise from less consequential noise. Both should alsohave some means of separating the significant noise from asignal that requires no compensation or different compensa-tion. In the case of a radio, the signal that requires no com-pensation by an AVC is the normal audio output of the radiospeakers. The AVC for a radio should have some means ofseparating the speaker signal from the noise background. Inthe case of a telephone, the signal that requires no compensa-tion or different compensation than the noise background isthe telephone user’s own voice. The AVC for a telephone orother multiplexed communications device should have somemeans of separating the user’s voice from the noise back-ground. II. S
UMMARY OF S MART
AVC ™ A means of identifying and separating human voice from anoise background is presented in [3]. But the simplest andmost practical solution for separating the user’s own voicefrom background noise, when using an AVC-equipped mobilephone, is to momentarily suspend operation of the AVC dur-ing each instant that the user is speaking. For example, theAVC function might be suspended whenever the sound levelinto the microphone exceeds a threshold value indicating thatthe user is speaking. Since the user generally cannot under-stand much of what is being said to him while he is at thesame time speaking on the phone, little is lost by the tempo-rary suspension of the AVC function. The following discus-sion of a means of separating an audio signal from backgroundnoise, therefore, is mostly applicable to one-way communica-tions devices, such as car radios.For an audio amplifier providing an audio signal to audiospeakers, the
SmartAVC ™ automatic volume control compen-sates for speech interference noise by a means including thefollowing components and processes, as shown in Fig. 1: amicrophone for detecting the background noise and the audiosignal either from the speakers of a radio or the phone user’svoice, and in response for producing a corresponding signal; aphase correlator process for phase correlating the microphoneand audio signals; an amplitude correlator for correlating thephase-correlated microphone and audio signals; a subtractionprocess for producing a signal corresponding to a differencebetween the phase- and amplitude-correlated microphone andaudio signals; a transform process for producing over a periodof time a signal corresponding to the amplitude of each fre-quency component of the noise background; a bandpass filter Fig. 1. Functional block diagram of DSP (within thick-lined block) of
SmartAVC ™ , and DSP’s interfaces with the rest of AVC and amplifier. for filtering the transform-produced signal to pass only fre-quency components within selected bands; a speech-inter-ference level (SIL) calculation process for producing a signalcorresponding to a combination of the amplitudes of thebandpass-filtered frequency components; and a solver processfor producing according to an algorithm a signal for control-ling the gain of the audio amplifier. Preferably the selectedbands include the three octave bands centered at 500, 1000and 2000 Hz. Preferably the transform process comprises afast Fourier transform module. Preferably the combination ofthe amplitudes of the bandpass-filtered frequency componentsis an arithmetic average of the noise levels in the octave bands.Preferably some or all processes, algorithms, and filtering areperformed by a digital signal processor (DSP) that receivesboth the digitized microphone signal and audio signal.III. S UMMARY OF C ONVENTIONAL
AVCThe first modern digital AVC was described in [4]. Alllater variations of AVCs, including
SmartAVC ™ , differ fromthe AVC in [4] primarily by the components and processeswithin the digital signal processor (DSP). Other than Smart-AVC ™ , the methods for controlling volume largely depend onmaintaining constant signal-to-noise ratio, in some manner orother, as is done for example in [5], which keeps constant theratio of signal to A-weighted noise.Fig. 2 shows the main components of a conventional audiodevice having a conventional AVC. In Fig. 2, the componentsof the conventional audio device preceding its amplifier stageare not shown individually, but are generally represented by afunction entitled “Signal Source.” In a conventional audiodevice, the signal source 3 provides an electrical signal that isamplified by an audio amplifier for driving a set of speakers.The speakers convert the amplified signal to an acoustic signalthat can be transmitted to listeners. Generally, the volume ofsuch a conventional audio device is controlled by a manualvolume control that adjusts the gain of the audio amplifier.The microphone receives both the transmitted signal from thespeakers and any background noise. The microphone trans-duces the incident acoustic waves to a corresponding analogelectrical signal that is communicated to an analog-to-digital(A/D) converter, wherein the analog signal is converted to acorresponding digital signal that is communicated to the DSPfor processing. Concurrently, the amplified electrical signalfrom the audio amplifier is converted by an A/D converter to acorresponding digital signal that is also communicated to theDSP. After comparing the signals from the microphone andthe audio amplifier , the DSP automatically performs a process Fig. 2. Functional block diagram of conventional AVCand interface with audio device.. hat results in a control signal that is communicated to the au-dio amplifier to adjust the gain of the amplifier and, thereby,the volume of the speakers.IV. S MART
AVC ™ P REFERRED E MBODIMENT
SmartAVC ™ incorporates a novel DSP that includes thecomponents and processes shown in Fig. 1. The correlatorsand the signal subtraction process cooperate to separate thesound of the speakers from the background noise so that thebackground noise can be processed separately. Thecorrelators correlate the digitized inputs from the two A/Ds, sothat they can be subtracted from each other by the signal sub-traction process with the remainder being the backgroundnoise.It might be possible, using factory settings, to subtract theinputs to the correlators directly without first correlating them,but the tolerance for jitter between the inputs to the correlatorsis so demanding that over time the system characteristics maydrift and detune. The phase and amplitude correlators cancorrelate the inputs continuously in near real time, if neces-sary, or only at each start-up of the audio device, if such issufficient. Both the phase and amplitude can be correlatedwith respect to the inputs over multiple processing periods forgreater accuracy.Referring again to Fig. 1, the phase correlator precedes theamplitude correlator. The phase correlator calculates the cor-relation function of the digitized inputs with respect to phasedifference (over a limited range around the factory-set value ofzero), and adjusts the relative phase to the maximum of thecorrelation function. The phase-correlated signals are thensent to the amplitude correlator as inputs. The amplitudecorrelator calculates the correlation function with respect tothe gain of the audio amplifier (over a limited range aroundthe factory-set value of one), and adjusts the gain to the mini-mum of the correlation function. The phase- and amplitude-correlated signals are then sent to the signal subtraction pro-cess. The signal subtraction module subtracts them to producea difference signal that is communicated as an input to theFFT module. The difference signal is the best representationof the pure noise background after the sound from the speak-ers, if any, has been subtracted.The operating characteristics of a preferred embodiment ofan FFT module, optimized for minimum throughput demand,can be best described as follows. Let the sampling rate of theA/D converters be s samples/second. Let the number of sam-ples to be processed in each processing period of the FFTmodule be N , where N must be an integer-power of 2. Theneach processing period is N/s , and the time from receiving thefirst sample to the last in each processing period is ( 1) /
T N s . (1)The frequency resolution of the Fourier transform is f T s N . (2)The highest frequency component of the Fourier transform is / 2 [ / ( 1)] / 2 m f N f N N s . (3)In the preferred embodiment of SmartAVC ™ , the FFTmodule described below is particularly well suited to calculat-ing the PSIL from the noise background. The PSIL is thearithmetic average of the noise levels in the three octave bandscentered at 500, 1000, and 2000 Hz, that is, the three octavebands from 354 to 707 Hz, from 707 to 1414 Hz, and from1414 to 2828 Hz, respectively.The following design guidelines are preferred for an accu-rate calculation of the PSIL:(a) The frequency resolution of the Fourier transformshould be finer than about 40 Hz, that is, / ( 1) 40 f s N Hz , (4)in order to get good statistics on the noise level by having atleast of the order of 10 frequency components, even in thelowest octave band.(b) The processing period of the FFT module should be nolonger than about 25 ms, that is, ( 1) / 25 T N s ms, (5)in order to provide at least of the order of 10 PSIL calculationsto the solver every quarter second or so. A quarter second isless than or about the time over which the AVC should beginto respond to a rapidly changing noise background.(c) The highest frequency component of the Fourier trans-form should be at least about 2800 Hz, that is, [ / ( 1)] / 2 2800 m f N N s Hz , (6)in order to get good statistics on the noise level in the highestoctave band by populating it fully.Combining these design guidelines, Eqs. (4) – (6), leads tothe following point design as an example of an FFT modulethat is particularly well suited to calculating the PSIL for anAVC: N s Hz; T ms; f Hz; m f Hz.After each processing period, the FFT module sends a sig-nal as an input to the bandpass filters, the signal comprising anamplitude for each of the frequency components of the FFTspectrum. With the point design in the preferred embodiment,the FFT calculates 65 amplitudes each processing period forthe frequency components (44.1 j f j f j Hz), where j . In the preferred embodiment, the frequencycomponents, f Hz through f Hz, populate thelowest octave of the PSIL. The 16 frequency components, f Hz through f Hz, populate the middle oc-tave of the PSIL. The 32 frequency components, f Hz through f Hz, populate the highest octave of thePSIL.The bandpass filters pass only those frequency componentsithin bands that are used by the SIL calculator. In the pre-ferred embodiment, the bands include the 56 frequency com-ponents from f through f . The SIL calculator calculatesthe arithmetic average (in dB) of the noise levels in the three(octave) frequency bands passed by the filters and sends as aninput to the solver a single PSIL value (in dB) every pro-cessing period ( / 22.9 N s ms in the preferred embodiment).The solver calculates a gain control signal, subject to cer-tain constraints to be sent to the audio amplifier every pro-cessing period. The purpose of the solver is to calculate a gaincontrol signal that responds proportionately to changing noiselevels of a duration sufficient to interfere with intelligibility ofspeech or appreciation of music, and that responds negligiblyto fluctuations of noise levels at the processing cycle frequen-cy, / s N , or to brief noise transients. The response of thegain control signal must be somewhat dilatory to allow thesolver to distinguish SIL changes of significant duration frominsignificant transients. But it should not be so dilatory as toseem to the listener to be unresponsive to substantial changesof SIL.In the preferred embodiment, the model used for the solveris that of a driven damped harmonic oscillator. The gain con-trol signal (in dB), ( ) a t , as a function of time t satisfies thesecond-order differential equation, ( ) ( ) ( ) [ ( ) ] a t b a t a t S t R , (7)where a prime denotes a derivative with respect to time, b is adamping constant, is a constant frequency indicative of the‘stiffness’ of the response, ( ) S t is the SIL (in dB), and R isthe listener’s preferred signal-to-SIL ratio (in dB). ( R is oneof the constraints imposed on the solver by user interactionthrough the manual volume control.)In terms of a normalized gain control signal, ( ) ( ) A t a t R , Eq. (7) may be written as
20 0 ( ) ( ) [ ( ) ( )] 0
A t b A t A t S t . (8)For the i th processing cycle, this model is implemented in thesolver by the following algorithm: ( / ) i i i A A N s A ; (9a)if i i A S r , then ( / ) i i i A A N s A ; (9b)otherwise i i A A ; (9c) i i i i A S b A A ; (9d)if i A A , then i A A . (9e)The constant r (in dB) is a threshold difference of the nor- malized gain control signal, ( ) A t , from the SIL, ( )
S t , belowwhich the gain control signal remains unchanged. The con-stant min A (in dB) is the user-preferred floor of the normalizedgain control signal, ( ) A t .The constant r is intended to desensitize the algorithm tomost of the high-frequency fluctuations of the SIL in an oth-erwise constant noise background, and to keep ( ) A t constantin such an environment. A typical factory setting for r mightbe about 1 dB. The constant r could also be made adaptiveby making it proportional to the root-mean-square fluctuationof the SIL, for example, at the cost of additional processing.The constant min A is the listener’s preferred minimumnormalized gain control signal, which is generally independentof how quiet the environment may become. The listener es-tablishes or re-establishes min A through the manual volumecontrol by adjusting the volume higher in quiet environments.The initial conditions for the algorithm in Eqs. (9) at sys-tem start-up ( t ), or whenever the user establishes newconstraints through the manual volume control, are: A S , A , A .Fig. 3 shows the result of implementing the algorithm ofEqs. (9) on a simulated SIL. SIL noise was simulated in Fig. 3with significant changes of various durations and with randomhigh-frequency fluctuations up to ±1 dB. The simulated SILincludes two transient triangular noise spikes, each 100 times(20 dB) louder than the background. For this simulation, theprocessing period, / N s , was taken to be 22.7 ms, as in theexample above. The following values of constants were usedin implementing the algorithm, Eqs. (9), in Fig. 3: s -1 , b r dB, min A dB. Fig. 3 also shows that the al-gorithm, Eqs. (9), for the normalized gain control signal, thesolid black curve, responds as desired to the SIL. After a briefdelay, ( ) A t responds fully to long-duration changes in the SIL. ( )
A t is virtually oblivious to high-frequency fluctuations.
Fig. 3. Simulated SIL (red) vs. time and corresponding normalized gaincontrol signal A (black) produced by SmartAVC ™ from Eqs. (9). N o i s e , C on t r o l S i gna l ( d B ) Time (s) o the half-second noise spike at t s and the quarter-second noise spike at t s, both 100 times louder than thebackground, the response of ( ) A t is a few dB for no morethan about one second. Lastly, the normalized gain controlsignal does not fall below the user-preferred floor of min A dB. V. H UMAN F ACTORS F EATURES
To be fully automatic, an AVC should impose no need foradditional manual controls on an audio device, other than pos-sibly an on-off switch for the AVC feature. Listener prefer-ences for volume should be established through normal opera-tion of the audio device and a minimum of manual volumeadjustments. The two key listener preferences that should beautomatically registered by an AVC are the preferred signal-to-noise ratio and the preferred signal floor. The relevant sig-nal-to-noise ratio is the ratio of the amplifier gain of an audiodevice to a suitable measure of significant noise, such as thePSIL. The preferred signal floor is the lowest amplifier gainacceptable to the listener, independent of how quiet the envi-ronment may be.For human factors considerations, constraints are appliedas inputs to the solver. Generally, it is preferable to apply atleast two constraints: (1) R , the listener’s preferred signal-to-SIL ratio (in dB); and (2) min A , the listener’s preferred floorfor the normalized gain control signal (in dB). There are manyvariations of algorithms for providing these and other con-straints from the constraint module. One example follows.Any time the manual volume control is adjusted (includingat start-up of the audio device in Fig. 2), a new value of R iscalculated and sent as an input to the solver. The new value of R is the difference between the gain control signal ( ) a t atthe end of each manual volume adjustment (or at start-up) andsome weighted average of SILs calculated for the same time.For example, let the processing period during which the man-ual adjustment ends be denoted by the subscript m , and let theweighted average be over m processing periods. An exampleof an algorithm for calculating R is
1( ) mm i ii
R a t w SILm , (10)where i w is a normalized weighting function. An example ofa normalized weighting function that weights SILs in pro-cessing periods near the end of an adjustment more heavily is i w i m . A typical time for calculating a weightedaverage of SILs might be about a quarter second, or about 11processing periods in the example given above.Any time a weighted average of SILs is below somethreshold value t SIL , and the manual volume control is adjust-ed upward, a new value of min A is calculated and sent as aninput to the solver. (The threshold t SIL may be, for example, the lowest weighted average of SILs since start-up that did notprompt a manual volume adjustment during some latency pe-riod.) The new value of min A is the normalized gain controlsignal established manually by the end of each such adjust-ment. When these conditions are met for establishing a new min A , a new R is not also calculated. That is, if min A ischanged by a manual volume adjustment, R remains un-changed by that adjustment. Any further manual volume ad-justments establish new values of min A and R , in accordancewith the same algorithms.Some two-way communications devices, such as mobilephones, may be equipped with two microphones, a voice mi-crophone that selectively transduces a user’s voice and a noisemicrophone that non-selectively transduces ambient sounds.The term “selective voice microphone” refers to a unidirec-tional microphone that selectively receives a voice signal froma relatively narrow solid angle in the direction of the user’svoice, and that generally has low gain. A selective voice mi-crophone is generally designed to capture the voice signal of auser, and reject most of the background noise from directionsother than that of the user’s voice. The term “non-selectivenoise microphone” refers to a microphone that is more nearlyomni-directional, and that generally has higher gain, for de-tecting all ambient sounds, such as voices in a conferenceroom. When an audio device is equipped with two micro-phones, the non-selective noise microphone will generallycharacterize the noise background more accurately.VI. C ONCLUSIONS
This paper presented a method that automatically controlsvolume to preserve constant intelligibility of speech, ratherthan constant signal-to-noise ratio, and thereby avoids over-compensating for noises in the frequency domain that do notdegrade intelligibility. Additionally, the method uses an algo-rithm resembling that of a shock absorber to smooth out fluc-tuations of noise in the time domain that do not affect intelli-gibility. The patented
SmartAVC ™ algorithms have beendemonstrated on an A-B breadboard unit to provide substantialadvantages in performance [6] by controlling volume withoutsacrificing clarity, not overcompensating for high-frequencynoise, smoothing volume fluctuations, and eliminating static-like bursts caused by noise spikes.R EFERENCES[1] F. S. Felber, “Automatic volume control to compensate for speechinterference noise,” U. S. Patent 7,760,893 (20 July 2010); U. S. Patent7,908,134 (15 March 2011).[2] L. E. Kinsler et al ., Fundamentals of Acoustics , 3 rd Ed. NY: John Wiley& Sons, 1982.[3] F. S. Felber, International Patent Application No. PCT/US08/69002.[4] F. P. Helms, “Automatic volume control to compensate for ambientnoise variations,” U. S. Patent 5,666,426 (9 Sep. 1997).[5] J. Sjöberg, “Method and circuit arrangement for adjusting the level ordynamic range of an audio signal,” U. S. Patent 5,907,823 (25 May1999).[6]
SmartAVC ™™