Audio Watermarking over the Air With Modulated Self-Correlation
AAUDIO WATERMARKING OVER THE AIR WITH MODULATED SELF-CORRELATION
Yuan-Yen Tai, Mohamed F. Mansour
Amazon Inc., USA
ABSTRACT
We propose a novel audio watermarking system that is robust to thedistortion due to the indoor acoustic propagation channel betweenthe loudspeaker and the receiving microphone. The system utilizesa set of new algorithms that effectively mitigate the impact of roomreverberation and interfering sound sources without using derever-beration procedures. The decoder has low-latency and it operatesasynchronously, which alleviates the need for explicit synchroniza-tion with the encoder. It is also robust to standard audio processingoperations in legacy watermarking systems, e.g., compression andvolume change. The effectiveness of the system is established witha real-time system under general room conditions.
Index Terms — audio watermarking, asynchronous decoder, re-verberation, spread-spectrum, second-screen.
1. INTRODUCTION
In most existing audio watermarking scenarios in the literature, theaudio signal stays in the digital domain between the encoder and thedecoder. This is a typical situation in digital right management ofaudio distribution, where the watermarking decoder is invoked priorto media playback [1, 2]. Recently, there has been growing inter-est in audio watermarking that survives indoor acoustic propagation,e.g., for second-screen applications [3]. In this scenario, the water-marked audio is played through a consumer loudspeaker after theencoder, propagates through an indoor acoustic channel, picked by aconsumer microphone (usually in another device) before passing tothe watermark decoder. This scenario poses a set of new challengesthat were not encountered in legacy audio watermarking: • Room reverberation, which introduces time and frequencysmearing of the audio content [4]. • Time/frequency drift between the encoder and decoder due todifferent system clocks.The relevant work in the literature has treated these two chal-lenges rather separately, and frequently at the cost of less robustnessto standard audio processing operations. For example, few audiowatermarking systems have been designed to withstand desynchro-nization between the encoder and decoder [5, 6, 7, 8, 9, 10]. Thisrobustness could be achieved through using features that are robustto local time-scale variations [5], or deploying a special synchro-nization mechanism (through time-warping like procedure) at thedecoder [8, 9]. On the other hand, some earlier works have focusedon the reverberation impact while assuming perfect synchronization[10, 11]. In [10], a special filter bank with a long symbol interval isused, and the watermark is embedded in the specific time-frequencycells that are robust to expected operations. The synchronization hasnot been explicitly addressed, rather general guidelines from wire-less communication systems were described. In this work, we de-velop an end-to-end audio watermarking system that addresses thesechallenges under practical computation and latency constraints. In particular, we develop a novel audio watermarking systemthat is robust to both reverberation and desynchronization as well asstandard audio processing operations. The encoder embeds a spread-spectrum watermark in successive short blocks of the host audio,and the watermark at each block is modulated with a binary ± se-quence to improve the detection and suppress host signal correlation.The encoder resembles standard audio watermarking systems, there-fore, it inherits their good properties, e.g., imperceptibility of thewatermark, and robustness to standard signal processing operationsuch as audio coding and filtering. The decoder applies a modu-lated self-correlation of successive blocks rather than the standardmatched filter that uses cross-correlation with the embedded water-mark (which requires perfect synchronization and knowledge of theacoustic channel at the decoder). Although self-correlation is notthe optimal detector from detection theory perspective, it effectivelyand blindly mitigates the impact of both reverberation and desyn-chronization at a low-cost in both computation and latency, whichenables real-time embedded implementation. The degradation in thedetection performance is shown to be small for practical use cases.The following notations are used throughout the paper. A boldlower-case letter denotes a column vector. v k denotes the k -th el-ement of v . x and ˜ x denote respectively, host and watermarkedsignal at the encoder, while y denotes the watermarked signal at thedecoder. (cid:104) ., . (cid:105) denotes the inner product. Additional notations areintroduced when needed.
2. BACKGROUND2.1. Spread-Spectrum Watermarking
In the following, we assume that the watermark is embedded in se-lected DCT coefficients of audio blocks. Spread-Spectrum water-marking procedure has the general form [1, 12, 13] ˜ x = x + η w (1)where η is the watermark strength (which controls the audibility ofthe watermark). If y is the received signal in the DCT domain, thenthe standard spread-spectrum decoder uses cross correlation of theform ρ = (cid:104) y , w (cid:105) (2)In the additive noise case y = x + η w + n (where n is the noisecomponent), and (cid:104) y , w (cid:105) = (cid:104) x , w (cid:105) + (cid:104) n , w (cid:105) + η (cid:107) w (cid:107) (3)If the watermark is not correlated with the signal nor the noise, thenboth (cid:104) x , w (cid:105) and (cid:104) x , n (cid:105) vanish and ρ becomes proportional to thewatermark energy. At the detector, ρ is compared by a predeterminedthreshold, γ . If ρ ≥ γ , then the watermark is detected at the decoder;otherwise, it is not detected. a r X i v : . [ c s . MM ] M a r .2. Acoustic Channel Model The acoustic propagation channel has few sources of distortions:clock drift between the encoder and the decoder, sampling rate dif-ference, loudspeaker behavior, room reverberation, and analog-to-digital and digital-to-analog distortion. The microphone impact isusually ignored because of its flat response over the frequencies ofinterest. The clock drift is measured in parts-per-million (ppm), andconsumer-grade system clocks can have up to few hundreds ppm er-ror. If the clock drift is ppm, then at kHz sampling frequency,the effective sampling frequency is ± . Hz, which results ina time shift of up to 4.8 samples every second, and also a slight fre-quency shift. If an explicit synchronization procedure is used, thenthis clock drift must be estimated and corrected (through PLL-likesystems [14]). The operating sampling rate difference could be mit-igated by standardizing the sampling frequency at which the water-mark is embedded or detected. The other distortions can be broadlymodeled as a slowly time-varying channel with additive noise simi-lar to fading channels in wireless communication: y ( t ) = (cid:88) τ h ( t ) ( τ )˜ x ( t − τ ) + n ( t ) (4)where { h ( t ) ( τ ) } is the time-varying impulse response, and n ( t ) isthe additive noise. In the frequency-domain, we have y ( t ) ( ω k ) = h ( t ) ( ω k )˜ x ( t ) ( ω k ) + n ( t ) ( ω k ) (5)where y ( t ) ( ω k ) is the frequency response of y at audio frame t , andsimilarly for x ( t ) ( ω k ) and n ( t ) ( ω k ) . If the channel change is slowcompared to the watermark length, then in vector-form, each DCTblock can be represented as [15] y = ˜ x (cid:12) α + n (6)where (cid:12) denotes element-wise vector multiplication, and α is thechannel representation in the DCT domain. If α k changes amplitudeand sign with the frequency index k (which is the typical case), thenspread-spectrum based audio watermark detection would fail. Tosee this, consider the cross-correlation factor in this case (assumingperfect synchronization) (cid:104) y , w (cid:105) ≈ (cid:104) x (cid:12) α , w (cid:105) + η (cid:104) w (cid:12) α , w (cid:105)≈ η (cid:88) k α k | w k | (7)If the sign of α k changes with k , then the cross-correlation becomesclose to the noise level, and detection fails. In this case, the optimaldetector requires knowledge of the channel at the receiver, which isthe common approach in wireless communication. The estimation isperformed by transmitting a known pilot signal at the start of eachframe, which is used for system identification at the receiver. Thechannel estimation procedure requires perfect synchronization, andit is an expensive procedure in both computation and latency.
3. WATERMARKING SYSTEM3.1. Watermark Design If H is a full-rank symmetric matrix of size κ , then its eigenvectors { v i } are real and constitute a set of orthonormal basis for R κ [16].Let the watermark w be chosen as one of the eigenvectors, e.g., v (without loss of generality). The host signal block, x in (1), can beexpressed as x = (cid:88) l a l v l (8) (a) (b) Traditional SS watermark Eigen watermark
False Accept Rate False Accept Rate D e t ec t i o n R a t e D e t ec t i o n R a t e Fig. 1 . The ROC metric of ‘ traditional SS watermark ’ and ‘ eigenwatermark ’. Each data point is averaged over 1000+ audio piecesthat embedded with the watermark with simulated reverberation ef-fect added to the watermarked audio source.where a l = (cid:104) x , v l (cid:105) . In this case, the cross-correlation between thehost signal and the watermark becomes (cid:104) x , w (cid:105) = a . This consti-tutes the detection noise floor in the noiseless case (i.e., when n = 0 in (3)). To completely remove this noise floor in the noiseless case,the host signal is slightly modified to remove the projection compo-nent of the host signal onto the watermark subspace. If we choose w = v , then the watermark embedding equation in (1) is modifiedto ˜ x = x − (cid:104) x , v (cid:105) v + η v = ¯x + η v ( where, ¯x (cid:44) x − (cid:104) x , v (cid:105) v ) (9)In Fig. 1, we show that Eigen Watermarking significantly improvesthe ROC of the detector and has a good operation point against thesimulated reverberation effect.In order for the embedding algorithm to be robust to standardaudio processing operations, e.g., filtering and compression, the em-bedding is restricted to mid-range DCT coefficients, i.e., in the range k L to k H . Henceforth, the inner product definition is (cid:104) a , b (cid:105) (cid:44) k H (cid:88) k = k L a k b k (10) The central idea of the proposed system is using self-correlation atthe detector, rather than cross-correlation as in standard watermark-ing detectors. As noted in (7), the cross-correlation with watermarktemplate requires perfect synchronization and perfect knowledge ofthe acoustic channel, otherwise it will be smeared by the alternatingsign of the channel response. This stringent requirement is relaxedif self-correlation is used as described in this section.Let y a and y b be two adjacent DCT blocks of the received sig-nal, then self-correlation is defined as ψ (cid:44) (cid:104) y a , y b (cid:105) (11)The notation self-correlation is used rather than autocorrelation toemphasize it is always between different blocks. If each block cor-responds to an embedded watermarked block as in (1) after passingthrough the acoustic channel in (6), then ψ = (cid:104) ˜ x a (cid:12) α + n a , ˜ x b (cid:12) α + n b (cid:105)≈ (cid:88) k α k x ak x bk + (cid:88) k α k w ak w bk + (cid:88) k n ak n bk (12)where we assumed that the channel behavior does not change for ad-jacent blocks, and in the approximation we invoked the assumptionof the absence of correlation between the watermark, signal, and ad-ditive noise. If the additive noise is zero-mean (which is usually the - + + - - - - + - - +BA1 2 3 4 5 6 1 2 3 4 5 6 Layer of random signsLayer of eigenvectors Fig. 2 . Illustration of the bi-layered watermark encoding structurewith N r = 2 , N s = 6 case), then the last term in (12) vanishes. If adjacent audio blocksare weakly correlated, then the first term in (12) is much weakerthan the watermark component (which is the second term in (12)),and this would improve detection. However, this component mightbecome significant if a music chord is present in the host signal, andthat increases the noise floor.Note that, by employing self-correlation the impact of acousticchannel is neutralized ( by making the channel contribution nonneg-ative) at the cost of higher noise floor due to the host signal self-correlation. The noise-floor is significantly reduced through the signmodulation scheme that is described in the following section. The second central component in the proposed watermarking systemis the sign-modulation of adjacent blocks in the host signal. A sec-ond encoding utilizes a sequence of ± to modify the binary phaseof the watermark in each block. The entire encoded audio sequencecan be expressed as ˜ x = N r (cid:77) n =1 N s (cid:77) i =1 (cid:16) x n,i + β s n,i g n,i w i (cid:17) (13)where (cid:76) denotes block concatenation, N s denotes the number ofsegments of basic watermark building blocks, N r denotes the num-ber of repeats of the set of segments, x n,i is the i -th audio blockof the n -th segment of the host audio, β is the encoding strength, g n,i (cid:44) (cid:112) (cid:104) x n,i , x n,i (cid:105) is the segment normalization factor, and s i,n is an ± random sequence. Note that, the watermark strength at eachblock is proportional to the signal strength, i.e., η n,i = βg n,i , andmutually orthogonal watermarks { w i } ≤ i ≤ N s are inserted at eachblock. An illustration of this encoding process is shown in Fig. 2.Note that, each block within the segment is modulated by a randomsign that will be incorporated at the decoder. Different keys couldbe used for the generation of the watermark and the sign sequence toallow for increased accuracy or multiple access watermarking.The decoder modifies the self-correlation procedure in (11) toaccommodate multilayered embedding in (13). The multilayeredself-correlation has the form ρ ( t ) = N s (cid:88) i =1 N r − (cid:88) n =1 N r (cid:88) m = n +1 s m,i s n,i (cid:104) y m,i , y n,i (cid:105) h m,i h n,i (14)where ρ ( t ) is the watermark decoding score, y n,i is the i -th block ofthe n -th audio segment at the receiver, h m,i ≡ (cid:112) (cid:104) y m,i , y m,i (cid:105) , is anormalization factor for the segment audio, y m,i , from the receiver.Note that, with this sign modulation arrangement in the encoderand the decoder, the watermark component in (12) is invariant, whilethe signal component is effectively suppressed. Assuming for nowthat we have perfect synchronization, we will describe how ρ in (14)behaves under signal and null hypotheses. Let ψ m,n,i (cid:44) s m,i s n,i (cid:104) y m,i , y n,i (cid:105) = s m,i s n,i (cid:104) ¯x m,i (cid:12) α + n , ¯x n,i (cid:12) α + n (cid:105) + β s m,i s n,i g m,i g n,i (cid:104) w i (cid:12) α , w i (cid:12) α (cid:105) + β s m,i s n,i g n,i (cid:104) ¯x m,i (cid:12) α + n , w i (cid:12) α (cid:105) + β s m,i s n,i g m,i (cid:104) w i (cid:12) α , ¯x n,i (cid:12) α + n (cid:105) (15) Note that, fractional delay of the block boundaries can be repre-sented as part of the channel. Under the null hypothesis (i.e., nowatermark), H , i.e., when β = 0 , we get the noise signature ρ ( t ) = N s (cid:88) i =1 N r − (cid:88) n =1 N r (cid:88) m = n +1 s m,i s n,i (cid:104) ¯x m,i (cid:12) α + n , ¯x n,i (cid:12) α + n (cid:105) h m,i h n,i (16)Under the signal hypothesis (i.e., watermark exists), and after invok-ing the assumption of non-correlation between signal/watermark/noise,we get the signal signature (note that, s m,i = s n,i = 1 ) ρ ( t ) = ρ ( t ) + N s (cid:88) i =1 N r − (cid:88) n =1 N r (cid:88) m = n +1 β (cid:104) w i (cid:12) α , w i (cid:12) α (cid:105) ( h m,i h n,i /g m,i g n,i ) (17)An illustration of the modulated self-correlation behavior under bothhypotheses is illustrated in Fig. 3. ⇢ ( t ) t = t noise t = t signal gap noise signal Fig. 3 . Illustration of modulated self-correlation score, ρ ( t ) , undersignal/null hypothesesThe difficulty of decoding with self-correlation is that the meanunder H , and the variance under both hypotheses are dependenton the unknown channel parameters α . Nevertheless, as illustratedin Fig. 4, there is more than dB difference in the mean underboth hypotheses, which provides flexibility to choose the detectionthreshold with good overall performance. In our system, the detec-tion threshold is set to be significantly higher than the noise floorover a long period of time. Hence, the detection threshold itself is afunction of the acoustic channel.The above discussion assumed synchronization was achievedprior to self-correlation. In the worst case, synchronization couldbe achieved at a sample-level by brute-force computation of ρ ( t ) (where fractional delay is absorbed in the channel response). Nev-ertheless, it was found that the modulated self-correlation mecha-nism tolerates imperfect alignment (roughly ± of the length ofeigenvector length) with acceptable detection rate. For example, ifwe take the eigenvector to be ms long, then ± ms misalignmentcan be tolerated. This is due to the blind detection procedure that pa-rameterizes the detection parameters with the channel, and tolerablemisalignments can be modeled as part of the channel. The tradeoffbetween complexity and performance could be further exploited byincorporated only a subset of segments (out of N r ) and blocks (outof N s ) in (14). The overall detection procedure proceeds as follows (where ρ ( t ) iscomputed as in the previous section):1. Calculate the noise-mean throughout the noise region, ¯ ρ ≡ n t n +∆ n (cid:88) t = t n ρ ( t ) (18)etection Score Fig. 4 . Histogram of the detection score in (14) under null andsignal hypotheses (with and without reverberation), averaged over20,000 audio streams.2. Calculate the channel dependent noise variance, σ ≡ n t n +∆ n − (cid:88) t = t n | ρ ( t ) − ¯ ρ | (19)3. Set the detection threshold, γ , at the desired point on the ROCcurve, e.g., γ = 3 σ .4. Calculate the modulated self-correlation factor with thenoise-mean correction, ¯ ρ ( t ) ≡ s ∆ s − (cid:88) τ =0 (cid:16) ρ ( t + τ ) − ¯ ρ (cid:17) (20)5. The detector operates as ε ( t ) = (cid:40) , for ¯ ρ ( t ) < γ, , for ¯ ρ ( t ) ≥ γ, (21)Note that, the frequency of computing ¯ ρ ( t ) and ε ( t ) is deter-mined by the available computation resources.
4. EXPERIMENTAL RESULTS
The proposed algorithm was implemented in python for a real-timedemo with consumer-grade loudspeakers and microphones. We didextensive evaluation to compute the Receiver Operating Characteris-tic (ROC) curve (which fully captures the detector performance [17])under different room environments and audio processing attacks. Forfalse accept rate calculation, we scan through a non-watermarkedaudio of duration ∼ min every 5 milliseconds. For the detec-tion part, the watermark is inserted every seconds in the same hostaudio, i.e., ∼ watermarks.The system was first evaluated versus standard audio processingoperations, e.g., lowpass filtering, highpass filtering, and mp3 com-pression. It showed the standard robust performance of spread spec-trum systems [13]. The subjective quality of the watermarked audiowas evaluated by expert listeners and was shown to be indistin-guishable from original audio. Both the encoder and the decoderrun in real-time, and the latency is only due to audio block bufferingdelay.Next, we evaluate the robustness of the proposed system to roomreverberation. The ROC curve is computed under different roomconditions, and the results are averaged for different sizes of the em-bedded watermark, which is also proportional to the overall systemlatency. Multiple watermarks with different duration are simultane-ously inserted in the host audio. This has a minor impact because thewatermarks are mutually orthogonal. In evaluating the ROC curve, False Accept Rate D e t ec t i o n R a t e a. b. False Accept Rate
Fig. 5 . (a.) The ROC with reverberation versus different watermarklength where we set N s = 2 ; (b.) The ROC with reverberationand clock drift (in ppm), the watermark length is fixed to 1 sec with N s = 2 .we applied measured reverberation filters to the watermarked audioprior to the detector. Fig. 5a shows the ROC for different watermarkperformance versus watermark length (with no clock drift betweenthe encoder and the decoder). In the figure, we zoomed in the hori-zontal axis of the ROC curve, because of the almost perfect behaviorwhen the watermark duration is longer than . second.Finally, we evaluated the combined impact of clock drift and re-verberation. In this experiment, both the encoder and the decoderrun at the same sampling frequency, but the decoder clock is per-turbed by different ppm values, and the decoder is run without clockcorrection. The resulting ROC behavior is shown in Fig. 5b, wherewe also zoomed in the horizontal axis to clarify the behavior. Asnoted from the figure, the performance is robust to clock drift up to ∼ ppm, when the watermark duration is second.
5. CONCLUSION
With the complicated indoor acoustic channel, there are two choicesfor successful audio watermarking. The first choice is to estimateand equalize the channel prior to applying a standard detector. Thesecond choice is to restructure the detector to neutralize the chan-nel impact without explicit channel estimation. The first choice isoptimal from performance perspective, but it is expensive in com-putation and latency. The second choice, which we adopted in thiswork, is more appropriate for real-time embedded system despite itssuboptimal detector. The suboptimal performance could always beenhanced by using a longer or stronger watermark. The proposednovel detector utilizes modulated self-correlation between adjacentaudio blocks, which effectively neutralizes the indoor channel im-pact and eliminates the need for explicit channel estimation. Theexperimental results showed the robustness of the system when thewatermark duration is greater than . second under general rever-beration conditions, and with clock drift up to ppm.In a future work, we describe a dynamic programming algo-rithm to prune the synchronization search by an order of magni-tude with negligible impact on the detection performance. Futurework also includes deploying microphone arrays [18] to improveself-correlation, and reduce the reverberation impact. . REFERENCES [1] Mitchell D Swanson, Bin Zhu, and Ahmed H Tewfik, “Au-dio watermarking and data embedding–current state of the art,challenges and future directions,” in Multimedia and SecurityWorkshop at ACM Multimedia . Citeseer, 1998, vol. 41.[2] Guang Hua, Jiwu Huang, Yun Q Shi, Jonathan Goh, and Vriz-lynn LL Thing, “Twenty years of digital audio watermarking?acomprehensive review,”
Signal Processing , vol. 128, pp. 222–242, 2016.[3] Pablo Cesar, Dick CA Bulterman, and Jack Jansen, “Leverag-ing user impact: an architecture for secondary screens usage ininteractive television,”
Multimedia systems , vol. 15, no. 3, pp.127–142, 2009.[4] Heinrich Kuttruff,
Room acoustics , Crc Press, 2016.[5] Mohamed F Mansour and Ahmed H Tewfik, “Time-scale in-variant audio data embedding,”
EURASIP Journal on AppliedSignal Processing , vol. 2003, pp. 993–1000, 2003.[6] Chi-Man Pun and Xiao-Chen Yuan, “Robust segments detectorfor de-synchronization resilient audio watermarking,”
IEEETransactions on Audio, Speech, and Language Processing , vol.21, no. 11, pp. 2412–2424, 2013.[7] Xiang-Yang Wang and Hong Zhao, “A novel synchronizationinvariant audio watermarking scheme based on dwt and dct,”
IEEE Transactions on signal processing , vol. 54, no. 12, pp.4835–4840, 2006.[8] Yong Xiang, Iynkaran Natgunanathan, Song Guo, Wan-lei Zhou, and Saeid Nahavandi, “Patchwork-based audiowatermarking method robust to de-synchronization attacks,”
IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 22, no. 9, pp. 1413–1423, 2014.[9] Andrew Nadeau and Gaurav Sharma, “An audio watermarkdesigned for efficient and robust resynchronization after analogplayback,”
IEEE Transactions on Information Forensics andSecurity , vol. 12, no. 6, pp. 1393–1405, 2017.[10] Giovanni Del Galdo, Juliane Borsum, Tobias Bliem, Alexan-dra Craciun, and Stefan Kr¨ageloh, “Audio watermarking foracoustic propagation in reverberant environments,” in
Acous-tics, Speech and Signal Processing (ICASSP), 2011 IEEE In-ternational Conference on . IEEE, 2011, pp. 2364–2367.[11] Xia Zhang, Di Chang, Wanyi Yang, Qian Huang, Wei Guo, andYanbin Zhao, “An audio digital watermarking algorithm trans-mitted via air channel in double dct domain,” in
MultimediaTechnology (ICMT), 2011 International Conference on . IEEE,2011, pp. 2926–2930.[12] Ingemar J Cox, Joe Kilian, Tom Leighton, and Talal Shamoon,“Secure spread spectrum watermarking for images, audio andvideo,” in
Image Processing, 1996. Proceedings., InternationalConference on . IEEE, 1996, vol. 3, pp. 243–246.[13] Darko Kirovski and Henrique S Malvar, “Spread-spectrum wa-termarking of audio signals,”
IEEE transactions on signal pro-cessing , vol. 51, no. 4, pp. 1020–1033, 2003.[14] Roland E Best,
Phase locked loops: design, simulation, andapplications , McGraw-Hill Professional, 2007.[15] Stephen A Martucci, “Symmetric convolution and the discretesine and cosine transforms,”
IEEE Transactions on Signal Pro-cessing , vol. 42, no. 5, pp. 1038–1051, 1994. [16] Roger A Horn, Roger A Horn, and Charles R Johnson,
Matrixanalysis , Cambridge university press, 1990.[17] Steven M Kay, “Fundamentals of statistical signal processing,vol. ii: Detection theory,”
Signal Processing. Upper SaddleRiver, NJ: Prentice Hall , 1998.[18] Amit Chhetri, Philip Hilmes, Trausti Kristjansson, Wai Chu,Mohamed Mansour, Xiaoxue Li, and Xianxian Zhang, “Mul-tichannel Audio Front-End for Far-Field Automatic SpeechRecognition,” in2018 European Signal Processing Confer-ence (EUSIPCO)