[PDF] A Time-Frequency Perspective on Audio Watermarking

Abstract

Existing audio watermarking methods usually treat the host audio signals of a function of time or frequency individually, while considering them in the joint time-frequency (TF) domain has received less attention. This paper proposes an audio watermarking framework from the perspective of TF analysis. The proposed framework treats the host audio signal in the 2-dimensional (2D) TF plane, and selects a series of patches within the 2D TF image. These patches correspond to the TF clusters with minimum averaged energy, and are used to form the feature vectors for watermark embedding. Classical spread spectrum embedding schemes are incorporated in the framework. The feature patches that carry the watermarks only occupy a few TF regions of the host audio signal, thus leading to improved imperceptibility property. In addition, since the feature patches contain a neighborhood area of TF representation of audio samples, the correlations among the samples within a single patch could be exploited for improved robustness against a series of processing attacks. Extensive experiments are carried out to illustrate the effectiveness of the proposed system, as compared to its counterpart systems. The aim of this work is to shed some light on the notion of audio watermarking in TF feature domain, which may potentially lead us to more robust watermarking solutions against malicious attacks.

Full PDF

TTIME-FREQUENCY ANALYSIS AND ITS APPLICATIONS 1

A Time-Frequency Perspective onAudio Watermarking

Haijian Zhang

Abstract —Existing audio watermarking methods usually treatthe host audio signals of a function of time or frequencyindividually, while considering them in the joint time-frequency(TF) domain has received less attention. This paper proposesan audio watermarking framework from the perspective of TFanalysis. The proposed framework treats the host audio signalin the 2-dimensional (2D) TF plane, and selects a series ofpatches within the 2D TF image. These patches correspond tothe TF clusters with minimum averaged energy, and are usedto form the feature vectors for watermark embedding. Classicalspread spectrum embedding schemes are incorporated in theframework. The feature patches that carry the watermarks onlyoccupy a few TF regions of the host audio signal, thus leading toimproved imperceptibility property. In addition, since the featurepatches contain a neighborhood area of TF representation ofaudio samples, the correlations among the samples within a singlepatch could be exploited for improved robustness against a seriesof processing attacks. Extensive experiments are carried out toillustrate the effectiveness of the proposed system, as compared toits counterpart systems. The aim of this work is to shed some lighton the notion of audio watermarking in TF feature domain, whichmay potentially lead us to more robust watermarking solutionsagainst malicious attacks.

Index Terms —Audio watermarking, Feature domain water-marking, Time-frequency domain watermarking, Short-timeFourier transform, Imperceptibility, Robustness.

I. I

NTRODUCTION

With the rapid development of modern communication andmultimedia technologies, the dissemination and processing ofdigital multimedia products are becoming more and morepopular, which inevitably gives rise to a variety of piracy andinfringement issues. Watermarking techniques have receivedsigniﬁcant research attention as a means to efﬁciently protectthe copyright of digital multimedia product [1]. While water-marks can be embedded into media formats including but notlimited to document, image, audio, and video, in this paper, wefocus on watermarking for audio signals, which are functionsof time.Recently, the authors in [2] reviewed the research, de-velopment, and commercialization achievements of digitalaudio watermarking technology for the past twenty years.Generally, the existing audio watermarking techniques couldbe classiﬁed according to the domains in which the water-marks are embedded. More speciﬁcally, time domain methodseither modify the raw audio samples frame by frame [3]–[5],or change the histogram of host audio signals [6]. On theother hand, transform domain methods, which have receivedmuch more research attention, can be classiﬁed into spread

H. Zhang is with Signal Processing Laboratory, School of ElectronicInformation, Wuhan University, China. spectrum (SS) [7]–[10], patchwork [11], quantization indexmodulation (QIM) [12], and a special case based on over-complete transform dictionaries [13], [14].It could be seen from the literature that although audiowatermarking solutions have been extensively studied in timeor a transform domain individually, less efforts have beendevoted to the case in which the host audio signal is analyzedand represented jointly in time-frequency (TF) domain, basedon the well established TF analysis techniques (transforms)[15]–[18]. TF analysis is a generalization of Fourier analysisfor the case when the signal frequency characteristics aretime-varying. Since many practical signals of interest, such asspeech and music, have varying frequency characteristics, TFanalysis has a broad scope of applications, and one of the mostbasic forms of TF analysis is the short-time Fourier transform(STFT) [19], [20], while more sophisticated techniques havealso been developed [21]–[23]. In [24], [25], the authorsproposed two efﬁcient approaches to speech watermarkingbased on the STFT and the S-method [26].In this paper, we discuss audio watermarking from theperspective of time-frequency analysis, and propose an audiowatermarking framework based on the fact that audio signalsare a function of time. Speciﬁcally, we propose to embedthe watermark signal into a set of low-energy points in theTF representation, which correspond to noise-only or silentsegments in audio signals. The selected points can form non-overlapping 2-dimensional (2D) feature frames, each of whichis composed of TF domain samples across multiple timeand frequency bins. To achieve this purpose, a method toautomatically determine these low-energy TF positions is in-troduced, and the energy invariance before and after watermarkembedding is exploited. The proposed scheme only modiﬁesa few frames within the feature space, while other frames arekept intact. The imperceptibility property of the watermarkingsystem could hence be improved in that the host audio signalcontaining strong audio content is less modiﬁed. Therefore,while the robustness against host signal interference could beensured via the use of improved spread spectrum (ISS) method[9], the inability of ISS method to control imperceptibilityis remedied by the proposed localized watermark embedding.Furthermore, the proposed system enjoys improved robustnessagainst a series of signal processing attacks including addingnoise, amplitude scaling, and lossy compressions, thanks tothe appropriately designed 2D embedding frames. In general,the proposed framework could be considered as a design forfeature domain audio watermarking, in which the featurescorrespond to the appropriately selected embedding locations.We concretize the above framework via the realizations a r X i v : . [ c s . MM ] F e b IME-FREQUENCY ANALYSIS AND ITS APPLICATIONS 2 x x i y i y w, i x w, i x w Framing T Embed T - Construct (a) x y y i y w, i y w x w FramingT Embed T - Construct (b) x Y

Feature Selection f i f w, i EmbedT (TF) Y w x w Construct T - (TF) (c)Fig. 1. Commonly used and the proposed watermark embedding schemes: (a) Transform after framing. (b) Transform before framing. (c) Proposed scheme. of the basic STFT and a similarly formed short-time cosinetransform (STCT), with both SS and ISS embedding andextraction mechanisms. Conventional SS and ISS schemeswith a uniform embedding rule across different frequencybands are also implemented for comparison. Extensive experi-ments are carried out to evaluate the proposed framework anddemonstrate its performance advantages.II. A UDIO W ATERMARKING IN

TF F

EATURE D OMAIN

A. General Frameworks

We use the following notations in this paper. The host audiosignal is denoted by vector x ∈ R N × , and its i th frame aftertime domain non-overlapping framing is x i ∈ R M × , where N is the number of samples of the host signal, M is thenumber of samples per frame, and i ∈ { , , . . . , (cid:100) N/M (cid:101)− } ,where (cid:100)·(cid:101) is the ceiling function. Similarly, the host audiosignal in transform domain is denoted by y with the samelength as x , and the i th frame of y after transform domain non-overlapping framing is y i . We use subscript {·} w to denotewatermarked version of a signal, thus the representations ofwatermarked signal in time and transform domain, and interms of frame and ensemble, are denoted by x w ,i , x w, y w ,i ,and y w, respectively. Since this paper mainly utilizes SSbased watermark embedding and extraction mechanisms, thecorresponding spreading sequence is a pseudo-random noisesequence p ∈ { +1 , − } L × . In conventional full spectrumSS settings, we have L = M so that the spreading sequencecould be additively embedded into host signal frames.Audio watermark embedding in transform domain couldbe carried out under two basic schemes, i.e., transform afterand before framing, which are shown in Fig. 1 (a) and (b).Transform after framing is the most widely adopted processingﬂow in the existing literature, which is summarized in [2]. Onthe other hand, host audio signal has also been considered as asingle frame to calculate its corresponding transform domainrepresentations, directly obtaining y from x . Good examplesof such works could be seen in [10], [11], [27]. In this paper,we propose an alternative watermark embedding frameworkas depicted in Fig. 1 (c), which is similar to the transformbefore framing case. Speciﬁcally, instead of calculating thetransform y , we ﬁst obtain the TF representation of the hostsignal x , denoted by matrix Y ∈ C M ×(cid:100) N/M (cid:101) composedof both time and frequency bins. In this way, the proposed x Embeding Scheme x w ODG Calculation α Good? x w YesNo

Fig. 2. Heuristic tuning mechanism to control imperceptibility based on ODG.The dashed line is validated only if the decision is “Yes” (stopping criterion). framework differs from most of existing ones by consideringwatermark embedding based on a 2D TF image. Furthermore,the TF domain feature, denoted by f i ∈ C W × (where W is awindow dimension), is selected as the patches with low-energyvalues, which correspond to noise-only or silent locations. Oneof the advantages of using the proposed framework over thesimilar framework in Fig. 1 (b) is that the modiﬁcation of hostsignal in one area will not affect the host signal in other areas,while for a system in Fig. 1 (b), any modiﬁcation in transformdomain samples will cause changes of all samples in timedomain. In addition, since each of the selected feature vectorscontains multiple time and frequency bins, the correlation ofthe host signal at both different time intervals and frequencyranges is considered for watermark embedding, which couldlead to improved robustness against a series of processing andattacks. Details of the proposed framework are provided innext subsection. B. Watermark Embedding and Extraction Schemes

In this subsection, STFT is used as an example for TFanalysis. To ensure well controlled imperceptibility, heuristictuning is incorporated in the embedding scheme, in which theobjective difference grade (ODG) [28] is utilized to quantifycurrent imperceptibility condition, according to which the wa-termark embedding strength parameter α is adjusted accordingto the feedback of the ODG value. The ODG value is a realnon-positive number in the intervals of {− , − , − , − , } ,with corresponding to imperceptible and - corresponding tovery annoying. The heuristic tuning mechanism is depicted inFig. 2. The proposed watermark embedding scheme is detailedas follows. IME-FREQUENCY ANALYSIS AND ITS APPLICATIONS 3

1) Partition x into non-overlapping frames x i with M samples . Perform Hilbert transform on x i to remove thesymmetry of the frequency spectrum within π radius[29]. For simplicity of notation, we still use x i to denotethe Hilbert transform output, x i ∈ C M × ← Hilbert ( x i ) , i = 0 , , · · · , (cid:100) N/M (cid:101) − , (1)but note here x i becomes complex quantities.2) Compute the non-symmetric STFT of the host audio sig-nal x , and obtain the TF representation Y . Speciﬁcally,perform fast Fourier transform (FFT) for each frame, y i = Hx i , (2)where H is the orthonormal FFT matrix. Then we have Y = [ y , y , . . . , y (cid:100) N/M (cid:101)− ] . (3)Further, select low to middle frequency bins boundedby f and f , e.g., f Hz, and f = 2800 Hz, as the feasible watermark embedding region. Theexact dimension, M , resulted from this process dependson f , f , the sampling frequency, and the length ofFFT. We simply denote the reﬁned 2D TF image as ˜ Y ∈ C M ×(cid:100) N/M (cid:101) . Therefore, vertically, the M samplescorrespond to a frequency region within [ f , f ] .3) Partition ˜ Y into square patches using a W × W window,and index the patches in raster scanning order. Usually,we have W < M < M (cid:28) N. (4)For convenience, we further assume W is cho-sen such that W is divisible by both M and (cid:100) N/M (cid:101) , hence there is no residual after partition, and M ( (cid:100) N/M (cid:101) ) /W patches are obtained in total.4) Calculate the average energy of each patch. Denote eachpatch as P j , j ∈ { , , . . . , M ( (cid:100) N/M (cid:101) − /W − } ,then, the average energy is given by E j = 1 W W − (cid:88) m =0 W − (cid:88) m =0 | P j ( m , m ) | . (5)5) Sort E j in ascending order. According to the binarypayload vector w ∈ { +1 , − } P × , usually, P < M ( (cid:100) N/M (cid:101) ) /W , (6)select the ﬁrst P patches with minimum average en-ergies as features for watermark embedding. Vectorizethe selected patches into feature vectors f i ∈ C W × , i ∈ { , , . . . , P − } . Next, the embedding order isfrom top to bottom and column-wise, i.e., the featurepatches in the ﬁrst column are ﬁrst embedded from highfrequency bands to low frequency bands, followed bypatches in the second column, as so on. Here, the frames could not be overlapped because otherwise the wa-termarked patches will affect multiple overlapped frames and the inversetransform would become unstable. This is slightly different from TF analysisliterature. However, such a treatment will not cause performance degradationsince we are not interested in the resolution or accuracy of TF analysis in thecontext of watermarking. ( a ) f r equen cy ( H z ) ( b ) f r equen cy ( H z ) ( c ) f r equen cy ( H z ) ( d ) f r equen cy ( H z ) Fig. 3. Demonstration of the proposed STFT-based watermark embeddingscheme. (a) STFT of host audio signal. (b) Partition of the 2D TF image. (c)Energy image of the patches of the 2D TF image. (d) Selected feature patches(red, payload size P = 32 ).

6) For each feature vector ordered as above, generate thePN sequence p ∈ { +1 , − } W × as the spreading code,and perform SS or ISS watermark embedding additively,i.e., f w, i = f i + ( α w ( i ) − I Φ ) p , (7)where Φ (cid:44) f Ti p (cid:107) p (cid:107) , (8) {·} T is transpose operator, and < α < controls wa-termark embedding strength. For simplicity, parameter I is a binary indicator, i.e., if I = 0 , then the scheme isbased on SS, while if I = 1 , then the scheme is basedon ISS.7) After embedding the payload w , Y w is obtained bysimply replacing its subset ˜ Y with ˜ Y w. Then, performinverse STFT according to the same framing rule as usedin Step 1, and reorder the output to vector form anddiscard the imaginary part to obtain x w.8) Calculate the ODG value according to x and x w. Adjustparameter α according to a desired ODG level, i.e., ifthe ODG value is greater than the desired value (moreimperceptible), then α could be slightly increased aslong as the resultant ODG is within a tolerant distancefrom the desired value; if the ODG value is smaller thanthe desired value (less imperceptible), then α should bereduced accordingly. This process is shown in Fig. 2.The watermark embedding process is visualized in Fig. 3.At the receiving end, assuming an error-free channel, theextraction of the payload in terms of the detection of eachembedded information bit is carried out as follows.1) Partition x w into non-overlapping frames x w ,i with M samples, and perform Hilbert transform similar to (1).2) Compute the non-symmetric STFT of x w, and obtainthe TF representation Y w. Speciﬁcally, perform FFTfor each frame, y w ,i = Hx w ,i , (9) IME-FREQUENCY ANALYSIS AND ITS APPLICATIONS 4 then we have Y w = [ y w , , y w , , . . . , y w , (cid:100) N/M (cid:101)− ] . (10)Further, according to f and f , construct the sub-matrix ˜ Y w ∈ C M ×(cid:100) N/M (cid:101) .3) Partition ˜ Y w into square patches using a W × W window, and index the patches in raster scanning order.4) Calculate the average energy of each patch. Denote eachpatch as P w ,j , j ∈ { , , . . . , M ( (cid:100) N/M (cid:101) − /W − } , then, the average energy is given by E w ,j = 1 W W − (cid:88) m =0 W − (cid:88) m =0 | P w ,j ( m , m ) | . (11)5) Sort E w ,j in ascending order, and ﬁnd P patches withleast energy values. Vectorize these patches to form f w ,i which are ordered column-wise and from top tobottom. The embedded information bit is estimated bythe following function ˆ w ( i ) = sgn (cid:104)(cid:60) ( f w ,i ) , p (cid:105) / (cid:107) p (cid:107) = sgn (cid:32) (cid:60) ( f i ) T p + ( α w ( i ) − I (cid:60) ( Φ )) (cid:107) p (cid:107) (cid:107) p (cid:107) (cid:33) = sgn ((1 − I ) (cid:60) ( Φ ) + α w ( i )) , (12)where (cid:60){·} denotes the real part.It can be seen from (12) that the SS scheme ( I = 0 ) suffersfrom host signal interference Φ , while the ISS scheme ( I = 1 )is able to remove the interference term in a closed-loopenvironment. Note that the above embedding and extractionschemes could be modiﬁed to other schemes by simply re-placing the STFT with STCT or other transforms. C. Feature Invariance

In this subsection, we address an important issue to validatethe proposed system, i.e., feature invariance before and afterwatermark embedding. It can be seen from the embeddingfunction (7) that the energy of watermarked feature vectorwill be altered. Therefore, after the whole embedding process,the energy distribution of the feature patches in Y w, at leastthe P patches with least energy levels, should still have leastenergy levels. To study the feature recovery property of theproposed system under different TF transforms, the recoveryresults using a sample audio clip in a closed-form environmentare depicted in Fig. 4, where the audio clip has a duration of seconds, and P = 32 . It could be seen that STFT basedmethod is more suitable for the proposed framework. Thereason behind is that DCT tends to compact the signal’s energyinto smaller frequency band and also de-correlate the signalin frequency domain, making the small energy regions moreambiguous in terms of energy difference. Therefore, in thesequel, we will only consider STFT as the TF analysis tool inthe following experiments. Extensive experimental results willbe provided later to demonstrate how additive noise and otherprocessing attacks affect the effectiveness of the proposedframework.It is worth noting that there actually exists another solu-tion for the requirement of feature invariance, which could f r equen cy ( H z ) f r equen cy ( H z ) f r equen cy ( H z ) time (s) (a) Top: STFT. Mid.: Feature patches. Bot.: Recovered patches. f r equen cy ( H z ) f r equen cy ( H z ) f r equen cy ( H z ) time (s) (b) Top: STCT. Mid.: Feature patches. Bot.: Recovered patches.

Fig. 4. Detection results of watermark positions via (a) STFT and (b) STCT. be obtained using an indexing array pointing to P patchesrandomly to identify which patches are watermarked. Thisarray could be considered as a private key shared among theauthority and trusted parties. Note that the system proposedin the previous subsection is strictly a blind watermarkingscheme whose watermark extraction does not requireany auxillary information, but if the random indexingarray is introduced, then this array, serving as a key,should be transmitted to authorized receivers via somesecure channels.

The study of random index key based TFfeature domain watermarking is noted here for future researchattention.III. E

VALUATIONS AND E XPERIMENTAL R ESULTS

In this section, we carry out extensive experiments to eval-uate the proposed framework in terms of imperceptibility androbustness. The imperceptibility is measured quantitatively bydocument-to-watermark ratio (DWR) and ODG respectively.For comparison, the counterpart system based on DCT and thescheme in Fig. 1 (b) is also implemented. Therefore, the ex-periments and comparisons will be conducted on four systems,i.e., STFT-SS, STFT-ISS, DCT-SS, and DCT-ISS, respectively.Some measurement metrics are deﬁned as follows. First, theDWR is given byDWR = 10log (cid:107) x (cid:107) (cid:107) x w − x (cid:107) . (13) IME-FREQUENCY ANALYSIS AND ITS APPLICATIONS 5

25 30 35 40 45

DWR D e t e c t i on R a t e DCT-ISSDCT-SSSTFT-ISSSTFT-SS (a) DR versus DWR. -2 -1.5 -1 -0.5 0

ODG D e t e c t i on R a t e DCT-ISSDCT-SSSTFT-ISSSTFT-SS (b) DR versus ODG.Fig. 5. Watermark detection rates averaged by running audio samples: (a)Detection Rate versus DWR. (b) Detection Rate versus ODG. The signal-to-noise ratio (SNR) is deﬁned bySNR = 10log (cid:107) x w (cid:107) σ , (14)where σ is the variance of additive white Gaussian noise(AWGN). To characterize watermark extraction performance,the detection rate (DR) is deﬁned byDR = 12 P P − (cid:88) i =0 | w ( i ) − ˆ w ( i ) | × . (15)Parameters are set as follows. M = 1024 , f = 60 Hz, f = 2800 Hz, W = 16 , P = 32 , , L = W , α is heuristically controlled by ODG values no less than − .During comparison, ODG values are tuned to be similar forfair comparison. In our simulation, four music samples areselected, including male song ( s), female song ( s),violin and piano duet ( s), and electronic music ( s). Allthe samples audio ﬁles have 16-bit quantization and a samplingfrequency of . kHz. A. Imperceptibility

The imperceptibility property of the implemented systems isdemonstrated in Fig. 5, where three of the four audio sampleswith s duration are used to generate the performance curves,and AWGN with SNR = 30 dB is considered. It can beobserved from both sub-ﬁgures that the proposed schemesconstantly yield better imperceptibility when the DR values TABLE II

MPERCEPTIBILITY OF

DCT-SS

AND

DCT-ISS

METHODS

Data DCT-SS DCT-ISSDWR (dB) ODG DWR (dB) ODGSample 1 33.9 -0.76 33.7 -0.70Sample 2 34.7 -0.13 32.4 -0.02Sample 3 40.4 -0.40 32.8 -0.90Sample 4 33.5 -0.95 32.3 -0.44TABLE III

MPERCEPTIBILITY OF

STFT-SS

AND

STFT-ISS

METHODS

Data STFT-SS STFT-ISSDWR (dB) ODG DWR (dB) ODGSample 1 39.9 -0.61 39.6 -0.44Sample 2 40.7 -0.03 42.6 -0.02Sample 3 46.4 -0.43 42.1 -0.40Sample 4 39.5 -0.55 39.9 -0.34 are the same. In terms of ODG, the proposed systems couldobtain ODG values between − . and with above DRs. Further, the DWR and ODG values of the four audiowatermarking systems applied on four audio samples aresummarized in Tables I and II. It can be seen that performanceimprovement on imperceptibility is consistent across differentsamples, and the inability of ISS based methods in controllingimperceptibility is resolved, thanks to localized embeddingin selected features. The robustness testing results will beprovided in next subsection, also based on the four audiosamples, and the corresponding imperceptibility informationis as shown in the two tables. We will demonstrate that whilethe proposed systems could achieve improved imperceptibility,the robustness against several common processing attacks canalso be improved.

Original Watermark (IEEE Logo) Extracted Watermark (No Attack)Original Watermark (Springer Logo) Extracted Watermark (No Attack)

Fig. 6. × -bit watermark logos used with DCT-ISS and STFT-ISS. B. Robustness to a Series of Attacks

The attacks considered in this paper include adding Gaus-sian noise, amplitude scaling, AAC lossy compression, andMP3 lossy compression, each with several different attackstrength settings. Here, we use two sets of watermarks andtwo sets of sample audio clips. In the ﬁrst setting, a randombinary sequence of bits is used as the watermark, whichis embedded into four audio samples of seconds. Theembedding DWR and ODG values in this setting are given IME-FREQUENCY ANALYSIS AND ITS APPLICATIONS 6

TABLE IIIDR S (%) OF DCT-SS

METHOD UNDER DIFFERENT ATTACKS ( FIRST SETTING )Attach Type Sample 1 Sample 2 Sample 3 Sample 4 AverageRe-Quantization 8 Bit 84.4 96.8 78.1 93.8 88.2Gaussian Noise 30 dB 84.4 96.8 78.1 90.6 87.550 dB 84.4 96.8 78.1 93.8 88.2Amplitude Scal. 1.2 84.4 96.8 78.1 93.8 88.21.8 84.4 96.8 78.1 93.8 88.2AAC Compression 96 kbps 84.4 96.8 78.1 93.8 88.2160 kbps 84.4 96.8 78.1 93.8 88.2MP3 Compression 64 kbps 84.4 96.8 78.1 93.8 88.2128 kbps 84.4 96.8 78.1 93.8 88.2TABLE IVDR S (%) OF DCT-ISS

METHOD UNDER DIFFERENT ATTACKS ( FIRST SETTING )Attach Type Sample 1 Sample 2 Sample 3 Sample 4 AverageRe-Quantization 8 Bit 100 100 100 100 100Gaussian Noise 30 dB 75.0 75.0 68.8 71.9 72.750 dB 100 100 100 100 100Amplitude Scal. 1.2 100 100 100 100 1001.8 100 100 100 100 100AAC Compression 96 kbps 96.8 96.8 96.8 96.8 96.8160 kbps 100 100 100 100 100MP3 Compression 64 kbps 87.5 100 93.8 100 95.3128 kbps 100 100 96.8 100 99.2TABLE VDR S (%) OF STFT-SS

METHOD UNDER DIFFERENT ATTACKS ( FIRST SETTING )Attach Type Sample 1 Sample 2 Sample 3 Sample 4 AverageRe-Quantization 8 Bit 96.8 100 81.3 96.8 93.8Gaussian Noise 30 dB 93.7 100 78.1 93.8 91.450 dB 96.8 100 81.3 96.8 93.8Amplitude Scal. 1.2 96.8 100 81.3 96.8 93.81.8 96.8 100 81.3 96.8 93.8AAC Compression 96 kbps 96.8 100 78.1 96.8 92.9160 kbps 96.8 100 81.3 96.8 93.8MP3 Compression 64 kbps 96.8 100 78.1 96.8 92.9128 kbps 96.8 100 81.3 96.8 93.8TABLE VIDR S (%) OF STFT-ISS

METHOD UNDER DIFFERENT ATTACKS ( FIRST SETTING )Attach Type Sample 1 Sample 2 Sample 3 Sample 4 AverageRe-Quantization 8 Bit 100 100 100 100 100Gaussian Noise 30 dB 96.8 93.7 96.8 87.5 93.750 dB 100 100 100 100 100Amplitude Scal. 1.2 100 100 100 100 1001.8 100 100 100 100 100AAC Compression 96 kbps 100 100 100 100 100160 kbps 100 100 100 100 100MP3 Compression 64 kbps 100 100 100 100 100128 kbps 100 100 100 100 100

IME-FREQUENCY ANALYSIS AND ITS APPLICATIONS 7

TABLE VIIDR S (%) OF DCT-ISS

AND

STFT-ISS

METHODS ( SECOND SETTING )Attack Type DCT-ISS STFT-ISS AverageImprovementIEEE Springer IEEE SpringerRe-Quantization 8 Bit 94.2 95.1 100 100 5.35Gaussian Noise 30 dB 55.2 58.9 85.2 86.7 29.340 dB 70.1 69.5 94.3 95.0 24.950 dB 92.6 94.5 98.6 100 5.75Amplitude Scal. 1.2 100 100 100 100 01.8 100 100 100 100 0AAC Compression 96 kbps 75.5 73.9 85.4 82.5 9.25128 kbps 81.5 81.9 90.0 87.9 7.25160 kbps 93.4 91.9 100 100 7.35MP3 Compression 64 kbps 64.6 75.1 85.1 86.3 15.9128 kbps 86.7 99.2 100 100 7.05192 dbps 99.5 100 100 100 0.25ODG -0.75 -0.76 -0.65 -0.66

Re-Quantization (8 Bit) DC T - I SS Gaussian Noise (30 dB) AAC (96 kbps) MP3 (64 kbps) S TFT - I SS DC T - I SSS

TFT - I SS Fig. 7. Recovered × -bit watermark logos (second setting) under different signiﬁcant attacks used with DCT-ISS and STFT-ISS. in Tables I and II. In the second setting, two graphic logosof × bits, as shown in Fig. 6, are used for watermarkembedding, and the -minute audio ﬁle is used as the hostsignal. The ODG values for the second setting are given at thebottom of Table VII. The implemented methods for the secondsetting are DCT-ISS and STFT-ISS only, for simplicity.The DRs against several attacks with different attackstrength settings for DCT-SS, DCT-ISS, STFT-SS, and STFT-ISS methods are shown in Tables III to VI respectively. ForDCT and STFT based methods respectively, we observe betterrobustness when ISS is used. This agrees with the classicalproperty of the ISS technique. More importantly, we cansee from these tables that the proposed TF feature domain watermarking systems outperform DCT based frequency do-main methods for both SS and ISS implementations. The bestperformance is observed in Table VI, which corresponds toSTFT-ISS method. Recall Table II, in which the DWR andODG values of STFT-ISS method are very close to, or evenslightly better than those obtained from its SS counterpartor DCT based methods. Therefore, it could be concludedthat the proposed system is able to simultaneously achieveimproved imperceptibility and robustness.

Finally, we test the two ISS based systems, i.e., DCT-ISSand STFT-ISS in the second setting where IEEE and Springerlogos are embedded in a -minute audio clip. The resultsare shown in Table VII. The improvement of the proposed IME-FREQUENCY ANALYSIS AND ITS APPLICATIONS 8 system against its frequency domain counterpart is consistentacross different attacks for both logos. The recovered IEEEand Springer logos under different signiﬁcant attacks usingthe DCT-ISS and the STFT-ISS are shown in Fig. 7. Twoobservations can be made: the DCT-ISS fails to reconstructthe original logos; Although the STFT-ISS does not achievecompletely error-free performance, it can still well recover theshape and content of the logos.IV. C

ONCLUSION