[PDF] Adversarial Audio: A New Information Hiding Method and Backdoor for DNN-based Speech Recognition Models

Abstract

Audio is an important medium in people's daily life, hidden information can be embedded into audio for covert communication. Current audio information hiding techniques can be roughly classed into time domain-based and transform domain-based techniques. Time domain-based techniques have large hiding capacity but low imperceptibility. Transform domain-based techniques have better imperceptibility, but the hiding capacity is poor. This paper proposes a new audio information hiding technique which shows high hiding capacity and good imperceptibility. The proposed audio information hiding method takes the original audio signal as input and obtains the audio signal embedded with hidden information (called stego audio) through the training of our private automatic speech recognition (ASR) model. Without knowing the internal parameters and structure of the private model, the hidden information can be extracted by the private model but cannot be extracted by public models. We use four other ASR models to extract the hidden information on the stego audios to evaluate the security of the private model. The experimental results show that the proposed audio information hiding technique has a high hiding capacity of 48 cps with good imperceptibility and high security. In addition, our proposed adversarial audio can be used to activate an intrinsic backdoor of DNN-based ASR models, which brings a serious threat to intelligent speakers.

Full PDF

AAdversarial Audio: A New Information Hiding Method and Backdoor forDNN-based Speech Recognition Models

Yehao Kong, Jiliang Zhang ∗ College of Computer Science and Electronic EngineeringHunan University, [email protected]

Abstract

Audio is an important medium in people’s daily life, hiddeninformation can be embedded into audio for covert commu-nication. Current audio information hiding techniques canbe roughly classed into time domain-based and transformdomain-based techniques. Time domain-based techniqueshave large hiding capacity but low imperceptibility. Trans-form domain-based techniques have better imperceptibility,but the hiding capacity is poor. This paper proposes a newaudio information hiding technique which shows high hid-ing capacity and good imperceptibility. The proposed audioinformation hiding method takes the original audio signalas input and obtains the audio signal embedded with hiddeninformation (called stego audio) through the training of ourprivate automatic speech recognition (ASR) model. Withoutknowing the internal parameters and structure of the privatemodel, the hidden information can be extracted by the privatemodel but cannot be extracted by public models. We use fourother ASR models to extract the hidden information on thestego audios to evaluate the security of the private model. Theexperimental results show that the proposed audio informa-tion hiding technique has a high hiding capacity of 48 cpswith good imperceptibility and high security. In addition, ourproposed adversarial audio can be used to activate an intrinsicbackdoor of DNN-based ASR models, which brings a seriousthreat to intelligent speakers.

With the rapid development of communication-related tech-nologies, multimedia information such as image, audio, andvideo is generated in large quantities and brings great conve-nience to people. However, multimedia information servicespose a potential threat to the legitimate rights of the infor-mation owner. As a different technology from traditionalcryptography, information hiding techniques [3] that use thehuman’s perceptual redundancy for digital signals and hide ∗ Corresponding author. the secret information into the carrier to provide technicalprotection for the rights of multimedia information.Since human auditory systems are more sensitive than vi-sual systems, embedding secret information into audio mediais more challenging than images. In addition, as an importantmedium in people’s daily life communication, the audio hasa good imperceptibility in the transmission of informationand provides a lot of redundant space for embedding hiddeninformation, making the research of audio information hidingtechniques more valuable.Traditional audio information hiding techniques can beroughly divided into two classes: time domain-based andtransform domain-based techniques.Time domain technique directly embeds the hidden in-formation into the carrier signal in the time domain. It haslarge hiding capacity and easy to implement. However, di-rectly modifying the carrier signal in the hiding process willinevitably cause distortion of the carrier signal, which willincrease the difﬁculty of extracting the hidden informationand make the imperceptibility be poor. The commonly usedtime domain-based techniques include the least signiﬁcantbit (LSB) [8, 19], echo hiding [15, 33] and spread spectrum[32, 35] techniques.Transform domain technique is to make the informationhidden in the carrier’s transform domain. It maps the carrierinformation to the transform domain and modiﬁes some pa-rameters in the transform domain to hide information, whichcan better resist the attack based on the various signal process-ing while maintaining the imperceptibility. However, mappingthe audio signal to the transform domain requires a large num-ber of signal processing operations, which results in high com-putation complexity. At the same time, the strong robustnessis at the cost of reducing the hiding capacity. Therefore, thehiding capacity of transform domain technique is small. Com-monly used transform domain-based techniques include phasecoding [24, 25], discrete cosine transform (DCT) [20, 36] anddiscrete wavelet transform (DWT) [4, 7] techniques.In order to improve the hiding capacity and imperceptibil-ity, this paper proposes to embed the hidden information to1 a r X i v : . [ c s . CR ] A p r he audio signal by the private ASR model based on deepneural network (DNN) on the transmitting end and extract thehidden information by the private ASR model at the receivingend. We also perform several performance tests on the gener-ated stego audios. Experiment results show that our proposedinformation hiding technique has good hiding capacity, im-perceptibility and security. The contributions of this paper areas follows. • Novel hiding approach.

We propose a new audio in-formation hiding technique based on the adversarialperturbations, which embeds and extracts the hiddeninformation by the DNN-based ASR model. • High hiding capacity.

The proposed technique embedsthe hidden information in the form of a whole sentencewith a hiding capacity of 48 character per second (cps). • Well imperceptibility.

The value of perceptual evalua-tion of speech quality (PESQ) is 3.598 on average. Peo-ple can barely perceive the perturbation. • High security.

Four public models such as Google andIBM commercial ASR system are used to test the stegoaudio signals and experimental results show that thesemodels are unable to extract the hidden information. • A new backdoor.

The hidden information can be usedas the speciﬁc trigger instruction to activate the model-intrinsic backdoor for DNN-based speech recognitionmodels.The remainder of this paper is organized as follows. Sec-tion 2 introduces related works of traditional audio informa-tion hiding techniques. Section 3 indicates the preliminaryknowledge. The proposed method and its working mecha-nisms are elaborated in Section 4. The experimental resultsare reported in Section 5. Section 6 demonstrates the applica-tion of our method and the intrinsic backdoor of DNN-basedASR models. Finally, we conclude in Section 7.

The audio information hiding methods are mainly classedinto time domain-based and transform domain-based meth-ods. The time domain-based methods are characterized by lowcomputation complexity and high hiding capacity, but poorrobustness and imperceptibility. The transform domain-basedmethods usually have better robustness and imperceptibility,but the computation complexity is high and the hiding capac-ity is small. Several commonly used time domain-based andtransform domain-based methods are introduced below.

In the embedding phase, the LSB method replaces the leastsigniﬁcant bits with the data bits of the hidden information.In the extraction phase, as long as the corresponding least sig-niﬁcant bits are taken out, the embedded hidden informationcan be recovered. Dieu et al. [8] proposed an improved LSBmethod that is less sensible than the traditional LSB method,but it is at the cost of reducing the hiding capacity, and it is notrobust. Jadhav et al. [19] proposed an enhanced security au-dio information hiding technique that uses the top three mostsigniﬁcant bits (MSBs) to determine the least signiﬁcant bit(LSB) position of the hidden information. For example, whenthe top three bits are "100", the hidden information bit willbe embedded in the 4th least signiﬁcant bit. However, it stillcannot solve the problem of poor robustness while improvingsecurity.

According to the auditory characteristics of human ear, if theweak signal appears in a short time (usually 0-200ms) after astrong signal in an audio, the weak signal will become inaudi-ble. Echo hiding achieves the purpose of hiding informationby introducing echoes into discrete audio signals, and vari-ous information can be represented by different echo delays.Xiang et al. [33] proposed a technique for embedding audiowatermarks using echo hiding. The robustness and imper-ceptibility have been improved over previous work, but thehiding capacity has not been tested. Hua et al. [15] proposedan audio watermarking scheme based on time-expanded echo.It uses the ﬁnite impulse response (FIR) ﬁlter based on con-vex optimization to obtain the optimal echo ﬁlter coefﬁcients.This scheme improves the imperceptibility and robustnesscompared to the previous methods, but the hiding capacityhas not been tested.

DCT-based method obtains the DCT coefﬁcients after DCTtransform processing ﬁrst, and then modiﬁes the DCT coefﬁ-cient for different positions to embed the hidden informationinto the audio. Finally, after the inverse discrete cosine trans-form operation, the stego audio signal is obtained. Zong etal. [36] introduced an information hiding algorithm basedon the energy difference between frequency bands. By cal-culating the difference between the energy average and theenergy variation, the hidden information is embedded into thelow frequency part of the DCT coefﬁcients. The robustness isbetter. However, the calculation coefﬁcients are too many andcomplicated and it is difﬁcult to guarantee the correct rate ofhidden information extraction. Jeyhoon et al. [20] performed2 DCT transform on each frame of the original audio signaland then selected the appropriate DCT coefﬁcient band toembed the hidden information bits. The hiding capacity androbustness of this method are good, but the imperceptibilityis slightly poor.

Discrete wavelet transform (DWT) is a multi-scale multi-resolution technique that decomposes signals into differenttime-frequency components. Wavelet decomposition has agood match with the human ear’s perceptual mental model.The difference between DWT-based and DCT-based methodis that DWT modiﬁes the DWT coefﬁcient to embed thehidden information into the audio. Das et al. [7] proposed amethod that hides both the hidden information and the keyin the DWT coefﬁcients. However, the paper did not test thehiding capacity and robustness of the stego audio. Avci etal. [4] proposed a method by using the LSB method in theDWT domain. It has a good imperceptibility, but the hidingcapacity is not high enough and no robustness experimentsare performed.A good information hiding algorithm should guarantee alarge capacity with a good imperceptibility. However, thetwo indicators are usually contradictory. The large hidingcapacity means that there is more hidden information can beembedded in the carrier audio. It will decrease the quality ofthe carrier audio, affect the imperceptibility, and increase therisk of being cracked. In this paper, we propose a new audioinformation hiding technique to balance imperceptibility andhiding capacity.

Automatic Speech Recognition (ASR) [12] is a cross-disciplinary applied research that transforms speech signalsinto corresponding texts through a process of recognition andunderstanding. Nowadays, speech recognition technology hasbeen widely used in mobile devices, in-vehicle devices, robotsand other scenes, and has played an increasingly importantrole in many ﬁelds such as search, manipulation, navigation,entertainment and so on.Early speech recognition techniques are based on sig-nal processing and pattern recognition methods. With theadvancement of technology, machine learning methods areincreasingly applied to speech recognition research, espe-cially deep learning technology, which has brought profoundchanges to speech recognition research.The structure of DeepSpeech [14] that we use is shown inFig. 1, which is an open source ASR engine based on Baidu’sdeep speech research. The model is trained in deep learningtechniques, which consists of ﬁve hidden layers ht ( ) t -ht ( ) t . Figure 1: The structure of DeepSpeech [14].The bidirectional recurrent neural network (BiRNN) in the4-th layer is the core of DeepSpeech and the loss functionCTC-loss [13] is used to train the neural network. Deep learning, especially neural networks, has shown great ad-vantages in the ﬁelds of image recognition, speech processing,autonomous driving and medical diagnosis. In particular, therecognition ability of image recognition models has exceededthe accuracy of human eye. However, recent researches haveshown that deep learning models are vulnerable to adversarialexamples [29]. Adversarial example is carefully designed byattackers to fool deep learning models. The difference be-tween the adversarial examples and real examples is almostindistinguishable by the human eye, but it can cause the modelto be misclassiﬁed.The majority of adversarial example researches focused ongenerating adversarial examples against image recognitionmodels [5, 9, 21, 22, 26, 29]. Szegedy et al. [29] proposed an L-BFGS method that uses L2 distance norm square to constrainthe perturbation to construct an image adversarial example.Although this method is stable and effective, the calculationis too complicated. Goodfellow et al. [9] added perturbationsin the gradient direction causing the model to misclassify theresulting images. It is the simplest and fastest way to constructan adversarial example, called the fast gradient sign method(FGSM). However, as only one calculation is performed, thesize of the perturbation cannot be well controlled. Carlini andWagner [5] proposed an improved L-BFGS method, which iscurrently the most powerful attack method, called CW attack.It can perform targeted or non-targeted attacks effectively to3 erturbationPerturbation

Original Audio SignalOriginal Audio SignalOriginal Audio Signal ＋ Perturbation

Original Audio Signal ＋ ASR Model

Target Hidden Text

OutputTraining & Updating the Perturbation

Stego Audio Signal

Input

Figure 2: The process of embedding the hidden information into audio signals.misclassify the image recognition models.However, recent researches have shown that speech recog-nition models are also vulnerable to adversarial examples[6, 18, 30]. Iter et al. [18] proposed to use FGSM [9] to gen-erate audio adversarial examples, which can misclassify therecognition result of the ASR system. However, the generatedadversarial examples have loud noises through the human earwith a large distortion. Carlini et al. [6] proposed a methodfor generating audio adversarial examples against a white-boxspeech recognition model. It can produce very strong audioadversarial examples, resulting in a misclassiﬁcation rate ofup to 100%. Taori et al. [30] combined genetic algorithm andgradient estimation targeted audio adversarial examples forblack box ASR model. However, it can only generate a phraseconsisting of two words with an attack success rate of only35%.The vulnerability to adversarial examples threatens thesecurity of DNN-based ASR models. However, we ﬁnd aninteresting characteristic that audio adversarial examples gen-erated by the current generating methods have no transfer-ability, that is, the audio adversarial examples generated by aspeciﬁc model are not aggressive to other models. Therefore,we introduce this characteristic into the ﬁeld of audio infor-mation hiding and use the deep learning model to generateaudio adversarial examples to embed the hidden information.

The paper proposes a new technique based on the adversarialexamples for audio information hiding, which embeds andextracts the hidden information by the private DNN-basedASR model. The technique is described in detail below. Theproposed intrinsic backdoor of DNN-based model will beintroduced in Section 6.

The ASR model DeepSpeech acts as a private model ownedonly by the send end and the receive end, which functions likethe grey box of embedding process in Fig. 6 in the traditionalinformation hiding method. The key idea is to take the originalaudio signal and hidden text as the input and obtain the stegoaudio signal after training the private model. For example,input an audio signal that is “Good morning” into the privatemodel, and through training the model, the result of privatemodel is ﬁnally recognized as “hi, Siri”.The process is shown in Fig. 2. First, input the originalaudio signal X and the hidden text t into the ASR model.In training phase, the slight perturbation δ that needs to beadded to the X is constantly updated according to the resultof the loss function. Finally, the generated stego audio signal X + δ can be recognized as the hidden text t with a smallperturbation δ .In order to recognize the audio as the hidden information,the CTC-loss is selected as the loss function of our method,which can output a probability for any text given an audiosignal. The detail principle of CTC-loss can be found in [13].Thus the stego audio can be trained using the CTC-loss tomaximize the probability of the audio to be recognized as thehidden text.In the meantime, under the premise that the audio is recog-nized as hidden text, the perturbation δ added to the originalaudio should be quieter than the original audio signal X , henceits value should be smaller than the original one. To makea lower computational complexity, the L inﬁnite norm (cid:107) δ (cid:107) ∞ (Eq. (1)) is used to represent the magnitude of perturbation. (cid:107) δ (cid:107) ∞ = max ≤ i ≤ n | δ i | . (1)Converting the overall goal into an optimization problem4s to minimize the (cid:107) δ (cid:107) ∞ in the case where the model C( · ) recognizes the speech (X+ δ ) as the target text t (i.e., C ( X + δ ) = t ), that is, min (cid:107) δ (cid:107) ∞ s . t . C ( X + δ ) = t . (2)Since the constraint C ( X + δ ) = t is non-linear, the gradientdescent cannot be used to determine the convergence pointof (cid:107) δ (cid:107) ∞ . The constraint can be converted to minimizing theloss function l ( X + δ , t ) , where the loss function l ( · ) is CTCloss and l ( X + δ , t ) indicates the magnitude of the CTC lossbetween the recognition result of X + δ and the target text t .Therefore, the optimization problem becomes two minimizeproblems. The formula is Eq. (3):min (cid:107) δ (cid:107) ∞ & min l ( X + δ , t ) . (3)Hence how to combine the two constraints needs to beconsidered. For facilitating the application of the gradientoptimizer, we separate them into two steps that keep iterating.1. Calculate the δ that meets the condition of C ( X + δ ) = t by applying gradient descent optimization to the lossfunction l ( X + δ , t ) ;2. Reduce the range of δ and clip it into the range.The two steps keep iterating until reaching the set thresholdof iteration times. For the step 1, the gradient optimizationof δ is performed by using Adam Optimizer to make therecognition result of X + δ close to the target text t gradually.For the step 2, a threshold τ is set for δ to ensure the maximumﬂuctuation range of δ will not exceed the threshold. The twosteps can be integrated to an iterative function Eq. (4): (cid:40) δ = , X = X + δ δ N + = clip δ , τ ( ∇ δ l ( X N , t )) , X N + = X + δ N + (4)The detailed algorithm is shown in Algorithm 1. The func-tion clip ( δ , − τ , τ ) sets the values of δ smaller than τ become τ , and values larger than − τ become − τ . As shown in Algo-rithm 1, the determination of the threshold τ is shown in 10- 18th rows. First, a large value of τ is given. Then after ob-taining the minimized result δ , the value of τ is reduced. Theminimization process is repeated until reaching the number ofiterations we set. Finally, the last best result will be returned. Compared with traditional audio information hiding methods,our proposed method does not need to use any complicated al-gorithm to process the stego audio signal. As shown in Fig. 3,the hidden text can be obtained by simply inputting the stegoaudio signal into the private ASR model to recognize. In orderto ensure that other public ASR models cannot identify the

Algorithm 1

Information Embedding Algorithm

Input:

Original audio signal X , Hidden text t Output:

Stego audio signal X (cid:48) Initialize: δ − an initial zero array with the same shape of X , τ − the threshold of δ , N − the max iteration times X (cid:48) = X + δ for i = , , , . . . , N do // Calculate the loss L = l ( X (cid:48) , t ) // Update δ δ ← AdamOptimizer. minimize ( L , δ ) δ = clip ( δ , − τ , τ ) X (cid:48) = X + δ if C ( X (cid:48) ) == t then // Update the threshold τ if max ( δ ) ≤ τ then τ = max ( δ ) end if τ = . · τ // Save the last best result temp = X (cid:48) end if end for return temp hidden text, four state-of-art ASR models are used to extractthe hidden text from the stego audios. The experimental re-sults show that in addition to the private ASR model, all otherpublic models cannot get any content related to the hiddentext, the test results can be seen in Section 5.3. Backdoors are typically activated under very speciﬁc con-ditions, which makes them unlikely to be activated and de-tected using random trigger inputs. This paper proposes touse the adversarial audio as the trigger input of DNN’s intrin-sic backdoor which is intrinsic for any DNN-based speechrecognition models and not deliberately designed by the man-ufacturer. The intrinsic backdoor will be introduced in detailin Section 6.

Although the requirements for information hiding are dif-ferent in various scenarios, hiding capacity, imperceptibility,security and robustness are the main performance indicatorsof audio information hiding techniques [27, 28]. In addition,we perform a steganalysis to measure the probability of stegoaudio being discovered.In order to test the performance of the proposed audioinformation hiding technique, 100 test audios (A00 - A99) areselected from the Mozilla common voice dataset [23], which5

SR Model Output

Stego Audio Signal

Input Target Hidden Text

Figure 3: The process of extracting the hidden information from stego audio signals.Table 1: The speciﬁc hidden information in different groups.Group Audio Range Hidden InformationG1 A00-A09 be quietG2 A10-A19 sing louderG3 A20-A29 close the doorG4 A30-A39 the key is one one nineG5 A40-A49 call the policeG6 A50-A59 happy birthday to youG7 A60-A69 be carefulG8 A70-A79 bob is the spyG9 A80-A89 help meG10 A90-A99 see you at ﬁve pmare wav ﬁles with a length of 3 seconds, sampled at the rateof 16 kHz and quantized with 16 bits, to embed the hiddeninformation. We divide these audios into 10 groups G1-G10.The speciﬁc information to be hidden in different groupsis shown in Table 1. The stego audios are generated withtensorﬂow and DeepSpeech v0.1.0 version, and the evaluationindicators are obtained with MATLAB. The initial parameterswe set in the experiments are as follows. The iteration times N is 500, the initial δ is an array of 0 with the same shape ofthe audio signal, and the initial τ is set to 3000.The performance of our method is compared with a spreadspectrum-based audio information hiding method [32]. It ob-tains the DCT coefﬁcients of audio ﬁrst, then embeds thehidden information into the coefﬁcients by using a group oforthonormal PN sequences. We have implemented this audioinformation hiding method in MATLAB using the originalconﬁguration in the [32]. The hiding capacity and impercepti-bility are compared in the following evaluations. Hiding capacity, also known as the hiding rate, is the amountthat hidden information can be embedded in the carrier signalper second. The traditional hiding methods hide informationin the form of bits in audio, that is, the unit of hiding capacityis bit per second (bps). Our proposed method is directly hiding the information in the form of characters. The informationextracted from the stego audio is a whole sentence, hencecharacter per second (cps) is used as the unit of hiding capacityhere. $XGLR,QGH[ 3 ( 6 4 9 D O X H 0D[ 0LQ0D[0LQ 3URSRVHG0HWKRG0HWKRGLQ> @ Figure 4: The PESQ value of stego audios for the proposedmethod and the method in [32].Since the DeepSpeech model divides the audio signal into50 frames per second when extracting speech features, whichindicates that up to 50 characters can be recognized per sec-ond. Thus, the theoretical maximum capacity of this informa-tion hiding method is 50 cps. We conduct a hiding capacitytest on the ten groups. The information for capacity analy-sis hidden for the audio signal per second is 10 consecutive"hide", separated by blanks. The experimental results areshown in Table 2, and the average hiding capacity is 48.0cps. In the meantime, the hiding capacity of method in [32]is a ﬁxed value of 84 bps. As 1 character equals to 8 bits, thecapacity of [32] is 10.5 cps. Therefore, our proposed methodhas a higher hiding capacity.6able 2: The hiding capacity of stego audiosProposed Method Method in [32]Group G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 Avg 84 bps = 10.5 cpsCapacity(cps) 47.9 48.2 48.0 46.6 48.6 48.8 48.8 47.6 46.8 48.6 48.0 7LPHV $ P S OLW XG H 2ULJLQDO$XGLR (a) 7LPHV $ P S OLW XG H 6WHJR$XGLR (b) 7LPHV $ P S OLW XG H 2ULJLQDO 6WHJR$XGLR (c) 7LPHV $ P S OLW XG H 'HWDLO9LHZRI6XEILJXUHF (d)Figure 5: The comparison of waveforms between original and stego audios. The evaluation methods of audio imperceptibility can beclassed into subjective evaluation and objective evaluation.Subjective method evaluates the quality of audio based on hu-man hearing. It is consistent with people’s perception of audioquality. However, the shortcomings are time-consuming andlabor-intensive, lack of ﬂexibility, poor repeatability and sta-bility. Objective evaluation utilizes machine to discriminateaudio quality automatically. It gives audio quality evaluationresults in a convenient and fast way without subjective inﬂu-ence. In this paper, perceptual evaluation of speech quality(PESQ) is used to perform imperceptible analysis of audio signals. PESQ is an objective mean opinion score (MOS)value evaluation method provided by ITU-T Recommenda-tion P.862, which uses the stego audio to compare with theoriginal audio. In general, the score is between 1.0 and 4.5.The worse the speech quality, the lower the score.In order to evaluate the imperceptibility of the proposedmethod, we select a recently proposed audio information hid-ing method [32] for comparison. 100 stego audios are gener-ated using the two embedding methods with the hidden textin Table 1. The tested PESQ value results are shown in Fig. 4,the average PESQ value of our proposed method is 3.598,while the method in [32] is 2.351.7able 3: The extracted text and extraction success rate of different ASR modelsGroup Model Internal Security Model External SecurityDeepSpeechv0.1.0 DeepSpeechv0.2.0 Google Cloud IBM Watson iFlytekG1 100% 0% 0% 0% 0%G2 100% 0% 0% 0% 0%G3 100% 0% 0% 0% 0%G4 100% 0% 0% 0% 0%G5 100% 0% 0% 0% 0%G6 100% 0% 0% 0% 0%G7 100% 0% 0% 0% 0%G8 100% 0% 0% 0% 0%G9 100% 0% 0% 0% 0%G10 100% 0% 0% 0% 0%Average Success Rate 100% 0% 0% 0% 0%The audio with the lowest PESQ value is selected to ana-lyze the waveform of the audio signals. The waveforms areshown in Fig. 5. In general, the smaller the difference betweenoriginal audio and stego audio, the larger the PESQ value, thatis, the imperceptibility is better. Fig. 5 (a) is the waveform oforiginal audio and Fig. 5 (b) is the waveform of stego audio.In order to ﬁnd the difference between the two waveformsmore intuitively, we combine the two into Fig. 5 (c). It can befound that the two audios are basically overlapped completelyand have no difference. Thus, a small part, which is the blackbox in Fig. 5 (c), is enlarged for more detailed observation inFig. 5 (d). By observing the Fig. 5 (d), even if enlarged 30times, the difference between them is still small, which meansthat our method has good imperceptibility.

The security of audio information hiding refers to the abilitythat the hidden information cannot be extracted by the attacker.In this paper, the original model (DeepSpeech v0.1.0) is usedas the private model.

As the private model acts like the key in traditional hidingmethods, the key space security should be evaluated, whichis deﬁned as model internal security in our proposed method.The model internal security analysis is to ﬁnd out if the modelwill output the same result while the weights of model are dif-ferent. We evaluate the security by comparing the extractionsuccess rate from the private model to its upgraded versionmodel DeepSpeech v0.2.0. The two DNN models have dif-ferent neuron weights while holding the same neural networkstructure. The extracting results are shown in Table 3.

The model external security analysis is to ﬁnd out if themodel will output the same result while the whole modelstructure and parameters are different. We evaluate the modelexternal security by comparing the extraction success ratefrom the private model to other ASR models. Three publiccommercialized ASR platform services Google Cloud [10],IBM Watson [16] and iFlytek [17] Speech-to-Text areselected to extract the hidden information in different groups.The extraction success rates are shown in Table 3.From the above results, it can be seen that only the privatemodel can extract the hidden information. Even the samemodel cannot extract hidden information after the model pa-rameters are updated (i.e., DeepSpeech v0.2.0). In addition,according to the speciﬁc extraction information during theexperiment, only the information related to the original audiocan be obtained for public models. Any content related to thehidden text cannot be obtained at all. Therefore, the securityof this audio information hiding method is high.

The robustness of audio information hiding refers to the abilityof the stego audio to remain hidden text that can be completelyextracted after suffering some modiﬁcation or transformation.In order to test the robustness of the algorithm, the above 10stego audio groups are processed as follows:1. Add Gaussian white noise. A Gaussian white noise witha signal-noise ratio (SNR) of 20 dB is added to the stegoaudio signals;2. Resampling attack. Up-sampling the stego audio signals.Up-sampling: First, the stego audio signal is resampledby 2 times the original sampling rate, and then restoredto the original sampling rate;8able 4: The extraction success rate after 4 signal processing methodsGroup White Gaussian Noise Resampling Lowpass Filtering Echo InterferenceG1 0% 50% 0% 0%G2 0% 30% 0% 0%G3 0% 70% 0% 0%G4 0% 70% 0% 0%G5 0% 50% 0% 0%G6 0% 60% 0% 0%G7 0% 20% 10% 10%G8 0% 60% 0% 0%G9 0% 40% 0% 0%G10 0% 10% 0% 0%Average Success Rate 0% 46% 1% 1%3. Low-pass ﬁltering: The Butterworth low-pass ﬁlter witha 2-order cutoff frequency of 6 kHz is processed forstego audio signals;4. Echo interference: Add an echo with a 50% attenuationrate and a delay of 30ms in the stego audio signals.As shown in Table 4, except the resampling attack, thestego audio signals have lost the hidden text after being pro-cessed by these methods. The experimental results show thatthe robustness of our proposed hiding technique is not good.Therefore, in order to enable the receiving end to extract thehidden text successfully, the stego audio signals can only betransmitted in a lossless propagation, for example, to uploadthe audio ﬁle.

Steganalysis is a technique against information hiding fordetecting whether there is hidden information in data. Ac-cording to the relationship between feature extraction andembedding algorithm, it can be classed into two steganal-ysis technologies, which are called as special steganalysisand general steganalysis technique, respectively. The specialsteganalysis generally targets a certain type of informationhiding method. According to the statistical analysis of thedata, it uses the difference between the statistical features todesign the corresponding steganalysis algorithm. The generalsteganalysis generally aims at multiple types of hiding meth-ods, which is more universal and has more practical value. Itextracts some features of the data to form a feature vector set,which is then trained by neural network, clustering algorithmor other methods to construct a detection model to analyzethe hidden information.In this paper, we use a recently proposed general steganaly-sis method that takes the quantiﬁed modiﬁed DCT (QMDCT)coefﬁcients matrix as the input of a convolutional neural net-work (CNN) [31] to analyze if the audio signal have beenembedded hidden information. The CNN model is trained with the default conﬁguration and dataset in [31]. Then the200 original and stego audios are analyzed by the model. Theresults show that 100% stego audios can be recognized asbeing embedded hidden information. However, 80% originalaudios are misidentiﬁed as well, which means the results cannot be trusted. Therefore, the current mainstream audio ste-ganalysis algorithm is not accurate enough for our proposedDNN-based audio information hiding method.

This paper proposes a new audio information hidden tech-nique, which also can be used to activate an intrinsic backdoorof DNN-based ASR models.The intelligent speakers, which are the core part of the in-telligent home control system market, have been developedrapidly. In the meanwhile, as the most basic component ofintelligent speakers, automatic speech recognition (ASR) ser-vice is widely integrated into them like Amazon Echo [2],Google Home [11], Xiaomi AI speaker [34], and Tmall Ge-nie [1], etc.When deploying an information hiding technique on IoTdevices like intelligent speakers, the overheads should beconsidered because of the limited resources such as CPU,memory, and battery power.Fig. 6 shows the whole process of traditional informationhiding methods. In the extracting process, the encrypted infor-mation of the stego audio needs to be decrypted after beingextracted. However, classic cryptographic security solutionsincur expensive overheads, which is unacceptable on resource-constrained IoT devices. On the contrary, as shown in Fig. 3,the information hiding method we propose does not need thedecryption unit and key storage. The hidden information canbe obtained by simply inputting the stego audio into the ASRmodel to recognize, which means the overhead is negligible.Thus it can be an appropriate solution for audio informationhiding in resource-constrained IoT devices like intelligentspeakers.9 xtracting Process

Embedding Process

Embedding Algorithm Extracting AlgorithmCiphertext Decryption Algorithm P l a i n t e x t S t e go A ud i o S i gn a l Key

Encryption Algorithm P l a i n t e x t Key

Original Audio Signal

Ciphertext

Figure 6: The embedding and extracting process of traditional information hiding methods. ② Stego Audio ① Normal Text

Normal Audio Intelligent Speaker ④ ActivateASR Model Embeded ① Intrinsic Backdoor of Model

Activation Instructions ②③ Figure 7: The recognition process of the DNN-based intel-ligent speaker. The blue lines indicate the process of recog-nizing a normal audio. The red lines indicate the process ofactivating the intrinsic backdoor of ASR model by recogniz-ing the stego audio.However, the hidden information can be illegally used by at-tackers to activate a backdoor of the DNN-based ASR model.Fig. 7 shows an example of the recognition process of theintelligent speaker. As shown in Fig. 7, the intelligent speaker,which embeds with a DNN-based ASR model, can recognizethe normal audio as a normal sentence like the process indi-cated by the blue lines in the ﬁgure. However, when a stegoaudio with the hidden information of activation instruction is sent to the intelligent speaker, it will be recognized as acommand to activate the backdoor of DNN model like theprocess indicated by the red lines. Later on, once the backdoorof the intelligent speaker is activated, attackers can obtain thecontrol of all intelligent IoT devices in your home. For ex-ample, open the door, query the position of children’s watch,and control curtain, air conditioner, TV, etc. At the same time,your personal privacy information may be leaked throughthese devices, which can be used to carry out illegal activitiesor cause damage to your property.

The paper proposes a novel technique for audio informationhiding based on adversarial examples, which takes the origi-nal audio signal as input and obtains the stego audio throughthe training process of the private ASR model. According toexperimental results, the generated stego audio signal has ahiding capacity of 48.0 cps with good imperceptibility, whichis difﬁcult for the human ear to perceive the difference be-tween the original audio signal and the stego audio signal.Besides, The hidden text in stego audio signal can only beextracted by the private ASR model. Without knowing theinternal parameters and structure of the private model, thepublic model can only extract the original text. Therefore, thesecurity of our proposed audio information hiding methodis high. In addition, our proposed adversarial audio bringsserious threats to DNN-based ASR models.However, our proposed new audio information hiding tech-nique is not robust enough. At current stage, the stego audiosignals can only be transmitted in a lossless propagation. Weexpect to provide a new solution for audio information hiding,and gradually address the shortcoming in further research.

References [1] Alibaba. Tmall Genie. https://bot.tmall.com/ .102] Amazon. Amazon Echo. .[3] Ross Anderson, editor.

Information Hiding , volume1174 of

Lecture Notes in Computer Science . SpringerBerlin Heidelberg, Berlin, Heidelberg, 1996.[4] Derya Avci, Turker Tuncer, and Engin Avci. A newinformation hiding method for audio signals. In , pages 1–4. IEEE, mar 2018.[5] Nicholas Carlini and David Wagner. Towards Evaluatingthe Robustness of Neural Networks.

Proceedings - IEEESymposium on Security and Privacy , pages 39–57, 2017.[6] Nicholas Carlini and David Wagner. Audio adversar-ial examples: Targeted attacks on speech-to-text. In

Proceedings - 2018 IEEE Symposium on Security andPrivacy Workshops, SPW 2018 , pages 1–7, 2018.[7] Rupayan Das, Dipta Mukherjee, Rahul Sourav Singh,Suman Godara, and Saroj Kumar. DWTAS: A ro-bust discrete wavelet transform approach towards audiosteganography. In , pages 198–204. IEEE, aug 2017.[8] Huynh Ba Dieu and Nguyen Xuan Huy. Hiding datain audio using modiﬁed CPT scheme. In , pages 396–400. IEEE, dec 2013.[9] Ian J. Goodfellow, Jonathon Shlens, and ChristianSzegedy. Explaining and harnessing adversarial ex-amples.

CoRR , abs/1412.6572, Dec 2014.[10] Google. Google Cloud Speech-to-Text. https://cloud.google.com/speech-to-text/ .[11] Google. Google Home. https://store.google.com/product/google_home .[12] Rainer E. Gruhn, Wolfgang Minker, and Satoshi Naka-mura. Automatic Speech Recognition. In

Signalsand Communication Technology , pages 5–17. SpringerBerlin Heidelberg, 2011.[13] Awni Hannun. Sequence modeling with ctc.

Distill ,2017. https://distill.pub/2017/ctc.[14] Awni Y. Hannun, Carl Case, Jared Casper, Bryan Catan-zaro, Greg Diamos, Erich Elsen, Ryan Prenger, San-jeev Satheesh, Shubho Sengupta, Adam Coates, andAndrew Y. Ng. Deep speech: Scaling up end-to-endspeech recognition.

CoRR , abs/1412.5567, 2014. [15] Guang Hua, Jonathan Goh, and Vrizlynn. L. L. Thing.Time-Spread Echo-Based Audio Watermarking WithOptimized Imperceptibility and Robustness.

IEEE/ACMTransactions on Audio, Speech, and Language Process-ing , 23(2):227–239, feb 2015.[16] IBM. IBM Watson Speech-to-Text. https://speech-to-text-demo.ng.bluemix.net/ .[17] IFlytek. iFlytek Speech-to-Text. .[18] D Iter, J Huang, and M Jermann. Generating adversarialexamples for speech recognition. 2017.[19] Shwetavinayakarao Jadhav and A. M. Rawate. A newaudio steganography with enhanced security based onlocation selection scheme.

International Journal ofPerformability Engineering , 12(5):451–458, 2016.[20] Mahdi Jeyhoon, Mohammad Asgari, Lili Ehsan, andSeyedeh Zahra Jalilzadeh. Blind audio watermarkingalgorithm based on DCT, linear regression and stan-dard deviation.

Multimedia Tools and Applications ,76(3):3343–3359, feb 2017.[21] Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio.Adversarial examples in the physical world.

CoRR ,abs/1607.02533, 2016.[22] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, andPascal Frossard. DeepFool: A Simple and AccurateMethod to Fool Deep Neural Networks. In , pages 2574–2582. IEEE, jun 2016.[23] Mozilla. Common Voice. https://voice.mozilla.org/datasets .[24] Nhut Minh Ngo and Masashi Unoki. Robust and reliableaudio watermarking based on phase coding. In , pages 345–349. IEEE,apr 2015.[25] Nhut Minh NGO and Masashi UNOKI. Method ofAudio Watermarking Based on Adaptive Phase Modula-tion.

IEICE Transactions on Information and Systems ,E99.D(1):92–101, 2016.[26] Nicolas Papernot, Patrick Mcdaniel, Somesh Jha, MattFredrikson, Z. Berkay Celik, and Ananthram Swami.The limitations of deep learning in adversarial settings.

Proceedings - 2016 IEEE European Symposium on Se-curity and Privacy, EURO S and P 2016 , pages 372–387,2016.1127] F.A.P. Petitcolas, R.J. Anderson, and M.G. Kuhn. In-formation hiding-a survey.

Proceedings of the IEEE ,87(7):1062–1078, jul 1999.[28] G.J. Simmons. The history of subliminal channels.

IEEE Journal on Selected Areas in Communications ,16(4):452–462, may 1998.[29] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever,Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and RobFergus. Intriguing properties of neural networks.

CoRR ,abs/1312.6199, 2013.[30] Rohan Taori, Amog Kamsetty, Brenton Chu, and NikitaVemuri. Targeted adversarial examples for black boxaudio systems.

CoRR , abs/1805.07820, 2018.[31] Yuntao Wang, Kun Yang, Xiaowei Yi, Xianfeng Zhao,and Zhoujun Xu. CNN-based steganalysis of mp3steganography in the entropy code domain. In

Proceed-ings of the 6th ACM Workshop on Information Hidingand Multimedia Security , IH& MMSec’18, pages 55–65,New York, NY, USA, 2018. ACM.[32] Yong Xiang, Iynkaran Natgunanathan, Dezhong Peng,Guang Hua, and Bo Liu. Spread Spectrum Au- dio Watermarking Using Multiple Orthogonal PN Se-quences and Variable Embedding Strengths and Polar-ities.

IEEE/ACM Transactions on Audio, Speech, andLanguage Processing , 26(3):529–539, mar 2018.[33] Yong Xiang, Iynkaran Natgunanathan, Dezhong Peng,Wanlei Zhou, and Shui Yu. A Dual-Channel Time-Spread Echo Method for Audio Watermarking.

IEEETransactions on Information Forensics and Security ,7(2):383–392, apr 2012.[34] Xiaomi. Xiaomi AI Speaker. .[35] Xu Xie, Zhengguang Xu, and Hui Xie. Channel Capac-ity Analysis of Spread Spectrum Watermarking in RadioFrequency Signals.

IEEE Access , 5:14749–14756, 2017.[36] Qingquan Zong and Wei Guo. A speech information hid-ing algorithm based on the energy difference betweenthe frequency band. In2012 2nd International Confer-ence on Consumer Electronics, Communications andNetworks (CECNet)