Audio Adversarial Examples: Attacks Using Vocal Masks
Kai Yuan Tay, Lynnette Ng, Wei Han Chua, Lucerne Loke, Danqi Ye, Melissa Chua
AAudio Adversarial Examples: Attacks Using Vocal Masks
Tay Kai Yuan, Ng Hui Xian Lynnette, Chua Wei Han,Lucerne Loke, Ye Danqi, Chua Wan Jun MelissaAbstract
We construct audio adversarial examples on automatic Speech-To-Text systems . Given any au-dio waveform, we produce an another by overlaying an audio vocal mask generated from theoriginal audio. We apply our audio adversarial attack to five SOTA STT systems: DeepSpeech,Julius, Kaldi, wav2letter@anywhere and CMUSphinx. In addition, we engaged human annota-tors to transcribe the adversarial audio. Our experiments show that these adversarial examplesfool State-Of-The-Art Speech-To-Text systems, yet humans are able to consistently pick out thespeech. The feasibility of this attack introduces a new domain to study machine and humanperception of speech.
With the advent of virtual assistants like Google Assistant, Apple’s Siri and Amazon’s Alexa, moreattention has been brought to the space of Speech-To-Text (STT), where natural language commandsare converted into computer texts. These texts carry information from the natural language command,authenticating the virtual assistant to execute an action, such as delivering news, even order productsfrom an online store (Mari, 2019).Adversarial machine learning is a sub-field of machine learning that has gained much attention inrecent years. In adversarial machine learning, malicious inputs exploit weaknesses in a trained modelor training regime to produce undesired and unforeseen behaviors. Goodfellow et. al. (Goodfellow etal., 2014) proposed the Fast Gradient Sign Method for generating adversarial inputs on image classifiers.Jang et. al. (Jang et al., 2017) further perfected gradient based techniques by proposing a techniquethat could generate minimal perturbations on images in relation to its features, demonstrating the ease ofexploitation of image classifiers.Most of the research of adversarial machine learning has been centered around the domain of com-puter vision, and very few techniques are available to create adversarial examples in STT. Carlini andWagner (Carlini and Wagner, 2018) proposed a gradient based technique adopted from image classifi-cation domain and applied it on audio samples, creating adversarial examples by iteratively minimisingperturbations introduced into the audio waveform, while ensuring that the waveforms are transcribed asanother message. Schoenherr et al. (Sch¨onherr et al., 2019) uses of psychoacoustic hearing propertiesto generate perturbations to the audio, to create audio below the hearing threshold such that the attack isinaudible to the listener. The generated audio renders the transcription through Kaldi ineffective. Zhanget al. (Zhang et al., 2017) introduced the dolphin attack that modulates voice on ultrasonic carriers toachieve inaudible attack vectors. While it is validated on popular speech recognition signals, this attackis costly to construct.
Just like how our brains fill in the gaps in our vision, our brains formulate speech to allow us to understandthe speaker better (King, 2007). In a noisy environment, if we pay enough attention to a speaker and hisspeech, it is relatively easy for our brains to follow the words. Following this concept, as long as anybackground noise is below a certain threshold in decibels, humans should not fail to formulate a speaker’s a r X i v : . [ c s . S D ] F e b peech(Darwin, 2007). On the other hand, STT systems often rely largely on several quantitative qualitiesof speech in order to accurately transcribe speech. Lacking the language ability to fill in words that werenot heard clearly in an audio, STT systems may fail to transcribe accurately in noisy situations.Human speech sounds are produced by the different shapes of the vocal tract. Mel-Frequency CepstralCoefficients (MFCCs) (Muda et al., 2010) and vocal masks are frequently used to represent the shape ofthe vocal tracts, which manifests itself in the sort time power spectrum, generated by performing Short-Time Fourier Transform (STFT) on the audio signal. STT systems learn these quantitative representationof words on these audio waveforms to transcribe subsequent audio files. Recent work on convolutionalneural networks on mel-frequency spectrograms have shown remarkable accuracy in voice separation(Simpson et al., 2015)(Ikemiya et al., 2016). This is largely due to extensive learning of features of vocalmasks from Mel-Frequency spectrograms(Lin et al., 2018), and how they differ from other noise, such asbackground music. However, these properties allow adversaries to employ specific attacks against thesedeep learning systems. In this paper, we propose a novel method of creating adversarial examples on audio signals that attacksfive State-Of-The-Art (SOTA) STT systems. These adversarial examples are transcribed as a differentmessage by Speech-To-Text systems, while humans are able to decipher the speech in the signal, ren-dering STT systems inadequate. We are able to achieve this by overlaying a vocal mask on top of theoriginal audio, making use of the inability of these neural networks to differentiate a vocal mask from theoriginal speech, resulting in an average Word Error Rate (WER) of 0.64. In comparison, the same adver-sarial audio can be transcribed by human annotators with an average WER of 0.28. This is an end-to-endattack, that operates directly on the raw samples that are used as inputs to the neural networks.The tasks that this paper attempts are:1. Generation of adversarial audio examples at different decibel levels using Mel-Frequency CepstralCoefficients (MFCC) properties2. Generation of targeted adversarial audio examples using the Carlini-Wagner audio attack (Carliniand Wagner, 2018)3. Transcription of adversarial audio examples using five SOTA speech-to-text transcription neuralnetworks.4. Comparing transcription output from neural networks with human transcriptionThe State-Of-The-Art Speech-To-Text Systems attacked in this paper are:1. DeepSpeech(Hannun et al., 2014). An end-to-end speech recognition system built upon Baidu’sDeepSpeech architecture. To enhance its capabilities, DeepSpeech learns a function that is robustto background noise, reverberation and speaker variation.2. Kaldi(Povey et al., 2011). One of the oldest toolkit for Automatic Speech Recognition (ASR),which integrates with Finite State Transducers and provide extensive linear algebra support builtupon the OpenBLAS library. Kaldi supports Hidden Markov Models, Gaussian Mixture Models,and neural-network based acoustic modelling.3. Wav2letter@anywhere(Pratap et al., 2020). Uses ArrayFire tensor library and flashlight machinelearning library, enabling training of the LibriSpeech dataset to be completed within minutes. Thelanguage model formation uses Time-Depth Separable Convolutions, while the decoder module isa beam-search decoder.4. Julius(Lee and Kawahara, 2009a). A large vocabulary continuous speech recognition decoder soft-ware based on n-gram and context-dependent hidden markov models developed by the Japanesesince 1997(Radenen and Artieres, 2012). The decoding algorithm is based on a two-pass tree-trellisearch that incorporates major decoding techniques including tree lexicons, n-gram factoring, en-veloped beam search, and deep neural networks.5. CMUSphinx(Huggins Daines et al., 2006). A collection of speech recognition tools aiming toovercome the constraints of isolated words and small vocabularies. Its acoustic model is based onHidden Markov Models, and speech constraints overcome through function-word-dependent phonemodels and generalised triphone models(Kai-Fu, 1989).To facilitate future work, we make our dataset available. We encourage the reader to listen to our audioadversarial examples . This transcription performance comparison of audio adversarial examples between SOTA neuralnetwork-based STT models and human annotators is empirically evaluated with a series of experiments.This section describes the generation of audio adversarial examples and defines the threat model andevaluation metric, which will subsequently be used for performance comparison between the producedtranscription and the original transcription.
Given an audio waveform x , target transcription y , and SOTA STT model C , and the human transcription D , our task is to construct another audio waveform x (cid:48) = x + δ so that D ( x (cid:48) ) = y but C ( x (cid:48) ) (cid:54) = y . To quantify the distortion introduced by an adversarial audio, we use the Decibels (dB) measure. dB isa logarithmic scale that measures the relative loudness of an audio sample. This measure is a relativemeasure, hence we use the original waveform x as the reference point for the adversarial audio x . dB ( x ) = 10 log S ( x ) S ( x ) , where S is the function that maps x to the intensity of the soundwaves.In this paper, most of the distortion is quieter than the original signal, the distortion is thus a negativenumber.This metric may not be a perfect measure of distortion, however the audio distortion may be imper-ceptible to the humans, as we will show in our experiments, where human annotators fare better thanSOTA STT systems. In calculation of the accuracy of the transcription, we calculate the Word Error Rate (WER) of theproduced transcription against the original transcription provided by the TIMIT dataset.The WER is a common measure of accuracy in performance measurement in STT systems. It isderived from the Levenshtein distance, which measures the difference between a sequence of charactersthrough the minimum number of edits. By measuring the difference in words, transcribed word sequencecan have different lengths from the ground truth. However, this measure does not detail the natureof transcription errors. In our experiments, we calculate the Levenshtein distance using the Wagner-Fisher algorithm (Wagner and Fischer, 1974). It should be noted that the lower the WER, the better thetranscription.
W ER = S + D + IN , where S is the number of substitutions, D is the number of deletions, I is the number of insertions, and N is the number of words in the reference. https://drive.google.com/open?id=1ixwuvPrk1H-hveX5HNSWWYGSzx-lWYnC .3 Dataset We used the DARPA-TIMIT Acoustic-Phonetic Continuous Speech Corpus (Garofolo et al., 1993) forconstruction of adversarial audio and experiments. This dataset contains time-aligned transcriptions ofphonetically rich spoken American English sentences. A hand-verified transcription of the corpus ismade available, which serves during the calculation of the model accuracy. This dataset is suitable forour studies as we are studying the spoken speech and our human annotators are well-versed in the Englishlanguage.
We generate obfuscated audio with the python library
Pydub (Jiaaro, 2016).
Pydub’s AudioSegment object provides several methods that allow easy manipulation of audio, including reading and writingaudio files, as well as reversing audio files and overlaying them on other audio files.Let x be the original audio waveform, and δ be the adversarial audio signal generated using AudioSeg-ment as an attack to the original waveform. We generate δ by reversing x . We then overlay the audio δ on to x , that is, x (cid:48) = x + δ .As decibels is a measure of the intensity of the sound, dB ( δ ) = dB ( x ) . We generate five adversarialaudio waveforms by varying the intensity of the sound, each time decreasing the intensity of δ . Thewaveforms shall be represented as: dB ( δ p ) = dB ( δ ) − p , where p is the amount of decibels that δ p differs from δ . For example, if the adversarial audio was decreased by 5 decibels, dB ( δ − ) = dB ( δ ) − . We constructed the experimental set up of the neural network models as follows:1.
DeepSpeech : Used the provided pre-trained English model, an n-gram language model trainedfrom a corpus of 220 million phrases with a vocabulary of 495,000 words. This was trained on theLibriSpeech(Panayotov et al., 2015), Fisher(Cieri et al., 2004), Switchboard(Godfrey et al., 1992)and Common Voice English(Ardila et al., 2019) datasets, for 233,784 steps where the best validationloss were selected at the end of 75 epochs.2.
Julius : Used the provided pre-trained n-gram language model, in a hybrid Deep Neural Networkwith Hidden Markov Model based architecture (LM+DNN-HMM). The model with a 262,000 worddictionary and 32 bit Language Model(L¨uscher et al., 2019)(Lee and Kawahara, 2009b).3.
Kaldi : Used the Kaldi pre-trained Aspire Chain Model(Kaldi, 2016) with already compiled HCLGsequence of phoneme decoding graph for inference, trained on Fisher English(Cieri et al., 2004).HCLG is a hidden-markov finite state transducer representing the lexicon, grammar and phoneticcontexts.4.
Wav2letter@anywhere : Used the provided inference platform that pre-trains BERT(Devlin et al.,2018) models with Librispeech dataset(Panayotov et al., 2015).5.
CMUSphinx : Used the provided inference platform on python bindings PocketSphinx (Hug-gins Daines et al., 2006), with the pre-trained US English constructed from Wall Street Journaldata using hidden-markov finite state transducers.
We used the Carlini-Wagner (CW) audio adversarial attack as the baseline (Carlini and Wagner, 2018)audio adversarial attack. This attack adds a small perturbation that are quieter than the original signal tothe original audio which changes the transcribed result when passed through DeepSpeech. We show thatour audio attacks surpass the CW attack. .3.4 Human Audio Transcription
We engaged seven people from ages 25-30 to transcribe the audio for us. Each person transcribed 34audio files of a single type of obfuscation. These audio files were randomly chosen from 2 dialect groupsof the DARPA-TIMIT corpus’ test directory. As the audio files were primarily in English, to reduceproblems with understanding the text, we engaged people whose first spoken and written language isEnglish. We asked the annotators to listen to the samples via headphones. The task was to type allwords of the audio sample into a blank text field without assistance from auto-complete or grammar andspell-checking. Annotators often repeated the audio samples in order to enter a complete sentence. Ina post-processing phase, we removed symbols and new lines from the transcript before calculating theWER.
We present the mean WER for the various transcription systems in Table 1.We note that DeepSpeech has a WER that is greater than 1.0 because there are more words than thenumber of words in the reference speech. This is likely due to the fact that DeepSpeech was trained ona dataset that has more words than the TIMIT dataset. In terms of audio adversarial attacks, we notethat the attack of x + δ − produces the best balance between fooling neural STT systems and humanidentification of the audio. At the same time, x + δ − has a higher WER than x + CW, proving ourattack to be more effective at fooling neural network systems as compared to the baseline Carlini-Wagnerattack.Audio Files DeepSpeechJulius Kaldi wav2letter@anywhere CMUSphinxHumans x x + δ x + δ − x + δ − x + δ − x + δ − x + CW 0.49(0.31) 0.84(0.26) 0.49(0.26) 0.28 (0.25) 0.85(0.36) 0.10(0.14)Table 1: Mean (Standard Deviation) WER of transcriptions for our experiments
Upon investigation, SOTA STT systems fail to transcribe our attacks accurately, which can be representedin terms of the attack audio’s similarity with the original, and audio signal properties.
We posit that our attack can successfully fool STT systems due to the similarity of the attack to itsoriginal audio. We note that the mean normalised cosine similarity of our attacked audio and the originalis very close to 1.0, surpassing the CW attacked audio. Table 2 shows the similarities of our attacks.
We analyse the original signal and our adversarial audio signal using audio signal analysis methods: FastFourier Transform (FFT) and Spectrograms.udio File Mean Standard Devia-tion x + δ x + δ − x + δ − x + δ − x + δ − x + CW 0.999961 0.0000229Table 2: Cosine similarities between the original audio and the adversarial audioFigure 1 presents the spectrograms and FFT plots of our audio adversarial examples. We note thatfrom the spectrograms generated from x + CW, the CW attack possess clear vocal masks of the originalaudio, which makes it possible to retrieve some semblance of the original audio using a vocal masktechnique, built upon computer vision principles. Our attacks obfuscate the original audio by appearingas the original audio, yet in the opposite direction, hence are able to fool techniques that employ vocalmask analyses. In the case of the FFT plots, the plot generated from x + CW is hugely similar to the plotof x , which means it is possible to retrieve x from the adversarial audio. However, our adversarial audiohas different FFT forms compared to x , since our attack introduces additional audio signals, therebypreventing attempts to recover the original audio. Our results reflect that humans are better at the transcription task, regardless of obfuscation. Humanshave innate experience with language and are able to predict words in a sentence even if they did nothear it from the audio. This phenomenon is known as auditory perceptual restoration, where the brainfills in missing information in areas where noise obstructs portions of sounds (Bidelman and Patro,2016)(King, 2007). Commonly observed in a conversation with loud background noise, this phonemicrestoration effect occurs where the brain restores sounds missing from a speech signal. This effect istypically observed when missing phonemes in an auditory signal are replace with noises that masks theoriginal phonemes, creating an auditory ambiguity(Repp, 1992)(Groppe et al., 2010). In the case of ourexperiments, our attack audio masks some original audio, which causes perceptual restoration to kick in,where the humans perform better transcriptions due to this phenomenon.
From this work, we acknowledge that humans possess an innate language ability which allowed them toperform better in the transcription task. We posit that future STT systems can include predictive abilities,such as introducing generative models with attention gates that govern which audio features have moreimportance, thereby allowing STT systems to predict transcriptions and at the same time inference themdirectly from the audio files.
We demonstrate a novel method of creating audio adversarial examples by reversing the audio signal andoverlaying it on the original speech. We present evidence that these adversarial examples render vocalmasks obsolete due to the inability to identify the reversed audio from the true audio. Our experimentsshow that these adversarial examples fool State-Of-The-Art Speech-To-Text systems, yet humans areable to consistently pick out the speech. We hope that future work will continue to investigate audioadversarial examples, and improve STT systems with predictive language abilities that humans possess. eferences
Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais,Lindsay Saunders, Francis M. Tyers, and Gregor Weber. 2019. Common voice: A massively-multilingualspeech corpus.Gavin M. Bidelman and Chhayakanta Patro. 2016. Auditory perceptual restoration and illusory continuity corre-lates in the human brainstem.
Brain Research , 1646:84 – 90.Nicholas Carlini and David A. Wagner. 2018. Audio adversarial examples: Targeted attacks on speech-to-text. , pages 1–7.Christopher Cieri, David Miller, and Kevin Walker. 2004. The fisher corpus: a resource for the next generationsof speech-to-text. In
LREC .C. J. Darwin. 2007. Listening to speech in the presence of other sounds.
Philosophical Transactions of the RoyalSociety B: Biological Sciences , 363:1011 – 1021.Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirec-tional transformers for language understanding.J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren. 1993. Darpa timitacoustic phonetic continuous speech corpus cdrom.J. J. Godfrey, E. C. Holliman, and J. McDaniel. 1992. Switchboard: telephone speech corpus for research anddevelopment. In [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, andSignal Processing , volume 1, pages 517–520 vol.1, March.Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial exam-ples.David Groppe, Marvin Choi, Tiffany Huang, Joseph Schilz, Ben Topkins, Thomas Urbach, and Marta Kutas. 2010.The phonemic restoration effect reveals pre-n400 effect of supportive sentence context in speech perception.
Brain research , 1361:54–66, 11.Awni Y. Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, SanjeevSatheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng. 2014. Deep speech: Scaling up end-to-endspeech recognition.
ArXiv , abs/1412.5567.David Huggins Daines, M. Kumar, A. Chan, A.W. Black, M. Ravishankar, and Alexander Rudnicky. 2006. Pock-etsphinx: A free, real-time continuous speech recognition system for hand-held devices. volume 1, pages I – I,06.Yukara Ikemiya, Katsutoshi Itoyama, and Kazuyoshi Yoshii. 2016. Singing voice separation and vocal f0 es-timation based on mutual combination of robust principal component analysis and subharmonic summation.
IEEE/ACM Transactions on Audio, Speech, and Language Processing , 24(11):2084–2095, Nov.Uyeong Jang, Xi Wu, and Somesh Jha. 2017. Objective metrics and gradient descent algorithms for adversarialexamples in machine learning. In
Proceedings of the 33rd Annual Computer Security Applications Conference ,ACSAC 2017, page 262–277, New York, NY, USA. Association for Computing Machinery.Jiaaro. 2016. Pydub by jiaaro. Last Accessed: 15 January 2020.Lee Kai-Fu. 1989. The development of the sphinx system. In
Automatic Speech Recognition , volume 62. SpringerUS.Kaldi. 2016. Aspire chain model. https://kaldi-asr.org/models/m1 .Andrew King. 2007. Auditory neuroscience: Filling in the gaps.
Current biology : CB , 17:R799–801, 10.Akinobu Lee and Tatsuya Kawahara. 2009a. Recent development of open-source speech recognition enginejulius.
Proceedings of the 2009 Asia-Pacific Signal and Information Processing Association Annual Summitand Conference , 01.Akinobu Lee and Tatsuya Kawahara. 2009b. Recent development of open-source speech recognition engine julius.Kin Wah Edward Lin, T. BalamuraliB., Enyan Koh, Simon Lui, and Dorien Herremans. 2018. Singing voiceseparation using a deep convolutional neural network trained by ideal binary mask and cross entropy.
NeuralComputing and Applications , pages 1–14.hristoph L¨uscher, Eugen Beck, Kazuki Irie, Markus Kitza, Wilfried Michel, Albert Zeyer, Ralf Schl¨uter, andHermann Ney. 2019. Rwth asr systems for librispeech: Hybrid vs attention.
Interspeech 2019 , Sep.Alex Mari. 2019. Voice commerce: Understanding shopping-related voice assistants and their effect on brands.10.Lindasalwa Muda, Mumtaj Begam, and Irraivan Elamvazuthi. 2010. Voice recognition algorithms using melfrequency cepstral coefficient (mfcc) and dynamic time warping (dtw) techniques.
J Comput , 2, 03.V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. 2015. Librispeech: An asr corpus based on public domainaudio books. In ,pages 5206–5210, April.Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hanne-mann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely. 2011. Thekaldi speech recognition toolkit. In
IEEE 2011 Workshop on Automatic Speech Recognition and Understanding .IEEE Signal Processing Society, December. IEEE Catalog No.: CFP11SRW-USB.Vineel Pratap, Qiantong Xu, Jacob Kahn, Gilad Avidov, Tatiana Likhomanenko, Awni Hannun, Vi-taliy Liptchinsky, Gabriel Synnaeve, and Ronon Collobert. 2020. Scaling up online speechrecognition using convnets. https://research.fb.com/wp-content/uploads/2020/01/Scaling-up-online-speech-recognition-using-ConvNets.pdf .Mathieu Radenen and Thierry Artieres. 2012. Contextual hidden markov models. pages 2113–2116, 03.Bruno H. Repp. 1992. Perceptual restoration of a “missing” speech sound: Auditory induction or illusion?volume 51, page 14–32, January.Lea Sch¨onherr, Katharina Kohls, Steffen Zeiler, Thorsten Holz, and Dorothea Kolossa. 2019. Adversarial attacksagainst automatic speech recognition systems via psychoacoustic hiding. In
Network and Distributed SystemSecurity Symposium (NDSS) .Andrew J. R. Simpson, Gerard Roma, and Mark D. Plumbley. 2015. Deep karaoke: Extracting vocals frommusical mixtures using a convolutional deep neural network.Robert A. Wagner and Michael J. Fischer. 1974. The string-to-string correction problem.
J. ACM , 21(1):168–173,January.Guoming Zhang, Chen Yan, Xiaoyu Ji, Tianchen Zhang, Taimin Zhang, and Wenyuan Xu. 2017. Dolphinattack:Inaudible voice commands. In
Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communi-cations Security , page 103–117, New York, NY, USA. Association for Computing Machinery. riginal audio x x + CW x + δ x + δ −15