[PDF] Deep Adversarial Learning on Google Home devices

Abstract

Smart speakers and voice-based virtual assistants are core components for the success of the IoT paradigm. Unfortunately, they are vulnerable to various privacy threats exploiting machine learning to analyze the generated encrypted traffic. To cope with that, deep adversarial learning approaches can be used to build black-box countermeasures altering the network traffic (e.g., via packet padding) and its statistical information. This letter showcases the inadequacy of such countermeasures against machine learning attacks with a dedicated experimental campaign on a real network dataset. Results indicate the need for a major re-engineering to guarantee the suitable protection of commercially available smart speakers.

Full PDF

IIEEE COMMUNICATIONS LETTERS, VOL. X , NO. Y , MONT H

Deep Adversarial Learning on Google Homedevices

Andrea Ranieri , Davide Caputo , Luca Verderame , Alessio Merlo , and Luca Caviglione IMATI - National Research Council of Italy { andrea.ranieri, luca.caviglione } @ge.imati.cnr.it DIBRIS - University of Genova, Italy { davide.caputo, luca.verderame, alessio } @dibris.unige.it Abstract —Smart speakers and voice-based virtual assistantsare core components for the success of the IoT paradigm. Unfor-tunately, they are vulnerable to various privacy threats exploitingmachine learning to analyze the generated encrypted trafﬁc. Tocope with that, deep adversarial learning approaches can beused to build black-box countermeasures altering the networktrafﬁc (e.g., via packet padding) and its statistical information.This letter showcases the inadequacy of such countermeasuresagainst machine learning attacks with a dedicated experimentalcampaign on a real network dataset. Results indicate the needfor a major re-engineering to guarantee the suitable protectionof commercially available smart speakers.

Index Terms —smart speakers, IoT privacy, deep adversariallearing, machine learning, privacy leaks.

I. I

NTRODUCTION

The popularity of smart speakers (including voice-basedvirtual assistants) is rooted in their ability to control IoT nodes,network appliances, and other devices via natural speech.They can also be used to access multimedia contents andobtain various information, including news, weather forecasts,and trafﬁc conditions. To implement such functionalities, thespeaker exchanges information with a remote data center,leading to several security issues, e.g., device enumerationattacks, mass proﬁling, and privacy threats. An emergingtrend in cybersecurity exploits machine learning techniquesto obtain information from the encrypted trafﬁc exchanged bythe speaker with its ecosystem [1], [2].Literature abounds in works investigating how statisticalanalysis of network ﬂows produced by smart devices can beabused for reconnaissance or attack purposes. For instance, thetrafﬁc produced by home devices can be used to understandif a user is at home [3] as well as to model daily routines [4]or the sleep cycle [2]. In general, attacks leveraging machinelearning proved to be effective, even when relying upon “poor”information. As an example, IoT nodes and connected devicescan be identiﬁed by simply using the length of the producedprotocol data units [5]. When HTTP-based interactions arepresent, it is possible to infer precise details, e.g., the status ofa light bulb, as well as hijacking the conversation or physicallyendanger the target [6].An emerging research trend explores the use of various arti-ﬁcial intelligence and machine learning techniques to classifythe commands issued to smart speakers (see, e.g., [7] and thereferences therein). To this aim, attackers take advantage of trafﬁc features not protected by the encryption, such as inter-packet time, throughput, the location of some endpoints, andthe number of connections. Since the classiﬁcation is typicallyaccurate, the attacker can infer details like the number ofdevices controlled by the smart speaker, the presence of theuser (even when the interaction is absent), and the “kind”of the issued commands [3]–[7]. Moreover, a relevant partof the trafﬁc produced by smart speakers shares functionaland technological traits with VoIP, meaning that it is alsosusceptible to attacks for disclosing the language of the talkeror other sensitive behaviors [8].Therefore, this letter focuses on investigating deep adversar-ial learning countermeasures against machine learning attackstargeting the trafﬁc produced by smart speakers. To the bestof our knowledge, this aspect has been mostly overlookedso far. The only notable exception is [9], which proposesa padding scheme to protect IoT and smart devices fromstatistical analysis. Instead, our work aims to showcase thelimitations of trafﬁc manipulation or morphing approaches,which often lead to ﬂawed countermeasures [10].To do so, we built a new dataset containing the networktrafﬁc of a typical smart home environment and an exper-imental testbed to evaluate the efﬁcacy of deep adversar-ial learning techniques. The experimental activities exploitboth theoretical approaches, i.e., the usage of Savitzky–Golayﬁlters and Additive White Gaussian Noise (AWGN) on allthe statistical features, and realistic ones, named “

RealisticAdversarial ”, that use constant padding and AWGN techniquesthat considers the constraints of the protocols and networksin use. The achieved results argue that smart speaker privacyneeds a complete rethink.The rest of the letter is structured as follows: Section IIprovides the background and the threat model, Section IIIpresents the deep adversarial techniques used in this work.Section IV describes the evaluation testbed and Section Vpresents the obtained results. Finally, Section VI concludesthe letter. II. B

ACKGROUND AND A TTACK M ODEL

Figure 1 depicts a typical smart speaker ecosystem thatprovides the voice-activated user interface and acts as a hubfor other IoT nodes and network appliances. In essence, thespeaker collects, samples, and transmits voice commands to a r X i v : . [ c s . CR ] F e b EEE COMMUNICATIONS LETTERS, VOL. X , NO. Y , MONT H remote cloud services in charge of processing data to deliverback textual/binary representations as well as additional con-tent, e.g., multimedia streams. The smart speaker can alsoprovide feedback to the user, play content, retrieve data fromthird-party providers (e.g., music streaming services) or driveother nodes via local network or short-range links like IEEE802.15.4. Even if the information gathered by commandsexchanged locally between the speaker and various nodes canbe used to threaten the privacy of the ecosystem [11]. Thisletter focuses on the attacks exploiting the network trafﬁcexchanged between the speaker and its remote cloud.Fig. 1: MITM threat model of a smart home ecosystem.From a security perspective, the continuous exchange ofdata between the smart speaker and the cloud is a prime pointof fragility. As depicted in Figure 1, an adversary (denoted as attacker ) can mount MITM (Man-in-the-Middle) attacks [12]to gather network trafﬁc even in the case of a communicationencrypted with TLS/SSL [13].Even if many commercial smart speakers implement coun-termeasures to protect the network trafﬁc, the majority isstill prone to a variety of privacy-breaking attacks targetinga composite set of features observable within the encryptedtrafﬁc ﬂows [1], [14]. Speciﬁcally, we focus on an attackerwilling to use machine learning or deep learning algorithmson encrypted trafﬁc samples to infer “behavioral” information,e.g., the presence of the victim or the “type” of the requestedinformation [3]–[7]. Owing to the end-to-end encryption, theattacker can only observe and acquire the trafﬁc produced bythe smart speaker and cannot alter, manipulate or perform deeppacket inspection operations. The attacker can then only relyon general statistics, e.g., the throughput, the size of protocoldata units, IP addresses, the number of different endpoints,ﬂags within the headers of the packets, or the behavior of thecongestion control of the TCP [7].The standard approach to mitigate machine learning attackson the network trafﬁc exploits the use of a middlebox (denotedas anonymization box in Figure 1) able to “sanitize” the net-work trafﬁc by removing (or altering) the data that the attackercan exploit. For instance, the anonymization box can padpackets [9] or perform NAT-like operations to prevent proﬁlingendpoints or probing [15]. All in all, since the anonymizationbox is outside the device, it cannot alter the protocol/commu-nication architecture of the smart speaker ecosystem. Rather,it can only manipulate the trafﬁc without disrupting the ﬂowor penalizing the QoE perceived by the user, for instance, interms of real-time guarantees (see, e.g., [16] and referencestherein). In the following, we will showcase the limits of suchan approach, which appears to be unsuited to face modernmachine learning-capable threats. III. D

EEP A DVERSARIAL L EARNING T ECHNIQUES

This letter investigates both theoretical and practical deepadversarial learning techniques to lower the classiﬁcationaccuracy. Ideally, we would like to reduce the classiﬁcationaccuracy to be as close as possible to a “coin toss” (e.g., on a two-class classiﬁcation problem). As in previousworks [7], [17], [18], the attacker can only acquire encryptedtrafﬁc to compute statistical metrics and analyze them usingmachine learning techniques. Such a computation requiresusing a suitable number of packets grouped using either timespans of length ∆ t or bursts of a ﬁxed size of N packets.The methods considered for this work are: i ) smooth-ing of features through Savitzky–Golay ﬁlter, ii ) injectingAdditive White Gaussian Noise (AWGN) into the featurestime series and iii ) applying a Realistic Adversarial , i.e., atargeted approach to feature degradation that also considersthe constraints of the protocols and networks in use.The ﬁrst two approaches aim to show the theoretical perfor-mances that could be obtained by randomizing, without anyconstraints, the statistics of the packets and the features derivedfrom them. In detail, the Savitzky–Golay ﬁlter [19] allowssmoothing all features through a moving window. The lengthof the ﬁlter window determines the number of samples takeninto consideration, and a polynomial of user-deﬁned ordersubsequently approximates these samples.The second adversarial technique uses Additive WhiteGaussian Noise (AWGN) – therefore with zero mean – whosevariance is set proportionally to the variance of the originalsignal subject to adversarial.The Realistic Adversarial, instead, represents an approachthat takes into account which features can actually be distortedat the egress of the IoT device (therefore with external hard-ware such as an anonymization box ) without compromising itsoperation (e.g., it is possible to add padding to the packets,but it is not possible to randomize the TCP window withoutdisrupting the service completely). In this work, we exploitedtwo Realistic Adversarial techniques: the ﬁrst is a constantpadding within the whole time series, which was simulatedselecting the maximum value of the mean TCP packet length( mean len pack ). The standard deviation of the TCP packetlength ( std len pack ) was set to zero to match this operation.The second technique concerns the injection of

AWGN in thefollowing features: • std ipt : to simulate jitter while sending packets; • n pack tcp and n pack udp : to simulate decoy connec-tions and packets between endpoints; • n pack icmp : to simulate decoy ping/traceroute packetsbetween endpoints; • n port unique :, to simulate decoy TCP/UDP packetsaddressed to random port numbers.Moreover, to match the realistic scenario, we cannotmodify the following features: max diff time , n ip unique , mean window and std window .IV. E XPERIMENTAL T ESTBED

We developed an experimental testbed to prove the effec-tiveness of privacy threats and the inadequacy of countermea-sures applied to smart speakers’ network trafﬁc. Brieﬂy, we

EEE COMMUNICATIONS LETTERS, VOL. X , NO. Y , MONT H

Feature Name Adversarial Technique Feature Name Adversarial Technique

Number of different IP address ( n ip unique ) None Number of different TCP/UDP ports ( n port unique ) AWGNNumber of TCP packets ( n pack udp ) AWGN Number of UDP packets ( n pack udp ) AWGNNumber of ICMP packets ( n pack icmp ) AWGN

Per- window Inter packet time ( max diff time ) NoneAverage of TCP Window ( mean window ) None Standard deviation of TCP Window ( std window ) NoneAverage of IPT ( mean ipt ) None Standard deviation of IPT ( std ipt ) AWGNAverage of packet length ( mean len pack ) Const. Padding Standard Deviation of packet length ( std len pack ) Set to Zero

TABLE I: Name and Acronym of statistical indicator used and the adversarial techniques applied to themcollected a dataset with the network trafﬁc of the IoT deviceduring typical usage scenarios, e.g., during voice queries ormedia playback. Then, we used a set of deep adversariallearning techniques to alter the statistical information of out-bound trafﬁc. Finally, we applied a set of ML techniques toevaluate the corresponding degradation of the classiﬁcationprocess. The experimental activities were carried out on anIntel Core i7-3770 computer equipped with GB of RAMand Ubuntu 16.04 LTS, and a Google Home Mini smartspeaker. The trafﬁc has been captured via an instrumentedcomputer acting as an IEEE 802.11 access point running ad-hoc scripts for tshark . All the generated trafﬁc traces havebeen anonymized and made available through Kaggle . A. Dataset Deﬁnition

In this letter, we extend the dataset already published inour previous work [7]. Brieﬂy, the original dataset consistsof 9-days of network trafﬁc that comprise: i) trafﬁc with themicrophone disabled (

Mic On-Off ), ii) microphone enabledin a quiet environment, and iii) microphone enabled withbackground noise (

Mic On-Noise ).Thus, to mimic the normal use of smart speakers by users,we extended the available data with three different classes ofqueries for the smart speaker, i.e., media, travel, and utility .To do so, we executed three different rounds of measurementsthat last three days each. In essence, a synthetic talker hasbeen created by using various voice records representinga wide range of speakers (e.g., male and female or withdifferent accents or talking speeds) and it has been used toissue commands to the smart speaker. In the ﬁrst round, wefocused on retrieving the network trafﬁc generated to playbackmultimedia content. For example, we captured trafﬁc when thesynthetic talker asked questions like “What’s the latest news?” or “Play some music” . For the second round, we performedqueries related to travels, thus accounting for the interactionwith services providing trafﬁc indications or weather forecasts.In this case, we asked questions like “How is the weathertoday?” . Lastly, we performed general queries belonging tothe utility category, like “What’s on my agenda today?” and “What time is it?” .For each query, we collected seconds of inbound andoutbound network trafﬁc to have a proper tradeoff betweenaccuracy and size of the data. The newly collected data https://voicebot.ai/2019/03/12/smart-speaker-owners-agree-that-questions-music-and-weather-are-killer-apps-what-comes-next/ contains , k packets for media queries, k packets fortravel queries, and k for utility queries. B. Deep Adversarial Techniques Setup

We implemented the three deep adversarial learning tech-niques presented in Sect. III. In detail, we set the Sav-itzky–Golay ﬁlter moving window to 51, and we applieddifferent polynomials degrees, i.e., ψ = { , , , , } . Forthe Additive White Gaussian Noise (AWGN) techniques, wemultiplied the original variance using a constant ν = [0 . , . for the Mic On-Off and

Mic On-Noise scenario and ν = { , , , , } for the utility/media/travel scenario.Table I lists the set of network features computed forthe experimental activity of this letter and the correspondingemployed deep adversarial technique. C. ML techniques

To implement the attacker, we used machine learning algo-rithms provided by the scikit-learn and

Fast.ai libraries. Weconsidered the most popular techniques commonly used in theliterature, i.e., AdaBoost (AB), Decision Tree (DT), k-NearestNeighbors (kNN), Random Forest (RF) and Neural Networks(NN) [7], [20], [21]. In our experiment, we assume that theattacker is able to collect 3 days of trafﬁc for each scenario,similar to [7]. The classiﬁers were neither pre-trained with theoriginal data nor fed with previously trained models.V. E

XPERIMENTAL R ESULTS

In this section, we show and discuss the numerical resultsobtained. First, we show the performance and accuracy ofmachine learning algorithms used to infer the query category.Next, we discuss the efﬁcacy of the deep adversarial learningtechniques to protect the network trafﬁc.

Query Classiﬁcation.

Fig. shows the classiﬁcation ac-curacy for ML techniques trained on the original data (notsubjected to any adversarial technique) of the Utility/Me-dia/Travel scenario. With sampling and feature generationintervals higher than packets, all the considered algorithmsachieve a classiﬁcation accuracy higher than , being thekNN algorithm the less precise. The accuracy of the neuralnetwork model (i.e., NN) also tends to drop slightly as theinterval increases because the number of samples decreases toomuch for the training “from scratch” of a neural network (at packets, the number of samples is just above unitsfor this scenario). Such behavior is also consistent with theclassiﬁcation of the

Mic On-Off and

Mic On-Noise scenarioswhose accuracy is higher than (as described in [7]).

EEE COMMUNICATIONS LETTERS, VOL. X , NO. Y , MONT H (a) Mic On-Off Adversarial with smoothing (b) Mic On-Off Adversarial with noise (c) Mic On-Off Realistic adversarial(d) Mic On-Noise Adversarial with smoothing (e) Mic On-Noise Adversarial with noise (f) Mic On-Noise Realistic adversarial

Fig. 2:

Mic On/Off On/Noise scenario : classiﬁcation accuracy of the neural network model after different adversarial techniqueshave been applied to the original features: (a),(d)

Savitzky–Golay ﬁlter, (b),(e)

AWGN , (c),(f)

Realistic adversarial technique. (a)

Utility/Media/Travel

Adversarial with smoothing (b)

Utility/Media/Travel

Adversarial with noise (c)

Utility/Media/Travel

Realistic adversarial

Fig. 3:

Utility/Media/Travel scenario : classiﬁcation accuracy of the neural network model after different adversarial techniqueshave been applied to the original features: (a)

Savitzky–Golay ﬁlter, (b)

AWGN , (c)

Realistic adversarial technique.

Mitigation Results.

The set of images in Fig. and Fig. depict the accuracy of a classiﬁer trained on data subjected tothe three deep adversarial techniques.In detail, Fig. to show the accuracy obtained in the Mic On-Off classiﬁcation scenario, Fig. to refer to the Mic On-Noise scenario, and Fig. shows the classiﬁcationperformance of a classiﬁer trained on the trafﬁc features fromthe Utility/Media/Travel scenario.In the ﬁrst two scenarios, the trend is similar: the theoreti-cally most effective adversarial technique is the polynomialsmoothing as it deprives the signal of most of its high- frequency components. On the contrary, AWGN injection isless effective as it leaves much of the information untouched,especially with low values of ν (i.e. variance multiplier val-ues). Fig. and , conversely, show the implementationof a realistic adversarial technique which is therefore notable to degrade all the features simultaneously. As can beseen from the accuracy levels (for some sampling intervals > packets, higher than ), the neural network modelis sufﬁciently “intelligent” to correctly predict the class usingjust the few remaining features and ignoring all the others. In EEE COMMUNICATIONS LETTERS, VOL. X , NO. Y , MONT H

Fig. 4:

Utility/Media/Travel scenario : classiﬁcation accuracyof the different machine learning/deep learning models usingunmodiﬁed original features.this case, higher AWGN intensity does not appear to affectthe prediction in any way.The analysis of the queries scenario in Fig. conﬁrms thatthe polynomial smoothing techniques are the most efﬁcient,bringing the accuracy of the classiﬁer to around , veryclose to the theoretical level of a 33% random choice ofa problem with three classes (cf. Fig. ). On the contrary,Fig. highlights the need for a substantially higher ν valuefor the AWGN in order to sufﬁciently degrade the featuresfor this scenario. Indeed, unlike the Mic On-Off and

MicOn-Noise scenarios (with ν = 2 . ), this scenario requires avariance multiplier ν = 64 to bring the accuracy level of theML classiﬁers to values close to the .Finally, Fig. shows the accuracy obtained from the sameneural network architecture trained on degraded features withrealistic adversarial. As in Fig. and , the impossibilityof degrading all the features at once, leaving some of themintact, preserves enough information for the neural network tolearn how to correctly identify the three classes of the problem,even with an accuracy of considering an interval of packets, independently from the magnitude of the injectedAWGN. VI. C ONCLUSIONS

In this letter we have empirically demonstrated how adver-sarial learning countermeasures applied to the virtual assis-tant’s outbound trafﬁc are ineffective against machine learningattacks, thus leading to serious concerns for the privacy ofusers in smart home environments.The results indicate the need for a major HW/SW redesignof virtual assistant platforms to ensure adequate protection ofcommercially available smart speakers.R

EFERENCES[1] Y. Yang, L. Wu, G. Yin, L. Li, and H. Zhao, “A Survey on Security andPrivacy Issues in Internet-of-Things,”

IEEE Internet of Things Journal ,vol. 4, no. 5, pp. 1250–1258, 2017.[2] N. Apthorpe, D. Reisman, and N. Feamster, “A Smart Home isNo Castle: Privacy Vulnerabilities of Encrypted IoT Trafﬁc,” 2017.[Online]. Available: http://arxiv.org/abs/1705.06805 [3] B. Copos, K. Levitt, M. Bishop, and J. Rowe, “Is Anybody Home?Inferring Activity from Smart Home Network Trafﬁc,” in

IEEE Securityand Privacy Workshops . IEEE, 2016, pp. 245–251.[4] A. Acar, H. Fereidooni, T. Abera, A. K. Sikder, M. Miettinen, H. Aksu,M. Conti, A.-R. Sadeghi, and A. S. Uluagac, “Peek-a-Boo: I See YourSmart Home Activities, Even Encrypted!” 2018. [Online]. Available:http://arxiv.org/abs/1808.02741[5] A. J. Pinheiro, J. d. M. Bezerra, C. A. Burgardt, and D. R. Campelo,“Identifying IoT Devices and Events Based on Packet Length fromEncrypted Trafﬁc,”

Computer Communications , vol. 144, pp. 8–17,2019.[6] Y. Amar, H. Haddadi, R. Mortier, A. Brown, J. Colley, and A. Crabtree,“An Analysis of Home IoT Network Trafﬁc and Behaviour,” 2018.[Online]. Available: http://arxiv.org/abs/1803.05368[7] D. Caputo, L. Verderame, A. Ranieri, A. Merlo, and L. Caviglione,“Fine-hearing google home: why silence will not protect your privacy.”

J. Wirel. Mob. Networks Ubiquitous Comput. Dependable Appl. , vol. 11,no. 1, pp. 35–53, 2020.[8] C. V. Wright, L. Ballard, F. Monrose, and G. M. Masson, “LanguageIdentiﬁcation of Encrypted VoIP Trafﬁc: Alejandra y Roberto or Aliceand Bob?” in

USENIX Security Symposium , vol. 3, 2007, pp. 43–54.[9] A. J. Pinheiro, P. F. de Araujo-Filho, J. d. M. Bezerra, and D. R.Campelo, “Adaptive packet padding approach for smart home networks:A trade-off between privacy and performance,”

IEEE Internet of ThingsJournal , 2020.[10] A. Houmansadr, C. Brubaker, and V. Shmatikov, “The parrot is dead:Observing unobservable network communications,” in . IEEE, 2013, pp. 65–79.[11] B. Nour, K. Sharif, F. Li, and Y. Wang, “Security and privacy challengesin information-centric wireless internet of things networks,”

IEEE Secu-rity & Privacy , vol. 18, no. 2, pp. 35–45, 2019.[12] S. Andy, B. Rahardjo, and B. Hanindhito, “Attack scenarios and securityanalysis of mqtt communication protocol in iot system,” in , 2017.[13] M. Conti, N. Dragoni, and V. Lesyk, “A Survey of Man in the MiddleAttacks,”

IEEE Communications Surveys & Tutorials , vol. 18, no. 3, pp.2027–2051, 2016.[14] E. Alepis and C. Patsakis, “Monkey Says, Monkey Does: Security andPrivacy on Voice Assistants,”

IEEE Access , vol. 5, pp. 17 841–17 851,2017.[15] M. Gregorczyk, P. ˙Z´orawski, P. Nowakowski, K. Cabaj, and W. Mazur-czyk, “Snifﬁng detection based on network trafﬁc probing and machinelearning,”

IEEE Access , vol. 8, pp. 149 255–149 269, 2020.[16] J. Fan, C. Guan, K. Ren, Y. Cui, and C. Qiao, “Spabox: Safeguardingprivacy during deep packet inspection at a middlebox,”

IEEE/ACMTransactions on Networking , vol. 25, no. 6, pp. 3753–3766, 2017.[17] D. Su, J. Liu, S. Zhu, X. Wang, and W. Wang, “” are you home alone?””yes” disclosing security and privacy vulnerabilities in alexa skills,” arXivpreprint arXiv:2010.10788 , 2020.[18] M. R. Shahid, G. Blanc, Z. Zhang, and H. Debar, “IoT DevicesRecognition Through Network Trafﬁc Analysis,” in

IEEE InternationalConference on Big Data . IEEE, 2018, pp. 5187–5192.[19] A. Savitzky and M. J. Golay, “Smoothing and differentiation of databy simpliﬁed least squares procedures.”

Analytical chemistry , vol. 36,no. 8, pp. 1627–1639, 1964.[20] Z. Li, R. Yuan, and X. Guan, “Accurate classiﬁcation of the internet traf-ﬁc based on the svm method,” in . IEEE, 2007.[21] A.-m. Yang, S.-y. Jiang, and H. Deng, “A p2p network trafﬁc classiﬁca-tion method using svm,” in2008 The 9th International Conference forYoung Computer Scientists