Deep Adversarial Learning on Google Home devices
Andrea Ranieri, Davide Caputo, Luca Verderame, Alessio Merlo, Luca Caviglione
IIEEE COMMUNICATIONS LETTERS, VOL. X , NO. Y , MONT H
Deep Adversarial Learning on Google Homedevices
Andrea Ranieri , Davide Caputo , Luca Verderame , Alessio Merlo , and Luca Caviglione IMATI - National Research Council of Italy { andrea.ranieri, luca.caviglione } @ge.imati.cnr.it DIBRIS - University of Genova, Italy { davide.caputo, luca.verderame, alessio } @dibris.unige.it Abstract —Smart speakers and voice-based virtual assistantsare core components for the success of the IoT paradigm. Unfor-tunately, they are vulnerable to various privacy threats exploitingmachine learning to analyze the generated encrypted traffic. Tocope with that, deep adversarial learning approaches can beused to build black-box countermeasures altering the networktraffic (e.g., via packet padding) and its statistical information.This letter showcases the inadequacy of such countermeasuresagainst machine learning attacks with a dedicated experimentalcampaign on a real network dataset. Results indicate the needfor a major re-engineering to guarantee the suitable protectionof commercially available smart speakers.
Index Terms —smart speakers, IoT privacy, deep adversariallearing, machine learning, privacy leaks.
I. I
NTRODUCTION
The popularity of smart speakers (including voice-basedvirtual assistants) is rooted in their ability to control IoT nodes,network appliances, and other devices via natural speech.They can also be used to access multimedia contents andobtain various information, including news, weather forecasts,and traffic conditions. To implement such functionalities, thespeaker exchanges information with a remote data center,leading to several security issues, e.g., device enumerationattacks, mass profiling, and privacy threats. An emergingtrend in cybersecurity exploits machine learning techniquesto obtain information from the encrypted traffic exchanged bythe speaker with its ecosystem [1], [2].Literature abounds in works investigating how statisticalanalysis of network flows produced by smart devices can beabused for reconnaissance or attack purposes. For instance, thetraffic produced by home devices can be used to understandif a user is at home [3] as well as to model daily routines [4]or the sleep cycle [2]. In general, attacks leveraging machinelearning proved to be effective, even when relying upon “poor”information. As an example, IoT nodes and connected devicescan be identified by simply using the length of the producedprotocol data units [5]. When HTTP-based interactions arepresent, it is possible to infer precise details, e.g., the status ofa light bulb, as well as hijacking the conversation or physicallyendanger the target [6].An emerging research trend explores the use of various arti-ficial intelligence and machine learning techniques to classifythe commands issued to smart speakers (see, e.g., [7] and thereferences therein). To this aim, attackers take advantage of traffic features not protected by the encryption, such as inter-packet time, throughput, the location of some endpoints, andthe number of connections. Since the classification is typicallyaccurate, the attacker can infer details like the number ofdevices controlled by the smart speaker, the presence of theuser (even when the interaction is absent), and the “kind”of the issued commands [3]–[7]. Moreover, a relevant partof the traffic produced by smart speakers shares functionaland technological traits with VoIP, meaning that it is alsosusceptible to attacks for disclosing the language of the talkeror other sensitive behaviors [8].Therefore, this letter focuses on investigating deep adversar-ial learning countermeasures against machine learning attackstargeting the traffic produced by smart speakers. To the bestof our knowledge, this aspect has been mostly overlookedso far. The only notable exception is [9], which proposesa padding scheme to protect IoT and smart devices fromstatistical analysis. Instead, our work aims to showcase thelimitations of traffic manipulation or morphing approaches,which often lead to flawed countermeasures [10].To do so, we built a new dataset containing the networktraffic of a typical smart home environment and an exper-imental testbed to evaluate the efficacy of deep adversar-ial learning techniques. The experimental activities exploitboth theoretical approaches, i.e., the usage of Savitzky–Golayfilters and Additive White Gaussian Noise (AWGN) on allthe statistical features, and realistic ones, named “
RealisticAdversarial ”, that use constant padding and AWGN techniquesthat considers the constraints of the protocols and networksin use. The achieved results argue that smart speaker privacyneeds a complete rethink.The rest of the letter is structured as follows: Section IIprovides the background and the threat model, Section IIIpresents the deep adversarial techniques used in this work.Section IV describes the evaluation testbed and Section Vpresents the obtained results. Finally, Section VI concludesthe letter. II. B
ACKGROUND AND A TTACK M ODEL
Figure 1 depicts a typical smart speaker ecosystem thatprovides the voice-activated user interface and acts as a hubfor other IoT nodes and network appliances. In essence, thespeaker collects, samples, and transmits voice commands to a r X i v : . [ c s . CR ] F e b EEE COMMUNICATIONS LETTERS, VOL. X , NO. Y , MONT H remote cloud services in charge of processing data to deliverback textual/binary representations as well as additional con-tent, e.g., multimedia streams. The smart speaker can alsoprovide feedback to the user, play content, retrieve data fromthird-party providers (e.g., music streaming services) or driveother nodes via local network or short-range links like IEEE802.15.4. Even if the information gathered by commandsexchanged locally between the speaker and various nodes canbe used to threaten the privacy of the ecosystem [11]. Thisletter focuses on the attacks exploiting the network trafficexchanged between the speaker and its remote cloud.Fig. 1: MITM threat model of a smart home ecosystem.From a security perspective, the continuous exchange ofdata between the smart speaker and the cloud is a prime pointof fragility. As depicted in Figure 1, an adversary (denoted as attacker ) can mount MITM (Man-in-the-Middle) attacks [12]to gather network traffic even in the case of a communicationencrypted with TLS/SSL [13].Even if many commercial smart speakers implement coun-termeasures to protect the network traffic, the majority isstill prone to a variety of privacy-breaking attacks targetinga composite set of features observable within the encryptedtraffic flows [1], [14]. Specifically, we focus on an attackerwilling to use machine learning or deep learning algorithmson encrypted traffic samples to infer “behavioral” information,e.g., the presence of the victim or the “type” of the requestedinformation [3]–[7]. Owing to the end-to-end encryption, theattacker can only observe and acquire the traffic produced bythe smart speaker and cannot alter, manipulate or perform deeppacket inspection operations. The attacker can then only relyon general statistics, e.g., the throughput, the size of protocoldata units, IP addresses, the number of different endpoints,flags within the headers of the packets, or the behavior of thecongestion control of the TCP [7].The standard approach to mitigate machine learning attackson the network traffic exploits the use of a middlebox (denotedas anonymization box in Figure 1) able to “sanitize” the net-work traffic by removing (or altering) the data that the attackercan exploit. For instance, the anonymization box can padpackets [9] or perform NAT-like operations to prevent profilingendpoints or probing [15]. All in all, since the anonymizationbox is outside the device, it cannot alter the protocol/commu-nication architecture of the smart speaker ecosystem. Rather,it can only manipulate the traffic without disrupting the flowor penalizing the QoE perceived by the user, for instance, interms of real-time guarantees (see, e.g., [16] and referencestherein). In the following, we will showcase the limits of suchan approach, which appears to be unsuited to face modernmachine learning-capable threats. III. D
EEP A DVERSARIAL L EARNING T ECHNIQUES
This letter investigates both theoretical and practical deepadversarial learning techniques to lower the classificationaccuracy. Ideally, we would like to reduce the classificationaccuracy to be as close as possible to a “coin toss” (e.g., on a two-class classification problem). As in previousworks [7], [17], [18], the attacker can only acquire encryptedtraffic to compute statistical metrics and analyze them usingmachine learning techniques. Such a computation requiresusing a suitable number of packets grouped using either timespans of length ∆ t or bursts of a fixed size of N packets.The methods considered for this work are: i ) smooth-ing of features through Savitzky–Golay filter, ii ) injectingAdditive White Gaussian Noise (AWGN) into the featurestime series and iii ) applying a Realistic Adversarial , i.e., atargeted approach to feature degradation that also considersthe constraints of the protocols and networks in use.The first two approaches aim to show the theoretical perfor-mances that could be obtained by randomizing, without anyconstraints, the statistics of the packets and the features derivedfrom them. In detail, the Savitzky–Golay filter [19] allowssmoothing all features through a moving window. The lengthof the filter window determines the number of samples takeninto consideration, and a polynomial of user-defined ordersubsequently approximates these samples.The second adversarial technique uses Additive WhiteGaussian Noise (AWGN) – therefore with zero mean – whosevariance is set proportionally to the variance of the originalsignal subject to adversarial.The Realistic Adversarial, instead, represents an approachthat takes into account which features can actually be distortedat the egress of the IoT device (therefore with external hard-ware such as an anonymization box ) without compromising itsoperation (e.g., it is possible to add padding to the packets,but it is not possible to randomize the TCP window withoutdisrupting the service completely). In this work, we exploitedtwo Realistic Adversarial techniques: the first is a constantpadding within the whole time series, which was simulatedselecting the maximum value of the mean TCP packet length( mean len pack ). The standard deviation of the TCP packetlength ( std len pack ) was set to zero to match this operation.The second technique concerns the injection of
AWGN in thefollowing features: • std ipt : to simulate jitter while sending packets; • n pack tcp and n pack udp : to simulate decoy connec-tions and packets between endpoints; • n pack icmp : to simulate decoy ping/traceroute packetsbetween endpoints; • n port unique :, to simulate decoy TCP/UDP packetsaddressed to random port numbers.Moreover, to match the realistic scenario, we cannotmodify the following features: max diff time , n ip unique , mean window and std window .IV. E XPERIMENTAL T ESTBED
We developed an experimental testbed to prove the effec-tiveness of privacy threats and the inadequacy of countermea-sures applied to smart speakers’ network traffic. Briefly, we
EEE COMMUNICATIONS LETTERS, VOL. X , NO. Y , MONT H
Feature Name Adversarial Technique Feature Name Adversarial Technique
Number of different IP address ( n ip unique ) None Number of different TCP/UDP ports ( n port unique ) AWGNNumber of TCP packets ( n pack udp ) AWGN Number of UDP packets ( n pack udp ) AWGNNumber of ICMP packets ( n pack icmp ) AWGN
Per- window Inter packet time ( max diff time ) NoneAverage of TCP Window ( mean window ) None Standard deviation of TCP Window ( std window ) NoneAverage of IPT ( mean ipt ) None Standard deviation of IPT ( std ipt ) AWGNAverage of packet length ( mean len pack ) Const. Padding Standard Deviation of packet length ( std len pack ) Set to Zero
TABLE I: Name and Acronym of statistical indicator used and the adversarial techniques applied to themcollected a dataset with the network traffic of the IoT deviceduring typical usage scenarios, e.g., during voice queries ormedia playback. Then, we used a set of deep adversariallearning techniques to alter the statistical information of out-bound traffic. Finally, we applied a set of ML techniques toevaluate the corresponding degradation of the classificationprocess. The experimental activities were carried out on anIntel Core i7-3770 computer equipped with GB of RAMand Ubuntu 16.04 LTS, and a Google Home Mini smartspeaker. The traffic has been captured via an instrumentedcomputer acting as an IEEE 802.11 access point running ad-hoc scripts for tshark . All the generated traffic traces havebeen anonymized and made available through Kaggle . A. Dataset Definition
In this letter, we extend the dataset already published inour previous work [7]. Briefly, the original dataset consistsof 9-days of network traffic that comprise: i) traffic with themicrophone disabled (
Mic On-Off ), ii) microphone enabledin a quiet environment, and iii) microphone enabled withbackground noise (
Mic On-Noise ).Thus, to mimic the normal use of smart speakers by users,we extended the available data with three different classes ofqueries for the smart speaker, i.e., media, travel, and utility .To do so, we executed three different rounds of measurementsthat last three days each. In essence, a synthetic talker hasbeen created by using various voice records representinga wide range of speakers (e.g., male and female or withdifferent accents or talking speeds) and it has been used toissue commands to the smart speaker. In the first round, wefocused on retrieving the network traffic generated to playbackmultimedia content. For example, we captured traffic when thesynthetic talker asked questions like “What’s the latest news?” or “Play some music” . For the second round, we performedqueries related to travels, thus accounting for the interactionwith services providing traffic indications or weather forecasts.In this case, we asked questions like “How is the weathertoday?” . Lastly, we performed general queries belonging tothe utility category, like “What’s on my agenda today?” and “What time is it?” .For each query, we collected seconds of inbound andoutbound network traffic to have a proper tradeoff betweenaccuracy and size of the data. The newly collected data https://voicebot.ai/2019/03/12/smart-speaker-owners-agree-that-questions-music-and-weather-are-killer-apps-what-comes-next/ contains , k packets for media queries, k packets fortravel queries, and k for utility queries. B. Deep Adversarial Techniques Setup
We implemented the three deep adversarial learning tech-niques presented in Sect. III. In detail, we set the Sav-itzky–Golay filter moving window to 51, and we applieddifferent polynomials degrees, i.e., ψ = { , , , , } . Forthe Additive White Gaussian Noise (AWGN) techniques, wemultiplied the original variance using a constant ν = [0 . , . for the Mic On-Off and
Mic On-Noise scenario and ν = { , , , , } for the utility/media/travel scenario.Table I lists the set of network features computed forthe experimental activity of this letter and the correspondingemployed deep adversarial technique. C. ML techniques
To implement the attacker, we used machine learning algo-rithms provided by the scikit-learn and
Fast.ai libraries. Weconsidered the most popular techniques commonly used in theliterature, i.e., AdaBoost (AB), Decision Tree (DT), k-NearestNeighbors (kNN), Random Forest (RF) and Neural Networks(NN) [7], [20], [21]. In our experiment, we assume that theattacker is able to collect 3 days of traffic for each scenario,similar to [7]. The classifiers were neither pre-trained with theoriginal data nor fed with previously trained models.V. E
XPERIMENTAL R ESULTS
In this section, we show and discuss the numerical resultsobtained. First, we show the performance and accuracy ofmachine learning algorithms used to infer the query category.Next, we discuss the efficacy of the deep adversarial learningtechniques to protect the network traffic.
Query Classification.
Fig. shows the classification ac-curacy for ML techniques trained on the original data (notsubjected to any adversarial technique) of the Utility/Me-dia/Travel scenario. With sampling and feature generationintervals higher than packets, all the considered algorithmsachieve a classification accuracy higher than , being thekNN algorithm the less precise. The accuracy of the neuralnetwork model (i.e., NN) also tends to drop slightly as theinterval increases because the number of samples decreases toomuch for the training “from scratch” of a neural network (at packets, the number of samples is just above unitsfor this scenario). Such behavior is also consistent with theclassification of the
Mic On-Off and
Mic On-Noise scenarioswhose accuracy is higher than (as described in [7]).
EEE COMMUNICATIONS LETTERS, VOL. X , NO. Y , MONT H (a) Mic On-Off Adversarial with smoothing (b) Mic On-Off Adversarial with noise (c) Mic On-Off Realistic adversarial(d) Mic On-Noise Adversarial with smoothing (e) Mic On-Noise Adversarial with noise (f) Mic On-Noise Realistic adversarial
Fig. 2:
Mic On/Off On/Noise scenario : classification accuracy of the neural network model after different adversarial techniqueshave been applied to the original features: (a),(d)
Savitzky–Golay filter, (b),(e)
AWGN , (c),(f)
Realistic adversarial technique. (a)
Utility/Media/Travel
Adversarial with smoothing (b)
Utility/Media/Travel
Adversarial with noise (c)
Utility/Media/Travel
Realistic adversarial
Fig. 3:
Utility/Media/Travel scenario : classification accuracy of the neural network model after different adversarial techniqueshave been applied to the original features: (a)
Savitzky–Golay filter, (b)
AWGN , (c)
Realistic adversarial technique.
Mitigation Results.
The set of images in Fig. and Fig. depict the accuracy of a classifier trained on data subjected tothe three deep adversarial techniques.In detail, Fig. to show the accuracy obtained in the Mic On-Off classification scenario, Fig. to refer to the Mic On-Noise scenario, and Fig. shows the classificationperformance of a classifier trained on the traffic features fromthe Utility/Media/Travel scenario.In the first two scenarios, the trend is similar: the theoreti-cally most effective adversarial technique is the polynomialsmoothing as it deprives the signal of most of its high- frequency components. On the contrary, AWGN injection isless effective as it leaves much of the information untouched,especially with low values of ν (i.e. variance multiplier val-ues). Fig. and , conversely, show the implementationof a realistic adversarial technique which is therefore notable to degrade all the features simultaneously. As can beseen from the accuracy levels (for some sampling intervals > packets, higher than ), the neural network modelis sufficiently “intelligent” to correctly predict the class usingjust the few remaining features and ignoring all the others. In EEE COMMUNICATIONS LETTERS, VOL. X , NO. Y , MONT H
Fig. 4:
Utility/Media/Travel scenario : classification accuracyof the different machine learning/deep learning models usingunmodified original features.this case, higher AWGN intensity does not appear to affectthe prediction in any way.The analysis of the queries scenario in Fig. confirms thatthe polynomial smoothing techniques are the most efficient,bringing the accuracy of the classifier to around , veryclose to the theoretical level of a 33% random choice ofa problem with three classes (cf. Fig. ). On the contrary,Fig. highlights the need for a substantially higher ν valuefor the AWGN in order to sufficiently degrade the featuresfor this scenario. Indeed, unlike the Mic On-Off and
MicOn-Noise scenarios (with ν = 2 . ), this scenario requires avariance multiplier ν = 64 to bring the accuracy level of theML classifiers to values close to the .Finally, Fig. shows the accuracy obtained from the sameneural network architecture trained on degraded features withrealistic adversarial. As in Fig. and , the impossibilityof degrading all the features at once, leaving some of themintact, preserves enough information for the neural network tolearn how to correctly identify the three classes of the problem,even with an accuracy of considering an interval of packets, independently from the magnitude of the injectedAWGN. VI. C ONCLUSIONS
In this letter we have empirically demonstrated how adver-sarial learning countermeasures applied to the virtual assis-tant’s outbound traffic are ineffective against machine learningattacks, thus leading to serious concerns for the privacy ofusers in smart home environments.The results indicate the need for a major HW/SW redesignof virtual assistant platforms to ensure adequate protection ofcommercially available smart speakers.R
EFERENCES[1] Y. Yang, L. Wu, G. Yin, L. Li, and H. Zhao, “A Survey on Security andPrivacy Issues in Internet-of-Things,”
IEEE Internet of Things Journal ,vol. 4, no. 5, pp. 1250–1258, 2017.[2] N. Apthorpe, D. Reisman, and N. Feamster, “A Smart Home isNo Castle: Privacy Vulnerabilities of Encrypted IoT Traffic,” 2017.[Online]. Available: http://arxiv.org/abs/1705.06805 [3] B. Copos, K. Levitt, M. Bishop, and J. Rowe, “Is Anybody Home?Inferring Activity from Smart Home Network Traffic,” in
IEEE Securityand Privacy Workshops . IEEE, 2016, pp. 245–251.[4] A. Acar, H. Fereidooni, T. Abera, A. K. Sikder, M. Miettinen, H. Aksu,M. Conti, A.-R. Sadeghi, and A. S. Uluagac, “Peek-a-Boo: I See YourSmart Home Activities, Even Encrypted!” 2018. [Online]. Available:http://arxiv.org/abs/1808.02741[5] A. J. Pinheiro, J. d. M. Bezerra, C. A. Burgardt, and D. R. Campelo,“Identifying IoT Devices and Events Based on Packet Length fromEncrypted Traffic,”
Computer Communications , vol. 144, pp. 8–17,2019.[6] Y. Amar, H. Haddadi, R. Mortier, A. Brown, J. Colley, and A. Crabtree,“An Analysis of Home IoT Network Traffic and Behaviour,” 2018.[Online]. Available: http://arxiv.org/abs/1803.05368[7] D. Caputo, L. Verderame, A. Ranieri, A. Merlo, and L. Caviglione,“Fine-hearing google home: why silence will not protect your privacy.”
J. Wirel. Mob. Networks Ubiquitous Comput. Dependable Appl. , vol. 11,no. 1, pp. 35–53, 2020.[8] C. V. Wright, L. Ballard, F. Monrose, and G. M. Masson, “LanguageIdentification of Encrypted VoIP Traffic: Alejandra y Roberto or Aliceand Bob?” in
USENIX Security Symposium , vol. 3, 2007, pp. 43–54.[9] A. J. Pinheiro, P. F. de Araujo-Filho, J. d. M. Bezerra, and D. R.Campelo, “Adaptive packet padding approach for smart home networks:A trade-off between privacy and performance,”
IEEE Internet of ThingsJournal , 2020.[10] A. Houmansadr, C. Brubaker, and V. Shmatikov, “The parrot is dead:Observing unobservable network communications,” in . IEEE, 2013, pp. 65–79.[11] B. Nour, K. Sharif, F. Li, and Y. Wang, “Security and privacy challengesin information-centric wireless internet of things networks,”
IEEE Secu-rity & Privacy , vol. 18, no. 2, pp. 35–45, 2019.[12] S. Andy, B. Rahardjo, and B. Hanindhito, “Attack scenarios and securityanalysis of mqtt communication protocol in iot system,” in , 2017.[13] M. Conti, N. Dragoni, and V. Lesyk, “A Survey of Man in the MiddleAttacks,”
IEEE Communications Surveys & Tutorials , vol. 18, no. 3, pp.2027–2051, 2016.[14] E. Alepis and C. Patsakis, “Monkey Says, Monkey Does: Security andPrivacy on Voice Assistants,”
IEEE Access , vol. 5, pp. 17 841–17 851,2017.[15] M. Gregorczyk, P. ˙Z´orawski, P. Nowakowski, K. Cabaj, and W. Mazur-czyk, “Sniffing detection based on network traffic probing and machinelearning,”
IEEE Access , vol. 8, pp. 149 255–149 269, 2020.[16] J. Fan, C. Guan, K. Ren, Y. Cui, and C. Qiao, “Spabox: Safeguardingprivacy during deep packet inspection at a middlebox,”
IEEE/ACMTransactions on Networking , vol. 25, no. 6, pp. 3753–3766, 2017.[17] D. Su, J. Liu, S. Zhu, X. Wang, and W. Wang, “” are you home alone?””yes” disclosing security and privacy vulnerabilities in alexa skills,” arXivpreprint arXiv:2010.10788 , 2020.[18] M. R. Shahid, G. Blanc, Z. Zhang, and H. Debar, “IoT DevicesRecognition Through Network Traffic Analysis,” in
IEEE InternationalConference on Big Data . IEEE, 2018, pp. 5187–5192.[19] A. Savitzky and M. J. Golay, “Smoothing and differentiation of databy simplified least squares procedures.”
Analytical chemistry , vol. 36,no. 8, pp. 1627–1639, 1964.[20] Z. Li, R. Yuan, and X. Guan, “Accurate classification of the internet traf-fic based on the svm method,” in . IEEE, 2007.[21] A.-m. Yang, S.-y. Jiang, and H. Deng, “A p2p network traffic classifica-tion method using svm,” in2008 The 9th International Conference forYoung Computer Scientists