A Survey on Amazon Alexa Attack Surfaces
AA Survey on Amazon Alexa Attack Surfaces
Yanyan Li † , Sara Kim ‡ , Eric Sy ‡ Department of Computer Science and Information SystemsCalifornia State University San Marcos, San Marcos, California 92078Email: † [email protected], ‡ { kim106, sy004 } @cougars.csusm.edu Abstract —Since being launched in 2014, Alexa, Amazon’sversatile cloud-based voice service, is now active in over 100million households worldwide [1]. Alexa’s user-friendly, person-alized vocal experience offers customers a more natural way ofinteracting with cutting-edge technology by allowing the abilityto directly dictate commands to the assistant. Now in the presentyear, the Alexa service is more accessible than ever, available onhundreds of millions of devices from not only Amazon but third-party device manufacturers. Unfortunately, that success has alsobeen the source of concern and controversy. The success of Alexais based on its effortless usability, but in turn, that has led toa lack of sufficient security. This paper surveys various attacksagainst Amazon Alexa ecosystem including attacks against thefrontend voice capturing and the cloud backend voice commandrecognition and processing. Overall, we have identified six attacksurfaces covering the lifecycle of Alexa voice interaction thatspans several stages including voice data collection, transmission,processing and storage. We also discuss the potential mitigationsolutions for each attack surface to better improve Alexa or othervoice assistants in terms of security and privacy.
Index Terms —Alexa skills, Amazon Alexa, attack surfaces,Echo, Internet of Things, privacy, security, voice hacking
I. I
NTRODUCTION
Imagine we are locked inside a room that contains a door,a window, and a vent. Outside the door are one thousandninjas all trying to attack us. What are the different ways theninjas can access our room? The answer: the door, the window,and the vent. These access points are the “attack surface” ofthe room. An attack surface, for a software environment, canformally be defined as the sum of different points (also knownas “attack vectors”) where an unauthorized user (an “attacker”)can try to enter or extract data from an environment. Keepingan attack surface minimized is a crucial security measure forany software [2]. The Amazon Alexa Voice assistant is a verynew technology. This innovative new software is set up tobe not just a personal device but a natural extension of thehome environment. As a result, Alexa not only has access totypical customer data used by smartphones and laptops suchas messages and schedules, but also has access and controlover house locks, personal shopping lists, voice recordings,conversations, customer voice profiles and etc.In April of 2019, Amazon disclosed the shocking extentof private user data that Alexa carries, admitting that Alexanot only continuously listens in on customer conversationsbut records and saves that data [3]. Shockingly, Amazonhas admitted to employing full teams of people to listenin on users’ conversations with Alexa in order to improveAlexa’s perceived weakness with foreign languages, regional expressions, and slang, storing potentially very sensitive andprivate information that could be at risk for exposure [3].Furthermore, with the inclusion of more “smart home” devicessuch as the “smart locks” (gate codes, key codes) and securitycameras (live footage of customer homes) becoming integratedinto the home environment and by extension, Alexa, Alexa hasgained more access to customer information than any othercontemporary software in the shortest amount of time.In addition, comparing to typing or touching based inter-actions, voice based interaction eliminates the need of fingertouch. While this new avenue of interaction may be convenientfor people in certain scenarios, it also introduces new issues.Voices from TV and radio signals, replayed voices that mimica person’s live voice, and even some inaudible sounds couldbe picked up by voice assistants [4], [5], [6]. Moreover, voicehacking or voice spoofing is becoming a new phenomenonwhere attackers hijack an individual’s unique voiceprint orvoice profile in order to steal his or her identity — a problemthat has become particularly bad in an era where speech-controlled, voice-activated products are common [7].Voice based attacks can happen not only in the frontendduring voice transmission from users to voice assistants, butalso in the backend for queried websites such as Wikipedia.An example of such an attack is the infamous “Burger KingAdvert Sabotage” [8]. In 2017, Burger King released an advertthat was designed to activate Google Home smart speakers andAndroid phones to describe Whopper burgers. The Wikipediapage for the Whopper had been maliciously edited as the“worst hamburger product” and added cyanide to the list ofingredients. This resulted in false information being spokenfrom users’ voice assistants.Users interaction with voice assistants are not limited toasking fact-based questions, checking the weather or playingmusic; but could also be requesting a Uber ride, placing anAmazon order, unlocking the front door and etc. In bothsituations, user voice recognition and functions to process userrequest (e.g., pull data from Wikipedia, send user requests toUber or smart lock) are needed in the cloud backend, butthe latter ones are more critical as the transcribed user voicecommand has total control over a user’s Uber/Amazon accountand home security system. If the voice recognition service isvulnerable (voices get misinterpreted as a different command)or backend functions running in the cloud get attacked, users’identities could be stolen, accounts could be compromised, andhome security could be at risk [9]. The goal of this paper is tostudy Amazon Alexa attack surfaces to shed light on building a r X i v : . [ c s . CR ] F e b ig. 1: Overview of Alexa Ecosystem, consisting of the Home Environment, AWS cloud and its componentsa more secure smart voice assistant system.The main contributions of this work are as follow: • We presented an overview of Amazon Alexa ecosystem,its major components and their functionalities. • We systematically studied the attacks against AmazonAlexa ecosystem and categorized those attacks into six at-tack surfaces, which are attacks in voice capturing, voicetraffic transmission, Alexa voice recognition, Alexa skillinvocation, Lambda functions and Amazon S3 bucket. • We discussed the potential mitigation solutions to preventagainst those identified attacks.This paper is organized as follows. First, we present anoverview of Alexa system in Section II. Then, we detail itsattack surfaces in Section III. Finally, we discuss the mitigationin Section IV and conclude the paper in Section V.II. B
ACKGROUND /A LEXA E COSYSTEM
The Alexa system consists of various physical input compo-nents that allow customers to interact with the system, referredto as “Echo” devices, and cloud components which include“smart” Speech Language Understanding (SLU) backend func-tionalities such as Automatic Speech Recognition (ASR), Nat-ural Language Understanding (NLU), Text-to-Speech (TTS)conversation and Response [10]. ASR is a technology thatallows computer interfaces to be communicated with in a waythat resembles normal human conversation [11]. Along thesame vein, NLU is the comprehension by computers of thestructure and meaning of human language and TTS is theability to accurately transcribe human speech [10]. For theAlexa system, all these functionalities are used in order toprocess the best “response” to the customer’s interaction.Many responses to customer interactions are provided di-rectly from Alexa. However, responses can also be providedby third party services through “skills”. Skills are voice-drivenapplications that can be added to Alexa to extend functionalityand personalize the user experience. Skills are launched andexecuted through the following interaction model [12]: 1) If a user says the wake word, “Alexa”, the Echo devicewill be activated and listen for user voice command.2) That command is sent to the cloud where it is processedvia ASR and NLU and transcribed using TTS.3) A JavaScript Object Notation (JSON) request is sent tothe skill’s Lambda function, processing user intent.4) Once the request is processed, the Lambda function sendsa JSON response to the Alexa voice service.5) The Alexa voice service receives the JSON response andconverts the output text to an audio file.6) Echo device receives the audio response from the cloudand plays the audio via the built-in speaker.Skills, voice applications that process users requests, havebecome the new target for attackers due to their capabilitiesof accessing user voice commands. Malicious skills have beenseen to steal user data or manipulate user voice command toexecute tasks deviating from users’ intentions.The Amazon Alexa ecosystem mainly consists of two parts,customer home environment and Alexa cloud backend [13].The home environment is the home setting of customerswho have adopted Alexa as their smart home extension, andtypically contains Alexa smartphone app, Amazon Echo de-vice, and other Internet-of-Things (IoT) devices. Alexa cloudbackend is comprised of Alexa voice service and supportingservices from AWS (Amazon Web Services) cloud, e.g.,Lambda serverless computing service, DynamdoDB databaseservice, Amazon S3 storage service. An overview of the Alexaecosystem is presented in Fig. 1 and the details of eachcomponent are provided below.1) Users: Alexa users can interact with Echo device viavoice and other IoT devices.2) Echo: Echo device listens for the wake word and, onceactivated, it will record user voice and send recordings toAlexa voice service. When response (in text) is received,Echo device will play it in a lifelike voice.3) 3rd Party Alexa Devices: Devices made by third-parties(e.g., a raspberry pi running Alexa client).) Alexa Voice Service: Alexa voice service performs smartSLU functionalities such as ASR and NLU, and TTSconversion in order to understand user request, which isused to decide which Alexa skill to invoke.5) Alexa Skill: Alexa skills are voice applications thatrespond user voice request. Alexa has built-in skills, e.g.,providing weather forecasts, querying Wikipedia. Newskills can be built to extend Alexa functionalities.6) Custom Skill API: A custom skill can be built with anAWS Lambda (a serverless computing service withoutprovisioning servers) function that defines customizedinteractions with user request.7) DynamoDB with Auto Scaling: DynamoDB provides aNoSQL database that supports key-value and documentbased structure. With its high performance and scalability,DynamoDB can handle different usage scenarios.8) Alexa Smart Home Skill: Smart home skills allow usersto control devices such as lights, thermostats, and othersmart home devices via voice interaction.9) Smart Home Skill API: This API provides an interfacefor developers to describe the smart home devices andhandle different user requests such as device discovery,status query and device control.10) AWS IoT: AWS IoT provides cloud services for connect-ing, monitoring and managing IoT devices in the cloudas well as services for analyzing device data.11) Amazon S3: Amazon S3 is an object storage service thatcan store static assets such as images, and media filescorresponding to an Alexa skill.12) CloudFront: CloudFront provides a content delivery net-work service to serve content faster to Alexa users.13) IoT Devices: IoT devices are smart home devices that canbe controlled with Alexa voice service.14) Audio Devices: Audio devices are the speakers that canplay audio files such as music or recorded user voices.III. A
TTACK S URFACES
Amazon Alexa attack surfaces are identified by analyzingthe existing attacks against Alexa ecosystem and categorizingwhich part of the Alexa ecosystem those attacks target. In total,we have identified six attack surfaces, depicted in Fig. 2.
A. Voice Capturing (Attack Surface 1)
The home environment is the most common setting forthe Alexa voice assistant. With modern security measures,Alexa is able to prevent most noise pollution within thishome environment from interrupting user conversations withthe voice assistant. However, attacks that are able to permeatethis noise pollution cancellation have been proven possible.These lingering security deficiencies make the voice capturingaspect of the Alexa a commonly targeted attack surface.A very important aspect of the Amazon Alexa is its conve-nience. Once purchased, Amazon Echo device requires littleset-up and can immediately be interacted with commands verysimilar to natural speech. Unfortunately, the Alexa lacks anyform of voice-based authentication, allowing any voice within a home environment to interact and command Alexa. There-fore, any voice containing wake words can trigger Alexa. Thishas lead to the introduction of remote attacks, or attacks thattake advantage of this lack of authentication by broadcastingcommands over devices, such as television, radio, or speakers.Fig. 2: Attacks surfaces of Alexa ecosystem
Remote Voice Attack : Due to the lack of proper user voiceauthentication, voice commands played by any speakers canfalsely trigger Echo device to respond. In [4], researchersperformed the remote voice attack in three different forms,injecting fake radio signals, replacing one TV channel withthe provided video stream, and tricking a wireless speaker toplay valid Echo commands.
Dolphin Attack : Dolphin attack is another form of remoteattack that instead uses inaudible commands to trigger Alexa.This inaudibility is possible by modulating voice commandson ultrasonic carriers. Although dolphin attacks require ultra-sound transducers to be within 2 meters of the Echo device,making it a less common threat than the remote device attacks,there is still a concern of whether these sorts of attacks willbe able to lengthen its attack distance in the future [6].
Man-in-the-Middle : One research team was able to hackinto an IoT device to attack the Alexa voice assistant. Theyused this along with several of the later mentioned techniquesto implement a more sophisticated attack. Their “Man inthe Middle” attack hijacks the conversation a user is havingwith their voice assistant without the user knowing. The firstcomponent of this attack is command jamming. They use theIoT device in order to inaudibly jam the commands the useris giving the voice assistant and simultaneously records them.When the user speaks the wake word, both the malicious andthe voice assistant are activated. The malicious device uses anultrasound modulated noise to prevent the voice assistant fromunderstanding the user. The next component is data retrieval.Since the malicious device knows what skill the user wastrying to use, it can send the same requests to the echo and findout the kind of data the user was looking for. The maliciousdevice can now modify the data and complete the hijackingby echoing the information back to the user [14]. . Voice Transmission (Attack Surface 2)
In cybersecurity, fingerprinting refers to a set of informationthat can be used to identify network protocols, operatingsystems, hardware devices, software among other things.Hackers use fingerprinting as the first step of their attack togather maximum information about targets. The technique offingerprinting can be used on the voice traffic between smartspeakers and their cloud servers. In this attack, an adversarycan eavesdrop both out-going and incoming encrypted voicetraffic of a smart speaker, and infer which voice command auser says over encrypted traffic [15], [16].
C. Alexa Voice Service (Attack Surface 3)
These are attacks on the common SLU functionalities, ASR,NLU and TTS that Alexa voice service provides [10]. Atthis day in age, it is nearly impossible for software to per-fectly understand everything a person is saying and correctlyinterpret intent. Homonyms and homophones are commonlymisunderstood even by humans, and has currently been foundto be near impossible for computers to accurately identify inhuman language [17]. Usually the language processing modelsare not accurate enough nor trained with other languages oraccents. These deficiencies in Alexa’s SLU are the targets forthis particular attack surface.
Skill Squatting : Skill Squatting is the creation of maliciousskills that carry invocation and intent names that sound similarto the invocation and intent names of legitimate skills. Takingadvantage of the easy misinterpretation of spoken word, skillsquatting relies on the systematic errors produced from word-to-word, such as pauses or mispronunciations. The intent ofthis attack is to confuse the Alexa and empower the maliciousskill to be used instead of the legitimate one, hijacking thelegitimate skill. This technique can be focused on specifictypes of people by leveraging words that are only squattablein targeted users’ demographic [12], [18].
D. Alexa Voice Skill (Attack Surface 4)
Malicious skills : Malicious skills can be interpreted as anyskill that is intentionally designed to act against the interests ofthe user. This could be a skill that mines user data or hijacksprivate information, or simply a skill that falsely claims to beable to perform an action it is unable to do. Currently, thereare several studies that show how easily malicious skills areable to bypass Amazon’s security and become available forpublic use on the official Amazon Alexa skill store [19], [20],[18]. In one study, researchers were able to publish hundredsof policy-violating skills onto the Alexa skill store for usersto access. There proved to be many reasons why this attacksurface in particular was so vulnerable [20].In the current Amazon skill publishing system, once a userpersonally verifies that their skill follows Amazon policy, theskill is put up for review and screened by an Amazon officialreviewer. If the official deems the skill as following customerpolicy, the skill becomes available for public consumption.In this way, while each skill is afforded the time and effortof an official review, the current Amazon skill publishing system is also very reliant on screening with little emphasison objective automation. This subjective screening processleads to a dilemma of inconsistency. In other words, skillreviewers verify skills based on their own interpretation ofthe Amazon policy, meaning that the chance of a skill getspublished is reliant not on the contents and intentions of theskill itself but of the skill reviewer assigned to its verification.This inconsistent verification process was demonstrated in avariety of different studies [19], [20], [18], [9].In one specific example, researchers were also able to pub-lish skills with malicious responses by delaying the responsejust enough to clear the process. Furthermore, if developersindicated (through a ”developer’s form” filled with yes or noquestions) that their skill does not collect user informationand data, they were granted publication onto the Amazon skillstore, even if their indication was false, as that claim was oftennot verified by skill reviewers during the official review. Thismeans that as well as relying on individual skill reviewers,Amazon is also reliant on individual developer honesty [20].The research team also believed that the review processis largely manual due to the inconsistency in finding issueswithin skills which could have been easily found by automatedsystems. Additionally, there was evidence of many reviewsbeing performed by non-native English speakers who may beunfamiliar with US laws [20]. The study found that the reviewprocess was not thorough enough to detect the malicious skillsthat had obvious policy-breaking functionalities. Through theirresearch it can be suggested that the abundance of humanreliance and lack of automation in the skill reviewing processis a security vulnerability [20].
Masquerading Attack : Alexa voice skill is also vulnerableto voice masquerading attack. This attack takes place when theadversary makes a malicious skill that mimics the behavior ofa legitimate skill or even the VPA service. These skills areable to convince the user that they are using safe and securefunctionalities when in reality, all of the information they havegiven their voice assistance has been compromised [18].
E. Lambda Functions (Attack Surface 5)
With the adoption of serverless architecture comes the riskof applications being developed by insecure code. The OWASPcommunity reported that these applications are vulnerable totraditional application-level attacks, like Cross-Site Scripting(XSS), Command/SQL Injection, Denial of Service (DoS),broken authentication and authorization and many more [21].
Command/SQL Injection : With the coming of voice as-sistants comes the voice user interface (VUI) and the VUIintroduces new ways of transferring information that aren’tsecure or well monitored. This interface introduces a vulner-ability into the Alexa system in the form of SQL injections.Attackers are able to use the VUI to interfere with queriesto that the skills make to their databases and grant themselvesaccess to otherwise sensitive data. In Black Hat Europe 2019, agroup of researchers demonstrated that voice commands couldalso cause SQL injection if the Lambda function, processinguser voice request, doesn’t have proper input validation [22]. . Amazon S3 Bucket (Attack Surface 6)
Amazon Simple Storage Service (Amazon S3) is a scalable,high performance cloud storage service. An Amazon S3 bucketwill be automatically created for storing media files whendevelopers create an Alexa skill.
S3 Bucket Misconfiguration : The access control to an S3bucket could be misconfigured, e.g., configuring a bucket aspublicly accessible to give skills (voice apps) an easy accessto all the hosted files. The problem is, anyone could find thebucket by its name and get access to all files hosted in thatbucket. If that bucket happens to store some sensitive datasuch as private key or credentials, then those sensitive datawould be leaked as well [23]. Moreover, since websites couldload resources from publicly writable S3 buckets, and if thosebuckets got maliciously modified, then the messages returnedby Alexa skills could be fake, similar to Burger King example.IV. M
ITIGATION
With the discovery of Alexa’s attack surfaces, measures thatminimize or eliminate these attack surfaces must be taken toensure the safety of customer data. Unfortunately, the act ofintroducing additional security measures into a commercialproduct is already a very complicated issue. New features thatenhance Alexa security must not only be effective at ensuringsecurity; it must also perform in a way that does not disturbthe convenient customer experience that has made Alexa sopopular. Therefore, a healthy balance between security andusability must be found in order to optimize both customersafety and customer satisfaction.
A. Mitigation to Voice Capturing Attacks
One solution for protecting against remote voice capturingattacks is teaching Alexa to differentiate between live andrecorded voices. Void, proposed as a lightweight voice livenessdetection system, is such an example [24]. This software worksto detect voice hacking attacks by finding the differencesin spectral power (analysis of cumulative power patterns inspectrograms) between live-human voices and voices replayedthrough speakers through multiple deep learning models. Spec-tral power refers to the distribution of power into frequencycomponents. Most loudspeakers inherently add distortions tooriginal sounds while replaying them, making the overallpower distribution over the audible frequency range showsome uniformity and linearity.Speaker-Sonar on the other hand is a sonar-based livenessdetection system for smart speakers [25]. Sonar is a techniquethat uses sound propagation to detect objects. The key ideafor this system was to ensure that the voice command isindeed coming from the user by tracking user movementthrough a constant stream of inaudible ultrasonic sound andcomparing the direction of the received voice command tothe user’s direction. This method in particular provided anon-intrusive user experience. However, it proved to onlybe reliably effectively in open outdoor spaces, as the highlydecorated interiors of customer home environments proved tolessen the accuracy of the Speaker-Sonar system. Protecting against Dolphin Attacks on the other hand re-quires a different set of solutions. A research team at ZhejiangUniversity, the same research team that invented the Dolphinattack, introduced solutions for protecting against inaudibleattacks. Hardware solutions for the inaudible attacks target thebase of the problem. The root cause of dolphin attacks andother inaudible voice commands is that unfortunately, mostcommercial microphones attached to smart devices such asphones or voice assistants are able to detect acoustic soundswith frequencies higher than 20 kHz. Therefore, adjustmentsto microphones that suppress any acoustic signals whose fre-quencies are in the ultrasound range would effectively preventmany forms of inaudible attacks [6]. Additionally, inaudiblevoice commands are able to be canceled by adding a moduleto microphones that detects modulated voice commands withinthe ultrasound frequency range. This module would then workto demodulate the signals to obtain the baseband [6].
B. Mitigation to Voice Trasnmission Attacks
Under an encrypted traffic analysis attack, an attacker caneffectively eavesdrop on encrypted user interactions with smartspeakers. In order to protect against the privacy leakage ofsmart speakers through voice traffic fingerprinting, a solutioncalled “adaptive padding” can be employed [15]. Adaptivepadding refers to the addition of dummy packets to voicetraffic, which are inserted based on the distribution of inter-arrival time while real packets still being sent at the originaltimestamp. This hides traffic bursts and traffic gaps, makingencrypted user interactions with smart devices harder to inter-pret. Dummy packets produced by adaptive padding has theadditional ability of sending buffered data sooner, effectivelyminimizing latency [15].
C. Mitigation to Alexa Voice Service Attacks
Alexa voice service misinterpretations is one the mostexploited attack surfaces. As previously mentioned in SectionIII, some common attacks that target this attack surface arevoice squatting attacks, voice masquerading attacks, and skillsquatting attacks.A possible countermeasure against voice squatting andvoice masquerading attacks is a skill-name scanner [18]. Thescanner would convert the invocation name string of a skillinto a ARPABET-specified phonetic expression. This phoneticexpression allows the phonetic distance between differentskill names to be measured, and the skill names that aredetected by the scanner to have a subset relation (considerablesimilarity) are deemed possible voice squatting attacks [18].Another possible solution is to take the context info, e.g.,user’s utterance and skill’s response, into consideration. Anexample of this is presented in [18], in which the authorsbuilt a context-sensitive detector that consists of two majorcomponents, a user intention classifier and a skill responsechecker, to detect if there is an impersonation and give usersan alert if impersonation is detected. This would ensure thatthe response of the skill match the perceived user intentions.ll skills, as previously discussed in Section III, must gothrough a certification process before they can be published tothe Alexa skill store for public consumption. Skill squattingattacks rely on attackers successfully registering malicioussquatted skills. In other words, skill squatting attacks rely onthe flaws of this certification process. Therefore, a possibleprevention tactic against skill squatting attacks is improvingthe certification process by adding additional screens. Forinstance, a word-based and phoneme-based analysis of a newskill’s invocation name as a screening measure, in order todetermine whether it could be confused for other alreadyregistered skills, would be an effective measure against skillsquatting attacks [12].
D. Mitigation to Alexa Voice Skill Attacks
Voice assistant providers, such as Amazon, have certifica-tion processes that insufficiently check the skills submitted totheir stores. A study in [20], provided two recommendationsto help the providers to enhance the trustworthiness of theirsystem. Since developers have the power to change the func-tionality of skills after their certification, enforcing the skillbehavior integrity throughout the skill life-cycle is necessary.A continuous certification/vetting process should be requiredwhenever the developer wants to change either the front-endor the back-end. Although this may increase the latency of thecertification process, it will improve its quality and increasethe trustworthiness of the system.Another observation of the certification process for skillis the room for human error in a human decision dependentvetting process. An obvious fix would be to utilize automatedskill testing in order to improve the consistency of the verifi-cations and help the testing to be more thorough. Accordingto [20], they concluded that 234 skill submissions were donein a largely manual manner and had limited voice responsetesting. To further increase the strength of the certificationprocess, voice assistant system providers need to have accessto the skill’s back-end code to perform code analyses.
E. Mitigation to Lambda Functions Attacks
SQL injections have been around for a while and onlybecause they still work and are effective at retrieving sensitivedata. They are also effective at attacking the SQL databasesused by Alexa skills if the Lambda functions are not up topar. One solution for that is to use Lambda-Proxy, whichis an utility that could perform automated SQL injectiontesting for AWS Lambda functions [26], [27]. To betterprotect against Lambda based attacks, LambdaGuard, an AWSLambda auditing tool could be used [28]. LambdaGuard isdesigned to provide visibility into Lambda functions, conductconfiguration checks to identify potential vulnerabilities.
F. Mitigation to Amazon S3 Bucket Attacks
Due to human errors for configuring S3 bucket, it is nec-essary to have stricter default policies and tools to check forcommon configuration errors. In [23], the authors developeda tool for bucket owners to check the access policies of their S3 buckets and verify if readable buckets contain sensitivedata such as privacy keys in .pem file. The same team alsodeveloped a browser extension for helping users to check if therendered webpage loads resources from a writable S3 bucket[23]. If so, the extension will prevent loading those untrustedresources. The same idea could be adopted by Alexa system toverify the legitimacy of the source. If the queried website froma skill is publicly editable or loads resources from a writableS3 bucket, then those should be blocked from playing.V. C
ONCLUSION
Alexa voice assistant is a very popular consumer productthat has completely changed the way how people interact withsmart technology. However, Alexa’s novel functionalities, suchas its ability to understand normal human conversation, adaptto the desires of the customer, and control smart devices in thehome environment, require an unprecedented amount of userdata, and unfortunately, security and privacy has not been ableto keep up to properly protect that sensitive information. As aresult, Alexa has been the target of many attacks.This paper surveyed and analyzed various attacks againstAmazon Alexa ecosystem, giving insights on where Alexasecurity risks are located in its system. Furthermore, six attacksurfaces were identified by examining the lifecycle of Alexavoice interaction that spans several stages including voice datacollection, transmission, processing and storage. In addition,mitigation solutions to those attacks were also investigated anddiscussed to provide directions for better improving Alexa orother voice assistants in terms of security and privacy.For future work, we plan to further evaluate the complexityof each attack in the amount of efforts required from attacker’sperspective, as well as the amount of loss and the level of harmit can bring to Alexa users from user perspective.R . IEEE, 2018, pp. 1–6.[5] S. Pradhan, W. Sun, G. Baig, and L. Qiu, “Combating replay attacksagainst voice assistants,”
Proceedings of the ACM on Interactive, Mobile,Wearable and Ubiquitous Technologies , vol. 3, no. 3, pp. 1–26, 2019.[6] G. Zhang, C. Yan, X. Ji, T. Zhang, T. Zhang, and W. Xu, “Dolphinattack:Inaudible voice commands,” in
Proceedings of the 2017 ACM SIGSACConference on Computer and Communications Security (CCS) { USENIX } Security Symposium 18 , 2018, pp. 33–47.[13] A. Westrich, M. Bunch, and etc. Serverless applications lens: Aws well-architected framework. Retrieved October 26, 2020. [Online]. Avail-able: https://docs.aws.amazon.com/wellarchitected/latest/serverless-applications-lens/wellarchitected-serverless-applications-lens.pdf[14] R. Mitev, M. Miettinen, and A.-R. Sadeghi, “Alexa lied to me: Skill-based man-in-the-middle attacks on virtual assistants,” in
Proceedingsof the 2019 ACM Asia Conference on Computer and CommunicationsSecurity (AsiaCCS) , 2019, pp. 465–478.[15] C. Wang, S. Kennedy, H. Li, K. Hudson, G. Atluri, X. Wei, W. Sun,and B. Wang, “Fingerprinting encrypted voice traffic on smart speakerswith deep learning,” in
Proceedings of the 13th ACM Conference onSecurity and Privacy in Wireless and Mobile Networks (WiSec) , 2020,pp. 254–265.[16] M. Ford and W. Palmer, “Alexa, are you listening to me? an analysis ofalexa voice service network traffic,”
Personal and Ubiquitous Comput-ing , vol. 23, no. 1, pp. 67–79, 2019.[17] M. K. Bispham, I. Agrafiotis, and M. Goldsmith, “Nonsense attacks ongoogle assistant and missense attacks on amazon alexa,” in
Proceedingsof the 5th International Conference on Information Systems Security andPrivacy (ICISSP) , 2019, pp. 75–87.[18] N. Zhang, X. Mi, X. Feng, X. Wang, Y. Tian, and F. Qian, “Dangerousskills: Understanding and mitigating security risks of voice-controlledthird-party functions on virtual personal assistant systems,” in . IEEE, 2019, pp. 1381–1396.[19] R. Leong, “Analyzing the privacy attack landscape for amazon alexadevices,” Imperial College London, Tech. Rep., 2018. [20] L. Cheng, C. Wilson, S. Liao, J. Young, D. Dong, and H. Hu,“Dangerous skills got certified: Measuring the trustworthiness of skillcertification in voice personal assistant platforms,” in
Proceedings ofthe 2020 ACM SIGSAC Conference on Computer and CommunicationsSecurity (CCS) , 2020, pp. 1699–1716.[21] OWASP, “Owasp serverless top 10,” retrieved October, 2020. [Online].Available: https://github.com/OWASP/Serverless-Top-10-Project[22] A. Bannister, “‘alexa, hack my serverless technology’ – attackingweb apps with voice commands,” December 11, 2019. [Online].Available: https://portswigger.net/daily-swig/alexa-hack-my-serverless-technology-attacking-web-apps-with-voice-commands[23] A. Continella, M. Polino, M. Pogliani, and S. Zanero, “There’s a holein that bucket! a large-scale analysis of misconfigured s3 buckets,”in
Proceedings of the 34th Annual Computer Security ApplicationsConference (ACSAC) , 2018, pp. 702–711.[24] M. E. Ahmed, I.-Y. Kwak, J. H. Huh, I. Kim, T. Oh, and H. Kim, “Void:A fast and light voice liveness detection system,” in { USENIX } Security Symposium ( { USENIX } Security 20) , 2020, pp. 2685–2702.[25] Y. Lee, Y. Zhao, J. Zeng, K. Lee, N. Zhang, F. H. Shezan, Y. Tian,K. Chen, and X. Wang, “Using sonar for liveness detection to protectsmart speakers against remote attackers,”