[PDF] SkillBot: Identifying Risky Content for Children in Alexa Skills

Abstract

Many households include children who use voice personal assistants (VPA) such as Amazon Alexa. Children benefit from the rich functionalities of VPAs and third-party apps but are also exposed to new risks in the VPA ecosystem (e.g., inappropriate content or information collection). To study the risks VPAs pose to children, we build a Natural Language Processing (NLP)-based system to automatically interact with VPA apps and analyze the resulting conversations to identify contents risky to children. We identify 28 child-directed apps with risky contents and maintain a growing dataset of 31,966 non-overlapping app behaviors collected from 3,434 Alexa apps. Our findings suggest that although voice apps designed for children are subject to more policy requirements and intensive vetting, children are still vulnerable to risky content. We then conduct a user study showing that parents are more concerned about VPA apps with inappropriate content than those that ask for personal information, but many parents are not aware that risky apps of either type exist. Finally, we identify a new threat to users of VPA apps: confounding utterances, or voice commands shared by multiple apps that may cause a user to invoke or interact with a different app than intended. We identify 4,487 confounding utterances, including 581 shared by child-directed and non-child-directed apps.

Full PDF

SSkillBot: Identifying Risky Content for Children in Alexa Skills

Tu Le

University of Virginia

Danny Yuxing Huang

New York University

Noah Apthorpe

Colgate University

Yuan Tian

University of Virginia

Abstract

Many households include children who use voice personalassistants (VPA) such as Amazon Alexa. Children beneﬁtfrom the rich functionalities of VPAs and third-party appsbut are also exposed to new risks in the VPA ecosystem (e.g.,inappropriate content or information collection). To study therisks VPAs pose to children, we build a Natural LanguageProcessing (NLP)-based system to automatically interact withVPA apps and analyze the resulting conversations to identifycontents risky to children. We identify 28 child-directed appswith risky contents and maintain a growing dataset of 31,966non-overlapping app behaviors collected from 3,434 Alexaapps. Our ﬁndings suggest that although voice apps designedfor children are subject to more policy requirements and in-tensive vetting, children are still vulnerable to risky content.We then conduct a user study showing that parents are moreconcerned about VPA apps with inappropriate content thanthose that ask for personal information, but many parents arenot aware that risky apps of either type exist. Finally, weidentify a new threat to users of VPA apps: confounding utter-ances, or voice commands shared by multiple apps that maycause a user to invoke or interact with a different app than in-tended. We identify 4,487 confounding utterances, including581 shared by child-directed and non-child-directed apps.

The rapid development of Internet of Things (IoT) technol-ogy has aligned with growing popularity of voice personalassistant (VPA) services, such as Amazon Alexa and GoogleHome. In addition to the ﬁrst-party features provided by theseproducts, VPA service providers have also developed plat-forms that allow third-party developers to build and publishtheir own voice apps—hereafter referred to as “skills”.

Risks to Children from VPAs.

Researchers have found that91% of children between ages 4 and 11 in the U.S. have ac-cess to VPAs, 26% of children are exposed to a VPA between2 and 4 hours a week, and 20% talk to VPA devices morethan 5 hours a week [16]. The lack of robust authentication oncommercial VPAs makes it challenging to regulate children’suse of skills [53], especially as anyone in the same physicalvicinity of a VPA can interact with the device. As a result,children may have access to risky skills that deliver inappro- priate content (e.g., expletives) or collect personal informationthrough voice interactions.There is no systematic testing tool that vets VPA skills toidentify those that contain risky content for children. Legal ef-forts and industry solutions have tried to protect children usingVPAs; however, their effectiveness is unclear. The 1998 Chil-dren’s Online Privacy Protection Act (COPPA) regulates theinformation collected from children under 13 online [10], butwidespread COPPA violations have been shown in the mobileapplication market [45], and compliance in the VPA space isfar from guaranteed. Additionally, parental control modes pro-vided by VPAs (e.g., Amazon FreeTime and Google FamilyApp) often place a burden on parents during setup and receivecomplaints from parents due to their limitations [1, 9, 15].

Research Questions.

Protecting children in the era of voicedevices therefore raises several pressing questions:•

RQ0.

Can we automate the analysis of VPA skills toidentify content risky for children without requiring man-ual human voice interactions?•

RQ1.

Are VPA skills targeted to children that claimto follow additional content requirements – hereafterreferred to as “kid skills” – actually safe for child users?•

RQ2.

What are parents’ attitudes and awareness of therisks posed by VPAs to children?•

RQ3.

How likely is it for children to be exposed torisky skills through confounding utterances —voice com-mands shared by multiple skills which could cause achild to accidentally invoke or interact with a differentskill than intended.In this paper, we design, implement, and perform a systematicautomated analysis of the Amazon Alexa VPA skill ecosystemand conduct a user study to answer these research questions.

Challenges to Automated Skill Analysis.

In comparison tomobile applications and other traditional software, neither theexecutable ﬁles nor source code of VPA skills are available toresearchers for analysis. Instead, the skills’ natural languageprocessing modules and key function logic are hosted in thecloud as a black box. Thus, decompilation, traditional static,or dynamic analysis methods cannot be applied to VPA skills.VPA skill voice interactions are built following a templatedeﬁned by the third-party developer, which is also unavailableto researchers. To automatically detect risky content, we need1 a r X i v : . [ c s . M A ] F e b o generate testing inputs that trigger this content throughsequential interactions. A further challenge is that risky con-tent does not always occur during a users’ ﬁrst interactionwith a skill; human users often need to have back-and-forthconversations with skills to discover risky contents. Automat-ing this process requires developing a tool that can generatevalid voice inputs and dynamic follow-up responses that willcause the skill to reveal risky contents. This is different fromexisting chatbot development techniques [28], as the goal isnot to generate inputs that sound natural to a human. Insteadautomated skill analysis requires generating inputs that willexplore the space of skill behaviors as thoroughly as possible. Automated Identiﬁcation of Risky Content.

This paperpresents our systematic approach to analyzing VPA skillsbased on automated interactions. We apply this approach to3,434 Alexa skills targeted toward children in order to mea-sure the prevalence of kids skills that contain risky content.More speciﬁcally, we build a natural-language-based sys-tem called “SkillBot” that interacts with VPA skills and an-alyzes the results for risky content, including inappropriatelanguage and personal information collection (Section 5).SkillBot generates valid skill inputs, analyzes skill responses,and systematically generates follow-up inputs. Through mul-tiple rounds of interactions, we can determine whether skillscontain risky content.The design of SkillBot answers RQ0, and our SkillBotanalysis of 3,434 kid skills allows us to answer RQ1. Weidentify 8 kid skills with inappropriate content and 20 kidskills that ask for personal information (Section 6).

Online User Study and Insights from Parents.

We nextwanted to verify our SkillBot results by seeing whether par-ents also viewed identiﬁed skills as risky, as well as to betterunderstand the real world contexts of children’s interactionswith VPAs (RQ2). We conduct a user study of 232 U.S. Alexausers who have children under 13 years old. We present theseparents with examples of interactions with risky and non-riskyskills identiﬁed by SkillBot and ask them to report their reac-tions to these skills, experiences with risky/unwanted contenton their own VPAs, and use of VPA parental control features.We ﬁnd that parents are uncomfortable about the inappro-priate content in our identiﬁed skills. 54.1% cannot imaginesuch interactions are possible on Alexa, and 58.4% believeAlexa should block such interactions. Many parents do notthink that these skills are designed for families/kids, althoughthese skills are actually published in Amazon’s “Kids” cate-gory. We also ﬁnd that 23.7% of parents do not know aboutAlexa’s parental control feature, and of those who know aboutthe feature, only 29.4% use it. These data highlight the risksto children posed by VPA skills with inappropriate content.While SkillBot demonstrates that such skills exist, parents arepredominantly unaware of this fact and typically neglect basicprecautions such as activating parental controls.

Confounding Utterances.

Our analysis also reveals a novel threat that is particularly problematic for child users: con-founding utterances . Confounding utterances are voice inputsthat are share by more than one skill that may be presenton a VPA device. When a user interacts with a VPA via aconfounding utterance, the utterance might trigger a reactionfrom any of these skills. If a kid skill shares a confoundingutterance with a skill inappropriate for kids, a kid user mightinadvertently begin interacting with the inappropriate skill.For example, a child user use an utterance to invoke a kidskill X, but the another skill Y, which is in the non-kid cate-gory and which shares the same utterance, could be triggeredinstead. As Echo does not offer visual cues on what skill is actually invoked, the user may not realize that Y is runninginstead of X. Furthermore, skills in the non-kid categoriestypically face more relaxed requirements than kid skills do;the child user could be exposed to risky contents from skill Y.Our SkillBot reveals 4,487 confounding utterances, 581of which are shared between a kid skill and a skill that isnot in “Kids” category (Section 8). Of these 581 utterances,27% prioritize invoking a non-kid skill over a kid skill. Thisindicates that children are at real risk of accidentally invokingnon-kid skills and that an adversary could exploit overlappingutterances to get child users to invoke non-kid skills (RQ3).

Contributions.

We make the following contributions:

Automated System for Skill Analysis:

We present a system,SkillBot, that automatically interacts with Alexa skills andcollects their contents at scale. Our system can be run longi-tudinally to identify new conversations and new conversationbranches in previously analyzed skills. We plan to publiclyrelease our system to help future research.

Identiﬁcation of Risks to Children:

We analyze 31,966conversations collected from 3,434 Alexa kid skills to detectpotential risky skills directed to children. We ﬁnd 8 skills thatcontain inappropriate content for children and 20 skills thatask for personal information through voice interaction.

User Study of Parents’ Awareness and Experiences:

Weconduct a user study demonstrating that a majority of parentsexpress concern about the content of the risky kids skills iden-tiﬁed by SkillBot tempered by disbelief that these skills areactually available for Alexa VPAs. This lack of risk awarenessis compounded by ﬁndings that many parents’ do not use VPAparental controls and allow their children to use VPA versionsthat do not have parental controls enabled by default.

Confounding Utterances:

We identify confounding utter-ances as a novel threat to VPA users. Our SkillBot analysisreveals 4,487 confounding utterances shared between two ormore skills and highlight those that place child users at riskby invoking a non-kid skill instead of an expected kid skill.

Voice Personal Assistant.

VPA is a software agent, whichcan interpret users’ speech to perform certain tasks or simplyanswer questions from users via synthesized voices. Most2PAs such as Amazon Alexa and Google Home follow acloud-based system design. In particular, when the user speaksto the VPA device (e.g., Amazon Echo) with a request, thisrequest is sent to the VPA service provider’s cloud server forprocessing and invoking the corresponding skills. Third-partyskills can be hosted on external web services instead of theVPA service provider’s cloud server.

Building and Publishing Skills.

To provide a broader rangeof features, Amazon allows third parties to develop skills forAlexa via Alexa Skills Kit (ASK) [5]. Using ASK, developerscan build custom Alexa skills that use their own web servicesto communicate with Alexa [14]. There are currently morethan 50,000 skills, including a wide variety of features such asreading news, playing games, controlling smart home, check-ing credit card balances, and telling jokes, that are publiclyavailable on the Alexa Skills Store [6].

Enabling and Invoking Skills.

Unlike mobile apps, Alexaskills are hosted on Amazon’s cloud servers. Therefore, usersdo not have to download any binary ﬁle or run any installationprocess. To use a skill, users only need to enable it in theirAmazon account. There are two ways to enable/disable askill. The ﬁrst way is via skill info page in which there is anenable/disable button. The users can access the skill info pagevia the Alexa Skills Store on Amazon website or the Alexacompanion app. The other way is via voice command. Notethat for usability, Amazon also allows invoking skills directlythrough voice without needing to enable the skill ﬁrst.Users can invoke a skill by saying its invocationphrases [18]. Invocation phrases include two types: with in-tent and without intent. For example, one can say “Alexa,open Ted Talks” to invoke Ted Talks skill or “Alexa, openDaily Horoscopes for Capricorn” to tell Daily Horoscopesskill to give some information about Capricorn. Since therecan be different ways of paraphrasing a sentence, there aremultiple variants of an invocation phrase that perform thesame task. Besides, Alexa allows some ﬂexibility in invokingskills through name-free interaction feature [19]. The user canspeak to Alexa with a skill request that does not necessarilyinclude the skill name. Alexa can process the request and se-lect a top candidate skill that fulﬁlls the request. If the chosenskill is not yet enabled by the user, it may be auto-enabled forthe user.Every skill has an Amazon webpage, which includes atmost three sample utterances , i.e., voice commands withwhich users could verbally interact with the said skill. In ad-dition, the webpage may include an “Additional Instructions”section with additional voice commands for interactions, al-though these additional commands are optional.

We ﬁrst introduce the current schemes for protecting chil-dren users on Alexa, such as the parental control and permis- sion control features, and then show their limitations.

Alexa Parental Control.

Amazon FreeTime is a parentalcontrol feature which allows parents to manage what contenttheir children can access on their Amazon devices. FreeTimeon Alexa provides a Parent Dashboard user interface for par-ents to set daily time limits, monitor activities, and manageallowed content. If Freetime is enabled, users can only usethe skills in the kids category by default. To use other skills,parents need to manually add skills in the white list. Free-Time Unlimited is a subscription that offers thousands ofkid-friendly content, including a list of kid skills available oncompatible Echo devices, for children under 13. Parents canpurchase this subscription via their Amazon account and useit across all compatible Amazon devices.Children can potentially access an Amazon Echo devicelocated in a shared space and invoke such “risky" skills inthe absence of child-protection features on the Amazon Echobecause of the following reasons. FreeTime is turned off bydefault on the regular version of Amazon Echo. Previousstudies, such as those in medicine [31], psychology [39], orbehavioral economics [35], have shown that people often optfor default settings. Although parents can turn on FreeTimefor their regular version of Amazon Echo, the feature placesa burden of usage on users. For example, users sometimescannot remove or disable certain skills added by FreeTime(which has been an issue since 2017 [1, 9]). Some users ﬁndit hard to access the list of skills available via FreeTime Un-limited [13, 15]. In particular, skills that parents would loveto use may not be appropriate for kids; thus, not allowed inFreeTime mode by default. As a consequence, users may mis-understand that not being able to use a skill in FreeTime modeis a bug of the skill itself, which leads to complaints being sentto the skill developer [4]. If parents want to use these skillsin FreeTime mode, they have to manually add these skills tothe white list in the parent dashboard interface. They have toremember to enable or disable FreeTime at appropriate timewhich affects user experience.

Alexa Permission Control.

Alexa skills might need personalinformation from users to give accurate responses or to pro-cess transactions. To get any personal information, a skillshould request the corresponding permission from the user.When the user ﬁrst enables the skill, Alexa asks the user to goto the Alexa companion app to grant the requested permission.However, this permission control mechanism only protect per-sonal information in the user’s Amazon Alexa account. If theskills do not specify permission requests, but directly ask forsuch personal information through voice interaction, they caneasily bypass the permission control.

In this paper, we consider two main types of threats: (1)risky skills (i.e., skills that contain inappropriate content orask for user’s personal information through voice interaction)3nd (2) confounding utterances (i.e., utterances that are sharedamong two or more different skills).

Risky Skills.

We investigate the risky content that harm thechildren. We deﬁne “risky" skills as skills that contain twokinds of content: (1) inappropriate content for children or (2)asking for personal information through voice interaction. Anexample is the “My burns” skill in Amazon’s Kids categorythat says “You’re so ugly you’d scare the crap out of thetoilet. I’m on a roll”. These threats may come from eitheran adversary who intentionally develops malicious skills or abenign/inexperienced developer who is not aware of the risks.

Confounding Utterances.

We identify a new risk whichwe call “confounding utterances”. We deﬁne confoundingutterances as utterances that are shared among two or moredifferent skills. Effectively, a confounding utterance used bythe user could trigger an unexpected skill for the user.Confounding utterances are different from previous re-search on voice squatting attacks, which exploited the speechrecognition misinterpretations made by voice personal as-sistants [32, 33, 54, 55]. They showed that voice commandmisinterpretation problem due to spoken errors could yieldunwanted skill interactions, and an adversary can route theusers to malicious Alexa skills by giving the skill invocationnames that are pronounced similar to the legitimate one.In contrast, this paper considers a new risk that even if thereis no such voice command misinterpretation, Alexa may stillinvoke the skill that the user does not want because multipleskills can have completely same utterances. We want to ﬁndout, given a confounding utterance that is shared betweenmultiple skills, which skill Alexa prioritizes to enable/invoke.Users have no control over what skills are actually openedeither upon an intentional voice command or an unintentionalone (e.g., Alexa being triggered by background conversations).In other words, a confounding utterance may invoke a ran-dom skill which is not the user’s intention. With name-freeinteraction feature [19], users can invoke a skill without itsinvocation name. Thus, an unexpected skill can be mistakenlyinvoked by users. Furthermore, there is no downloading orinstallation process on the customers’ devices which makesit easy for the these skills to bypass user awareness. For in-stance, a child may have one skill in mind but accidentallyinvoke a different skill that has a similar invocation name (orsimilar utterances). An adversary can exploit confoundingutterances to get kids to use malicious skills.

To study the impacts that risky skills might have on chil-dren, we propose SkillBot, which systematically interactswith the skills to discover risky content and confounding ut-terances. In this section, we ﬁrst show how we design SkillBotfor interacting with the skills and collecting their responsesthoroughly and at scale. We then evaluate SkillBot for itsreliability, coverage, and performance.

Figure 1: Automated Skill Interaction Pipeline Overview

Our goal for SkillBot is to interact effectively and efﬁcientlywith the skills and uncover the risky content for children inskill’s behaviors thoroughly and at scale.

Overview.

Our system consists of four main components:Skill Information Extractor, Web Driver, Chatbot, and Conver-sation Dataset (see the workﬂow in Figure 1). Skill Informa-tion Extractor handles exploring, downloading, and parsinginformation of skills available in the Alexa skills store. WebDriver handles connections to Alexa and requests from/tothe skills. Chatbot discovers interactions with the skills andrecords the conversations into Conversation Dataset.

Skill Information Extractor.

Amazon provides an onlinerepository of skills via Alexa Skills Store [6]. Each skill isan individual product, which has its own product info pageand an Amazon Standard Identiﬁcation Number (ASIN) thatcan be used to search for the skill in Amazon’s catalogue [22].The URL to a skill’s info page can be constructed from itsASIN. Our skill information extractor includes a web scraperto systematically access the Alexa website and download theskills’ info page in HTML based on their ASINs (i.e., skillIDs). It then reads the HTML ﬁles and constructs json dictio-nary structure using BeautifulSoup library [8]. For each skill,we extract any information available on its info page suchas ASIN (i.e., skill’s ID), icon, sample utterances, invocationname, description, reviews, permission list, and category (e.g.,kids, education, smart home, etc.).

Web Driver.

We leverage Amazon’s Alexa developer con-sole [2] to allow programmatically interacting with skills us-ing text inputs. We build a web driver module using Seleniumframework [17], which is a popular web browser automationframework for testing web applications, to automate send-ing requests to Alexa and interacting with the skill info pageto check the status of the skill (i.e., enabled, disabled, notavailable). We also implement a module that handles skillenabling/disabling requests. This module uses private APIsderived from inspecting XMLHttpRequest within networkactivities of Alexa webpages.

Chatbot.

We build an NLP-based module to interact withthe skills and explore as much content of the skills as possible.The module includes several techniques to explore sampleutterances suggested by the skill developers, create additionalutterances based on the skill’s info, classify utterances, detect4uestions in responses, and generating follow-up utterances.

Exploring and Classifying Utterances:

Amazon allows de-velopers to list up to three sample utterances in the sampleutterances section of their skill’s information page. Our sys-tem ﬁrst extracts these sample utterances. Some developersalso put additional instructions into their skill’s description.Therefore, our system further processes the skill’s descriptionto generate more utterances. In particular, we consider sen-tences that start with an invocation word (i.e., “Alexa,...”) tobe utterances. We also notice that phrases inside quotes canalso be utterances. An example is “You can say ‘give me a funfact’ to ask the skill for a fun fact”. Once a list of collected ut-terances is constructed, our system classiﬁes these utterancesinto opening and in-skill utterances. Opening utterances areused to invoke/open a skill. These often include the skill’sname and start with opening words such as open, launch, andstart [18]. In-skill utterances are used within the skill’s session(when the skill is already invoked). Some examples include“tell me a joke”, “help”, or “more info”.

Detecting Questions in Skill Responses:

To extend the con-versation, our system ﬁrst classiﬁes responses collected fromthe skill into three main categories. These three categoriesinclude: Yes/No question, WH question, and non-questionstatement. For this classiﬁcation task, we employ spaCy [30]and StanfordCoreNLP [38, 43] which are popular tools forNLP tasks. In particular, we ﬁrst tokenize the skill’s responseinto sentences and each sentence into words. We then annotateeach sentence using part-of-speech (POS) tagging. For POStags, we utilize both TreeBank POS tags [48] and UniversalPOS tags [20]. With the POS tagging, we can identify therole of each word in the sentence, such as auxiliary, subject,or object, based on its tag.A Yes/No question usually starts with an auxiliary verb,which follows the subject-auxiliary inversion formation rule.Yes/No questions generally take the form of [auxiliary + sub-ject + (main verb) + (object/adjective/adverb)?]. Some exam-ples are “Is she nice?”, “Do you play video games?”, and “Doyou swim today?”. It is also possible to have the auxiliaryverb as a negative contraction such as “Don’t you know it?”or “Isn’t she nice?”.A WH question contains WH words such as what, why, orhow. We ﬁrst identify these WH words in the sentence basedon their POS tags: “WDT”, “WP”, “WP$”, and “WRB”. Next,we check for WH question grammar structure. Regular WHquestions usually take the form of [WH-word + auxiliary +subject + (main verb) + (object)?]. Some examples are “Whatis your name?” and “What did you say?”. Furthermore, weconsider pied-piping WH questions such as “To whom didyou send it?”. We exclude cases that WH words are used ina non-question statement such as “What you think is great”,“That is what I did”, and “What goes around comes around”.

Generating Follow-up Utterances:

Given a skill response,there can be three ways to follow up. (1) Yes/No questions:This type of question asks for conﬁrmation from the users, expecting either a “yes” or a “no” answer. Our system sends“yes” or “no” as a follow-up utterance to continue the conver-sation.(2) WH questions: For WH questions, we further employthe question classiﬁcation method presented in [49] to de-termine the theme of an open-ended question. There are sixgeneral categories of question theme: Abbreviation, Entity,Description, Human, Location, Numeric [11]. ‘Abbreviation’includes questions that ask about a short form of an expression(e.g., “What is the abbreviation for California?”). ‘Entity’ in-cludes questions about objects that are not human (e.g., “Whatis your favorite color?”). ‘Description’ includes questionsabout explanations of concepts (e.g., “What does a deﬁbrilla-tor do?"). ‘Human’ includes questions about an individual ora group of people. ‘Location’ includes questions about placessuch as cities, countries, states, etc. ‘Numeric’ includes ques-tions asking for some numerical values such as count, weight,size, etc. For each category, there can be subcategories. For ex-ample, ‘Human’ has ’name’ and ’title’, ‘Location’ has ’city’,’country’, ’state’, etc. We create a dictionary of answers tothose subcategories (e.g., "age":{1, 2, 3,...}, "states":{Oregon,Arizona,...}) to continue the conversation with the skill. Forquestions asking about some knowledge such as those in‘Abbreviation’ or ‘Description’ whose subcategories are toogeneral, our system also sends “I don’t know. Please tell me.”to prompt for responses from the skill.(3) Non-question statements: These include two types ofstatements: directive statement and informative statement.Some directive statements can ask the user to provide an an-swer to a question, which is basically similar to a WH question.An example is “Please tell us your birthday”. For these cases,our system parses the sentence to look for what being askedand handles it similar to a WH question (discussed above).Other directive statements can suggest words/phrases for theuser to select to continue the conversation. Some examplesinclude “Please say ’continue’ to get a fun fact" and “Say ’1’to get info about a book, ’2’ to get info about a movie". Forthese cases, our system extracts the suggested words/phrasesand uses them to continue the conversation. Informative state-ments provide users with some information such as a joke, afact, or a daily news. These often do not give any directiveson what else the user can say to continue the conversation.Thus, our system sends an in-skill utterance, “Tell me anotherone", or “Tell me more” as follow-up utterances to exploremore content from the skill.

Conversation Dataset.

Our conversation dataset is a set ofjson ﬁles, each of which represents a skill. The ﬁle’s contentis a list of conversations with the skill collected by the chatbotmodule. Each conversation is stored as a list in which evenindexes of the list are the utterances sent by our system whileodd indexes are the corresponding responses from the skills.5 .2 Exploring Conversation Trees

For each skill, SkillBot runs multiple rounds to exploredifferent paths within the conversation trees . Each node inthis tree is a unique response from Alexa. There is an edgebetween nodes i and j if there exists an interaction whereAlexa says i , the user (i.e., SkillBot) says something, and thenAlexa says j . We call the progression from i to j a path inthe tree. Furthermore, multiple paths of interactions couldexist for a skill. For instance, node i could have two edges:one with j and another one with k . Effectively, two paths leadfrom i . In one path, the user says something after hearing i , and Alexa responds with j . In another path, the user sayssomething else after hearing i , and Alexa responds with k .To illustrate how we construct a conversation tree on atypical skill, we show a hypothetical example in Figure 2.First, the user would launch a skill by saying “Open Skill X”or “Launch Skill X”. This initial utterance could be foundin the “Sample Utterances” section of the skill’s informationpage on Amazon.com; alternatively, it could also be displayedin the “Additional Instructions” section on the skill’s page. PerFigure 2, let us assume that either “Open Skill X” or “LaunchSkill X” triggers the same response from Alexa, “Welcometo Skill X. Say ‘Continue’,” which is denoted by Node 1 inFigure 2. The user would say “Continue” and trigger anotherresponse (denoted as Node 2) from Alexa, “Great. Wouldyou like to do A?” The user could either respond with “Yes”,which would trigger the response in Node 3, or “No”, whichwould trigger Node 4.SkillBot explores multiple paths of the conversation treeby interacting with a skill multiple times, each time picking adifferent response. Per the example in Figure 2, the ﬁrst timeSkillBot runs on this skill (i.e., the ﬁrst run), it could follow apath along Nodes 1, 2, and 3. Once at Node 3, the skill in thisexample does not provide the user with the option to returnto the state in Node 2, so to explore a different path, SkillBotwould have to start over. In the second run, SkillBot couldfollow a path along Nodes 1, 2, 4, and 5. SkillBot respondswith “No” after Node 2 because it remembers answering “Yes”in the previous run. In the third run, SkillBot could followNodes 1, 2, 4, and 6. User: “Open Skill X” or “Launch Skill X.User: “Continue.”User: “Yes.”User: “C.”Alexa: “Welcome to Skill X. Say ‘Continue’.”Alexa: “Great. Would you like to do A?”Alexa: “Let’s do A.”Alexa: “OK. Say ‘C’ to do C, or say ‘D’ to do D.”Alexa: “Let’s do C.” Alexa: “Let’s do D.”User: “D.”User: “No.”

123 45 6

Figure 2: A conversation tree that represents how we interact with atypical skill.

Each run of SkillBot terminates when exploring down a par-ticular path is unlikely to trigger new responses from Alexa;in this case, SkillBot would start over with the same skill andexplores a different path. We list four conditions where Skill-Bot would terminate a particular run: (i)

Alexa’s response isnot new; in other words, SKillBot has seen Alexa’s responsein a previous run of the skill and/or in a different skill. Skill-Bot’s goal is to maximize the interaction with unique Alexa’sresponses, rather than previously seen ones, in an attempt todiscover risky contents. (ii)

Alexa’s response is empty. (iii)

Alexa’s response is a dynamic audio clip (e.g., music or pod-cast, which does not rely on Alexa’s automated voice). Dueto limitations of the Alexa simulator, the SkillBot is unableto extract and parse dynamic audio clips; as such, SkillBotterminates a path if it sees a dynamic audio clip because itdoes not know how to react. (iv)

Alexa’s response is an errormessage, such as “Sorry, I don’t understand.”

In this section, we present our validation to ensure that inter-acting with skills via our SkillBot (presented in Section 5) canrepresent user’s interaction with skills via the physical Echodevice. We further validate the performance of our SkillBot.

Interaction Reliability

We randomly selected 100 skills forvalidation. We used an Echo Dot device to interact with theskills and compared with our system. Note that since a skillcan have dynamic content which makes its responses differentin each invocation, we ﬁrst check the collected skill responses.If they do not match, we further check the skill invocation inthe activity log of Alexa to see if the same skill is invoked.We ﬁnd that our system and the Echo Dot share similar inter-action of 99 skills. Among these 99 skills, there are two skillsthat responded with audio playbacks, which are not supportedby the Alexa developer console [3] employed in our system(see detailed justiﬁcations in Section 9). However, their invo-cations were shown in the activity log, which matched thoseinvocations when using the Echo Dot. We cannot verify theremaining one skill as Alexa cannot recognize its sampleutterances. This might be an issue of the skill’s web service.

Skill’s Responses Classiﬁcation.

As described in Sec-tion 5.1, to extend the conversation with a skill, our systemclassiﬁes responses from the skill into three groups: Yes/Noquestion, WH question, and non-question statement. To eval-uate the performance, we randomly sampled 300 unique skillresponses from our conversation collection and manually la-beled them to make a ground truth. In the ground truth, wehad 52 Yes/No questions, 50 open-ended questions, and 198non-question statements. We then used our system to labelthese responses and veriﬁed the labels against our groundtruth. Our classiﬁer predicted 56 Yes/No questions, 50 open-ended questions, and 194 non-question statements, which isover 95% accuracy. The performance detail for each class isshown in Table 1 (see Table 6 in Appendix E for the confusion6atrix of our 3-class classiﬁer).

Table 1: Skill Response Classiﬁcation Performance

Accuracy Precision Recall F1 ScoreYes/No 98% 0.91 0.98 0.94Open-ended 98% 0.94 0.94 0.94Non-question 96% 0.98 0.96 0.97

Coverage.

We measure the coverage of SkillBot by analyzingthe conversation trees for every skill. Our analysis includesfour criteria: (i) the number of unique responses from Alexa,i.e., the number of nodes in a gree; (ii) the maximum depth(or height) in a tree; (iii) the maximum number of branchesin a tree, i.e., how many options that SkillBot explored; and (iv) the number of initial utterances, which counts the numberof distinct ways to start interacting with Alexa. We show theresults in in Figure 3.Per the 2nd chart in Figure 3, we highlight that SkillBot isable to reach a depth of at least 10 on 2.7% of the skills. Sucha depth allows SkillBot to trigger and explore a wide varietyof Alexa’s responses from which to discover risky contents.In fact, out of the 28 risky kid skills, 2 skills were identiﬁedat depth 11, 1 skill at depth 5, 4 skills at depth 4, 6 at depth 3,8 at depth 2, and 7 at depth 1 (more details in Section 6).Per the 4th chart in Figure 3, we highlight that SkillBot isable to initiate conversations with skills using more than 3different utterances. Normally, a skill’s information page onAmazon.com list at most three sample utterances. In additionto using these sample utterances, SkillBot also discovers andextracts utterances in the “Additional Instructions” section onthe skill’s page. As a result, SkillBot interacted with 20.3%of skills using more than 3 utterances. These extra initial ut-terances allow SkillBot to trigger more responses from Alexa.As we will explain in Section 6, 3 out of the 28 risky kid skillswere discovered by SkillBot from the additional utterances(i.e., those not from the 3 sample utterances).

Time Performance.

It took about 21 seconds on averagefor collecting one conversation. Our SkillBot interacted with4,507 skills and collected 39,322 conversations within 46hours using ﬁve parallel processes on an Ubuntu 20.04 ma-chine with Intel Core i7-9700K CPU.

To investigate the risks of skills made for kids (

RQ1 ), weemployed our SkillBot to collect and analyze 31,966 conversa-tions from a sample of 3,434 Alexa kid skills. In this section,we describe our dataset of kid skills and present our ﬁndingsof risky kid skills.

Our system ﬁrst explored and downloaded information ofskills from their info pages available in the Alexa’s U.S. skills store. Note that our system ﬁltered out error pages (e.g., 404not found) after three retries or non-English skills. As a result,we collected 43,740 Alexa skills from 23 different skill cate-gories (e.g., business & ﬁnance, social, kids, etc.). Our systemthen parsed data about the skills, such as ASIN (i.e., skill’sID), icon, sample utterances, invocation name, description,reviews, permission list, and category, from the downloadedskill info pages.For our analysis, we investigate all skills in Amazon’s Kidscategory (3,439 kid skills). We ran our SkillBot to interactwith each skill and record the conversations. To speed up thetask, we ran ﬁve processes of SkillBot simultaneously. Notethat SkillBot can be run over time to revisit each skill and cu-mulatively collect new conversations as well as new branchesof the collected conversations for that skill. As a result, oursample had 31,966 conversations from 3,434 kid skills afterremoving ﬁve skills that resulted in errors or crashed Alexa.

We performed content analysis on the conversations col-lected from 3,434 kid skills to identify risky kid skills thathave inappropriate content or ask for personal information.

Skills with Inappropriate Content for Children.

Our goalwas to analyze the skills’ contents to identify risky skillsthat provide inappropriate content to children. To identifyinappropriate content for children in the skills’ contents, wecombined WebPurify and Microsoft Azure’s Content Mod-erator, which are two popular content moderation servicesproviding inappropriate content ﬁltering to websites and ap-plications with a focus on children protection [7, 21]. Weimplemented a content moderation module for our SkillBot inPython 3, leveraging WebPurify API and Azure ModerationAPI, to ﬂag skills that have inappropriate content for children.As a result, our content moderation module ﬂagged 33potentially risky skills that have expletives in the content.However, a human review process is necessary to verify theoutput because whether or not a ﬂagged skill is consideredto actually have inappropriate content for children dependson context. For example, some of the expletives (such as“facial” and “sex”) are likely considered to be appropriate insome conversational context. For the human review process,four researchers in our team—who come from 3 countries(including the USA), all of whom are English speakers, andwhose ages range from 22 to 35—independently reviewedeach of the ﬂagged skills, and voted whether the skill’s contentis inappropriate for children. Skills that received three or fourvotes were counted towards the ﬁnal list. Using this approach,we identiﬁed 8 kid skills with actual inappropriate content.Out of these 8 kid skills, SkillBot identiﬁed the inappropri-ate content of one skill at depth 11, one skill at depth 5, twoat depth 4, one at depth 2, and three at depth 1.We performed a false negative analysis by sampling 100out of the other skills that were not ﬂagged as having inappro-7

Unique Response Count0%5%10%15% P e r c e n t o f S k ill s ( N = , ) Max Depth0%10%20%30%

Max Branch Count0%20%40%

Init Utterance Count0%10%20%

Figure 3: Coverage of SkillBot in terms of four criteria: number of unique responses from Alexa; maximum depth in an Interaction Tree;maximum number of branches for any node in an Interaction Tree; and number of initial utterances. priate content and manually checking them. As a result, wefound 0 false negatives.

Skills Collecting Personal Information.

Our goal was todetect if the skills asked users for personal information. To thebest of our knowledge, available tools only focus on detectingpersonal information in the text, which is a different goal. Forthis analysis, we employed a keyword-based search approachto identify skill responses that asked for personal information.We constructed a list of personal information keywords basedon the U.S. Department of Defense Privacy Ofﬁce [12] andsearched for these keywords in the skill responses. In particu-lar, our list includes: name, age, address, phone number, socialsecurity number, passport number, driver’s license number,taxpayer ID number, patient ID number, ﬁnancial accountnumber, credit card number, date of birth, and zipcode. Anaive keyword search approach that basically looks for thosekeywords in the text would not be sufﬁcient due to the factthat the text containing those keywords does not always askfor such information. Thus, we combined keyword searchwith our question detection and answer generation techniquesused for our Chatbot module presented in Section 5.1 to detectif the skill asked the user to provide personal information.22 risky skills were ﬂagged as asking users for personalinformation. To verify the result, we manually checked these22 skills and 100 random skills that were not ﬂagged. As aresult, we found 2 false positives and 0 false negatives. Thus,20 kid skills asked for personal information such as name,age, and birthday.Out of these 20 skills, SkillBot identiﬁed contents that askfor sensitive information of one skill at depth 11, two skillsat depth 4, six skills at depth 3, seven at depth 2, and fourat depth 1. Also, SkillBot identiﬁed such contents on non-sample utterances for three of the skills (i.e., utterances notlisted as the three samples, but rather listed in the “AdditionalInstructions” section of the skill’s page on Amazon.com).We further analyze the permission requests by the skills.None of the identiﬁed 20 risky kid skills requested any per-mission from the user.

To evaluate how the risky kid skills we identiﬁed actuallyimpact kid users (

RQ2 and

RQ3 ), we conducted a user studyof 232 U.S. parents who use Amazon Alexa and have children under 13. Our goal was to qualitatively understand parents’expectations and attitudes about these risky skills, parents’awareness of parental control features, and how risky skillsmight affect children. Our study protocol was approved byour Institutional Review Board (IRB), and the full text of oursurvey instrument is provided in Appendix A. In this section,we describe our recruitment strategy, survey design, responseﬁltering, and results.

We recruited participants on Proliﬁc , a crowd-sourcingwebsite for online research. Participants were required tobe adults 18 years or older who are ﬂuent in English, livein the U.S. with their kids under 13, and have at least oneAmazon Echo device in their home. We combined Proliﬁc’spre-screening ﬁlters and a screening survey to get this nichesample of participants for our main survey. Our screeningsurvey consisted of two questions to determine: (1) if theparticipant has kids aged 1 – 13 and (2) if the participant hasAmazon Echo device(s) in their household. 1,500 participantsparticipated in our screening survey and 258 of them qualiﬁedfor our main survey. The screening survey took less than 1minute to complete and our main survey took an average of6.5 minutes (5.2 minutes in the median case). Participantswere compensated $0.10 for completing the screening surveyand $2 for completing the main survey. To improve responsequality, we limited both the screening and main surveys toProliﬁc workers with at least a 99% approval rate. The screening survey consisted of two multiple-choicequestions: “Who lives in your household?" and “Which elec-tronic devices do you have in your household?" . This allowedus to identify participants with kids aged 1 – 13 and AmazonEcho device(s) in their household who were eligible to takethe main survey.

The main survey consisted of the following four sections. arents’ Perceptions of VPA Skills. This section investi-gated parents’ opinions of and experiences with risky skills.Participants were presented with two conversation samplescollected by SkillBot from each of the following categories(six samples total). Conversation samples were randomlyselected from each category for each participant and werepresented in random order.•

Expletive.

Conversation samples from 8 skills identi-ﬁed in our analysis that contain inappropriate languagecontent for children.•

Sensitive.

Conversation samples from 20 skills identi-ﬁed in our analysis that ask the user to provide personalinformation, such as name, age, and birthday.•

Non-Risky.

Conversation samples from 100 skills thatdid not contain inappropriate content for children or askfor personal information.The full list of skills in the Expletive and Sensitive cat-egories are provided in Appendix D. Each participant wasasked the following set of questions after viewing each con-versation sample:• Do you think the conversation is possible on Alexa?• Do you think Alexa should allow this type of conversa-tion?• Do you think this particular skill or conversation is de-signed for families and kids?• How comfortable are you if this conversation is betweenyour children and Alexa?• If you answered "Somewhat uncomfortable" or "Ex-tremely uncomfortable" to the previous question, whatskills or conversations have you experienced with yourAlexa that made you similarly uncomfortable?

Amazon Echo Usage.

We asked which device model(s) ofAmazon Echo our participants have in their household (e.g.,Echo Dot, Echo Dot Kids Edition, Echo Show). We also askedwhether their kids used Amazon Echo at home.

Awareness of Parental Control Feature.

We asked the par-ticipants if they think Amazon Echo supports parental con-trol (yes/no/don’t know). Participants who answered “yes”were further asked to identify the feature’s name (free-textresponse) and if they used the feature (yes/no/don’t know).

Demographic Information.

At the end of the survey, weasked demographic questions about gender, age, and comfortlevel with computing technology. Our sample consisted of128 male (55.2%), 103 female (44.4%), and 1 preferred notto answer (0.4%). The majority (79.7%) were between 25and 44 years old. Most participants in our sample are techni-cally savvy (68.5%). See Table 5 in Appendix C for detaileddemographic information.

We received 237 responses for our main survey. We ﬁlteredout responses from participants who incorrectly answered either of two attention check questions (“What is the com-pany that makes Alexa?” and “How many buttons are thereon an Amazon Echo?”). We also excluded participants whogave meaningless responses (e.g., entering only whitespacesinto all free-text answer boxes). This resulted in 232 validresponses for analysis.

We ﬁnd that most parents allow their kids to use other typesof Amazon Echo than the Kids Edition. Such types of Echodo not have parental control enabled by default. We also ﬁndthat many parents do not know about the parental controlfeature. For those who know about the feature, only a few ofthem use it. Thus, kids potentially have access to risky skills.Our results further show that parents are not aware of therisky skills that are avaiable in the Kids category on Amazon.When presented with examples of risky kid skills that haveexpletives and those that ask for personal information, parentsexpress concerns, especially for expletive ones. Some parentsreported previous experiences of using such risky skills.

Parents’ Perceptions of Kid Skills.

Table 2 shows the dis-tribution of responses to the following questions across theExpletive, Sensitive, and Non-Risky skill sets:• Do you think the conversation is possible on Alexa?• Do you think Alexa should allow this type of conversa-tion?• Do you think this particular skill or conversation is de-signed for families and kids?A majority of parents thought that the interactions with theexpletive skills were not possible and should not be allowedby Alexa. Only 45.9% of the respondents thought these in-teractions were possible and only 41.6% of the respondentsthought such skills should be allowed. Furthermore, most par-ents (57.1%) felt that the expletive skills were not designedfor families and kids.The parents’ responses with regard to the expletive skillsare signiﬁcantly different from their responses to the sen-sitive and non-risky skills on these questions. For each ofthese three questions, we conduct Chi-square tests on thepairs of responses across the skill sets: Non-Risky vs. Ex-pletive, Non-Risky vs. Sensitive, and Expletive vs. Sensitive.The responses from the Expletive set are signiﬁcantly dif-ferent from responses from the other two sets for all threequestions ( p < . p < . Designed for Family and Children.

Table 3 shows the dis-tribution of responses for the question: “Do you think thisparticular skill or conversation is designed for families andkids?” with a breakdown across different types of skills (e.g.,Non-risky, Expletive, and Sensitive). These results show thatthe majority of parents (72.6%) did not think that skills withexpletives were designed for families/kids. This indicatesthat the respondents were not aware of the skills with exple-tives that were actually developed for kids and published inAmazon’s “Kid” category. In addition, about half of parents(44.2%) did not think the sensitive skills were designed forfamilies/kids, although these skills are actually in the “Kid”category on Amazon as well.

Parents’ Comfort Level.

We used a ﬁve-point Likert scaleto measure parents’ comfort levels if the presented conversa-tions were between their children and Alexa. Figure 4 showsthe participants’ comfort levels for each skill category. Theseresults indicate that parents were more uncomfortable withthe Expletive skill conversations compared to the Sensitiveskill conversations. In particular, 42.7% of the respondents ex-pressed discomfort (“Extremely uncomfortable” and “Some-what uncomfortable”) with the Expletive skills, comparedto only 12.1% with the Sensitive skills and 5.6% with theNon-risky skills.Chi-square tests show that parents’ comfort with the Exple-tive conversations is signiﬁcantly different from their comfortwith the Sensitive conversations and from the Non-risky con-versations ( p < .

0% 20% 40% 60% 80% 100%Percent of responsesExpletiveSensitiveNon-risky C o n v e r s a t i o n t y p e Extremely uncomfortableSomewhat uncomfortableNeutralSomewhat comfortableExtremely comfortable

Figure 4: Participants’ levels of comfort if conversations of a partic-ular type happen between the participants’ children and Alexa. cerns about skills in the Expletive set by free-text responses,including “It doesn’t seem appropriate to tell jokes like thisto children (P148)”, “Under no circumstances should anyonehave a coversation [sic] with children about orgasms. Thiswould be grounds for legal action (P163)”, “I do not believeAlexa should be used in such a crass manner or to teach mychild how to be crass (P210)”, “Poop and poopy jokes don’thappen in my household (P216)”, and “It is too sexual (P123)”.Beyond the skills shown in the survey, one respondent alsorecalled hearing similar skills such as “Roastmaster (P121).”Another respondent remembered something similar but wasunable to provide the name of the skill: “We have asked Alexato tell us a joke in front of our young son and Alexa has tolda few jokes that were borderline inappropriate (P140).” We do not ﬁnd any signiﬁcant difference between parents’comfort with the Sensitive conversations versus the Non-riskyconversations. However, the Sensitive conversations involvedskills asking for different types of personal information. Outof the 20 skills in the Sensitive set, 15 skills asked for theuser’s name, 3 asked for the user’s age, and 2 asked for theuser’s birthday. We show the distribution of the participants’comfort level according to each type of personal informationin Figure 5. This indicates that that parents expressed morediscomfort (“Extremely uncomfortable” and “Somewhat un-comfortable”) for skills that ask for the user’s birthday (15.2%of respondents), compared with skills that ask for the user’sname (11.8%) or age (11.5%). Some participants expressedtheir concerns about these skills by free-text responses, in-cluding “I don’t like a skill or Alexa asking for PII (P115)”, “Ihaven’t had a similar experience but I think it is inappropriatefor Alexa to be asking for the name of a child (P209)”, “Idon’t know why it needs a name (P228)”, and “I would notwant Alexa to collect my children’s imformation [sic] (P003)”.

0% 20% 40% 60% 80% 100%Percent of responsesbirthdaynameage T y p e s o f s e n s i t i v e i n f o r m a t i o n a s k e d Extremely uncomfortableSomewhat uncomfortableNeutralSomewhat comfortableExtremely comfortable

Figure 5: Participants’ levels of comfort for each type of personalinformation, if the conversations happen between the participants’children and Alexa.

Amazon Echo Usage.

Our results also show that most house-holds with kids use Echo devices other than the Echo KidsEdition. Echo Dot was the most popular type (46.4%) of Echodevice in our participants’ households. Only 27 participants(6.8%) bought an Echo Dot Kids Edition, which has parentalcontrol mode enabled by default. This shows that if kids useEcho, they likely have access to the types of Echo devicesthat do not have parental control mode enabled by default.Furthermore, the majority of participants (91.8%) reportedthat their kids do use Amazon Echo at home. Figure 6 showsthe types of Echo that the participants own in their householdassociated with the breakdown of answers to the question “Doyour kids use Amazon Echo at home?". Most parents allowtheir kids to use Amazon Echo at home even without an EchoDot Kids Edition. This indicates that many kids have accessto risky skills, as these skills can be used by default on Echodevices other than the Kids Edition.

Awareness of Parental Control Feature.

We analyzed theresponses to the question: “Does Amazon Echo supportparental control?” In total, 76.3% said “yes”, 0.4% said “no”,and 23.3% were unsure. For participants who had Echo KidsEdition, almost all (92.6%) said “yes”, 7.4% said “no”, andnone was unsure. In contrast, for participants without Echo10 ossible on Alexa Alexa should allow Designed for families and kidsResponse non-risky expletive sensitive non-risky expletive sensitive non-risky expletive sensitiveYes 78.0% 45.9% 71.6% 83.2% 41.6% 66.2% 68.1% 27.4% 55.8%No 7.5% 30.2% 11.2% 8.8% 44.2% 16.6% 13.8% 57.1% 16.6%Not sure 14.4% 23.9% 17.2% 8.0% 14.2% 17.2% 18.1% 15.5% 27.6%Table 2: Distribution of responses for each of the three yes/no questions, across three types of conversations.Response Expletive Sensitive Non-RiskyYes 27.4% 55.8% 68.1%No 57.1% 16.6% 13.8%Not sure 15.5% 27.6% 18.1%Table 3: Distribution of responses for the question: “Do you thinkthis particular skill or conversation is designed for families andkids?” E c h o m o d e l My kids use EchoNo or not sure

Figure 6: Types of Echo that the participants own in their householdand the breakdown of number of participants whose kids use AmazonEcho at home. Echo Dot is the most popular and most householdshave kids who use Amazon Echo even without Kids Edition.

Kids Edition, only 74.1% said “yes”, and 25.4% were unsure.This indicates parents who did not buy the Echo Kids Editionare less likely to know about the parental control feature.For those who said “yes” (i.e., they knew Amazon Echo sup-ports parental control), we further asked if they used parentalcontrol. In total, only 29.4% used parental control. Speciﬁ-cally, 64.0% of participants that had Echo Kids Edition saidthey used parental control, but only 23.7% of those who didnot have Echo Kids Edition said so. Given that the numberof participants who had Echo Kids Edition were much fewer(only 27 people out of 232), the majority of parents did not useparental control for their Echo devices at home. This resultagain indicates that children are more likely to have access torisky skills.Furthermore, although participants reported that they usedparental control, many of them did not really know what thefeature was. For participants that said they used parental con-trol, we asked them to tell us the name of the parental controlfeature. As long as their answer contained “free” and “time”(the correct answer is “FreeTime”), we considered their an-swer correct. As a result, 66.7% of participants that had EchoKids were able to correctly name the parental control feature,but only 42.3% of those that did not have Echo Kids Editioncould do so.

Takeaway.

Our results show that parents express signiﬁcantlymore disbelief and concern about conversations from Exple-tive skills versus Sensitive skills. This is worrisome, becauseall skills displayed on the survey are real and can be invokedby anyone now, including children. The fact that parents’ re-sponses to Sensitive skills were not signiﬁcantly differentfrom their responses to Non-Risky skills also illustrates apotential lack of understanding that skills are developed bythird parties and may pose a privacy risk.

To understand how easy it is for kids to accidentally triggerskills (

RQ4 ), we performed a systematic analysis to identifythese utterances and skills. We followed two stages: (1) Fromthe skills’ info pages, we discovered the set of potential utter-ances, each of which might invoke multiple different skills;(2) We then used SkillBot to collect conversations started bythese utterances and analyzed the interaction behaviors.

To investigate behaviors of confounding utterances, we ﬁrstcreated a dictionary of confounding utterances, in which akey was a confounding utterance extracted from the Alexaskill info pages and its value was a list of skills that share thatsame utterance. For each confounding utterance, we removedpunctuation and used lowercase to keep the format consistent.We then ﬁltered the dictionary to only get the utterances thatappeared in at least two different skills. We wanted to ﬁndout, given a confounding utterance that is shared betweenmultiple skills, which skill Alexa prioritizes to enable/invoke.There can be cases that Alexa prioritizes a skill which doesnot match user’s intention. This poses a potential risk to users(especially children). For example, if Alexa prioritizes to in-voke a non-kid skill over a kid skill which shares the sameutterance, children potentially have access to risky contents.We discovered a set of 4,487 confounding utterances, eachof which was shared between two or more skills. Of these4,487 utterances, there were 110 utterances (2.5%) that be-longed to only kid skills, 581 utterances (12.9%) that belongedto both kid and non-kid skills, and 3,796 skill (84.6%) thatbelonged to only non-kid skills. We deﬁned these three typesof utterances respectively as following: Kids Only, Joint, andNon-kids Only. We further identiﬁed utterances that had skillswith the same skill name and skill icon–properties that would11ake it difﬁcult for users to distinguish among these skillseven by visually inspecting the skills’ webpages. For KidsOnly, we found 6 utterances (5.5%) out of 110 utterancesshared among skills with the same name and icon. For Joint,we found 48 utterances (8.3%) out of 581 utterances sharedamong skills with the same name and icon. For Non-kidsOnly, we found 577 utterances (15.2%) out of 3,796 utter-ances shared among skills with the same name and icon.We then used our SkillBot to test the confounding utter-ances collected in the discovery step. For each utterance, wedisabled all skills and entered the utterance. We then checkedif any of the skills that shared the utterance got enabled.

Kids Only(N=110) Joint(N=581) Non-kidsOnly (N=3,796) TotalInvoked irrelevant skill 64 367 1,999 2,430Invoked relevant skill 46 57 1,797 1,900Invoked relevant skillbut prioritized non-kid skill - 157 - 157

Table 4: Number of confounding utterances in each type and corre-sponding behaviors.

Kids Only:

We found that 64 (58.2%) out of 110 utterances inthis type invoked an irrelevant skill that was not in the list ofskills associated with the utterance itself. The remaining 46utterances (41.8%) invoked a relevant skill within the list ofassociated skills.

Both Kids and Non-kids (Joint):

We found that 367 (63.2%)out of 581 utterances in this type invoked an irrelevant skillthat was not in the list of skills associated with the utteranceitself. The remaining 214 utterances (36.8%) invoked a rele-vant skill within the list of associated skills. However, therewere 157 out of 214 utterances (73.4%) prioritized to invokea non-kid skill over a kid skill.

Non-kids Only:

We found that 1,999 (52.7%) out of 3,796utterances in this type invoked an irrelevant skill that was notin the list of skills associated with the utterance itself. Theremaining 1,797 utterances (47.3%) invoked a relevant skillwithin the list of associated skills.

Takeaway:

It is risky if a confounding utterance is sharedbetween a kid skill and a non-kid skill. Our analysis showsthat kids can accidentally invoke a non-kid skill while tryingto use a kid skill. An adversary can exploit this problem toget kid users to invoke risky non-kid skills.

Our confounding utterance analysis shows that kids canaccidentally invoke a non-kid skill while trying to use a kidskill. As a result, kids could potentially access risky contents,as non-kid skill development does not need to follow strictpolicy requirements as kid skill develop would. Furthermore,confounding utterances could be an attack vector for an adver-sary to expose kids to risky content without having to publisha kid skill (which faces stricter policy requirements).To have more insights into the risks from non-kid skills,we sampled a set of non-kid skills, containing 50 skills from each of the other 22 categories (1,100 skills in total). Weran our SkillBot to interact with each skill and record theconversations. We collected 7,356 conversations from 1,073non-kid skills after removing 27 skills that resulted in errors orcrashes. We used the same approach as our kid skill analysispresented in Section 6.2 to analyze this sample. For inappro-priate content analysis, 8 skills were ﬂagged. After humanreview process, we identiﬁed 4 skills with actual inappropri-ate content and found 0 false negatives in 100 random skillsthat were not ﬂagged.For skills collecting personal information, 17 skills wereﬂagged. We found 1 false positive and 0 false negative. Thus,16 skills actually asked for personal information, such asname, age, birthday, phone number, zipcode, and address. Ofthese 16 skills, 8 skills did not request any permission fromthe user. We further investigated the 3 skills that requestedpermissions. Only 1 of them requested the proper permissionfor the personal information it asked for.

Suggestions for Building Safe VPA for Children.

Al-though there are stricter requirements for developing andpublishing kid skills, risky kid skills still exist. This meansVPA service providers need a more robust vetting system toensure that the published skills adhere to the policy require-ments. Furthermore, since skills can be hosted on third-partyservers, it is hard for VPA service providers to control whathappens in the backend. An adversary can easily manipulatethe backend, turning a benign skill into malicious one. Thus,a continuous vetting process is important to ensure consis-tent adherence to policy requirements. VPA service providersalso need to improve detection and limitation of confoundingutterances, especially those that may unintentionally invokedifferent skills than the user intended. Third-party skill de-signers could be required to register invocation phrases uponposting their skills on VPA provider hosting platforms. Likeemail addresses or domain names, preventing overlap of skillinvocation phrases would keep children and other users fromaccidentally opening an unwanted skill with potentially riskycontent.Our user study results showed that many parents do notknow about or do not use parental control features on VPAdevices. This likely means that more user-friendly parentalcontrol features are needed to reduce burden on parents whileproviding strong protections for child users. Parents shouldbe encouraged to use parental control features for VPAs, es-pecially on devices placed in a shared space. Existing rec-ommendations for the design of parental control software inother domains likely carry over to the VPA space and wouldhelp parents’ prevent children from accessing risky non-kidskills.

Limitations.

The Alexa developer console employed in oursystem has some limitations [3] as compared to the actual12hysical device. The limitation related to this paper is notbeing able to collect the content of audio playbacks fromskills (e.g., music skills). Thus, we did not consider suchskills in our dataset. Since audio playbacks are rare (as shownin Section 5.3) and are not signiﬁcant in our content analysis,we believe that this limitation is a reasonable trade-off for theability to efﬁciently perform analysis at scale and it does notundermine our analysis results.Our user study’s results were based on self-reported data,which means the responses might be inﬂuenced by socialdesirability. Speciﬁcally, people are often biased when theyself-report as they tend to follow what are considered to besocial norms [26]. To mitigate this, we tried to use neutralwording for our questions. As with any self-reported surveys,participants may choose the ﬁrst answer that satisﬁes withoutthinking carefully about the question. Thus, we included atten-tion check questions to ﬁlter out such inattentive participantsfrom our study.

Future Work.

We maintain a dataset to cumulatively collectbranches of conversations with skills to increase coverageover time. In addition, our system can be run longitudinallyto keep the dataset updated with new contents from existingskills as well as new skills. Currently, our study focuses onrisky content in kid skills. However, we show that kids canpotentially get exposed to non-kid skills. Non-kid skills areallowed to have adult contents due to more relaxed policyrequirements. Thus, future work can investigate the non-kidskills that might be invoked accidentally by kids. We pre-sented our analysis on Alexa in this paper, but it would be easyto extend our proposed system to work with other platforms.Our system could be further integrated into a public servicefor consumers to understand “black-box” skills before in-stalling them (i.e., a potential countermeasure for consumers).Our ﬁndings could provide insights into the risky behaviorsand help with setting up rules to protect user experience.

10 Related Work

Researchers have been looking into the problem of speechrecognition misinterpretations made by voice personal assis-tants [32,33,54], showing that an adversary could impersonatethe voice assistant system or other skills to eavesdrop on users.Different from previous work, we investigate the hidden logicinside skills to identify risky content. Recently, Guo et al. [23]looked into skills asking users’ private information by trig-gering the skills starting with the three sample utterances. Inour work, we focus on children’s safety when using voiceapps, including a broader set of risks that deliver inappropri-ate content, and ask for personal information. We proposenovel designs for our SkillBot to go into depth and breadthfor identifying hidden risky behaviors of skills. These designsare critical for detecting risky skills. For example, out of the28 risky kid skills (8 expletive and 20 sensitive), SkillBotrevealed risky contents from 3 skills by generating custom utterances. Also, 2 skills were identiﬁed at depth 11, 1 skillat depth 5, 4 skills at depth 4, 6 at depth 3, 8 at depth 2, and7 at depth 1. We also ran a user study to understand parents’experiences, awareness, concerns about risky content for kids,and use of parental control. In addition, we analyzed con-founding utterances—a novel threat which exposes kids torisky contents.There is considerable literature examining children’s on-line protection and COPPA compliance in domains otherthan VPA. Automated analyses of thousands of mobile appli-cations have found widespread COPPA violations [45, 57]ranging from illegal collection of persistent identiﬁers tonon-compliant privacy policies. Investigations of children-directed websites have also found covert tracking techniquesdesigned to avoid COPPA requirements [52], among othernon-compliant behaviors [24,50]. Our work sheds light on thenew problems in VPA. More recently, researchers have raisedconcerns about children’s privacy with respect to Internet-connected “smart” toys. Several studies have conducted de-tailed analyses of speciﬁc toys, often noting multiple imple-mentation decisions and security vulnerabilities that placechildren’s data at risk and violate COPPA [25, 36, 46, 47, 51].Other studies have provided frameworks for smart toy protec-tions [29, 44], and recommendations for smart toy manufac-turers. As VPAs straddle the boundary between IoT products,mobile application platforms, and search engines, our workcontributes to the evolving landscape of risks posed to chil-dren by modern online services, joining this related literatureto demonstrate the breadth of challenges remaining.Importantly, technical research into children and the In-ternet has been motivated and supported by qualitative andquantitative user studies investigating how children and par-ents understand connected technologies [41] and make pri-vacy decisions. These studies show that parents are becomingmore concerned about smart toy privacy [37,40], but that chil-dren have difﬁculty conceptualizing certain types of privacyrisks [34, 56]. Given that some parents actively compromisetheir children’s online privacy [42] or help their children avoidCOPPA protections [27], we should not rely on parents tokeep children safe online. Instead, risky practices, such as theskills we identify, must be addressed through a combinationof academic, regulatory, and industry action.

11 Conclusion

We designed and implemented an automated skill inter-action system called SkillBot, analyzing 3,434 Alexa kidskills. We identiﬁed a number of risky skills with inappro-priate content or personal data requests, and confoundingutterance threat. To further evaluate the impacts of these riskyskills on kids, we conducted a user study of 232 U.S. par-ents who use Alexa in their household. We found widespreadconcerns about the contents of these skills, combined withgeneral disbelief that these skills might actually be availableto kids, and low adoption of parental control features.13 eferences [1] Alexa is adding freetime skillsthat i cannot remove. . reddit . com/r/alexa/comments/aba5u6/alexa_is_adding_freetime_skills_that_i_cannot/ .Accessed: 2020-02-13.[2] Alexa simulator. https://developer . amazon . com/docs/devconsole/test-your-skill . html . Accessed: 2020-02-13.[3] Alexa simulator limitations. https://developer . amazon . com/docs/devconsole/test-your-skill . html .Accessed: 2020-02-13.[4] Alexa skill not kid friendly (freetime)? https://forums . plex . tv/t/alexa-skill-not-kid-friendly-freetime/343477 . Accessed: 2020-02-13.[5] Alexa skills kit. https://developer . amazon . com/alexa/alexa-skills-kit . Accessed: 2020-02-13.[6] Amazon alexa skills. . amazon . com/alexa-skills/b?ie=UTF8&node=13727921011 . Ac-cessed: 2020-02-13.[7] Azure content moderator. https://docs . microsoft . com/en-us/azure/cognitive-services/content-moderator/ . Accessed: 2020-02-13.[8] Beautifulsoup. . crummy . com/software/BeautifulSoup/ . Accessed: 2020-02-13.[9] Can’t remove/edit freetime content! anyone havea ﬁx? . amazonforum . com/forums/devices/fire-tablets/1815-cant-remove-edit-freetime-content-anyone-have-a . Accessed:2020-02-13.[10] Complying with coppa: Frequently asked questions. . ftc . gov/tips-advice/business-center/guidance/complying-coppa-frequently-asked-questions . Accessed: 2020-02-13.[11] Deﬁnition of question classes. https://cogcomp . seas . upenn . edu/Data/QA/QC/definition . html . Accessed: 2020-02-13.[12] Faqs - dod privacy ofﬁce. https://dpcld . defense . gov/Privacy/About-the-Office/FAQs/ . Accessed: 2020-02-13.[13] Freetime unlimited alexa skills not available inparent dashboard. . amazonforum . com/forums/devices/echo-alexa/497656-freetime-unlimited-alexa-skills-not-available-in .Accessed: 2020-02-13. [14] Host a custom skill as a web service. https://developer . amazon . com/docs/custom-skills/host-a-custom-skill-as-a-web-service . html .Accessed: 2020-02-13.[15] How can i know what the freetime unlim-ited skills are? . reddit . com/r/amazonecho/comments/9nzuwj/how_can_i_know_what_the_freetime_unlimited_skills/ .Accessed: 2020-02-13.[16] Kids Are Spending More Time with Voice, but BrandsShouldn’t Rush to Engage Them.[17] Selenium. . seleniumhq . org . Accessed:2020-02-13.[18] Understand how users invoke custom skills. https://developer . amazon . com/docs/custom-skills/understanding-how-users-invoke-custom-skills . html . Accessed: 2020-02-13.[19] Understand name-free interactions. https://developer . amazon . com/docs/custom-skills/understand-name-free-interaction-for-custom-skills . html . Accessed: 2020-02-13.[20] Universal pos tags. https://universaldependencies . org/u/pos/ . Accessed:2020-02-13.[21] Webpurify for children’s apps and websites. . webpurify . com/childrens-apps-websites/ .Accessed: 2020-02-13.[22] What are upcs, eans, isbns. and asins? . amazon . com/gp/seller/asin-upc-isbn-info . html . Accessed: 2020-02-13.[23] Skillexplorer: Understanding the behavior of skills inlarge scale. In 29th USENIX Security Symposium(USENIX Security 20). USENIX Association, August2020.[24] Xiaomei Cai and Xiaoquan Zhao. Online advertis-ing on popular children’s websites: Structural featuresand privacy issues. Computers in Human Behavior,29(4):1510–1518, 2013.[25] Gordon Chu, Noah Apthorpe, and Nick Feamster. Secu-rity and privacy analyses of internet of things children’stoys. IEEE Internet of Things Journal, 6(1):978–985,2018.[26] Robert J. Fisher and James E. Katz. Social-desirabilitybias and the validity of self-reported values. Psychology& Marketing, 17(2):105–120, 2000.1427] Eszter Hargittai, Jason Schultz, John Palfrey, et al. Whyparents help their children lie to facebook about age: Un-intended consequences of the ‘children’s online privacyprotection act’. First Monday, 16(11), 2011.[28] Hamza Harkous, Kassem Fawaz, Kang G Shin, and KarlAberer. Pribots: Conversational privacy with chatbots.In Proc. of SOUPS’16, 2016.[29] Jeffrey Haynes, Maribette Ramirez, Thaier Hayajneh,and Md Zakirul Alam Bhuiyan. A framework for pre-venting the exploitation of iot smart toys for recon-naissance and exﬁltration. In International Conferenceon Security, Privacy and Anonymity in Computation,Communication and Storage, pages 581–592. Springer,2017.[30] Matthew Honnibal, Ines Montani, Soﬁe Van Landeghem,and Adriane Boyd. spaCy: Industrial-strength NaturalLanguage Processing in Python, 2020.[31] Eric J. Johnson and Daniel Goldstein. Do defaults savelives? Science, 302(5649):1338–1339, 2003.[32] D. Kumar, R. Paccagnella, P. Murley, E. Hennenfent,J. Mason, A. Bates, and M. Bailey. Emerging threats ininternet of things voice services. IEEE Security Privacy,17(4):18–24, July 2019.[33] Deepak Kumar, Riccardo Paccagnella, Paul Murley, EricHennenfent, Joshua Mason, Adam Bates, and MichaelBailey. Skill squatting attacks on amazon alexa. In 27thUSENIX Security Symposium (USENIX Security 18),pages 33–47, Baltimore, MD, 2018. USENIX Associa-tion.[34] Priya Kumar, Shalmali Milind Naik, Utkarsha RameshDevkar, Marshini Chetty, Tamara L Clegg, and JessicaVitak. ’no telling passcodes out because they’re pri-vate’: Understanding children’s mental models of pri-vacy and security online. Proceedings of the ACM onHuman-Computer Interaction, 1(CSCW):64, 2017.[35] Brigitte C Madrian and Dennis F Shea. The powerof suggestion: Inertia in 401 (k) participation and sav-ings behavior. The Quarterly journal of economics,116(4):1149–1187, 2001.[36] Moustafa Mahmoud, Md Zakir Hossen, HeshamBarakat, Mohammad Mannan, and Amr Youssef. To-wards a comprehensive analytical framework for smarttoy privacy practices. In Proceedings of the 7thWorkshop on Socio-Technical Aspects in Security andTrust, pages 64–75. ACM, 2018.[37] Andrew Manches, Pauline Duncan, Lydia Plowman, andShari Sabeti. Three questions about the internet of thingsand children. TechTrends, 59(1):76–83, 2015. [38] Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David McClosky.The Stanford CoreNLP natural language processingtoolkit. In Association for Computational Linguistics(ACL) System Demonstrations, pages 55–60, 2014.[39] Craig RM McKenzie, Michael J Liersch, and Stacey RFinkelstein. Recommendations implicit in policy de-faults. Psychological Science, 17(5):414–420, 2006.[40] Emily McReynolds, Sarah Hubbard, Timothy Lau,Aditya Saraf, Maya Cakmak, and Franziska Roesner.Toys that listen: A study of parents, children, andinternet-connected toys. In Proceedings of the 2017CHI Conference on Human Factors in ComputingSystems, pages 5197–5207. ACM, 2017.[41] Pekka Mertala. Young children’s perceptions of ubiq-uitous computing and the internet of things. BritishJournal of Educational Technology, 2019.[42] Tehila Minkus, Kelvin Liu, and Keith W Ross. Chil-dren seen but not heard: When parents compromisechildren’s online privacy. In Proceedings of the 24thInternational Conference on World Wide Web, pages776–786. International World Wide Web ConferencesSteering Committee, 2015.[43] Peng Qi, Timothy Dozat, Yuhao Zhang, and Christo-pher D. Manning. Universal dependency parsing fromscratch. In Proceedings of the CoNLL 2018 SharedTask: Multilingual Parsing from Raw Text to UniversalDependencies, pages 160–170, Brussels, Belgium, Oc-tober 2018. Association for Computational Linguistics.[44] Laura Rafferty, Patrick CK Hung, Marcelo Fantinato,Sarajane Marques Peres, Farkhund Iqbal, Sy-Yen Kuo,and Shih-Chia Huang. Towards a privacy rule concep-tual model for smart toys. In Computing in Smart Toys,pages 85–102. Springer, 2017.[45] Irwin Reyes, Primal Wijesekera, Joel Reardon, AmitElazari Bar On, Abbas Razaghpanah, Narseo Vallina-Rodriguez, and Serge Egelman. “won’t somebodythink of the children?” examining coppa compliance atscale. Proceedings on Privacy Enhancing Technologies,2018(3):63–83, 2018.[46] Sharon Shasha, Moustafa Mahmoud, Mohammad Man-nan, and Amr Youssef. Smart but unsafe: Experimentalevaluation of security and privacy practices in smarttoys. arXiv preprint arXiv:1809.05556, 2018.[47] Joshua Streiff, Olivia Kenny, Sanchari Das, AndrewLeeth, and L Jean Camp. Who’s watching your child?exploring home security risks with smart toy bears.In 2018 IEEE/ACM Third International Conference15n Internet-of-Things Design and Implementation(IoTDI), pages 285–286. IEEE, 2018.[48] Ann Taylor, Mitchell Marcus, and Beatrice Santorini.The penn treebank: an overview. In Treebanks, pages5–22. Springer, 2003.[49] Harish Tayyar Madabushi and Mark Lee. High accuracyrule-based question classiﬁcation using question syn-tax and semantics. In Proceedings of COLING 2016,the 26th International Conference on ComputationalLinguistics: Technical Papers, pages 1220–1230, Osaka,Japan, December 2016. The COLING 2016 OrganizingCommittee.[50] Joseph Turow. Privacy policies on children’s websites:Do they play by the rules? 2001.[51] Junia Valente and Alvaro A Cardenas. Security &privacy in smart toys. In Proceedings of the 2017Workshop on Internet of Things Security and Privacy,pages 19–24. ACM, 2017.[52] Natalija Vlajic, Marmara El Masri, Gianluigi M Riva,Marguerite Barry, and Derek Doran. Online track-ing of kids and teens by means of invisible images:Coppa vs. gdpr. In Proceedings of the 2nd InternationalWorkshop on Multimedia Privacy and Security, pages96–103. ACM, 2018.[53] X. Yuan, Y. Chen, A. Wang, K. Chen, S. Zhang,H. Huang, and I. M. Molloy. All your alexa are be-long to us: A remote voice control attack against echo.In 2018 IEEE Global Communications Conference(GLOBECOM), pages 1–6, Dec 2018.[54] Nan Zhang, Xianghang Mi, Xuan Feng, XiaoFeng Wang,Yuan Tian, and Feng Qian. Dangerous skills: Un-derstanding and mitigating security risks of voice-controlled third-party functions on virtual personal assis-tant systems. 2019 IEEE Symposium on Security andPrivacy (SP), pages 1381–1396, 2019.[55] Yangyong Zhang, Lei Xu, Abner Mendoza, GuangliangYang, Phakpoom Chinprutthiwong, and Guofei Gu. Lifeafter speech recognition: Fuzzing semantic misinterpre-tation for voice assistant applications. In NDSS, 2019.[56] Jun Zhao, Ge Wang, Carys Dally, Petr Slovak, JulianEdbrooke-Childs, Max Van Kleek, and Nigel Shadbolt.‘i make up a silly name’: Understanding children’s per-ception of privacy risks online. In Proceedings of the2019 CHI Conference on Human Factors in ComputingSystems, CHI ’19, pages 106:1–106:13, New York, NY,USA, 2019. ACM. [57] Sebastian Zimmeck, Ziqi Wang, Lieyong Zou, RogerIyengar, Bin Liu, Florian Schaub, Shomir Wilson, Nor-man Sadeh, Steven Bellovin, and Joel Reidenberg. Auto-mated analysis of privacy requirements for mobile apps.In 2016 AAAI Fall Symposium Series, 2016. A Survey QuestionnaireA.1 Screening Survey

1. Who lives in your household? (Choose all that ap-ply) (cid:50)

Myself (cid:50)

My spouse or partner (cid:50)

My friend(s) (cid:50)

My sibling(s) (cid:50)

My kid(s) - aged 1 to 13 (cid:50)

My kid(s) - aged 14 to 18 (cid:50)

My parent(s) (cid:50)

My grandparent(s) (cid:50)

Housemate(s) or roommate(s) (cid:50)

Other relative(s) (cid:50)

Other non-relative(s)

2. Which type of electronic devices do you have in thehousehold? (Choose all that apply) (cid:50)

Amazon Echo (cid:50)

Google Home (cid:50)

Smart TV (cid:50)

Computer (cid:50)

Smartphone (cid:50)

Other: (cid:50)

None of the above

A.2 Main Survey

A.2.1 Parents’ Reactions to Risky Skills

For each participant, show the following skills in randomorder and present the set of questions below.• Skill 1: Randomly selected from non-risky set• Skill 2: Randomly selected from non-risky set• Skill 3: Randomly selected from sensitive set• Skill 4: Randomly selected from sensitive set• Skill 5: Randomly selected from expletive set• Skill 6: Randomly selected from expletive set

1. Do you think this conversation is possible onAlexa? (cid:50)

Yes (cid:50) No (cid:50) Not sure

2. Do you think Alexa should allow this type of con-versation? Yes (cid:50) No (cid:50) Not sure

3. Do you think this particular skill or conversationis designed for families and kids? (cid:50)

Yes (cid:50) No (cid:50) Not sure

4. How comfortable are you if this conversation is be-tween your children and Alexa? (cid:50)

Extremely uncomfortable (cid:50)

Somewhat uncomfortable (cid:50)

Neutral (cid:50)

Somewhat comfortable (cid:50)

Extremely comfortableIf answering “Somewhat uncomfortable" or “Extremelyuncomfortable", ask:

5. What skills or conversations have you experiencedwith Alexa that made you similarly uncomfort-able?A.2.2 Amazon Echo Usage6. Which model(s) of Amazon Echo do you have inthe household? (Choose all that apply) (cid:50)

Regular Echo (cid:50)

Echo Dot (cid:50)

Echo Dot Kids Edition (cid:50)

Echo Plus (cid:50)

Other:

7. Do your kids use Amazon Echo at home? (cid:50)

Yes (cid:50) No (cid:50) I don’t know

A.2.3 Awareness of Parental Control Features8. Does Amazon Echo support parental control? (cid:50)

Yes (cid:50) No (cid:50) I don’t knowIf Yes, ask:

9. Do you use Amazon Echo’s parental control? (cid:50)

Yes (cid:50) No (cid:50) I don’t knowIf Yes, ask:

10. What is the name of Amazon Echo’s parental con-trol? A.2.4 Demographic Information11. What is your gender? (cid:50)

Male (cid:50)

Female (cid:50)

Other: (cid:50)

Prefer not to answer

12. What is your age? (cid:50)

18 - 24 years old (cid:50)

25 - 34 years old (cid:50)

35 - 44 years old (cid:50)

45 - 54 years old (cid:50)

55 - 64 years old (cid:50)

65 - 74 years old (cid:50)

75 years or older

13. Please select the statement that best describes yourcomfort level with computing technology. (cid:50)

Ultra Nerd: I build my own computers, run myown servers, code my own apps. I’m basicallyMr. Robot. (cid:50)

Technically Savvy: I know my way around acomputer pretty well. When anyone in my familyneeds technical help, I’m the one they call. (cid:50)

Average User: I know enough to get by. (cid:50)

Luddite: Technology scares me! I only use itwhen I have to.

B Side Findings about Sneaky Skills

During our experiments, we experienced the same issueregarding the Alexa’s user interface as reported by otherusers [1, 9]. In particular, through “Your skills" tab on bothAlexa webstore and mobile app interfaces, users can see alist of skills that are enabled. Although we already disabledsome skills, they still appeared in the list. Sometimes, theinterface showed an empty list of skills although we actuallyhad some skills enabled. This seems to be a bug of Alexauser interface that has not been ﬁxed since 2017 [1, 9]. Wemanaged to investigate this issue by enabling/disabling theskills and asking Alexa about the status of the skills to verify.We found that this might be a front-end bug, which did notreally affect our analysis. However, from our user study, 48participants were unwilling to use parental control featuressuch as FreeTime. Some of these participants preferred tomonitor enabled skills on their own instead of using AmazonFreeTime. Additionally, in our skill analysis, we showed thatusers have no way of checking which skill they have actu-ally invoked, even with the companion app in some casesdue to skills having similar name and icon. Therefore, thisfront-end bug might become a security issue as it increasesthe stealthiness of malicious skills. A malicious skill can getaccidentally enabled and remain invisible to users. In particu-lar, one possible attack is that the adversary crafts a malicious17kill that exploits the confounding utterances to get unexpect-edly enabled/invoked by the victims. These skills will become“sneaky" as the front-end bug prevents the user from seeingthe enabled skills.

C Demographic Info of Survey Respondents

Responses PercentageGender

Male 128 55%Female 103 44%Prefer not to answer 1 < Age

18 - 24 3 1%25 - 34 61 26%35 - 44 124 53%45 - 54 40 17%55 - 64 3 1%65 and above 1 < Comfort level with computing technology

Ultra Nerd 19 9%Technically Savvy 159 68%Average User 54 23%Luddite 0 0%Table 5: Demographic information (gender, age, comfort level withcomputing technology) of the participants in our sample. The num-bers in column ’Response count’ denote the number of participantswho selected the corresponding choices. Our sample is nearly gender-balanced with most participants in the 25 – 44 age group. Mostparticipants also self-reported to be technically savvy or averageusers.

D Examples of Skills in the Expletives andSensitive SetsD.1 Expletives Set

Skill ID: B07CX9G446 – You: “open word rally” – Alexa:“Poop.”Skill ID: B07K77Z4F8 – You: “open fart buttons” – Alexa:“Welcome to the Fart Buttons Skill. Press your Echo Buttonsto play a fart sound..”Skill ID: B07RYKZW52 – You: “open blast off” – Alexa:“Ok, Here’s Flaming Fart.”Skill ID: B07ZCS7784 – You: “open new facts” – Alexa:“Here’s your fact: A pig’s orgasm lasts for 30 minutes..”Skill ID: B081VS1BTD – You: “play santa claus a fart” –Alexa: “Handing off to SantaClausAFart by Baomatic Apps.”Skill ID: B0837HWNY5 – You: “open my burns” – Alexa:“You’re so ugly you’d scare the crap out of the toilet. I’m on aroll.”Skill ID: B0881V9K8V – You: “open birthday countdown”– Alexa: “... until November 5! In that time, you will fartapproximately 3 thousand times. Do you want me ...” Skill ID: B08C4WVY4Q – You: “start pop story” – Alexa:“... story, I will read it to you now. poop a poop Once in afaraway kingdom there lived a Poop and Poopy who were theshortest in the land, and ... and humongous pizzas. aw man-Well, one day, the poop and poopy had a wonderful daughter,who they named a ... by the time she was a teenager. The ...”

D.2 Sensitive Set

Skill ID: B0773H1M6J – You: “start spelling bee” – Alexa:“... understand what you said, please tell me your name. forexample, I am Cindy..”Skill ID: B077C9B1KX – You: “open spelling contest” –Alexa: “... say ”start” to start a contest, or my name is tochange your name.”Skill ID: B078XHP4MQ – You: “open does it ﬂy” – Alexa:“... play does it ﬂy game, tell me your name.. So more, Alexawill ask you a question,”Skill ID: B078XQC43F – You: “open mr. tongue twister” –Alexa: “Welcome to Mr Tongue Twister. What is your name?.”Skill ID: B079CMS36Z – You: “ask ﬁnd buddy to look inthe garden” – Alexa: “... - if you say ”call me” before yourname, I may understand you better. For example, if your nameis Allen, try saying ”Call me Allen”, instead”Skill ID: B07CQ68G5C – You: “open ready freddy” –Alexa: “... Ready Freddy! Please introduce yourself by saying:My name is, and then your name..”Skill ID: B07CQQWWRT – You: “open birthday wisher” –Alexa: “Hello! I am Birthday Wisher , by Uday Pratap Singh.What is your birthday?.”Skill ID: B07DB9JMCP – You: “open community helpers”– Alexa: “... I didn’t get that. Would you state your nameagain?.”Skill ID: B07DFCXXM5 – You: “open say please” – Alexa:“Welcome to Say Please! My name is Sara. Please tell me yourname..”Skill ID: B07L9QRYNH – You: “open santa’s little helper”– Alexa: “... to help you with that. Tell me your name to see ifyou are on the naughty”Skill ID: B07RZS3PGV – You: “start may twenty question”– Alexa: “You said your name is ’John’. Is that correct?.”Skill ID: B07YYK45WQ – You: “start my birthday saiﬁ” –Alexa: “You said your name is ’John’. Is that correct?.”Skill ID: B0819YKNHJ – You: “open push up” – Alexa:“... the Push Ups Counter. May I have your name, please?.”Skill ID: B087J97GBX – You: “ask banana stories to startthe game” – Alexa: “So, ﬁrst, what is your name?.”Skill ID: B088TQHVCJ – You: “open birthday surprise” –Alexa: “Hello! I’ll remember your birthday when no one elsewill! This is Birthday Surprise, when is your birthday?.”Skill ID: B08C7CC4CC – You: “open your name says”– Alexa: “Welcome to Your Name Says!Let’s start by yousaying your name...What’s your name?.”Skill ID: B08DD9LSB4 – You: “open hamilton knowledge”– Alexa: “I have heard your name before John.”18kill ID: B07MJZ5R65 – You: “open my book assistant”– Alexa: “Please provide me your age group. Such as adult,children, young Et cetera..”Skill ID: B07TZM51S4 – You: “open shape game” – Alexa:“Awesome! Before we start however; I’m curious...how oldare you?.”Skill ID: B07XV6Z4ZM – You: “open guess my age” –Alexa: “First you must write down your age. Say the wordREADY to continue..”

E Skill Response Classiﬁcation ConfusionMatrix

Table 6: Skill Response Classiﬁcation Confusion Matrix