Automatic Generation of Chatbots for Conversational Web Browsing
Pietro Chittò, Marcos Baez, Florian Daniel, Boualem Benatallah
AAutomatic Generation of Chatbots forConversational Web Browsing
Pietro Chitt`o , Marcos Baez , Florian Daniel , and Boualem Benatallah Politecnico di Milano, Via Ponzio 34/5, 20133 Milan, Italy [email protected], [email protected] LIRIS – University of Claude Bernard Lyon 1, Villeurbanne, France [email protected] University of News South Wales, Sydney, Australia [email protected]
Abstract.
In this paper, we describe the foundations for generating achatbot out of a website equipped with simple, bot-specific HTML anno-tations. The approach is part of what we call conversational web brows-ing , i.e., a dialog-based, natural language interaction with websites. Thegoal is to enable users to use content and functionality accessible throughrendered UIs by “talking to websites” instead of by operating the graph-ical UI using keyboard and mouse. The chatbot mediates between theuser and the website, operates its graphical UI on behalf of the user, andinforms the user about the state of interaction. We describe the concep-tual vocabulary and annotation format, the supporting conversationalmiddleware and techniques, and the implementation of a demo able todeliver conversational web browsing experiences through Amazon Alexa.
Keywords:
Non-visual browsing · Conversational browsing · Chatbots
Conversational agents are emerging as an exciting new platform for accessingonline services that promise a more natural and accessible interaction paradigm.They have shown great potential for regular users in hands-free and eyes-freescenarios but also for making services more accessible to people with disabilitiesand visual impairments [11], as well as groups, such as older adults, often chal-lenged by service design choices [9]. This new generation of agents is however notable to natively access the Web, requiring web developers and content creators toimplement specific “skills” to offer their content and services on Amazon Alexa,Google Assistant and other platforms. This requirement represents a huge bar-rier for developers and creators who might not have the skills or resources toinvest, and a missed opportunity for making the Web accessible to everyone.Integrating conversational capabilities into software enabled services is anemerging research topic [3], as pushed by recent works by Castaldo et al. [5] on
This is a post-peer-review, pre-copyedit version of an article accepted to the 29thInternational Conference on Conceptual Modeling, ER 2020. a r X i v : . [ c s . C Y ] O c t P. Chitt`o et al. inferring bots directly from database schemas, Yaghoub-Zadeh-Fard et al. [13] onderiving bots from APIs, and by Ripa et al. [12] on generating informational botsout of website content. While these works are facilitating chatbot integration atdifferent levels of the Web architecture, they do not address the challenges ofgenerating chatbots from both content and functionality available in websites.In this paper, we take a software engineering approach and study how toenable conversational browsing of websites equipped with purposefully designedannotations. This represents the first step towards our vision [2] of enabling usersto access the content and services accessible through rendered UIs by “talking towebsites” instead of by operating the graphical UI using keyboard and mouse.We start with an annotation-driven approach as the focus is to lay the foundationfor conversational browsing and to identify all necessary conversational featuresand technical solutions, which can then lead to the development of support toolsand automatic approaches. In doing so, we make the following contributions: – conceptual vocabulary for augmenting websites with conversational capabili-ties, able to describe domain knowledge (content and functionality) while ab-stracting interaction knowledge (enacting low-level interactions with sites); – an approach, architecture and techniques for generating a chatbot out of awebsite equipped with simple, bot-specific HTML annotations; – prototype implementation and technical feasibility of the proposed auto-matic chatbot generation approach.In the following we describe a concrete target scenario, the overall approach,and the prototype implementation. We describe our target scenario by illustrating the interactions of a user brows-ing a typical research project website using a smart speaker such as AmazonEcho (Figure 1). After the user requests access to the research project website,a conversational agent tailored to the website content, functionality and domainknowledge is automatically generated to mediate the interactions between theuser and the target website. During these interactions, i) the user is informedof the available features, ii) can browse the website in dialog-based natural lan-guage interactions with the agent, and iii) the agent identifies and performs theappropriate web browsing actions on the target website on behalf of the userBefore diving into the requirements posed by the envisioned scenario, weneed to introduce some concepts related to chatbot development, in what refersto task-oriented chatbots. Modern task-oriented chatbots are built on a frame-based architecture, which relies on a domain ontology (composed of frame, slotsand values) that specify the type of user intentions the system can recognizeand respond to [8].
Intents refer to the task requested by the user and the actions to the specific operations performed by the chatbot to serve the intent.Identifying user intents given a request in natural language (e.g., “Tell me aboutFlorian Daniel”) requires a natural language processing component trained with onversational Web Browsing 3
Natural language dialog
Tell me about the projectWho are the researchers involved?Tell me more about Florian Daniel
Ok, getting project information:
The conversational web browsing project aims at providing a dialog… Bot manager[REST API]
Policy
Conversational browsing server [Rasa + Python]
Application-speci(cid:14)c bot (automatically generated(cid:33)(cid:34)eadlessbrowser[Selenium]Actions(cid:35)(cid:36)(cid:37) Parser (cid:38)generator[Chatito]Intents(cid:39)utterancestraining dataAnswertem(cid:40)lates(cid:41)ialog manager(cid:35)(cid:36)(cid:42)
Ele(cid:17)ent-speci(cid:14)c bots (cid:40)ool((cid:40)re(cid:43)canned(cid:33) o(cid:40)erate(cid:40)arses(cid:34)T(cid:44)(cid:36)gen(cid:45)usesuses (cid:18)(cid:19)(cid:20) (cid:37)tteranceAnswer(cid:37)tterance Answer
Trainingtem(cid:40)lates uses
These are 3 researchers involved:
Florian Daniel, Pietro Chittò, Marcos Baez
Site opened. In this website I can: - Tell you about the project - Tell you the about its progress - Tell you about the researchers - … A l e x a V o i ce S e r v i ce Home Progress AboutNewsList of papers
Title Topics Abstract
Project infoLatest paper
The conversational web browsing project aims at…
Author biosFlorian DanielPietro CittòMarcos BaezHome Progress About
Conversational browsing server Annotated Website [Rasa + Python]
Audio Text Automated web browsing actions
Fig. 1: Conversational browsing scenario: the user talks to a bot not the website.a dataset of examples (e.g., researcher info: [“Tell me about @researcher”, “Whois @researcher?”, ...]) to correctly classify the request and infer the slots andvalues (e.g., intent: researcher info, researcher : “Florian Daniel”). Then the dialog management component, based on intent, the input provided and theconversation context, decides on the appropriate action (e.g., parse associatedDOM element). A response is generated using a natural language generationcomponent that elaborates the results and presents them in a format that fits theconversation medium (refer to [8] for more on chatbot design and architecture).Having introduced the scenario and main concepts, we refine some key re-quirements to enabling conversational browsing as identified earlier [2]:R1
Orientation : The bot must be able to summarize the content and/or func-tionalities offered by the website, to guide users through site offers at anypoint and to provide for basic access structures (e.g., “In this site you can...”).R2
Inferring intents and parameters : The bot must be able to understandthe user’s intent and enact suitable actions in response. Intents may be application-agnostic (e.g., fill a form field) or application-specific (e.g., posta new paper). The latter requires the bot to infer the intents from the website.R3
Training and vocabulary : The bot should be able to speak and under-stand the language of the target website, so as to identify intents and elab-orate proper responses. This requires deriving domain knowledge directlyfrom the website, training the bot to identify application-specific intents.R4
Browsing actions enactment : As the bot mediates between the user andthe website, enacting an action in response to an identified intent requiresa strategy for translating high-level user requests into automated low levelinteractions with the website.R5
Dialog control from rendered UIs . As the user browse the website con-versationally, the chatbot should track the state of the dialog and choose
P. Chitt`o et al. dialog actions considering the evolving state of the rendered UI. That is, itshould consider the conversation context as well as the browsing context.
The approach illustrated in Figure 1 is based on three main ingredients (i) pur-posefully designed bot annotations , (ii) a middleware comprised of chatbot gen-eration and run-time units, and iii) a medium-specific conversational interface .Web developers enable conversational access by augmenting their websites withbot-specific annotations , which associate knowledge about how to generatea conversational agent with specific HTML constructs. Initiating a conversa-tional browsing session then triggers the chatbot generation process . Thisprocess is about generating an application-specific bot tailored to the intentsand domain knowledge of the target website, while reusing a library of genericelement-specific bots. Using a conversational interface (e.g., Amazon Echo) theuser can start a dialog with the website. At run-time , the middleware processesthe user requests in natural language, selects the relevant bot and executes theappropriate actions on the rendered GUI of the website.Supporting conversational browsing is not trivial and requires weighing sev-eral options. The most important decisions that resulted in our solution are: – Domain vs. interaction knowledge : Using a website generally requiresthe user to master two types of knowledge, domain knowledge (to understandcontent and functionalities) and interaction knowledge (to use and operatethe site). This distinction is powerful to separate concerns in conversationalbrowsing. Domain knowledge, e.g., about the research project and scientificpublications, must be provided by the developer, as this varies from site tosite. Interaction knowledge, e.g., how to fill a form or read text paragraphby paragraph, can be pre-canned and reused across multiple sites. We thusdistinguish between an application-specific bot and a set of element-specific bots [R1,R2]. The former masters the domain, the latter enable the user tointeract with specific content elements like lists, text, tables, forms, etc. – Modularization : Incidentally, the distinction between application- and ele-ment-specific bots represents an excellent opportunity for modularizationand reuse. Application-specific bots must be generated for each site anew[R3]; element-specific bots can be implemented and trained once and reusedmultiple times . They can be implemented for specific HTML elements, suchas a form, or they can be implemented for a very specific version thereof,e.g., a login form. However, the presence of application- and element-specificbots introduces the need for a suitable bot selection logic. – Bot selection : As a user may provide as input any possible utterance atany instant of time, referring to either application-specific or element-specificintents, it is not possible to pre-define conversational paths through a web-site. Instead, some form of random access must be supported. We introducefor this purpose a so-called bot manager , which takes as input the utterance onversational Web Browsing 5 and forwards it to the bots registered in the system [R5]. Depending on thecontext (e.g., the last used bot) and the confidence provided by each invokedbot, it then decides which bot is most likely to provide the correct answer[R1,R2]. Thanks to the bot manager , the ensemble of application-specific andelement-specific bots presents itself as one single bot to the user.
The goal of the work presented in this paper is to prevent asking developers toprovide full-fledged chatbots for their websites in order to support conversationalbrowsing. The challenge is asking them to provide as little information as possible– the annotation – such that, together with the content and functionality that arealready in the site (its GUI), it is possible to automatically generate a chatbot.
Conceptual model.
Let’s start with introducing the key concepts that enableconversational browsing. Figure 2 uses an intuitive, graphical notation to con-textualize them in a model of a simple website about a research project, e.g.,our project on conversational browsing. The site consists of a set of pages, ofwhich the model ignores the actual content; the design of such content has tra-ditionally been approached by modeling languages like WebML [6] or IFML [4].Instead, the model hypothesizes a conceptual vocabulary that could extend thepages, subsuming the presence of suitable content . We identified these conceptsthrough a literature and systems review and prototyping efforts: – Intents : These are the core ingredients of conversational browsing. Intentsannotate HTML constructs and thereby qualify their contents as relevant forthe enactment of the intents’ actions [R2]. More importantly, intents enablethe user to access content and functionality. We distinguish three types: • Selection intents identify HTML constructs the developer wants tomake accessible through the chatbot. In order to guide user inside com-plex pages, selection intents can be structured hierarchically, which tellsthe bot to read out options at different levels of detail. • Link intents enable the user to navigate among pages of the site. Eachnavigation may reset the context of the conversation and prompt the botto inform the user of the new intents available. • Built-in intents are the intents that the framework comes with in orderto support basic interactions, such as orienting the user inside a page byproactively telling him/her which options are available (e.g., “What isthe page about?”)[R1]. Built-in intents do not require any annotation. – Conversational links : These are the counterpart of hyperlinks in conversa-tional browsing and tell link intents their target [R4]. Similar to conventionallinks, we distinguish two types of conversational links: Note that here we do not want to introduce an own, new modeling notation forconversational browsing; Figure 2 serves an intuitive, illustrative purpose only. P. Chitt`o et al.
About usHome
LinkHomeLatestPaperTextTitle TextTopics
Project progress
TextAuthorBiosListProgressInfointent =AuthorBiosTextAbstract
Login/Register
FormLogin
Personal Home
TextInstructions LinkLogout
Non-contextual, conversational link Type of element-specific, conversational agent acti ‐ vated by intent LinkLogin LinkProgressLinkAboutListListOfPapers
LinkHome LinkLoginLinkProgress LinkAboutLinkHome LinkLoginLinkProgress LinkAbout
TextLoginFeedback
Web pageLink intents Contextual, conversational linkIntents
LatestArticle - Latest article- Lastest news- Last article- Last one- …
Specification of synonyms for training data gener ‐ ationTarget intent selector Fig. 2: Informal graphical model of a project website explaining the core conceptsof application-specific, conversational browsing. Labels in italics define the usedgraphical notation. Gray-shaded intents are copied from the
Home page. • Non-contextual conversational links are links that can be navigatedwith the help from the bot and result in the loading and rendering of anew page, causing the bot to start a new browsing context. That is, eachpage accessed through a non-contextual link causes the bot to informthe user about the content of the page [R1]. For example,
Login followsa non-contextual link to a new page (with a different menu of options),triggering the bot to inform of the available options (Instructions, Login). • Contextual conversational links are links that are directed not onlytoward a new page but also toward a specific target intent. If a user thusaccesses a page through a contextual link, the bot will immediately startperforming the action associated with the target intent [R5], e.g.,
About (contextual link) will trigger
AboutBios (reading the associated text). – Bot types : If a selection intent identifies the HTML construct to act upon,i.e., if it cannot be further split into sub-intents (e.g.,
LatestPaper → Title ),the type of element-specific bot able to perform the expected action can bespecified (
Title : Text). As explained earlier, the number of element-specificbots is theoretically unlimited, but we identify the need for a minimum setof element-specific bots able to manage the following content elements [R2]: • Text , i.e., text organized into headings, sub-headings and paragraphs.Element-specific actions are reading out loud the full text, reading thetitles only, jumping back and forth among paragraphs, etc. • List , i.e., an ordered or unordered list of items. Element-specific actionsare telling the number of items, reading them out, navigating them, etc. • Table , i.e., content organized in rows and columns. Element-specificactions are reading by cells, navigating by rows, reading by column, etc. onversational Web Browsing 7 • Form , i.e., input fields grouped together and accompanied by a sub-mission button. Element-specific actions are telling which inputs are re-quired, filling individual fields, confirming inputs, submitting, etc. – Domain vocabulary : It is necessary to equip all intents in the website withtheir domain-specific vocabulary. This can be achieved by accompanyingintents with labels and synonyms that can be used to generate combinationsof phrases and to train the application-specific bot [R3]. For instance, theintent
LatestPaper with the words “latest paper, recent paper” or similar. – Intent description : Intent descriptions are simple textual explanationsthat the bot can use to tell the user which intents a given page supports.For instance, the
LatestPaper intent could be described using the words “tellyou about the last paper published by the project” [R1].Given a website, it is important to note how the sensible selection of whichHTML construct to annotate and how to connect them with conversationallinks allows the developer to construct pre-defined dialog flows guiding theuser through the content and functionalities published by a website [R5].
Annotation format . Annotating a website now means associating conversa-tional knowledge (knowledge about how to generate a conversational agent)with specific HTML constructs in a page. The cues for the generation of theagent come in the form of HTML attributes and developer-provided values. In-formed by the conceptual model, the concrete attributes for the generation of application-specific bots are highlighted in Figure 3. The figure provides apractical example of the use of these attributes, and the use of one element-specific attribute: bot-attribute , which identifies element-specific content types that the respective element-specific bot can understand. While some annota-tions may seem redundant (e.g., can be derived from HTML tags), developersnot always follow the semantics of HTML tags. For instance, one of the mostused tags today is the < div > tag, which lacks semantics. Explicit annotationscan also allow developers indicate what elements to expose to the chatbot.As research progresses, we intend to maintain an up-to-date version of theannotation format on GitHub and to improve it with the help from the com-munity. Please refer to https://github.com/floriandanielit/conversationalweb.
The generation process can be divided into two phases: (i) the generation ofthe application-specific training data and the training of the
NLU (natural lan-guage understanding), and (ii) the generation of a suitable conversational contextmodel to enable the bot manager to manage the dialog. The generation of theapplication-specific training data follows the steps highlighted in Figure 1 usingcircled numbers: the headless browser loads the current page of the website andbuilds its DOM (cid:202) , the parser and generator extracts intent identifiers and thelist of intent synonyms (cid:203) and generates a dataset of utterances for training (cid:204) ;the
NLU uses the dataset to learn intents and application-specific vocabulary (cid:205) . P. Chitt`o et al. Home |Project progress |About us Automatic (cid:38)eneration of (cid:39)hatbots for (cid:39)on(cid:33)... (cid:40)on(cid:41)(cid:33)isual bro(cid:42)sing(cid:43) con(cid:33)ersational agents(cid:43)...
AboutLatestPaper (hierarchicall(cid:9)(cid:10)(cid:11)r(cid:12)c(cid:11)(cid:12)re(cid:13)(cid:14)
TitleTopics
Element-specificannotationsfor (cid:17)e(cid:18)t(cid:19)ea(cid:20)er (cid:21)ot(cid:22)pplication-specificannotation • bot-intent : associates a page-wide unique intent identifier to the HTML construct holding it. • bot-desc : provides a text explanation that the bot can use to inform the user about the meaning of the intent. • bot-keys : specifies a comma-separated list of synonyms as alternative names of the intent. • bot-type : specifies the type of element-specific bot to use to process the HTML construct's internal HTML markup (e.g., Text Reader). Fig. 3: Simplified code excerpt of the < body > of the Home page in Figure 2 withannotations for conversational browsing. Application-specific annotations enablenavigation and content access; element-specific ones instruct the
Text Reader .The conversational context model is generated by the parser and generator once the
NLU is successfully trained. It consists in a tree representation of theintents contained in the current page: CT = (cid:104) N, C (cid:105) , where N is the set ofnodes, where each node represents one application-specific intent in the page,and C = N × N represents the set of non-cyclic, directed child relationships ofthe tree. Each node n ∈ N, n = (cid:104) intent, type, desc, keys, elem, link (cid:105) contains theidentifier, type, description and keywords of the respective intent, the HTMLelement it is associated with, and the possible conversational link in case theintent is a link intent. The root node r ∈ N represents the information intentassociated with the < body > element of the current page. Intermediate nodesrepresent access intents with sub-intents; leaf nodes (nodes without children)represent intents to be processed using a given type of element-specific bot.The bot manager now uses the so constructed context model to decide whichbot to choose to advance the conversation with the user. The proposed policy works as follows: as the user provides input, the bot manager checks if the lastused bot (the current bot) is able to understand the input, i.e., if it is able toidentify an intent with a confidence that exceeds a given threshold τ . If yes, therespective answer is forwarded to the user, otherwise it forwards the input to alldirect children of the current bot, and recursively to the sub-children if none issuccessful. If any of them is able to identify an intent with sufficient confidence,that bot becomes the new current bot and its answer is forwarded to the user.If the current bot corresponds to a leaf node and is not able to understand the The tree is a result of the hierarchical organization enabled by selection intents, e.g.,
LatestPaper → Title (Text Reader)onversational Web Browsing 9 user input, it escalates the input to upper levels until there is a higher-level botable to understand the input or the escalation reaches the root node. If none isable understand the input, the user is asked to reformulate his/her request.
The conversational browsing infrastructure outlined in Figure 1 has been imple-mented making use of ready technologies:
Alexa Voice Service for voice to textconversion,
Rasa NLU (https://rasa.com/) for natural language understanding,
Selenium (https://selenium.dev/) as headless browser integrated with
MozillaFirefox , and
Chatito (https://github.com/rodrigopivi/Chatito) for the gener-ation of training data. Custom integration and chatbot code were written in
Python . For the tests with Alexa, the infrastructure was deployed on Heroku.While the training phase of the chatbot could be done once of the entiresite, in our current prototype we opted for a page-by-page training, in order tosupport dynamically generated pages. As the focus of the prototype was techni-cal feasibility, it is not yet optimized for performance. However, tests on a localmachine (Omen by HP 15-DH0, Intel Core i7, 16 GB of RAM, SSD hard-drive,Win10 64bit) show that page loading and rendering, training data generationand bot training requires up to few seconds, an acceptable performance for somescenarios. Fetching pages from the Web adds an additional overhead. The con-struction of the context model is negligible in terms of execution time.The element-specific bots of the prototype are custom Rasa bots with pre-defined intents, actions and NLU models. Demo videos illustrating the compo-nents of the approach can be found at https://bit.ly/2OckzZW.
The problem of non visual web browsing has produced two main approaches: markup-based approaches such as VoiceXML [10] and voice-enabled screen read-ers integrated into web browsers [1]. VoiceXML [10] is a W3C markup languagefor voice applications typically accessed using a phone. Applications are standalone and could complement websites, but there is no native integration of thetwo. Voice-based screen readers (e.g., [1]) aim at lowering the complexity ofmanaging shortcuts in navigating with screen readers, enabling users to utterbrowsing commands in natural language (“press the cart button”). While valu-able, these approaches were developed to support desktop web browsing: theyrequire users to be aware of the layout of the pages and perform low-level, step-by-step interactions, or to create macros to automate tasks.As for chatbot development , general platforms and tools support the de-velopment of stand-alone chatbots (e.g., DialogFlow, Instabot.io). Another ap-proach is that of deriving chatbots directly from database schemas, API defini-tions and web content. Prominent works in this regard are the ones by Castaldoet al. [5] exploring the idea of conversational data exploration, by inferring achatbot directly from annotated database schema; Yaghoub-Zadeh-Fard et al. [13] generating a conversational interface directly from API specifications (e.g.,OpenAPI). Website content has also been used for chatbot generation. Popularin e-commerce and CRM, approaches such as SuperAgent [7] can generate con-versational FAQ based on the content to visitors directly on the website. Ripaet al. [12] focus on making informational queries over content intensive websitesaccessible via voice-based interfaces (e.g., smart speakers), relying on augmen-tations provided by end-users. While all these works illustrate the diversity ofapproaches, they require either (bot) programming knowledge (and effort), areconstrained by an application domain, or are limited to Q&A.
This paper contributes with abstractions, techniques and conceptual vocabu-lary for superimposing conversational bots over websites. These contributionsalong with the software infrastructure enable the (semi)automatic generation ofchatbots directly from websites, and can be leveraged by authoring tools to en-able developers, even without chatbot skills, to obtain chatbots effectively andefficiently. The solution presented is a proof-of-concept implementation not op-timized for large applications, and thus presents points for improvement thatare the focus of our ongoing work. As a next step, we will out user studies withdifferent types of target users (end users and developers) and derive guidelinesfor conversational browsing. We are also already studying how to use machinelearning and AI along with existing Web technical specifications (e.g., HTML5)to replace some explicit annotations by automatic recognition.
References3