[PDF] Automatic Generation of Chatbots for Conversational Web Browsing

Abstract

In this paper, we describe the foundations for generating a chatbot out of a website equipped with simple, bot-specific HTML annotations. The approach is part of what we call conversational web browsing, i.e., a dialog-based, natural language interaction with websites. The goal is to enable users to use content and functionality accessible through rendered UIs by "talking to websites" instead of by operating the graphical UI using keyboard and mouse. The chatbot mediates between the user and the website, operates its graphical UI on behalf of the user, and informs the user about the state of interaction. We describe the conceptual vocabulary and annotation format, the supporting conversational middleware and techniques, and the implementation of a demo able to deliver conversational web browsing experiences through Amazon Alexa.

Full PDF

AAutomatic Generation of Chatbots forConversational Web Browsing

Pietro Chitt`o , Marcos Baez , Florian Daniel , and Boualem Benatallah Politecnico di Milano, Via Ponzio 34/5, 20133 Milan, Italy [email protected], [email protected] LIRIS – University of Claude Bernard Lyon 1, Villeurbanne, France [email protected] University of News South Wales, Sydney, Australia [email protected]

Abstract.

In this paper, we describe the foundations for generating achatbot out of a website equipped with simple, bot-speciﬁc HTML anno-tations. The approach is part of what we call conversational web brows-ing , i.e., a dialog-based, natural language interaction with websites. Thegoal is to enable users to use content and functionality accessible throughrendered UIs by “talking to websites” instead of by operating the graph-ical UI using keyboard and mouse. The chatbot mediates between theuser and the website, operates its graphical UI on behalf of the user, andinforms the user about the state of interaction. We describe the concep-tual vocabulary and annotation format, the supporting conversationalmiddleware and techniques, and the implementation of a demo able todeliver conversational web browsing experiences through Amazon Alexa.

Keywords:

Non-visual browsing · Conversational browsing · Chatbots

Conversational agents are emerging as an exciting new platform for accessingonline services that promise a more natural and accessible interaction paradigm.They have shown great potential for regular users in hands-free and eyes-freescenarios but also for making services more accessible to people with disabilitiesand visual impairments [11], as well as groups, such as older adults, often chal-lenged by service design choices [9]. This new generation of agents is however notable to natively access the Web, requiring web developers and content creators toimplement speciﬁc “skills” to oﬀer their content and services on Amazon Alexa,Google Assistant and other platforms. This requirement represents a huge bar-rier for developers and creators who might not have the skills or resources toinvest, and a missed opportunity for making the Web accessible to everyone.Integrating conversational capabilities into software enabled services is anemerging research topic [3], as pushed by recent works by Castaldo et al. [5] on

This is a post-peer-review, pre-copyedit version of an article accepted to the 29thInternational Conference on Conceptual Modeling, ER 2020. a r X i v : . [ c s . C Y ] O c t P. Chitt`o et al. inferring bots directly from database schemas, Yaghoub-Zadeh-Fard et al. [13] onderiving bots from APIs, and by Ripa et al. [12] on generating informational botsout of website content. While these works are facilitating chatbot integration atdiﬀerent levels of the Web architecture, they do not address the challenges ofgenerating chatbots from both content and functionality available in websites.In this paper, we take a software engineering approach and study how toenable conversational browsing of websites equipped with purposefully designedannotations. This represents the ﬁrst step towards our vision [2] of enabling usersto access the content and services accessible through rendered UIs by “talking towebsites” instead of by operating the graphical UI using keyboard and mouse.We start with an annotation-driven approach as the focus is to lay the foundationfor conversational browsing and to identify all necessary conversational featuresand technical solutions, which can then lead to the development of support toolsand automatic approaches. In doing so, we make the following contributions: – conceptual vocabulary for augmenting websites with conversational capabili-ties, able to describe domain knowledge (content and functionality) while ab-stracting interaction knowledge (enacting low-level interactions with sites); – an approach, architecture and techniques for generating a chatbot out of awebsite equipped with simple, bot-speciﬁc HTML annotations; – prototype implementation and technical feasibility of the proposed auto-matic chatbot generation approach.In the following we describe a concrete target scenario, the overall approach,and the prototype implementation. We describe our target scenario by illustrating the interactions of a user brows-ing a typical research project website using a smart speaker such as AmazonEcho (Figure 1). After the user requests access to the research project website,a conversational agent tailored to the website content, functionality and domainknowledge is automatically generated to mediate the interactions between theuser and the target website. During these interactions, i) the user is informedof the available features, ii) can browse the website in dialog-based natural lan-guage interactions with the agent, and iii) the agent identiﬁes and performs theappropriate web browsing actions on the target website on behalf of the userBefore diving into the requirements posed by the envisioned scenario, weneed to introduce some concepts related to chatbot development, in what refersto task-oriented chatbots. Modern task-oriented chatbots are built on a frame-based architecture, which relies on a domain ontology (composed of frame, slotsand values) that specify the type of user intentions the system can recognizeand respond to [8].

Intents refer to the task requested by the user and the actions to the speciﬁc operations performed by the chatbot to serve the intent.Identifying user intents given a request in natural language (e.g., “Tell me aboutFlorian Daniel”) requires a natural language processing component trained with onversational Web Browsing 3

Natural language dialog

Tell me about the projectWho are the researchers involved?Tell me more about Florian Daniel

Ok, getting project information:

The conversational web browsing project aims at providing a dialog… Bot manager[REST API]

Policy

Conversational browsing server [Rasa + Python]

Application-speci(cid:14)c bot (automatically generated(cid:33)(cid:34)eadlessbrowser[Selenium]Actions(cid:35)(cid:36)(cid:37) Parser (cid:38)generator[Chatito]Intents(cid:39)utterancestraining dataAnswertem(cid:40)lates(cid:41)ialog manager(cid:35)(cid:36)(cid:42)

Ele(cid:17)ent-speci(cid:14)c bots (cid:40)ool((cid:40)re(cid:43)canned(cid:33) o(cid:40)erate(cid:40)arses(cid:34)T(cid:44)(cid:36)gen(cid:45)usesuses (cid:18)(cid:19)(cid:20) (cid:37)tteranceAnswer(cid:37)tterance Answer

Trainingtem(cid:40)lates uses

These are 3 researchers involved:

Florian Daniel, Pietro Chittò, Marcos Baez

Site opened. In this website I can: - Tell you about the project - Tell you the about its progress - Tell you about the researchers - … A l e x a V o i ce S e r v i ce Home Progress AboutNewsList of papers

Title Topics Abstract

Project infoLatest paper

The conversational web browsing project aims at…

Author biosFlorian DanielPietro CittòMarcos BaezHome Progress About

Conversational browsing server Annotated Website [Rasa + Python]

Audio Text Automated web browsing actions

Fig. 1: Conversational browsing scenario: the user talks to a bot not the website.a dataset of examples (e.g., researcher info: [“Tell me about @researcher”, “Whois @researcher?”, ...]) to correctly classify the request and infer the slots andvalues (e.g., intent: researcher info, researcher : “Florian Daniel”). Then the dialog management component, based on intent, the input provided and theconversation context, decides on the appropriate action (e.g., parse associatedDOM element). A response is generated using a natural language generationcomponent that elaborates the results and presents them in a format that ﬁts theconversation medium (refer to [8] for more on chatbot design and architecture).Having introduced the scenario and main concepts, we reﬁne some key re-quirements to enabling conversational browsing as identiﬁed earlier [2]:R1

Orientation : The bot must be able to summarize the content and/or func-tionalities oﬀered by the website, to guide users through site oﬀers at anypoint and to provide for basic access structures (e.g., “In this site you can...”).R2

Inferring intents and parameters : The bot must be able to understandthe user’s intent and enact suitable actions in response. Intents may be application-agnostic (e.g., ﬁll a form ﬁeld) or application-speciﬁc (e.g., posta new paper). The latter requires the bot to infer the intents from the website.R3

Training and vocabulary : The bot should be able to speak and under-stand the language of the target website, so as to identify intents and elab-orate proper responses. This requires deriving domain knowledge directlyfrom the website, training the bot to identify application-speciﬁc intents.R4

Browsing actions enactment : As the bot mediates between the user andthe website, enacting an action in response to an identiﬁed intent requiresa strategy for translating high-level user requests into automated low levelinteractions with the website.R5

Dialog control from rendered UIs . As the user browse the website con-versationally, the chatbot should track the state of the dialog and choose

P. Chitt`o et al. dialog actions considering the evolving state of the rendered UI. That is, itshould consider the conversation context as well as the browsing context.

The approach illustrated in Figure 1 is based on three main ingredients (i) pur-posefully designed bot annotations , (ii) a middleware comprised of chatbot gen-eration and run-time units, and iii) a medium-speciﬁc conversational interface .Web developers enable conversational access by augmenting their websites withbot-speciﬁc annotations , which associate knowledge about how to generatea conversational agent with speciﬁc HTML constructs. Initiating a conversa-tional browsing session then triggers the chatbot generation process . Thisprocess is about generating an application-speciﬁc bot tailored to the intentsand domain knowledge of the target website, while reusing a library of genericelement-speciﬁc bots. Using a conversational interface (e.g., Amazon Echo) theuser can start a dialog with the website. At run-time , the middleware processesthe user requests in natural language, selects the relevant bot and executes theappropriate actions on the rendered GUI of the website.Supporting conversational browsing is not trivial and requires weighing sev-eral options. The most important decisions that resulted in our solution are: – Domain vs. interaction knowledge : Using a website generally requiresthe user to master two types of knowledge, domain knowledge (to understandcontent and functionalities) and interaction knowledge (to use and operatethe site). This distinction is powerful to separate concerns in conversationalbrowsing. Domain knowledge, e.g., about the research project and scientiﬁcpublications, must be provided by the developer, as this varies from site tosite. Interaction knowledge, e.g., how to ﬁll a form or read text paragraphby paragraph, can be pre-canned and reused across multiple sites. We thusdistinguish between an application-speciﬁc bot and a set of element-speciﬁc bots [R1,R2]. The former masters the domain, the latter enable the user tointeract with speciﬁc content elements like lists, text, tables, forms, etc. – Modularization : Incidentally, the distinction between application- and ele-ment-speciﬁc bots represents an excellent opportunity for modularizationand reuse. Application-speciﬁc bots must be generated for each site anew[R3]; element-speciﬁc bots can be implemented and trained once and reusedmultiple times . They can be implemented for speciﬁc HTML elements, suchas a form, or they can be implemented for a very speciﬁc version thereof,e.g., a login form. However, the presence of application- and element-speciﬁcbots introduces the need for a suitable bot selection logic. – Bot selection : As a user may provide as input any possible utterance atany instant of time, referring to either application-speciﬁc or element-speciﬁcintents, it is not possible to pre-deﬁne conversational paths through a web-site. Instead, some form of random access must be supported. We introducefor this purpose a so-called bot manager , which takes as input the utterance onversational Web Browsing 5 and forwards it to the bots registered in the system [R5]. Depending on thecontext (e.g., the last used bot) and the conﬁdence provided by each invokedbot, it then decides which bot is most likely to provide the correct answer[R1,R2]. Thanks to the bot manager , the ensemble of application-speciﬁc andelement-speciﬁc bots presents itself as one single bot to the user.

The goal of the work presented in this paper is to prevent asking developers toprovide full-ﬂedged chatbots for their websites in order to support conversationalbrowsing. The challenge is asking them to provide as little information as possible– the annotation – such that, together with the content and functionality that arealready in the site (its GUI), it is possible to automatically generate a chatbot.

Conceptual model.

Let’s start with introducing the key concepts that enableconversational browsing. Figure 2 uses an intuitive, graphical notation to con-textualize them in a model of a simple website about a research project, e.g.,our project on conversational browsing. The site consists of a set of pages, ofwhich the model ignores the actual content; the design of such content has tra-ditionally been approached by modeling languages like WebML [6] or IFML [4].Instead, the model hypothesizes a conceptual vocabulary that could extend thepages, subsuming the presence of suitable content . We identiﬁed these conceptsthrough a literature and systems review and prototyping eﬀorts: – Intents : These are the core ingredients of conversational browsing. Intentsannotate HTML constructs and thereby qualify their contents as relevant forthe enactment of the intents’ actions [R2]. More importantly, intents enablethe user to access content and functionality. We distinguish three types: • Selection intents identify HTML constructs the developer wants tomake accessible through the chatbot. In order to guide user inside com-plex pages, selection intents can be structured hierarchically, which tellsthe bot to read out options at diﬀerent levels of detail. • Link intents enable the user to navigate among pages of the site. Eachnavigation may reset the context of the conversation and prompt the botto inform the user of the new intents available. • Built-in intents are the intents that the framework comes with in orderto support basic interactions, such as orienting the user inside a page byproactively telling him/her which options are available (e.g., “What isthe page about?”)[R1]. Built-in intents do not require any annotation. – Conversational links : These are the counterpart of hyperlinks in conversa-tional browsing and tell link intents their target [R4]. Similar to conventionallinks, we distinguish two types of conversational links: Note that here we do not want to introduce an own, new modeling notation forconversational browsing; Figure 2 serves an intuitive, illustrative purpose only. P. Chitt`o et al.

About usHome

LinkHomeLatestPaperTextTitle TextTopics

Project progress

TextAuthorBiosListProgressInfointent =AuthorBiosTextAbstract

FormLogin

Personal Home

TextInstructions LinkLogout

Non-contextual, conversational link Type of element-speciﬁc, conversational agent acti ‐ vated by intent LinkLogin LinkProgressLinkAboutListListOfPapers

LinkHome LinkLoginLinkProgress LinkAboutLinkHome LinkLoginLinkProgress LinkAbout

TextLoginFeedback

Web pageLink intents Contextual, conversational linkIntents

LatestArticle - Latest article- Lastest news- Last article- Last one- …

Speciﬁcation of synonyms for training data gener ‐ ationTarget intent selector Fig. 2: Informal graphical model of a project website explaining the core conceptsof application-speciﬁc, conversational browsing. Labels in italics deﬁne the usedgraphical notation. Gray-shaded intents are copied from the

Home page. • Non-contextual conversational links are links that can be navigatedwith the help from the bot and result in the loading and rendering of anew page, causing the bot to start a new browsing context. That is, eachpage accessed through a non-contextual link causes the bot to informthe user about the content of the page [R1]. For example,

Login followsa non-contextual link to a new page (with a diﬀerent menu of options),triggering the bot to inform of the available options (Instructions, Login). • Contextual conversational links are links that are directed not onlytoward a new page but also toward a speciﬁc target intent. If a user thusaccesses a page through a contextual link, the bot will immediately startperforming the action associated with the target intent [R5], e.g.,

About (contextual link) will trigger

AboutBios (reading the associated text). – Bot types : If a selection intent identiﬁes the HTML construct to act upon,i.e., if it cannot be further split into sub-intents (e.g.,

LatestPaper → Title ),the type of element-speciﬁc bot able to perform the expected action can bespeciﬁed (

Title : Text). As explained earlier, the number of element-speciﬁcbots is theoretically unlimited, but we identify the need for a minimum setof element-speciﬁc bots able to manage the following content elements [R2]: • Text , i.e., text organized into headings, sub-headings and paragraphs.Element-speciﬁc actions are reading out loud the full text, reading thetitles only, jumping back and forth among paragraphs, etc. • List , i.e., an ordered or unordered list of items. Element-speciﬁc actionsare telling the number of items, reading them out, navigating them, etc. • Table , i.e., content organized in rows and columns. Element-speciﬁcactions are reading by cells, navigating by rows, reading by column, etc. onversational Web Browsing 7 • Form , i.e., input ﬁelds grouped together and accompanied by a sub-mission button. Element-speciﬁc actions are telling which inputs are re-quired, ﬁlling individual ﬁelds, conﬁrming inputs, submitting, etc. – Domain vocabulary : It is necessary to equip all intents in the website withtheir domain-speciﬁc vocabulary. This can be achieved by accompanyingintents with labels and synonyms that can be used to generate combinationsof phrases and to train the application-speciﬁc bot [R3]. For instance, theintent

LatestPaper with the words “latest paper, recent paper” or similar. – Intent description : Intent descriptions are simple textual explanationsthat the bot can use to tell the user which intents a given page supports.For instance, the

LatestPaper intent could be described using the words “tellyou about the last paper published by the project” [R1].Given a website, it is important to note how the sensible selection of whichHTML construct to annotate and how to connect them with conversationallinks allows the developer to construct pre-deﬁned dialog ﬂows guiding theuser through the content and functionalities published by a website [R5].

Annotation format . Annotating a website now means associating conversa-tional knowledge (knowledge about how to generate a conversational agent)with speciﬁc HTML constructs in a page. The cues for the generation of theagent come in the form of HTML attributes and developer-provided values. In-formed by the conceptual model, the concrete attributes for the generation of application-speciﬁc bots are highlighted in Figure 3. The ﬁgure provides apractical example of the use of these attributes, and the use of one element-speciﬁc attribute: bot-attribute , which identiﬁes element-speciﬁc content types that the respective element-speciﬁc bot can understand. While some annota-tions may seem redundant (e.g., can be derived from HTML tags), developersnot always follow the semantics of HTML tags. For instance, one of the mostused tags today is the < div > tag, which lacks semantics. Explicit annotationscan also allow developers indicate what elements to expose to the chatbot.As research progresses, we intend to maintain an up-to-date version of theannotation format on GitHub and to improve it with the help from the com-munity. Please refer to https://github.com/ﬂoriandanielit/conversationalweb.

The generation process can be divided into two phases: (i) the generation ofthe application-speciﬁc training data and the training of the

NLU (natural lan-guage understanding), and (ii) the generation of a suitable conversational contextmodel to enable the bot manager to manage the dialog. The generation of theapplication-speciﬁc training data follows the steps highlighted in Figure 1 usingcircled numbers: the headless browser loads the current page of the website andbuilds its DOM (cid:202) , the parser and generator extracts intent identiﬁers and thelist of intent synonyms (cid:203) and generates a dataset of utterances for training (cid:204) ;the

NLU uses the dataset to learn intents and application-speciﬁc vocabulary (cid:205) . P. Chitt`o et al. Home |Project progress |About us(cid:35)atest paper(cid:37)itle of paper

Automatic (cid:38)eneration of (cid:39)hatbots for (cid:39)on(cid:33)...

(cid:37)opics

(cid:40)on(cid:41)(cid:33)isual bro(cid:42)sing(cid:43) con(cid:33)ersational agents(cid:43)...

...

AboutLatestPaper (hierarchicall(cid:9)(cid:10)(cid:11)r(cid:12)c(cid:11)(cid:12)re(cid:13)(cid:14)

TitleTopics

Element-specificannotationsfor (cid:17)e(cid:18)t(cid:19)ea(cid:20)er (cid:21)ot(cid:22)pplication-specificannotation • bot-intent : associates a page-wide unique intent identiﬁer to the HTML construct holding it. • bot-desc : provides a text explanation that the bot can use to inform the user about the meaning of the intent. • bot-keys : speciﬁes a comma-separated list of synonyms as alternative names of the intent. • bot-type : speciﬁes the type of element-speciﬁc bot to use to process the HTML construct's internal HTML markup (e.g., Text Reader). Fig. 3: Simpliﬁed code excerpt of the < body > of the Home page in Figure 2 withannotations for conversational browsing. Application-speciﬁc annotations enablenavigation and content access; element-speciﬁc ones instruct the

Text Reader .The conversational context model is generated by the parser and generator once the

NLU is successfully trained. It consists in a tree representation of theintents contained in the current page: CT = (cid:104) N, C (cid:105) , where N is the set ofnodes, where each node represents one application-speciﬁc intent in the page,and C = N × N represents the set of non-cyclic, directed child relationships ofthe tree. Each node n ∈ N, n = (cid:104) intent, type, desc, keys, elem, link (cid:105) contains theidentiﬁer, type, description and keywords of the respective intent, the HTMLelement it is associated with, and the possible conversational link in case theintent is a link intent. The root node r ∈ N represents the information intentassociated with the < body > element of the current page. Intermediate nodesrepresent access intents with sub-intents; leaf nodes (nodes without children)represent intents to be processed using a given type of element-speciﬁc bot.The bot manager now uses the so constructed context model to decide whichbot to choose to advance the conversation with the user. The proposed policy works as follows: as the user provides input, the bot manager checks if the lastused bot (the current bot) is able to understand the input, i.e., if it is able toidentify an intent with a conﬁdence that exceeds a given threshold τ . If yes, therespective answer is forwarded to the user, otherwise it forwards the input to alldirect children of the current bot, and recursively to the sub-children if none issuccessful. If any of them is able to identify an intent with suﬃcient conﬁdence,that bot becomes the new current bot and its answer is forwarded to the user.If the current bot corresponds to a leaf node and is not able to understand the The tree is a result of the hierarchical organization enabled by selection intents, e.g.,

LatestPaper → Title (Text Reader)onversational Web Browsing 9 user input, it escalates the input to upper levels until there is a higher-level botable to understand the input or the escalation reaches the root node. If none isable understand the input, the user is asked to reformulate his/her request.

The conversational browsing infrastructure outlined in Figure 1 has been imple-mented making use of ready technologies:

Alexa Voice Service for voice to textconversion,

Rasa NLU (https://rasa.com/) for natural language understanding,

Selenium (https://selenium.dev/) as headless browser integrated with

MozillaFirefox , and

Chatito (https://github.com/rodrigopivi/Chatito) for the gener-ation of training data. Custom integration and chatbot code were written in

Python . For the tests with Alexa, the infrastructure was deployed on Heroku.While the training phase of the chatbot could be done once of the entiresite, in our current prototype we opted for a page-by-page training, in order tosupport dynamically generated pages. As the focus of the prototype was techni-cal feasibility, it is not yet optimized for performance. However, tests on a localmachine (Omen by HP 15-DH0, Intel Core i7, 16 GB of RAM, SSD hard-drive,Win10 64bit) show that page loading and rendering, training data generationand bot training requires up to few seconds, an acceptable performance for somescenarios. Fetching pages from the Web adds an additional overhead. The con-struction of the context model is negligible in terms of execution time.The element-speciﬁc bots of the prototype are custom Rasa bots with pre-deﬁned intents, actions and NLU models. Demo videos illustrating the compo-nents of the approach can be found at https://bit.ly/2OckzZW.

The problem of non visual web browsing has produced two main approaches: markup-based approaches such as VoiceXML [10] and voice-enabled screen read-ers integrated into web browsers [1]. VoiceXML [10] is a W3C markup languagefor voice applications typically accessed using a phone. Applications are standalone and could complement websites, but there is no native integration of thetwo. Voice-based screen readers (e.g., [1]) aim at lowering the complexity ofmanaging shortcuts in navigating with screen readers, enabling users to utterbrowsing commands in natural language (“press the cart button”). While valu-able, these approaches were developed to support desktop web browsing: theyrequire users to be aware of the layout of the pages and perform low-level, step-by-step interactions, or to create macros to automate tasks.As for chatbot development , general platforms and tools support the de-velopment of stand-alone chatbots (e.g., DialogFlow, Instabot.io). Another ap-proach is that of deriving chatbots directly from database schemas, API deﬁni-tions and web content. Prominent works in this regard are the ones by Castaldoet al. [5] exploring the idea of conversational data exploration, by inferring achatbot directly from annotated database schema; Yaghoub-Zadeh-Fard et al. [13] generating a conversational interface directly from API speciﬁcations (e.g.,OpenAPI). Website content has also been used for chatbot generation. Popularin e-commerce and CRM, approaches such as SuperAgent [7] can generate con-versational FAQ based on the content to visitors directly on the website. Ripaet al. [12] focus on making informational queries over content intensive websitesaccessible via voice-based interfaces (e.g., smart speakers), relying on augmen-tations provided by end-users. While all these works illustrate the diversity ofapproaches, they require either (bot) programming knowledge (and eﬀort), areconstrained by an application domain, or are limited to Q&A.

This paper contributes with abstractions, techniques and conceptual vocabu-lary for superimposing conversational bots over websites. These contributionsalong with the software infrastructure enable the (semi)automatic generation ofchatbots directly from websites, and can be leveraged by authoring tools to en-able developers, even without chatbot skills, to obtain chatbots eﬀectively andeﬃciently. The solution presented is a proof-of-concept implementation not op-timized for large applications, and thus presents points for improvement thatare the focus of our ongoing work. As a next step, we will out user studies withdiﬀerent types of target users (end users and developers) and derive guidelinesfor conversational browsing. We are also already studying how to use machinelearning and AI along with existing Web technical speciﬁcations (e.g., HTML5)to replace some explicit annotations by automatic recognition.

References3