[PDF] A Maturity Assessment Framework for Conversational AI Development Platforms

Abstract

Conversational Artificial Intelligence (AI) systems have recently sky-rocketed in popularity and are now used in many applications, from car assistants to customer support. The development of conversational AI systems is supported by a large variety of software platforms, all with similar goals, but different focus points and functionalities. A systematic foundation for classifying conversational AI platforms is currently lacking. We propose a framework for assessing the maturity level of conversational AI development platforms. Our framework is based on a systematic literature review, in which we extracted common and distinguishing features of various open-source and commercial (or in-house) platforms. Inspired by language reference frameworks, we identify different maturity levels that a conversational AI development platform may exhibit in understanding and responding to user inputs. Our framework can guide organizations in selecting a conversational AI development platform according to their needs, as well as helping researchers and platform developers improving the maturity of their platforms.

Full PDF

AA Maturity Assessment Framework forConversational AI Development Platforms

Johan Aronsson, Philip Lu

Chalmers | University of Gothenburg,Swedenconvaistudy@easelab . org Daniel Strüber

Radboud University, Nijmegen,Netherlandsd . strueber@cs . ru . nl Thorsten Berger

Ruhr University Bochum, Germany;Chalmers | University of Gothenburg,Swedenthorsten . berger@rub . de ABSTRACT

Conversational Artificial Intelligence (AI) systems have recentlysky-rocketed in popularity and are now used in many applications,from car assistants to customer support. The development of con-versational AI systems is supported by a large variety of softwareplatforms, all with similar goals, but different focus points and func-tionalities. A systematic foundation for classifying conversationalAI platforms is currently lacking. We propose a framework forassessing the maturity level of conversational AI development plat-forms. Our framework is based on a systematic literature review, inwhich we extracted common and distinguishing features of variousopen-source and commercial (or in-house) platforms. Inspired bylanguage reference frameworks, we identify different maturity lev-els that a conversational AI development platform may exhibit inunderstanding and responding to user inputs. Our framework canguide organizations in selecting a conversational AI developmentplatform according to their needs, as well as helping researchersand platform developers improving the maturity of their platforms.

CCS CONCEPTS • Human-centered computing → Interaction techniques ; KEYWORDS conversational AI; software platforms; assessment framework

ACM Reference Format:

Johan Aronsson, Philip Lu, Daniel Strüber, and Thorsten Berger. 2021. AMaturity Assessment Framework for, Conversational AI Development Plat-forms. In

The 36th ACM/SIGAPP Symposium on Applied Computing (SAC’21), March 22–26, 2021, Virtual Event, Republic of Korea.

ACM, New York,NY, USA, 11 pages. https://doi . org/10 . . Conversational AI has recently surged in popularity and interest.A conversational AI system is an interface that can communicateand interact with users by relying on the automated processing ofquestions and formulating answers. In 2016, Facebook announceda new platform to develop chatbots on their messaging application

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

SAC ’21, March 22–26, 2021, Virtual Event, Republic of Korea © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.ACM ISBN 978-1-4503-8104-8/21/03...$15.00https://doi . org/10 . . [21], which simplified the creation of AI chatbots by providingrelevant toolkits [2]. After that, many other companies have im-plemented chatbots for both text and speech. Three of the mostpopular conversational AI systems today are Microsoft Cortana,Google Assistant, and Apple’s Siri [29]. The advances in deep learn-ing [33] and the advent of powerful language models, most recentlyGPT-3 [4], pave the way for a new generation of conversational AIsystems enabling conversations with human-like qualities.To support organizations in adopting conversational AI systems,a multitude of development platforms are available. By offeringnumerous concepts, such as natural language understanding (NLU),webhooks, and contexts, these platforms enable the user to engineersystems that can provide a rich, ideally almost human-like conver-sation experience. However, due to the large variety of availableplatforms, the relevance and need of each individual concept andits impact on the conversation experience is unclear. As a result,the use of such platforms may be overly complicated. To supportorganizations in selecting a suitable platform, and platform devel-opers in increasing the maturity of their platforms, we need toimprove our empirical understanding of the state-of-the art of thedomain. Specifically, we need to understand what platforms exist,what concepts they offer, what their concepts’ characteristics are,in what combinations the concepts are used, and, in sum, whatlevel of conversation they enable. Evaluating the conversationalmaturity that the different platforms offer might help in changingthe perception that these systems are simply task-oriented tools,and that they can hold truly social conversations. Additionally, itmay help in understanding how the more functional terms of theseplatforms relate to the conversational ability [6].In this paper, we provide a maturity assessment framework forconversational AI development platforms. We provide a compre-hensive overview of the features available in today’s platforms, andanalyze these features to see how they relate to the quality andability of conversational AI systems produced using them. Finally,inspired by human language development frameworks, we proposea layered framework with multiple levels of conversational maturity.With this contribution, we aim to improve our empirical understand-ing of current development platforms for creating conversationalAI systems, their concepts, and the level of conversation that botscreated with these systems can achieve. As a benchmark for as-sessing existing and new platforms, our framework can supportand guide practitioners who engineer such platforms. Moreover, itcan help researchers understand the concepts that exist, identifygaps between practice and research, and scope future research. Inthe long term, this could help in creating better conversational AIsystems. a r X i v : . [ c s . H C ] D ec AC ’21, March 22–26, 2021, Virtual Event, Republic of Korea Johan Aronsson, Philip Lu, Daniel Strüber, and Thorsten Berger

We address the following research questions:

RQ1:

What platforms exist for developing conversational AI systems?

RQ2:

What are the features of these platforms?

These first two questions are aimed towards analyzing existingconversational AI development platforms and extracting informa-tion regarding their usage and features. A specific focus is on theability of the platforms to model conversation dialogs.To this end, we performed a literature study, in which we col-lected papers presenting different platforms. We then analyzedthe documentation of the platforms to identify their distinguish-ing characteristics and concepts (features). To provide an intuitive,hierarchical overview on the multitude of available features, wegrouped them into a feature model [8, 13], a common notation formodeling the variability of portfolios of software systems [23] ina domain. Feature models have also become popular in empiricalstudies for modeling the design space of technologies, such as modeltransformations [9] or language-engineering workbenches [10].

RQ3:

What are the levels of conversational maturity supported bythe identified platforms?

We created a framework that can be used to evaluate the con-versational maturity (intuitively, how “smart” an agent is in un-derstanding questions and formulating responses) offered by theplatforms. To this end, we considered existing frameworks that eval-uate the language proficiency of humans, and previous discussionson how to evaluate different conversational AI development plat-forms. We then devised a framework based on the features identifiedin the first research questions and their effect on the human-likeperformance of a conversational AI development platform.There are not many studies on the conversational maturity thatdifferent conversational AI development platforms offer. The mostclosely related work is the study by Venkatesh et al. [31], who de-scribe how to evaluate the performance of conversational agentsin terms of certain metrics. In contrast, our work focuses on re-cently available platforms and on how their features impact theconversational maturity of systems created upon these platforms(cf. Section 2 for a discussion of related work).

With the recent developments in many of the sub fields of conver-sational AI, including machine learning, dialog management andNLU, many different conversational AI systems have emerged [22].In industry, this technology has been incorporated into searchengines, mobile devices, and personal computers. In search enginessuch as Google and Bing, conversational AI is used to create thefeeling of having a conversation with the search engine, enhancingthe experience. In mobile devices and personal computers, oneuse of conversational AI is to create virtual assistants. Some ofthe biggest virtual assistants on the market today are Apple’s Siri,Google Assistant, Amazon Alexa and Microsoft Cortana [14]. Theseassistants also have the capability of acting as chatbots where theykeep a turn-based dialog (a dialog where the user and the bot taketurns in asking and responding to queries) with the user. There alsoexist conversational interfaces that only focus on this type-dialog-based conversation such as XiaoIce [11] and Replika [11]. Thesedialogs use what is known within conversational AI as intents and entities to understand the user’s goal behind the query. In other words, an intent is what the user wants to achieve with the query,and an entity is the key information for answering the intent.Recently a number of different platforms have been made avail-able to simplify the creation and integration of conversational in-terfaces for developers. The most popular ones are: Google’s Di-alogFlow (formerly api.ai) , IBM’s cloud-based bot service WatsonConversation , Amazon Lex and the Microsoft Bot Framework .These platforms come equipped with several different technologiesused for NLU, dialog management, response generation and otheraspects [5, 11].Since conversational AI is a new field, systematic approachesto overview and categorize it are still in their infancy. Patil et al.[24] makes a general comparison of features and functionalitiesbetween some of the commercial platforms, giving an overview ofwhat platform one might choose for developing a conversationalAI system. There have also been more specific studies conductedwhich compare the NLU and conversational abilities of these typesof platforms. Canonico and De Russis [18] compare the NLU per-formance of these platforms have in terms of usability, pre-builtintents (a number intents already existing in the NLU tool) contextetc. McTear[19] describe the two main conversation models "one-shot queries" and "slot-filling dialogues". He compares differentplatforms’ ability to handle follow up questions in one-shot queryscenarios and their mechanisms for slot-filling (a type of conver-sation where the bot asks specific questions to fill certain slots tofulfil a user intent). McTear also presents a number of problems thatdevelopers may face when creating conversational interfaces withthese platforms. One of the main issues is that it might be difficultto know what functionalities a specific platform offers. There is alsoa difficulty in interpreting what functionalities might be commonbetween platforms since there is no standard terminology.Venkatesh et al. [31] describe a number of metrics that can beused to evaluate the overall performance of a conversational agentbased on the annual competition Alexa Prize [25] made for further-ing conversational AI. They propose metrics such as conversationaluser experience, engagement, and conversational depth to measurethe conversational abilities of entire conversational AI systems orchatbots [31]. Shawar and Atwell [26] describe metrics to specifi-cally evaluate chatbot systems, a type of conversational AI interface.They argue that metrics for evaluating the abilities of these systemsshould be done based on the application and its domain and notsolely on a standard.One of the main issues with creating the metrics described aboveis the understanding of what a good conversation is. Clark et al.[6] discuss that people generally describe conversations with con-versational interfaces in terms of their performance and perceivethem more as a device to be controlled. Indicating that people havea previous notion of how these systems will behave coming from aperception that infrastructure to support proper human-to-humandialogs do not exist.The maturity assessment framework presented in this papertakes inspiration from three language proficiency frameworks: Com-mon European Framework of Reference (CEFR, [7]),

American Council https://dialogflow.com/ https://aws.amazon.com/lex/ https://dev.botframework.com/ Maturity Assessment Framework forConversational AI Development Platforms SAC ’21, March 22–26, 2021, Virtual Event, Republic of Korea on the Teaching of Foreign Languages (ACTFL, [27]), and the

Intera-gency Language Roundtable (ILR, [1]). The goal of these frameworksis to assess the language competency of an individual for a partic-ular language. All of these frameworks have a similar structure,distinguishing different, successive levels (e.g., in case of CEFR, a six-item scale A1–C2), language-relevant skills (e.g., for CEFR, reading , listening , speaking , and writing ), and a number of hints for assigningan individual to a level. While the contents of the framework differ,they all share this same basic structure, which we also found usefulfor inspiring the design of our framework. A number of papers havescientifically investigated these frameworks, studying their validityand the possibility to use them in an automated way [12, 15, 30]. In the first part of our study (RQ1), we aimed to explore the varietyof existing conversational AI development platforms. To this end,we used several methods as follows.

We used a systematic literature review toidentify papers on conversational AI systems. We focused on papersthat present and evaluate platforms used to develop such systems.Specifically, among the different methods that exist for conductingliterature reviews, we used snowballing . We followed Wohlin [32],who describes two types of snowballing and provides guidelinesfor performing them: forward and backwards snowballing. We per-formed backwards snowballing, to find papers describing currentconversational AI development platforms. Backwards snowballinginvolves selecting a number of papers to be used as a start set tofind more relevant papers in the same field by tracing the referencelists of the papers. The start set should include a number of differ-ent papers from different areas of the field, different authors, anddifferent points in time. The idea is to cover the considered field ortopic to the largest possible extent. The reference lists of the papersin the start set are then evaluated based on certain inclusion andexclusion criteria (explained shortly). From the start set, additionalpapers can be found, which we also screened. Each set of reviewedpapers is one iteration of the snowballing procedure, once no morepapers can be found the process is over. [32]We collected the start set for snowballing through databasesearches, using search strings such as “Conversational AI,” “Con-versational AI development platforms,” and “Chatbot platforms”.We provide the full list as supplementary material (made availablein our online appendix [28]). The first 50 results for each searchstring were examined based the criteria listed below. To determinewhether to include or exclude a certain paper we used the followingprocedure: Read the title and abstract and skim the whole paper todetermine if any relevant platforms can be found. A paper couldbe excluded at any stage of the process based on its relevance tothe study. The databases used in this search were Google Scholar,IEEExplore, arXiv, SpringerLink, and a university library database.Our inclusion criteria were: • Papers published after 2000, after which most recent plat-forms have been developed, were candidates for inclusion. • Papers examining and presenting different conversationalAI/bot platforms were included. • Papers that only examine characteristics of conversationalAI and do not mention any platforms were excluded.The platforms that were found through the literature reviewwere then examined in order to determine if enough informationabout them was available to fairly assess what features the platformsprovided.To this end, our exclusion criteria were: • Platforms that are no longer available or heavily outdated(no update since two years at time of search) were excluded. • Systems that were just simple chatbots (i.e., responding onlyto simplest queries like "tell me the time" ) were excluded. • Platforms that did not have enough documentation availablepublicly were excluded. • Platforms that did not have a strong enough user base, eitherby individuals or companies or both, were excluded.In addition to conducting snowballing, we consulted with anemployee from an industrial partner—a company with years ofexperience in conversational AI—to find platforms that we mighthave missed. This made us aware of several additional platforms(detailed below). We conducted our analysis in the summer monthsof 2019.

To find platforms outside the more for-mal channels of published literature and company expertise, wealso sought via the Google search engine. This required specificcare and source criticism, since the information available may beoutdated or even false. We conducted the searches using searchterms that try to find platforms similar to those found throughprevious methods, for instance, “DialogFlow competitors.”

The main process for collecting information about the differentconversational AI development platforms was document analysis.Document analysis involves going trough any documentation avail-able for a specific entity, such as a software platform. It allowsfor the collection of data that later can be evaluated and groupedbased on certain criteria. Document analysis is often quite efficientand cost-effective since no new data needs to be acquired; instead,already existing data is evaluated. However, there is a risk that thedocumentation may be incomplete [3].We analyzed the documentation available for all considered plat-forms to identify their common and distinguishing features, thus ad-dressing RQ2. Whilst the entire platforms were analyzed to be ableto give an overview of the entire system structure, we put specialemphasis on their conversation-defining features . The conversation-defining features build up the dialog management portion of theplatforms, which define what the bot can understand and howit should respond. This process also helped us mapping similarfeatures whose names vary between different platforms.

To represent the identified features, we developed a feature model [13].Feature models visualize the features of a platform by displayingthem in a hierarchy, thus providing a good overview of top-level

AC ’21, March 22–26, 2021, Virtual Event, Republic of Korea Johan Aronsson, Philip Lu, Daniel Strüber, and Thorsten Berger and more fine-grained features. Features can be mandatory or op-tional. In our survey, we refer to common features of the consideredplatforms as mandatory, and to distinguishing ones as optional.The model also includes constraints between the features, suchas dependencies , in which a feature needs another feature for itsimplementation. There are a few other models that can be used forsimilar purposes, such as class diagrams. However, we used featuremodels since they provide a compact, hierarchical overview, whichis good for managing complexity in large systems [16], and sincefeature models have been used in earlier empirical studies [9, 10]on systematizing the features of systems in a particular domain.

As a pre-requisite for creating a maturity framework for conversational AIdevelopment platforms, we explored if any similar attempts hadbeen made before. We performed a literature review to identify anyexisting frameworks, either directly related to conversational AIclassification or to evaluate the conversational maturity of a hu-man. We searched using Google Scholar, IEEE, arXiv, Springer andour university’s library. The following search phrases were usedwhen looking for these frameworks: “Common language frame-work,” “Human language framework,” and “Language framework.More can be found in the online appendix [28]. From these searches,the top 50 results were considered to determine their relevance forthis study. We used the following exclusion and inclusion criteria: • Papers discussing different aspects of what makes good con-versational AI were included. • Papers with frameworks used to evaluate maturity of eitherhuman or bot conversation maturity were included. • Papers that have metrics for evaluating conversational AIsystems were included.To determine whether a paper matched the criteria above thefollowing procedure was followed: read the title of the papers;read the abstract of the papers; read discussion to determine if anyframeworks are presented or characteristics of good conversationalAI are mentioned. As mentioned above a paper could be excludedat any point of the process, if the title was out of scope the paper isdirectly excluded.

Our goal was to create a frame-work that describes a collection of incremental levels of conversa-tional maturity, inspired by the language proficiency frameworksCEFR, ACTFL, and ILR. We used the same structure as these ex-isting frameworks (distinguishing various incremental levels andorthogonal skills ), but filled the structure with entirely new con-tents, tailored to our understanding of conversational maturity.Since a definition of conversational maturity was not availablein the literature, we devised an own definition:

The ability of aconversational AI system to participate in a human-like conversation .To identify levels, we used the features found through the docu-mentation analysis. We decided for each feature if it contributes toconversational maturity according to this definition, and clusteredthose features that do into distinct, progressive levels.

We present the results from our literature review, the documenta-tion analysis performed on the identified conversational AI devel-opment platforms, and the obtained feature model with commonand distinguishing features of these platforms.

The database searches resulted in 10 sources which were used asthe start set for snowballing. The references of each of these paperswere then screened in order to find any other papers relevant forthe purpose of finding conversational AI development platforms.The snowballing was ended after three iterations of this procedure.In the first iteration, based on the start set, 13 additional paperswere added. The list of potential candidates were narrowed down byusing the following procedure: Read title of papers; check where thepaper is referenced in the text; read abstract of papers; look at fulltext to determine if it contains any new conversational platforms.The place of reference was checked in order to determine if itwas used in conjunction with text that describe conversationalplatforms. All papers were matched against the same criteria thatwas used to put together the start set, see Section 3. The seconditeration of the snowballing procedure were done on the 13 newlyfound papers. From these papers another 3 were identified thatdescribe conversational AI development platforms. The third andlast iteration identified no additional relevant papers.Using snowballing, we identified a total of 56 different potentialconversational AI development platforms. From these 56 platforms,we removed a number of duplicates arising from the same sys-tem appearing under different names: DialogFlow was renamed toAPI.ai, and IBM voice server and AT&T watson were the predeces-sors to IBM Watson Conversation. We excluded the conversationalinterfaces Cortana, Google assistant, and Amazon Alexa, as theyare not actual development platforms. Cortana is developed by Mi-crosoft who makes its technology available through Microsoft BotFramework. Google assistant is very much related to DialogFlowand Amazon Alexa with Amazon Lex. The remaining platformswere matched against the inclusion and exclusion criteria men-tioned in Section 3. These criteria were used to narrow down theset to a total of 10 platforms: DialogFlow, Microsoft Bot Frame-work, Houndify, RASA, Amazon Lex, IBM Watson Conversation,VoiceXML, Recast.ai, Kore.ai, and AIML.Consulting with an employee of our partner company resultedin the addition of three new platforms that had not yet been ex-amined as well as the confirmation that the platforms found in theliterature review match many of the platforms that had been foundby the company. The three new platforms that were found are:Teneo, Boost.ai, and TDM. Teneo and Boost.ai were not included infurther investigations as they lacked sufficient documentation oftheir contained features.Lastly, we used the Google search engine to identify poten-tially missed conversational AI development platforms. Our searchbrought forwad three new platforms that had emerged quite re-cently: Meya, Chatbot and Botpress. Chatbot and Botpress lacked

Maturity Assessment Framework forConversational AI Development Platforms SAC ’21, March 22–26, 2021, Virtual Event, Republic of Korea

Table 1: Conversational AI development platformsPlatform Source Availability Modality

DialogFlow closed commercial web-basedMeya.ai closed commercial web-basedMicrosoft BotFramework closed semi-comm. web-basedHoundify closed semi-comm. web-basedAmazon Lex closed commercial web-basedRASA open free command-lineIBM WatsonConversation closed semi-comm. web-basedVoiceXML open free impl.-dependentRecast.ai closed semi-comm. web-basedKore.ai closed semi-comm. web-basedAIML closed free impl.-dependentTDM closed commercial command-line

URLs: https://cloud.google.com/dialogflow https://meya.ai/https://botframework.com/ https://houndify.com/ https://aws.amazon.com/lex/https://rasa.com/ https://ibm.com/cloud/watson-assistant/ https://w3.org/Voice/https://recast.ai http://aiml.foundation/ http://talkamatic.se/ available documentation supporting a fair assessment of the avail-able functionality. For this reason both of these platforms wereexcluded. Meya was included since its documentation was exten-sive enough to form a full image of its features.We finally obtained a list of twelve platforms to further analyze.Table 1 shows these platforms, together with their core charac-teristics: open- vs. closed-source, (semi-)commercial vs. free, andweb-based vs. command-line vs. implementation-dependent. A plat-form is semi-commercial if it has both free and paid variants, wherethe free variant typically has an upper cap (for example, 10,000messages per month in the Microsoft Bot Framework).

We thoroughly analysed the documentation of platform to pinpointits included functionality, resulting in a list of features. These fea-tures were then reviewed to identify common and distinguishingfeatures between the different platforms. If the same feature wasfound in multiple platforms under different names, we continuedwith the name used more often in the platforms and existing lit-erature. The consolidated features were added to a feature matrix(made available in our online appendix [28]). The end result wasa list of 54 different features that we grouped and organized in ahierarchy to obtain a feature model, discussed in what follows.

Figure 1 shows a high-level overview of our feature model, high-lighting its four top-level feature groups:

System , Conversation , Inputmodalities , Output modalities . These top-level feature groups andtheir contained features will be discussed in this section. The fulllist of all features with their descriptions is made available in ouronline appendix [28].We use the standard syntax of feature models. Specifically, asshown in the legend, features can be marked as mandatory, meaning

ConversationalAI InputModalitiesConversationSystemOutputModalities Legend:MandatoryOptionalOrAbstractCollapsed425 344

Figure 1: Top level view of the feature model

MultiLanguageDevelopment AutomaticUnderstanding TranslationInputProcessing ContentCatalogsContent PropositionalitySystem LanguageRecognitionSpellingCorrectionMultipleConversationDomainsInterfaces 103

Figure 2: Main system features that they exist in all or most of the analyzed platforms, or optional,meaning that they only exist in some platforms. Numbers attachedto a node indicate that the node, in fact, represents a collapsedsub-tree, with the specified number of total nodes in the sub-tree.Abstract features are used for grouping purposes. "Or" features areused to specify features groups, where each considered platformhad at least one of several features of the group. In what follows,we describe the most crucial features, including those deemed asparticularly relevant for assessing conversational maturity.

From multiple

System features as shownin Figure 2,

Content refers to the different conversation contentsand how the platform handles them. A conversation content can be,for example, "phoning a friend" or "weather information". Figure 2shows two crucial sub-features of content:

ContentCatalogs refersto platforms with in-built content catalogues to simplify the devel-opment of the conversational AI bot. These catalogs contain entitiesand intents that are common in the domain.

MultipleConversation-Domains is used to distinguish platforms that support the handlingof domains that are completely independent from one another, thusmaking it possible to have 2 different content domains in the sameconversation. This is in contrast to most platforms, which onlysupport one particular domains with no explicit separation fromother domains.

MultiLanguage is the conversational AI feature thatregards to the number of supported languages within the platform.

Development features concern the development process of sys-tems using the platform, which can be supported by features suchas error feedback, debugging, and versioning tools.

Input processing refers to features to processing of the user input, such as

SpellingCor-rection . Different interfaces being supported may include a customfrontend, integration with social media, and other websites.

AC ’21, March 22–26, 2021, Virtual Event, Republic of Korea Johan Aronsson, Philip Lu, Daniel Strüber, and Thorsten Berger

EntityConversationalOutputLanguageSeparationConversationTypesConversation DialogflowClarificationIntentSpeech 33 1133 2 1

Figure 3: Conversational features

PoliciesSentimentsConversationalOutput DialogInitiation

Figure 4: Conversational output features

One of the main reasons why thereare so many different conversational AI development platforms isthat most handle conversations differently from one another. Thesedifferences can be anything from the content of the conversationto the dialog management of the platform. The platforms considerconversations differently depending on what the intent of use isand the area of use is. Many of the different platforms in the marketare focused on one specific field of expertise and are customizedto fit the needs and standards of this field. An overview of thesefeatures and feature categories can be seen in Fig. 3. These fea-tures are all regarding the conversation between the agent and auser, everything from processing to supported conversation types.Language-specific features are also in the

Conversation category,since the language is a part of the conversation.

LanguageSeparation is the ability of the system to separate language-specific and non-language-specific information from a sentence.This allows the system to identify what parts of the sentence iscrucial for the information to be received and what parts are not.This feature will simplify the translation of a conversation and themultilingual maintenance of the system.

Conversational output features, depicted in Fig. 4, affect how theconversational output is processed.

DialogInitiation is one way todo so, it allows the developer to instigate a conversation. Differentcompanies have different

Policies and rules to adhere so some plat-form allows for such policies and rules to be implemented withinthe system it self.

Sentiments allows for the conversational AI todisplay emotions in their response.

Clarification is something that all the analyzed platforms supportin some way, since not misunderstanding the user is crucial forsupporting the robustness of the system. As summarized in Fig. 5,this is done by using

Affirmation , Rephrasing or FallbackActions to confirm the users intent. These features affect how the systemreacts when a user input is not understood or if a user input can be

RephrasingFallbackActionsAffirmationClarification

Figure 5: Clarification features

SmallTalk MultipleUserIntents YesNoQuestionsContextualDialogsQuestionsSlotFilling OpenQuestionsSearchOrientedDialog OneShotQueriesConversationTypes TopicShiftingMemoryForContext

Figure 6: Conversation types assigned to two different intents. These features also allow to givethe user a second chance to change their mind or query.

ConversationTypes features cover the different types of conver-sation and questions that a platform supports. These features andfeature subsets can be seen in Figure 6.

SearchOrientedDialog refers to a dialog that searches through adatabase to find matching entities and respond to a user intent.

SlotFilling is a dialog where to conversational AI ask for addi-tional information to fill certain criteria to match the correct intentto an entity. An example of this could be:User: What is the weather like today?Bot: Which city would you like to search the weather for?User: New York.Bot: The weather in New York is cloudy.

SmallTalk is a conversation type that refers to conversationswithout any specific end goal. These types of conversations can beanything from asking how you feel to telling jokes.

ContextualDialogs are an important component of conversationalAI systems, supported by every analyzed platform. We found twodifferent ways to support contextual dialogs: one or multiple con-texts per conversation. To support contextual dialogs a conversa-tional AI development platform must have the feature

MemoryFor-Context . MemoryForContext is specially allocated memory that theAI uses in order to remember previous information. Multiple con-text are referred to as

TopicShifting and can enable conversationslike, for example:User: Send a text message to Peter.Bot: What would you like to text?User: I want to book a flight to Japan for tomorrow.Bot: What time would you like to book the flight at tomorrow?User: At 4am.Bot: Where in Japan would you like to fly?User: I would like to text: "The temperature in New York is 20 ◦ C."Bot: Ok, your message has been sent.User: I would like to fly to Tokyo.Bot: Ok, flight booked to Tokyo tomorrow at 4am.

Maturity Assessment Framework forConversational AI Development Platforms SAC ’21, March 22–26, 2021, Virtual Event, Republic of Korea

TrainingPhrasesMultipleIntentsIntentDefinitionIntent

Figure 7: Intent features

Entity MultipleEntities SynonymsEntityDefinition

Figure 8: Entity features

ToneAnalyzerVoiceActivityDetectionSpeech

Figure 9: Speech features

The last conversation type is

Questions , these can vary fromsimple

OneShotQueries to complex

MultipleUserIntents . There aretwo types of OneShotQueries:

YesNoQuestions and

OpenQuestions .Both these types only require one response from the conversationalAI to fully answer the query. MultipleUserIntents are queries withmultiple intents within them, an example can be: "What is the timeand weather like in New York?"Features related to

Intent are concerned with intent manipulationand intent restrictions. Intents are used to define the users goalwith the query, for example:User: What is the time in New York?Bot: The time in New York is 1pm.In this case the intent of the user is: finding out the time. Thesefeatures can be seen in Figure 7.

Entity features, as shown in Figure 8, are concerned with entitymanipulation and entity restrictions. Entities are descriptive actionsthe conversational AI can perform after identifying the users intent.An example is:User: Can you call mom?Bot: Calling mom.In this case the intent would be to make a call and the entity wouldbe mom.Several platforms support features specific to

Speech , one suchfeature is

VoiceActivityDetection which allows the system to detectchanges in audio level to determine whether or not the user is cur-rently speaking. Another feature is

ToneAnalyzer which allows theconversational AI to identify emotions within the speech pattern.These features, depicted in Fig. 9, are required by those platformsthat support speech as an input modality.

For a user to communicate with a conver-sational system the platform must support reading and analyzingone or several input types. The different input modalities supportedby the considered platforms are shown in Figure 10: input of text,speech, images, and URLs. The larger the number of input modali-ties a platform supports, the larger the potential areas of use andaccessibility of the created systems.

For a two-sided conversation, the con-versational AI system must be able to answer to respond to userqueries by using one or several output modalities, the output modal-ities can also be seen in Fig. 10. These different types of output,like input modalities, allow the system to be used in different types

TextInputSpeechInputInputModalities URLInputImageInput ListOutputSpeechOutputTextOutputImageOutputOutputModalities

Figure 10: Input and output modalities of environments. Some environments only allow for one type ofoutput to not disrupt the user’s focus. For example, in a movingcar, the optimal type of output would be speech. To have severaltypes of output to choose from will both enhance the usability andthe accessibility of the developed system.

To support the evaluation of the conversational maturity of a con-versational AI development platform, we created a maturity frame-work. The framework is inspired by those for human languagedevelopment such as CEFR, ILR and ACTFL, which help to assessthe conversational maturity of a language learner. This maturityframework can be used in the same fashion, to evaluate the conver-sational maturity level of a conversational AI development platformand how it compares to other platforms. We obtained the frameworkby considering all features of the state-of-the-art systems surveyedin Section 4, using the methodology outlined in Sect. 3.4.2.

Our framework, as summarized in Table 3, is divided into four mainlevels, where each level is further divided into two sub-levels forfurther differentiation (inspired by the organization of the CEFRinto three main levels with two sub-levels each). Orthogonally tothe levels, the framework distinguishes two main skills, targetedto the capabilities of conversational AI systems: understanding and response . Understanding refers to the level of comprehension theconversational AI system has and the type of natural languageprocessing it can perform. Response refers to the system’s responsepatterns and abilities to interact with the user.Table 3 further specifies the features corresponding to each level,pinpointing how certain functionality enables the conversationalmaturity of a conversational AI development platform. All relevantfeatures have been introduced in Section 4. For a conversationalAI development platform to be assigned as a specific level it has toaccommodate all the preceding levels of conversational maturity.For example: to be assessed as level 2a, the platform must satisfyall criteria for levels 1a and 1b, in addition to those of 2a.

A conversational AI development platform’s maturity level dependson its conversational abilities. We propose the following levels:

Level 1

Indicates very limited conversational abilities and verysimple comprehension.

Level 1a

Capabilities within the explicitly defined domainand response restricted to isolated words.

Level 1b

Capabilities within related domains and responserestricted to sentence fragments.

AC ’21, March 22–26, 2021, Virtual Event, Republic of Korea Johan Aronsson, Philip Lu, Daniel Strüber, and Thorsten Berger

Level 2

Indicates abilities to hold a short conversation with contextand comprehension on a intermediate level.

Level 2a

Capabilities to understand longer queries and re-sponding with full sentences.

Level 2b

Capabilities to memorize context for a short con-versation and can respond with questions.

Level 3

Indicates abilities to hold a long conversation with contextand comprehension on an advanced level.

Level 3a

Capabilities to comprehend spelling mistakes andsmall talk.

Level 3b

Capabilities to comprehend multiple intents in onequery.

Level 4

Indicates abilities to hold multiple conversations and com-prehending complex human features.

Level 4a

Capabilities to comprehend multiple input lan-guages and responding in using different languages.

Level 4b

Capabilities to comprehend feelings and sentimentsand use it for responding.Level 1 describes a platform’s ability to understand and respondto simple one-shot queries and phrases, which requires support forsimple intent and entity definition as well as basic dialog manage-ment based on a definition of isolated words. It also needs to beable to provide some basic input and output processing .Level 2 is devoted to intermediate-level conversation. A featurethat significantly affects the conversational maturity of a conver-sational AI is context . In conjunction with other features it plays avery crucial role in creating more fluid conversations between theAI and a user.Level 3 addresses advanced conversational ability. A complexfeature of conversational AI is the ability to process and understandmultiple intents within one query, corresponding to level 3b of theframework. This would not only require all the aforementionedfeatures, but also needs features such as: multiple intents per entity and nested intents . Making it possible to handle many different userrequests at once makes for a more human-like conversation.Level 4 considers one of the most important and challenging fac-tors of human conversation, the interpretation and understandingof emotions. A conversational AI system that can both read andrespond with feelings creates a more natural and as such a moremature conversation, allowing a proper relationship with the userto be developed [17, 20]. In conversational platforms this function-ality is supported through the features sentiments and tone analysis .This together with features such as context , multiple intents and multi-domains make for a platform that can support some of themost mature conversational AI systems. Our proposed framework has several use-cases, depending on theperspective of the stakeholder. Platform users can use the frame-work for informing their investment into particular platforms, adecision that might require to carefully trade off conversational ma-turity with orthogonal factors such as cost and required resources.Developers can use the framework as a benchmark to evaluate andcompare their own platforms, to identify missing functionalities,and to make an informed decision on the "next steps" when ex-tending their platform or developing a new platform from scratch.

Table 2: Maturity levels of considered platforms (based onour analysis in 2019; possible later updates not yet included)Platform Lv. Features for next level

DialogFlow 3b TopicShifting, Sentiments, PoliciesMeya.ai 1b OpenQuestions, MultiConv.Dom.Microsoft Bot Fw. 3a MultipleUserIntentsHoundify 1b SearchOrientedDialogAmazon Lex 3b MultiLanguage, PoliciesRASA 2b SpellingCorrectionIBM Watson Conv. 3a LanguageSeparationVoiceXML 3a LanguageSeparationRecast.ai 1b SearchOrientedDialogKore.ai 1b SearchOrientedDialogAIML 1b SearchOrientedDialogTDM 2a MemoryForContextResearchers can use the framework as a tool for obtaining a quali-fied overview of the platform landscape in practice. In what follows,we describe the results of applying the framework to the twelveplatforms considered in our survey.Table 2 shows an overview of the analyzed platforms togetherwith their proposed conversational maturity level, based on a man-ual assessment by the authors. This assessment was done based ona feature matrix derived from the literature survey (see our onlineappendix [28]). For each platform, we used the information on in-cluded features to map the platform to the corresponding level. Thetable also includes a hint for features required to achieve the nextlevel, which can act as a suggestion for current developers of theplatforms to specifically focus on during development. This is alsoa way to indicate for the platform company on what features arenecessary for the platform to reach a more human-like performanceif this is the end goal of the platform.The decision whether a platform has a particular feature ornot is binary. One might further be interested in how well eachplatform supports each feature. To this end, one might combineour framework with a metrics-based approach such as the one byVenkatesh et al. [31]. Having an automated assessment tool, ideallybased on a continuously updated "benchmarking bot" supported bya community consensus, is a desirable direction for future work.

The main threat to external validity is the question of how ourframework will generalize to future platforms: Since our frameworkbased on an analysis of the documentation of existing platforms,it is essentially an overview of “what we have”, rather than “whatwe want”. A promising direction for future work is to conduct auser-study to identify the limitations and unaddressed user needsin current platforms. Consequently, the framework might be aug-mented by a fifth maturity level. Nevertheless, this concern does notthreaten the usefulness of our framework for assessing current andnear-future platforms and guiding organizations and developers insystematically improving their platforms.A threat to internal validity are possible inaccuracies resultingfrom our decision to consult the considered platforms’ documen-tations: documentations might be incomplete (i.e., not all features

Maturity Assessment Framework forConversational AI Development Platforms SAC ’21, March 22–26, 2021, Virtual Event, Republic of Korea

Table 3: Assessment framework for conversational AI development platforms: maturity levels.Level Understanding Response FeaturesLevel 1

Very limited conversational abilities and very simple comprehension

Level 1a • Can understand simple one-shot queries, likeyes-or-no questions and questions for previ-ously defined names. • Can understand simple phrases like greetingsor information regarding the domain at hand. • Can understand one intent per entity definedin the conversational AI. • Can respond with isolated wordsand numbers like “yes”, “no” and“50”. • Can initiate a dialog with the user,instead of waiting for a user input. • Intent • Entity • OneShotQueries • YesNoQuestions • InputModality • OutputModality • DialogInitiation

Level 1b • Can understand queries that have been explic-itly defined for the domain at hand. • Can also understand simple queries for relateddomains. • Can build a short phrase or sentenceusing a few connected words, to pro-duce and response. • OpenQuestions • MultipleConver-sationDomains

Level 2

Abilities to hold a short conversation with context and comprehension on a intermediate level

Level 2a • Can understand longer queries with nested in-tents within the defined domain. • Can understand queries for related domains. • Can respond with full sentencescontaining more information. • Can use affirmation to confirm theusers intent. • Can ask the user to repeat the queryif it was not understood or mis-heard. • Affirmation • Rephrasing • FallbackActions • SearchOrientedDialog • SlotFilling

Level 2b • Can understand the conversational context andkeep that context memorized throughout ashort conversation. • Can understand multiple intents for a singleentity defined in the conversational AI. • Can comprehend propositionality. • Will ask the user for additional in-formation if there is missing infor-mation for the intent. • Can respond with follow up ques-tions to further continue the conver-sation. • ContextualDialogs • MemoryForContext

Level 3

Abilities to hold a long conversation with context and comprehension on an advanced level

Level 3a • Can understand context in a sentence and keepthat context memorized throughout an entireconversation. • Can use spelling correction and understandpolicies and censorship. • Can comprehend small talk, like asking: “Howare you?” • Can respond with sentences regard-ing small talk, like: “I’m doing verygood!”. • SpellingCorrection • SmallTalk

Level 3b • Can understand multiple intents in one query. • Can separate between language-specific andnon-language-specific information. • Can respond with multiple answersto cover all intents in the query. • MultipleUserIntents • LanguageSeparation

Level 4

Abilities to hold multiple conversations and comprehend complex human language features

Level 4a • Can shift between different contexts within thesame conversation. • Can understand at least 2 input languages. • Can have sentiments when answer-ing queries to add a more humanaspect. Can respond in different lan-guages. • Can translate information for theuser. • TopicShifting • MultiLanguage • Sentiments • Policies

Level 4b • Can understand the sentiments and feelings. • Can analyze the users’ speech input by usinglinguistic analysis to detect emotion and lan-guage tones. • Can convey feelings when respond-ing to queries; such as anger, com-fort and etc. • ToneAnalyzer • VoiceActivityDetect- ion

AC ’21, March 22–26, 2021, Virtual Event, Republic of Korea Johan Aronsson, Philip Lu, Daniel Strüber, and Thorsten Berger are documented), and/or affected by overselling (i.e., documentedfeatures are actually not available). While we cannot rule out thatour findings are affected by this threat, we point out that com-panies have a significant interest in keeping their documentationup-to-date and complete. Accurately depicting all available fea-tures is required to be competitive in a rising market; conversely,overselling might affect reputation and customer trust. Further-more, one may argue that our features and levels were defined in asubjective fashion (albeit with agreement between the authors); alarge-scale user-rating study might increase objectivity.

We presented a conversational AI maturity framework for assessingconversational AI development platforms, based on the ability ofthe produced systems to conduct conversations. By supporting theunderstanding of how the features of a conversational AI develop-ment platform correspond to conversational ability, this frameworkcan help both users with choosing and developers with developinga powerful conversational AI system. Our framework is inspired byrelated frameworks for human language development. Comparableto the way in which a human speaker learns a language, the levelsof conversational maturity in our framework indicate the ability toconduct and engage in a natural conversation with a user.Our framework is based on, and incorporates results from ananalysis of the state-of-the-art conversational AI development plat-forms, which we identified in a literature review. We considered thedocumentations of these platforms to extract common and uniquefeatures, which we grouped into a feature model to provide a high-level overview of all the different existing features. Each featurecomes with a description to support the understanding of its use,context, and scope. We related the features to conversational matu-rity and used them to develop our framework’s maturity levels.Our results show that the various existing conversational AI de-velopment platforms share significant commonalities. In the future,to bridge different terminologies and support users in flexibilitychoosing a platform according to their current needs, we aim todevelop a domain-specific language together with code generatorsfor the various platform. Such an infrastructure allows developinga system on a high level, and transforming the specification intoan implementation for a concrete platform. It can also support themigration between different platforms when a platform with higherconversational maturity more becomes available.

REFERENCES . govtilr . org/Skills/ILRscale2 . htm[2] 2019. Messenger Platform. (2019). https://developers . facebook . com/docs/messenger-platform/[3] Glenn Bowen. 2009. Document Analysis as a Qualitative Research Method.(2009).[4] Tom B. Brown and others. 2020. Language Models are Few-Shot Learners. In NeurIPS .[5] Massimo Canonico and Luigi De Russis. 2018. A Comparison and Critique ofNatural Language Understanding Tools. In

International Conference on CloudComputing, GRIDs, and Virtualization . IARIA, Barcelona, 110–115.[6] Leigh Clark and others. 2019. What Makes a Good Conversation?: Challenges inDesigning Truly Conversational Agents. In

CHI Conference on Human Factors inComputing Systems . [7] Council of Europe.

Common European Framework for Languages: Learning, teach-ing assessment . Technical Report. 21–42 pages. https://rm . coe . int/1680459f97[8] Krzysztof Czarnecki and Ulrich W. Eisenecker. 2000. Generative Programming:Methods, Tools, and Applications . Addison-Wesley.[9] Krzysztof Czarnecki and Simon Helsen. 2006. Feature-based survey of modeltransformation approaches.

IBM Syst. J.

45 (July 2006), 621–645. Issue 3.[10] Sebastian Erdweg and others. 2013. The State of the Art in Language Work-benches. In

International Conference on Software Language Engineering .[11] Jianfeng Gao, Michel Galley, and Lihong Li. 2018.

Neural Approaches to Con-versational AI Question Answering, Task-Oriented Dialogues and Social Chatbots .Technical Report. https://arxiv . org/pdf/1809 . . pdf[12] Lan-Fen Huang, Simon Kubelec, Nicole Keng, and Lung-Hsun Hsu. 2018. Evalu-ating CEFR rater performance through the analysis of spoken learner corpora.(2018).[13] Kyo C Kang, Sholom G Cohen, James A Hess, William E Novak, and A Spencer Pe-terson. 1990. Feature-oriented domain analysis (FODA) feasibility study . TechnicalReport. Carnegie-Mellon University Pittsburgh.[14] Lorenz Cuno Klopfenstein, Saverio Delpriori, Silvia Malatini, and AlessandroBogliolo. 2017. The Rise of Bots: A Survey of Conversational Interfaces, Patterns,and Paradigms. In

Conference on Designing Interactive Systems, DIS . 555–565.[15] Dale L Lange and Jr Lowe, Pardee. 1987.

Grading Reading Passages According tothe ACTFL/ETS/ILR Reading Proficiency Standard: Can It Be Learned?

TechnicalReport.[16] Kwanwoo Lee, Kyo Kang, and Jaejoon Lee. 2002.

Concepts and Guidelines ofFeature Modeling for Product Line Software Engineering . Technical Report. PohangUniversity of Science and Technology.[17] Iolanda Leite, André Pereira, Samuel Mascarenhas, Carlos Martinho, Rui Prada,and Ana Paiva. 2013. The influence of empathy in human-robot relations.

Inter-national Journal of Human Computer Studies

71, 3 (2013), 250–260.[18] Conanico Massimo and Luigi De Russis. 2018. A Comparison and Critique ofNatural Language Understanding Tools. In

Cloud Computing 2018 . Barcelona,110–115.[19] Michael McTear. 2018. Conversational Modelling for Chatbots: Current Ap-proaches and Future Directions. In

Conference on Electronic Speech Signal Pro-cessing . 175–185.[20] Michael McTear, Zoraida Callejas, and David Griol. 2016.

The ConversationalInterface Talking to Smart Devices . forbes . com/sites/forbestechcouncil/2017/12/04/the-rise-of-conversational-ai/ Modelling Chatbots with a Cognitive System Allows for aDifferentiating User Experience . Technical Report. IBM, Belgium.[23] Damir Nešić, Jacob Krüger, S , tefan Stănciulescu, and Thorsten Berger. 2019. Prin-ciples of feature modeling. In Proceedings of the 2019 27th ACM Joint Meeting onEuropean Software Engineering Conference and Symposium on the Foundations ofSoftware Engineering . ACM, 62–73.[24] Amit Patil, K Marimuthu, Nagaraja Rao, and R Niranchana. 2017. Comparativestudy of cloud platforms to develop a Chatbot.

International Journal of Engineering& Technology (2017), 57–61.[25] Ashwin Ram and others. 2017.

Conversational AI: The Science Behind the AlexaPrize . Technical Report.[26] Bayan Abu Shawar and Eric Atwell. 2007. Different measurements metrics toevaluate a chatbot system. In

Bridging the Gap: Academic and Industrial Researchin Dialog Technologies Workshop Proceedings . Association for ComputationalLinguistics, Rochester, NY, 89–96.[27] Elvira Swender, J. Daniel Conrad, and Robert Vicars. 2012.

ACTFL proficiencyguidelines 2012 . actfl . org/sites/default/files/pdfs/public/ACTFLProficiencyGuidelines2012_FINAL . pdf[28] The authors. 2020. Online appendix. (2020). https://doi . org/10 . . figshare . . v1[29] Peter Tsai. 2018. Data snapshot: AI Chatbots and Intelligent Assistants in theWorkplace - Spiceworks. (2018). https://community . spiceworks . com/blog/2964-data-snapshot-ai-chatbots-and-intelligent-assistants-in-the-workplace[30] Erwing Tschierner, Olaf Bärenfänger, and Irmgard Wanner. 2012. AssessingEvidence of Validity of Assigning CEFR Ratings to the ACTFL Oral ProficiencyInterview (OPI) and the Oral Proficiency Interview by computers (OPIc) . TechnicalReport. Institute for Test Research and Development, Leipzig.[31] Anu Venkatesh and others. 2017. On Evaluating and Comparing ConversationalAgents. In

NIPS: Conversational AI Workshop .[32] Claes Wohlin. 2014. Guidelines for Snowballing in Systematic Literature Stud-ies and a Replication in Software Engineering. In

International Conference onEvaluation and Assessment in Software Engineering . 38:1–10.[33] Rui Yan. 2018. "Chitty-Chitty-Chat Bot": Deep Learning for Conversational AI.In