[PDF] Individual and Domain Adaptation in Sentence Planning for Dialogue

Abstract

One of the biggest challenges in the development and deployment of spoken dialogue systems is the design of the spoken language generation module. This challenge arises from the need for the generator to adapt to many features of the dialogue domain, user population, and dialogue context. A promising approach is trainable generation, which uses general-purpose linguistic knowledge that is automatically adapted to the features of interest, such as the application domain, individual user, or user group. In this paper we present and evaluate a trainable sentence planner for providing restaurant information in the MATCH dialogue system. We show that trainable sentence planning can produce complex information presentations whose quality is comparable to the output of a template-based generator tuned to this domain. We also show that our method easily supports adapting the sentence planner to individuals, and that the individualized sentence planners generally perform better than models trained and tested on a population of individuals. Previous work has documented and utilized individual preferences for content selection, but to our knowledge, these results provide the first demonstration of individual preferences for sentence planning operations, affecting the content order, discourse structure and sentence structure of system responses. Finally, we evaluate the contribution of different feature sets, and show that, in our application, n-gram features often do as well as features based on higher-level linguistic representations.

Full PDF

aa r X i v : . [ c s . C L ] O c t Journal of Artiﬁcial Intelligence Research 30 (2007) 413-456 Submitted 05/2007; published 11/2007

Individual and Domain Adaptationin Sentence Planning for Dialogue

Marilyn Walker [email protected]

Department of Computer Science, University of Sheﬃeld211 Portobello Street, Sheﬃeld S1 4DP, United Kingdom

Amanda Stent [email protected]

Department of Computer Science, Stony Brook UniversityStony Brook, NY 11794, USA

Fran¸cois Mairesse [email protected]

Department of Computer Science, University of Sheﬃeld,211 Portobello Street, Sheﬃeld S1 4DP, United Kingdom

Rashmi Prasad [email protected]

Institute for Research in Cognitive Science, University of Pennsylvania,3401 Walnut Street, Suite 400A, Philadelphia, PA 19104, USA

Abstract

MATCH dialogue system. We show that trainable sentence planning can producecomplex information presentations whose quality is comparable to the output of a template-based generator tuned to this domain. We also show that our method easily supportsadapting the sentence planner to individuals, and that the individualized sentence plannersgenerally perform better than models trained and tested on a population of individuals.Previous work has documented and utilized individual preferences for content selection, butto our knowledge, these results provide the ﬁrst demonstration of individual preferences forsentence planning operations, aﬀecting the content order, discourse structure and sentencestructure of system responses. Finally, we evaluate the contribution of diﬀerent featuresets, and show that, in our application, n-gram features often do as well as features basedon higher-level linguistic representations.

1. Introduction

One of the most robust ﬁndings of studies of human-human dialogue is that people adapttheir interactions to match their conversational partners’ needs and behaviors (Goﬀman,1981; Brown & Levinson, 1987; Pennebaker & King, 1999). People adapt the content oftheir utterances (Garrod & Anderson, 1987; Luchok & McCroskey, 1978). They choosesyntactic structures to match their partners’ syntax (Levelt & Kelter, 1982; Branigan,Pickering, & Cleland, 2000; Reitter, Keller, & Moore, 2006; Stenchikova & Stent, 2007), c (cid:13) alker, Stent, Mairesse, & Prasad and adapt their choice of words and referring expressions (Clark & Wilkes-Gibbs, 1986;Brennan & Clark, 1996). They also adapt their speaking rate, amplitude, and clarity ofpronunciation (Jungers, Palmer, & Speer, 2002; Coulston, Oviatt, & Darves, 2002; Ferguson& Kewley-Port, 2002).However, it is beyond the state of the art to reproduce this type of adaptation inthe spoken language generation module of a dialogue system, i.e. the components thathandle response generation and information presentation. A standard generation systemincludes modules for content planning, sentence planning, and surface realization (Kittredge,Korelsky, & Rambow, 1991; Reiter & Dale, 2000). A content planner takes as input acommunicative goal; it selects content to realize that goal and organizes that content intoa content plan . A sentence planner takes as input a content plan. It decides how thecontent is allocated into sentences, how the sentences are ordered, and which discourse cuesto use to express the relationships between content elements. It outputs a sentence plan .Finally, a surface realizer determines the words and word order for each sentence in thesentence plan. It outputs a text or speech realization for the original communicative goal.The ﬁndings from human-human dialogue suggest that adaptation could potentially beuseful at any stage of the generation pipeline. Yet to date, the only work on adaptationto individual users utilizes models of the user’s knowledge, needs, or preferences to adaptthe content for content planning (Jokinen & Kanto, 2004; Rich, 1979; Wahlster & Kobsa,1989; Zukerman & Litman, 2001; Carenini & Moore, 2006), rather than applying modelsof individual linguistic preferences as to the form of the output, as determined by sentenceplanning or surface realization.However, consider the alternative realizations for a restaurant recommendation in Fig-ure 1. Columns A and B contain human ratings of the quality of the realizations from usersA and B. The diﬀerences in the rating feedback suggest that each user has diﬀerent per-ceptions as to the quality of the potential realizations. Data from an experiment collectingfeedback from users A and B, for 20 realizations of 30 diﬀerent recommendation contentplans (600 examples), shows that the feedback of the two users are easily distinguished: apaired t-test supports the hypothesis that the two samples are sampled from distinct dis-tributions ( t = 17 . p < . F = 1 . p < . • It is common knowledge that individual authors can be identiﬁed from the linguisticfeatures of their written texts (Madigan, Genkin, Lewis, Argamon, Fradkin, & Ye,2005; Oberlander & Brew, 2000). • An examination of a weather report corpus for ﬁve weather forecasters showed in-dividual diﬀerences in lexical choice for expressing speciﬁc weather-related concepts(Reiter & Sripada, 2002). ndividual and Domain Adaptation in Dialogue

Alt Realization A B AVG6 Chanpen Thai has the best overall quality among the selected restau-rants since it is a Thai restaurant, with good service, its price is 24dollars, and it has good food quality. 1 4 2.57 Chanpen Thai has the best overall quality among the selected restau-rants because it has good service, it has good food quality, it is a Thairestaurant, and its price is 24 dollars. 2 5 3.54 Chanpen Thai has the best overall quality among the selected restau-rants. It has good food quality, with good service, it is a Thai restau-rant, and its price is 24 dollars. 2 4 39 Chanpen Thai is a Thai restaurant, with good food quality, its priceis 24 dollars, and it has good service. It has the best overall qualityamong the selected restaurants. 2 4 35 Chanpen Thai has the best overall quality among the selected restau-rants. It has good service. It has good food quality. Its price is 24dollars, and it is a Thai restaurant. 3 2 2.53 Chanpen Thai has the best overall quality among the selected restau-rants. Its price is 24 dollars. It is a Thai restaurant, with goodservice. It has good food quality. 3 3 310 Chanpen Thai has the best overall quality among the selected restau-rants. It has good food quality. Its price is 24 dollars. It is a Thairestaurant, with good service. 3 3 32 Chanpen Thai has the best overall quality among the selected restau-rants. Its price is 24 dollars, and it is a Thai restaurant. It has goodfood quality and good service. 4 4 41 Chanpen Thai has the best overall quality among the selected restau-rants. This Thai restaurant has good food quality. Its price is 24dollars, and it has good service. 4 3 3.58 Chanpen Thai is a Thai restaurant, with good food quality. It hasgood service. Its price is 24 dollars. It has the best overall qualityamong the selected restaurants. 4 2 3

Figure 1: Some alternative realizations for the content plan in Figure 4, with feedback fromUsers A and B, and the mean (AVG) of their feedback (1=worst and 5=best). • Rules learned for generating nominal referring expressions perform better when indi-vidual speakers are provided as a feature to the learning algorithm (Jordan & Walker,2005), and an experiment evaluating choice of referring expression shows only 70%agreement among native speakers as to the best choice (Yeh & Mellish, 1997). Chai,Hong, Zhou, and Prasov (2004) show that there are also individual diﬀerences ingesture when generating multimodal references, and the corpus study of accentedpronouns reported by Kothari (2007) suggests that accentuation is also partly deter-mined by individual linguistic style. • Automatic evaluation techniques applied to human-generated reference outputs formachine translation and automatic summarization perform better when multiple out-puts are provided for comparison (Papenini, Roukos, Ward, & Zhu, 2002; Nenkova,Passonneau, & McKeown, 2007): this can be attributed to the large variation in whathumans generate given particular content to express. This is also reﬂected in the ﬁnd-ing that human subjects produce many diﬀerent valid content orderings when asked toorder a speciﬁc set of content items to produce the best possible summary (Barzilay,Elhadad, & McKeown, 2002; Lapata, 2003). alker, Stent, Mairesse, & Prasad

In the past, linguistic variation among individuals was considered a problem for genera-tion researchers to work around, rather than a potential area of study (McKeown, Kukich,& Shaw, 1994; Reiter, 2002; Reiter, Sripada, & Robertson, 2003). In part, this was dueto the hand-crafting of generation components and resources. It is impossible to encodeby hand, for each individual, rules for sentence planning and realization. Furthermore, ifdomain experts don’t agree on the best way to express a domain concept, how can thegeneration dictionary be encoded? It is diﬃcult simply to get good output that respectsall the interacting domain and linguistic constraints even with considerable handcrafting ofrules (Kittredge, Korelsky, & Rambow, 1991).Modeling individual diﬀerences can also be a problem for statistical methods whenlearning paradigms are used that assume there is a single correct output (Lapata, 2003;Jordan & Walker, 2005; Hardt & Rambow, 2001) inter alia . We believe that the simplestway to deal with the inherent variability in possible generation outputs is to treat generationas a ranking problem as we explain below, with techniques that overgenerate using user ordomain-independent rules, and then ﬁlter or rank the possibilities using domain or user-speciﬁc corpora or feedback (Langkilde & Knight, 1998; Langkilde-Geary, 2002; Bangalore& Rambow, 2000; Rambow, Rogati, & Walker, 2001). This approach has an advantage fordialogue systems because it also aﬀords joint optimization of the generator and the text-to-speech engine (Bulyko & Ostendorf, 2001; Nakatsu & White, 2006). There are manyproblems in generation to which ranking models and individualization could be applied, suchas text planning, cue word selection, or referring expression generation (Mellish, O’Donnell,Oberlander, & Knott, 1998; Litman, 1996; Di Eugenio, Moore, & Paolucci, 1997; Marciniak& Strube, 2004). However, only recently has any work in generation acknowledged thatthere are individual diﬀerences and tried to model them (Guo & Stent, 2005; Mairesse &Walker, 2005; Belz, 2005; Lin, 2006).This article describes

SPaRKy (Sentence Planning with Rhetorical Knowledge), a sen-tence planner that uses rhetorical relations and adapts to the user’s individual sentenceplanning preferences. SPaRKy has two components: a randomized sentence plan gener-ator (SPG) that produces multiple alternative realizations of an information presentation,and a sentence plan ranker (SPR) that is trained (using human feedback) to rank thesealternative realizations (See Figure 1). As mentioned above, previous work has documentedand utilized individual preferences for content selection, but to our knowledge, our resultsprovide the ﬁrst demonstration of individual preferences for sentence planning operations,aﬀecting the content ordering, discourse structure, sentence structure, and sentence scopeof system responses. We also show that some of the learned preferences are domain-speciﬁc.Section 2 compares our approach and results with previous work. Section 3 provides anoverview of the

MATCH system architecture, which can generate dialogue system responsesusing either

SPaRKy , or a domain-speciﬁc template-based generator described and eval-uated in previous work (Stent, Walker, Whittaker, & Maloor, 2002; Walker et al., 2004).Sections 4, 5 and 6 describe

SPaRKy in detail; they describe the SPG, the automaticgeneration of features used in training the SPR, and how boosting is used to train the SPR.Sections 7 and 8 present both quantitative and qualitative results:

1. A Java version of

SPaRKy can be downloaded from ndividual and Domain Adaptation in Dialogue

1. First, we show that

SPaRKy learns to select sentence plans that are signiﬁcantlybetter than a randomly selected sentence plan, and on average less than 10% worsethan a sentence plan ranked highest by human judges. We also show that, in ourexperiments, simple n-gram features perform as well as features based on higher-levellinguistic representations.2. Second, we show that

SPaRKy ’s SPG can produce realizations that are comparableto that of

MATCH ’s template-based generator, but that there is a gap between therealization that the SPR selects when trained on multiple users and those selected bya human.3. Third, we show that when

SPaRKy is trained for particular individuals, it performsbetter than when trained on feedback from multiple individuals. These are the ﬁrstresults suggesting that individual sentence planning preferences exist, and that theycan be modeled by a trainable generation system. We also show that in most casesthe performance of the individualized SPRs are statistically indistinguishable from

MATCH ’s template-based generator, but for compare-2 , User B prefers

SPaRKy ,while for compare-3 , User A prefers the template-based generator.4. Fourth, we show that the diﬀerences in the learned models make sense in terms ofprevious rule-based approaches to sentence planning. We analyze the qualitativediﬀerences between the learned group and individual models, and show that

SPaRKy learns speciﬁc rules about the interaction between content items and sentence planningoperations, and rules that model individual diﬀerences, that would be diﬃcult tocapture with a hand-crafted generator.We sum up and discuss future work in Section 9.

2. Related Work

We discuss related work on adaptation in generation using the standard generation ar-chitecture which contains modules for content planning (Section 2.1), sentence planning(Section 2.2) and surface realization (Section 2.3) (Kittredge, Korelsky, & Rambow, 1991;Reiter & Dale, 2000).

There has been signiﬁcant research on the use of user models and discourse context to adaptthe content of information presentations in dialogue (Joshi, Webber, & Weischedel, 1984,1986; Chu-Carroll & Carberry, 1995; Zukerman & Litman, 2001) inter alia , but only theuser models (not the information presentation strategies) are sensitive to particular individ-uals. Several studies have investigated the use of quantitative models of user preferences inselection of content for recommendations and comparisons (Carenini & Moore, 2006; Walkeret al., 2004; Polifroni & Walker, 2006), and Moore, Foster, Lemon, and White (2004) usesuch models for referring expression generation, sentence planning and some surface real-ization. Elhadad, Kan, Klavans, and McKeown (2005) applied group models (physician,lay person) and individual user models to the task of summarizing medical information. alker, Stent, Mairesse, & Prasad

McCoy (1989) used context information to design helpful system-generated corrections.Other work has looked at the use of statistical techniques for adapting content selectionand content ordering methods to particular domains (Barzilay, Elhadad, & McKeown, 2002;Duboue & McKeown, 2003; Lapata, 2003), but not to individual users.

The ﬁrst trainable sentence planner was SPoT, a precursor to

SPaRKy that output in-formation gathering utterances in the travel domain (Walker, Rambow, & Rogati, 2002).Evaluations of SPoT demonstrated that it performed as well as a template-based generatordeveloped for the travel domain and ﬁeld-tested in the DARPA Communicator evaluations(Rambow, Rogati, & Walker, 2001; Walker et al., 2002). Information gathering utterancesare considerably simpler than information presentations: they do not usually exhibit anycomplexities in rhetorical structure, and there is little interaction between domain-speciﬁccontent items and sentence structures. Thus the SPoT generator did not produce utter-ances with variation in rhetorical structure; it learned to optimize speech-act ordering andsentence structure choices, but it did not adapt to individuals.

Work on adaptation in surface realization has mainly focused on decisions such as lexical andsyntactic choice, using models of a target text, but not individual text models, althoughrecent research has also shown that n-gram models trained on user-speciﬁc corpora canadapt generators to reproduce individualized lexical and syntactic choices (Lin, 2006; Belz,2005). Paiva and Evans (2004) present a technique for training a generator by learning therelationship between particular generation decisions and text variables that can be measuredin the output corpus. This technique was applied to generator decisions such as the formof referring expression and syntactic structure, and was used to capture stylistic, ratherthan individual, diﬀerences. Gupta and Stent (2005) use discourse context and speakerknowledge for referring expression generation in dialogue.User models have also been used to adapt surface realization. The approach of learninga ranking from user feedback has been applied to multimedia presentation planning (Stent& Guo, 2005) and to the joint optimization of the syntactic realizer and the text-to-speechengine (Nakatsu & White, 2006). This work does not look at individual diﬀerences.Research has also focused on other factors that aﬀect stylistic variation – how realizationchoices reﬂect personality, politeness, emotion or domain speciﬁc style (Hovy, 1987; DiMarco& Foster, 1997; Walker, Cahn, & Whittaker, 1997; Andr´e, Rist, van Mulken, Klesen, &Baldes, 2000; Bouayad-Agha, Scott, & Power, 2000; Fleischman & Hovy, 2002; Piwek,2003; Porayska-Pomsta & Mellish, 2004; Isard, Brockmann, & Oberlander, 2006; Gupta,Walker, & Romano, 2007; Mairesse & Walker, 2007). None of this work has attempted toreproduce individual stylistic variation. ndividual and Domain Adaptation in Dialogue

3. Overview of

MATCH ’s Spoken Language Generator

Dialog ManagerSPUR text planner Template−basedgeneratorSentence plangeneratorrankerSentence plan

Ranked list of(sp−tree, d−tree) pairsdependency tree [d−tree])(sentence plan [sp−tree],Pairs of(tp−trees)Text−plan trees Communicativegoal Contentplan

SPaRKyRealPro surface realizer

Text Text

Figure 2: Architecture of

MATCH ’s Spoken Language Generator.

MATCH (Multimodal Access To City Help) is a multimodal dialogue system for ﬁndingrestaurants and entertainment options in New York City (Johnston, Bangalore, Vasireddy,Stent, Ehlen, Walker, Whittaker, & Maloor, 2002). Information presentations in

MATCH include route descriptions, as well as user-tailored recommendations and comparisons ofrestaurants. Figure 2 shows

MATCH ’s architecture for spoken language generation (SLG).The content planning module is the

SPUR text planner (Section 3.1) (Walker et al., 2004).There are two modules for producing text or spoken dialogue responses from

SPUR ’s out-put: a highly engineered domain-speciﬁc template-based realizer (Section 3.2); and the

SPaRKy sentence planner followed by the RealPro surface realizer (Lavoie & Rambow,1997) (Section 3.3). Example template-based and

SPaRKy outputs for each dialogue strat-egy are in Figure 3. Both

SPUR and

SPaRKy are trainable, and produce diﬀerent outputdepending on the user and discourse context. alker, Stent, Mairesse, & Prasad

Strategy System Realization AVG recommend

Template Caﬀe Cielo has the best overall value among the selectedrestaurants. Caﬀe Cielo has good decor and good service.It’s an Italian restaurant. 4 recommend SPaRKy

Caﬀe Cielo, which is an Italian restaurant, with good decorand good service, has the best overall quality among theselected restaurants. 4 compare-2

Template Caﬀe Buon Gusto’s an Italian restaurant. On the otherhand, John’s Pizzeria’s an Italian, Pizza restaurant. 2 compare-2 SPaRKy

Caﬀe Buon Gusto is an Italian restaurant, and John’sPizzeria is an Italian , Pizza restaurant. 4 compare-3

Template Among the selected restaurants, the following oﬀer excep-tional overall value. Uguale’s price is 33 dollars. It hasgood decor and very good service. It’s a French, Italianrestaurant. Da Andrea’s price is 28 dollars. It has gooddecor and very good service. It’s an Italian restaurant.John’s Pizzeria’s price is 20 dollars. It has mediocre decorand decent service. It’s an Italian, Pizza restaurant. 4.5 compare-3 SPaRKy

Da Andrea, Uguale, and John’s Pizzeria oﬀer exceptionalvalue among the selected restaurants. Da Andrea is an Ital-ian restaurant, with very good service, it has good decor,and its price is 28 dollars. John’s Pizzeria is an Italian ,Pizza restaurant. It has decent service. It has mediocredecor. Its price is 20 dollars. Uguale is a French, Italianrestaurant, with very good service. It has good decor, andits price is 33 dollars. 4

Figure 3: Template outputs and a sample

SPaRKy output for each dialogue strategy. AVG= Averaged score of two human users.

SPUR

The input to

SPUR is a high-level communicative goal from the

MATCH dialoguemanager and its output is a content plan for a recommendation or comparison.

SPUR selects and organizes the content to be communicated based on the communicative goal,a conciseness parameter, and a decision-theoretic user model. It produces targeted recom-mendations and comparisons: the restaurants mentioned and the attributes selected foreach restaurant are those the user model predicts the user will want to know about. Thus

SPUR can produce a wide variety of content plans.Figure 4 shows a sample content plan for a recommendation. This content plan gives riseto the alternate realizations for recommendations for Chanpen Thai in Figure 1. Followinga bottom-up approach to text-planning (Marcu, 1997; Mellish, O’Donnell, Oberlander, &Knott, 1998), each content plan consists of a set of assertions that must be communicatedto the user and a set of rhetorical relations that hold between those assertions that may becommunicated as well. Each rhetorical relation designates one or more facts as the nuclei of the relation, i.e. the main point, and the other facts as satellites , i.e. the supplementaryfacts (Mann & Thompson, 1987). Three rhetorical relations (Mann & Thompson, 1987) areused by

SPUR : the justify relation for the recommendation strategy, and the contrast and elaboration relations for the comparison strategies. The relations in Figure 4 specifythat the nucleus (1) is the claim being made in the recommendation, and that the satellites(assertions 2 to 5) provide justifying evidence for the claim. ndividual and Domain Adaptation in Dialogue relations:justify(nuc:1, sat:2); justify (nuc:1, sat:3 ); justify(nuc:1, sat:4);justify(nuc:1, sat:5)content: 1. assert(best (Chanpen Thai))2. assert(is (Chanpen Tai, cuisine(Thai)))3. assert(has-att(Chanpen Thai, food-quality(good)))4. assert(has-att(Chanpen Thai, service(good)))5. assert(is (Chanpen Thai, price(24 dollars)))Figure 4: A content plan for a recommendation.

In order to produce utterances from the content plans produced by

SPUR , we ﬁrst imple-mented and evaluated a template-based generator for

MATCH (Stent, Walker, Whittaker,& Maloor, 2002; Walker et al., 2004). The template-based generator was designed tomake it possible to evaluate algorithms for user-speciﬁc content selection based on

SPUR ’sdecision-theoretic user model. It performs sentence planning, including some discourse cueinsertion, clause combining and referring expression generation. It produces one high qual-ity output for any content plan for our three dialogue strategies: recommend , compare-2 and compare-3 . Recommendations and comparisons are one form of evaluative argument ,so its realization strategies are based on guidelines from argumentation theory for producingeﬀective evaluative arguments, as summarized by Carenini and Moore (2000). Because thetemplates are highly tailored to this domain, the template-based generator can be expectedto perform well in comparison to SPaRKy .Following the argumentation guidelines, the template-based generator realizes recom-mendations with the nucleus ordered ﬁrst, followed by the satellites. The satellites areordered to maximize the opportunity for aggregation. To produce the most concise recom-mendations given the content to be communicated, phrases with identical verbs and subjectsare grouped, so that lists and coordination can be used to aggregrate the assertions aboutthe subject. Figure 5 provides examples of aggregration as the number of assertions variesaccording to

SPUR ’s conciseness parameter (Z-value).The realization template for comparisons focuses on communicating both the elaboration and the contrast relations. Figure 6 contains a content plan for comparisons. The nucleusis the assertion (1) that Above and Carmine’s are exceptional restaurants. The satellites(assertions 2 to 7 representing the selected attributes for each restaurant) elaborate on theclaim in the nucleus (assertion 1).

Contrast relations hold between assertions 2 and 3,between 4 and 5, and between 6 and 7. One way to communicate the elaboration relationis to structure the comparison so that all the satellites are grouped together, following thenucleus. To communicate the contrast relation, the satellites are produced in a ﬁxed order,with a parallel structure maintained across options (Prevost, 1995; Prince, 1985). Thesatellites are initially ordered in terms of their evidential strength, but then are reorderedto allow for aggregation. Figure 7 illustrates aggregation for comparisons with varyingnumbers of assertions. alker, Stent, Mairesse, & Prasad

Z Output1.5 Komodo has the best overall value among the selected restaurants. Komodo’s a Japanese,Latin American restaurant.0.7 Komodo has the best overall value among the selected restaurants. Komodo’s a Japanese,Latin American restaurant.0.3 Komodo has the best overall value among the selected restaurants. Komodo’s price is$29. It’s a Japanese, Latin American restaurant.-0.5 Komodo has the best overall value among the selected restaurants. Komodo’s price is$29 and it has very good service. It’s a Japanese, Latin American restaurant.-0.7 Komodo has the best overall value among the selected restaurants. Komodo’s price is$29 and it has very good service and very good food quality. It’s a Japanese, LatinAmerican restaurant.-1.5 Komodo has the best overall value among the selected restaurants. Komodo’s price is$29 and it has very good service, very good food quality and good decor. It’s a Japanese,Latin American restaurant.

Figure 5: Recommendations for the East Village Japanese Task, for diﬀerent settings of theconciseness parameter Z.strategy: compare3items: Above, Carmine’srelations: elaboration(nuc:1,sat:2); elaboration(nuc:1,sat:3); elab-oration(nuc:1,sat:4); elaboration(nuc:1,sat:5); elabora-tion(nuc:1,sat:6); elaboration(nuc:1,sat:7); contrast(nuc:2,nuc:3);contrast(nuc:4,nuc:5); contrast(nuc:6,nuc:7)content: 1. assert(exceptional(Above,Carmine’s))2. assert(has-att(Above, decor(good)))3. assert(has-att(Carmine’s, decor(decent)))4. assert(has-att(Above, service(good)))5. assert(has-att(Carmine’s, service(good)))6. assert(has-att(Above, cuisine(New American)))7. assert(has-att(Carmine’s, cuisine(Italian)))Figure 6: A content plan for a comparison.

SPaRKy

Like the template-based generator,

SPaRKy takes as input any of the content plans pro-duced by

SPUR . Figure 2 shows that

SPaRKy has two modules: the sentence plan gener-ator (SPG), and the sentence plan ranker (SPR). The SPG uses a set of clause-combiningoperations (Figure 12); it produces a large set of alternative realizations of an input contentplan (See Figure 1). The SPR ranks the alternative realizations using a model learned fromusers’ ratings of a training set of content plans. The SPG is described in Section 4. Thefeatures used to train the SPR are described in Section 5; the procedure for training theSPR is described in Section 6.Because

SPaRKy is trained using user feedback, rather than being handcrafted, it canbe trained to be an individualized spoken language generator. As discussed above, the ndividual and Domain Adaptation in Dialogue

Z Output1.5 Among the selected restaurants, the following oﬀer exceptional overall value. Komodohas very good service.0.7 Among the selected restaurants, the following oﬀer exceptional overall value. Komodohas very good service and good decor.0.3 Among the selected restaurants, the following oﬀer exceptional overall value. Komodo’sprice is $29. It has very good food quality, very good service and good decor. Takahachi’sprice is $27. It has very good food quality, good service and decent decor.-0.5 Among the selected restaurants, the following oﬀer exceptional overall value. Komodo’sprice is $29. It has very good food quality, very good service and good decor. Takahachi’sprice is $27. It has very good food quality, good service and decent decor. Japonica’sprice is$37. It has excellent food quality, good service and decent decor-0.7 Among the selected restaurants, the following oﬀer exceptional overall value. Komodo’sprice is $29. It has very good food quality, very good service and good decor. Takahachi’sprice is $27. It has very good food quality, good service and decent decor. Japonica’sprice is $37. It has excellent food quality, good service and decent decor. Shabu-Tatsu’sprice is $31. It has very good food quality, good service and decent decor.-1.5 Among the selected restaurants, the following oﬀer exceptional overall value. Komodo’sprice is $29. It has very good food quality, very good service and good decor. Takahachi’sprice is $27. It has very good food quality, good service and decent decor. Japonica’sprice is $37. It has excellent food quality, good service and decent decor. Shabu-Tatsu’sprice is $31. It has very good food quality, good service and decent decor. Bond Street’sprice is $51. It has excellent food quality, good service and very good decor. Dojo’s priceis $14. It has decent food quality, mediocre service and mediocre decor.

Figure 7: Comparisons for the East Village Japanese Task, for diﬀerent settings of theconciseness parameter Z.feedback from the two users in Figure 1 suggests that each user has diﬀerent perceptionsas to the quality of the potential realizations. A signiﬁcant part of Sections 7 and 8are dedicated to examining the diﬀerences between a model trained on averaged feedback,shown as AVG in Figure 1, and those trained on individual feedback from users A and B.

4. Sentence Plan Generation

The input to

SPaRKy ’s SPG is a content plan from

SPUR . Content plans for asample recommendation and comparison were in Figure 4 and Figure 6. Figure 1 showsalternative

SPaRKy realizations for the recommendation in Figure 4, while Figure 8 showsalternative

SPaRKy realizations for the comparison in Figure 6. Content plans specifywhich assertions to include in an information presentation, and the rhetorical relationsholding between them, but not the order of assertions or how to express the rhetoricalrelations between them. This task is known as discourse planning . The SPG has two stagesof processing; ﬁrst it does discourse planning, and then it does sentence planning.

Discourse planning algorithms can be characterized as: schema-based (McKeown, 1985;Kittredge, Korelsky, & Rambow, 1991); top-down algorithms using plan operators (Moore& Paris, 1993); or bottom-up approaches that use, for example, constraint satisfactionalgorithms (Marcu, 1996, 1997) or genetic algorithms (Mellish, O’Donnell, Oberlander, & alker, Stent, Mairesse, & Prasad

Alt Realization A B AVG11 Above and Carmine’s oﬀer exceptional value among the selected restaurants.Above, which is a New American restaurant, with good decor, has good service.Carmine’s, which is an Italian restaurant, with good service, has decent decor. 2 2 212 Above and Carmine’s oﬀer exceptional value among the selected restaurants.Above has good decor, and Carmine’s has decent decor. Above and Carmine’shave good service. Above is a New American restaurant. On the other hand,Carmine’s is an Italian restaurant. 3 2 2.513 Above and Carmine’s oﬀer exceptional value among the selected restaurants.Above is a New American restaurant. It has good decor. It has good service.Carmine’s, which is an Italian restaurant, has decent decor and good service. 3 3 314 Above and Carmine’s oﬀer exceptional value among the selected restaurants.Above has good decor while Carmine’s has decent decor, and Above andCarmine’s have good service. Above is a New American restaurant whileCarmine’s is an Italian restaurant. 4 5 4.520 Above and Carmine’s oﬀer exceptional value among the selected restaurants.Carmine’s has decent decor but Above has good decor, and Carmine’s andAbove have good service. Carmine’s is an Italian restaurant. Above, however,is a New American restaurant. 2 3 2.525 Above and Carmine’s oﬀer exceptional value among the selected restaurants.Above has good decor. Carmine’s is an Italian restaurant. Above has goodservice. Carmine’s has decent decor. Above is a New American restaurant.Carmine’s has good service. NR NR NR

Figure 8: Some alternative realizations for the compare-3 plan in Figure 6, with feedbackfrom Users A and B, and the mean (AVG) of their feedback (1=worst and 5=best).NR = Not generated or ranked. assert−reco−cuisinesatellite: <2> satellite: <3>assert−reco−food−quality serviceassert−reco−satellite: <4> assert−reco−pricesatellite: <5>nucleus: <1>assert−reco−best inferjustify

Figure 9: A tp-tree for the plan of Figure 4, used to generate Alternatives 1, 3, 4, 5, 6, 7and 10 in Figure 1.Knott, 1998). In

SPaRKy , the SPG takes a bottom-up approach to discourse planningusing principles from Centering Theory (Grosz, Joshi, & Weinstein, 1995). Content itemsare grouped because they talk about the same thing, but the linear order between andamong the groupings is left unspeciﬁed. The centering constraints have the result thatAlt-25 in Figure 8, which repeatedly changes the discourse center, are never generated.The discourse planning stage produces one or more text-plan trees ( tp-trees ). A tp-treefor the recommend plan in Figure 4 is in Figure 9, and tp-trees for the compare-3 plan inFigure 6 are in Figure 10. In a tp-tree, each leaf represents a single assertion and is labeled ndividual and Domain Adaptation in Dialogue nucleus:<3>assert-com-decorcontrastnucleus:<2>assert-com-decor nucleus:<6>assert-com-cuisinenucleus:<7>assert-com-cuisinecontrastnucleus:<4>assert-com-servicenucleus:<5>assert-com-servicecontrastelaborationnucleus:<1>assert-com-list_exceptional infer nucleus:<3>assert-com-decornucleus:<5>assert-com-servicenucleus:<7>assert-com-cuisineinferinfernucleus:<2>assert-com-decor nucleus:<6>assert-com-cuisinenucleus:<4>assert-com-service elaborationnucleus:<1>assert-com-list_exceptional contrast

Figure 10: Tp-trees for the comparisons shown as alternatives 12 and 14 (top) and alterna-tives 11 and 13 (bottom) in Figure 8.with a speech act. Interior nodes are labeled with rhetorical relations. In addition to therhetorical relations in the content plan, the SPG uses the relation infer for combinationsof speech acts for which there is no rhetorical relation expressed in the content plan (Marcu,1997). The infer relation is similar to the joint relation in RST; it joins multiple satellitesin a mononuclear relation or the nuclei in a multinuclear relation.Each simple assertion, or leaf, in a tp-tree is associated with one or more syntacticrealizations ( d-trees ), using a dependency tree representation, called DSyntS (Figure 11)(Melˇcuk, 1988; Lavoie & Rambow, 1997). The association between the simple assertionsand any potential d-trees specifying their syntactic realizations is speciﬁed in a hand-craftedgeneration dictionary. Leaves of some d-trees in the generation dictionary are variables,which are instantiated from the content plan, e.g.

Thai replaces a cuisine type variable.

During sentence planning, the SPG assigns assertions to sentences, orders the sentences,inserts discourse cues, and performs referring expression generation. It uses a set of clause-combining operations that operate on tp-trees and incrementally transform the elementaryd-trees associated with their leaves into a single lexico-structural representation. The outputfrom this process is two parallel structures: (1) a sentence plan tree ( sp-tree ), a binary treewith leaves labeled with the assertions from the input tp-tree, and interior nodes labeled withclause-combining operations; and (2) one or more d-trees which reﬂect parallel operationson the predicate-argument representations.The clause-combining operations are general operations similar to aggregation opera-tions used in other research (Rambow & Korelsky, 1992; Danlos, 2000). The operations and alker, Stent, Mairesse, & Prasad assert-com-cuisine BE3 [class:verb ](I Chanpen Thai [number:sg class:proper noun article:no-art person:3rd ]II restaurant [class:common noun article:indef ](Thai [class:adjective ]))assert-com-food quality HAVE1 [class:verb ](I Chanpen Thai [number:sg class:proper noun article:no-art person:3rd ]II quality [class:common noun article:no-art ](ATTR good [class:adjective ]ATTR food [class:common noun ]))

Figure 11: Example d-trees from the generation dictionary used by the SPG.examples of their use are given in Figure 12. They are applied in a bottom-up left-to-rightfashion, with the choice of operation constrained by the rhetorical relation holding betweenthe assertions to be combined (Scott & de Souza, 1990), as speciﬁed in Figure 12.In addition to ordering assertions, a clause-combining operation may insert cue wordsbetween assertions. Figure 13 gives the list of cue words used by the SPG. The choice ofcue-word is determined by the type of rhetorical relation .The SPG generates a random sample of possible sp-trees for each tp-tree, up to a pre-speciﬁed number of sp-trees, by randomly selecting among the clause-combining operationsaccording to a probability distribution that favors preferred operations. Table 14 shows theprobability distribution used in our experiments, which is hand-crafted based on assumedpreferences for operations such as merge , relative-clause and with-reduction , andis one way in which some knowledge can be injected into the random process to bias ittowards producing higher quality sentence plans. The SPG handles referring expression generation by converting a proper name to a pro-noun when the same proper name appears in the previous utterance. Referring expressiongeneration rules are applied locally, across adjacent utterances, rather than globally acrossthe entire presentation at once (Brennan, Friedman, & Pollard, 1987). Referring expressionsare manipulated in the d-trees, either intrasententially during the incremental creation ofthe sp-tree, or intersententially, if the full sp-tree contains any period operations. The

2. An alternative approach is for the cue-word to impose a constraint on the rhetorical relation that musthold (Webber, Knott, Stone, & Joshi, 1999; Forbes, Miltsakaki, Prasad, Sarkar, Joshi, & Webber, 2003).3. This probability distribution could be learned from a corpus (Marcu, 1997; Prasad, Joshi, Dinesh, Lee,& Miltsakaki, 2005).4. If an infer relation holds and both clauses contain the have possession predicate, the second clause isarbitrarily selected for reduction. If a justify relation holds, it is the satellite of the RST relation thatalways undergoes reduction, if the syntactic constraints are satisﬁed.5. If an infer relation holds, any clause is arbitrarily selected for reduction. If a justify relation holds, theclause that undergoes relative clause formation is the satellite clause. This is motivated by the fact thatrelative clause formation is generally seen to occur when the modifying relative clause provides additionalinformation about the noun it modiﬁes, but where the additional/elaborated information does not havethe same informational status as the information in the main clause. ndividual and Domain Adaptation in Dialogue

Operation Rel Description Sample 1starg Sample2nd arg Result M erge infer or contrast Two clauses can be combined ifthey have identical matrix verbsand identical arguments and ad-juncts except one. The non-identical arguments are coordi-nated. ChanpenThai hasgood service. ChanpenThai hasgood foodquality. Chanpen Thai hasgood service andgood food quality.W ith-reduction justify or infer Two clauses with identical sub-ject arguments can be identiﬁedif one of the clauses has a have -possession matrix verb. Thepossession clause undergoes with -participial clause formation andis attached to the non-reducedclause. ChanpenThai is aThai restau-rant. ChanpenThai hasgood foodquality. Chanpen Thai isa Thai restaurant,with good foodquality.R elative-clause justify or infer Two clauses with an identicalsubject can be identiﬁed. Oneclause is attached to the subjectof the other clause as a relativeclause. ChanpenThai hasthe bestoverall qual-ity amongthe selectedrestaurants. ChanpenThai islocated inMidtownWest. Chanpen Thai,which is locatedin Midtown West,has the best overallquality among theselected restau-rants.C ue-word-conjunction justify , in-fer or con-trast Two clauses are conjoined with acue word (coordinating or subor-dinating conjunction). The orderof the arguments of the connec-tive is determined by the order ofthe nucleus (N) and the satellite(S), yielding two distinct oper-ations, cue-word-conjunction-ns and cue-word-conjunction-sn . ChanpenThai hasthe bestoverall qual-ity amongthe selectedrestaurants. ChanpenThai isa Thairestaurant,with goodservice. Chanpen Thai hasthe best overallquality among theselected restau-rants, since it is aThai restaurant,with good service.C ue-word-insertion ( on theotherhand ) contrast cue-word insertion combinesclauses by inserting a cue wordat the start of the second clause( Carmine’s is an Italian restau-rant. HOWEVER, Above is aNew American restaurant ), re-sulting in two separate sentences. Penang hasvery gooddecor. Baluchi’shasmediocredecor. Penang has verygood decor. Onthe other hand,Baluchi’s hasmediocre decor.P eriod justify , contrast , infer or elabora-tion Two clauses are joined by a pe-riod. ChanpenThai is aThai restau-rant, withgood foodquality. ChanpenThai hasgood ser-vice. Chanpen Thai isa Thai restaurant,with good foodquality. It hasgood service.

Figure 12: Clause combining operations and examples.third and fourth sentences for Alt 13 in Figure 8 show the conversion of a named restaurant(

Carmine’s ) to a pronoun.The sp-trees for Alts 6 and 8 in Figure 1 are shown in Figs. 15 and 16. Leaf labels areconcise names for assertions in the content plan, e.g. assert-reco-best is the claim (labelled1) in Figure 4. Because combination operations can switch the order of their arguments,from satellite before nucleus (SN) to nucleus before satellite (NS), the labels on the interiornodes indicate whether this occurred, and specify the rhetorical relation that the operationrealizes. These labels keep track of the operations and substitutions used in constructingthe tree and are subsequently used in the tree feature set described in Section 5, one of the alker, Stent, Mairesse, & Prasad

RST relation Aggregation operator justify with-reduction , relative-clause , cue-word conj. because , cue-wordconj. since , periodcontrast merge , cue-word insert. however , cue-word conj. while , cue-word conj. and , cue-word conj. but , cue-word insert. on the other hand , periodinfer merge , cue-word conj. and , periodelaboration period Figure 13: RST relation constraints on aggregation operators.

Aggregation operator Probability merge , with-reduction , relative-clause cue-word conj. because , cue-word conj. since , cue-word conj. while , cue-word conj. and , cue-word conj. but cue-word insert. however , cue-word insert. on the other hand period Figure 14: Probability distribution of aggregation operators. The ﬁnal operation is ran-domly chosen from the selected set with a uniform distribution.feature sets tested for training the SPR. For example, the label at the root of the tree inFigure 15 (

CW-SINCE-NS-justify ) speciﬁes that the cw-conjunction operation wasused, with the since cue word, with the nucleus ﬁrst (NS), to realize the justify relation.Similarly, the bottom left-most interior node (

WITH-NS-infer ) indicates that the with-reduction operation was used, with the nucleus before the satellite (NS), to realize the infer relation.Figure 17 shows a d-tree for the content plan in Figure 4. This d-tree shows that theSPG treats the period operation as part of the lexico-structural representation for thed-tree. The d-tree is split into multiple d-trees at these nodes before being sent to RealProfor surface realization.Note that a tp-tree can have very diﬀerent realizations, depending on the operations ofthe SPG. For example, the tp-tree in Figure 9 yields both Alt 6 and Alt 2 in Figure 1. Alt serviceassert−reco− assert−reco−priceassert−reco−bestassert−reco−cuisine assert−reco−food−qualityWITH−NS−infer CW−CONJUNCTION−inferCW−SINCE−NS−justifyCW−CONJUNCTION−infer

Figure 15: Sentence Plan Tree (SP-tree) for Alternative 6 of Figure 1. ndividual and Domain Adaptation in Dialogue assert−reco−food−quality cuisineassert−reco− serviceassert−reco− priceassert−reco−assert−reco−bestWITH−NS−infer PERIOD−inferPERIOD−infer PERIOD−justify

Figure 16: Sentence Plan Tree (SP-tree) for Alternative 8 of Figure 1.

Champen_Thai’sprice 24dollarBE3goodserviceHAVE1Champen_Thai PERIOD_inferfoodgoodqualitywithThairestaurantChampen_Thai BE3 PERIOD_infer selectedrestaurantAMONG1overallbest qualityHAVE1Champen_ThaiPERIOD_justify

Figure 17: Dependency tree for alternative 8 in Figure 1.2 is highly rated, with an average human rating of 4. However, Alt 6 is a poor realizationof this plan, with an average human rating of 2.5.To summarize,

SPaRKy ’s SPG transforms an input content plan into a set of alternativepairs of sentence-plan trees and d-trees. First, assertions in the input content plan aregrouped using principles from centering theory. Second, assertions are assigned to sentencesand discourse cues inserted using clause combining operations. Third, decisions about therealization of referring expressions are made on the basis of recency. The rhetorical relationsand clause-combining operations are domain-independent.

SPaRKy uses two types of domain-dependent knowledge: the probability distributionover clause-combining operations, and the d-trees that are input to the RealPro surfacerealizer. In order to use

SPaRKy in a new domain, it might be necessary to: alker, Stent, Mairesse, & Prasad • add new rhetorical relations if the content planner used additional rhetorical relations; • modify the probability distribution over clause-combining operations, either by handor by learning from a corpus; • construct a new set of d-trees to capture the syntactic structure of sentences in thedomain, unless we used a surface realizer that could take logical forms or semanticrepresentations as input.

5. Feature Generation

To train or use the SPR, each potential realization generated by the SPG, along with itscorresponding sp-tree and d-tree, is encoded as a set of real-valued features (binary featuresare modeled with values 0 and 1) from three feature types: • N-Gram features – simple word n-gram features generated from the realization ofSPG outputs; • Concept features – concept n-gram features generated from named entities in therealization of SPG outputs; • Tree features – these features represent structural conﬁgurations in the sp-trees andd-trees output by the SPG.These features are automatically generated as described below.

N-gram features capture information about lexical selection and lexical ordering in the real-izations output by

SPaRKy . A two-step approach is used to generate these features. First,a domain-speciﬁc rule-based named-entity tagger (using

MATCH ’s lexicons for restaurant,cuisine type and location names) replaces speciﬁc tokens with their types, e.g.

Babbo with restname . Then, unigram, bigram and trigram features and their counts are automaticallygenerated. The tokens begin and end indicate the beginning and end of a realization.N-gram feature names are preﬁxed with n-gram . For example, ngram-cuisinename-restaurant-with counts the occurrences of cuisine type followed by “restaurant” and“with” (as in the realization “Italian restaurant with”); ngram-begin-restname-which counts occurrences of realizations starting with a restaurant’s name followed by “which”.We also count words per presentation, and per sentence in a presentation.

Concept features capture information about the concepts selected for a presentation, andtheir linear order in the realization. A two-step approach is used to generate these features.First, a named-entity tagger marks the names of items in our restaurant database, e.g.

Uguale . Then, unigram, bigram and trigram features and their counts are automaticallygenerated from the sequences of concepts in the sentence plan for the realization. As withthe n-gram features, the tokens begin and end indicate the beginning and end of a realization. ndividual and Domain Adaptation in Dialogue

Concept feature names are preﬁxed with conc . For example, conc-decor-claim isset to 1 if the claim is expressed directly after information about decor, while the feature conc-begin-service characterizes utterances starting with information about service. Inthe concept n-gram features, we use ’*’ to separate individual features. We also countconcepts per presentation, and per sentence in a presentation.

Tree features capture declaratively the way in which merge, infer and cue-word opera-tions are applied to the tp-trees, and were inspired by the parsing features used by Collins(2000). They count the occurrences of certain structural linguistic conﬁgurations in thesp-trees and associated d-trees that the SPG generated. Tree feature names are preﬁxedwith r for “rule” (sp-tree) or s for “sentence” (d-tree).Several feature templates are used to generate tree features. Local feature templates record structural conﬁgurations local to a particular node (its ancestors, daughters etc.); global feature templates , used only for sp-tree features, record properties of the entire sp-tree.There are four types of local feature template: traversal features, sister features, ancestorfeatures and leaf features. Traversal, sister and ancestor features are generated for all nodesin sp-trees and d-trees; leaf features are generated for sp-trees only. The value of eachfeature is the count of the described conﬁguration in the tree. We discard features thatoccur fewer than 10 times to avoid those speciﬁc to particular content plans.For each node in the tree, traversal features record the preorder traversal of thesubtree rooted at that node, for all subtrees of all depths. Feature names are the con-catenation of the preﬁx trav- , with the names of the nodes (starting with the currentnode) on the traversal path. ’*’ is used to separate node names. An example is r-trav-with-ns-infer*assert-reco-food-quality*assert-reco-cuisine (with value 1) of thebottom-left subtree in Figure 16.

Sister features record all consecutive sister nodes. Names are the concatenation ofthe preﬁx sis- , with the names of the sister nodes. An example is r-sis-assert-reco-best*cw-conjunction-infer (with value 1) of the tree in Figure 15.For each node in the tree, ancestor features record all the initial subpaths of the pathfrom that node to the root. Feature names are the concatenation of the preﬁx anc- with thenames of the nodes (starting with the current node). An example is r-anc-assert-reco-cuisine*with-ns-infer*cw-conjunction-infer (with value 1) of the tree in Figure 15.

Leaf features record all initial substrings of the frontier of the sp-tree. Names are theconcatenation of the preﬁx leaf- , with the names of the frontier nodes (starting with thecurrent node). For example, the sp-tree of Figure 15 has value 1 for leaf-assert-reco-best and also for leaf-assert-reco-best*leaf-assert-reco-cuisine , and the sp-treeof Figure 16 has value 1 for leaf-assert-reco-food-quality*assert-reco-cuisine . Global features apply only to the sp-tree. They record, for each sp-tree and for eachoperation labeling a non-frontier node, (1) the minimal number of leaves dominated by anode labeled with that rule in that tree (MIN); (2) the maximal number of leaves dominatedby a node labeled with that rule (MAX); and (3) the average number of leaves dominatedby a node labeled with that rule (AVG). For example, the sp-tree in Figure 15 has value alker, Stent, Mairesse, & Prasad cw-conjunction-infer-max , value 2 for cw-conjunction-infer-min and value 3for cw-conjunction-infer-avg .

6. Training the Sentence Plan Ranker

The SPR ranks alternative information presentations using a model learned from user rat-ings of a set of training data. The training procedure is as follows: • For each content plan in the training data, the SPG generates a set of alternativesentence plans using a random selection of sentence planning operators (Section 4); • Features are automatically generated from the surface realizations and sentence plansso that each alternative sentence plan is represented in terms of a number of real-valued features (Section 5); • Feedback as to the perceived quality of the realization of each alternative sentenceplan is collected from one or more users; • The RankBoost boosting method (Freund, Iyer, Schapire, & Singer, 1998) learns afunction from the featural representation of each realization to its feedback, thatattempts to duplicate the rankings in the training examples.We use RankBoost for three reasons. First, it produces a ranking over the input alter-natives rather than a selection of one best alternative. Second, it can handle many sparsefeatures. Third, the function that it learns is a rule-based model showing the eﬀect ofeach feature on the ranking of the competing examples. These models can be inspectedand compared. This allows us to qualitatively analyze the models (Section 8) in order tounderstand the preferences of individuals, and the diﬀerences between SPRs for individualsvs. groups.This section describes the training of the SPR in detail. The

SPUR content plannerproduces content plans for three dialogue strategies: • recommend : recommend an entity from a set of entities • compare-2 : compare two entities • compare-3 : compare three or more entitiesFor each dialogue strategy, we start with a set of 30 representative content plans from SPUR . The SPG was parameterized to produced up to 20 distinct (sp-tree, d-tree) pairsfor each content plan. Each of these was realized by RealPro. Separately, we also obtainedoutput for each content plan from our template-based generator (Section 3.2).Both the

SPaRKy realizations and the template-based realizations were randomly or-dered and placed on a series of Web pages. These 1830 realizations were then rated on ascale from 1 to 5 by the ﬁrst two authors of this paper, neither of whom had implementedthe template-based realizer or the SPG. The raters worked on this rating task during ses-sions of one hour at a time for several hours a day, over a period of a week. They wereinstructed to look at all 21 realizations for a particular content plan before rating any ofthem, to try to use the whole rating scale, and to indicate their spontaneous rating without ndividual and Domain Adaptation in Dialogue repeatedly re-labelling the alternative realizations. They did not discuss their ratings orthe basis for their ratings at any time. Given the cognitive load and long duration of thisrating task, it was impossible for the raters to keep track of which realizations came from

SPaRKy and which from the template-based generator, and likely to be impossible to domore than generate a “gestalt” evaluation of each alternative.Each (sp-tree, d-tree, realization) triple is an example input for RankBoost; the ratingsare used as feedback. The experiments below examine two uses of the ratings. First, wetrain and test an SPR with the average of the ratings of the two users, i.e. we considerthe two users as representing a single user group. Second, we train and test individualizedSPRs, one for each user.The SPR is trained using the RankBoost algorithm (Freund, Iyer, Schapire, & Singer,1998), which we describe brieﬂy here. First, the training corpus is converted into a set T of ordered pairs of examples x, y : T = { ( x, y ) | x, y are alternatives for the same plan, x is preferred to y by user ratings } Each alternative realization x is represented by a set of m indicator functions h s ( x )for 1 ≤ s ≤ m . The indicator functions are calculated by thresholding the feature values(counts) described in Section 5. For example, one indicator function is: h ( x ) = (cid:26) leaf-assert-reco-best ( x ) ≥

10 otherwise So h ( x ) = 1 if the leftmost leaf is the assertion of the claim as in Figure 15. A singleparameter α s is associated with each indicator function, and the “ranking score” for anexample x is calculated as F ( x ) = X s α s h s ( x )This score is used to rank competing sp-trees of the same content plan with the goal ofduplicating the ranking found in the training data. Training is the process of setting theparameters α s to minimize the following loss function: RankLoss = 1 |T | X ( x,y ) ∈T eval ( F ( x ) ≤ F ( y ))The eval function returns 1 if the ranking scores of the ( x, y ) pair are misordered (so that x is ranked higher than y even though in the training data y is ranked higher than x ), and 0otherwise. In other words, the RankLoss is the percentage of misordered pairs. As this lossfunction is minimized, the ranking errors (cases where ranking scores disagree with humanjudgments) are reduced. Initially all parameter values are set to zero. The optimizationmethod then greedily picks a single parameter at a time – the parameter which will makethe most impact on the loss function – and updates the parameter value to minimize theloss.In the experiments described below, we use two evaluation metrics: alker, Stent, Mairesse, & Prasad • RankLoss : The value of the training method’s loss function; • TopRank : The diﬀerence between the human rating of the top realization for eachcontent plan and the human rating of the realization that the SPR predicts to be thetop ranked.

7. Quantitative Results

In this section, we describe three experiments with

SPaRKy :1.

Feature sets for trainable sentence planning:

We examine which features (n-gram, concept, tree, all) lead to the best performance for the sentence planning task,and ﬁnd that n-gram features sometimes perform as well as all the features.2.

Comparison with template-based generation:

We show that the performance ofa trainable sentence planner using the best performing feature set is more consistentthan that of a template-based generator, although overall a template-based generatorstill performs better.3.

Individualized sentence planners:

We show that people have quite speciﬁc indi-vidual preferences regarding the three tasks of sentence planning: information order-ing, sentence aggregation, and use of discourse cues; and furthermore, that a trainablesentence planner can model these individual preferences. Moreover we show that insome cases the individualized sentence planners are better than, or statistically indis-tinguishable from, the template-based generator.We report results below separately for comparisons between two entities and among threeor more entities. These two types of comparison are generated using diﬀerent strategies inthe SPG, and produce text that is very diﬀerent both in terms of length and structure.

Using a cross-validation methodology, we repeatedly train the SPR on a random 90% ofthe corpus, and test on the remaining 10%. Here, we use the averaged feedback from userA and user B as feedback. Figure 18 repeats the examples in Figure 1, here showing boththe user rankings and the rankings for a ranking function that was learned by the trainedSPRs for both users A and B and for the AVG user.Table 1 shows RankLoss for each feature set (Section 5). Paired t-tests comparing theranking loss for diﬀerent feature sets show surprisingly few performance diﬀerences amongthe features. Using all the features (

All ) always produces the best results, but the diﬀerencesare not always signiﬁcant.The n-gram features give results comparable to all the features for both compare-2 and recommend . An analysis of the learned models suggests that one reason that n-gram features perform well is because there are individual lexical items that are uniquelyassociated with many of the combination operators, such as the lexical item with for the with-ns operator. This means that the detailed representations of the content and structureof an information presentation as represented by the tree features are equivalent to n-gramfeatures in this application domain. ndividual and Domain Adaptation in Dialogue

Alt Realization A B SPR A SPR B SPR

AV G

Figure 18: Some alternative realizations for the content plan in Figure 4, with feedbackfrom users A and B (1=worst and 5=best) and rankings from the trained SPRsfor users A and B and mean(A,B) ([0 , We evaluated

SPaRKy on the test sets bycomparing three data points for each content plan: Human (the score of the best sentenceplan that

SPaRKy ’s SPG can produce as selected by the human users);

SPaRKy (thescore of the SPR’s top-ranked selected sentence); and Random (the score of a sentence plan

6. The TopRank metric is sensitive to the distribution of ranking feedback and SPR scores in the test set,which means that it is sensitive to the number of cross-validation folds. alker, Stent, Mairesse, & Prasad

Feature set/Strategy compare-2 compare-3 recommend

Random Baseline

Concept p < . p < . p < .

N-Gram 0.14 ( p < . p < . ( p < . Tree p < . p < . p < .

All 0.13 0.14 0.20

Table 1: AVG model’s ranking error with diﬀerent feature sets, for all strategies. Resultsare averaged over 10-fold cross-validation, testing over the mean feedback. p valuesin parentheses indicate the level of signiﬁcance of the decrease in accuracy whencompared to the model using all the features. Cases where diﬀerent feature setsperform as well as all the features are marked in bold.randomly selected from the alternative sentence plans). For all three presentation types,a paired t-test comparing SPaRKy to Human to Random showed that

SPaRKy was sig-niﬁcantly better than Random ( df = 59, p < . df = 59, p < . SPaRKy scores and the Human scoresindicates how much performance could be improved if the SPR were perfect at replicatingthe Human ratings.User Strategy

SPaRKy

Human RandomAVG recommend compare-2 compare-3 recommend , compare-2 and compare-3 (N = 180), usingall the features, for SPaRKy trained on AVG feedback, with standard deviations.

User Strategy

SPaRKy

Human TemplateAVG recommend compare-2 compare-3

MATCH ’s template-based generator,

SPaRKy (AVG) andHuman. N = 180, with standard deviations. ndividual and Domain Adaptation in Dialogue

As described above, the raters also rated the single output of the template-based gen-erator for

MATCH for each content plan in the training data. Table 3 shows the meanTopRank scores for the template-based generator’s output (Template), compared to thebest plan the trained SPR selects (

SPaRKy ), and the best plan as selected by a humanoracle (Human). In each fold, both

SPaRKy and the Human oracle select the best of 10sentence plans for each text plan, while the template-based generator produces a singleoutput with a single human-rated score. A paired t-test comparing Human with Templateshows that there are no signiﬁcant diﬀerences between them for recommend or compare-3 , but that Human is signiﬁcantly better for compare-2 ( df = 29, t = 4 . p < . compare-2 template. A paired t-test comparing SPaRKy to Template shows that the template-based generator is signiﬁcantly better for both rec-ommend and compare-3 ( df = 29, t = 2 . p < . SPaRKy to be better for compare-2 ( df = 29, t = 2 . p = . SPaRKy ,indicating that while the template-based generator performs well overall, it performs poorlyon some inputs. One reason for this might be that

SPUR ’s decision-theoretic user modelselects a wide range and number of content items for diﬀerent users, and for concisenesssettings (See Figures 5 and 7). This means that it is diﬃcult to handcraft a template-basedgenerator to handle all the diﬀerent cases well.The gap between the Human scores (produced by the SPG but selected by a humanrather than by the SPR) and the Template scores shows that the SPG produces sentenceplans as good as those of the template-based generator, but the accuracy of the SPR needsto be improved. Below, Section 7.3 shows that when the SPR is trained for individuals,

SPaRKy ’s performance is indistinguishable from the template-based generator in mostcases.

We discussed in Section 1 that the diﬀerences in the rating feedback from users A and B forcompeting realizations (See Figure 1) suggest that each user has diﬀerent perceptions as tothe quality of the potential realizations. To quantify the utility and the feasibility of trainingindividualized SPRs, we ﬁrst examine the feasibility of training models for individual users.The results in Table 1 are based on a corpus of 600 examples, rated by each user,which may involve too much eﬀort for most users. We would like to know whether a high-performing individualized SPR can be trained from less labelled data. Figure 19 plotsranking error rates as a function of the amount of training data. This data suggests thaterror rates around 0.20 could be acquired with a much smaller training set, i.e. with atraining set of around 120 examples, which is certainly more feasible. recommend

A’s model B’s model AVG modelA’s test data 0.17 0.52 0.29B’s test data 0.52 0.17 0.27AVG’s test data 0.31 0.31 0.20Table 4: Ranking error for various conﬁgurations with the recommend strategy. alker, Stent, Mairesse, & Prasad A v e r age t e s t e rr o r Number of sentences in the training setUser AUser B

Figure 19: Variation of the testing error for both users as a function of the number oftraining utterances. compare-2

A’s model B’s model AVG modelA’s test data 0.16 0.26 0.20B’s test data 0.23 0.11 0.13AVG’s test data 0.17 0.16 0.13Table 5: Ranking error for various conﬁgurations with the compare-2 strategy. compare-3

A’s model B’s model AVG modelA’s test data 0.13 0.30 0.18B’s test data 0.26 0.14 0.18AVG’s test data 0.17 0.20 0.14Table 6: Ranking error for various conﬁgurations with the compare-3 strategy.We then examine if trained individualized SPRs are accurate. The results in Tables 4, 5and 6 show RankLoss for several training and testing conﬁgurations for each strategy (using10-fold cross-validation). We compare the two individualized models with models trainedon A and B’s mean feedback (AVG). For each model, we test on its own test data, and ontest data for the other models. This shows how well a model might ‘ﬁt’ if customizing anSPR to a new domain or user group. For example, if we train a model for recommendations ndividual and Domain Adaptation in Dialogue using feedback from a group of users, and then deploy this system to an individual user, wemight expect model ﬁt diﬀerences similar to those in Table 4.Of course, there may be strongly conﬂicting preferences in any group of users. Forexample, consider the diﬀerences in the ratings for users A and B and the average ratingsin Figure 1. Alt-1 and Alt-7 are equivalent using the average feedback, but user A dislikesAlt-7 and likes Alt-1 and vice versa for user B. Column 3 of Table 4 shows that the averagemodel, when used in an SPR for user A or user B has a much higher ranking error (.29and .27 respectively) than that of an SPR customized to user A (.17 error) or customizedto user B (.17 error).An examination of Tables 4, 5 and 6 shows that in general, there are striking diﬀerencesbetween models trained and tested on one individual’s feedback (RankLoss ranges from 0.11to 0.17) and cross-tested models (RankLoss ranges from 0.13 to 0.52). Also, the average(AVG) models always perform more poorly for both users A and B than individually-tailoredmodels. As a baseline for comparison, a model ranking sentence alternatives randomlyproduces an error rate of 0.5 on average; Table 4 shows that models trained on one user’sdata and tested on the other’s can perform as badly as the random model baseline. Thissuggests that the diﬀerences in the users’ ratings are not random noise.In some cases, the average model also performs signiﬁcantly worse than the individualmodels even when tested on feedback from the “average” user (the diagonal in Tables 4, 5and 6). This suggests that in some cases it is harder to get a good model for the averageuser case, possibly because the feedback is more inconsistent. For recommendations, theperformance of each individual model is signiﬁcantly better than the average model ( df = 9, t = 2 . p < . compare-2 the average model is better than user A’s ( df = 9, t = 2 . p < . df = 9, t = 3 . p < . SPaRKy

Human TemplateA recommend compare-2 compare-3 recommend compare-2 compare-3

SPaRKy as compared with

MATCH ’stemplate-based generator as rated separately by Users A and B, and individualUser A and User B Human Oracles. Standard Deviations are in parentheses. N =180.We can also compare the template-based generator to the individualized

SPaRKy gen-erators using the TopRank metric (See Table 7). All comparisons are done with pairedt-tests using the Bonferroni adjustment for multiple comparisons.For recommend , there are no signiﬁcant diﬀerences between

SPaRKy and Templatefor User A ( df = 59, t = 2 . p = . df = 59, t = 1 . p = . alker, Stent, Mairesse, & Prasad are also no signiﬁcant diﬀerences for either user between Template and Human ( df = 59, t < . p > . compare-2 , there are large diﬀerences between Users A and B. User A appears tolike the template for compare-2 (average rating is 4.2) while User B does not (average rat-ing is 3.1). For User A, there are no signiﬁcant diﬀerences between SPaRKy and Template( df = 59, t = 2 . p = . df = 59, t = 0 . p = . SPaRKy to Template ( df = 59, t = 7 . p < . compare-3 , there are also large diﬀerences between Users A and B. User A likes thetemplate for compare-3 (average rating 3.9), and strongly prefers it to the individualized SPaRKy (average rating 3.1) ( df = 59, t = 3 . p < . SPaRKy (average rating4.4) ( df = 59, t = 1 . p = . SPaRKy and Human scores, indicating that the performance ofthe SPR could be improved ( df = 59, t = 3 . p < . trainable sentence planning can produce output compa-rable to or better than that of a template-based generator , with less programming eﬀort andmore ﬂexibility.

8. Qualitative Analysis

An important aspect of RankBoost is that the learned models are expressed as rules: a qual-itative examination of the learned models may highlight individual diﬀerences in linguisticpreferences, and help us understand why

SPaRKy ’s SPG can produce sentence plans thatare better than those produced by the template-based generator, and why the individuallytrained SPRs usually select sentence plans that are as good as the templates. To quali-tatively compare the learned ranking models for the individualized SPRs, we assess bothwhich linguistic aspects of an utterance (which features) are important to an individual, andhow important they are. We evaluate whether an individual is oriented towards a particularfeature by examining which features’ indicator functions h s ( x ) have non-zero values. Weevaluate how important a feature is to an individual by examining the magnitude of theparameters α s .There are two potential problems with this approach, The ﬁrst problem is that thefeature templates produce thousands of features, some of which are redundant, so thatdiﬀerences in each model’s indicator functions can be spurious. Therefore, to allow moremeaningful qualitative comparisons between models, one of a pair of perfectly correlatedfeatures is eliminated.The second problem arises from RankBoost’s greedy algorithm. The selection of whichparameter α s to set on any round of boosting is highly dependent on the training set, sothat the models derived from a single episode of training are highly variable. To compareindicator functions independently of the training set, we adopt a bootstrapping method toidentify a feature set for each user that is independent of a particular training episode. Byrepeatedly randomly selecting 10 alternatives for training and 10 for testing for each contentplan, we created 50 diﬀerent training sets for each user. We then average the α values of thefeatures selected by RankBoost over these 50 training runs, and conduct experiments using ndividual and Domain Adaptation in Dialogue Model Strategy Feature TypeTree N-Gram Concept Leaf GlobalAVG recommend

45 36 9 7 3 compare-2

37 46 12 1 4 compare-3

63 29 4 1 3A recommend

50 29 14 4 3 compare-2

35 51 10 3 1 compare-3

47 37 11 1 4B recommend

47 34 9 6 4 compare-2

45 36 13 1 5 compare-3

47 34 9 6 4Table 8: Features in the top 100 with the highest average α for each user model.only the 100 features for each user with the highest average α magnitude. In Section 8.1 wediscuss diﬀerences in the types of feature that are selected by the bootsrapping algorithmjust outlined. Section 8.2 discusses diﬀerences in models produced using the tree featuresfor user A and user B, while section 8.3 discusses diﬀerences between the average modeland the individual models. The bootstrapping process selects a total of 100 features for each strategy and for eachtype of feedback (individual or averaged). We found diﬀerences in the features along bothdimensions.Table 8 shows the number of features of each type that were in the top 100 (averagedover 50 training runs). Only 9 features are shared by the three strategies for the AVGmodel; these shared features are usually n-gram features. For User A, 6 features are sharedby the three strategies (mostly n-gram features). For User B, there are no features sharedby the three strategies.We also found that some features capture speciﬁc interactions between domain-speciﬁccontent items and syntactic structure, which are diﬃcult to model in a rule-based ortemplate-based generator. An example is Rule (1) in Figure 20 which signiﬁcantly low-ers the ranking of any sentence plan in which neighborhood information ( assert-reco-nbhd ) is combined with subsequent content items via the with-ns operation. Among thebootstrapped features for the average user, 16 features for compare-2 count interactionsbetween domain-speciﬁc content and syntactic structure. For compare-3 , 22 features countsuch interactions, and the bootstrapped features for recommend include 39 such features.We examine some of the models derived from these features in detail below.

To further analyze individual linguistic preferences for information presentation strate-gies, we now qualitatively compare the two models for Users A and B. We believe that thisqualitative analysis provides additional evidence that the diﬀerences in the users’ rankingpreferences are not random noise. We identify diﬀerences among the features selected by alker, Stent, Mairesse, & Prasad

N Condition α r-anc-assert-reco-nbhd*with-ns-infer ≥ cw-conjunction-infer-avg-leaves-under ≥ . r-anc-assert-reco*with-ns-infer*cw-conjunction-infer ≥ leaf-assert-reco-best*assert-reco-price ≥ cw-conjunction-infer-avg-leaves-under ≥ . r-trav-with-ns-infer*assert*assert ≥ r-anc-cw-conjunction-infer*cw-conjunction-infer ≥ with-ns-infer-min-leaves-under ≥ r-anc-assert-reco*with-ns-infer ≥ cw-conjunction-infer-max-leaves-under ≥ . r-trav-with-ns-infer*assert-reco*assert-reco ≥ r-anc-assert*with-ns-infer ≥ r-anc-with-ns-infer*relative-clause-infer ≥ r-anc-assert*with-ns-infer*relative-clause-infer ≥ cw-conjunction-infer-avg-leaves-under ≥ . r-anc-assert-reco-cuisine*with-ns-infer*period-infer ≥ cw-conjunction-infer-avg-leaves-under ≥ . r-anc-assert-reco-food-quality*merge-infer ≥ r-anc-assert-reco*merge-infer ≥ . r-anc-assert-reco-decor*merge-infer ≥ r-anc-assert*merge-infer ≥ . r-trav-merge-infer ≥ . r-trav-with-ns-infer*assert-reco-service*assert-reco-food-quality ≥ leaf-assert-reco-food-quality*assert-reco-cuisine ≥ cw-conjunction-infer-avg-leaves-under ≥ . leaf-assert-reco-food-quality ≥ s-trav-have1*propernoun-restaurant*II-quality*attr-among1 ≥ s-anc-attr-with*have1 ≥ Figure 20: A subset of rules and corresponding α values of User A’s model, ordered by α .RankBoost, and their α values, using models derived using bootstrapping over the tree fea-tures only, since they are easier to interpret qualitatively. Of course many diﬀerent modelsare possible. User A’s model consists of 109 rules; a subset are in Figure 20. User B’smodel consists of 90 rules, a subset of which are shown in Figure 21. We ﬁrst considerhow the individual models account for the rating diﬀerences for Alt-6 and Alt-8 from Fig-ure 1 (repeated in Figure 18 with ratings from the trained SPRs), and then discuss otherdiﬀerences. Comparing Alt-6 and Alt-8:

Alt-6 is highly ranked by User B but not by User A.Alt-6 instantiates Rule 21 of Figure 21, expressing User B’s preferences about linear orderof the content. (Alt-6’s sp-tree is in Figure 15.) Rule 21 increases the rating of examplesin which the claim, i.e. assert-reco-best ( Chanpen Thai has the best overall quality ), isrealized ﬁrst. Thus, unlike user A, user B prefers the claim at the beginning of the utterance(the ordering of the claim is left unspeciﬁed by argumentation theory (Carenini & Moore,2000)). Rule 22 increases the rating of examples in which the initial claim is immediately ndividual and Domain Adaptation in Dialogue

N Condition α r-sis-assert-reco-relative-clause-infer ≥ r-sis-period-infer-assert-reco ≥ r-anc-assert-reco-nbhd*with-ns-infer ≥ r-anc-assert-reco*period-infer*period-infer ≥ . r-anc-assert-reco-food-quality*with-ns-infer*relative-clause-infer ≥ r-anc-assert-reco-cuisine*with-ns-infer*relative-clause-infer ≥ r-anc-assert-reco*period-infer ≥ leaf-assert-reco-price ≥ r-anc-assert*period-infer*period-infer ≥ . leaf-assert-reco-decor ≥ r-anc-assert*relative-clause-infer*period-infer ≥ . r-trav-relative-clause-infer*assert-reco*with-ns-infer ≥ cw-conjunction-infer-avg-leaves-under ≥ . cw-conjunction-infer-avg-leaves-under ≥ . cw-conjunction-infer-avg-leaves-under ≥ . r-anc-assert*relative-clause-infer*period-infer ≥ leaf-assert-reco-service ≥ s-trav-attr-with ≥ r-anc-assert-reco-cuisine*with-ns-infer*cw-conjunction-infer ≥ cw-conjunction-infer-avg-leaves-under ≥ . leaf-assert-reco-best ≥ leaf-assert-reco-best*assert-reco-cuisine ≥ cw-conjunction-infer-avg-leaves-under ≥ . r-trav-with-ns-infer*assert-reco-cuisine*assert-reco-food-quality ≥ Figure 21: A subset of rules and corresponding α values of User B’s model, ordered by α .followed by the type of cuisine ( assert-reco-cuisine ). These rules interact with Rule 19in Figure 21, which speciﬁes a preference for information following assert-reco-cuisine to be combined via the with-ns operation, and then conjoined ( cw-conjunction-infer )with additional evidence. Alt-6 also instantiates Rule 23 in User B’s model, with an α valueof .52 associated with multiple uses of the cw-conjunction-infer operation.User A’s low rating of Alt-6 arises from A’s dislike of the with-ns operation (Rules 3,8, 9, 11 and 12) and the cw-conjunction-infer operation (Rules 3, 5, 7, 10 and 15) inFigure 20. (Contrast User B’s Rule 23 with User A’s Rules 5 and 17.) Alt-6 also fails toinstantiate A’s preference for food quality and cuisine information to occur ﬁrst (Rules 24and 26). Finally, user A also prefers the claim assert-reco-best to be realized in its ownsentence (Rule 27).By contrast, Alt-8 is rated highly by User A but not by User B (see Figure 1). Eventhough Alt-8 instantiates the negatively evaluated with-ns operation (Rules 3, 8, 9 and 11in Figure 20), there are no instances of cw-conjunction-infer (Rules 3, 5, 7, 10 and 15).Moreover Alt-8 follows A’s ordering preferences (Rules 24 and 26) which describe sp-treeswith assert-reco-food-quality on the left frontier, and trees where it is followed by assert-reco-cuisine . (See Alt-8’s sp-tree in Figure 16.) Rule 27 also increases the ratingof Alt-8 with its large positive α reﬂecting the expression of the claim in its own sentence. alker, Stent, Mairesse, & Prasad On the other hand, Alt-8 is rated poorly by User B; it violates B’s preferences forlinear order (remember that Rules 21 and 22 specify that B prefers the claim ﬁrst, followedby cuisine information). Also, B’s model has rules that radically decrease the ranking ofexamples using the period-infer operation (Rules 2, 4, 7 and 9).Thus, Alt-6 and Alt-8 show that users A and B prefer diﬀerent combination operators,and diﬀerent ordering of content, e.g. B likes the claim ﬁrst and A likes recommendationswith food quality ﬁrst followed by cuisine. As mentioned above, previous work on thegeneration of evaluative arguments states that the claim may appear ﬁrst or last (Carenini& Moore, 2000). The relevant guideline for producing eﬀective evaluative arguments statesthat “placing the main claim ﬁrst helps users follow the line of reasoning, but delaying theclaim until the end of the argument can also be eﬀective if the user is likely to disagree withthe claim.” The template-based generator for

MATCH always placed the claim ﬁrst, butthis analysis suggests that this may not be eﬀective for user A.

Other similarities and diﬀerences:

There are also individual diﬀerences in prefer-ences for particular operations, and for speciﬁc content operation interactions. For example,User A’s model demotes examples where the with-ns operation has been applied (Rules3, 6 and 8 in Figure 20), while User B generally likes examples where with-ns has beenused (Rule 18 in Figure 21). However, neither A nor B like with-ns when used to combineother content with neighborhood information. In User A’s model the α value is -1.26, whilein User B’s model the value is -0.50 (see Rule 1 in Figure 20 and Rule 3 in Figure 21.)These rules capture a speciﬁc interaction in the sp-tree between domain-speciﬁc contentand the with-ns-infer combination operation. Utterances instantiating these rules placeinformation in an adjunctival with-clause following the clause realizing the restaurant’sneighborhood. There is no constraint on the type of information in the with-clause. Inutterance (1) below, the with-clause realizes the restaurant’s food quality, whereas in (2) itcontains information about the restaurant’s service.(1) Mont Blanc has very good service, its price is 34 dollars, and it is located in MidtownWest, with good food quality. It has the best overall quality among the selectedrestaurants.(2) Mont Blanc is located in Midtown West, with very good service, its price is 34 dollars,and it has good food quality. It has the best overall quality among the selectedrestaurants.Moreover, both users like with-ns when it combines cuisine and food-quality informa-tion as in example (3) (Rule 23 in Figure 20 and Rule 24 in Figure 21).(3) Komodo has the best overall quality among the selected restaurants since it is aJapanese, Latin American restaurant, with very good food quality, it has very goodservice, and its price is 29 dollars.But User B radically reduces the rating of the cuisine, food-quality combination whenit is combined with further information using the relative-clause-infer operation, as inexample (5) (Rules 5 and 6 in Figure 21). ndividual and Domain Adaptation in Dialogue (4) Bond Street has very good decor. This Japanese, Sushi restaurant, with excellentfood quality, has good service. It has the best overall quality among the selectedrestaurants.Example (4) is an interesting contrast with example (3). Example (4) instantiates Rule24 in Figure 21, but it also instantiates a number of negatively valued features. As discussedabove, User B prefers examples where the claim is expressed ﬁrst (Rule 21 in Figure 21),and User B’s model explicitly reduces the rating of examples where information about decoris expressed ﬁrst (Rule 10 in Figure 21).In general, User A likes the merge-infer operation (Rules 19, 21 and 22), espe-cially when applied with assert-reco-food-quality (Rule 18), and assert-reco-decor (Rule 20). User A strongly prefers to hear about food quality ﬁrst (Rule 26 in Figure 20),followed by cuisine information (Rule 24). In contrast, User B has rules that reduce therating of examples with price or decor ﬁrst (Rules 8 and 10 in Figure 21). User B alsohas no preferences for merge-infer but likes the cw-conjunction operation (Rule 20in Figure 21). Finally, User B dislikes the relative-clause-infer operation in general(Rule 1), and its combination with the with-ns operation (Rule 12) or the period-infer operation (Rule 11).In addition to other evidence discussed above as to individual diﬀerences in languagegeneration, we believe that the fact that these model diﬀerences are interpretable shows thatthe diﬀerences in user perception of the quality of system utterances are true individualdiﬀerences, and not random noise. Table 22 shows a subset of rules that have the largest α magnitudes for an exampleAVG model using the same 100 feature bootstrapping process described above. Section 8.2presented results that the average model performs statistically worse for recommendationsthan either of the individual models. This may be due to the fact that the average model isessentially trying to learn from contradictory feedback from the two users. To see whetheran examination of the models provides support for this hypothesis, we ﬁrst examine how thelearned model ranks Alt-6 And Alt-8 as shown in Figure 18 in the column SPR AV G . Theaverage feedback for Alt-6 is 2.5 while the average feedback for Alt-8 is 3, but the trainedSPR ranks Alt-8 second highest and Alt-6 ﬁfth out of 10.The mid-value ranking of Alt-6 arises from a number of interacting rules, some of whichare similar to User B’s and some of which are similar to User A’s. Alt-6 instantiates Rules26 and 27 in Figure 22 which increase the ranking of sentence plans in which the claim,i.e. assert-reco-best is realized ﬁrst, and sentence plans where the claim is immediatelyfollowed by information about the type of cuisine ( assert-reco-cuisine ). These rules areidentical to B’s Rules 21 and 22 in Figure 21. Rule 18 additionally increases the ranking ofsentence plans where cuisine information is followed by service information, which appliesto Alt-6 to further increase its ranking. However Rule 3 lowers the ranking of Alt-6, sinceit combines more than 3 diﬀerent assertions into a single DSyntS tree.Alt-8 is highly ranked by SPR

AV G , largely as a result of several rules that increase itsranking. Rule 31 speciﬁes an increase in ranking for sentence plans that have the claim inits own sentence, which is true of Alt-8 but not of Alt-6. This rule also appears as Rule 27 alker, Stent, Mairesse, & Prasad

N Condition α s s-anc-attr-with*locate ≥ −∞ -0.872 s-trav-have1*i-restaurant*cuisine-type*ii-quality*attr-good*attr-food ≥ −∞ -0.813 s-trav-propernoun-restaurant ≥ . r-anc-cw-conjunction-infer*cw-conjunction-infer*period-justify ≥ −∞ -0.775 r-sis-assert-reco-relative-clause-infer ≥ −∞ -0.746 r-anc-assert-reco-decor*with-ns-infer*period-infer*period-justify ≥ −∞ -0.627 r-anc-assert-reco-cuisine*with-ns-infer*relative-clause-infer ≥ −∞ -0.628 period-justify-avg-leaves-under ≥ . cw-conjunction-infer-avg-leaves-under ≥ . r-anc-assert-reco-nbhd*with-ns-infer ≥ −∞ -0.4511 r-sis-cw-conjunction-infer-relative-clause-infer ≥ −∞ -0.4012 period-infer-avg-leaves-under ≥ . r-anc-assert-reco-food-quality*merge-infer ≥ −∞ s-trav-propernoun-restaurant ≥ . r-anc-assert-reco-decor*merge-infer ≥ −∞ s-anc-attr-with*i-restaurant*have1 ≥ −∞ r-anc-assert-reco-decor*with-ns-infer ≥ −∞ leaf-assert-reco-best*assert-reco-cuisine*assert-reco-service ≥ −∞ s-trav-propernoun-restaurant ≥ . r-anc-assert-reco-cuisine*with-ns-infer*period-infer*period-justify ≥ −∞ leaf-assert-reco-food-quality ≥ −∞ period-infer-avg-leaves-under ≥ . r-sis-merge-infer-assert-reco ≥ −∞ period-justify-avg-leaves-under ≥ . s-anc-attr-with*have1 ≥ −∞ leaf-assert-reco-best*assert-reco-cuisine ≥ −∞ leaf-assert-reco-best ≥ −∞ merge-infer-max-leaves-under ≥ −∞ leaf-assert-reco-food-quality*assert-reco-cuisine ≥ −∞ merge-infer-max-leaves-under ≥ . s-trav-have1*propernoun-restaurant*ii-quality*attr-among1 ≥ −∞ Figure 22: A subset of the rules with the largest α magnitudes that were learned for rankingrecommendations given AVG feedback.in A’s model in Figure 20. Alt-8 also instantiates Rules 21 and 29 which which are identicalto user A’s ordering preferences (Rules 24 and 26 in Figure 20) These rules describe sp-trees with assert-reco-food-quality on the left frontier, and trees where it is followedby assert-reco-cuisine . (See Alt-8’s sp-tree in Figure 16.) Rule 3 also applies to Alt-8,reducing its ranking due to the number of content items it realizes. Other similarities and diﬀerences : There are many rules in the average model thatare similar to either A or B’s models or both, and the average model retains a number of ndividual and Domain Adaptation in Dialogue preferences seen in the individual models. For example, Rules 1 and 10 both reduce theranking of any sentence plan where neighborhood information is combined with subsequentinformation using the with-ns combination operator. Rule 1 expresses this in terms ofthe lexical items in the d-tree, whereas Rule 10 expresses it in terms of semantic featuresderived from the sp-tree. Examples 1 and 2 in Section 8.2 illustrate this interaction.Some of the rules are more similar to User A. For example, Rules 4 and 9 (like A’s Rules2 and 5 in Figure 20) reduce the rating of sentence plans that use the operation cw-conj-infer . In addition, Rules 22, 23, 24, and 28 expresses preferences for merging information,which are very similar to A’s Rules 19, 21 and 22. Rule 15 expresses a preference forinformation about the atmosphere ( assert-reco-decor ) to be combined using the merge operation, as speciﬁed in A’s Rule 20. Rule 20 in Figure 22 is also similar to A’s Rule 16 with assert-reco-cuisine combined with subsequent information with the with-ns operation.Other rules are more similar to B’s model. For example, Rule 5 reduces the ranking ofsentence plans using the relative clause operation, which was also speciﬁed in User B’sRule 1, and Rules 16 and 25 indicate a general preference for use of the with-ns operation,which was a strong preference in User B’s model (see B’s Rule 18 in Figure 21).Note that in some cases, the learned model tries to account for both A’s and B’s prefer-ences, even when these contradict one another. For example, Rule 27 speciﬁes a preferencefor the claim to come ﬁrst, as in B’s Rule 21, whereas Rule 26 is the same as A’s 24, speci-fying a preference for food quality and cuisine information to be expressed ﬁrst. Thus themodel does suggest that a reduction in performance may arise from trying to account forthe contradictory preferences of users A and B.

9. Conclusions

This article describes

SPaRKy , a two-stage sentence planner that generates many alter-native realizations of input content plans and then ranks them using a statistical modeltrained on human feedback. We demonstrate that the training technique developed forSPoT (Walker, Rambow, & Rogati, 2002), generalizes easily to new domains, and that itcan be extended to handle the rhetorical structures required for more complex types ofinformation presentation.One of the most novel contributions of this paper is to show that trainable generationcan be used to train sentence planners tailored to individual users’ preferences. Previ-ous work modeling individuals has mainly applied to content planning. While studies ofhuman-human dialogue suggest that modeling other types of individual diﬀerences could bevaluable for spoken language generation, in the past, linguistic variation among individualswas considered a problem for generation (McKeown, Kukich, & Shaw, 1994; Reiter, 2002;Reiter, Sripada, & Robertson, 2003). Here, we show that users have diﬀerent perceptions ofthe quality of alternative realizations of a content plan, and that individualized models per-form better than those trained for groups of users. Our qualitative analysis indicates thattrainable sentence generation is sensitive to variations in domain application, presentationtype, and individual human preferences about the arrangement of particular content types.These are the ﬁrst results showing that individual preferences apply to sentence planning.We also compared

SPaRKy to the template-based generator described in Section 3.2:this generator is highly tuned to this domain and was previously shown to produce high alker, Stent, Mairesse, & Prasad quality outputs in a user evaluation (Stent, Prasad, & Walker, 2004). When

SPaRKy istrained for a group of users, then template-based generation is better for recommend and compare-3 , but in most cases the performance of the individualized SPRs are statisticallyindistinguishable from

MATCH ’s template-based generator: the exceptions are that, for compare-2 , User B prefers

SPaRKy , while for compare-3

User A prefers the template-based generator. In all cases, the Human scores (outputs produced by the SPG but selectedby a human) are as good or better than the template-based generator, even for complexinformation presentations such as extended comparisons.These results show that there is a gap between the performance of the trained SPRand human performance. This suggests that it might be possible to improve the SPR withdiﬀerent feature sets or a diﬀerent ranking algorithm. We leave a comparison with otherranking algorithms to future work. Here, we report results for many diﬀerent feature sets(n-gram, concept and tree) and investigate their eﬀect on performance. Table 1 shows thata combination of the three feature sets performs signiﬁcantly better for recommend and compare-3 than the tree features from our earlier work (Walker, Rambow, & Rogati, 2002;Stent, Prasad, & Walker, 2004; Mairesse & Walker, 2005). Interestingly, in some cases,simple features like n-grams perform as well as features representing linguistic structuresuch as the tree features. This might be because particular lexical items, e.g. with , areoften uniquely associated with a combination operator, e.g. the with-ns operator, whichwas shown to have impact on user perceptions of utterance quality (Section 8). More workis needed to determine whether these performance similarities are simply due to the factthat the variation of form generated by

SPaRKy ’s SPG is limited. Other work has alsoexamined tradeoﬀs between n-gram features and linguistically complex features in termsof tradeoﬀs between time and accuracy (Pantel, Ravichandran, & Hovy, 2004). Although

SPaRKy is trained oﬄine, the time to compute features and rank SPG outputs remainsan issue when using

SPaRKy in a real-time spoken dialogue system.A potential limitation of our approach is the time and eﬀort required to elicit userfeedback for training the system, as described in Section 6. In Section 7.3 we showed thatRankLoss error rates of around 0.20 could be acquired with a much smaller training set, i.e.with a training set of around 120 examples. However typical users would probably not wantto provide ratings of 120 examples. Future work should explore alternative training regimesperhaps by utilizing ratings from several users. For example, we could identify examplesthat most distinguish our existing users, and just present these examples to new users.Also, instead of users rating information presentations before using

MATCH , perhaps amethod for users to rate information presentations while using

MATCH could be developed,i.e. in the course of a dialogue with

MATCH when a recommendation or comparisonis presented to the user, the system could display on the screen a rating form for thatpresentation. Another approach would be to train from a diﬀerent type of user feedbackcollected automatically by monitoring the user’s behavior, e.g. measures of cognitive loadsuch as reading time.Another limitation is that

SPaRKy ’s dictionary is handcrafted, i.e. the associationsbetween simple assertions and their syntactic realizations (d-trees) are speciﬁed by hand, likeall generators. Recent work has begun to address this limitation by investigating techniquesfor learning a generation dictionary automatically from diﬀerent types of corpora, such ndividual and Domain Adaptation in Dialogue as user reviews (Barzilay & Lee, 2002; Higashinaka, Walker, & Prasad, 2007; Snyder &Barzilay, 2007).A ﬁnal limitation is that we only use two individuals to provide a proof-of-concept argu-ment for the value of user-tailored trainable sentence planning. We have argued throughoutthis paper that the individual diﬀerences we document are more general, are not particularto users A and B, and are not the result of random noise in user feedback. Nevertheless,we hope that future work will test these results against a larger population of individualsin order to provide further support for these arguments and in order to characterize the fullrange of individual diﬀerences in preferences for language variation in dialogue interaction.

Acknowledgments

This work was partially funded by a DARPA Communicator Contract MDA972-99-3-0003, by a Royal Society Wolfson Research Merit Award to M. Walker, and by a ViceChancellor’s studentship to F. Mairesse.

References

Andr´e, E., Rist, T., van Mulken, S., Klesen, M., & Baldes, S. (2000).

Embodied Conver-sational Agents , chap. The automated design of believable dialogues for animatedpresentation teams, pp. 220–255. MIT Press.Bangalore, S., & Rambow, O. (2000). Exploiting a probabilistic hierarchical model forgeneration. In

Proc. of the International Conference on Computational Linguistics .Barzilay, R., Elhadad, N., & McKeown, K. R. (2002). Inferring strategies for sentenceordering in multidocument news summarization.

Journal of Artiﬁcial IntelligenceResearch , , 35–55.Barzilay, R., & Lee, L. (2002). Bootstrapping lexical choice via multiple-sequence alignment.In Proc. of the Conference on Empirical Methods for Natural Language Processing .Belz, A. (2005). Corpus-driven generation of weather forecasts. In

Proc. 3rd Corpus Lin-guistics Conference .Bouayad-Agha, N., Scott, D., & Power, R. (2000). Integrating content and style in docu-ments: a case study of patient information leaﬂets.

Information Design Journal , (2),161–176.Branigan, H., Pickering, M., & Cleland, A. (2000). Syntactic coordination in dialogue. Cognition , , B13–B25.Brennan, S. E., & Clark, H. H. (1996). Conceptual pacts and lexical choice in conversation. Journal of Experimental Psychology: Learning, Memory And Cognition , (6), 1482–1493.Brennan, S. E., Friedman, M. W., & Pollard, C. J. (1987). A centering approach to pronouns.In Proc. of the Annual Meeting of the Association for Computational Linguistics .Brown, P., & Levinson, S. (1987).

Politeness: Some Universals in Language Usage . Cam-bridge University Press. alker, Stent, Mairesse, & Prasad

Bulyko, I., & Ostendorf, M. (2001). Joint prosody prediction and unit selection for concate-native speech synthesis. In

Proc. of the International Conference on Acoustic Speechand Signal Processing .Carenini, G., & Moore, J. D. (2000). A strategy for generating evaluative arguments. In

Proc. of the International Natural Language Generation Conference .Carenini, G., & Moore, J. D. (2006). Generating and evaluating evaluative arguments.

Artiﬁcial Intelligence Journal , (11), 925–952.Chai, J., Hong, P., Zhou, M., & Prasov, Z. (2004). Optimization in multimodal interpreta-tion. In Proc. of the Annual Meeting of the Association for Computational Linguistics .Chu-Carroll, J., & Carberry, S. (1995). Response generation in collaborative negotiation.In

Proc. of the Annual Meeting of the Association for Computational Linguistics .Clark, H. H., & Wilkes-Gibbs, D. (1986). Referring as a collaborative process.

Cognition , , 1–39.Collins, M. (2000). Discriminative reranking for natural language parsing. In Proc. of theInternational Conference on Machine Learning .Coulston, R., Oviatt, S., & Darves, C. (2002). Amplitude convergence in children’s conversa-tional speech with animated personas. In

Proc. of the International Spoken LanguageProcessing Conference .Danlos, L. (2000). G-TAG: A lexicalized formalism for text generation inspired by treeadjoining grammar. In Abeill´e, A., & Rambow, O. (Eds.),

Tree Adjoining Grammars:Formalisms, Linguistic Analysis, and Processing . CSLI Publications.Di Eugenio, B., Moore, J. D., & Paolucci, M. (1997). Learning features that predict cueusage. In

Proc. of the Annual Meeting of the Association for Computational Linguis-tics .DiMarco, C., & Foster, M. E. (1997). The automated generation of Web documents thatare tailored to the individual reader. In

Proc. of the AAAI Spring Symposium onNatural Language Processing on the World Wide Web .Duboue, P. A., & McKeown, K. R. (2003). Statistical acquisition of content selection rulesfor natural language generation. In

Proc. of the Conference on Empirical Methods inNatural Language Processing .Elhadad, N., Kan, M.-Y., Klavans, J., & McKeown, K. (2005). Customization in a uniﬁedframework for summarizing medical literature.

Journal of Artiﬁcial Intelligence inMedicine , (2), 179–198.Ferguson, S. H., & Kewley-Port, D. (2002). Vowel intelligibility in clear and conversationalspeech for normal-hearing and hearing-impaired listeners. Journal of the AcousticalSociety of America , (1), 259–271.Fleischman, M., & Hovy, E. (2002). Emotional variation in speech-based natural languagegeneration. In Proc. of the International Natural Language Generation Conference .Forbes, K., Miltsakaki, E., Prasad, R., Sarkar, A., Joshi, A., & Webber, B. (2003). D-LTAGsystem: Discourse parsing with a lexicalized tree adjoining grammar.

Journal of Logic,Language and Information , (3), 261–279. ndividual and Domain Adaptation in Dialogue Freund, Y., Iyer, R., Schapire, R., & Singer, Y. (1998). An eﬃcient boost-ing algorithm for combining preferences. In

Machine Learning: Proceedingsof the Fifteenth International Conference

Cognition , , 181–218.Goﬀman, E. (1981). Forms of Talk . University of Pennsylvania Press, Philadelphia, Penn-sylvania, USA.Grosz, B. J., Joshi, A. K., & Weinstein, S. (1995). Centering: A framework for modelingthe local coherence of discourse.

Computational Linguistics , (2), 203–225.Guo, H., & Stent, A. (2005). Trainable adapatable multimedia presentation generation. In Proc. of the International Conference on Multimodal Interfaces . Demo paper.Gupta, S., Walker, M., & Romano, D. (2007). How rude are you?: Evaluating politenessand aﬀect in interaction. In

Proc. of the Second International Conference on AﬀectiveComputing and Intelligent Interaction .Gupta, S., & Stent, A. (2005). Automatic evaluation of referring expression generation usingcorpora. In

Proc. of the Workshop on Using Corpora in Natural Language Generation .Hardt, D., & Rambow, O. (2001). Generation of VP ellipsis: A corpus-based approach. In

Proc. of the Annual Meeting of the Association for Computational Linguistics .Higashinaka, R., Walker, M., & Prasad, R. (2007). An unsupervised method for learn-ing generation lexicons for spoken dialogue systems by mining user reviews.

ACMTransactions on Speech and Language Processing , (4).Hovy, E. (1987). Some pragmatic decision criteria in generation. In Kempen, G. (Ed.), Natural Language Generation , pp. 3–17. Martinus Nijhoﬀ.Isard, A., Brockmann, C., & Oberlander, J. (2006). Individuality and alignment in generateddialogues. In

Proc.of the International Natural Language Generation Conference .Johnston, M., Bangalore, S., Vasireddy, G., Stent, A., Ehlen, P., Walker, M., Whittaker, S.,& Maloor, P. (2002). MATCH: An architecture for multimodal dialogue systems. In

Proc. of the Annual Meeting of the Association for Computational Linguistics .Jokinen, K., & Kanto, K. (2004). User expertise modelling and adaptivity in a speech-basede-mail system. In

Proc. of the Annual Meeting of the Association for ComputationalLinguistics .Jordan, P., & Walker, M. (2005). Learning content selection rules for generating objectdescriptions in dialogue.

Journal of Artiﬁcial Intelligence Research , , 157–194.Joshi, A. K., Webber, B., & Weischedel, R. M. (1986). Some aspects of default reasoningin interactive discourse. Tech. rep. MS-CIS-86-27, University of Pennsylvania.Joshi, A. K., Webber, B. L., & Weischedel, R. M. (1984). Preventing false inferences. In Proc. of the International Conference on Computational Linguistics .Jungers, M. K., Palmer, C., & Speer, S. R. (2002). Time after time: The coordinatinginﬂuence of tempo in music and speech.

Cognitive Processing , , 21–35. alker, Stent, Mairesse, & Prasad Kittredge, R., Korelsky, T., & Rambow, O. (1991). On the need for domain communicationknowledge.

Computational Intelligence , (4), 305–314.Kothari, A. (2007). Accented pronouns and unusual antecedents: A corpus study. In Proc.of the 8th SIGdial Workshop on Discourse and Dialogue .Langkilde, I., & Knight, K. (1998). Generation that exploits corpus-based statistical knowl-edge. In

Proc. of the International Conference on Computational Linguistics andMeeting of the Association for Computational Linguistics .Langkilde-Geary, I. (2002). An empirical veriﬁcation of coverage and correctness for ageneral-purpose sentence generator. In

Proc. of the International Natural LanguageGeneration Conference .Lapata, M. (2003). Probabilistic text structuring: Experiments with sentence ordering. In

Proc. of the Annual Meeting of the Association for Computational Linguistics .Lavoie, B., & Rambow, O. (1997). A fast and portable realizer for text generation systems.In

Proc. of the Conference on Applied Natural Language Processing .Levelt, W. J. M., & Kelter, S. (1982). Surface form and memory in question answering.

Cognitive Psychology , , 78–106.Lin, J. (2006). Using distributional similarity to identify individual verb choice. In Proc. ofthe International Natural Language Generation Conference .Litman, D. (1996). Cue phrase classiﬁcation using machine learning.

Journal of ArtiﬁcialIntelligence Research , , 53–94.Luchok, J. A., & McCroskey, J. C. (1978). The eﬀect of quality of evidence on attitudechange and source credibility. The Southern Speech Communication Journal , , 371–383.Madigan, D., Genkin, A., Lewis, D., Argamon, S., Fradkin, D., & Ye, L. (2005). Authoridentiﬁcation on the large scale. In Proc. of the Meeting of the Classiﬁcation Societyof North America .Mairesse, F., & Walker, M. (2007). PERSONAGE: Personality generation for dialogue. In

Proc. of the Annual Meeting of the Association for Computational Linguistics .Mairesse, F., & Walker, M. (2005). Learning to personalize spoken generation for dialoguesystems. In

Proc. Interspeech .Mann, W., & Thompson, S. (1987). Rhetorical structure theory: Description and construc-tion of text structures. In Kempen, G. (Ed.),

Natural Language Generation , pp. 83–96.Martinus Nijhoﬀ.Marciniak, T., & Strube, M. (2004). Classiﬁcation-based generation using TAG. In

Proc.of the International Natural Language Generation Conference .Marcu, D. (1996). Building up rhetorical structure trees. In

Proc. of the Conference onArtiﬁcial Intelligence and Conference on Innovative Applications of Artiﬁcial Intelli-gence .Marcu, D. (1997). From local to global coherence: a bottom-up approach to text planning.In

Proc. of the Conference on Artiﬁcial Intelligence . ndividual and Domain Adaptation in Dialogue McCoy, K. F. (1989). Generating context-sensitive responses to object related misconcep-tions.

Artiﬁcial Intelligence , (2), 157–195.McKeown, K., Kukich, K., & Shaw, J. (1994). Practical issues in automatic documentgeneration. In Proc. of the Conference on Applied Natural Language Processing .McKeown, K. R. (1985). Discourse strategies for generating natural language text.

ArtiﬁcialIntelligence , (1), 1–42.Mellish, C., O’Donnell, M., Oberlander, J., & Knott, A. (1998). An architecture for op-portunistic text generation. In Proc. of the Ninth International Workshop on NaturalLanguage Generation .Melˇcuk, I. A. (1988).

Dependency Syntax: Theory and Practice . State University of NewYork Press, Albany, New York.Moore, J. D., & Paris, C. L. (1993). Planning text for advisory dialogues: Capturingintentional and rhetorical information.

Computational Linguistics , (4), 651–694.Moore, J. D., Foster, M. E., Lemon, O., & White, M. (2004). Generating tailored, com-parative descriptions in spoken dialogue. In Proc. of the Seventeenth InternationalFlorida Artiﬁcial Intelligence Research Society Conference .Nakatsu, C., & White, M. (2006). Learning to say it well: Reranking realizations by predictedsynthesis quality. In

Proc. of the Annual Meeting of the Association for ComputationalLinguistics .Nenkova, A., Passonneau, R. J., & McKeown, K. (2007). The pyramid method: incorpo-rating human content selection variation in summarization evaluation.

ACM Trans-actions on Speech and Language Processing , (2).Oberlander, J., & Brew, C. (2000). Stochastic text generation. Philosophical Transactionsof the Royal Society of London , Series A, 358 , 1373–1385.Paiva, D. S., & Evans, R. (2004). A framework for stylistically controlled generation. In

Proc. of the International Natural Language Generation Conference .Pantel, P., Ravichandran, D., & Hovy, E. (2004). Towards terascale knowledge acquisition.In

Proc. of the International Conference on Computational Linguistics .Papenini, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: A method for automaticevaluation of machine translation. In

Proc. of the Annual Meeting of the Associationfor Computational Linguistics .Pennebaker, J. W., & King, L. A. (1999). Linguistic styles: Language use as an individualdiﬀerence.

Journal of Personality and Social Psychology , , 1296–1312.Piwek, P. (2003). A ﬂexible pragmatics-driven language generator for animated agents. In Proc. of the European Meeting of the Association for Computational Linguistics .Polifroni, J., & Walker, M. (2006). An analysis of automatic content selection algorithmsfor spoken dialogue system summaries. In

Proc. of the IEEE/ACL Conference onSpoken Language Technology .Porayska-Pomsta, K., & Mellish, C. (2004). Modelling politeness in natural language gen-eration. In

Proc. of the International Natural Language Generation Conference . alker, Stent, Mairesse, & Prasad Prasad, R., Joshi, A., Dinesh, N., Lee, A., & Miltsakaki, E. (2005). The Penn DiscourseTreeBank as a resource for natural language generation. In

Proc. of the CorpusLinguistics Workshop on Using Corpora for Natural Language Generation .Prevost, S. (1995).

A Semantics of Contrast and Information Structure for SpecifyingIntonation in Spoken Language Generation . Ph.D. thesis, University of Pennsylvania.Prince, E. F. (1985). Fancy syntax and shared knowledge.

Journal of Pragmatics , (1),65–81.Rambow, O., Rogati, M., & Walker, M. (2001). Evaluating a trainable sentence planner fora spoken dialogue travel system. In Proc. of the Annual Meeting of the Associationfor Computational Linguistics .Rambow, O., & Korelsky, T. (1992). Applied text generation. In

Proc. of the Conferenceon Applied Natural Language Processing .Reiter, E. (2002). Should corpora be gold standards for NLG?. In

Proc. of the InternationalNatural Language Generation Conference .Reiter, E., & Dale, R. (2000).

Building Natural Language Generation Systems . CambridgeUniversity Press.Reiter, E., & Sripada, S. (2002). Human variation and lexical choice.

ComputationalLinguistics , , 545–553.Reiter, E., Sripada, S., & Robertson, R. (2003). Acquiring correct knowledge for naturallanguage generation. Journal of Artiﬁcial Intelligence Research , , 491–516.Reitter, D., Keller, F., & Moore, J. D. (2006). Computational modeling of structural prim-ing in dialogue. In Proc. of the Joint Conference on Human Language Technologiesand Meeting of the North American Chapter of the Association for ComputationalLinguistics .Rich, E. (1979). User modelling via stereotypes.

Cognitive Science , , 329–354.Scott, D. R., & de Souza, C. S. (1990). Getting the message across in RST-based textgeneration. In Dale, R., Mellish, C., & Zock, M. (Eds.), Current Research in NaturalLanguage Generation , pp. 47–73. Academic Press.Snyder, B., & Barzilay, R. (2007). Database-text alignment via structured multilabel clas-siﬁcation. In

Proc. of the International Joint Conference on Artiﬁcial Intelligence .Stenchikova, S., & Stent, A. (2007). Measuring adaptation between dialogs. In

Proc. of the8th SIGdial Workshop on Discourse and Dialogue .Stent, A., & Guo, H. (2005). A new data-driven approach for multimedia presentationgeneration. In

Proc. EuroIMSA .Stent, A., Prasad, R., & Walker, M. (2004). Trainable sentence planning for complexinformation presentation in spoken dialog systems. In

Proc. of the Annual Meeting ofthe Association for Computational Linguistics .Stent, A., Walker, M., Whittaker, S., & Maloor, P. (2002). User-tailored generation forspoken dialogue: An experiment. In

Proc. of the International Conference on SpokenLanguage Processing . ndividual and Domain Adaptation in Dialogue Wahlster, W., & Kobsa, A. (1989). User models in dialogue systems. In

User Models inDialogue Systems , pp. 4–34. Springer Verlag, Berlin.Walker, M., Rambow, O., & Rogati, M. (2002). Training a sentence planner for spokendialogue using boosting.

Computer Speech and Language: Special Issue on SpokenLanguage Generation , (3-4), 409–433.Walker, M. A., Cahn, J. E., & Whittaker, S. J. (1997). Improvising linguistic style: Socialand aﬀective bases for agent personality. In Proc. of the First Conference on AutomousAgents .Walker, M. A., et al. (2002). DARPA communicator: Cross-system results for the 2001evaluation. In

Proc. of the International Spoken Language Processing Conference .Walker, M. A., et al. (2004). Generation and evaluation of user tailored responses in mul-timodal dialogue.

Cognitive Science , (5), 811–840.Webber, B., Knott, A., Stone, M., & Joshi, A. (1999). What are little trees made of?: Astructural and presuppositional account using lexicalized tag. In Proc. of the AnnualMeeting of the Association for Computational Linguistics .Yeh, C.-L., & Mellish, C. (1997). An empirical study on the generation of anaphora inChinese.

Computational Linguistics , , 169–190.Zukerman, I., & Litman, D. (2001). Natural language processing and user modeling: Syner-gies and limitations. User Modeling and User-Adapted Interaction , (1-2), 129–158. SyntSRanked list of(sp-tree, DSyntS) pairsdependency tree [d-tree])(sentence plan [sp-tree],Pairs of(tp-trees)Text-plan trees Communicativegoal Contentplan