[PDF] 'A Modern Up-To-Date Laptop' -- Vagueness in Natural Language Queries for Product Search

Abstract

With the rise of voice assistants and an increase in mobile search usage, natural language has become an important query language. So far, most of the current systems are not able to process these queries because of the vagueness and ambiguity in natural language. Users have adapted their query formulation to what they think the search engine is capable of, which adds to their cognitive burden. With our research, we contribute to the design of interactive search systems by investigating the genuine information need in a product search scenario. In a crowd-sourcing experiment, we collected 132 information needs in natural language. We examine the vagueness of the formulations and their match to retailer-generated content and user-generated product reviews. Our findings reveal high variance on the level of vagueness and the potential of user reviews as a source for supporting users with rather vague search intents.

Full PDF

‘‘A Modern Up-To-Date Laptop’ - Vagueness in NaturalLanguage Queries for Product Search

Andrea Papenmeier , Alfred Sliwa , Dagmar Kern , Daniel Hienert , Ahmet Aker , Norbert Fuhr GESIS â ˘A ¸S Leibniz Institute for the SocialSciencesCologne, Germanyﬁ[email protected] University of Duisburg-EssenDuisburg, Germanyﬁ[email protected]

ABSTRACT

With the rise of voice assistants and an increase in mobilesearch usage, natural language has become an important querylanguage. So far, most of the current systems are not ableto process these queries because of the vagueness and am-biguity in natural language. Users have adapted their queryformulation to what they think the search engine is capableof, which adds to their cognitive burden. With our research,we contribute to the design of interactive search systems byinvestigating the genuine information need in a product searchscenario. In a crowd-sourcing experiment, we collected 132information needs in natural language. We examine the vague-ness of the formulations and their match to retailer-generatedcontent and user-generated product reviews. Our ﬁndings re-veal high variance on the level of vagueness and the potentialof user reviews as a source for supporting users with rathervague search intents.

Author Keywords

Information retrieval; information need; query formulation;vagueness; natural language.

CCS Concepts • Human-centered computing → User interface program-ming; Text input;

Empirical studies in HCI;

INTRODUCTION

In 2018,

GlobalWebIndex reported that 27% of mobile inter-net users made use of the voice search functionality of theirmobile devices (not including tablets) [29], rising from 18% in2015 [28]. With the rising numbers of mobile searches that isaccompanied by an increase in voice interactions [12], naturallanguage support for searching the Web has gained importance.Likewise, voice assistants such as

Siri , Alexa , or

Cortana aredependent on understanding and interacting with natural lan-guage. Applications of voice assistants in laboratory assistants[6] or Smart Homes [7, 34] show the usefulness of voice asan input modality, but, at the same time, highlight existingproblems: Conversation techniques are not yet sophisticatedenough to elicit long-term usage [7] and natural language hasa great variance in vocabulary [6]. Already in 1987, Furnas etal. [11] noted the “vocabulary problem”: The natural languageof users is not equal to the controlled language used to indexinformation in search systems. Although users are best at expressing their information need (i.e. what they are search-ing for) through natural language [27], traditional informationretrieval systems for searching the Web were not designedto deal with challenges such as vagueness and ambiguity [3].This leads to serious consequences for the users. They needto focus not only on accessing and verbalising their informa-tion need, but also on respecting the formal restrictions of thesystem in order to achieve an acceptable search outcome [36].First, this means an increased cognitive burden for the user.Second, the users adapts the formulation of their informationneed to what they believe the system can process [5, 18]. Thisimpedes an intuitive interaction with the search system andleads to less relevant search results.In our research, we contribute to the vocabulary problem inweb search by designing an interactive search system froma user-centric view that is able to handle usersâ ˘A ´Z informa-tion needs with all the vagueness and ambiguity that mightbe included. In this paper, we report on our ﬁrst step in auser-centred design process: the collection and investigationof natural language descriptions of information needs. Wedesigned and conducted a user study with 132 participants tocollect genuine information need descriptions in the productsearch context. We examined the vagueness of these natu-ral language descriptions and explored how well they matchwith retailer-generated content (product descriptions) as wellas user-generated content (user reviews). Our key ﬁndingsshow that user reviews have a high potential as a source formatching vague descriptions of products. User reviews havesyntactic and semantic similarities to the vague informationneeds and contain information about usersâ ˘A ´Z experienceswith a product that are not provided in retailer-generated de-scriptions (e.g. quality or brand reputation). We conclude ourpaper by discussing implications for designing more intuitiveproduct search systems.

RELATED WORK

In the following section, we present the current state of re-search on natural language in voice interaction and for inter-acting with search systems, as well as existing approaches toresolve challenges arising from natural language for querying.

Natural Language in Voice Interactions

The quality of processing natural language is an essentialdriver for high user experience in area of voice assistants.1 a r X i v : . [ c s . H C ] A ug oice assistants have been evaluated in many scenarios [6,7, 32, 39], highlighting the potential for conversational inter-action, but also showing existing challenges. Cambre et al.[6] employed a voice assistant in a laboratory setting, notingthe challenge of the versatile natural language vocabulary. Intheir use case, the system was unable to understand techni-cal terms used during laboratory work. Missing context isalso a problem often described in literature [6, 23, 32]. Todate, interacting with voice assistants is mostly restricted tosimple commands and implemented “skills”. Yet, in an idealsetting, voice interaction is highly versatile and conversational[39]. Interaction with voice assistants is therefore not naturallyrestricted to simple commands or a closed vocabulary. Theabsence of sophisticated conversational abilities and lack ofunderstanding the rich natural language often results in dis-appointment and frustration, not only for voice assistants butalso conversational chat bots [15]. In general, voice results ina strong anthropomorphisation [7]. Cho, Lee and Lee [7] con-ducted a long-term study to investigate why long-term usageof Amazon’s voice assistant

Alexa is low. Participants uncon-sciously anthropomorphised the virtual assistant, leading todisappointment when

Alexa did not live up to the expectationof a human-human-like conversation. In human-human inter-action, vagueness does not necessarily lead to a problem ofunderstanding. Jucker, Smith and Lüdge [17] argue that vague-ness is a result of using an appropriate level of detail withrespect to a speciﬁc recipient and situation. It can therefore beconsidered a “tool” to reduce cognitive burden.

Query Formulation Problem

Traditional search engines require users to issue a query at thebeginning of the search process. Previous research argued thatusers modify and reformulate the query if they are not satisﬁedwith the results [36]. Kato et al. [20] analysed logs from asearch engine and found that experienced searchers adapt theirformulation to what they believe the search system is able toprocess. In the same vein, Kammerer and Bohnacker [18]analysed the query formulation process of children (age 8-10years). In their experiment, younger and inexperienced chil-dren tended to use natural language rather than keyword search,while older children had already learned to use keywords astheir search strategy. Aula [2] likewise found familiarity withthe search system and domain experience to be inﬂuencingquery formulations. For search systems, vagueness is a majorproblem: Balfe and Smyth [3] argue that due to their brief-ness, search queries lack the necessary context to restore theuser’s information need. Introducing controlled languages (e.g.boolean logic) to overcome the query formulation problemdid not show promising results in the past, as they are oftenapplied incorrectly [16].In the context of voice search, relatively little research hasbeen conducted on query formulation. Guy [12] analysed thequery logs of a voice search system by comparing them totextual queries, revealing some subtle differences betweenthe two input modalities. Voice queries contain more oftenwords that are easy to pronounce but difﬁcult to type. Yet,the opposite could be observed with typed queries, whichcontained more often words that are easy to type but lengthyto verbalise, e.g. calendar years. Guy [12] also found that language in voice queries is richer, i.e. more varied, and closerto natural language than typed queries.The evidence presented so far suggests that information needcan best be expressed in natural language, as opposed to con-trolled query languages. However, users are inﬂuenced in theirquery formulation process by their beliefs about what a searchsystem is capable of. Since information retrieval systems tra-ditionally have not been designed to support vagueness arisingfrom natural language, examining vagueness in queries willnot be possible via logs of existing search engines.

Faceted Search

Several researchers have investigated methods to support theuser in searching large amounts of data, e.g. in product search.Faceted search provides ﬁltering opportunities while exploit-ing additional, structured information about the items [13, 40].These facets are generated based on provided product items.Approaches are suggested to better adjust these facets to userneeds, e.g. by changing the ordering of facets [24] or by au-tomatically extracting relevant facets with respect to the userquery [37]. Adding a “weight” to a facet can further help theusers to indicate how much impact a facet should have on theresult list [21]. Research has also shown that novice usersproﬁt from different content than experts: In e-commerce,novices need more general information about a product thanexperts, who prefer more detailed information [33].Recent research has focused on data-driven improvements ofproduct search. Hirschmeier and Egger [14] built facets fromnotebook reviews on

Amazon and noted that user-generatedcontent holds “a problem and an opportunity at the sametime” [14]: For a user, there is too much content to consider indetail. At the same time, it provides enough data to automati-cally extract product attributes that are frequently mentioned.However, a user study to validate the usability of their ex-tracted facets is missing. Similarly, Feuerbach et al. [10]used reviews to generate new facet values for existing facetsof a hotel search system, e.g. “comfortable” for the facet “bed” . In a user study with 30 participants, they found thatusers perceived the extracted facet values more suitable withrespect to their preferences as compared to the original values.In their work on ontologies for multilingual product search,Lehtola, Heinecke and Bounsaythip [26] investigated extrac-tion and mapping of colour and material attributes for clothes.They highlight the discrepancy between retailers (synonymsare used for the same material), but also between retailer andcustomer (preference for the name of a textile vs. its function-ality).Kleemann and Ziegler [22] developed a dialog-based productadvisor for notebooks. The advisor poses questions about ab-stract attributes of the product rather than asking about precisetechnical characteristics, e.g. “What do you want to use it for?” (translated from German) with the answer options “surﬁngand watching movies” , “gaming” , “advanced ofﬁce tasks” ,and “image and video editing” . Vaccaro et al. [38] likewisetake a user-centered approach for their work on personal fash-ion assistants. They analysed the interaction between personalfashion advisors and their clients to develop design guidelinesfor fashion assistant chatbots. These guidelines differ from the2bstract questions used by Kleemann and Ziegler [22], leadingto the assumption that a difference between those two do-mains, the technical and the clothes domain, exist with regardsto product search support. Sawasdichai and Poggenpohl [33]explored e-commerce shopping from the perspective of theusers and found that users, in general, expect to be providedwith the same information when shopping ofﬂine as they arein an ofﬂine setting.Although these works show that researchers experiment withdifferent information sources to improve product search, itremains unclear whether the improvements ﬁt the user’s lan-guage. First user studies (evaluating for example a naturallanguage product advisor or new facet values) show the pos-itive perception by users, but the development process didnot start with the users in the ﬁrst place. Little research hasfocused on the user’s genuine information need as a source foridentifying product attributes for faceting and ﬁltering. RESEARCH METHOD

In the context of designing an interactive search system fol-lowing a user centerd design process, we conducted a ﬁrst userstudy to investigate the formulation of user’s information needin natural language and their potential for improving productsearch systems. Since users adjust their query to the searchsystem, analysing query logs is not sufﬁcient for investigatingnatural language as query language. Therefore, we decidedto carry out a user study in which the formulation of the in-formation need is decoupled from an existing search system.We focus on two different domains: the technical domain andclothes domain. For the technical domain, we chose purchas-ing a laptop as an example and for the clothes domain ourdecision went in favour of jackets.Our user study is designed and conducted as an online sur-vey to collect real-life examples of information need formu-lations. The study is set up as an in-between study with twoindependent conditions (laptop domain, jacket domain). Theparticipants are asked to describe a product (independent ofthe search context), before retrieving the described item froman online product search website.Our research is driven by the following research questions:

RQ 1

How do users formulate their information need in naturallanguage in a product search scenario?

RQ 1.1

How vague do users formulate their query in productsearch when using natural language?

RQ 1.2

What role does the product domain play for queryformulation?

RQ 2

To what extent do retailer-generated contents and productreviews reﬂect the language used in user-generated productdescriptions?

RQ 3

How well do facets of online shops match the user-generated product description?

STUDY DESIGN

The following section describes the design and sample for theonline user study.

Scenario and Task

We re-use the task described by Barbu el al. [4] who in-vestigated the impact of review tonality on buying decisionswithin product search. In their study, participants are asked toimagine themselves “looking to purchase a new laptop aftertheir old one broke”. Except the target product, this task doesnot prime the participants to re-use formulations of the taskdescription. We adapt the scenario for the technical domain:

Imagine your laptop broke down. What kind of laptopwould you choose as a replacement? Please describethe laptop you would want as a replacement in your ownwords.

For the clothes domain, we alter the scenario to match theneed for a new jacket, while keeping the instructions constant:

Imagine you lost the jacket you wear on a daily basis.What kind of jacket would you choose as a replacement?Please describe the jacket you would want as a replace-ment in your own words.

In both cases, participants are instructed to write at least 50characters, serving two goals. First, texts with a certain lengthcan be used as an attention check (i.e. it shows whether theparticipant has understood the instructions correctly), and sec-ondly, we expect that requesting longer texts will elicit a natu-ral language rather than a bullet points.After formulating their information need, participants wereasked to search for the product on

Amazon . Amazon was cho-sen as a well-known representative of a non-specialised onlineproduct retailer. To avoid superﬁcial searches, participants areinstructed to search for approximately ﬁve minutes. This timeframe is weakly enforced by the survey system by blockingthe “next” button for two minutes. At the end of the search,participants report the URL to their ﬁnal choice. To investigatehow much the search has inﬂuenced the users’ search intent,we offer participants the opportunity to make changes to theirinitial product description.

Apparatus & Procedure

To reach native English speaking participants, it is set up as anonline study. We use the survey platform SoSci Survey . Forboth the survey and the search task, participants are asked touse the internet browser on their private device. The study isstructured as follows:1. Introduction and consent2. Scenario and task: multi-line text ﬁeld for information needdescription3. Product search task: link to Amazon , and URL input ﬁeld4. Description reﬁnement: display of participant’s answer ofstep 2, with possibility to reﬁne and change the description5. Post-task questionnaire: query, domain knowledge, and lastproduct search in the speciﬁc domain6. Demographic questions: gender, age Measures

Per participant, we measure the following dependent variables:(1) information need description, (2) search query, (3) domainknowledge, (4) the search outcome in the form of the URLto the product on

Amazon , (5) satisfaction with the searchoutcome, and (6) search duration. The information need de-scription, search query, and product URL are provided by theparticipant in full-text, while the search duration is automat-ically recorded by the survey system. We measure domainknowledge as a combination of search experience and self-assessment questions, as suggested by Kanwar, Grund andOlson [19]. We ask the participant to indicate how much theythink they know about several product attributes (as found incommon facets of online shops) on a 7-point polarity scale(with 1 = “no knowledge”, 7 = “expert knowledge”). Satisfac-tion with the search outcome is measured on a 7-point Likertscale (from -3 = “very dissatisﬁed” to +3 = “very satisﬁed”). Ina second step after the user study, user-generated descriptionsare annotated with a “vagueness score”. To investigate howwell existing systems support natural language queries, we col-lect retailer-generated product descriptions and user-generatedproduct reviews (

RQ 2 ) as well as the facets of online productretailers (

RQ 3 ). For each of these information sources, wecalculate how well they match the user-generated descriptions.

Statistical Tests

For all signiﬁcance tests between two independent samples(e.g. laptop domain vs. jacket domain or user-generated con-tent vs. retailer-generated content), we use the two-sidedMann-Whitney U-test. The correlation of dependent variablesis computed using Pearson’s correlation coefﬁcient, whilethe signiﬁcance of differences between two paired samples isevaluated with Wilcoxon’s signed-rank test.

Participants

In total, 149 participants were recruited on the scientiﬁc crowdsourcing platform Proliﬁc . 17 participants had to be excludedfrom the analysis due to misunderstanding the task. The ﬁnalsample size is therefore N = 132 (f = 83, m = 49, d = 0).Participants had to be older than 18 years (M = 34.1 years, SD= 11.2 years) to take part in the experiment and each receiveda ﬁnancial allowance of 0.80 GBP for 7 minutes of work(6.80 GBP per hour). The platform’s population was screenedfor participants with English as their ﬁrst language to avoida translation bias. Furthermore, participants were informedbefore the start that they will need to access Amazon.com inthe course of the experiment to avoid technical difﬁculties.Participants are equally distributed over the two domains withrespect to age and gender, with 66 participants searching for alaptop and 66 searching for a jacket. DATA PREPARATION

This section describes the text preprocessing, segmentationand annotation process of the user-generated product descrip-tions, as well as the collection of lists of existing facets froma general retailer and specialised retailers for both domains(laptop and jacket).

Annotations

As described in the Measures section, we enrich the user-generated information need descriptions with a vaguenessscore. We followed the method used by Lebanoff and Liu[25] who use crowd sourcing with native English speakers toannotate the level of vagueness of a sentence. Four annotatorswere recruited on Amazon Mechanical Turk to each label alluser-generate descriptions on a scale from 1 (“very speciﬁc,not vague at all”) to 10 (“very vague, not speciﬁc at all”).They received a compensation of 7.00 USD for one hour ofwork and had to have an approval rate on Mechanical Turk ofmore than 95%. Before labelling the dataset, annotators wereintroduced to our deﬁnition of vagueness: Vagueness is the imprecise or unclear use of language.Contrast this term with “clarity” and “speciﬁcity”.Vague language states a general idea but leave the precisemeaning to the reader’s interpretation.

The annotators performed a training phase, in which they an-notated nine descriptions from pretests to get used to the topic,the annotation scale, and the formulations. With an estimatedduration of an hour, we anticipated the task to be rather lengthy.To ensure high-quality annotations, we included three attentioncheck mechanisms (minimum duration of 20 minutes, askingthe participant after the training phase for the maximum scalevalue, and a question in between the annotations asking totick a speciﬁc value). The attention checks disqualiﬁed oneannotator.The three annotators had a reliability measured with Krippen-dorff’s α of .62. The average of their individual test-retestreliability is at Krippendorff’s α = .67, while the Pearson’scorrelation coefﬁcient among the three annotators is at r(1) =.63 with p < .001 in all cases. Although the inter-annotatorreliability is not very high, the annotators show a high corre-lation and an acceptable test-retest reliability. We expectedjudging the vagueness on a sentence level to be an ambiguoustask, which reﬂects in rather low reliability scores. The anno-tator reliability is comparable if not better to the one in [25],who report that 4/5 of their annotators agreed on 13% of thedescriptions, which, in our case, is 13% for 3/3 annotators. In47% of their cases, 3/5 annotator agreed, while in our case 2/3annotators agreed on 70% of the cases. Note that for this com-parison, we mapped our 10-point scale to their 5-point scale.As done in [25], we average the scores of all three annotatorsto obtain a single decimal number as vagueness score for eachdescription. Segmentation

To see the inﬂuence of the search task on the information need,participants had the opportunity to modify their description One that’s lightweight , warm and has a hood . Independent of each other, two of the authors and a secondperson from outside the research project divided the user-generated descriptions into segments. The ﬁnal segmentationwas determined via majority vote (90% of the descriptions).If all three segmentations of a description differed from eachother (10% of the descriptions), the group discussed the seg-mentation until a consensus was found. Three rules were setbefore the segmentation process: 1) no deletion of words, onlydelimitation of the text, 2) except for “and”, “but”, and “or” ifused as conjunction between two attributes, 3) duplication ofwords that refer to two attributes, e.g. “fast start up and use” to “fast start up” and “fast use” . Overall, 132 descriptionswere segmented into 570 segments (252 in laptop descriptions,318 in jacket descriptions), each containing one product at-tribute. The descriptions contain at least one attribute and atmost 10, with 4 attributes on average (M = 3.82, SD = 1.82). Text Preprocessing

Before comparing the information need descriptions to retailer-generated content as well as user reviews, we preprocess eachtext. For the information need descriptions, we manuallysort out text fragments that do not contain information onthe desired product characteristics but are artifacts of naturallanguage sentence structure and the task instructions, e.g. “Iwould like” or “I would try to ﬁnd” or stop words such as “a” or “and” . This step was taken to avoid false positives whenunimportant words are matched, and false negatives due tounsubstantial words lacking in the target text. The automaticpreprocessing steps for all texts include: (1) conversion of textsto lowercase, (2) removal of trailing characters, (3) removalof punctuation characters, (4) lemmatization of each word inthe texts and (5) removal of stop words. The preprocessingpipeline is realised using the NLTK library. Facet Matching

To answer

RQ3 “How well do facets of online shops matchthe user-generated product description?”, we collected lists ofproduct search facets. We retrieve the facets for each domainfrom both a general retailer (

Amazon ) and a specialised retailer( skinﬂint for laptops and next for jackets). To avoid inﬂuenceon the facet lists by search queries, we navigate to the productcategories via the websites’ menu bars. On the website of thegeneral retailer

Amazon , we navigate to the domain categoriesin a private browser (“Shop by category” > “Computers” >“Computers & Tablets” > “Laptops” and “Shop by category” >“Men’s Fashion” > “Clothing” > “Jackets & Coats”, same pathfor “Women’s Fashion”) and retrieve the list of facets with Laptop JacketAmazon skinﬂint Amazon next - Aspect ra-tio- Battery- CardReaders- Class- Code-nameAMD - Activity- AverageCustomerReview- Certiﬁ-cations- Condition- CPUSpeed - Beneﬁt- Brand- Category- Colour- DesignFeature - Big & TallSize- Brand- Color- NewArrivals- Petite Size

Table 1. First ﬁve entries of facet lists for both domains and both theunspecialised (

Amazon ) and specialised ( skinﬂint , next ) retailers, orderedalphabetically. facet values. On skinﬂint , the specialised retailer for technicalproducts, we navigate to the laptop category via “Hardware”> “Notebooks” > “Notebooks”. For retrieving specialisedjacket facets on next , we follow “Women” > “Clothing” >“Coats & Jackets” (or starting at “Men”, respectively). Thegender-speciﬁc facets for jackets are merged into a single listof facets. The ﬁnal lists of the general retailer contain 27 facetsfor laptops and 12 for jackets. The specialised retailer listsoffer 84 facets for laptops and 14 facets for jackets. Table 1shows the ﬁrst ﬁve facets on each list, sorted alphabetically,while the full lists are available online . Product Page Matching

During the user study, participants delivered an

Amazon

URLof their chosen product, which is used to crawl the retailer-generated content for the product (i.e. product title and productdescription) immediately after the study. For each product on

Amazon , there exist speciﬁc HTML-ﬁelds for the product titleand product description. If available, we also crawled theproduct’s review texts. As products might have multiple re-views, we concatenate the associated reviews to one full review.Eventually, we gather three information sources to describea product: the title, product description and reviews. Theseare used for comparison with the user-generated descriptionsfrom the user study for matching purposes. We leverage thetext preprocessing pipeline on the three corpora as proposedin the Section “Text Preprocessing”.Table 2 illustrates the vocabulary sizes for each informationﬁeld in the respective domain. This table shows that the vo-cabulary of retailer-generated content (both product title andproduct description) in the laptop domain is greater than thatin the jacket domain. Contrarily, the vocabulary of reviews inthe jacket domain is greater than in the laptop domain.In the matching step, we determine for each attribute in everydescription whether the attribute can be found in the differentinformation ﬁelds (product title, product description, user re-views). The matching is binary per attribute and determinedby simple substring matching. We evaluate the quality of https://git.gesis.org/papenmaa/dis20_usersearchintentformulation aptop Jacket titles 66 296 66 267descriptions 66 1,132 66 973reviews 60 14,151 55 17,532 Table 2. Statistics about the different corpora regarding to the three in-formation ﬁelds for both domains. Vocabulary size refers to the numberof unique terms in one corpus. the automatic matching by manually matching and achieve astrong correlation (Pearson’s correlation coefﬁcient of r(130)= .63), meaning that the automatic matches can be used toapproximate the manual matches. In case of the manual match-ing, a more extensive and semantically focused matching wasdone as compared to the strict automatic substring matching,e.g. “has four pockets” was matched for the attribute “mul-tiple pockets” in the manual matching, although missing theword “multiple” . For better reproducibility, we report resultsbased on the automatic matches in the remainder of the paper.Finally, we compute the “coverage” of each user-generateddescription. The coverage indicates how many attributes of adescription are found in an information ﬁeld and is calculatedas follows: o f description attributes f ound in the in f ormation f ieldtotal o f description attributes

In the description “

A waterproof and weatherproof jacket in asubtle earthy colour ”, only the attribute “ colour ” was foundin the list of

Amazon facets. As the description contains threeattributes (“ waterproof ”, “ weatherproof ”, and “ subtle earthycolour ”), the coverage of the

Amazon facets are 33%.

RESULTS

In the following section, we describe characteristics of theuser-generated product descriptions as given before the search,compare the descriptions to the issued queries, and examinehow users adjusted the description after the search. In asecond step, we present the results of investigating how wellthe natural language descriptions match to the seller-generatedcontent, to the product reviews given by other buyers, andto the facets currently available in popular product searchsystems. The complete dataset of the user study, includingthe segmentations and annotated vagueness scores, is publiclyavailable . Vagueness in User Descriptions

Figure 1 shows the histogram of annotated vagueness of all132 user-generated product descriptions. The data does nothave a normal distribution (Shapiro-Wilk normality test withp = .002), with relatively few descriptions classiﬁed at themean vagueness score (M = 4.79, SD = 2.10). Descriptionsare either assigned a rather low vagueness or high vagueness.The domains show a weak signiﬁcant difference in means (p =.043), with the laptop descriptions being rated more vague (M= 5.10, SD = 2.16) than the jacket descriptions (M = 4.48, SD https://git.gesis.org/papenmaa/dis20_usersearchintentformulation Figure 1. Histogram of vagueness annotations for all 132 user-generateddescriptions. = 2.00). The most vague laptop description collected in ourstudy (unprocessed) was:

I’d go for one that is reasonably priced, with a good sizedhard drive and ram (vagueness = 9.3)In the jacket domain, the following description was rated themost vague:

I would want a waterproof jacket that is cosy and warm. (vagueness = 8.67)Whereas the least vague jacket description in our dataset is thefollowing:

It is a mustard coloured padded jacket, with quite a highcollar that has a hood inside it. The cuffs are ribbedat the sleeves. It also has two quite deep side pockets. (vagueness = 1.0)Some user-generated product description were rated to be inthe middle of the vagueness scale, e.g.:

A laptop with a screen over 14 inches and that was light.The brand wouldn’t be particularly important but onewhich looks stylish (vagueness = 6.0)

User Descriptions and Queries

On average, the user-generated descriptions of products were29 words long in the laptop domain, and 24 words long inthe jacket domain. The initial search queries used to retrievethe respective product from

Amazon.com only had a lengthof 2.2 words for laptops and 2.8 words for jackets, showingthat on average, the queries were 92% (88%, respectively)shorter than the descriptions. Two participants reported to nothave used the search bar and therefore were not able to reporta query string. Out of 66 participants in the laptop domain,21 used the query “laptop” or “laptops” . On average, 51%of the query terms also appeared in the description. In thejacket domain, only 3 participants used the queries “jacket” or “coat” and on average, 46% of the query terms also appearedin the description they wrote before. In general, queries weremuch shorter than the user-generated description given beforethe search. 40% of participants searching for laptops issueda query with no overlap to their initial description. In thejacket domain, this could be observed in only 21% of thecases. We observe three types of phenomena when queriescontained words not appearing in the description: (1) Theusage of pronouns instead of nouns, e.g. the query “laptop” ll Laptop Jacket Domainsmean mean mean p-value satisfaction 5.69 6.06 5.32 ** < .001dom. knowl. 4.19 3.88 4.51 ** .003 Table 3. Means of dependent variables and p-values of test for signiﬁ-cance between domains, where a single asterisk denotes signiﬁcance at95% CI and a double asterisk signiﬁcance at a 99% CI. with the description “a simple one that does the basics largememory and simple to use” . (2)

Additional information in thequery, e.g. the query “navy wool coat” with the description “simple and classic navy blue knee lenght coat with a collarmade by cos or arket” where the term “wool” was added.(3)

Omission of a word in the description that is used as ageneralisation to summarise the description, e.g. the query “laptop” with the description “14in screen with 1tb memorymust include 365 microsoft” .After the search task, participants were offered the possibilityto adjust their initial product description. 23% of the partic-ipants followed this offering in both domains, whereof 80%expanded the description, while 20% shortened it. In thosecases, the ﬁnal description in the laptop domain was 40% dif-ferent from the initial description (22% in the jacket domain),for example by adding “with a hood” to “a light waterproofneutral color jacket brand name but not too expensive” orchanging “enough ram and graphics card lots of internalharddrive storage” to “ram graphics card i7 processor atleast 500gb internal storage” . Statistical Analysis of Variables

Table 3 reports the means of dependent variables (vagueness,satisfaction, domain knowledge, amount of attributes, amountof words, and search duration). In the last column of Table3, the results of testing for signiﬁcant differences within thetwo domain groups can be found. The statistical test for dif-ference between the laptop and jacket domain indicates thatthe domain has a signiﬁcant impact on the query formulationin terms of amount of words, amount of attributes, and satis-faction. Laptop descriptions contain signiﬁcantly more words( M laptop = 29, M jacket = 24), but signiﬁcantly fewer attributes( M laptop = 3.83, M jacket = 4.83), which means that users takeon average more words to describe a single attribute in the lap-top domain. The domain knowledge of participants is reportedto be signiﬁcantly higher in the jacket domain than in thelaptop domain ( M laptop = 3.88, M jacket = 4.51), while the sat-isfaction with the result is higher of participants who searchedfor a laptop as compared to those who searched for a jacket( M laptop = 6.06, M jacket = 5.32). Performing a correlationalanalysis of the dependent variables, we see that a moderatenegative correlation exists between vagueness and amount ofattributes mentioned in the description (r(130) = -.444): Themore attributes are mentioned, the lower the vagueness of thedescription. It appears that satisfaction is weakly negativelycorrelated with the amount of mentioned attributes (r(130)= -.224), meaning that the more attributes the user described All Laptop Jacket Domainsmean mean mean p-value titles 14% 15% 13% .186descriptions 9% 5% 14% ** < .001reviews 27% 21% 33% * .011titles+descr. 18% 16% 20% * .012titles+descr.+rev. 34% 28% 41% ** .001

Table 4. Average coverages for various information sources, with cover-age being the percentage of attributes matched per user-generated de-scription. Right column: p-values of test for signiﬁcance between do-mains, where * denotes signiﬁcance at 95% CI and ** signiﬁcance at a99% CI.

Low-vag. High-vag.mean mean p-value titles 31% 20% ** .003descriptions 42% 27% ** <.001reviews 44% 38% .257titles+descr. 23% 12% ** .007titles+descr.+rev. 37% 30% .088

Table 5. Average coverages for various information sources, with cover-age being the percentage of attributes matched per user-generated de-scription. Right column: p-values of test for signiﬁcance between do-mains, where * denotes signiﬁcance at 95% CI and ** signiﬁcance at a99% CI. before the search, the lower the satisfaction after the search.A very weak positive correlation can be observed betweendomain knowledge and the amount of mentioned attributes(r(130) = +.187), and a very weak negative correlation betweendomain knowledge and the search duration (r(130) = -.142).

Matching Measurements

The average coverage of different information sources is pre-sented in Table 4 and visualised in Figure 2 for matchingwith retailer-generated content (left) and retailer-generatedcontent enriched with user reviews (right). The data in Table4 shows a weakly signiﬁcant higher coverage through retailer-generated content (product page title + product description)in the jacket domain (20%) as compared to the laptop domain(16%). Adding user reviews as information source increasesthis difference: user-generated descriptions are better matchedin the jacket domain (41%) than in the laptop domain (28%)when considering all information sources (product page title +product descriptions + user reviews).

Figure 2. Histograms of matched attributes per user-generated descrip-tion with respect to product title and product description (left) and withrespect to product title, product description, and user reviews (right). etailer-Generated Content vs. User-Generated Content The data in Table 4 show that the coverage of retailer-generatedcontent (product page title and product description) can beimproved when adding the product reviews: 11 percentagepoints are added in the laptop domain and 21 percentage pointsin the jacket domain. The Wilcoxon signed rank test showssigniﬁcance in both domains with p < .001. The impact ofadding reviews for matching becomes apparent in Figure 2:The amount of descriptions with a low coverage (between .0and .2) decreases greatly. Still, even when including user re-views as information source, some individual attributes remainunmatched. Some unmatched attributes are too precise to bementioned by any information source, e.g. “that I could wearfor dog walking and going out as well as to hockey” or “if itcame with some extra free programmes like Ofﬁce I would beoverjoyed” . Other unmatched attributes are highly vague inthe formulation, e.g. “has a bit of longevity” or “with a nicebig screen” .To determine the beneﬁt of adding reviews for automaticallyprocessing vague language, we examine statistical differencesbetween high-vagueness descriptions and low-vagueness de-scriptions. We divide the user-generated description in twovagueness groups: a low-vagueness group with vaguenessscores smaller than the average vagueness score of 4.79(N=74), and a high-vagueness group with scores greater thanthe average (N=58). The results are presented in Table 5. Forthe low vagueness group, the coverage of retailer-generatedcontent (product page title + product description) is signif-icantly higher than for the high vagueness group (23% vs.12%, p = .007) . This means that product descriptions withlow vagueness can be found signiﬁcantly better in titles anddescriptions. Adding user reviews has two implications: (1)Highly vague product descriptions are better found/covered(12% vs. 30%, p < .001) with a signiﬁcant increase of 18% incoverage. (2) But also, lowly vague product descriptions aresigniﬁcantly better found (23% vs. 37%, p < .001). Productattributes with high and low vagueness are both found in userreviews, resulting in no more statistical difference between thelow and high vagueness group. Facet Matching

Table 6 provides the average matching coverage of facets bothfrom

Amazon as a general retailer and those of the specialisedretailers.

Amazon provides 27 facets for laptop search and 12facets for jacket search, while the specialised retailers offer84 facets for laptop search ( skinﬂint ) and 14 for jacket search( next ). As already noted in Table 3, the descriptions of jacketscontain more attributes than those of laptops. Not all attributesmentioned in the user-generated descriptions could be matchedto available facets, yet, for both domains, more attributes werematched to the facets of a specialised retailer (33% coverage)compared to the general retailer (18% coverage). Additionally,only a third of the attributes could be matched to a facet anda facet value of the specialised retailers, with the percentagebeing even lower for the general retailer.In the jacket domain, 8 out of 12 facets on

Amazon focused onthe size of the jackets, leaving only 4 facets for ﬁltering otherjacket attributes. The facet that was most often matched was

All Laptop Jacket Domainsmean mean mean p-value facets

Amazon

18% 26% 10% ** .004facets special ret.

33% 33% 32% .208

Table 6. Average coverages for facet lists, with coverage being the per-centage of attributes matched per user-generated description. Right col-umn: p-values of test for signiﬁcance between domains, where * denotessigniﬁcance at 95% CI and ** signiﬁcance at a 99% CI. the facet describing the colour of the jacket. The specialisedretailer, however, had no redundancy in the facets, with the“Design Feature” facet being most often matched. Characteris-tics such as “quilted”, “hooded” or “padded’ as well as styletypes such as “Biker” or “Parka” were possible values of thisfacet.In the laptop domain, the brand plays the most important roleas a facet. Other popular facets were “RAM”, “Screen Size”,and ”Hard Drive Size”. However, although the coverage washigher for facets of the specialised retailer, it was often notpossible to select facet values. While the facets contain precisevalues, ( “8GB” , “15"” , “Apple” ), participants described morevague ranges: “small” , “reasonable screen size” , “min 8GB” .Some descriptions contained very vague language, with noattribute being successfully matched to any facet (neither atthe general retailer, nor at the specialised retailer): Jacket that is warm and comfortable, yet fashionable andwill go with most outﬁts. and:

A modern up to date laptop with the software that I useon a daily basis

In those cases, attributes could not be matched because eitherthe respective facet was missing (as with “warm” ) or becauseit was impossible to determine which attribute of the productwould bring about the desired characteristic (e.g. “comfort-able” , “fashionable” ). Derived Facet Suggestions

Using the attributes that could not be matched to the facetsof the general retailer nor to those of the specialised shop,we grouped similar attributes and identiﬁed six new attributesfor laptops (containing 86% of the unmatched attributes) andﬁve for jackets accounting for 94% of the unmatched jacketattributes (see Table 7). Three proposed facets could be help-ful for both the laptop and the jacket domain: “purpose” (e.g.using the laptop for image editing, or a jacket for the winterseason), “appearance” (e.g. laptop that ﬁts in a bag, whethera jacket is “fashionable”), and “experiences” relating to at-tributes that need repeated interaction to judge (e.g. batterylife, longevity of the jacket’s seams, overall quality of theproduct).

DISCUSSION AND DESIGN IMPLICATIONS

We investigated in a ﬁrst step how users formulate their infor-mation needs in product search (

RQ 1 ). We found a broadrange of vagueness in the user description, from quite pre-cise formulations ( “mustard coloured padded jacket” ) to very8 aptop Jacketattribute count attribute countpurpose purpose appearance appearance experiences experiences software accessories behaviour material model Table 7. Proposed facet categories for unmatched attributes vague formulations ( “reasonably priced, with a good sizedhard drive” ). Although on average, descriptions received amedium vagueness score (M = 4.79), scores are distributednon-normally: either centered around a low vagueness score orcentered around a high vagueness score (

RQ 1.1 ). Designersof an interactive system dealing with vague queries shouldkeep this in mind and provide reliable search results and func-tions for both cases: very precise queries and highly vaguequeries.

Additionally, we examined the inﬂuence of the product do-main (

RQ 1.2 ), ﬁnding that there are signiﬁcant differences be-tween the laptop and the jacket domain. Therefore, the resultsprobably do not generalise across other domains. There aresigniﬁcantly more attributes mentioned in jacket descriptionsthan for laptops, while laptop descriptions are slightly morevague. Users therefore need different support in different do-mains. As there cannot be a one-ﬁts-all system, it is inevitablefor the design of an interactive search system to include theusers from an early stage . An important ﬁnding is the weaklynegative correlation between satisfaction and amount of at-tributes mentioned – the more attributes are mentioned, thelower the satisfaction. Users with a precise conception of theirinformation need were not simultaneously better at ﬁndingwhat they were looking for, despite mentioning more desiredattributes. Furthermore, there were differences between theretailer-generated description and the query. Participants used generalisations, word omissions, and references (pronounsinstead of nouns) that have to be resolved when designing fornatural language queries . The second aim of this study was to determine to what extentavailable information sources match natural language descrip-tions of information needs (

RQ 2 ). Especially in voice applica-tions, systems need to be able to process natural language andlive up to the user’s expectations of a human-human-like con-versation. Using the information on the product pages of theselected products, we found that on average, only a ﬁfth of thedesired attributes mentioned in the user-generated descriptionswere covered by retailer-generated content (i.e., the productpage title and the product description). Reviews signiﬁcantlyincrease this matching percentage: Adding reviews to theretailer-generated content yields signiﬁcantly better coverageof desired attributes in both domains. Reviews especially helpwhen matching vague descriptions. They match equally wellto low vagueness as they match to high vagueness descriptions,which is not the case for retailer-generated content. Retailer-generated content is signiﬁcantly worse at matching to highlyvague descriptions as compared to low-vagueness descriptions.We suspect that retailers attempt to describe their productsprecisely to not give a false image of their product, while userreviews are written in natural language with an equal level ofvagueness as the user-generated descriptions. Therefore, wesuggest to not only rely on retailer-generated content in theretrieval process but to also include user-generated reviews tohandle vague search intents when designing new interactivesearch systems . Not only matching algorithms could proﬁtfrom user-generated content: Shopping assistants and conver-sational agents in the context of online shopping would beable to process vague search intents.Current online product search systems also offer facets for ﬁl-tering, raising the question whether current facets ﬁt to naturallanguage queries (

RQ 3 ). The facets of retailers specialisedin the respective product domain match better to the user-generated descriptions than the facets of a general retailer, yetdo not cover all attributes mentioned in the user descriptions.We therefore propose to add more user-centered facets thatrelate not only to hard facts (like the storage size or the brand),but also to experienced attributes such as quality or reputationof the brand (see Table 7). A fair amount of those suggestedfacets are difﬁcult to quantify, e.g. quality or longevity, but areof importance to the user. Giving the user the possibility toindicate how important the respective attribute is, as done inprevious research with sliders [21, 8], could be a way to pro-cess those vague requirements. However, selecting the value ofthe facet is the next hurdle. Participants often described rangesof values ( “at least” ) or more abstract concepts of values, e.g. “large storage” . The specialised notebook retailer skinﬂint pro-vides an overwhelming amount of 84 facets with technicalattributes. Selecting the correct facet and facet values, whilesimultaneously keeping the overview over the result list addscognitive burden on the user. Here, adjusting the ordering offacets according to the user’s input could help to reduce thecognitive burden. Besides compiling user-driven facets, de-signers of interactive search systems should consider mappingvague facet values onto dynamic ranges , e.g. assigning “cheap” laptop to the lower third of the current price range, or “fast” processor onto the upper quartile of available processor speeds.Lexical ambiguity could be processed with query expansionbased on synonyms from thesauri and user reviews, all while9iving users insights into the system’s processing steps andthe possibility to correct faulty interpretations.We approach the topic of information need formulation fromthe user side and provide empirical evidence showing thatsearch systems could be improved through utilising user-generated content. Systems supporting the user in onlineshopping or voice search could proﬁt from user-generatedcontent to improve their processing of vague language. Forproduct search facets, we suggest some new facets based onthe user needs mentioned in the study. In our experiment, userreviews have shown to bring a substantial improvement forﬁnding information sought in natural language informationneeds. User-generated content is therefore a valuable source toextract new facets or determine the relevance of a speciﬁc prod-uct for the user, i.e. by using the reviews during the automaticretrieval and ranking process. The next step, generating facetsand facet values from user-generated content has already beeninvestigated [10, 14]. Our ﬁndings validate the applicabilityof those approaches and highlight the inaptitude of currentsearch systems to deal with vague natural language. To cre-ate modern systems that live up to the expectations of users,search systems need to be designed from a user perspective,supporting natural language and vagueness where appropriate.

Limitations

In this section, we summarize the limitations our user studyface. The descriptions of jackets are highly inﬂuenced by theseason in which the user study took place. As the study wasconducted in November, most participants described a winterjacket, limiting the mentioned attributes to this category. Fu-ture research should repeat the experiment during a differentseason to yield a fuller image of desired attributes. Further-more, other product domains should be investigated in furtherstudies.Our results could furthermore be inﬂuenced by the amount ofavailable reviews of a product. While some products had agreat amount of reviews (500 reviews for one product), othershad little to no reviews (14 products without reviews, otherswith less than 10 reviews). This limitation also extends to pro-posed solutions: Using user reviews to improve product searchand process vagueness in natural language queries requires theexistence of user reviews in the ﬁrst place.The analysis of matching percentages (“coverage”) is limitedby the choice of purely lexical matching. Compared to manualmatching, where annotators use a more sophisticated approachanswering “Is this information available in the text?” ratherthan “Is this string a sub-string of the text?”, the automaticmatching delivers conservative results. A more sophisticatedmatching would account for synonyms, homonyms, generali-sations, and ambiguity. To develop a sophisticated algorithm,a deep understanding of vagueness in each product category isneeded. Technical terms and technical abbreviations often donot appear in standard synonym databases. As described byLehtola, Heinecke and Bounsaythip [26], retailers and usersuse different vocabularies, likewise complicating the applica-tion of advanced natural language processing techniques usedin conversational search, e.g. query expansion and the usageof word embeddings [1]. Various natural language processing methods have already been developed for learning distributedword representations [30, 31] to address the vocabulary mis-match problem. Word embeddings can for example be usedto identify which words are used in the same context. Thecontext, however, is lost when using bullet point lists. Forretailer-generated content, which is often a list of technicalspeciﬁcations, word embeddings might not be enough. Con-trarily, for user-generated content, word embeddings could beused to identify synonyms and expand the natural languagequeries with additional, related terms. The embeddings wouldhave to be trained per topic to account for domain-speciﬁcvocabulary and word usage [9]. Other methods make use ofdeep neural networks to improve the matching process be-tween queries and products [35]. These methods, however, areoften based on behavioural data such as click-through data.To develop a search system with deep networks, data aboutuser behaviour would need to be collected and annotated. Thepotential to apply natural language techniques in the productsearch scenario needs to be investigated in future work.Overall, understanding how users formulate their search in-tent in natural language provides the basis to develop moresophisticated matching algorithms. Our research provides ﬁrstinsights for the jacket and laptop domain and clears the wayfor developing sophisticated product search engines.

CONCLUSION

This paper investigated intuitively formulated informationneeds in product search with respect to vagueness and ﬁt-ness to currently existing product search systems through anonline user study (N=132). Our ﬁndings show the broad va-riety of information need formulations and how vagueness isused to describe products. We found that retailer-generatedcontent does not deal well with natural language queries. Userreviews have shown to be a valuable source for improvingproduct matching especially for highly vague search queries.User reviews also provide a basis to generate user-centeredfacets or expand queries with synonymous terms. Currently,based on our ﬁndings and the derived design implications,we develop and evaluate a prototype of a search systems thatsupports vague information needs. In addition to studying theuser experience of such a system, we will investigate how toinclude products without a sufﬁcient amount of user reviewsin a system that relies primarily on user-generated content. Inthis context, we also explore the potential of using more so-phisticated natural language processing techniques to improvethe matching process.

ACKNOWLEDGMENTS

This work was partly funded by the DFG, grant no.388815326; the VACOS project at GESIS.

REFERENCES [1] Salvatore Andolina, Valeria Orso, Hendrik Schneider,Khalil Klouche, Tuukka Ruotsalo, Luciano Gamberini,and Giulio Jacucci. 2018. Investigating Proactive SearchSupport in Conversations. In

Proceedings of the 2018Designing Interactive Systems Conference (DIS â ˘A ´Z18) .Association for Computing Machinery, New York, NY,10SA, 1295â ˘A ¸S1307.

DOI: http://dx.doi.org/10.1145/3196709.3196734 [2] Anne Aula. 2003. Query Formulation in WebInformation Search. In

Proceedings of the IADISInternational Conference WWW/Internet 2003, ICWI2003, Algarve, Portugal, November 5-8, 2003 . 403–410.[3] Evelyn Balfe and Barry Smyth. 2004. Improving WebSearch through Collaborative Query Recommendation.In

Proceedings of the 16th European Conference onArtiﬁcial Intelligence (ECAIâ ˘A ´Z04) . IOS Press, NLD,268â ˘A ¸S272.[4] Catalin-Mihai Barbu, Guillermo Carbonell, and JÃijrgenZiegler. 2019. The Inﬂuence of Trust Cues on theTrustworthiness of Online Reviews forRecommendations. In

Proceedings of the 34thACM/SIGAPP Symposium on Applied Computing . ACM,1687–1689.

DOI: http://dx.doi.org/10.1145/3297280.3297603 [5] Elena Barsky and Judit Bar-Ilan. 2005. From the searchproblem through query formulation to results on the web.

Online Information Review

29 (02 2005), 75–89.

DOI: http://dx.doi.org/10.1108/14684520510583954 [6] Julia Cambre, Ying Liu, Rebecca E. Taylor, andChinmay Kulkarni. 2019. Vitro: Designing a VoiceAssistant for the Scientiﬁc Lab Workplace. In

Proceedings of the 2019 on Designing InteractiveSystems Conference (DIS â ˘A ´Z19) . Association forComputing Machinery, New York, NY, USA,1531â ˘A ¸S1542.

DOI: http://dx.doi.org/10.1145/3322276.3322298 [7] Minji Cho, Sang-su Lee, and Kun-Pyo Lee. 2019. Oncea Kind Friend is Now a Thing: Understanding HowConversational Agents at Home Are Forgotten. In

Proceedings of the 2019 on Designing InteractiveSystems Conference (DIS â ˘A ´Z19) . Association forComputing Machinery, New York, NY, USA,1557â ˘A ¸S1569.

DOI: http://dx.doi.org/10.1145/3322276.3322332 [8] Cecilia di Sciascio, Vedran Sabol, and Eduardo E. Veas.2016. Rank As You Go: User-Driven Exploration ofSearch Results. In

Proceedings of the 21st InternationalConference on Intelligent User Interfaces (IUI â ˘A ´Z16) .Association for Computing Machinery, New York, NY,USA, 118â ˘A ¸S129.

DOI: http://dx.doi.org/10.1145/2856767.2856797 [9] Fernando Diaz, Bhaskar Mitra, and Nick Craswell. 2016.Query Expansion with Locally-Trained WordEmbeddings. In

Proceedings of the 54th Annual Meetingof the Association for Computational Linguistics(Volume 1: Long Papers) . Association forComputational Linguistics, Berlin, Germany, 367–377.

DOI: http://dx.doi.org/10.18653/v1/P16-1035 [10] Jan Feuerbach, Benedikt Loepp, Catalin-Mihai Barbu,and Jürgen Ziegler. 2017. Enhancing an Interactive Recommendation System with Review-basedInformation Filtering.. In

IntRS@ RecSys . 2–9.[11] G. W. Furnas, T. K. Landauer, L. M. Gomez, and S. T.Dumais. 1987. The Vocabulary Problem inHuman-System Communication.

Commun. ACM

30, 11(Nov. 1987), 964â ˘A ¸S971.

DOI: http://dx.doi.org/10.1145/32206.32212 [12] Ido Guy. 2016. Searching by Talking: Analysis of VoiceQueries on Mobile Web Search. In

Proceedings of the39th International ACM SIGIR Conference on Researchand Development in Information Retrieval (SIGIRâ ˘A ´Z16) . Association for Computing Machinery, NewYork, NY, USA, 35â ˘A ¸S44.

DOI: http://dx.doi.org/10.1145/2911451.2911525 [13] Marti Hearst, Ame Elliott, Jennifer English, RashmiSinha, Kirsten Swearingen, and Ka-Ping Yee. 2002.Finding the Flow in Web Site Search.

Commun. ACM

45, 9 (Sept. 2002), 42â ˘A ¸S49.

DOI: http://dx.doi.org/10.1145/567498.567525 [14] Stefan Hirschmeier and Marc Egger. 2018. SocialProduct Search-Enhancing Product Search with Mined(Sparse) Product Features.. In

ECIS . 170.[15] Mohit Jain, Pratyush Kumar, Ramachandra Kota, andShwetak N. Patel. 2018. Evaluating and Informing theDesign of Chatbots. In

Proceedings of the 2018Designing Interactive Systems Conference (DIS â ˘A ´Z18) .Association for Computing Machinery, New York, NY,USA, 895â ˘A ¸S906.

DOI: http://dx.doi.org/10.1145/3196709.3196735 [16] Bernard J. Jansen, Amanda Spink, and Tefko Saracevic.1998. Failure Analysis in Query Construction: Data andAnalysis from a Large Sample of Web Queries. In

Proceedings of the Third ACM Conference on DigitalLibraries (DL â ˘A ´Z98) . Association for ComputingMachinery, New York, NY, USA, 289â ˘A ¸S290.

DOI: http://dx.doi.org/10.1145/276675.276735 [17] Andreas H. Jucker, Sara W. Smith, and Tanja LÃijdge.2003. Interactive aspects of vagueness in conversation.

Journal of Pragmatics

35, 12 (2003), 1737 – 1769.

DOI: http://dx.doi.org/https://doi.org/10.1016/S0378-2166(02)00188-1 [18] Yvonne Kammerer and Maja Bohnacker. 2012.Children’s Web Search with Google: The Effectivenessof Natural Language Queries. In

Proceedings of the 11thInternational Conference on Interaction Design andChildren (IDC â ˘A ´Z12) . Association for ComputingMachinery, New York, NY, USA, 184â ˘A ¸S187.

DOI: http://dx.doi.org/10.1145/2307096.2307121 [19] Rajesh Kanwar, Lorna Grund, and Jerry C Olson. 1990.When do the measures of knowledge measure what wethink they are measuring?

NA - Advances in ConsumerResearch

17 (1990), 603–608.1120] Makoto P. Kato, Takehiro Yamamoto, Hiroaki Ohshima,and Katsumi Tanaka. 2014. Cognitive Search IntentsHidden behind Queries: A User Study on QueryFormulations. In

Proceedings of the 23rd InternationalConference on World Wide Web (WWW â ˘A ´Z14Companion) . Association for Computing Machinery,New York, NY, USA, 313â ˘A ¸S314.

DOI: http://dx.doi.org/10.1145/2567948.2577279 [21] Dagmar Kern, Wilko van Hoek, and Daniel Hienert.2018. Evaluation of a Search Interface forPreference-Based Ranking: Measuring User Satisfactionand System Performance. In

Proceedings of the 10thNordic Conference on Human-Computer Interaction(NordiCHI ’18) . Association for Computing Machinery,New York, NY, USA, 184â ˘A ¸S194.

DOI: http://dx.doi.org/10.1145/3240167.3240170 [22] Timm Kleemann and JÃijrgen Ziegler. 2019. Integrationof Dialog-based Product Advisors into Filter Systems.In

Proceedings of the Conference on Mensch undComputer (ACM International Conference ProceedingSeries) . ACM Press, New York, 67â ˘A ¸S77.

DOI: http://dx.doi.org/10.1145/3340764.3340786 [23] Lorenz Cuno Klopfenstein, Saverio Delpriori, SilviaMalatini, and Alessandro Bogliolo. 2017. The Rise ofBots: A Survey of Conversational Interfaces, Patterns,and Paradigms. In

Proceedings of the 2017 Conferenceon Designing Interactive Systems (DIS â ˘A ´Z17) .Association for Computing Machinery, New York, NY,USA, 555â ˘A ¸S565.

DOI: http://dx.doi.org/10.1145/3064663.3064672 [24] Jonathan Koren, Yi Zhang, and Xue Liu. 2008.Personalized Interactive Faceted Search. In

Proceedingsof the 17th International Conference on World Wide Web(WWW â ˘A ´Z08) . Association for Computing Machinery,New York, NY, USA, 477â ˘A ¸S486.

DOI: http://dx.doi.org/10.1145/1367497.1367562 [25] Logan Lebanoff and Fei Liu. 2018. Automatic Detectionof Vague Words and Sentences in Privacy Policies. In

Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing . Associationfor Computational Linguistics, Brussels, Belgium,3508–3517.

DOI: http://dx.doi.org/10.18653/v1/D18-1387 [26] Aarno Lehtola, Johannes Heinecke, and CatherineBounsaythip. 2003. Intelligent human language queryprocessing in MKBEEM.

Proceedings of HCIInternational/UAHCI 2003 (2003), 22–27.[27] David D Lewis and Karen Sparck Jones. 1996. Naturallanguage processing for information retrieval.

Commun.ACM

39, 1 (1996), 92–101.[28] Jason Mander. 2016. Chart of the day: 1 in 5 using voicesearch on mobile.https://blog.globalwebindex.com/chart-of-the-day/1-in-5-using-voice-search-on-mobile/. (February 2016).Accessed: 2020-01-30. [29] Jason Mander and Chase Buckle. 2018.

Voice Search: Adeep-dive into the consumer uptake of the voice assistanttechnology.

Technical Report INSIGHT REPORT 2018.GlobalWebIndex, London, United Kingdom. 22 pages.

Accessed: 2020-01-30.[30] Tomas Mikolov, Ilya Sutskever, Kai Chen, GregCorrado, and Jeffrey Dean. 2013. DistributedRepresentations of Words and Phrases and TheirCompositionality. In

Proceedings of the 26thInternational Conference on Neural InformationProcessing Systems - Volume 2 (NIPSâ ˘A ´Z13) . CurranAssociates Inc., Red Hook, NY, USA, 3111â ˘A ¸S3119.[31] Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. Glove: Global vectors for wordrepresentation. In

Proceedings of the 2014 conferenceon empirical methods in natural language processing(EMNLP) . 1532–1543.[32] David A. Robb, José Lopes, Stefano Padilla, AtanasLaskov, Francisco J. Chiyah Garcia, Xingkun Liu,Jonatan Scharff Willners, Nicolas Valeyrie, KatrinLohan, David Lane, Pedro Patron, Yvan Petillot, Mike J.Chantler, and Helen Hastie. 2019. Exploring Interactionwith Remote Autonomous Systems UsingConversational Agents. In

Proceedings of the 2019 onDesigning Interactive Systems Conference (DIS â ˘A ´Z19) .Association for Computing Machinery, New York, NY,USA, 1543â ˘A ¸S1556.

DOI: http://dx.doi.org/10.1145/3322276.3322318 [33] Napawan Sawasdichai and Sharon Poggenpohl. 2002.User Purposes and Information-Seeking Behaviors inWeb-Based Media: A User-Centered Approach toInformation Design on Websites. In

Proceedings of the4th Conference on Designing Interactive Systems:Processes, Practices, Methods, and Techniques (DISâ ˘A ´Z02) . Association for Computing Machinery, NewYork, NY, USA, 201â ˘A ¸S212.

DOI: http://dx.doi.org/10.1145/778712.778742 [34] Alex Sciuto, Arnita Saini, Jodi Forlizzi, and Jason I.Hong. 2018. â ˘AIJHey Alexa, Whatâ ˘A ´Zs Up?â ˘A˙I: AMixed-Methods Studies of In-Home ConversationalAgent Usage. In

Proceedings of the 2018 DesigningInteractive Systems Conference (DIS â ˘A ´Z18) .Association for Computing Machinery, New York, NY,USA, 857â ˘A ¸S868.

DOI: http://dx.doi.org/10.1145/3196709.3196772 [35] Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, andGrégoire Mesnil. 2014. Learning semanticrepresentations using convolutional neural networks forweb search. In

Proceedings of the 23rd InternationalConference on World Wide Web . 373–374.[36] Arthur HM ter Hofstede, Henderik Alex Proper, andTh P van der Weide. 1996. Query formulation as aninformation retrieval problem.

Comput. J.

39, 4 (1996),255–274.1237] Daniel Tunkelang. 2006. Dynamic category sets: Anapproach for faceted search. In

ACM SIGIR , Vol. 6.[38] Kristen Vaccaro, Tanvi Agarwalla, Sunaya Shivakumar,and Ranjitha Kumar. 2018. Designing the Future ofPersonal Fashion. In

Proceedings of the 2018 CHIConference on Human Factors in Computing Systems(CHI â ˘A ´Z18) . Association for Computing Machinery,New York, NY, USA, Article Paper 627, 11 pages.

DOI: http://dx.doi.org/10.1145/3173574.3174201 [39] Alexandra Vtyurina and Adam Fourney. 2018.Exploring the Role of Conversational Cues in GuidedTask Support with Virtual Assistants. In

Proceedings of the 2018 CHI Conference on Human Factors inComputing Systems (CHI â ˘A ´Z18) . Association forComputing Machinery, New York, NY, USA, ArticlePaper 208, 7 pages.

DOI: http://dx.doi.org/10.1145/3173574.3173782 [40] Ka-Ping Yee, Kirsten Swearingen, Kevin Li, and MartiHearst. 2003. Faceted Metadata for Image Search andBrowsing. In

Proceedings of the SIGCHI Conference onHuman Factors in Computing Systems (CHI â ˘A ´Z03) .Association for Computing Machinery, New York, NY,USA, 401â ˘A ¸S408.

DOI: http://dx.doi.org/10.1145/642611.642681http://dx.doi.org/10.1145/642611.642681