An Empirical Study of Clarifying Question-Based Systems
AAn Empirical Study of Clarifying Question-Based Systems
Jie Zou
University of AmsterdamAmsterdam, The [email protected]
Evangelos Kanoulas
University of AmsterdamAmsterdam, The [email protected]
Yiqun Liu
BNRist, DCST, Tsinghua UniversityBeijing, [email protected]
ABSTRACT
Search and recommender systems that take the initiative to askclarifying questions to better understand users’ information needsare receiving increasing attention from the research community.However, to the best of our knowledge, there is no empirical studyto quantify whether and to what extent users are willing or ableto answer these questions. In this work, we conduct an online ex-periment by deploying an experimental system, which interactswith users by asking clarifying questions against a product reposi-tory. We collect both implicit interaction behavior data and explicitfeedback from users showing that: (a) users are willing to answera good number of clarifying questions (11-21 on average), but notmany more than that; (b) most users answer questions until theyreach the target product, but also a fraction of them stops dueto fatigue or due to receiving irrelevant questions; (c) part of theusers’ answers (12-17%) are actually opposite to the descriptionof the target product; while (d) most of the users (66-84%) findthe question-based system helpful towards completing their tasks.Some of the findings of the study contradict current assumptionson simulated evaluations in the field, while they point towardsimprovements in the evaluation framework and can inspire futureinteractive search/recommender system designs.
KEYWORDS
Empirical Study; Question-based Systems; Asking Clarifying Ques-tions; Conversational Search; Conversational Recommendation
ACM Reference Format:
Jie Zou, Evangelos Kanoulas, and Yiqun Liu. 2020. An Empirical Studyof Clarifying Question-Based Systems. In
Proceedings of ACM Conference(Conference’17).
ACM, New York, NY, USA, 6 pages. https://doi.org/10.475/123_4
One of the key components of conversational search and recom-mender systems [3, 9, 11] is the construction and selection of goodclarifying questions to gather item information from users in asearchable repository. Most current studies either collect and learnfrom human-to-human conversations [2, 5, 8], or create a pool ofquestions on the basis of some "anchor" text (e.g. item aspects [1, 9],
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].
Conference’17, July 2017, Washington, DC, USA © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.ACM ISBN 123-4567-24-567/08/06...$15.00https://doi.org/10.475/123_4 entities [10–13], grounding text [6, 7]) that characterizes the search-able items themselves. Although the aforementioned works havedemonstrated success in helping systems better understand users,most of them evaluate algorithms by the means of simulationswhich assume users are willing to provide answers to as manyquestions as the system generates, and that users can always an-swer the questions correctly, i.e. they always know what the targetitem should look like in its finest details. On the basis of such as-sumptions, their evaluations (e.g. Bi et al. [1], Zhang et al. [9], Zouand Kanoulas [11]) focus on whether the system can place the tar-get item at a high ranking position. To the best of our knowledge,there is no empirical study to quantify whether and to what extentusers can respond to these questions, and the usefulness perceivedby users while interacting with the system.In this paper we conduct a user study by deploying an onlinequestion-based system to answer the following research questions:(1) To what extent are users willing to engage with a question-based system?(2) To what extent can users provide correct answers to thegenerated questions?(3) How useful do users perceive while interacting with a question-based system?The study is repeated under two conditions: (a) the question-based system uses an oracle to always obtain the right answer tothe questions asked, and (b) the system uses the user’s answers,even if they are imperfect, in ranking items and choosing the nextquestion to ask. We believe that answering these research questionscan help the community design better evaluation frameworks andmore robust question-based systems.
In our study, the users interact with a question-based system in thedomain of online retail. The user is answering questions promptedby the system with a “Yes”, a “No” or a “Not Sure”, in order to finda target product to buy. The architecture of our system is shown inFigure 1, with the user going through 4 steps.
Step 1: Category selection.
In this step, the users select an Amazoncategory that they feel most familiar with to fit their interests, e.g.a category from which they have purchased products before. Step 2: Target product assignment.
We randomly assign a targetproduct to the user from the selected category. The user is requestedto read the title and the description of the product carefully. Apicture of the product is also provided. Asking the users to carefullyread the description simulates a use case in which the user reallyknows what she is looking for, as opposed to an exploratory use case. Categories and dataset we used: http://jmcauley.ucsd.edu/data/amazon/links.html a r X i v : . [ c s . I R ] A ug magine that you want to buy a product. Please selectthe product category you are most familiar with (e.g.,most frequently purchased category). Step 1: Category selection
Step 1: Category selection
Imagine that you want to buy a product. Please select theproduct category you are most familiar with (e.g., mostfrequently purchased category).
Category:
Home and Kitchen
To Step 2
Step 1: Category selection
Imagine that you want to buy a product. Please select theproduct category you are most familiar with (e.g., mostfrequently purchased category).
Category:
Home and Kitchen
To Step 2
Step 2: Target product assignment
Imagine that you want to buy the target product shown below, please read theproduct title & description very carefully . After you click the ”Next step"button, you will interact with our algorithm by answering YES/NO questions:
Step 2: Target product assignment
Imagine that you want to buy the target product shown below, please read the product title and productdescription very carefully ( if not familiar with the target product or the description is not clear, you can click the"Change target product" button to be assigned a new product). After you click the "start conversational search"button, you will interact with our algorithm by answering YES/NO questions and the algorithm will take a few secondsto start the interactive search session:
Product Title:
Hog Wild Fish Sticks (Sold Individually)
Product Description:
Kids love this pre-scissors skills activity set of 1-piece chopsticks! Use the tongs with oral-motor activities. Simply set up small toys, easy-grip foods or cotton balls for kids to transferacross a midline. Styles may vary. Set of 48. Education Categories: Special Needs / Fine Motor / Scissors - Tools. UNSPSC/NIGP Codes: 6000000000-78500000
Change target
Next step
Product Title:
Hog Wild Fish Sticks (Sold Individually)
Product Description:
Kids love this pre-scissors skills activity set of 1-piecechopsticks! Use the tongs with oral-motor activities. Simply set up small toys,easy-grip foods or cotton balls for kids to transfer across a midline. …
Step 3: Find the target product
Please answer the following algorithmically constructed questions according tothe title and description of your target product shown on the last page. Afteryou click the "next" button, the algorithm will select the next question to ask.When you wish to stop answering questions you can click the "stop" button.
Step 3: Find the target product
Yes No Not Sure
Next Stop
Please answer the following algorithmically constructed questions according to the title and description of yourtarget product shown in last page (e.g., choose 'yes' when the selected term in the question is present in the titleand description while choose 'no' when absent, choose 'not sure' when you are not sure about it). After you click the"next" button, the algorithm will take few seconds to select the next question, please wait for a while and do notclick the button twice. When you wish to stop answering questions you can click the "stop" button.Question: Is "rosewood" relevant to the product you are looking for?Ranking list of search results, from top 1 (left) - top 4(right):
Product title: Hog Wild Fish Sticks (Sold Individually): Product title: Bone & Rosewood Chopsticks: Product title: Fred & Friends Good FortuneChopsticks: Product title: 2pk Green Pot Holders/Trivet Set:
Ranking list of search results, from top 1 (left) - top 4(right):
Hog Wild Fish Sticks (Sold Individually): Bone & Rosewood Chopsticks: Fred & Friends Good Fortune Chopsticks: 2pk Green Pot Holders/Trivet Set:
Q1: Did you find our question-based system helpful towards locating the target product?Q2: Will you use such a question-based system for product search or recommendation in the future?Q3: What was your experience using the question-based system?Q4: How many questions are you willing to answer for locating your target product?Q5: Why did you click the "Stop" button to stop answering in the last step?Q6: If selected "other" in Q5, please specify:Q7: Are the generated questions easy to answer? … Step 4: Questionnaire
Step 4: Questionnaire
Not SureNot Sure1 (very negative)Found the target productNot SureAll done
Q1: Did you find our conversational system helpful towards locating the target product?Q2: Will you use such a conversational system for product search or recommendation in thefuture?Q3: What was your experience using the conversational system?Q4: How many questions are you willing to answer for locating your target product?Q5: Why did you click the "Stop" button to stop answering in last step?Q6: If selected "other" in Q5, please specify:Q7: Are the generated questions easy to answer?
Step 4: Questionnaire
Not SureNot Sure1 (very negative)Found the target productNot SureAll done
Q1: Did you find our conversational system helpful towards locating the target product?Q2: Will you use such a conversational system for product search or recommendation in thefuture?Q3: What was your experience using the conversational system?Q4: How many questions are you willing to answer for locating your target product?Q5: Why did you click the "Stop" button to stop answering in last step?Q6: If selected "other" in Q5, please specify:Q7: Are the generated questions easy to answer?
Step 4: Questionnaire
Not SureNot Sure1 (very negative)Found the target productNot SureAll done
Q1: Did you find our conversational system helpful towards locating the target product?Q2: Will you use such a conversational system for product search or recommendation in thefuture?Q3: What was your experience using the conversational system?Q4: How many questions are you willing to answer for locating your target product?Q5: Why did you click the "Stop" button to stop answering in last step?Q6: If selected "other" in Q5, please specify:Q7: Are the generated questions easy to answer?
Step 4: Questionnaire
Not SureNot Sure1 (very negative)Found the target productNot SureAll done
Q1: Did you find our conversational system helpful towards locating the target product?Q2: Will you use such a conversational system for product search or recommendation in thefuture?Q3: What was your experience using the conversational system?Q4: How many questions are you willing to answer for locating your target product?Q5: Why did you click the "Stop" button to stop answering in last step?Q6: If selected "other" in Q5, please specify:Q7: Are the generated questions easy to answer?
Step 4: Questionnaire
Not SureNot Sure1 (very negative)Found the target productNot SureAll done
Q1: Did you find our conversational system helpful towards locating the target product?Q2: Will you use such a conversational system for product search or recommendation in thefuture?Q3: What was your experience using the conversational system?Q4: How many questions are you willing to answer for locating your target product?Q5: Why did you click the "Stop" button to stop answering in last step?Q6: If selected "other" in Q5, please specify:Q7: Are the generated questions easy to answer?
Step 4: Questionnaire
Not SureNot Sure1 (very negative)Found the target productNot SureAll done
Q1: Did you find our conversational system helpful towards locating the target product?Q2: Will you use such a conversational system for product search or recommendation in thefuture?Q3: What was your experience using the conversational system?Q4: How many questions are you willing to answer for locating your target product?Q5: Why did you click the "Stop" button to stop answering in last step?Q6: If selected "other" in Q5, please specify:Q7: Are the generated questions easy to answer?
Step 4: Questionnaire
Not SureNot Sure1 (very negative)Found the target productNot SureAll done
Q1: Did you find our conversational system helpful towards locating the target product?Q2: Will you use such a conversational system for product search or recommendation in thefuture?Q3: What was your experience using the conversational system?Q4: How many questions are you willing to answer for locating your target product?Q5: Why did you click the "Stop" button to stop answering in last step?Q6: If selected "other" in Q5, please specify:Q7: Are the generated questions easy to answer?
Step 4: Questionnaire
Not SureNot Sure1 (very negative)Found the target productNot SureAll done
Q1: Did you find our conversational system helpful towards locating the target product?Q2: Will you use such a conversational system for product search or recommendation in thefuture?Q3: What was your experience using the conversational system?Q4: How many questions are you willing to answer for locating your target product?Q5: Why did you click the "Stop" button to stop answering in last step?Q6: If selected "other" in Q5, please specify:Q7: Are the generated questions easy to answer?
Figure 1: System architecture and main UI pages
If the user is not familiar with the target product, or the descriptionis not clear to her, the user can request a new randomly selectedproduct.
Step 3: Find the target product.
After the user indicates that theconversation with the system can start, the target product disap-pears from the screen and the system selects a question to ask tothe user. The user needs to provide an answer on the basis of thetarget product information she read in the previous step. To helpthe user better understand her task of answering questions, anexample target product along with an example conversation is alsoshown to the user. Once the user answers the question, a 4-by-4grid of the pictures of the top sixteen ranked products is shown tothe user, along with the next clarifying question. The user can stopanswering questions at any time when she wants to stop duringher interaction with the system.To select what clarifying question to ask, a state-of-the-art algo-rithm [11] is deployed to first extract important entities from eachproduct description (e.g. product aspects) and construct questionsin the form of "Is [entity] relevant to the product you are lookingfor?". Then, it selects to ask the information-theoretically optimalquestion, that is the question that best splits the probability mass ofpredicted user preferences over items closest to two halves, and up-dates this predicted preference on the basis of the user’s answer [11].In this work, we compare the results under two conditions: (a) thesystem updates the predicted preference using the correct answer,i.e. the answer which agrees with the description of the product,independent of the user’s answer, and (b) the system updates itsbelief by using the user’s noisy answers. Under the first condition, we study the user behavior under a perfect system from an infor-mation theoretical point of view, leading to a best-case analysisand conclusions, while under the second condition we study theuser behavior when the system is getting confused and becomessuboptimal due to the user’s mistakes.
Step 4: Questionnaire.
In this step users are asked a number ofquestions about their experience with the system for further analy-sis.
Our research questions revolve around the user engagement andperceived value of the system:(1)
RQ1
Are users willing to answer the clarifying questions,how many of them, when do they stop and why, and howfast do they provide the answers?(2)
RQ2
To what extent can users provide correct answers givena target product, and what factors affect this?(3)
RQ3
How useful do users find the clarifying questions, whatis their overall experience, and how likely is it to use such asystem in the future?
Prior to the actual study, we ran a pilot study with a small numberof users, in a controlled environment, and iterated over the experi-mental design, and the user interface until no issues or concernswere reported. Then we considered two conditions. Under the first P e r c e n t a g e ( % ) Overall accumulative plotInterval accumulative histogram P e r c e n t a g e ( % ) Figure 2: The number of questions the users actually answered inthe system (left) and declared in the exit questionnaire (right). Inthe system, the average number of answered questions per productis 11.4, and 70.3% of users answered 4-12 questions per product. Inthe questionnaire, 50% users are willing to answer 6-10 questions. one the system used an oracle to obtain the correct answers to thequestions it asked to the system. For the actual study 53 participantslocated in the USA were recruited through Amazon MechanicalTurk and 1025 conversations were collected. The participants wereof varying gender, age, career field, English skills and online shop-ping experience. In particular, gender: 34 male, 19 female; age: 2in 18-23, 8 in 23-27, 14 in 27-35, 29 older than 35 years old; careerfield: 22 in science, computers & technology, 8 in management,business & finance, 7 in hospitality, tourism, & the service industry,3 in education and social services, 2 in arts and communications,2 in trades and transportation, 9 did not specify their career field;English skills: all of them were native speakers; online shoppingexperience: 44 were mostly shopping online, 9 did online shoppingonce or twice per year. Under the second condition the systemactually used the users answers, with 48 users participating in thisone, also with varying demographic and skills characteristics. 1833conversations are collected for these 48 users. Participants werepaid 2.5 dollars to complete the study. Also, we only engaged MasterWorkers , filtered out those users who spent less than 3 secondson reading the product title and descriptions, and users who gaverandom answers (~50% correct/wrong), for quality control. To answer
RQ1 , we attempt to answer the following sub-questions:(1) Are users willing to answer the system’s questions? (2) Howmany questions the users are willing to answer? (3) When do theystop and why? (4) How fast are they able to answer?In
RQ1 , we first investigate whether users are willing to answerthe system’s questions and how many of them, both by observingthe actual number of questions users answered when interactingwith the question-based system and what they declared at theexit questionnaire. The findings under the oracle condition aresummarized in Figure 2, with the left subfigure depicting the actualnumber of questions answered by the users when interacting withour question-based system, while the right subfigure depicting thenumber of questions the users declared they are willing to answer inthe exit questionnaire. In the left subfigure of Figure 2, the red linerepresents the accumulated percentage of users willing to answera certain number of questions; the light blue histogram reflects the High performing workers identified by Mechanical Turk who have demonstratedexcellence across a wide range of tasks. F o un d t h e p r o d u c t G o t t i r e d I rr e l e v a n t q u e s t i o n o t h e r Reasons020406080100 P e r c e n t a g e ( % ) Figure 3: The reasons for stopping answering questions. Most ofusers stop answering questions after they find their target products. (a) Box plot (b) Histogram of time spentFigure 4: Time spent per question by a user in order to provide ananswer. The average time for answering one question is 7.1 seconds. percentage per number of questions. The results in Figure 2 showthat users answer a minimum of 2 and a maximum of 48 questions.The average number of questions answered per target product is11.4, the median number is 7, and 70.3% of users answered 4-12questions per product, while at the exit questionnaire about 50% ofthe users declare that they are willing to answer 6-10 questions.We also compare the afore-described statistics with those underthe condition that the system updates its beliefs and hence choosesthe next question and ranks items according to the actual useranswers, however imperfect they might be. In this latter case weobserve that the average number of questions answered per targetproduct is 21, and the median number is 14, which is almost doublethan that of using an oracle. We do not provide these plots due tospace limitations. We hypothesize (and we later confirm that inthe exit questionnaire) that this is because our users really try tolocate the target product and they go as far as it takes to make thathappen or get frustrated; however their noisy answers confuse thealgorithm and it takes longer to bring the target product to the topof the recommendation list.Further, we explore why users stop answering questions. Userscould select one out of six answers during the exit questionnaire:"The target product was found", "A similar product was found", "I gottired of answering questions", "I could not answer the questions","The questions asked were irrelevant", and "Other". The resultsunder the oracle condition in Figure 3 show that while a smallpercentage of users stop due to fatigue (14%) or due to irrelevantquestions being asked (7%), the big majority of users (77%) stopbecause they located the target product. Under the second conditionof imperfect answers most users also stop answering questionsecause they found the target products (38%), but other reasonsare more prominent such as fatigue (34%), or receiving irrelevantquestions (22%).We then analyze how quick are the users in answering questions.Figure 4a shows a box-plot of the time spent per answer, while 4bbetter demonstrated the distribution. From the results, we observethat the minimum time for answering one question is 1.75 seconds,the average time is 7.1 seconds, and the median time is 4.98 seconds.86.5% of the users spent from 1.75s to 11.59s. Despite a mediantime of 5 seconds to answer a question, in the exit questionnaire98% of the users indicate that the system’s questions are easy toanswer. When using the user’s imperfect answers the average timefor answering a question is 6.22 seconds, while the median time is4.1 seconds, which is similar to that using an oracle. In RQ2 , we first explore to what extent can users provide correctanswers. As one can observe in Table 1, users provide correct an-swers 73.1% of the time, they are not sure 9.6% of the time andthey are wrong 17.3% of the time. Under the imperfect user answersetup, the afore-described percentages are 78.3%, 9.5%, and 12.2%,respectively.We then explore what features affect the percentage of incorrectanswers. We do that under both setups but we only report the oraclesetup given that numbers are very close across the two setups.In particular, we first investigate whether the percentage ofincorrect answers is different for different users. The results inFigure 5a show the percentages of correct, “not sure”, and incorrectanswers vary across users. It might be because of the varied knowl-edge of different users. There is a couple of users who provide ahigh percentage of incorrect answers, this might be because of thecrowdsourcing nature of the experiment.Further, we explore whether the percentage of correct answersdiffers across target products. The results are shown in Figure 5b. Byobserving Figure 5b, we conclude that the percentages of incorrectanswers also vary across target products, but not as much as theyvary across users.The percentages of correct, “not sure”, and incorrect answersfor different questions asked by the system are shown in Figure 5d.Here we observe some dramatic differences across questions, witha smaller subset of questions receiving almost always incorrectanswers. This might be because some questions are more ambiguousthan others. This finding suggests improvements of question-basedsystems in multiple directions. For instance, one can try to improvethe question pool by considering different question characteristics,or one could develop question selection strategies that also accountfor the chance of user providing the wrong answer.Further, we explore whether the percentage of incorrect answersis correlated to the question index, or whether it remains stablethroughout the conversation. The results are shown in Figure 5c,where the lines show the average percentages of correct, “not sure”,and incorrect answers as a function of the question index withinthe conversation, while the histogram shows the average incorrectanswer percentages of a sliding window. From Figure 5c, it canbe observed that the percentages of correct answers, “not sure”
Table 1: The % of correct, “not sure”, and incorrect answers.
OracleCorrect 73.1% Not sure 9.6% Incorrect 17.3%Imperfect userCorrect 78.3% Not sure 9.5% Incorrect 12.2% answers, and incorrect answers fluctuate, but in principle theyremain at similar levels throughout the conversation.Last, we explore whether the percentage of incorrect answers iscorrelated to the time spent to give the answers. The results withindifferent time intervals are shown in Figure 5e. We divide the timespent per question (1.75s - 50.96s) into 5 equal non-overlappingbuckets (or Frames). Specifically, Frame 1, Frame 2, Frame 3, Frame4, and Frame 5 are 1.75-11.59s, 11.59-21.43s, 21.43-31.28s, 31.28-41.12s, and 41.12-50.96s, respectively. From Figure 5e, we see thepercentage of incorrect answers decreases with more time spent.Also, we calculate the time spent when users are giving a correctanswer, a “not sure” answer, and an incorrect answer, with theaverages being 6.59s, 10.81s, and 7.12s respectively, and the median4.65s, 8.20s, 5.06s respectively. This suggests that users usually spentmore time when they are not sure about the answers, but almostthe same time when they are right or wrong about a question.
Regarding
RQ3 , we explore how useful do users perceive whileinteracting with such a question-based system. We ask the user(a) whether they think the question-based system is helpful, (b)whether they will use such a system in the future, and (c) what theirrating is for the system, ranging from 1 (very negative) to 5 (verypositive). The results using oracle answers are shown in Figure 6,in the three plots respectively. From the results we collected, mostusers think the question-based system is helpful and they will useit in the future. Specifically, 83.9% of users are positive about thehelpfulness, 5.4% are neutral, and 10.7% are negative. Further, 60.7%of users are positive about using such a system in the future, 30.4%of users are neutral, and 8.9% of users are negative. Regarding userratings, the results show 46.5% of 5-star ratings, 37.5% of 4-starratings, 7.1% of 3-star ratings, 7.1% of 2-star ratings, and 1.8% of1-star ratings. 84% of the users gave a rating at least as high as a 4.In the case of imperfect answers, 66% of users are positive, 6%are neutral, and 28% are negative about being helped by the system.Regarding using such a system in the future 40% of users are posi-tive, 20% are neutral, and 40% are negative. Regarding ratings, 76%of users gave the rating greater or equal to 3. Specifically, there is22% of 5’s, 32% of 4’s, and 22% of 3’s, 16% of 2’s, and 8% of 1’s. Still,most users are positive towards conversational recommender. Butwe also observe that, when using user answers updating, users areless positive than that under oracle answers updating. It is thereforeclear that the quality of the user answers affects the quality of thesystem questions and the overall user experience with the system.
In this paper, we conduct an empirical study using a question-basedproduct search system, to better understand users and gain insightinto user behavior and interaction with such systems. We deploy a
10 20 30 40 50Users020406080100 P e r c e n t a g e ( % ) Correct answersNot Sure answersIncorrect answers (a) Results under different users P e r c e n t a g e ( % ) Correct answersNot Sure answersIncorrect answers (b) Results under different target products P e r c e n t a g e ( % ) Correct answersNot Sure answersIncorrect answersInterval incorrect answers (c) Results under different question index P e r c e n t a g e ( % ) Correct answersNot Sure answersIncorrect answers (d) Results under different questions
Frame1 Frame2 Frame3 Frame4 Frame50.00.20.40.60.81.0 P e r c e n t a g e ( % ) Correct answersNot Sure answersIncorrect answers (e) Results under different time windowsFigure 5: The percentage of correct answers, “not sure” answers, which cannot be classified, and incorrect answers. (a) The % varies per user;(b) The % varies across different target products; (c) The % remains stable through out the conversation; (d) The % varies per question, withonly few questions receiving most of the incorrect answers; (e) the % of incorrect answers decreases with more time spent.
Positive Neutral Negative020406080 P e r c e n t a g e ( % ) (a) Positive Neutral Negative0204060 (b) (c)Figure 6: User perceived helpfulness. (a) Is the question-based sys-tem helpful; (b) will you use the question-based system in the fu-ture; (c) ratings. Most users are positive towards question-based sys-tems. state-of-the-art question-based system online and collect interactivelog data and questionnaire data for analysis. We find that usersare willing to answer a certain number of the system generatedquestions and stop answering questions when they find the targetproduct, only if the questions are relevant and well-selected.While users are able to answer these questions effectively, how-ever we also observe that users provide incorrect answers at a rateof about 17%, and this rate is affected mostly by some, still to beidentified, question characteristics, while it also varies across usersand products. Last, most users are positive towards question-basedsystems, and think that these systems help them towards achievingtheir goals, although this feeling is weaker with systems not robustto imperfect answers. The take-home message, if there is one, isthat current research should drop the assumption that users arehappy to answer as many questions as the system generates andthat all questions are answered correctly. One limitation of this work is the isolated clarifying-based envi-ronment of the study. A more realistic experiment would requireclarifying questions to be embedded in an existing environment,where the user is enabled to not only answer questions, but alsoreformulate her query or filter results by selecting pre-defined itemattributes, and browse the results to the preferred depth. Also amixed-initiative approach under which a system switches fromasking questions, to understanding user searches, and combiningthe two is worth studying. Such a study is in our future plans. Afurther limitation of this work is the fact that this was not an in-situexperiment but a simulation of a use case of a question-based sys-tem by involving crowd workers. Hence, the findings are as good asour simulation of a user looking for a target product. Furthermore,we cannot know whether by running the study for a long periodof time the results would have been the same, or whether we areobserving some novelty effects [4]. Other factors, such as questionquality, question format (e.g. yes/no or open questions), and noisyanswers, may affect the results, and studying therefore these factorsin an A/B testing experiment would be beneficial. We also leavethat for future work.
REFERENCES [1] Keping Bi, Qingyao Ai, Yongfeng Zhang, and W. Bruce Croft. 2019. ConversationalProduct Search Based on Negative Feedback. In
Proceedings of the 28th ACMInternational Conference on Information and Knowledge Management (CIKM ’19) .Association for Computing Machinery, New York, NY, USA, 359–368.[2] Qibin Chen, Junyang Lin, Yichang Zhang, Ming Ding, Yukuo Cen, Hongxia Yang,and Jie Tang. 2019. Towards Knowledge-Based Recommender Dialog System. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Pro-cessing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP) . Association for Computational Linguistics, Hong Kong, China,1803–1813.[3] Konstantina Christakopoulou, Filip Radlinski, and Katja Hofmann. 2016. TowardsConversational Recommender Systems. In
Proceedings of the 22nd ACM SIGKDDnternational Conference on Knowledge Discovery and Data Mining (KDD ’16) .Association for Computing Machinery, New York, NY, USA, 815–824.[4] Ron Kohavi and Roger Longbotham. 2017. Online Controlled Experiments andA/B Testing.
Encyclopedia of machine learning and data mining
7, 8 (2017),922–929.[5] Raymond Li, Samira Ebrahimi Kahou, Hannes Schulz, Vincent Michalski, LaurentCharlin, and Chris Pal. 2018. Towards Deep Conversational Recommendations.In
Advances in Neural Information Processing Systems 31 . Curran Associates, Inc.,9725–9735.[6] Mao Nakanishi, Tetsunori Kobayashi, and Yoshihiko Hayashi. 2019. TowardsAnswer-unaware Conversational Question Generation. In
Proceedings of the 2ndWorkshop on Machine Reading for Question Answering . Association for Computa-tional Linguistics, Hong Kong, China, 63–71.[7] Peng Qi, Yuhao Zhang, and Christopher D Manning. 2020. Stay Hungry, StayFocused: Generating Informative and Specific Questions in Information-SeekingConversations. arXiv preprint arXiv:2004.14530 (2020).[8] Yueming Sun and Yi Zhang. 2018. Conversational Recommender System. In
The41st International ACM SIGIR Conference on Research & Development in InformationRetrieval (SIGIR ’18) . ACM, New York, NY, USA, 235–244. [9] Yongfeng Zhang, Xu Chen, Qingyao Ai, Liu Yang, and W. Bruce Croft. 2018.Towards Conversational Search and Recommendation: System Ask, User Respond.In
Proceedings of the 27th ACM International Conference on Information andKnowledge Management (CIKM ’18) . ACM, New York, NY, USA, 177–186.[10] Jie Zou, Yifan Chen, and Evangelos Kanoulas. 2020. Towards Question-Based Rec-ommender Systems. In
Proceedings of the 43rd International ACM SIGIR Conferenceon Research and Development in Information Retrieval (SIGIR ’20) . Association forComputing Machinery, New York, NY, USA, 881–890.[11] Jie Zou and Evangelos Kanoulas. 2019. Learning to Ask: Question-Based Se-quential Bayesian Product Search. In
Proceedings of the 28th ACM InternationalConference on Information and Knowledge Management (CIKM ’19) . Associationfor Computing Machinery, New York, NY, USA, 369–378.[12] Jie Zou and Evangelos Kanoulas. 2020. Towards Question-Based High-RecallInformation Retrieval: Locating the Last Few Relevant Documents for Technology-Assisted Reviews.
ACM Trans. Inf. Syst.
38, 3, Article 27 (May 2020), 35 pages.[13] Jie Zou, Dan Li, and Evangelos Kanoulas. 2018. Technology Assisted Reviews:Finding the Last Few Relevant Documents by Asking Yes/No Questions to Review-ers. In