User Engagement Prediction for Clarification in Search
UUser Engagement Prediction forClarification in Search
Ivan Sekulić , Mohammad Aliannejadi , and Fabio Crestani Faculty of Informatics, Università della Svizzera italiana, Lugano, Switzerland {ivan.sekulic,fabio.crestani}@usi.ch , University of Amsterdam, Amsterdam, The Netherlands [email protected]
Abstract.
Clarification is increasingly becoming a vital factor in var-ious topics of information retrieval, such as conversational search andmodern Web search engines. Prompting the user for clarification in asearch session can be very beneficial to the system as the user’s explicitfeedback helps the system improve retrieval massively. However, it comeswith a very high risk of frustrating the user in case the system fails inasking decent clarifying questions. Therefore, it is of great importanceto determine when and how to ask for clarification.To this aim, in this work, we model search clarification prediction as userengagement problem. We assume that the better a clarification is, thehigher user engagement with it would be. We propose a Transformer-based model to tackle the task. The comparison with competitive base-lines on large-scale real-life clarification engagement data proves the ef-fectiveness of our model. Also, we analyse the effect of all result pageelements on the performance and find that, among others, the rankedlist of the search engine leads to considerable improvements. Our exten-sive analysis of task-specific features guides future research.
Keywords:
Search Clarification · Mixed-Initiative Conversations · UserEngagement Prediction
The primary goal of an information retrieval (IR) system is satisfying the user in-formation need, which can often be ambiguous when expressed as short queries.Incorporating users’ implicit feedback has long been studied for improved re-trieval [17]. However, the recent rise of interest in conversational systems andmixed-initiative interactions have enabled IR systems to collect users’ explicitfeedback. Current research focuses on prompting users for feedback by askingfor clarification [32,2,43]. For example, search clarification has recently beenutilised in search engines, leading to an improved user experience [43]. Anotherprominent area studying clarification is conversational search, as the system canusually output only one response, thus requiring to clarify the user’s intent [32,2]. a r X i v : . [ c s . I R ] F e b I. Sekulić et al.
Fig. 1: An example of Bing clarification pane taken from [44].The importance of clarification further increases in a mixed-initiative conversa-tional setting [39], where control of the conversation goes back and forth betweenuser and system through assertions, prompts, and questions [32].However, clarification in search proved to be a cumbersome task [45], posinghigher risk of user dissatisfaction. The challenge arises from two main aspects:deciding whether or not it is necessary to ask for clarification, and selecting orgenerating the appropriate clarifying question. Clarification selection can in factbe formalised as a user engagement prediction problem. User engagement refersto the quality of user experience characterised by, among others, attributes ofpositive affect, attention, interactivity, and perceived user control [26]. Persistentusers’ interactions with the clarification mechanism are an indication of a well-designed system. Furthermore, through these interactions users provide implicitfeedback about the necessity and the quality of prompted clarifications.Recently, modern search engines include various types of clarification com-ponents into their systems. An example of such a component in Bing, namelya clarification pane, can be seen on Figure 1. Given a user query, a number ofMicrosoft’s internal algorithms propose a clarifying question and offer clickableanswers that would filter the retrieved results according to the user’s need. Theresearch on the quality of asked clarifying questions and potential answers isstill in its early stages [43]; however, Zamani et al. [44] argued that engagementlevel could be an indicator of the clarification system quality. User engagementprediction has been studied in various domains of IR [25]. However, studying andmodelling user engagement for web search clarification is relatively unstudied.In this paper, we focus on the task of predicting user engagement level (ELP)on clarification panes. Given an initial query, search results, and clarificationpane, ELP aims to estimate how engaged the user would be with the clarifi-cation pane. Previous work [45] studies how engagement levels correlate withthe query attributes such as query type and aspects. However, the relationshipbetween SERPs and engagement has not yet been explored. We stress the im-portance of utilising retrieved results, as they can contain cues as to how facetedor ambiguous the query is, suggesting how necessary the clarification is in thefirst place.Moreover, users’ engagement with the system implicitly discloses informationabout the necessity and the quality of the asked clarification. The quality aspectcan be modelled under the assumption that the higher the engagement levels,the better the question and the provided answers are. We make this assump- ser Engagement Prediction for Clarification in Search 3 tion inspired by a large body of work in the IR community on implicit feedbackfrom aggregated click-through rates for document retrieval [42]. Also, we studyclarification necessity prediction through ELP. Our clarification necessity pre-diction model takes as input the initial query and the retrieved results list andpredicts the level of user engagement with a clarification pane. Although certainattributes of the initial query such as length and ambiguity could indicate thenecessity of asking clarifying questions, we show that incorporating other SERPelements such as result titles and snippets play important roles in improvedprediction accuracy.We formulate the task as supervised regression and propose a deep learning-based model for the prediction of the engagement levels. We compare the per-formance of the model to various central tendency measures and a number oftraditional machine learning algorithms, as well as popular neural models. Ourmodel, based on a Transformer architecture, jointly encodes the user query, theclarification pane, and the SERP elements, outperforming competitive baselines.We evaluate the performance of our model on a large-scale dataset of searchclarification engagements called MIMICS [44], collected from millions of in-teraction records of Bing users. Our extensive experiments establish a strongbaseline for the task, while ablation studies and analysis of the model’s innermechanisms provide guidelines for future research. Our main contributions canbe summarised as follows:• We formally introduce the clarification pane ELP task as supervised regres-sion and propose a transformer-based model to tackle it. We make the codepublicly available for reproducibility purposes. .• We perform ablation studies with respect to the model input data. We findthat utilising retrieved search results greatly benefits the model’s perfor-mance.• We perform detailed analysis of the performance of our model w.r.t. variouscharacteristics of the SERP.To the best of our knowledge, our work is the first to utilise SERP elementsfor clarification pane engagement prediction. More precisely, we find that util-ising search results in certain ways is highly beneficial for the ELP task, as theperformance of our model increases by up to 40% when provided with retrievedresults, compared to the query and the clarification pane only. Our work is related to work done in conversational and web search clarifica-tion, engagement level prediction, and neural networks. In this section we brieflyreview some of the works in these areas. https://github.com/microsoft/MIMICS https://github.com/isekulic/mimics-EL-benchmark I. Sekulić et al.
Clarification.
Search clarification has recently been addressed as an impor-tant problem in the IR community. Recent research efforts study clarificationin a wide range of areas, including web search engines [45], community questionanswering [6], voice queries [18], dialogue systems [38], entity disambiguation [9],and information-seeking conversations [2,20,36].Radlinski and Craswell [32] discuss the need for clarification in their pro-posed theoretical framework for conversational search, highlighting the necessityof multi-turn interactions with users. Moreover, the report from the DagstuhlSeminar on Conversational Search [4] summarises potential research topics inconversational search, and recognises clarification as an integral part of a con-versational information seeking (CIS) system, which was also argued by Penha etal. [29] for information-need elucidation. Asking clarifying questions was studiedby Aliannejadi et al. [2], who propose an offline evaluation setting of an open-domain CIS system, which was highlighted as a hard-to-evaluate setting [30].They find that asking clarifying questions reduces the number of turns neededfor identifying the underlying user information need. Adding the fact that userslike to be prompted for clarification [18], we see a clear importance for clarifica-tion.Clarification is further highlighted in mixed-initiative conversational search,where system in each turn needs to decide whether to ask for clarification orissue a response [32]. Hashemi et al. [15] propose a Guided Transformer modelfor document retrieval and next clarifying question selection in a conversationalsearch setting. Zamani et al. [43] propose supervised and reinforcement learningmodels for generating clarifying questions and the corresponding candidate an-swers from weak supervision data. On the other hand, Ren et al. [34] introducethe task of conversations with search engines, where system generates a short,summarised response of the retrieved passages. Although generating and select-ing clarifying questions for such purposes has recently been studied, the necessityof asking for clarification is still a relatively unexplored topic [1]. Whether or notit is necessary to ask for clarification depends mostly on the level of ambiguityof the query.
User engagement.
O’Brien and Toms [26] define user engagement as thequality of user experience in interaction with a system, characterised by variousattributes, e.g., positive affect, aesthetic and sensory appeal, attention, novelty,perceived user control. In their recent study [25], they point user engagement asan important outcome measure in interactive IR research. User engagement haspreviously been studied in the context of commercial software, social media [13],online news [24], student engagement with online courses [12], and applicationsfor monitoring health-related signals [3].User engagement in the aforementioned studies has usually been measured byself-reported questionnaires, facial expression analysis or speech analysis, signalprocessing methods, or web analytics [21]. Recently, Zamani et al. [44] createda collection of datasets for studying clarification in search by aggregating userinteractions with clarification pane in a major commercial search engine, thusfalling into the category of measuring the user engagement by web analytics. In ser Engagement Prediction for Clarification in Search 5 this paper however, instead of estimating the engagement levels with a goal ofadvancing search engine clarification feature, we analyse the implicit signals ofthe interactions that contain valuable information about the ambiguity of thequery, diversity of retrieved results, and the quality of the clarifying question.Thus, motivated by work on implicit feedback of aggregated users’ click-throughlogs for ad hoc retrieval [17],we view the engagement levels as implicit evaluationof clarifying questions with respect to the query and search results. Intuitively,the higher the engagement levels with the clarification system, the higher thequality of the prompted clarification, and higher the need for asking for clarifi-cation.Zamani et al. [45] study the clarifying question selection with respect to userqueries, prompted questions and candidate answers in clarification panes of asearch engine. However, the retrieved search engine results for a query have notyet been studied. To bridge this gap, in this paper, we propose a model to predictthe user engagement levels, not only from the information in clarification pane,but from the retrieved search results.
Transformers.
The unprecedented success of the Transformer-based archi-tectures in the large variety of the IR and natural language processing tasks mo-tivated their application to the engagement level prediction task as well. One ofthe most prominent Transformer-based models is BERT [11]. BERT has reachedstate-of-the-art results in multiple language understanding benchmarks, such asGLUE [40] and SQuAD [33], as well as IR tasks, such as passage and documentranking [23,37]. In this work, we utilise ALBERT [22] – a lite BERT. ALBERToffers the performance of BERT, or even a higher one, while having fewer pa-rameters, reducing the GPU/TPU memory requirements.
In this section, we first describe the dataset used for engagement level prediction(ELP). Then, we formally introduce the task of ELP and propose a BERT-basedmodel to tackle it.
MIMICS [44] is a recently proposed large-scale collection of datasets for researchon search clarification. It enables the IR community to study various aspectsof search clarification, ranging from clarification generation and selection, overre-ranking of candidate answers, to user engagement prediction and click modelsfor clarification. MIMICS consists of three datasets:1.
MIMICS-Click , including over 400k unique queries, their correspondingclarification panes, and the aggregated user interaction signals.2.
MIMICS-ClickExplore , consisting of over 60k unique queries, each withmultiple clarification panes, and the aggregated interaction signals.3.
MIMICS-Manual , containing 2k query-clarification pairs, manually la-belled for the quality of clarifying questions, candidate answer sets, andlanding result pages of each answer.
I. Sekulić et al.
Table 1: Dataset statistics for MIMICS-Click.
Mean Std Median min-maxQuery length 2.66 1.18 2 1 - 12Question length 6.05 0.47 6 5 - 14SERP Titles length 7.65 2.71 8 0 - 30SERP Snippets length 43.47 14.76 45 0 - 149Answers per query 2.81 1.06 2 2 - 5Responses per query 9.07 1.19 9 0 - 10
In this work, we mainly focus on MIMICS-Click, as the largest, most genericone. Each sample in MIMICS-Click consists of the initial query q , the clarifi-cation question c , and answers offered as options by the system A = [ a , ...a ] .The sample is associated with user interaction signals as labels. The impressionlevel i , a categorical variable where i ∈ { low, medium, high } , represents the fre-quency of the clarification pane being presented to the user for the correspondingquery. The engagement level e ∈ [0 , shows the level of total engagement re-ceived by the users in terms of click-through rate. Each answer is also associatedwith its conditional click probability.The authors also released search engine results pages (SERPs) for each query,as retrieved by Bing. In addition to the query meta-data, SERPs contain up to retrieved instances with a title, an URL, and a short snippet of a web document.We denote retrieved results as R = [ r , r , . . . r n ] , where n ∈ [0 , . Each of theresults r i consists of a tuple r i = ( t i , s i ) , where t i and s i are title and snippet ofthe i -th result. Table 1 shows the average lengths of queries , questions, retrievedtitles and snippets, as well as the number of retrieved results in SERPs. We utiliseall of the available text and information as input to our models to compose ourexperiments, as described in Section 3.3. We formulate the task of user engagement level prediction as a supervised re-gression. The goal of the regression is to predict the value of the target variable y , given a D-dimensional vector x of input variables [5]. Given the dataset of N observation pairs ( x n , y n ) , where n = 1 , ..., N , the goal is to find a function f ( x ) whose outputs ˆ y for new inputs x produce the predictions for the correspondingvalues of y . The loss function of the predicted values ˆ y and the actual values y are model-dependent and described in Section 3.3.The target variable y is given in the dataset in the range of 0 to 10, corre-sponding to the level of user engagement with the clarification pane. We approachELP as a regression problem as it poses itself as a natural formulation of ourtask. Compared to classification, false predictions of different value are penaliseddifferently. For example, classification would punish false predictions of ˆ y = 7 and ˆ y = 1 for a sample with y = 8 the same, while in reality, the predictedlabel of is much closer to the actual engagement level. Therefore, even thoughstill wrong, one would prefer a system to predict 7 instead of 1. Moreover, the The length was computed by splitting the text on whitespaces.ser Engagement Prediction for Clarification in Search 7 task of user engagement prediction has been evaluated as regression in variousapplications such as [35,12].
We now define our model called
ELBERT ( E ngagement L evel prediction byA LBERT ). As mentioned in the previous section, the goal is to predict theengagement level y based on the initial query q , clarification question c , list ofcandidate answers A , and retrieved results R . We predict the engagement level EL as follows: EL ( q, c, A, R ) = ψ ( φ q ( q ) , φ c ( c ) , φ A ( A ) , φ R ( R )) (1)where φ { q,c,A,R } are high-dimensional representations of q , c , A , and R . Theaggregation function ψ outputs the final engagement levels based on the inputrepresentations. All of these components can be modelled with numerous meth-ods. In this work, we utilise ALBERT as our encoder for generating φ { q,c,A,R } representations in a joint fashion. More specifically, as ALBERT has been shownto consistently help downstream tasks with multiple inputs [22], we essentiallylearn the joint representation of query, clarification question, answers, and resultsas: Φ ( q, c, A, R ) = ALBERT ( q, c, A, R ) (2)reducing our Equation 1 to: EL ( q, c, A, R ) = ψ ( Φ ( q, c, A, R )) . (3)Input to the ALBERT component is composed of tokenized query, question,answers, and results, separated by the separation token [ SEP ] , with classificationtoken [ CLS ] inserted in the beginning of a sequence. Answers a i are aggregatedbefore feeding them to the model. Similarly, we aggregate SERP information R , with a difference that we experiment with both, titles t i and snippets s i asinputs. In either case, texts of titles or of snippets are joined by whitespaceprior to being fed to the model. We note that in ablation studies some of thecomponents are left out by simply removing them from Equation 2. We use apretrained ALBERT-base [22] as a text encoder and truncate the total inputsequence length to a maximum of tokens. Our model has M trainingparameters, making it considerably smaller than other Transformer-based model(e.g., BERT has M ).The regression component ψ , that outputs the engagement level, is con-structed as follows: last layer hidden-state of the first token of the encodedsequence ( [ CLS ] token) is further processed by a linear layer and a non-linearactivation function. We then add another linear layer, with dropout and a non-linear activation function in between, to produce the final 1-dimensional outputthat corresponds to EL . The model is trained using mean squared error as a lossfunction for 4 epochs, with a learning rate of × − , Adam optimizer [19] andlinear weight decay with warmup. I. Sekulić et al.
In this section, we introduce our experimental setup and present main resultsfor the engagement level prediction. Furthermore, we analyse the effect of SERPelements on model’s performance and perform detailed analysis w.r.t. variouscharacteristics of the data.
We use central tendency measures as our first baselines for predicting the engage-ment level. More specifically, we have three different static baselines: (i) mean of the data (MeanEngagement); (ii) median of the data (MedianEngagement);(iii) sampling from a normal distribution N ( µ, σ ) , where µ and σ are the meanand the standard deviation of the engagement levels in the training data, respec-tively (NormalEngagement).To tackle the task of ELP, we experiment with a number of models fromtraditional machine learning and deep learning. Namely: Linear Regression.
First baseline is a linear regression model, fitted usingordinary least squares approach.
SVR.
We employ support vector regression machines [14], a version of supportvector machines [10] for regression. We experiment with the linear, as wellas the radial basis function (RBF) kernel.
Random Forests.
An ensemble meta-algorithm that uses bootstrap aggregat-ing (bagging) technique to improve the stability of decision trees [7].
LSTM.
Long-short term memory [16] are a well-established method for se-quence modelling, especially on text data. We experiment with multi-layerbidirectional networks.The input to traditional ML models are tf-idf weighted bag-of-word featuresextracted from the input text. LSTM is fed with pretrained GloVe word embed-dings [31] of tokenized input text. We use Scikit-learn [28], HuggingFace [41],and Pytorch [27] for the implementation of the aforementioned models.
We evaluate the effectiveness of our models using standard evaluation metrics forthe task of supervised regression. The first two are Mean Absolute Error (MAE)and Mean Squared Error (MSE). We also evaluate our regression models withCoefficient of Determination or R . It is a statistical measurement that examinesthe proportion of the variance in one variable that is predictable from the secondvariable, estimating the “goodness of a fit”. It is defined as: R = 1 − (cid:80) Ni =1 ( y i − ˆ y i ) (cid:80) Ni =1 ( y i − y i ) ,where N is the number of samples, y i is the actual value in the dataset for the i -th sample, ˆ y is the predicted value, and y is the mean of the actual values. ser Engagement Prediction for Clarification in Search 9 Table 2: Performance on the full MIMICS-Click dataset (400k+ samples) and asubset where engagement levels are higher than zero (71k samples). Bold valuesdenote the best results for each metric. Symbols † and ‡ mark statistically signif-icant improvement over central tendency measures and traditional ML models,respectively ( p < . ). Full MIMICS-Click EL-only MIMICS-ClickModel MAE MSE R MAE MSE R Mean 0.1531 0.0546 0.0 0.2426 0.0790 0.0Median † . † . † . † RandomForest 0.1477 0.0526 0.0423 . † . † . † BiLSTM . †‡ . †‡ . †‡ . † . † . † ELBERT . †‡ †‡ †‡ †‡ †‡ †‡ We evaluate our models using a hold-out method, i.e., reserving 20% of thedataset for the test set. We train, and tune traditional ML models in a cross-validation manner [8]. We use 5-fold split of the training set into training anddevelopment set, which is used for grid-searching of the best parameters. Theextensive grids of parameters include regularisation parameter C , the choice of kernel , gamma , and epsilon for SVR, number of estimators and depth of randomforest regressor, as well as feature selection process. All of the parameters canbe found on our GitHub repository.For tuning the hyper-parameters of our neural models, we split the trainingset into training and development sets. Notice that models are retrained on thefull training set with the best parameters before being evaluated on the hold-outtest set.We evaluate the models on the full MIMICS-Click dataset, consisting of morethan 400k query-clarification-SERP tuples, and on the subset of that dataset, inwhich only samples with the engagement level larger than zero are selected. Themodels in this setting were fed all the available data, i.e., the queries, clarificationpanes, and the SERPs, while the ablation studies in Section 4.4 go into theanalysis of input data. Here, we compare the performance of our EL-BERT model against the baselines on the complete dataset, as well as the sub-set of data with EL > 0. Table 2 lists the results in terms of all our evaluationmetrics. We can notice that heuristic baselines (i.e., MeanEngagement, Media-nEngagement and NormalEngagement) are consistently outperformed by both,the traditional ML models, and the neural models. However, one exception is
Table 3: Impact of SERP elements available on the model performance. Boldvalues denote the best performance of each metric. Statistically significant results(with p < . ) over query setting and query+pane setting are marked with † and ‡ , respectively. Full MIMICS-Click EL-only MIMICS-Click R MAE MSE R † † † † †‡ †‡ †‡ †‡ †‡ †‡ † † †‡ †‡ †‡ † † †‡ †‡ †‡ †‡ † † †‡ †‡ †‡ †‡ MedianEngagement, a baseline that always outputs the median of the trainingset, i.e., EL of 0.0, when evaluated on the full MIMICS-Click by mean absoluteerror. Since more than of the dataset have EL of 0.0, and MAE does notpenalise large errors as hard as MSE or R , this is expected. The tide turnsswiftly when evaluating on the subset of the data with EL larger that . , whereall of the static baselines, including MedianEngagement, are outperformed byall of our models.Moreover, we see a clear disparency in the performance of traditional MLmodels and neural networks. This is consistent with recent research in varioustasks in IR and NLP fields. Moreover, we see that ELBERT significantly out-performs BiLSTM model. Through its powerful encoder, ELBERT is able tocapture deeper semantic relations, as it is pretrained on a large body of text.This is also consistent with recent research on deep learning-based models fornatural language understanding. Effect of SERP elements on ELP.
In this experiment, we aim to analysethe effect of clarification panes and every SERP element on the performanceof our model. Our hypothesis is that each SERP element (e.g., result titlesand snippets) provides a complementary set of features that aids the modeltowards more effective prediction. Therefore, we train our ELBERT model withdifferent combination of SERP elements and clarification panes, and comparethe performance of the different models. We report the results in Table 3. Wesee that the relative improvement when utilising titles from SERPs is up to compared to using query and clarification pane, and more than over query-only setting. The results strongly suggest the advantage of making use of SERPelements for ELP.An interesting finding is that even though snippets contain more text thantitles and thus arguably more information as well, the model does not consis-tently perform better with snippets as input. In fact, even though results withtitles seem better than ones with snippets, we observe no statistically signifi-cant difference between the performance of query+titles and query+snippets onfull MIMICS-Click, nor EL-only MIMICS-Click. There are several reasons why ser Engagement Prediction for Clarification in Search 11 low medium high
Impression Level R Query Length R Pane-only Pane-Titles Pane-Snippets
Fig. 2: Performance by impression levels (left) and query lengths (right) withdifferent input configurations.snippets do not exceed the performance of titles. First, it might be the qualityand type of text shown in snippets. Snippets often show only short excerpts, oreven multiple excerpts which are not clearly divided, from a longer document,focusing on query words in the retrieved document. Thus, they might not con-tain all the semantics of the document, while titles usually do. Second, it mightbe the maximum input length of our encoder, which is 512 sub-word tokens.As mentioned in Table 1, a median length of a title is 8 tokens, while mediansnippet length is . Considering that most of the samples have 9 or more title-snippet pairs in their SERPs, it is evident that some portion of concatenatedsnippets get left out. The potential limitation of truncating input length in mostof BERT-based models is a research direction on its own.We point out that the necessity of asking the clarification can be estimatedfrom the initial query and retrieved search results, i.e., rows , , and in Table3, The success of the model to predict EL based on SERPs and the query alone,suggests that this framework can be used for determining whether or not to aska clarifying question. However, we leave this aspect for future work. Instead, inthe next subsection we evaluate our model trained on ELP task for clarificationpane selection, addressing the pane quality aspect. Here we show ELBERT performance, as measured by R , with respect to variouscharacteristics of the dataset and the input components. Impression level.
Figure 2 (left) shows the performance of our model w.r.t. im-pression levels. We notice that our model performs significantly better on querieswith high impression rate, i.e., those whose clarification panes have been shownto users more frequently. The differences between models at each impressionlevel are not statistically significant, while differences between levels are, with p < . . As the engagement level labels have been computed by aggregatinguser click information, this suggest that query-clarification pairs that have beenimplicitly evaluated by a small number of users, i.e., have low impression level,contain noise. R Input
TitlesSnippets
Fig. 3: Performance by number of search results made available to the model.
Query length.
Figure 2 (right) presents the performance of our model w.r.t. querylength. The difference in performance between all query lengths is statisticallysignificant. We notice that longer queries generally lead to better performance.This can be attributed to them being more descriptive, thus allowing the searchengine to retrieve more relevant results. Consequently, our model would utiliseSERPs of higher quality, improving the ELP. Highest improvement is seen fora query and pane-only setting. Since the model in that setting does not see anySERP content, it benefits the most out of longer, more descriptive queries.
Number of search results.
Since user behaviour is mainly biased by theresults they see, and they mostly look at top results only, we perform experimentsto see how our models behave in a setting with limited number of retrievedresults. As mentioned before, MIMICS dataset contains up to retrieved resultsfor each query. We evaluate our model with , , . . . SERP elements madeavailable to it. Results for both, titles setting and snippets setting, are presentedin Figure 3. We see a clear improvement in the performance as the number ofsearch results fed to the model rises. This suggests that our model highly utilisesSERP elements for ELP. We notice a saturation after 7 elements, especially inthe setting with snippets. This might be due to snippets exceeding the maximumlength of input to transformer-based models, which is 512 subword tokens.
In this study, we conducted various experiments on engagement level predic-tion task for clarification in search. We showed that semantic-rich models, likeALBERT, are much more successful in the task than traditional ML models.Furthermore, we demonstrated the benefit of utilising information from searchengine result pages, such as titles and text snippets of retrieved documents, inthe ELP task. Modelling of engagement levels can help guide the system on whenand which clarifications to prompt, thus improving the overall user experience.Future work involves deeper analysis of topical changes in the retrieved pages,that could lead to more accurate prediction of engagement levels, and estimatingthe necessity of asking for clarification. ser Engagement Prediction for Clarification in Search 13
References
1. Aliannejadi, M., Kiseleva, J., Chuklin, A., Dalton, J., Burtsev, M.: ConvAI3: Gen-erating clarifying questions for open-domain dialogue systems (ClariQ) (2020)2. Aliannejadi, M., Zamani, H., Crestani, F., Croft, W.B.: Asking clarifying questionsin open-domain information-seeking conversations. In: Proceedings of the 42nd In-ternational ACM SIGIR Conference on Research and Development in InformationRetrieval. pp. 475–484 (2019)3. Alkhaldi, G., Hamilton, F.L., Lau, R., Webster, R., Michie, S., Murray, E.: Theeffectiveness of prompts to promote engagement with digital interventions: a sys-tematic review. Journal of medical Internet research (1), e6 (2016)4. Anand, A., Cavedon, L., Joho, H., Sanderson, M., Stein, B.: Conversational search(Dagstuhl Seminar 19461). In: Dagstuhl Reports. vol. 9. Schloss Dagstuhl-Leibniz-Zentrum für Informatik (2020)5. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer (2006)6. Braslavski, P., Savenkov, D., Agichtein, E., Dubatovka, A.: What do you meanexactly? Analyzing clarification questions in CQA. In: Proceedings of the 2017Conference on Conference Human Information Interaction and Retrieval. pp. 345–348 (2017)7. Breiman, L.: Random forests. Machine learning (1), 5–32 (2001)8. Cawley, G.C., Talbot, N.L.: On over-fitting in model selection and subsequent se-lection bias in performance evaluation. The Journal of Machine Learning Research , 2079–2107 (2010)9. Coden, A., Gruhl, D., Lewis, N., Mendes, P.N.: Did you mean A or B? Supportingclarification dialog for entity disambiguation. In: SumPre-HSWI@ ESWC (2015)10. Cortes, C., Vapnik, V.: Support-vector networks. Machine learning (3), 273–297(1995)11. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deepbidirectional transformers for language understanding. In: Proceedings of the 2019Conference of the North American Chapter of the Association for Computa-tional Linguistics: Human Language Technologies, Volume 1 (Long and ShortPapers). pp. 4171–4186. Association for Computational Linguistics, Minneapo-lis, Minnesota (Jun 2019). https://doi.org/10.18653/v1/N19-1423,
12. Dhall, A., Kaur, A., Goecke, R., Gedeon, T.: Emotiw 2018: Audio-video, studentengagement and group-level affect prediction. In: Proceedings of the 20th ACMInternational Conference on Multimodal Interaction. pp. 653–656 (2018)13. Di Gangi, P.M., Wasko, M.M.: Social media engagement theory: Exploring theinfluence of user engagement on social media usage. Journal of Organizational andEnd User Computing (JOEUC) (2), 53–73 (2016)14. Drucker, H., Burges, C.J., Kaufman, L., Smola, A.J., Vapnik, V.: Support vectorregression machines. In: Advances in neural information processing systems. pp.155–161 (1997)15. Hashemi, H., Zamani, H., Croft, W.B.: Guided transformer: Leveraging multipleexternal sources for representation learning in conversational search. In: Proceed-ings of the 43rd International ACM SIGIR Conference on Research and Develop-ment in Information Retrieval. pp. 1131–1140 (2020)16. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation (8), 1735–1780 (1997)4 I. Sekulić et al.17. Kelly, D., Teevan, J.: Implicit feedback for inferring user preference: a bibliography.In: Acm Sigir Forum. vol. 37, pp. 18–28. ACM New York, NY, USA (2003)18. Kiesel, J., Bahrami, A., Stein, B., Anand, A., Hagen, M.: Toward voice queryclarification. In: The 41st International ACM SIGIR Conference on Research &Development in Information Retrieval. pp. 1257–1260 (2018)19. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014)20. Krasakis, A.M., Aliannejadi, M., Voskarides, N., Kanoulas, E.: Analysing the effectof clarifying questions on document ranking in conversational search. In: Proceed-ings of the 2020 ACM SIGIR on International Conference on Theory of InformationRetrieval. pp. 129–132 (2020)21. Lalmas, M., O’Brien, H., Yom-Tov, E.: Measuring user engagement. Synthesis Lec-tures on Information Concepts, Retrieval, and Services (4), 1–132 (2014)22. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: Alite BERT for self-supervised learning of language representations. In proceedingsof ICLR (2020)23. Nogueira, R., Cho, K.: Passage re-ranking with bert. arXiv preprintarXiv:1901.04085 (2019)24. O’Brien, H.L.: Antecedents and learning outcomes of online news engagement.Journal of the Association for Information Science and Technology (12), 2809–2820 (2017)25. O’Brien, H.L., Arguello, J., Capra, R.: An empirical study of interest, task com-plexity, and search behaviour on user engagement. Information Processing & Man-agement (3), 102226 (2020)26. O’Brien, H.L., Toms, E.G.: What is user engagement? A conceptual frameworkfor defining user engagement with technology. Journal of the American society forInformation Science and Technology (6), 938–955 (2008)27. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen,T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-performance deep learning library. In: Advances in neural information processingsystems. pp. 8026–8037 (2019)28. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: Machinelearning in Python. the Journal of machine Learning research , 2825–2830 (2011)29. Penha, G., Balan, A., Hauff, C.: Introducing MANtIS: a novel multi-domain infor-mation seeking dialogues dataset. arXiv preprint arXiv:1912.04639 (2019)30. Penha, G., Hauff, C.: Challenges in the evaluation of conversational search systems.KDD Workshop on Conversational Systems Towards Mainstream Adoption (2020)31. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word repre-sentation. In: Proceedings of the 2014 conference on empirical methods in naturallanguage processing (EMNLP). pp. 1532–1543 (2014)32. Radlinski, F., Craswell, N.: A theoretical framework for conversational search. In:Proceedings of the 2017 conference on conference human information interactionand retrieval. pp. 117–126 (2017)33. Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questionsfor machine comprehension of text. In: Proceedings of the 2016 Conference onEmpirical Methods in Natural Language Processing. pp. 2383–2392. Associationfor Computational Linguistics, Austin, Texas (Nov 2016)34. REN, P., CHEN, Z., REN, Z., KANOULAS, E., MONZ, C., DE RIJKE, M.: Con-versations with search engines. ACM Transactions on Information Systems1