[PDF] User Engagement Prediction for Clarification in Search

Abstract

Clarification is increasingly becoming a vital factor in various topics of information retrieval, such as conversational search and modern Web search engines. Prompting the user for clarification in a search session can be very beneficial to the system as the user's explicit feedback helps the system improve retrieval massively. However, it comes with a very high risk of frustrating the user in case the system fails in asking decent clarifying questions. Therefore, it is of great importance to determine when and how to ask for clarification. To this aim, in this work, we model search clarification prediction as user engagement problem. We assume that the better a clarification is, the higher user engagement with it would be. We propose a Transformer-based model to tackle the task. The comparison with competitive baselines on large-scale real-life clarification engagement data proves the effectiveness of our model. Also, we analyse the effect of all result page elements on the performance and find that, among others, the ranked list of the search engine leads to considerable improvements. Our extensive analysis of task-specific features guides future research.

Full PDF

UUser Engagement Prediction forClariﬁcation in Search

Ivan Sekulić , Mohammad Aliannejadi , and Fabio Crestani Faculty of Informatics, Università della Svizzera italiana, Lugano, Switzerland {ivan.sekulic,fabio.crestani}@usi.ch , University of Amsterdam, Amsterdam, The Netherlands [email protected]

Abstract.

Clariﬁcation is increasingly becoming a vital factor in var-ious topics of information retrieval, such as conversational search andmodern Web search engines. Prompting the user for clariﬁcation in asearch session can be very beneﬁcial to the system as the user’s explicitfeedback helps the system improve retrieval massively. However, it comeswith a very high risk of frustrating the user in case the system fails inasking decent clarifying questions. Therefore, it is of great importanceto determine when and how to ask for clariﬁcation.To this aim, in this work, we model search clariﬁcation prediction as userengagement problem. We assume that the better a clariﬁcation is, thehigher user engagement with it would be. We propose a Transformer-based model to tackle the task. The comparison with competitive base-lines on large-scale real-life clariﬁcation engagement data proves the ef-fectiveness of our model. Also, we analyse the eﬀect of all result pageelements on the performance and ﬁnd that, among others, the rankedlist of the search engine leads to considerable improvements. Our exten-sive analysis of task-speciﬁc features guides future research.

Keywords:

Search Clariﬁcation · Mixed-Initiative Conversations · UserEngagement Prediction

The primary goal of an information retrieval (IR) system is satisfying the user in-formation need, which can often be ambiguous when expressed as short queries.Incorporating users’ implicit feedback has long been studied for improved re-trieval [17]. However, the recent rise of interest in conversational systems andmixed-initiative interactions have enabled IR systems to collect users’ explicitfeedback. Current research focuses on prompting users for feedback by askingfor clariﬁcation [32,2,43]. For example, search clariﬁcation has recently beenutilised in search engines, leading to an improved user experience [43]. Anotherprominent area studying clariﬁcation is conversational search, as the system canusually output only one response, thus requiring to clarify the user’s intent [32,2]. a r X i v : . [ c s . I R ] F e b I. Sekulić et al.

Fig. 1: An example of Bing clariﬁcation pane taken from [44].The importance of clariﬁcation further increases in a mixed-initiative conversa-tional setting [39], where control of the conversation goes back and forth betweenuser and system through assertions, prompts, and questions [32].However, clariﬁcation in search proved to be a cumbersome task [45], posinghigher risk of user dissatisfaction. The challenge arises from two main aspects:deciding whether or not it is necessary to ask for clariﬁcation, and selecting orgenerating the appropriate clarifying question. Clariﬁcation selection can in factbe formalised as a user engagement prediction problem. User engagement refersto the quality of user experience characterised by, among others, attributes ofpositive aﬀect, attention, interactivity, and perceived user control [26]. Persistentusers’ interactions with the clariﬁcation mechanism are an indication of a well-designed system. Furthermore, through these interactions users provide implicitfeedback about the necessity and the quality of prompted clariﬁcations.Recently, modern search engines include various types of clariﬁcation com-ponents into their systems. An example of such a component in Bing, namelya clariﬁcation pane, can be seen on Figure 1. Given a user query, a number ofMicrosoft’s internal algorithms propose a clarifying question and oﬀer clickableanswers that would ﬁlter the retrieved results according to the user’s need. Theresearch on the quality of asked clarifying questions and potential answers isstill in its early stages [43]; however, Zamani et al. [44] argued that engagementlevel could be an indicator of the clariﬁcation system quality. User engagementprediction has been studied in various domains of IR [25]. However, studying andmodelling user engagement for web search clariﬁcation is relatively unstudied.In this paper, we focus on the task of predicting user engagement level (ELP)on clariﬁcation panes. Given an initial query, search results, and clariﬁcationpane, ELP aims to estimate how engaged the user would be with the clariﬁ-cation pane. Previous work [45] studies how engagement levels correlate withthe query attributes such as query type and aspects. However, the relationshipbetween SERPs and engagement has not yet been explored. We stress the im-portance of utilising retrieved results, as they can contain cues as to how facetedor ambiguous the query is, suggesting how necessary the clariﬁcation is in theﬁrst place.Moreover, users’ engagement with the system implicitly discloses informationabout the necessity and the quality of the asked clariﬁcation. The quality aspectcan be modelled under the assumption that the higher the engagement levels,the better the question and the provided answers are. We make this assump- ser Engagement Prediction for Clariﬁcation in Search 3 tion inspired by a large body of work in the IR community on implicit feedbackfrom aggregated click-through rates for document retrieval [42]. Also, we studyclariﬁcation necessity prediction through ELP. Our clariﬁcation necessity pre-diction model takes as input the initial query and the retrieved results list andpredicts the level of user engagement with a clariﬁcation pane. Although certainattributes of the initial query such as length and ambiguity could indicate thenecessity of asking clarifying questions, we show that incorporating other SERPelements such as result titles and snippets play important roles in improvedprediction accuracy.We formulate the task as supervised regression and propose a deep learning-based model for the prediction of the engagement levels. We compare the per-formance of the model to various central tendency measures and a number oftraditional machine learning algorithms, as well as popular neural models. Ourmodel, based on a Transformer architecture, jointly encodes the user query, theclariﬁcation pane, and the SERP elements, outperforming competitive baselines.We evaluate the performance of our model on a large-scale dataset of searchclariﬁcation engagements called MIMICS [44], collected from millions of in-teraction records of Bing users. Our extensive experiments establish a strongbaseline for the task, while ablation studies and analysis of the model’s innermechanisms provide guidelines for future research. Our main contributions canbe summarised as follows:• We formally introduce the clariﬁcation pane ELP task as supervised regres-sion and propose a transformer-based model to tackle it. We make the codepublicly available for reproducibility purposes. .• We perform ablation studies with respect to the model input data. We ﬁndthat utilising retrieved search results greatly beneﬁts the model’s perfor-mance.• We perform detailed analysis of the performance of our model w.r.t. variouscharacteristics of the SERP.To the best of our knowledge, our work is the ﬁrst to utilise SERP elementsfor clariﬁcation pane engagement prediction. More precisely, we ﬁnd that util-ising search results in certain ways is highly beneﬁcial for the ELP task, as theperformance of our model increases by up to 40% when provided with retrievedresults, compared to the query and the clariﬁcation pane only. Our work is related to work done in conversational and web search clariﬁca-tion, engagement level prediction, and neural networks. In this section we brieﬂyreview some of the works in these areas. https://github.com/microsoft/MIMICS https://github.com/isekulic/mimics-EL-benchmark I. Sekulić et al.

Clariﬁcation.

Search clariﬁcation has recently been addressed as an impor-tant problem in the IR community. Recent research eﬀorts study clariﬁcationin a wide range of areas, including web search engines [45], community questionanswering [6], voice queries [18], dialogue systems [38], entity disambiguation [9],and information-seeking conversations [2,20,36].Radlinski and Craswell [32] discuss the need for clariﬁcation in their pro-posed theoretical framework for conversational search, highlighting the necessityof multi-turn interactions with users. Moreover, the report from the DagstuhlSeminar on Conversational Search [4] summarises potential research topics inconversational search, and recognises clariﬁcation as an integral part of a con-versational information seeking (CIS) system, which was also argued by Penha etal. [29] for information-need elucidation. Asking clarifying questions was studiedby Aliannejadi et al. [2], who propose an oﬄine evaluation setting of an open-domain CIS system, which was highlighted as a hard-to-evaluate setting [30].They ﬁnd that asking clarifying questions reduces the number of turns neededfor identifying the underlying user information need. Adding the fact that userslike to be prompted for clariﬁcation [18], we see a clear importance for clariﬁca-tion.Clariﬁcation is further highlighted in mixed-initiative conversational search,where system in each turn needs to decide whether to ask for clariﬁcation orissue a response [32]. Hashemi et al. [15] propose a Guided Transformer modelfor document retrieval and next clarifying question selection in a conversationalsearch setting. Zamani et al. [43] propose supervised and reinforcement learningmodels for generating clarifying questions and the corresponding candidate an-swers from weak supervision data. On the other hand, Ren et al. [34] introducethe task of conversations with search engines, where system generates a short,summarised response of the retrieved passages. Although generating and select-ing clarifying questions for such purposes has recently been studied, the necessityof asking for clariﬁcation is still a relatively unexplored topic [1]. Whether or notit is necessary to ask for clariﬁcation depends mostly on the level of ambiguityof the query.

User engagement.

O’Brien and Toms [26] deﬁne user engagement as thequality of user experience in interaction with a system, characterised by variousattributes, e.g., positive aﬀect, aesthetic and sensory appeal, attention, novelty,perceived user control. In their recent study [25], they point user engagement asan important outcome measure in interactive IR research. User engagement haspreviously been studied in the context of commercial software, social media [13],online news [24], student engagement with online courses [12], and applicationsfor monitoring health-related signals [3].User engagement in the aforementioned studies has usually been measured byself-reported questionnaires, facial expression analysis or speech analysis, signalprocessing methods, or web analytics [21]. Recently, Zamani et al. [44] createda collection of datasets for studying clariﬁcation in search by aggregating userinteractions with clariﬁcation pane in a major commercial search engine, thusfalling into the category of measuring the user engagement by web analytics. In ser Engagement Prediction for Clariﬁcation in Search 5 this paper however, instead of estimating the engagement levels with a goal ofadvancing search engine clariﬁcation feature, we analyse the implicit signals ofthe interactions that contain valuable information about the ambiguity of thequery, diversity of retrieved results, and the quality of the clarifying question.Thus, motivated by work on implicit feedback of aggregated users’ click-throughlogs for ad hoc retrieval [17],we view the engagement levels as implicit evaluationof clarifying questions with respect to the query and search results. Intuitively,the higher the engagement levels with the clariﬁcation system, the higher thequality of the prompted clariﬁcation, and higher the need for asking for clariﬁ-cation.Zamani et al. [45] study the clarifying question selection with respect to userqueries, prompted questions and candidate answers in clariﬁcation panes of asearch engine. However, the retrieved search engine results for a query have notyet been studied. To bridge this gap, in this paper, we propose a model to predictthe user engagement levels, not only from the information in clariﬁcation pane,but from the retrieved search results.

Transformers.

The unprecedented success of the Transformer-based archi-tectures in the large variety of the IR and natural language processing tasks mo-tivated their application to the engagement level prediction task as well. One ofthe most prominent Transformer-based models is BERT [11]. BERT has reachedstate-of-the-art results in multiple language understanding benchmarks, such asGLUE [40] and SQuAD [33], as well as IR tasks, such as passage and documentranking [23,37]. In this work, we utilise ALBERT [22] – a lite BERT. ALBERToﬀers the performance of BERT, or even a higher one, while having fewer pa-rameters, reducing the GPU/TPU memory requirements.

In this section, we ﬁrst describe the dataset used for engagement level prediction(ELP). Then, we formally introduce the task of ELP and propose a BERT-basedmodel to tackle it.

MIMICS [44] is a recently proposed large-scale collection of datasets for researchon search clariﬁcation. It enables the IR community to study various aspectsof search clariﬁcation, ranging from clariﬁcation generation and selection, overre-ranking of candidate answers, to user engagement prediction and click modelsfor clariﬁcation. MIMICS consists of three datasets:1.

MIMICS-Click , including over 400k unique queries, their correspondingclariﬁcation panes, and the aggregated user interaction signals.2.

MIMICS-ClickExplore , consisting of over 60k unique queries, each withmultiple clariﬁcation panes, and the aggregated interaction signals.3.

MIMICS-Manual , containing 2k query-clariﬁcation pairs, manually la-belled for the quality of clarifying questions, candidate answer sets, andlanding result pages of each answer.

I. Sekulić et al.

Table 1: Dataset statistics for MIMICS-Click.

Mean Std Median min-maxQuery length 2.66 1.18 2 1 - 12Question length 6.05 0.47 6 5 - 14SERP Titles length 7.65 2.71 8 0 - 30SERP Snippets length 43.47 14.76 45 0 - 149Answers per query 2.81 1.06 2 2 - 5Responses per query 9.07 1.19 9 0 - 10

In this work, we mainly focus on MIMICS-Click, as the largest, most genericone. Each sample in MIMICS-Click consists of the initial query q , the clariﬁ-cation question c , and answers oﬀered as options by the system A = [ a , ...a ] .The sample is associated with user interaction signals as labels. The impressionlevel i , a categorical variable where i ∈ { low, medium, high } , represents the fre-quency of the clariﬁcation pane being presented to the user for the correspondingquery. The engagement level e ∈ [0 , shows the level of total engagement re-ceived by the users in terms of click-through rate. Each answer is also associatedwith its conditional click probability.The authors also released search engine results pages (SERPs) for each query,as retrieved by Bing. In addition to the query meta-data, SERPs contain up to retrieved instances with a title, an URL, and a short snippet of a web document.We denote retrieved results as R = [ r , r , . . . r n ] , where n ∈ [0 , . Each of theresults r i consists of a tuple r i = ( t i , s i ) , where t i and s i are title and snippet ofthe i -th result. Table 1 shows the average lengths of queries , questions, retrievedtitles and snippets, as well as the number of retrieved results in SERPs. We utiliseall of the available text and information as input to our models to compose ourexperiments, as described in Section 3.3. We formulate the task of user engagement level prediction as a supervised re-gression. The goal of the regression is to predict the value of the target variable y , given a D-dimensional vector x of input variables [5]. Given the dataset of N observation pairs ( x n , y n ) , where n = 1 , ..., N , the goal is to ﬁnd a function f ( x ) whose outputs ˆ y for new inputs x produce the predictions for the correspondingvalues of y . The loss function of the predicted values ˆ y and the actual values y are model-dependent and described in Section 3.3.The target variable y is given in the dataset in the range of 0 to 10, corre-sponding to the level of user engagement with the clariﬁcation pane. We approachELP as a regression problem as it poses itself as a natural formulation of ourtask. Compared to classiﬁcation, false predictions of diﬀerent value are penaliseddiﬀerently. For example, classiﬁcation would punish false predictions of ˆ y = 7 and ˆ y = 1 for a sample with y = 8 the same, while in reality, the predictedlabel of is much closer to the actual engagement level. Therefore, even thoughstill wrong, one would prefer a system to predict 7 instead of 1. Moreover, the The length was computed by splitting the text on whitespaces.ser Engagement Prediction for Clariﬁcation in Search 7 task of user engagement prediction has been evaluated as regression in variousapplications such as [35,12].

We now deﬁne our model called

ELBERT ( E ngagement L evel prediction byA LBERT ). As mentioned in the previous section, the goal is to predict theengagement level y based on the initial query q , clariﬁcation question c , list ofcandidate answers A , and retrieved results R . We predict the engagement level EL as follows: EL ( q, c, A, R ) = ψ ( φ q ( q ) , φ c ( c ) , φ A ( A ) , φ R ( R )) (1)where φ { q,c,A,R } are high-dimensional representations of q , c , A , and R . Theaggregation function ψ outputs the ﬁnal engagement levels based on the inputrepresentations. All of these components can be modelled with numerous meth-ods. In this work, we utilise ALBERT as our encoder for generating φ { q,c,A,R } representations in a joint fashion. More speciﬁcally, as ALBERT has been shownto consistently help downstream tasks with multiple inputs [22], we essentiallylearn the joint representation of query, clariﬁcation question, answers, and resultsas: Φ ( q, c, A, R ) = ALBERT ( q, c, A, R ) (2)reducing our Equation 1 to: EL ( q, c, A, R ) = ψ ( Φ ( q, c, A, R )) . (3)Input to the ALBERT component is composed of tokenized query, question,answers, and results, separated by the separation token [ SEP ] , with classiﬁcationtoken [ CLS ] inserted in the beginning of a sequence. Answers a i are aggregatedbefore feeding them to the model. Similarly, we aggregate SERP information R , with a diﬀerence that we experiment with both, titles t i and snippets s i asinputs. In either case, texts of titles or of snippets are joined by whitespaceprior to being fed to the model. We note that in ablation studies some of thecomponents are left out by simply removing them from Equation 2. We use apretrained ALBERT-base [22] as a text encoder and truncate the total inputsequence length to a maximum of tokens. Our model has M trainingparameters, making it considerably smaller than other Transformer-based model(e.g., BERT has M ).The regression component ψ , that outputs the engagement level, is con-structed as follows: last layer hidden-state of the ﬁrst token of the encodedsequence ( [ CLS ] token) is further processed by a linear layer and a non-linearactivation function. We then add another linear layer, with dropout and a non-linear activation function in between, to produce the ﬁnal 1-dimensional outputthat corresponds to EL . The model is trained using mean squared error as a lossfunction for 4 epochs, with a learning rate of × − , Adam optimizer [19] andlinear weight decay with warmup. I. Sekulić et al.

In this section, we introduce our experimental setup and present main resultsfor the engagement level prediction. Furthermore, we analyse the eﬀect of SERPelements on model’s performance and perform detailed analysis w.r.t. variouscharacteristics of the data.

We use central tendency measures as our ﬁrst baselines for predicting the engage-ment level. More speciﬁcally, we have three diﬀerent static baselines: (i) mean of the data (MeanEngagement); (ii) median of the data (MedianEngagement);(iii) sampling from a normal distribution N ( µ, σ ) , where µ and σ are the meanand the standard deviation of the engagement levels in the training data, respec-tively (NormalEngagement).To tackle the task of ELP, we experiment with a number of models fromtraditional machine learning and deep learning. Namely: Linear Regression.

First baseline is a linear regression model, ﬁtted usingordinary least squares approach.

SVR.

We employ support vector regression machines [14], a version of supportvector machines [10] for regression. We experiment with the linear, as wellas the radial basis function (RBF) kernel.

Random Forests.

An ensemble meta-algorithm that uses bootstrap aggregat-ing (bagging) technique to improve the stability of decision trees [7].

LSTM.

Long-short term memory [16] are a well-established method for se-quence modelling, especially on text data. We experiment with multi-layerbidirectional networks.The input to traditional ML models are tf-idf weighted bag-of-word featuresextracted from the input text. LSTM is fed with pretrained GloVe word embed-dings [31] of tokenized input text. We use Scikit-learn [28], HuggingFace [41],and Pytorch [27] for the implementation of the aforementioned models.

We evaluate the eﬀectiveness of our models using standard evaluation metrics forthe task of supervised regression. The ﬁrst two are Mean Absolute Error (MAE)and Mean Squared Error (MSE). We also evaluate our regression models withCoeﬃcient of Determination or R . It is a statistical measurement that examinesthe proportion of the variance in one variable that is predictable from the secondvariable, estimating the “goodness of a ﬁt”. It is deﬁned as: R = 1 − (cid:80) Ni =1 ( y i − ˆ y i ) (cid:80) Ni =1 ( y i − y i ) ,where N is the number of samples, y i is the actual value in the dataset for the i -th sample, ˆ y is the predicted value, and y is the mean of the actual values. ser Engagement Prediction for Clariﬁcation in Search 9 Table 2: Performance on the full MIMICS-Click dataset (400k+ samples) and asubset where engagement levels are higher than zero (71k samples). Bold valuesdenote the best results for each metric. Symbols † and ‡ mark statistically signif-icant improvement over central tendency measures and traditional ML models,respectively ( p < . ). Full MIMICS-Click EL-only MIMICS-ClickModel MAE MSE R MAE MSE R Mean 0.1531 0.0546 0.0 0.2426 0.0790 0.0Median † . † . † . † RandomForest 0.1477 0.0526 0.0423 . † . † . † BiLSTM . †‡ . †‡ . †‡ . † . † . † ELBERT . †‡ †‡ †‡ †‡ †‡ †‡ We evaluate our models using a hold-out method, i.e., reserving 20% of thedataset for the test set. We train, and tune traditional ML models in a cross-validation manner [8]. We use 5-fold split of the training set into training anddevelopment set, which is used for grid-searching of the best parameters. Theextensive grids of parameters include regularisation parameter C , the choice of kernel , gamma , and epsilon for SVR, number of estimators and depth of randomforest regressor, as well as feature selection process. All of the parameters canbe found on our GitHub repository.For tuning the hyper-parameters of our neural models, we split the trainingset into training and development sets. Notice that models are retrained on thefull training set with the best parameters before being evaluated on the hold-outtest set.We evaluate the models on the full MIMICS-Click dataset, consisting of morethan 400k query-clariﬁcation-SERP tuples, and on the subset of that dataset, inwhich only samples with the engagement level larger than zero are selected. Themodels in this setting were fed all the available data, i.e., the queries, clariﬁcationpanes, and the SERPs, while the ablation studies in Section 4.4 go into theanalysis of input data. Here, we compare the performance of our EL-BERT model against the baselines on the complete dataset, as well as the sub-set of data with EL > 0. Table 2 lists the results in terms of all our evaluationmetrics. We can notice that heuristic baselines (i.e., MeanEngagement, Media-nEngagement and NormalEngagement) are consistently outperformed by both,the traditional ML models, and the neural models. However, one exception is

Table 3: Impact of SERP elements available on the model performance. Boldvalues denote the best performance of each metric. Statistically signiﬁcant results(with p < . ) over query setting and query+pane setting are marked with † and ‡ , respectively. Full MIMICS-Click EL-only MIMICS-Click R MAE MSE R † † † † †‡ †‡ †‡ †‡ †‡ †‡ † † †‡ †‡ †‡ † † †‡ †‡ †‡ †‡ † † †‡ †‡ †‡ †‡ MedianEngagement, a baseline that always outputs the median of the trainingset, i.e., EL of 0.0, when evaluated on the full MIMICS-Click by mean absoluteerror. Since more than of the dataset have EL of 0.0, and MAE does notpenalise large errors as hard as MSE or R , this is expected. The tide turnsswiftly when evaluating on the subset of the data with EL larger that . , whereall of the static baselines, including MedianEngagement, are outperformed byall of our models.Moreover, we see a clear disparency in the performance of traditional MLmodels and neural networks. This is consistent with recent research in varioustasks in IR and NLP ﬁelds. Moreover, we see that ELBERT signiﬁcantly out-performs BiLSTM model. Through its powerful encoder, ELBERT is able tocapture deeper semantic relations, as it is pretrained on a large body of text.This is also consistent with recent research on deep learning-based models fornatural language understanding. Eﬀect of SERP elements on ELP.

In this experiment, we aim to analysethe eﬀect of clariﬁcation panes and every SERP element on the performanceof our model. Our hypothesis is that each SERP element (e.g., result titlesand snippets) provides a complementary set of features that aids the modeltowards more eﬀective prediction. Therefore, we train our ELBERT model withdiﬀerent combination of SERP elements and clariﬁcation panes, and comparethe performance of the diﬀerent models. We report the results in Table 3. Wesee that the relative improvement when utilising titles from SERPs is up to compared to using query and clariﬁcation pane, and more than over query-only setting. The results strongly suggest the advantage of making use of SERPelements for ELP.An interesting ﬁnding is that even though snippets contain more text thantitles and thus arguably more information as well, the model does not consis-tently perform better with snippets as input. In fact, even though results withtitles seem better than ones with snippets, we observe no statistically signiﬁ-cant diﬀerence between the performance of query+titles and query+snippets onfull MIMICS-Click, nor EL-only MIMICS-Click. There are several reasons why ser Engagement Prediction for Clariﬁcation in Search 11 low medium high

Impression Level R Query Length R Pane-only Pane-Titles Pane-Snippets

Fig. 2: Performance by impression levels (left) and query lengths (right) withdiﬀerent input conﬁgurations.snippets do not exceed the performance of titles. First, it might be the qualityand type of text shown in snippets. Snippets often show only short excerpts, oreven multiple excerpts which are not clearly divided, from a longer document,focusing on query words in the retrieved document. Thus, they might not con-tain all the semantics of the document, while titles usually do. Second, it mightbe the maximum input length of our encoder, which is 512 sub-word tokens.As mentioned in Table 1, a median length of a title is 8 tokens, while mediansnippet length is . Considering that most of the samples have 9 or more title-snippet pairs in their SERPs, it is evident that some portion of concatenatedsnippets get left out. The potential limitation of truncating input length in mostof BERT-based models is a research direction on its own.We point out that the necessity of asking the clariﬁcation can be estimatedfrom the initial query and retrieved search results, i.e., rows , , and in Table3, The success of the model to predict EL based on SERPs and the query alone,suggests that this framework can be used for determining whether or not to aska clarifying question. However, we leave this aspect for future work. Instead, inthe next subsection we evaluate our model trained on ELP task for clariﬁcationpane selection, addressing the pane quality aspect. Here we show ELBERT performance, as measured by R , with respect to variouscharacteristics of the dataset and the input components. Impression level.

Figure 2 (left) shows the performance of our model w.r.t. im-pression levels. We notice that our model performs signiﬁcantly better on querieswith high impression rate, i.e., those whose clariﬁcation panes have been shownto users more frequently. The diﬀerences between models at each impressionlevel are not statistically signiﬁcant, while diﬀerences between levels are, with p < . . As the engagement level labels have been computed by aggregatinguser click information, this suggest that query-clariﬁcation pairs that have beenimplicitly evaluated by a small number of users, i.e., have low impression level,contain noise. R Input

TitlesSnippets

Fig. 3: Performance by number of search results made available to the model.

Query length.

Figure 2 (right) presents the performance of our model w.r.t. querylength. The diﬀerence in performance between all query lengths is statisticallysigniﬁcant. We notice that longer queries generally lead to better performance.This can be attributed to them being more descriptive, thus allowing the searchengine to retrieve more relevant results. Consequently, our model would utiliseSERPs of higher quality, improving the ELP. Highest improvement is seen fora query and pane-only setting. Since the model in that setting does not see anySERP content, it beneﬁts the most out of longer, more descriptive queries.

Number of search results.

Since user behaviour is mainly biased by theresults they see, and they mostly look at top results only, we perform experimentsto see how our models behave in a setting with limited number of retrievedresults. As mentioned before, MIMICS dataset contains up to retrieved resultsfor each query. We evaluate our model with , , . . . SERP elements madeavailable to it. Results for both, titles setting and snippets setting, are presentedin Figure 3. We see a clear improvement in the performance as the number ofsearch results fed to the model rises. This suggests that our model highly utilisesSERP elements for ELP. We notice a saturation after 7 elements, especially inthe setting with snippets. This might be due to snippets exceeding the maximumlength of input to transformer-based models, which is 512 subword tokens.

In this study, we conducted various experiments on engagement level predic-tion task for clariﬁcation in search. We showed that semantic-rich models, likeALBERT, are much more successful in the task than traditional ML models.Furthermore, we demonstrated the beneﬁt of utilising information from searchengine result pages, such as titles and text snippets of retrieved documents, inthe ELP task. Modelling of engagement levels can help guide the system on whenand which clariﬁcations to prompt, thus improving the overall user experience.Future work involves deeper analysis of topical changes in the retrieved pages,that could lead to more accurate prediction of engagement levels, and estimatingthe necessity of asking for clariﬁcation. ser Engagement Prediction for Clariﬁcation in Search 13

References

1. Aliannejadi, M., Kiseleva, J., Chuklin, A., Dalton, J., Burtsev, M.: ConvAI3: Gen-erating clarifying questions for open-domain dialogue systems (ClariQ) (2020)2. Aliannejadi, M., Zamani, H., Crestani, F., Croft, W.B.: Asking clarifying questionsin open-domain information-seeking conversations. In: Proceedings of the 42nd In-ternational ACM SIGIR Conference on Research and Development in InformationRetrieval. pp. 475–484 (2019)3. Alkhaldi, G., Hamilton, F.L., Lau, R., Webster, R., Michie, S., Murray, E.: Theeﬀectiveness of prompts to promote engagement with digital interventions: a sys-tematic review. Journal of medical Internet research (1), e6 (2016)4. Anand, A., Cavedon, L., Joho, H., Sanderson, M., Stein, B.: Conversational search(Dagstuhl Seminar 19461). In: Dagstuhl Reports. vol. 9. Schloss Dagstuhl-Leibniz-Zentrum für Informatik (2020)5. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer (2006)6. Braslavski, P., Savenkov, D., Agichtein, E., Dubatovka, A.: What do you meanexactly? Analyzing clariﬁcation questions in CQA. In: Proceedings of the 2017Conference on Conference Human Information Interaction and Retrieval. pp. 345–348 (2017)7. Breiman, L.: Random forests. Machine learning (1), 5–32 (2001)8. Cawley, G.C., Talbot, N.L.: On over-ﬁtting in model selection and subsequent se-lection bias in performance evaluation. The Journal of Machine Learning Research , 2079–2107 (2010)9. Coden, A., Gruhl, D., Lewis, N., Mendes, P.N.: Did you mean A or B? Supportingclariﬁcation dialog for entity disambiguation. In: SumPre-HSWI@ ESWC (2015)10. Cortes, C., Vapnik, V.: Support-vector networks. Machine learning (3), 273–297(1995)11. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deepbidirectional transformers for language understanding. In: Proceedings of the 2019Conference of the North American Chapter of the Association for Computa-tional Linguistics: Human Language Technologies, Volume 1 (Long and ShortPapers). pp. 4171–4186. Association for Computational Linguistics, Minneapo-lis, Minnesota (Jun 2019). https://doi.org/10.18653/v1/N19-1423,

12. Dhall, A., Kaur, A., Goecke, R., Gedeon, T.: Emotiw 2018: Audio-video, studentengagement and group-level aﬀect prediction. In: Proceedings of the 20th ACMInternational Conference on Multimodal Interaction. pp. 653–656 (2018)13. Di Gangi, P.M., Wasko, M.M.: Social media engagement theory: Exploring theinﬂuence of user engagement on social media usage. Journal of Organizational andEnd User Computing (JOEUC) (2), 53–73 (2016)14. Drucker, H., Burges, C.J., Kaufman, L., Smola, A.J., Vapnik, V.: Support vectorregression machines. In: Advances in neural information processing systems. pp.155–161 (1997)15. Hashemi, H., Zamani, H., Croft, W.B.: Guided transformer: Leveraging multipleexternal sources for representation learning in conversational search. In: Proceed-ings of the 43rd International ACM SIGIR Conference on Research and Develop-ment in Information Retrieval. pp. 1131–1140 (2020)16. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation (8), 1735–1780 (1997)4 I. Sekulić et al.17. Kelly, D., Teevan, J.: Implicit feedback for inferring user preference: a bibliography.In: Acm Sigir Forum. vol. 37, pp. 18–28. ACM New York, NY, USA (2003)18. Kiesel, J., Bahrami, A., Stein, B., Anand, A., Hagen, M.: Toward voice queryclariﬁcation. In: The 41st International ACM SIGIR Conference on Research &Development in Information Retrieval. pp. 1257–1260 (2018)19. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014)20. Krasakis, A.M., Aliannejadi, M., Voskarides, N., Kanoulas, E.: Analysing the eﬀectof clarifying questions on document ranking in conversational search. In: Proceed-ings of the 2020 ACM SIGIR on International Conference on Theory of InformationRetrieval. pp. 129–132 (2020)21. Lalmas, M., O’Brien, H., Yom-Tov, E.: Measuring user engagement. Synthesis Lec-tures on Information Concepts, Retrieval, and Services (4), 1–132 (2014)22. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: Alite BERT for self-supervised learning of language representations. In proceedingsof ICLR (2020)23. Nogueira, R., Cho, K.: Passage re-ranking with bert. arXiv preprintarXiv:1901.04085 (2019)24. O’Brien, H.L.: Antecedents and learning outcomes of online news engagement.Journal of the Association for Information Science and Technology (12), 2809–2820 (2017)25. O’Brien, H.L., Arguello, J., Capra, R.: An empirical study of interest, task com-plexity, and search behaviour on user engagement. Information Processing & Man-agement (3), 102226 (2020)26. O’Brien, H.L., Toms, E.G.: What is user engagement? A conceptual frameworkfor deﬁning user engagement with technology. Journal of the American society forInformation Science and Technology (6), 938–955 (2008)27. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen,T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-performance deep learning library. In: Advances in neural information processingsystems. pp. 8026–8037 (2019)28. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: Machinelearning in Python. the Journal of machine Learning research , 2825–2830 (2011)29. Penha, G., Balan, A., Hauﬀ, C.: Introducing MANtIS: a novel multi-domain infor-mation seeking dialogues dataset. arXiv preprint arXiv:1912.04639 (2019)30. Penha, G., Hauﬀ, C.: Challenges in the evaluation of conversational search systems.KDD Workshop on Conversational Systems Towards Mainstream Adoption (2020)31. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word repre-sentation. In: Proceedings of the 2014 conference on empirical methods in naturallanguage processing (EMNLP). pp. 1532–1543 (2014)32. Radlinski, F., Craswell, N.: A theoretical framework for conversational search. In:Proceedings of the 2017 conference on conference human information interactionand retrieval. pp. 117–126 (2017)33. Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questionsfor machine comprehension of text. In: Proceedings of the 2016 Conference onEmpirical Methods in Natural Language Processing. pp. 2383–2392. Associationfor Computational Linguistics, Austin, Texas (Nov 2016)34. REN, P., CHEN, Z., REN, Z., KANOULAS, E., MONZ, C., DE RIJKE, M.: Con-versations with search engines. ACM Transactions on Information Systems1