Knowledge Elicitation via Sequential Probabilistic Inference for High-Dimensional Prediction
aa r X i v : . [ c s . A I] J u l Knowledge Elicitation via Sequential Probabilistic Inferencefor High-Dimensional Prediction ∗ Pedram Daee † , Tomi Peltola † , Marta Soare † , and Samuel KaskiHelsinki Institute for Information Technology HIIT,Department of Computer Science, Aalto University [email protected] † Authors contributed equally.
Abstract
Prediction in a small-sized sample with a large number of covariates, the “small n , large p ” problem, is challenging. This setting is encountered in multiple applications, such asprecision medicine, where obtaining additional samples can be extremely costly or evenimpossible, and extensive research effort has recently been dedicated to finding principledsolutions for accurate prediction. However, a valuable source of additional information,domain experts, has not yet been efficiently exploited. We formulate knowledge elicitationgenerally as a probabilistic inference process, where expert knowledge is sequentially queriedto improve predictions. In the specific case of sparse linear regression, where we assume theexpert has knowledge about the values of the regression coefficients or about the relevanceof the features, we propose an algorithm and computational approximation for fast andefficient interaction, which sequentially identifies the most informative features on which toquery expert knowledge. Evaluations of our method in experiments with simulated and realusers show improved prediction accuracy already with a small effort from the expert. Datasets with a small number of samples n and a large number of variables p are nowadayscommon. Statistical learning, for example regression, in these kinds of problems is ill-posed, andit is known that statistical methods have limits in how low in sample size they can go [1]. A lot ofrecent research in statistical methodology has focused on finding different kinds of solutions viawell-motivated trade-offs in model flexibility and bias. These include strong assumptions aboutthe model family, such as linearity, low rank, sparsity, meta-analysis and transfer learning fromrelated datasets, efficient collection of new data via active learning, and, less prominently, priorelicitation.There is, however, a certain disconnect between the development of state-of-the-art statisticalmethods and their application in challenging data analysis problems. Many applications havesignificant amounts of previous knowledge to incorporate into the analysis, but this is oftenunstructured and tacit. Building it into the analysis would require tailoring the model and ∗ This is the pre-print version. The paper is published in Machine Learning journal. Definitive version DOI:10.1007/s10994-017-5651-7. Link: http://rdcu.be/t9KF.
Contributions and Outline
The outline of the paper and our main contributions are as follows. After discussing related work(Sect. 2), we rigorously formulate the expert knowledge elicitation as a probabilistic inferenceprocess (Sect. 3). We study a specific case of sparse linear regression, and in particular, considercases where the user has knowledge about the values of the regression coefficients and about therelevance of the features (Sect. 4). We present an algorithm for efficient interactive sequentialknowledge elicitation for high-dimensional models that makes knowledge elicitation in “small n ,large p ” problems feasible (Sect. 4.3). We describe an efficient computational approach usingdeterministic posterior approximations allowing real-time interaction for the sparse linear regres-sion case (Sect. 4.4). Simulation studies are presented to demonstrate the performance and togain insight into the behaviour of the approach (Sect. 5). Finally, we demonstrate that real usersare able to improve the predictive performance of sparse linear regression in a proof-of-conceptexperiment (Sect. 5.4). The problem we study relates to several topics studied in the literature, either by the method,goal, or by the considered setting. In this section, we highlight the main connections.
Interactive Learning.
Interactive machine learning includes a variety of ways to employuser’s knowledge, preferences, and human cognition to enhance statistical learning [2, 3]. Thesemethods have been used successfully in several applications, such as learning user intent [4] andpreferential clustering. For instance, the semi-supervised clustering method in [5, 6] uses feedbackon pairs of items that should or should not be in the same cluster, to learn user preferences. Inaddition to the differences coming from the learning task, one notable contrast between theseworks and our method is that their aim is to identify user preferences or opinions, whereas ourgoal is to use expert knowledge as an additional source of information for an improved predictionmodel, by integrating it with the knowledge coming from the (small n ) data. As a probabilisticapproach, our work relates to [7] and [8], where expert feedback is used for improved learning of2ayesian networks and for visual data exploration, respectively. In Sect. 3.3, we show how theseworks can be seen as instances of the general approach we propose. Active Learning.
The method we propose for efficiently using expert feedback is relatedto active learning techniques (for a survey, see, for instance, [9]), where the algorithms activelyselect the most informative data points to be used in prediction tasks. Our method similarlyqueries the user for information with the goal of maximising the information gain from eachfeedback and thus learning more accurate models with less feedback. The same definition ofefficiency with respect to the use of samples, also connects our work with experimental designtechniques, recently used for linear settings by Seeger [31], Hern´andez-Lobato et al. [32], andRavi et al. [12]. Our task, however, is different as we do not aim at collecting new data samples,but the additional information comes from a different source, the expert, with its respective biasand uncertainty. Indeed, our method will be most useful in cases where obtaining additionalinput samples would be too expensive.
Prior Elicitation and Privileged Information.
Many works have studied approachesfor efficient elicitation of human and, in particular, expert knowledge. In prior elicitation [13],the goal is to use expert knowledge to construct a prior distribution for Bayesian data analysisand restrict the range of parameters to be later used in learning models. Notably, an importantline of work [14, 15] studies methods of quantifying subjective opinion about the coefficients oflinear regression models through the assessment of credible intervals. Our approach goes beyondpure prior elicitation as the training data is used to facilitate efficient user interaction. Anotherline of work considers expert feedback as privileged information [16], where additional humanknowledge is allowed in the training phase only. Differently to our method, these works typicallydo not consider an interactive integration of the expert knowledge with the training data, and donot model the reliability of the human feedback thus received, rather, they use it as a guidelinefor improving the performance of learning tasks.
In the following, we formulate expert knowledge elicitation as a probabilistic inference process.
Let y and x denote the outputs (target variables) and inputs (covariates), and θ and φ y themodel parameters. Let f encode input from the user (feedback based on the user’s knowledge)and φ f be related model parameters. We identify the following key components:1. An observation model p ( y | x, θ, φ y ) for y .2. A feedback model p ( f | θ, φ f ) for user’s knowledge.3. A prior model p ( θ, φ y , φ f ) completing the hierarchical model description.4. A query algorithm and user interface that facilitate gathering f iteratively from the user.5. Update process of the model after user interaction.The observation model can be any appropriate probability model. It is assumed that there issome parameter θ , possibly high-dimensional, that the user has knowledge about. The user’sknowledge is encoded as (possibly partial) feedback f that is transformed into information about θ via the feedback model. Of course, there could be a more complex hierarchy tying the observation3nd feedback models, and the feedback model can also be used to model more user-centric issues,such as the quality of or uncertainty in the knowledge or user’s interests.The feedback model, together with a query algorithm and a user interface, is used to facilitatean efficient interaction with the user. The term “query algorithm” is used here in a broad sense todescribe any mechanism that is used to intelligently guide the user’s focus in providing feedbackto the system. This enables considering a high-dimensional f without overwhelming the useras the most useful feedbacks can be queried first. Crucially, this enables going beyond pureprior elicitation as the observed data can be used to inform the queries via the dependence of thefeedback and observation models. For example, the queries can be formed as solutions to decisionor experimental design tasks that maximize the expected information gain from the interaction.Finally, as the user’s feedback is modelled as additional data, Bayes theorem can be used tosequentially update the model during the interaction. For real-time interaction, this may presenta challenge as computation in probabilistic models can be demanding. It is known that slowcomputation can impair effective interaction [17] and, thus, efficient computational approachesare important. Figure 1 depicts the information flow. First, the posterior distribution given the observations D = { ( y i , x i ) : i = 1 , . . . , n } is computed. Then, the user is queried iteratively for feedbackvia the user interface and the query algorithm. The feedback is used to sequentially updatethe posterior distribution. The query algorithm has access to the latest beliefs about the modelparameters and the predicted user behaviour, that is, the posterior predictive distribution of f , p ( f t +1 |D , f , . . . , f t ) where f j are possibly partial observations of f , to formulate queries andhighlight the most informative interactions in the user interface.Model p Data D p ( θ |D ) p ( θ |D , f ) p ( θ |D , f , f )Query F Query F Expert knowledgeFigure 1: Information flow. The parameters φ y and φ f are omitted from the posterior distribu-tions for brevity. The goal in this paper is to use the interaction scheme to help solve prediction problems inthe “small n , large p ” setting. The approach as described above is, however, more general andapplicable to other problems. We briefly describe two earlier works that can be seen as instancesof it.Cano et al. [7] present a method for integrating expert knowledge into learning of Bayesiannetworks. The observation model is a multinomial Bayesian network with Dirichlet priors. The4ser provides answers to queries about the presence or absence of edges in the graph and thefeedback model assumes the answers to be correct with some probability. Which edge to queryabout next is selected by maximising the information gain with regard to the inclusion probabilityof the edges. Monte Carlo algorithms are used for the computation.House et al. [8] present a framework for interactive visual data exploration. They describe twoobservation models, principal component analysis and multidimensional scaling, that are usedfor dimensionality reduction to visualise the observations in a two dimensional plot. They do nothave a query algorithm, but their user interface allows moving points in a low-dimensional plotcloser or further apart, which is interpreted by a feedback model that transforms the feedback intoappropriate changes in the shared parameters with the observation model to allow explorationof different aspects of the data. Their model affords closed form updates. We next introduce the knowledge elicitation approach for sparse linear regression.
Let y ∈ R n be the observed output values and X ∈ R n × m the matrix of covariate values. Theregression is modelled with Gaussian observation model and a spike-and-slab sparsity-inducingprior [18] on the regression coefficients w ∈ R m , and a Gamma prior on the inverse of the residualnoise variance σ : y ∼ N( Xw , σ I ) , (1) σ − ∼ Gamma( α σ , β σ ) ,w j ∼ γ j N(0 , ψ ) + (1 − γ j ) δ , j = 1 , . . . , m,γ j ∼ Bernoulli( ρ ) , j = 1 , . . . , m. Here, the γ j are latent binary variables indicating inclusion or exclusion of the covariates inthe regression ( δ is a point mass at zero) and ρ is the prior inclusion probability controlling theexpected sparsity. The α σ , β σ , ψ , and ρ are assumed fixed hyperparameters. We consider two simple and natural feedback models encoding knowledge about the individualregression coefficients: • User has knowledge about the value of the coefficient ( f w,j ∈ R ): f w,j ∼ N( w j , ω ) . (2) • User has knowledge about the relevance of coefficient ( f γ,j ∈ { , } for not-relevant, relevant): f γ,j ∼ γ j Bernoulli( π )+(1 − γ j )Bernoulli(1 − π ) . (3)Here, ω and π control the uncertainty or strength of the knowledge. In detail, ω is theuncertainty in the user’s estimate of the coefficient, and π is the probability that the user givescorrect feedback relative to the state of the covariate inclusion indicator γ j .5 .3 Query Algorithm Our aim is to improve prediction. Thus, the user interaction should focus on aspects of themodel (here, predictive features) that would be most beneficial towards this goal. We use thequery algorithm to rank the features for choosing which feature to ask feedback about next. Theranking is formulated as a Bayesian experimental design task [19]. More specifically, the feature j ∗ that maximizes the expected information gain is chosen next: j ∗ = arg max j / ∈F E p ( ˜ f j |D ) "X i KL[ p (˜ y |D , x i , ˜ f j ) k p (˜ y |D , x i )] , where j indexes the features, F is the set of feedbacks that have already been given (to simplifynotation, those are here included in D ), and the summation over i goes over the training dataset.The information gain is defined as the Kullback–Leibler divergence (KL) between the currentposterior predictive distribution p (˜ y |D , x ) = R p (˜ y | x , θ ) p ( θ |D ) d θ , where θ = ( w , γ , σ ), andthe posterior predictive distribution with the new feedback f j , p (˜ y |D , x , f j ). The bigger theinformation gain, the bigger impact the new feedback has on the predictive distribution. Sincethe feedback itself will only be observed after querying the user, we take the expectation overthe posterior predictive distribution of the feedback p ( ˜ f j |D ). More details about the Bayesianexperimental design are provided in the supplementary material (Sec. B).We note that, were the predictive distribution of y Gaussian, the problem would be simple.The expected information gain would be independent of y and the actual values of the feedbacks(when feedback is on values of the regression coefficients) and would only depend on the x andwhich features were given feedback on [31]. The sparsity-promoting prior, however, makes theproblem non-trivial. The model does not have a closed form posterior distribution, predictive distribution, or solutionto the information gain maximization problem. To achieve fast computation, we use deterministicposterior approximations. Expectation propagation [25] is used to approximate the spike-and-slab prior [29] and the feedback models, and variational Bayes (e.g., [27, Chapter 10]) is used toapproximate the residual variance σ . The form of the posterior approximation for the regressioncoefficients w is Gaussian. The posterior predictive distribution for y is also approximated asGaussian. Details are provided in the supplementary material (Sect. A.2).Expectation propagation has been found to provide good estimates of uncertainty, which isimportant in experimental design [29, 31, 32]. In evaluating the expected information gain for alarge number of candidate features, running the approximation iterations to full convergence foreach, however, is too slow. We follow the approach of Seeger [31], Hern´andez-Lobato et al. [32]in computing only a single iteration of updates on the essential parameters for each candidate.We show in the results that this already provides a good performance for the query algorithm incomparison to random queries. Details on the computations are provided in the supplementarymaterial (Sect. A.3). The performance of the proposed method (Sect. 4) is evaluated in several “small n , large p ”regression problems on both simulated and real data. A proof-of-concept user study is presented6 andom Dimensionality N u m be r o f E x pe r t F eedba cks
12 30 48 66 84 102 120 138 156 174 1921020304050
Sequential Experimental Design
Dimensionality N u m be r o f E x pe r t F eedba cks
12 30 48 66 84 102 120 138 156 174 1921020304050 246810
MSE (a) Feedback on coefficients’ values
Random
Dimensionality N u m be r o f E x pe r t F eedba cks
12 30 48 66 84 102 120 138 156 174 1921020304050
Sequential Experimental Design
Dimensionality N u m be r o f E x pe r t F eedba cks
12 30 48 66 84 102 120 138 156 174 1921020304050 45678910
MSE (b) Feedback on coefficients’ relevances
Figure 2: Mean squared errors in simulated settings with increasing dimensionality. The numberof relevant coefficients m ∗ = 10 and the number of training data points n = 10. The MSE valuesare averages over 100 independent runs.to demonstrate the feasibility of the method with real users. We use synthetic data to study the behaviour of the approach in a wide range of controlledsettings.
Setting.
The covariates of n training data points are generated from X ∼ N( , I ). Outof the m regression coefficients w , . . . , w m ∈ R , m ∗ are generated from w j ∼ N(0 , ψ ) and therest are set to zero. The observed output values are generated from y ∼ N( Xw , σ I ). Weconsider cases where the user has knowledge about the value of the coefficients (Eq. 2 with noisevalue ω = 0 .
1) and where the user has knowledge about non-relevant/relevant features (Eq. 3with γ j = 1 if w j is non-zero, and γ j = 0 otherwise, and π = 0 . ψ = 1 , ρ = m ∗ /m, and σ = 1 (here we do not use the distribution assumption on σ ). Results.
In Fig. 2, we consider a “small n , large p ” scenario, with n = 10 , m ∗ = 10and with increasing dimensionality (hence also increasing sparsity) from m = 12 , . . . , All codes and data are available in https://github.com/HIIT/knowledge-elicitation-for-linear-regression. m = 100 and a varying number of training data n = 5 , . . . ,
50. For those experiments, we canagain see superior improvement for the sequential experimental design compared to random,for both feedback models, and in particular for small sample sizes. Moreover, a comparisonof the sequential experimental design algorithm to its non-sequential version (Sect. C.1.2 inthe supplement) shows that the former achieves a better performance, indicating that the userfeedback affects the next query. Finally, for further insight into the behaviour of the approach,a simulation experiment with n = 10 in Sect. C.2 in the supplementary material shows that thetraining set error begins to increase as a function of the number of feedbacks while the test errordecreases. This happens because the initial fit exhausts the information in the training data, butat this small sample size is insufficient to provide good generalization performance. We test our method for the task of predicting review ratings from textual reviews in subsets ofAmazon and Yelp datasets. Each review is one data point, and each distinct word is a featurewith the corresponding covariate value given by the number of appearances of the word in thereview. In addition to being fit for sparse linear regression models (as shown in previous studies,for instance, in [29]), we also chose this type of dataset due to the uncomplicated interpretationof the features, which allows us to easily test our method on real users.
The Amazon data is a subset of the sentiment dataset of [23]. This dataset contains textual reviews and their corresponding 1-5 star ratings for Amazon products. Here,we only consider the reviews for products in the kitchen appliances category, which amounts to5149 reviews. The preprocessing of the data follows the method described in [29], where thisdataset was used for testing the performance of a sparse linear regression model. Each review isrepresented as a vector of features, where the features correspond to unigrams and bigrams, asgiven by the data provided by [23]. For each distinct feature and for each review, we created amatrix of occurrences and only kept for our analysis the features that appeared in at least 100reviews, that is, 824 features. Yelp data.
The second dataset we use is a subset of the YELP (academic) dataset . Thedataset contains 2.7 million restaurant reviews with ratings ranging from 1 to 5 stars (rounded tohalf-stars). Here, we consider the 4086 reviews from the year 2004. Similarly to the preprocessingdone for Amazon data, each review is represented as a vector of features (distinct words). Afterremoving non-alphanumeric characters from the words and removing words that appear fewerthan 100 times, we have 465 words for our analysis.Dataset Subset Reviews FeaturesYelp 4086 465Amazon 5159 824Table 1: Sizes of the datasets used in experiments. .3 Simulated User Feedback For all experiments on Amazon and Yelp datasets, we proceeded as follows: First, each datasetwas partitioned in three parts: (1) a training set of 100 randomly selected reviews, (2) a test set of1000 randomly selected reviews, and (3) the rest as a “user-data set” for constructing simulateduser knowledge. The data were normalised to have zero mean and unit standard deviation on thetraining and user-data sets. The simulated user feedback was generated based on the posteriorinclusion probabilities E [ γ ] in a spike-and-slab model trained on the user-data partition. We onlyconsidered the more realistic case where the user can give feedback about the relevance of thewords. For a word j selected by the algorithm, the user gives feedback that the word is relevant if E [ γ j ] > π , not-relevant if E [ γ j ] < − π , and uncertain otherwise. The intuition is that if theuser-data indicate that a feature is zero/non-zero with high probability, then the simulated userwould select that feature as not-relevant / relevant . However, for uncertain words, the feedbackiteration passes without receiving any feedback. The model parameters were set to π = 0 . ψ = 0 . α σ = 1, β σ = 1, and ρ = 0 . We compare three query algorithms: • random feature suggestion (green line, triangle up) , • an strategy that knows the relevant features beforehand (inferred from the posterior inclusionprobabilities over all data) and asks exclusively about them first, and then chooses at randomfrom the features not already selected (red line, triangle down) , • our sequential experimental design algorithm (Sect. 4.3) (blue line, squares) . Number of Expert Feedbacks M ean S qua r ed E rr o r RandomFirst relevant features, then non-relevantSequential Experimental DesignGround truth (all feedbacks) (a) Amazon data
Number of Expert Feedbacks M ean S qua r ed E rr o r RandomFirst relevant features, then non-relevantSequential Experimental DesignGround truth (all feedbacks) (b) Yelp data
Figure 3: Mean squared errors when user feedback is on relevance of features for Amazon andYelp data. The MSE values are averages over 100 independent runs.All algorithms query feedback about one feature at a time and MSE is used as the performancemeasure. The ground truth line represents the MSE after receiving user feedback for all wordsin each dataset.A first observation is that the use of additional knowledge coming from the simulated expertindeed reduces the prediction errors, for all algorithms and on both datasets. Yet, the reduction Although unrealistic, this “oracle” strategy allows to see the performance gain obtainable by an intuitivelygood strategy which first queries experts about the relevant features.
9n the prediction error differs significantly depending on whether the methods manage to queryfeedback on the most informative features first. Indeed, the goal is to make the elicitation aslittle burdensome as possible for the experts. To reach the goal, a strategy needs to rapidlyextract a maximal amount of information from the expert, which here amounts to the carefulselection of the features on which to query feedback. As expected, the random query selectionstrategy has a constant and slow improvement rate, as the number of feedbacks grows, leavinga big gap from the ground-truth performance in both datasets, even after 200 user feedbacks.In contrast, the (unrealistic) strategy that first asks about relevant features begins with a steepincrease in performance for the first iterations (only 26 words for Amazon and 23 for Yelp aremarked as relevant, as computed from the full dataset), then it continues with a very slowimprovement rate coming from asking non-relevant words. Our method manages to identify theinformative features rapidly and thus has a higher improvement compared to random from thefirst user feedbacks. In the case of Yelp data, our strategy manages to be very close to thestrategy knowing the relevant words in the initial feedbacks and then getting very close to theground-truth after 200 interactions. Furthermore, there is a significant gap compared to therandom strategy for all amounts of feedbacks. In the more difficult (in terms of rating predictionerror and size of dimensions) Amazon dataset, although the gap to the random strategy is clear,our strategy exceeds the level of information obtained in the 26 non-zero features only after 140feedbacks.
We next contrast the improvements in the predictions brought by eliciting the expert feedbackto improvements gained by adding samples from the user-data set to the training set. For thelatter, we use two alternative strategies: randomly selecting a sequence of reviews to be includedin the training set, and an active learning strategy, which selects samples based on maximizingexpected information gain (an adaptation of the method in [31]).More Samples More Feedback
MSE Random Active [31]
Random SeqExpDes feedbacks (for the knowledge elicitation strategies in the last twocolumns: random and our method; see Sect. 4.3) and respectively how many additional samples (that is, additional reviews to be included in the train set) are needed to reach set levels of MSE ,noting that all strategies have the same “small n , large p ” regression setting as a starting point,with n = 100 and a corresponding MSE of 1.2036.Even with the relatively weak type of expert feedback (feedback on the relevance of features),a specific performance is reached by a comparable number of expert feedbacks and additionaldata. For instance, the same level of MSE=1.18 is obtained either by asking an expert about therelevance of 25 features and by actively selecting 12 extra samples. When the active selection isnot possible, we can see that the same information gain requires 94 additional randomly selectedsamples. Naturally, the results obtained are specific for this Yelp data and for the feedback model10e assume. Nevertheless, the comparison shows the potential of expert knowledge elicitation inprediction for settings where actively selecting samples is not possible, or even more so, whengetting additional samples is impossible or very expensive. The same observations and intuitionsabout the information gain comparison remain valid for the Amazon data (see Sect. C.3). The goal of the user study is to investigate the prediction improvement and convergence speed ofthe proposed sequential method based on human feedback. Our focus is on testing the accuracyof feedback from real users on the easily interpretable Amazon data rather than on details of theuser interface. Hence, we asked ten university students and researchers to go through all the 824words and give us feedback in the form of not-relevant , relevant , or uncertain . This allowed fora fast collection of feedbacks and we could use the pre-given feedback to test the effectivenessof several query algorithms. We assumed that the algorithms had access to 100 training dataand at each iteration they could query the pre-given feedback of the participant about one word.The whole process was repeated for 40 independent runs, where training data were randomlyselected. The hyperparameters of the model were set to the same values as in the simulated datastudy with the only difference that the strength of user knowledge was lowered to π = 0 . Number of User Feedbacks M ean S qua r ed E rr o r Baseline (No user feedback)RandomSequential Experimental Design
Figure 4: Mean squared errors for ten participants. Values are averages over 40 independentruns.Fig. 4 shows the average MSE improvements for each of the 10 participants, when using ourproposed method and the random query order. From the very first feedbacks, the sequentialexperimental design approach performs better for all users and captures the expert knowledgemore efficiently. The random strategy exhibits a relatively constant rate of performance improve-ment with the increasing number of feedbacks, while the sequential experimental design showsfaster improvement rate in the beginning implying that it can query about the more importantfeatures first.To further quantify the statistical evidence for the difference, we computed the paired-sample t tests between the random suggestion and the proposed method at each iteration (green andblue curves in Fig. 4). Already after the first feedback, the difference between the methods issignificant at the Bonferroni corrected level α = 0 . / We presented a knowledge elicitation approach for high-dimensional sparse linear regression. Theresults for “small n , large p ” problems in simulated and real data with simulated and real users,and with user knowledge on the regression weight values and on the relevance of features, showedimproved prediction accuracy already with a small number of user interactions. The knowledgeelicitation problem was formulated as a probabilistic inference process that sequentially acquiresand integrates user knowledge with the training data. Compared to pure prior elicitation, the ap-proach can facilitate richer interaction and be used in knowledge elicitation for high-dimensionalparameters without overwhelming the user.As a by-product of our study, we noticed that even for the rather weak feedback on therelevance of features, the number of expert feedbacks and the number of randomly acquiredadditional data samples needed to reach a certain level of MSE reduction were of the same order.Although this observation was obtained on a noisy dataset and for a simplifying user interactionsetting, the fact that the considered feedback type was rather weak sets the ground for a furtherand more robust comparison of the performance gain obtained from these two different sourcesof information.The presented knowledge elicitation method is general, and as all assumptions have beenexplicated as a probabilistic model, the approach can be rigorously analyzed and tailored tomatch specifics of other knowledge elicitation settings. The presented results considered rathersimple types of feedback as a proof-of-concept of the approach. In future, we will work onextending the types of interactions and outlining new types of interactive machine learningproblems. Acknowledgements
This work was financially supported by the Academy of Finland (Finnish Center of Excellence inComputational Inference Research COIN; grants 295503, 294238, 292334, and 284642), Re:Knowfunded by TEKES, and MindSee (FP7ICT; Grant Agreement no 611570). We thank JuhoPiironen for comments that improved the article.
References [1] David Donoho and Jared Tanner. Observed universality of phase transitions in high-dimensional geometry, with implications for modern data analysis and signal processing.
Philosophical Transactions of the Royal Society A , 367:4273–4293, 2009.[2] Saleema Amershi.
Designing for Effective End-User Interaction with Machine Learning .PhD thesis, University of Washington, 2012.[3] Reid Porter, James Theiler, and Don Hush. Interactive machine learning in data exploitation.
Computing in Science & Engineering , 15(5):12–20, 2013.[4] Tuukka Ruotsalo, Giulio Jacucci, Petri Myllym¨aki, and Samuel Kaski. Interactive intentmodeling: Information discovery beyond search.
Commun. ACM , 58(1):86–92, 2014.[5] Zhengdong Lu and Todd K. Leen. Semi-supervised clustering with pairwise constraints:A discriminative approach. In
Proceedings of the Eleventh International Conference onArtificial Intelligence and Statistics (AISTATS) , pages 299–306, 2007.126] Maria-Florina Balcan and Avrim Blum.
Clustering with Interactive Feedback , pages 316–328.2008.[7] Andr´es Cano, Andr´es R. Masegosa, and Seraf´ın Moral. A method for integrating expertknowledge when learning Bayesian networks from data.
IEEE Transactions on Systems,Man, and Cybernetics, Part B (Cybernetics) , 41(5):1382–1394, 2011.[8] Leanna House, Leman Scotland, and Chao Han. Bayesian visual analytics: BaVa.
StatisticalAnalysis and Data Mining , 8(1):1–13, 2015.[9] Burr Settles. Active learning literature survey. Computer sciences technical report 1648,University of Wisconsin, Madison, 2010.[31] Matthias W Seeger. Bayesian inference and optimal design for the sparse linear model.
Journal of Machine Learning Research , 9:759–813, 2008.[32] Daniel Hern´andez-Lobato, Jos´e Miguel Hern´andez-Lobato, and Pierre Dupont. Generalizedspike-and-slab priors for bayesian group feature selection using expectation propagation.
Journal of Machine Learning Research , 14(1):1891–1945, 2013.[12] Sathya N. Ravi, Vamsi K. Ithapu, Sterling C. Johnson, and Vikas Singh. Experimentaldesign on a budget for sparse linear models and applications. In
Proceedings of the 33ndInternational Conference on Machine Learning (ICML) , pages 583–592, 2016.[13] Anthony O’Hagan, Caitlin E. Buck, Alireza Daneshkhah, J. Richard Eiser, Paul H. Garth-waite, David J. Jenkinson, Jeremy E. Oakley, and Tim Rakow.
Uncertain Judgements.Eliciting Experts’ Probabilisties . Wiley, Chichester, England, 2006.[14] Paul H Garthwaite and James M Dickey. Quantifying expert opinion in linear regressionproblems.
Journal of the Royal Statistical Society. Series B (Methodological) , pages 462–474,1988.[15] Joseph B Kadane, James M Dickey, Robert L Winkler, Wayne S Smith, and Stephen CPeters. Interactive elicitation of opinion for a normal linear model.
Journal of the AmericanStatistical Association , 75(372):845–854, 1980.[16] Vladimir Vapnik and Akshay Vashist. A new learning paradigm: Learning using privilegedinformation.
Neural Networks , 22(5-6):544–557, 2009.[17] Jerry Alan Fails and Dan R. Olsen, Jr. Interactive machine learning. In
Proceedings of the8th International Conference on Intelligent User Interfaces (IUI) , pages 39–45, 2003.[18] Edward I George and Robert E McCulloch. Variable selection via Gibbs sampling.
Journalof the American Statistical Association , 88(423):881–889, 1993.[19] Kathryn Chaloner and Isabella Verdinelli. Bayesian experimental design: A review.
Statist.Sci. , 10(3):273–304, 08 1995.[25] Thomas P Minka. Expectation propagation for approximate Bayesian inference. In
Proceed-ings of the Seventeenth Conference on Uncertainty in Artificial Intelligence (UAI) , pages362–369, 2001.[29] Jos´e Miguel Hern´andez-Lobato, Daniel Hern´andez-Lobato, and Alberto Su´arez. Expectationpropagation in linear regression models with spike-and-slab priors.
Machine Learning , 99(3):437–487, 2015. 1327] Christopher M Bishop.
Pattern Recognition and Machine Learning . Springer, 2006.[23] John Blitzer, Mark Dredze, and Fernando Pereira. Biographies, bollywood, boomboxes andblenders: Domain adaptation for sentiment classification. In
Proceedings of the 45th AnnualMeeting of the Association of Computational Linguistics (ACL) , pages 187–205, 2007.[24] Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald BRubin.
Bayesian data analysis . Chapman & Hall/CRC, 3rd edition, 2014.[25] Thomas P Minka. Expectation propagation for approximate Bayesian inference. In
Proceed-ings of the Seventeenth Conference on Uncertainty in Artificial Intelligence (UAI) , pages362–369, 2001.[26] Thomas P Minka. Divergence measures and message passing. Technical report, Technicalreport, Microsoft Research, 2005.[27] Christopher M Bishop.
Pattern Recognition and Machine Learning . Springer, 2006.[28] Jos´e M. Hern´andez-Lobato, Tjeerd Dijkstra, and Tom Heskes. Regulator discovery fromgene expression time series of malaria parasites: a hierachical approach. In
Advances inNeural Information Processing Systems 20 (NIPS) , pages 649–656. Curran Associates, Inc.,2008.[29] Jos´e Miguel Hern´andez-Lobato, Daniel Hern´andez-Lobato, and Alberto Su´arez. Expectationpropagation in linear regression models with spike-and-slab priors.
Machine Learning , 99(3):437–487, 2015.[30] Thomas P Minka and John Lafferty. Expectation-propagation for the generative aspectmodel. In
Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence ,pages 352–359, 2002.[31] Matthias W Seeger. Bayesian inference and optimal design for the sparse linear model.
Journal of Machine Learning Research , 9:759–813, 2008.[32] Daniel Hern´andez-Lobato, Jos´e Miguel Hern´andez-Lobato, and Pierre Dupont. Generalizedspike-and-slab priors for bayesian group feature selection using expectation propagation.
Journal of Machine Learning Research , 14(1):1891–1945, 2013.14
Gaussian Linear Regression with Spike and Slab Prior
A.1 Model
The posterior distribution of the regression model is p ( w , σ , γ |D ) ∝ p ( f γ | γ ) p ( f w | w ) p ( y | X , w , σ ) p ( σ − ) p ( w | γ ) p ( γ ) , where D = ( y , X , f γ , f w ) are the training data observations together with the sets of observeduser feedback and p ( f γ | γ ) = Y j ∈F γ [ γ j Bernoulli( f γ,j | π ) + (1 − γ j ) Bernoulli( f γ,j | − π )] ,p ( f w | w ) = Y j ∈F w N( f w,j | w j , ω ) ,p ( y | X , w , σ ) = N( y | Xw , σ I ) ,p ( σ − ) = Gamma( σ − | α σ , β σ ) ,p ( w | γ ) = Y j (cid:2) γ j N( w j | , ψ ) + (1 − γ j ) δ ( w j ) (cid:3) ,p ( γ ) = Y j Bernoulli( γ j | ρ ) . Here, F γ and F w denote the sets of indices of the features that have received relevance feed-back and weight feedback, respectively. π , ω , α σ , β σ , and ψ are assumed fixed hyperparameters.The parametrizations of the distributions follow Gelman et al. [24] and we use the generic p ( · )notation, where it is understood that the parameters identify the separate terms. A.2 Posterior approximation
The corresponding posterior approximation is q ( w , σ − , γ ) = q ( w ) q ( σ − ) q ( γ ) , where, using bar to distinguish the parameters of the posterior approximation, q ( w ) = N( w | ¯ m , ¯ Σ ) ,q ( σ − ) = Gamma( σ − | ¯ α σ , ¯ β σ ) ,q ( γ ) = Y j Bernoulli( γ j | ¯ ρ j ) , p ( f γ | γ ) ≈ Y j ∈F γ ˜ t Bernoulli ( γ j | ˜ ρ f γ j ) ,p ( f w | w ) = Y j ∈F w ˜ t N ( w j | ˜ µ f w j , ˜ τ f w j ) ,p ( y | X , w , σ ) ≈ ˜ t N ( w | ˜ µ y , ˜ Γ y )˜ t Gamma ( σ − | ˜ α y , ˜ β y ) ,p ( σ − ) = ˜ t Gamma ( σ − | α σ − , − β σ ) ,p ( w | γ ) ≈ Y j ∈F γ ˜ t N ( γ j | ˜ µ wj , ˜ τ wj )˜ t Bernoulli ( γ j | ˜ ρ wj ) ,p ( γ ) = Y j ˜ t Bernoulli ( γ j | logit( ρ )) , where ˜ t · denote the exponential family forms of the corresponding distributions parametrized bythe precision-adjusted mean and precision for normal distribution, and the natural parametersfor Bernoulli and gamma distributions. Note that the terms p ( σ − ), p ( f w | w ), and p ( γ ) neednot be approximated as they are already of the correct exponential family form.The parameters of the full approximation can be identified from the products of the corre-sponding site term approximations and are¯ m = ¯ Σ ( ˜ µ y + ˜ µ w + ˜ µ f w ) , ¯ Σ = (˜ Γ y + diag( ˜ τ w ) + diag( ˜ τ f w )) − , ¯ α σ = α σ + ˜ α y , ¯ β σ = β σ − ˜ β y , ¯ ρ j = 11 + exp( − (˜ ρ wj + logit( ρ ) + ˜ ρ f γ j )) , where diag( · ) is a diagonal matrix with the parameter as the diagonal and feedback term approx-imation parameters are zero for feedbacks that have not been observed. A.3 Computation of the posterior approximation
Expectation propagation (EP) and variational Bayes (VB) inference are used to find the param-eters of the posterior approximation [25–27]. Expectation propagation for linear regression withspike and slab prior has been introduced by Hern´andez-Lobato et al. [28] (see [29] for a more ex-tensive treatment). We update the ˜ t N ( w | ˜ µ y , ˜ Γ y ) and ˜ t Gamma ( σ − | ˜ α y , ˜ β y ) term approximationsusing VB and all other terms using EP.The parameter update steps in the algorithm, to be iterated until convergence, are1. p ( w | γ ) approximation using parallel EP update.2. p ( y | X , w , σ ) approximation using VB update.3. p ( f γ | γ ) approximation using parallel EP update.The individual terms are updated following the pattern in [25]:16. Computation of the cavity distribution, q \ ( · ) ∝ q ( · )˜ t ( · ) .In the natural parametrization, this corresponds to subtracting the parameters of the siteapproximation from the parameters of the full approximation for the processed modelparameter.2. Minimization of the Kullback–Leibler divergence between the approximation q and thetilted distribution, ˆ p ( · ) ∝ p ( · ) q \ ( · ).For the EP update, KL[ˆ p k q ] and for the VB, KL[ q k ˆ p ]. The former corresponds to settingthe moments of the sufficient statistics of q to match those of ˆ p , and the latter has solution q ( · ) new ∝ exp(E q −· [log ˆ p ( · )]), where the expectation is over the approximate posterior ofall other model parameters than the one that is being processed [26, 27].3. Updating of the parameters of the site approximation, ˜ t new ∝ q ( · ) new q \ ( · ) .This can be thought of as an inverse of the step 1 to now get the updated site approximationand, in the natural parametrization, is a subtraction of the cavity parameters from theparameters of the new full approximation. We use damping of the updates (the parametersare set to a convex combination of the old parameters and the new parameters computedabove) [30].All of the computation have closed form solutions. B Bayesian Experimental Design
The task is to find the feedback that maximises the expected information gain: j ∗ = arg max j / ∈F E p ( ˜ f j |D ) "X i KL[ p (˜ y |D , x i , ˜ f j ) k p (˜ y |D , x i )] , (4)where F is the set of feedbacks that have already been given (to simplify notation, those are hereassumed included in D ) and the summation over i goes over the training dataset. The evaluationof the expected information gain is described in the following.The posterior predictive distribution is approximated as Gaussian: p (˜ y |D , ˜ x ) ≈ N(˜ y | ˜ x ⊤ ¯ m , ˜ x ⊤ ¯ Σ ˜ x + ¯ s ) , (5)where ¯ s = ¯ β σ ¯ α σ is the posterior mean approximation for the residual variance.Similarly, the posterior predictive distributions of the feedbacks for the two feedback typesfollow as approximate Gaussian and Bernoulli distributions: p ( ˜ f w,j |D ) ≈ N( ˜ f w,j | ¯ m j , ¯Σ jj + ω ) , (6) p ( ˜ f γ,j |D ) ≈ Bernoulli( ˜ f γ,j | π ¯ ρ j + (1 − π )(1 − ¯ ρ j )) . (7)The information gain between the predictive distributions isKL[ p (˜ y |D , ˜ x , ˜ f j ) k p (˜ y |D , ˜ x )] = 12 " log ˜ x ⊤ ¯ Σ ˜ x + ¯ s ˜ x ⊤ ¯ Σ ˜ f ˜ x + ¯ s f + ˜ x ⊤ ¯ Σ ˜ f ˜ x + ¯ s f + ( ˜ x ⊤ ¯ m ˜ f − ˜ x ⊤ ¯ m ) ˜ x ⊤ ¯ Σ ˜ x + ¯ s − . (8)As running the EP algorithm to full convergence would be too costly for evaluating a largenumber of candidates, we approximate the posterior distribution with the new feedback with17artial EP updates. This is similar to the approach of Seeger [31] and Hern´andez-Lobato et al.[32] for experimental design for sparse linear model. We consider the two types of feedbackseparately.In the case of feedback directly on the regression weight, we add the corresponding site term(which is already of Gaussian form and does not need approximation, as noted above) and donot update the approximations of the other site terms (including assuming ¯ s f = ¯ s ). The newposterior approximation of w with these assumptions is¯ Σ ˜ f w,j = ( ¯ Σ − + T ee ⊤ ) − , (9)¯ m ˜ f w,j = ¯ Σ ˜ f w,j ( ¯ Σ − ¯ m + h e ) , (10)where e is a vector of zeros except for 1 at j th element, T = ω , and h = ˜ f w,j ω . Notably, ¯ Σ − and ¯ Σ − ¯ m are the precision and the precision-adjusted mean of the posterior approximationwithout the new feedback and are directly available from the previous EP approximation. Thenew posterior covariance is independent of the value of the feedback ˜ f w,j and it can be efficientlyevaluated using the matrix inversion lemma as ¯ Σ ˜ f = ¯ Σ − T − +¯Σ jj ¯ Σ ee ⊤ ¯ Σ . Furthermore, theexpectation over the feedback in the expected information gain affects only the term with thesquared difference of the means. This isE p ( ˜ f j |D ) h ( ˜ x ⊤ ¯ m ˜ f − ˜ x ⊤ ¯ m ) i = E p ( ˜ f j |D ) "(cid:18) T jj T ¯Σ jj ˜ x ⊤ ¯ Σ e (cid:19) (cid:18) hT − ¯ m j (cid:19) (11)= (cid:18) T T ¯Σ jj ˜ x ⊤ ¯ Σ e (cid:19) ( ¯Σ jj + ω ) , (12)where the first equality follows from substituting the Equation 10 and using the matrix inversionlemma, and the second equality from hT = ˜ f w,j and the remaining expectation being equal tothe variance of the predictive distribution of the feedback.In the case of relevance feedback, we add the corresponding site term for the feedback and runsingle EP update on it and the corresponding prior term p ( w j | γ j ). These updates are purely scalaroperations and do not require any costly matrix operations. Other site term approximations arenot updated. The new posterior approximation of w with these assumptions is¯ Σ ˜ f w,j = ( ¯ Σ − + T ee ⊤ ) − , (13)¯ m ˜ f w,j = ¯ Σ ˜ f w,j ( ¯ Σ − ¯ m + h e ) , (14)where T = [ ¯ Σ − f γ,j ] jj − [ ¯ Σ − ] jj and h = [ ¯ Σ − f γ,j m ˜ f γ,j ] j − [ ¯ Σ − ¯ m ] j . That is, now T and h arethe changes in the precision and the precision adjusted mean in the j th feature and these areavailable with cheap scalar operations. The expectation over the value of the feedback in theexpected information gain is in this case a sum of two terms and we evaluate both of the termsseparately using the above scheme. Again, we use the matrix inversion lemma to avoid fullinversions in computing the new posterior covariance. C Additional Experiments
C.1 Synthetic data
For the synthetic experiments with simulated data, we continue the study of the behaviour ofour algorithm, through additional experiments and visualisations. The setting stays the same asin Sect. 5.1, except for the specifications below.18 .1.1 Heatmaps with varying number of training data
We now study the performance when the number of training data varies from 1 to 50 (sincewe consider in particular small-samples settings). The dimensionality is fixed to 100, and thenumber of relevant features is 10.
Random
Number of Training Data N u m be r o f E x pe r t F eedba cks
5 9 13 17 21 25 29 33 37 41 45 491020304050
Sequential Experimental Design
Number of Training Data N u m be r o f E x pe r t F eedba cks
5 9 13 17 21 25 29 33 37 41 45 491020304050 246810
MSE (a) Feedback on coefficients’ values
Random
Number of Training Data N u m be r o f E x pe r t F eedba cks
5 9 13 17 21 25 29 33 37 41 45 491020304050
Sequential Experimental Design
Number of Training Data N u m be r o f E x pe r t F eedba cks
5 9 13 17 21 25 29 33 37 41 45 491020304050 246810
MSE (b) Feedback on coefficients’ relevances
Figure 5: Mean squared errors with increasing number of training data. The number of relevantcoefficients m ∗ = 10 and the number of dimensions m = 100. The MSE values are averages over100 independent runs.Fig. 5 illustrates the behaviour of our strategy and that of the random feature selection,for the previously described synthetic data setting with a fixed dimension m = 100 and withincreasing numbers of training data points n = 5 , . . . ,
50. For very small sample sizes ( n < < n < n >
30, both strategies have a muchsmaller MSE.
C.1.2 Sequential vs Non-sequential Experimental Design
For a simple setting with simulated data, we now study the difference between our method andits non-sequential version for the two feedback models discussed previously: user feedback on thecoefficients and on their relevance. The non-sequential version chooses the sequence of featuresto be queried before observing any expert feedback. We note that the behaviour and rankingof the query algorithms remain similar to the one observed in the previous plots. In Fig. 6, weconsider a “small n , large p ” scenario, with n = 10 , m = 100 , m ∗ = 10 and we report the averageMSE value over 500 runs.The results in the plots are shown for an increasing number of feedbacks, that gets to thenumber of dimensions, when all methods converge. However, if we consider the plausible scenariowhen the number of user interactions are limited, one can notice that compared to the othermethods, the performance of both experimental design methods have a more important decreasein prediction loss even in the first iterations.This reflects the fact that both experimental design strategies manage to identify and askwith priority about the most informative coefficients. This is more evident in the feedbackmodel about coefficient relevance (Fig. 6(b)), where the performance of the two experimentaldesign strategies is very close to the strategy that first suggests only relevant features. However,one can also notice an improved performance for the sequential version of the experimental19 umber of Expert Feedbacks M ean S qua r ed E rr o r RandomFirst relevant features, then non-relevantSequential Experimental DesignNon-sequential Experimental Design (a) Feedback on coefficients’ values
Number of Expert Feedbacks M ean S qua r ed E rr o r RandomFirst relevant features, then non-relevantSequential Experimental DesignNon-sequential Experimental Design (b) Feedback on coefficients’ relevances
Figure 6: MSE for all query algorithms, with simulated data, for feedback on coefficient valuesand relevance. Note that the red strategy is not available in practice.design strategy. Indeed, the more carefully selected sequence of queries done by the sequentialexperimental design strategy, manages to reduce the prediction error faster, compared to thenon-sequential selection, where the observed expert feedback is not taken into account. Also, asexpected, the difference between the sequential and non-sequential experimental designs is moresignificant in the case of the stronger feedback model on coefficients values (Fig. 6(a)).
C.2 Comparison of Training and Test Set Errors and the Average Ac-cumulated Suggestion Behaviour
We can get some insight into the behaviour of the approach by comparing the training andtest set errors shown in Fig. 6(a) and Fig. 7(a) for the simulated data scenario described in theprevious section with feedback on the coefficient values. The training set error begins to increaseas a function of the number of expert feedbacks. This happens because the model withoutany feedbacks has exhausted the information in the training data (to the extent allowed by theregularizing priors) and fits the training data well. The user feedback, however, moves the modelaway from the training data optimum and towards better generalization performance. Indeed,the MSE curves for the training and test errors converge close to each other as the number offeedbacks increases. Moreover, the convergence is faster for the query algorithms that start bysuggesting the features with non-zero effects implying that they are more informative (Fig. 7(b)).
C.3 Expert Knowledge Elicitation vs. Collecting More SamplesC.4 User Study
We complement the analysis of the results of the user study with two illustrations.First, to compare the convergence speed of different methods, we normalised the MSE im-provements at each iteration by the amount of total improvement obtained by each of the users,when considering all their individual feedback. Figure 8(a) depicts the convergence speed ofmethods based on this measure. As can be seen from the figure, for all participants, the pro-posed method was able to capture most of the participants’s knowledge with small budget of20 umber of Expert Feedbacks M ean S qua r ed E rr o r on T r a i n i ng RandomFirst relevant features, then non-relevantSequential Experimental DesignNon-sequential Experimental Design (a) MSE on training data
Number of Expert Feedbacks : z e r o f ea t u r e s , : non - z e r o f ea t u r e s RandomFirst relevant features, then non-relevantSequential Experimental DesignNon-sequential Experimental Design (b) Accumulated average suggestion
Figure 7: MSE on the training data and accumulated average suggestion behavior for all queryalgorithms, with simulated data, for the case where feedback is on coefficient values.More Samples More Feedback
MSE Random Active [31]
Random SeqExpDes >
200 311.95 44 43 >
200 431.925 59 71 >
200 64l.9 98 92 >
200 951.875 >
200 144 >
200 136Table 3: Number of samples/feedbacks needed to reach a particular MSE level in Amazon dataset.The results are averages over 100 independent runs.feedback queries (stabilizing at around 200 out of the total 824 features in the considered subsetof Amazon data).Then, in Figure 8(b), we show the average percentage of relevant words that were asked fromthe participants at each iteration. It is evident from the figure that the proposed algorithmstarted by mostly asking about limited relevant words. The relevant words were identified byconsidering all the data in Amazon dataset and training an spike and slab model and thenchoosing words with E [ γ j ] > . umber of User Feedbacks P e r c en t age o f I m p r o v e m e m t i n M SE -20020406080100 RandomSequential Experimental Design (a) Percentage of improvement in MSE
Number of User Feedbacks : no t -r e l e v an t o r un c e r t a i n , : r e l e v an t k e y w o r d s RandomSequential Experimental Design (b) Accumulated average suggestion
Figure 8: User study results: MSE for 10 participants, Amazon data. keywords M ean o f po s t e r i o r o f c oe ff i c i en t s goodlove wellgreatnot aftermonths easystarted nicelittlebrokeworks happyworkeddoesn’tthis productpleased brokendisappointedideaexcellentsturdy returnedhighlyperfectlywastepoor bestbeautifulbadperfectpiece ofbuy thisyour moneywonderfulI useloves not worthgoodlove wellgreatnot aftermonths easystarted nicelittlebrokeworks happyworkeddoesn’tthis productpleased brokendisappointedideaexcellentsturdy returnedhighlyperfectlywastepoor bestbeautifulbadperfectpiece ofbuy thisyour moneywonderfulI useloves not worth