Active embedding search via noisy paired comparisons
Gregory H. Canal, Andrew K. Massimino, Mark A. Davenport, Christopher J. Rozell
AActive embedding search via noisy paired comparisons
Gregory H. Canal, Andrew K. Massimino, Mark A. Davenport, and Christopher J. RozellMay 27, 2019
Abstract
Suppose that we wish to estimate a user’s preference vector w from paired comparisons of the form“does user w prefer item p or item q ?,” where both the user and items are embedded in a low-dimensionalEuclidean space with distances that reflect user and item similarities. Such observations arise in numeroussettings, including psychometrics and psychology experiments, search tasks, advertising, and recommendersystems. In such tasks, queries can be extremely costly and subject to varying levels of response noise;thus, we aim to actively choose pairs that are most informative given the results of previous comparisons.We provide new theoretical insights into the benefits and challenges of greedy information maximizationin this setting, and develop two novel strategies that maximize lower bounds on information gain and aresimpler to analyze and compute respectively. We use simulated responses from a real-world dataset tovalidate our strategies through their similar performance to greedy information maximization, and theirsuperior preference estimation over state-of-the-art selection methods as well as random queries. We consider the task of user preference learning, where we have a set of items (e.g., movies, music, or food)embedded in a Euclidean space and aim to represent the preferences of a user as a continuous point in thesame space (rather than simply a rank ordering over the items) so that their preference point is close to itemsthe user likes and far from items the user dislikes. To estimate this point, we consider a system using the method of paired comparisons , where during a sequence of interactions a user chooses which of two given itemsthey prefer [1]. For instance, to characterize a person’s taste in food, we might ask them which one of twodishes they would rather eat for a number of different pairs of dishes. The recovered preference point can beused in various tasks, for instance in the recommendation of nearby items, personalized product creation, orclustering of users with similar preferences. We refer to the entire process of querying via paired comparisonsand continuous preference point estimation as pairwise search , and note that this is distinct from the problemof searching for a single discrete item in the fixed dataset. A key goal of ours is to actively choose the itemsin each query and demonstrate the advantage over non-adaptive selection.More specifically, given N items, there are O ( N ) possible paired comparisons. Querying all such pairs isnot only prohibitively expensive for large datasets, but also unnecessary since not all queries are informative;some queries are rendered obvious by the accumulation of evidence about the user’s preference point, whileothers are considered ambiguous due to noise in the comparison process. Given these considerations, themain contribution of this work is the design and analysis of two new query selection algorithms for pairwisesearch that select the most informative pairs by directly modeling redundancy and noise in user responses .While previous active algorithms have been designed for related paired comparison models, none directlyaccount for probabilistic user behavior as we do here. To the best of our knowledge our work is the firstattempt to search a low-dimensional embedding for a continuous point via paired comparisons while directlymodeling noisy responses.Our approach builds upon the popular technique in active learning and Bayesian experimental design ofgreedily maximizing information gain [2, 3, 4]. In our setting, this corresponds to selecting pairs that maximizethe mutual information between the user’s response and the unknown location of their preference point. Weprovide new theoretical and computational insights into relationships between information gain maximization
School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, 30332 USA(e-mail: {gregory.canal,massimino,mdav,crozell}@gatech.edu) a r X i v : . [ s t a t . M L ] M a y nd estimation error minimization in pairwise search, and present a lower bound on the estimation errorachievable by any query strategy.Due to the known difficulty of analyzing greedy information gain maximization [5] and the high com-putational cost of estimating mutual information for each pair in a pool, we propose two strategies thateach maximize new lower bounds on information gain and are simpler to analyze and compute respectively.We present upper and lower bounds on the performance of our first strategy, which then motivates theuse of our second, computationally cheaper strategy. We then demonstrate through simulations using areal-world dataset how both strategies perform comparably to information maximization while outperformingstate-of-the-art techniques and randomly selected queries. Our goal in this paper is to estimate a user’s preference point (denoted as vector w ) with respect to a givenlow-dimensional embedding of items constructed such that distances between items are consistent with itemsimilarities, where similar items are close together and dissimilar items are far apart. While many items(e.g., images) exist in their raw form in a high-dimensional space (e.g., pixel space), this low-dimensionalrepresentation of items and user preferences offers the advantage of simple Euclidean relationships that directlycapture notions of preference and similarity, as well as mitigating the effects of the curse of dimensionality inestimating preferences. Specifically, we suppose user preferences can be captured via an ideal point model inwhich each item and user is represented using a common set of parameters in R d , and that a user’s overallpreference for a particular item decreases with the distance between that item and the user’s ideal point w [6]. This means that any item placed exactly at the user would be considered “ideal” and would be the mostpreferred over all other items. Although this model can be applied to the situation where a particular item issought, in general we do not assume the user point w to be co-located with any item.The embedding of the items can be constructed through a training set of triplet comparisons (pairedcomparisons regarding similarity of two items to a third reference item) using one of several standardnon-metric embedding techniques such as the Crowd Kernel Learning [7] or Stochastic Triplet Embeddingmethods [8]. In this study, we assume that such an embedding is given, presumably acquired through alarge set of crowdsourced training triplet comparisons. We do not consider this training set to be part ofthe learning cost in measuring a pairwise search algorithm’s efficiency, since our focus here is on efficientlychoosing paired comparisons to search an existing embedding.In this work, we assume a noisy ideal point model where the probability of a user located at w choosingitem p over item q in a paired comparison is modeled using P ( p ≺ q ) = f ( k pq ( (cid:107) w − q (cid:107) − (cid:107) w − p (cid:107) )) , (1)where p ≺ q denotes “item p is preferred to item q ,” f ( x ) = 1 / (1+ e − x ) is the logistic function, and k pq ∈ [0 , ∞ ) is the pair’s noise constant , which represents roughly the signal-to-noise ratio of a particular query and maydepend on the values of p and q . This type of logistic noise model is common in psychometrics literature andbears similarity to the Bradley–Terry model [9].Note that (1) can also be written as P ( p ≺ q ) = f ( k pq ( a T w − b )) , where a = 2( p − q ) and b = (cid:107) p (cid:107) − (cid:107) q (cid:107) encode the normal vector and threshold of a hyperplane bisectingitems p and q . After a number of such queries, the response model in (1) for each query can be multiplied toform a posterior belief about the location of w , as depicted in Figure 1.Note that we allow the noise constant k pq to differ for each item pair to allow for differing user behaviordepending on the geometry of the items being compared. When k pq → ∞ , this supposes a user’s selection ismade with complete certainty and cannot be erroneous. Conversely, k pq = 0 corresponds to choosing itemsrandomly with probability / . Varying k pq allows for differing reliability when items are far apart versuswhen they are close together. 2ome concrete examples for setting this parameter are:constant : k (1) pq = k , (K1)normalized : k (2) pq = k (cid:107) a (cid:107) − = 12 k (cid:107) ( p − q ) (cid:107) − , (K2)decaying : k (3) pq = k exp( −(cid:107) a (cid:107) )= k exp( − (cid:107) ( p − q ) (cid:107) ) . (K3)Figure 1: Paired comparisons between items can be thought of as a set of noisy hyperplane queries. In thehigh-fidelity case, this uniquely identifies a convex region of R d . In general, we have a posterior distributionwhich only approximates the shape of the ideal cell around the true user point, depicted with an x. There is a rich literature investigating statistical inference from paired comparisons and related ordinal querytypes. However, many of these works target a different problem than considered here, such as constructing itemembeddings [7], training classifiers [10], selecting arms of bandits [11], and learning rankings [12, 13, 14, 15, 16]or scores [17, 18] over items.Paired comparisons have also been used for learning user preferences: [19] models user preferences as avector, but preferences are modeled as linear weightings of item features rather than by relative distancesbetween the user and items in an embedding, resulting in a significantly different model (e.g., monotonic) ofpreference. [20] considers the task of actively estimating the maximizer of an unknown preference functionover items, while [21] and [22] actively approximate the preference function itself, the former study notablyusing information gain as a metric for selecting queries. Yet, these approaches are not directly comparableto our methods since they do not consider a setting where user points are assigned within an existinglow-dimensional item embedding. [7] does consider the same item embedding structure as our setting andactively chooses paired comparisons that maximize information gain for search, but only seeks discrete items within a fixed dataset rather than estimating a continuous preference vector as we do here. Furthermore weprovide novel insights into selecting pairs via information gain maximization, and mainly treat informationgain for pairwise search as a baseline in this work since our primary focus is instead on the development,analysis, and evaluation of alternative strategies inspired by this approach.The most directly relevant prior work to our setting consists of the theory and algorithms developed in[23] and [24]. In [23], item pairs are selected in stages to approximate a Gaussian point cloud that surroundsthe current user point estimate and dyadically shrinks in size with each new stage. In [24], previous queryresponses define a convex polytope in d dimensions (as in Figure 1), and their algorithm only selects querieswhose bisecting hyperplanes intersect this feasible region. While this algorithm in its original form onlyproduces a rank ordering over the embedding items, for the sake of a baseline comparison we extend it toproduce a preference point estimate from the feasible region. Neither of these studies fundamentally modelsor handles noise in their active selection algorithms; slack variables are used in the user point estimation of[23] to allow for contradicting query responses, but the presence of noise is not considered when selectingqueries. In an attempt to filter non-persistent noise (the type encountered in our work), [24] simply repeateach query multiple times and take a majority vote as the user response, but the items in the query pairare still selected using the same method as in the noiseless setting. Nevertheless, these methods provide animportant baseline. 3 Query selection
We now proceed to describe the pair selection problem in detail along with various theoretical and computa-tional considerations. We show that the goal of selecting pairwise queries to minimize estimation error leadsnaturally to the strategy of information maximization and subsequently to the development of our two novelselection strategies.
Let W ∈ R d ( d ≥ denote a random vector encoding the user’s preference point, assumed for the sake ofanalysis to be drawn from a uniform distribution over the hypercube [ − , ] d denoted by the prior density of p ( w ) . Unless noted otherwise, we denote random variables with uppercase letters, and specific realizationswith lowercase letters. Let Y i ∈ { , } denote the binary response to the i th paired comparison involving items p i and q i , with Y i = 0 indicating a preference for p i and Y i = 1 a preference for q i . After i queries, we havethe vector of responses Y i = { Y , Y , . . . Y i } . We assume that each response Y i is conditionally independentfrom previous responses Y i − when conditioned on preference W . Applying this assumption in conjunctionwith a recursive application of Bayes’ rule, after i queries we have a posterior density of p i ( w ) ≡ p ( w | Y i ) = p ( w ) (cid:81) ij =1 p ( Y j | w ) p ( Y i | Y i − ) (2)where p ( Y i | w ) is given by the model in (1). This logistic likelihood belongs to the class of log-concave (LCC)distributions, whose probability density functions f ( w ) satisfy f ( αw + (1 − α ) w ) ≥ f ( w ) α f ( w ) − α forany w , w ∈ R d and ≤ α ≤ . Since p ( w ) is log-concave and products of log-concave functions are alsolog-concave [25], we have that the posterior density given in (2) is log-concave.Suppose that after i queries, the posterior p i ( w ) is used to produce a Bayesian user point estimate (cid:99) W i .We denote the mean squared error for this estimate by MSE i = E W | Y i [ (cid:107) W − (cid:99) W i (cid:107) ] , which provides a directmeasure of our estimation error and is a quantity we wish to minimize by adaptively selecting queries basedon previous responses. One approach might be to greedily select an item pair such that MSE i +1 is minimizedin expectation after the user responds. However, this would require both updating the posterior distributionand estimating MSE i +1 for each possible response over all item pairs. This would be very computationallyexpensive since under our model there is no closed-form solution for MSE i +1 , and so each such evaluationrequires a “lookahead” batch of Monte Carlo samples from the posterior. Specifically, if S posterior samplesare generated for each MSE i +1 evaluation over a candidate pool of M pairs at a computational cost of C per sample generation, and MSE i +1 is estimated with O ( dS ) operations per pair, this strategy requires O (( C + d ) SM ) computations to select each query. This is undesirable for adaptive querying settings wheretypically data sets are large (resulting in a large number of candidate pairwise queries) and queries need tobe selected in or close to real-time.Instead, consider the covariance matrix of the user point posterior after i queries, denoted as Σ W | Y i = E [( W − E [ W | Y i ])( W − E [ W | Y i ]) T | Y i ] . For the minimum mean squared error (MMSE) estimator, given by the posterior mean (cid:99) W i = E [ W | Y i ] , wehave MSE i = Tr(Σ W | Y i ) ≥ d | Σ W | Y i | d where the last inequality follows from the arithmetic-geometric mean inequality (AM–GM) [26]. This impliesthat a necessary condition for a low MSE is for the posterior volume , defined here as the determinant of theposterior covariance matrix, to also be low. Unfortunately, actively selecting queries that greedily minimizeposterior volume is too computationally expensive to be useful in practice since this also requires a set of“lookahead” posterior samples for each candidate pair and possible response, resulting in a computationalcomplexity of O ((( C + d ) S + d ) M ) to select each query from the combined cost per pair of generatingsamples ( O ( CS ) ), estimating Σ W | Y i ( O ( d S ) ), and calculating | Σ W | Y i | ( O ( d ) ).4 .2 Information theoretic framework By utilizing statistical tools from information theory, we can select queries that approximately minimizeposterior volume (and hence tend to encourage low MSE) at a more reasonable computationally cost.Furthermore, an information theoretic approach provides convenient analytical tools which we use to provideperformance guarantees for the query selection methods we present.Towards this end, we define the posterior entropy as the differential entropy of the posterior distributionafter i queries: h i ( W ) ≡ h ( W | y i ) = − (cid:90) w p i ( w ) log ( p i ( w )) dw. (3)As we show in the following lemma, the posterior entropy of LCC distributions is both upper and lowerbounded by a monotonically increasing function of posterior volume, implying that low posterior entropy isboth necessary and sufficient for low posterior volume, and hence a necessary condition for low MSE. Theproofs of this lemma and subsequent results are provided in the supplementary material. Lemma 3.1.
For a LCC posterior distribution p ( w | Y i ) in d ≥ dimensions, where c d = e d / (4 √ d + 2)) , d | Σ W | Y i | d e c d ≤ h i ( W ) ≤ d (2 πe | Σ W | Y i | d ) . This relationship between MSE, posterior volume, and posterior entropy suggests a strategy of selectingqueries that minimize the posterior entropy after each query. Since the actual user response is unknownat the time of query selection, we seek to minimize the expected posterior entropy after a response ismade, i.e., E Y i +1 [ h i +1 ( W ) | y i ] . Using a standard result from information theory, we have E Y i [ h i ( W ) | y i − ] = h i ( W ) − I ( W ; Y i | y i − ) , where I ( W ; Y i | y i − ) is the mutual information between the location of the unknownuser point and the user response, conditioned on previous responses [27]. Examining this identity, we observethat selecting queries that minimize the expected posterior entropy is equivalent to selecting queries thatmaximize the mutual information between the user point and the user response, referred to here as the information gain .In this setting, it is generally difficult to obtain sharp performance bounds for query selection viainformation gain maximization. Instead, we use information theoretic tools along with Lemma 3.1 to providea lower bound on MSE for any estimator and query selection scheme in a manner similar to [28] and [27]: Theorem 3.2.
For any user point estimate given by (cid:99) W i after i queries, the MSE (averaged over user pointsand query responses) for any selection strategy is bounded by E W,Y i (cid:107) W − (cid:99) W i (cid:107) ≥ d − id πe . This result implies that the best rate of decrease in MSE one can hope for is exponential in the numberof queries and slows down in a matter inversely proportional to the dimension, indicating quicker possiblepreference convergence in settings with lower dimensional embeddings. To estimate the information gain of aquery, we can use the symmetry of mutual information to write I ( W ; Y i | y i − ) = H ( Y i | y i − ) − H ( Y i | W, y i − ) (4) H ( Y i | y i − ) = − (cid:88) Y i ∈{ , } p ( Y i | y i − ) log p ( Y i | y i − ) (5) H ( Y i | w, y i − ) = − (cid:88) Y i ∈{ , } p ( Y i | w ) log p ( Y i | w ) (6) H ( Y i | W, y i − ) = E W | y i − [ H ( Y i | W, y i − )] . (7)Unlike the greedy MSE and posterior volume minimization strategies, information gain estimation onlyrequires a single batch of posterior samples at each round of query selection, which is used to estimate the5iscrete entropy quantities in (4)–(7). (4) can be estimated in O ( dS ) operations per pair, resulting in acomputational cost of O ( dSM ) for selecting each query, which although more computationally feasible thanthe methods proposed so far is still likely prohibitive for highly accurate information gain estimates over alarge pool of candidate pairs.Because of these analytical and computational challenges, we develop two strategies that mimic the actionof maximizing information gain while being more analytically and computationally tractable, respectively. Inthe next section we present our first strategy, which we analyze for more refined upper and lower boundson the number of queries needed to shrink the posterior to a desired volume. Then we introduce a secondstrategy which benefits from reduced computational complexity while still remaining theoretically coupled tomaximizing information gain. In developing an approximation for information gain maximization, consider the scenario where arbitrary pairs of items can be generated (unconstrained to a given dataset), resulting in a bisecting hyperplaneparameterized by ( a i , b i ) . In practice, such queries might correspond to the generation of synthetic items viatools such as generative adversarial networks [29]. With this freedom, we could consider an equiprobable querystrategy where b i is selected so that each item in the query will be chosen by the user with probability .This strategy is motivated by the fact that the information gain of query i is upper bounded by H ( Y i | y i − ) ,which is maximized if and only if the response probability is equiprobable [27].To motivate the selection of query hyperplane directions, we define a query’s projected variance , denoted as σ i , as the variance of the posterior marginal in the direction of a query’s hyperplane, i.e., σ i = a Ti Σ W | y i − a i .This corresponds to a measure of how far away the user point is from the hyperplane query, in expectationover the posterior distribution. With this notation, we have the following lower bound on information gainfor equiprobable queries. Proposition 3.3.
For any “equiprobable” query scheme with noise constant k i and projected variance σ i ,for any choice of constant ≤ c ≤ we have I ( W ; Y i | y i − ) ≥ (cid:18) − h b (cid:16) f (cid:16) ck i σ i (cid:17)(cid:17)(cid:19) (1 − c ) =: L c,k i ( σ i ) where h b ( p ) = − p log p − (1 − p ) log (1 − p ) . This lower bound is monotonically increasing with k i σ i and achieves a maximum information gain of 1bit at k i → ∞ and/or σ i → ∞ (with an appropriate choice of c ). This suggests choosing a i that maximizeprojected variance in addition to selecting b i according to the equiprobable strategy. Together, we refer to theselection of equiprobable queries in the direction of largest projected variance as the equiprobable-max-variancescheme, or EPMV for short.Our primary result concerns the expected number of comparisons (or query complexity) sufficient toreduce the posterior volume below a specified threshold set a priori, using EPMV. Theorem 3.4.
For the EPMV query scheme with each selected query satisfying k i (cid:107) a i (cid:107) ≥ k min for someconstant k min > , consider the stopping time T ε = min { i : | Σ W | y i | d < ε } for stopping threshold ε > . For τ = d log ( πeε ) and τ = d log e c d ε , we have τ ≤ E [ T ε ] ≤ τ + τ + 1 l ( τ ) − l ( τ ) (cid:90) τ l ( x ) dx where l ( x ) = L c,k min (cid:16) − xd √ πe (cid:17) for any constant ≤ c ≤ as defined in Proposition 3.3. Furthermore, the lowerbound is true for any query selection scheme. This result follows from a martingale stopping-time analysis of the entropy at each query. Our nexttheorem presents a looser upper bound, but is more easily interpretable.6 heorem 3.5.
The EPMV scheme, under the same assumptions as in Theorem 3.4, satisfies E [ T ε ] = O (cid:18) d log 1 ε + (cid:18) εk min (cid:19) d log 1 ε (cid:19) . Furthermore, for any query scheme, E [ T ε ] = Ω (cid:0) d log ε (cid:1) . This result has a favorable dependence on the dimension d , and the upper bound can be interpreted as ablend between two rates, one of which matches that of the generic lower bound. The second term in the upperbound provides some evidence that our ability to recover w worsens as k min decreases. This is intuitivelyunsurprising since small k min corresponds to the case where queries are very noisy. We hypothesize that theabsence of such a penalty term in the lower bound is an artifact of our analysis, since increasing noise levels(i.e., decreasing k min ) should limit achievable performance by any querying strategy. On the other hand, forasymptotically large k i , we have the following corollary: Corollary 3.1.
In the noiseless setting ( k min → ∞ ), EPMV has optimal expected stopping time complexityfor posterior volume stopping.Proof. When k min → ∞ , from Theorem 3.5 E [ T ε ] = O (cid:0) d log ε (cid:1) ; for any scheme, E [ T ε ] = Ω (cid:0) d log ε (cid:1) .Taken together, these results suggest that EPMV is optimal with respect to posterior volume minimizationup to a penalty term which decreases to zero for large noise constants. While low posterior volume is only anecessary condition for low MSE, this result could be strengthened to an upper bound on MSE by boundingthe condition number of the posterior covariance matrix, which is left to future work. Yet, as we empiricallydemonstrate in Section 4, in practice our methods are very successful in reducing MSE.While EPMV was derived under the assumption of arbitrary hyperplane queries, depending on theapplication we may have to select a pair from a fixed pool of items in a given dataset. For this purpose wepropose a metric for any candidate pair that, when maximized over all pairs in a pool, approximates thebehavior of EPMV. For a pair with items p and q in item pool P , let a pq = 2( p − q ) and b pq = (cid:107) p (cid:107) − (cid:107) q (cid:107) denote the weights and threshold parameterizing the bisecting hyperplane. We choose a pair that maximizesthe utility function (for some λ > ) η ( p, q ; λ ) = k pq (cid:113) a Tpq Σ W | Y i − a pq − λ (cid:12)(cid:12)(cid:12)(cid:98) p − (cid:12)(cid:12)(cid:12) (8) (cid:98) p = P ( Y i = 1 | Y i − ) = E W | Y i − [ f ( k pq ( a Tpq W − b pq ))] . This has the effect of selecting queries which are close to equiprobable and align with the direction of largestvariance, weighted by k pq to prefer higher fidelity queries. While Σ W | Y i − can be estimated once from abatch of posterior samples, (cid:98) p must be estimated for each candidate pair in O ( dS ) operations, resulting in acomputational cost of O ( dSM ) which is on the same order as directly maximizing information gain. For thisreason, we develop a second strategy that approximates EPMV while significantly reducing the computationalcost. Our second strategy is a mean-cut strategy where b i is selected such that the query hyperplane passes throughthe posterior mean, i.e. a Ti E [ W | Y i − ] − b i = 0 . For such a strategy, we have the following proposition: Proposition 3.6.
For mean-cut queries with noise constant k i and projected variance σ i we have (cid:12)(cid:12)(cid:12) p ( Y i | y i − ) − (cid:12)(cid:12)(cid:12) ≤ e − e + ln 2 k i σ i and, I ( W ; Y i | y i − ) ≥ h b (cid:32) e − ln 2 k i σ i (cid:33) − π (log e )3 k i σ i . lgorithm 1: Pairwise search with noisy comparisons
Input: item set X , parameters S , β , λ P ← set of all pairwise queries from items in X (cid:102) W , µ , Σ ← initialize from samples of prior for i = 1 to T do P β ← uniformly downsample P at rate < β ≤ InfoGain: p i , q i ← arg max p,q ∈P β η ( p, q ; (cid:102) W i − ) EPMV: p i , q i ← arg max p,q ∈P β η ( p, q ; λ, (cid:102) W i − ) MCMV: p i , q i ← arg max p,q ∈P β η ( p, q ; λ, µ i − , Σ i − ) y i ← PairedComparison ( p i , q i ) , y i ← y i ∪ y i − . (cid:102) W i ← batch of S samples from posterior W | Y i µ i , Σ i ← Mean( (cid:102) W i ) , Covariance( (cid:102) W i ) (cid:99) W i ← µ i end forOutput: user point estimate (cid:99) W T For large projected variances, we observe that | p ( Y i | y i − ) − | (cid:47) . , suggesting that mean-cut queriesare somewhat of an approximation to equiprobable queries in this setting. Furthermore, notice that the lowerbound to information gain in Proposition 3.6 is a monotonically increasing function of the projected variance.As σ i → ∞ , this bound approaches h b (1 /e ) ≈ . which is nearly sharp since a query’s information gainis upper bounded by 1 bit. This implies some correspondence between maximizing a query’s informationgain and maximizing the projected variance, as was the case in EPMV. Hence, our second strategy selects mean-cut, maximum variance queries (referred to as MCMV) and serves as an approximation to EPMV whilestill maximizing a lower bound on information gain.For implementing MCMV over a fixed pool of pairs (rather than arbitrary hyperplanes), we calculate theorthogonal distance of each pair’s hyperplane to the posterior mean as | a Tpq E [ W | Y i − ] − b pq | / (cid:107) a pq (cid:107) and theprojected variance as a Tpq Σ W | Y i − a pq . We choose a pair that maximizes the following function which is atradeoff (tuned by λ > ) between minimizing distance to the posterior mean, maximizing noise constant,and maximizing projected variance: η ( p, q ; λ ) = k pq (cid:113) a Tpq Σ W | Y i − a pq − λ | a Tpq E [ W | Y i − ] − b pq |(cid:107) a pq (cid:107) . (9)This strategy is attractive from a computational standpoint since the posterior mean E [ W | Y i − ] and covariance Σ W | Y i − can be estimated once in O ( d S ) computations, and subsequent calculation of the hyperplane distancefrom mean and projected variance requires only O ( d ) computations per pair. Overall, this implementationof the MCMV strategy has a computational complexity of O ( d ( S + M )) , which scales more favorably thanboth the information gain maximization and EPMV strategies.We unify the information gain (referred to as InfoGain), EPMV, and MCMV query selection methods undera single framework described in Algorithm 1. At each round of querying, a pair is selected that maximizes autility function η ( p, q ) over a randomly downsampled pool of candidates pairs, with η ( p, q ) ≡ I ( W ; Y i | y i − ) for InfoGain and η from (8) and η from (9) denoting the utility functions of EPMV and MCMV, respectively.We include a batch of posterior samples denoted by (cid:102) W as an input to η and η to emphasize their dependenceon posterior sampling, and add mean and covariance inputs to η since once these are estimated, MCMVrequires no additional samples to select pairs. For all methods, we estimate the user point as the mean of thesample batch since this is the MMSE estimator. 8 Results
To evaluate our approach, we constructed a realistic embedding (from a set of training user-response triplets)consisting of multidimensional item points and simulated our pairwise search methods over randomly generatedpreference points and user responses . We constructed an item embedding of the Yummly Food-10k datasetof [30, 31], consisting of 958,479 publicly available triplet comparisons assessing relative similarity among10,000 food items. The item coordinates are derived from the crowdsourced triplets using the popularprobabilistic multidimensional scaling algorithm of [7] and the implementation obtained from the NEXTproject . We compare InfoGain, EPMV, and MCMV as described in Algorithm 1 against several baseline methods:
Random: pairs are selected uniformly at random and user preferences are estimated as the posterior mean.
GaussCloud- Q : pairs are chosen to approximate a Gaussian point cloud around the preference estimatethat shrinks dyadically over Q stages, as detailed in [23]. ActRank- Q : pairs are selected that intersect a feasible region of preference points and queried Q times; amajority vote is then taken to determine a single response, which is used with the pair hyperplane to furtherconstrain the feasible set [24]. Since the original goal of the algorithm was to rank embedding items ratherthan estimate a continuous preference point, it does not include a preference estimation procedure; in ourimplementation we estimate user preference as the Chebyshev center of the feasible region since it is thedeepest point in the set and is simple to compute [26].In each simulation trial, we generate a point W uniformly at random from the hypercube [ − , d andcollect paired comparisons using the item points in our embedding according to the methods described above.The response probability of each observation follows (1) (referred to herein as the “logistic” model), usingeach of the three schemes for choosing k pq described in (K1)–(K3). In each scheme we optimized the value of k over the set of training triplets via maximum-likelihood estimation according to the logistic model. Weuse the Stan Modeling Language [32] to generate posterior samples when required, since our model is LCCand therefore is particularly amenable to Markov chain Monte Carlo methods [33].Note that unlike GaussCloud- Q and ActRank- Q , the Random, InfoGain, EPMV, and MCMV methodsdirectly exploit a user response model in the selection of pairs and estimation of preference points, which canbe advantageous when a good model of user responses is available. Below we empirically test each method inthis matched scenario, where the noise type (logistic) and the model for k pq (e.g., “constant”, “normalized”, or“decaying”) are revealed to the algorithms. We also test a mismatched scenario by generating response noiseaccording to a non-logistic response model while the methods above continue to calculate the posterior as ifthe responses were logistic. Specifically, we generate responses according to a “Gaussian” model y i = sign( k pq ( a Ti w − b i ) + Z ) Z ∼ N (0 , where k and the model for k pq are selected using maximum-likelihood estimation on the training triplets. The left column of Figure 2 plots the MSE of each method’s estimate with respect to the ground truth locationover the course of a pairwise search run. In the matched model case of Figure 2a, our strategies outperformRandom, ActRank- Q , and GaussCloud- Q for multiple values of Q by a substantial margin. Furthermore, bothof our strategies performed similarity to InfoGain, corroborating their design as information maximizationapproximations. Note that Random outperforms the other baseline methods, supporting the use of Bayesianestimation in this setting (separately from the task of active query selection). Although mismatched noiseresults in decreased performance overall in Figure 2c, the same relative trends between the methods as inFigure 2a are evident. Code available at https://github.com/siplab-gt/pairsearch http://nextml.org a) Estimation error: matchedlogistic noise, d = 4 (b) Ranking performance: matchedlogistic noise, d = 4 (c) Estimation error: mismatchedGaussian noise, d = 4 (d) Ranking performance: mismatchedGaussian noise, d = 4 Figure 2: Performance evaluation over 80 simulated search queries, averaged over 50 trials per method andplotted with ± one standard error. (Left Column) MSE. (Right Column) for each trial, a batch of 15 itemswas uniformly sampled without replacement from the dataset, and the normalized Kendall’s Tau distance(lower distance is better) was calculated between a ranking of these items by distance to the ground truthpreference point and a ranking by distance to the estimated point. To get an unbiased estimate, this metricwas averaged over 1000 batches per trial, and error bars calculated with respect to the number of trials. (TopRow) “normalized” logistic model with matching noise in d = 4. (Bottom Row) “decaying” logistic model withmismatched Gaussian “normalized” noise in d = 4. Additional plots testing a wider selection of parametersare available in the supplement. Overall, our new strategies (EPMV, MCMV) outperform existing methodsand also perform comparably to information gain maximization (InfoGain), which they were designed toapproximate. 10 .3 Item ranking evaluation We also consider each method’s performance with respect to ranking embedding items in relation to apreference point. For each trial, a random set of 15 items is sampled from the embedding without replacementand ranked according to their distance to a user point estimate. This ranking is compared to the groundtruth ranking produced by the true user point by calculating a normalized Kendall’s Tau distance, which is 0for identical rankings and 1 for completely discordant rankings [24]. This metric measures performance inthe context of a recommender system type task (a common application of preference learning) rather thansolely measuring preference estimation error. This metric is depicted in the right column of Figure 2, forthe matched model case in 2b and mismatched case in 2d. The same trends as observed in MSE analysisoccur, with our strategies performing similarly to InfoGain and outperforming all other methods. This is aparticularly noteworthy result in that our method produces more accurate rankings than ActRank- Q , whichto our knowledge is the state-of-the-art method in active embedding ranking. Our simulations demonstrate that both InfoGain approximation methods, EPMV and MCMV, significantlyoutperform the state-of-the-art techniques in active preference estimation in the context of low-dimensionalitem embeddings with noisy user responses, and perform similarity to InfoGain, the method they weredesigned to approximate. This is true even when generating noise according to a different model than the oneused for Bayesian estimation. These empirical results support the theoretical connections between EPMV,MCMV, and InfoGain presented in this study, and suggest that the posterior volume reduction properties ofEPMV may in fact allow for MSE reduction guarantees.These results also highlight the attractiveness of MCMV, which proved to be a top performer in embeddingpreference learning yet is computationally efficient and simple to implement. This technique may also findutility as a subsampling strategy in supervised learning settings with implicit pairwise feedback, such as in[34]. Furthermore, although in this work pairs were drawn from a fixed embedding, MCMV is easily adaptableto continuous item spaces that allow for generative construction of new items to compare. This is possiblein some applications, such as facial composite generation for criminal cases [35] or in evaluating foods andbeverages, where we might be able to generate nearly arbitrary stimuli based on the ratios of ingredients [36].
Acknowledgements
We thank the reviewers for their useful feedback and comments, as well as colleagues for insightful discussionsincluding Matthieu Bloch, Yao Xie, Justin Romberg, Stefano Fenu, Marissa Connor, and John Lee. Thiswork is supported by NSF grants CCF-1350954 and CCF-1350616, ONR grant N00014-15-1-2619, and a giftfrom the Alfred P. Sloan Foundation.
References [1] H. A. David,
The method of paired comparisons . London, 1963, vol. 12.[2] B. Settles, “Active learning,”
Synthesis Lectures on Artificial Intelligence and Machine Learning , vol. 6,no. 1, pp. 1–114, 2012.[3] D. V. Lindley, “On a measure of the information provided by an experiment,”
The Annals of MathematicalStatistics , pp. 986–1005, 1956.[4] D. J. MacKay, “Information-based objective functions for active data selection,”
Neural computation ,vol. 4, no. 4, pp. 590–604, 1992.[5] Y. Chen, S. H. Hassani, A. Karbasi, and A. Krause, “Sequential information maximization: When isgreedy near-optimal?” in
Conference on Learning Theory , 2015, pp. 338–363.[6] C. H. Coombs, “Psychological scaling without a unit of measurement.”
Psychological review , vol. 57,no. 3, p. 145, 1950. 117] O. Tamuz, C. Liu, S. Belongie, O. Shamir, and A. T. Kalai, “Adaptively Learning the CrowdKernel,” in
International Conference on Machine Learning (ICML ) , May 2011. [Online]. Available:http://arxiv.org/abs/1105.1033[8] L. Van Der Maaten and K. Weinberger, “Stochastic triplet embedding,” in
Machine Learning for SignalProcessing (MLSP), 2012 IEEE International Workshop on . IEEE, 2012, pp. 1–6.[9] R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block designs: I. the method of pairedcomparisons,”
Biometrika , vol. 39, no. 3/4, pp. 324–345, 1952.[10] Y. Guo, P. Tian, J. Kalpathy-Cramer, S. Ostmo, J. P. Campbell, M. F. Chiang, D. Erdogmus, J. G. Dy,and S. Ioannidis, “Experimental design under the bradley-terry model.” in
IJCAI , 2018, pp. 2198–2204.[11] K. G. Jamieson, S. Katariya, A. Deshpande, and R. D. Nowak, “Sparse dueling bandits.” in
AISTATS ,2015.[12] F. Wauthier, M. Jordan, and N. Jojic, “Efficient ranking from pairwise comparisons,” in
InternationalConference on Machine Learning , 2013, pp. 109–117.[13] Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li, “Learning to rank: from pairwise approach to listwiseapproach,” in
Proceedings of the 24th international conference on Machine learning . ACM, 2007, pp.129–136.[14] Y. Chen and C. Suh, “Spectral mle: Top-k rank aggregation from pairwise comparisons,” in
InternationalConference on Machine Learning , 2015, pp. 371–380.[15] B. Eriksson, “Learning to top-k search using pairwise comparisons,” in
Artificial Intelligence and Statistics ,2013, pp. 265–273.[16] N. B. Shah and M. J. Wainwright, “Simple, robust and optimal ranking from pairwise comparisons.”
Journal of machine learning research , vol. 18, pp. 199–1, 2017.[17] N. B. Shah, S. Balakrishnan, J. Bradley, A. Parekh, K. Ramchandran, and M. J. Wainwright, “Estimationfrom pairwise comparisons: Sharp minimax bounds with topology dependence,”
The Journal of MachineLearning Research , vol. 17, no. 1, pp. 2049–2095, 2016.[18] S. Negahban, S. Oh, and D. Shah, “Iterative ranking from pair-wise comparisons,” in
Advances in NeuralInformation Processing Systems , F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds.,2012, pp. 2474–2482.[19] L. Qian, J. Gao, and H. Jagadish, “Learning user preferences by adaptive pairwise comparison,”
Proceed-ings of the VLDB Endowment , vol. 8, no. 11, pp. 1322–1333, 2015.[20] E. Brochu, N. D. Freitas, and A. Ghosh, “Active preference learning with discrete choice data,” in
Advances in neural information processing systems , 2008, pp. 409–416.[21] N. Houlsby, F. Huszar, Z. Ghahramani, and J. M. Hernández-Lobato, “Collaborative gaussian processesfor preference learning,” in
Advances in Neural Information Processing Systems , 2012, pp. 2096–2104.[22] W. Chu and Z. Ghahramani, “Preference learning with gaussian processes,” in
Proceedings of the 22ndinternational conference on Machine learning . ACM, 2005, pp. 137–144.[23] A. K. Massimino and M. A. Davenport, “As you like it: Localization via paired comparisons,” arXivpreprint arXiv:1802.10489 , 2018.[24] K. G. Jamieson and R. Nowak, “Active ranking using pairwise comparisons,” in
Advances in NeuralInformation Processing Systems , 2011, pp. 2240–2248.[25] A. Saumard and J. A. Wellner, “Log-concavity and strong log-concavity: a review,”
Statistics surveys ,vol. 8, p. 45, 2014. 1226] S. Boyd and L. Vandenberghe,
Convex optimization . Cambridge university press, 2004.[27] T. M. Cover and J. A. Thomas,
Elements of information theory . John Wiley & Sons, 2012.[28] S. Prasad, “Certain relations between mutual information and fidelity of statistical estimation,” arXivpreprint arXiv:1010.1508 , 2010.[29] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio,“Generative adversarial nets,” in
Advances in neural information processing systems , 2014, pp. 2672–2680.[30] M. Wilber, I. S. Kwak, D. Kriegman, and S. Belongie, “Learning concept embeddings with combinedhuman-machine expertise,” in
Proceedings of the IEEE International Conference on Computer Vision ,2015, pp. 981–989.[31] M. J. Wilber, I. S. Kwak, and S. J. Belongie, “Cost-effective hits for relative similarity comparisons,” in
Second AAAI conference on human computation and crowdsourcing , 2014.[32] B. Carpenter, A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo,P. Li, and A. Riddell, “Stan: A probabilistic programming language,”
Journal of statistical software ,vol. 76, no. 1, 2017.[33] S. Brooks, A. Gelman, G. Jones, and X.-L. Meng,
Handbook of markov chain monte carlo . CRC press,2011.[34] L. Wu, C.-J. Hsieh, and J. Sharpnack, “Large-scale collaborative ranking in near-linear time,” in
Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and DataMining . ACM, 2017, pp. 515–524.[35] C. Frowd, V. Bruce, M. Pitchford, C. Gannon, M. Robinson, C. Tredoux, J. Park, A. Mcintyre, and P. J.Hancock, “Evolving the face of a criminal: how to search a face space more effectively,”
Soft Computing ,vol. 15, no. 1, pp. 61–70, 2011.[36] E. E. Ventura, J. N. Davis, and M. I. Goran, “Sugar content of popular sweetened beverages based onobjective laboratory analysis: focus on fructose content,”
Obesity , vol. 19, no. 4, pp. 868–874, 2011.[37] L. Lovász and S. Vempala, “The geometry of logconcave functions and sampling algorithms,”
RandomStructures & Algorithms , vol. 30, no. 3, pp. 307–358, 2007.[38] A. Marsiglietti and V. Kostina, “A lower bound on the differential entropy of log-concave random vectorswith applications,”
Entropy , vol. 20, no. 3, p. 185, 2018.[39] S. G. Bobkov and M. M. Madiman, “On the problem of reversibility of the entropy power inequality,” in
Limit theorems in probability, statistics and number theory . Springer, 2013, pp. 61–74.[40] M. V. Burnashev and K. Zigangirov, “An interval estimation problem for controlled observations,”
Problemy Peredachi Informatsii , vol. 10, no. 3, pp. 51–61, 1974.[41] R. Durrett,
Probability: theory and examples . Cambridge university press, 2010.[42] A. R. Klivans, P. M. Long, and A. K. Tang, “Baum’s algorithm learns intersections of halfspaces withrespect to log-concave distributions,” in
Approximation, Randomization, and Combinatorial Optimization.Algorithms and Techniques . Springer, 2009, pp. 588–600.13
Supplementary material
First, we begin with an additional lemma:
Lemma A.1.
Let X i be a marginal distribution of W . The density of X i is then p X i | y i ( x ) = 1 σ i p Z i (cid:18) X i − E [ X i | y i ] σ i (cid:19) ≤ σ i , where σ i = (cid:112) E [( X i − E [ X i | y i ]) | y i ] and Z i = X i − E [ X i | y i ] σ i .Proof. Since X i is a marginal of a log-concave distribution, X i is also log-concave. Furthermore, Z i is azero-mean, unit-variance (i.e., isotropic) log-concave random variable with density p Z i ( z ) . Then Lemma A.1follows because one-dimensional isotropic log-concave densities are upper bounded by one [37].A direct consequence of Lemma A.1 is that for any a > , P ( | X i | < a | y i ) = (cid:90) a − a p X i | y i ( x ) dx ≤ σ i (cid:90) a − a dx ≤ aσ i , implying that P ( | X i | ≥ a | y i ) ≥ − aσ i . (10) A.1 Proof of Lemma 3.1
Proof.
Letting Σ W denote the d × d covariance matrix of random vector W ∈ R d , from Theorem 8.6.5 in [27],we have the upper bound h ( W ) ≤
12 log ((2 πe ) d | Σ W | ) . (11)Now assume the distribution P W of W is log-concave, let W , W ∼ P W be i.i.d. and let (cid:102) W := W − W . Let p (cid:102) W and p W denote the respective densities of (cid:102) W and W . We have by Proposition 3.5 of [25], for all z ∈ R d , p (cid:102) W ( z ) = p W ( z ) (cid:63) p W ( − z ) , (12)where (cid:63) is the convolution operator, is also log-concave. Since covariances add for independent randomvectors, Σ (cid:102) W = 2Σ W .By Theorem 4 of [38], for d ≥ h ( (cid:102) W ) ≥ d | Σ (cid:102) W | /d c ( d ) , where c ( d ) = e d / (4 √ d + 2)) . From Corollary 2.3 of [39], h ( (cid:102) W ) = h ( W − W ) ≤ h ( W ) + d log e, which implies h ( W ) ≥ h ( (cid:102) W ) − d log e ≥ d | Σ (cid:102) W | /d c ( d ) − d log e ≥ d | W | /d e c ( d ) (13)The result follows combining (11) and (13). 14 .2 Proof of Theorem 3.2 E Y i [ h i ( W )] = h ( W ) − i (cid:88) j =1 I ( W ; Y j | Y j − ) ≥ − i (14)from the chain rule for mutual information with h ( W ) = 0 and I ( W ; Y j | Y j − ) ≤ [27], and E Y i [ h i ( W )] ≤ E Y i log ((2 πe ) d | Σ W | Y i | ) (15) ≤
12 log ((2 πe ) d | E Y i Σ W | Y i | ) (16)from Lemma (3.1) with Jensen’s inequality and the concavity of log | A | for any matrix A in the positivedefinite cone [26]. Rearranging, we have − i (2 πe ) d ≤ | E Y i Σ W | Y i | ≤ Tr (cid:0) E Y i [Σ W | y i ] (cid:1) d d d (17) = ( E W,Y i [ (cid:107) W − E [ W | Y i ] (cid:107) ]) d d d (18) ≤ ( E W,Y i [ (cid:107) W − (cid:99) W i (cid:107) ]) d d d (19)where (17) is from the AM–GM inequality, (18) is due to the linearity of trace and expectation, and the lastinequality is due to that fact that expected value is the MMSE estimator, from which the MSE lower boundfollows. A.3 Proof of Proposition 3.3
Proof.
Consider the ‘equiprobable’ query scheme, with P ( Y i = 1 | y i − ) = for hyperplane query given byweights a i , threshold τ i , and noise constant k . Letting X i = a Ti W − τ i , we have I ( W ; Y i | y i − ) = H ( Y i | y i − ) − H ( Y i | y i − , W )= H ( Y i | y i − ) − H ( Y i | y i − , W, X i ) since X i is a deterministic function of W = H ( Y i | y i − ) − H ( Y i | y i − , X i ) since p ( Y i | y i − , W, X i ) = p ( Y i | y i − , X i ) = I ( X i ; Y i | y i − ) . Revisiting mutual information, we have I ( X i ; Y i | y i − ) = E (cid:20) log p ( Y i | X i , y i − ) p ( Y i | y i − ) (cid:21) (20) = E X i [(1 − h b ( f ( kX i ))) | y i − ] (21) = E X i [(1 − h b ( f ( k | X i | ))) | y i − ] (22)since − h b ( f ( kX i )) is symmetric. From Markov’s inequality with − h b ( f ( k | X i | )) being monotonicallyincreasing, for any a > , ≥ (1 − h b ( f ( ka ))) P ( | X | > a | y i − ) (23)(from (10)) ≥ (1 − h b ( f ( ka ))) (cid:18) − aσ i (cid:19) (24) = (cid:18) − h b (cid:18) f (cid:18) kcσ i (cid:19)(cid:19)(cid:19) (1 − c ) (25)by letting a = cσ i for any ≤ c ≤ .4 Proof of Theorem 3.4 Entropy properties:
Let h ( W | y i ) denote the posterior entropy after observing i queries. With a uniformprior distribution over the hypercube [ − , ] , we have that h ( W | y ) = 0 and h ( W | y i ) ≤ for ∀ i since theuniform distribution maximizes entropy over this bounded space.After query i , let the eigenvalues of the posterior covariance matrix be denoted in decreasing orderas λ ≥ λ · · · ≥ λ d . In the equiprobable, max-variance scheme, query a i is in the direction of maximaleigenvector, so the product of the noise constant and query standard deviation at iteration i is given by k (cid:113) a Ti Σ W | y i a i = k (cid:107) a i (cid:107)√ λ ≥ k min √ λ . From the monotonicity of the mutual information lower bound onequiprobable queries, we have I ( W ; Y i | y i − ) ≥ L c,k min ( (cid:112) λ ) (26)From rearranging terms in Lemma 3.1 along with | Σ W | y i | = (cid:81) di =1 λ i , we have h ( W | y i ) (2 πe ) d ≤ | Σ W | y i | = d (cid:89) i =1 λ i ≤ λ d (27) = ⇒ λ ≥ h ( W | yi ) d πe (28)For compactness of notation, let ˜ L c,k min ( h ) = L c,k min (cid:32) hd √ πe (cid:33) (29)Since L c,k min is monotonically increasing, we have I ( W ; Y i | y i − ) ≥ ˜ L c,k min ( h ( W | y i )) (30)Combined with the 1 bit upper bound on mutual information along with I ( W ; Y i | y i − ) = h ( W | y i − ) − E Y i | y i − [ h ( W | y i )] , we have h ( W | y i − ) − ≤ E Y i | y i − [ h ( W | y i )] (31) ≤ h ( W | y i − ) − ˜ L c,k min ( h ( W | y i − )) To bound the entropy deviations from one measurement to the next, we need the following lemma:
Lemma A.2.
For the equiprobable query scheme, | h ( W | y i ) − h ( W | y i − ) | ≤ γ ( d ) ∀ i ≥ where γ ( d ) = 8 d + d log (2 πed ) + 1 . The proof of Lemma A.2 is highly technical and so we relegate it to the end of the supplementary materials.
Martingale properties:
We note our martingale argument is similar in style to [40]. Let Z i = − h ( W | y i ) .From the previous section we have Z = 0 , Z i ≥ ∀ i ≥ , | Z i − Z i − | ≤ γ ( d ) from Lemma A.2, and Z i − + ˜ L c,k min ( − Z i − ) ≤ E Z i | y i − [ Z i ] ≤ Z i − + 1 . Since Z i − is a deterministic function of y i − ∀ i along withthe law of total expectation, E [ Z i | Z , . . . , Z i − ] = E Y i − | Z ,...,Z i − E [ Z i | Z , . . . , Z i − , y i − ]= E Y i − | Z ,...,Z i − E [ Z i | y i − ] which implies E [ Z i | Z i − ] ≥ E Y i − | Z ,...,Z i − [ Z i − + ˜ L c,k min ( − Z i − )]= Z i − + ˜ L c,k min ( − Z i − ) E [ Z i | Z , . . . , Z i − ] ≤ E Y i − | Z ,...,Z i − [ Z i − + 1]= Z i − + 1 Since ˜ L c,k min ( − Z i − ) > , we have E [ Z i | Z i − ] ≥ Z i − . For all i ≥ , | Z i | < ∞ since | Z i | = | Z + (cid:80) ij =1 Z j − Z j − | ≤ (cid:80) ij =1 | Z j − Z j − | ≤ iγ ( d ) < ∞ . Therefore, Z i is a submartingale.Let τ > define a stopping threshold and corresponding stopping time T = min { i : Z i ≥ τ } Considering E [ Z i | Z i − ] ≤ Z i − + 1 and taking the expectation over Z i − on both sides and expanding with the towerrule, we have E [ E [ Z i | Z i − ]] ≤ E [ Z i − ] + 1 E [ Z i ] ≤ E E [ Z i − | Z i − ] + 1 E [ Z i ] ≤ E [ Z i − ] + 1 + 1 . . . E [ Z i ] ≤ i which implies T ≥ E [ Z T ] ≥ τ where the last inequality follows by definition, so E [ T ] ≥ τ . Note that this is true for any query selectionscheme since mutual information is always upper bounded by 1 bit.To lower bound the expected stopping time, observe ˜ L c,k min ( − z ) is monotonically decreasing in z , and Z i ≤ τ for i < T , so we have in this range that ˜ L c,k min ( − Z i ) > ˜ L c,k min ( − τ ) . Using this fact, we construct aseparate submartingale that equals Z i up to and including the stopping time and has the same propertieslisted above. Specifically, let U i = (cid:40) Z i i ≤ TU i − + ˜ L c,k min ( − τ ) i > T. (32)Clearly for i ≤ T , U i = Z i , and if T U is defined as T U = min { i : U i ≥ τ } , by observation T U = T . U i alsosatisfies | U i − U i − | < γ ( d ) , and U i − + ˜ L c,k min ( − τ ) ≤ E [ U i | U i − ] ≤ U i − + 1 .We have E [ U i | U i − ] ≥ U i − + ˜ L c,k min ( − τ ) (33) E [ U i | U i − ]˜ L c,k min ( − τ ) ≥ U i − ˜ L c,k min ( − τ ) + 1 (34) E [ U i | U i − ]˜ L c,k min ( − τ ) − i ≥ U i − ˜ L c,k min ( − τ ) − ( i − (35)We then have a submartingle given by U ( sub ) i = U i ˜ L c,k min ( − τ ) − i .Assume for the time being that the optional stopping theorem can be applied to this submartingale (provedin the sequel)—for any stopping time S satisfying S ≤ T , E [ U sub S ] ≤ E [ U sub T ] . Specifically, if τ S is a stoppingthreshold satisfying τ S ≤ τ such that S = min { i : U i ≥ τ S } , then (for brevity, letting l ( u ) = ˜ L c,k min ( − u ) ) E [ U S ] l ( τ ) − E [ S ] ≤ E [ U T ] l ( τ ) − E [ T ] (36)which implies E [ U S ] l ( τ S ) − E [ S ] = l ( τ ) l ( τ S ) (cid:20) E [ U S ] l ( τ ) − E [ S ] (cid:21) − (cid:18) − l ( τ ) l ( τ S ) (cid:19) E [ S ] (37) ≤ l ( τ ) l ( τ S ) (cid:20) E [ U T ] l ( τ ) − E [ T ] (cid:21) − (cid:18) − l ( τ ) l ( τ S ) (cid:19) E [ S ] (38)17ore generally, let ∆ > be given and set stopping threshold τ i = i ∆ , with corresponding stopping time T i .Define P i = U Ti l ( τ i ) − T i . Letting r i = l ( τ i ) l ( τ i − ) and letting T = T i and S = T i − , by rearranging the above wehave E [ P i ] ≥ E [ P i − ] r i + (1 − r i ) r i E [ T i − ] (39)Noting that E [ T ] = 0 since a threshold of τ results in stopping at T = 0 and E [ P ] = U T l ( τ ) − E [ T ] = 0 , wecontinue this bound recursively E [ P i ] ≥ E [ P i − ] r i r i − + (1 − r i − ) r i r i − E [ T i − ] + (1 − r i ) r i E [ T i − ] . . . = i − (cid:88) j =1 − r j +1 (cid:81) ik = j +1 r k E [ T j ]= i − (cid:88) j =1 l ( τ j ) − l ( τ j +1 ) l ( τ i ) E [ T j ] since (cid:81) ik = j +1 r k = l ( τ i ) l ( τ i − ) l ( τ i − ) l ( τ i − ) . . . l ( τ j +1 ) l ( τ j ) = l ( τ i ) l ( τ j ) = 1 l ( τ i ) i − (cid:88) j =1 l ( j ∆) − l ( j ∆ + ∆)∆ ∆ E [ T j ] ≥ l ( τ i ) i − (cid:88) j =1 l ( τ j ) − l ( τ j + ∆)∆ τ j ∆ since E [ T j ] ≥ τ j = j ∆ . Now let τ > be given (with corresponding stopping time defined as T ) and let ∆ → , choosing i appropriately such that τ = τ i = i ∆ ≥ − l ( τ ) (cid:90) τ (cid:18) ddx l ( x ) (cid:19) xdx = 1 l ( τ ) (cid:90) τ l ( x ) dx − τ = ⇒ E [ U T ] l ( τ ) − E [ T ] ≥ l ( τ ) (cid:90) τ l ( x ) dx − τ = ⇒ E [ T ] ≤ τ + E [ U T ] l ( τ ) − l ( τ ) (cid:90) τ l ( x ) dx ≤ τ + τ + 1 l ( τ ) − l ( τ ) (cid:90) τ l ( x ) dx since E [ U T ] = E [ E [ U T | U T − ]] ≤ E [ U T − ] + 1 ≤ τ + 1 .All together we have τ ≤ E [ T ] ≤ τ + τ + 1 l ( τ ) − l ( τ ) (cid:90) τ l ( x ) dx (40)Now, suppose we’d like to stop the algorithm when the posterior covariance determinant crosses below athreshold, corresponding to a low posterior volume. Denote this threshold as ε , and define the stopping time T ε as min { i : | Σ W | y i | d < ε } . By rearranging the upper bound in Lemma 3.1 we have the necessary condition h i ( W ) ≤ d (2 πeε ) (41)18etting τ = d log ( πeε ) be the entropic stopping threshold with stopping time T , from (40) this results in(with E [ T ε ] ≥ E [ T ] since this is a necessary condition) E [ T ε ] ≥ E [ T ] ≥ τ (42)Similarly, by rearranging the lower bound in Lemma 3.1 we observe that a sufficient condition for thisstopping criterion is h i ( W ) ≤ d (cid:15)e c d (43)where c d = ( e d ) / (4 √ d + 2)) . Letting τ = d log e c d ε be the entropic stopping threshold with stoppingtime T , we have from (40) (with E [ T ε ] ≤ E [ T ] since this is only a sufficient condition): E [ T ε ] ≤ E [ T ] ≤ τ + τ + 1 l ( τ ) − l ( τ ) (cid:90) τ l ( x ) dx (44)Combining these, we have the theorem result. Verifying optional stopping theorem:
Consider a submartingale of the form P i = Q i C − i for some C > , where Q i is also a submartingale satisfying Q i = 0 , Q i ≥ for i ≥ , and | Q i +1 − Q i | ≤ B for some B > C > . This implies | P i − P i − | = (cid:12)(cid:12)(cid:12)(cid:12) Q i C − i − Q i − C + ( i − (cid:12)(cid:12)(cid:12)(cid:12) = | Q i − Q i − − C | C ≤ | Q i − Q i − | C + 1 ≤ BC + 1 =: B (cid:48) < ∞ Let stopping time T Q be defined as min { i : Q i > τ } for some threshold < τ < ∞ . This implies a stoppingtime on P i given by T P = min { i : P i > τC − i } , with T := T Q = T P . We have from Theorem 5.2.6 of [41] that P T ∧ i and Q T ∧ i are also submartingales.Consider sup E Q + T ∧ i = sup E Q T ∧ i ≤ τ + B < ∞ , by definition. From Theorem 5.2.8 of [41], as i → ∞ , Q T ∧ i converges a.s. to a limit Q with E | Q | < ∞ (and hence | Q | < ∞ a.s.). This also implies | Q T ∧ i | a.s. → | Q | .Similarly, sup E P + T ∧ i = sup E (cid:20)(cid:110) Q T ∧ i C − ( T ∧ i ) (cid:111) + (cid:21) ≤ sup E (cid:104) Q + T ∧ i C (cid:105) ≤ τ + BC < ∞ , so as i → ∞ , P T ∧ i converges a.s. to a limit P with E | P | < ∞ (and hence | P | < ∞ a.s.). This also implies | P T ∧ i | a.s. → | P | .We have T ∧ i = (cid:12)(cid:12)(cid:12)(cid:12) ( T ∧ i ) − Q T ∧ i C + Q T ∧ i C (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) ( T ∧ i ) − Q T ∧ i C (cid:12)(cid:12)(cid:12)(cid:12) + | Q T ∧ i | C = | P T ∧ i | + | Q T ∧ i | C Since the right side converges a.s. to a limit | P | + | Q | C =: L and L < ∞ a.s., for all large enough i , T ∧ i < L a.s. which implies T < L a.s. and therefore E [ T ] < ∞ . Combining this fact with | P i +1 − P i | ≤ B (cid:48) , Theorem5.7.5 of [41] gives that P T ∧ i is uniformly integrable. Then, from Theorem 5.7.4 of [41], for any stopping time L ≤ T , E [ P L ] ≤ E [ P T ] . 19 .5 Proof of Theorem 3.5 To lower bound the complexity of T ε , we substitute the definition of τ into (42), which is true for any queryscheme: E [ T ε ] ≥ d (cid:18) πeε (cid:19) (45) = ⇒ E [ T ε ] = Ω (cid:18) d log 1 ε (cid:19) (46)To upper bound the complexity of T ε , note that τ − l ( τ ) (cid:82) τ l ( x ) dx ≤ from the mean value theorem,so E [ T ε ] ≤ τ +1 l ( τ ) . Also note that L c,k ( σ ) = (cid:18) − h b (cid:18) f (cid:18) ckσ (cid:19)(cid:19)(cid:19) (1 − c ) ≥ (cid:18) − sech (cid:18) ckσ (cid:19)(cid:19) (1 − c ) (47) ≥ c k σ
32 + c k σ (1 − c ) (48)where (47) is from h b ( p ) ≤ (cid:112) p (1 − p ) , and (48) is from sech( x ) ≤ x .Plugging in the definition for τ into l ( τ ) we have l ( τ ) = L c,k min (cid:32) − τ d √ πe (cid:33) = L c,k min (cid:18)(cid:114) επe c d (cid:19) (49)so l ( τ ) ≥ c k min πe c d ε + c k min (1 − c ) (50)which implies E [ T ε ] ≤ (cid:16) d log e c d ε + 1 (cid:17) (cid:0) πe c d ε + c k min (cid:1) (1 − c ) c k min (51) = ⇒ E [ T ε ] = O (cid:18) d log 1 ε + (cid:18) εk min (cid:19) d log 1 ε (cid:19) (52) A.6 Proof of Proposition 3.6
Proof.
We first bound p := P ( Y = 1) . Recall that for some fixed k , f ( x ) = (1 + e − kx ) − . First note that (cid:90) ba f ( x ) dx = (cid:90) ba k ke kx e kx dx = 1 k (cid:90) ba u (cid:48) u dx = 1 k (cid:90) u ( b ) u ( a ) u du = 1 k ln 1 + e kb e ka . We have that P ( Y = 1) = E [ P ( Y = 1 | X = x )] = E [ f ( x )] . Note that ∀ x , (1 + e − kx ) ≤ . Then, p = E [ f ( x )] = (cid:90) f ( x ) p X ( x ) dx = (cid:90) x ≤ f ( x ) p X ( x ) dx + (cid:90) x> f ( x ) p X ( x ) dx ≤ σ X (cid:90) −∞ f ( x ) dx + (cid:90) x> f ( x ) p X ( x ) dx ≤ σ X k ln 1 + e k (cid:90) x> p X ( x ) dx ≤ ln 2 σ X k + P ( X > ≤ ln 2 σ X k + 1 − e , p X ( x ) ≤ /σ X and the final inequality follows from P ( X ≤ ≥ e for zero-mean LCC X [37].Using a similar argument it can be shown that E [ f ( x )] ≥ /e − ln 2 / ( σ X k ) . Combining these, we have e − ln 2 σ X k ≤ p ≤ − (cid:18) e − ln 2 σ X k (cid:19) . (53)Now we turn to lower bounding I ( X ; Y ) := H ( Y ) − H ( Y | X ) . The second term can be written H ( Y | X ) = E X H ( Y | X = x ) = (cid:90) ∞−∞ h b ( f ( x )) p X ( x ) dx ≤ σ X (cid:90) ∞−∞ h b ( f ( x )) dx. (54)where the inequality follows from Lemma A.1. Since H ( Y | X = x )= − f ( x ) log f ( x ) − (1 − f ( x )) log (1 − f ( x ))= 11 + e − kx log (1 + e − kx ) + e − kx e − kx log ((1 + e − kx ) /e − kx )= 1 + e − kx e − kx log (1 + e − kx ) − e − kx e − kx log ( e − kx )= log (1 + e − kx ) + kxe − kx log ( e )1 + e − kx , which is an even function, we have (omitting details of the integration) H ( Y | X ) ≤ σ X (cid:90) ∞ log (1 + e − kx ) + kxe − kx log ( e )1 + e − kx dx = π (log e )3 kσ X (55)For the second term, note that H ( Y = 1) = h b ( p ) . The binary entropy function is symmetric about, andmonotonically decreasing from p = 1 / . Therefore, H ( Y ) = h b ( p ) ≥ h b (cid:18) e − ln 2 σ X k (cid:19) (56)Combining (55) and (56) gives the desired result. A.7 Proof of Lemma A.2
Proof.
Since p ( W | y i ) is log-concave, and by Jensen’s inequality, − h ( W | y i ) = E W | y i [log p ( W | y i )] ≤ log p ( E [ W | y i ] | y i ) ≤ log sup w p ( w | y i ) . Without loss of generality, we may suppose E [ W | y i ] = 0 , and let V = Σ − W | y i W and W ∼ P W | y i , such that E [ V ] = 0 and E [ V V T ] = Σ − W | y i E [ W W T ]Σ − W | y i = Σ − W | y i Σ W | y i Σ − W | y i = I and therefore V is isotropic. From[42] we have that p V ( v ) ≤ d d d . From the density of a linear transformation of a random variable we have p W | y i ( w ) = p V (Σ − W | y i w ) | Σ W | y i | ≤ d d d | Σ W | y i | . f i ( W ) denoting the logistic response model for the query atiteration i ) p ( w | y i ) = p ( w | y i = y, y i − )= f i ( W ) y + (1 − f i ( W ))(1 − y ) p ( y i = y | y i − ) p ( W | y i − ) ≤ (1) y + (1 − (0))(1 − y ) p ( y i = y | y i − ) p ( W | y i − )= 1 p ( y i = y | y i − ) p ( W | y i − ) ≤ p ( y i = y | y i − ) 2 d d d | Σ W | y i − | = ⇒ sup w p ( w | y i ) ≤ p ( y i = y | y i − ) 2 d d d | Σ W | y i − | , which implies log sup w p ( w | y i ) ≤ d + d d −
12 log | Σ W | y i − | − log ( p ( y i = y | y i − )) , and hence h ( W | y i ) ≥
12 log | Σ W | y i − | + log ( p ( y i = y | y i − )) − (cid:18) d + d d (cid:19) ≥
12 log ((2 πe ) d | Σ W | y i − | ) −
12 log (2 πe ) d + log ( p ( y i = y | y i − )) − (cid:18) d + d d (cid:19) ≥ h ( W | y i − ) + log ( p ( y i = y | y i − )) − (cid:18) d + d (2 πed ) (cid:19) from (11) . For equiprobable queries p ( y i = y | y i − ) = 1 / , and so we have h ( W | y i − ) − h ( W | y i ) ≤ γ ( d ) . (57)where γ ( d ) = 8 d + d log (2 πed ) + 1 .To obtain the other direction, let h i − y = h ( W | Y i = y, y i − ) , y m = arg min y ∈{ , } h i − y , y M = 1 − y m .Note that h i − y M ≥ h i − y m . We have h ( W | Y i , y i − ) = 12 h i − m + 12 h i − M ≥
12 ( h ( W | y i − ) − γ ( d )) + 12 h i − M ≥
12 ( h ( W | y i − ) − γ ( d )) + 12 h ( W | y i ) where the first inequality follows from (57) and the second inequality follows from the definition of h M . Fromthe non-negativity of mutual information, we have that h ( W | Y i , y i − ) ≤ h ( W | y i − ) , implying h ( W | y i − ) ≥
12 ( h ( W | y i − ) − γ ( d )) + 12 h ( W | y i ) h ( W | y i − ) − h ( W | y i ) ≥ − γ ( d ) (58)Combining (58) with (57) we have the desired result.22 .8 Additional experiments Performance across dimensions:
Figure 3 plots MSE against embedding dimension averaged acrossall trials at both 20 and 60 queries asked. For all dimensions across all experiments, the learned Yummly
Food-10k embedding was centered and scaled by a constant amount such that the unit hypercube of userpreference points would be contained in the embedding of items, allowing for a rich pool of pairs to be selectedfrom for any user point. This scaling constant was heuristically set to √ d/ (3˜ λ / ) , where ˜ λ is the smallesteigenvalue of the covariance matrix of embedding items. This scaling is motivated by setting the smallestvariance direction of the embedding to align with the furthest point of the unit cube at a distance of √ d fromthe origin. For each learned embedding, responses to the Yummly Food-10k training triplets were predictedby selecting the closer of the two comparison items to the reference item, using the embedding to measuredistances. For a given embedding, we refer to the fraction of incorrectly predicted triplet responses as the triplet error fraction , which we plot for reference against embedding dimension in Figure 4. (a) Matched “constant” noise: 20queries (b) Matched “normalized” noise: 20queries (c) Matched “decaying” noise: 20queries(d) Matched “constant” noise: 60queries (e) Matched “normalized” noise: 60queries (f) Matched “decaying” noise: 60queries
Figure 3: Mean squared error performance across dimensions at a fixed number of answered queries, plottedwith ± one standard error.Figure 4: Triplet error fraction versus embedding dimension.23 peed plot comparison Figure 5 plots MSE against cumulative compute time for matched logistic noisewith “normalized” noise constant for d ∈ { , , } in a smaller scale experiment of 60 queries per trial, and40 trials per dimension. Specifically, MSE and average cumulative compute time were calculated for each number of queries asked, and these two values plotted against each other directly in a range up to 600 seconds.We evaluated all three of our methods (InfoGain, MCMV, EPMV) at various pair pool downsampling ratesof β ∈ { − , − . , − } , as listed in the figure legend next to each method. Each experiment was run onan Intel Xeon CPU E5-2680 v4 2.40 GHz processor. (a) d = 4 (b) d = 7 (c) d = 12 Figure 5: Mean squared error performance against cumulative compute time (s) for matched, “normalized”logistic noise at various pair downsampling rates. Error bars have been omitted for visual clarity.
Additional experimental results
In this section, MSE is evaluated for both matched and mismatchednoise at d ∈ { , , , , } in Figs. 6, 7, 8, 9, and 10. The model for k pq on mismatched Gaussian noiseis chosen as the maximum-likelihood model (“constant,” “normalized”, “decaying”) on the training triplets,calculated separately for each embedding dimension. For all experiments, β = 10 − and results are averagedover 50 trials. (a) “Constant” model,matched, d = 3 (b) “Normalized” model,matched, d = 3 (c) “Decaying” model,matched, d = 3(d) “Constant” model,mismatched, d = 3 (e) “Normalized” model,mismatched, d = 3 (f) “Decaying” model,mismatched, d = 3 Figure 6: Mean squared error performance versus number of queries asked for pairwise search in 3 dimensions,plotted with ± one standard error. All mismatched noise is Gaussian with a “constant” noise constant.24 a) “Constant” model,matched, d = 5 (b) “Normalized” model,matched, d = 5 (c) “Decaying” model,matched, d = 5(d) “Constant” model,mismatched, d = 5 (e) “Normalized” model,mismatched, d = 5 (f) “Decaying” model,mismatched, d = 5 Figure 7: Mean squared error performance versus number of queries asked for pairwise search in 5 dimensions,plotted with ± one standard error. All mismatched noise is Gaussian with a “normalized” noise constant. (a) “Constant” model,matched, d = 7 (b) “Normalized” model,matched, d = 7 (c) “Decaying” model,matched, d = 7(d) “Constant” model,mismatched, d = 7 (e) “Normalized” model,mismatched, d = 7 (f) “Decaying” model,mismatched, d = 7 Figure 8: Mean squared error performance versus number of queries asked for pairwise search in 7 dimensions,plotted with ± one standard error. All mismatched noise is Gaussian with a “normalized” noise constant.25 a) “Constant” model,matched, d = 9 (b) “Normalized” model,matched, d = 9 (c) “Decaying” model,matched, d = 9(d) “Constant” model,mismatched, d = 9 (e) “Normalized” model,mismatched, d = 9 (f) “Decaying” model,mismatched, d = 9 Figure 9: Mean squared error performance versus number of queries asked for pairwise search in 9 dimensions,plotted with ± one standard error. All mismatched noise is Gaussian with a “normalized” noise constant. (a) “Constant” model,matched, d = 12 (b) “Normalized” model,matched, d = 12 (c) “Decaying” model,matched, d = 12(d) “Constant” model,mismatched, d = 12 (e) “Normalized” model,mismatched, d = 12 (f) “Decaying” model,mismatched, d = 12 Figure 10: Mean squared error performance versus number of queries asked for pairwise search in 12dimensions, plotted with ±±