Large-Margin Determinantal Point Processes
aa r X i v : . [ s t a t . M L ] N ov Large-Margin Determinantal Point Processes
Boqing GongDept. of Computer ScienceU. of Southern CaliforniaLos Angeles, CA 90089 [email protected]
Wei-lun ChaoDept. of Computer ScienceU. of Southern CaliforniaLos Angeles, CA 90089 [email protected]
Kristen GraumanDept. of Computer ScienceU. of Texas at AustinAustin, TX 78701 [email protected]
Fei ShaDept. of Computer ScienceU. of Southern CaliforniaLos Angeles, CA 90089 [email protected]
November 10, 2014
Abstract
Determinantal point processes (DPPs) offer a powerful approach to modeling diversity in many ap-plications where the goal is to select a diverse subset. We study the problem of learning the parameters( i.e. , the kernel matrix) of a DPP from labeled training data. We make two contributions. First, weshow how to reparameterize a DPP’s kernel matrix with multiple kernel functions, thus enhancing mod-eling flexibility. Second, we propose a novel parameter estimation technique based on the principle oflarge margin separation. In contrast to the state-of-the-art method of maximum likelihood estimation,our large-margin loss function explicitly models errors in selecting the target subsets, and it can becustomized to trade off different types of errors (precision vs. recall). Extensive empirical studies vali-date our contributions, including applications on challenging document and video summarization, whereflexibility in modeling the kernel matrix and balancing different errors is indispensable.
Imagine we are to design a search engine to retrieve web images that match user queries. In response tothe search term jaguar , what should we retrieve — the images of the animal jaguar or the images of theautomobile jaguar?This frequently cited example illustrates the need to incorporate the notion of diversity . In many tasks,we want to select a subset of items from a “ground set”. While the ground set might contain many similaritems, our goal is not to discover all of the same ones, but rather to find a subset of diverse items that ensurecoverage (the exact definition of coverage is task-specific). In the example of retrieving images for jaguar ,we achieve diversity by including both types of images.Recently, the determinantal point process (DPP) has emerged as a promising technique for modelingdiversity [1]. A DPP defines a probability distribution over the power set of a ground set. Intuitively,subsets of higher diversity are assigned larger probabilities, and thus are more likely to be selected than thosewith lower diversity. Since its original application to quantum physics, DPP has found many applicationsin modeling random trees and graphs [2], document summarization [3], search and ranking in informationretrieval [4], and clustering [5]. Various extensions have also been studied, including k-DPP [4], structuredDPP [6], Markov DPP [7], and DPP on continuous spaces [8].1he probability distribution of a DPP depends crucially on its kernel — a square and symmetric, positivesemidefinite matrix whose elements specify how similar every pair of items in the ground set are. This kernelmatrix is often unknown and needs to be estimated from training data.This is a very challenging problem for several reasons. First, the number of the parameters, i.e., thenumber of elements in the kernel matrix, is quadratic in the number of items in the ground set. For manytasks (for instance, image search), the ground set can be very large. Thus it is impractical to directly specifyevery element of the matrix, and a suitable reparameterization of the matrix is necessary. Secondly, thenumber of training samples is often limited in many practical applications. One such example is the task ofdocument summarization, where our aim is to select a succinct subset of sentences from a long document.There, acquiring accurate annotations from human experts is costly and difficult. Thirdly, for many tasks, weneed to evaluate the performance of the learned DPP not only by its accuracy in predicting whether an itemshould be selected, but also by other measures like precision and recall. For instance, failing to select keysentences for summarizing documents might be regarded as being more catastrophic than injecting sentenceswith repetitive information into the summary.Existing methods of parameter estimation for DPPs are inadequate to address these challenges. Forexample, maximum likelihood estimation (MLE) typically requires a large number of training samples inorder to estimate the underlying model correctly. This also limits the number of the parameters it canestimate reliably, restricting its use to DPPs whose kernels can be parameterized with few degrees of freedom.It also does not offer fine control over precision and recall.We propose a two-pronged approach for learning a DPP from labeled data. First, we improve modelingflexibility by reparameterizing the DPP’s kernel matrix with multiple base kernels. This representation couldeasily incorporate domain knowledge and requires learning fewer parameters (instead of the whole kernelmatrix). Then, we optimize the parameters such that the probability of the correct subset is larger thanother erroneous subsets by a large margin. This margin is task-specific and can be customized to reflect thedesired performance measure—for example, to monitor precision and recall. As such, our approach definesobjective functions that closely track selection errors and work well with few training samples. While theprinciple of large margin separation has been widely used in classification [9] and structured prediction [10],formulating DPP learning with the large margin principle is novel. Our empirical studies show that theproposed method attains superior performance on two challenging tasks of practical interest: document andvideo summarization.The rest of the paper is organized as follows. We provide background on the DPP in section 2, followedby our approach in section 3. We discuss related work in section 4 and report our empirical studies insection 5. We conclude in section 6.
We first review background on the determinantal point process (DPP) [11] and the standard maximumlikelihood estimation technique for learning DPP parameters from data. More details can be found in theexcellent tutorial [1].Given a ground set of N items, Y = { , , . . . , N } , a DPP defines a probabilistic measure over the powerset, i.e., all possible subsets (including the empty set) of Y . Concretely, let L denote a symmetric andpositive semidefinite matrix in R N × N . The probability of selecting a subset y ⊆ Y is given by P ( y ; L ) = det( L + I ) − det( L y ) , (1)where L y denotes the submatrix of L , with rows and columns selected by the indices in y . I is the identitymatrix with the proper size. We define det ( L ∅ ) = 1. The above way of defining a DPP is called an L-ensemble. An equivalent way of defining a DPP is to use a kernel matrix to define the marginal probabilityof selecting a random subset: P y = X y ′ ⊆Y P ( y ′ ; L ) I [ y ⊆ y ′ ] = det( K y ) , (2)2here we sum over all subsets y ′ that contain y ( I [ · ] is an indicator function). The matrix K is anotherpositive semidefinite matrix, computable from the L matrix K = L ( L + I ) − , (3)and K y is the submatrix of K indexed by y . Despite the exponential number of summands in eq. (2), themarginalization is analytically tractable and computable in polynomial time. Modeling diversity
One particularly useful property of the DPP is its ability to model pairwise repulsion .Consider the marginal probability of having two items i and j simultaneously in a subset: P { i,j } = det (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K ii K ij K ji K jj (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = K ii K jj − K ij ≤ K ii K jj = P { i } P { j } ≤ min( P { i } , P { j } ) . (4)Thus, unless K ij = 0, the probability of observing i and j jointly is always less than observing either i or j separately. Namely, having i in a subset repulsively excludes j and vice versa. Another extreme case iswhen i and j are the same; then K ii = K jj = K ij , which leads to P { i,j } = 0. Namely, we should never allowthem together in any subset.Consequently, a subset with a large (marginal) probability cannot have too many items that are similarto each other (i.e., with high values of K ij ). In other words, the probability provides a gauge of the diversityof the subset. The most diverse subset, which balances all the pairwise repulsions, is the subset that attainsthe highest probability y ∗ = arg max y P ( y ; L ) . (5)Note that this MAP inference is computed with respect to the L-ensemble (instead of K ) as we are interestedin the mode, not the marginal probability of having the subset. Unfortunately, the MAP inference is NP-hard [12]. Various approximation algorithms have been investigated [13, 1]. Maximum likelihood estimation (MLE)
Suppose we are given a training set { ( Y n , y n ) } , where eachground set Y n is annotated with its most diverse subset y n . How can we discover the underlying parameters L or K ? Note that different ground sets need not have overlap. Thus, directly specifying kernel values forevery pair of items is unlikely to be scalable. Instead, we will need to assume that either L or K for eachground set is represented by a shared set of parameters θ .For items i and j in Y n , suppose their kernel values K n ij can be computed as a function of x n i , x n j and θ , where x n i and x n j are features characterizing those items. Our learning objective is to optimize θ suchthat y n is the most diverse subset in Y n , or attains the highest probability. This gives rise to the followingmaximum likelihood estimate (MLE) [3], θ mle = arg max θ X n log P ( y n ; L n ( Y n ; θ )) , (6)where L n ( Y n ; θ ) converts features in Y n to the L matrix for the ground set Y n . MLE has been a standardapproach for estimating DPP parameters. However, as we will discuss in section 3.2, it has importantlimitations.Next, we introduce our method for learning the parameters. We first present our multiple kernel basedrepresentation of the L matrix and then the large-margin based estimation. Our approach consists of two components that are developed in parallel, yet work in concert: (1) the useof multiple kernel functions to represent the DPP; (2) applying the principle of large margin separationto optimize the parameters. The former reduces the number of parameters to learn and thus is especiallyadvantageous when the number of training samples is limited. The latter strengthens the advantage byoptimizing objective functions that closely track subset selection errors.3 .1 Multiple kernel representation of a DPP
Learning the L or K matrix for a DPP is an instance of learning kernel functions, as those matrices arepositive semidefinite matrices, interpretable as kernel functions being evaluated on the items in the groundset. Thus, our goal is essentially to learn the right kernel function to measure similarity.However, for many applications, similarity is just one of the criteria for selecting items. For instance, inthe previous example of image retrieval, the retrieved images not only need to be diverse (thus different) butalso need to have strong relevance to the query term. Similarly, in document summarization, the selectedsentences not only need to be succinct and not redundant, but also need to represent the contents of thedocument [14].Kulesza and Taskar [3] propose to balance these two potentially conflicting forces with a decomposable L matrix: L ij = q i q j S ij = q i q j φ T i φ j , q i = q ( x i ) = exp( θ T x i ) , ∀ i, j ∈ Y , (7)where q i is referred to as the quality factor, modeling how representative or relevant the selected items are. Itdepends on item i ’s feature vector x i , which encodes i ’s contextual information and its representativeness ofother items. For example, in document summarization, possible features are the sentence lengths, positionsof the sentences in the text, or others. S ij , on the other hand, measures how similar two sentences are,computed from a different set of features, φ i and φ j , such as bag-of-words descriptors that represent eachitem’s individual characteristics.However, prior work [3] does not investigate whether this specific definition of similarity could be madeoptimal and adapted to the data, thus limiting the modeling power of the DPP largely to infer the quality q i . Our empirical studies show that this limitation can be severe, especially when the modeling choice iserroneous (cf. section 5.1).In this paper, we retain the aspect of quality modeling but improve the modeling of similarity S ij intwo ways. First, we use nonlinear kernel functions such as the Gaussian RBF kernel to determine similarity.Secondly, and more importantly, we combine several base kernels: S ij = X k α k exp {− k φ i − φ j k /σ k } + β φ T i φ j , (8)where k indexes the base kernels and σ k is a scaling factor. The combination coefficients are constrained suchthat P k α k + β = 1. They are optimized on the annotated data, either via maximum likelihood estimationor via our novel parameter estimation technique, to be described next. Maximum likelihood estimation does not closely track discriminative errors [15, 9, 16]. While improving thelikelihood of the ground-truth subset y n , MLE could also improve the likelihoods of other competing subsets.Consequentially, a model learned with MLE could have modes that are very different subsets yet are veryclose to each other in their probability values. Having highly confusable modes is especially problematicfor DPP’s NP-hard MAP inference — the difference between such modes can fall within the approximationerrors of approximate inference algorithms such that the true MAP cannot be easily extracted. Multiplicative large margin constraints
To address these deficiencies, our large-margin based ap-proach aims to maintain or increase the margin between the correct subset and alternative, incorrect ones.Specifically, we formulate the following large margin constraintslog P ( y n ; L n ) ≥ max y ⊆Y n log ℓ ( y n , y ) P ( y ; L n ) = max y ⊆Y n log ℓ ( y n , y ) + log P ( y ; L n ) , (9)where ℓ ( y n , y ) is a loss function measuring the discrepancy between the correct subset and an alternative y .We assume ℓ ( y n , y n ) = 0.Intuitively, the more different y is from y n , the larger the gap we want to maintain between the twoprobabilities. This way, the incorrect one has less chance to be identified as the most diverse one. Note that4hile similar intuitions have been explored in multiway classification and structured prediction, the marginhere is multiplicative instead of additive — this is by design, as it leads to a tractable optimization over theexponential number of constraints, as we will explain later. Design of the loss function
A natural choice for the loss function is the Hamming distance between y n and y , counting the number of disagreements between two subsets: ℓ H ( y n , y ) = X i ∈ y I [ i / ∈ y n ] + X i/ ∈ y I [ i ∈ y n ] . (10)In this loss function, failing to select the right item costs the same as adding an unnecessary item. In manytasks, however, this symmetry does not hold. For example, in summarizing a document, omitting a keysentence has more severe consequences than adding a (trivial) sentence.To balance these two types of errors, we introduce the generalized Hamming loss function, ℓ ω ( y n , y ) = X i ∈ y I [ i / ∈ y n ] + ω X i/ ∈ y I [ i ∈ y n ] . (11)When ω is greater than 1, the learning biases towards higher recall to select as many items in y n as possible.When ω is significantly less than 1, the learning biases towards high precision to avoid incorrect items asmuch as possible. Our empirical studies demonstrate such flexibility and its advantages in two real-worldsummarization tasks. Numerical optimization
To overcome the challenge of dealing with an exponential number of constraintsin eq. (9), we reformulate it as a tractable optimization problem. We first upper-bound the hard-maxoperation with Jensen’s inequality (i.e., softmax):log P ( y n ; L n ) ≥ log X y ⊆Y e log ℓ ω ( y n , y ) P ( y ; L n ) = softmax y ⊆Y n log ℓ ω ( y n , y ) + log P ( y ; L n ) . (12)With the loss function ℓ ω ( y n , y ), the right-hand-side is computable in polynomial time, softmax y ⊆Y n log ℓ ω ( y n , y ) + log P ( y ; L n ) = log X i/ ∈ y n K n ii + ω X i ∈ y n (1 − K n ii ) , (13)where K n ii is the i -th element on the diagonal of K n , the marginal kernel matrix corresponding to L n . Thedetailed derivation of this result is in the supplementary material. Note that K n can be computed efficientlyfrom L n through the identity eq. (3).The softmax can be seen as a summary of all undesirable subsets (the correct subset y n does not contributeto the weighted sum as ℓ ω ( y n , y n ) = 0). Our optimization balances this term with the likelihood of the targetwith the hinge loss function [ z ] + = max(0 , z )min X n − log P ( y n ; L n ) + λ log X i/ ∈ y n K n ii + ω X i ∈ y n (1 − K n ii ) + (14)where λ ≥ λ = 0. We optimize the objective function with subgradientdescent. Details are in the supplementary material. The DPP arises from random matrix theory and quantum physics [11, 1]. In machine learning, researchershave proposed different variations to improve its modeling capacity. Kulesza and Taskar introduced k-DPP5o restrict the sets to have a constant size k [4]. Affandi et al. proposed a Markov DPP which offers diversityat adjacent time stamps [7]. A structured DPP was presented in [6] to model trees and graphs. The MAPinference of DPP is generally NP-hard [12]. Gillenwater et al. developed an 1/4-approximation algorithm [13].In practice, greedy inference gives rise to decent results [3] though it lacks theoretical guarantees. Anotherpopular alternative is to resort to fast sampling algorithms [5, 1].In spite of much research activity surrounding DPPs, there is very little work exploring how to effectivelylearn the model parameters. MLE is the most popular estimator. Compared to MLE, our approach is morerobust to the number of training data or mis-specified models, and offers greater flexibility by incorporatingcustomizable error functions. A recent Bayesian approach works with the posterior over the parameters [17].In contrast to that work, we develop a large-margin training approach for DPPs and directly minimize theset selection errors. The large-margin principle has been widely used in classification [9] and structuredprediction [10, 18, 19, 20], but its application to DPP is original. In order to make it tractable for DPPs, weuse multiplicative rather than additive margin constraints. We validate our large-margin approach to learn DPP parameters with extensive empirical studies on bothsynthetic data and two real-world summarization tasks with documents and videos. While DPP also hasapplications beyond summarization, this is a particularly good testbed to illustrate diverse subset selection: acompact summary ought to include high quality items that, taken together, offer good coverage of the sourcecontent. We report key results in this section, and provide more extensive results in the supplementarymaterial.
Data
Our ground set has 10 items, Y = { x , x , · · · , x } . For each item, we sample a 5-dimensionalfeature vector from a spherical Gaussian: x i ∼ N ( , I ). To generate the L matrix for the DPP, we followthe model in eq. (7); for the parameter vector θ we sample from a spherical Gaussian, θ ∼ N ( , I ), and forthe similarity we simply let φ i = x i and compute S ij = φ T i φ j .We identify the most diverse subset y ∗ (eq. (5)) via exhaustive search of all subsets, which is possiblegiven the small ground set. The resulting y ∗ has 5 items on average. We then add noise by randomly (withprobability 0.1) adding or dropping an item to or from y ∗ . We repeat the process of sampling another pairof the ground set and its most diverse set. We do so 200 times and use 100 pairs for holdout and 100 fortesting. We repeat the process to yield training sets of various sizes. Evaluation metrics
We evaluate the quality of the selected subset y map against the ground-truth y ∗ usingthe F-score, which is the harmonic mean of precision and recall:F-score = 2 Precision × RecallPrecision + Recall , Precision = | y map ∩ y ∗ || y map | , Recall = | y map ∩ y ∗ || y ∗ | . (15)All three quantities are between 0 and 1, and higher values are better. Learning and inference
We compare our large-margin approach using the Hamming loss (eq. (10)) to thestandard MLE method for learning DPP parameters. All hyperparameters are tuned by cross-validation.After learning, we apply MAP inference to the testing ground sets.
Results
The DPP is parameterized by two things: θ for the quality of the items, and S ij for the similarityamong them. Since the ground-truth parameters are known to us, we conduct experiments to isolate theimpact of learning either one.Fig. 1(a) contrasts the two methods when learning θ only, assuming all S ij are known and the ground-truths are used. Our dpp lme method significantly outperforms dpp mle . When the number of trainingsamples is increased, the performance of our method generally improves and gets very close to the oracle’sperformance, for which the true values of both S ij and θ are used. Adding a zero-mean Gaussian prior over θ while learning with MLE, as in [3], did not yield improvement.
00 400 600 800 200 400 600 80075808590 F − sc o r e ( % ) DPP
MLE [3], true SDPP
LME , true SGroundtruth (a)
Learning θ only, with S ij correctlyspecified −Inf −3 −2 −1 0 1 2 3 Inf30405060708090 q : RBF kernel σ = 2 q σ F − sc o r e ( % ) DPP
MLE [3], "incorrect" SDPP
LME , "incorrect S"DPP
MLE [3], true SDPP
LME , true S (b)
Learning θ under mis-specified S ij (
200 400 600 80075808590 F − sc o r e ( % ) DPP
MLE [3]DPP
MLE [3], true SDPP
LME
DPP
LME , true SGroundtruth (c)
Learning both θ and S ij , withmultiple kernel parameterization Figure 1:
On synthetic datasets, our method dpp lme significantly outperforms the state-of-the-art parameter esti-mation technique dpp mle [3] in various learning settings. See text for details. Best viewed in color.
Fig. 1(b) examines the two methods in the setting of model mis-specification, where the S ij values delib-erately deviate from the true values. Specifically, we set them to exp( −k x i − x j k /σ ) where the bandwidth σ varies from small to large, while the true values are x T i x j . All methods generally suffer. However, ourmethod is fairly robust to the mis-specification while dpp mle quickly deteriorates. Our advantage is likelydue to our method’s focus on learning to reduce subset selection errors, whereas MLE focuses on learningthe right probabilistic model (even if it is already mis-specified).Fig. 1(c) compares the two methods when both θ and S ij need to be learned from the data. We applyour multiple kernel parameterization technique to model S ij , as in eq. (8), except β is set to be zero toavoid including the ground-truth. We see that our parameterization overcomes the problems of model mis-specification in Fig. 1(b), demonstrating its effectiveness in approximating unknown similarities. In fact,both learning methods match the performance of the corresponding methods with ground-truth similarityvalues, respectively. Nonetheless, our large-margin estimation still outperforms MLE significantly.In summary, our results on synthetic data are very encouraging. Our multiple kernel parameterizationavoids the pitfall of model mis-specification, and the large margin estimation outperforms MLE due to itsability to track selection errors more closely. Next we apply DPP to the task of extractive multi-document summarization [21, 3, 14]. In this task, theinput is a document cluster consisting of several documents on a single topic. The desired output is a subsetof the sentences in the cluster that serve as a summary for the entire cluster. Naturally, we want the sentencesin this subset to be both representative and diverse.
Setup
We use the text data from Document Understanding Conference (DUC) 2003 and 2004 [21] asthe training and testing sets, respectively. There are 60 document clusters in DUC 2003 and 50 in DUC2004, each collected over a short time period on a single topic. A cluster includes 10 news articles and onaverage 250 sentences. Four human reference summaries are provided along with each cluster. Followingprior work, we generate the oracle/ground-truth summary by identifying a subset of the original sentencesthat best agree with the human reference summaries [3]. On average, the oracle summary consists of 5sentences. As is standard practice, we use the oracles only during training. During testing, the algorithmoutput is evaluated against each of the four human reference summaries separately, and we report the averageaccuracy [21, 3, 14].We use the widely-used evaluation package ROUGE [22], which scores document summaries based on n -gram overlap statistics. We use ROUGE 1.5.5 along with WordNet 2.0, and report the F-score (F),Precision (P), and Recall (R) of both unigram and bigram matchings, denoted by ROUGE-1X and ROUGE-2X respectively (X ∈ { F, P, R } ). Additionally, we limit the maximum length of each summary to be 6657able 1: Accuracy on document summarization. Our methods outperform others with statistical significance.Method rouge-1f rouge-1p rouge-1r rouge-2f rouge-2p rouge-2r
PEER 35 [21] 37.54 37.69 37.45 8.37 – –PEER 104 [21] 37.12 36.79 37.48 8.49 – –PEER 65 [21] 37.87 37.58 38.20 9.13 – – dpp mle + cos [3] 37.89 ± ± ± ± ± ± Ours ( dpp lme + cos ) 38.36 ± ± ± ± ± ± Ours ( dpp mle + mkr ) 39.14 ± ± ± ± ± ± Ours ( dpp lme + mkr ) ± ± ± ± ± ± Table 2:
Accuracy on video summarization. Our method performs the best and allows precision-recall control.Metric VSUMM1 [24] VSUMM2 [24] dpp mle + mkr Ours ( dpp lme + mkr ) ω = 1 / ω = 1 ω = 64F-score 70.25 68.20 72.94 ± ± ± ± ± ± ± ± ± ± ± ± characters to be consistent with existing work [21]. This yields 5 sentences on average for subsets generatedby our algorithm.To allow the fairest comparison to existing DPP work for this task, we use the same features designatedin [3]. To model quality, the features are the sentence length, position in the original document, mean clustersimilarity, LexRank [23], and personal pronouns. To model the similarity, the features are the standardnormalized term frequency-inverse document frequency (tf-idf) vectors. Learning
We consider two ways of modeling similarities. The first one is to use the cosine similarity ( cos )between feature vectors, as in [3]. The second is our multiple kernel based similarity ( mkr , eq. (8)). For mkr , the bandwidths are σ = 2 q , q = − , − , · · · ,
6, and the combination coefficients are learned on thedata. We implement the method in [3] as a baseline ( dpp mle + cos ). We also test an enhanced variant ofthat method by replacing its cosine similarity with our multiple kernel based similarity ( dpp mle + mkr ). Results
Table 1 compares several DPP-based methods, as well as the top three results (PEER 35, 104,65) from the DUC 2004 competition, which are not DPP-based (“-” indicates results not available). Sincethe DPP MAP inference is NP-hard, we use a sampling technique to extract the most diverse subset [1]. Werun inference 10 times and report the mean accuracy and standard error.The state-of-the-art MLE-trained DPP model ( dpp mle + cos ) [3] achieves about the same performanceas the best PEER results of DUC 2004. We obtain a noticeable improvement by applying our large-marginestimation ( dpp lme + cos ). By applying multiple kernels to model similarity, we obtain significant improve-ments (above the standard errors) for both parameter estimation techniques. In particular, our completemethod, dpp lme + mkr , attains the best performance across all the evaluation metrics. Finally, we demonstrate the broad applicability of our method by applying it to video summarization. Inthis case, the goal is to select a set of representative and diverse frames from a video sequence.
Setup
The dataset consists of 50 videos from the Open Video Project ( ovp ) . They are 30fps, 352 × Features
We extract from each frame a color histogram and SIFT-based Fisher vector [25, 26] to modelpairwise frame similarity S ij . The two features are combined via our multiple kernel representation. Tomodel the quality of each frame, we extract both intra-frame and inter-frame representativeness features.They are computed on the saliency maps [27, 28] and include the mean, standard deviation, median, andquantiles of the maps as well as the the visual similarities between a frame and its neighbors. We z-scorethem within each video sequence. Results
Table 2 compares several methods for selecting key frames: an unsupervised clustering methodVSUMM [24] (we implemented its two variants, offering a degree of tradeoff between precision and recall, andfinely tuned the parameters), dpp mle with a multiple kernel parameterization of S ij , and our margin-basedapproach. For our method, we illustrate its flexibility to target different operating points, by varying thetradeoff constant ω in the generalized Hamming distance loss function eq. (11). Recall that higher values of ω will promote higher recall, while lower promote higher precision.The results clearly demonstrate the advantage of our approach, particularly in how it offers finer controlof the tradeoff between precision and recall. By adjusting ω , our method performs the best in each of thethree metrics and outperforms the baselines by a statistically significant margin measured in the standarderrors. Controlling the tradeoff is quite valuable in this application; for example, high precision may bepreferable to a user summarizing a video he himself captured (he knows what appeared in the video, andwants a noise-free summary), whereas high recall may be preferable to a user summarizing a video taken bya third party (he has not seen the original video, and prefers some noise to dropped frames). More detailedanalysis, including exemplar video frames, are provided in the supplementary material. The determinantal point process (DPP) offers a powerful and probabilistically grounded approach for select-ing diverse subsets. We proposed a novel technique for learning DPPs from annotated data. In contrast tothe status quo of maximum likelihood estimation, our method is more flexible in modeling pairwise similar-ity and avoids the pitfall of model mis-specification. Empirical results demonstrate its advantages on bothsynthetic datasets and challenging real-world summarization applications.
References [1] Alex Kulesza and Ben Taskar. Determinantal point processes for machine learning.
Foundations and Trends R (cid:13) in Machine Learning , 5(2-3):123–286, 2012.[2] Robert Burton and Robin Pemantle. Local characteristics, entropy and limit theorems for spanning trees anddomino tilings via transfer-impedances. The Annals of Probability , pages 1329–1371, 1993.[3] Alex Kulesza and Ben Taskar. Learning determinantal point processes. In
UAI , 2011.[4] Alex Kulesza and Ben Taskar. k-dpps: Fixed-size determinantal point processes. In
ICML , 2011.[5] Byungkon Kang. Fast determinantal point process sampling with application to clustering. In
NIPS , 2013.[6] A. Kulesza and B. Taskar. Structured determinantal point processes. In
NIPS , 2011.[7] R. H. Affandi, A. Kulesza, and E. B. Fox. Markov determinantal point processes. In
UAI , 2012.[8] Raja Hafiz Affandi, Emily B Fox, and Ben Taskar. Approximate inference in continuous determinantal pointprocesses. In
NIPS , 2013.[9] Vladimir Vapnik.
Statistical learning theory. 1998 . Wiley, New York, 1998.[10] Ben Taskar, Vassil Chatalbashev, Daphne Koller, and Carlos Guestrin. Learning structured prediction models:A large margin approach. In
ICML , 2005.[11] Odile Macchi. The coincidence approach to stochastic point processes.
Advances in Applied Probability , 7(1):83–122, 1975.
12] Chun-Wa Ko, Jon Lee, and Maurice Queyranne. An exact algorithm for maximum entropy sampling.
OperationsResearch , 43(4):684–691, 1995.[13] Jennifer Gillenwater, Alex Kulesza, and Ben Taskar. Near-optimal map inference for determinantal point pro-cesses. In
NIPS , 2012.[14] Hui Lin and Jeff Bilmes. Multi-document summarization via budgeted maximization of submodular functions.In
NAACL/HLT , 2010.[15] Andrew Y Ng and Michael I Jordan. On discriminative vs. generative classifiers: A comparison of logisticregression and naive bayes. In
NIPS , 2002.[16] Tony Jebara.
Machine learning: discriminative and generative . Springer, 2004.[17] Raja Hafiz Affandi, Emily B. Fox, Ryan P. Adams, and Ben Taskar. Learning the parameters of determinantalpoint process kernels. In
ICML , 2014.[18] Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. Support vector machinelearning for interdependent and structured output spaces. In
ICML , 2004.[19] Ben Taskar, Carlos Guestrin, and Daphne Koller. Max-margin markov networks. In
NIPS , 2004.[20] Fei Sha and Lawrence K Saul. Large margin hidden markov models for automatic speech recognition. In
NIPS ,2006.[21] Hoa Trang Dang. Overview of duc 2005. In
Document Understanding Conf. , 2005.[22] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In
Text Summarization Branches Out:Proc. of the ACL-04 Workshop , 2004.[23] G¨unes Erkan and Dragomir R Radev. Lexrank: Graph-based lexical centrality as salience in text summarization.
JAIR , 22(1):457–479, 2004.[24] Sandra Eliza Fontes de Avila, Ana Paula Brand˜ao Lopes, et al. Vsumm: A mechanism designed to producestatic video summaries and a novel evaluation method.
Pattern Recognition Letters , 32(1):56–68, 2011.[25] David G Lowe. Distinctive image features from scale-invariant keypoints.
IJCV , 60(2):91–110, 2004.[26] Florent Perronnin and Christopher Dance. Fisher kernels on visual vocabularies for image categorization. In
CVPR , 2007.[27] Esa Rahtu, Juho Kannala, Mikko Salo, and Janne Heikkil. Segmenting salient objects from images and videos.In
ECCV , 2010.[28] Xiaodi Hou, Jonathan Harel, and Christof Koch. Image signature: Highlighting sparse salient regions.
T-PAMI ,34(1):194–201, 2012.[29] William H Beyer.
CRC standard mathematical tables and formulae . CRC press, 1991.[30] Vaibhava Goel and William J Byrne. Minimum bayes-risk automatic speech recognition.
Computer Speech &Language , 14(2):115–135, 2000.
AppendixA Calculating the softmax (cf. eq. (13))
In the main text, we use softmax to deal with the exponential number of large-margin constraints and arrive ateq. (13). Here we show how to calculate the right-hand side of eq. (13). irstly, we compute P y ⊆Y n ℓ ω ( y n , y ) P ( y ; L n ) as follows X y ⊆Y n ℓ ω ( y n , y ) P ( y ; L n ) = X y ⊆Y n X i : i ∈ y I ( i / ∈ y n ) + ω X i : i/ ∈ y I ( i ∈ y n ) P ( y ; L n ) (16)= N X i =1 X y : i ∈ y I ( i / ∈ y n ) P ( y ; L n ) + ω X y : i/ ∈ y I ( i ∈ y n ) P ( y ; L n ) (17)= N X i =1 h I ( i / ∈ y n ) P n { i } + ω I ( i ∈ y n ) (cid:16) − P n { i } (cid:17)i (18)= X i : i/ ∈ y n P n { i } + ω X i : i ∈ y n (1 − P n { i } ) (19)= X i : i/ ∈ y n K n ii + ω X i : i ∈ y n (1 − K n ii ) , (20)where P n { i } = K n ii is the marginal probability of selecting item i . Now we are ready to see softmax y ⊆Y n log ℓ ω ( y n , y ) + log P ( y ; L n ) = log X y ⊆Y n ℓ ω ( y n , y ) P ( y ; L n ) (21)= log X i : i/ ∈ y n K n ii + ω X i : i ∈ y n (1 − K n ii ) . (22)Moreover, recall that K = L ( L + I ) − . Eigen-decomposing L = P m λ m v m v Tm , we have K = L ( L + I ) − = X m λ m λ m + 1 v m v Tm , and thus, K ii = X m λ m λ m + 1 v mi . (23) B Subgradients of the objective function (cf. eq. (14))
Recall that our objective function in eq. (14) actually consists of a likelihood term L ( · ) and the other term ofundesirable subsets. Denote them respectively by L ( θ , α ; Y n , y n ) , log P ( y n ; L n ) = log det( L n y n ) − log det( L n + I ) , (24) A ( θ , α ; Y n , y n ) , log X i/ ∈ y n K n ii + ω X i ∈ y n (1 − K n ii ) . (25)For brevity, we drop the subscript n of L n and K n ii and change y n to y ⋆ in what follows.To compute the overall subgradients, it is sufficient to compute the gradients of the above two terms, L and A .Denoting by Θ = { θ , α , β } , we have ∂ L ∂ Θ k = X i,j ∂ L ∂L ij ∂L ij ∂ Θ k = T (cid:18) ∂ L ∂ L ◦ ∂ L ∂ Θ k (cid:19) , ∂ A ∂ Θ k = T (cid:18) ∂ A ∂ L ◦ ∂ L ∂ Θ k (cid:19) , (26)where ◦ stands for the element-wise product between two matrices of the same size. We use the chain rule todecompose ∂ L ∂ Θ k from the overall gradients on purpose. Therefore, if we change the way of parameterizing the DPPkernel L , we only need care about ∂ L ∂ Θ k when we compute the gradients for the new parameterization. B.1 Gradients of the quality-diversity decomposition
In terms of the quality-diversity decomposition (c.f. eq. (7) and (8)), we have ∂ L ∂α k = ( qq T ) ◦ S k , ∂L ij ∂θ k = L ij ( x ik + x jk ) , or ∂ L ∂θ k = L ◦ ( Xe k T + e Tk X T ) (27)where q is the vector concatenating the quality terms q i , X is the design matrix concatenating x Ti row by row, and e k stands for the standard unit vector with 1 at the k -th entry and 0 elsewhere. .2 Gradients with respect to the DPP kernel In what follows we calculate ∂ L ∂ L and ∂ A ∂ L in eq. (26). Noting that eq. (26) sums over all the ( i, j ) pairs, we thereforedo not need bother taking special care of the symmetric structure in L .We will need map L y ⋆ “back” to a matrix M which is the same size as the original matrix L , such that M y ⋆ = L y ⋆ and all the other entries of M are zeros. We denote by h L y ⋆ i such mapping, i.e. , h L y ⋆ i = M . Now we are ready tosee, ∂ L ∂ L = ∂ log det( L y ⋆ ) ∂ L − ∂ log det( L + I ) ∂ L = h ( L y ⋆ ) − i − ( L + I ) − . (28)It is a little more involved to compute ∂ A ∂ L = 1 P i/ ∈ y ⋆ K ii + ω P i ∈ y ⋆ (1 − K ii ) X i/ ∈ y ⋆ ∂K ii ∂ L − ω X i ∈ y ⋆ ∂K ii ∂ L , (29)which involves ∂K ii ∂ L .In order to calculate ∂K ii ∂ L , we start from the basic identity [29] of ∂ A − ∂t = − A − ∂ A ∂t A − , (30)followed by ∂ A − ∂A mn = − A − J mn A − , where J mn is the same size as A . The ( m, n )-th entry of J mn is 1 and all elseare zeros.Let A = ( L + I ). Noting that K = L ( L + I ) − = I − ( L + I ) − = I − A − and thus K ii = 1 − (cid:2) A − (cid:3) ii , wehave, ∂K ii ∂L mn = − ∂ (cid:2) A − (cid:3) ii ∂L mn = − ∂ (cid:2) A − (cid:3) ii ∂A mn = (cid:2) A − J mn A − (cid:3) ii = [ A − ] mi [ A − ] ni . (31)We can also write eq. (31) in the matrix form, ∂K ii ∂ L = [ A − ] · i [ A − ] T · i = A − e i e Ti A − = A − J ii A − , (32)where [ A − ] · i is the i -th column of A − .Overall, we arrive at a concise form by writing out the right-hand-side of eq. (29) and merging some terms, X i/ ∈ y ⋆ ∂K ii ∂ L − ω X i ∈ y ⋆ ∂K ii ∂ L = A − I ω ( y ⋆ ) A − = ( L + I ) − I ω ( y ⋆ )( L + I ) − (33)where I ω ( y ⋆ ) looks like an identity matrix except that its ( i, i )-th entry is − ω for i ∈ y ⋆ . C Minimum Bayes Risk decoding
We conduct the MAP inference of DPP by brute-forth search on the synthetic data, and turn to the so called minimumBayes risk (MBR) decoding [30, 1] for larger ground sets on real data.The MBR inference samples subsets S = { y , · · · , y T } from the learned DPP and outputs the one ˆ y whichachieves the highest consensus with the others, where the consensus can be measured by different evaluation metricsdepending on applications. We use the F-score in our case. Particularly,ˆ y ← arg max y t ′ ∈S T T X t =1 F-score ( y t ′ , y t ) . (34)Note that the MBR inference has actually introduced some degrees of flexibility to DPP (and to other probabilisticmodels). It allows users to infer the desired output according to different evaluation metrics. As a result, the selectedsubset is not necessarily the “true” diverse subset, but is biased towards the users’ specific interests. Video summarization
We provide details on 1) how to generate oracle summaries as the supervised information to learn DPPs and 2)how to evaluate system-generated summaries against user summaries. We also present more results on balancing theprecision and recall through our large-margin DPP.
D.1 Oracle summary
In the OVP dataset, each video comes along with five user summaries y , y , · · · , y [24]. Similar to documentsummarization [3], we extract an “oracle” summary y ⋆ from the five user summaries using a greedy algorithm.Initialize y ⋆ = ∅ . From the frames not in y ⋆ , we pick out the one i which contributes the most to the marginal gain, vsumm ( y ⋆ ∪ { i } , { y , · · · , y } ) − vsumm ( y ⋆ , { y , · · · , y } ) , (35)where vsumm is the package developed in [24] to evaluate video summarization results. We postpone to Section D.2for describing the evaluation scheme of vsumm . Namely, we select the oracle frames greedily for each video and stopuntil the marginal gain becomes negative. We evaluate the oracle summaries against users’ and find that they achievehigh precision and recalls, 84.1% and 87.7% respectively, validating that the oracle summaries are able to serve asgood supervised targets for training DPP models.The above procedure allows a “user-independent” definition of a good oracle summary for learning. Of course ifthe application goal were to generate user-specific summaries catering to a particular user’s taste, one would insteadsimply apply our framework with y ⋆ set to be that particular user’s selection. D.2 vsumm: evaluating video summarization results
We evaluate video summarization results using the vsumm package [24]. Given two sets of summaries/frames, itsearches for the maximum number of matched pairs of frames between them. Two images are viewed as a matchedpair if their visual difference is below a certain threshold. vsumm uses normalized color histograms to compute suchdifference. Besides, each frame of one set can be matched to at most one frame of the other set, and vice versa. Afterthe matching procedure, one can hence develop different evaluation metrics based on the number of matched pairs.In our experiments, we define F-score, precision, and recall (cf. eq. (15) of the main text).
D.3 More results on balancing precision and recall
We present more results here on balancing precision and recall through our large-margin trained DPPs ( dpp lme ). Byvarying ω from 2 − to 2 in the generalized Hamming distance (cf. Section 3.2 in the main text), we obtain 8 pairs of(precision, recall) values. We apply uniform interpolation among them and draw the precision-recall curve in Fig. 2.One can see that dpp lme is able to control the characteristics of the DPP generated summaries, baising them to eitherhigh precision or high recall and without sacrificing the other too much. Though MLE or VSUMM does not supplysuch modeling flexibility, we also include them in the figure for reference.Besides, Fig. 3 shows some qualitative results. For this particular video, dpp mle , dpp lme with ω = 1, and dpp lme with ω = 2 all give rise to high recalls. Their output summaries are pretty lengthy, and may be boring to someusers who just want to grasp something interesting to watch. By turning down the weight to ω = 2 − , our dpp lme dramatically improves the precision to 76% (in contrast to the 48% of dpp mle ). Recall P r e c i s i on LM−DPPMLE−DPPVSUMM0.65 0.7 0.75 0.80.650.70.750.8
Figure 2: Balancing precision and recall. Through our large-margin DPPs ( dpp lme ), we can balance precisionand recall by varying ω in the generalized Hamming distance (cf. Section 3.2 in main text). In contrast,neither MLE nor VSUMM (the two variants in [24] are plotted together) is readily able to support suchflexibility. 14 PP LME + MKR( =64)DPP
LME + MKR( =1)DPP
LME + MKR( =1/64)DPP
MLE + MKROracle (F=63, P=48, R=97) (F=77, P=76, R=81) (F=88, P=88, R=91) (F=67, P=55, R=91) (F=63, P=48, R=97)
Figure 3: Video summaries generated by dpp mle and our dpp lme with ω = 1 , ω = 2 , and ω = 2 −6