PIGMIL: Positive Instance Detection via Graph Updating for Multiple Instance Learning
PPIGMIL: Positive Instance Detection via GraphUpdating for Multiple Instance Learning
Dongkuan Xu ∗ , Jia Wu † , Wei Zhang ‡ , and Yingjie Tian §¶∗ School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, ChinaEmail: [email protected] † QCIS, University of Technology, Sydney, NSW 2007, AustraliaEmail: [email protected] ‡ School of Information, Renmin University of China, Beijing 100872, ChinaEmail: [email protected] § Research Center on Fictitious Economy and Data Science, Chinese Academy of Sciences, Beijing 100190, China ¶ Key Laboratory of Big Data Mining and Knowledge management, Chinese Academy of Sciences, Beijing 100190, ChinaEmail: [email protected]
Abstract —Positive instance detection, especially for these inpositive bags (true positive instances,
TPI s), plays a key rolefor multiple instance learning (
MIL ) arising from a specificclassification problem only provided with bag (a set of instances)label information. However, most previous
MIL methods onthis issue ignore the global similarity among positive instancesand that negative instances are non-i.i.d., usually resulting inthe detection of
TPI not precise and sensitive to outliers. Tothe end, we propose a positive instance detection via graphupdating for multiple instance learning, called
PIGMIL , to detect
TPI accurately.
PIGMIL selects instances from working sets( WS s ) of some working bags ( WB s ) as positive candidatepool ( PCP ). The global similarity among positive instances andthe robust discrimination of instances of
PCP from negativeinstances are measured to construct the consistent similarityand discrimination graph (
CSDG ). As a result, the primary goal(i.e.
TPI detection) is transformed into
PCP updating, which isapproximated efficiently by updating
CSDG with a random walkranking algorithm and an instance updating strategy. At lastbags are transformed into feature representation vector basedon the identified
TPI s to train a classifier. Extensive experimentsdemonstrate the high precision of
PIGMIL ’s detection of
TPI sand its excellent performance compared to classic baseline
MIL methods.
Index Terms —Positive Instance Detection; Multiple InstanceLearning; Graph Learning; True Positive Instance;
I. I
NTRODUCTION
Multiple instance learning (MIL) was formally proposedfor drug activity detection [1] at first. Contrary to traditionalclassification problem, MIL deals with bag, or set of instances,classification to label a bag positive or negative where not allinstance label information is exploit. Based on the generalMIL setting, a bag is labelled positive if it contains at leastone positive instance, or else it is considered as a negativeone. However, the specific label of individual instance inpositive bags is unknown. Because of its ability to cope withinstance label ambiguity, MIL has been applied into variousapplications in pattern recognition and computer vision, e.g.,image categorization [2], [3], object detection [4], [5], graphclassification [6], [7], [8], text categorization [9], [10], [11],etc.
Fig. 1. Positive instance plays an important role for CBIR based on MIL.An image of tiger is divided into some patches, each patch corresponds toan instance, and this image is considered as a positive bag for tiger . Patchesinvolved in tiger correspond to positive instances as well as TPIs for thisimage. CBIR takes use of these instances to search retrieved images for aquery image. And there may also be an irrelevant one, like the one of squirrel ,in the retrieved images because of FPIs.
The positive instances in positive bags are called the truepositive instance denoted as TPIs, with the negative instancesin positive bags (false positive instances) denoting FPIs. Theintrinsic problem of MIL is to determine whether a bag con-tains TPIs or not. The typical application of TPI’s detection iscontent-based image retrieval (CBIR) [12], of which the mainobjective is to locate the regions of interest (ROIs) in imagesthat show a great discriminative ability to label images. Asshown in Figure 1, the image with tiger is divided into severalpatches based on feature extraction methods. According tothe MIL framework, the whole image is considered as apositive bag and each patch is taken as an instance. Thepatches involved in tiger , called TPIs, corresponds to ROIsand are significant in the image retrieval for tiger . The restpatches are called FPI providing little information for CBIR.There are extensive studies on TPIs of MIL [1], [13], [14], a r X i v : . [ c s . C V ] D ec ig. 2. Four bags are represented in feature space. Bag A, B are positiverepresented as green ellipses, and bag C, D are negative represented as redrhombuses. The true positive instances (TPIs) represented as green solidcircles and compassed by the polygonal yellow line in bag A and B should befar from negative instances, which are represented as the blue double-headedarrows. Moreover, TPIs should also be similar to each other, which is represented as the straight yellow lines. [9], [2], [15], [16], [11]. APR [1] constructs a axis-parallelrectangle to encompass instances from different positive bagsas many as possible while minimizing the number of instancesfrom negative bags. The rectangle is considered as the regionwhere TPIs are located. DD-based method [14] extends thebasic idea of APR, tries to recognize the instances withhigh diverse density value, i.e., instances near all positivebags while distant from negative bags, and regards theseinstance as TPIs. SVM-based methods [9], [12] utilize SVMto discriminate TPIs. mi-SVM [9] searches for a hyperplaneat instance-level where each positive bag has at least oneinstance located in positive space while all negative bags’instances are in the negative. KI-SVMs [12] proposes twodifferent level convex optimization models based on SVM andmaximizes the margin by the most violated key instance tolocate key instances at different levels. MILD [15] focuses onthe ambiguous information of instances in the positive bags.It selects an instance with the highest maximum empiricalprecision of each positive bag as the TPI and constructs a two-level classification scheme based on the selected TPIs inspiredby MILES [2]. MILIS [16] first selects instance prototypes(IPs) by Gaussian-kernel-based kernel density estimator onnegative instances, then updates these IPs, trains classifierin an iterative learning framework to construct the featurerepresentation for each bag, and employs the SVM to classifya new bag at last.However, these common MIL methods on identifying TPIshave some disadvantages. APR only showed high performancefor drug activity detection because it is hard to constructsuch a rectangle accurately for data sets in other applicationcontext. DD-based methods [14] are sensitive to noise, whichmeans the diverse density value will decrease dramaticallyif there are some negative instances nearby. Moreover, DD-based methods need to consider each instance in positive bagsresulting a high computation cost. MILD [15] simply considersthe instance with the highest empirical precision in each positive bag as the TPI. The empirical precision is calculatedbased on all training bags and a threshold θ t which is hard todetermined. Generally, most TPI detection methods for MILdo not consider the similarity among TPIs or utilize it indepth. Similarity among TPIs possesses the great significanceon the TPI detection because it reveals the intrinsic propertyof TPIs while it may result from some coincidental patternsthat are not irrelevant to the topic [17]. For instance, whenwe want to judge whether two images are similar because ofthe target content or not, these two images may be similar forsharing the irrelevant contents. These contents correspond tocoincidental patterns which are not repetitive in feature space.This implies that a reliable similarity should be homogeneousacross several parts, i.e., a global similarity. Moreover, thediscrimination between TPIs and negative instances is notrobust to outliers for most methods. Discrimination betweenTPIs and negative instances provides us an reliable way todecide whether an instance is negative or not because only thenegative instance’s label is determined in MIL. Although mostTPI detection methods utilized the difference between positiveand negative instances, but the influence of a far negativeinstance and a near negative on a TPI are not sufficientlycharacterized respectively. The influence of a far negativeinstance on an instance x ’s label should decrease exponentiallywhen it becomes farther from x , and a near negative instanceshould increase its influence on x ’s label sharply when itbecomes closer to x , i.e., a robust discrimination. Furthermore,it is unnecessary to search all instances of all positive bags toidentify TPIs while there are at least one TPI in each positivebag. This is because computation cost is too high to search allpositive bags, not every instance in positive bags is positive,and TPIs from some positive bags may not be positive enough.Inspired by these observations, this paper proposes aPositive Instance detection via Graph updating for MultipleInstance Learning (PIGMIL) whose core idea is to identifyTPIs that should not only be similar to themselves globally butalso different from negative instances robustly shown in Figure2. PIGMIL determines and initializes working sets ( WS s ), working bags ( WB s ), and positive candidate pool (PCP) atfirst to reduce the computation cost and improve the accuracyof TPI detection. The original TPI detection is approximatedby maximizing global similarity among positive instances androbust discrimination of positive instances from negative onesbased on PCP. Then the maximum optimization problem isdealt with on a consistent similarity and discrimination graph( CSDG ) with a random walk algorithm and an instanceupdating strategy. Bags are embedded into instance-basedfeature space and transformed into representation vectors byTPIs to train the classifier.The main contributions of PIGMIL are summarized:1) The global similarity among positive instances is utilized.Combining the similarity ( S ) and its consistency ( C )provides a global similarity ( S + C ) and avoids the mis-leading of coincidental patterns on TPI detection.2) The robust discrimination of positive instances fromnegative instances is exploited. The discrimination ( D )s robust to outliers, decreasing a far negative’s influencesharply when it gets farther away from an instance andputting exponentially more importance on near negativeinstances if they become closer to an instance.3) WS , WS , and PCP are determined to reduce computa-tion cost and improve searching accuracy. The originalobjective of identifying TPIs is transformed into PCPupdating and then approximated efficiently by updatinggraph CSDG iteratively with an instance updating strat-egy..In the rest of paper, we: define basic concepts and givean overview of PIGMIL in Section II; describe PIGMIL atlength in Section III; conduct experiments in Section IV; makediscussion in Section V; and draw conclusion in Section VI.II. P
ROBLEM F ORMULATION
In this section, we define some important notations, thenprovide a formal definition of the MIL problem.
Definition 1. (Instance and Bag)
Let x i = ( x , · · · , x d ) T and X j = ( x j , · · · , x jn j ) denote an instance and a bagseparately, where d is the dimensionality, n j is the numberof instances of X j , and x jk is an instance belonging to X j .Each instance and bag are labelled with y ∈ { +1 , − } and L ∈ { +1 , − } separately. +1 indicates the instance or bag ispositive and − corresponds to negative [9]. Definition 2. (KDE min ) Based on KDE [18], KDE min isdefined as: f KDE min ( x ) = 1 Z × N − (cid:88) L j = − min x ji ∈ X j exp ( − γ (cid:107) x − x ji (cid:107) ) (1) where N − is the number of negative bags, and Z, γ areempirical parameters.
Definition 3. (Working Set)
The working set of bag X j isrepresented as WS j ∈ { ( x , · · · , x n j ) | ws ( x k ) (cid:54) T ws j , ∀ k ∈ (1 , · · · , n j ) } , where ws ( · ) represents a decision function todecide whether an instance belongs to WS or not, T ws j represents a threshold, and n j represents the size of WS j . Definition 4. (Working Bag)
A positive bag X j is called aworking bag represented as WB j , iff the values of instances in WS j based on the decision function ws ( · ) is not significantlyworse than the values of instances in other positive bags’working sets. Definition 5. (Positive Candidate Pool)
A positive candi-date pool is a group of instances represented as
P CP = { x ∗ wb , · · · , x ∗ wb nw } , where x ∗ wb j is an instance from the WS of WB wb j , and n w is the number of working bags. Given a group of bags as X = { X , · · · , X N } , where eachpositive bag consists of at least one positive instance while allinstances are negative in negative bags. The objective of MILis to build a classification model based on training set onlywith bag labels to predict the labels of new bags. The overallframework includes three major steps presented in Figure 3: • Initialization: 1.
To improve the efficiency of updatingPCP and reduce computation cost, WS s and WS s areinitialized at first. By doing so, we can update PCP fromthe most possibly positive instances. We take oneinstance from the WS of each WB based on KDE min to initialize PCP. • PCP Updating:
To maximize the global similarity amonginstances in PCP and the robust discrimination of theseinstances from negative ones,
CSDG is constructedto recognize the instance in PCP that shares the leastsimilarity with other instances and least difference fromnegative instances with a random walk algorithm, andreplace it by a new one according to an instance updatingstrategy. • Bag Classification:
To label unknown bags, a bag classi-fication scheme based on updated PCP is constructed. Thedistance between a bag and each instance of updated PCPis exploited to transform the bag into feature presentationvectors. A SVM classifier is learned on the vectors.III. T
HE PROPOSED METHOD
PIGMIL
A. Initialization
Initialization refers to initialize WS s , WB s , and PCP. Itis necessary to determine the useful positive bags and theiruseful instance candidates to construct PCP for identifyingTPIs because the computation cost is too high to search fromall positive bags and not every instance in positive bags ispositive enough or positive actually. Working Set : A working set ( WS ) refers to the usefulinstance candidates for a positive bag. Instances in WS arewith high possibility to be TPIs. We take advantage of negativeinstances to figure our this possibility because only the labelsof instances in negative bags are known. However, negativeinstances may share very general distributions, we adoptKDE min (Eq. (1)) as decision function ws ( · ) to capture therelationship between a instance and negative ones to construct WS . In other words, instance x i of bag X j belongs to WS j if f KDE min ( x i ) (cid:54) T ws j . Working Bag : W orking bags ( WB s ) correspond tothe seleceted positive bags that are used to update PCP. Todetermine all WB s from all positive bags, T -test [19] isemployed to check whether the average value of instances ina positive bag’s WS is significantly worse than the averagevalue of all instances in rest positive bags’ WS based on ws ( · ) . If it is, this positive bag will not be considered as a WB . In this paper, f KDE min ( · ) is chosen as ws ( · ) . Positive Candidate Pool : P ositive candidate pool (PCP) includes some instances from the WS s of WB s andonly one instance is chosen from a WS . Instances in PCP areconsidered as positive ones, which means they should sharehigh similarity among themselves and significant differencefrom negative instances. PCP is used to construct a bagclassification scheme after it is updated, i.e., instances in PCPare positive enough. Initially, the instance x with the lowest ws ( · ) in WS wb j of WB wb j is chosen as x ∗ wb j . ig. 3. A conceptual view of Positive Instance Detection via Graph Updating for Multiple Instance Learning (PIGMIL): The goal of PIGMIL is to construct abag classification scheme to label a new bag (cid:13) based on the instances in updated positive candidate pool (PCP) (cid:13) after working sets ( WS s ) , working bags ( WB s ) , and PCP are initialized from original data set (cid:13) . Specifically, original data set is preprocessed into bags (sets of instances) based on MIL at first.To improve the accuracy and reduce computation cost of searching the true positive instances (TPIs), WS s , WB s , and PCP (consisting of one instance fromeach WS (a)) are identified and initialized. Instances in PCP are considered to be positive. Then to discern the instance x t in PCP that is not positive andneeded to replace, the consistent similarity and discrimination graph (CSDG) is built (b). x t is identified by a random walk algorithm on CSDG and updatedby an instance updating strategy (c). Eventually a bag is classified by a bag classification scheme, where bags are embedded into a updated PCP-based featurespace and transformed into feature vectors to train a SVM classifier. B. PCP Updating
Some instances in the initialized PCP are not positiveenough or not positive actually. PCP updating refers to thatinstances in PCP are updated to be positive enough in gen-eral, i.e., sharing high similarity among themselves and greatdifference from negative ones. But the original updating PCPis a difficult combinational optimization problem. So we trans-form it into an approximation based on consistent similarityand discrimination graph (CSDG). Additionally, an instanceupdating strategy is proposed to accelerate the updating. Optimization Objective of Updating PCP : The goal ofupdating PCP is to maximize the overall similarity of instancesin PCP and their difference from negative instances. However,it is a challenging task for most learning problems to learnthe overall similarity directly. We adopt a kind of pairwisesimilarity S to approximate it. To improve the approximation,the consistency C for S is employed to discriminate S that ishomogeneous across different parts. The difference of an in-stance from negative instances is represented as D . Therefore,the original goal of updating PCP can be formulated to findthe best labeling L for training instances to maximize S , C ,and D for the instances in PCP: max L (cid:88) ( x i ,x j ) ,x k ∈ P CP α S ( x i , x j ) + C ( x i , x j ) + β D ( x k ) (2)where ( x i , x j ) is a pair of instances in PCP, x k is an instance inPCP, α , β are balancing factors, and S , C , D are similarity , consistency , discrimination respectively. S + C indicatesthe global similarity and D indicates the robust discrimination. Similarity:
Because only the labels of negative instancesare known, we calculate the similarity between two instances S ( x i , x j ) based on how similarly different they are fromnegative instances. Inspired by [11], we use x i as a positiveinstance and all the negative instances to learn a classifierbased on SVM. The unbalance of positive instances andnegative ones is coped with by resampling x i . The confidenceof x j based on the learned classifier is Υ i,j = w Ti · x j , where w i is the learned weight based on x i . Definition 6. (Similarity)
The similarity between instance x i and x j is: S ( x i , x j ) = (cid:26) ϕ ( i,j ) · ϕ ( j,i ) if Υ i,j > and Υ j,i > otherwise (3) where ϕ ( j, i ) stands for the order of x j among other instanceswhose confidence is positive when they are classified by w i . Consistency:
To improve the accuracy of similarity, theconsistency for each pairwise similarity is figured out. Some-times the similarity between two objects may be confusedfor coincidental patterns. Therefore, the intrinsical similarityshould be consistent across several parts. In this paper, thesize of the maximal quasi-clique including the two instancesis adopted as consistency for their similarity:
Definition 7. (Consistency)
In a graph
Graph = (
V, E ) , theconsistency for v i and v j is the size of the maximal quasi-lique and defined as: C ( v i , v j ) = (cid:40) max k {| Q k |} ∀ k v i , v j ∈ Q k (cid:54) ∃ k v i , v j ∈ Q k (4) where Q k represents different maximal quasi-cliques consist-ing of v i and v j . A quasi clique corresponds to a undirected graph
Graph =( V, E ) , where | E | (cid:62) (cid:106) γ (cid:0) | V | (cid:1)(cid:107) and < γ (cid:54) [20]. Inthis paper, we set γ to be 0.9. The vertexes in quasi-cliqueshare dense similarities among themselves. And the maximalquasi-clique is a quasi-clique when there is no node can beadded to extend the quasi-clique. In other words, the size ofthe maximal quasi-clique for two objects is large when thesimilarities of two objects are consistent, i.e., existing severalhomogeneous similarities. Discrimination:
Beyond that positive instances should besimilar to themselves, positive instances should also be differ-ent from negative ones. To measure the difference between aninstance and negative instances, inspired by Gaussian-kernel-based kernel density estimator (KDE) [18], the discriminationof an instance is defined as:
Definition 8. (Discrimination)
The discrimination of instance x i from other negative instances is: D ( x ) = 1 Z N − (cid:80) j =1 n j (cid:88) L j = − n j (cid:88) i =1 d ( x, x ji ) (5) d (∆) = − exp [ − γ (∆ − (cid:62) γ ln ∆ − > ∆ > −∞ ∆ = 0 (6) where ∆ = ∆( x, x ji ) is a distance function between x and x ji , N − is the number of negative bags, n j is the number ofinstances in bag X j , L j = − indicates bag X j is negative,and Z, γ , γ are positive empirical parameters. In general, when how likely x i is negative based on theknown negative instances is to be determined, we shouldconsider: 1) the closer a negative instance is for x i , the moreinfluence it has on x i ’s label. 2) the far negative instancesshould not show much influence. In other words, D should berobust to outliers. Theorem 1.
The influence of far / close negative instances on D defined in Eq. (5) is decreased / increased sharply whenthe negative instances are farther / closer and the influence ofa far negative instance is limited.Proof. For an instance x i and a negative one x j , it is supposedthat x j is a far negative instance for x i if ∆ = ∆( x i , x j ) (cid:62) ;otherwise, x j is a close one for x i .If x j is a far negative instance for x i , i.e., ∆ (cid:62) .According to the definition of d ( x i , x j ) in Eq. (6), the in-fluence of x j on x i is − exp [ − γ (∆ − . Its first deriva-tive for ∆ is γ exp [ − γ (∆ − > and second oneis − γ exp [ − γ (∆ − < . So exp [ − γ (∆ − is a = ( x i , x j ) d ( ) Fig. 4. The function image of d (∆) that is the monotonic increasing functionfor ∆ . Suppose that x j is a far negative instance for x i if ∆ = ∆( x i , x j ) > and x j is a near negative instance for x i if ∆ < . Tag ’1’ corresponds tothe value of d ( · ) when ∆ is normal. Tag ’2’ indicates the situation where d (∆) ’s increase will slow down sharply when a x i ’s far negative instance isfather and the limit value is 0, which means d (∆) is robust to outliers. Tag’3’ indicates the situation where d (∆) will decrease exponentially (equal toshowing a more significant influence) when a x i ’s near negative instance getscloser, which means we puts exponentially more importance on x i (morelikely to regard it as a negative one) when it is closer to a near negativeinstances. monotonically increasing and concave function, and its valuerange is [ − , . Therefore, x j ’s influence on x i will decreasesharply when ∆ = ∆( x i , x j ) increases and be limited to [ − , .If x j is a close negative instance for x i , the influence of x j on x i is γ ln ∆ − . Its first derivative is γ ∆ − > andsecond one is − γ ∆ − < . Therefore, x j ’s influence on x i will increase sharply when ∆ decreases.As shown in Figure 4, d ( · ) increases along with the increaseof ∆ . The influence of close negative instance and the far oneare dealt with separately. And the farther the negative instanceis, the exponentially less it contributes to d ( · ) . According tothe definition of D , it consists of many d ( · ) s. Therefore, D isrobust to outliers and puts exponentially more importance onnear negative instances if they become closer. Approximation of Objective Based On CSDG : Differentinstances in PCP come from different positive bags. Findingthe best labeling L directly for Eq. (2) is a difficult combina-tional optimization problem. In this section, we approximatethe original goal in Eq. (2) by maximizing the total rankingscore of instances in PCP based on consistent similarity anddiscrimination graph ( CSDG ). Definition 9. (Consistent Similarity and Discrimination Graph(
CSDG )) CSDG = (
V, E ) is an undirected weighted graphwhere the vertex v i ∈ V corresponds to the instance x wb i in PCP and the edge between v i and v j is e ij ∈ E on thecondition that S ( x i , x j ) > and i (cid:54) = j . The weight for e ij is E ( v x i , v x j ) = max { , S ( x i , x j ) + α C ( x i , x j ) + β D ( x i , x j ) } ,where D ( x i , x j ) = min {D ( x i ) , D ( x j ) } , and α , β are twobalance factors. When an instance x in PCP is likely to be negative,he importance of the role it plays in CSDG should beundermined, i.e., decreasing the weight of edges containing x . This is because instances in PCP are considered as positiveones and the edges in CSDG correspond to the similaritybetween positive instances. S ( x i , x j ) and C ( x i , x j ) representthe similarity between two instances from the global structure. D ( x i , x j ) = max {D ( x i ) , D ( x j ) } indicates that the similarityshould be decreased if one of two vertexes edge e ij containsis likely to be negative.We approximate the optimization problem in Eq. (2) asa combination problem to maximize E ( v x i , v x j ) for CSDG formulated as: max V (cid:88) ( v xi ,v xj ) ∈ CSDG E ( v x i , v x j ) s.t. (cid:88) ( v xp ,v xq ) E ( v x p , v x q ) (cid:62) (cid:88) ( v xp ,v xk ) E ( v x p , v x k )( v x p , v x q ) ∈ CSDG, ∀ x k ∈ ( X W \ X CSDG ) (7) where v x is the vertex in CSDG and corresponds to instance x in PCP, X W is the WS s of all WB s , V is the correspondingvertexes for X W , and X CSDG represents the instances thatall vertexes of
CSDG correspond to. Instance Updating Strategy : The intuitive way to figureout the problem in Eq. (7) is to replace the vertexes in
CSDG with the vertexes corresponding to the rest instancesin X W \ X CSDG iteratively until the maximal is reached.However, it is hard and time consuming. An approximate wayis to rank vertexes in
CSDG and regard the vertex with thelowest ranking score as the one needed to be replaced. Then wesearch the most suitable substitute instance for the replaced.
Ranking Instance in PCP:
We propose a random walk algorithm, summarized in Al-gorithm 1, based on PageRank [21] to perform on
CSDG torank vertexes. The intuition is that vertexes that are connectedto high ranking vertexes by high weighted edges should havehigh ranking scores. Higher the vertex’s ranking score is, morepositive the vertex’s corresponding instance is considered tobe. This is because the edge’s weight combines the similarityamong positive instances and the discrimination from negativeinstances. So a vertex is considered to be positive with higherprobability if it is connected to more high ranking vertexes byhigh weighted edges shown in Figure 5.The vertex with the lowest ranking score is chosen toreplace in each iteration phase of
CSDG . The ranking score ofeach vertex is calculated iteratively by the following iterationequation: R k +1 = (1 − d ) Υ , ... Υ M,M + d E (1 , · · · E (1 , M ) ... . . . ... E ( M, · · · E ( M, M ) R k (8) where M is the number of vertexes in CSDG , k indicatesthe k th phase, d is a damping factor, R k = ( r v , · · · , r v M ) k (cid:48) represents the ranking score of each vertex in the k th phase, Υ i,i represents the self confidence value of instance x i in theprocess of calculating similarity , E ( i, j ) is the normalized Algorithm 1
CRS: Calculate Ranking Score
Input:
CSDG = (
V, E ) : Consistent similarity and discrimination graphdefined in Definition 9 ; d : A damping factor; n max ite : The maximal iterative number; Output: R = ( r v , · · · , r v M ) (cid:48) : The ranking scores for all vertexes in CSDG ; n ite ← ; R : Initialized randomly; Υ i,i ← w Ti · x i : Calculate the confidence value for each vertex; for all ( i, j ) ∈ { ( p, q ) | p, q ∈ (1 , · · · , M ) } do if e ij exists, then E ( i, j ) ← E ( v x i , v x j ) else E ( i, j ) ← Normalize E ( i, j ) end for while n ite (cid:54) n max ite do R n ite +1 ← (1 − d )[Υ] ( M × + d [ E ] ( M × M ) · R n ite ; n ite ← n ite + 1 end while return CSDG ; E ( v x i , v x j ) and equals E ( j, i ) , and E ( i, j ) equals if thereis no edge between vertexes v i and v j .It is noteworthy that: 1) A vertex’s ranking score r v k isdetermined by its confidence value Υ k,k and the ranking scoresof its adjacent vertexes. Υ k,k represents the probability ofinstance x k to be classified as positive to a certain degree. Theinfluence of its adjacent vertexes’ ranking scores is transmittedby E ( i, j ) which can capture the difference in relationshipamong vertexes. 2) The random walk algorithm is practicable. CSDG is regarded as a bidirectional weighted graph withoutcircles because the weight of edge E ( v x i , v x j ) is symmetricaland there is no vertex connecting itself. R k is initializedrandomly. The iteration process will stop when it meets themaximal iteration number. After all vertexes get the rankingscores, the corresponding instance of the vertex with the leastscore will be regarded as the least positive instance. Instance Updating:
After the least positive instance x t is discriminated, it needsto be replaced with a new one from X W \ X CSDG . The wholeinstance updating strategy is summarized in Algorithm 2. Atfirst, it is intuitive to find a new one from x t ’s corresponding WS T because there is at least one positive instance in eachpositive bag and should exist a more positive one in WS T if x t is not positive enough. Therefore, we replace x t witheach instance in WS T respectively to calculate the sum of allvertexes’ ranking scores in CSDG . There are two cases: i) We can not find an instance in WS T making the totalscore higher than that by x t . In this case, we update anothervertex in CSDG . Specifically, x t ’s corresponding vertex in CSDG is denoted as v t and the vertexes in CSDG are sortedin increasing order according to the ranking score. We choosethe vertex just after v t to replace. As a result, it returns to thebeginning of the updating strategy. Notably, if v t is at the lastof the order, the updating process will be terminated. ig. 5. Ten vertexes with different ranking scores (labelled with differentnumbers and colors) are connected by different weighted edges (labelled withdifferent colors: A, B, and C) after the random walk algorithm employed.Higher the ranking score is, more positive the vertex is considered to be.The red circle labelled with ’1’ is considered to play the most important roleand be the most positive one in the network because it is connected with thelargest number of vertexes by relatively high weight edges. The green circleon top labelled with ’5’ and the dark red one on the bottom right labelledwith ’2’ are connected to the same number of vertexes while possess differentranking scores because of different weighted edges. ii) We can find an instance x t (cid:48) in WS T making the totalscore higher than that by x t . In this case, x t (cid:48) is selected asthe substitute instance for x t . Specifically, x t (cid:48) ’s correspondingvertex is denoted as v t (cid:48) . We rank v t (cid:48) with the rest vertexesin CSDG based on Eq. (8) in increasing order. After ranked:A) If v t (cid:48) is at the first in the order, we choose the vertexat the second of the order to replace. Then, it returns to thebeginning of the updating strategy. B) If v t (cid:48) is not at the firstin the order, we choose the first vertex to replace. Then, itreturns to the beginning of the updating strategy. A actualupdating corresponds to the process that v t is replaced by v t (cid:48) .The instance updating process will also be terminated if itreaches the maximal actual updating number. C. Bag Classification
The bag classification scheme is proposed based on theinstances in updated PCP, denoted as T + . The basic idea is toembed bags into a feature space based on T + and utilize thedistance between a bag and each instance in T + to representthe bag. For bag X t , the feature representation vector is: Z t = [ w ( X t , x +1 ) , w ( X t , x +2 ) , · · · , w ( X t , x + M )] (cid:48) (9)where x + i ∈ T + , M is the number of instances in PCP,and w ( X t , x + i ) is the distance between X t and x + i based onHausdorff distance metric as: w ( X t , x + i ) = max x tj ∈ X t exp ( − γ d (cid:107) x tj − x + i (cid:107) ) (10)where γ d is an empirical parameter. According to the definitionof feature vector, a bag’s label is determined by its nearestinstance to T + , which means the bag is labelled positive ifone of its instances is similar to any one in T + . w ( X t , x + i ) also satisfies the basic assumption of MIL that there is at leastone positive instance in positive bag. Algorithm 2
IUS: Instance Updating Strategy
Input:
CSDG : Initialized; n max upd : The maximal updating number; Output:
CSDG : Updated; R = ( r v , · · · , r v M ) (cid:48) ← Invoke
CRS ( Algorithm v t : The vertex in CSDG with the lowest ranking score; n update ← ; while n update (cid:54) n max upd do x t : The corresponding instance of v t ; x t (cid:48) : x t (cid:48) ∈ WS T and v t (cid:48) corresponding to x t (cid:48) ; if (cid:54) ∃ x t (cid:48) s.t. ( (cid:80) Mi =1 r v i ) ( v t (cid:48) ) > ( (cid:80) Mi =1 r v i ) ( v t ) , then R ( v t ) ← Invoke
CRS ( Algorithm Sort vertexes in increasing order according to R ( v t ) ; if v t is at the last in the order, then return CSDG ; else v t ← v t + : v t + is just after v t in the order; else Replace v t with v t (cid:48) in CSDG ; R ( v t (cid:48) ) ← Invoke
CRS ( Algorithm
Sort vertexes in increasing order according to R ( v t (cid:48) ) ; if v t (cid:48) is at the first in the order, then v t ← v t nd : v t nd is the second in the order; else v t ← v t st : v t st is the first in the order; Replace v t (cid:48) with v t in CSDG ; n update ← n update + 1 ; end while return CSDG ; In the end, the MIL setting is transformed into the standardsingle instance learning problem where a classifier is trainedby these vectors and their labels. A SVM classifier is employedand a new bag is classified as: L t = sgn ( G bag ( Z t )) (11)where G bag ( · ) is the learned decision function. The wholealgorithm procedure of PIGMIL is summarized in Algorithm3. IV. E XPERIMENTS
A. Data Sets, Baseline Methods, and Experimental Settings
Three synthetic MIL data sets (BASIC, RHOMBUS, RINGshown in Figure 6(a)- 6(c) separatively) are constructed toverify PIGMIL’s ability to detect TPIs. Each data set contains20 positive and 20 negative bags, each positive bag contains4 positive and 4 negative instances, and each negative bagcontains 8 negative instances. Negative and positive instancesare generated from uniform distribution and normal distribu-tion respectively. Three kinds of real-world MIL data sets areutilized to verify PIGMIL’s classification accuracy comparedto classic MIL methods: 1) Musk-1 and Musk-2 [9]. 2)Elephant, Fox and Tiger [9]. 3) UCSB Breast [22] belongs toimage classification and is used in tissue microarray (TMA)based diagnosis in malignant breast cancer. Each image (bag)is split into equal-sized grid (instance) and its goal is todetermine an image as benign or malignant.To evaluate PIGMIL’s ability to detect TPIs on syntheticdata sets and its performance on real-world data sets, somebaseline methods are implemented: lgorithm 3
PIGMIL: Positive instance detection via graphupdating for multiple instance learning
Input:
Training Set:
T R = { (( X TR , L TR ) , · · · , (( X TRN TR , L TRN TR )) } ∈X d × { +1 , − } ;Test Set: T E = { X TE , · · · , X TEN TE } ∈ X d ;where X is the instance space; Output:
The labels of Test Set { L TE , · · · , L TEN TE } ∈ { +1 , − } ; // Initialization (Section III-A): WS j ← { ( x jk , · · · , x jk nws ) | f KDE min ( x ji ) (cid:54) T ws j , x ji ∈ X TRj , L
TRj = +1 } , where n ws is the size of WS j ; WB j ← { X TRj | t j (cid:62) h T test } , where t j is the t-value for X TRj and h T test is a threshold value; PCP ← { ( x ∗ p , · · · , x ∗ p nwb ) | x ∗ p j = arg min x ∈WS pj f KDE min ( x ) } ,where X TRP j is a WB and WS p j is its working set ; // PCP Updating (Section III-B): Initialized
CSDG ← Construct
CSDG based on PCP: S ( x i , x j ) , C ( x i , x j ) , and D ( x i , x j ) ; Updated
SCDG (updated PCP) ← Invoke
IUS ( Algorithm // Bag Classification (Section III-C): L TEt ← sgn ( G bag ( Z TEt )) : transform X TRt , X TEt into Z TRt , Z TEt respectively, and employ a SVM classifier learned by ( Z TRt , L
TRt ) to classify Z TEt ; return L TEt ; I: The TPI based methods AP R : The first method designed for MIL problem constructsa rectangle that is parallel with axis and tries to cover positiveinstances as many as possible [1].2. DD : Recognize instances with the highest DD value, and regardthese instance as TPIs [14].3. MILD : Utilize the ambiguous information of instances in thepositive bags to distinguish the true positive instances with twofeature representation [15].4. mi - Sim : Learn the similarity between instances in positivebags combined with the similarity’s consistency [11].5.
KDE min : (For TPI detection) The instance with the lowest f KDE min ( x ) (defined in Equation (1)) of each positive bagmakes up TPIs.6. KDE : (For TPI detection) The instance with the lowest f KDE ( x ) of each positive bag makes up TPIs [18].7. KDE max : (For TPI detection) The instance with the lowest f KDE max ( x ) of each positive bag makes up TPIs. II: The non-TPI based methods Citation kNN : Apply k-nearest neighbor method into MILand define bag-level distance between bags based on the mini-mum Hausdorff distance [13].2. MI - Kernel : Apply the set kernel method to bags representedby sets of feature vectors [23].3.
MILES : Try to discriminate target instances and measuresimilarity between bags according to their closeness to targetinstances [2].4. miGraph : Suppose that instances in a bag are non-iid and takesadvantage of graph kernel [10].5.
Clustering MIL : Construct a ’concept’ (a spherical area) byclustering all positive instances and instances located in theconcept are labelled as positive [24].6.
MInD (Hausdorff): A MIL framework that takes the Hausdorffdistance to measure difference between bags [25]. f KDE ( x ) = ( ZN − ) − (cid:80) Lj = − (cid:80) xji ∈ Xj exp ( − γ (cid:107) x − x ji (cid:107) ) f KDEmax ( x ) = ( ZN − ) − (cid:80) Lj = − max xji ∈ Xj exp ( − γ (cid:107) x − x ji (cid:107) ) TABLE IC
OMPARISON OF
TPI
DETECTION ACCURACY (%)
WITH THE AVERAGEONE ON
BASIC, RHOMBUS,
AND
RING. T
HE HIGHEST ACCURACY FOREACH DATA SET IS IN BOLD . (cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88) Data Set Method PIGMIL DD MILD mi-Sim KDE min
KDE KDE max
BASIC
RING
All reported results are based on 5 times 10-fold cross-validation. All data features are normalized so that each featureshares zero mean and unit variance. A linear kernel SVM isselected as the classifier. α and β (in Equation 2) are set tobe max (10 , log C ( x i ,x j ) S ( x i ,x j ) ) and max (10 , log α · S + CD ) . We set thesize of WS as 40% of a positive bag, i.e., T ws j is set to bethe least fortieth ws ( · ) . The quantile when selecting WB isset to be 1.5 which represents the 90% confidence level γ , γ (Equation (5)), Z (Equation (6)), and γ d (Equation (10)) are setto be 1. d (Equation (8)) is set to be 0.8. n max ite (Algorithm1) and n max upd (Algorithm 2) are 10 and 20 respectively.Moreover, all methods are executed on an Intel Core 2 DuoCPU (2.10GHz) PC.Because the specific label of individual instance in positivebags for real-world data sets is unknown while known forthe synthetic ones, we test PIGMIL’s ability of TPI detectionon the synthetic data sets and compare it with those of somebaseline methods. B. TPI Detection Comparison on Synthetic Data Sets
Figure 6 presents the detail of BASIC, RHOMBUS, andRING. Figure 7 reports the performance of PIGMIL]’s de-tection of TPIs on the data sets. According to Figures 6(a), (b), and (c), WS s contained most positive instances inpositive bags, which verified the validity of WS s. Accordingto Figures 7 (a), (b), and (c), there were many negativeinstances (encompassed by black dashed circles) in PCP buthad been not in WS s of Figure 6. The ratio of positiveinstances to negative ones in PCP was higher than that in WS s of Figure 6, which demonstrated PIGMIL’s great abilityto detect TPIs.Table I reports the comparison of TPI detection for PIGMIL,DD, MILD, mi-Sim, KDE min , KDE, and KDE max . Overall,PIGMIL achieved the best performance on the data sets, whichdemonstrated PIGMIL’s flexibility to different shape of datasets. DD and MILD showed a low performance on BASICmainly because positive and negative instances are close. Thebad performance of KDE and KDE max on RING mainlybecause TPIs are too centralized while negative instances aretoo dispersive. C. Accuracy Comparison on real-world data sets
We choose Musk-1, Musk-2, Elephant, Fox, Tiger, andUCSB to test PIGMIL’s classification accuracy on real-world a) BASIC (b) RHOMBUS (c) RINGFig. 6. Three synthetic MIL data sets: 1) BASIC is linearly separable and the negative instances in negative bags arise from the same uniform distribution.Negative and positive instances in positive bags arise from another uniform distribution and a normal distribution respectively. 2) RHOMBUS is linearlyinseparable. The positive and negative instances arise from two normal distributions and two uniform distributions respectively. Its negative instances inpositive bags are randomly selected from the negative instance set. 3) RING is linearly inseparable. Positive instances arise from a normal distribution locatedat the center, and negative ones arise from a uniform distribution located at the area between two concentric circles. Instances labelled with red ’ × ’ construct working sets ( WS s ). The size of WS is 60% of a positive bag’s size. The critical value of t involving in selecting working bags ( WB s ) is set to be 1.5.(a) Result on BASIC (b) Result on RHOMBUS (c) Result on RINGFig. 7. PIGMIL’s detection of TPIs on BASIC , RHOMBUS , and
RING . Instances labelled with red ’+’ construct PCP. Black dashed circles correspond to thenegative instances that are not in PCP while were in WS s (shown in Figure 6).TABLE IIC OMPARISON OF
TPI
DETECTION ACCURACY (%)
WITH THE AVERAGE ONE ON
BASIC, RHOMBUS,
AND
RING. T
HE HIGHEST ACCURACY FOR EACHDATA SET IS IN BOLD . (cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88) Data Set Method PIGMIL APR MILD mi-Sim Citation kNN MI-Kernel MILES miGraph Clustering MIL MInD(Hausdorff)Musk-1 ± ± ± ± ± ± ± ± • ± ± ± ± ± ± ± ± ± • ± ± ± ± ± ± ± ± ± ± • ± ± ± ± ± ± ± ± ± ± • ± ± ± ± • ± ± ± ± ± ± ± ± ± ± ± ± ± ± • ± ± ± ± ± ± ± • ± ± ± ± ± ± ± ± ± ± ± ± ± • ± ± ± ± yy ± Atom ± ± ± ± ± ± • ± ± ± ± ± • ± ± ± yy ± ± ± ± ± ± ± ± ± ± yy 71.0 ± ± • ± ± ± ± compared to APR, MILD, mi-Sim, Citation kNN, MI-Kernel,MILES, miGraph, Clustering MIL, and MInD(Hausdorff). Theaccuracy comparison is reported in Table II where PIGMILachieved a competitive performance. PIGMIL outperformedmost other methods, especially on Elephant, Fox, and Tigerpossibly because TPIs of these data sets (regions of interest,ROIs) are easier to discriminate compared to others, like aspecific drug molecule shape in Musk-1 and Musk-2. V. D ISCUSSION
A. Sensitivity to Global Similarity ( S + C ) and Robust Discrim-ination ( D ) PIGMIL can capture the global similarity ( S + C ) of TPIsand their robust discrimination ( D ) from negative instances.To measure the influence of S + C and D on TPI detectionaccuracy, we changed the ratio of D to S + C (scaling D todifferent levels). Figures 8 (a), (b), and (c) present the changeof TPI detection accuracy with different ratio of S + C and D on BASIC, RHOMBUS, and RING separately. /8 1 /4 1 /2 1 2 4 87 58 08 59 0 (cid:3) (cid:5)(cid:5) (cid:7) (cid:6) (cid:4) (cid:5) (cid:8) (cid:1) (cid:2) (cid:5) (cid:7)(cid:11)(cid:9)(cid:10) (cid:1)(cid:10) (cid:8)(cid:1)(cid:1)(cid:4) (cid:1)(cid:11)(cid:10) (cid:1)(cid:6)(cid:2) (cid:3) (a) BASIC (cid:3) (cid:5)(cid:5) (cid:7) (cid:6) (cid:4) (cid:5) (cid:8) (cid:1) (cid:2) (cid:5) (cid:7)(cid:11)(cid:9)(cid:10) (cid:1)(cid:10) (cid:8)(cid:1)(cid:4) (cid:1)(cid:11)(cid:10) (cid:1)(cid:1)(cid:6)(cid:2) (cid:3) (b) RHOMBUS (cid:3) (cid:5)(cid:5) (cid:7) (cid:6) (cid:4) (cid:5) (cid:8) (cid:1) (cid:2) (cid:5) (cid:7)(cid:11)(cid:9)(cid:10) (cid:1)(cid:10) (cid:8)(cid:1)(cid:1)(cid:4) (cid:1)(cid:11)(cid:10) (cid:1)(cid:6)(cid:2) (cid:3) (c) RINGFig. 8. TPI detection accuracy of PIGMIL with different ratios of theglobal similarity ( S + C ) of TPIs and the robust discrimination ( D ) on BASIC , RHOMBUS , and
RING . Specifically, the ratio of ’2’ indicates that D is scaledto the twice order of magnitude of S + C . Noise Level
BASICRINGRHOMBUS (a) Sensitivity to Noise
10 20 30 40 50 60 70 80 90 100
Workingset %
BASICRINGRHOMBUS (b) Sensitivity to Size of WS Fig. 9. (a) TPI detection accuracy of PIGMIL with different noise levels onBASIC, RHOMBUS, and RING. Specifically, noise level ’3’ indicates that thelabels of 20% positive instances are changed into negative ones and the labelsof 30% negative instances are changed into positive ones. (b) TPI detectionaccuracy of PIGMIL with different sizes of working set ( WS ) on BASIC,RHOMBUS, and RING. ’40%’ for WS indicates the size of WS is set tobe 40% of its corresponding working bag ( WB ). According to Figure 8 (a), the accuracy increased when theratio became bigger, which indicated that D contributed moreto the accuracy than S + C on this kind of data set. In Figure 8(b), the highest accuracy was reached when S + C and D were atthe same order of magnitude. This was mainly because TPIsor negative instances were symmetrical so that S + C and D played the similar important roles. Figure 8 (c) indicates that D ’s increase contributed to the increase of accuracy while thecontribution was limited. B. Sensitivity to Noise
To evaluate PIGMIL’s ability to cope with noise, we addedsome noise to BASIC, RHOMBUS, and RING. In Figure 9(a), the noise level indicates how many instances’ labels arechanged.According to Figure 9 (a), the accuracy decreased whennoise level increased. However, the decrease of accuracywas slowed down when noise level became bigger (e.g., thedecrease of accuracy when noise level changed into ’5’ from’4’ was smaller than that when changed into ’2’ from ’3’),which demonstrated PIGMIL’s ability to cope with noise.Moreover, accuracy decreased more sharply on RING thanthat on BASIC and RHOMBUS. This was because TPIs ofRING are more centralized and show a greater difference fromnegative instances than TPIs of BASIC and RHOMBUS. Soit was more hard for PIGMIL to detect TPIs on RING if TPIswere labelled negative.
C. Sensitivity to Size of Working Set ( WS ) We changed the size of WS to evaluate the influences of working set ( WS ) with different size on PIGMIL’s detectionaccuracy of TPIs. Figure 9 (b) reports the accuracy for different size of WS on BASIC, RHOMBUS, and RING and the working set (%) indicates the size of WS compared to its corresponding working bag ( WB ). For BASIC and RHOMBUS, the highestaccuracy was reached when the size of WS was about 40% (ofa positive bag). This was because some instances in positivebags are the false positive instances (FPIs) that can providelittle information to detect TPIs if they are included into WS .For RING, the accuracy did not change significantly when thesize of WS changed. This was because TPIs in RING areobviously different from negative instances (including FPIs).So TPIs will be included into WS successfully even if thesize of WS is small, let alone if the size of WS is big.VI. C ONCLUSION
Positive instance detection is key to MIL. Various methodshave been developed for this issue while suffering somedisadvantages, such as ignoring global similarity among pos-itive instances and irrelevance between negative ones. Tothis end, a positive instance detection via graph updating formultiple instance learning (PIGMIL) is proposed. PIGMIL firstconstructs positive candidate pool (PCP) from working sets ( WS s ) of some working bags ( WB s ) to transform positiveinstance detection into an optimization problem. Then basedon a consistent similarity and discrimination graph (CSDG),this problem is solved approximately by an instance updatingstrategy. Finally a bag classification scheme is constructed toclassify a new bag. Extensive experiments demonstrated PIG-MIL’s great ability to detect T P Is and that it outperformedother baseline methods. R
EFERENCES[1] T. G. Dietterich, R. H. Lathrop, and T. Lozano-P´erez, “Solving themultiple instance problem with axis-parallel rectangles,”
Artificial in-telligence , vol. 89, no. 1, pp. 31–71, 1997.[2] Y. Chen, J. Bi, and J. Z. Wang, “Miles: Multiple-instance learning viaembedded instance selection,”
Pattern Analysis and Machine Intelli-gence, IEEE Transactions on , vol. 28, no. 12, pp. 1931–1947, 2006.[3] W. Shen, X. Bai, Z. Hu, and Z. Zhang, “Multiple instance subspacelearning via partial random projection tree for local reflection symmetryin natural images,”
Pattern Recognition , vol. 52, pp. 306–316, 2016.[4] C. Zhang, J. C. Platt, and P. A. Viola, “Multiple instance boosting forobject detection,” in
Proceedings of the Advances in Neural InformationProcessing Systems (NIPS) , 2005, pp. 1417–1424.[5] Y. Yi and M. Lin, “Human action recognition with graph-based multiple-instance learning,”
Pattern Recognition , vol. 53, pp. 148–162, 2015.[6] J. Wu, X. Zhu, C. Zhang, and Z. Cai, “Multi-instance multi-graphdual embedding learning,” in
Data Mining (ICDM), 2013 IEEE 13thInternational Conference on . IEEE, 2013, pp. 827–836.[7] J. Wu, X. Zhu, C. Zhang, and P. S. Yu, “Bag constrained structurepattern mining for multi-graph classification,”
Knowledge and DataEngineering, IEEE Transactions on , vol. 26, no. 10, pp. 2382–2396,2014.[8] J. Wu, S. Pan, X. Zhu, and Z. Cai, “Boosting for multi-graph classifica-tion,”
Cybernetics, IEEE Transactions on , vol. 45, no. 3, pp. 430–443,2015.[9] S. Andrews, I. Tsochantaridis, and T. Hofmann, “Support vector ma-chines for multiple-instance learning,” in
Proceedings of the Advancesin Neural Information Processing Systems (NIPS) , 2002, pp. 561–568.[10] Z.-H. Zhou, Y.-Y. Sun, and Y.-F. Li, “Multi-instance learning by treatinginstances as non-iid samples,” in
Proceedings of the InternationalConference on Machine Learning (ICML) , 2009, pp. 1249–1256.11] M. Rastegari, H. Hajishirzi, and A. Farhadi, “Discriminative and con-sistent similarities in instance-level multiple instance learning,” in
Pro-ceedings of the Conference on Computer Vision and Pattern Recognition(CVPR) , 2015, pp. 740–748.[12] Y.-F. Li, J. T. Kwok, I. W. Tsang, and Z.-H. Zhou, “A convex methodfor locating regions of interest with multi-instance learning,” in
Machinelearning and knowledge discovery in databases . Springer, 2009, pp.15–30.[13] J. Wang and J.-D. Zucker, “Solving multiple-instance problem: A lazylearning approach,” in
Proceedings of the International Conference onMachine Learning (ICML) , 2000, pp. 1119–1126.[14] Q. Zhang and S. A. Goldman, “Em-dd: An improved multiple-instancelearning technique,” in
Proceedings of the Advances in neural informa-tion processing systems (NIPS) , 2001, pp. 1073–1080.[15] W.-J. Li and D.-Y. Yeung, “Mild: Multiple-instance learning via disam-biguation,”
Knowledge and Data Engineering, IEEE Transactions on ,vol. 22, no. 1, pp. 76–89, 2010.[16] Z. Fu, A. Robles-Kelly, and J. Zhou, “Milis: Multiple instance learningwith instance selection,”
Pattern Analysis and Machine Intelligence,IEEE Transactions on , vol. 33, no. 5, pp. 958–977, 2011.[17] A. Shrivastava, T. Malisiewicz, A. Gupta, and A. A. Efros, “Data-driven visual similarity for cross-domain image matching,” in
ACMTransactions on Graphics (TOG) , vol. 30, no. 6. ACM, 2011, p. 154.[18] R. O. Duda, P. E. Hart, and D. G. Stork,
Pattern classification . JohnWiley & Sons, 2012.[19] B. J. Winer, D. R. Brown, and K. M. Michels,
Statistical principles inexperimental design . McGraw-Hill New York, 1971, vol. 2.[20] M. Brunato, H. H. Hoos, and R. Battiti, “On effectively finding maximalquasi-cliques in graphs,” in
Learning and Intelligent Optimization .Springer, 2007, pp. 41–55.[21] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citationranking: bringing order to the web.” 1999.[22] M. Kandemir, C. Zhang, and F. A. Hamprecht, “Empowering multipleinstance histopathology cancer diagnosis by cell graphs,” in
MedicalImage Computing and Computer-Assisted Intervention–MICCAI 2014 .Springer, 2014, pp. 228–235.[23] T. G¨artner, P. A. Flach, A. Kowalczyk, and A. J. Smola, “Multi-instancekernels.” in
Proceedings of the International Conference on MachineLearning (ICML) , 2002, pp. 179–186.[24] D. M. Tax, E. Hendriks, M. F. Valstar, and M. Pantic, “The detection ofconcept frames using clustering multi-instance learning,” in
Proceedingsof the International Conference onPattern Recognition (ICPR) , 2010, pp.2917–2920.[25] V. Cheplygina, D. M. Tax, and M. Loog, “Multiple instance learningwith bag dissimilarities,”