[PDF] Kernel Mean Embedding of Instance-wise Predictions in Multiple Instance Regression

Abstract

In this paper, we propose an extension to an existing algorithm (instance-MIR) which tackles the multiple instance regression (MIR) problem, also known as distribution regression. The MIR setting arises when the data is a collection of bags, where each bag consists of several instances which correspond to the same and unique real-valued label. The goal of a MIR algorithm is to find a mapping from the instances of an unseen bag to its target value. The instance-MIR algorithm treats all the instances separately and maps each instance to a label. The final bag label is then taken as the mean or the median of the predictions for that given bag. While it is conceptually simple, taking a single statistic to summarize the distribution of the labels in each bag is a limitation. In spite of this performance bottleneck, the instance-MIR algorithm has been shown to be competitive when compared to the current state-of-the-art methods. We address the aforementioned issue by computing the kernel mean embeddings of the distributions of the predicted labels, for each bag, and learn a regressor from these embeddings to the bag label. We test our algorithm (instance-kme-MIR) on five real world datasets and obtain better results than the baseline instance-MIR across all the datasets, while achieving state-of-the-art results on two of the datasets.

Full PDF

TTarget-wise Kernel Mean Embedding in Multiple InstanceRegression

Thomas Uriot

Department of ComputingImperial College [email protected]

ABSTRACT

In this paper, we propose an extension to an existing algorithm(instance-MIR) which tackles the multiple instance regression (MIR)problem, also known as distribution regression. The MIR settingarises when the data is a collection of bags, where each bag consistsof several instances which correspond to the same and unique real-valued label. The goal of a MIR algorithm is to find a mapping fromthe instances of an unseen bag to its target value. The instance-MIR algorithm treats all the instances separately and maps eachinstance to a label. The final bag label is then taken as the mean orthe median of the predictions for that given bag. While it is con-ceptually simple, taking a single statistic to summarize the distri-bution of the labels in each bag is a limitation. In spite of this per-formance bottleneck, the instance-MIR algorithm has been shownto be competitive when compared to the current state-of-the-artmethods. We address the aforementioned issue by computing thekernel mean embeddings of the distributions of the predicted la-bels, for each bag, and learn a regressor from these embeddingsto the bag label. We test our algorithm (instance-kme-MIR) on fivereal world datasets and obtain better results than the baseline instance-MIR across all the datasets, while achieving state-of-the-art resultson two of the datasets.

KEYWORDS

Distribution regression; Kernel mean embedding; Instance-wise re-gression; Multiple instance regression

Multiple instance learning (MIL) is a setting which falls under thesupervised learning paradigm. Within the MIL framework, thereexist two different learning tasks: multiple instance classification(MIC) [1] and multiple instance regression (MIR) [4, 11]. The for-mer has been extensively studied in the literature while the latterhas been underrepresented. This could be due to the fact that manyof the data sources studied within the MIL framework are imagesand text, which correspond to classification tasks. The MIC prob-lem generally consists in classifying bags into positive or negativeexamples where negative bags contain only negative instances andpositive bags contain at least one positive instance. A multitude ofapplications are covered by the MIC framework. It has been ap-plied to medical imaging in a weakly supervised setting [20, 21]where each image is taken as a bag and sub-regions of the imageare instances, to image categorization [3] and retrieval [22, 23] andto analyzing videos [12], where the video is treated as the bag andthe frames are the instances.On the other hand, the MIR problem, where bags labels are nowreal valued, has been much less prevalent in the literature. In a regression setting, as opposed to classification, one cannot simplyidentify a single positive instance. Instead, one needs to estimatethe contribution of each of the instances towards the bag label.The MIR problem was first introduced in the context of predictingdrug activity level [4] and the first proposed MIR algorithm reliedon the assumption that the bag’s label can be fully explained bya single prime instance (prime-MIR) [11]. However, this is a sim-plistic assumption as we throw away a lot of information aboutthe distribution (e.g, variance, skewness). Instead of assuming thata single instance is responsible for the bag’s label, the MIR prob-lem has been tackled using a weighted linear combination of theinstances [16], or as a prime cluster of instances (cluster-MIR) [17].Other works have looked at first efficiently mapping the instancesin each bag to a new embedding space, and then train a regressoron the new embedded feature space. For instance, one can trans-form the MIR problem to a regular supervised learning problem bymapping each bag to a feature space which is characterized by asimilarity measure between a bag and an instance [2]. The result-ing embedding of a bag in the new feature space represents howsimilar a bag is to various instances from the training set. A draw-back of this approach is that the embedding space for each bag canbe high-dimensional when the number of instances in the trainingset is large, producing many redundant and possibly uninforma-tive features.In this paper, we use a similar approach and compute the ker-nel mean embeddings for each bag [9]. The use of kernel meanembedding in distribution regression has been applied to variousreal-world problems such as analyzing the 2016 US presidentialelection [6] and estimating aerosol levels in the atmosphere [13].Intuitively, kernel mean embedding measures how similar eachbag is to all the other bags from the training set. In this paper, asopposed to previous works, we do not compute the kernel meanembeddings directly on the input features (i.e, on the instances)but on the predictions made by a previous learning algorithm (e.g,a neural network). This insight comes from the fact that a simplebaseline algorithm (instance-MIR) performed surprisingly well onseveral datasets, when the regressor was a neural network with alarge hidden layer [14]. The instance-MIR algorithm essentially ig-nores the fact that we are in a distribution regression frameworkand treats each instance as a separate observation, thereby yieldinga unique prediction for each instance. The final bag label is takento be the mean or the median of the predictions for that given bag.However, using a point estimate in the original prediction spaceis a performance bottleneck. In this paper, we propose a novel al-gorithm (instance-kme-MIR), which leverages both the representa-tional power of the instance-MIR algorithm equipped with a neu-ral network and alleviate the aforementioned issue by mapping a r X i v : . [ s t a t . M L ] A ug ur predictions into a high or infinite-dimensional space, charac-terized by a kernel function. We test our approach on 5 remotelysensed real-world datasets. The datasets we are using to test our algorithm stems from re-motely sensed data , and have previously been described [14, 18]and studied as a distribution regression problem [14, 18, 19]. Thisallows us to compare the performance of our approach with thebaseline instance-MIR and the current state-of-the-art. The firstapplication (3 of the 5 datasets) consists in predicting aerosol opti-cal depth (AOD) - aerosols are fine airborne solid particles or liquiddroplets in air, that both reflect and absorb incoming solar radia-tion. The second application (2 of the 5 datasets) is the prediction ofcounty-level crop yields [16] (wheat and corn) in Kansas between2001 and 2005. These two applications can naturally be framed as amultiple instance regression problem. Indeed, in both applications,satellites will gather noisy measurements due to the intrinsic vari-ability within the sensors and the properties of the targeted areaon Earth (e.g, surface and atmospheric effects). For the AOD pre-diction task, aerosols have been found to have a very small spatialvariability over distances up to 100 km [7]. For the crop data, wecan reasonably assume that the yields are similar across a countyand thus consider the bag label as the aggregated yield over theentire county.The first study which investigated estimating AOD levels withina MIR setting, proposed an iterative method (pruning-MIR) whichprunes outlying instances from each bag and then proceeds in asimilar fashion as instance-MIR [19]. The main drawback of this ap-proach is that it is not obvious what the pruning threshold shouldbe and we may thus get rid of informative instances in the pro-cess. In a subsequent work, the authors investigated a probabilis-tic framework (EM-MIR) by fitting a mixture model and using theexpectation-maximization (EM) algorithm to learn the mixing anddistribution parameters [18]. The current state-of-the-art algorithm(attention-MIR) on the AOD datasets has been obtained by treat-ing each bag as a set (i.e, an unordered sequence) of instances [14].To do so, the authors implemented an order-invariant operationcharacterized by a content-based attention mechanism [15], whichthen attends the instances a selected number of times. Finally, theproblem of estimating AOD levels has been tackled using kernelmean embedding directly on the input features (i.e, the instances)[13], where they show that performance is robust to the kernelchoice but the hyperparameter values of the kernels are of primaryimportance. In this paper, however, we compute the kernel meanembeddings of the distributions of the predicted labels made bya neural network. In order to have a principled way to find thekernel parameters, authors have proposed a Bayesian kernel meanembedding model with a Gaussian process prior, from which wecan obtain a closed form marginal pseudolikelihood [5]. This mar-ginal likelihood can then be optimized in order to find the kernelparameters. https://harvist.jpl.nasa.gov/papers.shtml In the MIR problem, our observed dataset is {({ x i , l } L i l = , y i )} Bi = ,where B is the number of bags, y i ∈ R is the label of bag i , x i , l isthe l th instance of bag i and L i is the number of instances in bag i .Note that x i , l ∈ X , and X is a subset of R d , where d is the numberof features in each instance. The number of features must be thesame for all the instances, but the number of instances can varywithin each bag.We want to learn the best mapping ˆ f : { x i , l } L i l = → ˆ y i , i = . . . B .By best mapping we mean the function ˆ f which minimizes themean squared error (MSE) on bags unseen during training (e.g, onthe validation set). Formally, we seek ˆ f such thatˆ f = arg min f ∈H B ∗ B ∗ (cid:213) i = MSE ( y ∗ i , f ({ x ∗ i , l } L ∗ i l = )) , (1)from the validation data {({ x ∗ i , l } L ∗ i l = , y ∗ i )} B ∗ i = , where H is the hy-pothesis space of functions f under consideration.The two main challenges that the multiple instance regressionproblem poses are to find which instances are predictive of thebag’s label and to efficiently summarize the information from theinstances within each bag. However, the instance-MIR baseline al-gorithm, which we describe next, does not attempt to solve themultiple instance regression problem by addressing the two afore-mentioned challenges. Instead, it simply treats each instance inde-pendently and fit a regression model to all the instances separately. As mentioned, the instance-MIR algorithm makes predictions onall the instances before taking the mean or the median of the pre-dictions for each bag, as the final prediction. This means that dur-ing training, all the instances have the same weights and thus con-tribute equally to the loss function.Formally, our dataset is formed by pairs of instance and bag labelwhich can be denoted as {( x i , l , y i ) , i = . . . B , l = . . . L i } . Thefinal label prediction on an unseen bag can be simply calculated asˆ y ∗ i = L ∗ i L ∗ i (cid:213) l = ˆ y ∗ i , l , i = . . . B ∗ , where ˆ y ∗ i , l is the predicted label corresponding to the l th in-stance in bag i . Empirically, this method has been shown to becompetitive [10], even though it requires models with high com-plexity in order to be able to effectively map many different noisyinstances to the same target value. Thus, it is appropriate to takeˆ f as a neural network with a large number of hidden units [14], asapposed to a small number [18]. In this subsection, we briefly describe kernel mean embedding andits application to distribution regression, where the goal is to com-pute the kernel mean embedding of each bag. We assume that the nstances { x i , l } L i l = in each bag, are i.i.d. samples from some un-observed distribution P i , for i = , . . . , B . The idea is to adopt atwo-stage procedure by first representing each set of samples (i.e,bags) { x i , l } L i l = by its corresponding kernel mean embedding andthen train a kernel ridge regression on those embeddings [13].Formally, let H k be a reproducing kernel Hilbert space (RKHS),which is a potentially infinite dimensional space of functions f : X → R , and let k : X × X → R be a reproducing kernel functionof H k . Then for f ∈ H k , x ∈ X , we can evaluate f at x as aninner product f ( x ) = ⟨ f , k (· , x )⟩ H k (reproducing kernel property).Then, for a probability measure P ∈ X we can define its kernelmean embedding as µ P = ∫ k (· , x ) P ( dx ) ∈ H k . (2)For µ P to be well-defined, we simply require that the norm of k is finite, and so we want k (· , x ) such that ∫ (cid:112) k ( x , x ) P ( dx ) < ∞ . This is always true for kernel functions that are bounded (e.g,Gaussian RBF, inverse multiquadric) but may be violated for un-bounded ones (e.g, polynomial) [13]. In fact, it has been shown thatthe kernel mean embedding approach to distribution regressiondoes not yield satisfying results when using a polynomial kernel,due to the aforementioned violation [13].However, as mentioned, we do not have access to P i but onlyobserve i.i.d. samples { x i , l } L i l = drawn from it. Instead, we computethe empirical mean estimator ˆ µ P of µ P , given byˆ µ P i = ∫ k (· , x ) ˆP i ( dx ) = L i L i (cid:213) l = k (· , x i , l ) , for bag i . (3) In kernel ridge regression (KRR), we seek to find the set of param-eters ˆ α , such thatˆ α = argmin α (∥ y − K α ∥ + λα T K α ) , (4)where K ∈ R n × n is the kernelized Gram matrix of the dataset,and λ is the hyperparameter controlling the amount of weight de-cay (i.e, L regularization) on the parameters α . In the case of KRRapplied to kernel mean embedding, we have K ( i , j ) = k ′ ( ˆ µ P i , ˆ µ P j ) , for bags i , j = , . . . , B , (5)where k ′ is the KRR kernel and K ∈ R B × B ( B is the numberof bags in the training set). In this paper, we take k ′ to be the lin-ear kernel, as it simplifies the computation and has been shownto yield competitive results when compared to non-linear kernels[13]. Thus, we have that K ( i , j ) = k ′ ( ˆ µ P i , ˆ µ P j ) = ⟨ ˆ µ P i , ˆ µ P j ⟩ H k = (cid:68) L i L i (cid:213) l = k (· , x i , l ) , L j L j (cid:213) m = k (· , x j , m ) (cid:69) H k = L i L j L i (cid:213) l = L j (cid:213) m = k ( x i , l , x j , m ) , (6) where x i , l is the l th instance of bag i and x j , m is the m th in-stance of bag j and the last equality is due to the reproducing prop-erty. In order to make predictions ˆ y test on bags not seen duringtraining, we simply computeˆ y test = ˆ α K test , (7)where ˆ α = y train ( K train + λ I B × B ) − is obtained by differenti-ating (4) with respect to α , equating to 0, and solving for α . Notethat as mentioned in subsection 3.1, B is the number of bags in thetraining set and B ∗ is the number of unseen bags (e.g, in validationor testing set), and so K test ∈ R B × B ∗ . In this section, we describe our novel algorithm (instance-kme-MIR), and discuss the choice we made for the hyperparameter val-ues. We emphasise that the novelty in this paper is to computethe kernel mean embeddings on the predictions made by a previ-ous learning algorithm, as opposed to previous works [5, 6, 8, 13],where the authors directly compute the kernel mean embeddingson the input features. Our algorithm can be seen as an extensionof instance-MIR, where we take advantage of the representationalpower of neural networks (Part 1 of our algorithm), and addressits performance bottleneck by computing the kernel mean embed-dings on the predictions (Part 2 of our algorithm).In our implementation , we choose D =

50 and f to be a singlelayered neural network, as it was shown to yield good results forthe instance-MIR algorithm [14]. We purposefully set the numberof folds D to be large, so that in Part 1 of our algorithm, we stilltrain f on 98% of the training set. It thus makes sense to use thesame hyperparameter values for the neural network f when com-paring the baseline instance-MIR to instance-kme-MIR. For Part2, we experimented with two different kernels k (RBF and inversemultiquadric). In order to fairly compare our algorithm to the current state-of-the-art [14, 18], we evaluate its performance using the same trainingand evaluation protocol. The protocol consists in a 5-fold cross val-idation, where the bags in the training set are randomly split into 5folds, out of which 4 folds are used in training and 1 fold serves asthe validation set. In turn, each of the 5 folds serves as the valida-tion set and the 4 remaining folds as the training set. The cross val-idation is repeated 10 times in order to eliminate the randomnessinvolved in choosing the folds. We use the root mean squared error(RMSE) to evaluate the performance and report our results, shownin Table 1, on 5 real-world datasets. While the baseline instance-MIR was already evaluated [14], we re-implement it on the 3 AODdatasets, with different hyperparameter values, and thus obtain dis-tinct results. The validation loss reported in Table 1 below is theaverage loss over the 50 evaluations (10 iterations of 5-fold crossvalidation). https://github.com/pinouche/Instance-kme-MIR lgorithm 1 Instance-kme-MIR Algorithm

Inputs: (cid:16) {({ x i , l } L i l = , y i )} Bi = , {({ x ∗ i , l } L ∗ i l = , y ∗ i )} B ∗ i = (cid:17) Outputs:

Bag level predictions { ˆ y ∗ i } B ∗ i = on validation set Part 1:

Out-of-fold stacking Set D = number_folds Initialize an array A with (cid:205) Bi = L i elements (i.e, number oftraining instances) Choose a learning algorithm f Shuffle the bags in the training data {({ x i , l } L i l = , y i )} Bi = Partition the training data {({ x i , l } L i l = , y i )} Bi = into D folds, with an equal number of bags in each fold: (cid:110) {({ x i , l } L i l = , y i )} F i = F , . . . , {({ x i , l } L i l = , y i )} F D i = F D − (cid:111) , where F = , F D = B for k = , . . . , D − do Set counter = 0 Set { X , Y } train = (cid:110) {({ x i , l } L i l = , y i )} Bi = (cid:111) − k (take all the bagsexcept those in fold k ) Set { X , Y } val = ({ x i , l } L i l = , y i )} F k + i = F k (take all the bags in fold k ) Learn ˆ f : x i , l → ˆ y i , l , i = { , . . . , B } − k , l = , . . . , L i (seeequation (1)) for i = F k , . . . , F k + do for l = , . . . , L i do Predict ˆ y i , l = ˆ f ( x i , l ) , Set A [ counter ] = ˆ y i , l (build a stacked training setfor Part 2) counter += 1 end forend forend for Return APart 2:

Kernel mean embedding and KRR on the stacked dataset A Choose the weight decay value λ Choose the kernel function k (and its parameter values) Compute K ( i , j ) train = L i L j (cid:205) L i l = (cid:205) L j m = k ( ˆ y i , l , ˆ y j , m ) , i , j = , . . . , B (see equation (6)-(7)) Compute ˆ α = y train ( K train + λ I B × B ) − , where y train = [ y , . . . , y B ] ∈ R B Return ˆ α Part 3 : Predict bag labels { ˆ y ∗ i } B ∗ i = on the validation set for i = , . . . , B ∗ do for l = , . . . , L ∗ i do Predict ˆ y ∗ i , l = ˆ f ( x ∗ i , l ) , end forend for Compute K ( i , j ) val = L i L ∗ j (cid:205) L i l = (cid:205) L ∗ j m = k ( ˆ y i , l , ˆ y ∗ j , m ) , i = , . . . , B , j = , . . . , B ∗ Compute ˆ y val = ˆ α K val , where ˆ y val = [ ˆ y ∗ , . . . , ˆ y ∗ B ∗ ] ∈ R B ∗ (see equation (8)) Return ˆ y val (i.e, { ˆ y ∗ i } B ∗ i = ) In Table 1, we display the results for 4 algorithms: the baselineinstance-MIR (described in subsection 3.2), attention-MIR [14], EM-MIR [18] and our novel algorithm (instance-kme-MIR), for two dif-ferent kernels k RBF and k INV , where k ( x , x ′ ) RBF = exp (cid:18) − ∥ x − x ′ ∥ θ (cid:19) , k ( x , x ′ ) I NV = − ∥ x − x ′ ∥ ∥ x − x ′ ∥ + θ . Note that prior to our implementation of instance-kme-MIR, thestate-of-the-art results on the 5 datasets were shared between the3 other algorithms [14]. Now, as can be seen in Table 1, attention-MIR achieves the best results on the AOD datasets while instance-kme-MIR yields the best results on the crop datasets.We experimented with several values for θ and λ , where θ ∈{ , , . . . , , } and λ ∈ { − , − , . . . , − , − } , witha constant increment for both hyperparameters. The results in Ta-ble 1 are reported for the best hyperparameter values. We foundthat while extreme hyperparameter values negatively impacted theperformance of our algorithm, most values yielded similarly goodresults, which means that our algorithm is robust to hyperparam-eter values.Instance-MIR (median) refers to the instance-MIR algorithm wherethe median is used to compute the final prediction for each bag,instead of the mean, as described in subsection 3.2. We can seethat there does not seem to be an advantage to using the meanor the median, as both methods achieve very similar results. Onthe other hand, we can see that our algorithm consistently out-performs the baseline instance-MIR. However, note that since ouralgorithm makes use of the predictions made from instance-MIR(in Part 2 of Algorithm 1), we can only aim to achieve a measuredimprovement over the standard instance-MIR. Thus, our methodis mostly beneficial in the cases where instance-MIR is the bestout-of-the box algorithm (e.g, on the 2 crop datasets). Since ouralgorithm computes the kernel mean embedding between scalars(i.e, between the real-valued predictions) and is robust to values of λ and θ , it is easy to tune and its computation cost is very close tothat of instance-MIR. In this paper, we developed a straightforward extension of the base-line instance-MIR algorithm. Our method takes advantage of theexpressive power of neural networks while addressing the mainweakness of instance-MIR by computing the kernel mean embed-dings of the predictions. We have shown that our algorithm con-sistently outperforms the baseline and achieves state-of-the-art re-sults on the 2 crop datasets. In addition, our algorithm is robust tothe kernel parameter values and its performance gains come at alow computational cost.Nonetheless, it fails when the baseline instance-MIR does notyield satisfying results (e.g, on the 3 crop datasets). This is be-cause we compute the kernel mean embeddings on predictionsmade from the baseline instance-MIR, and we can thus only ex-pect measured improvements from that baseline. Another draw-back of our method comes from the fact that instance-MIR assignsthe same weights to all the instances during training. However, thenumber of instances per bag may vary and it would make sense to able 1: The loss for the 3 AOD datasets (MISR1, MISR2, MODIS) is the RMSE ×

100 and for the 2 CROP datasets (WHEAT,CORN) the loss is the RMSE. Datasets

Algorithms MODIS MISR1 MISR2 WHEAT CORNInstance-MIR (mean) 10.4 9.02 7.61 4.96 24.57Instance-MIR (median) 10.4 8.89 7.50 5.00 24.72Instance-kme-MIR ( k INV ) 10.1 8.68 7.28 4.91

Instance-kme-MIR ( k RBF ) 10.1 8.70 7.38

REFERENCES [1] Jaume Amores. 2013. Multiple instance classification: Review, taxonomy andcomparative study.

Artificial intelligence

201 (2013), 81–105.[2] Yixin Chen, Jinbo Bi, and James Ze Wang. 2006. MILES: Multiple-instance learn-ing via embedded instance selection.

IEEE Transactions on Pattern Analysis andMachine Intelligence

28, 12 (2006), 1931–1947.[3] Yixin Chen and James Z Wang. 2004. Image categorization by learning andreasoning with regions.

Journal of Machine Learning Research

5, Aug (2004),913–939.[4] Thomas G Dietterich, Richard H Lathrop, and Tomás Lozano-Pérez. 1997. Solv-ing the multiple instance problem with axis-parallel rectangles.

Artificial intel-ligence

89, 1-2 (1997), 31–71.[5] Seth Flaxman, Dino Sejdinovic, John P Cunningham, and Sarah Filippi. 2016.Bayesian learning of kernel embeddings. arXiv preprint arXiv:1603.02160 (2016).[6] Seth Flaxman, Dougal Sutherland, Yu-Xiang Wang, and Yee Whye Teh. 2016.Understanding the 2016 US presidential election using ecological inference anddistribution regression with census microdata. arXiv preprint arXiv:1611.03787 (2016).[7] Charles Ichoku, D Allen Chu, Shana Mattoo, Yoram J Kaufman, Lorraine A Re-mer, Didier Tanré, Ilya Slutsker, and Brent N Holben. 2002. A spatio-temporalapproach for global validation and analysis of MODIS aerosol products.

Geo-physical Research Letters

29, 12 (2002), MOD1–1.[8] Ho Chung Leon Law, Dougal J Sutherland, Dino Sejdinovic, and Seth Flax-man. 2017. Bayesian approaches to distribution regression. arXiv preprintarXiv:1705.04293 (2017).[9] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and BernhardSchölkopf. 2017. Kernel Mean Embedding of Distributions: A Review and Be-yond.

Foundations and Trends® in Machine Learning

10, 1-2 (2017), 1–141.[10] Soumya Ray and Mark Craven. 2005. Supervised versus multiple instance learn-ing: An empirical comparison. In

Proceedings of the 22nd international conferenceon Machine learning . ACM, 697–704.[11] Soumya Ray and David Page. 2001. Multiple instance regression. In

ICML , Vol. 1.425–432.[12] Karan Sikka, Abhinav Dhall, and Marian Bartlett. 2013. Weakly supervisedpain localization using multiple instance learning. In

Automatic Face and Ges-ture Recognition (FG), 2013 10th IEEE International Conference and Workshops on .IEEE, 1–8.[13] Zoltán Szabó, Arthur Gretton, Barnabás Póczos, and Bharath Sriperumbudur.2015. Two-stage sampled learning theory on distributions. In

Artificial Intelli-gence and Statistics . 948–957.[14] Thomas Uriot. 2019. Learning with Sets in Multiple Instance Regression Appliedto Remote Sensing. arXiv preprint arXiv:1903.07745 [stat.ML] (2019).[15] Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. 2015. Order matters: Se-quence to sequence for sets. arXiv preprint arXiv:1511.06391 (2015).[16] Kiri L Wagstaff and Terran Lane. 2007. Salience assignment for multiple-instanceregression. (2007). [17] Kiri L Wagstaff, Terran Lane, and Alex Roper. 2008. Multiple-instance regressionwith structured data. In

Data Mining Workshops, 2008. ICDMW’08. IEEE Interna-tional Conference on . IEEE, 291–300.[18] Zhuang Wang, Liang Lan, and Slobodan Vucetic. 2012. Mixture model for mul-tiple instance regression and applications in remote sensing.

IEEE Transactionson Geoscience and Remote Sensing

50, 6 (2012), 2226–2237.[19] Zhuang Wang, Vladan Radosavljevic, Bo Han, Zoran Obradovic, and SlobodanVucetic. 2008. Aerosol optical depth prediction from satellite observations bymultiple instance regression. In

Proceedings of the 2008 SIAM International Con-ference on Data Mining . SIAM, 165–176.[20] Jianjun Wu, Yinan Yu, Chang Huang, and Kai Yu. 2015. Deep multiple instancelearning for image classification and auto-annotation. In

CVPR .[21] Yan Xu, Tao Mo, Qiwei Feng, Peilin Zhong, Maode Lai, and Eric I-Chao Chang.[n. d.]. Deep learning of feature representation with multiple instance learningfor medical image analysis. ([n. d.]).[22] Cheng Yang and Tomas Lozano-Perez. 2000. Image database retrieval withmultiple-instance learning techniques. In

Data Engineering, 2000. Proceedings.16th International Conference on . IEEE, 233–243.[23] Qi Zhang, Sally A Goldman, Wei Yu, and Jason E Fritts. 2002. Content-basedimage retrieval using multiple-instance learning. In

ICML , Vol. 2. Citeseer, 682–689., Vol. 2. Citeseer, 682–689.