[PDF] Distributed Maximization of Submodular plus Diversity Functions for Multi-label Feature Selection on Huge Datasets

Abstract

There are many problems in machine learning and data mining which are equivalent to selecting a non-redundant, high "quality" set of objects. Recommender systems, feature selection, and data summarization are among many applications of this. In this paper, we consider this problem as an optimization problem that seeks to maximize the sum of a sum-sum diversity function and a non-negative monotone submodular function. The diversity function addresses the redundancy, and the submodular function controls the predictive quality. We consider the problem in big data settings (in other words, distributed and streaming settings) where the data cannot be stored on a single machine or the process time is too high for a single machine. We show that a greedy algorithm achieves a constant factor approximation of the optimal solution in these settings. Moreover, we formulate the multi-label feature selection problem as such an optimization problem. This formulation combined with our algorithm leads to the first distributed multi-label feature selection method. We compare the performance of this method with centralized multi-label feature selection methods in the literature, and we show that its performance is comparable or in some cases is even better than current centralized multi-label feature selection methods.

Full PDF

DDistributed Maximization of Submodular plus Diversity Functionsfor Multi-label Feature Selection on Huge Datasets

Mehrdad Ghadiri Mark Schmidt

University of British Columbia

Abstract

There are many problems in machine learn-ing and data mining which are equivalent toselecting a non-redundant, high “quality” setof objects. Recommender systems, featureselection, and data summarization are amongmany applications of this. In this paper, weconsider this problem as an optimization prob-lem that seeks to maximize the sum of a sum-sum diversity function and a non-negativemonotone submodular function. The diver-sity function addresses the redundancy, andthe submodular function controls the predic-tive quality. We consider the problem in bigdata settings (in other words, distributed andstreaming settings) where the data cannot bestored on a single machine or the process timeis too high for a single machine. We show thata greedy algorithm achieves a constant fac-tor approximation of the optimal solution inthese settings. Moreover, we formulate themulti-label feature selection problem as suchan optimization problem. This formulationcombined with our algorithm leads to theﬁrst distributed multi-label feature selectionmethod. We compare the performance of thismethod with centralized multi-label featureselection methods in the literature, and weshow that its performance is comparable orin some cases is even better than current cen-tralized multi-label feature selection methods.

Many problems from diﬀerent areas of machine learningand data mining can be modeled as an optimization

Proceedings of the 22 nd International Conference on Ar-tiﬁcial Intelligence and Statistics (AISTATS) 2019, Naha,Okinawa, Japan. PMLR: Volume 89. Copyright 2019 bythe author(s). problem that tries to maximize the sum of a sum-sum diversity function (which is the sum of the dis-tances between all of the pairs in a given subset) anda non-negative monotone submodular function. Ex-amples include query diversiﬁcation problem in thearea of databases [Demidova et al., 2010, Liu et al.,2009], search result diversiﬁcation [Agrawal et al., 2009,Drosou and Pitoura, 2010], and recommender sys-tems [Yu et al., 2009]. The size of the datasets inthese applications is growing rapidly, and there is aneed for scalable methods to tackle these problemson huge datasets. Inspired by these applications, wepropose an algorithm for approximately solving thisoptimization problem with a theoretical guarantee indistributed and steaming settings. Borodin et al. [2017]presented a 0.5-approximation for this optimizationproblem in the centralized setting in which data can bestored and processed on a single machine. In this paper,we consider this problem for big data settings where thedata cannot be stored on a single machine, or the pro-cess time is too high for a single machine. We show thatour algorithm achieves a / -approximation. Notethat solving this problem in a distributed or streamingsetting is strictly harder than solving it in the central-ized setting because, in the aforementioned settings,the algorithm does not use all of the data. As a result,our algorithm is √ d/k times faster in the distributedsetting and it needs (cid:112) d/k times less memory in thestreaming setting compared to the centralized setting,where d is the size of the ground set (for example, thenumber of features in the feature selection problem),and k is the number of machines (in the distributedsetting) or is the number of partitions of the data (inthe streaming setting). Therefore, our algorithm givesa worse approximate solution compared to the central-ized method of Borodin et al. [2017] but it is muchfaster and needs less memory. This trade-oﬀ might beinteresting and useful in some applications.One of the problems that can be modeled as such an op-timization problem and is in need of scalable methodsin modern applications is multi-label feature selection.The diversity part controls the redundancy of the se- a r X i v : . [ c s . L G ] A p r ulti-Label Feature Selection Using Submodular Plus Diversity Maximization lected features and the submodular part is to promotefeatures that are relevant to the labels. A multi-labeldataset is made up of a number of samples, features,and labels. Each sample is a set of values for the fea-tures and labels. Usually, labels have binary values. Forexample, if a patient has diabetes or not. Multi-labeldatasets can be found in diﬀerent areas, including butnot limited to semantic image annotation, protein andgene function studies, and text categorization [Kashefet al., 2018]. Applications, number, and size of suchdatasets are growing very rapidly, and it is necessaryto develop eﬃcient and scalable methods to deal withthem.Feature selection is a fundamental problem in machinelearning. Its goal is to decrease the dimensionality ofa dataset in order to improve the learning accuracy,decrease the learning and prediction time, and preventoverﬁtting. There are three diﬀerent categories of fea-ture selection methods depending on their interactionwith the learning methods. Filter methods select thefeatures based on the intrinsic properties of the dataand are totally independent of the learning method.Wrapper methods select the features according to theaccuracy of a speciﬁc learning method, like SVMs. Fi-nally, embedded methods select the features as a partof their learning procedure [Guyon and Elisseeﬀ, 2003].Decision trees and use of (cid:96) and (cid:96) regularization forfeature selection fall into the latter. When the numberof features is large, ﬁlter methods are a reasonablechoice since they are fast, resistant to over-ﬁtting, andindependent of the learning model. Therefore, we canquickly select a number of features with ﬁlter methodsand then try diﬀerent learning methods to see whichone ﬁts the data better (possibly with wrapper or em-bedded feature selection methods). However, withmillions of features, centralized ﬁlter methods are notapplicable anymore. To deal with such huge datasets,we need scalable methods. Although there were eﬀortsto develop scalable and distributed ﬁlter methods forsingle-label datasets [Zadeh et al., 2017, Bolón-Canedoet al., 2015a], to the best of our knowledge, there areno previous distributed multi-label feature selectionmethod.In this paper, we propose an information theoretic ﬁl-ter feature selection method for multi-label datasetsthat is usable in distributed, streaming, and centralizedsettings. In the centralized setting, all of the data isstored and can be processed on a single machine. Inthe distributed setting, the data is stored on multi-ple machines, and there is no shared memory betweenmachines. In the streaming setting, although the com-putation is done on a single machine, this machine doesnot have enough memory to store all of the data atonce. The data in our method is distributed vertically which means that the features are distributed betweenmachines instead of samples (horizontal distribution).Feature selection is considered harder when the datais distributed vertically because we lose much informa-tion about the relations of the features [Bolón-Canedoet al., 2015b]. However, when the number of instancesis small, and the number of features is large (for exam-ple, biological or medical datasets) vertical distributionis the only reasonable choice. Our work can be seen asan extension of Borodin et al. [2017] to distributed andstreaming settings or an extension of Zadeh et al. [2017]to multi-label data. However, our results cannot bederived from these previous works in a straightforwardmanner. The main contributions of the paper are listedin the following. Our Contributions • We present a greedy algorithm for maximizing thesum of a sum-sum diversity function and a non-negative monotone submodular function in the dis-tributed and streaming settings. We prove that itachieves a constant factor approximation of the opti-mal solution. • We formulate the multi-label feature selection prob-lem as such a combinatorial optimization problem.Using this formulation we present information theo-retic ﬁlter feature selection methods for distributed,steaming, and centralized settings. The distributedmethod is the ﬁrst distributed multi-label featureselection method proposed in the literature. • We perform an empirical study of the proposed dis-tributed method and compare its results to diﬀerentcentralized multi-label feature selection methods. Weshow that the results of the distributed method arecomparable to the current centralized methods inthe literature. We also compare the runtime and thevalue of the objective function that our centralizedand distributed methods achieve. Note that the cen-tralized methods have access to the all of the dataand can do computation on it. We do not expectthat our distributed or streaming method to beatthe centralized methods because it is not possible.However, we argue that our results are comparableto the results of centralized methods and our methodis much faster (in case of the distributed setting) andneeds much less memory (in case of the streaming set-ting). We compared our results with the centralizedmethods (this comparison is unfair to the distributedsetting) in the literature because to the best of ourknowledge there is no distributed multi-label featureselection method prior to this work.Our techniques can be used prior to multi-label classi-ﬁcation, multi-label regression, and in some multi-task ehrdad Ghadiri, Mark Schmidt learning setups. The structure of the paper is as follows.In the next section, we review the related work and pre-liminaries. In Section 3, we formulate the multi-labelfeature selection problem as the mentioned optimiza-tion problem and present the algorithm for maximizingit in the distributed and streaming settings. In Section4, we show the theoretical approximation guarantee ofthe proposed algorithm. In Section 5, we evaluate theperformance of the proposed distributed algorithm inpractice.

In this section, we review the previous works on diﬀer-ent aspects of the problem including diversity maximiza-tion, submodular maximization, composable core-sets,and feature selection.

Diversity Maximization and SubmodularMaximization

Usually, the diversity maximization problem is deﬁnedon a metric space of a set of points U with the goalof ﬁnding a subset of them which maximizes a diver-sity function subject to a constraint. For example, acardinality constraint or a matroid constraint. If S isa subset of the points, the sum-sum diversity of S is D ( S ) = 0 . (cid:80) x ∈ S (cid:80) y ∈ S d ( x, y ) where d ( ., . ) is a metricdistance. In the centralized setting, a simple greedy orlocal search algorithm can achieve a half approxima-tion of the optimal solution subject to | S | = k [Hassinet al., 1997, Abbassi et al., 2013]. TA better approxi-mation factor is not achievable under the planted cliqueconjecture [Bhaskara et al., 2016, Borodin et al., 2017].Submodular functions are important concepts in ma-chine learning and data mining with many applications.See Krause and Guestrin [2008] for their applications.A submodular function is a set function with a di-minishing marginal gain. A function f : 2 U → R issubmodular if f ( A ∪ { x } ) − f ( A ) ≥ f ( B ∪ { x } ) − f ( B ) for any A ⊆ B ⊂ U , and x ∈ U \ B . It is monotoneif f ( A ) ≤ f ( B ) and it is non-negative if f ( A ) ≥ forany A ⊆ B ⊆ U . Maximizing a monotone submodularfunction subject to a cardinality constraint is NP-hardbut using a simple greedy algorithm we can achieve (1 − e ) of the optimal solution. A better approxima-tion factor is not achievable using a polynomial timealgorithm unless P=NP [Krause and Golovin, 2014].Let U be a set and f ( . ) be a submodular functiondeﬁned on U and d ( ., . ) be a metric distance deﬁnedbetween pairs of elements of U . Borodin et al. [2017]showed that in the centralized setting, using a simplegreedy algorithm, we can achieve half of the optimalvalue for maximizing f ( S ) + λ (cid:80) { u,v } : u,v ∈ S d ( u, v ) sub- ject to S ⊆ U and | S | = k . This result is extendedto semi-metric distances in Abbasi Zadeh and Ghadiri[2015]. Similar problems are considered in Dasguptaet al. [2013] where the diversity part can be other diver-sity functions. Namely, they considered the sum-sumdiversity, the minimum spanning tree, and the min-imum of distances between all pairs. They showedthat the greedy algorithm achieves a constant factorapproximation in all of these cases. Composable Core-sets

In computational geometry, a core-set is a small sub-set of points that approximately preserve a measureof the original set [Agarwal et al., 2005]. Compos-able core-sets extend this property to the combina-tion of sets. Therefore, they can be used in a di-vide and conquer manner to ﬁnd an approximate so-lution. Let U be a set, f : 2 U → R be a set func-tion on U , ( T , . . . , T m ) be a random partitioningof elements of U , and k be a positive integer. Let OPT ( T ) = arg max S ⊆ T, | S | = k f ( S ) where T ⊆ U . Let ALG be an algorithm which takes T ⊆ U as an inputand outputs S ⊆ T . For α > , we call ALG an α -approximate composable core-set with size k for f ifthe size of its output is k and f ( OPT ( ALG ( T ) ∪ · · · ∪ ALG ( T m ))) ≥ αf ( OPT ( T ∪ · · · ∪ T m )) [Indyk et al.,2014]. We call ALG an α -approximate randomized com-posable core-set with size k for f if the size of itsoutput is k and E [ f ( OPT ( ALG ( T ) ∪ · · · ∪ ALG ( T m )))] ≥ αf ( OPT ( T ∪· · ·∪ T m )) [Mirrokni and Zadimoghaddam,2015]. Composable core-sets and randomized compos-able core-sets can be used in distributed settings (likethe MapReduce framework) and streaming settings (seeFigure 1).Composable core-sets ﬁrst were used to approximatelysolve several diversity maximization problems in dis-tributed and streaming settings [Indyk et al., 2014]. Itresulted in an approximation algorithm for the sum-sum diversity maximization with an approximationfactor of less than . . This approximation factoris improved to in Aghamolaei et al. [2015]. Ran-domized composable core-sets were ﬁrst introducedto tackle submodular maximization problem in dis-tributed and streaming settings which resulted in a . -approximation algorithm for monotone submod-ular functions [Mirrokni and Zadimoghaddam, 2015].Then they were used to improve the approximation fac-tor of the sum-sum diversity maximization from to . [Zadeh et al., 2017]. The randomized composablecore-sets used in the latter case ﬁnd the approximatesolution with high probability instead of expectation.There are a number of other works on distributed sub-modular maximization [Mirzasoleiman et al., 2016, Bar-bosa et al., 2015]. Moreover, submodular and weak ulti-Label Feature Selection Using Submodular Plus Diversity Maximization submodular functions are used for distributed single-label feature selection [Khanna et al., 2017]. We shouldnote that the discussed objective function in our workis neither submodular nor weak submodular. This isbecause of the diversity term of the function. An ad-vantage of using this diversity function is that it isevaluated by a pairwise distance function. As a result,it is easy to evaluate our objective function on datasetswith few samples. On the contrary, evaluating thepure submodular functions, that were used for featureselection in the literature, are quite hard and need alarge amount of data and computing power. Feature Selection and Multi-label FeatureSelection

Filter feature selection methods select features inde-pendent of the learning algorithm. Hence, they areusually faster and immune to overﬁtting [Guyon andElisseeﬀ, 2003]. Mutual information based methodsare a well-known family of ﬁlter methods. The best-known method of this kind for single-label feature selec-tion is minimum redundancy and maximum relevance(mRMR) which tries to ﬁnd a subset of features S that maximizes the following objective function usinga greedy algorithm | S | (cid:88) x i ∈ S I ( x i , c ) − | S | (cid:88) x i ,x j ∈ S I ( x i , x j ) , where I ( ., . ) is the mutual information function, and c is the label vector [Peng et al., 2005]. The proposedmethod in this paper can be seen as a variation ofmRMR which is capable of being used for multi-labelfeature selection in distributed, streaming, and central-ized settings.Although there have been great advancements in cen-tralized feature selection, there are few works on dis-tributed feature selection, and most of them distributethe data horizontally. Zadeh et al. [2017] was the ﬁrstwork on the single-label vertically distributed featureselection that considered the redundancy of the fea-tures. Their method selects features using randomizedcomposable core-sets in order to maximize a diversityfunction deﬁned on the features. Although there aresome similarities between the formulations presentedin Zadeh et al. [2017] and this work, we should note thatthe single-label formulation cannot be applied directlyto multi-label datasets. Moreover, maximization of thefunctions and the analysis of the algorithms to provethe theoretical guarantee are completely diﬀerent.Most of the multi-label feature selection methods trans-form the data to a single-label form. Binary relevance(BR) and label powerset (LP) are two common waysto do so. BR methods consider each label separatelyand use a single-label feature selection method to select features for each label, and then they aggregate theselected features. A disadvantage of BR methods isthat they cannot consider the relations of the labels.LP methods consider the multi-label dataset as onesingle-label multi-class dataset where each class of itssingle label are a possible combination of labels in thedataset (treating the labels as a binary string). Thenthey apply a single-label feature selection method. Al-though LP methods consider the relations of the labels,they have signiﬁcant drawbacks. For example, someclasses may end up with very few samples or none at all.Moreover, the method is biased toward the combina-tion of the labels which exist in the training set [Kashefet al., 2018]. Our proposed method does not transformthe data to single-label data and is designed in a wayto not suﬀer from the mentioned disadvantages. Let U be a set of d features and L be a set of t labels.We also have a set A of n instances each of which isa vector of observations for elements of U ∪ L . Thegoal of multi-label feature selection is to ﬁnd a small non-redundant subset of U which can predict labels in L accurately. In order to quantify redundancy it isnatural to use a metric distance d over the feature setto measure dissimilarity. In our application (featureselection) we are particularly interested in the followingmetric distance. For any u i , u j ∈ U , we deﬁne d ( u i , u j ) = 1 − I ( u i , u j ) H ( u i , u j )= 1 − (cid:80) x ∈ u i ,y ∈ u j p ( x, y ) log p ( x,y ) p ( x ) p ( y ) − (cid:80) x ∈ u i ,y ∈ u j p ( x, y ) log p ( x, y ) , where H ( ., . ) is the joint entropy and I ( ., . ) is the mutualinformation. This distance function is called normalized (values lie between and ) variation of information and it is a metric [Nguyen et al., 2010]. In Zadeh et al.[2017], this distance function plus a modular functionis used for single-label feature selection.In order to quantify the predictive quality of the se-lected features, we deﬁne a non-negative monotonesubmodular function g : 2 U → R which measures therelevance of the selected features to the labels. For anypositive integer p , we deﬁne g ( S ) = (cid:88) (cid:96) ∈ L top p x ∈ S { M I ( x, (cid:96) ) } , where top p x ∈ S { M I ( x, (cid:96) ) } is the sum of the p largestnumbers in { M I ( x, (cid:96) ) | x ∈ S } . Here M I ( x, (cid:96) ) = I ( x,(cid:96) ) √ H ( x ) H ( (cid:96) ) is the normalized mutual information where H ( . ) is the entropy function and the value M I ( ., . ) ehrdad Ghadiri, Mark Schmidt lies in [0 , . Note that if we only have one label (i.e., | L | = 1 ), and p = d (the number of all features of thedataset) then g will be exactly the modular functionused in Zadeh et al. [2017]. Therefore, our formulationis a generalization of theirs. Using the top p function,this formulation tries to select at least p relevant fea-tures for each label. In order to understand the impor-tance of top p function, we discuss two extreme cases: p = 1 and p = d . If p = 1 then a feature that is some-what relevant to all the features can dominate the g ( S ) and prevent other features, that are highly relevant toone or few features, to get selected. If p = d then alabel that has a lot of relevant features can dominate g ( S ) and prevent other labels to get relevant features,while a few features would be enough for predictingthis label with a high accuracy. In the following lemma,we show that g has the nice properties we need in ourmodel. Its proof is included in Appendix A. Lemma 1. g is a non-negative, monotone, submodularfunction.Hence if we deﬁne f ( S ) = g ( S ) + (cid:80) { u,v }∈ S d ( u, v ) ,then our feature selection model reduces to solving thefollowing combinatorial optimization problem. max S ⊆ U | S | = k f ( S ) = max S ⊆ U | S | = k { g ( S ) + (cid:88) { u,v }∈ S d ( u, v ) } , (1)where d ( ., . ) is a metric distance and g ( . ) is a non-negative monotone submodular function. In the actualfeature selection method we are free to scale the relativecontributions of the diversity or submodular parts, sinceboth metric and submodular functions are closed undermultiplication by a positive constant. Hence, we usea weighted version of the objective function in ourapplication. Algorithm 1:

Greedy Input:

Set of features U , set of labels L , number offeatures we want to select k . Output:

Set S ⊂ U with | S | = k . S ← { arg max u ∈ U g ( { u } ) } ; forall ≤ i ≤ k do u ∗ ← arg max u ∈ U \ S g ( S ∪ { u } ) − g ( S ) + (cid:80) x ∈ S d ( x, u ) ; (cid:46) This arg max has a consistent tiebreakingrule (see Deﬁnition 1). Add u ∗ to S ; Return S ; The problem (1) is NP-hard but Borodin et al. [2017]show that Algorithm 2 is a half approximation in thecentralized setting. Note that this is a greedy algo-rithm under the objective where g ( S ) is scaled by .On the other hand, Algorithm 1 is a standard greedyalgorithm for (1) and in the next section we show itis a constant factor randomized composable core-setfor any functions f which are the sum of a sum-sum diversity function and a non-negative, monotone, sub-modular function. Combining these we conclude thatAlgorithm 3 is a constant factor approximation algo-rithm for maximizing f . Moreover, Algorithm 3 canbe used both in distributed and streaming settings, asillustrated in Figure 1. In our experiments, to select k features, we use the following function. h ( S ) = (1 − λ ) k ( k − p | L | g ( S ) + λ (cid:88) x i ,x j ∈ S d ( x i , x j ) . (2)As discussed, the ﬁrst term of h ( S ) controls redundancyof the selected features and the second term is to pro-mote features that are relevant to the labels. The term k ( k − p | L | is a normalization coeﬃcient to make the rangeof both terms the same. Also, λ is a hyper-parameterwhich controls the eﬀect of two criteria on the ﬁnalfunction. Algorithm 2:

AltGreedy Input:

Set of features U , set of labels L , number offeatures we want to select k . Output:

Set S ⊂ U with | S | = k . S ← { arg max u ∈ U g ( { u } ) } ; forall ≤ i ≤ k do u ∗ ← arg max u ∈ U \ S ( g ( S ∪ { u } ) − g ( S )) + (cid:80) x ∈ S d ( x, u ) ; Add u ∗ to S ; Return S ; Let f ( S ) = D ( S )+ g ( S ) be a set function deﬁned on U where g ( S ) is a non-negative, monotone, submodularfunction and D ( S ) is a sum-sum diversity function,i.e. D ( S ) = (cid:80) { u,v }∈ S d ( u, v ) where d ( ., . ) is a metricdistance. In this section, we show that Algorithm 1 isa constant factor randomized composable core-set withsize k for f . We also show that running Algorithm 3which is equivalent to running Algorithm 1 in each slavemachine and then running Algorithm 2 in the mastermachine on the union of outputs of slave machines is aconstant factor randomized approximation algorithmfor maximizing f subject to a cardinality constraint. Algorithm 3:

Distributed Greedy Input:

Set of features U , set of labels L , number offeatures we want to select k , number of machines m . Output:

Set S ⊂ U with | S | = k . Randomly partition U into ( T i ) mi =1 ; forall ≤ i ≤ m do S i ← output of Greedy ( T i , L , k ); S ← output of AltGreedy ( ∪ mi =1 S i , L , k ); Return S ; We use the following key concept of a β -nice algorithmfrom Mirrokni and Zadimoghaddam [2015] throughoutour analysis. ulti-Label Feature Selection Using Submodular Plus Diversity Maximization (a) Distributed setting(b) Streaming setting Figure 1: Algorithm 3 operating in big data settings.

Deﬁnition 1.

Let f be a set function on U . Let ALG be an algorithm that given any T ⊆ U outputs ALG ( T ) ⊆ T . Let t ∈ T \ ALG ( T ) . For β ∈ R + , we call ALG a β -nice algorithm if it has the following properties. • ALG ( T ) = ALG ( T \ { t } ) . • f ( ALG ( T ) ∪ { t } ) − f ( ALG ( T )) ≤ β f ( ALG ( T )) k .The intuition behind the ﬁrst condition is simply thatby removing an element of T which is not used in thealgorithm’s output, we do not change the output. Thisis eﬀectively a condition on how we perform tiebreaking.The second condition helps to bound f ( ALG ( T ) ∪ O ) where O is a global optima. Our theoretical analysisheavily relies on the following theorem which is provedin Appendix B. Theorem 1.

Let k ≥ . Algorithm 1 is a -nicealgorithm for f ( . ) = D ( . ) + g ( . ) . Also, if ALG isAlgorithm 1, T ⊆ U , and t ∈ T \ ALG ( T ) , then . k − f ( ALG ( T )) ≥ (cid:80) x ∈ ALG ( T ) d ( t, x ) .Our main result is that Algorithm 3 is a constant factorapproximation algorithm. Theorem 2.

Let k ≥ . Algorithm 3 gives a -approximate solution in expectation for maximizing f ( S ) subject to | S | = k .We note that for k < , the constant degrades sowe focus on the large k regime. The proof of this Table 1: Speciﬁcations of the datasets. Dataset Name theorem follows from two key lemmas which boundthe diversity and submodular portions of an optimalsolution. We use O to denote a global optimum. Tostate the lemmas, we need the following notations.Let OPT ( T ) = arg max R ⊆ T f ( R ) subject to | R | = k .Let U be the set of all elements (for example, the setof all features for the feature selection problem) and ( T , . . . , T m ) be a random partitioning of the elementsof U . Lemma 2.

Let

ALG be Algorithm 1 and S i = ALG ( T i ) .Then D ( O ) ≤ . f ( OPT ( ∪ mi =1 S i )) . Lemma 3.

Let

ALG be Algorithm 1 and S i = ALG ( T i ) .Then g ( O ) ≤ f ( OPT ( ∪ mi =1 S i )) + E [ f ( OPT ( ∪ mi =1 S i ))] .We use Theorem 1 and techniques from a number ofpapers [Zadeh et al., 2017, Indyk et al., 2014, Mirrokniand Zadimoghaddam, 2015, Aghamolaei et al., 2015]to prove these two key lemmas in Appendix B. Evenin cases where some parts of proofs are similar toprevious work we include a complete proof for the sakeof completeness. We should note that our analysis isnot a straightforward combination of the ideas in thementioned papers. Using Lemma 2 and 3, we can easilyprove Theorem 2. Proof of Theorem 2.

Lemma 2 and 3 immedi-ately yield f ( O ) ≤ . E [ f ( OPT ( ∪ mi =1 S i ))] . Basedon Borodin et al. [2017], we know that Algo-rithm 2 is a half approximation algorithm for max-imizing f . Therefore, if ALG’ is Algorithm 2 then f ( OPT ( ∪ mi =1 S i )) ≤ f ( ALG’ ( ∪ mi =1 S i )) . Hence f ( O ) ≤ E [ f ( ALG’ ( ∪ mi =1 S i ))] which is exactly the statementof the theorem. (cid:3) In this section, we investigate the performance of ourmethod in practice. In the ﬁrst experiment, we com-pare our distributed method with centralized multi-label feature selection methods in the literature on aclassiﬁcation task. We show that our method’s per-formance is comparable to, or in some cases is evenbetter than previous centralized methods. Next, wecompare our distributed and centralized methods ontwo large datasets. We show that the distributed al-gorithm achieves almost the same objective functionvalue and it is much faster. This implies that the dis-tributed algorithm achieves a better approximation inpractice compared to the theoretical guarantee. ehrdad Ghadiri, Mark Schmidt

Table 2: Comparison of the distributed and the centralized algorithms. “h” and “m” means hour and minute.

DatasetName Reference

Comparison to Centralized Methods

As mentioned in Section 2, most of the multi-label fea-ture selection methods convert the multi-label datasetto one or multiple single-label datasets and then usesingle-label feature selection methods and then aggre-gate the results. Binary relevance (BR) and labelpowerset (LP) are the two best known of these conver-sions. Here, we combine these two conversion methodswith two single-label feature selection methods whichresults in four diﬀerent centralized feature selectionmethods. We considered ReliefF (RF) [Kononenko,1994, Robnik-Sikonja and Kononenko, 2003] and infor-mation gain (IG) [Zhao et al., 2010] for single-labelmethods. These methods compute a score for eachfeature and for aggregating their results in Binary Rel-evance conversion, it is enough to calculate the sumof the scores of each feature and use these scores forselecting features. These methods are used before inthe literature for multi-label feature selection [Chenet al., 2007, Dendamrongvit et al., 2011, Spolaor et al.,2011, Spolaôr et al., 2012, 2013].Figure 2: Eﬀect of λ on the performance of the method.For comparison, we selected 10 to 100 features witheach method and did a multi-label classiﬁcation us-ing BRKNN-b proposed in Xiouﬁs et al. [2008]. Wedid a 10-fold cross validation with ﬁve neighbors forBRKNN-b. We evaluated the classiﬁcation outputsover ﬁve multi-label evaluation measures. They are sub-set accuracy, example-based accuracy, example-basedF-measure, micro-averaged F-measure, and macro-averaged F-measure [Spolaôr et al., 2013, Kashef et al.,2018]. Evaluation measures are deﬁned in Appendix C.We used the Mulan library for the classiﬁcation andcomputation of the evaluation measures [Tsoumakaset al., 2011]. We used a synthesized dataset and tworeal-world datasets-Corel5k [Duygulu et al., 2002] andEurlex-ev [Francesconi et al., 2010]. Their speciﬁcations are shown in table 1. The synthesized dataset madeup of eight labels. Each label has two original featuresthat repeated 50 times. One of the features has thesame value as its label in half of the samples, and theother one has the same value as its label in a quarter ofthe samples. The results of this dataset show that ourmethod outperforms other methods on a dataset withredundant features. The results of this experiments areshown in Figure 3. Results of example-based accuracyand macro-average F-measure comparison for thesedatasets are included in Appendix D. We named ourmethod distributed greedy diversity plus submodular(DGDS) in the plots. The other methods are namedbased on the conversion method they use (i.e., BR orLP) and the feature selection method they use (i.e., RFor IG). In the experiments, we used λ = 0 . and top for our method. Moreover, methods are compared onthree other datasets in Appendix D. Results of thedistributed method ﬂuctuate more compared to othermethods. The reason is that, for every number offeatures, we did the feature selection, including therandom partitioning, from scratch. This caused morevariation in its results but also showed that the methodis relatively stable and does not produce poor qualityresults for diﬀerent random partitionings.As discussed, we compared our method to central-ized feature selection methods because there is no dis-tributed multi-label feature selection method prior toour work. We should note that this comparison is un-fair to the distributed method because it uses muchless of the data compared to centralized methods. Forexample, it does not use the relation (or the distance)between the features in diﬀerent machines. The advan-tage of the distributed method is that it is much fasterand scalable. This is supported by experiments on itsspeed-up (see Table 2). Comparison of Distributed and CentralizedAlgorithms

Here, we compare the performance of our proposedalgorithm (Algorithm 3) with the centralized algorithmintroduced in Borodin et al. [2017] (Algorithm 2) on theoptimization task. We compare the runtime and thevalue of the objective function the algorithms achieve. ulti-Label Feature Selection Using Submodular Plus Diversity Maximization (a) Corel5k (b) Eurlex-ev (c) Synthesized

Figure 3: Comparison of proposed distributed method with centralized methods in the literature.We select 10, 50, 100, and 200 features on two largedatasets. If there are d (cid:48) features in a machine, andwe want to select k of them then the runtime of themachine is O ( d (cid:48) k ) . Therefore, if we have (cid:100) (cid:112) d/k (cid:101) slavemachines then each of them has O ( √ dk ) features andits runtime is equal to O ( k √ dk ) , where d is the totalnumber of features. Also, the master machine will have O ( √ dk ) features, and its runtime is O ( k √ dk ) whichmeans the runtime complexity of the master machineand the slave machines are equal. If we increase or de-crease the number of slave machines, then the runningtime of the master machine or the slave machines willincrease which results in a lower speed-up. Hence, weset the number of slave machines equal to (cid:100) (cid:112) d/k (cid:101) . Theresults show that in practice our proposed distributedalgorithm achieves an approximate solution as good asthe centralized algorithm in a much shorter time. Theresults are summarized in Table 2. Moreover, we com-pared the distributed and the centralized algorithmson the classiﬁcation task. Results of this experimentare included in Appendix E. Eﬀect of λ hyper-parameter To show the importance of both terms of the objectivefunction, redundancy (diversity function) and relevance (submodular function), we compared the performanceof the method for diﬀerent λ value. We select 20, 30,40, and 50 features on the scene dataset [Boutell et al.,2004]. As shown in Figure 2, the best performancehappens for some λ between and . This shows thatboth terms are necessary and it is possible to get betterresults by choosing λ carefully. In this paper, we presented a greedy algorithm for max-imizing the sum of a sum-sum diversity function and anon-negative, monotone, submodular function subjectto a cardinality constraint in distributed and stream-ing settings. We showed that this algorithm guaran-tees a provable theoretical approximation. Moreover,we formulated the multi-label feature selection prob-lem as such an optimization problem and developeda multi-label feature selection method for distributedand streaming settings that can handle the redundancyof the features. Improving the theoretical approxima-tion guarantee is appealing for future work. From theempirical standpoint, it would be nice to try othermetric distances and other submodular functions forthe multi-label feature selection problem. ehrdad Ghadiri, Mark Schmidt

References

Abbasi Zadeh, S. and Ghadiri, M. (2015). Max-sum diversi-ﬁcation, monotone submodular functions and semi-metricspaces.

CoRR , abs/1511.02402.Abbassi, Z., Mirrokni, V. S., and Thakur, M. (2013). Di-versity maximization under matroid constraints. In

The19th ACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining, KDD 2013, Chicago, IL,USA, August 11-14, 2013 , pages 32–40.Agarwal, P. K., Har-Peled, S., and Varadarajan, K. R.(2005). Geometric approximation via coresets.

Combina-torial and computational geometry , 52:1–30.Aghamolaei, S., Farhadi, M., and Zarrabi-Zadeh, H. (2015).Diversity maximization via composable coresets. In

Pro-ceedings of the 27th Canadian Conference on Computa-tional Geometry, CCCG 2015, Kingston, Ontario, Canada,August 10-12, 2015 .Agrawal, R., Gollapudi, S., Halverson, A., and Ieong, S.(2009). Diversifying search results. In

Proceedings of theSecond International Conference on Web Search and WebData Mining, WSDM 2009, Barcelona, Spain, February9-11, 2009 , pages 5–14.Barbosa, R., Ene, A., Nguyen, H. L., and Ward, J. (2015).The power of randomization: Distributed submodular max-imization on massive datasets. In

Proceedings of the 32ndInternational Conference on Machine Learning, ICML2015, Lille, France, 6-11 July 2015 , pages 1236–1244.Bhaskara, A., Ghadiri, M., Mirrokni, V. S., and Svensson,O. (2016). Linear relaxations for ﬁnding diverse elementsin metric spaces. In

Advances in Neural InformationProcessing Systems 29: Annual Conference on Neural In-formation Processing Systems 2016, December 5-10, 2016,Barcelona, Spain , pages 4098–4106.Bolón-Canedo, V., Sánchez-Maroño, N., and Alonso-Betanzos, A. (2015a). Recent advances and emergingchallenges of feature selection in the context of big data.

Knowl.-Based Syst. , 86:33–45.Bolón-Canedo, V., Sánchez-Maroño, N., and Alonso-Betanzos, A. (2015b). Recent advances and emergingchallenges of feature selection in the context of big data.

Knowl.-Based Syst. , 86:33–45.Borodin, A., Jain, A., Lee, H. C., and Ye, Y. (2017). Max-sum diversiﬁcation, monotone submodular functions, anddynamic updates.

ACM Trans. Algorithms , 13(3):41:1–41:25.Boutell, M. R., Luo, J., Shen, X., and Brown, C. M.(2004). Learning multi-label scene classiﬁcation.

PatternRecognition , 37(9):1757–1771.Chen, W., Yan, J., Zhang, B., Chen, Z., and Yang, Q.(2007). Document transformation for multi-label featureselection in text categorization. In

Proceedings of the 7thIEEE International Conference on Data Mining (ICDM2007), October 28-31, 2007, Omaha, Nebraska, USA , pages451–456. Dasgupta, A., Kumar, R., and Ravi, S. (2013). Summariza-tion through submodularity and dispersion. In

Proceedingsof the 51st Annual Meeting of the Association for Com-putational Linguistics, ACL 2013, 4-9 August 2013, Soﬁa,Bulgaria, Volume 1: Long Papers , pages 1014–1022.Demidova, E., Fankhauser, P., Zhou, X., and Nejdl, W.(2010).

DivQ : diversiﬁcation for keyword search over struc-tured databases. In

Proceeding of the 33rd InternationalACM SIGIR Conference on Research and Development inInformation Retrieval, SIGIR 2010, Geneva, Switzerland,July 19-23, 2010 , pages 331–338.Dendamrongvit, S., Vateekul, P., and Kubat, M. (2011).Irrelevant attributes and imbalanced classes in multi-labeltext-categorization domains.

Intelligent Data Analysis ,15(6):843–859.Drosou, M. and Pitoura, E. (2010). Search result diversiﬁ-cation.

SIGMOD Record , 39(1):41–47.Duygulu, P., Barnard, K., de Freitas, J. F. G., and Forsyth,D. A. (2002). Object recognition as machine translation:Learning a lexicon for a ﬁxed image vocabulary. In

Com-puter Vision - ECCV 2002, 7th European Conferenceon Computer Vision, Copenhagen, Denmark, May 28-31,2002, Proceedings, Part IV , pages 97–112.Francesconi, E., Montemagni, S., Peters, W., and Tiscor-nia, D. (2010).

Semantic processing of legal texts: Wherethe language of law meets the law of language , volume 6036.Springer.Guyon, I. and Elisseeﬀ, A. (2003). An introduction to vari-able and feature selection.

Journal of Machine LearningResearch , 3:1157–1182.Hassin, R., Rubinstein, S., and Tamir, A. (1997). Approx-imation algorithms for maximum dispersion.

Oper. Res.Lett.

Indyk, P., Mahabadi, S., Mahdian, M., and Mirrokni, V. S.(2014). Composable core-sets for diversity and coveragemaximization. In

Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of DatabaseSystems, PODS’14, Snowbird, UT, USA, June 22-27, 2014 ,pages 100–108.Kashef, S., Nezamabadi-pour, H., and Nikpour, B. (2018).Multilabel feature selection: A comprehensive review andguiding experiments.

Wiley Interdisc. Rew.: Data Miningand Knowledge Discovery , 8(2).Khanna, R., Elenberg, E. R., Dimakis, A. G., Negah-ban, S., and Ghosh, J. (2017). Scalable greedy featureselection via weak submodularity. In

Proceedings of the20th International Conference on Artiﬁcial Intelligenceand Statistics, AISTATS 2017, 20-22 April 2017, FortLauderdale, FL, USA , pages 1560–1568.Kononenko, I. (1994). Estimating attributes: Analysis andextensions of RELIEF. In

Machine Learning: ECML-94,European Conference on Machine Learning, Catania, Italy,April 6-8, 1994, Proceedings , pages 171–182.Krause, A. and Golovin, D. (2014). Submodular functionmaximization. In

Tractability: Practical Approaches toHard Problems , pages 71–104. ulti-Label Feature Selection Using Submodular Plus Diversity Maximization

Krause, A. and Guestrin, C. (2008). Beyond convexity:Submodularity in machine learning.

ICML Tutorials .Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. (2004).RCV1: A new benchmark collection for text categorizationresearch.

Journal of Machine Learning Research , 5:361–397.Liu, Z., Sun, P., and Chen, Y. (2009). Structured searchresult diﬀerentiation.

PVLDB , 2(1):313–324.Mirrokni, V. S. and Zadimoghaddam, M. (2015). Ran-domized composable core-sets for distributed submodularmaximization. In

Proceedings of the Forty-Seventh AnnualACM on Symposium on Theory of Computing, STOC 2015,Portland, OR, USA, June 14-17, 2015 , pages 153–162.Mirzasoleiman, B., Karbasi, A., Sarkar, R., and Krause,A. (2016). Distributed submodular maximization.

Journalof Machine Learning Research , 17:238:1–238:44.Nguyen, X. V., Epps, J., and Bailey, J. (2010). Informationtheoretic measures for clusterings comparison: Variants,properties, normalization and correction for chance.

Jour-nal of Machine Learning Research , 11:2837–2854.Peng, H., Long, F., and Ding, C. H. Q. (2005). Featureselection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy.

IEEETrans. Pattern Anal. Mach. Intell. , 27(8):1226–1238.Robnik-Sikonja, M. and Kononenko, I. (2003). Theoreticaland empirical analysis of relieﬀ and rrelieﬀ.

MachineLearning , 53(1-2):23–69.Spolaor, N., Cherman, E., and Monard, M. (2011). Usingrelieﬀ for multi-label feature selection. In

ConferenciaLatinoamericana de Informática , pages 960–975.Spolaôr, N., Cherman, E. A., Monard, M. C., and Lee,H. D. (2012). Filter approach feature selection methodsto support multi-label learning based on relieﬀ and infor-mation gain. In

Advances in Artiﬁcial Intelligence - SBIA2012 - 21th Brazilian Symposium on Artiﬁcial Intelligence,Curitiba, Brazil, October 20-25, 2012. Proceedings , pages72–81.Spolaôr, N., Cherman, E. A., Monard, M. C., and Lee,H. D. (2013). A comparison of multi-label feature selec-tion methods using the problem transformation approach.

Electr. Notes Theor. Comput. Sci. , 292:135–151.Srivastava, A. N. and Zane-Ulman, B. (2005). Discoveringrecurring anomalies in text reports regarding complexspace systems. In

Aerospace conference, 2005 IEEE , pages3853–3862. IEEE.Tsoumakas, G., Katakis, I., and Vlahavas, I. (2008). Ef-fective and eﬃcient multilabel classiﬁcation in domainswith large number of labels. In

Proc. ECML/PKDD 2008Workshop on Mining Multidimensional Data (MMD’08) ,volume 21, pages 53–59. sn.Tsoumakas, G., Spyromitros-Xiouﬁs, E., Vilcek, J., andVlahavas, I. (2011). Mulan: A java library for multi-labellearning.

Journal of Machine Learning Research , 12:2411–2414. Turnbull, D., Barrington, L., Torres, D. A., and Lanckriet,G. R. G. (2008). Semantic annotation and retrieval ofmusic and sound eﬀects.

IEEE Trans. Audio, Speech &Language Processing , 16(2):467–476.Xiouﬁs, E. S., Tsoumakas, G., and Vlahavas, I. P. (2008).An empirical study of lazy multilabel classiﬁcation algo-rithms. In

Artiﬁcial Intelligence: Theories, Models andApplications, 5th Hellenic Conference on AI, SETN 2008,Syros, Greece, October 2-4, 2008. Proceedings , pages 401–406.Yu, C., Lakshmanan, L. V. S., and Amer-Yahia, S. (2009).It takes variety to make a world: diversiﬁcation in rec-ommender systems. In

EDBT 2009, 12th InternationalConference on Extending Database Technology, Saint Pe-tersburg, Russia, March 24-26, 2009, Proceedings , pages368–378.Zadeh, S. A., Ghadiri, M., Mirrokni, V. S., and Zadi-moghaddam, M. (2017). Scalable feature selection viadistributed diversity maximization. In

Proceedings of theThirty-First AAAI Conference on Artiﬁcial Intelligence,February 4-9, 2017, San Francisco, California, USA. ,pages 2876–2883.Zhao, Z., Morstatter, F., Sharma, S., Alelyani, S., Anand,A., and Liu, H. (2010). Advancing feature selection re-search.

ASU feature selection repository , pages 1–28.

A Appendix A

Proof of Lemma 2.

Clearly g is non-negative and mono-tone. Since the sum of submodular functions is a submodu-lar function, We only need to show that top p x ∈ S { MI ( x, (cid:96) ) } is submodular. We assume that top x ∈ S { MI ( x, (cid:96) ) } = 0 .Let S ⊆ T ⊂ U and a ∈ U \ T . We show that top p x ∈ S ∪{ a } { MI ( x, (cid:96) ) } − top p x ∈ S { MI ( x, (cid:96) ) }≥ top p x ∈ T ∪{ a } { MI ( x, (cid:96) ) } − top p x ∈ T { MI ( x, (cid:96) ) } . We have two cases. If MI ( a, (cid:96) ) is not among the p largestnumbers of { I ( x, (cid:96) ) | x ∈ S ∪ { a }} then both sides of theabove inequality are zero. If MI ( a, (cid:96) ) is among the p largestnumbers of { I ( x, (cid:96) ) | x ∈ S ∪ { a }} then the left hand side ofthe inequality is equal to MI ( a, (cid:96) ) − MI ( b, (cid:96) ) where b is the p ’th largest number in { I ( x, (cid:96) ) | x ∈ S } . The right hand sideis equal to max { , MI ( a, (cid:96) ) − MI ( c, (cid:96) ) } where c is the p ’thlargest number in { I ( x, (cid:96) ) | x ∈ T } . The p ’th largest numberin { I ( x, (cid:96) ) | x ∈ T } is greater than or equal to the p ’th largestnumber in { I ( x, (cid:96) ) | x ∈ S } because S ⊆ T . Therefore, inthis case MI ( a, (cid:96) ) − MI ( b, (cid:96) ) ≥ max { , MI ( a, (cid:96) ) − MI ( c, (cid:96) ) } and the inequality holds. (cid:3) B Appendix B

For S ⊆ U and x ∈ U \ S , let ∆( x, S ) = g ( S ∪ { x } ) − g ( S ) .We now show that Algorithm 1 is a β -nice algorithm for f .This is ultimately needed for the proof of both key lemmas. Proof of Theorem 1.

Let

ALG be the Algorithm 1, T ⊆ U , t ∈ T \ ALG ( T ) , and x , . . . , x k be the elements that ALG selected in the order of selection. Also, let S i = { x , . . . , x i } and S = ∅ . ehrdad Ghadiri, Mark Schmidt For the ﬁrst property of β -nice algorithms it is enough tohave a consistent tiebreaking rule for ALG . It is suﬃcientto ﬁx an ordering on all elements of U up front. If someiteration ﬁnds multiple elements with the same maximummarginal gain, then it should select earliest one in the apriori ordering.Now we prove the second property of the β -nice algorithmsfor ALG . Because of the greedy selection of

ALG , we havethe following inequalities. ∆( x , S ) ≥ ∆( t, S )∆( x , S ) + d ( x , x ) ≥ d ( t, x ) + ∆( t, S )∆( x , S ) + (cid:88) i =1 d ( x , x i ) ≥ (cid:88) i =1 d ( t, x i ) + ∆( t, S ) · · · ∆( x k , S k − ) + k − (cid:88) i =1 d ( x k , x i ) ≥ k − (cid:88) i =1 d ( t, x i ) + ∆( t, S k − ) Adding these inequalities together gives the following in-equality. g ( S k ) + D ( S k ) ≥ k − (cid:88) i =1 ( k − i ) d ( t, x i ) + k − (cid:88) i =0 ∆( t, S i ) ≥ k − (cid:88) i =1 ( k − i ) d ( t, x i ) + k ∆( t, S k ) , (3)where the second inequality holds because of the submodu-larity of g . Note that f ( ALG ( T ) ∪{ x } ) − f ( ALG ( T )) = ∆( t, ALG ( T ))+ (cid:88) x ∈ ALG ( T ) d ( x, t ) . (4)One may thus note that if the right-hand side coeﬃcientsin (3) were all k/ (instead of k − i ) we would have -niceness of the algorithm. Our strategy is to achieve thisby shifting some of the “weight” from coeﬃcients where k − i > k/ to coeﬃcients < k/ . This uses the metricinequality since d ( x k − i , x i ) + d ( x i , t ) ≥ d ( x k − i , t ) . Henceif we added d ( x k − i , x i ) to both sides of (3), then we mayincrease the coeﬃcient of d ( t, x k − i ) by at the expense ofreducing the coeﬃcient of d ( t, x i ) by .We use this idea to ﬁx all of the “small” components in bulkby adding a batch of distinct distances to both sides of (3).Since these distances are distinct, we increase the left-handside by at most D ( S k ) . In particular, the new left-handside will be at most g ( S k ) + D ( S k )) .The batch of distances we add to both sides of the inequalityis (cid:80) ki = (cid:100) k (cid:101) +1 (cid:80) i −(cid:98) k (cid:99)− j =1 d ( x i , x j ) . Clearly these distancesare distinct so we now need to make sure that the strategyproduces the desired coeﬃcients of terms d ( t, x i ) . Moreformally, we claim that the following inequality holds. Claim 1. k − (cid:88) i =1 ( k − i ) d ( t, x i ) + k (cid:88) i = (cid:100) k (cid:101) +1 i −(cid:98) k (cid:99)− (cid:88) j =1 d ( x i , x j ) ≥ k (cid:88) i =1 ( (cid:100) k (cid:101) − d ( t, x i ) We prove this claim later. Using this we have the following. g ( S k )+ D ( S k )) ≥ g ( S k ) + D ( S k ) + k (cid:88) i = (cid:100) k (cid:101) +1 i −(cid:98) k (cid:99)− (cid:88) j =1 d ( x i , x j ) ≥ k − (cid:88) i =1 ( k − i ) d ( t, x i ) + k ∆( t, S k )+ k (cid:88) i = (cid:100) k (cid:101) +1 i −(cid:98) k (cid:99)− (cid:88) j =1 d ( x i , x j ) ≥ k (cid:88) i =1 ( (cid:100) k (cid:101) − d ( t, x i ) + ( (cid:100) k (cid:101) − t, S k ) where the second inequalities holds because of the metricproperty, i.e. triangle inequality, and monotonicity of g . Byusing the above inequality, non-negativity of g , and (4) wehave (cid:100) k (cid:101) − f ( ALG ( T )) = 2 (cid:100) k (cid:101) − g ( S k ) + D ( S k )) ≥ k (cid:88) i =1 d ( t, x i ) + ∆( t, S k )= f ( ALG ( T ) ∪ { t } ) − f ( ALG ( T )) . We can easily see that for k ≥ , k ≥ (cid:100) k (cid:101)− and . k − ≥ (cid:100) k (cid:101)− . Therefore, ALG is a -nice algorithm for f and be-cause of monotonicity of g , . k − f ( ALG ( T )) ≥ (cid:80) ki =1 d ( t, x i ) . (cid:3) Now we prove Claim 1 to conclude Theorem 1.

Proof of Claim 1.

Note that k = (cid:100) k (cid:101) + (cid:98) k (cid:99) and (cid:98) k (cid:99) +1 ≥(cid:100) k (cid:101) . First, we show that k −(cid:98) k (cid:99)− (cid:88) j =1 ( (cid:100) k (cid:101) − j ) d ( t, x j ) = k (cid:88) i = (cid:100) k (cid:101) +1 i −(cid:98) k (cid:99)− (cid:88) j =1 d ( t, x j ) . (5)In the right hand side of (5), d ( t, x j ) appears in the innersummation when i − (cid:98) k (cid:99) − ≥ j or equivalently, when i ≥ j + (cid:98) k (cid:99) + 1 . We know that k ≥ i ≥ (cid:100) k (cid:101) + 1 . We alsoknow that j ≥ . Hence, j + (cid:98) k (cid:99) + 1 ≥ (cid:100) k (cid:101) + 1 . Therefore, ulti-Label Feature Selection Using Submodular Plus Diversity Maximization d ( t, x j ) deﬁnitely appears in the inner summation when k ≥ i ≥ j + (cid:98) k (cid:99) + 1 . This means that d ( t, x j ) appears k − j − (cid:98) k (cid:99) = (cid:100) k (cid:101) − j many times in the right hand side of(5). Moreover, note that the index j in the right hand sideof (1) ranges between 1 and k − (cid:98) k (cid:99) − . Hence (5) holds.Let A = k (cid:88) i = k −(cid:98) k (cid:99) ( k − i ) d ( t, x i ) + k −(cid:98) k (cid:99)− (cid:88) i =1 ( (cid:100) k (cid:101) − d ( t, x i ) . By decomposing (cid:80) k − i =1 ( k − i ) d ( t, x i ) to three summations,noting that ( k − k ) d ( t, x k ) = 0 , and using (5), we have k − (cid:88) i =1 ( k − i ) d ( t, x i ) = k (cid:88) i = k −(cid:98) k (cid:99) ( k − i ) d ( t, x i )+ k −(cid:98) k (cid:99)− (cid:88) i =1 ( (cid:100) k (cid:101) − d ( t, x i )+ k −(cid:98) k (cid:99)− (cid:88) j =1 ( k − j − (cid:100) k (cid:101) + 1) d ( t, x j )= A + k −(cid:98) k (cid:99)− (cid:88) j =1 ( (cid:98) k (cid:99) − j + 1) d ( t, x j ) ≥ A + k −(cid:98) k (cid:99)− (cid:88) j =1 ( (cid:100) k (cid:101) − j ) d ( t, x j )= A + k (cid:88) i = (cid:100) k (cid:101) +1 i −(cid:98) k (cid:99)− (cid:88) j =1 d ( t, x j ) . Therefore, by the triangle inequality and the above state-ments, we have k − (cid:88) i =1 ( k − i ) d ( t, x i ) + k (cid:88) i = (cid:100) k (cid:101) +1 i −(cid:98) k (cid:99)− (cid:88) j =1 d ( x i , x j ) ≥ A + k (cid:88) i = (cid:100) k (cid:101) +1 i −(cid:98) k (cid:99)− (cid:88) j =1 d ( t, x j )+ k (cid:88) i = (cid:100) k (cid:101) +1 i −(cid:98) k (cid:99)− (cid:88) j =1 d ( x i , x j )= A + k (cid:88) i = (cid:100) k (cid:101) +1 i −(cid:98) k (cid:99)− (cid:88) j =1 ( d ( t, x j ) + d ( x i , x j )) ≥ A + k (cid:88) i = (cid:100) k (cid:101) +1 i −(cid:98) k (cid:99)− (cid:88) j =1 d ( t, x i )= A + k (cid:88) i = (cid:100) k (cid:101) +1 ( i − (cid:98) k (cid:99) − d ( t, x i ) ≥ A + k (cid:88) i = (cid:100) k (cid:101) +1 ( i − (cid:98) k (cid:99) − d ( t, x i )+ ( (cid:100) k (cid:101) − (cid:98) k (cid:99) − d ( t, x (cid:100) k (cid:101) )= A + k (cid:88) i = (cid:100) k (cid:101) ( i − (cid:98) k (cid:99) − d ( t, x i )= k (cid:88) i = k −(cid:98) k (cid:99) ( k − i ) d ( t, x i ) + k −(cid:98) k (cid:99)− (cid:88) i =1 ( (cid:100) k (cid:101) − d ( t, x i )+ k (cid:88) i = k −(cid:98) k (cid:99) ( i − (cid:98) k (cid:99) − d ( t, x i )= k (cid:88) i = k −(cid:98) k (cid:99) ( k − i + i − (cid:98) k (cid:99) − d ( t, x i )+ k −(cid:98) k (cid:99)− (cid:88) i =1 ( (cid:100) k (cid:101) − d ( t, x i )= k (cid:88) i = k −(cid:98) k (cid:99) ( (cid:100) k (cid:101) − d ( t, x i ) + k −(cid:98) k (cid:99)− (cid:88) i =1 ( (cid:100) k (cid:101) − d ( t, x i )= k (cid:88) i =1 ( (cid:100) k (cid:101) − d ( t, x i ) . This yields the result. (cid:3)

We now proceed to bound the diversity part of the optimalsolution (Lemma 2). We re-use the key ideas from Aghamo-laei et al. [2015] to achieve this. Let O be an optimalsolution for maximizing f ( S ) subject to S ⊆ U and | S | = k .Let O i = T i ∩ O , Q i = O i \ S i . So Q i are the elementsof O on machine I that were “missed” by S i . Intuitively,we bound the damage to optimality by missing these ele-ments by ﬁnding a low-weight matching between Q i and ehrdad Ghadiri, Mark Schmidt S i . The following normalization parameters are used inthe next two lemmas: r i = f ( S i ) ( k ) and r = max i =1 ,...,m r i .Let G i ( O i ∪ S i , E ) be a complete weighted graph. For u, v ∈ O i ∪ S i , we use d ( u, v ) as the edge weight in ourmatching problem. Lemma 4.

There exists a bipartite matching between Q i and S i in G i with a weight of at most . | Q i | r that coversall the Q i . Proof.

The number of all maximal bipartite matchings be-tween Q i and S i is k !( k −| Q i | )! . Any of these matchings covers Q i because | Q i | ≤ | S i | . Each edge { q, x } with q ∈ Q i and x ∈ S i is in ( k − k −| Q i | )! of these matchings. Hence the totalweight of all matchings can be expressed as ( k − k − | Q i | )! (cid:88) q ∈ Q i (cid:88) x ∈ S i d ( q, x ) ≤ ( k − k − | Q i | )! (cid:88) q ∈ Q i . k − f ( S i ) ≤ ( k − k − | Q i | )! (cid:88) q ∈ Q i . k − (cid:32) k (cid:33) r = ( k − k − | Q i | )! | Q i | . k r = k !( k − | Q i | )! 4 . | Q i | r The ﬁrst inequality is from Lemma 1 and the second by thedeﬁnition of r . It follows that there exists a matching witha weight of at most . | Q i | r .We are now in position to upper bound the diversity portionof an optimal solution in terms of f ( OPT ( ∪ mi S i )) . Proof of Lemma 2.

Let M i be the maximal bipartitematching between Q i and S i with a weight of less thanor equal to . | Q i | r . It exists because of Lemma 4. Let M = ∪ mi =1 M i . Note that S i ’s are disjoint and Q i ’s aredisjoint. This implies that M i ’s are disjoint. Therefore, M is a matching between ∪ mi =1 Q i and ∪ mi =1 S i that cov-ers all of ∪ mi =1 Q i with a weight of less than or equal to . (cid:80) mi =1 | Q i | r ≤ . | O | r = . kr .Let e : O → ∪ mi =1 S i be a mapping which maps any o ∈ O ∩ ( ∪ mi =1 S i ) to itself and any o ∈ ( ∪ mi =1 Q i ) to its matchedvertex in M . The weight of this mapping is less than orequal to the weight of M since d ( o, o ) = 0 . Note that eachvertex in the range ( e ) is mapped from at most two verticesin O . We use this fact in the second inequality below anduse the triangle inequality in the ﬁrst inequality. We have D ( O ) = (cid:88) { u,v }∈ O d ( u, v ) ≤ (cid:88) { u,v }∈ O ( d ( u, e ( u )) + d ( e ( u ) , e ( v )) + d ( e ( v ) , v ))= ( | O | − (cid:88) u ∈ O d ( o, e ( o )) + (cid:88) { u,v }∈ O d ( e ( u ) , e ( v )) ≤ ( k −

1) 4 . kr + 4 D ( range ( e )) ≤ . (cid:32) k (cid:33) r + 4 f ( OPT ( ∪ mi =1 S i )) ≤ . f ( OPT ( ∪ mi =1 S i )) (cid:3) Now, we proceed to bound g ( O ) and the proofs of the nexttwo lemmas follow those found in Mirrokni and Zadimoghad-dam [2015]. Let o , . . . , o k be an ordering of elements of O .For x = o i ∈ O deﬁne O x = { o , . . . , o i − } and O o = ∅ . Lemma 5. g ( O ) ≤ f ( OPT ( ∪ mi =1 S i )) + (cid:80) mi =1 (cid:80) x ∈ O ∩ T i \ S i (∆( x, O x ) − ∆( x, O x ∪ S i )) . Proof.

Note that g ( O ) = g ( O ∩ ( ∪ mi =1 S i )) + (cid:80) x ∈ O \ ( ∪ mi =1 S i ) ∆( x, O x ∪ ( O ∩ ( ∪ mi =1 S i ))) . There-fore, using submodularity and monotonicity of g and5-niceness of Algorithm 1, we have g ( O ) ≤ f ( OPT ( ∪ mi =1 S i )) + (cid:88) x ∈ O \ ( ∪ mi =1 Si ) ∆( x, O x )= f ( OPT ( ∪ mi =1 S i ))+ m (cid:88) i =1 (cid:88) x ∈ O ∩ Ti \ Si (∆( x, O x ∪ S i ) + ∆( x, O x ) − ∆( x, O x ∪ S i )) ≤ f ( OPT ( ∪ mi =1 S i ))+ m (cid:88) i =1 (cid:88) x ∈ O ∩ Ti \ Si (∆( x, S i ) + ∆( x, O x ) − ∆( x, O x ∪ S i )) ≤ f ( OPT ( ∪ mi =1 S i ))+ m (cid:88) i =1 (cid:88) x ∈ O ∩ Ti \ Si ( 5 k f ( S i ) + ∆( x, O x ) − ∆( x, O x ∪ S i )) ≤ f ( OPT ( ∪ mi =1 S i ))+ m (cid:88) i =1 (cid:88) x ∈ O ∩ Ti \ Si ( 5 k f ( OPT ( ∪ mi =1 S i )) + ∆( x, O x ) − ∆( x, O x ∪ S i )) ≤ f ( OPT ( ∪ mi =1 S i )) + 5 f ( OPT ( ∪ mi =1 S i ))+ m (cid:88) i =1 (cid:88) x ∈ O ∩ Ti \ Si (∆( x, O x ) − ∆( x, O x ∪ S i )) ≤ f ( OPT ( ∪ mi =1 S i )) + m (cid:88) i =1 (cid:88) x ∈ O ∩ Ti \ Si (∆( x, O x ) − ∆( x, O x ∪ S i )) In the next Lemma, we use the randomness of the parti-tioning of the data over machines and the ﬁrst property of β -niceness. Lemma 6. E [ (cid:80) mi =1 (cid:80) x ∈ O ∩ T i \ S i (∆( x, O x ) − ∆( x, O x ∪ S i ))] ≤ E [ f ( OPT ( ∪ mi =1 S i ))] . Proof.

We show that E [ (cid:80) mi =1 (cid:80) x ∈ O ∩ T i \ S i (∆( x, O x ) − ∆( x, O x ∪ S i ))] ≤ E [ (cid:80) mi =1 g ( S i )] m and the statement ofthe lemma follows from the fact that (cid:80) mi =1 g ( S i ) m ≤ f ( OPT ( ∪ mi =1 S i )) . We ﬁrst establish an inequality A := E [ m (cid:88) i =1 (cid:88) x ∈ O ∩ T i \ S i (∆( x, O x ) − ∆( x, O x ∪ S i ))] ≤ m B where B := E [ m (cid:88) i =1 (cid:88) x ∈ O (∆( x, O x ) − ∆( x, O x ∪ S i ))] . ulti-Label Feature Selection Using Submodular Plus Diversity Maximization Let

ALG be Algorithm 1. For T ⊆ U and x ∈ U , let q ( x, T ) =∆( x, O x ) − ∆( x, O x ∪ ALG ( T )) . Let P [ . ] be the probabilitymass function for the uniform distribution over m -partitions P = ( T , . . . , T m ) of U , and let [ x / ∈ ALG ( T ∪ { x } )] be a , indicator function. Note that P [ T i = T ] = ( 1 m ) | T | (1 − m ) | U |−| T | P [ T i = T ∪ { x } ] = ( 1 m ) | T | +1 (1 − m ) | U |−| T |− Therefore P [ T i = T ∪ { x } ] = P [ T i = T ] + P [ T i = T ∪ { x } ] m . (6)We have that A = m (cid:88) i =1 (cid:88) x ∈ O (cid:88) T ⊆ U \{ x } P [ T i = T ∪ { x } ] [ x / ∈ ALG ( T ∪ { x } )] q ( x, T ∪ { x } ) B = m (cid:88) i =1 (cid:88) x ∈ O (cid:88) T ⊆ U \{ x } ( P [ T i = T ∪ { x } ] q ( x, T ∪ { x } ) + P [ T i = T ] q ( x, T )) ≥ m (cid:88) i =1 (cid:88) x ∈ O (cid:88) T ⊆ U \{ x } [ x / ∈ ALG ( T ∪ { x } )] q ( x, T ∪ { x } )( P [ T i = T ∪ { x } ]+ P [ T i = T ]) . The last inequality holds because q ( ., . ) is a non-negativefunction and multiplying it by [ x / ∈ ALG ( T ∪ { x } )] canonly decrease the sum value. Also, q ( x, T ) is replaced by q ( x, T ∪ { x } ) . It does not change the sum value becausewhen [ x / ∈ ALG ( T ∪ { x } )] = 1 , q ( x, T ) = q ( x, T ∪ { x } ) .We now deduce A ≤ B/m from (6).Now note that (cid:80) x ∈ O ∆( x, O x ∪ S i ) = g ( O ∪ S i ) − g ( S i ) ,and (cid:80) x ∈ O ∆( x, O x ) = g ( O ) . Therefore, because of themonotonicity of g , we have for any i (cid:88) x ∈ O ∆( x, O x ) − ∆( x, O x ∪ S i )= g ( O ) − g ( O ∪ S i ) + g ( S i ) ≤ g ( S i ) . Hence B ≤ E [ (cid:80) mi =1 g ( S i )] m and the lemma follows.We now have that Lemma 3 follows directly from Lemmas 5,and 6 as they imply g ( O ) ≤ f ( OPT ( ∪ mi =1 S i )) + E [ f ( OPT ( ∪ mi =1 S i ))] . Therefore this completes the proof of Theorem 2.

C Appendix C

Let n be the number of samples in the dataset, L i be theset of labels for sample i that are in the dataset, and L (cid:48) i be the set of labels for sample i that we predicted to be .Then the subset accuracy of our learning method is equalto n n (cid:88) i =1 I ( L i , L (cid:48) i ) where I ( , ., ) is a 0, 1 indicator function and is equal to 1when set L i is equal to the set L (cid:48) i , and it is 0 otherwise.Example-based accuracy is equal to n n (cid:88) i =1 | L i ∩ L (cid:48) i || L i ∪ L (cid:48) i | . Example-based F-measure is equal to n n (cid:88) i =1 | L i ∩ L (cid:48) i || L i | + | L (cid:48) i | . These evaluation measures are example-based. Micro-averaged F-measure and Macro-averaged F-measure aretwo label-based measures for multi-label classiﬁcation. Let t be the number of labels in the dataset, E i be the set ofexamples that their i ’th label is equal to 1, and E (cid:48) i be theset of example that we predicted their i ’th labels to be 1.Then Micro-averaged F-measure is equal to t t (cid:88) i =1 | E i ∩ E (cid:48) i || E i | + | E (cid:48) i | . Macro-averaged F-measure is equal to (cid:80) ti =1 | E i ∩ E (cid:48) i | (cid:80) ti =1 | E i | + (cid:80) ti =1 | E (cid:48) i | . D Appendix D

Results of example-based accuracy and macro-average F-measure comparison for Corel5k, Eurlex-ev, and Synthe-sized datasets are included in are shown in Figure 4. Speciﬁ-cations of three other datasets are shown in Table 3 and theperformance of our method on these datasets is comparedto centralized methods in Figure 5.

Table 3: Speciﬁcations of other datasets.

Dataset Name

E Appendix E

The performance of our distributed and centralized methodsare compared in Figure 6. ehrdad Ghadiri, Mark Schmidt (a) Corel5k (b) Eurlex-ev (c) Synthesized

Figure 4: Comparison of proposed distributed method with centralized methods in the literature. ulti-Label Feature Selection Using Submodular Plus Diversity Maximization (a) CAL500 (b) Delicious (c) Scene

Figure 5: Comparison of proposed distributed method with centralized methods in the literature. ehrdad Ghadiri, Mark Schmidt (a) Corel5k (b) Synthesized (c) Scene(a) Corel5k (b) Synthesized (c) Scene