Simultaneous Feature and Expert Selection within Mixture of Experts
SSimultaneous Feature and Expert Selectionwithin Mixture of Experts
Billy Peralta a, ∗ a Department of Informatics,Universidad Cat´olica de Temuco, Chile.
Abstract
A useful strategy to deal with complex classification scenarios is the “divide andconquer” approach. The mixture of experts (MOE) technique makes use of this strat-egy by joinly training a set of classifiers, or experts, that are specialized in differentregions of the input space. A global model, or gate function, complements the expertsby learning a function that weights their relevance in different parts of the input space.Local feature selection appears as an attractive alternative to improve the specializa-tion of experts and gate function, particularly, for the case of high dimensional data.Our main intuition is that particular subsets of dimensions, or subspaces, are usuallymore appropriate to classify instances located in different regions of the input space.Accordingly, this work contributes with a regularized variant of MoE that incorpo-rates an embedded process for local feature selection using L Keywords:
Mixture of experts, local feature selection, embedded feature selection,regularization.
1. Mixture of Experts with embedded variable selection
Our main idea is to incorporate a local feature selection scheme inside each expertand gate function of a MoE formulation. Our main intuition is that, in the context ofclassification, different partitions of the input data can be best represented by specific ∗ Corresponding author, Telephone: (56 45) 255 3948
Email address: [email protected] (Billy Peralta)
Preprint submitted to ??? June 12, 2018 a r X i v : . [ c s . L G ] M a y ubsets of features. This is particularly relevant in the case of high dimensional spaces,where the common presence of noisy or irrelevant features might obscure the detectionof particular class patterns. Specifically, our approach takes advantage of the linearnature of each local expert and gate function in the classical MoE formulation [17],meaning that L In the context of supervised classification, there is available a set of N trainingexamples, or instance-label pairs ( x n , y n ), representative of the domain data ( x, y ),where x n ∈ (cid:60) D and y n ∈ C . Here C is a discrete set of Q class labels { c , ..., c Q } .The goal is to use training data to find a function f that minimizes a loss functionwhich scores the quality of f to predict the true underlying relation between x and y .From a probabilistic point of view [4], a useful approach to find f is using a conditionalformulation: f ( x ) = arg max c i ∈ C p ( y = c i | x ) . In the general case of complex relations between x and y , a useful strategy consistsof approximating f through a mixture of local functions. This is similar to the case ofmodeling a mixture distribution [34] and it leads to the MoE model.We decompose the conditional likelihood p ( y | x ) as: p ( y | x ) = K (cid:88) i =1 p ( y, m i | x ) = K (cid:88) i =1 p ( y | m i , x ) p ( m i | x ) , (1)where Equation (1) represents a MoE model with K experts m i . Figure (1) shows aschematic diagram of the MoE approach. The main idea is to obtain local models insuch a way that they are specialized in a particular region of the data. In Figure (1), x corresponds to the input instance, p ( y | m i , x ) is the expert function , p ( m i | x ) is the2 ating function , and p ( y | x ) is the weighted sum of the experts. Note that the outputof each expert model is weighted by the gating function. This weight can be interpretedas the relevance of expert m i for the classification of input instance x . Also note thatthe gate function has K outputs, one for each expert. There are K expert functionsthat have Q components, one for each class. p(y|x) p(y|m K ,x) p(y|m ,x) p(y|m ,x) x x x p(m K |x) p(m |x) p(m |x) x Expert 1 Expert 2 Expert K Gates Ʃ Figure 1: Mixture of experts scheme.
The traditional MoE technique uses multinomial logit models, also known as soft-max functions [4], to represent the gate and expert functions. An important character-istic of this model is that it forces competition among its components. In MoE, suchcomponents are expert functions for the gates and class-conditional functions for theexperts. The competition in soft-max functions enforces the especialization of expertsin different areas of the input space [41].Using multinomial logit models, a gate function is defined as: p ( m i | x ) = exp ( ν Ti x ) (cid:80) Kj =1 exp ( ν Tj x ) (2)where i ∈ { , . . . , K } refers to the set of experts and ν i ∈ (cid:60) D is a vector of modelparameters. Component ν ij of vector ν i models the relation between the gate anddimension j of input instance x .Similarly, an expert function is defined as:3 ( y = c l | x, m i ) = exp ( ω Tli x ) (cid:80) Mj =1 exp ( ω Tji x ) (3)where ω li depends on class label c l and expert i . In this way, there are a total of Q × K vectors ω li . Component ω lij of vector ω li models the relation between expert function i and dimension j of input instance x .There are several methods to find the value of the hidden parameters ν ij and ω lij [26]. An attractive alternative is to use the EM algorithm. In the case of MoE, the EMformulation augments the model by introducing a set of latent variables, or responsibili-ties , indicating the expert that generates each instance. Accordingly, the EM iterationsconsider an expectation step that estimates expected values for responsibilities , and amaximization step that updates the values of parameters ν ij and ω lij . Specifically, theposterior probability of the responsibility R in assigned by the gate function to expert m i for an instance x n is given by [26]: R in = p ( m i | x n , y n ) (4)= p ( y n | x n , m i ) p ( m i | x n ) (cid:80) Kj =1 p ( y n | x n , m j ) p ( m j | x n )Considering these responsibilities and Equation (1), the expected complete log-likelihood (cid:104) L c (cid:105) used in the EM iterations is [26]: (cid:104) L c (cid:105) = N (cid:88) n =1 K (cid:88) i =1 R in [ log p ( y n | x n , m i ) + log p ( m i | x n )] (5) To embed a feature selection process in the MoE approach, we use the fact thatin Equations (2) and (3) the multinomial logit models for gate and experts functionscontain linear relations for the relevant parameters. This linearity can be straightfor-wardly used in feature selection by considering that a parameter component ν ij or ω lij j is irrelevant for gate function p ( m i | x ) or expertmodel p ( y | m i , x ), respectively. Consequently, we propose to penalize complex modelsusing L regularization. Similar consideration is used in the work of [29] but in thecontext of unsupervised learning. The idea is to maximize the likelihood of data whilesimultaneously minimizing the number of parameter components ν ij and ω lij differentfrom zero. Considering that there are Q classes, K experts, and D dimensions, theexpected L (cid:10) L Rc (cid:11) is given by: (cid:10) L Rc (cid:11) = (cid:104) L c (cid:105) − λ ν K (cid:88) i =1 D (cid:88) j =1 | ν ij | − λ ω Q (cid:88) l =1 K (cid:88) i =1 D (cid:88) j =1 | ω lij | . (6)To maximize Equation (6) with respect to model parameters, we use first the stan-dard fact that the likelihood function in Equation (5) can be decomposed in terms ofindependent expressions for gate and expert models [26]. In this way, the maximizationstep of the EM based solution can be performed independently with respect to gate andexpert parameters [26]. In our problem, each of these optimizations has an extra termgiven by the respective regularization term in Equation (6). To handle this case, weobserve that each of these optimizations is equivalent to the expression to solve a reg-ularized logistic regression [20]. As shown in [20], this problem can be solved by usinga coordinate ascent optimization strategy [37] given by a sequential two-step approachthat first models the problem as an unregularized logistic regression and afterwardsincorporates the regularization constraints.In summary, we handle Equation (6) by using a EM based strategy that at each stepsolves the maximation with respect to model parameters by decomposing this problemin terms of gate and expert parameters. Each of these problems is in turn solved usingthe strategy proposed in [20]. Next, we provide details of this procedure. Optimization of the unregularized log-likelihood
In this case, we solve the unconstrained log-likelihood given by Equation (5). First,we optimize the log-likelihood with respect to vector ω li . The maximization of theexpected log-likelihood (cid:104) L c (cid:105) implies deriving Equation (5) with respect to ω li :5 (cid:80) Nn =1 (cid:80) Ki =1 R in [ log p ( y n | x n , m i ) ] ∂ω li = 0 , (7)and applying the derivate, we have: − N (cid:88) n =1 R in ( p ( y n | x n , m i ) − y n ) x n = 0 . (8)In this case, the classical technique of least-squares can not be directly appliedbecause of the soft-max function in p ( y n | x n , m i ). Fortunately, as described in [18] andlater in [26], Equation (8) can be approximated by using a transformation that impliesinverting the soft-max function. Using this transformation, Equation (8) is equivalentto an optimization problem that can be solved using a weighted least squares technique[4]: min ω li (cid:80) Nn =1 R in (cid:0) ω Tli x n − log y n (cid:1) (9)A similar derivation can be performed with respect to vectors ν i . Again derivingEquation (5), in this case with respect to parameters ν ij and applying the transforma-tion suggested in [18], we obtain:min ν i (cid:80) Nn =1 (cid:0) ν Ti x n − logR in (cid:1) (10)(11) Optimization of the regularized likelihood
Following the procedure of [20], we add the regularization term to the optimiza-tion problem given by Equation (9), obtaining an expression that can be solved usingquadratic programming [35]: min ω li (cid:80) Nn =1 R in (cid:0) log y n − ω Tli x n (cid:1) subject to: || ω li || ≤ λ ω (12)6imilarly, we can also obtain a standard quadratic optimization problem to findparameters ν ij : min ν i (cid:80) Nn =1 (cid:0) logR in − ν Ti x n (cid:1) subject: to || ν i || ≤ λ ν (13)A practical advantage of using quadratic programming is that most available opti-mization packages can be utilized to solve it [6]. Specifically, in the case of T iterations,there are a total of T ∗ K ∗ ( Q + 1) convex quadratic problems related to the maximiza-tion step of the EM algorithm. To further reduce this computational load, we slightlymodify this maximization by applying the following two-steps scheme: • Step-1: Solve K quadratic problems to find gate parameters ν ij assuming thateach expert uses all the available dimensions. In this case, there are T − • Step-2: Solve K ∗ ( Q + 1) quadratic problems to find expert parameters ω lij applying the feature selection process. In this case, there is a single iteration.Using the previous scheme we reduce from T ∗ K ∗ ( Q + 1) to K ∗ ( T + 1) + K ∗ ( Q + 1)the number of quadratic problems that we need to solve in the maximization step ofthe EM algorithm. In our experiments, we do not notice a drop in performance byusing this simplification, but we are able to increase processing speed in one order ofmagnitude.In summary, starting by assigning random values to the relevant parameters ν ij and ω lij , our EM implementation consists of iterating the following two steps: • Expectation: estimating responsabilities for each expert using Equation (4), andthen estimating the outputs of gate and experts using Equations (2) and (3). • Maximization: updating the values of parameters ν ij and ω lij in Equations (12)and (13) by solving K ∗ ( T + 1) + K ∗ ( Q + 1) quadratic problems according tothe approximation described above in Step-1 and Step-2.7 . Expert Selection The MoE o RMoE assumes that all the gate functions affects to every data. Butfor example in object detection, we can assume that there are some group of objectsi.e. group of vehicles, animals, kitchen stuff, where each group is assigned to a gatefunction. We think that considering all groups of objects can confuse the classifiers.Therefore we propose to select a subset of gates function according to each data. Wedenominate this idea as a “expert selection”.Recalling that the likelihood in regular mixture of experts is: L = N (cid:89) n =1 K (cid:89) i =1 p ( y n | x n , m i ) p ( m i | x n ) (14)Now, in order to select a gate, we change the multinomial logit representation ofthe gate function (Equation 2) in this way: p ( m i | x n ) = expµ in ( ν Ti x ) (cid:80) Kj =1 expµ jn ( ν Tj x ) (15)where all the components of Equation 2 remain the same, except µ . The variable µ in ∈ { , } K is the vector of model parameters of the expert selector. It depends ondata x n and expert i , where i ∈ { , . . . , K } for the set of expert gates. When µ in = 1 / i is relevant/irrelevant for data n . In the case of µ in = 0, thevalue is constant and we can say that the data n is ignorant about expert i and assigna constant value. In this way, it is done the expert selection.In order to use EM algorithm, we show the expected log-likelihood by consider-ing the responsabilities , i.e. the posteriori probability of experts and the respectiveregularization terms with the addition of the term corresponding to the expert selector: (cid:104) L c (cid:105) = N (cid:88) n =1 K (cid:88) i =1 R in [ log p ( y n | x n , m i ) + log p ( m i | x n )] − λ ν K (cid:88) i =1 D (cid:88) j =1 | ν ij | − λ ω Q (cid:88) l =1 K (cid:88) i =1 D (cid:88) j =1 | ω lij | − P ( µ ) (16)8he penalization P depends on the regularization norm, mainly 0-norm or 1-norm.Now, we define the posteriori probability of the gates m i as: R in = p ( y n | x n , m i ) p ( m i | x n ) (cid:80) Kj =1 p ( y n | x n , m j ) p ( m j | x n ) (17)Next, we repeat the strategy of Lee et al. by first optimizing the unregularizedexpected log-likelihood and then, adding the restriction. In order to facilitate thecalculations, we define some auxiliary variables. As the derivative is linear in the sum,we calculate the contribution of a single data and call it as E (cid:48) : E (cid:48) = − log K (cid:88) k =1 p ( y n | x n , m k ) p ( m k | x n ) (18)We solve this process using an EM algorithm, where in the E-step, we calculate theresponsabilities in this case by using the equation 17. In the M-step, we assume theresponsabilities as known and we find the optimal parameters ν , ω and µ .Since the use of the responsability values, the term p ( y n | x n , m k ) can be evaluatedseparatevely and then the parameter ω can be optimized using the equation used inRMoE. In the case of p ( m k | x n ), by fixing the parameter µ , we can optimize the param-eter ν .We use some notations in order to facilitate the calculus: the term p ( y n | x n , m k ) as g nk , p ( m k | x n ) as h kn and exp ( µ in ν i x n ) as z i , we derive the equation respect to ν in forhaving: ∂E (cid:48) ∂ν i = ∂E (cid:48) ∂z i ∂z i ∂ν i ∂E (cid:48) ∂ν i = (cid:34) K (cid:88) k =1 ∂E (cid:48) ∂h k ∂h k ∂z i (cid:35) ∂z i ∂ν i (19)Now we have three terms and we evaluate the derivative over each one :9 E (cid:48) ∂h k = ∂ − log (cid:80) Kj =1 g j h j ∂h k ∂E (cid:48) ∂h k = − g k (cid:80) Kj =1 g j h j ∂E (cid:48) ∂h k = − R kn h k (20) ∂h k ∂z i = ∂ exp ( h k ) (cid:80) Kj =1 exp ( h j ) ∂z i ∂h k ∂z i = δ ki h i − h i h k (21) ∂z li ∂ν i = ∂µ i ν i x∂ν i ∂z li ∂ν i = µ i x We integrate these elements for obtain: ∂E (cid:48) ∂ν i = (cid:32) K (cid:88) k =1 R kn h k ( δ ki h i − h i h k ) (cid:33) µ i x∂E (cid:48) ∂ν i = ( R in − h i ) µ i x (22)By considering all the data, the regularization term and applying the trick of Bishopby taking the logarithms of the outputs and equaling to zero, we have:min ν i (cid:80) Nn =1 (cid:0) ( log ( R in ) − ν Ti µ in x n (cid:1) subject: to || ν i || ≤ λ ν (23)In this case it is a modified version of equation 13 and we can apply a QP packageto solve it. Finally, we fix the parameters ν and ω for optimizing the parameter µ . The10egularization over the parameter of expert selector has originally norm 0; on the otherhand, it can be relaxed bu considering norm 1. We state both approaches: A. Optimization of µ considering norm µ depends on data x n , we need to solve the optimization problem:min µ in − log (cid:80) Kk =1 p ( y n | x n , m k ) p ( m k | x n )subject: to : || µ in || ≤ λ µ (24)The minimization of equation 24 requires an exploration of C Kλ µ combinations, how-ever, by assuming a low number of gates K <
50 and a lower number of active experts λ µ <
10, this numerical optimization is feasible in practice.
B. Optimization of µ considering norm
1A more applicable approach is relaxing the constraint of 0-norm by replacing bythe use of a 1-norm, also known as LASSO regularization. Given that µ is in the samecomponent of ν , its solution has many equal steps. In particular, we find almost thesame equations. Using the same notations of Equation 19, we have for the individuallog-likelihood: ∂E (cid:48) ∂µ in = ∂E (cid:48) ∂z i ∂z i ∂µ in ∂E (cid:48) ∂µ in = (cid:34) K (cid:88) k =1 ∂E (cid:48) ∂h k ∂h k ∂z i (cid:35) ∂z i ∂µ in (25)We get the same Equations 20 and 21. In the case of the last component we have: ∂z li ∂µ in = ∂µ in ν i x∂µ in ∂z li ∂µ in = ν i x (26)11e ensemble all components equations to have: ∂E (cid:48) ∂µ in = (cid:32) K (cid:88) k =1 R kn h k ( δ ki h i − h i h k ) (cid:33) ν i x∂E (cid:48) ∂µ in = ( R in − h i ) ν i x In order to find the optimum parameter µ in , we fix n and consider from i = 1 to K .By equaling each equation to zero, we have:( R in − h i ) ν i x = 0 (27)Next, we approximate the previous equation using the logarithms over the outputs(Bishop): ( log ( R in ) − µ i ν i x ) ν i x = 0 (28)Now, we fix n in order to find jointly the parameters of µ for a fixed data n . Thereforewhen we add the K equations, we have an equation system: (cid:32) K (cid:88) i =1 ( log ( R in ) − µ in ν i x n ) ν i x n (cid:33) = 0 (29)This equation can be represented as a minimization problem considering the sumof squares residuals between log ( R in ) and µ in ν i x n ; where we add restriction of norm 1over µ ∗ n that represents all selected experts for data n . In this case, we have:min µ (cid:107) log ( R n ) − µ ∗ n νx n (cid:107) subject: to || µ ∗ n || ≤ λ µ (30)12his equation can be solved with a quadratic program optimization package wherethe variable is µ ∗ n . In the training phase, we require to solve this optimization N times.And in the test phase, it is necessary to solve this optimization problem for each testdata.By using norm 0 or 1, we can find the parameters of the expert selector. All theprocess is summarized as an EM algorithm where in the M-step, first, we freeze ν and ω and find µ ; then we freeze µ and iterate for finding the local optimum ν and ω ; thenin the E-step, we find the responsabilities R in using the new parameters ν , ω and µ . Inthe beginning, we initialize all parameters randomly. In the following section, we willdetail the results of our experiments. References ∼∼