Optimization of the Asymptotic Property of Mutual Learning Involving an Integration Mechanism of Ensemble Learning
aa r X i v : . [ c ond - m a t . d i s - nn ] S e p Optimization of the Asymptotic Property ofMutual Learning Involving an IntegrationMechanism of Ensemble Learning
Kazuyuki Hara ∗ , Takahiro Yamada Tokyo Metropolitan College of Industrial TechnologyHigashi-oi 1-10-40, Shinagawa-ku, Tokyo 140-0011. Toyohashi University of Technology1-1, Hibarigaoka, Tempaku, Toyohashi, Aichi, 441-8580.
Abstruct–
We propose an optimization method of mutual learning which con-verges into the identical state of optimum ensemble learning within the frameworkof on-line learning, and have analyzed its asymptotic property through the statisticalmechanics method.The proposed model consists of two learning steps: two students in-dependently learn from a teacher, and then the students learn from each other throughthe mutual learning. In mutual learning, students learn from each other and the gen-eralization error is improved even if the teacher has not taken part in the mutuallearning. However, in the case of different initial overlaps(direction cosine) betweenteacher and students, a student with a larger initial overlap tends to have a largergeneralization error than that of before the mutual learning. To overcome this prob-lem, our proposed optimization method of mutual learning optimizes the step sizes oftwo students to minimize the asymptotic property of the generalization error. Conse-quently, the optimized mutual learning converges to a generalization error identical tothat of the optimal ensemble learning. In addition, we show the relationship betweenthe optimum step size of the mutual learning and the integration mechanism of theensemble learning.
Keywords– mutual learning, learning step size, on-line learning, linear percep-tron, statistical mechanics
As a model involving the interaction between students, Kinzel proposed mutuallearning within the framework of on-line learning[9, 10, 11]. Kinzel’s modelemploys two students, and a student learns with the other student acting as ateacher. The target of his model is to obtain the same networks through thelearning. On the other hand, ensemble learning algorithms, such as bagging[1] ∗ E-mail:[email protected] B J J TeacherStudentsu J J x x x N x x N1 N22 x B N v Figure 1: Network structure of latent teacher and student networks, all havingthe same network structure.
In this section, we formulate the latent teacher and student networks, and themutual learning algorithms. We assume the latent teacher and student networksreceive N -dimensional input x ( m ) = ( x ( m ) , . . . , x N ( m )) at the m -th learningiteration as shown in Fig. 1. Learning iteration m is ignored in the figure.The latent teacher network is a linear perceptron, and the student networks aretwo linear perceptrons. We also assume that the elements x i ( m ) of the indepen-dently drawn input x ( m ) are uncorrelated random variables with zero mean and1 /N variance; that is, the elements are drawn from a probability distribution P ( x ). In this paper, the thermodynamic limit of N → ∞ is assumed. The sizeof input vector | x | then becomes one. h x i i = 0 , h ( x i ) i = 1 N , | x | = 1 , (1)where h· · · i denotes average, and | · | denotes the norm of a vector.The latent teacher network is a linear perceptron, and is not subject to train-ing. Thus, the weight vector is fixed in the learning process. The output of thelatent teacher v ( m ) for N -dimensional input x ( m ) = ( x ( m ) , x ( m ) , . . . , x N ( m ))at the m -th learning iteration is v ( m ) = N X i =1 B i x i ( m ) = B · x ( m ) , (2) B = ( B , B , . . . , B N ) , (3)where latent teacher weight vector B is an N -dimensional vector like the inputvector, and each element B i of the latent teacher weight vector B is drawnfrom a probability distribution of zero mean and unit variance. Assuming the3hermodynamic limit of N → ∞ , the size of latent teacher weight vector | B | becomes √ N . h B i i = 0 , h ( B i ) i = 1 , | B | = √ N. (4)The output distribution for the latent teacher P ( v ) follows a Gaussian distribu-tion of zero mean and unit variance in the thermodynamic limit of N → ∞ .The two linear perceptrons are used as student networks that compose themutual learning machine. Each student network has the same architecture asthe latent teacher network. Each element of J k (0) which is the initial value ofthe k -th student weight vector J k is drawn from a probability distribution ofzero mean and unit variance. The norm of the initial student vector | J k (0) | is √ N in the thermodynamic limit of N → ∞ , h J ki (0) i = 0 , h ( J ki (0)) i = 1 , | J k (0) | = √ N . (5)The k -th student output u k ( m ) for the N -dimensional input x ( m ) is u k ( m ) = N X i =1 J ki ( m ) x i ( m ) = J k ( m ) · x ( m ) , (6) J k ( m ) = ( J k , J k , . . . , J kN ) . (7)Generally, the norm of student weight vector | J k ( m ) | changes as the time stepproceeds. Therefore, the ratio l k of the norm to √ N is considered and is calledthe length of student weight vector J k . The norm at the m -th iteration is l k ( m ) √ N , and the size of l k ( m ) is O (1). | J k ( m ) | = l k ( m ) √ N (8)The distribution of the output of the k -th student P ( u k ) follows a Gaussiandistribution of zero mean and l k variance in the thermodynamic limit of N → ∞ .Next, we formulate the learning algorithm. After the students learn froma latent teacher, mutual learning is carried out. The learning equation of themutual learning is J k ( m + 1) = J k ( m ) + η k (cid:16) u k ′ ( m ) − u k ( m ) (cid:17) x ( m ) , (9)where k is 1 or 2 and k = k ′ . m denotes the iteration number. Equation (9)shows that mutual learning is carried out between two students. Therefore,the teacher used in the initial learning is called a latent latent teacher. Weuse the gradient descent algorithm in this paper, while another algorithm wasused in Kinzel’s work [9]. When the interaction between students is introduced,the performance of students may be improved if they exchange knowledge thateach student has acquired from the latent teacher in the initial learning. Inother words, two students approach each other through mutual learning, andtend to move towards the middle of the initial weight vectors. This tendency issimilar to the integration mechanism of ensemble learning, so mutual learningmay mimic this mechanism. 4 Theory
In this section, we first derive the differential equations of two order parameterswhich depict the behavior of mutual learning. After that, we derive an auxiliaryorder parameter which depicts the relationship between the latent teacher andstudents. We then rewrite the generalization error using these order parameters.We first derive the differential equation of the length of the student weightvector l k . l k is the first order parameter of the system. We modify the lengthof the student weight vector in Eq. (8) as J k · J k = N l k . To obtain a timedependent differential equation of l k , we square both sides of Eq. (9). We thenaverage the term of the equation using the distribution of P ( u k , u k ′ ). Note that x and J k are random variables, so the equation becomes a random recurrenceformula. We formulate the size of the weight vectors to be O ( N ), and the size ofinput x is O (1), so the length of the student weight vector has a self-averagingproperty. Here, we rewrite m as m = N t , and represent the learning processusing continuous time t in the thermodynamic limit of N → ∞ . We then obtainthe deterministic differential equation of l k , dl k dt = ( η k − η k ) l k + η k l k ′ − η k − η k ) Q. (10)Here, k is 1 or 2, and k = k ′ . In this equation, Q = ql k l k ′ and q is the overlapbetween J k and J k ′ , defined as q = J k · J k ′ | J k | | J k ′ | = J k · J k ′ N l k l k ′ , (11)and q is the second order parameter of the system. The overlap q also has aself-averaging property, so we can derive the differential equation in the thermo-dynamic limit of N → ∞ . The differential equation is derived by calculating theproduct of the learning equation (Eq. (9)) for J k and J k ′ , and we then averagethe term of the equation using the distribution of P ( u k , u k ′ ). After that, weobtain the deterministic differential equation as dQdt = ( η − η η ) l + ( η − η η ) l − ( η + η − η η ) Q. (12)Equations (10) and (12) form closed differential equations.The analytical solutions of the length of the student l k and the overlapbetween students Q are given by l k ( t ) = − A η k η k ′ exp( − ( η + η )(2 − ( η + η )) t ) + ( − k A η k η − η exp( − ( η + η ) t ) + A , (13) Q ( t ) = A exp( − ( η + η )(2 − ( η + η )) t ) + A exp( − ( η + η ) t ) + A , (14)5here A = − η η ( l (0) + l (0) − Q (0))( η + η ) , (15) A = − ( η − η )( η l (0) − η l (0) − ( η − η ) Q (0))( η + η ) , (16) A = η l (0) + η l (0) + 2 η η Q (0)( η + η ) . (17) l (0) is the initial condition of student 1, and l (0) is that of student 2. Q (0) = q (0) l (0), and q (0) is the initial condition of the overlap between student 1 andstudent 2. From Eqs. (13) and (14), l k ( t ) and Q ( t ) converge to finite values at t → ∞ if 2 − ( η + η ) > l k ( t )and Q ( t ) is η + η ≥ . (18)To depict the behavior of mutual learning with a latent latent teacher, wehave to obtain the differential equation of overlap R k , which is a direction cosinebetween latent teacher weight vector B and the k -th student weight vector J k defined by Eq. (19). We introduce R k as the third order parameter of thesystem. R k = B · J k | B | | J k | = B · J k N l k (19)For the sake of convenience, we write the overlap between the latent teacherweight vector and the student weight vector as r k and r k = R k l k . The differ-ential equation of overlap r k is derived by calculating the product of B andEq. (9), and we then average the term of the equation using the distribution of P ( v, u k , u k ′ ). The overlap r k also has a self-averaging property, and in the ther-modynamic limit the deterministic differential equation of r k is then obtainedthrough a calculation similar to that used for l k . dr k dt = η k ( r k ′ − r k ) (20)The solution for overlap r k is obtained by solving simultaneous differential equa-tions of Eq. (20) for k = 1 and k ′ = 2, and for k = 2 and k ′ = 1. r k ( t ) = η k ( r k (0) − r k ′ (0)) η + η exp( − ( η + η ) t ) + η r (0) + η r (0) η + η , (21)where r k (0) = R k (0) l (0), and R k (0) is the initial overlap between the latentteacher and the k -th student.The squared error for the k -th student ǫ k is then defined using the outputof the latent teacher and that of the student as given in Eqs. (2) and (6),respectively. ǫ k = 12 (cid:16) B · x − J k · x (cid:17) (22)6he generalization error for the k -th student ǫ kg is given by the squared error ǫ k inEq. (22) averaged over the possible input x drawn from a Gaussian distribution P ( x ) of zero mean and 1 /N variance. ǫ kg = Z d x P ( x ) ǫ k (23)= 12 Z d x P ( x ) (cid:16) B · x − J k · x (cid:17) . (24)This calculation is the N -th Gaussian integral with x and it is hard to calculate.To overcome this difficulty, we employ coordinate transformation from x to v and u k in Eqs. (2) and (6). Note that the distribution of the output of thestudents P ( u k ) follows a Gaussian distribution of zero mean and l k variancein the thermodynamic limit of N → ∞ . For the same reason, the outputdistribution for the latent teacher P ( v ) follows a Gaussian distribution of zeromean and unit variance in the thermodynamic limit. Thus, the distribution P ( v, u k ) of latent teacher output v and the k -th student output u k is P ( v, u k ) = 12 π p | Σ | exp (cid:20) − ( v, u k ) T Σ − ( v, u k )2 (cid:21) (25)Σ = (cid:18) r k r k l k (cid:19) (26)Here, T denotes the transpose of a vector, r k denotes r k = R k l k , and R k isthe overlap between the latent teacher weight vector B and the student weightvector J k defined by Eq. (19). Hence, by using this coordinate transformation,the generalization error in Eq. (24) can be rewritten as ǫ kg = 12 Z dvdu k ( v − u k ) (27)= 12 (1 − r k + l k ) . (28)Consequently, we calculate the dynamics of the generalization error by substi-tuting the time step value of l k ( t ), Q ( t ), and r k ( t ) into Eq. (28). ǫ kg = 12 ( − η k ( r k (0) − r k ′ (0)) η + η exp( − ( η + η ) t ) − η r (0) + η r (0)( η + η )+ η k ( l (0) + l (0) − Q (0))( η + η ) exp( − ( η + η )(2 − ( η + η )) t )+( − k η k ( η l (0) − η l (0) − ( η − η ) Q (0))( η + η ) exp( − ( η + η ) t ) + η l (0) + η l (0) + 2 η η Q (0)( η + η ) ) (29)7 Results
When the step sizes of two students are the same, the mutual learning asymp-totically converges to the average weight vector of two students [12]. In thissection, we analyze the asymptotic property of mutual learning in the case ofdifferent step sizes, and then discuss the relationship between mutual learningand ensemble learning.
We analyze the effect of the learning step size on the asymptotic property of mu-tual learning. Two students use different learning step sizes. For this purpose,we use computer simulations.Figure 2 shows trajectories of the student weight vectors when the initialoverlaps between the latent teacher and the students were inhomogeneous: (a)shows the results obtained through setting the learning step size of student 1( η ) to 0.1(fixed), and setting the learning step size of student 2 ( η ) to 0.1, 0.2,0.3, or 0.5; (b) shows the results obtained through setting the learning step size η to 0.01(fixed), and setting η to 0.01, 0.02, 0.03, or 0.05. In these figures,the horizontal axis shows the length of the student weight vector l k , and thevertical axis shows the overlap R k . The initial conditions were l (0) = l (0) = 1, R (0) = 0 . R (0) = 0 .
2, and q (0) = − .
2. The theoretical results obtainedusing Eqs. (13), (14), and (21) are shown as thick lines, and the results obtainedthrough computer simulations for N = 10000 are shown as thin lines. Theupper lines show trajectories of the weight vector of student 1, and the lowerlines show trajectories of the weight vector of student 2. The symbols of blackrectangles show convergence points of trajectories of the student weight vectors.The numbers above the symbols show the learning step sizes of student 2.When the learning step sizes η and η were the same, student 1 started at l (0) = 1 and R (0) = 0 .
6, and converged to the average weight vector of theinitial student vectors denoted by AW . Student 2 started at l (0) = 1 and R (0) = 0 .
2, and also converged to the average weight vector denoted by AW when using the same learning step sizes.When the learning step sizes η and η were not the same, the convergencepoints were changed by using a different step size η of 0 . , .
3, or 0 . η of 0 . , .
03, or 0 .
05. Note that theconvergence points for the same ratio of the learning step size tend to be thesame. Thus, we pay attention to the effect of the ratio of learning step sizes η /η in the mutual learning.Figure 3 shows the learning step size dependence of the generalization error.The learning step size of student 1 was 0.1 or 0.01(fixed), and that of student2 was changed as shown in the figure. The horizontal axis shows the ratio oflearning step sizes η /η , and the vertical axis shows the asymptotic property ofthe generalization error ǫ g . The asymptotic property of the generalization error8 AW Weight vector length: l O v e r l ap : R h=0.1(S1) (a) η = 0 . h=0.01(S1) O v e r l ap : R (b) η = 0 . l (0) = 1, R (0) = 0 . R (0) = 0 .
2, and q (0) = − .
2. (a)Results of setting the learning step size to η = 0 . η = 0 . , . , . .
5. (b) Results of setting the learning step size to η = 0 . η = 0 . , . , .
03, or 0 . t → ∞ . The results show that theasymptotic property of the generalization error was minimized when the ratio η /η was 2. Consequently, the asymptotic property of the generalization errorcan be minimized by using the optimal ratio of learning step sizes. Next, we willobtain this optimal ratio of learning step sizes that minimizes the asymptoticproperty of the generalization error. We now analyze the asymptotic property of the generalization error based onthe ratio of learning step sizes, and then we obtain the optimum ratio of learningstep sizes η /η that minimizes the asymptotic property of the generalizationerror.The asymptotic property of the order parameters is obtained by substituting t → ∞ into Eqs. (13), (14), and (21): l ( ∞ ) = l ( ∞ ) = Q ( ∞ ) = η l (0) + η l (0) + 2 η η Q (0)( η + η ) , (30) r ( ∞ ) = r ( ∞ ) = η η + η r (0) + η η + η r (0) . (31)The above equations show that the mutual learning converges to the internaldividing point of the initial student weight vectors. Using Eqs. (30) and (31),9 G ene r a li z a t i on e rr o r : e g Ratio of Learning rate of student 2 to student 1: h=0.1h=0.01 h / h Figure 3: Relation between learning step size and generalization error. Thelearning step size of student 1 was 0.1 or 0.01(fixed), and that of student 2 waschanged. The generalization error is minimized when the ratio of the learningstep size is two for both cases. The optimum ratio is independent of the size ofthe learning step size.we can obtain the asymptotic property of the generalization error: ǫ g ( ∞ ) = 12 (cid:26) − η r (0) + η r (0) η + η + η l (0) + η l (0) + 2 η η Q (0)( η + η ) (cid:27) (32)We rewrite the generalization error by replacing the ratio η /η with α : ǫ g ( ∞ ) = 12 ( − αr (0) + r (0) α + 1 + α l (0) + l (0) + 2 αQ (0)( α + 1) ) . (33)When the generalization error is minimized, ∂ǫ g ( ∞ ) /∂α = 0 is satisfied, so ∂ǫ g ∂α = 2 αl (0) + 2 Q (0)( α + 1) − α l (0) + l (0) + 2 αQ (0))( α + 1) + 2( αr (0) + r (0))( α + 1) − r (0) α + 1 = 0(34)Solving Eq. (34), we obtain α opt as α opt = l (0) − Q (0) + r (0) − r (0) l (0) − Q (0) − r (0) + r (0) . (35)Therefore, the optimum ratio of the learning step size is obtained through Eq.(35). The optimum asymptotic property of the generalization error is obtainedby substituting Eq. (35) into Eq. (33): ǫ optg ( ∞ ) = 12 (cid:8) − κr (0) + (1 − κ ) r (0)) + κ l (0) + (1 − κ ) l (0) + 2 κ (1 − κ ) Q (0) (cid:9) . (36)10ere, κ is defined as κ = α opt / (1 + α opt ).On the other hand, we can consider the linear combination of the initialweight vectors of the students — that is, J = C J (0) + (1 − C ) J (0) — andminimize the generalization error by C . This is an ensemble learning with twostudents, so from the appendix, the optimum C ∗ that minimizes the generaliza-tion error is C ∗ = l (0) − Q (0) + r (0) − r (0) l (0) + l (0) − Q (0) . (37)Therefore, the optimum ratio C ∗ / (1 − C ∗ ) is obtained as C ∗ − C ∗ = l (0) − Q (0) + r (0) − r (0) l (0) − Q (0) − r (0) + r (0) = η opt η opt , (38)and it is shown that the optimum ratio of the learning step size of mutuallearning α opt = η opt /η opt is equal to that of the optimum linear combination ofthe initial weight vectors C ∗ / (1 − C ∗ ). Consequently, mutual learning using anoptimum ratio of learning step sizes converges to the optimum ensemble learningthat is the linear combination of the initial student vectors. We have proposed an optimization method for mutual learning by means of mini-mizing the asymptotic property of the generalization error within the frameworkof on-line learning. We first formulated mutual learning with a latent teacher,and then derived the differential equations of order parameters that depict thelearning process. The order parameters of mutual learning are the length ofthe student weight vector l k and the overlap between students q . To depictthe relationship between the latent teacher and the students, we introduced theorder parameter R k . We derived these differential equations using statisticalmechanics methods and solved them analytically. After that, we obtained thedynamics of the generalization error using these order parameters.Next, we used the theoretical results to analyze the relationship betweenthe asymptotic property of the mutual learning and the learning step size of thestudents. From the results, we found that the asymptotic property of the mutuallearning related to the ratio of the learning step sizes of two students, and wasnot related to the learning step size itself. We obtained the optimum ratio ofthe learning step size which minimizes the generalization error analytically. Wealso showed that the optimum ratio of the learning step sizes of the mutuallearning is equal to the inverse of the ratio of optimum weights for an average ofthe linear combination of initial student weight vectors. We conclude that theintegration mechanism of ensemble learning can be mimicked through mutuallearning by introducing the interaction between students. Our future work willinclude analysis of the mutual learning with non-linear perceptrons.11 cknowledgment We would like to thank Masato Okada (The University of Tokyo) and SeijiMiyoshi (Kobe City College of Technology) for their useful discussions. Part ofthis study has been supported by a Grant-in-Aid for Scientific Research (C) No.16500146.
References [1] L. Breiman, Bagging predictors,
Machine Learning , vol. 24, pp. 123-140(1996).[2] Y. Freund and R. E. Shapire, J. Comput. Syst. Sci. (1997) 119.[3] On-line Learning in Neural Networks, ed. D. Saad (Cambridge UniversityPress, Oxford, 1998).[4] A. Krogh and P. Sollich, Phys. Rev. E, (1997) 811.[5] K. Hara and M. Okada, Neural Networks, (2004) 215.[6] K. Hara and M. Okada, J. Phys. Soc. Jpn. (2005) 2966.[7] S. Miyoshi, K. Hara, and M. Okada, Phys. Rev. E, (2005) 036116.[8] A. Lazarevic and Z. Obradivic, Distributed and parallel databases, vol.11,pp. 203 (2002).[9] Klein, E., et. al., Proc. Neural Inf. Pro. Sys. (2004).[10] R. Metzler, W. Kinzel, and I. Kanter: Phys. Rev. E 62 (2000) 2555.[11] R. Mislovaty, E. Klein, I. Kanter, and W. Kinzel: Phys. Rev. Lett. 91(2003) 118701.[12] Hara K. and M. Okada, J. Phys. Soc. Jpn. (2007) 014001. A Ensemble learning
Ensemble learning is a learning method using many weak learning machinesto improve upon the performance of a single weak learning machine[1, 2, 8].Students learn from the teacher individually, and then an ensemble output iscalculated by integrating the students’ outputs. Because many students areused, ensemble learning is effective when the students differ from each other.Therefore, we assume that the overlap(direction cosine) between the k th studentand the k ′ th student q kk ′ is not one. The ensemble output of the student12etworks u is given by the weighted average of each student output using theweights for averaging C k : u = K X k =1 C k u k = K X k =1 C k (cid:16) J k · x (cid:17) (39)Here, the number of students is K and we assume P Kk =1 C k = 1. In the follow-ing, we assume that the number of students is two. We use linear perceptronsas the students, so the average output of the two students is equal to the out-put of a perceptron having the average of the two student weight vectors. Theweighted average of the two student weight vectors J E is defined as follows[12]. J E = C k J k + C k ′ J k ′ = C J k + (1 − C ) J k ′ (40)Here, we rewrite C k as C and C k ′ as 1 − C from C k + C k ′ = 1. From thisequation, ensemble learning can be viewed as the linear combination of the twostudent weight vectors. Note that ensemble learning is a static process, so thereis no dynamical property. The length of the weight vector l E and the overlap r E are given by ( l E ) = C l k + (1 − C ) l k ′ + 2 C (1 − C ) Q (41) r E = Cr k + (1 − C ) r k ′ (42)The generalization error of ensemble output ǫ Eg is given by substituting Eqs.(41) and (42) into Eq. (28): ǫ Eg = 12 (cid:0) − r E + ( l E ) (cid:1) = 12 n − Cr k + (1 − C ) r k ′ ) + C l k + (1 − C ) l k ′ + 2 C (1 − C ) Q o . (43)If the optimum weight for average C ∗ satisfies the condition of ∂ǫ ∗ g /∂C ∗ = 0,we obtain C ∗ = l k ′ − Q + r k − r k ′ l k + l k ′ − Q (44)When the student weight vector length l k = l k ′ = l and the overlap between thestudents r k = r k ′ = r , from Eq. (44) we obtain C ∗ = (1 − C ∗ ) = 1 //