[PDF] Analyzing the benefits of communication channels between deep learning models

Abstract

As artificial intelligence systems spread to more diverse and larger tasks in many domains, the machine learning algorithms, and in particular the deep learning models and the databases required to train them are getting bigger themselves. Some algorithms do allow for some scaling of large computations by leveraging data parallelism. However, they often require a large amount of data to be exchanged in order to ensure the shared knowledge throughout the compute nodes is accurate. In this work, the effect of different levels of communications between deep learning models is studied, in particular how it affects performance. The first approach studied looks at decentralizing the numerous computations that are done in parallel in training procedures such as synchronous and asynchronous stochastic gradient descent. In this setting, a simplified communication that consists of exchanging low bandwidth outputs between compute nodes can be beneficial. In the following chapter, the communication protocol is slightly modified to further include training instructions. Indeed, this is studied in a simplified setup where a pre-trained model, analogous to a teacher, can customize a randomly initialized model's training procedure to accelerate learning. Finally, a communication channel where two deep learning models can exchange a purposefully crafted language is explored while allowing for different ways of optimizing that language.

Full PDF

UUniversité de MontréalAnalyzing the Beneﬁts of Communication ChannelsBetween Deep Learning Models par

Philippe Lacaille

Département d’informatique et de recherche opérationnelleFaculté des arts et des sciences

Mémoire présenté à la Faculté des études supérieures et postdoctoralesen vue de l’obtention du grade deMaître ès sciences (M.Sc.)en informatique

Août 2018 c (cid:13) Philippe Lacaille, 2018 a r X i v : . [ c s . L G ] A p r ommaire Comme les domaines d’application des systèmes d’intelligence artiﬁcielle ainsi que les tâchesassociées ne cessent de se diversiﬁer, les algorithmes d’apprentissage automatique et en par-ticulier les modèles d’apprentissage profond et les bases de données requises au fonctionne-ment de ces derniers grossissent continuellement. Certains algorithmes permettent de mettreà l’échelle les nombreux calculs en sollicitant la parallélisation des données. Par contre, cesalgorithmes requièrent qu’une grande quantité de données soit échangée aﬁn de s’assurer queles connaissances partagées entre les cellules de calculs soient précises.Dans les travaux suivants, diﬀérents niveaux de communication entre des modèlesd’apprentissage profond sont étudiés, en particulier l’eﬀet sur la performance de ceux-ci.La première approche présentée se concentre sur la décentralisation des multiples calculsfaits en parallèle avec les algorithmes du gradient stochastique synchrone ou asynchrone. Ils’avère qu’une communication simpliﬁée qui consiste à permettre aux modèles d’échangerdes sorties à petite bande passante peut se montrer bénéﬁque. Dans le chapitre suivant, leprotocole de communication est modiﬁé légèrement aﬁn d’y communiquer des instructionspour l’entraînement. En eﬀet, cela est étudié dans un environnement simpliﬁé où un modèlepréentraîné, tel un professeur, peut personnaliser l’entraînement d’un modèle initialiséaléatoirement aﬁn d’accélérer l’apprentissage. Finalement, une voie de communicationoù deux modèles d’apprentissage profond peuvent s’échanger un langage spéciﬁquementfabriqué est analysée tout en lui permettant d’être optimisé de diﬀérentes manières.

Mots-clés :

Apprentissage automatique, apprentissage profond, communication, langage,professeur, étudiant, optimisation, gradient ii ummary

As artiﬁcial intelligence systems spread to more diverse and larger tasks in many domains, themachine learning algorithms, and in particular the deep learning models and the databasesrequired to train them are getting bigger themselves. Some algorithms do allow for somescaling of large computations by leveraging data parallelism. However, they often require alarge amount of data to be exchanged in order to ensure the shared knowledge throughoutthe compute nodes is accurate.In this work, the eﬀect of diﬀerent levels of communications between deep learningmodels is studied, in particular how it aﬀects performance. The ﬁrst approach studied looksat decentralizing the numerous computations that are done in parallel in training proceduressuch as synchronous and asynchronous stochastic gradient descent. In this setting, a sim-pliﬁed communication that consists of exchanging low bandwidth outputs between computenodes can be beneﬁcial. In the following chapter, the communication protocol is slightlymodiﬁed to further include training instructions. Indeed, this is studied in a simpliﬁed setupwhere a pre-trained model, analogous to a teacher, can customize a randomly initializedmodel’s training procedure to accelerate learning. Finally, a communication channel wheretwo deep learning models can exchange a purposefully crafted language is explored whileallowing for diﬀerent ways of optimizing that language.

Keywords:

Machine learning, deep learning, communication, language, teacher, student,optimization, gradients iii ontents

Sommaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 1. Machine Learning Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 2. Exchanging Outputs Between Models . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 3. Increased Utility Through Selection Of Training Data . . . . . . . .

Chapter 4. Sharing Internal Representation Through Language . . . . . . . . . . .

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ist of Tables ist of Figures x ∈ R and a single output. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.1 Class probabilities resulting from normalizing the logits with diﬀerent temperature.The logits or class scores used, i.e. 1.0, 2.0, 4.0, 8.0 for the 4 classes, are the sameacross the diﬀerent temperatures above. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.1 Validation accuracy (%) per number of parameter updates of the student networkduring training. Conﬁgurations of the students include a baseline with nocommunication, (Left) stack size of two minibatches, with both cross-entropyand Euclidean distance as diﬃculty measure, in addition to comparing soft-labelsand true-labels and randomly selection. (Right) The best conﬁguration from stacksize two (cross-entropy soft-labels), compared with teacher using a stack size of 5minibatches with soft-labels and true labels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.1 (Left) Original MNIST samples, (Middle & Right) show the same samples modiﬁedby the noisy mask with noise. Each pixel has a probability of to beinverted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.2 Vector representation of a message generated of length with vocabulary sizeof , and temperature of Gumbel-Softmax set at (left) and . (right). Forillustrative purposes, in this ﬁgure, the logits and the underlying sample from theGumbel distribution are the same for the two temperatures. . . . . . . . . . . . . . . . . . . . . . 52ix.3 Validation accuracy (%) per training steps of the student network during trainingbased on diﬀerent levels of communication. Results show average of ﬁve runs witha noise level of . One way communication conﬁguration (orange) has weightof 0.01 to the cross-entropy term, while the two-way communications (green andred) both have 0.005. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58x ist of Abbreviations SGD Stochastic Gradient DescentGPU Graphics Processing UnitMSE Mean Square ErrorMLP Multi-Layer PerceptronRNN Recurrent Neural NetworkCNN Convolutional Neural NetworkReLU Rectiﬁed Linear UnitGAN Generative Adversarial NetworkVAE Variational Auto-EncoderELBO Evidence Lower BoundDCGAN Deep Convolutional Generative Adversarial NetworkBiGAN Bidirectional Generative Adversarial NetworkKL Kullback-LeibleirMINE Mutual Information Neural Estimationxi cknowledgements

Firstly, I would like to thank my supervisor Yoshua Bengio, who was willing to give me,an actuary coming back to school, a chance to learn about this ﬁeld of which I am nowpassionate about. Your guidance and your support, both educational and ﬁnancial, duringthe completion of this Master’s degree will serve as an inspiration in the future; merci Yoshua.To my close family, Jean-Claude, Carole, Olivier and, of course, Marie-Pier, thank youfor believing in me and keeping me steered in the right direction through the ups and downsof the past few years, your support means everything to me.To my collaborator of all the work presented in this thesis, Min Lin, I can’t thank youenough for your accessibility and your patience. Through the many brainstorming sessions,along with your help writing and debugging code, I hope you enjoyed our time workingtogether as much as I did.Thank you also to all my colleagues and friends from room 3248 at Mila, the numeroushours spent there were truly enjoyable.Finally, thank you to my dear friend William Perrault, who made getting back to schoolas enjoyable as it was. I’m forever grateful our paths crossed, to many more golf rounds andeven more birdies in the future. xii ntroduction

This thesis presents my research during the completion of my Master’s degree at the Univer-sité de Montréal at Mila under the supervision of Professor Yoshua Bengio. This work wasdone in the ﬁeld of computer science and more speciﬁcally of artiﬁcial intelligence and withthe collaboration of postdoctoral researcher Min Lin at Mila.This thesis is structured in such a way to introduce the machine learning basics in theﬁrst chapter to be able to follow the work detailed in the subsequent ones. Chapters 2, 3 and4 detail diﬀerent experiments to study how useful having a communication channel betweendeep learning models can be. hapter 1

Machine Learning Basics

In this chapter, a review of the basics of machine learning required to follow the workdescribed in the following chapters is made. This does not serve as a full review of machinelearning nor deep learning. Should the reader be interested in an in-depth review of thebackground material and trends in machine learning and more speciﬁcally in deep learning,the Deep Learning book [16] is an excellent alternative.

Machine learning is the greater family of creating functions from data, leveraging thecomputing abilities and algorithms of computer science. In addition to computer science, itis at the intersection of multiple ﬁelds of research, in particular, probability and statistics,information theory, optimization, linear algebra and linguistics.Recent breakthroughs in artiﬁcial intelligence such as the highly publicized Alpha Gosuccesses [32], leverage diﬀerent components of the machine learning family. One in particu-lar, reinforcement learning is not covered throughout this document. The interested readerscan learn more about it from [34], which a second edition is currently in the works.

Although the ﬁeld of machine learning and its potential applications are expandingrapidly, most of them can be boiled down to a couple categories of tasks. This sectionserves as a brief introduction to the most common tasks that leverage machine learningalgorithms. .2.1. Supervised learning tasks

Supervised learning tasks can be generally seen as tasks where the objective is to make aprediction as to what an input corresponds to. Conceptually, models try to ﬁgure out what isthe relationship between the input data and an associated value or label. Generally speaking,the objective is to build models such that for similar input, it makes similar predictions inthe hopes of being able to make an appropriate prediction for new data. In other words,supervised learning tasks aim to understand the relationship between the data and someother value or attribute of the data. Provided with a dataset with training samples, thediﬃculty lies in establishing what consists of a similarity.1.2.1.1.

Classiﬁcation

A typical classiﬁcation task consists of making a prediction regarding which class orgroup a given input associates with. Examples of such task includes predicting if an imageis one of a cat or a dog, or given a wine sample predict if it is a red or white wine. Other advanced tasks such as auto-correcting mistyped words on a cellphone keyboard or evenfacial recognition software all correspond to a form of classiﬁcation task. Apart from somecases of multi-label classiﬁcation, the general objective of the machine learning model in thistask is to predict which one of the possible M classes corresponds to that input.Datasets used for classiﬁcation tasks consist of the input data as well as the correspondinglabel for each of the data samples. The goal is then to make a prediction on the label ofa new data point. To design a classiﬁcation algorithm or model, it is often mandatory toknow ahead of time the possible labels or classes that the data may represent, i.e. will thesepictures be exclusively of either dogs or cats? In addition, the loss function used to trainmachine learning models for supervised learning tasks is the cross-entropy loss between themodel’s probabilistic prediction and the label distribution of the training data.Given training data D train = { ( x (1) , y (1) ) , . . . , ( x ( i ) , y ( i ) ) , . . . , ( x ( N ) , y ( N ) ) } , where x ∈ R d , y ∈ { , , . . . , C } and f θ ( x ( i ) ) = (cid:2) P ( y ( i ) = 1 | x ( i ) ) , P ( y ( i ) = 2 | x ( i ) ) , . . . , P ( y ( i ) = C | x ( i ) ) (cid:3) ,the global cross-entropy objective can be deﬁned as, L ( f θ ( x ( i ) ) , y ( i ) ) = − C (cid:88) j =1 { y ( i ) = j } log f θ ( x ( i ) ) j (1.2.1)3.2.1.2. Regression

Similarly to classiﬁcation, regression tasks can be seen as making a prediction of valuefor a given input. However, instead of selecting one of the possible groups to associate agiven input to, the goal is to estimate a real value. An intuitive example to understand theregression is to consider estimating the value of a house. A real estate agent acts similarlyto a machine learning model in a regression task where given all the information of theneighborhood and characteristics of a house, it tries to determine a good market price to listthe house on the market. Variants of regression include predicting the amount of accelerationan autonomous vehicle requires or even the price a user may be ready to pay online for anitem given its user proﬁle.The datasets for regression are similar to classiﬁcation tasks but where a real value isassociated with each of the data samples. The loss function generally used in regression tasksis the mean-square error (MSE), as it allows for computing losses between real numbers.Given training data D train = { ( x (1) , y (1) ) , . . . , ( x ( i ) , y ( i ) ) , . . . , ( x ( N ) , y ( N ) ) } , where x ∈ R d , y ∈ R and f θ ( x ( i ) ) = ˆ y ( i ) , the MSE objective can be deﬁned as, L ( f θ ( x ( i ) ) , y ( i ) ) = 12 ( y ( i ) − f θ ( x ( i ) )) (1.2.2) Unlike supervised learning tasks, unsupervised learning focuses on the actual data itselfrather than its relationship with a corresponding label or value. There is a wide array ofunsupervised learning tasks, but most of them try to estimate, in some way, the underlyingdistribution of the data. In general, a key distinction from supervised learning datasets isthe absence of labels.1.2.2.1.

Clustering

Conceptually, clustering is a task about discovering boundaries throughout the data inorder to regroup similar data points into groups, or clusters. The number of clusters isusually required to be known in advance. There is an analogy to be made between clusteringand classiﬁcation, only the former is in a situation where it is not known in advance what4lasses the data correspond to. An example of clustering tasks include grouping users basedon their online activity in order to better predict their purchases.1.2.2.2.

Density estimation

Rather than focus on a function of the data, density estimation aims at discovering orapproximating the underlying function that is behind the data. This function is commonlycalled the probability density function. In other words, the goal is to ﬁnd the distributionfrom which the data was created. If successful, the recovered function is a powerful tool thatcan be used to replace the original data or even create new data from the same distribution.Density modelling can further be understood as a way of compressing all the data on handinto a machine learning model.These tasks are pretty general and depend on what the intended purpose is once thedensity has been successfully modelled by a machine learning model. For example, modellingthe insurance claims of a set of car insurance policies allows an insurance company to analyzethe risks they are exposed to. Given that model, it can further determine an appropriatepricing for a new customer. The key here is the intent of doing something else with thedensity estimation but where on its own, it doesn’t really do anything.1.2.2.3.

Generative models

Density estimation usually requires to explicitly have a parameterized model of a family ofdensity functions. On the contrary, generative models usually have an implicit model of thedensity distribution of the data. The objective is to be able to then sample that distributionin order to get some additional data samples. An example of tasks using generative modelswould be to create additional artwork given some paintings of a deceased artist. By modellingthe distribution of the known paintings of an artist, it could be possible, in theory, to thengenerate new paintings that correspond to that artist’s characteristics.

In order to solve any of the previously mentioned tasks in machine learning, a mathe-matical framework of the problem needs to be formulated. Once that problem is formulatedas an optimization problem, some algorithms can be used to minimize or maximize the ob-jective. Throughout the diﬀerent tasks, the one recurring approach is to deﬁne the objective5o minimize as a loss function. That loss function is what deﬁnes the task a model will betrained to do.Given a function f with parameters θ and a training dataset D train , the global objective J ( θ ) can be deﬁned as, J ( θ ) = ˆ R ( f θ , D train ) = 1 | D train | | D train | (cid:88) i =1 L ( f θ ( x ( i ) ) , y ( i ) ) (1.3.1)The ultimate goal of the optimization procedure is to ﬁnd the parameters θ that minimize J ( θ ) , the empirical risk, over the entire training data. The solution to the optimizationproblem can be written as θ ∗ = argmin θ J ( θ ) .However, this is not easily accomplished, in particular given the complexity of the tasksthat are now explored in the machine learning community. Indeed, especially with the deeplearning models, the resulting loss function to minimize is simply not tractable. This meansthe possibility of solving the optimization problem analytically is out of the question. The true intent behind using machine learning models is often not to minimize/maximizethe actual loss used in training. Indeed, this mathematical formulation of the true goalis a surrogate loss that can be easily optimized because most of the time, the real mea-sure/objective is not.A key example of this is the classiﬁcation task. The actual objective of doing classiﬁcationis to minimize the number of errors a model makes on a new data sample it never trainedone, i.e. generalization error. However, for the gradient optimization algorithms to work,the mathematical formulation of the objective must be continuous and diﬀerentiable, wherethe generalization error may not always be.

Given the complex nature of the loss functions used throughout , rather than ﬁndinganalytically the solution, an iterative method can be used. If analytically solvable, theoptimization procedure would have called for ﬁnding where the ∇ θ J ( θ ) = 0 , where thegradients of the objective with regards to the parameters θ are zero. An intuitive alternative6o circumvent the need to solve this is to consider an iterative solution, called gradientdescent/ascent [9].1.3.2.1. Stochastic Gradient Descent

The gradient descent procedure can be conceptually understood as hiking down a moun-tain, where we do not know the actual path down. An intuitive way to go about reachingthe bottom of a mountain would be at each step, to look for the angle of the mountain andtake a step in the direction that goes downhill. If the mountain is convex such that thereare no valleys that restrict access to the bottom of it, this approach is guaranteed to allowyou to reach the bottom of the mountain.This is in fact exactly minimizing the empirical risk through gradient descent. Iteratively,at each step, the direction that goes downhill will be computed and then a step will be madein that direction. The direction of the mountain will be given by the gradient ∇ θ J ( θ ) .To mitigate risks of getting stuck in local minima during the optimization, rather thancomputing the gradient based on the full training state, noisy estimates of the gradients canbe computed by randomly selecting a subset of the data. This approach is commonly calledminibatch stochastic gradient descent, where a minibatch of size m represents a randomlyselected subset of the data. The gradient estimate can therefore be deﬁned as, ˆ ∇ θ J ( θ ) = 1 m m (cid:88) i =1 ∇ θ L ( f θ ( x ( i ) ) , y ( i ) ) (1.3.2)In the most extreme case where m = 1 , the gradient can be estimated with a singlesample from the data. A stripped-down version of minibatch SGD and its training algorithmis described in algorithm 1.One key aspect of SGD is that the gradients must be evaluated with the same value of θ for the full minibatch. This is often one of the bottlenecks in the speed of training. As theminibatch size increases, the accuracy of the gradient estimate increases, but so is the timeto compute it. Although GPU implementations have been able to parallelize some of thesecomputations to speed up training, other approaches have been proposed to further alleviatethe time constraint. 7 lgorithm 1 Minibatch stochastic gradient descent Given D train = { ( x (1) , y (1) ) , . . . , ( x ( N ) , y ( N ) ) } the training dataset Given f θ is a continuous and diﬀerentiable function Initialize η as the step size Initialize m as the size of the minibatch while Not converged do Randomly select m data samples from D train θ ← θ − η m m (cid:80) i =1 ∇ θ L ( f θ ( x ( i ) ) , y ( i ) ) end while Synchronous and asynchronous SGD

Distributed SGD refers to the widely used approach of data parallelism with large neuralnetwork models/datasets. The main concept is to split the data among diﬀerent computationnodes and allow them to be connected to a central controller. Each node always has thesame version of the model, and each is responsible to provide to the central controller thechanges to the parameters for its local data share. To create a single parameter update ofthe global model , each of the node computes the gradients on their local minibatch of datawhich comes from their local dataset partition. The gradients are then sent to the centralcontroller, where the gradients from all compute nodes are agglomerated to form the globalgradient estimator. The global gradient estimator is then sent back to all compute nodes, sothey can all update their local version of the model in the same manner. This ensures eachlocal model has the same version of the parameters as the others.This method scales almost linearly with the number of computing nodes (up to a certainamount of computing nodes [18]). The computational advantage of this approach comesfrom the parallelism of the computation over multiple data partitions. The theoretical gainsfor a dataset split into N partitions is N , because it is assumed each compute node can trainon their local share in parallel. Furthermore, asynchronous versions [5] of distributed SGDhave been developed to alleviate some of the computational needs of the approach. How-ever, synchronous methods have shown some limitations to scalability, leading to alternativemethods such as co-distillation [1]. 8hroughout this document, this distributed SGD approach is considered as an approachleveraging communication between models because of the use of the central controller. Infact, this is considered to be the most eﬃcient in terms of computation gains/speedup. Themain aspect of distributed SGD that causes problem is the high bandwidth requirementthat derives from sharing the raw gradients. Given the size of the models used under thisapproach are usually large, diﬀerent computation nodes are physically separated, thereforerequiring a top-of-the-line network between compute nodes and the central controller. With machine learning being a subcategory of artiﬁcial intelligence, neural networks anddeep learning are a subgroup of techniques and training algorithms of machine learning.They derive their name by the structural inspiration to the brain topology and how thesealgorithms leverage the stacking of diﬀerent layers of units.All of the recent deep learning architecture base themselves on the linear combination ofmultidimensional inputs and parameters. Given an input x = { x , x , . . . , x d } and param-eters w ∈ R d , b ∈ R , the most basic of models to approximate a function of x , the linearcombination, can be written as, f ( x ) = d (cid:88) i =1 w i × x i + b (1.4.1)This model alone could be applied to previously mentioned tasks like classiﬁcation orregression. In any of those two cases, f ( x ) would serve as an approximation of the label orvalue y and it could be trained by minimizing the empirical risk using minibatch gradientdescent. The ancestor of neural networks is the Perceptron [28] which is a clever take on the linearcombination in order to apply it to a binary classiﬁcation class. Consider a dataset for atwo group classiﬁcation task with D train = { ( x (1) , y (1) ) , . . . , ( x ( N ) , y ( N ) ) } , where x ∈ R d , y ∈{− , } in addition to parameters w ∈ R d , b ∈ R . To simplify the notation, all parametersof a model can be grouped into a single variable θ = { w , b } .9he perceptron deﬁnes the predicted class f ( x ) as the following, f ( x ) =  , if d (cid:80) i =1 w i × x i + b > − , otherwise (1.4.2)To minimize the empirical risk, the Perceptron used a creative loss function that allowsit to employ gradient optimization algorithms. Indeed, as previously discussed, directlyminimizing the number of errors on the training set is not feasible with SGD. However, thePerceptron proposes to slightly modify the error count in order to allow for the gradientcomputation. Denoting h ( x ) = d (cid:80) i =1 w i × x i + b , it can be expressed in the following manner, L ( f θ ( x ( i ) ) , y ( i ) ) = { y ( i ) × f θ ( x ( i ) ) ≤ } ( − y ( i ) × h ( x ( i ) )) (1.4.3)The loss function is therefore valued at 0 unless there is an error in the prediction, in whichcase it is equal to the linear combination of the parameters and the input x . Without gettinginto the details, this loss function can be plugged in the previously described optimizationalgorithms to iteratively train θ . The Perceptron can be decomposed into two functions quite intuitively. Firstly, a linearcombination of the parameters, or weights and the input that gives h ( x ) = d (cid:80) i =1 w i × x i + b ,and the step function that gives the ﬁnal output f ( x ) = sign ( h ( x )) .Similarly, any function could be applied to h in order to get an output. These functionsare called non-linearity or activation functions and are a key concept of allowing neural Figure 1.1.

Popular non-linearity functions used in neural networks.(Left) Sigmoid, (Middle) Tanh and (Right) ReLU.10etworks to become high capacity models. Without the use of a non-linearity, the modelcan only represent linear relationships. Although any function could be used as a non-linearity, given the use of gradient optimization, it is required that it be diﬀerentiable almosteverywhere.The ReLU [29], the sigmoid and tanh functions are the most common activation functions.They serve diﬀerent purposes, with the ReLU often used as a hidden layer activation, whilethe sigmoid can be used to provide a value between and , like a probability. The tanhfunction used to be a popular hidden layer activation function, but it is mostly used now toforce values between − and . Theses non-linearities are plotted in Figure 1.1, and theirfunctions are as follows, • ReLU : f ( x ) = max(0 , x ) • Sigmoid : f ( x ) = 11 + e − x • Tanh : f ( x ) = e x − e − x e x + e − x The linear combination described in equation 1.4.1 allows one to combine an array ofinputs x and parameters w and b into a single scalar. A natural way of expanding this isby considering multiple values of parameters w and b. If indeed, instead of having a singlelinear combination with a single set of w and b , there were k linear combinations with eachtheir corresponding parameters, f ( x ) could therefore become multidimensional. In addition,much like the Perceptron, a non-linearity function could be applied to each of these.For simplicity, a linear combination of the parameters and an array of inputs, in combina-tion with a non-linearity, can be called a node or a unit. The analogy of the neural networkthen comes from the visual representation where a single node is therefore connected to theinput array.The combination of the k units or nodes is referred to as a layer, and the k scalar values generated from the computations are called the outputs of the layer. In a way, this layeris composed of k Perceptrons, allowing the model to have an input dimension of d and anoutput size of k . The k outputs of the previously deﬁned layer could be very well seen as aninput themselves and in fact, this is the key concept behind neural networks. Indeed, layerscan therefore be stacked on top of each other, where the output of a previous layer is the11 x x OutputHiddenlayerInputlayer Outputlayer

Figure 1.2.

Visual representation of a multi-layer perceptron with a sin-gle hidden layer with 6 units, input x ∈ R and a single output.input to the next. When considering the whole set of nodes, the last layer is referred to asthe output layer, while all the others apart from the input are called hidden layers.Names often given to these networks are the feedforward neural network and the Multi-Layer Perceptron (MLP), as per their structural design. A visual representation of a singlehidden layer MLP can be seen in Figure 1.2.Given enough memory to store all the parameters, these models can be built of arbitrarilysize, where both the number of units and the number of layers can be controlled. Havingmultiple layers and numerous units allows for the output of the network to be a highlycomplex function of the input. The MLP has an interesting property: provided with aninﬁnite number of hidden units, it is a universal function approximator [21]. In other words,it can have enough complexity through the combinations of the input that any continuousfunction can be reproduced. Convolutional neural networks [27] (CNN or ConvNet) are a special type of neural net-works especially designed to handle data such as images. CNNs have been shown to beapplicable not only to images but also to text [23] and even audio signals [35]. These modelsare what popularized deep learning by their highly optimized implementation and impressiveperformance on diﬃcult image classiﬁcation tasks such as ImageNet [12].12he key concept behind CNNs is the parameter sharing aspect that allows for the samecomputations to be made at diﬀerent places of a layer. In comparison, an MLP layer needsto have speciﬁc computations for all of the layer’s input dimensions. This is very usefulwhen applied to images because it allows the model to detect the same shapes and patternsthroughout the image. These models have the property of being shift and space equivariant,which simply means that through the convolution operation of a layer, a shift in the inputwill result in the same shift in the output.It is important, however, to point out that this type of network can very well be, and ismost often, mixed with the MLP. In fact, the diﬀerent types of networks are usually handledas layers, where it doesn’t matter how a given output was obtained, as long as it can beconsidered as an input to another layer. Multiple convolutional layers are often used toextract features from images, only to be combined higher in the architecture with an MLP.

Another type of neural network called Recurrent Neural Networks (RNN) [31] are usedto handle sequential data such as time series or video. To handle the input changing overtime, RNNs have the characteristic of sharing parameters through time . Indeed, rather thanhaving a separate set of parameters for each time step, it uses the same parameters for eachtime step.Given an input x = { x , . . . , x t , . . . , x T } , where x t ∈ R d and parameters W x ∈ R d × m , W h ∈ R m × m , b ∈ R m , it can be written as a recurrence, in the simplest case as, h t = f ( x t , h t − ) = W x • x t + W h • h t − + b (1.4.4)With a hidden state h t computed for each time step, it can be used as an input toanother layer to generate an output. This is an example of the concept of layers that iskey to understand how neural networks are built. Furthermore, it can be easy to imaginea wide array of combinations of inputs and outputs, especially given the added dimensionof time that is often associated with RNNs. The most widely used architectures for RNNsare the long short-term memory [20] (LSTM) and gated recurrent units [10] (GRU). Both ofthese models have shown to allow for longer dependencies between time steps to emerge by13imiting the eﬀects of the exploding and vanishing gradients, which have been described tocause problems to the optimization procedure [6]. In this section, the most popular deep unsupervised learning models are detailed whichwere also used in the subsequent chapters of this document.1.4.6.1.

Variational Auto-Encoder

The Variational Auto-Encoder [25] (VAE) is a model of the encoder-decoder type thatallows to both encode an input into features, but also generate samples. An encoder networkmaps from the input space of x to z , the feature space, which can be of arbitrary size. Thetrue distribution of z given an input x can be denoted p ( z | x ) and the encoder’s model ofthat distribution will be deﬁned as q ( z | x ) . It turns out by its training objective, q ( z | x ) willbe trained to become closer to its true posterior distribution p ( z | x ) . This in turn makes itpossible to sample z ∼ q ( z | x ) . Another network, the decoder, maps the sampled z back intothe input space as a generator. The output distribution of the reconstructed inputs can bedenoted p ( x | z ) .The key and the beauty to making this whole model work is the training objective whichis called the variational lower bound or evidence lower bound (ELBO). This bound to thelog-likelihood of the underlying distribution of the data p ( x ) , can be maximized to createmeaningful features and a good sample generator. Using the above mentioned distributions,it can be deﬁned as, L ( q ) = E q ( z | x ) (cid:2) log p ( x, z ) (cid:3) + H (cid:0) q ( z | x ) (cid:1) = E q ( z | x ) (cid:2) log p ( x | z ) (cid:3) − D KL (cid:2) q ( z | x ) || p ( z ) (cid:3) ≤ log p ( x ) (1.4.5)Practically speaking, the left most objective, E q ( z | x ) (cid:2) log p ( x | z ) (cid:3) is attributed to the de-coder and is considered to be the reconstruction or prediction error. Indeed, it is trainedto increase its ability , given a sampled z ∼ q ( z | x ) , to predict x . The rightmost part D KL (cid:2) q ( z | x ) || p ( z ) (cid:3) , the Kullback-Leibler (KL) divergence, will make the features generated14y the encoder closer to the prior distribution p ( z ) which is deﬁned as a Gaussian distribu-tion. This approach has the advantage, depending on the task on hand, to encode an inputinto a distribution, rather than a simple deterministic mapping.1.4.6.2. Generative Adversarial Networks

GANs, short for generative adversarial networks [17] revolutionized the world of genera-tive models. Although numerous variants of the vanilla

GAN have been proposed, this briefsection details the original version.The main concept at the center of this type of model is the competition between twocomponents, the generator and the discriminator. Firstly, the generator with its parameters θ G , is a function of a noise vector z ∈ R m , and outputs directly fake samples, i.e. G θ G ( z ) = x (cid:48) .The discriminator, with its own set of parameters θ D takes as input either the original data x or the fake sample x (cid:48) , and gives a single scalar score between and using a sigmoidnon-linearity on the output layer. The discriminator network is noted as D θ D . Conceptually,this score represents the conﬁdence of the discriminator that the provided input (whetherreal or fake) is a real sample.The discriminator is trained in order to both increase the score of true data decreasing thescore of the fake data. In a way, it is learning to distinguish between the true distribution ofthe data p ( x ) , and the fake distribution of data q ( x | z ) . The adversarial aspect of the modelderives from the way the generator is trained. Indeed, its objective is to compete with thediscriminator and generate samples that would be considered as true. In other words, it istrained to fool the discriminator into thinking its samples are real.Given samples x ∼ p ( x ) and x (cid:48) ∼ q ( x | z ) with z ∼ q ( z ) , from the true data distributionand generated samples, respectively, the global objective can be written as, E p ( x ) (cid:2) log D θ D ( x ) (cid:3) + E q ( x | z ) (cid:2) log(1 − D θ D ( x (cid:48) )) (cid:3) (1.4.6)Relating back to the adversarial aspect, the discriminator is trained to maximize this fullobjective, while the generator will be trained to minimize it.15.4.6.3. Bidirectional Generative Adversarial Networks

One disadvantage of the GAN is that it doesn’t provide for a compressed representationof the data. Unlike the VAE, it only build a model that allows to generate some samples.Two similar models have been designed to leverage the adversarial aspect of GANs in order togenerate features out of input data. Although Adversarially Learning Inference [15] is similarto Bidirectional Generative Adversarial Networks [13] (BiGAN), the latter are slightly morestraightforward to explain and was used in some experiments detailed in this document.The key diﬀerence between GAN and BiGAN is the added encoder E θ E ( x ) that is usedto map true data samples to the feature space. Previously, the discriminator was only fedeither the true image, or the generated samples from the generator. In BiGAN, the noisevector and the encoded features are paired up along with their corresponding generated andtrue data samples, respectively. Much like the traditional GAN framework, the pairs arethen considered as the fake and true data and fed to the discriminator.Training of the generator and discriminator is the same as in the original GAN formu-lation, while the encoder is also trained to minimize the objective. The authors of BiGANfurther argue that by using the proposed objective, the encoder learns to invert the generator. The capacity of a model refers to its ability to represent a large space, or family, offunctions. Although abstract, it can conceptually be understood as a measure of how ﬂexible a model is. For example, a model such as a deep learning network with many parametershas high capacity and is therefore known to be able to represent highly complex functions,while a linear model has very low capacity since it can only represent linear relationships.Capacity can often be controlled by the choice of machine learning algorithms, but evenwithin the conﬁguration of a particular algorithm. It could seem intuitive to always aimfor the highest possible capacity when tackling a machine learning task, however there is animportant caveat and it relates to the objective being optimized. Given the training proce-dures minimizes the empirical risk rather than the true evaluation objective, and although16ncreased capacity should allow to reach the minimal training loss, it might not be ideal interms of generalization error.

Ensemble learning is a machine learning method that allows for combining a set of modelsinto a single one. A common approach, called model averaging, consists of fully trainingvariants of the same machine learning model on the same data and then combining themwhen making a prediction. This has the eﬀect of leveraging the diversity in the diﬀerentsolutions that are proposed by the set of machine learning models. An early version of thisapproach, called bootstrap aggregating (or bagging) [8], proposed to train models on diﬀerentsubsets of the training data.This is another type of algorithm that is considered to have a communication protocol.At the time of inference or deployment of the ensemble of models, a communication occurssince only a single prediction is made out of the ensemble. All the models which are part ofthe ensemble therefore, in some way, do communicate in order to jointly make one prediction.

Regularization consists of limiting the capacity of the model such that optimizing thetraining objective does not make performance on the evaluation objective worst.1.5.3.1.

Weight-decay

A widely used regularization approach is to impose some constraints on the diﬀerentparameter weights of neural networks. This is done by adding an additional objective tothe empirical risk J ( θ ) . The additional objective to minimize can be the square norm of theweight vectors. Since minimizing the norm of the vector goes against the main objective, itis considered to be restricting the model as it pushes all of the parameters values towards 0.Furthermore, the importance of this additional objective is controlled by a hyper-parameter λ , where a larger value indicates a greater importance in comparison to the original empiricalrisk.Indeed, by considering the L1 or the L2 norm of the weight vectors, the global objectivecan become any of the two objectives described below. Additionally, in some cases such as17lastic net regularization [38] both the L1 and L2 norms may be combined to ensure properregularization of the network’s weights. J ( θ ) = J ( θ ) + λ (cid:88) i | θ i | (1.5.1) J ( θ ) = J ( θ ) + λ (cid:88) i θ i (1.5.2)1.5.3.2. Dropout

The dropout [33] regularization approach is a very interesting and simple approach thatprovides great regularization power. It consists of simply injecting noise to a layer by sam-pling a binary mask on the input and hidden layers. For example, given x = { x , x , . . . , x d } ,a binary mask of size d would be sampled from a Bernouilli distribution with probability p .The amount of noise is controlled by the probability p , a hyper-parameter.During training, the mask is sampled for each x that goes through a given layer of anetwork and the corresponding dimensions of x that match the noisy mask are turned oﬀ.During evaluation, the probability is set to 0 and inference is made on the full observationsfor each layer. By sampling diﬀerent masks at each layer, it has the eﬀect of limiting thenumber of pathways in the neural network. Furthermore, this approach has been comparedto training an ensemble of models [2] at a much less expensive resources cost.1.5.3.3. Early stopping

A widely used method of monitoring the progress of the training procedure of a networkis to consider how it performs on the actual objective. For example, by monitoring thegeneralization error of a network used for classiﬁcation during training and stopping it whenthe model no longer improves is a way of regularizing it. Indeed, by limiting the amount ofgradient optimization steps it makes, the number of functions that can be represented bythe network is limited. 18 hapter 2

Exchanging Outputs Between Models

Most implementations of large scale neural networks are trained using either synchronous [11, 37] or asynchronous [5] gradient optimization procedures. Although both of these ap-proaches have some great beneﬁts and can achieve impressive results leveraging data par-allelism, they require a great deal of both computing and networking resources to actuallyachieve their theoretical optimal speedup. In this work, we propose to explore how removingone of the main characteristics of these optimization procedures may aﬀect performance.Mainly, this work studies how decentralizing the computation onto multiple compute nodesand allowing them to communicate between each other rather than to a central controller,may reach similar performance. Part of the study includes what information can be com-municated between the compute nodes, which includes low bandwidth outputs and hiddenlayers activation.

Previous work on allowing a model’s output to be used as training labels for anotherone, known as distillation [19], have paved the way to leveraging a network’s outputs toaccelerate training in a teacher/student setup. More recently, [1] showed distillation couldbe applied online during training on two large models in order to circumvent some of thescaling issues of distributed synchronized SGD. Building on this, this work explores how thisexpands to a full network of computation nodes that exchange outputs at diﬀerent depthsof their respective architectures.The current shift towards using synchronous and asynchronous gradient optimization asdeep learning models and datasets grow bigger entails high bandwidth requirements to tradeﬀ the large local computations done at each node. In some cases where highly optimizedinfrastructure and large computational power are available, distributed-SGD can drasticallyreduce the time needed to train large deep learning models [18]. There are therefore largegains of bandwidth to be made by decentralizing some of the computation.Reconsidering a decentralized version of the distributed-SGD further allows for a recon-sideration of future computational needs and possibilities. Using a decentralized computingnetwork could allow for increased stability and failure resistance as the whole progress oftraining would not be dependent on the success of all the compute nodes. Such an ap-proach could also lower the cost of training large deep learning models through the lack of arequirement of highly optimized and eﬃcient computing centers.In a way, all of these potential advantages explored here could be extended to forman internet of computing . One could indeed imagine a large network of low powered andlow bandwidth computing nodes exchanging information between them without the need tocentralize the communication, all working on the same objective.An inspiration for studying the eﬀects of decentralizing the communication between mod-els comes from society and the way humans go about communicating with each other. Thereis a parallel to be made with the way all humans exchange with each other directly when theintent is to learn something. In particular, when the wide range of knowledge/data availableto the society to be known is considered, an analogy can be drawn regarding specializationof the diﬀerent members of society for diﬀerent subsets. When called upon to solve a prob-lem, the society unites to propose a solution. This work touches on parts of this analogythroughout the way it trains, communicates and predicts.As part of the decentralized and communicating network of computing nodes, diﬀerentcomponents and parameters will be explored. In particular, which nodes can communicate,what they communicate and how often they communicate with each other will be part of thetested conﬁgurations of the computing group. Both supervised and unsupervised learningtasks are used in order to analyze the eﬀectiveness of the proposed approach. An overviewof the experimental setup is, however, required to show the diﬀerent components at play.20 .2. Method

In order to alleviate the engineering and computational needs often associated with usinga physical network of computing nodes, all experiments were done with models in a simulatednetwork of computing nodes on the same physical machine. What is meant by simulated isthat the computing nodes were not truly setup as a network of computing nodes but ratherall stored on the same physical machine. Using academic resources such as large clustersof GPUs for relatively high computational needs is possible, but to require the exclusivityand availability of more than 50-100 GPUs across multiple machines was simply not feasible.Furthermore, having all the simulated nodes eﬀectively on the same machine allowed tosimplify the implementation of the communication protocol between nodes.Ultimately, the main limitation to both training speed and memory usage of the imple-mentation was the number of nodes in the network. The approach chosen to be able to scaleup eﬃciently with the number of nodes was to have the training loops for the compute nodesto be sequentially evaluated. Some signiﬁcant operations such as the data management andthe communication protocol were implemented in parallel throughout all the compute nodesin order to be able to shave oﬀ the sequential overhead. Given the appropriate resources,all experiments could be extended relatively easily to a full network of parallel computingnodes.

Dataset

To test the eﬀect of decentralized communication and to accommodate the practicalityof experiments, the dataset used was the CIFAR-10 dataset. This dataset consists of 50,000colored images of size 32x32 used for training as well as a distinct set of 10,000 images ofthe same size used as the test set. Each of the images represents 1 of the 10 diﬀerent classesuniformly split through both the training and test sets.To reﬂect the decentralization of the data between computing nodes, the entire trainingset was randomly split among the diﬀerent computing nodes in the network. In other words,given a network consisting of 25 nodes, each of the node has their own distinct, and exclusive,215th of the training data. Furthermore, the distribution of data among the computing nodeswas done independently of the classes. The fractional training data that each computingnode has access to is considered, and referred to, as the local training data.In the actual implementation, given the simulated network of nodes, the dataset wascentralized to the physical machine used to hold all the models as it provided easier accessand simpliﬁed experimental design. In no way was any of the eﬀective nodes using thepartition of another node’s training data. Much like the extension to physically distinctcomputing nodes, the extension of the dataset setup could be done to be truly decentralized.2.2.2.2.

Model architecture

With the dataset this work focuses on, the logical choice of a model was to use Con-volutional Neural Networks (CNNs). These models are widely used in the deep learningcommunity when the input are images, given their optimized GPU implementation but inparticular their structural characteristics. Such characteristics include parameter sharingand their ability to be invariant to slight input transformations such as translations.The size of the model and the details of the architecture of the CNNs used throughoutthe experiences are not essential to the understanding of this work. For completeness, themodels were traditional CNNs with no pooling layers but rather strided convolutions. Forwork in the unsupervised learning setting, if a decoder was necessary, the same structure asthe encoder was used. In general, the structural recommendations from the DCGAN [30]architecture were followed.2.2.2.3.

Communication and network of nodes

In order to design the communication exchanges between compute nodes, one key aspectto be thought of is how each of these computing nodes will communicate throughout training.The ﬁrst thing that needs to be considered is which node can communicate with each other.There is, of course, a combinatorial way of designing sets of nodes that can communicatewith each other. To simplify this, let’s consider the analogy to human communication atboth extremes. On the one hand, there is how each of us communicates with a small setof relatives, but there is also the opposite where we attend classes or conferences where thesame information is distributed to a much wider array of individuals.22mplicit to these is the distinction between the broadcasting and the consumption ofinformation. One could therefore be broadcasting to a large number of nodes, e.g. givinga talk at a conference, or to a small number of nodes, e.g. speaking to close relatives.Regarding consumption, attending a conference would allow for consumption from a largearray of diﬀerent sources, while being exposed to close relatives would restrict that number.To address this in the implementation, the communication between nodes was designedsuch that a given node can broadcast to its p neighbours, p being controlled as a hyper-parameter. As for the consumption, it is controlled implicitly by the dynamics of the networkresulting from the hyper-parameter p . For example, in a 25-node network if p is set to 24,this means that a given node can broadcast to all other nodes, and every node can broadcastto all other nodes. The set of nodes to which a given node can broadcast is considered tobe its neighbours . In a network of nodes with a much more restricted communication, e.g.consider a 100-node network with p = 5 , the constraint imposed in the implementation isthat for a node, its 5 neighbours must be adjacent . An illustrative way of understandingthis is by considering all nodes in a circle, with the neighbours being selected as the closestnodes.For simplicity, the communication pattern between nodes is considered ﬁxed and p thesame for all nodes. A possible extension of this work consists of using more complex sets ofconnections, e.g. each node has a random set of neighbours, ﬁxed or changing.In order to further control the communication between nodes, frequencies of broadcastingand consumption were added as hyper-parameters. Simply put, for a given minibatch, thebroadcasting frequency can be seen as the likelihood to broadcast to it’s neighbours. As forthe consumption frequency, it can be seen as the likelihood of consuming data sent from theother nodes rather than from its local training data. In addition, if a node is training ondata sent from another node, it is also exposed to the possibility of being broadcasted itself,i.e. broadcasted data can be broadcasted to other nodes.To implement data consumption from other nodes, each has their local training data inaddition to a consumption queue where all the data sent from other nodes is added to. Thisconsumption queue is analogous to an email inbox that will be receiving and storing all datalocally where it will be read from. All the components impacting the communication of asingle computation node can be eﬀectively summarized in Algorithm 2.23 lgorithm 2 Consumption and Broadcasting training pseudo-code (for one node) Initialize q local , q external as local training data queue and empty external data queue Initialize p c , p b as consumption and broadcasting probabilities Initialize communication channel to q external of the neighbours while Training do consume ∼ Bernouilli ( p c ) broadcast ∼ Bernouilli ( p b ) if consume then data ← pop q external Do consumed data training objective step else data ← pop q local Do local training objective step end if if broadcast then for all neighbours do Put data in their q external end for end if end while Collective decision making

As communication during training was detailed in the previous section, another keyaspect is to consider how each of the computing nodes will communicate at test time. Inthis line of work, all the nodes have a randomly initialized model and each has their ownsubset of the data, but the focus remains on combining the knowledge from each of the nodesand consider their total knowledge as a group. In addition to being aligned with the humanculture analogy previously described, traditional and centralized approaches leveraging dataparallelism usually aim to train a single model to make a single prediction at test/inferencetime. 24raditional ensemble methods such as bagging [8] that make accessible the full datasetto each of the models combine predictions by averaging results or if applicable, by using avoting scheme. Under the studied framework, the preferred approach was to employ a formof weighted model averaging. The weighting is done at the level of the output probabilities.For example, in the supervised setting, all nodes are presented with the data to make aprediction on and they all provide their distribution over the possible answers, mainly, thediﬀerent classes along with their corresponding probabilities. The distributions are thengathered for all nodes, the entropy is then computed for each of the per-model distribution,and a single distribution is created by weighing each of them with their negative entropy.The entropy used is the Shannon entropy, leveraging its relationship to uncertainty as aconﬁdence level for each node. The collective decision making algorithm is described in thealgorithm 3.

Algorithm 3

Collective decision making pseudo-code Input x is received by every compute node Each of the nodes compute their probability distributions y i = [ y i , y i , . . . , y iC ] Collect all y i ’s and initialize sum of negative entropy s = 0 for All nodes and their corresponding y i do Compute entropy as h i = − (cid:80) k y ik log y ik Increment total sum of negative entropy s = s − h i end for for All nodes do Compute normalized weight based on negative entropy as w i = − h i /s end for Compute single weighted distribution as y = (cid:80) i w i × y i Make a single class prediction as argmax y The prediction is therefore made out of the entropy weighted average distribution betweennodes. This work makes an important assumption at test time; that all nodes are reachable in order to make a prediction. Although not explored here, future work on collective decisionmaking at a much larger scale should consider applying the same approach but from collectingpredictions from only a subset of the nodes rather than all the network nodes. If this25ramework were to be extended to a very large number of computing nodes such as the internet of computing , requesting an answer from all nodes would simply not be feasible.

Training objective and evaluation

Regarding a single node along with its local training data, the supervised learning proce-dure and objective are standard. The objective for each node is to minimize the cross-entropyloss over all the local training data considering the corresponding label of each training image.If there is a communication channel between two nodes and depending on what informationis communicated between the nodes, an additional training objective is considered; more onthis in section 2.2.3.2.As for evaluation, the network of nodes aims to have a low generalization error, much likeall other traditional supervised learning tasks. In practice, the accuracy on the data left outof the training procedure is used as a measure of generalization performance. Consideringthat there is a full group of models rather than a single one to measure accuracy and that allthe nodes need to be evaluated as a whole, the same data is used to evaluate all the nodes.The predictions on the data left out of the training procedure are made the same way asdescribed in section 2.2.2.4. A prediction is considered correct if the class associated withthe highest probability is the correct one.The measure of accuracy over training steps will be adjusted to reﬂect the accelerationpotential of the approach. In other words, given that our implementation simulates a parallelsystem, some operations can be assumed to be potentially executed in parallel. Given anappropriate computation network, all training steps could be executed at the same time forall compute nodes.2.2.3.2.

Information communicated between nodes

In order to avoid communicating directly the gradients between nodes or to a centralsystem as with the synchronous optimization algorithms, diﬀerent outputs of each of themodels in the network of compute nodes can be exchanged at diﬀerent depths of the archi-tecture. Given a classiﬁcation task on hand, neural networks compute label predictions foreach input as the highest level output. These label probabilities are a normalized version26f what is commonly called logits , or class label scores. The normalization of these logits isusually done with the softmax function.An intuitive thing to share between compute nodes would be the class label for a trainingsample. However, as previously described in distillation [19], additional information abouta model can be extracted in the logits and in turn, accelerate training of a secondary modelif used as training targets, in particular when the temperature of the logits is raised. Themodiﬁed logits can be normalized to create another predictive distribution of the labels andare further referred here as soft-labels . The operation of normalizing the received logits intosoft-labels is detailed in equation 2.2.1.Given the received logits [ v , v , . . . , v C ] from another model for each of the C classes, andwith temperature τ , the soft-labels [ y , y , . . . , y C ] to be used as targets can be computed as, y i = exp( v i /τ ) (cid:80) j exp( v j /τ ) (2.2.1)As the temperature approaches zero, the soft-labels hardens and becomes more like a one-hot vector of the predicted class label. In contrast, as temperature approaches inﬁnity,the soft-labels become uniform. See Figure 2.1 for eﬀect of varying the temperature andthe resulting class probabilities. Conceptually, exchanging the soft-labels can be seen asexchanging what a node thinks the answer is, as opposed to sending directly the answer, i.e.the true label .Whether the information communicated is the soft or true labels between nodes, bothof them can be used with the same additional training objective. Indeed, for the node on Figure 2.1.

Class probabilities resulting from normalizing the logits withdiﬀerent temperature. The logits or class scores used, i.e. 1.0, 2.0, 4.0, 8.0for the 4 classes, are the same across the diﬀerent temperatures above.27he receiving end of such information, they can be seen just like regular training data withtheir corresponding labels. If the true label is exchanged, using the same training objectiveas for the local training data is straightforward. As for the soft-labels, the slight diﬀerenceis simply to take into consideration the full probability distribution over the classes from thesender as being the target label, as shown in equation 2.2.2.Using y = [ y , y , . . . , y C ] as the soft-labels or the true labels and f θ ( x ) as the predictiveprobability distribution over the classes for a given input x , the cross-entropy loss originallydeﬁned in equation 1.2.1 can be extended to consider the full probability distribution as, L ( f θ ( x ) , y ) = − C (cid:88) j =1 y j log f θ ( x ) j (2.2.2)In addition to soft-labels and labels exchanged, exchanging high-level features betweennodes was tested. Conceptually, the highest hidden layer before the label prediction output isan abstract representation of the image fed as input. This in itself makes it a good candidatefor information to be shared between the models. Much like sending labels, whether softor not, to another node, sending the features can be passed in the same way through thecommunication channel.As for the training objective of another node’s features, and given the features are trainedunder no constraint apart from the main supervised training objective, an appropriate lossfunction to use is the mean-squared error. Given features of an input x from models A and B represented as h A = [ h A , h A , . . . , h Am ] and h B = [ h B , h B , . . . , h Bm ] , the MSE objective canbe deﬁned as, L ( h A , h B ) = m (cid:88) k =1

12 ( h Ak − h Bk ) (2.2.3)A receiving node is therefore trained to match its top-level representation to its neigh-bours’ by considering them as arrays of real values. From the group’s point of view, ex-changing top-level features consists of ensuring all the nodes of the network extract similarfeatures from the same images. A great way to test the eﬀectiveness of the communication channel is to test it under anunsupervised learning setting. Previously in the supervised setting, it was still possible for28 computing node to communicate its true label corresponding with the training data. Incontrast, with the unsupervised task, no information apart from the training data themselvesis assumed to be available. This then becomes a question of how a model can send goodfeatures to others in the computing network.2.2.4.1.

Training objective and evaluation

Much like in the supervised setting, all nodes train on their local data with their ownrespective objective. In this case, their objective is to extract meaningful features out ofthe images. Considering the intent is to communicate features, more on this in 2.2.4.2,two families of unsupervised models that allow for encoding of an input into features wereexplored:(1) Variational Auto-Encoder (VAE) [25](2) Bidirectional Generative Adversarial Networks (BiGAN) [14]Both of these allow for nodes to have their own local training objective and don’t inﬂuencethe general understanding of this work.The VAE is comprised of an encoder and a decoder and is trained on the local data toboth ensure the features extracted lead to a reconstruction of the input and also make thefeatures themselves be similar to a Gaussian distribution. One key feature of the VAE is thefact the learned features are distributional , as the encoder’s output corresponds to Gaussianparameters µ and σ .As for the BiGAN, it has three components. The generator is the same concept as in thetraditional GAN [17] framework, such that it generates fake samples out of sampled noise.The added component to BiGAN is an encoder that maps from the input space to a featurespace. In addition, these features extracted from the real input (not from the generator’soutput) are paired with the input before being fed to the discriminator. The pair of featuresand real input is considered as real, while the sampled noise along with the correspondinggenerated samples are considered as fake for the discriminator. Much like in GAN, thediscriminator is trained to distinguish between the real and fake, trying to create a biggerdistance between the distributions of real and fake. On the other end, both generator andencoder are trained to fool the discriminator by feeding them their outputs.29valuating unsupervised learning is in itself a ﬁeld of research, but for this line of work,the focus was on leveraging the same collective decision making for evaluation as in supervisedlearning. In order to do so, a linear classiﬁer was added on top of encoders of each of thecomputing node which was then trained on the full training data. Doing so allowed forevaluating purely how eﬀective was the exchange of communication between nodes regardingthe feature extraction process. Much like the supervised setting, the weighted predictivedistribution along with the accuracy on data left out of the training data was also utilized.Unlike in the supervised setting, here the focus is mostly on performance rather thanactual acceleration.2.2.4.2. Information communicated between nodes

In the unsupervised learning setting, there is no grounded information such as classlabels to be exchanged between nodes. Therefore, in this setting, the features are used asinformation to be communicated. For a compute node on the receiving end of the sender,this unsupervised task now becomes similar to a supervised task as it tries to reproduceanother model’s output. The hypothesis is that for a model, it is easier to learn through asupervised objective than an unsupervised one.There is however, more ﬂexibility as to what the training objective can be for the featuresexchanged. In particular when a VAE is used, the features are distributional and character-ized as a Gaussian distribution. As the goal is to have models with a similar representationfor the same input, an objective such as the Kullback-Leibler (KL) divergence between twoGaussian distributions can be leveraged. Indeed, minimizing the KL divergence, betweenthe sender’s and the receiver’s features can be seen as pulling the latter features distributiontowards the former’s. As for the experiments with the BiGAN, much like in the supervisedcase, the mean-squared error was used as the training objective was used for the featurematching.

To evaluate how successful this approach is at decentralizing the computation eﬃciently,diﬀerent conﬁgurations of the network of nodes were explored. Almost all of the conﬁgu-rations and parameters of the network of nodes tested were in the supervised setting. The30ationale behind this is that both unsupervised and supervised learning tasks in this settingshare very similar aspects, in particular when the compute nodes exchanging features underthe supervised task. It is expected that the most successful conﬁguration under the super-vised setting should transfer to the unsupervised task, especially given the dataset is thesame and the models are similar in size and structure.For this work, the most meaningful hyper-parameters of the network of nodes conﬁgu-ration are the content being communicated (logits, true labels, features), the temperature,if applicable, at which the sent logits are raised to, the frequency of broadcasting and con-sumption, the number of nodes as well as the number of neighbours each compute node has.Although not detailed here, other hyper-parameters were tuned outside of the communica-tion scope. In particular, the scaling factor for the additional loss, the model size and thelearning rate were explored, but varying these did not change how the other factors aﬀectedthe results.

Number of nodes and neighbours

The overall assessment is that the more the better. However there is an important caveatto this in the resulting size of the local training data partition that is associated with eachcomputing node. Collective accuracy levels did not increase past 20 nodes, to the point whereusing 30 and 50 nodes performed the same but jumping to 100 nodes caused the performanceto decrease. Given the experiments were performed on CIFAR-10, which is a relatively smalldataset, the observed lack of improvement can be explained with the resulting small amountof training data available to each node. Indeed, as the number of nodes reached 100, signsof over-ﬁtting were noticeable, e.g. training loss on the local training data rapidly collapsedto zero.Regarding the number of neighbours, at least in the supervised setting, increasing thenumber of connections in the graph always helped until a fully connected network was ob-tained and achieved the best performance. This meant that increasing the level of diversity incommunications for each of the compute nodes further increased generalization performance.31.3.1.2.

Frequency of broadcasting and consumption

Although multiple scenarios of consumption frequency were explored, the best performingand logical was to allow each node to consume one sample from its local training data queuefor each sample consumed from the rest of the network. Eﬀectively, in a 10-node network,each node is consuming on average 50% of its actual training data from its private share,while the other 50% is split among the other 9 nodes, therefore 5.6% of each other nodes’private share.The frequency of broadcasting was constructed in a way that allows for an equilibriumin the number of samples available for consumption. In other words, some scenarios couldcause individual nodes to receive too much data from the others to the point where extradata received would need to be dealt with, i.e. ﬂush older or ignore more recent. Instead,the focus was on ensuring the broadcasting frequency allows the consumption to be stable.On the other end, a too high consumption scenario could arise where each node consumesas much from its private share as from each other nodes, making it just like training on thefull dataset. When considering all samples consumed, this would make a single node consumemore from others by a factor equal to the number of neighbours. Conceptually, however,this doesn’t align well with the human analogy previously described. Therefore, consumptionprobabilities were set to . for all nodes.2.3.1.3. Content of communication

If the information exchanged between compute nodes are logits, better performance re-sults from lowering the temperature, i.e. hardened outputs. An intuitive extension to sharinglogits with low temperature is to share the actual ground truth labels. Although in slightcontradiction with the ﬁndings in [19], these results are conﬁrmed by the observation thatexchanging the true labels of the training data performs even better.The fact that exchanging the true labels performs better than close to zero temperaturecan be easily explained by the fact that some models make prediction errors on the trainingset. In other words, with a small temperature that reproduces hardened outputs, a nodestill has a chance to make an error on the prediction, while the true label is always correct.When making a prediction error, that node sends an incorrect target to another node, and it32ill negatively aﬀect performance compared to sending the actual true label. Sharing logitslater in training was brieﬂy experimented but no signiﬁcant diﬀerence was noticeable.Broadcasting the top-level features of each compute node performed very badly whencompared to the other information communicated.

The tables 2.1 and 2.2 show the performance of a single node as well as 10 nodes basedon changing the outputs communicated between the compute nodes (logits, true labels,features). Also included in the tables are results for the 20 nodes exchanging true labels as itwas the best-performing conﬁguration. All the models in the tables have the same size andare trained with Adam [24] with default learning rates. Both average accuracy of the wholegroup and collective decision accuracy are denoted as either

Avg acc. or Coll. decision in thetables, respectively. The results further reﬂect all the previously mentioned conﬁgurationand the best of each setting, in particular fully connected communications throughout thenetwork of nodes.The table 2.1 showcases the performance that conﬁgurations can achieve after 5,000training steps. These training steps can be considered as wall-time for a parallel implemen-tation. In other words, each of the nodes of the network has trained for 5,000 training steps.Both training on either local data or broadcasted data from other nodes were considered as atraining step. It can be noticed that sharing the logits (or soft-labels ) with collective decisionachieves similar performance to a single node operating on its own with all the training dataavailable, with both achieving 84.5% and 84.8%. In general, leveraging a uniﬁed predictionthrough collective decision making rather than measuring the average accuracy across allnodes allowed a gain of at least 5% in all conﬁgurations, but this increase in performanceis expected when using any ensemble method [8]. This isn’t true, however, for one of thebaselines, where all nodes are trained independently. Indeed, only a marginal increase canbe noticed using the collective decision. For 10 nodes, sharing the features performed poorlyas it only reached 80.1% accuracy with the collective decision, while sharing the true labelsreached 86.8%, the best of all 10 nodes conﬁgurations.The overall best performing approach was 20 nodes communicating their true labels fromthe training data with 87.7%. However, exchanging true labels is basically the same as a33 onﬁguration

Avg acc. Coll. decision1 node –

No sharing

Validation accuracy (%) after 5,000 training steps

Conﬁguration

Avg acc. Coll. decision1 node –

No sharing

No sharing >13,000 >13,00010 nodes – Sharing logits >13,000 5,30010 nodes – Sharing true labels 12,800 3,10010 nodes – Sharing features >13,000 >13,00020 nodes – Sharing true labels –

Training steps until reach 85% accuracytraditional ensemble method, only with a weighted dataset resulting from the communicationwith other nodes rather than having access to all the data. It can be further seen that anyof the approaches with a communication protocol outperforms greatly the isolated networkof 10 compute nodes, where after 5,000 steps, the accuracy reaches . and . foraverage accuracy and collective decision, respectively.As for table 2.2, the intent is to show the potential speedup of these approaches whenconsidering the parallelism in play. Making the 10 nodes share logits did not seem to providemuch of an acceleration in training when compared to a single node, with both reaching 85%accuracy at 5,300 and 5,900 steps, respectively. There was, however, a speedup of over215% (5,900 vs 2,700) for reaching 85% accuracy when considering the 20 nodes sharingtrue labels vs the single node. Sharing the features and the network of nodes without anycommunications both did not reach 85% accuracy and therefore were not shown in the table.34 .3.3. Unsupervised learning Unfortunately, the low performance of features broadcasting in the supervised setting wastranslated into this new setting, even when considering both the BiGAN and VAE models.It turns out for the unsupervised learning task, having communications between the nodesimpacted negatively performance while non-communicating nodes performed slightly betteron a downstream classiﬁcation task, in particular when using the collective decision approach.Table 2.3 details the accuracy of the linear classiﬁer trained on top of the extracted features.The linear classiﬁers were fully trained using early stopping on a validation set at diﬀerentstages of the unsupervised learning. Ultimately, the best performing linear classiﬁer on thevalidation set throughout training was selected and the test set performance is reported.

Conﬁguration

Avg acc. Coll. decision

VAE

No sharing

10 nodes – Full communication 41.5 42.7

BiGAN

No sharing

10 nodes – Full communication 44.3 46.2

Table 2.3.

Test set accuracy (%) of linear classiﬁer over learned features.The results showed for the fully communicated network show worst performance than thenetwork of nodes without any communications. It is to be noted, however, that the collectivedecision approach performed greatly, even for the non-communicating nodes, which is in linewith the observation in supervised learning. All approaches could not beat the single nodewith the average accuracy of the classiﬁers, but when combined to make a single prediction,it performed better. In particular, for both VAE and BiGAN, the collective decision in theno-communication setting allowed for at least 3% increase when compared to the single node.Again, this kind of jump in performance is to be expected when using ensemble methods.35sing the KL-Divergence as the feature training objective for the communicated infor-mation seemed to help slightly performance when compared to the single node. Indeed, thecollective decision with the communication was able to achieve 42.7%, which is 0.9% overthe single node, while in the BiGAN setting with mean-squared error, it was 0.7% underthat same baseline.

The main observation in the supervised setting to be made is that exchanging the truelabels performs better than letting the nodes communicate outputs such as the logits ortop-level features by a signiﬁcant margin. As for the unsupervised setting, it was clear thatsharing the features seemed to only impact negatively performance. There are, however,still potential uses for the proposed approach to be made in some speciﬁc circumstances orsetups.In a scenario where the true label would not be available to be exchanged, e.g. either semi-supervised setting or lost partial data, it was shown that having the compute nodes broadcasttheir prediction on their data could make the whole group achieve similar performance towhat a single model could achieve. This is especially true in a situation where the predictionsare known to be good. These results could be interpolated to a scenario where having a singlemodel is simply not feasible and instead of leveraging a distributed SGD implementation thatrequires the models to exchange the large number of parameter gradients, it could simplyexchange the logits between them.Although exchanging true labels did not provide a linear acceleration with the numberof nodes, in very large networks of compute nodes of over 256, it was shown in [1] thatdistributed SGD implementations did not scale up well. Extending the proposed approachto sharing true labels between a much larger network of nodes could still show some speedup.To show the gains of such very large scale network of communicating nodes is left for furtherwork. By further considering why the true labels perform better than logits, this approachcould potentially be better if the predictive accuracy of the broadcasting node was betterthan the consuming node. The better model or node could be replacing the true labelsdirectly and therefore removing the need to broadcast them. An extension of this setting isfurther described in Chapter 3. 36 hapter 3

Increased Utility Through Selection Of Training Data

Training deep learning models with stochastic gradient descent requires randomly selectingsamples from the training data. During training of a neural network, it can be anticipatedthat some samples will be more or less eﬀective in the training of the model and those canbe seen as harder or easier , respectively. In this work, it is proposed to allow a studentnetwork, randomly initialized, to communicate with a fully trained network, the teacher, totry and leverage the latter’s expertise by instructing the student about which of the trainingdata are harder . To identify the diﬃcult examples, rather than simply sending away modeloutputs or labels as in Chapter 2, the teacher considers the predictions from the studentto evaluate which of the training samples are good training candidates. It is demonstratedthat by measuring the distance between the predictions of the teacher and the student, itcan be used as a proxy of diﬃculty to select samples for the student and therefore acceleratetraining when compared to randomly sampling training data. This is done by leveragingprevious results from Chapter 2 in addition to [19]’s work on distillation. Furthermore,it will be shown that using the teacher predictions as training targets for the student canfurther increase convergence speed. Graduate students often seek shortcuts when studying for a ﬁnal exam. Instead of goingthrough all of the content, they wish to optimize their grade while not having to go over allthe course content. They sometimes do so by their high level of laziness, but more often theseshortcuts are taken because the student already understands well a given section of the coursematerial. The student can therefore aﬀord to skip some exercises listed for a chapter, as herusts his understanding developed through previous experience. Another way a graduatestudent accelerates his training is by leveraging a professor’s role and accessibility. He canreach out to a professor who uses his experience to recommend either exercises or additionalreadings in a way that is beneﬁcial for the student’s learning.By maintaining this analogy of a graduate student, a randomly initialized neural networkand its traditional stochastic gradient descent training can be seen as a highly ineﬃcienttraining procedure. For that randomly initialized neural network, the student , going throughall training samples certainly has some ineﬃciencies since it might already have masteredthe content of the dataset associated with that sample.In this work, it is proposed to address these ineﬃciencies by maintaining a communica-tion channel between the student network and a previously trained teacher network. Thiscommunication channel will not only be used to transmit information but also instructionsregarding which training samples can accelerate performance. One of the ﬁndings leveragedfrom previous work in Chapter 2 was that sharing the true labels resulted in better per-formance than sharing the hardened soft-labels as supervised learning targets. A possibleexplanation to this observation was that the model sending the soft-labels was simply notaccurate enough to use its predictions as the ground truth labels. This is where leveraginga teacher and student analogy might prove itself useful.The teacher will identify which examples can accelerate performance by using the predic-tions of the student with diﬀerent measures to quantify the diﬀerence to its own predictions.The eﬀect of varying the size of the database from which the teacher selects which samplesto train the student on will also be detailed. In this setup, the teacher has the possibilityof either sending the normal training data and the ground truth labels or it can send itsown set of predictions as the labels. The communication channel therefore includes boththe student sending its prediction to a teacher and receiving the samples along with theappropriate target to use in its training procedure.A successful acceleration in this simpliﬁed setup could be useful in a setup where trainingthe student on the original data is no longer feasible. In a sense, this could be used in semi-supervised learning tasks on labels where the ground truth label is not available for part ofthe training data. 38 .2. Method

There are two major components of the analogy teacher and student to consider. Theﬁrst is to consider that a teacher already knows the content of a course, or in the supervisedsetting, it has a low generalization error. The second aspect to consider is much like theteacher and student relationship in an academic setting, the teacher has the ability to evaluatethe student’s weaknesses and customize its training . More commonly, with its evaluation ofthe student’s skills, he is able to identify which chapters or exercises are the most beneﬁcialfor the student to learn the course content.To incorporate these two components in a supervised learning experimental framework,a model referred to as the teacher was trained until convergence with early stopping, bymonitoring its prediction accuracy on a validation set. A second model, the student, isthen randomly initialized and its training procedure begins. Throughout training of thestudent, it will have access to the same data as the teacher, but in addition, it will be able toleverage predictions of the teacher as targets. More details regarding the student’s trainingare detailed in section 3.2.3.

Dataset

A widely used dataset throughout the machine learning community that was used in thiswork is the MNIST dataset [26]. It consists of 60,000 greyscale images of digits between0 to 9 as well as an additional 10,000 images as the test set. Each of the images is 28x28pixels which will be used as a single row of 784 pixels. This dataset allows for a simpliﬁedexperimental setup and does not require a speciﬁc model size or architecture in order toshow signiﬁcant performance. Given the objective of this work is to show the accelerationof training, an interesting aspect of using this dataset was the eﬃciency of training with theMNIST dataset.From the 60,000 training data samples, 10,000 were set aside to be used as validationset in order to monitor the performance of the teacher for early stopping, as well as trainingspeed for the student. 39.2.2.2.

Model architecture

Both the teacher model and the student were constructed as networks with 2 fully con-nected hidden layers with ReLU [29] in addition to a softmax output layer for each of the10 classes in the MNIST dataset. The teacher was purposefully set up to have many morehidden units at each layer than the student’s network, with 1,200 and 32 respectively. In ad-dition, each hidden layer of the teacher was regularized using dropout [33], while the studentwas not. The structure of both of these networks was intended to allow the teacher networkto acquire more knowledge through its greater capacity than the student model.For both models, Adam [24] optimization with the default hyper-parameters was used andboth the student and teacher architecture were kept the same throughout the experiments.Also, the parameters of the teacher network were trained and kept the same across all thesets of the experiments in order to compare the diﬀerent eﬀect of student training.

The interactions that are further described between the teacher and student models inthe proposed framework can be seen analogous as a student answering quizzes and sendingthem to the teacher. In addition, rather than having the teacher provide feedback on all thesubmitted quizzes, he only provides the feedback on some of them. This section covers howthe teacher selects which samples will be used to train the student, in addition to how theywill be used.3.2.3.1.

Evaluating the diﬃculty

Let’s assume the teacher has a set of training examples from which it needs to selectwhich one will be the most beneﬁcial for the student and let’s further assume the groundtruth labels are not available. What is proposed here is to leverage the predictions that aremade from the student in comparison to the ones made by the teacher. In a way, consideringthe teacher’s prediction as a proxy of the real labels, it can be identiﬁed quite easily if thestudent is wrong. The goal of using them as a proxy is to be able to identify which trainingsamples are more diﬃcult for the student, without requiring the ground truth labels.40owever, the intent is to leverage the totality of the probabilities associated with each ofthe class predictions from the student. There is much more information that can be gatheredby considering the full predictive distribution than just its most likely outcome.It is therefore proposed to measure the diﬀerence between both the teacher’s and student’spredictive distributions as a proxy for diﬃculty. For a random variable Y , the class label, andgiven probability distributions P Y and Q Y , the teacher and student’s predictive distributionfor a given sample, respectively, the diﬀerent metrics explored are the following. Cross-entropy . The cross-entropy serves as a natural metric to measure the diﬃcultyof the training samples since it is the actual objective of supervised learning when P Y is theground truth label. It measures how much the probability distribution Q Y diﬀers from P Y ,where, for example, P Y is the target. It is deﬁned as, H ( P Y , Q Y ) = − (cid:88) y P Y ( y ) log Q Y ( y ) (3.2.1) Euclidean distance . Sometimes called the pairwise distance, it can be used to de-termine the distance between the two distributions by leveraging the vector form of bothdistributions. Unlike the cross-entropy, using the Euclidean distance will put an equal weightto each of the class labels. Using the previously deﬁned P Y and Q Y , it can be deﬁned as, D E ( P Y , Q Y ) = (cid:104) (cid:88) y ( P Y ( y ) − Q Y ( y )) (cid:105) / (3.2.2)Out of the above-mentioned error measures, only the Euclidean distance can beconsidered as a true distance since the cross-entropy is not symmetric. However, the term distance will be used more loosely throughout this chapter to reﬂect any of these errormeasures.For a given set of training samples, such as a minibatch, once the teacher has both itsown set of predictions along with the student’s, it can compute any of the metrics for eachsample. A sample with a higher error will indicate current lower performance from thestudent and therefore be considered as a harder example. Now equipped with the error ofeach of the training samples, the teacher is able to select the most diﬃcult ones and sendthem back to the student. 41n order to fairly compare performance between a student exchanging with a teacher andone without such communication, there must be some consideration of the computationalcost associated with this approach. In particular, the student still needs to compute thepredictions on the training samples to be able to communicate them with the teacher. Inaddition, the cost of computing the distances is left to the teacher, but it could be assumedthat both of these models operate in parallel and in addition, calculating this is much cheaperthan doing a backward pass from the student.3.2.3.2. Using soft-labels as student targets

In addition to selecting which samples will be beneﬁcial to the student, it was importantto explore what target to provide to the student once these have been selected. This decisioncomes down to selecting what information or outputs available from the teacher should beused. Following work in Chapter 2, an intuitive thing to use are the ground truth labels sinceit proved to perform better. However, it was proposed this observation occurred because themodels exchanging soft-labels did not have good generalization performance. Therefore,using the predictions of the teacher as soft-labels will be considered as a possible target forthe student. With any of these approaches, no additional training objective is necessary.They can both leverage the same cross-entropy objective, simply replacing the labels withthe soft-labels as detailed in equation 2.2.2.3.2.3.3.

Implementation details

Similarly to the analogy of quizzes, the teacher network manages the set of training sam-ples by stacking the multiple minibatches sent by the student, along with their correspondingpredictions. The size of the stack was further important as it allows the teacher to selectfrom a larger set of training samples which ones are the most diﬃcult.For the implementation, a stack referred to as the teacher stack was created with a sizecontrolled by the hyper-parameter n , where n represents the number of minibatches it canhold. Before pushing the student’s data to the stack, the teacher makes its predictions to belater used to compute the distances.Once the stack is full, the m highest scoring out of the n ∗ m samples are selectedby the teacher as the training data to send back to the student, where m is the batch size.Conceptually, the teacher therefore compresses n minibatches into a single one for the student42o train on. Although the student does a forward pass on the n ∗ m samples, there is a gainof ( n − ∗ m backward passes. As the size of the stack increases, it is therefore expectedto show a greater acceleration, in terms of accuracy per backward passes. The number offorward passes still needs to be considered, so the mentioned acceleration should take it intoconsideration. In order to show any potential advantages with the proposed approach, it is necessaryto compare it with the most basic baseline, a student network training by itself without anycommunication with the teacher and by using the ground truth labels. This baseline is inessence, the same training procedure as the teacher network, or any randomly initializedneural network, the only diﬀerence being their model size and regularization, as previouslydescribed.An additional baseline considered is ﬁlling up the teacher stack much like other ap-proaches, but rather than making the teacher select any of the training samples with ametric, simply select which ones to send randomly and send its prediction as soft-labels.This allows to test the impact of actively selecting the samples. Throughout the diﬀerentconﬁgurations tested, all student models were trained by monitoring the validation accuracyand were stopped after 20 epochs of no improvement. Batch size was 32, making the stackof the teacher to choose the hardest examples from multiples of 32.

Sample selection and deﬁning student targets

The ﬁgure 3.1 shows the validation accuracy during training, and to show the potentialacceleration of this approach, by the number of parameter updates, or backward passes,made by the student (in minibatches).It can be seen in ﬁgure 3.1 on the left, that all three scenarios using a distance and ateacher stack size of 2 minibatches reach about twice as fast the 96%-97% mark than thebaseline (in blue). Using the Euclidean distance (in red) performs slightly less well than thetwo other conﬁgurations using the cross-entropy between teacher and student predictions asa proxy of diﬃculty. Considering this, the cross-entropy was selected as the best distance

43o distinguish which examples to send to the student. The baseline of randomly selectingsoft-labels to send to the student (in yellow) performs worse than actively selecting basedon any of the above mentioned metrics. This can be interpreted as the increased complexityof the communication channel was beneﬁcial for the student’s training. Furthermore, usingthe cross-entropy as a distance measure aligns perfectly with the supervised objective. If theteacher’s predictions will be used as targets, measuring the diﬃculty of a sample can now becomputed directly with the loss.

Figure 3.1.

Validation accuracy (%) per number of parameter updatesof the student network during training. Conﬁgurations of the studentsinclude a baseline with no communication, (Left) stack size of two mini-batches, with both cross-entropy and Euclidean distance as diﬃculty mea-sure, in addition to comparing soft-labels and true-labels and randomly se-lection. (Right) The best conﬁguration from stack size two (cross-entropysoft-labels), compared with teacher using a stack size of 5 minibatcheswith soft-labels and true labels.In the same ﬁgure on the left, it is shown when using a stack size of 2 minibatches, usingeither the ground truth labels or the teacher’s soft-labels did not seem to aﬀect the generalperformance (in green and brown).3.3.1.2.

Teacher stack size

On the right side of ﬁgure 3.1, it can be seen that by increasing the size of the teacherstack to 5 minibatches, to make the pool from which the teacher can select the samplesbigger, it helps performance. Either with soft or true labels (orange and purple), it greatly44educes the number of backward passes required when compared to both the baseline andusing a stack size of 2 minibatches with soft-labels and cross-entropy (in green). Furthermore,using the teacher’s prediction as soft-labels for the student’s target (in orange) combine witha stack size of 5 minibatches outperforms signiﬁcantly the student training with directlythe true labels. In this work, when the true labels are communicated back to the studentas targets, it is important to consider that both student’s and teacher’s prediction are stillused for determining which examples are harder . Also, other sizes of teacher stack wereexperimented ranging up to 100 minibatches. However, it did not provide any beneﬁts fromscaling and an explanation put forward is the small dataset size. We leave to verify thisassumption for future work on much larger networks and datasets.

Although the ﬁgure 3.1 shows convergence speed of the model, it must be consideredthat peak performance on the validation set cannot be used to compare the generalizationperformance of diﬀerent conﬁgurations without being biased. To alleviate this, the table3.1 shows the test set accuracy of the same conﬁgurations, using their best model based on

Conﬁguration

Accuracy Backward passesBaseline 96.4 39,000

Teacher stack 2 minibatches

Cross-entropy – Soft-labels

Teacher stack 5 minibatches

Cross-entropy – Soft-labels 96.8 13,700Cross-entropy – True labels 96.8

Test set accuracy (%) and number of backward passes forthat performance (or parameters updates) of various student conﬁgurationbased on validation set best scoring parameters. All of the optimizationhyper-parameters are the same for the diﬀerent student networks.45he peak validation accuracy parameter’s value. These results show that ultimately, all ofthese approaches have very similar generalization performance. Indeed, all of the approachesare within 0.5% of the baseline. There is, however, a considerable speedup to get to thatperformance, in particular when using a bigger teacher stack size. The baseline reached thatperformance in 39,000 updates/backward passes, while both approaches using a larger teacherstack reached it in only 9,100 and 13,700 steps, for true labels and soft-labels, respectively.With this approach applied to a larger dataset, it is anticipated the scaling would be moreevident.

Throughout this work, it was shown that by designing a communication channel used fortraining instructions between a teacher and a student network, it could allow for acceleratedtraining to reach the same level of generalization performance. Indeed, by allowing theteacher network to compute a distance between its own predictions about a given sampleand the student’s, it can be used as a proxy of sample diﬃculty and help it train. Creatinga minibatch out of harder examples was shown to accelerate signiﬁcantly convergence speed.Through that process, the number of backward passes (or parameter updates) can beeﬃciently reduced by the teacher instructing the student which data to train on. Therewere further signs that increasing the pool of data from which the teacher can select diﬃcultsamples from further increased the acceleration.The approach presented implies the student has to make a prediction on all datapointsthe teacher wishes to consider. The focus was mostly on the fact that even by doing so,there is a speedup because of the gains in backward passes. Other approaches where studentpredictive ability could be predicted by the teacher would allow to relax that assumption.Such an approach could provide itself useful in other settings such as where new data isunlabelled but a trained model is available. It could therefore be used to train an additionalsmaller model without the cost of acquiring labels for those new training samples. Anotherpossible extension of this work would be to consider this approach with a bigger networkof nodes. Much like work detailed in Chapter 2, using predictions from diﬀerent computenodes that are experts on their own sets of data may show itself simpler and proﬁtable totransfer knowledge of their data partition. 46 hapter 4

Sharing Internal Representation Through Language

Language is the key to humans exchanging communications both diversely and imperfectlybetween each other. This corresponds to the opposite of the communication protocol of train-ing algorithms used in the machine learning community such as synchronous SGD, where it ismandatory that the information communicated is precise. In the latter, the communicationrequires high bandwidth by the high number of values and their corresponding high levelof precision. However, for some reason, humans can communicate how they perceive theirhighly complex surroundings with a discrete language through a low bandwidth channel.The language we use is specially crafted in order to allow us to communicate and exchangewith our peers. It is therefore of interest to study how a similar language could be useful fordeep learning models.Contrary to work previously described in Chapters 2 and 3, rather than directly exchang-ing model outputs or training samples, a language is purposefully created between modelswhere it will serve as a way to communicate internal representations. To study how eﬀec-tive such a language is, two models are set up and shown variants of the same input andtry to better understand the underlying original input. These two models will be able tocommunicate, with diﬀerent levels of complexity with a language. The language created islow bandwidth, discrete and trained to have high mutual information with regards to thepartial observation of the broadcasting model. It will be demonstrated that allowing agentsto exchange information makes it easier for them to perform a classiﬁcation task based ontheir own internal representation. .1. Introduction

Training algorithms used for training large neural networks on massive datasets/tasks,such as distributed synchronized gradient descent (SGD), do employ some language to com-municate. Indeed, such an approach requires ﬁrst that all models directly communicatetheir noisy approximation of the gradients to a central system. Following that, they receiveanother message containing the actual gradients to update their local parameters.Such a language that shares gradients, although containing very useful information, hasthe disadvantage of requiring lots of bandwidth simply given the sheer size of the models.Unlike the gradients that are represented continuously by decimals (up to some precision),the language we use daily as humans consists of selecting discrete words. Furthermore, thisdiscrete language that we use both in writing and speech has some very small bandwidthrequirements when compared to parameter gradient tensors. Given we can successfullyexchange excessively rich and complex concepts through this discrete language with ourpeers, developing such a language between machine learning models is of interest.Although the use of a continuous language with less bandwidth than gradients for dialoguecould very well convey more information than a discrete one, focusing on a discrete languageallows for a more interesting analysis, in particular with the analogy to human language.Moving forward, not without the implementation diﬃculties that arise from working withdiscrete sequences, the focus is on the use and development of a discrete language.To evaluate how successful a model helps another by sending a message, this workuses partial observations derived from a shared input. The objective is to help the othermodel/agent understand better the underlying full observation behind its own partial obser-vation. Diﬀerent levels of communications are explored throughout the diﬀerent experiments,in particular one-way broadcasting, two-way broadcasting, in addition to allowing for somefeedback from the receiver of the message. It will be illustrated how the latter level ofcomplexity can be rewritten as a diﬀerent objective tying both models’ objective together.To discretize the outputs, the Gumbel-Softmax distribution is used to sample the messageswhere over time, the temperature used is annealed to ensure the samples become discrete.The message generation is trained through unsupervised learning and it is shown that hav-ing communication does improve performance on a classiﬁcation task using the previouslylearned features. 48 .2. Method

Partial observation setting

Let us assume a world with current state X and two agents A and B . One importantcharacteristic of the setup is that X is never fully observable by A nor B , but can ratherbe understood as if an oracle could observe the current state of the world. In addition,both A and B witness X at the same time, but from diﬀerent angles , which makes themsee it diﬀerently. The partial observations of the world by agent A and B are thereforedenoted X A and X B , respectively. An analogy can be seen with how diﬀerent individualshave diﬀerent perspectives on the shared world that they live in. The partial observationsetup used throughout this work was previously described in [4].The goal of this setting is to consider how as humans we can quite easily communicateabout our environment, even though we don’t necessarily see it the same way as others thatwe communicate with. Even though we don’t see things exactly the same way, we are ableto help each other better understand the underlying world we live in.4.2.1.2. Dataset

To translate the partial observation setting into a machine learning task, a noisy mask canbe applied to training data for each agent and therefore generate two diﬀerent observationsof the same original sample. When created, this mask is random, and when applied to animage, it has the eﬀect of inverting pixels selected by that mask. Doing so ensure that allagents have diﬀerent partial observations of the same original state of the world, or at least,the probability of generating two identical masks given a considerable input size is extremelyunlikely. Throughout the experiments, the noise level was kept at and to create themask, a Bernouilli distribution was sampled for each pixel with probability matching thenoise level. The mask was sampled before training for each agent and kept ﬁxed, i.e. eachagent applies their mask on all the inputs, it doesn’t change during training.The dataset used in this set of experiments was the MNIST dataset [26]. It consistsof 60,000 images, from which 50,000 are used for training and 10,000 as a validation set,and another 10,000 are given as a test set. Handwritten digits from 0 to 9 represent the49 igure 4.1. (Left) Original MNIST samples, (Middle & Right) show thesame samples modiﬁed by the noisy mask with noise. Each pixel hasa probability of to be inverted.10 possible classes of the dataset. The images are of size 28x28, but are used as a singlerow of 784 pixels due to model architecture. Some samples along with the resulting partialobservations can be seen in Figure 4.1.

Discretization

One key component of the human communication is the discretization of internal rep-resentations into the discrete language that is used by so many of us. It was thereforeimportant to consider this and build a discrete language.The language that is explored is to allow models to communicate bits, mainly ’s and ’s, or rather, a sequence of them. Using bits can be understood as using a language withonly a very limited vocabulary size and therefore restricting the number of possible mes-sages. However, by allowing these sequences to be of considerable length, it allows for largercapacity, e.g. for a sequence of only 16 bits, the number of possible messages is whichequates to 65,536 possibilities.Generating discrete sequences is known to carry its own set of problems when mixed inthe deep learning training procedures because of the backpropagation algorithm. For thegradient descent procedure to work and in particular the chain rule to allow gradients to ﬂowdown to the model parameters, all functions must be continuous. However, when makingdiscrete decisions, such as sampling a softmax distribution or taking the argmax of that same50istribution, it stops any gradient from ﬂowing back through that operation. In order toleverage gradient optimization, it was necessary to explore alternatives.To allow the training of the message generation from end-to-end with the main objective,the Gumbel-Softmax [22] layer was selected. Employing the Gumbel-Softmax distributionto sample the messages allows for the backpropagation to ﬂow through the sampling process,while ensuring, in the limit, that the samples generated are discrete. Any regular trainingobjective usually applied to continuous functions could therefore be used to optimize themessages.4.2.2.2. Vocabulary and message generation

In order to have the message as an output, diﬀerent approaches could have been used.At the time of this writing, the current approach used focuses mainly on generating themessage all at once rather than character by character. Furthermore, given the decision toconsider bits as the vocabulary, generating the full message at once is manageable. Indeed,the highest layer of the message generator can be seen as having n heads, where n is thelength of the message, and each head has two output units, one for the and the otherfor . Each of the output units give a score for that token of the vocabulary and then theGumbel-Softmax function is applied to these scores to sample a message. In this simpliﬁedsetup, since only ’s and ’s can be used as characters, it keeps the size, in number of units,of the output layer pretty reasonable. As for the rest of the message generator model, fullyconnected hidden layers with rectiﬁed linear units were used.The other possible approach is to use Recurrent Neural Network to replace this model.With the current state of this work, some successful initial testing was done to ensure thefeasibility of the approach with an RNN in combination with the Gumbel-Softmax distri-bution. However, in particular due to the training time and given the small size of thevocabulary, there is not much gain to be made from moving to this type of model. Howeverfor future work with larger vocabulary size, it will be necessary to employ some recurrentconnections in the message generator network. For future work, using a RNN could alsoprovide itself useful to allow messages of diﬀerent lengths. Indeed, it could be used as a wayfor the receiving model to handle diﬀerent length messages from other models, or even togenerate diﬀerent length messages to an array of models.51  .

54 0 .

47 0 .

52 0 .

57 0 . .

46 0 .

53 0 .

48 0 .

43 0 .   .

94 0 .

12 0 .

80 0 .

99 0 . .

06 0 .

88 0 .

20 0 .

01 1 .  Figure 4.2.

Vector representation of a message generated of length with vocabulary size of , and temperature of Gumbel-Softmax set at (left) and . (right). For illustrative purposes, in this ﬁgure, the logitsand the underlying sample from the Gumbel distribution are the same forthe two temperatures.Given the use of the Gumbel-Softmax layer as the ﬁnal output layer, the message gen-erated depends on the temperature used to smooth or harden the samples. As an exampleof a possible message sent from one model to another, see Figure 4.2 for the vector repre-sentation. The left corresponding to early in training with a high temperature which resultsin smoother samples, and if applicable, allowing for a greater gradient to ﬂow through. Onthe right, the same input but as the temperature is annealed throughout training, the samelogits generate a more discrete message. The vector representation of the message can beseen as a softened one-hot representation. If applicable, training would be done with thesoftened version of the message, while testing would be done with the one-hot version of themessage. In particular, throughout training, the reconstruction of the images by using thehard version of the message was successfully used as a way of ensuring appropriate training.The algorithm 4 details the steps to generate both the partial observations but also themessages in the previously described setup with two agents, A and B .4.2.2.3. Training objective

One question that arises is how to train or optimize this language. An intuitive answerproposed is that the messages sent should convey as much information possible about thepartial observations of the world each agent experiences. Thankfully, these concepts canbe tied to probability and information theory quite nicely by considering X A as the partialobservation of agent A and S A the message it sends to agent B as two random variables.By maximizing the mutual information between these two random variables I ( X A ; S A ) , itwill ensure the message generated is a good replacement of the partial observation. If well52 lgorithm 4 Partial observation and message generation pseudo-code Generate masks m A and m B based on the noise hyper-parameter Initialize message generator networks f θ A and f θ B while Training do Sample minibatch of data from D train X A ← Invert pixels in minibatch based on m A X B ← Invert pixels in minibatch based on m B s A ← f θ A ( X A ) (cid:46) generate message from agent A s B ← f θ B ( X B ) (cid:46) generate message from agent B if Communication then A sends s A to B B sends s B to A end if end while trained, the message will be a low bandwidth and informative representation of an agent’spartial observation. This objective will serve as the base objective of developing a languagewithout any other task.Considering both random variables X A and S A and their corresponding marginal distri-butions p X A ( x A ) and p S A ( s A ) as well as the joint distribution p X A ,S A ( x A ,s A ) . The mutualinformation between the partial observation X A and the message S A is deﬁned as, I ( X A ; S A ) = E X A ,S A log p ( x A , s A ) p ( x A ) p ( s A ) (4.2.1)To maximize the quantity detailed above, we therefore need to make the joint distribution p ( x A , s A ) diﬀer greatly from the combination of the two marginals p ( x A ) and p ( s A ) . In otherwords, for samples x A and s A , the probability associated with that joint observation needsto be high, while the product of both probabilities from their corresponding marginal mustbe small. This relates to ensuring X A and S A are not independent.To do so, although approaches such as MINE [3] could be useful, the traditional GANobjective can be leveraged quite elegantly. In particular, the role of the discriminator in aGAN framework can be viewed as distinguishing between two distributions, i.e. making thetrue sample and the generated ones far apart. Under that framework, the generator is used53o do the opposite and make these two distributions closer, mainly controlling the generatedsamples’ distribution. In our case, the two distributions we wish to make distinguishable isthe joint p ( x A , s A ) and the combination of the two marginals, p ( x A ) and p ( s A ) . By makingthem far apart , it increases the mutual information. Similarly to what was done in [7], it isproposed to use the discriminator’s objective of a vanilla-GAN to achieve this. Additionally,the generator is trained to further separate the two distributions rather than closer. Doingso, the mutual information between X A and S A can be maximized, where [7] minimized it,because p ( s A ) and p ( s A | x A ) are controlled by a generator model.In addition, contrarily to the GAN framework where both the generator and the discrim-inator are competing , our objectives are to generate S A and make it incorporate informationfrom X A . This therefore includes the generator in the optimization and makes it a max-max problem as opposed to a min-max problem. The former, based on previous experience, ismuch easier and stable to train.Implementation-wise, similarly to the procedure proposed in the MINE framework [3],a pair of x A and its corresponding (from the joint distribution) s A are considered as true samples in the GAN framework. In addition, a new x (cid:48) A is used to generate its corresponding s (cid:48) A . However the previous x A is paired with s (cid:48) A to form the fake samples . The fake samplesrepresent two samples from the two marginal distributions, while the true samples are fromthe joint distribution. Both ( x A , s A ) and ( x A , s (cid:48) A ) are fed to the discriminator as both the true samples and the fake samples , respectively.4.2.2.4. Communicating the message with another model

One of the purposes of developing an informative message is to communicate it to anotheragent or model. Hopefully, this message will be informative such that it will allow the agenton the receiving end to understand better its own observation.Considering the two agent setup, one of them is known as the teacher , agent A, whilethe other is the student , agent B. The idea here is to develop an objective that ties bothagents into a global objective. Without any communications between them, a baseline canbe deﬁned where each agent has only the training objective to have their message generationbe informative as described in 4.2.2.3. 54n order to add a communication objective, consider the desire for A to have S A , itsmessage, such that it provides few or no additional information than if we had known X A ,its partial observation, i.e. S A has all the information about X A . In addition, A shouldhope that its message conveys lots of information regarding B ’s partial observation X B .Using information theory, the former quantity is the conditional entropy of S A given X A ,or H ( S A | X A ) , while the latter is known as the mutual information between X B and S A , or I ( S A ; X B ) .In other words, the message from A is relatable for B , while still having lots of informa-tion about its own observation. Putting these two concepts together, a global objective tomaximize that ties both agents together can be written as the following, I ( S A ; X B ) − H ( S A | X A ) (4.2.2)Interestingly, equation 4.2.2 can be decomposed and rewritten to obtain the originaltraining objective deﬁned in 4.2.2.3 along with an additional term. Indeed, by expandingthe mutual information term, it can be rewritten as, H ( S A ) − H ( S A | X B ) − H ( S A | X A ) = I ( S A ; X A ) − H ( S A | X B ) (4.2.3)And ﬁnally, the conditional entropy term can be rewritten as, H ( S A | X B ) = − (cid:88) s A ∈ S A ,x b ∈ X B p ( s A , x B ) log p ( s a | x b ) = E S A ,X B (cid:2) − log p ( s A | x b ) (cid:3) (4.2.4)Reassembling all the components, we get, I ( S A , X A ) + E S A ,X B (cid:2) log p ( s A | x b ) (cid:3) = R info − L likelihood (4.2.5)The ﬁrm term R info in equation 4.2.5 corresponds to the training objective originallydescribed in section 4.2.2.3, while the second term L likelihood is actually a likelihood objectiveon the receiving end of the messages. In other words, agent B tries to predict s A | X B , whichcorresponds to the cross-entropy loss. For the implementation, a hyper-parameter was addedto control the importance of the likelihood loss in the global objective.The objective on the receiving end can be analogous to how we try to build a modelof the people with whom we communicate. Given our own individual partial observation,we increase the likelihood of a message from our language model based on what the otherssay on their partial observation. Of course, for this analogy to make sense, both partial55bservations need to be related in some way, which is the case in the partial observationsetup described earlier.4.2.2.5. Communication levels

To test the eﬀectiveness of the new communication between models, it is proposed tostudy diﬀerent levels of communication. Each of them will be evaluated by having the agentstrain a linear classiﬁer on top of the last hidden layer of features before the message outputlayer. The training of this linear classiﬁer is separated from the main training objective, andonly applies to the classiﬁcation task. The levels of communications will refer to the diﬀerentgradient sources for the objective deﬁned in equation 4.2.5.For simplicity, let’s consider the point of view of the student, or agent B. As previouslymentioned, the baseline consists of having the student train only its message generation basedon the maximization of the mutual information. The ﬁrst level of communication consistsof having the student reproduce the message the teacher generated. This way, similarly towork done in Chapters 2 and 3, a cross-entropy supervised learning objective is added forthe message received.More formally, let’s consider s A = [ s A , s A , . . . , s TA ] as the message generated by the teacherbased on its observation of X A from f θ A ( X A ) and s B = [ s B , s B , . . . , s TB ] the message generatedby the student based on its own observation of X B from f θ B ( X B ) . From this initial level ofcommunication, the gradients from the likelihood term for the student can be computed as, ∇ θ B L likelihood = −∇ θ B (cid:104) T (cid:88) t =1 s tA log s tB (cid:105) (4.2.6)Under the formulation above, s tA and s tB are the real values associated to the probabilityof activating the bit t in the message sampled from the Gumbel distribution. For moreinformation, refer to section 4.2.2.1 and ﬁgure 4.2.By adding the objective E S A ,X B (cid:2) log p ( s A | x b ) (cid:3) , since the expectation is over both S A and X B , it means some gradients should ﬂow back to the model that generated S A , which inthis case, is the teacher or agent A with f θ A . However, for this level of communication,no gradients are propagated back to the broadcaster of the message. This conﬁguration isreferred to as the student is only receiving the message.56he following level is to allow the gradients from the loss computed in equation 4.2.6to ﬂow back to the language generation model on the teacher side f θ A . This means theteacher will receive, in addition to maximizing I ( S A , X A ) , a loss from the likelihood termof the student. This communication has an eﬀect on the student only by allowing theteacher to modify its message generation based on the feedback. It doesn’t directly aﬀectthe student’s message generation model f θ B . This can be seen as the student sending feedbackto the teacher based on its understanding, and similarly, the gradients for the teacher canbe computed as, ∇ θ A L likelihood = −∇ θ A (cid:104) T (cid:88) t =1 s tA log s tB (cid:105) (4.2.7)Finally, the last level of communication considered is to further allow the teacher toreproduce the student’s language, in addition to allowing gradients to ﬂow back to thestudent, i.e. allow both of them to communicate and copy each other. In other words,both student and teacher are sending their messages while receiving messages from theother. Then, both of them compute their cross-entropy loss using the received message asthe target and their own generated message as the prediction. Their cross-entropy loss isthen sent back to the sender of the message, and the gradients ﬂow back into the messagegeneration network of the sender.Considering the total loss computed by the teacher and the student, the gradients fromthe likelihood terms for the student then become, ∇ θ B L likelihood = −∇ θ B (cid:104) T (cid:88) t =1 ( s tA log s tB + s tB log s tA ) (cid:105) (4.2.8)To compare all of these approaches, the classiﬁcation performance of the student willbe monitored throughout training. Conceptually, the diﬀerent levels of communication rep-resent how a model performs when it is either trained alone, on the receiving end of acommunication or communicating and receiving feedback from that communication.57 .3. Results The diﬀerent levels of communication explained in the previous section can be analo-gous to an increased complexity of the communication protocol between two agents. It istherefore interesting to compare how increasing the complexity of that communication af-fects performance, in particular in comparison to a scenario where the two agents do notcommunicate.Figure 4.3 shows performance over the validation set during training of the diﬀerent levels of communication with the length of the message at 32 characters with vocabulary size of2 (exchanging bits). Both the teacher and the student network were trained at the sametime and each of them had their noisy mask kept the same for all the conﬁgurations, witha noise level of . Furthermore, the performance on the test set of the model parametersaccording to the best validation set accuracy is shown in table 4.1. The temperature is setto . at the start of the training and then annealed by a factor of . every trainingsteps, but never brought lower than . . Figure 4.3.

Validation accuracy (%) per training steps of the student net-work during training based on diﬀerent levels of communication. Resultsshow average of ﬁve runs with a noise level of . One way communica-tion conﬁguration (orange) has weight of 0.01 to the cross-entropy term,while the two-way communications (green and red) both have 0.005.58 onﬁguration

AccuracyBaseline 64.4Student-receiving-only 65.7Student-feedback-to-teacher 69.4Both-feedback

Test set accuracy (%) of the diﬀerent communication levelapproaches. Performance reported is the average of ﬁve runs at the highestvalidation set accuracy for the student.From the results, in can be noticed for all scenarios the performance is greater from thestart followed by an important decrease in performance. A hypothesis put forward is simplythe discretization of the messages. As training progresses, the temperature of the Gumbel-Softmax layer is annealed gradually and over time, there is less and less information in the hardened message. A possible approach to mitigate this would be to increase the size of themessage, allowing for the extra length in it to carry that lost information. This is, however,left for future work.The baseline approach without any communication and only the mutual informationtraining objective achieved . performance with a linear classiﬁer on the test set. Byallowing the student to add a cross-entropy term on the received teacher message, it increasedslightly the student’s performance to . accuracy. More interestingly, by allowing thecross-entropy error term of the student to ﬂow back in the teacher’s message generationmodel (green in 4.3), there is a considerable jump in test accuracy to . . This can beseen as the teacher is customizing its language to reduce the predictive error of the student.There is, however, a greater decrease in performance on the validation set than with theother approaches. Finally, the scenario where both models try to predict the other model’smessage and allow the gradients to ﬂow back shows even greater accuracy. Indeed, it allowedthe student to reach . accuracy on the test set.Having a ﬁxed temperature was also tried but did not provide any beneﬁts apart fromstabilizing slightly the accuracy over training, but still without beating the peak performanceof the annealed temperature setting. Given the temperature seems to be causing some issues,59uture work could focus on removing the Gumbel temperature by moving towards otherapproaches to deal with discrete sequences such as REINFORCE [36].To mitigate the early peaking in performance, delaying the training of the linear classiﬁeruntil the messages were more discrete was also tried. However, it did not provide any gainand was not able to reach the same level of accuracy. It did seem to stabilize slightly theperformance, but still was not able to match the peak performance of the no-delay approach. It is interesting to point out in addition to work in Chapters 2 and 3, explicitly creatinga language trained with its own objective seems to show promising results. The increasedcomplexity from the communication allowed for greater performance in a shared partialobservation setting. Indeed, the best performing approach was to allow both models tocustomize their language based on predictability from the other models.These results show that making the joint distribution of the message and the partialobservation of the broadcaster diﬀer greatly from both these marginal distributions can gen-erate meaningful features. In addition, by adding a cross-entropy term on the receiver’s endof the message, it was demonstrated that it can further increase performance, in particularby allowing gradients on that loss to ﬂow back to the sender.Some future work using a RNN to handle the language between the two models is cur-rently being done to further discretize the language used. Some interesting ways to expandthis work is to consider the training objective but with a large number of nodes. Previouswork with a large number of compute nodes seemed to have low performance when usingcommunication. However, having this explicit training objective tailored to the messagerather than automatically derived from the outputs as in previous Chapters might helpbreak the limitations previously noticed. 60 onclusion

In this thesis, a study of various communication channels between deep learning models waspresented. This was accomplished by viewing the communication channels with diﬀerent ob-jectives, ranging from low bandwidth outputs of a model to a language crafted and optimizedwith the sole purpose of being communicated between two models.It was shown these low bandwidth messages exchanged between compute nodes of a fullydecentralized computing network could speedup some of the training. It was pointed out thisapproach could allow to give birth to an internet of computing given some further research inthe pooling of knowledge of the compute nodes. In addition, under a simpliﬁed setup of theteacher and student type, a teacher could accelerate the student’s training by customizingits training procedure. Indeed, by selecting which samples to train on, by considering boththe student’s and its own set of predictions, selecting the hardest samples to provide to thestudent as training samples proved itself to increase convergence speed of the generalizationerror. Finally, using two randomly initialized models that share a partial observation ofan input, it was shown that having a purposefully crafted discrete language can lead tobetter generalization performance on the learned features. Although the language craftedwas mentioned being relatively restrictive, some promising results can pave the way for moreﬂexible language, which is key to extending this proposal to a large number of communicatingmodels. ibliography [1] Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George E. Dahl, andGeoﬀrey E. Hinton. Large scale distributed neural network training through onlinedistillation. In

International Conference on Learning Representations , 2018.[2] Pierre Baldi and Peter J Sadowski. Understanding dropout. In

Advances in neuralinformation processing systems , pages 2814–2822, 2013.[3] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, YoshuaBengio, Devon Hjelm, and Aaron Courville. Mutual information neural estimation. In

International Conference on Machine Learning , pages 530–539, 2018.[4] Yoshua Bengio. Evolving culture versus local minima. In

Growing Adaptive Machines ,pages 109–138. Springer, 2014.[5] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neuralprobabilistic language model.

Journal of machine learning research , 3(Feb):1137–1155,2003.[6] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencieswith gradient descent is diﬃcult.

IEEE transactions on neural networks , 5(2):157–166,1994.[7] Philemon Brakel and Yoshua Bengio. Learning independent features with adversarialnets for non-linear ica. arXiv preprint arXiv:1710.05050 , 2017.[8] Leo Breiman. Bagging predictors.

Machine learning , 24(2):123–140, 1996.[9] Augustin Cauchy. Méthode générale pour la résolution des systemes d’équations simul-tanées.

Comp. Rend. Sci. Paris , 25(1847):536–538, 1847.[10] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, FethiBougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations usingrnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 ,014.[11] Jeﬀrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao,Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deepnetworks. In

Advances in neural information processing systems , pages 1223–1231, 2012.[12] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: Alarge-scale hierarchical image database. In

Computer Vision and Pattern Recognition,2009. CVPR 2009. IEEE Conference on , pages 248–255. Ieee, 2009.[13] Jeﬀ Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782 , 2016.[14] Jeﬀ Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782 , 2016.[15] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb,Martin Arjovsky, and Aaron Courville. Adversarially learned inference. arXiv preprintarXiv:1606.00704 , 2016.[16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.

Deep Learning . MIT Press, 2016. .[17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sher-jil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In

Advancesin neural information processing systems , pages 2672–2680, 2014.[18] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, AapoKyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatchsgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 , 2017.[19] Geoﬀrey Hinton, Oriol Vinyals, and Jeﬀ Dean. Distilling the knowledge in a neuralnetwork. arXiv preprint arXiv:1503.02531 , 2015.[20] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.

Neural computa-tion , 9(8):1735–1780, 1997.[21] Kurt Hornik. Approximation capabilities of multilayer feedforward networks.

Neuralnetworks , 4(2):251–257, 1991.[22] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparametrization with gumble-softmax. In

International Conference on Learning Representations 2017 . OpenReviews.net, 2017. 6323] Rie Johnson and Tong Zhang. Eﬀective use of word order for text categorization withconvolutional neural networks. arXiv preprint arXiv:1412.1058 , 2014.[24] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980 , 2014.[25] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprintarXiv:1312.6114 , 2013.[26] Yann LeCun. The mnist database of handwritten digits. http://yann. lecun.com/exdb/mnist/ , 1998.[27] Yann LeCun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E Howard,Wayne E Hubbard, and Lawrence D Jackel. Handwritten digit recognition with a back-propagation network. In

Advances in neural information processing systems , pages 396–404, 1990.[28] Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent innervous activity.

The bulletin of mathematical biophysics , 5(4):115–133, 1943.[29] Vinod Nair and Geoﬀrey E Hinton. Rectiﬁed linear units improve restricted boltz-mann machines. In

Proceedings of the 27th international conference on machine learning(ICML-10) , pages 807–814, 2010.[30] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representationlearning with deep convolutional generative adversarial networks. arXiv preprintarXiv:1511.06434 , 2015.[31] David E Rumelhart, Geoﬀrey E Hinton, and Ronald J Williams. Learning representa-tions by back-propagating errors. nature , 323(6088):533, 1986.[32] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George VanDen Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, MarcLanctot, et al. Mastering the game of go with deep neural networks and tree search. nature , 529(7587):484, 2016.[33] Nitish Srivastava, Geoﬀrey Hinton, Alex Krizhevsky, Ilya Sutskever, and RuslanSalakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting.

The Journal of Machine Learning Research , 15(1):1929–1958, 2014.[34] Richard S Sutton and Andrew G Barto.

Introduction to reinforcement learning , volume135. MIT press Cambridge, 1998. 6435] Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals,Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. Wavenet:A generative model for raw audio. In

SSW , page 125, 2016.[36] Ronald J Williams. Simple statistical gradient-following algorithms for connectionistreinforcement learning.

Machine learning , 8(3-4):229–256, 1992.[37] Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. Parallelized stochasticgradient descent. In

Advances in neural information processing systems , pages 2595–2603, 2010.[38] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net.