[PDF] Introduction to Machine Learning for the Sciences

Abstract

This is an introductory machine learning course specifically developed with STEM students in mind. We discuss supervised, unsupervised, and reinforcement learning. The notes start with an exposition of machine learning methods without neural networks, such as principle component analysis, t-SNE, and linear regression. We continue with an introduction to both basic and advanced neural network structures such as conventional neural networks, (variational) autoencoders, generative adversarial networks, restricted Boltzmann machines, and recurrent neural networks. Questions of interpretability are discussed using the examples of dreaming and adversarial attacks.

Full PDF

LLecture Notes:Introduction to Machine Learning for the

Sciences

Titus Neupert, Mark H Fischer, Eliska Greplova, , Kenny Choo, and Michael Denner Department of Physics, University of Zurich, 8057 Zurich, Switzerland Kavli Institute of Nanoscience, Delft University of Technology, 2600 GA Delft, TheNetherlands Institute for Theoretical Physics, ETH Zurich, CH-8093, Switzerland

February 10, 2021

This lecture notes including exercises are online available at ml-lectures.org.If you notice mistakes or typos, please report them to [email protected]. a r X i v : . [ phy s i c s . c o m p - ph ] F e b achine Learning for the Sciences Chapter 0 Contents k -means . . . . . . . . . . . . . 15 University of Zurich 1achine Learning for the Sciences Chapter 0

University of Zurich 2achine Learning for the Sciences Chapter 1

Machine learning and artiﬁcial neural networks are everywhere and change our dailylife more profoundly than we might be aware of. However, these concepts are nota particularly recent invention. Their foundational principles emerged already inthe 1940s. The perceptron , the predecessor of the artiﬁcial neuron, the basic unit ofmany neural networks to date, was invented by Frank Rosenblatt in 1958, and evencast into a hardware realization by IBM.It then took half a century for these ideas to become technologically relevant.Now, artiﬁcial intelligence based on neural-network algorithms has become an in-tegral part of data processing with widespread applications. The reason for itstremendous success is twofold. First, the availability of big and structured datacaters to machine learning applications. Second, the realization that deep (feed-forward) networks (made from many “layers” of artiﬁcial neurons) with many vari-ational parameters are tremendously more powerful than few-layer ones was a bigleap, the “deep learning revolution”.Machine learning refers to algorithms that infer information from data in animplicit way. If the algorithms are inspired by the functionality of neural activity inthe brain, the term cognitive or neural computing is used. Artiﬁcial neural networks refer to a speciﬁc, albeit most broadly used, ansatz for machine learning. Anotherﬁeld that concerns iteself with inferring information from data is statistics. In thatsense, both machine learning and statistics have the same goal. However, the waythis goal is achieved is markedly diﬀerent: while statistics uses insights from math-ematics to extract information, machine learning aims at optimizing a variationalfunction using available data through learning.The mathematical foundations of machine learning with neural networks arepoorly understood: we do not know why deep learning works. Nevertheless, thereare some exact results for special cases. For instance, certain classes of neuralnetworks are a complete basis of smooth functions, that is, when equipped withenough variational parameters, they can approximate any smooth high-dimensionalfunction with arbitrarily precision. Other variational functions with this propertywe commonly use are Taylor or Fourier series (with the coeﬃcients as “variational”parameters). We can think of neural networks as a class or variational functions, forwhich the parameters can be eﬃciently optimized with respect to a desired objective.As an example, this objective can be the classiﬁcation of handwritten digits from‘0’ to ‘9’. The input to the neural network would be an image of the number, encodedin a vector of grayscale values. The output is a probability distribution saying howlikely it is that the image shows a ‘0’, ‘1’, ‘2’, and so on. The variational parametersof the network are adjusted until it accomplishes that task well. This is a classicalexample of supervised learning . To perform the network optimization, we need dataconsisting of input data (the pixel images) and labels (the integer number shown onthe respective image).Our hope is that the optimized network also recognizes handwritten digits it hasnot seen during the learning. This property of a network is called generalization . Itstands in opposition to a tendency called overﬁtting , which means that the networkhas learned speciﬁcities of the data set it was presented with, rather than the ab-

University of Zurich 3achine Learning for the Sciences Chapter 1

Figure 1:

The MNIST dataset.

Examples of the digits from the handwrittenMNIST dataset.stract features necessary to identify the respective digit. An illustrative example ofoverﬁtting is ﬁtting a polynomial of degree 9 to 10 data points, which will always bea perfect ﬁt. Does this mean that this polynomial best characterizes the behavior ofthe measured system? Of course not! Fighting overﬁtting and creating algorithmsthat generalize well are key challenges in machine learning. We will study severalapproaches to achieve this goal.Handwritten digit recognition has become one of the standard benchmark prob-lems in the ﬁeld. Why so? The reason is simple: there exists a very good andfreely available data set for it, the MNIST database , see Fig. 1. This curious facthighlights an important aspect of machine learning: it is all about data. The mosteﬃcient way to improve machine learning results is to provide more and better data.Thus, one should keep in mind that despite the widespread applications, machinelearning is not the hammer for every nail. It is most beneﬁcial if large and balanced data sets, meaning roughly that the algorithm can learn all aspects of the problemequally, in a machine-readable way are available.This lecture is an introduction speciﬁcally targeting the use of machine learningin diﬀerent domains of science. In scientiﬁc research, we see a vastly increasingnumber of applications of machine learning, mirroring the developments in industrialtechnology. With that, machine learning presents itself as a universal new tool forthe exact sciences, standing side-by-side with methods such as calculus, traditionalstatistics, and numerical simulations. This poses the question, where in the scientiﬁcworkﬂow, summerized in Fig. 2, these novel methods are best employed.Once a speciﬁc task has been identiﬁed, applying machine learning to the sciencesdoes, furthermore, hold its very speciﬁc challenges: (i) scientiﬁc data has oftenvery particular structure, such as the nearly perfect periodicity in an image of acrystal; (ii) typically, we have speciﬁc knowledge about correlations in the datawhich should be reﬂected in a machine learning analysis; (iii) we want to understandwhy a particular algorithm works, seeking a fundamental insight into mechanismsand laws of nature; (iv) in the sciences we are used to algorithms and laws thatprovide deterministic answers while machine learning is intrinsically probabilistic -there is no absolute certainty. Nevertheless, quantitative precision is paramount inmany areas of science and thus a critical benchmark for machine learning methods. A note on the concept of a model

In both machine learning and the sciences, models play a crucial role. However,it is important to recognize the diﬀerence in meaning: In the natural sciences, amodel is a conceptual representation of a phenomenon. A scientiﬁc model doesnot try to represent the whole world, but only a small part of it. A model is thus http://yann.lecun.com/exdb/mnist University of Zurich 4achine Learning for the Sciences Chapter 1

Observation Model buildingPrediction design experimentaquire dataanalyse data abstract knowledgeformulate questionsdevelop theoriesif my models are correct,then ...

Figure 2:

Simpliﬁed scientiﬁc workﬂow.

From observations, via abstraction tobuilding and testing hypothesis or laws, to ﬁnally making predictionsa simpliﬁcation of the phenomenon and can be both a theoretical construct, forexample the ideal gas model or the Bohr model of the atom, or an experimentalsimpliﬁcation, such as a small version of an airplane in a wind channel.In machine learning, on the other hand, we most often use a complicated vari-ational function, for example a neural network, to try to approximate a statisticalmodel. But what is a model in statistics? Colloquially speaking, a statistical modelcomprises a set of statistical assumptions which allow us to calculate the probability P ( x ) of any event x . The statistical model does not correspond to the true dis-tribution of all possible events, it simply approximates the distribution. Scientiﬁcand statistical models thus share an important property: neither claims to be arepresentation of reality. This lecture is an introduction to basic machine learning algorithms for scientistsand students of the sciences. We will cover• the most fundamental machine learning algorithms,• the terminology of the ﬁeld, succinctly explained,• the principles of supervised and unsupervised learning and why it is so suc-cessful,• various architectures of artiﬁcial neural networks and the problems they aresuitable for,• how we ﬁnd out what the machine learning algorithm uses to solve a problem.The ﬁeld of machine learning is full of lingo which to the uninitiated obscureswhat is at the core of the methods. Being a ﬁeld in constant transformation, newterminology is being introduced at a fast pace. Our aim is to cut through slangwith mathematically precise and concise formulations in order to demystify machinelearning concepts for someone with an understanding of calculus and linear algebra.As mentioned above, data is at the core of most machine learning approachesdiscussed in this lecture. With raw data in many cases very complex and extremelyhigh dimensional, it is often crucial to ﬁrst understand the data better and reducetheir dimensionality. Simple algorithms that can be used before turning to the oftenheavy machinery of neural networks will be discussed in the next section, Sec. 2.

University of Zurich 5achine Learning for the Sciences Chapter 1 probability distribution P discriminative/supervised additional topicsinterpretability and vulnerability (Sec. 6)generative/unsupervisedconditional probability P(y|x) probability P(x), P(x,y)without neural networks (Sec. 3)dimensional reduction (Sec. 2) reinforcement learning (Sec. 7) • regression• linear classifier• PCA• t-SNE with neural networks (Sec. 4) • simple neural network• convolutional neural network(geometry)• recursive neural network(sequence) with neural networks (Sec. 5) • restricted Boltzmann machine• (variational) autoencoder• recursive neural network II• generative adversarial networks

Figure 3: Overview over the plan of the lecture from the perspective of learningprobability distributions.The machine learning algorithms we will focus on most can generally be dividedinto two classes of algorithms, namely discriminative and generative algorithms asillustrated in Fig. 3. Examples of discriminative tasks include classiﬁcation prob-lems, such as the aforementioned digit classiﬁcation or the classiﬁcation into solid,liquid and gas phases given some experimental observables. Similarly, regression, inother words estimating relationships between variables, is a discriminative problem.More speciﬁcally, we try to approximate the conditional probability distribution P ( y | x ) of some variable y (the label) given some input data x . As data is providedin the form of input and target data for most of these tasks, these algorithms usuallyemploy supervised learning. Discriminative algorithms are most straight-forwardlyapplicable in the sciences and we will discuss them in Secs. 3 and 4.Generative algorithms, on the other hand, model a probability distribution P ( x ).These approaches are—once trained—in principle more powerful, since we can alsolearn the joint probability distribution P ( x, y ) of both the data x and the labels y and infer the conditional probability of y . Still, the more targeted approach ofdiscriminative learning is better suited for many problems. However, generativealgorithms are useful in the natural sciences, as we can sample from a known prob-ability distribution, for example for image denoising, or when trying to ﬁnd newcompounds/molecules resembling known ones with given properties. These algo- University of Zurich 6achine Learning for the Sciences Chapter 1 rithms are discussed in Sec. 5.The promise of artiﬁcial intelligence may trigger unreasonable expectations inthe sciences. After all, scientiﬁc knowledge generation is one of the most complexintellectual processes. Computer algorithms are certainly far from achieving any-thing on that level of complexity and will in the near future not formulate new lawsof nature independently. Nevertheless, researchers study how machine learning canhelp with individual segments of the scientiﬁc workﬂow (Fig. 2). While the type ofabstraction needed to formulate Newton’s laws of classical mechanics seems incred-ibly complex, neural networks are very good at implicit knowledge representation .To understand precisely how they achieve certain tasks, however, is not an easyundertaking. We will discuss this question of interpretability in Sec. 6.A third class of algorithms, which does not neatly ﬁt the framework of approxi-mating a statistical model and thus the distinction into discriminative and generativealgorithms is known as reinforcement learning. Instead of approximating a statisti-cal model, reinforcement learning tries to optimize strategies (actions) for achievinga given task. Reinforcement learning has gained a lot of attention with Google’sAlphaGo Zero, a computer program that beat the best Go players in the world. Asan example for an application in the sciences, reinfocrment learning can be used todecide on what experimental conﬁguration to perform next. While the whole topicis beyond the scope of this lecture, we will give an introduction to the basic conceptsof reinforcement learning in Sec. 7.A ﬁnal note on the practice of learning. While the machine learning machinery isextremely powerful, using an appropriate architecture and the right training details,captured in what are called hyperparameters , is crucial for its successful application.Though there are attempts to learn a suitable model and all hyperparameters as partof the overall learning process, this is not a simple task and requires immense com-putational resources. A large part of the machine learning success is thus connectedto the experience of the scientist using the appropriate algorithms. We thus stronglyencourage solving the accompanying exercises carefully and taking advantage of theexercise classes.

While it may seem that implementing ML tasks is computationally challenging,actually almost any ML task one might be interested in can be done with relativelyfew lines of code simply by relying on external libraries or mathematical computingsystems such as Mathematica or Matlab. At the moment, most of the externallibraries are written for the Python programming language. Here are some usefulPython libraries:1.

TensorFlow.

Developed by Google, Tensorﬂow is one of the most popularand ﬂexible library for machine learning with complex models, with full GPUsupport.2.

PyTorch.

Developed by Facebook, Pytorch is the biggest rival library toTensorﬂow, with pretty much the same functionalities.3.

Scikit-Learn.

Whereas TensorFlow and PyTorch are catered for deep learn-ing practitioners, Scikit-Learn provides much of the traditional machine learn-

University of Zurich 7achine Learning for the Sciences Chapter 2 ing tools, including linear regression and PCA.4.

Pandas.

Modern machine learning is largely reliant on big datasets. Thislibrary provides many helpful tools to handle these large datasets.

This course is aimed at students of the (natural) sciences with a basic mathemat-ics education and some experience in programming. In particular, we assume thefollowing prerequisites:• Basic knowledge of calculus and linear algebra.• Rudimentary knowledge of statistics and probability theory (advantageous).• Basic knowledge of a programming language. For the teaching assignments,you are free to choose your preferable one. The solutions will typically bedistributed in Python in the form of Jupyter notebooks.Please, don’t hesitate to ask questions if any notions are unclear.

For further reading, we recommend the following books:•

ML without neural networks : The Elements of Statistical Learning , T.Hastie, R. Tisbshirani, and J. Friedman (Springer)•

ML with neural networks : Neural Networks and Deep Learning , M. Nielson(http://neuralnetworksanddeeplearning.com)•

Deep Learning Theory : Deep Learning

Reinforcement Learning : Reinforcement Learning , R. S. Sutton and A. G.Barto (MIT Press)

University of Zurich 8achine Learning for the Sciences Chapter 2 works

Deep learning with neural networks is very much at the forefront of the recentrenaissance in machine learning. However, machine learning is not synonymous withneural networks. There is a wealth of machine learning approaches without neuralnetworks, and the boundary between them and conventional statistical analysis isnot always sharp.It is a common misconception that neural network techniques would always out-perform these approaches. In fact, in some cases, a simple linear method couldachieve faster and better results. Even when we might eventually want to use adeep network, simpler approaches may help to understand the problem we are fac-ing and the speciﬁcity of the data so as to better formulate our machine learningstrategy. In this chapter, we shall explore machine learning approaches without theuse of neural networks. This will further allow us to introduce basic concepts andthe general form of a machine learning workﬂow.

At the heart of any machine learning task is data. In order to choose the most appro-priate machine learning strategy, it is essential that we understand the data we areworking with. However, very often, we are presented with a dataset containing manytypes of information, called features of the data. Such a dataset is also described asbeing high-dimensional. Techniques that extract information from such a datasetare broadly summarised as high-dimensional inference . For instance, we could beinterested in predicting the progress of diabetes in patients given features such asage, sex, body mass index, or average blood pressure. Extremely high-dimensionaldata can occur in biology, where we might want to compare gene expression patternin cells. Given a multitude of features, it is neither easy to visualise the data norpick out the most relevant information. This is where principle component analysis (PCA) can be helpful.Very brieﬂy, PCA is a systematic way to ﬁnd out which feature or combinationof features varies the most across the data samples. We can think of PCA asapproximating the data with a high-dimensional ellipsoid, where the principal axesof this ellipsoid correspond to the principal components. A feature, which is almostconstant across the samples, in other words has a very short principal axis, mightnot be very useful. PCA then has two main applications: (1) It helps to visualisethe data in a low dimensional space and (2) it can reduce the dimensionality of theinput data to an amount that a more complex algorithm can handle.

PCA algorithm

Given a dataset of m samples with n data features, we can arrange our data in theform of a m by n matrix X where the element x ij corresponds to the value of the j th data feature of the i th sample. We will also use the feature vector x i for all the n features of one sample i = 1 , . . . , m . The vector x i can take values in the featurespace , for example x i ∈ R n . Going back to our diabetes example, we might have 10data features. Furthermore if we are given information regarding 100 patients, our University of Zurich 9achine Learning for the Sciences Chapter 2 data matrix X would have 100 rows and 10 columns.The procedure to perform PCA can then be described as follows: Algorithm 1: Principle Component Analysis

1. Center the data by subtracting from each column the mean of that col-umn, x i x i − m m X i =1 x i . (2.1)This ensures that the mean of each data feature is zero.2. Form the n by n (unnormalised) covariance matrix C = X T X = m X i =1 x i x Ti . (2.2)3. Diagonalize the matrix to the form C = X T X = W Λ W T , where thecolumns of W are the normalised eigenvectors, or principal components,and Λ is a diagonal matrix containing the eigenvalues. It will be helpfulto arrange the eigenvalues from largest to smallest.4. Pick the l largest eigenvalues λ , . . . λ l , l ≤ n and their correspondingeigenvectors v . . . v l . Construct the n by l matrix f W = [ v . . . v l ].5. Dimensional reduction: Transform the data matrix as f X = X f W . (2.3)The transformed data matrix f X now has dimensions m by l .We have thus reduced the dimensionality of the data from n to l . Notice thatthere are actually two things happening: First, of course, we now only have l datafeatures. But second, the l data features are new features and not simply a selec-tion of the original data. Rather, they are a linear combination of them. Using ourdiabetes example again, one of the “new” data features could be the sum of the aver-age blood pressure and the body mass index. These new features are automaticallyextracted by the algorithm.But why did we have to go through such an elaborate procedure to do this insteadof simply removing a couple of features? The reason is that we want to maximizethe variance in our data. We will give a precise deﬁnition of the variance later inthe chapter, but brieﬂy the variance just means the spread of the data. Using PCA,we have essentially obtained l “new” features which maximise the spread of the datawhen plotted as a function of this feature. We illustrate this with an example. Example

Let us consider a very simple dataset with just 2 data features. We have data, fromthe Iris dataset , a well known dataset on 3 diﬀerent species of ﬂowers. We are giveninformation about the petal length and petal width. Since there are just 2 features, https://archive.ics.uci.edu/ml/datasets/iris University of Zurich 10achine Learning for the Sciences Chapter 2

Length [cm] W i d t h [ c m ] PCA setosaversicolorvirginica

Figure 4:

PCA on Iris Dataset. it is easy to visualise the data. In Fig. 4, we show how the data is transformed underthe PCA algorithm.Notice that there is no dimensional reduction here since l = n . In this case, thePCA algorithm amounts simply to a rotation of the original data. However, it stillproduces 2 new features which are orthogonal linear combinations of the originalfeatures: petal length and petal width, i.e. w = 0 . × Petal Length + 0 . × Petal Width ,w = − . × Petal Length + 0 . × Petal Width . (2.4)We see very clearly that the ﬁrst new feature w has a much larger variance than thesecond feature w . In fact, if we are interested in distinguishing the three diﬀerentspecies of ﬂowers, as in a classiﬁcation task, its almost suﬃcient to use only thedata feature with the largest variance, w . This is the essence of (PCA) dimensionalreduction.Finally, it is important to note that it is not always true that the feature withthe largest variance is the most relevant for the task and it is possible to constructcounter examples where the feature with the least variance contains all the usefulinformation. However, PCA is often a good guiding principle and can yield inter-esting insights in the data. Most importantly, it is also interpretable , i.e., not onlydoes it separate the data, but we also learn which linear combination of features canachieve this. We will see that for many neural network algorithms, in contrast, alack of interpretability is a big issue. PCA performs a linear transformation on the data. However, there are cases wheresuch a transformation is unable to produce any meaningful result. Consider forinstance the ﬁctitious dataset with 2 classes and 2 data features as shown on theleft of Fig. (5). We see by naked eye that it should be possible to separate this datawell, for instance by the distance of the datapoint from the origin, but it is also clearthat a linear function cannot be used to compute it. In this case, it can be helpfulto consider a non-linear extension of PCA, known as kernel PCA .The basic idea of this method is to apply to the data x ∈ R n a chosen non-linearvector-valued transformation function Φ ( x ) with Φ : R n → R N , (2.5) University of Zurich 11achine Learning for the Sciences Chapter 2

PCAKernel PCA

Figure 5:

Kernel PCA versus PCA. which is a map from the original n -dimensional space (corresponding to the n originaldata features) to a N -dimensional feature space. Kernel PCA then simply involvesperforming the standard PCA on the transformed data Φ ( x ). Here, we will assumethat the transformed data is centered, i.e., X i Φ( x i ) = 0 (2.6)to have simpler formulas.In practice, when N is large, it is not eﬃcient or even possible to explicitlyperform the transformation Φ . Instead we can make use of a method known as thekernel trick. Recall that in standard PCA, the primary aim is to ﬁnd the eigenvectorsand eigenvalues of the covariance matrix C . In the case of kernel PCA, this matrixbecomes C = m X i =1 Φ ( x i ) Φ ( x i ) T , (2.7)with the eigenvalue equation m X i =1 Φ ( x i ) Φ ( x i ) T v j = λ j v j . (2.8)By writing the eigenvectors v j as a linear combination of the transformed datafeatures v j = m X i =1 a ji Φ ( x i ) , (2.9) University of Zurich 12achine Learning for the Sciences Chapter 2 we see that ﬁnding the eigenvectors is equivalent to ﬁnding the coeﬃcients a ji . Onsubstituting this form back into Eq. (2.8), we ﬁnd m X i =1 Φ ( x i ) Φ ( x i ) T " m X i =1 a ji Φ ( x j ) = λ j " m X i =1 a ji Φ ( x i ) . (2.10)By multiplying both sides of the equation by Φ ( x k ) T we arrive at m X i =1 Φ ( x k ) T Φ ( x i ) Φ ( x i ) T " m X l =1 a jl Φ ( x l ) = λ j Φ ( x k ) T " m X l =1 a jl Φ ( x l ) m X i =1 h Φ ( x k ) T Φ ( x i ) i m X l =1 a jl h Φ ( x i ) T Φ ( x l ) i = λ j m X l =1 a jl h Φ ( x k ) T Φ ( x l ) i m X i =1 K ( x k , x i ) m X l =1 a jl K ( x i , x l ) = λ j m X l =1 a jl K ( x k , x l ) , (2.11)where K ( x , y ) = Φ ( x ) T Φ ( y ) is known as the kernel . Thus we see that if we directlyspecify the kernels we can avoid explicit performing the transformation Φ . In matrixform, we ﬁnd the eigenvalue equation K a j = λ j K a j , which simpliﬁes to K a j = λ j a j . (2.12)Note that this simpliﬁcation requires λ j = 0, which will be the case for relevantprinciple components. (If λ j = 0, then the corresponding eigenvectors would beirrelevant components to be discarded.) After solving the above equation and ob-taining the coeﬃcients a jl , the kernel PCA transformation is then simply given bythe overlap with the eigenvectors v j , i.e., x → Φ ( x ) → y j = v Tj Φ ( x ) = m X i =1 a ji Φ ( x i ) T Φ ( x ) = m X i =1 a ji K ( x i , x ) , (2.13)where once again the explicit Φ transformation is avoided.A common choice for the kernel is known as the radial basis function kernel(RBF) deﬁned by K RBF ( x , y ) = exp (cid:16) − γ k x − y k (cid:17) , (2.14)where γ is a tunable parameter. Using the RBF kernel, we compare the result ofkernel PCA with that of standard PCA, as shown on the right of Fig. (5). It is clearthat kernel PCA leads to a meaningful separation of the data while standard PCAcompletely fails. We studied (kernel) PCA as an example for a method that reduces the dimen-sionality of a dataset and makes features apparent by which data points can beeﬃciently distinguished. Often, it is desirable to more clearly cluster similar datapoints and visualize this clustering in a low (two- or three-) dimensional space. Wefocus our attention on a relatively recent algorithm (from 2008) that has proven veryperformant. It goes by the name t-distributed stochastic neighborhood embedding(t-SNE).

University of Zurich 13achine Learning for the Sciences Chapter 2

Figure 6:

PCA vs. t-SNE

Application of both methods on 5000 samples fromthe MNIST handwritten digit dataset. We see that perfect clustering cannot beachieved with either method, but t-SNE delivers the much better result.The basic idea is to think of the data (images, for instance) as objects x i in a veryhigh-dimensional space and characterize their relation by the Euclidean distance || x i − x j || between them. These pairwise distances are mapped to a probabilitydistribution p ij . The same is done for the distances || y i − y j || of the images of thedata points y i in the target low-dimensional space. Their probability distribution isdenoted q ij . The mapping is optimized by changing the locations y i so as to minimizethe distance between the two probability distributions. Let us substantiate thesewords with formulas.The probability distribution in the space of data points is given as the sym-metrized version (joint probability distribution) p ij = p i | j + p j | i p j | i = exp ( −|| x i − x j || / σ i ) P k = i exp ( −|| x i − x k || / σ i ) , (2.16)where the choice of variances σ i will be explained momentarily. Distances are thusturned into a Gaussian distribution. Note that p j | i = p i | j while p ji = p ij .The probability distribution in the target space is chosen to be a Student t-distribution q ij = (1 + || y i − y j || ) − P k = l (1 + || y k − y l || ) − . (2.17)This choice has several advantages: (i) it is symmetric upon interchanging i and j ,(ii) it is numerically more eﬃciently evaluated because there are no exponentials,(iii) it has ’fatter’ tails which helps to produce more meaningful maps in the lowerdimensional space.Let us now discuss the choice of σ i . Intuitively, in dense regions of the dataset,a smaller value of σ i is usually more appropriate than in sparser regions, in orderto resolve the distances better. Any particular value of σ i induces a probabilitydistribution P i over all the other data points. This distribution has an entropy (here we use the Shannon entropy, in general it is a measure for the “uncertainty” University of Zurich 14achine Learning for the Sciences Chapter 2 represented by the distribution) H ( P i ) = − X j p j | i log p j | i . (2.18)The value of H ( P i ) increases as σ i increases, i.e., the more uncertainty is addedto the distances. The algorithm searches for the σ i that result in a P i with ﬁxedperplexity Perp( P i ) = 2 H ( P i ) . (2.19)The target value of the perplexity is chosen a priory and is the main parameter thatcontrols the outcome of the t-SNE algorithm. It can be interpreted as a smoothmeasure for the eﬀective number of neighbors. Typical values for the perplexity arebetween 5 and 50.Finally, we have to introduce a measure for the similarity between the two prob-ability distributions p ij and q ij . This deﬁnes a so-called loss function . Here, wechoose the Kullback-Leibler divergence L ( { y i } ) = X i X j p ij log p ij q ij , (2.20)which we will frequently encounter during this lecture. The symmetrized p ij ensuresthat P j p ij > / (2 n ), so that each data point makes a signiﬁcant contribution to thecost function. The minimization of L ( { y i } ) with respect to the positions y i can beachieved with a variety of methods. In the simplest case it can be gradient descent,which we will discuss in more detail in a later chapter. As the name suggests, itfollows the direction of largest gradient of the cost function to ﬁnd the minimum.To this end it is useful that these gradients can be calculated in a simple form ∂L∂ y i = 4 X j ( p ij − q ij )( y i − y j )(1 + || y i − y j || ) − . (2.21)By now, t-SNE is implemented as standard in many packages. They involvesome extra tweaks that force points y i to stay close together at the initial stepsof the optimization and create a lot of empty space. This facilitates the movingof larger clusters in early stages of the optimization until a globally good arrange-ment is found. If the dataset is very high-dimensional it is advisable to perform aninitial dimensionality reduction (to somewhere between 10 and 100 dimensions, forinstance) with PCA before running t-SNE.While t-SNE is a very powerful clustering technique, it has its limitations. (i)The target dimension should be 2 or 3, for much larger dimensions ansatz for q ij is not suitable. (ii) If the dataset is intrinsically high-dimensional (so that also thePCA pre-processing fails), t-SNE may not be a suitable technique. (iii) Due to thestochastic nature of the optimization, results are not reproducible. The result mayend up looking very diﬀerent when the algorithm is initialized with some slightlydiﬀerent initial values for y i . k -means All of PCA, kernel-PCA and t-SNE may or may not deliver a visualization of thedataset, where clusters emerge. They all leave it to the observer to identify these

University of Zurich 15achine Learning for the Sciences Chapter 3 possible clusters. In this section, we want to introduce an algorithm that actuallyclusters data, i.e., it will sort any data point into one of k clusters. Here the desirednumber of clusters k is ﬁxed a priori by us. This is a weakness but may be com-pensated by running the algorithm with diﬀerent values of k and asses where theperformance is best.We will exemplify a simple clustering algorithm that goes by the name k -means.The algorithm is iterative. The key idea is that data points are assigned to clusterssuch that the squared distances between the data points belonging to one cluster andthe centroid of the cluster is minimized. The centroid is deﬁned as the arithmeticmean of all data points in a cluster.This description already suggests, that we will again minimize a loss function(or maximize an expectation function, which just diﬀers in the overall sign fromthe loss function). Suppose we are given an assignment of datapoints x i to clusters j = 1 , · · · , k that is represented by w ij =  , x i in cluster j, , x i not in cluster j. (2.22)Then the loss function is given by L ( { x i } , { w ij } ) = m X i =1 k X j =1 w ij || x i − µ j || , (2.23)where µ j = P i w ij x i P i w ij . (2.24)Naturally, we want to minimize the loss function with respect to the assignment w ij . However, a change in this assignment also changes µ j . For this reason, it isnatural to divide each update step in two parts. The ﬁrst part updates the w ij according to w ij =  , if j = argmin l || x i − µ l || , , else . (2.25)That means we attach each data point to the nearest cluster centroid. The secondpart is a recalculation of the centroids according to Eq. (2.24).The algorithm is initialized by choosing at random k distinct data points asinitial positions of the centroids. Then one repeats the above two-part steps untilconvergence, i.e., until the w ij do not change anymore.In this algorithm we use the Euclidean distance measure || · || . It is advisable tostandardize the data such that each feature has mean zero and standard deviationof one when average over all data points. Otherwise (if some features are overallnumerically smaller than others), the diﬀerences in various features may be weightedvery diﬀerently by the algorithm.Furthermore, the results depend on the initialization. One should re-run thealgorithm with a few diﬀerent initializations to avoid running into bad local minima.Applications of k -means are manifold: in economy they include marked seg-mentation, in science any classiﬁcation problem such as that of phases of matter,document clustering, image compression (color reduction), etc.. In general it helpsto build intuition about the data at hand. University of Zurich 16achine Learning for the Sciences Chapter 3 works

Supervised learning is the term for a machine learning task, where we are givena dataset consisting of input-output pairs { ( x , y ) , . . . , ( x m , y m ) } and our task isto "learn" a function which maps input to output f : x y . Here we chose avector-valued input x and only a single real number as output y , but in principlealso the output can be vector valued. The output data that we have is called the ground truth and sometimes also referred to as “labels” of the input. In contrast tosupervised learning, all algorithms presented so far were unsupervised, because theyjust relied on input-data, without any ground truth or output data.Within the scope of supervised learning, there are two main types of tasks: Classiﬁcation and

Regression . In a classiﬁcation task, our output y is a discretevariable corresponding to a classiﬁcation category. An example of such a task wouldbe to distinguish stars with a planetary system (exoplanets) from those withoutgiven time series of images of such objects. On the other hand, in a regressionproblem, the output y is a continuous number or vector. For example predicting thequantity of rainfall based on meteorological data from the previous days.In this section, we ﬁrst familiarize ourselves with linear methods for achievingthese tasks. Neural networks, in contrast, are a non-linear method for supervisedclassiﬁcation and regression tasks. Linear regression, as the name suggests, simply means to ﬁt a linear model to adataset. Consider a dataset consisting of input-output pairs { ( x , y ) , . . . , ( x m , y m ) } ,where the inputs are n -component vectors x T = ( x , x , . . . , x n ) and the output y isa real-valued number. The linear model then takes the form f ( x | β ) = β + n X j =1 β j x j (3.1)or in matrix notation f ( x | β ) = ˜ x T β (3.2)where ˜ x T = (1 , x , x , . . . , x n ) and β = ( β , . . . , β n ) T are ( n + 1) dimensional rowvectors.The aim then is to ﬁnd parameters ˆ β such that f ( x | ˆ β ) is a good estimator forthe output value y . In order to quantify what it means to be a “good” estimator,one need to specify a real-valued loss function L ( β ), sometimes also called a costfunction . The good set of parameters ˆ β is then the minimizer of this loss functionˆ β = argmin β L ( β ) . (3.3)There are many, inequivalent, choices for this loss function. For our purpose, we University of Zurich 17achine Learning for the Sciences Chapter 3 choose the loss function to be residual sum of squares (RSS) deﬁned asRSS( β ) = m X i =1 [ y i − f ( x i | β )] = m X i =1  y i − β − n X j =1 β j x ij  , (3.4)where the sum runs over the m samples of the dataset. This loss function is some-times also called the L2-loss and can be seen as a measure of the distance betweenthe output values from the dataset y i and the corresponding predictions f ( x i | β ).It is convenient to deﬁne the m by ( n + 1) data matrix f X , each row of whichcorresponds to an input sample ˜ x Ti , as well as the output vector Y T = ( y , . . . , y m ).With this notation, Eq. (3.4) can be expressed succinctly as a matrix equationRSS( β ) = ( Y − f X β ) T ( Y − f X β ) . (3.5)The minimum of RSS( β ) can be easily solved by considering the partial derivativeswith respect to β , i.e., ∂ RSS ∂ β = − f X T ( Y − f X β ) ,∂ RSS ∂ β ∂ β T = 2 f X T f X. (3.6)At the minimum, ∂ RSS ∂ β = 0 and ∂ RSS ∂ β ∂ β T is positive-deﬁnite. Assuming f X T f X is full-rank and hence invertible, we can obtain the solution ˆ β as ∂ RSS ∂ β (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) β =ˆ β = 0 , = ⇒ f X T f X ˆ β = f X T Y , = ⇒ ˆ β = ( f X T f X ) − f X T Y . (3.7)If f X T f X is not full-rank, which can happen if certain data features are perfectlycorrelated (e.g., x = 2 x ), the solution to f X T f X β = f X T Y can still be found, but itwould not be unique. Note that the RSS is not the only possible choice for the lossfunction and a diﬀerent choice would lead to a diﬀerent solution.What we have done so far is uni-variate linear regression, that is linear regres-sion where the output y is a single real-valued number. The generalisation to themulti-variate case, where the output is a p -component vector y T = ( y , . . . y p ), isstraightforward. The model takes the form f k ( x | β ) = β k + n X j =1 β jk x j . (3.8)where the parameters β jk now have an additional index k = 1 , . . . , p . Consideringthe parameters β as a ( n + 1) by p matrix, we can show that the solution takes thesame form as before [Eq. (3.7)] with Y as a m by p output matrix. University of Zurich 18achine Learning for the Sciences Chapter 3

Let us stop here and evaluate the quality of the method we have just introduced.At the same time, we will take the opportunity to introduce some statistics notions,which will be useful throughout the book.Up to now, we have made no assumptions about the dataset we are given, wesimply stated that it consisted of input-output pairs, { ( x , y ) , . . . , ( x m , y m ) } . Inorder to assess the accuracy of our model in a mathematically clean way, we haveto make an additional assumption. The output data y . . . , y m may arise from somemeasurement or observation. Then, each of these values will generically be subjectto errors (cid:15) , · · · , (cid:15) m by which the values deviate from the “true” output withouterrors, y i = y true i + (cid:15) i , i = 1 , · · · , m. (3.9)We assume that this error (cid:15) is a Gaussian random variable with mean µ = 0 andvariance σ , which we denote by (cid:15) ∼ N (0 , σ ). Assuming that a linear model inEq. (3.1) is a suitable model for our dataset, we are interested in the followingquestion: How does our solution ˆ β as given in Eq. (3.7) compare with the truesolution β true which obeys y i = β true0 + n X j =1 β true j x ij + (cid:15) i , i = 1 , . . . , m ? (3.10)In order to make statistical statements about this question, we have to imaginethat we can ﬁx the inputs x i of our dataset and repeatedly draw samples for ouroutputs y i . Each time we will obtain a diﬀerent value for y i following Eq. (3.10), inother words the (cid:15) i are uncorrelated random numbers. This allows us to formalisethe notion of an expectation value E ( · · · ) as the average over an inﬁnite number ofdraws. For each draw, we obtain a new dataset, which diﬀers from the other onesby the values of the outputs y i . With each of these datasets, we obtain a diﬀerentsolution ˆ β as given by Eq. (3.7). The expectation value E ( ˆ β ) is then simply theaverage value we obtained across an inﬁnite number of datasets. The deviation ofthis average value from the “true” value given perfect data is called the bias of themodel, Bias( ˆ β ) = E ( ˆ β ) − β true . (3.11)For the linear regression we study here, the bias is exactly zero, because E ( ˆ β ) = E (cid:16) ( f X T f X ) − f X T ( Y true + (cid:15) ) (cid:17) = β true , (3.12)where the second line follows because E ( (cid:15) ) = and ( f X T f X ) − f X T Y true = β true .Equation (3.12) implies linear regression is unbiased. Note that other machinelearning algorithms will in general be biased.What about the standard error or uncertainty of our solution? This informationis contained in the covariance matrix Var( ˆ β ) = E (cid:16) [ ˆ β − E ( ˆ β )][ ˆ β − E ( ˆ β )] T (cid:17) . (3.13) University of Zurich 19achine Learning for the Sciences Chapter 3

The covariance matrix can be computed for the case of linear regression using thesolution in Eq. (3.7), the expectation value in Eq. (3.12) and the assumption inEq. (3.10) that Y = Y true + (cid:15) yieldingVar( ˆ β ) = E (cid:16) [ ˆ β − E ( ˆ β )][ ˆ β − E ( ˆ β )] T (cid:17) = E (cid:18)h ( f X T f X ) − f X T (cid:15) i h ( f X T f X ) − f X T (cid:15) i T (cid:19) = E (cid:16) ( f X T f X ) − f X T (cid:15)(cid:15) T f X ( f X T f X ) − (cid:17) . (3.14)This expression can be simpliﬁed by using the fact that our input matrices f X areindependent of the draw such thatVar( ˆ β ) = ( f X T f X ) − f X T E ( (cid:15)(cid:15) T ) f X ( f X T f X ) − = ( f X T f X ) − f X T σ I f X ( f X T f X ) − = σ ( f X T f X ) − . (3.15)Here, the second line follows from the fact that diﬀerent samples are uncorrelated,which implies that E ( (cid:15)(cid:15) T ) = σ I with I the identity matrix. The diagonal elementsof σ ( f X T f X ) − then correspond to the varianceVar( ˆ β ) = E (cid:16) [ ˆ β − E ( ˆ β )][ ˆ β − E ( ˆ β )] T (cid:17) = σ ( f X T f X ) − . (3.16)of the individual parameters β i . The standard error or uncertainty is then q Var( ˆ β i ).There is one more missing element: we have not explained how to obtain thevariances σ of the outputs y . In an actual machine learning task, we would notknow anything about the true relation, as given in Eq. (3.10), governing our dataset.The only information we have access to is a single dataset. Therefore, we have toestimate the variance using the samples in our dataset, which is given byˆ σ = 1 m − n − m X i =1 ( y i − f ( x i | ˆ β )) , (3.17)where y i are the output values from our dataset and f ( x i | ˆ β ) is the correspondingprediction. Note that we normalized the above expression by ( m − n −

1) insteadof m to ensure that E (ˆ σ ) = σ , meaning that ˆ σ is an unbiased estimator of σ .Our ultimate goal is not simply to ﬁt a model to the dataset. We want ourmodel to generalize to inputs not within the dataset. To assess how well this isachieved, let us consider the prediction ˜ a T ˆ β on a new random input-output pair( a , y ). The output is again subject to an error y = ˜ a T β true + (cid:15) . In order tocompute the expected error of the prediction, we compute the expectation value ofthe loss function over these previously unseen data. This is also known as the test orgeneralization error . For the square-distance loss function, this is the mean squareerror (MSE)MSE( ˆ β ) = E (cid:16) ( y − ˜ a T ˆ β ) (cid:17) = E (cid:16) ( (cid:15) + ˜ a T β true − ˜ a T ˆ β ) (cid:17) = E ( (cid:15) ) + [˜ a T ( β true − E ( ˆ β ))] + E (cid:16) [˜ a T ( ˆ β − E ( ˆ β ))] (cid:17) = σ + [˜ a T Bias( ˆ β )] + ˜ a T Var( ˆ β )˜ a . (3.18) University of Zurich 20achine Learning for the Sciences Chapter 3

Model Complexity ErrorVarianceBias G e n e r a li s a t i o n E rr o r Figure 7:

Schematic depiction of the bias-variance tradeoﬀ.

There are three terms in the expression. The ﬁrst term is the irreducible or intrinsicuncertainty of the dataset. The second term represents the bias and the third termis the variance of the model. For RSS linear regression, the estimate is unbiased sothat MSE( ˆ β ) = σ + ˜ a T Var( ˆ β )˜ a . (3.19)Based on the assumption that the dataset indeed derives from a linear model asgiven by Eq. (3.10) with a Gaussian error, it can be shown that the RSS solution,Eq. (3.7), gives the minimum error among all unbiased linear estimators, Eq. (3.1).This is known as the Gauss-Markov theorem.This completes our error analysis of the method. Although the RSS solution has the minimum error among unbiased linear estimators,the expression for the generalisation error, Eq. (3.18), suggests that we can actuallystill reduce the error by sacriﬁcing some bias in our estimate.A possible way to reduce generalisation error is actually to drop some datafeatures. From the n data features { x , . . . x n } , we can pick a reduced set M . Forexample, we can choose M = { x , x , x } , and deﬁne our new linear model as f ( x | β ) = β + X j ∈M β j x j . (3.20)This is equivalent to ﬁxing some parameters to zero, i.e., β k = 0 if x k / ∈ M . Mini-mizing the RSS with this constraint results in a biased estimator but the reductionin model variance can sometimes help to reduce the overall generalisation error.For a small number of features n ∼

20, one can search exhaustively for the bestsubset of features that minimises the error, but beyond that the search becomescomputationally unfeasible.A common alternative is called ridge regression . In this method, we consider thesame linear model given in Eq. (3.1) but with a modiﬁed loss function L ridge ( β ) = m X i =1 [ y i − f ( x i | β )] + λ n X j =0 β j , (3.21)where λ > λ [c.f. Eq. (3.4)]. The eﬀect of this new term is to University of Zurich 21achine Learning for the Sciences Chapter 3

DatasetTraining SetValidation Set Test SetTrain Model Test Final ModelValidate Model

Tune Hyperparameters Final Model

Figure 8:

Machine Learning Workﬂow. penalize large parameters β j and bias the model towards smaller absolute values.The parameter λ is an example of a hyper-parameter , which is kept ﬁxed during thetraining. On ﬁxing λ and minimising the loss function, we obtain the solutionˆ β ridge = ( f X T f X + λI ) − f X T Y , (3.22)from which we can see that as λ → ∞ , ˆ β ridge → . By computing the bias andvariance, Bias( ˆ β ridge ) = − λ ( f X T f X + λI ) − β true Var( ˆ β ridge ) = σ ( f X T f X + λI ) − f X T f X ( f X T f X + λI ) − , (3.23)it is also obvious that increasing λ increases the bias, while reducing the variance.This is the tradeoﬀ between bias and variance. By appropriately choosing λ it ispossible that generalisation error can be reduced. We will introduce in the nextsection a common strategy how to ﬁnd the optimal value for λ .The techniques presented here to reduce the generalization error, namely drop-ping of features and biasing the model to small parameters, are part of a large classof methods known as regularization . Comparing the two methods, we can see a sim-ilarity. Both methods actually reduce the complexity of our model. In the former,some parameters are set to zero, while in the latter, there is a constraint which eﬀec-tively reduces the magnitude of all parameters. A less complex model has a smallervariance but larger bias. By balancing these competing eﬀects, generalisation canbe improved, as illustrated schematically in Fig. 7.In the next chapter, we will see that these techniques are useful beyond ap-plications to linear methods. We illustrate the diﬀerent concepts in the followingexample. We illustrate the concepts of linear regression using a medical dataset. In the process,we will also familiarize ourselves with the standard machine learning workﬂow [seeFig. 8]. For this example, we are given 10 data features, namely age, sex, body massindex, average blood pressure, and six blood serum measurements from 442 diabetes

University of Zurich 22achine Learning for the Sciences Chapter 3 patients, and our task is train a model f ( x | β ) [Eq. (3.1)] to predict a quantitativemeasure of the disease progression after one year.Recall that the ﬁnal aim of a machine-learning task is not to obtain the smallestpossible value for the loss function such as the RSS, but to minimise the general-isation error on unseen data [c.f. Eq. (3.18)]. The standard approach relies on adivision of the dataset into three subsets: training set, validation set and test set.The standard workﬂow is summarised in Box 1. Box 1: ML Workﬂow

1. Divide the dataset into training set T , validation set V and test set S . Acommon ratio for the split is 70 : 15 : 15.2. Pick the hyperparameters, e.g., λ in Eq. (3.21).3. Train the model with only the training set, in other words minimize theloss function on the training set. [This corresponds to Eq. (3.7) or (3.22)for the linear regression, where f X only contains the training set.]4. Evaluate the MSE (or any other chosen metric) on the validation set, [c.f.Eq. (3.18)] MSE validation ( ˆ β ) = 1 |V| X j ∈V ( y j − f ( x j | ˆ β )) . (3.24)This is known as the validation error .5. Pick a diﬀerent value for the hyperparameters and repeat steps 3 and 4,until validation error is minimized.6. Evaluate the ﬁnal model on the test setMSE test ( ˆ β ) = 1 |S| X j ∈S ( y j − f ( x j | ˆ β )) . (3.25)It is important to note that the test set S was not involved in optimizing eitherparameters β or the hyperparameters such as λ .Applying this procedure to the diabetes dataset , we obtain the results in Fig. 9.We compare RSS linear regression with the ridge regression, and indeed we see thatby appropriately choosing the regularisation hyperparameter λ , the generalisationerror can be minimized.As side remark regarding the ridge regression, we can see on the left of Fig. 10,that as λ increases, the magnitude of the parameters, Eq. (3.22), ˆ β ridge decreases.Consider on the other hand, a diﬀerent form of regularisation, which goes by thename lasso regression , where the loss function is given by L lasso ( β ) = m X i =1 [ y i − f ( x i | β )] + α n X j =0 | β j | . (3.26)Despite the similarities, lasso regression has a very diﬀerent behaviour as depictedon the right of Fig. 10. Notice that as α increases some parameters actually vanish ∼ boos/var.select/diabetes.html University of Zurich 23achine Learning for the Sciences Chapter 3

Figure 9:

Ridge Regression on Diabetes patients dataset.

Left: Validationerror versus λ . Right: Test data versus the prediction from the trained model. Ifthe prediction were free of any error, all the points would fall on the blue line.and can be ignored completely. This actually corresponds to dropping certain datafeatures completely and can be useful if we are interested in selecting the mostimportant features in a dataset.Figure 10: Evolution of the model parameters.

Increasing the hyperparameter λ or α leads to a reduction of the absolute value of the model parameters, hereshown for the ridge (left) and Lasso (right) regression for the Diabetes dataset. In a classiﬁcation problem, the aim is to categorize the inputs into one of a ﬁniteset of classes. Formulated as a supervised learning task, the dataset again consistsof input-output pairs, i.e. { ( x , y ) , . . . , ( x m , y m ) } with x ∈ R n . However, unlikeregression problems, the output y is a discrete integer number representing one ofthe classes. In a binary classiﬁcation problem, in other words a problem with onlytwo classes, it is natural to choose y ∈ {− , } .We have introduced linear regression in the previous section as a method forsupervised learning when the output is a real number. Here, we will see how we University of Zurich 24achine Learning for the Sciences Chapter 3 can use the same model for a binary classiﬁcation task. If we look at the regressionproblem, we ﬁrst note that geometrically f ( x | β ) = β + n X j =1 β j x j = 0 (3.27)deﬁnes a hyperplane perpendicular to the vector with elements β j ≥ . If we ﬁxthe length P nj =1 β j = 1, then f ( x | β ) measures the (signed) distance of x to thehyperplane with a sign depending on which side of the plane the point x i lies. Touse this model as a classiﬁer, we thus deﬁne F ( x | β ) = sign f ( x | β ) , (3.28)which yields { +1 , − } . If the two classes are (completely) linearly separable, thenthe goal of the classiﬁcation is to ﬁnd a hyperplane that separates the two classesin feature space. Speciﬁcally, we look for parameters β , such that y i ˜ x Ti β > M, ∀ i, (3.29)where M is called the margin . The optimal solution ˆ β then maximizes this margin.Note that instead of ﬁxing the norm of β j ≥ and maximizing M , it is customary tominimize P nj =1 β j setting M = 1 in Eq. (3.29).In most cases, the two classes are not completely separable. In order to still ﬁnda good classiﬁer, we allow some of the points x i to lie within the margin or evenon the wrong side of the hyperplane. For this purpose, we rewrite the optimizationconstraint Eq. (3.29) to y i ˜ x Ti β > (1 − ξ i ) , with ξ i ≥ , ∀ i. (3.30)We can now deﬁne the optimization problem as ﬁndingmin β , { ξ i } n X j =1 β j + C X i ξ i (3.31)subject to the constraint Eq. (3.30). Note that the second term with hyperparameter C acts like a regularizer, in particular a lasso regularizer. As we have seen in theexample of the previous section, such a regularizer tries to set as many ξ i to zero aspossible.We can solve this constrained minimization problem by introducing Lagrangemultipliers α i and µ i and solvingmin β, { ξ i } n X j =1 β j + C X i ξ i − X i α i [ y i ˜ x Ti β − (1 − ξ i )] − X i µ i ξ i , (3.32)which yields the conditions β j = X i α i y i x ij , (3.33)0 = X i α i y i (3.34) α i = C − µ i , ∀ i. (3.35) University of Zurich 25achine Learning for the Sciences Chapter 3

Legth [cm] W i d t h [ c m ] Figure 11:

Binary classiﬁcation.

Hyperplane separating the two classes andmargin M of the linear binary classiﬁer. The support vectors are denoted by acircle around them.It is numerically simpler to solve the dual problemmin { α i } X i,i α i α i y i y i x Ti x i − X i α i (3.36)subject to P i α i y i = 0 and 0 ≤ α i ≤ C . Using Eq. (3.33), we can reexpress β j toﬁnd f ( x |{ α i } ) = X i α i y i x T x i + β , (3.40)where the sum only runs over the points x i , which lie within the margin, as all otherpoints have α i ≡ support vectors and are denoted in Fig. 11 with a circle around them. Finally, note that we can useEq. (3.37) again to ﬁnd β . The Kernel trick and support vector machines

We have seen in our discussion of PCA that most data is not separable linearly.However, we have also seen how the kernel trick can help us in such situations. Inparticular, we have seen how a non-linear function Φ ( x ), which we ﬁrst apply tothe data x , can help us separate data that is not linearly separable. Importantly,we never actually use the non-linear function Φ ( x ), but only the kernel. Lookingat the dual optimization problem Eq. (3.36) and the resulting classiﬁer Eq. (3.40),we see that, as in the case of Kernel PCA, only the kernel K ( x , y ) = Φ ( x ) T Φ ( y )enters, simplifying the problem. This non-linear extension of the binary classiﬁer iscalled a support vector machine . Note that the constraints for the minimization are not equalities, but actually inequalities. Asolution thus has to fulﬁl the additional Karush-Kuhn-Tucker constraints α i [ y i ˜ x Ti β − (1 − ξ i )] = 0 , (3.37) µ i ξ i = 0 , (3.38) y i ˜ x Ti β − (1 − ξ i ) > . (3.39) University of Zurich 26achine Learning for the Sciences Chapter 4

In the following, we are interested in the case of p classes with p >

2. After theprevious discussion, it seems natural for the output to take the integer values y =1 , . . . , p . However, it turns out to be helpful to use a diﬀerent, so-called one-hotencoding . In this encoding, the output y is instead represented by the p -dimensionalunit vector in y direction e ( y ) , y −→ e ( y ) =  e ( y )1 ... e ( y ) y ... e ( y ) p  =   , (3.41)where e ( y ) l = 1 if l = y and zero for all other l = 1 , . . . , p . A main advantage of thisencoding is that we are not forced to choose a potentially biasing ordering of theclasses as we would when arranging them along the ray of integers.A linear approach to this problem then again mirrors the case for linear re-gression. We ﬁt a multi-variate linear model, Eq. (3.8), to the one-hot encodeddataset { ( x , e ( y ) ) , . . . , ( x m , e ( y m ) ) } . By minimising the RSS, Eq. (3.4), we obtainthe solution ˆ β = ( f X T f X ) − f X T Y, (3.42)where Y is the m by p output matrix. The prediction given an input x is then a p -dimensional vector f ( x | ˆ β ) = ˜ x T ˆ β . On a generic input x , it is obvious that thecomponents of this prediction vector would be real valued, rather than being one ofthe one-hot basis vectors. To obtain a class prediction F ( x | ˆ β ) = 1 , . . . , p , we simplytake the index of the largest component of that vector, i.e., F ( x | ˆ β ) = argmax k f k ( x | ˆ β ) . (3.43)The argmax function is a non-linear function and is a ﬁrst example of what is referredto as activation function .For numerical minimization, it is better to use a smooth activation function.Such an activation function is given by the softmax function F k ( x | ˆ β ) = e − f k ( x | ˆ β ) P pk =1 e − f k ( x | ˆ β ) . (3.44)Importantly, the output of the softmax function is a probability P ( y = k | x ), since P k F k ( x | ˆ β ) = 1. This extended linear model is referred to as logistic regression .The current linear approach based on classiﬁcation of one-hot encoded datagenerally works poorly when there are more than two classes. We will see in thenext chapter that relatively straightforward non-linear extensions of this approachcan lead to much better results. Note that the softmax function for two classes is the logistic function.

University of Zurich 27achine Learning for the Sciences Chapter 4 works

In the previous chapter, we covered the basics of machine learning using conventionalmethods such as linear regression and principle component analysis. In the presentchapter, we move towards a more complex class of machine learning models: neuralnetworks . Neural networks have been central to the recent vast success of machinelearning in many practical applications.The idea for the design of a neural network model is an analogy to how biologicalorganisms process information. Biological brains contain neurons, electrically acti-vated nerve cells, connected by synapses that facilitate information transfer betweenneurons. The machine learning equivalent of this structure, the so-called artiﬁcialneural networks or neural networks in short, is a mathematical function developedwith the same principles in mind. It is composed from elementary functions, the neurons , which are organized in layers that are connected to each other. To sim-plify the notation, a graphical representation of the neurons and network is used, seeFig. 12. The connections in the graphical representation means that the output fromone set of neurons (forming one layer) serves as the input for the next set of neurons(the next layer). This deﬁnes a sense of direction in which information is handedover from layer to layer, and thus the architecture is referred to as a feed-forwardneural network.In general, an artiﬁcial neural network is simply an example of a variationalnon-linear function that maps some (potentially high-dimensional) input data to adesired output. Neural networks are remarkably powerful and it has been proventhat under some mild structure assumptions they can approximate any smoothfunction arbitrarily well as the number of neurons tends to inﬁnity. A drawbackis that neural networks typically depend on a large amount of parameters. In thefollowing, we will learn how to construct these neural networks and ﬁnd optimalvalues for the variational parameters.In this chapter, we are going to discuss one option for optimizing neural networks:the so-called supervised learning . A machine learning process is called supervisedwhenever we use training data comprising input-output pairs, in other words inputwith known correct answer (the label), to teach the network-required task.

Input Output"carrot" ... x x x x n information flow Figure 12:

Neural Network.

Graphical representation and basic architecture.

University of Zurich 28achine Learning for the Sciences Chapter 4

ReLusigmoidtanh

Figure 13:

The artiﬁcial neuron.

Left: schematic of a single neuron and itsfunctional form. Right: examples of the commonly used activation functions: ReLU,sigmoid function and hyperbolic tangent.

The basic building block of a neural network is the neuron. Let us consider a singleneuron which we assume to be connected to k neurons in the preceding layer, seeFig. 13 left side. The neuron corresponds to a function f : R k → R which is acomposition of a linear function q : R k → R and a non-linear (so-called activationfunction ) g : R → R . Speciﬁcally, f ( z , . . . , z k ) = g ( q ( z , . . . , z k )) (4.1)where z , z , . . . , z k are the outputs of the neurons from the preceding layer to whichthe neuron is connected.The linear function is parametrized as q ( z , . . . , z k ) = k X j =1 w j z j + b. (4.2)Here, the real numbers w , w , . . . , w k are called weights and can be thought of asthe “strength” of each respective connection between neurons in the preceding layerand this neuron. The real parameter b is known as the bias and is simply a constantoﬀset . The weights and biases are the variational parameters we will need tooptimize when we train the network.The activation function g is crucial for the neural network to be able to approxi-mate any smooth function, since so far we merely performed a linear transformation.For this reason, g has to be nonlinear. In analogy to biological neurons, g representsthe property of the neuron that it “spikes”, i.e., it produces a noticeable outputonly when the input potential grows beyond a certain threshold value. The mostcommon choices for activation functions, shown in Fig. 13, include:• ReLU : ReLU stands for rectiﬁed linear unit and is zero for all numbers smallerthan zero, while a linear function for all positive numbers. Note that this bias is unrelated to the bias we learned about in regression.

University of Zurich 29achine Learning for the Sciences Chapter 4 • Sigmoid : The sigmoid function, usually taken as the logistic function, is asmoothed version of the step function.•

Hyperbolic tangent : The hyperbolic tangent function has a similar behaviouras sigmoid but has both positive and negative values.•

Softmax : The softmax function is a common activation function for the lastlayer in a classiﬁcation problem (see below).The choice of activation function is part of the neural network architecture andis therefore not changed during training (in contrast to the variational parametersweights and bias, which are adjusted during training). Typically, the same activationfunction is used for all neurons in a layer, while the activation function may varyfrom layer to layer. Determining what a good activation function is for a given layerof a neural network is typically a heuristic rather than systematic task.Note that the softmax provides a special case of an activation function as itexplicitly depends on the output of the q functions in the other neurons of the samelayer. Let us label by l = 1 , . . . , n the n neurons in a given layer and by q l theoutput of their respective linear transformation. Then, the softmax is deﬁned as g l ( q , . . . , q n ) = e − q l P nl =1 e − q l (4.3)for the output of neuron l . A useful property of softmax is that X l g l ( q , . . . , q n ) = 1 , (4.4)so that the layer output can be interpreted as a probability distribution. The softmaxfunction is thus a continuous generalization of the argmax function introduced inthe previous chapter. Now that we understand how a single neuron works, we can connect many of themtogether and create an artiﬁcial neural network. The general structure of a simple(feed-forward) neural network is shown in Fig. 14. The ﬁrst and last layers are theinput and output layers (blue and violet, respectively, in Fig. 14) and are called visible layers as they are directly accessed. All the other layers in between themare neither accessible for input nor providing any direct output, and thus are called hidden layers (green layer in Fig. 14).Assuming we can feed the input to the network as a vector, we denote the inputdata with x . The network then transforms this input into the output F ( x ), whichin general is also a vector. As a simple and concrete example, we write the completefunctional form of a neural network with one hidden layer as shown in Fig. 14, F ( x ) = g [2] (cid:16) W [2] g [1] (cid:16) W [1] x + b [1] (cid:17) + b [2] (cid:17) . (4.5)Here, W [ n ] and b [ n ] are the weight matrix and bias vectors of the n -th layer. Specif-ically, W [1] is the k × l weight matrix of the hidden layer with k and l the numberof neurons in the input and hidden layer, respectively. W [1] ij is the j -the entry of University of Zurich 30achine Learning for the Sciences Chapter 4 input layer output layerhidden layer(s)

Figure 14:

Simple neural network.

Architecture and variational parameters.the weight vector of the i -th neuron in the hidden layer, while b [1] i is the bias of thisneuron. The W [2] ij and b [2] i are the respective quantities for the output layer. Thisnetwork is called fully connected or dense , because each neuron in a given layer takesas input the output from all the neurons in the previous layer, in other words allweights are allowed to be non-zero.Note that for the evaluation of such a network, we ﬁrst calculate all the neurons’values of the ﬁrst hidden layer, which feed into the neurons of the second hidden layerand so on until we reach the output layer. This procedure, which is possible onlyfor feed-forward neural networks, is obviously much more eﬃcient than evaluatingthe nested function of each output neuron independently. Adjusting all the weights and biases to achieve the task given using data samples D = { ( x , y ) , . . . , ( x m , y m ) } constitutes the training of the network. In otherwords, the training is the process that makes the network an approximation to themathematical function F ( x ) = y that we want it to represent. Since each neuronhas its own bias and weights, a potentially huge number of variatonial parameters,and we will need to adjust all of them.We have already seen in the previous chapter how one in principle trains a vari-ational function. For the purpose of learning, we introduce a loss function L ( W, B ),which characterizes how well the network is doing at predicting the correct outputfor each input. The loss function now depends, through the neural network, on allthe weights and biases that we collectively denote by the vectors W and B .The choice of loss function may strongly impact the eﬃciency of the trainingand is based on heuristics (as was the case with the choice of activation functions).In the previous chapter, we already encountered one loss function, the mean squareerror L ( θ ) = 12 m m X i =1 || F ( x i ) − y i || . (4.6)Here, || a || = qP i a i is the L L loss . An advantage of the L2 loss is that it is a smooth function of thevariational parameters. Another natural loss function is the mean absolute error , University of Zurich 31achine Learning for the Sciences Chapter 4 which is given by L ( θ ) = 12 m m X i =1 || F ( x i ) − y i || , (4.7)where || a || = P i | a i | denotes the L L loss . Note that the L L cross-entropy between true label, y i and the networkoutput, F ( x i ) deﬁned as L ent ( θ ) = − m X i =1 [ y i · ln ( F ( x i )) + (1 − y i ) · ln (1 − F ( x i ))] , (4.8)where the logarithm is taken element-wise. This loss function is also called negativelog likelihood . It is here written for outputs that lie between 0 and 1, as is thecase when the activation function of the last layer of the network is sigmoid σ ( z ) =1 / (1 + e − z ). (The cross-entropy is preferably combined with sigmoid activation inthe last layer.)Of these loss functions the cross entropy is probably the least intuitive one. Wewant to understand what it means and gain some intuition about it. The diﬀerentcost functions actually diﬀer by the speed of the learning process. The learningrate is largely determined by the partial derivatives of the cost function ∂L/∂θ .Slow learning appears when these derivatives become small. Let us consider the toyexample of a single neuron with sigmoid activation F ( x ) = σ ( wx + b ) and a singleinput-output pair { x, y } = { , } . Then the quadratic cost function has derivatives ∂L∂w = ∂L∂b = σ ( w + b ) σ ( w + b ) . (4.9)We observe that this derivative gets very small for σ ( w + b ) →

1, because σ getsvery small in that limit. Therefore, a slowdown of learning appears. This slowdownis also observed in more complex neural networks with L2 loss, we considered thesimple case here only to be able to say something analytically.Given this observation, we want to see whether the cross entropy can improvethe situation. We again compute the derivative of the cost function with respect tothe weights for a single term in the sum and a network that is composed of a singlesigmoid and a general input-output pair { x, y } ∂L ent ∂w = − yσ ( wx + b ) − − y − σ ( wx + b ) ! σ ( wx + b ) x = σ ( wx + b ) xσ ( wx + b )[1 − σ ( wx + b )] [ σ ( wx + b ) − y ]= x [ σ ( wx + b ) − y ] , (4.10)where in the last step we used that σ ( z ) = σ ( z )[1 − σ ( z )]. This is a much better resultthan what we got for the L2 loss. The learning rate is here directly proportionalto the error between data point and prediction [ σ ( wx + b ) − y ]. The mathematicalreason for this change is that σ ( z ) cancels out due to this speciﬁc form of the crossentropy. A similar expression holds true for the derivative with respect to b , ∂L ent ∂b = [ σ ( wx + b ) − y ] . (4.11) University of Zurich 32achine Learning for the Sciences Chapter 4

In fact, if we insisted that we want the very intuitive form of Eqs. (4.10) and (4.11)for the gradients, we can derive the cost function for the sigmoid activation functionto be the cross-entropy. This follows simply because ∂L∂b = ∂L∂F F (4.12)and F = F (1 − F ) for the sigmoid activation, which, in comparison to Eq. (4.11),yields ∂L∂F = F − yF (1 − F ) , (4.13)which, when integrated with respect to F , gives exactly the cross-entropy (up to aconstant). We can thus, starting from Eqs. (4.10) and (4.11), think of the choice ofcost functions as a backward engineering. Following this logic, we can think of otherpairs of ﬁnal layer activations and cost functions that may work well together.What happens if we change the activation function in the last layer from sigmoidto softmax? For the loss function, we consider just the ﬁrst term in the cross entropyfor the shortness of presentation (for softmax, this form is appropriate, as comparedto a sigmoid activation) L ( θ ) = − m X i =1 y i · ln ( F ( x i )) , (4.14)where again the logarithm is taken element-wise. For concreteness, let us look atone-hot encoded classiﬁcation problem. Then, all y i labels are vectors with exactlyone entry “1”. Let that entry have index n i in the vector. The loss function thenreads L ( θ ) = − m X i =1 ln ( F n i ( x i )) . (4.15)Due to the properties of the softmax, F n i ( x i ) is always ≤

1, so that loss function isminimized, if it approaches 1, the value of the label. For the gradients, we obtain ∂L∂b j = − m X i =1 F n i ( x i ) ∂F n i ( x i ) ∂b j = − m X i =1 F n i ( x i ) h F n i ( x i ) δ n i ,j − F n i ( x i ) i = m X i =1 [ F n i ( x i ) − y n i ] . (4.16)We observe that again, the gradient has a similar favorable structure to the previouscase, in that it is linearly dependent on the error that the network makes. (The samecan be found for the derivatives with respect to the weights.)Once we have deﬁned a loss function, we also already understand how to train thenetwork: we need to minimize L ( θ ) with respect to W and B . However, L is typicallya high-dimensional function and may have many nearly degenerate minima. Unlikein the previous chapter, ﬁnding the loss function’s absolute minimum exactly istypically intractable analytically and may come at prohibitive costs computationally.The practical goal is therefore rather to ﬁnd a “good” instead than the absolute University of Zurich 33achine Learning for the Sciences Chapter 4 minimum through training. Having found such “good” values for

W, B , the networkcan then be applied on previously unseen data.It remains to be explained how to minimize the loss function. Here, we employan iterative method called gradient descent . Intuitively, the method corresponds to“walking down the hill” in our many parameter landscape until we reach a (local)minimum. For this purpose, we use the (discrete) derivative of the cost function toupdate all the weights and biases incrementally and search for the minimum of thefunction via tiny steps on the many-dimensional surface. More speciﬁcally, we canupdate all weights and biases in each step as θ α → θ α − η ∂L ( θ ) ∂θ α . (4.17)The variable η , also referred to as learning rate , speciﬁes the size of step we useto walk the landscape—if it is too small in the beginning, we might get stuck ina local minimum early on, while for too large η we might never ﬁnd a minimum.The learning rate is a hyperparameter of the training algorithm. Note that gradientdescent is just a discrete many-variable version of the analytical search for extremawhich we know from calculus: An extremum is characterized by vanishing derivativesin all directions, which results in convergence in the gradient descent algorithmoutlined above.While the process of optimizing the many variables of the loss function is math-ematically straightforward to understand, it presents a signiﬁcant numerical chal-lenge: For each variational parameter, for instance a weight in the k -th layer W [ k ] ij ,the partial derivative ∂L/∂W [ k ] ij has to be computed. And this has to be done eachtime the network is evaluated for a new dataset during training. Naively, one couldassume that the whole network has to be evaluated each time. Luckily there is analgorithm that allows for an eﬃcient and parallel computation of all derivatives – itis known as backpropagation . The algorithm derives directly from the chain rule ofdiﬀerentiation for nested functions and is based on two observations:(1) The loss function is a function of the neural network F ( x ), that is L ≡ L ( F ).(2) To determine the derivatives in layer k only the derivatives of the followinglayer, given as Jacobi matrix D f [ l ] ( z [ l − ) = ∂ f [ l ] /∂ z [ l − , (4.18)with l > k and z [ l − the output of the previous layer, as well as ∂ z [ k ] ∂θ [ k ] α = ∂ g [ k ] ∂q [ k ] i ∂q [ k ] i ∂θ α =  ∂ g [ k ] ∂q [ k ] i z [ k − j θ α = W ij∂ g [ k ] ∂q [ k ] i θ α = b i (4.19)are required. The derivatives z [ l ] are the same for all parameters.The calculation of the Jacobi matrix thus has to be performed only once for everyupdate. In contrast to the evaluation of the network itself, which is propagating University of Zurich 34achine Learning for the Sciences Chapter 4 forward, (output of layer n is input to layer n + 1), we ﬁnd that a change in theOutput propagates backwards though the network. Hence the name .The full algorithm looks then as follows: Algorithm 2: BackpropagationInput:

Loss function L that in turn depends on the neural network,which is parametrized by weights and biases, summarized as θ = { W, b } . Output:

Partial derivatives ∂L/∂θ [ n ] α with respect to all parameters θ [ n ] of all layers k = 1 . . . n .Calculate the derivatives with respect to the parameters of the outputlayer: ∂L/∂W [ n ] ij = ( ∇ L ) T ∂ g [ n ] ∂q [ n ] i z [ n − j , ∂L/∂b [ n ] i = ( ∇ L ) T ∂ g [ n ] ∂q [ n ] i for k = n . . . do Calculate the Jacobi matrices for layer k : D g [ k ] = ( ∂ g [ k ] /∂ q [ k ] ) and D f [ k ] = ( ∂ f [ k ] /∂ z [ k − );Multiply all following Jacobi matrices to obtain the derivatives oflayer k : ∂L/∂θ [ k ] α = ( ∇ L ) T D f [ n ] · · · D f [ k +1] D g [ k ] ( ∂ q [ k ] /∂θ [ k ] α ) ; end A remaining question is when to actually perform updates to the network param-eters. One possibility would be to perform the above procedure for each trainingdata individually. Another extreme is to use all the training data available andperform the update with an averaged derivative. Not surprisingly, the answer liessomewhere in the middle: Often, we do not present training data to the networkone item at the time, but the full training data is divided into co-called batches , agroup of training data that is fed into the network together. Chances are the weightsand biases can be adjusted better if the network is presented with more informa-tion in each training step. However, the price to pay for larger batches is a highercomputational cost. Therefore, the batch size can greatly impact the eﬃciency oftraining. The random partitioning of the training data into batches is kept for acertain number of iterations, before a new partitioning is chosen. The consecutiveiterations carried out with a chosen set of batches constitute a training epoch . As we discussed in the introduction, the recognition of hand-written digits 0, 1, . . . ×

28 grayscale image, comeswith a label , which holds the information which digit is stored in the image. Thediﬃculty of learning to recognize the digits is that handwriting styles are incredibly Backpropagation is actually a special case of a set of techniques known as automatic diﬀerenti-ation (AD). AD makes use of the fact that any computer program can be composed of elementaryoperations (addition, subtraction, multiplication, division) and elementary functions (sin , exp , . . . ).By repeated application of the chain rule, derivatives of arbitrary order can be computed auto-matically. University of Zurich 35achine Learning for the Sciences Chapter 4 personal and diﬀerent people will write the digit “4” slightly diﬀerently. It wouldbe very challenging to hardcode all the criteria to recognize “4” and not confuse itwith, say, a “9”.We can use a simple neural network as introduced earlier in the chapter to tacklethis complex task. We will use a network as shown in Fig. 14 and given in Eq. (4.5)to do just that. The input is the image of the handwritten digit, transformed intoa k = 28 long vector, the hidden layer contains l neurons and the output layerhas p = 10 neurons, each corresponding to one digit in the one-hot encoding. Theoutput is then a probability distribution over these 10 neurons that will determinewhich digit the network identiﬁes.As an exercise, we build a neural network according to these guidelines and trainit. How exactly one writes the code depends on the library of choice , but the genericstructure will be the following: Example 1: MNIST Import the data : The MNIST database is available for download at http://yann.lecun.com/exdb/mnist/ Deﬁne the model :• Input layer : 28 = 784 neurons (the greyscale value of each pixel ofthe image, normalized to a value in [0 , Fully connected hidden layer : Here one can experiment, startingfrom as few as 10 neurons. The use of a sigmoid activation functionis recommended, but others can in principle be used.•

Output layer : Use 10 neurons, one for each digit. The proper acti-vation function for this classiﬁcation task is, as discussed, a softmaxfunction.3.

Choose the loss function : Since we are dealing with a classiﬁcation task,we use the cross-entropy, Eq. (4.8).4.

Train and evaluate the model : Follow the standard machine-learningworkﬂow to train a and evaluate the model. However, unlike in the re-gression example of the previous chapter, where we evaluated the modelusing the mean square error, here we are rather interested in the accuracyof our prediction. a Most ML packages have some type of ’train’ function built in, so no need to worry aboutimplementing back-propagation by hand. All that is needed here is to call the ’train’ function

With the training completed, we want to understand how well the ﬁnal modelperforms in recognizing handwritten digits. For that, we introduce the accuracy deﬁned by accuracy = correct predictionstotal predictions . (4.20)If we use 30 hidden neurons, set the learning rate to η = 0 . University of Zurich 36achine Learning for the Sciences Chapter 4 cost we obtain only slightly worse results of 95.42%. For 100 hidden neurons, weobtain 96.82%. That is a considerable improvement over a quadratic cost, wherewe obtain 96.59%. (Meaning that now about 1 in 14 wrongly classiﬁed pictureswill now be correctly classiﬁed.) Still, these numbers are not even close to state ofthe art neural network performances. The reason is that we have used the simplestpossible all-to-all connected architecture with only one hidden layer. Below, wewill introduce more advanced neural network features and show how to increase theperformance.Before doing so, we brieﬂy introduce other important measures used to charac-terize the performance of speciﬁcally binary-classiﬁcation models in statistics are: precision , speciﬁcity and recall . In the language of true (false) positives (negatives)the precision is deﬁned asprecision = true positivestrue positives + false positives . (4.21)Recall (also referred to as sensitivity) is deﬁned asrecall = true positivestrue positives + false negatives . (4.22)While recall can be interpreted as true positive rate as it represents the ratio betweenactual positives and outcomes identiﬁed as positive, the speciﬁcity is an analogousmeasures for negativesspeciﬁcity = true negativestrue negatives + false positives . (4.23)Note, however, that these measures can be misleading, in particular when dealingwith very unbalanced data sets. In the previous sections, we have illustrated an artiﬁcial neural network that isconstructed analogous to neuronal networks in the brain. A model is only given arough structure a priori, within which they have a huge number of parameters toadjust by learning from the training set. While we already understand that this isan extraordinarily powerful process, this method of learning comes with its own setof challenges. The most prominent of them is the generalization of the rules learnedfrom training data to unseen data.We have already encountered in the previous chapter how the naive optimizationof a linear model reduces the generalization. However, we have also seen how thegeneralization error can be improved using regularization. Training neural networkcomes with the same issue and the same solution: we are always showing the al-gorithm we built the training that is limited in one way or another and we needto make sure that the neural network does not learn particularities of that giventraining set, but actually extracts a general knowledge.The step zero to avoid over-ﬁtting is to create suﬃciently representative anddiverse training set. Once this is taken care of, we can take several steps for theregularization of the network. The simplest, but at the same time most powerful

University of Zurich 37achine Learning for the Sciences Chapter 4 option is introducing dropout layers . This regularization is very similar to drop-ping features that we discussed for linear regression. However, the dropout layerignores a randomly selected subset of neuron outputs in the network only dur-ing training. Which neurons are dropped is chosen at random for each trainingstep. This regularization procedure is illustrated in Fig. 15. By randomly discard-ing a certain fraction of neurons we ensure that the network does not get ﬁxedat small particular features of the training set and is better equipped to recognizethe more general features. Another way of looking at it is that this procedurecorresponds to training a large number of neural networks with diﬀerent neuronconnections in parallel. The fraction of neurons that are ignored in a dropoutlayer is a hyperparameter that is ﬁxed a priori. Maybe it is counter-intuitive butthe best performance is often achieved when this number is sizable, between 20%and 50%. It shows the remarkable resilience of the network against ﬂuctuations.Figure 15:

Dropout layer

As for the linear models, regularization can also be achievedby adding regularization terms R to the L , L → L + R . Again,the two most common regularization terms are L

1- or

Lasso -regularisation, where R L = λ X j | W j | , (4.24)and the sum runs over all weights W j of the network, as wellas the L R L = λ X j W j , (4.25)where the sum runs again over all weights W j of the network. As for the lin-ear models, L L weight decay . Either way, both L L . In particular if we know symmetriesof the problem from which the data originates (such as time translation invariance,invariance under spatial translations or rotations), eﬀective generation of augmenteddatasets is possible. Another option is the addition of various forms of noise to data See Simard et al., http://dx.doi.org/10.1109/ICDAR.2003.1227801

University of Zurich 38achine Learning for the Sciences Chapter 4

Kernel

Figure 16:

Convolutional layer in 2D.

Here with ﬁlter size k = 3 and stride s = 2. The ﬁlter is ﬁrst applied to the 3 × s neurons tothe right, which yields the next pixel and so on. After moving all the way to theright, the ﬁlter moves s pixels down and starts from the left again until reaching thebottom right.in order to prevent overﬁtting to the existing noise or in general resilience of theneural network to noise. Finally, for classiﬁcation problems in particular, data maynot be distributed between categories equally. To avoid a bias, it is the desirable toenhance the data in the underrepresented categories. The fully-connected simple single-layer architecture for a neural network is in prin-ciple universally applicable. However, this architecture is often ineﬃcient and hardto train. In this section, we introduce more advanced neural-network layers andexamples of the types of problems for which they are suitable.

The achieved accuracy in the MNIST example above was not as high as one mayhave hoped, being much worse than the performance of a human. A main reason wasthat, using a dense network structure, we discarded all local information containedin the pictures. In other words, connecting every input neuron with every neuronin the next layer, the information whether two neurons are close to each other islost. This information is, however, not only crucial for pictures, but quite often forinput data with an underlying geometric structure or natural notion of ‘distance’in its correlations. To use this local information, so-called convolutional layers wereintroduced. The neural networks that contain such layers are called convolutionalneural networks (CNNs).The key idea behind convolutional layers is to identify certain (local) patterns inthe data. In the example of the MNIST images, such patterns could be straight andcurved lines, or corners. A pattern is then encoded in a kernel or ﬁlter in the form ofweights, which are again part of the training. The convolutional layer than comparesthese patterns with a local patch of the input data. Mathematically, identifying thefeatures in the data corresponds to a convolution ( f ∗ x )( t ) = P τ f ( τ ) x ( t − τ ) of thekernel f with the original data x . University of Zurich 39achine Learning for the Sciences Chapter 4

Figure 17:

Pooling layer (a) an average pooling and (b) a max pooling layer (both n = 3).For two-dimensional data, such as shown in the example in Fig. 16, we write thediscrete convolution explicitly as q i,j = k X m =1 k X n =1 f n,m x si − m,sj − n + b , (4.26)where f n,m are the weights of the kernel, which has linear size k , and b is a bias.Finally, s is called stride and refers to the number of pixels the ﬁlter moves perapplication. The output, q , is called feature map . Note that the dimension of thefeature map is n q × n q with n q = b ( n in − k ) /s + 1 c when the input image is ofdimensions n in × n in : application of a convolutional layer thus reduces the imagesize, an eﬀect not always intended. To avoid this reduction, the original data canbe padded , for example by adding zeros around the border of the data to ensure thefeature map has the same dimension as the input data.For typical convolutional networks, one applies a number of ﬁlters for each layerin parallel, where each ﬁlter is trained to recognize diﬀerent features. For instance,one ﬁlter could start to be sensitive to contours in an image, while another ﬁlterrecognizes the brightness of a region. Further, while ﬁlters in the ﬁrst layers may besensitive to local patterns, the ones in the later layers recognize larger structures.This distribution of functionalities between ﬁlters happens automatically, it is notpreconceived when building the neural network. Another very useful layer, in particular in combination with convolutional layers, isthe pooling layer . Each neuron in the pooling layer takes input from n (neighboring)neurons in the previous layer—in the case of a convolutional network for each featuremap individually—and only retains the most signiﬁcant information. Thus, thepooling layer helps to reduce the spatial dimension of the data. What is consideredsigniﬁcant depends on the particular circumstances: Picking the neuron with themaximum input among the n , called max pooling , detects whether a given feature ispresent in the window. Furthermore, max pooling is useful to avoid dead neurons , inother words neurons that are stuck with a value near 0 irrespective of the input andsuch a small gradient for its weights and biases that this is unlikely to change withfurther training. This is a scenario that can often happen especially when using theReLU activation function. Average pooling , in other words taking the average valueof the n inputs is a straight forward compression. Note that unlike other layers, thepooling layer has just a small set of n connections with no adjustable weights. Thefunctionality of the pooling layer is shown in Fig. 17 (a) and (b). University of Zurich 40achine Learning for the Sciences Chapter 4

DNA sequences random sequences

AACCCCTAACCCTAACCCTAACCCTAACCCTAAACTCTATGTATTTATCTATCATCTATCTATCTACCTGCCCACCTGGCTTCCTGTTGAAGTTGACCTGCTGGAACACTCAGATCCTTCATGCTTTCATTGCTGCCTCCACATCCCTCCAGGTACCCAAGGTCTCTCCACTGCCCTGCC CTGGCCCGATATCAGTCACTTATCACGCGGATGAGTTACCAAATCTCCCTATGATTAGTCTTATTGTAAATAATTTGCCAAGAAGCGTATAACGCCCATTTGGTCTTAAATGAGTCACTGCACAGACCAAGCAGCCGATCACGTTGGAATTGAGAAGTGCGCGAAGGAGACTCGAGGATC

Figure 18:

Comparison of DNA and random sequences.

An extreme case of pooling is global pooling, where the full input is converted toa single output. Using a max pooling, this would then immediately tell us, whethera given feature is present in the data.

With lowering costs and expanding applications, DNA sequencing has become awidespread tool in biological research. Especially the introduction of high-throughputsequencing methods and the related increase of data has required the introductionof data science methods into biology. Sequenced data is extremely complex and thusa great playground for machine learning applications. Here, we consider a simpleclassiﬁcation as an example. The primary structure of DNA consists of a linear se-quence of basic building blocks called nucleotides. The key component of nucleotidesare nitrogen bases: Adenine (A), Guanine (G), Cytosine (C), and Thymine (T). Theorder of the bases in the linear chains deﬁnes the DNA sequence. Which sequencesare meaningful is determined by a set of complex speciﬁc rules. In other words, thereare series of letters A, G, C, and T that correspond to DNA and while many othersequences do not resemble DNA. Trying to distinguish between strings of nitrogenbases that correspond to human DNA and those that don not is a simple exampleof a classiﬁcation task that is at the same time not so easy for an untrained humaneye.In Fig. 18, we show a comparison of ﬁve strings of human DNA and ﬁve stringsof 36 randomly generated letters A, G, C, and T. Without deeper knowledge it ishard to distinguish the two classes and even harder to ﬁnd the set of empirical rulesthat quantify their distinction. We can let a neural network have a go and see if itperforms any better than us studying these sequences by visual analysis.We have all ingredients to build a binary classiﬁer that will be able to distin-guish between DNA and non-DNA sequences. First, we download a freely availabledatabase of the human genome from https://genome.ucsc.edu . Here, we down-loaded a database of encoding genomes that contains 100000 sequences of humanDNA (each is 36 letters long). Additionally, we generate 100000 random sequencesof letters A, G, C, T. The learning task we are facing now is very similar to theMNIST classiﬁcation, though in the present case, we only have two classes. Note,however, that we generated random sequences of bases and labeled them as ran-dom, even though we might have accidentally created sequences that do correspondto human DNA. This limits the quality of our data set and thus naturally also theﬁnal performance of the network.The model we choose here has a standard architecture and can serve as a guiding http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeUwRepliSeq/wgEncodeUwRepliSeqBg02esG1bAlnRep1.bam University of Zurich 41achine Learning for the Sciences Chapter 4

GCTATTTGGTTAAAAGCTATCAGGCTAGGT

DNA random convconv pool densedense DNArandom

Figure 19:

Neural network classiﬁcation of DNA sequences.

The two lowerpanels show loss function and accuracy on the training (evaluation) data in green(orange) as a function of the training step respectively.example for supervised learning with neural networks that will be useful in manyother scenarios. In particular, we implement the following architecture:

Example 2: DNA classiﬁcation Import the data from http://genome.uscs.edu Deﬁne the model :• Input layer : The input layer has dimension 36 × Example : [[1,0,0,0], [0,0,1,0], [0,0,1,0], [0,0,0,1]] = ACCT•

Convolutional layer : Kernel size k = 4, stride s = 1 and number ofﬁlters N = 64.• Pooling layer : max pooling over n = 2 neurons, which reduces theoutput of the previous layer by a factor of 2.• Dense layer : 256 neurons with a ReLU activation function.•

Output layer : 2 neurons (DNA and non-DNA output) with softmaxactivation function.3.

Loss function : Cross-entropy between DNA and non-DNA.A schematic of the network structure as well as the evolution of the loss and theaccuracy measured over the training and validation sets with the number of trainingsteps are shown in Fig. 19. Comparing the accuracies of the training and validationsets is a standard way to avoid overﬁtting: On the examples from the training setwe can simply check the accuracy during training. When training and validation

University of Zurich 42achine Learning for the Sciences Chapter 4 accuracy reach the same number, this indicates that we are not overﬁtting on thetraining set since the validation set is never used to adjust the weights and biases. Adecreasing validation accuracy despite an increasing training accuracy, on the otherhand, is a clear sign of overﬁtting.We see that this simple convolutional network is able to achieve around 80%accuracy. By downloading a larger training set, ensuring that only truly randomsequences are labeled as such, and by optimizing the hyper-parameters of the net-work, it is likely that an even higher accuracy can be achieved. We also encourageyou to test other architectures: one can try to add more layers (both convolutionaland dense), adjust the size of the convolution kernel or stride, add dropout layers,and ﬁnally, test whether it is possible to reach higher accuracies without over-ﬁttingon the training set.

We can now revisit the MNIST example and approach the classiﬁcation with themore advanced neural network structures of the previous section. In particular, weuse the following architecture

Example 3: Advanced MNIST Input layer : 28 = 784 neurons.2. Convolutional layer 1 : Kernel size k = 5, stride s = 1 and number ofﬁlters N = 32 with a ReLU activation function.3. Pooling layer : max pooling over n = 2 × Convolutional layer 2 : Kernel size k = 5, stride s = 1 and number ofﬁlters N = 64 with a ReLU activation function.5. Pooling layer : max pooling over n = 2 × Dropout : dropout layer for regularization with a 50% dropout probability.7.

Dense layer : 1000 neurons with a ReLU activation function.8.

Output layer : 10 neurons with softmax activation function.For the loss function, we again use cross-entropy between the output and the labels.Notice here the repeated structure of convolutional layers and pooling layers. Thisis a very common structure for deep convolutional networks. With this model, weachieve an accuracy on the MNIST test set of 98.8%, a massive improvement overthe simple dense network.

We have seen in the previous section how the convolutional neural network allowsto retain local information through the use of ﬁlters. While this context-sensitivity

University of Zurich 43achine Learning for the Sciences Chapter 4 of the CNN is applicable in many situations, where geometric information is im-portant, there are situations we have more than neighborhood relations, namelysequential order. An element is before or after an other, not simply next to it.A common situation, where the order of the input is important, is with time-seriesdata. Examples are measurements of a distant star collected over years or the eventsrecorded in a detector after a collision in a particle collider. The classiﬁcation taskin these examples could be the determination whether the star has an exoplanet, orwhether a Higgs boson was observed, respectively. Another example without anytemporal structure in the data is the prediction of a protein’s functions from itsprimary amino-acid sequence.A property that the above examples have in common is that the length of theinput data is not necessarily always ﬁxed for all samples. This emphasizes againanother weakness of both the dense network and the CNN: The networks only workwith ﬁxed-size input and there is no good procedure to decrease or increase the inputsize. While we can in principle always cut our input to a desired size, of course, thisﬁnite window is not guaranteed to contain the relevant information.In this ﬁnal section on supervised learning, we introduce one more neural networkarchitecture that solves both problems discussed above: recurrent neural networks (RNNs). The key idea behind a recurrent neural network is that input is passed tothe network one element after another—unlike for other neural networks, where aninput ’vector’ is given the network all at once—and to recognize context, the networkkeeps an internal state, or memory, that is fed back to the network together withthe next input. Recurrent neural networks were developed in the context of naturallanguage processing (NLP), the ﬁeld of processing, translating and transformingspoken or written language input, where clearly, both context and order are crucialpieces of information. However, over the last couple of years, RNNs have foundapplications in many ﬁelds in the sciences.The special structure of a RNN is depicted in Fig. 20. At step t , the input x t andthe (hidden) internal state of the last step h t − are fed to the network to calculate h t . The new hidden memory of the RNN is ﬁnally connected to the output layer y t .As shown in Fig. 20, this is equivalent to having many copies of the input-outputarchitecture, where the hidden layers of the copies are connected to each other. TheRNN cell itself can have a very simple structure with a single activation function. RNN RNN RNNRNN

Figure 20:

Recurrent neural network architecture.

The input x t is fed intothe recurrent cell together with the (hidden) memory h t − of the previous step toproduce the new memory h t and the output y t . One can understand the recurrentstructure via the “unwrapped” depiction of the structure on the right hand side ofthe ﬁgure. The red arrows indicate how gradients are propagated back in time forupdating the network parameters. University of Zurich 44achine Learning for the Sciences Chapter 4

Concretely, in each step of a simple RNN we update the hidden state as h t = tanh( W hh h t − + W xh x t − + b h ) , (4.27)where we used for the nonlinearity the hyperbolic tangent, a common choice, whichis applied element-wise. Further, if the input data x t has dimension n and thehidden state h t dimension m , the weight matrices W hh and W ih have dimensions m × m and m × n , respectively. Finally, the output at step t can be calculated usingthe hidden state h t , y t = W ho h t . (4.28)A schematic of this implementation is depicted in Fig. 21(a). Note that in thissimplest implementation, the output is only a linear operation on the hidden state.A straight forward extension—necessary in the case of a classiﬁcation problem—isto add a non-linear element to the output as well, i.e., y t = g ( W ho h t + b y ) (4.29)with g ( q ) some activation function, such as a softmax. Note that while in principlean output can be calculated at every step, this is only done after the last inputelement in a classiﬁcation task.An interesting property of RNNs is that the weight matrices and biases, theparameters we learn during training, are the same for each input element. Thisproperty is called parameter sharing and is in stark contrast to dense networks.In the latter architecture, each input element is connected through its own weightmatrix. While it might seem that this property could be detrimental in terms ofrepresentability of the network, it can greatly help extracting sequential information:Similar to a ﬁlter in a CNN, the network does not need to learn the exact locationof some speciﬁc sequence that carries the important information, it only learns torecognize this sequence somewhere in the data. Note that the way each input elementis processed diﬀerently is instead implemented through the hidden memory.Parameter sharing is, however, also the root of a major problem when traininga simple RNN. To see this, remember that during training, we update the networkparameters using gradient descent. As in the previous sections, we can use backprop-agation to achieve this optimization. Even though the unwrapped representation ofthe RNN in Fig. 20 suggests a single hidden layer, the gradients for the backprop-agation have to also propagate back through time. This is depicted in Fig. 20 withthe red arrows . Looking at the backpropagation algorithm in Sec. 4.3, we see thatto use data points from N steps back, we need to multiply N − D f [ t ] with t − N < t ≤ t . Using Eq. (4.28), we can write each Jacobi matrix as aproduct of the derivative of the activation function, ∂ q tanh( q ), with the weight ma-trix. If either of these factors is much smaller (much larger) than 1, the gradientsdecrease (grow) exponentially. This is known as the problem of vanishing gradients ( exploding gradients ) and makes learning long-term dependencies with simple RNNspractically impossible.Note that the problem of exploding gradients can be mitigated by clipping thegradients, in other words scaling them to a ﬁxed size. Furthermore, we can use the In the context of RNNs, backpropagation is thus referred to as backpropagation through time (BPTT). For the weight matrix this means the singular values.

University of Zurich 45achine Learning for the Sciences Chapter 4 (a) (b)

Figure 21:

Comparison of (a) a simple RNN and (b) a LSTM.

The boxesdenote neural networks with the respective activation function, while the circles de-note element-wise operations. The dark green box indicates that the four individualneural networks can be implemented as one larger one.ReLU activation function instead of a hyperbolic tangent, as the derivative of theReLU for q >

The key idea behind the LSTM is to introduce another state to the RNN, the so-called cell state , which is passed from cell to cell, similar to the hidden state. How-ever, unlike the hidden state, no matrix multiplication takes place, but informationis added or removed to the cell state through gates . The LSTM then commonlycomprises four gates which correspond to three steps: the forget step, the input andupdate step, and ﬁnally the output step. We will in the following go through all ofthese steps individually.

Forget step

In this step, speciﬁc information of the cell state is forgotten. Speciﬁcally, we updatethe cell state as c t = σ ( W hf h t − + W xf x t + b f ) (cid:12) c t − . (4.30)where σ is the sigmoid function (applied element-wise) and (cid:12) denotes element-wisemultiplication. Note that this step multiplies each element of the gate state with anumber ∈ (0 , Input and update step

In the next step, we decide what and how much to add to the cell state. For thispurpose, we ﬁrst decide what to add to the state. We ﬁrst deﬁne what we wouldlike to add to the cell, g t = tanh( W hu h t − + W xu x t + b u ) , (4.31)which due to the hyperbolic tangent, − < g αt < University of Zurich 46achine Learning for the Sciences Chapter 5 another gate, which determines whether to actually write to the cell, i t = σ ( W hi h t − + W xi x t + b i ) , (4.32)again with 0 < i αt <

1. Finally, we update the cell state c t = c t + i t (cid:12) g t . (4.33) Output step

In the ﬁnal step, we decide how much of the information stored in the cell stateshould be written to the new hidden state, h t = σ ( W ho h t − + W xo x t + b o ) (cid:12) tanh( c t ) . (4.34)The full structure of the LSTM with the four gates and the element-wise opera-tions is schematically shown in Fig. 21(b). Note that we can concatenate the input x t and hidden memory h t − into a vector of size n + m and write one large weightmatrix W of size 4 m × ( m + n ).So far, we have only used the RNN in a supervised setting for classiﬁcationpurposes, where the input is a sequence and the output a single class at the end ofthe full sequence. A network that performs such a task is thus called a many-to-oneRNN. We will see in the next section, that unlike the other network architecturesencountered in this section, RNNs can straight-forwardly be used for unsupervisedlearning, usually as one-to-many RNNs. University of Zurich 47achine Learning for the Sciences Chapter 5

In Sec. 4, we discussed supervised learning tasks, for which datasets consist of input-output pairs, or data-label pairs. More often than not, however, we have datawithout labels and would like to extract information from such a dataset. Clusteringproblems fall in this category, for instance: We suspect that the data can be dividedinto diﬀerent types, but we do not know which features distinguish these types.Mathematically, we can think of the data x as samples that were drawn from aprobability distribution P ( x ). The unsupervised learning task is to implicitly rep-resent this distribution with a model, for example represented by a neural network.The model can then be used to study properties of the distribution or to generatenew ‘artiﬁcial’ data. The models we encounter in this chapter are thus also referredto as generative models . In general, unsupervised learning is conceptually more chal-lenging than supervised learning. At the same time, unsupervised algorithms arehighly desirable, since unlabelled data is much more abundant than labelled data.Moreover, we can in principle use a generative model for a classiﬁcation task bylearning the joint probability distribution of the data-label pair.In this chapter, we will introduce three types of neural networks that are spe-ciﬁc to unsupervised learning tasks: Restricted Boltzmann machines , autoencoders ,and generative adversarial networks . Furthermore, we will discuss how the RNNintroduced in the previous chapter can also be used for an unsupervised task. Restricted Boltzmann Machines (RBM) are a class of generative stochastic neuralnetworks. More speciﬁcally, given some (binary) input data x ∈ { , } n v , an RBMcan be trained to approximate the probability distribution of this input. Moreover,once the neural network is trained to approximate the distribution of the input, wecan sample from the network, in other words we generate new instances from thelearned probability distribution.The RBM consists of two layers (see Fig. 22) of binary units . Each binary unitis a variable which can take the values 0 or 1. We call the ﬁrst (input) layer visibleand the second layer hidden. The visible layer with input variables { v , v , . . . v n v } ,which we collect in the vector v , is connected to the hidden layer with variables { h , h , . . . h n h } , which we collect in the vector h . The role of the hidden layer is tomediate correlations between the units of the visible layer. In contrast to the neuralnetworks we have seen in the previous chapter, the hidden layer is not followed byan output layer. Instead, the RBM represents a probability distribution P rbm ( v ),which depends on variational parameters represented by the weights and biases ofa neural network. The RBM, as illustrated by the graph in Fig. 22, is a special caseof a network structure known as a Boltzmann machine with the restriction that aunit in the visible layer is only connected to hidden units and vice versa, hence thename restricted Boltzmann machine.The structure of the RBM is motivated from statistical physics: To each choiceof the binary vectors v and h , we assign a value we call the energy E ( v , h ) = − X i a i v i − X j b j h j − X ij v i W ij h j , (5.1) University of Zurich 48achine Learning for the Sciences Chapter 5

Figure 22:

Restricted Boltzmann machine.

Each of the three visible unitsand ﬁve hidden units represents a variable that can take the values ± W ij of the weight matrix that entersthe energy function (5.1).where the vectors a , b , and the matrix W are the variational parameters of themodel. Given the energy, the probability distribution over the conﬁgurations ( v , h )is deﬁned as P rbm ( v , h ) = 1 Z e − E ( v , h ) , (5.2)where Z = X v , h e − E ( v , h ) (5.3)is a normalisation factor called the partition function. The sum in Eq. (5.3) runsover all binary vectors v and h , i.e., vectors with entries 0 or 1. The probability thatthe model assigns to a visible vector v is then the marginal over the joint probabilitydistribution Eq. (5.2), P rbm ( v ) = X h P rbm ( v , h ) = 1 Z X h e − E ( v , h ) . (5.4)As a result of the restriction, the visible units, with the hidden units ﬁxed, aremutually independent: given a choice of the hidden units h , we have an indepen-dent probability distribution for each visible unit given by P rbm ( v i = 1 | h ) = σ ( a i + X j W ij h j ) , i = 1 , . . . , n v , (5.5)where σ ( x ) = 1 / (1 + e − x ) is the sigmoid function. Similarly, with the visible unitsﬁxed, the individual hidden units are also mutually independent with the probabilitydistribution P rbm ( h j = 1 | v ) = σ ( b j + X i v i W ij ) j = 1 , . . . , n h . (5.6)The visible (hidden) units can thus be interpreted as artiﬁcial neurons connected tothe hidden (visible) units with sigmoid activation function and bias a ( b ). A directconsequence of this mutual independence is that sampling a vector v or h reduces tosampling every component individually. Notice that this simpliﬁcation comes aboutdue to the restriction that visible (hidden) units do not directly interact amongstthemselves, i.e. there are no terms proportional to v i v j or h i h j in Eq. (5.1). In thefollowing, we explain how one can train an RBM and discuss possible applicationsof RBMs. University of Zurich 49achine Learning for the Sciences Chapter 5

Consider a set of binary input data x k , k = 1 , . . . , M , drawn from a probabilitydistribution P data ( x ). The aim of the training is to tune the parameters { a , b , W } in an RBM such that after training P rbm ( x ) ≈ P data ( x ). The standard approachto solve this problem is the maximum likelihood principle, in other words we wantto ﬁnd the parameters { a , b , W } which maximize the probability that our modelproduces the data x k .Maximizing the likelihood L ( a , b , W ) = Q P rbm ( x k ) is equivalent to training theRBM using a loss function we have encountered before, the negative log-likelihood L ( a , b , W ) = − M X k =1 log P rbm ( x k ) . (5.7)For the gradient descent, we need derivatives of the loss function of the form ∂L ( a , b , W ) ∂W ij = − M X k =1 ∂ log P rbm ( x k ) ∂W ij . (5.8)This derivative consists of two terms, ∂ log P rbm ( x ) ∂W ij = x i P rbm ( h j = 1 | x ) − X v v i P rbm ( h j = 1 | v ) P rbm ( v ) (5.9)and similarly simple forms are found for the derivatives with respect to the compo-nents of a and b . We can then iteratively update the parameters just as we havedone in Chapter 4 , W ij → W ij − η ∂L ( a, b, W ) ∂W ij (5.10)with a suﬃciently small learning rate η . As we have seen in the previous chapter inthe context of backpropagation, we can reduce the computational cost by replacingthe summation over the whole data set in Eq. (5.8) with a summation over a smallrandomly chosen batch of samples. This reduction in the computational cost comesat the expense of noise, but at the same time it can help to improve generalization.However, there is one more problem: The second summation in Eq. (5.9), whichcontains 2 n v terms, cannot be eﬃciently evaluated exactly. Instead, we have toapproximate the sum by sampling the visible layer v from the marginal probabilitydistribution P rbm ( v ). This sampling can be done using Gibbs sampling as follows:

Algorithm 3: Gibbs samplingInput:

Any visible vector v (0) Output:

Visible vector v ( r ) for n = 1 . . . r do sample h ( n ) from P rbm ( h | v = v ( n − v ( n ) from P rbm ( v | h = h ( n )); end University of Zurich 50achine Learning for the Sciences Chapter 5

With suﬃciently many steps r , the vector v ( r ) is an unbiased sample drawn from P rbm ( v ). By repeating the procedure, we can obtain multiple samples to estimatethe summation. Note that this is still rather computationally expensive, requiringmultiple evaluations on the model.The key innovation which allows the training of an RBM to be computation-ally feasible was proposed by Geoﬀrey Hinton (2002). Instead of obtaining multiplesamples, we simply perform the Gibbs sampling with r steps and estimate the sum-mation with a single sample, in other words we replace the second summation inEq. (5.9) with X v v i P rbm ( h j = 1 | v ) P rbm ( v ) → v i P rbm ( h j = 1 | v ) , (5.11)where v = v ( r ) is simply the sample obtained from r -step Gibbs sampling. Withthis modiﬁcation, the gradient, Eq. (5.9), can be approximated as ∂ log P rbm ( x ) ∂W ij ≈ x i P rbm ( h j = 1 | x ) − v i P rbm ( h j = 1 | v ) . (5.12)This method is known as contrastive divergence . Although the quantity com-puted is only a biased estimator of the gradient, this approach is found to work wellin practice. The complete algorithm for training a RBM with r -step contrastivedivergence can be summarised as follows: Algorithm 4: Contrastive divergenceInput:

Dataset D = { x , x , . . . x M } drawn from a distribution P ( x )initialize the RBM weights { a , b , W } ;Initialize ∆ W ij = ∆ a i = ∆ b j = 0; while not converged do select a random batch S of samples from the dataset D ; forall x ∈ S do Obtain v by r -step Gibbs sampling starting from x ∆ W ij ← ∆ W ij − x i P rbm ( h j = 1 | x ) + v i P rbm ( h j = 1 | v ) end W ij ← W ij − η ∆ W ij (and similarly for a and b ) end Having trained the RBM to represent the underlying data distribution P ( x ),there are a few ways one can use the trained model:1. Pretraining

We can use W and b as the initial weights and biases for a deepnetwork (c.f. Chapter 4), which is then ﬁne-tuned with gradient descent andbackpropagation.2. Generative Modelling

As a generative model, a trained RBM can be usedto generate new samples via Gibbs sampling (Alg. 3). Some potential usesof the generative aspect of the RBM include recommender systems and imagereconstruction . In the following subsection, we provide an example, where anRBM is used to reconstruct a noisy signal.

University of Zurich 51achine Learning for the Sciences Chapter 5

Gibbssamplingcorruptoriginal RBM input RBM output

Figure 23:

Signal reconstruction.

Using an RBM to repair a corrupted signal,here a sine and a sawtooth waveform.

A major drawback of the simple RBMs for their application is the fact that theyonly take binary data as input. As an example, we thus look at simple periodicwaveforms with 60 sample points. In particular, we use sawtooth, sine, and squarewaveforms. In order to have quasi-continuous data, we use eight bits for each point,such that our signal can take values from 0 to 255. Finally, we generate samples totrain with a small variation in the maximum value, the periodicity, as well as thecenter point of each waveform.After training the RBM using the contrastive divergence algorithm, we now havea model which represents the data distribution of the binarized waveforms. Considernow a signal which has been corrupted, meaning some parts of the waveform havenot been received, in other words they are set to 0. By feeding this corrupted datainto the RBM and performing a few iterations of Gibbs sampling (Alg. 3), we canobtain a reconstruction of the signal, where the missing part has been repaired, ascan been seen at the bottom of Fig. 23.Note that the same procedure can be used to reconstruct or denoise images. Dueto the limitation to binary data, however, the picture has to either be binarized, orthe input size to the RBM becomes fairly large for high-resolution pictures. It isthus not surprising that while RBMs have been popular in the mid-2000s, theyhave largely been superseded by more modern and architectures such as generativeadversarial networks which we shall explore later in the chapter. However, they stillserve a pedagogical purpose and could also provide inspiration for future innovations,in particular in science. A recent example is the idea of using an RBM to representa quantum mechanical state.

In Sec. 4, the RNN was introduced as a classiﬁcation model. Instead of classifyingsequences of data, such as time series, the RNN can also be trained to generate

University of Zurich 52achine Learning for the Sciences Chapter 5

RNN RNN RNN RNN RNN RNN RNN RNN RNN RNN

Training Generation (...) (...) 'label'inputoutput = = = = = ?????

Figure 24:

RNN used as a generator.

For training, left, the input data shiftedby one, x t +1 , are used as the label. For the generation of new sequences, right, weinput a single data point x and the RNN uses the recurrent steps to generate anew sequence.valid sequences itself. Given the RNN introduced in Sec. 4.7, the implementation ofsuch a generator is straight-forward and does not require a new architecture. Themain diﬀerence is that the output y t of the network given the data point x t is aguess of the subsequent data point x t +1 instead of the class to which the wholesequence belongs to. This means in particular that the input and output size arenow the same. For training this network, we can once again use the cross-entropyor (negative) log-likelihood as a loss function, L ent = − m − X t =1 x t +1 · ln ( y t ) , (5.13)where x t +1 is now the ‘label’ for the input x t and y t is the output of the networkand t runs over the input sequence with length m . This training is schematicallyshown in Fig. 24.For generating a new sequence, it is enough to have one single input point x tostart the sequence. Note that since we now can start with a single data point x and generate a whole sequence of data points { y t } , this mode of using an RNN isreferred to as one-to-many . This sequence generation is shown in Fig. 24, left. To illustrate the concept of sequence generation using recurrent neural networks, weuse an RNN to generate new molecules. The ﬁrst question we need to address is howto encode a chemical structure into input data—of sequential form no less—that amachine learning model can read. A common representation of molecular graphsused in chemistry is the simpliﬁed molecular-input line-entry system , or SMILES.Figure 25 shows examples of such SMILES strings for the caﬀeine, ethanol, andaspirin molecules. We can use the dataset

Molecular Sets , which contains ∼ . https://github.com/molecularsets/moses University of Zurich 53achine Learning for the Sciences Chapter 5 ethanol aspirincaffeine CH HOCH NNONO NCH H C H C O O O OH

CN1C=NC2=C1C(=O)N(C(=O)N2C)C CC(=O)OC1=CC=CC=C1C(=O)OCCO

Figure 25:

SMILES.

Examples of molecules and their representation in SMILES.feed each character separately to the RNN. This creates a map from charactersin SMILES strings onto an array of numbers. Finally, in order to account for thevariable size of the molecules and hence, the variable length of the strings, we can in-troduce a ‘stop’ character such that the network learns and later generates sequencesof arbitrary length.We are now ready to use the SMILES strings for training our network as describedabove, where the input is a one-hot-encoded vector and the output is again a vectorof the same size. Note, however, that similar to a classiﬁcation task, the outputvector is a probability distribution over the characters the network believes couldcome next. Unlike a classiﬁcation task, where we consider the largest output the bestguess of the network, here we sample in each step from the probability distribution y t to again have a one-hot-encoded vector for the input of the next step. Autoencoders are neuron-based generative models, initially introduced for dimen-sionality reduction. The original purpose, thus, is similar to that of PCA or t-SNEthat we already encountered in Sec. 2, namely the reduction of the number of fea-tures that describe our input data. Unlike for PCA, where we have a clear recipehow to reduce the number of features, an autoencoder learns the best way of achiev-ing the dimensionality reduction. An obvious question, however, is how to measurethe quality of the compression, which is essential for the deﬁnition of a loss functionand thus, training. In the case of t-SNE, we introduced two probability distributionsbased on the distance of samples in the original and feature space, respectively, andminimized their diﬀerence, for example using the Kullback-Leibler divergence.The solution the autoencoder uses is to have a neural network do ﬁrst, thedimensionality reduction, or encoding to the latent space , x e ( x ) = z , and then,the decoding back to the original dimension, z d ( z ), see Fig. 26. This architectureallows us to directly compare the original input x with the reconstructed output d ( e ( x )), such that the autoencoder trains itself unsupervised by minimizing thediﬀerence. A good example of a loss function that achieves successful training andthat we have encountered already several times is the cross entropy, L ae = − X i x i · ln[ d ( e ( x i ))] . (5.14)In other words, we compare point-wise the diﬀerence between the input to theencoder with the decoder’s output. University of Zurich 54achine Learning for the Sciences Chapter 5 latentspaceEncoder Decoder input output

Figure 26:

General autoencoder architecture.

A neural network is used tocontract a compressed representation of the input in the latent space. A secondneural network is used to reconstruct the original input.Intuitively, the latent space with its lower dimension presents a bottleneck for theinformation propagation from input to output. The goal of training is to ﬁnd andkeep the most relevant information for the reconstruction to be optimal. The latentspace then corresponds to the reduced space in PCA and t-SNE. Note that muchlike in t-SNE but unlike in PCA, the new features are in general not independent.

A major problem of the approach introduced in the previous section is its tendency tooverﬁtting. As an extreme example, a suﬃciently complicated encoder-decoder paircould learn to map all data in the training set onto a single variable and back to thedata. Such a network would indeed accomplish completely lossless compression anddecompression. However, the network would not have extracted any useful informa-tion from the dataset and thus, would completely fail to compress and decompresspreviously unseen data. Moreover, as in the case of the dimensionality-reductionschemes discussed in Sec. 2, we would like to analyze the latent space images andextract new information about the data. Finally, we might also want to use thedecoder part of the autoencoder as a generator for new data. For these reasons, itis essential that we combat overﬁtting as we have done in the previous chapters byregularization.The question then becomes how one can eﬀectively regularize the autoencoder.First, we need to analyze what properties we would like the latent space to fulﬁl.We can identify two main properties:1. If two input data points are close (according to some measure), their imagesin the latent space should also be close. We call this property continuity .2. Any point in the latent space should be mapped through the decoder onto ameaningful data point, a property we call completeness .While there are principle ways to achieve regularization along similar paths as dis-cussed in the previous section on supervised learning, we will discuss here a solutionthat is particularly useful as a generative model: the variational autoencoder (VAE).The idea behind VAEs is for the encoder to output not just an exact point z in the latent space, but a (factorized) Normal distribution of points, N ( µ , σ ). Inparticular, the output of the encoder comprises two vectors, the ﬁrst representingthe means, µ , and the second the standard deviations, σ . The input for the decoder University of Zurich 55achine Learning for the Sciences Chapter 5 latent spaceencoderinput outputdecoder

Figure 27:

Architecture of variational autoencoder.

Instead of outputtinga point z in the latent space, the encoder provides a distribution N ( µ , σ ),parametrized by the means µ and the standard deviations σ . The input z forthe decoder is then drawn from N ( µ , σ ).is then sampled from this distribution, z ∼ N ( µ , σ ), and the original input is recon-structed and compared to the original input for training. In addition to the standardloss function comparing input and output of the VAE, we further add a regulariza-tion term to the loss function such that the distributions from the encoder are closeto a standard normal distribution N ( , ). Using the Kullback-Leibler divergence,Eq.(2.20), to measure the deviation from the standard normal distribution, the fullloss function then reads L vae = − X i x in i ln x out i + KL( N ( µ i , σ i ) ||N ( , ))= − X i x in i ln x out i + 12 X k [ σ i,k + µ i,k − − σ i,k ] . (5.15)In this expression, the ﬁrst term quantiﬁes the reconstruction loss with x in i theinput to and x out i the reconstructed data from the VAE. The second term is theregularization on the latent space for each input data point, which for two (diagonal)Normal distributions can be simpliﬁed, see second line of Eq. (5.15). This procedureregularizes the training through the introduction of noise, similar to the dropoutlayer in Section 4. However, the regularization here not only generically increasesgeneralization, but also enforces the desired structure in the latent space.The structure of a VAE is shown in Fig. 27. By enforcing the mean and variancestructure of the encoder output, the latent space fulﬁlls the requirements outlinedabove. This type of structure can then serve as a generative model for many diﬀerentdata types: anything from human faces to complicated molecular structures. Hence,the variational autoencoder goes beyond extracting information from a dataset, butcan be used for the scientiﬁc discovery. Note, ﬁnally, that the general structure ofthe variational autoencoder can be applied beyond the simple example above. As anexample, a diﬀerent distribution function can be enforced in the latent space otherthan the standard Normal distribution, or a diﬀerent neural network can be used asencoder and decoder, such as a RNN. University of Zurich 56achine Learning for the Sciences Chapter 5

Implicit probability densityTractable density Variational StatisticalMaximum Likelihood Generative ModelsExplicit probability density Approximate density

GANBoltzmann machineVariational

AutoencoderFully visible believe nets

Figure 28:

Maximum likelihood approaches to generative modeling.

In this section, we will be a concerned with a type of generative neural network,the generative adversarial network (GAN), which gained a very high popularity inrecent years. Before getting into the details about this method, we give a quicksystematic overview over types of generative methods, to place GANs in properrelation to them. . We restrict ourselves to methods that are based on the maximum likelihood prin-ciple . The role of the model is to provide an estimate p model ( x ; θ ) of a probabilitydistribution parametrized by parameters θ . The likelihood is the probability thatthe model assigns to the training data m Y i =1 p model ( x i ; θ ) , (5.16)where m is again the number of samples in the data { x i } . The goal is to choose theparameters θ such as to maximize the likelihood. Thanks to the equality θ ∗ = argmax θ m Y i =1 p model ( x i ; θ )= argmax θ m X i =1 log p model ( x i ; θ ) (5.17)we can just as well work with the sum of logarithms, which is easier to handle. Aswe explained previously (see section on t-SNE), the maximization is equivalent to following “NIPS 2016 Tutorial: Generative Adversarial Netoworks” Ian Goodfellow,arXiv:1701.001160 University of Zurich 57achine Learning for the Sciences Chapter 5 the minimization of the cross-entropy between two probability distributions: the‘true’ distribution p data ( x ) from which the data has been drawn and p model ( x ; θ ).While we do not have access to p data ( x ) in principle, we estimate it empirically as adistribution peaked at the m data points we have.Methods can now be distinguished by the way p model is deﬁned and evaluated(see Fig. 28). We diﬀerentiate between models that deﬁne p data ( x ) explicitly throughsome functional form. They have the general advantage that maximization of thelikelihood is rather straight-forward, since we have direct access to this function.The downside is that the functional forms are generically limiting the ability of themodel to ﬁt the data distribution or become computationally intractable.Among those explicit density models, we can further distinguish between thosethat represent a computationally tractable density and those that do not. An ex-ample for tractable explicit density models are fully visible belief networks (FVBNs)that decompose the probability distribution over an n -dimensional vector x into aproduct of conditional probabilities p model ( x ) = n Y j =1 p model ( x j | x , · · · , x j − ) . (5.18)We can already see that, once we use the model to draw new samples, this is doneone entry of the vector x at a time (ﬁrst x is drawn, then, knowing it, x is drawnetc.). This is computationally costly and not parallelizable but is useful for tasksthat are anyway sequential (like generation of human speech, where the so-calledWaveNet employs FVBNs).Models that encode an explicit density, but require approximations to maximizethe likelihood that can either be variational in nature or use stochastic methods.We have seen examples for either. Variational methods deﬁne a lower bound to thelog likelihood which can be maximized L ( x ; θ ) ≤ log p model ( x ; θ ) . (5.19)The algorithm produces a maximum value of the log-likelihood that is at least ashigh as the value for L obtained (when summed over all data points). Variational au-toencoders belong to this category. Their most obvious shortcoming is that L ( x ; θ )may represent a very bad lower bound to the log-likelihood (and is in general notguaranteed to converge to it for inﬁnite model size), so that the distribution repre-sented by the model is very diﬀerent from p data . Stochastic methods, in contrast,often rely on a Markov chain process: The model is deﬁned by a probability q ( x | x )from which the current sample x is drawn, which depends on the previously drawnsample x (but not any others). RBMs are an example for this. They have the ad-vantage that there is some rigorously proven convergence to p model with large enoughsize of the RBM, but the convergence may be slow. Like with FVBNs, the drawingprocess is sequential and thus not easily parallelizable.All these classes of models allow for explicit representations of the probabilitydensity function approximations. In contrast, for GANs and related models, thereis only an indirect access to said probability density: The model allows us to samplefrom it. Naturally, this makes optimization potentially harder, but circumventsmany of the other previously mentioned problems. In particular• GANs can generate samples in parallel University of Zurich 58achine Learning for the Sciences Chapter 5 • there are few restrictions on the form of the generator function (as comparedto Boltzmann machines, for instance, which have a restricted form to makeMarkov chain sampling work)• no Markov chains are needed• no variational bound is needed and some GAN model families are known tobe asymptotically consistent (meaning that for a large enough model they areapproximations to any probability distribution).GANs have been immensely successful in several application scenarios. Theirsuperiority against other methods is, however, often assessed subjectively. Most ofperformance comparison have been in the ﬁeld of image generation, and largely onthe ImageNet database. Some of the standard tasks evaluated in this context are:• generate an image from a sentence or phrase that describes its content (“ablue ﬂower”)• generate realistic images from sketches• generate abstract maps from satellite photos• generate a high-resolution (“super-resolution”) image from a lower resolutionone• predict a next frame in a video.As far as more science-related applications are concerned, GANs have been used to• predict the impact of climate change on individual houses• generate new molecules that have been later synethsized.In the light of these examples, it is of fundamental importance to understandthat GANs enable (and excel at) problems with multi-modal outputs. That meansthe problems are such that a single input corresponds to many diﬀerent ‘correct’or ‘likely’ outputs. (In contrast to a mathematical function, which would alwaysproduce the same output.) This is important to keep in mind in particular inscientiﬁc applications, where we often search for the one answer . Only if that is notthe case, GANs can actually play out their strengths.Let us consider image super-resolution as an illustrative example: Conventional(deterministic) methods of increasing image resolution would necessarily lead tosome blurring or artifacts, because the information that can be encoded in the ﬁnerpixel grid simply is not existent in the input data. A GAN, in contrast, will providea possibility how a realistic image could have looked if it had been taken with higherresolution. This way they add information that may diﬀer from the true scene ofthe image that was taken – a process that is obviously not yielding a unique answersince many versions of the information added may correspond to a realistic image.

University of Zurich 59achine Learning for the Sciences Chapter 5

GeneratorDiscriminatorNoise x real samplefake sampe probability for 'real' Figure 29:

Architecture of a GAN.

The optimization of all neural network models we discussed so far was formulatedas minimization of a cost function. For GANs, while such a formulation is a alsopossible, a much more illuminating perspective is viewing the GAN as a game be-tween two players, the generator ( G ) and the discriminator ( D ), see Fig. 29. Therole of G is to generate from some random input z drawn from a simple distributionsamples that could be mistaken from being drawn from p data . The task of D isto classify its input as generated by G or coming from the data. Training shouldimprove the performance of both D and G at their respective tasks simultaneously.After training is completed, G can be used to draw samples that closely resemblesthose drawn from p data . In summary D θ D : x binary true/false ,G θ G : z x , (5.20)where we have also indicated the two sets of parameters on which the two functionsdepend: θ D and θ G , respectively. The game is then deﬁned by two cost functions.The discriminator wants to minimize J D ( θ D , θ G ) by only changing θ D , while thegenerator J G ( θ D , θ G ) by only changing θ G . So, each players cost depends on boththeir and the other players parameters, the latter of which cannot be controlled bythe player. The solution to this game optimization problem is a (local) minimum,i.e., a point in ( θ D , θ G )-space where J D ( θ D , θ G ) has a local minimum with respectto θ D and J G ( θ D , θ G ) has a local minimum with respect to θ G . In game theorysuch a solution is called a Nash equilibrium. Let us now specify possible choices forthe cost functions as well as for D and G .The most important requirement of G is that it is diﬀerentiable. It thus can(in contrast to VAEs) not have discrete variables on the output layer. A typicalrepresentation is a deep (possibly convolutional) neural network. A popular DeepConventional architecture is called DCGAN. Then θ G are the networks weights andbiases. The input z is drawn from some simple prior distribution, e.g., the uniformdistribution or a normal distribution. (The speciﬁc choice of this distribution issecondary, as long as we use the same during training and when we use the generatorby itself.) It is important that z has at least as high a dimension as x if the fullmulti-dimensional p model is to be approximated. Otherwise the model will performsome sort of dimensional reduction. Several tweaks have also been used, such asfeeding some components of z into a hidden instead of the input layer and addingnoise to hidden layers.The training proceeds in steps. At each step, a minibatch of x is drawn from thedata set and a minibatch of z is sampled from the prior distribution. Using this, University of Zurich 60achine Learning for the Sciences Chapter 5 gradient descent-type updates are performed: One update of θ D using the gradientof J D ( θ D , θ G ) and one of θ G using the gradient of J G ( θ D , θ G ). For D , the cost function of choice is the cross-entropy as with standard binaryclassiﬁers that have sigmoid output. Given that the labels are ‘1’ for data and ‘0’for z samples, it is simply J D ( θ D , θ G ) = − N X i log D ( x i ) − N X j log(1 − D ( G ( z j ))) , (5.21)where the sums over i and j run over the respective minibatches, which contain N and N points.For G more variations of the cost functions have been explored. Maybe the mostintuitive one is J G ( θ D , θ G ) = − J D ( θ D , θ G ) , (5.22)which corresponds to the so-called zero-sum or minmax game. Its solution is for-mally given by θ ?G = arg min θ G max θ D [ − J D ( θ D , θ G )] . (5.23)This form of the cost is convenient for theoretical analysis, because there is only asingle target function, which helps drawing parallels to conventional optimization.However, other cost functions have been proven superior in practice. The reasonis that minimization can get trapped very far from an equilibrium: When the dis-criminator manages to learn rejecting generator samples with high conﬁdence, thegradient of the generator will be very small, making its optimization very hard.Instead, we can use the cross-entropy also for the generator cost function (butthis time from the generator’s perspective) J G ( θ D , θ G ) = − N X j log D ( G ( z j )) . (5.24)Now the generator maximizes the probability of the discriminator being mistaken.This way, each player still has a strong gradient when the player is loosing the game.We observe that this version of J G ( θ D , θ G ) has no direct dependence of the trainingdata. Of course, such a dependence is implicit via D , which has learned fromthe training data. This indirect dependence also acts like a regularizer, preventingoverﬁtting: G has no possibility to directly ‘ﬁt’ its output to training data. In closing this section, we comment on a few properties of GANs, which also markfrontiers for improvements. One global problem is that GANs are typically diﬃ-cult to train: they require large training sets and are highly susceptible to hyper-parameter ﬂuctuations. It is currently an active topic of research to compensate forthis with the structural modiﬁcation and novel loss function formulations.

University of Zurich 61achine Learning for the Sciences Chapter 5

Mode collapse — this may describe one of the most obvious problems of GANs:it refers to a situation where G does not explore the full space to which x belongs,but rather maps several inputs z to the same output. Mode collapse can be moreor less severe. For instance a G trained on generating images may always resort tocertain fragments or patterns of images. A formal reason for mode collapse is whenthe simultaneous gradient descent gravitates towards a solution θ ?G = arg max θ D min θ G [ − J D ( θ D , θ G )] , (5.25)instead of the order in Eq. (5.23). (A priori it is not clear which of the two solutionsis closer to the algorithm’s doing.) Note that the interchange of min and max ingeneral corresponds to a diﬀerent solution: It is now suﬃcient for G to alwaysproduce one (and the same) output that is classiﬁed as data by D with very highprobability. Due to the mode collapse problem, GANs are not good at exploringergodically the full space of possible outputs. They rather produce few very goodpossible outputs.One strategy to ﬁght mode collapse is called minibatch features . Instead of letting D rate one sample at a time, a minibatch of real and generated samples is consideredat once. It then detects whether the generated samples are unusually close to eachother. Arithmetics with GANs — it has been demonstrated that GANs can do lineararithmetics with inputs to add or remove abstract features from the output. Thishas been demonstrated using a DCGAN trained on images of faces. The gender andthe feature ‘wearing glasses’ can be added or subtracted and thus changed at will.Of course such a result is only empirical, there is no formal mathematical theorywhy it works.

Using GANs with labelled data — it has been shown that, if (partially)labeled data is available, using the labels when training D may improve the perfor-mance of G . In this constellation, G has still the same task as before and does notinteract with the labels. If data with n classes exist, then D will be constructed as aclassiﬁer for ( n + 1) classes, where the extra class corresponds to ‘fake’ data that D attributes to coming from G . If a data point has a label, then this label is used as areference in the cost function. If a datapoint has no label, then the ﬁrst n outputsof D are summed up. One-sided label smoothing — this technique is useful not only for the D inGANs but also other binary classiﬁcation problems with neural networks. Oftenwe observe that classiﬁers give proper results, but show a too conﬁdent probability.This overshooting conﬁdence can be counteracted by one-sided label smoothing. Theidea is to simply replace the target value for the real examples with a value slightlyless than 1, e.g., 0.9. This smoothes the distribution of the discriminator. Why dowe only perform this oﬀ-set one-sided and not also give a small nonzero value β tothe fake samples target values? If we were to do this, the optimal function for D is D ? ( x ) = p data ( x ) + βp model ( x ) p data ( x ) + p model ( x ) . (5.26) University of Zurich 62achine Learning for the Sciences Chapter 6

Consider now a range of x for which p data ( x ) is small but p model ( x ) is large (a“spurious mode”). D ? ( x ) will have a peak near this spurious mode. This means D reinforces incorrect behavior of G . This will encourage G to reproduce samples thatit already makes (irrespective of whether they are anything like real data). University of Zurich 63achine Learning for the Sciences Chapter 6

In particular for applications in science, we not only want to obtain a neural networkthat excels at performing a given task, but we also seek an understanding how theproblem was solved. Ideally, we want to know underlying principles, deduce causalrelations, identify abstracted notions. This is the topic of interpretability.

Exploring the weights and intermediate activations of a neural network in orderto understand what the network has learnt, very quickly becomes unfeasible oruninformative for large network architectures. In this case, we can try a diﬀerentapproach, where we focus on the inputs to the neural network, rather than theintermediate activations.More precisely, let us consider a neural network classiﬁer f , depending on theweights W , which maps an input x to a probability distribution over n classes f ( x | W ) ∈ R n , i.e., F i ( x | θ ) ≥ X i F i ( x | θ ) = 1 . (6.1)We want to minimise the distance between the output of the network f ( x ) and achosen target output y target , which can be done by minimizing the loss function L = | F ( x ) − y target | . (6.2)However, unlike in supervised learning where the loss minimization was done withrespect to the weights W of the network, here we are interested in minimizing withrespect to the input x while keeping the weights W ﬁxed. This can be achievedusing gradient descent, i.e. x → x − η ∂L∂ x , (6.3)where η is the learning rate. With a suﬃcient number of iterations, our initial input x will be transformed into the ﬁnal input x ∗ such that f ( x ∗ ) ≈ y target . (6.4)By choosing the target output to correspond to a particular class, e.g., y target =(1 , , , . . . ), we are then essentially ﬁnding examples of inputs which the networkwould classify as belonging to the chosen class. This procedure is called dreaming .Let us apply this technique to a binary classiﬁcation example. We consider adataset consisting of images of healthy and unhealthy plant leaves. Some samplesfrom the dataset are shown in the top row of Fig. 30. After training a deep con-volutional network to classify the leaves (reaching a test accuracy of around 95%),we start with a random image as our initial input x and perform gradient descenton the input, as described above, to arrive at the ﬁnal image x ∗ which our networkconﬁdently classiﬁes. Source: https://data.mendeley.com/datasets/tywbtsjrjv/1

University of Zurich 64achine Learning for the Sciences Chapter 6 “Healthy” “Unhealthy” “Unhealthy”“Healthy” 99% conﬁdence “Unhealthy” 98% conﬁdence “Healthy” 97% conﬁdence

Figure 30:

Plant leaves.

Top: Some samples from the plants dataset. Bottom:Samples generated by using the "dreaming" procedure starting from random noise.In bottom row of Fig. 30, we show three examples produced using the ‘dreaming’technique. On ﬁrst sight, it might be astonishing that the ﬁnal image actually doesnot even remotely resemble a leaf. How could it be that the network has such ahigh accuracy of around 95%, yet we have here a conﬁdently classiﬁed image whichis essentially just noise. Although this seem surprising, a closer inspection revealsthe problem: The noisy image x ∗ looks nothing like the samples in the datasetwith which we trained our network. By feeding this image into our network, we areasking it to make an extrapolation, which as can be seen leads to an uncontrolledbehavior. This is a key issue which plagues most data driven machine learningapproaches. With few exceptions, it is very diﬃcult to train a model capable ofperforming reliable extrapolations. Since scientiﬁc research is often in the businessof making extrapolations, this is an extremely important point of caution to keepin mind.While it might seem obvious that any model should only be predictive for datathat ‘resembles’ those in the training set, the precise meaning of ‘resembles’ is ac-tually more subtle than one imagines. For example, if one trains a ML model usinga dataset of images captured using a Canon camera but subsequently decide to usethe model to make predictions on images taken with a Nikon camera, we could ac-tually be in for a surprise. Even though the images may ‘resemble’ each other to ournaked eye, the diﬀerent cameras can have a diﬀerent noise proﬁle which might notbe perceptible to the human eye. We shall see in the next section that even suchminute image distortions can already be suﬃcient to completely confuse our model. As we have seen, it is possible to modify the input x so that the correspondingmodel approximates a chosen target output. This concept can also be applied to University of Zurich 65achine Learning for the Sciences Chapter 6 generate adverserial examples , i.e. images which have been intentionally modiﬁedto cause a model to misclassify it. In addition, we usually want the modiﬁcation tobe minimal or almost imperceptible to the human eye.One common method for generating adversarial examples is known as the fastgradient sign method . Starting from an input x which our model correctly classiﬁes,we choose a target output y ∗ which corresponds to a wrong classiﬁcation, and followthe procedure described in the previous section with a slight modiﬁcation. Insteadof updating the input according to Eq. (6.3) we use the following update rule: x → x − η sign ∂L∂ x ! , (6.5)where L is given be Eq. (6.2). The sign( . . . ) ∈ {− , } both serves to enhance thesignal and also acts as constraint to limit the size of the modiﬁcation. By choosing η = (cid:15)T and performing only T iterations, we can then guarantee that each componentof the ﬁnal input x ∗ satisﬁes | x ∗ i − x i | ≤ (cid:15), (6.6)which is important since we want our ﬁnal image x ∗ to be only minimally modiﬁed.We summarize this algorithm as follows: Algorithm 5: Fast Gradient Sign MethodInput:

A classiﬁcation model f , a loss function L , an initial image x , a target label y target , perturbation size (cid:15) and number ofiterations T Output:

Adversarial example x ∗ with | x ∗ i − x i | ≤ (cid:15)η = (cid:15)/T ; for i=1. . . T do x = x − η sign (cid:16) ∂L∂ x (cid:17) ; end This process of generating adversarial examples is called an adversarial attack ,which we can classify under two broad categories: white box and black box attacks.In a white box attack, the attacker has full access to the network f and is thusable to compute or estimate the gradients with respect to the input. On the otherhand, in a black box attack, the adversarial examples are generated without usingthe target network f . In this case, a possible strategy for the attacker is to trainhis own model g , ﬁnd an adversarial example for his model and use it against histarget f without actually having access to it. Although it might seem surprising,this strategy has been found to work albeit with a lower success rate as comparedto white box methods. We shall illustrate these concepts in the example below. We shall use the same plant leaves classiﬁcation example as above. The target model f which we want to ’attack’ is a pretrained model using Google’s well known Incep-

University of Zurich 66achine Learning for the Sciences Chapter 6 = AAACAnicbVBNS8NAEN34WeNX1aOXYBE8laQKehGKXjy2YD+gDWWznbRLN5uwOxFK6M2bV/0T3sSrf8T/4I9w2+ZgWx8MPN6bYWZekAiu0XW/rbX1jc2t7cKOvbu3f3BYPDpu6jhVDBosFrFqB1SD4BIayFFAO1FAo0BAKxjdT/3WEyjNY/mI4wT8iA4kDzmjaKT6ba9YcsvuDM4q8XJSIjlqveJPtx+zNAKJTFCtO56boJ9RhZwJmNjdVENC2YgOoGOopBFoP5sdOnHOjdJ3wliZkujM1L8TGY20HkeB6YwoDvWyNxX/8zophjd+xmWSIkg2XxSmwsHYmX7t9LkChmJsCGWKm1sdNqSKMjTZLGxBRaVOqDL/TWzbxOMth7FKmpWyd1mu1K9K1bs8qAI5JWfkgnjkmlTJA6mRBmEEyAt5JW/Ws/VufVif89Y1K585IQuwvn4B67yXlA== “Unhealthy” 94.4% conﬁdence “Healthy” 99.6% conﬁdence + 0 . ⇥ AAACFXicbVDLSgMxFM3UVx1fY126CRZBEMpMFXRZdOOygn1AZyiZNNOGZjJDckcspd/hzq3+hDtx69p/8CNM21nY1gMhJ+fey7k5YSq4Btf9tgpr6xubW8Vte2d3b//AOSw1dZIpyho0EYlqh0QzwSVrAAfB2qliJA4Fa4XD22m99ciU5ol8gFHKgpj0JY84JWCkrlM6xz7GbsX1zO0Dj5nuOmXzngGvEi8nZZSj3nV+/F5Cs5hJoIJo3fHcFIIxUcCpYBPbzzRLCR2SPusYKokxCcaz3Sf41Cg9HCXKHAl4pv6dGJNY61Ecms6YwEAv16bif7VOBtF1MOYyzYBJOjeKMoEhwdMgcI8rRkGMDCFUcbMrpgOiCAUT14ILKCJ1SpT538S2TTzechirpFmteBeV6v1luXaTB1VEx+gEnSEPXaEaukN11EAUPaEX9IrerGfr3fqwPuetBSufOUILsL5+AUktnPI= = AAACAnicbVBNS8NAEN34WeNX1aOXYBE8laQKehGKXjy2YD+gDWWznbRLN5uwOxFK6M2bV/0T3sSrf8T/4I9w2+ZgWx8MPN6bYWZekAiu0XW/rbX1jc2t7cKOvbu3f3BYPDpu6jhVDBosFrFqB1SD4BIayFFAO1FAo0BAKxjdT/3WEyjNY/mI4wT8iA4kDzmjaKT6ba9YcsvuDM4q8XJSIjlqveJPtx+zNAKJTFCtO56boJ9RhZwJmNjdVENC2YgOoGOopBFoP5sdOnHOjdJ3wliZkujM1L8TGY20HkeB6YwoDvWyNxX/8zophjd+xmWSIkg2XxSmwsHYmX7t9LkChmJsCGWKm1sdNqSKMjTZLGxBRaVOqDL/TWzbxOMth7FKmpWyd1mu1K9K1bs8qAI5JWfkgnjkmlTJA6mRBmEEyAt5JW/Ws/VufVif89Y1K585IQuwvn4B67yXlA== “Healthy” 87.4% conﬁdence “Unhealthy” 87.8% conﬁdence + 0 . ⇥ AAACFXicbVDLSgMxFM3UVx1fY126CRZBEMpMFXRZdOOygn1AZyiZNNOGZjJDckcspd/hzq3+hDtx69p/8CNM21nY1gMhJ+fey7k5YSq4Btf9tgpr6xubW8Vte2d3b//AOSw1dZIpyho0EYlqh0QzwSVrAAfB2qliJA4Fa4XD22m99ciU5ol8gFHKgpj0JY84JWCkrlM6xz7GbsX1zO0Dj5nuOmXzngGvEi8nZZSj3nV+/F5Cs5hJoIJo3fHcFIIxUcCpYBPbzzRLCR2SPusYKokxCcaz3Sf41Cg9HCXKHAl4pv6dGJNY61Ecms6YwEAv16bif7VOBtF1MOYyzYBJOjeKMoEhwdMgcI8rRkGMDCFUcbMrpgOiCAUT14ILKCJ1SpT538S2TTzechirpFmteBeV6v1luXaTB1VEx+gEnSEPXaEaukN11EAUPaEX9IrerGfr3fqwPuetBSufOUILsL5+AUktnPI= Figure 31:

Adversarial examples.

Generated using the fast gradient sign methodwith T = 1 iteration and (cid:15) = 0 .

01. The target model is Google’s

InceptionV3 deepconvolutional network with a test accuracy of ∼

95% on the binary ("Healthy" vs"Unhealthy") plants dataset. tionV3 deep convolutional neural network containing over 20 million parameters .The model achieved a test accuracy of ∼ f , we can then consider a white box attack. Starting froman image in the dataset which the target model correctly classiﬁes and applying thefast gradient sign method (Alg. 5) with (cid:15) = 0 .

01 and T = 1, we obtain an adversar-ial image which diﬀers from the original image by almost imperceptible amount ofnoise as depicted on the left of Fig. 31. Any human would still correctly identify theimage but yet the network, which has around 95% accuracy has completely failed.If, however, the gradients and outputs of the target model f are hidden, theabove white box attack strategy becomes unfeasible. In this case, we can adoptthe following ‘black box attack’ strategy. We train a secondary model g , and thenapplying the FGSM algorithm to g to generate adversarial examples for g . Notethat it is not necessary for g to have the same network architecture as the targetmodel f . In fact, it is possible that we do not even know the architecture of ourtarget model.Let us consider another pretrained network based on MobileNet containing about2 million parameters. After retraining the top classiﬁcation layer of this modelto a test accuracy of ∼ f , we notice This is an example of transfer learning . The base model, InceptionV3, has been trained ona diﬀerent classiﬁcation dataset,

ImageNet , with over 1000 classes. To apply this network to ourbinary classiﬁcation problem, we simply replace the top layer with a simple duo-output densesoftmax layer. We keep the weights of the base model ﬁxed and only train the top layer.

University of Zurich 67achine Learning for the Sciences Chapter 6

Figure 32:

Black Box Adversarial Attack. a signiﬁcant drop in the accuracy as shown on the graph on the right of Fig. 32.The fact that the drop in accuracy is greater for the black box generated adversarialimages as compared to images with random noise (of the same scale) added to it,shows that adversarial images have some degree of transferability between models.As a side note, on the left of Fig. 32 we observe that black box attacks are moreeﬀective when only T = 1 iteration of the FGSM algorithm is used, contrary to thesituation for the white box attack. This is because, with more iterations, the methodhas a tendency towards overﬁtting the secondary model, resulting in adversarialimages which are less transferable.These forms of attacks highlight a serious vulnerability of such data driven ma-chine learning techniques. Defending against such attack is an active area of researchbut it is largely a cat and mouse game between the attacker and defender. Previously we have learned about a broad scope of application of generative models.We have seen that autoencoders can serve as powerful generative models in thescientiﬁc context by extracting the compressed representation of the input and usingit to generate new instances of the problem. It turns out that in the simple enoughproblems one can ﬁnd a meaningful interpretation of the latent representation thatmay be novel enough to help us get new insight into the problem we are analyzing.In 2020, the group of Renato Renner considered a machine learning perspectiveon one of the most historically important problems in physics: Copernicus helio-centric system of the solar orbits. Via series of careful and precise measurements ofpositions of objects in the night sky, Copernicus conjectured that Sun is the centerof the solar system and other planets are orbiting around it. Let us now ask thefollowing question: is it possible to build a neural network that receives the sameobservation angles Copernicus did and deduces the same conclusion from them?Renner group inputted into the autoencoder the angles of Mars and Sun asobserved from Earth ( α ES and α EM in Fig. 33) in certain times and asked the au-toencoder to predict the angles at other times. When analyzing the trained modelthey realized that the two latent neurons included in their model are storing in-formation in the heliocentric coordinates ! In particular, one observes that theinformation stored in the latent space is a linear combination of angles between Sunand Mars, γ SM and Sun and Earth γ SE . In other words, just like Copernicus, the University of Zurich 68achine Learning for the Sciences Chapter 7

Figure 33:

The Copernicus problem.

Relation between angles in heliocentricand geocentric coordinate system.autoencoder has learned, that the most eﬃcient way to store the information givenis to transform it into the heliocentric coordinate system.While this fascinating example is a great way to show the generative models canbe interpreted in some important cases, in general the question of interpretabilityis still very much open and subject to ongoing research. In the instances discussedearlier in this book, like generation of molecules, where the input is compressedthrough several layers of transformations requiring a complex dictionary and thedimension of the latent space is high, interpreting latent space becomes increasinglychallenging.

University of Zurich 69achine Learning for the Sciences Chapter 7

In the previous sections, we have introduced data-based learning, where we are givena dataset { x i } for training. Depending on whether we are given labels y i with eachdata point, we have further divided our learning task as either being supervised orunsupervised, respectively. The aim of machine learning is then to classify unseendata (supervised), or extract useful information from the data and generate newdata resembling the data in the given dataset (unsupervised). However, the conceptof learning as commonly understood certainly encompasses other forms of learningthat are not falling into these data-driven categories.An example for a form of learning not obviously covered by supervised or unsu-pervised learning is learning how to walk: in particular, a child that learns how towalk does not ﬁrst collect data on all possible ways of successfully walking to extractrules on how to walk best. Rather, the child performs an action, sees what happens,and then adjusts their actions accordingly. This kind of learning thus happens best‘on-the-ﬂy’, in other words while performing the attempted task. Reinforcementlearning formalizes this diﬀerent kind of learning and introduces suitable (compu-tational) methods.As we will explain in the following, the framework of reinforcement learningconsiders an agent , that interacts with an environment through actions, which, onthe one hand, changes the state of the agent and on the other hand, leads to a reward . Whereas we tried to minimize a loss function in the previous sections,the main goal of reinforcement learning is to maximize this reward by learning anappropriate policy . One way of reformulating this task is to ﬁnd a value function ,which associates to each state (or state-action pair) a value, or expected total reward.Note that, importantly, to perform our learning task we do not require knowledge, a model , of the environment. All that is needed is feedback to our actions in the formof a reward signal and a new state. We stress again that we study in the followingmethods that learn at each time step. One could also devise methods, where anagent tries a policy many times and judges only the ﬁnal outcome.The framework of reinforcement learning is very powerful and versatile. Exam-ples include:• We can train a robot to perform a task, such as using an arm to collect samples.The state of the agent is the position of the robot arm, the actions move thearm, and the agent receives a reward for each sample collected.• We can use reinforcement learning to optimize experiments, such as chemicalreactions. In this case, the state contains the experimental conditions, such astemperature, solvent composition, or pH and the actions are all possible waysof changing these state variables. The reward is a function of the yield, thepurity, or the cost. Note that reinforcement learning can be used at severallevels of this process: While one agent might be trained to target the experi-mental conditions directly, another agent could be trained to reach the targettemperature by adjusting the current running through a heating element.• We can train an agent to play a game, with the state being the current state ofthe game and a reward is received once for winning. The most famous examplefor such an agent is Google’s AlphaGo, which outperforms humans in the

University of Zurich 70achine Learning for the Sciences Chapter 7 game of Go. A possible way of applying reinforcement learning in the sciencesis to phrase a problem as a game. An example, where such rephrasing wassuccessfully applied, is error correction for (topological) quantum computers.• In the following, we will use a toy example to illustrate the concepts introduced:We want to train an agent to help us with the plants in our lab: in particular,the state of the agent is the water level. The agent can turn on and oﬀ agrowth lamp and it can send us a message if we need to show up to water theplants. Obviously, we would like to optimize the growth of the plants and nothave them die.As a full discussion of reinforcement learning goes well beyond the scope of thislecture, we will focus in the following on the main ideas and terminology with noclaim of completeness.

We begin our discussion with a simple example that demonstrates some importantaspects of reinforcement learning. In particular, we discuss a situation, where thereward does not depend on a state, but only on the action taken. The agent is adoctor, who has to choose from n actions, the treatments, for a given disease, withthe reward depending on the recovery of the patient. The doctor ‘learns on the job’and tries to ﬁnd the best treatment. The value of a treatment a ∈ A is denoted by q ∗ ( a ) = E ( r ), the expectation value of our reward.Unfortunately, there is an uncertainty in the outcome of each treatment, suchthat it is not enough to perform each treatment just once to know the best one.Rather, only by performing a treatment many times we ﬁnd a good estimate Q t ( a ) ≈ q ∗ ( a ). Here, Q t ( a ) is our estimate of the value of a after t (time-) steps. Obviously,we should not perform a bad treatment many times, only to have a better estimatefor its failure. We could instead try each action once and then continue for the restof the time with the action that performed best. This strategy is called a greedy method and exploits our knowledge of the system. Again, this strategy bears risks,as the uncertainty in the outcome of the treatment means that we might use asuboptimal treatment. It is thus crucial to explore other actions. This dilemma iscalled the ‘conﬂict between exploration and exploitation’. A common strategy is touse the best known action a ∗ = argmax a Q t ( a ) most of the time, but with probability (cid:15) chose randomly one of the other actions. This strategy of choosing the next actionis called (cid:15) -greedy . After this introductory example, we introduce the idealized form of reinforcementlearning with a Markov decision process (MDP). At each time step t , the agentstarts from a state S t ∈ S , performs an action A t ∈ A , which, through interactionwith the environment, leads to a reward R t +1 ∈ R and moves the agent to a newstate S t +1 . This agent-environment interaction is schematically shown in Fig. 34.Note that we assume the space of all actions, states, and rewards to be ﬁnite, suchthat we talk about a ﬁnite MDP.

University of Zurich 71achine Learning for the Sciences Chapter 7 agentenvironment actionrewardstate

Figure 34:

Markov decision process.

Schematic of the agent-environment inter-action.For our toy example, the sensor we have only shows whether the water levelis high (h) or low (l), so that the state space of our agent is S = { h , l } . In bothcases, our agent can choose to turn the growth lamps on or oﬀ, or in the case oflow water, he can choose to send us a message so we can go and water the plants.The available actions are thus A = { on , oﬀ , text } . When the growth lamps are on,the plants grow faster, which leads to a bigger reward, r on > r oﬀ >

0. Furthermore,there is a penalty for texting us, but an even bigger penalty for letting the plantsdie, 0 > r text > r fail .A model of the environment provides the probability of ending in state s withreward r , starting from a state s and choosing the action a , p ( s , r | s, a ). In this case,the dynamics of the Markov decision process is completely characterized. Note thatthe process is a Markov process, since the next state and reward only depend on thecurrent state and chosen action.In our toy example, being in state ‘high’ and having the growth lamp on willprovide a reward of r on and keep the agent in ‘high’ with probability p (h , r on | h , on) = α , while with 1 − α the agent will end up with a low water level. However, if theagent turns the lamps oﬀ, the reward is r oﬀ and the probability of staying in state‘high’ is α > α . For the case of a low water level, the probability of staying in lowdespite the lamps on is p (l , r on | l , on) = β , which means that with probability 1 − β ,our plants run out of water. In this case, we will need to get new plants and wewill water them , of course, such that p (h , r fail | l , on) = 1 − β . As with high waterlevels, turning the lamps oﬀ reduces our rewards, but increases our chance of notlosing the plants, β > β . Finally, if the agent should choose to send us a text, wewill reﬁll the water, such that p (h , r text | l , text) = 1. The whole Markov process issummarized in the transition graph in Fig. 35.From the probability for the next reward and state, we can also calculate theexpected reward starting from state s and choosing action a , namely r ( s, a ) = X r ∈R r X s ∈S p ( s , r | s, a ) . (7.1)Obviously, the value of an action now depends on the state the agent is in, suchthat we write q ∗ ( s, a ). Alternatively, we can also assign to each state a value v ∗ ( s ),which quantizes the optimal reward from this state.Finally, we can deﬁne what we want to accomplish by learning: knowing ourcurrent state s , we want to know what action to choose such that our future totalreward is maximized. Importantly, we want to accomplish this without any priorknowledge of how to optimize rewards directly. This poses yet another question:what is the total reward? We usually distinguish tasks with a well-deﬁned end point University of Zurich 72achine Learning for the Sciences Chapter 7 high lowontext offonoff

Figure 35:

Transition graph of the MDP for the plant-watering agent.

Thestates ‘high’ and ‘low’ are denoted with large circles, the actions with small blackcircles, and the arrows correspond to the probabilities and rewards. t = T , so-called episodic tasks , from continuous tasks that go on for ever. The totalreward for the former is simply the total return G t = R t +1 + R t +2 + R t +3 + · · · + R T . (7.2)As such a sum is not guaranteed to converge for a continuous task, the total rewardis the discounted return G t = R t +1 + γR t +2 + γ R t +3 + · · · = ∞ X k =0 γ k R t + k +1 , (7.3)with 0 ≤ γ < γ = 1 and R t = 0 for t > T . Note that for rewardswhich are bound, this sum is guaranteed to converge to a ﬁnite value. A policy π ( a | s ) is the probability of choosing the action a when in state s . We canthus formulate our learning task as ﬁnding the policy that maximizes our reward andreinforcement learning as adapting an agent’s policy as a result of its experience.For a given policy, we can deﬁne the value function of a state s as the expectedreturn from starting in that state and using the policy function π for choosing allour future actions. We can write this as v π ( s ) ≡ E π ( G t | S t = s ) . (7.4)Alternatively, we can deﬁne the action-value function of π as q π ( s, a ) ≡ E π ( G t | S t = s, A t = a ) . (7.5)This is the expectation value for the return starting in state s and choosing action a , but using the policy π for all future actions. Note that one of the key ideasof reinforcement learning is to use such value functions, instead of the policy, toorganize our learning process. University of Zurich 73achine Learning for the Sciences Chapter 7

The value function of Eq. (7.4) satisﬁes a self-consistency equation, v π ( s ) = E π ( R t +1 + γG t +1 | S t = s ) (7.6)= X a π ( a | s ) X s ,r p ( s , r | s, a )[ r + γv π ( s )] . (7.7)This equation, known as the Bellman equation , relates the value of state s to theexpected reward and the (discounted) value of the next state after having chosen anaction under the policy π ( a | s ).As an example, we can write the Bellman equation for the strategy of alwaysleaving the lamps on in our toy model. Then, we ﬁnd the system of linear equations v on (h) = p (h , r on | h , on)[ r on + γv on (h)] + p (l , r on | h , on)[ r on + γv on (l)]= r on + γ [ αv on (h) + (1 − α ) v on (l)] , (7.8) v on (l) = β [ r on + γv on (l)] + (1 − β )[ r fail + γv on (h)] , (7.9)from which we can solve easily for v on (h) and v on (l).Instead of calculating the value function for all possible policies, we can directlytry and ﬁnd the optimal policy π ∗ , for which v π ∗ ( s ) > v π ( s ) for all policies π and s ∈ S . For this policy, we ﬁnd the Bellman optimality equations v ∗ ( s ) = max a q π ∗ ( s, a )= max a E ( R t +1 + γv ∗ ( S t +1 ) | S t = s, A t = a ) (7.10)= max a X s ,r p ( s , r | s, a )[ r + γv ∗ ( s )] . (7.11)Importantly, the Bellman optimality equations do not depend on the actual policyanymore. As such, Eq. (7.11) deﬁnes a non-linear system of equations, which fora suﬃciently simple MDP can be solved explicitly. For our toy example, the twoequations for the value functions are v ∗ (h) = max ( r on + γ [ αv ∗ (h) + (1 − α ) v ∗ (l)] r oﬀ + γ [ α v ∗ (h) + (1 − α ) v ∗ (l)] (7.12)and v ∗ (l) = max  β [ r on + γv ∗ (l)] + (1 − β )[ r fail + γv ∗ (h)] β [ r oﬀ + γv ∗ (l)] + (1 − β )[ r fail + γv ∗ (h)] r text + γv ∗ (h) . (7.13)Note that equivalent equations to Eqs. (7.10) and (7.11) hold for the state-actionvalue function q ∗ ( s, a ) = E ( R t +1 + γ max a q ∗ ( S t +1 , a )) (7.14)= X s ,r p ( s , r | s, a )[ r + γ max a q ∗ ( s , a )] . (7.15)Once we know v ∗ , the optimal policy π ∗ ( a | s ) is the greedy policy that choosesthe action a that maximizes the right-hand side of Eq. (7.11). If, instead, we know q ∗ ( s, a ), then we can directly choose the action which maximizes q ∗ ( s, a ), namely π ∗ ( a | s ) = argmax a q ∗ ( s, a ), without looking one step ahead. University of Zurich 74achine Learning for the Sciences Chapter 7

While Eqs (7.11) or (7.15) can be solved explicitly for a suﬃciently simple system,such an approach, which corresponds to an exhaustive search, is often not feasible.In the following, we distinguish two levels of complexity: First, if the explicit solutionis too hard, but we can still keep track of all possible value functions—we can chooseeither the state or the state-action value function—we can use a tabular approach.A main diﬃculty in this case is the evaluation of a policy, or prediction, whichis needed to improve on the policy. While various methods for policy evaluation and policy improvement exist, we will discuss in the following an approach called temporal-diﬀerence learning . Second, in many cases the space of possible states ismuch too large to allow for a complete knowledge of all value functions. In thiscase, we additionally need to approximate the value functions. For this purpose, wecan use the methods encountered in the previous chapters, such as (deep) neuralnetworks.

If we cannot explicitly solve the Bellman optimality equations—the case most oftenencountered—then we need to ﬁnd the optimal policy by some other means. If thestate space is still small enough to keep track of all value functions, we can tabulatethe value function for all the states and a given policy and thus, speak of tabularmethods . The most straight-forward approach, referred to as policy iteration , pro-ceeds in two steps: First, given a policy π ( a | s ), the value function v π ( s ) is evaluated.Second, after this policy evaluation , we can improve on the given policy π ( a | s ) usingthe greedy policy π ( a | s ) = argmax a X s ,r p ( s , r | s, a )[ r + γv π ( s )] . (7.16)This second step is called policy improvement . The full policy iteration then proceedsiteratively π → v π → π → v π → π → · · · (7.17)until convergence to v ∗ and hence π ∗ . Note that, indeed, the Bellman optimalityequation (7.11) is the ﬁxed-point equation for this procedure.Policy iteration requires a full evaluation of the value function of π k for everyiteration k , which is usually a costly calculation. Instead of fully evaluating the valuefunction under a ﬁxed policy, we can also directly try and calculate the optimal valuefunction by iteratively solving the Bellman optimality equation, v [ k +1] ( s ) = max a X s ,r p ( s , r | s, a )[ r + γv [ k ] ( s )] . (7.18)Note that once we have converged to the optimal value function, the optimal pol-icy is given by the greedy policy corresponding to the right-hand side of Eq. (7.16)An alternative way of interpreting this iterative procedure is to perform policy im-provement every time we update the value function, instead of ﬁnishing the policyevaluation each time before policy improvement. This procedure is called value iter-ation and is an example of a generalized policy iteration , the idea of allowing policyevaluation and policy improvement to interact while learning. University of Zurich 75achine Learning for the Sciences Chapter 7

In the following, we want to use such a generalized policy iteration scheme for the(common) case, where we do not have a model for our environment. In this model-free case, we have to perform the (generalized) policy improvement using only ourinteractions with the environment. It is instructive to ﬁrst think about how toevaluate a policy. We have seen in Eqs. (7.4) and (7.6) that the value function canalso be written as an expectation value, v π ( s ) = E π ( G t | S t = s ) (7.19)= E π ( R t +1 + γv π ( S t +1 ) | S t = s ) . (7.20)We can thus either try to directly sample the expectation value of the ﬁrst line—thiscan be done using Monte Carlo sampling over possible state-action sequences—orwe try to use the second line to iteratively solve for the value function. In both cases,we start from state S t and choose an action A t according to the policy we want toevaluate. The agent’s interaction with the environment results in the reward R t +1 and the new state S t +1 . Using the second line, Eq. (7.20), goes under the name temporal-diﬀerence learning and is in many cases the most eﬃcient method. Inparticular, we make the following updates v [ k +1] π ( S t ) = v [ k ] π ( S t ) + α [ R t +1 + γv [ k ] π ( S t +1 ) − v [ k ] π ( S t )] . (7.21)The expression in the brackets is the diﬀerence between our new estimate and theold estimate of the value function and α < q [ k +1] ( S t , a ) = q [ k ] ( S t , a ) + α [ R t +1 + γq [ k ] ( S t +1 , a ) − q [ k ] ( S t , a )] (7.22)and the question is then what action a we should take for the state-action pair andwhat action a should be taken in the new state S t +1 .Starting from a state S , we ﬁrst choose an action A according to a policyderived from the current estimate of the state-action value function , such as an (cid:15) -greedy policy. For the ﬁrst approach, we perform updates as q [ k +1] ( S t , A t ) = q [ k ] ( S t , A t ) + α [ R t +1 + γq [ k ] ( S t +1 , A t +1 ) − q [ k ] ( S t , A t )] . (7.23)As above, we are provided a reward R t +1 and a new state S t +1 through our interac-tion with the environment. To choose the action A t +1 , we again use a policy derived We assume here an episodic task. At the very beginning of training, we may initialize thestate-action value function randomly.

University of Zurich 76achine Learning for the Sciences Chapter 8 from Q [ k ] ( s = S t +1 , a ). Since we are using the policy for choosing the action in thenext state S t +1 , this approach is called on-policy . Further, since in this particularcase, we use the quintuple S t , A t , R t +1 , S t +1 , A t +1 , this algorithm is referred to as Sarsa . Finally, note that for the next step, we use S t +1 , A t +1 as the state-action pairfor which q [ k ] ( s, a ) is updated.Alternatively, we only keep the state S t from the last step and ﬁrst choose theaction A t for the update using the current policy. Then, we choose our action fromstate S t +1 in greedy fashion, which eﬀectively uses Q [ k ] ( s = S t , a ) as an approxima-tion for q ∗ ( s = S t , a ). This leads to q [ k +1] ( S t , A t ) = q [ k ] ( S t , A t ) + α [ R t +1 + γ max a q [ k ] ( S t +1 , a ) − q [ k ] ( S t , A t )] . (7.24)and is a so-called oﬀ-policy approach. The algorithm, a variant of which is used inAlphaGo, is called Q-learning . When the state-action space becomes very large, we face two problems: First, wecan not use tabular methods anymore, since we can not store all values. Second andmore important, even if we could store all the values, the probability of visiting allstate-action pairs with the above algorithms becomes increasingly unlikely, in otherwords most states will never be visited during training. Ideally, we should thusidentify states that are ‘similar’, assign them ‘similar’ value, and choose ‘similar’actions when in these states. This grouping of similar states is exactly the kindof generalization we tried to achieve in the previous sections. Not surprisingly,reinforcement learning is most successful when combined with neural networks.In particular, we can parametrize a value function ˆ v π ( s ; θ ) and try to ﬁnd pa-rameters θ such that ˆ v π ( s ; θ ) ≈ v π ( s ). This approximation can be done using thesupervised-learning methods encountered in the previous sections, where the target,or label, is given by the new estimate. In particular, we can use the mean squaredvalue error to formulate a gradient descent method for an update procedure analo-gous to Eq. (7.21). Starting from a state S and choosing an action A according tothe policy π ( a | S ), we update the parameters θ [ k +1] = θ [ k ] + α [ R + γ ˆ v π ( S ; θ [ k ] ) − ˆ v π ( S ; θ [ k ] )] ∇ ˆ v π ( S ; θ [ k ] ) (7.25)with 0 < α < θ [ k ] , we only take the derivative with respect to the old estimate.This method is thus referred to as semi-gradient method . In an similar fashion, wecan reformulate the Sarsa algorithm introduced for generalized gradient iteration. University of Zurich 77achine Learning for the Sciences Chapter 8

In this lecture, ‘Introduction to Machine Learning for the Sciences’, we have dis-cussed common structures and algorithms of machine learning to analyze data orlearn policies to achieve a given goal. Even though machine learning is often as-sociated with neural networks, we have ﬁrst introduced methods commonly knownfrom statistical analysis, such as linear regression. Neural networks, which we usedfor most of this lecture, are much less controlled than these conventional methods.As an example, we do not try to ﬁnd an absolute minimum in the optimizationprocedure, but one of many almost degenerate minima. This uncertainty might feellike a loss of control to a scientist, but it is crucial for the successful generalizationof the trained networkThe goal of our discussions was not to provide the details needed for an actualimplementation—as all standard algorithms are provided by standard libraries suchas TensorFlow or PyTorch, this is indeed not necessary—but to give an overviewover the most important terminology and the common algorithms. We hope thatsuch an overview is helpful for reading the literature and deciding, whether a givenmethod is suitable for your own problems.To help with the use of machine learning in your own research, here a few lessonsfor a successful machine learner:1. Your learning result can only be as good as your data set.2. Understand your data, its structure and biases.3. Try simpler algorithms ﬁrst.4. Don’t be afraid of lingo. Not everything that sounds fancy actually is.5. Neural networks are better at interpolating than extrapolating.6. Neural networks represent smooth functions well, not discontinuous or spikyones.Regarding the use of machine learning in a scientiﬁc setting, several points shouldbe kept in mind. First, unlike in many other applications, scientiﬁc data oftenexhibits speciﬁc structure, correlations, or biases, which are known beforehand. It isthus important to use our prior knowledge in the construction of the neural networkand the loss function. There are also many situations, where the output of thenetwork has to satisfy conditions, such as symmetries, to be meaningful in a givencontext. This should ideally be included in the deﬁnition of the network. Finally,scientiﬁc analysis needs to be well deﬁned and reproducible. Machine learning,with its intrinsic stochastic nature, does not easily satisfy these conditions. It isthus crucial to document carefully the architecture and all hyperparameters of anetwork and training. The results should be compared to conventional statisticalmethods, their robustness to variations in the structure of the network and thehyperparameters should be checked.

University of Zurich 78achine Learning for the Sciences Chapter 8 u n s u p e r v i s e d l e a r n i n g s u p e r v i s e d l e a r n i n g r e i n f o r c e m e n t l e a r n i n g n e u r a l n e t w o r k s r e g r e s s i o n c l a s s i fi c a ti o n o p t i m a l p o l i c y g e n e r a t i o n s t r u c t u r e PCAt-SNE (V)AE RBM RNN GAN Sarsa (on-policy) ridgeregressionlinearregression logisticregression

Bellman Eq. (brute force)

Q-learning (off-policy) functionapproximation dense (brute force)

CNN (spatial)

RNN (sequential)

Figure 36:

Machine Learning overview.