Deep neural networks for classifying complex features in diffraction images
Julian Zimmermann, Bruno Langbehn, Riccardo Cucini, Michele Di Fraia, Paola Finetti, Aaron C. LaForge, Toshiyuki Nishiyama, Yevheniy Ovcharenko, Paolo Piseri, Oksana Plekan, Kevin C. Prince, Frank Stienkemeier, Kiyoshi Ueda, Carlo Callegari, Thomas Möller, Daniela Rupp
DDeep neural networks forclassifying complex features in diffraction images
Julian Zimmermann, ∗ Bruno Langbehn, Riccardo Cucini, Michele Di Fraia,
3, 4
Paola Finetti, Aaron C.LaForge, Toshiyuki Nishiyama, Yevheniy Ovcharenko,
2, 7
Paolo Piseri, Oksana Plekan, Kevin C.Prince,
3, 9
Frank Stienkemeier, Kiyoshi Ueda, Carlo Callegari,
3, 4
Thomas M¨oller, and Daniela Rupp Max-Born-Institut f¨ur Nichtlineare Optik und Kurzzeitspektroskopie, 12489 Berlin, Germany Institut f¨ur Optik und Atomare Physik, Technische Universit¨at Berlin, 10623 Berlin, Germany ElettraSincrotrone Trieste S.C.p.A., 34149 Trieste, Italy ISM-CNR, Istituto di Struttura della Materia, LD2 Unit, 34149 Trieste, Italy Physikalisches Institut, Universit¨at Freiburg, 79104 Freiburg, Germany Division of Physics and Astronomy, Graduate School of Science, Kyoto University, Kyoto 606-8502, Japan European XFEL GmbH, 22869 Schenefeld, Germany CIMAINA and Dipartimento di Fisica, Universit degli Studi di Milano, 20133 Milano, Italy Department of Chemistry and Biotechnology, Swinburne University of Technology, Victoria 3122, Australia Institute of Multidisciplinary Research for Advanced Materials, Tohoku University, Sendai 980-8577, Japan (Dated: June 25, 2019)Intense short-wavelength pulses from free-electron lasers and high-harmonic-generation sourcesenable diffractive imaging of individual nano-sized objects with a single x-ray laser shot. Theenormous data sets with up to several million diffraction patterns represent a severe problem fordata analysis, due to the high dimensionality of imaging data. Feature recognition and selectionis a crucial step to reduce the dimensionality. Usually, custom-made algorithms are developed ata considerable effort to approximate the particular features connected to an individual specimen,but facing different experimental conditions, these approaches do not generalize well. On the otherhand, deep neural networks are the principal instrument for today’s revolution in automated imagerecognition, a development that has not been adapted to its full potential for data analysis inscience. We recently published in [Langbehn et al.
Phys. Rev. Lett. 121, 255301 (2018)] the firstapplication of a deep neural network as a feature extractor for wide-angle diffraction images of heliumnanodroplets. Here we present the setup, our modifications and the training process of the deepneural network for diffraction image classification and its systematic benchmarking. We find thatdeep neural networks significantly outperform previous attempts for sorting and classifying complexdiffraction patterns and are a significant improvement for the much-needed assistance during post-processing of large amounts of experimental coherent diffraction imaging data.
PACS numbers: 05.10.-a, 07.05.-t, 61.05.C-, 87.59.-eKeywords: residual convolutional deep neural networks, coherent diffraction imaging, image classification
1. INTRODUCTION
Coherent diffraction imaging (CDI) experiments of sin-gle particles in free flight have been proven to be a sig-nificant asset in the pursuit of understanding the struc-tural composition of nano-scaled matter [2–7]. While tra-ditional microscopy methods are able to image fixated,substrate-grown or deposited individual particles [8–12],only CDI can combine high-resolution images with sin-gle particles in free flight in one experiment [13–15].CDI became possible due to the recent advent of shortwavelength free-electron lasers (FELs) producing coher-ent high-intensity x-ray pulses with femtosecond dura-tion with a single x-ray laser shot [16]. However, CDIalso comes with its own set of new challenges.One of the growing problems of CDI experiments isthe sheer amount of recorded data that has to be ana-lyzed. The LINAC Coherent Light Source (LCLS), for ∗ [email protected] instance, has a repetition rate of 120 Hz with typical hit-rates ranging from 1 % to 30 % [16–18], greatly dependingon the performed experiment. The newly opened Euro-pean XFEL will have an even higher maximum repetitionrate of 27 000 Hz [19], which may add up to several milliondiffraction patterns in a single 12-hour shift. The idea ofusing neural networks for classification of large numberof scattering patterns was born out of the significant dif-ficulties of analyzing large data sets of clusters [20], inparticularly metal clusters [21]. Moreover, the ability toanalyze such data sets is sought after by the communityin general [22]. For example, for the successful determi-nation of 3D-structures from a CDI data set using theexpansion-maximization-compression algorithm [22–24],it is necessary to sample the 3D Fourier space up to theNyquist rate for the desired resolution and this for allsub-species contained in the target under study. Theachievable resolution, as well as the chance for success-ful convergence of the algorithm, correlates directly withthe number of diffraction patterns with a high signal-to-noise ratio [23]. Thus, huge data sets are taken and asa consequence of the sheer amount of data, it is getting a r X i v : . [ phy s i c s . d a t a - a n ] J un increasingly complicated to distill the high-quality datasubsets that are suitable for subsequent analysis steps.The enormous success of neural networks in the regimeof image processing and classification provides a uniqueway of facing the imminent data-analysis bottleneck andreduces the impending problem to a mere domain adapta-tion from datasets used throughout the industry to onesthat are used in CDI research. This work aims to be astepping-stone towards this adaptation by providing anintroduction to the theory of deep neural networks andanalyzing how to best transfer and optimize these algo-rithms to the domain of scattering images. As a newbaseline, we train a widely used deep neural network ar-chitecture, a residual convolutional deep neural network[25], in a supervised manner with a training set of man-ually labeled data. We then adapt the neural networkto the domain of diffraction images and improve on thebaseline performance by addressing the following issues:1. Modification of the architecture to account for thespecificities of diffraction images and thus optimizethe prediction capabilities.2. Determination of the appropriate size of the train-ing dataset in order to keep the manual work of aresearcher to a moderate level.3. Mitigation of experimental artifacts, in particularnoisy diffraction images.Experience has shown that a researcher is able to relatediffraction patterns produced by similarly shaped parti-cles of different sizes and orientations in context witheach other. However, a programmatic description for aclassification and sorting of these mostly similar patternsis almost impossible to achieve.Figure 1 illustrates the case of two diffraction patternscaptured from almost identical particles but under dif-ferent orientations. Both patterns clearly show an elon-gated and bent streak, but the bending is differently pro-nounced and directed. If we wanted to handcraft analgorithm that detects this feature, we would need todescribe it via some appropriate metric that must takeinto account the various grades of inflection, direction,brightness, and completeness of this feature within everyimage. Furthermore, we would need to redo it for everycharacteristic feature in a diffraction image of which wewant to find similar ones.In addition to that, poor signal-to-noise ratios, stray-light, a beam stop or central hole of multichannel platesor pnCCDs [26] and overall poor image quality can evenfurther increase the difficulty to make an automatizedclassification of all images coherent [27–29].Therefore, we need a robust classification routine thatis insusceptible to the described artifacts, just as a re-searcher is, to tackle the upcoming data volume. Deepneural networks provide a way out of this situation, andwe show in this paper that they outperform the currentstate-of-the-art classification and sorting routines. MSFTMSFT a) FIG. 1. a) and b) are showing a capsule shaped particleswhose orientation and size differs. The scattering images arecalculated using a multi-slice Fourier transform (MSFT) algo-rithm that simulates a wide-angle x-ray scattering experimentwhich includes 3D information about the particle [7, 21]. Thetwo incoming beams (indicated by the arrow on the left-handside) produce very different scattering images, yet the dom-inant feature, an elongated bent streak, is distinctly visiblein both calculations. A handcrafted algorithm is typicallynot able to identify the similarity between the two scatteringpatterns and would classify these two images in two distinctclasses, although they belong to the same capsule shape class.A deep neural network can learn these complicated similar-ities on its own when we provide a few manually selecteddiffraction patterns that contain this feature.
Current state-of-the-art automatic classification rou-tines for diffraction experiments employ so-called kernelmethods [28, 30]. Bobkov et al. [28] trained a support-vector-machine on a public small-angle x-ray scatteringdataset with an
Accuracy of 87 %, but only on selectedimages (we will use this approach as a reference in sec-tion 4). Yoon et al. [30] were able to achieve an
Accuracy of up to 90 % using unsupervised spectral clustering ona non-public small-angle x-ray scattering dataset.Deep neural networks, on the other hand, have alreadybeen applied to a broad range of physics-related prob-lems ranging from predicting topological ground states[31], distinguish different topological phases of topolog-ical band insulators [32], enhancing the signal-to-noiseat hadron colliders [33], differentiate between so-calledknown-physics background and new-physics signals atthe Large Hadron Collider [34] and to help solve theSchr¨odinger equation [35, 36]. Their ability to classifyimages has also been utilized in cryo-electron microscopy[37], medical imaging [38] and even for hit-finding in se-rial x-ray crystallography [39]. However, to our knowl-edge, this paper is the first application of deep neural net-works for classifying complex features within diffractionpatterns. We show that deep neural networks outperformthe current state-of-the-art classification and sorting rou-tines, while being insusceptible to typical artifact featuresof diffraction measurements. Furthermore a deeper anal-ysis of the trained network shows that it can understandcomplex concepts of what constitutes a characteristic fea-ture in a diffraction pattern.The paper is organized as follows: In section 2, thedata set is presented and a few experimental details arediscussed. Section 3 provides the fundamental theoryto understand the basics of neural networks; it has twosubsections. Subsection 3.1 covers the theory, and algo-rithmic underpinnings of deep neural networks and howto train these models and subsection 3.2 presents threecommon metrics to evaluate the quality of the neuralnetwork’s predictions.Section 4 establishes our starting point, while the fullbenchmark report on the baseline neural network can befound in appendix A. We introduce the chosen networkarchitecture and provide baseline results on the data pre-sented in section 2 but also on a reference dataset forwhich classification results are already published [40].In section 5, we discuss solutions for the above statedissues of applying neural networks to diffraction data.In subsection 5.1 we discuss the choice of the activationfunction for the neural network and present a novel loga-rithmic activation function that enhances the predictionperformance with diffraction image data. Subsection 5.2benchmarks the dependence of neural networks on train-ing data size, asking essentially how much manually la-beled data is needed for the neural network to give ac-ceptable results and subsection 5.3 presents an approachto harden the neural network against very noisy data us-ing a custom two-point cross-correlation map.In section 6 we then provide more profound insightsinto the output of the neural network by showing and dis-cussing calculated heatmaps that visualize the gradientflow within the neural network. These images directlycorrelate with what the neural network sees ; they arecreated using an advanced visualization algorithm calledGradCam++ [41].Finally, we give a summary of the principal results andunique propositions of this paper and conclude with anoutlook on further modifications as well as future direc-tions.
2. THE DATA
Helium nanodroplets [1] were imaged using extremeultraviolet (XUV) photon energies between 19 eV to35 eV using the experimental setup of the LDM beamline[42, 43] at the Free Electron Laser FERMI [44]. Scat-tering images were recorded with a multi-channel-plate(MCP) detector combined with a phosphor screen whichwas placed 65 mm downstream from the interaction re-gion; this defines the maximum scattering angle of 30 ◦ .Single shot diffraction images in the XUV regime are insome respect a special case, as they cover large scatteringangles and can contain 3D structural information [21],manifesting as complex and pronounced characteristic features, such as the bent streaks in Figure 1. Out of2 × laser shots, about 38 000 images were obtained.The images were corrected for straylight background andthe flat detector (see also Langbehn et al. [1])For the neural network training dataset, we selected7264 diffraction images randomly out of all recorded pat-terns. The size of the subset was chosen to be the maxi-mum a researcher could classify manually given one weektime. From this subset we manually identified 11 distinctbut non-exclusive classes (see Figure 2 for examples aswell as a description and Table 1 for statistics about everyclass). We chose each of the diffraction patterns shown inFigure 2 for being a strong candidate for its class, but it isimportant to note that almost all diffraction patterns be-long to multiple classes since this is a multi-class labelingscenario. These patterns are therefore not always clearlydistinguishable from each other and can exhibit multiplecharacteristics from different classes. For example, theNewton rings in Figure 2d) are superimposed on a con-centric ring pattern that falls into the category Spher-ical/Oblate , but Newton rings can also occur in otherclasses, e.g. streak patterns. Furthermore, labeling allimages is itself prone to systematic errors because the re-searcher has to learn-to-label [45]. This means that thelabeling process itself is to some extent ill-posed, as theresearcher does not know the characteristics of a featurea priori which results in a changing perception of fea-tures and classes along the labeling process and thus asystematically decreased consistency for every class.We uploaded all available data alongside our assignedlabels to the public CXI database (CXIDB, [46]) underthe public domain CC0 waiver . TABLE 1. Statistics of the helium nanodroplets dataset.Non-exclusive labels assigned by a researcher. One image canbe in multiple classes. Total dataset size is 7264. Note that
Spherical/Oblate as a class also contains
Round patterns, only
Prolate shapes are excluded from this class (see also captionof Figure 2).Class Nr. of labels % of the whole datasetSpherical/Oblate 6589 90 . . . . . . . . . . . http://cxidb.org/id-94.html Spherical/Oblate a) b)c) d)
Round EllipticalProlateStreakBent Asymmetric Newton Rings Double RingsLayeredApproximate particle shape Schematic representation of the expected feature
Exclusive superordinate classes Partially exclusive oblate subclassesNon-exclusive prolate subclasses Non-exlusive other subclasses
FIG. 2. Characteristic examples for all the classes assigned to the 7264 images by a researcher, except for the
Empty class.The top row of every class shows a representative diffraction pattern and the bottom row in b) - d) shows a stylized drawingof the characteristic feature of this class. The bottom row in a) shows an illustration of the name-giving particle shape forthe
Spherical/Oblate and
Prolate class. The shapes are derived from the analysis of the data in Langbehn et al. [1], and theyserve as a form of superordinate classes. They are mutually exclusive to each other, and all diffraction patterns are part ofone of these two classes. Also, both superordinate classes have subclasses. For example, b) is showing the
Spherical/Oblate subclasses
Round , Elliptical and
Double Rings . While a diffraction pattern can be part of the
Round and the
Double Rings class, it cannot be part of the
Round and the
Elliptical class. For the
Prolate superordinate class, we find analog subclassrules, although there is no exclusivity rule as it was with the
Round and
Elliptical class. Therefore, an image belonging to
Bent can also be in the
Streaks class. Furthermore, all
Spherical/Oblate and
Prolate patterns can not only be part of theirrespective subclass but can also be part of one or more of the classes in the non-exclusive other subclass categories - shownin d). These classes describe general features within the image which are to some extent independent of the particle shape.We derived the superordinate classes from these general features. These complicated inter-class relationships demonstrate thecapabilities of a researcher to interconnect mostly distinctive appearing features into a consistent description and ultimatelyleading to a valid physical interpretation. A hand-crafted algorithm could not account for these relationships normally, butnow these interconnections can serve as an additional evaluation metric for the neural network. Since there is no diffractionpattern which belongs to the
Spherical/Oblate and
Prolate class simultaneously, we can check if the neural network mislabeleda diffraction pattern according to these rules. We can then interpret this as a reliable indicator for a failed generalization ofthe network. The physics behind these patterns are quite complicated as well, but for a rigorous interpretation and analysis ofthese patterns, please see Langbehn et al. [1].
3. BASIC THEORY3.1. What is a deep neural network
We concentrate in this paper solely on deep feed-forward neural networks. They are a classification modelconsisting of a directed acyclic graph that defines a setof hierarchically structured non-linear functions.A fundamental example can be constructed by arrang-ing n non-linear functions ( z , z , . . . z n ) in a chain-likemanner: z output = z n ( z n − ( . . . ( z ( z ( x ))) . . . )), where x is the input, which is in our case a diffraction image.The first function, z ( x ), is called the input layer. Wethen pass the output of z to z and so on; this goes onuntil the last layer ( z n ) which is called the output layer.The nomenclature is that all layers except the outputlayer ( z n ) and the input layer ( z ) are called hidden lay-ers.For illustrative purposes, Figure 3 shows a convolu-tional neural network. There, we schematically show thelayer functions z , . . . z n where every layer consists oftwo stages; A linear layer-specific operation on its inputsfollowed by a so-called activation function, which is al-ways non-linear. We address the choice of layer-specificoperations in section 3.1.1 and then introduce the acti-vation functions in section 3.1.2. In general, the layer-specific operation is always the name-giving componentfor the layer, so for example if we compute a 2D convolu-tion as the layer-specific operation on the input and thenapply an activation function, we call the set of these twostages a convolutional layer . Figure 3 shows a neural net-work whose first layers are convolutional layers followedby a fully connected layer that produces the predictions. All common choices for layer-specific operations areaffine transformations. They all introduce trainableweights; free parameters that are adjustable during thetraining process and are sometimes called neurons due tothe intuition that in a fully connected layer they sharesome similarity to the dendrites, soma, and axon of abiological neuron [47]. These trainable weights are thename-giving components in a neural network.Now, the goal of training a neural network is to opti-mize all these weights for all layers, so, that the predic-tions for all images in the training data match their ac-companying original labels. The original labels are calledground truth and define the upper limit of how good anetwork can fit a domain. No neural network is betterthan its training data. In this section, we briefly illus-trate the affine transformations of the fully connectedlayer and the convolutional layer and then explain in thenext section the role of the activation function. a. Fully connected layer
The name-giving operationfor the fully connected layer is a matrix multiplicationperformed on a flattened input, for example, a m × n sized input image would be flattened into a m · n sizedvector. Mathematically this is a matrix multiplicationbetween a matrix and a vector: a j = m (cid:88) k =1 x k w kj , (1)where x is the flattened input and w is the weight matrixof a fully connected layer. Here, all input vector elements(e.g., the pixels of an image, now arranged in one largerow x k ) contribute to all output matrix elements andare therefore connected . Furthermore, by convention is x defined as 1 and w j = b j , where b j is a free andtrainable bias parameter. b. Convolutional layer In a convolutional layer, thetrainable weights are parameters of a kernel that slidesover the inputs, this is visualized in Figure 3. The generalidea of a convolutional layer is to preserve the spatial cor-relations in the input image when going to a lower dimen-sional representation (the next layer). This is achievedby using a kernel with a spatial extent larger than 1 px.The kernel size is then also the extent to which one ker-nel can correlate different areas of an input and is calledits local receptive field. Each kernel produces one out-put which is called a feature map or filter . Multiple fea-ture maps from multiple kernels are grouped within oneconvolutional layer. For example, the first convolutionallayer in Figure 3 produces 9 feature maps out of the in-put diffraction image and hence has 9 kernels that getoptimized during training. Since we usually only have inthe input layer a 2-dimensional diffraction image as inputand a high number of feature maps for every subsequentconvolutional layer as their inputs, we define the outputof a convolutional layer with a 4-dimensional kernel k that produces i feature maps of size j × k : a i,j,k = (cid:88) l,m,n x l,j + m − ,k + n − k i,l,m,n , (2)here the input x has l dimensions of size j × k and weslide a kernel of size m × n across all these l dimensions.In the given example for the input layer, l is simply 1 andthe summation is just across one input image, as shownin Figure 3. Regardless of the affine transformation that is used, alllayer-specific operations produce trainable weights whichare passed through an activation function. This functionis always non-linear. We only address 2 activation func-tions here as they are the most common used by the com-munity and the only ones we use; The sigmoid and the
LeakyRelu function. The first one is a logistic regressionfunction used mostly at the outputs of neural networks,and the second one is a piecewise linear activation func-tion used between layers for numerical reasons [48, 49].
FIG. 3. Schematic visualization of a convolutional neural network; It shows the hierarchical structure of the network withthe function hierarchy z , . . . z n above each layer. Depicted as input is a diffraction image, which is getting expanded by 9trainable convolutional kernel into 9 feature maps. Note, only 1 kernel, producing the last feature map, is shown. The outputof the first layer is then passed through multiple convolutional layer, this is the feature extraction part of the neural network.Ultimately, a fully connected layer with a logistic function as activation function produces the predictions. Every layer consistsof 2 stages, also indicated by the brackets underneath z , . . . z n . The first stage is an affine transformation and the secondone is a non-linear function, called an activation function. The operation that is used as affine transformation is then thename-giving component for the layer, e.g., a convolutional layer uses a convolution as affine transformation. The choice ofthe activation function is subject to empirical optimization with various choices possible. Section 3.1.1 describes the affinetransformations in more detail and section 3.1.2 covers the basics on activation functions. The sigmoid function is given as: h ( x ) = 11 + exp ( − x ) (3)and the LeakyRelu function is given as: h ( x ) = (cid:40) x if x ≥ γx if x < , (4)where in both functions x ∈ a are the trainable weights ofthe affine transformation (the convolutional or the fullyconnected layer operation, i.e., the output of Equation1 or 2) and γ is the slope for the negative part in the LeakyRelu function and is called leakage .In Figure 3 the last activation function of the neuralnetwork, denoted by
Logistic function , is a sigmoid func-tion, because its output can be interpreted as a proba-bility in a Bernoulli distribution, yielding a probabilityfor how likely it is that a given event (an image in ourcase) is part of a class (in our case, the pre-defined classesfrom Table 1). Sigmoid functions always give an outputbetween 1 and 0. In our case, we have 11 distinct classeswhich are mutually non-exclusive, which means every im-age has a probability of being part of every class. Using asigmoid function at the end of the neural network yieldstherefore 11 distinct Bernoulli distributions. The gener-alization from the single-case Bernoulli distribution to its multi-case n-class distribution equivalent is called cate-gorical distribution.Interpreting the output of the neural network, as wellas the original labels, as a categorical distribution is keyto train the neural network because only then we canuse statistical measures to evaluate the quality of theneural network’s prediction, which allows us to optimizeit iteratively.However, due to the non-linearity of all activation func-tions, optimizing a neural network is a non-convex prob-lem where no global extrema can be found with certainty.The general procedure is that of a forward pass and thena backward correction. Meaning, we feed the neural net-work several images, take the network’s prediction andcompare this prediction to the ground truth; This is theforward pass. Then we calculate a loss function which isa metric for how bad or good the predictions were, seethe next section, and correct the weights of the networkin a way that it would be better equipped to predict thelabels for the images it just saw. This correction stepis starting at the end of the network using an algorithmcalled backpropagation; hence the name backward cor-rection, see section 3.1.4.
Optimizing a neural network always starts by feedingit multiple images and evaluate what the neural networkmade of it. For assessing the quality of the network’s pre-diction a so-called loss function is used. It is the definingmetric that we seek to minimize during the training of theneural network. In every training step, we compare theoutput of the neural network to the real labels providedby the researcher and calculate the so-called loss. Lowerloss values correspond to a higher prediction quality ofthe neural net.Therefore, the goal during the training process is toadjust all weights and biases within the network so, thatthe loss is minimal for all input training images. Thereare various possible loss functions which often serve a spe-cific purpose. For classification tasks, such as the presentcase, primarily the cross-entropy is used [50–53]. Cross-entropy is a concept from information theory giving anestimate about the statistical distance between a true dis-tribution p and an unnatural distribution q . In our case, p is the categorical distribution over the ground truthlabels, and q is the output of the neural network.Cross-entropy is calculated as the sum of the Shannonentropy [54] for the true distribution p and the Kullback-Leibler divergence [55] between p and q . The former is ameasure of the total amount of information of p , and thelatter is a typical distance measure between two proba-bility distributions.If the Kullback-Leibler divergence is zero, then thecross-entropy is just the Shannon entropy of p , and wehave p = q . Then, the predictions of the neural networkare not distinguishable from the labels of all training im-ages.Cross-entropy can be formally written as: H ( p, q ) = H ( p ) + D KL ( p (cid:107) q ) (5)where H ( p ) is the Shannon entropy of p , and D KL ( p (cid:107) q )is the KullbackLeibler divergence of p and q [56].When using a sigmoid function as activation functionon the output layer, the final loss function can be definedas: H ( x out , x ) = M (cid:88) i x outi − x outi x i + log (cid:0) (cid:0) − x outi (cid:1)(cid:1) , (6)where M is the number of all images in the training data, x outi is the prediction for one image from the deep neuralnetwork and x i is the original label of the image, assignedby the researcher. Please see appendix B for a completederivation.Using Equation 6 as it is, would require us to passall images through the network for one training step, asthe sum runs over all images. This is computationalintractable. Therefore, we use a variant of Equation 6where the sum runs only over a stochastically chosen subset of size bs , called a batch. The size of that batchis called batch size and is an important hyperparame-ter that needs to be chosen prior to training, see section3.1.5. One iteration step now involves only bs imagesfrom the dataset, and we define an epoch as the numberof iteration steps it takes the network during the trainingto see all images one time.To summarize, minimizing the cross-entropy is the goalduring the training process in a neural network. The net-work learns to link the user-defined labels to the providedimages. All that’s left to understand the basic trainingprocess of a neural network is a way to adjust the weightsin all layers. Optimizing the weights within the neural network sothat they give minimal loss for all training images isdone using two distinct algorithms; gradient descent andbackpropagation. In principal, gradient descent works byevaluating the gradient at some point and then movinga certain step-size in the opposite direction; This is doneiteratively until the gradient is smaller than some pre-defined threshold, which is the numerical equivalent ofcalculating the extrema of a function analytically.The basic gradient descent step is given by: w τ +1 = w τ − η ∇ w τ H ( x out , x ) . (7)where η is the afore mentioned step-size, called learn-ing rate, ∇ w τ is the gradient w.r.t. the weights at step τ and H ( x out , x ) is the loss function from Equation 6.With Equation 7 we already could update the weightswithin the output layer of the neural network ( z n ( · )),since for the output layer we can calculate the numericalgradients. But we can’t do this for the layers that comebefore the output layer since we’re lacking a way to in-clude these. In order to propagate the gradient descentcorrection throughout the network an algorithm calledbackpropagation is used [57]:First, we define the gradient of H ( x out , x ), w.r.t. theweights at the output of the deep neural network, usingthe chain rule: ∇ w τ H ( x out , x ) = ∂H ( x out , x ) ∂w Njτ = ∂H ( x out , x ) ∂h N (cid:0) a Nj (cid:1) ∂h N (cid:0) a Nj (cid:1) ∂w Njτ , (8)where N denotes the layer depth of the output layer, h N ( · ) is the used activation function in that layer and a Nj are the outputs of the layer-specific operation, as inEquation 1 and 2. Starting from there we include thelayer, preceding the output layer ( z n − ( z n ( · ))), by mak-ing use of the chain rule again: ∂H ( x out , x ) ∂h N (cid:0) a Nj (cid:1) = ∂H ( x out , x ) ∂h N − (cid:0) a N − j (cid:1) ∂h N − (cid:0) a N − j (cid:1) ∂h N (cid:0) a Nj (cid:1) . (9)This can be iteratively repeated until the input layer( z ( · )) is included in the calculation. By making use ofthe chain rule until we reach the input layer we can in-clude all trainable weights of all layers into the correctionterm of the gradient descent algorithm. With this, weconclude the full optimization routine in Table 2. TABLE 2. The iterative optimization routine for a deepfeed-forward neural network.1.
Forward pass : Propagate bs images through the network.2. Evaluate the predictions : At the output layer calculatethe loss between the ground truth and the output of thedeep neural network (Equation 6).3.
Construct the backpropagation rule : Include allgradients w.r.t. the weights of all layer accordingto Equation 9.4.
Backward correction : Update all weights in thenetwork using gradient descent, see Equation 7.
Of significant importance is the way how the networkis constructed; How deep should the network be and ofwhat should it consist? For nomenclature, the combina-tion of all used layers, the depth of the network and theused activation functions is called an architecture.We benchmarked the performance of various architec-tural choices when used with diffraction images as inputand provide the results in appendix A and not in the mainpaper, due to its rather technical character. In short, allarchitectures are established through extensive empiricalresearch. So far, not only the leading A.I. research in-stitutes, like the Massachusetts Institute of Technology(MIT) or the University of Toronto, but also large com-panies like Google, Facebook and Microsoft have investedsignificant amounts of resources to establish well working out-of-the-box solutions [50–52].Building on this and after extensively benchmarkingthe most common architectures on our own, we settledon an architecture called pre-activated wide residual con-volutional neural network in its 18-layer configuration,called
ResNet18 [25, 58, 59]. In essence, it is a convolu-tional neural network much like the example in Figure 3but it employs so-called residual skip connections whichincrease
Accuracy while decrease training time, see ap-pendix A for further details as well as comparisons withother architectures.After settling on an architecture, training a neuralnetwork requires fine-tuning of multiple free parameters.Four of them are critical: The learning rate η , the batchsize bs and so-called regularization parameters of whichwe have two (which will be introduced at the end of thissection).We set the initial learning rate for the gradient descentalgorithm to η = 0 .
1, see also Equation 7. Throughout the training we multiply η with 0 . ×
224 pxwhich is necessary to fit the deep neural net on two Nvidia1080Ti GPUs, each having 11 GB memory. The imagedimensions are chosen to be a compromise between filesize and resolution. All features we are training the neu-ral network on are still clearly visible and distinguishableafter the rescaling.Furthermore, we face the problem of having a com-paratively small training set, consisting of only ≈ H ( x out , x ) reg = H ( x out , x ) + α || w || + β || w || (10)where H ( x out , x ) is the cross-entropy loss function, || w || and || w || are the L1- and the L2-norm ap-plied on the sum of all trainable weight parame-ters and α and β are so-called regularization co-efficients. In our experiments we set α and β to1 × − during training. Using L1 and L2 regu-larization in combination is commonly referred toas elastic net regularization [60].2. Data augmentation means creating artificial inputimages by randomly applying image transforma-tions on the original image like flipping the verti-cal or the horizontal axes and adjusting contrast orbrightness values randomly. This greatly increasesthe robustness to over-fitting and is used as a stan-dard procedure when facing small training datasets[61, 62].We were able to train deep neural networks with adepth of up to 101 layers without over-fitting using reg-ularization and data augmentation, see appendix A. Inall experiments reported here we choose a depth of 18layers for the neural network, due to numerical, memoryand time reasons. We trained all deep neural networksvariants for 200 epochs. We use three metrics to assess the quality of the predic-tions from the neural network,
Accuracy , Precision , and
Recall . We calculated these metrics every 2500 train-ing iteration steps ( ≈
52 epochs) using the evaluationdataset.
Accuracy is formally defined as:Accuracy = True Positives + True NegativesCondition Positives + Condition Negatives , where condition positives/negatives is the real num-ber of positives/negatives in the data and true posi-tives/negatives is the correct overlap of the predictionfrom the model and the condition positives/negatives.An Accuracy of 1 corresponds to a model that was able topredict all classes of all images correct. Therefore,
Accu-racy is a good measure for evaluating the prediction capa-bilities of a model when true positives and true negativesare of importance. Predicting negative labels correct isin the case of the helium dataset of particular interestbecause we want to estimate if the neural network wasable to understand the complex inter-class relationshipsimposed by the researcher. The network should realizethat if, for example, one prediction is
Spherical/Oblate ,it cannot simultaneously be
Prolate . Therefore, the net-work has to produce a true negative for either one ofthese predictions. However, using only
Accuracy as ametric has several downsides. The most important oneis the decreased expressiveness of
Accuracy when work-ing in a multi-class scenario. In order to understand this,we first introduce
Precision and
Recall , and then providean example:Precision = True PositivesTrue Positives + False Positives , Recall = True PositivesTrue Positives + False Negatives . Precision , also-called positive predictive value, is a mea-sure for how reasonable the estimates of the model werewhen it labeled a class positive, and
Recall is a measurefor how complete the model’s positive estimates were.For example, if the model would predict all trainingimages in the helium dataset to be
Spherical/Oblate andnothing else (out of 7264 images, 6589 are indeed
Spher-ical/Oblate ) then
Accuracy would be 0 . Accuracy wouldbe 0 . Precision in these both examples would give0 .
907 for the
Spherical/Oblate example and 0 .
000 for theall-negative example.
Precision is, therefore, a metricthat quantifies how well the positive predictions wereassigned. Since 91 % of all images are indeed
Spher-ical/Oblate , setting all labels positive in the
Spheri-cal/Oblate class can make sense, and
Precision also pro-vides insight when the model makes no positive predic-tion at all which would be a useless model for our pur-pose. However,
Precision alone is not sufficient as a met-ric. At this point we dont know if our model predictedalmost every possible positive label correct or if only asmall fraction of all positive labels were assigned cor-rectly, we, therefore, need an additional measure for thegeneralization capabilities of our model. For that reason,
Precision is always used in combination with
Recall . The
Recall for our first example is 0 .
423 and for the secondone 0 . Recall relies on False Negatives instead of theFalse Positives, used by
Precision , which provides a mea-sure about the completeness of all positive predictionscompared to all positive labels within our data.
Recall states that our model only captured 42 % of all possiblepositive labels in the
Spherical/Oblate example, showingthat generalization of the model would not be sufficientfor a real-world application.Therefore, a balanced interpretation of these threemetrics is necessary to estimate the quality of the modelstested here.
4. BASELINE PERFORMANCE OF NEURALNETWORKS WITH CDI DATA
In this section we briefly report on what we call base-line results. We used the previously described ResNet[25] neural network architecture in its basic configura-tion with a depth of 18 layers, termed vanilla configura-tion or ResNet18 (see section 3.1.5) and trained it withthe helium diffraction data set as described in section 2as well as with a reference data set from the literature[40]. This reference data set was made freely availableon the CXIDB by Kassemeyer et al. [40] . It containsdiffraction patterns of a number of prototypical diffrac-tion imaging targets, namely the Paramecium bursariumChlorella virus (PBCV-1), bacteriophage T4 , magneto-somes and nanorice . For further experimental detailssee Kassemeyer et al. [40].We selected this dataset because of a previous publica-tion dealing with this dataset [28], that describes, to our et al. [28] trained a support-vector-machine on the CX-IDB dataset and inferred the particle type directly fromthe diffraction images. Overall, they achieved an Accu-racy of up to 0 .
87, but only on selected high quality im-ages with a high confidence score of the support-vector-machine above 0 . Max is the time whenthe neural network achieved the highest
Accuracy scoreon the evaluation dataset, and Train Time
F ull is the timefor training 200 epochs. In practice, we achieved optimalconvergence after training for 70 to 100 epochs.We achieved an
Accuracy of 0 .
967 on not only a highquality subset of the CXIDB data, like in [28], but onall available data (see table 3), using a vanilla ResNet18architecture, proving that using a neural network signifi-cantly outperforms the current state-of-the-art approachin [28].
TABLE 3. Overall evaluation metrics for the ResNet18 archi-tecture (vanilla configuration) and both datasets. The tablegives the max values during training for
Accuracy , Precision ,and
Recall . The training time after which the neural networkachieved the highest
Accuracy score on the evaluation datasetis labeled Train Time
Max and the time for training the full200 epochs is labeled Train Time
Full . See also appendix Afor further details.Architecture ResNet18Dataset CXIDB Helium
Accuracy .
967 0 . Precision .
932 0 . Recall .
933 0 . Max [h] 0 .
278 0 . Full [h] 0 .
668 0 . In the case of the helium dataset we face a much morecomplicated multi-class learning problem (one image canbelong to multiple classes compared to one image belongsto exactly one class as it is in the CXIDB data). However,we reach a comparable
Accuracy score of 0 . Precision and
Recall are very high forthe helium and the CXIDB dataset, proving that theneural network not only predicted the true positives withhigh confidence and reliability (high
Precision ), it didso for almost all true positive labels in the evaluationdataset (high
Recall ).In the next section we show how to further improve onthe baseline performance of neural networks with diffrac-tion images as input data.
5. ADAPTING NEURAL NETWORKS FOR CDIDATA
Here, we describe our contribution for using neural net-works in combination with diffraction images.First, we show in section 5.1 that the performance ofa neural network can be enhanced when using a specialactivation function after the input layer.Second, in section 5.2 we benchmark the performanceof the neural network when using a smaller amount oftraining data. The idea is to provide an intuition abouthow much the prediction capabilities deteriorate when asmaller training dataset is used. This is useful because sofar a researcher still has to invest a lot of time preparingthe training dataset and, more general, minimizing thetime spent looking through the raw data is the ultimategoal for using a neural network in the first place.Third, in section 5.3 we propose a novel data augmen-tation in the form of a custom two-point cross correlationmap that hardens the network against very noisy data.We show that when using this augmentation the networkis more robust to noise from a uniform distribution addedon top of the original diffraction image. This simulatesthe experimental scenario in which a very low signal-to-noise ratio is unavoidable, e.g., during CDI experimentswith very limited photon flux [7] or very small scatter-ing cross sections as it is the case with upcoming CDIexperiments on single biomolecules [63, 64].
One of the key additions of this paper is the proposedactivation function, formally stated in Equation 11. It isdesigned to account for the inherent property of diffrac-tion images of scaling exponentially. More general, theintensity distribution of scattered light on a flat detec-tor follows two laws, depending on the scattering an-gle that is recorded. For very small angles (SAXS andUSAXS experiments) the Guinier approximation is thedominant contribution to the recorded intensity, whilefor larger scattering angles (SAXS and WAXS experi-ments) Porod’s law becomes dominant [65, 66]. Wherethe scattering intensity in the Guinier approximation isproportional to ≈ exp (cid:0) − q (cid:1) , in Porod’s law the intensityscales with ≈ q − d . q is the scattering vector (function ofthe scattering angle and of the wavelength in use) and d is the so-called Porod coefficient, which can vary signifi-cantly depending on the object from which the light wasscattered [65].In any case, the recorded detector intensity for diffrac-tion images scale exponentially. For this reason we pro-pose a logarithmic activation function of the form: h ( x ) = (cid:40) α (log ( x + c ) + c ) if x ≥ − α (log ( c − x ) + c ) if x < , (11)1where α > c =exp ( − c = 1 and x is the input.We define c and c so that the activation function isanti-symmetric around 0, which helps speed up trainingand avoids a bias shift for succeeding layers [67, 68].Since we are using a gradient-based optimization tech-nique we need to take care that the gradient can prop-agate throughout the whole network, otherwise it wouldlead to so-called gradient flow problems, which befallsdeep architectures [48, 69]. There are two possibilities forinsufficient gradient flow, either the gradients are gettingtoo small ( vanishing gradient ) or too large ( exploding gra-dient ) when propagating throughout the network. Bothscenarios lead to numerical instabilities during trainingmaking convergence for large architectures very hard oreven impossible. The reason for this is the backprop-agation algorithm which invokes the chain rule for cal-culating the gradients. Every gradient is therefore alsoa multiplicative factor for the gradient of a succeedinglayer. For our case the derivative of Equation 11 w.r.t. x is given by: ∂h ( x ) ∂x = (cid:40) αx + c if x ≥ αc − x if x < . (12)It shows that the gradient scales with x − with a discon-tinuity of size α c − at 0.If we used this activation function for all activationsthroughout the network, the gradient would have an in-creased probability to vanish - or explode - the deeper thearchitecture gets. In addition to that, the discontinuityat x = 0 could lead to gradient jumps, which would fur-ther decrease numerical stability. Therefore, we use thelogarithmic activation function only for the first convolu-tional layer and use a LeakyRelu activation with leakage of 0 . α is a tunable hyperparameter, we conduct ex-periments with three values for α ∈ [0 . , . , .
0] andevaluate its impact on the performance of the neural net-work.In Table 4 we provide the evaluation metrics forResNet18 used with the logarithmic activation function,trained with three different values for α . For comparison,we also provide the results of the unmodified ResNet18labeled unmodified . The best performing configurationis with an α value of 0 .
2, maxing out with an
Accu-racy of 0 . Accuracy of a full percentage point compared to the unmodifiedResNet18. The lowest value for the maximum
Accuracy was reached without the logarithmic activation function,topping at 0 . Precision and
Recall both increasewith the addition of the logarithmic activation function.These improvements all come without increasing train-ing time or complexity of the model. The maximumachieved
Accuracy seems to be anti-correlated to α , withthe ResNet18 α =1 . variant performing worst. We suspect TABLE 4. Evaluation metrics for the ResNet18 networkwith and without the logarithmic activation function. Webenchmark three values for α . Results are shown for bothdatasets and are the maximum value recorded during training.Bold numbers indicate the best scores across their respectivecategory.Architecture ResNet18 α . . . unmodified Dataset Helium
Accuracy . .
960 0 .
959 0 . Precision . . . . Recall .
870 0 . .
868 0 . that this is related to the smaller size of the discontinuityof the derivative of h ( a j ) when choosing a small value for α , see Equation 12.However, choosing even smaller values for α did notimprove the Accuracy further, either because the benefitfrom the activation function plateaus there or because wereached the classification capacity of this ResNet layout.These results show convincingly that the addition ofthe logarithmic activation function improves the overallperformance and generalization of the deep neural net-work. This is in so far expected because we imposed aform of feature engineering on the network, by exploitinga known characteristic of the dataset. Therefore, withoutincreasing the complexity, the depth or the training time,we showed that using the logarithmic activation improvesall relevant evaluation metrics. For this reason, we usethe logarithmic activation function with an α value of 0 . In this section, we evaluate the impact of the train-ing set size on the evaluation metrics, we trained theResNet18 α =0 . with a varying amount of labeled images.The reason for this is to provide intuition for how manyimages are needed to be classified manually before theemployment of a neural network is useful. We uniformlyselect images from the training set but kept the sameevaluation dataset described in section 3.1.5. We de-creased the size of the training set in three stages (to75 % ≡ ≡ ≡ α =0 . whentrained with datasets of different sizes. For the heliumdataset, the maximum achieved Accuracy is droppingfrom 0 .
965 to 0 .
797 when using only 1544 images insteadof the full 6174 images. Even more pronounced is thedecline in
Precision and
Recall from 0 .
922 and 0 .
870 to0 .
673 and 0 .
593 for the smallest training set size. Thesteeper decline rate for
Precision and
Recall , comparedto
Accuracy , can be understood as the helium datasetpredominantly consists of
Negative ground truth labels2(64 339 out of 79 904 labels) to which the neural net-works resorts in the absence of sufficient training data.
Precision and
Recall , on the other hand, provide only in-formation about the positive prediction capabilities andtheir completeness and therefore decrease faster when asmaller training set size is used.This shows that the number of images is critical forthe prediction capabilities of the neural network. Thedrastic decrease in training set size results in a muchworse generalization of the model, detecting only thoseimages that are very close to the ones from the trainingset, missing most from the evaluation set. The networkhas not learned the characteristics of a particular classto a point where it can transfer the gained knowledge toother images, which is the one critical property for whichwe employed a neural network in the first place.Therefore, if time is limited, one may be well advisedto concentrate efforts on preparing a sufficiently large,high-quality training dataset while using e.g. our herepresented neural network approach in its standard con-figuration.
This section introduces an image augmentation basedon the two-point cross-correlation function, which in-creases the resistance to noise. We prepare four trainingsets, each with an increasing amount of noise sampledfrom a uniform distribution and analyze the noise de-pendence of the neural network.One of the principal problems in CDI experiments, orimaging experiments in general, is recorded noise. Noiseoften leads to computational problems due to noise resis-tance being a known weak point for a significant fractionof predictive algorithms [29]. In particular, deep neuralnetworks are known to be easily fooled by noise. Whenadding noise to an image, whose addition may be invisibleto the human eye, a neural network can come to entirelydifferent conclusions and this even with high confidence;Seeing a panda where there was a wolf [70, 71]. There-
TABLE 5. Evaluation metrics of the ResNet18 α =0 . networkwith the logarithmic activation function and an α value of0 .
2. Results are shown for the helium datasets and reflectthe maximum achieved value reached throughout the trainingprocess, assessed on the evaluation dataset. Bold numbersindicate the best scores across their respective category.Architecture ResNet18 α =0 . Training set size 6174 4631 3088 1544Dataset Helium
Accuracy . .
915 0 .
829 0 . Precision . .
821 0 .
740 0 . Recall . .
771 0 .
679 0 . fore, we propose an additional pre-processing step for theinput images to increase the noise resistance of the neuralnetwork.To quantify the quality of an image, the signal-to-noiseratio is often used. It is a measure for how much noiseis present when compared to some information content,where low values indicate that information might be in-distinguishable from noise. It has been shown that higherorders of the two-point cross-correlation function (CCF)can act as a frequency dependent noise filter and increasethe quality of a reconstruction of a diffraction image evenin the presence of recorded noise[72, 73]. And since theCCF can be interpreted as an image, see Figure 4 e) to h),we employ this method in a similar manner to optimizethe use-case with a convolutional deep neural network,expecting that the higher-order terms make the neuralnetwork more resistant to the presence of noise.In general, the CCF is defined as: C i,j ( q i , q j , ∆) = (cid:90) ∞−∞ I ∗ i ( q i , φ ) I j ( q j , φ + ∆) dφ, (13)where ∆ is the angular separation, φ is the angularcoordinate, and ( i, j ) denotes the index of the two scat-tering vectors q i and q j . For discrete φ and written asFourier decomposition, Equation 13 yields [72]: C ni,j ( q i , q j ) = I n ∗ i ( q i ) I nj ( q j ) , (14)where n denotes the order of the CCF. I ni is given by: I ni ( q i ) = 12 π (cid:90) π I ( q i , φ ) exp ( − inφ ) dφ (15)Since C i,j = C j,i , we can split the final correlationmap into an upper and a lower triangle matrix. To max-imize information, and to optimally use the local recep-tive fields of the convolutional layers, we merge the lowertriangle from the full CCF calculation, Equation 13 with∆ = 0, and the upper triangle of order n = 8 from Equa-tion 14. Therefore, we combine a plain correlation mapwith a higher order map that is more resistant to noise,see Figure 4 e) to h) for a full example.To test the robustness of this method, we use theResNet18 α =0 . and train it with various pre-processeddatasets.From our original dataset we derive three additionaldatasets that only differ in the amount of noise added.We do this as follows; First, we calculate the mean,the standard deviation (std) and the maximum inten-sity values of each image in the original dataset. Fromthese values we calculate the median, instead of the mean(due to increased robustness against outliers); ending upwith three statistical characteristics describing the inten-sity distribution throughout all diffraction images. Withthat, we define three continuous uniform distributions tosample noise from. A continuous uniform distributionis fully defined by an upper and a lower boundary; a Noise: None Noise: Mean Noise: Mean + Std Noise: Maxa) b) c) d)e) f) g) h)
FIG. 4. a) to d) showing the various stages of added noise to a standard scattering image. e) to h) are the calculatedcorrelation maps with the upper triangle of order n = 8 and lower triangle from the full CCF calculation. and b , respectively. The probability for a value to bedrawn within these boundaries is equal and non-zero ev-erywhere. For our three noise distributions we alwaysuse a lower boundary of 0 and vary the upper boundaryso that b is either the mean , the mean + the std. or themaximum of the intensity distribution on the images (thethree statistical characteristics described above).For example, for creating the maximum noise dataset,we looped through every diffraction image and addednoise sampled from the maximum noise distribution. Wedo this for all three noise distributions. From these threenoise embedded datasets, as well as our original dataset,we calculate the here proposed CCF maps. This leadsto a total of eight data sets; for each of them we traina ResNet18 α =0 . . An example of one image in all eightdatasets is in Figure 4.The results for these eight data sets are given in Table6. The performance of the neural network without addednoise is much stronger when using the original diffractionimages instead of the CCF maps. However, as soon asnoise is added, the performance of the neural networktrained on diffraction images deteriorates much faster ascompared to the performance with CCF maps as input.When the upper boundary of the added noise excels themedian values of mean + std. , the neural network is per-forming better with the CCF maps instead of the originaldiffraction images. Especially with the noisiest datasetthe differences in performance are significant. Precision is increased by 4 percentage points when using the CCFmaps as input, showing that our data augmentation mayserve as a helpful asset when dealing with very noisy data. In general, it is a viable alternative to use the CCFmaps as input to the convolutional deep neural network,which should be considered an option in the case of verynoisy data where it provides a boost to classification re-sults. The downside is, calculating the CCF for every im-age comes at an additional computational cost. It tookus three full days to calculate the CCF maps for all 39 879images of both datasets on an Intel 6700K quad-core ma-chine using a multi-threaded Python script (Also releasedon Github).
6. WHAT THE NEURAL NETWORK SAW
Neural networks are often considered being a black boxapproach. We usually do not impose a-priori knowledgeon our model, the network learns this on its own. Al-though this is part of the reason why they are so success-ful it also gives rise to doubts about the interpretabilityof their predictions. Some ways to interpret the pro-cesses of decision finding within a trained neural net-work have been presented in the literature [41, 74–76].In order to get a better understanding of why our deepneural network assigned images to certain classes, wecalculated heatmaps using the GradCam++ algorithm[41]. These heatmaps are making visible where the net-work has looked for in a particular class, which we do bytracing back the gradient flow from the output layer tothe last convolutional layer. The network’s class-specificinterest directly correlates with this gradient signal be-cause, in essence, we simulate a training step using back-4
TABLE 6. Evaluation results when training a ResNet18 α =0 . on the original diffraction images and on CCF maps calculatedfrom them. The results reflect the maximum value achieved throughout the training process, assessed on the evaluation dataset.Bold numbers indicate the best scores across their respective category.Architecture ResNet18 α =0 . Noise added None Mean Mean + Std. Max.Input data CCF Maps Diff. Imgs. CCF Maps Diff. Imgs. CCF Maps Diff. Imgs. CCF Maps Diff. Imgs.Dataset Helium
Accuracy . . . .
954 0 . . . . Precision . . . .
910 0 . . . . Recall . . . .
853 0 . . . . a) Streakb) Bent FIG. 5. Showing the GradCam++ results for two distinct classes from the helium dataset. a) shows five randomly selectedimages from the
Streak class and b) shows five images from the
Bent class. We chose these classes due to their distinctand distinguishable characteristic shapes which can easily be identified using the contour maps provided by the GradCam++algorithm. For each class, we plot the schematic from Figure 2 also at the beginning of each row. GradCam++ contour levelsare plotted as dashed lines and used as transparency value for the images from which we calculated them. This way regionswith strong gradients are also brighter. propagation and interpolate the feature maps from thelast convolutional layer. A full description of this processis given in appendix D. The output of the GradCam++algorithm provides contour maps whose amplitude is anormalized measure for how much the gradient wouldimpose corrections on the weights if used during train-ing. This gradient flow directly corresponds to what thenetwork deemed the most relevant regions.Figure 5 shows the GradCam++ results for the
Streak and
Bent classes using our best performing network -ResNet18 α =0 . . We present results from these classes,because the distinct spatial characteristics are obvious tothe human eye. Therefore they are an ideal candidate totest if the neural network understood these characteris-tics. In each row of Figure 2, a schematic sketch of thekey feature together with five randomly selected imagesfrom this class are depicted.The GradCam++ contour maps are overlaid on theimage, in addition, the contour levels are also used asan α mask for the diffraction image so that the brightest areas in each plot correspond to the ones with the highestgradient flow. In the case of the Streak class, Figure 5clearly shows that the neural network was able to identifythe dominant streak feature regardless of its orientationor size. Results on the
Bent class also show a strongcorrelation between the shape of the contour maps andthe bent shape of the diffraction pattern.Therefore, combining these metrics and the Grad-Cam++ images we think that the
Streak class featureidentified by the neural network indeed corresponds tothe one seen by the researcher. Also, the
Bent class con-tour maps from the network show a clear resemblanceof the feature intended by the researcher, albeit not sostrongly pronounced. Although the deep neural networklearned these representations on its own, they co-alignwith the intentions of the researcher. This demonstratesthat neural networks are capable of learning these com-plicated patterns on their own.5
7. SUMMARY AND OUTLOOK
In this paper, we give a general introduction on the ca-pabilities of neural networks and provide results on thefirst domain adaption of neural networks for the use-caseof diffraction images as input data. The main additions ofthis paper are (i) a novel activation function that incorpo-rates the intrinsic logarithmic intensity scaling of diffrac-tion images, (ii) an evaluation on the impact of differenttraining set sizes on the performance of a trained networkand (iii) the use of the point-wise cross-correlation func-tion to improve the resistance against very noisy data. Inaddition, we provide a large benchmarking routine, uti-lizing multiple neural network architectures and layoutsin appendix A.We have shown that even in the most basic configu-ration, convolutional deep neural networks outperformpreviously established sorting algorithms by a significantmargin. More importantly, we improved on these base-line results by modifying the activation function for thefirst layer. For the case of very noisy data, often a prob-lem in diffraction imaging experiments, we showed thattwo-point cross-correlation maps as input data insteadof the original diffraction images improve the robustnessof the classification capabilities of the network. Our re-sults set the stage for using deep learning techniques asfeature extractors from diffraction imaging datasets. Theultimate goal will be establishing an unsupervised routinethat can categorize and extract essential pieces of infor-mation of a large set of diffraction images on its own. We envision for the near future, that the gained insights leadto multiple new approaches regarding neural networksand diffraction data. For example, the MSFT algorithmused in Langbehn et al. [1], can be used as a generativemodule in an end-to-end unsupervised classification rou-tine using large synthetic datasets as training data for aneural network. This approach can be extended to utilizethese trained networks as an online-analysis tool duringthe experiments. Furthermore, we hope to develop anunsupervised approach that connects the recent researchfrom Generative Adversarial Network theory [77–80] andmutual information maximization [81] with the resultsof this paper. Such an approach would allow for find-ing characteristic classes of patterns within a data setwithout any a priori knowledge about the recorded data.All of the code, written in Python 3 .
6+ and using theTensorflow framework, is available at Github, free to useunder the MIT License . We hope the community usesand improves the code provided in this repository. ACKNOWLEDGMENTS
We would like to thank K. Kolatzki, B. Senfftleben,R.M.P. Tanyag, M.J.J. Vrakking, A. Rouze, B. Fin-gerhut, D. Engel and A. L¨ubcke from the Max-Born-Institute, Ruslan Kurta from the European XFEL andChristian Peltz as well as Thomas Fennel from the Uni-versity of Rostock for fruitful discussions. This workreceived financial support by the Deutsche Forschungs-gemeinschaft under Grant
MO 719/13-1 , and STI125/19-1 and by the Leibniz Grant
SAW/2017/MBI4 . [1] B. Langbehn, K. Sander, Y. Ovcharenko, C. Peltz,A. Clark, M. Coreno, R. Cucini, M. Drabbels, P. Finetti,M. Di Fraia, et al., Phys. Rev. Lett. , 255301 (2018).[2] M. M. Seibert, T. Ekeberg, F. R. N. C. Maia, M. Svenda,J. Andreasson, O. J¨onsson, D. Odi´c, B. Iwan, A. Rocker,D. Westphal, et al., Nature , 78 (2011).[3] N. D. Loh, C. Y. Hampton, A. V. Martin, D. Starodub,R. G. Sierra, A. Barty, A. Aquila, J. Schulz, L. Lomb,J. Steinbrener, et al., Nature , 513 (2012).[4] C. Bostedt, M. Adolph, E. Eremina, M. Hoener,D. Rupp, S. Schorb, H. Thomas, A. R. B. de Castro, andT. M¨oller, J. Phys. B At. Mol. Opt. Phys. , 194011(2010).[5] L. F. Gomez, K. R. Ferguson, J. P. Cryan, C. Bacel-lar, R. M. P. Tanyag, C. Jones, S. Schorb, D. Aniel-ski, A. Belkacem, C. Bernando, et al., Science , 906(2014).[6] H. N. Chapman and K. A. Nugent, Nat. Photonics , 833(2010).[7] D. Rupp, N. Monserud, B. Langbehn, M. Sauppe, J. Zim- https://github.com/julian-carpenter/airynet mermann, Y. Ovcharenko, T. M¨oller, F. Frassetto, L. Po-letto, A. Trabattoni, et al., Nat. Commun. , 493 (2017).[8] Z. Y. Li, N. P. Young, M. Di Vece, S. Palomba, R. E.Palmer, A. L. Bleloch, B. C. Curley, R. L. Johnston,J. Jiang, and J. Yuan, Nature , 46 (2008).[9] J. Farges, M. F. de Feraudy, B. Raoult, and G. Torchet,J. Chem. Phys. , 3491 (1986).[10] D. E. Clemmer and M. F. Jarrold, J. Mass Spectrom. ,577 (1997).[11] O. Kostko, B. Huber, M. Moseler, and B. von Issendorff,Phys. Rev. Lett. , 043401 (2007).[12] A. Sakdinawat and D. Attwood, Nat. Photonics , 840(2010).[13] C. Bostedt, H. N. Chapman, J. T. Costello, J. R. C.L´opez-Urrutia, S. D¨usterer, S. W. Epp, J. Feldhaus,A. F¨ohlisch, M. Meyer, T. M¨oller, et al., Nucl. Instru-ments Methods Phys. Res. Sect. A Accel. Spectrometers,Detect. Assoc. Equip. , 108 (2009).[14] T. Gorkhover, M. Adolph, D. Rupp, S. Schorb, S. W.Epp, B. Erk, L. Foucar, R. Hartmann, N. Kimmel, K.-U.K¨uhnel, et al., Phys. Rev. Lett. , 245005 (2012).[15] C. Bostedt, T. Gorkhover, D. Rupp, M. Thomas, andT. M¨oller, in Synchrotron Light Sources Free. Lasers ,edited by E. Jaeschke, S. Khan, R. J. Schneider, and B. J. Hastings (Springer International Publishing, Cham,2016) 1st ed., Chap. Clusters and Nanocrystals, pp. 1–38.[16] P. Emma, R. Akre, J. Arthur, R. Bionta, C. Bostedt,J. Bozek, A. Brachmann, P. Bucksbaum, R. Coffee, F.-J.Decker, et al., Nat. Photonics , 641 (2010).[17] C. Bostedt, S. Boutet, D. M. Fritz, Z. Huang, H. J. Lee,H. T. Lemke, A. Robert, W. F. Schlotter, J. J. Turner,and G. J. Williams, Rev. Mod. Phys. , 015007 (2016).[18] G. D. Calvey, A. M. Katz, C. B. Schaffer, and L. Pollack,Struct. Dyn. (Melville, N.Y.) , 054301 (2016).[19] E. A. Schneidmiller, Photon beam properties at the Eu-ropean XFEL , Tech. Rep. (XFEL, Hamburg, 2011).[20] D. Rupp, M. Adolph, L. Fl¨uckiger, T. Gorkhover, J. P.M¨uller, M. M¨uller, M. Sauppe, D. Wolter, S. Schorb,R. Treusch, et al., J. Chem. Phys. , 044306 (2014).[21] I. Barke, H. Hartmann, D. Rupp, L. Fl¨uckiger,M. Sauppe, M. Adolph, S. Schorb, C. Bostedt,R. Treusch, C. Peltz, et al., Nat. Commun. , 6187(2015).[22] I. V. Lundholm, J. A. Sellberg, T. Ekeberg, M. F. Hantke,K. Okamoto, G. van der Schot, J. Andreasson, A. Barty,J. Bielecki, P. Bruza, et al., IUCrJ , 531 (2018).[23] J. Flamant, N. Le Bihan, A. V. Martin, and J. H. Man-ton, Phys. Rev. E , 053302 (2016).[24] T. Ekeberg, M. Svenda, C. Abergel, F. R. Maia,V. Seltzer, J.-M. Claverie, M. Hantke, O. J¨onsson,C. Nettelblad, G. van der Schot, et al., Phys. Rev. Lett. , 098102 (2015).[25] K. He, X. Zhang, S. Ren, and J. Sun, (2016),arXiv:1603.05027.[26] N. Meidinger, R. Andritschke, R. Hartmann, S. Her-rmann, P. Holl, G. Lutz, and L. Str¨uder, Nucl. Instru-ments Methods Phys. Res. Sect. A Accel. Spectrometers,Detect. Assoc. Equip. , 251 (2006).[27] R. P. Kurta, M. Altarelli, and I. A. Vartanyants, in Adv.Chem. Phys. , edited by S. A. Rice and A. R. Dinner (JohnWiley & Sons, Inc., 2016) Chap. Structural analysis byX-Ray intensity angular cross correlations, pp. 1–39.[28] S. A. Bobkov, A. B. Teslyuk, R. P. Kurta, O. Y.Gorobtsov, O. M. Yefanov, V. A. Ilyin, R. A. Senin,and I. A. Vartanyants, J. Synchrotron Radiat. , 1345(2015).[29] A. Atla, R. Tada, V. Sheng, and N. Singireddy, in J.Comput. Sci. Coll. , Vol. 26 (Consortium for ComputingSciences in Colleges, 2011) Chap. Sensitivity of differentmachine learning algorithms to noise, pp. 96–103.[30] C. H. Yoon, P. Schwander, C. Abergel, I. Andersson,J. Andreasson, A. Aquila, S. Bajt, M. Barthelmess,A. Barty, M. J. Bogan, et al., Opt. Express , 16542(2011).[31] D.-L. Deng, X. Li, and S. Das Sarma, Phys. Rev. B ,195145 (2017).[32] P. Zhang, H. Shen, and H. Zhai, Phys. Rev. Lett. ,066401 (2018).[33] R. D. Field, Y. Kanev, M. Tayebnejad, and P. A. Griffin,Phys. Rev. D , 2296 (1996).[34] W. Bhimji, S. A. Farrell, T. Kurth, M. Paganini, Prab-hat, and E. Racah, (2017), arXiv:1711.03573.[35] K. Mills, M. Spanner, and I. Tamblyn, Phys. Rev. A ,042113 (2017).[36] S. Manzhos, K. Yamashita, and T. Carrington, Chem.Phys. Lett. , 217 (2009).[37] Y. Zhu, Q. Ouyang, and Y. Mao, BMC Bioinformatics , 348 (2017). [38] Z. Gao, L. Wang, L. Zhou, and J. Zhang, IEEE J.Biomed. Heal. Informatics , 416 (2017).[39] T. W. Ke, A. S. Brewster, S. X. Yu, D. Ushizima,C. Yang, and N. K. Sauter, J. Synchrotron Radiat. ,655 (2018).[40] S. Kassemeyer, J. Steinbrener, L. Lomb, E. Hartmann,A. Aquila, A. Barty, A. V. Martin, C. Y. Hampton,S. Bajt, M. Barthelmess, et al., Opt. Express , 4149(2012).[41] A. Chattopadhyay, A. Sarkar, P. Howlader, and V. N.Balasubramanian, (2017), arXiv:1710.11063.[42] V. Lyamayev, Y. Ovcharenko, R. Katzy, M. De-vetta, L. Bruder, A. LaForge, M. Mudrich, U. Person,F. Stienkemeier, M. Krikunova, et al., J. Phys. B At.Mol. Opt. Phys. , 164007 (2013).[43] C. Svetina, C. Grazioli, N. Mahne, L. Raimondi, C. Fava,M. Zangrando, S. Gerusina, M. Alagia, L. Avaldi,G. Cautero, et al., J. Synchrotron Radiat. , 538 (2015).[44] E. Allaria, R. Appio, L. Badano, W. Barletta, S. Bas-sanese, S. Biedron, A. Borga, E. Busetto, D. Castronovo,P. Cinquegrana, et al., Nat. Photonics , 699 (2012).[45] B. Fr´enay and A. Kab´an, in ESANN (2014) Chap. Acomprehensive introduction to label noise.[46] F. R. N. C. Maia, Nat. Methods , 854 (2012).[47] M. A. Arbib, Brains, machines, and mathematics (Springer-Verlag, 1987) p. 202.[48] G. E. Nair, V.;Hinton, in
Proc. 27th Int. Conf. Mach.Learn. (2010) Chap. Rectified linear units improve re-stricted boltzmann machines, pp. 807–814.[49] A. L. Maas, A. Y. Hannun, and A. Y. Ng, in
Proc. ICML ,Vol. 30 (2016) Chap. Rectifier Nonlinearities ImproveNeural Network Acoustic Models, p. 3.[50] J. Schmidhuber, Neural Networks , 85 (2014).[51] Y. LeCun, Y. Bengio, and G. Hinton, Nature , 436(2015).[52] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, (2016),arXiv:1602.07261.[53] G. Hinton, O. Vinyals, and J. Dean, (2015),arXiv:1503.02531.[54] C. E. Shannon, Bell Syst. Tech. J. , 379 (1948).[55] S. Kullback and R. A. Leibler, Ann. Math. Stat. , 79(1951).[56] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learn-ing (MIT Press, 2016) p. 775.[57] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Na-ture , 533 (1986).[58] K. He, X. Zhang, S. Ren, and J. Sun, (2015),arXiv:1512.03385.[59] S. Zagoruyko and N. Komodakis, (2016),arXiv:1605.07146.[60] H. Zou, H. Zou, and T. Hastie, J. R. Stat. Soc. Ser. B , 301 (2005).[61] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever,and R. R. Salakhutdinov, (2012), arXiv:1207.0580.[62] L. Perez and J. Wang, (2017), arXiv:1712.04621.[63] S. Ikeda and H. Kono, Opt. Express , 3375 (2012).[64] T. Shintake, Phys. Rev. E , 041906 (2008).[65] B. Hammouda, J. Appl. Crystallogr. , 716 (2010).[66] S. K. Sinha, E. B. Sirota, S. Garoff, and H. B. Stanley,Phys. Rev. B , 2297 (1988).[67] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, (2015),arXiv:1511.07289.[68] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabi- novich, (2014), arXiv:1409.4842.[69] J. F. Kolen and S. C. Kremer, A Field Guideto Dynamical Recurrent Networks , (2001),10.1109/9780470544037.ch14.[70] A. Nguyen, J. Yosinski, and J. Clune, (2014),arXiv:1412.1897.[71] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, andP. Frossard, (2016), arXiv:1610.08401.[72] R. P. Kurta, J. J. Donatelli, C. H. Yoon, P. Berntsen,J. Bielecki, B. J. Daurer, H. DeMirci, P. Fromme, M. F.Hantke, F. R. N. C. Maia, et al., Phys. Rev. Lett. ,158102 (2017).[73] J. J. Donatelli, P. H. Zwart, and J. A. Sethian, Proc.Natl. Acad. Sci. U. S. A. , 10286 (2015).[74] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam,D. Parikh, and D. Batra, in (IEEE, 2017) pp. 618–626, arXiv:1610.02391.[75] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Tor-ralba, (2015), arXiv:1512.04150.[76] K. Li, Z. Wu, K.-C. Peng, J. Ernst, and Y. Fu, (2018),arXiv:1802.10171.[77] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena,(2018), arXiv:1805.08318.[78] T. Miyato and M. Koyama, (2018), arXiv:1802.05637.[79] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida,(2018), arXiv:1802.05957.[80] I. Goodfellow, (2016), arXiv:1701.00160.[81] X. Chen, Y. Duan, R. Houthooft, J. Schulman,I. Sutskever, and P. Abbeel, (2016), arXiv:1606.03657.[82] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Proc.IEEE , 2278 (1998).[83] A. Krizhevsky, I. Sutskever, and G. E. Hinton, in Proc.25th Int. Conf. Neural Inf. Process. Syst. - Vol. 1 (Cur-ran Associates Inc., 2012) Chap. ImageNet classificationwith deep convolutional neural networks, pp. 1097–1105.[84] K. Simonyan and A. Zisserman, (2014), arXiv:1409.1556.[85] R. Eldan and O. Shamir, (2015), arXiv:1512.03965.[86] A. Veit, M. Wilber, and S. Belongie, (2016),arXiv:1605.06431.[87] S. Li, J. Jiao, Y. Han, and T. Weissman, (2016),arXiv:1611.01186.[88] H. Shimodaira, Journal of Statistical Planning and Infer-ence , 227 (2000).[89] J. Jiang, A Literature Survey on Domain Adaptation ofStatistical Classifiers (2008).[90] S. Ioffe and C. Szegedy, (2015), arXiv:1502.03167.
Appendix A: Architectural design choices
In this section, we describe and explain our choices forneural network architecture to establish as baseline per-formance when working with diffraction patterns, beforethe inclusion of our diffraction specific activation func-tion, see section 5.1 in the main manuscript. We presentthe theory and background on available architectures andprovide results on two architectures with five depth lay-outs.There are different layer styles from which we can builda neural network. Nomenclature is that a full arrange-ment of all layers is called architecture, or configuration,of the network.For our tests, we use two different neural network ar-chitectures, a ResNet and a VGG- Net, both with multi-ple depth layouts. For the ResNet, we train and evaluatethree depth variations (18, 50 and 101 layers), and forthe VGG-Net we train two variants (16 and 19 layers).The structure of this section is as follows: First, we ex-plain how a convolutional layer works in general. Second,we motivate the derivation of the VGG-Net from preced-ing architectures, and third, we show how the ResNet ar-chitecture can be explained by expanding the core ideasused in the VGG-Net. In the following section, we willthen present the results for all the here trained configu-rations.Almost every architectural design is empirically de-rived [50–52] and constitutes of multiple combinationsof only a few basic layer styles, namely the fully con-nected layer, convolutional layer, a pooling operation anda batch normalization operation. We discuss the pool-ing and batch normalization layer only in appendix C,because of their minor role within the neural network.The reader is also referred to the exhaustive overview inSchmidhuber [50] and LeCun et al. [51]. Since the con-volutional layer serves as a fundamental basis for imageanalysis with neural networks, we explain it here in moredetail.The very basic idea of a convolutional layer is thatnearby pixels in an input image are more strongly cor-related than more distant pixels, this is called a localreceptive field. Therefore, by calculating a convolutionover an input image with a trainable filter of size > × N filter, with size M × M ,slide over an input image and produce N convolved maps,called feature maps. One filter uses the same weightson all parts of the input image for producing one fea-ture map; this is called weight sharing. Weight sharingreduces not only the complexity of the model but pro-vides a bridge towards the convolution function in math-ematics. With weight sharing, we can identify the fil-ter within the convolutional layer as a kernel functionfrom the mathematical convolution function. Figure 6 a)shows a schematic of a convolutional layer with one filter.This exemplary filter with size 3 × × Identity (N-dim)
OutputInput conv. conv. (N-dim)(N-dim) (N-dim) (N-dim) a) b)4 + 0 + 0 + 12 + 5 + 28 + 8 + 24 + 5 = 86FilterInput Feature map
FIG. 6. Schematic for a convolutional operation inside a convolutional layer in a), and for a classic skip connection foundin the ResNet architecture in b). a) illustrates the local receptive fields and shared weights concept. The convolutional filterhas size 3 × ×
7, which produces an output, called feature map,of size 3 ×
3. The stride is the distance the filter is moving in each step which is implied by the gray shading every 2 pixelsin the input image. Using a local receptive field describes the inclusion of nearby pixels, and weight sharing means using thesame filter weights for the whole input image. The calculation at the bottom is for the second entry in the feature map. b)A classical skip connection is shown with two convolutional layers that approximate a sparse residue which gets added to theidentity at the output. ×
3. The feature map is smaller than the input imagebecause the filter moves two pixels for each step. Thisstep-size is called stride .Hereafter we use the notation conv(a, b, c) for a con-volutional layer with filter size a × a, number of filters b and stride c . The example from figure 6 a) could, there-fore, be written as conv (3 , ,
2) and would result in 9trainable weight parameters plus 1 bias parameter (notshown in the figure).This concept was introduced with the LeNet architec-ture by Lecun et al. [82] which is considered the seminalwork in the field and the first deep convolutional neu-ral network. After Yann LeCun proposed the LeNet ar-chitecture, further research [83] led to the now de-factostandard for plain convolutional networks, the VGG-Net.Simonyan and Zisserman [84] proposed the original archi-tecture which consists of up to 19 weight layers of which16 are convolutional layers, and 3 are fully connectedones.It is easy to build, easy to train and provides in gen-eral good results [50, 51]. For these reasons, we includetwo variations of it in our tests, namely version D andE, nomenclature is from [84]). Table 7 shows the de-tails of the architecture, using the naming convention weintroduced with the convolutional layer.Simonyan and Zisserman [84] derived the VGG-Net di-rectly from the LeNet by arguing that three convolutionallayers with filter size 3 and stride 1 (VGG-Net) achievebetter results than only one filter with size 7 and stride2 (LeNet), which equals to the same effective local re-ceptive field size [84]. Three layers perform better thanone due to having 2 additional non-linear activation func-tions and reduced complexity (less weight-parameter be-cause of the smaller filter sizes), which enforces the neuralnetwork not only to be more discriminative but to find sparser solutions [84].
TABLE 7. The deep neural network architecture of theVGG variant D and E. conv(a, b, c) is a convolutional layerwith filter size a × a, number of filters b and stride c . maxpooling(d, e) is a max pooling layer with filter size d × d andstride e . Note that we changed the fully connected layer ofthe original architecture to a convolutional layer.Variant D EDepth 16 19Input 2 × conv (3 , , , × conv (3 , , , × conv (3 , ,
1) 4 × conv (3 , , , × conv (3 , ,
1) 4 × conv (3 , , , × conv (3 , ,
1) 4 × conv (3 , , , × conv (7 , , , conv (1 , N, Building on the results achieved by the VGG-net, itwas shown that the depth of a deep neural network di-rectly relates to its classification capabilities [58, 68, 85].This led to the introduction of the so-called residual skip-connections which further exploit this depth-matters con-cept [58, 68]. These residual skip connections are thename-giving components for the ResNet architecture.In principle, a ResNet still uses the VGG architecturallayout but exchanges the convolutional blocks 1 to 4 withresidual skip connections, compare tables 7 and 8. Thisexchange drastically reduces the complexity of the wholenetwork while increasing the number of layers.The VGG-architecture can be broken down into six9blocks, one input block, one output block, and four con-volutional blocks (see table 7). Block 2 is the first blockin which there are distinctions between VGG variant Dand E.The VGG-net architecture proved that increasing thedepth and decreasing the amount and size of the filtersincreases the accuracy, which ultimately gave rise to the plain skip connections : Blocks of few convolutional lay-ers designed to replace the large amounts of filters in onelayer for multiple layers with fewer, and smaller, filters.Two types exist: A classical and a bottleneck skip con-nection, both differ only in the amount of how much thedepth is increased and the complexity decreased.This addition has so far only modified the depth andcomplexity of the network and is called a plain network,see He et al. [58]. It performs reasonably well but not sig-nificantly better than VGG-net. A residual skip connec-tion differs from a plain skip connection only in addingthe identity of its inputs to its outputs. This way allthe convolutional layers in a skip connection learn only a residual of their input. This simple technique enables aResNet to outperform all other convolutional deep neuralnetwork architectures [25, 52]. Figure 6 b) exemplifies aclassical residual skip connection. There is still an ongo-ing debate about why a residual neural network performsso well [58, 68, 86]. Research has shown that ResNets findsparser solutions faster due to their layout, and that theybehave like ensembles of shallower networks with infor-mation flow only activated on 10 to 34 layers even whenthe neural network has a depth of 101 layers [58, 68, 86].However, besides empirical success, one of the criticaladvantages of ResNets is that reaching training conver-gence is not getting significantly harder when increasingthe depth of the neural network, which is usually the casewith other architectures. Therefore, the training of verydeep residual neural networks is no more difficult thantraining shallow plain neural networks [52, 87].For these reasons, we train three variants, with 18, 50and 101 layers, of a further optimized version of the clas-sical ResNet, called pre-activated ResNet [25] (see table8 for implementation details).Table 9 shows the overall evaluation metrics on thehelium and the CXIDB dataset. Table 10 shows the per-class evaluation metrics for the helium dataset, which arenot needed for the CXIDB dataset because predictionson the helium dataset are a multi-class problem whereaspredictions on the CXIDB data are single-class. Single-class - or one-hot - problems have identical overall- andper-class-evaluation metrics. We trained all models asdescribed in section 3.1.5 in the main manuscript.Table 9 shows the overall evaluation metrics as well asthe training wall time. Train Time
Max is the time whenthe neural network achieved the highest accuracy scoreon the evaluation dataset, and Train Time
F ull is the timefor training 200 epochs. However, in practice we achievedoptimal convergence after training for 70 to 100 epochs.After this the network showed overfitting.Both VGG models took significantly longer to train than the ResNet variants, needing between 6 . . . . .
959 compared to 0 .
964 forthe helium data and 0 .
970 vs. 0 .
978 for the CXIDB data.Also, accuracy did not change much when increasing thedepth from 16 to 19, precision even decreased slightlyand recall remained unchanged.On the other hand, increasing complexity within theResNet architecture helped to boost the accuracy from0 .
955 (CXIDB data: 0 . .
964 (CX-IDB data: 0 . Oblate , Spherical , Streak and
Empty a precision of 0 . . . Prolate , Bent , Double Rings and
Lay-ered , the ResNet reached a good precision, but a recallscore of ≈ .
65 shows that it missed almost a third of allavailable images, indicating we failed to generalize thenetwork for these classes.For
Elliptical , Newton Rings and
Asymmetric images,the recall of 0 . . Elliptical is the onlyclass of these three where precision is high enough forusing the neural network as a predictor. For the
New-ton Rings and
Asymmetric class, with precision scoresaround 0 .
6, the neural network is effectively guessing.The performance of all variants clearly shows the gen-eral good classification capabilities of a convolutionaldeep neural networks in the use case of diffraction pat-terns. Even the lowest performing neural network canoutperform previous classification approaches by a largemargin - compare with [28]. In particular, the resultsof ResNet18 are compelling; it is small, easy to trainand has relatively low complexity. Although having onlya fraction of trainable parameters, it performed almostalways on-par with the much more complex VGG archi-tectures and all this while taking only 0 . Appendix B: Derivation of the binary cross-entropy
Here, we give an derivation for the binary cross entropy(Equation 6 in the main manuscript). We start with themost general form of the cross-entropy given by: H ( p, q ) = H ( p ) + D KL ( p (cid:107) q ) (B1)0 TABLE 8. Used ResNet variants, see also 18, 50 and 101 layer layout in [58]. Note that we added the pre-activated layerlayout from [59]. conv(a, b, c) is a convolutional layer with filter size a × a, number of filters b and stride c . max pooling(d,e) is a max pooling layer with filter size d × d and stride e. avg pooling is a global average pooling layer, and fc(f) is a fullyconnected layer with output size f. Layers in bold emphasis have a stride of 2 during their first iteration, therefore reducingthe dimension by a factor of 2.Variant Classic Bottleneck BottleneckDepth 18 50 101Input conv(7, 64, 2)Pooling max pooling(3 , 2)Block 1 2 × (cid:20) conv (3 , , , , (cid:21) × conv (1 , , , , , , × conv (1 , , , , , , Block 2 2 × (cid:20) conv (3 , , , , (cid:21) × conv (1 , , , , , , × conv (1 , , , , , , Block 3 2 × (cid:20) conv (3 , , , , (cid:21) × conv (1 , , , , , , × conv (1 , , , , , , Block 4 2 × (cid:20) conv (3 , , , , (cid:21) × conv (1 , , , , , , × conv (1 , , , , , , Output block avg pooling, fc(N)TABLE 9. Overall evaluation metrics for all architecturesand both datasets. The training time after which the neuralnetwork scored the highest accuracy score on the evaluationdataset is labeled Train Time
Max and Train Time
Full is thetime for training the full 200 epochs. The table gives themax values during training for accuracy, precision, and recall.Bold scores are the best results in their respective category.Architecture ResNet VGGDepth 18 50 101 16 19Dataset HeliumAccuracy 0 .
955 0 . . .
958 0 . .
918 0 . . .
923 0 . .
866 0 . . .
867 0 . Max [h] . .
605 0 .
940 3 .
271 6 . Full [h] . .
814 2 .
820 6 .
541 6 . .
967 0 . . .
970 0 . .
932 0 . . .
944 0 . .
933 0 . . .
904 0 . Max [h] . .
205 1 .
093 4 .
374 4 . Full [h] . .
807 2 .
623 6 .
562 6 . where H ( p ) is the Shannon entropy of p , and D KL ( p (cid:107) q )is the KullbackLeibler divergence of p and q [56]. This isequivalent to: H ( p, q ) = − (cid:88) i p i log q i , (B2)where p i and q i are two probability distributions overthe same set of events. p i is the “correct” distribution,and q i is the approximation of p i from the deep neural TABLE 10. Per-class accuracy, precision and recall valuesfor the best performing ResNet configuration with 101 lay-ers. Samples are the number of images whose ground truthlabel is positive in the evaluation dataset. Results are shownfor both datasets and reflect the maximum achieved valuereached throughout the training process, assessed on the eval-uation dataset.Class Accuracy Precision Recall SamplesOblate 0.9681 0.9770 0.9965 988Spherical 0.9166 0.9247 0.9849 869Elliptical 0.9231 0.8054 0.4836 119Newton rings 0.9352 0.6325 0.2282 69Prolate 0.9690 0.9274 0.6777 68Bent 0.9657 0.8161 0.6487 59Asymmetric 0.9458 0.6044 0.2207 55Streak 0.9898 0.9372 0.9876 36Double Rings 0.9768 0.7708 0.6788 33Layered 0.9896 0.9062 0.6170 7Empty 0.9904 0.9537 0.9763 32 network. Since we are using a Bernoulli distribution asour probabilistic model there are only two outcomes thatone event ( k ) can have: k ∈ { , } . The probability forboth outcomes of one event and of both distributions canbe written as: p ( x ) = (cid:40) y ( x ) if k = 11 − y ( x ) if k = 0 q ( x ) = (cid:40) ˆ y ( x ) if k = 11 − ˆ y ( x ) if k = 0 x is some event, y is the ground truth label and ˆ y is theapproximate probability assigned by the deep neural net-1work. Since we are using a sigmoid function at the outputof our deep neural network, we can simplify equation B2.Using: ˆ y ( x ) sigmoid = 11 + exp ( − x ) , we can write: H ( p, q ) = − (cid:88) i p i log q i , = − y ( x ) log (ˆ y ( x )) − (1 − y ( x )) log (1 − ˆ y ( x ))= − y ( x ) log (cid:18)
11 + exp ( − x ) (cid:19) − (1 − y ( x ))log (cid:18) −
11 + exp ( − x ) (cid:19) ...= x − x y ( x ) + log (1 + exp ( − x ))where x is an event (e.g. the activation in the outputlayer of the deep neural network) and y is the real labelof this event. Appendix C: Further building blocks of deep neuralnetworks
This section describes the pooling layer and the batchnormalization layer in more detail. Since these compo-nents are not critical for the neural network their expla-nation is only here in supplemental material.
1. Pooling
There are two commonly used variants of pooling lay-ers, the max pool, and the average pool. The idea is toreduce the dimensionality of the output from a preced-ing layer (dim x = ( N × X )) by letting a filter, with size a × a slide over parts of the image with step size b , calledstride, and let them perform a down-sample operation.A max pool filter only takes the maximum value, and aavg pool filter averages over all values, within its percep-tive field [82, 83], this process is equivalent to a convo-lutional operation but instead of a matrix multiplicationwith a convolutional kernel the pooling operation is car-ried out.
2. Batch Normalization
Every layer within a deep neural network is to somepoint modeling the probability distribution given to it byits preceding layer. It is a hierarchical regression prob-lem, which becomes harder if one layer changes key char-acteristics of the modeled probability distribution (e.g. the mean, variance or the kurtosis). This shift is thenfurther multiplied in every succeeding layer and is there-fore dependent on the depth of the network. This phe-nomenon is called a covariate shift [88]. Although thisproblem is solved in a deep neural network via domainadaptation, the costs of a covariate shift are usually muchlonger training times and reduced accuracy [89].For this reason a batch normalization layer (bn) is usedto shift the mean of the mini-batch input to zero and toset the variance to one. This significantly reduces theamount of training time and increases accuracy [90]. bn consists of 4 steps after which a normalized mini-batch isreturned:1. Calculated the mini-batch mean: µ mb = 1 m m (cid:88) i =1 x i
2. Calculated the mini-batch variance: σ = 1 m m (cid:88) i =1 ( x i − µ mb )
3. Normalize: ˆ x i = x i − µ mb (cid:112) σ + (cid:15)
4. Scale and shift according to adjustable parameter: y i = γ ˆ x i + β where y i is the normalized output of input x i and γ and β are adjustable parameter. Appendix D: GradCam++
In chapter 6 of the main manuscript we show what theneural network deemed the most relevant areas withinan input image. We calculated these so-called heatmapswith an algorithm called GradCam++. The main ideais based on Cam [75] and Gradcam [74] and allows fora very intuitive explanation for the decisions made by aconvolutional deep neural network [41].The core principle is that the output of a convolutionaldeep neural network can be expressed as a linear combi-nation of the globally average pooled feature maps of thelast convolutional layer. Y c = (cid:88) k w ck (cid:88) i (cid:88) j A kij where A kij is one feature map of all k maps from the lastconvolutional layer and w ck are the weights for a particu-lar class prediction c of feature map k . Y c is the predicted2probability that the input image belongs to this certainclass c . In the GradCam++ formalism the weights canbe calculated: w ck = (cid:88) i (cid:88) j a kcij LeakyReLu (cid:32) ∂Y c ∂A kij (cid:33) . (D1)where a kcij are the gradient weights and LeakyReLu ( · ) isa rectified linear unit activation function, very similar tothe one we used throughout the main manuscript. a kcij depends only on A kij and Y c via: a kcij = ∂ Y c ( ∂A kij ) ∂ Y c ( ∂A kij ) + (cid:80) a (cid:80) b A kab ∂ Y c ( ∂A kij ) The final heatmap, often called saliency map, can thenbe obtained: L cij = LeakyReLu (cid:32)(cid:88) k w c A kij (cid:33) ..