[PDF] Deep neural networks for classifying complex features in diffraction images

Abstract

Intense short-wavelength pulses from free-electron lasers and high-harmonic-generation sources enable diffractive imaging of individual nano-sized objects with a single x-ray laser shot. The enormous data sets with up to several million diffraction patterns represent a severe problem for data analysis, due to the high dimensionality of imaging data. Feature recognition and selection is a crucial step to reduce the dimensionality. Usually, custom-made algorithms are developed at a considerable effort to approximate the particular features connected to an individual specimen, but facing different experimental conditions, these approaches do not generalize well. On the other hand, deep neural networks are the principal instrument for today's revolution in automated image recognition, a development that has not been adapted to its full potential for data analysis in science. We recently published in Langbehn et al. (Phys. Rev. Lett. 121, 255301 (2018)) the first application of a deep neural network as a feature extractor for wide-angle diffraction images of helium nanodroplets. Here we present the setup, our modifications and the training process of the deep neural network for diffraction image classification and its systematic benchmarking. We find that deep neural networks significantly outperform previous attempts for sorting and classifying complex diffraction patterns and are a significant improvement for the much-needed assistance during post-processing of large amounts of experimental coherent diffraction imaging data.

Full PDF

DDeep neural networks forclassifying complex features in diﬀraction images

Julian Zimmermann, ∗ Bruno Langbehn, Riccardo Cucini, Michele Di Fraia,

3, 4

Paola Finetti, Aaron C.LaForge, Toshiyuki Nishiyama, Yevheniy Ovcharenko,

2, 7

Paolo Piseri, Oksana Plekan, Kevin C.Prince,

3, 9

Frank Stienkemeier, Kiyoshi Ueda, Carlo Callegari,

3, 4

Thomas M¨oller, and Daniela Rupp Max-Born-Institut f¨ur Nichtlineare Optik und Kurzzeitspektroskopie, 12489 Berlin, Germany Institut f¨ur Optik und Atomare Physik, Technische Universit¨at Berlin, 10623 Berlin, Germany ElettraSincrotrone Trieste S.C.p.A., 34149 Trieste, Italy ISM-CNR, Istituto di Struttura della Materia, LD2 Unit, 34149 Trieste, Italy Physikalisches Institut, Universit¨at Freiburg, 79104 Freiburg, Germany Division of Physics and Astronomy, Graduate School of Science, Kyoto University, Kyoto 606-8502, Japan European XFEL GmbH, 22869 Schenefeld, Germany CIMAINA and Dipartimento di Fisica, Universit degli Studi di Milano, 20133 Milano, Italy Department of Chemistry and Biotechnology, Swinburne University of Technology, Victoria 3122, Australia Institute of Multidisciplinary Research for Advanced Materials, Tohoku University, Sendai 980-8577, Japan (Dated: June 25, 2019)Intense short-wavelength pulses from free-electron lasers and high-harmonic-generation sourcesenable diﬀractive imaging of individual nano-sized objects with a single x-ray laser shot. Theenormous data sets with up to several million diﬀraction patterns represent a severe problem fordata analysis, due to the high dimensionality of imaging data. Feature recognition and selectionis a crucial step to reduce the dimensionality. Usually, custom-made algorithms are developed ata considerable eﬀort to approximate the particular features connected to an individual specimen,but facing diﬀerent experimental conditions, these approaches do not generalize well. On the otherhand, deep neural networks are the principal instrument for today’s revolution in automated imagerecognition, a development that has not been adapted to its full potential for data analysis inscience. We recently published in [Langbehn et al.

Phys. Rev. Lett. 121, 255301 (2018)] the ﬁrstapplication of a deep neural network as a feature extractor for wide-angle diﬀraction images of heliumnanodroplets. Here we present the setup, our modiﬁcations and the training process of the deepneural network for diﬀraction image classiﬁcation and its systematic benchmarking. We ﬁnd thatdeep neural networks signiﬁcantly outperform previous attempts for sorting and classifying complexdiﬀraction patterns and are a signiﬁcant improvement for the much-needed assistance during post-processing of large amounts of experimental coherent diﬀraction imaging data.

PACS numbers: 05.10.-a, 07.05.-t, 61.05.C-, 87.59.-eKeywords: residual convolutional deep neural networks, coherent diﬀraction imaging, image classiﬁcation

1. INTRODUCTION

Coherent diﬀraction imaging (CDI) experiments of sin-gle particles in free ﬂight have been proven to be a sig-niﬁcant asset in the pursuit of understanding the struc-tural composition of nano-scaled matter [2–7]. While tra-ditional microscopy methods are able to image ﬁxated,substrate-grown or deposited individual particles [8–12],only CDI can combine high-resolution images with sin-gle particles in free ﬂight in one experiment [13–15].CDI became possible due to the recent advent of shortwavelength free-electron lasers (FELs) producing coher-ent high-intensity x-ray pulses with femtosecond dura-tion with a single x-ray laser shot [16]. However, CDIalso comes with its own set of new challenges.One of the growing problems of CDI experiments isthe sheer amount of recorded data that has to be ana-lyzed. The LINAC Coherent Light Source (LCLS), for ∗ [email protected] instance, has a repetition rate of 120 Hz with typical hit-rates ranging from 1 % to 30 % [16–18], greatly dependingon the performed experiment. The newly opened Euro-pean XFEL will have an even higher maximum repetitionrate of 27 000 Hz [19], which may add up to several milliondiﬀraction patterns in a single 12-hour shift. The idea ofusing neural networks for classiﬁcation of large numberof scattering patterns was born out of the signiﬁcant dif-ﬁculties of analyzing large data sets of clusters [20], inparticularly metal clusters [21]. Moreover, the ability toanalyze such data sets is sought after by the communityin general [22]. For example, for the successful determi-nation of 3D-structures from a CDI data set using theexpansion-maximization-compression algorithm [22–24],it is necessary to sample the 3D Fourier space up to theNyquist rate for the desired resolution and this for allsub-species contained in the target under study. Theachievable resolution, as well as the chance for success-ful convergence of the algorithm, correlates directly withthe number of diﬀraction patterns with a high signal-to-noise ratio [23]. Thus, huge data sets are taken and asa consequence of the sheer amount of data, it is getting a r X i v : . [ phy s i c s . d a t a - a n ] J un increasingly complicated to distill the high-quality datasubsets that are suitable for subsequent analysis steps.The enormous success of neural networks in the regimeof image processing and classiﬁcation provides a uniqueway of facing the imminent data-analysis bottleneck andreduces the impending problem to a mere domain adapta-tion from datasets used throughout the industry to onesthat are used in CDI research. This work aims to be astepping-stone towards this adaptation by providing anintroduction to the theory of deep neural networks andanalyzing how to best transfer and optimize these algo-rithms to the domain of scattering images. As a newbaseline, we train a widely used deep neural network ar-chitecture, a residual convolutional deep neural network[25], in a supervised manner with a training set of man-ually labeled data. We then adapt the neural networkto the domain of diﬀraction images and improve on thebaseline performance by addressing the following issues:1. Modiﬁcation of the architecture to account for thespeciﬁcities of diﬀraction images and thus optimizethe prediction capabilities.2. Determination of the appropriate size of the train-ing dataset in order to keep the manual work of aresearcher to a moderate level.3. Mitigation of experimental artifacts, in particularnoisy diﬀraction images.Experience has shown that a researcher is able to relatediﬀraction patterns produced by similarly shaped parti-cles of diﬀerent sizes and orientations in context witheach other. However, a programmatic description for aclassiﬁcation and sorting of these mostly similar patternsis almost impossible to achieve.Figure 1 illustrates the case of two diﬀraction patternscaptured from almost identical particles but under dif-ferent orientations. Both patterns clearly show an elon-gated and bent streak, but the bending is diﬀerently pro-nounced and directed. If we wanted to handcraft analgorithm that detects this feature, we would need todescribe it via some appropriate metric that must takeinto account the various grades of inﬂection, direction,brightness, and completeness of this feature within everyimage. Furthermore, we would need to redo it for everycharacteristic feature in a diﬀraction image of which wewant to ﬁnd similar ones.In addition to that, poor signal-to-noise ratios, stray-light, a beam stop or central hole of multichannel platesor pnCCDs [26] and overall poor image quality can evenfurther increase the diﬃculty to make an automatizedclassiﬁcation of all images coherent [27–29].Therefore, we need a robust classiﬁcation routine thatis insusceptible to the described artifacts, just as a re-searcher is, to tackle the upcoming data volume. Deepneural networks provide a way out of this situation, andwe show in this paper that they outperform the currentstate-of-the-art classiﬁcation and sorting routines. MSFTMSFT a) FIG. 1. a) and b) are showing a capsule shaped particleswhose orientation and size diﬀers. The scattering images arecalculated using a multi-slice Fourier transform (MSFT) algo-rithm that simulates a wide-angle x-ray scattering experimentwhich includes 3D information about the particle [7, 21]. Thetwo incoming beams (indicated by the arrow on the left-handside) produce very diﬀerent scattering images, yet the dom-inant feature, an elongated bent streak, is distinctly visiblein both calculations. A handcrafted algorithm is typicallynot able to identify the similarity between the two scatteringpatterns and would classify these two images in two distinctclasses, although they belong to the same capsule shape class.A deep neural network can learn these complicated similar-ities on its own when we provide a few manually selecteddiﬀraction patterns that contain this feature.

Current state-of-the-art automatic classiﬁcation rou-tines for diﬀraction experiments employ so-called kernelmethods [28, 30]. Bobkov et al. [28] trained a support-vector-machine on a public small-angle x-ray scatteringdataset with an

Accuracy of 87 %, but only on selectedimages (we will use this approach as a reference in sec-tion 4). Yoon et al. [30] were able to achieve an

Accuracy of up to 90 % using unsupervised spectral clustering ona non-public small-angle x-ray scattering dataset.Deep neural networks, on the other hand, have alreadybeen applied to a broad range of physics-related prob-lems ranging from predicting topological ground states[31], distinguish diﬀerent topological phases of topolog-ical band insulators [32], enhancing the signal-to-noiseat hadron colliders [33], diﬀerentiate between so-calledknown-physics background and new-physics signals atthe Large Hadron Collider [34] and to help solve theSchr¨odinger equation [35, 36]. Their ability to classifyimages has also been utilized in cryo-electron microscopy[37], medical imaging [38] and even for hit-ﬁnding in se-rial x-ray crystallography [39]. However, to our knowl-edge, this paper is the ﬁrst application of deep neural net-works for classifying complex features within diﬀractionpatterns. We show that deep neural networks outperformthe current state-of-the-art classiﬁcation and sorting rou-tines, while being insusceptible to typical artifact featuresof diﬀraction measurements. Furthermore a deeper anal-ysis of the trained network shows that it can understandcomplex concepts of what constitutes a characteristic fea-ture in a diﬀraction pattern.The paper is organized as follows: In section 2, thedata set is presented and a few experimental details arediscussed. Section 3 provides the fundamental theoryto understand the basics of neural networks; it has twosubsections. Subsection 3.1 covers the theory, and algo-rithmic underpinnings of deep neural networks and howto train these models and subsection 3.2 presents threecommon metrics to evaluate the quality of the neuralnetwork’s predictions.Section 4 establishes our starting point, while the fullbenchmark report on the baseline neural network can befound in appendix A. We introduce the chosen networkarchitecture and provide baseline results on the data pre-sented in section 2 but also on a reference dataset forwhich classiﬁcation results are already published [40].In section 5, we discuss solutions for the above statedissues of applying neural networks to diﬀraction data.In subsection 5.1 we discuss the choice of the activationfunction for the neural network and present a novel loga-rithmic activation function that enhances the predictionperformance with diﬀraction image data. Subsection 5.2benchmarks the dependence of neural networks on train-ing data size, asking essentially how much manually la-beled data is needed for the neural network to give ac-ceptable results and subsection 5.3 presents an approachto harden the neural network against very noisy data us-ing a custom two-point cross-correlation map.In section 6 we then provide more profound insightsinto the output of the neural network by showing and dis-cussing calculated heatmaps that visualize the gradientﬂow within the neural network. These images directlycorrelate with what the neural network sees ; they arecreated using an advanced visualization algorithm calledGradCam++ [41].Finally, we give a summary of the principal results andunique propositions of this paper and conclude with anoutlook on further modiﬁcations as well as future direc-tions.

2. THE DATA

Helium nanodroplets [1] were imaged using extremeultraviolet (XUV) photon energies between 19 eV to35 eV using the experimental setup of the LDM beamline[42, 43] at the Free Electron Laser FERMI [44]. Scat-tering images were recorded with a multi-channel-plate(MCP) detector combined with a phosphor screen whichwas placed 65 mm downstream from the interaction re-gion; this deﬁnes the maximum scattering angle of 30 ◦ .Single shot diﬀraction images in the XUV regime are insome respect a special case, as they cover large scatteringangles and can contain 3D structural information [21],manifesting as complex and pronounced characteristic features, such as the bent streaks in Figure 1. Out of2 × laser shots, about 38 000 images were obtained.The images were corrected for straylight background andthe ﬂat detector (see also Langbehn et al. [1])For the neural network training dataset, we selected7264 diﬀraction images randomly out of all recorded pat-terns. The size of the subset was chosen to be the maxi-mum a researcher could classify manually given one weektime. From this subset we manually identiﬁed 11 distinctbut non-exclusive classes (see Figure 2 for examples aswell as a description and Table 1 for statistics about everyclass). We chose each of the diﬀraction patterns shown inFigure 2 for being a strong candidate for its class, but it isimportant to note that almost all diﬀraction patterns be-long to multiple classes since this is a multi-class labelingscenario. These patterns are therefore not always clearlydistinguishable from each other and can exhibit multiplecharacteristics from diﬀerent classes. For example, theNewton rings in Figure 2d) are superimposed on a con-centric ring pattern that falls into the category Spher-ical/Oblate , but Newton rings can also occur in otherclasses, e.g. streak patterns. Furthermore, labeling allimages is itself prone to systematic errors because the re-searcher has to learn-to-label [45]. This means that thelabeling process itself is to some extent ill-posed, as theresearcher does not know the characteristics of a featurea priori which results in a changing perception of fea-tures and classes along the labeling process and thus asystematically decreased consistency for every class.We uploaded all available data alongside our assignedlabels to the public CXI database (CXIDB, [46]) underthe public domain CC0 waiver . TABLE 1. Statistics of the helium nanodroplets dataset.Non-exclusive labels assigned by a researcher. One image canbe in multiple classes. Total dataset size is 7264. Note that

Spherical/Oblate as a class also contains

Round patterns, only

Prolate shapes are excluded from this class (see also captionof Figure 2).Class Nr. of labels % of the whole datasetSpherical/Oblate 6589 90 . . . . . . . . . . . http://cxidb.org/id-94.html Spherical/Oblate a) b)c) d)

Round EllipticalProlateStreakBent Asymmetric Newton Rings Double RingsLayeredApproximate particle shape Schematic representation of the expected feature

Exclusive superordinate classes Partially exclusive oblate subclassesNon-exclusive prolate subclasses Non-exlusive other subclasses

FIG. 2. Characteristic examples for all the classes assigned to the 7264 images by a researcher, except for the

Empty class.The top row of every class shows a representative diﬀraction pattern and the bottom row in b) - d) shows a stylized drawingof the characteristic feature of this class. The bottom row in a) shows an illustration of the name-giving particle shape forthe

Spherical/Oblate and

Prolate class. The shapes are derived from the analysis of the data in Langbehn et al. [1], and theyserve as a form of superordinate classes. They are mutually exclusive to each other, and all diﬀraction patterns are part ofone of these two classes. Also, both superordinate classes have subclasses. For example, b) is showing the

Spherical/Oblate subclasses

Round , Elliptical and

Double Rings . While a diﬀraction pattern can be part of the

Round and the

Double Rings class, it cannot be part of the

Round and the

Elliptical class. For the

Prolate superordinate class, we ﬁnd analog subclassrules, although there is no exclusivity rule as it was with the

Round and

Elliptical class. Therefore, an image belonging to

Bent can also be in the

Streaks class. Furthermore, all

Spherical/Oblate and

Prolate patterns can not only be part of theirrespective subclass but can also be part of one or more of the classes in the non-exclusive other subclass categories - shownin d). These classes describe general features within the image which are to some extent independent of the particle shape.We derived the superordinate classes from these general features. These complicated inter-class relationships demonstrate thecapabilities of a researcher to interconnect mostly distinctive appearing features into a consistent description and ultimatelyleading to a valid physical interpretation. A hand-crafted algorithm could not account for these relationships normally, butnow these interconnections can serve as an additional evaluation metric for the neural network. Since there is no diﬀractionpattern which belongs to the

Spherical/Oblate and

Prolate class simultaneously, we can check if the neural network mislabeleda diﬀraction pattern according to these rules. We can then interpret this as a reliable indicator for a failed generalization ofthe network. The physics behind these patterns are quite complicated as well, but for a rigorous interpretation and analysis ofthese patterns, please see Langbehn et al. [1].

3. BASIC THEORY3.1. What is a deep neural network

We concentrate in this paper solely on deep feed-forward neural networks. They are a classiﬁcation modelconsisting of a directed acyclic graph that deﬁnes a setof hierarchically structured non-linear functions.A fundamental example can be constructed by arrang-ing n non-linear functions ( z , z , . . . z n ) in a chain-likemanner: z output = z n ( z n − ( . . . ( z ( z ( x ))) . . . )), where x is the input, which is in our case a diﬀraction image.The ﬁrst function, z ( x ), is called the input layer. Wethen pass the output of z to z and so on; this goes onuntil the last layer ( z n ) which is called the output layer.The nomenclature is that all layers except the outputlayer ( z n ) and the input layer ( z ) are called hidden lay-ers.For illustrative purposes, Figure 3 shows a convolu-tional neural network. There, we schematically show thelayer functions z , . . . z n where every layer consists oftwo stages; A linear layer-speciﬁc operation on its inputsfollowed by a so-called activation function, which is al-ways non-linear. We address the choice of layer-speciﬁcoperations in section 3.1.1 and then introduce the acti-vation functions in section 3.1.2. In general, the layer-speciﬁc operation is always the name-giving componentfor the layer, so for example if we compute a 2D convolu-tion as the layer-speciﬁc operation on the input and thenapply an activation function, we call the set of these twostages a convolutional layer . Figure 3 shows a neural net-work whose ﬁrst layers are convolutional layers followedby a fully connected layer that produces the predictions. All common choices for layer-speciﬁc operations areaﬃne transformations. They all introduce trainableweights; free parameters that are adjustable during thetraining process and are sometimes called neurons due tothe intuition that in a fully connected layer they sharesome similarity to the dendrites, soma, and axon of abiological neuron [47]. These trainable weights are thename-giving components in a neural network.Now, the goal of training a neural network is to opti-mize all these weights for all layers, so, that the predic-tions for all images in the training data match their ac-companying original labels. The original labels are calledground truth and deﬁne the upper limit of how good anetwork can ﬁt a domain. No neural network is betterthan its training data. In this section, we brieﬂy illus-trate the aﬃne transformations of the fully connectedlayer and the convolutional layer and then explain in thenext section the role of the activation function. a. Fully connected layer

The name-giving operationfor the fully connected layer is a matrix multiplicationperformed on a ﬂattened input, for example, a m × n sized input image would be ﬂattened into a m · n sizedvector. Mathematically this is a matrix multiplicationbetween a matrix and a vector: a j = m (cid:88) k =1 x k w kj , (1)where x is the ﬂattened input and w is the weight matrixof a fully connected layer. Here, all input vector elements(e.g., the pixels of an image, now arranged in one largerow x k ) contribute to all output matrix elements andare therefore connected . Furthermore, by convention is x deﬁned as 1 and w j = b j , where b j is a free andtrainable bias parameter. b. Convolutional layer In a convolutional layer, thetrainable weights are parameters of a kernel that slidesover the inputs, this is visualized in Figure 3. The generalidea of a convolutional layer is to preserve the spatial cor-relations in the input image when going to a lower dimen-sional representation (the next layer). This is achievedby using a kernel with a spatial extent larger than 1 px.The kernel size is then also the extent to which one ker-nel can correlate diﬀerent areas of an input and is calledits local receptive ﬁeld. Each kernel produces one out-put which is called a feature map or ﬁlter . Multiple fea-ture maps from multiple kernels are grouped within oneconvolutional layer. For example, the ﬁrst convolutionallayer in Figure 3 produces 9 feature maps out of the in-put diﬀraction image and hence has 9 kernels that getoptimized during training. Since we usually only have inthe input layer a 2-dimensional diﬀraction image as inputand a high number of feature maps for every subsequentconvolutional layer as their inputs, we deﬁne the outputof a convolutional layer with a 4-dimensional kernel k that produces i feature maps of size j × k : a i,j,k = (cid:88) l,m,n x l,j + m − ,k + n − k i,l,m,n , (2)here the input x has l dimensions of size j × k and weslide a kernel of size m × n across all these l dimensions.In the given example for the input layer, l is simply 1 andthe summation is just across one input image, as shownin Figure 3. Regardless of the aﬃne transformation that is used, alllayer-speciﬁc operations produce trainable weights whichare passed through an activation function. This functionis always non-linear. We only address 2 activation func-tions here as they are the most common used by the com-munity and the only ones we use; The sigmoid and the

LeakyRelu function. The ﬁrst one is a logistic regressionfunction used mostly at the outputs of neural networks,and the second one is a piecewise linear activation func-tion used between layers for numerical reasons [48, 49].

FIG. 3. Schematic visualization of a convolutional neural network; It shows the hierarchical structure of the network withthe function hierarchy z , . . . z n above each layer. Depicted as input is a diﬀraction image, which is getting expanded by 9trainable convolutional kernel into 9 feature maps. Note, only 1 kernel, producing the last feature map, is shown. The outputof the ﬁrst layer is then passed through multiple convolutional layer, this is the feature extraction part of the neural network.Ultimately, a fully connected layer with a logistic function as activation function produces the predictions. Every layer consistsof 2 stages, also indicated by the brackets underneath z , . . . z n . The ﬁrst stage is an aﬃne transformation and the secondone is a non-linear function, called an activation function. The operation that is used as aﬃne transformation is then thename-giving component for the layer, e.g., a convolutional layer uses a convolution as aﬃne transformation. The choice ofthe activation function is subject to empirical optimization with various choices possible. Section 3.1.1 describes the aﬃnetransformations in more detail and section 3.1.2 covers the basics on activation functions. The sigmoid function is given as: h ( x ) = 11 + exp ( − x ) (3)and the LeakyRelu function is given as: h ( x ) = (cid:40) x if x ≥ γx if x < , (4)where in both functions x ∈ a are the trainable weights ofthe aﬃne transformation (the convolutional or the fullyconnected layer operation, i.e., the output of Equation1 or 2) and γ is the slope for the negative part in the LeakyRelu function and is called leakage .In Figure 3 the last activation function of the neuralnetwork, denoted by

Logistic function , is a sigmoid func-tion, because its output can be interpreted as a proba-bility in a Bernoulli distribution, yielding a probabilityfor how likely it is that a given event (an image in ourcase) is part of a class (in our case, the pre-deﬁned classesfrom Table 1). Sigmoid functions always give an outputbetween 1 and 0. In our case, we have 11 distinct classeswhich are mutually non-exclusive, which means every im-age has a probability of being part of every class. Using asigmoid function at the end of the neural network yieldstherefore 11 distinct Bernoulli distributions. The gener-alization from the single-case Bernoulli distribution to its multi-case n-class distribution equivalent is called cate-gorical distribution.Interpreting the output of the neural network, as wellas the original labels, as a categorical distribution is keyto train the neural network because only then we canuse statistical measures to evaluate the quality of theneural network’s prediction, which allows us to optimizeit iteratively.However, due to the non-linearity of all activation func-tions, optimizing a neural network is a non-convex prob-lem where no global extrema can be found with certainty.The general procedure is that of a forward pass and thena backward correction. Meaning, we feed the neural net-work several images, take the network’s prediction andcompare this prediction to the ground truth; This is theforward pass. Then we calculate a loss function which isa metric for how bad or good the predictions were, seethe next section, and correct the weights of the networkin a way that it would be better equipped to predict thelabels for the images it just saw. This correction stepis starting at the end of the network using an algorithmcalled backpropagation; hence the name backward cor-rection, see section 3.1.4.

Optimizing a neural network always starts by feedingit multiple images and evaluate what the neural networkmade of it. For assessing the quality of the network’s pre-diction a so-called loss function is used. It is the deﬁningmetric that we seek to minimize during the training of theneural network. In every training step, we compare theoutput of the neural network to the real labels providedby the researcher and calculate the so-called loss. Lowerloss values correspond to a higher prediction quality ofthe neural net.Therefore, the goal during the training process is toadjust all weights and biases within the network so, thatthe loss is minimal for all input training images. Thereare various possible loss functions which often serve a spe-ciﬁc purpose. For classiﬁcation tasks, such as the presentcase, primarily the cross-entropy is used [50–53]. Cross-entropy is a concept from information theory giving anestimate about the statistical distance between a true dis-tribution p and an unnatural distribution q . In our case, p is the categorical distribution over the ground truthlabels, and q is the output of the neural network.Cross-entropy is calculated as the sum of the Shannonentropy [54] for the true distribution p and the Kullback-Leibler divergence [55] between p and q . The former is ameasure of the total amount of information of p , and thelatter is a typical distance measure between two proba-bility distributions.If the Kullback-Leibler divergence is zero, then thecross-entropy is just the Shannon entropy of p , and wehave p = q . Then, the predictions of the neural networkare not distinguishable from the labels of all training im-ages.Cross-entropy can be formally written as: H ( p, q ) = H ( p ) + D KL ( p (cid:107) q ) (5)where H ( p ) is the Shannon entropy of p , and D KL ( p (cid:107) q )is the KullbackLeibler divergence of p and q [56].When using a sigmoid function as activation functionon the output layer, the ﬁnal loss function can be deﬁnedas: H ( x out , x ) = M (cid:88) i x outi − x outi x i + log (cid:0) (cid:0) − x outi (cid:1)(cid:1) , (6)where M is the number of all images in the training data, x outi is the prediction for one image from the deep neuralnetwork and x i is the original label of the image, assignedby the researcher. Please see appendix B for a completederivation.Using Equation 6 as it is, would require us to passall images through the network for one training step, asthe sum runs over all images. This is computationalintractable. Therefore, we use a variant of Equation 6where the sum runs only over a stochastically chosen subset of size bs , called a batch. The size of that batchis called batch size and is an important hyperparame-ter that needs to be chosen prior to training, see section3.1.5. One iteration step now involves only bs imagesfrom the dataset, and we deﬁne an epoch as the numberof iteration steps it takes the network during the trainingto see all images one time.To summarize, minimizing the cross-entropy is the goalduring the training process in a neural network. The net-work learns to link the user-deﬁned labels to the providedimages. All that’s left to understand the basic trainingprocess of a neural network is a way to adjust the weightsin all layers. Optimizing the weights within the neural network sothat they give minimal loss for all training images isdone using two distinct algorithms; gradient descent andbackpropagation. In principal, gradient descent works byevaluating the gradient at some point and then movinga certain step-size in the opposite direction; This is doneiteratively until the gradient is smaller than some pre-deﬁned threshold, which is the numerical equivalent ofcalculating the extrema of a function analytically.The basic gradient descent step is given by: w τ +1 = w τ − η ∇ w τ H ( x out , x ) . (7)where η is the afore mentioned step-size, called learn-ing rate, ∇ w τ is the gradient w.r.t. the weights at step τ and H ( x out , x ) is the loss function from Equation 6.With Equation 7 we already could update the weightswithin the output layer of the neural network ( z n ( · )),since for the output layer we can calculate the numericalgradients. But we can’t do this for the layers that comebefore the output layer since we’re lacking a way to in-clude these. In order to propagate the gradient descentcorrection throughout the network an algorithm calledbackpropagation is used [57]:First, we deﬁne the gradient of H ( x out , x ), w.r.t. theweights at the output of the deep neural network, usingthe chain rule: ∇ w τ H ( x out , x ) = ∂H ( x out , x ) ∂w Njτ = ∂H ( x out , x ) ∂h N (cid:0) a Nj (cid:1) ∂h N (cid:0) a Nj (cid:1) ∂w Njτ , (8)where N denotes the layer depth of the output layer, h N ( · ) is the used activation function in that layer and a Nj are the outputs of the layer-speciﬁc operation, as inEquation 1 and 2. Starting from there we include thelayer, preceding the output layer ( z n − ( z n ( · ))), by mak-ing use of the chain rule again: ∂H ( x out , x ) ∂h N (cid:0) a Nj (cid:1) = ∂H ( x out , x ) ∂h N − (cid:0) a N − j (cid:1) ∂h N − (cid:0) a N − j (cid:1) ∂h N (cid:0) a Nj (cid:1) . (9)This can be iteratively repeated until the input layer( z ( · )) is included in the calculation. By making use ofthe chain rule until we reach the input layer we can in-clude all trainable weights of all layers into the correctionterm of the gradient descent algorithm. With this, weconclude the full optimization routine in Table 2. TABLE 2. The iterative optimization routine for a deepfeed-forward neural network.1.

Forward pass : Propagate bs images through the network.2. Evaluate the predictions : At the output layer calculatethe loss between the ground truth and the output of thedeep neural network (Equation 6).3.

Construct the backpropagation rule : Include allgradients w.r.t. the weights of all layer accordingto Equation 9.4.

Backward correction : Update all weights in thenetwork using gradient descent, see Equation 7.

Of signiﬁcant importance is the way how the networkis constructed; How deep should the network be and ofwhat should it consist? For nomenclature, the combina-tion of all used layers, the depth of the network and theused activation functions is called an architecture.We benchmarked the performance of various architec-tural choices when used with diﬀraction images as inputand provide the results in appendix A and not in the mainpaper, due to its rather technical character. In short, allarchitectures are established through extensive empiricalresearch. So far, not only the leading A.I. research in-stitutes, like the Massachusetts Institute of Technology(MIT) or the University of Toronto, but also large com-panies like Google, Facebook and Microsoft have investedsigniﬁcant amounts of resources to establish well working out-of-the-box solutions [50–52].Building on this and after extensively benchmarkingthe most common architectures on our own, we settledon an architecture called pre-activated wide residual con-volutional neural network in its 18-layer conﬁguration,called

ResNet18 [25, 58, 59]. In essence, it is a convolu-tional neural network much like the example in Figure 3but it employs so-called residual skip connections whichincrease

Accuracy while decrease training time, see ap-pendix A for further details as well as comparisons withother architectures.After settling on an architecture, training a neuralnetwork requires ﬁne-tuning of multiple free parameters.Four of them are critical: The learning rate η , the batchsize bs and so-called regularization parameters of whichwe have two (which will be introduced at the end of thissection).We set the initial learning rate for the gradient descentalgorithm to η = 0 .

1, see also Equation 7. Throughout the training we multiply η with 0 . ×

224 pxwhich is necessary to ﬁt the deep neural net on two Nvidia1080Ti GPUs, each having 11 GB memory. The imagedimensions are chosen to be a compromise between ﬁlesize and resolution. All features we are training the neu-ral network on are still clearly visible and distinguishableafter the rescaling.Furthermore, we face the problem of having a com-paratively small training set, consisting of only ≈ H ( x out , x ) reg = H ( x out , x ) + α || w || + β || w || (10)where H ( x out , x ) is the cross-entropy loss function, || w || and || w || are the L1- and the L2-norm ap-plied on the sum of all trainable weight parame-ters and α and β are so-called regularization co-eﬃcients. In our experiments we set α and β to1 × − during training. Using L1 and L2 regu-larization in combination is commonly referred toas elastic net regularization [60].2. Data augmentation means creating artiﬁcial inputimages by randomly applying image transforma-tions on the original image like ﬂipping the verti-cal or the horizontal axes and adjusting contrast orbrightness values randomly. This greatly increasesthe robustness to over-ﬁtting and is used as a stan-dard procedure when facing small training datasets[61, 62].We were able to train deep neural networks with adepth of up to 101 layers without over-ﬁtting using reg-ularization and data augmentation, see appendix A. Inall experiments reported here we choose a depth of 18layers for the neural network, due to numerical, memoryand time reasons. We trained all deep neural networksvariants for 200 epochs. We use three metrics to assess the quality of the predic-tions from the neural network,

Accuracy , Precision , and

Recall . We calculated these metrics every 2500 train-ing iteration steps ( ≈

52 epochs) using the evaluationdataset.

Accuracy is formally deﬁned as:Accuracy = True Positives + True NegativesCondition Positives + Condition Negatives , where condition positives/negatives is the real num-ber of positives/negatives in the data and true posi-tives/negatives is the correct overlap of the predictionfrom the model and the condition positives/negatives.An Accuracy of 1 corresponds to a model that was able topredict all classes of all images correct. Therefore,

Accu-racy is a good measure for evaluating the prediction capa-bilities of a model when true positives and true negativesare of importance. Predicting negative labels correct isin the case of the helium dataset of particular interestbecause we want to estimate if the neural network wasable to understand the complex inter-class relationshipsimposed by the researcher. The network should realizethat if, for example, one prediction is

Spherical/Oblate ,it cannot simultaneously be

Prolate . Therefore, the net-work has to produce a true negative for either one ofthese predictions. However, using only

Accuracy as ametric has several downsides. The most important oneis the decreased expressiveness of

Accuracy when work-ing in a multi-class scenario. In order to understand this,we ﬁrst introduce

Precision and

Recall , and then providean example:Precision = True PositivesTrue Positives + False Positives , Recall = True PositivesTrue Positives + False Negatives . Precision , also-called positive predictive value, is a mea-sure for how reasonable the estimates of the model werewhen it labeled a class positive, and

Recall is a measurefor how complete the model’s positive estimates were.For example, if the model would predict all trainingimages in the helium dataset to be

Spherical/Oblate andnothing else (out of 7264 images, 6589 are indeed

Spher-ical/Oblate ) then

Accuracy would be 0 . Accuracy wouldbe 0 . Precision in these both examples would give0 .

907 for the

Spherical/Oblate example and 0 .

000 for theall-negative example.

Precision is, therefore, a metricthat quantiﬁes how well the positive predictions wereassigned. Since 91 % of all images are indeed

Spher-ical/Oblate , setting all labels positive in the

Spheri-cal/Oblate class can make sense, and

Precision also pro-vides insight when the model makes no positive predic-tion at all which would be a useless model for our pur-pose. However,

Precision alone is not suﬃcient as a met-ric. At this point we dont know if our model predictedalmost every possible positive label correct or if only asmall fraction of all positive labels were assigned cor-rectly, we, therefore, need an additional measure for thegeneralization capabilities of our model. For that reason,

Precision is always used in combination with

Recall . The

Recall for our ﬁrst example is 0 .

423 and for the secondone 0 . Recall relies on False Negatives instead of theFalse Positives, used by

Precision , which provides a mea-sure about the completeness of all positive predictionscompared to all positive labels within our data.

Recall states that our model only captured 42 % of all possiblepositive labels in the

Spherical/Oblate example, showingthat generalization of the model would not be suﬃcientfor a real-world application.Therefore, a balanced interpretation of these threemetrics is necessary to estimate the quality of the modelstested here.

4. BASELINE PERFORMANCE OF NEURALNETWORKS WITH CDI DATA

In this section we brieﬂy report on what we call base-line results. We used the previously described ResNet[25] neural network architecture in its basic conﬁgura-tion with a depth of 18 layers, termed vanilla conﬁgura-tion or ResNet18 (see section 3.1.5) and trained it withthe helium diﬀraction data set as described in section 2as well as with a reference data set from the literature[40]. This reference data set was made freely availableon the CXIDB by Kassemeyer et al. [40] . It containsdiﬀraction patterns of a number of prototypical diﬀrac-tion imaging targets, namely the Paramecium bursariumChlorella virus (PBCV-1), bacteriophage T4 , magneto-somes and nanorice . For further experimental detailssee Kassemeyer et al. [40].We selected this dataset because of a previous publica-tion dealing with this dataset [28], that describes, to our et al. [28] trained a support-vector-machine on the CX-IDB dataset and inferred the particle type directly fromthe diﬀraction images. Overall, they achieved an Accu-racy of up to 0 .

87, but only on selected high quality im-ages with a high conﬁdence score of the support-vector-machine above 0 . Max is the time whenthe neural network achieved the highest

Accuracy scoreon the evaluation dataset, and Train Time

F ull is the timefor training 200 epochs. In practice, we achieved optimalconvergence after training for 70 to 100 epochs.We achieved an

Accuracy of 0 .

967 on not only a highquality subset of the CXIDB data, like in [28], but onall available data (see table 3), using a vanilla ResNet18architecture, proving that using a neural network signiﬁ-cantly outperforms the current state-of-the-art approachin [28].

TABLE 3. Overall evaluation metrics for the ResNet18 archi-tecture (vanilla conﬁguration) and both datasets. The tablegives the max values during training for

Accuracy , Precision ,and

Recall . The training time after which the neural networkachieved the highest

Accuracy score on the evaluation datasetis labeled Train Time

Max and the time for training the full200 epochs is labeled Train Time

Full . See also appendix Afor further details.Architecture ResNet18Dataset CXIDB Helium

Accuracy .

967 0 . Precision .

932 0 . Recall .

933 0 . Max [h] 0 .

278 0 . Full [h] 0 .

668 0 . In the case of the helium dataset we face a much morecomplicated multi-class learning problem (one image canbelong to multiple classes compared to one image belongsto exactly one class as it is in the CXIDB data). However,we reach a comparable

Accuracy score of 0 . Precision and

Recall are very high forthe helium and the CXIDB dataset, proving that theneural network not only predicted the true positives withhigh conﬁdence and reliability (high

Precision ), it didso for almost all true positive labels in the evaluationdataset (high

Recall ).In the next section we show how to further improve onthe baseline performance of neural networks with diﬀrac-tion images as input data.

5. ADAPTING NEURAL NETWORKS FOR CDIDATA

Here, we describe our contribution for using neural net-works in combination with diﬀraction images.First, we show in section 5.1 that the performance ofa neural network can be enhanced when using a specialactivation function after the input layer.Second, in section 5.2 we benchmark the performanceof the neural network when using a smaller amount oftraining data. The idea is to provide an intuition abouthow much the prediction capabilities deteriorate when asmaller training dataset is used. This is useful because sofar a researcher still has to invest a lot of time preparingthe training dataset and, more general, minimizing thetime spent looking through the raw data is the ultimategoal for using a neural network in the ﬁrst place.Third, in section 5.3 we propose a novel data augmen-tation in the form of a custom two-point cross correlationmap that hardens the network against very noisy data.We show that when using this augmentation the networkis more robust to noise from a uniform distribution addedon top of the original diﬀraction image. This simulatesthe experimental scenario in which a very low signal-to-noise ratio is unavoidable, e.g., during CDI experimentswith very limited photon ﬂux [7] or very small scatter-ing cross sections as it is the case with upcoming CDIexperiments on single biomolecules [63, 64].

One of the key additions of this paper is the proposedactivation function, formally stated in Equation 11. It isdesigned to account for the inherent property of diﬀrac-tion images of scaling exponentially. More general, theintensity distribution of scattered light on a ﬂat detec-tor follows two laws, depending on the scattering an-gle that is recorded. For very small angles (SAXS andUSAXS experiments) the Guinier approximation is thedominant contribution to the recorded intensity, whilefor larger scattering angles (SAXS and WAXS experi-ments) Porod’s law becomes dominant [65, 66]. Wherethe scattering intensity in the Guinier approximation isproportional to ≈ exp (cid:0) − q (cid:1) , in Porod’s law the intensityscales with ≈ q − d . q is the scattering vector (function ofthe scattering angle and of the wavelength in use) and d is the so-called Porod coeﬃcient, which can vary signiﬁ-cantly depending on the object from which the light wasscattered [65].In any case, the recorded detector intensity for diﬀrac-tion images scale exponentially. For this reason we pro-pose a logarithmic activation function of the form: h ( x ) = (cid:40) α (log ( x + c ) + c ) if x ≥ − α (log ( c − x ) + c ) if x < , (11)1where α > c =exp ( − c = 1 and x is the input.We deﬁne c and c so that the activation function isanti-symmetric around 0, which helps speed up trainingand avoids a bias shift for succeeding layers [67, 68].Since we are using a gradient-based optimization tech-nique we need to take care that the gradient can prop-agate throughout the whole network, otherwise it wouldlead to so-called gradient ﬂow problems, which befallsdeep architectures [48, 69]. There are two possibilities forinsuﬃcient gradient ﬂow, either the gradients are gettingtoo small ( vanishing gradient ) or too large ( exploding gra-dient ) when propagating throughout the network. Bothscenarios lead to numerical instabilities during trainingmaking convergence for large architectures very hard oreven impossible. The reason for this is the backprop-agation algorithm which invokes the chain rule for cal-culating the gradients. Every gradient is therefore alsoa multiplicative factor for the gradient of a succeedinglayer. For our case the derivative of Equation 11 w.r.t. x is given by: ∂h ( x ) ∂x = (cid:40) αx + c if x ≥ αc − x if x < . (12)It shows that the gradient scales with x − with a discon-tinuity of size α c − at 0.If we used this activation function for all activationsthroughout the network, the gradient would have an in-creased probability to vanish - or explode - the deeper thearchitecture gets. In addition to that, the discontinuityat x = 0 could lead to gradient jumps, which would fur-ther decrease numerical stability. Therefore, we use thelogarithmic activation function only for the ﬁrst convolu-tional layer and use a LeakyRelu activation with leakage of 0 . α is a tunable hyperparameter, we conduct ex-periments with three values for α ∈ [0 . , . , .

0] andevaluate its impact on the performance of the neural net-work.In Table 4 we provide the evaluation metrics forResNet18 used with the logarithmic activation function,trained with three diﬀerent values for α . For comparison,we also provide the results of the unmodiﬁed ResNet18labeled unmodiﬁed . The best performing conﬁgurationis with an α value of 0 .

2, maxing out with an

Accu-racy of 0 . Accuracy of a full percentage point compared to the unmodiﬁedResNet18. The lowest value for the maximum

Accuracy was reached without the logarithmic activation function,topping at 0 . Precision and

Recall both increasewith the addition of the logarithmic activation function.These improvements all come without increasing train-ing time or complexity of the model. The maximumachieved

Accuracy seems to be anti-correlated to α , withthe ResNet18 α =1 . variant performing worst. We suspect TABLE 4. Evaluation metrics for the ResNet18 networkwith and without the logarithmic activation function. Webenchmark three values for α . Results are shown for bothdatasets and are the maximum value recorded during training.Bold numbers indicate the best scores across their respectivecategory.Architecture ResNet18 α . . . unmodiﬁed Dataset Helium

Accuracy . .

960 0 .

959 0 . Precision . . . . Recall .

870 0 . .

868 0 . that this is related to the smaller size of the discontinuityof the derivative of h ( a j ) when choosing a small value for α , see Equation 12.However, choosing even smaller values for α did notimprove the Accuracy further, either because the beneﬁtfrom the activation function plateaus there or because wereached the classiﬁcation capacity of this ResNet layout.These results show convincingly that the addition ofthe logarithmic activation function improves the overallperformance and generalization of the deep neural net-work. This is in so far expected because we imposed aform of feature engineering on the network, by exploitinga known characteristic of the dataset. Therefore, withoutincreasing the complexity, the depth or the training time,we showed that using the logarithmic activation improvesall relevant evaluation metrics. For this reason, we usethe logarithmic activation function with an α value of 0 . In this section, we evaluate the impact of the train-ing set size on the evaluation metrics, we trained theResNet18 α =0 . with a varying amount of labeled images.The reason for this is to provide intuition for how manyimages are needed to be classiﬁed manually before theemployment of a neural network is useful. We uniformlyselect images from the training set but kept the sameevaluation dataset described in section 3.1.5. We de-creased the size of the training set in three stages (to75 % ≡ ≡ ≡ α =0 . whentrained with datasets of diﬀerent sizes. For the heliumdataset, the maximum achieved Accuracy is droppingfrom 0 .

965 to 0 .

797 when using only 1544 images insteadof the full 6174 images. Even more pronounced is thedecline in

Precision and

Recall from 0 .

922 and 0 .

870 to0 .

673 and 0 .

593 for the smallest training set size. Thesteeper decline rate for

Precision and

Recall , comparedto

Accuracy , can be understood as the helium datasetpredominantly consists of

Negative ground truth labels2(64 339 out of 79 904 labels) to which the neural net-works resorts in the absence of suﬃcient training data.

Precision and

Recall , on the other hand, provide only in-formation about the positive prediction capabilities andtheir completeness and therefore decrease faster when asmaller training set size is used.This shows that the number of images is critical forthe prediction capabilities of the neural network. Thedrastic decrease in training set size results in a muchworse generalization of the model, detecting only thoseimages that are very close to the ones from the trainingset, missing most from the evaluation set. The networkhas not learned the characteristics of a particular classto a point where it can transfer the gained knowledge toother images, which is the one critical property for whichwe employed a neural network in the ﬁrst place.Therefore, if time is limited, one may be well advisedto concentrate eﬀorts on preparing a suﬃciently large,high-quality training dataset while using e.g. our herepresented neural network approach in its standard con-ﬁguration.

This section introduces an image augmentation basedon the two-point cross-correlation function, which in-creases the resistance to noise. We prepare four trainingsets, each with an increasing amount of noise sampledfrom a uniform distribution and analyze the noise de-pendence of the neural network.One of the principal problems in CDI experiments, orimaging experiments in general, is recorded noise. Noiseoften leads to computational problems due to noise resis-tance being a known weak point for a signiﬁcant fractionof predictive algorithms [29]. In particular, deep neuralnetworks are known to be easily fooled by noise. Whenadding noise to an image, whose addition may be invisibleto the human eye, a neural network can come to entirelydiﬀerent conclusions and this even with high conﬁdence;Seeing a panda where there was a wolf [70, 71]. There-

TABLE 5. Evaluation metrics of the ResNet18 α =0 . networkwith the logarithmic activation function and an α value of0 .

2. Results are shown for the helium datasets and reﬂectthe maximum achieved value reached throughout the trainingprocess, assessed on the evaluation dataset. Bold numbersindicate the best scores across their respective category.Architecture ResNet18 α =0 . Training set size 6174 4631 3088 1544Dataset Helium

Accuracy . .

915 0 .

829 0 . Precision . .

821 0 .

740 0 . Recall . .

771 0 .

679 0 . fore, we propose an additional pre-processing step for theinput images to increase the noise resistance of the neuralnetwork.To quantify the quality of an image, the signal-to-noiseratio is often used. It is a measure for how much noiseis present when compared to some information content,where low values indicate that information might be in-distinguishable from noise. It has been shown that higherorders of the two-point cross-correlation function (CCF)can act as a frequency dependent noise ﬁlter and increasethe quality of a reconstruction of a diﬀraction image evenin the presence of recorded noise[72, 73]. And since theCCF can be interpreted as an image, see Figure 4 e) to h),we employ this method in a similar manner to optimizethe use-case with a convolutional deep neural network,expecting that the higher-order terms make the neuralnetwork more resistant to the presence of noise.In general, the CCF is deﬁned as: C i,j ( q i , q j , ∆) = (cid:90) ∞−∞ I ∗ i ( q i , φ ) I j ( q j , φ + ∆) dφ, (13)where ∆ is the angular separation, φ is the angularcoordinate, and ( i, j ) denotes the index of the two scat-tering vectors q i and q j . For discrete φ and written asFourier decomposition, Equation 13 yields [72]: C ni,j ( q i , q j ) = I n ∗ i ( q i ) I nj ( q j ) , (14)where n denotes the order of the CCF. I ni is given by: I ni ( q i ) = 12 π (cid:90) π I ( q i , φ ) exp ( − inφ ) dφ (15)Since C i,j = C j,i , we can split the ﬁnal correlationmap into an upper and a lower triangle matrix. To max-imize information, and to optimally use the local recep-tive ﬁelds of the convolutional layers, we merge the lowertriangle from the full CCF calculation, Equation 13 with∆ = 0, and the upper triangle of order n = 8 from Equa-tion 14. Therefore, we combine a plain correlation mapwith a higher order map that is more resistant to noise,see Figure 4 e) to h) for a full example.To test the robustness of this method, we use theResNet18 α =0 . and train it with various pre-processeddatasets.From our original dataset we derive three additionaldatasets that only diﬀer in the amount of noise added.We do this as follows; First, we calculate the mean,the standard deviation (std) and the maximum inten-sity values of each image in the original dataset. Fromthese values we calculate the median, instead of the mean(due to increased robustness against outliers); ending upwith three statistical characteristics describing the inten-sity distribution throughout all diﬀraction images. Withthat, we deﬁne three continuous uniform distributions tosample noise from. A continuous uniform distributionis fully deﬁned by an upper and a lower boundary; a Noise: None Noise: Mean Noise: Mean + Std Noise: Maxa) b) c) d)e) f) g) h)

FIG. 4. a) to d) showing the various stages of added noise to a standard scattering image. e) to h) are the calculatedcorrelation maps with the upper triangle of order n = 8 and lower triangle from the full CCF calculation. and b , respectively. The probability for a value to bedrawn within these boundaries is equal and non-zero ev-erywhere. For our three noise distributions we alwaysuse a lower boundary of 0 and vary the upper boundaryso that b is either the mean , the mean + the std. or themaximum of the intensity distribution on the images (thethree statistical characteristics described above).For example, for creating the maximum noise dataset,we looped through every diﬀraction image and addednoise sampled from the maximum noise distribution. Wedo this for all three noise distributions. From these threenoise embedded datasets, as well as our original dataset,we calculate the here proposed CCF maps. This leadsto a total of eight data sets; for each of them we traina ResNet18 α =0 . . An example of one image in all eightdatasets is in Figure 4.The results for these eight data sets are given in Table6. The performance of the neural network without addednoise is much stronger when using the original diﬀractionimages instead of the CCF maps. However, as soon asnoise is added, the performance of the neural networktrained on diﬀraction images deteriorates much faster ascompared to the performance with CCF maps as input.When the upper boundary of the added noise excels themedian values of mean + std. , the neural network is per-forming better with the CCF maps instead of the originaldiﬀraction images. Especially with the noisiest datasetthe diﬀerences in performance are signiﬁcant. Precision is increased by 4 percentage points when using the CCFmaps as input, showing that our data augmentation mayserve as a helpful asset when dealing with very noisy data. In general, it is a viable alternative to use the CCFmaps as input to the convolutional deep neural network,which should be considered an option in the case of verynoisy data where it provides a boost to classiﬁcation re-sults. The downside is, calculating the CCF for every im-age comes at an additional computational cost. It tookus three full days to calculate the CCF maps for all 39 879images of both datasets on an Intel 6700K quad-core ma-chine using a multi-threaded Python script (Also releasedon Github).

6. WHAT THE NEURAL NETWORK SAW

Neural networks are often considered being a black boxapproach. We usually do not impose a-priori knowledgeon our model, the network learns this on its own. Al-though this is part of the reason why they are so success-ful it also gives rise to doubts about the interpretabilityof their predictions. Some ways to interpret the pro-cesses of decision ﬁnding within a trained neural net-work have been presented in the literature [41, 74–76].In order to get a better understanding of why our deepneural network assigned images to certain classes, wecalculated heatmaps using the GradCam++ algorithm[41]. These heatmaps are making visible where the net-work has looked for in a particular class, which we do bytracing back the gradient ﬂow from the output layer tothe last convolutional layer. The network’s class-speciﬁcinterest directly correlates with this gradient signal be-cause, in essence, we simulate a training step using back-4

TABLE 6. Evaluation results when training a ResNet18 α =0 . on the original diﬀraction images and on CCF maps calculatedfrom them. The results reﬂect the maximum value achieved throughout the training process, assessed on the evaluation dataset.Bold numbers indicate the best scores across their respective category.Architecture ResNet18 α =0 . Noise added None Mean Mean + Std. Max.Input data CCF Maps Diﬀ. Imgs. CCF Maps Diﬀ. Imgs. CCF Maps Diﬀ. Imgs. CCF Maps Diﬀ. Imgs.Dataset Helium

Accuracy . . . .

954 0 . . . . Precision . . . .

910 0 . . . . Recall . . . .

853 0 . . . . a) Streakb) Bent FIG. 5. Showing the GradCam++ results for two distinct classes from the helium dataset. a) shows ﬁve randomly selectedimages from the

Streak class and b) shows ﬁve images from the

Bent class. We chose these classes due to their distinctand distinguishable characteristic shapes which can easily be identiﬁed using the contour maps provided by the GradCam++algorithm. For each class, we plot the schematic from Figure 2 also at the beginning of each row. GradCam++ contour levelsare plotted as dashed lines and used as transparency value for the images from which we calculated them. This way regionswith strong gradients are also brighter. propagation and interpolate the feature maps from thelast convolutional layer. A full description of this processis given in appendix D. The output of the GradCam++algorithm provides contour maps whose amplitude is anormalized measure for how much the gradient wouldimpose corrections on the weights if used during train-ing. This gradient ﬂow directly corresponds to what thenetwork deemed the most relevant regions.Figure 5 shows the GradCam++ results for the

Streak and

Bent classes using our best performing network -ResNet18 α =0 . . We present results from these classes,because the distinct spatial characteristics are obvious tothe human eye. Therefore they are an ideal candidate totest if the neural network understood these characteris-tics. In each row of Figure 2, a schematic sketch of thekey feature together with ﬁve randomly selected imagesfrom this class are depicted.The GradCam++ contour maps are overlaid on theimage, in addition, the contour levels are also used asan α mask for the diﬀraction image so that the brightest areas in each plot correspond to the ones with the highestgradient ﬂow. In the case of the Streak class, Figure 5clearly shows that the neural network was able to identifythe dominant streak feature regardless of its orientationor size. Results on the

Bent class also show a strongcorrelation between the shape of the contour maps andthe bent shape of the diﬀraction pattern.Therefore, combining these metrics and the Grad-Cam++ images we think that the

Streak class featureidentiﬁed by the neural network indeed corresponds tothe one seen by the researcher. Also, the

Bent class con-tour maps from the network show a clear resemblanceof the feature intended by the researcher, albeit not sostrongly pronounced. Although the deep neural networklearned these representations on its own, they co-alignwith the intentions of the researcher. This demonstratesthat neural networks are capable of learning these com-plicated patterns on their own.5

7. SUMMARY AND OUTLOOK

In this paper, we give a general introduction on the ca-pabilities of neural networks and provide results on theﬁrst domain adaption of neural networks for the use-caseof diﬀraction images as input data. The main additions ofthis paper are (i) a novel activation function that incorpo-rates the intrinsic logarithmic intensity scaling of diﬀrac-tion images, (ii) an evaluation on the impact of diﬀerenttraining set sizes on the performance of a trained networkand (iii) the use of the point-wise cross-correlation func-tion to improve the resistance against very noisy data. Inaddition, we provide a large benchmarking routine, uti-lizing multiple neural network architectures and layoutsin appendix A.We have shown that even in the most basic conﬁgu-ration, convolutional deep neural networks outperformpreviously established sorting algorithms by a signiﬁcantmargin. More importantly, we improved on these base-line results by modifying the activation function for theﬁrst layer. For the case of very noisy data, often a prob-lem in diﬀraction imaging experiments, we showed thattwo-point cross-correlation maps as input data insteadof the original diﬀraction images improve the robustnessof the classiﬁcation capabilities of the network. Our re-sults set the stage for using deep learning techniques asfeature extractors from diﬀraction imaging datasets. Theultimate goal will be establishing an unsupervised routinethat can categorize and extract essential pieces of infor-mation of a large set of diﬀraction images on its own. We envision for the near future, that the gained insights leadto multiple new approaches regarding neural networksand diﬀraction data. For example, the MSFT algorithmused in Langbehn et al. [1], can be used as a generativemodule in an end-to-end unsupervised classiﬁcation rou-tine using large synthetic datasets as training data for aneural network. This approach can be extended to utilizethese trained networks as an online-analysis tool duringthe experiments. Furthermore, we hope to develop anunsupervised approach that connects the recent researchfrom Generative Adversarial Network theory [77–80] andmutual information maximization [81] with the resultsof this paper. Such an approach would allow for ﬁnd-ing characteristic classes of patterns within a data setwithout any a priori knowledge about the recorded data.All of the code, written in Python 3 .

6+ and using theTensorﬂow framework, is available at Github, free to useunder the MIT License . We hope the community usesand improves the code provided in this repository. ACKNOWLEDGMENTS

We would like to thank K. Kolatzki, B. Senﬀtleben,R.M.P. Tanyag, M.J.J. Vrakking, A. Rouze, B. Fin-gerhut, D. Engel and A. L¨ubcke from the Max-Born-Institute, Ruslan Kurta from the European XFEL andChristian Peltz as well as Thomas Fennel from the Uni-versity of Rostock for fruitful discussions. This workreceived ﬁnancial support by the Deutsche Forschungs-gemeinschaft under Grant

MO 719/13-1 , and STI125/19-1 and by the Leibniz Grant

SAW/2017/MBI4 . [1] B. Langbehn, K. Sander, Y. Ovcharenko, C. Peltz,A. Clark, M. Coreno, R. Cucini, M. Drabbels, P. Finetti,M. Di Fraia, et al., Phys. Rev. Lett. , 255301 (2018).[2] M. M. Seibert, T. Ekeberg, F. R. N. C. Maia, M. Svenda,J. Andreasson, O. J¨onsson, D. Odi´c, B. Iwan, A. Rocker,D. Westphal, et al., Nature , 78 (2011).[3] N. D. Loh, C. Y. Hampton, A. V. Martin, D. Starodub,R. G. Sierra, A. Barty, A. Aquila, J. Schulz, L. Lomb,J. Steinbrener, et al., Nature , 513 (2012).[4] C. Bostedt, M. Adolph, E. Eremina, M. Hoener,D. Rupp, S. Schorb, H. Thomas, A. R. B. de Castro, andT. M¨oller, J. Phys. B At. Mol. Opt. Phys. , 194011(2010).[5] L. F. Gomez, K. R. Ferguson, J. P. Cryan, C. Bacel-lar, R. M. P. Tanyag, C. Jones, S. Schorb, D. Aniel-ski, A. Belkacem, C. Bernando, et al., Science , 906(2014).[6] H. N. Chapman and K. A. Nugent, Nat. Photonics , 833(2010).[7] D. Rupp, N. Monserud, B. Langbehn, M. Sauppe, J. Zim- https://github.com/julian-carpenter/airynet mermann, Y. Ovcharenko, T. M¨oller, F. Frassetto, L. Po-letto, A. Trabattoni, et al., Nat. Commun. , 493 (2017).[8] Z. Y. Li, N. P. Young, M. Di Vece, S. Palomba, R. E.Palmer, A. L. Bleloch, B. C. Curley, R. L. Johnston,J. Jiang, and J. Yuan, Nature , 46 (2008).[9] J. Farges, M. F. de Feraudy, B. Raoult, and G. Torchet,J. Chem. Phys. , 3491 (1986).[10] D. E. Clemmer and M. F. Jarrold, J. Mass Spectrom. ,577 (1997).[11] O. Kostko, B. Huber, M. Moseler, and B. von Issendorﬀ,Phys. Rev. Lett. , 043401 (2007).[12] A. Sakdinawat and D. Attwood, Nat. Photonics , 840(2010).[13] C. Bostedt, H. N. Chapman, J. T. Costello, J. R. C.L´opez-Urrutia, S. D¨usterer, S. W. Epp, J. Feldhaus,A. F¨ohlisch, M. Meyer, T. M¨oller, et al., Nucl. Instru-ments Methods Phys. Res. Sect. A Accel. Spectrometers,Detect. Assoc. Equip. , 108 (2009).[14] T. Gorkhover, M. Adolph, D. Rupp, S. Schorb, S. W.Epp, B. Erk, L. Foucar, R. Hartmann, N. Kimmel, K.-U.K¨uhnel, et al., Phys. Rev. Lett. , 245005 (2012).[15] C. Bostedt, T. Gorkhover, D. Rupp, M. Thomas, andT. M¨oller, in Synchrotron Light Sources Free. Lasers ,edited by E. Jaeschke, S. Khan, R. J. Schneider, and B. J. Hastings (Springer International Publishing, Cham,2016) 1st ed., Chap. Clusters and Nanocrystals, pp. 1–38.[16] P. Emma, R. Akre, J. Arthur, R. Bionta, C. Bostedt,J. Bozek, A. Brachmann, P. Bucksbaum, R. Coﬀee, F.-J.Decker, et al., Nat. Photonics , 641 (2010).[17] C. Bostedt, S. Boutet, D. M. Fritz, Z. Huang, H. J. Lee,H. T. Lemke, A. Robert, W. F. Schlotter, J. J. Turner,and G. J. Williams, Rev. Mod. Phys. , 015007 (2016).[18] G. D. Calvey, A. M. Katz, C. B. Schaﬀer, and L. Pollack,Struct. Dyn. (Melville, N.Y.) , 054301 (2016).[19] E. A. Schneidmiller, Photon beam properties at the Eu-ropean XFEL , Tech. Rep. (XFEL, Hamburg, 2011).[20] D. Rupp, M. Adolph, L. Fl¨uckiger, T. Gorkhover, J. P.M¨uller, M. M¨uller, M. Sauppe, D. Wolter, S. Schorb,R. Treusch, et al., J. Chem. Phys. , 044306 (2014).[21] I. Barke, H. Hartmann, D. Rupp, L. Fl¨uckiger,M. Sauppe, M. Adolph, S. Schorb, C. Bostedt,R. Treusch, C. Peltz, et al., Nat. Commun. , 6187(2015).[22] I. V. Lundholm, J. A. Sellberg, T. Ekeberg, M. F. Hantke,K. Okamoto, G. van der Schot, J. Andreasson, A. Barty,J. Bielecki, P. Bruza, et al., IUCrJ , 531 (2018).[23] J. Flamant, N. Le Bihan, A. V. Martin, and J. H. Man-ton, Phys. Rev. E , 053302 (2016).[24] T. Ekeberg, M. Svenda, C. Abergel, F. R. Maia,V. Seltzer, J.-M. Claverie, M. Hantke, O. J¨onsson,C. Nettelblad, G. van der Schot, et al., Phys. Rev. Lett. , 098102 (2015).[25] K. He, X. Zhang, S. Ren, and J. Sun, (2016),arXiv:1603.05027.[26] N. Meidinger, R. Andritschke, R. Hartmann, S. Her-rmann, P. Holl, G. Lutz, and L. Str¨uder, Nucl. Instru-ments Methods Phys. Res. Sect. A Accel. Spectrometers,Detect. Assoc. Equip. , 251 (2006).[27] R. P. Kurta, M. Altarelli, and I. A. Vartanyants, in Adv.Chem. Phys. , edited by S. A. Rice and A. R. Dinner (JohnWiley & Sons, Inc., 2016) Chap. Structural analysis byX-Ray intensity angular cross correlations, pp. 1–39.[28] S. A. Bobkov, A. B. Teslyuk, R. P. Kurta, O. Y.Gorobtsov, O. M. Yefanov, V. A. Ilyin, R. A. Senin,and I. A. Vartanyants, J. Synchrotron Radiat. , 1345(2015).[29] A. Atla, R. Tada, V. Sheng, and N. Singireddy, in J.Comput. Sci. Coll. , Vol. 26 (Consortium for ComputingSciences in Colleges, 2011) Chap. Sensitivity of diﬀerentmachine learning algorithms to noise, pp. 96–103.[30] C. H. Yoon, P. Schwander, C. Abergel, I. Andersson,J. Andreasson, A. Aquila, S. Bajt, M. Barthelmess,A. Barty, M. J. Bogan, et al., Opt. Express , 16542(2011).[31] D.-L. Deng, X. Li, and S. Das Sarma, Phys. Rev. B ,195145 (2017).[32] P. Zhang, H. Shen, and H. Zhai, Phys. Rev. Lett. ,066401 (2018).[33] R. D. Field, Y. Kanev, M. Tayebnejad, and P. A. Griﬃn,Phys. Rev. D , 2296 (1996).[34] W. Bhimji, S. A. Farrell, T. Kurth, M. Paganini, Prab-hat, and E. Racah, (2017), arXiv:1711.03573.[35] K. Mills, M. Spanner, and I. Tamblyn, Phys. Rev. A ,042113 (2017).[36] S. Manzhos, K. Yamashita, and T. Carrington, Chem.Phys. Lett. , 217 (2009).[37] Y. Zhu, Q. Ouyang, and Y. Mao, BMC Bioinformatics , 348 (2017). [38] Z. Gao, L. Wang, L. Zhou, and J. Zhang, IEEE J.Biomed. Heal. Informatics , 416 (2017).[39] T. W. Ke, A. S. Brewster, S. X. Yu, D. Ushizima,C. Yang, and N. K. Sauter, J. Synchrotron Radiat. ,655 (2018).[40] S. Kassemeyer, J. Steinbrener, L. Lomb, E. Hartmann,A. Aquila, A. Barty, A. V. Martin, C. Y. Hampton,S. Bajt, M. Barthelmess, et al., Opt. Express , 4149(2012).[41] A. Chattopadhyay, A. Sarkar, P. Howlader, and V. N.Balasubramanian, (2017), arXiv:1710.11063.[42] V. Lyamayev, Y. Ovcharenko, R. Katzy, M. De-vetta, L. Bruder, A. LaForge, M. Mudrich, U. Person,F. Stienkemeier, M. Krikunova, et al., J. Phys. B At.Mol. Opt. Phys. , 164007 (2013).[43] C. Svetina, C. Grazioli, N. Mahne, L. Raimondi, C. Fava,M. Zangrando, S. Gerusina, M. Alagia, L. Avaldi,G. Cautero, et al., J. Synchrotron Radiat. , 538 (2015).[44] E. Allaria, R. Appio, L. Badano, W. Barletta, S. Bas-sanese, S. Biedron, A. Borga, E. Busetto, D. Castronovo,P. Cinquegrana, et al., Nat. Photonics , 699 (2012).[45] B. Fr´enay and A. Kab´an, in ESANN (2014) Chap. Acomprehensive introduction to label noise.[46] F. R. N. C. Maia, Nat. Methods , 854 (2012).[47] M. A. Arbib, Brains, machines, and mathematics (Springer-Verlag, 1987) p. 202.[48] G. E. Nair, V.;Hinton, in

Proc. 27th Int. Conf. Mach.Learn. (2010) Chap. Rectiﬁed linear units improve re-stricted boltzmann machines, pp. 807–814.[49] A. L. Maas, A. Y. Hannun, and A. Y. Ng, in

Proc. ICML ,Vol. 30 (2016) Chap. Rectiﬁer Nonlinearities ImproveNeural Network Acoustic Models, p. 3.[50] J. Schmidhuber, Neural Networks , 85 (2014).[51] Y. LeCun, Y. Bengio, and G. Hinton, Nature , 436(2015).[52] C. Szegedy, S. Ioﬀe, V. Vanhoucke, and A. Alemi, (2016),arXiv:1602.07261.[53] G. Hinton, O. Vinyals, and J. Dean, (2015),arXiv:1503.02531.[54] C. E. Shannon, Bell Syst. Tech. J. , 379 (1948).[55] S. Kullback and R. A. Leibler, Ann. Math. Stat. , 79(1951).[56] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learn-ing (MIT Press, 2016) p. 775.[57] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Na-ture , 533 (1986).[58] K. He, X. Zhang, S. Ren, and J. Sun, (2015),arXiv:1512.03385.[59] S. Zagoruyko and N. Komodakis, (2016),arXiv:1605.07146.[60] H. Zou, H. Zou, and T. Hastie, J. R. Stat. Soc. Ser. B , 301 (2005).[61] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever,and R. R. Salakhutdinov, (2012), arXiv:1207.0580.[62] L. Perez and J. Wang, (2017), arXiv:1712.04621.[63] S. Ikeda and H. Kono, Opt. Express , 3375 (2012).[64] T. Shintake, Phys. Rev. E , 041906 (2008).[65] B. Hammouda, J. Appl. Crystallogr. , 716 (2010).[66] S. K. Sinha, E. B. Sirota, S. Garoﬀ, and H. B. Stanley,Phys. Rev. B , 2297 (1988).[67] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, (2015),arXiv:1511.07289.[68] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabi- novich, (2014), arXiv:1409.4842.[69] J. F. Kolen and S. C. Kremer, A Field Guideto Dynamical Recurrent Networks , (2001),10.1109/9780470544037.ch14.[70] A. Nguyen, J. Yosinski, and J. Clune, (2014),arXiv:1412.1897.[71] S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, andP. Frossard, (2016), arXiv:1610.08401.[72] R. P. Kurta, J. J. Donatelli, C. H. Yoon, P. Berntsen,J. Bielecki, B. J. Daurer, H. DeMirci, P. Fromme, M. F.Hantke, F. R. N. C. Maia, et al., Phys. Rev. Lett. ,158102 (2017).[73] J. J. Donatelli, P. H. Zwart, and J. A. Sethian, Proc.Natl. Acad. Sci. U. S. A. , 10286 (2015).[74] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam,D. Parikh, and D. Batra, in (IEEE, 2017) pp. 618–626, arXiv:1610.02391.[75] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Tor-ralba, (2015), arXiv:1512.04150.[76] K. Li, Z. Wu, K.-C. Peng, J. Ernst, and Y. Fu, (2018),arXiv:1802.10171.[77] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena,(2018), arXiv:1805.08318.[78] T. Miyato and M. Koyama, (2018), arXiv:1802.05637.[79] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida,(2018), arXiv:1802.05957.[80] I. Goodfellow, (2016), arXiv:1701.00160.[81] X. Chen, Y. Duan, R. Houthooft, J. Schulman,I. Sutskever, and P. Abbeel, (2016), arXiv:1606.03657.[82] Y. Lecun, L. Bottou, Y. Bengio, and P. Haﬀner, Proc.IEEE , 2278 (1998).[83] A. Krizhevsky, I. Sutskever, and G. E. Hinton, in Proc.25th Int. Conf. Neural Inf. Process. Syst. - Vol. 1 (Cur-ran Associates Inc., 2012) Chap. ImageNet classiﬁcationwith deep convolutional neural networks, pp. 1097–1105.[84] K. Simonyan and A. Zisserman, (2014), arXiv:1409.1556.[85] R. Eldan and O. Shamir, (2015), arXiv:1512.03965.[86] A. Veit, M. Wilber, and S. Belongie, (2016),arXiv:1605.06431.[87] S. Li, J. Jiao, Y. Han, and T. Weissman, (2016),arXiv:1611.01186.[88] H. Shimodaira, Journal of Statistical Planning and Infer-ence , 227 (2000).[89] J. Jiang, A Literature Survey on Domain Adaptation ofStatistical Classiﬁers (2008).[90] S. Ioﬀe and C. Szegedy, (2015), arXiv:1502.03167.

Appendix A: Architectural design choices

In this section, we describe and explain our choices forneural network architecture to establish as baseline per-formance when working with diﬀraction patterns, beforethe inclusion of our diﬀraction speciﬁc activation func-tion, see section 5.1 in the main manuscript. We presentthe theory and background on available architectures andprovide results on two architectures with ﬁve depth lay-outs.There are diﬀerent layer styles from which we can builda neural network. Nomenclature is that a full arrange-ment of all layers is called architecture, or conﬁguration,of the network.For our tests, we use two diﬀerent neural network ar-chitectures, a ResNet and a VGG- Net, both with multi-ple depth layouts. For the ResNet, we train and evaluatethree depth variations (18, 50 and 101 layers), and forthe VGG-Net we train two variants (16 and 19 layers).The structure of this section is as follows: First, we ex-plain how a convolutional layer works in general. Second,we motivate the derivation of the VGG-Net from preced-ing architectures, and third, we show how the ResNet ar-chitecture can be explained by expanding the core ideasused in the VGG-Net. In the following section, we willthen present the results for all the here trained conﬁgu-rations.Almost every architectural design is empirically de-rived [50–52] and constitutes of multiple combinationsof only a few basic layer styles, namely the fully con-nected layer, convolutional layer, a pooling operation anda batch normalization operation. We discuss the pool-ing and batch normalization layer only in appendix C,because of their minor role within the neural network.The reader is also referred to the exhaustive overview inSchmidhuber [50] and LeCun et al. [51]. Since the con-volutional layer serves as a fundamental basis for imageanalysis with neural networks, we explain it here in moredetail.The very basic idea of a convolutional layer is thatnearby pixels in an input image are more strongly cor-related than more distant pixels, this is called a localreceptive ﬁeld. Therefore, by calculating a convolutionover an input image with a trainable ﬁlter of size > × N ﬁlter, with size M × M ,slide over an input image and produce N convolved maps,called feature maps. One ﬁlter uses the same weightson all parts of the input image for producing one fea-ture map; this is called weight sharing. Weight sharingreduces not only the complexity of the model but pro-vides a bridge towards the convolution function in math-ematics. With weight sharing, we can identify the ﬁl-ter within the convolutional layer as a kernel functionfrom the mathematical convolution function. Figure 6 a)shows a schematic of a convolutional layer with one ﬁlter.This exemplary ﬁlter with size 3 × × Identity (N-dim)

OutputInput conv. conv. (N-dim)(N-dim) (N-dim) (N-dim) a) b)4 + 0 + 0 + 12 + 5 + 28 + 8 + 24 + 5 = 86FilterInput Feature map

FIG. 6. Schematic for a convolutional operation inside a convolutional layer in a), and for a classic skip connection foundin the ResNet architecture in b). a) illustrates the local receptive ﬁelds and shared weights concept. The convolutional ﬁlterhas size 3 × ×

7, which produces an output, called feature map,of size 3 ×

3. The stride is the distance the ﬁlter is moving in each step which is implied by the gray shading every 2 pixelsin the input image. Using a local receptive ﬁeld describes the inclusion of nearby pixels, and weight sharing means using thesame ﬁlter weights for the whole input image. The calculation at the bottom is for the second entry in the feature map. b)A classical skip connection is shown with two convolutional layers that approximate a sparse residue which gets added to theidentity at the output. ×

3. The feature map is smaller than the input imagebecause the ﬁlter moves two pixels for each step. Thisstep-size is called stride .Hereafter we use the notation conv(a, b, c) for a con-volutional layer with ﬁlter size a × a, number of ﬁlters b and stride c . The example from ﬁgure 6 a) could, there-fore, be written as conv (3 , ,

2) and would result in 9trainable weight parameters plus 1 bias parameter (notshown in the ﬁgure).This concept was introduced with the LeNet architec-ture by Lecun et al. [82] which is considered the seminalwork in the ﬁeld and the ﬁrst deep convolutional neu-ral network. After Yann LeCun proposed the LeNet ar-chitecture, further research [83] led to the now de-factostandard for plain convolutional networks, the VGG-Net.Simonyan and Zisserman [84] proposed the original archi-tecture which consists of up to 19 weight layers of which16 are convolutional layers, and 3 are fully connectedones.It is easy to build, easy to train and provides in gen-eral good results [50, 51]. For these reasons, we includetwo variations of it in our tests, namely version D andE, nomenclature is from [84]). Table 7 shows the de-tails of the architecture, using the naming convention weintroduced with the convolutional layer.Simonyan and Zisserman [84] derived the VGG-Net di-rectly from the LeNet by arguing that three convolutionallayers with ﬁlter size 3 and stride 1 (VGG-Net) achievebetter results than only one ﬁlter with size 7 and stride2 (LeNet), which equals to the same eﬀective local re-ceptive ﬁeld size [84]. Three layers perform better thanone due to having 2 additional non-linear activation func-tions and reduced complexity (less weight-parameter be-cause of the smaller ﬁlter sizes), which enforces the neuralnetwork not only to be more discriminative but to ﬁnd sparser solutions [84].

TABLE 7. The deep neural network architecture of theVGG variant D and E. conv(a, b, c) is a convolutional layerwith ﬁlter size a × a, number of ﬁlters b and stride c . maxpooling(d, e) is a max pooling layer with ﬁlter size d × d andstride e . Note that we changed the fully connected layer ofthe original architecture to a convolutional layer.Variant D EDepth 16 19Input 2 × conv (3 , , , × conv (3 , , , × conv (3 , ,

1) 4 × conv (3 , , , × conv (3 , ,

1) 4 × conv (3 , , , × conv (7 , , , conv (1 , N, Building on the results achieved by the VGG-net, itwas shown that the depth of a deep neural network di-rectly relates to its classiﬁcation capabilities [58, 68, 85].This led to the introduction of the so-called residual skip-connections which further exploit this depth-matters con-cept [58, 68]. These residual skip connections are thename-giving components for the ResNet architecture.In principle, a ResNet still uses the VGG architecturallayout but exchanges the convolutional blocks 1 to 4 withresidual skip connections, compare tables 7 and 8. Thisexchange drastically reduces the complexity of the wholenetwork while increasing the number of layers.The VGG-architecture can be broken down into six9blocks, one input block, one output block, and four con-volutional blocks (see table 7). Block 2 is the ﬁrst blockin which there are distinctions between VGG variant Dand E.The VGG-net architecture proved that increasing thedepth and decreasing the amount and size of the ﬁltersincreases the accuracy, which ultimately gave rise to the plain skip connections : Blocks of few convolutional lay-ers designed to replace the large amounts of ﬁlters in onelayer for multiple layers with fewer, and smaller, ﬁlters.Two types exist: A classical and a bottleneck skip con-nection, both diﬀer only in the amount of how much thedepth is increased and the complexity decreased.This addition has so far only modiﬁed the depth andcomplexity of the network and is called a plain network,see He et al. [58]. It performs reasonably well but not sig-niﬁcantly better than VGG-net. A residual skip connec-tion diﬀers from a plain skip connection only in addingthe identity of its inputs to its outputs. This way allthe convolutional layers in a skip connection learn only a residual of their input. This simple technique enables aResNet to outperform all other convolutional deep neuralnetwork architectures [25, 52]. Figure 6 b) exempliﬁes aclassical residual skip connection. There is still an ongo-ing debate about why a residual neural network performsso well [58, 68, 86]. Research has shown that ResNets ﬁndsparser solutions faster due to their layout, and that theybehave like ensembles of shallower networks with infor-mation ﬂow only activated on 10 to 34 layers even whenthe neural network has a depth of 101 layers [58, 68, 86].However, besides empirical success, one of the criticaladvantages of ResNets is that reaching training conver-gence is not getting signiﬁcantly harder when increasingthe depth of the neural network, which is usually the casewith other architectures. Therefore, the training of verydeep residual neural networks is no more diﬃcult thantraining shallow plain neural networks [52, 87].For these reasons, we train three variants, with 18, 50and 101 layers, of a further optimized version of the clas-sical ResNet, called pre-activated ResNet [25] (see table8 for implementation details).Table 9 shows the overall evaluation metrics on thehelium and the CXIDB dataset. Table 10 shows the per-class evaluation metrics for the helium dataset, which arenot needed for the CXIDB dataset because predictionson the helium dataset are a multi-class problem whereaspredictions on the CXIDB data are single-class. Single-class - or one-hot - problems have identical overall- andper-class-evaluation metrics. We trained all models asdescribed in section 3.1.5 in the main manuscript.Table 9 shows the overall evaluation metrics as well asthe training wall time. Train Time

Max is the time whenthe neural network achieved the highest accuracy scoreon the evaluation dataset, and Train Time

F ull is the timefor training 200 epochs. However, in practice we achievedoptimal convergence after training for 70 to 100 epochs.After this the network showed overﬁtting.Both VGG models took signiﬁcantly longer to train than the ResNet variants, needing between 6 . . . . .

959 compared to 0 .

964 forthe helium data and 0 .

970 vs. 0 .

978 for the CXIDB data.Also, accuracy did not change much when increasing thedepth from 16 to 19, precision even decreased slightlyand recall remained unchanged.On the other hand, increasing complexity within theResNet architecture helped to boost the accuracy from0 .

955 (CXIDB data: 0 . .

964 (CX-IDB data: 0 . Oblate , Spherical , Streak and

Empty a precision of 0 . . . Prolate , Bent , Double Rings and

Lay-ered , the ResNet reached a good precision, but a recallscore of ≈ .

65 shows that it missed almost a third of allavailable images, indicating we failed to generalize thenetwork for these classes.For

Elliptical , Newton Rings and

Asymmetric images,the recall of 0 . . Elliptical is the onlyclass of these three where precision is high enough forusing the neural network as a predictor. For the

New-ton Rings and

Asymmetric class, with precision scoresaround 0 .

6, the neural network is eﬀectively guessing.The performance of all variants clearly shows the gen-eral good classiﬁcation capabilities of a convolutionaldeep neural networks in the use case of diﬀraction pat-terns. Even the lowest performing neural network canoutperform previous classiﬁcation approaches by a largemargin - compare with [28]. In particular, the resultsof ResNet18 are compelling; it is small, easy to trainand has relatively low complexity. Although having onlya fraction of trainable parameters, it performed almostalways on-par with the much more complex VGG archi-tectures and all this while taking only 0 . Appendix B: Derivation of the binary cross-entropy

Here, we give an derivation for the binary cross entropy(Equation 6 in the main manuscript). We start with themost general form of the cross-entropy given by: H ( p, q ) = H ( p ) + D KL ( p (cid:107) q ) (B1)0 TABLE 8. Used ResNet variants, see also 18, 50 and 101 layer layout in [58]. Note that we added the pre-activated layerlayout from [59]. conv(a, b, c) is a convolutional layer with ﬁlter size a × a, number of ﬁlters b and stride c . max pooling(d,e) is a max pooling layer with ﬁlter size d × d and stride e. avg pooling is a global average pooling layer, and fc(f) is a fullyconnected layer with output size f. Layers in bold emphasis have a stride of 2 during their ﬁrst iteration, therefore reducingthe dimension by a factor of 2.Variant Classic Bottleneck BottleneckDepth 18 50 101Input conv(7, 64, 2)Pooling max pooling(3 , 2)Block 1 2 × (cid:20) conv (3 , , , , (cid:21) ×  conv (1 , , , , , ,  ×  conv (1 , , , , , ,  Block 2 2 × (cid:20) conv (3 , , , , (cid:21) ×  conv (1 , , , , , ,  ×  conv (1 , , , , , ,  Block 3 2 × (cid:20) conv (3 , , , , (cid:21) ×  conv (1 , , , , , ,  ×  conv (1 , , , , , ,  Block 4 2 × (cid:20) conv (3 , , , , (cid:21) ×  conv (1 , , , , , ,  ×  conv (1 , , , , , ,  Output block avg pooling, fc(N)TABLE 9. Overall evaluation metrics for all architecturesand both datasets. The training time after which the neuralnetwork scored the highest accuracy score on the evaluationdataset is labeled Train Time

Max and Train Time

Full is thetime for training the full 200 epochs. The table gives themax values during training for accuracy, precision, and recall.Bold scores are the best results in their respective category.Architecture ResNet VGGDepth 18 50 101 16 19Dataset HeliumAccuracy 0 .

955 0 . . .

958 0 . .

918 0 . . .

923 0 . .

866 0 . . .

867 0 . Max [h] . .

605 0 .

940 3 .

271 6 . Full [h] . .

814 2 .

820 6 .

541 6 . .

967 0 . . .

970 0 . .

932 0 . . .

944 0 . .

933 0 . . .

904 0 . Max [h] . .

205 1 .

093 4 .

374 4 . Full [h] . .

807 2 .

623 6 .

562 6 . where H ( p ) is the Shannon entropy of p , and D KL ( p (cid:107) q )is the KullbackLeibler divergence of p and q [56]. This isequivalent to: H ( p, q ) = − (cid:88) i p i log q i , (B2)where p i and q i are two probability distributions overthe same set of events. p i is the “correct” distribution,and q i is the approximation of p i from the deep neural TABLE 10. Per-class accuracy, precision and recall valuesfor the best performing ResNet conﬁguration with 101 lay-ers. Samples are the number of images whose ground truthlabel is positive in the evaluation dataset. Results are shownfor both datasets and reﬂect the maximum achieved valuereached throughout the training process, assessed on the eval-uation dataset.Class Accuracy Precision Recall SamplesOblate 0.9681 0.9770 0.9965 988Spherical 0.9166 0.9247 0.9849 869Elliptical 0.9231 0.8054 0.4836 119Newton rings 0.9352 0.6325 0.2282 69Prolate 0.9690 0.9274 0.6777 68Bent 0.9657 0.8161 0.6487 59Asymmetric 0.9458 0.6044 0.2207 55Streak 0.9898 0.9372 0.9876 36Double Rings 0.9768 0.7708 0.6788 33Layered 0.9896 0.9062 0.6170 7Empty 0.9904 0.9537 0.9763 32 network. Since we are using a Bernoulli distribution asour probabilistic model there are only two outcomes thatone event ( k ) can have: k ∈ { , } . The probability forboth outcomes of one event and of both distributions canbe written as: p ( x ) = (cid:40) y ( x ) if k = 11 − y ( x ) if k = 0 q ( x ) = (cid:40) ˆ y ( x ) if k = 11 − ˆ y ( x ) if k = 0 x is some event, y is the ground truth label and ˆ y is theapproximate probability assigned by the deep neural net-1work. Since we are using a sigmoid function at the outputof our deep neural network, we can simplify equation B2.Using: ˆ y ( x ) sigmoid = 11 + exp ( − x ) , we can write: H ( p, q ) = − (cid:88) i p i log q i , = − y ( x ) log (ˆ y ( x )) − (1 − y ( x )) log (1 − ˆ y ( x ))= − y ( x ) log (cid:18)

11 + exp ( − x ) (cid:19) − (1 − y ( x ))log (cid:18) −

11 + exp ( − x ) (cid:19) ...= x − x y ( x ) + log (1 + exp ( − x ))where x is an event (e.g. the activation in the outputlayer of the deep neural network) and y is the real labelof this event. Appendix C: Further building blocks of deep neuralnetworks

This section describes the pooling layer and the batchnormalization layer in more detail. Since these compo-nents are not critical for the neural network their expla-nation is only here in supplemental material.

1. Pooling

There are two commonly used variants of pooling lay-ers, the max pool, and the average pool. The idea is toreduce the dimensionality of the output from a preced-ing layer (dim x = ( N × X )) by letting a ﬁlter, with size a × a slide over parts of the image with step size b , calledstride, and let them perform a down-sample operation.A max pool ﬁlter only takes the maximum value, and aavg pool ﬁlter averages over all values, within its percep-tive ﬁeld [82, 83], this process is equivalent to a convo-lutional operation but instead of a matrix multiplicationwith a convolutional kernel the pooling operation is car-ried out.

2. Batch Normalization

Every layer within a deep neural network is to somepoint modeling the probability distribution given to it byits preceding layer. It is a hierarchical regression prob-lem, which becomes harder if one layer changes key char-acteristics of the modeled probability distribution (e.g. the mean, variance or the kurtosis). This shift is thenfurther multiplied in every succeeding layer and is there-fore dependent on the depth of the network. This phe-nomenon is called a covariate shift [88]. Although thisproblem is solved in a deep neural network via domainadaptation, the costs of a covariate shift are usually muchlonger training times and reduced accuracy [89].For this reason a batch normalization layer (bn) is usedto shift the mean of the mini-batch input to zero and toset the variance to one. This signiﬁcantly reduces theamount of training time and increases accuracy [90]. bn consists of 4 steps after which a normalized mini-batch isreturned:1. Calculated the mini-batch mean: µ mb = 1 m m (cid:88) i =1 x i

2. Calculated the mini-batch variance: σ = 1 m m (cid:88) i =1 ( x i − µ mb )

3. Normalize: ˆ x i = x i − µ mb (cid:112) σ + (cid:15)

4. Scale and shift according to adjustable parameter: y i = γ ˆ x i + β where y i is the normalized output of input x i and γ and β are adjustable parameter. Appendix D: GradCam++

In chapter 6 of the main manuscript we show what theneural network deemed the most relevant areas withinan input image. We calculated these so-called heatmapswith an algorithm called GradCam++. The main ideais based on Cam [75] and Gradcam [74] and allows fora very intuitive explanation for the decisions made by aconvolutional deep neural network [41].The core principle is that the output of a convolutionaldeep neural network can be expressed as a linear combi-nation of the globally average pooled feature maps of thelast convolutional layer. Y c = (cid:88) k w ck (cid:88) i (cid:88) j A kij where A kij is one feature map of all k maps from the lastconvolutional layer and w ck are the weights for a particu-lar class prediction c of feature map k . Y c is the predicted2probability that the input image belongs to this certainclass c . In the GradCam++ formalism the weights canbe calculated: w ck = (cid:88) i (cid:88) j a kcij LeakyReLu (cid:32) ∂Y c ∂A kij (cid:33) . (D1)where a kcij are the gradient weights and LeakyReLu ( · ) isa rectiﬁed linear unit activation function, very similar tothe one we used throughout the main manuscript. a kcij depends only on A kij and Y c via: a kcij = ∂ Y c ( ∂A kij ) ∂ Y c ( ∂A kij ) + (cid:80) a (cid:80) b A kab ∂ Y c ( ∂A kij ) The ﬁnal heatmap, often called saliency map, can thenbe obtained: L cij = LeakyReLu (cid:32)(cid:88) k w c A kij (cid:33) ..