A Bayesian multiscale CNN framework to predict local stress fields in structures with microscale features
Vasilis Krokos, Viet Bui Xuan, Stéphane P. A. Bordas, Philippe Young, Pierre Kerfriden
BBayesian convolutional neural networks as probabilisticsurrogates for the fast prediction of stress fields in structureswith microscale features
Vasilis Krokos , Viet Bui Xuan , St´ephane P. A. Bordas , Philippe Young , andPierre Kerfriden Cardiff School of Engineering, Cardiff University, UK Synopsys Northern Europe Ltd, Exeter, UK Institute for Computational Engineering, Faculty of Science, Technology andCommunication, University of Luxembourg, Luxembourg School of Engineering, Mathematics and Physical Sciences, University of Exeter,Exeter, UK Centre des Mat´eriaux, Mines ParisTech, Evry, France * Corresponding authors: Vasilis Krokos, KrokosV@cardiff.ac.uk; Pierre Kerfriden,[email protected]
December 17, 2020
Abstract
Finite Element Analysis (FEA) for stress prediction in structures with microstructural fea-tures is computationally expensive since those features are much smaller than the other geometricfeatures of the structure. The accurate prediction of the additional stress generated by such mi-crostructural features therefore requires a very fine FE mesh. Omitting or averaging the effectof the microstructural features from FEA models is standard practice, resulting in faster calcu-lations of global stress fields, which, assuming some degree of scale separability, may then becomplemented by local defect analyses. The purpose of this work is to train an Encoder-DecoderConvolutional Neural Networks (CNN) to automatically add local fine-scale stress corrections tocoarse stress predictions around defects. We wish to understand to what extent such a frameworkmay provide reliable stress predictions inside and outside the training set, i.e. for unseen coarsescale geometries and stress distributions and/or unseen defect geometries. Ultimately, we aimto develop efficient offline data generation and online data acquisition methods to maximise thedomain of validity of the CNN predictions. To achieve these ambitious goals, we will deploy aBayesian approach providing not point estimates, but credible intervals of the fine-scale stress field,as a means to evaluate the uncertainty of the predictions. The uncertainty quantified by the net-work will automatically encompass the lack of knowledge due to unseen macro and micro features,and the lack of knowledge due to the potential lack of scale separability. This uncertainty will beused in a Selective Learning framework to reduce the data requirements of the network. In thiswork we will investigate stress prediction in 2D composite structures with randomly distributedcircular pores.
Multi-scale structures are very common in industrial and bio-mechanical applications, for instancecomposite materials and trabecular bones [Fig. 1]. Full Finite Element Analysis (FEA) for stress1 a r X i v : . [ c s . C E ] D ec rediction is usually very expensive in these structures as the FE mesh needs to be very dense tocapture the effect of the fine scale features. Therefore, a common approach is to use a coarse scalemesh to predict the global stress field and to perform local scale corrections in the areas where the finescale features are present.Figure 1: On the left a 3D reconstruction of a femur and on the right an X-ray revealing the internalmicrostructure. Images courtesy of SynopsysThe effect of fine (micro) scale features on the global (macro) stress field can be expressed usingStress Intensity Factors (SIFs). SIFs can be calculated analytically but only for spherical or ellipticalisolated micro scale features, or specific assemblies of such features where the main axis of inertia isaligned with the principal stress on semi-finite or infinite planes [Pilkey and Pilkey, 2008]. Additionally,analytical SIFs assume a separation of scales and thus require the application of homogeneous boundaryconditions at infinity.In this work we will use a non-parametric regression model to perform the local corrections. Wedon’t want to explicitly parameterise the micro and macro scale geometric features nor assume scaleseparability. A promising approach that satisfies all these constraints is a Convolutional Neural Net-work (CNN). A CNN is a specific type of Neural Network (NN) designed to work with images. CNNshave been vastly used for tasks like classification, localisation and segmentation and are able to recog-nise arbitrarily shapes and textures as long as these are relevant to their objective [Wu et al., 2015;2hang et al., 2018; Aladem and Rawashdeh, 2020]. Therefore, our approach is not a priori limitedto any kind micro scale geometric features but in practice we need to stay relatively close to thetraining set. Additionally, we can use the full macro scale stress fields as input to the CNN, notonly hand-crafted averages such as the ones used in homogenisation theory [Sanchez-Palencia, 1986].Consequently, we can assume scale separability and apply arbitrary boundary conditions at infinity.Lastly, real medical or industrial data are hard to find and often expensive, so we aim to train ourCNN on simpler, artificial, datasets and find a way to transfer our knowledge to real cases. To achievethat we are going to train our CNN using only patches of the geometry, so the CNN will be completelyagnostic to the overall structure and will learn to identify the effect of microscale features on the macrostress field.Although Neural Networks (NNs) are nowadays used in a wide range of applications they usuallylack one very important feature. The vast majority of NNs fail to incorporate uncertainty informationin their predictions. That makes them overconfident when facing data far away from the training set,something that can lead to catastrophic consequences for critical tasks. Even worst, in classificationtasks, predictive probabilities obtained after the softmax layer are often erroneously interpreted asmodel confidence when in reality a NN can be very uncertain even with a very high softmax output.NNs that can provide uncertainty information are called Bayesian Neural Networks (BNNs). Theextracted uncertainty from a BNN is expected to increase far from the dataset or in high noise areaswarning us not to trust the prediction. On the other hand low uncertainty implies that the BNN isconfident to make a prediction for the specific input and thus the prediction is sensible. There are 2common ways to quantify the uncertainty in a NN and get a probabilistic output. The first one isthe Bayes by Backprop method [Blundell et al., 2015]. This is a proper Bayesian treatment of themodel where the parameters of the network are replaced by distributions, often but not necessarilyGaussians. The second option is by using dropout as Bayesian approximation [Gal and Ghahramani,2016]. This method doesn’t require changes in the architecture of the NN and the main idea is to usethe dropout even during inference to introduce randomness into the network.Although the uncertainty on it’s own is a very important property, BNNs are usually employedto solve another, bigger problem with modern NNs, namely the lack of labelled data. Modern DeepNeural Networks (DNNs) are extremely data hungry, but labelled data are usually hard to find andexpensive either in terms of time or money. This is true in our case as well. To reduce the labelleddata requirements we will employ a Selective Learning (SL) framework that will help us identify thedata that contain new, useful information and label only these. Selective learning with image data isa challenging task with a very sparse existing literature [Gal et al., 2017; Gal and Ghahramani, 2016;Holub et al., 2008; Joshi et al., 2009; Li and Guo, 2013].Machine Learning (ML) has already been used in fields such as solid mechanics and bio mechanics.One of the earliest works of using ML models as surrogates for FEA is from [Liang et al., 2018] whodeveloped an image-image deep learning framework to predict the aortic wall stress distribution wherethe mechanical behaviour in the FEA model was described by a fibre-reinforced hyperelastic materialmodel. After that, other NNs with fully connected layers have been used for stress predictions fornon-linear materials but simple beam structures as shown by [Roewer-Despres et al., 2018; Meisteret al., 2018]. Later, [Mendizabal et al., 2019] used a CNN for the prediction of nonlinear soft tissuedeformation on more complicated structures such as a liver but without any kind of microscale features.Moreover, [Nie et al., 2019] deployed a CNN model for stress prediction on simple structures withgeometric features but not on multi-scale problems as the size of these features was comparable to thesize of the structure. Also, [Jiang et al., 2020] used a GAN to analyze mechanical stress distributionson a set of structures encoded as high and low resolution images. A variety of loading and boundaryconditions has been used and some of them resembled the effect of isolated microstructural features.Recently, [Sun et al., 2020] based on the architecture of [Nie et al., 2019] created an Encoder-DecoderCNN for the prediction of stress field on Fiber-reinforced Polymers but their samples come from asingle specimen and with a single FE simulation implying low generalisation ability both in terms ofdifferent structures and loading/boundary conditions. Additionally, they predict only the z componentof the stress tensor and they report a value of about 70% in their evaluation metric. Lastly, [Wang3t al., 2020] used a Convolutional Aided Bidirectional Long Short-term Memory Network to predictthe sequence of maximum internal stress until material failure.In contrast to all the aforementioned approaches our model is able to make predictions in multiscalecases, where multiple microscale features are interacting with each other and the macroscale structure.Also, our NN is trained on patches of different structures under different boundary conditions thus itcan be applied to a much broader set of macro scale features and loadings. Furthermore, because weuse the cheap macroscale stress as input, the NN only needs to learn how the microscale features areaffecting the macro scale stress; further contributing to having a model that generalizes well in unseencases. Moreover, we managed to incorporate uncertainty information into our prediction allowing usto make confident predictions or be aware of the potential large error in the prediction. Lastly, weproposed a SL framework to tackle the well known problem of insufficient labelled data. In this section we will discuss the reference multiscale mechanical model that we aim to solve onlineusing the trained CNN and we will also introduce definitions and notations that will be necessary forus to explain our methodology.
We consider a 2D body occupying domain Ω ∈ R with a boundary ∂ Ω. The body is subjected toprescribed displacements U D on its boundary ∂ Ω u and prescribed tractions T D on the complementaryboundary ∂ Ω T = ∂ Ω \ ∂ Ω u . The equations of linear elasticity under the plane strain assumption areas follows [Eq. 1a - 1d]. −∇ · σ = f in Ω (1a) σ · n = T D on ∂ Ω T (1b) σ = λ tr( (cid:15) ) I + 2 µ(cid:15) (1c) (cid:15) = 12 ( ∇ u + ( ∇ u ) (cid:62) ) (1d)where σ is the stress tensor, f is the body force per unit volume, λ and µ are Lam´e elasticity parametersfor the material in Ω, I is the identity tensor, tr is the trace operator on a tensor, (cid:15) is the symmetricstrain-rate tensor (symmetric gradient), u is the displacement vector field and lastly n is the outwardunit normal to Ω.We are interested in predicting the stress field and more specifically a fracture indicator. A properfracture indicator is the tresca stress, defined as the maximum of the principals stresses. The trescastress can be calculated given the stress tensor as described below: The stress tensor can be rotatedmechanically [Eq. 2] using the rotation matrix [Eq. 3]. The resulted components of the stress tensorafter the rotation can be found in [Eq. 4a - 4c]. From [Eq. 4c] we can conclude that there must be anangle θ p such that the shear stress after rotation is zero [Eq. 5]. After inserting θ p into [Eq. 4a, 4b]we can calculate the 2 principal stress components [Eq. 6]. σ max is the tresca stress. σ (cid:48) = Q · σ · Q (cid:62) (2) Q = (cid:20) cos( θ ) − sin( θ )sin( θ ) cos( θ ) (cid:21) (3) σ (cid:48) xx = σ xx cos θ + σ yy sin θ + 2 τ xy sin θ cos θ (4a) σ (cid:48) yy = σ xx sin θ + σ yy cos θ − τ xy sin θ cos θ (4b) τ (cid:48) xy = ( σ yy − σ xx ) sin θ cos θ + τ xy (cos θ − sin θ ) (4c)4an(2 θ P ) = 2 τ xy σ xx − σ yy (5) σ max , σ min = σ xx + σ yy ± (cid:114) ( σ xx − σ yy + τ xy (6) Assuming a structure with both macroscale and microscale features like the one in [Fig 2] we assumethat elasticity can be written over ¯Ω, where microscale features are ignored or averaged, through amodification of the constitutive law (i.e. standard homogenisation) as for example in [Sanchez-Palencia,1986]. We assume that the macro stress prediction along with the microscale information in an domainsurrounding a Region Of Interest (ROI) are sufficient to predict the micro stress field in the ROI up toan acceptable level of accuracy. This assumption is backed up by the Saint-Venant’s principle, statingthat the micro effect in a subregion B ⊂ ¯Ω can be fully predicted knowing the macroscopic solution, ifwe look sufficiently far away from the boundaries of B as shown in [Fig 2].Figure 2: Sketch of a multiscale geometry. With gray the domain ¯Ω. Two macroscale features arevisible and black areas correspond to microscale features. With brown we see a subregion B wherethe macro stress field and the microscale information are available and with green the ROI where themicro tresca can be calculated.In our case we assume a homogeneous distribution of spherical pores as microscale features. Nev-ertheless our surrogate model is completely agnostic to the shape of the geometric features and theseinformation are only available to it through the training examples. One strategy is to work on the entire domain as described in [Zhang et al., 2020; Sasaki and Igarashi,2019] for topology optimisation. Unfortunately, in our problem this strategy suffers from high compu-tational requirements both in terms of training the NN and in creating enough data to train on. Mostimportantly it will result in poor generalization as the NN will unavoidably try to learn the specificmacrostructers present in the dataset.Another, more suited, strategy for our case is first to solve for the stress in the whole domain andthen to divide the domain in squares that we call patches as can be seen in [Fig. 3]. Therefore, we5an consider each patch as a training example instead of each domain.This strategy will reduce thecomputational requirements because we can extract a lot of patches from a single domain and also theNN will have inputs of smaller size. Last but not least, this strategy will result in better generalizationbecause the NN is encouraged to to learn how the microstructural features are affecting the globalstress field instead of learning the specific structures present in the training dataset.As already discussed, according to Saint-Venant’s principle we can’t fully predict the micro stressin the patch but only in a ROI far from the boundaries. Consequently, we need to provide to the NNall the available information in the patch (macro stress filed and microscale features) but we can onlyask for a micro stress prediction in the ROI. Specifically, the input of our model will be the full macrostress tensor in the patch along with the microscale features while the output will be the tresca stressin the ROI.Figure 3: On the left the original structure, in the middle the red squares correspond to patches andon the right some extracted patches.
For the purpose of training our model we have assumed a distribution of elliptical pores as macroscalefeatures. We consider all the microscale features as circles with the same radius, R . We assume thatfor a distance larger than 4 radius from the center of the microscale features the micro effect on theglobal stress field is negligible, for instance in the case of an infinite plate under uni-axial loading themax stress at r = 4 R is 1.04 times the macro stress [Pilkey and Pilkey, 2008]. Based on that the defectlength is 2 R , and the interaction length is equal to 3 R . Given those 2 parameters we conclude thatthe patch length should be 18 R and the ROI should be a [8 R × R ] window in the middle of thepatch as shown in [Fig 4].The boundary conditions are applied to a buffer area where the mesh is much coarser, as can beseen in [Fig. 5]. The buffer area allows us to apply boundary conditions without introducing boundaryeffects on the fine mesh area. Additionally, because the mesh in the buffer area is very coarse thecomputational cost remains practically the same. We apply displacement as boundary conditions [Eq.7]. u = (cid:20) E xx E xy E xy E yy (cid:21) ( X − X ) (cid:62) (7)where E xx is the far field displacement along the x direction, E xy is the far field displacement alongthe xy direction, E yy is the far field displacement along the y direction, X is the position of a point in R and X is the initial position of the center of the body in R .6igure 4: A sketch of the patch. With gray we see the patch, with green the region of interest andwith the blue we can see circular defects. The input of the CNN is a 4D array of size [ N D × N x × N y × N C ] where: N D is the number of datapoints, N x and N y are the size of the input image along the x and y direction respectively and N C isthe number of channels of every data point. Each data point has 4 channels namely σ xx , τ xy , σ yy and Geometry corresponding to the xx , xy , yy component of the macro stress tensor and a binary imageof the geometry respectively. We chose R = 4 units so the input array is of size [ N D × × × N x × N y ] image corresponding to the micro tresca stress, so here animage of size [72 × ×
32] window in the middle of the patch as shownin [Fig 4]. Because we want to identify the effect of micro scale features on the macro scale stress wewill scale the output with a number that reflects the intensity of the macro stress field. This numberis the sum of the principal stresses of the macro stress tensor σ max + σ min from [Eq. 6]. The microstress in areas away from micro scale features should be the same as the macro scale stress becausethese features only have a local effect. This suggests that the output should be constant away fromthe micro scale features and change rapidly very close to them. That is clearly visible in [Fig. 6].Differences in the scales across input variables may increase the difficulty of the problem beingmodeled, for example increased difficulty for the optimizer to converge to a local minimum or unstablebehaviour of the network, thus a standard practice is to pre-process the input data usually with asimple linear rescaling [Bishop, 1995]. In our case we will scale the data not only to improve themodel but also to restrict the space we have to explore. The space that we have to cover is infinitebecause the input can take any real value. Fortunately, because we chose to model linear elasticityproblems we can scale the input stress tensor by any value and the output micro tresca field will bescaled by exactly the same number. From [Eq. 6] it is trivial to show that if we replace σ xx , τ xy , σ yy to k · σ xx , k · τ xy , k · σ yy where k is some scaling factor, then σ (cid:48) max = k · σ max where σ (cid:48) max is thenew tresca after scaling the input. Here k − is the maximum stress value present in all the 3 stresscomponents. This scaling of the input values to the range [0 ,
1] allows us to make predictions on inputdata of any possible scale. We just have to to calculate k , multiply the input with it to transfer it to7igure 5: On the left a sketch of buffer zone with gray and on the right an example where the meshis visible.the desired scale and then multiply the output with k − to get the true output.Figure 6: On the left the input of the CNN and on the right the output Training Deep Neural Networks is complex for a number of reasons. A common reason is that thedistribution of each layer’s inputs changes during training, as the parameters of the previous layerschange, this phenomenon is known as internal covariate shift. This slows down the training by requiringlower learning rates and careful parameter initialization [Ioffe and Szegedy, 2015]. Batch normalization(BN) aims at reaching a stable distribution of activation values throughout training [Ioffe and Szegedy,2015; Santurkar et al., 2019]. To achieve that, BN normalizes the output of a previous activation layerby subtracting the batch mean and dividing by the batch standard deviation. After the normalization,BN tries to scale and shift the normalized output by adding two trainable parameters to each layer.8N multiplies the normalized output by a “standard deviation” parameter γ and then adds a “mean”parameter β . This scaling and shifting happens to restore the representation power of the network,the network could even recover the original activations if that were the optimal thing to do [Ioffe andSzegedy, 2015; Santurkar et al., 2019]. Batch normalization acts to standardize only the mean andvariance of each unit in order to stabilize learning, but allows the relationships between units and thenonlinear statistics of a single unit to change [Goodfellow et al., 2016].Reduction of the internal covariate shift [Ioffe and Szegedy, 2015] is one possible reason why BNworks. The other popular theory is that the BN makes the optimization landscape significantlysmoother. This smoothness induces a more predictive and stable behaviour of the gradients, allowingfor faster training [Santurkar et al., 2019]. Nevertheless, both agree that BN enables faster (largerlearning rates) and more stable training of DNNs.There is also a discussion about whether BN should be used with dropout or not. According to[Ioffe and Szegedy, 2015; Goodfellow et al., 2016] BN reduces generalization error and allows dropoutto be omitted, due to the noise in the estimate of the statistics used to normalize each variable. [Chenet al., 2019] show that the simultaneous use of BN and dropout can actually reduce accuracy. However,they state that this is not true if dropout is used after BN. [Garbin et al., 2020; Li et al., 2018a] agreethat both of them can be used together if dropout is used after the BN layer and actually [Li et al.,2018a] proposed a modification in the architecture of well studied Neural Networks to take advantageof the combined effect of BN and dropout. Lastly, everyone seems to agree that the BN layer has tobe used before the activation function.In theory the more layers a network has the better is should perform. Nevertheless, that is nottrue in practice and training very deep NN is very challenging due to problems like vanishing gradientscausing the NN to not be able to learn simple functions like the identity function between input andoutput [Sussillo and Abbott, 2015; Hochreiter et al., 2001]. The usual way to train DNNs is throughresidual blocks [He et al., 2015; Zagoruyko and Komodakis, 2017; Lim et al., 2017; Kim et al., 2016].With residual blocks the NN itself can choose it’s depth by skipping the training of a few layers usingskip connections. As we can see from [Fig. 7] even if the NN chooses to ignore some layers, F ( X ) = 0,it will learn to map the input to the output, output: F ( X ) + X = X . This way we can use a largenumber of residual blocks and the network will simply ignore the ones it does not need. The nameresidual comes from the fact that the network tries to learn the residual, F ( X ), or in other words thedifference between the true output, F ( X ) + X , and the input, X .Figure 7: Structure of a generic residual block with input X and output F ( X ) + X .In the residual blocks of our CNN we are using another kind of block the Squeeze and Excitationblock, SE. The SE can adaptively recalibrate channel-wise feature responses by explicitly modellinginterdependencies between channels resulting in improved generalization across datasets and improve-9ent in the performance [Hu et al., 2019; Cheng et al., 2018; Li et al., 2018b]. The input of the SEhas C channels, height H and width W , [ H × W × C ]. The input decreases in size using a global-averaging pooling layer resulting in a linear array of size [1 × C ]. After that, two fully connected layersdownsample and then upsample the linear array. Firstly the linear array is downsampled by a factorof 16, [1 × C/ C/ ·
16 = 1 × C ] and inthe end a Sigmoid activation function is applied. Lastly, the linear array is reshaped to size [1 × × C ]and multiplied with the input of the SE block [Fig. 8].Figure 8: Structure of a generic SE block with input e and output G ( e ) × e .The residual blocks we will use in this work consist of two convolution layers, followed by a BNlayer and a ReLU activation function each, and a SE block in the end [Fig. 9]. The input and outputof this block has exactly the same size as we choose the number of filters for the convolution layers tobe the same as the number of filters at the input of the residual block.Figure 9: Structure of the residual block we are using with input u and output F ( u ) + u .The architecture of the network is inspired by the “StressNet”, proposed by [Nie et al., 2019]. Threeconvolution layers with increasing number of filters will downsample the input, after that five residualblocks are applied to the resulting array before using 3 deconvolution layers with a decreasing numberof filters to upsample to the original dimension but with 1 channel instead of 4 [Fig. 10].10igure 10: Structure of the CNN. N is the number of data. σ xx , σ yy and τ xy are the stress componentson the x , y and xy direction respectively. In our effort to include uncertainty information into our prediction we will deploy a Bayesian framework,firstly described by [Blundell et al., 2015], to introduce uncertainty in the weights of the network. Toachieve that we will replace the constant weights of a plain neural network with a distribution overeach weight as seen in [Fig. 11]. The output of this probabilistic model, for an input x ∈ R n and aset of possible outputs y ∈ R , will be a probability distribution over all possible outputs, P ( y | x, w ).The distribution of the weights before observing the data is called prior distribution, P ( ω ), and itincorporates our prior beliefs for the weights. The goal is to calculate the posterior, the distributionof weights after observing the data, because during training and of course inference the weights of thenetwork are sampled from the posterior. Bayesian inference can calculate the posterior distribution ofthe weights given the training data, P ( w | D ). Unfortunately, the posterior is intractable for NNs butcan be approximated by a variational distribution q ( w | θ ) [Hinton and van Camp, 1993; Graves, 2011],parameterised by θ . Variational learning finds the parameters θ opt that minimise the Kullback-Leibler(KL) divergence between the approximate posterior and the true Bayesian posterior. That is how theloss function is defined [Eq. 8]. The first term of the loss is the KL divergence between the approximateposterior and the prior. It is obvious that the prior is introducing a regularization effect because theKL divergence penalizes complexity by forcing the approximate posterior to be close to the prior. Thesecond part is the negative log likelihood. This is a data dependent term and it forces the network tofit the data. θ opt = arg min θ KL[ q ( w | θ ) || P ( w | D )]= arg min θ (cid:90) log q ( w | θ ) P ( w ) P ( D | w ) dw = arg min θ KL[ q ( w | θ ) || P ( w )] − E q ( w | θ ) [log P ( D | w )] (8)Here we consider the approximate prosterior to be a fully factorised Gaussian [Graves, 2011]. Duringone forward pass we sample the weights from the posterior but during back propagation we would11igure 11: On the left we have a sketch of a plain Neural Network with constant weights and on therigt a Bayesian Neural Network where the weights are replaced by distributions.have to define the gradient of the loss with respect to this sampling procedure, which is of course notpossible. Instead we use the reparameterization trick [Kingma and Welling, 2014]. This procedure iswell described in [Blundell et al., 2015]. Instead of having a parameter-free operation (sampling) we canobtain the weights of the posterior by sampling a unit Gaussian shifting it by a mean µ and scaling it bya standard deviation σ . This standard deviation is parameterised as σ = log(1 + exp( ρ )) and thus it isalways positive. So the weights are sampled according to the following scheme: w = µ +log(1+exp( ρ )),and the variational parameters to be optimised are θ = ( µ, ρ ).The architecture remains the same but we replace all the convolutional and dense layers with therespective Bayesian layers. We use a Gaussian as predictive distribution, P ( y | x, w ), corresponding to asquared loss. The prior is a single Gaussian for each layer, corresponding to L2 regularization. Duringbackpropagation we optimise the prior by considering the gradients of loss not only with respect tothe posterior but also the prior parameters. The mean prediction and the variance are calculated bypassing the same input through the network multiple times. In this work we train a NN in a supervised way. This means that we need to provide labelled dataduring training. Labelled data contain both the input and the output of the NN, while unlabelleddata contain only the input. Usually, but also in our work as well, labelled data are considerably moreexpensive to create than unlabelled data.A selective learning process, in a supervised learning framework, assumes that a large pool ofunlabelled data is available while there is a very expensive function that labeles these data. The aimof selective learning is to identify which of the unlabelled data contain useful information so that onlythese are labelled. To achieve that, an acquisition function needs to be formulated able to identify theuseful data. [Tsymbalov et al., 2018] suggest that the uncertainty extracted from a Bayesian NeuralNetwork is a sensible acquisition function for this task. This is also intuitively a sensible conclusion12ecause high uncertainty in the prediction of the network means that the input is far away from thetraining data distribution.
Firstly we created an initial dataset with very simple examples [Fig 12]. A single ellipse in the middleplaying the role of the macroscale feature, creating a diverse macro stress field. Also, a few defectsare randomly positioned around the ellipse, accounting for the micro scale features that will affect themacro stress field. All the defects have a circular shape and the same radius, R = 4 units. From almost500 examples we extracted about 33.000 patches, 5.000 of which where used as a validation set.Figure 12: Nine random examples from the initial datasetExperiments on this dataset showed very positive results. Training with 28.000 data points andvalidating on 5.000 unseen data points resulted in a validation accuracy of 96% when training with theAdam optimizer for 600 epochs, which required about 6 hours on a NVIDIA T4 GPU. The conceptof accuracy in a regression task with images needs to be discussed. The process followed to defineaccuracy is summarised in [Algorithm 1] and it is described with more detail in the following passage:First of all we take the max of each prediction, this happens because we are primarily interested inthe max values as these values will indicate if the material will fail or not. Next, we define an errormetric between the real max value, y F E , and the max value in our prediction, y NN . Here we use therelative error defined as: relative error = | y NN − y F E | /y F E . Finally we need to set a threshold for theacceptable error. In this case we will use 10%. To sum up, 96% validation accuracy means that inthe validation set 96% of the max values were predicted with a relative error less than 10%. Someonecould argue that this 10% threshold is arbitrarily chosen and it should be more application specificbecause different applications have different error requirements. We have constructed a diagram thatshows the accuracy as a function of the threshold [Fig. 13]. We present results from 2 random patches[Fig. 14] and then a result on the whole structure [Fig. 15]. The prediction happens again on thepatch level but then the original image is reconstructed. This is possible if we align one corner of theROI with a corner of the image and use a sliding window equal to the size of the ROI as can seen as13Fig. 16]. We can see that in all cases the CNN was able to accurately reconstruct the full micro stressfield but it was also able to predict the max values with a very small error. More specifically it is clearthat away from the micro scale features the micro scale field is constant. Also we can see that veryclose to the defects we have a very steep rise of the micro stress field. The orientation and the shapeof the micro stress field is accurately predicted even in complicated cases where more than one defectsare interacting or defects and macroscale features are interacting.
Algorithm 1
Calculate accuracy N = length ( datapoints ) accepted = zeros ( N ) for point in datapoints do y NN = max ( prediction [ point ]) y F E = max ( ground truth [ point ]) error = | y NN − y F E | /y F E if error ≤ threshold then accepted [ point ] = 1 accuracy = sum ( accepted ) /N return accuracy Figure 13: Accuracy as function of the thresholdWe also wanted to investigate the effect of training with less data. We randomly chose 10.000 datapoints, almost 30% of the available data, and train the NN with exactly the same settings, an examplecan be found at [Fig. 17]. We noticed that for the 10% threshold the accuracy is only 6% higher forthe large dataset even though we used almost 3 times more data. This implies that after a specificpoint in training most of the data from this dataset do not really contain any new information. Thishighlights the need to use selective learning in order to only label (and train on) the useful data.14igure 14: Prediction of the Network on 2 patches. On the top left of each example we see the FEresult for the whole patch and on the top right the NN prediction for the whole patch. On the secondrow we see the same but for the ROI.Figure 15: Comparison between the FE solution, on the left, and the NN prediction, on the right.
Even though the CNN we trained seems to work well for the data it was trained on we believe that itcan only be used for simple cases and it’s not suitable for complicated structures like the femur. Totackle this problem we created a new, more interesting, family of data with the expectation that thiswould add more complexity [Fig. 18]. We tried to use the old CNN to make predictions on the new15igure 16: Patch generation for full image prediction. The corner of the ROI is aligned to the cornerof the image and then a sliding window of size equal to the size of the ROI is used.Figure 17: A prediction for the same input for 2 identical NNs trained with 28.000 (right) and 10.000(left) data pointsdataset. We observed that the accuracy dropped from 96% to 72%. This implies two things. Firstly,the drop in accuracy means that the new dataset contains information that the network had neverseen before, thus training in this dataset will make the CNN to generalize better. Lastly, the conceptof making the knowledge transferable seems to be working as we were able to make reasonable, butnot perfect, predictions on a new family of data. This suggests that we managed to learn interactionsbetween micro scale features and the macro stress field and not just the structures themselves.Training a CNN with the new dataset proved to be more challenging. By using 23.000 trainingdata points (almost as many as with the original case) and 5.000 validation data points we obtained,with the same settings, a validation accuracy of 74% in contrast to the 96% in the first case. Fromexperiments we found out that as more and more new data points are added the accuracy tends toincrease slower and slower. This happens because the new data points added tend to contain less andless new information. As discussed before, we can use rotation as data augmentation technique, byrotating mechanically the stress and “physically” the images. We started from an initial training setof 5.000 data points ( ≈ .3 Bayesian Neural Network Until now we have used a deterministic neural network for the predictions. In this section we will seesome results from the Bayesian Neural Network. We trained the BNN with the same 5.000 data pointsas earlier for 600 epochs and validated on 10.000 data points. That requires almost twice as muchtime compared to the deterministic case. The accuracy of the prediction is 72% for the 10% thresholdcompared to 62% for the deterministic case.The mean and the variance of the NN prediction are calculated by passing the input 100 timesthrough the network. The results can be found at [Fig 22]. We can see from the first image that themean prediction is very close to the real value. We can also observe that for higher values we get highererror, something we expected because those cases are more challenging. In the second image we cansee that the error between the real maximum value in the ROI and the predicted one is almost always,and specifically in 92% of the data points, inside the 95% credible interval. This is a very positiveresult as our objective is to always have the real maximum value inside the 95% credible interval.Lastly, we can see that the error tends to increase with the variance something that justifies our choiceto use the BNN’s variance as acquisition function for the Selective Learning framework.Results from the uncertainty estimation on image level can be found at [Fig. 23]. On the top 2cases we can see some examples of very good mean predictions where there is nevertheless relativelyhigh uncertainty. This is justified by the fact that there is high error as well, it just happens that thiserror is not in high stress regions so is not affecting the maximum values. We can observe that theuncertainty, expressed as 95% CI, is higher close to the higher error pixels indicating that the BNN hassuccessfully identified the unseen interactions. The bottom left image is an example of a case wherethe maximum value is miss-predicted with a large error of about 1 unit. Fortunately, we can observethat the uncertainty is also very large, specifically the 95% CI has the value of about 1.5 units meaningthat the true maximum value is between the mean prediction and the 95% CI. The bottom right imageis an example of a case with low uncertainty and low error. This means that the CIs are very tightand the BNN is very confident about the prediction. That was an expectable result in the sense thatthis is a very simple case and we would expect from the BNN to handle it without a problem.Figure 22: On the left a diagram showing the relationship between NNs’ mean prediction and FEresults for the maximum value in the ROI. On the right a diagram showing the relationship betweenthe prediction error and the 95% Credible Interval for the prediction. In both images the black linecorresponds to the y = x line 20igure 23: Four random examples of the BNN prediction. For each example the top left image is theabsolute error between the FE solution and the mean prediction of the BNN. The top right image isthe uncertainty of the BNN, expressed as the 95% CI. The bottom left image is the FE prediction andthe bottom left is the mean prediction from the BNN. In the last section we will investigate the idea of Selective Learning to reduce the labelled data re-quirements for training the BNN. The principles of this framework are described below. We have aninitial dataset with labeled data and a bigger dataset with only unlabeled data. We train on the initialdataset and then use an acquisition function to select small batches of data from the unlabeled dataset to label and train on. After training, when the new information is incorporated into the BNN werepeat the same process until we reach the desired accuracy or label the entire unlabeled set.We designed a small experiment to validate our approach, inspired by [Gal et al., 2017]. We usedthe following setup: 2.500 data points for the initial set, 2.500 data points as the unlabeled set and11.000 data points as validation set. We trained each network for 50 epochs, we performed 50 forwardpasses for the uncertainty estimation and we added 500 data points in the labeled set at each iteration,query rate = 500. The accuracy is calculated from the mean prediction of the network. Here wemade a comparison between the max uncertainty acquisition function choosing first the points withhigher uncertainty and a random acquisition function that chooses data points randomly. For the21andom selection approach we repeated the experiments 5 times and presented the mean and the 95%confidence interval. The results can be found at [Fig. 24]. We can observe that the results producedby the max uncertainty acquisition function are consistently better, presenting higher accuracy. Morespecifically, with this unlabeled data set we can reach an accuracy of about 75%. This can be achievedusing 1.500 points with the max uncertainty acquisition function but requires all the 2.500 if we choosepoints randomly. This means that we reduced the labeled data requirement by 40%. From that wecan conclude that the selective framework is working.
500 750 1000 1250 1500 1750 2000 2250 2500Number of new data0.710.720.730.740.75 a cc u r a c y acc max_uncertaintyRandom Figure 24: Results of Selective Learning for a small problem. 50 epochs per training, 50 forward passesfor the uncertainty quantification and 500 added data in each iteration. The orange line correspondingto the max uncertainty acquisition function and the blue to the random acquisition functionNow we will use a larger unlabelled dataset consisting of 10.000 data points. We make again thecomparison between the max uncertainty acquisition function and a random acquisition function. Theinitial train set has 5.000 data points, we train for 150 epochs every network, we perform 100 forwardpasses for the uncertainty quantification and we label 2.000 unlabeled data points in every iteration,query rate = 2.000. The results can be found at [Fig. 25]. We can again observe the accuracyincreasing faster in the case with the max uncertainty acquisition function and also the loss functionis decreasing faster until it reaches 6.000 new data points. At this point the accuracy practically stopsincreasing any more and the loss gradually approaches the same value as with the random acquisitionfunction. Using the max uncertainty acquisition function we can reach the max accuracy using 6.000data points while we need all the 10.000 data points when randomly choosing new data. Again wehave a decrease of 40% in the labelled data requirement.This time we want to perform a similar experiment but we are interested in examining the effect ofquery rate on the results. Specifically, we will use an initial set of 5.000 data points and we will performSelective Learning on an unlabeled dataset of 4.000 data points. We will repeat the experiment 3 times,with query rates 500, 1.000 and 2.000. A similar experiment was conducted by [Islam, 2016], wherehe concluded that using very small query rates results in sub-optimal performance, higher simulationtimes and noisy behaviour. There are 2 reasons why the results are worse in this case. Firstly, addingjust a few data compared to the size of the initial dataset might result in overfitting and secondly,these data points might get smoothed out in the loss function. The simulation time increases becausethe network needs to be retrained a considerable number of times. On the other hand using too largequery rates also results in worst results because the weights of the network are not updated frequentlyenough so new information is rarely incorporated in the network and we end up again labeling andtraining on data points that do not contain new information. The results of our experiment can befound at [Fig. 26]. We have reached the same conclusions. When query rate is 1.000 we have the22igure 25: Demonstration of the selective learning framework for an unlabelled set of size 10.000 datapoints. On the left a diagram depicting the accuracy and on the right a diagram depicting the lossoptimal behaviour, when we double it we observe slower convergence and when we use a small queryrate we observe noisy sub-optimal behaviour. a cc u r a c y Figure 26: Selective learning with query rates of different sizes.After validating the Selective Learning framework we will now use it without the random acquisitionfunction as baseline. We will use all the 30.000 available data to train the network. As initial set wewill use again 5.000 data points. We will query 5.000 unlabeled data points at each iteration chosen bythe max uncertainty acquisition function. We will train for 300 epochs and perform 100 forward passesfor the uncertainty quantification. The results can be found at [Fig. 27]. It is clear that the accuracyis not improving after the third iteration, 15.000 data points, but we continued labelling points onlyfor demonstration reasons. The mean squared error decreases for the first 3 iterations and then stopsdecreasing as well. In this example we could reach the maximum accuracy using 15.000 out of the30.000 data points, so we managed to reduce the labelled data requirements by 50%.Lastly, we want to test the BNN in data outside of the train set. Specifically, we will use ellipsesas defects. Ellipses have a similar shape to circles and not very different behaviour. Neural Networksextrapolate when they make predictions outside of the data set and they are notoriously bad at it.23 a cc u r a c y m s e Figure 27: Accuracy and mean squared error plots with respect to train data chosen by a max un-certanity acquisition function. The blue line corresponds to the accuracy and the orange one to themse.What we are hoping for is that the BNN will understand that the ellipses are not in the dataset andwill assign high variance to most of the patches. We created 500 patches and made a prediction withthe previous BNN. The results can be found at [Fig. 28]. From the first plot we can see that the meanprediction from the BNN for the max values in the patch is not close to the real max value for a bigpercentage of the data, accuracy ≈ ≈ .0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 mean prediction F E r e s u l t
95% CI e rr o r Figure 28: On the left a diagram showing the relationship between NNs’ mean prediction and FEresults for the max value in the ROI. On the right a diagram showing the relationship between theprediction error and the 95% Credible Interval for the prediction. In both images the black linecorresponds to the y = x line error real mean prediction error real mean prediction Figure 29: Two examples where the error in max values is large and even though the 95% CIs are verybroad they can’t capture it 25 rror real mean prediction error real mean prediction
Figure 30: Two examples where the error in max values is large but inside the 95% CI error real mean prediction error real mean prediction
Figure 31: Two examples where the error in max values is small. We can observe high error in otherareas but this error is captured by the uncertainty of the BNN26
Conclusions
The goal of this work was to use a CNN to identify the effect of microscale features on the globalstress field. We aimed for a Bayesian approach that would provide not a point estimation but credibleintervals for the prediction. We successfully managed to train a plain CNN on a simple datasetachieving 96% validation accuracy using 28.000 data points. We proceeded to a more advanced datasetwith more interesting interactions and found that training using 23.000 data points only resulted in a72% accuracy. We used rotation as data augmentation technique in order to acquire more data witha minimum computational cost. With this approach by only using 5.000 data points, and applying 12rotations, we were able to achieve an accuracy of 82%. Finally, we deployed a Bayesian Neural Network,able to provide uncertainty information for the predictions. The mean prediction of this network wasable to fit the data very nicely, achieving a 72% validation accuracy using again 5.000 data points. Thetrue result was inside the 95% Credible Interval for 92% of the data points and also the uncertaintywas higher close to the high error pixels. This suggests that the BNN is able to efficiently quantifythe uncertainty of the prediction and provide high quality uncertainty that can be used in a SelectiveLearning framework. We demonstrated twice the advantages of the Selective Learning framework bycomparing a random acquisition function with a max uncertainty acquisition function and showingthat the latter results in a 40% reduction in the labeled data requirement. Also we examined the effectof the query rate on the efficiency of the Selective Learning process and concluded that too small ortoo large query rates should be avoided. Furthermore, we used Selective Learning to train the BNNwith all the available data and we reached an accuracy of 84% by using 50% less data. Lastly, wetested the limits of our BNN by making predictions on points outside the train data distribution. Theaccuracy of course dropped but the important conclusion is that the network was able to efficientlyquantify the uncertainty on the unseen cases. The real value was inside the prediction’s 95% CI for80% of the cases. 27 cknowledgments
This project has received funding from the Euro-pean Union’s Horizon 2020 research and innovationprogramme under the Marie Sklodowska-Curiegrant agreement No. 764644.This paper only contains the author’s views and the Research Executive Agency and the Commis-sion are not responsible for any use that may be made of the information it contains.
References
Aladem, M. and Rawashdeh, S. A. (2020). A single-stream segmentation and depth prediction cnn forautonomous driving.
IEEE Intelligent Systems , pages 1–1.Bishop, C. M. (1995).
Neural Networks for Pattern Recognition . Oxford University Press, Inc., USA.Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). Weight uncertainty in neuralnetworks. arXiv: 1505.05424.Chen, G., Chen, P., Shi, Y., Hsieh, C.-Y., Liao, B., and Zhang, S. (2019). Rethinking the usage ofbatch normalization and dropout in the training of deep neural networks. arXiv: 1905.05928.Cheng, X., Li, X., Yang, J., and Tai, Y. (2018). Sesr: Single image super resolution with recursivesqueeze and excitation networks. In , pages 147–152.Gal, Y. and Ghahramani, Z. (2016). Dropout as a bayesian approximation: Representing modeluncertainty in deep learning. arXiv: 1506.02142.Gal, Y., Islam, R., and Ghahramani, Z. (2017). Deep bayesian active learning with image data. arXiv:1703.02910.Garbin, C., Zhu, X., and Marques, O. (2020). Dropout vs. batch normalization: an empirical study oftheir impact to deep learning.
Multimedia Tools and Applications , 79:1–39.Goodfellow, I., Bengio, Y., and Courville, A. (2016).
Deep Learning . MIT Press. .Graves, A. (2011). Practical variational inference for neural networks. In Shawe-Taylor, J., Zemel,R. S., Bartlett, P. L., Pereira, F., and Weinberger, K. Q., editors,
Advances in Neural InformationProcessing Systems 24 , pages 2348–2356. Curran Associates, Inc.He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep residual learning for image recognition. arXiv:1512.03385.Hinton, G. E. and van Camp, D. (1993). Keeping neural networks simple by minimizing the descriptionlength of the weights. In
Proceedings of the 16th Annual Conference On Learning Theory (COLT) .28ochreiter, S., Bengio, Y., and Frasconi, P. (2001). Gradient flow in recurrent nets: the difficulty oflearning long-term dependencies. In Kolen, J. and Kremer, S., editors,
Field Guide to DynamicalRecurrent Networks . IEEE Press.Holub, A., Perona, P., and Burl, M. C. (2008). Entropy-based active learning for object recognition.In
CVPR Workshops , pages 1–8.Hu, J., Shen, L., Albanie, S., Sun, G., and Wu, E. (2019). Squeeze-and-excitation networks. arXiv:1709.01507.Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducinginternal covariate shift. arXiv: 1502.03167.Islam, R. (2016). Active learning for high dimensional inputs using bayesian convolutional neuralnetworks.Jiang, H., Nie, Z., Yeo, R., Farimani, A. B., and Kara, L. B. (2020). Stressgan: A generative deeplearning model for 2d stress distribution prediction. arXiv: 2006.11376.Joshi, A., Porikli, F., and Papanikolopoulos, N. (2009). Multi-class active learning for image classifi-cation. In , 2009 IEEE Computer Society Conference on Computer Visionand Pattern Recognition Workshops, CVPR Workshops 2009, pages 2372–2379. IEEE ComputerSociety. 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition ;Conference date: 20-06-2009 Through 25-06-2009.Kim, J., Lee, J. K., and Lee, K. M. (2016). Deeply-recursive convolutional network for image super-resolution. arXiv: 1511.04491.Kingma, D. P. and Welling, M. (2014). Auto-encoding variational bayes. arXiv: 1312.6114.Li, X., Chen, S., Hu, X., and Yang, J. (2018a). Understanding the disharmony between dropout andbatch normalization by variance shift. arXiv: 1801.05134.Li, X. and Guo, Y. (2013). Adaptive active learning for image classification. In
Proceedings of the 2013IEEE Conference on Computer Vision and Pattern Recognition , CVPR ’13, page 859–866, USA.IEEE Computer Society.Li, X., Wu, J., Lin, Z., Liu, H., and Zha, H. (2018b). Recurrent squeeze-and-excitation contextaggregation net for single image deraining. In
Proceedings of the European Conference on ComputerVision (ECCV) .Liang, L., Minliang, L., Caitlin, M., and Wei, S. (2018). A deep learning approach to estimate stressdistribution: a fast and accurate surrogate of finite-element analysis.
Journal of the Royal Society,Interface , 15:138.Lim, B., Son, S., Kim, H., Nah, S., and Lee, K. M. (2017). Enhanced deep residual networks for singleimage super-resolution. arXiv: 1707.02921.Meister, F., Passerini, T., Mihalef, V., Tuysuzoglu, A., Maier, A., and Mansi, T. (2018). Towards fastbiomechanical modeling of soft tissue using neural networks. arXiv: 1812.06186.Mendizabal, A., M´arquez-Neila, P., and Cotin, S. (2019). Simulation of hyperelastic materials inreal-time using deep learning.
CoRR , abs/1904.06197. arXiv: 1904.06197.Nie, Z., Jiang, H., and Kara, L. B. (2019). Stress field prediction in cantilevered structures usingconvolutional neural networks.
Journal of Computing and Information Science in Engineering ,20(1). 29ilkey, W. and Pilkey, D. (2008). Peterson’s stress concentration factors, third edition.
Peterson’sStress Concentration Factors, Third Edition , pages 1–522.Roewer-Despres, F., Khan, N., and Stavness, I. (2018). Towards finite-element simulation using deeplearning.Sanchez-Palencia, E. (1986). Homogenization in mechanics, a survey of solved and open problems.
Rendiconti del Seminario Matematico , 44(1):1–45.Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. (2019). How does batch normalization helpoptimization? arXiv: 1805.11604.Sasaki, H. and Igarashi, H. (2019). Topology optimization accelerated by deep learning.
IEEE Trans-actions on Magnetics , 55(6):1–5.Sun, Y., Hanhan, I., Sangid, M. D., and Lin, G. (2020). Predicting mechanical properties frommicrostructure images in fiber-reinforced polymers using convolutional neural networks. arXiv:2010.03675.Sussillo, D. and Abbott, L. F. (2015). Random walk initialization for training very deep feedforwardnetworks. arXiv: 1412.6558.Tsymbalov, E., Panov, M., and Shapeev, A. (2018). Dropout-based active learning for regression.
Analysis of Images, Social Networks and Texts , page 247–258.Wang, Y., Oyen, D., Weihong, Guo, Mehta, A., Scott, C. B., Panda, N., Fern´andez-Godino, M. G.,Srinivasan, G., and Yue, X. (2020). Stressnet: Deep learning to predict stress with fracture propa-gation in brittle materials.Wu, J., Yu, Y., Huang, C., and Yu, K. (2015). Deep multiple instance learning for image classificationand auto-annotation. In
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR) .Zagoruyko, S. and Komodakis, N. (2017). Wide residual networks. arXiv: 1605.07146.Zhang, M., Li, W., and Du, Q. (2018). Diverse region-based cnn for hyperspectral image classification.