A parallel Fortran framework for neural networks and deep learning
AA parallel Fortran framework for neural networksand deep learning
Milan Curcic
University of MiamiMiami, FL [email protected]
Abstract
This paper describes neural-fortran, a parallel Fortran frame-work for neural networks and deep learning. It features asimple interface to construct feed-forward neural networksof arbitrary structure and size, several activation functions,and stochastic gradient descent as the default optimizationalgorithm. neural-fortran also leverages the Fortran 2018standard collective subroutines to achieve data-based paral-lelism on shared- or distributed-memory machines. First, Idescribe the implementation of neural networks with For-tran derived types, whole-array arithmetic, and collectivesum and broadcast operations to achieve parallelism. Sec-ond, I demonstrate the use of neural-fortran in an exampleof recognizing hand-written digits from images. Finally, Ievaluate the computational performance in both serial andparallel modes. Ease of use and computational performanceare similar to an existing popular machine learning frame-work, making neural-fortran a viable candidate for furtherdevelopment and use in production.
Keywords
Machine learning, neural networks, parallel pro-cessing
Machine learning has seen tremendous increase in real-lifeapplications over the past two decades, with rapid develop-ment of algorithms, software, and hardware to enable theseapplications. Some examples include image recognition [13],natural language processing [9], and prediction of non-linearsystems such as markets [12] or turbulent fluid flows [15].In contrast to classical computer programming where theinput data is processed with a predefined set of instructionsto produce the output data, machine learning aims to inferthe instructions from sets of input and output data. Thereare several widely used and battle-tested machine learningframeworks available such as Tensorflow [1], Keras [3], Py-Torch [17], or Scikit-learn [18]. Most high-level frameworkstend to be written in Python, with computational kernelswritten in C++. With Fortran being a high-performance,compiled, and natively parallel programming language, itis worth exploring whether Fortran is suitable for machinelearning applications.Why consider Fortran as high-level language for machinelearning applications? First, Fortran is a statically and strongly typed language, making it easier to write correct programs.Second, it provides strong support for whole-array arith-metic, making it suitable to concisely express arithmetic op-erations over large, multidimensional arrays. Third, Fortranis a natively parallel programming language, with syntaxthat allows expressing parallel patterns independent fromthe underlying hardware architecture, shared or distributedmemory alike. The most recent iteration of the Fortran stan-dard is Fortran 2018 [20], which brought improvements tointeroperability with C, advanced parallel programming fea-tures such as teams and events, and miscellaneous additionsto the language and the standard library. These features,as well as the continued development by the Fortran Stan-dards committee, provide a solid foundation for a Fortranframework for neural networks and deep learning.Development of this framework was largely motivated bymy own desire to learn how neural networks work insideout. In the process, I learned that there may be a multifoldusefulness to such library. First, it allows existing Fortrandevelopers without machine learning experience to ease intothe field without friction that comes up when one needs tolearn both a new programming language and a new concept.Second, it allows existing machine learning practitionerswho are not familiar with modern Fortran to learn moreabout the language and its unique capabilities, such as whole-array arithmetic and native distributed parallelism. Finally,this framework can be used by scientists and engineers tointegrate machine learning flows into existing Fortran soft-ware, such as those for civil engineering [7], computationalchemistry [23], and numerical weather [19], ocean [2], wave[5], and climate [11] prediction. This paper thus serves as agentle introduction to each of these audiences.This paper describes the implementation of neural-fortran.While still only a proof of concept, I demonstrate that itsease of use, serial performance, and parallel scalability makeit a viable candidate for use in production on its own, or inintegration with existing Fortran software. For brevity, I willnot go into much detail on how neural networks work in amathematical sense, and I will focus on the implementationdetails in the context of Fortran programming. I first describethe features in Section 2, then I describe the implementa-tion in Section 3. I demonstrate the use of the frameworkwith a real-life example in Section 4. Then, I discuss the a r X i v : . [ c s . L G ] M a r omputational performance of the framework in Section 5.Concluding remarks are given in Section 6. neural-fortran provides out-of-the-box support for: • Feed-forward neural networks with arbitrary numberof hidden layers with arbitrary number of neurons; • Several activation functions, including Gaussian, RELU,sigmoid, step, and tangent hyperbolic functions; • Optimization using stochastic gradient descent (SGD)[21] and a quadratic cost function; • Data-based parallelism using collective sum and broad-cast operations; • Single ( real32 ), double ( real64 ), or quadruple ( real128 )precision floats, selected at compile time; • Saving and loading networks to and from file.Feed-forward neural networks are the simplest possibleflavor of all neural networks, and form a basis for other net-work flavors such as convolutional [14] and recurrent [10]neural networks. Optimization algorithm and the choice ofcost function allow a network to learn from the data. Paral-lelism allows training or evaluating the network using manyprocessors at once. Finally, the choice of single, double, orquadruple floating-point precision is made available via typekind constants from the standard iso_fortran_env module,and selected at compile-time using a preprocessor macro.While quadruple floating-point precision is likely an overkillfor most machine learning applications, it’s trivially madeavailable by most recent Fortran compilers, and may proveto be useful if an application that requires it comes about.
This section describes the fundamental data structures, net-works and layers, and the implementation of methods tocompute the output of the network (forward propagation),as well as learning from data (back-propagation). Finally, wewill go over the method to parallelize the training of thenetwork.
At a high level, a neural network is made of an arbitrary num-ber of layers, each having an arbitrary number of neurons.Each neuron has at least two fundamental properties: A bias b , which determines how easy it is to activate the neuron, anda weight w for every connection to other neurons, determin-ing the strenght of the connection. A feed-forward neuralnetwork is also often called dense because each neuron inone layer is connected to all neurons in the next layer. Dueto this unstructured nature, a contiguous Fortran array isnot a convenient choice to model a network. Instead, we canmodel a network with a derived type, one of its componentsbeing a dynamic array of layers (Listing 1). type :: network_typetype(layer_type), allocatable :: layers(:)integer, allocatable :: dims(:)procedure(activation_function), &pointer, nopass :: activation => null()procedure(activation_function), &pointer, nopass :: activation_prime => null()contains...end type network_type Listing 1.
Definition of the network class. Type-boundmethods are omitted for brevity.The key component in the network class is the allocat-able array of layers instances. The allocatable attribute isessential to allowing the size and structure of the networkto be determined at run time rather than compile time. Wekeep track of two procedure pointers at this point, one forthe activation function and another for the derivative of theactivation function. The former is input by the user, or givena default value if not specified. The latter is determined bythe activation function. While not crucial for functionality,we keep the component dims for housekeeping. The networkclass has several private and public methods, which are omit-ted in Listing 1. We will look into these methods in moredetail soon.To create and initialize a network instance from a singleline, we need to build all the set up logic into a customconstructor function (Listing 2). type(network_type) function &net_constructor(dims, activation) result(net)integer(ik), intent(in) :: dims(:)character(len=*), intent(in), optional :: &activationcall net % init(dims)if (present(activation)) thencall net % set_activation(activation)elsecall net % set_activation('sigmoid')end ifcall net % sync(1)end function net_constructor
Listing 2.
Custom network constructor.The network constructor accepts two input arguments.First is a rank-1 array of integers dims . This array controlshow many layers to allocate in the network, and how manyneurons should each layer have. size(dims) is the total num-ber of layers, including input and output layers. The sec-ond argument is optional, and it is a character string thatcontains the name of the activation function to use in thenetwork. If not present, we default to a sigmoid function, σ ( x ) = /( + e − x ) , a commonly used function for neural etwork applications. The first step that we take in the con-structor is to call the init() method, which goes on to al-locate the layers and their components in memory, and toinitialize their values. Second, we set the procedure pointersfor the activation function and its derivative. Finally, the sync() method broadcasts the state of the network to allother processors if executed in parallel.With our custom constructor defined, a network instancecan be created from the client code as shown in Listing 3. use mod_network, only: network_typetype(network_type) :: netnet = network_type([3, 5, 2], 'tanh') Listing 3.
Creating a network instance in the client code.In this example, we created a network instance with aninput layer of 3 neurons, one hidden layer of 5 neurons,an output layer with 2 neurons, and a tangent hyperbolicactivation function. Recall that providing the name of theactivation function is optional. If omitted, it will default to asigmoid function.Just as a network can be made of any number of layers, socan a layer be made of any number of neurons. For the basicfeed-forward network architecture, we don’t need specialtreatment of neurons using derived types, and modelingtheir properties as Fortran arrays inside the layer class isstraightforward and sufficient (Listing 4). type :: layer_typereal(rk), allocatable :: a(:) ! activationsreal(rk), allocatable :: b(:) ! biasesreal(rk), allocatable :: w(:,:) ! weightsreal(rk), allocatable :: z(:) ! temp. arrayend type layer_type
Listing 4.
Definition of the layer class.The layer is thus simply made of contiguous allocatablearrays for activations, biases, weights, and a temporary arrayto be used in the backpropagation step. Note that unlike theactivations and biases, the weights are of rank 2, one foreach neuron in this layer, and the other for all the neuronsin the next layer. The kind constant rk refers to the real kind,and can be selected by the user at compile time as either real32 (single precision, default), real64 (double precision),and real128 (quadruple precision, if available).Before we begin any training, it is important to initializethe network with quasi-random weights and biases to makeit as little biased as possible. To that end, we initialize theweights using a simplified variant of Xavier’s initialization[8], as shown in Listing 5. type(layer_type) function &constructor(this_size, next_size) result(layer)integer(ik), intent(in) :: this_size, next_sizeallocate(layer % a(this_size))allocate(layer % z(this_size))layer % a = 0layer % z = 0layer % w = randn(this_size, next_size) &/ this_sizelayer % b = randn(this_size)end function constructor Listing 5.
Layer constructor.All weights are initialized as random numbers with a nor-mal distribution centered on zero, and normalized by thenumber of neurons in the layer. The goal of normalizationis to prevent the saturation of the network output for largenetworks. Activations are computed and stored during theevaluation of the network output, so they are initialized tozero.Note that neural-fortran at this time does not treat indi-vidual neurons with their own data structure. While thisapproach allows for simpler implementation, it does imposelimitations on what kinds of neural networks can be built.Specifically, the current design allows for a family of classic,feed-forward neural networks. More complex network ar-chitectures, such as recurrent neural networks (RNN, [10])which are well suited for natural language processing, wouldrequire additional logic either by modeling individual neu-rons with derived types, or using Fortran pointers. Effortsto generalize the possible network architectures that canbe constructed in neural-fortran will be considered in thefuture.
Given a state of a network’s weights and biases, and a sampleof input data, we can compute the output of the networkusing the so-called forward propagation. This procedureinvolves computing the activations of all neurons layer bylayer, starting from the input layer and finishing with theoutput layer. The activations of each previous layer take therole of input data.The output of a neuron is calculated as the weighted sumof neuron outputs from the previous layer, plus the biasof that neuron, result of which is passed to an activationfunction. For each neuron i in layer j , we evaluate its outputas: a ij = σ (cid:32)(cid:213) i w i ( j − ) x i ( j − ) + b ij (cid:33) (1)where σ is the activation function. (cid:205) w i ( j − ) x i ( j − ) is anexpanded form of a dot product, so we can write this moreconcisely in vector form as: j = σ (cid:0) w j − · x j − + b j (cid:1) (2)Fortran’s whole array arithmetic allows us to express Eq.2 in code almost identically, as shown in Listing 6. pure subroutine fwdprop(self, x)class(network_type), intent(in out) :: selfreal(rk), intent(in) :: x(:)integer(ik) :: nassociate(layers => self % layers)layers(1) % a = xdo n = 2, size(layers)layers(n) % z = &matmul(transpose(layers(n-1) % w), &layers(n-1) % a) + layers(n) % blayers(n) % a = &self % activation(layers(n) % z)end doend associateend subroutine fwdprop Listing 6.
A subroutine to perform the forward propagationof the network, and store intermediate activations for lateruse.The activations for the first (input) layer are just inputdata x . The actual computation of activations is done fromthe second layer and onward. For each following layer, theinput data are the activations of the previous layer. For theevaluation of w j − · x j − , we use the intrinsic matmul ratherthan the dot_product because layer_type % w is a rank 2 ar-ray, one rank for all neurons in this layer, and one rank for allneurons in the following (connecting) layer. The associate construct is here only to make the code less verbose, and isotherwise unnecessary for the implementation.While the fwdprop subroutine is declared as pure, it doesmodify the state of the network, by storing w j − · x j − + b j ofeach layer as it propagates through the layers. Storing thesevalues is necessary for the back-propagation algorithm usedfor updating the weights and biases (see subsection 3.3).A variant of fwdprop that doesn’t store any intermediatevalues but only returns the output of the network is alsoavailable as a pure function network_type % output() . Thismethod should be used in any case outside of training of thenetwork, such as testing or prediction. The optimization method used in neural-fortran is the back-propagation with the stochastic gradient descent [21]. I omitthe details for brevity, but the essence of the method can bedescribed in three steps:1. Evaluate the cost function C of the network, that is,calculate the error of the network output relative tothe training or testing data; 2. Find the gradient of C in the weight-bias space: (cid:18) ∂ C ∂ w , ∂ C ∂ b (cid:19) .The gradient vector informs us about the direction inwhich C decreases the fastest;3. Update weights and biases in all layers such that C decreases.Thus the name gradient descent. The second step is thecore of the method. The "stochastic" part comes about in theway that the input data is shuffled during the training.Back to our code. Assuming we have done a forward prop-agation step and updated the activations in all layers, a back-ward propagation step will compute the weight and biastendencies dw and db , as shown in Listing 7.Unlike the forward propagation step, backward propaga-tion does not modify the state of the network, but returnsthe weight and bias tendencies that would minimize the costfunction given data y . Updating the weights and biases isjust a matter of iterating over all the layers and applyingthe tendencies returned by network_type % backprop() . Aconvenience method network_type % update() is providedfor this purpose, and is omitted in this paper for brevity. Now that we have the essential building blocks of a neu-ral network’s training loop (forward propagation, backwardpropagation, and update), we can put them together in a se-quence to make a convenience high-level training procedure.The subroutine train_single takes a single set of inputdata x and y, and the learning rate eta , and updates itsweights and biases accordingly (Listing 8). pure subroutine train_single(self, x, y, eta)class(network_type), intent(in out) :: selfreal(rk), intent(in) :: x(:), y(:), etatype(array2d), allocatable :: dw(:)type(array1d), allocatable :: db(:)call self % fwdprop(x)call self % backprop(y, dw, db)call self % update(dw, db, eta)end subroutine train_single
Listing 8.
A training procedure for a single data sample.This method takes a sample of input data x , output data y , and a learning rate as input arguments. We first forwardpropagate the network using the input data. At this stage, allactivations in the network are updated and stored. Second,we perform the backpropagation using the output data. Atthis step, we get weight and bias tendencies. Finally, we passthe tendencies and the learning rate to the update methodwhich increments the state of the network.More commonly than not, bias and weight tendencies arecomputed and accumulated over a long sequence of inputsand outputs (a batch), and applied at the very end. Thus wealso need the train_batch subroutine which accepts rank-2 ure subroutine backprop(self, y, dw, db)class(network_type), intent(in out) :: selfreal(rk), intent(in) :: y(:)type(array2d), allocatable, intent(out) :: dw(:)type(array1d), allocatable, intent(out) :: db(:)integer :: n, nmassociate(dims => self % dims, layers => self % layers)call db_init(db, dims)call dw_init(dw, dims)n = size(dims)db(n) % array = (layers(n) % a - y) * self % activation_prime(layers(n) % z)dw(n-1) % array = matmul(reshape(layers(n-1) % a, [dims(n-1), 1]), &reshape(db(n) % array, [1, dims(n)]))do n = size(dims) - 1, 2, -1db(n) % array = matmul(layers(n) % w, db(n+1) % array) &* self % activation_prime(layers(n) % z)dw(n-1) % array = matmul(reshape(layers(n-1) % a, [dims(n-1), 1]), &reshape(db(n) % array, [1, dims(n)]))end doend associateend subroutine backprop Listing 7.
A subroutine to perform the backpropagation using gradient descent, and return the weight and bias tendencies.arrays x and y as input arguments, first dimension corre-sponding to the shapes of input and output layers, respec-tively, and the second dimension corresponding to the sizeof the data batch (Listing 9). subroutine train_batch(self, x, y, eta)class(network_type), intent(in out) :: selfreal(rk), intent(in) :: x(:,:), y(:,:), eta...end subroutine train_batch Listing 9.
A variant of the training method that acceptsbatches of data.Finally, access to either variant is made via the genericprocedure train (Listing 10). generic, public :: &train => train_batch, train_single
Listing 10.
Overloading specific training procedures with ageneric name.The user can thus perform training using a single sampleof data, or a batch of data, using the same generic procedure.The correct specific procedure is determined by the compilerdepending on the rank of input data. Example: ! invokes network % train_single()network % train(x(:,n), y(:,n), eta)! invokes network % train_batch()network % train(x(:,:), y(:,:), eta)
Listing 11.
Using the generic training procedure with singlesamples and batches of data.This subsection completes the training sequence. The lastremaining piece is the implementation of the parallelism,described in the following section.
There are two main approaches to parallelizing neural net-works: Data-based and model-based parallelism.In the data-based parallelism approach, a training databatch is evenly distributed between the parallel processors,which can be working in either shared or distributed memory.Each processor calculates the weight and bias tendenciesbased on their piece of the data batch, and a collective sumoperation is done over all parallel processes to calculate thetotal update to the weights and biases of the network. It iscrucial that the randomly initialized weights and biases areequal on all parallel processors. This can be done either by ll processes using the same random seed at initialization,or by broadcasing weights and biases from one process toall others. neural-fortran adopts the latter approach.The data-based parallelism in neural-fortran is accom-plished with the following steps:1. Create the network instance on each image by invok-ing the network_type constructor. Inside the construc-tor, the networks are synchronized by broadcasting theinitial weights and biases from the image 1 to all otherimages. This is done using the intrinsic co_broadcast subroutine. As the synchonization is built into the typeconstructor (see Listing 2), the user simply has to cre-ate the network instance as if they were working inserial mode.2. Each parallel process calculates its own weight andbias tendencies given their piece of the training databatch.3. Finally, the weight and bias tendencies are summed upacross all parallel processes using the collective summethod: if (num_images() > 1) thencall dw_co_sum(dw_batch)call db_co_sum(db_batch)end if Once the collective sum operation is complete, eachprocess updates their own copy of the network. dw_co_sum and db_co_sum are thin wrappers around co_sum for ar-rays of derived types that hold the weight and biastendencies, respectively.neural-fortran thus relies exclusively on collective sub-routines co_sum (for the collective sum of weight and biastendencies) and co_broadcast (for broadcasting randomlyinitialized weights and biases across all processes), intro-duced in Fortran 2018 [20]. Collective subroutines let theprogrammer express a family of parallel algorithms withoutinvolving coarrays for explicit data communication. Beingeasy to write, debug, and reason about, they are an extremelyvaluable new feature of the language.In contrast to the data-based parallelism, the model-basedparallelism entails parallelizing the dot product and matrixmultiplication operations used in the forward and backwardpropagation steps of the training procedure. Model-basedparallelism is currently not implemented in neural-fortran.The first step toward such capability would likely involveinterfacing external libraries that provide parallel implemen-tations of dot_product and matmul , such as the Intel MathKernel Library, OpenBLAS ( ), or AT-LAS [25, 26]. As as the matmul invocation in fwdprop and backprop would be automatically replaced by the compilerwith the optimized and parallel procedure provided by thelinked library, model-based parallelism could thus be accom-plished without any further modification to the code. The data- and model-based parallelism approaches arethus decoupled and independent from each other. For largesystems with distributed nodes with multiple cores on each, ahybrid approach could be adopted: Intra-node (shared mem-ory) parallelization of matmul via external linear algebra li-brary, and inter-node (distributed memory) parallelizationvia Fortran 2018 collective subroutines. Viability and per-formance of such hybrid approach will be explored in thefuture and is beyond the scope of this paper.
To demonstrate the use of neural-fortran we will look atan example of training a network to recognize handwrit-ten digits using the MNIST database of images [16] (Figure1). MNIST is commonly used in teaching machine learningmethods, and for development and testing of new algorithms.This dataset is included in the neural-fortran repository asis a good example for beginners to get started and play withthe framework. A convenience Fortran subroutine to loadthe dataset is available in the mod_io module.
Figure 1.
A small set of data samples from the MNIST data-base.The MNIST dataset consists of 70000 images of handwrit-ten digits, and the same number of labels denoting the valuein each image. A label is a qualitative value that classifiesthe input image. In this paper, 50000 images will be usedfor training, and 10000 for validation. Each image is a 28by 28 pixel greyscale scan of a numerical digit in the range[0-9]. The labels for this dataset are thus numerical values inthe same range. Figure 2 offers a close-up view of one suchimage, labeled as number 2. igure 2. A close-up view at a single digit sample of 28 by28 pixels. The value of each pixel is the greyscale intensity,zero for white and one for black.In this example, we will load the MNIST dataset into mem-ory, the data from each image being represented as a rank-1real array of 784 elements, with magnitude in the range [0-1].The pixel values from each image will serve as the input data x to our network. The labels will serve as the output data y .Rather than a single number indicating the value of the label,we will use an array of 10 elements, one for each digit. Thevalues of the elements will be zero for all elements except theone corresponding to the value of the label, which will takethe value of one. Listing 12 shows the complete program.This program has 3 key pieces. First, it loads the trainingand testing data into the memory by calling the load_mnist subroutine. This subroutine returns input and output dataarrays that are in the appropriate shape for working withneural-fortran. While not obvious from the listing, the train-ing dataset consists of 50000 images, and the testing datasetconsists of 10000 images. The distinction between the two isbuilt into the data loader. It is important for the validationdataset to be completely distinct from the training datasetto ensure that the model can learn in a general way.Second, we create our network instance, in this case anetwork with one input layer with 784 neurons, one hiddenlayer with 30 neurons, and one output layer with 10 neurons.Recall that the sizes of input and output layers are not arbi-trary. The input layer size must match the size of the inputdata sample. In case of MNIST, each image is 28 by 28 pixels,a total of 784 pixels. Similarly, the size of the output layermust match the size of the output data sample, an array of 10elements, one for each digit. The rationale for transformingthe value of the image label into an array of 10 binary valuesis empirical. As the weights and biases of the network are adjusted in the training process, the network assigns higherprobability to one or more digits. The output of the networkis thus the probability that the input image represents eachof the 10 digits. For a well trained network and a clean inputimage, the network should output one value that is approx-imately 1, and nine others that are approximately 0. Thisbinary label representation thus allows for probabilistic out-put of the network. While other output layer configurationsare possible, they don’t tend to yield as good results.In the third step, we begin the training loop, which isperformed for a set number of epochs. An epoch refers toone round of cycling through the whole training dataset,in our case, 50000 images. Within each epoch, however, weiterate over a number of so-called mini-batches, which arethe subsets of the whole dataset. The number of mini-batchesis controlled by the batch size, in our case 1000, meaningthat a total of 50 training iterations will be done in eachepoch. The random choice of the starting index of the mini-batch makes this gradient descent approach a stochasticone. Randomly sampling the training data in each batchprevents the network from converging toward a state that isspecific to the ordering of the data. Note that in this simplisticimplementation using the random_number subroutine, not alldata samples will be used even for a large number of epochs,and there will be some overlap within each epoch. Whilethis works well enough for this example, more sophisticatedshuffling should be used in production.As we saw in Section 3.3, the learning rate eta is themultiplier to weight and bias tendencies at the time of thenetwork update. Its value is somewhat arbitrary and variesbetween applications. An optimal value of eta thus needs tobe determined by experimentation whenever we work withnew datasets or network structures. A value of eta that is toohigh may lead to never converging to a local minimum of thecost function. On the other hand, a value too low may leadto a slow and computationally expensive training procedure.The neural-fortran API used in this example is rather sim-ple. We used only a network constructor and two methods, accuracy() , and train() . All other code was for the pur-pose of loading and preparing the training data, slicing mini-batches out of the whole dataset, and other housekeeping.Running this program outputs the following: Initial accuracy: 10.09 %Epoch 1 done, Accuracy: 27.91 %Epoch 2 done, Accuracy: 53.17 %Epoch 3 done, Accuracy: 75.16 %Epoch 4 done, Accuracy: 82.96 %Epoch 5 done, Accuracy: 87.24 %...Epoch 30 done, Accuracy: 93.39 %
Listing 13.
Output from the MNIST training exampleprogram. Some lines omitted for brevity. rogram example_mnistuse mod_kinds, only: ik, rkuse mod_mnist, only: label_digits, load_mnistuse mod_network, only: network_typeimplicit nonereal(rk), allocatable :: tr_images(:,:), tr_labels(:)real(rk), allocatable :: te_images(:,:), te_labels(:)real(rk), allocatable :: input(:,:), output(:,:)type(network_type) :: netinteger(ik) :: i, n, num_epochsinteger(ik) :: batch_size, batch_start, batch_endreal(rk) :: poscall load_mnist(tr_images, tr_labels, te_images, te_labels)net = network_type([784, 30, 10])batch_size = 1000num_epochs = 30if (this_image() == 1) thenwrite(*, '(a,f5.2,a)') 'Initial accuracy: ',&net % accuracy(te_images, label_digits(te_labels)) * 100, ' %'end ifepochs: do n = 1, num_epochsmini_batches: do i = 1, size(tr_labels) / batch_size! pull a random mini-batch from the datasetcall random_number(pos)batch_start = int(pos * (size(tr_labels) - batch_size + 1))batch_end = batch_start + batch_size - 1! prepare mini-batchinput = tr_images(:,batch_start:batch_end)output = label_digits(tr_labels(batch_start:batch_end))! train the network on the mini-batchcall net % train(input, output, eta=3._rk)end do mini_batchesif (this_image() == 1) thenwrite(*, '(a,i2,a,f5.2,a)') 'Epoch ', n, ' done, Accuracy: ',&net % accuracy(te_images, label_digits(te_labels)) * 100, ' %'end ifend do epochsend program example_mnist Listing 12.
The complete program to train and evaluate a neural network using the MNIST dataset of handwritten digits. he program first outputs the initial accuracy, which isapproximately 10%. The accuracy is calculated as the numberof digits (out of 10000, the size of the testing dataset) thatare correctly predicted by the network. Since we have notdone any training at this point, the accuracy is equivalent tothat of a random guess of one in ten digits. With each epoch,the accuracy steadily improves and exceeds 93% at the 30thepoch. While this is not an extraordinarily high accuracyfor handwritten digit recognition, it demonstrates that oursimple little network works as intended. For example, a deep6-layer network with several thousand neurons was shownto achieve accuracy of 99.65% [4]. Figure 3.
Accuracy as function of the number of epochs inthe MNIST example.The network accuracy as function of number of epochs isshown on Figure 3. It shows that the fastest learning occursduring the first five epochs, beyond which the network learnsslowly and eventually plateaus. The rate of the increase in ac-curacy, and the final value at which the accuracy asymptotesis sensitive to almost every aspect of the network: Numberof layers and neurons in each layer, batch size, learning rate,the choice of activation and cost functions, the initializationmethod for weights and biases, and more.
We evaluate two aspects of computational performance ofneural-fortran: Serial performance in comparison with anestablished machine learning framework, and parallel scala-bility. Both benchmarks were done with the following con-figuration: • Processor: Intel Xeon Platinum 8168 (2.70GHz) • Operating System: Fedora 28 64-bit • Compiler: GNU Fortran 8.2.1 • Libraries: – OpenMPI 2.1.1 – OpenCoarrays 2.3.1 [6]
To evaluate the serial (single-core) performance of neural-fortran, we compare the elapsed training time of neural-fortran against an established and mature machine learningframework, Keras 2.2.4 [3] using Tensorflow 1.12.0 [1] as thecomputational backend. This framework was configured in aPython 3.6.8 environment with numpy 1.16.1 and scipy 1.2.1on the same system used to run neural-fortran.The performance is compared using the handwritten digitsexample with MNIST image dataset, with the same layerconfiguration as that used in Section 4 (784-30-10 neuronswith sigmoid activations). The Keras network was configuredin the exact same fashion, with stochastic gradient descentoptimizer and the quadratic cost function. The default batchsize for Keras is 32 if not specified, so we use that value in thiscomparison. A total of 10 training epochs are done in eachrun. As the Tensorflow backend engages multithreading bydefault, we explicitly limit the number of cores and threadsto one, for fairness. A total of 5 runs are made, the mean andstandard deviation of which are shown in Table 1.
Table 1.
Computational performance of neural-fortran andKeras+Tensorflow in the MNIST training example. Elapsedtime shown is the mean ± standard deviation of 5 repeatedruns.Framework Elapsed (s) Memory use (MB)neural-fortran 13.933 ± ± inally, it is worth comparing the ease of use of neural-fortran to that of Keras. The most straightforward metricis the number of lines of code needed to construct, train,and evaluate the network. The neural-fortran program usedin this benchmark consists of 35 lines of code excludingcomments, whereas the Keras script consists of 41 lines ofcode. Five of these lines were needed to constrain the Kerasnetwork to use a single thread. The amount of code neededfor this example is thus almost equivalent between the twoframeworks. The Python script used to construct, train, andtime the Keras network, and the neural-fortran program usedin this benchmark are available in the source code repositoryfor this paper. With a basic measure of serial performance established, let’sevaluate neural-fortran’s parallel scaling. All network param-eters are the same as in the serial case, except for the batchsize, which is now set to 1200. A large batch size is impor-tant for parallel scaling because a single batch is distributedevenly between the parallel processes. Unlike in the serialbenchmarks, here we time only the traning portion of thecode and exclude the loading of the dataset. Elapsed timeson up to 12 parallel images on a shared-memory system areshown in Figure 4. The elapsed time decreases monotoni-cally from over 12 s in serial mode to under 2 s on 12 parallelcores.
Figure 4.
Mean elapsed time as function of number of coresin the MNIST training example.Next we look at the parallel efficiency (PE), which is theserial elapsed time divided by the parallel elapsed time andthe number of processes, PE = t ( )/( nt ( n )) (Figure 5). Onthe high end, perfect scaling corresponds PE =
1. On the lowend, no parallel speed-up relative to the serial performance,meaning t ( n ) = t ( ) for any value of n , corresponds to PE = / n and is shown using the dashed line. Figure 5.
Parallel efficiency as function of number of coresin the MNIST training example (solid line), and zero speed-uprelative to the serial mode, PE = / n (dashed line).The mean elapsed times and their standard deviation (outof five runs) are given in Table 2. Table 2.
Parallel scaling of neural-fortran in the MNISTtraining example.Cores Elapsed (s) Parallel efficiency1 12.068 ± ± ± ± ± ± ± ± ± PE remains well above the zero-speed-up line for all configurations, indicating that in thisexample neural-fortran can efficiently use all cores in a high-end shared memory system. The ability to explicitly allocateany number of cores in the system, or to execute the trainingon a distributed-memory system without any change to the ode, are unique features of neural-fortran that are not trivialto accomplish in a framework like Keras or Tensorflow. This paper described the implementation of neural-fortran,a parallel framework for neural networks and deep learning.It allows constructing feed-forward neural networks withany number of layers and neurons and offers several activa-tion functions and learning via stochastic gradient descent.neural-fortran also leverages the collecive sum and broadcastprocedures introduced in Fortran 2018 to accomplish data-based parallelism. Model-based parallelism can be enabledby linking to external linear algebra libraries such the IntelMKL or ATLAS. Use of neural-fortran was demonstrated inan example of training the network to recognize handwrittendigits. In a rudimentary benchmark, computational perfor-mance of neural-fortran is comparable to an established andmature framework (Keras + Tensorflow), making it a viablecandidate for use in production. While currently only a proofof concept with a limited number of features, neural-fortranprovides a good foundation for further development andapplications.There are two potential benefits of using neural-fortran innew software development and machine learning projects.One is tight integration with existing Fortran code, suchas computational fluid dynamics for engineering applica-tions, or numerical weather prediction. For example, effortsto augment existing climate modeling capabilities are under-way, one instance being the Earth Machine project [22, 24].Such projects work with large Fortran code bases with manyparametric and empirical sub-models that could be largelyaccelerated via tight integration with a machine learningframework such as neural-fortran. Another benefit is the easeof use for experienced Fortran programmers who want toget started with machine learning and neural networks with-out needing to also learn another programming languageor framework. For this reason, any further development ofneural-fortran will maintain ease of use as its primary focus.
Acknowledgments
I am grateful to Arjen Markus and Izaak Beekman who re-viewed and helped improve this article. I am also thank-ful to the Sourcery Institute team for their work on Open-Coarrays. Michael Hirsch helped setting up the continu-ous integration for neural-fortran. Michael Nielsen’s tuto-rial on neural networks and deep learning significantly in-formed the implementation of neural-fortran. neural-fortranrepository is available at https://github.com/modern-fortran/neural-fortran . The complete source code for this manuscriptis available at https://github.com/milancurcic/neural-fortran-paper . References [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S.Ghemawat, G. Irving, M. Isard, et al. 2016. Tensorflow: a system forlarge-scale machine learning.. In
OSDI , Vol. 16. 265–283.[2] E. P. Chassignet, H. E. Hurlburt, O. M. Smedstad, G. R. Halliwell, P. J.Hogan, A. J. Wallcraft, and R. Bleck. 2006. Ocean prediction with thehybrid coordinate ocean model (HYCOM). In
Ocean weather forecasting .Springer, 413–426.[3] F. Chollet et al. 2015. Keras. https://keras.io .[4] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber. 2010.Deep, Big, Simple Neural Nets for Handwritten Digit Recognition.
Neural Computation
22, 12 (2010), 3207–3220. https://doi.org/10.1162/NECO_a_00052 [5] M. A. Donelan, M. Curcic, S. S. Chen, and A. K. Magnusson. 2012.Modeling waves and wind stress.
Journal of Geophysical Research:Oceans
Proceedings of the 8th InternationalConference on Partitioned Global Address Space Programming Models .ACM, 4.[7] P. F. Fischer, J. W. Lottes, and S. G. Kerkemeier. 2008. nek5000. http://nek5000.mcs.anl.gov [8] X. Glorot and Y. Bengio. 2010. Understanding the difficulty of trainingdeep feedforward neural networks. In
Proceedings of the thirteenthinternational conference on artificial intelligence and statistics . 249–256.[9] Y. Goldberg. 2016. A primer on neural network models for naturallanguage processing.
Journal of Artificial Intelligence Research
57 (2016),345–420.[10] S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory.
Neural computation
9, 8 (1997), 1735–1780.[11] J. W. Hurrell, M. M. Holland, P. R. Gent, S. Ghan, Jennifer E. Kay,P. J. Kushner, J.-F. Lamarque, W. G. Large, D. Lawrence, K. Lindsay,W. H. Lipscomb, M. C. Long, N. Mahowald, D. R. Marsh, R. B. Neale,P. Rasch, S. Vavrus, M. Vertenstein, D. Bader, W. D. Collins, J. J. Hack,J. Kiehl, and S. Marshall. 2013. The Community Earth System Model:A Framework for Collaborative Research.
Bulletin of the AmericanMeteorological Society
94, 9 (2013), 1339–1360. https://doi.org/10.1175/BAMS-D-12-00121.1 [12] T. Kimoto, K. Asakawa, M. Yoda, and M. Takeoka. 1990. Stock mar-ket prediction system with modular neural networks. In . IEEE, 1–6.[13] A. Krizhevsky and G. Hinton. 2009.
Learning multiple layers of featuresfrom tiny images . Technical Report. Citeseer.[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. Imagenet classifi-cation with deep convolutional neural networks. In
Advances in neuralinformation processing systems . 1097–1105.[15] J. Nathan Kutz. 2017. Deep learning in fluid dynamics.
Journal of FluidMechanics
814 (2017), 1âĂŞ4. https://doi.org/10.1017/jfm.2016.803 [16] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-basedlearning applied to document recognition.
Proc. IEEE
86, 11 (1998),2278–2324.[17] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,A. Desmaison, L. Antiga, and A. Lerer. 2017. Automatic differentiationin PyTorch. In
NIPS-W .[18] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O.Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas,A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay.2011. Scikit-learn: Machine Learning in Python.
Journal of MachineLearning Research
12 (2011), 2825–2830.[19] J. G. Powers, J. B. Klemp, W. C. Skamarock, C. A. Davis, J. Dudhia, D. O.Gill, J. L. Coen, D. J. Gochis, R. Ahmadov, S. E. Peckham, G. A. Grell, J.Michalakes, S. Trahan, S. G. Benjamin, C. R. Alexander, G. J. Dimego,W. Wang, C. S. Schwartz, G. S. Romine, Z. Liu, C. Snyder, F. Chen, M. J. arlage, W. Yu, M. G., and Duda. 2017. The Weather Research andForecasting Model: Overview, System Efforts, and Future Directions. Bulletin of the American Meteorological Society
98, 8 (2017), 1717–1737. https://doi.org/10.1175/BAMS-D-15-00308.1 [20] John Reid. 2018. The new features of Fortran 2018. In
ACM SIGPLANFortran Forum , Vol. 37. ACM, 5–43.[21] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. 1986. Learningrepresentations by back-propagating errors.
Nature
323 (Oct. 1986),533–536. https://doi.org/10.1038/323533a0 [22] T. Schneider, S. Lan, A. Stuart, and J. Teixeira. 2017. Earth System Mod-eling 2.0: A Blueprint for Models That Learn From Observations andTargeted High-Resolution Simulations.
Geophysical Research Letters
44, 24 (2017), 12,396–12,417. https://doi.org/10.1002/2017GL076101 [23] M. Valiev, E. J. Bylaska, N. Govind, K. Kowalski, T. P. Straatsma, H. J. J.Van Dam, D. Wang, J. Nieplocha, E. Apra, T. L. Windus, and W. A.de Jong. 2010. NWChem: a comprehensive and scalable open-sourcesolution for large scale molecular simulations.
Computer Physics Com-munications
Science https://doi.org/10.1126/science.361.6400.344 [25] R. C. Whaley and A. Petitet. 2005. Minimizing development and main-tenance costs in supporting persistently optimized BLAS.
Software:Practice and Experience
35, 2 (2005), 101–121.[26] R. C. Whaley, A. Petitet, and J. J. Dongarra. 2001. Automated EmpiricalOptimization of Software and the ATLAS Project.
Parallel Comput.27,1–2 (2001), 3–35.