Flexible numerical optimization with ensmallen
Ryan R. Curtin, Marcus Edel, Rahul Ganesh Prabhu, Suryoday Basak, Zhihao Lou, Conrad Sanderson
aa r X i v : . [ c s . M S ] M a r Flexible numerical optimization with ensmallen
Ryan R. Curtin , ∗ , Marcus Edel , , ∗ , Rahul Ganesh Prabhu , Suryoday Basak , Zhihao Lou , Conrad Sanderson , RelationalAI, Atlanta, USA Free University of Berlin, Germany Collabora, Montreal, Canada Birla Institute of Technology and Science, Pilani, Pilani Campus, India Department of Computer Science and Engineering, The University of Texas at Arlington, USA Epsilon, Chicago, USA Data61 / CSIRO, Australia Griffith University, Australia ∗ Primary maintainers
Abstract
This report provides an introduction to the ensmallen numerical optimization library, as well as a deep dive into the technicaldetails of how it works. The library provides a fast and flexible C++ framework for mathematical optimization of arbitraryuser-supplied functions. A large set of pre-built optimizers is provided, including many variants of Stochastic GradientDescent and Quasi-Newton optimizers. Several types of objective functions are supported, including differentiable, separable,constrained, and categorical objective functions. Implementation of a new optimizer requires only one method, while a newobjective function requires typically only one or two C++ methods. Through internal use of C++ template metaprogramming, ensmallen provides support for arbitrary user-supplied callbacks and automatic inference of unsupplied methods without anyruntime overhead. Empirical comparisons show that ensmallen outperforms other optimization frameworks (such as Juliaand SciPy), sometimes by large margins. The library is available at https://ensmallen.org and is distributed under thepermissive BSD license.
The problem of mathematical optimization is fundamental in the computational sciences [54]. In short, this problem isexpressed as argmin x f ( x ) (1)where f ( x ) is a given objective function and x is typically a vector or matrix. The ubiquity of this problem gives rise to theproliferation of mathematical optimization toolkits, such as SciPy [81], opt++ [48], OR-Tools [58], CVXOPT [76], NLopt [28],Ceres [3], and RBFOpt [13]. Furthermore, in the field of machine learning, many deep learning frameworks have integratedoptimization components. Examples include Theano [4], TensorFlow [1], PyTorch [56], and Caffe [26].Mathematical optimization is generally quite computationally intensive. For instance, the training of deep neural networksis dominated by the optimization of the model parameters on the data [36, 39]. Similarly, other popular machine learningalgorithms such as logistic regression are also expressed as and dominated by an optimization process [85, 46]. Computationalbottlenecks occur even in fields as wide-ranging as rocket landing guidance systems [19], motivating the development andimplementation of specialized solvers.The necessity for both efficient and specializable function optimization motivated us to implement the ensmallen library,originally as part of the mlpack machine learning library [14]. The ensmallen library provides a large set of pre-builtoptimizers for optimizing user-defined objective functions in C++; at the time of writing, 46 optimizers are available. Theexternal interface to the optimizers is intuitive and matches the ease of use of popular optimization toolkits mentioned above.However, unlike many of the existing optimization toolkits, ensmallen explicitly supports numerous function types:arbitrary, differentiable, separable, categorical, constrained, and semidefinite. Furthermore, custom behavior duringoptimization can be easily specified via callbacks . Lastly, the underlying framework facilitates the implementation ofnew optimization techniques, which can be contributed upstream and incorporated into the library. Table 1 contrasts thefunctionality supported by ensmallen and other optimization toolkits. n i f i e d fr a m e w o r k c o n s t r a i n t ss e p a r a b l e f u n c ti o n s / b a t c h e s a r b it r a r y f u n c ti o n s a r b it r a r y o p ti m i z e r ss p a r s e g r a d i e n t s c a t e g o r i c a l a r b it r a r y t y p e s c a ll b a c k s ensmallen Shogun [70] - - - - -Vowpal Wabbit [38] - - - - - - -TensorFlow [1] - G - G - G -PyTorch [56] - G - - G -Caffe [26] - G - - G Keras [12] - G - - G scikit-learn [57] G - G - - - G -SciPy [29] - - - - G MATLAB [47] - - - - G -Julia ( Optim.jl ) [49] G - - - - Table 1: Feature comparison: = provides feature, G = partially provides feature, - = does not provide feature. unifiedframework indicates if there is a form of generic/unified optimization framework; constraints and separable functions / batches indicate support for constrained functions and separable functions; arbitrary functions means arbitrary objective functions areeasily implemented; arbitrary optimizers means arbitrary optimizers are easily implemented; sparse gradient indicates thatthe framework can natively take advantage of sparse gradients; categorical refers to if support for categorical features exists; arbitrary types mean that arbitrary types can be used for the parameters x ; callbacks indicates that user-implementable callbacksupport is available.This report is a revised and expanded version of our initial overview of ensmallen [10]. It also provides a deep dive intothe internals of how the library works, which can be a useful resource for anyone looking to contribute to the library or getinvolved with its development.We continue the report as follows. We overview the available functionality and show example usage in Section 2.Advanced usage such as callbacks is discussed in Section 3. We demonstrate the empirical efficiency of ensmallen inSection 4. The salient points are summarized in Section 5. The task of optimizing an objective function with ensmallen is straightforward. The class of objective function (e.g., arbitrary,constrained, differentiable, etc.) defines the implementation requirements. Each objective function type has a minimal set ofmethods that must be implemented. Typically this is only between one and four methods. As an example, to optimize anobjective function f ( x ) that is differentiable, implementations of f ( x ) and f ′ ( x ) are required. One of the optimizers fordifferentiable functions, such as L-BFGS [41], can then be immediately employed.Whenever possible, ensmallen will automatically infer methods that are not provided. For instance, given a separableobjective function f ( x ) = P i f i ( x ) where an implementation of f i ( x ) is provided (as well as the number of such separableobjectives), an implementation of f ( x ) can be automatically inferred. This is done at compile-time, and so there is noadditional runtime overhead compared to a manual implementation. C++ template metaprogramming techniques [2, 5, 80]are internally used to produce efficient code during compilation.Not every type of objective function can be used with every type of optimizer. For instance, since L-BFGS is adifferentiable optimizer, it cannot be used with a non-differentiable object function type (e.g. an arbitrary function). Whenan optimizer is used with a user-provided objective function, an internal mechanism automatically checks the requirements,resulting in user-friendly error messages if any required methods are not detected. .1 Types of Objective Functions In most cases, the objective function f ( x ) has inherent attributes; for example, as mentioned in the previous paragraphs, f ( x ) might be differentiable. The internal framework in ensmallen can optionally take advantage of such attributes. In theexample of a differentiable function f ( x ) , the user can provide an implementation of the gradient f ′ ( x ) , which in turn allowsa first-order optimizer to be used. This generally leads to significant speedups when compared to using only f ( x ) . To allowexploitation of such attributes, the optimizers are built to work with many types of objective functions. The classes of objectivefunctions are listed below.For details on the required signatures for each objective function type (such as DifferentiableFunctionType ), see theonline documentation at https://ensmallen.org/docs.html .• Arbitrary functions ( ArbitraryFunctionType ). No assumptions are made on function f ( x ) and only the objective f ( x ) can be computed for a given x .• Differentiable functions ( DifferentiableFunctionType ). A differentiable function f ( x ) is an arbitrary functionwhose gradient f ′ ( x ) can be computed for a given x , in addition to the objective.• Partially differentiable functions ( PartiallyDifferentiableFunctionType ). A partially differentiable function f ( x ) is a differentiable function with the additional property that the gradient f ′ ( x ) can be decomposed along somebasis j such that f ′ j ( x ) is sparse. Often, this is used for coordinate descent algorithms (i.e., f ′ ( x ) can be decomposedinto f ′ ( x ) , f ′ ( x ) , etc.).• Arbitrary separable functions ( ArbitrarySeparableFunctionType ). An arbitrary separable function is an arbitraryfunction f ( x ) that can be decomposed into the sum of several objective functions: f ( x ) = P i f i ( x ) .• Differentiable separable functions ( DifferentiableSeparableFunctionType ). A differentiable separable functionis a separable arbitrary function f ( x ) where the individual gradients f ′ i ( x ) are also computable.• Categorical functions ( CategoricalFunctionType ). A categorical function type is an arbitrary function f ( x ) wheresome (or all) dimensions of x take discrete values from a set.• Constrained functions ( ConstrainedFunctionType ). A constrained function f ( x ) is a differentiable function subjectto constraints of the form c i ( x ) ; when the constraints are satisfied, c i ( x ) = 0 ∀ i . Minimizing f ( x ) then meansminimizing f ( x ) + P i c i ( x ) .• Semidefinite programs (SDPs). (These are a subset of constrained functions.) ensmallen has special support to makeoptimizing semidefinite programs [77] straightforward.
For each class of the objective functions, ensmallen provides a set of pre-built optimizers:•
For arbitrary functions:
Simulated annealing [34], CNE (Conventional Neural Evolution) [51], DE(Differential Evolution) [73], PSO (Particle Swarm Optimization) [31], SPSA (Simultaneous Perturbation StochasticApproximation) [71].•
For differentiable functions:
L-BFGS [41], Frank-Wolfe [25], gradient descent.•
For partially differentiable functions.
SCD (Stochastic Coordinate Descent) [68].•
For arbitrary separable functions:
CMA-ES (Covariance Matrix Adaptation Evolution Strategy) [23].•
For differentiable separable functions:
AdaBound [43], AdaDelta [83], AdaGrad [18], Adam [33], AdaMax [33],AMSBound [43], AMSGrad [60], Big Batch SGD [16], Eve [35], FTML (Follow The Moving Leader) [86],Hogwild! [59], IQN (Incremental Quasi-Newton) [50], Katyusha [6], Lookahead [84], SGD with momentum [63],Nadam [17], NadaMax [17], SGD with Nesterov momentum [52], Optimistic Adam [15], QHAdam (Quasi-Hyperbolic Generally, constrained functions do not need to be differentiable. However, this is a requirement here, as all of the current optimizers in ensmallen forconstrained functions require a gradient to be available. dam) [44], QHSGD (Quasi-Hyperbolic Stochastic Gradient Descent) [44], RMSProp [74], SARAH/SARAH+ [53],stochastic gradient descent, SGDR (Stochastic Gradient Descent with Restarts) [42], Snapshot SGDR [24],SMORMS3 [21], SVRG (Stochastic Variance Reduced Gradient) [27], SWATS [32], SPALeRA (Safe Parameter-wiseAgnostic LEarning Rate Adaptation) [67], WNGrad [82].•
For categorical functions:
Grid search.•
For constrained functions:
Augmented Lagrangian method, primal-dual interior point SDP solver, LRSDP (low-rankaccelerated SDP solver) [11].
Let us consider the problem of linear regression, where we are given a matrix of predictors X ∈ R n × d and a vector ofresponses y ∈ R n . Our task is to find the best linear model θ ∈ R d ; that is, we want to find θ ∗ = argmin θ f ( θ ) for f ( θ ) = k Xθ − y k = ( Xθ − y ) T ( Xθ − y ) . (2)From this we can derive the gradient f ′ ( θ ) : f ′ ( θ ) = 2 X T ( Xθ − y ) . (3)To find θ ∗ using a differentiable optimizer , we simply need to provide implementations of f ( θ ) and f ′ ( θ ) according tothe signatures required by the DifferentiableFunctionType of objective function. For a differentiable function, only twomethods are necessary:
Evaluate() and
Gradient() . The pre-built L-BFGS optimizer can be used to find θ ∗ .Figure 1 shows an example implementation. We hold X and y as members of the LinearRegressionFunction class ,and theta is used to represent θ . Via the use of Armadillo [65], the linear algebra expressions to implement the objectivefunction and gradient are readable in a way that closely matches Equations (2) and (3). In addition to the support for various objective functions and the provision of many pre-built optimizers covered in Section 2,the ensmallen library provides further functionality, suitable for advanced users. This includes:• Inference of functions not supplied by the user. For example, a user might supply a single function named
EvaluateWithGradient() in order to allow sharing common computation that may be present in
Evaluate() and
Gradient() . Those two functions can then be automatically inferred when needed.• Callbacks, which can specify custom behavior during the optimization. Examples include printing the loss functionvalue at each iteration, terminating when a time limit is reached, changing the step size, and adding custom constraintswhen they are violated by the current solution.• Use of types other than arma::mat when calling
Optimize() . This can include integer-valued matrices ( arma::imat ),sparse matrices ( arma::sp_mat ), or any type whose implementation matches the Armadillo API [65, 64].Each of the above points is covered in more detail in the following subsections.
In Section 2.3, we saw an example of how a user might implement
Evaluate() and
Gradient() for the linear regressionobjective function and use ensmallen to find the minimum. However, there is an inefficiency: the objective functioncomputation is defined as f ( θ ) = k Xθ − y k , and the gradient computation is defined as f ′ ( θ ) = 2 X T ( Xθ − y ) . There isa shared inner computation in f ( θ ) and f ′ ( θ ) : the term ( Xθ − y ) . If f ( θ ) and f ′ ( θ ) are implemented as separate functions,there is no easy way to exploit this shared computation. Typically, in practice, solving a linear regression model can be done directly by taking the pseudoinverse: θ ∗ = X † y . However, the objective functionis easy to describe and useful for demonstration, so we use it in this document. include
Gradient( const arma::mat& theta, arma::mat& gradient) { gradient = 2 * X.t() * (X * theta - y); } private : const arma::mat& X; const arma::vec& y;}; int main(){ arma::mat X;arma::vec y;// ... set the contents of X and y here ...ens::LinearRegressionFunction f(X, y);ens::L_BFGS optimizer; // create the optimizer with default parametersarma::mat theta_best(X.n_rows, 1, arma::fill::randu); // initial starting point (uniform random values)optimizer.Optimize(f, theta_best);// at this point theta_best contains the best parameters} Figure 1: An example implementation of an objective function class for linear regression and usage of the L-BFGS optimizerin ensmallen . The online documentation for all ensmallen optimizers is at https://ensmallen.org/docs.html . The arma::mat and arma::vec types are dense matrix and vector classes from the Armadillo linear algebra library [65], with thecorresponding online documentation at http://arma.sf.net/docs.html .The differentiable optimizers in ensmallen treat the given functions as oracular, and do not know anything about theinternal computations of the functions. This inefficiency can apply to any optimization package that accepts an objectivefunction and its gradient as separate parameters, such as SciPy, Optim.jl and bfgsmin() from Octave [20].To work around this issue, ensmallen internally uses template metaprogramming techniques to allow the user to provide either separate implementations of the objective function and gradient, or a combined implementation that computes both theobjective function and gradient simultaneously. The latter options allows the sharing of inner computations. That is, the usercan provide the methods Evaluate() and
Gradient() , or
EvaluateWithGradient() . For the example objective functionabove, we empirically show (Section 4) that the ability to provide
EvaluateWithGradient() can result in a significantspeedup. Similarly, when implementing a differentiable optimizer in ensmallen , it is possible to use either
Evaluate() and
Gradient() , or EvaluateWithGradient() during optimization.The same technique is used to infer and provide more missing methods than just
EvaluateWithGradient() or Evaluate() and
Gradient() . For instance, separable functions, differentiable separable functions, constrained functions,and categorical functions each have inferrable methods. Not all of these possibilities are currently implemented in ensmallen ,but the existing framework makes it straightforward to add more. Below are examples that are currently implemented: It may be possible to use a specialized implementation of auto-differentiation or a programming language with introspection that operates directly onthe abstract syntax tree [30] of the given objective and gradient computations to avoid this inefficiency and successfully share the computations across thegradient and objective. However, at the time of this writing, we are not aware of any optimization packages that actually do this. unction Description Function type
BeginOptimization
Called at the beginning of the optimization process all
EndOptimization
Called at the end of the optimization process all
Evaluate
Called after any call to
Evaluate()
Arbitrary, Differentiable, Partially differentiable,Arbitrary separable, Differentiable separable
EvaluateConstraint
Called after any call to
EvaluateConstraint()
Constrained
Gradient
Called after any call to
Gradient()
Differentiable, Partially differentiableDifferentiable separable
GradientConstraint
Called after any call to
GradientConstraint()
Constrained
BeginEpoch
Called at the beginning of a pass over the data Differentiable separable
EndEpoch
Called at the end of a pass over the data Differentiable separable
Table 2: Available callback routines, with brief descriptions. Optional additional arguments have been omitted for brevity.See for more detailed documentation.• (Differentiable functions.)
If the user provides an objective function with
Evaluate() and
Gradient() , we canautomatically synthesize
EvaluateWithGradient() .• (Differentiable functions.) If the user provides an objective function with
EvaluateWithGradient() , we can synthesize
Evaluate() and/or
Gradient() .• (Separable functions.) If the user provides an objective function with
Evaluate() and
NumFunctions() , we cansythesize a non-separable version of
Evaluate() .• (Separable functions.) If the user provides an objective function with
Gradient() and
NumFunctions() , we cansynthesize a non-separable version of
Gradient() .• (Separable functions.) If the user provides an objective function with
EvaluateWithGradient() and
NumFunctions() ,we can synthesize a non-separable version of
Gradient() .For more precise details on exactly how the method generation works, see Appendix A, where we describe the frameworkin a simplified form, focusing only the
EvaluateWithGradient() / Evaluate() / Gradient() example described above.
Many of the optimizers in ensmallen offer the ability to monitor and modify parts of the optimization process. Examplemodifications include changing the step size, adding custom constraints when they are violated by the current solution, orproviding custom heuristics to find and investigate feasible solutions.In many existing toolkits, this type of functionality is provided only via solver-specific interfaces. For instance,the tick statistical learning toolkit [9] requires the use of a solver-specific
History . In contrast, ensmallen providesoptimizer-independent callbacks to allow classes to inspect and work with internal parts of the optimization process. Inparticular, the callbacks allow code to be executed regularly during an optimization session.
RosenbrockFunction f;arma::mat coordinates = f.GetInitialPoint();MomentumSGD optimizer(0.01, 32, 100000, 1e-5, true, MomentumUpdate(0.5));optimizer.Optimize(f, coordinates, EarlyStopAtMinLoss(), ProgressBar());
Figure 2: Code snippet to demonstrate usage of the callback functionality using pre-defined callbacks:
EarlyStopAtMinLoss and
ProgressBar . EarlyStopAtMinLoss will terminate the optimization as soon as the objective value stars increasing, and
ProgressBar will print a progress bar during the optimization. lass CustomCallback { public : template < typename OptimizerType, typename
FunctionType, typename
MatType> bool
Evaluate(OptimizerType& opt, FunctionType& function, const
MatType& coordinates, const double objective){ std::cout << "The optimization process called Evaluate()!" << std::endl; return false; // Do not terminate, continue the optimization process.}};
Figure 3: An example of a custom callback. This callback prints to std::cout after each time the
Evaluate() functionis called by the optimizer. In the example the callback always returns false , meaning that the optimization should not beterminated on behalf of the callback.
To use callbacks, either for optimization, tuning or logging, arbitrary callbacks to any optimizer can be optionally providedto the
Optimize() function. Figure 2 contains a code snippet which briefly demonstrates usage of the callback functionality.Given the pre-defined callbacks
EarlyStopAtMinLoss and
ProgressBar , the code snippet shows not only how the
MomentumSGD optimizer can be used to find the best coordinates but also how the callbacks can be used to control and monitorthe optimization.Eight types of optimization callback routines are available, as shown in Table 2. These callbacks are regularly calledduring the optimization process, depending on the objective function type. Callbacks are executed in the order that they arespecified. Each callback may terminate the optimization via its bool return value, where true indicates that the optimizationshould be stopped. By default, subsequent callbacks are not called if an earlier callback terminates the optimization: theoptimizer terminates immediately.There are several pre-built callbacks that can be used without needing any custom code:•
EarlyStopAtMinLoss : stops the optimization process if the loss stops decreasing or no improvement has been made.•
PrintLoss : callback that prints loss to stdout or a specified output stream.•
ProgressBar : callback that prints a progress bar to stdout or a specified output stream.•
StoreBestCoordinates : callback that stores the model parameter after every epoch if the objective decreased.Note that to use the
StoreBestCoordinates callback, the user will need to instantiate a
StoreBestCoordinates object,and then call
BestCoordinates() in order to recover the best coordinates found during the optimization. This process isdetailed more in the following section.
Implementing a custom callback is straightforward: the only requirement is a class that has functions whose names are thesame as the callback functions listed in Table 2. Comprehensive documentation on the required signatures for each callbackcan be found in the ensmallen documentation at http://ensmallen.org/docs.html .If a custom callback is desired to take an action before the optimization process starts, then only a
CustomCallback::BeginOptimization() method is required to perform the desired action. An example of a customcallback that prints a line to std::cout every time the optimization calls
Evaluate() is shown in Figure 3. The
CustomCallback class can be used exactly like
EarlyStopAtMinLoss in Figure 2—the only change needed is to add
CustomCallback() to the list of arguments passed to optimizer.Optimize() .If a callback class requires additional parameters or state beyond what is passed through the predefined arguments tofunctions in Table 2, a user should manually create an instance of the callback class with those additional parameters, andthen pass the object to the optimizer as a callback when
Optimize() is called. ensmallen does not modify or dereference theobject, so it is safe to use for this purpose. Figure 4 provides an example, where we pass an instantiated custom callback thattakes an additional step-size decay parameter as input. In addition, inside the
StepTaken() callback, we use the optimizer’s truct
CustomCallback{ CustomCallback( double rIn) : r(rIn) {} template < typename OptimizerType, typename
FunctionType, typename
MatType> void
StepTaken(OptimizerType& optimizer, FunctionType& function, MatType& coordinates){ // Multiply the step size by r (hopefully r is less than 1!).optimizer.StepSize() *= r;} double r;};RosenbrockFunction f;arma::mat coordinates = f.GetInitialPoint();Adam opt;CustomCallback cb(0.9); // Instantiate the custom callback...opt.Optimize(f, coordinates, cb); // ...and call Optimize() with that object!
Figure 4: Code snippet demonstrating how to add additional parameters/state to a callback and accessing optimizer-specificparameters. struct
Optimizer{ template < typename FT> void
Optimize(FT& f, arma::mat& p){ f.Evaluate(p);}}; int main(){ RosenbrockFunction rf;arma::mat parameters = rf.GetInitialPoint();Optimizer opt;opt.Optimize(rf, parameters); return struct
Optimizer{ template < typename FT, typename... CallbackType > void
Optimize(FT& f, arma::mat& p, CallbackType&&... c){ Callback::BeginOptimization(*this, f, p, c...);f.Evaluate(iterate);}}; int main(){ RosenbrockFunction rf;arma::mat parameters = rf.GetInitialPoint();Optimizer opt;opt.Optimize(rf, parameters); return
Figure 5: Left panel: A C++ program that mimics the ensmallen optimizer interface without any callback functionality. Rightpanel: A corresponding C++ program using an empty callback routine that is automatically optimized out. Both programsproduce the exact same machine code, resulting in no performance penalty if no callbacks are used.interface function (
StepSize() ) to update the step size. Note that if this callback is attempted to be used with an optimizerthat did not have a
StepSize() function, compilation would fail. As such, some care is necessary when implementing customcallbacks.There is no performance penalty if no callbacks are used. Figure 5 shows two programs; one explicitly without callbacks,and one with an empty callback. The implementation of callback facility is internally done via template metaprogramming.In all cases, modern C++ compilers (e.g. clang-1100.0.33.16 and g++
In the example shown in Section 2.3, where we introduced the class
LinearRegressionFunction , the matrix and vectorobjects are hardcoded as arma::mat and arma::vec . These objects hold elements with the C++ type double , representingdouble-precision floating point [22]. However, in many applications it can be very important to specify a different underlyingelement type (e.g. the single-precision float , used by arma::fmat and arma::fvec ). For instance, in the field of machinelearning, neural networks have been shown to be effective with low-precision floating point representations for weights [79].Furthermore, many optimization problems have parameters that are best represented as sparse data [75, 59], which isrepresented in Armadillo as the sp_mat class [64, 66]. Even alternate representations such as data held on the GPU canbe quite important: the use of GPUs can often result in significant speedups [55, 8].In order to handle this diverse set of needs, ensmallen has been built in such a way that any underlying storage type canbe used to represent the coordinates to be optimized—so long as it matches the Armadillo API. This means that the BandicootGPU matrix library can be used as a drop-in replacement once it is stable.An example of seamlessly using ensmallen ’s optimizers with different underlying storage types is given in Figure 6. Inthis example, the types arma::mat (which holds double ), arma::fmat (which holds float ), and arma::imat (which holds https://gitlab.com/conradsnicta/bandicoot-code class ParabolicFunction { public : template < typename MatType> typename
MatType::elem_type Evaluate( const
MatType& coordinates){ // Return the sum of squared coordinates. return arma::accu(arma::square(coordinates));}}; int main(){ // Use simulated annealing to optimize the ParabolicFunction.ParabolicFunction pf;ens::SA<> optimizer(ens::ExponentialSchedule());// element type is doublearma::mat doubleCoordinates(10, 1, arma::fill::randu);optimizer.Optimize(pf, doubleCoordinates);// element type is floatarma::fmat floatCoordinates(10, 1, arma::fill::randu);optimizer.Optimize(pf, floatCoordinates);// element type is intarma::imat intCoordinates(10, 1, arma::fill::randi);optimizer.Optimize(pf, intCoordinates);}
Figure 6: Example ensmallen program showing that ensmallen ’s optimizers can be used to seamlessly optimize functionswith many different element types. lass GradientDescent { template < typename FunctionType, typename
MatType, typename
GradType = MatType> typename
MatType::elem_type Optimize(FunctionType& function, MatType& coordinates){ // The step size is hardcoded to 0.01, and the number of iterations is 1000. for ( size_t i = 0; i < 1000; ++i){ GradType gradient;function.Gradient(coordinates, gradient);// Take the step.coordinates -= 0.01 * gradient;}// Compute and return the final objective. return function.Evaluate(coordinates);}}; Figure 7: Example implementation of a simple gradient descent optimizer. For the sake of brevity, functionality such asthe ability to configure the parameters has been deliberately omitted. The actual
GradientDescent optimizer in ensmallen provides more functionality. int ) are all used with ensmallen ’s simulated annealing implementation to optimize a very simple parabolic function. Due tothe use of templates both inside ensmallen and in the implementation of the
ParabolicFunction class, it is trivial to changethe underlying type used for storage.Consider the simplified gradient descent optimizer shown in Figure 7. The use of the template types
FunctionType , MatType , and
GradType means that at compilation time, the correct types are substituted in for
FunctionType , MatType , and
GradType (which by default is set to be the same as
MatType in this code). So long as each type has all the methods that areused inside of
Optimize() , there will be no compilation problems. Templates are a technique for code generation; in thiscase, that means the code generated will be exactly the same as if
Optimize() was written with the types specified in thosetemplate parameters. This means that there is no additional runtime overhead when a different
MatType is used.Appendix C contains details on the internal compile-time checks used for providing users with concise error messages,avoiding the long errors typically associated with template metaprogramming.
To show the efficiency of mathematical optimization with ensmallen , we compare its performance with several othercommonly used optimization frameworks, including some that use automatic differentiation.
For our first experiment, we aim to capture the overhead involved in various optimization toolkits. In order to do this, weconsider the simple and popular Rosenbrock function [62]: f ([ x , x ]) = 100( x − x ) + (1 − x ) . (4)This objective function is useful for this task because the computational effort involved in computing f ( · ) is trivial.Therefore, if we hold the number of iterations of each toolkit constant, then this will help us understand the overhead costs ofeach toolkit. For the optimizer, we use simulated annealing [34], a gradient-free optimizer. Simulated annealing will call theobjective function numerous times; for each simulation we limit the optimizer to 100K objective evaluations.The code used to run this simulation for ensmallen (including the implementation of the Rosenbrock function) is givenin Figure 8. Note that the RosenbrockFunction is actually implemented in ensmallen ’s source code, in the directory include/ensmallen_bits/problems/ . include
MatType::elem_type Evaluate( const
MatType& x) const { return
100 * std::pow(x[1] - std::pow(x[0], 2), 2) + std::pow(1 - x[0], 2);}}; int main(){ arma::wall_clock clock;RosenbrockFunction rf;ens::ExponentialSchedule sched;// A tolerance of 0.0 means the optimization will run for the maximum number of iterations.ens::SA<> s(sched, 100000, 10000, 1000, 100, 0.0);// Get the initial point of the optimization.arma::mat parameters = rf.GetInitialPoint();// Run the optimization and time it.clock.tic();s.Optimize(rf, parameters); const double time = clock.toc();std::cout << time << std::endl << "Result (optimal 1, 1): " << parameters.t(); return
Figure 8: Code to use ensmallen to optimize the Rosenbrock function using 100K iterations of simulated annealing.We compare four frameworks for this task: (i) ensmallen , (ii) scipy.optimize.anneal from SciPy 0.14.1 [29], (iii) simulated annealing implementation in Optim.jl with Julia 1.0.1 [49], (iv) samin in the optim package for Octave [20].While another option here might be simulannealbnd() in the Global Optimization Toolkit for MATLAB, no license wasavailable. We ran our code on a MacBook Pro i7 2018 with 16GB RAM running macOS 10.14 with clang 1000.10.44.2, Juliaversion 1.0.1, Python 2.7.15, and Octave 4.4.1.Our initial implementation for each toolkit, corresponding to the line “default” in Table 3, was as simple of animplementation as possible and included no tuning. This reflects how a typical user might interact with a given framework.Only Julia and ensmallen are compiled, and thus are able to avoid the function pointer dereference for evaluating theRosenbrock function and take advantage of inlining and related optimizations. The overhead of both scipy and samin arequite large— ensmallen is nearly three orders of magnitude faster for the same task.However, both Python and Octave have routes for acceleration, such as Numba [37], MEX bindings and JIT compilation.We hand-optimized the Rosenbrock implementation using Numba, which required significant modification of the underlying anneal.anneal() function. These techniques did produce some speedup, as shown in the second row of Table 3. For Octave,a MEX binding did not produce a noticeable difference. We were unable to tune either ensmallen or Optim.jl to get anyspeedup, suggesting that novice users will easily be able to write efficient code in these cases. .2 Large-Scale Linear Regression Problems
Next, we consider the linear regression example described in Section 2.3. For this task we use the first-order L-BFGSoptimizer [41], implemented in ensmallen as the
L_BFGS class. Using the same four packages, we implement the linearregression objective and gradient. Remembering that ensmallen allows us to share work across the objective functionand gradient implementations (Section 3.1), for ensmallen we implement a version with only
EvaluateWithGradient() ,denoted as ensmallen-1 . We also implement a version with both
Evaluate() and
Gradient() : ensmallen-2 . We also useautomatic differentiation for Julia via the ForwardDiff.jl [61] package and for Python via the
Autograd [45] package. ForGNU Octave we use the bfgsmin() function.Results for various data sizes are shown in Table 4. For each implementation, L-BFGS was allowed to run for only iterations and never converged in fewer iterations. The datasets used for training are highly noisy random data with a slightlinear pattern. Note that the exact data is not relevant for the experiments here, only its size. Runtimes are reported as theaverage across 10 runs.The results indicate that ensmallen with EvaluateWithGradient() is the fastest approach. Furthermore, the use of
EvaluateWithGradient() yields non-negligible speedup over the ensmallen-2 implementation with both the objective andgradient implemented separately. In addition, although the automatic differentiation support makes it easier for users to writetheir code (since they do not have to write an implementation of the gradient), we observe that the output of automaticdifferentiators is not as efficient, especially with
ForwardDiff.jl . We expect this effect to be more pronounced withincreasingly complex objective functions.
Lastly, we demonstrate the easy pluggability in ensmallen for using various optimizers on the same task. Using a versionof
LinearRegressionFunction from Section 2.3 adapted for separable differentiable optimizers, we run six optimizers withdefault parameters in just 8 lines of code, as shown in Fig. 9. Applying these optimizers to the
YearPredictionMSD datasetfrom the UCI repository [40] yields the learning curves shown in Fig. 10. // X and y are data.LinearRegressionFunction lrf(X, y); using namespace ens;StandardSGD<>().Optimize(lrf, sgdModel);Adam().Optimize(lrf, adamModel);AdaGrad().Optimize(lrf, adagradModel);SMORMS3().Optimize(lrf, smorms3Model);SPALeRASGD().Optimize(lrf, spaleraModel);RMSProp().Optimize(lrf, rmspropModel);
Figure 9: ensmallen makes it easy to switch out optimizer types: 8 lines of code run 6 optimizers on one problem.Any other optimizer for separable differentiable objective functions can be dropped into place in the same manner; giventhe large number of available optimizers in ensmallen , this support could be used to easily compare optimizers. In fact, thisis exactly done with the interactive optimizer visualization tool found at https://vis.ensmallen.org . Figure 11 shows anexample visualization. ensmallen scipy Optim.jl samin default
K iterations of simulated annealing with various frameworks on the simple Rosenbrock function.Julia code runs do not count compilation time. The tuned row indicates that the code was manually modified for speed. lgorithm d : 100, n : 1k d : 100, n : 10k d : 100, n : 100k d : 1k, n : 100k ensmallen-1 ensmallen-2 Optim.jl scipy bfgsmin
ForwardDiff.jl autograd n indicating the number of samples, and d indicating the dimensionality of each sample. All Julia runs do not count compilation time. l o ss AdamAdaGradRMSpropSPALeRASMORMS3SGD
Figure 10: Example usage of six ensmallen optimizers to optimize a linear regression function on the
YearPredictionMSD dataset [40] for 5 epochs of training. The optimizers can be tuned for better performance.igure 11: Visualization of the Adam optimizer on the Styblinski-Tang objective function; this is a screen capture from https://vis.ensmallen.org . This report introduces and explains ensmallen , a C++ mathematical optimization library that internally uses templatemetaprogramming to produce efficient code. The library is flexible, with support for numerous types of objective functions,and many implemented optimizers. It is easy to both implement objective functions to optimize with ensmallen , and towrite new optimizers for inclusion in the library. ensmallen has support for automatic function inference and callbacks.Because this is done through template metaprogramming, there is no additional runtime overhead. Empirical results show that ensmallen outperforms other toolkits for similar tasks.Future work includes the implementation of additional optimizers and better support for various types ofobjective functions (such as better support for constrained functions). Development is done in an open manner at https://github.com/ensmallen/ensmallen . The library uses the permissive BSD license [72]. Anyone is welcome tohelp with the effort and contribute code. ppendix A: Inferring Missing Methods – Details
In Section 3.1, we described that ensmallen is able to infer missing methods. For instance, if the user provides a differentiableobjective function that implements
EvaluateWithGradient() , we can automatically infer and provide implementations of
Evaluate() and
Gradient() that the optimizer can then use. (And vice versa: we can infer
EvaluateWithGradient() from
Evaluate() and
Gradient() .)Here, we provide details of how this is implemented, focusing specifically on how
EvaluateWithGradient() can beinferred, and how this is used in ensmallen ’s optimizers. The same technique is applicable to all of the other methods that ensmallen infers.Using our framework for function inference is straightforward; an example differentiable optimizer may resemble the codein Figure 12. Code for a new optimizer only needs to define a
Function
Evaluate() , Gradient() , and
EvaluateWithGradient() to be available. Theclass
Function<...> is like a mix-in class [69]; more accurately, it is a collection of mix-in classes.Figure 13 shows a shortened snippet of the definition of the
Function class. We can see that
Function inheritsmethods from the given
FunctionType (which is the user-supplied objective function class that is to be optimized),and also from the
AddEvaluateWithGradient<...> mixin class. The key to our approach is that if
FunctionType class has an
EvaluateWithGradient() method, then
AddEvaluateWithGradient<...> will provide no functions; if
FunctionType does not have an
EvaluateWithGradient() method, then
AddEvaluateWithGradient<...> will providean
EvaluateWithGradient() function.The details of how this work depend on the SFINAE technique [2, 78] and template specialization. The
AddEvaluateWithGradient class has five template parameters, shown in Figure 14; the last two of these have default values.For the sake of brevity we omit the expressions for the bool parameters
HasEvaluateGradient and
HasEvaluateWithGradient . These two parameters are traits expressions that depend heavily on SFINAE techniques for Interested readers can find that code in ensmallen_bits/function/add_evaluate_with_gradient.hpp . // FunctionType is the user-supplied function type to optimize.// MatType is the user-supplied type of the initial coordinates.// GradType is the user-specified type of the gradient.// typename MatType::elem_type represents the internal type held by MatType// (e.g., if MatType is `arma::mat`, then this is `double`) template < typename FunctionType, typename
MatType, typename
GradType> typename
MatType::elem_type Optimize(FunctionType& function,MatType& coordinates){ // The Function<> mix-in class adds all inferrable methods to `function`// So, if `function` has `function.Evaluate()` and `function.Gradient()`, then// `fullFunction` will have `fullFunction.EvaluateWithGradient()`. typename
Function
Figure 12: An example implementation of an ensmallen optimizer’s
Optimize() method. The use of the
Function<> mix-inclass [69] will provide methods that the original
FunctionType may not have. template < typename FunctionType, typename
MatType, typename
GradType> class Function : public AddEvaluateWithGradient
FunctionType... // many other mixin classes omitted
Figure 13: Snippet of the definition of the
Function class. The full implementation contains numerous other mix-in classeslike
AddEvaluateWithGradient .ethod detection;
HasEvaluateGradient will evaluate to true if FunctionType has
Evaluate() and
Gradient() ; if onlyone (or neither) is available, the value is false . HasEvaluateWithGradient will evaluate to true if FunctionType has anoverload of
EvaluateWithGradient() , and false otherwise.Using these two boolean template variables, we can then use template specialization to control the behavior of
AddEvaluateWithGradient as a function of what is provided by
FunctionType . Specifically, we make two specializations:one for when
EvaluateWithGradient() exists, and one for when both
Evaluate() and
Gradient() exist but
EvaluateWithGradient() does not.The first specialization, shown in Figure 15, is for when
HasEvaluateWithGradient is true —meaningthat FunctionType already has
EvaluateWithGradient() . In this situation, the implementation of template < typename FunctionType, typename
MatType, typename
GradType,// Check if FunctionType has at least one non-const Evaluate() or// Gradient(). bool
HasEvaluateGradient = ...// Check if FunctionType has an EvaluateWithGradient() method already. bool
HasEvaluateWithGradient = ... class AddEvaluateWithGradient { public :// Provide a dummy overload so the name 'EvaluateWithGradient' exists for this// object. typename MatType::elem_type EvaluateWithGradient(traits::UnconstructableType&);};
Figure 14: Definition of the
AddEvaluateWithGradient mix-in class. The first three template parameters are the ‘inputs’,and the last two template parameters are computed quantities used for later template specialization. template < typename FunctionType, typename
MatType, typename
GradType, bool
HasEvaluateGradient> class AddEvaluateWithGradient
MatType::elem_type EvaluateWithGradient( const
MatType& coordinates, GradType& gradient){ return static_cast
Figure 15: Partial template specialization of
AddEvaluateWithGradient for when the given
FunctionType does have an
EvaluateWithGradient() method available already. ddEvaluateWithGradient::EvaluateWithGradient() simply calls out to
FunctionType::EvaluateWithGradient() . However, there is somecomplexity here, as to successfully call
FunctionType::EvaluateWithGradient() , we must first cast the this pointer tohave type
FunctionType —which can only be done by first casting this to the derived class
Function<...> . (This cast isonly safe because we know that we will never create an
AddEvaluateWithGradient outside of the context of the
Function<> class.)Next, we specialize for the case where
HasEvaluateGradient is true (i.e., FunctionType has both
Evaluate() and
Gradient() methods), and
HasEvaluateWithGradient is false (i.e., there is no EvaluateWithGradient() provided by
FunctionType ). In this situation, we intend to synthesize an implementation for
EvaluateWithGradient() by using both ofthe provided
Evaluate() and
Gradient() methods sequentially. Figure 16 shows this specialization.The same complexity with casting is necessary in this specialization too. The last specialization is the general case:where there is neither
Evaluate() and
Gradient() nor
EvaluateWithGradient() provided by
FunctionType . That caseis covered by the initial shown implementation of
AddEvaluateWithGradient : EvaluateWithGradient() is provided withan argument of type traits::UnconstructableType —which is just a class with a non-public constructor, that cannot becreated; and thus, this function cannot be called. It is necessary to have this unusable version of
EvaluateWithGradient() ,though, because the design pattern requires that
AddEvaluateWithGradient always provides a method with the name
EvaluateWithGradient() .This does, however, mean that users can get long and confusing error messages if the optimizer attempts to instantiatethe overload of
EvaluateWithGradient() with traits::UnconstructableType . But, remember that this situation canonly happen when using a differentiable optimizer if the user provided a
FunctionType that (a) does not have both
Evaluate() or Gradient() , or (b) does not have
EvaluateWithGradient() . This is something that we already havecode to detect—that is the code that computes the values of the boolean template parameters
HasEvaluateGradient and
HasEvaluateWithGradient . Therefore, we simply use a static_assert() to provide a clear and understandable errormessage at compile time when neither of those values are true . We encapsulate this in a convenient function that can beadded at the beginning of the
Optimize() method:
CheckFunctionTypeAPI
This technique is adapted to each objective function type that ensmallen supports, in order to provide straightforwarderror messages of the form shown in Figure 17. This static_assert() failure output appears typically at the end of the erroroutput, which is far preferable to the output without
CheckFunctionTypeAPI . Example output produced by clang++ is shownin Figure 18.Note that the error message above can be highly misleading: it’s not actually a requirement that
EvaluateWithGradient() be supplied by the objective function class! That can be automatically inferred, just like discussed above, but only if template < typename FunctionType, typename
MatType, typename
GradType> class AddEvaluateWithGradient
MatType::elem_type EvaluateWithGradient( const
MatType& coordinates,GradType& gradient){ const typename
MatType::elem_type objective = static_cast
Figure 16: Partial template specialization of
AddEvaluateWithGradient for when
EvaluateWithGradient() is notavailable, but
Evaluate() and
Gradient() are. ..include/ensmallen_bits/function/static_checks.hpp:292:3: error: static_assertfailed due to requirement'CheckEvaluateWithGradient
Figure 17: Example error output produced by clang++ when the given objective function is missing methods and static_assert() s are used. Compare with Figure 18, which is far less clear. ...include/ensmallen_bits/lbfgs/lbfgs_impl.hpp:376:30: error: no matching memberfunction for call to 'EvaluateWithGradient'ElemType functionValue = f.EvaluateWithGradient(iterate, gradient);~~^~~~~~~~~~~~~~~~~~~~include/ensmallen_bits/lbfgs/lbfgs.hpp:97:12: note: in instantiation of functiontemplate specialization 'ens::L_BFGS::Optimize
Figure 18: Example error output produced by clang++ when the given objective function is missing methods, and static_assert() s are not used. Compare with Figure 17, which is much clearer.
Evaluate() and
Gradient() are both available. In the case of the example code that generated the errors above,
Gradient() was not implemented—thus, the given error message is actually somewhat inaccurate and confusing!The last piece of the puzzle is making sure that when
Function<...>::EvaluateWithGradient() is called, that thisalways calls
AddEvaluateWithGradient::EvaluateWithGradient() . This can be done by a simple using declaration inthe body of the
Function class: using
AddEvaluateWithGradient
This is all that is needed to infer and synthesize an
EvaluateWithGradient() method when the user provided an objectivefunction that only has
Evaluate() and
Gradient() . There is one detail we have omitted, though—the code that we haveshown here handles non- const and non- static versions of methods provided by the user. Separate auxiliary structures like
AddEvaluateWithGradientConst and
AddEvaluateWithGradientStatic are used for function inference in those cases; thegeneral design is identical.All of these pieces put together result in a clean interface that inference of methods that users did not provide in theirobjective function. Further, this all happens at compile time and thus there is no runtime penalty. Auxiliary structures like
AddEvaluateWithGradient should be optimized out by the compiler.All of the code involved with automatic generation of missing methods can be found in the ensmallen source code in theirectory include/ensmallen_bits/function/ . Appendix B: Callbacks – Details
In Section 3.2, we introduced ensmallen ’s support for callbacks. This support is implemented via templates—like muchof the rest of ensmallen . This design strategy was chosen in order to minimize runtime overhead caused by callbacks, andproduce machine code roughly equivalent to the machine code that would have been produced if the callback had been directlyintegrated into the optimizer’s code.In this section, we detail the implementation of these callbacks and show how to implement optimizers that are capable ofhandling these callbacks.An optimizer that supports callbacks should call out to the static methods in the
Callback class inside of its
Optimize() method, as below: double functionValue = function.Evaluate(coordinates); bool terminate = Callback::Evaluate(* this , function, coordinates, functionValue, callbacks...);
In the above snippet, function represents the function being optimized, *this is the optimizer itself, coordinates represents the current coordinates of the optimization, and callbacks is a template vararg pack containing all of the givencallbacks. Note that this means callbacks can be empty.The
Callback class contains a top-level method to use for each of the callback types supported by ensmallen , as listed inTable 2. There is a
Callback::Gradient() method,
Callback::EvaluateConstraint() , Callback::BeginEpoch() , andso forth.The
Callback::Evaluate() method calls the
Evaluate() method of every given callback that has an
Evaluate() method implemented. This means that there is some amount of difficulty we have to handle: not every callback in callbacks... will have an
Evaluate() method available. Thus, we can use the same techniques as in Section 3.1 todetect, via SFINAE, whether an
Evaluate() method exists for a given callback, and then perform the correct action based onthe result.The definition of
Callback::Evaluate() is given in Figure 19. Each callback function takes several template parameters,including
OptimizerType (the type of optimizer that is being used),
FunctionType (the type of function being optimized),
MatType (the matrix type used to store the coordinates), and
CallbackTypes (the set of types of callbacks). Of these,
CallbackTypes is the most important. Since it is a template vararg, the type
CallbackTypes is actually a pack thatcorresponds to every type of every callback that must be called.The implementation of the function unpacks the callbacks , calling each callback’s
Evaluate() method (if it exists) insequence, and terminating early if any of these callbacks returns true . Each callback’s
Evaluate() method is called throughthe helper function
Callback::EvaluateFunction() , which uses SFINAE traits to control behavior depending on whetherthe given callback has an
Evaluate() method or not. Figure B shows the two overloads of
Callback::EvaluateFunction() . template < typename OptimizerType, typename
FunctionType, typename
MatType, typename ... CallbackTypes> static bool
Callback::Evaluate(OptimizerType& optimizer,FunctionType& function, const
MatType& coordinates, const double objective,CallbackTypes&... callbacks){ // This will return immediately once a callback returns true. bool result = false;( void ) std::initializer_list< bool >{ result =result || Callback::EvaluateFunction(callbacks, optimizer, function,coordinates, objective)... }; return result;}
Figure 19: Implementation of
Callback::Evaluate() , showing part of the process of calling each callback given in thetemplate vararg pack callbacks . emplate < typename CallbackType, typename
OptimizerType, typename
FunctionType, typename
MatType> static typename std::enable_if
MatType& coordinates, const double objective){ return ( const_cast
OptimizerType, typename
FunctionType, typename
MatType> static typename std::enable_if::value, bool >::typeEvaluateFunction(CallbackType& /* callback */,OptimizerType& /* optimizer */,FunctionType& /* function */, const
MatType& /* coordinates */, const double /* objective */){ return false; }
Figure 20: Implementation of
Callback::EvaluateFunction() . Two overloads are specified: the first is used when
CallbackType has an
Evaluate() method, and the second is used when
CallbackType does not have an
Evaluate() method.SFINAE is used, via callbacks::traits::HasEvaluateSignature , to determine which overload to use.Similar to Section 3.1, a traits class is used to determine which of the two overloads of
EvaluateFunction() to call. Thetraits class callbacks::traits::HasEvaluateSignature<...>::value evaluates to true only when the inner callbackand arguments can form a valid
Evaluate() signature. Thus, the correct overload is called when
EvaluateFunction() iscalled from
Callback::Evaluate() .Each of the other callbacks in the
Callback class is implemented in virtually identical form, although with differentarguments depending on what the callback is. For further details on the callback infrastructure, examples are provided in thecode repository in the directory include/ensmallen_bits/callbacks/ . Appendix C: Static Assertions to Check Matrix Traits
In Section 3.3, we showed that ensmallen uses templates to allow arbitrary types to be used for the matrix type or gradient type.However, care must be taken there to avoid a drawback with C++ template metaprogramming, which is the issue of compilererrors. In Figure 7, the compiler may fail to substitute a given
MatType into the code for
GradientDescent::Optimize() .A typical reason might be that a required method of
MatType or FunctionType is not available. For example, the given
MatType does not have an operator() method. Ordinarily, current C++ compilers typically emit a long list of error messages,with unnecessarily detailed and distracting information. The internal framework in ensmallen avoids this issue via the use ofC++ compile-time static assertions ( static_assert() ), resulting in considerably clearer error messages.The framework provides a set of utility methods for checking traits of
MatType and
GradType (for differentiable objectivefunctions) inside of an optimizer’s
Optimize() method:•
RequireFloatingPointType
MatType is either float or double . This is generally needed by optimizers that use Armadillo functionality which ends up calling LAPACKfunctions [7].• RequireSameInternalTypes
Mat1Type and
Mat2Type use the same typeto hold elements. This is useful for differentiable optimizers, where the user is allowed to specify different objectivematrix types (
MatType ) and gradient matrix types (
GradType ). This function allows us to require that the same type(e.g., int or float or double ) is used for both MatType and
GradType . *** Require that the internal element type of the matrix type and gradient type* are the same. A static_assert() will fail if not.*/ template < typename MatType, typename
GradType> void
RequireSameInternalTypes(){ static_assert (std::is_same< typename
MatType::elem_type, typename
GradType::elem_type>::value,"The internal element types of the given MatType and GradType must be ""identical, or it is not known to work! If you would like to try ""anyway, set the preprocessor macro ENS_DISABLE_TYPE_CHECKS before ""including ensmallen.hpp. However, you get to pick up all the pieces if ""there is a failure!");
Figure 21: Implementation of
RequireSameInternalTypes() from internal ensmallen code.•
RequireDenseFloatingPointType
MatType is either float or double , and that the representation used for storage by MatType is dense.These methods would typically be called at the top of the implementation of an optimizer’s
Optimize() method.At the time of writing, these are the only three checks that have been needed, but it is easy to add more. The implementationof these functions is just a static_assert() that uses some underlying traits of the given template parameters. For example,Figure 21 is the implementation of
RequireSameInternalTypes<...>() .Note that users who would like to continue despite these static_assert() checks failing may simply define the macro
ENS_DISABLE_TYPE_CHECKS in their code before the line . eferences [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al.TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.[2] D. Abrahams and A. Gurtovoy. C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost andBeyond (C++ in Depth Series) . Addison-Wesley Professional, 2004.[3] S. Agarwal, K. Mierle, et al. Ceres solver. http://ceres-solver.org .[4] R. Al-Rfou, G. Alain, A. Almahairi, C. Angermueller, D. Bahdanau, N. Ballas, F. Bastien, J. Bayer, A. Belikov, et al.Theano: A Python framework for fast computation of mathematical expressions. arXiv pre-print , 1605.02688, May2016.[5] A. Alexandrescu.
Modern C++ design: generic programming and design patterns applied . Addison-Wesley, 2001.[6] Z. Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. In
Proceedings of the 49th AnnualACM SIGACT Symposium on Theory of Computing , pages 1200–1205. ACM, 2017.[7] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling,A. McKenney, and D. Sorensen.
LAPACK Users’ Guide . SIAM, 1999.[8] A. Athanasopoulos, A. Dimou, V. Mezaris, and I. Kompatsiaris. GPU acceleration for support vector machines. In
Proceedings of 12th International Workshop on Image Analysis for Multimedia Interactive Services, Delft, Netherlands ,pages 17–55, 2011.[9] E. Bacry, M. Bompaire, P. Deegan, S. Gaïffas, and S. V. Poulsen. tick: a Python library for statistical learning, with anemphasis on hawkes processes and time-dependent models.
Journal of Machine Learning Research , 18(214):1–5, 2018.[10] S. Bhardwaj, R. R. Curtin, M. Edel, Y. Mentekidis, and C. Sanderson. ensmallen: a flexible C++ library for efficientfunction optimization. In
Workshop on Systems and ML and Open Source Software at NeurIPS , 2018.[11] S. Burer and R. D. Monteiro. A nonlinear programming algorithm for solving semidefinite programs via low-rankfactorization.
Mathematical Programming , 95(2):329–357, 2003.[12] F. Chollet et al. Keras. https://github.com/keras-team/keras , 2015.[13] A. Costa and G. Nannicini. RBFOpt: an open-source library for black-box optimization with costly function evaluations.
Mathematical Programming Computation , 10(4):597–629, 2018.[14] R. R. Curtin, M. Edel, M. Lozhnikov, Y. Mentekidis, S. Ghaisas, and S. Zhang. mlpack 3: a fast, flexible machinelearning library.
Journal of Open Source Software , 3:726, 2018.[15] C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Training GANs with optimism. arXiv pre-print , 1711.00141, 2017.[16] S. De, A. K. Yadav, D. W. Jacobs, and T. Goldstein. Big batch SGD: automated inference using adaptive batch sizes. arXiv pre-print , 1610.05792, 2017.[17] T. Dozat. Incorporating Nesterov momentum into Adam. Technical report, Stanford University, 2015.[18] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization.
Journalof Machine Learning Research , 12(Jul):2121–2159, 2011.[19] D. Dueri, B. Açıkme¸se, D. P. Scharf, and M. W. Harris. Customized real-time interior-point methods for onboardpowered-descent guidance.
Journal of Guidance, Control, and Dynamics , 40(2):197–212, 2016.[20] J. W. Eaton, D. Bateman, S. Hauberg, and R. Wehbring.
GNU Octave version 4.4.0 manual: a high-level interactivelanguage for numerical computations , 2018.[21] S. Funk. RMSprop loses to SMORMS3 - Beware the Epsilon!, 2015. http://sifter.org/ simon/journal/20150420.html.[22] D. Goldberg. What every computer scientist should know about floating-point arithmetic.
ACM Computing Surveys(CSUR) , 23(1), 1991.[23] N. Hansen and A. Ostermeier. Completely derandomized self-adaptation in evolution strategies.
Evol. Comput. ,9(2):159–195, 2001.[24] G. Huang, Y. Li, G. Pleiss, Z. Liu, J. E. Hopcroft, and K. Q. Weinberger. Snapshot ensembles: Train 1, get m for free.In
Proceedings of the International Conference on Learning Representations (ICLR) , 2017.[25] M. Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In
International Conference on MachineLearning , pages 427–435, 2013.[26] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutionalarchitecture for fast feature embedding. arXiv pre-print , 1408.5093, 2014.[27] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In
Proceedingsof the 26th International Conference on Neural Information Processing Systems - Volume 1 , NIPS’13, pages 315–323.Curran Associates Inc., 2013.[28] S. G. Johnson. The NLopt nonlinear-optimization package. https://github.com/stevengj/nlopt.[29] E. Jones, T. Oliphant, and P. Peterson. SciPy: open source scientific tools for Python, 2014. .30] J. Jones. Abstract syntax tree implementation idioms. In
Proceedings of the 10th Conference on Pattern Languages ofPrograms , 2003.[31] J. Kennedy and R. C. Eberhart. Particle swarm optimization. In
Proceedings of the IEEE International Conference onNeural Networks , pages 1942–1948, 1995.[32] N. S. Keskar and R. Socher. Improving generalization performance by switching from Adam to SGD. arXiv pre-print ,1712.07628, 2017.[33] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv pre-print , 1412.6980, 2014.[34] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing.
Science , 220(4598):671–680, 1983.[35] J. Koushik and H. Hayashi. Improving stochastic gradient descent with feedback. arXiv pre-print , 1611.01505, 2016.[36] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In
Advances in Neural Information Processing Systems , pages 1097–1105, 2012.[37] S. K. Lam, A. Pitrou, and S. Seibert. Numba: A LLVM-based Python JIT compiler. In
Second Workshop on the LLVMCompiler Infrastructure in HPC , page 7, 2015.[38] J. Langford, L. Li, and A. Strehl. Vowpal wabbit open source project. Technical report, Yahoo!, 2007.[39] F. Q. Lauzon. An introduction to deep learning. In
International Conference on Information Science, Signal Processingand their Applications , pages 1438–1439, 2012.[40] M. Lichman. UCI Machine Learning Repository, 2013. http://archive.ics.uci.edu/ml .[41] D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization.
MathematicalProgramming , 45(1-3):503–528, 1989.[42] I. Loshchilov and F. Hutter. SGDR: stochastic gradient descent with restarts. arXiv pre-print , 1608.03983, 2016.[43] L. Luo, Y. Xiong, Y. Liu, and X. Sun. Adaptive gradient methods with dynamic bound of learning rate. In
Proceedingsof the 7th International Conference on Learning Representations , New Orleans, Louisiana, May 2019.[44] J. Ma and D. Yarats. Quasi-hyperbolic momentum and Adam for deep learning. In
International Conference on LearningRepresentations , 2019.[45] D. Maclaurin, D. Duvenaud, and R. P. Adams. Autograd: Effortless gradients in NumPy. In
AutoML Workshop at ICML ,2015.[46] G. Manogaran and D. Lopez. Health data analytics using scalable logistic regression with stochastic gradient descent.
International Journal of Advanced Intelligence Paradigms , 10(1-2):118–132, 2018.[47] MathWorks. fminsearch
Journal of Open SourceSoftware , 3(24):615, 2018.[50] A. Mokhtari, M. Eisen, and A. Ribeiro. IQN: An incremental quasi-Newton method with local superlinear convergencerate. arXiv pre-print , 1702.00709, 2017.[51] D. J. Montana and L. Davis. Training feedforward neural networks using genetic algorithms. In
International JointConference on Artificial Intelligence , volume 1, pages 762–767, 1989.[52] Y. Nesterov. A method of solving a convex programming problem with convergence rate o (1 /k ) . Technical report,Soviet Math. Dokl., 1983.[53] L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takáˇc. SARAH: A novel method for machine learning problems usingstochastic recursive gradient. arXiv pre-print , 1703.00102, 2017.[54] J. Nocedal and S. Wright. Numerical Optimization . Springer, 2nd edition, 2006.[55] K.-S. Oh and K. Jung. Gpu implementation of neural networks.
Pattern Recognition , 37(6):1311–1314, 2004.[56] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al.PyTorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer,F. d'Alché-Buc, E. Fox, and R. Garnett, editors,
Advances in Neural Information Processing Systems 32 , pages8024–8035. Curran Associates, Inc., 2019.[57] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, et al.Scikit-learn: Machine learning in Python.
Journal of Machine Learning Research , 12(Oct):2825–2830, 2011.[58] L. Perron and V. Furnon. OR-tools, 2019-7-19. https://developers.google.com/optimization/.[59] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In
Advances in Neural Information Processing Systems , pages 693–701, 2011.[60] S. J. Reddi, S. Kale, and S. Kumar. On the convergence of Adam and beyond. arXiv pre-print , 1904.09237, 2019.[61] J. Revels, M. Lubin, and T. Papamarkou. Forward-mode automatic differentiation in Julia. arXiv:1607.07892 , 2016.[62] H. H. Rosenbrock. An automatic method for finding the greatest or least value of a function.
The Computer Journal ,:175–184, 1960.[63] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors.
Nature ,323:533–536, 1986.[64] C. Sanderson and R. Curtin. A user-friendly hybrid sparse matrix class in C++. In
Lecture Notes in Computer Science(LNCS), Vol. 10931 , pages 422–430, 2018.[65] C. Sanderson and R. R. Curtin. Armadillo: a template-based C++ library for linear algebra.
Journal of Open SourceSoftware , 1(2):26, 2016.[66] C. Sanderson and R. R. Curtin. Practical sparse matrices in C++ with hybrid storage and template-based expressionoptimisation.
Mathematical and Computational Applications , 24(3), 2019.[67] A. Schoenauer-Sebag, M. Schoenauer, and M. Sebag. Stochastic gradient descent: Going as fast as possible but notfaster. arXiv pre-print , 1709.01427, 2017.[68] S. Shalev-Shwartz and A. Tewari. Stochastic methods for l1 regularized loss minimization. In
Proceedings of the 26thAnnual International Conference on Machine Learning , 2009.[69] Y. Smaragdakis and D. Batory. Mixin-based programming in C++. In
International Symposium on Generative andComponent-Based Software Engineering, Lecture Notes in Computer Science , volume 2177, pages 164–178. Springer,2000.[70] S. Sonnenburg, G. Rätsch, S. Henschel, C. Widmer, J. Behr, A. Zien, F. de Bona, A. Binder, C. Gehl, and V. Franc. TheSHOGUN machine learning toolbox.
Journal of Machine Learning Research , 11:1799–1802, 2010.[71] J. Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation.
IEEETransactions on Automatic Control , 37(3):332–341, 1992.[72] A. St. Laurent.
Understanding Open Source and Free Software Licensing . O’Reilly Media, 2008.[73] R. Storn and K. Price. Differential evolution–a simple and efficient heuristic for global optimization over continuousspaces.
Journal of Global Optimization , 11(4):341–359, 1997.[74] T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.
COURSERA: Neural networks for machine learning , 4(2):26–31, 2012.[75] E. Van den Berg and M. P. Friedlander. Sparse optimization with least-squares constraints.
SIAM Journal onOptimization , 21(4):1201–1229, 2011.[76] L. Vandenberghe. The CVXOPT linear and quadratic cone program solvers, 2010. http://cvxopt.org.[77] L. Vandenberghe and S. Boyd. Semidefinite programming.
SIAM Review , 38(1):49–95, 1996.[78] D. Vandevoorde and N. M. Josuttis.
C++ Templates: The Complete Guide . Addison-Wesley, 2nd edition, 2017.[79] V. Vanhoucke, A. Senior, and M. Z. Mao. Improving the speed of neural networks on CPUs. In
Deep Learning andUnsupervised Feature Learning Workshop at NIPS , 2011.[80] T. L. Veldhuizen. C++ templates as partial evaluation. arXiv preprint cs/9810010 , 1998.[81] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, et al. SciPy1.0–Fundamental Algorithms for Scientific Computing in Python. arXiv pre-print , 1907.10121, 2019.[82] X. Wu, R. Ward, and L. Bottou. WNGrad: Learn the learning rate in gradient descent. arXiv pre-print , 1803.02865,2018.[83] M. D. Zeiler. ADADELTA: an adaptive learning rate method. arXiv pre-print , 1212.5701, 2012.[84] M. R. Zhang, J. Lucas, G. E. Hinton, and J. Ba. Lookahead optimizer: k steps forward, 1 step back. arXiv pre-print ,1907.08610, 2019.[85] T. Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In
Proceedingsof the Twenty-First International Conference on Machine Learning , page 116, 2004.[86] S. Zheng and J. T. Kwok. Follow the moving leader in deep learning. In