Robust and Efficient Optimization Using a Marquardt-Levenberg Algorithm with R Package marqLevAlg
Viviane Philipps, Boris P Hejblum, Mélanie Prague, Daniel Commenges, Cécile Proust-Lima
RRobust and Efficient Optimization Using aMarquardt-Levenberg Algorithm with R Package marqLevAlg
Viviane Philipps
Inserm, Bordeaux Population Health Research Center, UMR 1219,Univ. Bordeaux, F-33000 Bordeaux, France
Boris P. Hejblum M´elanie Prague
Univ. Bordeaux, Inserm, Bordeaux Population Health Research Center,UMR 1219, Inria BSO SISTM, F-33000 Bordeaux, France
Daniel CommengesC´ecile Proust-Lima
Inserm, Bordeaux Population Health Research Center,UMR 1219, Univ. Bordeaux, F-33000 Bordeaux, France
Abstract
Optimization is an essential task in many computational problems. In statisticalmodelling for instance, in the absence of analytical solution, maximum likelihood esti-mators are often retrieved using iterative optimization algorithms. R software alreadyincludes a variety of optimizers from general-purpose optimization algorithms to morespecific ones. Among Newton-like methods which have good convergence properties, theMarquardt-Levenberg algorithm (MLA) provides a particularly robust algorithm for solv-ing optimization problems. Newton-like methods generally have two major limitations:(i) convergence criteria that are a little too loose, and do not ensure convergence towardsa maximum, (ii) a calculation time that is often too long, which makes them unusablein complex problems. We propose in the marqLevAlg package an efficient and generalimplementation of a modified MLA combined with strict convergence criteria and parallelcomputations. Convergence to saddle points is avoided by using the relative distance tominimum/maximum criterion (RDM) in addition to the stability of the parameters andof the objective function. RDM exploits the first and second derivatives to compute thedistance to a true local maximum. The independent multiple evaluations of the objectivefunction at each iteration used for computing either first or second derivatives are calledin parallel to allow a theoretical speed up to the square of the number of parameters. Weshow through the estimation of 7 relatively complex statistical models how parallel imple-mentation can largely reduce computational time. We also show through the estimationof the same model using 3 different algorithms (BFGS of optim routine, an E-M, andMLA) the superior efficiency of MLA to correctly and consistently reach the maximum.
Keywords : convergence criteria, Marquardt-Levenberg, Newton-Raphson, optimization, par-allel computing, R. a r X i v : . [ s t a t . C O ] S e p The R package marqLevAlg
1. Introduction
Optimization is an essential task in many computational problems. In statistical modellingfor instance, in the absence of analytical solution, maximum likelihood estimators are oftenretrieved using iterative optimization algorithms.Steepest descent algorithms are among the most famous general optimization algorithms.They generally consist in updating parameters according to the steepest gradient (gradientdescent) possibly scaled by the Hessian in the Newton (Newton-Raphson) algorithm or anapproximation of the Hessian based on the gradients in the quasi-Newton algorithms (e.g.,Broyden-Fletcher-Goldfarb-Shanno — BFGS). Newton-like algorithms have been shown toprovide good convergence properties (Joe and Nash 2003) and were demonstrated in partic-ular to behave better than Expectation-Maximization (EM) algorithms in several contextsof Maximum Likelihood Estimation, such as the random-effect models (Lindstrom and Bates1988) or the latent class models (Proust and Jacqmin-Gadda 2005). Among Newton methods,the Marquardt-Levenberg algorithm, initially proposed by Levenberg (Levenberg 1944) thenMarquardt (Marquardt 1963), combines BFGS and gradient descent methods to provide amore robust optimization algorithm.The R software includes multiple solutions for optimization tasks (see CRAN task Viewon “Optimization and Mathematical Programming” (Theussl, Schwendinger, and Borchers2014)). In particular the optim function in base R offers different algorithms for generalpurpose optimization, and so does optimx — a more recent package extending optim (Nashand Varadhan 2011). Numerous additional packages are available for different contexts, fromnonlinear least square problems (including some exploiting Marquardt-Levenberg idea — min-pack.lm (Elzhov, Mullen, Spiess, and Bolker 2016)) to stochastic optimization and algorithmsbased on the simplex approach. However, R software could benefit from a general-purpose R implementation of Marquardt-Levenberg algorithm.Moreover, while optimization can be easily achieved in small dimension, the increasing com-plexity of statistical models leads to critical issues. First, the large dimension of the objectivefunction can induce excessively long computation times. Second, with complex objectivefunctions, it is more likely to encounter flat regions, so that convergence cannot be assessedaccording to objective function stability anymore.To address these two issues, we propose a R implementation of the Levenberg-Marquardtalgorithm in the package marqLevAlg which relies on a stringent convergence criterion basedon the first and second derivatives to avoid loosely convergence (Prague, Diakite, and Com-menges 2012) and includes (from version 2.0.1) parallel computations within each iterationto speed up convergence in complex settings.Section 2 and 3 describe the algorithm and the implementation, respectively. Then Section4 provides an example of call with the estimation of a linear mixed model. A benchmarkof the package is reported in Section 5 with the performances of parallel implementation.Marquardt-Levenberg algorithm implementation is also compared with other algorithms on iviane Philipps, Boris P. Hejblum, M´elanie Prague, Daniel Commenges, C´ecile Proust-Lima
3a case example in Section 6. Finally Section 7 concludes.
2. Methodology
The Marquardt-Levenberg algorithm (MLA) can be used for any problem where a function Q ( θ ) has to be minimized (or equivalently, function L ( θ )= - Q ( θ ) has to be maximized)according to a set of m unconstrained parameters θ , as long as the second derivatives of Q ( θ )exist. In statistical applications for instance, the objective function is the deviance to beminimized or the log-likelihood to be maximized.Our improved MLA iteratively updates the vector θ ( k ) from a starting point θ (0) until con-vergence using the following formula at iteration k + 1: θ ( k +1) = θ ( k ) − δ k ( ˜ H ( k ) ) − ∇ ( Q ( θ ( k ) ))where θ ( k ) is the set of parameters at iteration k , ∇ ( Q ( θ ( k ) )) is the gradient of the objectivefunction at iteration k , and ˜ H ( k ) is the Hessian matrix H ( k ) where the diagonal terms arereplaced by ˜ H ( k ) ii = H ( k ) ii + λ k [(1 − η k ) | H ( k ) ii | + η k tr( H ( k ) )]. In the original MLA the Hessianmatrix is inflated by a scaled identity matrix. Following Fletcher (1971) we consider a refinedinflation based on the curvature. The diagonal inflation of our improved MLA makes it anintermediate between the steepest descent method and the Newton method. The parameters δ k , λ k and η k are scalars specifically determined at each iteration k . Parameter δ k is fixedto 1 unless the objective function is not reduced, in which case a line search determines thelocally optimal step length. Parameters λ k and η k are internally modified in order to ensurethat (i) ˜ H ( k ) be definite-positive at each iteration k , and (ii) ˜ H ( k ) approaches H ( k ) when θ k approaches ˆ θ .When the problem encounters a unique solution, the minimum is reached whatever the choseninitial values. As in any iterative algorithm, convergence of MLA is achieved when convergence criteria arefullfilled. In marqLevAlg package, convergence is defined according to three criteria: • parameters stability: (cid:80) mj =1 ( θ ( k +1) j − θ ( k ) j ) < (cid:15) a • objective function stability: |Q ( k +1) − Q ( k ) | < (cid:15) b • relative distance to minimum/maximum (RDM): ∇ ( Q ( θ ( k ) ))( H ( k ) ) − ∇ ( Q ( θ ( k ) )) m < (cid:15) d The last criterion is essential to ensure that an optimum is truly reached. Indeed, the twofirst criteria used in other iterative algorithms only ensure that the algorithm reached asaddle point. As the last criterion based on the derivatives requires the Hessian matrix tobe invertible, it prevents from such convergences to a saddle point. When the Hessian is notinvertible, RDM is set to 1+ (cid:15) d and convergence criteria cannot be fullfilled. The R package marqLevAlg Although it constitutes a relevant convergence criterion in any optimization context, RDMwas initially designed for log-likelihood maximization problems, that is cases where Q ( θ )= - L ( θ ) with L the log-likelihood. In that context, RDM can be interpreted as the ratio betweenthe numerical error and the statistical error (Commenges, Jacqmin-Gadda, Proust, and Guedj2006, Prague, Commenges, Guedj, Drylewicz, and Thi´ebaut (2013)).The three thresholds (cid:15) a , (cid:15) b and (cid:15) d can be adjusted, but values around 0 .
001 are usuallysufficient to guarantee a correct convergence. In some complex loglikelihood maximisationproblems for instance, Prague et al. (2013) showed that the RDM convergence propertiesremain acceptable providing (cid:15) d is below 0.1 (although the lower the better). MLA update relies on first ( ∇ ( Q ( θ ( k ) ))) and second ( H ( k ) ) derivatives of the objective function Q ( θ ( k ) ) at each iteration k. The gradient and the Hessian may sometimes be calculatedanalytically but in a general framework, numerical approximation can become necessary. In marqLevAlg package, in the absence of analytical gradient computation, the first derivativesare computed by central finite differences. In the absence of analytical Hessian, the secondderivatives are computed using forward finite differences. The step of finite difference for eachderivative depends on the value of the involved parameter. It is set to max(10 − , − | θ j | )for parameter j .When both the gradient and the Hessian are to be numerically computed, numerous evalua-tions of Q are required at each iteration: • × m evaluations of Q for the numerical approximation of the gradient function; • m × ( m + 1)2 evaluations of Q for the numerical approximation of the Hessian matrix.The number of derivatives thus grows quadratically with the number m of parameters andcalculations are per se independent as done for different vectors of parameters θ .When the gradient is analytically calculated, only the second derivatives have to be approxi-mated, requiring 2 × m independent calls to the gradient function. In that case, the complexitythus linearly increases with m .In both cases, and especially when each calculation of derivative is long and/or m is large,parallel computations of independent Q evaluations becomes particularly relevant to speedup the estimation process. When the optimization problem is the maximization of the log-likelihood L ( θ ) of a statisticalmodel according to parameters θ , the Hessian matrix of the Q ( θ ) = −L ( θ ) calculated at theoptimum ˆ θ , H ˆ θ = − ∂ L ( θ ) ∂θ | θ =ˆ θ , provides an estimator of the Fisher Information matrix. Theinverse of H ˆ θ computed in the package thus provides an estimator of the variance-covariance iviane Philipps, Boris P. Hejblum, M´elanie Prague, Daniel Commenges, C´ecile Proust-Lima θ .
3. Implementation
The call of the marqLevAlg function, or its shorcut mla , is the following : marqLevAlg(b, m = FALSE, fn, gr = NULL, hess = NULL, maxiter = 500,epsa = 0.001, epsb = 0.001, epsd = 0.01, digits = 8,print.info = FALSE, blinding = TRUE, multipleTry = 25, nproc = 1,clustertype = NULL, file = "", .packages = NULL, minimize = TRUE, ...)
Argument b is the set of initial parameters; alternatively its length m can be entered. fn is thefunction to optimize; it should take the parameter vector as first argument, and additionalarguments are passed in . . . . Optional gr and hess refer to the functions implementingthe analytical calculations of the gradient and the Hessian matrix, respectively. maxiter is the maximum number of iterations. Arguments epsa , epsb and epsd are the thresholdsfor the three convergence criteria defined in Section 2.2. print.info specifies if details oneach iteration should be printed; such information can be reported in a file if argument file is specified, and digits indicates the number of decimals in the eventually reportedinformation during optimization. blinding is an option allowing the algorithm to go on evenwhen the fn function returns NA, which is then replaced by the arbitrary value of 500 ,
000 (forminimization) and -500 ,
000 (for maximization). Similarly, if an infinite value is found for thechosen initial values, the multipleTry option will internally reshape b (up to multipleTry times) until a finite value is get, and the algorithm can be correctly initialized. The parallelframework is first stated by the nproc argument which gives the number of cores and by the clustertype argument (see the next section). In the case where the fn function depends on R packages, these should be given as a character vector in the .packages argument. Finally,the minimize argument offers the possibility to minimize or maximize the objective function fn ; a maximization problem is implemented as the minimization of the opposite function( -fn ). In the absence of analytical gradient calculation, derivatives are computed in deriva sub-function with two loops, one for the first derivatives and one for the second derivatives. Bothloops are parallelized. The parallelized loops are at most over m ∗ ( m + 1) / m parameters to estimate which suggests that the performance could theoretically be improvedwith up to m ∗ ( m + 1) / deriva subfunction is replaced by deriva_grad subfunction. It is parallelized in the same way but the parallelization being executed over m elements, the performance should be bounded at m cores.In all cases, the parallelization is achieved using the doParallel and foreach packages. Thesnow and multicore options of the doParallel backend are kept, making the parallel option The R package marqLevAlg of marqLevAlg package available on all systems. The user specifies the type of parallelenvironment among FORK, SOCK or MPI in argument clustertype and the number of coresin nproc . For instance, clustertype = "FORK", nproc = 6 will use FORK technology and6 cores.
4. Example
We illustrate how to use marqLevAlg function with the maximum likelihood estimation ina linear mixed model (Laird and Ware 1982). Function loglikLMM available in the packageimplements the log-likelihood of a linear mixed model for a dependent outcome vector orderedby subject (argument Y ) explained according to a matrix of covariates (argument X ) enteredin the same order as Y with a Gaussian individual-specific random intercept and Gaussianindependent errors: loglikLMM(b, Y, X, ni) Argument b specifies the vector of parameters with first the regression parameters (lengthgiven by the number of columns in X ) and then the standard deviations of the randomintercept and of the independent error. Finally argument ni specifies the number of repeatedmeasures for each subject.We consider the dataset dataEx (available in the package) in which variable Y is repeatedlyobserved at time t for 500 subjects along with a binary variable X X
3. For the illustration, we specify a linear trajectory over time adjusted for X X X t . The vector of parameters to estimate corresponds tothe intercept, 4 regression parameters and the 2 standard deviations.We first define the quantities to include as argument in loglikLMM function: R> Y <- dataEx$YR> X <- as.matrix(cbind(1, dataEx[, c("t", "X1", "X3")],+ dataEx$t * dataEx$X1))R> ni <- as.numeric(table(dataEx$i))
The vector of initial parameters to specify in marqLevAlg call is created with the trivial valuesof 0 for the fixed effects and 1 for the variance components.
R> binit <- c(0, 0, 0, 0, 0, 1, 1)
The maximum likelihood estimation of the linear mixed model in sequential mode is then runusing a simple call to marqLevAlg function for a maximization (with argument minimize =FALSE ): R> estim <- marqLevAlg(b = binit, fn = loglikLMM, minimize = FALSE,+ X = X, Y = Y, ni = ni)R> estim iviane Philipps, Boris P. Hejblum, M´elanie Prague, Daniel Commenges, C´ecile Proust-Lima Robust marqLevAlg algorithmmarqLevAlg(b = binit, fn = loglikLMM, minimize = FALSE, X = X,Y = Y, ni = ni)Iteration process:Number of parameters: 7Number of iterations: 18Optimized objective function: -6836.754Convergence criteria satisfiedConvergence criteria: parameters stability= 3.2e-07: objective function stability= 4.35e-06: Matrix inversion for RDM successful: relative distance to maximum(RDM)= 0Final parameter values:50.115 0.106 2.437 2.949 -0.376 -5.618 3.015
The printed output estim shows that the algorithm converged in 18 iterations with conver-gence criteria of 3.2e-07, 4.35e-06 and 0 for parameters stability, objective function stabilityand RDM, respectively. The output also displays the list of coefficient values at the optimum.All this information can also be recovered in the estim object, where item b contains theestimated coefficients.As mentioned in Section 2.4, in log-likelihood maximization problems, the inverse of theHessian given by the program provides an estimate of the variance-covariance matrix of thecoefficients at the optimum. The upper triangular matrix of the inverse Hessian is thussystematically computed in object v . When appropriate, the summary function can outputthis information with option loglik = TRUE . With this option, the summary also includes thesquare root of these variances (i.e., the standards errors), the corresponding Wald statistic,the associated p value and the 95% confidence interval boundaries for each parameter: R> summary(estim, loglik = TRUE)
Robust marqLevAlg algorithmmarqLevAlg(b = binit, fn = loglikLMM, minimize = FALSE, X = X,Y = Y, ni = ni)Iteration process:Number of parameters: 7Number of iterations: 18Optimized objective function: -6836.754
The R package marqLevAlg Convergence criteria satisfiedConvergence criteria: parameters stability= 3.2e-07: objective function stability= 4.35e-06: Matrix inversion for RDM successful: relative distance to maximum(RDM)= 0Final parameter values:coef SE.coef Wald P.value binf bsup50.115 0.426 13839.36027 0e+00 49.280 50.9500.106 0.026 16.02319 6e-05 0.054 0.1572.437 0.550 19.64792 1e-05 1.360 3.5152.949 0.032 8416.33202 0e+00 2.886 3.012-0.376 0.037 104.82702 0e+00 -0.449 -0.304-5.618 0.189 883.19775 0e+00 -5.989 -5.2483.015 0.049 3860.64370 0e+00 2.919 3.110
The exact same model can also be estimated in parallel mode using FORK implementationof parallelism (here with two cores):
R> estim2 <- marqLevAlg(b = binit, fn = loglikLMM, minimize = FALSE,+ nproc = 2, clustertype = "FORK",+ X = X, Y = Y, ni = ni)
It can also be estimated by using analytical gradients (provided in gradient function gradLMM with the same arguments as loglikLMM ): R> estim3 <- marqLevAlg(b = binit, fn = loglikLMM, gr = gradLMM,+ minimize = FALSE, X = X, Y = Y, ni = ni)
In all three situations, the program converges to the same maximum as shown in Table 1 forthe estimation process and in Table 2 for the parameter estimates. The iteration process isidentical when using the either the sequential or the parallel code (number of iterations, finalconvergence criteria, etc). It necessarily differs slightly when using the analytical gradient, asthe computations steps are not identical (e.g., here it converges in 15 iterations rather than18) but all the final results are identical.
5. Benchmark
We aimed at evaluating and comparing the performances of the parallelization in some timeconsuming examples. We focused on three examples of sophisticated models from the mixedmodels area estimated by maximum likelihood. These examples rely on packages using threedifferent languages, thus illustrating the behavior of marqLevAlg package with a programexclusively written in R ( JM , Rizopoulos (2010)), and programs including Rcpp ( CInLPN ,Tadd´e, Jacqmin-Gadda, Dartigues, Commenges, and Proust-Lima (2019)) and
Fortran90 iviane Philipps, Boris P. Hejblum, M´elanie Prague, Daniel Commenges, C´ecile Proust-Lima lcmm , Proust-Lima, Philipps, and Liquet (2017)) languages widely used in complex situ-ations.We first describe the generated dataset on which the benchmark has been realized. We thenintoduce each statistical model and associated program. Finally, we detail the results obtainedwith the three programs. Each time, the model has been estimated sequentially and with avarying number of cores in order to provide the program speed-up. We used a Linux clusterwith 32 cores machines and 100 replicates to assess the variability. Codes and dataset usedin this section are available at https://github.com/VivianePhilipps/marqLevAlgPaper.
We generated a dataset of 20 ,
000 subjects having repeated measurements of a marker
Ycens (measured at times t ) up to a right-censored time of event tsurv with indicator that the eventoccured event . The data were generated according to a 4 latent class joint model (Proust-Lima, S´ene, Taylor, and Jacqmin-Gadda 2014). This model assumes that the populationis divided in 4 latent classes, each class having a specific trajectory of the marker definedaccording to a linear mixed model with specific parameters, and a specific risk of event definedaccording to a parametric proportional hazard model with specific parameters too. The0 The R package marqLevAlg class-specific linear mixed model included a basis of natural cubic splines with 3 equidistantknots taken at times 5, 10 and 15, associated with fixed and correlated random-effects. Theproportional hazard model included a class-specific Weibull risk adjusted on 3 covariates: onebinary (Bernoulli with 50% probability) and two continous variables (standard Gaussian, andGaussian with mean 45 and standard deviation 8). The proportion of individuals in eachclass is about 22%, 17%, 34% and 27% in the sample.Below are given the five first rows of the three first subjects: i class X1 X2 X3 t Ycens tsurv event1 1 2 0 0.6472205 43.42920 0 61.10632 20.000000 02 1 2 0 0.6472205 43.42920 1 60.76988 20.000000 03 1 2 0 0.6472205 43.42920 2 58.72617 20.000000 04 1 2 0 0.6472205 43.42920 3 56.76015 20.000000 05 1 2 0 0.6472205 43.42920 4 54.04558 20.000000 022 2 1 0 0.3954846 43.46060 0 37.95302 3.763148 123 2 1 0 0.3954846 43.46060 1 34.48660 3.763148 124 2 1 0 0.3954846 43.46060 2 31.39679 3.763148 125 2 1 0 0.3954846 43.46060 3 27.81427 3.763148 126 2 1 0 0.3954846 43.46060 4 NA 3.763148 143 3 3 0 1.0660837 42.08057 0 51.60877 15.396958 144 3 3 0 1.0660837 42.08057 1 53.80671 15.396958 145 3 3 0 1.0660837 42.08057 2 51.11840 15.396958 146 3 3 0 1.0660837 42.08057 3 50.64331 15.396958 147 3 3 0 1.0660837 42.08057 4 50.87873 15.396958 1 Joint shared random effect model for a longitudinal marker and a time to event: packageJM
The maximum likelihood estimation of joint shared random effect models has been made avail-able in R with the JM package (Rizopoulos 2010). The implemented optimization functionsare optim and nlminb . We added the marqLevALg function for the purpose of this example.We considered a subsample of the simulated dataset, consisting in 5 ,
000 randomly selectedsubjects.The joint shared random effect model is divided into two submodels jointly estimated: • a linear mixed submodel for the repeated marker Y measured at different times t ij ( j = 1 , ..., n i ): Y i ( t ij ) = ˜ Y i ( t ij ) + ε ij = X i ( t ij ) β + Z i ( t ij ) u i + ε ij where, in our example, X i ( t ) contained the intercept, the class indicator, the 3 simulatedcovariates, a basis of natural cubic splines on time t (with 2 internal knots at times 5 and iviane Philipps, Boris P. Hejblum, M´elanie Prague, Daniel Commenges, C´ecile Proust-Lima Z i ( t ) contained the intercept and the same basis of natural cubic splines ontime t , and was associated with u i , the 4-vector of correlated Gaussian random effects. ε ij was the independent Gaussian error. • a survival submodel for the right censored time-to-event: α i ( t ) = α ( t ) exp( X si γ + η ˜ Y i ( t ))where, in our example, the vector X si , containing the 3 simulated covariates, was associatedwith the vector of parameters γ ; the current underlying level of the marker ˜ Y i ( t )) was asso-ciated with parameter η and the baseline hazard α ( t ) was defined using a basis of B-splineswith 1 interior knot.The length of the total vector of parameters θ to estimate was 40 (20 fixed effects and 11variance component parameters in the longitudinal submodel, and 9 parameters in the survivalsubmodel).One particularity of this model is that the log-likelihood does not have a closed form. Itinvolves an integral over the random effects (here, of dimension 4) which is numerically com-puted using an adaptive Gauss-Hermite quadrature with 3 integration points for this example.As package JM includes an analytical computation of the gradient, we ran two estimations:one with the analytical gradient and one with the numerical approximation to compare thespeed up and execution times. Latent class linear mixed model: package lcmm
The second example is a latent class linear mixed model, as implemented in the hlme func-tion of the lcmm R package. The function uses a previous implementation of the Marquardtalgorithm coded in Fortran90 and in sequential mode. For the purpose of this example, we ex-tracted the log-likelihood computation programmed in
Fortran90 to be used with marqLevAlg package.The latent class linear mixed model consists in two submodels estimated jointly: • a multinomial logistic regression for the latent class membership ( c i ): P ( c i = g ) = exp( W i ζ g ) (cid:80) Gl =1 exp( W i ζ l ) with g = 1 , ..., G where ζ G = 0 for identifiability and W i contained an intercept and the 3 covariates. • a linear mixed model specific to each latent class g for the repeated outcome Y measuredat times t ij ( j = 1 , ..., n i ): Y i ( t ij | c i = g ) = X i ( t ij ) β g + Z i ( t ij ) u ig + ε ij The R package marqLevAlg where, in this example, X i ( t ) and Z i ( t ) contained an intercept, time t and quadratic time.The vector u ig of correlated Gaussian random effects had a proportional variance across latentclasses, and ε ij were independent Gaussian errors.The log-likelihood of this model has a closed form but it involves the logarithm of a sum overlatent classes which can become computationally demanding. We estimated the model on thetotal sample of 20 ,
000 subjects with 1, 2, 3 and 4 latent classes which corresponded to 10,18, 26 and 34 parameters to estimate, respectively.
Multivariate latent process mixed model: package CInLPN
The last example is provided by the
CInLPN package, which relies on the
Rcpp language.The function fits a multivariate linear mixed model combined with a system of differenceequations in order to retrieve temporal influences between several repeated markers (Tadd´e et al.
L_1 , L_2 , L_3 were repeatedly measured over time. The model related each marker k ( k = 1 , ,
3) measured at observation times t ijk ( j = 1 , ..., T ) to its underlying level Λ ik ( t ijk as follows: L ik ( t ijk ) = η k + η k Λ ik ( t ijk ) + (cid:15) ijk where (cid:15) ijk are independent Gaussian errors and ( η , η ) parameters to estimate. Simultane-ously, the structural model defines the initial state at time 0 (Λ ik (0)) and the change overtime at subsequent times t with δ is a discretization step:Λ ik (0) = β k + u ik Λ ik ( t + δ ) − Λ ik ( t ) δ = γ k + v ik + K (cid:88) l =1 a kl Λ il ( t )where u ik and v ik are Gaussian random effects.Again, the log-likelihood of this model that depends on 27 parameters has a closed form butit may involve complex calculations. All the models have been estimated with 1, 2, 3, 4, 6, 8, 10, 15, 20, 25 and 30 cores. To fairlycompare the execution times, we ensured that changing the number of cores did not affectthe final estimation point or the number of iterations needed to converge. The mean of thespeed up over the 100 replicates are reported in table 3 and plotted in Figure 1.The joint shared random effect model ( JM ) converged in 16 iterations after 4279 seconds insequential mode when using the analytical gradient. Running the algorithm in parallel on2 cores made the execution 1.85 times shorter. Computational time was gradually reducedwith a number of cores between 2 and 10 to reach a maximal speed up slightly above 4. With15, 20, 25 or 30 cores, the performances were no more improved, the speed up showing evena slight reduction, probably due to the overhead. In contrast, when the program involvednumerical computations of the gradient, the parallelization reduced the computation timeby a factor of almost 8 at maximum. The better speed-up performances with a numericalgradient calculation were expected since the parallel loops iterate over more elements. iviane Philipps, Boris P. Hejblum, M´elanie Prague, Daniel Commenges, C´ecile Proust-Lima JM hlme CInLPNanalytic numeric G=1 G=2 G=3 G=4Number of parameters 40 40 10 18 26 34 27Number of iterations 16 16 30 30 30 30 13Number of elements in foreach loop 40 860 65 189 377 629 405Sequential time (seconds) 4279 14737 680 3703 10402 22421 272Speed up with 2 cores 1.85 1.93 1.78 1.93 1.94 1.96 1.89Speed up with 3 cores 2.40 2.80 2.35 2.81 2.88 2.92 2.75Speed up with 4 cores 2.97 3.57 2.90 3.58 3.80 3.87 3.56Speed up with 6 cores 3.66 4.90 3.49 5.01 5.44 5.66 4.95Speed up with 8 cores 4.15 5.84 3.71 5.84 6.90 7.26 5.96Speed up with 10 cores 4.23 6.69 3.98 6.70 8.14 8.96 6.89Speed up with 15 cores 4.32 7.24 3.59 7.29 10.78 12.25 8.14Speed up with 20 cores 4.28 7.61 3.11 7.71 12.00 15.23 8.36Speed up with 25 cores 3.76 7.29 2.60 7.37 12.30 16.84 8.11Speed up with 30 cores 3.41 6.82 2.47 6.82 13.33 17.89 7.83
Table 3: Estimation process characteristics for the 3 different programs (JM, hlme andCInLPN). Analytic and Numeric refer to the analytical and numerical computations of thegradient in JM; G refers to the number of latent classes.The second example, the latent class mixed model estimation ( hlme ), showed an improvementof the performances as the complexity of the models increased. The simple linear mixed model(one class model), like the joint models with analytical gradient, reached a maximum speed-up of 4 with 10 cores. The two class mixed model with 18 parameters, showed a maximumspeed up of 7.71 with 20 cores. Finally the 3 and 4 class mixed models reached speed-ups of13.33 and 17.89 with 30 cores and might still be improved with larger resources.The running time of the third program (CInLPN) was also progressively reduced with theincreasing number of cores reaching the maximal speed-up of 8.36 for 20 cores.In these 7 examples, the speed up systematically reached almost 2 with 2 cores, and it re-mained interesting with 3 or 4 cores although some variations in the speed-up performancesbegan to be observed according to the complexity of the objective function computations.This hilights the benefit of the parallel implementation of MLA even on personal computers.As the number of cores continued to increase, the speed-up performances varied a lot. Amongour examples, the most promising situation was the one of the latent class mixed model (withprogram in
Fortran90 ) where the speed-up was up to 15 for 20 cores with the 4 class model.
6. Comparison with other optimization algorithms
The JM package (Rizopoulos (2010)) includes several optimization algorithms, namely theBFGS of optim function, and an expectation-maximization technique internally implemented.It thus offers a nice framework to compare the reliability of MLA to find the maximumlikelihood of a joint model with the reliability of other optimization algorithms. We used inthis comparison the prothro dataset described in the JM package and elsewhere (Skrondal andRabe-Hesketh 2004, Andersen, Borgan, Gill, and Keiding (1993)). It consists of a randomizedtrial in which 488 subjects were split into two treatment arms (prednisone versus placebo).Repeated measures of prothrombin ratio were collected over time as well as time to death.4 The R package marqLevAlg Number of cores S p ee d − up Numeric gradientAnalytical gradient JM Number of cores S p ee d − up hlme Number of cores S p ee d − up
95% Conf. Int.
CInLPN
Figure 1: Speed up performances for the 3 different programs (JM, hlme and CInLPN).Analytic and numeric refer to the analytical and numerical computations of the gradient inJM. The number of parameters was 40 for JM; 10, 18, 26, 34 for hlme with 1, 2, 3, 4 classes,respectively; 27 for CInLPN.The longitudinal part of the joint model included a linear trajectory with time in the study, anindicator of first measurement and their interaction with treatment group. Were also includedcorrelated individual random effects on the intercept and the slope with time. The survivalpart was a proportional hazard model adjusted for treatment group as well as the dynamicsof the longitudinal outcome either through the current value of the marker or its slope orboth. The baseline risk function was approximated by B-splines with one internal knot. Thetotal number of parameters to estimate was 17 or 18 (10 for the longitudinal submodel, and7 for the survival submodel considering only the curent value of the marker or its slope or8 for the survival model when both the current level and the slope were considered). Themarker initially ranged from 6 to 176 (mean=79.0, sd=27.3). To investigate the consistencyof the results to different dimensions of the marker, we also considered cases where the markerwas rescaled by a factor 0.1 or 10. In these cases, the log-likelihood was rescaled a posteriorito the original dimension of the marker to make the comparisons possible. The startingpoint was systematically set at the default initial value of the jointModel function, whichis the estimation point obtained from the separated linear mixed model and proportionalhazard model. Codes and dataset used in this section are available at https://github.com/VivianePhilipps/marqLevAlgPaper .MLA runned on 3 cores and converged when the three criteria defined in section 2.2 weresatisfied with tolerance 0.001, 0.001 and 0.01 for the parameters, the likelihood and theRDM, respectively. BFGS converged when the convergence criterion on the log-likelihoodwas satisfied with the square root of the tolerance of the machine ( ≈ − ). The EM al-gorithm converged when stability on the parameters or on the log-likelihood was satisfiedwith tolerance 0.0001 and around 10 − (i.e., the square root of the tolerance of the machine),respectively. iviane Philipps, Boris P. Hejblum, M´elanie Prague, Daniel Commenges, C´ecile Proust-Lima Nature of Algorithm Scaling Rescaled log- Variation of Variation of Number of Time independency factor likelihood value (%) slope (%) iterations secondsvalue BFGS 1 -13958.55 -3.73 120 29.32value BFGS 0.1 -13957.91 -0.01 490 117.20value BFGS 10 -13961.54 -9.28 91 18.30value EM 1 -13957.91 -0.29 66 57.64value EM 0.1 -13957.72 0.14 104 89.98value EM 10 -13957.94 -0.59 62 61.10value MLA 1 -13957.69 -0.00 7 36.08value MLA 0.1 -13957.69 -0.00 5 26.77value MLA 10 -13957.69 -0.00 15 72.57slope BFGS 1 -13961.41 -1.85 251 52.46slope BFGS 0.1 -13961.23 -1.37 391 78.78slope BFGS 10 -13980.90 -13.98 444 86.61slope EM 1 -13960.69 0.18 169 143.20slope EM 0.1 -13960.69 0.03 208 156.80slope EM 10 -13960.70 0.08 156 138.04slope MLA 1 -13960.69 -0.00 10 46.04slope MLA 0.1 -13960.69 0.00 10 46.37slope MLA 10 -13960.69 0.00 14 63.63both BFGS 1 -13951.60 15.97 -28.17 164 40.19both BFGS 0.1 -13949.82 2.66 -4.63 502 133.82both BFGS 10 -13965.25 40.31 -95.26 52 10.85both EM 1 -13949.82 4.10 -7.22 159 179.71both EM 0.1 -13949.44 1.68 -3.66 156 148.23both EM 10 -13950.46 10.67 -16.31 142 197.16both MLA 1 -13949.42 -0.00 -0.00 10 51.24both MLA 0.1 -13949.42 -0.00 0.00 10 53.32both MLA 10 -13949.42 0.00 -0.01 22 118.37
Table 4: Comparison of the convergence obtained by MLA, BFGS and EM algorithms forthe estimation of a joint model for prothrobin repeated marker (scaled by 1, 0.1 or 10) andtime to death when considering a dependency on the current level of prothrobin (’value’) orthe current slope (’slope’) or both (’both’). All the models converged correctly accordingto the algorithm outputs. We report the final log-likelihood rescaled to scaling factor 1 (forcomparison), the percentage of variation of the association parameters (’value’ and ’slope’columns) compared to the one obtained with the overall maximum likelihood with scaling 1,the number of iterations and the running time in seconds.Table 4 compares the convergence obtained by using the three optimization methods, whenconsidering a pseudo-adaptive Gauss-Hermite quadrature with 15 points. All the algorithmsconverged correctly according to the programs. Although the model for a given associa-tion structure is exactly the same, some differences were observed in the final maximumlog-likelihood (computed in the original scale of prothrombin ratio). The final log-likelihoodobtained by MLA was always the same whatever the outcome’s scaling, showing its consis-tency. It was also higher than the one obtained using the two other algorithms, showing thatBFGS and, to a lesser extent, EM did not systematically converge toward the effective maxi-mum. The difference could go up to 20 points of log-likelihood for BFGS in the example withthe current slope of the marker as the association structure. The convergence also differedaccording to outcome’s scaling with BFGS and slightly with EM, even though in general theEM algorithm seemed relatively stable in this example. The less stringent convergence ofBFGS and, to a lesser extent, of EM had also consequences on the parameters estimates as6
The R package marqLevAlg roughly illustrated in Table 4 with the percentage of variation in the association parametersof prothrombin dynamics estimated in the survival model (either the current value or thecurrent slope) in comparison with the estimate obtained using MLA which gives the overallmaximum likelihood. The better performances of MLA was not at the expense of the numberof iterations since MLA converged in at most 22 iterations, whereas several hundreds of iter-ations could be required for EM or BFGS. Note however that one iteration of MLA is muchmore computationally demanding.Finally, for BFGS, the problem of convergence is even more apparent when the outcome isscaled by a factor 10. Indeed, the optimal log-likelihood of the model assuming a bivariateassociation structure (on the current level and the current slope) is worse than the optimallog-likelihood of its nested model which assumes an association structure only on the currentlevel (i.e., constraining the parameter for the current slope to 0).
7. Concluding remarks
We proposed in this paper a general-purpose optimization algorithm based on a robustMarquardt-Levenberg algorithm. The program, written in R and Fortran90 , is available in marqLevAlg R package. It provides a very nice alternative to other optimization packagesavailable in R software such as optim , roptim (Pan 2020) or optimx (Nash and Varadhan2011). In particular, as shown in our example with the estimation of joint models, it is morereliable than classical alternatives (EM and BFGS). This is due to the very good convergenceproperties of the Marquardt-Levenberg algorithm associated with very stringent convergencecriteria based on the first and second derivatives of the objective function which avoids spu-rious convergence at saddle points (Commenges et al. R were allconstrained to sequential mode to our knowledge, despite increasingly powerful computersand servers. There is an exception with the Broyden-Fletcher-Goldfarb-Shanno algorithmwith box constraints (L-BFGS-B) which was very recently made available in a parallel modeusing the independent computations of objective function in each iteration (Gerber and Furrer2019). With its parallel implementation of derivative calculations combined with very good iviane Philipps, Boris P. Hejblum, M´elanie Prague, Daniel Commenges, C´ecile Proust-Lima marqLevAlg package provides a promising solution for theestimation of complex statistical models in R . We have chosen for the moment to parallelizethe derivatives which is very useful for optimization problems involving many parameters.However we could also easily parallelize the computation of the objective function when thelatter is decomposed into independent sub-computations as is the log-likelihood computedindependently on the statistical units. This alternative is currently under development.
8. Funding
This work was funded by French National Research Agency [grant number ANR-18-CE36-0004-01 for project DyMES] and [grant number ANR-2010-PRPS-006 for project MOBYDIQ].
9. Acknowlegdments
The computing facilities MCIA (M´esocentre de Calcul Intensif Aquitain) at the Universit´ede Bordeaux and the Universit´e de Pau et des Pays de l’Adour provided advice on parallelcomputing technologies, as well as computer time.
References
Andersen PK, Borgan O, Gill RD, Keiding N (1993).
Statistical Models Based on CountingProcesses . Springer, Paris.Commenges D, Jacqmin-Gadda H, Proust C, Guedj J (2006). “A Newton-Like Algorithm forLikelihood Maximization: The Robust-Variance Scoring Algorithm.” arXiv:math/0610402 .ArXiv: math/0610402, URL http://arxiv.org/abs/math/0610402 .Elzhov TV, Mullen KM, Spiess AN, Bolker B (2016). minpack.lm: R Interface to theLevenberg-Marquardt Nonlinear Least-Squares Algorithm Found in MINPACK, Plus Sup-port for Bounds . R package version 1.2-1, URL https://CRAN.R-project.org/package=minpack.lm .Fletcher R (1971). “A Modified Marquardt Subroutine for Non-Linear Least Squares.” Pub-lisher: Theoretical Physics Division, Atomic Energy Research Establishment.Gerber F, Furrer R (2019). “optimParallel: An R Package Providing a Parallel Version ofthe L-BFGS-B Optimization Method.”
The R Journal , (1), 352–358. doi:10.32614/RJ-2019-030 . URL https://doi.org/10.32614/RJ-2019-030 .Joe H, Nash JC (2003). “Numerical Optimization and Surface Estimation with ImpreciseFunction Evaluations.” Statistics and Computing , (3), 277–286. ISSN 1573-1375. doi:10.1023/A:1024226918553 . URL https://doi.org/10.1023/A:1024226918553 .Laird NM, Ware JH (1982). “Random-Effects Models for Longitudinal Data.” Biometrics , (4), 963–974. ISSN 0006-341X. doi:10.2307/2529876 . Publisher: [Wiley, InternationalBiometric Society], URL .8 The R package marqLevAlg Levenberg K (1944). “A method for the Solution of Certain Non-Linear Problems inLeast Squares.”
Quarterly of Applied Mathematics , (2), 164–168. ISSN 0033-569X,1552-4485. doi:10.1090/qam/10666 . URL .Lindstrom MJ, Bates DM (1988). “Newton—Raphson and EM Algorithms for Linear Mixed-Effects Models for Repeated-Measures Data.” Journal of the American Statistical Associa-tion , (404), 1014–1022. ISSN 0162-1459. doi:10.1080/01621459.1988.10478693 . Pub-lisher: Taylor & Francis eprint: https://doi.org/10.1080/01621459.1988.10478693, URL https://doi.org/10.1080/01621459.1988.10478693 .Marquardt DW (1963). “An Algorithm for Least-Squares Estimation of Nonlinear Param-eters.” Journal of the Society for Industrial and Applied Mathematics , (2), 431–441.ISSN 0368-4245. doi:10.1137/0111030 . Publisher: Society for Industrial and AppliedMathematics, URL https://epubs.siam.org/doi/abs/10.1137/0111030 .Nash JC, Varadhan R (2011). “Unifying Optimization Algorithms to Aid Software SystemUsers: optimx for R.” Journal of Statistical Software , (9), 1–14. URL .Pan Y (2020). roptim: General Purpose Optimization in R using C++ . R package version0.1.5, URL https://CRAN.R-project.org/package=roptim .Prague M, Commenges D, Guedj J, Drylewicz J, Thi´ebaut R (2013). “NIMROD: A Pro-gram for Inference via a Normal Approximation of the Posterior in Models with RandomEffects Based on Ordinary Differential Equations.” Computer methods and programs inbiomedicine , (2), 447–458.Prague M, Diakite A, Commenges D (2012). “Package ’marqLevAlg’ - Algorithme deLevenberg-Marquardt en R : une Alternative `a ’optimx’ pour des Probl`emes de Minimisa-tion.” In . Bordeaux, France. URL https://hal.archives-ouvertes.fr/hal-00717566 .Proust C, Jacqmin-Gadda H (2005). “Estimation of Linear Mixed Models with a Mixture ofDistribution for the Random Effects.” Computer Methods and Programs in Biomedicine , (2), 165–173. ISSN 0169-2607. doi:10.1016/j.cmpb.2004.12.004 .Proust-Lima C, Philipps V, Liquet B (2017). “Estimation of Extended Mixed Models UsingLatent Classes and Latent Processes: The R Package lcmm.” Journal of Statistical Software , (1), 1–56. ISSN 1548-7660. doi:10.18637/jss.v078.i02 . Number: 1, URL .Proust-Lima C, S´ene M, Taylor JM, Jacqmin-Gadda H (2014). “Joint Latent Class Models forLongitudinal and Time-to-Event Data: A Review.” Statistical methods in medical research , (1), 74–90. ISSN 0962-2802. doi:10.1177/0962280212445839 . URL .Rizopoulos D (2010). “JM: An R Package for the Joint Modelling of Longitudinal and Time-to-Event Data.” Journal of Statistical Software (Online) , (9), 1–33. ISSN 15487660. doi:10.18637/jss.v035.i09 . URL https://repub.eur.nl/pub/89690/ . iviane Philipps, Boris P. Hejblum, M´elanie Prague, Daniel Commenges, C´ecile Proust-Lima Generalized Latent Variable Modeling: Multilevel, Lon-gitudinal, and Structural Equation Models . CRC Press. ISBN 978-0-203-48943-7.Tadd´e BO, Jacqmin-Gadda H, Dartigues JF, Commenges D, Proust-Lima C(2019). “Dynamic Modeling of Multivariate Dimensions and their Tempo-ral Relationships Using Latent Processes: Application to Alzheimer’s Dis-ease.”
Biometrics , n/a (n/a). ISSN 1541-0420. doi:10.1111/biom.13168 .eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/biom.13168, URL https://onlinelibrary.wiley.com/doi/abs/10.1111/biom.13168 .Theussl S, Schwendinger F, Borchers HW (2014). “CRAN Task View: Optimization andMathematical Programming.” Version 2020-05-21, URL https://cran.r-project.org/view=Optimization . Affiliation: