[PDF] Parameter Sensitivity Analysis of the SparTen High Performance Sparse Tensor Decomposition Software: Extended Analysis

Abstract

Tensor decomposition models play an increasingly important role in modern data science applications. One problem of particular interest is fitting a low-rank Canonical Polyadic (CP) tensor decomposition model when the tensor has sparse structure and the tensor elements are nonnegative count data. SparTen is a high-performance C++ library which computes a low-rank decomposition using different solvers: a first-order quasi-Newton or a second-order damped Newton method, along with the appropriate choice of runtime parameters. Since default parameters in SparTen are tuned to experimental results in prior published work on a single real-world dataset conducted using MATLAB implementations of these methods, it remains unclear if the parameter defaults in SparTen are appropriate for general tensor data. Furthermore, it is unknown how sensitive algorithm convergence is to changes in the input parameter values. This report addresses these unresolved issues with large-scale experimentation on three benchmark tensor data sets. Experiments were conducted on several different CPU architectures and replicated with many initial states to establish generalized profiles of algorithm convergence behavior.

Full PDF

PParameter Sensitivity Analysis of the SparTen HighPerformance Sparse Tensor Decomposition Software:Extended Analysis

Jeremy M. Myers [email protected], [email protected]

Daniel M. Dunlavy [email protected]

Keita Teranishi [email protected]

D. S. Hollman [email protected]

Sandia National Laboratories, Albuquerque, NM 87123, USACollege of William and Mary, Williamsburg, VA 23185, USA

Abstract

Tensor decomposition models play an increasingly important role in modern data scienceapplications. One problem of particular interest is ﬁtting a low-rank Canonical Polyadic(CP) tensor decomposition model when the tensor has sparse structure and the tensorelements are nonnegative count data. SparTen is a high-performance C++ library whichcomputes a low-rank decomposition using diﬀerent solvers: a ﬁrst-order quasi-Newton ora second-order damped Newton method, along with the appropriate choice of runtimeparameters. Since default parameters in SparTen are tuned to experimental results in priorpublished work on a single real-world dataset conducted using MATLAB implementationsof these methods, it remains unclear if the parameter defaults in SparTen are appropriatefor general tensor data. Furthermore, it is unknown how sensitive algorithm convergenceis to changes in the input parameter values. This report addresses these unresolved issueswith large-scale experimentation on three benchmark tensor data sets. Experiments wereconducted on several diﬀerent CPU architectures and replicated with many initial statesto establish generalized proﬁles of algorithm convergence behavior.

Keywords: tensor decomposition, Poisson factorization, Kokkos, Newton optimization

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. SAND2020-11901R a r X i v : . [ m a t h . NA ] D ec cknowledgment We would like to thank Richard Barrett for assistance utilizing computing resources atSandia National Laboratories and Rich Lehoucq for comments of support.

Contents

A.1 Heatmaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20A.2 Average Convergence Results . . . . . . . . . . . . . . . . . . . . . . . . . . 30

List of Figures mu initial . Mean function evaluations and 95% CI. PDNR dampsout Hessian information and is more prone to time-outs when mu initial islarge (PDNR, lbnl ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 Mean objective function values with 95% conﬁdence interval, varying eps -div zero grad (PDNR, lbnl ). . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Experiment outcomes for chicago-crime-comm data on Intel platform. . . . 216 Experiment outcomes for lbnl-network data on Intel platform. . . . . . . . . 227 Experiment outcomes for nell-2 data on Intel platform. . . . . . . . . . . . . 238 Experiment outcomes for chicago-crime-comm data on ARM platform. . . . 249 Experiment outcomes for lbnl-network data on ARM platform. . . . . . . . 2520 Experiment outcomes for nell-2 data on ARM platform. . . . . . . . . . . . 2611 Experiment outcomes for chicago-crime-comm data on IBM platform. . . . 2712 Experiment outcomes for lbnl-network data on IBM platform. . . . . . . . . 2813 Experiment outcomes for nell-2 data on IBM platform. . . . . . . . . . . . . 29

List of Tables

Threads and

RAM (GB) are per node. . . . . . . . . . . . . . . . . . 62 Sparse tensor datasets from the FROSTT collection. . . . . . . . . . . . . . 73 SparTen software parameter descriptions and values used in our experiments. 94 Experiments run on the diﬀerent datasets and hardware platforms. . . . . . 105 Average convergence behavior for the chicago-crime-comm data set. Conﬁ-dence intervals are percentages of the mean. . . . . . . . . . . . . . . . . . . 316 Average convergence behavior for the lbnl-network data set. Conﬁdence in-tervals are percentages of the mean. . . . . . . . . . . . . . . . . . . . . . . 327 Average convergence behavior for the nell-2 data set. Conﬁdence intervalsare percentages of the mean. . . . . . . . . . . . . . . . . . . . . . . . . . . 333 . Introduction

The Canonical Polyadic (CP) tensor decomposition model has garnered attention as atool for extracting useful information from high dimensional data across a wide range ofapplications [9, 3, 8, 2, 7].Recently, Hansen et al. developed two highly-parallelizable Newton-based methods forlow-rank tensor factorizations on Poisson count data in [6], one a ﬁrst-order quasi-Newtonmethod (PQNR) and another a second-order damped Newton method (PDNR). They wereﬁrst implemented in MATLAB Tensor Toolbox [1] as the function cp_apr , referring tothis approach as computing a CP decomposition using Alternating Poisson Regression (i.e.,CP-APR). These methods ﬁt a reduced-rank CP model to count data, assuming a Poissonerror distribution. PDNR and PQNR are implemented in SparTen, a high-performanceC++ library of CP-APR solvers for sparse tensors. SparTen improves on the MATLABimplementation to provide eﬃcient execution for large, sparse tensor decompositions, ex-ploiting the Kokkos hardware abstraction library [5] to harness parallelism on diverse HPCplatforms, including x86-multicore, ARM, and GPU computer architectures.SparTen contains many algorithmic parameters for controlling the optimization subrou-tines comprising PDNR and PQNR. To date, only anecdotal evidence exists for how bestto tune the algorithms. Parameter defaults in SparTen were chosen according to previousresults using the MATLAB implementations described by Hansen et al. [6]. However, theiranalysis was limited to a single real-world dataset, and thus may not be optimal for comput-ing decompositions of more general tensor data. Furthermore, it is unknown how the initialguess to a solution aﬀects convergence, since SparTen methods may converge slowly—orworse, stagnate—on real data if the initial state is far from a solution. And, lastly, theaverage impact of input parameters on algorithm convergence is unclear.To address these unknowns, we present the results of numerical experiments to assessthe sensitivity of software parameters on algorithm convergence for a range of values withbenchmark tensor problems. Every experiment was replicated with 30 randomly choseninitial guesses on three diverse computer architectures to aid statistical interpretation. Withour results, we (1) provide new results that oﬀer a realistic picture of algorithm convergenceunder reasonable resource constraints, (2) establish practical bounds on parameters suchthat, if set at or beyond these values, convergence is unlikely, and (3) identify areas ofperformance degradation and convergence toward qualitatively diﬀerent results owing toparameter sensitivities.We limited our study to multicore CPU architectures only, using OpenMP [4] to managethe parallel computations across threads/cores. Although SparTen, through Kokkos, canleverage other execution backends—e.g., NVIDIA’s CUDA framework for GPU computation—we focus solely on diversity in CPU architectures in this work.This paper is structured as follows. Section 2 summarizes basic tensor notation anddetails. Section 3 describes the hardware environment, test data, and experimental designof the sensitivity analysis. Section 4 provides detailed results of the sensitivity analyses.Section 5 oﬀers concluding remarks and lays out future work.

1. SparTen is a portmanteau word derived from

Sparse and

Tensor . The SparTen code is available at http://gitlab.com/tensors/sparten . . Background We brieﬂy describe below the problem we are addressing in this report; for a detailed de-scription of CP decomposition algorithms implemented in SparTen, refer to the descriptionsin Hansen et al. [6].An N -way data tensor X has dimension sizes I × I × · · · × I N . We wish to ﬁt areduced-dimension tensor model, M , to X . The R -component Canonical Polyadic (CP)decomposition is given as follows: X ≈ M = (cid:74) λ ; A (1) , . . . , A ( N ) (cid:75) = R (cid:88) r =1 λ r a (1) r ◦ . . . ◦ a ( N ) r , (1)where λ = [ λ , . . . , λ R ] is a scaling vector, a ( n ) r represents the r -th column of the factormatrix A ( n ) of size I n × R , and ◦ is the vector outer product. We refer to the operator (cid:74) · (cid:75) as a Kruskal operator, and the tensor M , with its speciﬁc multilinear model form, as aKruskal tensor in (1). See [9] for more details regarding these deﬁnitions.SparTen addresses the special case when the elements of X are nonnegative counts.Assuming the entries in X follow a Poisson distribution with multilinear parameters, thelow-rank CP decomposition in (1) can be computed using the CP-APR methods, PDNRand PQNR, introduced by Hansen et al. [6]. 5 . Methods In this section, we describe the hardware platforms, data, and SparTen algorithm parame-ters used in our experiments.

We used diverse computer architectures running Red Hat Enterprise Linux (RHEL) toperform our experiments, with hardware and compiler speciﬁcations detailed in Table 1.Intel 1–4 are production clusters with hundreds to thousands of nodes, whereas ARM andIBM clusters are advanced architecture research testbeds with tens of nodes each. The ARMand IBM testbeds have larger memory and support many more threads per node than doany cluster in Intel 1-4. We employed the maximum number of OpenMP threads availableper node from each platform to maximize throughput and conﬁgured the maximum wall-clock limit as 12:00 hours for all experiments. All parallelism was solely across threads ona single node. We built and compiled SparTen to leverage OpenMP via Kokkos with thelatest software build tools available on each cluster. The GNU compiler, gcc , was used,with -O3 optimization and Kokkos architecture-speciﬁc ﬂags enabled.Table 1: Hardware characteristics and software environment of the clusters in this paper.

Threads and

RAM (GB) are per node.Platform Processor Nodes CPUs Threads RAM (GB) GCCARM ThunderX2 44 28 256 255 7.2.0IBM Newell Power9 10 20 80 319 7.2.0Intel 1 Sandy Bridge 1,848 29,568 16 64 8.2.1Intel 2 Broadwell 740 26,640 36 128 8.2.1Intel 3 Sandy Bridge 201 3,344 16 64 8.2.1Intel 4 Sandy Bridge 1,232 19,712 16 64 8.2.1

We conducted experiments using sparse tensors of count data from the FROSTT collec-tion [11]. Speciﬁcally, we chose the following three datasets (summary statistics providedin Table 2) to account for size, dimensionality, and density (i.e., the ratio of nonzero entriesto the total number of elements in the tensor):1. Chicago Crime Community is a 4th-order tensor of crime reports in the city of Chicagospanning nearly 17 years. The four modes represent day × hour × community × crime-type and the values are counts.2. LBNL-Network is a 5th-order tensor of anonymized internal network traﬃc at LawrenceBerkeley National Laboratory. The ﬁve modes represent sender-IP × sender-port × destination-IP × destination-port × time and the values are total packet length pertimestep. 6. NELL-2 is 3rd-order benchmark tensor that gives a snapshot of the NELL: Never-Ending Language Learning relational database. The three modes represent entity × relation × entity relationships.Throughout the discussion below, we refer to the data using the short names listed in thetable. Table 2: Sparse tensor datasets from the FROSTT collection. FROSTT Name (short name) Nonzeros Dimensions DensityChicago Crime Community ( chicago ) 5.3M (6186 , , ,

32) 1 . × − LBNL-Network ( lbnl ) 1.7M (1605 , , , , . × − NELL-2 ( nell ) 77M (12092 , , . × − PQNR and PDNR are composed of standard techniques in the numerical optimization lit-erature. Speciﬁcally, for each tensor mode, the Newton optimization computes the gradientand Hessian matrix. Then, the inverse Hessian is approximated to compute a search direc-tion and an Armijo backtracking line search is used to compute the Newton step. How theinverse Hessian is approximated diﬀerentiates PDNR and PQNR. PDNR shifts the eigen-values by a damping factor µ to guarantee the Hessian matrix is semi-positive deﬁnite, andsolves the resulting linear system exactly. PQNR approximates the inverse Hessian directlywith a limited-memory BFGS (L-BFGS) approach, computed with a small number of up-date pairs. Since the algorithm parameters analyzed here are those presented in severalequations and algorithms in [6], we defer to that paper for speciﬁc details.To support discussion later in Section 4, we group the algorithm parameters into thefollowing ﬁve categories. We note that the stability parameters used to safeguard againstnumerical errors—e.g., oﬀset tolerances to avoid divide-by-zero ﬂoating point errors—donot appear in the corresponding Matlab Tensor Toolbox method cp_apr . A. CP-APR • max outer iterations : Maximum number of outer iterations to perform (Algorithm1, Steps 2-9 in [6]). • max inner iterations : Maximum number of inner iterations to perform ( K max inAlgorithms 3 and 4 in [6]). B. Line search • max backtrack steps : Maximum number of backtracking steps in line search (max-imum allowable value of t used in Equation (17) in [6]). • min variable nonzero tolerance : Tolerance for nonzero line search step length(smallest allowable value of β in Equation (17) in [6]). • step reduction factor : Factor to reduce line search step size between iterations( β t +1 /β t in Equation (17) in [6]). 7 suff decrease tolerance : Tolerance to ensure the next iterate decreases the objec-tive function ( σ in Equation (17) in [6]). C. Damped Newton (PDNR) • mu initial : Initial value of damping parameter ( µ in Algorithm 3 in [6]). • damping increase factor : Scalar value to increase damping parameter in next iter-ate (Equation (16) in [6]). • damping decrease factor : Scalar value to decrease damping parameter in next iter-ate (Equation (16) in [6]). • damping increase tolerance : Tolerance to increase the damping parameter (Equa-tion (16) in [6]). If the search direction increases the objective function and the ratioof actual reduction and predicted reduction in objective function ( ρ in Equation (15)in [6]) is less than damping increase tolerance , the damping parameter µ k will beincreased for the next iteration. • damping decrease tolerance : Tolerance to decrease the damping parameter (Equa-tion (16) in [6]). Conversely, if the search direction decreases the objective functionand the ratio of actual reduction to predicted reduction ( ρ in Equation (15) in [6])is greater than damping decrease tolerance , the damping parameter µ k will be de-creased for the next iteration. D. Quasi-Newton (PQNR) • size LBFGS : Number of recent limited-BFGS (L-BFGS)-update pairs to use in esti-mating the current Hessian ( M in Equation (18) in [6]). E. Numerical stability • eps div zero grad : Safeguard against divide-by-zero in gradient and Hessian calcu-lations. • log zero safeguard : Tolerance to avoid computing log(0) in objective function cal-culations.The default value in SparTen of each parameter described above and the experimentalranges tested in these experiments are given in Table 3. An individual experiment is a job j on platform m solving a PDNR/PQNR row subproblemfor dataset d with SparTen solver s , parameter p , parameter value v , and random initializa-tion r ; all remaining software parameters are ﬁxed at the default values listed in Table 3.Certain experiments denoted with a dagger † were run only on Intel hardware due to limitedresources associated with the other architectures; this accounts for the larger number ofexperiments reported for these platforms. We conducted tests on these values to providebetter resolution of the impact of the parameter where nearby values—i.e., on the bounds8able 3: SparTen software parameter descriptions and values used in our experiments. Parameter Default Values Used in Experiments max outer iterations

1, 2, 4, 8, 16, 32, 64, 128, 256, 512 max inner iterations

20 20, 40, 80, 160 max backtrack steps

10 1 † , , , , , † , min variable nonzero tolerance − − † , − , − , − † step reduction factor . , . † , . , . † , . suff decrease tolerance − − , − , − † , − † mu initial − − , − , − damping increase factor . , . † , . , . † , . damping decrease factor / . , / , . † , . , . † , . damping increase tolerance . , . , . damping decrease tolerance . , . , . size LBFGS , , , , , , , eps div zero grad − − , − † , − , − † , − log zero safeguard − − † , − , − † , − , − † , − eps active set PDNR: 10 − − , − , − † , − † PQNR: 10 − − , − , − † , − † − Intel platform used for experiments; † values evaluated on Intel platform only of the test range—contained uncertainty in the results. Furthermore, we split up the exper-iments across the Intel platforms by parameter, running the full set of experiments acrossall parameter values and all random initializations on a single platform. The superscriptsdenoted for each parameter in the table denote the Intel platform number speciﬁed in Ta-ble 1. Since we report only the number of function evaluations and outer iterations in ourresults, we expect that running our experiments in this way has produced valid results.In all experiments, we ﬁt a 5-component CP decomposition using a tolerance of 10 − (i.e., the value of τ in Equation (20) in [6], the violation of the Karush-Kuhn-Tucker (KKT)conditions, used as the stopping criterion for the methods we explore here). Computationof a CP decomposition using PDNR or PQNR in SparTen requires an initial guess of themodel parameters—i.e., initial values for M in (1)—drawn from a uniform distribution inthe range [0 , f ( M ), deﬁned in Equa-tion (4) of [6]) and the quality of the solution (i.e., the value of the negative log likelihoodobjective function). As each of our experiments consists of 30 replicates (i.e., 30 randominitializations) across three CPU architectures, we report sample means and 95% conﬁdenceintervals (as deﬁned in [10]) when presenting statistical trends in the results.9 . Results Table 4: Experiments run on the diﬀerent datasets and hardware platforms.

CPU Solver Data Planned Collected Canceled Converged Max Iterations MissingARM PDNR chicago lbnl nell chicago

990 281 0.0% 55.5% 44.5% 71.6% lbnl

990 237 0.0% 0.0% 100.0% 76.1% nell

990 390 23.3% 0.3% 76.4% 60.6%IBM PDNR chicago lbnl nell chicago

990 676 10.2% 76.3% 13.5% 31.7% lbnl

990 293 61.8% 0.0% 38.2% 70.4% nell

990 481 31.0% 6.6% 62.4% 51.4%Intel PDNR chicago lbnl nell chicago lbnl nell

In this section we analyze the results of the parameter sensitivity experiments anddescribe the statistical relationships between the convergence properties of the PDNR andPQNR methods and their input parameters.In total, 21,960 unique experiments were planned, accounting for running PDNR andPQNR with random initializations across all parameter value ranges on the various hardwarearchitectures described in Sections 3.1 and 3.3. An experiment converged if the ﬁnal KKTviolation is less than the value of τ = 10 − ; an experiment reached maximum iterations ifthe number of outer iterations exceeded the maximum limit (i.e., max outer iterations )and did not converge; an experiment was canceled if it exceeded the wall-clock limit (i.e.,SparTen neither converged to a solution nor reached maximum number of outer iterationswithin 12 hours); and an experiment was missing if it did not run due to a failure of thesystem to launch the experiment or other system issue. Of the planned experiments, wecollected data from 16,139 experiments.Table 4 presents the number of experiments planned as deﬁned above and the numberof planned experiments where data was collected (i.e., planned minus missing). For thosecollected, the table shows the percentage of experiments that were canceled, converged,or exceeded the maximum iterations. We note that the most complete set of experimentresults were obtained on the Intel platforms. Although there are many missing experimentresults for the IBM and ARM platforms, we attempt to identify patterns in the data wecollected if there is strong evidence to support our claims. We note that a few parameters( eps active set , min variable nonzero tolerance , suff decrease tolerance , damp- ng increase tolerance , damping decrease tolerance ) showed no statistically signiﬁ-cant diﬀerences across the range of input values used in the experiments. We conjecturethat we did not ﬁnd values where the parameters display sensitivities in the chosen tensorproblems, thus it remains unclear if this behavior holds in general. To illustrate general convergence behavior, we present the results of the experiments in aheatmap for a given dataset, method, and hardware platform. Figure 1 presents an exampleheatmap, where each square represents the total number of objective function evaluations ofan experiment, with random initializations across the rows and parameter values used (withall other parameter values set to their default values) across the columns. The complete setof heatmaps for all experiments can be found in Appendix A.1. These results illustrate thatthere are certain ranges of parameter values that lead to good or bad convergence behaviorsin general. . . . . . . .

505 0 .

75 0 . . . . . . . .

25 0 .

495 1 e -

08 1 e -

05 0 .

001 0 . e -

15 1 e -

12 1 e -

10 1 e -

08 1 e -

05 1 e -

32 1 e -

24 1 e -

16 1 e -

12 1 e -

08 0 . . . . . . . . . . . . . . . . . . . . . . e -

15 1 e -

07 0 .

001 0 . e -

08 1 e -

05 0 .

01 0 . . . . . e -

12 1 e -

08 0 . . E x p e r i m e n t I D Status

CanceledMax.Iter.Missing F un c t i o n E v a l u a t i o n s dampingdecreasefactor dampingdecreasetolerance dampingincreasefactor dampingincreasetolerance epsactiveset epsdivzerograd logzerosafeguard maxInnerIter maxOuterIter maxbacktracksteps minvariablenonzerotolerance muinitial stepreductionfactor suffdecreasetolerance Figure 1: Example heatmap illustrating results.The colours in the heatmaps denote the outcomes of the experiments as follows. Greenshades are consistent with converged experiments. Vertical bands not shaded green identifyvalues that may impact algorithm performance, due either to iteration constraints (bluehues) or excessive computations corresponding to slow convergence or stagnation (red hues).Hatches denote non-convergent exit status. Grey represents missing data, i.e., experimentsthat were planned but never conducted due to resource limitations—e.g., dequeued by thecluster administrator—or a system failure. Solid columns of a single shade indicating thesame convergence behavior across all 30 random initializations. Nearly solid column lines11f the same shade indicate similar behavior, but also that there is some sensitivity of thoseparameter values to the initial starting point of the iterative methods.

Observation 1: Convergence properties are demonstrated empirically.

Asdiscussed in Section 1, applying PDNR and PQNR to real-world data has been exploredpreviously in the literature only for a single problem. From Table 4, we observed that PQNRis canceled more than PDNR in the allotted time across datasets and CPU platforms. Thisconﬁrms our intuition, since it is a classical result in iterative methods that damped Newtonmethods converge quadratically, in comparison to quasi-Newton methods, which convergesuperlinearly. Speciﬁcally, PQNR calls the objective function 2 . chicago data and fails to converge in the allotted time for any experimenton lbnl data across all hardware platforms. By contrast, PDNR converges in 86% of lbnl experiments across platforms when only 32 outer iterations are allowed. Observation 2: There is consistent convergence behavior across many rangesof parameter values.

Looking across the range of values for each parameter, there aremany cases where there is a distinct change in behavior—e.g., from converged to canceledwhen log zero safeguard is greater than 10 − in PDNR experiments on chicago as shownin Figure 5. This distinct, repeatable behavior can be used to guide the choices for both ageneral set of parameter defaults and tuning of parameters for speciﬁc data. Observation 3: PDNR and PQNR are not necessarily sensitive to the sameparameters.

In some cases PQNR converges where PDNR is canceled using the samerandom initial starting point. Several interesting patterns from the nell results should alsobe noted. First, the solution space is highly sensitive to random initialization (seen ashatched red bands in Figure 7). Thus statements quantifying average behavior are moreuncertain. Second, where PQNR does converge on nell , it is for extreme values whereconvergence is unlikely in other cases—for example, using the following parameter values: • eps div zero grad ≤ − , • log zero safeguard ≥ − , • max backtrack steps ≤ We are interested in comparing the convergence results across parameter values and solvers.In the following sections, we will refer to Tables 5, 6, and 7, which compare the com-putational costs between PDNR and PQNR by reporting the mean number of objectivefunction evaluations of converged jobs from the three data sets for each parameter valueon all computer hardware platforms with 95% conﬁdence intervals presented as percent-ages above and below the mean. We proceed by analyzing the results by the categories ofparameters described in Section 3.3: general CP-APR parameters, line search parameters,damped Newton step parameters, Quasi-Newton step parameters, and numerical stabilityparameters. 12 .2.1 CP-APR

As noted above, damped Newton methods converge faster than Quasi-Newton methods(quadratic convergence versus super-linear, respectively). Thus, we are interested in de-termining the impact on convergence of constraining the maximum number of allowableiterations in the outer and inner solvers. Note that in experiments that do not measure theeﬀect of max outer iterations and max inner iterations explicitly, these values wereﬁxed at 100,000 and 20, respectively.In all cases, there is a minimum value where PDNR converges to a solution. However,PQNR times out for all max outer iterations test values on lbnl and nell . Where compar-isons can be made ( chicago ), PQNR calls the objective function 2 . max outer iterations grows.The eﬀect of increasing the maximum number of inner iterations, max inner itera-tions , is similar. The non-monotonic increase in function evaluations for the maximumnumber of inner iterations can be explained by the trade-oﬀ between outer and inner iter-ations depending on a particular data set. The exception to this trend is the result thatPDNR calls the objective function 2 .

33 times more than PQNR on nell (see Table 7), al-though this is most likely due to so few converged PQNR experiments and may not bestatistically signiﬁcant.In principle, there is a value of parameter eps active set that will have an eﬀect onconvergence. In practice, however, we did not ﬁnd that value for any dataset. We notethat this result diﬀers from [6], which found setting the parameter to 10 − leads to fasterPDNR convergence, compared to a value of 10 − . Therefore, since algorithm sensitivity tothis parameter is data-dependent, it is unreasonable to generalize behavior. Allowing many backtracking steps during the line search, set by max backtrack steps maycause PDNR to waste eﬀort; however, PQNR appears to perform better, in general, withmore steps. PDNR is sensitive to the number of backtracking steps on chicago : averagework performed is less when the maximum number of allowed steps is large and more workis performed when the number of steps is small. On lbnl —the sparsest tensor problemconsidered—PDNR performs better with fewer backtracking steps (see Figure 2a).The line search parameter step reduction factor is used to reduce the line search stepsize between iterations ( β t +1 /β t in Equation (17) in [6]). On a large, sparse tensor problem,increasing this parameter may accelerate convergence. On the other hand, a small valuemakes convergence less certain. Figure 2b illustrates this behavior on the lbnl data: theaverage total cost decreased by 77% as step reduction factor increased from 0.1 to 0.5(SparTen default) and decreased another 28% from 0.5 to 0.9. On nell data, PQNR only converged for large values (0.7, 0.9).Threshold parameter min variable nonzero tolerance guarantees that the ﬁnal New-ton step length is nonzero. In theory, if this value is too large, the next iterate may overstepimportant information, forcing additional iterations to correct the misstep. On the otherhand, if the value is too small, the algorithm may converge too slowly. We observed nostatistically signiﬁcant diﬀerences in PDNR algorithm performance when varying this pa-13 Outer iterations F un c t i o n E v a l u a t i o n s max_backtrack_steps2.04.08.016.0 (a) The x -axis is truncated to emphasizethe lower average cost when fewer max -backtrack steps are allowed; the ﬁgure doesnot fully capture the high average cost when16 maximum backtrack steps are allowed(PDNR, lbnl ). Outer iterations F un c t i o n E v a l u a t i o n s step_reduction_factor0.10.50.9 (b) The x -axis is truncated to demonstratehow high step reduction factor may ac-celerate convergence on large, sparse tensordata (PDNR, lbnl ). rameter on chicago . The same appears true for PQNR, although we caution that empiricaldata is limited to results on chicago only. Addressing the result that no experiments con-verged for min variable nonzero tolerance − or 10 − on nell , it is unlikely that sizeand sparsity play a role in parameter convergence; PDNR algorithm performance shows nosigniﬁcant diﬀerence on lbnl , the largest and sparsest tensor problem.The parameter suff decrease tolerance is used to assess if suﬃcient decrease in theobjective function has been achieved in the line search—i.e., it is used to determine whento stop the line search iterations. A characterization virtually identical to that made forparameter min variable nonzero tolerance can also be stated here. In short, there isno signiﬁcant performance diﬀerence when varying this parameter on PDNR or PQNR,although PQNR results are limited to chicago only. In this section, we discuss results of varying parameters that are used only by the PDNRsolver. The damped Newton parameters control updates to the damping factor for the nextiterate ( µ k in Equation (16) in [6]). The damping parameter µ k shifts the eigenvalues of theHessian matrix, forcing it to be positive semideﬁnite and guaranteeing that a solution exists.For every outer iteration, the damping factor is initialized to mu initial and updated usingthe following parameters:1. damping decrease factor damping decrease tolerance damping increase factor damping increase tolerance Outer iterations10 F un c t i o n E v a l u a t i o n s mu_initial (a) chicago Outer iterations F un c t i o n E v a l u a t i o n s mu_initial1e-081e-050.01 (b) lbnl Outer iterations10 F un c t i o n E v a l u a t i o n s mu_initial (c) nell Figure 3: PDNR, mu initial . Mean function evaluations and 95% CI. PDNR damps outHessian information and is more prone to time-outs when mu initial is large (PDNR, lbnl ).Hansen et al. predict in [6] that when the damping parameter µ is set too large, a lossof Hessian information follows, which impacts convergence:“We expect larger values of µ k to improve robustness by eﬀectively shorteningthe step length and hopefully avoiding the mistake of setting too many variablesto zero. However, a serious drawback to increasing µ k is that it damps outHessian information, which can hinder the convergence rate.”For example, when mu initial is large, the computational cost grows dramatically andtime-outs become more likely, since the initial step length will at ﬁrst be very small in everyouter iteration and useful Hessian information is discarded in early stages of the inner loopsolves. See Figure 3. Convergence is most likely for a large, but not too-large, value, i.e., mu initial = 10 − . Cost grows 177.2% on lbnl and nearly doubles (+92 . chicago as mu initial grows from 10 − to 10 − . It is important to note in the former case thatthis cost is skewed by one experiment that converged after nearly 42,000 outer iterations,in comparison to 1,300 for the other parameter values on average, illustrated in Figure 3,where the x -axis is truncated to highlight the diﬀerences in total cost. Smaller values (i.e,10 − ) seem to perform better for chicago , the smallest, densest problem and larger values(i.e., 10 − ) tend to perform better for large, sparse problems.PDNR parameters damping increase factor and damping decrease factor , whichcontrol updates to the Hessian matrix damping parameter µ , are two examples of algo-rithm parameters where convergence behavior is similar for values set within sensitivityconstraints. SparTen rarely converges when the former is set too low (1.5); the likely eﬀectis that the updated damping factor is insuﬃcient to guarantee a well-conditioned Hessian15nd too many unimportant directions are considered when computing the search direction.Above the 1.5 bound, the cost in objective function calls does not change signiﬁcantly.In our experiments, we found that varying either damping decrease tolerance or damping increase tolerance has no noticeable eﬀect on either the number of calls tothe objective function nor the number of outer iterations performed. In this section, we discuss varying parameter size LBFGS , which is the only software param-eter used by the PQNR solver. PQNR uses a limited memory BFGS (L-BFGS) approachto approximate the inverse Hessian matrix in the Quasi-Newton step, with M update pairsstored. The value of M is set by the parameter size LBFGS . See [6] for algorithm details.More update pairs M should provide a higher resolution to approximate the inverse Hessian.Intuitively, too few update pairs seems insuﬃcient to compute an acceptable approxima-tion. The only observable diﬀerence occurs when the update size is 1, using only the currentiterate in the BFGS update. The numerical stability parameters eps div zero grad and log zero safeguard describedin this Section are oﬀset tolerances to avoid divide-by-zero ﬂoating point errors tailored toSparTen’s C++ implementation. They do not appear in the corresponding Matlab TensorToolbox method cp_apr . Their impact on convergence was consistent across combinationsof solver, data, and CPU hardware.The parameter eps div zero grad is an oﬀset to avoid divide-by-zero ﬂoating pointerrors when computing the gradient and Hessian (Equation (10) in [6]). When eps div -zero grad is large, gradient directions that do not lead to objective function improvementsmay be scaled the same as gradient directions that do lead to such improvements. Fur-thermore, the corresponding eigenvalues of the Hessian matrix are ampliﬁed and Hessianinformation may be lost when determining the next iterate. For example, PDNR loses Hes-sian information as eps div zero grad increases on chicago data; PDNR rarely convergesand PQNR never converges when this parameter is relatively large—i.e. 10 − . Moreover,both algorithms are sensitive to the parameter’s lower bound, as small values may be in-suﬃcient to avoid an ill-conditioned Hessian matrix. In either case, additional iterationsfollow to correct errors incurred by eps div zero grad values, large and small.Parameter sensitivities aﬀect not only convergence behavior, but may also produce qual-itatively diﬀerent results. Figure 4 illustrates the eﬀect where large eps div zero grad —and consequently, small step length—minimizes calls to the objective function and results inminimal objective function value. Most striking is that larger eps div zero grad decreasesthe objective function more than an order of magnitude. This result was collected from79 of 90 planned PDNR experiments on lbnl , and thus we consider this interesting eﬀectworthy of further investigation.PDNR typically does not converge for large log zero safeguard values on large tensorproblems. This parameter sets a nonzero oﬀset in logarithm calculations to avoid explicitlycomputing log(0). High precision in logarithm computations tends to ensure the objectivefunction is minimized accurately. When the value is too large, the calculated logarithm16 Outer iterations L o g L i k e li h oo d eps_div_zero_grad1e-151e-121e-101e-081e-05 Figure 4: Mean objective function values with 95% conﬁdence interval, varying eps div -zero grad (PDNR, lbnl ).may be too small, and more backtracking steps are required to suﬃciently decrease theobjective function in the line search routine, making time-outs more likely. On the otherhand, the eﬀect of the parameter on convergence is indistinguishable for values smaller than10 − across all experiments. 17 . Conclusions Using results from more than 16,000 numerical experiments on several hardware platforms,we presented experimental results that expand our understanding of average PDNR andPQNR convergence on real-world tensor problems. We have shown that when using PQNRto compute large tensor decompositions convergence is less-likely under reasonable resourceconstraints. We have shown that some software parameters are sensitive to bounds on val-ues. Further, we showed that varying several parameters can dramatically impact algorithmperformance, and in some cases, may produce qualitatively diﬀerent results.Future work may address the issue of stagnation in Newton optimization methods forCP decompositions. We showed examples where the solver converged to a solution slowlybut within the allotted time of 12 hours. For those experiments that timed out, it is un-known whether SparTen would eventually converge to a solution or stagnate without makingprogress. We anticipate that stagnation could be determined if the objective function valuesconverge to a statistical steady state without satisfying the convergence criterion. Futuredevelopment of SparTen may include dynamic updates to algorithm parameters based on lo-cal convergence information. Lastly, future experiments could explore coupled sensitivitiesamong algorithm parameters, as this work was limited to single parameter, univariate anal-yses. Understanding the nature of bivariate (or even more complex) relationships amongparameters may better inform end-users when searching for optimal parameter choices torun the SparTen methods. 18 eferences [1] Brett W. Bader, Tamara G. Kolda, et al. Matlab tensor toolbox version 3.0-dev.Available online, August 2017.[2] J.D. Carroll and J. Chang. Analysis of individual diﬀerences in multidimensional scalingvia an n-way generalization of “eckart-young” decomposition.

Psychometrika , 35:283–319, 1970.[3] A. Cichocki, D. Mandic, L. De Lathauwer, G. Zhou, Q. Zhao, C. Caiafa, and H. A.PHAN. Tensor decompositions for signal processing applications: From two-way tomultiway component analysis.

IEEE Signal Processing Magazine , 32(2):145–163, 2015.[4] Leonardo Dagum and Ramesh Menon. Openmp: an industry standard api for shared-memory programming.

Computational Science & Engineering, IEEE , 5(1):46–55, 1998.[5] H. Carter Edwards, Christian R. Trott, and Daniel Sunderland. Kokkos: Enablingmanycore performance portability through polymorphic memory access patterns.

Jour-nal of Parallel and Distributed Computing , 74(12):3202–3216, 2014. Domain-SpeciﬁcLanguages and High-Level Frameworks for High-Performance Computing.[6] Samantha Hansen, Todd Plantenga, and Tamara G. Kolda. Newton-based optimiza-tion for Kullback-Leibler nonnegative tensor factorizations.

Optimization Methods andSoftware , 30(5):1002–1029, April 2015.[7] Richard A. Harshman. Foundations of the PARAFAC procedure: models and con-ditions for an “explanatory” multi-modal factor analysis.

UCLA Working Papers inPhonetics , 16(1):84–84, 1970.[8] Frank L. Hitchcock. The expression of a tensor or a polyadic as a sum of products.

Journal of Mathematics and Physics , 6(1–4):164–189, 1927.[9] T. Kolda and B. Bader. Tensor decompositions and applications.

SIAM Review ,51(3):455–500, 2009.[10] Lawrence M. Leemis and Stephen K. Park.

Discrete-Event Simulation: A First Course .Prentice-Hall, Inc., USA, 2005.[11] Shaden Smith, Jee W. Choi, Jiajia Li, Richard Vuduc, Jongsoo Park, Xing Liu, andGeorge Karypis. Frostt: The formidable repository of open sparse tensors and tools.Available online, 2017. 19 ppendix A. Detailed Experiment Results

This section provides additional results for the experiments described in this analysis.

A.1 Heatmaps

Below are the outcomes of the planned experiments described in this report presented asheatmap images. They are organized by hardware platform, data set and solver used ineach experiments, as described in Section 3.20igure 5: Experiment outcomes for chicago-crime-comm data on Intel platform. . . . . . . .

505 0 .

75 0 . . . . . . . .

25 0 .

495 1 e -

08 1 e -

05 0 .

001 0 . e -

15 1 e -

12 1 e -

10 1 e -

08 1 e -

05 1 e -

32 1 e -

24 1 e -

16 1 e -

12 1 e -

08 0 . . . . . . . . . . . . . . . . . . . . . . e -

15 1 e -

07 0 .

001 0 . e -

08 1 e -

05 0 .

01 0 . . . . . e -

12 1 e -

08 0 . . E x p e r i m e n t I D Status

08 1 e -

05 0 .

001 0 . e -

15 1 e -

12 1 e -

10 1 e -

08 1 e -

05 1 e -

32 1 e -

24 1 e -

16 1 e -

12 1 e -

08 0 . . . . . . . . . . . . . . . . . . . . . . e -

15 1 e -

07 0 .

001 0 . . . . . . . . . . . . . . e -

12 1 e -

08 0 . . E x p e r i m e n t I D Status

505 0 .

75 0 . . . . . . . .

25 0 .

495 1 e -

08 1 e -

05 0 .

001 0 . e -

15 1 e -

12 1 e -

10 1 e -

08 1 e -

05 1 e -

32 1 e -

24 1 e -

16 1 e -

12 1 e -

08 0 . . . . . . . . . . . . . . . . . . . . . . e -

15 1 e -

07 0 .

001 0 . e -

08 1 e -

05 0 .

01 0 . . . . . e -

12 1 e -

08 0 . . E x p e r i m e n t I D Status

08 1 e -

05 0 .

001 0 . e -

15 1 e -

12 1 e -

10 1 e -

08 1 e -

05 1 e -

32 1 e -

24 1 e -

16 1 e -

12 1 e -

08 0 . . . . . . . . . . . . . . . . . . . . . . e -

15 1 e -

07 0 .

001 0 . . . . . . . . . . . . . . e -

12 1 e -

08 0 . . E x p e r i m e n t I D Status

505 0 .

75 0 . . . . . . . .

25 0 .

495 1 e -

08 1 e -

05 0 .

001 0 . e -

15 1 e -

12 1 e -

10 1 e -

08 1 e -

05 1 e -

32 1 e -

24 1 e -

16 1 e -

12 1 e -

08 0 . . . . . . . . . . . . . . . . . . . . . . e -

15 1 e -

07 0 .

001 0 . e -

08 1 e -

05 0 .

01 0 . . . . . e -

12 1 e -

08 0 . . E x p e r i m e n t I D Status

08 1 e -

05 0 .

001 0 . e -

15 1 e -

12 1 e -

10 1 e -

08 1 e -

05 1 e -

32 1 e -

24 1 e -

16 1 e -

12 1 e -

08 0 . . . . . . . . . . . . . . . . . . . . . . e -

15 1 e -

07 0 .

001 0 . . . . . . . . . . . . . . e -

12 1 e -

08 0 . . E x p e r i m e n t I D Status

505 0 .

75 0 . . . . . . . .

25 0 .

495 1 e -

08 1 e -

05 0 .

001 0 . e -

15 1 e -

12 1 e -

10 1 e -

08 1 e -

05 1 e -

32 1 e -

24 1 e -

16 1 e -

12 1 e -

08 0 . . . . . . . . . . . . . . . . . . . . . . e -

15 1 e -

07 0 .

001 0 . e -

08 1 e -

05 0 .

01 0 . . . . . e -

12 1 e -

08 0 . . E x p e r i m e n t I D Status

08 1 e -

05 0 .

001 0 . e -

15 1 e -

12 1 e -

10 1 e -

08 1 e -

05 1 e -

32 1 e -

24 1 e -

16 1 e -

12 1 e -

08 0 . . . . . . . . . . . . . . . . . . . . . . e -

15 1 e -

07 0 .

001 0 . . . . . . . . . . . . . . e -

12 1 e -

08 0 . . E x p e r i m e n t I D Status

505 0 .

75 0 . . . . . . . .

25 0 .

495 1 e -

08 1 e -

05 0 .

001 0 . e -

15 1 e -

12 1 e -

10 1 e -

08 1 e -

05 1 e -

32 1 e -

24 1 e -

16 1 e -

12 1 e -

08 0 . . . . . . . . . . . . . . . . . . . . . . e -

15 1 e -

07 0 .

001 0 . e -

08 1 e -

05 0 .

01 0 . . . . . e -

12 1 e -

08 0 . . E x p e r i m e n t I D Status

08 1 e -

05 0 .

001 0 . e -

15 1 e -

12 1 e -

10 1 e -

08 1 e -

05 1 e -

32 1 e -

24 1 e -

16 1 e -

12 1 e -

08 0 . . . . . . . . . . . . . . . . . . . . . . e -

15 1 e -

07 0 .

001 0 . . . . . . . . . . . . . . e -

12 1 e -

08 0 . . E x p e r i m e n t I D Status

505 0 .

75 0 . . . . . . . .

25 0 .

495 1 e -

08 1 e -

05 0 .

001 0 . e -

15 1 e -

12 1 e -

10 1 e -

08 1 e -

05 1 e -

32 1 e -

24 1 e -

16 1 e -

12 1 e -

08 0 . . . . . . . . . . . . . . . . . . . . . . e -

15 1 e -

07 0 .

001 0 . e -

08 1 e -

05 0 .

01 0 . . . . . e -

12 1 e -

08 0 . . E x p e r i m e n t I D Status

08 1 e -

05 0 .

001 0 . e -

15 1 e -

12 1 e -

10 1 e -

08 1 e -

05 1 e -

32 1 e -

24 1 e -

16 1 e -

12 1 e -

08 0 . . . . . . . . . . . . . . . . . . . . . . e -

15 1 e -

07 0 .

001 0 . . . . . . . . . . . . . . e -

12 1 e -

08 0 . . E x p e r i m e n t I D Status

505 0 .

75 0 . . . . . . . .

25 0 .

495 1 e -

08 1 e -

05 0 .

001 0 . e -

15 1 e -

12 1 e -

10 1 e -

08 1 e -

05 1 e -

32 1 e -

24 1 e -

16 1 e -

12 1 e -

08 0 . . . . . . . . . . . . . . . . . . . . . . e -

15 1 e -

07 0 .

001 0 . e -

08 1 e -

05 0 .

01 0 . . . . . e -

12 1 e -

08 0 . . E x p e r i m e n t I D Status

08 1 e -

05 0 .

001 0 . e -

15 1 e -

12 1 e -

10 1 e -

08 1 e -

05 1 e -

32 1 e -

24 1 e -

16 1 e -

12 1 e -

08 0 . . . . . . . . . . . . . . . . . . . . . . e -

15 1 e -

07 0 .

001 0 . . . . . . . . . . . . . . e -

12 1 e -

08 0 . . E x p e r i m e n t I D Status

505 0 .

75 0 . . . . . . . .

25 0 .

495 1 e -

08 1 e -

05 0 .

001 0 . e -

15 1 e -

12 1 e -

10 1 e -

08 1 e -

05 1 e -

32 1 e -

24 1 e -

16 1 e -

12 1 e -

08 0 . . . . . . . . . . . . . . . . . . . . . . e -

15 1 e -

07 0 .

001 0 . e -

08 1 e -

05 0 .

01 0 . . . . . e -

12 1 e -

08 0 . . E x p e r i m e n t I D Status

08 1 e -

05 0 .

001 0 . e -

15 1 e -

12 1 e -

10 1 e -

08 1 e -

05 1 e -

32 1 e -

24 1 e -

16 1 e -

12 1 e -

08 0 . . . . . . . . . . . . . . . . . . . . . . e -

15 1 e -

07 0 .

001 0 . . . . . . . . . . . . . . e -

12 1 e -

08 0 . . E x p e r i m e n t I D Status

505 0 .

75 0 . . . . . . . .

25 0 .

495 1 e -

08 1 e -

05 0 .

001 0 . e -

15 1 e -

12 1 e -

10 1 e -

08 1 e -

05 1 e -

32 1 e -

24 1 e -

16 1 e -

12 1 e -

08 0 . . . . . . . . . . . . . . . . . . . . . . e -

15 1 e -

07 0 .

001 0 . e -

08 1 e -

05 0 .

01 0 . . . . . e -

12 1 e -

08 0 . . E x p e r i m e n t I D Status

08 1 e -

05 0 .

001 0 . e -

15 1 e -

12 1 e -

10 1 e -

08 1 e -

05 1 e -

32 1 e -

24 1 e -

16 1 e -

12 1 e -

08 0 . . . . . . . . . . . . . . . . . . . . . . e -

15 1 e -

07 0 .

001 0 . . . . . . . . . . . . . . e -

12 1 e -

08 0 . . E x p e r i m e n t I D Status

CanceledMax.Iter.Missing F un c t i o n E v a l u a t i o n s epsactiveset epsdivzerograd logzerosafeguard maxInnerIter maxOuterIter maxbacktracksteps minvariablenonzerotolerance sizeLBFGS stepreductionfactor suffdecreasetolerance (b) PQNR , nell-2 .2 Average Convergence Results Below are tables presenting the number of of objective function evaluations for PDNR andPQNR to reach convergence, averaged across hardware platforms and reported per dataset. N denotes number of converged experiments per set of experiments. Mean and 95%conﬁdence intervals (CI) are given for objection function evaluation results.30able 5: Average convergence behavior for the chicago-crime-comm data set. Conﬁdenceintervals are percentages of the mean. Parameter Value Solver N Function EvaluationsMean 95% CI mu initial ± . ± . ± . damping decrease factor ± . ± . ± . ± . ± . ± . damping decrease tolerance ± . ± . ± . damping increase factor ± . ± . ± . ± . ± . damping increase tolerance ± . ± . ± . eps active set ± . ± . ± . ± . ± . ± . ± . ± . eps div zero grad ± . ± . ± . ± . ± . ± . ± . ± . ± . log zero safeguard ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . max inner iterations

20 PDNR 90 3.54e+06 ± . ± .

040 PDNR 90 3.51e+06 ± . ± .

780 PDNR 90 3.59e+06 ± . ± . ± . ± . Parameter Value Solver N Function EvaluationsMean 95% CI max outer iterations ± .

916 PDNR 36 6.43e+05 ± . ± .

932 PDNR 75 1.03e+06 ± . ± .

564 PDNR 87 1.69e+06 ± . ± . ± . ± . ± . ± . ± . ± . max backtrack steps ± . ± .

42 PDNR 75 3.87e+06 ± . ± .

94 PDNR 77 3.85e+06 ± . ± .

08 PDNR 80 3.51e+06 ± . ± .

110 PDNR 90 3.54e+06 ± . ± .

012 PDNR 30 3.51e+06 ± . ± .

016 PDNR 75 3.57e+06 ± . ± . min variable nonzero tolerance ± . ± . ± . ± . ± . ± . ± . ± . size LBFGS ± .

42 PQNR 51 7.89e+06 ± .

93 PQNR 88 8.37e+06 ± .

04 PQNR 50 7.44e+06 ± .

05 PQNR 49 8.14e+06 ± .

810 PQNR 47 7.26e+06 ± .

210 PQNR 55 8.12e+06 ± .

220 PQNR 49 7.23e+06 ± . step reduction factor ± . ± . ± . ± . ± . ± . ± . ± . ± . ± . suff decrease tolerance ± . ± . ± . ± . ± . ± . ± . ± . lbnl-network data set. Conﬁdence intervalsare percentages of the mean. Parameter Value Solver N Function EvaluationsMean 95% CI mu initial ± . ± . ± . damping decrease factor ± . ± . ± . ± . damping decrease tolerance ± . ± . ± . damping increase factor ± . ± . ± . ± . damping increase tolerance ± . ± . ± . eps active set ± . ± . ± . ± . eps div zero grad ± . ± . ± . ± . ± . log zero safeguard ± . ± . ± . ± . Parameter Value Solver N Function EvaluationsMean 95% CI max inner iterations

20 PDNR 88 2.36e+08 ± .

140 PDNR 69 1.13e+09 ± .

680 PDNR 77 7.31e+08 ± . ± . max outer iterations

16 PDNR 35 4.32e+07 ± .

432 PDNR 69 5.85e+07 ± .

264 PDNR 86 8.33e+07 ± . ± . ± . ± . max backtrack steps ± .

72 PDNR 77 2.85e+07 ± .

84 PDNR 80 3.22e+07 ± .

68 PDNR 79 1.39e+08 ± .

310 PDNR 86 2.30e+08 ± .

412 PDNR 28 3.76e+08 ± .

516 PDNR 79 3.13e+08 ± . min variable nonzero tolerance ± . ± . ± . ± . step reduction factor ± . ± . ± . ± . ± . suff decrease tolerance ± . ± . ± . ± . nell-2 data set. Conﬁdence intervals arepercentages of the mean. Parameter Value Solver N Function EvaluationsMean 95% CI mu initial ± . ± . damping decrease factor ± . ± . ± . ± . ± . ± . damping decrease tolerance ± . ± . ± . damping increase factor ± . ± . ± . ± . damping increase tolerance ± . ± . ± . eps active set ± . ± . ± . ± . ± . eps div zero grad ± . ± . ± . ± . ± . ± . log zero safeguard ± . ± . ± . ± . ± . ± . ± . ± . Parameter Value Solver N Function EvaluationsMean 95% CI max inner iterations

20 PDNR 74 8.94e+07 ± . ± .

040 PDNR 48 1.19e+08 ± . ± .

080 PDNR 46 1.28e+08 ± . ± . ± . ± . max outer iterations

64 PDNR 13 4.01e+07 ± . ± . ± . ± . max backtrack steps ± . ± .

72 PDNR 17 6.68e+07 ± . ± .

84 PDNR 17 8.01e+07 ± . ± .

28 PDNR 20 8.93e+07 ± . ± .

910 PDNR 74 8.94e+07 ± . ± .

016 PDNR 15 2.20e+08 ± . min variable nonzero tolerance ± . ± . ± . size LBFGS ± .

04 PQNR 1 9.12e+07 ± .

05 PQNR 1 1.73e+08 ± .

010 PQNR 1 4.68e+07 ± .

015 PQNR 1 1.41e+08 ± .

020 PQNR 2 6.06e+07 ± . step-reduction-factor ± . ± . ± . ± . ± . ± . ± . suff decrease tolerance ± . ± . ± . ± .2