[PDF] A robust and efficient algorithm to find profile likelihood confidence intervals

Abstract

Profile likelihood confidence intervals are a robust alternative to Wald's method if the asymptotic properties of the maximum likelihood estimator are not met. However, the constrained optimization problem defining profile likelihood confidence intervals can be difficult to solve in these situations, because the likelihood function may exhibit unfavorable properties. As a result, existing methods may be inefficient and yield misleading results. In this paper, we address this problem by computing profile likelihood confidence intervals via a trust-region approach, where steps computed based on local approximations are constrained to regions where these approximations are sufficiently precise. As our algorithm also accounts for numerical issues arising if the likelihood function is strongly non-linear or parameters are not estimable, the method is applicable in many scenarios where earlier approaches are shown to be unreliable. To demonstrate its potential in applications, we apply our algorithm to benchmark problems and compare it with 6 existing approaches to compute profile likelihood confidence intervals. Our algorithm consistently achieved higher success rates than any competitor while also being among the quickest methods. As our algorithm can be applied to compute both confidence intervals of parameters and model predictions, it is useful in a wide range of scenarios.

Full PDF

AA Robust and Eﬃcient Algorithm to Find ProﬁleLikelihood Conﬁdence Intervals

Samuel M. Fischer · Mark A. LewisAbstract

Proﬁle likelihood conﬁdence intervals are arobust alternative to Wald’s method if the asymptoticproperties of the maximum likelihood estimator are notmet. However, the constrained optimization problemdeﬁning proﬁle likelihood conﬁdence intervals can bediﬃcult to solve in these situations, because the likeli-hood function may exhibit unfavorable properties. Asa result, existing methods may be ineﬃcient and yieldmisleading results. In this paper, we address this prob-lem by computing proﬁle likelihood conﬁdence inter-vals via a trust-region approach, where steps computedbased on local approximations are constrained to re-gions where these approximations are suﬃciently pre-cise. As our algorithm also accounts for numerical issuesarising if the likelihood function is strongly non-linearor parameters are not estimable, the method is appli-cable in many scenarios where earlier approaches areshown to be unreliable. To demonstrate its potentialin applications, we apply our algorithm to benchmarkproblems and compare it with 6 existing approaches tocompute proﬁle likelihood conﬁdence intervals. Our al-gorithm consistently achieved higher success rates thanany competitor while also being among the quickestmethods. As our algorithm can be applied to computeboth conﬁdence intervals of parameters and model pre-dictions, it is useful in a wide range of scenarios.

S. M. FischerDepartment of Mathematical and Statistical Sciences, 632Central Academic Building, University of Alberta, Edmon-ton, AB, T6G 2G1. E-mail: samuel.ﬁ[email protected]. A. LewisDepartment of Mathematical and Statistical Sciences, 632Central Academic Building, University of Alberta, Edmon-ton, AB, T6G 2G1.Department of Biological Sciences, CW 405, Biological Sci-ences Building, University of Alberta, Edmonton, AB, T6G2E9.

Keywords computer algorithm, constrained optimiza-tion, parameter estimation, estimability, identiﬁability

Declarations

Funding.

SMF is thankful for the funding received fromthe Canadian Aquatic Invasive Species Network andthe Natural Sciences and Engineering Research Councilof Canada (NSERC); MAL gratefully acknowledges anNSERC Discovery Grant and Canada Research Chair.

Competing interests.

The authors declare no competinginterests.

Availability of data and material.

Not applicable.

Code availability.

A Python implementation of the de-scribed algorithm and the test procedures can be re-trieved from the python package index as package ci-rvm (see pypi.org/project/ci-rvm).

Authors’ contributions.

Samuel M. Fischer and MarkA. Lewis jointly conceived the project; Samuel M. Fis-cher conceived the algorithm, conducted the mathemat-ical analysis, implemented the algorithm, and wrote themanuscript. Mark A. Lewis revised the manuscript.

Acknowledgements

The authors would like to give thanksto the members of the Lewis Research Group at the Univer-sity of Alberta for helpful feedback and discussions. a r X i v : . [ s t a t . C O ] A p r Samuel M. Fischer, Mark A. Lewis θ encompasses all values ¯ θ that might suitas acceptable null hypotheses if the parameter were tobe ﬁxed; i.e. H : θ = ¯ θ could not be rejected versusthe alternative H : θ = ¯ θ . As the likelihood ratiostatistic is, under regularity conditions, approximately χ distributed under the null hypothesis, the conﬁdenceinterval is given by I = (cid:20) ¯ θ (cid:12)(cid:12)(cid:12) (cid:18) max θ ∈ Θ ‘ ( θ ) − max θ ∈ Θ : θ =¯ θ ‘ ( θ ) (cid:19) ≤ χ , − α (cid:21) , (1)whereby Θ is the parameter space, ‘ denotes the log-likelihood function, α is the desired conﬁdence level,and χ k, − α is the (1 − α )th quantile of the χ distribu-tion with k degrees of freedom.The function that maps ¯ θ to the constrained max-imum ‘ PL (cid:0) ¯ θ (cid:1) := max θ ∈ Θ : θ =¯ θ ‘ ( θ ) (2)is called the proﬁle log-likelihood. While Wald’s methodapproximates ‘ and ‘ PL as quadratic functions, proﬁlelikelihood conﬁdence intervals are constructed by ex-act computation of the proﬁle log-likelihood ‘ PL . This makes this method more accurate but also computa-tionally challenging.1.2 Existing approachesConceptually, the task of identifying the end points θ min0 and θ max0 of the conﬁdence interval I is equivalent toﬁnding the maximal (or minimal) value for θ with ‘ PL ( θ ) = ‘ ∗ := ‘ (cid:16) ˆ θ (cid:17) − χ , − α , (3)Here, ˆ θ denotes the MLE; the value ‘ ∗ follows fromrearranging the terms in the inequality characterizing I (see equation (1)).There are two major perspectives to address thisproblem. It could either be understood as a one-dimensional root ﬁnding problem on ‘ PL or as the con-strained maximization (or minimization) problem θ max0 = max θ ∈ Θ : ‘ ( θ ) ≥ ‘ ∗ θ (4)( θ min0 analog). Approaches developed from either per-spective face the challenge of balancing robustnessagainst eﬃciency.The root ﬁnding perspective (Cook and Weisberg,1990; DiCiccio and Tibshirani, 1991; Stryhn and Chris-tensen, 2003; Moerbeek et al., 2004; Ren and Xia, 2019)is robust if small steps are taken and solutions of themaximization problem (2) are good initial guesses forthe maximizations in later steps. Nonetheless, the stepsize should be variable if parameters might be not es-timable and the conﬁdence intervals large. At the sametime, care must be taken with large steps, as solving (2)can be diﬃcult if the initial guesses are poor, and al-gorithms may fail to converge. Therefore, conservativestep choices are often advisable even though they maydecrease the overall eﬃciency of the approaches.The constrained maximization perspective (Nealeand Miller, 1997; Wu and Neale, 2012) has the advan-tage that eﬃcient solvers for such problems are readilyimplemented in many optimization packages. If the like-lihood function is “well behaved”, these methods con-verge very quickly. However, in practical problems, thelikelihood function may have local extrema, e.g. due tolack of data, or steep “cliﬀs” that may hinder these algo-rithms from converging to a feasible solution. Further-more, general algorithms are typically not optimized forproblems like (4), in which the target function is simpleand the major challenge is in ensuring that the con-straint is met. Therefore, an approach would be desir-able that is speciﬁcally tailored to solve the constrainedmaximization (4) in a robust and eﬃcient manner. Robust and Eﬃcient Algorithm to Find Proﬁle Likelihood Conﬁdence Intervals 3

A ﬁrst step in this direction is the algorithm byVenzon and Moolgavkar (1988), which solves (4) by re-peated quadratic approximations of the likelihood sur-face. As the method is of Newton-Raphson type, it isvery eﬃcient as long as the local approximations areaccurate. Therefore, the algorithm is fast if the asymp-totic normality of the MLE is achieved approximately.Otherwise, the algorithm relies heavily on good initialguesses. Though methods to determine accurate initialguesses exist (Gimenez et al., 2005), the algorithm byVenzon and Moolgavkar (1988) (below abbreviated asVM) can get stuck in local extrema or fail to convergeif the likelihood surface has unfavorable properties (seee.g. Ren and Xia, 2019). Moreover, the algorithm willbreak down if parameters are not identiﬁable. Thus,VM cannot be applied in important use cases of proﬁlelikelihood conﬁdence intervals.1.3 Our contributionsIn this paper, we address the issues of VM by intro-ducing an algorithm extending the ideas of Venzon andMoolgavkar (1988). Our algorithm, which we will call

Robust Venzon-Moolgavkar Algorithm (RVM) below,combines the original procedure with a trust region ap-proach (Conn et al., 2000). That is, the algorithm neversteps outside of the region in which the likelihood ap-proximation is suﬃciently precise. Furthermore, RVMaccounts for unidentiﬁable parameters, local minimaand maxima, and sharp changes in the likelihood sur-face. Though RVM may not outcompete traditional ap-proaches in problems with well-behaved likelihood func-tions or in the absence of estimability issues, we arguethat RVM is a valuable alternative in the (common)cases that the likelihood function is hard to optimizeand the model involves parameters that are not es-timable.Another well-known limitation of the approach byVenzon and Moolgavkar (1988) is that it is not directlyapplicable to construct conﬁdence intervals for func-tions of parameters. Often the main research interest isnot in identifying speciﬁc model parameters but in ob-taining model predictions, which can be expressed as afunction of the parameters. In addition to presenting arobust algorithm to ﬁnd conﬁdence intervals for modelparameters, we show how RVM (and the original VM)can also be applied to determine conﬁdence intervalsfor functions of parameters.This paper is structured as follows: in the ﬁrst sec-tion, we start by outlining the main ideas behind RVMbefore we provide details of the applied procedures. Fur-thermore, we brieﬂy describe how the algorithm canbe used to determine conﬁdence intervals of functions of parameters. In the second section, we apply RVMand alternative algorithms to benchmark problems withsimulated data. Thereby, we review the implemented al-ternative algorithms before we present the results. Weconclude this paper with a discussion of the benchmarkresults and the beneﬁts and limitations of RVM in com-parison to earlier methods.All code used in this study, including a Pythonimplementation of RVM, can be retrieved fromthe python package index as package ci-rvm (seepypi.org/project/ci-rvm). n -dimensional pa-rameter vector θ := ( θ , . . . , θ n − ) and a twice contin-uously diﬀerentiable log-likelihood function ‘ . Assumewithout loss of generality that we seek to construct alevel- α conﬁdence interval for the parameter θ , and let e θ := ( θ , . . . , θ n − ) > be the vector of all remaining pa-rameters, called nuisance parameters. For convenience,we may write ‘ = ‘ ( θ ) as a function of the completeparameter vector or ‘ = ‘ (cid:16) θ , e θ (cid:17) as a function of theparameter of interest and the nuisance parameters.The algorithm RVM introduced in this papersearches the right end point θ max0 (equation (4)) of theconﬁdence interval I . The left end point can be iden-tiﬁed with the same approach if a modiﬁed model isconsidered in which ‘ is ﬂipped in θ . As RVM buildson the method by Venzon and Moolgavkar (1988), westart by recapitulating their algorithm VM below.Let θ ∗ ∈ Θ be the parameter vector at whichthe parameter of interest is maximal, θ ∗ = θ max0 , and ‘ ( θ ∗ ) ≥ ‘ ∗ . Venzon and Moolgavkar (1988) note that θ ∗ satisﬁes the following necessary conditions:1. ‘ ( θ ∗ ) = ‘ ∗ and2. ‘ is in a local maximum with respect to the nuisanceparameters, which implies ∂‘ / ∂ e θ ( θ ∗ ) = 0.The algorithm VM searches for θ ∗ by minimiz-ing both the log-likelihood distance to the threshold | ‘ ( θ ) − ‘ ∗ | and the gradient of the nuisance parameters ∂‘ / ∂ e θ . To this end, the algorithm repeatedly approx-imates the log-likelihood surface ‘ with second orderTaylor expansions ˆ ‘ . If θ ( i ) is the parameter vector inthe i th iteration of the algorithm, expanding ‘ around Samuel M. Fischer, Mark A. Lewis θ ( i ) yieldsˆ ‘ ( θ ) := ‘ (cid:16) θ ( i ) (cid:17) + g > (cid:16) θ − θ ( i ) (cid:17) + 12 (cid:16) θ − θ ( i ) (cid:17) > H (cid:16) θ − θ ( i ) (cid:17) = ¯ ‘ + e g > e δ + g δ + 12 e δ > e H e δ + δ e H > e δ + 12 δ H δ =: ˆ ‘ δ (cid:16) δ , e δ (cid:17) . (5)Here, δ := θ − θ ( i ) , ¯ ‘ := ‘ (cid:16) θ ( i ) (cid:17) ; g := ∂‘∂ θ (cid:16) θ ( i ) (cid:17) is thegradient and H := ∂ ‘ / ∂ θ (cid:16) θ ( i ) (cid:17) the Hessian matrix of ‘ at θ ( i ) . Analogously to notation used above, we split δ into its ﬁrst entry δ and the remainder e δ , g into g and e g , and write H for the ﬁrst column of H , e H for H without its ﬁrst column and row, and split H intoH and e H .In each iteration, VM seeks δ ∗ and f δ ∗ that satisfyconditions 1 and 2. Applying condition 2 to the approx-imation ˆ ‘ δ (equation (5)) yields f δ ∗ = − e H − (cid:16) e H δ + e g (cid:17) . (6)Inserting (5) and (6) into condition 1 gives us ‘ ∗ = 12 (cid:16) H − e H > e H − e H (cid:17) δ ∗ + (cid:16) g − e g > e H − e H (cid:17) δ ∗ + ¯ ‘ − e g > e H − e g , (7)which can be solved for δ ∗ if H is negative deﬁnite. Ifequation (7) has multiple solutions, Venzon and Mool-gavkar (1988) choose the one that minimizes δ accord-ing to some norm. Our algorithm RVM applies a dif-ferent procedure and chooses the root that minimizesthe distance to θ max0 without stepping into a region inwhich the approximation (5) is inaccurate. In section2.5, we provide further details and discuss the case inwhich equation (7) has no real solutions.After each iteration, θ is updated according to theabove results: θ ( i +1) = θ ( i ) + δ ∗ . (8)If ‘ (cid:16) θ ( i +1) (cid:17) ≈ ‘ ∗ and ∂‘ / ∂ ˜ θ (cid:16) θ ( i +1) (cid:17) ≈ θ ( i +1) isreturned.The need to extend the original algorithm VM out-lined above comes from the following issues: (1) Thequadratic approximation ˆ ‘ may be imprecise far fromthe approximation point. In extreme cases, updating θ as suggested could take us farther away from the tar-get θ ∗ rather than closer to it. (2) The approximation ˆ ‘ may be constant in some directions or be not boundedabove. In these cases, we may not be able to identify Determine new approximationDetermine solution Set fixed search radiusSolve constrained problem Reduce search radiusApproximation bounded

Yes NoYes No

Approxim. accurate at solution

Figure 1: Flow chart for RVM. The procedure is re-peated until the termination criterion is met and theresult is returned.unique solutions for δ and e δ , and the gradient criterionin condition 2 may not characterize a maximum but asaddle point or a minimum. (3) The limited precision ofnumerical operations can result in discontinuities cor-rupting the results of VM and hinder the algorithmfrom terminating.To circumvent these problems, we introduce a num-ber of extensions to VM. First, we address the limitedprecision of the Taylor approximation ˆ ‘ with a trustregion approach (Conn et al., 2000). That is, we con-strain our search for δ ∗ to a region in which the ap-proximation ˆ ‘ is suﬃciently accurate. Second, we choosesome parameters freely if ˆ ‘ is constant in some direc-tions and solve constrained maximization problems ifˆ ‘ is not bounded above. In particular, we detect casesin which ‘ PL approaches an asymptote above ‘ ∗ , whichmeans that θ is not estimable. Lastly, we introduce amethod to identify and jump over discontinuities as ap-propriate. An overview of the algorithm is depicted asﬂow chart in Figure 1. Below, we describe each of ourextensions in detail.2.2 The trust regionIn practice, the quadratic approximation (5) may notbe good enough to reach a point close to θ ∗ withinone step. In fact, since ‘ may be very “non-quadratic”,we might obtain a parameter vector for which ‘ and ∂‘ / ∂ ˜ θ are farther from ‘ ∗ and than in the previousiteration. Therefore, we accept changes in θ only if theapproximation is suﬃciently accurate in the new point. Robust and Eﬃcient Algorithm to Find Proﬁle Likelihood Conﬁdence Intervals 5

In each iteration i , we compute the new parametervector, compare the values of ˆ ‘ and ‘ at the obtainedpoint θ ( i ) + δ ∗ , and accept the step if, and only if, ˆ ‘ and ‘ are close together with respect to a given distancemeasure. If ¯ ‘ is near the target ‘ ∗ , we may also check theprecision of the gradient approximation ∂ ˆ ‘ / ∂ ˜ θ to enforcetimely convergence of the algorithm.If we reject a step, we decrease the value δ ∗ obtainedbefore, reduce the maximal admissible length r of thenuisance parameter vector and solve the constrainedmaximization problem f δ ∗ = max e δ : | e δ | ≤ r ˆ ‘ δ (cid:16) δ , e δ (cid:17) . (9)As the quadratic subproblem (9) appears in classicaltrust-region algorithms, eﬃcient solvers are available(Conn et al., 2000) and implemented in optimizationsoftware, such as in the Python package Scipy (Joneset al., 2001).We check the accuracy of the approximation at theresulting point θ ( i ) + δ ∗ , decrease the search radiusif necessary, and continue with this procedure untilthe approximation is suﬃciently precise. The metricand the tolerance applied to measure the approxima-tion’s precision may depend on how far the current log-likelihood ¯ ‘ is from the target ‘ ∗ . We suggest suitableprecision measures in section 2.8.Since it is often computationally expensive to com-pute the Hessian H , we desire to take as large steps δ as possible. However, it is also ineﬃcient to adjustthe search radius very often to ﬁnd the maximal ad-missible δ ∗ . Therefore, RVM ﬁrst attempts to make theunconstrained step given by equation (5). If this stepis rejected, RVM determines the search radius with alog-scale binary search between the radius of the un-constrained step and the search radius accepted in theprevious iteration. If even the latter radius does not leadto a suﬃciently precise result, we update δ ∗ and r byfactors β , β ∈ (0 ,

1) so that δ ∗ ← β δ ∗ and r ← β r .2.3 Linearly dependent parametersThe right hand side of equation (6) is deﬁned only ifthe nuisance Hessian e H is invertible. If e H is singular,the maximum with respect to the nuisance parametersis not uniquely deﬁned or does not exist at all. We willconsider the second case in the next section and focuson the ﬁrst case here.There are multiple options to compute a psudo-inverse of a singular matrix to solve underspeciﬁed lin-ear equation systems (Rao, 1967). A commonly usedapproach is the Moore-Penrose inverse (Penrose, 1955),which yields a solution with minimal norm (Rao, 1967). This is a desirable property for our purposes, as thequadratic approximation is generally most precise closeto the approximation point. The Moore-Penrose inversecan be computed eﬃciently with singular value decom-positions (Golub and Kahan, 1965), which have alsobeen applied to determine the number of identiﬁableparameters in a model (Eubank and Webster, 1985;Viallefont et al., 1998).Whether or not a matrix is singular is often diﬃcultto know precisely due to numerical inaccuracies. TheMoore-Penrose inverse is therefore highly sensitive to athreshold parameter determining when the consideredmatrix is deemed singular. As the Hessian matrix istypically computed with numerical methods subject toerror, it is often beneﬁcial to choose a high value forthis threshold parameter to increase the robustness ofthe method. Too large threshold values, however, canslow down or even hinder convergence of the algorithm.An alternative method to account for singular Hes-sian matrices is to hold linearly dependent parametersconstant until the remaining parameters form a non-singular system. In tests, this approach appeared to bemore robust than applying the Moore-Penrose inverse.Therefore, we used this method in our implementation.We provide details on this method as well as test resultsin Supplementary Appendix A. Note that we write e H − for this generalized inverse below.To determine whether the approximate system hasany solution when e H is singular, we test whether f δ ∗ computed according to equations (6) and (7) indeedsatisﬁes the necessary conditions for a maximum in thenuisance parameters. That is, we check whether0 ≈ ∂∂ e δ ˆ ‘ δ = e H f δ ∗ + e H δ ∗ + e g (10)holds up to a certain tolerance. If this is not the case, ˆ ‘ is unbounded, and we proceed as outlined in the nextsection.2.4 Solving unbounded subproblemsIn each iteration, we seek the nuisance parameters ˜ θ that maximize ‘ for the computed value of θ . Sincethe log-likelihood function ‘ is bounded above, such amaximum must exist in theory. However, the approx-imate log-likelihood ˆ ‘ could be unbounded at times,which would imply that the approximation is impre-cise for large steps. Since we cannot identify a globalmaximum of ˆ ‘ if it is unbounded, we instead seek thepoint maximizing ˆ ‘ in the range where ˆ ‘ is suﬃcientlyaccurate. Samuel M. Fischer, Mark A. Lewis

To test whether ˆ ‘ is unbounded in the nuisance pa-rameters, we ﬁrst check whether e H is negative semi-deﬁnite. If e H is invertible, this test can be conductedby applying a Cholesky decomposition on − e H , whichsucceeds if and only if e H is negative deﬁnite. If e H is singular, we use an eigenvalue decomposition. If alleigenvalues are below a small threshold, e H is negativesemi-deﬁnite. To conﬁrm that ˆ ‘ is bounded, we alsotest whether equation (10) holds approximately if e H issingular (see section 2.3).If either of these tests fails, ˆ ‘ is unbounded. In thiscase, we set δ ∗ ← r , r ← r , for some parameters r , r > r and r can be adjusted and saved for fu-ture iterations to eﬃciently identify the maximal admis-sible step. That is, we may increase (or reduce) δ ∗ and r as long as (or until) ˆ ‘ is suﬃciently precise. Thereby,we adjust the ratio of δ ∗ and r so that the likelihoodincreases: ˆ ‘ δ (cid:16) δ ∗ , f δ ∗ (cid:17) > ¯ ‘ .2.5 Step choice for the parameter of interestWhenever ˆ ‘ has a unique maximum in the nuisanceparameters, we compute δ ∗ by solving equation (7).This equation can have one, two, or no roots. To dis-cuss how δ ∗ should be chosen in either of these cases,we introduce some helpful notation. First, we writeˆ ‘ PL ( θ ) := max ˜ θ ˆ ‘ (cid:16) θ , ˜ θ (cid:17) for the proﬁle log-likelihoodfunction of the quadratic approximation. Furthermore,we write in accordance with previous notationˆ ‘ δ PL ( δ ) := ˆ ‘ PL (cid:16) θ ( i )0 + δ (cid:17) = aδ + pδ + q + ‘ ∗ (11)with a := (cid:16) H − e H e H − e H (cid:17) , p := g − e g > e H − e H ,and q := ¯ ‘ − e g > e H − e g − ‘ ∗ (see equation (7)).Our choices of δ ∗ attempt to increase θ as much aspossible while staying in a region in which the approxi-mation ˆ ‘ is reasonably accurate. The speciﬁc step choicedepends on the slope of the proﬁle likelihood ˆ ‘ δ PL andon whether we have already exceeded θ max0 accordingto our approximation, i.e. ˆ ‘ δ PL (0) < ‘ ∗ . Below, we willﬁrst assume that ˆ ‘ δ PL (0) > ‘ ∗ and discuss the oppositecase later. If the proﬁle likelihood decreases at the approximationpoint, i.e. p <

0, we select the smallest positive root: δ ∗ = ( − qp if a = 0 − a (cid:16) p + p p − aq (cid:17) else. (12) Choosing δ ∗ > θ max0 decreases in this iteration. Choosing thesmaller positive root increases our trust in the accu-racy of the approximation and prevents potential con-vergence issues (see Figure 2a).If ˆ ‘ δ PL has a local minimum above the threshold ‘ ∗ ,equation (11) does not have a solution, and we mayattempt to decrease the distance between ˆ ‘ δ PL and ‘ ∗ instead. This procedure, however, may let RVM con-verge to a local minimum in ˆ ‘ δ PL rather than to a pointwith ˆ ‘ δ PL = ‘ ∗ . Therefore, we “jump” over the extremepoint by doubling the value of δ ∗ . That is, we choose δ ∗ = − pa (13)if p < aq (see Figure 2b). If the proﬁle likelihood increases at the approximationpoint, i.e. p >

0, equation (11) has a positive root ifand only if ˆ ‘ PL is concave down; a <

0. We choose thisroot whenever it exists: δ ∗ = − a (cid:16) p + p p − aq (cid:17) . (14)However, if ˆ ‘ PL grows unboundedly, equation (11) doesnot have a positive root. In this case, we change thethreshold value ‘ ∗ temporarily to a value ‘ ∗0 chosenso that equation (11) has a solution with the updatedthreshold (see Figure 2c). For example, we may set ‘ ∗0 := max  ˆ ‘ δ PL (0) + 1 , ¯ ‘ + ‘ (cid:16) ˆ θ (cid:17)  . This choice ensures that a solution exists while at thesame time reaching local likelihood maxima quickly. Af-ter resetting the threshold, we proceed as usual.To memorize that we changed the threshold value ‘ ∗ , we set a ﬂag maximizing := True . In future it-erations j > i , we set the threshold ‘ ∗ back to itsinitial value ‘ ∗ and maximizing := False as soon as ‘ (cid:16) θ ( j ) (cid:17) < ‘ ∗ or ˆ ‘ PL is concave down at the approxima-tion point θ ( j ) . If the proﬁle likelihood has a local extremum at theapproximation point, i.e. p = 0, a = 0, we proceedas in cases 1 and 2: if a >

0, we proceed as if ˆ ‘ PL were increasing, and if a <

0, we proceed as if ˆ ‘ PL weredecreasing. However, the approximate proﬁle likelihoodcould also be constant, a = p = 0. In this case, we Robust and Eﬃcient Algorithm to Find Proﬁle Likelihood Conﬁdence Intervals 7(a) � θ � * δ δ (b) � θ � * δ (c) � � * ' θ � * δ Figure 2: Step choice for θ in special cases. The ﬁgures depict the proﬁle likelihood function ‘ PL (solid black),quadratic approximation ˆ ‘ PL (dashed parabola), and the threshold log-likelihood ‘ ∗ . (a) The approximation hastwo roots δ ∗ and δ . Though the largest root of ‘ is searched, the smaller root of ˆ ‘ is closest to the desired result.In fact, consistently choosing the larger root would let the algorithm diverge. (b) If ‘ PL is decreasing but ˆ ‘ PL doesnot assume the threshold value ‘ ∗ , we “jump” over the local minimum. (c) If ‘ PL is increasing but ˆ ‘ PL does notassume the threshold value ‘ ∗ , we reset the target value to an increased value ‘ ∗0 .attempt to make a very large step to check whether wecan push θ arbitrarily far. In section 2.6, we discussthis procedure in greater detail. If the proﬁle likelihood at the approximation point isbelow the threshold, ˆ ‘ δ PL (0) < ‘ ∗ , we always choose thesmallest possible step: δ ∗ =  − a (cid:16) p + p p − aq (cid:17) if a = 0 , p < − qp if a = 0 , p = 0 − a (cid:16) p − p p − aq (cid:17) if a = 0 , p > . (15)This shall bring us to the admissible parameter regionas quickly as possible.As RVM rarely steps far beyond the admissible re-gion in practice, equation (15) usually suﬃces to deﬁne δ ∗ . Nonetheless, if we ﬁnd that ˆ ‘ δ P L has a local maxi-mum below the threshold, i.e. p < qa , we may insteadmaximize ˆ ‘ δP L as far as possible: δ ∗ = − p a . (16)If we have already reached a local maximum ( p ≈ δ . In this case, wemay recall the iteration k := argmax j : ‘ ( θ ( j ) ) ≥ ‘ ∗ θ ( j )0 , in whichthe largest admissible θ value with ‘ (cid:0) θ ( k ) (cid:1) ≥ ‘ ∗ hasbeen found so far, and conduct a binary search between θ ( i ) and θ ( k ) until we ﬁnd a point θ ( i +1) with ‘ (cid:0) θ ( i +1) (cid:1) ≥ ‘ ∗ . 2.6 Identifying inestimable parametersIn some practical scenarios, the proﬁle log-likelihood ‘ PL will never fall below the threshold ‘ ∗ , which meansthat the considered parameter is not estimable. In thesecases, RVM may not converge. However, often it is pos-sible to identify inestimable parameters by introducinga step size limit δ max0 . If the computed step exceedsthe maximal step size, δ ∗ > δ max0 and the current func-tion value exceeds the threshold value, i.e. ¯ ‘ ≥ ‘ ∗ , weset δ ∗ := δ max0 and compute the corresponding nuisanceparameters. If the resulting log-likelihood ‘ (cid:0) θ ( i ) + δ ∗ (cid:1) isnot below the threshold ‘ ∗ , we let the algorithm termi-nate, raising a warning that the parameter θ is not es-timable. If ‘ (cid:0) θ ( i ) + δ ∗ (cid:1) < ‘ ∗ , however, we cannot drawthis conclusion and decrease the step size until the ap-proximation is suﬃciently close to the original function.The criterion suggested above may not always suf-ﬁce to identify inestimable parameters. For exam-ple, if the proﬁle likelihood is constant but the nui-sance parameters maximizing the likelihood changenon-linearly, RVM may not halt. For this reason, andalso to prevent unexpected convergence issues, it is ad-visable to introduce an iteration limit to the algorithm.If the iteration limit is exceeded, potential estimabilityissues issues may be investigated further.2.7 DiscontinuitiesRVM is based on quadratic approximations and re-quires therefore that ‘ is diﬀerentiable twice. Nonethe-less, discontinuities can occur due to numerical impre-cision even if the likelihood function is continuous in Samuel M. Fischer, Mark A. Lewis theory. Though we may still be able to compute thegradient g and the Hessian H in these cases, the result-ing quadratic approximation will be inaccurate even ifwe take very small steps. Therefore, these discontinu-ities could hinder the algorithm from terminating.To identify discontinuities, we deﬁne a minimal stepsize (cid:15) , which may depend on the gradient g . If we re-ject a step with small length | δ ∗ | ≤ (cid:15) , we may concludethat ‘ is discontinuous at the current approximationpoint θ ( i ) . To determine the set D of parameters re-sponsible for the issue, we decompose δ ∗ into its com-ponents. We initialize D ← ∅ and consider, with the j th unit vector e j , the step δ ∗0 := P j ≤ k, j = D e j δ ∗ j untilˆ ‘ δ (cid:0) δ ∗0 (cid:1) ‘ δ (cid:0) δ ∗0 (cid:1) for some k < n . When we identifysuch a component, we add it to the set D and continuethe procedure.If we ﬁnd that ‘ is discontinuous in θ , we checkwhether the current nuisance parameters maximize thelikelihood, i.e. ‘ is bounded above and e g is approxi-mately . If the nuisance parameters are not optimal,we hold θ constant and maximize ‘ with respect tothe nuisance parameters. Otherwise, we conclude thatthe proﬁle likelihood function has a jump discontinuity.In this case, our action depends on the current log-likelihood value ¯ ‘ , the value of ‘ at the other end of thediscontinuity, and the threshold ‘ ∗ . – If ‘ (cid:16) θ ( i ) + e δ ∗ (cid:17) ≥ ‘ ∗ or ‘ (cid:16) θ ( i ) (cid:17) < ‘ (cid:16) θ ( i ) + e δ ∗ (cid:17) ,we accept the step regardless of the undesirablylarge error. – If ‘ (cid:16) θ ( i ) + e δ ∗ (cid:17) < ‘ ∗ and ‘ (cid:16) θ ( i ) (cid:17) ≥ ‘ ∗ , we termi-nate and return θ ( i )0 as the bound of the conﬁdenceinterval. – Otherwise, we cannot make a sensible step and tryto get back into the admissible region by conductingthe binary search procedure we have described insection 2.5.4.If ‘ is discontinuous in variables other than θ , we holdthe variables constant whose change decreases the like-lihood and repeat the iteration with a reduced system.After a given number of iterations, we release these pa-rameters again, as θ may have left the point of discon-tinuity.Since we may require that not only ˆ ‘ but also its gra-dient are well approximated, a robust implementationof RVM should also handle potential gradient discon-tinuities. The nuisance parameters causing the issuescan be identiﬁed analogously to the procedure outlinedabove. All components in which the gradient changes itssign from positive to negative should be held constant,as the likelihood appears to be in a local maximum inthese components. The step in the remaining compo-nents may be accepted regardless of the large error. 2.8 Suitable parameters and distance measuresThe eﬃciency of RVM depends highly on the distancemeasures and parameters applied when assessing theaccuracy of the approximation and updating the searchradius of the constrained optimization problems. If theprecision measures are overly conservative, then manysteps will be needed to ﬁnd θ ∗ . If the precision measureis too liberal, in turn, RVM may take detrimental stepsand might not even converge.We suggest the following procedure: (1) we al-ways accept forward steps with δ ∗ ≥ ‘ δ ( δ ∗ ) ≥ ˆ ‘ δ ( δ ∗ ). (2) If the approximate likelihood func-tion is unbounded, we require that the likelihood in-creases ‘ δ ( δ ∗ ) ≥ ¯ ‘ . This requirement helps RVM to re-turn quickly to a region in which the approximationis bounded. However, if the step size falls below thethreshold used to detect discontinuities, we may relaxthis constraint so that less time must be spent to de-tect potential discontinuities. (3) If we are outside theadmissible region, i.e. ¯ ‘ < ‘ ∗ , we enforce that we getcloser to the target likelihood: (cid:12)(cid:12) ‘ δ ( δ ∗ ) − ‘ ∗ (cid:12)(cid:12) < (cid:12)(cid:12) ¯ ‘ − ‘ ∗ (cid:12)(cid:12) .This reduces potential convergence issues. (4) We re-quire that (cid:12)(cid:12)(cid:12) ˆ ‘ δ ( δ ∗ ) − ‘ δ ( δ ∗ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ¯ ‘ − ‘ ∗ (cid:12)(cid:12) ≤ γ (17)for a constant γ . That is, the required precision dependson how close we are to the target. This facilitates fastconvergence of the algorithm. The constant γ ∈ (0 , γ = appeared to be a good choice. (5) If we are closeto the target, ‘ δ ( δ ∗ ) ≈ ‘ ∗ , we also require that thegradient estimate is precise: (cid:12)(cid:12)(cid:12) ∂ ˆ ‘ δ ∂ ˜ θ ( δ ∗ ) − ∂‘ δ ∂ ˜ θ ( δ ∗ ) (cid:12)(cid:12)(cid:12) | g | ≤ γ. (18)This constraint helps us to get closer to a maximum inthe nuisance parameters. Here, we use the L norm.When we reject a step because the approximation isnot suﬃciently accurate, we adjust δ ∗ and solve the con-strained maximization problem (9) requiring (cid:12)(cid:12)(cid:12)e δ (cid:12)(cid:12)(cid:12) ≤ r .To ensure that the resulting step does not push thelog-likelihood below the target ‘ ∗ , the radius r shouldnot be decreased more strongly than δ ∗ . In tests, ad-justing r by a factor β := 1 . δ ∗ is adjustedby factor β := 2 appeared to be a good choice.2.9 Conﬁdence intervals for functions of parametersOften, modelers are interested in conﬁdence intervalsfor functions f ( θ ) of the parameters. A limitation of Robust and Eﬃcient Algorithm to Find Proﬁle Likelihood Conﬁdence Intervals 9

VM and VMR is that such conﬁdence intervals cannotbe computed directly with these algorithms. However,this problem can be solved approximately by consider-ing a slightly changed likelihood function. We aim toﬁnd φ max = max θ ∈ Θ : ‘ ( θ ) ≥ ‘ ∗ f ( θ ) (19)or the respective minimum. Deﬁneˇ ‘ ( φ, θ ) := ‘ ( θ ) − (cid:18) f ( θ ) − φε (cid:19) χ , − α , (20)with a small constant ε . Consider the altered maximiza-tion problemˇ φ max = max θ ∈ Θ : ˇ ‘ ( φ, θ ) ≥ ‘ ∗ φ, (21)which can be solved with VM or RVM.We argue that a solution to (21) is an approximatesolution to (19), whereby the error is bounded by ε . Let( φ max , θ ∗ ) be a solution to problem (19) and (cid:16) ˇ φ max , ˇ θ ∗ (cid:17) a solution to problem (21). Since φ max = f ( θ ∗ ), itis ˇ ‘ ( φ max , θ ∗ ) = ‘ ( θ ∗ ) ≥ ‘ ∗ . Therefore, ( φ max , θ ∗ ) isalso a feasible solution to (19), and it follows thatˇ φ max ≥ φ max . At the same time, ˇ ‘ ( φ, θ ) ≤ ‘ ( θ ), whichimplies that f (cid:16) ˇ θ ∗ (cid:17) ≤ f ( θ ∗ ), since θ ∗ maximizes f overa domain larger than the feasibility domain of (21). Inconclusion, f (cid:16) ˇ θ ∗ (cid:17) ≤ f ( θ ∗ ) = φ max ≤ ˇ φ max . Lastly, ‘ ∗ = ‘ (cid:16) ˆ θ (cid:17) − χ , − α ≤ ˇ ‘ (cid:16) ˇ φ max , ˇ θ ∗ (cid:17) = ‘ (cid:16) ˇ θ ∗ (cid:17) −  f (cid:16) ˇ θ ∗ (cid:17) − ˇ φ max ε  χ , − α . (22)Simplifying (22) yields (cid:12)(cid:12)(cid:12) f (cid:16) ˇ θ ∗ (cid:17) − ˇ φ max (cid:12)(cid:12)(cid:12) ≤ ε . Thus, (cid:12)(cid:12) φ max − ˇ φ max (cid:12)(cid:12) ≤ ε .Though it is possible to bound the error by an ar-bitrarily small constant ε in theory, care must be takenif the function f ( θ ) is not well-behaved, i.e. stronglynonlinear. In theses cases, overly small values for ε mayslow down convergence.Note that the suggested procedure may seem to re-semble the approach of Neale and Miller (1997), whoalso account for constraints by adding the squared er-ror to the target function. However, unlike Neale andMiller (1997), the approach suggested above bounds theerror in the conﬁdence interval bound, not the error ofthe constraint. Furthermore, we do not square the log-likelihood function, which would worsen nonlinearitiesand could thus make optimization diﬃcult. Therefore,our approach is less error-prone than the method byNeale and Miller (1997). To compare the presented algorithm to existing meth-ods, we applied RVM, the classic VM, and ﬁve otheralgorithms to benchmark problems and compared therobustness and performance of the approaches. Belowwe review the implemented methods. Then we intro-duce the benchmark problems, before we ﬁnally presentthe benchmark results.3.1 Methods implemented for comparisonBesides RVM and VM, we implemented three meth-ods that repeatedly evaluate the proﬁle likelihoodfunction and two methods that search for the con-ﬁdence intervals directly. We implemented all meth-ods in the programming language Python version 3.7and made use of diﬀerent optimization routines imple-mented or wrapped in the scientiﬁc computing libraryScipy (Jones et al., 2001).First, we implemented a grid search for the con-ﬁdence bounds. The approach uses repeated La-grangian constrained optimizations and may resemblethe method by DiCiccio and Tibshirani (1991); how-ever, rather than implementing the algorithm by DiCi-ccio and Tibshirani (1991), we applied the constrainedoptimization algorithm by Lalee et al. (1998), which isa trust-region approach and may thus be more robustthan the method by DiCiccio and Tibshirani (1991).Furthermore, the algorithm by Lalee et al. (1998) wasreadily implemented in Scipy.We conducted the grid search with a naive step sizeof 0 .

2, which we repeatedly reduced by factor 2 closeto the threshold log-likelihood ‘ ∗ until the desired pre-cision was achieved. To account for unidentiﬁable pa-rameters, we attempted one large step (1000 units) ifthe algorithm did not terminate in the given iterationlimit. We considered a parameter as unbounded if thisstep yielded a log-likelihood above the target value ‘ ∗ .Second, we implemented a quadratic bisectionmethod for root ﬁnding on ‘ PL (cf. Ren and Xia, 2019).Initially we chose a step size of 1. Afterwards, we com-puted the step of θ based on a quadratic interpola-tion between the MLE ˆ θ , the maximal value of θ forwhich we found ‘ PL ( θ ) > ‘ ∗ and the smallest iden-tiﬁed value of θ with ‘ PL ( θ ) < ‘ ∗ . Until a point θ with ‘ PL ( θ ) < ‘ ∗ was identiﬁed, we interpolated ‘ PL between ˆ θ and the two largest evaluated values θ .When only two points were available or the approxi-mation of ‘ PL did not assume the target value, we in-troduced the additional constraint d‘ PL / dθ = 0. Usinga quadratic rather than a linear interpolation for bi-section has the advantage that the algorithm converges faster if the proﬁle log-likelihood function is convex orquadratic. To evaluate ‘ PL , we applied sequential leastsquares programming (Kraft, 1988), which is the de-fault method for constrained optimization in Scipy.Third, we implemented a binary search with an ini-tial step of 1. Until a value θ with ‘ PL ( θ ) < ‘ ∗ wasfound, we increased θ by factor 10. This preserves thelogarithmic runtime of the algorithm if the problem hasa solution. To broaden the range of tested internal op-timization routines, we used a diﬀerent method to eval-uate ‘ PL than in the bisection method: we ﬁxed θ atthe desired value and performed an unconstrained opti-mization on the nuisance parameters. Here, we used thequasi-Newton method by Broyden, Fletcher, Goldfarb,and Shanno (BFGS; see Nocedal and Wright, 2006, pp.136).To test methods that search for the conﬁdence in-terval end points directly, we solved problem (4) withsequential least squares programming (Kraft, 1988).Furthermore, we implemented the approximate methodby Neale and Miller (1997). They transform the con-strained maximization problem (9) to an unconstrainedproblem by considering the sum of the parameter of in-terest θ and the squared error between the target ‘ ∗ and the log-likelihood. Minimization of this target func-tion yields a point in which the target log-likelihood isreached approximately and the parameter of interest isminimal. Again, we used the method BFGS for mini-mization (see above).Finally, we implemented Wald’s method to assessthe need to apply any proﬁle likelihood method.3.2 Benchmark problemTo investigate the performances of the implementedmethods, we applied the algorithms to a benchmarkproblem with variable parameter number and data setsize. We considered a logistic regression problem with n count data covariates c ij , j ∈ { , . . . , n } for each datapoint i ∈ { , . . . , N } . We assumed that the impact ofthe covariates levels oﬀ at high values and consideredtherefore the transformed covariates c α j ij with α ∈ (0 , X i was thus given by P ( X i = 1) =   − β − X j β j c α j ij  − (23) and P ( X i = 0) = 1 − P ( X i = 1).We drew the covariate values randomly from a nega-tive binomial distribution with mean 5 and variance 10.The negative binomial distribution is commonly usedto model count data (Gardner et al., 1995) and thussuited to represent count covariates. To simulate thecommon case that covariates are correlated, we further-more drew the value for every other covariate from abinomial distribution with the respective preceding co-variate as count parameter. That is, for uneven j , c i,j +1 ∼ Binomial( c i,j , p ) , with p = 0 . wasdeﬁned to be 1. We chose the parameters α j and β j so that the data were balanced, i.e. the frequency of 0sand 1s was approximately even. Refer to SupplementaryAppendix B for the parameter values we used.3.3 Test procedureTo test the algorithms in a broad range of scenariosand assess how their performance is impacted by modelcharacteristics, we considered a model with 1 covariate(3 parameters), a model with 5 covariates (11 param-eters), and a generalized linear model (GLM) with 10covariates, in which the powers α j were set to 1 (11parameters). Furthermore, we varied the sizes of thesimulated data sets, ranging between N = 500 and N = 10000 for the models with transformed covariatesand N = 50 and N = 1000 for the GLM. In Figure 3,we depict the impact of N on the shape of the likelihoodfunction and thus the diﬃculty of the problem.For each considered set of parameters, we generated200 realizations of covariates and training data from themodel described in the previous section. We determinedthe maximum likelihood estimator by maximizing thelog-likelihood with the method BFGS and reﬁned theestimate with an exact trust region optimizer (Connet al., 2000). Then, we applied each of the implementedalgorithms to each data set and determined the algo-rithms’ success rates and eﬃciencies.As the likelihood functions of the tested mod-els decrease drastically at α j = 0, potentially caus-ing some algorithms to fail, we constrained the α j to non-negative values. Tho that end, we consideredtransformed parameters α j := ln(exp( α j ) − Robust and Eﬃcient Algorithm to Find Proﬁle Likelihood Conﬁdence Intervals 11 algorithms based on the back-transformed parameters α j . We measured the algorithms’ success based on theirability to solve problem (4) rather than their capabil-ity to determine the true conﬁdence intervals for theparameters. Though proﬁle likelihood conﬁdence in-tervals are usually highly accurate, they rely on thelimiting distribution of the likelihood ratio statistic.Therefore, algorithms could fail to solve optimizationproblem (4) but, by coincidence, return a result closeto the true conﬁdence interval bound and vice versa.To exclude such eﬀects and circumvent the high com-putational eﬀort required to determine highly preciseconﬁdence intervals with sampling methods, we deter-mined the “true” conﬁdence interval bound by choosingthe widest conﬁdence interval bound obtained by eitherof the tested methods provided it was admissible, i.e. ‘ ( θ max ) ≥ ‘ ∗ up to a permissible error of 0 . ±

5% range of the true conﬁ-dence interval bound or had an error below 0 . − , . (b) (c) Figure 3: Likelihood surface of the 3-parameter benchmark model with diﬀerent data set sizes N . As N increases,the conﬁdence region becomes smaller and closer to an elliptic shape. The orange dots depict the accepted (largedots) and rejected (small dots) steps of RVM searching for a conﬁdence interval for β . RVM follows the ridge ofthe likelihood surface. The red dot shows the location of the MLE ˆ θ . The background color depicts the respectivemaximal log-likelihood for the given α and β ranging from ≤ ˆ ‘ −

50 (dark blue) to ˆ ‘ (yellow). The solid blue linedenotes the target log-likelihood ‘ ∗ for a 95% conﬁdence interval. (a) N = 500; (b) N = 1000; (c) N = 10000.exploits the approximately quadratic shape of the pro-ﬁle likelihood function if many data are available. Inscenarios with large data sets, the bisection methodwas among the most eﬃcient algorithms. The errorsof the three root ﬁnding methods decreased the moredata became available to ﬁt the models. However, whilethe binary search had a consistently low error, both thegrid search and the bisection method were more proneto large errors than all other tested methods.The algorithms developed from the constrainedmaximization perspective (the method by Neale andMiller and direct constrained maximization) had suc-cess rates ranging between 45% and 85% in problemswith transformed covariates. In the GLM scenario, thesuccess rate was smaller in with 50 data points andhigher with more data. The constrained maximizationprocedure was slightly more successful than the methodby Neale and Miller (1997). Both methods required rel-atively few function evaluations, whereby direct con-strained maximization performed better. Both methodswere less prone to large errors than the grid search andthe bisection method. However, the outlier-reduced er-ror was on average more than twice as large than withany other method except RVM (Neale and Miller: 0 . .

09, RVM: 0 . We presented an algorithm that determines the endpoints of proﬁle likelihood conﬁdence intervals both ofparameters and functions of parameters with high ro-bustness and eﬃciency. We tested the algorithm in sce-narios varying in parameter number, size of the dataset, and complexity of the likelihood function. In thetests, our algorithm RVM was more robust than anyother considered method. At the same time, RVM wasamong the fastest algorithms in most scenarios. Thisis remarkable, because there is typically a trade-oﬀ be-tween robustness and computational speed of optimiza-tion algorithms. RVM achieves this result by exploitingthe approximately quadratic form of the log-likelihoodsurface in “benign” cases while maintaining high ro-bustness with the trust-region approach. Consequently,RVM naturally extends the algorithm VM (Venzon andMoolgavkar, 1988), which appeared to be highly eﬃ-cient but lacking robustness in our tests.Surprisingly, RVM turned out to be even more ro-bust than methods based on repeated evaluations of

Robust and Eﬃcient Algorithm to Find Proﬁle Likelihood Conﬁdence Intervals 13

A B C S u cce ss r a t e RVMVMGrid searchBisectionBinary searchNeale-MillerConstr. max.Wald M e a n e rr o r F un c t i o n e v a l u a t i o n s parameters,transformed covariates parameters, transformedcovariates parameters, GLM Figure 4: Benchmark results. The success rate, the mean error, and the number of function evaluations are plottedfor the 3 parameter and the 11 parameter model with transformed covariates and for the 11 parameter GLM.Throughout the simulations, our algorithm RVM had the highest success rate. At the same time, RVM had a lowmean error and required only few likelihood function evaluations compared to the considered alternative methods.The parameter values used to generate the Figures are given in Supplementary Appendix B.the proﬁle likelihood. For the bisection method and thebinary search, this may be due to failures of internaloptimization routines, as initial guesses far from the so-lution can hinder accurate convergence. The grid searchmethod, in turn, was often aborted due to the lim-ited step size, which precluded the method from iden-tifying conﬁdence bounds farther than 40 units awayfrom the respective MLE. This, however, does not ex-plain the comparatively high error in the results of thegrid search, as only successful runs were considered. We therefore hypothesize that internal optimization issueswere responsible for some failures.As expected, the algorithms that searched for theconﬁdence interval end points directly were more ef-ﬁcient but less robust than algorithms that repeat-edly evaluate the proﬁle likelihood. Remarkably, a“standard” algorithm for constrained optimization per-formed slightly better than an unconstrained optimizeroperating on the modiﬁed target function suggested byNeale and Miller (1997). This indicates that the approx- imation introduced by Neale and Miller (1997) mightnot be necessary and even of disadvantage.All methods implemented in this study (exceptRVM and VM) rely on general optimizers. Conse-quently, the performance of these methods depends onthe chosen optimizers both in terms of computationalspeed and robustness. Careful adjustment of optimiza-tion parameters might make some of the implementedalgorithms more eﬃcient and thus more competitive inbenchmark tests. Though we attempted to reduce po-tential bias by applying a variety of diﬀerent methods,an exhaustive test of optimization routines was beyondthe scope of this study. Nonetheless, the consistentlygood performance of RVM throughout our benchmarktests suggests that RVM is a good choice in many ap-plications.Though RVM performed well in our tests, there areinstances in which the algorithm is not applicable orsuﬃciently eﬃcient. This are scenarios in which (1) thelog-likelihood cannot be computed directly, (2) the Hes-sian matrix of the log-likelihood function is hard tocompute, (3) the dimension of the parameter space isvery large, or (4) there are multiple points in the pa-rameter space in which problem (4) is solved locally.Below, we brieﬂy discuss each of these limitations.(1) In hierarchical models, the likelihood functionmay not be known. As RVM needs to evaluate the log-likelihood, its gradient, and its Hessian matrix, the algo-rithm is not applicable in these instances. Consequently,sampling based methods, such as parametric bootstrap(Efron, 1981), Monte Carlo methods (Buckland, 1984),or data cloning (Ponciano et al., 2009) may then be theonly feasible method to determine conﬁdence intervals.(2) Especially in problems with a large parameterspace, it is computationally expensive to compute theHessian matrix with ﬁnite diﬀerence methods, as thenumber of function calls increases in quadratic orderwith the length of the parameter vector. Though al-ternative diﬀerentiation methods, such as analytical orautomatic diﬀerentiation (Griewank, 1989), are oftenapplicable, there may be some instances in which ﬁnitediﬀerence methods are the only feasible alternative. Inthese scenarios, optimization routines that do not re-quire knowledge of the Hessian matrix may be fasterthan RVM. Note, however, that the higher computa-tional speed may come with decreased robustness, andsampling based methods might be the only remainingoption if application of RVM is infeasible.(3) If the parameter space has a very high dimension(exceeding 1000), internal routines, such as inversion ofthe Hessian matrix, may become the dominant factordetermining the speed of RVM. Though it may be pos-sible in future to make RVM more eﬃcient, sampling based methods or algorithms that do not use the Hes-sian matrix may be better suited in these scenarios.(4) RVM as well as all other methods implementedin this study are local optimization algorithms. There-fore, the algorithms may converge to wrong resultsif maximization problem (4) has multiple local solu-tions. This is in particular the case if the conﬁdence set { θ : ‘ PL ( θ ) ≥ ‘ ∗ } is not connected and thus no inter-val. RVM reduces the issue of local extreme points bychoosing steps carefully and ensuring that the point ofconvergence is indeed a maximum. This contrasts withVM, which could converge to the wrong conﬁdence in-terval end point (e.g. maximum instead of minimum) ifthe initial guesses are not chosen with care. Nonethe-less, stochastic optimization routines, such as geneticalgorithms (Akrami et al., 2010), and sampling meth-ods may be better suited if a local search is insuﬃcient.Despite these caveats, RVM is applicable to a broadclass of systems. Especially when inestimable parame-ters are present, commonly used methods such as VMor grid search techniques can break down or be highlyineﬃcient. Furthermore, optimization failures are com-monly observed if not enough data are available to reachthe asymptotic properties of the MLE (Ren and Xia,2019). RVM is a particularly valuable tool in these in-stances. We developed and presented an algorithm to determineproﬁle likelihood conﬁdence intervals. In contrast tomany earlier methods, our algorithm is robust in sce-narios in which lack of data or a complicated likelihoodfunction make it diﬃcult to ﬁnd the bounds of proﬁlelikelihood conﬁdence intervals. In particular, our meth-ods is applicable in instances in which parameters arenot estimable and in cases in which the likelihood func-tion has strong nonlinearities. At the same time, ourmethod eﬃciently exploits the asymptotic properties ofthe maximum likelihood estimator if enough data areavailable.We tested our method on benchmark problems withdiﬀerent diﬃculty. Throughout our simulations, ourmethod was the most robust while also being amongthe fastest algorithms. We therefore believe that RVMcan be helpful to researchers and modelers across dis-ciplines.

References

Akrami Y, Scott P, Edsj¨o J, Conrad J, Bergstr¨omL (2010) A proﬁle likelihood analysis of the con-

Robust and Eﬃcient Algorithm to Find Proﬁle Likelihood Conﬁdence Intervals 15 strained MSSM with genetic algorithms. Journalof High Energy Physics 2010(4):57, DOI 10.1007/JHEP04(2010)057, URL http://link.springer.com/10.1007/JHEP04(2010)057

Brodtkorb PA, D’Errico J (2019) numd-iﬀtools 0.9.39. URL https://github.com/pbrod/numdifftools , retrieved fromhttps://github.com/pbrod/numdiﬀtoolsBuckland ST (1984) Monte Carlo conﬁdence inter-vals. Biometrics 40(3):811, DOI 10.2307/2530926,URL

Conn AR, Gould NIM, Toint PL (2000) Trust-regionmethods. MPS-SIAM series on optimization, Societyfor Industrial and Applied Mathematics, Philadel-phia, PACook RD, Weisberg S (1990) Conﬁdence curvesin nonlinear regression. Journal of the Amer-ican Statistical Association 85(410):544–551,DOI 10.1080/01621459.1990.10476233, URL

Cox DR, Snell EJ (1989) Analysis of binary data,2nd edn. No. 32 in Monographs on Statistics andApplied Probability, Routledge, Boca Raton, FL,DOI 10.1201/9781315137391, URL

DiCiccio TJ, Tibshirani R (1991) On the implemen-tation of proﬁle likelihood methods. Tech. rep.,University of Toronto, Department of Statistics,URL

Efron B (1981) Nonparametric standard errors andconﬁdence intervals. Canadian Journal of Statistics9(2):139–158, DOI 10.2307/3314608, URL http://doi.wiley.com/10.2307/3314608

Eubank RL, Webster JT (1985) The singular-valuedecomposition as a tool for solving estimabil-ity problems. The American Statistician 39(1):64,DOI 10.2307/2683912, URL

Gardner W, Mulvey EP, Shaw EC (1995) Regressionanalyses of counts and rates: Poisson, overdispersedPoisson, and negative binomial models. Psychologi-cal Bulletin 118(3):392–404, DOI 10.1037/0033-2909.118.3.392, URL http://doi.apa.org/getdoi.cfm?doi=10.1037/0033-2909.118.3.392

Gimenez O, Choquet R, Lamor L, Scoﬁeld P, FletcherD, Lebreton JD, Pradel R (2005) Eﬃcient proﬁle-likelihood conﬁdence intervals for capture-recapturemodels. Journal of Agricultural, Biological, and En-vironmental Statistics 10(2):184–196, DOI 10.1198/108571105X46462, URL http://link.springer. com/10.1198/108571105X46462

Golub G, Kahan W (1965) Calculating the singularvalues and pseudo-inverse of a matrix. Journal ofthe Society for Industrial and Applied MathematicsSeries B Numerical Analysis 2(2):205–224, DOI 10.1137/0702016, URL http://epubs.siam.org/doi/10.1137/0702016

Griewank A (1989) On automatic diﬀeren-tiation. Mathematical Programming: re-cent developments and applications 6(6):83–107, URL

Jones E, Oliphant T, Peterson P (2001) SciPy: opensource scientiﬁc tools for Python. URL https://scipy.org/ , retrieved from https://scipy.org/Kraft D (1988) A software package for sequentialquadratic programming. Tech. Rep. DFVLR-FB 88-28, DLR German Aerospace Center – Institute forFlight Mechanics, K¨oln, GermanyLai KL, Crassidis J, Cheng Y, Kim J (2005) Newcomplex-step derivative approximations with appli-cation to second-order kalman ﬁltering. In: AIAAGuidance, Navigation, and Control Conference andExhibit, American Institute of Aeronautics and As-tronautics, San Francisco, California, DOI 10.2514/6.2005-5944, URL http://arc.aiaa.org/doi/10.2514/6.2005-5944

Lalee M, Nocedal J, Plantenga T (1998) Onthe implementation of an algorithm for large-scale equality constrained optimization. SIAMJournal on Optimization 8(3):682–706, DOI10.1137/S1052623493262993, URL http://epubs.siam.org/doi/10.1137/S1052623493262993

Moerbeek M, Piersma AH, Slob W (2004) A com-parison of three methods for calculating conﬁdenceintervals for the benchmark dose. Risk Analysis24(1):31–40, DOI 10.1111/j.0272-4332.2004.00409.x, URL https://onlinelibrary.wiley.com/doi/abs/10.1111/j.0272-4332.2004.00409.x

Neale MC, Miller MB (1997) The use of likelihood-based conﬁdence intervals in genetic models. Be-havior Genetics 27(2):113–120, DOI 10.1023/A:1025681223921, URL http://link.springer.com/10.1023/A:1025681223921

Nocedal J, Wright SJ (2006) Numerical optimiza-tion, 2nd edn. Springer series in operations research,Springer, New YorkPenrose R (1955) A generalized inverse for ma-trices. Mathematical Proceedings of the Cam-bridge Philosophical Society 51(3):406–413, DOI

Ponciano JM, Taper ML, Dennis B, Lele SR (2009)Hierarchical models in ecology: conﬁdence inter-vals, hypothesis testing, and model selection usingdata cloning. Ecology 90(2):356–362, DOI 10.1890/08-0967.1, URL http://doi.wiley.com/10.1890/08-0967.1

Rao CR (1967) Calculus of generalized inverse ofmatrices. Part 1: general theory. Sankhya Ser A29:317–342, URL

Raue A, Kreutz C, Maiwald T, Bachmann J, SchillingM, Klingm¨uller U, Timmer J (2009) Structuraland practical identiﬁability analysis of partiallyobserved dynamical models by exploiting theproﬁle likelihood. Bioinformatics 25(15):1923–1929, DOI 10.1093/bioinformatics/btp358, URL https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btp358

Ren X, Xia J (2019) An algorithm for com-puting proﬁle likelihood based pointwise conﬁ-dence intervals for nonlinear dose-response models.PLOS ONE 14(1):e0210953, DOI 10.1371/journal.pone.0210953, URL http://dx.plos.org/10.1371/journal.pone.0210953

Stryhn H, Christensen J (2003) Conﬁdence intervalsby the proﬁle likelihood method, with applicationsin veterinary epidemiology. Vina del Mar, Chile,URL https://pdfs.semanticscholar.org/716e/5366865e409a430bce73d26c8770e6bbbeab.pdf

Venzon DJ, Moolgavkar SH (1988) A method forcomputing proﬁle-likelihood-based conﬁdence inter-vals. Applied Statistics 37(1):87, DOI 10.2307/2347496, URL

Viallefont A, Lebreton JD, Reboulet AM, Gory G(1998) Parameter identiﬁability and model se-lection in capture-recapture models: a numericalapproach. Biometrical Journal: Journal of Math-ematical Methods in Biosciences 40(3):313–325,DOI 10.1002/(SICI)1521-4036(199807)40:3 h i http://link.springer.com/10.1007/s10519-012-9560-z upplementary Appendices for “A Robust and EﬃcientAlgorithm to Find Proﬁle Likelihood ConﬁdenceIntervals” Samuel M. Fischer · Mark A. LewisA An alternative way to account for singularmatrices

In each iteration, we seek to maximize the approximate likeli-hood ˆ ‘ with respect to the nuisance parameters. To that end,we solve the equation0 = ∂∂ e δ ˆ ‘ δ = e H f δ ∗ + e H δ ∗ + e g (A1)which has a unique solution if and only if e H is invertible.Otherwise, equation (A1) may have inﬁnitely or no solutions.In the main text, we suggested to solve (A1) with the Moore-Penrose inverse if e H is singular. However, this procedure ap-peared to be very sensitive to a threshold parameter in tests,and we obtained better results with an alternative method,which we describe below. We furthermore show test resultscomparing the two methods. A.1 Description of the method

If ˆ ‘ has inﬁnitely many maxima in the nuisance parameters,we can choose some nuisance parameters freely and consider areduced system including the remaining independent param-eters only. To that end, we check e H for linear dependenciesat the beginning of each iteration. We are interested in aminimal set S containing indices of rows and columns whoseremoval from e H would make the matrix invertible. To com-pute S , we iteratively determine the ranks of sub-matrices of e H using singular value decompositions (SVMs). SVMs are awell-known tool to identify the rank of a matrix.The iterative algorithm proceeds as follows: ﬁrst, we con-sider one row of e H and determine its rank. Then, we continueby adding a second row, determine the rank of the new matrixand repeat the procedure until all rows, i.e. the full matrix e H , are considered. Whenever the matrix rank increases withaddition of a row, this row is linearly independent from theprevious rows. Conversely, the rows that do not increase thematrix rank are linearly dependent on other rows of e H . Theindices of these rows form the set S .In general, the set of linearly dependent rows is notunique. Therefore, we consider the rows of e H in descendingorder of the magnitudes of the corresponding gradient entries.This can help the algorithm to converge faster.After S is determined, we need to check whether there isa parameter vector θ ∗ satisfying requirements 1 and 2 fromE-mail: samuel.ﬁ[email protected] section 2.1 (main text) for the approximation ˆ ‘ . Let e H dd (“d”for “dependent”) be the submatrix of H that remains if allrows and columns corresponding to indices in S are removedfrom e H . Similarly, let e H ff (“f” for “free”) be the submatrixof e H containing only the rows and columns corresponding toindices in S , and let e H df = e H > fd be the matrix containingthe rows whose indices are not in S and the columns whoseindices are in S . Let us deﬁne e g d , e g f , e δ d , and e δ f accordingly. If e H dd is not negative deﬁnite, ˆ ‘ is unbounded, and requirement2 (section 2.1 main text) cannot be satisﬁed. Otherwise, wemay attempt to solve0 = ∂∂ e δ ˆ ‘ δ (A2) ⇐⇒ e H dd f δ ∗ d + e H df f δ ∗ f + e H δ ∗ + e g d (A3)0 = e H > df f δ ∗ d + e H ff f δ ∗ f + e H δ ∗ + e g f . (A4)If equation system (A3)-(A4) has a solution, we can choose f δ ∗ f freely. Setting f δ ∗ f ← makes equation (A3) equivalent to f δ ∗ d = − e H − (cid:16) e H δ + e g d (cid:17) . (A5)That is, we may set e H ← e H dd , e g ← e g d , f δ ∗ ← f δ ∗ d for theremainder of the current iteration and proceed as usual,whereby the free nuisance parameters are left unchanged: f δ ∗ f = . With the resulting δ ∗ , we check whether (A4) holdsapproximately. If not, the log-likelihood is unbounded above.We consider this case in section 2.4 in the main text. A.2 Tests

We implemented RVM with both suggested methods fortreating linearly dependent parameters. To that end, we ap-plied the same testing procedure described in section 3 of themain text. The two methods yielded similar results in termsof computational speed (number of required likelihood evalu-ations) and error in case of reported success (see section 3.1 inthe main text). However, holding some parameters constantas suggested in this Appendix turned out to be more robustin general and lead to slightly higher success rates (see FigureA1). Therefore, we suggest using this method in practice. a r X i v : . [ s t a t . C O ] A p r Samuel M. Fischer, Mark A. Lewis

A B C S u cce ss r a t e RVMRVM-MPI parameters,transformed covariates parameters, transformedcovariates parameters, GLM Figure A1

Comparison of diﬀerent methods to handle linearly dependent parameters. The success rate of RVM is plotted for the 3parameter and the 11 parameter model with transformed covariates and for the 11 parameter GLM. Though the algorithm using theMoore-Penrose inverse (RVM-MPI) performed slightly better for the GLM with little data (panel C), the method had lower successrates when the 11 parameter model with transformed variables was considered (panel B). The parameter values used to generate theFigures are given in Appendix B.

B Parameters for benchmark tests

Here we provide the parameter values used to generate the data for our benchmark tests. We tested models with transformedcovariates with 3 and 11 parameters and a GLM with 11 parameters. For the models with transformed covariates, we consideredscenarios with N = 500, N = 1000, N = 3000, and N = 10000 data points. The parameters for the two model families aregiven in Tables A1 and A2. For the GLM, we considered data sets with sizes N = 50, N = 100, N = 300, and N = 1000. Weprovide the parameter values in Table A3. Parameter α β β Value . −

10 5

Table A1

Parameters for the model with 3 parameters and transformed covariates.

Parameter α α α α α β β β β β β Value . . . . − − − − Table A2

Parameters for the model with 11 parameters and transformed covariates.

Parameter β β β β β β β β β β β Value . . − . − − . . . − . . Table A3

Parameters for the 11 parameter GLM. The covariate powers α ii