A robust and efficient algorithm to find profile likelihood confidence intervals
AA Robust and Efficient Algorithm to Find ProfileLikelihood Confidence Intervals
Samuel M. Fischer · Mark A. LewisAbstract
Profile likelihood confidence intervals are arobust alternative to Wald’s method if the asymptoticproperties of the maximum likelihood estimator are notmet. However, the constrained optimization problemdefining profile likelihood confidence intervals can bedifficult to solve in these situations, because the likeli-hood function may exhibit unfavorable properties. Asa result, existing methods may be inefficient and yieldmisleading results. In this paper, we address this prob-lem by computing profile likelihood confidence inter-vals via a trust-region approach, where steps computedbased on local approximations are constrained to re-gions where these approximations are sufficiently pre-cise. As our algorithm also accounts for numerical issuesarising if the likelihood function is strongly non-linearor parameters are not estimable, the method is appli-cable in many scenarios where earlier approaches areshown to be unreliable. To demonstrate its potentialin applications, we apply our algorithm to benchmarkproblems and compare it with 6 existing approaches tocompute profile likelihood confidence intervals. Our al-gorithm consistently achieved higher success rates thanany competitor while also being among the quickestmethods. As our algorithm can be applied to computeboth confidence intervals of parameters and model pre-dictions, it is useful in a wide range of scenarios.
S. M. FischerDepartment of Mathematical and Statistical Sciences, 632Central Academic Building, University of Alberta, Edmon-ton, AB, T6G 2G1. E-mail: samuel.fi[email protected]. A. LewisDepartment of Mathematical and Statistical Sciences, 632Central Academic Building, University of Alberta, Edmon-ton, AB, T6G 2G1.Department of Biological Sciences, CW 405, Biological Sci-ences Building, University of Alberta, Edmonton, AB, T6G2E9.
Keywords computer algorithm, constrained optimiza-tion, parameter estimation, estimability, identifiability
Declarations
Funding.
SMF is thankful for the funding received fromthe Canadian Aquatic Invasive Species Network andthe Natural Sciences and Engineering Research Councilof Canada (NSERC); MAL gratefully acknowledges anNSERC Discovery Grant and Canada Research Chair.
Competing interests.
The authors declare no competinginterests.
Availability of data and material.
Not applicable.
Code availability.
A Python implementation of the de-scribed algorithm and the test procedures can be re-trieved from the python package index as package ci-rvm (see pypi.org/project/ci-rvm).
Authors’ contributions.
Samuel M. Fischer and MarkA. Lewis jointly conceived the project; Samuel M. Fis-cher conceived the algorithm, conducted the mathemat-ical analysis, implemented the algorithm, and wrote themanuscript. Mark A. Lewis revised the manuscript.
Acknowledgements
The authors would like to give thanksto the members of the Lewis Research Group at the Univer-sity of Alberta for helpful feedback and discussions. a r X i v : . [ s t a t . C O ] A p r Samuel M. Fischer, Mark A. Lewis θ encompasses all values ¯ θ that might suitas acceptable null hypotheses if the parameter were tobe fixed; i.e. H : θ = ¯ θ could not be rejected versusthe alternative H : θ = ¯ θ . As the likelihood ratiostatistic is, under regularity conditions, approximately χ distributed under the null hypothesis, the confidenceinterval is given by I = (cid:20) ¯ θ (cid:12)(cid:12)(cid:12) (cid:18) max θ ∈ Θ ‘ ( θ ) − max θ ∈ Θ : θ =¯ θ ‘ ( θ ) (cid:19) ≤ χ , − α (cid:21) , (1)whereby Θ is the parameter space, ‘ denotes the log-likelihood function, α is the desired confidence level,and χ k, − α is the (1 − α )th quantile of the χ distribu-tion with k degrees of freedom.The function that maps ¯ θ to the constrained max-imum ‘ PL (cid:0) ¯ θ (cid:1) := max θ ∈ Θ : θ =¯ θ ‘ ( θ ) (2)is called the profile log-likelihood. While Wald’s methodapproximates ‘ and ‘ PL as quadratic functions, profilelikelihood confidence intervals are constructed by ex-act computation of the profile log-likelihood ‘ PL . This makes this method more accurate but also computa-tionally challenging.1.2 Existing approachesConceptually, the task of identifying the end points θ min0 and θ max0 of the confidence interval I is equivalent tofinding the maximal (or minimal) value for θ with ‘ PL ( θ ) = ‘ ∗ := ‘ (cid:16) ˆ θ (cid:17) − χ , − α , (3)Here, ˆ θ denotes the MLE; the value ‘ ∗ follows fromrearranging the terms in the inequality characterizing I (see equation (1)).There are two major perspectives to address thisproblem. It could either be understood as a one-dimensional root finding problem on ‘ PL or as the con-strained maximization (or minimization) problem θ max0 = max θ ∈ Θ : ‘ ( θ ) ≥ ‘ ∗ θ (4)( θ min0 analog). Approaches developed from either per-spective face the challenge of balancing robustnessagainst efficiency.The root finding perspective (Cook and Weisberg,1990; DiCiccio and Tibshirani, 1991; Stryhn and Chris-tensen, 2003; Moerbeek et al., 2004; Ren and Xia, 2019)is robust if small steps are taken and solutions of themaximization problem (2) are good initial guesses forthe maximizations in later steps. Nonetheless, the stepsize should be variable if parameters might be not es-timable and the confidence intervals large. At the sametime, care must be taken with large steps, as solving (2)can be difficult if the initial guesses are poor, and al-gorithms may fail to converge. Therefore, conservativestep choices are often advisable even though they maydecrease the overall efficiency of the approaches.The constrained maximization perspective (Nealeand Miller, 1997; Wu and Neale, 2012) has the advan-tage that efficient solvers for such problems are readilyimplemented in many optimization packages. If the like-lihood function is “well behaved”, these methods con-verge very quickly. However, in practical problems, thelikelihood function may have local extrema, e.g. due tolack of data, or steep “cliffs” that may hinder these algo-rithms from converging to a feasible solution. Further-more, general algorithms are typically not optimized forproblems like (4), in which the target function is simpleand the major challenge is in ensuring that the con-straint is met. Therefore, an approach would be desir-able that is specifically tailored to solve the constrainedmaximization (4) in a robust and efficient manner. Robust and Efficient Algorithm to Find Profile Likelihood Confidence Intervals 3
A first step in this direction is the algorithm byVenzon and Moolgavkar (1988), which solves (4) by re-peated quadratic approximations of the likelihood sur-face. As the method is of Newton-Raphson type, it isvery efficient as long as the local approximations areaccurate. Therefore, the algorithm is fast if the asymp-totic normality of the MLE is achieved approximately.Otherwise, the algorithm relies heavily on good initialguesses. Though methods to determine accurate initialguesses exist (Gimenez et al., 2005), the algorithm byVenzon and Moolgavkar (1988) (below abbreviated asVM) can get stuck in local extrema or fail to convergeif the likelihood surface has unfavorable properties (seee.g. Ren and Xia, 2019). Moreover, the algorithm willbreak down if parameters are not identifiable. Thus,VM cannot be applied in important use cases of profilelikelihood confidence intervals.1.3 Our contributionsIn this paper, we address the issues of VM by intro-ducing an algorithm extending the ideas of Venzon andMoolgavkar (1988). Our algorithm, which we will call
Robust Venzon-Moolgavkar Algorithm (RVM) below,combines the original procedure with a trust region ap-proach (Conn et al., 2000). That is, the algorithm neversteps outside of the region in which the likelihood ap-proximation is sufficiently precise. Furthermore, RVMaccounts for unidentifiable parameters, local minimaand maxima, and sharp changes in the likelihood sur-face. Though RVM may not outcompete traditional ap-proaches in problems with well-behaved likelihood func-tions or in the absence of estimability issues, we arguethat RVM is a valuable alternative in the (common)cases that the likelihood function is hard to optimizeand the model involves parameters that are not es-timable.Another well-known limitation of the approach byVenzon and Moolgavkar (1988) is that it is not directlyapplicable to construct confidence intervals for func-tions of parameters. Often the main research interest isnot in identifying specific model parameters but in ob-taining model predictions, which can be expressed as afunction of the parameters. In addition to presenting arobust algorithm to find confidence intervals for modelparameters, we show how RVM (and the original VM)can also be applied to determine confidence intervalsfor functions of parameters.This paper is structured as follows: in the first sec-tion, we start by outlining the main ideas behind RVMbefore we provide details of the applied procedures. Fur-thermore, we briefly describe how the algorithm canbe used to determine confidence intervals of functions of parameters. In the second section, we apply RVMand alternative algorithms to benchmark problems withsimulated data. Thereby, we review the implemented al-ternative algorithms before we present the results. Weconclude this paper with a discussion of the benchmarkresults and the benefits and limitations of RVM in com-parison to earlier methods.All code used in this study, including a Pythonimplementation of RVM, can be retrieved fromthe python package index as package ci-rvm (seepypi.org/project/ci-rvm). n -dimensional pa-rameter vector θ := ( θ , . . . , θ n − ) and a twice contin-uously differentiable log-likelihood function ‘ . Assumewithout loss of generality that we seek to construct alevel- α confidence interval for the parameter θ , and let e θ := ( θ , . . . , θ n − ) > be the vector of all remaining pa-rameters, called nuisance parameters. For convenience,we may write ‘ = ‘ ( θ ) as a function of the completeparameter vector or ‘ = ‘ (cid:16) θ , e θ (cid:17) as a function of theparameter of interest and the nuisance parameters.The algorithm RVM introduced in this papersearches the right end point θ max0 (equation (4)) of theconfidence interval I . The left end point can be iden-tified with the same approach if a modified model isconsidered in which ‘ is flipped in θ . As RVM buildson the method by Venzon and Moolgavkar (1988), westart by recapitulating their algorithm VM below.Let θ ∗ ∈ Θ be the parameter vector at whichthe parameter of interest is maximal, θ ∗ = θ max0 , and ‘ ( θ ∗ ) ≥ ‘ ∗ . Venzon and Moolgavkar (1988) note that θ ∗ satisfies the following necessary conditions:1. ‘ ( θ ∗ ) = ‘ ∗ and2. ‘ is in a local maximum with respect to the nuisanceparameters, which implies ∂‘ / ∂ e θ ( θ ∗ ) = 0.The algorithm VM searches for θ ∗ by minimiz-ing both the log-likelihood distance to the threshold | ‘ ( θ ) − ‘ ∗ | and the gradient of the nuisance parameters ∂‘ / ∂ e θ . To this end, the algorithm repeatedly approx-imates the log-likelihood surface ‘ with second orderTaylor expansions ˆ ‘ . If θ ( i ) is the parameter vector inthe i th iteration of the algorithm, expanding ‘ around Samuel M. Fischer, Mark A. Lewis θ ( i ) yieldsˆ ‘ ( θ ) := ‘ (cid:16) θ ( i ) (cid:17) + g > (cid:16) θ − θ ( i ) (cid:17) + 12 (cid:16) θ − θ ( i ) (cid:17) > H (cid:16) θ − θ ( i ) (cid:17) = ¯ ‘ + e g > e δ + g δ + 12 e δ > e H e δ + δ e H > e δ + 12 δ H δ =: ˆ ‘ δ (cid:16) δ , e δ (cid:17) . (5)Here, δ := θ − θ ( i ) , ¯ ‘ := ‘ (cid:16) θ ( i ) (cid:17) ; g := ∂‘∂ θ (cid:16) θ ( i ) (cid:17) is thegradient and H := ∂ ‘ / ∂ θ (cid:16) θ ( i ) (cid:17) the Hessian matrix of ‘ at θ ( i ) . Analogously to notation used above, we split δ into its first entry δ and the remainder e δ , g into g and e g , and write H for the first column of H , e H for H without its first column and row, and split H intoH and e H .In each iteration, VM seeks δ ∗ and f δ ∗ that satisfyconditions 1 and 2. Applying condition 2 to the approx-imation ˆ ‘ δ (equation (5)) yields f δ ∗ = − e H − (cid:16) e H δ + e g (cid:17) . (6)Inserting (5) and (6) into condition 1 gives us ‘ ∗ = 12 (cid:16) H − e H > e H − e H (cid:17) δ ∗ + (cid:16) g − e g > e H − e H (cid:17) δ ∗ + ¯ ‘ − e g > e H − e g , (7)which can be solved for δ ∗ if H is negative definite. Ifequation (7) has multiple solutions, Venzon and Mool-gavkar (1988) choose the one that minimizes δ accord-ing to some norm. Our algorithm RVM applies a dif-ferent procedure and chooses the root that minimizesthe distance to θ max0 without stepping into a region inwhich the approximation (5) is inaccurate. In section2.5, we provide further details and discuss the case inwhich equation (7) has no real solutions.After each iteration, θ is updated according to theabove results: θ ( i +1) = θ ( i ) + δ ∗ . (8)If ‘ (cid:16) θ ( i +1) (cid:17) ≈ ‘ ∗ and ∂‘ / ∂ ˜ θ (cid:16) θ ( i +1) (cid:17) ≈ θ ( i +1) isreturned.The need to extend the original algorithm VM out-lined above comes from the following issues: (1) Thequadratic approximation ˆ ‘ may be imprecise far fromthe approximation point. In extreme cases, updating θ as suggested could take us farther away from the tar-get θ ∗ rather than closer to it. (2) The approximation ˆ ‘ may be constant in some directions or be not boundedabove. In these cases, we may not be able to identify Determine new approximationDetermine solution Set fixed search radiusSolve constrained problem Reduce search radiusApproximation bounded
Yes NoYes No
Approxim. accurate at solution
Figure 1: Flow chart for RVM. The procedure is re-peated until the termination criterion is met and theresult is returned.unique solutions for δ and e δ , and the gradient criterionin condition 2 may not characterize a maximum but asaddle point or a minimum. (3) The limited precision ofnumerical operations can result in discontinuities cor-rupting the results of VM and hinder the algorithmfrom terminating.To circumvent these problems, we introduce a num-ber of extensions to VM. First, we address the limitedprecision of the Taylor approximation ˆ ‘ with a trustregion approach (Conn et al., 2000). That is, we con-strain our search for δ ∗ to a region in which the ap-proximation ˆ ‘ is sufficiently accurate. Second, we choosesome parameters freely if ˆ ‘ is constant in some direc-tions and solve constrained maximization problems ifˆ ‘ is not bounded above. In particular, we detect casesin which ‘ PL approaches an asymptote above ‘ ∗ , whichmeans that θ is not estimable. Lastly, we introduce amethod to identify and jump over discontinuities as ap-propriate. An overview of the algorithm is depicted asflow chart in Figure 1. Below, we describe each of ourextensions in detail.2.2 The trust regionIn practice, the quadratic approximation (5) may notbe good enough to reach a point close to θ ∗ withinone step. In fact, since ‘ may be very “non-quadratic”,we might obtain a parameter vector for which ‘ and ∂‘ / ∂ ˜ θ are farther from ‘ ∗ and than in the previousiteration. Therefore, we accept changes in θ only if theapproximation is sufficiently accurate in the new point. Robust and Efficient Algorithm to Find Profile Likelihood Confidence Intervals 5
In each iteration i , we compute the new parametervector, compare the values of ˆ ‘ and ‘ at the obtainedpoint θ ( i ) + δ ∗ , and accept the step if, and only if, ˆ ‘ and ‘ are close together with respect to a given distancemeasure. If ¯ ‘ is near the target ‘ ∗ , we may also check theprecision of the gradient approximation ∂ ˆ ‘ / ∂ ˜ θ to enforcetimely convergence of the algorithm.If we reject a step, we decrease the value δ ∗ obtainedbefore, reduce the maximal admissible length r of thenuisance parameter vector and solve the constrainedmaximization problem f δ ∗ = max e δ : | e δ | ≤ r ˆ ‘ δ (cid:16) δ , e δ (cid:17) . (9)As the quadratic subproblem (9) appears in classicaltrust-region algorithms, efficient solvers are available(Conn et al., 2000) and implemented in optimizationsoftware, such as in the Python package Scipy (Joneset al., 2001).We check the accuracy of the approximation at theresulting point θ ( i ) + δ ∗ , decrease the search radiusif necessary, and continue with this procedure untilthe approximation is sufficiently precise. The metricand the tolerance applied to measure the approxima-tion’s precision may depend on how far the current log-likelihood ¯ ‘ is from the target ‘ ∗ . We suggest suitableprecision measures in section 2.8.Since it is often computationally expensive to com-pute the Hessian H , we desire to take as large steps δ as possible. However, it is also inefficient to adjustthe search radius very often to find the maximal ad-missible δ ∗ . Therefore, RVM first attempts to make theunconstrained step given by equation (5). If this stepis rejected, RVM determines the search radius with alog-scale binary search between the radius of the un-constrained step and the search radius accepted in theprevious iteration. If even the latter radius does not leadto a sufficiently precise result, we update δ ∗ and r byfactors β , β ∈ (0 ,
1) so that δ ∗ ← β δ ∗ and r ← β r .2.3 Linearly dependent parametersThe right hand side of equation (6) is defined only ifthe nuisance Hessian e H is invertible. If e H is singular,the maximum with respect to the nuisance parametersis not uniquely defined or does not exist at all. We willconsider the second case in the next section and focuson the first case here.There are multiple options to compute a psudo-inverse of a singular matrix to solve underspecified lin-ear equation systems (Rao, 1967). A commonly usedapproach is the Moore-Penrose inverse (Penrose, 1955),which yields a solution with minimal norm (Rao, 1967). This is a desirable property for our purposes, as thequadratic approximation is generally most precise closeto the approximation point. The Moore-Penrose inversecan be computed efficiently with singular value decom-positions (Golub and Kahan, 1965), which have alsobeen applied to determine the number of identifiableparameters in a model (Eubank and Webster, 1985;Viallefont et al., 1998).Whether or not a matrix is singular is often difficultto know precisely due to numerical inaccuracies. TheMoore-Penrose inverse is therefore highly sensitive to athreshold parameter determining when the consideredmatrix is deemed singular. As the Hessian matrix istypically computed with numerical methods subject toerror, it is often beneficial to choose a high value forthis threshold parameter to increase the robustness ofthe method. Too large threshold values, however, canslow down or even hinder convergence of the algorithm.An alternative method to account for singular Hes-sian matrices is to hold linearly dependent parametersconstant until the remaining parameters form a non-singular system. In tests, this approach appeared to bemore robust than applying the Moore-Penrose inverse.Therefore, we used this method in our implementation.We provide details on this method as well as test resultsin Supplementary Appendix A. Note that we write e H − for this generalized inverse below.To determine whether the approximate system hasany solution when e H is singular, we test whether f δ ∗ computed according to equations (6) and (7) indeedsatisfies the necessary conditions for a maximum in thenuisance parameters. That is, we check whether0 ≈ ∂∂ e δ ˆ ‘ δ = e H f δ ∗ + e H δ ∗ + e g (10)holds up to a certain tolerance. If this is not the case, ˆ ‘ is unbounded, and we proceed as outlined in the nextsection.2.4 Solving unbounded subproblemsIn each iteration, we seek the nuisance parameters ˜ θ that maximize ‘ for the computed value of θ . Sincethe log-likelihood function ‘ is bounded above, such amaximum must exist in theory. However, the approx-imate log-likelihood ˆ ‘ could be unbounded at times,which would imply that the approximation is impre-cise for large steps. Since we cannot identify a globalmaximum of ˆ ‘ if it is unbounded, we instead seek thepoint maximizing ˆ ‘ in the range where ˆ ‘ is sufficientlyaccurate. Samuel M. Fischer, Mark A. Lewis
To test whether ˆ ‘ is unbounded in the nuisance pa-rameters, we first check whether e H is negative semi-definite. If e H is invertible, this test can be conductedby applying a Cholesky decomposition on − e H , whichsucceeds if and only if e H is negative definite. If e H is singular, we use an eigenvalue decomposition. If alleigenvalues are below a small threshold, e H is negativesemi-definite. To confirm that ˆ ‘ is bounded, we alsotest whether equation (10) holds approximately if e H issingular (see section 2.3).If either of these tests fails, ˆ ‘ is unbounded. In thiscase, we set δ ∗ ← r , r ← r , for some parameters r , r > r and r can be adjusted and saved for fu-ture iterations to efficiently identify the maximal admis-sible step. That is, we may increase (or reduce) δ ∗ and r as long as (or until) ˆ ‘ is sufficiently precise. Thereby,we adjust the ratio of δ ∗ and r so that the likelihoodincreases: ˆ ‘ δ (cid:16) δ ∗ , f δ ∗ (cid:17) > ¯ ‘ .2.5 Step choice for the parameter of interestWhenever ˆ ‘ has a unique maximum in the nuisanceparameters, we compute δ ∗ by solving equation (7).This equation can have one, two, or no roots. To dis-cuss how δ ∗ should be chosen in either of these cases,we introduce some helpful notation. First, we writeˆ ‘ PL ( θ ) := max ˜ θ ˆ ‘ (cid:16) θ , ˜ θ (cid:17) for the profile log-likelihoodfunction of the quadratic approximation. Furthermore,we write in accordance with previous notationˆ ‘ δ PL ( δ ) := ˆ ‘ PL (cid:16) θ ( i )0 + δ (cid:17) = aδ + pδ + q + ‘ ∗ (11)with a := (cid:16) H − e H e H − e H (cid:17) , p := g − e g > e H − e H ,and q := ¯ ‘ − e g > e H − e g − ‘ ∗ (see equation (7)).Our choices of δ ∗ attempt to increase θ as much aspossible while staying in a region in which the approxi-mation ˆ ‘ is reasonably accurate. The specific step choicedepends on the slope of the profile likelihood ˆ ‘ δ PL andon whether we have already exceeded θ max0 accordingto our approximation, i.e. ˆ ‘ δ PL (0) < ‘ ∗ . Below, we willfirst assume that ˆ ‘ δ PL (0) > ‘ ∗ and discuss the oppositecase later. If the profile likelihood decreases at the approximationpoint, i.e. p <
0, we select the smallest positive root: δ ∗ = ( − qp if a = 0 − a (cid:16) p + p p − aq (cid:17) else. (12) Choosing δ ∗ > θ max0 decreases in this iteration. Choosing thesmaller positive root increases our trust in the accu-racy of the approximation and prevents potential con-vergence issues (see Figure 2a).If ˆ ‘ δ PL has a local minimum above the threshold ‘ ∗ ,equation (11) does not have a solution, and we mayattempt to decrease the distance between ˆ ‘ δ PL and ‘ ∗ instead. This procedure, however, may let RVM con-verge to a local minimum in ˆ ‘ δ PL rather than to a pointwith ˆ ‘ δ PL = ‘ ∗ . Therefore, we “jump” over the extremepoint by doubling the value of δ ∗ . That is, we choose δ ∗ = − pa (13)if p < aq (see Figure 2b). If the profile likelihood increases at the approximationpoint, i.e. p >
0, equation (11) has a positive root ifand only if ˆ ‘ PL is concave down; a <
0. We choose thisroot whenever it exists: δ ∗ = − a (cid:16) p + p p − aq (cid:17) . (14)However, if ˆ ‘ PL grows unboundedly, equation (11) doesnot have a positive root. In this case, we change thethreshold value ‘ ∗ temporarily to a value ‘ ∗0 chosenso that equation (11) has a solution with the updatedthreshold (see Figure 2c). For example, we may set ‘ ∗0 := max ˆ ‘ δ PL (0) + 1 , ¯ ‘ + ‘ (cid:16) ˆ θ (cid:17) . This choice ensures that a solution exists while at thesame time reaching local likelihood maxima quickly. Af-ter resetting the threshold, we proceed as usual.To memorize that we changed the threshold value ‘ ∗ , we set a flag maximizing := True . In future it-erations j > i , we set the threshold ‘ ∗ back to itsinitial value ‘ ∗ and maximizing := False as soon as ‘ (cid:16) θ ( j ) (cid:17) < ‘ ∗ or ˆ ‘ PL is concave down at the approxima-tion point θ ( j ) . If the profile likelihood has a local extremum at theapproximation point, i.e. p = 0, a = 0, we proceedas in cases 1 and 2: if a >
0, we proceed as if ˆ ‘ PL were increasing, and if a <
0, we proceed as if ˆ ‘ PL weredecreasing. However, the approximate profile likelihoodcould also be constant, a = p = 0. In this case, we Robust and Efficient Algorithm to Find Profile Likelihood Confidence Intervals 7(a) � θ � * δ δ (b) � θ � * δ (c) � � * ' θ � * δ Figure 2: Step choice for θ in special cases. The figures depict the profile likelihood function ‘ PL (solid black),quadratic approximation ˆ ‘ PL (dashed parabola), and the threshold log-likelihood ‘ ∗ . (a) The approximation hastwo roots δ ∗ and δ . Though the largest root of ‘ is searched, the smaller root of ˆ ‘ is closest to the desired result.In fact, consistently choosing the larger root would let the algorithm diverge. (b) If ‘ PL is decreasing but ˆ ‘ PL doesnot assume the threshold value ‘ ∗ , we “jump” over the local minimum. (c) If ‘ PL is increasing but ˆ ‘ PL does notassume the threshold value ‘ ∗ , we reset the target value to an increased value ‘ ∗0 .attempt to make a very large step to check whether wecan push θ arbitrarily far. In section 2.6, we discussthis procedure in greater detail. If the profile likelihood at the approximation point isbelow the threshold, ˆ ‘ δ PL (0) < ‘ ∗ , we always choose thesmallest possible step: δ ∗ = − a (cid:16) p + p p − aq (cid:17) if a = 0 , p < − qp if a = 0 , p = 0 − a (cid:16) p − p p − aq (cid:17) if a = 0 , p > . (15)This shall bring us to the admissible parameter regionas quickly as possible.As RVM rarely steps far beyond the admissible re-gion in practice, equation (15) usually suffices to define δ ∗ . Nonetheless, if we find that ˆ ‘ δ P L has a local maxi-mum below the threshold, i.e. p < qa , we may insteadmaximize ˆ ‘ δP L as far as possible: δ ∗ = − p a . (16)If we have already reached a local maximum ( p ≈ δ . In this case, wemay recall the iteration k := argmax j : ‘ ( θ ( j ) ) ≥ ‘ ∗ θ ( j )0 , in whichthe largest admissible θ value with ‘ (cid:0) θ ( k ) (cid:1) ≥ ‘ ∗ hasbeen found so far, and conduct a binary search between θ ( i ) and θ ( k ) until we find a point θ ( i +1) with ‘ (cid:0) θ ( i +1) (cid:1) ≥ ‘ ∗ . 2.6 Identifying inestimable parametersIn some practical scenarios, the profile log-likelihood ‘ PL will never fall below the threshold ‘ ∗ , which meansthat the considered parameter is not estimable. In thesecases, RVM may not converge. However, often it is pos-sible to identify inestimable parameters by introducinga step size limit δ max0 . If the computed step exceedsthe maximal step size, δ ∗ > δ max0 and the current func-tion value exceeds the threshold value, i.e. ¯ ‘ ≥ ‘ ∗ , weset δ ∗ := δ max0 and compute the corresponding nuisanceparameters. If the resulting log-likelihood ‘ (cid:0) θ ( i ) + δ ∗ (cid:1) isnot below the threshold ‘ ∗ , we let the algorithm termi-nate, raising a warning that the parameter θ is not es-timable. If ‘ (cid:0) θ ( i ) + δ ∗ (cid:1) < ‘ ∗ , however, we cannot drawthis conclusion and decrease the step size until the ap-proximation is sufficiently close to the original function.The criterion suggested above may not always suf-fice to identify inestimable parameters. For exam-ple, if the profile likelihood is constant but the nui-sance parameters maximizing the likelihood changenon-linearly, RVM may not halt. For this reason, andalso to prevent unexpected convergence issues, it is ad-visable to introduce an iteration limit to the algorithm.If the iteration limit is exceeded, potential estimabilityissues issues may be investigated further.2.7 DiscontinuitiesRVM is based on quadratic approximations and re-quires therefore that ‘ is differentiable twice. Nonethe-less, discontinuities can occur due to numerical impre-cision even if the likelihood function is continuous in Samuel M. Fischer, Mark A. Lewis theory. Though we may still be able to compute thegradient g and the Hessian H in these cases, the result-ing quadratic approximation will be inaccurate even ifwe take very small steps. Therefore, these discontinu-ities could hinder the algorithm from terminating.To identify discontinuities, we define a minimal stepsize (cid:15) , which may depend on the gradient g . If we re-ject a step with small length | δ ∗ | ≤ (cid:15) , we may concludethat ‘ is discontinuous at the current approximationpoint θ ( i ) . To determine the set D of parameters re-sponsible for the issue, we decompose δ ∗ into its com-ponents. We initialize D ← ∅ and consider, with the j th unit vector e j , the step δ ∗0 := P j ≤ k, j = D e j δ ∗ j untilˆ ‘ δ (cid:0) δ ∗0 (cid:1) ‘ δ (cid:0) δ ∗0 (cid:1) for some k < n . When we identifysuch a component, we add it to the set D and continuethe procedure.If we find that ‘ is discontinuous in θ , we checkwhether the current nuisance parameters maximize thelikelihood, i.e. ‘ is bounded above and e g is approxi-mately . If the nuisance parameters are not optimal,we hold θ constant and maximize ‘ with respect tothe nuisance parameters. Otherwise, we conclude thatthe profile likelihood function has a jump discontinuity.In this case, our action depends on the current log-likelihood value ¯ ‘ , the value of ‘ at the other end of thediscontinuity, and the threshold ‘ ∗ . – If ‘ (cid:16) θ ( i ) + e δ ∗ (cid:17) ≥ ‘ ∗ or ‘ (cid:16) θ ( i ) (cid:17) < ‘ (cid:16) θ ( i ) + e δ ∗ (cid:17) ,we accept the step regardless of the undesirablylarge error. – If ‘ (cid:16) θ ( i ) + e δ ∗ (cid:17) < ‘ ∗ and ‘ (cid:16) θ ( i ) (cid:17) ≥ ‘ ∗ , we termi-nate and return θ ( i )0 as the bound of the confidenceinterval. – Otherwise, we cannot make a sensible step and tryto get back into the admissible region by conductingthe binary search procedure we have described insection 2.5.4.If ‘ is discontinuous in variables other than θ , we holdthe variables constant whose change decreases the like-lihood and repeat the iteration with a reduced system.After a given number of iterations, we release these pa-rameters again, as θ may have left the point of discon-tinuity.Since we may require that not only ˆ ‘ but also its gra-dient are well approximated, a robust implementationof RVM should also handle potential gradient discon-tinuities. The nuisance parameters causing the issuescan be identified analogously to the procedure outlinedabove. All components in which the gradient changes itssign from positive to negative should be held constant,as the likelihood appears to be in a local maximum inthese components. The step in the remaining compo-nents may be accepted regardless of the large error. 2.8 Suitable parameters and distance measuresThe efficiency of RVM depends highly on the distancemeasures and parameters applied when assessing theaccuracy of the approximation and updating the searchradius of the constrained optimization problems. If theprecision measures are overly conservative, then manysteps will be needed to find θ ∗ . If the precision measureis too liberal, in turn, RVM may take detrimental stepsand might not even converge.We suggest the following procedure: (1) we al-ways accept forward steps with δ ∗ ≥ ‘ δ ( δ ∗ ) ≥ ˆ ‘ δ ( δ ∗ ). (2) If the approximate likelihood func-tion is unbounded, we require that the likelihood in-creases ‘ δ ( δ ∗ ) ≥ ¯ ‘ . This requirement helps RVM to re-turn quickly to a region in which the approximationis bounded. However, if the step size falls below thethreshold used to detect discontinuities, we may relaxthis constraint so that less time must be spent to de-tect potential discontinuities. (3) If we are outside theadmissible region, i.e. ¯ ‘ < ‘ ∗ , we enforce that we getcloser to the target likelihood: (cid:12)(cid:12) ‘ δ ( δ ∗ ) − ‘ ∗ (cid:12)(cid:12) < (cid:12)(cid:12) ¯ ‘ − ‘ ∗ (cid:12)(cid:12) .This reduces potential convergence issues. (4) We re-quire that (cid:12)(cid:12)(cid:12) ˆ ‘ δ ( δ ∗ ) − ‘ δ ( δ ∗ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ¯ ‘ − ‘ ∗ (cid:12)(cid:12) ≤ γ (17)for a constant γ . That is, the required precision dependson how close we are to the target. This facilitates fastconvergence of the algorithm. The constant γ ∈ (0 , γ = appeared to be a good choice. (5) If we are closeto the target, ‘ δ ( δ ∗ ) ≈ ‘ ∗ , we also require that thegradient estimate is precise: (cid:12)(cid:12)(cid:12) ∂ ˆ ‘ δ ∂ ˜ θ ( δ ∗ ) − ∂‘ δ ∂ ˜ θ ( δ ∗ ) (cid:12)(cid:12)(cid:12) | g | ≤ γ. (18)This constraint helps us to get closer to a maximum inthe nuisance parameters. Here, we use the L norm.When we reject a step because the approximation isnot sufficiently accurate, we adjust δ ∗ and solve the con-strained maximization problem (9) requiring (cid:12)(cid:12)(cid:12)e δ (cid:12)(cid:12)(cid:12) ≤ r .To ensure that the resulting step does not push thelog-likelihood below the target ‘ ∗ , the radius r shouldnot be decreased more strongly than δ ∗ . In tests, ad-justing r by a factor β := 1 . δ ∗ is adjustedby factor β := 2 appeared to be a good choice.2.9 Confidence intervals for functions of parametersOften, modelers are interested in confidence intervalsfor functions f ( θ ) of the parameters. A limitation of Robust and Efficient Algorithm to Find Profile Likelihood Confidence Intervals 9
VM and VMR is that such confidence intervals cannotbe computed directly with these algorithms. However,this problem can be solved approximately by consider-ing a slightly changed likelihood function. We aim tofind φ max = max θ ∈ Θ : ‘ ( θ ) ≥ ‘ ∗ f ( θ ) (19)or the respective minimum. Defineˇ ‘ ( φ, θ ) := ‘ ( θ ) − (cid:18) f ( θ ) − φε (cid:19) χ , − α , (20)with a small constant ε . Consider the altered maximiza-tion problemˇ φ max = max θ ∈ Θ : ˇ ‘ ( φ, θ ) ≥ ‘ ∗ φ, (21)which can be solved with VM or RVM.We argue that a solution to (21) is an approximatesolution to (19), whereby the error is bounded by ε . Let( φ max , θ ∗ ) be a solution to problem (19) and (cid:16) ˇ φ max , ˇ θ ∗ (cid:17) a solution to problem (21). Since φ max = f ( θ ∗ ), itis ˇ ‘ ( φ max , θ ∗ ) = ‘ ( θ ∗ ) ≥ ‘ ∗ . Therefore, ( φ max , θ ∗ ) isalso a feasible solution to (19), and it follows thatˇ φ max ≥ φ max . At the same time, ˇ ‘ ( φ, θ ) ≤ ‘ ( θ ), whichimplies that f (cid:16) ˇ θ ∗ (cid:17) ≤ f ( θ ∗ ), since θ ∗ maximizes f overa domain larger than the feasibility domain of (21). Inconclusion, f (cid:16) ˇ θ ∗ (cid:17) ≤ f ( θ ∗ ) = φ max ≤ ˇ φ max . Lastly, ‘ ∗ = ‘ (cid:16) ˆ θ (cid:17) − χ , − α ≤ ˇ ‘ (cid:16) ˇ φ max , ˇ θ ∗ (cid:17) = ‘ (cid:16) ˇ θ ∗ (cid:17) − f (cid:16) ˇ θ ∗ (cid:17) − ˇ φ max ε χ , − α . (22)Simplifying (22) yields (cid:12)(cid:12)(cid:12) f (cid:16) ˇ θ ∗ (cid:17) − ˇ φ max (cid:12)(cid:12)(cid:12) ≤ ε . Thus, (cid:12)(cid:12) φ max − ˇ φ max (cid:12)(cid:12) ≤ ε .Though it is possible to bound the error by an ar-bitrarily small constant ε in theory, care must be takenif the function f ( θ ) is not well-behaved, i.e. stronglynonlinear. In theses cases, overly small values for ε mayslow down convergence.Note that the suggested procedure may seem to re-semble the approach of Neale and Miller (1997), whoalso account for constraints by adding the squared er-ror to the target function. However, unlike Neale andMiller (1997), the approach suggested above bounds theerror in the confidence interval bound, not the error ofthe constraint. Furthermore, we do not square the log-likelihood function, which would worsen nonlinearitiesand could thus make optimization difficult. Therefore,our approach is less error-prone than the method byNeale and Miller (1997). To compare the presented algorithm to existing meth-ods, we applied RVM, the classic VM, and five otheralgorithms to benchmark problems and compared therobustness and performance of the approaches. Belowwe review the implemented methods. Then we intro-duce the benchmark problems, before we finally presentthe benchmark results.3.1 Methods implemented for comparisonBesides RVM and VM, we implemented three meth-ods that repeatedly evaluate the profile likelihoodfunction and two methods that search for the con-fidence intervals directly. We implemented all meth-ods in the programming language Python version 3.7and made use of different optimization routines imple-mented or wrapped in the scientific computing libraryScipy (Jones et al., 2001).First, we implemented a grid search for the con-fidence bounds. The approach uses repeated La-grangian constrained optimizations and may resemblethe method by DiCiccio and Tibshirani (1991); how-ever, rather than implementing the algorithm by DiCi-ccio and Tibshirani (1991), we applied the constrainedoptimization algorithm by Lalee et al. (1998), which isa trust-region approach and may thus be more robustthan the method by DiCiccio and Tibshirani (1991).Furthermore, the algorithm by Lalee et al. (1998) wasreadily implemented in Scipy.We conducted the grid search with a naive step sizeof 0 .
2, which we repeatedly reduced by factor 2 closeto the threshold log-likelihood ‘ ∗ until the desired pre-cision was achieved. To account for unidentifiable pa-rameters, we attempted one large step (1000 units) ifthe algorithm did not terminate in the given iterationlimit. We considered a parameter as unbounded if thisstep yielded a log-likelihood above the target value ‘ ∗ .Second, we implemented a quadratic bisectionmethod for root finding on ‘ PL (cf. Ren and Xia, 2019).Initially we chose a step size of 1. Afterwards, we com-puted the step of θ based on a quadratic interpola-tion between the MLE ˆ θ , the maximal value of θ forwhich we found ‘ PL ( θ ) > ‘ ∗ and the smallest iden-tified value of θ with ‘ PL ( θ ) < ‘ ∗ . Until a point θ with ‘ PL ( θ ) < ‘ ∗ was identified, we interpolated ‘ PL between ˆ θ and the two largest evaluated values θ .When only two points were available or the approxi-mation of ‘ PL did not assume the target value, we in-troduced the additional constraint d‘ PL / dθ = 0. Usinga quadratic rather than a linear interpolation for bi-section has the advantage that the algorithm converges faster if the profile log-likelihood function is convex orquadratic. To evaluate ‘ PL , we applied sequential leastsquares programming (Kraft, 1988), which is the de-fault method for constrained optimization in Scipy.Third, we implemented a binary search with an ini-tial step of 1. Until a value θ with ‘ PL ( θ ) < ‘ ∗ wasfound, we increased θ by factor 10. This preserves thelogarithmic runtime of the algorithm if the problem hasa solution. To broaden the range of tested internal op-timization routines, we used a different method to eval-uate ‘ PL than in the bisection method: we fixed θ atthe desired value and performed an unconstrained opti-mization on the nuisance parameters. Here, we used thequasi-Newton method by Broyden, Fletcher, Goldfarb,and Shanno (BFGS; see Nocedal and Wright, 2006, pp.136).To test methods that search for the confidence in-terval end points directly, we solved problem (4) withsequential least squares programming (Kraft, 1988).Furthermore, we implemented the approximate methodby Neale and Miller (1997). They transform the con-strained maximization problem (9) to an unconstrainedproblem by considering the sum of the parameter of in-terest θ and the squared error between the target ‘ ∗ and the log-likelihood. Minimization of this target func-tion yields a point in which the target log-likelihood isreached approximately and the parameter of interest isminimal. Again, we used the method BFGS for mini-mization (see above).Finally, we implemented Wald’s method to assessthe need to apply any profile likelihood method.3.2 Benchmark problemTo investigate the performances of the implementedmethods, we applied the algorithms to a benchmarkproblem with variable parameter number and data setsize. We considered a logistic regression problem with n count data covariates c ij , j ∈ { , . . . , n } for each datapoint i ∈ { , . . . , N } . We assumed that the impact ofthe covariates levels off at high values and consideredtherefore the transformed covariates c α j ij with α ∈ (0 , X i was thus given by P ( X i = 1) = − β − X j β j c α j ij − (23) and P ( X i = 0) = 1 − P ( X i = 1).We drew the covariate values randomly from a nega-tive binomial distribution with mean 5 and variance 10.The negative binomial distribution is commonly usedto model count data (Gardner et al., 1995) and thussuited to represent count covariates. To simulate thecommon case that covariates are correlated, we further-more drew the value for every other covariate from abinomial distribution with the respective preceding co-variate as count parameter. That is, for uneven j , c i,j +1 ∼ Binomial( c i,j , p ) , with p = 0 . wasdefined to be 1. We chose the parameters α j and β j so that the data were balanced, i.e. the frequency of 0sand 1s was approximately even. Refer to SupplementaryAppendix B for the parameter values we used.3.3 Test procedureTo test the algorithms in a broad range of scenariosand assess how their performance is impacted by modelcharacteristics, we considered a model with 1 covariate(3 parameters), a model with 5 covariates (11 param-eters), and a generalized linear model (GLM) with 10covariates, in which the powers α j were set to 1 (11parameters). Furthermore, we varied the sizes of thesimulated data sets, ranging between N = 500 and N = 10000 for the models with transformed covariatesand N = 50 and N = 1000 for the GLM. In Figure 3,we depict the impact of N on the shape of the likelihoodfunction and thus the difficulty of the problem.For each considered set of parameters, we generated200 realizations of covariates and training data from themodel described in the previous section. We determinedthe maximum likelihood estimator by maximizing thelog-likelihood with the method BFGS and refined theestimate with an exact trust region optimizer (Connet al., 2000). Then, we applied each of the implementedalgorithms to each data set and determined the algo-rithms’ success rates and efficiencies.As the likelihood functions of the tested mod-els decrease drastically at α j = 0, potentially caus-ing some algorithms to fail, we constrained the α j to non-negative values. Tho that end, we consideredtransformed parameters α j := ln(exp( α j ) − Robust and Efficient Algorithm to Find Profile Likelihood Confidence Intervals 11 algorithms based on the back-transformed parameters α j . We measured the algorithms’ success based on theirability to solve problem (4) rather than their capabil-ity to determine the true confidence intervals for theparameters. Though profile likelihood confidence in-tervals are usually highly accurate, they rely on thelimiting distribution of the likelihood ratio statistic.Therefore, algorithms could fail to solve optimizationproblem (4) but, by coincidence, return a result closeto the true confidence interval bound and vice versa.To exclude such effects and circumvent the high com-putational effort required to determine highly preciseconfidence intervals with sampling methods, we deter-mined the “true” confidence interval bound by choosingthe widest confidence interval bound obtained by eitherof the tested methods provided it was admissible, i.e. ‘ ( θ max ) ≥ ‘ ∗ up to a permissible error of 0 . ±
5% range of the true confi-dence interval bound or had an error below 0 . − , . (b) (c) Figure 3: Likelihood surface of the 3-parameter benchmark model with different data set sizes N . As N increases,the confidence region becomes smaller and closer to an elliptic shape. The orange dots depict the accepted (largedots) and rejected (small dots) steps of RVM searching for a confidence interval for β . RVM follows the ridge ofthe likelihood surface. The red dot shows the location of the MLE ˆ θ . The background color depicts the respectivemaximal log-likelihood for the given α and β ranging from ≤ ˆ ‘ −
50 (dark blue) to ˆ ‘ (yellow). The solid blue linedenotes the target log-likelihood ‘ ∗ for a 95% confidence interval. (a) N = 500; (b) N = 1000; (c) N = 10000.exploits the approximately quadratic shape of the pro-file likelihood function if many data are available. Inscenarios with large data sets, the bisection methodwas among the most efficient algorithms. The errorsof the three root finding methods decreased the moredata became available to fit the models. However, whilethe binary search had a consistently low error, both thegrid search and the bisection method were more proneto large errors than all other tested methods.The algorithms developed from the constrainedmaximization perspective (the method by Neale andMiller and direct constrained maximization) had suc-cess rates ranging between 45% and 85% in problemswith transformed covariates. In the GLM scenario, thesuccess rate was smaller in with 50 data points andhigher with more data. The constrained maximizationprocedure was slightly more successful than the methodby Neale and Miller (1997). Both methods required rel-atively few function evaluations, whereby direct con-strained maximization performed better. Both methodswere less prone to large errors than the grid search andthe bisection method. However, the outlier-reduced er-ror was on average more than twice as large than withany other method except RVM (Neale and Miller: 0 . .
09, RVM: 0 . We presented an algorithm that determines the endpoints of profile likelihood confidence intervals both ofparameters and functions of parameters with high ro-bustness and efficiency. We tested the algorithm in sce-narios varying in parameter number, size of the dataset, and complexity of the likelihood function. In thetests, our algorithm RVM was more robust than anyother considered method. At the same time, RVM wasamong the fastest algorithms in most scenarios. Thisis remarkable, because there is typically a trade-off be-tween robustness and computational speed of optimiza-tion algorithms. RVM achieves this result by exploitingthe approximately quadratic form of the log-likelihoodsurface in “benign” cases while maintaining high ro-bustness with the trust-region approach. Consequently,RVM naturally extends the algorithm VM (Venzon andMoolgavkar, 1988), which appeared to be highly effi-cient but lacking robustness in our tests.Surprisingly, RVM turned out to be even more ro-bust than methods based on repeated evaluations of
Robust and Efficient Algorithm to Find Profile Likelihood Confidence Intervals 13
A B C S u cce ss r a t e RVMVMGrid searchBisectionBinary searchNeale-MillerConstr. max.Wald M e a n e rr o r F un c t i o n e v a l u a t i o n s parameters,transformed covariates parameters, transformedcovariates parameters, GLM Figure 4: Benchmark results. The success rate, the mean error, and the number of function evaluations are plottedfor the 3 parameter and the 11 parameter model with transformed covariates and for the 11 parameter GLM.Throughout the simulations, our algorithm RVM had the highest success rate. At the same time, RVM had a lowmean error and required only few likelihood function evaluations compared to the considered alternative methods.The parameter values used to generate the Figures are given in Supplementary Appendix B.the profile likelihood. For the bisection method and thebinary search, this may be due to failures of internaloptimization routines, as initial guesses far from the so-lution can hinder accurate convergence. The grid searchmethod, in turn, was often aborted due to the lim-ited step size, which precluded the method from iden-tifying confidence bounds farther than 40 units awayfrom the respective MLE. This, however, does not ex-plain the comparatively high error in the results of thegrid search, as only successful runs were considered. We therefore hypothesize that internal optimization issueswere responsible for some failures.As expected, the algorithms that searched for theconfidence interval end points directly were more ef-ficient but less robust than algorithms that repeat-edly evaluate the profile likelihood. Remarkably, a“standard” algorithm for constrained optimization per-formed slightly better than an unconstrained optimizeroperating on the modified target function suggested byNeale and Miller (1997). This indicates that the approx- imation introduced by Neale and Miller (1997) mightnot be necessary and even of disadvantage.All methods implemented in this study (exceptRVM and VM) rely on general optimizers. Conse-quently, the performance of these methods depends onthe chosen optimizers both in terms of computationalspeed and robustness. Careful adjustment of optimiza-tion parameters might make some of the implementedalgorithms more efficient and thus more competitive inbenchmark tests. Though we attempted to reduce po-tential bias by applying a variety of different methods,an exhaustive test of optimization routines was beyondthe scope of this study. Nonetheless, the consistentlygood performance of RVM throughout our benchmarktests suggests that RVM is a good choice in many ap-plications.Though RVM performed well in our tests, there areinstances in which the algorithm is not applicable orsufficiently efficient. This are scenarios in which (1) thelog-likelihood cannot be computed directly, (2) the Hes-sian matrix of the log-likelihood function is hard tocompute, (3) the dimension of the parameter space isvery large, or (4) there are multiple points in the pa-rameter space in which problem (4) is solved locally.Below, we briefly discuss each of these limitations.(1) In hierarchical models, the likelihood functionmay not be known. As RVM needs to evaluate the log-likelihood, its gradient, and its Hessian matrix, the algo-rithm is not applicable in these instances. Consequently,sampling based methods, such as parametric bootstrap(Efron, 1981), Monte Carlo methods (Buckland, 1984),or data cloning (Ponciano et al., 2009) may then be theonly feasible method to determine confidence intervals.(2) Especially in problems with a large parameterspace, it is computationally expensive to compute theHessian matrix with finite difference methods, as thenumber of function calls increases in quadratic orderwith the length of the parameter vector. Though al-ternative differentiation methods, such as analytical orautomatic differentiation (Griewank, 1989), are oftenapplicable, there may be some instances in which finitedifference methods are the only feasible alternative. Inthese scenarios, optimization routines that do not re-quire knowledge of the Hessian matrix may be fasterthan RVM. Note, however, that the higher computa-tional speed may come with decreased robustness, andsampling based methods might be the only remainingoption if application of RVM is infeasible.(3) If the parameter space has a very high dimension(exceeding 1000), internal routines, such as inversion ofthe Hessian matrix, may become the dominant factordetermining the speed of RVM. Though it may be pos-sible in future to make RVM more efficient, sampling based methods or algorithms that do not use the Hes-sian matrix may be better suited in these scenarios.(4) RVM as well as all other methods implementedin this study are local optimization algorithms. There-fore, the algorithms may converge to wrong resultsif maximization problem (4) has multiple local solu-tions. This is in particular the case if the confidence set { θ : ‘ PL ( θ ) ≥ ‘ ∗ } is not connected and thus no inter-val. RVM reduces the issue of local extreme points bychoosing steps carefully and ensuring that the point ofconvergence is indeed a maximum. This contrasts withVM, which could converge to the wrong confidence in-terval end point (e.g. maximum instead of minimum) ifthe initial guesses are not chosen with care. Nonethe-less, stochastic optimization routines, such as geneticalgorithms (Akrami et al., 2010), and sampling meth-ods may be better suited if a local search is insufficient.Despite these caveats, RVM is applicable to a broadclass of systems. Especially when inestimable parame-ters are present, commonly used methods such as VMor grid search techniques can break down or be highlyinefficient. Furthermore, optimization failures are com-monly observed if not enough data are available to reachthe asymptotic properties of the MLE (Ren and Xia,2019). RVM is a particularly valuable tool in these in-stances. We developed and presented an algorithm to determineprofile likelihood confidence intervals. In contrast tomany earlier methods, our algorithm is robust in sce-narios in which lack of data or a complicated likelihoodfunction make it difficult to find the bounds of profilelikelihood confidence intervals. In particular, our meth-ods is applicable in instances in which parameters arenot estimable and in cases in which the likelihood func-tion has strong nonlinearities. At the same time, ourmethod efficiently exploits the asymptotic properties ofthe maximum likelihood estimator if enough data areavailable.We tested our method on benchmark problems withdifferent difficulty. Throughout our simulations, ourmethod was the most robust while also being amongthe fastest algorithms. We therefore believe that RVMcan be helpful to researchers and modelers across dis-ciplines.
References
Akrami Y, Scott P, Edsj¨o J, Conrad J, Bergstr¨omL (2010) A profile likelihood analysis of the con-
Robust and Efficient Algorithm to Find Profile Likelihood Confidence Intervals 15 strained MSSM with genetic algorithms. Journalof High Energy Physics 2010(4):57, DOI 10.1007/JHEP04(2010)057, URL http://link.springer.com/10.1007/JHEP04(2010)057
Brodtkorb PA, D’Errico J (2019) numd-ifftools 0.9.39. URL https://github.com/pbrod/numdifftools , retrieved fromhttps://github.com/pbrod/numdifftoolsBuckland ST (1984) Monte Carlo confidence inter-vals. Biometrics 40(3):811, DOI 10.2307/2530926,URL
Conn AR, Gould NIM, Toint PL (2000) Trust-regionmethods. MPS-SIAM series on optimization, Societyfor Industrial and Applied Mathematics, Philadel-phia, PACook RD, Weisberg S (1990) Confidence curvesin nonlinear regression. Journal of the Amer-ican Statistical Association 85(410):544–551,DOI 10.1080/01621459.1990.10476233, URL
Cox DR, Snell EJ (1989) Analysis of binary data,2nd edn. No. 32 in Monographs on Statistics andApplied Probability, Routledge, Boca Raton, FL,DOI 10.1201/9781315137391, URL
DiCiccio TJ, Tibshirani R (1991) On the implemen-tation of profile likelihood methods. Tech. rep.,University of Toronto, Department of Statistics,URL
Efron B (1981) Nonparametric standard errors andconfidence intervals. Canadian Journal of Statistics9(2):139–158, DOI 10.2307/3314608, URL http://doi.wiley.com/10.2307/3314608
Eubank RL, Webster JT (1985) The singular-valuedecomposition as a tool for solving estimabil-ity problems. The American Statistician 39(1):64,DOI 10.2307/2683912, URL
Gardner W, Mulvey EP, Shaw EC (1995) Regressionanalyses of counts and rates: Poisson, overdispersedPoisson, and negative binomial models. Psychologi-cal Bulletin 118(3):392–404, DOI 10.1037/0033-2909.118.3.392, URL http://doi.apa.org/getdoi.cfm?doi=10.1037/0033-2909.118.3.392
Gimenez O, Choquet R, Lamor L, Scofield P, FletcherD, Lebreton JD, Pradel R (2005) Efficient profile-likelihood confidence intervals for capture-recapturemodels. Journal of Agricultural, Biological, and En-vironmental Statistics 10(2):184–196, DOI 10.1198/108571105X46462, URL http://link.springer. com/10.1198/108571105X46462
Golub G, Kahan W (1965) Calculating the singularvalues and pseudo-inverse of a matrix. Journal ofthe Society for Industrial and Applied MathematicsSeries B Numerical Analysis 2(2):205–224, DOI 10.1137/0702016, URL http://epubs.siam.org/doi/10.1137/0702016
Griewank A (1989) On automatic differen-tiation. Mathematical Programming: re-cent developments and applications 6(6):83–107, URL
Jones E, Oliphant T, Peterson P (2001) SciPy: opensource scientific tools for Python. URL https://scipy.org/ , retrieved from https://scipy.org/Kraft D (1988) A software package for sequentialquadratic programming. Tech. Rep. DFVLR-FB 88-28, DLR German Aerospace Center – Institute forFlight Mechanics, K¨oln, GermanyLai KL, Crassidis J, Cheng Y, Kim J (2005) Newcomplex-step derivative approximations with appli-cation to second-order kalman filtering. In: AIAAGuidance, Navigation, and Control Conference andExhibit, American Institute of Aeronautics and As-tronautics, San Francisco, California, DOI 10.2514/6.2005-5944, URL http://arc.aiaa.org/doi/10.2514/6.2005-5944
Lalee M, Nocedal J, Plantenga T (1998) Onthe implementation of an algorithm for large-scale equality constrained optimization. SIAMJournal on Optimization 8(3):682–706, DOI10.1137/S1052623493262993, URL http://epubs.siam.org/doi/10.1137/S1052623493262993
Moerbeek M, Piersma AH, Slob W (2004) A com-parison of three methods for calculating confidenceintervals for the benchmark dose. Risk Analysis24(1):31–40, DOI 10.1111/j.0272-4332.2004.00409.x, URL https://onlinelibrary.wiley.com/doi/abs/10.1111/j.0272-4332.2004.00409.x
Neale MC, Miller MB (1997) The use of likelihood-based confidence intervals in genetic models. Be-havior Genetics 27(2):113–120, DOI 10.1023/A:1025681223921, URL http://link.springer.com/10.1023/A:1025681223921
Nocedal J, Wright SJ (2006) Numerical optimiza-tion, 2nd edn. Springer series in operations research,Springer, New YorkPenrose R (1955) A generalized inverse for ma-trices. Mathematical Proceedings of the Cam-bridge Philosophical Society 51(3):406–413, DOI
Ponciano JM, Taper ML, Dennis B, Lele SR (2009)Hierarchical models in ecology: confidence inter-vals, hypothesis testing, and model selection usingdata cloning. Ecology 90(2):356–362, DOI 10.1890/08-0967.1, URL http://doi.wiley.com/10.1890/08-0967.1
Rao CR (1967) Calculus of generalized inverse ofmatrices. Part 1: general theory. Sankhya Ser A29:317–342, URL
Raue A, Kreutz C, Maiwald T, Bachmann J, SchillingM, Klingm¨uller U, Timmer J (2009) Structuraland practical identifiability analysis of partiallyobserved dynamical models by exploiting theprofile likelihood. Bioinformatics 25(15):1923–1929, DOI 10.1093/bioinformatics/btp358, URL https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btp358
Ren X, Xia J (2019) An algorithm for com-puting profile likelihood based pointwise confi-dence intervals for nonlinear dose-response models.PLOS ONE 14(1):e0210953, DOI 10.1371/journal.pone.0210953, URL http://dx.plos.org/10.1371/journal.pone.0210953
Stryhn H, Christensen J (2003) Confidence intervalsby the profile likelihood method, with applicationsin veterinary epidemiology. Vina del Mar, Chile,URL https://pdfs.semanticscholar.org/716e/5366865e409a430bce73d26c8770e6bbbeab.pdf
Venzon DJ, Moolgavkar SH (1988) A method forcomputing profile-likelihood-based confidence inter-vals. Applied Statistics 37(1):87, DOI 10.2307/2347496, URL
Viallefont A, Lebreton JD, Reboulet AM, Gory G(1998) Parameter identifiability and model se-lection in capture-recapture models: a numericalapproach. Biometrical Journal: Journal of Math-ematical Methods in Biosciences 40(3):313–325,DOI 10.1002/(SICI)1521-4036(199807)40:3 h i http://link.springer.com/10.1007/s10519-012-9560-z upplementary Appendices for “A Robust and EfficientAlgorithm to Find Profile Likelihood ConfidenceIntervals” Samuel M. Fischer · Mark A. LewisA An alternative way to account for singularmatrices
In each iteration, we seek to maximize the approximate likeli-hood ˆ ‘ with respect to the nuisance parameters. To that end,we solve the equation0 = ∂∂ e δ ˆ ‘ δ = e H f δ ∗ + e H δ ∗ + e g (A1)which has a unique solution if and only if e H is invertible.Otherwise, equation (A1) may have infinitely or no solutions.In the main text, we suggested to solve (A1) with the Moore-Penrose inverse if e H is singular. However, this procedure ap-peared to be very sensitive to a threshold parameter in tests,and we obtained better results with an alternative method,which we describe below. We furthermore show test resultscomparing the two methods. A.1 Description of the method
If ˆ ‘ has infinitely many maxima in the nuisance parameters,we can choose some nuisance parameters freely and consider areduced system including the remaining independent param-eters only. To that end, we check e H for linear dependenciesat the beginning of each iteration. We are interested in aminimal set S containing indices of rows and columns whoseremoval from e H would make the matrix invertible. To com-pute S , we iteratively determine the ranks of sub-matrices of e H using singular value decompositions (SVMs). SVMs are awell-known tool to identify the rank of a matrix.The iterative algorithm proceeds as follows: first, we con-sider one row of e H and determine its rank. Then, we continueby adding a second row, determine the rank of the new matrixand repeat the procedure until all rows, i.e. the full matrix e H , are considered. Whenever the matrix rank increases withaddition of a row, this row is linearly independent from theprevious rows. Conversely, the rows that do not increase thematrix rank are linearly dependent on other rows of e H . Theindices of these rows form the set S .In general, the set of linearly dependent rows is notunique. Therefore, we consider the rows of e H in descendingorder of the magnitudes of the corresponding gradient entries.This can help the algorithm to converge faster.After S is determined, we need to check whether there isa parameter vector θ ∗ satisfying requirements 1 and 2 fromE-mail: samuel.fi[email protected] section 2.1 (main text) for the approximation ˆ ‘ . Let e H dd (“d”for “dependent”) be the submatrix of H that remains if allrows and columns corresponding to indices in S are removedfrom e H . Similarly, let e H ff (“f” for “free”) be the submatrixof e H containing only the rows and columns corresponding toindices in S , and let e H df = e H > fd be the matrix containingthe rows whose indices are not in S and the columns whoseindices are in S . Let us define e g d , e g f , e δ d , and e δ f accordingly. If e H dd is not negative definite, ˆ ‘ is unbounded, and requirement2 (section 2.1 main text) cannot be satisfied. Otherwise, wemay attempt to solve0 = ∂∂ e δ ˆ ‘ δ (A2) ⇐⇒ e H dd f δ ∗ d + e H df f δ ∗ f + e H δ ∗ + e g d (A3)0 = e H > df f δ ∗ d + e H ff f δ ∗ f + e H δ ∗ + e g f . (A4)If equation system (A3)-(A4) has a solution, we can choose f δ ∗ f freely. Setting f δ ∗ f ← makes equation (A3) equivalent to f δ ∗ d = − e H − (cid:16) e H δ + e g d (cid:17) . (A5)That is, we may set e H ← e H dd , e g ← e g d , f δ ∗ ← f δ ∗ d for theremainder of the current iteration and proceed as usual,whereby the free nuisance parameters are left unchanged: f δ ∗ f = . With the resulting δ ∗ , we check whether (A4) holdsapproximately. If not, the log-likelihood is unbounded above.We consider this case in section 2.4 in the main text. A.2 Tests
We implemented RVM with both suggested methods fortreating linearly dependent parameters. To that end, we ap-plied the same testing procedure described in section 3 of themain text. The two methods yielded similar results in termsof computational speed (number of required likelihood evalu-ations) and error in case of reported success (see section 3.1 inthe main text). However, holding some parameters constantas suggested in this Appendix turned out to be more robustin general and lead to slightly higher success rates (see FigureA1). Therefore, we suggest using this method in practice. a r X i v : . [ s t a t . C O ] A p r Samuel M. Fischer, Mark A. Lewis
A B C S u cce ss r a t e RVMRVM-MPI parameters,transformed covariates parameters, transformedcovariates parameters, GLM Figure A1
Comparison of different methods to handle linearly dependent parameters. The success rate of RVM is plotted for the 3parameter and the 11 parameter model with transformed covariates and for the 11 parameter GLM. Though the algorithm using theMoore-Penrose inverse (RVM-MPI) performed slightly better for the GLM with little data (panel C), the method had lower successrates when the 11 parameter model with transformed variables was considered (panel B). The parameter values used to generate theFigures are given in Appendix B.
B Parameters for benchmark tests
Here we provide the parameter values used to generate the data for our benchmark tests. We tested models with transformedcovariates with 3 and 11 parameters and a GLM with 11 parameters. For the models with transformed covariates, we consideredscenarios with N = 500, N = 1000, N = 3000, and N = 10000 data points. The parameters for the two model families aregiven in Tables A1 and A2. For the GLM, we considered data sets with sizes N = 50, N = 100, N = 300, and N = 1000. Weprovide the parameter values in Table A3. Parameter α β β Value . −
10 5
Table A1
Parameters for the model with 3 parameters and transformed covariates.
Parameter α α α α α β β β β β β Value . . . . − − − − Table A2
Parameters for the model with 11 parameters and transformed covariates.
Parameter β β β β β β β β β β β Value . . − . − − . . . − . . Table A3
Parameters for the 11 parameter GLM. The covariate powers α ii