Improved Structural Methods for Nonlinear Differential-Algebraic Equations via Combinatorial Relaxation
aa r X i v : . [ c s . S C ] J u l Improved Structural Methodsfor Nonlinear Differential-Algebraic Equationsvia Combinatorial Relaxation ∗Taihei Oki † July 11, 2019
Abstract
Differential-algebraic equations (DAEs) are widely used for modeling of dynamical sys-tems. In numerical analysis of DAEs, consistent initialization and index reduction are im-portant preprocessing prior to numerical integration. Existing DAE solvers commonly adoptstructural preprocessing methods based on combinatorial optimization. Unfortunately, thestructural methods fail if the DAE has numerical or symbolic cancellations. For such DAEs,methods have been proposed to modify them to other DAEs to which the structural meth-ods are applicable, based on the combinatorial relaxation technique. Existing modificationmethods, however, work only for a class of DAEs that are linear or close to linear.This paper presents two new modification methods for nonlinear DAEs: the substitutionmethod and the augmentation method. Both methods are based on the combinatorial re-laxation approach and are applicable to a large class of nonlinear DAEs. The substitutionmethod symbolically solves equations for some derivatives based on the implicit functiontheorem and substitutes the solution back into the system. Instead of solving equations, theaugmentation method modifies DAEs by appending new variables and equations. The aug-mentation method has advantages that the equation solving is not needed and the sparsityof DAEs is retained. It is shown in numerical experiments that both methods, especiallythe augmentation method, successfully modify high-index DAEs that the DAE solver inMATLAB cannot handle.
Keywords differential-algebraic equations, index reduction, implicit function theo-rem, combinatorial relaxation, combinatorial scientific computing
Let T ⊆ R be a nonempty open interval and Ω ⊆ R ( l +1) n a nonempty open set. An l th-order differential-algebraic equation (DAE) of size n for x : T → R n is a differential equation in theform of F ( t, x ( t ) , ˙ x ( t ) , . . . , x ( l ) ( t )) = 0 , (1.1)where F : T × Ω → R n is a sufficiently smooth function. DAEs have aspects of both ordinarydifferential equations (ODEs) ˙ x ( t ) = ϕ ( t, x ( t )) and algebraic equations G ( t, x ( t )) = 0. DAEsare widely used for modeling dynamical systems such as mechanical systems, electrical circuits,and chemical reaction plants. ∗ A preliminary version of this paper is to appear in Proceedings of the 44th International Symposium onSymbolic and Algebraic Computation (ISSAC 2019), Beijing, China, July 2019. † Department of Mathematical Informatics, Graduate School of Information Science and Technology, Universityof Tokyo, Tokyo 113-8656, Japan. E-mail: [email protected] initial value problem ,which is to find a smooth trajectory x : T → R n satisfying (1.1) with the initial value condition x ( t ∗ ) = x ∗ (0) , ˙ x ( t ∗ ) = x ∗ (1) , . . . , x ( l − ( t ∗ ) = x ∗ ( l − , (1.2)where t ∗ ∈ T and x ∗ (0) , x ∗ (1) , . . . , x ∗ ( l − ∈ R n . Unlike ODEs, an initial value problem for a DAEmay not have a solution because the DAE can involve algebraic constraints, and the solutionmust satisfy not only the constraints but also their differentiations, called hidden constraints .While giving a consistent initial value of a DAE is an important process prior to numericalintegration, this is known to be a non-trivial task [1, 15, 21].Another important preprocessing of the numerical simulation of DAEs is an index reduction ,which is a process of reducing the differentiation index [2] of a DAE. The differentiation indexof a first-order DAE F ( t, x ( t ) , ˙ x ( t )) = 0 (1.3)is the minimum nonnegative integer ν such that the system of equations F ( t, x ( t ) , ˙ x ( t )) = 0 , dd t F ( t, x ( t ) , ˙ x ( t )) = 0 , . . . , d ν d t ν F ( t, x ( t ) , ˙ x ( t )) = 0can determine ˙ x as a continuous function of t and x . In other words, ν is the number of timesone has to differentiate the DAE (1.3) to get an ODE. Intuitively, the differentiation indexrepresents how far the DAE is from ODEs. The differentiation index of an l th-order DAE (1.1)is defined as that of the first-order DAE obtained by replacing higher-order derivatives of x withnewly introduced variables. It is commonly said to be difficult to numerically solve high ( ≥ ≤
1) index DAE.Today, most simulation software packages for dynamical systems, such as Dymola, Open-Modelica, MapleSim, and Simulink, are equipped with graph-based preprocessing methods,which we call structural methods . These methods were first presented by Pantelides [15] for theconsistent initialization of DAEs. This method was subsequently applied to an index reductionmethod by dummy derivative approach of Mattsson–Söderlind [12] (MS-method). Pryce [16]proposed a structural analysis method for DAEs, called the Σ-method, based on a variant ofPantelides’ method. These structural methods construct a bipartite graph from DAEs’ struc-tural information and solves an assignment problem on the bipartite graph.These structural methods, however, do not work even for the following simple DAE ˙ x + ˙ x + x = 0 , ˙ x + ˙ x = 0 ,x + ˙ x = 0 . (1.4)The Σ-method reports that the index is zero whereas it is indeed two. This is because themethod cannot detect the singularity of the coefficient matrix of ˙ x ˙ x ˙ x . As thistoy example shows, structural methods, which ignore numerical information, may fail on someDAEs due to numerical or symbolic cancellations. In general, the structural methods work onlyif the associated Jacobian matrix, called the system Jacobian , is nonsingular.In order to overcome this issue for a first-order linear DAE A ˙ x ( t ) + A x ( t ) = f ( t ) (1.5)with constant matrices A , A ∈ R n × n and a smooth function f : T → R n , Wu et al. [24]presented a method to modify (1.5) into an equivalent DAE having nonsingular system Jacobian,2sing the combinatorial relaxation algorithm by Iwata [7]. The combinatorial relaxation is aframework devised by Murota [14] to solve linear algebraic problems by iteratively relaxingthem into combinatorial optimization problems. Another combinatorial relaxation method forlinear DAEs whose coefficient matrices are mixed matrices is given in [8]. A mixed matrix isa matrix consisting of accurate constants and inaccurate parameters. Independently, Tan etal. [23] presented modification methods, called LC-method and ES-method, for nonlinear DAEsbased on the same principle. All the above methods iteratively replace an equation of DAEs bya linear combination of other equations or their derivatives. These methods can deal only withDAEs close to linear DAEs; see Section 3.2 for details. In fact, one can make DAEs intractablejust by changing the coordinate nonlinearly.In this paper, we present two modification methods for nonlinear DAEs, which we call the substitution method and the augmentation method . While the previous combinatorial relaxationmethods [7, 8, 9, 14, 23, 24] are designed only for a class of DAEs that is linear or close to linear,our methods are applicable to a much larger class of nonlinear DAEs. The substitution methodexplicitly solves equations for some derivatives based on the implicit function theorem (IFT)and then substitutes the solution back into the system. This can be seen as a generalization ofsolving linear equations in the LC-method. To implement the substitution method, a routineto solve algebraic equations symbolically is needed. The augmentation method is presented asa remedy for this drawback. In order to avoid solving equations symbolically, the augmentationmethod introduces new variables and equations, which are copies of existing ones in the DAEsystem. While the size of the modified DAE is increased, the augmentation method does notdestroy the sparsity of DAEs. We show in numerical experiments that both methods can modifyhigh-index DAEs which cannot be dealt with by the standard DAE-solving library in MATLAB.The experimental results also show that an equation-solving engine in MATLAB cannot obtainexplicit functions in the substitution method depending on DAEs and on the selection of thevalues used in the method. The augmentation method successfully serves as a remedy for thisproblem. Related work.
The substitution method repeatedly eliminates some derivatives in the DAEsystem. In theory of DAEs and partial differential equations (PDEs), this approach is known as“differential elimination” or “projection” [4, 17, 18], especially for polynomial DAEs and PDEs.Maple provides rifsimp function that simplifies polynomial PDEs based on the differentialalgebra and the Gröbner basis [11]. From practical dynamical systems, however, non-polynomialDAEs often appear. Gear [4] described a naïve index reduction method for nonlinear DAEsusing a similar approach to the substitution method that iteratively eliminates derivatives usingthe IFT. Gear’s method appends differentiations of some equations in the DAE and thus theresultant DAE is overdetermined. Our method is advantageous in this point since it returnsDAEs having the same number of equations and variables.Takamatsu–Iwata [22] proposed an index reduction method which is also named as “sub-stitution method.” Our substitution method is different from their substitution method inthat their method deals with the first-order linear DAEs with constant coefficients based oncombinatorial matrix theory, whereas our method is designed for fully nonlinear DAEs.
Organization.
This paper is organized as follows. Section 2 summarizes structural methodsfor DAEs and analyzes failure reasons. Section 3 explains previous modification methods basedon the combinatorial relaxation approach. Sections 4 and 5 describe the substitution methodand the augmentation method, respectively. Section 6 illustrates two examples. Section 7 showsresults of numerical experiments. Finally, Section 8 concludes this paper.3
Structural Methods for DAEs
Structural methods for DAEs utilize information on which variable each equation depends. Wefirst introduce notations and a proposition to describe the structural methods.Let T ⊆ R be a nonempty open interval and Ω ⊆ R ( l +1) n a nonempty open set havingcoordinates ( x, ˙ x, . . . , x ( l ) ), where x ( k ) = (cid:16) x ( k ) j (cid:17) j ∈ C ∈ R n for k ∈ { , , . . . , l } . Here C is a set ofindices with | C | = n . Note that each x ( k ) j is regarded not as the k th-order derivative of sometrajectory but as an independent variable here. Let f : T × Ω → R be a smooth function. For j ∈ C and k ∈ { , , . . . , l } , the function f is said to depend on x ( k ) j if the partial derivative ∂f∂x ( k ) j is not identically zero on the domain T × Ω of f . We denote the maximum nonnegativeinteger k such that f depends on x ( k ) j by σ ( f, x j ). If f does not depend on x ( k ) j for any k , weassign σ ( f, x j ) := −∞ for convenience.The derivative ˙ f of f with respect to t is defined by˙ f ( t, x, ˙ x, . . . , x ( l +1) ) := ∂f∂t ( t, x, ˙ x, . . . , x ( l ) ) + l X k =0 ∂f∂x ( k ) ( t, x, ˙ x, . . . , x ( l ) ) x ( k +1) . For a nonnegative integer d , the d th-order derivative f ( d ) of f is recursively defined by f (0) := f and f ( d ) := ˙ f ( d − for d ≥
1. It should be noted that the domain of ˙ f is not T × Ω but T × Ω × R n because ˙ f linearly depends on x ( l +1) . Similarly, for a nonnegative integer d , weregard the domain of f ( d ) as T × Ω ( d ) , where Ω ( d ) := Ω × R dn .The following simple proposition plays an important role in structural methods for DAEs. Proposition 2.1 (Griewank’s lemma [5, Section 2.2], [16, Lemma 3.7]) . Let f : T × Ω → R bea smooth function. For j ∈ C and a nonnegative integer d , if σ ( f, x j ) ≤ c , then ∂f∂x ( c ) j ( t, x, ˙ x, . . . , x ( l ) ) = ∂f ( d ) ∂x ( c + d ) j ( t, x, ˙ x, . . . , x ( l + d ) ) (2.1) holds for all ( t, x, ˙ x, . . . , x ( l + d ) ) ∈ T × Ω ( d ) . We sometimes regard the domain of ∂f ( d ) ∂x ( c + d ) j not as T × Ω ( d ) but as T × Ω to simply writethe equality (2.1) as ∂f∂x ( c ) j = ∂f ( d ) ∂x ( c + d ) j . In addition, it follows from Proposition 2.1 that σ (cid:16) f ( d ) , x j (cid:17) = σ ( f, x j ) + d holds for j ∈ C and a nonnegative integer d . Pryce [16] introduced an assignment problem for a reinterpretation of Pantelides’ algorithm [15]as follows.Consider a DAE (1.1) of size n with equation index set R and variable index set C . Let G ( F ) denote the bipartite graph with vertex set R ∪ C and edge set E ( F ) := { ( i, j ) ∈ R × C | σ ( F i , x j ) > −∞} .
4n edge subset M ⊆ E ( F ) is called a matching if the ends of edges in M are disjoint. A perfect matching is a matching of size n . We set the weight c e of an edge e = ( i, j ) ∈ E ( F ) by c e = c i,j = σ ( F i , x j ).The assignment problem on G ( F ) is the following problem P( F ):P( F ) maximize X e ∈ M c e subject to M ⊆ E ( F ) is a perfect matching on G ( F ).The dual problem D( F ) of P( F ) is expressed as follows:D( F ) minimize X j ∈ C q j − X i ∈ R p i subject to q j − p i ≥ c i,j (( i, j ) ∈ E ( F )) ,p i ∈ Z ( i ∈ R ) ,q j ∈ Z ( j ∈ C ) . It can be shown from the duality theorem that D( F ) has an optimal solution if and only if G ( F )has a perfect matching. Considerˆ δ ( F ) := the optimal value of the problem D( F ) , which is equal to the optimal value of P( F ) due to the strong duality. If D( F ) has no optimalsolution, we assign ˆ δ ( F ) := −∞ . The problems P( F ) and D( F ) can be efficiently solved by theHungarian method [10].For a dual feasible solution ( p, q ), a system Jacobian D = ( D i,j ) i ∈ R,j ∈ C : T × Ω → R n × n of F with respect to ( p, q ) is a matrix defined by D i,j := ∂F ( p i ) i ∂x ( q j ) j = ∂F i ∂x ( q j − p i ) j (2.2)for each i ∈ R and j ∈ C . The last equality in (2.2) for ( i, j ) with q j − p i ≥ i, j ) with q j − p i < ∂F i ∂x ( q j − p i ) j as anidentically zero function.Here we give a characterization of the optimality of D( F ), which was originally given byMurota [14] for linear DAEs with constant coefficients. For a system Jacobian D , let G ∗ ( D ) bethe bipartite graph with vertex set R ∪ C and edge set E ∗ ( D ) = { ( i, j ) ∈ R × C | D i,j is not identically zero } . (2.3)The term rank of D is the maximum size of a matching in G ∗ ( D ), and is denoted by t-rank D . Proposition 2.2 ([14, Proposition 2.3]) . For a DAE (1.1) of size n , let D be a system Jacobianof the DAE with respect to a feasible solution ( p, q ) of D( F ) . Then ( p, q ) is optimal if and onlyif t-rank D = n . It is well-known that the term-rank of D serves as a combinatorial upper bound on the rankof D . Therefore, t-rank D = n is a necessary condition for the nonsingularity of D . Pryce’s Σ-method [16] uses the assignment problem to determine a system of equations whosesolution provides a consistent initial value. The Mattsson–Söderlind method [12] (MS-method)reduces the index of DAEs in a structural way based on the dummy derivative approach. Thevalidity of these structural methods is established as follows.5 heorem 2.3 ([12, Section 3.2], [16, Theorems 4.2, 5.2]) . For a DAE (1.1) , suppose that D( F ) has an optimal solution ( p, q ) and let D be the system Jacobian of (1.1) with respect to ( p, q ) .If there exists a consistent point ( t ∗ , X ∗ ) of (1.1) at which D is nonsingular, then ( t ∗ , X ∗ ) canbe found by the Σ -method. In addition, the MS-method returns an equivalent DAE whose indexis at most one around ( t ∗ , X ∗ ) . In practice, the condition in Theorem 2.3 is satisfied on many DAEs of real instances. Forexample, Pryce [16] showed that the Σ-method can be applied to any DAE which is of indexzero, in standard canonical form, in Hessenberg form, a constrained mechanical system, or atriangular chain of systems for which the method works [16, Theorem 5.3]. The structuralmethods succeed for seven instances out of nine DAE problems in the test set for IVP (initialvalue problem) solvers collected by Mazzia and Magherini [13].However, it is also true that the structural methods do not work for two DAEs in the testset, which model electrical circuits describing the behaviour of a transistor amplifier and a ringmodulator. In addition, it is reported [8, 20] that the structural methods fail for DAEs modelingsimple RLC circuits.Here we investigate how the structural methods fail. From Theorem 2.3, these failures areclassified into the following three scenarios.(F1) The bipartite graph G ( F ) has no perfect matching, or equivalently, the dual problemD( F ) has no optimal solution.(F2) The system Jacobian D with respect to an optimal solution of D( F ) is not identicallysingular on T × Ω but singular at all consistent points.(F3) D is identically singular.Example DAEs of the failures are shown in the following. Example 2.4.
Consider the following DAE: ( x + ( x − = 0 , . (2.4)The DAE (2.4) has a unique solution x ( t ) = 0 and x ( t ) = 1 for all t ∈ R . However, since thebipartite graph G ( F ) corresponding to (2.4) has no perfect matching, the structural methodscannot be applied to (2.4) due to (F1). Example 2.5.
Consider the following DAE: ( x = 0 , ( x − = 0 . (2.5)The solution of (2.5) is the same as that of (2.4).We try to apply the Σ-method to (2.5). In Step 1, we find a dual optimal solution p = (0 , q = (0 , D is D = x
00 2( x − ! , which is not identically singular on Ω = { ( x , x ) | x , x ∈ R } . However, D is singular at theunique consistent point (0 ,
1) of (2.5). Hence (2.5) does not satisfy the validity condition of theΣ-method (and the MS-method) due to (F2). 6 xample 2.6.
The structural methods cannot be applied to the DAE (1.4). In fact, its systemJacobian D corresponding to a dual optimal solution p = (0 , ,
0) and q = (1 , ,
1) is a singularconstant matrix D = . Thus the DAE (1.4) is in the case of (F3).The structural methods indeed fail for the aforementioned electrical network DAEs due to(F3). In this paper, we focus on (F3). It is also known that the nonsingularity of the systemJacobian is destroyed by a simple linear transformation of DAEs as follows.
Example 2.7.
Let F ( t, x, ˙ x, . . . , x ( l ) ) = 0 be a DAE and D the system Jacobian with respectto a dual optimal solution ( p, q ). Suppose p i = p i for some i , i ∈ R . Take a “generic”matrix A ∈ R n × n , that is, each entry in A is chosen at random. Then F ( t, x, ˙ x, . . . , x ( l ) ) = 0and AF ( t, x, ˙ x, . . . , x ( l ) ) = 0 are equivalent DAEs since A is nonsingular (with probability one),whereas AF ( t, x, ˙ x, . . . , x ( l ) ) = 0 meets (F3) as we explain below.In fact, from the genericity of A , the associated graph G ( AF ) is the complete bipartite graphwith edge weight c ′ h,j := max i ∈ R σ ( F i , x j ) for ( h, j ) ∈ E ( AF ). Thus an optimal solution ( p ′ , q ′ )of D( AF ) is p ′ = (0 , . . . ,
0) and q ′ = (cid:16) max i ∈ R σ ( F i , x j ) (cid:17) j ∈ C . It is easy to see that the systemJacobian D ′ of AF ( t, x, ˙ x, . . . , x ( l ) ) = 0 with respect to ( p ′ , q ′ ) is given by D ′ = A ˜ D , where ˜ D isa matrix defined by ˜ D i,j := D i,j if σ ( F i , x j ) = q ′ j and ˜ D i,j := 0 otherwise for i ∈ R and j ∈ C .Here from the assumption p i = p i , there is a row of zeros in ˜ D , and thus D ′ is identicallysingular.The failure (F3) is attributed to the fact that the structural methods use only combinatorialinformation and ignore numerical and symbolic information of DAEs assuming that nonzeroentries in Jacobian matrices are generic. Then numerical or symbolic cancellations inherent inthe DAEs make the system Jacobian identically singular. The method of Wu et al. [24] modifies a given first-order linear DAE (1.5) with constant coef-ficients into an equivalent linear DAE without (F3), i.e., the system Jacobian is not identicallysingular. This method relies on the combinatorial relaxation algorithm of Iwata [7], and allother modification methods are also based on the combinatorial relaxation approach. Thecombinatorial relaxation method consists of the following three phases [7, 14, 23, 24].
Combinatorial RelaxationPhase 1.
Compute an optimal solution ( p, q ) of D( F ). If D( F ) has no optimal solution,the algorithm terminates with failure. Phase 2.
If the system Jacobian D with respect to ( p, q ) is not identically singular, returnthe DAE F = 0 and halt. Phase 3.
Modify the DAE F = 0 into an equivalent DAE ¯ F = 0 such that ˆ δ ( ¯ F ) ≤ ˆ δ ( F ) − F ) has an optimal solution if and only if ˆ δ ( F ) ≥
0, the above process ends in at mostˆ δ ( F ) ≤ ln iterations. Therefore, given a DAE with (F3), the combinatorial relaxation methodreturns an equivalent DAE without (F3) (or with (F1) if the method has failed in Phase 1).A non-trivial part of the combinatorial relaxation method is only Phase 3, which modifiesDAEs to decrease the value of ˆ δ . Iwata’s combinatorial relaxation algorithm modifies first-orderlinear DAEs with constant coefficients using strict equivalence transformations , which multiplynonsingular constant matrix to equations and variables. A combinatorial relaxation methodin [8] for linear DAEs with mixed matrices employs unimodular transformations here. The uni-modular transformation is a sequence of trivial equivalent transformations of DAEs that addan equation (or its derivative) to another equation. Iwata–Takamatsu’s index reduction algo-rithm [9] for first-order linear DAEs with constant coefficients is also based on the combinatorialrelaxation and modifies DAEs using unimodular transformations. The LC-method of Tan et al. [23] can be regarded as a nonlinear generalization of the method ofWu et al [24], where the difference is only the modification method in Phase 3. The modificationmethod of the LC-method is summarized as follows.Suppose that we have a DAE (1.1) and its dual optimal solution ( p, q ) such that the sys-tem Jacobian D with respect to ( p, q ) is identically singular. First, we find a nonzero vector u ( t, x, ˙ x, . . . ) = ( u i ) i ∈ R in the cokernel of D , namely, u is a row vector such that uD is identicallyzero. Let supp u denote the support of u , i.e.,supp u := { i ∈ R | u i is not identically zero } . Take r ∈ supp u such that p r ≤ p i for all i ∈ supp u and put I := supp u \ { r } . Then we replacethe r -th equation F r = 0 of the DAE by ¯ F LC r = 0, where¯ F LC r := u r F r + X i ∈ I u i F ( p i − p r ) i . (3.1)It is shown that this modification decreases the value of ˆ δ if σ ( u i , x j ) < q j − p r (3.2)for all i ∈ R and j ∈ C [23, Theorem 4.1]. Intuitively, the condition (3.2) means that the highest-order derivatives appear linearly in DAEs. For (time-varying) linear DAEs, (3.2) trivially holdssince σ ( u i , x j ) = −∞ for all i, j .However, there still exist DAEs to which the LC-method cannot be applied. For example,the following DAE ( ˙ x ˙ x − t = 0 , ˙ x ˙ x + x + x − t − t − F . Example 3.1.
Let F ( t, x, ˙ x, . . . , x ( l ) ) = 0 be a DAE and D the system Jacobian with respectto a dual optimal solution ( p, q ). Suppose p i = p i for some i , i ∈ R as in Example 2.7. Let ψ = ( ψ h ) h ∈ R ′ : R n → R n be a “generic” nonlinear diffeomorphism such that ψ ( w ) = 0 if andonly if w = 0. Then F ( t, x, ˙ x, . . . , x ( l ) ) = 0 is equivalent to ψ ( F ( t, x, ˙ x, . . . , x ( l ) )) = 0, whereas8he latter DAE meets (F3) but cannot be handled by the LC-method since the highest-orderderivatives appear nonlinearly.More formally, this is shown as follows. From the genericity assumption on ψ , it holds σ ( ψ h ( F ) , x j ) = max i ∈ R σ ( F i , x j ) for each h ∈ R ′ and j ∈ C . Then as in Example 2.7, G ( ψ ( F ))is the complete bipartite graph, and a dual optimal solution is given by p ′ = (0 , . . . , q ′ = (cid:16) max i ∈ R σ ( F i , x j ) (cid:17) j ∈ C . Let ˜ D be a matrix defined by ˜ D i,j := D i,j if σ ( F i , x j ) = q ′ j and ˜ D i,j :=0 otherwise for i ∈ R and j ∈ C . Then the ( h, j )-th entry in the system Jacobian D ′ of ψ ( F ( t, x, ˙ x, . . . , x ( l ) )) = 0 with respect to ( p ′ , q ′ ) is D ′ h,j = ∂ψ h ( F ( t, x, ˙ x, . . . , x ( l ) )) ∂x ( q ′ j ) j = X i ∈ R ∂ψ h ∂w i ( F ( t, x, ˙ x, . . . , x ( l ) )) ∂F i ∂x ( q ′ j ) j ( t, x, ˙ x, . . . , x ( l ) )= X i ∈ R ∂ψ h ∂w i ( F ( t, x, ˙ x, . . . , x ( l ) )) ˜ D i,j . Therefore, it holds D ′ = d ψ d w ( F ( t, x, ˙ x, . . . , x ( l ) )) ˜ D , where d ψ d w is the Jacobian matrix of ψ .Now there is a row of zeros in ˜ D from the assumption p i = p i , and thus D ′ is identicallysingular. In addition, a cokernel vector u of D ′ corresponds to a cokernel vector v of ˜ D by u = d ψ d w ( F ( t, x, ˙ x, . . . , x ( l ) )) v . Since each entry in d ψ d w ( F ( t, x, ˙ x, . . . , x ( l ) )) depends on x ( q ′ j ) j bythe nonlinearity and genericity assumptions on ψ , an entry u h in u depends on x ( q ′ j ) j for each h ∈ R ′ and j ∈ C if u = 0. This means that the DAE ψ ( F ( t, x, ˙ x, . . . , x ( l ) )) = 0 does not fulfillthe validity condition (3.2) of the LC-method.The claim in Example 3.1 implies that (3.2) holds only if we have a special coordinate of thecodomain space of F . Therefore, from a geometrical point of view, it is natural and importantto devise a modification method for such “heavily nonlinear” DAEs. In this section, we describe a new modification method for nonlinear DAEs, called the substi-tution method. This method is used in Phase 3 of the combinatorial relaxation framework.Let T ⊆ R be a nonempty open interval and Ω ⊆ R ( l +1) n a nonempty open set. The inputof the substitution method is a DAE (1.1) of size n with real analytic function F : T × Ω → R n such that(I1) G ( F ) has a perfect matching,(I2) for any square submatrix D [ I, J ] of the system Jacobian D with respect to a dualoptimal solution, if D [ I, J ] is not identically singular on T × Ω, then there exists aconsistent point of (1.1) at which D [ I, J ] is nonsingular, and(I3) D is identically singular.The smoothness assumption on F is needed to avoid technical difficulties. We remark that (I2)is just a part of a sufficient condition for which the substitution method works, and it suffices inpractice to check the condition only for a few submatrices of D that are needed in the method.9he substitution method modifies the DAE (1.1) into another DAE¯ F sub ( t, x, ˙ x, . . . , x ( l + κ ) ) = 0 (4.1)of size n such that(S1) ¯ F sub is a real analytic function defined on a nonempty open subset ¯ T sub × ¯Ω sub ⊆ T × Ω ( κ ) with κ ≤ ln ,(S2) the resulting DAE (4.1) is locally equivalent to the input DAE (1.1), and(S3) ˆ δ ( ¯ F sub ) ≤ ˆ δ ( F ) − R and C be the equationindex set and the variable index set of the DAE (1.1), respectively. For I ⊆ R , let F I denote a“subvector” ( F i ) i ∈ I of F indexed by I . Similarly, for J ⊆ C , let x J denote a subvector ( x j ) j ∈ J of x indexed by J . Let p and q be the vectors of variables in D( F ). In addition, we use thefollowing notations F ( p ) I := (cid:18) F ( p i ) i (cid:19) i ∈ I , x ( q ) J := (cid:18) x ( q j ) j (cid:19) j ∈ J , ∂F ( p ) I ∂x ( q ) J := ∂F ( p i ) i ∂x ( q j ) j i ∈ I,j ∈ J for I ⊆ R and J ⊆ C .Here we start to describe the method. Let D be the system Jacobian of (1.1) with respectto an optimal solution ( p, q ) of D( F ) and suppose that D is identically singular. We regard D as a matrix over the quotient field F of the ring of real analytic functions on T × Ω. Thesubstitution method first finds r ∈ R , I ⊆ R \ { r } and J ⊆ C with | I | = | J | =: m such that(C1) D [ I, J ] is nonsingular,(C2) rank D [ I ∪ { r } , C ] = m , and(C3) p r ≤ p i for i ∈ I .Here, both the nonsingularity in (C1) and the rank in (C2) are in the sense of those of matricesover F . Namely, these conditions can be rewritten as(C1 ∗ ) D [ I, J ] is not identically singular, and(C2 ∗ ) the maximum size of a submatrix in D [ I ∪ { r } , C ] that is not identically singular is m .The existence of ( r, I, J ) satisfying (C1)–(C3) is guaranteed through the algorithm explained inSection 4.2.Let ( r, I, J ) be a triple satisfying the conditions (C1)–(C3). Define S = R \ ( I ∪ { r } ) and T = C \ J . Then the DAE (1.1) is divided into three subsystems as follows: F r ( t, x, ˙ x, . . . , x ( l ) ) = 0 ,F I ( t, x, ˙ x, . . . , x ( l ) ) = 0 ,F S ( t, x, ˙ x, . . . , x ( l ) ) = 0 . (4.2)10he system Jacobian D with respect to ( p, q ) forms a block matrix as follows: D = J T { r } ∂F ( p r ) r ∂x ( q ) J ∂F ( p r ) r ∂x ( q ) T I ∂F ( p ) I ∂x ( q ) J ∂F ( p ) I ∂x ( q ) T S ∂F ( p ) S ∂x ( q ) J ∂F ( p ) S ∂x ( q ) T . By the condition (C3) and Proposition 2.1, it holds that ∂F ( p ) I ∂x ( q ) J = ∂F ( p i ) i ∂x ( q j ) j i ∈ I,j ∈ J = ∂F ( p i − p r ) i ∂x ( q j − p r ) j i ∈ I,j ∈ J = ∂F ( p − p r ) I ∂x ( q − p r ) J , where is the vector of ones with appropriate dimension. In addition, from the condition (C1),the submatrix D [ I, J ] = ∂F ( p ) I ∂x ( q ) J = ∂F ( p − p r ) I ∂x ( q − p r ) J is not identically singular on T × Ω. Therefore, by(I2), there exists a point (ˆ t, ˆ X ) ∈ T × Ω ( κ ) such that F ( p − p r ) I (ˆ t, ˆ X ) = 0 and ∂F ( p − p r ) I ∂x ( q − p r ) (ˆ t, ˆ X ) isnonsingular, where κ := max i ∈ I p i − p r . (4.3)Then via the IFT, we can solve an equation F ( p − p r ) I ( t, x, ˙ x, . . . , x ( l + κ ) ) = 0 (4.4)for x ( q − p r ) J as x ( q − p r ) J = ϕ ( t, x, ˙ x, . . . , x ( l + κ ) ) , (4.5)where ϕ is a function that does not depend on x ( q − p r ) J . See Section 4.3 for a rigorous descriptionof this part.Finally, we substitute the right-hand side of (4.5) into x ( q − p r ) J in the first equation F r = 0of (4.2). The modified DAE (4.1) is ¯ F sub r ( t, x, ˙ x, . . . , , x ( l + κ ) ) = 0 ,F I ( t, x, ˙ x, . . . , x ( l ) ) = 0 ,F S ( t, x, ˙ x, . . . , x ( l ) ) = 0 , (4.6)where ¯ F sub r is a function obtained from F r by substituting (4.5). ( r, I, J ) Let D be a singular n × n matrix over a field F with row index set R and column index set C ,and p = ( p i ) i ∈ R an integer vector indexed by R . On the setting in Section 4.1, F is the quotientfield of the ring of analytic functions on T × Ω. We give an algorithm, which uses arithmeticoperations over F , to find r ∈ R, I ⊆ R \ { r } and J ⊆ C satisfying the conditions (C1)–(C3).11irst, by column operations, we transform D into D ′ = ( D ′ i,j ) i ∈ R,j ∈ C in the form D ′ = B C \ B (cid:18) (cid:19) H I OR \ H ∗ O , (4.7)where H ⊆ R and B ⊆ C with | H | = | B | = rank D . Here, I and O in (4.7) are the identitymatrix and the zero matrix of appropriate size, respectively, and “ ∗ ” indicates an arbitrarymatrix. Let h : B → H denote the natural bijection represented by the top left block D ′ [ H, B ]in (4.7). Namely, h ( j ) = i if and only if D ′ i,j = 0 for j ∈ B and i ∈ H .Next, we choose ℓ ∈ R \ H arbitrarily. Note that R \ H is nonempty because D ′ is singular.Put Z := { ℓ } ∪ n h ( j ) (cid:12)(cid:12)(cid:12) j ∈ B, D ′ ℓ,j = 0 o ⊆ R. (4.8)Finally, we take r ∈ Z such that p r ≤ p i for all i ∈ Z . Put I := Z \ { r } and choose J ⊆ C suchthat D [ I, J ] is nonsingular. The existence of J is guaranteed by the following lemma. Lemma 4.1.
Let D ∈ F n × n be a singular matrix and Z ⊆ R defined in (4.8) . Then D [ Z, C ] isnot of full-row rank and D [ I, C ] is of full-row rank for any proper subset I ( Z .Proof. Since D ′ in (4.7) is a matrix obtained from D by column operations, it suffices to showthe statement for D ′ . By the definition of Z , it holds D ′ [ { ℓ } , C ] − X i ∈ Z \{ ℓ } D ′ [ { i } , C ] = 0 . This implies that D ′ [ Z, C ] is not of full-row rank.We next show that D ′ [ I, C ] is of full-row rank for I ( Z . This is trivial if ℓ / ∈ I since I ( Z ⊆ { ℓ } ∪ H and D ′ [ H, C ] is of full-row rank. Suppose that ℓ ∈ I . Then we cantake i ∈ Z \ I . From the definition of Z , D ′ [( Z \ { i } ) ∪ { ℓ } , C ] is of full-row rank. Since I ⊆ ( Z \ { i } ) ∪ { ℓ } , D ′ [ I, C ] is also of full-row rank.The following theorem holds from the construction of ( r, I, J ) together with Lemma 4.1.
Theorem 4.2.
For a singular matrix D ∈ F n × n , the above algorithm returns ( r, I, J ) satisfyingthe conditions (C1)–(C3) . This algorithm uses O (cid:0) n (cid:1) arithmetic operations over F . This section gives a mathematically rigorous description of the application of the IFT to (4.4).The description in this section is used in proofs of the substitution method later.We introduce additional notations. Let
C ⊆ C × { , , , . . . , l } be a finite set of index pairssuch that ( j, k ) ∈ C indicates an argument x ( k ) j of F in (1.1). Let R C denote a |C| -dimensionalreal vector space with index set C . For X ∈ R C and J ⊆ C , let X J designate a subvector of X with index set J .The following is a version of the IFT which we use. Theorem 4.3 (Implicit Function Theorem; IFT) . Let U ⊆ R n + m be an open set having coor-dinates ( x, y ) with x ∈ R n and y ∈ R m . Let f : U → R m be a real analytic function. Fix apoint ( ξ, η ) ∈ U such that f ( ξ, η ) = 0 and ∂f∂y ( ξ, η ) is nonsingular. Then there exist open sets V ⊆ R n and W ⊆ R m with ( ξ, η ) ∈ V × W ⊆ U and a real analytic function ϕ : V → W suchthat ϕ ( ξ ) = η , (2) f ( x, y ) = 0 if and only if y = ϕ ( x ) for all ( x, y ) ∈ V × W , and (3) ∂f∂y ( x, ϕ ( x )) is nonsingular and d ϕ d x ( x ) = − (cid:18) ∂f∂y ( x, ϕ ( x )) (cid:19) − ∂f∂x ( x, ϕ ( x )) (4.9) for all x ∈ V . The function ϕ in the IFT is called an explicit function . The formula (4.9) is called the implicit differentiation formula .Let us start the description of the application of the implicit function theorem. Let ( p, q )be an optimal solution of D( F ) and ( r, I, J ) triple satisfying the conditions (C1)–(C3). Put C := { ( j, k ) | j ∈ C, ≤ k ≤ q j − p r } . (4.10)From Proposition 2.1 and the feasibility of ( p, q ), it holds that σ (cid:16) F ( p i − p r ) i , x j (cid:17) = σ ( F i , x j ) + p i − p r = c i,j + p i − p r ≤ q j − p r (4.11)for i ∈ I ∪ { r } and j ∈ J with σ ( F i , x j ) > −∞ . Thus we regard both F r and F ( p I − p r ) I asfunctions defined on T × U , where U is an open subset of R C .Take (ˆ t, ˆ X ) ∈ T × U such that F ( p − p r ) I (ˆ t, ˆ X ) = 0 and ∂F ( p − p r ) I ∂x ( q − p r ) J (ˆ t, ˆ X ) is nonsingular. Let J := { ( j, q j − p r ) | j ∈ J } ⊆ C . (4.12)Then the components of ˆ X is bipartitioned by J as ˆ X = ( ˆ X C\J , ˆ X J ). Thus by the implicitfunction theorem, there exist open sets ¯ T sub ⊆ T , V ⊆ R C\J and W ⊆ R J with (ˆ t, ˆ X C\J , ˆ X J ) ∈ ¯ T sub × V × W ⊆ T × U and a real analytic function ϕ : ¯ T sub × V → W such that ˆ X J = ϕ (cid:0) ˆ t, ˆ X C\J (cid:1) and F ( p − p r ) I (cid:0) t, X C\J , ϕ (cid:0) t, X
C\J (cid:1)(cid:1) = 0 (4.13)for every ( t, X
C\J ) ∈ V . In addition, all zeros of F ( p − p r ) I in ¯ T sub × V × W are in the formof (4.13). Using ϕ , the modified function ¯ F sub r : ¯ T sub × V → R can be expressed as¯ F sub r ( t, X C\J ) = F r (cid:0) t, X C\J , ϕ (cid:0) t, X
C\J (cid:1)(cid:1) (4.14)for ( t, X
C\J ) ∈ ¯ T sub × V . Since both F r and ϕ are real analytic, so is ¯ F sub r .We remark about the domain of the resulting system of functions ¯ F in (4.6). In the aboveargument, we treated the domain of F ( p − p r ) I as T × U , which is an open subset of T × R C .However, the domain of F ( p − p r ) I can also be represented as T × Ω ( κ ) , where κ is defined by (4.3)(indeed, U is the projection of Ω ( κ ) onto R C ). Since ¯ F r is a function obtained from F ( p − p r ) I and F r by the above transformation, the domain of ¯ F r (and thus of ¯ F ) can also be regarded as¯ T sub × ¯Ω sub , where ¯Ω sub is a nonempty open subset of Ω ( κ ) .13 .4 Proofs This section is devoted to the validity proofs of our method.We first show (S1). In Section 4.3, we have already shown that F r is a real analytic functiondefined on ¯ T sub × ¯Ω sub . Thus what we should give is only the bound on κ . To give this bound,we need to bound p i for i ∈ I . This can be achieved by using a specialized ( p, q ) that thefollowing lemma claims. Lemma 4.4 ([8, Lemma 4.1]) . Let G = ( R ∪ C ; E ) be a bipartite graph with | R | = | C | = n and suppose that G has a perfect matching. For each edge e ∈ E , let c e ∈ { , , . . . , l } be theweight of e . Then there exists an algorithm to find an optimal solution ( p, q ) of the dual of themaximum weight perfect matching problem on G such that p i , q j ≤ nl for all i ∈ R and j ∈ C . See [8, Section 4.2] for the algorithm mentioned in Lemma 4.4. By using ( p, q ) obtained bythe algorithm, the following lemma immediately follows.
Lemma 4.5.
In the substitution method, κ defined in (4.3) is at most nl . Next, we focus on (S2), which claims about the equivalence of the original DAE and themodified DAE.
Lemma 4.6.
Consider a DAE (1.1) satisfying (I1)–(I3) . Let x : ¯ T sub → R n be a sufficientlysmooth trajectory satisfying the initial value condition (1.2) for ( t ∗ , X ∗ ) ∈ ¯ T sub × ¯Ω sub . Thenthere exists an open subinterval I ⊆ ¯ T sub containing t ∗ such that x is a solution of (1.1) on I ifand only if x is a solution of (4.6) on I .Proof. We show both the “if” and “only if” parts simultaneously. Suppose that there exists anopen subinterval I ⊆ ¯ T sub with t ∗ ∈ I such that x is a solution of (1.1) or (4.6) on I . Then x satisfies F I ( t, x ( t ) , ˙ x ( t ) , . . . , x ( l ) ( t )) = 0 on I , which is a subsystem of both (4.2) and (4.6). Thus x also satisfies (4.4) on I .We rewrite the equation (4.4) for x ( t ) using C and J defined by (4.10) and (4.12), respec-tively. Let D C denote a differentiation operator that maps x to a trajectory D C x : ¯ T sub → R C such that (cid:16) D C x ( t ) (cid:17) ( j,k ) = x ( k ) j ( t )for t ∈ ¯ T sub and ( j, k ) ∈ C . Then the initial value condition (1.2) can be represented asD C x ( t ∗ ) = X ∗C . Since the domain of F ( p − p r ) I is an open subset of T × R C , the equation (4.4)for x ( t ) can also be represented as F ( p − p r ) I (cid:0) t, D C x ( t ) (cid:1) = 0or F ( p − p r ) I (cid:0) t, D C\J x ( t ) , D J x ( t ) (cid:1) = 0 (4.15)for t ∈ I , where D C\J and D J are differentiation operators defined in the same way as D C .Let U ⊆ R C , V ⊆ R C\J and W ⊆ R J be open sets defined in Section 4.3. Here, since x is smooth, U is open and D C x ( t ∗ ) = X ∗C ∈ U , it holds D C x ( t ) ∈ U for all t ∈ I by taking I sufficiently small. This implies that D C\J x ( t ) ∈ V and D J x ( t ) ∈ W for t ∈ I . Comparing (4.13)and (4.15), we obtain D J x ( t ) = ϕ (cid:0) t, D C\J x ( t ) (cid:1) t ∈ I . Therefore, we have F r ( t, x ( t ) , ˙ x ( t ) , . . . , x ( l ) ( t )) = F r (cid:0) t, D C\J x ( t ) , D J x ( t ) (cid:1) = F r (cid:0) t, D C\J x ( t ) , ϕ (cid:0) t, D C\J x ( t ) (cid:1)(cid:1) = ¯ F sub r (cid:0) t, D C\J x ( t ) (cid:1) = ¯ F sub r ( t, x ( t ) , ˙ x ( t ) , . . . , x ( l + κ ) ( t )) , which means that x is a solution of (4.6) if x is a solution of (1.1), and vice versa.We finally show that the modified DAE satisfies (S3). In order to show (S3), it suffices toshow that ( p, q ) is a feasible solution of D( ¯ F sub ) but not an optimal solution. The feasibility iseasily shown as follows. Lemma 4.7.
Consider a DAE (1.1) satisfying (I1)–(I3) and let ( p, q ) be an optimal solutionof D( F ) . Then ( p, q ) is feasible on D( ¯ F sub ) .Proof. Let ( r, I, J ) be a triple satisfying the conditions (C1)–(C3). Consider the explicit function ϕ in (4.5). For i ∈ I and j ∈ C , we have σ ( ϕ, x j ) ≤ σ (cid:16) F ( p i − p r ) i , x j (cid:17) ≤ q j − p r , where we used (4.11) in the last inequality. Because ¯ F sub r is a function obtained from F r bysubstituting (4.5), it holds σ (cid:16) ¯ F sub r , x j (cid:17) ≤ max { σ ( F r , x j ) , σ ( ϕ, x j ) } ≤ q j − p r for every j ∈ C . Thus ( p, q ) is feasible on D( ¯ F sub ).We finally focus on the non-optimality of ( p, q ) on D( ¯ F sub ). By Proposition 2.2, our goalis to show that t-rank ¯ D < n holds, where ¯ D be the system Jacobian of (4.1) with respect to( p, q ). This is shown by the following lemma. Lemma 4.8.
Consider a DAE (1.1) satisfying (I1)–(I3) . Let ( p, q ) be an optimal solution of D( F ) and ( r, I, J ) a triple satisfying (C1)–(C3) . Then the modified function ¯ F r in (4.6) doesnot depend on x ( q j − p r ) j for all j ∈ C .Proof. The claim is easy to see for j ∈ J because we have eliminated x ( q j − p r ) j from F r by substi-tuting (4.5). Consider the variable x ( q − p r ) T with T = C \ J . Let C and J be index sets definedin (4.10) and (4.12), respectively. For ( t, X ) ∈ ¯ T sub × ¯Ω sub , we denote ( t, X C\J , ϕ ( t, X C\J )) by A t,X for short, where ϕ is the explicit function given by (4.5). From the chain rule, the implicitdifferentiation formula (4.9) and Proposition 2.1, we obtain ∂ ¯ F sub r ∂x ( q − p r ) ( t, X C\J )= ∂F r ∂x ( q − p r ) T ( A t,X ) + ∂F r ∂x ( q − p r ) J ( A t,X ) ∂ϕ∂x ( q − p r ) T ( t, X C\J )= ∂F r ∂x ( q − p r ) T ( A t,X ) − ∂F r ∂x ( q − p r ) J ( A t,X ) ∂F ( p − p r ) I ∂x ( q − p r ) J ( A t,X ) ! − ∂F ( p − p r ) I ∂x ( q − p r ) T ( A t,X )= ∂F ( p r ) r ∂x ( q ) T ( A t,X ) − ∂F ( p r ) r ∂x ( q ) J ( A t,X ) ∂F ( p ) I ∂x ( q ) J ( A t,X ) ! − ∂F ( p ) I ∂x ( q ) ( A t,X ) (4.16)15or ( t, X ) ∈ ¯ T sub × ¯Ω sub . The right hand side of (4.16) coincides with the Schur complement of ∂F ( p ) I ∂x ( q ) J ( A t,X ) in the following matrix˜ D ( t, X C\J ) := ∂F ( p r ) r ∂x ( q ) J ( A t,X ) ∂F ( p r ) r ∂x ( q ) T ( A t,X ) ∂F ( p ) I ∂x ( q ) J ( A t,X ) ∂F ( p ) I ∂x ( q ) T ( A t,X ) . Thus, we have rank ˜ D ( t, X C\J ) = rank ∂F ( p ) I ∂x ( q ) J ( A t,X ) + rank ∂ ¯ F sub r ∂x ( q − p r ) T ( t, X C\J ) (4.17)for all ( t, X ) ∈ ¯ T sub × ¯Ω sub . Let D be a system Jacobian of F with respect to ( p, q ). Note that˜ D is a matrix obtained from D [ I ∪ { r } , C ] by substituting ϕ ( t, X C\J ) into X J . Hence it holdsrank ˜ D ( t, X C\J ) ≤ rank D [ I ∪ { r } , C ]( t, X ) ≤ rank D [ I ∪ { r } , C ] = m with m = | I | , where thelast equality comes from (C2). In addition, the rank of ∂F ( p ) I ∂x ( q ) J ( A t,X ) is m due to the invertibility.Therefore, the rank of ∂ ¯ F sub r ∂x ( q − p r ) T ( t, X C\J ) is zero from (4.17), which means that ∂ ¯ F sub r ∂x ( q − p r ) T isidentically zero on ¯ T sub × ¯Ω sub . Thus ¯ F sub r does not depend on x ( q j − p r ) j for j ∈ T . Corollary 4.9.
For a DAE (1.1) satisfying (I1)–(I3) , it holds ˆ δ ( ¯ F sub ) ≤ ˆ δ ( F ) − .Proof. Let ( p, q ) be an optimal solution of D( F ) and ( r, I, J ) a triple satisfying the conditions(C1)–(C3). Let ¯ D be the system Jacobian of (4.6) with respect to ( p, q ). By Proposition 2.1, itholds ¯ D [ { r } , C ] = ∂ ¯ F ( p r ) r ∂x ( q ) = ∂ ¯ F r ∂x ( q − p r ) , whereas the right-hand side is identically zero from Lemma 4.8. Thus t-rank ¯ D is less than n , and from Proposition 2.2, ( p, q ) is not an optimal solution of D( ¯ F sub ). This concludes theproof.We conclude this section with the following theorem. Theorem 4.10.
For a DAE (1.1) satisfying (I1)–(I3) , the substitution method outputs a DAE (4.1) satisfying (S1)–(S3) . This section describes another proposed modification method for nonlinear DAEs, which wecall an augmentation method. The input of the augmentation method is a nonlinear DAE (1.1)of size n satisfying the conditions (I1)–(I3), where F : T × Ω → R n is a real analytic functionagain. Instead of solving equations symbolically, the augmentation method augments the sizeof the DAE by introducing a new variable vector y and attaching new equations. Formally, theaugmentation method modifies (1.1) into a DAE¯ F aug ( t, x, ˙ x, . . . , x ( l + κ ) , y ) = 0 (5.1)of size n + m such that 16A1) ¯ F aug is a real analytic function defined on a nonempty open subset ¯ T aug × ¯Ω aug × Y ⊆ T × Ω ( κ ) × R m with κ ≤ ln and m ≤ n − δ ( ¯ F aug ) ≤ ˆ δ ( F ) − R and C be theequation index set and the variable index set of the input DAE (1.1), respectively. Let ( p, q ) bean optimal solution of D( F ) and D denote the system Jacobian with respect to ( p, q ). We firstfind r ∈ R , I ⊆ R \ { r } and J ⊆ C satisfying the conditions (C1)–(C3) described in Section 4.1.Define κ := max i ∈ I p i − p r , m := | I | , S := R \ ( I ∪ { r } ) and T := C \ J .The following modification step differs from the substitution method. Let I ′ = { i ′ | i ∈ I } and J ′ = { j ′ | j ∈ J } be copies of I and J , respectively. Take a point ( τ, Ξ) arbitrary fromthe domain ¯ T sub × ¯Ω sub ⊆ T × Ω ( κ ) of the resultant DAE ¯ F sub of the substitution method. Weregard Ω ( κ ) as a subset of R C hereafter, where C := C × { , , , . . . , l + κ } . For X ∈ R C and avector y = ( y j ′ ) j ′ ∈ J ′ with index set J ′ , let ψ Ξ ( X, y ) be a vector of R C such that( ψ Ξ ( X, y )) ( j,k ) := y j ′ ( j ∈ J, k = q j − p r ) , Ξ ( j,k ) ( j ∈ T, k = q j − p r ) ,X ( j,k ) (otherwise)for ( j, k ) ∈ C . For each i ∈ I , we define a function¯ F aug i ′ ( t, x, ˙ x, . . . , x ( l + κ ) , y ) := F ( p i − p r ) i ( t, ψ Ξ ( X, y )) , where X = ( x, ˙ x, . . . , x ( l + κ ) ). Namely, ¯ F aug i ′ is obtained by replacing x ( q j − p r ) j in F ( p i − p r ) i with avariable y j ′ for j ∈ J and with a constant Ξ ( j,q j − p r ) for j ∈ T . Put ¯ F aug I ′ := (cid:16) ¯ F aug i ′ (cid:17) i ′ ∈ I ′ . Wealso define ¯ F aug r ( t, x, ˙ x, . . . , x ( l + κ ) , y ) := F r ( t, ψ Ξ ( X, y ))in the same way.The output (5.1) of the augmentation method is the following DAE ¯ F aug r ( t, x, ˙ x, . . . , , x ( l + κ ) , y ) = 0 ,F I ( t, x, ˙ x, . . . , x ( l ) ) = 0 ,F S ( t, x, ˙ x, . . . , x ( l ) ) = 0 , ¯ F aug I ′ ( t, x, ˙ x, . . . , x ( l + κ ) , y ) = 0 (5.2)with unknown function ( x ( t ) , y ( t )) of t . The domain ¯ T aug × ¯Ω aug of (5.2) is given by ¯ T aug := ¯ T sub and ¯Ω aug := n ( X, y ) ∈ ¯Ω sub × R m (cid:12)(cid:12)(cid:12) ψ Ξ ( X, y ) ∈ ¯Ω sub o .The DAE (5.2) is obtained by copying some equations (or their derivatives), relabellingvariables and substituting constants. Hence if the original DAE contains only a few variablesin each equation, so does (5.2). Thus the augmentation method retains the sparsity of DAEs.17 .2 Proofs Validity proofs of the augmentation method are given in this section. We first show (A1).
Lemma 5.1.
For a DAE (1.1) satisfying (I1)–(I3) , the resulting DAE ¯ F aug = 0 satisfies (A1) .Proof. It is clear that ¯ F aug is real analytic from its construction, which is a combination ofvariable relabelling and partial substitution of constants on F . Let η = ( η j ′ ) j ′ ∈ J ′ be a vectordefined by η j ′ = Ξ ( j,q j − p r ) for j ′ ∈ J ′ . Then it holds (Ξ , η ) ∈ ¯Ω aug from ψ Ξ (Ξ , η ) = Ξ ∈ ¯Ω sub .Hence ¯Ω aug is nonempty. In addition, since ψ Ξ is a continuous map and ¯Ω sub is an open set,¯Ω aug is also open. Therefore the domain ¯ T aug × ¯Ω aug of ¯ F aug is a nonempty open set.The bounds on κ and m are given by Lemma 4.5 and m = | I | ≤ n − Lemma 5.2.
Consider a DAE (1.1) satisfying (I1)–(I3) . Let x : ¯ T aug → R n be a sufficientlysmooth trajectory satisfying the initial value condition (1.2) for ( t ∗ , X ∗ ) ∈ ¯ T aug × ¯Ω aug . Thenthere exists an open subinterval I ⊆ ¯ T aug containing t ∗ such that the following two statementsare equivalent: (1) x is a solution of (1.1) on I , and (2) there uniquely exists a trajectory y : I → R m such that ( x, y ) is a solution of (5.2) on I .Proof. From the argument on the substitution method, the last equation ¯ F aug I ′ ( t, x, ˙ x, . . . , x ( l + κ ) , y ) =0 in (5.2) can be solved for y on the domain of ¯ F aug as y = ¯ ϕ aug ( t, x, ˙ x, . . . , x ( l + κ ) ) , where ¯ ϕ aug is a function obtained by replacing x ( q j − p r ) j of ϕ in (4.5) with the constant Ξ ( j,q j − p r ) for j ∈ T . Therefore, (5.2) is equivalent to ¯ F aug r ( t, x, ˙ x, . . . , , x ( l + κ ) , ¯ ϕ aug ( t, x, ˙ x, . . . , x ( l + κ ) )) = 0 ,F I ( t, x, ˙ x, . . . , x ( l ) ) = 0 ,F S ( t, x, ˙ x, . . . , x ( l ) ) = 0 ,y = ¯ ϕ aug ( t, x, ˙ x, . . . , x ( l + κ ) ) . (5.3)It can be seen from (4.14) that the left-hand side of the first equation in (5.3) is a functionobtained by replacing x ( q j − p r ) j of ¯ F sub r with the constant Ξ ( j,q j − p r ) for j ∈ T . On the other hand,¯ F sub r does not depend on x ( q j − p r ) j for all j ∈ T from Lemma 4.8. Therefore, the first equationin (5.3) is equivalent to ¯ F sub r ( t, x, ˙ x, . . . , , x ( l + κ ) ) = 0. Thus the system (5.3) is equivalent to ( ¯ F sub ( t, x, ˙ x, . . . , , x ( l + κ ) ) = 0 ,y = ¯ ϕ aug ( t, x, ˙ x, . . . , x ( l + κ ) ) . (5.4)The statement of this lemma is shown by (5.4) together with Lemma 4.6.Let ¯ R := R ∪ I ′ and ¯ C := C ∪ J ′ . We finally show (A3) as a corollary of the following lemma. Lemma 5.3.
Consider a DAE (1.1) satisfying (I1)–(I3) and let ( p, q ) be a dual optimal solution.Define ¯ p i := ( p i ( i ∈ R ) ,p r ( i ∈ I ′ ) , ¯ q j := ( q j ( j ∈ C ) ,p r ( j ∈ J ′ ) (5.5) for i ∈ ¯ R and j ∈ ¯ C . Then (¯ p, ¯ q ) is feasible but not optimal on D( ¯ F aug ) . ISI ′ J T J ′ O OO O OO
Figure 1: The zero/nonzero pattern of the system Jacobian ¯ D of ¯ F aug . The hatched region maycontain nonzero elements. Proof.
We first prove σ (cid:16) ¯ F aug i ′ , x j (cid:17) < q j − p r (5.6)for i ′ ∈ I ′ ∪ { r } and j ∈ C . Since x ( q j − p r ) j in ¯ F aug i ′ has been replaced with a dummy variableor a constant, it holds σ (cid:16) ¯ F aug i ′ , x j (cid:17) < σ (cid:16) F ( p i − p r ) i , x j (cid:17) , where i = r if i ′ = r here. In addition, σ (cid:16) F ( p i − p r ) i , x j (cid:17) = σ ( F i , x j ) + p i − p r ≤ q j − p r holds, where the first equality comes fromProposition 2.1 and the second inequality is due to the feasibility of ( p, q ) on D( F ). Thus (5.6)is true.We next show the feasibility of (¯ p, ¯ q ) on D( ¯ F aug ). For i ∈ R \ { r } and j ∈ C , it holds σ (cid:16) ¯ F aug i , x j (cid:17) = σ ( F i , x j ) ≤ q j − p i = ¯ q j − ¯ p i from the feasibility of ( p, q ) on D( F ). For i ∈ R \ { r } and j ′ ∈ J ′ , we have σ (cid:16) ¯ F aug i , y j ′ (cid:17) = σ (cid:0) F i , y j ′ (cid:1) = −∞ ≤ ¯ q j − ¯ p i . For i ′ ∈ I ′ ∪ { r } and j ∈ C , itholds σ (cid:16) ¯ F aug i ′ , x j (cid:17) < q j − p r = ¯ q j − ¯ p i ′ from (5.6). In the last case with i ′ ∈ I ′ ∪ { r } and j ′ ∈ J ′ ,we have σ (cid:16) ¯ F aug i ′ , y j ′ (cid:17) = 0 = p r − p r = ¯ p i ′ − ¯ q j ′ . Thus (¯ p, ¯ q ) is feasible on D( ¯ F aug ).Finally, we show the non-optimality of (¯ p, ¯ q ) on D( ¯ F aug ). From Proposition 2.2, it sufficesto show t-rank ¯ D < n + m , where ¯ D is the system Jacobian of (5.2) with respect to (¯ p, ¯ q ).Here, ¯ D i ′ ,j is identically zero for i ′ ∈ I ′ ∪ { r } and j ∈ C due to (5.6). Figure 1 shows thezero/nonzero pattern of ¯ D , where ¯ D [ I, J ′ ] = O and ¯ D [ S, J ′ ] = O can also be checked from thedefinition of ¯ F aug . Therefore, I ∪ S ∪ J ′ is a vertex cover in the corresponding bipartite graph G ∗ ( D ) = ( ¯ R ∪ ¯ C, E ∗ ( D )) with edge set (2.3). By the König–Egeváry theorem, we havet-rank ¯ D ≤ (cid:12)(cid:12) I ∪ S ∪ J ′ (cid:12)(cid:12) = m + ( n − m −
1) + m = n + m − , which completes the proof. Corollary 5.4.
For a DAE (1.1) satisfying (I1)–(I3) , the resulting DAE ¯ F aug = 0 satisfies(A3).Proof. Let ( p, q ) be a dual optimal solution on D( F ) and (¯ p, ¯ q ) defined by (5.5). From Lemma 5.3,it holds thatˆ δ ( ¯ F aug ) < X j ∈ ¯ C ¯ q j − X i ∈ ¯ R ¯ p i = mp r + X j ∈ C q j − mp r + X i ∈ R p i ! = X j ∈ C q j − X i ∈ R p i = ˆ δ ( F )as required. 19he above lemmas are summed up in the following theorem. Theorem 5.5.
For a DAE (1.1) satisfying (I1)–(I3) , the augmentation method returns aDAE (5.1) satisfying (A1)–(A3) . In this section, we demonstrate our methods using two examples.
We first apply our modification methods to the index-1 DAE (3.3). Neither the LC-method northe ES-method works for (3.3).In Phase 1 of the combinatorial relaxation, we find a dual optimal solution p = (0 ,
0) and q = (1 , p, q ) is D = ˙ x ˙ x x ˙ x x ˙ x ! , which is identically singular. In Phase 3, we find r = 2 , I = { } and J = { } as follows: D = J z }| {(cid:18) (cid:19) ˙ x ˙ x I x ˙ x x ˙ x r . We first demonstrate the substitution method applied to (3.3). By solving F = 0 in (3.3)for ˙ x , we get ˙ x = − t ˙ x (6.1)unless ˙ x = 0. Then (6.1) is substituted into the second equation in (3.3). The resulting DAEof the substitution method is ( F : ˙ x ˙ x − t = 0 , ¯ F sub2 : x + x − t − . (6.2)The system Jacobian D ′ of (6.2) corresponding to a dual optimal solution p ′ = (0 , , q ′ = (1 , D ′ = ˙ x ˙ x ! , which is not identically singular. Thus we are done.Next, we show the modification of (3.3) by the augmentation method. Let y ′ be a newvariable corresponding to ˙ x and ξ ∈ R an arbitrary nonzero constant corresponding to ˙ x . Theaugmentation method modifies the DAE (3.3) into the following DAE F : ˙ x ˙ x − t = 0 , ¯ F aug2 : y ′ ξ + x + x − t − t − , ¯ F aug1 ′ : y ′ ξ − t = 020ith unknown function ( x , x , y ′ ). A pair of p ′ = (0 , ,
1) and q ′ = (1 , ,
1) is a new dualoptimal solution, and the system Jacobian D ′ is D ′ = ˙ x ˙ x
01 1 2 y ′ ξ ξ . Since D ′ is not identically singular, the method terminates at this point. Next we consider a transistor amplifier problem arising from an electrical network [13]. Theproblem is described by an index-1 DAE in the following form F : C ( ˙ x − ˙ x ) + ( x − U e ( t )) /R = 0 ,F : − C ( ˙ x − ˙ x ) − U b /R + x (1 /R + 1 /R ) − ( α − g ( x − x ) = 0 ,F : C ˙ x + x /R − g ( x − x ) = 0 ,F : C ( ˙ x − ˙ x ) + ( x − U b ) /R + αg ( x − x ) = 0 ,F : − C ( ˙ x − ˙ x ) − U b /R + x (1 /R + 1 /R ) − ( α − g ( x − x ) = 0 ,F : C ˙ x + x /R − g ( x − x ) = 0 ,F : C ( ˙ x − ˙ x ) + ( x − U b ) /R + αg ( x − x ) = 0 ,F : − C ( ˙ x − ˙ x ) + x /R = 0 , (6.3)where g ( x ) = β (exp( x/U F ) −
1) and U e ( t ) = 0 . πt ) with nonzero parameters U b , U F , α , β , R , R , . . . , R , and C , . . . , C .A dual optimal solution on (6.3) is given by p = (0 , . . . ,
0) and q = (1 , . . . , ∈ Z . Thesystem Jacobian corresponding ( p, q ) is a singular constant matrix D = C − C − C C C C − C − C C C C − C − C C . One possible selection of ( r, I, J ) is r = 1 , I = { } and J = { } .On the substitution method, we solve F = 0 for ˙ x to get˙ x = ˙ x + ( − U b /R + x (1 /R + 1 /R ) − ( α − g ( x − x )) /C (6.4)and substitute (6.4) into F = 0. Then the first equation is modified into¯ F sub1 : − U b /R + x (1 /R + 1 /R ) − ( α − g ( x − x ) + ( x − U e ( t )) /R = 0and the dual optimal solution is updated to p ′ = (1 , , , , , , ,
1) and q ′ = q . The substitutionmethod modifies the DAE twice more in the same manner for ( r, I, J ) = (4 , { } , { } ) and217 , { } , { } ), and outputs the following DAE ¯ F sub1 : − U b /R + x (1 /R + 1 /R ) − ( α − g ( x − x ) + ( x − U e ( t )) /R = 0 ,F : − C ( ˙ x − ˙ x ) − U b /R + x (1 /R + 1 /R ) − ( α − g ( x − x ) = 0 ,F : C ˙ x + x /R − g ( x − x ) = 0 , ¯ F sub4 : − U b /R + x (1 /R + 1 /R ) − ( α − g ( x − x )+ ( x − U b ) /R + αg ( x − x ) = 0 ,F : − C ( ˙ x − ˙ x ) − U b /R + x (1 /R + 1 /R ) − ( α − g ( x − x ) = 0 ,F : C ˙ x + x /R − g ( x − x ) = 0 , ¯ F sub7 : x /R + ( x − U b ) /R + αg ( x − x ) = 0 ,F : − C ( ˙ x − ˙ x ) + x /R = 0 , (6.5)which has a nonsingular system Jacobian.The augmentation method also modifies the DAE (6.3) three times for ( r, I, J ) = (1 , { } , { } ),(4 , { } , { } ) and (7 , { } , { } ). Due to limitations of space, we just describe the resulting DAEin the following: ¯ F aug1 : C ( y − ξ ) + ( x − U e ( t )) /R = 0 ,F : − C ( ˙ x − ˙ x ) − U b /R + x (1 /R + 1 /R ) − ( α − g ( x − x ) = 0 , ¯ F aug2 ′ : − C ( y − ξ ) − U b /R + x (1 /R + 1 /R ) − ( α − g ( x − x ) = 0 ,F : C ˙ x + x /R − g ( x − x ) = 0 , ¯ F aug4 : C ( y − ξ ) + ( x − U b ) /R + αg ( x − x ) = 0 ,F : − C ( ˙ x − ˙ x ) − U b /R + x (1 /R + 1 /R ) − ( α − g ( x − x ) = 0 , ¯ F aug5 ′ : − C ( y − ξ ) − U b /R + x (1 /R + 1 /R ) − ( α − g ( x − x ) = 0 ,F : C ˙ x + x /R − g ( x − x ) = 0 , ¯ F aug7 : C ( y − ξ ) + ( x − U b ) /R + αg ( x − x ) = 0 ,F : − C ( ˙ x − ˙ x ) + x /R = 0 , ¯ F aug8 ′ : − C ( y − ξ ) + x /R = 0 , where y , y and y are new variables corresponding to ˙ x , ˙ x , and ˙ x , respectively, and ξ , ξ and ξ are arbitrary constants corresponding to ˙ x , ˙ x and ˙ x , respectively.Indeed, the LC-method can also modify the DAE (6.3) into (6.5). In general, the substitutionmethod and the LC-method return the same DAE under some reasonable restrictions; see theappendix for details. We applied our library in practice to the following four DAEs. The DAEs have identicallysingular system Jacobian, and thus the MS-method, which is the index reduction method usedin MATLAB, cannot be applied to them.(a) Nonlinearly modified pendulum (index-3): ˙ x − x x cos x = 0 , ˙ x − x cos x sin x + g = 0 ,x + x sin x − , tanh( ˙ x − x ) = 0 , ˙ x sin x + x ˙ x cos x − x = 022ith parameter g = 9 .
8. This DAE is obtained by nonlinearly changing the variable( y, z, λ, v y , v z ) of a simple pendulum DAE ˙ v y − yλ = 0 , ˙ v z − zλ + g = 0 ,y + z − , ˙ y − v y = 0 , ˙ z − v z = 0by ( y, z, λ, v y , v z ) = ( x , x sin x , x cos x , x , x ). In addition, we equivalently changedthe fourth equation ˙ x − x = 0 to tanh( ˙ x − x ) = 0.(b) Robotic arm (index-5): ¨ x − c ( x )( ˙ x + ˙ x ) − ˙ x d ( x ) + ( x − x )( a ( x ) + 2 b ( x )) − a ( x )( x − x ) = 0 , ¨ x + 2 c ( x )( ˙ x + ˙ x ) + ˙ x d ( x ) + ( x − x )(1 − a ( x ) − b ( x ))+ a ( x )( x − x ) + x = 0 , ¨ x + 2 c ( x )( ˙ x + ˙ x ) + ˙ x d ( x ) + ( x − x )( a ( x ) − b ( x ))+ 2 ˙ x c ( x ) + d ( x )( ˙ x + ˙ x ) + ( a ( x ) + b ( x ))( x − x ) = 0 , cos x + cos( x + x ) − p ( t ) = 0 , sin x + sin( x + x ) − p ( t ) = 0 , where p ( t ) = cos(1 − e t ) + cos(1 − t ) , p ( t ) = sin(1 − e t ) + sin(1 − t ) ,a ( s ) = 22 − cos s , b ( s ) = cos s − cos s , c ( s ) = sin s − cos s , d ( s ) = sin s cos s − cos s . The robotic arm DAE arises from the path control of a two-link, flexible joint andplanar robotic arm [3]. The above formulation is a slightly modified version given inthe preliminary paper of [23] available on arXiv.(c) Transistor amplifier (index-1): the DAE (6.3). The values of parameters are shown inthe appendix. 23d) Ring modulator (index-2): ˙ x + ( x /R − x + 0 . x − . x − x ) /C = 0 , ˙ x + ( x /R − x + 0 . x − . x − x ) /C = 0 ,x − q ( U D ) + q ( U D ) = 0 ,x − q ( U D ) + q ( U D ) = 0 ,x + q ( U D ) − q ( U D ) = 0 ,x + q ( U D ) − q ( U D ) = 0 , ˙ x + ( x /R p − q ( U D ) − q ( U D ) + q ( U D ) + q ( U D )) /C p = 0 , ˙ x + x /L h = 0 , ˙ x + x /L h = 0 , ˙ x + ( − . x + x + R g x ) /L s = 0 , ˙ x + (0 . x − x + R g x ) /L s = 0 , ˙ x + ( − . x + x + R g x ) /L s = 0 , ˙ x + (0 . x − x + R g x ) /L s = 0 , ˙ x + ( x + ( R g + R i ) x − U in1 ( t )) /L s = 0 , ˙ x + ( x + ( R c + R g ) x ) /L s = 0 , where U D = x − x − x − U in2 ( t ) , U D = − x + x − x − U in2 ( t ) ,U D = x + x + x + U in2 ( t ) , U D = − x − x + x + U in2 ( t ) ,q ( U ) = γ (cid:16) e δU − (cid:17) , U in1 ( t ) = 0 . πt, U in2 ( t ) = 2 sin 20000 πt with parameters C, C p , L h , L s , L s , L s , γ, δ, R, R p , R g , R g , R g , R i and R c . Their nu-merical values are given in the appendix. The DAE represents an electrical networkdescribing the behaviour of a ring modulator [13]. The above formulation is obtainedby setting C s = 0 in the original problem.We computed numerical solutions of DAEs (a)–(d) after performing the following three kindsof preprocessing: (i) no index reduction, (ii) reduce the index by the MS-method after applyingthe substitution method, and (iii) reduce the index after using the augmentation method; whatthe default DAE solver in MATLAB can do for the DAEs is only (i). For the rank computationof system Jacobian and the process of finding ( r, I, J ), we adopted the fast symbolic Gaussianelimination algorithm by Sasaki–Murao [19]. The numerical solutions were computed by ode15i in MATLAB, which is a variable-step variable-order (VSVO) solver for index-0 or 1 DAEs basedon the backward differentiation formulas (BDFs) [21]. The parameters of ode15i were set to thedefault values: AbsTol = 10 − and RelTol = 10 − . We explicitly provided Jacobian matricesto ode15i through the Jacobian option. All the computation were performed on MATLABR2019a.
Numerical Results.
Figure 2 shows numerical solutions obtained in the experiments. With-out index reduction, ode15i could not yield a numerical solution of the index-3 DAE (a) or theindex-5 DAE (b). As for the index-2 DAE (d), ode15i first output a numerical solution butstopped at t ≈ . ode15i to solve high-indexDAEs stably.Both the substitution method and the augmentation method successfully modified all theDAEs (a)–(d). With the preprocessing by the substitution method, ode15i had stopped at t ≈ . t ≈ . − x ( t ) tx ( t ) x ( t ) x ( t ) x ( t ) x ( t ) (a-i) modified pendulumw/o modification − − x ( t ) tx ( t ) x ( t ) x ( t ) x ( t ) x ( t ) (a-ii) modified pendulumw/ substitution method − − x ( t ) tx ( t ) x ( t ) x ( t ) x ( t ) x ( t ) (a-iii) modified pendulumw/ augmentation method − . . . . . . x ( t ) tx ( t ) x ( t ) x ( t ) x ( t ) x ( t ) (b-i) robotic armw/o modification − . . . . . . x ( t ) tx ( t ) x ( t ) x ( t ) x ( t ) x ( t ) (b-ii) robotic armw/ substitution method − . . . . . . x ( t ) tx ( t ) x ( t ) x ( t ) x ( t ) x ( t ) (b-iii) robotic armw/ augmentation method − − .
05 0 . .
15 0 . x ( t ) tx ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) (c-i) transistor amplifierw/o modification − − .
05 0 . .
15 0 . x ( t ) tx ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) (c-ii) transistor amplifierw/ substitution method − − .
05 0 . .
15 0 . x ( t ) tx ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) (c-iii) transistor amplifierw/ augmentation method − − . .
510 0 . . . . . x ( t ) tx ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) (d-i) ring modulatorw/o modification − − . .
510 0 . . . . . x ( t ) tx ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) (d-ii) ring modulatorw/ substitution method − − . .
510 0 . . . . . x ( t ) tx ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) x ( t ) (d-iii) ring modulatorw/ augmentation method Figure 2: Numerical solutions of the experiments. In the 4 × D [ I, J ] is identically nonsingular. Nevertheless,by re-modifying the DAE at that time, our program could go on the computation. This process,called the dynamic pivoting , can be applied by re-choosing ( r, I, J ) such that D [ I, J ] is far fromsingular at the present point. The dynamic pivoting is also known to be needed in the MS-method [12].In the first application of the substitution method to the DAE (b), we used the followingvalues of ( p, q, r, I, J ): p = (0 , , , , , , , , , , , , , , ,q = (1 , , , , , , , , , , , , , , , (7.1) r = 11 , I = { , , , , , , } , J = { , , , , , , } . There exists another possible values of ( p, q, r, I, J ) as follows: p = (0 , , , , , , , , , , , , , , ,q = (1 , , , , , , , , , , , , , , , (7.2) r = 5 , I = { , , } , J = { , , } . The values in (7.2) seem to be superior to (7.1) in that the substitution method need to dif-ferentiate equations for (7.1) but not for (7.2), and the size | I | of the equation system (4.4)for (7.2) is smaller than that of (7.1). However, the equation-solving routine in MATLAB couldnot return a solution of the equation system (4.4) for (7.2). This is because the substitutionmethod with (7.2) need to solve the following equation system x − γ e δ ( x − x − x − U in2 ( t )) + γ e δ ( − x − x + x + U in2 ( t )) = 0 ,x − γ e δ ( − x + x − x − U in2 ( t )) + γ e δ ( x + x + x + U in2 ( t )) = 0 ,x + γ e δ ( − x + x − x − U in2 ( t )) − γ e δ ( − x − x + x + U in2 ( t )) = 0for x , x and x , while the solution cannot be represented by a combination of the elementaryfunctions. On the other hand, the equation system (4.4) for (7.1) is linear because ˙ x , ˙ x and ˙ x appear linearly in the differentiation of the 3–6th equations in (d). Thus the substitution methodworks with (7.1) but not with (7.2). As we have seen, the substitution method may not rundepending on the choice of ( p, q, r, I, J ). The experimental results show that the augmentationmethod successfully serves as a remedy for this issue. In this paper, we have presented two modification methods for nonlinear DAEs, called thesubstitution method and the augmentation method. Based on the combinatorial relaxationapproach, both methods modify DAEs into other DAEs for which the structural preprocessingmethods work. The substitution method modifies DAEs using the IFT and has a merit that itretains the size of DAEs. The augmentation method modifies DAEs by appending new variablesand equations, and is advantageous in that it does not require an equation-solving engine andkeeps DAEs’ sparsity. The numerical experiments have shown that both methods can modifyDAEs that the standard DAE solver in MATLAB cannot handle into amenable forms. Whilethe success of the substitution method depends on the selection of the values of ( p, q, r, I, J ),the augmentation method has successfully served as a remedy for this problem. Numericalexperiments on sparse DAEs are left for further investigation.26 cknowledgments
The author thanks Satoru Iwata for careful reading and helpful comments, and thanks MizuyoTakamatsu for discussions. This work was supported by JST CREST Grant Number JP-MJCR14D2, Japan, Grant-in-Aid for JSPS Research Fellow Grant Number JP18J22141, Japanand JST ACT-I Grant Number JPMJPR18U9, Japan.
References [1] K. E. Brenan, S. L. Campbell, and L. R. Petzold.
Numerical Solution of Initial-ValueProblems in Differential-Algebraic Equations . SIAM, Philadelphia, 1996.[2] S. L. Campbell and C. W. Gear. The index of general nonlinear DAEs.
NumerischeMathematik , 72(2):173–196, 1995.[3] S. L. Campbell and E. Griepentrog. Solvability of general differential algebraic equations.
SIAM Journal on Scientific Computing , 16(2):257–270, 1995.[4] C. W. Gear. Differential-algebraic equation index transformations.
SIAM Journal onScientific and Statistical Computing , 9(1):39–47, 1988.[5] A. Griewank. On automatic differentiation. In M. Iri and K. Tanabe, editors,
Mathemati-cal Programming: Recent Developments and Applications , pages 83–108, Dordrecht, 1989.Kluwer Academic Publishers.[6] E. Hairer and G. Wanner.
Solving Ordinary Differential Equations II: Stiff and Differential-Algebraic Problems . Springer Series in Computational Mathematics. Springer-Verlag,Berlin, Heidelberg, 1996.[7] S. Iwata. Computing the maximum degree of minors in matrix pencils via combinatorialrelaxation.
Algorithmica , 36(4):331–341, 2003.[8] S. Iwata, T. Oki, and M. Takamatsu. Index reduction for differential-algebraic equationswith mixed matrices. In
Proceedings of the 8th SIAM Workshop on Combinatorial ScientificComputing (CSC’18) , pages 45–55, Bergen, Norway, 2018.[9] S. Iwata and M. Takamatsu. Index reduction via unimodular transformations.
SIAMJournal on Matrix Analysis and Applications , 39(3):1135–1151, 2018.[10] H. W. Kuhn. The Hungarian method for the assignment problem.
Naval Research LogisticsQuarterly , 2:83–97, 1955.[11] Maplesoft, a division of Waterloo Maple, Inc. Overview of theRif Command Set Version 1.1 - Maple Programming Help. URL: .(accessed April 11, 2019).[12] S. E. Mattsson and G. Söderlind. Index reduction in differential-algebraic equations usingdummy derivatives.
SIAM Journal on Scientific Computing , 14(3):677–692, 1993.[13] F. Mazzia and C. Magherini. Test set for initial value problem solvers. Tech-nical report, Department of Mathematics, University of Bari, 2008. URL: https://archimede.dm.uniba.it/~testset/ .[14] K. Murota. Computing the degree of determinants via combinatorial relaxation.
SIAMJournal on Computing , 24(4):765–796, 1995.2715] C. C. Pantelides. The consistent initialization of differential-algebraic systems.
SIAMJournal on Scientific and Statistical Computing , 9(2):213–231, 1988.[16] J. D. Pryce. A simple structural analysis method for DAEs.
BIT Numerical Mathematics ,41(2):364–394, 2001.[17] X. Qin, L. Yang, Y. Feng, B. Bachmann, and P. Fritzson. Index reduction of differentialalgebraic equations by differential Dixon resultant.
Applied Mathematics and Computation ,328:189–202, 2018.[18] G. J. Reid, P. Lin, and A. D. Wittkopf. Differential elimination - Completion algorithmsfor DAE and PDAE.
Studies in Applied Mathematics , 106(1):1–45, 2001.[19] T. Sasaki and H. Murao. Efficient Gaussian elimination method for symbolic determinantsand linear systems.
ACM Transactions on Mathematical Software , 8(3):277–289, 1982.[20] L. Scholz. The signature method for DAEs arising in the modeling of electrical circuits.
Journal of Computational and Applied Mathematics , 332:107–139, 2018.[21] L. F. Shampine. Solving 0 = F ( t, y ( t ) , y ′ ( t )) in Matlab. Journal of Numerical Mathematics ,10(4):291–310, 2002.[22] M. Takamatsu and S. Iwata. Index reduction for differential-algebraic equations by substi-tution method.
Linear Algebra and Its Applications , 429(8-9):2268–2277, 2008.[23] G. Tan, N. S. Nedialkov, and J. D. Pryce. Conversion methods for improving structuralanalysis of differential-algebraic equation systems.
BIT Numerical Mathematics , 57(3):845–865, 2017. arXiv: .[24] X. Wu, Y. Zeng, and J. Cao. The application of the combinatorial relaxation theory on thestructural index reduction of DAE. In
Proceedings of the 12th International Symposiumon Distributed Computing and Applications to Business, Engineering & Science , pages162–166, London, UK, 2013. IEEE.
A Appendix
A.1 Relation between Substitution Method and LC-method
We discuss a relation between the substitution method and the LC-method described in Sec-tion 3.2. One may conjecture that the sets of possible outputs of these two methods are thesame if a DAE can be modified by both methods. However, this is false because the cokernelof the system Jacobian contains infinitely many vectors and the LC-method returns differentDAEs for different cokernel vectors, while the number of possible ( r, I, J ) in the substitutionmethod is finite. Indeed, the conjecture is true if we restrict the selection of the cokernel vectoras follows.
Theorem A.1.
Consider a DAE (1.1) satisfying (I1)–(I3) . Let ( r, I, J ) be a triple satisfying (C1)–(C3) and D denote the system Jacobian of F with respect to a dual optimal solution ( p, q ) .Define a row vector u = ( u i ( t, x, ˙ x, . . . , x ( l ) )) i ∈ R by u i := i = r ) , − D [ { r } , J ] (cid:16) ( D [ I, J ]) − [ J, { i } ] (cid:17) ( i ∈ I ) , otherwise )28 or i ∈ R , where ( D [ I, J ]) − [ J, { i } ] is the i -th column vector in the inverse of D [ I, J ] . Then u is a cokernel vector of D . In addition, if u satisfies the validity condition (3.2) of the LC-method, then the output ¯ F sub of the substitution method is the same as the output ¯ F LC of theLC-method with respect to u . If u does not satisfy (3.2) , neither does any cokernel vector v of D with supp v = I ∪ { r } .Proof. We first show that u is a cokernel vector of D . From the definition of u , it holds uD [ R, J ] = D [ { r } , J ] − D [ { r } , J ]( D [ I, J ]) − D [ I, J ] = D [ { r } , J ] − D [ { r } , J ] = 0 (A.1)and uD [ R, T ] = D [ { r } , T ] − D [ { r } , J ]( D [ I, J ]) − D [ I, T ] , (A.2)where T := C \ J . Here, the right-hand side of (A.2) is the Schur complement of D [ I, J ] in thefollowing block matrix D [ I ∪ { r } , C ] = D [ { r } , J ] D [ { r } , T ] D [ I, J ] D [ I, T ] ! . Since rank D [ I ∪ { r } , C ] = rank D [ I, J ] from the conditions (C1) and (C2), the rank of uD [ R, T ]must be zero, which means uD [ R, T ] = 0. Thus u is a cokernel vector of D by this and (A.1).Next suppose that u satisfies the validity condition (3.2) of the LC-method. Let u I := ( u i ) i ∈ I .Then the modified function ¯ F LC r in (3.1) can be expressed as ¯ F LC r = F r + u I F ( p I − p r ) I , wherewe used u r = 1 in the present setting. Let C and J be index sets defined in (4.10) and (4.12),respectively. In order to show ¯ F sub r = ¯ F LC r , we prove ∂ ¯ F sub r ∂x ( k ) j ( t, X C\J ) = ∂ ¯ F LC r ∂x ( k ) j ( t, X ) (A.3)for all j ∈ C, a nonnegative integer k and a point ( t, X ) in the domain ¯ T sub × ¯Ω sub of ¯ F sub .If (A.3) holds, then the difference of ¯ F sub r and ¯ F LC r is a function depending only on t . Nowsince ¯ F sub = 0 and ¯ F LC = 0 share the same solution set, the difference must be identically zeroand we are done.We show (A.3). Let denote ( t, X C\J , ϕ ( t, X C\J )) by A t,X for short, where ϕ is the explicitfunction given by (4.5). Then by the same calculation as (4.16), it holds ∂ ¯ F sub r ∂x ( k ) j ( t, X C\J ) = ∂F r ∂x ( k ) j ( A t,X ) − ∂F ( p r ) r ∂x ( q ) J ( A t,X ) ∂F ( p ) I ∂x ( q ) J ( A t,X ) ! − ∂F ( p − p r ) I ∂x ( k ) j ( A t,X ) . On the other hand, the right-hand side of (A.3) is calculated as ∂ ¯ F LC r ∂x ( k ) j ( t, X ) = ∂∂x ( k ) j (cid:16) F r + u I F ( p − p r ) I (cid:17)(cid:12)(cid:12)(cid:12) ( t,X ) = ∂F r ∂x ( k ) j ( t, X ) + ∂u I ∂x ( k ) j ( t, X ) F ( p − p r ) I ( t, X ) + u I ( t, X ) ∂F ( p − p r ) I ∂x ( k ) j ( t, X )for all ( t, X ). Now since ¯ F LC r does not depend on X J = (cid:16) x ( q j ) j (cid:17) j ∈ J by [23, Lemma 4.2], theboth sides of the above equation also does not depend on X J . By replacing X J in X with29 ( t, X C\J ), we have ∂ ¯ F LC r ∂x ( k ) j ( t, X ) = ∂F r ∂x ( k ) j ( A t,X ) + ∂u I ∂x ( k ) j ( A t,X ) F ( p − p r ) I ( A t,X ) + u I ( A t,X ) ∂F ( p − p r ) I ∂x ( k ) j ( A t,X )= ∂F r ∂x ( k ) j ( A t,X ) + u I ( A t,X ) ∂F ( p − p r ) I ∂x ( k ) j ( A t,X )= ∂F r ∂x ( k ) j ( A t,X ) − ∂F ( p r ) r ∂x ( q ) J ( A t,X ) ∂F ( p ) I ∂x ( q ) J ( A t,X ) ! − ∂F ( p − p r ) I ∂x ( k ) j ( A t,X ) , where we used F ( p − p r ) I ( A t,X ) = 0 in the second equality. Thus (A.3) holds.Finally, suppose that u does not satisfy (3.2). Let v = ( v i ) i ∈ R be a nonzero cokernel vectorof D with supp v = I ∪ { r } . This implies that ( v i ) i ∈ I ∪{ r } is in the cokernel of D [ I ∪ { r } , C ].From rank D [ I ∪ { r } , C ] = | I | , the dimension of the cokernel of D [ I ∪ { r } , C ] is 1. Thus allcokernel vectors of D having support I ∪ { r } are multiples of u . Indeed, it holds v = v r u from u r = 1. Since u does not satisfy (3.2), there exists i ∈ R and j ∈ C such that σ ( u i , x j ) ≥ q j − p r .If σ ( v r , x j ) < q j − p r , then σ ( v i , x j ) = σ ( v r u i , x j ) = σ ( u i , x j ) ≥ q j − p r , which implies that v does not satisfy the validity condition (3.2). Otherwise, σ ( v r , x j ) ≥ q j − p r , which also meansthat v does not meet (3.2). A.2 Parameter and Initial Value Settings
The parameter and initial value settings used in the numerical experiments in Section 7 aredescribed. Tables 3 and 5 show the parameter settings of the transistor amplifier DAE and thering modulator DAE, respectively. Tables 1, 2, 4 and 6 show the initial value settings of thenonlinearly modified pendulum DAE, the robotic arm DAE, the transistor amplifier DAE andthe ring modulator DAE, respectively.Table 1: Initial values of the nonlinearly modified pendulum DAE j x j (0) ˙ x j (0)1 0.5 02 8.5311195044981 03 3.2432815053528 04 0 − . − . j x j (0) ˙ x j (0) ¨ x j (0)1 0 − −
12 0.9537503511807 − . − . − . − . U b R U F R k ( k = 1 , . . . ,
9) 9000 α C k ( k = 1 , . . . , k × − β − Table 4: Initial values of the transistor amplifier DAE j x j (0) ˙ x j (0) j x j (0) ˙ x j (0)1 0 51.3392765171807 5 3 − . − . − . − . − . − . C . × − R C p − R p L h R g . L s × − R g . L s × − R g . L s × − R i γ . × − R c δ . j x j (0) ˙ x j (0) j x j (0) ˙ x j (0)1 0 0 9 0 02 0 0 10 0 03 0 6 . ×
11 0 04 0 − . ×
12 0 05 0 − . ×
13 0 06 0 6 . ×4