[PDF] Solving the k-sparse Eigenvalue Problem with Reinforcement Learning

Abstract

We examine the possibility of using a reinforcement learning (RL) algorithm to solve large-scale eigenvalue problems in which the desired the eigenvector can be approximated by a sparse vector with at most k nonzero elements, where k is relatively small compare to the dimension of the matrix to be partially diagonalized. This type of problem arises in applications in which the desired eigenvector exhibits localization properties and in large-scale eigenvalue computations in which the amount of computational resource is limited. When the positions of these nonzero elements can be determined, we can obtain the k -sparse approximation to the original problem by computing eigenvalues of a k×k submatrix extracted from k rows and columns of the original matrix. We review a previously developed greedy algorithm for incrementally probing the positions of the nonzero elements in a k -sparse approximate eigenvector and show that the greedy algorithm can be improved by using an RL method to refine the selection of k rows and columns of the original matrix. We describe how to represent states, actions, rewards and policies in an RL algorithm designed to solve the k -sparse eigenvalue problem and demonstrate the effectiveness of the RL algorithm on two examples originating from quantum many-body physics.

Full PDF

aa r X i v : . [ phy s i c s . c o m p - ph ] S e p CSIAM Trans. Appl. Math.doi: Vol. x , No. 1, pp. 1-30February 202x Solving the k -sparse Eigenvalue Problem with Rein-forcement Learning Li Zhou , Lihao Yan , Mark A. Caprio , Weiguo Gao , andChao Yang ∗ School of Mathematical Sciences , Fudan University, Shanghai 200433, P.R. China. Department of Physics , University of Notre Dame, IN 46556, United States. School of Data Science , Fudan University, Shanghai 200433, P.R. China. Computational Research Division, Lawrence Berkeley National Laboratory, CA94720, United States.

Abstract.

We examine the possibility of using a reinforcement learning (RL) algorithmto solve large-scale eigenvalue problems in which the desired the eigenvector can beapproximated by a sparse vector with at most k nonzero elements, where k is rela-tively small compare to the dimension of the matrix to be partially diagonalized. Thistype of problem arises in applications in which the desired eigenvector exhibits local-ization properties and in large-scale eigenvalue computations in which the amount ofcomputational resource is limited. When the positions of these nonzero elements canbe determined, we can obtain the k -sparse approximation to the original problem bycomputing eigenvalues of a k × k submatrix extracted from k rows and columns of theoriginal matrix. We review a previously developed greedy algorithm for incrementallyprobing the positions of the nonzero elements in a k -sparse approximate eigenvectorand show that the greedy algorithm can be improved by using an RL method to reﬁnethe selection of k rows and columns of the original matrix. We describe how to rep-resent states, actions, rewards and policies in an RL algorithm designed to solve the k -sparse eigenvalue problem and demonstrate the effectiveness of the RL algorithm ontwo examples originating from quantum many-body physics. AMS subject classiﬁcations : xxxxx, xxxxx

Key words : large-scale eigenvalue problem, quantum many-body problem, eigenvector local-ization, reinforcement learning, approximate Q-learning, stochastic sampling, high performancecomputing ∗ Corresponding author.

Email addresses: [email protected] (L. Zhou), [email protected] (L. Yan), [email protected] (M. Caprio), [email protected] (W. Gao), [email protected] (cid:13) x (202x), pp. 1-30 Let A be an n × n sparse symmetric matrix, where n can be very large. We are interestedin solving the following problem min k x k ≤ k x T Axx T x , (1.1)where k·k denotes the cardinality of a vector, i.e., the number of non-zero elements ofa vector. A vector x that satisﬁes k x k = k is called a k -sparse vector. We will refer to(1.1) as a k -sparse eigenvalue problem because the solution to (1.1) is the eigenvectorassociated with the algebraically smallest eigenvalue of A if the k -sparse constraint k x k is not imposed. When the k -sparse constrain is imposed, the solution to (1.1) can beobtained from the eigenvector of a submatrix of A with at most k rows and columns.The k -sparse eigenvalue problem can also be more plainly stated as follows: Select atmost k rows and columns of A to form a submatrix A such that the algebraically smallesteigenvalue of A is the smallest among all smallest eigenvalues of all submatrices ofdimension at most k . Note that we may replace the minimum in (1.1) by maximum if theeigenvalue of interest is the largest among all eigenvalues. This problem is related to thesparse principal component analysis (PCA) problem in which A is a covariant matrix ofthe form A = B T B , and minimization is replaced with maximization in (1.1) [10, 16, 17].If the eigenvector x ∗ associated with the algebraically smallest eigenvalue of A has atmost k nonzero elements, it is the solution of (1.1). The positions of the nonzero elementsof the eigenvector specify the rows and columns of A that deﬁnes A .If x ∗ has more than k nonzero elements, it is not entirely clear how one can obtain thesolution to (1.1) efﬁciently or which rows and columns of A should be extracted to form A whose lowest eigenvalue yields the minimum of the objective function in (1.1). As wewill show in Section 5, even if we can compute the smallest eigenvalue of A , simply tak-ing k rows and columns of A corresponding to the k largest components (in magnitude)of the corresponding eigenvector does not necessarily yield the optimal solution to (1.1).The k -sparse eigenvalue problem is of particular interest when we try to solve a large-scale eigenvalue problem with a limited amount of computational resource. One of themotivations originates from solving a quantum many body problem H Ψ = Ψ E , (1.2)where H is a many body Hamiltonian and Ψ is an eigenfunction of H corresponding tothe eigenvalue E . The lowest eigenvalue E and its corresponding eigenfunction formthe ground state of the many-body Hamiltonian [23, 24, 33].One way to solve (1.2) is to expand Ψ in terms of a linear combination of a ﬁnite num-ber of many-body basis functions known as Slater determinants in some well deﬁnedHilbert space (often referred to as a conﬁguration interaction space), and solve a pro-jected linear eigenvalue problem in that subspace. To obtain an accurate approximationto the solution of (1.2), the dimension of the conﬁguration space may be prohibitively . Zhou et al. / CSIAM Trans. Appl. Math., x (202x), pp. 1-30 3 large. Although signiﬁcant progress has been made on solving this type of problem onhigh performance computers using advanced sparse matrix techniques and efﬁcient it-erative methods [2–4, 27, 31], the size of the problems that we can currently solve is stilllimited by the amount of available computational resources.However, it is well known that the ground state of many Hamiltonians have localiza-tion properties [1, 11]. This means that Ψ can be well represented by a relatively smallnumber of many-body basis functions. If these basis functions are known, we only needto diagonalize a Hamiltonian matrix of relatively small dimension. However, in mostcases, the optimal set of basis that allows us to give the best estimate of the ground stateenergy of the many-body Hamiltonian from a subspace (of many-body basis functions)of a ﬁxed dimension (limited by computational resources) is not known. As a result, weneed to learn how to identify the desired basis function and the corresponding rows andcolumns of A to be extract as we try to solve the k -sparse eigenvalue problem.We are interested in efﬁcient methods for ﬁnding the solution of (1.1) without comput-ing the eigenvalue and eigenvector of A directly, which can be prohibitively expensive.We would like to identify the location of the k non-zero components of the eigenvectorassociated with the smallest eigenvalue of A if the eigenvector is indeed k sparse (whichwe may not know a priori). If we know where the locations of these component (but notthe components themselves), we can then diagonalize a k by k matrix A to obtain thesolution to (1.1). In the case the eigenvector associated with the smallest eigenvalue of A is not strictly k -sparse, we would like to identify k rows and columns of A that yield asubmatrix A , whose eigenvector associated with the smallest eigenvalue minimizes theobjective function in (1.1) after it is padded with zeros.In quantum chemistry and physics literature, several selected CI method have beendeveloped to solve (1.2) by selecting important Slater determinant basis [12,36,38]. Thesemethods have been shown be competitive or sometimes better than Monte-Carlo basedmethods for sampling rows and columns of A stochastically [6, 7, 19].In [13], we developed a greedy algorithm to incrementally select rows and columnsof A to form A based on either the residual or a component perturbation based selectioncriterion. We will review the basic idea of this approach in section 3. However, thisgreedy algorithm is far from optimal because the rows and columns of A selected in theearly stages of the greedy algorithm may not be the best ones for constructing A . Aswe learn more about the matrix A through the selection of additional rows and columns,it may become more clear that some of the rows and columns can be replaced by othersthat can contribute to lowering the smallest eigenvalue of A .In this paper, we explore the possibility of using a reinforcement learning (RL) algo-rithm [29, 32] to improve the sampling of rows and columns of A so that better approxi-mation to the solution of (1.1) can be obtained. The RL algorithm uses a global expectedreward function to guide the selection of rows and columns. The previously selectedrows and columns may be removed to make room for new rows and columns that con-tribute more to lowering the smallest eigenvalue of A . This global reward function isupdated repeatedly to improve the selection policy. L. Zhou et al. / CSIAM Trans. Appl. Math., x (202x), pp. 1-30 The basic elements of a RL algorithm is reviewed in section 4 where we also showhow to formulate the search for an optimal set of rows and columns of A to solve (1.1) asa RL process. We provide some algorithmic details of the RL algorithm in the context ofsolving (1.1) also. In section 5, we demonstrate the effectiveness of the RL approach withtwo numerical examples. Before we start to discuss methods for solving the k -sparse eigenvalue problem, it isworth discussing ﬁrst how to assess the optimality of an approximate solution.The optimal solution of (1.1) is well deﬁned, if (1.1) is strictly k -sparse, meaning thatthe desired eigenvector has at most k nonzero elements. We can reorder the elements ofthe eigenvector to have all nonzero elements appear in the leading k rows, i.e., Px = (cid:20) x (cid:21) ,where x ∈ R k and P the permutation matrix associated with such a reordering. Conse-quently, we can reorder the rows and columns of the matrix A so that ( PAP T )( Px ) = (cid:20) A A T A A (cid:21)(cid:20) x (cid:21) = λ (cid:20) x (cid:21) (2.1)holds.To obtain x , we just need to solve the eigenvalue problem A x = λ x . (2.2)Assuming λ is simple, we can verify the optimality of the solution by comparing thesolution to (2.2) to that of (2.1) (which may be costly to obtain.)However, when the desired eigenvector is not strictly k -sparse, the matrix A thatyields the best approximate eigenvalue may not be the submatrix obtained by extractingrows and columns of A associated with k largest (in magnitude) elements of the desiredeigenvector of A .In theory, one can obtain the optimal solution of (1.1) by enumerating all possiblecombinations of k rows and columns out of n rows and columns of A , solving each k -dimensional eigenvalue problem and choose the optimal among solutions to all k -dimensional eigenvalue problems. However, in practice, this approach is prohibitivelyexpensive even for problems with a moderate n and k because (cid:18) nk (cid:19) can be extremelylarge.Let us denote the smallest eigenvalue of the A matrix obtained by extracting rowsand columns of A associated with k largest (in magnitude) elements of the desired eigen-vector of A by λ b . . Zhou et al. / CSIAM Trans. Appl. Math., x (202x), pp. 1-30 5 We say that an approximate solution to (1.1) is a ”good” solution if the smallest eigen-value θ of the A matrix selected from the RL learning algorithm presented in this paperis less than or equal to λ b .The true optimal solution lies between λ b and λ . Clearly, we would like to obtainan approximate solution that is as close as possible to λ . However, how close θ is to λ depends on k and how localized the eigenvector x ∗ is. One way to solve the k -sparse eigenvalue problem is to use the greedy algorithm pre-sented in [13]. The basic algorithm can be summarized as follows1. We select a subset of the indices 1,2,..., n denoted by S that corresponds to “im-portant” rows and columns of A . In the conﬁguration interaction method for solv-ing quantum many-body eigenvalue problems, this subset may correspond to aset of many-body basis functions produced from some type of basis truncationscheme [12, 14, 36].2. Let A be a submatrix of A that consists of rows and columns deﬁned by S . Assum-ing the size of S is small relative to k (and n ), we can easily compute the desiredeigenpairs ( λ , x ) of A , i.e., A x = λ x .3. We take λ to be the approximation to the smallest eigenvalue of A . The approxima-tion to the eigenvector of A is constructed as ˆ x = P T (cid:2) x T (cid:3) T . To assess the accuracyof the computed eigenpair ( λ , ˆ x ) , we compute the full residual r = A ˆ x − λ ˆ x .4. If the norm of r is sufﬁciently small, we terminate the computation and return ( λ , ˆ x ) as the approximate solution. Otherwise, we select some additional rows andcolumns of A using an appropriate selection criterion to augment A and repeatsteps 2–4 until the dimension of A is k .Note that steps 2–4 of the above algorithm do not need to be repeated if k rows andand columns of A are selected all at once in step 4. But for large problems in which k canstill be relatively large, it is generally computationally more efﬁcient to select a few rowsand columns at a time to construct A incrementally. Such a scheme often yields betterapproximation also because it may not be clear in advance which k rows and columns of A we should select.We now discuss a few strategies for selecting rows and columns of A incrementallyto obtain an k × k submatrix A that yields a good k -sparse approximation to the desiredeigenpair of A . L. Zhou et al. / CSIAM Trans. Appl. Math., x (202x), pp. 1-30 Without loss of generality, we take S to be the leading n ≪ n rows and columns of A sothat we can partition A as A = (cid:20) A A T A A (cid:21) . (3.1)We now discuss how to select additional “important” rows and columns outside of thesubset S to obtain a more accurate approximation of the desired eigenvector of A .Suppose ( λ , x ) is the computed eigenpair of the submatrix A that serve as an ap-proximation to the desired eigenpair ( λ , x ) . By padding x with zeros to formˆ x = (cid:20) x (cid:21) , (3.2)we can assess the accuracy of the approximate eigenvector ˆ x in the full space by comput-ing its residual r = A ˆ x − λ ˆ x = (cid:20) A x (cid:21) ≡ (cid:20) r ′ (cid:21) . (3.3)The ﬁrst greedy scheme for improving the accuracy of x is to select some row indicesin { n }\S that correspond to components of r ′ = A x with the largest magnitude.These indices, along with S , yield an augmented A from which a more accurate approx-imation to ( λ , x ) can be obtained. Another greedy probing algorithm can be developed by a component-wise perturbationanalysis in which one component in the zero block of ˆ x deﬁned by (3.2) is perturbed tomake ˜ x = (cid:18) x γ j e j (cid:19) (3.4)a better approximation to the desired eigenvector. The analysis presented in [13, 15]shows that γ j can be estimated to be γ j ≈ e Tj A x λ − e Tj A e j . (3.5)We choose j ∈ { n }\S that yields large values of | γ j | ’s to argument S incrementallyin the greedy algorithm to ultimate obtain a k × k matrix A from which an approximationthe desired eigenpair can be computed. . Zhou et al. / CSIAM Trans. Appl. Math., x (202x), pp. 1-30 7 Both the residual and component-wise perturbation analysis based greedy probing al-gorithm require computing A x . When the dimension of A is large, which is the casewe are primarily interested in, this computation can be prohibitively costly, especiallyif we have to perform it each time A is augmented [14]. The cost of the computationcan be reduced if we exploit the sparsity structure of A , i.e., we only multiply nonzerorows of A with x . The identiﬁcation of these nonzero rows is problem dependent.For quantum many-body problems arising from chemistry, several strategies have beendeveloped to perform this update efﬁciently [14, 25, 28].To avoid recomputing A x whenever A is augmented in multiple stages, we use asimple updating scheme described below.We denote the partitioned blocks of A in the m th stage of the greedy algorithm by A ( m ) and A ( m ) respectively.Let us partition the matrix A ( m + ) by A ( m + ) = A ( m ) B T B C ! , (3.6)where the B and C blocks correspond to newly added rows and columns in the m + A ( m + ) , denoted by x ( m + ) can be partitioned conformally withthat of A ( m + ) , i.e., x ( m + ) = ˆ x ( m + ) y ! . (3.7)Let A ( m + ) be the ( ) block of A after A ( m ) is augmented to A ( m + ) . We can partitionthis matrix conformally with the way x ( m + ) is partitioned, i.e., A ( m + ) = (cid:16) ˆ A ( m ) E (cid:17) , (3.8)where ˆ A ( m ) is a submatrix of A ( m ) after the submatrix B block is removed. As a result, thevector A ( m + ) x ( m + ) , which is required in the m + A ( m + ) x ( m + ) = ˆ A ( m ) ˆ x ( m + ) + Ey . (3.9)Note that ˆ A ( m ) contains a subset of rows of A ( m ) in (3.1) before A ( m ) is augmented. There-fore, if ˆ x ( m + ) is close to x ( m ) , we may use components of A ( m ) x ( m ) , which have alreadybeen computed, in place of those in A ( m + ) x ( m + ) in (3.9). The only additional compu-tation we need to perform is Ey . Because k ˆ x ( m + ) k < y =

0, whereas k x ( m ) k = L. Zhou et al. / CSIAM Trans. Appl. Math., x (202x), pp. 1-30 to account for the difference in scale, we multiply selected components of the previouslycomputed A ( m ) x ( m ) by a scaling factor ν = k ˆ x ( m + ) k before it is combined with compo-nents of Ey to yield an estimation for components of A ( m + ) x ( m + ) . One of the main issues with the greedy algorithm reviewed in the previous section is thatthe algorithm terminates when the dimension of A reaches k , but the selected rows andcolumns may be far from optimal at that point.In this section, we examine how to use a RL algorithm to improve the incrementalgreedy probing method. In particular, we will examine a procedure that performs thegreedy selection repeatedly after the dimension of A reaches k . This requires us to re-move some rows and columns from A in order to bring in new rows and columns of A that can yield a smaller λ . We will also discuss criteria and strategies for selecting rowsand columns.Just like the greedy probing scheme, the RL algorithm we examine only needs toaccess a small fraction of A in every step. Therefore, we do not need to store the entire A in advance. In a reinforcement learning algorithm, an agent is trained to take a sequence of actionsfrom a state to reach other states with the ultimate goal of achieving a predeﬁned objec-tive.In our case, the states are simply different combinations of k rows and columns se-lected from A to form A . The state that solves the k-sparse eigenvalue problem is calledthe optimal state that the agent wants to reach after taking a sequence of actions froma starting state that is non-optimal. Each action corresponds to removing one or morerows and columns from A and selecting some additional rows or columns from A toreplace the removed rows and columns. The effectiveness of each action is measured bya reward function which takes the state and action pair as the input and gives a score forsuch a pair, which can, for example, be the change in the smallest eigenvalue of A orsome other metric.Because the reward associated with each state/action pair is local, taking an actionthat gives the largest reward at a particular state does not necessarily lead to an optimalstrategy globally. For example, even when replacing some rows and columns of A withothers from A can lead to a smaller decrease or even an increase in the the smallest eigen-value of A , the selected rows and columns may be important in minimizing the smallesteigenvalue of A when combined with other rows and columns selected in subsequentsteps. . Zhou et al. / CSIAM Trans. Appl. Math., x (202x), pp. 1-30 9 We decide which action to take by following a policy prescribed by a function Q ( s , a ) ,which can be viewed as the sum of discounted future rewards the agent can expect onaverage after it reaches the states s and chooses to take the action a . This Q ( s , a ) functionis constructed and reﬁned dynamically in a multi-episode training process. The policydeﬁned in terms of Q ( s , a ) is related to, but not completely determined by the local re-ward. During a training process, we may not want the agent to always take the actionthat yields the largest reward at that moment. Instead, we may want the agent to havethe opportunity to explore different possibilities to improve Q ( s , a ) , which can be viewedas giving a more global assessment of the ( s , a ) pair towards the goal of solving (1.1).A reinforcement learning algorithm that produces an optimal policy deﬁned by anoptimal Q ( s , a ) is known as a Q -learning algorithm [32,39,40]. The construction of Q ( s , a ) can be exact or approximate. An exact Q ( s , a ) may be obtained and tabulated if the num-ber of state and action pairs is relatively small. When the number of state and actionpairs is too large, we typically need to develop a way to approximate Q ( s , a ) . Such anapproximation can be as simple as taking a linear combination of important factors orfeatures of the ( s , a ) pair [5, 8, 35]. It can also be represented as by a deep neural networkif a nonlinear parametrization can better capture the behavior of the function [18, 21, 29] .In this paper, we approximate Q ( s , a ) by using feature-based representation of state/actionpairs. For a given ( s , a ) pair, such a representation can be expressed in terms of a vector f ( s , a ) ∈ R m . The function Q ( s , a ) can be expressed in terms of a linear combination offeature components, i.e., Q ( s , a ) = m ∑ i = w i f i ( s , a ) , (4.1)where w i is a weight factor. We will deﬁne s , a , f ( s , a ) , Q ( s , a ) and how they are repre-sented in greater details in the next section. To solve (1.1) by RL, we train an agent to iteratively reﬁne the selection of rows andcolumns of A to be included in the k × k submatrix A . The training procedure allowsus to construct and update an expected (global) reward function Q ( s , a ) associated witheach state and action pair through the update of a set of feature weights.After the training is completed, the optimal policy allows us to select rows and columnsof A in an optimal fashion to solve (1.1). We will now discuss how state, action, re-ward and policy are deﬁned speciﬁcally for a RL algorithm designed to solve the k -sparseeigenvalue problem (1.1).As we indicated earlier, each state corresponds to a set of k rows and columns of A used to form A . The order of these rows and columns is not important. The environmentfor our agent is the set of all possible states.There are multiple ways to represent a state. One possibility is to use a size- n binaryindicator vector s = ( ) x (202x), pp. 1-30 to mark which rows and columns are not included (indicated by 0’s) and which are in-cluded (indicated by 1’s). An alternative representation is simply a set of indices of therows/columns of A that are included in A , e.g., s = { } .The set of all possible states, which can be very large, is denoted by S .For simplicity, let us ﬁrst deﬁne an action as removing a row/column from A andadding a new row/column from A \ A to A . This is equivalent to removing an index p from s and adding a new index q that is not in s . We can denote such an action by a pairof indices, i.e., a = ( p , q ) . The space of all actions for a particular state, which can be verylarge, is denoted by A .Each action is associated with a local reward. One natural choice of the reward func-tion is r = θ − θ ′ , (4.2)where θ is the smallest eigenvalue of A , associated with the state s before the action istaken, and θ ′ is the smallest eigenvalue of A associated with the state s ′ after the actionis taken.The true impact of the action a may not be known until some future actions are taken.The global expected reward of the state action pair ( s , a ) , which allows us to devise aneffective policy towards ﬁnding the optimal A , is deﬁned by a function, Q ( s , a ) : S × A R . As we indicated in (4.1), in an approximate Q -learning scheme, Q ( s , a ) can be deﬁnedin terms of a weighted average of feature components associated with a feature vectordeﬁned for each ( s , a ) . For each state action pair ( s , a ) , where the action a involves taking p out of s and adding q into the new s ′ , f i can be deﬁned as: f i ( s , a ) = δ i , q − δ i , p , for p ∈ s and q / ∈ s , (4.3)where δ i , j = i = j and 0 otherwise. Note that when p ∈ s and q / ∈ s hold, the vector f ( s , a ) contains only two nonzero elements. As a result, the value of Q ( s , a ) depends onthe difference between w q and w p for a speciﬁc ( s , a ) pair.Another way to deﬁne f ( s , a ) is to set f i ( s , a ) =  i ∈ s and i = p , or if i / ∈ s and i = q . − i ∈ s and i = p Q ( s , a ) is simply the sum of weights associated with elements of s ′ and − w p where s ′ is the state achieved by s taking the action a = ( p , q ) . It provides a measure ofhow important the rows and columns in s ′ are collectively.Note that the weights in (4.1) are themselves independent of s and a . They give, tosome extent, a global ranking of rows and columns of A in terms of their importancein being selected to form A . They are updated after an action a has been taken. As . Zhou et al. / CSIAM Trans. Appl. Math., x (202x), pp. 1-30 11 we will show below, the update of w i ’s makes use of the local reward r . It balances theprevious values and their changes through a learning rate α . It also take into accountfuture rewards resulting from taking a subsequent action a ′ from the new state s ′ . Wecan view the update of w i ’s as taking a gradient descent step towards minimizing thedifference between the current Q ( s , a ) and an optimal (but unknown) Q ( s , a ) . We willexamine this optimization point of view in section 4.4.Before the weights w i ’s are updated, we need to decide which action to take at each s .In the RL literature, a rule for making such a decision is called a policy , which is deﬁned tobe a function π ( s ) : S A [32]. Intuitively, the policy should be designed to select actionsthat can lead to the largest expected reward Q ( s , a ) , and ultimate reach an optimal state s ∗ that solves (1.1), i.e., π ( s ) is deﬁned as π ( s ) = argmax a Q ( s , a ) , (4.5)However, due the large number of possible actions the agent can take at each s , thisoptimization problem may not be easy to solve. One way to overcome this difﬁculty isto obtain a nearly optimal solution by considering a subset of actions that are likely toproduce a larger Q ( s , a ) value. This choice also depends on how Q ( s , a ) is deﬁned. Wewill examine two speciﬁc policies in section 4.4.2.Because Q ( s , a ) is a function that is learned over time during the training process,the policy also evolves during the training process. Upon the completion of the trainingprocess, we obtain, in principle, an optimal Q ∗ , which corresponds to a set of optimalweights w i ’s associated with all rows and columns of A . This optimal set of w i ’s providesa ranking of all rows and columns of A in terms of their importance in contributing tothe solution of (1.1). We form A by simply selecting rows and columns of A associatedwith largest w i ’s. k -sparse Eigenvalue Problem Using notation established above, we outline the basic steps of the RL algorithm for solv-ing the k -sparse eigenvalue problem (1.1). x (202x), pp. 1-30 Algorithm 1

The Basic RL algorithm Input : matrix A (efﬁciently represented), the sparsity level k ; Output : Approximate solution ( θ , x ) of (1.1), where x contains at most k nonzeroelements Initialize the weight w i (thus Q ( s , a ) for all s ’s and a ’s; for episode=1,2,...,epmax do Select an initial state s ; while not done do Choose and take a action according to the policy π ( s ) based on Q ( s , a ) valuesfor all possible a ’s; Evaluate the local reward r ; Use r , a learning rate α , a reward discount rate γ to update Q ( s , a ) implicitlyby updating w ; end while end for Select the best state using the training results;The algorithm consists of two nested loops. The outer loop runs over a number oftraining episodes. Each episode consists of a number of inner loop iterates in which anagent transitions from one state to another by following a search policy to take a sequenceof actions. Each action is followed by a calculation of a local reward and the update of Q ( s , a ) (through the update of feature weights). We will provide details of state andweight initialization, search policy, reward calculation, Q ( s , a ) update and terminationcriteria in the next section. In this section, we give some algorithmic details of Algorithm 1 that can lead to an efﬁ-cient implement RL method for solving the k -sparse eigenvalue problem. The ﬁrst episode of RL begins when the greedy algorithm described in 3 produces a k × kA matrix. The set of row and column indices of A that have been selected to form A deﬁnes the initial state s .We initialize the weight w i ’s as follows. If i ∈ s , we set w i = β | e Ti x | , where x isthe eigenvector associated with the desired eigenvalue of A , and β is a normalizationfactor. If i / ∈ s , we set w i = β | γ i | , where β is another normalization factor, if γ i hasbeen computed from (3.5) as the result of perturbation analysis performed in the greedyalgorithm. Otherwise, w i is simply set to 0. The normalization factors β and β arechosen to ensure e Ti x and γ i are on the same scale. They depend on the dimension of theoriginal problem ( n ), and the desired sparsity of the eigenvector k . . Zhou et al. / CSIAM Trans. Appl. Math., x (202x), pp. 1-30 13 Upon the completion of each RL episode, we need to reinitialize the state s . Theweights are not reinitialized because they represent the “global” ranking of the rows andcolumns of A that are learned over training episodes.The state reinitialization strategies are not unique. In this paper, we present two statereinitialization methods used in two variants of the RL algorithm. In the ﬁrst method,we initialize s by ﬁrst selecting the rows that have the largest weights and then replacethe rows in s that have the lowest weights with the rows outside of s that have the largestweights. In the second method, we simply initialize s to be the last state reached in theprevious episode. We now discuss how to decide which action to take when we reach the state s during atraining procedure. This decision deﬁnes a policy function π ( s ) . As we indicated earlier,an intuitive policy is to take the action that maximize the expected (global) reward, i.e.,to take the action that maximizes Q ( s , a ) , as described by (4.5). The implementation ofsuch a policy depends on how Q ( s , a ) is deﬁned in terms of w i ’s, which in turn dependson how feature vectors are deﬁned for each ( s , a ) pair.It may appear that solving (4.5) is difﬁcult due to the large number of potential choicesof actions. By limiting our search range to states that correspond to largest values of w i ’sdetermined in previous training episodes, solving (4.5) becomes tractable.When f ( s , a ) is deﬁned by (4.3), Q ( s , a ) is determined by w q − w q , where p is therow/column to be removed and q is the row/column to be added in a . Therefore, tosolve the optimization problem (4.5) so that the best action can be determined, we justneed to examine all pairwise difference w q − w p , for p ∈ s and q / ∈ s , and choose the actionthat yields that largest difference. If w p ’s and w q ’s are sorted in advance, we can simplytake the p that yields the smallest w p , and q that corresponds to the largest w q .An alternative search policy is to take into account both w i and some additional infor-mation in the determination of the best action to take from the state s . We use the valueof w i ’s for i ∈ s to select a set of rows in s to be potentially removed. We denote this setof indices by S . Because the values of w j ’s for j / ∈ s may be far from optimal, we use analternative metric to select potential rows and columns outside of s to be added to s ′ .There are a number of alternative metrics we can use. For example, we can use themagnitude of c j = e Tj A x , (4.6)where x is the eigenvector associated with the smallest eigenvalue of A deﬁned by s .Instead of examining all rows of A x , which can be costly, we can examine a subset ofthese elements by sampling a few rows of A . The sampled rows are determined by thenonzero structure of A and the magitude of the nonzero matrix elements. We selecta subset of rows that correspond to elements of A x with the largest magnitude, andplace the global indices of these rows in a set S . x (202x), pp. 1-30 Instead of choosing p = argmin i ∈ S w i to be removed from s and replacing it with q = argmax j ∈ S c j , we go through all pairs of ( i , j ) for i ∈ S and j ∈ S , and compute thesmallest eigenvalue θ ′ of the updated matrix A obtained by replacing row i with row j .We choose the action a = ( p , q ) as the ﬁrst pair that satisﬁes θ ′ < θ · ( − τ · ǫ ) , (4.7)where τ is an exploration rate and ǫ is a uniformly distributed random number in theinterval ( ) ( U ( ) ). Such a probablistic selection strategy makes it possible for theagent to escape from a locally optimal policy and reach a better state. Effectively, theset of actions deﬁned by ( p , q ) ’s for p ∈ S and q ∈ S form an active space from which agood action can be determined via exploitation and exploration. We refer to a searchpolicy that selects an action from an active space as an active space based search policy.This approach is similar to the complete active space self-consistent ﬁeld method used inquantum chemistry [26]. We summarize such a policy in Algorithm 2.In addition to using (4.6) as a criterion for selecting candidate rows/columns to beadded to s ′ , we may also use c j = | e Tj A x || θ − e Tj A e j | , (4.8)which results from an estimation of the magnitude of the j th eigenvector componentderived from ﬁrst order perturbation analysis.Yet, another way to assess how important row/column j / ∈ s is in achieving the goalof solving (1.1) is to estimate the amount of change in the desired eigenvalue we canrealize by bringing j into s . The estimation can be obtained by computing the smallesteigenvalue of the 2 × A ′ = (cid:20) θ x T A e j e Tj A x e Tj A e j (cid:21) , (4.9)and subtracting it from θ . . Zhou et al. / CSIAM Trans. Appl. Math., x (202x), pp. 1-30 15 Algorithm 2

Active space based search policy Input : the state s , Q ( s , a ) (represented by weights w i ’s), c i ’s determined by (4.6) or(4.8), exploration rate τ ; Output : the action pair ( p , q ) Construct S to include i ∈ s that have the smallest w i ’s; Construct S to include j / ∈ s that have the largest c j ’s; for j = | S | do for i = | S | do s ′ = (( s \ S [ i ]) ∪ S [ j ]) Compute the smallest eigenvalue θ ′ of A ( i ∈ s ′ , i ∈ s ′ ) ; Generate a random number ǫ ∼ U ( ) ; if θ ′ < θ · ( − τ · ǫ ) then Let p = S [ i ] , q = S [ j ] and output the action ( p , q ) . Caluate the reward and update w . end if end for end for To improve efﬁciency, it is not necessary to generate new S and S for each state.Whenever an action ( p , q ) is taken, we can update S by deleting and adding q . Weupdate S by deleting q and marking all elements of S that have already be consideredbe in previous action section steps. Local reward measures the immediate progress the agent makes towards the ﬁnal goalof solving (1.1) after taking an action a deﬁned by (4.2). One natural way to deﬁne sucha reward is to calculate the difference between desired eigenvalues computed before andafter the action is taken. If θ ′ is the eigenvalue of the update A matrix obtained bytaking an action a , the local reward can be calculated as θ − θ ′ . Local rewards are usedto update the Q ( s , a ) which measures the global expected reward associated with a state-action pair. When Q ( s , a ) is deﬁned as a weighted sum of components of a feature vector,updating Q ( s , a ) is equivalent to updating the weighting factors w i ’s. To update severalweight factors simultaneously, it is convenient to partition the local reward among rowsand columns that constitute the state s . Let x = ( ξ , ξ ,..., ξ k ) T be the eigenvector of A associated with the eigenvalue θ . We can then partition r as r = ∑ i r i , (4.10)where r i = r ξ i . x (202x), pp. 1-30 Q ( s , a ) Once the agent takes an action a from s to reach s ′ , we update the function Q ( s , a ) bymodifying the weights w i ’s using the following scheme: w m + i ← w mi + α · (cid:20) r + γ · max a ′ Q ( s ′ , a ′ ) − Q ( s , a ) (cid:21) · f i ( s , a ) (4.11)where m is the (inner) iteration index, α is known as the learning rate, r is the local rewardresulting from taking the action a , e.g., deﬁned by (4.2), γ is known as a discount rate fora reward to be collected from a future action. Recall that f i ( s , a ) is the i th component ofa feature vector associated with state action pair s and a . The term max a ′ Q ( s ′ , a ′ ) is themaximum future rewards that can be collected after action a is taken.This updating formula can be viewed as a gradient descent step in the optimizationof Q ( s , a ) [8]. Suppose the optimal value of a particular ( s , a ) pair is Q ∗ ( s , a ) , which wedo not know in advance. The goal of a RL training process is to minimize the distancebetween the existing Q ( s , a ) value and Q ∗ ( s , a ) with respect to w i ’s.Since we represent Q ( s , a ) by (4.1), the least squares error function between the targetoptimal function Q ∗ ( s , a ) and the current approximation Q ( s , a ) is E ( w ) = [ Q ∗ ( s , a ) − Q ( s , a )] = " Q ∗ ( s , a ) − ∑ i w i f i ( s , a ) . (4.12)To minimize (4.12), we compute the gradient of E ( w ) formally as follows ∂ E ( w ) ∂ w i = − " Q ∗ ( s , a ) − ∑ j w j f j ( s , a ) · f i ( s , a ) .The least squares error may be reduced if we move along the negative gradient directionby updating w i ’s as w i ← w i + α · " Q ∗ ( s , a ) − ∑ j w j f j ( s , a ) · f i ( s , a ) , (4.13)where the learning rate α is simply an appropriately chosen step length.Note that (4.13) can also be written as w i ← w i + α · [ Q ∗ ( s , a ) − Q ( s , a )] · f i ( s , a ) (4.14)by making use of the deﬁnition of Q ( s , a ) given in (4.1). . Zhou et al. / CSIAM Trans. Appl. Math., x (202x), pp. 1-30 17 Because f i ( s , a ) ’s are nonzero for a few i ’s, only a few w i ’s are modiﬁed in each step. Infact, if f i ( s , a ) is deﬁned by (4.3), only two components of w i ’s are update because Q ( s , a ) is deﬁned by Q ( s , a ) = w q − w p , (4.15)where a = ( p , q ) , p ∈ s and q / ∈ s .However, since we do not know the optimal Q ∗ ( s , a ) in advance, the formulae givenby (4.13) and (4.14) are not computable. To work around this issue, we replace Q ∗ ( s , a ) by a surrogate function deﬁned in terms of a local reward and discounted future Q ( s ′ , a ′ ) values.A commonly used surrogate for Q ∗ ( s , a ) is r + γ · max a ′ Q ( s ′ , a ′ ) , (4.16)where r is a local reward deﬁne in 4.4.3, and γ is an appropriately chosen discount rate.This choice yields the updating formula given in (4.11). This choice of the surrogateis likely to be poor in early iterations and episodes, but can gradually converge to theoptimal Q ( s , a ) as the agent visits more states and take more actions.When f i ( s , a ) is deﬁned by (4.4) and the local reward r is deﬁned by (4.10), it is dif-ﬁcult to draw a direct connection between (4.13) and a gradient descent minimizationscheme for obtaining an optimal Q ( s , a ) . However, the updating formula (4.13) can stillbe derived from the updating formula for Q ( s , a ) , i.e., Q ( s , a ) ← Q ( s , a )+ α · ( r + γ max Q ( s ′ , a ′ ) − Q ( s , a )) , (4.17)which is widely used in Q -learning [32].When f i ( s , a ) is deﬁned by (4.4), Q ( s , a ) is deﬁned by Q ( s , a ) = ∑ i ∈ s ′ w i − w p , (4.18)where s ′ is the state reached after the action a is taken and the action pair is ( p , q ) whichmeans taking p out of s and adding q into the new s ′ .Substituting (4.18) and (4.10) into (4.17) yields ( ∑ i ∈ s ′ w i − w p ) ← ( ∑ i ∈ s ′ w i − w p )+ α · " ∑ i ∈ s ′ r i + γ max Q ( s ′ , a ′ ) ∑ i ∈ s ′ ξ i − ( ∑ i ∈ s ′ w i − w p ) , (4.19)where x = ( ξ , ξ ,..., ξ k ) T is the eigenvector corresponding to the smallest eigenvalue A associated with the new state s ′ .Partitioning and regrouping terms by the indices i and p , and dividing max a ′ Q ( s ′ , a ′ ) proportionally according to the weighting factor ξ i among all these k terms yields thefollowing updating formula for all i ∈ s ′ : w i ← w i + α · (cid:2) r i + ξ i γ max Q ( s ′ , a ′ ) − w i (cid:3) . (4.20) x (202x), pp. 1-30 For w p , the updating formula is as follows: w p ← w p + α · (cid:2) − w p (cid:3) , (4.21)which can be regarded as a penalty for the removed row/column. An RL procedure typically consists of several episodes. Many steps (actions) are taken ineach episode to identify a better state. We terminate each episode either when no actioncan be found to improve the Q ( s , a ) function or when a maximum number of steps perepisode is reached.Because RL is a greedy algorithm that makes use of a number of problem dependentheuristics (e.g., the choice of features and learning and exploration rate.), its convergenceis generally not guaranteed. As a result, a maximum number of episodes is often setto terminate the RL algorithm when the algorithm does not reach convergence after anexcessive number of episodes.When the RL is terminated, we use the weights produced by the training episodesto rank all rows and columns of A , and choose k rows and columns associated with k largest w i ’s as the ﬁnal state, which may be different from the state reached at the of thelast episode of the RL algorithm. In this section, we demonstrate the effectiveness of the RL algorithm on solving k -sparseeigenvalue problems. We choose two test problems arising from many-body physics applications. The ﬁrstproblems arises from nuclear structure analysis of the Li the isotope, which consists of 3protons and 3 neutrons. The goal is to compute the ground state of nucleus which is thesmallest eigenvalue of the nuclear Schr ¨odinger Hamiltonian operator. The Hamiltonianis approximated in a truncated conﬁguration interaction (CI) space consisting of Slaterdeterminant basis functions that satisfy a truncation cutoff constraint deﬁned in terms ofa parameter N max [9]. We choose N max = A of dimension 197,822. The sparsity pattern of the matrix is shown in Figure 1. We labelthis matrix as Li6Nmax6.In general, the Slater determinants that satisfy a smaller N max cutoff tend to contributemore to the ground state, although this is not always true. The true eigenvector associ-ated with smallest eigenvalue of A is shown in Figure 2. We observe that this eigenvectorhas larger magnitudes in the ﬁrst few components corresponding to Slater determinants . Zhou et al. / CSIAM Trans. Appl. Math., x (202x), pp. 1-30 19 Figure 1: The sparsity of the Li6 Hamiltonian matrix. row index i | x ( i ) | Figure 2: The component magnitude of eigenvector corresponding to the smallest eigenvalue for Li6 Hamiltonianmatrix. x (202x), pp. 1-30 in a small conﬁguration space. But the eigenvector is not strictly localized. Other compo-nents of the eigenvector are small but not negligible.The second problem we use for testing originates from the study of many-body lo-calization (MBL) [20, 22, 30]. The sparse matrix A represents the Heisenberg spin-1/2Hamiltonian associated with a disordered quantum spin chain with L =

20 spins andnearest neighbor interactions [37]. We name this matrix as MBL20. The dimension ofthe MBL20 martrix is 184,756 and its sparsity structure is shown in Fig 3. This matrix ismuch sparser, with only 0.006% nonzero elements. When the strength of the disorder issufﬁciently large, the eigenvectors of MBL20 exhibits localization features. In this paper,we compute only the smallest eigenvalue and the corresponding eigenvector. Figure 4shows the magnitudes of different components of this eigenvector.We set k to 100 for both test problems, which is admittedly small. Even though theeigenvector of the MBL20 matrix is localized. The number of nonzero components ofthe eigenvector is actually larger than 1000. Nonetheless, by setting k = nz = 2032316 Figure 3: The sparsity of the MBL20 Hamiltonian matrix.

We use two variants of the RL algorithm to solve the k -sparse. In the ﬁrst variant, whichwe denote by RL1, the feature vector associated with each state action pair is deﬁnedby (4.3). The corresponding Q ( s , a ) function is deﬁned by (4.15). The search policy isdesigned to solve (4.5). As a result, each action simply swaps out a row in s that has theleast w i with a row outside of s that has the largest w j . Only w p and w q are updated by . Zhou et al. / CSIAM Trans. Appl. Math., x (202x), pp. 1-30 21 row index i | x ( i ) | Figure 4: The component magnitude of eigenvector corresponding to the smallest eigenvalue for MBL20 Hamil-tonian matrix. the formula given by (4.13).In the second variant, which we label by RL2, the feature vector f ( s , a ) is chosen tobe (4.4). The corresponding Q ( s , a ) function is deﬁned by (4.17). The alternative searchpolicy given in Algorithm 2 is used to select an action. The update of the weightingfactors w i ’s follows the formula given by (4.20) and (4.21).The learning rate for updating w i ’s is set to α = γ for RL1 is set to 0.1. For RL2, it is set to 1.0. RL2 uses an exploration rate that decaysexponentially with respect to the number of steps within each episode.We initialize the state s for both variant to an approximate solution obtained from thegreedy algorithm discussed in section 3.In RL1, a random starting guess of s is chosen for subsequent episodes, whereas inRL2, the state reached in the previous episode is chosen as the starting state for the nextepisode.Table 1 summarizes the algorithmic and parameter differences between RL1 and RL2.RL1 RL2feature representation Eq. (4.3) Eq. (4.4) Q ( s , a ) Eq. (4.15) (4.17)policy solve (4.5) Algorithm 4.5 w i update Eq. (4.13) Eqs. (4.20) and (4.21)learn rate 0.5 0.5discount rate 0.1 1exploration rate N/A e − i Table 1: Algorithmic options and parameters for two diﬀerent variants of the RL algorithm. x (202x), pp. 1-30 In Figure 5, we show the convergence of the smallest eigenvalue of A to the solution ofthe k -sparse eigenvalue problem for the Li6Nmax6 problem in both RL1 and RL2. Wedenote the best eigenvalue approximation obtained from RL2 by θ ∗ . Instead of plottingthe change of eigenvalues of A with respect to the episode number, we plot the dif-ference θ ( i ) − θ ∗ + − , where θ ( i ) is the eigenvalue of A obtained at the end of the i thepisode. The small constant 10 − is added to avoid plotting 0 on a log scale for the lastepisode of RL2. The dashed line marks the difference between the baseline solution λ b and θ ∗ , where the baseline solution is obtained by selecting rows and columns of A corre-sponding to the largest (in magnitude) k components of the eigenvector associated withthe smallest eigenvalue of A . As we indicated in Section 1, any solution that falls belowthis baseline is considered a “good” solution. The solution to the k -sparse eigenvalueproblem lies below the blue lines. However, we do not know the exact solution to (1.1),which in principle can be obtained by enumerating all possible combinations of k = A and computing the smallest eigenvalue of the corresponding A .Therefore, we cannot show how far is θ ∗ to the exact solution. -4 -2 BaselineRL1RL2

Figure 5: The diﬀerences between the best eigenvalue obtained from RL2 and the lowest eigenvalue at the endof each training episode for the Li matrix from RL1 and RL2.

We observe that both RL1 and RL2 can ﬁnd approximate solutions that are “good”.There is very little change in the eigenvalue of A in RL1 after the ﬁrst few episodes.In RL2, the smallest eigenvalue of A increases in the ﬁrst few episodes, but eventuallydecreases and falls below that produced by RL1. This is likely due to the active spacebased search policy used in RL2 to explore a wider range of actions.We also compared the computed eigenvalues obtained from RL1 and RL2 with a so-lution obtained by reformulating (1.1) as a sparse principal component analysis (PCA) . Zhou et al. / CSIAM Trans. Appl. Math., x (202x), pp. 1-30 23 problem and using the GPower l method presented in [17] to solve this problem. In sucha reformulation, we try to ﬁnd x ∗ = argmax x T x ≤ q x T ( σ I − A ) x − ρ k x k , (5.1)where σ is a shift chosen to map the smallest eigenvalue of A to the largest eigenvalue of ( σ I − A ) and ρ is a L penalty parameter used to introduce sparsity in an approximatesolution to (5.1) [34]. The larger the ρ , the sparser the approximate solution x will be. TheGPower l method is based on the power iteration. By setting ρ to 0.003, the GPower l method produces an approximate solution x that has 100 elements that are signiﬁcantlylarger than 0 in magnitude.Table 2 shows that both RL1 and RL2 produce better approximate eigenvalues thanthat obtained from GPower l .Method Approximate EigenvalueRL1 -22.9503RL2 -23.0441GPower l -22.7243baseline -22.0618Greedy initialization -22.9468 Table 2: A comparison of approximate eigenvalues obtained from RL1, RL2 and a sparse PCA solver GPower l . We observe similar convergence patterns in RL1 and RL2 when they are applied tothe MBL20 problem. Figure 6 shows that the smallest eigenvalue of A obtained at theend of each episode in RL1 changes very little after the third episode. The approximateeigenvalues computed in RL2 increases above the baseline eigenvalue λ b in the secondepisode, but then gradually decreases in subsequent episodes until it reaches the bestvalue reported in Table 3. Approximate solutions obtained from both RL1 and RL2 aregood. The difference between the baseline eigenvalue λ b and θ ∗ (reached by RL2) isplotted as a dashed line. Table 3 shows the ﬁnal approximate eigenvalue from RL1, RL2and baseline. Method Approximate EigenvalueRL1 -27.9241RL2 -27.9242baseline -27.9229Greedy initialization -27.9237 Table 3: A comparison of approximate eigenvalues obtained from RL1 and RL2 for the MBL20 matrix. x (202x), pp. 1-30 -4 -2 BaselineRL1RL2

Figure 6: The diﬀerences between the best eigenvalue obtained from RL2 and the lowest eigenvalue at the endof each training episode for the MBL20 matrix from RL1 and RL2.

In Figures 7, we show how the states evolve from one episode to another in RL1 and RL2respectively. The state s reached at the end of each episode is plotted as shaded rectanglesin a horizontal array of rectangles indexed by the row numbers of the matrix A . We onlyplot up to the largest row index that has been selected. first episodeRL1last episodefirst episodeRL2last episode row index Figure 7: State evolution in RL1 and RL2 for the Li6Nmax6 problem.

Notice that the state evolution pattern is quite different in RL1 from that in RL2. Thestates evolve slowly in RL1. With a good initial guess provided by the greedy algorithm,only a few rows and columns are changed over several episodes of RL1. The changeof states are much more dramatic in RL2. We can see that RL2 explores a much widerrange of row/column indices. This is partly because RL2 uses the active space based . Zhou et al. / CSIAM Trans. Appl. Math., x (202x), pp. 1-30 25 search policy to choose an action in each step, and partly due to the more frequent use ofexploration in the early episodes of the algorithm. Although the states reached in earlierepisodes are far from optimal, RL2 eventually converges to a better state than the oneidentiﬁed by RL1, as we have shown in Figure 5.For MBL20, as shown in Figures 8, RL1 again converges in a few episodes startingfrom an initial state produced from the greedy algorithm. The states reached in the ﬁrstseveral episodes changes quite a bit again in RL2 reﬂecting the use of active space basedpolicy and more extensive exploration. But eventually the algorithm settles down to asubset of rows and columns that corresponds to the localized region of the eigenvector.Due to the more localized nature of the eigenvector for this problem, the difference be-tween the states reached by RL1 and RL2 is relatively small. This is consistent with theapproximate eigenvalues reported in Table 3. first episodeRL1last episodefirst episodeRL2last episode row index Figure 8: State evolution in RL1 and RL2 for the MBL20 problem.

In Figures 9 and 10, we plot √ w i ’s for the largest k =

100 weights produced when RL2is terminated. We compare the location of these weights with the locations of the largest k =

100 components (in magnitude) of the desired eigenvector of A . To make it easierto see the difference between the row indices selected by RL2 and the row indices corre-sponding to the largest eigenvector components (baseline solution), we collect the indices I = { i } associated with the largest w i ’s and the indices J = { j } associated with the largesteigenvector components. We then take the union of I and J , K = I ∪J , and sort theindices in K in an increasing order and denote them by ℓ < ℓ ... < ℓ |K| ,where |K| is the size of K . We plot √ w ℓ i as a solid blue bar over i if ℓ i ∈ I , and aneigenvector component | ξ w ℓ i | as a red empty bar over i if ℓ i ∈ J . x (202x), pp. 1-30 We observe from Figure 9 that only a small subset of rows selected by RL2 for theLi6Nmax6 problem overlap with the largest 100 elements of the desired eigenvector of A . Because the solution obtained by RL2 corresponds to a much smaller eigenvalue, it isa much better solution than the baseline solution.For this problem, which is not strictly localized, RL appears to be very effective inﬁnding a very good solution even though we can not verify that this solution is the solu-tion of (1.1).Figure 10 shows that, for MBL20, there is a signiﬁcant overlap between the rows of A selected by RL2 and those that correspond to 100 largest eigenvector components asso-ciated with the smallest eigenvalue of A . To some extent, this is not surprising becausethe desired eigenvector of the MBL20 problem has a clear localization feature shown inFigure 4. i Figure 9: The absolute value of the lowest eigenvector x components of Li overlayed with of weights w fromRL1 of the corresponding rows. The weight w are rescaled to a unit vector for compare. . Zhou et al. / CSIAM Trans. Appl. Math., x (202x), pp. 1-30 27 i Figure 10: The absolute value of the lowest eigenvector x components of MBL overlayed with the square rootof weights w from RL2 of the corresponding rows. The weight w are rescaled to a unit vector for compare. We showed how a k -sparse eigenvalue problem can be solved via a RL algorithm. Wedescribed how to represent states ( s ), actions ( a ), local rewards, and the global expectedreturn (also known as the Q ( s , a ) function), which are the basic ingredients of a RL algo-rithms, for the k -sparse eigenvalue problem. Each state simply consists of indices of the k rows and columns selected from A . Each action involves removing one index from s andadding another index outside of s to form a new state s ′ . The most challenging problemfor devising an effective and efﬁcient RL algorithm is to construct an appropriate repre-sentation of the Q ( s , a ) function, which values the suitability of taking the action a at thestate s , so that it can be updated efﬁciently during a learning process, especially whenboth the number of states and actions that we can take in using RL to solve the k -sparseeigenvalue problem is extremely large. In this paper, we choose to represent Q ( s , a ) asa linear combination of a number of feature components deﬁned in terms of row andcolumn indices in the current or the next states. This linear representation is sufﬁcientlysimple so that the update of Q ( s , a ) can be realized by modifying the weighting factorsin the linear combination. We presented two strategies (policies) for choosing the nextaction based on the information provided in Q ( s , a ) and component-wise perturbationanalysis. In particular, we proposed an active space based policy that search within asubset of candidate actions that can lead to a signiﬁcant reduction of the smallest eigen-value of the selected submatrix. We tested the RL algorithm on two examples originatingfrom many-body physics. One of the problems has nice localization properties whereasthe other is not strictly localized. We demonstrated the effectiveness of RL on both prob- x (202x), pp. 1-30 lems. Although these problems are still relatively small, and we chose a relative small k so that the RL algorithms can terminate in a handful of episodes, they are good bench-marks for testing the robustness and effectiveness of the RL algorithm. Clearly, moretests involving larger problems and larger k ’s need to be performed in the future to de-velop strategies for setting a number of RL parameters such as learning rate, explorationrate and discount rate. As the problem size becomes larger, many other computationalconsiderations such as fast local reward computation, the parallelization of active spacesearch and exploration need to be developed. Furthermore, we may also consider using anonlinear representation of the Q ( s , a ) function perhaps via a deep neural network to fur-ther improve the predictability of Q ( s , a ) and consequently the search policy and expandthe space of action by including actions that involve removing and adding multiple rowsand columns of A from and to A . In addition to comparing with baseline solution, wewill also compare RL solution to solutions obtained from Monte Carlo sampling basedmethods. Acknowledgments

This work was supported in part by the U.S. Department of Energy, Ofﬁce of Science, Of-ﬁce of Advanced Scientiﬁc Computing Research, Scientiﬁc Discovery through AdvancedComputing (SciDAC) program, and the U.S. Department of Energy, Ofﬁce of Science,Ofﬁce of Nuclear Physics, under Award Number DE-FG02-95ER-40934. W. Gao is sup-ported in part by National Science Foundation of China under Grant No. 11690013,71991471, U1811461.

References [1] D. A. Abanin, E. Altman, I. Bloch, and M. Serbyn. Colloquium: Many-body localization,thermalization, and entanglement.

Rev. Mod. Phys. , 91:021001, May 2019.[2] H. M. Aktulga, M. Aﬁbuzzaman, S. Williams, A. Bulu, M. Shao, C. Yang, E. G. Ng, P. Maris,and J. P. Vary. A high performance block eigensolver for nuclear conﬁguration interactioncalculations.

IEEE Transactions on Parallel and Distributed Systems , 28(6):1550–1563, 2017.[3] H. M. Aktulga, C. Yang, E. G. Ng, P. Maris, and J. P. Vary. Topology-aware mappings forlarge-scale eigenvalue problems. In C. Kaklamanis, T. Papatheodorou, and P. G. Spirakis,editors,

Proceedings of the Europar2012 Conference, Lecture Notes in Computer Science , pages830–842, Berlin, Heidelberg, 2012. Springer.[4] H. M. Aktulga, C. Yang, E. G. Ng, P. Maris, and J. P. Vary. Improving the scalability ofa symmetric iterative eigensolver for multi-core platforms.

Concurrency and Computation:Practice and Experience , 26(16):2631–2651, 2014.[5] A. Barreto, S. Hou, D. Borsa, D. Silver, and D. Precup. Fast reinforcement learning withgeneralized policy updates.

Proceedings of the National Academy of Sciences , 2020.[6] G. H. Booth, A. Gr uneis, G. Kresse, and A. Alavi. Towards an exact description of electronicwavefunctions in real solids.

Nature , 493:365–370, 2013. . Zhou et al. / CSIAM Trans. Appl. Math., x (202x), pp. 1-30 29 [7] G. H. Booth, A. J. W. Thom, and A. Alavi. Fermion Monte Carlo without ﬁxed nodes: Agame of life, death, and annihilation in slater determinant space. The Journal of ChemicalPhysics , 131(5):054106, 2009.[8] L. Bus¸oniu, D. Ernst, B. De Schutter, and R. Babuˇska. Approximate reinforcement learning:An overview. In , pages 1–8. IEEE, 2011.[9] M. A. Caprio, A. E. McCoy, and P. J. Fasano. Intrinsic operators for the translationally-invariant many-body problem.

Journal of Physics G: Nuclear and Particle Physics , 2020.[10] A. dAspremont, F. Bach, and L. E. Ghaoui. Optimal solutions for sparse principal componentanalysis.

Journal of Machine Learning Research , 9(Jul):1269–1294, 2008.[11] Y. Ge and J. Eisert. Area laws and efﬁcient descriptions of quantum many-body states.

NewJournal of Physics , 18(8):083026, aug 2016.[12] R. J. Harrison. Approximating full conﬁguration interaction with selected conﬁgurationinteraction and perturbation theory.

The Journal of Chemical Physics , 94(7):5021–5031, 1991.[13] T. Hernandez, R. V. Beeumen, M. Caprio, and C. Yang. A greedy algorithm for computingeigenvalues of a symmetric matrix with localized eigenvectors. arXiv: 1911. 10041 , 2019.[14] A. A. Holmes, N. M. Tubman, and C. J. Umrigar. Heat-bath conﬁguration interaction: Anefﬁcient selected conﬁguration interaction algorithm inspired by heat-bath sampling.

Journalof Chemical Theory and Computation , 12(8):3674–3680, 2016. PMID: 27428771.[15] B. Huron, J. P. Malrieu, and P. Rancurel. Iterative perturbation calculations of ground andexcited state energies from multiconﬁgurational zerothorder wavefunctions.

The Journal ofChemical Physics , 58(12):5745–5759, 1973.[16] I. T. Jolliffe, N. T. Trendaﬁlov, and M. Uddin. A modiﬁed principal component techniquebased on the lasso.

Journal of computational and Graphical Statistics , 12(3):531–547, 2003.[17] M. Journ´ee, Y. Nesterov, P. Richt´arik, and R. Sepulchre. Generalized power method forsparse principal component analysis.

Journal of Machine Learning Research , 11(2), 2010.[18] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature , 521(7553):436–444, 2015.[19] J. Lu and Z. Wang. The full conﬁguration interaction quantum Monte Carlo method throughthe lens of inexact power iteration.

SIAM Journal on Scientiﬁc Computing , 42(1):B1–B29, 2020.[20] D. J. Luitz, N. Laﬂorencie, and F. Alet. Many-body localization edge in the random-ﬁeldheisenberg chain.

Physical Review B , 91(8):081103, 2015.[21] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep rein-forcement learning. nature , 518(7540):529–533, 2015.[22] R. Nandkishore and D. A. Huse. Many-body localization and thermalization in quantumstatistical mechanics.

Annu. Rev. Condens. Matter Phys. , 6(1):15–38, 2015.[23] P. Navr´atil, G. P. Kamuntaviˇcius, and B. R. Barrett. Few-nucleon systems in a translationallyinvariant harmonic oscillator basis.

Phys. Rev. C , 61:044001, Mar 2000.[24] P. Navr´atil, J. P. Vary, and B. R. Barrett. Properties of C in the ab initio nuclear shell model.

Phys. Rev. Lett. , 84:5728–5731, Jun 2000.[25] J. Olsen, P. Jrgensen, and J. Simons. Passing the one-billion limit in full conﬁguration-interaction (fci) calculations.

Chemical Physics Letters , 169(6):463 – 472, 1990.[26] B. O. Roos, P. R. Taylor, and P. E. Sigbahn. A complete active space scf method (casscf) usinga density matrix formulated super-ci approach.

Chemical Physics , 48(2):157 – 173, 1980.[27] M. Shao, H. M. Aktulga, C. Yang, E. G. Ng, P. Maris, and J. P. Vary. Accelerating nuclearconﬁguration interaction calculations through a preconditioned block iterative eigensolver.

Computer Physics Communications , 222:1–13, 2018. x (202x), pp. 1-30 [28] S. Sharma, A. A. Holmes, G. Jeanmairet, A. Alavi, and C. J. Umrigar. Semistochastic heat-bath conﬁguration interaction method: Selected conﬁguration interaction with semistochas-tic perturbation theory. Journal of Chemical Theory and Computation , 13(4):1595–1604, 2017.PMID: 28263594.[29] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deepneural networks and tree search. nature , 529(7587):484–489, 2016.[30] J. Smith, A. Lee, P. Richerme, B. Neyenhuis, P. W. Hess, P. Hauke, M. Heyl, D. A. Huse, andC. Monroe. Many-body localization in a quantum simulator with programmable randomdisorder.

Nature Physics , 12(10):907–911, 2016.[31] P. Sternberg, E. G. Ng, C. Yang, P. Maris, J. P. Vary, M. Sosonkina, and H. V. Le. Acceleratingconﬁguration interaction calculations for nuclear structure. In

Proc. 2008 ACM/IEEE Conf.Supercomputing , November 15-21 2008.[32] R. S. Sutton and A. G. Barto.

Reinforcement learning: An introduction . MIT press, 2018.[33] A. Szabo and N. S. Ostlund.

Modern Quantum Chemistry . Dover, New York, 1996.[34] R. Tibshirani. Regression shrinkage and selection via the lasso.

Journal of the Royal StatisticalSociety: Series B (Methodological) , 58(1):267–288, 1996.[35] J. N. Tsitsiklis and B. Van Roy. Feature-based methods for large scale dynamic programming.

Machine Learning , 22(1-3):59–94, 1996.[36] N. M. Tubman, C. D. Freeman, D. S. Levine, D. Hait, M. Head-Gordon, and K. B. Whaley.Modern approaches to exact diagonalization and selected conﬁguration interaction with theadaptive sampling ci method.

Journal of Chemical Theory and Computation , 16(4):2139–2159,2020. PMID: 32159951.[37] R. Van Beeumen, G. D. Kahanamoku-Meyer, N. Y. Yao, and C. Yang. A scalable matrix-freeiterative eigensolver for studying many-body localization. In

Proceedings of the InternationalConference on High Performance Computing in Asia-Paciﬁc Region , pages 179–187, 2020.[38] Z. Wang, Y. Li, and J. Lu. Coordinate descent full conﬁguration interaction.

Journal of chemicaltheory and computation , 15(6):3558–3569, 2019.[39] C. Watkins.

Learning from Delayed Rewards . PhD thesis, King’s College, Cambridge, UK, May1989.[40] C. Watkins and P. Dayan. Q-learning.