[PDF] A literature survey of matrix methods for data science

Abstract

Efficient numerical linear algebra is a core ingredient in many applications across almost all scientific and industrial disciplines. With this survey we want to illustrate that numerical linear algebra has played and is playing a crucial role in enabling and improving data science computations with many new developments being fueled by the availability of data and computing resources. We highlight the role of various different factorizations and the power of changing the representation of the data as well as discussing topics such as randomized algorithms, functions of matrices, and high-dimensional problems. We briefly touch upon the role of techniques from numerical linear algebra used within deep learning.

Full PDF

RReceived: Added at production Revised: Added at production Accepted: Added at productionDOI: xxx/xxxx

ARTICLE TYPE

A literature survey of matrix methods for data science † Martin Stoll* Chair of Scientiﬁc Computing, Departmentof Mathematics, TU Chemnitz, Chemnitz,Germany

Correspondence

Martin Stoll Email:[email protected]

Present Address

Chair of Scientiﬁc Computing, Departmentof Mathematics, TU Chemnitz,Reichenhainer Str. 41, 09126 Chemnitz,Germany

Abstract

Eﬃcient numerical linear algebra is a core ingredient in many applications acrossalmost all scientiﬁc and industrial disciplines. With this survey we want to illustratethat numerical linear algebra has played and is playing a crucial role in enabling andimproving data science computations with many new developments being fueled bythe availability of data and computing resources. We highlight the role of various dif-ferent factorizations and the power of changing the representation of the data as wellas discussing topics such as randomized algorithms, functions of matrices, and high-dimensional problems. We brieﬂy touch upon the role of techniques from numericallinear algebra used within deep learning.

KEYWORDS:

Numerical Linear Algebra

Wiley NJD

The study of extracting information from data has become crucial in many ﬁeld ranging from business, engineering, fundamentalresearch, or culture. We here assume that data science intends to analyze and understand actual phenomena with data accordingto [136]. To achieve this, we follow [80] in that data science draws on elements of machine learning, data mining and manyother mathematical ﬁelds such as optimization or statistics . Also, we want to point out that in order to obtain information fromdata it is not necessarily implied that the amount of data is big but often it is.The multitude of applications where such data occur and need to be studied goes beyond the scope of this paper. Naturally,matrices arise as part of spatial data analysis [28, 206, 117] or time-series data analysis [2, 154, 291, 233, 40] but can also befound in many more disciplines [280].We here view the data as being represented in a matrix 𝐴 = ⎡⎢⎢⎣ 𝑎 𝑇 ⋮ 𝑎 𝑇𝑛 ⎤⎥⎥⎦ ∈ ℝ 𝑛,𝑚 (1)with the dimensions 𝑚, 𝑛 related to the underlying data. Here 𝑚 is the dimension of the feature space with feature vectors 𝑎 𝑖 ∈ ℝ 𝑚 viewed here as the rows of 𝐴 . The dimension 𝑛 is the number of data points and is thus possibly very large. We often assumethat 𝑛 > 𝑚 . Alternatively, the data can also arise in the form of a tensor of order 𝑑𝐴 ∈ ℝ 𝑛 × 𝑛 ×…× 𝑛 𝑑 (2)with dimensions 𝑛 𝑖 ∈ ℕ ∀ 𝑖 .The tasks of extracting meaningful information from the collected data varies between application areas. In the processof extracting information from data we often encounter tasks such as unsupervised learning , semi-supervised learning , and † a r X i v : . [ m a t h . NA ] M a y supervised learning (cf. [153, 135]). The diﬀerence between these can roughly be summarized by the availability and usage oftraining data with none for unsupervised, only for a subset for semi-supervised, and for all of the data in supervised learning.Before starting the detailed discussion, we would like to point out a particular example that is a core problem in data science,statistics, numerical linear algebra and computer science alike. This is the least squares problem [110], where in brief one wantsto ﬁt a linear model to labeled data. Let us assume that we are given 𝑛 data points 𝑎 𝑖 ∈ ℝ 𝑚 and typically outputs 𝑦 𝑖 , e.g. a valueof either −1 or for a classiﬁcation problem. We are then interested in ﬁnding a weight vector 𝑤 ∈ ℝ 𝑚 such that the -norm ofthe residual 𝑟 𝑖 = 𝑎 𝑇𝑖 𝑤 − 𝑦 𝑖 is minimized.It is well known that this would lead to a linear regression problem min 𝑤 ‖ 𝐴𝑤 − 𝑦 ‖ + 𝜆 ‖ 𝑤 ‖ (3)with 𝐴 as above. Here, we also introduce 𝜆 > as a regularization or ridge parameter for the regularization term ‖ 𝑤 ‖ . Theregularization term could also be measured in several diﬀerent norms such as the 𝑙 -norm [274] or the total variation norm[285]. The solution to this problem is then given by 𝑤 ∗ = ( 𝐴 ⊤ 𝐴 + 𝜆𝐼 ) −1 𝐴 𝑇 𝑦, (4)a prototypical linear system of equations with a symmetric and positive deﬁnite matrix as long as 𝜆 > . This example illustratesthat the diﬀerent disciplines are intertwined. Problem (3) arises in data mining relying on techniques from numerical linearalgebra to make the evaluation robust and eﬃcient. On the other hand, the development of sophisticated numerical methodsis driven by studying problems with real data . The goal of this survey is to show how information extraction from data, datamodeling, and pattern ﬁnding relies on eﬃcient techniques from numerical linear algebra that not only enable computations butalso reveal hidden information.The paper is structured as follows. We ﬁrst illustrate the use of classical factorizations such as the QR and singular valuedecomposition (SVD) and then introduce interpretable factorizations such as the non-negative matrix factorization (NMF) orthe CUR decomposition. We additionally discuss literature devoted to kernel methods with special attention given to the graphLaplacian. Randomization as well as functions of matrices are discussed next. We close with a brief discussion of recent resultsfor high-dimensional problems and deep learning applications.As a word of caution, we want to remark that the ﬁeld of numerical linear algebra is vast and the analysis of data via techniquesfrom numerical linear algebra is not new. The goal of this survey is to point to recent trends and we apologize to the authorswhose results we missed while writing this. In particular we want to refer to the beautiful books by Eldén [92] and Strang [266]that provide general introductions to linear algebra for data science applications.

The decompositional approach to matrix computations has been named one of the top 10 algorithms of the 20th century [79].Matrix factorizations are an ubiquitous tool in data science and have received much attention over the last years. A great exampleis the use of matrix factorization techniques for recommender systems such as the Netﬂix challenge [169]. We here review someimportant matrix factorizations, their applications as well as tailored factorizations used for data science. We split the discussioninto classical factorizations, which have been the workhorse of many applications such as engineering or ﬂuid mechanics, andfactorizations that are designed to more closely resemble the nature of the data. Given a data matrix 𝐴 ∈ ℝ 𝑛,𝑚 , assuming 𝑛 ≥ 𝑚 , a singular value decomposition (SVD) [110] is given as 𝐴 = 𝑈 𝑆𝑉 ⊤ https://en.wikipedia.org/wiki/Netﬂix_Prize with 𝑈 ∈ ℝ 𝑛,𝑛 and 𝑉 ∈ ℝ 𝑚,𝑚 being orthogonal matrices. The matrix 𝑆 ∈ ℝ 𝑛,𝑚 is of the following form 𝑆 = ⎡⎢⎢⎢⎢⎢⎢⎢⎢⎣ 𝜎 𝜎 ⋱⋮ 𝜎 𝑚 ⋮ ⋮ ⋮

00 … … 0 ⎤⎥⎥⎥⎥⎥⎥⎥⎥⎦ with singular values 𝜎 ≥ 𝜎 … ≥ 𝜎 𝑚 ≥ . An equivalent and often very useful representation is the outer product form of theSVD as 𝐴 = 𝑚 ∑ 𝑖 =1 𝜎 𝑖 𝑢 𝑖 𝑣 𝑇𝑖 where 𝑢 𝑖 and 𝑣 𝑖 are the columns of 𝑈 and 𝑉 , respectively. This allows for a natural interpretation of 𝑈 as providing a basisfor the column space of 𝐴 and 𝑉 for its row-space. A crucial task is to ﬁnd a good rank- 𝑘 approximation to the matrix 𝐴 . TheEckart–Young–Mirsky theorem [111] states that the best rank- 𝑘 approximation in any unitarily invariant matrix norm is given as 𝐴 ≈ 𝐴 𝑘 ∶= 𝑘 ∑ 𝑖 =1 𝜎 𝑖 𝑢 𝑖 𝑣 𝑇𝑖 , hereby ignoring the smaller singular values in the summation. This is known as the truncated singular value decomposition andis written in matrix form as 𝐴 𝑘 = 𝑈 𝑘 𝑆 𝑘 𝑉 ⊤𝑘 . (5)The SVD is one of the most crucial tools in the complexity reduction of large-scale problems. The singular value decompositionis naturally well-suited to solve the least squares problem (3) of the form 𝑤 ∗ = ( 𝐴 ⊤ 𝐴 + 𝜆𝐼 ) −1 𝐴 ⊤ 𝑦 = ( 𝑉 𝑆 ⊤ 𝑈 𝑇 𝑈 𝑆𝑉 ⊤ + 𝜆𝐼 ) −1 𝑉 𝑆 ⊤ 𝑈 ⊤ 𝑦 = 𝑉 ( 𝑆 ⊤ 𝑆 + 𝜆𝐼 ) −1 𝑆 ⊤ 𝑈 ⊤ 𝑦. (6)This is expensive due to the cost of computing the full SVD but the truncated SVD can be exploited for solving least squares prob-lems as discussed in [133]. There have been many algorithmic updates in the computation of the (truncated) SVD that deal withthe numerical diﬃculties of large-scale problems, rounding errors, locking of wanted singular vectors and purging of unwantedinformation [14, 147, 264, 140]. At the heart often lies the Lanczos bidiagonalization and for large scale problems incorporatingimplicit restarts is mandatory (cf. [30, 141, 14, 264]). The algoritmic foundations for computing the truncated SVD, namely theLanczos bidiagonalization [106] was shown to be equivalent [91, 30] to a well-known method in statistics, namely the NIPALS (Nonlinear Iterative Partial Least Squares) method. To the best of our knowledge, the algorithmic improvements developed forthe Lanczos-bidiagonalization have not been exploited within NIPALS implementations (cf. [30]). Nevertheless, NIPALS hasbecome a crucial tool in the analysis of problems from economics applications [150, 128, 99].The SVD is also a key ingredient in model order reduction [19, 50] classically used for reducing the dimensionality ofphysics-related models based on diﬀerential equations. Recently, model order reduction has found more and more applicationsin machine learning [171, 275]. One of the most important applications that directly mirrors the use of the SVD in computationalscience and engineering is the creation of reduced representations. Here, applications include text mining [3], face recognition[309], medicine [306] and many more.The SVD also comes in many disguises among the diﬀerent disciplines of statistics, engineering, and applied mathematics.Given a data matrix 𝐴 applying principal component analysis (PCA), which is equivalent to performing the SVD, has been akey tool for understanding the structure of the data. In PCA, the column means within 𝐴 is zero and then the right singularvectors 𝑣 𝑖 are called the principal component directions of 𝐴 and the left singular vectors are the principal components of 𝐴 .More information and applications are given in [155, 156, 203, 300].In order to compute the singular value decomposition for data matrices that originate from massive datasets one often has toresort to techniques from high performance computing [8, 267, 212]. Additionally, it is possible to rely on randomized algorithmsthat we discuss in Section 4.The SVD is also a crucial ingredient in many algorithms for high-dimensional data analysis, see Section 6 on tensorfactorizations. A prototypical example is the drastic reduction of the dimensionality of the system matrices deﬁning a dynamical system.

The QR factorization

Given a set of vectors collected in the matrix 𝐴 and considering the span of the columns of 𝐴 it is clear that the vectors themselvescan potentially be a terribly conditioned basis for further numerical computations. Hence, one wants to ﬁnd a well-conditionedbasis, ideally a basis with orthogonal vectors. This task is achieved by computing the QR factorization, i.e., 𝐴 = 𝑄𝑅 with 𝑄 ∈ ℝ 𝑛,𝑛 an orthogonal matrix and 𝑅 ∈ ℝ 𝑛,𝑚 an upper triangular matrix of the form 𝑅 = [ ̂𝑅 ] , where the matrix ̂𝑅 ∈ ℝ 𝑚,𝑚 is invertible if only if the matrix 𝐴 has full column rank 𝑚 . From the representation it is clear thatone can also work with 𝐴 = ̂𝑄 ̂𝑅 where ̂𝑄 only contains the ﬁrst 𝑚 columns of 𝑄 . The QR factorization is another ubiquitousfactorization in applied mathematics. Its computation is typically rather expensive and for the details we refer to [110]. Inparticular, the reduction of 𝐴 to triangular form via so-called Householder reﬂectors is a de-facto standard. The solution of theleast squares problem (3) without regularization [110] via the QR factorization is well-known 𝑤 ∗ = argmin 𝑤 ‖ 𝑄𝑅𝑤 − 𝑦 ‖ = argmin 𝑤 ‖‖‖ 𝑅𝑤 − 𝑄 ⊤ 𝑦 ‖‖‖ . Truncated QR decompositions in disguise are also at the heart of the Lanczos [172] and Arnoldi [9] methods, which are the keyalgorithms of Krylov subspace methods. Many algorithms in numerical linear algebra rely on Krylov subspaces, i.e.  𝓁 ( 𝑀, 𝑟 ) ∶= span { 𝑟, 𝑀𝑟, 𝑀 𝑟, 𝑀 𝑟, … , 𝑀 𝓁 −1 𝑟 } being the space of dimension 𝓁 , where 𝑀 is the system matrix of the underlying problem, e.g. 𝑀 = ( 𝐴 ⊤ 𝐴 + 𝜆𝐼 ) for the ridgeregression problem (4), and 𝑟 is a vector associated with a right-hand side of (4). In order to obtain robust methods we rely ona well-conditioned basis of  𝓁 ( 𝑀, 𝑟 ) . The Lanczos method computes an orthonormal basis of increasing dimensionality withgreat eﬃciency only requiring one matrix vector product per iteration. For more details, we refer to [111, 237]. Truncated QRdecompositions are based on computing the columns of a matrix 𝑄 spanning the basis of the range of the Krylov matrix.Additionally, the pivoted QR factorization [262, 261] computes the factorization 𝐴𝑃 = 𝑄𝑅 with 𝑃 a permutation matrix. This QR factorization can be computed maintaining the sparsity of the matrix 𝐴 (cf. [261]) and itis also closely related to interpretable factorizations, which we discuss next. Despite the beautiful mathematical properties of both the QR and the SVD they sometimes do not provide a representation thatallows an easy interpretation of the results for practitioners. For example, in many applications the data are nonnegative while thesingular vectors can contain negative values. Mathematically this means that a given data matrix 𝐴 ∈ ℝ 𝑛,𝑚 is well approximatedby the truncated SVD 𝐴 ≈ 𝑈 𝑘 𝑆 𝑘 𝑉 ⊤𝑘 but the singular vectors contained in 𝑈 𝑘 (∶ , 𝑙 ) , are in general not sparse or non-negativeeven though this holds for the data. We adopt M ATLAB notation addressing columns and rows of matrices by using 𝐴 (∶ , 𝑘 ) forthe 𝑘 -th column of 𝐴 and 𝐴 ( 𝑘, ∶) for the 𝑘 -th row. Analogously, this can be deﬁned for several rows or columns.To preserve the properties inherent in the application while maintaining a good approximation of the original data with reducedcomplexity, many interpretable factorizations have been developed over recent years (cf. [193, 293, 84, 38, 176, 81, 103] forsome of them). Here, ﬁg. 1 illustrates how the CUR approximation, which we introduce later in this section, naturally representsthe original data.We now brieﬂy introduce these methods and comment on applications and new developments. We here follow the notationof [287] and start with the CX decomposition deﬁned by 𝐴 ≈ 𝐶𝑋 where 𝐶 ∈ ℝ 𝑛,𝑘 and 𝑋 ∈ ℝ 𝑘,𝑚 . Here the matrix 𝐶 is a matrix consisting of 𝑘 columns of the data matrix 𝐴 , i.e., 𝐶 = 𝐴 (∶ , 𝐽 ) −30 −20 −10 0 10 20 30−20−1001020 Data pointsSVD vectorsCUR vectors

FIGURE 1

We here illustrate a set of datapoints (red) that are stored in the matrix 𝐴 . We then show the two dominating rows(black) of 𝐴 obtained from a CUR decomposition and the two dominating singular vectors (blue).for an index set 𝐽 of order 𝑘 . The important feature of the matrix 𝐶 is that its columns are interpretable as they are taken fromthe original data. Since the data matrix is often sparse this is inherited by 𝐶 and as a result storing 𝐶 requires less memory thanstoring the singular vector matrix 𝑈 𝑘 . The computation of 𝐶 and 𝑋 is done via the minimization of min ‖ 𝐴 − 𝐶𝑋 ‖ , where ‖ ⋅ ‖ may be the 2-norm or the Frobenius norm. This problem is known as the column subset selection problem [37] andits NP-completeness is discussed in [252]. Typically, more structure is required than just a general matrix 𝑋 and one quicklymoves to the interpolative decomposition (ID) [287] 𝐴 ≈ 𝐶𝑉 ⊤ (7)where 𝐶 ∈ ℝ 𝑛,𝑘 is as before but the matrix 𝑉 ∈ ℝ 𝑚,𝑘 is constructed such that it contains a 𝑘 × 𝑘 identity matrix and max 𝑖,𝑗 ||| 𝑣 𝑖𝑗 ||| ≤ . A procedure to obtain a low-rank ID via a pivoted QR is given in [287] returning a 𝑉 𝑇 = [ 𝐼 𝑘 𝑇 𝑙 ] 𝑃 𝑇 with 𝑃 a permutationmatrix and 𝑇 𝑙 a solution related to the upper triangular factors of the pivoted QR. This approach works analogously when aone-sided representation in terms of matrix rows is desired. One can also obtain a two-sided interpolative decomposition 𝐴 ≈ 𝑊 𝐴 ( 𝐼, 𝐽 ) 𝑉 𝑇 . (8)We start its computation by using a one-sided interpolative decomposition 𝐴 ≈ 𝐶𝑉 ⊤ , which already provides us with the set ofcrucial column indices 𝐽 . In order to compute the row indices 𝐼 and the matrix 𝑊 , we now compute a one-sided decompositionfor the matrix 𝐶 𝑇 . Following (7) we then obtain 𝐶 𝑇 ≈ ̃𝐶 ̃𝑉 𝑇 , where ̃𝐶 contains by design columns of 𝐶 𝑇 , i.e., rows of 𝐶 and thus elements of the rows of 𝐴 along with the row index set 𝐼 . The matrix 𝑊 is then given as 𝑊 = ̃𝑉 . Note that 𝑊 and 𝑉 do not contain columns and rows of the original matrix 𝐴 ,respectively, and hence properties such as sparsity and non-negativity found in the data are typically not carried over.As a result, we can consider a factorization that avoids this pitfall and bears a lot of similarity to the two-sided ID. This isachieved by the CUR decomposition 𝐴 ≈ 𝐶𝑈 𝑅 where 𝐶 ∈ ℝ 𝑛,𝑘 , 𝑈 ∈ ℝ 𝑘,𝑘 , and 𝑅 ∈ ℝ 𝑘,𝑚 . Here, 𝐶 contains columns of the original matrix and 𝑅 represents a subset of itscolumns. Both matrices inherit properties such as non-negativity, sparsity, and ﬁnally interpretability. The CUR decompositionis also known as the skeleton decomposition [164, 102, 277, 220, 114] and is closely related to a rank-revealing QR factorization[287, 119]. The point of departure for computing the CUR decomposition is typically a low-rank factorization of the matrix 𝐴 .Both the truncated SVD (5) and the two-sided ID (8) are the typical initial factorizations. Given a truncated SVD, the crucialalgorithmic step is to select the index sets 𝐼 and 𝐽 . For this one typically computes leverage scores as sums over the rows of thematrices 𝑈 𝑘 and 𝑉 𝑘 coming from the truncated SVD 𝐴 ≈ 𝑈 𝑘 𝑆 𝑘 𝑉 ⊤𝑘 , e.g. 𝓁 𝑗 = ∑ 𝑘𝑖 =1 𝑈 𝑗,𝑖 for 𝑗 = 1 , … , 𝑛 for the column selectionand analogously for 𝑉 𝑘 . In [193] the authors provide a statistical interpretation of the leverage scores as a probability distribution.More recently, Embree and Sorensen [258] have introduced a procedure based on the discrete empirical interpolation method (DEIM) [50] where the selection of the column and row indices is based on a greedy projection technique resembling a pivotingstrategy within the LU factorization. In [73] the authors compute the column and row subset using conditional expectationswith a more eﬃcient numerical realization being recently introduced in [64]. It remains to compute the intersection matrix 𝑈 as 𝑈 = 𝐴 ( 𝐼, 𝐽 ) −1 with the other possibility being 𝑈 = 𝐶 † 𝐴𝑅 † , where † indicates the Moore-Penrose inverse. Recently, aperturbation analysis of the CUR decomposition was presented in [132].Another important interpretable factorization of the matrix 𝐴 is the so-called non-negative matrix factorization (NMF) [104,92, 77, 77, 24] 𝐴 𝑇 ≈ 𝑊 𝐻, (9)which is a low-rank approximation using 𝑊 ∈ ℝ 𝑚,𝑘 and 𝐻 ∈ ℝ 𝑘,𝑛 with component-wise non-negativity written as 𝑊 ≥ , 𝐻 ≥ . The interpretation of the columns of 𝑊 (∶ , 𝑗 ) ∈ ℝ 𝑚 is that they form a basis of order 𝑘 that best approximates the data points 𝑎 𝑗 , i.e., the columns of 𝐴 𝑇 via 𝑎 𝑙 = 𝑊 𝐻 (∶ , 𝑙 ) = 𝑘 ∑ 𝑗 =1 𝑊 ∶ ,𝑗 𝐻 𝑗,𝑙 . The coeﬃcients of how the data are expanded in the basis deﬁned by 𝑊 are stored in 𝐻 . Such linear dimension reductionframeworks are found in various data tasks within image processing or text mining. Here again the non-negativity of the ele-ments in 𝑊 means that the resulting matrix is more sparse than an SVD-based approach and provides an interpretable featurerepresentation [187, 185].The computation of a NMF is an NP-hard ill–posed problem [283] and it is typically based on solvingthe minimization problem min 𝑊 ,𝐻 ≥ ‖‖‖ 𝐴 𝑇 − 𝑊 𝐻 ‖‖‖ 𝐹 . Alternating minimization procedures, which consist of an alternating update of the factors 𝑊 and 𝐻 , are typically employedand we refer to [103, 129, 190] for more details. It is also possible to include further constraints such as sparsity as was done in[230, 229]. In [75] the authors analyze the relationship between the NMF factorization of a kernel matrix and spectral clusteringdiscussed later. For further improving the performance the authors in [47] include a kernel matrix as a regularization term forthe objective function. This shows that kernel matrices, which are matrices changing the representation of the data, are oftenvery useful and we discuss these next. So far we focused on methods that directly utilize the data 𝐴 as a matrix. Often it is necessary to transform the data to a diﬀerentrepresentation. The goal is that for the transformed data the learning task is easier and we now describe several approachesdesigned for that purpose. The data encoded in 𝐴 ∈ ℝ 𝑛,𝑚 either have a natural representation as a graph with the nodes 𝑣 𝑗 representing the associatedfeature vectors 𝑎 𝑗 ∀ 𝑗 or they can be modeled that way. The result is a graph 𝐺 = ( 𝑉 , 𝐸 ) consisting of the nodes 𝑣 𝑗 ∈ 𝑉 and edges 𝑒 ∈ 𝐸, where an edge 𝑒 consists of a pair of nodes. We here consider undirected graphs where an edge is typically equippedwith an edge weight representing the strength of the connection between the corresponding nodes. Practically, the most relevantweight function is the Gaussian weight function 𝑤 ( 𝑣 𝑖 , 𝑣 𝑗 ) = 𝑤 𝑖𝑗 = exp ( − ‖ 𝑎 𝑖 − 𝑎 𝑗 ‖ ∕ 𝜎 ) . (10)Many applications naturally have a graph structure but one can also convert data to graph form and we refer to [286, Section2.2] where diﬀerent techniques are presented. Given a graph the weights are collected into a matrix 𝑊 ∈ ℝ 𝑛,𝑛 where thediagonal is set to zero. If two nodes are not connected in the graph the associated matrix entry is set to zero. A particularlyinteresting and challenging example is a fully connected graph where all data points are compared pairwise also resulting in adense matrix 𝑊 . As a second ingredient we compute the diagonal degree matrix 𝐷 where 𝑑 𝑖𝑖 = ∑ 𝑛𝑗 =1 𝑤 ( 𝑣 𝑖 , 𝑣 𝑗 ) . We then obtainthe graph Laplacian 𝐿 = 𝐷 − 𝑊 , which is often used in a normalized form either as the symmetric normalized Laplacian 𝐿 𝑠𝑦𝑚 = 𝐼 − 𝐷 −1∕2 𝑊 𝐷 −1∕2 or as the random walk Laplacian 𝐿 𝑟𝑤 = 𝐷 −1 𝐿 . The properties of the graph Laplacian are discussed in [286, 55]. In more detail, we can see that 𝐿 is symmetric and its positive semi-deﬁniteness follows from 𝑢 ⊤ 𝐿𝑢 = 12 𝑛 ∑ 𝑖,𝑗 =1 𝑤 𝑖𝑗 ( 𝑢 𝑖 − 𝑢 𝑗 ) . (11)This relation only changes slightly when 𝐿 𝑠𝑦𝑚 is used. In the context of data science many of the properties of the graph Laplaciancan be utilized for tasks such as clustering. In particular, the eigeninformation of 𝐿 and 𝐿 𝑠𝑦𝑚 are crucial. For example, the numberof zero eigenvalues gives information about the number of connected components of the underlying graph. The eigenvectorcorresponding to the ﬁrst non-zero eigenvalue is known as the Fiedler vector. As the eigenvector corresponding to the zeroeigenvalue is the constant vector 𝑐 [ , … , ] 𝑇 with 𝑐 ∈ ℝ and since the Fiedler vector is orthogonal to it, we must have signchanges in the Fiedler vector. This makes the Fiedler a ﬁrst candidate to perform clustering simply by the sign of its entries [279, 76, 27, 49].In fact, one of the classical tasks that is performed using the eigeninformation of the graph Laplacian is spectral clustering[286], where one computes the ﬁrst 𝑘 eigenvectors 𝜙 , … , 𝜙 𝑘 . As the graph Laplacian translates the original data 𝐴 into analternative space encoded into new matrices 𝑊 and 𝐷 , its eigeninformation will lead to a diﬀerent clustering behavior thantraditional methods such as k-means [260]. It can be seen from ﬁg. 2 that standard k-means clustering based on 𝐴 [134, 158]shows poorer performance when compared against spectral clustering [286]. In more detail, the most common spectral clusteringmethods proceed by using k-means on the rows of the eigenvector matrix Φ 𝑘 = [ 𝜙 , … , 𝜙 𝑘 ] . (12)In [286] the author illustrates that performing spectral clustering solves relaxed versions of known graph cut problems, whichaim at partitioning the vertices of a graph into disjoint subsets. If instead of the graph Laplacian the two normalized versions FIGURE 2

The typical two-moons data set with desired unsupervised classiﬁcation into two-classes. A standard scikit-learn 𝑘 -means clustering applied on the left vs. scikit-learn spectral clustering (right). The curvature of the data is only correctlyidentiﬁed with the spectral clustering approach.are used, one obtains diﬀerent results, for 𝐿 𝑟𝑤 see [251] and for 𝐿 𝑠𝑦𝑚 [216]. The hyperparameter 𝜎 in the denominator of theweight function can be replaced by a local scaling with an additional hyperparameter describing the locality [307]. Often theparameter is chosen in a heuristic way according to the performance of the algorithm using the graph Laplacian [216]. Wheninterpreted in terms of Gaussian processes the parameter 𝜎 is related to the characteristic length-scale of the process [231].As the foundation of the spectral clustering method is the computation of 𝑘 eigenvectors of the (normalized) graph Lapla-cian, it is important to be able to compute these eigenvectors eﬃciently. Be reminded that the matrices 𝐿, 𝐿 𝑟𝑤 , and 𝐿 𝑠𝑦𝑚 = In https://people.eecs.berkeley.edu/~demmel/cs267/lecture20/lecture20.html a connection to vibrating strings and standing waves is made that beautifully illustratesthis property. The eigenvalues of the Laplacian are given as 𝜆 ≤ 𝜆 ≤ ⋯ ≤ 𝜆 𝑛 . https://scikit-learn.org As hyperparameter we understand a parameter whose value is set before the learning process starts. 𝐼 − 𝐷 −1∕2 𝑊 𝐷 −1∕2 are singular and the multiplicity of the zero-eigenvalue corresponds to the number of connected componentsin the graph. We are interested in computing the smallest eigenvalues of 𝐿 . Iterative eigenvalue algorithms typically convergetowards the largest eigenvalues and as a result we would need to invert the matrix 𝐿 , which is not possible. Focusing on 𝐿 𝑠𝑦𝑚 it is obvious that the smallest eigenvalues of 𝐿 𝑠𝑦𝑚 can be computed from the largest eigenvalues of 𝐷 −1∕2 𝑊 𝐷 −1∕2 . Methodsfor eﬃciently computing the eigeninformation of such a matrix often rely on Krylov subspaces [111, 263, 177, 238], whereit is crucial to perform the matrix vector products with 𝐷 −1∕2 𝑊 𝐷 −1∕2 eﬃciently. If we assume that the graph consists of oneconnected component 𝐷 is diagonal and invertible. The main cost of multiplying with 𝐷 −1∕2 𝑊 𝐷 −1∕2 comes from computingmatrix vector products with 𝑊 . For sparse graphs the matrix 𝑊 will itself be sparse and matrix vector products will be inex-pensive. The main computational challenge is then encountered for non-sparse or fully connected graphs. For large data-setsmatrix vector products with 𝑊 are often infeasible and hence more sophisticated techniques are needed. Recently, methodsbased on the non-equispaced Fourier transform [5], the (improved) fast Gauss transform [299, 209], or algebraic fast multipolemethods [303, 195] have shown great potential resulting in a complexity of  ( 𝑛 log 𝑛 ) for the matrix vector products with a ﬁxednumber of columns 𝑚 . While these methods provide great speed-ups the dimensionality 𝑚 of the feature vectors still provides asigniﬁcant challenge in computations and we return to this point in the next section.The basis Φ given in (12) is not only important for spectral clustering but also for a group of methods recently introduced forsemi-supervised learning, i.e., methods that, given a small set of labeled data, classify the remaining unlabeled points simul-taneously. The methods are based on partial diﬀerential equation techniques from material science modeling [46, 7], namelydiﬀuse-interface methods, and have also been used in image inpainting [25]. The graph Laplacian then replaces the classicalLaplacian operator resulting in a diﬀerential equation based on the graph data via 𝑢 𝑡 = 𝜀𝐿 𝑠𝑦𝑚 𝑢 − 1 𝜀 𝜓 ′ ( 𝑢 ) + 𝜔 ( 𝑓 − 𝑢 ) (13)with 𝜓 ( 𝑢 ) being a potential deﬁned on the graph nodes enforcing two classes and 𝜔 incorporating penalization for deviationfrom the training data stored in 𝑓 . The variable 𝜀 is a hyperparameter related to the thickness of the interface region. Due to thelarge number of vertices the dimensionality of (13) is vast. A projection using Φ 𝑘 reduces the PDE to a 𝑘 -dimensional equation.Similar to model order reduction the nonlinearity still needs to be evaluated in the large-dimensional space [50].We brieﬂy want to comment on the use of the graph Laplacian in image processing, where the pixels are often representedas the nodes in a graph, which would then lead to a fully connected graph. In this ﬁeld the graph Laplacian is often used as aregularizer for denoising [159, 189, 232] or image restoration [160]. In [205] the construction of the Laplacian is performedpatch-wise in both a local and a non-local fashion. A more in-depth discussion for applications in image processing can also befound in [53].The graph Laplacian has also recently enjoyed wide applicability within deep learning, namely, as an essential ingredi-ent within so-called Graph Convolutional Networks [139, 42, 162] where the equation at layer 𝑙 for semi-supervised learningbecomes 𝑋 ( 𝑙 +1) = 𝜎 𝑙 ( 𝑁 𝐾 ∑ 𝑘 =1 𝐾 ( 𝑘 ) 𝑋 ( 𝑙 ) Θ ( 𝑙,𝑘 ) ) , with 𝜎 𝑙 an activation function and weights Θ that need to be learned. The crucial ﬁlter matrices 𝐾 ( 𝑘 ) are computed using theeigeninformation of the graph Laplacian associated with the input 𝑋 (0) . The matrices 𝐾 ( 𝑘 ) are composed from a 𝑘 -dimensionalﬁlter space span { 𝜙 , … , 𝜙 𝑘 } , typically of the form 𝐾 ( 𝑘 ) = 𝑈 𝜙 𝑘 (Λ) 𝑈 𝑇 , where 𝑈 and Λ are the eigenvector and eigenvaluematrix of the graph Laplacian or a slight modiﬁcation of it. The name convolutional network stems from the fact that thetransformation 𝑈 𝜙 𝑘 (Λ) 𝑈 𝑇 can be interpreted as graph Fourier transform [253, 70]. Similarly, classical convolutional networksapply a ﬁlter/convolution to the data to detect more structure within the data [174]. In more detail, for a signal 𝑥 ∈ ℝ 𝑛 , with 𝑛 the number of nodes in the graph, ̂𝑥 = 𝑈 𝑇 𝑥 is the graph Fourier transform and its inverse is given by 𝑥 = 𝑈 ̂𝑥 . It is clearthat polynomial ﬁlters 𝜙 𝑗 are easy to apply either directly, by multiplying with the matrices, or via an (approximate) eigende-composition of the Laplacian. Many ﬁlters and eﬃcient methods for their computations have been suggested and we refer to[162, 139, 6, 181, 254, 184, 313, 294] for some of them.There have also been generalizations of the graph Laplacian to other settings. We in particular want to mention the case of hypergraphs [312, 311], where an edge is now a collection of possibly many nodes. Again, one can obtain a normalized Laplacianoperator of the form 𝐿 = 𝐼 − Ξ and perform spectral clustering based on the eigeninformation of this matrix. Hypergraphs areencountered in many applications such as biological networks [166], image processing [305], social networks [310] or musicrecommendation [43]. Hypergraphs have also been used in the context of semi-supervised learning [178, 228, 34] and particularin convolutional neural networks based on hypergraphs [6, 298]. Graph Laplacians have also been used to analyze multilayer networks [165, 31] where a set of nodes can be connected invarious layers (cf. ﬁg. 3 for an illustration). The connections between the nodes and the various layers can be represented as atensor but also using a (supra)-Laplacian [257] 𝐿 = 𝐿 ( 𝐿 ) + 𝐿 ( 𝐼 ) with the intra-layer-supra Laplacian 𝐿 ( 𝐿 ) = blkdiag( 𝐿 , … , 𝐿 𝐾 ) and the interlayer-supra Laplacian 𝐿 ( 𝐼 ) = 𝐿 𝐼 ⊗ 𝐼 with 𝐿 𝐼 theinterlayer Laplacian. Again, the eigeninformation of the supra-Laplacian provides rich information about the network [257, 226,271]. One can also obtain networks where the nodes are not connected across layers but rather only have intra-layer connections(cf. [201]). We do not discuss the full details of the various diﬀerent network structures here but identify this as a very excitingarea of future research. L a y e r L a y e r a b c d a b c d FIGURE 3

A simple multilayer graph.In the context of analyzing social relationships we want to mention signed networks , which are graphs with positive andnegative edge weights. These networks are used to model friend and foe type relationships and we refer to [180, 248, 127,179, 268] and the references mentioned therein for an overview of some of the crucial applications. Again techniques such asspectral clustering [248, 200, 202], semi-supervised learning [198], convolutional networks [72] are available to extract furtherinformation from the data. The diﬃculty for signed networks is that the classical graph Laplacian is not feasible as for examplethe sum of the weights could be zero resulting in a non-invertible degree matrix. As a result several competing Laplacians arepossible (see [101, 268] for an overview).The graph Laplacian is also an essential tool analyzing complex networks via network motifs [20, 221], graph centralities[96, 98, 225], or community detection [214, 215]. It has also been suggested to replace the graph Laplacian by a deformedLaplacian or Bethe Laplacian [41, 239, 210] as 𝐻 ( 𝑠 ) = ( 𝑠 − 1) 𝐼 − 𝑠𝑊 + 𝐷 for a parameter 𝑠 ∈ ℝ and 𝑊 the adjacency matrix of the graph. It has been shown that 𝐻 ( 𝑠 ) corresponds to a non-backtrackingrandom walk, which is a simple random walk that is conditioned not to jump back along the edge it has just traversed [170] andshows better performance when community detection is desired in very sparse graphs generated by the stochastic block model.In the next section, we turn our attention to the case when the kernel, i.e. (10) is not only part of the graph Laplacian, but isviewed as the deﬁning element of embedding the data into a high-dimensional space. The transformation of the data via a so-called kernel function , like the Gaussian encountered for the graph Laplacian, is atechnique that has been successfully applied in many data science tasks [250, 247, 211, 148]. In ﬁg. 4 we illustrate a dataset thatis diﬃcult to linearly separate in two dimensions on the left. When the data is transformed via kernelization to three dimensional space it is easily separable. Kernel methods can be motivated from the evaluation of the function 𝑓 ( 𝑥 ) = 𝑤 𝑇 𝑥 for a new data FIGURE 4

A typical two-dimensional dataset (left) diﬃcult to separate linearly vs the embedding of the data into three-dimensional space (right) with the decision boundary shown in grey.point 𝑥 by using the computed weights 𝑤 as minimizers of a least squares problem (3). By employing Lagrangian duality [247]we can write the weights as 𝑤 = 𝜆 ∑ 𝑖 𝛼 𝑖 𝑎 𝑖 ∈ ℝ 𝑚 where 𝜆 is a regularization parameter, 𝑎 𝑖 ∈ ℝ 𝑚 are the vectors associatedwith the data and the 𝛼 𝑖 are the Lagrange multipliers. Inserting the new point 𝑥 gives 𝑓 ( 𝑥 ) = 𝜆 ∑ 𝑖 𝛼 𝑖 𝑎 𝑇𝑖 𝑥, which relies on theevaluation of inner products 𝑎 𝑇𝑖 𝑥 . It turns out that the evaluation of this and other inner products is replaced by a more general kernel function 𝑘 ( ⋅ , ⋅ ) . We then write the above as 𝑓 ( 𝑥 ) = 1 𝜆 ∑ 𝑖 𝛼 𝑖 𝑘 ( 𝑎 𝑖 , 𝑥 ) using the kernel instead of the inner product. The goal within kernel methods is now to ﬁnd a kernel that allows for betterseparability then the trivial kernel 𝑘 ( 𝑎 𝑖 , 𝑥 ) = 𝑎 𝑇𝑖 𝑥. To understand the role of the kernel function let us look at the kernel [247] 𝑘 ( 𝑣, 𝑥 ) = ( 𝑥 𝑇 𝑣 ) and consider the feature mapping 𝜙 that maps the two-dimensional data into three-dimensional space via 𝜙 ( 𝑥 ) = 𝜙 ( [ 𝑥 𝑥 ] ) = ⎡⎢⎢⎣ 𝑥 √ 𝑥 𝑥 𝑥 ⎤⎥⎥⎦ and we can see that 𝑘 ( 𝑥, 𝑣 ) = 𝜙 ( 𝑥 ) 𝑇 𝜙 ( 𝑣 ) . This even comes at a computational advantage as we do not need to compute the higher-dimensional 𝜙 vectors. This techniqueof avoiding the direct computation of the higher-dimensional nonlinear relation via the evaluation of a kernel function is knownas the kernel trick [243, 247, 244]. It is clear now that the adjacency matrix of the graph Laplacian is also a kernel matrix forthe particular choice of the Gaussian kernel. The assembly of the kernel for all data points 𝑎 𝑖 into a matrix leads to a positivesemi-deﬁnite Gram/kernel matrix 𝐾 . In fact, under the name of kernel PCA the leading (now largest) eigenvectors of 𝐾 arecomputed to obtain principal components/directions in the data (cf. [246, 245]). The use of the kernel trick in machine learning isomnipresent. One of the simplest but powerful methods is the so-called kernel ridge regression (KRR), where the core problemis to minimize the function ‖ 𝑏 − 𝐴𝑢 ‖ + 𝜆 ‖ 𝑢 ‖ with 𝑏 a vector encoding the training data and 𝑢 a vector of weights. This problem is then solved using a dual formulationresulting in a formulation based on the matrix 𝐴𝐴 𝑇 , which consists of inner products of the feature vectors. A kernelization of 𝐴𝐴 𝑇 then leads to the kernel matrix 𝐾 for which we need to solve the linear system [5] ( 𝐾 + 𝜆𝐼 ) 𝑤 = 𝑏. The system matrix is symmetric and positive deﬁnite for a positive regularization or ridge parameter 𝜆 ∈ ℝ . Such a system istypically solved numerically with the use of a preconditioned iterative solver such as the conjugate gradient method [111, 237,142]. The key ingredients are the matrix vector products with 𝐾 + 𝜆𝐼 and developing a preconditioner 𝑃 ≈ 𝐾 + 𝜆𝐼 . For the matrixvector product the non-equispaced fast Fourier transform (NFFT) can be used for a variety of diﬀerent kernels [5] and in the caseof Gaussian kernels many bespoke methods exist such as [299, 303, 195]. A method based on sketching and the random featuremethod [227] is introduced in [12], which can also be applied to various diﬀerent kernel functions. A ﬂexible preconditioner,which changes in every iteration, combined with a suitable Krylov solver was presented in [259] whereas the authors in [249]construct a low-rank approximation preconditioner based on an interpolative decomposition and fast matrix vector products. Amore diﬃcult scenario arises when 𝑚, the dimensionality of the feature vectors 𝑎 𝑖 ℝ 𝑚 gets larger. In this case, fast matrix vectormultiplication becomes more diﬃcult and suﬀers from the curse of dimensionality unless certain decay rates are imposed onthe matrix entries [209]. In [301] the authors consider an example with 𝑚 = 90 where they use a divide-and-conquer parallelalgorithm that also comes with communication avoidance. For more details we refer to [301] and also the mentioned literaturefor competing methods. In [234] the authors use a technique based on randomization (cf. Section 4) for both the matrix vectorproduct and the preconditioner in KRR while a similar problem is solved in [304] via high performance computing approaches.The power of the kernelization has been exploited as an essential ingredient of support vector machines (SVM) [247, 281, 63].In more detail, support vector machines are derived from maximizing the margin of the separating hyperplane. The supportvectors are the closest data points to this hyperplane. In order to be able to obtain a nonlinear hyperplane kernelization of theinner products is employed. The resulting kernel matrix is then at the heart of the quadratic program that needs to be solvedfor determining the support vectors [247] but the computation suﬀers from having to deal with dense kernel matrices. In [222]the sequential minimal optimization (SMO) method is introduced, which breaks the problem into smaller, analytically solvableproblems. Kernel matrices are also crucial when solving the closely related support vector regression (SVR) problem [256, 88].While both SVM and SVR remain popular methods, deep learning techniques have recently gained more popularity and havebeen combined with kernel techniques via graph convolution networks [139, 162] or as loss functions in convolutional networks[269, 192].Often the linear algebra tasks associated with large scale kernel methods will involve randomization, which we discuss next. In their seminal review paper [131] the authors discuss the importance of the decompositional approach to matrix computations[79] (cf. Section 2). Due to the computational complexity it is not always straightforward to eﬃciently compute matrix factor-izations, such as the SVD or QR, since in many data science applications the matrix 𝐴 is so large that computing approximatefactorizations is too costly. Additionally, the data might be corrupted or it might be desirable to avoid many passes over the dataso that classical methods need further thought.One of the key ingredients in allowing eﬃcient numerical linear algebra is the process of randomization. Randomizationhas proven to be a valuable tool in various matrix computation tasks such as matrix vector products [62, 82] or low-rankapproximations [29, 83]. These techniques typically follow the scheme of producing a skinny matrix 𝑄 ∈ ℝ 𝑛,𝑘 + 𝑝 , with 𝑘 thedesired approximation rank and 𝑝 an oversampling parameter that is needed for theoretical guarantees, such that we obtain theorthogonal-projection-approximation onto the subspace spanned by 𝑄 via 𝐴 ≈ 𝑄𝑄 𝑇 𝐴. From the information contained in 𝑄 one then computes standard factorizations such as QR or SVD decompositions. In moredetail, we form the matrix 𝐵 = 𝑄 𝑇 𝐴 ∈ ℝ 𝑘 + 𝑝,𝑚 , which then gives the low-rank approximation 𝐴 ≈ 𝑄𝐵.

Replacing 𝐵 by its SVDresults in an approximate SVD of 𝐴 via 𝐴 ≈ 𝑄𝐵 = 𝑄 ( 𝑈 Σ 𝑉 𝑇 ) = ( 𝑄𝑈 )Σ 𝑉 𝑇 . The ﬁrst stage, i.e, the computation of the matrix 𝑄 , follows from the prototype algorithm illustrated in Algorithm 1. We aretypically only interested in a decomposition of order 𝑘 when proceeding to the second stage of the method but that 𝐺 is of Typically, this is drawn as a Gaussian random matrix. Algorithm 1

Prototype algorithm from [131].Draw a random 𝑛 × ( 𝑘 + 𝑝 ) test matrix 𝐺 .Compute 𝑌 = 𝐴𝐺 .Construct orthogonal basis for range( 𝑌 ) , e.g. [ 𝑄, 𝑅 ] = 𝑞𝑟 ( 𝑌 ) .dimension 𝑘 + 𝑝 where 𝑝 is the oversampling parameter. The dimensionality of 𝐺 is crucial as our aim is to reduce the complexityas much as possible. Theoretical bounds for the approximation quality are given in [131] and depend on the parameters 𝑘 and 𝑝 as well as the matrix size and the ( 𝑘 + 1) -st singular value of 𝐴 . Such a randomized SVD has become the an essential tool inmany scientiﬁc computing and data science applications in areas such as uncertainty quantiﬁcation [182], optimal experimentaldesign [4], or computer vision [93].One of the key applications of randomized methods within data science is the use for approximation of kernel matrices[196, 85, 308, 183, 149, 26] where the desire is to avoid the computation of the full kernel matrix since the storage demand canbe too large. In particular, the kernel matrix is approximated via 𝐴 ≈ ( 𝐴𝑄 )( 𝑄 𝑇 𝐴𝑄 ) −1 𝑄 𝑇 𝐴 𝑇 where in the traditional setup the matrix 𝑄 contains columns of the identity matrix drawing columns from the matrix 𝐴 and weobtain the so-called Nyström scheme [ 𝐴 𝐴 ] 𝐴 −111 [ 𝐴 𝐴 ] = [ 𝐼𝐴 𝐴 −111 ] [ 𝐴 𝐴 ] = [ 𝐴 𝐴 𝐴 𝐴 𝐴 −111 𝐴 ] ≈ 𝐴. This means in order to approximate a kernel matrix 𝐴 one only needs to compute a small number of columns and rows, whichhas been shown to be very successful (cf. [224, 85]). The authors in [85] give an approximation bound using the diagonal entriesof the kernel matrix and a probabilistic interpretation whereas the authors in [220] provide a bound for the Chebyshev-norm ofthe pseudo-skeleton approximation and the 𝑘 + 1 -st singular value of the kernel matrix. In [196] the author suggests to use a 𝑄 obtained via Algorithm 1 and thus obtain a variant of the traditional Nyström method.We already pointed out in Section 2.2 that interpretable decompositions can be obtained via QR or SVD approximations to 𝐴 .As the author in [196] points out computing a matrix that holds a basis for the column space of 𝐴 is crucial. For this the author in[196] computes 𝑌 = 𝐴𝐺 and via a subspace iteration proceeds to iteratively reﬁne the matrix 𝑌 using the update 𝑌 ← 𝐴𝐴 𝑇 𝑌 .The resulting 𝑌 is then used to identify the relevant row-indices for the ID. In the same way one can obtain the two-sided IDand also a randomized CUR decomposition, where the details are given in [196]. Note that algorithms for directly computingthe CUR decomposition compute some sort of statistical score for the importance of the columns/rows via the truncated SVD[193, 289, 37]. When obtaining the CUR from the two-sided ID we get the index sets from the ID matrices. The usefulness ofthe importance sample or sampling statistics is illustrated in [39] where the author shows that operations from numerical linearalgebra such as inner products or matrix vector products can be performed using randomized numerical linear algebra. The keyingredients when multiplying 𝐴𝐵 or 𝑎 𝑇 𝑏 is the sampling strategy for producing the approximations. The strategy is often basedon an importance sampling strategy [84, 86] to draw elements or rows/columns of a matrix for further processing. For example,the authors in [94] use the vector 𝑞 𝑞 𝑗 = ‖ 𝐴 (∶ , 𝑗 ) ‖ ‖ 𝐵 ( 𝑗, ∶) ‖ ∑ 𝑛𝑖 =1 ‖ 𝐴 (∶ , 𝑖 ) ‖ ‖ 𝐵 ( 𝑖, ∶) ‖ , 𝑗 = 1 , … , 𝑛 of probabilities to produce an approximation 𝐶 ≈ 𝐴𝐵 based on rows of 𝐴 and columns of 𝐵 . For the solution of a least squaresproblem such as (3) [87, 39] the (random) selection process is decoupled from from a deterministic computation via a traditionalnumerical linear algebra method. In particular, the ﬁrst stage computes a score based on e.g. Euclidean norms, a truncated SVDmatrix, or columns of a truncated Hadamard matrix. This means that the original problem is separated into a random samplingprocedure leading to a reduced formulation that is then solved via a deterministic NLA approach. Due to the success in datascience applications randomized methods have also penetrated classical problems in scientiﬁc computing such as solving linearsystems of equations [115, 276, 213], eigenvalue problems [242, 118] or inverse problems [296, 295, 282]. A recent survey canbe found in [197]. This is obtained from the minimization of the expected value of the variance of the inner product/matrix vector product. One area where randomization has helped greatly with the evaluation of complex mathematical expressions is the computa-tions of functions of matrices, which we want to describe now.

The concept of evaluating a function of a matrix 𝑓 ∶ 𝑅 𝑛,𝑛 → 𝑅 𝑛,𝑛 , where 𝑓 is not meant to be evaluated element-wise, is an old but still very relevant one and we refer to [144] for an excellentintroduction to the topic. Matrix functions appear in a variety of applications with a particularly important example given byGaussian processes [236]; an ubiquitous tool in machine learning and statistics or in the analysis of complex networks [97, 22].One can deﬁne the matrix function via the Cauchy integral theorem as 𝑓 ( 𝐴 ) = 12 𝜋𝑖 ∫ Γ 𝑓 ( 𝑧 )( 𝑧𝐼 − 𝐴 ) −1 𝑑𝑧 (14)provided 𝑓 is analytic on and inside a closed contour Γ ⊂ ℂ that encloses the spectrum of 𝐴 . There are other deﬁnitions thatare equivalent for analytic functions (cf. [144]). Matrix functions provide a large number of challenges for numerical methodsand an early discussion with 19 dubious methods for the matrix exponential can be found in [207] with an update providedin [208] adding Krylov subspace methods as the twentieth method. One typically considers two diﬀerent setups, the ﬁrst isthe computation or approximation of the matrix 𝑓 ( 𝐴 ) explicitly and the other is the evaluation of 𝑓 ( 𝐴 ) 𝑏 where the explicitcomputation of 𝑓 ( 𝐴 ) and subsequent application to 𝑏 is avoided. For a recent survey on evaluating 𝑓 ( 𝐴 ) 𝑏 we refer to [125]. Inthis case the computation of 𝑓 ( 𝐴 ) and then applying it to 𝑏 is obviously avoided. Eﬃcient techniques typically depend on thesize of the data matrix 𝐴 . The approximation of 𝑓 ( 𝐴 ) 𝑏 employing the Cauchy integral formula (14) via contour integration wasgiven in [130]. Often the main work goes into solving linear systems with ( 𝑧𝐼 − 𝐴 ) and once direct solvers become infeasibleapproaches based on Krylov subspace methods have shown very good performance for approximating the evaluation of 𝑓 ( 𝐴 ) 𝑏 [146, 167, 89, 124, 167]. For the more complex case of computing 𝑓 ( 𝐴 ) we again refer to [144] for an overview of suitablemethods. We here want to mention certain scenarios in data science applications that lead us to such matrix functions.Throughout scientiﬁc computing and engineering fractional Laplacians, where instead of second order derivatives one con-siders derivatives of arbitrary orders, have gained much popularity in recent years [223] and one way to evaluate this eﬃcientlyis using matrix functions, e.g. [45]. Fractional powers of the Laplacian have recently become more popular for semi-supervisedlearning [17, 69], where to the best of our knowledge techniques based on the full eigendecomposition have been employed,which for large graphs quickly becomes infeasible. In [199] a multilayer graph is considered where arbitrary matrix powers of thelayer Laplacians are evaluated using the polynomial Krylov subspace method (PKSM) [144] and using contour integrals [201].We return again to the study of complex networks and are now interested in the network centrality [97] as this helps inidentifying the most important nodes, such as the most inﬂuential people in a social network. We assume that the network isunderstood as an undirected graph and describe some of the most important centrality measures. One measure of centrality isthe degree matrix 𝐷 of the graph Laplacian assigning the degree 𝑑 𝑖𝑖 to every node. Other centrality measures are of great usefor understanding the underlying data. For example, the eigenvector centrality [32] is deﬁned as 𝑏 𝑖 = 1 𝜆 𝑊 𝜙 with 𝑊 the adjacency matrix of the network and ( 𝜆 , 𝜙 ) the Perron-Frobenius eigenvalue and eigenvector, respectively. Theauthors in [97] illustrate that powers of the adjacency matrix 𝑊 provide essential information about the graph. In particular,studying the ( 𝑖, 𝑗 ) -th entry of 𝑊 𝑛 allows to count the number of diﬀerent walks of length 𝑛 getting from node 𝑖 to node 𝑗 . Forexample the diagonal entries of 𝑊 give the degrees of the nodes in the network. Note that powers of the graph Laplacian areconsidered in [1] for the purpose of spectral clustering and within graph convolutional networks for semi-supervised learningin [188]. The use of 𝑊 suggests to consider higher powers of the adjacency matrix for longer walks and one obtains ( 𝐼 + 𝑊

2! + 𝑊

3! + 𝑊

4! + …) 𝑖𝑖 = exp( 𝑊 ) 𝑖𝑖 = 𝑒 𝑇𝑖 exp( 𝑊 ) 𝑒 𝑖 as the so-called subgraph centrality for node 𝑖 , (cf. [95, 98]) with 𝑒 𝑖 the corresponding unit vector. The quantity trace(exp( 𝑊 )) is deﬁned as the Estrada index of a network [105, 71, 66]. Due to the high complexity the computation of the matrix function exp( 𝑊 ) should be avoided, especially if complex networks are considered. It is well known that expressions of the form 𝑢 𝑇 𝑓 ( 𝐴 ) 𝑢 can be well approximated by 𝑒 𝑇 𝑓 ( 𝑇 𝑘 ) 𝑒 , where 𝑇 𝑘 is the tridiagonal matrix coming from the Lanczos process applied to 𝐴 and 𝑢 is assumed to have norm one. In particular, for our example we have 𝑢 = 𝑒 𝑖 , 𝑓 ( ⋅ ) = exp( ⋅ ) , and 𝐴 = 𝑊 . As the authors in[108, 107] show there is a beautiful relation between 𝑢 𝑇 𝑓 ( 𝐴 ) 𝑢 and Gauss–quadrature with ﬁrst results going back to [112]. Inparticular, the authors in [21] use the error estimates that stem from studying the relation between the Lanczos process and Gaussquadrature. In [33] the authors also rely on the power of Gauss quadrature to compute Katz scores and commute times betweennodes in a network. Using the Lanczos process or other Krylov subspace methods [108, 109, 265] to approximate quantities ofthe form 𝑢 𝑇 𝑓 ( 𝐴 ) 𝑢 also proves essential for many applications (cf. [11, 151, 10]) when the trace of a matrix (function) has to beestimated via trace( 𝑓 ( 𝐴 )) ≈ 1 𝑠 𝑠 ∑ 𝑖 =1 𝑣 𝑇𝑖 𝑓 ( 𝐴 ) 𝑣 𝑖 where we rely on 𝑠 carefully chosen random vectors 𝑣 𝑖 . One such estimator is the so-called Hutchinson estimator [13] where 𝑣 𝑖 = ±1 and the numerical computation again relies on the Lanczos process and its eﬃcient approximation of the spectrum of 𝐴 [278, 204, 18]. Gaussian processes [231, 191, 236] are another essential tool within statistics and data science. Their evaluationposes a signiﬁcant numerical challenge as it relies on evaluating matrix functions as we will illustrate now. Following [78], aGaussian process is a collection of random variables with a joint probability distribution. Let us consider 𝑋 = { 𝑥 , … , 𝑥 𝑛 } with all the 𝑥 𝑖 ∈ ℝ 𝑑 . The Gaussian process can then deﬁne a distribution over functions 𝑓 ( 𝑥 ) ∼  ( 𝜇 ( 𝑥 ) , 𝑘 ( 𝑥, 𝑥 ′ )) with mean 𝜇 ( 𝑥 ) ∶ ℝ 𝑑 → ℝ and covariance function 𝑘 ( 𝑥, 𝑥 ′ ) ∶ ℝ 𝑑 × ℝ 𝑑 → ℝ . Now 𝑓 𝑋 ∈ ℝ 𝑛 represents the vector of the function valuesfor 𝑓 ( 𝑥 𝑖 ) , 𝜇 𝑋 ∈ ℝ 𝑛 the evaluation of 𝜇 ( 𝑥 𝑖 ) , and the matrix 𝐾 𝑋𝑋 ∈ ℝ 𝑛,𝑛 represents the evaluation of 𝑘 ( 𝑥 𝑖 , 𝑥 𝑗 ) ∀ 𝑖, 𝑗 . Then itholds that 𝑓 𝑋 ∼  ( 𝜇 𝑋 , 𝐾 𝑋𝑋 ) . The chosen covariance kernel depends on hyperparameters 𝜃 and we assume that we are givena set of noisy function values 𝑦 ∈ ℝ 𝑛 with variance 𝜎 . Assuming a Gaussian prior distribution depending on 𝜃 one obtains thelog marginal likelihood as  ( 𝜃 | 𝑦 ) = − 12 [( 𝑦 − 𝜇 𝑋 ) 𝑇 ( 𝐾 𝑋𝑋 + 𝜎 𝐼 ) −1 ( 𝑦 − 𝜇 𝑋 ) + log ( det ( 𝐾 𝑋𝑋 + 𝜎 𝐼 )) + 𝑛 log(2 𝜋 ) ] . It is obvious that for the numerical solution of the minimization of the log marginal likelihood the evaluation of the log-determinant is crucial. If the matrix 𝐾 𝑋𝑋 is small then we can aﬀord the computation via the Cholesky decomposition, whichis an  ( 𝑛 ) computation. An eﬃcient computation is then given via log (det( 𝑊 )) = 2 𝑛 ∑ 𝑖 =1 𝐿 𝑖𝑖 where 𝑊 = 𝐿𝐿 𝑇 is the Cholesky decomposition of the symmetric positive deﬁnite matrix 𝑊 . In [235, 236] the authors showthat the methods for approximating the log-determinant become more eﬃcient if one works with Markovian ﬁelds (cf. [255]for approaches beyond Cholesky for Markovian ﬁelds). In general the Cholesky decomposition will be too expensive and againKrylov subspace methods such as the Lanczos process and its connection to Gauss–quadrature are exploited. From the relation exp(trace(log( 𝑊 ))) = exp(trace( 𝑋 log(Λ) 𝑋 −1 ) = exp (∑ 𝑖 log( 𝜆 𝑖 ) ) = ∏ 𝑖 𝜆 𝑖 = det( 𝑊 ) we obtain the following beautiful relationship log(det( 𝑊 )) = trace(log( 𝑊 )) . This shows that in order to approximate the log–determinant, it is again crucial to eﬃciently evaluate a matrix function, namely trace(log( 𝐴 )) ≈ 1 𝑠 𝑠 ∑ 𝑖 =1 𝑣 𝑇𝑖 log( 𝐴 ) 𝑣 𝑖 . The Lanczos-based approximation has been used in [186, 15, 278] and the authors in [78] combine these ideas with fast matrixvector products by modiﬁed structured kernel interpolation [292] allowing a good approximation of the log–determinant andalso its derivatives.We refer to [145] for an overview of software for the computation of matrix functions in various programming languages. All of the above problems are tailored to when the data is given as a matrix 𝐴 ∈ ℝ 𝑛,𝑚 but often it is also natural for the data tobe given as a tensor 𝐴 ∈ ℝ 𝑛 × 𝑛 ×…× 𝑛 𝑑 (15)with the dimensions 𝑛 𝑖 ∈ ℕ for all 𝑑 modes. It is obvious that the storage requirements for a full tensor 𝐴 quickly surpass theavailable resources in many applications. This exponential increase in relation to the parameter 𝑑 is often referred to as the curse of dimensionality. Tensors and their approximations are closely related to the representation of a nonlinear function in ahigh–dimensional space [74] 𝑓 ( 𝑥 , … , 𝑥 𝑑 ) ≈ 𝑙 ∑ 𝑗 =1 𝛼 𝑗 𝜙 ( 𝑥 , … , 𝑥 𝑑 ) (16)where the choice of functions 𝜙 is crucial and these could be a basis, a frame, or a dictionary of functions [272, 44, 61, 90, 157].In this case it is desirable to ﬁnd an approximation to 𝑓 ( 𝑥 , … , 𝑥 𝑑 ) with some form of sparsity, i.e., as small as possible numberof terms 𝑙 . Alternatively, one can interpret the function as being deﬁned on a tensor product space and aiming at designing alow-rank approximation to 𝑓 (cf. [217] for example). The approximation of 𝑓 in (16) is then performed by relying on similarformats and tool as the ones that are used for the approximation of the given tensor 𝐴 in (15), which we brieﬂy describe now.For this reason approximating the tensor has a long tradition in computational mathematics, physics, chemistry, and otherdisciplines. While being already 10 years old we would like to point to [168] for a seminal overview of numerical tensor methodsdue to the introductory character and the broad overview. There are several low-rank approximations to the full tensor 𝐴 thatcorrespond to diﬀerent tensor formats such as the classical CANDECOMP/PARAFAC(CP) formulation [161] where 𝐴 = 𝑅 ∑ 𝑗 =1 𝜆 𝑗 𝑢 (1) 𝑗 ◦ ⋯ ◦ 𝑢 ( 𝑑 ) 𝑗 , (17)with ◦ being the dyadic product. While this format is the most natural, it has properties that make it not ideal for (many)applications, e.g. the set of rank 𝑅 tensors is not closed, the computation ill-posed, etc. (cf. [116] for more details). Alternatively,one can consider the Tucker format/HOSVD [67, 68], which somewhat generalizes the SVD to the multidimensional case but stillsuﬀers from the curse of dimensionality due to the 𝑑 -dimensional core tensor needed. This is overcome by the tensor-train format[219], which consists of 𝑑 core tensors of maximal dimension . In the context of data science tensors have long been an essentialtool and to list all applications and techniques goes beyond the scope of this paper. We here only list a few results that addresssimilar questions to the above mentioned matrix cases: non-negative factorizations [59, 60, 54], interpolative decomposition[241], CUR [64, 194], kernel methods [270, 138, 137], semi-supervised learning [120], and randomization [16, 290].This list is far from complete and for more detailed reviews on tensor methods for data science we refer to the recent surveys[57, 58, 56] and the references given therein. For a very broad overview of recent techniques and literature for numerical tensormethods beyond data science we refer to [116].One of the key areas in which tensor prove very powerful tools is the ﬁeld of deep learning, which we now brieﬂy review.

The study of computational and mathematical aspects of deep learning is glowing white hot with activity and we only wantto point to number of places where linear algebra is used for better understanding or improving performance of deep learningmethods.In the most generic setup, the task in an artiﬁcial neural network is to determine the weight matrices 𝑊 ( 𝑗 ) and bias vectors 𝑏 ( 𝑗 ) from minimizing a loss function to obtain the function 𝐹 ( 𝑥 ) ∶= 𝑊 ( 𝐿 ) ( … 𝜎 ( 𝑊 (2) ( 𝜎 ( 𝑊 (1) 𝑥 + 𝑏 (1) )) + 𝑏 (2) )) … + 𝑏 ( 𝐿 ) (18)see [143, 113, 175] for more information and further references. The loss function typically consists of a sum of many termsdue to the large number of training data and as a result the optimization is based on gradient descent schemes [36]. For classicaloptimization problems it is often desirable to use second order methods due their superior convergence properties. These typi-cally require the use of direct or iterative solvers to compute the step towards optimality, e.g. solving with the Hessian matrix in a Newton method. As a result the applicability of Newton-type schemes for the computation of 𝑊 ( 𝑗 ) and 𝑏 ( 𝑗 ) has received moreattention with a strong focus on exploiting the structure of the Hessian matrix [35, 65, 288, 51, 240, 284].The computational complexity of neural networks is challenging on many levels. In order to reduce the numerical eﬀortthe authors in [273, 297] use the SVD to analyze the weight matrices and then restructure the neural network accordingly,while the authors in [302] use the condition numbers of the weight matrices for such a restructuring. The weight matrices havereceived further attention as in [218] the authors compress the weight matrices of fully connected layers using a low-rank tensorapproximation, namely, the tensor train format [219]. Additionally, low-rank approximations are also useful for speeding up theevaluation of convolutional neural networks [152] by using a low–rank representation of the ﬁlters, which are used for detectingimage features. For a very similar task the authors in [173, 122, 123] rely on optimized tensor decompositions. Note that recentlythese techniques have also been applied to adversarial networks [48].The connection of neural network architectures to optimal control problems as introduced in [126, 121] makes heavy use ofPDE-constrained optimization techniques including eﬃcient matrix vector products. This topic has also recently received moreattention from within the machine learning community [52] and promises to be a very interesting ﬁeld for combining traditionalmethods from numerical analysis with deep learning.As already pointed out in Section 3.1 graph convolutional networks (GCNs) have shown great potential and received muchpopularity recently as tools for semi-supervised learning [162]. The expressive power of representing data as graphs that is thebasis of GCNs has led to using their methodology for adversarial networks [100], matrix completion [23], or within graph autoencoders [163].This is only a brief glimpse where numerical linear algebra techniques have enhanced deep learning methodology. We believethat this will be an area of intense research in the future. We have illustrated with this literature review that numerical linear algebra is alive and kicking. In a sense the rise of machinelearning, big data, and data science shows that old methods are still very much in fashion but also that new techniques arerequired to either meet the demands of practitioners or account for the complexity and size of the data.

ACKNOWLEDGMENTS

The author would like to thank Dominik Alfke, Peter Benner, Pedro Mercado, and Daniel Potts for helpful comments on anearlier version of this manuscript. He is also greatly indebted to the two anonymous referees who greatly helped to improve thepresentation.

References [1] E. A

BBE , E. B

OIX , P. R

ALLI , AND

C. S

ANDON , Graph powering and spectral robustness, arXiv preprintarXiv:1809.04818, (2018).[2] S. A

GHABOZORGI , A. S

EYED S HIRKHORSHIDI , AND

T. Y

ING W AH , Time-series clustering – a decade review, Inf. Syst.,53 (2015), pp. 16–38.[3] R. A LBRIGHT , Taming text with the SVD, SAS Institute Inc, (2004).[4] A. A

LEXANDERIAN , N. P

ETRA , G. S

TADLER , AND

O. G

HATTAS , A-optimal design of experiments forinﬁnite-dimensional Bayesian linear inverse problems with regularized 𝓁 -sparsiﬁcation, SIAM J. Sci. Comput., 36(2014), pp. A2122–A2148.[5] D. A LFKE , D. P

OTTS , M. S

TOLL , AND

T. V

OLKMER , NFFT meets Krylov methods: Fast matrix-vector products for thegraph Laplacian of fully connected networks, Front. Appl. Math. Stat., 4 (2018), p. 61. [6] D. A LFKE AND

M. S

TOLL , Semi-supervised classiﬁcation on non-sparse graphs using low-rank graph convolutionalnetworks, arXiv preprint arXiv:1905.10224, (2019).[7] S. M. A

LLEN AND

J. W. C

AHN , A microscopic theory for antiphase boundary motion and its application to antiphasedomain coarsening, Acta Metall., 27 (1979), pp. 1085–1095.[8] E. A

NGERSON , Z. B AI , J. D ONGARRA , A. G

REENBAUM , A. M C K ENNEY , J. D U C ROZ , S. H

AMMARLING , J. D EM - MEL , C. B

ISCHOF , AND

D. S

ORENSEN , LAPACK: A portable linear algebra library for high-performance computers, inProceedings SUPERCOMPUTING ’90, IEEE Computer Society Press, IEEE, 1990, pp. 2–11.[9] W. E. A

RNOLDI , The principle of minimized iterations in the solution of the matrix eigenvalue problem, Quart. Appl.Math., 9 (1951), pp. 17–29.[10] M. J. A

TALLAH , F. C

HYZAK , AND

P. D

UMAS , A randomized algorithm for approximate string matching, Algorithmica,29 (2001), pp. 468–486.[11] H. A

VRON , Counting triangles in large graphs using randomized matrix trace estimation, in Workshop on Large-scaleData Mining: Theory and Applications, vol. 10, 2010, pp. 10–9.[12] H. A

VRON , K. L. C

LARKSON , AND

D. P. W

OODRUFF , Faster kernel ridge regression using sketching and preconditioning,SIAM J. Matrix Anal. & Appl., 38 (2017), pp. 1116–1138.[13] H. A

VRON AND

S. T

OLEDO , Randomized algorithms for estimating the trace of an implicit symmetric positivesemi-deﬁnite matrix, J ACM, 58 (2011), pp. 1–34.[14] J. B

AGLAMA AND

L. R

EICHEL , Augmented implicitly restarted Lanczos bidiagonalization methods, SIAM J. Sci.Comput., 27 (2005), pp. 19–42.[15] Z. B AI , M. F AHEY , G. H. G

OLUB , M. M

ENON , AND

E. R

ICHTER , Computing partial eigenvalue sum in electronicstructure calculations, tech. rep., Tech. Report SCCM-98-03, Stanford University, 1998.[16] C. B

ATTAGLINO , G. B

ALLARD , AND

T. G. K

OLDA , A practical randomized CP tensor decomposition, SIAM J. MatrixAnal. & Appl., 39 (2018), pp. 876–901.[17] E. B

AUTISTA , P. A

BRY , AND

P. G

ONÇALVES , 𝐿 𝛾 -PageRank for Semi-Supervised Learning, arXiv preprintarXiv:1903.06007, (2019).[18] M. B ELLALIJ , L. R

EICHEL , G. R

ODRIGUEZ , AND

H. S

ADOK , Bounding matrix functionals via partial global blockLanczos decomposition, Appl Numer Math, 94 (2015), pp. 127–139.[19] P. B

ENNER , S. G

UGERCIN , AND

K. W

ILLCOX , A survey of projection-based model reduction methods for parametricdynamical systems, SIAM Rev., 57 (2015), pp. 483–531.[20] A. R. B

ENSON , D. F. G

LEICH , AND

J. L

ESKOVEC , Higher-order organization of complex networks, Science, 353 (2016),pp. 163–166.[21] M. B

ENZI AND

P. B

OITO , Quadrature rule-based bounds for functions of adjacency matrices, Linear Algebra Appl., 433(2010), pp. 637–652.[22] , Matrix functions in network analysis, GAMM Mitteilungen, (2020).[23] R. V . D . B ERG , T. N. K

IPF , AND

M. W

ELLING , Graph convolutional matrix completion, arXiv preprint arXiv:1706.02263,(2017).[24] M. W. B

ERRY , M. B

ROWNE , A. N. L

ANGVILLE , V. P. P

AUCA , AND

R. J. P

LEMMONS , Algorithms and applications forapproximate nonnegative matrix factorization, Comput Stat Data An, 52 (2007), pp. 155–173.[25] A. L. B

ERTOZZI , S. E

SEDOGLU , AND

A. G

ILLETTE , Inpainting of Binary Images Using the Cahn–Hilliard Equation,IEEE Trans. on Image Process., 16 (2007), pp. 285–291. [26] A. L. B ERTOZZI AND

A. F

LENNER , Diﬀuse interface models on graphs for classiﬁcation of high dimensional data,Multiscale Model. Simul., 10 (2012), pp. 1090–1118.[27] A. B

ERTRAND AND

M. M

OONEN , Seeing the bigger picture: How nodes can learn their place within a complex ad hocnetwork topology, IEEE Signal Process. Mag., 30 (2013), pp. 71–82.[28] F. B

ERZAL AND

N. M

ATÍN , Data mining, SIGMOD Rec., 31 (2002), p. 66.[29] E. B

INGHAM AND

H. M

ANNILA , Random projection in dimensionality reduction, in Proceedings of the seventh ACMSIGKDD international conference on Knowledge discovery and data mining - KDD ’01, ACM, ACM Press, 2001,pp. 245–250.[30] A. B

JÖRCK , Stability of two direct methods for bidiagonalization and partial least squares, SIAM J. Matrix Anal. & Appl.,35 (2014), pp. 279–291.[31] S. B

OCCALETTI , G. B

IANCONI , R. C

RIADO , C. I. D EL G ENIO , J. G

ÓMEZ -G ARDENES , M. R

OMANCE , I. S

ENDINA -N ADAL , Z. W

ANG , AND

M. Z

ANIN , The structure and dynamics of multilayer networks, Phys. Rep., 544 (2014), pp. 1–122.[32] P. B

ONACICH , Power and centrality: A family of measures, Am J Sociol, 92 (1987), pp. 1170–1182.[33] F. B

ONCHI , P. E

SFANDIAR , D. F. G

LEICH , C. G

REIF , AND

L. V. L

AKSHMANAN , Fast matrix computations for pairwiseand columnwise commute times and Katz scores, Internet Math., 8 (2012), pp. 73–112.[34] J. B

OSCH , S. K

LAMT , AND

M. S

TOLL , Generalizing diﬀuse interface methods on graphs: Nonsmooth potentials andhypergraphs, SIAM J. Appl. Math., 78 (2018), pp. 1350–1377.[35] A. B

OTEV , H. R

ITTER , AND

D. B

ARBER , Practical Gauss–Newton optimisation for deep learning, in Proceedings of the34th International Conference on Machine Learning-Volume 70, JMLR. org, 2017, pp. 557–565.[36] L. B

OTTOU , Large-scale machine learning with stochastic gradient descent, in Proceedings of COMPSTAT’2010,Physica-Verlag HD, 2010, pp. 177–186.[37] C. B

OUTSIDIS , M. W. M

AHONEY , AND

P. D

RINEAS , An improved approximation algorithm for the column subsetselection problem, in Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, SIAM,Society for Industrial and Applied Mathematics, Jan. 2009, pp. 968–977.[38] C. B

OUTSIDIS AND

D. P. W

OODRUFF , Optimal CUR matrix decompositions, SIAM J. Comput., 46 (2017), pp. 543–589.[39] M. W. M. B

OYD , Randomized algorithms for matrices and data, Found. Trends Mach. Learn., 3 (2010), pp. 123–224.[40] R. G. B

ROWN , Smoothing, forecasting and prediction of discrete time series, Courier Corporation, 2004.[41] J. B

RUNA AND

X. L I , Community detection with graph neural networks, Stat, 1050 (2017), p. 27.[42] J. B RUNA , W. Z

AREMBA , A. S

ZLAM , AND

Y. L E C UN , Spectral networks and locally connected networks on graphs,arXiv preprint arXiv:1312.6203, (2013).[43] J. B U , S. T AN , C. C HEN , C. W

ANG , H. W U , L. Z HANG , AND

X. H E , Music recommendation by uniﬁed hypergraph:combining social media information and music content, in Proceedings of the 18th ACM international conference onMultimedia, ACM, 2010, pp. 391–400.[44] H.-J. B UNGARTZ AND

M. G

RIEBEL , Sparse grids, Acta Numer., 13 (2004), pp. 147–269.[45] K. B

URRAGE , N. H

ALE , AND

D. K AY , An eﬃcient implicit FEM scheme for fractional-in-space reaction-diﬀusionequations, SIAM J. Sci. Comput., 34 (2012), pp. A2145–A2172.[46] J. W. C AHN AND

J. E. H

ILLIARD , Free Energy of a Nonuniform System. I. Interfacial Free Energy, J Chem Phys, 28(1958), pp. 258–267. [47] D. C AI , X. H E , J. H AN , AND

T. S. H

UANG , Graph regularized nonnegative matrix factorization for data representation,IEEE Trans. Pattern Anal. Mach. Intell., 33 (2011), pp. 1548–1560.[48] X. C AO , X. Z HAO , AND

Q. Z

HAO , Tensorizing generative adversarial nets, in 2018 IEEE International Conference onConsumer Electronics - Asia (ICCE-Asia), IEEE, IEEE, June 2018, pp. 206–212.[49] T. F. C

HAN , P. C

IARLET , AND

W. K. S

ZETO , On the optimality of the median cut spectral bisection graph partitioningmethod, SIAM J. Sci. Comput., 18 (1997), pp. 943–948.[50] S. C

HATURANTABUT AND

D. C. S

ORENSEN , Nonlinear model reduction via discrete empirical interpolation, SIAM J.Sci. Comput., 32 (2010), pp. 2737–2764.[51] C. C

HEN , S. R

EIZ , C. Y U , H.-J. B UNGARTZ , AND

G. B

IROS , Fast evaluation and approximation of the Gauss-NewtonHessian matrix for the multilayer perceptron, arXiv preprint arXiv:1910.12184, (2019).[52] T. Q. C

HEN , Y. R

UBANOVA , J. B

ETTENCOURT , AND

D. K. D

UVENAUD , Neural ordinary diﬀerential equations, in AdvNeural Inf Process Syst, 2018, pp. 6571–6583.[53] G. C

HEUNG , E. M

AGLI , Y. T

ANAKA , AND

M. K. N G , Graph spectral image processing, Proc. IEEE, 106 (2018),pp. 907–930.[54] E. C. C HI AND

T. G. K

OLDA , On tensors, sparsity, and nonnegative factorizations, SIAM J. Matrix Anal. & Appl., 33(2012), pp. 1272–1299.[55] F. C

HUNG , Spectral Graph Theory, no. 92, American Mathematical Society, Dec. 1996.[56] A. C

ICHOCKI , Tensor networks for big data analytics and large-scale optimization problems, arXiv preprintarXiv:1407.3124, (2014).[57] A. C

ICHOCKI , N. L EE , I. O SELEDETS , A.-H. P

HAN , Q. Z

HAO , AND

D. P. M

ANDIC ,Tensor Networks for Dimensionality Reduction and Large-scale Optimization: Part 1 Low-Rank Tensor Decompositions,Found. Trends Mach. Learn., 9 (2016), pp. 249–429.[58] A. C

ICHOCKI , N. L EE , I. O SELEDETS , A.-H. P

HAN , Q. Z

HAO , M. S

UGIYAMA , AND

D. P. M

ANDIC ,Tensor Networks for Dimensionality Reduction and Large-scale Optimization: Part 2 Applications and Future Perspectives,Found. Trends Mach. Learn., 9 (2017), pp. 249–429.[59] A. C

ICHOCKI , R. Z

DUNEK , AND

S.- I . A MARI , Nonnegative matrix and tensor factorization, IEEE Signal Process. Mag.,25 (2008), pp. 142–145.[60] A. C

ICHOCKI , R. Z

DUNEK , A. H. P

HAN , AND

S.-I. A

MARI , Nonnegative Matrix and Tensor Factorizations, John Wiley& Sons, Ltd, Sept. 2009.[61] A. C

OHEN AND

R. D E V ORE , Approximation of high-dimensional parametric PDEs, Acta Numer., 24 (2015), pp. 1–159.[62] E. C

OHEN AND

D. D. L

EWIS , Approximating matrix multiplication for pattern recognition tasks, Journal of Algorithms,30 (1999), pp. 211–252.[63] C. C

ORTES AND

V. V

APNIK , Support-vector networks, Mach Learn., 20 (1995), pp. 273–297.[64] A. C

ORTINOVIS AND

D. K

RESSNER , Low-rank approximation in the Frobenius norm by column and row subset selection,arXiv preprint arXiv:1908.06059, (2019).[65] F. D

ANGEL AND

P. H

ENNIG , A modular approach to block–diagonal Hessian approximations for second-orderoptimization methods, arXiv preprint arXiv:1902.01813, (2019).[66] J. A.

DE LA P EÑA , I. G

UTMAN , AND

J. R

ADA , Estimating the Estrada index, Linear Algebra Appl., 427 (2007), pp. 70–76.[67] L. D E L ATHAUWER , Signal processing based on multilinear algebra, Katholieke Universiteit Leuven Leuven, 1997. [68] L. D E L ATHAUWER , B. D E M OOR , AND

J. V

ANDEWALLE , A multilinear singular value decomposition, SIAM J. MatrixAnal. & Appl., 21 (2000), pp. 1253–1278.[69] S. D E N IGRIS , E. B

AUTISTA , P. A

BRY , K. A

VRACHENKOV , AND

P. G

ONÇALVES , Fractional graph-based semi-supervisedlearning, in 2017 25th European Signal Processing Conference (EUSIPCO), IEEE, 2017, pp. 356–360.[70] M. D

EFFERRARD , X. B

RESSON , AND

P. V

ANDERGHEYNST , Convolutional neural networks on graphs with fast localizedspectral ﬁltering, in Adv Neural Inf Process Syst, 2016, pp. 3844–3852.[71] H. D

ENG , S. R

ADENKOVIC , AND

I. G

UTMAN , The Estrada index, Applications of Graph Spectra, Math. Inst., Belgrade,(2009), pp. 123–140.[72] T. D

ERR , Y. M A , AND

J. T

ANG , Signed graph convolutional networks, in 2018 IEEE International Conference on DataMining (ICDM), IEEE, IEEE, Nov. 2018, pp. 929–934.[73] A. D

ESHPANDE AND

L. R

ADEMACHER , Eﬃcient Volume Sampling for Row/Column Subset Selection, in 2010 IEEE51st Annual Symposium on Foundations of Computer Science, IEEE, IEEE, Oct. 2010, pp. 329–338.[74] R. A. D E V ORE , Nonlinear approximation, Acta Numer., 7 (1998), pp. 51–150.[75] C. D

ING , X. H E , AND

H. D. S

IMON , On the equivalence of nonnegative matrix factorization and spectral clustering,in Proceedings of the 2005 SIAM International Conference on Data Mining, SIAM, Society for Industrial and AppliedMathematics, Apr. 2005, pp. 606–610.[76] C. H. D

ING , X. H E , H. Z HA , M. G U , AND

H. D. S

IMON , A min-max cut algorithm for graph partitioning and dataclustering, in Proceedings 2001 IEEE International Conference on Data Mining, IEEE, IEEE Comput. Soc, 2001, pp. 107–114.[77] C. H. D

ING , T. L I , AND

M. I. J

ORDAN , Convex and semi-nonnegative matrix factorizations, IEEE Trans. Pattern Anal.Machine Intell., 32 (2008), pp. 45–55.[78] K. D

ONG , D. E

RIKSSON , H. N

ICKISCH , D. B

INDEL , AND

A. G. W

ILSON , Scalable log determinants for Gaussian processkernel learning, in Advances in Neural Information Processing Systems, 2017, pp. 6327–6337.[79] J. D

ONGARRA AND

F. S

ULLIVAN , Guest editors introduction to the top 10 algorithms, Comput. Sci. Eng., 2 (2000),pp. 22–23.[80] D. D

ONOHO , 50 years of data science, J Comput Graph Stat, 26 (2017), pp. 745–766.[81] D. D

ONOHO AND

V. S

TODDEN , When does non-negative matrix factorization give a correct decomposition into parts?,in Adv Neural Inf Process Syst, 2004, pp. 1141–1148.[82] P. D

RINEAS , R. K

ANNAN , AND

M. W. M

AHONEY , Fast Monte Carlo algorithms for matrices i: Approximating matrixmultiplication, SIAM J. Comput., 36 (2006), pp. 132–157.[83] , Fast Monte Carlo algorithms for matrices II: Computing a low-rank approximation to a matrix, SIAM J. Comput.,36 (2006), pp. 158–183.[84] P. D

RINEAS , M. M

AGDON -I SMAIL , M. W. M

AHONEY , AND

D. P. W

OODRUFF , Fast approximation of matrix coherenceand statistical leverage, J Mach Learn Res., 13 (2012), pp. 3475–3506.[85] P. D

RINEAS AND

M. W. M

AHONEY , On the Nyström method for approximating a gram matrix for improved kernel-basedlearning, J Mach Learn Res., 6 (2005), pp. 2153–2175.[86] P. D

RINEAS AND

M. W. M

AHONEY , RandNLA, Commun. ACM, 59 (2016), pp. 80–90.[87] P. D

RINEAS , M. W. M

AHONEY , S. M

UTHUKRISHNAN , AND

T. S

ARLÓS , Faster least squares approximation, Numer.Math., 117 (2010), pp. 219–249. [88] H. D RUCKER , C. J. B

URGES , L. K

AUFMAN , A. J. S

MOLA , AND

V. V

APNIK , Support vector regression machines, in AdvNeural Inf Process Syst, 1997, pp. 155–161.[89] V. D

RUSKIN AND

L. K

NIZHNERMAN , Extended Krylov subspaces: Approximation of the matrix square root and relatedfunctions, SIAM J. Matrix Anal. & Appl., 19 (1998), pp. 755–771.[90] D. D ˜

UNG , V. T

EMLYAKOV , AND

T. U

LLRICH , Hyperbolic Cross Approximation, Springer International Publishing, 2018.[91] L. E

LDÉN , Partial least-squares vs. Lanczos bidiagonalization—I: Analysis of a projection method for multiple regression,Comput Stat Data An, 46 (2004), pp. 11–31.[92] , Matrix methods in data mining and pattern recognition, vol. 15, SIAM, 2019.[93] N. B. E

RICHSON AND

C. D

ONOVAN , Randomized low-rank dynamic mode decomposition for motion detection, ComputVis Image Underst, 146 (2016), pp. 40–50.[94] S. E

RIKSSON -B IQUE , M. S

OLBRIG , M. S

TEFANELLI , S. W

ARKENTIN , R. A

BBEY , AND

I. C. F. I

PSEN , Importancesampling for a Monte Carlo matrix multiplication algorithm, with application to information retrieval, SIAM J. Sci.Comput., 33 (2011), pp. 1689–1706.[95] E. E

STRADA , Characterization of 3D molecular structure, Chem. Phys. Lett., 319 (2000), pp. 713–718.[96] E. E

STRADA , N. H

ATANO , AND

M. B

ENZI , The physics of communicability in complex networks, Phys. Rep., 514 (2012),pp. 89–119.[97] E. E

STRADA AND

D. J. H

IGHAM , Network properties revealed through matrix functions, SIAM Rev., 52 (2010), pp. 696–714.[98] E. E

STRADA AND

J. A. R

ODRÍGUEZ -V ELÁZQUEZ , Subgraph centrality in complex networks, Phys. Rev. E, 71 (2005),p. 056103.[99] J. F. H

AIR J R , M. S ARSTEDT , L. H

OPKINS , AND

V. G. K

UPPELWIESER , Partial least squares structural equation modeling(PLS-SEM), European Business Review, 26 (2014), pp. 106–121.[100] S. F

AN AND

B. H

UANG , Labeled graph generative adversarial networks, CoRR, abs/1906.03220 (2019).[101] J. G

ALLIER , Spectral theory of unsigned and signed graphs. applications to graph clustering: A survey, arXiv preprintarXiv:1601.04692, (2016).[102] F. G

ANTMACHER , The theory of matrices, Volume I, (1964), pp. 95–103.[103] N. G

ILLIS , The why and how of nonnegative matrix factorization, Regularization, Optimization, Kernels, and SupportVector Machines, 12 (2014).[104] , The why and how of nonnegative matrix factorization, Regularization, Optimization, Kernels, and Support VectorMachines, 12 (2014).[105] Y. G

INOSAR , I. G

UTMAN , T. M

ANSOUR , AND

M. S

CHORK , Estrada index and Chebyshev polynomials, Chem. Phys.Lett., 454 (2008), pp. 145–147.[106] G. G

OLUB AND

W. K

AHAN , Calculating the singular values and pseudo-inverse of a matrix, Journal of the Society forIndustrial and Applied Mathematics Series B Numerical Analysis, 2 (1965), pp. 205–224.[107] G. H. G

OLUB AND

G. M

EURANT , Matrices, moments and quadrature, Pitman Research Notes in Mathematics Series,(1994), pp. 105–105.[108] G. H. G

OLUB AND

G. M

EURANT , Matrices, Moments and Quadrature with Applications, vol. 30, Princeton UniversityPress, Dec. 2009. [109] G. H. G OLUB , M. S

TOLL , AND

A. W

ATHEN , Approximation of the scattering amplitude and linear systems, Electron.Trans. Numer. Anal, 31 (2008), pp. 178–203.[110] G. H. G

OLUB AND

C. F. V AN L OAN , Matrix Computations, The Johns Hopkins University Press, third ed., 1996.[111] , Matrix computations, Johns Hopkins Studies in the Mathematical Sciences, Johns Hopkins University Press,Baltimore, MD, fourth ed., 2013.[112] G. H. G

OLUB AND

J. H. W

ELSCH , Calculation of Gauss quadrature rules, Math. Comput., 23 (1969), p. 221.[113] I. G

OODFELLOW , Y. B

ENGIO , AND

A. C

OURVILLE , Deep learning, MIT press, 2016.[114] S. A. G

OREINOV , N. L. Z

AMARASHKIN , AND

E. E. T

YRTYSHNIKOV , Pseudo-skeleton approximations by matrices ofmaximal volume, Math Notes, 62 (1997), pp. 515–519.[115] R. M. G

OWER AND

P. R

ICHTÁRIK , Randomized iterative methods for linear systems, SIAM J. Matrix Anal. & Appl., 36(2015), pp. 1660–1690.[116] L. G

RASEDYCK , D. K

RESSNER , AND

C. T

OBLER , A literature survey of low-rank tensor approximation techniques,GAMM-Mitteilungen, 36 (2013), pp. 53–78.[117] E. G

RILLI , F. M

ENNA , AND

F. R

EMONDINO , A review of point clouds segmentation and classiﬁcation algorithms, IntArch Photogramm Remote Sens Spat Inf Sci, 42 (2017), p. 339.[118] M. G U , Subspace iteration randomization and singular value problems, SIAM J. Sci. Comput., 37 (2015), pp. A1139–A1173.[119] M. G U AND

S. C. E

ISENSTAT , Eﬃcient algorithms for computing a strong rank-revealing QR factorization, SIAM J. Sci.Comput., 17 (1996), pp. 848–869.[120] E. G

UJRAL AND

E. E. P

APALEXAKIS , SMACD: Semi-supervised Multi-Aspect Community Detection, in Proceedingsof the 2018 SIAM International Conference on Data Mining, SIAM, 2018, pp. 702–710.[121] S. G

ÜNTHER , L. R

UTHOTTO , J. B. S

CHRODER , E. C YR , AND

N. R. G

AUGER , Layer-parallel training of deep residualneural networks, arXiv preprint arXiv:1812.04352, (2018).[122] J. G

USAK , M. K

HOLIAVCHENKO , E. P

ONOMAREV , L. M

ARKEEVA , P. B

LAGOVESCHENSKY , A. C

ICHOCKI , AND

I. O

SELEDETS , Automated multi-stage compression of neural networks, in Proceedings of the IEEE InternationalConference on Computer Vision Workshops, 2019, pp. 0–0.[123] J. G

USAK , M. K

HOLYAVCHENKO , E. P

ONOMAREV , L. M

ARKEEVA , I. O

SELEDETS , AND

A. C

ICHOCKI , MUSCO:Multi-stage compression of neural networks, arXiv preprint arXiv:1903.09973, (2019).[124] S. G

ÜTTEL , Rational Krylov approximation of matrix functions: Numerical methods and optimal pole selection, GAMM-Mitteilungen, 36 (2013), pp. 8–31.[125] S. G

ÜTTEL , D. K

RESSNER , AND

K. L

UND , Limited-memory polynomial methods for large-scale matrix functions, (2020).[126] E. H

ABER AND

L. R

UTHOTTO , Stable architectures for deep neural networks, Inverse Prob., 34 (2017), p. 014004.[127] P. H

AGE , A graph theoretic approach to the analysis of alliance structure and local grouping in highland New Guinea, inAnthropological Forum, vol. 3, Taylor & Francis, 1973, pp. 280–294.[128] J. F. H

AIR J R , G. T. M. H ULT , C. R

INGLE , AND

M. S

ARSTEDT , A primer on partial least squares structural equationmodeling (PLS-SEM), Sage Publications, 2016.[129] D. H

AJINEZHAD , T.-H. C

HANG , X. W

ANG , Q. S HI , AND

M. H

ONG , Nonnegative matrix factorization using ADMM:Algorithm and convergence analysis, in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), IEEE, IEEE, Mar. 2016, pp. 4742–4746. [130] N. H ALE , N. J. H

IGHAM , AND

L. N. T

REFETHEN , Computing 𝑎 𝛼 , log( 𝑎 ) , and related matrix functions by contour integrals,SIAM J. Numer. Anal., 46 (2008), pp. 2505–2523.[131] N. H ALKO , P. G. M

ARTINSSON , AND

J. A. T

ROPP , Finding structure with randomness: Probabilistic algorithms forconstructing approximate matrix decompositions, SIAM Rev., 53 (2011), pp. 217–288.[132] K. H

AMM AND

L. H

UANG , Perturbations of CUR decompositions, arXiv preprint arXiv:1908.08101, (2019).[133] P. C. H

ANSEN , The truncated SVD as a method for regularization, BIT, 27 (1987), pp. 534–553.[134] J. A. H

ARTIGAN AND

M. A. W

ONG , Algorithm as 136: A k-means clustering algorithm, J R Stat Soc C-Appl, 28 (1979),pp. 100–108.[135] T. H

ASTIE , R. T

IBSHIRANI , AND

J. F

RIEDMAN , The Elements of Statistical Learning, Springer New York, 2009.[136] C. H

AYASHI , What is data science ? fundamental concepts and a heuristic example, in Studies in Classiﬁcation, DataAnalysis, and Knowledge Organization, Springer Japan, 1998, pp. 40–51.[137] L. H E , X. K ONG , P. S. Y U , X. Y ANG , A. B. R

AGIN , AND

Z. H AO , DuSK: A dual structure-preserving kernel forsupervised tensor learning with applications to neuroimages, in Proceedings of the 2014 SIAM International Conferenceon Data Mining, SIAM, Society for Industrial and Applied Mathematics, Apr. 2014, pp. 127–135.[138] L. H E , C.-T. L U , G. M A , S. W ANG , L. S

HEN , P. S. Y U , AND

A. B. R

AGIN , Kernelized support tensor machines, inProceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org, 2017, pp. 1442–1451.[139] M. H

ENAFF , J. B

RUNA , AND

Y. L E C UN , Deep convolutional networks on graph-structured data, arXiv preprintarXiv:1506.05163, (2015).[140] V. H ERNÁNDEZ , J. E. R

OMÁN , AND

A. T

OMÁS , A robust and eﬃcient parallel SVD solver based on restarted Lanczosbidiagonalization, Electron. Trans. Numer. Anal., 31 (2008), pp. 68–85.[141] V. H

ERNÁNDEZ , J. E. R

OMÁN , A. T

OMÁS , AND

V. V

IDAL , Restarted Lanczos bidiagonalization for the SVD in SLEPc,STR-8, Tech. Rep., (2007).[142] M. R. H

ESTENES , Conjugate Direction Methods in Optimization, vol. 49, Springer New York, 1980.[143] C. F. H

IGHAM AND

D. J. H

IGHAM , Deep learning: An introduction for applied mathematicians, SIAM Review, 61(2019), pp. 860–891.[144] N. J. H

IGHAM , Functions of Matrices, vol. 104, Society for Industrial and Applied Mathematics, Jan. 2008.[145] N. J. H

IGHAM AND

E. D

EADMAN , A catalogue of software for matrix functions. version 2.0, (2016).[146] M. H

OCHBRUCK AND

C. L

UBICH , On Krylov subspace approximations to the matrix exponential operator, SIAM J.Numer. Anal., 34 (1997), pp. 1911–1925.[147] M. E. H

OCHSTENBACH , Harmonic and reﬁned extraction methods for the singular value problem, with applications inleast squares problems, BIT, 44 (2004), pp. 721–754.[148] T. H

OFMANN , B. S

CHÖLKOPF , AND

A. J. S

MOLA , Kernel methods in machine learning, Ann. Statist., 36 (2008), pp. 1171–1220.[149] C.-J. H

SIEH , S. S I , AND

I. S. D

HILLON , Fast prediction for large-scale kernel machines, in Advances in Neural InformationProcessing Systems, 2014, pp. 3689–3697.[150] J. H

ULLAND , Use of partial least squares (PLS) in strategic management research: A review of four recent studies, Strat.Mgmt. J., 20 (1999), pp. 195–204.[151] M. H

UTCHINSON , A stochastic estimator of the trace of the inﬂuence matrix for laplacian smoothing splines, Communi-cations in Statistics - Simulation and Computation, 19 (1990), pp. 433–450. [152] M. J ADERBERG , A. V

EDALDI , AND

A. Z

ISSERMAN , Speeding up convolutional neural networks with low rank expansions,arXiv preprint arXiv:1405.3866, (2014).[153] G. J

AMES , D. W

ITTEN , T. H

ASTIE , AND

R. T

IBSHIRANI , An Introduction to Statistical Learning, Springer New York,2013.[154] Y.-S. J

EONG , M. K. J

EONG , AND

O. A. O

MITAOMU , Weighted dynamic time warping for time series classiﬁcation,Pattern Recognit., 44 (2011), pp. 2231–2240.[155] I. T. J

OLLIFFE , Principal Component Analysis, Springer New York, 1986.[156] I. T. J

OLLIFFE AND

J. C

ADIMA , Principal component analysis: A review and recent developments, Phil. Trans. R. Soc.A, 374 (2016), p. 20150202.[157] L. K

ÄMMERER , D. P

OTTS , AND

T. V

OLKMER , Approximation of multivariate periodic functions by trigonometricpolynomials based on rank-1 lattice sampling, J Complex., 31 (2015), pp. 543–576.[158] T. K

ANUNGO , D. M

OUNT , N. N

ETANYAHU , C. P

IATKO , R. S

ILVERMAN , AND

A. W U , An eﬃcient k-means clusteringalgorithm: Analysis and implementation, IEEE Trans. Pattern Anal. Machine Intell., 24 (2002), pp. 881–892.[159] A. K HERADMAND AND

P. M

ILANFAR , A general framework for kernel similarity-based image denoising, in 2013 IEEEGlobal Conference on Signal and Information Processing, IEEE, IEEE, Dec. 2013, pp. 415–418.[160] , A general framework for regularized, similarity-based image restoration, IEEE Trans. on Image Process., 23(2014), pp. 5136–5151.[161] H. A. L. K

IERS , Towards a standardized notation and terminology in multiway analysis, J. Chemometrics, 14 (2000),pp. 105–122.[162] T. N. K

IPF AND

M. W

ELLING , Semi-supervised classiﬁcation with graph convolutional networks, arXiv preprintarXiv:1609.02907, (2016).[163] , Variational graph auto-encoders, arXiv preprint arXiv:1611.07308, (2016).[164] N. K

ISHORE K UMAR AND

J. S

CHNEIDER , Literature survey on low rank approximation of matrices, Linear MultilinearA, 65 (2016), pp. 2212–2244.[165] M. K

IVELA , A. A

RENAS , M. B

ARTHELEMY , J. P. G

LEESON , Y. M

ORENO , AND

M. A. P

ORTER , Multilayer networks,SSRN Journal, 2 (2013), pp. 203–271.[166] S. K

LAMT , U.-U. H

AUS , AND

F. T

HEIS , Hypergraphs and cellular networks, PLoS Comput Biol, 5 (2009), p. e1000385.[167] L. K

NIZHNERMAN AND

V. S

IMONCINI , A new investigation of the extended Krylov subspace method for matrix functionevaluations, Numer. Linear Algebra Appl., 17 (2009), pp. n/a–n/a.[168] T. G. K

OLDA AND

B. W. B

ADER , Tensor decompositions and applications, SIAM Rev., 51 (2009), pp. 455–500.[169] Y. K

OREN , R. B

ELL , AND

C. V

OLINSKY , Matrix factorization techniques for recommender systems, Computer, (2009),pp. 30–37.[170] F. K

RZAKALA , C. M

OORE , E. M

OSSEL , J. N

EEMAN , A. S LY , L. Z DEBOROVA , AND

P. Z

HANG , Spectral redemption inclustering sparse networks, Proceedings of the National Academy of Sciences, 110 (2013), pp. 20935–20940.[171] J. N. K

UTZ , Deep learning in ﬂuid dynamics, J. Fluid Mech., 814 (2017), pp. 1–4.[172] C. L

ANCZOS , An iteration method for the solution of the eigenvalue problem of linear diﬀerential and integral operators,United States Governm. Press Oﬃce Los Angeles, CA, 1950.[173] V. L

EBEDEV , Y. G

ANIN , M. R

AKHUBA , I. O

SELEDETS , AND

V. L

EMPITSKY , Speeding-up convolutional neural networksusing ﬁne-tuned CP-decomposition, arXiv preprint arXiv:1412.6553, (2014). [174] Y. L E C UN , Y. B ENGIO , ET AL ., Convolutional networks for images, speech, and time series, The handbook of braintheory and neural networks, 3361 (1995), p. 1995.[175] Y. L E C UN , Y. B ENGIO , AND

G. H

INTON , Deep learning, Nature, 521 (2015), pp. 436–444.[176] D. D. L

EE AND

H. S. S

EUNG , Algorithms for non-negative matrix factorization, in Adv Neural Inf Process Syst, 2001,pp. 556–562.[177] R. B. L

EHOUCQ , D. C. S

ORENSEN , AND

C. Y

ANG , ARPACK Users’ Guide, vol. 6, Society for Industrial and AppliedMathematics, Jan. 1998.[178] M. L

EORDEANU , A. Z

ANFIR , AND

C. S

MINCHISESCU , Semi-supervised learning and optimization for hypergraphmatching, in 2011 International Conference on Computer Vision, IEEE, IEEE, Nov. 2011, pp. 2274–2281.[179] J. L

ESKOVEC , D. H

UTTENLOCHER , AND

J. K

LEINBERG , Predicting positive and negative links in online social networks,in Proceedings of the 19th international conference on World wide web - WWW ’10, ACM, ACM Press, 2010, pp. 641–650.[180] , Signed networks in social media, in Proceedings of the 28th international conference on Human factors incomputing systems - CHI ’10, ACM, ACM Press, 2010, pp. 1361–1370.[181] R. L

EVIE , F. M

ONTI , X. B

RESSON , AND

M. M. B

RONSTEIN , CayleyNets: Graph convolutional neural networks withcomplex rational spectral ﬁlters, IEEE Trans. Signal Process., 67 (2019), pp. 97–109.[182] C. L

I AND

G. S

TADLER , Sparse solutions in optimal control of PDEs with uncertain parameters: The linear case, SIAMJ. Control Optim., 57 (2019), pp. 633–658.[183] M. L I , W. B I , J. T. K WOK , AND

B.-L. L U , Large-Scale Nyström Kernel Matrix Approximation Using Randomized SVD,IEEE Trans. Neural Netw. Learning Syst., 26 (2015), pp. 152–164.[184] S. L I , Y. J IN , AND

D. I. S

HUMAN , Scalable 𝑀 -channel critically sampled ﬁlter banks for graph signals, IEEE Trans.Signal Process., 67 (2019), pp. 3954–3969.[185] X. L I , G. C UI , AND

Y. D

ONG , Graph regularized non-negative low-rank matrix factorization for image clustering, IEEETrans. Cybern., 47 (2017), pp. 3840–3853.[186] L. L IN , Y. S AAD , AND

C. Y

ANG , Approximating spectral densities of large matrices, SIAM Rev., 58 (2016), pp. 34–65.[187] H. L IU , Z. W U , D. C AI , AND

T. S. H

UANG , Constrained nonnegative matrix factorization for image representation, IEEETrans. Pattern Anal. Mach. Intell., 34 (2012), pp. 1299–1311.[188] S. L IU , L. C HEN , H. D

ONG , Z. W

ANG , D. W U , AND

Z. H

UANG , Higher-order weighted graph convolutional networks,arXiv preprint arXiv:1911.04129, (2019).[189] X. L IU , D. Z HAI , D. Z

HAO , G. Z

HAI , AND

W. G AO , Progressive image denoising through hybrid graph Laplacianregularization: A uniﬁed framework, IEEE Trans. on Image Process., 23 (2014), pp. 1491–1503.[190] X. L UO , M. Z HOU , S. L I , Z. Y OU , Y. X IA , AND

Q. Z HU , A nonnegative latent factor model for large-scale sparsematrices in recommender systems via alternating direction method, IEEE Trans. Neural Netw. Learning Syst., 27 (2016),pp. 579–592.[191] D. J. M AC K AY , Introduction to Gaussian processes, NATO ASI Series F Computer and Systems Sciences, 168 (1998),pp. 133–166.[192] G. M ADJAROV , D. K

OCEV , D. G

JORGJEVIKJ , AND

S. D

ŽEROSKI , An extensive experimental comparison of methods formulti-label learning, Pattern Recognit., 45 (2012), pp. 3084–3104.[193] M. W. M

AHONEY AND

P. D

RINEAS , CUR matrix decompositions for improved data analysis, PNAS, 106 (2009), pp. 697–702. [194] M. W. M AHONEY , M. M

AGGIONI , AND

P. D

RINEAS , Tensor-CUR decompositions for tensor-based data, SIAM J. MatrixAnal. & Appl., 30 (2008), pp. 957–987.[195] W. B. M

ARCH , B. X

IAO , S. T

HARAKAN , C. D. Y U , AND

G. B

IROS , A kernel-independent FMM in general dimensions,in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on- SC ’15, IEEE, ACM Press, 2015, pp. 1–12.[196] P.-G. M

ARTINSSON , Randomized methods for matrix computations, The Mathematics of Data, 25 (2018), pp. 187–231.[197] P.-G. M

ARTINSSON AND

J. T

ROPP , Randomized numerical linear algebra: Foundations & algorithms, arXiv preprintarXiv:2002.01387, (2020).[198] P. M

ERCADO , J. B

OSCH , AND

M. S

TOLL , Node classiﬁcation for signed social networks using diﬀuse interface methods,in ECMLPKDD, Sept. 2019.[199] P. M

ERCADO , A. G

AUTIER , F. T

UDISCO , AND

M. H

EIN , The power mean Laplacian for multilayer graph clustering,arXiv preprint arXiv:1803.00491, (2018).[200] P. M

ERCADO , F. T

UDISCO , AND

M. H

EIN , Clustering signed networks with the geometric mean of Laplacians, inAdvances in Neural Information Processing Systems, 2016, pp. 4421–4429.[201] , Generalized Matrix Means for Semi-Supervised Learning with Multilayer Graphs, (2019).[202] , Spectral clustering of signed graphs via matrix power means, in Proceedings of the 36th International Conferenceon Machine Learning, K. Chaudhuri and R. Salakhutdinov, eds., vol. 97 of Proceedings of Machine Learning Research,Long Beach, California, USA, June 2019, PMLR, pp. 4526–4536.[203] T. M

ETSALU AND

J. V

ILO , ClustVis: A web tool for visualizing clustering of multivariate data using principal componentanalysis and heatmap, Nucleic Acids Res, 43 (2015), pp. W566–W570.[204] G. M

EURANT , Estimates of the trace of the inverse of a symmetric matrix using the modiﬁed Chebyshev algorithm,Numer Algor, 51 (2008), pp. 309–318.[205] P. M

ILANFAR , A tour of modern image ﬁltering: New insights and methods, both practical and theoretical, IEEE SignalProcess. Mag., 30 (2013), pp. 106–128.[206] H. J. M

ILLER AND

J. H AN , Geographic Data Mining and Knowledge Discovery, CRC Press, May 2009.[207] C. M OLER AND

C. V AN L OAN , Nineteen dubious ways to compute the exponential of a matrix, SIAM Rev., 20 (1978),pp. 801–836.[208] , Nineteen dubious ways to compute the exponential of a matrix, twenty-ﬁve years later, SIAM Rev., 45 (2003),pp. 3–49.[209] V. I. M

ORARIU , B. V. S

RINIVASAN , V. C. R

AYKAR , R. D

URAISWAMI , AND

L. S. D

AVIS , Automatic online tuning forfast Gaussian summation, in Adv Neural Inf Process Syst, 2009, pp. 1113–1120.[210] F. M

ORBIDI , The deformed consensus protocol, Automatica, 49 (2013), pp. 3049–3055.[211] K.-R. M

ULLER , S. M

IKA , G. R

ATSCH , K. T

SUDA , AND

B. S

CHÖLKOPF , An introduction to kernel-based learningalgorithms, IEEE Trans. Neural Netw., 12 (2001), pp. 181–201.[212] Y. N

AKATSUKASA AND

N. J. H

IGHAM , Stable and eﬃcient spectral divide and conquer algorithms for the symmetriceigenvalue decomposition and the SVD, SIAM J. Sci. Comput., 35 (2013), pp. A1325–A1349.[213] D. N

EEDELL , Randomized Kaczmarz solver for noisy linear systems, BIT, 50 (2010), pp. 395–403.[214] M. E. J. N

EWMAN , Detecting community structure in networks, The European Physical Journal B - Condensed Matter,38 (2004), pp. 321–330. [215] , Modularity and community structure in networks, Proceedings of the National Academy of Sciences, 103 (2006),pp. 8577–8582.[216] A. Y. N G , M. I. J ORDAN , AND

Y. W

EISS , On spectral clustering: Analysis and an algorithm, in Adv Neural Inf ProcessSyst, 2002, pp. 849–856.[217] A. N

OUY , Higher-order principal component analysis for the approximation of tensors in tree-based low-rank formats,Numer. Math., 141 (2019), pp. 743–789.[218] A. N

OVIKOV , D. P

ODOPRIKHIN , A. O

SOKIN , AND

D. P. V

ETROV , Tensorizing neural networks, in Adv Neural InfProcess Syst, 2015, pp. 442–450.[219] I. V. O

SELEDETS , Tensor-train decomposition, SIAM J. Sci. Comput., 33 (2011), pp. 2295–2317.[220] A. O

SINSKY AND

N. Z

AMARASHKIN , Pseudo-skeleton approximations with better accuracy estimates, Linear AlgebraAppl., 537 (2018), pp. 221–249.[221] A. P

ARANJAPE , A. R. B

ENSON , AND

J. L

ESKOVEC , Motifs in temporal networks, in Proceedings of the Tenth ACMInternational Conference on Web Search and Data Mining - WSDM ’17, ACM, ACM Press, 2017, pp. 601–610.[222] J. P

LATT , Sequential minimal optimization: A fast algorithm for training support vector machines, (1998).[223] I. P

ODLUBNY , Fractional diﬀerential equations: an introduction to fractional derivatives, fractional diﬀerential equations,to methods of their solution and some of their applications, vol. 198, Elsevier, 1998.[224] F. P

OURKAMALI -A NARAKI , S. B

ECKER , AND

M. B. W

AKIN , Randomized clustered Nyström for large-scale kernelmachines, in Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018.[225] X. Q I , E. F ULLER , Q. W U , Y. W U , AND

C.-Q. Z

HANG , Laplacian centrality: A new centrality measure for weightednetworks, Information Sciences, 194 (2012), pp. 240–253.[226] F. R

ADICCHI , Driving interconnected networks to supercriticality, Phys. Rev. X, 4 (2014), p. 021014.[227] A. R

AHIMI AND

B. R

ECHT , Random features for large-scale kernel machines, in Adv Neural Inf Process Syst, 2008,pp. 1177–1184.[228] S. S. R

ANGAPURAM , T. B

ÜHLER , AND

M. H

EIN , Towards realistic team formation in social networks based on densestsubgraphs, in Proceedings of the 22nd international conference on World Wide Web - WWW ’13, ACM Press, 2013,pp. 2427–2435.[229] J. R

APIN , J. B

OBIN , A. L

ARUE , AND

J.-L. S

TARCK , Sparse and non-negative BSS for noisy data, IEEE Trans. SignalProcess., 61 (2013), pp. 5620–5632.[230] J. R

APIN , J. B

OBIN , A. L

ARUE , AND

J.-L. S

TARCK , NMF with sparse regularizations in transformed domains, SIAM J.Imaging Sci., 7 (2014), pp. 2020–2047.[231] C. E. R

ASMUSSEN , Gaussian processes in machine learning, in Summer School on Machine Learning, Springer, 2003,pp. 63–71.[232] Y. R

OMANO , M. E

LAD , AND

P. M

ILANFAR , The little engine that could: Regularization by denoising (RED), SIAM J.Imaging Sci., 10 (2017), pp. 1804–1844.[233] P. J. R

OUSSEEUW AND

A. M. L

EROY , Robust Regression and Outlier Detection, vol. 589, John Wiley & Sons, Inc., Oct.1987.[234] A. R

UDI , L. C

ARRATINO , AND

L. R

OSASCO , Falkon: An optimal large scale kernel method, in Advances in NeuralInformation Processing Systems, 2017, pp. 3888–3898.[235] H. R UE , Fast sampling of Gaussian Markov random ﬁelds, J R Stat Soc B, 63 (2001), pp. 325–338. [236] H. R UE AND

L. H

ELD , Gaussian Markov Random Fields, Chapman and Hall/CRC, Feb. 2005.[237] Y. S

AAD , Iterative Methods for Sparse Linear Systems, vol. 82, Society for Industrial and Applied Mathematics, Jan.2003.[238] , Numerical Methods for Large Eigenvalue Problems, vol. 66, Society for Industrial and Applied Mathematics, Jan.2011.[239] A. S

AADE , F. K

RZAKALA , AND

L. Z

DEBOROVÁ , Spectral clustering of graphs with the Bethe Hessian, in Advances inNeural Information Processing Systems, 2014, pp. 406–414.[240] L. S

AGUN , L. B

OTTOU , AND

Y. L E C UN , Singularity of the Hessian in deep learning, arXiv preprint arXiv:1611.07476,(2016).[241] A. K. S AIBABA , HOID: Higher Order Interpolatory Decomposition for Tensors Based on Tucker Representation, SIAMJ. Matrix Anal. & Appl., 37 (2016), pp. 1223–1249.[242] A. K. S

AIBABA , J. L EE , AND

P. K. K

ITANIDIS , Randomized algorithms for generalized Hermitian eigenvalue problemswith application to computing karhunen-loève expansion, Numer. Linear Algebra Appl., 23 (2015), pp. 314–339.[243] R. S

CHABACK AND

H. W

ENDLAND , Kernel techniques: From machine learning to meshless methods, Acta Numerica,15 (2006), pp. 543–639.[244] B. S

CHÖLKOPF , The kernel trick for distances, in Adv Neural Inf Process Syst, 2001, pp. 301–307.[245] B. S

CHÖLKOPF , A. S

MOLA , AND

K.-R. M

ÜLLER , Kernel principal component analysis, in International conference onartiﬁcial neural networks, Springer, 1997, pp. 583–588.[246] , Nonlinear component analysis as a kernel eigenvalue problem, Neural Comput., 10 (1998), pp. 1299–1319.[247] B. S

CHÖLKOPF AND

A. J. S

MOLA , Learning with Kernels, The MIT Press, 2018.[248] J. S

EDOC , J. G

ALLIER , D. F

OSTER , AND

L. U

NGAR , Semantic word clusters using signed spectral clustering, in Proceed-ings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Associationfor Computational Linguistics, 2017, pp. 939–949.[249] G. S

HABAT , E. C

HOSHEN , D. B EN -O R , AND

N. C

ARMEL , Fast and accurate Gaussian kernel ridge regression usingmatrix decompositions for preconditioning, arXiv preprint arXiv:1905.10587, (2019).[250] J. S

HAWE -T AYLOR AND

N. C

RISTIANINI , Kernel Methods for Pattern Analysis, Cambridge University Press, 2004.[251] J. S

HI AND

J. M

ALIK , Normalized cuts and image segmentation, IEEE Trans. Pattern Anal. Machine Intell., 22 (2000),pp. 888–905.[252] Y. S

HITOV , Column subset selection is NP-complete, arXiv preprint arXiv:1701.02764, (2017).[253] D. I. S

HUMAN , S. K. N

ARANG , P. F

ROSSARD , A. O

RTEGA , AND

P. V

ANDERGHEYNST , The emerging ﬁeld of signalprocessing on graphs: Extending high-dimensional data analysis to networks and other irregular domains, IEEE SignalProcess. Mag., 30 (2013), pp. 83–98.[254] D. I. S

HUMAN , P. V

ANDERGHEYNST , D. K

RESSNER , AND

P. F

ROSSARD , Distributed signal processing via Chebyshevpolynomial approximation, IEEE Trans. on Signal and Inf. Process. over Networks, 4 (2018), pp. 736–751.[255] D. P. S

IMPSON , I. W. T

URNER , AND

A. N. P

ETTITT , Fast sampling from a Gaussian Markov random ﬁeld using Krylovsubspace approaches, (2008).[256] A. J. S

MOLA AND

B. S

CHÖLKOPF , A tutorial on support vector regression, Stat Comput, 14 (2004), pp. 199–222.[257] A. S

OLÉ -R IBALTA , M. D. D

OMENICO , N. E. K

OUVARIS , A. D

ÍAZ -G UILERA , S. G

ÓMEZ , AND

A. A

RENAS , Spectralproperties of the Laplacian of multiplex networks, Phys. Rev. E, 88 (2013), p. 032807. [258] D. C. S ORENSEN AND

M. E

MBREE , A DEIM induced CUR factorization, SIAM J. Sci. Comput., 38 (2016), pp. A1454–A1482.[259] B. V. S

RINIVASAN , Q. H U , N. A. G UMEROV , R. M

URTUGUDDE , AND

R. D

URAISWAMI , Preconditioned Krylov solversfor kernel regression, arXiv preprint arXiv:1408.1237, (2014).[260] D. S

TEINLEY , K-means clustering: A half-century synthesis, Br. J. Math. Stat. Psychol., 59 (2006), pp. 1–34.[261] G. S

TEWART , Four algorithms for the the eﬃcient computation of truncated pivoted QR approximations to a sparsematrix, Numer Math, 83 (1999), pp. 313–323.[262] G. W. S

TEWART , Matrix Algorithms, vol. 1, Society for Industrial and Applied Mathematics, Jan. 1998.[263] , Matrix Algorithms, vol. 2, Society for Industrial and Applied Mathematics, Jan. 2001.[264] M. S

TOLL , A Krylov–Schur approach to the truncated SVD, Linear Algebra Appl., 436 (2012), pp. 2795–2806.[265] Z. S

TRAKOŠ AND

P. T

ICHÝ , On eﬃcient numerical approximation of the bilinear form 𝑐 ∗ 𝐴 −1 𝑏 , SIAM J. Sci. Comput.,33 (2011), pp. 565–587.[266] G. S TRANG , Linear algebra and learning from data, Wellesley-Cambridge Press, 2019.[267] D. S

UKKARI , H. L

TAIEF , AND

D. K

EYES , A high performance QDWH-SVD solver using hardware accelerators, ACMTrans. Math. Softw., 43 (2016), pp. 1–25.[268] J. T

ANG , Y. C

HANG , C. A

GGARWAL , AND

H. L IU , A survey of signed network mining in social media, CSUR, 49 (2016),pp. 1–37.[269] Y. T ANG , Deep learning using linear support vector machines, arXiv preprint arXiv:1306.0239, (2013).[270] D. T AO , X. L I , W. H U , S. M AYBANK , AND

X. W U , Supervised tensor learning, in Fifth IEEE International Conferenceon Data Mining (ICDM’05), IEEE, IEEE, 2005, pp. 8–pp.[271] D. T AYLOR , S. A. M

YERS , A. C

LAUSET , M. A. P

ORTER , AND

P. J. M

UCHA , Eigenvector-based centrality measures fortemporal networks, Multiscale Model. Simul., 15 (2017), pp. 537–574.[272] V. T

EMLYAKOV , Greedy Approximation, vol. 20, Cambridge University Press, 2009.[273] E. T

EOH , K. T AN , AND

C. X

IANG , Estimating the number of hidden neurons in a feedforward network using the singularvalue decomposition, IEEE Trans. Neural Netw., 17 (2006), pp. 1623–1629.[274] R. T

IBSHIRANI , Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society: Series B(Methodological), 58 (1996), pp. 267–288.[275] S. T

REHAN , K. T. C

ARLBERG , AND

L. J. D

URLOFSKY , Error modeling for surrogates of dynamical systems usingmachine learning, Int J Numer Methods Eng, 112 (2017), pp. 1801–1827.[276] S. T U , S. V ENKATARAMAN , A. C. W

ILSON , A. G

ITTENS , M. I. J

ORDAN , AND

B. R

ECHT , Breaking locality acceleratesblock Gauss-seidel, in Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org,2017, pp. 3482–3491.[277] E. T

YRTYSHNIKOV , Mosaic-skeleton approximations, Calcolo, 33 (1996), pp. 47–57.[278] S. U

BARU , J. C

HEN , AND

Y. S

AAD , Fast estimation of 𝑡𝑟 ( 𝑓 ( 𝐴 )) via stochastic Lanczos quadrature, SIAM J. Matrix Anal.& Appl., 38 (2017), pp. 1075–1099.[279] J. C. U RSCHEL , Nodal decompositions of graphs, Linear Algebra Appl., 539 (2018), pp. 60–71.[280] W.

VAN DER A ALST , Data science in action, Springer Berlin Heidelberg, 2016, ch. Data Science in Action, pp. 3–23. [281] V. V APNIK , Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series inStatistics), Springer-Verlag, Berlin, Heidelberg, 1982.[282] S. V

ATANKHAH , R. A. R

ENAUT , AND

V. E. A

RDESTANI , Total variation regularization of the 3-d gravity inverse problemusing a randomized generalized singular value decomposition, Geophys. J. Int., 213 (2018), pp. 695–705.[283] S. A. V

AVASIS , On the complexity of nonnegative matrix factorization, SIAM J. Optim., 20 (2010), pp. 1364–1377.[284] O. V

INYALS AND

D. P

OVEY , Krylov subspace descent for deep learning, in Artiﬁcial Intelligence and Statistics, 2012,pp. 1261–1268.[285] C. R. V

OGEL AND

M. E. O

MAN , Iterative methods for total variation denoising, SIAM J. Sci. Comput., 17 (1996),pp. 227–238.[286] U.

VON L UXBURG , A tutorial on spectral clustering, Stat Comput, 17 (2007), pp. 395–416.[287] S. V

ORONIN AND

P.-G. M

ARTINSSON , Eﬃcient algorithms for CUR and interpolative matrix decompositions, AdvComput Math, 43 (2016), pp. 495–516.[288] C.-C. W

ANG , K. L. T AN , AND

C.-J. L IN , Newton methods for convolutional neural networks, arXiv preprintarXiv:1811.06100, (2018).[289] S. W ANG AND

Z. Z

HANG , Improving CUR matrix decomposition and the Nyström approximation via adaptive sampling,J Mach Learn Res., 14 (2013), pp. 2729–2769.[290] Y. W

ANG , H.-Y. T

UNG , A. J. S

MOLA , AND

A. A

NANDKUMAR , Fast and guaranteed tensor decomposition via sketching,in Advances in Neural Information Processing Systems, 2015, pp. 991–999.[291] T. W

ARREN L IAO , Clustering of time series data–a survey, Pattern Recognit., 38 (2005), pp. 1857–1874.[292] A. W

ILSON AND

H. N

ICKISCH , Kernel interpolation for scalable structured Gaussian processes (KISS-GP), inInternational Conference on Machine Learning, 2015, pp. 1775–1784.[293] D. P. W

OODRUFF ET AL ., Sketching as a tool for numerical linear algebra, Found Trends Theor Comput Sci, 10 (2014),pp. 1–157.[294] Z. W U , S. P AN , F. C HEN , G. L

ONG , C. Z

HANG , AND

P. S. Y U , A comprehensive survey on graph neural networks,arXiv preprint arXiv:1901.00596, (2019).[295] H. X IANG AND

J. Z OU , Regularization with randomized SVD for large-scale discrete inverse problems, Inverse Prob.,29 (2013), p. 085008.[296] , Randomized algorithms for large-scale inverse problems with general Tikhonov regularizations, Inverse Prob., 31(2015), p. 085008.[297] J. X UE , J. L I , AND

Y. G

ONG , Restructuring of deep neural network acoustic models with singular value decomposition.,in Interspeech, 2013, pp. 2365–2369.[298] N. Y

ADATI , M. N

IMISHAKAVI , P. Y

ADAV , A. L

OUIS , AND

P. T

ALUKDAR ,HyperGCN: Hypergraph convolutional networks for semi-supervised classiﬁcation, arXiv preprint arXiv:1809.02589,(2018).[299] C. Y

ANG , R. D

URAISWAMI , AND

L. S. D

AVIS , Eﬃcient kernel machines using the improved fast Gauss transform, inAdv Neural Inf Process Syst, 2005, pp. 1561–1568.[300] S. Y I , Z. L AI , Z. H E , Y.- M . C HEUNG , AND

Y. L IU , Joint sparse principal component analysis, Pattern Recognit., 61(2017), pp. 524–536. [301] Y. Y OU , J. D EMMEL , C.-J. H

SIEH , AND

R. V

UDUC , Accurate, fast and scalable kernel ridge regression on parallel anddistributed systems, in Proceedings of the 2018 International Conference on Supercomputing - ICS ’18, ACM, ACMPress, 2018, pp. 307–317.[302] R. Y

OUSEFZADEH AND

D. P. O’L

EARY , Reﬁning the structure of neural networks using matrix conditioning, arXivpreprint arXiv:1908.02400, (2019).[303] C. D. Y U , J. L EVITT , S. R

EIZ , AND

G. B

IROS , Geometry-oblivious FMM for compressing dense SPD matrices, inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on -SC ’17, ACM, ACM Press, 2017, p. 53.[304] C. D. Y U , W. B. M ARCH , B. X

IAO , AND

G. B

IROS , INV-ASKIT: A parallel fast direct solver for kernel matrices, in 2016IEEE International Parallel and Distributed Processing Symposium (IPDPS), IEEE, IEEE, May 2016, pp. 161–171.[305] J. Y U , D. T AO , AND

M. W

ANG , Adaptive hypergraph learning and its application in image classiﬁcation, IEEE Trans.on Image Process., 21 (2012), pp. 3262–3272.[306] A. Z

EAR , A. K. S

INGH , AND

P. K

UMAR , A proposed secure multiple watermarking technique based on DWT, DCT andSVD for application in medicine, Multimed Tools Appl, 77 (2016), pp. 4863–4882.[307] L. Z

ELNIK -M ANOR AND

P. P

ERONA , Self-tuning spectral clustering, in Adv Neural Inf Process Syst, 2005, pp. 1601–1608.[308] K. Z

HANG , I. W. T

SANG , AND

J. T. K

WOK , Improved Nyström low-rank approximation and error analysis, in Proceed-ings of the 25th international conference on Machine learning - ICML ’08, ACM, ACM Press, 2008, pp. 1232–1239.[309] Q. Z

HANG AND

B. L I , Discriminative k-SVD for dictionary learning in face recognition, in 2010 IEEE Computer SocietyConference on Computer Vision and Pattern Recognition, IEEE, IEEE, June 2010, pp. 2691–2698.[310] Z.-K. Z HANG AND

C. L IU , A hypergraph model of social tagging networks, J. Stat. Mech., 2010 (2010), p. P10005.[311] D. Z HOU , J. H

UANG , AND

B. S

CHÖLKOPF , Beyond pairwise classiﬁcation and clustering using hypergraphs, (2005).[312] , Learning with hypergraphs: Clustering, classiﬁcation, and embedding, in Adv Neural Inf Process Syst, 2007,pp. 1601–1608.[313] J. Z

HOU , G. C UI , Z. Z HANG , C. Y

ANG , Z. L IU , AND