EigenGame Unloaded: When playing games is better than optimizing
EEigenGame UnloadedWhen playing games is better than optimizing
Ian Gemp * 1
Brian McWilliams * 1
Claire Vernade Thore Graepel Abstract
We build on the recently proposed EigenGamethat views eigendecomposition as a competitivegame. EigenGame’s updates are biased if com-puted using minibatches of data, which hindersconvergence and more sophisticated parallelismin the stochastic setting. In this work, we proposean unbiased stochastic update that is asymptot-ically equivalent to EigenGame, enjoys greaterparallelism allowing computation on datasets oflarger sample sizes, and outperforms EigenGamein experiments. We present applications to find-ing the principal components of massive datasetsand performing spectral clustering of graphs. Weanalyze and discuss our proposed update in thecontext of EigenGame and the shift in perspectivefrom optimization to games.
1. Introduction
Large, high-dimensional datasets containing billions of sam-ples are commonplace. Dimensionality reduction to ex-tract the most informative features is an important step inthe data processing pipeline which enables faster learningof classifiers and regressors (Dhillon et al., 2013), clus-tering (Kannan & Vempala, 2009), and interpretable visu-alizations. Many dimensionality reduction and clusteringtechniques rely on eigendecomposition at their core includ-ing principal component analysis (Jolliffe, 2002), locallylinear embedding (Roweis & Saul, 2000), multidimensionalscaling (Mead, 1992), Isomap (Tenenbaum et al., 2000), andgraph spectral clustering (Von Luxburg, 2007).Numerical solutions to the eigenvalue problem have beenapproached from a variety of angles for centuries: Ja-cobi’s method, Rayleigh quotient, power (von Mises) itera-tion (Golub & Van der Vorst, 2000). For large datasets thatdo not fit in memory, approaches that access only subsets— * Equal contribution DeepMind, London, UK. Correspon-dence to: Ian Gemp < [email protected] > , Brian McWilliams < [email protected] > .Copyright 2021 by the author(s). (a) (b) Figure 1.
Comparing α -EigenGame (Gemp et al., 2021) and µ -EigenGame (this work) over 1000 trials with a batch size of 1. (a) The expected trajectory of each algorithm from initialization ( (cid:3) )to the true value of the third eigenvector ( (cid:63) ). The density of theshaded region shows the distribution of steps taken by the stochas-tic variant of each algorithm after 100 burn-in steps. Althoughthe expected path of α -EG is slightly more direct, its stochasticvariant has much larger variance. (b) The distribution of distancesbetween stochastic update trajectories and the expected trajectoryof each algorithm as a function of iteration count ( bolder lines arelater iterations and modes further left are more desirable). Withincreasing iterations, the stochastic µ -EG trajectory approaches itsexpected value whereas α -EG exhibits larger bias. or minibatches —of the data at a time have been proposed.Recently, EigenGame (Gemp et al., 2021) was introducedwith the novel perspective of viewing the set of eigenvectorsas the Nash strategy of a suitably defined game. Whilethis work demonstrated an algorithm that was empiricallycompetitive given access to only subsets of the data, itsperformance degraded with smaller minibatch sizes.One path towards circumventing EigenGame’s need forlarge minibatch sizes is parallelization. In a data parallelapproach, updates are computed in parallel on partitions ofthe data and then combined such that the aggregate updateis equivalent to a single large-batch update. The techni-cal obstacle preventing such an approach for EigenGamelies in the bias of its updates, i.e., the divide-and-conquerEigenGame update is not equivalent to the large-batch up-date. Biased updates are not just a theoretical nuisance; theycan slow and even prevent convergence to the solution. The trajectory when updating with E [ X (cid:62) t X t ] . a r X i v : . [ s t a t . M L ] F e b igenGame Unloaded In this work we introduce a formulation of EigenGamewhich admits unbiased updates which we term µ -EigenGame. When necessary we will refer to the originalformulation of EigenGame as α -EigenGame. The difference between µ -EigenGame and α -EigenGameis illustrated in Figure 1. Unbiased updates allow us to in-crease the effective batch size using data parallelism. Lowervariance updates mean that µ -EigenGame should convergefaster and to more accurate solutions than α -EigenGameregardless of batch size. Our contributions : In the rest of the paper, we presentour new formulation of the EigenGame problem, analyzeits bias and propose a novel unbiased parallel variant, µ -EigenGame . We demonstrate its performance with ex-tensive experiments including applications to massive datasets and clustering a large social network graph. We con-clude with discussions of the algorithm’s design and contextwithin optimization, game theory, and neuroscience.
2. Preliminaries
In this work, we aim to compute the top- k right singularvectors of data X , which is either represented as a ma-trix, X ∈ R n × d , of n d -dimensional samples, or as a d -dimensional random variable. In either case, we assume wecan repeatedly sample a minibatch X t from the data of size n (cid:48) < n , X t ∈ R n (cid:48) × d . The top- k right singular vectors ofthe dataset are then given by the top- k eigenvectors of the(sample) covariance matrix, E [ n (cid:48) X (cid:62) t X t ] .For small datasets, SVD is appropriate. However, SVD’stime, O (min { nd , n d } ) , and space, O ( nd ) , complexityprohibit its use for larger datasets (Shamir, 2015) includingwhen X is a random variable. For larger datasets, stochas-tic, randomized, or sketching algorithms are better suited.Stochastic algorithms such as Oja’s algorithm (Oja, 1982;Allen-Zhu & Li, 2017) perform power iteration (Rutishauser,1971) to iteratively improve an approximation, maintainingorthogonality of the learned eigenvectors typically throughrepeated QR decompositions. Alternatively, randomized al-gorithms (Halko et al., 2011; Sarlos, 2006; Cohen et al.,2017) first compute a random projection of the data onto a ( k + p ) -subspace approximately containing the top- k sub-space. This is done using techniques similar to Krylovsubspace iteration methods (Musco & Musco, 2015). Afterprojecting, a call to SVD is then made on this reduced-dimensionality data matrix. Sketching algorithms (Feldmanet al., 2020) such as Frequent Directions (Ghashami et al.,2016) also target learning the top- k subspace by maintain-ing an overcomplete sketch matrix of size ( k + p ) × d andmaintaining a span of the top subspace with repeated callsto SVD. In both the randomized and sketching approaches, µ signifies unbiased or un loaded and α denotes original. a final SVD of the n × ( k + p ) dataset is required to re-cover the desired singular vectors. Although the SVD scaleslinearly in the number of samples, some datasets are toolarge to fit in memory; in this case, an out-of-memory SVDmay suffice (Haidar et al., 2017). For this reason, the directapproach of stochastic algorithms, which avoid an SVD callaltogether, is appealing when processing very large datasets. Notation : We follow the same notation as Gemp et al.(2021). Variables returned by an approximation algorithmare distinguished from the true solutions with hats, e.g., thecolumn-wise matrix of eigenvectors ˆ V approximates V . Weorder the columns of V such that the i th column, v i , is theeigenvector with the i th largest eigenvalue λ i . The set of alleigenvectors { ˆ v j } with λ j larger than λ i , namely v i ’s par-ents , will be denoted by v j
3. EigenGame
We build on the algorithm introduced by Gemp et al. (2021),which we refer to here as α -EigenGame. This algorithm isderived by formulating the eigendecomposition of a sym-metric positive definite matrix as the Nash equilibrium of agame among k players, each player i owning the approxi-mate eigenvector ˆ v i . Each player is also assigned a utilityfunction, u αi (ˆ v i | ˆ v j
EigenGame
It is helpful to rearrange equation (3) to shift perspectivefrom estimating a penalty coefficient (in red) to estimatinga penalty direction (in blue): ∇ αi ∝ n (cid:48) n (cid:88) t (cid:104) Σ t ˆ v i − (cid:88) j
Bias. In the top row, player 3’s utility is given for par-ents mis-specified by an angular distance along the sphere of ∠ (ˆ v j
We arrived at µ -EigenGame by analyzingand improving properties of the α -EigenGame update. How-ever, the µ -EigenGame update direction is linear in each ˆ v i . This suggests we may be able to design a pseudo -utilityfunction for it. Rearranging the update direction from equa-tion (5) as ∆ µi = Σˆ v i − (cid:88) j
PCA is the unique Nash of µ -EigenGame givensymmetric Σ with distinct eigenvalues.Proof. We will show by induction that each v i is the uniquebest response to v − i , which implies they constitute theunique Nash equilibrium. First, consider player ’s util-ity. It is simply the Rayleigh quotient of Σ because ˆ v isconstrained to the unit-sphere, i.e., u µ = ˆ v (cid:62) Σˆ v = ˆ v (cid:62) Σˆ v ˆ v (cid:62) ˆ v .Therefore, we know v maximizes u µ and the maximizeris unique because the eigenvalues are distinct. In game the-ory parlance, v is a best response to all other v − . Theproof then continues by induction. The utility of player i is u µi = ˆ v (cid:62) i [ I − (cid:80) j
Before introducing an algorithmfor µ -EigenGame, we first briefly review necessary termi-nology for learning on Riemannian manifolds (Absil et al.,2009), specifically for the sphere. The notation T ˆ v i S d − denotes the set of vectors tangent to the sphere at a point ˆ v i (i.e., any vector orthogonal to ˆ v i ). R ˆ v i ( z ) = ˆ v i + z || ˆ v i + z || isthe commonly used restriction of the retraction on S d − to the tangent bundle at ˆ v i (i.e., step in tangent direction z and then unit-normalize the result). The operator Π ˆ v i ( y ) =( I − ˆ v (cid:62) i ˆ v i ) y projects the direction y onto T ˆ v i S d − . Com-bining these tools together results in a movement along theRiemannian manifold: ˆ v ( t +1) i ← R ˆ v i (cid:0) Π ˆ v i ( y ) (cid:1) .We present pseudocode for µ -EigenGame below where com-putation is parallelized both over the k players and over M igenGame Unloaded Algorithm 1 µ -EigenGame R Given: data stream X t ∈ R n (cid:48) × d , number of parallelmachines M per player (minibatch size per machine n (cid:48)(cid:48) = n (cid:48) M ), initial vectors ˆ v i ∈ S d − , step size sequence η t , and number of iterations T . ˆ v i ← ˆ v i for all i for t = 1 : T do parfor i = 1 : k do parfor m = 1 : M do rewards ← X (cid:62) tm X tm ˆ v i penalties ← (cid:80) j
5. Experiments
As in EigenGame, we omit the projection of gradients ontothe tangent space of the sphere; specifically, we omit line9 in Algorithm 1. As discussed in Gemp et al. (2021), thishas the effect of intelligently adapting the step size to usesmaller learning rates near the fixed point.
To ease comparison with previous work, we count the longest correct eigenvector streak as introduced in Gempet al. (2021), which measures the number of eigenvectorsthat have been learned, in order, to within an angular thresh-old (e.g., π ) of the true eigenvectors. We also measure howwell the set of ˆ v i captures the top- k subspace with a normal-ized subspace distance: − k · Tr( U ∗ P ) ∈ [0 , where U ∗ = V V † and P = ˆ V ˆ V † (Tang, 2019). Shading in plotsindicates ± standard error of the mean. We first validate µ -EigenGame in a full-batch setting on twosynthetic datasets: one with exponentially decaying spec-trum; the other with a linearly decaying spectrum. Figure 4shows µ -EigenGame outperforms α -EigenGame on the for-mer and matches its performance on the latter. We discusspossible reasons for this gap in the discussion in Section 6. We compare µ -EigenGame against α -EigenGame,GHA (Sanger, 1989), Matrix Krasulina (Tang, 2019), andOja’s algorithm (Allen-Zhu & Li, 2017) on the M NIST L o n g e s t C o rr e c t E i g e n v e c t o r S t r e a k -EG (268) GHA (182)Ojas (306) Krasulinas (271)-EG (231) Exponentially Decaying Spectrum 0 200 400 600 800 1000 L o n g e s t C o rr e c t E i g e n v e c t o r S t r e a k -EG (323)GHA (214)Ojas (407) Krasulinas (288)-EG (299) Linearly Decaying Spectrum
Figure 4.
Synthetic Experiment. Runtime (milliseconds) in legend. dataset. We flatten each image in the training set to obtain a , × dimensional matrix X .Figure 5 demonstrates µ -EigenGame’s robustness to mini-batch size. It performs best in the longest streak metricand better than α -EigenGame in subspace distance. Weattribute this performance boost to the unbiased updates of µ -EigenGame. This dataset consists of embeddings computed using theMeena language model (Adiwardana et al., 2020). The em-bedded data consists of a subset of the 40 billion words usedto train the transformer-based model. The subset was pre-processed to remove duplicates and then embedded usingthe trained model. Full details of the dataset and model canbe found in Adiwardana et al. (2020). The dataset consistsof n ≈ billion embeddings each with dimensionality d = 2560 ; its total size is TB. Due to its moderatedimensionality we can exactly compute the ground truthsolution by stochastically accumulating the covariance ma-trix of the data and computing its eigendecomposition. Ona single machine this takes approximately 1.5 days (but isembarrassingly parallelizable with MapReduce).We use minibatches of size 4,096 in each TPU. We do modelparallelism across 8 TPUs so we see 32,768 samples per iter-ation per set of eigenvectors. We test two additional degreesof data parallelism with × (16 TPUs with 131,072 sam-ples) and × (32 TPUs with 262,144 samples) the amountof data per iteration respectively. We compute and applyupdates in Optax using SGD with a learning rate of × − and Nesterov momentum with a factor of 0.9.We compare the performance of µ -EigenGame against α -EigenGame as a function of the degree of parallelism incomputing the top k = 256 eigenvectors. Each TPU istasked with learning 32 contiguous eigenvectors. Figure 6 re-ports on the longest correct streak of eigenvectors learned toan angular tolerance of π/ (with standard error computedover five runs). We see that increasing the degree of paral-lelism has no effect on the performance of α -EigenGame.As expected, α -EigenGame is unable to take advantage ofthe higher data throughput since its updates are biased andcannot be meaningfully linearly combined across copies. igenGame Unloaded L o n g e s t C o rr e c t E i g e n v e c t o r S t r e a k -EG (16) GHA (19)Krasulinas (23)Ojas (13)-EG (17) MNIST (Minibatch = 1024)0 10 20 30 40 50Epochs10 S u b s p a c e D i s t a n c e -EG (16)GHA (19) Krasulinas (23)Ojas (13)-EG (17) L o n g e s t C o rr e c t E i g e n v e c t o r S t r e a k -EG (45)GHA (49) Krasulinas (43)Ojas (32)-EG (43) MNIST (Minibatch = 256)0 10 20 30 40 50Epochs10 S u b s p a c e D i s t a n c e -EG (45)GHA (49) Krasulinas (43)Ojas (32)-EG (43) L o n g e s t C o rr e c t E i g e n v e c t o r S t r e a k -EG (291)GHA (297)Krasulinas (199)Ojas (195)-EG (274) MNIST (Minibatch = 32)0 10 20 30 40 50Epochs10 S u b s p a c e D i s t a n c e -EG (291)GHA (297) Krasulinas (199)Ojas (195)-EG (274) Figure 5.
MNIST Experiment. Runtime (seconds) in legend. Each column evaluates a different minibatch size ∈ { , , } . The performance of µ -EigenGame scales with the effectivebatch size achieved through parallelism. µ -EigenGame ( × )is able to recover 256 eigenvectors after 40,000 iterations(approximately 0.75 epochs) in 2 hours 45 minutes. Figure 6.
Comparison between µ -EigenGame and α -EigenGamewith different degrees of data parallelism (in parentheses) on theMeena dataset. We conducted an experiment on learning the eigenvectorsof the Laplacian of a social network graph (Leskovec &McAuley, 2012) for the purpose of spectral clustering. Theeigenvalues of the graph Laplacian reveal several interestingproperties as well such as the number of connected compo-nents, an approximation to the sparsest cut, and the diameterof a connected graph (Chung et al., 1994).Given a graph with a set of nodes V and set of edges E ,the graph Laplacian can be written as L = X (cid:62) X whereeach row of the incidence matrix X ∈ R |E|×|V| representsa distinct edge; X e =( i,j ) ∈E is a vector containing only nonzero entries, a at index i and a − at index j (Ho-raud, 2009). In this setting, the eigenvectors of primaryinterest are the bottom- k ( λ |V| , λ |V|− , . . . ) rather than thetop- k ( λ , λ , . . . ), however, a simple algebraic manipula-tion allows us to reuse a top- k solver. By defining the matrix L − = λ ∗ I − L with λ ∗ > λ , we ensure L − (cid:31) and thetop- k eigenvectors of L − are the bottom- k of L . Figure 7.
Facebook Page Networks. Petals differentiate groundtruth clusters; colors differentiate learned clusters. Petals are ide-ally colored according to the color bar starting with the rightmostpetal and proceeding counterclockwise. Numbers indicate groundtruth cluster size. Clusters are extracted by running k -means clus-tering on the learned eigenvectors ˆ V ∈ R |V|× k (samples on rows). The update in equation (5) is transformed into ˜ ∇ µi = ( λ ∗ I − L )ˆ v i − (cid:88) j
6. Discussion
We conjecture that µ -EigenGame converges more quicklythan α -EigenGame because of the following two claims. Harmonic mean of homogeneity and completeness. igenGame Unloaded
Claim 1
The penalty terms of ˜ ∇ µi are all within ◦ of thoseof ∇ αi because (cid:68) M ˆ v j ˆ v (cid:62) j M ˆ v j , ˆ v j (cid:69) = 1 > . (17) Claim 2
The penalty terms of ˜ ∇ µi are all smaller in magni-tude than those of ∇ αi : || ˆ v j || ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M ˆ v j ˆ v (cid:62) j M ˆ v j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (18)Indeed, consider the direction M ˆ v j . By properties of thevector rejection, we know the rejection of this directiononto the tangent space of the unit sphere has magnitude lessthan or equal to that of the original vector, || M ˆ v j || . Theprojection is ( I − ˆ v j ˆ v (cid:62) j )( M ˆ v j ) . Therefore, the rejection is ˆ v j ˆ v (cid:62) j ( M ˆ v j ) and, by the preceding argument, we know itsmagnitude | ˆ v (cid:62) j M ˆ v j ||| ˆ v j || is less than or equal to || M ˆ v j || .Rearranging the inequality completes the proof.By Claim 1 , the penalty directions of µ -EG and α -EG ap-proximately agree. And by Claim 2 , α -EG’s penalty direc-tion is shorter. Consider a scenario where a parent of ˆ v i has not converged and transiently occupies space along ˆ v i ’sgeodesic to its true endpoint ˆ v i , a strong penalty term willforce ˆ v i to take a roundabout trajectory, thereby slowingits convergence. A weaker penalty term allows ˆ v i to passthrough regions occupied by its parent as long as its parentis not an eigenvector. Recall from Section 4 that the twoutilities are equivalent when the parents are eigenvectors. Figure 8 summarizes the relationships advising the designsof the various EigenGame algorithms. Starting from the α -EigenGame utility, its update is arrived at by simply fol-lowing the standard gradient ascent paradigm. In noticingthat stochastic estimates of the gradient are biased, we arriveat the µ -EigenGame update by considering how to removethis bias in a principled manner.Sacrificing the exact steepest decent direction for a directionthat allows unbiased estimates is a tradeoff that in this casehas benefits. Also, while ˜ ∇ µi is not a gradient (except withexact parents), the new penalties have properties (above)that make them intuitively more desirable than the originals;they are adaptive to the state of the system.From this new update we can derive a pseudo-utility usingthe stop gradient operator which has the desired theoreticalproperties. However, it is unlikely that this utility would bedeveloped independently of these steps to solve the problemat hand (see Appendix D.1 for more details).This suggests an alternative approach to algorithm designcomplementary to the optimization perspective: directly designing updates themselves which converge to the de-sired solution, reminiscent of previous paradigms that droveneuro-inspired learning rules. u µi u αi ˜ ∇ µi ∇ αi Var & ⊥ remove bias Figure 8.
This diagram presents the relationships between utilitiesand updates. An arrow indicates the endpoint is reasonably derivedfrom the origin; the lack of an arrow indicates the direction isunlikely.
The Generalized Hebbian Algorithm (GHA) (Sanger, 1989;Gang et al., 2019; Chen et al., 2019) update direction for ˆ v i with inexact parents ˆ v j is similar to µ -EigenGame: ∆ ghai = Σˆ v i − (cid:88) j ≤ i (ˆ v (cid:62) i Σˆ v j )ˆ v j . (19) Σ appears linearly in this update so GHA parallelizes as well.In contrast to µ -EigenGame, GHA additionally penalizesthe alignment of ˆ v i to itself and removes the unit normconstraint on ˆ v i (not shown). Without any constraints, GHAoverflows in experiments. We take the approach of Gempet al. (2021) and constrain ˆ v i to the unit-ball ( || ˆ v i || ≤ )rather than the unit-sphere ( || ˆ v i || = 1 ).The connection between GHA and µ -EigenGame is interest-ing because GHA is a Hebbian learning algorithm inspiredby neuroscience whereas µ -EigenGame is inspired by gametheory. Game formulations of classical machine learningproblems may provide a bridge between statistical and bio-logically inspired viewpoints.
7. Conclusion
We introduced µ -EigenGame, an unbiased, globally conver-gent, parallelizable algorithm that recovers the top- k eigen-vectors of a symmetric positive definite matrix. We demon-strated this feat empirically on several different datasets ofvarying size and application. We discussed technical compo-nents of this algorithm and its place within the wider contextof game theory meets machine learning meets neuroscience. µ -EigenGame’s improved robustness to smaller minibatchesmakes it more amenable to being used within popular deeplearning frameworks and as part of optimization (Krumme-nacher et al., 2016) and regularization (Miyato et al., 2018)techniques which leverage spectral information of gradientcovariances or Hessians. igenGame Unloaded Acknowledgements.
We would like to thank Trevor Cai,Rosalia Schneider, Dimitrios Vytiniotis for invaluable helpwith optimizing algorithm performance on TPU. We thankMaribeth Rauh, Zonglin Li, Daniel Adiwardana and theMeena team for providing us with data and assistance.
References
Absil, P.-A., Mahony, R., and Sepulchre, R.
OptimizationAlgorithms on Matrix Manifolds . Princeton UniversityPress, 2009.Adiwardana, D., Luong, M.-T., So, D. R., Hall, J., Fiedel,N., Thoppilan, R., Yang, Z., Kulshreshtha, A., Nemade,G., Lu, Y., et al. Towards a human-like open-domainchatbot. arXiv preprint arXiv:2001.09977 , 2020.Allen-Zhu, Z. and Li, Y. First efficient convergence forstreaming k-PCA: a global, gap-free, and near-optimalrate. In , pp. 487–492. IEEE,2017.Chen, Z., Li, X., Yang, L., Haupt, J., and Zhao, T. Onconstrained nonconvex stochastic optimization: A casestudy for generalized eigenvalue decomposition. In
The22nd International Conference on Artificial Intelligenceand Statistics , pp. 916–925. PMLR, 2019.Chung, F. R., Faber, V., and Manteuffel, T. A. An upperbound on the diameter of a graph from eigenvalues as-sociated with its Laplacian.
SIAM Journal on DiscreteMathematics , 7(3):443–457, 1994.Cohen, M. B., Musco, C., and Musco, C. Input sparsitytime low-rank approximation via ridge leverage scoresampling. In
Proceedings of the Twenty-Eighth AnnualACM-SIAM Symposium on Discrete Algorithms , pp. 1758–1777. SIAM, 2017.Dhillon, P. S., Foster, D. P., Kakade, S. M., and Ungar,L. H. A risk comparison of ordinary least squares vs ridgeregression.
The Journal of Machine Learning Research ,14(1):1505–1511, 2013.Feldman, D., Schmidt, M., and Sohler, C. Turning big datainto tiny data: Constant-size coresets for k-means, PCA,and projective clustering.
SIAM Journal on Computing ,49(3):601–657, 2020.Gang, A., Raja, H., and Bajwa, W. U. Fast andcommunication-efficient distributed PCA. In
ICASSP2019-2019 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) , pp. 7450–7454.IEEE, 2019. Gemp, I., McWilliams, B., Vernade, C., and Graepel, T.Eigengame: PCA as a Nash equilibrium. In
InternationalConference for Learning Representations , 2021.Ghashami, M., Liberty, E., Phillips, J. M., and Woodruff,D. P. Frequent directions: simple and deterministic matrixsketching.
SIAM Journal on Computing , 45(5):1762–1792, 2016.Golub, G. H. and Van der Vorst, H. A. Eigenvalue computa-tion in the 20th century.
Journal of Computational andApplied Mathematics , 123(1-2):35–65, 2000.Haidar, A., Kabir, K., Fayad, D., Tomov, S., and Dongarra,J. Out of memory SVD solver for big data. In , pp. 1–7. IEEE, 2017.Halko, N., Martinsson, P.-G., and Tropp, J. A. Findingstructure with randomness: probabilistic algorithms forconstructing approximate matrix decompositions.
SIAMReview , 53(2):217–288, 2011.Hessel, M., Budden, D., Viola, F., Rosca, M., Sezener,E., and Hennigan, T. Optax: composable gradienttransformation and optimisation, in JAX!, 2020. URL http://github.com/deepmind/optax .Horaud, R. A short tutorial on graph Lapla-cians, Laplacian embedding, and spectral clus-tering, 2009. URL http://https://csustan.csustan.edu/˜tom/Clustering/GraphLaplacian-tutorial.pdf .Jolliffe, I. T. Principal components in regression analysis.In
Principal Component Analysis . Springer, 2002.Kannan, R. and Vempala, S.
Spectral algorithms . NowPublishers Inc, 2009.Krummenacher, G., McWilliams, B., Kilcher, Y., Buhmann,J. M., and Meinshausen, N. Scalable adaptive stochasticoptimization using random projections. In
Advances inNeural Information Processing Systems , pp. 1750–1758,2016.Leskovec, J. and McAuley, J. Learning to discover socialcircles in ego networks.
Advances in Neural InformationProcessing Systems , 25:539–547, 2012.Mead, A. Review of the development of multidimensionalscaling methods.
Journal of the Royal Statistical Society:Series D (The Statistician) , 41(1):27–39, 1992.Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spec-tral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957 , 2018. igenGame Unloaded
Musco, C. and Musco, C. Randomized block Krylov meth-ods for stronger and faster approximate singular valuedecomposition. In
Advances in Neural Information Pro-cessing Systems , 2015.Oja, E. Simplified neuron model as a principal componentanalyzer.
Journal of Mathematical Biology , 15(3):267–273, 1982.Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour-napeau, D., Brucher, M., Perrot, M., and Duchesnay, E.Scikit-learn: Machine learning in Python.
Journal ofMachine Learning Research , 12:2825–2830, 2011.Roweis, S. T. and Saul, L. K. Nonlinear dimensionalityreduction by locally linear embedding.
Science , 290(5500):2323–2326, 2000.Rozemberczki, B., Davies, R., Sarkar, R., and Sutton, C.Gemsec: Graph embedding with self clustering. In
Pro-ceedings of the 2019 IEEE/ACM International Confer-ence on Advances in Social Networks Analysis and Min-ing 2019 , pp. 65–72. ACM, 2019.Rutishauser, H. Simultaneous iteration method for symmet-ric matrices. In
Handbook for Automatic Computation ,pp. 284–302. Springer, 1971.Sanger, T. D. Optimal unsupervised learning in a single-layer linear feedforward neural network.
Neural Net-works , 2(6):459–473, 1989.Sarlos, T. Improved approximation algorithms for largematrices via random projections. In , pp. 143–152. IEEE, 2006.Shah, S. M. Stochastic approximation on Riemannian man-ifolds.
Applied Mathematics & Optimization , pp. 1–29,2019.Shamir, O. A stochastic PCA and SVD algorithm withan exponential convergence rate. In
Proceedings of theInternational Conference on Machine Learning , pp. 144–152, 2015.Tang, C. Exponentially convergent stochastic k-PCA with-out variance reduction. In
Advances in Neural Informa-tion Processing Systems , pp. 12393–12404, 2019.Tenenbaum, J. B., De Silva, V., and Langford, J. C. Aglobal geometric framework for nonlinear dimensionalityreduction.
Science , 290(5500):2319–2323, 2000.Von Luxburg, U. A tutorial on spectral clustering.
Statisticsand Computing , 17(4):395–416, 2007. Wang, Y., Xiu, N., and Han, J. On cone of nonsymmetricpositive semidefinite matrices.
Linear Algebra and itsApplications , 433(4):718–736, 2010. igenGame Unloaded
A. Error Propagation / Sensitivity Analysis
Lemma 3. An O ( (cid:15) ) angular error of parent ˆ v j (cid:15) , specifically (cid:104) ˜ ∇ µ,ei , ˜ ∇ µi (cid:105) > (see Figure 9a). And recall that the direction withexact parents is equivalent to the gradient of α -EigenGame with exact parents, ∇ α,ei .Therefore, by a Lyapunov argument, the ˜ ∇ µi direction is an ascent direction on the α -EigenGame utility where it formsan acute angle (positive inner product) with ∇ α,ei . Furthermore, ∇ α,ei is the gradient of a utility function that is sinusoidalalong the sphere manifold; specifically, it is a cosine with period π and positive amplitude dependent on the spectrum of Σ (c.f. equation (8) of Gemp et al. (2021)). We can derive an upper bound on the size of the angular region for which ˜ ∇ µi is igenGame Unloaded (a) (b) Figure 9. (a) Close in Euclidean distance can imply close in angular distance if the vectors are long enough. (b) The stable region for µ -EigenGame consists of an O ( (cid:15) ) ball around the true optimum as (cid:15) → . not necessarily an ascent direction (the “?” marks in Figure 9). This region is defined as the set of angles for which the normof the utility’s derivative is small, i.e., ||∇ α,ei || ≤ (cid:15) . The derivative of cosine is sine, which depends linearly on its argument(angle) for small values, therefore, | θ | ≤ O ( (cid:15) ) or | π − θ | ≤ O ( (cid:15) ) . As long as ˆ v i does not lie within the | π − O ( (cid:15) ) | region, µ -EigenGame will ascend the utility landscape to within O ( (cid:15) ) angular error of the true eigenvector v i . In the limit as (cid:15) → ,the size of the | π − O ( (cid:15) ) | region vanishes to a point, v ⊥ i . To understand the stability of this point, we can again appeal tothe analysis from (Gemp et al., 2021). The Jacobian of ˜ ∇ µi and the Hessian of u αi are equal with exact parents, and we knowthat the Riemannian Hessian is positive definite for distinct eigenvalues: H R ˆ v i [ u αi ] (cid:23) min j ( λ j − λ j +1 ) I . This means thatthe point v ⊥ i is a repeller for α -EigenGame. Similarly to before, we can show more formally that an O ( (cid:15) ) perturbation toparents results in an O ( (cid:15) ) perturbation to the Jacobian of ˜ ∇ µi from H [ u αi ] : J = [ I − (cid:88) j
For the sake of reproducibility we have included pseudocode in Jax. We use the Optax optimization library (Hesselet al., 2020) and the Jaxline training framework . Our graph algorithm is a straightforward modification of the provided https://github.com/deepmind/optax https://github.com/deepmind/jaxline igenGame Unloaded pseudo-code. See section C for details. """ Copyright 2020 DeepMind Technologies Limited.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. """ import jax import optax import jax.numpy as jnp def eg_grads(vi: jnp.ndarray, weights: jnp.ndarray, eigs: jnp.ndarray, data: jnp.ndarray) -> jnp.ndarray: """ Args: vi: shape (d,), eigenvector to be updated weights: shape (k,), mask for penalty coefficients, eigs: shape (k, d), i.e., vectors on rows data: shape (N, d), minibatch X_t Returns: grads: shape (d,), gradient for vi """ weights_ij = (jnp.sign(weights + 0.5) - 1.) / 2. data_vi = jnp.dot(data, vi) data_eigs = jnp.transpose(jnp.dot(data, jnp.transpose(eigs))) vi_m_vj = jnp.dot(data_eigs, data_vi) penalty_grads = vi_m_vj * jnp.transpose(eigs) penalty_grads = jnp.dot(penalty_grads, weights_ij) grads = jnp.dot(jnp.transpose(data), data_vi) + penalty_grads return grads def utility(vi, weights, eigs, data): """Compute Eigengame utilities. util: shape (1,), utility for vi """ data_vi = jnp.dot(data, vi) data_eigs = jnp.transpose(jnp.dot(data, jnp.transpose(eigs))) vi_m_vj2 = jnp.dot(data_eigs, data_vi)**2. vj_m_vj = jnp.sum(data_eigs * data_eigs, axis=1) r_ij = vi_m_vj2 / vj_m_vj util = jnp.dot(jnp.array(r_ij), weights) return util Listing 1.
Gradient and utility functions. def _grads_and_update(vi, weights, eigs, input, opt_state, axis_index_groups): """Compute utilities and update directions, psum and apply. Args: vi: shape (d,), eigenvector to be updated weights: shape (k_per_device, k,), mask for penalty coefficients, igenGame Unloaded eigs: shape (k, d), i.e., vectors on rows input: shape (N, d), minibatch X_t opt_state: optax state axis_index_groups: For multi-host parallelism https://jax.readthedocs.io/en/latest/_modules/jax/_src/lax/parallel.html Returns: vi_new: shape (d,), eigenvector to be updated opt_state: new optax state utilities: shape (1,), utilities """ grads, utilities = _grads_and_utils(vi, weights, V, input) avg_grads = jax.lax.psum( grads, axis_name=’i’, axis_index_groups=axis_index_groups) vi_new, opt_state, lr = _update_with_grads(vi, avg_grads, opt_state) return vi_new, opt_state, utilities def _grads_and_utils(vi, weights, V, inputs): """Compute utiltiies and update directions ("grads"). Wrap in jax.vmap for k_per_device dimension.""" utilities = utility(vi, weights, V, inputs) grads = eg_grads(vi, weights, V, inputs) return grads, utilities def _update_with_grads(vi, grads, opt_state): """Compute and apply updates with optax optimizer. Wrap in jax.vmap for k_per_device dimension.""" updates, opt_state = self._optimizer.update(-grads, opt_state) vi_new = optax.apply_updates(vi, updates) vi_new /= jnp.linalg.norm(vi_new) return vi_new, opt_state Listing 2.
EigenGame Update functions. def init(self, *): """Initialization function for a Jaxline experiment.""" weights = np.eye(self._total_k) * 2 - np.ones((self._total_k, self._total_k)) weights[np.triu_indices(self._total_k, 1)] = 0. self._weights = jnp.reshape(weights, [self._num_devices, self._k_per_device, self._total_k]) local_rng = jax.random.fold_in(jax.random.PRNGkey(seed), jax.host_id()) keys = jax.random.split(local_rng, self._num_devices) V = jax.pmap(lambda key: jax.random.normal(key, (self._k_per_device, self._dims)))(keys) self._V = jax.pmap(lambda V: V / jnp.linalg.norm(V, axis=1, keepdims=True))(V) self._partial_grad_update = functools.partial( self._grads_and_update, axis_groups=self._axis_index_groups) self._par_grad_update = jax.pmap( self._partial_grad_update, in_axes=(0, 0, None, 0, 0, 0), axis_name=’i’) self._optimizer = optax.sgd(learning_rate=1e-4, momentum=0.9, nesterov=True) def step(self, *): """Step function for a Jaxline experiment""" inputs = next(input_data_iterator) self._local_V = jnp.reshape(self._V, (self._total_k, self._dims)) self._V, self._opt_state, utilities, lr = self._par_grad_update( self._V, self._weights_jnp, self._local_V, inputs, self._opt_state, global_step) Listing 3.
Skeleton for Jaxline experiment. igenGame Unloaded C. µ -EigenGame on Graphs Algorithm 2 receives a stream of edges represented as a matrix with edges on the rows and outgoing node id ( out ) andincoming node id ( in ) as nonegative integers on the columns. The method zeros like ( z ) returns an array of zeros withthe same dimensions as z . The method index add ( z, idx, val ) adds the values in array val to z at the correspondingindices in array idx with threadsafe locking so that indices in idx may be duplicated. Both methods are available in J AX .The largest eigenvector ˆ v is learned to estimate λ and may be discarded. The bottom- k eigenvectors are returned by thealgorithm in increasing order. Algorithm 2 expects k + 1 random unit vectors as input rather than k in order to additionallyestimate the top eigenvector necessary for the computation; otherwise, the inputs are the same as Algorithm 1. Algorithm 2 µ -EigenGame for GraphsGiven: Edge stream E t ∈ R n (cid:48) × , minibatch size per partition n (cid:48)(cid:48) , initial vectors ˆ v i ∈ S d − , step size sequence η t , anditerations T . ˆ v i ← ˆ v i for all i ∈ { , . . . , k + 1 } λ ← |V| *upper bound on top eigenvalue* for t = 1 : T doparfor i = 1 : k + 1 doparfor t (cid:48) = 1 : n (cid:48) n (cid:48)(cid:48) do [ Xv ] i = ˆ v i ( out tt (cid:48) ) − ˆ v i ( in tt (cid:48) )[ X (cid:62) Xv ] i ← zeros like (ˆ v i )[ X (cid:62) Xv ] i ← index add ([ X (cid:62) Xv ] , out tt (cid:48) , [ Xv ] i )[ X (cid:62) Xv ] i ← index add ([ X (cid:62) Xv ] , in tt (cid:48) , − [ Xv ] i ) if i = 1 then λ ← || [ Xv ] i || ˜ ∇ µit (cid:48) ← [ X (cid:62) Xv ] i else ˜ ∇ µit (cid:48) ← λ [ˆ v i − (cid:80) In section 4.1, we presented u µi as the Rayleigh quotient of a deflated matrix (repeated in equation (15) for convencience): u µi = ˆ v (cid:62) i (cid:104) deflation (cid:122) (cid:125)(cid:124) (cid:123) I − (cid:88) j
This diagram presents the relationships between utilities and updates. An arrow indicates the endpoint is reasonably derivedfrom the origin; the lack of an arrow indicates the direction is unlikely. The link from equation (38) is explicitly crossed out with a hardstop for emphasis. D.1. Gradient Ascent on u µi If we remove the stop gradient from equation (37), we are left with equation (41): u µi = ˆ v (cid:62) i [ deflation (cid:122) (cid:125)(cid:124) (cid:123) I − (cid:88) j
Figure 11. Synthetic Experiment. Runtime (milliseconds) in legend. L o n g e s t C o rr e c t E i g e n v e c t o r S t r e a k -EG (16)-EG (17) -EG (17) MNIST (Minibatch = 1024)0 10 20 30 40 50Epochs10 S u b s p a c e D i s t a n c e -EG (16)-EG (17) -EG (17) L o n g e s t C o rr e c t E i g e n v e c t o r S t r e a k -EG (45)-EG (43) -EG (46) MNIST (Minibatch = 256)0 10 20 30 40 50Epochs10 S u b s p a c e D i s t a n c e -EG (45)-EG (43) -EG (46) L o n g e s t C o rr e c t E i g e n v e c t o r S t r e a k -EG (291)-EG (274) -EG (293) MNIST (Minibatch = 32)0 10 20 30 40 50Epochs10 S u b s p a c e D i s t a n c e -EG (291)-EG (274) -EG (293) Figure 12. MNIST Experiment. Runtime (seconds) in legend. Each column evaluates a different minibatch size ∈ { , , }}