[PDF] A sampling algorithm to compute the set of feasible solutions for non-negative matrix factorization with an arbitrary rank

Abstract

Non-negative Matrix Factorization (NMF) is a useful method to extract features from multivariate data, but an important and sometimes neglected concern is that NMF can result in non-unique solutions. Often, there exist a Set of Feasible Solutions (SFS), which makes it more difficult to interpret the factorization. This problem is especially ignored in cancer genomics, where NMF is used to infer information about the mutational processes present in the evolution of cancer. In this paper the extent of non-uniqueness is investigated for two mutational counts data, and a new sampling algorithm, that can find the SFS, is introduced. Our sampling algorithm is easy to implement and applies to an arbitrary rank of NMF. This is in contrast to state of the art, where the NMF rank must be smaller than or equal to four. For lower ranks we show that our algorithm performs similarly to the polygon inflation algorithm that is developed in relations to chemometrics. Furthermore, we show how the size of the SFS can have a high influence on the appearing variability of a solution. Our sampling algorithm is implemented in an R package \textbf{SFS} (\url{this https URL}).

Full PDF

AA sampling algorithm to compute the set of feasible solutions fornon-negative matrix factorization with an arbitrary rank

Ragnhild Laursen ∗ and Asger Hobolth † Department of Mathematics, Aarhus UniversityJanuary 20, 2021

Abstract

Non-negative Matrix Factorization (NMF) is a useful method to extract features frommultivariate data, but an important and sometimes neglected concern is that NMF canresult in non-unique solutions. Often, there exist a Set of Feasible Solutions (SFS), whichmakes it more diﬃcult to interpret the factorization. This problem is especially ignored incancer genomics, where NMF is used to infer information about the mutational processespresent in the evolution of cancer. In this paper the extent of non-uniqueness is investigatedfor two mutational counts data, and a new sampling algorithm, that can ﬁnd the SFS, isintroduced. Our sampling algorithm is easy to implement and applies to an arbitrary rankof NMF. This is in contrast to state of the art, where the NMF rank must be smaller thanor equal to four. For lower ranks we show that our algorithm performs similarly to thepolygon inﬂation algorithm that is developed in relations to chemometrics. Furthermore,we show how the size of the SFS can have a high inﬂuence on the appearing variability of asolution. Our sampling algorithm is implemented in an R package

SFS ( https://github.com/ragnhildlaursen/SFS ). Keywords:

Identiﬁability, mutational processes, non-negative matrix factorization(NMF), sampling, uniqueness.

The applications of Non-negative Matrix Factorization (NMF) are many, and one of them isin pure component analysis for analytical chemistry. In this ﬁeld, it is a big obstacle that thesolution from NMF is non-unique, such that there exist a Set of Feasible Solutions (SFS) andnot only one. As a consequence, there exist a vast literature in chemometrics on ﬁnding theSFS, both for general applications of NMF and more speciﬁc applications to chemical data[3, 7, 12, 19]. Having a non-unique solution to NMF makes it problematic to interpret thefactorization. NMF is an unsupervised learning method that factorizes a non-negative datamatrix M ∈ R K × G + into two non-negative matrices P ∈ R K × N + and E ∈ R N × G + [13]. Therank N is usually chosen much smaller than K and G . This means the factorization is anapproximation M ≈ P E. ∗ Email: [email protected] † Email: [email protected] a r X i v : . [ s t a t . A P ] J a n ere, all the entries in P and E are free parameters, that need to be estimated. The problemwith the factorization is that other solutions can be constructed by ﬁnding invertible matrices A ∈ R N × N that fulﬁll ˜ P = P A ≥ E = A − E ≥ . (1)Then the product of ˜ P and ˜ E gives the exact same approximation of M as the product of P and E . Trivial ambiguities exist when A is either a diagonal matrix, which scales the entriesin P and E or a permutation matrix, which re-orders the columns in P and rows in E . Theseambiguities are always possible, so it is standard to deﬁne P and E as a unique solution toNMF if the only possible ambiguities are scaling and/or re-ordering [4, 10, 11, 15]. Here, theSFS for P and E are deﬁned as M ( P ) = (cid:110) ˜ P ∈ R K × N + (cid:12)(cid:12)(cid:12) ∃ invertible A ∈ R N × N : ˜ P = P A ≥ E = A − E ≥ (cid:111) M ( E ) = (cid:110) ˜ E ∈ R N × G + (cid:12)(cid:12)(cid:12) ∃ invertible A ∈ R N × N : ˜ P = P A ≥ E = A − E ≥ (cid:111) (2)where A is normalised such that the columns of ˜ P = P A sum to one, as this removes the scalingambiguity. This can be shown as follows K (cid:88) j =1 ˜ P jn = K (cid:88) j =1 (cid:40) N (cid:88) i =1 P ji A in (cid:41) = N (cid:88) i =1 A in K (cid:88) j =1 P ji (cid:124) (cid:123)(cid:122) (cid:125) =1 = N (cid:88) i =1 A in . for each n = 1 , . . . , N and therefore constructing A such that the columns sum to one wouldautomatically assure the columns of ˜ P sum to one, such that the whole SFS, ( ˜ P , ˜ E ), have thesame scaling. The problem of re-ordering of the entries is solved by ordering by cosine similarity.In chemometrics, the SFS for a given NMF solution has been investigated since 1971, whereLawton and Sylvester [12] ﬁrst introduced the analytical calculation for ﬁnding the SFS forrank two i.e. N = 2. A vast literature has later evolved on ﬁnding the SFS for a higherrank, N ≥

3, both by analytical calculations and with numerical algorithms [3, 7, 9, 16] . Theanalytical calculations of the SFS have been extended to ranks of three and four [3, 9], butrequire a signiﬁcant amount of calculations when the size of K and G are large, as it increasesthe number of inequalities in (1). For a higher rank others have proposed numerical algorithmsto approximate the SFS [8, 19]. Here, we will focus on the recent polygon inﬂation algorithm[19] to compare with our new sampling algorithm. For a further review and comparison of someof the other currently available algorithms we refer to [8]. Our sampling algorithm has theadvantage of being able to compute the SFS for an arbitrary rank of the factorization, whichis needed for many NMF applications. In particular we will focus on applications in cancergenomics, where NMF is used to infer information about the mutational processes in cancerevolution.In the ﬁeld of cancer genomics, the assumption is that all the mutations in a cancer genomecomes from a spectrum of N mutational process, that each have their own eﬀect on the cancergenome. The overall goal is to identify these mutational processes and use it to enhance theunderstanding of cancer evolution [17]. A popular method to infer information on the mutationalprocesses from a matrix of mutational counts is NMF. Here, M is a K × G matrix of mutationalcounts. The G columns of dimension K × > A C > G C > T T > A T > C T > G Mutation types P r ob a b ili t y Signatures(A) % % % % % % % % % % % % % % % Probability P a t i e n t s Normalized Exposures(B)

Figure 1: The signatures and exposures for the Lung A. cancer data, when assuming ﬁvemutational processes. In (A) the signatures are depicted, with the SFS marked in red from theminimum feasible value to the maximum feasible value. In (B) the normalised exposures aredepicted, where the SFS is again marked in red. The signatures are arranged according to theirtotal exposure i.e. (cid:80) Gg =1 E ng in decreasing order.which are found from sequencing a cancer genome. The number of diﬀerent mutation types, K , in the data sets is 96 as they include the 6 base substitutions C > A, C > G, C > T, T > A,T > C, T > G, and the immediate 3’ and 5’ neighboring bases, i.e. 6 · · N . This means approximating M by the factorization of two non-negative matrices P and E of dimension K × N and N × G , respectively, such that M ≈ P E . The matrix P represents the N signatures for the mutational processes and the matrix E represents their exposures for eachmutational catalog. The size of N is usually chosen magnitudes smaller than K and G , whichmakes P and E a low-dimensional representation of M describing its main features.Through this paper we refer to two diﬀerent data sets of mutations in whole genomes. One ofthem is the 21 Breast cancer genomes, which is a commonly known and analysed data set inrelation to mutational processes in cancer genomics [2, 5, 18], which we refer to as the Breastcancer data. The data is a matrix of dimension 96 ×

21 consisting of the mutational counts forthe 96 diﬀerent mutationtypes in 21 diﬀerent breast cancer genomes. The other data set consistsof the mutational counts from 24 patients having Lung Adenocarcinoma cancer taken from [1].The data set is referred to as the Lung A. cancer data and is a 96 ×

24 matrix of mutationalcounts. The Lung A. cancer data is chosen because it has a large SFS, even for large sizes of N . The SFS found with our sampling algorithm for the Lung A. cancer when assuming ﬁvemutational processes is depicted in Figure 1. The red bars illustrate the SFS from the minimum3easible value to the maximum feasible value for each of the diﬀerent entries. In Figure 1 weobserve that the range of the SFS varies over the diﬀerent mutation types and patients, as thechanges are very dependent on the structure of one another. The code and data for this paperis available at https://github.com/ragnhildlaursen/sampleSFS_paper .The paper is structured as follows. In Section 2 we introduce and explain our sampling al-gorithm. In Section 3 we apply our algorithm on the Breast and Lung A. cancer data andcompare it to the polygon inﬂation algorithm. Section 4 is a further analysis of the optimalchoice of parameters as well as running time of our sampling algorithm. In Section 5, we atlast cover the problem of identiﬁability of P and E in relation to initialization for the updatesmade by Lee and Seung [13] and noise in the data. These identiﬁability problems are comparedto the inﬂuence of the SFS. The paper ends with concluding remarks regarding our samplingalgorithm, choice of rank N , and the lack of uniqueness of the matrix factorization. Recall that our starting point is two non-negative matrices P ∈ R K × N + and E ∈ R N × G + thatapproximates our data M ∈ R K × G + , and that we want to ﬁnd the SFS in (2) for both P and E . The general idea of our algorithm is to use the simple analytical calculation for rank two,and adapt it to higher dimensions through sampling. First, the SFS is described in the simplesetting of N = 2 to ease the understanding of the general setting. N = 2 For the case of a rank two factorization, the SFS for P and E can be found in a closed form.The calculations we make here are similar to the ones found in [15], but here we ﬁnd the SFSfor one column of P at a time. Assume we would like to ﬁnd the SFS for the ﬁrst column of P .Then we set A ( λ ) = (cid:18) − λ λ (cid:19) . (3)The inverse A − ( λ ) is simple as well and can be directly expressed in terms of the originalmatrix as A − ( λ ) = 11 − λ (cid:18) − λ − λ (cid:19) = (cid:18) λ − λ − λ − λ (cid:19) = A (cid:18) − λ − λ (cid:19) . (4)The simple inverse eases both the calculations and computation time for the algorithm. Thefeasible values of λ must fulﬁll that all entries of ˜ P = P A ( λ ) and ˜ E = A − ( λ ) E remainnon-negative, which can be formulated as0 ≤ P A ( λ ) =  P (1 − λ ) + P λ P ... ... P K (1 − λ ) + P K λ P K  =  P − λ ( P − P ) P ... ... P K − λ ( P K − P K ) P K  (5)4nd 0 ≤ A − ( λ ) E = 11 − λ (cid:18) E . . . E G − E λ + E (1 − λ ) . . . − E G λ + E G (1 − λ ) (cid:19) = 11 − λ (cid:18) E . . . E G E − λ ( E + E ) . . . E G − λ ( E G + E G ) (cid:19) . (6)A general requirement is λ < λ given byΛ = min g =1 ,...G (cid:26) E g E g + E g (cid:12)(cid:12)(cid:12)(cid:12) E g + E g > (cid:27) ≥ . (7)In (5), the requirement of λ < P k − λ ( P k − P k ) ≥ k where P k ≥ P k . Forthe other case where P k > P k we get the following lower bound for λ Λ = max k =1 ,...,K (cid:26) P k P k − P k (cid:12)(cid:12)(cid:12)(cid:12) P k > P k (cid:27) ≤ . (8)This means that all λ ∈ [Λ , Λ ] give feasible solutions for the ﬁrst column of P , while thesecond column is ﬁxed. The bounds of λ for the second column of P will be similar, butwhere 1 and 2 are simply switched in (7) and (8). In this simple setting, the individual feasibleintervals of [Λ , Λ ] and [Λ , Λ ] give the whole SFS for both P and E . In the case of arank higher than two the analytical calculations for the SFS are substantially more complicated.Calculations for N equal to 3 can be seen in [3, 9]. Our algorithm uses the analytical calculations for N = 2 above and is inspired by Gibbs Sam-pling [6]. The idea is to change each column of P with an aﬃne combination of another columnchosen by random sampling. This is done sequentially for each column of P while updating E correspondingly such that the matrix product remains the same. For the approach we deﬁne ageneral transformation matrix A ij ( λ ), of dimension N × N , similar to the one in (3) (cid:0) A ij ( λ ) (cid:1) uv =  − λ if u = v = i u = v (cid:54) = iλ if u = j, v = i A ij ( λ ) changes column P i to an aﬃne combination of column P i and P j ,where j (cid:54) = i . Let Λ ij denote the feasible interval for λ given a solution P and E , such that P A ij ( λ ) ≥ A − ij ( λ ) E ≥

0. The endpoints of this interval are similar to the ones in (7)and (8), i.e. a function of column i and j in P and row i and j in E Λ ij = f ( P, E, i, j ) := max k (cid:26) P ki P ki − P kj (cid:12)(cid:12)(cid:12)(cid:12) P kj > P ki (cid:27) ≤ ij = f ( P, E, i, j ) := , min g (cid:26) E jg E ig + E jg (cid:12)(cid:12)(cid:12)(cid:12) E ig + E jg > (cid:27) ≥ ij = f ( P, E, i, j ) = [Λ ij , Λ ij ]. 5 .3 Sampling a value λ in Λ ij The most simple way to choose λ is uniformly at random in the interval Λ ij , but this choiceoften gives a slow convergence. The endpoints of the interval often lead to larger changes in P and E and are therefore more favorable. The choice of λ ∈ Λ ij is chosen as a shifted symmetricbeta distribution with equal shape parameters denoted by β . Setting β = 1 results in a uniformchoice and setting β < β are further described in Section 4.1 but for now we ﬁx β = 0 .

5. Sampling λ ∈ Λ ij is madeby sampling x from a regular beta distribution with the shape parameters equals to β andthen setting λ = x · Λ ij + (1 − x ) · Λ ij . After sampling a λ ∈ Λ ij , the matrices P and E areupdated to P new = P A ij ( λ ) and E new = A − ij ( λ ) E . In one iteration, this is done sequentiallyfor i = 1 , . . . , N , where there is chosen a random j (cid:54) = i to mix with for each i . After each iteration the values of P and E are saved, so after S iterations we have the followingsamples from the SFS for both P and E : P S = { P , P , . . . , P S } E S = { E , E , . . . , E S } . This means every entry in P and E have S samples besides the initial solution ( P , E ). Weoften observe that for some entries all samples are equivalent and for others they spread across aninterval. Each new sample is created as an aﬃne combination of the previous sample includingitself, which means all the values between two samples in the interval will be feasible as well. TheSFS for each entry is therefore deﬁned as the interval between the minimum and the maximumvalue of these S samples. Similarly to the polygon inﬂation algorithm [19], our algorithm alsoassumes that the SFS consists of connected sets as the samples are aﬃne combinations of oneanother. This assumption has not appeared to restrict the SFS for any of the data sets we haveused, but in theory the solution could have a larger SFS than what is found by our algorithm.We deﬁne the size of P S as the average change of each entry across the full sample:avg (cid:104) P S (cid:105) = 1 K · N N (cid:88) n =1 K (cid:88) k =1 (cid:26) max s =0 , ,..., S P skn − min s =0 , ,..., S P skn (cid:27) (11)To assure that the algorithm stops at the right time, the size of the SFS is calculated after each T iterations, which we have chosen to ﬁx at T = 1000. The algorithm is stopped when the sizeof the SFS has changed less than (cid:15) within these T iterationsavg (cid:104) P S (cid:105) − avg (cid:104) P S−T (cid:105) < (cid:15), where we set (cid:15) = 10 − . This assures that our algorithm continues until convergence and stopswhen no more changes are available. One could add a similar stopping criteria for E S , but inour experiments we found the proposed criteria satisfactory to ﬁnd the whole SFS for both P and E . The sampling algorithm is summarized in Algorithm 1.6 lgorithm 1: Finding the SFS given ( ˆ

P , ˆ E )Initialize P = ˆ P and E = ˆ E for s = 1 , , . . . do P ← P s − E ← E s − for i = 1 , . . . , N do j ← random element of { , . . . , N }\{ i } Λ ij = f ( P i , E i , i, j ) x ← Beta( β, β ) λ = x · Λ ij + (1 − x ) · Λ ij P i +1 = P i A ij ( λ ) E i +1 = A ij ( − λ − λ ) E i end P s = P N E s = E N stop if avg (cid:104) P s (cid:105) − avg (cid:104) P s −T (cid:105) < (cid:15) end In this section results from the sampling algorithm are compared with the polygon inﬂationalgorithm. The comparison is made for both the Breast and Lung A. cancer data, when assumingthree mutational processes. The results from our sampling algorithm are illustrated in Figure2a and 2b in the representation deﬁned in (2). Before making the comparison we have to deﬁneanother representation of the SFS, that is more commonly used in chemometrics [3, 7, 9, 16, 19]and in particular for the polygon inﬂation algorithm. After introducing this representation, theassumption of three mutational processes for the comparison is more easily justiﬁed.

In chemometrics it is common to deﬁne the SFS relative to the Singular Value Decomposition(SVD). Note, any matrix

P E ∈ R K × G + of rank N can be decomposed into P E = U Σ V (cid:48) where U ∈ R K × N and V ∈ R G × N are the N eigenvectors for P E ( P E ) (cid:48) and ( P E ) (cid:48) P E , re-spectively, and Σ ∈ R N × N is the diagonal matrix of singular values. The factorization into U and Σ V (cid:48) consist of both positive and negative entries, but can be transformed by a matrix T ∈ R N × N such that U T ≥ T − Σ V (cid:48) ≥ T are normalised and deﬁned by T =  · · · α w , · · · w ,N − ... ... . . . ... α N − w N − , · · · w N − ,N −  = (cid:18) e (cid:48) α W (cid:19) (12)7uch that α = ( α , . . . , α N − ) (cid:48) ∈ R ( N − , W ∈ R ( N − × ( N − and e (cid:48) = (1 , . . . , ∈ R ( N − . Thescaling of the columns are made such that the ﬁrst entry of each column is 1. Here, the SFS isdeﬁned by M SV D ( P ) = (cid:110) α ∈ R ( N − (cid:12)(cid:12)(cid:12) ∃ W ∈ R ( N − × ( N − : U T ≥ T − Σ V (cid:48) ≥ (cid:111) M SV D ( E ) = (cid:110) α ∈ R ( N − (cid:12)(cid:12)(cid:12) ∃ W ∈ R ( N − × ( N − : U ( T (cid:48) ) − ≥ T (cid:48) Σ V (cid:48) ≥ (cid:111) (13)which is similar to the deﬁnition in [16, 19], where a detailed description of this construction canbe seen. As the set in (13) is deﬁned in terms of α ∈ R ( N − , we choose to assume N = 3 suchthat the SFS can be visualized in 2D. A big advantage of M SV D is that it removes the problemof reordering columns and rows as it only focuses on possible transformations of the SVD of

P E and not the actual values of P and E . The connection between the deﬁnition in (13) andthe one from (2) is mostly diﬀerent normalizations of P and E . Constructing an invertible T with columns consisting of diﬀerent α ∈ M SV D ( P ) gives P E = U Σ V (cid:48) = U T D − (cid:124) (cid:123)(cid:122) (cid:125) ˜ P D T − Σ V (cid:48) (cid:124) (cid:123)(cid:122) (cid:125) ˜ E where ˜ P ∈ M ( P ), ˜ E ∈ M ( E ) and D = diag( e (cid:48) U T ) ∈ R N × N . The diagonal matrix D scalesthe matrices such that ˜ P has columns summing to one. Given ˜ P ∈ M ( P ), the values of α canbe found in the columns of T = U (cid:48) ˜ P D − as the eigenvectors in U are orthonormal and D = diag( e U (cid:48) ˜ P ) ∈ R N × N , where e =(1 , , . . . , D assures that the ﬁrst row of T consists of ones as in(12).The set in (2) is easier to interpret as this visually shows the direct possible changes on P and E , but the one in (13) gives a more simple visualization of the SFS and is better for comparisonto the polygon inﬂation algorithm. The polygon inﬂation algorithm is an algorithm used to approximate the SFS. The algorithmis introduced in [19] and a further analysis and proofs are made in [16]. The algorithm assumesthe set M SV D for both P and E , at most, consists of N separated whole free subsets. In theexamples in Figure 3 they all have three separated whole free subsets, such that all the pointswithin the polygons are feasible solutions. This assumption is also made for the samplingalgorithm, as the SFS are always connected and cannot have two diﬀerent feasible intervals forone entry. The idea of the polygon inﬂation algorithm is to approximate the subsets of the SFSin terms of the representation in (13) by inﬂating an initial solution in M SV D to the boundarieswhich creates a polygon. The algorithm continues to inﬂate the edges of the polygon with newvertices on the boundary of the SFS until the area stops increasing. In the package FAC-PACK( ) they have applied the algorithm for rank threeand four. The results from the polygon inﬂation algorithm and sampling algorithm with rankthree are seen in Figure 3, where the grey and black dots illustrate the samples from the samplingalgorithm and the colored polygons are the SFS found from the polygon inﬂation algorithm.8 > A C > G C > T T > A T > C T > G

Mutation types P r ob a b ili t y Signatures(A) % % % % % % % % % Probability P a t i e n t s Normalized Exposures(B) (a) Breast cancer

C > A C > G C > T T > A T > C T > G

Mutation types P r ob a b ili t y Signatures(A) % % % % % % % % % Probability P a t i e n t s Normalized Exposures(B) (b) Lung A. cancer

Figure 2: The signatures and exposures for the Breast and Lung A. cancer data, when assumingthree mutational processes. In (A) the signatures are depicted, with the SFS marked in red fromthe minimum feasible value to the maximum feasible value. In (B) the normalised exposuresare depicted and the SFS is again marked in red. The signatures are arranged according totheir total exposure i.e. (cid:80) Gg =1 E ng in decreasing order.9 reast Cancer (a) M SV D ( P ) (b) M SV D ( E ) Lung A. Cancer (c) M SV D ( P ) (d) M SV D ( E ) Figure 3: The SFS for the Breast and Lung A. cancer data in the M SV D representation, wherethe grey dots show the results of all the iterations from the sampling algorithm and the blackdots highlight the results from the last 500 samples. The colored polygons are the SFS foundfrom the polygon inﬂation algorithm. Coloring of the three signatures is in accordance withFigure 2. 10he colored lines of the diﬀerent areas in Figure 3 correspond to the signature with the samestrip colors in Figure 2. The sampling algorithm proposed here clearly gives similar results asthe polygon inﬂation algorithm, which has been the case for all examples we have tried. TheSFS for E is also completely found, even though the stopping criteria and sampling is made inrelation to the signature matrix P . The matrix used as a reference is therefore unimportant interms of ﬁnding the SFS for both P and E .Curious observation is that the size of the areas in Figure 3 are hard to directly transfer to thesize of the changes in diﬀerent entries in Figure 2. Though, the representation in Figure 3 givesa more simple visualization of the SFS. The advantages of our sampling algorithm compared tothe polygon inﬂation algorithm are that it is easier to implement and can scale to an arbitrarydimension of N , as shown in Figure 5. This is especially important for cancer genomics wherethe number of mutational processes varies a lot depending on e.g. the number of samples andnumber of cancer types. Potentially, this could also be advantageous in other ﬁelds where NMFis applied with higher rank of N . Here we will discuss the choice of the parameter β and the running time of the samplingalgorithm. The sampling algorithm is applied to an assumed rank between 2 and 10, whichmakes it possible also to comment on how the size of the SFS and computation time is inﬂuencedby an increase in the rank. β and rank N The inﬂuence of the parameter β is found by running the sampling algorithm multiple times for β ∈ { . , . , } on the Breast and Lung A. cancer data. We test for β = 1 which is equivalentto choosing λ uniformly at random and also for β = 0 . β = 0 .

5, that is something in-between the two cases. An illustration of1000 samples from the algorithm with β ∈ { . , . , } is seen in Figure 4. Visually we observefrom Figure 4 that β = 0 . (cid:104) P S (cid:105) is depicted as a function of N i.e. the number ofassumed mutational processes. The size of the SFS for the Breast cancer data is close to zerofor ﬁve mutational processes and above. Even below ﬁve mutational processes the averagevariations of the entries seem to be fairly small.11 .81.01.21.41.6 0.4 0.6 0.8 1.0 a a (a) β = 0 . a a (b) β = 0 . a a (c) β = 1 Figure 4: Sampling 1000 points in the SFS M SV D ( P ) for diﬀerent choices of β . Here, theresults are illustrated on the second signature of Lung A. cancer in Figure 3. Breast Cancer Lung A. Cancer2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 100.00250.00500.00750.01000.000000.000250.000500.000750.001000.00125

N (Number of mutational processes) a v g Æ P S æ beta = 0.1 beta = 0.5 beta = 1 Figure 5: The size of the SFS for diﬀerent number of assumed mutational processes i.e. N .Each point is based on the results of running the sampling algorithm 50 times for each choiceof β and N . The error bars show the percentile 0 .

25 and 0 .

75 of the 50 diﬀerent runs.On the contrary the size of the SFS for the Lung A. cancer is increasing with the number ofsignatures. The increase of the SFS when N is equal to ﬁve instead of three can be seen bycomparing Figure 1 and Figure 2b, where there are more red areas with N equal to ﬁve thanwith three. Here, it is clear that an increase in the number of mutational processes does notnecessarily decrease the size of the SFS. The values avg (cid:104) P S (cid:105) on the y-axis can appear small,but remember from the deﬁnition in (11) that we divide our sum by K · N .The choice of β seem to have an inﬂuence on both the stability and ﬁnal size of avg (cid:104) P S (cid:105) . Forﬁve mutational processes and below, the smallest β = 0 . β = 0 . λ ∈ Λ ij uniformly i.e. β = 1 separates thesample equally across the area, which gives a very low probability of sampling in the endpointsand thereby gives a smaller size of the SFS, which are also seen in Figure 5. On the contrary,sampling λ ∈ Λ ij with too much weight on the endpoints i.e. β = 0 . β = 0 .

5. 12

Number of mutational processes It e r a t i on s be f o r e s t opp i ng Number of mutational processes T i m e be f o r e s t opp i ng ( s e c ) Breast Cancer Lung A. Cancer

Figure 6: The number of iterations and time needed to compute the SFS for diﬀerent numberof mutational processes, where the algorithm was repeated 50 times. In the ﬁgures above, theaverage is reported together with the quantiles 0.05 and 0.95.

Our sampling algorithm is not only simple to implement, but is also very fast even for higherranks of the factorization. In Figure 6, the average number of iterations and computation time(using a regular laptop with Intel Core i7-8565U CPU @ 1.80 GHz and 16GB RAM) is shownon the two cancer data sets for diﬀerent sizes of N , where the algorithm was repeated 50 timesfor each choice of N . In general the computation time is highly inﬂuenced by the size of theSFS and the rank N . The identiﬁcation of P and E is not only inﬂuenced by the size of SFS, but will also be inﬂuencedby noise in data and the initialization of the algorithm by Lee and Seung [13]. Here, we willcover the inﬂuence of these aspects and compare it to the SFS. In cancer genomics the mostnatural assumption is that each mutational count is drawn from a Poisson distribution M kg ∼ Pois (cid:0) ( P E ) kg (cid:1) (14)Under this assumption, a natural way to estimate P and E is to maximize the log-likelihoodfunction (cid:96) ( P, E ; M ) = K (cid:88) k =1 G (cid:88) g =1 M kg log (( P E ) kg ) − ( P E ) kg − log ( M kg !) (15)= − D ( M | P E ) + C (16)where D ( M | P E ) = K (cid:88) k =1 G (cid:88) g =1 M kg log (cid:18) M kg ( P E ) kg (cid:19) − M kg + ( P E ) kg (17)is the generalised Kullback-Leibler divergence and C is a constant only dependent on M . Hence,maximizing the likelihood function is equivalent to minimizing the generalised Kullback-Leibler13ivergence D ( M | P E ). This is the divergence appearing in Lee and Seung [13], which is de-creased through each update. Notice, the objective function is convex in either P or E , butnot in both variables together. As they mention in Lee and Seung [13], the update rules cantherefore not ensure a global minimum, but only a local minimum. Several initialization makesit more likely that a global minimum is reached, which is shown in the next section. Through this paper, we have performed at least ﬁve random initializations when ﬁnding anNMF solution. The inﬂuence of an increase in the number of initializations can be seen inFigure 7a, which shows how just a few initializations will stabilize the reached minimum to alarge extend. As the minimum stops changing we assume to have found the global minimum.Even though the algorithm reaches the same minimum we will still observe variability in theestimates of P and E , when they have a variability in the SFS. Breast cancer Lung A. cancer

Number of initializations m i n i m u m D ( M | PE ) Number of initializations m i n i m u m D ( M | PE ) (a) Eﬀect of more initializations −101 0 1 2 3 4 a a −0.50.00.51.01.5 0 1 2 3 a a (b) Initializations in M SV D ( P ) Figure 7: The inﬂuence of initialization. In (a), the average of the minimum D ( M | P E ) for acertain number of initializations are depicted together with error bars that show the quantiles0 .

05 and 0 .

95. The average and quantiles are based on 100 runs. In (b), the results from the100 diﬀerent runs with ten initializations are shown in M SV D ( P ) relative to the solution foundpreviously in Figure 3(a). 14iﬀerent random initializations with the same minimum will still give diﬀerent results in theSFS, which is illustrated in Figure 7b. Some of the SFS can therefore appear through theinitialization, but using initializations to ﬁnd the SFS would be a very unreliable and timeconsuming way to recover the SFS. Also, it is clear to observe in Figure 7b how the iterativeprocedure often converges to the same areas of the SFS. Noise in the data obviously has an inﬂuence on the identiﬁcation of P and E , but sometimes theinﬂuence can seem higher than it actually is due to variability in the SFS. We will illustrate thison the Breast cancer data assuming three mutational processes, where we have variability in theSFS. Assume we have found a global minimum factorization ˆ P ˆ E ∈ R × that approximatesthe Breast cancer data M ∈ N × . The inﬂuence of variance is now illustrated throughparametric bootstrapping using the assumption in (14). Speciﬁcally, we create 100 bootstrapsamples M ∗ b ∼ Pois( ˆ P ˆ E )for b = 1 , . . . , M ∗ , . . . M ∗ )for both 10 diﬀerent random initializations and the same 10 initializations, which is illustratedin Figure 8. We clearly see how the appearing variability of the results is inﬂuenced by the SFSthrough random initialization.A large variance could therefore appear as a low reproducibility of the signatures, but is actuallyjust a result of the SFS. Remember, having a large SFS does not make the assumed number ofmutational processes less correct but is instead a result of the structure of the signatures andexposures. −1012 0 1 2 3 4 a a (a) Random initialization −101 0 1 2 3 4 a a (b) Same initialization Figure 8: Visualization of the identiﬁcation problem due to variance, where (a) shows the resultsfrom random initialization and (b) shows the results using the same ten initialization for eachsample. Both show the global minimum from 100 bootstrap samples.15

Conclusion

We introduce a novel sampling algorithm that can ﬁnd the SFS for an arbitrary rank of NMF.The algorithm is based on the simple and well known calculations for two dimensions togetherwith random sampling to explore the possible changes of the initial solution. The algorithmalso gives equivalent results when compared to current algorithms, most notably the polygoninﬂation algorithm. The problem of non-uniqueness for NMF in relation to mutational countshave also been highlighted here. The size of the SFS is shown to depend strongly on the speciﬁcdata set at hand and the assumed number of mutational processes. Furthermore, the importanceof several initializations is enlightened together with the important fact of how a large variancein a signature could potentially be caused by the SFS.An important aspect of NMF is to recover the true rank of the factorization. In cancer genomicsthere have both been introduced methods based on the Bayesian Information Criteria (BIC)[5, 18] and another method that both includes the variability of the signatures and how well P and E approximates M [2]. The latter is deﬁnitely more questionable, as there could appearlarge variations due to the size of the SFS. An interesting future problem is the inﬂuence ofhaving a SFS in the further analysis of the mutational processes. Would the conclusion fromone solution in the SFS be very diﬀerent from another? One method to remove the problem ofnon-uniqueness of the solution is by decreasing the variability of the signatures through fewerparameters. This was implemented in a simple form in [20].Our sampling algorithm is very simple to implement and fast, which makes it easy to checkwhether a NMF solution is unique or not. This article has enlightened the problem of non-uniqueness of the NMF and at the same time showed a simple way to check this uncoveredproblem. It is important to remember that the size of the SFS is not a tool to determine thecorrect choice of N , but instead a way to ﬁnd all the solutions that give the same global minimumapproximation of the data. Variability in the SFS is not the result of a poor factorization, butinstead lack of uniqueness of the matrix factorization. Acknowledgements

We would like to thank Joachim Beck for his master thesis on this subject, as it gave much inspi-ration to the development of the sampling algorithm. Furthermore, we would like to thank DanAriel Søndergaard and Kenneth Borup for helpful suggestions and corrections to the manuscript.

References [1] L. B. Alexandrov, S. Nik-Zainal, D. C. Wedge, S. A. Aparicio, S. Behjati, A. V. Biankin,G. R. Bignell, N. Bolli, A. Borg, A.-L. Børresen-Dale, et al. Signatures of mutationalprocesses in human cancer.

Nature , 500(7463):415–421, 2013.[2] L. B. Alexandrov, S. Nik-Zainal, D. C. Wedge, P. J. Campbell, and M. R. Stratton. Deci-phering signatures of mutational processes operative in human cancer.

Cell reports , 3(1):246–259, 2013.[3] O. S. Borgen and B. R. Kowalski. An extension of the multivariate component-resolutionmethod to three components.

Analytica Chimica Acta , 174:1–26, 1985.164] D. Donoho and V. Stodden. When does non-negative matrix factorization give a correctdecomposition into parts?

Advances in neural information processing systems , 16:1141–1148, 2003.[5] A. Fischer, C. J. Illingworth, P. J. Campbell, and V. Mustonen. EMu: probabilisticinference of mutational processes and their localization in the cancer genome.

GenomeBiology , 14(4):1–10, 2013.[6] A. E. Gelfand and A. F. Smith. Sampling-based approaches to calculating marginal den-sities.

Journal of the American statistical association , 85(410):398–409, 1990.[7] P. J. Gemperline. Computation of the range of feasible solutions in self-modeling curveresolution algorithms.

Analytical chemistry , 71(23):5398–5404, 1999.[8] A. Golshan, H. Abdollahi, S. Beyramysoltan, M. Maeder, K. Neymeyr, R. Rajk´o, M. Sawall,and R. Tauler. A review of recent methods for the determination of ranges of feasiblesolutions resulting from soft modelling analyses of multivariate data.

Analytica ChimicaActa , 911:1–13, 2016.[9] R. C. Henry and B. M. Kim. Extension of self-modeling curve resolution to mixtures ofmore than three components: Part 1. Finding the basic feasible region.

Chemometrics andIntelligent Laboratory Systems , 8(2):205–216, 1990.[10] K. Huang, N. D. Sidiropoulos, and A. Swami. Non-negative matrix factorization revisited:Uniqueness and algorithm for symmetric decomposition.

IEEE Transactions on SignalProcessing , 62(1):211–224, 2013.[11] H. Laurberg, M. G. Christensen, M. D. Plumbley, L. K. Hansen, and S. H. Jensen. Theo-rems on positive data: On the uniqueness of NMF.

Computational intelligence and neuro-science , 2008.[12] W. H. Lawton and E. A. Sylvestre. Self modeling curve resolution.

Technometrics , 13(3):617–633, 1971.[13] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In

Advancesin neural information processing systems , pages 556–562, 2001.[14] C. Mason, M. Maeder, and A. Whitson. Resolving factor analysis.

Analytical chemistry ,73(7):1587–1594, 2001.[15] S. Moussaoui, D. Brie, and J. Idier. Non-negative source separation: range of admissiblesolutions and conditions for the uniqueness of the solution. In

Proceedings.(ICASSP’05).IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. , vol-ume 5, pages v–289. IEEE, 2005.[16] K. Neymeyr and M. Sawall. On the set of solutions of the nonnegative matrix factorizationproblem.

SIAM Journal on Matrix Analysis and Applications , 39(2):1049–1069, 2018.[17] S. Nik-Zainal and S. Morganella. Mutational signatures in breast cancer: the problem atthe DNA level, 2017.[18] R. A. Rosales, R. D. Drummond, R. Valieris, E. Dias-Neto, and I. T. Da Silva. signeR:17n empirical Bayesian approach to mutational signature discovery.

Bioinformatics , 33(1):8–16, 2017.[19] M. Sawall, C. Kubis, D. Selent, A. Boerner, and K. Neymeyr. A fast polygon inﬂationalgorithm to compute the area of feasible solutions for three-component systems. I: conceptsand applications.

Journal of Chemometrics , 27(5):106–116, 2013.[20] Y. Shiraishi, G. Tremmel, S. Miyano, and M. Stephens. A simple model-based approachto inferring and visualizing cancer mutation signatures.