A sampling algorithm to compute the set of feasible solutions for non-negative matrix factorization with an arbitrary rank
AA sampling algorithm to compute the set of feasible solutions fornon-negative matrix factorization with an arbitrary rank
Ragnhild Laursen ∗ and Asger Hobolth † Department of Mathematics, Aarhus UniversityJanuary 20, 2021
Abstract
Non-negative Matrix Factorization (NMF) is a useful method to extract features frommultivariate data, but an important and sometimes neglected concern is that NMF canresult in non-unique solutions. Often, there exist a Set of Feasible Solutions (SFS), whichmakes it more difficult to interpret the factorization. This problem is especially ignored incancer genomics, where NMF is used to infer information about the mutational processespresent in the evolution of cancer. In this paper the extent of non-uniqueness is investigatedfor two mutational counts data, and a new sampling algorithm, that can find the SFS, isintroduced. Our sampling algorithm is easy to implement and applies to an arbitrary rankof NMF. This is in contrast to state of the art, where the NMF rank must be smaller thanor equal to four. For lower ranks we show that our algorithm performs similarly to thepolygon inflation algorithm that is developed in relations to chemometrics. Furthermore,we show how the size of the SFS can have a high influence on the appearing variability of asolution. Our sampling algorithm is implemented in an R package
SFS ( https://github.com/ragnhildlaursen/SFS ). Keywords:
Identifiability, mutational processes, non-negative matrix factorization(NMF), sampling, uniqueness.
The applications of Non-negative Matrix Factorization (NMF) are many, and one of them isin pure component analysis for analytical chemistry. In this field, it is a big obstacle that thesolution from NMF is non-unique, such that there exist a Set of Feasible Solutions (SFS) andnot only one. As a consequence, there exist a vast literature in chemometrics on finding theSFS, both for general applications of NMF and more specific applications to chemical data[3, 7, 12, 19]. Having a non-unique solution to NMF makes it problematic to interpret thefactorization. NMF is an unsupervised learning method that factorizes a non-negative datamatrix M ∈ R K × G + into two non-negative matrices P ∈ R K × N + and E ∈ R N × G + [13]. Therank N is usually chosen much smaller than K and G . This means the factorization is anapproximation M ≈ P E. ∗ Email: [email protected] † Email: [email protected] a r X i v : . [ s t a t . A P ] J a n ere, all the entries in P and E are free parameters, that need to be estimated. The problemwith the factorization is that other solutions can be constructed by finding invertible matrices A ∈ R N × N that fulfill ˜ P = P A ≥ E = A − E ≥ . (1)Then the product of ˜ P and ˜ E gives the exact same approximation of M as the product of P and E . Trivial ambiguities exist when A is either a diagonal matrix, which scales the entriesin P and E or a permutation matrix, which re-orders the columns in P and rows in E . Theseambiguities are always possible, so it is standard to define P and E as a unique solution toNMF if the only possible ambiguities are scaling and/or re-ordering [4, 10, 11, 15]. Here, theSFS for P and E are defined as M ( P ) = (cid:110) ˜ P ∈ R K × N + (cid:12)(cid:12)(cid:12) ∃ invertible A ∈ R N × N : ˜ P = P A ≥ E = A − E ≥ (cid:111) M ( E ) = (cid:110) ˜ E ∈ R N × G + (cid:12)(cid:12)(cid:12) ∃ invertible A ∈ R N × N : ˜ P = P A ≥ E = A − E ≥ (cid:111) (2)where A is normalised such that the columns of ˜ P = P A sum to one, as this removes the scalingambiguity. This can be shown as follows K (cid:88) j =1 ˜ P jn = K (cid:88) j =1 (cid:40) N (cid:88) i =1 P ji A in (cid:41) = N (cid:88) i =1 A in K (cid:88) j =1 P ji (cid:124) (cid:123)(cid:122) (cid:125) =1 = N (cid:88) i =1 A in . for each n = 1 , . . . , N and therefore constructing A such that the columns sum to one wouldautomatically assure the columns of ˜ P sum to one, such that the whole SFS, ( ˜ P , ˜ E ), have thesame scaling. The problem of re-ordering of the entries is solved by ordering by cosine similarity.In chemometrics, the SFS for a given NMF solution has been investigated since 1971, whereLawton and Sylvester [12] first introduced the analytical calculation for finding the SFS forrank two i.e. N = 2. A vast literature has later evolved on finding the SFS for a higherrank, N ≥
3, both by analytical calculations and with numerical algorithms [3, 7, 9, 16] . Theanalytical calculations of the SFS have been extended to ranks of three and four [3, 9], butrequire a significant amount of calculations when the size of K and G are large, as it increasesthe number of inequalities in (1). For a higher rank others have proposed numerical algorithmsto approximate the SFS [8, 19]. Here, we will focus on the recent polygon inflation algorithm[19] to compare with our new sampling algorithm. For a further review and comparison of someof the other currently available algorithms we refer to [8]. Our sampling algorithm has theadvantage of being able to compute the SFS for an arbitrary rank of the factorization, whichis needed for many NMF applications. In particular we will focus on applications in cancergenomics, where NMF is used to infer information about the mutational processes in cancerevolution.In the field of cancer genomics, the assumption is that all the mutations in a cancer genomecomes from a spectrum of N mutational process, that each have their own effect on the cancergenome. The overall goal is to identify these mutational processes and use it to enhance theunderstanding of cancer evolution [17]. A popular method to infer information on the mutationalprocesses from a matrix of mutational counts is NMF. Here, M is a K × G matrix of mutationalcounts. The G columns of dimension K × > A C > G C > T T > A T > C T > G Mutation types P r ob a b ili t y Signatures(A) % % % % % % % % % % % % % % % Probability P a t i e n t s Normalized Exposures(B)
Figure 1: The signatures and exposures for the Lung A. cancer data, when assuming fivemutational processes. In (A) the signatures are depicted, with the SFS marked in red from theminimum feasible value to the maximum feasible value. In (B) the normalised exposures aredepicted, where the SFS is again marked in red. The signatures are arranged according to theirtotal exposure i.e. (cid:80) Gg =1 E ng in decreasing order.which are found from sequencing a cancer genome. The number of different mutation types, K , in the data sets is 96 as they include the 6 base substitutions C > A, C > G, C > T, T > A,T > C, T > G, and the immediate 3’ and 5’ neighboring bases, i.e. 6 · · N . This means approximating M by the factorization of two non-negative matrices P and E of dimension K × N and N × G , respectively, such that M ≈ P E . The matrix P represents the N signatures for the mutational processes and the matrix E represents their exposures for eachmutational catalog. The size of N is usually chosen magnitudes smaller than K and G , whichmakes P and E a low-dimensional representation of M describing its main features.Through this paper we refer to two different data sets of mutations in whole genomes. One ofthem is the 21 Breast cancer genomes, which is a commonly known and analysed data set inrelation to mutational processes in cancer genomics [2, 5, 18], which we refer to as the Breastcancer data. The data is a matrix of dimension 96 ×
21 consisting of the mutational counts forthe 96 different mutationtypes in 21 different breast cancer genomes. The other data set consistsof the mutational counts from 24 patients having Lung Adenocarcinoma cancer taken from [1].The data set is referred to as the Lung A. cancer data and is a 96 ×
24 matrix of mutationalcounts. The Lung A. cancer data is chosen because it has a large SFS, even for large sizes of N . The SFS found with our sampling algorithm for the Lung A. cancer when assuming fivemutational processes is depicted in Figure 1. The red bars illustrate the SFS from the minimum3easible value to the maximum feasible value for each of the different entries. In Figure 1 weobserve that the range of the SFS varies over the different mutation types and patients, as thechanges are very dependent on the structure of one another. The code and data for this paperis available at https://github.com/ragnhildlaursen/sampleSFS_paper .The paper is structured as follows. In Section 2 we introduce and explain our sampling al-gorithm. In Section 3 we apply our algorithm on the Breast and Lung A. cancer data andcompare it to the polygon inflation algorithm. Section 4 is a further analysis of the optimalchoice of parameters as well as running time of our sampling algorithm. In Section 5, we atlast cover the problem of identifiability of P and E in relation to initialization for the updatesmade by Lee and Seung [13] and noise in the data. These identifiability problems are comparedto the influence of the SFS. The paper ends with concluding remarks regarding our samplingalgorithm, choice of rank N , and the lack of uniqueness of the matrix factorization. Recall that our starting point is two non-negative matrices P ∈ R K × N + and E ∈ R N × G + thatapproximates our data M ∈ R K × G + , and that we want to find the SFS in (2) for both P and E . The general idea of our algorithm is to use the simple analytical calculation for rank two,and adapt it to higher dimensions through sampling. First, the SFS is described in the simplesetting of N = 2 to ease the understanding of the general setting. N = 2 For the case of a rank two factorization, the SFS for P and E can be found in a closed form.The calculations we make here are similar to the ones found in [15], but here we find the SFSfor one column of P at a time. Assume we would like to find the SFS for the first column of P .Then we set A ( λ ) = (cid:18) − λ λ (cid:19) . (3)The inverse A − ( λ ) is simple as well and can be directly expressed in terms of the originalmatrix as A − ( λ ) = 11 − λ (cid:18) − λ − λ (cid:19) = (cid:18) λ − λ − λ − λ (cid:19) = A (cid:18) − λ − λ (cid:19) . (4)The simple inverse eases both the calculations and computation time for the algorithm. Thefeasible values of λ must fulfill that all entries of ˜ P = P A ( λ ) and ˜ E = A − ( λ ) E remainnon-negative, which can be formulated as0 ≤ P A ( λ ) = P (1 − λ ) + P λ P ... ... P K (1 − λ ) + P K λ P K = P − λ ( P − P ) P ... ... P K − λ ( P K − P K ) P K (5)4nd 0 ≤ A − ( λ ) E = 11 − λ (cid:18) E . . . E G − E λ + E (1 − λ ) . . . − E G λ + E G (1 − λ ) (cid:19) = 11 − λ (cid:18) E . . . E G E − λ ( E + E ) . . . E G − λ ( E G + E G ) (cid:19) . (6)A general requirement is λ < λ given byΛ = min g =1 ,...G (cid:26) E g E g + E g (cid:12)(cid:12)(cid:12)(cid:12) E g + E g > (cid:27) ≥ . (7)In (5), the requirement of λ < P k − λ ( P k − P k ) ≥ k where P k ≥ P k . Forthe other case where P k > P k we get the following lower bound for λ Λ = max k =1 ,...,K (cid:26) P k P k − P k (cid:12)(cid:12)(cid:12)(cid:12) P k > P k (cid:27) ≤ . (8)This means that all λ ∈ [Λ , Λ ] give feasible solutions for the first column of P , while thesecond column is fixed. The bounds of λ for the second column of P will be similar, butwhere 1 and 2 are simply switched in (7) and (8). In this simple setting, the individual feasibleintervals of [Λ , Λ ] and [Λ , Λ ] give the whole SFS for both P and E . In the case of arank higher than two the analytical calculations for the SFS are substantially more complicated.Calculations for N equal to 3 can be seen in [3, 9]. Our algorithm uses the analytical calculations for N = 2 above and is inspired by Gibbs Sam-pling [6]. The idea is to change each column of P with an affine combination of another columnchosen by random sampling. This is done sequentially for each column of P while updating E correspondingly such that the matrix product remains the same. For the approach we define ageneral transformation matrix A ij ( λ ), of dimension N × N , similar to the one in (3) (cid:0) A ij ( λ ) (cid:1) uv = − λ if u = v = i u = v (cid:54) = iλ if u = j, v = i A ij ( λ ) changes column P i to an affine combination of column P i and P j ,where j (cid:54) = i . Let Λ ij denote the feasible interval for λ given a solution P and E , such that P A ij ( λ ) ≥ A − ij ( λ ) E ≥
0. The endpoints of this interval are similar to the ones in (7)and (8), i.e. a function of column i and j in P and row i and j in E Λ ij = f ( P, E, i, j ) := max k (cid:26) P ki P ki − P kj (cid:12)(cid:12)(cid:12)(cid:12) P kj > P ki (cid:27) ≤ ij = f ( P, E, i, j ) := , min g (cid:26) E jg E ig + E jg (cid:12)(cid:12)(cid:12)(cid:12) E ig + E jg > (cid:27) ≥ ij = f ( P, E, i, j ) = [Λ ij , Λ ij ]. 5 .3 Sampling a value λ in Λ ij The most simple way to choose λ is uniformly at random in the interval Λ ij , but this choiceoften gives a slow convergence. The endpoints of the interval often lead to larger changes in P and E and are therefore more favorable. The choice of λ ∈ Λ ij is chosen as a shifted symmetricbeta distribution with equal shape parameters denoted by β . Setting β = 1 results in a uniformchoice and setting β < β are further described in Section 4.1 but for now we fix β = 0 .
5. Sampling λ ∈ Λ ij is madeby sampling x from a regular beta distribution with the shape parameters equals to β andthen setting λ = x · Λ ij + (1 − x ) · Λ ij . After sampling a λ ∈ Λ ij , the matrices P and E areupdated to P new = P A ij ( λ ) and E new = A − ij ( λ ) E . In one iteration, this is done sequentiallyfor i = 1 , . . . , N , where there is chosen a random j (cid:54) = i to mix with for each i . After each iteration the values of P and E are saved, so after S iterations we have the followingsamples from the SFS for both P and E : P S = { P , P , . . . , P S } E S = { E , E , . . . , E S } . This means every entry in P and E have S samples besides the initial solution ( P , E ). Weoften observe that for some entries all samples are equivalent and for others they spread across aninterval. Each new sample is created as an affine combination of the previous sample includingitself, which means all the values between two samples in the interval will be feasible as well. TheSFS for each entry is therefore defined as the interval between the minimum and the maximumvalue of these S samples. Similarly to the polygon inflation algorithm [19], our algorithm alsoassumes that the SFS consists of connected sets as the samples are affine combinations of oneanother. This assumption has not appeared to restrict the SFS for any of the data sets we haveused, but in theory the solution could have a larger SFS than what is found by our algorithm.We define the size of P S as the average change of each entry across the full sample:avg (cid:104) P S (cid:105) = 1 K · N N (cid:88) n =1 K (cid:88) k =1 (cid:26) max s =0 , ,..., S P skn − min s =0 , ,..., S P skn (cid:27) (11)To assure that the algorithm stops at the right time, the size of the SFS is calculated after each T iterations, which we have chosen to fix at T = 1000. The algorithm is stopped when the sizeof the SFS has changed less than (cid:15) within these T iterationsavg (cid:104) P S (cid:105) − avg (cid:104) P S−T (cid:105) < (cid:15), where we set (cid:15) = 10 − . This assures that our algorithm continues until convergence and stopswhen no more changes are available. One could add a similar stopping criteria for E S , but inour experiments we found the proposed criteria satisfactory to find the whole SFS for both P and E . The sampling algorithm is summarized in Algorithm 1.6 lgorithm 1: Finding the SFS given ( ˆ
P , ˆ E )Initialize P = ˆ P and E = ˆ E for s = 1 , , . . . do P ← P s − E ← E s − for i = 1 , . . . , N do j ← random element of { , . . . , N }\{ i } Λ ij = f ( P i , E i , i, j ) x ← Beta( β, β ) λ = x · Λ ij + (1 − x ) · Λ ij P i +1 = P i A ij ( λ ) E i +1 = A ij ( − λ − λ ) E i end P s = P N E s = E N stop if avg (cid:104) P s (cid:105) − avg (cid:104) P s −T (cid:105) < (cid:15) end In this section results from the sampling algorithm are compared with the polygon inflationalgorithm. The comparison is made for both the Breast and Lung A. cancer data, when assumingthree mutational processes. The results from our sampling algorithm are illustrated in Figure2a and 2b in the representation defined in (2). Before making the comparison we have to defineanother representation of the SFS, that is more commonly used in chemometrics [3, 7, 9, 16, 19]and in particular for the polygon inflation algorithm. After introducing this representation, theassumption of three mutational processes for the comparison is more easily justified.
In chemometrics it is common to define the SFS relative to the Singular Value Decomposition(SVD). Note, any matrix
P E ∈ R K × G + of rank N can be decomposed into P E = U Σ V (cid:48) where U ∈ R K × N and V ∈ R G × N are the N eigenvectors for P E ( P E ) (cid:48) and ( P E ) (cid:48) P E , re-spectively, and Σ ∈ R N × N is the diagonal matrix of singular values. The factorization into U and Σ V (cid:48) consist of both positive and negative entries, but can be transformed by a matrix T ∈ R N × N such that U T ≥ T − Σ V (cid:48) ≥ T are normalised and defined by T = · · · α w , · · · w ,N − ... ... . . . ... α N − w N − , · · · w N − ,N − = (cid:18) e (cid:48) α W (cid:19) (12)7uch that α = ( α , . . . , α N − ) (cid:48) ∈ R ( N − , W ∈ R ( N − × ( N − and e (cid:48) = (1 , . . . , ∈ R ( N − . Thescaling of the columns are made such that the first entry of each column is 1. Here, the SFS isdefined by M SV D ( P ) = (cid:110) α ∈ R ( N − (cid:12)(cid:12)(cid:12) ∃ W ∈ R ( N − × ( N − : U T ≥ T − Σ V (cid:48) ≥ (cid:111) M SV D ( E ) = (cid:110) α ∈ R ( N − (cid:12)(cid:12)(cid:12) ∃ W ∈ R ( N − × ( N − : U ( T (cid:48) ) − ≥ T (cid:48) Σ V (cid:48) ≥ (cid:111) (13)which is similar to the definition in [16, 19], where a detailed description of this construction canbe seen. As the set in (13) is defined in terms of α ∈ R ( N − , we choose to assume N = 3 suchthat the SFS can be visualized in 2D. A big advantage of M SV D is that it removes the problemof reordering columns and rows as it only focuses on possible transformations of the SVD of
P E and not the actual values of P and E . The connection between the definition in (13) andthe one from (2) is mostly different normalizations of P and E . Constructing an invertible T with columns consisting of different α ∈ M SV D ( P ) gives P E = U Σ V (cid:48) = U T D − (cid:124) (cid:123)(cid:122) (cid:125) ˜ P D T − Σ V (cid:48) (cid:124) (cid:123)(cid:122) (cid:125) ˜ E where ˜ P ∈ M ( P ), ˜ E ∈ M ( E ) and D = diag( e (cid:48) U T ) ∈ R N × N . The diagonal matrix D scalesthe matrices such that ˜ P has columns summing to one. Given ˜ P ∈ M ( P ), the values of α canbe found in the columns of T = U (cid:48) ˜ P D − as the eigenvectors in U are orthonormal and D = diag( e U (cid:48) ˜ P ) ∈ R N × N , where e =(1 , , . . . , D assures that the first row of T consists of ones as in(12).The set in (2) is easier to interpret as this visually shows the direct possible changes on P and E , but the one in (13) gives a more simple visualization of the SFS and is better for comparisonto the polygon inflation algorithm. The polygon inflation algorithm is an algorithm used to approximate the SFS. The algorithmis introduced in [19] and a further analysis and proofs are made in [16]. The algorithm assumesthe set M SV D for both P and E , at most, consists of N separated whole free subsets. In theexamples in Figure 3 they all have three separated whole free subsets, such that all the pointswithin the polygons are feasible solutions. This assumption is also made for the samplingalgorithm, as the SFS are always connected and cannot have two different feasible intervals forone entry. The idea of the polygon inflation algorithm is to approximate the subsets of the SFSin terms of the representation in (13) by inflating an initial solution in M SV D to the boundarieswhich creates a polygon. The algorithm continues to inflate the edges of the polygon with newvertices on the boundary of the SFS until the area stops increasing. In the package FAC-PACK( ) they have applied the algorithm for rank threeand four. The results from the polygon inflation algorithm and sampling algorithm with rankthree are seen in Figure 3, where the grey and black dots illustrate the samples from the samplingalgorithm and the colored polygons are the SFS found from the polygon inflation algorithm.8 > A C > G C > T T > A T > C T > G
Mutation types P r ob a b ili t y Signatures(A) % % % % % % % % % Probability P a t i e n t s Normalized Exposures(B) (a) Breast cancer
C > A C > G C > T T > A T > C T > G
Mutation types P r ob a b ili t y Signatures(A) % % % % % % % % % Probability P a t i e n t s Normalized Exposures(B) (b) Lung A. cancer
Figure 2: The signatures and exposures for the Breast and Lung A. cancer data, when assumingthree mutational processes. In (A) the signatures are depicted, with the SFS marked in red fromthe minimum feasible value to the maximum feasible value. In (B) the normalised exposuresare depicted and the SFS is again marked in red. The signatures are arranged according totheir total exposure i.e. (cid:80) Gg =1 E ng in decreasing order.9 reast Cancer (a) M SV D ( P ) (b) M SV D ( E ) Lung A. Cancer (c) M SV D ( P ) (d) M SV D ( E ) Figure 3: The SFS for the Breast and Lung A. cancer data in the M SV D representation, wherethe grey dots show the results of all the iterations from the sampling algorithm and the blackdots highlight the results from the last 500 samples. The colored polygons are the SFS foundfrom the polygon inflation algorithm. Coloring of the three signatures is in accordance withFigure 2. 10he colored lines of the different areas in Figure 3 correspond to the signature with the samestrip colors in Figure 2. The sampling algorithm proposed here clearly gives similar results asthe polygon inflation algorithm, which has been the case for all examples we have tried. TheSFS for E is also completely found, even though the stopping criteria and sampling is made inrelation to the signature matrix P . The matrix used as a reference is therefore unimportant interms of finding the SFS for both P and E .Curious observation is that the size of the areas in Figure 3 are hard to directly transfer to thesize of the changes in different entries in Figure 2. Though, the representation in Figure 3 givesa more simple visualization of the SFS. The advantages of our sampling algorithm compared tothe polygon inflation algorithm are that it is easier to implement and can scale to an arbitrarydimension of N , as shown in Figure 5. This is especially important for cancer genomics wherethe number of mutational processes varies a lot depending on e.g. the number of samples andnumber of cancer types. Potentially, this could also be advantageous in other fields where NMFis applied with higher rank of N . Here we will discuss the choice of the parameter β and the running time of the samplingalgorithm. The sampling algorithm is applied to an assumed rank between 2 and 10, whichmakes it possible also to comment on how the size of the SFS and computation time is influencedby an increase in the rank. β and rank N The influence of the parameter β is found by running the sampling algorithm multiple times for β ∈ { . , . , } on the Breast and Lung A. cancer data. We test for β = 1 which is equivalentto choosing λ uniformly at random and also for β = 0 . β = 0 .
5, that is something in-between the two cases. An illustration of1000 samples from the algorithm with β ∈ { . , . , } is seen in Figure 4. Visually we observefrom Figure 4 that β = 0 . (cid:104) P S (cid:105) is depicted as a function of N i.e. the number ofassumed mutational processes. The size of the SFS for the Breast cancer data is close to zerofor five mutational processes and above. Even below five mutational processes the averagevariations of the entries seem to be fairly small.11 .81.01.21.41.6 0.4 0.6 0.8 1.0 a a (a) β = 0 . a a (b) β = 0 . a a (c) β = 1 Figure 4: Sampling 1000 points in the SFS M SV D ( P ) for different choices of β . Here, theresults are illustrated on the second signature of Lung A. cancer in Figure 3. Breast Cancer Lung A. Cancer2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 100.00250.00500.00750.01000.000000.000250.000500.000750.001000.00125
N (Number of mutational processes) a v g Æ P S æ beta = 0.1 beta = 0.5 beta = 1 Figure 5: The size of the SFS for different number of assumed mutational processes i.e. N .Each point is based on the results of running the sampling algorithm 50 times for each choiceof β and N . The error bars show the percentile 0 .
25 and 0 .
75 of the 50 different runs.On the contrary the size of the SFS for the Lung A. cancer is increasing with the number ofsignatures. The increase of the SFS when N is equal to five instead of three can be seen bycomparing Figure 1 and Figure 2b, where there are more red areas with N equal to five thanwith three. Here, it is clear that an increase in the number of mutational processes does notnecessarily decrease the size of the SFS. The values avg (cid:104) P S (cid:105) on the y-axis can appear small,but remember from the definition in (11) that we divide our sum by K · N .The choice of β seem to have an influence on both the stability and final size of avg (cid:104) P S (cid:105) . Forfive mutational processes and below, the smallest β = 0 . β = 0 . λ ∈ Λ ij uniformly i.e. β = 1 separates thesample equally across the area, which gives a very low probability of sampling in the endpointsand thereby gives a smaller size of the SFS, which are also seen in Figure 5. On the contrary,sampling λ ∈ Λ ij with too much weight on the endpoints i.e. β = 0 . β = 0 .
5. 12
Number of mutational processes It e r a t i on s be f o r e s t opp i ng Number of mutational processes T i m e be f o r e s t opp i ng ( s e c ) Breast Cancer Lung A. Cancer
Figure 6: The number of iterations and time needed to compute the SFS for different numberof mutational processes, where the algorithm was repeated 50 times. In the figures above, theaverage is reported together with the quantiles 0.05 and 0.95.
Our sampling algorithm is not only simple to implement, but is also very fast even for higherranks of the factorization. In Figure 6, the average number of iterations and computation time(using a regular laptop with Intel Core i7-8565U CPU @ 1.80 GHz and 16GB RAM) is shownon the two cancer data sets for different sizes of N , where the algorithm was repeated 50 timesfor each choice of N . In general the computation time is highly influenced by the size of theSFS and the rank N . The identification of P and E is not only influenced by the size of SFS, but will also be influencedby noise in data and the initialization of the algorithm by Lee and Seung [13]. Here, we willcover the influence of these aspects and compare it to the SFS. In cancer genomics the mostnatural assumption is that each mutational count is drawn from a Poisson distribution M kg ∼ Pois (cid:0) ( P E ) kg (cid:1) (14)Under this assumption, a natural way to estimate P and E is to maximize the log-likelihoodfunction (cid:96) ( P, E ; M ) = K (cid:88) k =1 G (cid:88) g =1 M kg log (( P E ) kg ) − ( P E ) kg − log ( M kg !) (15)= − D ( M | P E ) + C (16)where D ( M | P E ) = K (cid:88) k =1 G (cid:88) g =1 M kg log (cid:18) M kg ( P E ) kg (cid:19) − M kg + ( P E ) kg (17)is the generalised Kullback-Leibler divergence and C is a constant only dependent on M . Hence,maximizing the likelihood function is equivalent to minimizing the generalised Kullback-Leibler13ivergence D ( M | P E ). This is the divergence appearing in Lee and Seung [13], which is de-creased through each update. Notice, the objective function is convex in either P or E , butnot in both variables together. As they mention in Lee and Seung [13], the update rules cantherefore not ensure a global minimum, but only a local minimum. Several initialization makesit more likely that a global minimum is reached, which is shown in the next section. Through this paper, we have performed at least five random initializations when finding anNMF solution. The influence of an increase in the number of initializations can be seen inFigure 7a, which shows how just a few initializations will stabilize the reached minimum to alarge extend. As the minimum stops changing we assume to have found the global minimum.Even though the algorithm reaches the same minimum we will still observe variability in theestimates of P and E , when they have a variability in the SFS. Breast cancer Lung A. cancer
Number of initializations m i n i m u m D ( M | PE ) Number of initializations m i n i m u m D ( M | PE ) (a) Effect of more initializations −101 0 1 2 3 4 a a −0.50.00.51.01.5 0 1 2 3 a a (b) Initializations in M SV D ( P ) Figure 7: The influence of initialization. In (a), the average of the minimum D ( M | P E ) for acertain number of initializations are depicted together with error bars that show the quantiles0 .
05 and 0 .
95. The average and quantiles are based on 100 runs. In (b), the results from the100 different runs with ten initializations are shown in M SV D ( P ) relative to the solution foundpreviously in Figure 3(a). 14ifferent random initializations with the same minimum will still give different results in theSFS, which is illustrated in Figure 7b. Some of the SFS can therefore appear through theinitialization, but using initializations to find the SFS would be a very unreliable and timeconsuming way to recover the SFS. Also, it is clear to observe in Figure 7b how the iterativeprocedure often converges to the same areas of the SFS. Noise in the data obviously has an influence on the identification of P and E , but sometimes theinfluence can seem higher than it actually is due to variability in the SFS. We will illustrate thison the Breast cancer data assuming three mutational processes, where we have variability in theSFS. Assume we have found a global minimum factorization ˆ P ˆ E ∈ R × that approximatesthe Breast cancer data M ∈ N × . The influence of variance is now illustrated throughparametric bootstrapping using the assumption in (14). Specifically, we create 100 bootstrapsamples M ∗ b ∼ Pois( ˆ P ˆ E )for b = 1 , . . . , M ∗ , . . . M ∗ )for both 10 different random initializations and the same 10 initializations, which is illustratedin Figure 8. We clearly see how the appearing variability of the results is influenced by the SFSthrough random initialization.A large variance could therefore appear as a low reproducibility of the signatures, but is actuallyjust a result of the SFS. Remember, having a large SFS does not make the assumed number ofmutational processes less correct but is instead a result of the structure of the signatures andexposures. −1012 0 1 2 3 4 a a (a) Random initialization −101 0 1 2 3 4 a a (b) Same initialization Figure 8: Visualization of the identification problem due to variance, where (a) shows the resultsfrom random initialization and (b) shows the results using the same ten initialization for eachsample. Both show the global minimum from 100 bootstrap samples.15
Conclusion
We introduce a novel sampling algorithm that can find the SFS for an arbitrary rank of NMF.The algorithm is based on the simple and well known calculations for two dimensions togetherwith random sampling to explore the possible changes of the initial solution. The algorithmalso gives equivalent results when compared to current algorithms, most notably the polygoninflation algorithm. The problem of non-uniqueness for NMF in relation to mutational countshave also been highlighted here. The size of the SFS is shown to depend strongly on the specificdata set at hand and the assumed number of mutational processes. Furthermore, the importanceof several initializations is enlightened together with the important fact of how a large variancein a signature could potentially be caused by the SFS.An important aspect of NMF is to recover the true rank of the factorization. In cancer genomicsthere have both been introduced methods based on the Bayesian Information Criteria (BIC)[5, 18] and another method that both includes the variability of the signatures and how well P and E approximates M [2]. The latter is definitely more questionable, as there could appearlarge variations due to the size of the SFS. An interesting future problem is the influence ofhaving a SFS in the further analysis of the mutational processes. Would the conclusion fromone solution in the SFS be very different from another? One method to remove the problem ofnon-uniqueness of the solution is by decreasing the variability of the signatures through fewerparameters. This was implemented in a simple form in [20].Our sampling algorithm is very simple to implement and fast, which makes it easy to checkwhether a NMF solution is unique or not. This article has enlightened the problem of non-uniqueness of the NMF and at the same time showed a simple way to check this uncoveredproblem. It is important to remember that the size of the SFS is not a tool to determine thecorrect choice of N , but instead a way to find all the solutions that give the same global minimumapproximation of the data. Variability in the SFS is not the result of a poor factorization, butinstead lack of uniqueness of the matrix factorization. Acknowledgements
We would like to thank Joachim Beck for his master thesis on this subject, as it gave much inspi-ration to the development of the sampling algorithm. Furthermore, we would like to thank DanAriel Søndergaard and Kenneth Borup for helpful suggestions and corrections to the manuscript.
References [1] L. B. Alexandrov, S. Nik-Zainal, D. C. Wedge, S. A. Aparicio, S. Behjati, A. V. Biankin,G. R. Bignell, N. Bolli, A. Borg, A.-L. Børresen-Dale, et al. Signatures of mutationalprocesses in human cancer.
Nature , 500(7463):415–421, 2013.[2] L. B. Alexandrov, S. Nik-Zainal, D. C. Wedge, P. J. Campbell, and M. R. Stratton. Deci-phering signatures of mutational processes operative in human cancer.
Cell reports , 3(1):246–259, 2013.[3] O. S. Borgen and B. R. Kowalski. An extension of the multivariate component-resolutionmethod to three components.
Analytica Chimica Acta , 174:1–26, 1985.164] D. Donoho and V. Stodden. When does non-negative matrix factorization give a correctdecomposition into parts?
Advances in neural information processing systems , 16:1141–1148, 2003.[5] A. Fischer, C. J. Illingworth, P. J. Campbell, and V. Mustonen. EMu: probabilisticinference of mutational processes and their localization in the cancer genome.
GenomeBiology , 14(4):1–10, 2013.[6] A. E. Gelfand and A. F. Smith. Sampling-based approaches to calculating marginal den-sities.
Journal of the American statistical association , 85(410):398–409, 1990.[7] P. J. Gemperline. Computation of the range of feasible solutions in self-modeling curveresolution algorithms.
Analytical chemistry , 71(23):5398–5404, 1999.[8] A. Golshan, H. Abdollahi, S. Beyramysoltan, M. Maeder, K. Neymeyr, R. Rajk´o, M. Sawall,and R. Tauler. A review of recent methods for the determination of ranges of feasiblesolutions resulting from soft modelling analyses of multivariate data.
Analytica ChimicaActa , 911:1–13, 2016.[9] R. C. Henry and B. M. Kim. Extension of self-modeling curve resolution to mixtures ofmore than three components: Part 1. Finding the basic feasible region.
Chemometrics andIntelligent Laboratory Systems , 8(2):205–216, 1990.[10] K. Huang, N. D. Sidiropoulos, and A. Swami. Non-negative matrix factorization revisited:Uniqueness and algorithm for symmetric decomposition.
IEEE Transactions on SignalProcessing , 62(1):211–224, 2013.[11] H. Laurberg, M. G. Christensen, M. D. Plumbley, L. K. Hansen, and S. H. Jensen. Theo-rems on positive data: On the uniqueness of NMF.
Computational intelligence and neuro-science , 2008.[12] W. H. Lawton and E. A. Sylvestre. Self modeling curve resolution.
Technometrics , 13(3):617–633, 1971.[13] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In
Advancesin neural information processing systems , pages 556–562, 2001.[14] C. Mason, M. Maeder, and A. Whitson. Resolving factor analysis.
Analytical chemistry ,73(7):1587–1594, 2001.[15] S. Moussaoui, D. Brie, and J. Idier. Non-negative source separation: range of admissiblesolutions and conditions for the uniqueness of the solution. In
Proceedings.(ICASSP’05).IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. , vol-ume 5, pages v–289. IEEE, 2005.[16] K. Neymeyr and M. Sawall. On the set of solutions of the nonnegative matrix factorizationproblem.
SIAM Journal on Matrix Analysis and Applications , 39(2):1049–1069, 2018.[17] S. Nik-Zainal and S. Morganella. Mutational signatures in breast cancer: the problem atthe DNA level, 2017.[18] R. A. Rosales, R. D. Drummond, R. Valieris, E. Dias-Neto, and I. T. Da Silva. signeR:17n empirical Bayesian approach to mutational signature discovery.
Bioinformatics , 33(1):8–16, 2017.[19] M. Sawall, C. Kubis, D. Selent, A. Boerner, and K. Neymeyr. A fast polygon inflationalgorithm to compute the area of feasible solutions for three-component systems. I: conceptsand applications.
Journal of Chemometrics , 27(5):106–116, 2013.[20] Y. Shiraishi, G. Tremmel, S. Miyano, and M. Stephens. A simple model-based approachto inferring and visualizing cancer mutation signatures.