A Goodness-of-fit Test on the Number of Biclusters in a Relational Data Matrix
GGoodness-of-fit Test on the Number of Biclusters in RelationalData Matrix
Chihiro Watanabe ∗ and Taiji Suzuki † Graduate School of Information Science Technology, The University of Tokyo, Tokyo, Japan Center for Advanced Intelligence Project (AIP), RIKEN, Tokyo, Japan
Abstract
Biclustering is a problem to detect homogeneous submatrices in a given observed matrix, andit has been shown to be an effective tool for relational data analysis. Although there have beenmany studies for estimating the underlying bicluster structure of a matrix, few have enabled usto determine the appropriate number of biclusters in an observed matrix. Recently, a statisticaltest on the number of biclusters has been proposed for a regular-grid bicluster structure, where weassume that the latent bicluster structure can be represented by row-column clustering. However,when the latent bicluster structure does not satisfy such regular-grid assumption, the previous testrequires too many biclusters (i.e., finer bicluster structure) for the null hypothesis to be accepted,which is not desirable in terms of interpreting the accepted bicluster structure. In this paper, wepropose a new statistical test on the number of biclusters that does not require the regular-gridassumption, and derive the asymptotic behavior of the proposed test statistic in both null andalternative cases. To develop the proposed test, we construct a consistent submatrix localizationalgorithm, that is, the probability that it outputs the correct bicluster structure converges to one.We show the effectiveness of the proposed method by applying it to both synthetic and practicalrelational data matrices.
Keywords. biclustering, submatrix detection, consistent submatrix localization, goodness-of-fittest, random matrix theory
Relational data is a kind of matrix data, whose entries show some kind of relationships betweentwo (generally different) objects. For example, the rows and columns, respectively, of an observedmatrix A ∈ R n × p represent customers and products, and each entry A ij is a number of times forwhich the i th customer purchased the j th product. It has been shown that we can successfully modelvarious kinds of relational data matrices, including customer-item transaction/rating data [46, 48],document-word co-occurence data [16, 20], and gene expression data [36, 39, 42, 49], by assuming theexistence of latent biclusters or some “homogeneous” submatrices (e.g., the entries in each biclusterare identically distributed). In the example of the above customer-item relational data matrix, thisassumption corresponds to that there are some groups of customers I ⊆ { , . . . , n } and those of items J ⊆ { , . . . , p } , and that the customers in I tend to purchase the items in J at a similar frequency.In regard to bicluster structure of a relational data matrix, the following two problems have beenextensively studied in the literature: submatrix detection and localization . Submatrix detection is atask to detect the existence of such biclusters in a given observed matrix A (i.e., whether or not matrix ∗ chihiro [email protected] † [email protected] a r X i v : . [ s t a t . M E ] F e b contains at least one bicluster) [7, 26, 35, 45]. As in the previous studies [9, 13], in this paper,we distinguish such a task from submatrix localization (which is also known as biclustering ), whosepurpose is to recover the exact position of such biclusters. So far, many biclustering methods havebeen proposed for a fixed number of biclusters K [9, 13, 24, 27, 45]. In most practical cases, however,we would not have a prior knowledge about K in a given data matrix. Therefore, it would be animportant task to develop some method to appropriately determine K from the observed data A . Inthe next two paragraphs, we introduce some related studies that have proposed methods for choosing K . Statistical test on the number of biclusters K . Although there have been many studies fortesting whether an observed matrix A contains any large average submatrix or not [6, 7, 10, 32, 35],few statistical test methods have been proposed for the number of biclusters K in matrix A . Recently,statistical tests on K has been proposed in [3, 28, 30, 55] with the constraint that the underlyingbicluster structure should be represented by a regular grid (as shown in Figure 1 (b-2)) . However,if the latent bicluster structure does not satisfy the regular grid constraint (as shown in Figure 1(b-1)), such a test needs larger hypothetical number of biclusters K (i.e., finer bicluster structure) toaccept the null hypothesis K = K than necessary, which would not be desirable in the perspective ofinterpreting the accepted bicluster structure. To cope with such a problem, we need a more flexiblemodel which can represent the existence of local biclusters [45]. For a singular-value-decomposition-based biclustering, a stopping criterion has been proposed for detecting multiple biclusters (whichdetermines K ) based on stability selection in [47]. This method has enabled us to detect a biclusterstructure with Type I error control and without the regular grid constraint. However, unlike ourproposed method, its Type I error has been guaranteed only in terms of an upper bound, not thenull distribution of a test statistic. Therefore, in this previous study, no way has been provided toperform a statistical test on the number of biclusters. Moreover, it has no theoretical guarantee forthe alternative cases (i.e., statistical power). In this paper, we address these problems by developinga new statistical test on K , which does not require the regular grid constraint and whose test statistic T shows a good property in an alternative case (i.e., with high probability, T asymptotically increaseswith the matrix size and thus the Type II error converges in probability to zero ), as shown later inTheorem 4.3. Information criterion on the number of biclusters K . Aside from the statistical-test-basedmethods, some studies have proposed to determine K based on the minimum description length [44,50, 57] and modified DIC for the biclustering problem [11, 12]. Particularly, under the regular gridconstraint of the bicluster structure, an information criterion called integrated completed likelihood(ICL) has been proposed for determining K [15, 33, 56], which approximates the maximum marginallikelihood of a given K . These methods aim to select the optimal number of biclusters K from agiven set of candidates, in terms of some criterion (e.g., marginal likelihood). This purpose is differentfrom that of a statistical test, which aims to judge whether or not we accept a hypothetical numberof biclusters K with a specific significance level given by a user. Other approaches for determining the number of biclusters K . Aside from the above information-criterion-based and statistical-test-based methods, some studies have proposed to construct a genera-tive model of the bicluster structure including the number of biclusters K and then select an optimalmodel in terms of some measure (e.g., choose a MAP estimator) [37, 43]. There have also been someheuristic criteria for determining the number of biclusters in an observed matrix, which have been pro-posed as stopping rules for top-down division-based biclustering algorithms [18, 26, 51] or bottom-upmerging-based one [41]. Particularly, in [3, 28, 30], the observed matrix A (and thus its bicluster structure) is assumed to be square symmetric. It must be noted that Theorem 4.3 shows the asymptotic behavior of test statistic T under the assumption that anobserved matrix has some latent bicluster structure. To derive its behavior in the case that an observe matrix cannotbe represented by such a model is beyond the scope of this paper.
2n this paper, we propose a statistical test method for the number of biclusters K in a givenobserved matrix A , without the regular grid constraint of the latent bicluster structure. To guaranteethe asymptotic behavior of the proposed test statistic in the null case (i.e., K = K ), which is given inTheorem 4.1, we use the properties of a random matrix with a sub-exponential decay [4, 40]. Moreover,we derive its behavior in the alternative case (i.e., K > K ) that it increases with the matrix size inhigh probability, which is given in Theorem 4.3. Unlike the previous study [55], where the numberof biclusters K is assumed to be a fixed constant that does not depend on the matrix size m , weconsider a case in which K might increase with m (the precise description of this assumption is givenin (9)). Finally, we explain a method to estimate K from observed matrix A , by sequentially testingthe hypothetical numbers of biclusters in an ascending order (i.e., K = 0 , , , . . . ) until the nullhypothesis is accepted.In the proposed test, we assume that the submatrix localization algorithm is consistent , that is, ifthe hypothetical number of biclusters K is equal to the null one K , the probability that it outputs thecorrect bicluster structure converges to one asymptotically. Such consistency condition in submatrixlocalization has been considered in several previous studies, however, most of them have made anassumption that there exists at most one bicluster in a given observed matrix [1, 5, 8, 23, 24, 29, 34].It would be possible to localize multiple biclusters by applying these methods to a given observedmatrix multiple times, however, there is no guarantee for consistency of such a heuristic approach.When considering submatrix localization with multiple biclusters, we have to take into accountthe notions of disjointness and bi-disjointness . We call a bicluster structure is disjoint iff each entrybelongs to at most one bicluster (as shown in Figure 1 (a) and (b)). Bi-disjointness is a strictercondition: we call a bicluster structure is bi-disjoint iff each row or column belongs to at most onebicluster Among disjoint bicluster structures, we call those with a bi-disjoint one (as shown in Figure 1(a)). For bi-disjoint bicluster structure, some submatrix localization algorithms based on the maximumlikelihood estimator for a known model parameter [13] and singular value decomposition [9] have beenshown to be consistent. However, these algorithms cannot be directly applied to non-bi-disjoint cases,where the localization problem cannot be formulated as a row-column (hard) clustering [45].As another direction, by imposing the regular grid constraint to the underlying bicluster structure,we can consider a special case of non-bi-disjoint bicluster structures, which can be represented as aresult of row-column clustering (as shown in Figure 1 (b-2)). As for the biclustering problem withsuch a regular grid structure, Flynn and Perry [19] have proposed a consistent algorithm based onthe criterion of (generalized) profile likelihood. In this paper, we extend this method and develop aconsistent submatrix localization algorithm without regular grid assumption (as shown in Figure 1(b-1)).In summary, the main contribution of this paper is as follows:1. We proposed for the first time a statistical test on the number of biclusters K in a given ob-served matrix, under the assumptions that the underlying bicluster structure is disjoint (but notnecessarily bi-disjoint) and that the submatrix localization algorithm is consistent.2. To fulfill the above assumption, we developed a consistent submatrix localization algorithm for adisjoint but not necessarily bi-disjoint bicluster structure (including the cases of Figure 1 (a) and(b)). We first proved the consistency of the proposed algorithm based on (generalized) profilelikelihood [38], and then we proposed an approximated algorithm to obtain the maximum profilelikelihood solution to avoid combinatorial explosion.This paper is organized as follows: in Section 2, we describe the problem settings and the modelof underlying bicluster structure in an observed matrix. Then, in Section 3, we develop a consistentsubmatrix localization algorithm based on generalized profile likelihood. Based on the estimatedbicluster structure given by such a consistent algorithm, in Section 4, we propose a statistical test onthe number of biclusters K in an observed matrix, and derive its theoretical guarantee in both nulland alternative cases. In Section 5, we give some experimental results which show the effectiveness ofthe proposed test. Finally, in Section 6, we discuss the obtained results and the future works of theproposed method, and give a conclusion to this paper in Section 7.3 Problem setting and statistical model for goodness-of-fit testfor submatrix detection problem
Let A ∈ R n × p be an n × p observed matrix. Given such an observed matrix A , the goal of submatrixdetection problem is to determine whether it contains one or multiple disjoint submatrices or biclusters ,in each of which the entries are generated in the i.i.d. sense (Figure 1 (a) and (b)). As in the previousstudies [9, 32], we distinguish the submatrix detection and localization problems in that the goal ofthe latter is not only to detect the existence of biclusters in an observed matrix, but to estimate theirprecise locations.Let K be the minimum number of such biclusters to represent matrix A , which is unknown inadvance. Aside from the K biclusters, we assume that there are “background” entries in matrix A that do not belong to any bicluster. Note that the difference between a bicluster and the background isin that the former can be represented as a submatrix (i.e., { ( i, j ) : i ∈ I k , j ∈ J k } for some set of I k ⊂{ , . . . , n } and J k ⊂ { , . . . , p } ), while the latter does not necessarily have such a submatrix structure,as shown in Figure 1. Unlike the previous studies that have provided consistent submatrix localizationmethods with multiple biclusters [9, 13], we do not assume the underlying bicluster structure to be bi-disjoint , that is, each row or column is assigned to at most one bicluster (Figure 1 (a)).We denote the bicluster index of the ( i, j )th entry of matrix A as g ij ∈ { , , . . . , K } , where g ij = k if the ( i, j )th entry belongs to the k th bicluster for some k ∈ { , . . . , K } and g ij = 0 if it belongs tothe background. We define the set of group indices of all the entries as g ≡ ( g ij ) ≤ i ≤ n, ≤ j ≤ p . We alsodefine that I k ≡ { ( i, j ) : g ij = k } , which represents the set of entries in the k th group. Specifically, weconsider the following model: P = ( P ij ) ≤ i ≤ n, ≤ j ≤ p , P ij = b g ij .σ = ( σ ij ) ≤ i ≤ n, ≤ j ≤ p , σ ij = s g ij .A = ( A ij ) ≤ i ≤ n, ≤ j ≤ p , E [ A ij ] = P ij , E [( A ij − P ij ) ] = σ ij , (1)where b k and s k >
0, respectively, are the mean and standard deviation of the k th bicluster ( k =1 , . . . , K ) or background ( k = 0). This model is a generalized version of well-studied submatrixdetection models, in which we assume that the mean of the background noise is zero (i.e., b = 0)[7, 35, 45]. Let Z ∈ R n × p be a standardized noise matrix, which is given by Z = ( Z ij ) ≤ i ≤ n, ≤ j ≤ p , Z ij = A ij − P ij σ ij . (2)In most cases, the number of biclusters K is unknown in advance. The goal of this paper is todevelop a statistical test on K , which is based on the following null (N) and alternative (A) hypotheses:(N) : K = K , (A) : K > K , (3)where K is a given hypothetical number of biclusters. In this paper, we only consider the cases where K ≤ K . If K = K (i.e., the null case), then we call it a realizable case. Otherwise (i.e., the alternativecase), we call it an unrealizable case. To select the number of biclusters for a given observed matrix A , we propose to sequentially test the bicluster numbers K = 0 , , , . . . until the null hypothesis (N)is accepted. Let ˆ K be the hypothetical number of biclusters when (N) is accepted. The proposedmethod outputs ˆ K as the selected number of biclusters in matrix A . Notations.
Throughout this paper, we use the following notations: X = O p [ f ( m )] . ⇔ ∀ (cid:15) > , ∃ C > , M > , ∀ m ≥ M, Pr [ Cf ( m ) ≥ X ] ≥ − (cid:15).X = Ω p [ f ( m )] . ⇔ ∀ (cid:15) > , ∃ C > , M > , ∀ m ≥ M, Pr [ Cf ( m ) ≤ X ] ≥ − (cid:15).X = Θ p [ f ( m )] . ⇔ ∀ (cid:15) > , ∃ C , C > , M > , ∀ m ≥ M, Pr [ C f ( m ) ≤ X ≤ C f ( m )] ≥ − (cid:15). (4)4 icluster 1 Bicluster 2Bicluster 3 Background (b-1) Not bi-disjoint (a) Bi-disjoint Bicluster 1
Bicluster 2
Bicluster 3Background (c) Non-disjoint
Disjoint (proposed test)
Bicluster 1
Bicluster 2
Bicluster 3
Background (b-2) Not bi-disjoint
Background
Bicluster 1 Bicluster 2
Bicluster 3 Bicluster 4Bicluster 5
Figure 1: (a) Bi-disjoint, (b) disjoint but not bi-disjoint, and (c) non-disjoint bicluster structures. Inthe proposed method, we assume that the underlying bicluster structure is disjoint, but not necessarilybi-disjoint. We assume that the observed matrix consists of one or multiple biclusters, in each of whichthe entries are generated in the i.i.d. sense. Note that in the cases of (b-1) and (c), it is not alwayspossible to make all the rows and columns within the biclusters contiguous by sorting the rows andcolumns. (cid:107) A (cid:107) op = sup u ∈ R p (cid:107) A u (cid:107)(cid:107) u (cid:107) , (cid:107) A (cid:107) F = (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) i =1 p (cid:88) j =1 A ij . (5)In the proofs in Section 4, we use the following sample mean matrix ˜ P and standard deviationmatrix ˜ σ for the correct block structure, and matrix ˜ Z :˜ P = ( ˜ P ij ) ≤ i ≤ n, ≤ j ≤ p , ˜ P ij = ˜ b g ij , ˜ σ = (˜ σ ij ) ≤ i ≤ n, ≤ j ≤ p , ˜ σ ij = ˜ s g ij , ˜ Z = ( ˜ Z ij ) ≤ i ≤ n, ≤ j ≤ p , ˜ Z ij = A − ˜ P ij ˜ σ ij . (6)where ˜ b k and ˜ s k , respectively, are the sample mean and standard deviation of the entries in the k th null group in observed matrix A . Let ˜ λ and ˜ v , respectively, be the maximum eigenvalue of matrix˜ Z (cid:62) ˜ Z and the corresponding eigenvector whose Euclid norm is constrained to be one:˜ Z (cid:62) ˜ Z ˜ v = ˜ λ ˜ v , (cid:107) ˜ v (cid:107) = 1 . (7)Similarly, we denote the eigenvalues (in the descending order) and the corresponding normalized eigen-vectors of matrix Z (cid:62) Z as { λ j } and { v j } , respectively: Z (cid:62) Z v j = λ j v j , (cid:107) v j (cid:107) = 1 , j = 1 , . . . , p.λ ≥ λ ≥ · · · ≥ λ p . (8) Assumptions.
In Section 4, we develop a statistical test for the number of biclusters in an observedmatrix based on the following assumptions:(i). We assume that the biclusters (and the background) are disjoint , that is, there is no entry ( i, j )that is assigned to two or more biclusters (i.e., I k ∩ I k (cid:48) = ∅ if k (cid:54) = k (cid:48) , k, k (cid:48) ∈ { , , . . . , K } ).Moreover, we assume that each ( i, j )th entry belongs to exactly one group. Note that in this paper, we use the term ‘disjoint” in the meaning that each entry belongs to a single group, notthe meaning that the bicluster structure is bi-disjoint as in Figure 1 (a) (i.e., each row or column is assigned to at mostone bicluster). Z ij of the standardized noise matrix has a sub-exponential decay(i.e., there exists some ϑ > x >
1, Pr ( | Z ij | > x ) ≤ ϑ − exp( − x ϑ )). We alsoassume the following conditions: • max k =0 , ,...,K s k = O (1), and min k =0 , ,...,K s k = Ω(1). • max k =1 ,...,K,k (cid:48) =1 ,...,K | b k − b k (cid:48) | = O (1), and max k =0 , ,...,K | b k | = O ( K ). We also assumethat the minimum difference between a pair of different biclusters (including background)is lower bounded by some constant that does not depend on the matrix size: min k (cid:54) = k (cid:48) | b k − b k (cid:48) | ≥ C b > • C M(4) ≡ max i =1 ,...,n,j =1 ,...,p E [ Z ij ] = O (1).(iii). We assume that both the row and column sizes n and p of the observed matrix increase inproportion to some sufficiently large number m (i.e., n, p ∝ m ).(iv). As for the background, we assume that it can be divided to H disjoint submatrices. In regardto the row and column sizes of each k th submatrix ( | I k | and | J k | , respectively), we assume thatthey monotonically increase with m , where k = 1 , . . . , K + H . • In Theorem 4.1 for a realizable case, we assume that the minimum number of biclusters K and that of background submatrices H to represent the observed matrix A satisfy thefollowing conditions: K + H = O (cid:16) m − (cid:15) (cid:17) , for some (cid:15) > . (9) n min ≡ min k =1 ,...,K + H | I k | = Ω (cid:16) m (cid:17) , p min ≡ min k =1 ,...,K + H | J k | = Ω (cid:16) m (cid:17) . (10)Note that from these conditions, we have( K + H ) (cid:18) min k =1 ,...,K + H |I k | (cid:19) − ≤ K + Hn min p min = O (cid:16) m − − (cid:15) (cid:17) , for some (cid:15) > . (11) • In Theorem 4.3 for an unrealizable case, we assume the following stricter condition: K + H √ n min p min = O (cid:16) m − − (cid:15) (cid:17) , for some (cid:15) > . (12)(v). In the realizable case (i.e., K = K ), we assume that a submatrix localization algorithm forestimating the bicluster structure g is consistent , that is, Pr(ˆ g = g ) → m → ∞ ,where g and ˆ g , respectively, are the null and the estimated bicluster structure of observed matrix A (the precise definition of ˆ g is given in Section 4). In Section 3, we develop a consistentsubmatrix localization algorithm that can be applied without the bi-disjoint assumption. In this section, we propose a consistent submatrix localization algorithm, which is used to estimatethe bicluster structure of observed matrix A before applying the proposed statistical test. To ourknowledge, this is the first paper that gives a submatrix localization method with the consistencyguarantee (i.e., the probability that it outputs the correct bicluster structure converges to one in thelimit of matrix size m → ∞ ), where the underlying bicluster structure is not necessarily bi-disjoint (asshown in Figure 1 (b)). 6e first show the rigorous theoretical result that the bicluster structure with the maximum gener-alized profile likelihood is consistent in Section 3.1. Since the combinatorial optimization problem tofind the bicluster structure with the maximum generalized profile likelihood is intractable, we proposean approximated algorithm in Section 3.2. To show the consistency of a submatrix localization algorithm based on the (generalized) profilelikelihood (whose definition is given later in (15)), we use the following assumptions, aside from thosein Section 2.
Assumptions in constructing consistent submatrix localization algorithm (a). The hypothetical number of biclusters is equal to the null one (i.e., K = K ).(b). Let p k ∈ R and ˆ p k ∈ R , respectively, be the proportion of entries in the k th group ( k ∈{ , , . . . , K } ) in the null and estimated bicluster structure to all the np entries. Let p =( p k ) ≤ k ≤ K ∈ R ( K +1) . To guarantee the consistency, we need a stricter condition than theassumption (iv) that min k =0 , ,...,K p k = Ω (cid:16) m − (cid:17) . Fix (cid:15) > (cid:15) = Ω (cid:16) m − (cid:17) (and thus np(cid:15) = Ω (cid:16) m (cid:17) ), and (2) (cid:15) < min k =0 , ,...,K p k . For instance,if we set (cid:15) = C (cid:15) m − with an arbitrary fixed constant C (cid:15) > m ), fora sufficiently large m , both (1) and (2) hold.(c). In regard to sub-exponential assumption (ii), we additionally assume that there exists some ϑ , ϑ , . . . , ϑ K ≥ C ϑ > x >
1, Pr ( | Z ij | > x ) ≤ ϑ − g ij exp( − x ϑ gij ), where C ϑ is afixed constant that does not depend on the matrix size m .To evaluate the accuracy of estimated bicluster structure ˆ g , we introduce a confusion matrix C ∈ R ( K +1) × ( K +1) , whose ( k, k (cid:48) )th entry represents the proportion of entries that belong to the k th group in the null bicluster structure and the k (cid:48) th group in the estimated one [i.e., C kk (cid:48) ≡ np (cid:80) ni =1 (cid:80) pj =1 I ( g ij = k ) I (ˆ g ij = k (cid:48) )], as shown in Figure 2. Note that the following two conditions aremutually equivalent: • The estimated bicluster structure is correct (i.e., ˆ g = g ). • The confusion matrix C can be made diagonal by permutations of its rows.Moreover, the population mean ¯ b k of each k th estimated group can be computed based on theconfusion matrix C as follows:¯ b k ≡ | ˆ I k | (cid:88) ( i,j ) ∈ ˆ I k P ij = np | ˆ I k | K (cid:88) k (cid:48) =0 C k (cid:48) k b k (cid:48) = (cid:0) C (cid:62) b (cid:1) k ( C (cid:62) u ) k , (13)where u ≡ [1 , , . . . , (cid:62) ∈ R ( K +1) . Note that we used ˆ p k = | ˆ I k | np = (cid:0) C (cid:62) u (cid:1) k to derive (13).We use the following notation: J (cid:15) ≡ { g : ˆ p k ( g ) > (cid:15) for all k ∈ { , , . . . , K }} . (14)By assumption (b), (cid:15) is chosen to be smaller than min k =0 , ,...,K p k , we choose the estimated biclusterassignment ˆ g from J (cid:15) . 7 .19 0.27 0.11 0.26 0.170.2 0.1 0.01 0.02 0.05 0.020.1 0.01 0.02 0.04 0.01 0.020.3 0.05 0.08 0.01 0.12 0.040.1 0.01 0.01 0.01 0.01 0.060.3 0.02 0.15 0.03 0.07 0.03 Sum of each row 𝑝 𝑘 Sum of each column Ƹ𝑝 𝑘′ Null bicluster 𝑘 Estimated bicluster 𝑘′ 𝐶 𝑘𝑘′ Figure 2: An example of confusion matrix for a set of null and estimated bicluster assignments.As in [19], we develop a consistent submatrix localization algorithm based on the (generalized)profile likelihood [38]. Given an estimated bicluster assignment ˆ g , the generalized profile-likelihoodcriterion for an exponential family model is given by F (ˆ g ) ≡ K (cid:88) k =0 ˆ p k f | ˆ I k | (cid:88) ( i,j ) ∈ ˆ I k A ij , (15)where f : R (cid:55)→ R is an arbitrary function that satisfies the following conditions: • We define the following closed and open intervals: S ≡ min inf ˆ I∈K (cid:15) | ˆ I| (cid:88) ( i,j ) ∈ ˆ I A ij , min k =0 , ,...,K b k , max sup ˆ I∈K (cid:15) | ˆ I| (cid:88) ( i,j ) ∈ ˆ I A ij , max k =0 , ,...,K b k , S ≡ min inf ˆ I∈K (cid:15) | ˆ I| (cid:88) ( i,j ) ∈ ˆ I A ij , min k =0 , ,...,K b k , max sup ˆ I∈K (cid:15) | ˆ I| (cid:88) ( i,j ) ∈ ˆ I A ij , max k =0 , ,...,K b k . (16) f is continuous on S and differentiable on S . The first derivative of f is bounded (i.e., | f (cid:48) ( x ) | < ∞ ) on S . • f is locally strictly convex: inf x ∈T f (cid:48)(cid:48) ( x ) >
0, where
T ≡ (min k =0 , ,...,K b k , max k =0 , ,...,K b k ).We also define the following criterion function based on the population group-wise mean:¯ F (ˆ g ) ≡ K (cid:88) k =0 ˆ p k f (¯ b k ) . (17)Note that we have ¯ F ( g ) = (cid:80) Kk =0 p k f ( b k ).Based on these assumptions and notations, the following Theorem 3.1 guarantees the consistencyof the maximum generalized profile-likelihood solution ˆ g PL . Theorem 3.1 (Consistency of submatrix localization based on the generalized profile-likelihood) . Un-der the assumptions in Sections 2 and 3, for any bicluster assignment ˆ g PL that achieves the maximumgeneralized profile-likelihood (i.e., ˆ g PL ∈ arg max ˆ g ∈J (cid:15) F (ˆ g ) ), the confusion matrix C (ˆ g PL ) converges inprobability to a permutation of a diagonal matrix. roof. To prove this theorem, we first show the following Lemmas A.1, 3.1 and 3.2.
Lemma 3.1.
Under the assumptions in Sections 2 and 3, sup ˆ g ∈J (cid:15) | F (ˆ g ) − ¯ F (ˆ g ) | converges in probabilityto zero in the limit of m → ∞ .Proof. A proof is in Appendix A.Before the statement of Lemma 3.2, we define P δ : For a given δ >
0, let P δ ≡ { C ∈ C p :max ( k ,k ,k ): k (cid:54) = k C k k C k k < δ } , where C p is a set of all confusion matrices with the null set ofproportions p . If the confusion matrix C is a permutation of a diagonal matrix, then for k (cid:54) = k ,we always have C k k C k k = 0 for all k ∈ { , , . . . , K } . Intuitively, P δ is a set of confusion matriceswhich are “almost” permutations of diagonal matrices, where the gap from a permutation of a diagonalmatrix is controlled by δ . Lemma 3.2.
Under the assumptions in Sections 2 and 3, for all δ > , if C / ∈ P δ , ¯ F (ˆ g ) − ¯ F ( g ) ≤ − κ δ (cid:0) C b (cid:1) , (18) where κ ≡ inf x ∈T f (cid:48)(cid:48) ( x ) > and C b is a constant that satisfies the condition in assumption (ii).Proof. A proof is in Appendix B.From Lemma 3.2, for all δ >
0, if
C / ∈ P δ , F (ˆ g ) − F ( g ) ≤ sup ˆ g ∈J (cid:15) | F (ˆ g ) − ¯ F (ˆ g ) | + ¯ F (ˆ g ) − F ( g )= sup ˆ g ∈J (cid:15) | F (ˆ g ) − ¯ F (ˆ g ) | + ¯ F (ˆ g ) − ¯ F ( g ) + ¯ F ( g ) − F ( g ) ≤ sup ˆ g ∈J (cid:15) | F (ˆ g ) − ¯ F (ˆ g ) | + sup g ∈J (cid:15) | F ( g ) − ¯ F ( g ) | + ¯ F (ˆ g ) − ¯ F ( g )= 2 sup ˆ g ∈J (cid:15) | F (ˆ g ) − ¯ F (ˆ g ) | + ¯ F (ˆ g ) − ¯ F ( g ) ≤ ˆ g ∈J (cid:15) | F (ˆ g ) − ¯ F (ˆ g ) | − κ δ (cid:0) C b (cid:1) . (19)From Lemma 3.1, the first term in (19) converges in probability to zero. Therefore, for all δ >
0, if
C / ∈ P δ , the probability that F (ˆ g ) ≤ F ( g ) − κ δ (cid:0) C b (cid:1) holds converges to one. By taking the limit of δ →
0, it results in that the probability converges to one that the confusion matrix of the maximumgeneralized profile-likelihood solution ˆ g PL is a permutation of a diagonal matrix, from the definitionof P δ . In this section, we develop an approximated algorithm to find the bicluster structure with themaximum generalized profile likelihood based on simulated annealing (SA), since the exact algorithmto find the global optimal solution is intractable. 9 .2.1 Naive implementation of SA-based submatrix localization
Let G K be a set of all bicluster structures with (non-empty) K biclusters, which are disjoint, butwhich are not necessarily bi-disjoint (as shown in Figure 1 (b)). In SA, we first define a sequenceof temperatures { T t } ∞ t =0 , a threshold (cid:15) SA , and the initial state (i.e., bicluster assignment) ˆ g (0) ∈ G K .For each state g ∈ G K , we also define a set of its neighbors N ( g ) ⊆ G K and a transition probability R ( g, g (cid:48) ) ∈ [0 ,
1] to a given state g (cid:48) ∈ G K . Here, we set R ( g, g (cid:48) ) = 0 iff g (cid:48) / ∈ N ( g ).For each step t = 0 , , , . . . , if T t < (cid:15) , we stop the algorithm and output the final state ˆ g ( t ) . If T t ≥ (cid:15) , we randomly choose a candidate for the next state ˜ g ∈ N (ˆ g ( t ) ) with probability R (ˆ g ( t ) , ˜ g ), andcompute the difference of the objective function value ∆ F ≡ F (˜ g ) − F (ˆ g ( t ) ). If ∆ F >
0, we set thenext state at ˆ g ( t +1) = ˜ g . If ∆ F ≤
0, we set the next state at ˆ g ( t +1) = ˜ g with probability exp (cid:16) ∆ FT t (cid:17) ,and set it at the current state ˆ g ( t +1) = ˆ g ( t ) with probability 1 − exp (cid:16) ∆ FT t (cid:17) .Specifically, we propose Algorithm 1 as an example of SA for approximately maximizing the gener-alized profile likelihood F . In Algorithm 1, we define that the neighbors N ( g ) of state g is a set of allpossible bicluster assignments that can be obtained by adding/removing one row or column to/fromone bicluster in g . As for the transition probability, we define that one of the elements in N ( g ) arechosen from the uniform distribution on N ( g ) (i.e., R ( g, g (cid:48) ) = 1 / | N ( g ) | for g (cid:48) ∈ N ( g )).We can easily check that the above settings satisfy the following irreducibility and weak reversibility : • Irreducibility: for any pair g, g (cid:48) ∈ G K , there exists some sequence of transitions from g to g (cid:48) withnon-zero probability. • Weak reversibility: for any pair g, g (cid:48) ∈ G K and ˜ F ∈ R , the following two propositions (P1) and(P2) are mutually equivalent: – (P1) there exists some sequence of transitions g = g, g , . . . , g p = g (cid:48) with non-zero proba-bility that satisfies F ( g t ) ≥ ˜ F for all t ∈ { , . . . , p } . – (P2) there exists some sequence of transitions g = g (cid:48) , g , . . . , g p = g with non-zero proba-bility that satisfies F ( g t ) ≥ ˜ F for all t ∈ { , . . . , p } .We define that a state g is locally optimal if there is no state g (cid:48) ∈ G K that satisfies the followingtwo conditions simultaneously: F ( g (cid:48) ) > F ( g ), and there exists some sequence of transitions g = g, g , . . . , g p = g (cid:48) with non-zero probability that satisfies F ( g t ) ≥ F ( g ) for all t ∈ { , . . . , p } . For alocally but not globally optimal solution g , we define its depth as the minimum d that satisfies thefollowing condition: there exists some g (cid:48) such that F ( g (cid:48) ) > F ( g ) and there exists some sequence oftransitions g = g, g , . . . , g p = g (cid:48) with non-zero probability that satisfies F ( g t ) ≥ F ( g ) − d for all t ∈{ , . . . , p } . By setting the sequence of temperatures at T t = [max g ∈G K F ( g ) − min g ∈G K F ( g )] / log( t + 2)for all t ≥ • T t ≥ T t +1 holds for all t ≥
0, and lim t →∞ T t = 0. • (cid:80) ∞ t =0 exp (cid:16) − d ∗ T t (cid:17) = + ∞ , where d ∗ is the maximum depth of all the locally but not globallyoptimal solutions.It has been proven that under the above conditions, the probability that an SA algorithm outputsthe global optimal solution converges to one in the limit of t → ∞ [22]. Although the naive SA algorithm in Section 3.2.1 is tractable compared to the exhaustive search,it still requires too much steps for the algorithm to converge. Therefore, in this subsection, we proposea further approximation of Algorithm 1. The main idea is that we first compress an observed datamatrix A by using row-column clustering, and then apply an SA algorithm on the compressed datamatrix. 10 lgorithm 1 Naive SA algorithm for finding the maximum profile likelihood solution ˆ g . Require:
A cooling schedule of temperature { T t } ∞ t =0 and a threshold (cid:15) SA . Ensure:
Approximated optimal bicluster assignment ˆ g . t ← Randomly generate initial bicluster assignment ˆ g , which is disjoint but not necessarily bi-disjoint(as shown in Figure 1 (b)). while T t ≥ (cid:15) SA do Set ˜ g ← ˆ g and randomly choose an index k from the uniform distribution on { , . . . , K } . if k ≤ K then Set bicluster index k ← k . Let I k and J k = { j , . . . , j | J k | } , respectively, be the sets of row and column indices in the k thbicluster. We define add and remove lists as follows. For i ∈ I k , let I rem ki be the set of entries in the i th row of the k th bicluster (i.e., { ( i, j ) , ( i, j ) , . . . , ( i, j | J k | ) } ). We define the remove list as I rem k = {I rem ki } i ∈ I k . Let ¯ I k be the set of row indices i that satisfies (cid:84) | J k | s =1 (cid:84) Kk (cid:48) =1 [( i, j s ) / ∈ I k (cid:48) ]. For i ∈ ¯ I k , let I add ki be the set of entries { ( i, j ) , ( i, j ) , . . . , ( i, j | J k | ) } . We define the add list as I add k = {I add ki } i ∈ ¯ I k . Let I be the set of background entries in ˜ g . Set y add ← ( | ¯ I k | ≥ ∪ [( | ¯ I k | = 1) ∩ ( I (cid:54) = I add k )],which is a flag of whether or not we can execute “add” operation. This guarantees that theset of background entries is not null. if | I k | ≥ y add = True then Randomly choose i from the uniform distribution on { , . . . , | I k | + | ¯ I k |} . If i ≤ | I k | , remove I rem ki from the k th bicluster and add it to the background in ˜ g . If i > | I k | , remove I add k ( i −| I k | ) from the background and add it to the k th bicluster in ˜ g . else if | I k | ≥ then Randomly choose i from the uniform distribution on { , . . . , | I k |} . Remove I rem ki from the k th bicluster and add it to the background in ˜ g . else if y add = True then Randomly choose i from the uniform distribution on { , . . . , | ¯ I k |} . Remove I add ki from thebackground and add it to the k th bicluster in ˜ g . end if else Set bicluster index k ← k − K . Execute lines 7 to 17, by swapping row and column in all the operations. end if if F (˜ g ) − F (ˆ g ) > then ˆ g ← ˜ g . else With probability exp (cid:16) F (˜ g ) − F (ˆ g ) T t (cid:17) , ˆ g ← ˜ g . end if t ← t + 1. end while lgorithm 2 Approximated SA algorithm for finding the maximum profile likelihood solution ˆ g . Require:
A set of row and column cluster numbers ( L , L ) that satisfies L ≥ K and L ≥ K , acooling schedule of temperature { T t } ∞ t =0 and a threshold (cid:15) SA . Ensure:
Approximated optimal bicluster assignment ˆ g . Apply a clustering algorithm to the rows of observed matrix A with the number of clusters L . Apply a clustering algorithm to the columns of observed matrix A with the number of clusters L . Let I comp hh (cid:48) be the set of entries of matrix A in the h th row cluster and the h (cid:48) th column cluster, andlet I comp = ( I comp hh (cid:48) ) ≤ h ≤ L , ≤ h (cid:48) ≤ L . Based on the clustering result I comp , define matrices A comp and M by (20). t ← Randomly generate initial (compressed) bicluster assignment ˆ g comp , which is disjoint but not nec-essarily bi-disjoint. Execute lines 3 to 28 in Algorithm 1 by replacing A and ˆ g with A comp and ˆ g comp , respectively. Asfor the objective function value, we can compute it by using (22). Convert the set of results I comp and ˆ g comp into the bicluster assignment ˆ g of the original observedmatrix A .Remark that the null group-wise mean matrix P with K biclusters has at most 2 K distinct rows,depending on whether or not it includes each k th bicluster ( k = 1 , . . . , K ). Based on this fact, we firstapply some clustering method (e.g., hierarchical clustering) to the rows of matrix A , by setting thenumber of clusters at L ∈ N that satisfies min { K , n } ≤ L ≤ n . From similar discussion, we alsoperform column clustering with number of clusters L that satisfies min { K , p } ≤ L ≤ p . Then, wedefine the compressed observed matrix A comp ∈ R L × L and matrix M ∈ N L × L as follows: A comp = ( A comp hh (cid:48) ) ≤ h ≤ L , ≤ h (cid:48) ≤ L , A comp hh (cid:48) = 1 |I comp hh (cid:48) | (cid:88) ( i,j ) ∈I comp hh (cid:48) A ij ,M = ( M hh (cid:48) ) ≤ h ≤ L , ≤ h (cid:48) ≤ L , M hh (cid:48) = |I comp hh (cid:48) | , (20)where I comp hh (cid:48) is the set of entries of matrix A in the h th row cluster and the h (cid:48) th column cluster.From now on, we apply an SA algorithm to the compressed observed matrix A comp . Let ˆ g comp hh (cid:48) ∈{ , , . . . , K } be the estimated group index of the ( h, h (cid:48) )th entry of matrix A comp , and let J comp k ⊆{ (1 , , . . . , ( L , L ) } be the set of entries in the k th estimated group ( k = 0 , , . . . , K ) of matrix A comp .Note that we have J comp k = { ( h, h (cid:48) ) : ˆ g comp hh (cid:48) = k } .The key insight is that we haveˆ p k = 1 np (cid:88) ( h,h (cid:48) ) ∈J comp k M hh (cid:48) , | ˆ I k | (cid:88) ( i,j ) ∈ ˆ I k A ij = 1 np ˆ p k (cid:88) ( h,h (cid:48) ) ∈J comp k M hh (cid:48) A comp hh (cid:48) . (21)Based on the above fact, we can compute the objective function value (i.e., profile likelihood) basedon the matrices A comp and M , and the bicluster assignment ˆ g comp = (ˆ g comp hh (cid:48) ) ≤ h ≤ L , ≤ h (cid:48) ≤ L by F (ˆ g comp ) ≡ K (cid:88) k =0 np (cid:88) ( h,h (cid:48) ) ∈J comp k M hh (cid:48) f np ˆ p k (cid:88) ( h,h (cid:48) ) ∈J comp k M hh (cid:48) A comp hh (cid:48) . (22)From these observations, Algorithm 2 provides an approximated solution of Algorithm 1.12 Test statistic for determining the number of biclusters
We develop the test statistic T of the proposed test based on the estimated version of the standard-ized noise matrix Z in (2), given a hypothetical number of biclusters K . We denote the estimated group index of the ( i, j )th entry of matrix A as ˆ g ij ∈ { , , . . . , K } , where ˆ g ij = k if the ( i, j )th entryis estimated to be a member of the k th bicluster for some k and ˆ g ij = 0 otherwise (i.e., the ( i, j )thentry is estimated to be a member of background). We define the set of estimated group indices ofall the entries as ˆ g ≡ ( g ij ) ≤ i ≤ n, ≤ j ≤ p . We also define that ˆ I k ≡ { ( i, j ) : ˆ g ij = k } , which representsthe estimated set of entries in the k th group.Based on the above notations, the estimated mean, standard deviation and noise matrices ˆ P , ˆ σ and ˆ Z are given by ˆ b = (ˆ b k ) ≤ k ≤ K , ˆ b k = 1 | ˆ I k | (cid:88) ( i,j ) ∈ ˆ I k A ij , ˆ P = ( ˆ P ij ) ≤ i ≤ n, ≤ j ≤ p , ˆ P ij = ˆ b ˆ g ij , ˆ s = (ˆ s k ) ≤ k ≤ K , ˆ s k = (cid:118)(cid:117)(cid:117)(cid:116) | ˆ I k | (cid:88) ( i,j ) ∈ ˆ I k (cid:16) A ij − ˆ P ij (cid:17) , ˆ σ = (ˆ σ ij ) ≤ i ≤ n, ≤ j ≤ p , ˆ σ ij = ˆ s ˆ g ij . (23)ˆ Z = ( ˆ Z ij ) ≤ i ≤ n, ≤ j ≤ p , ˆ Z ij = A ij − ˆ P ij ˆ σ ij . (24)To construct a statistical test on the number of biclusters K , we use the following result in [40],which shows that the scaled maximum eigenvalue T ∗ of sample covariance matrix Z (cid:62) Z converges inlaw to the Tracy-Widom distribution with index 1 ( T W ) in the limit of m → ∞ : T ∗ = λ − a TW b TW , T ∗ (cid:32) T W (Convergence in law) , (25)where λ is the maximum eigenvalue of matrix Z (cid:62) Z and a TW = ( √ n + √ p ) , b TW = ( √ n + √ p ) (cid:18) √ n + 1 √ p (cid:19) . (26)Based on the above fact, we define the test statistic T , which is an estimator of T ∗ in (25), fromthe maximum eigenvalue ˆ λ of matrix ˆ Z (cid:62) ˆ Z : T = ˆ λ − a TW b TW . (27)By using the test statistic T , we define the rule of the proposed test at the significance level of α asfollows: Reject null hypothesis ( K = K ) , if T ≥ t ( α ) , (28)where t ( α ) is the α upper quantile of the T W distribution. We given the theoretical guarantees forthe above test in both the null and alternative cases later in Theorems 4.1 and 4.3, respectively. Theorem 4.1 (Realizable case) . Under the assumptions in Section 2, if K = K , T (cid:32) T W (Convergence in law) , (29) in the limit of m → ∞ , where T is defined as in (27). roof. To apply the result in [40], we consider the difference between T ∗ and T . By definition in (25)and (27), we have | T − T ∗ | = | ˆ λ − λ | b TW . (30)From now on, we prove that the right side of (30) can be bounded by | λ − ˆ λ | b TW = O p ( m − (cid:15) ) for some (cid:15) >
0, which is given in Lemma 4.3 later. If this bound holds, from Slutsky’s theorem, (29) also holds.To show Lemma 4.3, we first prove the following Lemmas 4.1 and 4.2, which give the lower and upperbounds for the maximum eigenvalue ˜ λ of matrix ˜ Z (cid:62) ˜ Z . Lemma 4.1.
Under the assumptions in Section 2, if K = K , λ ≤ ˜ λ + O p (cid:16) m − (cid:15) (cid:17) , for some (cid:15) > . (31) Proof.
From assumption (iv), the background can be divided to H disjoint submatrices, whose mini-mum row and column sizes are n (cid:48) min and p (cid:48) min , respectively. Let X ( k ) ∈ R | I k |×| J k | be a submatrix of ma-trix X ∈ R n × p corresponding to the row and column indices ( I k , J k ) of the k th bicluster. To distinguishthe indices of the biclusters and those of the background submatrices, let X ( K + h ) ∈ R | I K + h |×| J K + h | be a submatrix of matrix X ∈ R n × p corresponding to the row and column indices ( I K + h , J K + h ) ofthe h th background submatrix. We define a bicluster-wise constant matrix Q ( k ) for each k th bicluster( k = 1 , . . . , K ), Q ( k ) ≡ Z ( k ) − ˜ s k s k ˜ Z ( k ) = 1 s k (cid:16) ˜ P ( k ) − P ( k ) (cid:17) = 1 |I k | (cid:88) ( i,j ) ∈I k A ij − P ij s k · · · · · · = |I k | (cid:88) ( i,j ) ∈I k Z ij · · · · · · ∈ R | I k |×| J k | , (32)and a submatrix-wise constant matrix Q ( K + h ) for each h th background submatrix ( h = 1 , . . . , H ), Q ( K + h ) ≡ Z ( K + h ) − ˜ s s ˜ Z ( K + h ) = 1 s (cid:16) ˜ P ( K + h ) − P ( K + h ) (cid:17) = 1 |I | (cid:88) ( i,j ) ∈I A ij − P ij s · · · · · · = |I | (cid:88) ( i,j ) ∈I Z ij · · · · · · ∈ R | I K + h |×| J K + h | . (33)Based on these matrices, let Z ( k ) , ˜ Z ( k ) , and Q ( k ) , respectively, be n × p matrices whose entries inthe k th bicluster (i.e., { ( i, j ) : i ∈ I k , j ∈ J k } ) are Z ( k ) , ˜ Z ( k ) and Q ( k ) and whose all the other entriesare zero. Similarly, let Z ( K + h ) , ˜ Z ( K + h ) , and Q ( K + h ) , respectively, be n × p matrices whose entries inthe h th background submatrix (i.e., { ( i, j ) : i ∈ I K + h , j ∈ J K + h } ) are Z ( K + h ) , ˜ Z ( K + h ) and Q ( K + h ) and whose all the other entries are zero. Figure 3 shows an example of { Z ( k ) } , where k = 1 , . . . , K + H .Finally, we define matrix Q by Q ≡ (cid:80) K + Hk =1 Q ( k ) .Let v be the normalized eigenvector of matrix Z (cid:62) Z corresponding to the maximum eigenvalue λ , as defined in (8). Since the largest singular value (cid:112) ˜ λ of matrix ˜ Z is equal to the operator norm14 icluster 1 = + + + + + + + + Bicluster 2 Bicluster 3
Background 1 Background 2 Background 3 Background 4 Background 5 Background 6
Figure 3: Decomposition of the noise matrix Z to the K biclusters { Z ( k ) } , k = 1 , . . . , K and the H background submatrices { Z ( K + h ) } , h = 1 , . . . , H . In the case of this figure, K = 3 and H = 6.of ˜ Z , we have˜ λ = (cid:32) sup u (cid:107) ˜ Z u (cid:107)(cid:107) u (cid:107) (cid:33) ≥ (cid:32) (cid:107) ˜ Z v (cid:107)(cid:107) v (cid:107) (cid:33) = (cid:107) ˜ Z v (cid:107) = (cid:107) K + H (cid:88) k =1 ˜ Z ( k ) v (cid:107) = (cid:107) (cid:34) K (cid:88) k =1 (cid:18) s k ˜ s k (cid:19) ( Z ( k ) − Q ( k ) ) v (cid:35) + (cid:34) H (cid:88) h =1 (cid:18) s ˜ s (cid:19) ( Z ( K + h ) − Q ( K + h ) ) v (cid:35) (cid:107) = (cid:107) (cid:34) K + H (cid:88) k =1 ( Z ( k ) − Q ( k ) ) v (cid:35) + (cid:34) K + H (cid:88) k =1 (cid:18) s k ˜ s k − (cid:19) ( Z ( k ) − Q ( k ) ) v (cid:35) (cid:107) ≥ (cid:34) (cid:107) K + H (cid:88) k =1 ( Z ( k ) − Q ( k ) ) v (cid:107) − (cid:107) K + H (cid:88) k =1 (cid:18) − s k ˜ s k (cid:19) ( Z ( k ) − Q ( k ) ) v (cid:107) (cid:35) = (cid:34) (cid:107) ( Z − Q ) v (cid:107) − (cid:107) K + H (cid:88) k =1 (cid:18) − s k ˜ s k (cid:19) ( Z ( k ) − Q ( k ) ) v (cid:107) (cid:35) ≥ (cid:107) ( Z − Q ) v (cid:107) − (cid:107) ( Z − Q ) v (cid:107)(cid:107) K + H (cid:88) k =1 (cid:18) − s k ˜ s k (cid:19) ( Z ( k ) − Q ( k ) ) v (cid:107)≥ (cid:107) ( Z − Q ) v (cid:107) − (cid:107) ( Z − Q ) v (cid:107) (cid:34) K + H (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) − s k ˜ s k (cid:12)(cid:12)(cid:12)(cid:12) (cid:107) ( Z ( k ) − Q ( k ) ) v (cid:107) (cid:35) ≥ (cid:107) ( Z − Q ) v (cid:107) − (cid:107) ( Z − Q ) v (cid:107) (cid:34) K + H (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) − s k ˜ s k (cid:12)(cid:12)(cid:12)(cid:12) (cid:16) (cid:107) Z ( k ) v (cid:107) + (cid:107) Q ( k ) v (cid:107) (cid:17)(cid:35) ≥ (cid:107) ( Z − Q ) v (cid:107) − (cid:107) ( Z − Q ) v (cid:107) (cid:34) K + H (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) − s k ˜ s k (cid:12)(cid:12)(cid:12)(cid:12) (cid:18)(cid:113) λ ( k )1 + (cid:107) Q ( k ) v (cid:107) (cid:19)(cid:35) ≥ λ − (cid:112) λ (cid:107) Q v (cid:107) − (cid:112) λ + (cid:107) Q v (cid:107) ) (cid:34) K + H (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) − s k ˜ s k (cid:12)(cid:12)(cid:12)(cid:12) (cid:18)(cid:113) λ ( k )1 + (cid:107) Q ( k ) v (cid:107) (cid:19)(cid:35) . (34)where λ ( k )1 is the maximum eigenvalue of matrix ( Z ( k ) ) (cid:62) Z ( k ) (which is equal to that of matrix( Z ( k ) ) (cid:62) Z ( k ) ). From the third line in (34), we used the notation that s K +1 = · · · = s K + H = s for simplicity.From now, we show the probabilistic orders of (cid:107) Q ( k ) v (cid:107) and (cid:107) Q v (cid:107) . The non-zero entries in matrix( Q ( k ) ) (cid:62) Q ( k ) is only located in a submatrix { ( i, j ) : i ∈ J k , j ∈ J k } , and all of their values are | I k | η k ,where η k ≡ |I k | (cid:88) ( i,j ) ∈I k Z ij = O p (cid:32) (cid:112) |I k | (cid:33) . (35)15herefore, we have ( Q ( k ) ) (cid:62) Q ( k ) v = | I k || J k | η k ( v (cid:62) u ( k ) ) u ( k ) , (36)where u ( k ) ∈ R p is a vector whose entries are defined by u ( k ) j = √ | J k | if j ∈ J k and u ( k ) j = 0 otherwise.Note that this vector satisfies (cid:107) u ( k ) (cid:107) = 1. From (36), we have (cid:107) Q ( k ) v (cid:107) = (cid:113) v (cid:62) ( Q ( k ) ) (cid:62) Q ( k ) v = (cid:113) | I k || J k | η k ( v (cid:62) u ( k ) ) . (37)To upper bound the right side of (37), we refer to the following important property of each j theigenvector v j of matrix Z (cid:62) Z , which has been proven in [4]. Theorem 4.2 (Delocalization property of an eigenvector of a sample covariance matrix [4]) . Underthe assumptions in Section 2, from Theorem 2.17 in [4], a normalized eigenvector v j of matrix Z (cid:62) Z (i.e., (cid:107) v j (cid:107) = 1 ) has a delocalization property, that is, for any constant vector w that satisfies (cid:107) w (cid:107) = 1 ,for all (cid:15) > , | v (cid:62) j w | = O p (cid:20) p (cid:15) (cid:18) n + 1 n (1 + ˜ m − max { j, ( ˜ m + 1 − j ) } ) (cid:19)(cid:21) , (38) where ˜ m ≡ max { n, p } . From (38), we have | v (cid:62) j w | = O p (cid:16) m − + (cid:15) (cid:17) for all (cid:15) > . (39)Note that (39) holds uniformly in j and w from Theorem 2.20 in [4].Based on the above delocalization property of vector v and (37), we have (cid:107) Q ( k ) v (cid:107) = (cid:115) |I k | O p (cid:18) |I k | (cid:19) O p ( m − (cid:15) ) = O p (cid:16) m − + (cid:15) (cid:17) for all (cid:15) > . (40)As for (cid:107) Q v (cid:107) , we can derive its upper bound by (cid:107) Q v (cid:107) = (cid:107) K + H (cid:88) k =1 Q ( k ) v (cid:107) ≤ K + H (cid:88) k =1 (cid:107) Q ( k ) v (cid:107) = K + H (cid:88) k =1 (cid:113) |I k | η k ( v (cid:62) u ( k ) ) = K + H (cid:88) k =1 (cid:115) |I k | O p (cid:18) |I k | (cid:19) | v (cid:62) u ( k ) | = K + H (cid:88) k =1 O p (1) O p (cid:16) m − + (cid:15) (cid:17) = ( K + H ) O p (cid:16) m − + (cid:15) (cid:17) ( ∵ (39) holds uniformly in w from [4]) . (41)From Lemma C1 in Appendix C, | ˜ s k − s k | = O p (cid:18) √ |I k | (cid:19) holds, which results in (cid:12)(cid:12)(cid:12)(cid:12) − s k ˜ s k (cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:32) (cid:112) |I k | (cid:33) . (42)16y substituting (40), (41), (42), and the fact that (cid:113) λ ( k )1 = O p (cid:16) |I k | (cid:17) from (25), into (34), weobtain˜ λ ≥ λ − K + H ) O p (cid:16) m (cid:17) O p (cid:16) m − + (cid:15) (cid:17) − (cid:104) O p (cid:16) m (cid:17) + ( K + H ) O p (cid:16) m − + (cid:15) (cid:17)(cid:105) (cid:40) K + H (cid:88) k =1 O p (cid:16) |I k | − (cid:17) (cid:104) O p (cid:16) |I k | (cid:17) + O p (cid:16) m − + (cid:15) (cid:17)(cid:105)(cid:41) = λ − K + H ) O p ( m (cid:15) ) − (cid:104) O p (cid:16) m (cid:17) + ( K + H ) O p (cid:16) m − + (cid:15) (cid:17)(cid:105) (cid:40) K + H (cid:88) k =1 (cid:104) O p (cid:16) |I k | − (cid:17) + O p (cid:16) |I k | − m − + (cid:15) (cid:17)(cid:105)(cid:41) . (43)By taking (cid:15) < , the lower bound in (43) can be simplified as follows:˜ λ ≥ λ − K + H ) O p ( m (cid:15) ) − (cid:104) O p (cid:16) m (cid:17) + ( K + H ) O p (cid:16) m − + (cid:15) (cid:17)(cid:105) (cid:40) K + H (cid:88) k =1 (cid:104) O p (cid:16) |I k | − (cid:17) + O p (cid:16) |I k | − |I k | − + (cid:15) (cid:17)(cid:105)(cid:41) = λ − K + H ) O p ( m (cid:15) ) − (cid:104) O p (cid:16) m (cid:17) + ( K + H ) O p (cid:16) m − + (cid:15) (cid:17)(cid:105) (cid:34) K + H (cid:88) k =1 O p (cid:16) |I k | − (cid:17)(cid:35) ≥ λ − K + H ) O p ( m (cid:15) ) − (cid:104) O p (cid:16) m (cid:17) + ( K + H ) O p (cid:16) m − + (cid:15) (cid:17)(cid:105) ( K + H ) O p (cid:34)(cid:18) min k =1 ,...,K + H |I k | (cid:19) − (cid:35) = λ − K + H ) O p ( m (cid:15) ) − K + H ) O p (cid:16) m + (cid:15) (cid:17) O p (cid:34)(cid:18) min k =1 ,...,K + H |I k | (cid:19) − (cid:35) = λ − K + H ) O p ( m (cid:15) ) (cid:40) O p (1) + O p (cid:34) m (cid:18) min k =1 ,...,K + H |I k | (cid:19) − (cid:35)(cid:41) = λ − K + H ) O p ( m (cid:15) ) O p (cid:34) m (cid:18) min k =1 ,...,K + H |I k | (cid:19) − (cid:35) (cid:18) ∵ min k =1 ,...,K + H |I k | ≤ np (cid:19) = λ − K + H ) O p (cid:34) m + (cid:15) (cid:18) min k =1 ,...,K + H |I k | (cid:19) − (cid:35) (44)From the assumption that ( K + H ) (min k =1 ,...,K + H |I k | ) − = O (cid:16) m − − (cid:15) (cid:17) for some (cid:15) > (cid:15) < (cid:15) , we have λ ≤ ˜ λ + O p (cid:16) m − ( (cid:15) − (cid:15) ) (cid:17) , (45)which concludes the proof. Lemma 4.2.
Under the assumptions in Section 2, if K = K , ˜ λ ≤ λ + O p (cid:16) m − (cid:15) (cid:17) , for some (cid:15) > . (46) Proof.
Let ˜ v ( k )1 ∈ R | J k | be a subvector of ˜ v corresponding to the columns of the k th submatrix inobserved matrix A , and let τ k ≡ s k ˜ s k . In (42), we have already shown that | − τ k | = O p (cid:18) √ |I k | (cid:19) . The17aximum eigenvalue ˜ λ of matrix ˜ Z (cid:62) ˜ Z can be upper bounded as follows:˜ λ = (cid:107) ˜ Z ˜ v (cid:107) = (cid:107) K + H (cid:88) k =1 τ k (cid:16) Z ( k ) − Q ( k ) (cid:17) ˜ v (cid:107) ( ∵ (32))= (cid:107) K + H (cid:88) k =1 (cid:104) Z ( k ) + ( τ k − Z ( k ) − τ k Q ( k ) (cid:105) ˜ v (cid:107) = (cid:107) (cid:40) Z + K + H (cid:88) k =1 (cid:104) ( τ k − Z ( k ) − τ k Q ( k ) (cid:105)(cid:41) ˜ v (cid:107) = (cid:107) Z ˜ v (cid:107) + 2˜ v (cid:62) Z (cid:62) K + H (cid:88) k =1 (cid:104) ( τ k − Z ( k ) − τ k Q ( k ) (cid:105) ˜ v + (cid:107) K + H (cid:88) k =1 (cid:104) ( τ k − Z ( k ) − τ k Q ( k ) (cid:105) ˜ v (cid:107) ≤ (cid:107) Z ˜ v (cid:107) + 2 (cid:112) λ K + H (cid:88) k =1 | τ k − |(cid:107) Z ( k ) ˜ v (cid:107) − v (cid:62) Z (cid:62) K + H (cid:88) k =1 τ k Q ( k ) ˜ v + (cid:107) K + H (cid:88) k =1 (cid:104) ( τ k − Z ( k ) − τ k Q ( k ) (cid:105) ˜ v (cid:107) = (cid:107) Z ˜ v (cid:107) + 2 (cid:112) λ K + H (cid:88) k =1 | τ k − |(cid:107) Z ( k ) ˜ v ( k )1 (cid:107) − v (cid:62) Z (cid:62) K + H (cid:88) k =1 τ k Q ( k ) ˜ v + (cid:107) K + H (cid:88) k =1 (cid:104) ( τ k − Z ( k ) − τ k Q ( k ) (cid:105) ˜ v (cid:107) ≤ (cid:107) Z ˜ v (cid:107) + 2 (cid:112) λ K + H (cid:88) k =1 | τ k − | (cid:113) λ ( k )1 − v (cid:62) Z (cid:62) K + H (cid:88) k =1 τ k Q ( k ) ˜ v + (cid:107) K + H (cid:88) k =1 (cid:104) ( τ k − Z ( k ) − τ k Q ( k ) (cid:105) ˜ v (cid:107) = (cid:107) Z ˜ v (cid:107) + 2 O p (cid:16) m (cid:17) K + H (cid:88) k =1 O p (cid:16) |I k | − (cid:17) − v (cid:62) Z (cid:62) K + H (cid:88) k =1 τ k Q ( k ) ˜ v + (cid:107) K + H (cid:88) k =1 (cid:104) ( τ k − Z ( k ) − τ k Q ( k ) (cid:105) ˜ v (cid:107) = (cid:107) Z ˜ v (cid:107) + 2( K + H ) O p (cid:34) m (cid:18) min k =1 ,...,K + H |I k | (cid:19) − (cid:35) − v (cid:62) Z (cid:62) K + H (cid:88) k =1 τ k Q ( k ) ˜ v + (cid:107) K + H (cid:88) k =1 (cid:104) ( τ k − Z ( k ) − τ k Q ( k ) (cid:105) ˜ v (cid:107) ≤ (cid:107) Z ˜ v (cid:107) + 2( K + H ) O p (cid:34) m (cid:18) min k =1 ,...,K + H |I k | (cid:19) − (cid:35) − v (cid:62) Z (cid:62) K + H (cid:88) k =1 τ k Q ( k ) ˜ v + (cid:34) (cid:107) K + H (cid:88) k =1 ( τ k − Z ( k ) ˜ v (cid:107) + (cid:107) K + H (cid:88) k =1 τ k Q ( k ) ˜ v (cid:107) (cid:35) ≤ (cid:107) Z ˜ v (cid:107) + 2( K + H ) O p (cid:34) m (cid:18) min k =1 ,...,K + H |I k | (cid:19) − (cid:35) − v (cid:62) Z (cid:62) K + H (cid:88) k =1 τ k Q ( k ) ˜ v + (cid:34) K + H (cid:88) k =1 | τ k − |(cid:107) Z ( k ) ˜ v (cid:107) + K + H (cid:88) k =1 τ k (cid:107) Q ( k ) ˜ v (cid:107) (cid:35) ≤ (cid:107) Z ˜ v (cid:107) + 2( K + H ) O p (cid:34) m (cid:18) min k =1 ,...,K + H |I k | (cid:19) − (cid:35) − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v + (cid:34) K + H (cid:88) k =1 O p (cid:16) |I k | − (cid:17) + K + H (cid:88) k =1 τ k (cid:107) Q ( k ) ˜ v (cid:107) (cid:35) (cid:18) ∵ (cid:107) Z ( k ) ˜ v (cid:107) ≤ (cid:113) λ ( k )1 = O p (cid:16) |I k | (cid:17)(cid:19) = (cid:107) Z ˜ v (cid:107) + O p (cid:104) m C ( K,H ) (cid:105) − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v + (cid:34) O p (cid:16) C ( K,H ) (cid:17) + K + H (cid:88) k =1 τ k (cid:107) Q ( k ) ˜ v (cid:107) (cid:35) , (47)where we denote C ( K,H ) ≡ ( K + H ) (min k =1 ,...,K + H |I k | ) − .Since Z (cid:62) Z is a symmetric matrix, its eigenvectors { v j } form an orthonormal system, and thus18here exists a unique set of coefficients { c j } such that˜ v = p (cid:88) j =1 c j v j = ˜ v L + ˜ v S , (48)where ˜ v L ≡ t (cid:88) j =1 c j v j , ˜ v S ≡ p (cid:88) j = t +1 c j v j ,λ t ≥ λ − n d , λ t +1 < λ − n d , d = 57 . (49)By substituting (48) into the last term in (47) and from the similar discussion as in (36), (cid:107) Q ( k ) ˜ v (cid:107) = ˜ v (cid:62) ( Q ( k ) ) (cid:62) Q ( k ) ˜ v = p (cid:88) j =1 p (cid:88) j (cid:48) =1 c j c j (cid:48) v (cid:62) j ( Q ( k ) ) (cid:62) Q ( k ) v j (cid:48) = p (cid:88) j =1 p (cid:88) j (cid:48) =1 c j c j (cid:48) | I k || J k | η k ( v (cid:62) j u ( k ) )( v (cid:62) j (cid:48) u ( k ) ) = | I k || J k | η k p (cid:88) j =1 c j ( v (cid:62) j u ( k ) ) ≤ | I k || J k | η k (cid:118)(cid:117)(cid:117)(cid:116) p (cid:88) j =1 c j (cid:118)(cid:117)(cid:117)(cid:116) p (cid:88) j =1 ( v (cid:62) j u ( k ) ) = | I k || J k | η k (cid:107) ˜ v (cid:107) p (cid:88) j =1 ( v (cid:62) j u ( k ) ) = | I k || J k | η k p (cid:88) j =1 ( v (cid:62) j u ( k ) ) ≤ | I k || J k | η k p max j =1 ,...,p ( v (cid:62) j u ( k ) ) = |I k | O p (cid:0) |I k | − (cid:1) p O p (cid:0) m − (cid:15) (cid:1) ( ∵ (39) holds uniformly in j and w from [4])= O p (cid:0) m (cid:15) (cid:1) . (50)By combining (47) and (50),˜ λ ≤ (cid:107) Z ˜ v (cid:107) − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v + O p (cid:104) m C ( K,H ) (cid:105) + (cid:34) O p (cid:16) C ( K,H ) (cid:17) + K + H (cid:88) k =1 τ k O p ( m (cid:15) ) (cid:35) = (cid:107) Z ˜ v (cid:107) − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v + O p (cid:104) m C ( K,H ) (cid:105) + (cid:34) O p (cid:16) C ( K,H ) (cid:17) + K + H (cid:88) k =1 (cid:16) O p (cid:16) |I k | − (cid:17)(cid:17) O p ( m (cid:15) ) (cid:35) = (cid:107) Z ˜ v (cid:107) − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v + O p (cid:104) m C ( K,H ) (cid:105) + (cid:104) O p (cid:16) C ( K,H ) (cid:17) + ( K + H ) O p ( m (cid:15) ) (cid:105) = (cid:107) Z ˜ v (cid:107) − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v + O p (cid:34) ( K + H ) m (cid:18) min k =1 ,...,K + H |I k | (cid:19) − (cid:35) + O p (cid:2) ( K + H ) m (cid:15) (cid:3) . (51)As for the third term in (51), from the assumption that ( K + H ) (min k =1 ,...,K + H |I k | ) − = O (cid:16) m − − (cid:15) (cid:17) for some (cid:15) > O p (cid:34) ( K + H ) m (cid:18) min k =1 ,...,K + H |I k | (cid:19) − (cid:35) = O p (cid:16) m − (cid:15) (cid:17) . (52)19ith regard to the fourth term in (51), from the assumption that K + H = O (cid:16) m − (cid:15) (cid:17) for some (cid:15) > (cid:15) < (cid:15) , O p (cid:2) ( K + H ) m (cid:15) (cid:3) = O p (cid:16) m − (cid:15) − (cid:15) ) (cid:17) . (53)An upper bound of the first and second terms in (51) is given by (cid:107) Z ˜ v (cid:107) − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v = (˜ v L + ˜ v S ) (cid:62) Z (cid:62) Z (˜ v L + ˜ v S ) − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v = (cid:0) ˜ v L (cid:1) (cid:62) Z (cid:62) Z ˜ v L + (cid:0) ˜ v S (cid:1) (cid:62) Z (cid:62) Z ˜ v S − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v = t (cid:88) j =1 c j v j (cid:62) t (cid:88) j =1 c j Z (cid:62) Z v j + p (cid:88) j = t +1 c j v j (cid:62) p (cid:88) j = t +1 c j Z (cid:62) Z v j − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v = t (cid:88) j =1 c j v j (cid:62) t (cid:88) j =1 c j λ j v j + p (cid:88) j = t +1 c j v j (cid:62) p (cid:88) j = t +1 c j λ j v j − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v = t (cid:88) j =1 c j λ j (cid:107) v j (cid:107) + p (cid:88) j = t +1 c j λ j (cid:107) v j (cid:107) − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v = t (cid:88) j =1 c j λ j + p (cid:88) j = t +1 c j λ j − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v ≤ λ t (cid:88) j =1 c j + λ t +1 p (cid:88) j = t +1 c j − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v ≤ λ t (cid:88) j =1 c j + ( λ − n d ) p (cid:88) j = t +1 c j − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v ( ∵ (49))= λ p (cid:88) j =1 c j − n d p (cid:88) j = t +1 c j − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v = λ (cid:107) ˜ v (cid:107) − n d (cid:107) ˜ v S (cid:107) − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v = λ − n d (cid:107) ˜ v S (cid:107) − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v L − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v S . (54)Let u ( k ) ∈ R p be a vector whose entries are defined by u ( k ) j = √ | J k | if j ∈ J k and u ( k ) j = 0otherwise. As for the third term in (54), using the fact that Q ( k ) v j = η k | J k | ( v (cid:62) j u ( k ) ) u ( k ) , for all (cid:15) > − ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v L ≤ | ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v L | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t (cid:88) j =1 c j ˜ v (cid:62) Z (cid:62) Q ( k ) v j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) η k | J k | ˜ v (cid:62) Z (cid:62) u ( k ) t (cid:88) j =1 c j ( v (cid:62) j u ( k ) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:16) |I k | − (cid:17) | J k | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t (cid:88) j =1 c j ( v (cid:62) j u ( k ) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) ˜ v (cid:62) Z (cid:62) u ( k ) (cid:12)(cid:12)(cid:12) = O p (1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t (cid:88) j =1 c j ( v (cid:62) j u ( k ) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) ˜ v (cid:62) Z (cid:62) u ( k ) (cid:12)(cid:12)(cid:12) ≤ O p (1) (cid:118)(cid:117)(cid:117)(cid:116) t (cid:88) j =1 c j (cid:118)(cid:117)(cid:117)(cid:116) t (cid:88) j =1 | v (cid:62) j u ( k ) | (cid:12)(cid:12)(cid:12) ˜ v (cid:62) Z (cid:62) u ( k ) (cid:12)(cid:12)(cid:12) ≤ O p (1) (cid:107) ˜ v (cid:107)√ t O p (cid:16) m − + (cid:15) (cid:17) (cid:12)(cid:12)(cid:12) ˜ v (cid:62) Z (cid:62) u ( k ) (cid:12)(cid:12)(cid:12) ( ∵ (39) holds uniformly in j and w from [4])= √ t O p (cid:16) m − + (cid:15) (cid:17) (cid:12)(cid:12)(cid:12) ˜ v (cid:62) Z (cid:62) u ( k ) (cid:12)(cid:12)(cid:12) ≤ √ t O p (cid:16) m − + (cid:15) (cid:17) (cid:107) ˜ v (cid:62) Z (cid:62) (cid:107)(cid:107) u ( k ) (cid:107) = √ t O p (cid:16) m − + (cid:15) (cid:17) (cid:107) ˜ v (cid:62) Z (cid:62) (cid:107) √ t O p (cid:16) m − + (cid:15) (cid:17) (cid:112) λ = √ t O p (cid:16) m − + (cid:15) (cid:17) O p (cid:16) m (cid:17) = √ t O p ( m (cid:15) ) . (55)With regard to the fourth term in (54), we have − ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v S ≤ | ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v S | ≤ (cid:107) ˜ v (cid:107)(cid:107) Z (cid:62) Q ( k ) ˜ v S (cid:107) = (cid:107) Z (cid:62) Q ( k ) ˜ v S (cid:107)≤ (cid:107) Z (cid:62) Q ( k ) (cid:107) op (cid:107) ˜ v S (cid:107) ≤ (cid:107) Z (cid:107) op (cid:107) Q ( k ) (cid:107) F (cid:107) ˜ v S (cid:107) = (cid:112) λ |I k | | η k | (cid:107) ˜ v S (cid:107) . (56)By substituting (55) and (56) into (54), (cid:107) Z ˜ v (cid:107) − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v ≤ λ − n d (cid:107) ˜ v S (cid:107) + 2 K + H (cid:88) k =1 τ k √ t O p ( m (cid:15) ) + 2 K + H (cid:88) k =1 τ k (cid:112) λ |I k | | η k | (cid:107) ˜ v S (cid:107) = λ − n d (cid:107) ˜ v S (cid:107) + 2 K + H (cid:88) k =1 (cid:104) O p (cid:16) |I k | − (cid:17)(cid:105) √ t O p ( m (cid:15) ) + 2 K + H (cid:88) k =1 τ k (cid:112) λ |I k | | η k | (cid:107) ˜ v S (cid:107) = λ − n d (cid:107) ˜ v S (cid:107) + 2 K + H (cid:88) k =1 √ t O p ( m (cid:15) ) + 2 K + H (cid:88) k =1 τ k (cid:112) λ |I k | | η k | (cid:107) ˜ v S (cid:107) = λ − n d (cid:107) ˜ v S (cid:107) + √ t O p [( K + H ) m (cid:15) ] + 2 K + H (cid:88) k =1 τ k (cid:112) λ |I k | | η k | (cid:107) ˜ v S (cid:107) . (57)We denote the j th normalized eigenvalue of matrix Z (cid:62) Z as ν j ≡ n λ j , and define the followingvariables: ν + ≡ (cid:18) (cid:114) pn (cid:19) , ν − ≡ (cid:18) − (cid:114) pn (cid:19) , (cid:15) ≡ ν + − ν . (58)Note that | (cid:15) | = O p (cid:16) φ C m − (cid:17) holds for some constant C > φ ≡ (log p ) log log p from (4.1) of [40].Since φ = o ( m (cid:15) ) holds for any (cid:15) >
0, by taking (cid:15) ≡ C(cid:15) , we have | (cid:15) | = O p (cid:16) m − + (cid:15) (cid:17) for any (cid:15) > . (59)Each j th normalized eigenvalue ν j follows the Marcenko–Pastur distribution, whose probabilitydensity function is given by q ( x ) = 12 π np (cid:112) max { ( ν + − x )( x − ν − ) , } x . (60)From (60), by taking (cid:15) < d − = , we have q ( ν − n d − ) = q ( ν + − n d − − (cid:15) )= √ ν + − ν − ν + (cid:104) n d − + O p (cid:16) m − + (cid:15) (cid:17)(cid:105) (cid:104) O (cid:16) m d − (cid:17) + O p (cid:16) m − + (cid:15) (cid:17)(cid:105) = √ ν + − ν − ν + n d − + O p (cid:16) m d − (cid:17) . (61)From (59) and (61), by setting (cid:15) < d − ,¯ n ≡ (cid:90) ∞ ν − n d − q ( x )d x ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) ν + ν − n d − q ( x )d x (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) ∞ ν + q ( x )d x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) ν + ν − n d − q ( x )d x (cid:12)(cid:12)(cid:12)(cid:12) ≤ | (cid:15) + n d − | q ( ν − n d − )= O p (cid:0) m d − (cid:1) O p (cid:16) m d − (cid:17) = O p (cid:16) m d − (cid:17) . (62)21rom (3.7) of [40], we have (cid:12)(cid:12)(cid:12)(cid:12) ¯ n − tp (cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:0) m − (cid:15) (cid:1) , for all (cid:15) > . (63)From (62) and (63), by setting (cid:15) < d − , t = O p (cid:16) m d − (cid:17) . (64)By substituting (64) into (57) and from the assumption in (49) that d = , for all (cid:15) > (cid:107) Z ˜ v (cid:107) − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v ≤ λ + O p (cid:104) ( K + H ) m + (cid:15) (cid:105) + (cid:107) ˜ v S (cid:107) (cid:32) (cid:112) λ K + H (cid:88) k =1 τ k (cid:112) |I k | | η k | − n d (cid:107) ˜ v S (cid:107) (cid:33) = λ + O p (cid:104) ( K + H ) m + (cid:15) (cid:105) + (cid:107) ˜ v S (cid:107) (cid:16) (cid:112) λ (cid:36) − n d (cid:107) ˜ v S (cid:107) (cid:17) , (65)where (cid:36) ≡ K + H (cid:88) k =1 τ k (cid:112) |I k | | η k | = K + H (cid:88) k =1 (cid:34) O p (cid:32) (cid:112) |I k | (cid:33)(cid:35) (cid:112) |I k | O p (cid:32) (cid:112) |I k | (cid:33) = O p ( K + H ) . (66)If 2 √ λ (cid:36) − n d (cid:107) ˜ v S (cid:107) ≤
0, from (65), (cid:107) Z ˜ v (cid:107) − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v ≤ λ + O p (cid:104) ( K + H ) m + (cid:15) (cid:105) . (67)If 2 √ λ (cid:36) − n d (cid:107) ˜ v S (cid:107) >
0, we have (cid:107) ˜ v S (cid:107) < n − d √ λ (cid:36) . Therefore, the following inequality holds: (cid:107) ˜ v S (cid:107) (cid:16) (cid:112) λ (cid:36) − n d (cid:107) ˜ v S (cid:107) (cid:17) ≤ n − d λ (cid:36) = O p (cid:104) ( K + H ) m (cid:105) (cid:18) ∵ d = 57 and (66) (cid:19) , (68)which results in that (cid:107) Z ˜ v (cid:107) − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v ≤ λ + O p (cid:104) ( K + H ) m + (cid:15) (cid:105) + O p (cid:104) ( K + H ) m (cid:105) ≤ λ + O p (cid:104) ( K + H ) m + (cid:15) (cid:105) , for all (cid:15) > . (69)Based on the above discussion, (69) always holds. Therefore, from (51), (52), and (53), for all (cid:15) > λ ≤ λ + O p (cid:104) ( K + H ) m + (cid:15) (cid:105) + O p (cid:16) m − (cid:15) (cid:17) + O p (cid:16) m − (cid:15) − (cid:15) ) (cid:17) . (70)From the assumption that K + H = O (cid:16) m − (cid:15) (cid:17) for some (cid:15) > (cid:15) < (cid:15) , we finallyobtain ˜ λ ≤ λ + O p (cid:16) m − ˜ (cid:15) (cid:17) , for some ˜ (cid:15) > , (71)which concludes the proof. 22 emma 4.3. Under the assumptions in Section 2, if K = K , | λ − ˆ λ | b TW = O p (cid:0) m − (cid:15) (cid:1) , for some (cid:15) > . (72) Proof.
By combining Lemmas 4.1, 4.2, and the definition of b in (26), we have | λ − ˜ λ | b TW = O p (cid:0) m − (cid:15) (cid:1) , for some (cid:15) > . (73)We consider the following two events: • E (1) m represents the event that ˜ Z = ˆ Z holds. • E (2) m represents the event that the solution given by the submatrix localization algorithm is correct(i.e., ˆ g = g ). • E (3) m,C represents the event that | λ − ˜ λ | b TW ≤ Cm − (cid:15) holds.The joint probability of events E (1) m and E (3) m,C can be lower bounded byPr (cid:16) E (1) m ∩ E (3) m,C (cid:17) ≥ Pr (cid:16) E (2) m ∩ E (3) m,C (cid:17) ≥ − Pr (cid:20)(cid:16) E (2) m (cid:17) C (cid:21) − Pr (cid:20)(cid:16) E (3) m,C (cid:17) C (cid:21) , (74)where E C is the complement of event E . From the consistency assumption (v) in Section 2, if K = K ,the second term in the right side of (74) satisfies that Pr (cid:20)(cid:16) E (2) m (cid:17) C (cid:21) → m → ∞ . Asfor the third term Pr (cid:20)(cid:16) E (3) m,C (cid:17) C (cid:21) , we already have (73). By combining these facts, we have ∀ (cid:15) > , ∃ C > , M > , ∀ m ≥ M, Pr (cid:16) E (1) m ∩ E (3) m,C (cid:17) ≥ − (cid:15), (75)which results in (72).From Lemma 4.3, we finally obtain | T − T ∗ | = | ˆ λ − λ | b TW = O p (cid:0) m − (cid:15) (cid:1) , for some (cid:15) > . (76)By combining this fact with Slutsky’s theorem, the convergence of the test statistic T in law to T W distribution in (29) holds. Theorem 4.3 (Unrealizable case) . Under the assumptions in Section 2, if
K > K , T = O p (cid:16) m (cid:17) , (77) and T = Ω p (cid:16) m (cid:17) , (78) where T is defined as in (27). ത𝑃 𝑃 ◆ Null bicluster structure ◆ Population mean entries ◆ Estimated bicluster structure ◆ Population mean entries ◆ Estimated bicluster structure ◆ Estimated mean entries
Estimated assignments of rows in the 𝑘 th null group Null group 𝑘 Estimated group ℎ 𝑘 ≥ 𝑛 min ≥ 𝑛 min /(𝐾 + 1) Estimated group with at least 𝑛 min /(𝐾 + 1) rows in each null group Null group
Estimated group Background
Bicluster
Figure 4: Definition of matrices P , ¯ P , and ˆ P in an unrealizable case. Proof.
We first prove the upper bound in (77). Let X E( k ) be an n × p matrix whose entries in the k th estimated bicluster (including background) are the same as matrix X and all the other entries arezero. Since the Frobenius norm upper bounds the operator norm, (cid:107) ˆ Z (cid:107) op ≤ (cid:107) ˆ Z (cid:107) F = (cid:118)(cid:117)(cid:117)(cid:116) K (cid:88) k =0 (cid:107) ˆ Z E( k ) (cid:107) = (cid:118)(cid:117)(cid:117)(cid:116) K (cid:88) k =0 s k (cid:107) A E( k ) − ˆ P E( k ) (cid:107) = (cid:118)(cid:117)(cid:117)(cid:116) K (cid:88) k =0 | ˆ I k |(cid:107) A E( k ) − ˆ P E( k ) (cid:107) (cid:107) A E( k ) − ˆ P E( k ) (cid:107) = (cid:118)(cid:117)(cid:117)(cid:116) K (cid:88) k =0 | ˆ I k | = √ np. (79)From assumption (iii), let n = C n m and p = C p m , where C n and C p are positive constants. Bydefinition in (26), we have a TW = ( (cid:112) C n + (cid:112) C p ) m, b TW = ( (cid:112) C n + (cid:112) C p ) (cid:32) √ C n + 1 (cid:112) C p (cid:33) m . (80)By substituting (79) and (80) into (27), we have T = (cid:107) ˆ Z (cid:107) − a TW b TW ≤ np − a TW b TW = C n C p m − ( √ C n + (cid:112) C p ) m ( √ C n + (cid:112) C p ) (cid:18) √ C n + √ C p (cid:19) = O p (cid:16) m (cid:17) , (81)which concludes the proof of (77).We next show the lower bound in (78). As shown in Figure 4, we define that ¯ P is a matrix thatconsists of the estimated bicluster structure and entries of the population means. For instance, ifthe ( i, j )th entry of observed matrix A belongs to an estimated bicluster that consists of n (1) entriesof the k th null bicluster and n (2) entries of the k th null bicluster, its value in matrix ¯ P is given by¯ P ij = (cid:0) n (1) b k + n (2) b k (cid:1) / (cid:0) n (1) + n (2) (cid:1) .From assumption (iv), for all k ∈ { , . . . , K + H } , the k th null submatrix (i.e., bicluster orbackground submatrix) has a size of at least n min × p min . Therefore, for all null group index k ∈{ , . . . , K + H } , there exists at least one estimated group h k ∈ { , , . . . , K } that contains a submatrixof the k th null submatrix with the size of n min K +1 × p min K +1 or more (Figure 4). Since K < K , thereexists at least one set of null group indices ( k , k ), k , k ∈ { , . . . , K + H } that satisfies h k = h k , k (cid:54) = k , and k ∈ { , . . . , K } (i.e., a pair of mutually different null groups that belongs to the samegroup in the estimated bicluster structure). In other words, there exists at least one estimated group ¯ k (= h k = h k ) that satisfies the following conditions:24 The ¯ k th estimated group contains two submatrices. We denote the sets of entries in these twosubmatrices as I (1) and I (2) . • The set of entries I (1) forms a submatrix of the k th null group ( k ∈ { , , . . . , K } ). The sizeof submatrix I (1) is at least n min K +1 × p min K +1 . • The set of entries I (2) forms a submatrix of the k th null group ( k ∈ { , , . . . , K } ). The sizeof submatrix I (2) is at least n min K +1 × p min K +1 . • The null groups of I (1) and I (2) are mutually different (i.e., k (cid:54) = k ).The population means of submatrices I (1) and I (2) , respectively, are b k and b k . We assume b k > b k without loss of generality. Let ¯ b be the constant value of the submatrices I (1) and I (2) inmatrix ¯ P . Here, we consider the following two patterns: • If ¯ b ≥ ( b k + b k ) /
2, we have | ¯ b − b k | ≥ ( b k − b k ) / • If ¯ b < ( b k + b k ) /
2, we have | ¯ b − b k | > ( b k − b k ) / I that satisfies the following two conditions (notethat I = I (2) in the former case and I = I (1) in the latter case): • The size of submatrix I is at least n min K +1 × p min K +1 . • Let b I and ¯ b I , respectively, be the constant values of submatrix I in matrices P and ¯ P . Notethat we have ¯ b I = ¯ b ¯ k . From assumption (ii), the following inequality holds: | ¯ b I − b I | ≥ min k (cid:54) = k (cid:48) | b k − b k (cid:48) | / ≥ C b . (82)We denote the k th null bicluster (including background submatrices) in matrix X as X N( k ) ( k =1 , . . . , K + H ). The difference between ˆ b ¯ k and ¯ b ¯ k is given by | ˆ b ¯ k − ¯ b ¯ k | = 1 | ˆ I ¯ k | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ( i,j ) ∈ ˆ I ¯ k (cid:16) ˆ P ij − ¯ P ij (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 1 | ˆ I ¯ k | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ( i,j ) ∈ ˆ I ¯ k ( A ij − P ij ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max k =0 , ,...,K s k | ˆ I ¯ k | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ( i,j ) ∈ ˆ I ¯ k Z ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (83)Here, Z ij independently follows a distribution with zero mean and unit variance. From the sub-exponential condition (ii), for any n M ∈ N , we have E [ Z n M ij ] < ∞ . Let I CLT be a subset of entries in a n × p matrix. By defining δ ≡ |I CLT |→∞ |I CLT | δ (cid:88) ( i,j ) ∈I CLT E (cid:2) | Z ij | δ (cid:3) ≤ lim |I CLT |→∞ C M(4) |I CLT ||I
CLT | = lim |I CLT |→∞ C M(4) |I CLT | = 0 . (84)Therefore, from the Lyapunov variant of the central limit theorem, √ |I CLT | (cid:80) ( i,j ) ∈I CLT Z ij (cid:32) N (0 , | ˆ I ¯ k | ≥ |I| ≥ n min p min / ( K + 1) (note that theright side monotonically increases with the matrix size m from assumption (iv)), we have | ˆ b ¯ k − ¯ b ¯ k | ≤ max k =0 , ,...,K s k (cid:113) | ˆ I ¯ k | O p (1) ≤ (cid:18) max k =0 , ,...,K s k (cid:19) K + 1 √ n min p min O p (1) ≤ O p (cid:18) K + H √ n min p min (cid:19) ( ∵ K + 1 < K + H and assumption (ii)) . (85)25rom (85), we have (cid:12)(cid:12)(cid:12) | b I − ¯ b I | − | b I − ˆ b ¯ k | (cid:12)(cid:12)(cid:12) ≤ | ˆ b ¯ k − ¯ b ¯ k | ≤ O p (cid:18) K + H √ n min p min (cid:19) . (86)By combining this result with (82), C b ≤ | b I − ˆ b ¯ k | + O p (cid:18) K + H √ n min p min (cid:19) . (87)Therefore, from (12) in assumption (iv), we have | b I − ˆ b ¯ k | = Ω p (1) . (88)Let X I be a submatrix of X with the set of entries I . Since the operator norm of a submatrix isnot larger than that of the original matrix, (cid:107) ˆ Z (cid:107) op ≥ (cid:107) ˆ Z I (cid:107) op = 1ˆ s ¯ k (cid:107) A I − ˆ P I (cid:107) op ≥ s ¯ k (cid:12)(cid:12)(cid:12) (cid:107) A I − P I (cid:107) op − (cid:107) P I − ˆ P I (cid:107) op (cid:12)(cid:12)(cid:12) . (89)Let ¯ k N be the null bicluster index (including background) of submatrix I . Note that we have b I = b ¯ k N . As for the first term in (89), from assumption (ii), we have (cid:107) A I − P I (cid:107) op = s ¯ k N (cid:107) Z I (cid:107) op ≤ s ¯ k N (cid:107) Z (cid:107) op ≤ (cid:18) max k =0 , ,...,K s k (cid:19) O p ( √ m ) = O p ( √ m ) . (90)In regard to the second term in (89), since all the entries in matrix ( P I − ˆ P I ) is ( b I − ˆ b ¯ k ) and thusits rank is one, we have (cid:107) P I − ˆ P I (cid:107) op = (cid:107) P I − ˆ P I (cid:107) F = (cid:113) |I| ( b I − ˆ b ¯ k ) ≥ √ n min p min K + 1 | b I − ˆ b ¯ k |≥ √ n min p min K + H | b I − ˆ b ¯ k | = Ω p (cid:18) √ n min p min K + H (cid:19) . (91)To derive the last equation, we used assumption (iv) and (88).Finally, we can derive an upper bound of ˆ s ¯ k byˆ s ¯ k = 1 (cid:113) | ˆ I ¯ k | (cid:107) A E(¯ k ) − ˆ P E(¯ k ) (cid:107) F ≤ (cid:113) | ˆ I ¯ k | (cid:18) (cid:107) A E(¯ k ) − P E(¯ k ) (cid:107) F + (cid:107) P E(¯ k ) − ˆ P E(¯ k ) (cid:107) F (cid:19) ≤ (cid:113) | ˆ I ¯ k | (cid:18) (cid:107) A − P (cid:107) F + (cid:107) P E(¯ k ) − ˆ P E(¯ k ) (cid:107) F (cid:19) = 1 (cid:113) | ˆ I ¯ k | (cid:118)(cid:117)(cid:117)(cid:116) K + H (cid:88) k =1 s k (cid:107) Z N( k ) (cid:107) + (cid:107) P E(¯ k ) − ˆ P E(¯ k ) (cid:107) F ≤ (cid:113) | ˆ I ¯ k | (cid:20)(cid:18) max k =0 , ,...,K s k (cid:19) (cid:107) Z (cid:107) F + (cid:107) P E(¯ k ) − ˆ P E(¯ k ) (cid:107) F (cid:21) = 1 (cid:113) | ˆ I ¯ k | (cid:18) max k =0 , ,...,K s k (cid:19) (cid:107) Z (cid:107) F + (cid:118)(cid:117)(cid:117)(cid:116) (cid:88) ( i,j ) ∈ ˆ I ¯ k (cid:16) P ij − ˆ b ¯ k (cid:17) ≤ (cid:113) | ˆ I ¯ k | (cid:18) max k =0 , ,...,K s k (cid:19) (cid:107) Z (cid:107) F + max k =0 , ,...,K (cid:12)(cid:12)(cid:12) b k − ˆ b ¯ k (cid:12)(cid:12)(cid:12) ≤ K + 1 √ n min p min (cid:18) max k =0 , ,...,K s k (cid:19) (cid:107) Z (cid:107) F + max k =0 , ,...,K (cid:12)(cid:12)(cid:12) b k − ˆ b ¯ k (cid:12)(cid:12)(cid:12) ≤ K + H √ n min p min (cid:18) max k =0 , ,...,K s k (cid:19) (cid:107) Z (cid:107) F + max k =0 , ,...,K (cid:12)(cid:12)(cid:12) b k − ˆ b ¯ k (cid:12)(cid:12)(cid:12) . (92)26he second term in (92) can be upper bounded as follows:max k =0 , ,...,K (cid:12)(cid:12)(cid:12) b k − ˆ b ¯ k (cid:12)(cid:12)(cid:12) ≤ | ˆ b ¯ k | + max k =0 , ,...,K | b k | = 1 | ˆ I ¯ k | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ( i,j ) ∈ ˆ I ¯ k ( σ ij Z ij + P ij ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + max k =0 , ,...,K | b k |≤ | ˆ I ¯ k | (cid:88) ( i,j ) ∈ ˆ I ¯ k ( σ ij | Z ij | + | P ij | ) + max k =0 , ,...,K | b k | ≤ | ˆ I ¯ k | (cid:88) ( i,j ) ∈ ˆ I ¯ k σ ij | Z ij | + 2 max k =0 , ,...,K | b k |≤ | ˆ I ¯ k | (cid:18) max k =0 , ,...,K s k (cid:19) (cid:88) ( i,j ) ∈ ˆ I ¯ k | Z ij | + 2 max k =0 , ,...,K | b k |≤ | ˆ I ¯ k | (cid:18) max k =0 , ,...,K s k (cid:19) (cid:115) (cid:88) ( i,j ) ∈ ˆ I ¯ k | Z ij | (cid:113) | ˆ I ¯ k | + 2 max k =0 , ,...,K | b k |≤ (cid:113) | ˆ I ¯ k | (cid:18) max k =0 , ,...,K s k (cid:19) (cid:107) Z (cid:107) F + 2 max k =0 , ,...,K | b k |≤ K + 1 √ n min p min (cid:18) max k =0 , ,...,K s k (cid:19) (cid:107) Z (cid:107) F + 2 max k =0 , ,...,K | b k |≤ K + H √ n min p min (cid:18) max k =0 , ,...,K s k (cid:19) (cid:107) Z (cid:107) F + 2 max k =0 , ,...,K | b k | . (93)Since E [ Z ij ] = V [ Z ij ]+ E [ Z ij ] = 1 and V [ Z ij ] = E [ Z ij ] − E [ Z ij ] = E [ Z ij ] − < ∞ from assumption(ii), from the central limit theorem and Prokhorov’s theorem [53], we have1 √ np n (cid:88) i =1 p (cid:88) j =1 ( Z ij −
1) = O p (1) ⇐⇒ n (cid:88) i =1 p (cid:88) j =1 Z ij = (cid:107) Z (cid:107) = np + O p ( m ) , (94)which results in that (cid:107) Z (cid:107) F = O p ( m ) . (95)By substituting (93) and (95) into (92), and using assumption (ii),ˆ s ¯ k ≤ K + H √ n min p min O p ( m ) + O ( K ) = O p (cid:20) ( K + H ) m √ n min p min (cid:21) . (96)By substituting (90), (91), and (96) into (89), and using (12) in assumption (iv), we finally have (cid:107) ˆ Z (cid:107) op ≥ Ω p (cid:20) √ n min p min ( K + H ) m (cid:21) (cid:12)(cid:12)(cid:12)(cid:12) Ω p (cid:18) √ n min p min K + H (cid:19) − O p ( √ m ) (cid:12)(cid:12)(cid:12)(cid:12) = Ω p (cid:20) n min p min ( K + H ) m (cid:21) (cid:12)(cid:12)(cid:12)(cid:12) Ω p (1) − O p (cid:20) ( K + H ) √ m √ n min p min (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) = Ω p (cid:20) n min p min ( K + H ) m (cid:21) = Ω p (cid:16) m +2 (cid:15) (cid:17) ⇐⇒ (cid:107) ˆ Z (cid:107) = Ω p (cid:0) m (cid:15) (cid:1) , for some (cid:15) > . (97)By substituting (97) and (80) into (27), we have T = (cid:107) ˆ Z (cid:107) − a TW b TW = Ω p (cid:16) m +4 (cid:15) (cid:17) ≥ Ω p (cid:16) m (cid:17) , (98)which concludes the proof. 27 Experiments T in law to T W distribution in realizablecase We first checked the asymptotic behavior of the proposed test statistic T in null case (i.e., K = K )by using synthetic data matrices, which were generated from the Gaussian, Bernoulli, and Poissondistributions. In this case, from Theorem 4.1, T converges in law to the T W distribution in the limitof m → ∞ .We set the null number of biclusters at K = 3 in all the distributional settings, and tried 10 sets ofmatrix sizes: ( n, p ) = (500 × i, × i ), i = 1 , . . . ,
10. For each distribution, we defined the null set ofparameters and the relative entropy function f of the generalized profile likelihood in (15) as follows. • Gaussian case : Each entry in the k th group ( k = 0 , , . . . , K ) of observed matrix A wasgenerated independently from the Gaussian distribution N ( b k , s k ), where b = (cid:0) . . . . (cid:1) (cid:62) , s = (cid:0) .
03 0 .
04 0 .
06 0 . (cid:1) (cid:62) . (99) f ( x ) ≡ x / . (100) • Bernoulli case : Each entry in the k th group of observed matrix A was generated independentlyfrom the Bernoulli distribution Bernoulli( b k ), where b = (cid:0) . . . . (cid:1) (cid:62) . (101) f ( x ) ≡ x log (cid:0) max { x, − } (cid:1) + (1 − x ) log (cid:0) max { − x, − } (cid:1) . (102) • Poisson case : Each entry in the k th group of observed matrix A was generated independentlyfrom the Poisson distribution Pois( b k ), where b = (cid:0) (cid:1) (cid:62) . (103) f ( x ) ≡ x log (cid:0) max { x, − } (cid:1) − x. (104)For each combination of the distributional and matrix size settings, we randomly generate 5 ,
000 datamatrices A based on the null (non-bi-disjoint) bicluster structure, which was defined as follows. Let K ≡ (3 K + 4 + K mod 2) / K ≡ (3 K + 4 − K mod 2) / n ≡ (cid:98) n/K (cid:99) , and p ≡ (cid:98) p/K (cid:99) . For each k th bicluster ( k = 1 , . . . , K ), we also define k ≡ (3 k − − k mod 2) / k ≡ (3 k − k mod 2) / A that belong to the k th bicluster isgiven by I k = { k n + 1 , . . . , ( k + 2) n } and J k = { k p + 1 , . . . , ( k + 2) p } , respectively. Figure 5shows the examples of Gaussian, Bernoulli, and Poisson data matrices.After generating the observed data matrices, we estimated their bicluster structures by the proposedsubmatrix localization algorithm (i.e., Algorithm 2 in Section 3.2.2). To compress the original datamatrix A , we applied the Ward’s hierarchical clustering method [54] to the rows and columns ofmatrix A with the number of clusters L = min { K , n } and L = min { K , p } , respectively. Initialbicluster structures in Algorithm 2 were given as follows: for each k th bicluster, it contains a (uniformlyrandomly chosen) single entry A i k j k in A , where ( i k , j k ) (cid:54) = ( i k (cid:48) , j k (cid:48) ) for k (cid:54) = k (cid:48) . In Section 3.2.1, wedescribed the sufficient condition about the cooling schedule { T t } for the SA algorithm to convergein probability to the global optimal solution. However, such a setting requires too much iterations toconverge. In the experiments, we used the following cooling schedule instead: T t = 0 . t for all t ≥ (cid:15) SA = 10 − . Since this setting does not guaranteethe convergence in probability to the global optimal solution any more, we applied Algorithm 2 toeach observed matrix for five times and adopted the best solution that achieved the maximum profile These experimental settings of the relative entropy function f follow those in [19]. aussian Bernoulli
Poisson
Figure 5: Examples of the observed data matrices. The left, center, and right figures show the Gaussian,Bernoulli, and Poisson cases, respectively.likelihood in the last step of the algorithm (this procedure was also performed in all the subsequentexperiments in Sections 5.2, 5.3, and 5.4). Based on the estimated bicluster structure, we finallyapplied the proposed statistical test by setting the hypothetical number of bicluster at K .Figures 6, 7, and 8, respectively, show the histograms of the proposed test statistic T with differentmatrix sizes under the Gaussian, Bernoulli, and Poisson distributional settings. Figure 9 shows theempirical tail probabilities of the proposed test statistic T (i.e., ratios of the trials where T ≥ t (0 . T ≥ t (0 . T ≥ t (0 . t ( α ) is the α upper quantile of the T W distribution) underthe three distributional settings. As in the previous study [55], we used the approximated values t (0 . ≈ . t (0 . ≈ . t (0 . ≈ . T to the T W distribution, we also applied the Kolmogorov-Smirnov (KS) test [14]to the test statistics T of the 5 ,
000 trials, and plotted the results in Figure 10. The test statistic ofthe KS test is D √ r , where D is the maximum absolute difference between the empirical distributionfunction of T and the cumulative distribution function of the T W distribution, and r is the numberof trials (i.e., 5 ,
000 in this case).From Figures 6, 7, 8, 9 and 10, we see that the proposed test statistic T converges in law to the T W distribution in each distributional setting. In the Bernoulli case, however, the convergence of T in lawto the T W distribution is slow, compared to the other two cases (i.e., Gaussian and Poisson). To checkthe cause of this, we also computed ˜ T with the null bicluster structure [i.e., ˜ T ≡ (˜ λ − a TW ) /b TW ]and plotted its empirical tail probabilities and the test statistic of the KS test in Figures 11 and 12,respectively. From these figures, we see that the convergence of ˜ T in law to the T W distribution is stillslow in the Bernoulli case. Therefore, the slow convergence of T would not be caused by low accuracyof the biclustering algorithm, but it would be a problem specific to a Bernoulli random matrix. T in unrealizable case Second, we consider unrealizable cases (i.e.,
K > K ). Specifically, under the assumptions thatthe total number of biclusters and background submatrices ( K + H ) is a fixed constant that doesnot depend on the matrix size and that the minimum row and column sizes of these submatrices( n min and p min , respectively) satisfy n min = Ω p ( m ) and p min = Ω p ( m ), from (97) and (77), we have T = Θ p (cid:16) m (cid:17) .Based on the same procedure as in Section 5.1, we generated Gaussian, Bernoulli and Poissonrandom matrices with three biclusters (i.e., K = 3), estimated their bicluster structures, and computedthe test statistics. We used the same settings as in Section 5.1 for (1) the null parameters of threedistributions (99), (101), and (103), (2) the procedure to generate the observed matrices, and (3) theSA-based submatrix localization algorithm. In this experiment, we tried the following 10 sets of matrixsizes: ( n, p ) = (200 × i, × i ), i = 1 , . . . ,
10. For each combination of the distributional and matrix29 n=500, p=375
Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure −4 −2 0 2 4T0.00.20.4 n=1000, p=750
Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure −4 −2 0 2 4T0.00.20.4 n=1500, p=1125
Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure −4 −2 0 2 4T0.00.20.4 n=2000, p=1500
Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure−4 −2 0 2 4T0.00.20.4 n=2500, p=1875
Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure −4 −2 0 2 4T0.00.20.4 n=3000, p=2250
Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure −4 −2 0 2 4T0.00.20.4 n=3500, p=2625
Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure −4 −2 0 2 4T0.00.20.4 n=4000, p=3000
Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure−4 −2 0 2 4T0.00.20.4 n=4500, p=3375
Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure −4 −2 0 2 4T0.00.20.4 n=5000, p=3750
Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure
Figure 6: Histogram of the proposed test statistic T , which was computed with estimated biclusterstructure ( Gaussian case ). The titles of the figures show the different matrix sizes. −4 −2 0 2 4T0.00.20.4 n=500, p=375
Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure −4 −2 0 2 4T0.00.20.4 n=1000, p=750
Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure −4 −2 0 2 4T0.00.20.4 n=1500, p=1125
Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure −4 −2 0 2 4T0.00.20.4 n=2000, p=1500
Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure−4 −2 0 2 4T0.00.20.4 n=2500, p=1875
Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure −4 −2 0 2 4T0.00.20.4 n=3000, p=2250
Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure −4 −2 0 2 4T0.00.20.4 n=3500, p=2625
Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure −4 −2 0 2 4T0.00.20.4 n=4000, p=3000
Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure−4 −2 0 2 4T0.00.20.4 n=4500, p=3375
Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure −4 −2 0 2 4T0.00.20.4 n=5000, p=3750
Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure
Figure 7: Histogram of the proposed test statistic T , which was computed with estimated biclusterstructure ( Bernoulli case ). The titles of the figures show the different matrix sizes. −4 −2 0 2 4T0.00.20.4 n=500, p=375
Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure −4 −2 0 2 4T0.00.20.4 n=1000, p=750
Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure −4 −2 0 2 4T0.00.20.4 n=1500, p=1125
Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure −4 −2 0 2 4T0.00.20.4 n=2000, p=1500
Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure−4 −2 0 2 4T0.00.20.4 n=2500, p=1875
Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure −4 −2 0 2 4T0.00.20.4 n=3000, p=2250
Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure −4 −2 0 2 4T0.00.20.4 n=3500, p=2625
Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure −4 −2 0 2 4T0.00.20.4 n=4000, p=3000
Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure−4 −2 0 2 4T0.00.20.4 n=4500, p=3375
Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure −4 −2 0 2 4T0.00.20.4 n=5000, p=3750
Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure
Figure 8: Histogram of the proposed test statistic T , which was computed with estimated biclusterstructure ( Poisson case ). The titles of the figures show the different matrix sizes.30
000 2000 3000 4000 5000n0.020.040.060.080.10
Ratio of the trials where T ≥ t(α) α=0.1α=0.05α=0.01 1000 2000 3000 4000 5000n0.020.040.060.080.10
Ratio of the trials where T ≥ t(α) α=0.1α=0.05α=0.01
Ratio of the trials where T ≥ t(α) α=0.1α=0.05α=0.01
Figure 9: Empirical tail probabilities of the proposed test statistic T under the three distributionalsettings, which was computed with estimated bicluster structure . The left, center, and right fig-ures, respectively, show the results where each entry of observed matrix A is generated from Gaussian,Bernoulli, and Poisson distributions. The horizontal line shows the row size n of matrix A , and thedashed lines show the three significance levels. D√r α=0.1α=0.05α=0.01 D√r α=0.1α=0.05α=0.01 D√r α=0.1α=0.05α=0.01
Figure 10: Test statistics D √ r of the KS test [14], which was computed with estimated biclusterstructure . The left, center, and right figures, respectively, show the results where each entry of ob-served matrix A is generated from Gaussian, Bernoulli, and Poisson distributions. Given a significancelevel α KS for the KS test, iff D √ r > α KS , then the null hypothesis is rejected that T follows the T W distribution. Ratio of the trials where T ≥ t(α) α=0.1α=0.05α=0.01 1000 2000 3000 4000 5000n0.020.040.060.080.10
Ratio of the trials where T ≥ t(α) α=0.1α=0.05α=0.01 1000 2000 3000 4000 5000n0.020.040.060.080.10
Ratio of the trials where T ≥ t(α) α=0.1α=0.05α=0.01
Figure 11: Empirical tail probabilities of the proposed test statistic T under the three distributionalsettings, which was computed with null bicluster structure . The left, center, and right figures,respectively, show the results where each entry of observed matrix A is generated from Gaussian,Bernoulli, and Poisson distributions. D√r α=0.1α=0.05α=0.01 D√r α=0.1α=0.05α=0.01 −1 D√r α=0.1α=0.05α=0.01
Figure 12: Test statistics D √ r of the KS test [14], which was computed with null bicluster structure .The left, center, and right figures, respectively, show the results where each entry of observed matrix A is generated from Gaussian, Bernoulli, and Poisson distributions.31
00 1000 1500 2000n0.120.130.140.150.160.17
Mean of T × n − K =0K =1K =2
500 1000 1500 2000n0.0100.0150.0200.025
Mean of T × n − K =0K =1K =2
500 1000 1500 2000n0.050.060.070.080.09
Mean of T × n − K =0K =1K =2 Figure 13: Mean of the proposed test statistics T divided by n in the unrealizable case for 100 trials.The null number of biclusters was set at K = 3. The left, center, and right figures, respectively, showthe results where each entry of observed matrix A is generated from Gaussian, Bernoulli, and Poissondistributions. The horizontal line shows the row size n of the observed matrix. Each plotted line showsa result for a given hypothetical number of biclusters K .size settings, we randomly generate 100 data matrices A , estimated their bicluster structures with K = 0 , , . . . , K −
1, and checked the average behavior of test statistic T .Figure 13 shows the asymptotic behavior of the mean of the proposed test statistic T divided by n under unrealizable settings. From this figure, we see that T increases in proportion to m in allthe distributional settings, as shown in the first paragraph of this section. K Third, we checked the accuracy of the proposed test in selecting the number of biclusters K , byusing the synthetic Gaussian, Bernoulli and Poisson data matrices that were generated by the sameprocedure as in Section 5.1. We set the null number of biclusters at K = 3. As for the null meanparameters of the three distributions, we tried the ten settings { b (1) , . . . , b (10) } , where b ( t ) ≡ (cid:18) − t (cid:19) (cid:16) b − . (cid:2) . . . (cid:3) (cid:62) (cid:17) + 0 . (cid:2) . . . (cid:3) (cid:62) (Gaussian and Bernoulli cases) , (105) b ( t ) ≡ (cid:18) − t (cid:19) (cid:16) b − (cid:2) . . . (cid:3) (cid:62) (cid:17) + 5 (cid:2) . . . (cid:3) (cid:62) (Poisson case) , (106)for all t = 1 , . . . ,
10. In the above settings, we set b at the same vector as given in (99), (101), and (103)for each distributional setting. In regard to the standard deviation parameter s , we used the samesetting as in (99). Aside from these model parameters, we used the same settings as in Section 5.1for (1) the procedure to generate the observed matrices, and (2) the SA-based submatrix localizationalgorithm. Figures 14, 15, and 16, respectively, show the examples of generated data matrices inGaussian, Bernoulli, and Poisson cases. In this experiment, we tried the following 10 sets of matrixsizes: ( n, p ) = (40 × i, × i ), i = 1 , . . . ,
10. For each combination of the distributional and matrixsize settings, we randomly generated 1 ,
000 data matrices A and applied the proposed sequential testwith significance level of α = 0 .
01 and the hypothetical number of biclusters K = 0 , , , . . . .Figure 17 shows the accuracy of the proposed test, that is, the ratio of trials where the selectednumber of biclusters ˆ K is equal to the null one K . From Figure 17, we see that the proposed testachieved higher accuracy with the larger matrix sizes and with the smaller differences between thegroup-wise means. This result is consistent with our intuition, since with the larger matrix sizes andwith the smaller differences between the elements in mean vector, it gets more difficult to correctlyestimate the underlying bicluster structure of matrix A , based on which we compute the test statistic T . 32 = 1 t = 2 t = 3 t = 4 t = 5 t = 6 t = 7 t = 8 t = 9 t = 10 Figure 14: Examples of the observed data matrices for t = 1 , . . . ,
10 (
Gaussian case ). The coloredboxes show the null bicluster structure. t = 1 t = 2 t = 3 t = 4 t = 5 t = 6 t = 7 t = 8 t = 9 t = 10
Figure 15: Examples of the observed data matrices for t = 1 , . . . ,
10 (
Bernoulli case ). t = 1 t = 2 t = 3 t = 4 t = 5 t = 6 t = 7 t = 8 t = 9 t = 10 Figure 16: Examples of the observed data matrices for t = 1 , . . . ,
10 (
Poisson case ).33
00 200 300 400n0.00.20.40.60.81.0
Accuracy of tests
Range of B: (0.20, 0.70)Range of B: (0.23, 0.68)Range of B: (0.26, 0.66)Range of B: (0.29, 0.64)Range of B: (0.32, 0.62)Range of B: (0.35, 0.60)Range of B: (0.38, 0.58)Range of B: (0.41, 0.56)Range of B: (0.44, 0.54)Range of B: (0.47, 0.52) 100 200 300 400n0.00.20.40.60.81.0
Accuracy of tests
Range of B: (0.20, 0.70)Range of B: (0.23, 0.68)Range of B: (0.26, 0.66)Range of B: (0.29, 0.64)Range of B: (0.32, 0.62)Range of B: (0.35, 0.60)Range of B: (0.38, 0.58)Range of B: (0.41, 0.56)Range of B: (0.44, 0.54)Range of B: (0.47, 0.52) 100 200 300 400n0.00.20.40.60.81.0
Accuracy of tests
Range of B: (2.00, 7.00)Range of B: (2.30, 6.80)Range of B: (2.60, 6.60)Range of B: (2.90, 6.40)Range of B: (3.20, 6.20)Range of B: (3.50, 6.00)Range of B: (3.80, 5.80)Range of B: (4.10, 5.60)Range of B: (4.40, 5.40)Range of B: (4.70, 5.20)
Figure 17: Accuracy of the proposed test in selecting the number of biclusters K , under 10 differentmean parameter settings { b (1) , . . . , b (10) } . The left, center, and right figures, respectively, show theresults where each entry of observed matrix A is generated from Gaussian, Bernoulli, and Poissondistributions. Finally, we applied the proposed test and the conventional LBM-based one [55] to Divorce Predictorsdata set [59] from UCI Machine Learning Repository [17], and compared the results. The rows andcolumns of the original observed matrix ˇ A ∈ R × represent the 170 participants and 54 attributes,respectively, and each ( i, j )th entry shows the Divorce Predictors Scale (DPS), which takes a value of0 , , . . . ,
4. According to [58], the original questionnaire was done based on the following five-factorscale: 0: “Never,” 1: “Rarely,” 2: “Occasionally,” 3: “Often,” and 4: “Always,” which is used as ascore for Attributes 31 to 54. As for Attributes 1 to 30, this scale is reversed (i.e., 0 means “Always”and 4 means “Never”) so that higher values indicate higher divorce risk in all the attributes. Basedon the original matrix ˇ A , we defined a binary data matrix A by setting A ij = 1 if ˇ A ij ≥ i th participant and the j th attribute, and A ij = 0 otherwise. The upper left of Figure 18 shows theobserved data matrix, where the meaning of each attribute index is shown in Table 1.As for the proposed test, we applied it sequentially as in Section 5.3 with a significance level of α = 0 .
01 until some hypothetical number of biclusters was accepted. In the SA algorithm, we usedthe relative entropy function f in (102) and the cooling schedule of T t = 0 . t for all t ≥
0. Foreach hypothetical number of biclusters K , we set the threshold at (cid:15) SA = 10 − K / . − . Based onthese settings, we applied Algorithm 2 for 30 times and adopted the best solution that achieved themaximum profile likelihood in the last step. Based on the estimated bicluster structure, we appliedthe proposed statistical test.In regard to the conventional LBM-based test, we used the same settings as in [55]. That is,for each hypothetical set of row and column cluster numbers ( K , H ), we estimated the regular-gridbicluster structure by applying the Ward’s hierarchical clustering method [54] to the rows and columnsof observed matrix. Based on the estimated row and column cluster assignments, we applied the testin [55] with a significance level of α = 0 .
01. We tried the multiple combinations of K and H in thefollowing order: ( K , H ) = (1 , , (1 , , (2 , , (1 , , (2 , , (3 , , . . . , (107)until the null hypothesis was accepted.Based on the above settings, the estimated number of biclusters by the proposed test was 30, whilethe estimated set of row and column cluster numbers by the conventional one [55] was (14 ,
46) (i.e., theestimated number of biclusters was 644, aside from the background). The upper right and bottom ofFigure 18, respectively, show the estimated bicluster structure when the null hypotheses were acceptedby the proposed and the previous LBM-based tests. From these results, we see that that the proposedtest could capture the bicluster structure more flexibly than the previous regular-grid based one, andthereby accepted the smaller hypothetical number of biclusters.More specifically, Figure 19 shows each estimated bicluster when the null hypothesis was acceptedby the proposed test. Most of the estimated biclusters contained constant values (i.e., 0 or 1), exceptfor Biclusters 0, 3, 18, 19, 23, and 28, and many biclusters consisted of small number (i.e., one or two)34f attributes. Aside from Biclusters 6 and 27, most of the rows (i.e., participants) in each biclusterbelonged to the same class (i.e., divorced or married). Each estimated bicluster was composed of somehomogeneous sets of rows and columns: for example, Bicluster 9 shows that there exists (mostly)married group of participants who gave small DPS (i.e., ˇ A ij ≤
1) to the attributes 24 and 27, bothof which are related to the knowledge about the stress of the spouse. From Bicluster 16, we also seethat there exists divorced group of participants who gave large DPS (i.e., ˇ A ij ≥
2) to many attributes,including the similarity with the spouse (e.g., attributes from 12 to 20) and the awareness about thespouse (e.g., attributes from 21 to 30).
We discuss both the theoretical and practical perspectives of the two main contributions of thispaper: the statistical test on the number of biclusters K and the consistent submatrix localizationalgorithm for estimating the bicluster structure of a given observed matrix.In regard to the proposed statistical test on K , we derived the asymptotic behavior of the proposedtest statistic T in both null and alternative cases, where the null number of biclusters might increasewith the matrix size (as the condition given in (iv)). As in the previous regular-grid based test [55], itis an important future work to reveal the non-asymptotic property of the test statistic T , that is, itsconvergence rate to the T W distribution. To solve this problem, we need to derive the behavior of T in case that the submatrix localization algorithm does not output the correct bicluster structure, whichrequires more careful analysis. From a practical point of view, some studies [2, 31] have shown theeffectiveness of analyzing a gene expression data matrix by assuming the existence of order-preservingbiclusters , in which a set of rows (i.e., genes) shows a similar linear ordering of columns (i.e., conditions).Such definition of homogeneousness is different from that of ours, where we assume that each entry ina bicluster is generated in the i.i.d. sense. In addition, some practical relational data matrices (e.g.,MovieLens [25] and Jester [21] data sets) contain missing entries. It is an important topic for futureresearch to construct a statistical test on K for such cases and derive its theoretical guarantee.As for the biclustering algorithm, the proposed test requires it to be consistent, that is, the prob-ability that it outputs the correct bicluster structure converges to one in the limit of matrix size m → ∞ . To guarantee such consistency, we developed a profile-likelihood-based submatrix localiza-tion algorithm, and its approximated version based on a compressed observed matrix. Although theapproximated version (i.e., Algorithm 2) needs fewer steps until convergence than the original Algo-rithm 1, there is some room for improvement in it. First, if the observed matrix compression hasnot been done successfully, the latter part of the SA algorithm cannot output the correct biclusterstructure anymore. When using a probabilistic clustering method (e.g., k-means clustering) for com-pression, this problem would be alleviated by trying clustering multiple times, run the SA algorithmwith each compressed observed matrix, and output the solution with the maximum profile likelihood.Second, it cannot compress the observed matrix size at all when K ≥ log(max { n, p } ) / log 2. In suchcases, we need another method (e.g., changing the definition of the neighbors N ( g ) of the current state g in the SA algorithm) for reducing the computation complexity. In this paper, we proposed a new statistical test on the number of biclusters in a given relational datamatrix, and showed the asymptotic behavior of the proposed test statistic T in both null and alternativecases. Unlike the previous study [55], we can apply the proposed method when the underlying biclusterstructure is not necessarily represented by a regular grid. This has been made possible by developinga consistent submatrix localization algorithm for such bicluster structure, that is, the probability thatit outputs the correct bicluster structure converges to one. Moreover, to alleviate the slow convergenceof such an algorithm, we proposed a heuristic approximation method by using a compressed observedmatrix. Based on a bicluster structure estimated by such a consistent algorithm, we can compute the35 Attributes P a r t i c i p a n t s Sorted data matrix
Attributes P a r t i c i p a n t s Estimated bicluster structure (proposed), α = 0.01
Attributes P a r t i c i p a n t s Estimated bicluster structure (LBM-based), α = 0.01 ∙ Divorced ∙ Married
Figure 18: The sorted observed data matrix of the Divorce Predictors data set [59] (upper left) andits estimated bicluster structures when the null hypotheses were accepted by the proposed test (upperright) and the previous one (bottom). In the upper left and bottom figures, the black and whiteelements show one and zero, respectively. In the upper right figure, the sorting orders of the rows andcolumns are the same as in the upper left figure, and the color of each element shows its group index(the white elements were estimated as background), regardless of its value. In the bottom figure, theblue lines show the regular grid bicluster structure (note that in this figure, the sorting orders of therows and columns are different from those of the upper left figure. The meaning of each attributeindex is shown in Table 1. 36
Attributes P a r t i c i p a n t s Bicluster 0 Attributes P a r t i c i p a n t s Bicluster 1
16 21
Attributes P a r t i c i p a n t s Bicluster 2
12 23 28
Attributes P a r t i c i p a n t s Bicluster 3 Attributes P a r t i c i p a n t s Bicluster 4 Attributes P a r t i c i p a n t s Bicluster 5 Attributes P a r t i c i p a n t s Bicluster 6 Attributes P a r t i c i p a n t s Bicluster 7 Attributes P a r t i c i p a n t s Bicluster 8
24 27
Attributes P a r t i c i p a n t s Bicluster 9 Attributes P a r t i c i p a n t s Bicluster 10 Attributes P a r t i c i p a n t s Bicluster 11 Attributes P a r t i c i p a n t s Bicluster 12
Attributes P a r t i c i p a n t s Bicluster 13 Attributes P a r t i c i p a n t s Bicluster 14 Attributes P a r t i c i p a n t s Bicluster 15
Attributes P a r t i c i p a n t s Bicluster 16 Attributes P a r t i c i p a n t s Bicluster 17
Attributes P a r t i c i p a n t s Bicluster 18
Attributes P a r t i c i p a n t s Bicluster 19 Attributes P a r t i c i p a n t s Bicluster 20 Attributes P a r t i c i p a n t s Bicluster 21 Attributes P a r t i c i p a n t s Bicluster 22
Attributes P a r t i c i p a n t s Bicluster 23 Attributes P a r t i c i p a n t s Bicluster 24 Attributes P a r t i c i p a n t s Bicluster 25 Attributes P a r t i c i p a n t s Bicluster 26
Attributes P a r t i c i p a n t s Bicluster 27
10 13 14 15 22 25
Attributes P a r t i c i p a n t s Bicluster 28 Attributes P a r t i c i p a n t s Bicluster 29
Figure 19: Estimated biclusters of the observed matrix when the proposed test accepted the nullhypothesis. The font color of the title corresponds to the bicluster color in Figure 18, and the left sidecolor for each row shows the class of each participant in the bicluster, as in the legend of Figure 18.37able 1: Attributes of the Divorce Predictors data set [59]. T and perform the test on a given number of biclusters K . By sequentiallytesting K in an ascending order, we can select an appropriate number of biclusters in a given observedmatrix. We experimentally showed the asymptotic behavior of the proposed test statistic T and itsaccuracy in selecting the correct number of biclusters with a synthetic data matrices, and we alsoanalyzed the test result with a practical data set. Acknowledgments
TS was partially supported by JSPS KAKENHI (18K19793, 18H03201, and 20H00576), JapanDigital Design, and JST CREST. 39 ppendix A Proof of Lemma 3.1
Let K (cid:15) be a family of all biclusters and backgrounds with the proportion of (cid:15) or more. Since thereare 2 n + p combinations of biclusters (without any size constraint), from assumption (iv), we have |K (cid:15) | ≤ n + p + 2 K ( n + p ) ≤ O (cid:104) exp (cid:16) m (cid:17)(cid:105) . (108)To prove Lemma 3.1, we first show the following Lemma A.1. Lemma A.1.
Under the assumptions in Sections 2 and 3, for any t > , ( K + 1)Pr sup ˆ I∈K (cid:15) | ˆ I| (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ( i,j ) ∈ ˆ I Z ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t → , (109) in the limit of m → ∞ .Proof. Let Y ij ≡ Z ij I (cid:18) | Z ij | ≤ (cid:15) (cid:113) | ˆ I| (cid:19) for a given (cid:15) >
0. By definition, we have | E [ Y ij ] | = (cid:12)(cid:12)(cid:12)(cid:12) E (cid:20) Z ij ; | Z ij | ≤ (cid:15) (cid:113) | ˆ I| (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) E [ Z ij ] − E (cid:20) Z ij ; | Z ij | > (cid:15) (cid:113) | ˆ I| (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) E (cid:20) Z ij ; | Z ij | > (cid:15) (cid:113) | ˆ I| (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) ( ∵ E [ Z ij ] = 0) ≤ E (cid:20) | Z ij | ; | Z ij | > (cid:15) (cid:113) | ˆ I| (cid:21) ≤ E | Z ij | (cid:15) (cid:113) | ˆ I| | Z ij | ; | Z ij | > (cid:15) (cid:113) | ˆ I| = 1 (cid:15) (cid:113) | ˆ I| E (cid:20) Z ij ; | Z ij | > (cid:15) (cid:113) | ˆ I| (cid:21) ≤ (cid:15) (cid:113) | ˆ I| E (cid:2) Z ij (cid:3) = 1 (cid:15) (cid:113) | ˆ I| , (110)which results in that for any (cid:15) > ˆ I∈K (cid:15) | ˆ I| (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ( i,j ) ∈ ˆ I E [ Y ij ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup ˆ I∈K (cid:15) | ˆ I| (cid:88) ( i,j ) ∈ ˆ I | E [ Y ij ] | ≤ sup ˆ I∈K (cid:15) (cid:15) | ˆ I| ≤ (cid:15) √ np(cid:15) → , (111)in the limit of m → ∞ (this holds from the assumption (b) that np(cid:15) = Ω (cid:16) m (cid:17) ). Therefore,for an arbitrary t >
0, there exists some (sufficiently large) m (1) t,(cid:15) ∈ N such that for n, p ≥ m (1) t,(cid:15) ,sup ˆ I∈K (cid:15) | ˆ I| (cid:12)(cid:12)(cid:12)(cid:80) ( i,j ) ∈ ˆ I E [ Y ij ] (cid:12)(cid:12)(cid:12) < t .By definition, we also have | Y ij | ≤ (cid:15) (cid:113) | ˆ I| and thus | Y ij − E [ Y ij ] | ≤ | Y ij | + E [ | Y ij | ] ≤ (cid:15) (cid:113) | ˆ I| holdsalmost surely. From Bernstein’s inequality, for n, p ≥ m (1) t,(cid:15) ,Pr (cid:88) ( i,j ) ∈ ˆ I Y ij > t | ˆ I| = Pr (cid:88) ( i,j ) ∈ ˆ I ( Y ij − E [ Y ij ]) > | ˆ I| t − | ˆ I| (cid:88) ( i,j ) ∈ ˆ I E [ Y ij ] ≤ Pr (cid:88) ( i,j ) ∈ ˆ I ( Y ij − E [ Y ij ]) > t | ˆ I| ≤ exp (cid:32) − t | ˆ I| (cid:80) ( i,j ) ∈ ˆ I E [( Y ij − E [ Y ij ]) ] + (cid:15) t | ˆ I| (cid:33) ≤ exp (cid:32) − t (cid:15) + (cid:15) t | ˆ I| − (cid:33) [ ∵ ( Y ij − E [ Y ij ]) ≤ (cid:15) | ˆ I| ] . (112)40nd Pr (cid:88) ( i,j ) ∈ ˆ I Y ij < − t | ˆ I| = Pr (cid:88) ( i,j ) ∈ ˆ I ( Y ij − E [ Y ij ]) < −| ˆ I| t + 1 | ˆ I| (cid:88) ( i,j ) ∈ ˆ I E [ Y ij ] ≤ Pr (cid:88) ( i,j ) ∈ ˆ I ( − Y ij − E [ − Y ij ]) > t | ˆ I| ≤ exp (cid:32) − t (cid:15) + (cid:15) t | ˆ I| − (cid:33) . (113)By combining the results of (112) and (113), we havePr | ˆ I| (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ( i,j ) ∈ ˆ I Y ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t ≤ (cid:32) − t (cid:15) + (cid:15) t | ˆ I| − (cid:33) , (114)which results inPr sup ˆ I∈K (cid:15) | ˆ I| (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ( i,j ) ∈ ˆ I Y ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t ≤ (cid:88) ˆ I∈K (cid:15) exp (cid:32) − t (cid:15) + (cid:15) t | ˆ I| − (cid:33) ≤ |K (cid:15) | exp (cid:32) − t (cid:15) + (cid:15) t ( np(cid:15) ) − (cid:33) ≤ O (cid:104) exp (cid:16) m (cid:17)(cid:105) exp (cid:32) − t (cid:15) + (cid:15) t ( np(cid:15) ) − (cid:33) ( ∵ (108)) . (115)Therefore, for n, p ≥ m (1) t,(cid:15) , we havePr sup ˆ I∈K (cid:15) | ˆ I| (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ( i,j ) ∈ ˆ I Z ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t ≤ Pr (cid:16) Y ij = Z ij for all ( i, j ) ∈ ˆ I (cid:17) Pr sup ˆ I∈K (cid:15) | ˆ I| (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ( i,j ) ∈ ˆ I Y ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t + Pr (cid:16) ∃ ( i, j ) ∈ ˆ I , s . t . Y ij (cid:54) = Z ij (cid:17) ≤ Pr sup ˆ I∈K (cid:15) | ˆ I| (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ( i,j ) ∈ ˆ I Y ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t + Pr (cid:16) ∃ ( i, j ) ∈ ˆ I , s . t . Y ij (cid:54) = Z ij (cid:17) ≤ O (cid:104) exp (cid:16) m (cid:17)(cid:105) exp (cid:32) − t (cid:15) + (cid:15) t ( np(cid:15) ) − (cid:33) + Pr (cid:16) ∃ ( i, j ) ∈ ˆ I , s . t . Y ij (cid:54) = Z ij (cid:17) ( ∵ (115)) . (116)From here, we will upper bound the second term in (116). From the assumption (b), for any (cid:15) >
0, there exists some (sufficiently large) m (2) (cid:15) ∈ N such that for n, p ≥ m (2) (cid:15) , (cid:15) √ np(cid:15) > (cid:15) (cid:113) | ˆ I| > n, p ≥ m (2) (cid:15) ,Pr (cid:16) ∃ ( i, j ) ∈ ˆ I , s . t . Y ij (cid:54) = Z ij (cid:17) ≤ (cid:88) ( i,j ) ∈ ˆ I Pr (cid:18) | Z ij | > (cid:15) (cid:113) | ˆ I| (cid:19) ≤ (cid:88) ( i,j ) ∈ ˆ I ϑ − g ij exp (cid:34) − (cid:18) (cid:15) (cid:113) | ˆ I| (cid:19) ϑ gij (cid:35) ≤ (cid:0) (cid:15) C ϑ (cid:1) − (cid:15) | ˆ I| exp (cid:34) − (cid:16) (cid:15) | ˆ I| (cid:17) Cϑ (cid:35) , (117)41here C ϑ is a fixed constant that does not depend on the matrix size m .Here, let us consider a function f ( x ) ≡ x exp( − x k ) for x > k >
0. Since the first derivativeof f is given by f (cid:48) ( x ) = (1 − kx k ) exp( − x k ), for (cid:0) k (cid:1) k ≤ x , we have f (cid:48) ( x ) ≤
0. Therefore, from theassumption (b), there exists some (sufficiently large) m (3) (cid:15) ,C ϑ ∈ N such that for n, p ≥ m (3) (cid:15) ,C ϑ , (cid:15) | ˆ I| exp (cid:34) − (cid:16) (cid:15) | ˆ I| (cid:17) Cϑ (cid:35) ≤ np(cid:15)(cid:15) exp (cid:20) − (cid:0) np(cid:15)(cid:15) (cid:1) Cϑ (cid:21) . (118)Therefore, by defining that m (4) (cid:15) ,C ϑ ≡ max (cid:110) m (2) (cid:15) , m (3) (cid:15) ,C ϑ (cid:111) , for n, p ≥ m (4) (cid:15) ,C ϑ , the following inequal-ity holds: Pr (cid:16) ∃ ( i, j ) ∈ ˆ I , s . t . Y ij (cid:54) = Z ij (cid:17) ≤ np(cid:15)C ϑ exp (cid:20) − (cid:0) np(cid:15)(cid:15) (cid:1) Cϑ (cid:21) . (119)Let m t,(cid:15) ,C ϑ ≡ max (cid:110) m (1) t,(cid:15) , m (4) (cid:15) ,C ϑ (cid:111) . By substituting (119) into (116), for n, p ≥ m t,(cid:15) ,C ϑ ,Pr sup ˆ I∈K (cid:15) | ˆ I| (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ( i,j ) ∈ ˆ I Z ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t ≤ O (cid:34) exp (cid:32) m − t (cid:15) + (cid:15) t ( np(cid:15) ) − (cid:33)(cid:35) + np(cid:15)C ϑ exp (cid:20) − (cid:0) np(cid:15)(cid:15) (cid:1) Cϑ (cid:21) ≤ O (cid:34) exp (cid:32) m − t (cid:15) + (cid:15) tm − (cid:33)(cid:35) + np(cid:15)C ϑ exp (cid:20) − (cid:0) np(cid:15)(cid:15) (cid:1) Cϑ (cid:21) (120)By taking (cid:15) = m − , for example, and using the assumption that np(cid:15) = O (cid:16) m (cid:17) and np(cid:15) =Ω (cid:16) m (cid:17) , we havePr sup ˆ I∈K (cid:15) | ˆ I| (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ( i,j ) ∈ ˆ I Z ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t ≤ O (cid:40) exp (cid:32) m − t m tm − (cid:33) + np(cid:15) exp (cid:20) − (cid:0) np(cid:15)(cid:15) (cid:1) Cϑ (cid:21)(cid:41) ≤ O (cid:40) exp (cid:32) m − t m tm − (cid:33) + m exp (cid:34) − (cid:16) m (cid:17) Cϑ (cid:35)(cid:41) . (121)Therefore, from assumption (iv), we have( K + 1)Pr sup ˆ I∈K (cid:15) | ˆ I| (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ( i,j ) ∈ ˆ I Z ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t ≤ O (cid:40) m exp (cid:32) m − t m tm − (cid:33) + m exp (cid:34) − (cid:16) m (cid:17) Cϑ (cid:35)(cid:41) → , (122)in the limit of m → ∞ , which concludes the proof.From now on, we will prove Lemma 3.1 based on the result of Lemma A.1. Lemma A.2.
Under the assumptions in Sections 2 and 3, sup ˆ g ∈J (cid:15) | F (ˆ g ) − ¯ F (ˆ g ) | converges in proba-bility to zero in the limit of m → ∞ . roof. Let r ≡ ˆ b − ¯ b , which represents the difference between sample and population means of each estimated group. Based on such definition, we havePr (cid:32) sup ˆ g ∈J (cid:15) (cid:107) r (cid:107) ∞ > t (cid:33) = Pr (cid:32) sup ˆ g ∈J (cid:15) max k =0 , ,...,K (cid:12)(cid:12)(cid:12) ˆ b k − ¯ b k (cid:12)(cid:12)(cid:12) > t (cid:33) ≤ Pr (cid:34)(cid:32) sup ˆ g ∈J (cid:15) (cid:12)(cid:12)(cid:12) ˆ b − ¯ b (cid:12)(cid:12)(cid:12) > t (cid:33) ∪ · · · ∪ (cid:32) sup ˆ g ∈J (cid:15) (cid:12)(cid:12)(cid:12) ˆ b K − ¯ b K (cid:12)(cid:12)(cid:12) > t (cid:33)(cid:35) ≤ K (cid:88) k =0 Pr (cid:32) sup ˆ g ∈J (cid:15) (cid:12)(cid:12)(cid:12) ˆ b k − ¯ b k (cid:12)(cid:12)(cid:12) > t (cid:33) ≤ K (cid:88) k =0 Pr sup ˆ I∈K (cid:15) | ˆ I| (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ( i,j ) ∈ ˆ I ( A ij − P ij ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t = ( K + 1)Pr sup ˆ I∈K (cid:15) | ˆ I| (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ( i,j ) ∈ ˆ I ( A ij − P ij ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t = ( K + 1)Pr sup ˆ I∈K (cid:15) | ˆ I| (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ( i,j ) ∈ ˆ I σ ij Z ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t ≤ ( K + 1)Pr sup ˆ I∈K (cid:15) | ˆ I| (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ( i,j ) ∈ ˆ I Z ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t max k =0 , ,...,K S k . (123)From Lemma A.1 and the assumption (ii), the right side of (123) converges to zero in the limit of m → ∞ .From the assumption that f is continuous on S , differentiable on S , and its first derivative isbounded on S , by mean value theorem, | F (ˆ g ) − ¯ F (ˆ g ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K (cid:88) k =0 ˆ p k (cid:104) f (ˆ b k ) − f (¯ b k ) (cid:105)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ K (cid:88) k =0 ˆ p k (cid:12)(cid:12)(cid:12) f (ˆ b k ) − f (¯ b k ) (cid:12)(cid:12)(cid:12) ≤ (cid:18) sup x ∈S | f (cid:48) ( x ) | (cid:19) K (cid:88) k =0 ˆ p k (cid:12)(cid:12)(cid:12) ˆ b k − ¯ b k (cid:12)(cid:12)(cid:12) ≤ (cid:18) sup x ∈S | f (cid:48) ( x ) | (cid:19) K (cid:88) k =0 ˆ p k (cid:18) max k =0 , ,...,K (cid:12)(cid:12)(cid:12) ˆ b k − ¯ b k (cid:12)(cid:12)(cid:12)(cid:19) = (cid:18) sup x ∈S | f (cid:48) ( x ) | (cid:19) (cid:107) r (cid:107) ∞ . (124)By combining (123) and (124), sup ˆ g ∈J (cid:15) | F (ˆ g ) − ¯ F (ˆ g ) | converges in probability to zero. Appendix B Proof of Lemma 3.2
Lemma B.1.
Under the assumptions in Sections 2 and 3, for all δ > , if C / ∈ P δ , ¯ F (ˆ g ) − ¯ F ( g ) ≤ − κ δ (cid:0) C b (cid:1) , (125) where κ ≡ inf x ∈T f (cid:48)(cid:48) ( x ) > and C b is a constant that satisfies the condition in assumption (ii).Proof. Let d ≡ ( d k ) ≤ k ≤ K with d k ≡ f ( b k ), and let w k ≡ C k ˜ k / (cid:0) C (cid:62) u (cid:1) ˜ k for each k ∈ { , , . . . , K } . Bydefinition, w k ≥ k and (cid:80) Kk =0 w k = 1. We also define that T ≡ (min { b , b , . . . , b K , ¯ b ˜ k } , max { b , b , . . . , b K , ¯ b ˜ k } ) = (cid:18) min k =0 , ,...,K b k , max k =0 , ,...,K b k (cid:19) . (126)From Taylor’s theorem, there exists some θ ∈ T k ≡ (min { b k , ¯ b ˜ k } , max { b k , ¯ b ˜ k } ) such that f ( b k ) = f (¯ b ˜ k ) + f (cid:48) (¯ b ˜ k )( b k − ¯ b ˜ k ) + f (cid:48)(cid:48) ( θ )2 ( b k − ¯ b ˜ k ) ≥ f (¯ b ˜ k ) + f (cid:48) (¯ b ˜ k )( b k − ¯ b ˜ k ) + inf x ∈T k f (cid:48)(cid:48) ( x )2 ( b k − ¯ b ˜ k ) . (127)43rom (127), we have (cid:0) C (cid:62) d (cid:1) ˜ k ( C (cid:62) u ) ˜ k − f (¯ b ˜ k ) = K (cid:88) k =0 w k f ( b k ) − f (¯ b ˜ k ) ≥ K (cid:88) k =0 w k (cid:104) f (cid:48) (¯ b ˜ k )( b k − ¯ b ˜ k ) + κ b k − ¯ b ˜ k ) (cid:105) , (128)where κ ≡ inf x ∈T f (cid:48)(cid:48) ( x ) >
0. For the first term in the right side of (128), we have K (cid:88) k =0 w k f (cid:48) (¯ b ˜ k )( b k − ¯ b ˜ k ) = f (cid:48) (¯ b ˜ k ) (cid:32) K (cid:88) k =0 w k b k − ¯ b ˜ k (cid:33) = 0 ( ∵ definition of ¯ b ˜ k in (13)) . (129)By substituting (129) into (128), we have (cid:0) C (cid:62) d (cid:1) ˜ k ( C (cid:62) u ) ˜ k − f (¯ b ˜ k ) ≥ κ K (cid:88) k =0 w k ( b k − ¯ b ˜ k ) . (130)By definition, if C / ∈ P δ , there exists a set ( k , k , ˜ k ) that satisfies k (cid:54) = k and C k ˜ k C k ˜ k ≥ δ .Therefore, if C / ∈ P δ , we can lower bound the right side of (130) by (cid:0) C (cid:62) d (cid:1) ˜ k ( C (cid:62) u ) ˜ k − f (¯ b ˜ k ) ≥ κ K (cid:88) k =0 w k ( b k − ¯ b ˜ k ) ≥ κ (cid:20) w k ( b k − ¯ b ˜ k ) + 12 w k ( b k − ¯ b ˜ k ) (cid:21) = κ ( C (cid:62) u ) ˜ k (cid:20) C k ˜ k ( b k − ¯ b ˜ k ) + 12 C k ˜ k ( b k − ¯ b ˜ k ) (cid:21) ≥ κ C k ˜ k C k ˜ k ( C (cid:62) u ) ˜ k (cid:20)
12 ( b k − ¯ b ˜ k ) + 12 ( b k − ¯ b ˜ k ) (cid:21) ( ∵ < C k ˜ k , C k ˜ k ≤ ≥ κ C k ˜ k C k ˜ k ( C (cid:62) u ) ˜ k (cid:20)
12 ( b k − ¯ b ˜ k ) + 12 (¯ b ˜ k − b k ) (cid:21) = κ C k ˜ k C k ˜ k C (cid:62) u ) ˜ k ( b k − b k ) ≥ κ δ C (cid:62) u ) ˜ k ( b k − b k ) ( ∵ C k ˜ k C k ˜ k ≥ δ ) ⇐⇒ (cid:0) C (cid:62) u (cid:1) ˜ k f (¯ b ˜ k ) − (cid:0) C (cid:62) d (cid:1) ˜ k ≤ − κ δ b k − b k ) (cid:2) ∵ (cid:0) C (cid:62) u (cid:1) ˜ k ≥ (cid:3) ⇐⇒ ˆ p ˜ k f (¯ b ˜ k ) − K (cid:88) k =0 C k ˜ k f ( b k ) ≤ − κ δ b k − b k ) = ⇒ ˆ p ˜ k f (¯ b ˜ k ) − K (cid:88) k =0 C k ˜ k f ( b k ) ≤ − κ δ (cid:0) C b (cid:1) ( ∵ assumption (ii)) . (131)For k (cid:48) (cid:54) = ˜ k , by defining w k ≡ C kk (cid:48) / (cid:0) C (cid:62) u (cid:1) k (cid:48) , the following inequality holds: (cid:0) C (cid:62) d (cid:1) k (cid:48) ( C (cid:62) u ) k (cid:48) − f (¯ b k (cid:48) ) ≥ κ K (cid:88) k =0 w k ( b k − ¯ b k (cid:48) ) ≥ ⇐⇒ ˆ p k (cid:48) f (¯ b k (cid:48) ) − K (cid:88) k =0 C kk (cid:48) f ( b k ) ≤ . (132)By combining (131) and (132), we have K (cid:88) k (cid:48) =0 ˆ p k (cid:48) f (¯ b k (cid:48) ) − K (cid:88) k (cid:48) =0 K (cid:88) k =0 C kk (cid:48) f ( b k ) ≤ − κ δ (cid:0) C b (cid:1) , (133)44hich is equivalent to the statement of Lemma 3.2. Appendix C Proof of | ˜ s k − s k | = O p (cid:18) √ |I k | (cid:19) . Let A ( k ) , P ( k ) , and ˜ P ( k ) , respectively, be the k th null bicluster ( k = 1 , . . . , K ) or background( k = 0) of matrices A , P , and ˜ P . Lemma C1.
Under the assumption that the fourth moment of the noise Z ij is bounded ( E [ Z ij ] < ∞ ), | ˜ s k − s k | = O p (cid:32) (cid:112) |I k | (cid:33) , (134) where ˜ s k is defined as in (6) and I k ≡ { ( i, j ) : g ij = k } (i.e., the set of entries in the k th group).Proof. By definition, we have˜ s k ≡ |I k | (cid:88) ( i,j ) ∈I k (cid:16) A ( k ) ij − ˜ b k (cid:17) = 1 |I k | (cid:88) ( i,j ) ∈I k (cid:20)(cid:16) A ( k ) ij (cid:17) − ˜ b k (cid:21) = 1 |I k | (cid:88) ( i,j ) ∈I k (cid:20)(cid:16) A ( k ) ij (cid:17) − ˜ b k (cid:21) − |I k | b k (cid:88) ( i,j ) ∈I k (cid:16) A ( k ) ij − ˜ b k (cid:17) = 1 |I k | (cid:88) ( i,j ) ∈I k (cid:16) A ( k ) ij − b k (cid:17) − (cid:16) b k − ˜ b k (cid:17) , (135)where ˜ b k ≡ |I k | (cid:80) ( i,j ) ∈I k A ( k ) ij .From (135), we have˜ s k − s k = 1 |I k | (cid:88) ( i,j ) ∈I k (cid:16) A ( k ) ij − b k (cid:17) − s k − (cid:16) b k − ˜ b k (cid:17) = 1 |I k | (cid:88) ( i,j ) ∈I k Y ( k ) ij − (cid:16) b k − ˜ b k (cid:17) , (136)where used the notation that Y ( k ) ij ≡ (cid:16) A ( k ) ij − b k (cid:17) − s k . From the assumption that the entries (cid:16) A ( k ) ij (cid:17) ( i,j ) ∈I k are generated in the i.i.d. sense in each k th group, the random variables (cid:16) Y ( k ) ij (cid:17) ( i,j ) ∈I k are also independent, and their expectations and variances satisfy E (cid:104) Y ( k ) ij (cid:105) = E (cid:20)(cid:16) A ( k ) ij − b k (cid:17) (cid:21) − s k = 0 , V (cid:104) Y ( k ) ij (cid:105) = E (cid:20)(cid:16) Y ( k ) ij (cid:17) (cid:21) = E (cid:34)(cid:26)(cid:16) A ( k ) ij − b k (cid:17) − s k (cid:27) (cid:35) = s k (cid:18) E (cid:20)(cid:16) Z ( k ) ij (cid:17) (cid:21) − (cid:19) , (137)which results in E |I k | (cid:88) ( i,j ) ∈I k Y ( k ) ij = 0 , V |I k | (cid:88) ( i,j ) ∈I k Y ( k ) ij = 1 |I k | s k (cid:18) E (cid:20)(cid:16) Z ( k ) ij (cid:17) (cid:21) − (cid:19) . (138)45rom (138) and Chebyshev’s inequality, for all t >
0, we havePr (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) |I k | (cid:88) ( i,j ) ∈I k Y ( k ) ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ t (cid:115) |I k | s k (cid:18) E (cid:20)(cid:16) Z ( k ) ij (cid:17) (cid:21) − (cid:19) ≤ t , (139)which results in (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) |I k | (cid:88) ( i,j ) ∈I k Y ( k ) ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:32) (cid:112) |I k | (cid:33) . (140)from the assumption of E (cid:20)(cid:16) Z ( k ) ij (cid:17) (cid:21) < ∞ .As for the second term in (136), we have (cid:16) b k − ˜ b k (cid:17) = |I k | (cid:88) ( i,j ) ∈I k (cid:16) P ( k ) ij − A ( k ) ij (cid:17) = s k |I k | (cid:88) ( i,j ) ∈I k Z ( k ) ij . (141)From (141), we have E (cid:20)(cid:16) b k − ˜ b k (cid:17) (cid:21) = s k |I k | E (cid:88) ( i,j ) ∈I k Z ( k ) ij = s k |I k | V (cid:88) ( i,j ) ∈I k Z ( k ) ij = s k |I k | , (142)since Z ( k ) ij has a unit variance.From (142) and Markov’s inequality, we have ∀ t > , Pr (cid:20)(cid:16) b k − ˜ b k (cid:17) ≥ t (cid:21) ≤ s k |I k | t ⇐⇒ ∀ t (cid:48) > , Pr (cid:20)(cid:16) b k − ˜ b k (cid:17) ≥ s k |I k | t (cid:48) (cid:21) ≤ t (cid:48) , (143)which results in (cid:16) b k − ˜ b k (cid:17) = O p (cid:18) |I k | (cid:19) . (144)Using (140), (144), and (136), we finally obtain | ˜ s k − s k | ≤ | |I k | (cid:88) ( i,j ) ∈I k Y ( k ) ij | + | (cid:16) b k − ˜ b k (cid:17) | = O p (cid:32) (cid:112) |I k | (cid:33) , (145)which results in | ˜ s k − s k | = | ˜ s k − s k || ˜ s k + s k | . (146)From (145), we see that ˜ s k converges in probability to s k , and thus | ˜ s k + s k | converges in probabilityto s k >
0. Therefore, we have | ˜ s k − s k | = O p (cid:32) (cid:112) |I k | (cid:33) , (147)which concludes the proof. 46 eferences [1] S. Balakrishnan, M. Kolar, A. Rinaldo, A. Singh, and L. Wasserman. Statistical and computationaltradeoffs in biclustering. In NIPS 2011 workshop on computational trade-offs in statistical learning ,2011.[2] A. Ben-Dor, B. Chor, R. Karp, and Z. Yakhini. Discovering local structure in gene expressiondata: the order-preserving submatrix problem. In
Proceedings of the Sixth Annual InternationalConference on Computational Biology , pages 49–57, 2002.[3] P. J. Bickel and P. Sarkar. Hypothesis testing for automated community detection in networks.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 78(1):253–273, 2016.[4] A. Bloemendal, A. Knowles, H.-T. Yau, and J. Yin. On the principal components of samplecovariance matrices.
Probability Theory and Related Fields , 164:459–552, 2016.[5] M. Brennan, G. Bresler, and W. Huleihel. Reducibility and computational lower bounds forproblems with planted sparse structure. In
Proceedings of the 31st Conference On LearningTheory , volume 75 of
Proceedings of Machine Learning Research , pages 48–166, 2018.[6] M. Brennan, G. Bresler, and W. Huleihel. Universality of computational lower bounds for sub-matrix detection. In
Proceedings of the 32nd Conference On Learning Theory , volume 99 of
Proceedings of Machine Learning Research , pages 417–468, 2019.[7] C. Butucea and Y. I. Ingster. Detection of a sparse submatrix of a high-dimensional noisy matrix.
Bernoulli , 19(5B):2652–2688, 2013.[8] C. Butucea, Y. I. Ingster, and I. A. Suslina. Sharp variable selection of a sparse submatrix in ahigh-dimensional noisy matrix.
ESAIM: Probability and Statistics , 19:115–134, 2015.[9] T. T. Cai, T. Liang, and A. Rakhlin. Computational and statistical boundaries for submatrixlocalization in a large noisy matrix.
Annals of Statistics , 45(4):1403–1430, 2017.[10] T. T. Cai and Y. Wu. Statistical and computational limits for sparse matrix detection.
Annalsof Statistics , 48(3):1593–1614, 2020.[11] T. Chekouo and A. Murua. The penalized biclustering model and related algorithms.
Journal ofApplied Statistics , 42(6):1255–1277, 2015.[12] T. Chekouo, A. Murua, and W. Raffelsberger. The Gibbs-plaid biclustering model.
Annals ofApplied Statistics , 9(3):1643–1670, 2015.[13] Y. Chen and J. Xu. Statistical-computational tradeoffs in planted problems and submatrix local-ization with a growing number of clusters and submatrices.
Journal of Machine Learning Research ,17(27):1–57, 2016.[14] W. J. Conover.
Practical Nonparametric Statistics . John Wiley & Sons, New York, 1999.[15] M. Corneli, P. Latouche, and F. Rossi. Exact ICL maximization in a non-stationary time extensionof the latent block model for dynamic networks. In
Proceedings of the 23-th European Symposiumon Artificial Neural Networks, Computational Intelligence and Machine Learning , pages 225–230,2015.[16] I. S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning.In
Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery andData Mining , pages 269–274, 2001. 4717] D. Dua and C. Graff. UCI machine learning repository. http://archive.ics.uci.edu/ml , 2017.University of California, Irvine, School of Information and Computer Sciences.[18] D. E. Duffy and adolfo J Quiroz. A permutation-based algorithm for block clustering.
Journal ofClassification , 8:65–91, 1991.[19] C. J. Flynn and P. O. Perry. Profile likelihood biclustering.
Electronic Journal of Statistics ,14(1):731–768, 2020.[20] F. O. D. Fran¸ca. Scalable overlapping co-clustering of word-document data. In , pages 464–467, 2012.[21] K. Goldberg, T. Roeder, D. Gupta, and C. Perkins. Eigentaste: A constant time collaborativefiltering algorithm.
Information Retrieval , 4(2):133–151, 2001.[22] B. Hajek. Cooling schedules for optimal annealing.
Mathematics of Operations Research ,13(2):311–329, 1988.[23] B. Hajek, Y. Wu, and J. Xu. Information limits for recovering a hidden community.
IEEETransactions on Information Theory , 63(8):4729–4745, 2017.[24] B. Hajek, Y. Wu, and J. Xu. Submatrix localization via message passing.
Journal of MachineLearning Research , 18(186):1–52, 2018.[25] F. M. Harper and J. A. Konstan. The MovieLens datasets: History and context.
ACM Transac-tions on Interactive Intelligent Systems , 5(4):1–19, 2015.[26] J. A. Hartigan. Direct clustering of a data matrix.
Journal of the American Statistical Association ,67(337):123–129, 1972.[27] S. Hochreiter, U. Bodenhofer, M. Heusel, A. Mayr, A. Mitterecker, A. Kasim, T. Khamiakova,S. V. Sanden, D. Lin, W. Talloen, L. Bijnens, H. W. H. G¨ohlmann, Z. Shkedy, and D.-A. Clevert.FABIA: factor analysis for bicluster acquisition.
Bioinformatics , 26(12):1520–1527, 2010.[28] J. Hu, J. Zhang, H. Qin, T. Yan, and J. Zhu. Using maximum entry-wise deviation to testthe goodness of fit for stochastic block models.
Journal of the American Statistical Association ,0(0):1–10, 2020.[29] M. Kolar, S. Balakrishnan, A. Rinaldo, and A. Singh. Minimax localization of structural informa-tion in large noisy matrices. In
Advances in Neural Information Processing Systems , volume 24,pages 909–917, 2011.[30] J. Lei. A goodness-of-fit test for stochastic block models.
The Annals of Statistics , 44(1):401–424,2016.[31] J. Liu, J. Yang, and W. Wang. Biclustering in gene expression data by tendency. In
Proceedingsof 2004 IEEE Computational Systems Bioinformatics Conference , pages 182–193, 2004.[32] Y. Liu and J. Guo. Distribution-free, size adaptive submatrix detection with acceleration.arXiv:1804.10887, 2018.[33] A. Lomet, G. Govaert, and Y. Grandvalet. Model selection in block clustering by the integratedclassification likelihood. In
Proceedings of 20th International Conference on Computational Statis-tics , pages 519–530, 2012.[34] Y. Luo and A. Zhang. Tensor clustering with planted structures: Statistical optimality andcomputational limits. In , 2020.4835] Z. Ma and Y. Wu. Computational barriers in minimax submatrix detection.
Annals of Statistics ,43(3):1089–1116, 2015.[36] S. C. Madeira and A. L. Oliveira. Biclustering algorithms for biological data analysis: a survey.
IEEE/ACM Transactions on Computational Biology and Bioinformatics , 1(1):24–45, 2004.[37] G. E. Moran.
Bayesian Approaches For Modeling Variation . PhD thesis, University of Pennsyl-vania, 2019.[38] S. A. Murphy and A. W. V. D. Vaart. On profile likelihood.
Journal of the American StatisticalAssociation , 95(450):449–465, 2000.[39] A. Oghabian, S. Kilpinen, S. Hautaniemi, and E. Czeizler. Biclustering methods: Biologicalrelevance and application in gene expression analysis.
PLOS ONE , 9(3):e90801, 2014.[40] N. S. Pillai and J. Yin. Universality of covariance matrices.
Annals of Applied Probability ,24(3):935–1001, 2014.[41] G. Pio, M. Ceci, D. D’Elia, C. Loglisci, and D. Malerba. A novel biclustering algorithm for thediscovery of meaningful biological correlations between microRNAs and their target genes.
BMCBioinformatics , 14(7):S8, 2013.[42] A. Preli´c, S. Bleuler, P. Zimmermann, A. Wille, P. B¨uhlmann, W. Gruissem, L. Hennig, L. Thiele,and E. Zitzler. A systematic comparison and evaluation of biclustering methods for gene expressiondata.
Bioinformatics , 22(9):1122–1129, 2006.[43] E. Raff, R. Zak, G. L. Munoz, W. Fleming, H. S. Anderson, B. Filar, C. Nicholas, and J. Holt.Automatic Yara rule generation using biclustering. In
Proceedings of the 13th ACM Workshop onArtificial Intelligence and Security , pages 71–82, 2020.[44] Y. Sakai and K. Yamanishi. An NML-based model selection criterion for general relational datamodeling. In
Proceedings of 2013 IEEE International Conference on Big Data , pages 421–429,2013.[45] A. A. Shabalin, V. J. Weigman, C. M. Perou, and A. B. Nobel. Finding large average submatricesin high dimensional data.
Annals of Applied Statistics , 3(3):985–1012, 2009.[46] H. Shan and A. Banerjee. Bayesian co-clustering. In
Proceedings of the 8th IEEE InternationalConference on Data Mining , pages 530–539, 2008.[47] M. Sill, S. Kaiser, A. Benner, and A. Kopp-Schneider. Robust biclustering by sparse singularvalue decomposition incorporating stability selection.
Bioinformatics , 27(15):2089–2097, 2011.[48] P. Symeonidis, A. Nanopoulos, A. Papadopoulos, and Y. Manolopoulos. Nearest-biclusters col-laborative filtering with constant values. In
Advances in Web Mining and Web Usage Analysis,WebKDD 2006, Lecture Notes in Computer Science , volume 4811, pages 36–55, 2007.[49] A. Tanay, R. Sharan, and R. Shamir. Discovering statistically significant biclusters in gene ex-pression data.
Bioinformatics , 18(suppl 1):S136–S144, 2002.[50] M. Tepper and G. Sapiro. Fast L1-NMF for multiple parametric model estimation.arXiv:1610.05712, 2016.[51] R. Tibshirani, T. Hastie, M. Eisen, D. Ross, D. Botstein, and P. Brown. Clustering methodsfor the analysis of DNA microarray data. Technical report, Department of Health Researchand Policy, Department of Statistics, Department of Genetics and Department of Biochemistry,Stanford University, 1999. 4952] C. A. Tracy and H. Widom. The distributions of random matrix theory and their applications.In
New Trends in Mathematical Physics , pages 753–765. Springer, 2009.[53] A. W. van der Vaart.
Asymptotic Statistics . Cambridge University Press, 1998.[54] J. H. Ward, Jr. Hierarchical grouping to optimize an objective function.
Journal of the AmericanStatistical Association , 58(301):236–244, 1963.[55] C. Watanabe and T. Suzuki. Goodness-of-fit test for latent block models.
Computational Statistics& Data Analysis , 154:107090, 2021.[56] J. Wyse, N. Friel, and P. Latouche. Inferring structure in bipartite networks using the latentblockmodel and exact ICL.
Network Science , 5(1):45–69, 2017.[57] K. Yamanishi, T. Wu, S. Sugawara, and M. Okada. The decomposed normalized maximumlikelihood code-length criterion for selecting hierarchical latent variable models.
Data Mining andKnowledge Discovery , 33:1017–1058, 2019.[58] M. K. Y¨ontem.
The predictive role of the styles of parenthood origin on divorce predictors . PhDthesis, Gaziosmanpasa University, 2017.[59] M. K. Y¨ontem, K. Adem, T. ˙Ilhan, and S. Kılı¸carslan. Divorce prediction using correlationbased feature selection and artificial neural networks.