[PDF] A Goodness-of-fit Test on the Number of Biclusters in a Relational Data Matrix

Abstract

Biclustering is a method for detecting homogeneous submatrices in a given observed matrix, and it is an effective tool for relational data analysis. Although there are many studies that estimate the underlying bicluster structure of a matrix, few have enabled us to determine the appropriate number of biclusters in an observed matrix. Recently, a statistical test on the number of biclusters has been proposed for a regular-grid bicluster structure, where we assume that the latent bicluster structure can be represented by row-column clustering. However, when the latent bicluster structure does not satisfy such regular-grid assumption, the previous test requires a larger number of biclusters than necessary (i.e., a finer bicluster structure than necessary) for the null hypothesis to be accepted, which is not desirable in terms of interpreting the accepted bicluster structure. In this study, we propose a new statistical test on the number of biclusters that does not require the regular-grid assumption and derive the asymptotic behavior of the proposed test statistic in both null and alternative cases. We illustrate the effectiveness of the proposed method by applying it to both synthetic and practical relational data matrices.

Full PDF

GGoodness-of-ﬁt Test on the Number of Biclusters in RelationalData Matrix

Chihiro Watanabe ∗ and Taiji Suzuki † Graduate School of Information Science Technology, The University of Tokyo, Tokyo, Japan Center for Advanced Intelligence Project (AIP), RIKEN, Tokyo, Japan

Abstract

Biclustering is a problem to detect homogeneous submatrices in a given observed matrix, andit has been shown to be an eﬀective tool for relational data analysis. Although there have beenmany studies for estimating the underlying bicluster structure of a matrix, few have enabled usto determine the appropriate number of biclusters in an observed matrix. Recently, a statisticaltest on the number of biclusters has been proposed for a regular-grid bicluster structure, where weassume that the latent bicluster structure can be represented by row-column clustering. However,when the latent bicluster structure does not satisfy such regular-grid assumption, the previous testrequires too many biclusters (i.e., ﬁner bicluster structure) for the null hypothesis to be accepted,which is not desirable in terms of interpreting the accepted bicluster structure. In this paper, wepropose a new statistical test on the number of biclusters that does not require the regular-gridassumption, and derive the asymptotic behavior of the proposed test statistic in both null andalternative cases. To develop the proposed test, we construct a consistent submatrix localizationalgorithm, that is, the probability that it outputs the correct bicluster structure converges to one.We show the eﬀectiveness of the proposed method by applying it to both synthetic and practicalrelational data matrices.

Keywords. biclustering, submatrix detection, consistent submatrix localization, goodness-of-ﬁttest, random matrix theory

Relational data is a kind of matrix data, whose entries show some kind of relationships betweentwo (generally diﬀerent) objects. For example, the rows and columns, respectively, of an observedmatrix A ∈ R n × p represent customers and products, and each entry A ij is a number of times forwhich the i th customer purchased the j th product. It has been shown that we can successfully modelvarious kinds of relational data matrices, including customer-item transaction/rating data [46, 48],document-word co-occurence data [16, 20], and gene expression data [36, 39, 42, 49], by assuming theexistence of latent biclusters or some “homogeneous” submatrices (e.g., the entries in each biclusterare identically distributed). In the example of the above customer-item relational data matrix, thisassumption corresponds to that there are some groups of customers I ⊆ { , . . . , n } and those of items J ⊆ { , . . . , p } , and that the customers in I tend to purchase the items in J at a similar frequency.In regard to bicluster structure of a relational data matrix, the following two problems have beenextensively studied in the literature: submatrix detection and localization . Submatrix detection is atask to detect the existence of such biclusters in a given observed matrix A (i.e., whether or not matrix ∗ chihiro [email protected] † [email protected] a r X i v : . [ s t a t . M E ] F e b contains at least one bicluster) [7, 26, 35, 45]. As in the previous studies [9, 13], in this paper,we distinguish such a task from submatrix localization (which is also known as biclustering ), whosepurpose is to recover the exact position of such biclusters. So far, many biclustering methods havebeen proposed for a ﬁxed number of biclusters K [9, 13, 24, 27, 45]. In most practical cases, however,we would not have a prior knowledge about K in a given data matrix. Therefore, it would be animportant task to develop some method to appropriately determine K from the observed data A . Inthe next two paragraphs, we introduce some related studies that have proposed methods for choosing K . Statistical test on the number of biclusters K . Although there have been many studies fortesting whether an observed matrix A contains any large average submatrix or not [6, 7, 10, 32, 35],few statistical test methods have been proposed for the number of biclusters K in matrix A . Recently,statistical tests on K has been proposed in [3, 28, 30, 55] with the constraint that the underlyingbicluster structure should be represented by a regular grid (as shown in Figure 1 (b-2)) . However,if the latent bicluster structure does not satisfy the regular grid constraint (as shown in Figure 1(b-1)), such a test needs larger hypothetical number of biclusters K (i.e., ﬁner bicluster structure) toaccept the null hypothesis K = K than necessary, which would not be desirable in the perspective ofinterpreting the accepted bicluster structure. To cope with such a problem, we need a more ﬂexiblemodel which can represent the existence of local biclusters [45]. For a singular-value-decomposition-based biclustering, a stopping criterion has been proposed for detecting multiple biclusters (whichdetermines K ) based on stability selection in [47]. This method has enabled us to detect a biclusterstructure with Type I error control and without the regular grid constraint. However, unlike ourproposed method, its Type I error has been guaranteed only in terms of an upper bound, not thenull distribution of a test statistic. Therefore, in this previous study, no way has been provided toperform a statistical test on the number of biclusters. Moreover, it has no theoretical guarantee forthe alternative cases (i.e., statistical power). In this paper, we address these problems by developinga new statistical test on K , which does not require the regular grid constraint and whose test statistic T shows a good property in an alternative case (i.e., with high probability, T asymptotically increaseswith the matrix size and thus the Type II error converges in probability to zero ), as shown later inTheorem 4.3. Information criterion on the number of biclusters K . Aside from the statistical-test-basedmethods, some studies have proposed to determine K based on the minimum description length [44,50, 57] and modiﬁed DIC for the biclustering problem [11, 12]. Particularly, under the regular gridconstraint of the bicluster structure, an information criterion called integrated completed likelihood(ICL) has been proposed for determining K [15, 33, 56], which approximates the maximum marginallikelihood of a given K . These methods aim to select the optimal number of biclusters K from agiven set of candidates, in terms of some criterion (e.g., marginal likelihood). This purpose is diﬀerentfrom that of a statistical test, which aims to judge whether or not we accept a hypothetical numberof biclusters K with a speciﬁc signiﬁcance level given by a user. Other approaches for determining the number of biclusters K . Aside from the above information-criterion-based and statistical-test-based methods, some studies have proposed to construct a genera-tive model of the bicluster structure including the number of biclusters K and then select an optimalmodel in terms of some measure (e.g., choose a MAP estimator) [37, 43]. There have also been someheuristic criteria for determining the number of biclusters in an observed matrix, which have been pro-posed as stopping rules for top-down division-based biclustering algorithms [18, 26, 51] or bottom-upmerging-based one [41]. Particularly, in [3, 28, 30], the observed matrix A (and thus its bicluster structure) is assumed to be square symmetric. It must be noted that Theorem 4.3 shows the asymptotic behavior of test statistic T under the assumption that anobserved matrix has some latent bicluster structure. To derive its behavior in the case that an observe matrix cannotbe represented by such a model is beyond the scope of this paper.

2n this paper, we propose a statistical test method for the number of biclusters K in a givenobserved matrix A , without the regular grid constraint of the latent bicluster structure. To guaranteethe asymptotic behavior of the proposed test statistic in the null case (i.e., K = K ), which is given inTheorem 4.1, we use the properties of a random matrix with a sub-exponential decay [4, 40]. Moreover,we derive its behavior in the alternative case (i.e., K > K ) that it increases with the matrix size inhigh probability, which is given in Theorem 4.3. Unlike the previous study [55], where the numberof biclusters K is assumed to be a ﬁxed constant that does not depend on the matrix size m , weconsider a case in which K might increase with m (the precise description of this assumption is givenin (9)). Finally, we explain a method to estimate K from observed matrix A , by sequentially testingthe hypothetical numbers of biclusters in an ascending order (i.e., K = 0 , , , . . . ) until the nullhypothesis is accepted.In the proposed test, we assume that the submatrix localization algorithm is consistent , that is, ifthe hypothetical number of biclusters K is equal to the null one K , the probability that it outputs thecorrect bicluster structure converges to one asymptotically. Such consistency condition in submatrixlocalization has been considered in several previous studies, however, most of them have made anassumption that there exists at most one bicluster in a given observed matrix [1, 5, 8, 23, 24, 29, 34].It would be possible to localize multiple biclusters by applying these methods to a given observedmatrix multiple times, however, there is no guarantee for consistency of such a heuristic approach.When considering submatrix localization with multiple biclusters, we have to take into accountthe notions of disjointness and bi-disjointness . We call a bicluster structure is disjoint iﬀ each entrybelongs to at most one bicluster (as shown in Figure 1 (a) and (b)). Bi-disjointness is a strictercondition: we call a bicluster structure is bi-disjoint iﬀ each row or column belongs to at most onebicluster Among disjoint bicluster structures, we call those with a bi-disjoint one (as shown in Figure 1(a)). For bi-disjoint bicluster structure, some submatrix localization algorithms based on the maximumlikelihood estimator for a known model parameter [13] and singular value decomposition [9] have beenshown to be consistent. However, these algorithms cannot be directly applied to non-bi-disjoint cases,where the localization problem cannot be formulated as a row-column (hard) clustering [45].As another direction, by imposing the regular grid constraint to the underlying bicluster structure,we can consider a special case of non-bi-disjoint bicluster structures, which can be represented as aresult of row-column clustering (as shown in Figure 1 (b-2)). As for the biclustering problem withsuch a regular grid structure, Flynn and Perry [19] have proposed a consistent algorithm based onthe criterion of (generalized) proﬁle likelihood. In this paper, we extend this method and develop aconsistent submatrix localization algorithm without regular grid assumption (as shown in Figure 1(b-1)).In summary, the main contribution of this paper is as follows:1. We proposed for the ﬁrst time a statistical test on the number of biclusters K in a given ob-served matrix, under the assumptions that the underlying bicluster structure is disjoint (but notnecessarily bi-disjoint) and that the submatrix localization algorithm is consistent.2. To fulﬁll the above assumption, we developed a consistent submatrix localization algorithm for adisjoint but not necessarily bi-disjoint bicluster structure (including the cases of Figure 1 (a) and(b)). We ﬁrst proved the consistency of the proposed algorithm based on (generalized) proﬁlelikelihood [38], and then we proposed an approximated algorithm to obtain the maximum proﬁlelikelihood solution to avoid combinatorial explosion.This paper is organized as follows: in Section 2, we describe the problem settings and the modelof underlying bicluster structure in an observed matrix. Then, in Section 3, we develop a consistentsubmatrix localization algorithm based on generalized proﬁle likelihood. Based on the estimatedbicluster structure given by such a consistent algorithm, in Section 4, we propose a statistical test onthe number of biclusters K in an observed matrix, and derive its theoretical guarantee in both nulland alternative cases. In Section 5, we give some experimental results which show the eﬀectiveness ofthe proposed test. Finally, in Section 6, we discuss the obtained results and the future works of theproposed method, and give a conclusion to this paper in Section 7.3 Problem setting and statistical model for goodness-of-ﬁt testfor submatrix detection problem

Let A ∈ R n × p be an n × p observed matrix. Given such an observed matrix A , the goal of submatrixdetection problem is to determine whether it contains one or multiple disjoint submatrices or biclusters ,in each of which the entries are generated in the i.i.d. sense (Figure 1 (a) and (b)). As in the previousstudies [9, 32], we distinguish the submatrix detection and localization problems in that the goal ofthe latter is not only to detect the existence of biclusters in an observed matrix, but to estimate theirprecise locations.Let K be the minimum number of such biclusters to represent matrix A , which is unknown inadvance. Aside from the K biclusters, we assume that there are “background” entries in matrix A that do not belong to any bicluster. Note that the diﬀerence between a bicluster and the background isin that the former can be represented as a submatrix (i.e., { ( i, j ) : i ∈ I k , j ∈ J k } for some set of I k ⊂{ , . . . , n } and J k ⊂ { , . . . , p } ), while the latter does not necessarily have such a submatrix structure,as shown in Figure 1. Unlike the previous studies that have provided consistent submatrix localizationmethods with multiple biclusters [9, 13], we do not assume the underlying bicluster structure to be bi-disjoint , that is, each row or column is assigned to at most one bicluster (Figure 1 (a)).We denote the bicluster index of the ( i, j )th entry of matrix A as g ij ∈ { , , . . . , K } , where g ij = k if the ( i, j )th entry belongs to the k th bicluster for some k ∈ { , . . . , K } and g ij = 0 if it belongs tothe background. We deﬁne the set of group indices of all the entries as g ≡ ( g ij ) ≤ i ≤ n, ≤ j ≤ p . We alsodeﬁne that I k ≡ { ( i, j ) : g ij = k } , which represents the set of entries in the k th group. Speciﬁcally, weconsider the following model: P = ( P ij ) ≤ i ≤ n, ≤ j ≤ p , P ij = b g ij .σ = ( σ ij ) ≤ i ≤ n, ≤ j ≤ p , σ ij = s g ij .A = ( A ij ) ≤ i ≤ n, ≤ j ≤ p , E [ A ij ] = P ij , E [( A ij − P ij ) ] = σ ij , (1)where b k and s k >

0, respectively, are the mean and standard deviation of the k th bicluster ( k =1 , . . . , K ) or background ( k = 0). This model is a generalized version of well-studied submatrixdetection models, in which we assume that the mean of the background noise is zero (i.e., b = 0)[7, 35, 45]. Let Z ∈ R n × p be a standardized noise matrix, which is given by Z = ( Z ij ) ≤ i ≤ n, ≤ j ≤ p , Z ij = A ij − P ij σ ij . (2)In most cases, the number of biclusters K is unknown in advance. The goal of this paper is todevelop a statistical test on K , which is based on the following null (N) and alternative (A) hypotheses:(N) : K = K , (A) : K > K , (3)where K is a given hypothetical number of biclusters. In this paper, we only consider the cases where K ≤ K . If K = K (i.e., the null case), then we call it a realizable case. Otherwise (i.e., the alternativecase), we call it an unrealizable case. To select the number of biclusters for a given observed matrix A , we propose to sequentially test the bicluster numbers K = 0 , , , . . . until the null hypothesis (N)is accepted. Let ˆ K be the hypothetical number of biclusters when (N) is accepted. The proposedmethod outputs ˆ K as the selected number of biclusters in matrix A . Notations.

Throughout this paper, we use the following notations: X = O p [ f ( m )] . ⇔ ∀ (cid:15) > , ∃ C > , M > , ∀ m ≥ M, Pr [ Cf ( m ) ≥ X ] ≥ − (cid:15).X = Ω p [ f ( m )] . ⇔ ∀ (cid:15) > , ∃ C > , M > , ∀ m ≥ M, Pr [ Cf ( m ) ≤ X ] ≥ − (cid:15).X = Θ p [ f ( m )] . ⇔ ∀ (cid:15) > , ∃ C , C > , M > , ∀ m ≥ M, Pr [ C f ( m ) ≤ X ≤ C f ( m )] ≥ − (cid:15). (4)4 icluster 1 Bicluster 2Bicluster 3 Background (b-1) Not bi-disjoint (a) Bi-disjoint Bicluster 1

Bicluster 2

Bicluster 3Background (c) Non-disjoint

Disjoint (proposed test)

Bicluster 1

Bicluster 2

Bicluster 3

Background (b-2) Not bi-disjoint

Background

Bicluster 1 Bicluster 2

Bicluster 3 Bicluster 4Bicluster 5

Figure 1: (a) Bi-disjoint, (b) disjoint but not bi-disjoint, and (c) non-disjoint bicluster structures. Inthe proposed method, we assume that the underlying bicluster structure is disjoint, but not necessarilybi-disjoint. We assume that the observed matrix consists of one or multiple biclusters, in each of whichthe entries are generated in the i.i.d. sense. Note that in the cases of (b-1) and (c), it is not alwayspossible to make all the rows and columns within the biclusters contiguous by sorting the rows andcolumns. (cid:107) A (cid:107) op = sup u ∈ R p (cid:107) A u (cid:107)(cid:107) u (cid:107) , (cid:107) A (cid:107) F = (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) i =1 p (cid:88) j =1 A ij . (5)In the proofs in Section 4, we use the following sample mean matrix ˜ P and standard deviationmatrix ˜ σ for the correct block structure, and matrix ˜ Z :˜ P = ( ˜ P ij ) ≤ i ≤ n, ≤ j ≤ p , ˜ P ij = ˜ b g ij , ˜ σ = (˜ σ ij ) ≤ i ≤ n, ≤ j ≤ p , ˜ σ ij = ˜ s g ij , ˜ Z = ( ˜ Z ij ) ≤ i ≤ n, ≤ j ≤ p , ˜ Z ij = A − ˜ P ij ˜ σ ij . (6)where ˜ b k and ˜ s k , respectively, are the sample mean and standard deviation of the entries in the k th null group in observed matrix A . Let ˜ λ and ˜ v , respectively, be the maximum eigenvalue of matrix˜ Z (cid:62) ˜ Z and the corresponding eigenvector whose Euclid norm is constrained to be one:˜ Z (cid:62) ˜ Z ˜ v = ˜ λ ˜ v , (cid:107) ˜ v (cid:107) = 1 . (7)Similarly, we denote the eigenvalues (in the descending order) and the corresponding normalized eigen-vectors of matrix Z (cid:62) Z as { λ j } and { v j } , respectively: Z (cid:62) Z v j = λ j v j , (cid:107) v j (cid:107) = 1 , j = 1 , . . . , p.λ ≥ λ ≥ · · · ≥ λ p . (8) Assumptions.

In Section 4, we develop a statistical test for the number of biclusters in an observedmatrix based on the following assumptions:(i). We assume that the biclusters (and the background) are disjoint , that is, there is no entry ( i, j )that is assigned to two or more biclusters (i.e., I k ∩ I k (cid:48) = ∅ if k (cid:54) = k (cid:48) , k, k (cid:48) ∈ { , , . . . , K } ).Moreover, we assume that each ( i, j )th entry belongs to exactly one group. Note that in this paper, we use the term ‘disjoint” in the meaning that each entry belongs to a single group, notthe meaning that the bicluster structure is bi-disjoint as in Figure 1 (a) (i.e., each row or column is assigned to at mostone bicluster). Z ij of the standardized noise matrix has a sub-exponential decay(i.e., there exists some ϑ > x >

1, Pr ( | Z ij | > x ) ≤ ϑ − exp( − x ϑ )). We alsoassume the following conditions: • max k =0 , ,...,K s k = O (1), and min k =0 , ,...,K s k = Ω(1). • max k =1 ,...,K,k (cid:48) =1 ,...,K | b k − b k (cid:48) | = O (1), and max k =0 , ,...,K | b k | = O ( K ). We also assumethat the minimum diﬀerence between a pair of diﬀerent biclusters (including background)is lower bounded by some constant that does not depend on the matrix size: min k (cid:54) = k (cid:48) | b k − b k (cid:48) | ≥ C b > • C M(4) ≡ max i =1 ,...,n,j =1 ,...,p E [ Z ij ] = O (1).(iii). We assume that both the row and column sizes n and p of the observed matrix increase inproportion to some suﬃciently large number m (i.e., n, p ∝ m ).(iv). As for the background, we assume that it can be divided to H disjoint submatrices. In regardto the row and column sizes of each k th submatrix ( | I k | and | J k | , respectively), we assume thatthey monotonically increase with m , where k = 1 , . . . , K + H . • In Theorem 4.1 for a realizable case, we assume that the minimum number of biclusters K and that of background submatrices H to represent the observed matrix A satisfy thefollowing conditions: K + H = O (cid:16) m − (cid:15) (cid:17) , for some (cid:15) > . (9) n min ≡ min k =1 ,...,K + H | I k | = Ω (cid:16) m (cid:17) , p min ≡ min k =1 ,...,K + H | J k | = Ω (cid:16) m (cid:17) . (10)Note that from these conditions, we have( K + H ) (cid:18) min k =1 ,...,K + H |I k | (cid:19) − ≤ K + Hn min p min = O (cid:16) m − − (cid:15) (cid:17) , for some (cid:15) > . (11) • In Theorem 4.3 for an unrealizable case, we assume the following stricter condition: K + H √ n min p min = O (cid:16) m − − (cid:15) (cid:17) , for some (cid:15) > . (12)(v). In the realizable case (i.e., K = K ), we assume that a submatrix localization algorithm forestimating the bicluster structure g is consistent , that is, Pr(ˆ g = g ) → m → ∞ ,where g and ˆ g , respectively, are the null and the estimated bicluster structure of observed matrix A (the precise deﬁnition of ˆ g is given in Section 4). In Section 3, we develop a consistentsubmatrix localization algorithm that can be applied without the bi-disjoint assumption. In this section, we propose a consistent submatrix localization algorithm, which is used to estimatethe bicluster structure of observed matrix A before applying the proposed statistical test. To ourknowledge, this is the ﬁrst paper that gives a submatrix localization method with the consistencyguarantee (i.e., the probability that it outputs the correct bicluster structure converges to one in thelimit of matrix size m → ∞ ), where the underlying bicluster structure is not necessarily bi-disjoint (asshown in Figure 1 (b)). 6e ﬁrst show the rigorous theoretical result that the bicluster structure with the maximum gener-alized proﬁle likelihood is consistent in Section 3.1. Since the combinatorial optimization problem toﬁnd the bicluster structure with the maximum generalized proﬁle likelihood is intractable, we proposean approximated algorithm in Section 3.2. To show the consistency of a submatrix localization algorithm based on the (generalized) proﬁlelikelihood (whose deﬁnition is given later in (15)), we use the following assumptions, aside from thosein Section 2.

Assumptions in constructing consistent submatrix localization algorithm (a). The hypothetical number of biclusters is equal to the null one (i.e., K = K ).(b). Let p k ∈ R and ˆ p k ∈ R , respectively, be the proportion of entries in the k th group ( k ∈{ , , . . . , K } ) in the null and estimated bicluster structure to all the np entries. Let p =( p k ) ≤ k ≤ K ∈ R ( K +1) . To guarantee the consistency, we need a stricter condition than theassumption (iv) that min k =0 , ,...,K p k = Ω (cid:16) m − (cid:17) . Fix (cid:15) > (cid:15) = Ω (cid:16) m − (cid:17) (and thus np(cid:15) = Ω (cid:16) m (cid:17) ), and (2) (cid:15) < min k =0 , ,...,K p k . For instance,if we set (cid:15) = C (cid:15) m − with an arbitrary ﬁxed constant C (cid:15) > m ), fora suﬃciently large m , both (1) and (2) hold.(c). In regard to sub-exponential assumption (ii), we additionally assume that there exists some ϑ , ϑ , . . . , ϑ K ≥ C ϑ > x >

1, Pr ( | Z ij | > x ) ≤ ϑ − g ij exp( − x ϑ gij ), where C ϑ is aﬁxed constant that does not depend on the matrix size m .To evaluate the accuracy of estimated bicluster structure ˆ g , we introduce a confusion matrix C ∈ R ( K +1) × ( K +1) , whose ( k, k (cid:48) )th entry represents the proportion of entries that belong to the k th group in the null bicluster structure and the k (cid:48) th group in the estimated one [i.e., C kk (cid:48) ≡ np (cid:80) ni =1 (cid:80) pj =1 I ( g ij = k ) I (ˆ g ij = k (cid:48) )], as shown in Figure 2. Note that the following two conditions aremutually equivalent: • The estimated bicluster structure is correct (i.e., ˆ g = g ). • The confusion matrix C can be made diagonal by permutations of its rows.Moreover, the population mean ¯ b k of each k th estimated group can be computed based on theconfusion matrix C as follows:¯ b k ≡ | ˆ I k | (cid:88) ( i,j ) ∈ ˆ I k P ij = np | ˆ I k | K (cid:88) k (cid:48) =0 C k (cid:48) k b k (cid:48) = (cid:0) C (cid:62) b (cid:1) k ( C (cid:62) u ) k , (13)where u ≡ [1 , , . . . , (cid:62) ∈ R ( K +1) . Note that we used ˆ p k = | ˆ I k | np = (cid:0) C (cid:62) u (cid:1) k to derive (13).We use the following notation: J (cid:15) ≡ { g : ˆ p k ( g ) > (cid:15) for all k ∈ { , , . . . , K }} . (14)By assumption (b), (cid:15) is chosen to be smaller than min k =0 , ,...,K p k , we choose the estimated biclusterassignment ˆ g from J (cid:15) . 7 .19 0.27 0.11 0.26 0.170.2 0.1 0.01 0.02 0.05 0.020.1 0.01 0.02 0.04 0.01 0.020.3 0.05 0.08 0.01 0.12 0.040.1 0.01 0.01 0.01 0.01 0.060.3 0.02 0.15 0.03 0.07 0.03 Sum of each row 𝑝 𝑘 Sum of each column Ƹ𝑝 𝑘′ Null bicluster 𝑘 Estimated bicluster 𝑘′ 𝐶 𝑘𝑘′ Figure 2: An example of confusion matrix for a set of null and estimated bicluster assignments.As in [19], we develop a consistent submatrix localization algorithm based on the (generalized)proﬁle likelihood [38]. Given an estimated bicluster assignment ˆ g , the generalized proﬁle-likelihoodcriterion for an exponential family model is given by F (ˆ g ) ≡ K (cid:88) k =0 ˆ p k f  | ˆ I k | (cid:88) ( i,j ) ∈ ˆ I k A ij  , (15)where f : R (cid:55)→ R is an arbitrary function that satisﬁes the following conditions: • We deﬁne the following closed and open intervals: S ≡  min  inf ˆ I∈K (cid:15) | ˆ I| (cid:88) ( i,j ) ∈ ˆ I A ij , min k =0 , ,...,K b k  , max  sup ˆ I∈K (cid:15) | ˆ I| (cid:88) ( i,j ) ∈ ˆ I A ij , max k =0 , ,...,K b k  , S ≡  min  inf ˆ I∈K (cid:15) | ˆ I| (cid:88) ( i,j ) ∈ ˆ I A ij , min k =0 , ,...,K b k  , max  sup ˆ I∈K (cid:15) | ˆ I| (cid:88) ( i,j ) ∈ ˆ I A ij , max k =0 , ,...,K b k  . (16) f is continuous on S and diﬀerentiable on S . The ﬁrst derivative of f is bounded (i.e., | f (cid:48) ( x ) | < ∞ ) on S . • f is locally strictly convex: inf x ∈T f (cid:48)(cid:48) ( x ) >

0, where

T ≡ (min k =0 , ,...,K b k , max k =0 , ,...,K b k ).We also deﬁne the following criterion function based on the population group-wise mean:¯ F (ˆ g ) ≡ K (cid:88) k =0 ˆ p k f (¯ b k ) . (17)Note that we have ¯ F ( g ) = (cid:80) Kk =0 p k f ( b k ).Based on these assumptions and notations, the following Theorem 3.1 guarantees the consistencyof the maximum generalized proﬁle-likelihood solution ˆ g PL . Theorem 3.1 (Consistency of submatrix localization based on the generalized proﬁle-likelihood) . Un-der the assumptions in Sections 2 and 3, for any bicluster assignment ˆ g PL that achieves the maximumgeneralized proﬁle-likelihood (i.e., ˆ g PL ∈ arg max ˆ g ∈J (cid:15) F (ˆ g ) ), the confusion matrix C (ˆ g PL ) converges inprobability to a permutation of a diagonal matrix. roof. To prove this theorem, we ﬁrst show the following Lemmas A.1, 3.1 and 3.2.

Lemma 3.1.

Under the assumptions in Sections 2 and 3, sup ˆ g ∈J (cid:15) | F (ˆ g ) − ¯ F (ˆ g ) | converges in probabilityto zero in the limit of m → ∞ .Proof. A proof is in Appendix A.Before the statement of Lemma 3.2, we deﬁne P δ : For a given δ >

0, let P δ ≡ { C ∈ C p :max ( k ,k ,k ): k (cid:54) = k C k k C k k < δ } , where C p is a set of all confusion matrices with the null set ofproportions p . If the confusion matrix C is a permutation of a diagonal matrix, then for k (cid:54) = k ,we always have C k k C k k = 0 for all k ∈ { , , . . . , K } . Intuitively, P δ is a set of confusion matriceswhich are “almost” permutations of diagonal matrices, where the gap from a permutation of a diagonalmatrix is controlled by δ . Lemma 3.2.

Under the assumptions in Sections 2 and 3, for all δ > , if C / ∈ P δ , ¯ F (ˆ g ) − ¯ F ( g ) ≤ − κ δ (cid:0) C b (cid:1) , (18) where κ ≡ inf x ∈T f (cid:48)(cid:48) ( x ) > and C b is a constant that satisﬁes the condition in assumption (ii).Proof. A proof is in Appendix B.From Lemma 3.2, for all δ >

0, if

0, if

C / ∈ P δ , the probability that F (ˆ g ) ≤ F ( g ) − κ δ (cid:0) C b (cid:1) holds converges to one. By taking the limit of δ →

0, it results in that the probability converges to one that the confusion matrix of the maximumgeneralized proﬁle-likelihood solution ˆ g PL is a permutation of a diagonal matrix, from the deﬁnitionof P δ . In this section, we develop an approximated algorithm to ﬁnd the bicluster structure with themaximum generalized proﬁle likelihood based on simulated annealing (SA), since the exact algorithmto ﬁnd the global optimal solution is intractable. 9 .2.1 Naive implementation of SA-based submatrix localization

Let G K be a set of all bicluster structures with (non-empty) K biclusters, which are disjoint, butwhich are not necessarily bi-disjoint (as shown in Figure 1 (b)). In SA, we ﬁrst deﬁne a sequenceof temperatures { T t } ∞ t =0 , a threshold (cid:15) SA , and the initial state (i.e., bicluster assignment) ˆ g (0) ∈ G K .For each state g ∈ G K , we also deﬁne a set of its neighbors N ( g ) ⊆ G K and a transition probability R ( g, g (cid:48) ) ∈ [0 ,

1] to a given state g (cid:48) ∈ G K . Here, we set R ( g, g (cid:48) ) = 0 iﬀ g (cid:48) / ∈ N ( g ).For each step t = 0 , , , . . . , if T t < (cid:15) , we stop the algorithm and output the ﬁnal state ˆ g ( t ) . If T t ≥ (cid:15) , we randomly choose a candidate for the next state ˜ g ∈ N (ˆ g ( t ) ) with probability R (ˆ g ( t ) , ˜ g ), andcompute the diﬀerence of the objective function value ∆ F ≡ F (˜ g ) − F (ˆ g ( t ) ). If ∆ F >

0, we set thenext state at ˆ g ( t +1) = ˜ g . If ∆ F ≤

0, we set the next state at ˆ g ( t +1) = ˜ g with probability exp (cid:16) ∆ FT t (cid:17) ,and set it at the current state ˆ g ( t +1) = ˆ g ( t ) with probability 1 − exp (cid:16) ∆ FT t (cid:17) .Speciﬁcally, we propose Algorithm 1 as an example of SA for approximately maximizing the gener-alized proﬁle likelihood F . In Algorithm 1, we deﬁne that the neighbors N ( g ) of state g is a set of allpossible bicluster assignments that can be obtained by adding/removing one row or column to/fromone bicluster in g . As for the transition probability, we deﬁne that one of the elements in N ( g ) arechosen from the uniform distribution on N ( g ) (i.e., R ( g, g (cid:48) ) = 1 / | N ( g ) | for g (cid:48) ∈ N ( g )).We can easily check that the above settings satisfy the following irreducibility and weak reversibility : • Irreducibility: for any pair g, g (cid:48) ∈ G K , there exists some sequence of transitions from g to g (cid:48) withnon-zero probability. • Weak reversibility: for any pair g, g (cid:48) ∈ G K and ˜ F ∈ R , the following two propositions (P1) and(P2) are mutually equivalent: – (P1) there exists some sequence of transitions g = g, g , . . . , g p = g (cid:48) with non-zero proba-bility that satisﬁes F ( g t ) ≥ ˜ F for all t ∈ { , . . . , p } . – (P2) there exists some sequence of transitions g = g (cid:48) , g , . . . , g p = g with non-zero proba-bility that satisﬁes F ( g t ) ≥ ˜ F for all t ∈ { , . . . , p } .We deﬁne that a state g is locally optimal if there is no state g (cid:48) ∈ G K that satisﬁes the followingtwo conditions simultaneously: F ( g (cid:48) ) > F ( g ), and there exists some sequence of transitions g = g, g , . . . , g p = g (cid:48) with non-zero probability that satisﬁes F ( g t ) ≥ F ( g ) for all t ∈ { , . . . , p } . For alocally but not globally optimal solution g , we deﬁne its depth as the minimum d that satisﬁes thefollowing condition: there exists some g (cid:48) such that F ( g (cid:48) ) > F ( g ) and there exists some sequence oftransitions g = g, g , . . . , g p = g (cid:48) with non-zero probability that satisﬁes F ( g t ) ≥ F ( g ) − d for all t ∈{ , . . . , p } . By setting the sequence of temperatures at T t = [max g ∈G K F ( g ) − min g ∈G K F ( g )] / log( t + 2)for all t ≥ • T t ≥ T t +1 holds for all t ≥

0, and lim t →∞ T t = 0. • (cid:80) ∞ t =0 exp (cid:16) − d ∗ T t (cid:17) = + ∞ , where d ∗ is the maximum depth of all the locally but not globallyoptimal solutions.It has been proven that under the above conditions, the probability that an SA algorithm outputsthe global optimal solution converges to one in the limit of t → ∞ [22]. Although the naive SA algorithm in Section 3.2.1 is tractable compared to the exhaustive search,it still requires too much steps for the algorithm to converge. Therefore, in this subsection, we proposea further approximation of Algorithm 1. The main idea is that we ﬁrst compress an observed datamatrix A by using row-column clustering, and then apply an SA algorithm on the compressed datamatrix. 10 lgorithm 1 Naive SA algorithm for ﬁnding the maximum proﬁle likelihood solution ˆ g . Require:

A cooling schedule of temperature { T t } ∞ t =0 and a threshold (cid:15) SA . Ensure:

Approximated optimal bicluster assignment ˆ g . t ← Randomly generate initial bicluster assignment ˆ g , which is disjoint but not necessarily bi-disjoint(as shown in Figure 1 (b)). while T t ≥ (cid:15) SA do Set ˜ g ← ˆ g and randomly choose an index k from the uniform distribution on { , . . . , K } . if k ≤ K then Set bicluster index k ← k . Let I k and J k = { j , . . . , j | J k | } , respectively, be the sets of row and column indices in the k thbicluster. We deﬁne add and remove lists as follows. For i ∈ I k , let I rem ki be the set of entries in the i th row of the k th bicluster (i.e., { ( i, j ) , ( i, j ) , . . . , ( i, j | J k | ) } ). We deﬁne the remove list as I rem k = {I rem ki } i ∈ I k . Let ¯ I k be the set of row indices i that satisﬁes (cid:84) | J k | s =1 (cid:84) Kk (cid:48) =1 [( i, j s ) / ∈ I k (cid:48) ]. For i ∈ ¯ I k , let I add ki be the set of entries { ( i, j ) , ( i, j ) , . . . , ( i, j | J k | ) } . We deﬁne the add list as I add k = {I add ki } i ∈ ¯ I k . Let I be the set of background entries in ˜ g . Set y add ← ( | ¯ I k | ≥ ∪ [( | ¯ I k | = 1) ∩ ( I (cid:54) = I add k )],which is a ﬂag of whether or not we can execute “add” operation. This guarantees that theset of background entries is not null. if | I k | ≥ y add = True then Randomly choose i from the uniform distribution on { , . . . , | I k | + | ¯ I k |} . If i ≤ | I k | , remove I rem ki from the k th bicluster and add it to the background in ˜ g . If i > | I k | , remove I add k ( i −| I k | ) from the background and add it to the k th bicluster in ˜ g . else if | I k | ≥ then Randomly choose i from the uniform distribution on { , . . . , | I k |} . Remove I rem ki from the k th bicluster and add it to the background in ˜ g . else if y add = True then Randomly choose i from the uniform distribution on { , . . . , | ¯ I k |} . Remove I add ki from thebackground and add it to the k th bicluster in ˜ g . end if else Set bicluster index k ← k − K . Execute lines 7 to 17, by swapping row and column in all the operations. end if if F (˜ g ) − F (ˆ g ) > then ˆ g ← ˜ g . else With probability exp (cid:16) F (˜ g ) − F (ˆ g ) T t (cid:17) , ˆ g ← ˜ g . end if t ← t + 1. end while lgorithm 2 Approximated SA algorithm for ﬁnding the maximum proﬁle likelihood solution ˆ g . Require:

A set of row and column cluster numbers ( L , L ) that satisﬁes L ≥ K and L ≥ K , acooling schedule of temperature { T t } ∞ t =0 and a threshold (cid:15) SA . Ensure:

Approximated optimal bicluster assignment ˆ g . Apply a clustering algorithm to the rows of observed matrix A with the number of clusters L . Apply a clustering algorithm to the columns of observed matrix A with the number of clusters L . Let I comp hh (cid:48) be the set of entries of matrix A in the h th row cluster and the h (cid:48) th column cluster, andlet I comp = ( I comp hh (cid:48) ) ≤ h ≤ L , ≤ h (cid:48) ≤ L . Based on the clustering result I comp , deﬁne matrices A comp and M by (20). t ← Randomly generate initial (compressed) bicluster assignment ˆ g comp , which is disjoint but not nec-essarily bi-disjoint. Execute lines 3 to 28 in Algorithm 1 by replacing A and ˆ g with A comp and ˆ g comp , respectively. Asfor the objective function value, we can compute it by using (22). Convert the set of results I comp and ˆ g comp into the bicluster assignment ˆ g of the original observedmatrix A .Remark that the null group-wise mean matrix P with K biclusters has at most 2 K distinct rows,depending on whether or not it includes each k th bicluster ( k = 1 , . . . , K ). Based on this fact, we ﬁrstapply some clustering method (e.g., hierarchical clustering) to the rows of matrix A , by setting thenumber of clusters at L ∈ N that satisﬁes min { K , n } ≤ L ≤ n . From similar discussion, we alsoperform column clustering with number of clusters L that satisﬁes min { K , p } ≤ L ≤ p . Then, wedeﬁne the compressed observed matrix A comp ∈ R L × L and matrix M ∈ N L × L as follows: A comp = ( A comp hh (cid:48) ) ≤ h ≤ L , ≤ h (cid:48) ≤ L , A comp hh (cid:48) = 1 |I comp hh (cid:48) | (cid:88) ( i,j ) ∈I comp hh (cid:48) A ij ,M = ( M hh (cid:48) ) ≤ h ≤ L , ≤ h (cid:48) ≤ L , M hh (cid:48) = |I comp hh (cid:48) | , (20)where I comp hh (cid:48) is the set of entries of matrix A in the h th row cluster and the h (cid:48) th column cluster.From now on, we apply an SA algorithm to the compressed observed matrix A comp . Let ˆ g comp hh (cid:48) ∈{ , , . . . , K } be the estimated group index of the ( h, h (cid:48) )th entry of matrix A comp , and let J comp k ⊆{ (1 , , . . . , ( L , L ) } be the set of entries in the k th estimated group ( k = 0 , , . . . , K ) of matrix A comp .Note that we have J comp k = { ( h, h (cid:48) ) : ˆ g comp hh (cid:48) = k } .The key insight is that we haveˆ p k = 1 np (cid:88) ( h,h (cid:48) ) ∈J comp k M hh (cid:48) , | ˆ I k | (cid:88) ( i,j ) ∈ ˆ I k A ij = 1 np ˆ p k (cid:88) ( h,h (cid:48) ) ∈J comp k M hh (cid:48) A comp hh (cid:48) . (21)Based on the above fact, we can compute the objective function value (i.e., proﬁle likelihood) basedon the matrices A comp and M , and the bicluster assignment ˆ g comp = (ˆ g comp hh (cid:48) ) ≤ h ≤ L , ≤ h (cid:48) ≤ L by F (ˆ g comp ) ≡ K (cid:88) k =0  np (cid:88) ( h,h (cid:48) ) ∈J comp k M hh (cid:48)  f  np ˆ p k (cid:88) ( h,h (cid:48) ) ∈J comp k M hh (cid:48) A comp hh (cid:48)  . (22)From these observations, Algorithm 2 provides an approximated solution of Algorithm 1.12 Test statistic for determining the number of biclusters

We develop the test statistic T of the proposed test based on the estimated version of the standard-ized noise matrix Z in (2), given a hypothetical number of biclusters K . We denote the estimated group index of the ( i, j )th entry of matrix A as ˆ g ij ∈ { , , . . . , K } , where ˆ g ij = k if the ( i, j )th entryis estimated to be a member of the k th bicluster for some k and ˆ g ij = 0 otherwise (i.e., the ( i, j )thentry is estimated to be a member of background). We deﬁne the set of estimated group indices ofall the entries as ˆ g ≡ ( g ij ) ≤ i ≤ n, ≤ j ≤ p . We also deﬁne that ˆ I k ≡ { ( i, j ) : ˆ g ij = k } , which representsthe estimated set of entries in the k th group.Based on the above notations, the estimated mean, standard deviation and noise matrices ˆ P , ˆ σ and ˆ Z are given by ˆ b = (ˆ b k ) ≤ k ≤ K , ˆ b k = 1 | ˆ I k | (cid:88) ( i,j ) ∈ ˆ I k A ij , ˆ P = ( ˆ P ij ) ≤ i ≤ n, ≤ j ≤ p , ˆ P ij = ˆ b ˆ g ij , ˆ s = (ˆ s k ) ≤ k ≤ K , ˆ s k = (cid:118)(cid:117)(cid:117)(cid:116) | ˆ I k | (cid:88) ( i,j ) ∈ ˆ I k (cid:16) A ij − ˆ P ij (cid:17) , ˆ σ = (ˆ σ ij ) ≤ i ≤ n, ≤ j ≤ p , ˆ σ ij = ˆ s ˆ g ij . (23)ˆ Z = ( ˆ Z ij ) ≤ i ≤ n, ≤ j ≤ p , ˆ Z ij = A ij − ˆ P ij ˆ σ ij . (24)To construct a statistical test on the number of biclusters K , we use the following result in [40],which shows that the scaled maximum eigenvalue T ∗ of sample covariance matrix Z (cid:62) Z converges inlaw to the Tracy-Widom distribution with index 1 ( T W ) in the limit of m → ∞ : T ∗ = λ − a TW b TW , T ∗ (cid:32) T W (Convergence in law) , (25)where λ is the maximum eigenvalue of matrix Z (cid:62) Z and a TW = ( √ n + √ p ) , b TW = ( √ n + √ p ) (cid:18) √ n + 1 √ p (cid:19) . (26)Based on the above fact, we deﬁne the test statistic T , which is an estimator of T ∗ in (25), fromthe maximum eigenvalue ˆ λ of matrix ˆ Z (cid:62) ˆ Z : T = ˆ λ − a TW b TW . (27)By using the test statistic T , we deﬁne the rule of the proposed test at the signiﬁcance level of α asfollows: Reject null hypothesis ( K = K ) , if T ≥ t ( α ) , (28)where t ( α ) is the α upper quantile of the T W distribution. We given the theoretical guarantees forthe above test in both the null and alternative cases later in Theorems 4.1 and 4.3, respectively. Theorem 4.1 (Realizable case) . Under the assumptions in Section 2, if K = K , T (cid:32) T W (Convergence in law) , (29) in the limit of m → ∞ , where T is deﬁned as in (27). roof. To apply the result in [40], we consider the diﬀerence between T ∗ and T . By deﬁnition in (25)and (27), we have | T − T ∗ | = | ˆ λ − λ | b TW . (30)From now on, we prove that the right side of (30) can be bounded by | λ − ˆ λ | b TW = O p ( m − (cid:15) ) for some (cid:15) >

0, which is given in Lemma 4.3 later. If this bound holds, from Slutsky’s theorem, (29) also holds.To show Lemma 4.3, we ﬁrst prove the following Lemmas 4.1 and 4.2, which give the lower and upperbounds for the maximum eigenvalue ˜ λ of matrix ˜ Z (cid:62) ˜ Z . Lemma 4.1.

Under the assumptions in Section 2, if K = K , λ ≤ ˜ λ + O p (cid:16) m − (cid:15) (cid:17) , for some (cid:15) > . (31) Proof.

From assumption (iv), the background can be divided to H disjoint submatrices, whose mini-mum row and column sizes are n (cid:48) min and p (cid:48) min , respectively. Let X ( k ) ∈ R | I k |×| J k | be a submatrix of ma-trix X ∈ R n × p corresponding to the row and column indices ( I k , J k ) of the k th bicluster. To distinguishthe indices of the biclusters and those of the background submatrices, let X ( K + h ) ∈ R | I K + h |×| J K + h | be a submatrix of matrix X ∈ R n × p corresponding to the row and column indices ( I K + h , J K + h ) ofthe h th background submatrix. We deﬁne a bicluster-wise constant matrix Q ( k ) for each k th bicluster( k = 1 , . . . , K ), Q ( k ) ≡ Z ( k ) − ˜ s k s k ˜ Z ( k ) = 1 s k (cid:16) ˜ P ( k ) − P ( k ) (cid:17) = 1 |I k |  (cid:88) ( i,j ) ∈I k A ij − P ij s k   · · · · · ·  =  |I k | (cid:88) ( i,j ) ∈I k Z ij   · · · · · ·  ∈ R | I k |×| J k | , (32)and a submatrix-wise constant matrix Q ( K + h ) for each h th background submatrix ( h = 1 , . . . , H ), Q ( K + h ) ≡ Z ( K + h ) − ˜ s s ˜ Z ( K + h ) = 1 s (cid:16) ˜ P ( K + h ) − P ( K + h ) (cid:17) = 1 |I |  (cid:88) ( i,j ) ∈I A ij − P ij s   · · · · · ·  =  |I | (cid:88) ( i,j ) ∈I Z ij   · · · · · ·  ∈ R | I K + h |×| J K + h | . (33)Based on these matrices, let Z ( k ) , ˜ Z ( k ) , and Q ( k ) , respectively, be n × p matrices whose entries inthe k th bicluster (i.e., { ( i, j ) : i ∈ I k , j ∈ J k } ) are Z ( k ) , ˜ Z ( k ) and Q ( k ) and whose all the other entriesare zero. Similarly, let Z ( K + h ) , ˜ Z ( K + h ) , and Q ( K + h ) , respectively, be n × p matrices whose entries inthe h th background submatrix (i.e., { ( i, j ) : i ∈ I K + h , j ∈ J K + h } ) are Z ( K + h ) , ˜ Z ( K + h ) and Q ( K + h ) and whose all the other entries are zero. Figure 3 shows an example of { Z ( k ) } , where k = 1 , . . . , K + H .Finally, we deﬁne matrix Q by Q ≡ (cid:80) K + Hk =1 Q ( k ) .Let v be the normalized eigenvector of matrix Z (cid:62) Z corresponding to the maximum eigenvalue λ , as deﬁned in (8). Since the largest singular value (cid:112) ˜ λ of matrix ˜ Z is equal to the operator norm14 icluster 1 = + + + + + + + + Bicluster 2 Bicluster 3

Background 1 Background 2 Background 3 Background 4 Background 5 Background 6

Figure 3: Decomposition of the noise matrix Z to the K biclusters { Z ( k ) } , k = 1 , . . . , K and the H background submatrices { Z ( K + h ) } , h = 1 , . . . , H . In the case of this ﬁgure, K = 3 and H = 6.of ˜ Z , we have˜ λ = (cid:32) sup u (cid:107) ˜ Z u (cid:107)(cid:107) u (cid:107) (cid:33) ≥ (cid:32) (cid:107) ˜ Z v (cid:107)(cid:107) v (cid:107) (cid:33) = (cid:107) ˜ Z v (cid:107) = (cid:107) K + H (cid:88) k =1 ˜ Z ( k ) v (cid:107) = (cid:107) (cid:34) K (cid:88) k =1 (cid:18) s k ˜ s k (cid:19) ( Z ( k ) − Q ( k ) ) v (cid:35) + (cid:34) H (cid:88) h =1 (cid:18) s ˜ s (cid:19) ( Z ( K + h ) − Q ( K + h ) ) v (cid:35) (cid:107) = (cid:107) (cid:34) K + H (cid:88) k =1 ( Z ( k ) − Q ( k ) ) v (cid:35) + (cid:34) K + H (cid:88) k =1 (cid:18) s k ˜ s k − (cid:19) ( Z ( k ) − Q ( k ) ) v (cid:35) (cid:107) ≥ (cid:34) (cid:107) K + H (cid:88) k =1 ( Z ( k ) − Q ( k ) ) v (cid:107) − (cid:107) K + H (cid:88) k =1 (cid:18) − s k ˜ s k (cid:19) ( Z ( k ) − Q ( k ) ) v (cid:107) (cid:35) = (cid:34) (cid:107) ( Z − Q ) v (cid:107) − (cid:107) K + H (cid:88) k =1 (cid:18) − s k ˜ s k (cid:19) ( Z ( k ) − Q ( k ) ) v (cid:107) (cid:35) ≥ (cid:107) ( Z − Q ) v (cid:107) − (cid:107) ( Z − Q ) v (cid:107)(cid:107) K + H (cid:88) k =1 (cid:18) − s k ˜ s k (cid:19) ( Z ( k ) − Q ( k ) ) v (cid:107)≥ (cid:107) ( Z − Q ) v (cid:107) − (cid:107) ( Z − Q ) v (cid:107) (cid:34) K + H (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) − s k ˜ s k (cid:12)(cid:12)(cid:12)(cid:12) (cid:107) ( Z ( k ) − Q ( k ) ) v (cid:107) (cid:35) ≥ (cid:107) ( Z − Q ) v (cid:107) − (cid:107) ( Z − Q ) v (cid:107) (cid:34) K + H (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) − s k ˜ s k (cid:12)(cid:12)(cid:12)(cid:12) (cid:16) (cid:107) Z ( k ) v (cid:107) + (cid:107) Q ( k ) v (cid:107) (cid:17)(cid:35) ≥ (cid:107) ( Z − Q ) v (cid:107) − (cid:107) ( Z − Q ) v (cid:107) (cid:34) K + H (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) − s k ˜ s k (cid:12)(cid:12)(cid:12)(cid:12) (cid:18)(cid:113) λ ( k )1 + (cid:107) Q ( k ) v (cid:107) (cid:19)(cid:35) ≥ λ − (cid:112) λ (cid:107) Q v (cid:107) − (cid:112) λ + (cid:107) Q v (cid:107) ) (cid:34) K + H (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) − s k ˜ s k (cid:12)(cid:12)(cid:12)(cid:12) (cid:18)(cid:113) λ ( k )1 + (cid:107) Q ( k ) v (cid:107) (cid:19)(cid:35) . (34)where λ ( k )1 is the maximum eigenvalue of matrix ( Z ( k ) ) (cid:62) Z ( k ) (which is equal to that of matrix( Z ( k ) ) (cid:62) Z ( k ) ). From the third line in (34), we used the notation that s K +1 = · · · = s K + H = s for simplicity.From now, we show the probabilistic orders of (cid:107) Q ( k ) v (cid:107) and (cid:107) Q v (cid:107) . The non-zero entries in matrix( Q ( k ) ) (cid:62) Q ( k ) is only located in a submatrix { ( i, j ) : i ∈ J k , j ∈ J k } , and all of their values are | I k | η k ,where η k ≡ |I k | (cid:88) ( i,j ) ∈I k Z ij = O p (cid:32) (cid:112) |I k | (cid:33) . (35)15herefore, we have ( Q ( k ) ) (cid:62) Q ( k ) v = | I k || J k | η k ( v (cid:62) u ( k ) ) u ( k ) , (36)where u ( k ) ∈ R p is a vector whose entries are deﬁned by u ( k ) j = √ | J k | if j ∈ J k and u ( k ) j = 0 otherwise.Note that this vector satisﬁes (cid:107) u ( k ) (cid:107) = 1. From (36), we have (cid:107) Q ( k ) v (cid:107) = (cid:113) v (cid:62) ( Q ( k ) ) (cid:62) Q ( k ) v = (cid:113) | I k || J k | η k ( v (cid:62) u ( k ) ) . (37)To upper bound the right side of (37), we refer to the following important property of each j theigenvector v j of matrix Z (cid:62) Z , which has been proven in [4]. Theorem 4.2 (Delocalization property of an eigenvector of a sample covariance matrix [4]) . Underthe assumptions in Section 2, from Theorem 2.17 in [4], a normalized eigenvector v j of matrix Z (cid:62) Z (i.e., (cid:107) v j (cid:107) = 1 ) has a delocalization property, that is, for any constant vector w that satisﬁes (cid:107) w (cid:107) = 1 ,for all (cid:15) > , | v (cid:62) j w | = O p (cid:20) p (cid:15) (cid:18) n + 1 n (1 + ˜ m − max { j, ( ˜ m + 1 − j ) } ) (cid:19)(cid:21) , (38) where ˜ m ≡ max { n, p } . From (38), we have | v (cid:62) j w | = O p (cid:16) m − + (cid:15) (cid:17) for all (cid:15) > . (39)Note that (39) holds uniformly in j and w from Theorem 2.20 in [4].Based on the above delocalization property of vector v and (37), we have (cid:107) Q ( k ) v (cid:107) = (cid:115) |I k | O p (cid:18) |I k | (cid:19) O p ( m − (cid:15) ) = O p (cid:16) m − + (cid:15) (cid:17) for all (cid:15) > . (40)As for (cid:107) Q v (cid:107) , we can derive its upper bound by (cid:107) Q v (cid:107) = (cid:107) K + H (cid:88) k =1 Q ( k ) v (cid:107) ≤ K + H (cid:88) k =1 (cid:107) Q ( k ) v (cid:107) = K + H (cid:88) k =1 (cid:113) |I k | η k ( v (cid:62) u ( k ) ) = K + H (cid:88) k =1 (cid:115) |I k | O p (cid:18) |I k | (cid:19) | v (cid:62) u ( k ) | = K + H (cid:88) k =1 O p (1) O p (cid:16) m − + (cid:15) (cid:17) = ( K + H ) O p (cid:16) m − + (cid:15) (cid:17) ( ∵ (39) holds uniformly in w from [4]) . (41)From Lemma C1 in Appendix C, | ˜ s k − s k | = O p (cid:18) √ |I k | (cid:19) holds, which results in (cid:12)(cid:12)(cid:12)(cid:12) − s k ˜ s k (cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:32) (cid:112) |I k | (cid:33) . (42)16y substituting (40), (41), (42), and the fact that (cid:113) λ ( k )1 = O p (cid:16) |I k | (cid:17) from (25), into (34), weobtain˜ λ ≥ λ − K + H ) O p (cid:16) m (cid:17) O p (cid:16) m − + (cid:15) (cid:17) − (cid:104) O p (cid:16) m (cid:17) + ( K + H ) O p (cid:16) m − + (cid:15) (cid:17)(cid:105) (cid:40) K + H (cid:88) k =1 O p (cid:16) |I k | − (cid:17) (cid:104) O p (cid:16) |I k | (cid:17) + O p (cid:16) m − + (cid:15) (cid:17)(cid:105)(cid:41) = λ − K + H ) O p ( m (cid:15) ) − (cid:104) O p (cid:16) m (cid:17) + ( K + H ) O p (cid:16) m − + (cid:15) (cid:17)(cid:105) (cid:40) K + H (cid:88) k =1 (cid:104) O p (cid:16) |I k | − (cid:17) + O p (cid:16) |I k | − m − + (cid:15) (cid:17)(cid:105)(cid:41) . (43)By taking (cid:15) < , the lower bound in (43) can be simpliﬁed as follows:˜ λ ≥ λ − K + H ) O p ( m (cid:15) ) − (cid:104) O p (cid:16) m (cid:17) + ( K + H ) O p (cid:16) m − + (cid:15) (cid:17)(cid:105) (cid:40) K + H (cid:88) k =1 (cid:104) O p (cid:16) |I k | − (cid:17) + O p (cid:16) |I k | − |I k | − + (cid:15) (cid:17)(cid:105)(cid:41) = λ − K + H ) O p ( m (cid:15) ) − (cid:104) O p (cid:16) m (cid:17) + ( K + H ) O p (cid:16) m − + (cid:15) (cid:17)(cid:105) (cid:34) K + H (cid:88) k =1 O p (cid:16) |I k | − (cid:17)(cid:35) ≥ λ − K + H ) O p ( m (cid:15) ) − (cid:104) O p (cid:16) m (cid:17) + ( K + H ) O p (cid:16) m − + (cid:15) (cid:17)(cid:105) ( K + H ) O p (cid:34)(cid:18) min k =1 ,...,K + H |I k | (cid:19) − (cid:35) = λ − K + H ) O p ( m (cid:15) ) − K + H ) O p (cid:16) m + (cid:15) (cid:17) O p (cid:34)(cid:18) min k =1 ,...,K + H |I k | (cid:19) − (cid:35) = λ − K + H ) O p ( m (cid:15) ) (cid:40) O p (1) + O p (cid:34) m (cid:18) min k =1 ,...,K + H |I k | (cid:19) − (cid:35)(cid:41) = λ − K + H ) O p ( m (cid:15) ) O p (cid:34) m (cid:18) min k =1 ,...,K + H |I k | (cid:19) − (cid:35) (cid:18) ∵ min k =1 ,...,K + H |I k | ≤ np (cid:19) = λ − K + H ) O p (cid:34) m + (cid:15) (cid:18) min k =1 ,...,K + H |I k | (cid:19) − (cid:35) (44)From the assumption that ( K + H ) (min k =1 ,...,K + H |I k | ) − = O (cid:16) m − − (cid:15) (cid:17) for some (cid:15) > (cid:15) < (cid:15) , we have λ ≤ ˜ λ + O p (cid:16) m − ( (cid:15) − (cid:15) ) (cid:17) , (45)which concludes the proof. Lemma 4.2.

Under the assumptions in Section 2, if K = K , ˜ λ ≤ λ + O p (cid:16) m − (cid:15) (cid:17) , for some (cid:15) > . (46) Proof.

Let ˜ v ( k )1 ∈ R | J k | be a subvector of ˜ v corresponding to the columns of the k th submatrix inobserved matrix A , and let τ k ≡ s k ˜ s k . In (42), we have already shown that | − τ k | = O p (cid:18) √ |I k | (cid:19) . The17aximum eigenvalue ˜ λ of matrix ˜ Z (cid:62) ˜ Z can be upper bounded as follows:˜ λ = (cid:107) ˜ Z ˜ v (cid:107) = (cid:107) K + H (cid:88) k =1 τ k (cid:16) Z ( k ) − Q ( k ) (cid:17) ˜ v (cid:107) ( ∵ (32))= (cid:107) K + H (cid:88) k =1 (cid:104) Z ( k ) + ( τ k − Z ( k ) − τ k Q ( k ) (cid:105) ˜ v (cid:107) = (cid:107) (cid:40) Z + K + H (cid:88) k =1 (cid:104) ( τ k − Z ( k ) − τ k Q ( k ) (cid:105)(cid:41) ˜ v (cid:107) = (cid:107) Z ˜ v (cid:107) + 2˜ v (cid:62) Z (cid:62) K + H (cid:88) k =1 (cid:104) ( τ k − Z ( k ) − τ k Q ( k ) (cid:105) ˜ v + (cid:107) K + H (cid:88) k =1 (cid:104) ( τ k − Z ( k ) − τ k Q ( k ) (cid:105) ˜ v (cid:107) ≤ (cid:107) Z ˜ v (cid:107) + 2 (cid:112) λ K + H (cid:88) k =1 | τ k − |(cid:107) Z ( k ) ˜ v (cid:107) − v (cid:62) Z (cid:62) K + H (cid:88) k =1 τ k Q ( k ) ˜ v + (cid:107) K + H (cid:88) k =1 (cid:104) ( τ k − Z ( k ) − τ k Q ( k ) (cid:105) ˜ v (cid:107) = (cid:107) Z ˜ v (cid:107) + 2 (cid:112) λ K + H (cid:88) k =1 | τ k − |(cid:107) Z ( k ) ˜ v ( k )1 (cid:107) − v (cid:62) Z (cid:62) K + H (cid:88) k =1 τ k Q ( k ) ˜ v + (cid:107) K + H (cid:88) k =1 (cid:104) ( τ k − Z ( k ) − τ k Q ( k ) (cid:105) ˜ v (cid:107) ≤ (cid:107) Z ˜ v (cid:107) + 2 (cid:112) λ K + H (cid:88) k =1 | τ k − | (cid:113) λ ( k )1 − v (cid:62) Z (cid:62) K + H (cid:88) k =1 τ k Q ( k ) ˜ v + (cid:107) K + H (cid:88) k =1 (cid:104) ( τ k − Z ( k ) − τ k Q ( k ) (cid:105) ˜ v (cid:107) = (cid:107) Z ˜ v (cid:107) + 2 O p (cid:16) m (cid:17) K + H (cid:88) k =1 O p (cid:16) |I k | − (cid:17) − v (cid:62) Z (cid:62) K + H (cid:88) k =1 τ k Q ( k ) ˜ v + (cid:107) K + H (cid:88) k =1 (cid:104) ( τ k − Z ( k ) − τ k Q ( k ) (cid:105) ˜ v (cid:107) = (cid:107) Z ˜ v (cid:107) + 2( K + H ) O p (cid:34) m (cid:18) min k =1 ,...,K + H |I k | (cid:19) − (cid:35) − v (cid:62) Z (cid:62) K + H (cid:88) k =1 τ k Q ( k ) ˜ v + (cid:107) K + H (cid:88) k =1 (cid:104) ( τ k − Z ( k ) − τ k Q ( k ) (cid:105) ˜ v (cid:107) ≤ (cid:107) Z ˜ v (cid:107) + 2( K + H ) O p (cid:34) m (cid:18) min k =1 ,...,K + H |I k | (cid:19) − (cid:35) − v (cid:62) Z (cid:62) K + H (cid:88) k =1 τ k Q ( k ) ˜ v + (cid:34) (cid:107) K + H (cid:88) k =1 ( τ k − Z ( k ) ˜ v (cid:107) + (cid:107) K + H (cid:88) k =1 τ k Q ( k ) ˜ v (cid:107) (cid:35) ≤ (cid:107) Z ˜ v (cid:107) + 2( K + H ) O p (cid:34) m (cid:18) min k =1 ,...,K + H |I k | (cid:19) − (cid:35) − v (cid:62) Z (cid:62) K + H (cid:88) k =1 τ k Q ( k ) ˜ v + (cid:34) K + H (cid:88) k =1 | τ k − |(cid:107) Z ( k ) ˜ v (cid:107) + K + H (cid:88) k =1 τ k (cid:107) Q ( k ) ˜ v (cid:107) (cid:35) ≤ (cid:107) Z ˜ v (cid:107) + 2( K + H ) O p (cid:34) m (cid:18) min k =1 ,...,K + H |I k | (cid:19) − (cid:35) − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v + (cid:34) K + H (cid:88) k =1 O p (cid:16) |I k | − (cid:17) + K + H (cid:88) k =1 τ k (cid:107) Q ( k ) ˜ v (cid:107) (cid:35) (cid:18) ∵ (cid:107) Z ( k ) ˜ v (cid:107) ≤ (cid:113) λ ( k )1 = O p (cid:16) |I k | (cid:17)(cid:19) = (cid:107) Z ˜ v (cid:107) + O p (cid:104) m C ( K,H ) (cid:105) − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v + (cid:34) O p (cid:16) C ( K,H ) (cid:17) + K + H (cid:88) k =1 τ k (cid:107) Q ( k ) ˜ v (cid:107) (cid:35) , (47)where we denote C ( K,H ) ≡ ( K + H ) (min k =1 ,...,K + H |I k | ) − .Since Z (cid:62) Z is a symmetric matrix, its eigenvectors { v j } form an orthonormal system, and thus18here exists a unique set of coeﬃcients { c j } such that˜ v = p (cid:88) j =1 c j v j = ˜ v L + ˜ v S , (48)where ˜ v L ≡ t (cid:88) j =1 c j v j , ˜ v S ≡ p (cid:88) j = t +1 c j v j ,λ t ≥ λ − n d , λ t +1 < λ − n d , d = 57 . (49)By substituting (48) into the last term in (47) and from the similar discussion as in (36), (cid:107) Q ( k ) ˜ v (cid:107) = ˜ v (cid:62) ( Q ( k ) ) (cid:62) Q ( k ) ˜ v = p (cid:88) j =1 p (cid:88) j (cid:48) =1 c j c j (cid:48) v (cid:62) j ( Q ( k ) ) (cid:62) Q ( k ) v j (cid:48) = p (cid:88) j =1 p (cid:88) j (cid:48) =1 c j c j (cid:48) | I k || J k | η k ( v (cid:62) j u ( k ) )( v (cid:62) j (cid:48) u ( k ) ) = | I k || J k | η k  p (cid:88) j =1 c j ( v (cid:62) j u ( k ) )  ≤ | I k || J k | η k (cid:118)(cid:117)(cid:117)(cid:116) p (cid:88) j =1 c j (cid:118)(cid:117)(cid:117)(cid:116) p (cid:88) j =1 ( v (cid:62) j u ( k ) )  = | I k || J k | η k (cid:107) ˜ v (cid:107)  p (cid:88) j =1 ( v (cid:62) j u ( k ) )  = | I k || J k | η k  p (cid:88) j =1 ( v (cid:62) j u ( k ) )  ≤ | I k || J k | η k p max j =1 ,...,p ( v (cid:62) j u ( k ) ) = |I k | O p (cid:0) |I k | − (cid:1) p O p (cid:0) m − (cid:15) (cid:1) ( ∵ (39) holds uniformly in j and w from [4])= O p (cid:0) m (cid:15) (cid:1) . (50)By combining (47) and (50),˜ λ ≤ (cid:107) Z ˜ v (cid:107) − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v + O p (cid:104) m C ( K,H ) (cid:105) + (cid:34) O p (cid:16) C ( K,H ) (cid:17) + K + H (cid:88) k =1 τ k O p ( m (cid:15) ) (cid:35) = (cid:107) Z ˜ v (cid:107) − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v + O p (cid:104) m C ( K,H ) (cid:105) + (cid:34) O p (cid:16) C ( K,H ) (cid:17) + K + H (cid:88) k =1 (cid:16) O p (cid:16) |I k | − (cid:17)(cid:17) O p ( m (cid:15) ) (cid:35) = (cid:107) Z ˜ v (cid:107) − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v + O p (cid:104) m C ( K,H ) (cid:105) + (cid:104) O p (cid:16) C ( K,H ) (cid:17) + ( K + H ) O p ( m (cid:15) ) (cid:105) = (cid:107) Z ˜ v (cid:107) − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v + O p (cid:34) ( K + H ) m (cid:18) min k =1 ,...,K + H |I k | (cid:19) − (cid:35) + O p (cid:2) ( K + H ) m (cid:15) (cid:3) . (51)As for the third term in (51), from the assumption that ( K + H ) (min k =1 ,...,K + H |I k | ) − = O (cid:16) m − − (cid:15) (cid:17) for some (cid:15) > O p (cid:34) ( K + H ) m (cid:18) min k =1 ,...,K + H |I k | (cid:19) − (cid:35) = O p (cid:16) m − (cid:15) (cid:17) . (52)19ith regard to the fourth term in (51), from the assumption that K + H = O (cid:16) m − (cid:15) (cid:17) for some (cid:15) > (cid:15) < (cid:15) , O p (cid:2) ( K + H ) m (cid:15) (cid:3) = O p (cid:16) m − (cid:15) − (cid:15) ) (cid:17) . (53)An upper bound of the ﬁrst and second terms in (51) is given by (cid:107) Z ˜ v (cid:107) − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v = (˜ v L + ˜ v S ) (cid:62) Z (cid:62) Z (˜ v L + ˜ v S ) − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v = (cid:0) ˜ v L (cid:1) (cid:62) Z (cid:62) Z ˜ v L + (cid:0) ˜ v S (cid:1) (cid:62) Z (cid:62) Z ˜ v S − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v =  t (cid:88) j =1 c j v j  (cid:62)  t (cid:88) j =1 c j Z (cid:62) Z v j  +  p (cid:88) j = t +1 c j v j  (cid:62)  p (cid:88) j = t +1 c j Z (cid:62) Z v j  − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v =  t (cid:88) j =1 c j v j  (cid:62)  t (cid:88) j =1 c j λ j v j  +  p (cid:88) j = t +1 c j v j  (cid:62)  p (cid:88) j = t +1 c j λ j v j  − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v = t (cid:88) j =1 c j λ j (cid:107) v j (cid:107) + p (cid:88) j = t +1 c j λ j (cid:107) v j (cid:107) − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v = t (cid:88) j =1 c j λ j + p (cid:88) j = t +1 c j λ j − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v ≤ λ t (cid:88) j =1 c j + λ t +1 p (cid:88) j = t +1 c j − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v ≤ λ t (cid:88) j =1 c j + ( λ − n d ) p (cid:88) j = t +1 c j − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v ( ∵ (49))= λ p (cid:88) j =1 c j − n d p (cid:88) j = t +1 c j − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v = λ (cid:107) ˜ v (cid:107) − n d (cid:107) ˜ v S (cid:107) − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v = λ − n d (cid:107) ˜ v S (cid:107) − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v L − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v S . (54)Let u ( k ) ∈ R p be a vector whose entries are deﬁned by u ( k ) j = √ | J k | if j ∈ J k and u ( k ) j = 0otherwise. As for the third term in (54), using the fact that Q ( k ) v j = η k | J k | ( v (cid:62) j u ( k ) ) u ( k ) , for all (cid:15) > − ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v L ≤ | ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v L | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t (cid:88) j =1 c j ˜ v (cid:62) Z (cid:62) Q ( k ) v j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) η k | J k | ˜ v (cid:62) Z (cid:62) u ( k ) t (cid:88) j =1 c j ( v (cid:62) j u ( k ) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:16) |I k | − (cid:17) | J k | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t (cid:88) j =1 c j ( v (cid:62) j u ( k ) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) ˜ v (cid:62) Z (cid:62) u ( k ) (cid:12)(cid:12)(cid:12) = O p (1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t (cid:88) j =1 c j ( v (cid:62) j u ( k ) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) ˜ v (cid:62) Z (cid:62) u ( k ) (cid:12)(cid:12)(cid:12) ≤ O p (1) (cid:118)(cid:117)(cid:117)(cid:116) t (cid:88) j =1 c j (cid:118)(cid:117)(cid:117)(cid:116) t (cid:88) j =1 | v (cid:62) j u ( k ) | (cid:12)(cid:12)(cid:12) ˜ v (cid:62) Z (cid:62) u ( k ) (cid:12)(cid:12)(cid:12) ≤ O p (1) (cid:107) ˜ v (cid:107)√ t O p (cid:16) m − + (cid:15) (cid:17) (cid:12)(cid:12)(cid:12) ˜ v (cid:62) Z (cid:62) u ( k ) (cid:12)(cid:12)(cid:12) ( ∵ (39) holds uniformly in j and w from [4])= √ t O p (cid:16) m − + (cid:15) (cid:17) (cid:12)(cid:12)(cid:12) ˜ v (cid:62) Z (cid:62) u ( k ) (cid:12)(cid:12)(cid:12) ≤ √ t O p (cid:16) m − + (cid:15) (cid:17) (cid:107) ˜ v (cid:62) Z (cid:62) (cid:107)(cid:107) u ( k ) (cid:107) = √ t O p (cid:16) m − + (cid:15) (cid:17) (cid:107) ˜ v (cid:62) Z (cid:62) (cid:107) √ t O p (cid:16) m − + (cid:15) (cid:17) (cid:112) λ = √ t O p (cid:16) m − + (cid:15) (cid:17) O p (cid:16) m (cid:17) = √ t O p ( m (cid:15) ) . (55)With regard to the fourth term in (54), we have − ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v S ≤ | ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v S | ≤ (cid:107) ˜ v (cid:107)(cid:107) Z (cid:62) Q ( k ) ˜ v S (cid:107) = (cid:107) Z (cid:62) Q ( k ) ˜ v S (cid:107)≤ (cid:107) Z (cid:62) Q ( k ) (cid:107) op (cid:107) ˜ v S (cid:107) ≤ (cid:107) Z (cid:107) op (cid:107) Q ( k ) (cid:107) F (cid:107) ˜ v S (cid:107) = (cid:112) λ |I k | | η k | (cid:107) ˜ v S (cid:107) . (56)By substituting (55) and (56) into (54), (cid:107) Z ˜ v (cid:107) − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v ≤ λ − n d (cid:107) ˜ v S (cid:107) + 2 K + H (cid:88) k =1 τ k √ t O p ( m (cid:15) ) + 2 K + H (cid:88) k =1 τ k (cid:112) λ |I k | | η k | (cid:107) ˜ v S (cid:107) = λ − n d (cid:107) ˜ v S (cid:107) + 2 K + H (cid:88) k =1 (cid:104) O p (cid:16) |I k | − (cid:17)(cid:105) √ t O p ( m (cid:15) ) + 2 K + H (cid:88) k =1 τ k (cid:112) λ |I k | | η k | (cid:107) ˜ v S (cid:107) = λ − n d (cid:107) ˜ v S (cid:107) + 2 K + H (cid:88) k =1 √ t O p ( m (cid:15) ) + 2 K + H (cid:88) k =1 τ k (cid:112) λ |I k | | η k | (cid:107) ˜ v S (cid:107) = λ − n d (cid:107) ˜ v S (cid:107) + √ t O p [( K + H ) m (cid:15) ] + 2 K + H (cid:88) k =1 τ k (cid:112) λ |I k | | η k | (cid:107) ˜ v S (cid:107) . (57)We denote the j th normalized eigenvalue of matrix Z (cid:62) Z as ν j ≡ n λ j , and deﬁne the followingvariables: ν + ≡ (cid:18) (cid:114) pn (cid:19) , ν − ≡ (cid:18) − (cid:114) pn (cid:19) , (cid:15) ≡ ν + − ν . (58)Note that | (cid:15) | = O p (cid:16) φ C m − (cid:17) holds for some constant C > φ ≡ (log p ) log log p from (4.1) of [40].Since φ = o ( m (cid:15) ) holds for any (cid:15) >

0, by taking (cid:15) ≡ C(cid:15) , we have | (cid:15) | = O p (cid:16) m − + (cid:15) (cid:17) for any (cid:15) > . (59)Each j th normalized eigenvalue ν j follows the Marcenko–Pastur distribution, whose probabilitydensity function is given by q ( x ) = 12 π np (cid:112) max { ( ν + − x )( x − ν − ) , } x . (60)From (60), by taking (cid:15) < d − = , we have q ( ν − n d − ) = q ( ν + − n d − − (cid:15) )= √ ν + − ν − ν + (cid:104) n d − + O p (cid:16) m − + (cid:15) (cid:17)(cid:105) (cid:104) O (cid:16) m d − (cid:17) + O p (cid:16) m − + (cid:15) (cid:17)(cid:105) = √ ν + − ν − ν + n d − + O p (cid:16) m d − (cid:17) . (61)From (59) and (61), by setting (cid:15) < d − ,¯ n ≡ (cid:90) ∞ ν − n d − q ( x )d x ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) ν + ν − n d − q ( x )d x (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) ∞ ν + q ( x )d x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) ν + ν − n d − q ( x )d x (cid:12)(cid:12)(cid:12)(cid:12) ≤ | (cid:15) + n d − | q ( ν − n d − )= O p (cid:0) m d − (cid:1) O p (cid:16) m d − (cid:17) = O p (cid:16) m d − (cid:17) . (62)21rom (3.7) of [40], we have (cid:12)(cid:12)(cid:12)(cid:12) ¯ n − tp (cid:12)(cid:12)(cid:12)(cid:12) = O p (cid:0) m − (cid:15) (cid:1) , for all (cid:15) > . (63)From (62) and (63), by setting (cid:15) < d − , t = O p (cid:16) m d − (cid:17) . (64)By substituting (64) into (57) and from the assumption in (49) that d = , for all (cid:15) > (cid:107) Z ˜ v (cid:107) − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v ≤ λ + O p (cid:104) ( K + H ) m + (cid:15) (cid:105) + (cid:107) ˜ v S (cid:107) (cid:32) (cid:112) λ K + H (cid:88) k =1 τ k (cid:112) |I k | | η k | − n d (cid:107) ˜ v S (cid:107) (cid:33) = λ + O p (cid:104) ( K + H ) m + (cid:15) (cid:105) + (cid:107) ˜ v S (cid:107) (cid:16) (cid:112) λ (cid:36) − n d (cid:107) ˜ v S (cid:107) (cid:17) , (65)where (cid:36) ≡ K + H (cid:88) k =1 τ k (cid:112) |I k | | η k | = K + H (cid:88) k =1 (cid:34) O p (cid:32) (cid:112) |I k | (cid:33)(cid:35) (cid:112) |I k | O p (cid:32) (cid:112) |I k | (cid:33) = O p ( K + H ) . (66)If 2 √ λ (cid:36) − n d (cid:107) ˜ v S (cid:107) ≤

0, from (65), (cid:107) Z ˜ v (cid:107) − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v ≤ λ + O p (cid:104) ( K + H ) m + (cid:15) (cid:105) . (67)If 2 √ λ (cid:36) − n d (cid:107) ˜ v S (cid:107) >

0, we have (cid:107) ˜ v S (cid:107) < n − d √ λ (cid:36) . Therefore, the following inequality holds: (cid:107) ˜ v S (cid:107) (cid:16) (cid:112) λ (cid:36) − n d (cid:107) ˜ v S (cid:107) (cid:17) ≤ n − d λ (cid:36) = O p (cid:104) ( K + H ) m (cid:105) (cid:18) ∵ d = 57 and (66) (cid:19) , (68)which results in that (cid:107) Z ˜ v (cid:107) − K + H (cid:88) k =1 τ k ˜ v (cid:62) Z (cid:62) Q ( k ) ˜ v ≤ λ + O p (cid:104) ( K + H ) m + (cid:15) (cid:105) + O p (cid:104) ( K + H ) m (cid:105) ≤ λ + O p (cid:104) ( K + H ) m + (cid:15) (cid:105) , for all (cid:15) > . (69)Based on the above discussion, (69) always holds. Therefore, from (51), (52), and (53), for all (cid:15) > λ ≤ λ + O p (cid:104) ( K + H ) m + (cid:15) (cid:105) + O p (cid:16) m − (cid:15) (cid:17) + O p (cid:16) m − (cid:15) − (cid:15) ) (cid:17) . (70)From the assumption that K + H = O (cid:16) m − (cid:15) (cid:17) for some (cid:15) > (cid:15) < (cid:15) , we ﬁnallyobtain ˜ λ ≤ λ + O p (cid:16) m − ˜ (cid:15) (cid:17) , for some ˜ (cid:15) > , (71)which concludes the proof. 22 emma 4.3. Under the assumptions in Section 2, if K = K , | λ − ˆ λ | b TW = O p (cid:0) m − (cid:15) (cid:1) , for some (cid:15) > . (72) Proof.

By combining Lemmas 4.1, 4.2, and the deﬁnition of b in (26), we have | λ − ˜ λ | b TW = O p (cid:0) m − (cid:15) (cid:1) , for some (cid:15) > . (73)We consider the following two events: • E (1) m represents the event that ˜ Z = ˆ Z holds. • E (2) m represents the event that the solution given by the submatrix localization algorithm is correct(i.e., ˆ g = g ). • E (3) m,C represents the event that | λ − ˜ λ | b TW ≤ Cm − (cid:15) holds.The joint probability of events E (1) m and E (3) m,C can be lower bounded byPr (cid:16) E (1) m ∩ E (3) m,C (cid:17) ≥ Pr (cid:16) E (2) m ∩ E (3) m,C (cid:17) ≥ − Pr (cid:20)(cid:16) E (2) m (cid:17) C (cid:21) − Pr (cid:20)(cid:16) E (3) m,C (cid:17) C (cid:21) , (74)where E C is the complement of event E . From the consistency assumption (v) in Section 2, if K = K ,the second term in the right side of (74) satisﬁes that Pr (cid:20)(cid:16) E (2) m (cid:17) C (cid:21) → m → ∞ . Asfor the third term Pr (cid:20)(cid:16) E (3) m,C (cid:17) C (cid:21) , we already have (73). By combining these facts, we have ∀ (cid:15) > , ∃ C > , M > , ∀ m ≥ M, Pr (cid:16) E (1) m ∩ E (3) m,C (cid:17) ≥ − (cid:15), (75)which results in (72).From Lemma 4.3, we ﬁnally obtain | T − T ∗ | = | ˆ λ − λ | b TW = O p (cid:0) m − (cid:15) (cid:1) , for some (cid:15) > . (76)By combining this fact with Slutsky’s theorem, the convergence of the test statistic T in law to T W distribution in (29) holds. Theorem 4.3 (Unrealizable case) . Under the assumptions in Section 2, if

K > K , T = O p (cid:16) m (cid:17) , (77) and T = Ω p (cid:16) m (cid:17) , (78) where T is deﬁned as in (27). ത𝑃 ෠ 𝑃 ◆ Null bicluster structure ◆ Population mean entries ◆ Estimated bicluster structure ◆ Population mean entries ◆ Estimated bicluster structure ◆ Estimated mean entries

Estimated assignments of rows in the 𝑘 th null group Null group 𝑘 Estimated group ℎ 𝑘 ≥ 𝑛 min ≥ 𝑛 min /(𝐾 + 1) Estimated group with at least 𝑛 min /(𝐾 + 1) rows in each null group Null group

Estimated group Background

Bicluster

Figure 4: Deﬁnition of matrices P , ¯ P , and ˆ P in an unrealizable case. Proof.

We ﬁrst prove the upper bound in (77). Let X E( k ) be an n × p matrix whose entries in the k th estimated bicluster (including background) are the same as matrix X and all the other entries arezero. Since the Frobenius norm upper bounds the operator norm, (cid:107) ˆ Z (cid:107) op ≤ (cid:107) ˆ Z (cid:107) F = (cid:118)(cid:117)(cid:117)(cid:116) K (cid:88) k =0 (cid:107) ˆ Z E( k ) (cid:107) = (cid:118)(cid:117)(cid:117)(cid:116) K (cid:88) k =0 s k (cid:107) A E( k ) − ˆ P E( k ) (cid:107) = (cid:118)(cid:117)(cid:117)(cid:116) K (cid:88) k =0 | ˆ I k |(cid:107) A E( k ) − ˆ P E( k ) (cid:107) (cid:107) A E( k ) − ˆ P E( k ) (cid:107) = (cid:118)(cid:117)(cid:117)(cid:116) K (cid:88) k =0 | ˆ I k | = √ np. (79)From assumption (iii), let n = C n m and p = C p m , where C n and C p are positive constants. Bydeﬁnition in (26), we have a TW = ( (cid:112) C n + (cid:112) C p ) m, b TW = ( (cid:112) C n + (cid:112) C p ) (cid:32) √ C n + 1 (cid:112) C p (cid:33) m . (80)By substituting (79) and (80) into (27), we have T = (cid:107) ˆ Z (cid:107) − a TW b TW ≤ np − a TW b TW = C n C p m − ( √ C n + (cid:112) C p ) m ( √ C n + (cid:112) C p ) (cid:18) √ C n + √ C p (cid:19) = O p (cid:16) m (cid:17) , (81)which concludes the proof of (77).We next show the lower bound in (78). As shown in Figure 4, we deﬁne that ¯ P is a matrix thatconsists of the estimated bicluster structure and entries of the population means. For instance, ifthe ( i, j )th entry of observed matrix A belongs to an estimated bicluster that consists of n (1) entriesof the k th null bicluster and n (2) entries of the k th null bicluster, its value in matrix ¯ P is given by¯ P ij = (cid:0) n (1) b k + n (2) b k (cid:1) / (cid:0) n (1) + n (2) (cid:1) .From assumption (iv), for all k ∈ { , . . . , K + H } , the k th null submatrix (i.e., bicluster orbackground submatrix) has a size of at least n min × p min . Therefore, for all null group index k ∈{ , . . . , K + H } , there exists at least one estimated group h k ∈ { , , . . . , K } that contains a submatrixof the k th null submatrix with the size of n min K +1 × p min K +1 or more (Figure 4). Since K < K , thereexists at least one set of null group indices ( k , k ), k , k ∈ { , . . . , K + H } that satisﬁes h k = h k , k (cid:54) = k , and k ∈ { , . . . , K } (i.e., a pair of mutually diﬀerent null groups that belongs to the samegroup in the estimated bicluster structure). In other words, there exists at least one estimated group ¯ k (= h k = h k ) that satisﬁes the following conditions:24 The ¯ k th estimated group contains two submatrices. We denote the sets of entries in these twosubmatrices as I (1) and I (2) . • The set of entries I (1) forms a submatrix of the k th null group ( k ∈ { , , . . . , K } ). The sizeof submatrix I (1) is at least n min K +1 × p min K +1 . • The set of entries I (2) forms a submatrix of the k th null group ( k ∈ { , , . . . , K } ). The sizeof submatrix I (2) is at least n min K +1 × p min K +1 . • The null groups of I (1) and I (2) are mutually diﬀerent (i.e., k (cid:54) = k ).The population means of submatrices I (1) and I (2) , respectively, are b k and b k . We assume b k > b k without loss of generality. Let ¯ b be the constant value of the submatrices I (1) and I (2) inmatrix ¯ P . Here, we consider the following two patterns: • If ¯ b ≥ ( b k + b k ) /

2, we have | ¯ b − b k | ≥ ( b k − b k ) / • If ¯ b < ( b k + b k ) /

2, we have | ¯ b − b k | > ( b k − b k ) / I that satisﬁes the following two conditions (notethat I = I (2) in the former case and I = I (1) in the latter case): • The size of submatrix I is at least n min K +1 × p min K +1 . • Let b I and ¯ b I , respectively, be the constant values of submatrix I in matrices P and ¯ P . Notethat we have ¯ b I = ¯ b ¯ k . From assumption (ii), the following inequality holds: | ¯ b I − b I | ≥ min k (cid:54) = k (cid:48) | b k − b k (cid:48) | / ≥ C b . (82)We denote the k th null bicluster (including background submatrices) in matrix X as X N( k ) ( k =1 , . . . , K + H ). The diﬀerence between ˆ b ¯ k and ¯ b ¯ k is given by | ˆ b ¯ k − ¯ b ¯ k | = 1 | ˆ I ¯ k | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ( i,j ) ∈ ˆ I ¯ k (cid:16) ˆ P ij − ¯ P ij (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 1 | ˆ I ¯ k | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ( i,j ) ∈ ˆ I ¯ k ( A ij − P ij ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max k =0 , ,...,K s k | ˆ I ¯ k | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ( i,j ) ∈ ˆ I ¯ k Z ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (83)Here, Z ij independently follows a distribution with zero mean and unit variance. From the sub-exponential condition (ii), for any n M ∈ N , we have E [ Z n M ij ] < ∞ . Let I CLT be a subset of entries in a n × p matrix. By deﬁning δ ≡ |I CLT |→∞ |I CLT | δ (cid:88) ( i,j ) ∈I CLT E (cid:2) | Z ij | δ (cid:3) ≤ lim |I CLT |→∞ C M(4) |I CLT ||I

CLT | = lim |I CLT |→∞ C M(4) |I CLT | = 0 . (84)Therefore, from the Lyapunov variant of the central limit theorem, √ |I CLT | (cid:80) ( i,j ) ∈I CLT Z ij (cid:32) N (0 , | ˆ I ¯ k | ≥ |I| ≥ n min p min / ( K + 1) (note that theright side monotonically increases with the matrix size m from assumption (iv)), we have | ˆ b ¯ k − ¯ b ¯ k | ≤ max k =0 , ,...,K s k (cid:113) | ˆ I ¯ k | O p (1) ≤ (cid:18) max k =0 , ,...,K s k (cid:19) K + 1 √ n min p min O p (1) ≤ O p (cid:18) K + H √ n min p min (cid:19) ( ∵ K + 1 < K + H and assumption (ii)) . (85)25rom (85), we have (cid:12)(cid:12)(cid:12) | b I − ¯ b I | − | b I − ˆ b ¯ k | (cid:12)(cid:12)(cid:12) ≤ | ˆ b ¯ k − ¯ b ¯ k | ≤ O p (cid:18) K + H √ n min p min (cid:19) . (86)By combining this result with (82), C b ≤ | b I − ˆ b ¯ k | + O p (cid:18) K + H √ n min p min (cid:19) . (87)Therefore, from (12) in assumption (iv), we have | b I − ˆ b ¯ k | = Ω p (1) . (88)Let X I be a submatrix of X with the set of entries I . Since the operator norm of a submatrix isnot larger than that of the original matrix, (cid:107) ˆ Z (cid:107) op ≥ (cid:107) ˆ Z I (cid:107) op = 1ˆ s ¯ k (cid:107) A I − ˆ P I (cid:107) op ≥ s ¯ k (cid:12)(cid:12)(cid:12) (cid:107) A I − P I (cid:107) op − (cid:107) P I − ˆ P I (cid:107) op (cid:12)(cid:12)(cid:12) . (89)Let ¯ k N be the null bicluster index (including background) of submatrix I . Note that we have b I = b ¯ k N . As for the ﬁrst term in (89), from assumption (ii), we have (cid:107) A I − P I (cid:107) op = s ¯ k N (cid:107) Z I (cid:107) op ≤ s ¯ k N (cid:107) Z (cid:107) op ≤ (cid:18) max k =0 , ,...,K s k (cid:19) O p ( √ m ) = O p ( √ m ) . (90)In regard to the second term in (89), since all the entries in matrix ( P I − ˆ P I ) is ( b I − ˆ b ¯ k ) and thusits rank is one, we have (cid:107) P I − ˆ P I (cid:107) op = (cid:107) P I − ˆ P I (cid:107) F = (cid:113) |I| ( b I − ˆ b ¯ k ) ≥ √ n min p min K + 1 | b I − ˆ b ¯ k |≥ √ n min p min K + H | b I − ˆ b ¯ k | = Ω p (cid:18) √ n min p min K + H (cid:19) . (91)To derive the last equation, we used assumption (iv) and (88).Finally, we can derive an upper bound of ˆ s ¯ k byˆ s ¯ k = 1 (cid:113) | ˆ I ¯ k | (cid:107) A E(¯ k ) − ˆ P E(¯ k ) (cid:107) F ≤ (cid:113) | ˆ I ¯ k | (cid:18) (cid:107) A E(¯ k ) − P E(¯ k ) (cid:107) F + (cid:107) P E(¯ k ) − ˆ P E(¯ k ) (cid:107) F (cid:19) ≤ (cid:113) | ˆ I ¯ k | (cid:18) (cid:107) A − P (cid:107) F + (cid:107) P E(¯ k ) − ˆ P E(¯ k ) (cid:107) F (cid:19) = 1 (cid:113) | ˆ I ¯ k | (cid:118)(cid:117)(cid:117)(cid:116) K + H (cid:88) k =1 s k (cid:107) Z N( k ) (cid:107) + (cid:107) P E(¯ k ) − ˆ P E(¯ k ) (cid:107) F  ≤ (cid:113) | ˆ I ¯ k | (cid:20)(cid:18) max k =0 , ,...,K s k (cid:19) (cid:107) Z (cid:107) F + (cid:107) P E(¯ k ) − ˆ P E(¯ k ) (cid:107) F (cid:21) = 1 (cid:113) | ˆ I ¯ k | (cid:18) max k =0 , ,...,K s k (cid:19) (cid:107) Z (cid:107) F + (cid:118)(cid:117)(cid:117)(cid:116) (cid:88) ( i,j ) ∈ ˆ I ¯ k (cid:16) P ij − ˆ b ¯ k (cid:17)  ≤ (cid:113) | ˆ I ¯ k | (cid:18) max k =0 , ,...,K s k (cid:19) (cid:107) Z (cid:107) F + max k =0 , ,...,K (cid:12)(cid:12)(cid:12) b k − ˆ b ¯ k (cid:12)(cid:12)(cid:12) ≤ K + 1 √ n min p min (cid:18) max k =0 , ,...,K s k (cid:19) (cid:107) Z (cid:107) F + max k =0 , ,...,K (cid:12)(cid:12)(cid:12) b k − ˆ b ¯ k (cid:12)(cid:12)(cid:12) ≤ K + H √ n min p min (cid:18) max k =0 , ,...,K s k (cid:19) (cid:107) Z (cid:107) F + max k =0 , ,...,K (cid:12)(cid:12)(cid:12) b k − ˆ b ¯ k (cid:12)(cid:12)(cid:12) . (92)26he second term in (92) can be upper bounded as follows:max k =0 , ,...,K (cid:12)(cid:12)(cid:12) b k − ˆ b ¯ k (cid:12)(cid:12)(cid:12) ≤ | ˆ b ¯ k | + max k =0 , ,...,K | b k | = 1 | ˆ I ¯ k | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ( i,j ) ∈ ˆ I ¯ k ( σ ij Z ij + P ij ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + max k =0 , ,...,K | b k |≤ | ˆ I ¯ k | (cid:88) ( i,j ) ∈ ˆ I ¯ k ( σ ij | Z ij | + | P ij | ) + max k =0 , ,...,K | b k | ≤ | ˆ I ¯ k | (cid:88) ( i,j ) ∈ ˆ I ¯ k σ ij | Z ij | + 2 max k =0 , ,...,K | b k |≤ | ˆ I ¯ k | (cid:18) max k =0 , ,...,K s k (cid:19) (cid:88) ( i,j ) ∈ ˆ I ¯ k | Z ij | + 2 max k =0 , ,...,K | b k |≤ | ˆ I ¯ k | (cid:18) max k =0 , ,...,K s k (cid:19) (cid:115) (cid:88) ( i,j ) ∈ ˆ I ¯ k | Z ij | (cid:113) | ˆ I ¯ k | + 2 max k =0 , ,...,K | b k |≤ (cid:113) | ˆ I ¯ k | (cid:18) max k =0 , ,...,K s k (cid:19) (cid:107) Z (cid:107) F + 2 max k =0 , ,...,K | b k |≤ K + 1 √ n min p min (cid:18) max k =0 , ,...,K s k (cid:19) (cid:107) Z (cid:107) F + 2 max k =0 , ,...,K | b k |≤ K + H √ n min p min (cid:18) max k =0 , ,...,K s k (cid:19) (cid:107) Z (cid:107) F + 2 max k =0 , ,...,K | b k | . (93)Since E [ Z ij ] = V [ Z ij ]+ E [ Z ij ] = 1 and V [ Z ij ] = E [ Z ij ] − E [ Z ij ] = E [ Z ij ] − < ∞ from assumption(ii), from the central limit theorem and Prokhorov’s theorem [53], we have1 √ np n (cid:88) i =1 p (cid:88) j =1 ( Z ij −

1) = O p (1) ⇐⇒ n (cid:88) i =1 p (cid:88) j =1 Z ij = (cid:107) Z (cid:107) = np + O p ( m ) , (94)which results in that (cid:107) Z (cid:107) F = O p ( m ) . (95)By substituting (93) and (95) into (92), and using assumption (ii),ˆ s ¯ k ≤ K + H √ n min p min O p ( m ) + O ( K ) = O p (cid:20) ( K + H ) m √ n min p min (cid:21) . (96)By substituting (90), (91), and (96) into (89), and using (12) in assumption (iv), we ﬁnally have (cid:107) ˆ Z (cid:107) op ≥ Ω p (cid:20) √ n min p min ( K + H ) m (cid:21) (cid:12)(cid:12)(cid:12)(cid:12) Ω p (cid:18) √ n min p min K + H (cid:19) − O p ( √ m ) (cid:12)(cid:12)(cid:12)(cid:12) = Ω p (cid:20) n min p min ( K + H ) m (cid:21) (cid:12)(cid:12)(cid:12)(cid:12) Ω p (1) − O p (cid:20) ( K + H ) √ m √ n min p min (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) = Ω p (cid:20) n min p min ( K + H ) m (cid:21) = Ω p (cid:16) m +2 (cid:15) (cid:17) ⇐⇒ (cid:107) ˆ Z (cid:107) = Ω p (cid:0) m (cid:15) (cid:1) , for some (cid:15) > . (97)By substituting (97) and (80) into (27), we have T = (cid:107) ˆ Z (cid:107) − a TW b TW = Ω p (cid:16) m +4 (cid:15) (cid:17) ≥ Ω p (cid:16) m (cid:17) , (98)which concludes the proof. 27 Experiments T in law to T W distribution in realizablecase We ﬁrst checked the asymptotic behavior of the proposed test statistic T in null case (i.e., K = K )by using synthetic data matrices, which were generated from the Gaussian, Bernoulli, and Poissondistributions. In this case, from Theorem 4.1, T converges in law to the T W distribution in the limitof m → ∞ .We set the null number of biclusters at K = 3 in all the distributional settings, and tried 10 sets ofmatrix sizes: ( n, p ) = (500 × i, × i ), i = 1 , . . . ,

10. For each distribution, we deﬁned the null set ofparameters and the relative entropy function f of the generalized proﬁle likelihood in (15) as follows. • Gaussian case : Each entry in the k th group ( k = 0 , , . . . , K ) of observed matrix A wasgenerated independently from the Gaussian distribution N ( b k , s k ), where b = (cid:0) . . . . (cid:1) (cid:62) , s = (cid:0) .

03 0 .

04 0 .

06 0 . (cid:1) (cid:62) . (99) f ( x ) ≡ x / . (100) • Bernoulli case : Each entry in the k th group of observed matrix A was generated independentlyfrom the Bernoulli distribution Bernoulli( b k ), where b = (cid:0) . . . . (cid:1) (cid:62) . (101) f ( x ) ≡ x log (cid:0) max { x, − } (cid:1) + (1 − x ) log (cid:0) max { − x, − } (cid:1) . (102) • Poisson case : Each entry in the k th group of observed matrix A was generated independentlyfrom the Poisson distribution Pois( b k ), where b = (cid:0) (cid:1) (cid:62) . (103) f ( x ) ≡ x log (cid:0) max { x, − } (cid:1) − x. (104)For each combination of the distributional and matrix size settings, we randomly generate 5 ,

000 datamatrices A based on the null (non-bi-disjoint) bicluster structure, which was deﬁned as follows. Let K ≡ (3 K + 4 + K mod 2) / K ≡ (3 K + 4 − K mod 2) / n ≡ (cid:98) n/K (cid:99) , and p ≡ (cid:98) p/K (cid:99) . For each k th bicluster ( k = 1 , . . . , K ), we also deﬁne k ≡ (3 k − − k mod 2) / k ≡ (3 k − k mod 2) / A that belong to the k th bicluster isgiven by I k = { k n + 1 , . . . , ( k + 2) n } and J k = { k p + 1 , . . . , ( k + 2) p } , respectively. Figure 5shows the examples of Gaussian, Bernoulli, and Poisson data matrices.After generating the observed data matrices, we estimated their bicluster structures by the proposedsubmatrix localization algorithm (i.e., Algorithm 2 in Section 3.2.2). To compress the original datamatrix A , we applied the Ward’s hierarchical clustering method [54] to the rows and columns ofmatrix A with the number of clusters L = min { K , n } and L = min { K , p } , respectively. Initialbicluster structures in Algorithm 2 were given as follows: for each k th bicluster, it contains a (uniformlyrandomly chosen) single entry A i k j k in A , where ( i k , j k ) (cid:54) = ( i k (cid:48) , j k (cid:48) ) for k (cid:54) = k (cid:48) . In Section 3.2.1, wedescribed the suﬃcient condition about the cooling schedule { T t } for the SA algorithm to convergein probability to the global optimal solution. However, such a setting requires too much iterations toconverge. In the experiments, we used the following cooling schedule instead: T t = 0 . t for all t ≥ (cid:15) SA = 10 − . Since this setting does not guaranteethe convergence in probability to the global optimal solution any more, we applied Algorithm 2 toeach observed matrix for ﬁve times and adopted the best solution that achieved the maximum proﬁle These experimental settings of the relative entropy function f follow those in [19]. aussian Bernoulli

Poisson

Figure 5: Examples of the observed data matrices. The left, center, and right ﬁgures show the Gaussian,Bernoulli, and Poisson cases, respectively.likelihood in the last step of the algorithm (this procedure was also performed in all the subsequentexperiments in Sections 5.2, 5.3, and 5.4). Based on the estimated bicluster structure, we ﬁnallyapplied the proposed statistical test by setting the hypothetical number of bicluster at K .Figures 6, 7, and 8, respectively, show the histograms of the proposed test statistic T with diﬀerentmatrix sizes under the Gaussian, Bernoulli, and Poisson distributional settings. Figure 9 shows theempirical tail probabilities of the proposed test statistic T (i.e., ratios of the trials where T ≥ t (0 . T ≥ t (0 . T ≥ t (0 . t ( α ) is the α upper quantile of the T W distribution) underthe three distributional settings. As in the previous study [55], we used the approximated values t (0 . ≈ . t (0 . ≈ . t (0 . ≈ . T to the T W distribution, we also applied the Kolmogorov-Smirnov (KS) test [14]to the test statistics T of the 5 ,

000 trials, and plotted the results in Figure 10. The test statistic ofthe KS test is D √ r , where D is the maximum absolute diﬀerence between the empirical distributionfunction of T and the cumulative distribution function of the T W distribution, and r is the numberof trials (i.e., 5 ,

000 in this case).From Figures 6, 7, 8, 9 and 10, we see that the proposed test statistic T converges in law to the T W distribution in each distributional setting. In the Bernoulli case, however, the convergence of T in lawto the T W distribution is slow, compared to the other two cases (i.e., Gaussian and Poisson). To checkthe cause of this, we also computed ˜ T with the null bicluster structure [i.e., ˜ T ≡ (˜ λ − a TW ) /b TW ]and plotted its empirical tail probabilities and the test statistic of the KS test in Figures 11 and 12,respectively. From these ﬁgures, we see that the convergence of ˜ T in law to the T W distribution is stillslow in the Bernoulli case. Therefore, the slow convergence of T would not be caused by low accuracyof the biclustering algorithm, but it would be a problem speciﬁc to a Bernoulli random matrix. T in unrealizable case Second, we consider unrealizable cases (i.e.,

K > K ). Speciﬁcally, under the assumptions thatthe total number of biclusters and background submatrices ( K + H ) is a ﬁxed constant that doesnot depend on the matrix size and that the minimum row and column sizes of these submatrices( n min and p min , respectively) satisfy n min = Ω p ( m ) and p min = Ω p ( m ), from (97) and (77), we have T = Θ p (cid:16) m (cid:17) .Based on the same procedure as in Section 5.1, we generated Gaussian, Bernoulli and Poissonrandom matrices with three biclusters (i.e., K = 3), estimated their bicluster structures, and computedthe test statistics. We used the same settings as in Section 5.1 for (1) the null parameters of threedistributions (99), (101), and (103), (2) the procedure to generate the observed matrices, and (3) theSA-based submatrix localization algorithm. In this experiment, we tried the following 10 sets of matrixsizes: ( n, p ) = (200 × i, × i ), i = 1 , . . . ,

10. For each combination of the distributional and matrix29 n=500, p=375

Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure −4 −2 0 2 4T0.00.20.4 n=1000, p=750

Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure −4 −2 0 2 4T0.00.20.4 n=1500, p=1125

Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure −4 −2 0 2 4T0.00.20.4 n=2000, p=1500

Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure−4 −2 0 2 4T0.00.20.4 n=2500, p=1875

Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure −4 −2 0 2 4T0.00.20.4 n=3000, p=2250

Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure −4 −2 0 2 4T0.00.20.4 n=3500, p=2625

Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure −4 −2 0 2 4T0.00.20.4 n=4000, p=3000

Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure−4 −2 0 2 4T0.00.20.4 n=4500, p=3375

Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure −4 −2 0 2 4T0.00.20.4 n=5000, p=3750

Tracy-Widom distribution of order 1Histogram of T of estimated cluster structure

Figure 6: Histogram of the proposed test statistic T , which was computed with estimated biclusterstructure ( Gaussian case ). The titles of the ﬁgures show the diﬀerent matrix sizes. −4 −2 0 2 4T0.00.20.4 n=500, p=375