Average-Case Lower Bounds for Learning Sparse Mixtures, Robust Estimation and Semirandom Adversaries
aa r X i v : . [ c s . CC ] M a y Average-Case Lower Bounds for Learning Sparse Mixtures,Robust Estimation and Semirandom Adversaries
Matthew Brennan ∗ Guy Bresler † May 20, 2020
Abstract
This paper develops several average-case reduction techniques to show new hardness resultsfor three central high-dimensional statistics problems, implying a statistical-computational gapinduced by robustness, a detection-recovery gap and a universality principle for these gaps. Amain feature of our approach is to map to these problems via a common intermediate problemthat we introduce, which we call Imbalanced Sparse Gaussian Mixtures. We assume the plantedclique conjecture for a version of the planted clique problem where the position of the plantedclique is mildly constrained, and from this obtain the following computational lower boundsthat are tight against efficient algorithms: • Robust Sparse Mean Estimation:
Estimating an unknown k -sparse mean of n samplesfrom a d -dimensional Gaussian is a gapless problem, with the truncated empirical meanachieving the optimal sample complexity of n = ˜Θ( k ). However, if an ǫ -fraction of thesesamples are corrupted, the best known polynomial time algorithms require n = ˜Ω( k ) sam-ples. We give a reduction showing that this sample complexity is necessary, providing thefirst average-case complexity evidence for a conjecture of [Li17, BDLS17]. Our reductionalso shows that this statistical-computational gap persists even for algorithms estimatingwithin a suboptimal ℓ rate of ˜ O ( √ ǫ ), rather than the minimax rate of O ( ǫ ). • Semirandom Community Recovery:
The problem of finding a k -subgraph with ele-vated edge density p within an Erd˝os-R´enyi graph on n vertices with edge density q = Ω( p )is a canonical problem believed to have different computational limits for detection andfor recovery. The conjectured detection threshold has been established through reductionsfrom planted clique, but the recovery threshold has remained elusive. We give a reductionshowing that the detection and recovery thresholds coincide when the graph is perturbedby a semirandom adversary , in the case where q is constant. This yields the first average-case evidence towards the recovery conjectures of [CX16, HWX15, BBH18, BBH19]. • Universality of k -to- k Gaps in Sparse Mixtures:
Extending the techniques forour other two reductions, we also show a universality principle for computational lowerbounds at n = ˜Ω( k ) samples for learning k -sparse mixtures, under mild conditions onthe likelihood ratios of the planted marginals. Our result extracts the underlying problemstructure leading to a gap, independently of distributional specifics, and is a first steptowards explaining the ubiquity of k -to- k gaps in high-dimensional statistics.Our reductions produce structured and Gaussianized versions of an input graph problem, andthen rotate these high-dimensional Gaussians by matrices carefully constructed from hyper-planes in F tr . For our universality result, we introduce a new method to perform an algorithmicchange of measure tailored to sparse mixtures. We also provide evidence that the mild promisein our variant of planted clique does not change the complexity of the problem. ∗ Massachusetts Institute of Technology. Department of EECS. Email: [email protected] . † Massachusetts Institute of Technology. Department of EECS. Email: [email protected] . ontents n = ˜Θ( k ) Computational Barrier in Sparse Mixtures . . . . . . . 467.3 The Universality Class UC( n, k, d ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 k -Partite Planted Clique Conjecture 52 A Deferred Proofs 63 Introduction
A primary aim of the field of mathematical statistics is to determine how much data is needed forvarious estimation tasks, and to analyze the performance of practical algorithms. For a century, thefocus has been on information-theoretic limits. However, the study of high-dimensional structuredestimation problems over the last two decades has revealed that the much more relevant quantity– the amount of data needed by computationally efficient algorithms – may be significantly higherthan what is achievable without computational constraints.Because data in real-world problems is not adversarially generated, the mathematical analysisof estimation problems typically assumes a probabilistic model on the data. In computer science,combinatorial problems with random inputs have been studied since the 1970’s [Kar77, Kuˇc77]. Inthe 1980’s, Levin’s theory of average-case complexity [Lev86] crystallized the notion of an average-case reduction and obtained completeness results. Average-case hardness reductions are notoriouslydelicate, requiring that a distribution over instances in a conjecturally hard problem be mappedprecisely to the target distribution, making gadget-based reductions from worst-case complexity in-effective. For this reason, much of the recent work showing hardness for statistical problems showshardness for restricted models of computation (or equivalently, classes of algorithms), such as sta-tistical query algorithms [FGR +
13, FPV15, DKS17, DKS19], sum of squares [BHK +
16, HKP + +
17, Lin92], message-passing [ZK16,LKZ15, LDBB +
16, KMRT +
07, RTSZ19], and others. Another line of work attempts to understandcomputational limitations via properties of the energy landscape of solutions to estimation problems[ACO08, GZ17, BMMN17, BGJ18, RBBC19, CGP +
19, GZ19].This paper develops several average-case reduction techniques to show new hardness resultsfor three central high-dimensional statistics problems. We assume the planted clique conjecturefor a version of the planted clique problem where the position of the planted clique is mildlyconstrained. As discussed below, planted clique is known to be hard for all well-studied restrictedmodels of computation, and we confirm in Section 8 for statistical query algorithms and low-degree polynomial tests that this hardness remains unchanged for our modification. Aside fromthe advantage of being future-proof against new classes of algorithms, showing that a problem ofinterest is hard by reducing from planted clique effectively subsumes hardness in the restrictedmodels and thus gives much stronger evidence for hardness.
We show within the framework of average-case complexity that robustness, in the form of modelmisspecification or constrained adversaries, introduces statistical-computational gaps in several nat-ural problems. In doing so, we develop a number of new techniques for average-case reductionsand arrive at a universality class of problems with statistical-computational gaps.
Robust problemformulations are somewhere between worst-case and average-case, and indeed are strictly harderto solve than the corresponding purely average-case formulations, which is captured in our reduc-tions. The first problem we consider is robust sparse mean estimation, the problem of sparse meanestimation in Huber’s contamination model [Hub92, Hub65]. Lower bounds for robust sparse meanestimation have been shown for statistical query algorithms [DKS17], but to the best of our knowl-edge our reductions are the first average-case hardness results that capture hardness induced byrobustness . The second problem, semirandom community recovery, is the planted dense subgraphproblem under a semirandom (monotone) adversary [FK01, FK00]. It is surprising that reductionsfrom planted clique can capture the shift in hardness threshold due to semirandom adversaries,since the semirandom version of planted clique has the same threshold as the original [FK00].3oth of these robust problems are mapped to via an intermediate problem that we introduce, Im-balanced Sparse Gaussian Mixtures (ISGM). We also develop a new technique to reduce from ISGMto a universality class of sparse mixtures, which shows that the k to k statistical-computationalgaps in many statistical estimation problems are not coincidental and in particular do not dependon specifics of the distributions involved. The fact that all of our reductions go through ISGMunifies the techniques associated to each of the reductions and illustrates the utility of construct-ing reductions that use natural problems as intermediates, a perspective espoused in [BBH18].By constructing precise distributional maps preserving the canonical simple-versus-simple problemformulations, the maps can be composed, which greatly facilitates reducing to new problems.This work is closely related to a recent line of research showing lower bounds for average-caseproblems in computer science and statistics based on the planted clique ( pc ) conjecture. The pc conjecture asserts that there is no polynomial time algorithm to detect a clique of size k = o ( √ n )randomly planted in the Erd˝os-R´enyi graph G ( n, / pc conjecture has been used to showlower bounds for testing k -wise independence [AAK + pc up to k = o ( √ n ) under a mild promise on thevertex set of its hidden clique. Specifically, we assume that we are given a partition of its vertexset into k sets of equal size with the guarantee that the hidden clique has exactly one vertexin each set. This mild promise does not asymptotically affect any reasonable notion of entropyof the distribution over the hidden vertex set, scaling it by 1 − o (1). In Section 8, we provideevidence that this k -partite promise does not affect the complexity of planted clique, by showinghardness up to k = o ( √ n ) for low-degree polynomial tests and statistical query algorithms. In thecryptography literature, this form of promise is referred to as a small amount of information leakageabout the secret. The hardness of the Learning With Errors (LWE) problem has been shown to berobust to leakage [GKPV10], and it is left as an interesting open problem to show that a similarstatement holds true for pc . Our main results are that this k -partite pc conjecture ( k -pc ) impliesthe statistical-computational gaps outlined below. Robust Sparse Mean Estimation.
In sparse mean estimation, the observations X , X , . . . , X n are n independent samples from N ( µ, I d ) where µ is an unknown k -sparse vector in R d and thetask is to estimate µ closely in ℓ norm. This is a gapless problem, with the efficiently computabletruncated empirical mean achieving the optimal sample complexity of n = ˜Θ( k ). However, if an ǫ -fraction of these samples are adversarially corrupted, the best known polynomial time algorithmsrequire n = ˜Ω( k ) samples while n = ˜Ω( k ) remains the information-theoretically optimal sam-ple complexity. We give a reduction showing that robust sparse mean estimation is k -pc hardif n = ˜ o ( k ). This yields the first average-case evidence for this statistical-computational gap,which was originally conjectured in [Li17, BDLS17]. Our reduction also shows that this statistical-computational gap persists even for algorithms estimating within a suboptimal ℓ rate of ˜ O ( √ ǫ ),rather than the minimax rate of O ( ǫ ). The reductions showing these lower bounds can be found inSections 4 and 5. Semirandom Community Recovery.
In the planted dense subgraph model of single com-munity recovery, one observes a graph sampled from G ( n, q ) with a random subgraph on k ver-4ices replaced with a sample from G ( k, p ), where p > q are allowed to vary with n and sat-isfy that p = O ( q ). Detection and recovery of the hidden community in this model have beenstudied extensively [ACV14, BI13, VAC15, HWX15, CX16, HWX16, Mon15, CC18] and this modelhas emerged as a canonical example of a problem with a detection-recovery computational gap.While it is possible to efficiently detect the presence of a hidden subgraph of size k = ˜Ω( √ n ) if( p − q ) /q (1 − q ) = ˜Ω( n /k ), the best known polynomial time algorithms to recover the subgraphrequire a higher signal of ( p − q ) /q (1 − q ) = ˜Ω( n/k ).In each of [HWX15, BBH18] and [BBH19], it has been conjectured that the recovery problem ishard below this threshold of ˜Θ( n/k ). This pds Recovery Conjecture was even used in [BBH18] asa hardness assumption to show detection-recovery gaps in other problems including biased sparsePCA and Gaussian biclustering. A line of work has tightly established the conjectured detectionthreshold through reductions from the pc conjecture [HWX15, BBH18, BBH19], while the recoverythreshold has remained elusive. Planted clique maps naturally to the detection threshold in thismodel, so it seems unlikely that the pc conjecture could also yield lower bounds at the tighterrecovery threshold, given that recovery and detection are known to be equivalent for pc [AAK + k -pc conjecture implies the pds Recovery Conjecture for semirandom com-munity recovery in the regime where q = Θ(1). Specifically, we show that the computational barrierin the detection problem shifts to the recovery threshold when perturbed by a semirandom adver-sary. This yields the first average-case evidence towards the pds recovery conjecture as stated in[CX16, HWX15, BBH18, BBH19]. The reduction showing this lower bound is in Section 6. Universality of k -to- k Gaps in Sparse Mixtures.
Several sparse estimation problems exhibit k -to- k statistical-computational gaps, such as sparse PCA. By extending our reductions for theproblems above and introducing a new gadget performing an algorithmic change of measure, wealso show a universality principle for computational lower bounds at n = ˜Ω( k ) samples for learning k -sparse mixtures. This implies that all problems with the same structure exhibit the same gap,independent of the specifics of the distributions, and is a first step towards explaining the ubiquityof k -to- k gaps in high-dimensional statistics.The reduction gadget is a 3-input variant of the 2-input rejection kernel framework introducedin [BBH18] and extended in [BBH19]. Its guarantees are general and yield lower bounds that onlyrequire high probability bounds on the likelihood ratios of the planted marginals. The universalityclass of lower bounds we obtain includes the k -to- k gap in the spiked covariance model of sparsePCA, recovering lower bounds obtained in [BR13b, BR13a, WBS16, GMZ17, BBH18, BB19]. It alsoincludes the sparse Gaussian mixture models considered in [ASW13, VAC17, FLWY18] and showscomputational lower bounds for these problems that are tight against algorithmic achievability.As noted earlier, average-case reductions are notoriously brittle due to the difficulty in exactlymapping to natural target distributions. Universality principles from average-case complexity muststrongly overcome this obstacle by mapping precisely to every natural target distribution within anentire universality class. The rejection kernels we introduce provide a simple recipe for obtainingthese universality results from an initial reduction to a single well-chosen member of the universalityclass, in the context of learning sparse mixtures. Our reduction can be found in Section 7. Techniques.
To obtain our lower bounds from k -pc , we introduce several techniques for average-case reductions. Our main approach is simple, but powerful: we rotate Gaussianized forms ofthe input graph by carefully designed matrices with orthonormal rows. By choosing the rightrotation matrices, we can manipulate the planted signal in a precisely controlled manner and mapapproximately in total variation to our target distributions. The matrices we use in our reductions5re constructed from hyperplanes in F tr for certain prime numbers r . In order to arrive at aGaussianized form of the input graph and permit this approach, we need a number of average-casereduction primitives tailored to the k -partite promise in k -pc , as presented in Section 4. Analyzingthe total variation guarantees of these reduction primitives proves to be technically involved. Ourtechniques are outlined in more detail in Section 2.4. Robustness.
The study of robust estimation under model misspecification began with Huber’scontamination model [Hub92, Hub65] and observations of Tukey [Tuk75]. Classical robust esti-mators have typically either been computationally intractable or heuristic [Hub11, Tuk75, Yat85].Recently, the first dimension-independent error rates for robust mean estimation were obtained in[DKK +
16] and a logarithmic dependence was obtained in [LRV16]. This sparked an active line ofresearch into robust algorithms for high-dimensional problems and has led to algorithms for robustvariants of clustering, learning of mixture models, linear regression, estimation of sparse functionalsand more [ABL14, DKK +
18, KKM18, DKS19, Li17, BDLS17, CSV17]. Another notion of robustnessthat has been studied in computer science is against semirandom adversaries, who may modify arandom instance of a problem in certain constrained ways that heuristically appear to increase thesignal strength [BS95]. For example, in a semirandom variant of planted clique, the adversary takesan instance of planted clique and may only remove edges outside of the clique. Algorithms robustto semirandom adversaries have been exhibited for the stochastic block model [FK01], plantedclique [FK00], unique games [KMM11], correlation clustering [MS10, MMV15], graph partitioning[MMV12] and clustering mixtures of Gaussians [VA18].The papers [FK01] and [DF16], show lower bounds in semirandom problems from worst-casehardness, and [MPW16] shows that semirandom adversaries can shift the information-theoreticthreshold for recovery in the 2-block stochastic block model. Recent work also shows lower boundsfrom worst-case hardness for the optimal ℓ error that can be attained by polynomial time algo-rithms in sub-Gaussian mean estimation without the sparsity constraint [HL19]. Our work buildson the framework for average-case reductions introduced in [BBH18, BBH19, BB19], making use ofseveral of the average-case primitives in these papers. Restricted models of computation.
In addition to the aforementioned work showing statistical-computational gaps through average-case reductions from the pc conjecture, there is a line of workshowing lower bounds in restricted models of computation. Lower bounds in the sum of squares(SOS) hierarchy have been shown for a variety of average-case problems, including planted clique[DM15b, RS15, HKP +
16, BHK + + + + k -block stochastic block model in [HS17] and forrandom optimization problems related to PCA in [BKW19].More closely related to our results are the statistical query lower bounds of [DKS17] and[FLWY18], which show statistical query lower bounds for some of the problems that we consider.Specifically, [DKS17] shows that statistical query algorithms require n = ˜Ω( k ) samples to solverobust sparse mean estimation and [FLWY18] shows tight statistical query lower bounds for sparsemixtures of Gaussians. We obtain lower bounds for sparse mixtures of Gaussians as an intermediatein reducing to robust sparse mean estimation. It also falls within our universality class for sparsemixtures. 6 verage-case reductions. There have been a number of average-case reductions in the literaturestarting with different assumptions than the pc conjecture. Hardness conjectures for random CSPshave been used to show hardness in improper learning complexity [DLSS14], learning DNFs [DSS16]and hardness of approximation [Fei02].Another related reference is the reduction in [CLR15], which proves a detection-recovery gapin the context of sub-Gaussian submatrix localization based on the hardness of finding a planted k -clique in a random n/ pc was previouslyconsidered in [DM15a] and differs in a number of ways from pc . For example, it is unclear howto generate a sample from the degree-regular variant in polynomial time. We remark that thereduction of [CLR15], when instead applied the usual formulation of pc produces a matrix withhighly dependent entries. Specifically, the sum of the entries of the output matrix has variance n /µ where µ ≪ n .Note that, in general, any reduction beginning with pc that also preserves the natural H hypothesiscannot show the existence of a detection-recovery gap, as any lower bounds for localization wouldalso apply to detection. In Section 2, we formulate the estimation and recovery tasks we consider as detection problems,present our main results and give an overview of our techniques. In Section 3, we establish ourcomputational model and give some preliminaries on reductions in total variation. In Section 4,we give our main reduction to the intermediate problem of imbalanced sparse Gaussian mixtures.In Section 5, we deduce our lower bounds for robust sparse mean estimation. In Section 6, wegive our reduction to semirandom community recovery. In Section 7, we introduce symmetric 3-ary rejection kernels and apply them with our reduction to sparse Gaussian mixtures to produceour universal lower bound. In Section 8, we provide evidence for the k -pc conjecture based onlow-degree polynomials and lower bounds against statistical query algorithms. We adopt the following notation. Let L ( X ) denote the distribution law of a random variable X andgiven two laws L and L , let L + L denote L ( X + Y ) where X ∼ L and Y ∼ L are independent.Given a distribution P , let P ⊗ n denote the distribution of ( X , X , . . . , X n ) where the X i are i.i.d.according to P . Similarly, let P ⊗ m × n denote the distribution on R m × n with i.i.d. entries distributedas P . Given a finite or measurable set X , let Unif[ X ] denote the uniform distribution on X . Let d TV , d KL and χ denote total variation distance, KL divergence and χ divergence, respectively.Let N ( µ, Σ) denote a multivariate normal random vector with mean µ ∈ R d and covariance matrixΣ, where Σ is a d × d positive semidefinite matrix. Let [ n ] = { , , . . . , n } and G n be the set ofsimple graphs on n vertices. Let S denote the vector v ∈ R n with v i = 1 if i ∈ S and v i = 0 if i S where S ⊆ [ n ]. Let mix ǫ ( D , D ) denote the ǫ -mixture distribution formed by sampling D with probability (1 − ǫ ) and D with probability ǫ .7 Problem Formulations and Main Lower Bounds
We begin by describing our general setup for detection problems and the notions of robustnessand types adversaries that we consider. In a detection task P , the algorithm is given a set ofobservations and tasked with distinguishing between two hypotheses: • a uniform hypothesis H corresponding to the natural noise distribution for the problem; and • a planted hypothesis H , under which observations are generated from the same noise distri-bution but with a latent planted structure.Both H and H can either be simple hypotheses consisting of a single distribution or a compositehypothesis consisting of multiple distributions. Our problems typically are such that either: (1)both H and H are simple hypotheses; or (2) both H and H are composite hypotheses consistingof the set of distributions that can be induced by some constrained adversary. The robust estimationliterature contains a number of adversaries capturing different notions of model misspecification.We consider the following three central classes of adversaries:1. ǫ -corruption : A set of samples ( X , X , . . . , X n ) is an ǫ -corrupted sample from a distribution D if they can be generated by giving a set of n samples drawn i.i.d. from D to an adversarywho then changes at most ǫn of them arbitrarily.2. Huber’s contamination model : A set of samples ( X , X , . . . , X n ) is an ǫ -contaminatedof D in Huber’s model if X , X , . . . , X n ∼ i.i.d. mix ǫ ( D , D O )where D O is an unknown outlier distribution chosen by an adversary. Here, mix ǫ ( D , D O )denotes the ǫ -mixture distribution formed by sampling D with probability (1 − ǫ ) and D O with probability ǫ .3. Semirandom adversaries : Suppose that D is a distribution over collections of observations { X i } i ∈ I such that an unknown subset P ⊆ I of indices correspond to a planted structure.A sample { X i } i ∈ I is semirandom if it can be generated by giving a sample from D to anadversary who is allowed decrease X i for any i ∈ I \ P . Some formulations of semirandomadversaries in the literature also permit increases in X i for any i ∈ P . Our lower boundsapply to both adversarial setups.All adversaries in these models of robustness are computationally unbounded and have accessto randomness. Given a single distribution D over a set X , any one of these three adversariesproduces a set of distributions adv ( D ) that can be obtained after corruption. When formulatedas detection problems, the hypotheses H and H are of the form adv ( D ) for some D . We remarkthat ǫ -corruption can simulate contamination in Huber’s model at a slightly smaller ǫ ′ within o (1)total variation. This is because a sample from Huber’s model has Bin( n, ǫ ′ ) samples from D O .An adversary resampling min { Bin( n, ǫ ′ ) , ǫn } samples from D O therefore simulates Huber’s modelwithin a total variation distance bounded by standard concentration for the Binomial distribution.As discussed in [BBH18] and [HWX15], when detection problems need not be composite bydefinition, average-case reductions to natural simple vs. simple hypothesis testing formulations arestronger and technically more difficult. In these cases, composite hypotheses typically arise becausea reduction gadget precludes mapping to the natural simple vs. simple hypothesis testing formu-lation. We remark that simple vs. simple formulations are the hypothesis testing problems that8orrespond to average-case decision problems ( L, D ) as in Levin’s theory of average-case complexity.A survey of average-case complexity can be found in [BT + X , an algorithm A ( X ) ∈ { , } solves the detection problem with non-trivial probability if there is an ǫ > H and H satisfies thatlim sup n →∞ (cid:18) sup P ∈ H P X ∼ P [ A ( X ) = 1] + sup P ∈ H P X ∼ P [ A ( X ) = 0] (cid:19) ≤ − ǫ where n is the parameter indicating the size of X . We refer to this quantity as the asymptoticType I+II error of A for the problem P . If the asymptotic Type I+II error of A is zero, then wesay A solves the detection problem P . A simple consequence of this definition is that if A achievesasymptotic Type I+II error 1 − ǫ for a composite testing problem with hypotheses H and H ,then it also achieves this same error on the simple problem with hypotheses H and H ′ : X ∼ P where P is any mixture of the distributions in H . When stating detection problems, we adopt theconvention that any parameters such as n implicitly refers to a sequence n = ( n t ). For notationalconvenience, we drop the index t . The asymptotic Type I+II error of a test for a parameterizeddetection problem is defined as t → ∞ . In this section, we formulate robust sparse mean estimation, semirandom community recovery andour general learning sparse mixtures as detection problems. More precisely, for each problem P that we consider, we introduce a detection variant P ′ such that a blackbox for P also solves P ′ . Robust Sparse Mean Estimation.
In robust sparse mean estimation, the observed vectors X , X , . . . , X n are an ǫ -corrupted set of n samples from N ( µ, I d ) where µ is an unknown k -sparsevector in R d . The task is to estimate µ in the ℓ norm by outputting ˆ µ with k ˆ µ − µ k small. With-out ǫ -corruption, the information-theoretically optimal number of samples is n = Θ( k log d/ρ )in order to estimate µ within ℓ distance ρ , and this is efficiently achieved by truncating theempirical mean. As discussed in [Li17, BDLS17], for k µ − µ ′ k sufficiently small, it holds that d TV ( N ( µ, I d ) , N ( µ ′ , I d )) = Θ( k µ − µ ′ k ). Furthermore, an ǫ -corrupted set of samples can simu-late distributions within O ( ǫ ) total variation from N ( µ, I d ). Therefore ǫ -corruption can simulate N ( µ ′ , I d ) if k µ ′ − µ k = O ( ǫ ) and it is impossible to estimate µ with ℓ distance less than this O ( ǫ ).This implies that the minimax rate of estimation for µ is O ( ǫ ), even for very large n . Asshown in [Li17, BDLS17], the information-theoretic threshold for estimating at this rate in the ǫ -corrupted model remains at n = Θ( k log d/ǫ ) samples. However, the best known polynomial-timealgorithms from [Li17, BDLS17] require n = ˜Θ( k log d/ǫ ) samples to estimate µ within ǫ p log ǫ − in ℓ . One of our main results is to show that this k -to- k statistical-computational gap inducedby robustness follows from the k -partite planted clique conjecture. We show this by giving anaverage-case reduction to the following detection formulation of robust sparse mean estimation. Definition 2.1 (Detection Formulation of Robust Sparse Mean Estimation) . For any τ = ω ( ǫ ) ,the hypothesis testing problem rsme ( n, k, d, τ, ǫ ) has hypotheses H : ( X , X , . . . , X n ) ∼ i.i.d. N (0 , I d ) H : ( X , X , . . . , X n ) ∼ i.i.d. mix ǫ ( N ( µ R , I d ) , D O ) where D O is any adversarially chosen outlier distribution on R d and where µ R ∈ R d is any random k -sparse vector satisfying k µ R k ≥ τ holds almost surely. α
23 23
Community Detection s n r ≍ n k s n r ≍ k IT impossible poly-timePC-hard βα Community Recovery s n r ≍ n k s n r ≍ k IT impossible poly-timePC-hard open
Figure 1:
Prior computational and statistical barriers in the detection and recovery of a single hiddencommunity from the pc conjecture [HWX15, BBH18, BBH19]. The axes are parameterized by α and β where snr = ( p − q ) q (1 − q ) = ˜Θ( n − α ) and k = ˜Θ( n β ). The red region is conjectured to be computationally hardbut no pc reductions showing this hardness are known. This is a formulation of robust sparse mean estimation in Huber’s contamination model, andtherefore lower bounds for this problem imply corresponding lower bounds under ǫ -corruption. Wealso directly consider a detection variant c-rsme formulated in ǫ -corruption model. The conditionthat τ = ω ( ǫ ) ensures that any algorithm achieving ℓ error k µ − µ ′ k at the minimax rate of O ( ǫ )can also solve the detection problem. Our lower bounds also apply to this detection problem withthe requirement that τ = ω ( r ) for r much larger than ǫ , up to approximately ˜Θ( √ ǫ ). Lower boundsin this case show gaps for estimators achieving a suboptimal rate of estimation that is only O ( r )instead of O ( ǫ ). Planted Dense Subgraph and Community Recovery.
In single community recovery, oneobserves a graph G drawn from the planted dense subgraph ( pds ) distribution G ( n, k, p, q ) withedge densities p = p ( n ) > q = q ( n ) that can vary with n . To sample from G ( n, k, p, q ), first G issampled from G ( n, q ) and then a k -subset of [ n ] is chosen uniformly at random and the inducedsubgraph of G on S is replaced with an independently sampled copy of G ( k, p ). The regime ofinterest is when p and q are converging to one another at some rate and satisfy that p/q = O (1).The task is to recover the latent index set S , either exactly by outputting ˆ S = S or partially byoutputting ˆ S with | ˆ S ∩ S | = Ω( k ). As shown in [CX16], a polynomial-time convexified maximumlikelihood algorithm achieves exact recovery when snr = ( p − q ) q (1 − q ) & nk This is the best known polynomial time algorithm and has a statistical-computational gap fromthe information-theoretic limit of snr = ˜Θ(1 /k ).The natural detection variant of this problem is planted dense subgraph with varying p and q ,which has the hypotheses H : G ∼ G ( n, q ) and H : G ∼ G ( n, k, p, q )Unlike the recovery problem, this detection problem can be solved at the lower threshold of snr & n /k by thresholding the total number of edges in the observed graph. No polynomial10ime algorithm beating this threshold by a poly( n, k ) factor is known. The information-theoreticthreshold for detection is snr = ˜Θ(min { /k, n /k } ) and thus the problem has no statistical-computational gap when k & n / . Average-case lower bounds based on planted clique at theconjectured computational threshold of snr & n /k were shown in the regime p = cq = Θ( n − α )for some constant c > p and q with p/q = O (1) in [BBH18, BBH19].Conjectured phase diagrams for the detection and recovery variants are shown in Figure 1.This pair of problems for recovery and detection of a single community has emerged as acanonical example of a problem with a detection-recovery computational gap, which is significantin all regimes of q as long as p/q = O (1). [HWX15, BBH18] and [BBH19] all conjecture that therecovery problem is hard below the threshold of snr = ˜Θ( n/k ). This pds Recovery Conjecturewas even used in [BBH18] as a hardness assumption to show detection-recovery gaps in otherproblems including biased sparse PCA and Gaussian biclustering. It is unlikely that the pds recovery conjecture can be shown to follow from the pc conjecture, since it is a very different problemfrom detection which does have tight pc hardness. In fact, a reduction in total variation from pc to pds at the recovery threshold that faithfully maps corresponding hypotheses H i is impossiblegiven the pc conjecture, since the pds detection problem is easy for n/k ≫ snr & n /k .We show the intriguing result that the pds Recovery Conjecture is true for semirandom com-munity recovery in the regime where q = Θ(1) given the k -pc conjecture. To do this, we give anaverage-case reduction to the following semirandom detection formulation of community recovery. Definition 2.2 (Detection Formulation of Semirandom Community Recovery) . The hypothesistesting problem semi-cr ( n, k, p, q ) has observation G ∈ G n and hypotheses H : G ∼ P for some P ∈ adv ( G ( n, q )) H : G ∼ P for some P ∈ adv ( G ( n, k, p, q )) where adv ( G ( n, k, p, q )) is the set of distributions induced by a semirandom adversary that can onlyremove edges outside of the planted dense subgraph S . In the non-planted case, the set adv ( G ( n, q )) corresponds to an adversary that can remove any edges. An algorithm A solving semirandom community recovery exactly can threshold the edge densitywithin the output set of vertices ˆ S . The semirandomness of the adversary along with concentrationbounds ensures that this solves semi-cr . We remark that the convexified maximum likelihoodalgorithm from [CX16] continues to solve the community recovery problem at the same thresholdunder a semirandom adversary by a simple monotonicity argument. Learning Sparse Mixtures.
Our third main result is a universality principle for statistical-computational gaps in learning sparse mixtures. The sparse mixture setup we consider includessparse PCA in the spiked covariance model, learning sparse mixtures of Gaussians, sparse grouptesting and distributions related to learning graphical models. To define the detection formulationof our generalized sparse mixtures problem, we will need the notion of computable pairs from[BBH19]. For brevity in this section, we defer this definition until Section 7.2. Our general sparsemixtures detection problem is the following simple vs. simple hypothesis testing problem.
Definition 2.3 (Generalized Learning Sparse Mixtures) . Let n be a parameter and D be a symmet-ric distribution on a subset of R . Suppose that {P µ } µ ∈ R and Q are distributions on a measurablespace ( W, B ) such that ( P µ , Q ) is a computable pair for each µ ∈ R . Let the general sparse mixture glsm H ( n, k, d, {P µ } µ ∈ R , Q , D ) be the distribution on X , X , . . . , X n ∈ W d sampled as follows:1. Sample a single k -subset S ⊆ [ d ] uniformly at random; and . For each i ∈ [ n ] , choose some µ ∼ D and independently sample ( X i ) j ∼ P µ if j ∈ S and ( X i ) j ∼ Q otherwise.The problem glsm ( n, k, d, {P µ } µ ∈ R , Q , D ) has observations X , X , . . . , X n and hypotheses H : X , X , . . . , X n ∼ i.i.d. Q ⊗ d H : ( X , X , . . . , X n ) ∼ glsm H ( n, k, d, {P µ } µ ∈ R , Q , D ) We now state our main average-case lower bounds for robust sparse mean estimation, semirandomcommunity recovery and general learning of sparse mixtures. We first formally introduce the k -pc conjecture. Our reductions will all start with the more general problem of k -partite planted densesubgraph with constant edge densities, defined as follows. For simplicity of analysis we will alwaysassume that p and q are fixed constants, however we remark that our reductions can handle themore general case where only q is constant and p is tending towards q . Definition 2.4 ( k -partite Planted Dense Subgraph) . Let k divide n and E be known a partition of [ n ] with [ n ] = E ∪ E ∪ · · · ∪ E k and | E i | = n/k for each i . Let G E ( n, k, p, q ) denote the distributionon n -vertex graphs formed by sampling G ∼ G ( n, q ) and planting an independently sampled copy of H ∼ G ( k, p ) on the k -vertex subgraph with one vertex chosen uniformly at random from each E i in G . The hypothesis testing problem k -pds ( n, k, p, q ) has hypotheses H : G ∼ G ( n, q ) and H : G ∼ G E ( n, k, p, q )The k -partite variant of planted clique k -pc ( n, k, p ) is then k -pds ( n, k, , p ) in this notation.Note that the edges within each E i are irrelevant and independent of the hidden vertex set of theclique. We remark that E can be any fixed partition of [ n ] without changing the problem. The k -pc conjecture is formally stated as follows. Conjecture 2.5 ( k -pc Conjecture) . Fix some constant p ∈ (0 , . Suppose that {A t } is a sequenceof randomized polynomial time algorithms A t : G n t → { , } where n t and k t are increasing se-quences of positive integers satisfying that k t = o ( √ n t ) and k t divides n t for each t . Then if G t isan instance of k -pc ( n t , k t , p ) , it holds that lim inf t →∞ ( P H [ A t ( G ) = 1] + P H [ A t ( G ) = 0]) ≥ k -pds conjecture at a fixed pair of constant edge densities 0 < q < p ≤ k -pc conjecture. There is a plethora of evidence in the literaturefor the ordinary pc conjecture. Spectral algorithms, approximate message passing, semidefiniteprogramming, nuclear norm minimization and several other polynomial-time combinatorial ap-proaches all appear to fail to solve pc exactly when k = o ( n / ) [AKS98, FK00, McS01, FR10, AV11,DGGP14, DM15a, CX16]. Lower bounds against low-degree sum of squares relaxations [BHK + +
13] have also been shown up to k = o ( n / ). In Section 8,we show that lower bounds for ordinary pc against low-degree polynomials and statistical queryalgorithms extend easily to k -pc .We now state our main lower bounds based on either the k -pc conjecture or the k -pds conjectureat a fixed pair of constant edge densities 0 < q < p ≤
1. These theorem statements are reproducedin Sections 5, 6 and 7, respectively. We first give our general lower bound for rsme , which alsoapplies to weak algorithms only able to estimate up to ℓ rates of approximately √ ǫ .12 heorem 2.6 (General Lower Bound for rsme ) . Let ( n, k, d, ǫ ) be any parameters satisfying that k = o ( d ) , ǫ ∈ (0 , and n satisfies that n = o ( ǫ k ) and n = Ω( k ) . If c > is some fixed constant,then there is a parameter τ = Ω( p ǫ/ (log n ) c ) such that any randomized polynomial time test for rsme ( n, k, d, τ, ǫ ) has asymptotic Type I + II error at least 1 assuming either the k -pc conjecture orthe k -pds conjecture for some fixed edge densities < q < p ≤ . Specializing this theorem to the case where ǫ = 1 / polylog( n ), we establish the tight k -to- k gapfor rsme up to polylog( n ) factors, as stated in the following corollary. Corollary 2.7 (Tight k -to- k gap for rsme ) . Let ( n, k, d, ǫ ) be any parameters satisfying that k = o ( d ) , ǫ = Θ((log n ) − c ) for some constant c > and n satisfies that n = o ( k (log n ) − c ) and n = Ω( k ) . Then there is a parameter τ = ω ( ǫ ) such that any randomized polynomial time test for rsme ( n, k, d, τ, ǫ ) has asymptotic Type I + II error at least 1 assuming either the k -pc conjecture orthe k -pds conjecture for some fixed edge densities < q < p ≤ . In Section 5, we also show how to alleviate the dependence of our general sample complexitylower bound of n = Ω( ǫ k ) on ǫ as a tradeoff with how far τ is above the minimax rate of O ( ǫ ).We now state our lower bounds for semi-cr and universality principle for glsm . Theorem 2.8 (The pds
Recovery Conjecture holds for semi-cr ) . Fix any constant β ∈ [1 / , .Suppose that B is a randomized polynomial time test for semi-cr ( n, k, p, q ) for all ( n, k, p, q ) with k = Θ( n β ) and ( p − q ) q (1 − q ) ≤ ν and min { q, − q } = Ω(1) where ν = o (cid:18) nk log n (cid:19) Then B has asymptotic Type I + II error at least 1 assuming either the k -pc conjecture or the k -pds conjecture for some fixed edge densities < q ′ < p ′ ≤ . Theorem 2.9 (Universality of n = ˜Ω( k ) for glsm ) . Let ( n, k, d ) be parameters such that n = o ( k ) and k = o ( d ) . Suppose that ( D , Q , {P ν } ν ∈ R ) satisfy • D is a symmetric distribution about zero and P ν ∼D [ ν ∈ [ − , − o ( n − ) ; and • for all ν ∈ [ − , , it holds that √ k log n ≫ (cid:12)(cid:12)(cid:12)(cid:12) d P ν d Q ( x ) − d P − ν d Q ( x ) (cid:12)(cid:12)(cid:12)(cid:12) and 1 k log n ≫ (cid:12)(cid:12)(cid:12)(cid:12) d P ν d Q ( x ) + d P − ν d Q ( x ) − (cid:12)(cid:12)(cid:12)(cid:12) with probability at least − o ( n − d − ) over each of P ν , P − ν and Q .Then if B is a randomized polynomial time test for glsm ( n, k, d, {P ν } ν ∈ R , Q , D ) . Then B hasasymptotic Type I + II error at least 1 assuming either the k -pc conjecture or the k -pds conjecturefor some fixed edge densities < q < p ≤ . Note that D and the indices of P ν can be reparameterized without changing the underlyingproblem. The assumption that D is symmetric and mostly supported on [ − ,
1] is for convenience.When the likelihood ratios are relatively concentrated the dependence of the two conditions aboveon n and d is nearly negligible. These conditions almost exclusively depend on k , implying that theywill not require a stronger dependence between n and k to produce hardness than the n = ˜ o ( k )condition that arises from our reductions. Thus these conditions do show a universality principlefor the computational sample complexity of n = ˜Θ( k ).13 .4 Overview of Techniques We now overview the average-case reduction techniques we introduce to show our lower bounds.
Rotating Gaussianized Instances by H r,t . We give a simplified overview of our rotationmethod in the case of robust sparse mean estimation. Observe that if M = τ · S ⊤ S + N (0 , ⊗ n × n , S is a k -subset and R is an orthogonal matrix then M R ⊤ is distributed as τ · S ( R S ) ⊤ + N (0 , ⊗ n × n since the rows of the noise term have i.i.d. N (0 ,
1) entries and thus are isotropic. This simpleproperty allows us to manipulate the latent structure S upon a rotation by R while at the same time,crucially preserve the independence among the entries of the noise distribution. Our key insight isto carefully construct R based on the geometry of F tr . Let P , P , . . . , P r t be an enumeration of thepoints in F tr and V , V , . . . , V ℓ , where ℓ = r t − r − , be an enumeration of the hyperplanes in F tr . Nowtake H r,t to be the ℓ × r t matrix( H r,t ) ij = 1 p r t ( r − · (cid:26) P j V i − r if P j ∈ V i It can be verified that the rows of H r,t are an orthonormal basis of the subspace they span. Thischoice of H r,t has two properties crucial to our reductions: (1) it contains exactly two values; and(2) each of its columns, other than the column corresponding to P i = 0, approximately containsa 1 − r − vs. r − mixture of these two values. Taking R to rotate by H r,t on blocks of indicesof M , each containing exactly one element of S , then produces an instance of robust sparse meanestimation with ǫ ≈ r − and planted mean vector approximately given by τ · S / p r t ( r − µ = S · p ǫ/wk log n for a slow-growing function w and thus k µ k = p ǫ/ log n , which for polylogarithmically small ǫ is larger thanthe O ( ǫ ) minimax rate. Carrying out this strategy involves a number of technical obstacles relatedto the counts of each type of entry, the fact that the H r,t are not square, dealing with the columncorresponding to zero and other distributional issues that arise.In our reduction to semirandom community recovery we rotate along both rows and columnsand in smaller blocks, and also only require r = 3. Our universality result uses exactly the reductiondescribed above for r = 2 as a subroutine. k -Partite Average-Case Primitives. The discussion above overlooks several issues, notablyincluding: (1) mapping from the input graph problem to a matrix of the form τ · S ⊤ S + N (0 , ⊗ n × n ;and (2) ensuring that we can rotate by H r,t on blocks, each containing exactly one element of S .Property (2) is essential to mapping in distribution to an instance of robust sparse mean estimation.Achieving (1) involves more than an algorithmic change of measure – the initial adjacency matrix ofthe graph is symmetric and lacks diagonal entries. Symmetry can be broken using a cloning gadget,but planting the diagonal entries so that they are properly distributed is harder. In particular, doingthis while maintaining the k -partite promise in order to achieve (2) requires embedding in a largersubmatrix and an involved analysis of the distances between product distributions with binomialmarginals. We break this first Gaussianization step into a number of primitives, extending theframework introduced in [BBH18] and [BBH19]. To show our universality result, we need to carry out analgorithmic change of measure with three target distributions. To do this, we introduce a new moreinvolved variant of the rejection kernels from [BBH18] and [BBH19] in order to map from threeinputs to three outputs. The analysis of these rejection kernels is made possible due to symmetries14n the initial distributions that crucially rely on using the rotations reduction described above as asubroutine. We remark that the fact that H r,t contains exactly two distinct values turns out to beespecially important here to ensure that there are only three possible input distributions to theserejection kernels. We give approximate reductions in total variation to show that lower bounds for one hypothesistesting problem imply lower bounds for another. These reductions yield an exact correspondencebetween the asymptotic Type I+II errors of the two problems. This is formalized in the followinglemma, which is Lemma 3.1 from [BBH18] stated in terms of composite hypotheses H and H .The main quantity in the statement of the lemma can be interpreted as the smallest total variationdistance between the reduced object A ( X ) and the closest mixture of distributions from either H ′ or H ′ . The proof of this lemma is short and follows from the definition of total variation. Lemma 3.1 (Lemma 3.1 in [BBH18]) . Let P and P ′ be detection problems with hypotheses H , H and H ′ , H ′ , respectively. Let X be an instance of P and let Y be an instance of P ′ . Suppose thereis a polynomial time computable map A satisfying sup P ∈ H inf π ∈ ∆( H ′ ) d TV ( L P ( A ( X )) , E P ′ ∼ π L P ′ ( Y )) + sup P ∈ H inf π ∈ ∆( H ′ ) d TV ( L P ( A ( X )) , E P ′ ∼ π L P ′ ( Y )) ≤ δ If there is a randomized polynomial time algorithm solving P ′ with Type I + II error at most ǫ , thenthere is a randomized polynomial time algorithm solving P with Type I + II error at most ǫ + δ . If δ = o (1), then given a blackbox solver B for P ′ D , the algorithm that applies A and then B solves P D and requires only a single query to the blackbox. We now outline the computationalmodel and conventions we adopt throughout this paper. An algorithm that runs in randomizedpolynomial time refers to one that has access to poly( n ) independent random bits and must run inpoly( n ) time where n is the size of the instance of the problem. For clarity of exposition, in ourreductions we assume that explicit expressions can be exactly computed and that we can samplea biased random bit Bern( p ) in polynomial time. We also assume that the oracles described inDefinition 7.1 can be computed in poly( n ) time. For simplicity of exposition, we assume that wecan sample N (0 ,
1) in poly( n ) time. The analysis of our reductions will make use of the following well-known facts and inequalitiesconcerning total variation distance.
Fact 3.2.
The distance d TV satisfies the following properties:1. (Tensorization) Let P , P , . . . , P n and Q , Q , . . . , Q n be distributions on a measurable space ( X , B ) . Then d TV n Y i =1 P i , n Y i =1 Q i ! ≤ n X i =1 d TV ( P i , Q i )
2. (Conditioning on an Event) For any distribution P on a measurable space ( X , B ) and event A ∈ B , it holds that d TV ( P ( ·| A ) , P ) = 1 − P ( A )15 . (Conditioning on a Random Variable) For any two pairs of random variables ( X, Y ) and ( X ′ , Y ′ ) each taking values in a measurable space ( X , B ) , it holds that d TV (cid:0) L ( X ) , L ( X ′ ) (cid:1) ≤ d TV (cid:0) L ( Y ) , L ( Y ′ ) (cid:1) + E y ∼ Y (cid:2) d TV (cid:0) L ( X | Y = y ) , L ( X ′ | Y ′ = y ) (cid:1)(cid:3) where we define d TV ( L ( X | Y = y ) , L ( X ′ | Y ′ = y )) = 1 for all y supp( Y ′ ) . Given an algorithm A and distribution P on inputs, let A ( P ) denote the distribution of A ( X )induced by X ∼ P . If A has k steps, let A i denote the i th step of A and A i - j denote the procedureformed by steps i through j . Each time this notation is used, we clarify the intended initial andfinal variables when A i and A i - j are viewed as Markov kernels. The next lemma from [BBH19]encapsulates the structure of all of our analyses of average-case reductions. Its proof is simple andincluded in Appendix A for completeness. Lemma 3.3 (Lemma 4.2 in [BBH19]) . Let A be an algorithm that can be written as A = A m ◦A m − ◦ · · · ◦ A for a sequence of steps A , A , . . . , A m . Suppose that the probability distributions P , P , . . . , P m are such that d TV ( A i ( P i − ) , P i ) ≤ ǫ i for each ≤ i ≤ m . Then it follows that d TV ( A ( P ) , P m ) ≤ m X i =1 ǫ i In this section, we give our reduction from k -pds to the key intermediate problem isgm , which wewill reduce from in subsequent sections to obtain several of our main computational lower bounds.We also introduce several average-case reduction subroutines that will be used in our reduction tosemirandom community recovery. The problem isgm , imbalanced sparse Gaussian mixtures, is asimple vs. simple hypothesis testing problem defined formally below. A similar distribution wasalso used in [DKS17] to construct an instance of robust sparse mean estimation inducing the tightstatistical-computational gap in the statistical query model. Definition 4.1 (Imbalanced Sparse Gaussian Mixtures) . Given some µ ∈ R and ǫ ∈ (0 , , let µ ′ be such that ǫ · µ ′ + (1 − ǫ ) · µ = 0 . The distribution isgm H ( n, k, d, µ, ǫ ) over X = ( X , X , . . . , X n ) where X i ∈ R d is sampled as follows:1. choose a k -subset S ⊆ [ d ] uniformly at random;2. sample X , X , . . . , X n i.i.d. from the mixture mix ǫ ( N ( µ · S , I d ) , N ( µ ′ · S , I d )) .The imbalanced sparse Gaussian mixture detection problem isgm ( n, k, d, µ, ǫ ) has observations X =( X , X , . . . , X n ) and hypotheses H : X ∼ N (0 , I d ) ⊗ n and H : X ∼ isgm H ( n, k, d, µ, ǫ )Figure 2 outlines the steps of our reduction from k -pds to isgm , using subroutines that will beintroduced in the next two subsections. The reduction makes use of the framework for average-casereductions set forth in the sequence of work [BBH18, BBH19, BB19] for its initial steps transforming k -pc into a Gaussianized submatrix problem, while preserving the partition-promise structure of k -pds . These steps are discussed in Section 4.1.In Section 4.2, we introduce the key insight of the reduction, which is to rotate the resultingGaussianized submatrix problem by a carefully chosen matrix H r,t constructed using hyperplanes16 lgorithm k -pds-to-isgm Inputs : k -pds instance G ∈ G N with dense subgraph size k that divides N , partition E of [ N ]and edge probabilities 0 < q < p ≤
1, a slow growing function w ( N ) = ω (1), target isgm parameters ( n, k, d, µ, ǫ ) satisfying that ǫ = 1 /r for some prime number r , wn ≤ k · r t − r − for some t ∈ N , d ≥ m and kr t ≥ m where m is the smallest multiple of k larger than (cid:16) pQ + 1 (cid:17) N where Q = 1 − p (1 − p )(1 − q ) + { p =1 } (cid:0) √ q − (cid:1) , µ ≤ c/ p r t ( r −
1) log( kmr t ) for a sufficiently smallconstant c > n, kr t ≤ poly( N )1. Symmetrize and Plant Diagonals : Compute M PD1 ∈ { , } m × m with partition F ′ of [ m ]as M PD1 ← To- k -Partite-Submatrix ( G )applied with initial dimension N , edge probabilities p and q and target dimension m .2. Pad : Form M PD2 ∈ { , } m × kr t by adding kr t − m new columns sampled i.i.d. fromBern( Q ) ⊗ m to M PD1 . Let F i be F ′ i with r t − m/k of the new columns. Randomly permutethe row indices of M PD2 and the column indices of M PD2 within each part F i .3. Gaussianize : Compute M G ∈ R m × kr t as M G ← Gaussianize ( M PD2 )applied with probabilities p and Q and µ ij = p r t ( r − · µ for all ( i, j ) ∈ [ m ] × [ kr t ].4. Construct Rotation Matrix : Form the ℓ × r t matrix H r,t where ℓ = r t − r − as follows(1) Let V , V , . . . , V ℓ be an enumeration of the hyperplanes of F tr and P , P , . . . , P r t bean enumeration of the points in F tr .(2) For each pair ( i, j ) ∈ [ ℓ ] × [ r t ], set the ( i, j )th entry of H r,t to be( H r,t ) ij = 1 p r t ( r − · (cid:26) P j V i − r if P j ∈ V i Sample Rotation : Fix a partition [ kℓ ] = F ′ ∪ F ′ ∪ · · · ∪ F ′ k into k parts each of size ℓ andcompute the matrix M R ∈ R m × kℓ where( M R ) F ′ i = ( M G ) F i H ⊤ r,t for each i ∈ [ k ]where Y F i denotes the submatrix of Y restricted to the columns with indices in F i .6. Permute and Output : Form X ∈ R d × n by choosing n columns of M R uniformly at random,randomly embedding the resulting matrix as m rows of X and sampling the remaining d − m rows of X i.i.d. from N (0 , I n ). Output the columns ( X , X , . . . , X n ) of X . Figure 2:
Reduction from k -partite planted dense subgraph to exactly imbalanced sparse Gaussian mixtures. F tr , to arrive at an instance of isgm . The matrix H r,t is a Grassmanian construction related toprojective constructions of block designs and is close to a Hadamard matrix when r = 2. Threeproperties of H r,t are essential to our reduction: (1) H r,t has orthonormal rows; (2) H r,t containsonly two distinct values; and (3) each column of H r,t has approximately an 1 /r fraction of its entriesnegative. These properties are established and used in the analysis of our reduction in Section 4.2.The next theorem encapsulates the total variation guarantees of the reduction k -pds-to-isgm .A key parameter is the prime number r , which is essential to our construction of the matrices H r,t .In applications of the theorem to robust sparse mean estimation, r will grow with n . To showthe tightest possible statistical-computational gaps for robust sparse mean estimation, we ideallywould want to take N such that N = Θ( kr t ). When r is growing with n , this induces numbertheoretic constraints on our choices of parameters that require careful attention. Because of this,we have kept the statement of our next theorem technically precise and in terms of all of thefree parameters of the reduction k -pds-to-isgm . Ignoring these number theoretic constraints, thereduction k -pds-to-isgm can be interpreted as essentially mapping an instance of k -pds ( N, k, p, q )with k = o ( √ N ) to isgm ( n, k, d, µ, ǫ ) where ǫ ∈ (0 ,
1) is arbitrary and can vary with n . The targetparameters n, d and µ satisfy that d = Ω( N ) , n = o ( ǫN ) and µ ≍ √ log N · r ǫkN All of our applications will handle the number theoretic constraints to set parameters so that theynearly satisfy these conditions. The slow-growing function w ( N ) is so that Step 6 subsamples theproduced samples by a large enough factor to enable an application of finite de Finetti’s theorem.Our lower bounds for robust sparse mean estimation, semirandom community recovery and univer-sality of lower bounds for sparse mixtures will set r to be growing, r = 3 and r = 2, respectively.We now state our total variation guarantees for k -pds-to-isgm . Theorem 4.2 (Reduction from k -pds to isgm ) . Let N be a parameter, r = r ( N ) ≥ be a primenumber and w ( N ) = ω (1) be a slow-growing function. Fix initial and target parameters as follows: • Initial k -pds Parameters: number of vertices N , dense subgraph size k that divides N , fixedconstant edge probabilities < q < p ≤ with q = N − O (1) and a partition E of [ N ] . • Target isgm
Parameters: ( n, k, d, µ, ǫ ) where ǫ = 1 /r and there is a parameter t = t ( N ) ∈ N such that wn ≤ k ( r t − r − , m ≤ d, kr t ≤ poly( N ) and0 ≤ µ ≤ δ p kmr t ) + 2 log( p − Q ) − · p r t ( r − where m is the smallest multiple of k larger than (cid:16) pQ + 1 (cid:17) N , where Q = 1 − p (1 − p )(1 − q )+ { p =1 } (cid:0) √ q − (cid:1) and δ = min n log (cid:16) pQ (cid:17) , log (cid:16) − Q − p (cid:17)o .Let A ( G ) denote k -pds-to-isgm applied to the graph G with these parameters. Then A runs inrandomized polynomial time and it follows that d TV ( A ( k -pds ( N, k, p, q )) , isgm ( n, k, d, µ, ǫ )) = O (cid:18) w − + k wN + k √ N + e − Ω( N /kn ) + N − (cid:19) under both H and H as N → ∞ . F of [ N ] with [ N ] = F ∪ F ∪ · · · ∪ F k , let U N ( F ) denotethe distribution of k -subsets of [ N ] formed by choosing one member element of each of F , F , . . . , F k uniformly at random. Let U N,k denote the uniform distribution on k -subsets of [ N ]. Let G ( n, S, p, q )denote the distribution of planted dense subgraph instances from G ( n, k, p, q ) conditioned on thesubgraph being planted on the vertex set S where | S | = k . Given S ⊆ [ m ], T ⊆ [ n ] and twodistributions P and Q over X , let M ( m, n, S, T, P , Q ) denote the distribution of matrices in X m × n with independent entries where M ij ∼ P if ( i, j ) ∈ S × T and M ij ∼ Q otherwise.For simplicity of notation, when either S or T is a distribution D on subsets of [ m ] or [ n ], welet this denote the mixture over M ( m, n, S, T, P , Q ) induced by sampling this set from D . We willadopt the same convention for S in G ( n, S, p, q ). We also let M ( n, S, P , Q ) be a shorthand for thedistribution when m = n and S = T . For simplicity, we also replace P and Q with p and q when P = Bern( p ) and Q = Bern( q ). Similarly, let V ( n, S, P , Q ) denote the distribution of vectors v in X n with independent entries such that v i ∼ P if i ∈ S and v i ∼ Q otherwise. We adopt theanalogous shorthands for V ( n, S, P , Q ). In this section, we present several reductions from [BBH18, BBH19, BB19] that are used as sub-routines in k -pds-to-isgm . We also introduce To- k -Partite-Submatrix , which is a modi-fied variant of the reduction To-Submatrix from [BBH19] that maps from the k -partite vari-ant of planted dense subgraph. We remark that the proof of the total variation guarantees of To- k -Partite-Submatrix is technically more involved than that of To-Submatrix in [BBH19].We begin with the subroutine
Graph-Clone , shown in Figure 3. This subroutine was intro-duced in [BBH19] and produces several independent samples from a planted subgraph problemgiven a single sample. Its properties as a Markov kernel are stated in the next lemma, which isproven by showing the two explicit expressions for P [ x ij = v ] in Step 1 define valid probabilitydistributions and then explicitly writing the mass functions of A ( G ( n, q )) and A ( G ( n, S, p, q )). Lemma 4.3 (Graph Cloning – Lemma 5.2 in [BBH19]) . Let t ∈ N , < q < p ≤ and < Q
Gaussianize from [BB19], shown in Figure 4, which mapsa planted Bernoulli submatrix problem to a corresponding submatrix problems with independentGaussian entries. To describe this subroutine, we first will need the univariate rejection kernelframework introduced in [BBH18]. A multivariate extension of this framework is used in [BBH19],but we will only require the univariate case in this section. The next lemma states the total variationguarantees of the Gaussian rejection kernels, which are also shown in Figure 4.The proof of this lemma consists of showing that the distributions of the outputs rk G ( µ, Bern( p ))and rk G ( µ, Bern( q )) are close to N ( µ,
1) and N (0 ,
1) conditioned to lie in the set of x with − p − q ≤ ϕ µ ( x ) ϕ ( x ) ≤ pq and then showing that this event occurs with probability close to one. Wewill use the notation rk G ( B ) to denote the random variable output by a run of the procedure rk G using independently generated randomness. 19 lgorithm Graph-Clone
Inputs : Graph G ∈ G n , the number of copies t , parameters 0 < q < p ≤ < Q < P ≤ − p − q ≤ (cid:16) − P − Q (cid:17) t and (cid:16) PQ (cid:17) t ≤ pq
1. Generate x ij ∈ { , } t for each 1 ≤ i < j ≤ n such that: • If { i, j } ∈ E ( G ), sample x ij from the distribution on { , } t with P [ x ij = v ] = 1 p − q h (1 − q ) · P | v | (1 − P ) t −| v | − (1 − p ) · Q | v | (1 − Q ) t −| v | i • If { i, j } 6∈ E ( G ), sample x ij from the distribution on { , } t with P [ x ij = v ] = 1 p − q h p · Q | v | (1 − Q ) t −| v | − q · P | v | (1 − P ) t −| v | i
2. Output the graphs ( G , G , . . . , G t ) where { i, j } ∈ E ( G k ) if and only if x ijk = 1. Figure 3:
Subroutine
Graph-Clone for producing independent samples from planted graph problemsfrom [BBH19].
Lemma 4.4 (Gaussian Rejection Kernels – Lemma 5.4 in [BBH18]) . Let n be a parameter andsuppose that p = p ( n ) and q = q ( n ) satisfy that < q < p ≤ , min( q, − q ) = Ω(1) and p − q ≥ n − O (1) . Let δ = min n log (cid:16) pq (cid:17) , log (cid:16) − q − p (cid:17)o . Suppose that µ = µ ( n ) ∈ (0 , satisfies that µ ≤ δ p n + 2 log( p − q ) − Then the map rk G with N = (cid:6) δ − log n (cid:7) iterations can be computed in poly ( n ) time and satisfies d TV ( rk G ( µ, Bern( p )) , N ( µ, O ( n − ) and d TV ( rk G ( µ, Bern( q )) , N (0 , O ( n − )We now state the total variation guarantees of Gaussianize . The instantiation of
Gaussianize here generalizes that in [BB19] to rectangular matrices, but has the same proof. The procedureapplies a Gaussian rejection kernel entrywise and its total variation guarantees follow by a simpleby applying the tensorization property of d TV from Fact 3.2. Lemma 4.5 (Gaussianization – Lemma 4.5 in [BB19]) . Given parameters m and n , let < Q
satisfies that τ ≤ δ p mn ) + 2 log( P − Q ) − where δ = min (cid:26) log (cid:18) PQ (cid:19) , log (cid:18) − Q − P (cid:19)(cid:27) The algorithm A = Gaussianize runs in poly( mn ) time and satisfies that d TV (cid:16) A ( M ( m, n, S, T, P, Q )) , µ ◦ S ⊤ T + N (0 , ⊗ m × n (cid:17) = O (cid:16) ( mn ) − / (cid:17) d TV (cid:0) A (cid:0) Bern( Q ) ⊗ m × n (cid:1) , N (0 , ⊗ m × n (cid:1) = O (cid:16) ( mn ) − / (cid:17) lgorithm rk G ( µ, B ) Parameters : Input B ∈ { , } , Bernoulli probabilities 0 < q < p ≤
1, Gaussian mean µ , numberof iterations N , let ϕ µ ( x ) = √ π · exp (cid:0) − ( x − µ ) (cid:1) denote the density of N ( µ, z ← z is set or N iterations have elapsed:(1) Sample z ′ ∼ N (0 ,
1) independently.(2) If B = 0, if the condition p · ϕ ( z ′ ) ≥ q · ϕ µ ( z ′ )holds, then set z ← z ′ with probability 1 − q · ϕ µ ( z ′ ) p · ϕ ( z ′ ) .(3) If B = 1, if the condition(1 − q ) · ϕ µ ( z ′ + µ ) ≥ (1 − p ) · ϕ ( z ′ + µ )holds, then set z ← z ′ + µ with probability 1 − (1 − p ) · ϕ ( z ′ + µ )(1 − q ) · ϕ µ ( z ′ + µ ) .3. Output z . Algorithm
Gaussianize
Parameters : Matrix M ∈ { , } m × n , Bernoulli probabilities 0 < Q < P ≤ Q = ( mn ) − O (1) and a target mean matrix 0 ≤ µ ij ≤ τ where τ > X ∈ R m × n by setting X ij ← rk G ( µ ij , M ij )for each ( i, j ) ∈ [ m ] × [ n ] where each rk G is run with N it = ⌈ δ − log( mn ) ⌉ iterationswhere δ = min n log (cid:16) PQ (cid:17) , log (cid:16) − Q − P (cid:17)o .2. Output the matrix X . Figure 4:
Gaussian instantiation of the rejection kernel algorithm from [BBH18] and the reduction
Gaussianize for mapping from Bernoulli to Gaussian planted submatrix problems from [BB19]. for all subsets S ⊆ [ m ] and T ⊆ [ n ] where ◦ denotes the Hadamard product between two matrices. We now introduce the procedure
To- k -Partite-Submatrix , which is shown in Figure 5. Thisreduction clones the upper half of the adjacency matrix of the input graph problem to producean independent lower half and plants diagonal entries while randomly embedding into a largermatrix to hide the diagonal entries in total variation. To- k -Partite-Submatrix is similar to To-Submatrix in [BBH19] and
To-Bernoulli-Submatrix in [BB19] but ensures that the ran-dom embedding step accounts for the k -partite promise of the input k -pds instance.We begin with the following lemma, which is a key computation in the proof of correctness21 lgorithm To- k -Partite-Submatrix Inputs : k -pds instance G ∈ G N with clique size k that divides N and partition E of [ N ], edgeprobabilities 0 < q < p ≤ q = N − O (1) and target dimension n ≥ (cid:16) pQ + 1 (cid:17) N where Q = 1 − p (1 − p )(1 − q ) + { p =1 } (cid:0) √ q − (cid:1) and k divides n
1. Apply
Graph-Clone to G with edge probabilities P = p and Q = 1 − p (1 − p )(1 − q ) + { p =1 } (cid:0) √ q − (cid:1) and t = 2 clones to obtain ( G , G ).2. Let F be a partition of [ n ] with [ n ] = F ∪ F ∪ · · · ∪ F k and | F i | = n/k . Form the matrix M PD ∈ { , } n × n as follows:(1) For each t ∈ [ k ], sample s t ∼ Bin(
N/k, p ) and s t ∼ Bin( n/k, Q ) and let S t be a subsetof F t with | S t | = N/k selected uniformly at random. Sample T t ⊆ S t and T t ⊆ F t \ S t with | T t | = s t and | T t | = max { s t − s t , } uniformly at random.(2) Now form the matrix M PD such that its ( i, j )th entry is( M PD ) ij = { π t ( i ) ,π t ( j ) }∈ E ( G ) if i < j and i, j ∈ S t { π t ( i ) ,π t ( j ) }∈ E ( G ) if i > j and i, j ∈ S t { i ∈ T t } if i = j and i, j ∈ S t { i ∈ T t } if i = j and i, j ∈ F t \ S t ∼ i.i.d. Bern( Q ) if i = j and ( i, j ) S t for a t ∈ [ k ]where π t : S t → E t is a bijection chosen uniformly at random.3. Output the matrix M PD and the partition F . Figure 5:
Subroutine
To- k -Partite-Submatrix for mapping from an instance of k -partite planted densesubgraph to a k -partite Bernoulli submatrix problem. for To- k -Partite-Submatrix . We remark that the total variation upper bound in this lemmais tight in the following sense. When all of the P i are the same, the expected value of the sum ofthe coordinates of the first distribution is k ( P i − Q ) higher than that of the second. The standarddeviation of the second sum is p kmQ (1 − Q ) and thus when k ( P i − Q ) ≫ mQ (1 − Q ), the totalvariation below tends to one. Lemma 4.6. If k, m ∈ N , P , P , . . . , P k ∈ [0 , and Q ∈ (0 , , then d TV (cid:16) ⊗ ki =1 (Bern( P i ) + Bin( m − , Q )) , Bin( m, Q ) ⊗ k (cid:17) ≤ vuut k X i =1 ( P i − Q ) mQ (1 − Q ) Proof.
Given some P ∈ [0 , χ (Bern( P ) + Bin( m − , Q ) , Bin( m, Q )).For notational convenience, let (cid:0) ab (cid:1) = 0 if b > a or b <
0. It follows that1 + χ (Bern( P ) + Bin( m − , Q ) , Bin( m, Q ))22 m X t =0 (cid:16) (1 − P ) · (cid:0) m − t (cid:1) Q t (1 − Q ) m − − t + P · (cid:0) m − t − (cid:1) Q t − (1 − Q ) m − t (cid:17) (cid:0) mt (cid:1) Q t (1 − Q ) m − t = m X t =0 (cid:18) mt (cid:19) Q t (1 − Q ) m − t (cid:18) m − tm · − P − Q + tm · PQ (cid:19) = E "(cid:18) m − Xm · − P − Q + Xm · PQ (cid:19) = E "(cid:18) X − mQm · P − QQ (1 − Q ) (cid:19) = 1 + 2( P − Q ) mQ (1 − Q ) · E [ X − mQ ] + ( P − Q ) m Q (1 − Q ) · E (cid:2) ( X − Qm ) (cid:3) = 1 + ( P − Q ) mQ (1 − Q )where X ∼ Bin( m, Q ) and the second last equality follows from E [ X ] = Qm and E [( X − Qm ) ] =Var[ X ] = Q (1 − Q ) m . The concavity of log implies that d KL ( P , Q ) ≤ log (cid:0) χ ( P , Q ) (cid:1) ≤ χ ( P , Q )for any two distributions with P absolutely continuous with respect to Q . Pinsker’s inequality andtensorization of d KL now imply that2 · d TV (cid:16) ⊗ ki =1 (Bern( P i ) + Bin( m − , Q )) , Bin( m, Q ) ⊗ k (cid:17) ≤ d KL (cid:16) ⊗ ki =1 (Bern( P i ) + Bin( m − , Q )) , Bin( m, Q ) ⊗ k (cid:17) = k X i =1 d KL (Bern( P i ) + Bin( m − , Q ) , Bin( m, Q )) ≤ k X i =1 χ (Bern( P i ) + Bin( m − , Q ) , Bin( m, Q )) = k X i =1 ( P i − Q ) mQ (1 − Q )which completes the proof of the lemma.We now use this lemma to establish an analogue of Lemma 6.4 from [BBH19] in the k -partitecase to analyze the planted diagonal entries in Step 2 of To- k -Partite-Submatrix . Lemma 4.7 (Planting k -Partite Diagonals) . Suppose that < Q < P ≤ and n ≥ (cid:16) PQ + 1 (cid:17) N issuch that both N and n are divisible by k and k ≤ QN/ . Suppose that for each t ∈ [ k ] , z t ∼ Bern( P ) , z t ∼ Bin(
N/k − , P ) and z t ∼ Bin( n/k, Q ) are independent. If z t = max { z t − z t − z t , } , then it follows that d TV (cid:16) ⊗ kt =1 L ( z t , z t + z t ) , (Bern( P ) ⊗ Bin( n/k − , Q )) ⊗ k (cid:17) ≤ k · exp (cid:18) − Q N P kn (cid:19) + r C Q k nd TV (cid:16) ⊗ kt =1 L ( z t + z t + z t ) , Bin( n/k, Q ) ⊗ k (cid:17) ≤ k · exp (cid:18) − Q N P kn (cid:19) where C Q = max n Q − Q , − QQ o . roof. Throughout this argument, let v denote a vector in { , } k . Now define the event E = k \ t =1 (cid:8) z t = z t + z t + z t (cid:9) Now observe that if z t ≥ Qn/k − QN/ k + 1 and z t ≤ P ( N/k −
1) +
QN/ k then it follows that z t ≥ z t ≥ v t + z t for any v t ∈ { , } since Qn ≥ ( P + Q ) N . Now union bounding the probabilitythat E does not hold conditioned on z yields that P h E C (cid:12)(cid:12)(cid:12) z = v i ≤ k X t =1 P (cid:2) z t < v t + z t (cid:3) ≤ k X t =1 P (cid:20) z t < Qnk − QN k + 1 (cid:21) + k X t =1 P (cid:20) z t > P (cid:18) Nk − (cid:19) + QN k (cid:21) ≤ k · exp − ( QN/ k − Qn/k ! + k · exp − ( QN/ k ) P ( N/k − ! ≤ k · exp (cid:18) − Q N P kn (cid:19) where the third inequality follows from standard Chernoff bounds on the tails of the binomialdistribution. Marginalizing this bound over v ∼ L ( z ) = Bern( P ) ⊗ k , we have that P (cid:2) E C (cid:3) = E v ∼L ( z ) P h E C (cid:12)(cid:12)(cid:12) z = v i ≤ k · exp (cid:18) − Q N P kn (cid:19)
Now consider the total variation error induced by conditioning each of the product measures ⊗ kt =1 L ( z t + z t + z t ) and ⊗ kt =1 L ( z t ) on the event E . Note that under E , by definition, we havethat z t = z t + z t + z t for each t ∈ [ k ]. By the conditioning property of d TV in Fact 3.2, we have d TV (cid:16) ⊗ kt =1 L ( z t + z t + z t ) , L (cid:16)(cid:0) z t : t ∈ [ k ] (cid:1) (cid:12)(cid:12)(cid:12) E (cid:17)(cid:17) ≤ P (cid:2) E C (cid:3) d TV (cid:16) ⊗ kt =1 L ( z t ) , L (cid:16)(cid:0) z t : t ∈ [ k ] (cid:1) (cid:12)(cid:12)(cid:12) E (cid:17)(cid:17) ≤ P (cid:2) E C (cid:3) The fact that ⊗ kt =1 L ( z t ) = Bin( n/k, Q ) ⊗ k and the triangle inequality now imply that d TV (cid:16) ⊗ kt =1 L ( z t + z t + z t ) , Bin( n/k, Q ) ⊗ k (cid:17) ≤ · P (cid:2) E C (cid:3) ≤ k · exp (cid:18) − Q N P kn (cid:19) which proves the second inequality in the statement of the lemma. It suffices to establish the firstinequality. A similar conditioning step as above shows that for all v ∈ { , } k , we have that d TV (cid:16) ⊗ kt =1 L (cid:16) v t + z t + z t (cid:12)(cid:12)(cid:12) z t = v t (cid:17) , L (cid:16)(cid:0) v t + z t + z t : t ∈ [ k ] (cid:1) (cid:12)(cid:12)(cid:12) z = v and E (cid:17)(cid:17) ≤ P h E C (cid:12)(cid:12)(cid:12) z = v i d TV (cid:16) ⊗ kt =1 L (cid:16) z t (cid:12)(cid:12)(cid:12) z t = v t (cid:17) , L (cid:16)(cid:0) z t : t ∈ [ k ] (cid:1) (cid:12)(cid:12)(cid:12) z = v and E (cid:17)(cid:17) ≤ P h E C (cid:12)(cid:12)(cid:12) z = v i The triangle inequality and the fact that z ∼ Bin( n/k, Q ) ⊗ k is independent of z implies that d TV (cid:16) ⊗ kt =1 L (cid:16) v t + z t + z t (cid:12)(cid:12)(cid:12) z t = v t (cid:17) , Bin( n/k, Q ) ⊗ k (cid:17) ≤ k · exp (cid:18) − Q N P kn (cid:19)
24y Lemma 4.6 applied with P t = v t ∈ { , } , we also have that d TV (cid:16) ⊗ kt =1 ( v t + Bin( n/k − , Q )) , Bin( n/k, Q ) ⊗ k (cid:17) ≤ vuut k X t =1 k ( v t − Q ) nQ (1 − Q ) ≤ r C Q k n The triangle now implies that for each v ∈ { , } k , d TV (cid:16) ⊗ kt =1 L (cid:16) z t + z t (cid:12)(cid:12)(cid:12) z t = v t (cid:17) , Bin( n/k − , Q ) ⊗ k (cid:17) = d TV (cid:16) ⊗ kt =1 L (cid:16) v t + z t + z t (cid:12)(cid:12)(cid:12) z t = v t (cid:17) , ⊗ kt =1 ( v t + Bin( n/k − , Q )) (cid:17) ≤ k · exp (cid:18) − Q N P kn (cid:19) + r C Q k n We now marginalize over v ∼ L ( z ) = Bern( P ) ⊗ k . The conditioning on a random variable propertyof d TV in Fact 3.2 implies that d TV (cid:16) ⊗ kt =1 L ( z t , z t + z t ) , (Bern( P ) ⊗ Bin( n/k − , Q )) ⊗ k (cid:17) ≤ E v ∼ Bern( P ) ⊗ k d TV (cid:16) ⊗ kt =1 L (cid:16) z t + z t (cid:12)(cid:12)(cid:12) z t = v t (cid:17) , Bin( n/k − , Q ) ⊗ k (cid:17) which, when combined with the inequalities above, completes the proof of the lemma.We now combine these lemmas to analyze To- k -Partite-Submatrix . The next lemma is a k -partite variant of Theorem 6.1 in [BBH19] and involves several technical subtleties that do notarise in the non k -partite case. After applying Graph-Clone , the adjacency matrix of the inputgraph G is still missing its diagonal entries. The main difficulty in producing these diagonal entriesis to ensure that entries corresponding to vertices in the planted subgraph are properly sampledfrom Bern( p ). To do this, we randomly embed the original N × N adjacency matrix in a larger n × n matrix with i.i.d. entries from Bern( Q ) and sample all diagonal entries corresponding to entries ofthe original matrix from Bern( p ). The diagonal entries in the new n − N columns are chosen sothat the supports on the diagonals within each F t each have size Bin( n/k, Q ). Even though thiscauses the sizes of the supports on the diagonals in each F t to have the same distribution underboth H and H , the randomness of the embedding and the fact that k = o ( √ n ) ensures that thisis hidden in total variation. Showing this involves some technical subtleties captured in the abovetwo lemmas and the next lemma. Lemma 4.8 (Reduction to k -Partite Bernoulli Submatrix Problems) . Let < q < p ≤ and Q = 1 − p (1 − p )(1 − q ) + { p =1 } (cid:0) √ q − (cid:1) . Suppose that n and N are such that n ≥ (cid:18) pQ + 1 (cid:19) N and k ≤ QN/ Also suppose that q = N − O (1) and both N and n are divisible by k . Let E = ( E , E , . . . , E k ) and F = ( F , F , . . . , F k ) be partitions of [ N ] and [ n ] , respectively. Then it follows that the algorithm A = To- k -Partite-Submatrix runs in poly( N ) time and satisfies d TV ( A ( G ( N, U N ( E ) , p, q )) , M ( n, U n ( F ) , p, Q )) ≤ k · exp (cid:18) − Q N pkn (cid:19) + r C Q k nd TV (cid:0) A ( G ( N, q )) , Bern ( Q ) ⊗ n × n (cid:1) ≤ k · exp (cid:18) − Q N pkn (cid:19) where C Q = max n Q − Q , − QQ o . roof. Fix some subset R ⊆ [ N ] such that | R ∩ E i | = 1 for each i ∈ [ k ]. We will first show that A maps an input G ∼ G ( N, R, p, q ) approximately in total variation to M ( n, U n ( F ) , p, Q ). ByAM-GM, we have that √ pq ≤ p + q − (1 − p ) + (1 − q )2 ≤ − p (1 − p )(1 − q )If p = 1, it follows that P = p > Q = 1 − p (1 − p )(1 − q ). This implies that − p − q = (cid:16) − P − Q (cid:17) andthe inequality above rearranges to (cid:16) PQ (cid:17) ≤ pq . If p = 1, then Q = √ q and (cid:16) PQ (cid:17) = pq . Furthermore,the inequality − p − q ≤ (cid:16) − P − Q (cid:17) holds trivially. Therefore we may apply Lemma 4.3, which impliesthat ( G , G ) ∼ G ( N, R, p, Q ) ⊗ .Let the random set U = { π − ( R ∩ E ) , π − ( R ∩ E ) , . . . , π − k ( R ∩ E k ) } denote the support ofthe k -subset of [ n ] that R is mapped to in the embedding step of To- k -Partite-Submatrix .Now fix some k -subset R ′ ⊆ [ n ] with | R ′ ∩ F i | = 1 for each i ∈ [ k ] and consider the distri-bution of M PD conditioned on the event U = R ′ . Since ( G , G ) ∼ G ( n, R, p, Q ) ⊗ , Step 2 of To- k -Partite-Submatrix ensures that the off-diagonal entries of M PD , given this conditioning,are independent and distributed as follows: • M ij ∼ Bern( p ) if i = j and i, j ∈ R ′ ; and • M ij ∼ Bern( Q ) if i = j and i R ′ or j R ′ .which match the corresponding entries of M ( n, R ′ , p, Q ). Furthermore, these entries are indepen-dent of the vector diag( M PD ) = (( M PD ) ii : i ∈ [ k ]) of the diagonal entries of M PD . It thereforefollows that d TV (cid:16) L (cid:16) M PD (cid:12)(cid:12)(cid:12) U = R ′ (cid:17) , M ( n, R ′ , p, Q ) (cid:17) = d TV (cid:16) L (cid:16) diag( M PD ) (cid:12)(cid:12)(cid:12) U = R ′ (cid:17) , V ( n, R ′ , p, Q ) (cid:17) Let ( S ′ , S ′ , . . . , S ′ k ) be any tuple of fixed subsets such that | S ′ t | = N/k , S ′ i ⊆ F t and R ′ ∩ F t ∈ S ′ t for each t ∈ [ k ]. Now consider the distribution of diag( M PD ) conditioned on both U = R ′ and( S , S , . . . , S k ) = ( S ′ , S ′ , . . . , S ′ k ). It holds by construction that the k vectors diag( M PD ) F t areindependent for t ∈ [ k ] and each distributed as follows: • diag( M PD ) S ′ t is an exchangeable distribution on { , } N/k with support of size s t ∼ Bin(
N/k, p ),by construction. This implies that diag( M PD ) S ′ t ∼ Bern( p ) ⊗ N/k . This can trivially be restatedas (cid:16) M R ′ ∩ F t ,R ′ ∩ F t , diag( M PD ) S ′ t \ R ′ (cid:17) ∼ Bern( p ) ⊗ Bern( p ) ⊗ N/k − . • diag( M PD ) F t \ S ′ t is an exchangeable distribution on { , } N/k with support of size z t = max { s t − s t , } . Furthermore, diag( M PD ) F t \ S ′ t is independent of diag( M PD ) S ′ t .For each t ∈ [ k ], let z t = M R ′ ∩ F t ,R ′ ∩ F t ∼ Bern( p ) and z t ∼ Bin(
N/k − , p ) be the size of thesupport of diag( M PD ) S ′ t \ R ′ . As shown discussed in the first point above, we have that z t and z t are independent and z t + z t = s t .Now consider the distribution of diag( M PD ) relaxed to only be conditioned on U = R ′ , and nolonger on ( S , S , . . . , S k ) = ( S ′ , S ′ , . . . , S ′ k ). Conditioned on U = R ′ , the S t are independent andeach uniformly distributed among all N/k size subsets of F t that contain the element R ′ ∩ F t . Inparticular, this implies that the distribution of diag( M PD ) F t \ R ′ is an exchangeable distribution on { , } n/k − with support size z t + z t for each t . Note that any v ∼ V ( n, R ′ , p, Q ) also satisfies that26 F t \ R ′ is exchangeable. This implies that V ( n, R ′ , p, Q ) and diag( M PD ) are identically distributedwhen conditioned on their entries with indices in R ′ and on their support sizes within the k sets ofindices F t \ R ′ . The conditioning property of Fact 3.2 therefore implies that d TV (cid:16) L (cid:16) diag( M PD ) (cid:12)(cid:12)(cid:12) U = R ′ (cid:17) , V ( n, R ′ , p, Q ) (cid:17) ≤ d TV (cid:16) ⊗ kt =1 L ( z t , z t + z t ) , (Bern( p ) ⊗ Bin( n/k − , Q )) ⊗ k (cid:17) ≤ k · exp (cid:18) − Q N P kn (cid:19) + r C Q k n by the first inequality in Lemma 4.7. Now observe that U ∼ U n ( F ) and thus marginalizing over R ′ ∼ L ( U ) = U n ( F ) and applying the conditioning property of Fact 3.2 yields that d TV ( A ( G ( N, R, p, q )) , M ( n, U n ( F ) , p, Q )) ≤ E R ′ ∼U n ( F ) d TV (cid:16) L (cid:16) M PD (cid:12)(cid:12)(cid:12) U = R ′ (cid:17) , M ( n, R ′ , p, Q ) (cid:17) since M PD ∼ A ( G ( N, R, p, q )). Applying an identical marginalization over R ∼ U N ( E ) completesthe proof of the first inequality in the lemma statement.It suffices to consider the case where G ∼ G ( N, q ), which follows from an analogous but simplerargument. By Lemma 4.3, we have that ( G , G ) ∼ G ( N, Q ) ⊗ . It follows that the entries of M PD are distributed as ( M PD ) ij ∼ i.i.d. Bern( Q ) for all i = j independently of diag( M PD ). Nownote that the k vectors diag( M PD ) F t for t ∈ [ k ] are each exchangeable and have support size s t + max { s t − s t , } = z t + z t + z t where z t ∼ Bern( p ), z t ∼ Bin(
N/k − , p ) and s t ∼ Bin( n/k, Q )are independent. By the same argument as above, we have that d TV (cid:0) L ( M PD ) , Bern( Q ) ⊗ n × n (cid:1) = d TV (cid:0) L (diag( M PD )) , Bern( Q ) ⊗ n (cid:1) = d TV (cid:16) ⊗ kt =1 L (cid:0) z t + z t + z t (cid:1) , Bin( n/k, Q ) (cid:17) ≤ k · exp (cid:18) − Q N P kn (cid:19) by Lemma 4.7. Since M PD ∼ A ( G ( N, q )), this completes the proof of the lemma.
In this section, we analyze the matrices H r,t constructed based on the incidences between pointsand hyperplanes in F tr . The definition of H r,t can be found in Step 4 of k -pds-to-isgm in Figure2. We remark that a classic trick counting the number of ordered d -tuples of linearly independentvectors in F tr shows that the number of d -dimensional subspaces of F tr is | Gr( d, F tr ) | = ( r t − r t − r ) · · · ( r t − r d − )( r d − r d − r ) · · · ( r d − r d − )This implies that the number of hyperplanes in F tr is ℓ = r t − r − , which justifies that the number ofrows of H r,t is as described in k -pds-to-isgm . The matrices H r,t are used to rotate the Gaussianizedsubmatrix produced from k -pds in Step 5 of k -pds-to-isgm to produce the exactly imbalancedmixture structure. The crucial properties of H r,t are that they have orthogonal rows, are binaryin the sense that they contain two distinct real values and contain a fraction of approximately 1 /r negative values per column. All three properties are essential to the correctness of the reduction k -pds-to-isgm . We establish these properties in the simple lemma below.27 emma 4.9 (Imbalanced Binary Orthogonal Matrices) . If t ≥ and r ≥ is prime, then (cid:16) r t − r − (cid:17) × r t real matrix H r,t has orthonormal rows and each column of H r,t other than the col-umn corresponding to P i = 0 contains exactly r t − − r − entries equal to − r √ r t ( r − .Proof. Let r i denote the i th row of H r,t . First observe that k r i k = ( r t − | V i | ) · r t ( r −
1) + | V i | · (1 − r ) r t ( r −
1) = 1since | V i | = r t − . Furthermore V i ∩ V j is a ( t − F tr if i = j , which impliesthat | V i ∩ V j | = r t − and | V i ∪ V j | = | V i | + | V j | − | V i ∩ V j | = 2 r t − − r t − . Therefore if i = j , h r i , r j i = ( r t − | V i ∪ V j | ) · r t ( r −
1) + ( | V i ∪ V j | − | V i ∩ V j | ) · − rr t ( r −
1) + | V i ∩ V j | · (1 − r ) r t ( r − r − · r ( r − − r − · r + r − r = 0which shows that the rows of H r,t are orthonormal. Fix any two nonzero vectors P i , P j ∈ F tr andconsider any invertible linear transformation M : F tr → F tr such that M ( P i ) = P j . The map M induces a bijection between the hyperplanes containing P i and the hyperplanes containing P j . Inparticular, it follows that each nonzero point in F tr is contained in the same number of hyperplanes.Each hyperplane contains r t − − r t − − · r t − r − , which implies that the number per nonzero point is r t − − r − . Each such incidencecorresponds to a negative entry in the column for that point in H r,t , implying that the column foreach nonzero point P i contains exactly r t − − r − negative entries.We now proceed to establish the total variation guarantees for sample rotation and subsamplingas in Steps 5 and 6 in k -pds-to-isgm , using these properties of H r,t . In the rest of this section,let A denote the reduction k -pds-to-isgm with input ( G, E ) where E is a partition of [ N ] andoutput ( X , X , . . . , X n ). We will need the following convenient upper bound on the total variationbetween two binomial distributions. Lemma 4.10.
Given P ∈ [0 , , Q ∈ (0 , and n ∈ N , it follows that d TV (Bin( n, P ) , Bin( n, Q )) ≤ | P − Q | · r n Q (1 − Q ) Proof.
By applying the data processing inequality for d TV to the function taking the sum of thecoordinates of a vector, we have that2 · d TV (Bin( n, P ) , Bin( n, Q )) ≤ · d TV (cid:0) Bern( P ) ⊗ n , Bern( Q ) ⊗ n (cid:1) ≤ d KL (cid:0) Bern( P ) ⊗ n , Bern( Q ) ⊗ n (cid:1) = n · d KL (Bern( P ) , Bern( Q )) ≤ n · χ (Bern( P ) , Bern( Q ))= n · ( P − Q ) Q (1 − Q )The second inequality is an application of Pinsker’s, the first equality is tensorization of d KL andthe third inequality is the fact that χ upper bounds d KL by the concavity of log. This completesthe proof of the lemma. 28et Hyp( N, K, n ) denote a hypergeometric distribution with n draws from a population ofsize N with K success states. We will also need the upper bound on the total variation betweenhypergeometric and binomial distributions given by d TV (Hyp( N, K, n ) , Bin( n, K/N )) ≤ nN This bound is a simple case of finite de Finetti’s theorem and is proven in Theorem (4) in [DF80].The following lemma analyzes Steps 5 and 6 of A . Lemma 4.11 (Sample Rotation) . Let F be a fixed partition of [ kr t ] into k parts of size r t and let S ⊆ [ m ] be a fixed k -subset. Let A denote Steps 5 and 6 of k -pds-to-isgm with input M G andoutput ( X , X , . . . , X n ) . Then for all τ ∈ R , d TV (cid:16) A (cid:16) τ · S ⊤U krt ( F ) + N (0 , ⊗ m × kr t (cid:17) , isgm H ( n, k, d, µ, ǫ ) (cid:17) ≤ w + k wn + kw p n ( r − where ǫ = 1 /r and µ = τ √ r t ( r − . Furthermore, it holds that A (cid:16) N (0 , ⊗ m × kr t (cid:17) ∼ N (0 , I d ) ⊗ n .Proof. Let T be a fixed k -subset of [ kr t ] such that | T ∩ F i | = 1 for each i ∈ [ k ]. Let ℓ = r t − r − and F ′ be a fixed partition of [ kℓ ] into k parts of size ℓ . We first consider the case where the input M G to A is of the form M G = τ · S ⊤ T + G where G ∼ N (0 , ⊗ m × kr t Since M G has independent entries, the submatrices ( M G ) F i for each i ∈ [ k ] are independent. Nowobserve that if ( H r,t ) j denotes the j th column of H r,t , then we have that( M R ) F ′ i = ( M G ) F i H ⊤ r,t = τ · S ⊤ T ∩ F i H ⊤ r,t + G F i H ⊤ r,t ∼ L (cid:16) τ · S ( H r,t ) ⊤ T ∩ F i + N (0 , ⊗ m × ℓ (cid:17) The distribution statement above follows from the joint Gaussianity and isotropy of the rows of G F i . More precisely, the entries of G F i H ⊤ r,t are linear combinations of the entries of G F i , whichimplies that they are jointly Gaussian. The fact that the rows of H r,t are orthonormal impliesthat the entries of G F i H ⊤ r,t are uncorrelated and each have unit variance. Therefore it follows that G F i H ⊤ r,t ∼ N (0 , ⊗ m × ℓ . If h T,F,F ′ ∈ R kℓ denotes the vector with ( h T,F,F ′ ) F ′ i = ( H r,t ) T ∩ F i for each i ∈ [ k ], then it follows that M R ∼ L (cid:16) τ · S h ⊤ T,F,F ′ + N (0 , ⊗ m × kℓ (cid:17) Observe that the columns of M R are independent and either distributed according N ( µ · S , I m )or N ( µ ′ · S , I m ) where µ ′ = τ (1 − r ) / p r t ( r −
1) depending on whether the entry of h T,F,F ′ at theindex corresponding to the column is 1 / p r t ( r −
1) or (1 − r ) / p r t ( r − s T,F denote the number of entries of h T,F,F ′ that are equal to 1 / p r t ( r − R n ( s ) to be the distribution on R n with a sample v ∼ R n ( s ) generated by first choosing an s -subset U of [ n ] uniformly at random and then setting v i = 1 / p r t ( r −
1) if i ∈ U and v i =(1 − r ) / p r t ( r −
1) if i U . Note that the number of columns distributed as N ( µ · S , I m ) in M R chosen to be in X is distributed according to Hyp( kℓ, s T,F , n ). Step 6 of A therefore ensures that M R ∼ L (cid:16) τ · U d,k R n (Hyp( kℓ, s T,F , n )) ⊤ + N (0 , ⊗ d × n (cid:17) isgm H ( n, k, d, µ, ǫ ) can be expressed similarly as isgm H ( n, k, d, µ, ǫ ) = L (cid:16) τ · U d,k R n (Bin( n, − ǫ )) ⊤ + N (0 , ⊗ d × n (cid:17) where again we set µ = τ / p r t ( r − M G where T ∼ U kr t ( F ).It follows that the output M R under this input is distributed as A (cid:16) τ · S ⊤U krt ( F ) + N (0 , ⊗ m × kr t (cid:17) ∼ L (cid:18) τ · U d,k R n (cid:16) Hyp (cid:16) kℓ, s U krt ( F ) ,F , n (cid:17)(cid:17) ⊤ + N (0 , ⊗ d × n (cid:19) The conditioning property of d TV in Fact 3.2 now implies that d TV (cid:16) A (cid:16) τ · S ⊤U krt ( F ) + N (0 , ⊗ m × kr t (cid:17) , isgm H ( n, k, d, µ, ǫ ) (cid:17) ≤ d TV (cid:16) Bin( n, − ǫ ) , Hyp (cid:16) kℓ, s U krt ( F ) ,F , n (cid:17)(cid:17) By Lemma 4.9, ( H r,t ) T ∩ F i contains r t − r − − r t − − r − = r t − entries equal to 1 / p r t ( r −
1) as long T ∩ F i is not the column index corresponding to the zero point in F tr . If it does correspond tozero, then ( H r,t ) T ∩ F i contains no entries equal to 1 / p r t ( r − T ∼ U kr t ( F ),then T ∩ F i corresponds to zero with probability 1 /r t for each i ∈ [ k ]. Furthermore, these eventsare independent. This implies that r − t · s U krt ( F ) ,F is the number of F i such that T ∩ F i does notcorrespond to zero, and therefore distributed as r − t · s U krt ( F ) ,F ∼ Bin( k, − /r t ). Thus P h s U krt ( F ) ,F = kr t − i = (cid:18) − r t (cid:19) k ≥ − kr t Applying the conditioning on an event property of d TV from Fact 3.2 now yields that d TV (cid:16) Hyp (cid:16) kℓ, s U krt ( F ) ,F , n (cid:17) , Hyp (cid:0) kℓ, kr t − , n (cid:1)(cid:17) ≤ P h s U krt ( F ) ,F = kr t − i ≤ kr t ≤ k wn since wn ≤ kℓ by definition and ℓ ≤ r t . By the application of Theorem (4) in [DF80] to hypergeo-metric distributions above, we also have d TV (cid:0) Hyp( kℓ, kr t − , n ) , Bin( n, r t − /ℓ ) (cid:1) ≤ nkℓ ≤ w Recall that ǫ = 1 /r and note that Lemma 4.10 implies that d TV (cid:0) Bin( n, r t − /ℓ ) , Bin( n, − ǫ ) (cid:1) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) r t − r t − r − − (cid:18) − r (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · s nr r − r t − · r n ( r − ℓ · r n r − ≤ kw p n ( r − wn ≤ kℓ . Applying the triangle inequality now yields that d TV (cid:16) Bin( n, − ǫ ) , Hyp (cid:16) kℓ, s U krt ( F ) ,F , n (cid:17)(cid:17) ≤ w + k wn + kw p n ( r − τ = 0. It follows that A (cid:16) N (0 , ⊗ m × kr t (cid:17) ∼ N (0 , ⊗ d × n = N (0 , I d ) ⊗ n which completes the proof of the lemma. 30e now combine these lemmas to complete the proof of Theorem 4.2. Proof of Theorem 4.2.
We apply Lemma 3.3 to the steps A i of A under each of H and H to proveTheorem 4.2. Define the steps of A to map inputs to outputs as follows( G, E ) A −−→ ( M PD1 , F ′ ) A −−→ ( M PD2 , F ) A −−→ ( M G , F ) A −−−→ ( X , X , . . . , X n )We first prove the desired result in the case that H holds. Consider Lemma 3.3 applied to thesteps A i above and the following sequence of distributions P = G E ( N, k, p, q ) P = M ( m, U m ( F ′ ) , p, Q ) P = M ( m, kr t , U m,k , U kr t ( F ) , p, Q ) P = M (cid:16) m, kr t , U m,k , U kr t ( F ) , N (cid:16)p r t ( r − · µ, (cid:17) , N (0 , (cid:17) = p r t ( r − · µ · U m,k ⊤U krt ( F ) + N (0 , ⊗ m × kr t P = isgm H ( n, k, d, µ, ǫ )As in the statement of Lemma 3.3, let ǫ i be any real numbers satisfying d TV ( A i ( P i − ) , P i ) ≤ ǫ i foreach step i . A direct application of Lemma 4.8 implies that we can take ǫ = 4 k · exp (cid:18) − Q N pkm (cid:19) + r C Q k m where C Q = max n Q − Q , − QQ o . Note that, by construction, the step A is exact and we can take ǫ = 0. Consider applying Lemma 4.5 and averaging over S ∼ U m,k and T ∼ U kr t ( F ) using theconditioning property from Fact 3.2. This yields that we can take ǫ = O (( mkr t ) − / ) = O ( N − ).Applying Lemma 4.11 while similarly averaging over S ∼ U m,k yields that we can take ǫ = 4 w + k wn + kw p n ( r − d TV ( A ( G E ( N, k, p, q )) , isgm H ( n, k, d, µ, ǫ )) = O (cid:18) w − + k wN + k √ N + e − Ω( N /km ) + N − (cid:19) which proves the desired result in the case of H . Now consider the case that H holds and Lemma3.3 applied to the steps A i and the following sequence of distributions P = G ( N, q ) P = Bern( Q ) ⊗ m × m P = Bern( Q ) ⊗ m × kr t P = N (0 , ⊗ m × kr t P = N (0 , I d ) ⊗ n As above, Lemmas 4.8, 4.5 and 4.11 imply that we can take ǫ = 4 k · exp (cid:18) − Q N pkm (cid:19) , ǫ = 0 , ǫ = O ( N − ) and ǫ = 031y Lemma 3.3, we therefore have that d TV (cid:0) A ( G ( N, q )) , N (0 , I d ) ⊗ n (cid:1) = O (cid:16) e − Ω( N /kn ) + N − (cid:17) which completes the proof of the theorem. In this section, we apply the reduction k - pds-to-isgm (Fig. 2) to deduce our main statistical-computational gaps for robust sparse mean estimation. We begin by showing that a direct appli-cation of k - pds-to-isgm yields a lower bound of n = Ω( ǫ k ) for polynomial-time robust sparseestimation within ℓ distance τ = o ( p ǫ/ (log n ) c ) for any fixed c >
0. When ǫ = 1 / polylog( n ),this yields the optimal sample lower bound of n = ˜Ω( k ) for estimation within the ℓ minimaxrate of τ = ˜Θ( ǫ ). When ǫ = (log n ) − ω (1) , our reduction shows that robust sparse mean estimationcontinues to have a large statistical-computational gap even when the task only requires estimationwithin ℓ distance τ ≈ √ ǫ , which is far above the minimax rate.We remark that ( n, k, d, ǫ ) in the next theorem is an arbitrary sequence of parameters satisfyingthe given conditions. Given these parameters, we construct ( N, k ′ ) such that k ′ = o ( √ N ) and k - pds-to-isgm reduces from k -pc with N vertices and subgraph size k ′ to rsme ( n, k, d, τ, ǫ ). Analgorithm solving rsme ( n, k, d, τ, ǫ ) would then that yield that k -pc can be solved on this con-structed sequence of parameters ( N, k ′ ), contradicting our form of the k -pc conjecture. When N is appropriately close to a number of the form k ′ r t , then the resulting τ satisfies τ ≍ r ǫw log n where w denotes an arbitrarily slow-growing function of n tending to infinity. Note that this τ is much larger than ǫ for ǫ = o (1 / log n ). When N is far from any k ′ r t , then τ can degrade to τ ≍ ǫ/ √ w log n , which is never larger than ǫ and thus yields vacuous lower bounds for rsme . Thisis the number theoretic subtlety alluded to in Section 4. This is a nonissue for the right choices of(
N, k ′ ), as shown in the derivation below. For clarity, we go through this first lower bound in detailand include fewer details in subsequent similar theorems. Theorem 5.1 (General Lower Bound for rsme ) . Let ( n, k, d, ǫ ) be any parameters satisfying that k = o ( d ) , ǫ ∈ (0 , and n satisfies that n = o ( ǫ k ) and n = Ω( k ) . If c > is some fixed constant,then there is a parameter τ = Ω( p ǫ/ (log n ) c ) such that any randomized polynomial time test for rsme ( n, k, d, τ, ǫ ) has asymptotic Type I + II error at least 1 assuming either the k -pc conjecture orthe k -pds conjecture for some fixed edge densities < q < p ≤ .Proof. This theorem will follow from a careful selection of parameters with which to apply k - pds-to-isgm from Theorem 4.2. Assume the k -pds conjecture for some fixed edge densities 0 < q
1. Let w = w ( n ) be an arbitrarily slow-growing function of n and let Q = 1 − p (1 − p )(1 − q )+ { p =1 } (cid:0) √ q − (cid:1) . Note that Q ∈ (0 ,
1) and is constant. Now define parameters as follows:1. Let r be a prime number with ǫ − < r = O ( ǫ − ), which can be found in poly( ǫ − ) time. This is because our reduction sets r = Θ( ǫ − ) and maps to the signal level τ ≍ p k ′ / ( r t +1 log n ) where r t isthe smallest power of r greater than ( p/Q + 1) N/k ′ and k ′ = o ( √ N ) for the starting pc instance to be hard. If( p/Q + 1) N/k ′ is far from the next smallest power of r , it is possible that r t ≍ Nr/k ′ which implies that τ = o (1 / ( r √ log n )) = O ( ǫ/ √ log n ). However, for our choice of parameters, it will hold that r t ≍ N/k ′ and τ will insteadbe close to 1 / √ r log n ≍ p ǫ/ log n .
32. Let t be such that r t is the largest power of r less than wk (1 + pQ ).3. Set k -pds parameters k ′ = ⌊ r t w − (1 + pQ ) − ⌋ and N = wk ′ .4. Set the mean parameter µ to be µ = δ p k ′ r t ) + 2 log( p − Q ) − · p r t ( r − δ = min n log (cid:16) pQ (cid:17) , log (cid:16) − Q − p (cid:17)o .By construction, we have that (cid:16) pQ + 1 (cid:17) N ≤ k ′ r t . For a slow-growing enough choice of w and largeenough n , we have (cid:18) pQ + 1 (cid:19) N + k ′ ≤ wk (cid:18) pQ (cid:19) ≤ d and wn ≤ · ǫ k ≤ r t +1) ǫ w (cid:16) pQ (cid:17) ≤ r ( r − ǫ w (cid:16) pQ (cid:17) · k ′ ( r t − r − ≤ k ′ ( r t − r − k - pds-to-isgm to map from k -pds ( N, k ′ , p, q ) to isgm ( n, k ′ , d, µ, /r ). The inequalities above guarantee that we have met the conditions needed toapply Theorem 4.2. Note that the total variation upper bound in Theorem 4.2 tends to zero since k ′ /N = w − = o (1) and N /k ′ n = Ω( k ′ ) = ω (1).Now observe that isgm ( n, k ′ , d, µ, /r ) is an instance of rsme ( n, k, d, τ, ǫ ) since k ′ ≤ k and1 /r < ǫ . More precisely, it is an instance with mean vector µ · S , where S is a k ′ -subset of [ d ]chosen uniformly at random, and outlier distribution D O = mix ǫ − r − ( N ( µ · S , I d ) , N ( µ ′ · S , I d ))where µ ′ is such that (1 − r − ) µ + r − · µ ′ = 0. Now note that τ = k µ · S k = µ √ k ′ ≍ p log( n ) · p r t ( r − · √ k ′ ≍ r ǫw log n since log( k ′ r t ) = Θ(log n ) and k ′ = ⌈ r t w − (1+ pQ ) − ⌉ implies that r t = Θ( wk ′ ). This τ satisfies that τ = Ω( p ǫ/ (log n ) c ) as long as w = O ((log n ) c ). Now suppose that some randomized polynomialtime test A for rsme ( n, k, d, τ, ǫ ) has asymptotic Type I+II error less than 1. By Lemma 3.1 andthe reduction above, this implies that there is a randomized polynomial time test for k -pds on thesequence of inputs ( N, k ′ , p, q ) with asymptotic Type I+II error less than 1. This contradicts the k -pds conjecture and proves the theorem.For small ǫ with ǫ = (log n ) − ω (1) , the mean parameter τ above is √ ǫ up to subpolynomial factorsin ǫ . This value of τ is much larger than the ω ( ǫ ) it needs to be to show lower bounds for rsme .Thus the reduction k - pds-to-isgm actually shows lower bounds at small ǫ for weak estimatorsthat can only estimate up to ℓ distance τ ≈ √ ǫ . When ǫ = (log n ) c where c > k -to- k gap in rsme up to polylogarithmic factors. This is stated inthe corollary below. Corollary 5.2 (Optimal Statistical-Computational Gaps in rsme ) . Let ( n, k, d, ǫ ) be any param-eters satisfying that k = o ( d ) , ǫ = Θ((log n ) − c ) for some constant c > and n satisfies that n = o ( k (log n ) − c ) and n = Ω( k ) . Then there is a parameter τ = ω ( ǫ ) such that any randomizedpolynomial time test for rsme ( n, k, d, τ, ǫ ) has asymptotic Type I + II error at least 1 assuming eitherthe k -pc conjecture or the k -pds conjecture for some fixed edge densities < q < p ≤ . lgorithm isgm-Sample-Cloning Inputs : isgm samples X , X , . . . , X n ∈ R d , blowup parameter ℓ
1. Set X i = X i for each 1 ≤ i ≤ n .2. For j = 1 , , . . . , ℓ do:(1) Sample G , G , . . . , G j − n ∼ i.i.d. N (0 , I d ).(2) For each 1 ≤ i ≤ j − n , form X ji and X j j − n + i as X ji = 1 √ (cid:16) X j − i + G i (cid:17) and X j j − n + i = 1 √ (cid:16) X j − i − G i (cid:17)
3. Output a subset of X ℓ , X ℓ , . . . , X ℓ ℓ n of size n ′ chosen uniformly at random. Figure 6:
Sample cloning subroutine in the reduction from a planted dense subgraph instance to robustsparse mean estimation.
We remark that in intermediate parameter regimes where ǫ = (log n ) − ω (1) is not yet poly-nomially small in n , such as ǫ = e − Θ( √ log n ) , our result essentially shows a k -to- k statistical-computational gap for rsme at the weak ℓ estimation rate of τ = ˜Θ( √ ǫ ). It is in these parameterregimes where our lower bounds for rsme are strongest.In the case where ǫ is polynomially small in n , the sample lower bound of n = Ω( ǫ k ) inTheorem 5.1 degrades with ǫ . We now show that the high τ ≈ √ ǫ produced by our reduction canbe traded off for sharper bounds in n using the simple post-processing subroutine isgm-Sample-Cloning in Figure 6. Its important properties are captured in the following lemma. Lemma 5.3 (Sample Cloning) . Let A denote isgm-Sample-Cloning applied with blowup param-eter ℓ and let ( Y , Y , . . . , Y ℓ n ) be the output of A ( X , X , . . . , X n ) . Then we have that • If X , X , . . . , X n are independent and exactly m of the n samples X , X , . . . , X n are dis-tributed according to N ( µ · S , I d ) and the rest from N ( µ ′ · S , I d ) , then the Y , Y , . . . , Y ℓ n areindependent and exactly ℓ m of Y , Y , . . . , Y ℓ n are distributed according to N (2 − ℓ/ µ · S , I d ) and the rest from N (2 − ℓ/ µ ′ · S , I d ) . • If X , X , . . . , X n ∼ i.i.d. N (0 , I d ) , then the Y , Y , . . . , Y ℓ n ∼ i.i.d. N (0 , I d ) .Proof. These are both simple consequences of the fact that if X ∼ N ( µ · S , I d ) and G ∼ N (0 , I d )are independent then X = 1 √ X + G i ) and X = 1 √ X − G i )are independent and satisfy ( X , X ) ∼ N ( µ · S / √ , I d ) ⊗ . Iteratively applying this fact provesthe lemma.We now use isgm-Sample-Cloning to strengthen Theorem 5.2 as follows. This requires aslightly more stringent choice of the parameter k than in Theorem 5.2 to initially improve the34ower bound to n = Ω( ǫk ) before applying isgm-Sample-Cloning . This choice of k renders thenumber-theoretic issue alluded to above trivial. We omit details that are the same as in Theorem5.2. Note that rsme is formulated in Section 2 in Huber’s ǫ -contamination model. Let rsme-c bethe variant of rsme instead defined in the ǫ -corruption model. Then we have the following theorem. Theorem 5.4 (Lower Bound Tradeoff with Estimation Accuracy for rsme ) . Fix some α ∈ (0 , and suppose that ǫ = O ( n − c ) for some constant c > . Assume either the k -pc conjectureor the k -pds conjecture for some fixed edge densities < q < p ≤ . Then any test solving rsme-c ( n, k, d, τ, ǫ ) with τ = ˜Ω( ǫ − α/ ) has asymptotic Type I + II error at least 1 if n = o ( ǫ α k ) , k = o ( d ) and n = Ω( k ) .Proof. Set parameters identically as in Theorem 5.2, except let k = r t for the choice of r and t andlet 1 /r < ǫ/ /r = Ω( ǫ ). All parameter calculations remain identical to Theorem 5.2, exceptit only needs to hold that n = o ( ǫk ) instead of n = o ( ǫ k ) to satisfy the conditions to applyTheorem 4.2. This is because if n = o ( ǫk ) and k = r t , then we have that wn ≤ · ǫk ≤ r t − w (cid:16) pQ (cid:17) ≤ k ′ ( r t − r − n since w tends to infinity. Therefore the same application of Theorem 4.2 yieldsthat k - pds-to-isgm produces an instance of isgm ( n, k ′ , d, µ, /r ) with τ ≍ p ǫ/w log n . ApplyingLemma 5.3 yields that if isgm-Sample-Cloning is then applied with blowup factor ℓ , we havethat we arrive at an instance of rsme-c (2 ℓ n, k, d, − ℓ/ · τ, ǫ ) within total variation o (1). Notethat here, we have used the fact that since 1 /r < ǫ/
2, then the concentration of Bin( n, /r )implies with high probability that the number of corrupted sampled is at most ǫn before applying isgm-Sample-Cloning . Note that isgm-Sample-Cloning will preserve this fact. Suppose that ℓ is chosen such that 2 ℓ = Θ( ǫ α − ), then it follows that 2 ℓ n = o ( ǫ α k ) and 2 − ℓ/ · τ = ˜Ω( ǫ − α/ ).Applying Lemma 3.1 completes the proof of this theorem. In this section, we prove our second main result showing the k -pc and k -pds conjectures implythe pds Recovery Conjecture under a semirandom adversary in the regime of constant ambientedge density. Our reduction from k -pds to semi-cr is shown in Figure 7. On a high level, k -pds-to-semi-cr can be interpreted as rotating by H ,ℓ to effectively spread the signal in the planteddense subgraph out by simultaneously expanding its size from k to Θ(3 ℓ k ) while decreasing itsplanted edge density. Furthermore, this rotation spreads the signal at the sharper rate from the pds Recovery Conjecture as opposed to the slower detection rate. In doing so, the reduction alsoproduces monotone noise in the rest of the graph that can be simulated by a semirandom adversary.Our reduction begins with the same first two steps as in the reduction to isgm , by symmetrizing,planting diagonals and Gaussianizing. The total variation guarantees of these steps were alreadyestablished in Section 4.1. The third step breaks the resulting matrix into blocks within each part F i and adds one row and one column to each block such that: (1) the added row and column haveindex in the block corresponding to the column index of P i = 0 in H ,ℓ ; and (2) the entries ofboth are independently sampled from N (0 , H ,ℓ along both rows and columns. The last step produces a graph byappropriately thresholding the above-diagonal entries of the resulting matrix.35 lgorithm k -pds-to-semi-cr Inputs : k -pds instance G ∈ G N with dense subgraph size k that divides N , partition E of [ N ]and edge probabilities 0 < q < p ≤
1, blowup factor ℓ , target number of vertices n ≥ m where m is the smallest multiple of (3 ℓ − k larger than (cid:16) pQ + 1 (cid:17) N where Q = 1 − p (1 − p )(1 − q ) + { p =1 } (cid:0) √ q − (cid:1) and n ≤ poly( N )1. Symmetrize and Plant Diagonals : Compute M PD ∈ { , } m × m with partition F of [ m ] as M PD ← To- k -Partite-Submatrix ( G )applied with initial dimension N , edge probabilities p and q and target dimension m .2. Gaussianize : Compute M G ∈ R m × m as M G ← Gaussianize ( M PD ) applied with proba-bilities p and Q and mean parameters µ ij = µ = 12 p m + 2 log( p − Q ) − · min (cid:26) log (cid:18) pQ (cid:19) , log (cid:18) − Q − p (cid:19)(cid:27) for each i, j ∈ [ m ].3. Partition into Blocks and Pad : Form the rotation matrix H ,ℓ as in Figure 2. Form thematrix M P ∈ R m ′ × m ′ where m ′ = 3 ℓ ks where s = m/ (3 ℓ − k by embedding M G as itsupper left submatrix and sampling all other entries i.i.d. from N (0 , F i into s blocks of size 3 ℓ − ks indices to each F i . Nowreorder the indices in each block so that the new index corresponds to the index of thecolumn for the point P i = 0 in H ,ℓ . Let [ m ′ ] = F ′ ∪ F ′ ∪ · · · ∪ F ′ ks be the partition of therow and column indices of M P induced by the blocks.4. Rotate in Blocks : Let F ′′ be a partition of [ m ′′ ] into ks equally sized parts where m ′′ = (3 ℓ − ks . Now compute the matrix M R ∈ R m ′′ × m ′′ as( M R ) F ′′ i ,F ′′ j = H ,ℓ ( M P ) F ′ i ,F ′ j H ⊤ ,ℓ for each i, j ∈ [ ks ]where Y A,B denotes the submatrix of Y restricted to the entries with indices in A × B .5. Threshold and Output : Now construct the graph G ′ with vertex set [ m ′′ ] such that for each i > j with i, j ∈ [ m ′′ ], we have { i, j } ∈ E ( G ′ ) if and only if ( M R ) ij ≥ µ · ℓ Add n − m ′′ new vertices to G ′ such that each edge incident to a new vertex is includedin E ( G ′ ) independently with probability 1 /
2. Randomly permute the vertex labels of G ′ and output the resulting graph. Figure 7:
Reduction from k -partite planted dense subgraph to semirandom community recovery. A denote the reduction k -pds-to-semi-cr . Wefirst formally introduce the key intermediate distributions on graphs that our reduction producesand which we will show can be simulated by a semirandom adversary. Definition 6.1 (Target Graph Distributions) . Given positive integers k ≤ m ≤ n and µ , µ , µ ∈ (0 , , let tg H ( n, k, k ′ , m, µ , µ , µ ) be the distribution over G ∈ G n sampled as follows:1. Choose a subset V ⊆ [ n ] of size | V | = m uniformly at random and then choose two disjointsubsets S ⊆ V and S ′ ⊆ V of sizes | S | = k and | S ′ | = k ′ , respectively, uniformly at random.2. Include the edge { i, j } in E ( G ) independently with probability p ij where p ij = / if ( i, j ) ∈ S ′ or ( i, j ) V / − µ if ( i, j ) ∈ V \ ( S ∪ S ′ ) / − µ if ( i, j ) ∈ S × S ′ or ( i, j ) ∈ S ′ × S / µ if ( i, j ) ∈ S Furthermore, let tg H ( n, m, µ ) be the distribution over G ∈ G n sampled by choosing V ⊆ [ n ] with | V | = m uniformly at random and including { i, j } in E ( G ) with probability / − µ if ( i, j ) ∈ V and probability / otherwise. We now establish the desired Markov transition properties for the block rotations and thresh-olding procedures in Steps 3, 4 and 5. We then will combine this with lemmas in Section 4.1 toprovet our lower bound for semi-cr . We remark that the block-wise padding in Step 3 is neededin the next lemma. In the proof of Lemma 4.11, we were able to condition on no planted indicescorresponding to columns with P i = 0 upon rotating without a loss in total variation since thesecorrespondences occurred with low probability. Here, this is no longer possible because rotationsare carried out block-wise and the number of blocks is much larger than the number of blocks inthe partition. This issue is resolved by adding in a new index corresponding to P i = 0 in each blockthat is guaranteed not to be planted. The fact that no planted index corresponds to P i = 0 is zeroensures that the number of vertices in the planted subgraph of the resulting semirandom instance is (3 ℓ − − k . This fact is used in the proof of the following lemma. Recall that Φ( x ) = R x −∞ e − t / dt is the standard normal cumulative distribution function. Lemma 6.2 (Block Rotations and Thresholding) . Let F be a fixed partition of [ m ] where m isdivisible by (3 ℓ − k . Let U ⊆ [ m ] be a fixed k -subset such that | U ∩ F i | = 1 for each i ∈ [ k ] . Let A denote Steps 5 and 6 of k -pds-to-semi-cr with input M G and output G ′ . Then for all τ ∈ R , A (cid:16) µ · U ⊤ U + N (0 , ⊗ m × m (cid:17) ∼ tg H (cid:18) n,
12 (3 ℓ − − k, ℓ − k, m, µ , µ , µ (cid:19) A (cid:0) N (0 , ⊗ m × m (cid:1) ∼ tg H ( n, m, µ ) where µ , µ , µ ∈ (0 , are equal to µ = Φ (cid:18) µ · − ℓ (cid:19) − , and µ = µ = Φ (cid:18) µ · − ℓ +1 (cid:19) − Proof.
First consider the case in which M G ∼ L (cid:0) µ · U ⊤ U + N (0 , ⊗ m × m (cid:1) . It follows that M P = µ · U ′ ⊤ U ′ + G where G ∼ N (0 , ⊗ m ′ × m ′ and U ′ is the image of U under the embedding and indexreordering in Step 3. Let [ m ′ ] = F ′ ∪ F ′ ∪ · · · ∪ F ′ ks be the partition of the row and column indices37f M P induced by the blocks. Note that | F ′ i | = 3 ℓ for each i ∈ [ ks ]. Since F ′ is a refinement of F ,it holds that | U ′ ∩ F ′ i | ≤ i ∈ [ ks ]. Furthermore, if W is the set of m ′ − m indices addedin Step 3, then it holds that W and U ′ are disjoint.Let [ m ′′ ] = F ′′ ∪ F ′′ ∪ · · · ∪ F ′′ ks be the partition in Step 4 of [ m ′′ ] into blocks of size (3 ℓ − H ∈ R m ′′ × m ′ to be such that: • H restricted to the indices F ′′ i × F ′ i is a copy of H ,ℓ with H F ′′ i ,F ′ i = H ,ℓ for each i ∈ [ ks ] • H ij = 0 if ( i, j ) is not in F ′′ a × F ′ a for some a ∈ [ ks ]The rotation step setting ( M R ) F ′′ i ,F ′′ j = H ,ℓ ( M P ) F ′ i ,F ′ j H ⊤ ,ℓ for each i, j ∈ [ ks ] can equivalently beexpressed as M R = H M P H ⊤ . Therefore we have that M R = H M P H ⊤ = µ · H U ′ ⊤ U ′ H ⊤ + H G H ⊤ ∼ L (cid:16) µ · vv ⊤ + N (0 , ⊗ m ′′ × m ′′ (cid:17) where v = H U ′ ∈ R m ′′ . The last statement holds since the rows of H are orthogonal by anapplication of the isotropic property of independent Gaussians similar to Lemma 4.11.Now consider the vector v , which is the sum of the k columns of H with indices in U ′ . Since | U ′ ∩ F ′ i | ≤
1, the construction of H implies that these k columns have disjoint support. ByLemma 4.9, each column of H not corresponding to the point P i = 0 in some block contains exactly (3 ℓ − −
1) entries equal to 1 / √ · ℓ , exactly 3 ℓ − equal to − / √ · ℓ and the rest of its entriesare zero. Step 3 ensures that all columns of H corresponding to P i = 0 are in W and thus not in U ′ . Thus it follows that v contains exactly k (3 ℓ − −
1) entries equal to 1 / √ · ℓ , exactly 3 ℓ − k equal to − / √ · ℓ and the rest of its entries are zero. Define S and S ′ to be S = n i ∈ [ m ′′ ] : v i = − / √ · ℓ o and S ′ = n i ∈ [ m ′′ ] : v i = 1 / √ · ℓ o where | S | = k (3 ℓ − −
1) and | S ′ | = 3 ℓ − k . Therefore it follows that the entries of M R are independentand distributed as follows:( M R ) ij ∼ N (2 µ · − ℓ ,
1) if ( i, j ) ∈ S × S N ( − µ · − ℓ ,
1) if ( i, j ) ∈ S × S ′ or ( i, j ) ∈ S ′ × S N (cid:0) µ · − ℓ , (cid:1) if ( i, j ) ∈ S ′ × S ′ N (0 ,
1) otherwiseAfter thresholding, adding n − m new vertices and permuting as in Step 5 therefore yields that G ′ ∼ tg H (cid:0) n, (3 ℓ − − k, ℓ − k, m, µ , µ , µ (cid:1) for the values of µ , µ , µ ∈ (0 ,
1) defined in thelemma statement. This completes the proof of the first part of the lemma. Now consider the casewhere M G ∼ N (0 , ⊗ m × m . By an identical argument, we have that M R ∼ N (0 , ⊗ m ′′ × m ′′ . Thenthresholding, adding vertices and permuting as in Step 5 yields G ′ ∼ tg H ( n, m, µ ), completingthe proof of the lemma.Using this lemma, we now prove our second main result showing the pds Recovery Conjectureunder a semirandom adversary for constant ambient edge density. We begin with the case of q = 1 / q = Θ(1) case subsequently in a corollary. Theorem 6.3 ( k -pc Lower Bounds for Semirandom Community Recovery) . Fix any constant β ∈ [1 / , . Suppose that B is a randomized polynomial time test for semi-cr ( n, k, / ν, / for all ( n, k, ν ) with k = Θ( n β ) and ν ≥ ν where ν = o ( n/k log n ) . Then B has asymptotic TypeI + II error at least 1 assuming either the k -pc conjecture or the k -pds conjecture for some fixededge densities < q < p ≤ . roof. Assume the k -pds conjecture for some fixed edge densities 0 < q < p ≤ Q =1 − p (1 − p )(1 − q ) + { p =1 } (cid:0) √ q − (cid:1) ∈ (0 , w = w ( k ′ ) = ω (1) be a sufficiently slow-growingfunction and define the parameters ( k ′ , N ) to be such that N = wk ′ . Now define the followingparameters: • Blow up factor ℓ = ⌈ log ( N β /k ′ ) ⌉ and target subgraph size k = (cid:0) ℓ − − (cid:1) k ′ • Target number of vertices n = m , the smallest multiple of (3 ℓ − k larger than (cid:16) pQ + 1 (cid:17) N • Target graph distribution parameters µ = Φ (cid:0) µ · − ℓ (cid:1) − and µ = µ = Φ (cid:0) µ · − ℓ +1 (cid:1) − Note that these parameters satisfy the given conditions since k = Θ( N β ) and N = Θ( n ). As definedin Step 2 of A , it holds that µ = Θ(1 / √ log n ). Let ν = µ and observe that ν = Φ (cid:18) µ · − ℓ +1 (cid:19) − ≍ µ · − ℓ ≍ √ log n · k · r Nw ≍ r nwk log n since Φ( x ) − / x ) for x ∈ (0 , ν ≥ ν for a sufficiently slow-growingchoice of w .We now will show that A maps G ( N, q ) approximately to tg H ( n, m, µ ) and maps G E ( N, k ′ , p, q )approximately to tg H (cid:0) n, (3 ℓ − − k, ℓ − k, m, µ , µ , µ (cid:1) in total variation. To prove this, weapply Lemma 3.3 to the steps A i of A in each of these two cases. Let E be a partition of [ N ] into k ′ parts of size N/k ′ . Define the steps of A to map inputs to outputs as follows( G, E ) A −−→ ( M PD , F ) A −−→ ( M G , F ) A −−−→ G ′ In the first case, consider Lemma 3.3 applied to the steps A i above and the following sequence ofdistributions P = G E ( N, k ′ , p, q ) P = M ( m, U m ( F ′ ) , p, Q ) P = M ( m, U m ( F ) , N ( µ, , N (0 , P = tg H (cid:18) n,
12 (3 ℓ − − k ′ , ℓ − k ′ , m, µ , µ , µ (cid:19) As in the statement of Lemma 3.3, let ǫ i be any real numbers satisfying d TV ( A i ( P i − ) , P i ) ≤ ǫ i foreach step i . Lemma 4.8 implies that we can take ǫ = 4 k · exp (cid:0) − Q N / pk ′ m (cid:1) + p C Q k ′ / m where C Q = max { Q/ (1 − Q ) , (1 − Q ) /Q } . Applying Lemma 4.5 and averaging over S = T ∼ U m ( F )yields that we can take ǫ = O ( N − ). Lemma 6.2 implies that Steps 3, 4 and 5 are exact andwe can take ǫ = 0. Since ǫ , ǫ = o (1), Lemma 3.3, implies that A takes P to P with o (1)total variation, which proves the first part of the claim. Now consider the second case Lemma 3.3applied to the following sequence of distributions P = G ( N, q ) , P = Bern( Q ) ⊗ m × m , P = N (0 , ⊗ m × m , P = tg H ( n, m, µ )As above, Lemmas 4.8, 4.5 and 4.11 imply that we can take ǫ = 4 k · exp (cid:0) − Q N / pk ′ m (cid:1) , ǫ = O ( N − ) and ǫ = 0. Applying Lemma 3.3 again implies that A takes P to P with o (1)total variation, which proves the second part of the claim.We now will show that these two target distributions can be simulated by the H and H semirandom adversaries in semi-cr ( n, k, / ν, / G ∼G ( n, k, / ν, /
2) and modifies G as follows: 39. samples S ′ of size 3 ℓ − k ′ uniformly at random from all 3 ℓ − k ′ -subsets of [ n ] \ S where S is thevertex set of the planted dense subgraph; and2. if the edge { i, j } is in E ( G ), remove it from G independently with probability P ij where P ij = i, j ) ∈ S µ if ( i, j ) ( S ∪ S ′ ) µ if ( i, j ) ∈ S × S ′ or ( i, j ) ∈ S ′ × S This exactly simulates tg H (cid:0) n, (3 ℓ − − k, ℓ − k, m, µ , µ , µ (cid:1) and shows that it is in the setof distributions adv ( G ( n, k, / ν, / G ∼ G ( n, /
2) and removes every present edge independently with probability 2 µ . This similarlyshows that tg H ( n, m, µ ) ∈ adv ( G ( n, / A , if B hasasymptotic Type I+II error less than 1, t follows that there is a randomized polynomial time testfor k -pds on the sequence of inputs ( N, k ′ , p, q ) with asymptotic Type I+II error less than 1. Thiscontradicts the k -pds conjecture and proves the theorem.An identical analysis and reduction modified to replace the threshold µ · − ℓ with µ · − ℓ +Φ − ( q ) shows the same statistical-computational gap holds at ambient edge density q instead of1 /
2, as long as min { q, − q } = Ω(1). The resulting generalization is formally stated below. Corollary 6.4 (Arbitrary Bounded q ) . Fix any constant β ∈ [1 / , . Suppose that B is a ran-domized polynomial time test for semi-cr ( n, k, p, q ) for all ( n, k, p, q ) with k = Θ( n β ) and ( p − q ) q (1 − q ) ≥ ν and min { q, − q } = Ω(1) where ν = o (cid:18) nk log n (cid:19) Then B has asymptotic Type I + II error at least 1 assuming either the k -pc conjecture or the k -pds conjecture for some fixed edge densities < q ′ < p ′ ≤ . In this section, we combine our reduction to isgm from Section 4 with a new gadget performingan algorithmic change of measure in order to obtain a universality principle for computationallower bounds at the sample complexity of n = ˜Ω( k ) for learning sparse mixtures. This gadget,symmetric 3-ary rejection kernels, is introduced and analyzed in Section 7.1. We remark that the k -partite promise in k -pc and k -pds is crucially used in our reduction to obtain this universality.In particular, the k -partite promise ensures that the entries of the intermediate isgm instance arefrom one of three distinct distributions, when conditioned on the part of the mixture the sample isfrom. This is necessary for our application of symmetric 3-ary rejection kernels.Our lower bounds hold given tail bounds on the likelihood ratios between the planted and noisedistributions. In particular, our universality result shows tight computational lower bounds forsparse PCA in the spiked covariance model and a wide range of natural distributional formulationsof learning sparse mixtures. The results in this section can also be interpreted as a universalityprinciple for computational lower bounds in sparse PCA. We prove total variation guarantees forour reduction to glsm in Section 7.2 and discuss the universality conditions needed for our lowerbounds in Section 7.3. 40 lgorithm ( B, P + , P − , Q ) Parameters : Input B ∈ {− , , } , number of iterations N , parameters a ∈ (0 ,
1) and sufficientlysmall nonzero µ , µ ∈ R , distributions P + , P − and Q over a measurable space ( X, B ) such that( P + , Q ) and ( P − , Q ) are computable pairs1. Initialize z arbitrarily in the support of Q .2. Until z is set or N iterations have elapsed:(1) Sample z ′ ∼ Q independently and compute the two quantities L ( z ′ ) = d P + d Q ( z ′ ) − d P − d Q ( z ′ ) and L ( z ′ ) = d P + d Q ( z ′ ) + d P − d Q ( z ′ ) − | µ | ≥ (cid:12)(cid:12) L ( z ′ ) (cid:12)(cid:12) and 2 | µ | max { a, − a } ≥ |L ( z ′ ) | (3) Set z ← z ′ with probability P A ( x, B ) where P A ( x, B ) = 12 · a µ · L ( z ′ ) + µ · L ( z ′ ) if B = 11 − − a µ · L ( z ′ ) if B = 01 + µ · L ( z ′ ) − a µ · L ( z ′ ) if B = −
13. Output z . Figure 8:
In this section, we introduce symmetric 3-ary rejection kernels, which will be the key gadget in ourreduction showing universality of lower bounds for learning sparse mixtures. Rejection kernels area form of an algorithmic change of measure. Rejection kernels mapping a pair of Bernoulli distri-butions to a target pair of scalar distributions were introduced in [BBH18]. These were extendedto arbitrary high-dimensional target distributions and applied to obtain universality results forsubmatrix detection in [BBH19]. A surprising and key feature of both of these rejection kernels isthat they are not lossy in mapping one computational barrier to another. For instance, in [BBH19],multivariate rejection kernels were applied to increase the relative size k of the planted submatrix,faithfully mapping instances tight to the computational barrier at lower k to tight instances athigher k . This feature is also true of the scalar rejection kernels applied in [BBH18].To faithfully map the k -pc computational barrier onto the computational barrier of sparsemixtures, it is important to produce multiple planted distributions. Since previous rejection kernelsall begin with binary inputs, they do not have enough degrees of freedom to map to three outputdistributions. The symmetric 3-ary rejection kernels 3 -srk introduced in this section overcome thisissue by mapping from distributions supported on {− , , } to three output distributions P + , P − Q . In order to produce clean total variation guarantees, these rejection kernels also exploitsymmetry in their three input distributions on {− , , } .Let Tern( a, µ , µ ) where a ∈ (0 ,
1) and µ , µ ∈ R denote the probability distribution on {− , , } such that if B ∼ Tern( a, µ , µ ) then P [ X = −
1] = 1 − a − µ + µ , P [ X = 0] = a − µ , P [ X = 1] = 1 − a µ + µ if all three of these probabilities are nonnegative. The map 3 -srk ( B ), shown in Figure 8, sends aninput B ∈ {− , , } to a set X simultaneously satisfying three Markov transition properties:1. if B ∼ Tern( a, µ , µ ), then 3 -srk ( B ) is close to P + in total variation;2. if B ∼ Tern( a, − µ , µ ), then 3 -srk ( B ) is close to Q in total variation; and3. if B ∼ Tern( a, , -srk ( B ) is close to P − in total variation.In order to state our main results for 3 -srk ( B ), we will need the notion of computable pairs from[BBH19]. The definition below is that given in [BBH19], without the assumption of finiteness ofKL divergences. This assumption was convenient for the Chernoff exponent analysis needed formultivariate rejection kernels in [BBH19]. Since our rejection kernels are univariate, we will be ableto state our universality conditions directly in terms of tail bounds rather than Chernoff exponents. Definition 7.1 (Relaxed Computable Pair [BBH19]) . Define a pair of sequences of distributions ( P , Q ) over a measurable space ( X, B ) where P = ( P n ) and Q = ( Q n ) to be computable if:1. there is an oracle producing a sample from Q n in poly( n ) time;2. P n and Q n are mutually absolutely continuous and the likelihood ratio satisfies E x ∼Q (cid:20) d P d Q ( x ) (cid:21) = E x ∼P "(cid:18) d P d Q ( x ) (cid:19) − = 1 where d P n d Q n is the Radon-Nikodym derivative; and3. there is an oracle computing d P n d Q n ( x ) in poly( n ) time for each x ∈ X . We remark that the second condition above always holds for discrete distributions and generallyfor most well-behaved distributions P and Q . We now prove our main total variation guaranteesfor 3 -srk . The proof of the next lemma follows a similar structure to the analysis of rejectionsampling as in Lemma 5.1 of [BBH18] and Lemma 5.1 of [BBH19]. However, the bounds that weobtain are different than those in [BBH18, BBH19] because of the symmetry of the three input Terndistributions. Lemma 7.2 (Symmetric 3-ary Rejection Kernels) . Let a ∈ (0 , and µ , µ ∈ R be nonzero andsuch that Tern( a, µ , µ ) is well-defined. Let P + , P − and Q be distributions over a measurablespace ( X, B ) such that ( P + , Q ) and ( P − , Q ) are computable pairs with respect to a parameter n .Let S ⊆ X be the set S = (cid:26) x ∈ X : 2 | µ | ≥ (cid:12)(cid:12)(cid:12)(cid:12) d P + d Q ( x ) − d P − d Q ( x ) (cid:12)(cid:12)(cid:12)(cid:12) and 2 | µ | max { a, − a } ≥ (cid:12)(cid:12)(cid:12)(cid:12) d P + d Q ( x ) + d P − d Q ( x ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:27) iven a positive integer N , then the algorithm -srk : {− , , } → X can be computed in poly( n, N ) time and satisfies that d TV (3 -srk (Tern( a, µ , µ )) , P + ) d TV (3 -srk (Tern( a, − µ , µ )) , P − ) d TV (3 -srk (Tern( a, , , Q ) ≤ δ (cid:0) | µ | − + | µ | − (cid:1) + (cid:18)
12 + δ (cid:0) | µ | − + | µ | − (cid:1)(cid:19) N where δ > is such that P X ∼P + [ X S ] , P X ∼P − [ X S ] and P X ∼Q [ X S ] are upper bounded by δ .Proof. Define L , L : X → R to be L ( x ) = d P + d Q ( x ) − d P − d Q ( x ) and L ( x ) = d P + d Q ( x ) + d P − d Q ( x ) − x ∈ S , then the triangle inequality implies that P A ( x, ≤ (cid:18) a | µ | · |L ( x ) | + 14 | µ | · |L ( x ) | (cid:19) ≤ P A ( x, ≥ (cid:18) − a | µ | · |L ( x ) | − | µ | · |L ( x ) | (cid:19) ≥ ≤ P A ( x, ≤ ≤ P A ( x, − ≤
1, implying that eachof these probabilities is well-defined. Now let R = P X ∼P + [ X ∈ S ], R = P X ∼Q [ X ∈ S ] and R − = P X ∼P − [ X ∈ S ] where R , R , R − ≥ − δ by assumption.We now define several useful events. For the sake of analysis, consider continuing to iterateStep 2 even after z is set for the first time for a total of N iterations. Let A i , A i and A − i be theevents that z is set in the i th iteration of Step 2 when B = 1, B = 0 and B = −
1, respectively.Let B i = ( A ) C ∩ ( A ) C ∩ · · · ∩ ( A i − ) C ∩ A i be the event that z is set for the first time in the i thiteration of Step 2. Let C = A ∪ A ∪ · · · ∪ A N be the event that z is set in some iteration of Step2. Define B i , C , B − i and C − analogously. Let z be the initialization of z in Step 1.Now let Z ∼ D = L (3 -srk (1)), Z ∼ D = L (3 -srk (0)) and Z − ∼ D − = L (3 -srk ( − L ( Z t | B ti ) = L ( Z t | A ti ) for each t ∈ {− , , } since A ti is independent of A t , A t , . . . , A ti − and the sample z ′ chosen in the i th iteration of Step 2. The independence between Steps 2.1 and2.3 implies that P (cid:2) A i (cid:3) = E x ∼Q (cid:20) (cid:18) a µ · L ( x ) + 14 µ · L ( x ) (cid:19) · S ( x ) (cid:21) = 12 R + a µ ( R + R − − R ) + 18 µ ( R − R − ) ≥ − δ (cid:18) a | µ | − + 14 | µ | − (cid:19) P (cid:2) A i (cid:3) = E x ∼Q (cid:20) (cid:18) − − a µ · L ( x ) (cid:19) · S ( x ) (cid:21) = 12 R − − a µ ( R + R − − R ) ≥ − δ (cid:18) − a · | µ | − (cid:19) P (cid:2) A − i (cid:3) = E x ∼Q (cid:20) (cid:18) a µ · L ( x ) − µ · L ( x ) (cid:19) · S ( x ) (cid:21) = 12 R + a µ ( R + R − − R ) − µ ( R − R − ) ≥ − δ (cid:18) a | µ | − + 14 | µ | − (cid:19) The independence of the A ti for each t ∈ {− , , } implies that1 − P (cid:2) C t (cid:3) = N Y i =1 (cid:0) − P (cid:2) A ti (cid:3)(cid:1) ≤ (cid:18)
12 + δ (cid:18) | µ | − + | µ | − (cid:19)(cid:19) N L ( Z t | A ti ) are each absolutely continuous with respect to Q or each t ∈ {− , , } , withRadon-Nikodym derivatives given by d L ( Z | B i ) d Q ( x ) = d L ( Z | A i ) d Q ( x ) = 12 · P (cid:2) A i (cid:3) (cid:18) a µ · L ( x ) + 14 µ · L ( x ) (cid:19) · S ( x ) d L ( Z | B i ) d Q ( x ) = d L ( Z | A i ) d Q ( x ) = 12 · P (cid:2) A i (cid:3) (cid:18) − − a µ · L ( x ) (cid:19) · S ( x ) d L ( Z − | B − i ) d Q ( x ) = d L ( Z − | A − i ) d Q ( x ) = 12 · P (cid:2) A i (cid:3) (cid:18) a µ · L ( x ) − µ · L ( x ) (cid:19) · S ( x )Fix one of t ∈ {− , , } and note that since the conditional laws L ( Z t | B ti ) are all identical, wehave that d D t d Q ( x ) = P (cid:2) C t (cid:3) · d L ( Z t | B t ) d Q ( x ) + (cid:0) − P (cid:2) C t (cid:3)(cid:1) · z ( x )Therefore it follows that d TV (cid:0) D t , L ( Z t | B t ) (cid:1) = 12 · E x ∼Q (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) d D t d Q ( x ) − d L ( Z t | B t ) d Q ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) ≤ (cid:0) − P (cid:2) C t (cid:3)(cid:1) · E x ∼Q (cid:20) z ( x ) + d L ( Z t | B t ) d Q ( x ) (cid:21) = 1 − P (cid:2) C t (cid:3) by the triangle inequality. Since 1 + a µ · L ( x ) + µ · L ( x ) ≥ x ∈ S , we have that E x ∼Q (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) d L ( Z | B ) d Q ( x ) − (cid:18) a µ · L ( x ) + 14 µ · L ( x ) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:21) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · P (cid:2) A i (cid:3) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · E x ∼Q ∗ n (cid:20)(cid:18) a µ · L ( x ) + 14 µ · L ( x ) (cid:19) · S ( x ) (cid:21) + E x ∼Q (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) a µ · L ( x ) + 14 | µ | · L ( x ) (cid:12)(cid:12)(cid:12)(cid:12) · S C ( x ) (cid:21) ≤ (cid:12)(cid:12)(cid:12)(cid:12) − P [ A i ] (cid:12)(cid:12)(cid:12)(cid:12) + E x ∼Q (cid:20)(cid:18) a | µ | · (cid:18) d P + d Q ( x ) + d P − d Q ( x ) + 2 (cid:19)(cid:19) · S C ( x ) (cid:21) + E x ∼Q (cid:20) | µ | · (cid:18) d P + d Q ( x ) + d P − d Q ( x ) (cid:19) · S C ( x ) (cid:21) ≤ δ (cid:18) a | µ | − + 14 | µ | − (cid:19) + δ (cid:18) a | µ | − + 12 | µ | − (cid:19) = δ (cid:18)
32 + 54 | µ | − + 58 | µ | − (cid:19) By analogous computations, we have that E x ∼Q (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) d L ( Z | B ) d Q ( x ) − (cid:18) − − a µ · L ( x ) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:21) ≤ δ (cid:0) | µ | − + | µ | − (cid:1) E x ∼Q (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) d L ( Z − | B − ) d Q ( x ) − (cid:18) a µ · L ( x ) − µ · L ( x ) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:21) ≤ δ (cid:0) | µ | − + | µ | − (cid:1) Now observe that d P + d Q ( x ) = (cid:18) − a µ + µ (cid:19) · (cid:18) a µ · L ( x ) + 14 µ · L ( x ) (cid:19) + ( a − µ ) · (cid:18) − − a µ · L ( x ) (cid:19) (cid:18) − a − µ + µ (cid:19) · (cid:18) a µ · L ( x ) − µ · L ( x ) (cid:19) − a · (cid:18) a µ · L ( x ) + 14 µ · L ( x ) (cid:19) + a · (cid:18) − − a µ · L ( x ) (cid:19) + 1 − a · (cid:18) a µ · L ( x ) − µ · L ( x ) (cid:19) d P − d Q ( x ) = (cid:18) − a − µ + µ (cid:19) · (cid:18) a µ · L ( x ) + 14 µ · L ( x ) (cid:19) + ( a − µ ) · (cid:18) − − a µ · L ( x ) (cid:19) + (cid:18) − a µ + µ (cid:19) · (cid:18) a µ · L ( x ) − µ · L ( x ) (cid:19) Let D ∗ be the mixture of L ( Z | B ) , L ( Z | B ) and L ( Z − | B − ) with weights − a + µ + µ , a − µ and − a − µ + µ , respectively. It then follows by the triangle inequality that d TV (3 -srk (Tern( a, µ , µ )) , P + ) ≤ d TV ( D ∗ , P + ) + d TV ( D ∗ , -srk (Tern( a, µ , µ ))) ≤ (cid:18) − a µ + µ (cid:19) · E x ∼Q (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) d L ( Z | B ) d Q ( x ) − (cid:18) a µ · L ( x ) + 14 µ · L ( x ) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:21) + ( a − µ ) · E x ∼Q (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) d L ( Z | B ) d Q ( x ) − (cid:18) − − a µ · L ( x ) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:21) + (cid:18) − a − µ + µ (cid:19) · E x ∼Q (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) d L ( Z − | B − ) d Q ( x ) − (cid:18) a µ · L ( x ) − µ · L ( x ) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:21) + (cid:18) − a µ + µ (cid:19) · d TV (cid:0) D , L ( Z | B ) (cid:1) + ( a − µ ) · d TV (cid:0) D , L ( Z | B ) (cid:1) + (cid:18) − a − µ + µ (cid:19) · d TV (cid:0) D − , L ( Z − | B − ) (cid:1) ≤ δ (cid:0) | µ | − + | µ | − (cid:1) + (cid:18)
12 + δ (cid:0) | µ | − + | µ | − (cid:1)(cid:19) N A symmetric argument shows analogous upper bounds on d TV (3 -srk (Tern( a, − µ , µ )) , P − ) and d TV (3 -srk (Tern( a, , , Q ). This completes the proof of the lemma.In our reduction showing universality, we will truncate Gaussians to generate the input distri-butions Tern. Let tr τ : R → {− , , } be the following truncation map tr τ ( x ) = x > | τ | − | τ | ≤ x ≤ | τ |− x < −| τ | We conclude this section with the following simple lemma on truncating symmetric triples of Gaus-sian distributions.
Lemma 7.3 (Truncating Gaussians) . Let τ > be constant, µ > be tending to zero and let a, µ , µ be such that tr τ ( N ( µ, ∼ Tern( a, µ , µ ) tr τ ( N ( − µ, ∼ Tern( a, − µ , µ ) tr τ ( N (0 , ∼ Tern( a, , hen it follows that a > is constant, < µ = Θ( µ ) and < µ = Θ( µ ) .Proof. The parameters a, µ , µ for which these distributional statements are true are given by a = Φ( τ ) − Φ( − τ ) µ = 12 ((1 − Φ( τ − µ )) − Φ( − τ − µ )) = 12 (Φ( τ + µ ) − Φ( τ − µ )) µ = 12 (Φ( τ ) − Φ( − τ )) −
12 (Φ( τ + µ ) − Φ( − τ + µ )) = 12 (2 · Φ( τ ) − Φ( τ + µ ) − Φ( τ − µ ))Now note that µ = 12 (Φ( τ + µ ) − Φ( τ − µ )) = 12 √ π Z τ + µτ − µ e − t / dt = Θ( µ )and is positive since e − t / is bounded on [ τ − µ, τ + µ ] as τ is constant and µ →
0. Furthermore,note that µ = 12 (2 · Φ( τ ) − Φ( τ + µ ) − Φ( τ − µ )) = 12 √ π Z ττ − µ e − t / dt − √ π Z τ + µτ e − t / dt = 12 √ π Z τ + µτ (cid:16) e − ( t − µ ) / − e − t / (cid:17) dt = 12 √ π Z τ + µτ e − t / (cid:16) e tµ − µ / − (cid:17) dt Now note that as µ → t ∈ [ τ, τ + µ ], it follows that 0 < e tµ − µ / − µ ). This impliesthat 0 < µ = Θ( µ ), as claimed. n = ˜Θ( k ) Computational Barrier in Sparse Mixtures
In this section, we combine symmetric 3-ary rejection kernels with the reduction k -pds-to-isgm to map from k -pds to generalized sparse mixtures. The details of this reduction k -pds-to-glsm are shown in Figure 9. Throughout this section, we will denote A = k -pds-to-glsm . In order toapply symmetric 3-ary rejection kernels, we will need a set of conditions on the target distributions D , Q and {P ν } ν ∈ R . These conditions will also be the conditions for our universality result. Aswill be discussed in Section 7.3, these conditions turn out to faithfully map the computationalbarrier of k -pds to learning sparse mixtures in a number of natural cases, including learning sparseGaussian mixtures and sparse PCA in the spiked covariance model. Our universality conditionsare as follows. Definition 7.4 (Universality Conditions) . Given parameters ( n, k, d ) , define the collection of dis-tributions ( D , Q , {P ν } ν ∈ R ) to be in uc ( n, k, d ) if: • D is a symmetric distribution about zero and P ν ∼D [ ν ∈ [ − , − o ( n − ) ; and • for all ν ∈ [ − , , it holds that √ k log n ≫ (cid:12)(cid:12)(cid:12)(cid:12) d P ν d Q ( x ) − d P − ν d Q ( x ) (cid:12)(cid:12)(cid:12)(cid:12) and 1 k log n ≫ (cid:12)(cid:12)(cid:12)(cid:12) d P ν d Q ( x ) + d P − ν d Q ( x ) − (cid:12)(cid:12)(cid:12)(cid:12) with probability at least − o ( n − d − ) over each of P ν , P − ν and Q . Let A denote Step 2 of A with input ( Z , Z , . . . , Z n ) and output ( X , X , . . . , X n ). We nowprove total variation guarantees for A , which follow from an application of tensorization of d TV .46 lgorithm k -pds-to-glsm Inputs : k -pds instance G ∈ G N with dense subgraph size k that divides N , partition E of[ N ] and edge probabilities 0 < q < p ≤
1, constant threshold τ >
0, slow-growing function w ( N ) = ω (1), target glsm parameters ( n, k, d ) with wn ≤ cN and d ≥ c − N for a sufficientlysmall constant c >
0, mixture distribution D and target distributions {P ν } ν ∈ R and Q Map to Gaussian Sparse Mixtures : Form the sample Z , Z , . . . , Z n ∈ R d by setting( Z , Z , . . . , Z n ) ← k -pds-to-isgm ( G, E )where k -pds-to-isgm is applied with r = 2, slow-growing function w ( N ) = ω (1), t = ⌈ log ( c − N/k ) ⌉ , target parameters n, k, d , ǫ = 1 / µ = c q kN log n .2. Truncate and 3-ary Rejection Kernels : Sample ν , ν , . . . , ν n ∼ i.i.d. D , truncate the ν i tolie within [ − ,
1] and form the vectors X , X , . . . , X n ∈ R d by setting X ij ← -srk ( tr τ ( Z ij ) , P ν i , P − ν i , Q )for each i ∈ [ n ] and j ∈ [ d ]. Here 3 -srk is applied with N = ⌈ dn ) ⌉ iterations andwith the parameters a = Φ( τ ) − Φ( − τ ) , µ = 12 (Φ( τ + µ ) − Φ( τ − µ )) ,µ = 12 (2 · Φ( τ ) − Φ( τ + µ ) − Φ( τ − µ ))3. Output : The vectors ( X , X , . . . , X n ). Figure 9:
Reduction from k -partite planted dense subgraph to general learning sparse mixtures. Lemma 7.5 ( isgm to glsm ) . Suppose that τ > is a fixed constant and µ = Ω(1 / √ wk log n ) fora sufficiently slow-growing function w . If ( D , Q , {P ν } ν ∈ R ) ∈ uc ( n, k, d ) , then d TV ( A ( isgm ( n, k, d, µ, / , glsm ( n, k, d, {P ν } ν ∈ R , Q , D )) = o (1) under both H and H .Proof. Let ( Z , Z , . . . , Z n ) be an instance of isgm ( n, k, d, µ, /
2) under H . In other words, Z i ∼ i.i.d. mix / ( N ( µ · S , I d ) , N ( − µ · S , I d )) where S is a k -subset of [ d ] chosen uniformly atrandom. For the next part of this argument, we condition on: (1) the entire vector ( ν , ν , . . . , ν n );(2) the latent support S ⊆ [ d ] with | S | = k of the planted indices of the Z i ; and (3) the subset P ⊆ [ n ] of sample indices corresponding to the positive part N ( µ · S , I d ) of the mixture. Let C denote the event corresponding to this conditioning. After truncating according to tr τ , by Lemma7.3 the resulting entries are distributed as tr τ ( Z ij ) ∼ Tern( a, µ , µ ) if ( i, j ) ∈ S × P Tern( a, − µ , µ ) if ( i, j ) ∈ S × P C Tern( a, ,
0) if i S τ is constant, Lemma7.3 also implies that a ∈ (0 ,
1) is constant, µ = Θ( µ ) and µ = Θ( µ ). Let S ν be S ν = (cid:26) x ∈ X : 2 | µ | ≥ (cid:12)(cid:12)(cid:12)(cid:12) d P ν i d Q ( x ) − d P − ν i d Q ( x ) (cid:12)(cid:12)(cid:12)(cid:12) and 2 | µ | max { a, − a } ≥ (cid:12)(cid:12)(cid:12)(cid:12) d P ν i d Q ( x ) + d P − ν i d Q ( x ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:27) as in Lemma 7.2. The second implication of ( D , Q , {P ν } ν ∈ R ) ∈ uc ( n, k, d ) gives that { x ∈ S ν i } occurs with probability at least 1 − δ over each of P ν i , P − ν i and Q , where δ = o ( n − d − ), for each i ∈ [ n ]. This holds for a sufficiently slow-growing choice of w . Therefore we can apply Lemma 7.2 toeach application of 3 -srk in Step 2 of A . Note that | µ | − = O ( √ n log n ) and | µ | − = O ( n log n )since µ = Ω(1 / √ wk log n ) and k ≥
1. Now consider the d -dimensional vectors X ′ , X ′ , . . . , X ′ n withindependent entries distributed as X ′ ij ∼ P ν i if ( i, j ) ∈ S × P P − ν i if ( i, j ) ∈ S × P C Q if i S The tensorization property of d TV from Fact 3.2 implies that d TV (cid:0) L ( X , X , . . . , X n |C ) , L ( X ′ , X ′ , . . . , X ′ n |C ) (cid:1) ≤ n X i =1 d X j =1 d TV (cid:0) L ( X ij |C ) , L ( X ′ ij |C ) (cid:1) ≤ n X i =1 d X j =1 d TV (cid:0) -srk ( tr τ ( Z ij ) , P ν i , P − ν i , Q ) , L ( X ′ ij |C ) (cid:1) ≤ nd " δ (cid:0) | µ | − + | µ | − (cid:1) + (cid:18)
12 + δ (cid:0) | µ | − + | µ | − (cid:1)(cid:19) N = o (1)by the total variation upper bounds in Lemma 7.2. Now consider dropping the conditioning on C . It follows by the definition of glsm that ( X ′ , X ′ , . . . , X ′ n ), when no longer conditioned on C , isdistributed as glsm ( n, k, d, {P ν } ν ∈ R , Q , D ′ ) where D ′ is the distribution sampled by first sampling x ∼ D , truncating x to lie in [ − ,
1] and then multiplying x by − /
2. It thereforefollows by the conditioning property of d TV in Fact 3.2 that d TV (cid:0) A ( isgm ( n, k, d, µ, / , glsm (cid:0) n, k, d, {P ν } ν ∈ R , Q , D ′ (cid:1)(cid:1) = o (1)Now note that since D is symmetric, it follows that d TV ( D , D ′ ) = o ( n − ) since x ∈ D lies in [ − , − o ( n − ). Another application of tensorization yields that d TV ( D ⊗ n , D ′⊗ n ) = o (1). Coupling the latent ν , ν , . . . , ν n sampled from D ′ and D in glsm ( n, k, d, {P ν } ν ∈ R , Q , D ′ )and glsm ( n, k, d, {P ν } ν ∈ R , Q , D ), respectively, yields by the conditioning property that their totalvariation distance is o (1). The desired result in the H case then follows from the triangle inequality.Under H , an identical argument shows the desired result without conditioning on C being necessary.This completes the proof of the lemma.We now use this lemma, Theorem 4.2 and Lemma 3.1 to formally deduce our universalityprinciple for lower bounds at the threshold n = ˜Θ( k ) in glsm . The proof follows a similarstructure to that of Theorems 5.2 and 6.3. We omit details where they are the same.48 heorem 7.6 (Lower Bounds for General Sparse Mixtures) . Let ( n, k, d ) be parameters such that n = o ( k ) and k = o ( d ) . Suppose that ( D , Q , {P ν } ν ∈ R ) ∈ uc ( n, k, d ) and B is a randomizedpolynomial time test for glsm ( n, k, d, {P ν } ν ∈ R , Q , D ) . Then B has asymptotic Type I + II error atleast 1 assuming either the k -pc conjecture or the k -pds conjecture for some fixed edge densities < q < p ≤ .Proof. Assume the k -pds conjecture for some fixed edge densities 0 < q < p ≤ Q =1 − p (1 − p )(1 − q ) + { p =1 } (cid:0) √ q − (cid:1) ∈ (0 , w = w ( n ) = ω (1) be a sufficiently slow-growing function and let t be such that 2 t is the smallest power of two larger than wk . Notethat wn = o ( k (2 t − N be the largest multiple of k less than ( pQ + 1) − k · t . Byconstruction, we have that N = Θ( k · t ) = Θ( wk ). For a sufficiently slow-growing choice of w , itfollows that N ≤ d/
2. Now consider A , which applies k -pds-to-isgm to map from k -pds ( N, k, p, q )to isgm ( n, k, d, µ, /
2) with µ = c ′ δ p k · t ) + 2 log( p − Q ) − · √ t ≍ r wk log n where δ = min n log (cid:16) pQ (cid:17) , log (cid:16) − Q − p (cid:17)o and c ′ > A performs this mapping within o (1)total variation error. Combining this with Lemma 7.5 and applying Lemma 3.3 implies that A maps k -pds ( N, k, p, q ) to glsm ( n, k, d, {P ν } ν ∈ R , Q , D ) within o (1) total variation error under both H and H . Since N = ω ( k ), if B were to have an asymptotic Type I+II error less than 1, thenthis would contradict the k -pds conjecture at edge densities 0 < q < p ≤ ( n, k, d ) The result in Theorem 7.6 shows universality of the computational sample complexity of n = Ω( k )for learning sparse mixtures under the conditions of uc ( n, k, d ). In this section, we make severalremarks on the implications of showing hardness for glsm under our conditions uc ( n, k, d ). Remarks on UC ( n, k, d ) . The conditions for ( D , Q , {P ν } ν ∈ R ) ∈ uc ( n, k, d ) have the followingtwo notable properties. • They are conditions on marginals : The conditions in uc ( n, k, d ) are entirely in terms of thelikelihood ratios d P ν /d Q between the planted and non-planted marginals. In particular, theydo not depend on any properties of high-dimensional distributions constructed from the P ν and Q . Thus Theorem 7.6 extracts the high-dimensional structure leading to statistical-computational gaps in instances of glsm as a condition on the marginals P ν and Q . • Their dependence on n and d is negligible : As noted in Section 2, when the likelihood ratiosare relatively concentrated the dependence of the conditions in uc ( n, k, d ) on n and d isnearly negligible. Specifically, the upper bounds on the functions of the likelihood ratios d P ν /d Q only depend logarithmically on n . If the ratios d P ν /d Q are concentrated under P ν and Q with exponentially decaying tails, then the tail probability bounds of o ( n − d − )in uc ( n, k, d ) only impose a mild condition on the P ν and Q . Instead, the conditions in uc ( n, k, d ) almost exclusively depend on k , implying that they will not implicitly require astronger dependence between n and k to produce hardness than the n = ˜ o ( k ) conditionthat arises from our reductions. Thus Theorem 7.6 does show a universality principle for thecomputational sample complexity of n = ˜Θ( k ).49 and Parameterization over [ − , . As remarked in Section 2, D and the indices of P ν canbe reparameterized without changing the underlying problem. The assumption that D is symmetricand mostly supported on [ − ,
1] is for notational convenience.While the output vectors ( X , X , . . . , X n ) of our reduction k -pds-to-glsm are independent,their coordinates have dependence induced by the mixture D . The fact that our reduction samplesthe ν i implies that if these values were revealed to the algorithm, the problem would still remainhard: an algorithm for the latter could be used together with the reduction to solve k - pc . However,even given the ν i for the i th sample, our reduction is such that whether the planted marginals in the i th sample are distributed according to P ν i or P − ν i remains unknown to the algorithm. Intuitively,our setup chooses to parameterize the distribution D over [ − ,
1] such that the sign ambiguitybetween P ν i or P − ν i is what is producing hardness below the sample complexity of n = ˜Ω( k ). Implications for Concentrated LLR.
We now give several remarks on the conditions uc ( n, k, d )in the case that the log-likelihood ratios (LLR) log d P ν /d Q ( x ) are sufficiently well-concentrated if x ∼ Q or x ∼ P ν . Suppose that ( D , Q , {P ν } ν ∈ R ) ∈ uc ( n, k, d ), fix some function w ( n ) → ∞ as n → ∞ and fix some ν ∈ [ − , S Q is the common support of the P ν and Q , define S to be S = (cid:26) x ∈ S Q : 1 √ wk log n ≥ (cid:12)(cid:12)(cid:12)(cid:12) d P ν d Q ( x ) − d P − ν d Q ( x ) (cid:12)(cid:12)(cid:12)(cid:12) and 1 wk log n ≥ (cid:12)(cid:12)(cid:12)(cid:12) d P ν d Q ( x ) + d P − ν d Q ( x ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:27) Now note that d TV ( P ν , P − ν ) = 12 · E x ∈Q (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) d P ν d Q ( x ) − d P − ν d Q ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) ≤ · E x ∈Q (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) d P ν d Q ( x ) − d P − ν d Q ( x ) (cid:12)(cid:12)(cid:12)(cid:12) · S ( x ) (cid:21) + 12 · P ν (cid:2) S C (cid:3) + 12 · P − ν (cid:2) S C (cid:3) ≤ √ wk log n + o ( n − d − ) . √ k log n A similar calculation with the second condition defining S shows that d TV (cid:0) mix / ( P ν , P − ν ) , Q (cid:1) . k log n If the LLRs log d P ν /d Q are sufficiently well-concentrated, then the random variables (cid:12)(cid:12)(cid:12)(cid:12) d P ν d Q ( x ) − d P − ν d Q ( x ) (cid:12)(cid:12)(cid:12)(cid:12) and (cid:12)(cid:12)(cid:12)(cid:12) d P ν d Q ( x ) + d P − ν d Q ( x ) − (cid:12)(cid:12)(cid:12)(cid:12) will also concentrate around their means if x ∼ Q . LLR concentration also implies that this is trueif x ∼ P ν or x ∼ P − ν . Thus, under sufficient concentration, the conditions in uc ( n, k, d ) reduce tothe much more interpretable conditions d TV ( P ν , P − ν ) = ˜ o ( k − / ) and d TV (cid:0) mix / ( P ν , P − ν ) , Q (cid:1) = ˜ o ( k − )These conditions directly measure the amount of statistical signal present in the planted marginals P ν . The relevant calculations for an example application of Theorem 7.6 when the LLR concentratesis shown below for sparse PCA. In [BBH19], various assumptions of concentration of the LLR andanalogous implications for computational lower bounds in submatrix detection are analyzed indetail. We refer the reader to Sections 3 and 9 of [BBH19] for the calculations needed to make thediscussion here precise. 50e remark that, assuming sufficient concentration on the LLR, the analysis of the k -sparseeigenvalue statistic from [BR13a] yields an information-theoretic upper bound for glsm . Givensamples ( X , X , . . . , X n ), consider forming the LLR-processed samples Z i with Z ij = E ν ∼D (cid:20) log d P ν d Q ( X ij ) (cid:21) for each i ∈ [ n ] and j ∈ [ d ]. Now consider taking the k -sparse eigenvalue of the samples Z , Z , . . . , Z n .Under sub-Gaussianity assumptions on the Z ij , the analysis in Theorem 2 of [BR13a] applies. Sim-ilarly, the analysis in Theorem 5 of [BR13a] continues to hold, showing that the semidefinite pro-gramming algorithm for sparse PCA yields an algorithmic upper bound for glsm . In many setupscaptured by glsm such as sparse PCA, learning sparse mixtures of Gaussians and learning sparsemixtures of Rademachers, these analyses and our lower bounds together confirm a k -to- k gap. Asinformation-theoretic limits and algorithms are not the focus of this paper, we omit these details. Sparse PCA and Specific Distributions.
One specific example captured by our universalityprinciple and that falls under the concentrated LLR setup discussed above is sparse PCA in thespiked covariance model. The statistical-computational gaps of sparse PCA have been characterizedbased on the planted clique conjecture in a line of work [BR13b, BR13a, WBS16, GMZ17, BBH18,BB19]. We show that our universality principle faithfully recovers the k -to- k gap for sparse PCAshown in [BR13b, BR13a, WBS16, GMZ17, BBH18] assuming the k -pc conjecture up to k = o ( √ n ).We remark that [BB19] shows stronger hardness based on weaker forms of the pc conjecture.We show that sparse PCA corresponds to glsm ( n, k, d, {P ν } ν ∈ R , Q , D ) for the proper choice of( D , Q , {P ν } ν ∈ R ) ∈ uc ( n, k, d ) exactly at its conjectured computational barrier. In particular, wehave the following corollary of Theorem 7.6. Corollary 7.7 (Lower Bounds for Sparse PCA) . Let spca ( n, k, d, θ ) be the testing problem H : ( X , X , . . . , X n ) ∼ i.i.d. N (0 , I d ) ⊗ n H : ( X , X , . . . , X n ) ∼ i.i.d. N (cid:16) , I d + θvv ⊤ (cid:17) ⊗ n where v is a k -sparse unit vector in R d chosen uniformly at random among all such vectors withnonzero entries equal to / √ k . If n = o ( k ) , k = o ( d ) and θ = o (1 / (log n ) ) , then spca ( n, k, d, θ ) can be expressed as glsm ( n, k, d, {P ν } ν ∈ R , Q , D ) for some choice ( D , Q , {P ν } ν ∈ R ) ∈ uc ( n, k, d ) .Proof. Note that if X ∼ N (0 , I d + θvv ⊤ ) then X can be written as X = √ θ · gv + G where G ∼N (0 , I d ) and g ∼ N (0 , X can be rewritten as X = √ θ log n · g ′ v + G where g ′ ∼ N (0 , / √ n ).Now consider setting D = N (0 , / p n ) , P ν = N ν r θ log nk , ! , Q = N (0 , x ∼ D satisfies x ∈ [ − ,
1] is 1 − o ( n − ) by standard Gaussian tailbounds. Fix some ν and let t = ν q θ log nk . Note that if θ = o (1 / (log n ) ), we have that (cid:12)(cid:12)(cid:12)(cid:12) d P ν d Q ( x ) − d P − ν d Q ( x ) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) e tx − t / − e − tx − t / (cid:12)(cid:12)(cid:12) = Θ( tx )ift tx = o (1). Now note that this quantity is o (1 / √ k log n ) as long as x = o (log n ). Note that x = o (log n ) occurs with probability at least 1 − ( n − d − ) as long as d = poly( n ) by standardGaussian tail bounds. Now note that (cid:12)(cid:12)(cid:12)(cid:12) d P ν d Q ( x ) + d P − ν d Q ( x ) − (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) e tx − t / + e − tx − t / − (cid:12)(cid:12)(cid:12) = Θ( t )51olds if tx = o (1), which follows from x = o (log n ). As shown above, this occurs with probability atleast 1 − ( n − d − ). Since t = o (1 /k log n ), we have that these ( D , Q , {P ν } ν ∈ R ) are in uc ( n, k, d ),completing the proof.Combining this lower bound with the subsampling internal reduction in Section 8.1 of [BB19]extends this reduction to the small signal regime of θ = ˜Θ( n − α ), recovering the optimal computa-tional threshold of n = ˜Ω( k /θ ) for all ( n, k, d, θ ) with θ = ˜ o (1) and k = o ( d ). Similar calculationsto those in the above corollary can be used to show many other choices of ( D , Q , {P ν } ν ∈ R ) are in uc ( n, k, d ). Some examples are: • Balanced sparse Gaussian mixtures where Q = N (0 ,
1) and P ν = N ( θν,
1) where θ = ˜ o ( k − / )and D is any distribution over [ − , • The Bernoulli case where Q = Bern(1 /
2) and P ν = Bern(1 / θν ) where θ = ˜ o ( k − / ) and D is any distribution over [ − , • Sparse mixtures of exponential distributions where Q = Exp( λ ) and P ν = Exp( λ + θν ), D isany distribution over [ − ,
1] and θ = ˜ o (cid:16) k − / · min n λ / , (1 + λ ) − o(cid:17) • Sparse mixtures of centered Gaussians with difference variances where Q = N (0 ,
1) and P ν = N (0 , θν ) where θ = ˜ o (cid:0) k − / (cid:1) and D is any distribution over [ − , uc ( n, k, d ) can be verified for many more choices of D , Q and P ν using the computations outlined in the discussion above on the implications of our result forconcentrated LLR. Furthermore, tradeoffs between n , k and θ at smaller levels of signal θ can beobtained from the lower bounds above through subsampling internal reductions, analogously tosparse PCA. k -Partite Planted Clique Conjecture Our k -partite versions of planted clique and planted dense subgraph, k - pc and k - pds , seem to bejust as hard as the standard versions. While the partition E in their definitions contains a slightamount of information about the position of the clique, in this section we provide evidence thatthe hardness threshold for k is unchanged by considering two restricted classes of algorithms: testsbased on low-degree polynomials as well as statistical query algorithms. For the expert on eitherof these, the high-level message substantiated below is that k - pc and ordinary pc are virtuallyidentical in both their Fourier spectrum and statistical dimension (and the same is true for k - pds versus ordinary pds ). We emphasize that this section contains no new ideas, and repeatsthe arguments of [Hop18] and [FGR +
13] with a few tiny changes. If anything, however, this onlysupports the goal of the section which is to substantiate our claim that k - pc is extremely similarto standard pc .We remark here that whenever we refer to k - pds in this section we will have q = 1 / p = 1 / n − δ for arbitrary δ >
0. Furthermore, the analysis of low-degree polynomial andstatistical query algorithms for k - pds where p − q = Ω(1) is essentially the same as for k - pc . Also,these analyses remain qualitatively the same when q is replaced by a constant other than 1 / .1 Low-degree Likelihood Ratio This subsection draws heavily from the paper by Hopkins and Steurer [HS17] and on Hopkins’sthesis [Hop18]. Based their understanding of the sum-of-squares (SOS) hierarchy applied to statis-tical inference problems, they conjecture that low-degree polynomial tests capture the full power ofSOS and, more generally, all efficient hypothesis testing algorithms.
Conjecture 8.1.
For a broad class of hypothesis testing problems H versus H , there is a testrunning in time n ˜ O ( D ) with Type I + II errors tending to zero if and only if there is a successful D -simple statistic, i.e. a polynomial f of degree at most D such that E H f ( X ) = 0 and E H f ( X ) = 1 yet E H f ( X ) → ∞ . By the Neyman-Pearson lemma, the likelihood ratio test is the optimal test with respect to TypeI + II errors: given a sample X , declare H if LR ( X ) = P H ( X ) P H ( X ) > H otherwise. Of course,computing the likelihood ratio is intractable for the problems of interest. The low-degree likelihoodratio LR ≤ D is the orthogonal projection of the likelihood ratio onto the subspace of polynomials ofdegree at most D , and as stated in the following theorem is the optimal test of a given degree. Wetake H to be the uniform distribution on the appropriate dimension hypercube {− , +1 } N , andhere the projection is with respect to the inner product h f, g i = E H f ( X ) g ( X ), which also definesa norm k f k = h f, f i . Theorem 8.2 (Page 35 of [Hop18]) . The optimal D -simple statistic is the low-degree likelihoodratio, i.e. it holds that max f ∈ R [ x ] ≤ D E H f ( X )=0 E H f ( X ) p E H f ( X ) = k LR ≤ D − k Thus existence of low-degree tests for a given problem boils down to computing the norm ofthe low-degree likelihood ratio. In order to bound the norm on the right-hand side it is useful tore-express it in terms of the standard Boolean Fourier basis. The collection of functions { χ α ( X ) = Q e ∈ α X e : α ⊆ [ N ] } is an orthonormal basis over the space {− , } N with inner product definedabove. By orthonormality of the basis, for any basis function χ α with 1 ≤ | α | ≤ D , h χ α , LR ≤ D − i = h χ α , LR i = E H χ α ( X ) LR ( X ) = E H χ α ( X )and E H LR ≤ D = E H h , LR ≤ D − i = 0. It then follows by Parseval’s identity that k LR ≤ D − k = X ≤| α |≤ D (cid:0) E H χ α ( X ) (cid:1) / (1)which is exactly the Fourier energy up to degree D . By directly computing the Fourier coefficientsof LR ≤ D for k - pc and k - pds , we arrive at the following proposition. Proposition 8.3 (Failure of low-degree tests) . Consider k - pc and k - pds with k = n / − ǫ . Testsof degree at most D fail, i.e., k LR ≤ D − k = O (1) in the following cases:(i) in k - pc , if D ≤ C log n for a sufficiently small constant C (ii) in k - pds , if D ≤ n δ/ . k - pc there is nopolynomial time algorithm and for k - pds there is no algorithm with runtime less than exp( n δ/ ).A proof of this proposition can be found in Appendix A, which is similar to the computation ofthe Fourier spectrum of pc in [Hop18]. We also make the following remark. Remark 8.4.
The computational threshold for k in both k -pds ( n, k, + n − δ , ) and pds ( n, k, + n − δ , ) are no longer at k = √ n , but rather at k = n / δ ′ for some δ ′ > . For this reason we seefailure of low-degree tests up to degree (say) n δ/ even for ǫ = 0 , i.e. k = √ n . In this section we verify that the lower bounds shown by [FGR +
13] for pc for a generalizationof statistical query algorithms hold essentially unchanged for k - pc . The same approach results inlower bounds for k - pds that are essentially the same as for pds , which we omit to avoid redundancy.The Statistical Algorithm framework of [FGR +
13] applies to distributional problems, where theinput is a sequence of i.i.d. observations from a distribution D . We therefore define distributionalversions of k - pds and k - pc . Just as in [FGR + k - pds and k - pc with n vertices per side. Let k divide n and E = E ∪ E ∪ · · · ∪ E k be a known partition of left-hand side vertices [ n ] with | E i | = n/k for each i . A k × k bipartite clique S is formed by selecting a single vertex u.a.r. fromeach E i on the LHS and including each of the RHS vertices independently with probability k/n each. Note that only the LHS respects the partition E . Now in k - pc all edges between LHS andRHS vertices in S are included, and remaining edges are included with probability 1 / p and q for k - pds . We now define the correspondingdistributional version of k - pc . Definition 8.5.
Let k divide n and fix a known partition of E = E ∪ E ∪ · · · ∪ E k of [ n ] with | E i | = n/k . Let S ⊂ [ n ] be a subset of indices with | S ∩ E i | = 1 for each i ∈ [ k ] . The distribution D S over { , } n produces with probability − k/n a uniform point X ∼ Unif( { , } n ) and with probability k/n a point X with X i = 1 for all i ∈ S and X S c ∼ Unif( { , } ) n − k . The distributional bipartite k - PC problem is to find the subset S given some number of independent samples m from D S . The correspondence between this distributional problem where samples are observed and the bi-partite version can be seen by considering algorithms that sequentially examine the neighborhoodsof the RHS vertices in the bipartite graph. Because there are only n RHS vertices, meaningful con-clusions regard number of samples m ≤ n . We now make an important remark on the relationshipbetween these formulations and our reductions. Remark 8.6.
Our main reductions from k - pc and k - pds to rsme and glsm both only require thepartition structure of E along one axis of the adjacency matrix of the input graph. Therefore bothof these reductions can easily be adapted to begin with the distributional bipartite variants of k - pc and k - pds . However, the reductions to semi-cr require the partition structure along both the rowsand columns of the adjacency matrix of the input graph. Let X = {− , +1 } n denote the space of configurations and let D be a set of distributions over X . Let F be a set of solutions (in our case, clique positions) and Z : D → F be a map takingeach distribution D ∈ D to a subset of solutions Z ( D ) ⊆ F that are defined to be valid solutionsfor D . In our setting, since each clique position is in one-to-one correspondence with distributions,54here is a single clique Z ( D ) corresponding to each distribution D . For m >
0, the distributionalsearch problem Z over D and F using m samples is to find a valid solution f ∈ Z ( D ) given accessto m random samples from an unknown D ∈ D .One class of algorithms we are interested in are called unbiased statistical algorithms , definedby access to an unbiased oracle. Definition 8.7 (Unbiased Oracle) . Let D be the true distribution. A query to the oracle consistsof any function h : X → { , } , and the oracle then takes an independent random sample X ∼ D and returns h ( X ) . These algorithms access the sampled data only through the oracle: unbiased statistical algo-rithms outsource the computation. Because the data is accessed through the oracle, it is possible toprove unconditional lower bounds using information-theoretic methods. Another oracle, VSTAT, issimilar but the oracle is allowed to make an adversarial perturbation of the function evaluation. Itis shown in [FGR +
13] via a simulation argument that the two oracles are approximately equivalent.
Definition 8.8 (VSTAT Oracle) . Let D be the true distribution and t > a sample size parameter.A query to the oracle again consists of any function h : X → { , } , and the oracle returns anarbitrary value v ∈ [ E D h ( X ) − τ, E D h ( X ) + τ ] , where τ = max { /t, p E D h ( X )(1 − E D h ( X )) /t } . We borrow some definitions from [FGR + D we define the inner product h f, g i D = E X ∼ D f ( X ) g ( X ) and the corresponding norm k f k D = p h f, f i D . For distributions D and D both absolutely continuous with respect to D , the pairwise correlation is defined to be χ D ( D , D ) = (cid:12)(cid:12)(cid:12)D D D − , D D − E D (cid:12)(cid:12)(cid:12) = |h b D , b D i D | . Here we defined b D = D D −
1. The average correlation ρ ( D , D ) of a set of distributions D relativeto distribution D is defined as ρ ( D , D ) = 1 |D| X D ,D ∈D χ D ( D , D ) = 1 |D| X D ,D ∈D (cid:12)(cid:12)(cid:12)D D D − , D D − E D (cid:12)(cid:12)(cid:12) . We quote the definition of statistical dimension with average correlation from [FGR + Definition 8.9 (Statistical dimension) . Fix γ > , η > , and search problem Z over set of solutions F and class of distributions D over X . We consider pairs ( D, D D ) consisting of a “referencedistribution” D over X and a finite set of distributions D D ⊆ D with the following property: forany solution f ∈ F , the set D f = D D \ Z − ( f ) has size at least (1 − η ) · |D D | . Let ℓ ( D, D D ) bethe largest integer ℓ so that for any subset D ′ ⊆ D f with |D ′ | ≥ |D f | /ℓ , the average correlation is | ρ ( D ′ , D ) | < γ (if there is no such ℓ one can take ℓ = 0 ). The statistical dimension with averagecorrelation γ and solution set bound η is defined to be the largest ℓ ( D, D D ) for valid pairs ( D, D D ) as described, and is denoted by SDA( Z , γ, η ) . Theorem 8.10 (Theorems 2.7 and 3.17 of [FGR + . Let X be a domain and Z a search problemover a set of solutions F and a class of distributions D over X . For γ > and η ∈ (0 , ,let ℓ = SDA( Z , γ, η ) . Any (possibly randomized) statistical query algorithm that solves Z withprobability δ > η requires at least ℓ calls to the V ST AT (1 / (3 γ )) oracle to solve Z .Moreover, any statistical query algorithm requires at least m calls to the Unbiased Oracle for m = min n ℓ ( δ − η )2(1 − η ) , ( δ − η ) γ o . In particular, if η ≤ / , then any algorithm with success probability atleast / requires at least min { ℓ/ , / γ } samples from the Unbiased Oracle. S be the set of all size k subsets of [ n ] respecting the partition E , i.e., S = { S : | S | = k and | S ∩ E i | = 1 for i ∈ [ k ] } , and note that |S| = ( n/k ) k . We henceforth use D to denote theuniform distribution on { , } n . The following lemma is exactly the same as in [FGR + S and T to be in S rather than arbitrary size k subsets of [ n ], which doesnot change the bound. Lemma 8.11 (Lemma 5.1 in [FGR + . For
S, T ∈ S , χ D ( D S , D T ) = |h b D S , b D T i D | ≤ | S ∩ T | k /n . We now proceed to derive the main statistical query lower bounds for the bipartite formulationsof k - pc and k - pds . The lemma below is similar to Lemma 5.2 in [FGR + Lemma 8.12 (Modification of Lemma 5.2 in [FGR + . Let δ ≥ / log n and k ≤ n / − δ . Forany integer ℓ ≤ k , S ∈ S , and set A ⊆ S with | A | ≥ |S| /n ℓδ , | A | X T ∈ A (cid:12)(cid:12) h b D S , b D T i D (cid:12)(cid:12) ≤ ℓ +3 k n . This lemma now implies the following statistical dimension lower bound.
Theorem 8.13 (Analogue of Theorem 5.3 of [FGR + . For δ ≥ / log n and k ≤ n / − δ let Z denote the distributional bipartite k - pc problem. If ℓ ≤ k then SDA ( Z , ℓ +3 k /n , (cid:0) nk (cid:1) − k ) ≥ n ℓδ / .Proof. For each clique position S let D S = D \ { D S } . Then |D S | = (cid:0) nk (cid:1) k − (cid:0) − (cid:0) nk (cid:1) − k (cid:1) |D| . Nowfor any D ′ with |D ′ | ≥ |S| /n ℓδ we can apply Lemma 8.12 to conclude that ρ ( D ′ , D ) ≤ ℓ +3 k /n .By Definition 8.9 of statistical dimension this implies the bound stated in the theorem.Applying Theorem 8.10 to this statistical dimension lower bound yields the following hardnessfor statistical query algorithms. Corollary 8.14 (SQ lower bound for recovery in k - pc ) . For any constant δ > and k ≤ n / − δ ,any SQ algorithm that solves the distributional bipartite k - pc problem requires Ω( n /k log n ) =˜Ω( n δ ) queries to the Unbiased Oracle. This is to be interpreted as impossible, as there are only n RHS vertices/samples available inthe actual bipartite graph. Because all the quantities in Theorem 8.13 are the same as in [FGR + Corollary 8.15 (SQ lower bound for decision version of k - pc ) . For any constant δ > , suppose k ≤ n / − δ . Let D = Unif( { , } n ) and let D be the set of all planted bipartite k - pc distributions(one for each clique position). Any SQ algorithm that solves the hypothesis testing problem between D and D with probability better than / requires Ω( n /k ) queries to the Unbiased Oracle. Asimilar statement holds for VSTAT. There is a t = n Ω(log n ) such that any randomized SQ algorithmthat solves the hypothesis testing problem between D and D with probability better than / requiresat least t queries to V ST AT ( n − δ /k ) . Acknowledgements
We are greatly indebted to Jerry Li for introducing the conjectured statistical-computational gapfor robust sparse mean estimation and for discussions that helped lead to this work. We thankIlias Diakonikolas for pointing out the statistical query model construction in [DKS17]. We alsothank Frederic Koehler, Sam Hopkins, Philippe Rigollet, Enric Boix Adser`a, Dheeraj Nagaraj andAustin Stromme for inspiring discussions on related topics.56 eferences [AAK +
07] Noga Alon, Alexandr Andoni, Tali Kaufman, Kevin Matulef, Ronitt Rubinfeld, andNing Xie. Testing k-wise and almost k-wise independence. In
Proceedings of thethirty-ninth annual ACM symposium on Theory of computing , pages 496–505. ACM,2007.[ABL14] Pranjal Awasthi, Maria Florina Balcan, and Philip M Long. The power of localizationfor efficiently learning linear separators with noise. In
Proceedings of the forty-sixthannual ACM symposium on Theory of computing , pages 449–458. ACM, 2014.[ACO08] Dimitris Achlioptas and Amin Coja-Oghlan. Algorithmic barriers from phase transi-tions. In ,pages 793–802. IEEE, 2008.[ACV14] Ery Arias-Castro and Nicolas Verzelen. Community detection in dense random net-works.
The Annals of Statistics , 42(3):940–969, 2014.[AKS98] Noga Alon, Michael Krivelevich, and Benny Sudakov. Finding a large hidden cliquein a random graph.
Random Structures and Algorithms , 13(3-4):457–466, 1998.[ASW13] Martin Azizyan, Aarti Singh, and Larry Wasserman. Minimax theory for high-dimensional gaussian mixtures with sparse mean separation. In
Advances in NeuralInformation Processing Systems , pages 2139–2147, 2013.[AV11] Brendan PW Ames and Stephen A Vavasis. Nuclear norm minimization for the plantedclique and biclique problems.
Mathematical programming , 129(1):69–89, 2011.[BB19] Matthew Brennan and Guy Bresler. Optimal average-case reductions to sparse pca:From weak assumptions to strong hardness. arXiv preprint arXiv:1902.07380 , 2019.[BBH18] Matthew Brennan, Guy Bresler, and Wasim Huleihel. Reducibility and computationallower bounds for problems with planted sparse structure. In
COLT , pages 48–166,2018.[BBH19] Matthew Brennan, Guy Bresler, and Wasim Huleihel. Universality of computationallower bounds for submatrix detection. arXiv preprint arXiv:1902.06916 , 2019.[BDLS17] Sivaraman Balakrishnan, Simon S Du, Jerry Li, and Aarti Singh. Computationallyefficient robust sparse estimation in high dimensions. pages 169–212, 2017.[BGJ18] Gerard Ben Arous, Reza Gheissari, and Aukosh Jagannath. Algorithmic thresholdsfor tensor pca. arXiv preprint arXiv:1808.00921 , 2018.[BGL17] Vijay Bhattiprolu, Venkatesan Guruswami, and Euiwoong Lee. Sum-of-squares cer-tificates for maxima of random tensors on the sphere. In
LIPIcs-Leibniz InternationalProceedings in Informatics , volume 81. Schloss Dagstuhl-Leibniz-Zentrum fuer Infor-matik, 2017.[BHK +
16] Boaz Barak, Samuel B Hopkins, Jonathan Kelner, Pravesh Kothari, Ankur Moitra,and Aaron Potechin. A nearly tight sum-of-squares lower bound for the planted cliqueproblem. In
Foundations of Computer Science (FOCS), 2016 IEEE 57th AnnualSymposium on , pages 428–437. IEEE, 2016.57BI13] Cristina Butucea and Yuri I Ingster. Detection of a sparse submatrix of a high-dimensional noisy matrix.
Bernoulli , 19(5B):2652–2688, 2013.[BKW19] Afonso S Bandeira, Dmitriy Kunisky, and Alexander S Wein. Computational hardnessof certifying bounds on constrained pca problems. arXiv preprint arXiv:1902.07324 ,2019.[BMMN17] Gerard Ben Arous, Song Mei, Andrea Montanari, and Mihai Nica. The landscape ofthe spiked tensor model. arXiv preprint arXiv:1711.05424 , 2017.[BR13a] Quentin Berthet and Philippe Rigollet. Complexity theoretic lower bounds for sparseprincipal component detection. In
COLT , pages 1046–1066, 2013.[BR13b] Quentin Berthet and Philippe Rigollet. Optimal detection of sparse principal compo-nents in high dimension.
The Annals of Statistics , 41(4):1780–1815, 2013.[BS95] Avrim Blum and Joel Spencer. Coloring random and semi-random k-colorable graphs.
Journal of Algorithms , 19(2):204–234, 1995.[BT +
06] Andrej Bogdanov, Luca Trevisan, et al. Average-case complexity.
Foundations andTrends R (cid:13) in Theoretical Computer Science , 2(1):1–106, 2006.[CC18] Utkan Onur Candogan and Venkat Chandrasekaran. Finding planted subgraphs withfew eigenvalues using the schur–horn relaxation. SIAM Journal on Optimization ,28(1):735–759, 2018.[CGP +
19] Wei-Kuo Chen, David Gamarnik, Dmitry Panchenko, Mustazee Rahman, et al. Sub-optimality of local algorithms for a class of max-cut problems.
The Annals of Proba-bility , 47(3):1587–1618, 2019.[Che15] Yudong Chen. Incoherence-optimal matrix completion.
IEEE Transactions on Infor-mation Theory , 61(5):2909–2923, 2015.[CLR15] Tony Cai, Tengyuan Liang, and Alexander Rakhlin. Computational and statisti-cal boundaries for submatrix localization in a large noisy matrix. arXiv preprintarXiv:1502.01988 , 2015.[CSV17] Moses Charikar, Jacob Steinhardt, and Gregory Valiant. Learning from untrusteddata. In
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory ofComputing , pages 47–60. ACM, 2017.[CW18] T. Tony Cai and Yihong Wu. Statistical and computational limits for sparse matrixdetection. arXiv preprint arXiv:1801.00518 , 2018.[CX16] Yudong Chen and Jiaming Xu. Statistical-computational tradeoffs in planted prob-lems and submatrix localization with a growing number of clusters and submatrices.
Journal of Machine Learning Research , 17(27):1–57, 2016.[DF80] Persi Diaconis and David Freedman. Finite exchangeable sequences.
The Annals ofProbability , pages 745–764, 1980.[DF16] Roee David and Uriel Feige. On the effect of randomness on planted 3-coloring models.In
Proceedings of the forty-eighth annual ACM symposium on Theory of Computing ,pages 77–90. ACM, 2016. 58DGGP14] Yael Dekel, Ori Gurel-Gurevich, and Yuval Peres. Finding hidden cliques in lineartime with high probability.
Combinatorics, Probability and Computing , 23(1):29–49,2014.[DKK +
16] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, andAlistair Stewart. Robust estimators in high dimensions without the computationalintractability. In , pages 655–664. IEEE, 2016.[DKK +
18] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, andAlistair Stewart. Robustly learning a gaussian: Getting optimal error, efficiently. In
Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algo-rithms , pages 2683–2702. Society for Industrial and Applied Mathematics, 2018.[DKS17] Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Statistical query lowerbounds for robust estimation of high-dimensional gaussians and gaussian mixtures. In ,pages 73–84. IEEE, 2017.[DKS19] Ilias Diakonikolas, Weihao Kong, and Alistair Stewart. Efficient algorithms and lowerbounds for robust linear regression. In
Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms , pages 2745–2754. SIAM, 2019.[DLSS14] Amit Daniely, Nati Linial, and Shai Shalev-Shwartz. From average case complexityto improper learning complexity. pages 441–448, 2014.[DM15a] Yash Deshpande and Andrea Montanari. Finding hidden cliques of size p N/e in nearlylinear time.
Foundations of Computational Mathematics , 15(4):1069–1128, 2015.[DM15b] Yash Deshpande and Andrea Montanari. Improved sum-of-squares lower bounds forhidden clique and hidden submatrix problems. pages 523–562, 2015.[DSS16] Amit Daniely and Shai Shalev-Shwartz. Complexity theoretic limitations on learningDNF’s. pages 815–830, 2016.[Fei02] Uriel Feige. Relations between average case complexity and approximation complexity.In
Proceedings of the thiry-fourth annual ACM symposium on Theory of computing ,pages 534–543. ACM, 2002.[FGR +
13] Vitaly Feldman, Elena Grigorescu, Lev Reyzin, Santosh Vempala, and Ying Xiao.Statistical algorithms and a lower bound for detecting planted cliques. In
Proceedingsof the forty-fifth annual ACM symposium on Theory of computing , pages 655–664.ACM, 2013.[FK00] Uriel Feige and Robert Krauthgamer. Finding and certifying a large hidden clique ina semirandom graph.
Random Structures and Algorithms , 16(2):195–208, 2000.[FK01] Uriel Feige and Joe Kilian. Heuristics for semirandom graph problems.
Journal ofComputer and System Sciences , 63(4):639–671, 2001.[FLWY18] Jianqing Fan, Han Liu, Zhaoran Wang, and Zhuoran Yang. Curse of heterogeneity:Computational barriers in sparse mixture models and phase retrieval. arXiv preprintarXiv:1808.06996 , 2018. 59FPV15] Vitaly Feldman, Will Perkins, and Santosh Vempala. On the complexity of randomsatisfiability problems with planted solutions. pages 77–86, 2015.[FR10] Uriel Feige and Dorit Ron. Finding hidden cliques in linear time. In , pages 189–204. Discrete Mathematics and Theoretical Com-puter Science, 2010.[GKPV10] Shafi Goldwasser, Yael Kalai, Chris Peikert, and Vinod Vaikuntanathan. Robustnessof the learning with errors assumption. In
Innovations in Computer Science , pages230–240, 2010.[GMZ17] Chao Gao, Zongming Ma, and Harrison H Zhou. Sparse cca: Adaptive estimationand computational barriers.
The Annals of Statistics , 45(5):2074–2101, 2017.[GS +
17] David Gamarnik, Madhu Sudan, et al. Limits of local algorithms over sparse randomgraphs.
The Annals of Probability , 45(4):2353–2376, 2017.[GZ17] David Gamarnik and Ilias Zadik. High dimensional regression with binary coefficients.estimating squared error and a phase transtition. In
Conference on Learning Theory ,pages 948–953, 2017.[GZ19] David Gamarnik and Ilias Zadik. The landscape of the planted clique problem: Densesubgraphs and the overlap gap property. arXiv preprint arXiv:1904.07174 , 2019.[HKP +
16] Samuel B Hopkins, Pravesh Kothari, Aaron Henry Potechin, Prasad Raghavendra,and Tselil Schramm. On the integrality gap of degree-4 sum of squares for plantedclique. pages 1079–1095, 2016.[HKP +
17] Samuel B Hopkins, Pravesh K Kothari, Aaron Potechin, Prasad Raghavendra, TselilSchramm, and David Steurer. The power of sum-of-squares for detecting hiddenstructures.
Proceedings of the fifty-eighth IEEE Foundations of Computer Science ,pages 720–731, 2017.[HL19] Samuel B Hopkins and Jerry Li. How hard is robust mean estimation? arXiv preprintarXiv:1903.07870 , 2019.[Hop18] Samuel B Hopkins.
Statistical Inference and the Sum of Squares Method . PhD thesis,Cornell University, 2018.[HS17] Samuel B Hopkins and David Steurer. Efficient bayesian estimation from few samples:community detection and related problems. In
Foundations of Computer Science(FOCS), 2017 IEEE 58th Annual Symposium on , pages 379–390. IEEE, 2017.[Hub65] Peter J Huber. A robust version of the probability ratio test.
The Annals of Mathe-matical Statistics , pages 1753–1758, 1965.[Hub92] Peter J Huber. Robust estimation of a location parameter. In
Breakthroughs instatistics , pages 492–518. Springer, 1992.[Hub11] Peter J Huber.
Robust statistics . Springer, 2011.[HWX15] Bruce E Hajek, Yihong Wu, and Jiaming Xu. Computational lower bounds for com-munity detection on random graphs. In
COLT , pages 899–928, 2015.60HWX16] Bruce Hajek, Yihong Wu, and Jiaming Xu. Information limits for recovering a hiddencommunity. pages 1894–1898, 2016.[Kar77] Richard M Karp. Probabilistic analysis of partitioning algorithms for the traveling-salesman problem in the plane.
Mathematics of operations research , 2(3):209–224,1977.[KKM18] Adam Klivans, Pravesh K Kothari, and Raghu Meka. Efficient algorithms for outlier-robust regression. arXiv preprint arXiv:1803.03241 , 2018.[KMM11] Alexandra Kolla, Konstantin Makarychev, and Yury Makarychev. How to play uniquegames against a semi-random adversary: Study of semi-random models of uniquegames. In ,pages 443–452. IEEE, 2011.[KMOW17] Pravesh K Kothari, Ryuhei Mori, Ryan O’Donnell, and David Witmer. Sum of squareslower bounds for refuting any csp. arXiv preprint arXiv:1701.04521 , 2017.[KMRT +
07] Florent Krzaka la, Andrea Montanari, Federico Ricci-Tersenghi, Guilhem Semerjian,and Lenka Zdeborov´a. Gibbs states and the set of solutions of random constraint sat-isfaction problems.
Proceedings of the National Academy of Sciences , 104(25):10318–10323, 2007.[Kuˇc77] L Kuˇcera. Expected behavior of graph coloring algorithms. In
International Confer-ence on Fundamentals of Computation Theory , pages 447–451. Springer, 1977.[KZ14] Pascal Koiran and Anastasios Zouzias. Hidden cliques and the certification of therestricted isometry property.
IEEE Transactions on Information Theory , 60(8):4999–5006, 2014.[LCL +
18] Hao Lu, Yuan Cao, Junwei Lu, Han Liu, and Zhaoran Wang. The edge density bar-rier: Computational-statistical tradeoffs in combinatorial inference. In
InternationalConference on Machine Learning , pages 3253–3262, 2018.[LDBB +
16] Thibault Lesieur, Caterina De Bacco, Jess Banks, Florent Krzakala, Cris Moore,and Lenka Zdeborov´a. Phase transitions and optimal algorithms in high-dimensionalGaussian mixture clustering. In
Communication, Control, and Computing (Allerton),2016 54th Annual Allerton Conference on , pages 601–608. IEEE, 2016.[Lev86] Leonid A Levin. Average case complete problems.
SIAM Journal on Computing ,15(1):285–286, 1986.[Li17] Jerry Li. Robust sparse estimation tasks in high dimensions. arXiv preprintarXiv:1702.05860 , 2017.[Lin92] Nathan Linial. Locality in distributed graph algorithms.
SIAM Journal on Computing ,21(1):193–201, 1992.[LKZ15] Thibault Lesieur, Florent Krzakala, and Lenka Zdeborov´a. Mmse of probabilistic low-rank matrix estimation: Universality with respect to the output channel. In
Commu-nication, Control, and Computing (Allerton), 2015 53rd Annual Allerton Conferenceon , pages 680–687. IEEE, 2015. 61LRV16] Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of meanand covariance. In , pages 665–674. IEEE, 2016.[McS01] Frank McSherry. Spectral partitioning of random graphs. In
Foundations of ComputerScience, 2001. Proceedings. 42nd IEEE Symposium on , pages 529–537. IEEE, 2001.[MMV12] Konstantin Makarychev, Yury Makarychev, and Aravindan Vijayaraghavan. Approx-imation algorithms for semi-random partitioning problems. In
Proceedings of theforty-fourth annual ACM symposium on Theory of computing , pages 367–384. ACM,2012.[MMV15] Konstantin Makarychev, Yury Makarychev, and Aravindan Vijayaraghavan. Corre-lation clustering with noisy partial information. In
Conference on Learning Theory ,pages 1321–1342, 2015.[Mon15] Andrea Montanari. Finding one community in a sparse graph.
Journal of StatisticalPhysics , 161(2):273–299, 2015.[MPW16] Ankur Moitra, William Perry, and Alexander S Wein. How robust are reconstructionthresholds for community detection? In
Proceedings of the forty-eighth annual ACMsymposium on Theory of Computing , pages 828–841. ACM, 2016.[MS10] Claire Mathieu and Warren Schudy. Correlation clustering with noisy input. In
Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms ,pages 712–728. Society for Industrial and Applied Mathematics, 2010.[MW15a] Tengyu Ma and Avi Wigderson. Sum-of-squares lower bounds for sparse pca. In
Advances in Neural Information Processing Systems , pages 1612–1620, 2015.[MW15b] Zongming Ma and Yihong Wu. Computational barriers in minimax submatrix detec-tion.
The Annals of Statistics , 43(3):1089–1116, 2015.[RBBC19] Valentina Ros, Gerard Ben Arous, Giulio Biroli, and Chiara Cammarota. Complex en-ergy landscapes in spiked-tensor and simple glassy models: Ruggedness, arrangementsof local minima, and phase transitions.
Physical Review X , 9(1):011003, 2019.[Ros08] Benjamin Rossman. On the constant-depth complexity of k-clique. In
Proceedings ofthe fortieth annual ACM symposium on Theory of computing , pages 721–730. ACM,2008.[Ros14] Benjamin Rossman. The monotone complexity of k-clique on random graphs.
SIAMJournal on Computing , 43(1):256–279, 2014.[RR97] Alexander A Razborov and Steven Rudich. Natural proofs.
Journal of Computer andSystem Sciences , 55(1):24–35, 1997.[RS15] Prasad Raghavendra and Tselil Schramm. Tight lower bounds for planted clique inthe degree-4 sos program. arXiv preprint arXiv:1507.05136 , 2015.[RTSZ19] Federico Ricci-Tersenghi, Guilhem Semerjian, and Lenka Zdeborov´a. Typology ofphase transitions in Bayesian inference problems.
Physical Review E , 99(4):042109,2019. 62Tuk75] John W Tukey. Mathematics and the picturing of data. In
Proceedings of the In-ternational Congress of Mathematicians, Vancouver, 1975 , volume 2, pages 523–531,1975.[VA18] Aravindan Vijayaraghavan and Pranjal Awasthi. Clustering semi-random mixtures ofgaussians. In
International Conference on Machine Learning , pages 5055–5064, 2018.[VAC15] Nicolas Verzelen and Ery Arias-Castro. Community detection in sparse random net-works.
The Annals of Applied Probability , 25(6):3465–3510, 2015.[VAC17] Nicolas Verzelen and Ery Arias-Castro. Detection and feature selection in sparsemixture models.
The Annals of Statistics , 45(5):1920–1950, 2017.[WBP16] Tengyao Wang, Quentin Berthet, and Yaniv Plan. Average-case hardness of rip cer-tification. In
Advances in Neural Information Processing Systems , pages 3819–3827,2016.[WBS16] Tengyao Wang, Quentin Berthet, and Richard J Samworth. Statistical and com-putational trade-offs in estimation of sparse principal components.
The Annals ofStatistics , 44(5):1896–1930, 2016.[Yat85] Yannis G Yatracos. Rates of convergence of minimum distance estimators and kol-mogorov’s entropy.
The Annals of Statistics , pages 768–774, 1985.[ZK16] Lenka Zdeborov´a and Florent Krzakala. Statistical physics of inference: Thresholdsand algorithms.
Advances in Physics , 65(5):453–552, 2016.
A Deferred Proofs
In this section, we present the deferred proofs from the body of the paper. We first present theproof of Lemma 3.3.
Proof of Lemma 3.3.
This follows from a simple induction on m . Note that the case when m = 1follows by definition. Now observe that by the data-processing and triangle inequalities of totalvariation, we have that if B = A m − ◦ A m − ◦ · · · ◦ A then d TV ( A ( P ) , P m ) ≤ d TV ( A m ◦ B ( P ) , A m ( P m − )) + d TV ( A m ( P m − ) , P m ) ≤ d TV ( B ( P ) , P m − ) + ǫ m ≤ m X i =1 ǫ i where the last inequality follows from the induction hypothesis applied with m − B . Thiscompletes the induction and proves the lemma.We now present the proof of Proposition 8.3, which is similar to the computation of the Fourierspectrum of pc in [Hop18]. We only provide a sketch of details similar to [Hop18] for brevity.63 roof of Proposition 8.3. Recall that in k - pc the nodes are partitioned into k sets E , . . . , E k of size n/k each. Denote by S the clique vertices. We are guaranteed that | S ∩ E i | = 1 for all 1 ≤ i ≤ k ,and thus the edges between nodes in any given E i contain no information and can be removedwithout changing the clique. We take the set of possible edges E ⊂ (cid:0) [ n ]2 (cid:1) in an instance of k - pc to be pairs ij with i and j from different partitions. Let S = { S : | S ∩ E i | = 1 } be the collectionof all size k subsets respecting the given partition E . Note that choosing an S uniformly from S amounts to selecting a single node uniformly at random from each set in the partition. Let P S bethe distribution on graphs such that X ij = 1 if i ∈ S and j ∈ S and otherwise X ij = ± S is denoted by P = E S ∼ Unif( S ) P S .Now let α ⊆ E be a subset of possible edges. The set of functions { χ α ( X ) = Q e ∈ α X e : α ⊆ E } comprises the standard Fourier basis on {± } E . Consider a fixed clique S . Just as for standard pc , because E P S X e = 0 if e / ∈ (cid:0) S (cid:1) and non-clique edges are independent, we see that E P S [ χ α ( X )] = { V ( α ) ⊆ S } where V ( α ) is the set of nodes covered by edges in α . Thus, if V ( α ) has at most onenode per size n/k set, then the Fourier coefficients are E P [ χ α ( X )] = P S ∼ Unif ( V ( α ) ⊆ S ) = (cid:16) n/k (cid:17) | V ( α ) | = (cid:16) kn (cid:17) | V ( α ) | , and otherwise E P [ χ α ( X )] = 0.Remarkably, as can be seen from [Hop18] or [BHK +
16] this is precisely the same Fourier co-efficient as for the version of planted clique where each node is independently included in S withprobability k/n . Because the set of Fourier coefficients is indexed by E in k - pc and this is a subsetof the set of Fourier coefficients in standard pc , it immediately follows that the quantity of interestin (1) is smaller in k - pc relative to pc . Thus k - pc is at least as hard as pc from the perspective oflow-degree polynomial tests.We briefly sketch the calculation showing a constant bound on the Fourier energy of k - pc forsets of size | α | ≤ D for D = C log n , following the calculation for pc in [Hop18]. Note that if | α | ≤ D , then | V ( α ) | ≤ D and for every t ≤ D we may bound the number of sets α ∈ E with | V ( α ) | = t and | α | ≤ D as (cid:18) kt (cid:19)(cid:16) nk (cid:17) t (cid:18)(cid:0) t (cid:1) | α | (cid:19) ≤ (cid:18) kt (cid:19)(cid:16) nk (cid:17) t t min(2 D, t ) ≤ n t t min(2 D, t ) . (2)The total Fourier energy for | α | ≤ D = C log n is X α ⊆E < | α |≤ D ( E H χ α ( X )) ≤ X t ≤√ C log n (cid:16) kn (cid:17) t n t t t + X √ C log n 2, weexpress Bern( p ) as the mixture Bern( p ) = (2 − p ) · Bern(1 / 2) + (2 p − · Bern(1). The Fouriercoefficient corresponding to set α with V ( α ) ⊆ S is then nonzero only if each of the edges selectedthe Bern(1) component of the mixture, so E χ α ( X ) = (2 p − | α | = (2 n − δ ) | α | . We will now take D = n δ/ and again k = n / − ǫ . By (2) the number of sets with | α | = r and | V ( α ) | = t is bounded64y n t t r , so X α ⊆E < | α |≤ D ( E H χ α ( X )) ≤ X t ≤ D X α : | V ( A ) | = t < | α |≤ D (cid:16) kn (cid:17) t (2 n − δ ) | α | ≤ X t ≤ D X t ≤ r ≤ t / n t t r (cid:16) kn (cid:17) t (2 n − δ ) r ≤ X t ≤ D X t ≤ r ≤ t / n − ǫt (4 n − δ/ ) r where the last inequality used t ≤ n δ/ . This last quantity is O (1).We now present the proof of Lemma 8.12, which is similar to Lemma 5.2 in [FGR + Proof of Lemma 8.12. The proof is almost identical to Lemma 5.2 in [FGR +