[PDF] On the Fourier transform of a quantitative trait: Implications for compressive sensing

Abstract

This paper explores the genotype-phenotype relationship. It outlines conditions under which the dependence of quantitative trait on the genome might be predictable, based on measurement of a limited subset of genotypes. It uses the theory of real-valued Boolean functions in a systematic way to translate trait data into the Fourier domain. Important trait features, such as the roughness of the trait landscape or the modularity of a trait have a simple Fourier interpretation. Roughness at a gene location corresponds to high sensitivity to mutation, while a modular organization of gene activity reduces such sensitivity. Traits where rugged loci are rare will naturally compress gene data in the Fourier domain, leading to a sparse representation of trait data, concentrated in identifiable, low-level coefficients. This Fourier representation of a trait organizes epistasis in a form which is isometric to the trait data. As Fourier matrices are known to be maximally incoherent with the standard basis, this permits employing compressive sensing techniques to work from data sets that are relatively small -- sometimes even polynomial -- compared to the exponentially large sets of possible genomes. This theory provides a theoretical underpinning for systematic use of Boolean function machinery to dissect the dependency of a trait on the genome and environment.

Full PDF

DDiscovering genetic networks using compressive sensing

Stephen Doro a, ∗ , Matthew A. Herman b a Columbia University, New York, NY, USA b Fourier Genetics, Austin, TX, USA

Abstract

A ﬁrst analysis applying compressive sensing to a quantitative biological trait and its compress-ible “frequency domain” is presented. Consider an n -bit genetic sequence and suppose we wantto discover a function that maps participating alleles (or even environmental inﬂuences) to aparticular trait. Under plausible assumptions of how they evolved, certain traits can be viewedas “smooth” functions on the n -dimensional Boolean lattice of possible genomes. This allowsapproximation of their Fourier transforms, i.e., their gene networks, as sparse, dominated by“low-frequency” components. In turn, we can apply compressive sensing methods to collect relatively few samples , yet achieve accurate recovery.For an arbitrary quantitative trait a ﬀ ected by n =

26 genes and with 812 meaningful geneinteractions, our simulations show noisy trait measurements (SNR =

20 dB) from just M = ,

336 genomes in a population of size N = (undersample ratio M / N ≈ . .

6% accuracy. Moredramatic undersample ratios are possible for traits a ﬀ ected by more genes. Work is currentlyunderway to see if empirical data ﬁt the proposed model. If so, it could o ﬀ er a radical reductionin the number of measurements — from exponential to polynomial in some cases — necessary toquantify the relationship between genomes and certain traits. Keywords:

Computational biology, quantitative traits, genomics, Boolean functions, Fourier transform

1. Introduction

There is a conceptual problem in describing the relationship between the genome and thephenotype. Consider some quantiﬁable trait of an organism, such as height, weight, antibioticresistance, spherical equivalent of the eye, intelligence, or some other form of “ﬁtness.” It isnatural to seek to ﬁnd the set of genes responsible for that trait. Genome-wide association studies(GWAS) can detect places in the chromosome with a strong statistical association with some trait.Alleles which inﬂuence the trait in question are located at such places, although the precision ofGWAS may be insu ﬃ cient to pinpoint the exact location responsible for the trait variability.However, even if the allele that correlates with a trait could be pinpointed, one cannot quantifythe full e ﬀ ect of each gene on the trait in question, because the e ﬀ ect of one gene may be modiﬁed ∗ Corresponding author

Email address: [email protected] (Stephen Doro)

Preprint submitted to Journal of Theoretical Biology January 6, 2021 a r X i v : . [ q - b i o . Q M ] J a n y the presence of other genes at other loci. A na¨ıve approach would attempt to ﬁnd a spectrumof loci which modify the trait, attach a number measuring the e ﬀ ect of swapping alleles at eachof these loci (the “allele e ﬀ ect size” [1]) and then predicting the resultant trait value by simplyadding together the individual e ﬀ ects over the spectrum of loci. This approach predicts thatthe values of a trait over a population with independently distributed alleles will converge to aGaussian distribution (page 66 of [2]). This theory underlies biometric analysis of “heritability,”which assumes that trait values are the convolution of genetic and environmental inﬂuences.Indeed, the typical distribution of biological traits appear to be bell-shaped curves, although fattails are not ruled out.But it is also quite clear that the na¨ıve analysis above is incorrect. Even in the simplestcases of two homologous loci on paired chromosomes, examples of dominant and recessive geneinteractions show that the e ﬀ ects of switching alleles at separate loci are not strictly additive. Atrait is not a linear combination of individual gene e ﬀ ects — it is a function of the entire genome.To conclusively understand the e ﬀ ect of the genes, one would need a table assigning the valueof the trait to every possible combination of the relevant genes at many locations. But such atable would be enormous. If traits are functions of the entire genome we face what has beencalled the “scale problem” [3]. The number of possible genomes grows exponentially with thenumber of gene loci. For example, if we ﬁnd n loci (each with 2 alternate alleles) associated witha certain trait, then there are 2 n genomes. To pick a relatively small case, if n =

100 there will beapproximately 10 potentially di ﬀ erent genomes. It would be ﬂatly impossible to measure thetrait for each such genome and hence, the function from genome to trait cannot be measured inany practical sense.This paper explores a third possibility. There may be many traits which are neither strictly ad-ditive nor as chaotic as the theoretical nightmare above where each genome has its own arbitrarytrait value. It is possible to ﬁnd a way to interpolate between these two possibilities. Namely,we expand the spectrum of gene e ﬀ ects from the individual loci to associate an interaction e ﬀ ectto every possible set, or cluster of loci (e.g., as singletons, pairs, triads, etc.). Given this manyinteractions, any trait can be broken into a spectrum of interactions but there are now 2 n pos-sibilities. Allele e ﬀ ect sizes are included in this spectrum, which is enlarged to account for allpossible interactions. But it is quite plausible that the important interactions are a very sparsesubset — needles in the haystack of possible interactions — and these signiﬁcant interactions canbe reliably discovered.Techniques from the ﬁeld of compressive sensing, avoid the scale problem arising when weassociate a trait with an entire genome, yet also avoid the unquantiﬁability of a direct map fromgene to trait. The goal is to discover the Fourier transform of a trait, which is the gene network (see Section 3.2). The quantitative e ﬀ ect of the genome is broken into Fourier components ona Boolean lattice. Each Fourier coe ﬃ cient measures the strength of the interaction of a certainsubset of loci. The “ level ” of a Fourier component indicates the number of gene loci in a givencluster (see Section 3.3). It will turn out that the level-0 component, corresponding to the nullset, is the average value of the trait, and the level-1 components, which correspond to a singlelocus, are simply the allele e ﬀ ect size, the change in the trait average from swapping allele ’1’to allele ’0’ at the locus in question. Similarly, for two homologous loci, the correspondinglevel-2 Fourier component encodes their interaction, with the sign expressing whether this is adominant or a recessive interaction. In this context, the components associated with relativelyfew genes take the place of “low-frequency” terms in standard Fourier analysis, while “high-frequency” terms represent interactions involving many genes. Formally, the mathematics usedis a well-understood generalization of standard Fourier series or the Fourier transform, using a2nite abelian group as the underlying basis, rather than the circle or line. We can view the Fourier transform of a trait as providing information about the network ofgene interactions in the following way. Suppose A and B are loci and that the Fourier coe ﬃ cients g A and g B indicate the direct e ﬀ ects of di ﬀ erent alleles at A and B , respectively. Then a nonzeroFourier coe ﬃ cient g AB reveals that, in addition to the direct e ﬀ ects, there is some cooperation orinterference between the alleles at loci A and B . For example, if locus A produces a ligand andlocus B produces a receptor, and if the trait is conditional on ligand binding to receptor, then thetrait emerges only when both A and B are e ﬀ ective. This interaction of A and B will be reﬂectedby the Fourier coe ﬃ cient g AB having a nonzero value. Gene interactions can be visualized as asimplicial complex with the loci as points, and large interaction terms as edges. More complexinteractions can be pictured as geometric simplices.There are two potential beneﬁts to exploring the Fourier transform of a trait.1. In the Fourier domain, entries provide information about the gene network, such as theinteractions between the metabolic activities of di ﬀ erent genes. Large gene interactioncoe ﬃ cients may provide clues as to which genes are interacting intensely as well whichare occupied with independent tasks. Thus, observations of trait matched to genome maycontribute to an understanding of gene function.2. In the trait space, we have precise information of the trait for relatively few speciﬁcgenomes, and no information for the rest. An accurate enough reconstruction of the largergene interaction Fourier coe ﬃ cients will allow a low-error prediction of the full trait viaan inverse Fourier transform. This would provide insight to trait phenotype for novel,previously unexamined organisms. There is a class of traits which have su ﬃ cient additivity of gene e ﬀ ects to permit Fourieranalysis to isolate them. Provisionally, let us call such traits “smooth.” Similar to functions dom-inated by low frequencies in classical Fourier analysis, a smooth trait is one where, on average,the Fourier coe ﬃ cients of signiﬁcant size are concentrated in the lower levels . Important quan-titative traits may turn out be smooth. In particular, we give heuristic arguments that traits witha pronounced modular structure will be smooth. This will ultimately be an empirical question.Our present purpose is to show how to circumvent the scale problem for such traits. The low-level concentration phenomenon implies sparsity in the Fourier domain, which allows the useof ideas from compressive sensing to estimate the genome-to-trait function from relatively fewobservations. Two prerequisites must be met if we are to implement a compressive sensing scheme: (i) asparse (or compressible) representation of the data of interest, and (ii) a sensing modality that is“incoherent” with respect to the sparsifying basis. The ﬁrst condition is satisﬁed based on theassumed model of the traits we are interested in. The second condition is conveniently fulﬁlled inour model since the rows of the Fourier matrix are known to be incoherent relative to the standard We recently became aware of new work on detecting gene interactions using Fourier analysis on non-abeliangroups [4]. However, it does not address our main theme of smooth / modular traits nor utilize compressive sensingtheory. uncertainty principle , which dictates that localizationin one domain implies its dual is “spread out.” This provides a rule of thumb central to thephilosophy of the compressive sensing method — by subsampling in the spread out domain (asopposed to the sparse domain) we are essentially guaranteed to gather nontrivial measurements.As the Fourier transform is a global operator, each of these measurements yield some informationabout the sparse domain of interest. A review of compressive sensing can be found in Section 4.1. It is compelling to note that compressive sensing has a connection with biological sciencethrough the study of combinatorial group testing , which has its roots in designing (the minimumnumber of) trials to screen for venereal disease in soldiers during World War II [5]. Indeed,many previous applications of compressive sensing to genetics have also used group testing intheir modi operandi [6, 7, 8, 9]. Other studies have taken a more traditional compressive sensingapproach [10, 11]. By and large what these investigations have in common is the assumptionthat DNA sequences of interest can be represented as a sparse signal, e.g., because there arerelatively few gene markers participating (either directly or di ﬀ erentially), or because certaingenetic diseases are rare, or due to the fact that only a few agents in a pool are present or aredefective. Most of the work is then devoted to: (i) designing the screening method, minimizingthe observations (e.g., the number of sequencing probe spots, the number of pools, or just thesheer number of subjects), and the mathematics of the associated sensing matrix; and (ii) theresulting challenges in recovery of the sparse signal. Our work has little in common with these earlier approaches — rather than a DNA sequence, we assume it is a trait’s Fourier transform that is the underlying sparse object of interest . Al-though previous theoreticians have suggested the application of Fourier analysis to trait “ﬁt-ness” [12], to the best of our knowledge, the study presented here is the ﬁrst to provide thetheory on when and why a trait’s Fourier transform may be compressible, and to also connect itwith ideas from compressive sensing. Guidance can be gleaned from previous work on recon-structing a sparse time domain signal from incomplete frequency information (e.g., [13]), exceptwe treat traits as real-valued functions on the Boolean hypercube of possible genomes, and ex-amine the dual scenario with assumed sparsity in the frequency domain dominated by low-levelcomponents, accompanied by incomplete, random observations in the trait space. We also dealwith vector sizes that are exponentially larger than typically reported in the compressive sensingliterature.

2. Modular Traits

In our model of gene interaction, a trait t is a real-valued function on a set of n genes, avector in a 2 n -dimensional space. These 2 n numbers can, in principle, be empirically determinedby measuring the trait for a selection of individuals of each genome. But in practice, if n is large(let us say, an order of ten or more), the task exceeds the number of organisms. It would seemthat converting these 2 n numbers to their 2 n Fourier transform coe ﬃ cients does not obviate thisscale problem, but there are certain traits where we might expect the Fourier coe ﬃ cients to besmall, or even vanish for high level. 4onsider what we call “gauntlet processes,” multi-stage survival tests. Imagine a sequenceof m ﬁlters, f , f , . . . , f m , and denote P i as the probability of surviving past ﬁlter f i . Then theoverall survival rate is the product P = P × P × · · · × P m . (1)The paradigm for such processes relates to the original idea behind Darwinian ﬁtness. An or-ganism is tested by a series of barriers to survival and to successful reproduction. Each barrieris associated with a certain probability of successful passage. These probabilities are, to goodapproximation, independent and overall survival (or reproduction) requires success with everybarrier. However, the product of probabilities can lead to undesired interaction terms. Instead,we want a measure for “ﬁtness” which is additive . This can be achieved by deﬁning “surviv-ability” as the logarithm of the probability of surviving a gauntlet of challenges. Hence, (1)becomes log( P ) = log( P ) + log( P ) + · · · + log( P m ) . It has been noted in prior studies of antibiotic resistance in bacteria [12] that the logarithm ofsurvival is the appropriate version of a ﬁtness trait, and not raw survival percentages.

The gauntlet processes above may be generalized. Filters need not be organized sequentially:it is su ﬃ cient that a trait be organized into functional components, with interactions concentratedwithin each component. Biological functions are typically organized into modules, each with anassociated suite of genes [14], breaking a task into an array of subtasks. In turn, subtasks maythemselves be composite, leading to a hierarchical structure. This organizing principle facilitatesevolution, since a submodule may be modiﬁed without inducing global complications. Thisexplains why those exceptional proteins which interact with many other proteins are very stablethroughout evolution [15]. Thus, modularity largely conﬁnes gene interactions to those withina module, i.e., locally . This motivates our governing hypothesis: due to modularity, many genenetworks are, or can be approximated as sparse, with the vast majority of large coe ﬃ cientsconcentrated into the lower levels . If compressive sensing detects clustering of interactions, itcan help identify the module structure of genome interactions [16]. The notion of modules may be made less abstract with a few concrete examples. It has longbeen conjectured that human intelligence is due to distinct functional components, or “factors”in Thomson’s terminology [17]. Further, GWAS have identiﬁed 52 gene locations signiﬁcantlyassociated with the trait of intelligence [18]. A typical IQ test involves counting the percentage ofcorrect answers to a collection of questions, in other words, estimating a probability of success.It is plausible then that the task of “problem solving” is composed of genetically distinct modulesrelated, e.g., to correct comprehension, su ﬃ cient memory, accurate calculation, etc.The analysis of myopia genes may also follow this paradigm. Here, we are concerned withsurvival of useful visual information, which must endure the successive e ﬀ ects of the cornea, of “Levels” {L k } are introduced in Section 3.3 and formally deﬁned in (13). In words, coe ﬃ cients in the k th level of anetwork measure the strength of interaction within clusters consisting of precisely k loci . For instance, for k =

2, level L deals with interactions for all possible pairs of loci. (cid:64)(cid:64)(cid:64)(cid:64)(cid:73) (cid:116) Gene 0 (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:18) (cid:116)

Gene 1 (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:18) (cid:116)

Gene 2 (cid:64)(cid:64)(cid:64)(cid:64)(cid:64)(cid:73) (cid:116)

Gene 3

StartGoal

Figure 1: A simple network composed of n = the lens, and then the blur due to excessive axial length of the eye. If the genes causing steepcornea, dense lens, and elongated globe are distinct, then their e ﬀ ects should be roughly additive,when quantiﬁed by diopter. Recent work [19] is beginning to identify functional modules in thegenetics of myopia.Gene loci often code for enzymes that establish a metabolic network with multiple functions.Further, these networks often resemble a logic network, although there is no exact correspon-dence. Under suitable restrictions of depth and size, the Fourier components of such logic net-works have power spectra concentrated at low levels [20]. Metabolic networks achieve a desiredstate via a wide variety of genetically controlled transitions, where genes correspond to edges thatpermit a transition from one state to the next both in series and in parallel (see Figure 1). Such astructure is su ﬃ ciently general as it can represent many possible processes: such as catalysis ofa chemical reaction, transport across a membrane, activation of a receptor, etc. Figure 1 shows a small, but nontrivial network; its n = ﬃ cients in the gene network are restricted to the lower levels. Supposethe parallel branches are combined so as to emulate a logical OR gate. Thus either the leftor right branch can achieve the “goal” of completing some subtask, or perhaps, having or nothaving some dominant trait. Example 1 on page 7 explains the mathematical detail of a trait t based on this Boolean circuit and contains its resulting truth table showing the genome-to-traitrelationship: how the i th trait value t i is a function of genome i = [i , i , i , i ]. From (8) and(11), the associated gene network is g = H t , shown in the right-hand table (note, the orderof g has been permuted so that its indices / coe ﬃ cients are grouped into their respective levels).Observe that seven rows ( j = , , , , , ,

12) are emboldened, highlighting the fact that thelarge-magnitude coe ﬃ cients g j occupy the lower levels, 0 ≤ k ≤ ﬃ cients of g is slight in thiscase (i.e., the magnitudes 7 and 3 versus 1 in the table on page 7). However, for larger and morecomplicated networks, signiﬁcantly greater dynamic ranges will occur, which means the genenetworks will be compressible and thus well-approximated by a K -sparse representation. In thissense, we can interpret the | g j | = ﬃ cients here as “insigniﬁcant.” Hence, vector g can beloosely characterized as “7-sparse” since it has K = ﬃ cients.Let us analyze each level of the gene network in further depth:6 xample 1. The Fourier relationship between a full trait t and its gene network g . Suppose we are interested in the n = =

16 possible genomes:these codes and their decimal equivalents are shown in the Boolean lattice of Figure 2.An organism, due to its genetic code, either DOES or DOES NOT “have the trait,” whichwe represent with a ‘1’ or ‘0’, respectively. Hence the trait value t i for the organism withgenome i = [i , i , i , i ] is t i = (cid:0) i AND i (cid:1) OR (cid:0) i AND i (cid:1) . (2)Suppose we have access to the organisms with the 16 unique genomes. We measureeach individual’s trait value t i and store it in the vector t as the i th entry (see the tablebelow on the left). Given the full trait t , we can take its Fourier transform (11) to obtainthe associated gene network g , which is deﬁned on its own Boolean lattice, again shownin Figure 2. Let us permute the order of g so that its coe ﬃ cients are grouped into levels0 ≤ k ≤

4. Recall, k corresponds to the number of participating loci identiﬁed by the‘1’s in the binary code of index j . The table below on the right shows this level-ordered gene network. Rows j = , , , , , ,

12 are highlighted since they contain the largermagnitude g j — notice they occur in the lower levels : 0 ≤ k ≤ Trait , t i i i i i t i Gene network , g j j j j j k g j − − − − −

16 0 1 1 0 2 −

19 1 0 0 1 2 −

110 1 0 1 0 2 − − • For k =

0, in level L = { } the index j = H . As aresult, g = t i = / • For k =

1, the indices in level L = { , , , } represent singletons, or individual genes .Notice the corresponding binary codes for each j have precisely one j l = l being evaluated. Here the g j for all j in L are relatively large. This7akes sense because each gene locus independently has a strong direct e ﬀ ect on the trait. • For k =

2, indices in level L = { , , , , , } refer to pairs of genes . However, only g and g are relatively large. The binary code for j = l = ,

1, which arein series on the left branch of Figure 1 — we would expect this pair of loci to have a stronginteraction — and similarly for index j =

12, which represents the interaction between loci l = , L correspond to g j that are relatively small. Forexample, j = l = ,

3. Figure 1 a ﬃ rms that these loci are on di ﬀ erentbranches and therefore do not strongly interact. • For k = ,

4, the indices in the highest levels L = { , , , } and L = { } correspondsolely to relatively small coe ﬃ cients. For example, the binary code for j = l = , ,

2. Coe ﬃ cient g is small because these three loci do not strongly interact asa triad to a ﬀ ect the trait. Similarly for j =

15, the cluster of all four loci do not stronglyinteract as a quad .In summary, even this very small example demonstrates the key property that we conjecture: thelarger coe ﬃ cients of the gene network are conﬁned to the lower levels (in this case, 0 ≤ k ≤ ≤ k ≤ ﬀ ect will be much more pronounced. ﬀ ect from two modules Consider an arbitrary quantitative trait governed by n genes, illustrated below. As previouslymentioned, there are 2 n possible combinations in which the loci can interact. Now suppose thetrait is composed of two subtasks, with n genes in Module 1 dedicated to the ﬁrst subtask and n = n − n genes in Module 2 to the second subtask. Let us assume only local interactions,where the loci of Module 1 do not communicate with those of Module 2. Trait: n genes Module 1: n genes (cid:101) (cid:101) (cid:101) · · · (cid:101) (cid:101) (cid:101) Module 2: n genes (cid:101) (cid:101) · · · (cid:101) (cid:101) (cid:101) The mere act of partitioning the genes into two modules with no cross-interactions naturallyleads to a (very) sparse gene network. To see this, let us estimate the density of signiﬁcantcoe ﬃ cients at level k . The problem is isomorphic to that of calculating the probability for k stones of the same color to be drawn at random from an urn with n white stones and n blackstones. For the moment, assume n is larger than n . Then, as k increases, the white stone entrieswill asymptotically dominate the white-to-black ratio of successes. In fact, once k exceeds n ,all successes are white and once k exceeds n , there will be no way of selecting a monocolored k -set. Thus there is a cuto ﬀ level k cut = max( n , n ).The calculation is quite transparent if we are returning each stone to the urn after selection.There is a probability α = n / n of selecting a white stone and a corresponding α = n / n so that α k is the density of all-white k subsets, α k is the density of all-black k subsets. Then by analogy, α k + α k bounds the density of signiﬁcant coe ﬃ cients in level k from Modules 1 and 2. Noticethat this argument slightly overestimates the probability of drawing k stones of the same color,8ecause it counts some cases where a stone is replaced. In turn, this can only overestimate thedensity of signiﬁcant coe ﬃ cients. Yet, it can be shown with a bit more e ﬀ ort that the case ofchoosing stones without replacement yields a similar result.The important conclusion is that modular traits have two properties which immediately implysparsity of signiﬁcant coe ﬃ cients:1. A cuto ﬀ level k cut beyond which there are essentially no signiﬁcant coe ﬃ cients. It followsthat k cut provides a convenient way of delineating the “low” and “high” levels.2. An exponentially decaying density with the vast majority of its mass concentrated in levels { k } , where k is small relative to n .The above analysis examined sparsity density for each level k . Another approach is to ﬁrstassess the overall sparsity and then compare it to the size of the ambient space. Denote K as thetotal number of signiﬁcant interactions from all levels. The loci of Module 1 have at most 2 n interactions, and the loci of Module 2 have at most 2 n , and so K ≤ n + n ≤ · k cut . (3)There exists c > k cut = n / c , therefore the sparsity ratio of K relative to the number ofpossible interactions for the whole network is K / n ≤ · − (1 − / c ) n (4)which decays exponentially with n . The e ﬀ ect of exponential drop o ﬀ becomes even more pro-nounced as soon as there are three or more subtasks, or if the subtasks themselves consist of ﬁnersubtasks. Thus for large enough n , it is clear that a trait with modules limited to local interactionscan give rise to K (cid:28) n , i.e., very strong sparsity in the gene network . Section 6.1 extends thisanalysis to M modules, also including other scenarios, such as when the sparsity K is polynomialin the number of genes n . Next, we extend the level-wise density estimate developed in the previous example to anarbitrary trait a ﬀ ected by n genes that is partitioned into M modules. For 1 ≤ m ≤ M , let the m thmodule have n m loci and probability ratio α m = n m / n , with n = (cid:80) m n m . Temporarily assume theloci are evenly distributed across all modules so that each n m = n / M and α m = / M . Then thedensity of k -loci clusters in any module is just α km = / M k . Summing over all M modules yields p k = / M k − , ≤ k ≤ n . (5)If we now account for a nonuniform distribution of loci per module, then the density at level k issimply p k = (cid:80) m α km . However, asymptotically one of the α m will dominate the summation, whichleads to a more general form of the density, or probability that a k -loci cluster can signiﬁcantlyinteract: p k = b / a k − , ≤ k ≤ n (6)where a , b > ≤ p k ≤

1. This form can also represent themerging or melding of di ﬀ erent networks. In analogy with the denominator of (5), we observe theparameter a informally represents the “e ﬀ ective number of modules.” Although (6) is simplisticand not likely to be exact in any real biological system, it captures the essence of the problem athand — that modularity strongly favors interactions between fewer genes rather than many .9 able 1: Probabilistic determination of low-level concentration for the ﬁrst 13 levels of an arbitrary gene network with n =

26 loci.

Column 1:

The level index k specifying the number of loci in a cluster. Column 2:

From (6) with a = . , b =

1, the probability p k for an arbitrary k -loci cluster to have a nonzero interaction. Column 3:

From (14), thenumber |L k | of possible k -loci clusters in level k . Column 4 and Figure:

From (7), the expected sparsity K k for level k . The number of possible k -loci clusters that can be drawn from n loci is |L k | = (cid:16) nk (cid:17) (see (14)).Thus for each level k , multiplying p k by |L k | yields K k , the expected number of k-loci clusters that can have signiﬁcant interaction energy: K k = (cid:106) p k · |L k | (cid:107) , ≤ k ≤ n (7)where (cid:98) x (cid:99) denotes the integer part of x . This can also be interpreted as the expected sparsity oflevel L k . Note, it is always the case that K = L .The e ﬀ ect of (7) on a gene network is a natural concentration of signiﬁcant coe ﬃ cientsinto its lower levels . This is because the polynomial growth of |L k | cannot “outrun” the rate ofexponential decay of p k . Therefore when modeled by (7), modular traits are necessarily smooth ,as deﬁned in Section 1.1. For example, suppose some quantitative trait is governed by n = a = .

56 and b = p k , |L k | , and K k as afunction of index k for the ﬁrst 13 levels. Observe the faster decay rate of p k versus the rise of |L k | . The resulting low-level concentration of the trait’s gene network is clearly evident in theplot on the right showing sparsity K k for all levels 0 ≤ k ≤

26. Note, the cuto ﬀ level is k cut = K low = (cid:80) k cut k = K k =

782 elements.While modular traits may not exactly obey strict local interactions nor the asymptotic analysisabove, this ideal scenario illuminates how the mechanics of modules naturally lead to low-levelgene interactions. Subsequently, we can loosen the assumptions of this model to also allow forhigh-level interactions. In the simulations of Section 5 we do this by employing (7) in additionto permitting a limited number of random high-level coe ﬃ cients. To conclude this section, many quantitative biological traits may appear to be simple in na-ture, especially when they are measured on a linear scale, but they actually represent the result oferror-free activities in multiple successive processes. From the arguments above, it is reasonableto look at traits in the Fourier domain, expecting that the bulk of the information about gene10nteractions to be concentrated in components of low level. Theoretical and practical approachesmay diverge here.1.

Theoretical.

We can pursue mathematical analysis of precise, but sometimes unrealisticscenarios, so as to understand the overall landscape of ﬁtness functions and their role inevolution. The theory of evolution in its modern synthesis requires a more robust math-ematical theory of percolation in a Boolean lattice, endowed with a ﬁtness function. Weadvocate a broad view that the Fourier transform of the ﬁtness is an appropriate way toclassify ﬁtness landscapes and that large percolating subsets are associated with smoothtransforms.2.

Practical.

Assume we have identiﬁed a candidate trait which can be expressed as a realnumber and which we expect to be smooth. It is impossible to measure the overall impactof these genes over every possible combination. However, techniques of compressivesensing could begin to reconstruct values of the trait, with error estimates from incompleteand inaccurate data, and model mismatch (discussed in more detail in Section 4.1).

3. The Mathematical Formalism

The use of Fourier techniques on functions on the Boolean lattice is a solidly established ﬁeldand fully explained in books by O’Donnell [21] and Garban and Steif [22]. As a consequence,there are a variety of notations and viewpoints. We brieﬂy review the matter to establish notationand orient the reader. Throughout, vectors and matrices are represented with boldface letters,while their elements are non-boldface; thus the i th component of a vector x is x i (all indicescommence at zero). The transposition of a matrix A is denoted A (cid:62) . For 1 ≤ p < ∞ , the (cid:96) p -normof x is deﬁned as (cid:107) x (cid:107) p : = ( (cid:80) i | x i | p ) / p . For index i in the set { , , . . . , n − } , let [i n − , . . . , i , i ]be its canonical n -bit binary representation, with i and i n − as the least- and most-signiﬁcantbits, respectively (note, the i l are non-italicized). The Gaussian distribution with mean µ andvariance σ is denoted as N (cid:0) µ, σ (cid:1) , and the ( n , k )-th binomial coe ﬃ cient is (cid:16) nk (cid:17) .Suppose we are interested in how n particular genes of an organism a ﬀ ect a given quantitativetrait. Arbitrarily assign the genes to n loci. We will use a simpliﬁed model where the gene at eachlocus has just two possible alleles (though the model can be extended to accommodate factors thatare non-binary ). Hence, a genome can be represented by the n -bit word [i n − , . . . , i , i ] (note thereverse order of subscripts), where i l = l , and i l = ﬃ cient to capture the essential nature of the scale problem, and is standard in theliterature [23]. Position in the binary word does not correspond to position on the chromosome,and di ﬀ erent positions in the word may encode alleles from di ﬀ erent chromosomes, or evenalleles from mitochondrial alleles. The model only focuses on a certain subset of the organism’sgenes in order to quantify their interactions, the rest are treated as a ﬁxed background. Similarly,features of the environment that may inﬂuence a trait are considered ﬁxed. Although simple,this setup generalizes the usual discussion of dominant and recessive genes, which focuses ononly two homologous loci. In our model of a genome, not all n loci necessarily must be genes. If desired, an obvious extension is to let someloci represent other binary factors, e.g., inﬂuences of the natural or man-made world. Source: commons.wikimedia.org/wiki/File:Hypercubeorder.svg ) Figure 2: A 4-D hypercube with its sixteen labeled vertices. (

Left ) Decimal indices. (

Right ) Equivalent 4-bit codes.

The n -bit words that represent distinct genomes in a given population can be representedby the 2 n vertices of an n -dimensional Boolean lattice (or hypercube). For example, Figure 2shows a small hypercube with n = =

16 vertices, labeled in decimal andbinary. Throughout, we explicitly reserve the index ‘i’ to refer to the “trait side.”

That is, supposethe binary word [i n − , . . . , i , i ] encodes the speciﬁc allele combination for n speciﬁc genes ofsome organism. Then we can either use this binary code or its decimal-equivalent index i toidentify that individual, which then permits representing its trait value as t i (if there are multipleorganisms with the same genome, then t i represents a mean of their trait values). Hence, the fulltrait t = { t i } n − i = is deﬁned on the n -dimensional Boolean lattice.The group G = Z n acts on n -length binary strings in the obvious way (i.e., component-wiseaddition modulo 2). The group elements may be taken as the basis for a real vector space ofdimension 2 n , and a trait is deﬁned as an element of this vector space, assigning a real value toevery possible genome. But the space R G is also a group algebra with a multiplication corre-sponding to convolution and also a di ﬀ erent operation of pointwise multiplication. A characterof the group G is a homomorphism to the reals. Because G is abelian, the set of characters isitself a di ﬀ erent group C under pointwise multiplication: another isomorphic copy of Z n . Thecharacters provide an alternate basis for the vector space of functions on G . The character table for our groups G and C is the Sylvester-type Hadamard matrix H n oforder 2 n , deﬁned for n ≥ H n : =  H n − H n − H n − − H n −  (8) These matrices were introduced as “tessellated pavements” by J. J. Sylvester, who commends their versatility, “fur-nishing interesting food for thought, or a substitute for the want of it, alike to the analyst at his desk and the ﬁne lady inher boudoir” [24]. Note, power-of-two Hadamard matrices are often referred to en masse as “Walsh-Hadamard matrices,”although formally, this connotes a di ﬀ erent ordering of the rows and columns from the Sylvester-type deﬁned in (8). H : =

1. Row and column indices should be labeled 0 to 2 n −

1, or in their binaryequivalents. The Sylvester-Hadamard matrix changes basis from G to C , and back. By deﬁnition,Hadamard matrices are orthogonal, obeying H n H (cid:62) n = n I (9)where I is the identity matrix of order 2 n . Further, it is well known that Sylvester-type matricesare symmetric: H n = H (cid:62) n , hence H − n = H n / n . (10)There is a duality between the G and C representation of functions, where pointwise multipli-cation on one domain corresponds to convolution in the other. Analogues of Plancherel’s andParsevel’s formulae hold, with appropriately chosen scaling factors owing to (9). More detailson Hadamard matrices and their use in group theory can be found in [25].Formally, we deﬁne a gene network g to be the (forward) Fourier transform F of a cor-responding trait t , each deﬁned on its own Boolean lattice. However, to apply the charactertable H n to an n -dimensional Boolean lattice, we must reshape it into a vector of length 2 n or-dered in terms of its vertices’ indices. Hence the Fourier coe ﬃ cients are related to a given traitby the matrix-vector multiplication g : = F ( t ) = H n t . (11)Clearly, given a gene network g , we can take its inverse-Fourier transform using (10) to ﬁnd itsfull trait for the whole population of genomes: t = F − ( g ) = H n g / n . (12)The choice of the labels ‘0’ or ‘1’ for the generators at a particular locus is a matter of conve-nience. It is easy to show that ﬂipping the bit of an arbitrary locus will only change the sign, andnot the magnitude, of its associated network coe ﬃ cients. We emphasize that the gene network g is deﬁned on its own Boolean lattice and reserve theindex ‘ j ’ to refer to its coe ﬃ cients. We further point out that the binary codes of the indicescontain crucial information on how the genes communicate. That is, for index j , the location ofthe { j l = } in its binary code [j n − , . . . , j , j ] are simply “ﬂags” indicating which gene loci { l } participate in a given cluster. The corresponding coe ﬃ cient g j then represents the amount ofinteraction between the genes within this cluster. For instance, the ﬂags in the binary codefor index j = = [0 0 ] indicate participation of the two rightmost loci , and the value ofcoe ﬃ cient g tells us how strongly / weakly these two loci interact. Compared to classical discreteFourier analysis, this is a di ﬀ erent interpretation of an index j in the frequency domain.The indices of the gene network’s Boolean lattice are organized into so-called “levels,” whereeach level contains the binary codes with the same number of ‘1’s. With the number of loci n ﬁxed and for 0 ≤ k ≤ n , deﬁne the k th level , L k , as the set of indices whose n-bit code contains The levels are also a function of n , so we should formally be discussing the “( n , k )-th level, L n , k .” However, since n is ﬁxed, we opt for a slightly cleaner notation relying on the reader to recognize when to utilize its dependence on n , e.g.,in (14). Further, for convenience we sometimes refer to L k just as “level k ,” and occasionally also refer to a coe ﬃ cient g j as being “in” or “of” level k , when in reality it is its index j that is in L k . recisely k ones : L k : = (cid:40) j = [j n − , . . . , j , j ] (cid:12)(cid:12)(cid:12)(cid:12) n − (cid:88) l = j l = k (cid:41) . (13)As such, level L contains indices with just k = one j l = individual locus l is being examined. Next, level L contains indices that have k = two j l = ﬀ erent pairs of loci , and so on. It should be clear that the number of indices(or codes) in each level is “ n choose k ”: |L k | = (cid:32) nk (cid:33) . (14)Returning to the example of n = ﬀ erent) Boolean lattice inFigure 2 should now be viewed as the indices of the corresponding gene network. Notice theright panel shows the 4-bit codes partitioned into their ﬁve respective levels {L k } k = , and that thecardinality of each level conﬁrms |L k | = (cid:16) k (cid:17) . Example 1 on page 7 should help clarify how tointerpret the network for the genes in Figure 1 (note, the order of g has been permuted so that itsindices / coe ﬃ cients are grouped into their respective levels). Remark 1.

In summary: on the “trait side,” the vertices of the Boolean lattice (in-dexed by i ) represent speciﬁc n -bit genomes of a population — the coe ﬃ cients { t i } are the trait values associated with these genetic codes. On the “gene network side,”the vertices of its Boolean lattice (indexed by j ) represent speciﬁc clusters of k ofthe n gene loci organized into levels {L k } — the coe ﬃ cients { g j } reﬂect the interac-tion strength of these clusters.Figure 3 illustrates a simulated trait-gene network Fourier transform pair related by (11)and (12) due to n =

26 genes. The top panel shows the full trait t over all N = ≈ g over all N gene clustercombinations. This exempliﬁes the uncertainty principle mentioned in Section 1.2: the tworepresentations contain the exact same genetic information , yet the “energy” of the trait is spreadover its whole population, while the gene network’s is focused into just K =

812 of its possible N coe ﬃ cients; the rest are identically zero. The plotted points of the full trait are too dense to seeindividual values, so refer to the middle panel to see a zoomed-in portion of the trait valuesfor the ﬁrst 100 genomes; notice how the trait is spread out with mostly nonzero values. Thisjustiﬁes collecting trait values from relatively few random genomes in order to discover the genenetwork.Although the mathematics of the two domains, the gene network space (11) and trait space(12), are formally symmetric, our knowledge about the two is not. We can physically measurethe trait value t i of the i th organism, but the scale problem means that we only have access toa very small subset of genomes. At the same time, because of the smoothness assumption, wehave some statistical knowledge about the distribution of all the Fourier coe ﬃ cients in the genenetwork. Further, from (12) each trait value measurement corresponding to a particular genomemay be considered as sampling a weighted average of the network coe ﬃ cients. That is, each t i encodes partial information of the full vector g . This setup perfectly complements the model ofcompressive sensing. 14 igure 3: A simulated Fourier transform pair consisting of ( top ) an arbitrary quantitative trait t a ﬀ ected by n =

26 genesand ( bottom ) its corresponding hidden gene network g , related by (11) and (12). ( Middle ) Detail of the trait values forthe ﬁrst 100 genomes. The genetic information in both representations is identical, yet nonzero trait values are spreadacross all of the possible N = ≈

67 million genomes, while the gene network is concentrated or “compressed” intojust K =

812 of its possible N coe ﬃ cients. This phenomenon is due to the Fourier uncertainty principle. The sparsityratio is K / N ≈ .

4. Compressive Sensing

Important insights have been gained in the last decade into signal and information process-ing. In particular, there has been a shift away from the long-standing assumption of bandlimitedreal-world signals to one that assumes a model of a sparse representation [26]. This point ofview has enabled many new possibilities of acquiring, storing, and processing of data that wasnot previously possible. Notable among these has been in the ﬁeld of compressive sensing , whicho ﬀ ers degrees of undersampling that did not exist before. Compressive sensing is a combinedsampling-reconstruction modality that is appropriate whenever it is expensive, or even impos-sible, to acquire many observations of a signal of interest. In that case, and under the correctconditions, we can take relatively few samples and still be able to reconstruct a signal with highﬁdelity.“Expensive” may be due to the cost of a sensor’s material or its size, or it may be related to15he time required to acquire the data. For instance, CMOS megapixel sensors made of siliconcurrently cost less than one U.S. dollar, while those made of indium gallium arsenide (InGaAs)are on the order of $10 , . Shortwave infrared compressive imaging systems exist with just oneInGaAs detector instead of an array of pixels, which o ﬀ ers signiﬁcant ﬁnancial savings [27, 28].In terms of time, there are scenarios that require a very short duration of sensing. One of theoriginal compressive sensing applications addressed magnetic resonance imaging (MRI) [29].An incident is described in [30] where relatively few MRI samples were su ﬃ cient to imagethe organs and save the life of a child who could not safely endure the high dose of anesthesianecessary for a full scan. The U.S. Federal Drug Administration has recently approved the useof compressive sensing in commercial MRI machines [31, 32].In the context of the genomic analysis explored in this study, we are interested in quantitativetraits a ﬀ ected by n genes. The 2 n combinatoric possibilities inform that it is essentially impossi-ble to access and measure all individual organisms of a population once n becomes large. Hence,compressive sensing may be an appropriate tool to permit measuring the trait values from rel-atively few genomes, yet still being able to analyze and quantify certain gene-to-trait functionsthat have been beyond realistic observable and computational means. We now review the speciﬁcs of compressive sensing in more detail. Interested readers canﬁnd more information in the foundational and related papers, e.g., [13, 33, 34, 35, 36, 37]. A lessformal introduction is available in di ﬀ erent survey articles, such as [38, 39].Suppose we are interested in observing a real-valued 1-D discrete signal g of length N .Rather than traditional point sampling, suppose further that we acquire general linear measure-ments via a sensing / measurement matrix A of size M × N , with M < N . The basic compressivesensing model is embodied by the observation / measurement vector y = Ag + e (15)where e is an unknown additive noise vector of length M . In general, there is no hope to recover g since this is an underdetermined systems of equations (there are fewer equations than unknowns).However, if at most K of the possible entries of g are nonzero (commonly referred to as a K-sparse signal), and if matrix A possesses a necessary “incoherence” property relative to thesparsifying basis, then with as few as M = const · K log r ( N ) (16)measurements (for some r ≥

1) we can reconstruct g , e.g., by solving the program min ˜ g ∈ R N (cid:107) ˜ g (cid:107) subject to (cid:107) y − A ˜ g (cid:107) ≤ ε (17)where ε is some assumed or known measure of the energy of the noise e . In words, rather than theambient dimension N , (16) reveals that we only need to sample at a rate M roughly commensuratewith the underlying sparsity K . This is then su ﬃ cient for (17) to ﬁnd the best candidate ˜ g whoseimage under A coincides closely with y , while also being of minimal (cid:96) -norm. The convex (cid:96) -norm constraint is used since it is known to promote sparsity. Nonconvex (cid:96) p -quasi-norms arealso possible with 0 < p <

1, but they tend to be less computationally tractable. Alternative and equivalent formulations of (17) exist as well. emark 2. The beauty and power of compressive sensing occurs when g is ex-tremely sparse compared to the ambient dimension. In this case K (cid:28) N means wecan severely undersample with M (cid:28) N due to (16).Let ˆ g be the solution to (17). A typical measure of the absolute error between the recoveredand original signals is (cid:107) ˆ g − g (cid:107) ≤ C K − / (cid:107) g − g K (cid:107) + C ε (18)where g K ∈ R N is the best K -sparse approximation to g and (cid:107) e (cid:107) ≤ ε . In theory, constants C and C are not too large and can be calculated, however it is not always so easy to do so in prac-tice. Clearly, there is no error when g is precisely K -sparse and there is no noise. However, (18)tells us that ˆ g can still be an excellent approximation even if g has more than K nonzero elementsand when the measurements are taken in the presence of moderate noise, both of which are en-countered in real-world applications. It is well known, e.g., that vectors whose sorted coe ﬃ cientsdecay according to a power law admit a K -sparse approximation with minimal error.A great deal of e ﬀ ort in the compressive sensing community has focused on ﬁnding thetheoretical conditions of A that lead to successful recovery of g from (15)–(17). Injecting ran-domness into A is the easiest way to guarantee robust results with high probability. For example, A may be a random matrix with independent and identically distributed (i.i.d.) elements drawnfrom either a Gaussian or Bernoulli distribution. Although these sensing matrices yield excellentperformance, they are plagued by the need to store all elements of A , which becomes a nontrivialburden when M · N is large. A more realistic embodiment that avoids this problem is to randomlysubsample M rows of an N × N unitary matrix, e.g., a discrete Fourier or Hadamard matrix. Thesematrices do not need to be explicitly stored because they either have a closed-form expression ora “black box” implementation. Further, as mentioned earlier, Fourier and Hadamard matrices areknown to be incoherent relative to the standard basis, as well as wavelet bases and others. Thismakes them ideal sensing operators necessary to admit (16).In practical applications, however, the measurement matrix may be more elusive. For ex-ample, we may assume (15) properly models the measurement procedure, but when it is imple-mented in the ﬁeld A may be unknowingly perturbed so that the measurements are actually ofthe form y = ( A + E ) g + e . In this case recovering with (17) incurs a so-called “model mismatch”that does not account for the extra multiplicative (rather than additive) noise term Eg . Previouswork [40, 41, 42] has extended the error expression (18) to include this real-world phenomenon.A di ﬀ erent, but related issue occurs when we do not have the liberty of choosing the typeof measurement matrix to be used. That is, A is often dictated by the sensing modality or byspeciﬁc physical, chemical, or biological constraints. For example, the magnetic coils of MRImachines emit sinusoidal radio waves. These spawn measurements in the frequency domainthat are naturally modeled by the canonical discrete Fourier transform (the so-called “DFT”),hence A consists of rows of the DFT matrix. The previous work in genetics [6, 7, 8, 9, 10, 11]mentioned in Section 1.3 exemplify scenarios where one must design A based on realities andrestrictions of the sensing environment. In the study presented here, the genome is modeled asan n -dimensional Boolean lattice and the trait, or ﬁtness landscape, is a vector which takes theelements of the Boolean lattice as a basis. From (11) and (12), the trait and gene network vectorspaces are related by the Fourier transform characterized by the Sylvester-Hadamard matrix H n ,whose rows naturally form our sensing matrix.17 rocedure 1 Simulation to recover an unknown gene network1.

Initialization

Generate a (hidden) gene network g according to the method outlined inProcedure 2.2. Observations

Generate M indices { i , i , . . . , i M − } from the set { , , . . . , N − } , uniformlyat random. The associated binary codes of these indices represent the genomes for M random organisms within a population whose trait values we can measure.(a) Subsample the rows of matrix H n / N at these M indices; this constitutes the randomsensing matrix A .(b) Generate the vector of randomly observed trait values y in (15).3. Reconstruction

Given y and A use (17), or a proxy, to recover an approximation ˆ g to theunknown gene network g .

5. Simulations and Results

We are tasked with the problem of discovering a genome-to-trait function a ﬀ ected by n loci,which can be expressed on a Boolean lattice that captures all N = n possible combinations ofparticipating alleles. Instead of explicitly identifying the genome-to-trait function t , it su ﬃ cesto discover its associated gene network g . However, we do not have direct access to the genenetwork — hence the challenge is to discover a “hidden” gene network from the trait valuesof a limited number of genomes. A summary of the steps of the simulation can be found inProcedure 1. We conducted 10 trials, the results of which are recorded in Table 2 on page 24.To assess how well our combined compressive sensing-recovery method works, we used twodi ﬀ erent measures of error. For a general length- N vector x , recall that its typical root-mean-square (RMS) and mean-absolute (MA) values are deﬁned as RMS( x ) : = (cid:0) (cid:80) N − i = | x i | / N (cid:1) / andMA( x ) : = (cid:80) N − i = | x i | / N . Accordingly, the relative (or normalized) RMS and MA errors between x and its approximate ˆ x are deﬁned as: ε RMS ( ˆ x ) : = RMS( ˆ x − x )RMS( x ) = (cid:107) ˆ x − x (cid:107) (cid:107) x (cid:107) , ε MA ( ˆ x ) : = MA( ˆ x − x )MA( x ) = (cid:107) ˆ x − x (cid:107) (cid:107) x (cid:107) . (19) g Based on the theory of smooth traits presented in Section 2, we can assume with high prob-ability that the gene network g is sparse or compressible with the bulk of signiﬁcant coe ﬃ -cients occupying the lower levels. For convenience we employ a strictly sparse model, althoughwe achieved similar results with a compressible model whose coe ﬃ cients decayed similar to apower law. To make the simulations more realistic, we relaxed the assumption of only low-levelconcentration and allowed some nonzero coe ﬃ cients to occupy higher levels. This correspondsto possible nonlocal interactions across modules, as explained in Section 2.1. In what follows,denote S k ⊆ L k as the support set of nonzero coe ﬃ cients in level k . There are many ways to enforce low-level concentration on the gene network g in a computersimulation. We implemented the model outlined in Procedure 2 on page 20. Step 1(a) followsSection 2.1.4 to determine the sparsity K k for each of the lower levels. Step 1(b) then chooses18 k indices uniformly at random from level L k and assigns them to the subset S k . The complete low-level support S low in (20) is the union of the support sets { S k } up to cuto ﬀ level k cut (for level k =

0, the support is trivially S = L : = { } ). Step 2 introduces a limited number of high-levelinteractions for some K high (cid:28) K low ; these indices are chosen at random from the union of thelevels above k cut and assigned to high-level support S high . Finally in Step 3, the total support S andoverall sparsity K are just the result of joining S low and S high , and their respective sparsities K low and K high .Many natural phenomena possess a high degree of organization. As such, we suspect therewill be signiﬁcant structure within the support S . That is, the support sets { S k } for real-worldgene networks most probably have some degree of dependence. For example, suppose somecoe ﬃ cient g ABC is nonzero, thus the index corresponding to the triad of loci A , B , C will bein S . Then it is reasonable to assume that these loci also interact pairwise to some extent, hencecoe ﬃ cients g AB , g AC , g BC will also be nonzero with their respective indices in S . In this casethere would be a correlation between these indices in support sets S and S . A priori knowledgeof this kind of structure could make reconstructions of the gene network faster and with betteraccuracy. At the same time, counterexamples can be found. Without empirical evidence as aguide, we therefore chose not to incorporate this phenomenon into the model of the gene networkfor our simulations; thus the K random indices of S in (21) are uncorrelated with each other. ﬃ cient values There are also many ways to assign values to the K nonzero coe ﬃ cients corresponding tosupport set S . For example, it may be that the magnitudes decay with increasing level k , say,according to some power law. However, as above, lacking empirical evidence, we chose tosimply assign i.i.d. random Gaussian numbers to these coe ﬃ cients as seen in Step 4. Moresophisticated models are, of course, possible. But this, as well as our method for generating thesupport set S in Steps 1–3, is su ﬃ cient for our goal: to demonstrate how compressive sensingcan deal with the scale problem. g Next, we explain the choice of parameters used in Procedure 2 to generate the gene network g for use in our simulations. We were limited by available computing resources, so chose arbitraryquantitative traits a ﬀ ected by just n =

26 genes. We utilized the probabilistic model describedin Section 2.1.4 to determine the low-level sparsity distribution with a = .

56 and b = ﬃ cients in levels 0 ≤ k ≤

12 is shown in Table 1 with cuto ﬀ level k cut =

11, resulting in low-level sparsity K low =

782 coe ﬃ cients. For the higher levels wepermitted K high =

30 randomly located coe ﬃ cients. This yielded a K -sparse vector g with K =

812 meaningful interactions, representing a sparsity ratio of merely K / N ≈ . t in (12) would have a realisticrange. For example, measurements of myopia are usually in the range of roughly ±

2, so wesimply chose σ = (though this may seem large, it is countered by the 2 scaling constantin the denominator of (12)). The support set S and corresponding values { g j | j ∈ S } of the genenetwork were randomly regenerated for each simulation trial, as described in the Procedure 2.The bottom of Figure 3 shows a particular gene network g of size N = used in one trial of Some of these coe ﬃ cients will probably be small enough to be considered “insigniﬁcant.” However, this does notcontradict our assumption of a smooth trait. rocedure 2 Method to generate a gene network g with primarily low-level concentration G enerate the support set Low levels

Fix parameters a , b in (6). For each level L k , 1 ≤ k ≤ k cut :(a) Determine the sparsity K k from (7).(b) Generate the support set S k by collecting K k indices uniformly at random from L k .The low-level support is S low = (cid:83) k cut k = S k (20)with sparsity K low = (cid:80) k cut k = K k .2. High levels

Generate support set S high containing K high (cid:28) K low indices chosen uniformlyat random from the union of levels above the cuto ﬀ level: (cid:83) nk = k cut + L k . Total support

The resulting support set of the gene network g is S = S low (cid:83) S high (21)with sparsity K = K low + K high . G enerate the coefficients

4. For each j ∈ { , , . . . , N − } , set g j =  N (cid:0) , σ (cid:1) , j ∈ S ;0 , otherwise. (22)our simulations. From Step 1 of Procedure 1, this is the example “hidden” gene network that wenext attempt to discover.The full trait associated with this gene network, simply generated from g using (12), is seenin the top of Figure 3. Its histogram is shown in the bottom-right of Figure 4. We remark how ithas a bell-like shape, which is reassuring as many traits in nature often have a similar distribution. y Consider the gene network g in the bottom of Figure 3, which we do not have the abilityto directly measure. Assume we are constrained to observe only a subset of the organisms ofits associated full trait t , shown in the top of Figure 3. This is a valid assumption because with n =

26 genes the population of size N = ≈ × is already quite large. If M genomes of t are subsampled essentially uniformly at random from (12), this is equivalent to the (noiseless)observation vector y = Ag in (15) with matrix A deﬁned as the corresponding M randomlychosen rows of H n / N .To implement Step 2 of Procedure 1, we chose const = . r = M = ,

336 measured genome-trait values. This represents an extremely strong undersamplingcompared to the ambient dimension N = : the subsampling ratio is just M / N ≈ . e to Ag in (15), such that (cid:107) Ag (cid:107) / (cid:107) e (cid:107) =

10. Interms of decibels (dB), the signal-to-noise ratio (SNR) is 10 log (cid:0) (cid:107) Ag (cid:107) / (cid:107) e (cid:107) (cid:1) =

20 dB. The20 igure 4: (

Top ) The M = ,

336 observations y (red), which are random and noisy samples of the N = genome-traitvalues contained in the full trait t (blue). The undersample ratio is M / N ≈ . Middle )Detail of the extremely low sample rate: the trait values for just 4 genomes (red) randomly sampled from the ﬁrst 6 , Bottom-left ) Histogram of the random observations y . ( Bottom-right ) Histogram of the truefull trait t . top of Figure 4 depicts these randomly chosen and noisy trait measurements with red dots super-imposed on the true full trait t in blue (duplicated from the top of Figure 3). Although the reddots seem plentiful, the zoomed-in portion in the middle panel reveals that only 4 genomes andtheir perturbed trait values have been sampled at random from the ﬁrst 6 ,

000 true trait values —this extremely small sampling of 4 / ≈ . M / N .The distribution of the values of the true full trait and its noisy subsamples are shown in thebottom of Figure 4. The histogram of y (left) has mean µ ( y ) = . σ ( y ) = . t (right) has mean µ ( t ) = . σ ( t ) = . ﬀ by about 7%, the severe random undersampling and perturbed observa-tions have captured the range of trait values contained in the true full trait with only about 0 . .4. Recovery of the gene network, ˆ g For Step 3 of Procedure 1, given measurements y and matrix A in the previous section,we recovered an approximate gene network ˆ g by implementing (17) in two substeps. First,we used the “Fast Adaptive Shrinkage / Thresholding Algorithm” (FASTA) [43] to recover theapproximate support ˆ S of the gene network. Deﬁne A ˆ S as the M × | ˆ S | submatrix of A whosecolumns are restricted to the indices of ˆ S . With M ≥ | ˆ S | and A ˆ S of full column rank, we thenperformed a traditional least-squares regression on ˆ g ˆ S , the support ˆ S of ˆ g :ˆ g ˆ S = ( A (cid:62) ˆ S A ˆ S ) − A (cid:62) ˆ S y . (23)Inserting zeros on the complement of ˆ S yields ˆ g .The top of Figure 5 shows the gene network ˆ g recovered from noisy observations y . Com-pare this with the hidden gene network g we are trying to discover in the bottom of Figure 3.Although they look quite similar there are small di ﬀ erences between them, seen in the pointwiseerror ˆ g − g in the bottom of Figure 5. Some of these di ﬀ erences are due to the FASTA algorithmeither missing the smallest coe ﬃ cients of g , or introducing relatively small coe ﬃ cients with in-correct support in ˆ g . Applying (19) to ˆ g and g in this example we ﬁrst compute the (cid:96) - and (cid:96) -norms of the ground-truth: (cid:107) g (cid:107) = . × and (cid:107) g (cid:107) = . × . Next, the absolute (cid:96) - and (cid:96) -errors were (cid:107) ˆ g − g (cid:107) = . × and (cid:107) ˆ g − g (cid:107) = . × , which yieldedrelative errors of ε RMS ( ˆ g ) = . ε MA ( ˆ g ) = . ﬀ ected by n =

26 genes with K =

812 meaningful clusters, when M = ,

336 randomlysampled trait values are corrupted with 20 dB of AWGN, on average we can recover its associatedgene network with about 97 . .

1% accuracy. Clearly, the error can be further reduced if onehas access to more trait values. It is also possible to ﬁne-tune the parameters in the FASTAalgorithm to either permit fewer measurements and / or achieve lower error, but these results arefairly representative for the scenario presented. The key takeaway of these simulations is that the signiﬁcant coe ﬃ cients of the gene network can be faithfully recovered with minimal error . Inturn, we see in the next section that we can also dependably predict trait values. ˆ t In analogy with (12), the approximate full trait ˆ t corresponding to ˆ g isˆ t = F − ( ˆ g ) = H n ˆ g / N . (24)Recognize, though, we do not have to predict the full trait; we may only be interested in the traitvalue ˆ t i for some previously unobserved i th genome. This is easily accomplished by taking theinner product of the i th row of H n / N with ˆ g .Clearly, the e ﬀ ects of any perturbations in ˆ g are inherited by ˆ t , so we are interested in as-sessing deviations from the ground-truth t . A standard measure of this is an average error overall predicted trait values. But by assumption, for large n we most probably do not have access to We also conducted simulations with noiseless measurements y = Ag . Note that (18) predicts perfect reconstructionin the case of strictly K -sparse vectors recovered from noiseless observations. Our simulations essentially conﬁrm this:both relative errors ε RMS ( ˆ g ) and ε MA ( ˆ g ) were on the order of 10 − , which is close to machine epsilon. igure 5: ( Top ) The recovered gene network ˆ g using (17) from just y (Fig. 4) and A . Compare with the hidden genenetwork g (bottom of Fig. 3) that we are attempting to discover. ( Bottom ) The resulting pointwise error ˆ g − g betweenthe recovered and hidden gene networks. The relative errors from (19) are ε RMS ( ˆ g ) ≈ .

4% and ε MA ( ˆ g ) ≈ . the full trait t . Suppose, however, that we have access to the (cid:96) - and (cid:96) -norms and errors of thegene network, either through measurement, approximation, or some other a priori knowledge;with these we can establish theoretical error bounds on ˆ t in terms of ˆ g .From the unitarity of the Fourier transform (i.e., (cid:107) H n / √ N (cid:107) = t interms of the (cid:96) -norm is just a scaled version of that of ˆ g : (cid:107) ˆ t − t (cid:107) = (cid:107) H n ( ˆ g − g ) / N (cid:107) = (cid:107) ˆ g − g (cid:107) / √ N (25)while the relative (cid:96) -errors are identical: ε RMS (ˆ t ) = (cid:107) ˆ t − t (cid:107) (cid:107) t (cid:107) = (cid:107) ˆ g − g (cid:107) / √ N (cid:107) g (cid:107) / √ N = ε RMS ( ˆ g ) . (26)This is signiﬁcant: we can precisely determine the absolute and relative (cid:96) -errors of the traitsolely based on those of the gene network due to the unitarity of the Fourier transform.However, similar analysis of the errors in terms of the (cid:96) -norm does not yield such equalities.In particular, the absolute (cid:96) -error of ˆ t is upper-bounded by that of ˆ g : (cid:107) ˆ t − t (cid:107) ≤ (cid:107) H n / N (cid:107) (cid:107) ˆ g − g (cid:107) ≤ (cid:107) H n / √ N (cid:107) (cid:107) ˆ g − g (cid:107) = (cid:107) ˆ g − g (cid:107) (27)where in the second inequality we used the fact that (cid:107) X (cid:107) ≤ √ N (cid:107) X (cid:107) for a matrix X with N rows. Unfortunately, the lower bound on the (cid:96) -norm of t incurs a signiﬁcant scaling factor: (cid:107) t (cid:107) ≥ (cid:107) H n g / N (cid:107) = (cid:107) g (cid:107) / √ N ≥ (cid:107) g (cid:107) / √ KN (28) We could have deﬁned the transforms in (11) and (12) to be g = H n t / √ N and t = H n g / √ N , respectively. Thesymmetry would have then precisely yielded (cid:107) ˆ t − t (cid:107) = (cid:107) ˆ g − g (cid:107) . RMS ( ˆ g ) ε MA ( ˆ g ) ε RMS (ˆ t ) ε MA (ˆ t )2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 2: Relative errors ε RMS and ε MA (in %) of the reconstructed gene network ˆ g and predicted trait ˆ t for 10 di ﬀ erentruns of Procedure 1. Note, ε RMS (ˆ t ) = ε RMS ( ˆ g ) from (26). The emboldened row corresponds to the errors for the exampletrial shown in Figs. 3–6. The bottom row indicates the average of each column. where in the ﬁrst inequality we used (cid:107) x (cid:107) ≥ (cid:107) x (cid:107) for general vector x , and (cid:107) x (cid:107) ≥ (cid:107) x (cid:107) / √ K for K -sparse x in the second inequality. Thus from (27) and (28), the relative (cid:96) -error of ˆ t in termsof ˆ g is ε MA (ˆ t ) = (cid:107) ˆ t − t (cid:107) (cid:107) t (cid:107) ≤ √ KN · ε MA ( ˆ g ) . (29)This theoretical bound does not provide any meaningful utility owing to the fact that √ KN willbe extremely large for any n of practical interest.In our numerical simulations, however, we have access to all of ˆ t and t . For the trial we areexamining, the top of Figure 6 shows the predicted full trait ˆ t reconstructed from ˆ g using (24),which is very similar to the true trait t in the top of Figure 3 — this is a ﬃ rmed in the middlepanel showing a zoomed-in portion of the predicted trait values for the ﬁrst 100 genomes (red‘ × ’s) plotted relative to the ﬁrst 100 true trait values (blue dots). The di ﬀ erences are recorded inthe error ˆ t − t , however, plotting this as a function of all N genomes contains too many points tobe a useful visual aid. Instead, we show the distribution of error values in the bottom of Figure 6.The histogram of the prediction error has a bell-like shape, with mean µ (ˆ t − t ) ≈ σ (ˆ t − t ) = . t . Note, it is un-necessary to compute the (cid:96) -error since (26) guarantees ε RMS (ˆ t ) = ε RMS ( ˆ g ) = . (cid:96) -norm of the ground-truth was (cid:107) t (cid:107) = . × . The absolute (cid:96) -error was (cid:107) ˆ t − t (cid:107) = . × , thus ε MA (ˆ t ) = . ε RMS (ˆ t ) and ε MA (ˆ t ) were very close.This is noteworthy and worth investigating further since it is rare for relative (cid:96) - and (cid:96) -errorsto coincide so closely (in contrast, compare ε RMS ( ˆ g ) and ε MA ( ˆ g )). Moreover, ε MA (ˆ t ) is onlyslightly larger than ε MA ( ˆ g ), verifying how gross the bound is in (29). In summary, for our as-sumed model of an arbitrary quantitative trait a ﬀ ected by n =

26 genes with K =

812 meaningfulclusters, when M = ,

336 random trait measurements are corrupted with 20 dB of AWGN,both ε RMS (ˆ t ) and ε MA (ˆ t ) reveal that we can predict an individual organism’s trait on average withabout 97 .

6% accuracy. 24 igure 6: (

Top ) The predicted full trait ˆ t from the recovered ˆ g using (24). Compare with the true full trait t (top ofFig. 3). ( Middle ) Detail of the predicted trait values for the ﬁrst 100 genomes (red ‘ × ’s) superimposed on the ﬁrst 100true trait values (blue dots). ( Bottom ) Histogram of the error ˆ t − t between the predicted and true trait values. Therelative errors from (19) are ε RMS (ˆ t ) ≈ ε MA (ˆ t ) ≈ .

6. Additional Comments

Dorogovtsev and Mendes aptly point out in [44], “One should note that a number of e ﬀ ectsin networks cannot be explained without accounting for their ﬁnite size. In this sense, most realnetworks are mesoscopic objects.” Such are the gene networks we focus on. There are traits(such as certain diseases) with very small gene determinants for which our theory is unneeded,and there may well be networks so large as to be computationally beyond reach. We only claimour theory is suitable for traits in some middle zone.Nevertheless, it may be of interest to examine the asymptotic behavior of the number of sig-niﬁcant network coe ﬃ cients K , as the number of loci n increases. Assume the trait is partitionedinto M modules, and that the signiﬁcant entries are only due to local interactions within eachmodule. For 1 ≤ m ≤ M , let n m be the number of loci in module m , with n = (cid:80) m n m . As before,the cuto ﬀ level is the size of the largest module: k cut = max( { n m } ). We want to ﬁnd upper boundsthat make K (cid:28) n . Such bounds may be found almost ad lib , by postulating various ways in25hich k cut may depend on n . Clearly, smaller k cut gives more stringent bounds. The followingthree cases cover realistic scenarios we may encounter:1. If k cut < n / c , where c >

1, then K is bounded by c n / c .2. If k cut is O (ln n ), then K is bounded by a polynomial in n .3. If k cut is O (1), then K is bounded by a linear function of n .These estimates are all derived in a similar way, by ﬁrst estimating the contribution from thelargest module, and then including remaining contributions as estimated in terms of the largestcomponent. Notice that Case 1 is just an extension of (3). The most interesting is Case 2: here, k cut < c ln n , for some constant c . The largest module contributes at most 2 c ln n to K , whichcan be rewritten as n c ln 2 . Although the number of modules M ≤ n , grossly multiplying ourpolynomial estimate by n still leaves it polynomial, albeit one degree higher. Thus K is boundedby a polynomial in n .In conclusion, when compared to the ambient dimension N = n , the sparsity ratio K / N decays exponentially in Case 1, and even faster in Cases 2 and 3. Therefore, K (cid:28) N in all ofthese cases, as desired. We comment that the worst-case scenario of Case 1 is probably not veryrealistic, i.e., k cut most probably cannot grow without bound as a ﬁxed proportion of n . At theother extreme, the best-case scenario of Case 3 occurs when k cut has some ﬁxed upper boundindependent of n , presumably due to some biological constraint. A thorough survey of empiricaldata from many traits is needed to ascertain if, when, and why these (or some other) cases mayoccur. It is important to recognize that the recovery method implemented in Section 5 did not takeadvantage of any a priori information of the gene network. In particular, the assumption oflow-level concentration was not utilized, and therefore the algorithm in (17) optimized over all N = n coe ﬃ cients of the estimate ˆ g with equal weight. With a priori information, we could haveachieved similar results by only recovering the speciﬁc levels {L k } with meaningful coe ﬃ cients.This “semi-oracle” method o ﬀ ers certain advantages and disadvantages, discussed below.Denote L (cid:48) as the set of indices contained in the union of the (not necessarily contiguous)desired levels, with collective cardinality N (cid:48) = |L (cid:48) | . This reduces the size of the ambient solutionspace from N to a potentially much smaller N (cid:48) . We can then interpret the observations (15) tobe of the form y = A L (cid:48) g L (cid:48) + e , where A L (cid:48) is the M × N (cid:48) submatrix of A whose columns arerestricted to the indices of L (cid:48) , and the length- N (cid:48) vector g L (cid:48) is similarly deﬁned.If we are fortunate enough to have access to enough measurements so that M ≥ N (cid:48) , thenassuming A L (cid:48) has full rank we can perform a traditional least-squares regression on ˆ g L (cid:48) , similarto (23). Otherwise, if M < N (cid:48) , then we can adapt (16) and (17) in terms of L (cid:48) and N (cid:48) . For thesparseness assumption to hold, it is desirable to choose enough levels so that g L (cid:48) has relativelyfew large coe ﬃ cients, yet still maintain N (cid:48) to be small enough for processing with the availablecomputing resources. Then we can recoverˆ g L (cid:48) = argmin ˜ g L(cid:48) ∈ R N (cid:48) (cid:107) ˜ g L (cid:48) (cid:107) subject to (cid:107) y − A L (cid:48) ˜ g L (cid:48) (cid:107) ≤ ε. There are two advantages to reducing the size of the ambient solution space. First, it o ﬀ ersa way to attack the problem with somewhat fewer measurements, which may be a real-worldconstraint. For the M < N (cid:48) case, the reduction is due to logarithmic factor in (16) operating26n a smaller ambient space N (cid:48) , as well as potentially due to a smaller e ﬀ ective K based on thechoice of L (cid:48) . Second, and more importantly, since N = n , even with moderately sized n , theexponentially large vectors can quickly outstrip the capabilities of many computing platforms.For instance, the desirable “fast Fourier / Hadamard transform” may be di ﬃ cult to implement. Assuch, the freedom to map the problem to a smaller ambient space is attractive as it may be theonly practical path to ﬁnd a solution. The main disadvantage to only recovering speciﬁc levelsis that it clearly prevents us from discovering coe ﬃ cients that may reside outside of the targetedlevels, which would incur some error.Of course, we can use this technique to recover just the lower levels, since we believe thevast majority of meaningful coe ﬃ cients to be located there (whether or not due to modularity).Then at the expense of possibly missing some higher-level coe ﬃ cients, we can approximate thegene network up to some maximum level k max . In this case L (cid:48) = (cid:83) k max k = L k and N (cid:48) = (cid:80) k max k = |L k | = (cid:80) k max k = (cid:16) nk (cid:17) . If k max is not too large, then for not too small n , we have N (cid:48) (cid:28) N .For example, suppose we are investigating a quantitative trait a ﬀ ected by n =

100 genes, and areonly interested in discovering the gene interactions up to and including clusters of size k max = N (cid:48) ≈ compared to N = shows the dramatic reduction in ambient space size, whiletheir logarithmic factors in (16) result in a reduction in the number of necessary measurementsby a factor of about 3.

7. Conclusion and Future Work

In this paper, we deﬁne a spectrum of gene interactions that permit the analysis of e ﬀ ectsas a sum of contributions from all possible sets of gene loci. The exponential expansion ofcombinations leads to the scale problem. We show that compressive sensing o ﬀ ers a path towardaddressing the scale problem of describing the e ﬀ ects of genes on quantitative traits.There is a class of traits whose smoothness allows a precise description based upon a muchsmaller number of observations, rather than the 2 n that would be theoretically be required. Thisoccurs in traits composed from more-or-less independent modules. Such modular structuresgreatly generalize previous attempts to reconstruct a trait from independent allele e ﬀ ects. Forthis class of traits (and possibly others), the Fourier transform provides a natural “compression”of the data in the gene network. The uncertainty principle for the discrete group Z n ensures that,since the gene network space is su ﬃ ciently concentrated, the e ﬀ ects in trait space are pervasive,so that a limited number of randomly chosen organisms will provide useful data. In e ﬀ ect, the“needles in the haystack” of the gene network cannot hide because they will poke all of thegenomes. Much work still remains to ﬁnd traits with the requisite smoothness. This may involverecasting biological traits by a monotone transformation.The gene network is deﬁned in a way independent of any population. The network coef-ﬁcients constructed from any data set should be considered as proxies, which should improveover time as larger data sets are collected. Knowledge of the gene network is valuable for tworeasons. First, it provides insight as to the important interactions of the gene loci and overallgene function. Finding such interactions might help in designing drugs and other gene thera-pies, screening individuals for drug trials and warning of adverse interactions, analyzing diseasesusceptibility, and anticipating the e ﬀ ects of gene editing (e.g., CRISPR). Second, the gene net-work is a concise formulation predicting average trait values for unobserved organisms via the(inverse) Fourier transform (24). 27ur reconstruction algorithm did not use any a priori information of the gene network (e.g.,low-level concentration phenomenon or correlation across its support). In a more realistic set-ting we may have knowledge assigning genes to distinct functional modules [19]. Similarly, oursimulations took trait values attached to randomly chosen genomes. But we may have previ-ous knowledge of the genome structure of an observed population that is the result of selectivecross-breeding. We may also have knowledge either of linkage equilibrium or disequilibrium.Specialized or machine learning algorithms may exploit prior information to improve calcula-tion of the recovered gene network and reduce error. For example, even partial knowledge of thesupport and distribution of coe ﬃ cients of g can enable a weighted (cid:96) -minimization in (17).Due to limited computing resources we simulated trait-gene network pairs a ﬀ ected by only n =

26 genes in an ambient space of size N = . The hidden gene networks had K = M = , .

6% accuracy in the presence of 20 dB noise.We showed in Section 6.1 scenarios where the sparsity K of a gene network can be much lessthan the ambient space N = n . According to (16) the relative number of necessary measure-ments M will also be much smaller than N — in some circumstances even polynomial. RecallingRemark 2 on page 17, this corresponds to an ideal scenario in which to employ compressivesensing. The sparsity and undersampling ratios in our simulations were K / N ≈ . M / N ≈ . K and therefore M needs to be pro-portionally larger, it is astonishing to imagine the relative paucity of data necessary to analyzetrait-gene network pairs a ﬀ ected by, say, n >

50 genes. However, there are signiﬁcant challengesassociated with memory, data communication, and processing of exponentially-large vectors inthe reconstruction step.Some simpliﬁcations and assumptions were made in the conducted simulations for the sakeof demonstration. However, our results clearly illustrate the viability and potential to address thescale problem. Yet there are several open questions and issues that need to be resolved in thefuture. • Crucially, will empirical data a ﬃ rm the fundamental assumption of smooth traits? • Can we determine with high probability the unique gene network signatures for di ﬀ erentclasses of quantitative traits? If so, then we can use this a priori knowledge to assist inrecovering similar traits. • For a speciﬁc quantitative trait a ﬀ ected by n genes, what are the necessary number ofmeasurements M ? Clearly as the number of measurements increases, we expect betterapproximations to ˆ g and ˆ t , and thus smaller errors as deﬁned in (19). • We do not necessarily have the freedom to choose which genomic sequences to measure.Moreover, actual trait sampling may be limited to a narrow, non-diverse subset of a pop-ulation. How does this constraint of non-randomness in the subsampling step (15) a ﬀ ectthe reconstruction? • The noise model used in (15) is standard in signal processing literature. Is there a bet-ter model that is appropriate for genomic data that also complements the Boolean latticetopology? Can results from [40, 41, 42] be applied?28n future work, we are interested in pursuing the theoretical case where all interactions areof level k =

2, so that the gene network is a graph with labeled edges. If there are interactions,even of this restricted type, the central limit theorem need not hold and the distribution of a traitneed not converge to a Gaussian curve. The bell-shaped but fat-tailed distribution of biologicaltraits is consistent with the notion that smooth traits are ubiquitous. In summary, if empirical datasupport the existence of smooth traits, then compressive sensing may prove to have a signiﬁcantimpact in mapping genes to traits.

References [1] J.-H. Park, M. H. Gail, C. R. Weinberg, R. J. Carroll, C. C. Chung, Z. Wang, S. J. Chanock, J. F. Fraumeni,N. Chatterjee, Distribution of allele frequencies and e ﬀ ect sizes and their interrelationships for common geneticsusceptibility variants, Proceedings of the National Academy of Sciences 108 (44) (2011) 18026–18031.[2] F. Galton, Natural Inheritance, Macmillan and Co, London, New York, 1889.[3] I. G. Szendro, M. F. Schenk, J. Franke, J. Krug, J. A. G. M. de Visser, Quantitative analyses of empirical ﬁtnesslandscapes, Journal of Statistical Mechanics: Theory and Experiment 2013 (01) (2013).[4] D. Uminsky, M. Banuelos, L. Gonz´alez-Albino, R. Garza, S. A. Nwakanma, Detecting higher order genomicvariant interactions with spectral analysis, in: 2019 27th European Signal Processing Conference (EUSIPCO),2019, pp. 1–5.[5] A. C. Gilbert, M. A. Iwen, M. J. Strauss, Group Testing and Sparse Signal Recovery, in: In 42nd Asilomar Confer-ence on Signals, Systems, and Computers, 2008.[6] W. Dai, O. Milenkovic, M. A. Sheikh, R. G. Baraniuk, Probe design for compressive sensing DNA microarrays,in: 2008 IEEE International Conference on Bioinformatics and Biomedicine, 2008, pp. 163–169.[7] N. Shental, A. Amir, O. Zuk, Identiﬁcation of rare alleles and their carriers using compressed se(que)nsing 38(2010) e179.[8] Y. Erlich, A. Gordon, M. Brand, G. J. Hannon, P. P. Mitra, Compressed genotyping, IEEE Transactions on Infor-mation Theory 56 (2) (2010) 706–723.[9] Y. Erlich, A. Gilbert, H. Ngo, A. Rudra, N. Thierry-Mieg, M. Wootters,D. Zielinski, O. Zuk, Biological screens from linear codes: theory and tools,bioRxiv (2015).[10] F. Parvaresh, H. Vikalo, S. Misra, B. Hassibi, Recovering sparse signals using sparse measurement matrices incompressed DNA microarrays, IEEE Journal of Selected Topics in Signal Processing 2 (3) (2008) 275–285.[11] S. Vattikuti, J. J. Lee, C. C. Chang, S. D. H. Hsu, C. C. Chow, Applying compressed sensing to genome-wideassociation studies, Gigascience 3 (10), (Jun 2014).[12] J. A. G. M. de Visser, J. Krug, Empirical ﬁtness landscapes and the predictability of evolution, Nature ReviewsGenetics 15 (7) (2014) 480–490.[13] E. J. Cand`es, J. Romberg, T. Tao, Robust uncertainty principles: Exact signal reconstruction from highly incompletefrequency information, IEEE Transactions on Information Theory 52 (2) (Feb. 2006) 489–509.[14] L. H. Hartwell, J. J. Hopﬁeld, S. Leibler, A. W. Murray, From molecular to modular cell biology, Nature 402 (1999)47–52, http://dx.doi.org/10.1038/35011540 .[15] D. Waxman, J. R. Peck, Pleiotropy and the preservation of perfection, Science 279 (5354) (1998) 1210–1213.[16] A. Clauset, Lecture 15, in: Inference, Models and Simulation for Complex Systems, Class notes, Fall 2011, http://tuvalu.santafe.edu/ ∼ aaronc/courses/7000/csci7000-001 2011 L15.pdf .[17] G. H. Thomson, The factorial analysis of human ability, British Journal of Educational Psychology 9 (2) (1939)188–195.[18] S. Sniekers, et al., Genome-wide association meta-analysis of 78,308 individuals identiﬁes new loci and genesinﬂuencing human intelligence, Nature Genetics 49 (2017) 1107–1112.[19] T. V. Tkatchenko, D. Troilo, A. Benavente-Perez, A. V. Tkatchenko, Gene expression in response to optical defocusof opposite signs reveals bidirectional mechanism of visually guided eye growth, PLOS Biology 16 (10) (2018)1–26.[20] N. Linial, Y. Mansour, N. Nisan, Constant depth circuits, Fourier transform, and learnability, Journal of the ACM40 (3) (July 1993).[21] R. O’Donnell, Analysis of Boolean Functions, Cambridge University Press, 2014.[22] C. Garban, J. E. Steif, Noise Sensitivity of Boolean Functions and Percolation, Cambridge University Press, 2014.[23] M. Nowak, Evolutionary Dynamics, Harvard University Press, 2006.

24] J. J. Sylvester, Thoughts on inverse orthogonal matrices, simultaneous sign-successions, and tessellated pavementsin two or more colors, with applications to Newton’s rule, ornamental tile-work, and the theory of numbers, TheLondon, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 34 (232) (1867) 461–475.[25] K. J. Horadam, Hadamard Matrices and Their Applications, Princeton University Press, 2007.[26] M. Elad, Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing,Springer New York, 2010.[27] L. McMackin, M. A. Herman, B. Chatterjee, M. Weldon, A High-Resolution SWIR Camera via CompressedSensing, in: Proc. SPIE: Infrared Technology and Applications, XXXVIII, Vol. 8353, Baltimore, MD, May 2012.[28] M. F. Duarte, M. A. Davenport, D. Takhar, J. N. Laska, T. Sun, K. F. Kelly, R. G. Baraniuk, Single pixel imagingvia compressive sampling, IEEE Signal Processing Magazine (2008).[29] M. Lustig, D. L. Donoho, J. M. Santos, J. M. Pauly, Compressed sensing mri, IEEE Signal Processing Magazine25 (2) (2008) 72–82.[30] J. Ellenberg, Fill in the blanks: Using math to turn lo-res datasets into hi-res samples, Wired Maga-zine (Feb 2010).[31] FDA, 510k premarket notiﬁcation of HyperSense (GE Medical Systems), Tech. rep., (2017).[32] FDA, 510k premarket notiﬁcation of Compressed Sensing Cardiac Cine (Siemens), Tech. rep., (2017).[33] D. L. Donoho, Compressed sensing, IEEE Transactions on Information Theory 52 (4) (2006) 1289–1306.[34] E. J. Cand`es, J. Romberg, T. Tao, Stable signal recovery from incomplete and inaccurate measurements, Comm.Pure Appl. Math. 59 (2006) 1207–1223.[35] E. J. Candes, T. Tao, Near-optimal signal recovery from random projections: Universal encoding strategies?, IEEETransactions on Information Theory 52 (12) (2006) 5406–5425.[36] M. Rudelson, R. Vershynin, Sparse reconstruction by convex relaxation: Fourier and gaussian measurements, in:2006 40th Annual Conference on Information Sciences and Systems, 2006, pp. 207–212.[37] E. J. Cand`es, Y. Plan, A probabilistic and RIPless theory of compressed sensing, IEEE Transactions on InformationTheory 57 (11) (2011) 7235–7254.[38] R. G. Baraniuk, Compressive sensing, IEEE Signal Processing Magazine 24 (4) (2007) 118–121.[39] E. J. Cand`es, M. Wakin, An introduction to compressive sampling, IEEE Signal Processing Magazine 25 (2) (2008)21–30.[40] M. A. Herman, T. Strohmer, General Deviants: An analysis of perturbations in compressed sensing, IEEE Journalof Selected Topics in Sig. Proc.: Special Issue on Compressive Sensing 4 (2) (Apr 2010).[41] M. A. Herman, D. Needell, Mixed operators in compressed sensing, Proc. IEEE Conf. on Inf. Sci. and Syst.(Princeton, NJ, Mar. 2010).[42] Y. Chi, L. L. Scharf, A. Pezeshki, A. R. Calderbank, Sensitivity to basis mismatch in compressed sensing, IEEETransactions on Signal Processing 59 (5) (2011) 2182–2195.[43] T. Goldstein, C. Studer, R. Baraniuk, FASTA: A generalized implementation of forward-backward splitting, http://arxiv.org/abs/1501.04979 (January 2015).[44] S. Dorogovtsev, J. Mendes, Evolution of Networks: From Biological Nets to the Internet and WWW, OUP Oxford,2013.(January 2015).[44] S. Dorogovtsev, J. Mendes, Evolution of Networks: From Biological Nets to the Internet and WWW, OUP Oxford,2013.