GPU accelerated enumeration and exploration of HP model genotype-phenotype maps for protein folding
AAbstract
Evolution can be broadly described in terms ofmutations of the genotype and the subsequent se-lection of the phenotype. The full enumeration ofa given genotype-phenotype (GP) map is there-fore a powerful technique in examining evolution-ary landscapes. However, because the numberof genotypes typically grows exponentially withgenome length, such calculations rapidly becomeintractable. Here I apply graphics processing unit(GPU) techniques to the hydrophobic-polar (HP)model for protein folding. This GP map is asimple and well-studied model for the complexprocess of protein folding. Prior studies on rel-atively small 2D and 3D lattices have been exclu-sively carried out using conventional central pro-cessing unit (CPU) approaches. By using GPUtechniques, I was able to reproduce the pioneer-ing calculations of Li et al.[1] with a speed up of580-700 fold over a CPU. I was also able to per-form the largest enumeration to date of the 6 × a r X i v : . [ q - b i o . B M ] A ug P18 - GPU accelerated enumeration and exploration of model GPmaps for protein folding.
S. Owen - Thesis submitted for MPhys. Oxford
In the 1890s, the Dutch botanist Hugo de Vries re-discovered and expanded upon Mendel’s ground-breaking work on heredity[2, 3]. This work revo-lutionized the field of genetics. Within 20 years,Mendel’s principles of inheritance had been ap-plied convincingly to model organism
Drosophiliamelanogaster [4] and mathematician RA Fisherhad begun to develop the mathematical frame-work of population genetics[5] which incorpo-rated Mendel’s principles into evolutionary the-ory. Fisher’s work was the first quantification ofthe distribution and propagation of heritable ge-netic information. Ever since these great shiftsin approach and understanding, scientists acrossmany disciplines have been fascinated by the re-lationship between the information that is in-herited by, and the resulting properties of, anorganism[6, 7]. In 1911, Johannsen introduced thefundamental words phenotype , genotype andgene to biology[8], vital terminology for the dis-cussion of the nature of evolution. Phenotype, abroad term, refers to a set of observable charac-teristics such as the functional shape of a proteinfold or the function of a gene regulatory network.The corresponding genotypes are encoded in theDNA and can be coarse-grained to the amino acidcomposition of a protein or the set of interactinggenes in a regulatory network[9]. Thus the geno-type refers to the information that generates thephenotype.In 1991, Alberch formalized the ideas of a map-ping between genotype and phenotype[10], lead-ing to the introduction of the term genotype-phenotype (GP) map. Since then this concept has proven to be very fruitful because a GP mapdetails which phenotypes are accessible to whichgenotypes. The number of genotypes that mapto a phenotype is called its designability[1]. TheGP map also encodes the possible consequencesof mutations to the genotype. For example, it iswell known that many mutations are effectivelyneutral; they don’t change the phenotype[11]. Aquantitative measure of the frequency of theseneutral mutations is the robustness; phenotype ro-bustness is the average probability that a mutationto the phenotype is neutral i.e. maps back to thesame phenotype[12].Exploring these properties for GP maps requiresthe exhaustive enumeration of all genotypes andphenotypes in the model. This can prove difficultfor two reasons. Firstly, for any biologically realis-tic system, the simulation of a GP map is compu-tationally expensive. Secondly, for most biologicalsystems an accurate mathematical description ofthe GP map is not known. It is because of thisthat despite their utility, studies have typicallybeen limited to relatively simple model systems.Examples of GP maps that have been exploredinclude simplistic sequence-structure relationshipsof molecules such as RNA or proteins[13]. GPmaps of higher level systems such as gene tran-scription networks[14] and gene-regulatory net-works have also been explored[9]. Even for thesesimple systems computational expense remains acritical bottleneck for progress.It should be kept in mind that the environ-ment also plays an important role in evolution.The GP map is not a complete description of evo-lution but without starting somewhere, progresscan’t be made. In fact, a great deal of workas been done on the evolutionary implicationsof data from model GP maps[15, 16, 17]. It isthought that the interplay of properties such asdesignability and robustness are important for un-derstanding broader principles such as evolvabil-ity; the ability of evolution to generate novel heri-table phenotypes. For example, higher designabil-ity means that a phenotype is more likely to ariseby random selection of a genotype. Similarly, highrobustness means that genotypes that are morelikely to form a percolating neutral network (NN)of genotypes that all map onto the same pheno-type. A NN in turn allows a population to travelthrough evolutionary space over time without los-ing functionality[18], but at the same time in-creases the number of alternative phenotypes thatmay be mutationally accessible. Thus NNs areof key importance in providing adaptability to asystem[19, 20]. Spaghetti vs plum pudding model
One keyquestion for GP maps is whether or not the neu-tral networks of two phenotypes can be connectedby single point mutations. One extreme is the‘spaghetti bowl’ or ‘spaghetti’ model (Figure 1A)where the NNs are widely distributed and so phe-notypes can make contact with many others. Theother extreme is the ‘plum pudding’ model (Fig-ure 1B) where NNs are much more isolated fromone another. This distinction is important, be-cause the ‘spaghetti’ model is thought to make asystem much more evolvable, since it can moreeasily access evolutionary novelty through neutralexploration of its NN.
Relationships between designability andcomplexity
Biologically realistic GP maps typ-ically have a small number of highly designablephenotypes that have most of the genotypes map-ping to them[21, 22]. Discovering what propertiesmake some phenotypes exceptionally designable isan interesting problem in and of itself. Such adiscovery may also shed light on an explanationas to why only - fundamentally differentprotein folds are observed in nature[23] whereason theoretical grounds many orders of magnitudemore are expected to be possible. The proper- Figure 1: Two hypothetical continuous genotype-phenotype spaces with different topologies; A depictsthe ‘spaghetti bowl’ model and B depicts the ‘plumpudding’ model. Regions of color represent neutralnetworks, with the area representing the quantity ofcomponent genotypes. Different colors represent dif-ferent phenotypes. Gray represents deleterious pheno-types. For example in a GP map of protein folding,proteins that don’t fold. In the ‘spaghetti’ model, phe-notypes percolate the space and make connections withmany others. In the ‘plum pudding’ model, phenotypesare isolated and make only limited contact with others. ties of phenotypes that have so far been exploredhave been necessarily dependent upon the modelsused. Several authors have attempted to calcu-late what properties of phenotypes correlate withdesignability[24, 25]. An example of a propertythat has been investigated is the symmetry of pro-tein folds in a 2D, lattice protein folding model. Liet al. noted when they looked at protein foldingon a 6 × × lattice andvarious symmetries of the folds[25] and found thatfor x-y symmetry and ◦ rotation symmetry, thedesignability of a fold increased with these symme-Page 3ries on average.Whilst attempting to find a model independentmeasure that correlates with designability, Louiset al. (unpublished) noted that highly symmetricstructures may minimize a property called Kolgo-morov complexity. Kolmogorov complexity is, putsimply, the minimum length of a defining descrip-tion of a object[26]. Highly symmetric structuresinclude redundant information and as such mayrequire a shorter description to specify. As such,this measurement of symmetry may be an indirectapproximation of Komogorov complexity[27]. Fur-ther investigations into more complete measures ofcomplexity may elucidate correlations between thedesignability of phenotypes and their complexity. GP maps for protein folding models
Thereare 20 amino acids which make up the standardalphabet for protein sequences. Chains of aminoacids spontaneously fold into complex, orderedstructures called proteins. Modeling folding pro-teins has been of great theoretical interest sincethe classic experiment by Anfinsen in 1972[28]. Itwas shown that for small globular proteins, all ofthe information a protein needs to fold is encodedin its amino acid sequence.The protein folding problem is a grand challengein science and many authors have approached theproblem [29, 30, 31]. Predicting the structureof any protein from its amino acid sequence istheoretically possible but practically impossiblegiven current technology. The ‘Hydrophobic Po-lar’ (HP) model is a much simplified protein fold-ing model that attempts to tackle this complex-ity issue. The HP model was proposed by Dillin 1985[32] and exploits the fact that, to first or-der, all 20 amino acids can be separated into twocategories: hydrophobic (H) and polar (P). Sec-ondly, the configuration space is greatly simplifiedto simple lattices. The HP model therefore re-duces a protein to a binary string of either polar orhydrophobic amino acids following a self-avoidingwalk on a lattice. It has been widely studied, anddespite its simplicity, it is thought to reproducesome key aspects of the physics of protein folding.A schematic showing how a 2D fold on a × lattice is similar to a biological fold comprised of alpha helices and beta sheets is shown in Figure 2. Figure 2: An example of a path on a × latticeand a biologically similar structure with alpha helices(turquoise) and beta sheets (pink). This shows how2D and 3D lattice structures can be translated intobiological substructures. It is more biologically realistic to calculate thesolvation energy of protein folds with all 20 aminoacids. This is typically done using the Miyazawa-Jernigan (MJ) energies, which are calculated sta-tistically from known contacts within proteins[33].The results of this model on the simple × × and × lattices have been compared to the sim-plified HP system; both models produce similarresults, giving justification that the very simpleand less computationally expensive HP model isbiologically relevant[33].The simplification provided by HP model is nec-essary in producing a GP map because calculatingevery genotype and phenotype for a model usingall 20 amino acids is an intensive process; for a verymodest protein of length 27 on a × × lattice,there are ≈ × possible sequences and103367 folds for which the energy of each sequencemust be evaluated. Another example that illus-trates the size of the genotype space for proteins isthat all possible proteins of length 37 would weigh × kg which is approximately twice the massof the Earth. When enumerating all sequences isimpossible, the sampling of significant numbers ofgenotypes is often used to provide some insightinto the structure of the GP space.Although simplistic and therefore less computa-tionally demanding than the MJ model, the HPfolding problem is still NP complete[34]. Thismeans that it is believed to be a computation-Page 4lly intractable problem. The solution space ofthis problem grows exponentially with its size[35]and there is no known shortcut to an exactsolution[36]. In these models, even if we onlylook at a sample of all possible sequences, each se-quence must still be exhaustively folded throughall possible configurations in order to ensure theground state is not missed. This is the primaryreason why the acceleration of calculations is es-sential in being able to examine and create GPmaps for systems in larger spaces than a × lattice.The HP and MJ models therefore present a bio-logically relevant problem with a great need for ac-celeration and expansion. Several authors have de-rived properties of the GP map for these models onsmall lattices. In 2008, Goldstein investigated theHP model on the × lattice and suggested thatthe genotype-phenotype space for the HP modelin general is ‘plum pudding’ like[37]. This wouldmean that the phenotypes are isolated, having lim-ited connections with other ‘plums’. Ferrada andWagner explored the same idea in 2012[22], sug-gesting as well that the phenotypes for proteinswere less connected than the ‘highly interwoven’nature of the RNA GP map. GPU computing applied to protein folding
General purpose graphical processing unit com-puting is the use of graphical processing units(GPU) to accelerate programs that would betraditionally run on a central processing unit(CPU)[38]. This technique has seen a relativelyrecent uprising in usage for a variety of scientificproblems[39]. It achieved particular success in anIsing spin model (20-fold speed up)[40] and a non-hydrostatic weather model (80-fold speed up)[41].The Folding@home project has been able to utilizeconsumer GPUs on laptops and home computersto achieve an average of 60-fold speed up on GPUscompared to CPUs of participants[42].This project brought together two exciting as-pects of current research: 1) the analysis of bio-logical genotype-phenotype spaces and 2) the useof GPUs to accelerate scientific simulations.I wrote and verified a program that carried outthe enumeration and sampling of HP and MJ mod- els on given 2D and 3D lattices on the GPU. Iwas then able to investigate whether the HP GPspace is more like ‘spaghetti’ or ‘plum pudding’and whether the previous results from Ferrada andGoldstein were due to finite size effects. I thenused HP and MJ models on the × lattice to in-vestigate the link between various measurementsof the information content or complexity of a phe-notype and its designability. The minimum defin-ing chain length and compression approximationsof Kolmogorov complexity were evaluated for thislattice. The HP and MJ models use the H or P monomersor an alphabet of 20 amino acids respectively.These models are on lattice and require the fulllibrary of structures that a sequence can fold to.This library of possible folds is equivalent to allcompact, self-avoiding random walks on the lat-tice. The use of only the compact structures is jus-tifiable in that they quite accurately mimic globu-lar proteins and that the most compact configura-tion of a protein is typically the most stable[43]. Astable protein fold is the lowest energy state for itssequence so finding the ground state is equivalentto minimizing the following Hamiltonian, H = (cid:88) i Table 2: CUDA terminology Kernel A function within the CUDA script,called by the host (CPU) and exe-cuted on the device (GPU) by an ar-ray of threads. Thread A basic element of data to be pro-cessed, delivered to a single process-ing unit. Warp A group of 32 threads. Threadsare launched in warps. One warp isscheduled on one multiprocessor. Block A group of threads that can cooperatevia shared memory and synchroniza-tion. Should be a multiple of the warpsize. Grid A structure for launching multipleblocks. Figure 4: Schematic of a GPU and its host CPU. Thedifferent components of the GPU are described in themain text. The memory within a computer with a GPU isPage 6rganized into a hierarchy of 5 structures (in bold).Each multiprocessor includes thousands of regis-ters which can be accessed locally and as such thismemory access is very fast. Each multiprocessorhas access to 48kB of shared memory which al-lows synchronization between threads in a block;this is the second fastest memory that can be ac-cessed by a thread. There is approximately 4GBof global memory on separate DRAM chips thatcan be accessed by every thread but is the slowestmemory access. Texture memory is a region ofread-only memory that is specifically set aside forfast access and is available locally to each thread.The host memory is on the CPU and cannot beaccessed by threads. Any memory transfers fromthe CPU to GPU must take place outside of run-ning kernels. This memory hierarchy is illustratedin Figure 4.Two different NVIDIA cards were used, theGeForce GTX 750Ti and Tesla C2070 cards whichboth use the Fermi architecture. The IntelÂőCoreâĎć i5-480M CPU was used for all timed CPUcalculations. Compute Unified Device Architecture(CUDA) and the design of GPU code. CUDA was chosen as the language for this pro-gram as it has mature tools, including a debuggerand profiler. The definitions of some importantCUDA terms are given in Table 2. I createdthe new program from scratch in CUDA, takinginspiration from in-house CPU code.To efficiently solve a problem with parallelismusing CUDA on a GPU, the program should havethe following qualities[40]:1. Local calculations, keeping the need for com-munication between threads to a minimum.2. Coherent threads with as little thread diver-gence as possible.3. A number of threads much greater than avail-able multiprocessors.4. Many more arithmetic operations and use ofshared memory than global memory accesses. (a) Time per fold against threads per block for the HP × model.(b) Time per fold against threads per block for the HP × × model. Figure 5: Time per fold against threads per block fordifferent numbers of total threads and for int and dou-ble variables. The three variables that effect the run-ning time of the CUDA program are the data typeused (integer or double), the total number of threadslaunched by the kernel and the way these threads aresplit into blocks. The size of each block is important aseach multiprocessor core runs 1 block at a time. Theseresults show that the fastest set ups for the HP modelon the × and × × lattices use integer variables,at least threads and threads per block values of 64( ). To optimize the running time of code on theGPU, it is important to ensure that the threadsper block and block sizes specified in the launch ofPage 7 CUDA kernel do not increase latency. Threadsare launched in warps of 32 and so to reduce anywasted processors, threads per block should be amultiple of 32. The grid size and total threads perblock is best determined within the context of themodel. I ran tests for varying block and grid sizeparameters to find the optimum set up.I converted the floating point operations to inte-ger operations in order to avoid issues with round-ing floating point variables; if floating points areused, results can vary for different GPU archi-tectures and between the GPU and the CPU[47].Integer precision arithmetic operations sometimesuse fewer cycles and therefore run faster than thoseusing floats or doubles[48]. In order to examinethe effect on the speed of the code when usinginteger or double variables, I converted the dec-imal values in the energy matrices (Table 1 andAppendix A.1) to integer values and ran the codeon the GeForce GTX 750Ti GPU. The speed ofthe code using integers and doubles is comparedin Figure 5 which shows that when comparing op-timal integer and double runs, the integer compu-tations run approximately twice as fast as thoseusing doubles. Figure 5 also shows the compar-isons between different total thread numbers andblock sizes. Code with over total threads runsapproximately four times as fast as that with .The optimal threads per block are shown by theminima of the graphs. The fastest configurationsfor the HP model on both the × and × × lattice used 64 threads per block. The future runswere executed with these optimized parameters.I wrote the core program for the project withthe capability to run the HP or MJ models oneither the × , × or × × lattice. I verifiedthe accuracy of the code by running cross checksagainst the results of Helling et al[49]. I verifiedthat the code gave the same most designable foldsfor the MJ model on the × lattice and HP × × lattice as in Helling et al. (Figure 7).Other various cross-checks were carried out andcan be seen in Appendix B. The concepts of designability and robustness canbe expressed mathematically and used to classifythe topology of the GP map[21]. In this report theword ‘fold’ is used interchangeably with ‘pheno-type’ and the word ‘sequence’ is used interchange-ably with ‘genotype’. For a model with alphabetsize K and genotype length L there are K L pos-sible genotypes. The designability N q of a pheno-type q is defined as the size of the set of genotypesthat have phenotype q . To define robustness wemust talk about the neighborhood of a genotypeand a quantity called φ pq . The ‘neighborhood’ ofa genotype is all other sequences that differ from itby a single point mutation. The number of neigh-bors of a sequence is therefore ( K − L . If thedesignability (number of genotypes) of a pheno-type is N q then the total number of neighbors ofthe phenotype is given by N q ( K − L . φ pq is the average probability of a mutation fromphenotype q ⇒ p . It is therefore the fraction ofneighbors of q that map to p ; φ pq = 1 N q N q (cid:88) i =1 Φ p ( g i ) . (2)where Φ p ( g i ) is the fraction of the ( K − L neigh-bors of genotype g i that result in phenotype p wand the sum averages over the N q genotypes thatfold to phenotype q .The phenotype robustness can now be definedas ρ p = φ pp . Robustness is the average probabilityof a neutral mutation p ⇒ p . Another interestingproperty of a phenotype is φ del which measures theprobability a mutation will result in the deleteri-ous phenotype. The deleterious phenotype in theprotein folding model is assigned to amino acidswith no unique ground state fold. An example ofa genotype space and the nature of different φ pq values is shown in Figure 6[21].Before discussing the expectations of these val-ues in the ‘plum pudding’ and ‘spaghetti’ modelsI introduce the null model for quantitative com-parison. The frequency of phenotypes is always aconstraint defined by the physics of the problem(Equation 1). It is the location of phenotypes inPage 8 igure 6: (A) A hypothetical genotype space withdifferent colors and shapes representing different phe-notypes. Neutral connections between genotypes areshown in color, highlighting the neutral networksformed. (B) A depiction of how mutations of the orig-inal phenotype contribute to φ ij values, the thicknessof the arrows represent the size of the φ value. Herethe robustness has the largest φ value and mutationfrom the circle phenotype to the orange diamond hasthe smallest φ value. Taken from[21]. the space that is variable. In the null model pheno-types are randomly located in the genotype space.The phenotypes of neighbors in the null model aretherefore not correlated. A relevant quantity inthe null model is the threshold γ pq ; this is the valueof the global frequency of phenotype p such that itwill appear on average once in the neighborhoodof q . Below this frequency the expected numberof occurrences of p in the neighborhood of q dropsbelow 1; γ pq = 1 N q ( K − L . (3)In the ‘plum pudding’ model, the only signifi-cant values of φ are φ pp (the robustness) and φ del as the ‘plums’ make limited connection with oth-ers. In the ‘spaghetti bowl’ model, many values of φ pq would be greater than the null model for many phenotypes q as the ‘spaghetti’ phenotypes make agreater number of contacts with other phenotypes.All ≈ . × sequences in the HP × × model were exhaustively enumerated to produceand analyze the full GP map for the HP × × model. φ pq was calculated for the neighborhoodof the most designable fold (see Figure 7). Therobustness and φ del were calculated for every phe-notype in the model. Figure 7: The most designable fold for the MJ modelon the × lattice and the HP model on the × × lattice. The phenotypic robustness, φ pq and φ del in the × × HP model were compared to what wouldbe expected from the ‘spaghetti’, ‘plum pudding’and null models. Complexity and Designability Current in-vestigations into the GP maps of RNA andgene networks have suggested connections betweencomplexity and designability (Louis et al., unpub-lished). This suggestion prompted the investiga-tion within this project into the relationship be-tween the complexity and designability on the × lattice model.There are many definitions of complexity andthe measurement method can vary between mod-els. One method pioneered by Kolmogorov iscalled Kolmogorov complexity[50]. The Kol-mogorov complexity of a string x is defined as ‘thesize of the shortest string y from which the univer-sal Turing machine produces x ’[51]. Exact Kol-mogorov complexity is technically incomputablebut there are many methods for approximating it.One method used in this project found an upperbound of the Kolmogorov complexity using a com-pression technique. The process is to write thefold as a sequence of Up, Down, Left and RightPage 9numerically represented by 0,1,2,3) and then tocompress this string and calculate its resultantlength. This compression estimate of Kolmogorovcomplexity was compared to designability for the × phenotypes.Another approximation to the Kolmogorov com-plexity I used was the minimum defining chainlength (MDCL). It has been suggested that theminimum length of unique chain in a fold corre-lates with its designability (Dingle, unpublished).The calculation of MDCL comprises finding theshortest path for each fold such that once thatpath is followed, given the starting position andsize of the lattice, the rest of the fold is fixed. Thisis a promising approximation to the Kolmogorovcomplexity as it is analogous to the shortest defi-nition of a fold. For example, the most designablefold for the the MJ × model has a short definingchain length of 7, shown by the red path in Figure8. Figure 8: The most designable fold for the MJ modelon the × lattice. The minimum defining chain lengthfor this fold given the lattice and start point is shownin red. Once the red line is fixed, the rest of the foldis completely determined. An attempt was made to calculate the MDCLfor all compact phenotypes on the × lattice.These estimates of complexity for each fold werecompared with designability for the folds to inves-tigate whether there was any correlation betweenthem. I used CUDA to write programs for the GPU thatcalculated the ground state fold for every geno-type for a given lattice and alphabet. The GPU gave a significant speed up which enabled vari-ous explorations of the properties of the genotype-phenotype spaces of various models. The code was designed to fit the criteria in Sec-tion 2.2. Each thread generated or imported asequence and computed the sequence’s lowest en-ergy fold using arithmetic operations. The threadswere completely independent and coherent in theoperations they applied. The number of threadslaunched was of the order of − and as suchthere were many more threads than processors.Finally, the number of global memory accesseswere minimal as the majority of each threadscalculations were either arithmetic operations orcomparisons with values stored in registers. TheCUDA code is in Appendix C and can be seen toexhibit all of these properties.The speed ups shown in Table 3 were obtainedon the Tesla C2070 card, compiled with NVCC us-ing CUDA v. 3.2[46] and were compared to speedson the IntelÂő CoreâĎć i5-480M CPU. The valuesin Table 3 were from the random sampling of someof possible sequences due to the impracticality offull enumeration on the CPU. Full enumeration iseven faster per fold on the GPU as fewer memoryoperations have to be carried out. Full enumera-tion is faster because threads on the GPU do nothave to access random numbers in global memorywhich have been transferred from the CPU. Table 3: Time per fold for CPU v GPU and the equiv-alent speed up per sequence. Model CPU /s GPU /s Speed upMJ × × × × × × × model, which would take over 180 years on a CPUis underway at time of writing and should take aPage 10otal of approximately 110 days to complete ona single GPU chip. It is worth noting that theGPU provided approximately equal accelerationsfor calculations on the × and × × lattices.This is because the limiting factor on both theCPU and GPU was the number of folds that haveto be tested for each sequence. This determinedthe number of calculations required on the CPUand determined the number of memory accesseson the GPU. There are twice as many folds on the × × lattice compared to the × lattice and assuch the GPU and CPU ran approximately twiceas fast on the × lattice compared to the × × lattice. φ pq andthe topology of the HP × × GPmap All of the results in this subsection are for the HPmodel on the × × lattice. The null model used for comparison in these results is one in whichthe location of phenotypes within the genotypespace is random.The full GP map for this model was enumerated.The distribution of designability for this model isshown in Appendix B as part of the code verifica-tion. With the phenotype for all genotypes known,I calculated the robustness of all the model’s phe-notypes. I then examined the phenotypes in theneighborhood of the top fold φ pq , and the connec-tions to the deleterious phenotype in the neigh-borhood of all phenotypes. I used these results todeduce how well the model is described by boththe ‘spaghetti’ and ‘plum pudding’ metaphors.When all sequences in this model are enumer-ated, there are 4255 out of 51704 possible foldswhich do not occur at all as ground states in the × × model, i.e. they have zero designabil-ity. This means that there are only 47449 possiblephenotypes for sequences on this lattice. In thismodel, 4.76% of all sequences have a ground statefold. All other genotypes have non-unique groundstates and therefore 95.24% of sequences have thedeleterious phenotype. Robustness The robustness against designabil-ity for all phenotypes in the HP × × model is shown in Figure 9. Figure 9: Robustness against designability for all foldsin the HP model on × × lattice. In the nullmodel designability is equal to robustness (red line).The frequency of neutral mutations in the null modelwould fit this line as it would correlate exactly withthe global frequency of that phenotype. The robustness for a significant number of phe-notypes (43.6%) is larger than the null model.This suggests that in this HP model, many phe-notypes are likely to have neutral networks andthereby percolate the space. The points that fallon the null model (red line) are phenotypes withrobustness equal to what would be expected ifthere were no correlations between a fold and itsneighbor. The large number of phenotypes withsignificantly larger robustness than the null modelis evidence that a given phenotype is correlatedwith its neighborhood. Phenotypes in the neighborhood of the topfold φ pq is the frequency of mutations that re-sult in phenotype p in the single mutation neigh-borhood of phenotype q . Here I looked at theneighborhood of the most designable fold. Fig-ure 10 shows the local frequency of phenotypesin the neighborhood of the most designable foldagainst their global frequency. The distributionthat would be produced by the null model is shownby the green line in Figure 10. The graph showsthat in the HP × × model there are many phe-notypes which have more connections to the mostdesignable phenotype than would be expected inthe null model (i.e. those above the green diago-Page 11 igure 10: Log φ pq against log frequency for connec-tions to the top fold. Many phenotypes, despite havinga frequency below the threshold are connected to thetop fold for the × × . The green line shows thenull model, for which the frequency of phenotypes inthe neighborhood of the top fold is random and so isequal to their global frequency. The pink vertical lineshows the threshold γ , below which the phenotype isso infrequent that it is not expected to occur in a ran-domly connected GP map. The top right point is therobustness φ pp (connections from the most designablefold back to itself). Due to the need for a log scale,phenotypes with a zero frequency are represented witha frequency of − at the bottom of the graph. nal). The graph also shows that below the thresh-old γ there are many connections from the mostdesignable fold to other phenotypes; the existenceof these connections would not be expected in thenull model. Those phenotypes with a zero value( − ) of local frequency do not occur in the im-mediate neighborhood of this phenotype. The ro-bustness of the top fold is the largest φ pq value atthe top right of the graph. This graph shows thatdue to its large robustness, the most designablefold is likely to percolate the space and form a largeneutral network. The neighborhood of a pheno-type is not random and is correlated with the phe-notype itself. The large number of non-zero con-nections to other phenotypes from the top fold’sphenotype suggests that the HP × × modelis not ‘plum pudding’. In the HP × × model,0.86% of possible phenotypes are connected to thetop fold, a larger value than would be predicted inthe ‘plum pudding’ model. Deleterious phenotypes The distribution ofconnections to the deleterious phenotype is shownin Figure 11. In the null model the local fre-quency of deleterious mutations would be simplyequal to the global frequency of deleterious muta-tions which is 0.9524. The figure shows that thedeleterious phenotypes are underrepresented com-pared to the null model with a mean of 0.8618.Taking the robustness and φ del together I inves- Figure 11: Local frequency of deleterious phenotypes φ del compared to global frequency of deleterious pheno-types f del for all phenotypes (folds) in the HP model onthe × × lattice. The mean value of the distributionis marked in cyan. The randomised null model wouldhave a distribution centered around f del = 0 . ,shown in green. tigated whether all of the non-deleterious neigh-bors were accounted for by the robustness of aphenotype. A histogram of the fraction of neigh-bors that are neither deleterious nor neutral isshown in Figure 12. I found that only 890 phe-notypes (1.9%) are completely isolated from anyother and on average 4.0% of a phenotypes neigh-bors are other, non-deleterious phenotypes. Thismeans that in this HP model, the vast majority ofthe phenotypes possess many more connections toother phenotypes than if the model was describedby the ‘plum pudding’. The main contribution tothe histogram for the ‘plum pudding’ model wouldbe from phenotypes with zero distinct neighbors.Figure 12 therefore presents compelling evidencethat the HP model on the × × lattice is notwell represented by the ‘plum pudding’ model; thephenotypes of the GP map for the HP × × Page 12odel are better described by the ‘spaghetti bowl’metaphor. Figure 12: A histogram of the frequency of non-deleterious, distinct phenotypes φ del in the neighbor-hood of all phenotypes in the HP model on the × × lattice. On average 4.0% of neighbors of a phenotypeare non-deleterious, non-neutral neighbors. The data shown in Figures 9-11 shows that theHP model on the × × lattice fits neitherthe ‘plum pudding’ nor ‘spaghetti bowl’ metaphorsperfectly. The GP map contains at least one largeand well connected neutral network (Figure 10)as well as many other significant neutral networks(Figure 9). However, 98.1% of phenotypes are con-nected to other phenotypes (Figure 12) meaningthis GP map is best described by the ‘spaghetti’model. The distinction cannot be made betweenall phenotypes being connected via mutation orforming a small number of subnetworks. The lat-ter can be thought of as balls of ‘spaghetti’. The complexity-designability relationship can beexplored by sampling the genotype space becausea sufficient distribution of the most common folds(designability) occurs without full enumeration.Only the × model was used to fully analyze re-lationships between complexity and designabilitybecause spotting visual distinctions between foldsis easier on the 2D lattice. The 50 most designableand 50 least designable structures are shown inAppendix B. Visual inspection shows that theyare clearly different. The challenge is to discoverwhether this difference is correlated with a simple complexity measure.Figure 13 shows the compression complexity ap-proximation against designability, there is no pat-tern that emerges from this comparison. Figure 13: Compression complexity against designabil-ity for the MJ model on the × lattice (for randomsequences). There is no correlation between complex-ity and designability for this measure. Figure 14 shows the minimum defining chainlength (MDCL) against designability. There isno clear correlation but there is some indicationthat the least designable folds have long minimumdefining chain lengths. It is possible that with im-proved calculation of the MDCL a clearer correla-tion may emerge, as the current measure does notrecognize the MDCL correctly for all folds. Thismethod of complexity measure deserves furtherconsideration because it is clear that the MDCLis a form of minimum information content. Thiscould be biologically relevant as it suggests thatshort motifs may define larger structural featuresthat are not necessarily conserved in their aminoacid sequence. The × lattice is, of course, ar-tificial in its constraints and as such, any futureconnection found must still be interpreted withinthat context. Page 13 igure 14: Minimum defining chain length against des-ignability for the MJ model on the × lattice (for random sequences). There is no clear correlation be-tween MDCL and designability for this model. This particular GP mapping problem was ex-tremely well suited to the application of GPU com-puting. The very large speed up (580-700 fold) wasdue to the properties of the HP and MJ modelsmatching many of those required for an efficientGPU application. These qualities are typical ofGP mapping: locality, coherence, the dominanceof arithmetic calculations and multitude of calcu-lations needed. This means many GP mappingsare likely to be greatly sped up on the GPU andthe result can therefore be generalised. This appli-cability is already evident in my work with ChicoCamargo (University of Oxford) who has produceda 1000-fold speed up in the calculation of the GPmap for gene networks by implementing his modelon a GPU, for which he used concepts from mycode (Appendix C).There are two simple improvements that couldbe made to further accelerate the time per foldI have achieved. Firstly, a network of GPUswould increase computational throughput. Sec-ondly, the generation of random numbers on theGPU would reduce the number of required mem-ory accesses. This is in comparison to the currentmethod which requires copying large quantities ofrandom numbers from the CPU. This would in-crease performance for any model requiring ran-dom sampling. Due to the relevant in-expense of GPUs and the ever expanding range of tool-kitsprovided by NVIDIA, these options would bothbe worth pursuing in further research. A full enu-meration of the HP model on the × lattice onthe GPU is currently running and will allow theexploration of its full GP map. This was previ-ously impossible due to the extreme length of timeit would have taken (approximately 180 years) be-fore this acceleration.The investigation of the × × HPmodel has shown that contrary to Goldstein’spostulation[37], the HP model does not fit the de-scription of ‘plum pudding’ like and that their re-sults are an artefact of the small size of their lat-tice. It is important to bear in mind that the HPmodel on the × × lattice has a very largegenotype space ( ≈ . × genotypes) and95.24% of its genotypes have the deleterious phe-notype (do not fold). If the GP map, given theseconstraints, fitted the ‘plum pudding’ model, wewould expect to see more than 1.8% of phenotypesbeing completely isolated and we would expectmany fewer than 4% of mutations to result in newphenotypes. This HP model is even more poorlydescribed by the null model as there are significantcorrelations between neighbors. The HP modelis best described by the ‘spaghetti’ model but nosimplistic metaphor describes this space perfectly.Future work should be done on GP maps for boththe MJ model and larger lattices in order to in-vestigate how the topology changes with larger al-phabets and lattices. The MJ model would alsoallow the investigation of how the connections andneutral networks within a GP map are effected bychanges to the energy threshold.The attempt to discover underlying propertiesof phenotypes that may correlate with designabil-ity on the × lattice did not reveal any conclusiveresults. However, further investigation into therelationship between complexity and designabilitywould be worthwhile. In particular, the computa-tion of defining chain length can be perfected asthe current complexity measure is incomplete.This project has obtained a very significant ac-celeration using GPU methods which is highlytranslatable to both other attempts at GP map-ping and NP complete problems. I have been ableto analyze the topology of a new GP map and toPage 14numerate the previously unobtainable 6 × Acknowledgments I would like to thank Professor Ard Louis, ChicoCamargo, Dr Flavio Romano and Dr Sam Green-bury for their help in completing this project. References [1] Hao Li, Robert Helling, Chao Tang, and NedWingreen. Emergence of preferred structuresin a simple model of protein folding. Science ,273(5275):666–669, 1996.[2] Hugo De Vries. Das Spaltungsgesetz der bas-tarde . Borntraeger, 1900.[3] Gregor Mendel. Versuche über pflanzenhy-briden. Verhandlungen des naturforschendenVereines in Brunn 4: 3 , 44, 1866.[4] Thomas Hunt Morgan. The physical basis ofheredity . JB Lippincott, 1919.[5] Ronald A Fisher. Xv.âĂŤthe correlationbetween relatives on the supposition ofmendelian inheritance. Transactions of theroyal society of Edinburgh , 52(02):399–433,1919.[6] BW Geer and MM Green. Genotype, phe-notype and mating behavior of drosophilamelanogaster. American Naturalist , pages175–181, 1962. [7] Peter J Schwartz, Silvia G Priori, Carla Spaz-zolini, Arthur J Moss, G Michael Vincent,Carlo Napolitano, Isabelle Denjoy, PascaleGuicheney, Günter Breithardt, Mark T Keat-ing, et al. Genotype-phenotype correlationin the long-qt syndrome gene-specific triggersfor life-threatening arrhythmias. Circulation ,103(1):89–95, 2001.[8] Wilhelm Johannsen. The genotype concep-tion of heredity. International journal of epi-demiology , 43(4):989–1000, 2014.[9] Stefano Ciliberti, Olivier C Martin, and An-dreas Wagner. Innovation and robustness incomplex regulatory gene networks. Proceed-ings of the National Academy of Sciences ,104(34):13591–13596, 2007.[10] P Alberch. From genes to phenotype: dy-namical systems and evolvability. Genetica ,84(1):5–11, 1991.[11] L Duret. Neutral theory: the null hypothe-sis of molecular evolution. Nature Education ,1(1):218, 2008.[12] Joanna Masel and Meredith V. Trotter. Ro-bustness and evolvability. Trends in Genetics ,26(9):406 – 414, 2010.[13] David J Lipman and W John Wilbur. Mod-elling neutral and selective evolution of pro-tein folding. Proceedings of the Royal Soci-ety of London. Series B: Biological Sciences ,245(1312):7–11, 1991.[14] Eduardo Izquierdo and Chrisantha Fernando.The evolution of evolvability in gene tran-scription networks. In ALIFE , pages 265–273.Citeseer, 2008.[15] Gunter P Wagner and Lee Altenberg. Per-spective: complex adaptations and the evolu-tion of evolvability. Evolution , pages 967–976,1996.[16] Mark Shackleton, R Shipma, and MarcEbner. An investigation of redundantgenotype-phenotype mappings and their rolein evolutionary search. In Evolutionary Page 15 omputation, 2000. Proceedings of the 2000Congress on , volume 1, pages 493–500. IEEE,2000.[17] Massimo Pigliucci. Genotype–phenotypemapping and the end of the âĂŸgenes asblueprintâĂŹmetaphor. Philosophical Trans-actions of the Royal Society B: Biological Sci-ences , 365(1540):557–566, 2010.[18] JM Smith. Natural selection and the conceptof a protein space. Nature , 225(5232):563,1970.[19] Erik Van Nimwegen, James P Crutchfield,and Martijn Huynen. Neutral evolution ofmutational robustness. Proceedings of theNational Academy of Sciences , 96(17):9716–9720, 1999.[20] Mark D Rendel. Adaptive evolutionary walksrequire neutral intermediates in rna fitnesslandscapes. Theoretical population biology ,79(1):12–18, 2011.[21] Steffen Schaper and Ard A Louis. The ar-rival of the frequent: How bias in genotype-phenotype maps can steer populations to lo-cal optima. PloS one , 9(2):e86635, 2014.[22] Evandro Ferrada and Andreas Wagner. Acomparison of genotype-phenotype maps forrna and proteins. Biophysical journal ,102(8):1916–1925, 2012.[23] Rachel Kolodny, Leonid Pereyaslavets, Abra-ham O Samson, and Michael Levitt. On theuniverse of protein folds. Annual review ofbiophysics , 42:559–582, 2013.[24] Yigal D Nochomovitz and Hao Li. Highlydesignable phenotypes and mutational buffersemerge from a systematic mapping betweennetwork topology and dynamic output. Pro-ceedings of the National Academy of Sci-ences of the United States of America ,103(11):4180–4185, 2006.[25] Tairan Wang, Jonathan Miller, Ned S.Wingreen, Chao Tang, and Ken A. Dill. Sym-metry and designability for lattice protein models. The Journal of Chemical Physics ,113(18):8329–8336, 2000.[26] Ming Li and Paul MB Vitányi. An introduc-tion to Kolmogorov complexity and its appli-cations . Springer Science & Business Media,2009.[27] D Bonchev, D Kamenski, and V Kamenska.Symmetry and information content of chem-ical structures. Bulletin of Mathematical Bi-ology , 38(2):119–133, 1976.[28] Christian B Anfinsen et al. Principles thatgovern the folding of protein chains. Science ,181(4096):223–230, 1973.[29] Morten Källberg, Haipeng Wang, ShengWang, Jian Peng, Zhiyong Wang, Hui Lu, andJinbo Xu. Template-based protein structuremodeling using the raptorx web server. Na-ture protocols , 7(8):1511–1522, 2012.[30] S Ołdziej, C Czaplewski, A Liwo, M Chin-chio, M Nanias, JA Vila, M Khalili, YA Ar-nautova, A Jagielska, M others Makowski,et al. Physics-based protein-structure pre-diction using a hierarchical protocol based onthe unres force field: assessment in two blindtests. Proceedings of the National Academyof Sciences of the United States of America ,102(21):7547–7552, 2005.[31] Sitao Wu, Jeffrey Skolnick, and Yang Zhang.Ab initio modeling of small proteins by itera-tive tasser simulations. BMC biology , 5(1):17,2007.[32] Ken A Dill. Theory for the folding andstability of globular proteins. Biochemistry ,24(6):1501–1509, 1985.[33] Sanzo Miyazawa and Robert L Jernigan. Es-timation of effective interresidue contact en-ergies from protein crystal structures: quasi-chemical approximation. Macromolecules ,18(3):534–552, 1985.[34] Bonnie Berger and Tom Leighton. Proteinfolding in the hydrophobic-hydrophilic (hp)model is np-complete. Journal of Computa-tional Biology , 5(1):27–40, 1998. Page 1635] Martin Mann and Rolf Backofen. Exactmethods for lattice protein models. Bio-Algorithms and Med-Systems , 10(4):213–225,2014.[36] Elwyn R Berlekamp, Robert J McEliece,and Henk CA Van Tilborg. On the inher-ent intractability of certain coding problems. IEEE Transactions on Information Theory ,24(3):384–386, 1978.[37] Richard A Goldstein. The structure of proteinevolution and the evolution of protein struc-ture. Current Opinion in Structural Biology ,18(2):170 – 177, 2008. Theory and simulation/ Macromolecular assemblages.[38] David Luebke et al. General-purpose compu-tation on graphics hardware. In Workshop,SIGGRAPH , 2004.[39] John D Owens, Mike Houston, David Luebke,Simon Green, John E Stone, and James CPhillips. Gpu computing. Proceedings of theIEEE , 96(5):879–899, 2008.[40] Martin Weigel. Simulating spin models ongpu. Computer Physics Communications ,182(9):1833–1836, 2011.[41] Takashi Shimokawabe, Takayuki Aoki,Chiashi Muroi, Junichi Ishida, KoheiKawano, Toshio Endo, Akira Nukada, NaoyaMaruyama, and Satoshi Matsuoka. An 80-fold speedup, 15.0 tflops full gpu accelerationof non-hydrostatic weather model asucaproduction code. In High Performance Com-puting, Networking, Storage and Analysis(SC), 2010 International Conference for ,pages 1–11. IEEE, 2010.[42] Adam L Beberg, Daniel L Ensign, Guha Jay-achandran, Siraj Khaliq, and Vijay S Pande.Folding@ home: Lessons from eight years ofvolunteer distributed computing. In Parallel& Distributed Processing, 2009. IPDPS 2009.IEEE International Symposium on , pages 1–8. IEEE, 2009.[43] Thomas E Creighton. Protein folding. Bio-chemical journal , 270(1):1–2, 1990. [44] André R Brodtkorb, Trond R Hagen, andMartin L Sætra. Graphics processing unit(gpu) programming strategies and trends ingpu computing. Journal of Parallel and Dis-tributed Computing , 73(1):4–13, 2013.[45] Ian Buck. Gpu computing: Programming amassively parallel processor. Proceedings ofthe 2013 IEEE/ACM International Sympo-sium on Code Generation and Optimization(CGO) , 0:17, 2007.[46] John Nickolls, Ian Buck, Michael Garland,and Kevin Skadron. Scalable parallel pro-gramming with cuda. Queue , 6(2):40–53,2008.[47] J.M. Muller, N. Brisebarre, F. de Dinechin,C.P. Jeannerod, V. Lefèvre, G. Melquiond,N. Revol, D. Stehlé, and S. Torres. Handbookof Floating-Point Arithmetic , pages 206–207.Birkhäuser Boston, 2009.[48] A. Arora and S. Bansal. Unix and C Pro-gramming , page 325. Laxmi Publications PvtLimited, 2005.[49] Robert Helling, Hao Li, Régis Mélin,Jonathan Miller, Ned Wingreen, Chen Zeng,and Chao Tang. The designability of pro-tein structures. Journal of Molecular Graph-ics and Modelling , 19(1):157–167, 2001.[50] Andrei N Kolmogorov. Three approachesto the quantitative definition ofinforma-tion’. Problems of information transmission ,1(1):1–7, 1965.[51] Osamu Watanabe. Kolmogorov complex-ity and computational complexity . Springer,1992.[52] CA Floudas, HK Fung, SR McAllister,M Mönnigmann, and R Rajgaria. Advancesin protein structure prediction and de novoprotein design: A review. Chemical Engineer-ing Science , 61(3):966–988, 2006. Page 17 ppendix A. Result A.1 Miyazawa-Jernigan contact energies The contact energies for the MJ1985 model are for the standard alphabet of 20 amino acids. The valuesused are from the work of Miyazawa and Jernigan[33] and are given in the matrix in Figure 15. Figure 15: The values including and above the diagonal of this matrix give the amino acid interaction energies[33]. A.2 Checking the results of the GPU version of the model Figure 16 is from Li et al. and along with the fold percentages and top fold data in Table 4 was usedto verify the results from the GPU program.The histogram for the full enumeration of the HP1996 3x3x3 lattice model is shown in Figure 16A.The data from this project (Figure 17) is in agreement with Li et al. Note that only half the sequence-structure relationships have actually been computed in the process of forming these histograms becausethe GP map is completely symmetric.Figure 18 shows the histogram of designabilities for the a random sampling of the HP1996 model onthe 6x6 lattice. The top fold given in 1996 by Li et al.[1] is not identical to the fold we found. However,Li et al. used a comparably small sample, with the Li et al.’s most designable fold having only 200sequences folding to it (Figure 16B) in contrast with the most designable fold in this project havingover 3000 sequences folding to it (Figure 18). Their small sample size could mean the top ranking oftheir most designable fold is an artifact of the small sample. Furthermore, there is a highly unusualdetail in Li et al.’s data (Figure 16B(inset)). The furthest right point is the one that has the highestdesignability and their graph suggests that there are in fact two folds that give this value, there is nomention of the second fold in the results and this strange distribution is probably also a result of theirsmall sample size. Page 18 igure 16: Figure used for verification of the code from Li et al.[1].Table 4: Values from the GPU code compared with those of Li et al.[1] Li et al. ProjectHP1996 × × Fold percentage 4.75% 4.76% N s top fold 3794 3794 Page 19 igure 17: Histogram of designability for the HP1996 model on the 3x3x3 lattice.Figure 18: Histogram of designability for the HP1996 model on the 6x6 lattice. Page 20 ppendix B. Designabilityand Complexity The following folds on the 6x6 lattice were found to be the most (Figure 19) and least (Figure 20)designable for the MJ1985 model on the 6x6 lattice. Figure 19: 50 most designable folds on the 6x6 lattice labelled with rank.Figure 20: 50 least designable folds for the 6x6 lattice labelled with rank. Page 21 ppendix C. Code C.1 GPU Code GPU programs involve a compute kernel and the launching of that kernel from within the CPU code.The following code is the version for the HP1996 model on the 6x6 lattice. Lines 27-29 define thenumber of blocks and threads per block. These parameters are used to launch the kernel and were setto the optimal values found in the results. Lines 46-100 detail the kernel which computes and outputsthe lowest energies. The most time consuming part for all random sampling versions is in lines 62-64where the random sequence values are read in by each thread. Line 215 is where the kernel is launchedfrom the CPU. /* CUDA ENERGY CALCULATION - RANDOM SAMPLING - SJOWEN - 20/03/15 In this code, for each sequence (counted in binary), the lowest energy fold is found,and if unique, the id of that fold is stored. The calculations are done reversingsign of the cross energy and the max energy / unique fold is found. HP1996 model for 6x6 */ // number of 6x6 folds // length sequence (from 3x3x3) // // threadsperblock // min kT*10 between ground and 1st exc. inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true) Page 22 { if (code != cudaSuccess) { fprintf(stderr,"GPUassert:␣%s␣%s␣%d\n", cudaGetErrorString(code), file, line); if (abort) exit(code); } } // cuda kernel // calculates foldid of lowest unique energy fold for a sequence that equals threadID __global__ void energycalc(int V[N_FOLD][2*NBP], int SeqRnd[numseq][length], int *ID_fold, int *E_dif) { int sequence[length]; int blockId = blockIdx.x + blockIdx.y * gridDim.x; int threadId = blockId * blockDim.x + threadIdx.x; //global id of thread int min_E_sequence = 0; int min_E_sequence2 = 0; int E[4]; int min_E_fid = -1; int energy_fold = 0; ID_fold[threadId]= -1; // The ID of the min E fold. // fill sequence with random amino acids for (int jj=0;jj T* V_array = new T[N_FOLD]; S* SeqRnd = new S[numseq]; S* dev_SeqRnd = new S[numseq]; float pct; srand(time(NULL)); // Fill V_array with -1s for (int n=0;n GSfolds << E_dif[zz] << std::endl; ijk++; } double f = double(numseq); pct += ijk/f; std::cout << "Percentage:␣" << pct/runs << std::endl; counter.close(); GSfolds.close(); } // Free memory on device cudaFree( dev_V ); cudaFree( dev_SeqRnd ); cudaFree( dev_Edif ); cudaFree( dev_foldID ); cudaGetLastError(); return 0; }}