PHAST: Protein-like heteropolymer analysis by statistical thermodynamics
aa r X i v : . [ phy s i c s . c o m p - ph ] F e b PHAST: Protein-like Heteropolymer Analysis by StatisticalThermodynamics
Rafael B. Frigori
Universidade Tecnol´ogica Federal do Paran´a, Rua Cristo Rei 19, CEP 85902-490, Toledo (PR), Brazil
Abstract
PHAST is a software package written in standard Fortran, with MPI and CUDA exten-sions, able to efficiently perform parallel multicanonical Monte Carlo simulations of singleor multiple heteropolymeric chains, as coarse-grained models for proteins. The outcomedata can be straightforwardly analyzed within its microcanonical Statistical Thermody-namics module, which allows for computing the entropy, caloric curve, specific heat andfree energies. As a case study, we investigate the aggregation of heteropolymers bioinspiredon Aβ − fragments and their cross-seeding with IAP P − isoforms. Excellent parallelscaling is observed, even under numerically difficult first-order like phase transitions, whichare properly described by the built-in fully reconfigurable force fields. Still, the package isfree and open source, this shall motivate users to readily adapt it to specific purposes. PROGRAM SUMMARY
Authors:
Rafael Bertolini Frigori
Program Title:
PHAST
Journal Reference:Catalogue identifier:Licensing provisions:
GNU/GPL version 3
Programming language:
FORTRAN 90, MPICH 3.0.4, CUDA 8.0
Computer: PC Operating system:
GNU/Linux (3.13.0-46), should also work on any Unix-based operational system.
Supplementary material:Keywords:
Folding: thermodynamics, statistical mechanics, models, and pathways; Biopolymers, biopoly-merization; Monte Carlo methods; Thermodynamics and statistical mechanics
PACS:
Classification:
Email address: [email protected] (Rafael B. Frigori)
Preprint submitted to Elsevier February 8, 2017 xternal routines/libraries: cuRAND (CUDA), Grace-5.1.22 or higher
Subprograms used:Nature of the problem:
Nowadays powerful multicore processors (CPUs) and Graphical ProcessingUnits (GPUs) became much popular and cost-effective, so enabling thermostatistical studies ofcomplex molecular systems to be performed even in personal computers. PHAST provides not onlyan easily-reconfigurable parallel program for Monte Carlo simulations of linear heteropolymers, ascoarse-grained models of proteins, but also permits the automatized microcanonical thermodynamicanalysis of those systems.
Solution method:
PHAST has three main modules for complementary tasks as: to map any .pdb file-sequence to its inner lexicon by using a configurable hydrophobic scale, while the main simulationalmodule performs parallel Monte Carlo simulations in the multicanonical (MUCA) ensemble and,the analysis module which extracts microcanonical observables from MUCA weights.
1. Introduction
Linear heteropolymeric chains of amino acids are known as polypeptides. Proteins,on the other hand, are a large class of biological polymers containing at least one longpolypeptide. Together, they constitute a set of macromolecules performing a vast arrayof functions within organisms, whose specificities strongly depend on their detailed in-tramolecular interactions that leads to three-dimensional (native) structures by a foldingmechanism. The thermodynamic hypothesis [1] states that the native structure of a proteinis a unique, stable and kinetically accessible minimum of the free energy solely determinedby its (primary) sequence of amino acids. However, eventual misfolding may produce dys-functional proteins, so inducing the formation of cytotoxic amyloid aggregates, which canculminate on degenerative diseases as type 2 Diabetes Mellitus (DM2) [2] or Alzheimer(AD) [3].Once those biopolymers are typical representatives of the so-called small-systems, wherein general the equivalence of statistical ensembles does not hold (see [4] and referencestherein), the direct computation of their density of states is a well suited investigationmethod to result on the system microcanonical thermodynamics [5]. Nevertheless, perform-ing such calculations at a high enough accuracy implies on accumulating large amounts ofstatistical data through Monte Carlo methods, as Wang-Landau [6] or the multicanonical(MUCA) ensemble [7]. Thus, it consists of a huge computational challenge, which can bereasonably alleviated just by conjugating efficient parallel algorithms and coarse-grainedmolecular force fields.To accomplish these demands we designed PHAST, a simple and easy-to-use opensource package that enables users to simulate and analyze the microcanonical thermo-statistics of general models for semi-flexible linear polymers [8, 9], as coarse-grained pro-teins [10, 11, 12, 13]. The program brings fully reconfigurable built-in force fields that canbe readily modified and extended, this makes it specially apt for prototyping of many-bodyinteracting systems, as polymers embedded in crowded solutions [14]. It is written in stan-dard Fortran and possess MPI and CUDA extensions [15, 16], so being able to efficientlyrun on most modern parallel computer architectures. The distribution is forbidden for2ommercial or military purposes, but we hope that PHAST is useful and could be eventu-ally incorporated in other programs, while we kindly request to be informed accordingly.Still, it is naturally supposed to be acknowledged in all resulting publications.The article is organized as follows, in the Section 2 we review some simulation back-ground, this includes coarse-grained modelling and the derivation of microcanonical ther-mostatistics from parallel multicanonical simulations. The Section 3 encompasses an in-dept overview of application modules. In the Section 4 two concrete case studies involvingheteropolymers bioinspired on amyloidogenic proteins are presented, so the aggregation of Aβ − fragments and their further cross-seeding with IAP P − isoforms are investigatedin the context of a minimal coarse-grained model. The parallel scaling of such simulationsis rigorously addressed, while their physical significance is outlined, when possible, bycomparisons to other results from the literature. The Section 5 brings our conclusions andperspectives for further developments.
2. Simulation background
Coarse-graining is a widely employed modelling strategy to reduce the degrees of free-dom of many-body macromolecular systems as heteropolymers [9, 17]. Moreover, the studyof protein folding and aggregation, specially in crowded media [18], has also largely bene-fited from such approach [17]. In this vein, PHAST incorporates a configurable force fieldable to describe semi-flexible linear heteropolymers [8, 19, 20, 21] and general propertiesof protein-like structural phase transitions [10, 11, 12, 13]. In the present model, a pri-mary sequence is translated to a simpler one according to a predefined rule (i.e. sortingeach residue by its hydrophobic character), so the amino acids on a proteic backbone aremapped to a coarser chain of pseudo-atoms (beads).Once original molecular interactions are mapped into simpler ones, non-bonded inter-actions among beads are represented through effective hydrophobic forces [22], as givenby a Lennard-Jones potential ( LJ ). Thus, it is natural to adopt a specific hydrophobicscale [23] to perform the aforesaid sequence translation, see Figure 1. By considering thathydrophobic residues ( A ) form more strongly-interactive cores than hydrophilic ones ( B ) , the coupling constant for pairs of nonadjacent beads ( σ i , σ j ) may be defined by C LJ ( σ i , σ j ) = +1 σ i = σ j = A + / σ i = σ j = B − / σ i = σ j (1)Notwithstanding the fact that the protein backbones are rather stiff, they may be wellmodelled as bead-spring heteropolymers. Therefore, the bounds between successive beadslocated at a distance r l,l +1 are described by the usual finitely extensible nonlinear elastic While it is a considerable limitation, once in chemically relevant situations side chains may becomecrucial, this implementation is computationally quite effective. igure 1: A chosen hydrophobicity scale is employed to map the FASTA sequence (A–Z) of a protein (onthe left) extracted from the Protein Data Bank to another sequence, expressed by a two-letter code (A–B),which is ultimately used to build the coarse-grained one-bead protein model (on the right). potential (FENE) [24], while a bending potential term between three successive beads isconsidered as proportional to (1 − cos θ k ) . The resulting energy for a single chain composedby N beads may be written as H single = κ F R P N − l =1 C F ( σ l , σ l +1 ) ln (cid:0) − [( r l,l +1 − r ) /R ] (cid:1) + κ C P N − k =1 (1 − cos θ k ) + κ LJ INTRA P N − i =1 P Nj =1+1 (cid:2) r − ij − C LJ ( σ i , σ j ) r − ij (cid:3) , (2)where the first term (FENE potential) has κ F , r and R as general coupling constantsfor homopolymers, while C F ( σ l , σ l +1 ) allows for introducing different spring stiffness forheteropolymers. Moreover, κ C sets the chain rigidity against bends by an angle θ k , while κ LJ INTRA defines the coupling constant of the non-bonded intramolecular LJ potentialbetween pairs of beads ( i, j ) at a distance r ij .Finally, M different chains are made to constitute a many-body interacting system bythe following multi-chain potential H multi = M X k =1 H single,k + X l>k N X i,j =1 κ LJ INTER h r − l i k j − C LJ (cid:0) σ l i , σ k j (cid:1) r − l i k j i! , (3)where H single,k is given by Eq.(2), and the second term describes the non-bonded inter-molecular LJ potential between pairs of beads ( l i , k j ) at a distance r l i k j , whose couplingconstant is κ LJ INTER . .2. Parallel multicanonical simulations It has been shown that microcanonical analysis of Monte Carlo simulations providea powerful tool not only to precisely characterize structural phase transitions on small-systems [5, 8], where ensemble equivalence does not hold (see [4], and references therein),but also allows for neatly distinguish between good and bad folders among protein-likeheteropolymers [25]. Most of straightforward implementations of such methodology relyon the accurate determination of systems density of states Ω ( E ) , which may be achievedby the Wang-Landau method [6], or with generalized ensembles as the multicanonical [7].Once knowing Ω ( E ) , the microcanonical thermostatistics can be established by theusual Boltzmann formula for the entropy S ( E ) , as follows S ( E ) = k B ln Ω ( E ) . (4)This leads to the microcanonical connection to thermodynamical observables [5]. Forinstance, numerical derivatives of the entropy S ( E ) can be taken to compute quantitiesof interest, as the microcanonical caloric curve, relating the temperature T to the internalenergy E, given by k B β ( E ) ≡ T ( E ) − = ∂S∂E . (5)Also, the microcanonical specific heat C V ( E ) is expressed as C V ( E ) = dEdT = − (cid:18) ∂S∂E (cid:19) (cid:18) ∂ S∂E (cid:19) − , (6)while the free energy F ( E ) , at a fixed temperature T c , is computed by F ( E ) = E − (cid:18) ∂S∂E (cid:19) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E = E ( T c ) S ( E ) . (7)As previously stated, the multicanonical ensemble offers a quite practical method toestimate Ω ( E ) through Monte Carlo simulations, and so, to obtain the microcanonicalthermostatistics of any physical system. Basically, on this approach the entropy is writtenas a piecewise function of the discretized energies E k and a set of MUCA coefficients { β k , α k } , as follows S muca ( E k ) = β k E k − α k . (8)The Monte Carlo simulations in the multicanonical ensemble naturally employ the so-called generalized MUCA weights ω muca ( E k ) ∝ e − β k E k + α k , instead of the usual Boltzmannweights ω B ∝ e − βE , which enhances energetic tunnelings by improving the sampling ofrare configurations.However, those MUCA coefficients { β k , α k } are a priori unknowns, so requiring analgorithm for their determination before a production run can be pursued. A success-ful technique to estimate MUCA weights is iterative [7] and involves computing his-tograms H muca ( E ) of the energies E k ∈ [ E , . . . , E max ] . Thus, in the beginning, one sets5 igure 2: The iteration-cycle scheme for parallel multicanonical simulation: N processors (nodes) areinitialized with different random seeds and so, evolve independent Markov Chains whose individual His-tograms are accumulated on H ( E ) at each MUCA iteration to update the weight W ( E ) . The resultingcoefficients ( α, β ) on MUCA weights are then broadcasted. ω muca ( E k ) = 1 for all energies E k , which is used on an usual METROPOLIS simulationto collect the initial ( n = 0) histogram H muca ( E ) . Thereafter, it is applied a recursiveupdate scheme for the MUCA weights ω n +1 muca ( E ) = ω nmuca ( E ) /H nmuca ( E ) , before startinga new simulation with the lately estimated weights. This process is repeated till conver-gence is detected by a flatness check on the histograms. In the end, at a large- n limit, themulticanonical entropy is ensured to recover the microcanonical one as S muca ( E ) → S ( E ).The multicanonical implementation on PHAST employs accumulated error-weightedhistograms, so statistics for weight estimations improves for every repetition, while con-vergence is attained by Berg’s weight recursion[7]. Additionally, its parallel estimator forMUCA weights — initially proposed by Zierenberg et. al. [26] to study spin-systems— present scaling properties computationally exceeding more traditional approaches [27].This is due to a simple and quite efficient update scheme, illustrated on Figure 2. First,each processor (node) is initialized with the same MUCA coefficients, but a particularrandom seed, then they perform independent Markov Chain Monte Carlo (MCMC) simu-lations to accumulate local energy histograms. After one iteration cycle those histogramsare combined on the master node, which updates and then broadcasts the new MUCAweights, the cycle is repeated till convergence is achieved.
3. Software framework
The PHAST package is constituted by the following modules • SET INPUT: prepares input protein sequences from PDB files • PHAST: the main engine to run parallel multicanonical simulations • ANALYST: computes the microcanonical thermostatistics from MUCA simulationsIt is not possible to give here a highly detailed description of all program routines. In-stead, we address the main features and data workflows of aforementioned modules in thefollowing subsections, which afterwards is illustrated by some concrete study cases.6 .1. SET INPUT
A single text file named
ProteinSequences.txt is to be provided as input to thesimulation module. But while users may manually encode each of its line entries, sodescribing protein (or polymeric) sequences according to an AB-like lexicon (e.g. Eq. 1),this can be alternatively generated in a quicker manner. To this end, the SET INPUTmodule was designed to facilitate the mapping of protein structures, downloaded from theProtein Data Bank (PDB) [28], by using a built-in hidrophobicity scale [23] configured onfile hscale.h .The aforesaid mapping is attained by invoking the setup module ( ./set input.o )and then typing the path to a ( .pdb or .seq ) input file, whereas the software shallrecognize among valid PDB ( .pdb ) or three-letter code ( .seq ) sequences. Still, by suc-cessively calling this module with different file sequences, more lines are appended to ProteinSequences.txt , so making it quite easy to prepare input files even for simula-tions of multiple-proteic systems. This is the main module of the package, here multicanonical algorithms are implementedto simulate the coarse-grained models of Section 2. In principle, PHAST just needs to reada
ProteinSequences.txt text file before starting any run. However, the software can befully reconfigured (before compilation) through editing its .h files, whose content is • vars.h : the main configuration file that sets multicanonical parameters. This in-cludes, but is not limited to, the quantity and size of energy binnings, the number ofMCMC sweeps and MUCA iterations, the definition of common blocks, the maximalsize and number of proteins as well as of the residue families allowed in a simulation. • set hamiltonian.h : enables to tune the coupling constants specifying a particularHamiltonian from Eq.(2) and Eq.(3). • box size.h : defines the size of the steric spherical container for proteins and somegeometric cutoffs of polymer backbones. • set randvec size.h : random numbers may be optionally generated as large vectorsin GPUs, the size can be optimally adjusted to fit memory caches.Still, the main module is composed by the following briefly summarized routines • MAIN.
F: is the principal routine of the simulation module, contains all MPI commu-nication structures and callings of I/O and MUCA engine functions. • HAMILTONIAN. f: given a polymeric configuration this routine computes the Hamilto-nian defined by Eq.(2) and Eq.(3) with the constants adjusted at set hamiltonian.h. • MONTE CARLO. f: this is a wrap routine that employs the Metropolis algorithm, withgeneralized MUCA weights, to perform MCMC stochastic evolutions. Additionally,it updates the local energy histograms at every simulation step.7
GEOMETRY. f: the simulation code uses inner coordinates, as bond-lengths and anglesbetween beads, to evolve individual chains. This routine allows to convert them to(external) cartesian coordinates for computing inter-chain distances. • UPDATE ENGINE. f: this function stochastically updates all bead-positions of the chainscontained in the 3d-spherical box defined at box size.h . It employs a mix of op-timized updating algorithms, to know, spherical-caps : that changes internal anglesof each monomer along the backbone, pivotings : rotates a chain over a pivot direc-tion, bond-length changes : by dilations or shrinks around the equilibrium value r (if κ F = 0), and Galilean-group updates : i.e. rotations plus translations of chainsaround the box center of mass. More details on references in [13] . • SET HF-MATRIX. f: this routine permits to manually set the interaction coefficients C LJ ( σ i , σ j ) and C F ( σ i , σ j ) of the Hamiltonian in Eq.(2). It shall be modified ac-cordingly to hscale.h, which allows for consistent generalizations of the AB-modellexicon such as [29]. • SET MUCA. f: a bundle of auxiliary functions needed to set MUCA simulations. • SET PROT-SEQ. f: reads in proteins from the AB-sequence file
ProteinSequences.txt . • SET-SAVE SIM. f: reads or writes to disk the simulation files, as protein configurations aAcidPosRxyz *.pdb with extremal energies, produced during MCMC evolutions. • SIM IO-FILES. f: reads or writes to disk the inner control files of the simulation. • RANDOM.f : the wrap function for random number generation on CPUs or GPUs. • cudaRandom.cu : enables random number generation on GPUs.It also shall be noted that as long as the simulations progress a serie of output files is au-tomatically saved to disk. While files as Protein3DPositions.* , aAcid3DPositions.* and last.con are just intended for inner control of simulations, being specially neededwhen restarting simulations, there are also many other output files with *.dat or *.pdb extensions. As an example beta n.dat, alpha n.dat, g n.dat, and hist n.dat re-spectively denote files containing data from the n -th MUCA repetition, as the coefficients { β k , α k } n , the density of states g n ( E k ) and histograms H n ( E k ) — which are read by theANALYST module afterwards — as well as the aforementioned aAcidPosRxyz *.pdb files. This module is the kernel of the package for microcanonical thermostatistical anal-ysis. A set of files, originally saved by the simulation module — i.e. beta *.dat,alpha *.dat, and last.con — is read and semi-automatically processed. As an out-come neat xmGrace [30] files are produced according to template files, named as template *.agr .Some internal parameters shall be set before compilation, depending on the intended out-put features, to know 8 icontrol : set intensive units of energy, as “energy per volume”, on the outputs. • iAGR: defines if the thermodynamical observables of Eqs.(4)–(7) are written to indi-vidual .agr files, or combined into a single graph otherwise. • Tol Beta & Tol L: respectively set (i) the numerical tolerance among three succes-sive β values and (ii) the minimal latent heat to be searched for during a Maxwellconstruction [5] on β ( ε ) × ε .In particular, when the ANALYST module is called ( ./analyst.o ) the last.con file isread and supplies the software with parameters (set at vars.h ) used on that former MUCAsimulation. Then, the user is offered the option to type the configuration-ranges of theiteration files to be analyzed. Finally, the following additional information is requested to • “ Enter the number of monomers ”: necessary to output data as intensive quantities. • “ Enter the finite difference step-size, dE: [0.01,9.9] ”: the energy step dE is used tocompute the finite-difference derivatives taken on S ( E ) . • “ Enter the β value to compute Free Energy, or 0 for Maxwell construction ”: duringfirst-order like structural phase transitions, the S-bends appear on microcanonicalcaloric curves. So users may wish to provide a specific β c to compute the Eq.(7), orlet the program find this pseudo-critical value by a Maxwell construction.
4. Example runs
As case studies we have investigated two proteic systems by using the AB-model limit [10] of the Hamiltonian Eq.(3). Such modelling does not allows for prediction of proteinstructures, but may provide an useful method to learn about some general thermodynam-ical mechanisms underlying biological structural phase transitions [12]. For instance, thisidea was successfully employed to compute aggregation propensities of peptides [13]. a typical "set_hamiltonian.h" file!-------------------------------------------------------------! Hamiltonian Parameters!-------------------------------------------------------------! Set c_KF, c_KC, c_KLJ_intra or c_KLJ_inter to 0.0d0 to! Shutdown any of that interactions!-------------------------------------------------------------! FENE parameters: c_KF*R^2*ln(1-[(r-r0)/R]^2)!-------------------------------------------------------------r0 = 1.00d0 ! Distance of equilibrium between beads Note that by taking κ F = 0 the FENE interaction is disabled and so all bond-lengths stay fixed to r (e.g. r = 1), therefore the bead-stick limit of AB-model is recovered. = 1.2d0 ! Scale parameterc_KF = 0.0d0 ! FENE coupling constant!-------------------------------------------------------------! Curvature parameters: c_KC*(1-cos*(Phi))!-------------------------------------------------------------c_KC = 0.25d0 ! Curvature coupling constant!-------------------------------------------------------------! Non-bonded interaction parameters: c_KLJ*(1/R^12-sigma/R^6)!-------------------------------------------------------------c_KLJ_intra = 4.00d0 ! Intra-protein LJ coupling constantc_KLJ_inter = 4.00d0 ! Inter-protein LJ coupling constant Therefore, while keeping in mind the limitations of our coarse-grained approach, wechoose some natural amyloidogenic protein sequences as inspiration to generate aggregation-prone heteropolymeric chains through their hydrophobic mappings. So, we first simulatethe aggregation of highly amyloidogenic segments of the Amyloid β protein, associated tothe Alzheimer disease — i.e. the Aβ − segment, PDB: — a system also appliedfor assessing the parallel scaling (Speedup) of our code. Additionally, we have simulatedthe aggregation of an heterogeneous molecular system, composed by segments of Aβ − and Amylin isoforms (related to the type 2 Diabetes Mellitus, see [13]) — i.e. the hu-man hIAP P − PDB: and rat rIAP P − PDB: — to test for catalyticeffects of crowding [31] and cross-seeding [32] on the onset of aggregation that inducessuch degenerative diseases. Aβ − segments: parallel speedup To simulate the aggregation of Aβ − segments, we initially downloaded the 2LFM filefrom PDB [28] and SET INPUT converted it to an AB-sequence. This sequence was thenduly cut, cloned and used to prepare an input file to simulate the molecular aggregation(dimerization), as can be checked in the next data sample ProteinSequences.txtBBBBBAAABBBBBBAAAB
The box-radius ( R ) was set to R = 40 for these simulations and, overall statistics collectedby the multiple processors was fixed to 10 MCMC sweeps per MUCA iterations (1 . × for 128 processors), in a total of 10 repetitions.Before performing the data analysis of the resulting MUCA coefficients { β k , α k } n , itwas necessary to ensure they had reasonably converged after some minimal number ofMUCA iterations ( N conv. ) . To do so, we inspected the flatness of energy histograms H n ( E k )in the the full range of MUCA iterations, i.e. n ∈ [1 , . As an adequacy criteriawe imposed that the Coefficient of Variation (c
Var = σ H / µ H ) , defined as the ratio of thestandard deviation ( σ H ) over the average value of a histogram ( µ H ), should be smaller10
000 3000 4000 5000 6000 7000 k H n ( E k ) n: 100n: 150n: 175n: 400 Figure 3: Representative energy histograms H n ( E k ) employed for flatness checks, which also allows foridentifying the convergence of MUCA coefficients { β k , α k } n during simulations of Aβ − segments.For histograms in the iteration range n ∈ (175 , Var ≤ , see text. than 5%. Additionally, local peaks on any supposedly flat histogram were to be no largerthan 20% of its average value. Considering both such constraints, we found acceptableconvergence (i.e. enough histogram flatness) was established for data in the iterationrange n ∈ [ N conv. > , N S ≥ N conv. )and ending ( N E ≤ n ∈ [ N S , N E ] of MUCA coefficients { β k , α k } n were tested, as well as the size of data-blocks △ N (see Binning method, [33]) tobe grouped in between. We also required the numerical stability and inter-compatibility(within the statistical errors) when computing the pseudo-critical inverse temperatures β c , which was particularly important to improve the curve matchings during data collapses(Figure 4). Despite tedious, this procedure also enabled controlled checks of autocorrelationtime effects over the statistical error bars.The physical consistency of our parallel results was validated by the data collapse ofmultiple simulations, performed with increasing number of CPUs (up to 128 processors)at the IBM P750 machine [34], see a representative output on Figure 4 for parameters[ N S , N E ] = [200 , △ N = 50. The general microcanonical behavior of the systemwas finally checked by an automatized plotting of the thermostatistical observables savedto an .agr file, as can be seen from Figure 5 (where [ N S , N E ] = [700 , △ N = 25).It is observed a first-order like structural phase transition, characterized by a reentrant onthe microcanonical entropy S ( ε ), which induces an S-bend on the caloric curve β × ε andalso a region with negative values of the microcanonical specific heat C v ( ε ) . See [5] fordetails and a comprehensive review on this formalism. It is also worth to note that suchtransitions are signaled by the presence of energetic barriers on the free energy, a featurewe have previously studied on similar bioinspired molecular systems [12, 13].11 ε β ( ε ) Microcanonical caloric curve bin = 0.10 and dE = 1.50
Figure 4: The data collapse of representative curves, of the inverse temperature β ( ε ) × ε, illustrates theequivalence on thermodynamical results from simulations employing up to 128 processors. -0.4 -0.2 0 0.2 ε -40-2002040 c v ( ε ) -0.4 -0.2 0 0.20246810 β ( ε ) -0.4 -0.2 0 0.2 ε f = ε - T . s -0.4 -0.2 0 0.2-40-30-20-100 S ( ε ) Figure 5: The microcanonical thermodynamic observables as seen from ANALYST module for the aggre-gation of Aβ − fragments, to know: the entropy S ( ε ) × ε , caloric curve β ( ε ) × ε, the (shifted) freeenergy f (cid:0) = ε − β − c · S (cid:1) × ε, and specific heat c v ( ε ) × ε. igure 6: The parallel Speedup obtained for Multicanonical simulations present almost linear scaling upto hundred processors, even at numerically difficult first-order like phase transitions. To conclude, the wall times for simulations presenting equivalent thermodynamicalresults — which was ensured by performing a previous data collapse — were employedto analyze the parallel speedup of our code, seen on Figure 6. So we plotted the parallelSpeedup factor S p = t / t p , defined as the ratio of the serial t to the parallel t p times spentto accomplish the convergence of MUCA weights. A power-law model S p = a · P b wasproposed for regression, and so established an almost linear scaling b = 0 .
97 (2) withthe number of processors P . This corroborates the nice scaling properties of the parallelMUCA algorithm found on mild first-order like phase transitions [26, 35]. Aβ − and IAP P − isoforms Recent clinical studies have suggested that DM2 can be a catalyst factor to the emer-gence of some neurodegenerative diseases, as the AD [36]. A microscopic molecular expla-nation, from in vitro studies [37], points to the cross-seeding of Aβ and IAP P as a keyfactor for the appearance of heterogeneous amyloid fibrils with considerable cytotoxicity.This phenomena would be somehow mimicked by the aggregation of bioinspired protein-likeheteropolymers, investigated by the methodology previously applied to study the aggrega-tion of Aβ − . Therefore, an input file was prepared by editing the AB-sequences obtainedfrom SET INPUT, once it is feed with the 2LFM ( Aβ ) and 2KB8 (hIAPP) files downloadedfrom PDB website [28]. That is ProteinSequences.txtBBBABAAABB However, for a fixed total simulation statistics the error-bar sizes are enlarged by increasing thenumber of processors. This probably indicates an eventual speedup saturation, possibly connected to thecorrelation times, as already observed in [26, 35]. ε -200-1000100200 c v ( ε ) -0.8 -0.6 -0.4 -0.2 0 0.2 01234 β ( ε ) -0.8 -0.6 -0.4 -0.2 0 0.2 ε f = ε - T . s -0.8 -0.6 -0.4 -0.2 0 0.2-75-50-250 S ( ε ) hIAPP+ABrIAPP+AB ∆ f h ∆ f r ∆ε h ∆ε r Figure 7: The microcanonical thermodynamic observables as seen from ANALYST module for the aggre-gation of Aβ − fragments cross-seeded by Amylin isoforms hIAP P − or rIAP P − . To know:the entropy S ( ε ) × ε , caloric curve β ( ε ) × ε including the Maxwell construction, the free energies f (cid:0) = ε − β − c · S (cid:1) × ε, and the specific heat c v ( ε ) × ε. While the energetic barrier △ f h = 0 . △ f r = 0 . , its groundstate is found at much higher energies on the landscape. This explains the larger latent heatsfor the rodent system △ ε r = 0 .
584 (5) [to be compared to △ ε h = 0 .
513 (5)], and so its increased stabilityagainst aggregation. The energy axis ε = E/N is shown in intensive units.
BBBABAAABBBBBBBAAABBBBBBAAAB also, an equivalent system for IAPP molecules of rats (PDB: was built, as shown
ProteinSequences.txtBBBABAAAAABBBABAAAAABBBBBAAABBBBBBAAAB
These simulations were run on a considerably denser environment, once the box-radiuswas set to R = 20 . The 480 CPUs at [38] shared a workload of 10 MCMC sweeps perMUCA iteration, in a total of 10 MUCA repetitions. The obtained MUCA weights in theiteration range [ N S , N E ] = [700 , △ N = 25. Thermodynamical results are summarized for all microcanonical observ-ables of both systems on Figure 7, while details for representative structures are exhibitedon Figure 8. It is clearly observed that while the energetic barrier for the aggregation ofthe human molecular system (i.e. Aβ − and hIAP P − ) △ f h = 0 . △ f r = 0 . , the latter transition happens at muchdeeper values of the free energy. This fact justifies the larger latent heat found on the14 igure 8: The microcanonical caloric curve β ( ε ) × ε with representative molecular configurations for thecross-seeding of fragments of Aβ − and Amylin isoforms: hIAP P − or rIAP P − . rodent system △ ε r = 0 .
584 (5) [to be compared to that of humans △ ε h = 0 .
513 (5)], whichexplains thermodynamically its improved stability against aggregation.Interesting enough, we also detected the presence of multiple backbendings on β ( ε ) × ε (see Figure 8), which are related to nucleation hierarchies, as broadly studied on [39]. Ini-tially one observes the formation of one hetero-oligomer made of one segment of Aβ − and an Amylin isoform ( hIAP P − or rIAP P − ) , whose thermodynamical signatureemerges by the first bump on the caloric curves. Following, another hetero-oligomer issimilarly built and, then finally the aggregation process is completed by a global collapseof both hetero-oligomers which results on a globular structure. It deserves to be noted thatthose aggregates are clearly not amyloidic ones, as no cross-beta structure is formed, whichwould require specific interactions for their stabilization a feature absent in our Hamilto-nian. However, these protein-like aggregation transitions exhibit hierarchical nucleationsas would also be expected by cooperative cross-seeding mechanisms on real proteins.Therefore, some intriguing conclusions may be addressed from our microcanonical anal-ysis, mainly regarding the stability of Amylin isoforms when cross-seeded by Aβ segments.For example, a biotechnologically designed protein named Pramlintide [40] — originallyinspired on PDB:2KJ7 — has been recently used as an adjuvant to treat DM2 in humans,curiously such protein becomes identical to rIAP P when mapped by the hydrophobicscale [23] to the AB-model. Thus, our simulations indicate that Pramlintide could also beemployed for damping the aggregation of heterogeneous amyloidogenic systems, as onesmade by Aβ and IAP P isoforms. This result is corroborated not only by aggregationpropensities previously computed from the AB-model [13], but also by studies of similarsystems [31, 32] using all-atom force fields. 15 . Conclusions
We have presented PHAST, a package for simulating coarse-grained models of proteinsand bioinspired polymers in the multicanonical ensemble. Despite of its simple structure,the software comprises a reconfigurable force field which allows for an excellent parallelspeedup. Additionally, its modules enable users to quickly map proteins downloaded fromPDB to an inner AB-lexicon — or its generalizations as [29] — as well as to automaticallycompute most of systems microcanonical thermostatistics. Still, the program codes areopen and free, which shall motivate students and researchers to readily adapt them totheir specific purposes, as modelling many-body protein-protein interactions or the realis-tic effects of crowding on polymeric systems. We plan to gradually improve the built-inforce fields, besides moving their calculations to GPUs, and implement faster evolutionalgorithms as hybrid MCMC [41].
Acknowledgements
This is a long-term project whose author has benefited from enlightening discussionswith Leandro G. Rizzi, Lieverton H. Queiroz, Mathias S. Costa, Mikhael C. Chrum, andNelson A. Alves. The Brazilian laboratories CENAPAD-SP and LNCC-RJ are acknowl-edged by providing machine time and helpful technical support, respectively on their IBMP750 [34] and Santos Dumont [38] supercomputing facilities.