[PDF] PHAST: Protein-like heteropolymer analysis by statistical thermodynamics

Abstract

PHAST is a software package written in standard Fortran, with MPI and CUDA extensions, able to efficiently perform parallel multicanonical Monte Carlo simulations of single or multiple heteropolymeric chains, as coarse-grained models for proteins. The outcome data can be straightforwardly analyzed within its microcanonical Statistical Thermodynamics module, which allows for computing the entropy, caloric curve, specific heat and free energies. As a case study, we investigate the aggregation of heteropolymers bioinspired on A\beta_{25-33} fragments and their cross-seeding with IAPP_{20-29} isoforms. Excellent parallel scaling is observed, even under numerically difficult first-order like phase transitions, which are properly described by the built-in fully reconfigurable force fields. Still, the package is free and open source, this shall motivate users to readily adapt it to specific purposes.

Full PDF

aa r X i v : . [ phy s i c s . c o m p - ph ] F e b PHAST: Protein-like Heteropolymer Analysis by StatisticalThermodynamics

Rafael B. Frigori

Universidade Tecnol´ogica Federal do Paran´a, Rua Cristo Rei 19, CEP 85902-490, Toledo (PR), Brazil

Abstract

PHAST is a software package written in standard Fortran, with MPI and CUDA exten-sions, able to eﬃciently perform parallel multicanonical Monte Carlo simulations of singleor multiple heteropolymeric chains, as coarse-grained models for proteins. The outcomedata can be straightforwardly analyzed within its microcanonical Statistical Thermody-namics module, which allows for computing the entropy, caloric curve, speciﬁc heat andfree energies. As a case study, we investigate the aggregation of heteropolymers bioinspiredon Aβ − fragments and their cross-seeding with IAP P − isoforms. Excellent parallelscaling is observed, even under numerically diﬃcult ﬁrst-order like phase transitions, whichare properly described by the built-in fully reconﬁgurable force ﬁelds. Still, the package isfree and open source, this shall motivate users to readily adapt it to speciﬁc purposes. PROGRAM SUMMARY

Authors:

Rafael Bertolini Frigori

Program Title:

PHAST

Journal Reference:Catalogue identiﬁer:Licensing provisions:

GNU/GPL version 3

Programming language:

FORTRAN 90, MPICH 3.0.4, CUDA 8.0

Computer: PC Operating system:

GNU/Linux (3.13.0-46), should also work on any Unix-based operational system.

Supplementary material:Keywords:

Folding: thermodynamics, statistical mechanics, models, and pathways; Biopolymers, biopoly-merization; Monte Carlo methods; Thermodynamics and statistical mechanics

PACS:

Classiﬁcation:

Email address: [email protected] (Rafael B. Frigori)

Preprint submitted to Elsevier February 8, 2017 xternal routines/libraries: cuRAND (CUDA), Grace-5.1.22 or higher

Subprograms used:Nature of the problem:

Nowadays powerful multicore processors (CPUs) and Graphical ProcessingUnits (GPUs) became much popular and cost-eﬀective, so enabling thermostatistical studies ofcomplex molecular systems to be performed even in personal computers. PHAST provides not onlyan easily-reconﬁgurable parallel program for Monte Carlo simulations of linear heteropolymers, ascoarse-grained models of proteins, but also permits the automatized microcanonical thermodynamicanalysis of those systems.

Solution method:

PHAST has three main modules for complementary tasks as: to map any .pdb ﬁle-sequence to its inner lexicon by using a conﬁgurable hydrophobic scale, while the main simulationalmodule performs parallel Monte Carlo simulations in the multicanonical (MUCA) ensemble and,the analysis module which extracts microcanonical observables from MUCA weights.

1. Introduction

Linear heteropolymeric chains of amino acids are known as polypeptides. Proteins,on the other hand, are a large class of biological polymers containing at least one longpolypeptide. Together, they constitute a set of macromolecules performing a vast arrayof functions within organisms, whose speciﬁcities strongly depend on their detailed in-tramolecular interactions that leads to three-dimensional (native) structures by a foldingmechanism. The thermodynamic hypothesis [1] states that the native structure of a proteinis a unique, stable and kinetically accessible minimum of the free energy solely determinedby its (primary) sequence of amino acids. However, eventual misfolding may produce dys-functional proteins, so inducing the formation of cytotoxic amyloid aggregates, which canculminate on degenerative diseases as type 2 Diabetes Mellitus (DM2) [2] or Alzheimer(AD) [3].Once those biopolymers are typical representatives of the so-called small-systems, wherein general the equivalence of statistical ensembles does not hold (see [4] and referencestherein), the direct computation of their density of states is a well suited investigationmethod to result on the system microcanonical thermodynamics [5]. Nevertheless, perform-ing such calculations at a high enough accuracy implies on accumulating large amounts ofstatistical data through Monte Carlo methods, as Wang-Landau [6] or the multicanonical(MUCA) ensemble [7]. Thus, it consists of a huge computational challenge, which can bereasonably alleviated just by conjugating eﬃcient parallel algorithms and coarse-grainedmolecular force ﬁelds.To accomplish these demands we designed PHAST, a simple and easy-to-use opensource package that enables users to simulate and analyze the microcanonical thermo-statistics of general models for semi-ﬂexible linear polymers [8, 9], as coarse-grained pro-teins [10, 11, 12, 13]. The program brings fully reconﬁgurable built-in force ﬁelds that canbe readily modiﬁed and extended, this makes it specially apt for prototyping of many-bodyinteracting systems, as polymers embedded in crowded solutions [14]. It is written in stan-dard Fortran and possess MPI and CUDA extensions [15, 16], so being able to eﬃcientlyrun on most modern parallel computer architectures. The distribution is forbidden for2ommercial or military purposes, but we hope that PHAST is useful and could be eventu-ally incorporated in other programs, while we kindly request to be informed accordingly.Still, it is naturally supposed to be acknowledged in all resulting publications.The article is organized as follows, in the Section 2 we review some simulation back-ground, this includes coarse-grained modelling and the derivation of microcanonical ther-mostatistics from parallel multicanonical simulations. The Section 3 encompasses an in-dept overview of application modules. In the Section 4 two concrete case studies involvingheteropolymers bioinspired on amyloidogenic proteins are presented, so the aggregation of Aβ − fragments and their further cross-seeding with IAP P − isoforms are investigatedin the context of a minimal coarse-grained model. The parallel scaling of such simulationsis rigorously addressed, while their physical signiﬁcance is outlined, when possible, bycomparisons to other results from the literature. The Section 5 brings our conclusions andperspectives for further developments.

2. Simulation background

Coarse-graining is a widely employed modelling strategy to reduce the degrees of free-dom of many-body macromolecular systems as heteropolymers [9, 17]. Moreover, the studyof protein folding and aggregation, specially in crowded media [18], has also largely bene-ﬁted from such approach [17]. In this vein, PHAST incorporates a conﬁgurable force ﬁeldable to describe semi-ﬂexible linear heteropolymers [8, 19, 20, 21] and general propertiesof protein-like structural phase transitions [10, 11, 12, 13]. In the present model, a pri-mary sequence is translated to a simpler one according to a predeﬁned rule (i.e. sortingeach residue by its hydrophobic character), so the amino acids on a proteic backbone aremapped to a coarser chain of pseudo-atoms (beads).Once original molecular interactions are mapped into simpler ones, non-bonded inter-actions among beads are represented through eﬀective hydrophobic forces [22], as givenby a Lennard-Jones potential ( LJ ). Thus, it is natural to adopt a speciﬁc hydrophobicscale [23] to perform the aforesaid sequence translation, see Figure 1. By considering thathydrophobic residues ( A ) form more strongly-interactive cores than hydrophilic ones ( B ) , the coupling constant for pairs of nonadjacent beads ( σ i , σ j ) may be deﬁned by C LJ ( σ i , σ j ) =  +1 σ i = σ j = A + / σ i = σ j = B − / σ i = σ j (1)Notwithstanding the fact that the protein backbones are rather stiﬀ, they may be wellmodelled as bead-spring heteropolymers. Therefore, the bounds between successive beadslocated at a distance r l,l +1 are described by the usual ﬁnitely extensible nonlinear elastic While it is a considerable limitation, once in chemically relevant situations side chains may becomecrucial, this implementation is computationally quite eﬀective. igure 1: A chosen hydrophobicity scale is employed to map the FASTA sequence (A–Z) of a protein (onthe left) extracted from the Protein Data Bank to another sequence, expressed by a two-letter code (A–B),which is ultimately used to build the coarse-grained one-bead protein model (on the right). potential (FENE) [24], while a bending potential term between three successive beads isconsidered as proportional to (1 − cos θ k ) . The resulting energy for a single chain composedby N beads may be written as H single = κ F R P N − l =1 C F ( σ l , σ l +1 ) ln (cid:0) − [( r l,l +1 − r ) /R ] (cid:1) + κ C P N − k =1 (1 − cos θ k ) + κ LJ INTRA P N − i =1 P Nj =1+1 (cid:2) r − ij − C LJ ( σ i , σ j ) r − ij (cid:3) , (2)where the ﬁrst term (FENE potential) has κ F , r and R as general coupling constantsfor homopolymers, while C F ( σ l , σ l +1 ) allows for introducing diﬀerent spring stiﬀness forheteropolymers. Moreover, κ C sets the chain rigidity against bends by an angle θ k , while κ LJ INTRA deﬁnes the coupling constant of the non-bonded intramolecular LJ potentialbetween pairs of beads ( i, j ) at a distance r ij .Finally, M diﬀerent chains are made to constitute a many-body interacting system bythe following multi-chain potential H multi = M X k =1 H single,k + X l>k N X i,j =1 κ LJ INTER h r − l i k j − C LJ (cid:0) σ l i , σ k j (cid:1) r − l i k j i! , (3)where H single,k is given by Eq.(2), and the second term describes the non-bonded inter-molecular LJ potential between pairs of beads ( l i , k j ) at a distance r l i k j , whose couplingconstant is κ LJ INTER . .2. Parallel multicanonical simulations It has been shown that microcanonical analysis of Monte Carlo simulations providea powerful tool not only to precisely characterize structural phase transitions on small-systems [5, 8], where ensemble equivalence does not hold (see [4], and references therein),but also allows for neatly distinguish between good and bad folders among protein-likeheteropolymers [25]. Most of straightforward implementations of such methodology relyon the accurate determination of systems density of states Ω ( E ) , which may be achievedby the Wang-Landau method [6], or with generalized ensembles as the multicanonical [7].Once knowing Ω ( E ) , the microcanonical thermostatistics can be established by theusual Boltzmann formula for the entropy S ( E ) , as follows S ( E ) = k B ln Ω ( E ) . (4)This leads to the microcanonical connection to thermodynamical observables [5]. Forinstance, numerical derivatives of the entropy S ( E ) can be taken to compute quantitiesof interest, as the microcanonical caloric curve, relating the temperature T to the internalenergy E, given by k B β ( E ) ≡ T ( E ) − = ∂S∂E . (5)Also, the microcanonical speciﬁc heat C V ( E ) is expressed as C V ( E ) = dEdT = − (cid:18) ∂S∂E (cid:19) (cid:18) ∂ S∂E (cid:19) − , (6)while the free energy F ( E ) , at a ﬁxed temperature T c , is computed by F ( E ) = E − (cid:18) ∂S∂E (cid:19) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E = E ( T c ) S ( E ) . (7)As previously stated, the multicanonical ensemble oﬀers a quite practical method toestimate Ω ( E ) through Monte Carlo simulations, and so, to obtain the microcanonicalthermostatistics of any physical system. Basically, on this approach the entropy is writtenas a piecewise function of the discretized energies E k and a set of MUCA coeﬃcients { β k , α k } , as follows S muca ( E k ) = β k E k − α k . (8)The Monte Carlo simulations in the multicanonical ensemble naturally employ the so-called generalized MUCA weights ω muca ( E k ) ∝ e − β k E k + α k , instead of the usual Boltzmannweights ω B ∝ e − βE , which enhances energetic tunnelings by improving the sampling ofrare conﬁgurations.However, those MUCA coeﬃcients { β k , α k } are a priori unknowns, so requiring analgorithm for their determination before a production run can be pursued. A success-ful technique to estimate MUCA weights is iterative [7] and involves computing his-tograms H muca ( E ) of the energies E k ∈ [ E , . . . , E max ] . Thus, in the beginning, one sets5 igure 2: The iteration-cycle scheme for parallel multicanonical simulation: N processors (nodes) areinitialized with diﬀerent random seeds and so, evolve independent Markov Chains whose individual His-tograms are accumulated on H ( E ) at each MUCA iteration to update the weight W ( E ) . The resultingcoeﬃcients ( α, β ) on MUCA weights are then broadcasted. ω muca ( E k ) = 1 for all energies E k , which is used on an usual METROPOLIS simulationto collect the initial ( n = 0) histogram H muca ( E ) . Thereafter, it is applied a recursiveupdate scheme for the MUCA weights ω n +1 muca ( E ) = ω nmuca ( E ) /H nmuca ( E ) , before startinga new simulation with the lately estimated weights. This process is repeated till conver-gence is detected by a ﬂatness check on the histograms. In the end, at a large- n limit, themulticanonical entropy is ensured to recover the microcanonical one as S muca ( E ) → S ( E ).The multicanonical implementation on PHAST employs accumulated error-weightedhistograms, so statistics for weight estimations improves for every repetition, while con-vergence is attained by Berg’s weight recursion[7]. Additionally, its parallel estimator forMUCA weights — initially proposed by Zierenberg et. al. [26] to study spin-systems— present scaling properties computationally exceeding more traditional approaches [27].This is due to a simple and quite eﬃcient update scheme, illustrated on Figure 2. First,each processor (node) is initialized with the same MUCA coeﬃcients, but a particularrandom seed, then they perform independent Markov Chain Monte Carlo (MCMC) simu-lations to accumulate local energy histograms. After one iteration cycle those histogramsare combined on the master node, which updates and then broadcasts the new MUCAweights, the cycle is repeated till convergence is achieved.

3. Software framework

The PHAST package is constituted by the following modules • SET INPUT: prepares input protein sequences from PDB ﬁles • PHAST: the main engine to run parallel multicanonical simulations • ANALYST: computes the microcanonical thermostatistics from MUCA simulationsIt is not possible to give here a highly detailed description of all program routines. In-stead, we address the main features and data workﬂows of aforementioned modules in thefollowing subsections, which afterwards is illustrated by some concrete study cases.6 .1. SET INPUT

A single text ﬁle named

ProteinSequences.txt is to be provided as input to thesimulation module. But while users may manually encode each of its line entries, sodescribing protein (or polymeric) sequences according to an AB-like lexicon (e.g. Eq. 1),this can be alternatively generated in a quicker manner. To this end, the SET INPUTmodule was designed to facilitate the mapping of protein structures, downloaded from theProtein Data Bank (PDB) [28], by using a built-in hidrophobicity scale [23] conﬁgured onﬁle hscale.h .The aforesaid mapping is attained by invoking the setup module ( ./set input.o )and then typing the path to a ( .pdb or .seq ) input ﬁle, whereas the software shallrecognize among valid PDB ( .pdb ) or three-letter code ( .seq ) sequences. Still, by suc-cessively calling this module with diﬀerent ﬁle sequences, more lines are appended to ProteinSequences.txt , so making it quite easy to prepare input ﬁles even for simula-tions of multiple-proteic systems. This is the main module of the package, here multicanonical algorithms are implementedto simulate the coarse-grained models of Section 2. In principle, PHAST just needs to reada

ProteinSequences.txt text ﬁle before starting any run. However, the software can befully reconﬁgured (before compilation) through editing its .h ﬁles, whose content is • vars.h : the main conﬁguration ﬁle that sets multicanonical parameters. This in-cludes, but is not limited to, the quantity and size of energy binnings, the number ofMCMC sweeps and MUCA iterations, the deﬁnition of common blocks, the maximalsize and number of proteins as well as of the residue families allowed in a simulation. • set hamiltonian.h : enables to tune the coupling constants specifying a particularHamiltonian from Eq.(2) and Eq.(3). • box size.h : deﬁnes the size of the steric spherical container for proteins and somegeometric cutoﬀs of polymer backbones. • set randvec size.h : random numbers may be optionally generated as large vectorsin GPUs, the size can be optimally adjusted to ﬁt memory caches.Still, the main module is composed by the following brieﬂy summarized routines • MAIN.

F: is the principal routine of the simulation module, contains all MPI commu-nication structures and callings of I/O and MUCA engine functions. • HAMILTONIAN. f: given a polymeric conﬁguration this routine computes the Hamilto-nian deﬁned by Eq.(2) and Eq.(3) with the constants adjusted at set hamiltonian.h. • MONTE CARLO. f: this is a wrap routine that employs the Metropolis algorithm, withgeneralized MUCA weights, to perform MCMC stochastic evolutions. Additionally,it updates the local energy histograms at every simulation step.7

GEOMETRY. f: the simulation code uses inner coordinates, as bond-lengths and anglesbetween beads, to evolve individual chains. This routine allows to convert them to(external) cartesian coordinates for computing inter-chain distances. • UPDATE ENGINE. f: this function stochastically updates all bead-positions of the chainscontained in the 3d-spherical box deﬁned at box size.h . It employs a mix of op-timized updating algorithms, to know, spherical-caps : that changes internal anglesof each monomer along the backbone, pivotings : rotates a chain over a pivot direc-tion, bond-length changes : by dilations or shrinks around the equilibrium value r (if κ F = 0), and Galilean-group updates : i.e. rotations plus translations of chainsaround the box center of mass. More details on references in [13] . • SET HF-MATRIX. f: this routine permits to manually set the interaction coeﬃcients C LJ ( σ i , σ j ) and C F ( σ i , σ j ) of the Hamiltonian in Eq.(2). It shall be modiﬁed ac-cordingly to hscale.h, which allows for consistent generalizations of the AB-modellexicon such as [29]. • SET MUCA. f: a bundle of auxiliary functions needed to set MUCA simulations. • SET PROT-SEQ. f: reads in proteins from the AB-sequence ﬁle

ProteinSequences.txt . • SET-SAVE SIM. f: reads or writes to disk the simulation ﬁles, as protein conﬁgurations aAcidPosRxyz *.pdb with extremal energies, produced during MCMC evolutions. • SIM IO-FILES. f: reads or writes to disk the inner control ﬁles of the simulation. • RANDOM.f : the wrap function for random number generation on CPUs or GPUs. • cudaRandom.cu : enables random number generation on GPUs.It also shall be noted that as long as the simulations progress a serie of output ﬁles is au-tomatically saved to disk. While ﬁles as Protein3DPositions.* , aAcid3DPositions.* and last.con are just intended for inner control of simulations, being specially neededwhen restarting simulations, there are also many other output ﬁles with *.dat or *.pdb extensions. As an example beta n.dat, alpha n.dat, g n.dat, and hist n.dat re-spectively denote ﬁles containing data from the n -th MUCA repetition, as the coeﬃcients { β k , α k } n , the density of states g n ( E k ) and histograms H n ( E k ) — which are read by theANALYST module afterwards — as well as the aforementioned aAcidPosRxyz *.pdb ﬁles. This module is the kernel of the package for microcanonical thermostatistical anal-ysis. A set of ﬁles, originally saved by the simulation module — i.e. beta *.dat,alpha *.dat, and last.con — is read and semi-automatically processed. As an out-come neat xmGrace [30] ﬁles are produced according to template ﬁles, named as template *.agr .Some internal parameters shall be set before compilation, depending on the intended out-put features, to know 8 icontrol : set intensive units of energy, as “energy per volume”, on the outputs. • iAGR: deﬁnes if the thermodynamical observables of Eqs.(4)–(7) are written to indi-vidual .agr ﬁles, or combined into a single graph otherwise. • Tol Beta & Tol L: respectively set (i) the numerical tolerance among three succes-sive β values and (ii) the minimal latent heat to be searched for during a Maxwellconstruction [5] on β ( ε ) × ε .In particular, when the ANALYST module is called ( ./analyst.o ) the last.con ﬁle isread and supplies the software with parameters (set at vars.h ) used on that former MUCAsimulation. Then, the user is oﬀered the option to type the conﬁguration-ranges of theiteration ﬁles to be analyzed. Finally, the following additional information is requested to • “ Enter the number of monomers ”: necessary to output data as intensive quantities. • “ Enter the ﬁnite diﬀerence step-size, dE: [0.01,9.9] ”: the energy step dE is used tocompute the ﬁnite-diﬀerence derivatives taken on S ( E ) . • “ Enter the β value to compute Free Energy, or 0 for Maxwell construction ”: duringﬁrst-order like structural phase transitions, the S-bends appear on microcanonicalcaloric curves. So users may wish to provide a speciﬁc β c to compute the Eq.(7), orlet the program ﬁnd this pseudo-critical value by a Maxwell construction.

4. Example runs

As case studies we have investigated two proteic systems by using the AB-model limit [10] of the Hamiltonian Eq.(3). Such modelling does not allows for prediction of proteinstructures, but may provide an useful method to learn about some general thermodynam-ical mechanisms underlying biological structural phase transitions [12]. For instance, thisidea was successfully employed to compute aggregation propensities of peptides [13]. a typical "set_hamiltonian.h" file!-------------------------------------------------------------! Hamiltonian Parameters!-------------------------------------------------------------! Set c_KF, c_KC, c_KLJ_intra or c_KLJ_inter to 0.0d0 to! Shutdown any of that interactions!-------------------------------------------------------------! FENE parameters: c_KF*R^2*ln(1-[(r-r0)/R]^2)!-------------------------------------------------------------r0 = 1.00d0 ! Distance of equilibrium between beads Note that by taking κ F = 0 the FENE interaction is disabled and so all bond-lengths stay ﬁxed to r (e.g. r = 1), therefore the bead-stick limit of AB-model is recovered. = 1.2d0 ! Scale parameterc_KF = 0.0d0 ! FENE coupling constant!-------------------------------------------------------------! Curvature parameters: c_KC*(1-cos*(Phi))!-------------------------------------------------------------c_KC = 0.25d0 ! Curvature coupling constant!-------------------------------------------------------------! Non-bonded interaction parameters: c_KLJ*(1/R^12-sigma/R^6)!-------------------------------------------------------------c_KLJ_intra = 4.00d0 ! Intra-protein LJ coupling constantc_KLJ_inter = 4.00d0 ! Inter-protein LJ coupling constant Therefore, while keeping in mind the limitations of our coarse-grained approach, wechoose some natural amyloidogenic protein sequences as inspiration to generate aggregation-prone heteropolymeric chains through their hydrophobic mappings. So, we ﬁrst simulatethe aggregation of highly amyloidogenic segments of the Amyloid β protein, associated tothe Alzheimer disease — i.e. the Aβ − segment, PDB: — a system also appliedfor assessing the parallel scaling (Speedup) of our code. Additionally, we have simulatedthe aggregation of an heterogeneous molecular system, composed by segments of Aβ − and Amylin isoforms (related to the type 2 Diabetes Mellitus, see [13]) — i.e. the hu-man hIAP P − PDB: and rat rIAP P − PDB: — to test for catalyticeﬀects of crowding [31] and cross-seeding [32] on the onset of aggregation that inducessuch degenerative diseases. Aβ − segments: parallel speedup To simulate the aggregation of Aβ − segments, we initially downloaded the 2LFM ﬁlefrom PDB [28] and SET INPUT converted it to an AB-sequence. This sequence was thenduly cut, cloned and used to prepare an input ﬁle to simulate the molecular aggregation(dimerization), as can be checked in the next data sample ProteinSequences.txtBBBBBAAABBBBBBAAAB

The box-radius ( R ) was set to R = 40 for these simulations and, overall statistics collectedby the multiple processors was ﬁxed to 10 MCMC sweeps per MUCA iterations (1 . × for 128 processors), in a total of 10 repetitions.Before performing the data analysis of the resulting MUCA coeﬃcients { β k , α k } n , itwas necessary to ensure they had reasonably converged after some minimal number ofMUCA iterations ( N conv. ) . To do so, we inspected the ﬂatness of energy histograms H n ( E k )in the the full range of MUCA iterations, i.e. n ∈ [1 , . As an adequacy criteriawe imposed that the Coeﬃcient of Variation (c

Var = σ H / µ H ) , deﬁned as the ratio of thestandard deviation ( σ H ) over the average value of a histogram ( µ H ), should be smaller10

000 3000 4000 5000 6000 7000 k H n ( E k ) n: 100n: 150n: 175n: 400 Figure 3: Representative energy histograms H n ( E k ) employed for ﬂatness checks, which also allows foridentifying the convergence of MUCA coeﬃcients { β k , α k } n during simulations of Aβ − segments.For histograms in the iteration range n ∈ (175 , Var ≤ , see text. than 5%. Additionally, local peaks on any supposedly ﬂat histogram were to be no largerthan 20% of its average value. Considering both such constraints, we found acceptableconvergence (i.e. enough histogram ﬂatness) was established for data in the iterationrange n ∈ [ N conv. > , N S ≥ N conv. )and ending ( N E ≤ n ∈ [ N S , N E ] of MUCA coeﬃcients { β k , α k } n were tested, as well as the size of data-blocks △ N (see Binning method, [33]) tobe grouped in between. We also required the numerical stability and inter-compatibility(within the statistical errors) when computing the pseudo-critical inverse temperatures β c , which was particularly important to improve the curve matchings during data collapses(Figure 4). Despite tedious, this procedure also enabled controlled checks of autocorrelationtime eﬀects over the statistical error bars.The physical consistency of our parallel results was validated by the data collapse ofmultiple simulations, performed with increasing number of CPUs (up to 128 processors)at the IBM P750 machine [34], see a representative output on Figure 4 for parameters[ N S , N E ] = [200 , △ N = 50. The general microcanonical behavior of the systemwas ﬁnally checked by an automatized plotting of the thermostatistical observables savedto an .agr ﬁle, as can be seen from Figure 5 (where [ N S , N E ] = [700 , △ N = 25).It is observed a ﬁrst-order like structural phase transition, characterized by a reentrant onthe microcanonical entropy S ( ε ), which induces an S-bend on the caloric curve β × ε andalso a region with negative values of the microcanonical speciﬁc heat C v ( ε ) . See [5] fordetails and a comprehensive review on this formalism. It is also worth to note that suchtransitions are signaled by the presence of energetic barriers on the free energy, a featurewe have previously studied on similar bioinspired molecular systems [12, 13].11 ε β ( ε ) Microcanonical caloric curve bin = 0.10 and dE = 1.50

Figure 4: The data collapse of representative curves, of the inverse temperature β ( ε ) × ε, illustrates theequivalence on thermodynamical results from simulations employing up to 128 processors. -0.4 -0.2 0 0.2 ε -40-2002040 c v ( ε ) -0.4 -0.2 0 0.20246810 β ( ε ) -0.4 -0.2 0 0.2 ε f = ε - T . s -0.4 -0.2 0 0.2-40-30-20-100 S ( ε ) Figure 5: The microcanonical thermodynamic observables as seen from ANALYST module for the aggre-gation of Aβ − fragments, to know: the entropy S ( ε ) × ε , caloric curve β ( ε ) × ε, the (shifted) freeenergy f (cid:0) = ε − β − c · S (cid:1) × ε, and speciﬁc heat c v ( ε ) × ε. igure 6: The parallel Speedup obtained for Multicanonical simulations present almost linear scaling upto hundred processors, even at numerically diﬃcult ﬁrst-order like phase transitions. To conclude, the wall times for simulations presenting equivalent thermodynamicalresults — which was ensured by performing a previous data collapse — were employedto analyze the parallel speedup of our code, seen on Figure 6. So we plotted the parallelSpeedup factor S p = t / t p , deﬁned as the ratio of the serial t to the parallel t p times spentto accomplish the convergence of MUCA weights. A power-law model S p = a · P b wasproposed for regression, and so established an almost linear scaling b = 0 .

97 (2) withthe number of processors P . This corroborates the nice scaling properties of the parallelMUCA algorithm found on mild ﬁrst-order like phase transitions [26, 35]. Aβ − and IAP P − isoforms Recent clinical studies have suggested that DM2 can be a catalyst factor to the emer-gence of some neurodegenerative diseases, as the AD [36]. A microscopic molecular expla-nation, from in vitro studies [37], points to the cross-seeding of Aβ and IAP P as a keyfactor for the appearance of heterogeneous amyloid ﬁbrils with considerable cytotoxicity.This phenomena would be somehow mimicked by the aggregation of bioinspired protein-likeheteropolymers, investigated by the methodology previously applied to study the aggrega-tion of Aβ − . Therefore, an input ﬁle was prepared by editing the AB-sequences obtainedfrom SET INPUT, once it is feed with the 2LFM ( Aβ ) and 2KB8 (hIAPP) ﬁles downloadedfrom PDB website [28]. That is ProteinSequences.txtBBBABAAABB However, for a ﬁxed total simulation statistics the error-bar sizes are enlarged by increasing thenumber of processors. This probably indicates an eventual speedup saturation, possibly connected to thecorrelation times, as already observed in [26, 35]. ε -200-1000100200 c v ( ε ) -0.8 -0.6 -0.4 -0.2 0 0.2 01234 β ( ε ) -0.8 -0.6 -0.4 -0.2 0 0.2 ε f = ε - T . s -0.8 -0.6 -0.4 -0.2 0 0.2-75-50-250 S ( ε ) hIAPP+ABrIAPP+AB ∆ f h ∆ f r ∆ε h ∆ε r Figure 7: The microcanonical thermodynamic observables as seen from ANALYST module for the aggre-gation of Aβ − fragments cross-seeded by Amylin isoforms hIAP P − or rIAP P − . To know:the entropy S ( ε ) × ε , caloric curve β ( ε ) × ε including the Maxwell construction, the free energies f (cid:0) = ε − β − c · S (cid:1) × ε, and the speciﬁc heat c v ( ε ) × ε. While the energetic barrier △ f h = 0 . △ f r = 0 . , its groundstate is found at much higher energies on the landscape. This explains the larger latent heatsfor the rodent system △ ε r = 0 .

584 (5) [to be compared to △ ε h = 0 .

513 (5)], and so its increased stabilityagainst aggregation. The energy axis ε = E/N is shown in intensive units.

BBBABAAABBBBBBBAAABBBBBBAAAB also, an equivalent system for IAPP molecules of rats (PDB: was built, as shown

ProteinSequences.txtBBBABAAAAABBBABAAAAABBBBBAAABBBBBBAAAB

These simulations were run on a considerably denser environment, once the box-radiuswas set to R = 20 . The 480 CPUs at [38] shared a workload of 10 MCMC sweeps perMUCA iteration, in a total of 10 MUCA repetitions. The obtained MUCA weights in theiteration range [ N S , N E ] = [700 , △ N = 25. Thermodynamical results are summarized for all microcanonical observ-ables of both systems on Figure 7, while details for representative structures are exhibitedon Figure 8. It is clearly observed that while the energetic barrier for the aggregation ofthe human molecular system (i.e. Aβ − and hIAP P − ) △ f h = 0 . △ f r = 0 . , the latter transition happens at muchdeeper values of the free energy. This fact justiﬁes the larger latent heat found on the14 igure 8: The microcanonical caloric curve β ( ε ) × ε with representative molecular conﬁgurations for thecross-seeding of fragments of Aβ − and Amylin isoforms: hIAP P − or rIAP P − . rodent system △ ε r = 0 .

584 (5) [to be compared to that of humans △ ε h = 0 .

513 (5)], whichexplains thermodynamically its improved stability against aggregation.Interesting enough, we also detected the presence of multiple backbendings on β ( ε ) × ε (see Figure 8), which are related to nucleation hierarchies, as broadly studied on [39]. Ini-tially one observes the formation of one hetero-oligomer made of one segment of Aβ − and an Amylin isoform ( hIAP P − or rIAP P − ) , whose thermodynamical signatureemerges by the ﬁrst bump on the caloric curves. Following, another hetero-oligomer issimilarly built and, then ﬁnally the aggregation process is completed by a global collapseof both hetero-oligomers which results on a globular structure. It deserves to be noted thatthose aggregates are clearly not amyloidic ones, as no cross-beta structure is formed, whichwould require speciﬁc interactions for their stabilization a feature absent in our Hamilto-nian. However, these protein-like aggregation transitions exhibit hierarchical nucleationsas would also be expected by cooperative cross-seeding mechanisms on real proteins.Therefore, some intriguing conclusions may be addressed from our microcanonical anal-ysis, mainly regarding the stability of Amylin isoforms when cross-seeded by Aβ segments.For example, a biotechnologically designed protein named Pramlintide [40] — originallyinspired on PDB:2KJ7 — has been recently used as an adjuvant to treat DM2 in humans,curiously such protein becomes identical to rIAP P when mapped by the hydrophobicscale [23] to the AB-model. Thus, our simulations indicate that Pramlintide could also beemployed for damping the aggregation of heterogeneous amyloidogenic systems, as onesmade by Aβ and IAP P isoforms. This result is corroborated not only by aggregationpropensities previously computed from the AB-model [13], but also by studies of similarsystems [31, 32] using all-atom force ﬁelds. 15 . Conclusions

We have presented PHAST, a package for simulating coarse-grained models of proteinsand bioinspired polymers in the multicanonical ensemble. Despite of its simple structure,the software comprises a reconﬁgurable force ﬁeld which allows for an excellent parallelspeedup. Additionally, its modules enable users to quickly map proteins downloaded fromPDB to an inner AB-lexicon — or its generalizations as [29] — as well as to automaticallycompute most of systems microcanonical thermostatistics. Still, the program codes areopen and free, which shall motivate students and researchers to readily adapt them totheir speciﬁc purposes, as modelling many-body protein-protein interactions or the realis-tic eﬀects of crowding on polymeric systems. We plan to gradually improve the built-inforce ﬁelds, besides moving their calculations to GPUs, and implement faster evolutionalgorithms as hybrid MCMC [41].

Acknowledgements

This is a long-term project whose author has beneﬁted from enlightening discussionswith Leandro G. Rizzi, Lieverton H. Queiroz, Mathias S. Costa, Mikhael C. Chrum, andNelson A. Alves. The Brazilian laboratories CENAPAD-SP and LNCC-RJ are acknowl-edged by providing machine time and helpful technical support, respectively on their IBMP750 [34] and Santos Dumont [38] supercomputing facilities.