[PDF] A step towards a reinforcement learning de novo genome assembler

Abstract

The use of reinforcement learning has proven to be very promising for solving complex activities without human supervision during their learning process. However, their successful applications are predominantly focused on fictional and entertainment problems - such as games. Based on the above, this work aims to shed light on the application of reinforcement learning to solve this relevant real-world problem, the genome assembly. By expanding the only approach found in the literature that addresses this problem, we carefully explored the aspects of intelligent agent learning, performed by the Q-learning algorithm, to understand its suitability to be applied in scenarios whose characteristics are more similar to those faced by real genome projects. The improvements proposed here include changing the previously proposed reward system and including state space exploration optimization strategies based on dynamic pruning and mutual collaboration with evolutionary computing. These investigations were tried on 23 new environments with larger inputs than those used previously. All these environments are freely available on the internet for the evolution of this research by the scientific community. The results suggest consistent performance progress using the proposed improvements, however, they also demonstrate the limitations of them, especially related to the high dimensionality of state and action spaces. We also present, later, the paths that can be traced to tackle genome assembly efficiently in real scenarios considering recent, successfully reinforcement learning applications - including deep reinforcement learning - from other domains dealing with high-dimensional inputs.

Full PDF

TTowards a reinforcement learning de novo genomeassembler

Kleber Padovani , Roberto Xavier , Andr ´e Carvalho , Anna Reali , Annie Chateau ,and Ronnie Alves Federal University of Par ´a, Computer Science Graduate Program, Bel ´em-PA, 66.075-110, Brazil University of the State of Amazonas, Itacoatiara-AM, 69.101-416, Brazil Vale Technology Institute, Sustainable Development, Bel ´em-PA, 66.055-090, Brazil Institute of Mathematics and Computer Sciences, University of S ˜ao Paulo, S ˜ao Carlos-SP, 13566-590, Brazil Polytechnic School, University of S ˜ao Paulo, S ˜ao Paulo-SP, 05508-010, Brazil Laboratory of Informatics, Robotics and Microelectronics of Montpellier, University of Montpellier 2, Montpellier,34090, France * [email protected] ABSTRACT

Genome assembly is one of the most relevant and computationally complex tasks in genomics projects. It aims to reconstruct agenome through the analysis of several small textual fragments of such genome — named reads . Ideally, besides ignoring anyerrors contained in reads , the reconstructed genome should also optimally combine these reads , thus reaching the originalgenome. The quality of the genome assembly is relevant because the more reliable the genomes, the more accurate theunderstanding of the characteristics and functions of living beings, and it allows generating many positive impacts on society,including the prevention and treatment of diseases. The assembly becomes even more complex (and it is termed de novo inthis case) when the assembler software is not supplied with a similar genome to be used as a reference. Current assemblershave predominantly used heuristic strategies on computational graphs. Despite being widely used in genomics projects, there isstill no irrefutably best assembler for any genome, and the proper choice of these assemblers and their conﬁgurations dependson Bioinformatics experts. The use of reinforcement learning has proven to be very promising for solving complex activitieswithout human supervision during their learning process. However, their successful applications are predominantly focused onﬁctional and entertainment problems - such as games. Based on the above, this work aims to shed light on the application ofreinforcement learning to solve this relevant real-world problem, the genome assembly. By expanding the only approach foundin the literature that addresses this problem, we carefully explored the aspects of intelligent agent learning, performed by theQ-learning algorithm, to understand its suitability to be applied in scenarios whose characteristics are more similar to thosefaced by real genome projects. The improvements proposed here include changing the previously proposed reward system andincluding state space exploration optimization strategies based on dynamic pruning and mutual collaboration with evolutionarycomputing. These investigations were tried on 23 new environments with larger inputs than those used previously. All theseenvironments are freely available on the internet for the evolution of this research by the scientiﬁc community. The resultssuggest consistent performance progress using the proposed improvements, however, they also demonstrate the limitationsof them, especially related to the high dimensionality of state and action spaces. We also present, later, the paths that canbe traced to tackle genome assembly efﬁciently in real scenarios considering recent, successfully reinforcement learningapplications — including deep reinforcement learning — from other domains dealing with high-dimensional inputs.

Introduction

Genome assembly is one of the most complex and computationally exhaustive tasks in genomics projects . Additionally, it isamong the most important tasks which allow genome sequences analyses. The main goal of genome assembly is to reconstructa genome using an assembler software, that does it by the analysis of several small fragments coming from a genome —commonly referred to as target genome . These fragments, named reads , are obtained from an equipment, called the DNAsequencer.A high-performance assembler is something highly desired among researchers, as it will imply more accurate genomes,allowing researchers to reach a better understanding of the traits and functions of living organisms . As a result, the knowledgeacquired from these whole genomes produces positive impacts in several ﬁelds, such as medicine, biotechnology and biologicalsciences.A DNA sequencer is a machine responsible for the initial, but fractional, reading of the genetic code of living organisms .However, the genomes of most organisms — even microorganisms — are too long for being read in a single run in the a r X i v : . [ q - b i o . GN ] F e b equencer . To surpass this limitation, a technique called Shotgun is applied, consisting of cutting up the genome into smallpieces and producing small fragments of DNA which can be entirely (or mostly) read by the sequencer, representing thecorresponding genetic information in text sequences (the reads ) .Since DNA molecules are formed by sequential pairs of complementary nucleotides (each of them composed by Adenine-Thymine or Guanine-Cytosine), reads represent only a single nucleotide from each pair, sequentially written as a character inthe text. In biological terms, we can say that nucleotides of only one (out of two) strand of each DNA fragment is read. Thereading of each nucleotide is represented by its corresponding initial (A, C, G or T). Thus, the number of characters in each read is commonly referred to as base pairs (or bp ).The genome of an organism is the sequence of all nucleotides from its DNA molecules, represented by letters (A, C, T, G).Each nucleotide isolated does not represent any relevant biological information, however, when we put them all together, thecorresponding sequences provide a deep knowledge about this organism. Within the organism’s genome, for example, thereare (among other information) the species genes. Genes are continuous fragments of the genome whose nucleotide sequencesdeﬁne species traits and behaviors (e.g. the human eye color) . A single read , however, generally cannot represent the completeinformation from even a gene, thus genome assembly is commonly required to obtain the whole genome .Genome sequencing technology deﬁnes the maximum number of base pairs will be read from each DNA fragment , which,in turn, deﬁnes the size of the produced reads and directly affects the number of reads produced during the sequencing process.As genome assembly is a computational task aiming to order reads in an attempt to reconstruct the original DNA sequence, thenumber of reads and their sizes directly impact the complexity of the assembly process — the more and larger the reads , thehigher the complexity for assembling them.Genome assembly is usually carried out by assemblers adopting de novo strategy and/or the comparative strategy. Thecomparative approach is relatively simpler and computationally treatable, however it requires a previously assembled genomeas reference (e.g. the genome of a similar species) to guide the assembly process by comparing the produced reads with thereference genome, meanwhile the de novo approach presents no such dependence . De novo strategy is particularly important given that only a small number of reference genomes are currently available,compared to the number of existing and non-sequenced genomes - it is estimated that the vast majority of microorganisms’genomes are still unknown . However, although de novo assembly approach allows the assembly of new genomes withoutrequiring a reference genome, it is considered a highly complex combinatorial problem, falling into the theoretically intractableclass of computational problems, referred to as NP-hard .In computer science, the commonly applied strategies for de novo genome assembly process are based on heuristics andgraphs, and they are known as Greedy , Overlap-Layout-Consensus (OLC) , and

De Bruijn graph . For instance, in OLCstrategy, each read is represented as a node in a graph (named overlap graph) and edges represent the overlap between reads .Thus, the reconstructed genome corresponds to the reads along the path traversing all the nodes. This algorithm correspondsto the Hamiltonian path algorithm. Another computational formulation for genome assembly is to ﬁnd the shortest commonsuperstring (SCS) formed by the reads — which can also be polinomially reduced to the Travelling Salesman Problem (TSP) .Regardless the difﬁculties and limitations, current de novo assemblers are capable of producing acceptable solutions.However, the use and application of de novo -based assemblers normally require speciﬁc bioinformatics knowledge in order tocorrectly set conﬁgurations and parameters for the assembler. Nevertheless, optimal results are not always guaranteed — giventhe high complexity enclosed to this theme .Despite the great contributions of the current assemblers for developing scientiﬁc discoveries on the genomics analyses oforganisms, genome assembly is not yet a fully solved problem. So, it is very important to continue the development of new andmore robust assemblers, in order to assemble DNA sequences faster and with improved accuracy . This challenge has been theaim of numerous currently ongoing researches, which apply computational techniques to genomics in the search for bettersolutions, including the use of machine learning (ML) .Although machine learning is an alternative approach to heuristics for dealing with high complex computational problems(as it is the case of NP-hard problems), few approaches apply machine learning for dealing with the assembly problem.According to the literature review presented in Souza et al. , the few scientiﬁc investigations evaluating the application ofmachine learning techniques for genome assembly were recently published, and only one study dated to the end of the previouscentury while 12 others date from later. For comparison, in a similar review , Henrique et al. reported 140 publicationsapplying machine learning techniques into the problem of ﬁnancial credit risk.With the recent access to computational advances — including increased processing and storage power of computers, theinvestigation of machine learning application for complex and high-scale computational problems has started to increase in thescientiﬁc community and some good results have been reported . This larger resources availability has also allowed the returnof reinforcement learning application for these problems .Reinforcement learning (RL), although scarcely applied in machine learning development, has shown some surprisinglypositive results, especially for games . Recently, two great (and commonly cited) examples of the application of reinforcement earning are the training of agents for playing the classic board game Go, as well as several Atari games, showing betterresults when compared to those from previous approaches and showing superior performances even when compared to trainedhumans .However, RL successful applications are predominant in problems that rely on accurate environment simulators, suchas games, where the rules and environments are known and allow the development of simulations for intensive training ofintelligent agents . Despite the importance of games to both society and computer science, there is a growing expectation,and some initial efforts have been made, for extending the success of RL in games into real world problems — for which it isgenerally impossible or impracticable to create the required simulators that would provide appropriate training for the agents .Such low use of RL in real world problems is also observed in the speciﬁc scenario of the genome assembly, where only asingle study can be found, proposed by Bocicor et al. . This approach, which is also the object of this study and will behence forward referred to as seminal approach, proposes the use of an episodic trained agent (i.e. whose training has beendivided into episodes) applying the Q-learning reinforcement learning algorithm for learning the correct order of a set of reads and, consequently, for reaching the corresponding genome.One of the most valued characteristics of the reinforcement learning is that it allows the agent’s autonomy for learning. Forexample, in supervised learning, a great deal of intervention is required (usually by human specialists) during the learningprocess, given that all information provided for the machine is previously and properly labeled. In reinforcement learning, theagent learns through the consequences of its successive failures and successes.This is particularly useful for solving tasks whose solutions extrapolate the human knowledge and capacity, such as genomeassembly. The assembly complexity starts from the assembler’s choice — as assemblers using similar strategies may producedifferent results – and extends until the assembler’s conﬁguration following the user’s decisions. Obtaining intelligent andtrained agents by reinforcement learning is important in this scenario as it could eliminate the need for human specialists.Another relevant aspect for the application of reinforcement learning is the capacity of agents to deal with large volumesof data and extract new rules associated with the main task, which were not explicitly provided before and, in some cases,were not identiﬁed by humans. In games, as above mentioned , agents were able to play new games on their own - withouthuman supervision - and in some cases, they were able to outperform the best known human players.Considering that Q-learning algorithm requires a Markov decision process deﬁnition with established parameters of statesand actions, together with a reward system to be achieved by the agent at each action in every state , the problem was modeledby the authors through a space of states capable of representing all possible read permutations, so that only one initial statethere existed and, in each state, there is one action to be taken by the agent for each read in the pool to be assembled.Following these deﬁnitions, from the graph theory perspective, the proposed states space for n reads can be visualized as acomplete n -ary tree, with height equal to n , as the set of states presents only one initial state and forms a connected and acyclicgraph . Thus, we can reach the number of existing states in the states space, represented by Equation 1.number of states = n n + − n − . Theauthors have deﬁned that each state requiring n actions to be reached (being n the number of reads to be assembled) is anabsorbing state.A small and constant reward (e.g. 0.1) was assigned as reward for actions reaching non-absorbing states. This reward wasalso set to every action leading to absorbing states where repeated reads were used to achieve them. Finally, actions leading toother absorbing states (the ﬁnal states) produce a reward corresponding to the sum of overlaps between all pairs of consecutive reads used to reach these states.For your better understanding, Figure 1 presents a simple example of a space of states for a set of only 2 reads , identiﬁed asA and B. In this example, we can observe the existence of a single initial state, two actions associated with non-absorbing statesand four absorbing states (highlighted in black), achieved after taking two any actions.In this ﬁgure two of the absorbing states are highlighted by the letter X . These states, unlike the previous ones, are ﬁnalstates, as they are the only ones in the space of states reached directly from the initial state without repeated actions — one isachieved from actions referred to as read A and read B , respectively, and the other from the actions of read B and, then, read A .The Smith-Waterman algorithm (SW) was applied for obtaining the overlaps between pairs of reads, which were added forobtaining the rewards of actions that led to ﬁnal states . The sum of overlaps when reaching a ﬁnal state s , here referred to asPerformance Measure (PM), is described in Equation 2, where read s corresponds to the sequence of reads associated with the igure 1. Example of state space for a set of two reads , here referred to by A and B.actions taken for achieving s . PM ( s ) = n − ∑ i = sw ( read s [ i ] , read s [ i + ]) (2)By using these deﬁnitions, the seminal approach produced attractive results, however, it has been evaluated by the authorsagainst only two small sets of reads , one with 4 reads with less than 10 bp and the second with 10 reads of 8 bp each. These reads were obtained by simulating the sequencing process, assuming as the target genome only a small fragment (a microgenome) ofthe real genome of the bacterium Escherichia coli .In order to perform a scalability analysis of the seminal approach, Xavier et al. evaluated the performance of this approachagainst 18 datasets. These 18 datasets were produced following the same simulation methods. The ﬁrst dataset corresponds tothe set of 10 reads with 8 bp from the seminal approach, originated from a 25 bp microgenome. From this same microgenomeand from a 50 bp microgenome, a total of 17 new datasets were generated (8 from the minor microgenome and 9 from the majorone) each containing 10, 20 or 30 reads , with 8 bp , 10 bp or 15 bp .Almost all deﬁnitions made by Bocicor et al. were replicated by Xavier et al. in this analysis, but they experimentally set α and β to 0 . .

9, respectively, and the space of actions was slightly reduced, so that actions associated with reads that hasalready been taken previously were removed from the available actions. In the states space depicted in Figure 1, for example,the leftmost and rightmost leaves (i.e. absorbing states) would be removed after this change. Although this change reduces thenumber of states, the size of the space of states continues to grow exponentially, as we can observe in Equation 3.number of states = n ∑ i = n ! ( n − i ) ! (3)This study conﬁrmed the previously positive results found in the seminal approach with the ﬁrst dataset, however, as thedataset size increases the performance of the seminal approach decreases considerably, reaching the target microgenome inonly 2 out of 17 major datasets. According to the authors, such bad results may be related to the high cost required by the agentto explore such a vast states space — as seen in Equation 1, this space grows exponentially — and also from the failures in theproposed reward system.In order to continue the investigation of the application of reinforcement learning in the genome assembly problem andtargeting the current challenge of applying the reinforcement learning into real-world problems , here we propose to analyzethe limits of reinforcement learning into a real-world problem, which is also a key problem for the development of science.This analysis was carried out by evaluating the performance of strategies that are complementary to those previously studied,and that could be incorporated into the seminal approach for obtaining improved genome assembly. Methods

In this study 7 experiments were evaluated against the seminal approach — referred to as approaches 1.1, 1.2, 1.3, 1.4, 2, 3.1,and 3.2 — and their methodologies are described in the next 3 subsections. In each approach, as in the seminal approach,the main goal is to reach an agent trained by reinforcement learning capable of identifying the correct order of reads from asequenced genome. Figure 2 illustrates this proposal, where the set of reads from the sequencing process is initially represented igure 2.

Illustration that demonstrates the application of reinforcement learning in the genome assembly problem. The set of reads , which are obtained in random order by the sequencer, is represented computationally by a reinforcement learningenvironment. Through successive interactions with the environment, caused by taking actions, the agent ideally learns thecorrect order of reads — thus allowing the target genome to be reached.as the agent’s interaction environment; the agent learns the order of reads by observing the environment’s current state and therewards produced from the successive actions taken — until it (ideally) reaches the correct order for the analyzed reads .These approaches will consider the ﬁndings of the scalability analysis from the previously mentioned work of Xavier et.al. Efforts were then made for improving the reward system adopted in the seminal approach — especially in the approachesdescribed in the next subsection - and to optimize the agent’s exploration — in the approaches described in the last twosubsections.

Approaches 1: Tackling sparse rewards

In approaches 1, the reward system, originally deﬁned by Equation 3, has been improved for preventing that permutations of reads that are inconsistent, in terms of alignment, produce high reward values for the agent. This is an undesirable behavior asthe agent’s learning process is based on maximizing the accumulated rewards. Therefore, ideally, high rewards are expected tobe associated with high quality responses. r ( s , a , s (cid:48) ) = (cid:26) PM ( s (cid:48) ) if s’ is a ﬁnal state , . Smith-Waterman (SW) algorithm, chosen for calculating the local alignmentbetween two given sequences. With this algorithm, a numeric score is calculated to represent the major alignment size (even ifpartial) between two sequences. However, the SW algorithm has no constraint for the order between sequences. Therefore, itdoes not previously consider that a sequence (in our case, a read ) must be aligned either left or right to the other one.In the case of the genome assembly problem, where the alignment between subsequent reads is expected to follow an order,the overlap score obtained from the SW algorithm may induce the agent to ﬁnd reads permutations with high overlap values inpairs, however, without presenting a consecutive alignment (sufﬁx-preﬁx) between reads in the set.In Figure 3 we can observer this type of inconsistency through an example, which presents two different permutationsfor a set of reads , identiﬁed by the letters ranging from A to J, obtained from a given microgenome. The ﬁrst permutation isformed, in this order, by the reads

A, B, C, D, E, F, G, H, I, and J, and according to SW the estimated overlap is 40 .

34. Thispermutation corresponds to the optimal permutation, because the union of subsequent reads through the overlap between thesufﬁx of the previous read and the preﬁx of the posterior read produces the target microgenome (provided at the bottom ofthe ﬁgure). On the other hand, the second permutation, formed by the reads

H, G, F, E, C, B, D, A, J, and I, respectively, isnot optimal; however, for this permutation the SW presents an overlap of 43 .

02, which is a greater than that obtained for theoptimal permutation.Another aspect to be considered in the reward system of the seminal approach is that the agent is able of receiving highrewards only when it takes actions leading to sparse ﬁnal states in the states space. That is, considering that the applied trainingis episodic, in each training episode, the agent receives a non-constant and high reward for only 1 of the n actions taken.Given that, the reward system was adjusted in 4 different ways, in order to explore two aspects: (a) the use of overlap scorethat considers the relative order of reads and/or (b) the use of dense rewards. These new reward systems are referred here byapproaches 1.1, 1.2, 1.3 and 1.4, as follows. igure 3. Illustration showing that the measure used as a reward to train agents does not produce maximum values for optimaloutputs in some cases. Above, we have an optimal permutation of reads , for which the PM is 40 .

34 and whose correspondinggenome is equal to the target genome itself; and, below, we have another permutation whose output differs from the targetgenome, but the corresponding PM is greater than the PM of the optimal permutation.As proposed in the seminal approach, the reward system of the 1.1 approach deﬁnes that actions leading to ﬁnal statesproduce a bonus reward (of 1.0) that is added to another numerical overlap score between all subsequent reads used since theinitial state. Thus, these actions produces a reward corresponding to the sum of the normalized overlap score (ranging from 0 to1) of each pair of reads , taking into account the relative order of them.Still, every action that leads to a non-ﬁnal state produces a constant and low reward (0.1). Equation 5 formalizes the rewardsystem for Approach 1.1, with PM norm ( s (cid:48) ) representing the normalized overlap between the reads used to reach the s (cid:48) (see moreinformation on the normalized overlap calculation in Section 2 of the supplementary material). r ( s , a , s (cid:48) ) = (cid:26) PM norm ( s (cid:48) ) + . , . reads in approach 1.1, it is susceptible to the sparse rewardsproblem — as well as in the seminal approach. Although it often produces a small, constant and usually positive reward, andnot a zero-value reward as traditionally applied by sparse reward systems, only few and sparse state-action pairs would producehigher rewards.We can observe in both systems (from Equations 4 and 5) that there is no rewards provided during learning process to guidethe agent towards its goal (since any read incorporated would produce a reward 0 . , the proposed changes in approaches 1.2, 1.3 and 1.4 mainly focused on improving this aspect and, for this, higherrewards, previously obtained only at the end of the episode, were distributed for each action taken in each episode.Thus, these approaches —- in addition to making the reward system dense, instead of sparse as originally proposed —-focused on reducing or eliminating the occurrence of inconsistencies, which, from the genome assembly perspective, wouldallow permutations of unaligned reads to produce maximum accumulated rewards. quations 6, 7 and 8 represent, respectively, the reward systems for approaches 1.2, 1.3 and 1.4 — so that ol norm ( s , s (cid:48) ) represents the normalized overlap between the two subsequent reads that allowed to reach state s and then s (cid:48) . r ( s , a , s (cid:48) ) = PM norm ( s (cid:48) ) (6) r ( s , a , s (cid:48) ) = (cid:26) PM norm ( s (cid:48) ) + . , ol norm ( s , s (cid:48) ) otherwise (7) r ( s , a , s (cid:48) ) = (cid:26) ol norm ( s , s (cid:48) ) + . , ol norm ( s , s (cid:48) ) otherwise (8) Approach 2: Pruning-based action elimination

One of the great challenges for applying reinforcement learning in real-world problems is the high dimensionality of the statesspace that must be explored by the agents . The Q-learning algorithm is especially susceptible to the dimensionality curse,as the number of states (as well as the number of actions) directly affects the data structure required for the agent’s learning .Given that the number of actions in each state directly affects the number of states in the seminal proposal, eliminating actionsare a strategy for dealing with the high dimensionality .To reduce even more the states space proposed by the seminal approach, a heuristic procedure was applied to eliminateactions where states reached directly or indirectly had already been fully explored and where the maximum cumulative rewardachieved is smaller than the cumulative reward obtained by taking any other action available in the state.In Figure 4, we have an example of this action elimination in the given states space considering 3 reads. Looking again atthe changed states space as a tree — so removing actions associated with reads that have already been used, we see 16 states inthis illustration, 6 of them are absorbing states (in the base of the tree) and also correspond to ﬁnal states following the proposedmodeling approach.For better understanding the pruning process, note that 3 out of 6 ﬁnal states are highlighted in black, while the remainingstates are in gray and white. Black states correspond to explored ﬁnal states (i.e. already visited by the agent). Gray states, suchas the one reached by taking action a in the initial state, represent the states in which all children states have been fully visitedin the learning process. Finally, white states (ﬁnal or not) are those not yet explored and/or that have unexplored children — e.g.the initial state, where one child is not explored and the other one is partially explored. Figure 4.

Illustration of the pruning procedure for a state space referring to the assembly of 3 reads , referred by a , b and c .The generic pruning procedure is deﬁned in detail by Algorithm 1When reaching an unexplored ﬁnal state, such as the rightmost ﬁnal state in Figure 4, the accumulated rewards achievedsince the initial state is maintained and propagated for its predecessors, maintaining only the highest value propagated for thechildren. Each reward is represented by integer numbers within the states in the ﬁgure. Thus, each non-ﬁnal state stores thehighest accumulated reward achieved from it during the training process. ased on this information, it is possible to prune irrelevant actions, those actions that, after taken, do not produce themaximum accumulated reward. This type of action can be found in action a of the initial state in Figure 4. Note that all possibleachievable states after taking this action have been explored and the maximum cumulative reward obtained is 6, obtained fromconsecutively taking the actions a , b , c . Also note that action c in the initial state, even that not fully explored, is capable ofproducing a higher reward, equals to 8, and obtained by taking the actions c , a and b , in that order.Thus, when the agent ﬁrst goes through the sequence of states corresponding to the actions c , a and b , the pruningmechanism propagates the maximum reward value up to the initial state and, at that moment, it cuts the action a from the initialstate. The pseudo code presented in Algorithm 1 presents the procedure to update pruning process when the last explored ﬁnalstate (referred to as state ) is reached obtaining the corresponding accumulated reward achieved (referred to as newReward ). Algorithm 1

Pruning’s algorithm procedure P RUNE ( state : treeNode , newReward : f loat ) if state (cid:54) = null and ( state . unseen or newReward > state . maxReward ) then (cid:46) state . unseen starts true , ∀ state state . unseen ← f alse state . maxReward ← newReward if state . f inal then P RUNE U SELESS C HILDREN ( state ) (cid:46) i.e. prune all fully explored children where maxReward < newReward P RUNE ( state . parent , newReward ) Approaches 3: Evolutionary-based exploration

In this proposal, the potential for mutual collaboration between reinforcement learning and evolutionary computing wasinvestigated –– by applying the elitist selection of the genetic algorithm — to optimize the exploration of the states space.For assessing the individual contribution of the genetic algorithm in this hybrid proposal, this approach has been divided intotwo smaller approaches, referred to as approaches 3.1 and 3.2 and presented in the following subsections.

Approach 3.1: Evolutionary-aided reinforcement learning assembly

The strategy of applying ε -greedy — and its variations — is a classic solution for expanding the exploration of agents trained bythe Q-learning algorithm, as it allows a broader initial exploration, achieving the optimal policy once the states space has beensufﬁciently explored . However, the existing trade-off between exploitation and exploration remains a major and challengingproblem for reinforcement learning in high-dimensional environments .Searching for a more efﬁcient exploration process and also considering the good performance of genetic algorithms in asimilar genome assembly approach carried out by Oliveira et al. , here, the interaction between reinforcement learning andevolutionary computation was introduced into the exploration process.This approach is based on the traditional operation of the Q-learning algorithm. However, in each Q-learning episode, thesequence of actions is stored, and at the end of the episode it is transformed into a chromosome of an initial population, thatlater will evolve. This procedure is presented in Figure 5, where we can see the list of actions made in each episode being usedfor constructing a new individual.New chromosomes are inserted into the initial population until the amount of chromosomes reaches the predeﬁned andexpected size for this population. At this point, agent training is interrupted and m genetic generations are carried out —-being m also predetermined (for details see Section 4 of supplementary material) and applying the normalized sum of overlapsbetween reads as the adaptive function — the same applied in Equation 8 and detailed in Section 2 of the supplementarymaterial.After m generations, according to the objective function, the most ﬁt individual is used for conducting the next episode inthe agent’s reinforcement learning training, hitherto interrupted. As each gene of the individual’s chromosome correspondsto one possible action, the complete gene sequence will contain distinct successive actions to be taken by the agent in thecurrent episode, producing then a mutual collaboration between reinforcement learning and the genetic algorithm — the initialpopulations of the genetic algorithm are produced by reinforcement learning and, as a counterpart, the results from the evolutionof the genetic algorithm is introduced in an episode of reinforcement learning.For your better understanding of all the aforementioned approaches, considering they share the common basis of theQ-learning algorithm, Figure 6 presents a ﬂowchart where we can see the proposed updates for these approaches. The elementshighlighted in gray represent the procedures and conditions of Q-learning algorithm — which, brieﬂy, is based on the actionchoice, action taking and the update of Q-values in the Q-table during several episodes, all of them starting in an initial stateand ending when an absorbing state is reached. igure 5. Illustration of the proposed interaction between reinforcement learning and genetic algorithm. At eachreinforcement learning episode, the actions taken by the agent are converted into the chromosome (having each action as agene) of an individual of the initial population of the genetic algorithm, whose size n is predeﬁned. After n episodes (and thus n individuals in the initial population), this population evolves for an also predeﬁned number of generations through the geneticalgorithm. Then, the most adapted individual of the last generation is obtained. In the end, that individual’s chromosome genesare used as actions in the next episode of reinforcement learning.As a consequence, these elements (in gray) fully represent approaches 1.1, 1.2, 1.3, and 1.4, as the proposal is focusedon replacing the reward system proposed by the seminal approach. Approaches 2 and 3.1 emerge as complementary to thiscommon basis of procedures with speciﬁc procedures, namely the insertion of the dynamic pruning mechanism in approach 2 —represented by the dashed element with double edges — and the introduction of mutual contribution with the genetic algorithmto improve the agent’s exploration performance — represented by the dashed elements with simple edges. Approach 3.2: Evolutionary-based assembly

Finally, to estimate the contribution of the genetic algorithm in approach 3.1, which applies a mutual collaboration betweenreinforcement learning and the genetic algorithm, the genetic algorithm assembling performance was evaluated separately,following the same conﬁgurations set of the previous approach, but this time, adopting as starting population a set of individualswhose chromosomes were built from random permutations without repetition of reads . Datasets

To assess the performance of all approaches (including seminal approach), in addition to the 18 datasets proposed and madeavailable by Xavier et al. , 5 novel datasets derived from other microgenomes extracted from the genome of the previousstudies were complementarily created here. These microgenomes are not arbitrary genome fragments, as the previouslyused microgenomes (which had 25 bp and 50 bp ), but represent larger fragments of previously annotated genes from thecorresponding organism (i.e. E. coli ). The experiments were then carried out on 23 datasets, whose microgenomes sizes,number of reads and sizes of reads are presented in Table 1 — the last 5 lines correspond to the 5 datasets derived from genes.As each of these datasets correspond to a reinforcement learning environment, an environment for each of them was createdin the OpenAI Gym toolkit , in order to share such reinforcement learning challenges in a simple way. These environmentsare provided in http://github.com/kriowloo/gymnome-assembly (see section 1 of the supplementary materialfor additional technical information) and use the reward system proposed in Approach 1.4. The identiﬁcation names of eachenvironment are presented in the last column of Table 1. The seminal reward system is also implemented and available —- forrunning it, use the version 1, replacing v2 by v1 in the environment name ﬁeld.Two experiments were then carried out for evaluating the approaches. In each experiment, 20 successive runs of eachevaluated approach was performed for all the 23 existing datasets; totalizing 460 runs per approach. Given that each approachhas different levels of complexity, the real execution time for each approach was considered for comparing them. To reduce theinterference of external factors in execution time, all experiments were individually and sequentially performed in the samestation (with Ubuntu 16.04 in an AWS EC2 instance of the r5a.large type, dual core, with 16GB RAM and 30GB of storage).In the ﬁrst experiment, here referred to as Experiment A , the objective was to verify the impact of progressively includingthe previously described strategies. For this, the performance of the seminal approach was evaluated (according to 14) againstapproaches 1.1, 1.2, 1.3, 1.4 (which modify the reward system), 2 (which includes the pruning dynamic) and 3.1 (which usesthe AG as an complementary factor).In the second experiment, referred to as Experiment B, the main objective was to compare the performance of the newRL-based approaches against the performance of the AG alone in the performance previously achieved in approach 3. Therefore,in addition to approaches 1, 2 and 3.1, the approach 3.2 (which explores GA alone) was performed in an equivalent time. igure 6.

Flowchart representing approaches 1.1, 1.2, 1.3, 1.4, 2 and 3.1, so that approaches 1.1, 1.2, 1.3, and 1.4 are deﬁnedby the elements in gray, approach 2 by the dashed element with double edges and approach 3.1 by the dashed elements withsingle border.In the second experiment, referred to as

Experiment B , the main objective was to compare the performance of the newRL-based approaches against the performance of the AG alone. Therefore, in addition to approaches 1.1, 1.2, 1.3, 1.4, 2 and3.1, the approach 3.2 (which explores GA alone) was performed in an equivalent time.For performance measure in each experiment, two percentage measures were calculated, called distance-based measure(DM) and reward-based measure (RM). Evaluations of de novo assembly are commonly performed using proper metrics, suchas the N50 . These metrics were created because, as previously indicated, de novo assemblies are not supported by a referencegenome. This way, in some scenarios, it is not possible to accurately assess the results obtained from the assemblers —- as theoptimal output is unknown.Here, although a de novo assembler is evaluated, its assessment environment is restricted and the target genomes are known,and this scenario allows the use of speciﬁc (and exact) evaluations, such as the DM and RM metrics.DM considers that a run was successful when the consensus sequence resulting from the orders of reads produced in thatrun is identical to the expected sequence. RM, however, consider any run as successful when the proposed order of reads presents the corresponding sum of PM n orm higher or equal to the sum of PM n orm from the optimal reads sequence (for details,see Section 3 of the supplementary material). Results

The results obtained from

Experiment A are presented in Table 2, where it is possible to observe that the seminal approach,although consuming the longest running time (23 hours and 34 minutes), also presented the lowest average performances,obtaining an optimal response in only 16.96% of the runs (i.e. 78 out of 460 executions) in terms of distance between theproduced and the expected genome (DM) and 21.30% (98 out of 460) in terms of maximum reward (RM). This difference isbased on the previously mentioned inconsistency of the proposed reward system, which allowed non-optimal permutations toproduce maximum accumulated rewards.Beyond that, following the updates in the rewards system, the DM and MR performances in approach 1.2, 1.,3, and 1.4surpassed those of the previous approach, and they also consumes about 4 hours less (19 hours and 38 minutes of execution icrogenome size Number of reads Read size Env. name

25 10 8 GymnomeAssembly_25_10_8-v225 10 10 GymnomeAssembly_25_10_10-v225 10 15 GymnomeAssembly_25_10_15-v250 10 8 GymnomeAssembly_50_10_8-v250 10 10 GymnomeAssembly_50_10_10-v250 10 15 GymnomeAssembly_50_10_15-v225 20 8 GymnomeAssembly_25_20_8-v225 20 10 GymnomeAssembly_25_20_10-v225 20 15 GymnomeAssembly_25_20_15-v250 20 8 GymnomeAssembly_50_20_8-v250 20 10 GymnomeAssembly_50_20_10-v250 20 15 GymnomeAssembly_50_20_15-v225 30 8 GymnomeAssembly_25_30_8-v225 30 10 GymnomeAssembly_25_30_10-v225 30 15 GymnomeAssembly_25_30_15-v250 30 8 GymnomeAssembly_50_30_8-v250 30 10 GymnomeAssembly_50_30_10-v250 30 15 GymnomeAssembly_50_30_15-v2381 20 75 GymnomeAssembly_381_20_75-v2567 30 75 GymnomeAssembly_567_30_75-v2726 40 75 GymnomeAssembly_728_40_75-v2930 50 75 GymnomeAssembly_930_50_75-v24224 230 75 GymnomeAssembly_4224_230_75-v2

Table 1.

Data sets used in the experiments. The ﬁrst column shows the size (in bp ) of the microgenome used to generate the reads of each set; the second column shows the number of reads generated; the third column shows the size of the generated reads ; and the fourth column shows the name of the environment built for each set in the OpenAI Gym toolkit (accessible in http://github.com/kriowloo/gymnome-assembly ).time). In such experiments the part of the gains are due to the improved agent’s performance in one of the sets where the sumof rewards for the optimal permutation of reads were not maximum in the previous reward system (as presented in Figure 3).Despite the gains obtained from the updated reward system, it was possible to note, based on the results, that the previouslymentioned inconsistencies were not completely resolved. In some of the datasets the agent reached and even surpassed themaximum expected accumulated rewards, however, without obtaining the target genome.A minor improvement is observed after the application of the approach 2, with the incorporation of the agent’s dynamicpruning, presenting a performance slightly superior to that of the agent in approaches 1, thus requiring approximately one hourless of processing. The great highlight for this comparison, however, is presented in approach 3.1, which, while beneﬁting fromthe environment exploration improved by AG, has reached a signiﬁcantly improved result in a reduced amount of time forexecution. Experiment A Seminal Appr. 1.1 Appr. 1.2 Appr. 1.3 Appr. 1.4 Appr. 2 Appr. 3.1

Average DM 16.96% 9.57% 18.48% 20.00% 20.43% 20.65%

Average RM 21.30% 13.70% 21.30% 24.35% 24.78% 25.00%

Total runtime 23h34m 19h38m 19h38m 19h38m 19h38m 18h41m 17h03m

Table 2.

Results of Experiment A, which compares the performances of trained agents with different reinforcement learningstrategies. The performances of each approach are expressed using distance-based (DM) and reward-based (RM) metrics (see

Methods for details).In order to analyze the contribution of each technique for the gain of performance in the approach 3.1, a second experimentwas performed, here referred to as

Experiment B , to compare the performance of RL-based approaches (approaches 1.4, 2 and3) and the isolated use of AG, but using random initial populations (approach 3.2). The results of Experiment B are presentedin Table 3. xperiment B Appr. 1.4 Appr. 2 Appr. 3.1 Appr. 3.2

Average DM 13.91% 12.39% 14.78%

Average RM 17.61% 16.30% 14.78%

Total runtime 01h36m 01h36m 01h42m 01h34m

Table 3.

Experimental performances considering similar running times (RT). Performances were expressed usingDistance-based Measure (DM) and Reward-based Measure (RM) (see

Methods ).Given the remarkable performance of approach 3.2, Experiment B applied as reference the time taken by the AG to ﬁnd anoptimal solution in terms of RM for 22 out of the 23 datasets used (i.e. 95 . Experiment B ).Although with an expressive increase in execution time, no optimal solution has yet been reached for any of the 20 runs ofthis dataset. However, as presented in Figure 7, it is possible to observe a consistent gain in performance, both in terms of RM(which was higher in all longer time runs) and DM (which presented a shorter distance for the longer runs).

Figure 7.

Performances obtained by the GA in experiments with short (1h34m) and long (37h58m) execution times in termsof sums of rewards and distances between obtained and expected genomes.

Discussion

As mentioned, genome assembly is among the most complex problems confronted by computer scientists within the context ofgenomic projects, regardless the importance of its results for scientiﬁc development. This complexity, in computational terms,allocates the problem of ﬁnding an optimal permutation of sequenced reads and reaching the target genome into a class ofproblems called NP-hard which comprises the most difﬁcult problems in computer science .This high complexity is particularly expressed in the vast space of states required for representing the assembly problem intothe modeling proposed by the seminal approach and, consequently, into the approaches proposed here. To achieve the optimalsolution in sets of reads of only 30 sequences in the seminal approach, for example, the agent should explore a states spacecomposed of approximately 2 e

44 states (this number exceeds the estimated number of stars in the universe and correspondsto the number of sand grains on Earth). It is also worth noting that, in real-world scenarios, it is common to sequence millionsof reads per genome — which increases even more the corresponding states space.The approaches proposed in this study aimed to expand the agent’s learning based on two difﬁculties observed in theseminal approach for applying reinforcement learning into the genome assembly problem: (1) the reward system and (2) the gent’s exploration strategies. Both the updates in the rewards system and the incorporation of new explorative strategiesimproved the agent’s learning performance, as demonstrated by the results.The deﬁnition of extrinsic rewards is one of the most challenging tasks for the construction of environments suitable for theagent’s learning. The updates into the reward systems proposed here favored the agent’s improved learning. However, it is notyet an optimal solution for the problem, as some non-optimal solutions are still present in datasets that produced maximumaccumulated rewards. This would justify the occurrence of RM percentages higher than DM percentages in several experiments.The dynamic pruning mechanism showed a discreet improvement for the learning process. However, the relationshipbetween the additional processing cost resulting from the application of this mechanism and the beneﬁt obtained fromits implementation did not indicate a reasonable net gain from its use as bypass for the problem emerging from the highdimensionality of the space of states.In this sense, the application of the RL strategy combined with the AG, in the hybrid approach, presented a much moreattractive performance for supporting the exploration process in the long term. This combination was proved to be advantageous,probably given the dimensionality curse encountered by the Q-learning algorithm, as a strong AG support was observed for theagent while conduct the RL exploration.Despite the performance improvements, it remains evident the insufﬁciency of the approaches when applied to real-worldscenarios. This insufﬁciency is more evident in the experiments performed with the largest dataset analyzed, with a size thatcorresponds to a gene of approximately 4 Kbp . Despite being the largest dataset employed, this gene remains much smaller thanthe smallest genomes from living organisms. None of the applied approaches presented an optimal response in this scenario,not even when applying the isolated genetic algorithm for a time that was largely longer than the time applied for the tests withthe other datasets.This ﬁnding, together with the superior results obtained from the GA alone, allow us to conclude on the infeasibility ofapplying the Q-learning algorithm to solve the genomes assembly problem in search for an optimal reads permutation, asoriginally proposed in the seminal approach.However, given the absence of approaches in the literature for tackling the problem through reinforcement learning and alsoconsidering the optimistic results obtained from RL, especially when combined to deep learning, further investigations on theapplicability of reinforcement learning are required, including the use of different modeling approaches and algorithms.Considering the importance of reproducing scientiﬁc studies and with the special intention of supporting future investigations,all experiments performed in this study, as well as the reinforcement learning environments applied to simulate the assemblyproblems, are available at https://osf.io/tp4zj/files , and are open for reuse (for details on how to reproduce theexperiments see Section 5 of the supplementary material).One of the major challenges for applying reinforcement learning to real-world problems is the low sample efﬁciency of thealgorithms ; and it is not different in the genome assembly problem treated here. From the time required by the agent trainedby the Q-learning algorithm to reach an optimal solution, it is possible to perceive a high need for numerous interactions withthe data. Considering that real-world inputs are even bigger than those experimentally applied here, obtaining a sample efﬁcientalgorithm for the problem would be required for supporting the development of solutions applicable into real-world problems.One aspect that directly interferes with the agent’s sample efﬁciency is optimizing the exploration of the space of states. Inthis sense, the use of an intrinsic motivation could be alternatively investigated to bypass the exploration problem, given thehigh dimensionality of the proposed space of states .As previously mentioned, it is important to continue the investigation for updating and replacing the Q-learning algorithm,as well as the use of distributed approaches and/or algorithms using eligibility traces. Although simple, Q-learning is a verypowerful reinforcement learning method for solving tasks and remains applied for real-world applications . As is the casewith other RL methods, however, its use in real-world problems where the space of states is too vast is not recommended. Theuse of approximating methods for the Q function may prove opportune in such cases, especially given the promising resultsobtained in this study through the combination of RL with deep learning .Although it faces some obstacles for being implemented into commercial applications, it is reasonable to consider the recentachievements of deep reinforcement learning when applied to games, which presents an equivalent computational problem tothat of the genome assembly . One of the great challenges for production equivalent proposals for the assembly problemis that several promising works apply convolutional artiﬁcial networks, which use images as inputs.The transformation of real problems into games is a possibility for reusing the promising technology developed focusing ongames . In this sense, the modeling of the genome assembly problem through a game can work as an alternative representationof the problem and may reduce the space of states explored by the agent. Representing the assembly problem as a maze withmultiple targets/objectives could be an example for such modeling approach.One of the main beneﬁts of representing the problem as a game is the reduction of the space of actions, which, in theproposed modeling approach, increases with the increase in the number of reads . Mnih et. al , for example, were capable ofachieving a common deep reinforcement learning architecture capable of resolving several Atari games, being remarkable that he best performances were achieved in games with fewer actions, as in the game Breakout , which requires only two actions(move right or left). In this game the agent was able to learn, from its performance and without instruction, that producing a gapis a valid strategy for optimizing its results (illustrated in Figure 8).

Figure 8.

Example of application of deep reinforcement learning in games where the agent was able to ﬁnd strategies tooptimize accumulated rewards. Here, the agent was able to discover that caving a tunnel could maximize rewards gain (imagesadapted from a video demonstration of learning progress in ).The use of Graph Embedding may act as an option of modeling approach allowing the use of deep reinforcement learningwithout requiring the conversion of the problem into an image —- especially when considering that the genome assemblyproblem may be represented through a graph for optimizing the problem, in the shape of the Traveling Salesman Problem(TSP) . As presented by Vesselinova et. al, numerous studies investigated the application of deep reinforcement learning forsolving graph problems, including TSP .Finally, one other aspect to be considered before the adoption of reinforcement learning into the genome assembly problemis the generalization of the agent’s learning — a major challenge for the use of RL in real-world problems . As designed intothe RL environment for the genome assembly problem, the learning acquired by the agent when assembling a set of reads willhardly be applied for the assembly of a new set. References de Souza, K. P. et al. Machine learning meets genome assembly.

Brief. Bioinforma.

DOI: 10.1093/bib/bby072 (2018). Li, Z. et al.

Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph.

Brieﬁngs Funct. Genomics , 25–37, DOI: 10.1093/bfgp/elr035 (2011). Manzoni, C. et al.

Genome, transcriptome and proteome: the rise of omics data and their integration in biomedical sciences.

Brieﬁngs Bioinforma. , 286–302, DOI: 10.1093/bib/bbw114 (2016). Paszkiewicz, K. & Studholme, D. J. De novo assembly of short sequence reads.

Brieﬁngs Bioinforma. , 457–472, DOI:10.1093/bib/bbq020 (2010). Heather, J. M. & Chain, B. The sequence of sequencers: The history of sequencing DNA.

Genomics , 1–8, DOI:10.1016/j.ygeno.2015.11.003 (2016). Rodríguez-Ezpeleta, N., Hackenberg, M. & Aransay, A. M. (eds.)

Bioinformatics for High Throughput Sequencing (Springer New York, 2012). Portin, P. & Wilkins, A. The evolving deﬁnition of the term “gene”.

Genetics , 1353–1364, DOI: 10.1534/genetics.116.196956 (2017). Ji, P., Zhang, Y., Wang, J. & Zhao, F. MetaSort untangles metagenome assembly by reducing microbial communitycomplexity.

Nat. Commun. , DOI: 10.1038/ncomms14306 (2017). Wong, H. L., MacLeod, F. I., White, R. A., Visscher, P. T. & Burns, B. P. Microbial dark matter ﬁlling the niche inhypersaline microbial mats.

Microbiome , DOI: 10.1186/s40168-020-00910-0 (2020). Medvedev, P., Georgiou, K., Myers, G. & Brudno, M. Computability of models for sequence assembly. In

Lecture Notes inComputer Science , 289–301, DOI: 10.1007/978-3-540-74126-8_27 (Springer Berlin Heidelberg, 2007).

Pop, M. Genome assembly reborn: recent computational challenges.

Brieﬁngs Bioinforma. , 354–366, DOI: 10.1093/bib/bbp026 (2009). Böckenhauer, H.-J. & Bongartz, D. DNA sequencing. In

Algorithmic Aspects of Bioinformatics , 171–209, DOI:10.1007/978-3-540-71913-7_8 (Springer Berlin Heidelberg, 2007).

Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies.

Bioinformatics , 1072–1075, DOI: 10.1093/bioinformatics/btt086 (2013). Bocicor, M.-I., Czibula, G. & Czibula, I.-G. A reinforcement learning approach for solving the fragment assemblyproblem. In , DOI:10.1109/synasc.2011.9 (IEEE, 2011).

Henrique, B. M., Sobreiro, V. A. & Kimura, H. Literature review: Machine learning techniques applied to ﬁnancial marketprediction.

Expert. Syst. with Appl. , 226–251, DOI: 10.1016/j.eswa.2019.01.012 (2019).

LeCun, Y. 1.1 deep learning hardware: Past, present, and future. In , DOI: 10.1109/isscc.2019.8662396 (IEEE, 2019).

Botvinick, M. et al.

Reinforcement learning, fast and slow.

Trends Cogn. Sci. , 408–422, DOI: 10.1016/j.tics.2019.02.006(2019). Mnih, V. et al.

Human-level control through deep reinforcement learning.

Nature , 529–533, DOI: 10.1038/nature14236(2015).

Silver, D. et al.

Mastering the game of go without human knowledge.

Nature , 354–359, DOI: 10.1038/nature24270(2017).

Vinyals, O. et al.

Grandmaster level in StarCraft II using multi-agent reinforcement learning.

Nature , 350–354, DOI:10.1038/s41586-019-1724-z (2019).

Nian, R., Liu, J. & Huang, B. A review on reinforcement learning: Introduction and applications in industrial processcontrol.

Comput. & Chem. Eng. , 106886, DOI: 10.1016/j.compchemeng.2020.106886 (2020).

Dulac-Arnold, G., Mankowitz, D. J. & Hester, T. Challenges of real-world reinforcement learning. In

ICML 2019 Workshopon Reinforcement Learning for Real Life (RLRL) (2019).

Vollmers, J., Wiegand, S. & Kaster, A.-K. Comparing and evaluating metagenome assembly tools from a microbiologist’sperspective - not only size matters!

PLOS ONE , e0169662, DOI: 10.1371/journal.pone.0169662 (2017). van der Walt, A. J. et al. Assembling metagenomes, one community at a time.

BMC Genomics , DOI: 10.1186/s12864-017-3918-9 (2017). Sutton, R. S. & Barto, A. G.

Reinforcement Learning: An Introduction (A Bradford Book, Cambridge, MA, USA, 2018).

Cormen, T. H., Leiserson, C. E., Rivest, R. L. & Stein, C.

Introduction to Algorithms, Third Edition (The MIT Press, 2009),3rd edn.

Grinstead, C. & Snell, J.

Introduction to Probability (American Mathematical Society, 2012).

Smith, T. & Waterman, M. Identiﬁcation of common molecular subsequences.

J. Mol. Biol. , 195–197, DOI:10.1016/0022-2836(81)90087-5 (1981).

Xavier, R., de Souza, K. P., Chateau, A. & Alves, R. Genome assembly using reinforcement learning. In Kowada, L. &de Oliveira, D. (eds.)

Advances in Bioinformatics and Computational Biology" , 16–28 (Springer International Publishing,2020).

Trott, A., Zheng, S., Xiong, C. & Socher, R. Keeping your distance: Solving sparse reward tasks using self-balancing shapedrewards. In Wallach, H. M. et al. (eds.)

Advances in Neural Information Processing Systems 32: Annual Conference onNeural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada , 10376–10386(2019).

Zahavy, T., Haroush, M., Merlis, N., Mankowitz, D. J. & Mannor, S. Learn what not to learn: Action elimination with deepreinforcement learning. In Bengio, S. et al. (eds.)

Advances in Neural Information Processing Systems 31 , 3562–3573(Curran Associates, Inc., 2018).

Dulac-Arnold, G. et al.

An empirical investigation of the challenges of real-world reinforcement learning.

CoRR abs/2003.11881 (2020). 2003.11881. de Wiele, T. V., Warde-Farley, D., Mnih, A. & Mnih, V. Q-learning in enormous action spaces via amortized approximatemaximization.

CoRR abs/2001.08116 (2020). 2001.08116.

Baluja, S. & Caruana, R. Removing the genetics from the standard genetic algorithm. In

In Proceedings of ICML’95 ,38–46 (Morgan Kaufmann Publishers, 1995). Konar, A. Evolutionary computing algorithms. In

Computational Intelligence , 323–351, DOI: 10.1007/3-540-27335-2_12(Springer Berlin Heidelberg, 2005).

Gimelfarb, M., Sanner, S. & Lee, C.-G. Epsilon-bmc: A bayesian ensemble approach to epsilon-greedy exploration inmodel-free reinforcement learning. In Adams, R. P. & Gogate, V. (eds.)

Proceedings of Machine Learning Research , vol.115, 476–485 (PMLR, Tel Aviv, Israel, 2020).

Peterson, E. J. & Verstynen, T. D. A way around the exploration-exploitation dilemma. bioRxiv

DOI: 10.1101/671362(2019).

Oliveira, R. R. M. et al.

GAVGA: A genetic algorithm for viral genome assembly. In Oliveira, E. C., Gama, J., Vale, Z. A.& Cardoso, H. L. (eds.)

Progress in Artiﬁcial Intelligence - 18th EPIA Conference on Artiﬁcial Intelligence, EPIA 2017,Porto, Portugal, September 5-8, 2017, Proceedings , vol. 10423 of

Lecture Notes in Computer Science , 395–407, DOI:10.1007/978-3-319-65340-2_33 (Springer, 2017).

Brockman, G. et al.

Openai gym (2016). 1606.01540.

Bradnam, K. R. et al.

Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species.

GigaScience , DOI: 10.1186/2047-217x-2-10 (2013). Roughgarden, T.

Algorithms Illuminated (Part 4): Algorithms for NP-Hard Problems . Algorithms Illuminated (Sound-likeyourself Publishing, LLC, 2020).

Yu, Y. Towards sample efﬁcient reinforcement learning. In

Proceedings of the 27th International Joint Conference onArtiﬁcial Intelligence , IJCAI’18, 5739–5743 (AAAI Press, 2018).

Barto, A. G. Intrinsic motivation and reinforcement learning. In

Intrinsically Motivated Learning in Natural and ArtiﬁcialSystems , 17–47, DOI: 10.1007/978-3-642-32375-1_2 (Springer Berlin Heidelberg, 2012).

Chakole, J., Kolhe, M., Mahapurush, G., Yadav, A. & Kurhekar, M. A q-learning agent for automated trading in equitystock markets.

Expert. Syst. with Appl. , DOI: 10.1016/j.eswa.2020.113761 (2021). Cited By 0.

Abdulhai, B., Pringle, R. & Karakoulas, G. J. Reinforcement learning for true adaptive trafﬁc signal control.

J. Transp.Eng. , 278–285, DOI: 10.1061/(asce)0733-947x(2003)129:3(278) (2003).

Fjelland, R. Why general artiﬁcial intelligence will not be realized.

Humanit. Soc. Sci. Commun. , DOI: 10.1057/s41599-020-0494-4 (2020). Reis, S., Reis, L. P. & Lau, N. Game adaptation by using reinforcement learning over meta games.

Group Decis. Negot.

DOI: 10.1007/s10726-020-09652-8 (2020).

Cook, W. J.

Pushing the Limits , 211–212 (Princeton University Press, 2012).

Li, Z. et al.

Comparison of the two major classes of assembly algorithms: overlap–layout–consensus and de-bruijn-graph.

Brieﬁngs Funct. Genomics , 25–37, DOI: 10.1093/bfgp/elr035 (2011). Vesselinova, N., Steinert, R., Perez-Ramirez, D. F. & Boman, M. Learning combinatorial optimization on graphs: A surveywith applications to networking.

IEEE Access , 120388–120416, DOI: 10.1109/ACCESS.2020.3004964 (2020). Ponsen, M., Taylor, M. E. & Tuyls, K. Abstraction and generalization in reinforcement learning: A summary andframework. In

Adaptive and Learning Agents , 1–32, DOI: 10.1007/978-3-642-11814-2_1 (Springer Berlin Heidelberg,2010).

Acknowledgements (not compulsory)

This study was ﬁnanced in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – Brasil (CAPES) –Finance Code 001.

Author contributions statement

K.P., R.X and R.A. conceived the experiments, K.P. and R.X. conducted the experiments, K.P., R.A., A.C and A.R analysed theresults. All authors reviewed the manuscript.

Additional information

Supplementary ﬁles : supplementary material and reproduction codes are available at https://osf.io/tp4zj/files ; Competing interests : The authors declare no competing interests.: The authors declare no competing interests.