[PDF] Deep Reinforcement Learning for Neural Control

Abstract

We present a novel methodology for control of neural circuits based on deep reinforcement learning. Our approach achieves aimed behavior by generating external continuous stimulation of existing neural circuits (neuromodulation control) or modulations of neural circuits architecture (connectome control). Both forms of control are challenging due to nonlinear and recurrent complexity of neural activity. To infer candidate control policies, our approach maps neural circuits and their connectome into a grid-world like setting and infers the actions needed to achieve aimed behavior. The actions are inferred by adaptation of deep Q-learning methods known for their robust performance in navigating grid-worlds. We apply our approach to the model of \textit{C. elegans} which simulates the full somatic nervous system with muscles and body. Our framework successfully infers neuropeptidic currents and synaptic architectures for control of chemotaxis. Our findings are consistent with in vivo measurements and provide additional insights into neural control of chemotaxis. We further demonstrate the generality and scalability of our methods by inferring chemotactic neural circuits from scratch.

Full PDF

DDeep Reinforcement Learning for Neural Control

Jimin Kim

Department of Electrical and Computer EngineeringUniversity of WashingtonSeattle, WA, 98195 [email protected]

Eli Shlizerman

Department of Applied MathematicsDepartment of Electrical and Computer EngineeringUniversity of WashingtonSeattle, WA, 98195 [email protected]

Abstract

We present a novel methodology for control of neural circuits based on deep rein-forcement learning. Our approach achieves aimed behavior by generating externalcontinuous stimulation of existing neural circuits (neuromodulation control) ormodulations of neural circuits architecture (connectome control). Both forms ofcontrol are challenging due to nonlinear and recurrent complexity of neural activity.To infer candidate control policies, our approach maps neural circuits and theirconnectome into a grid-world like setting and infers the actions needed to achieveaimed behavior. The actions are inferred by adaptation of deep Q-learning methodsknown for their robust performance in navigating grid-worlds. We apply our ap-proach to the model of

C. elegans which simulates the full somatic nervous systemwith muscles and body. Our framework successfully infers neuropeptidic currentsand synaptic architectures for control of chemotaxis. Our ﬁndings are consistentwith in vivo measurements and provide additional insights into neural control ofchemotaxis. We further demonstrate the generality and scalability of our methodsby inferring chemotactic neural circuits from scratch.

The objective of neural control is to manipulate neural circuits to achieve a target function. The targetfunction could take various forms. In its simplest form the function could be a change in neuralactivity of an individual neuron, i.e., suppression or excitation of its membrane potential. For such atarget, neural control would typically be achieved by considering individual neuron and the action ofthe control would correspond to external stimulation. Typically the target functions are less preciseand of a more implicit nature, e.g. setting the target as a proﬁle that a population of neurons wouldfollow. Such function would be typical in Brain Machine Interfaces (BMI) methodologies whichaim to set neural population activity such that it corresponds to desired responses [1–3]. Perhaps themost implicit and challenging aim for neural control, and also the next step for BMI, would be theachievement of a target which is set to be a particular behavior.Setting the behavior as a target comes with implicitness which reﬂects the complexity of the neuralcontrol. Neural circuits consist of multiple layers that are both static (e.g. connectome) and dynamic(e.g. neural interactions, neuromodulation). These interact together in a complex and nonlinearfashion. Therefore, a reasonable approach toward ﬁrst step of achieving a behavioral target could come

Preprint. Under review. a r X i v : . [ q - b i o . N C ] J un n the form of controlling one of these layers at a time: neuromodulation control and connectome control. Neuromodulation control corresponds to modulating neural activity in existing circuitsleading to target behavior by applying continuous stimuli into groups of neurons. Such controlmethod is of a closed loop type where it adapts to internal or external states of the circuit. Onthe other hand, connectome control is a control of an architecture where particular connectivityrealization facilitating the aimed behavioral target is sought. For both types of control, advancedcontrol strategies effectively mapping neural states and dynamics to implicit behavioral targets areneeded.Reinforcement learning based control appears to be the most viable option to be considered for thesetask since it is generically designed to optimize implicit tasks and is efﬁcient in selection of promisingcontrol strategies when implemented as deep reinforcement learning. The approach we present is atop-down approach where we review the problems in which deep RL succeeds and then formulatethe two types of neural control in terms of these problems. In particular, we formulate neural controlin terms of Grid-world navigation, a typical control example in which RL showed great success. Weapply our methods to a full model of the nervous system and body of C. elegans to control bothneuromodulation and the connectome to achieve aimed chemotatic behavior. We validate our resultswith previous ﬁndings in-vivo and show that the deep RL methodology can serve as viable neuralcontrol framework for generic neural circuits.

Both model-based and data-driven methods have been proposed for neuromodulation control of neuralcircuits. Such methods take as an input neural activity and output an external stimulation to achievetarget modulated neural activity. Model-based approaches propose to make use of an underlyingsimpliﬁed model and establish the control to leverage both computational and analytical features ofthe model. For example, nonlinear control methods for coordination of spike trains in local circuitsused a simpliﬁed model of neurons to infer external stimuli to adjust oscillations related to animallocomotion patterns [4, 5]. Model-based approaches were also shown to be effective in inference ofstimuli to trigger neurons natural rhythms [6] and in adding feedback to regulate oscillations in largenetworks [7, 8].The use of a model is advantageous since could provide guarantees for observability, controllabilityand stability for the proposed control. However, such approaches are limited since they requireconstruction of a model. This is not the case in many applications in which circuits dynamics areunknown and the only available data are sampled activities of the neurons. For these scenarios,model-free data-driven neural control methods have been proposed. In particular, deep reinforcementlearning methods, such as Deep Deterministic Policy Gradient (DDPG) were developed for controlstimuli into multiple neurons in oscillatory spike models and showed equivalent or preferableperformance to model-based methods [9]. Another example of applying RL in neural control is ofdeep brain stimulation to suppress neural activity associated with epileptic seizures [10, 11, 2]. Inthis application, ensemble learning and batch mode Q-learning implemented an adaptive closed loopneuromodulation control to eliminate seizures. The work showed that reinforcement learning methodscan achieve successful neuromodulation and also demonstrated the sensitivity of the methods tovarious deﬁnitions of both state and reward functions. These results warrant a generic framework forRL for neuromodulation control, which we deﬁne in our methods.Connectome control can be approached from model-based and data-driven perspectives as well.Model-based approaches are more common since they incorporate simplifying assumptions regardingthe connectivity. For example, it was shown that the phase reduction method can be used to optimizethe coupling matrix between a pair of limit-cycle oscillators to synchronize their activity [12], inaddition, connectivity within antenal lobe in insects was inferred by imposing model constraints [13].Approaches for data-driven connectome control are more scarce due to limited knowledge of theconnectivity of many circuits and even more so as the application of connectome modulation requiresadvanced biological means. Recent progress in neuroimaging and bioengineering suggests that suchcapabilities are becoming available. Connectomes of several organisms have been fully or partiallyresolved and many smaller circuits have been mapped [14–18]. For connectome modulation, untilrecently, the available tools were limited to ablations of neurons or synapses in the connectome [19–22]. However, recent optogenetic methods and methods to genetically edit the connectome have beenproposed and show the promise of synthesizing neural circuits with a variety of architectures [23–28].2otably, while connectome control would correspond to typically discrete modulation, it wouldnevertheless require advanced nonlinear dynamic control methods rather than static methods. Thereason is that neural activity supported by the connectome and its relation to the target functionare nonlinear dynamic processes [29]. The complexity is evident from

C. elegans nervous systemwhich connectome was fully mapped, however, ablations of the connectome cause outcomes that aredifﬁcult to predict without examination of neural activity and behavior [30–32].As the complexity of data-driven neural control becomes daunting, deep reinforcement learning basedcontrol strategy stands out as the most plausible and general candidate to address these complexities.There have been number of simple yet successful control problems in the ﬁeld of deep RL that can bereformulated as either neuromodulation control or connectome control.

Grid-world navigation is oneof such examples which shares several aspects with neural control in which its target function can beillustrated as an episodic event. In the methods section below, we develop this analogy further byemploying grid-world like learning environment as a template for building neural control state, actionand reward functions.

Agents in Reinforcement Learning.

The goal of the agent in reinforcement learning is to learn thebest action given the environment to maximize the scalar rewards . This interaction of the agentand the environment is formalized as a Markov Decision Process (MDP), or as a tuple [ S, A, T, r, γ ] governed by T ( s, a, s (cid:48) ) = P [ S t +1 = s (cid:48) | S t = s, A t = a ] where S is the ﬁnite set of states, A is the ﬁnite set of actions and T is the stochastic transitionfunction from state s to s (cid:48) given the action a . r ( s, a ) is the scalar reward function being obtained attime t + 1 given the state s and action a with a discount factor γ ∈ [0 , . For each state S t at time t ,the selection of the action is given by a policy π ( s, a ) which maps each state with particular action.The discounted reward is deﬁned as R t = (cid:80) Tk = t γ k − t r k ( s k , a k ) and corresponds to the discountedsum of future rewards gained by the agent. The goal of the agent is to learn the optimal policy π ∗ ( s, a ) which maximizes R t for all t .An effective way to learn such policy is to formulate it as the function of the discounted reward toeach of the state-action pairs, i.e., q-values, q π ( s, a ) = E π [ R t | S t = s, A t = a ] By setting π = π ∗ , one can then learn the optimal policy π ∗ by learning the q-values for everystate-action pairs. Once the q-values are learned, the optimal policy at a given state is to choose theaction with highest value with probability (1 − (cid:15) ) where (cid:15) is the exploration factor. Such selection ofthe action is called (cid:15) -greedy action. Deep Q-Learning (DQN).

Learning q-values becomes intractable for large state and action spaces.DQN was introduced to address this issue by estimating q-values with deep neural networks [33].In DQN, the agent is a multi-layered deep neural network which takes S t as the input and outputsq-values for possible actions. The DQN agent is trained as follows: For each step, the state S t isfed into the input layer and the agent outputs the action A t by following (cid:15) -greedy principle. The experience is then deﬁned as the tuple [ S t , A t , R t , S t +1 ] where R t is the reward corresponding toaction A t and S t +1 is the new state. The experience is then stored in experience relay memory whichholds the ﬁnite number of past experiences. The parameters of DQN agent are then trained usinggradient descent to minimize the loss ls = ( R t +1 + γ t +1 max a (cid:48) q θ ( S t +1 , a (cid:48) ) − q θ ( S t , A t )) where t is randomly selected time point from the experience relay memory. The training processmaintains two networks: the evaluation network with the parameters θ used to evaluate the q-valuesas well as choosing actions, and the target network with parameters θ which is a periodic copy of theevaluation network to help DQN agent to achieve stable learning of q-values.While DQN was a major leap in reinforcement learning, it also had several limitations leading toextensions proposed to address them. Among these extensions we use Double DQN, Prioritizedexperience relay and

Dueling DQN as our DQN agent. Each of these extension brings improvementsin resolving the overestimation bias of q-values, more effective sampling of past experiences and3 . . 𝒕 𝒕 𝒏 Existing Circuit d3d2

Target d1 AIAAIBAWC AIY . . . . . . . . . . . . . . . . . . . . . . . . . . . Deep

Q-LearningReward

Control

Environment +𝜹−𝜹

Agent 𝑰 𝑨𝑰𝒀𝑪

State 𝒕 𝒕 Figure 1: Continuousneuromodulation con-trol for existing cir-cuits with a Deep Q-Learning agent andgrid-world like envri-onmentbetter evaluation of q-values through the decomposition into value and advantage. Since theseextensions often result in synergistic effects when combined, we utilize different combinations ofthem to improve learning of the agent in different scenarios.

Mapping Neural Control to Grid-world Environment . A core idea of our method is to formulatethe neural control problems as the grid-world like problem. In its simplest form, grid-world consistsof spatial grids where the coordinates of each grid represent a state and an action space of size 4allowing for going [Up, Down, Left, Right]. At t = 0 , the agent is placed at the starting grid pointand needs to learn the optimal policy to reach the highest reward in a shortest path while avoidingobstacles.Adapting grid-world to the neural control problem has several advantages. Navigating gridworld isperhaps one of the most successful control tasks utilizing reinforcement learning. Deep Q-learninghas shown exceptional performance in solving large and complex grid-world problem [34]. Theyare also easily scalable, easy to implement, and in general, warrant convergence to good controlpolicy. Formulating neural control tasks into grid-world like setting requires careful deﬁnition ofenvironmental states and action spaces. We thereby deﬁne each component of control strategy forboth of our neural control tasks. Continuous Neuromodulation Control . The objective of continuous neuromodulation control is toinfer control input into a set of neurons to achieve target behavior. Such behavior is often deﬁned assequence of spatial positions p ( t ) = ( x ( t ) , y ( t )) or as a function of environmental factors experiencedby the neural circuits, such as sensory input I s ( t ) for t ∈ [ t , T ] . Let us deﬁne the current positionand the target position at time t as p ( t ) and p ∗ ( t ) respectively. We deﬁne the environment state andpossible actions at time t as S t = [ d x ( t ) , d y ( t ) , I s ( t ) , I c ( t )] , A t = [ − δ, , + δ ] where d x ( t ) , d y ( t ) are x and y components of the difference vector ∆ p ( t ) = p ( t ) − p ∗ ( t ) , I s ( t ) and I c ( t ) are environmental stimulation (e.g. sensory input) experienced by neural circuit andneuromodulatory control input respectively at time t . The action space i.e. the neural control output, A t deﬁnes the incremental change to neuromodulatory control input ∆ I c ( t ) at time t where δ issome small scalar value. Given the deﬁnitions of S t and A t , we deﬁne the reward R t = r t ( s t , a t ) asfollows r t ( s t , a t ) = (cid:26) − tanh ( α ∆ d ( t + 1)) t < Tf ( I s ( T )) t = T where ∆ d ( t + 1) = | ∆ p ( t + 1) | − | ∆ p ( t ) | , α is a scalar that adjusts slope of tanh function, and f is an evaluation function that maps I s ( T ) to scalar. The state S t can be thought of as a grid coordinatein a grid-world at time t , where DQN agent needs to pick which direction to move to get closer to thereward (Figure 1). In our formulation, these directions take the form of increasing, decreasing or notchanging the neuromodulatory control input, ensuring the continuity while also keeping the actionspace small and discrete. The reward function is designed such that the agent minimizes the errorbetween the target and the controlled behavior throughout the episode and maximizes the reward atthe end of the episode. Connectome Control . An alternative way to achieve aimed behavior is to modulate the architectureof the circuit. Unlike neuromodulatory control, connectome modulation is static and once set doesnot require the agent. Speciﬁcally, the goal of the connectome control is to infer the connectomematrix G ∗ , which achieves the target behavior p ∗ ( t ) given the sensory input I s ( t ) for t ∈ [ t , T ] .4onnectome modulation can be implemented in two forms. One can either start from an existingconnectome and to make minor changes to the circuit by inserting or deleting connections, or to startfrom a connectome with no connections, and to attempt and infer the full circuit architecture fromscratch. Both of these cases take the form of G ∗ = G + ∆ G , where G is the initial connectomematrix and ∆ G records changes that have been made to the initial connectome. We propose to use aDQN agent to infer ∆ G in Grid-world like setting.We represent ∆ G as a function of state transition steps similar to grid-world environment, i.e. ∆ G = ∆ G k f where k f is the end step of each episode. We assign enumeration of all possible neuronpairs within the circuit to serve as transition steps such that the agent modiﬁes a single connection ata time. The state and the action spaces are then deﬁned as follows S k = [∆ G k , k, m k , d k ] , A k = [ − , , where k is the step index, i.e. the neuron pair being modiﬁed, ∆ G k is ∆ G at step k , m k are availablenumber of insertions at step k and d k = k − m k is the distance to trapped state at step k . Trappedstate is deﬁned as at step k , d k < , i.e. there are more remaining insertions than transition steps,leading to m k f > . m k monitors whether the agent attempts to insert a new connection when thereis no remaining insertions whereas d k monitors if there will be excess insertions by the end of theepisode. The action space is a triplet selection between -1, 0 and 1 where -1 means deletion, 0means no change and 1 means insertion of new connection with some arbitrary weight δ . The rewardfunction R k = r k ( s k , a k ) is deﬁned as follows: r k ( s k , a k ) =  +0 . m k +1 ≥ and d k +1 ≥ | a k − . m k +1 ≤ or d k +1 ≤ | a k f (∆ G k f ) k = k f where f is an evaluation function that maps G k f to some scalar. Small positive reward is givento valid action in which doesn’t attempt to insert when m k = 0 AND its insertion doesn’t leadto trapped state. Negative reward is given to invalid action which either attempts to insert when m k = 0 OR leads to trapped state. The formulation is analogous to an agent navigating a grid-worldwith random obstacles with limited number of steps, where the grids are replaced with differentrealization of the connectomes [35]. The reward function intends to guide the DQN agent to makemodiﬁcation to connectome at appropriate steps leading to maximum reward at the end step whileavoiding either inserting too many or too few connections. Notably, connectome control introduces avast combinatorial complexity in possible conﬁgurations of the architecture and without an effectivemethod such as DQN, it becomes an intractable problem. e.g. 16 neurons circuit with 8 insertionscorresponds to ∼

840 billion variations.

Environment Modeling Setup . The DQN methodology does not constrain the environment at all. Asan example, we use the computational neuro-mechanical model of

C. elegans to set up our experiments.The model consists of a simulation of a full nervous system activity with the body [31, 32]. In orderto mimic the typical chemotaxis environment, we extend the model to support spatial gradient stimuli which introduces stimulus into olfactory neurons in realistic way [36]. For each episode, we placespatial gradient stimuli on the left side of the spatial plane, and the worm on the right side. At t = 0 , we initialize the worm with forward motion and simulate until t reaches the ﬁnal time point.We deﬁne the behavior for each episode to be the trajectory of the worm head s ( t ) = ( x ( t ) , y ( t )) parametrized by time t . Here s ( t ) is the function of parameters being controlled, such as externalcontrol stimulation into set of neurons or modulated connectomes. We apply the methods described in previous section to AWC sensory circuit (i.e. AWC sensoryneurons and its adjacent neighboring neurons) in

C. elegans nervous system to control chemotaxisbehavior. Our results are divided into two parts for neuromodulation control and connectome control.First, we infer a continuous neuromodulatory control to emulate neuropeptidic currents from AWCsensory neurons to AIA and AIY interneurons to obtain a baseline attraction behavior in responseto AWC spatial gradient. The experiment serves as a validation of literature ﬁndings that suchcurrents between AWC and AIA are important in achieving attraction behavior [37]. Next, we applyconnectome control to AWC circuit to modulate the baseline chemotaxis behavior by introducing newelectrical synapses, i.e. gap junctions into the circuit. The results are validated against experimental5igure 2: Connectome control implemented with Deep Q-Learning agent. The agent chooses aconnectome path, simulates the environment and evaluate a reward 𝑰 𝒄 𝑨𝑰𝑨, 𝑨𝑰𝒀 (𝒏𝑨)

PathTarget Path (Attraction)Controlled PathAWC gradient 𝐼 " 𝑨𝑾𝑪 (𝑛𝐴)

No Control 𝐼 " 𝑨𝑾𝑪 (𝑛𝐴)

Time (s)

Time (s)Neuromodulation Control

Figure 3: Neuromodulation control in

C. elegans chemotactic circuit to assume a target path of at-traction to AWC gradient stimulus. Top: In the ab-sence of control, the path passively passes throughthe gradient stimulus. Bottom: Neuromodulationcontrol into AIA and AIY neurons changes thecourse such that

C. elegans is attracted to the cen-ter of AWC gradient stimulus.ﬁndings which inserted synthetic gap junctions in particular locations of the AWC circuit and observeda change in chemotaxis behavior [23, 24]. To further test generality and scalability of connectomecontrol, we then expand it to infer full connectome which achieves a desired chemotaxis behavior. Foreach method, we also provide comparison with other control methods in supplementary materials.

Figure 3 top shows

C. elegans simulated behavior with no control when navigating AWC stimuligradient. We construct AWC stimuli gradient to have a negative distribution (i.e. I s < ) for allpoints throughout the space so that it mimics a typical AWC activity during chemotaxis through aodor gradient [36]. In the bottom part, neuromodulation control I c is applied to AIA & AIY. Bothcontrol input I C and the signal from AWC gradient I s are plotted as the function of time to show theinteraction between the two signals where larger amplitude of I s represents the worm getting closerto the target set at the center of spatial gradient.From movement paths, it is clear that neuromodulation control steers the worm toward the AWCgradient whereas the absence of control leads to worm escaping the gradient with no interaction. Thecontrol input overall maintains the range of ∼ +0 . nA but temporarily dips around t = 10 s when theworm encounters a steep gradient slope, i.e. high | dI s /dt | . Remarkably, experiments have reportedﬁndings implying such adaptation of neuromodulatory currents in response to negative derivativeof chemical gradient during chemotaxis [36, 38, 39]. This is particularly surprising since the onlyconstraints provided to RL agent were the lower and the upper bound for I c . Having an established baseline chemotaxis behavior to AWC gradient with neuromodulatory peptidiccurrents, we proceed to use this baseline, i.e. a closed loop system, for connectome control tointroduce new connections into the circuit for novel behavioral output. We apply our method to threeexperiments in order to measure and test its capability to full extent.

Baseline behavior modulation

In this scenario we add gap junctions in the existing AWC circuitto modulate its baseline chemotaxis behavior. Speciﬁcally RL agent is tasked with inserting gapjunctions in appropriate locations in the circuit to change the behavior from baseline attraction repulsion . The target is implicitly set such that for the given AWC gradient signal I s ( t ) , theconnectome that produces worm path with large curvature (i.e. steeper turn) receives higher reward.In Figure 4, we show training results for 2 and 4 insertions scenarios (and 8 insertions scenario insupplementary material). Each column shows the learning progress in terms of number of viablecandidates (i.e. the insertions that result in behavior with score higher than certain threshold) found,movement trajectory with one of such candidates, and visualization of its circuit architecture. For 2insertion scenario, RL agent consistently ﬁnds viable conﬁgurations throughout training as evidentfrom linear trend in number of candidates found. One of such conﬁgurations results in behaviorvery close to the target is connecting AWCL, AWCR and AWCR and AIAL. Strikingly, we ﬁnd thatsuch conﬁguration is conﬁrmed by experiment to indeed cause behavioral switch from attraction torepulsion in AWC modulated response [27].Increasing the available insertions to 4 results in more challenging search as it increases both thenumber of possible conﬁgurations as well as the uncertainty in network activity by introducing moreconnections. This is indeed reﬂected in lower number of candidates throughout the training comparedto 2 insertion scenario (Figure 4). Interestingly enough, we notice many candidates still retaininsertion between AWC and AIA, implying that the RL agent retains information on which insertionsare valuable. We also notice growing number of candidates that are rather new and unexploredin literature. Such candidate is shown in the third column of Figure 4 where apart from insertionbetween in-vivo AIAL and AWCL, the other three connections are new and their effects are unknown.This could imply that RL agent learns how to ’engineer’ the circuit such that the combined effects ofmultiple insertions achieve a target behavior. 7igure 6: Inferring circuit architectures to achieve target chemotactic behavior. Each column showsthe connection distribution throughout the neurons in existing AWC circuit (left), inferred eachconnection types when other type was kept static (middle) and inferred circuits of all connectiontypes when types are initialized with empty circuits. A damaged circuit repair

To further test the robustness of our method, we test if we can repair thedamaged circuit and restore lost functionality by inserting new gap junctions. In particular, we ablateAIA inter-neurons from AWC circuit, leading to absence of attraction response [24]. The changein behavior is indeed accurately reproduced in the computational model of

C. elegans (Figure 5).We then repeat the same procedure of inserting 2, 4 and 8 gap junctions into the damaged circuit torestore the baseline attraction behavior.Figure 5 shows training results for 2 and 4 insertions scenarios (8 in the supplementary material)which lead to behavioral recovery. One of the candidates for 2 insertions scenario connects AWCLwith AIBL and AWCR. Indeed, such conﬁguration has been recently shown to restore the attractionbehavior in AIA ablated circuit as the connection bypasses the ablated AIA neurons and lets AWCcommunicate directly to AIB [24]. We notice that many of the candidates in 4 insertions scenariocontinue to connect AWC and AIB, further conﬁrming the essential role of AWC-AIB coupling inbehavioral recovery. Interestingly, we obtain a number of candidates that achieve a higher rewardfor 4 insertions than 2 insertions. One possible explanation is that since AIA is one of the mostconnected neurons in the circuit, it’s easier for RL agent to ﬁnd candidates with 4 insertions than 2 tocompensate for lost connections.

Full circuit architecture inference

Being able to insert new connections into an existing circuitimplies that the same can be done with a circuit with no connections. i.e. building the connectomefrom scratch. In such case, the action space of RL agent turns into discrete set of numbers whereeach action is synaptic weight to be added to particular pair of neurons, allowing the agent to inferfull circuit connectivity. To test such extension, we designed three experiments with same objectiveof inferring the circuit that can produce the baseline attraction behavior as in existing AWC circuitwith neuromodulation control. Speciﬁcally, experiment 1 was tasked with inferring both gap andsynaptic connectomes from scratch, whereas experiment 2 and 3 were each tasked with inferringsingle connection type: synaptic or gap connectome while keeping the other type of connectomeintact. In order to compare the distribution of inferred connections with that of existing circuit, weset the number of insertions for each type of connectome to be the same as those in the existingconnectomes.Figure 6 shows connection distributions between existing connectomes and inferred circuits thatsuccessfully produce baseline attraction behavior (averaged over 10 best candidates). We noticethat connection distributions of inferred connectomes are generally different to those of existingconnectomes even though all connection types is more similar to existing circuit than single connnec-tion type. This implies existence of several circuit realizations to facilitate attraction. However,inferred circuits also highlight some of the hub neurons found in existing circuits, such as ASIneurons in gap connectome. It still remains an open question on why the inferred connectomes don’tnecessarily converge to that of existing circuit. One possible explanation is that inferred connectomesare optimized to a single behavior type, whereas the existing circuit is for multiple behaviors. There8s a large body of evidences showing that AWC circuit partakes in sensory functions other than justolfactory [40–42]. Thus while there may exist many circuit realizations that facilitate single behaviortype as our method suggests, they might not produce sensory functions in other areas.In summary for both continuous neuromodulatory control and connectome control, our methodsadapt grid-world like setting and were able to successfully control circuit of interest as well as toinfer full circuit architecture from scratch to achieve aimed behavior.

References [1] Justin C Sanchez and José C Principe. Brain–machine interface engineering.

Synthesis Lectureson Biomedical Engineering , 2(1):1–234, 2007.[2] Joelle Pineau, Arthur Guez, Robert Vincent, Gabriella Panuccio, and Massimo Avoli. Treatingepilepsy via adaptive neurostimulation: a reinforcement learning approach.

Internationaljournal of neural systems , 19(04):227–240, 2009.[3] Monzurul Alam, Willyam Rodrigues, Bau Ngoc Pham, and Nitish V Thakor. Brain-machineinterface facilitated neurorehabilitation via spinal stimulation after spinal cord injury: Recentprogress and future perspectives.

Brain Research , 1646:25–33, 2016.[4] Eric Brown, Jeff Moehlis, and Philip Holmes. On the phase reduction and response dynamicsof neural oscillator populations.

Neural computation , 16(4):673–715, 2004.[5] Jordan Snyder, Anatoly Zlotnik, and Aric Hagberg. Stability of entrainment of a continuum ofcoupled oscillators.

Chaos: An Interdisciplinary Journal of Nonlinear Science , 27(10):103108,2017.[6] Jeff Moehlis, Eric Shea-Brown, and Herschel Rabitz. Optimal inputs for phase models ofspiking neurons.

Journal of Computational and Nonlinear Dynamics , 2006.[7] Gábor Orosz, Jeff Moehlis, and Richard M Murray. Controlling biological networks by time-delayed signals.

Philosophical Transactions of the Royal Society A: Mathematical, Physicaland Engineering Sciences , 368(1911):439–454, 2010.[8] Per Danzl, João Hespanha, and Jeff Moehlis. Event-based minimum-time control of oscillatoryneuron models.

Biological cybernetics , 101(5-6):387, 2009.[9] BA Mitchell and LR Petzold. Control of neural systems at multiple scales using model-free,deep reinforcement learning.

Scientiﬁc reports , 8(1):1–12, 2018.[10] Gabriella Panuccio, Arthur Guez, Robert Vincent, Massimo Avoli, and Joelle Pineau. Adaptivecontrol of epileptiform excitability in an in vitro model of limbic seizures.

Experimentalneurology , 241:179–183, 2013.[11] Arthur Guez, Robert D Vincent, Massimo Avoli, and Joelle Pineau. Adaptive treatment ofepilepsy via batch-mode reinforcement learning. In

AAAI , pages 1671–1678, 2008.[12] Sho Shirasaka, Nobuhiro Watanabe, Yoji Kawamura, and Hiroya Nakao. Optimizing stabilityof mutual synchronization between a pair of limit-cycle oscillators with weak cross coupling.

Physical Review E , 96(1):012223, 2017.[13] Eli Shlizerman, Jeffrey A Riffell, and J Nathan Kutz. Data-driven inference of networkconnectivity for modeling the dynamics of neural codes in the insect antennal lobe.

Frontiers incomputational neuroscience , 8:70, 2014.[14] John G White, Eileen Southgate, J Nichol Thomson, and Sydney Brenner. The structure of thenervous system of the nematode caenorhabditis elegans.

Philos Trans R Soc Lond B Biol Sci ,314(1165):1–340, 1986.[15] Rickard Ignell, Teun Dekker, Majid Ghaninia, and Bill S Hansson. Neuronal architecture of themosquito deutocerebrum.

Journal of Comparative Neurology , 493(2):207–240, 2005.916] Chi-Tin Shih, Olaf Sporns, Shou-Li Yuan, Ta-Shun Su, Yen-Jen Lin, Chao-Chun Chuang, Ting-Yuan Wang, Chung-Chuang Lo, Ralph J Greenspan, and Ann-Shyn Chiang. Connectomics-based analysis of information ﬂow in the drosophila brain.

Current Biology , 25(10):1249–1258,2015.[17] Romain Franconville, Celia Beron, and Vivek Jayaraman. Building a functional connectome ofthe drosophila central complex.

Elife , 7:e37017, 2018.[18] C Shan Xu, Michal Januszewski, Zhiyuan Lu, Shin-ya Takemura, Kenneth Hayworth, GaryHuang, Kazunori Shinomiya, Jeremy Maitin-Shepard, David Ackerman, Stuart Berg, et al. Aconnectome of the adult drosophila central brain.

BioRxiv , 2020.[19] Martin Chalﬁe, John E Sulston, John G White, Eileen Southgate, J Nicol Thomson, andSydney Brenner. The neural circuit for touch sensitivity in caenorhabditis elegans.

Journal ofNeuroscience , 5(4):956–964, 1985.[20] Cornelia I Bargmann, Erika Hartwieg, and H Robert Horvitz. Odorant-selective genes andneurons mediate olfaction in c. elegans.

Cell , 74(3):515–527, 1993.[21] Junko Hara, Carsten T Beuckmann, Tadahiro Nambu, Jon T Willie, Richard M Chemelli,Christopher M Sinton, Fumihiro Sugiyama, Ken-ichi Yagami, Katsutoshi Goto, Masashi Yanag-isawa, et al. Genetic ablation of orexin neurons in mice results in narcolepsy, hypophagia, andobesity.

Neuron , 30(2):345–354, 2001.[22] Samuel H Chung and Eric Mazur. Femtosecond laser ablation of neurons in c. elegans forbehavioral studies.

Applied Physics A , 96(2):335–341, 2009.[23] Ithai Rabinowitch and William R Schafer. Engineering new synaptic connections in the c.elegans connectome. In

Worm , 4:e992668, 2015.[24] Ithai Rabinowitch, Bishal Upadhyaya, Aaradhya Pant, and Jihong Bai. Repairing neural damagein a c. elegans chemosensory circuit using genetically engineered synapses. bioRxiv , 2020.[25] Franciszek Rakowski, Jagan Srinivasan, Paul W Sternberg, and Jan Karbowski. Synapticpolarity of the interneuron circuit controlling c. elegans locomotion.

Frontiers in computationalneuroscience , 7:128, 2013.[26] Eli Shlizerman. Driving the connectome by-wire: Comment on “what would a syntheticconnectome look like?” by ithai rabinowitch.

Physics of life reviews , 2019.[27] Ithai Rabinowitch. What would a synthetic connectome look like?

Physics of life reviews , 2019.[28] Brooke L Sinnen, Aaron B Bowen, Jeffrey S Forte, Brian G Hiester, Kevin C Crosby, Emily SGibson, Mark L Dell’Acqua, and Matthew J Kennedy. Optogenetic control of synaptic compo-sition and function.

Neuron , 93(3):646–660, 2017.[29] Nancy J Kopell, Howard J Gritton, Miles A Whittington, and Mark A Kramer. Beyond theconnectome: the dynome.

Neuron , 83(6):1319–1328, 2014.[30] Hexuan Liu, Jimin Kim, and Eli Shlizerman. Functional connectomics from neural dynamics:probabilistic graphical models for neuronal network of caenorhabditis elegans.

PhilosophicalTransactions of the Royal Society B: Biological Sciences , 373(1758):20170377, 2018.[31] Jimin Kim, William Leahy, and Eli Shlizerman. Neural interactome: Interactive simulation of aneuronal system.

Frontiers in Computational Neuroscience , 13, 2019.[32] Jimin Kim, Julia A Santos, Mark J Alkema, and Eli Shlizerman. Whole integration of neu-ral connectomics, dynamics and bio-mechanics for identiﬁcation of behavioral sensorimotorpathways in caenorhabditis elegans. bioRxiv , page 724328, 2019.[33] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc GBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.Human-level control through deep reinforcement learning.

Nature , 518(7540):529–533, 2015.1034] Juan Wu, Seabyuk Shin, Cheong-Gil Kim, and Shin-Dug Kim. Effective lazy training methodfor deep q-network in obstacle avoidance and path planning. In , pages 1799–1804. IEEE, 2017.[35] Fabio Pardo, Arash Tavakoli, Vitaly Levdik, and Petar Kormushev. Time limits in reinforcementlearning. arXiv preprint arXiv:1712.00378 , 2017.[36] Eyal Itskovits, Rotem Ruach, and Alon Zaslaver. Concerted pulsatile and graded neuraldynamics enables efﬁcient chemotaxis in c. elegans.

Nature communications , 9(1):1–11, 2018.[37] Sreekanth H Chalasani, Saul Kato, Dirk R Albrecht, Takao Nakagawa, LF Abbott, and Cornelia IBargmann. Neuropeptide feedback modiﬁes odor-evoked dynamics in caenorhabditis elegansolfactory neurons.

Nature neuroscience , 13(5):615, 2010.[38] Fleur L Strand.

Neuropeptides: regulators of physiological processes . MIT press, 1999.[39] Chiara Salio, Laura Lossi, Francesco Ferrini, and Adalberto Merighi. Neuropeptides as synaptictransmitters.

Cell and tissue research , 326(2):583–598, 2006.[40] Matthew Beverly, Sriram Anbil, and Piali Sengupta. Degeneracy and neuromodulation amongthermosensory neurons contribute to robust thermosensory behaviors in caenorhabditis elegans.

Journal of Neuroscience , 31(32):11718–11727, 2011.[41] Atsushi Kuhara, Masatoshi Okumura, Tsubasa Kimata, Yoshinori Tanizawa, Ryo Takano,Koutarou D Kimura, Hitoshi Inada, Kunihiro Matsumoto, and Ikue Mori. Temperature sensingby an olfactory neuron in a circuit controlling behavior of c. elegans.

Science , 320(5877):803–807, 2008.[42] Christopher V Gabel, Harrison Gabel, Dmitri Pavlichin, Albert Kao, Damon A Clark, andAravinthan DT Samuel. Neural circuits mediate electrosensory behavior in caenorhabditiselegans.