[PDF] Deep learning approaches for neural decoding: from CNNs to LSTMs and spikes to fMRI

Abstract

Decoding behavior, perception, or cognitive state directly from neural signals has applications in brain-computer interface research as well as implications for systems neuroscience. In the last decade, deep learning has become the state-of-the-art method in many machine learning tasks ranging from speech recognition to image segmentation. The success of deep networks in other domains has led to a new wave of applications in neuroscience. In this article, we review deep learning approaches to neural decoding. We describe the architectures used for extracting useful features from neural recording modalities ranging from spikes to EEG. Furthermore, we explore how deep learning has been leveraged to predict common outputs including movement, speech, and vision, with a focus on how pretrained deep networks can be incorporated as priors for complex decoding targets like acoustic speech or images. Deep learning has been shown to be a useful tool for improving the accuracy and flexibility of neural decoding across a wide range of tasks, and we point out areas for future scientific development.

Full PDF

DDeep learning approaches for neural decoding: from CNNs toLSTMs and spikes to fMRI

Jesse A. Livezey and Joshua I. Glaser [email protected], [email protected] * equal contribution Biological Systems and Engineering Division, Lawrence Berkeley National Laboratory,Berkeley, California, United States Redwood Center for Theoretical Neuroscience, University of California, Berkeley,Berkeley, California, United States Department of Statistics, Columbia University, New York, United States Zuckerman Mind Brain Behavior Institute, Columbia University, New York, United States Center for Theoretical Neuroscience, Columbia University, New York, United StatesMay 21, 2020

Abstract

Decoding behavior, perception, or cognitive state directly from neural signals has applicationsin brain-computer interface research as well as implications for systems neuroscience. In the lastdecade, deep learning has become the state-of-the-art method in many machine learning tasksranging from speech recognition to image segmentation. The success of deep networks in otherdomains has led to a new wave of applications in neuroscience. In this article, we review deeplearning approaches to neural decoding. We describe the architectures used for extracting usefulfeatures from neural recording modalities ranging from spikes to EEG. Furthermore, we explorehow deep learning has been leveraged to predict common outputs including movement, speech,and vision, with a focus on how pretrained deep networks can be incorporated as priors forcomplex decoding targets like acoustic speech or images. Deep learning has been shown to be auseful tool for improving the accuracy and ﬂexibility of neural decoding across a wide range oftasks, and we point out areas for future scientiﬁc development.

Using signals from the brain to make predictions about behavior, perception, or cognitive state,i.e., “neural decoding”, is becoming increasingly important within neuroscience and engineering.One common goal of neural decoding is to create brain computer interfaces, where neural signalsare used to control an output in real time. This could allow patients with neurological or motordiseases or injuries to, for example, control a robotic arm or cursor on a screen, or produce speechthrough a synthesizer. Another common goal of neural decoding is to gain a better scientiﬁcunderstanding of the link between neural activity and the outside world. To provide insight,decoding accuracy can be compared across brain regions, cell types, diﬀerent types of subjects(e.g., with diﬀerent diseases or genetics), and diﬀerent experimental conditions [1–8]. Plus, therepresentations learned by neural decoders can be probed to better understand the structure ofneural computation [9–12]. These uses of neural decoding span many diﬀerent neural recordingmodalities and span a wide range of behavioral outputs (Fig. 1A). a r X i v : . [ q - b i o . N C ] M a y ithin the last decade, many researchers have begun to successfully use deep learning ap-proaches for neural decoding. A decoder can be thought of as a function approximator, doingeither regression or classiﬁcation depending on whether the output is a continuous or categor-ical variable. Given the great successes of deep learning at learning complex functions acrossmany domains [13–22], it is unsurprising that deep learning has become a popular approachin neuroscience. Here, we will review the many uses of deep learning for neural decoding. Wewill emphasize how diﬀerent deep learning architectures can induce biases that can be beneﬁcialwhen decoding from diﬀerent neural recording modalities and when decoding diﬀerent behav-ioral outputs. We hope this will prove useful to deep learning researchers aiming to understandcurrent neural decoding problems and to neuroscience researchers aiming to understand thestate-of-the-art in neural decoding. At their core, deep learning models share a common structure across architectures: 1) simplecomponents formed from linear operations (typically matrix multiplication or convolution) plusa nonlinear operation (for example, rectiﬁcation or a sigmoid nonlinearity); and 2) compositionof these simple components to form complex, layered architectures. There are many formatsof neural networks, each with their own set of assumptions. In addition to feedforward neuralnetworks, which have the basic structure described above, common architectures for neuraldecoding are convolutional neural networks (CNNs) and recurrent neural networks (RNNs).While more complex deep network layer types, e.g., graph neural networks [23] or networksthat use attention mechanisms [24], have been developed, they have not seen as much use inneuroscience. Additionally, given that datasets in neuroscience typically have limited numbersof trials, simpler, more shallow deep networks (e.g., a standard convolutional network versus aresidual convolutional network [21]) are often used for neural decoding.RNNs typically use a sequence of inputs. RNNs are also capable of processing inputs thatare sequences of varying lengths, which occurs in neuroscience data (e.g., trials of diﬀeringduration). This is unlike a fully-connected network, which requires a ﬁxed dimensionality input.In an RNN, the inputs are then projected into a hidden layer, which connects to itself acrosstime (Fig. 1B). Thus, recurrent networks are commonly used for decoding since they can ﬂexiblyincorporate information across time. Finally, the hidden layer projects to an output, which canitself be a sequence (Fig. 1B), or just a single data point.CNNs can be adapted to input and output data in many diﬀerent formats. For example,convolutional architectures can take in structured data (1d timeseries, 2d images, 3d volumes) ofarbitrary size. The convolutional layers will then learn ﬁlters of the corresponding dimensions,in order to extract meaningful local structure (Fig. 1C). The convolutional layers will be par-ticularly useful if there are important features that are translation invariant, as in images. Thisis done hierarchically, in order to learn ﬁlters of varying scales (i.e., varying temporal or spatialfrequency content). Next, depending on the output that is being predicted, the convolutionallayers are fed into other types of layers to produce the ﬁnal output (e.g., into fully connectedlayers to classify an image). In general, hierarchically combining local features is a useful priorfor image-like datasets.Weight-sharing, where the weights of some parameters are constrained to be the same, isoften used for neural decoding. For instance, the parameters of a convolutional (in time) layercan be made the same for diﬀering input channels or neurons, so that these inputs are ﬁlteredin the same way. This is analogous to CNN parameters being shared across space or time in2d or 1d convolutions. For neural decoding, this can be beneﬁcial for learning a shared setof data-driven features for diﬀerent recording channels as an alternative to human-engineeredfeatures.Training a neural decoder uses supervised learning, where the network’s parameters arelearned to predict target outputs based on the inputs. Recent work has combined superviseddeep networks with unsupervised learning techniques. These unsupervised methods learn (typ- cally) lower dimensional representations that reproduce one data source (either the input oroutput), and are especially prevalent when decoding images. One common method, generativeadversarial networks (GANs) [25, 26], generate an output, e.g. an image, given a vector ofnoise as input. GANs are trained to produce images that fool a classiﬁer deep network aboutwhether they are real versus generated images. Another method is convolutional autoencoders,which are trained to encode an image into a latent state, and then reconstruct a high ﬁdelityversion [27]. These unsupervised methods can produce representations of the decoding input oroutput that are sometimes more conducive for decoding. To understand how varying neural network architectures can be preferable for processing dif-ferent neural signals, it is important to understand the basics of neural recording modalities.These modalities diﬀer in their invasiveness, and their spatial and temporal precision.The most invasive recordings involve inserting electrodes into the brain to record voltages.This allows experimentalists to record spikes or action potentials , the fast electrical transientsthat individual neurons use to signal, and the basic unit of neural signaling. To get binary spik-ing events, the recorded signals are high-pass ﬁltered and thresholded. Datasets with spikes arethus binary time courses from all of the recording channels (Fig. 1A). These invasive measure-ments also allow recording local ﬁeld potentials ( LFPs ), which are the low-pass ﬁltered version(typically below ∼ wide-band activity . Datasets with LFP and wide-band are continuoustime courses of voltages from all the recording channels (Fig. 1A). Note that traditionally, due tothe distance between recording electrodes being greater than the spatial precision of recording,spatial relationships between electrodes are not utilized for decoding. Spikes, LFP, and wide-band are more commonly recorded from animal models than humans because of their invasivenature.Another invasive technique for recording individual neurons’ activities is calcium imaging ,which uses microscopy to capture images of ﬂuorescent calcium indicators that are sensitive toneurons’ spiking activity [33]. The raw outputs of calcium imaging are videos: pixels measureﬂuorescence at the times when, and locations where, neurons are active. Calcium imaging isonly used with animal models.Electrical potentials measured from outside of the brain, that is electrocorticography ( ECoG )and electroencephalography ( EEG ), are common neural recording modalities used in humans.ECoG recordings are from grids that record electrical potentials from the surface of the cortex,require surgical implantation, and often cover large function areas of cortex. EEG is a non-invasive method that records from the surface of the scalp from up to hundreds of spatiallydistributed channels. Like LFPs, datasets from ECoG and EEG recordings are continuous timecourses of electrical potentials across recording channels (Fig. 1A), but here the spatial layoutof the channels is also sometimes used in decoding. Note that as these electrical recordingmethods get less invasive, spatial precision decreases (from spikes to LFP to ECoG to EEG),which can lead to inferior decoding performance [34, 35]. Still, all these electrical signals can berecorded at high temporal resolution (100s-1000s of Hz) which make them good candidates forfast time-scale decoding.

Magnetoencephalography ( MEG ), functional near infrared spectroscopy ( fNIRS ), and func-tional magnetic resonance imaging ( fMRI ) are also noninvasive recording modalities which aremost often used in human decoding experiments. MEG measures the weak magnetic ﬁelds thatare induced by electrical currents in the brain. Like EEG and ECoG, MEG can be recordedwith high temporal precision. fNIRS and fMRI measure blood oxygenation (a proxy for neural B C

Decoder Inputs, X Spikes LFP EEG/ECoG fMRI...Movement Speech Vision...Decoder Outputs, Y Decoder: Y = f ( X ) Grid-x G r i d - y T i m e V x V Y U n i t s TimeTime C h a nn e l s Voxels V o x e l s Time v i a fi l t e r X X ... YX X X Y Y Y Y RNN Decoder CNN Decoder S l i c e s Figure 1: Schematics. A: Schematics of neural decoding, which can use many diﬀerent neuralmodalities as input (top) and can predict many diﬀerent outputs (bottom). Embedded ﬁguresare adapted from [28–30]. B: A schematic of a standard recurrent neural network (RNN). Eacharrow represents a linear transformation followed by a nonlinearity. Arrows of the same colorrepresent the same transformations occurring. The circles representing the hidden layer typicallycontain many hidden units. More sophisticated versions of RNNs, which include gates that controlinformation ﬂow through various parts of the network, are commonly used. For example, see [31]for a schematic of an LSTM. C: A schematic of a convolutional neural network. A convolutionaltransformation takes a learned ﬁlter and convolves it with the input (here, a 2d input), and thenpasses this through a nonlinearity. This means that here, a 2 × × ime F r e q u e n c y A B C

Grid-x G r i d - y T i m e Time E l e c t r o d e s Hand-engineered LearnedHigh gamma amplitude Wavelet amplitude Raw voltage

Figure 2: Feature engineering for neural decoding. For all plots, the red box indicates a set offeatures across time, space, or frequency which will be ﬁltered together by the ﬁrst layer’s convo-lutional or recurrent window. The red arrows indicate axes along which convolution or recurrencemay be performed. Sample data from [29]. A: High gamma amplitude, which is selected from alarge ﬁlterbank of features from B , is shown spatially laid out in the ECoG grid locations. Deepnetwork ﬁlters combine hand-engineered high gamma features across space and time. B: Spec-trotemporal wavelet decomposition of the raw data, from C , may be used as the input to a deepnetwork. The deep network ﬁlter shown combines features across frequency and time and can beshared across channels. C: Raw electrical potential recorded using ECoG across channels. Thedeep network ﬁlter shown combines features across time and can be shared across channels. activity), through its absorption of light and with resonance imaging respectively, and theirtemporal resolution are temporally limited by its dynamics. fNIRS and fMRI datasets containactivity signals in diﬀerent “voxels (locations) of the brain over time. Due to the limited tempo-ral resolution, sometimes the temporal continuity of this data is not used for decoding purposes(Fig. 1A).

For each of these recording modalities, the raw data are processed to create features that are ben-eﬁcial for decoding. Sometimes, these features are hand-engineered based on previous knowledge,traditionally with the goal of creating features that are most compatible with linear decoders.Other times, this feature engineering is part of the deep learning architecture. That is, a moreraw form of the input is provided into the decoder, and a ﬁrst stage of the deep network decoderwill automatically learn to extract relevant features. Speciﬁc neural network architectures canbe beneﬁcial for this automatic feature engineering (Fig. 2).For use in decoding, spikes are typically ﬁrst converted into ﬁring rates by determining thenumber of spikes in time bins. Then, these ﬁring rates are fed into the decoder. This generalapproach of decoding based on ﬁring rates (an assumption of “rate coding”) is standard. Whileusing precise temporal timing of spikes (“temporal coding”) for decoding has been done [36],we are not aware of examples using deep learning. Given that ﬁring rates are used as inputs,additional neural network architectures are not used to extract unknown features from the input.However, in future research, it might be advantageous to provide a more raw form of spiking asinput, and use deep learning architectures to do feature engineering. For rate coding, the best ize and temporal placement of time bins could be automatically determined, and for temporalcoding, features related to the precise timing of spikes could be learned.When analyzing calcium imaging data, the videos are typically preprocessed to extract timetraces of ﬂuorescences over time for each neuron [37]. Sometimes, additional processing will bedone to estimate spiking events from the calcium traces [38]. Deep learning tools exist for bothof these processing steps [39, 40]. For decoding, either the ﬂuorescences, or the estimated ﬁringrates (via the estimated spike trains), are then used as input. While it could be possible todevelop an end-to-end decoder that works with the videos as input, this may prove challenginggiven the potential for overﬁtting with high-dimensional input.When decoding from wide-band, LFP, EEG, and ECoG data, it is common to ﬁrst extractspectrotemporal features from the data, for example the signals in speciﬁc frequency bands.Sometimes, only “task-relevant” frequencies will be used for decoding - for instance, usinghigh gamma frequencies in ECoG to decode speech [41, 42] (Fig. 2A). More frequently, manyfrequencies will be included, to better understand which are contributing to decoding [12, 43].Similar to frequency selection based on domain knowledge, ECoG grid electrodes and fMRIvoxels are often subselected by hand or with statistical tests. In general, these extracted featurescan then be put into almost any type of decoder, such as linear (or logistic) regression or a deepneural network (e.g. [44]).It is also possible to let a deep learning architecture do more of the feature extraction. Oneapproach is to ﬁrst convert each electrode’s signal into a frequency domain representation overtime (i.e., a spectrogram), often via a wavelet transform. Then, this 2-dimensional representation(like an image) is provided as input to a CNN [35, 45–47] (Fig. 2B). If multiple electrode channelsare being used for decoding, each channel can be fed into an independent CNN, or alternatively,the CNN weights for each channel can be shared [35]. The CNN will then learn the relevantfrequency domain representation for the decoding.Another approach is to provide the raw input signals into a deep learning architecture(Fig. 2C). To learn temporal features, typically the signal is fed into a 1-dimensional CNN,where the convolutions occur in the time domain. This has been done with a standard CNN[48], in addition to variant architectures. Ahmadi et al. [49] used a temporal convolutional net-work, which is a more complex version of a 1-dimensional CNN that (among other things) allowsfor multiple timescales of inputs to aﬀect the output. Li et al. [50] used parameterized versionsof temporal ﬁlters that target synchrony between electrodes. These convolutional approacheswill automatically learn temporal ﬁlters (like frequency bands) that are relevant for decoding.In addition to temporal structure, there is often spatial structure of the electrode channelsthat can also be leveraged for decoding (Fig. 2A). Convolutional ﬁlters can be used in the spatialdomain to learn spatial representations that are relevant for decoding, for example local func-tional correlation structure. It is common for the temporal ﬁlters and spatial ﬁlters to be learnedin successive layers of the network, either temporal followed by spatial [51, 52] or vice-versa [53].Additionally, 3-dimensional convolutional ﬁlters can be learned that simultaneously incorporateboth temporal and (2-dimensional) spatial dimensions [54] or 3 spatial dimensions [55]. Includ-ing spatial ﬁlters, which is most common in EEG and ECoG, can help learn spatial motifs thatare most relevant for the task. Moreover, from a practical perspective, convolutional networksare an eﬃcient way of processing high-dimensional spatial data. Neural decoding is used to predict many outputs, including movement, speech, vision, andmore. Sometimes, the output variable will be directly predicted from the neural inputs, e.g.,when predicting movement velocities. Other times, the decoder may be trained to predict someintermediate representation, which has a predetermined mapping to the output (Fig. 3). Forexample, a GAN can be trained to generate an image using a small number of latent variables.This mapping from the low-dimensional variables to images can be learned without havingto simultaneously record neural activity. Then, to decode an image from neural activity, one an train the decoder to predict the latent variables to be fed into the GAN, rather than theentire high-dimensional image. This two-step approach can be especially beneﬁcial when theoutput data is complex and high-dimensional, as is often the case in vision or speech. In eﬀect,the generative model can act as a prior on the underconstrained decoding solution. Across thefollowing decoding outputs, researchers have used both the “direct” and “intermediate mapping”approaches (Fig. 3). Some of the earliest uses of neural decoding were in the motor system [56]. Researchers have usedneural activity from motor cortex to predict many diﬀerent motor outputs, such as movementkinematics (e.g., position and velocity), muscle activity (EMG), and broad type of movement.Traditionally, this decoding has used methods (e.g., Kalman Filter or Wiener Filter) that as-sumed a linear mapping from neural activity to the motor output, which has led to manysuccesses [57–60]. To improve the decoders, these methods were extended to allow speciﬁcnonlinearities (e.g., Unscented Kalman Filter and Wiener Cascade [61–64]). Within the lastdecade, deep learning methods have become more common, frequently outperforming linearmethods and their direct nonlinear extensions when compared (e.g., [28, 53, 65, 66]).Deep learning methods for decoding movement have been applied to a wide range of prob-lems. Researchers have used many input signals that have high temporal resolution, includingspikes [28, 65–70], wide-band [71, 72], LFP [44, 49], EEG [73, 74], and ECoG [53, 75–77]. Ad-ditionally, deep learning has been used to predict many diﬀerent outputs. Often the outputis a continuous variable, such as the position, angle, or velocity of a limb, joint, or cursor[28, 44, 49, 53, 65, 66, 69, 70, 73], or a muscles EMG [67] (Fig. 3B). Rather than predictinga continuous variable, sometimes the goal is to classify diﬀerent movement types [71, 72, 74–77], for example, classifying which ﬁnger is moving [75]. Finally, deep learning decoders havebeen used to predict movements from eﬀectors across diﬀerent parts of the body, including arm[28, 44, 49, 65, 66, 68, 70], leg [65, 69, 73], wrist [67, 71, 72], and ﬁnger movements [53, 71, 72, 75–77]. Thus, deep learning methods have shown to be a very ﬂexible tool for movement decoding.RNNs are by far the most common deep learning architecture for movement decoding. Whenpredicting a continuous movement variable, there is generally a linear mapping from the RNNsoutput to the movement variable. When classifying movements, there is an additional softmaxnonlinearity that determines the movement with the highest probability. From a deep learningperspective, given that this is a problem of converting one sequence (a temporal trace of neuralactivities) into another sequence (motor outputs), it would be expected that an RNN would be anappropriate architecture. Recurrent architectures also make sense from a scientiﬁc perspective:motor cortical activity has dynamics that are important for producing movements [78], plusmovements themselves have dynamics.LSTMs have generally been the most common and successful type of RNN for decoding[28, 44, 53, 65, 67–69, 75–77], although other standard types of RNN architectures (e.g., GRUs[73] and echostate networks [70]) have also proven successful. Additionally, researchers havefound that stacking multiple layers of LSTMs [65, 75] can improve performance beyond a singleLSTM [65]. LSTMs are likely successful because they are able to learn long-term dependenciesbetter than a standard “vanilla” RNN [31].A common goal of neural decoding of movement is to be able to create a usable braincomputer interface for patients. While the majority of deep learning uses have been in oﬄinescenarios (decoding after the neural recording), there are several successful examples of real-time uses of deep learning for movement decoding [66, 70–72]. For example, in human patientswith tetraplegia who had implanted electrode arrays, Schwemmer et al. [71] were able to classifyplanned movements of wrist extension, wrist ﬂexion, index extension, and index ﬂexion. Theythen applied functional electrical stimulation to activate muscles according to this decoder, sothat the patient was able to make these movements in real time. In Sussillo et al. [70], monkeyswith implanted electrode arrays were able to control the velocity of a cursor on a screen in realtime. - T - _ - I - S - _ - O - U - R - ... Seq2Seq generation Intermediatefeature vector

CNN

GAN

Acoustic modelConcurrent behavior

RNN . . .. . .

Neural data Neural data

ACB EF G

RNN . . .

Direct Decoding Decoding ThroughIntermediate Variables

RNN intermediate state D Spectrogram H Figure 3: Architectures and outputs of decoding. A: Sequential inputs can be processed by RNNswhich can use past context (or past and future in bi-directional RNNs). B: RNN outputs at eachtimestep can be mapped to behaviors, e.g., movements, measured concurrently. C: The ﬁnal outputof an RNN can be used as the input to a decoding network which can produce a second sequenceof a diﬀerent length, such as text. D: RNNs can produce an intermediate state to be used in asecond decoding step. E: Intermediate states can often be structured, such as a spectrogram inthis example. F: Intermediate states can be fed into an acoustic model which produces acousticwaveforms. G: Image-like inputs can be processed by CNNs to produce intermediate featurevectors. H: Feature vectors can be fed into generative image models, e.g., a GAN, to produce amore realistic looking image. 8 hile there has been great initial success, there are several challenges associated with usingdeep learning for real-time decoding for brain computer interfaces. One challenge is that thesource of the recorded neural activity can change across days, for example due to slight movementof implanted electrodes. One approach that has dealt with this is the multiplicative RNN, whichallows mappings from the neural input to the motor output to partially change across days [66].Another challenge is computation time, as there is the need to make predictions through the deeplearning architecture at very high temporal resolution. When using a less complicated echostatenetwork, Sussillo et al. [70] were able to decode with less than 25 ms temporal resolution.However, when using a more complex architecture of LSTMs followed by CNNs, Schwemmeret al. [71] decoded at 100 ms resolution, slower than our perception. Relatedly, for linear methodsthat can be ﬁt rapidly, researchers are able to adapt the decoder in real time to better matchthe subjects intention (trying to get to a target) to improve performance [58, 62]. Developingsimilar approaches for deep learning based decoders is an exciting, unexplored area.

Vocal articulation is a complex behavior that engages a large functional area of the brain toproduce movements that have a high degree of articulatory temporal and spatial precision [79].It is also a uniqely human ability which limits the recording modalities and neuroscientiﬁcinterventions that can be used to study it. Due to the functional and temporal requirementsof decoding speech, cortical surface electrical potentials recorded using ECoG is the typicalrecording modality used, although penetrating electrodes, MEG, EEG, and fNIRS are alsoused [80–83]. When decoding from ECoG or EEG, researchers commonly use the signals’ highgamma amplitude [41], although some use more broad spectrotemporal features as well [41, 43,84].Many approaches to decoding speech from neural signals have used some combination oflinear methods and shallow probabilistic models. Clustering, SVMs, LDA, linear regression,and probabilistic models have been used with spectrotemporal features of electrical potentialsto decode vowel acoustics, speech articulator movements, phonemes, whole words, and semanticcategories [41, 43, 80, 85–88].Deep learning approaches to decoding speech from neural signals have emerged that canpotentially learn nonlinear mappings. Some of these approaches have operated on temporallysegmented neural data and have thus used fully connected neural network architectures. Forexample, spectrotemporal features derived from ECoG or EEG have been used to reconstructperceived spectrograms, classify words or syllables, or classify entire phrases [12, 42, 82–84].These examples with temporally segmented neural data are useful for increasing understandingabout neural representations, and as a step towards decoding natural speech.Mapping directly from continuous, time-varying neural signals to speech is the goal of speechbrain-computer interfaces [89, 90]. Both convolutional and recurrent networks are able to ﬂex-ibly decode timeseries data and are often used for decoding naturalistic speech. Heelan et al.[91] reconstructed perceived speech audio from multi-unit spike counts from a non-human pri-mate and found that LSTM-based networks outperformed other traditional and deep models.Speech represented as text does not have a simple one-to-one temporal alignment to regularlysampled neural signals. For this reason, speech-to-text decoding networks often use architec-tures and methods like sequence-to-sequence models or the connectionist temporal classiﬁcationloss [20, 92], which are commonly used in machine translation or automated speech recognitionapplications. As such, several groups have decoded directly from neural signals to text usingrecurrent networks such as sequence-to-sequence models [93, 94] (Fig. 3C).For decoding intelligible acoustic speech, it is also common to split decoding into a moreconstrained neural-to-intermediate mapping, followed by a second stage that maps this interme-diate format into an acoustic waveform using acoustic priors for speech based on deep learning orhand-engineered methods. For instance, high gamma features recorded using ECoG have beenused to decode spectrograms and speech articulator dynamics [54, 95] as intermediate states.Then, either a WaveNet deep network [96] was used to directly produce an acoustic waveform rom the spectrogram [54], or an RNN was used to produce acoustic features which were fed intoa speech synthesizer [95]. These second stages do not require invasive neural data for trainingand were trained on a larger second corpus.Deep learning models have improved the accuracy of primarily oﬄine speech decoding tasks.Many of the preprocessing and decoding methods reviewed here are done oﬄine using acausal orhigh-latency deep learning models. Developing deep learning methods, software, and hardwarefor real-time speech decoding is important for clinical applications of brain computer inter-faces [88, 97]. Similar to decoding acoustic speech, decoding visual stimuli from neural signals requires strongimage priors due to the large variability of natural scenes and the relatively small bit-rate ofneural recordings. Early attempts to reconstruct the full visual experience restricted decodingto simple images [98] or relied on a ﬁlterbank encoding model and a large set of natural imagesas a sampled prior [99]. Qiao et al. [100] solved the simpler task of classifying perceived objectcategory using one CNN to select a small set of fMRI voxels which were fed into a second RNNfor classiﬁcation. Similarly, Ellis and Michaelides [101] classiﬁed among many visual scenes fromcalcium imaging data using feedforward or convolutional neural networks.As mentioned in Deep learning architectures, deep generative image models, such as GANs,can produce realistic images. In addition, CNNs trained to classify large naturalistic imagedatabases [102] (discriminative models) have been shown to encode a large amount of texturaland semantic meaning in their activations [103], which can be used as an image prior. Due to thevariety of ways that natural image priors can be created with deep networks, there exist decodingmethods that combine diﬀerent aspects of both generative and discriminative networks.Given a deep generative model of images, a simpler decoder can be trained to map fromneural data to the latent space of the model [104, 105], and the generative model can be usedfor image reconstruction. Similarly, a linear stage reconstruction followed by a deep networkthat cleans-up the image has been used with retinal ganglion cell output [27]. Generative modelscan also be trained to reconstruct images directly from fMRI responses on real data with dataaugmentation from a simulated encoding model [106].Alternatively, generative and discriminative models can be used together. By leveraging apretrained CNN, a simple decoder can be trained to map neural data to CNN activations that canthen be passed into a convolutional image reconstruction model [107]. Additionally, the inputimage in a pretrained CNN can be optimized so that the CNN activations match predictionsgiven by the fMRI responses [108]. Researchers have also used an end-to-end approach inwhich they train the generative part directly on neural data with both an adversarial loss anda pretrained CNN feature loss [109]. Along with acoustic speech, decoding naturalistic visualstimuli presents one of the best cases to study the use of data-driven priors derived from deepnetworks.

While we have chosen to focus on a few decoding outputs that are prevalent in the literature, deeplearning has been used for a myriad of decoding applications. RNNs such as LSTMs have beenused to decode an animals location [28, 35, 110, 111] and direction [112] from spiking activity inthe hippocampus and head-direction cells, respectively. LSTMs have been used to decode whatis being remembered in a working memory task from human fMRI [113]. Researchers have usedLSTMs [114] and feedforward neural networks [115] to classify diﬀerent classes of behaviors,using spiking activity in animals [115] and fNIRS measurements in humans [114]. LSTMs[116, 117] and CNNs [118] have been used to classify emotions from EEG signals. Feedforwardneural networks have been used to determine the source of a subjects attention, using EEG inhumans [119, 120] and spiking activity in monkeys [121]. CNNs [46–48], along with LSTMs [48] ave been used to predict a subject’s stage of sleep from their EEG. For almost any behavioralsignal that can be decoded, someone has tried to use deep learning. Deep learning is an attractive method for use in neural decoding because of its ability to learncomplex, nonlinear transformations from data. In many of the examples above, deep networkscan outperform linear or shallow methods even on relatively small datasets; however, examplesexist where this is not the case, especially when using fMRI [122, 123] or fNIRS data [124].Relatedly, there are many times in which using hand-engineered features can outperform anend-to-end neural network that will learn the features. This is more likely with limited amountsof data, and also when there is strong prior knowledge about the relevant features. One generalmachine learning approach to eﬃciently use limited data is transfer learning, in which a neuralnetwork trained in one scenario (typically with more data) is used a separate scenario. This hasbeen used in neural decoding to more eﬀectively train decoders for new subjects [77, 94] andfor new predicted outputs [71]. As the capability to generate ever larger datasets develops withautomated, long-term experimental setups for single animals [125] and large scale recordingsacross multiple animals [126], deep learning is well poised to take advantage of this ﬂood ofdata. As dataset sizes increase, this will also allow more features to be learned through data-driven network training rather than being selected by-hand.Although deep learning will inevitably improve decoding accuracy as neuroscientists collectlarger datasets, extracting scientiﬁc knowledge from trained networks is still an area of activeresearch. That is, can we understand the transformations deep networks are learning? In com-puter vision, layers that include spatial attention [127] and methods for performing featureattribution [128] have been developed to understand what parts of the input are important forprediction, although the latter are an active area of research [129]. These methods could beused to attribute what channels, neurons, or time-points are most salient for decoding [128].Additionally, there are methods for understanding deep network representations in computervision that examine the representations networks have learned across layers [130, 131]. Usingthese methods may help to understand the transformations that occur within neural decoders,however results may be sensitive to the decoder’s architecture and not purely the data’s struc-ture. While deep learning interpretability methods are not commonly used on decoders trainedon neural data, there are a few examples of networks that were built with interpretability inmind or were investigated after training [12, 50, 51, 113].When interpreting decoders, it is often assumed that the decoder reveals the informationcontained in the brain about the decoded variable. It is important to note that this is onlypartially true when priors are being used for decoding [132], which is often the case whendecoding a full image or acoustic speech. In these scenarios, the decoded outputs will be afunction of both neural activity and the prior, so one cannot simply determine what informationthe brain has about the output.The software used to create, train, and evaluate deep networks has been steadily developedand is now almost as easy to use as other standard machine learning methods. A wide rangeof cost functions, layer types, and parameter optimization algorithms are implemented andaccessible in deep learning libraries such as PyTorch or Tensorﬂow [133, 134] and libraries inother programming languages. Like other machine learning methods, care must be taken tocarefully cross-validate results as deep networks can easily overﬁt to the training data.In addition to their use in neural decoding, deep learning has other prominent uses withinneuroscience [135, 136]. Neural networks have a long history in neuroscience as models of neuralprocessing [137, 138]. More recently, there has also been a surge of papers using deep networksas encoding models [9, 11, 139]. There has been a speciﬁc focus on using the representationslearned by deep networks trained to perform behavioral tasks (e.g., image recognition) to predictneural responses in corresponding brain areas (e.g., across the visual hierarchy [140]). Combiningthese multiple complementary approaches is one promising approach to understanding neural omputation. Acknowledgements

We would like to thank Ella Batty and Charles Frye for very helpful comments on this manuscript.

Funding

JIG was supported by National Science Foundation NeuroNex Award DBI-1707398 and TheGatsby Foundation AT3708. JAL was supported by the LBNL Laboratory Directed Researchand Development program.

References [1] Rodrigo Quian Quiroga, Lawrence H Snyder, Aaron P Batista, He Cui, and Richard AAndersen. Movement intention is better predicted than attention in the posterior parietalcortex.

Journal of neuroscience , 26(13):3615–3620, 2006.[2] Stephenie A Harrison and Frank Tong. Decoding reveals the contents of visual workingmemory in early visual areas.

Nature , 458(7238):632–635, 2009.[3] Soumyadipta Acharya, Matthew S Fifer, Heather L Benz, Nathan E Crone, and Nitish VThakor. Electrocorticographic amplitude predicts ﬁnger positions during slow graspingmotions of the hand.

Journal of neural engineering , 7(4):046002, 2010.[4] Martin Weygandt, Carlo R Blecker, Axel Sch¨afer, Kerstin Hackmack, John-Dylan Haynes,Dieter Vaitl, Rudolf Stark, and Anne Schienle. fmri pattern recognition in obsessive–compulsive disorder.

Neuroimage , 60(2):1186–1193, 2012.[5] Erin L Rich and Jonathan D Wallis. Decoding subjective decisions from orbitofrontalcortex.

Nature neuroscience , 19(7):973, 2016.[6] Joshua I Glaser, Matthew G Perich, Pavan Ramkumar, Lee E Miller, and Konrad PKording. Population coding of conditional probability distributions in dorsal premotorcortex.

Nature communications , 9(1):1–14, 2018.[7] Liberty S Hamilton, Erik Edwards, and Edward F Chang. A spatial map of onset andsustained responses to speech in the human superior temporal gyrus.

Current Biology , 28(12):1860–1871, 2018.[8] Nora Brackbill, Colleen Rhoades, Alexandra Kling, Nishal P Shah, Alexander Sher,Alan M Litke, and EJ Chichilnisky. Reconstruction of natural images from responsesof primate retinal ganglion cells. bioRxiv , 2020.[9] Lane McIntosh, Niru Maheswaranathan, Aran Nayebi, Surya Ganguli, and Stephen Bac-cus. Deep learning models of the retinal response to natural scenes. In

Advances in neuralinformation processing systems , pages 1369–1377, 2016.[10] Tasha Nagamine and Nima Mesgarani. Understanding the representation and computationof multilayer perceptrons: A case study in speech recognition. In

Proceedings of the 34thInternational Conference on Machine Learning-Volume 70 , pages 2564–2573. JMLR. org,2017.

11] Alexander JE Kell, Daniel LK Yamins, Erica N Shook, Sam V Norman-Haignere, andJosh H McDermott. A task-optimized neural network replicates human auditory behavior,predicts brain responses, and reveals a cortical processing hierarchy.

Neuron , 98(3):630–644, 2018.[12] Jesse A Livezey, Kristofer E Bouchard, and Edward F Chang. Deep learning as a toolfor neural data analysis: speech classiﬁcation and cross-frequency coupling in humansensorimotor cortex.

PLoS computational biology , 15(9):e1007091, 2019.[13] Babak Alipanahi, Andrew Delong, Matthew T Weirauch, and Brendan J Frey. Predict-ing the sequence speciﬁcities of dna-and rna-binding proteins by deep learning.

Naturebiotechnology , 33(8):831–838, 2015.[14] Chris Piech, Jonathan Bassen, Jonathan Huang, Surya Ganguli, Mehran Sahami,Leonidas J Guibas, and Jascha Sohl-Dickstein. Deep knowledge tracing. In

Advancesin neural information processing systems , pages 505–513, 2015.[15] Michela Paganini, Luke de Oliveira, and Benjamin Nachman. Calogan: Simulating 3dhigh energy particle showers in multilayer electromagnetic calorimeters with generativeadversarial networks.

Physical Review D , 97(1):014021, 2018.[16] Thorsten Kurth, Sean Treichler, Joshua Romero, Mayur Mudigonda, Nathan Luehr, Ev-erett Phillips, Ankur Mahesh, Michael Matheson, Jack Deslippe, Massimiliano Fatica,et al. Exascale deep learning for climate analytics. In

SC18: International Conference forHigh Performance Computing, Networking, Storage and Analysis , pages 649–660. IEEE,2018.[17] Kristof T Sch¨utt, Huziel E Sauceda, P-J Kindermans, Alexandre Tkatchenko, and K-RM¨uller. Schnet–a deep learning architecture for molecules and materials.

The Journal ofChemical Physics , 148(24):241722, 2018.[18] Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory.

Neural computation ,9(8):1735–1780, 1997.[19] Alex Krizhevsky, Ilya Sutskever, and Geoﬀrey E Hinton. Imagenet classiﬁcation with deepconvolutional neural networks. In

Advances in neural information processing systems ,pages 1097–1105, 2012.[20] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neuralnetworks. In

Advances in neural information processing systems , pages 3104–3112, 2014.[21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning forimage recognition. In

Proceedings of the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016.[22] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Bat-tenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al.Deep speech 2: End-to-end speech recognition in english and mandarin. In

Internationalconference on machine learning , pages 173–182, 2016.[23] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S YuPhilip. A comprehensive survey on graph neural networks.

IEEE Transactions on NeuralNetworks and Learning Systems , 2020.[24] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan NGomez, (cid:32)Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In

Advances inneural information processing systems , pages 5998–6008, 2017.

25] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, SherjilOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In

Advances inneural information processing systems , pages 2672–2680, 2014.[26] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learningwith deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 ,2015.[27] Nikhil Parthasarathy, Eleanor Batty, William Falcon, Thomas Rutten, Mohit Rajpal,EJ Chichilnisky, and Liam Paninski. Neural networks for eﬃcient bayesian decodingof natural images from retinal neurons. In

Advances in Neural Information ProcessingSystems , pages 6434–6445, 2017.[28] Joshua I Glaser, Raeed H Chowdhury, Matthew G Perich, Lee E Miller, and Konrad PKording. Machine learning for neural decoding. arXiv preprint arXiv:1708.00909 , 2017.[29] Kristofer E. Bouchard and Edward F Chang. Human ecog speaking consonant-vowelsyllables, 2019. URL https://doi.org/10.6084/m9.figshare.c.4617263.v4 .[30] Samaneh Kazemifar, Kathryn Y Manning, Nagalingam Rajakumar, Francisco A Gomez,Andrea Soddu, Michael J Borrie, Ravi S Menon, Robert Bartha, Alzheimers Disease Neu-roimaging Initiative, et al. Spontaneous low frequency bold signal variations from resting-state fmri are decreased in alzheimer disease.

PloS one , 12(6), 2017.[31] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.

Deep learning . MIT press, 2016.[32] Gy¨orgy Buzs´aki, Costas A Anastassiou, and Christof Koch. The origin of extracellularﬁelds and currentseeg, ecog, lfp and spikes.

Nature reviews neuroscience , 13(6):407–420,2012.[33] Tsai-Wen Chen, Trevor J Wardill, Yi Sun, Stefan R Pulver, Sabine L Renninger, AmyBaohan, Eric R Schreiter, Rex A Kerr, Michael B Orger, Vivek Jayaraman, et al. Ultra-sensitive ﬂuorescent proteins for imaging neuronal activity.

Nature , 499(7458):295–300,2013.[34] Robert D Flint, Christian Ethier, Emily R Oby, Lee E Miller, and Marc W Slutzky. Localﬁeld potentials allow accurate decoding of muscle activity.

Journal of neurophysiology ,108(1):18–24, 2012.[35] Markus Frey, Sander Tanni, Catherine Perrodin, Alice OLeary, Matthias Nau, Jack Kelly,Andrea Banino, Christian F Doeller, and Caswell Barry. Deepinsight: a general frameworkfor interpreting wide-band neural activity. bioRxiv , page 871848, 2019.[36] Andr´e Maia Chagas, Lucas Theis, Biswa Sengupta, Maik Christopher St¨uttgen, MatthiasBethge, and Cornelius Schwarz. Functional analysis of ultra high information rates con-veyed by rat vibrissal primary aﬀerents.

Frontiers in neural circuits , 7:190, 2013.[37] Andrea Giovannucci, Johannes Friedrich, Pat Gunn, Jeremie Kalfon, Brandon L Brown,Sue Ann Koay, Jiannis Taxidis, Farzaneh Najaﬁ, Jeﬀrey L Gauthier, Pengcheng Zhou,et al. Caiman an open source tool for scalable calcium imaging data analysis.

Elife , 8:e38173, 2019.[38] Joshua T Vogelstein, Adam M Packer, Timothy A Machado, Tanya Sippy, BaktashBabadi, Rafael Yuste, and Liam Paninski. Fast nonnegative deconvolution for spike traininference from population calcium imaging.

Journal of neurophysiology , 104(6):3691–3704,2010.

39] Somayyeh Soltanian-Zadeh, Kaan Sahingur, Sarah Blau, Yiyang Gong, and Sina Farsiu.Fast and robust active neuron segmentation in two-photon calcium imaging using spa-tiotemporal deep learning.

Proceedings of the National Academy of Sciences , 116(17):8554–8563, 2019.[40] Artur Speiser, Jinyao Yan, Evan W Archer, Lars Buesing, Srinivas C Turaga, and Jakob HMacke. Fast amortized inference of neural activity from calcium imaging data with varia-tional autoencoders. In

Advances in Neural Information Processing Systems , pages 4024–4034, 2017.[41] Kristofer E Bouchard and Edward F Chang. Neural decoding of spoken vowels fromhuman sensory-motor cortex with high-density electrocorticography. In , pages6782–6785. IEEE, 2014.[42] Minda Yang, Sameer A Sheth, Catherine A Schevon, Guy M Mckhann Ii, and NimaMesgarani. Speech reconstruction from human auditory cortex with deep neural networks.In

Sixteenth Annual Conference of the International Speech Communication Association ,2015.[43] Emily M Mugler, James L Patton, Robert D Flint, Zachary A Wright, Stephan U Schuele,Joshua Rosenow, Jerry J Shih, Dean J Krusienski, and Marc W Slutzky. Direct classiﬁca-tion of all american english phonemes using signals from functional speech motor cortex.

Journal of neural engineering , 11(3):035015, 2014.[44] Nur Ahmadi, Timothy G Constandinou, and Christos-Savvas Bouganis. Decoding handkinematics from local ﬁeld potentials using long short-term memory (lstm) network. In , pages415–419. IEEE, 2019.[45] Hosein M Golshan, Adam O Hebb, and Mohammad H Mahoor. Lfp-net: A deep learningframework to recognize human behavioral activities using brain stn-lfp signals.

Journalof Neuroscience Methods , 335:108621, 2020.[46] Jialin Wang, Yanchun Zhang, Qinying Ma, Huihui Huang, and Xiaoyuan Hong. Deeplearning for single-channel eeg signals sleep stage scoring based on frequency domainrepresentation. In

International Conference on Health Information Science , pages 121–133. Springer, 2019.[47] Zeke Barger, Charles G Frye, Danqian Liu, Yang Dan, and Kristofer E Bouchard. Robust,automated sleep scoring by a compact neural network with distributional shift correction.

PloS one , 14(12), 2019.[48] Akara Supratak, Hao Dong, Chao Wu, and Yike Guo. Deepsleepnet: a model for automaticsleep stage scoring based on raw single-channel eeg.

IEEE Transactions on Neural Systemsand Rehabilitation Engineering , 25(11):1998–2008, 2017.[49] Nur Ahmadi, Timothy G Constandinou, and Christos-Savvas Bouganis. End-to-endhand kinematic decoding from lfps using temporal convolutional network. In , pages 1–4. IEEE, 2019.[50] Yitong Li, Kafui Dzirasa, Lawrence Carin, David E Carlson, et al. Targeting eeg/lfpsynchrony with neural nets. In

Advances in Neural Information Processing Systems , pages4620–4630, 2017.

51] Robin Tibor Schirrmeister, Jost Tobias Springenberg, Lukas Dominique Josef Fiederer,Martin Glasstetter, Katharina Eggensperger, Michael Tangermann, Frank Hutter, Wol-fram Burgard, and Tonio Ball. Deep learning with convolutional neural networks for eegdecoding and visualization.

Human brain mapping , 38(11):5391–5420, 2017.[52] Vernon J Lawhern, Amelia J Solon, Nicholas R Waytowich, Stephen M Gordon, Chou PHung, and Brent J Lance. Eegnet: a compact convolutional neural network for eeg-basedbrain–computer interfaces.

Journal of neural engineering , 15(5):056013, 2018.[53] Ziqian Xie, Odelia Schwartz, and Abhishek Prasad. Decoding of ﬁnger trajectory fromecog using deep learning.

Journal of neural engineering , 15(3):036009, 2018.[54] Miguel Angrick, Christian Herﬀ, Emily Mugler, Matthew C Tate, Marc W Slutzky, Dean JKrusienski, and Tanja Schultz. Speech synthesis from ecog using densely connected 3dconvolutional neural networks.

Journal of neural engineering , 16(3):036019, 2019.[55] Liang Zou, Jiannan Zheng, Chunyan Miao, Martin J Mckeown, and Z Jane Wang. 3d cnnbased automatic diagnosis of attention deﬁcit hyperactivity disorder using functional andstructural mri.

IEEE Access , 5:23626–23636, 2017.[56] Apostolos P Georgopoulos, Roberto Caminiti, John F Kalaska, and Joseph T Massey.Spatial coding of movement: a hypothesis concerning the coding of movement directionby motor cortical populations.

Experimental Brain Research , 49(Suppl. 7):327–336, 1983.[57] Wei Wu, Michael J Black, Yun Gao, M Serruya, A Shaikhouni, JP Donoghue, and ElieBienenstock. Neural decoding of cursor motion using a kalman ﬁlter. In

Advances inneural information processing systems , pages 133–140, 2003.[58] Vikash Gilja, Paul Nuyujukian, Cindy A Chestek, John P Cunningham, M Yu Byron,Joline M Fan, Mark M Churchland, Matthew T Kaufman, Jonathan C Kao, Stephen IRyu, et al. A high-performance neural prosthesis enabled by control algorithm design.

Nature neuroscience , 15(12):1752, 2012.[59] Mijail D Serruya, Nicholas G Hatsopoulos, Liam Paninski, Matthew R Fellows, and John PDonoghue. Instant neural control of a movement signal.

Nature , 416(6877):141–142, 2002.[60] Jose M Carmena, Mikhail A Lebedev, Roy E Crist, Joseph E O’Doherty, David M San-tucci, Dragan F Dimitrov, Parag G Patil, Craig S Henriquez, and Miguel AL Nicolelis.Learning to control a brain–machine interface for reaching and grasping by primates.

PLoSbiology , 1(2), 2003.[61] Zheng Li, Joseph E O’Doherty, Timothy L Hanson, Mikhail A Lebedev, Craig S Henriquez,and Miguel AL Nicolelis. Unscented kalman ﬁlter for brain-machine interfaces.

PloS one ,4(7), 2009.[62] Trieu Phat Luu, Yongtian He, Samuel Brown, Sho Nakagome, and Jose L Contreras-Vidal. Gait adaptation to visual kinematic perturbations using a real-time closed-loopbrain–computer interface to a virtual reality avatar.

Journal of neural engineering , 13(3):036006, 2016.[63] Eric A Pohlmeyer, Sara A Solla, Eric J Perreault, and Lee E Miller. Prediction of upperlimb muscle activity from motor cortical discharge during reaching.

Journal of neuralengineering , 4(4):369, 2007.[64] Christian Ethier, Emily R Oby, Matthew J Bauman, and Lee E Miller. Restoration ofgrasp following paralysis through brain-controlled stimulation of muscles.

Nature , 485(7398):368–371, 2012.

65] Po-He Tseng, N´uria Armengol Urpi, Mikhail Lebedev, and Miguel Nicolelis. Decodingmovements from cortical ensemble activity using a long short-term memory recurrentnetwork.

Neural computation , 31(6):1085–1113, 2019.[66] David Sussillo, Sergey D Stavisky, Jonathan C Kao, Stephen I Ryu, and Krishna V Shenoy.Making brain–machine interfaces robust to future neural variability.

Nature communica-tions , 7:13749, 2016.[67] Stephanie Naufel, Joshua I Glaser, Konrad P Kording, Eric J Perreault, and Lee E Miller.A muscle-activity-dependent gain between motor cortex and emg.

Journal of neurophys-iology , 121(1):61–73, 2019.[68] Jisung Park and Sung-Phil Kim. Estimation of speed and direction of arm movementsfrom m1 activity using a nonlinear neural decoder. In , pages 1–4. IEEE, 2019.[69] Yinong Wang, Wilson Truccolo, and David A Borton. Decoding hindlimb kinematics fromprimate motor cortex using long short-term memory recurrent neural networks. In , pages 1944–1947. IEEE, 2018.[70] David Sussillo, Paul Nuyujukian, Joline M Fan, Jonathan C Kao, Sergey D Stavisky,Stephen Ryu, and Krishna Shenoy. A recurrent neural network for closed-loop intracorticalbrain–machine interface decoders.

Journal of neural engineering , 9(2):026027, 2012.[71] Michael A Schwemmer, Nicholas D Skomrock, Per B Sederberg, Jordyn E Ting, GauravSharma, Marcia A Bockbrader, and David A Friedenberg. Meeting brain–computer in-terface user performance expectations using a deep neural network decoding framework.

Nature medicine , 24(11):1669–1676, 2018.[72] Nicholas D Skomrock, Michael A Schwemmer, Jordyn E Ting, Hemang R Trivedi, GauravSharma, Marcia A Bockbrader, and David A Friedenberg. A characterization of brain-computer interface performance trade-oﬀs using support vector machines and deep neuralnetworks to decode movement intent.

Frontiers in neuroscience , 12:763, 2018.[73] Sho Nakagome, Trieu Phat Luu, Yongtian He, Akshay Sujatha Ravindran, and Jose LContreras-Vidal. An empirical comparison of neural networks and machine learning algo-rithms for eeg gait decoding.

Scientiﬁc Reports , 10(1):1–17, 2020.[74] Ewan Nurse, Benjamin S Mashford, Antonio Jimeno Yepes, Isabell Kiral-Kornek, StefanHarrer, and Dean R Freestone. Decoding eeg and lfp signals using deep learning: headingtruenorth. In

Proceedings of the ACM International Conference on Computing Frontiers ,pages 259–266, 2016.[75] Anming Du, Shuqin Yang, Weijia Liu, and Haiping Huang. Decoding ecog signal withdeep learning model based on lstm. In

TENCON 2018-2018 IEEE Region 10 Conference ,pages 0430–0435. IEEE, 2018.[76] Gang Pan, Jia-Jun Li, Yu Qi, Hang Yu, Jun-Ming Zhu, Xiao-Xiang Zheng, Yue-MingWang, and Shao-Min Zhang. Rapid decoding of hand gestures in electrocorticographyusing recurrent neural networks.

Frontiers in neuroscience , 12:555, 2018.[77] Venkatesh Elango, Aashish N Patel, Kai J Miller, and Vikash Gilja. Sequence transferlearning for neural decoding. bioRxiv , page 210732, 2017.[78] Krishna V Shenoy, Maneesh Sahani, and Mark M Churchland. Cortical control of armmovements: a dynamical systems perspective.

Annual review of neuroscience , 36:337–359,2013.

79] Kristofer E Bouchard, Nima Mesgarani, Keith Johnson, and Edward F Chang. Functionalorganization of human sensorimotor cortex for speech articulation.

Nature , 495(7441):327,2013.[80] Alexander M Chan, Eric Halgren, Ksenija Marinkovic, and Sydney S Cash. Decoding wordand category-speciﬁc spatiotemporal representations from meg and eeg.

Neuroimage , 54(4):3028–3039, 2011.[81] Christian Herﬀ and Tanja Schultz. Automatic speech recognition from neural signals: afocused review.

Frontiers in neuroscience , 10:429, 2016.[82] Alborz Rezazadeh Sereshkeh, Robert Trott, Aur´elien Bricout, and Tom Chau. Eeg classi-ﬁcation of covert speech using regularized neural networks.

IEEE/ACM Transactions onAudio, Speech, and Language Processing , 25(12):2292–2300, 2017.[83] Jun Wang, Myungjong Kim, Angel W Hernandez-Mulero, Daragh Heitzman, and PaulFerrari. Towards decoding speech production from single-trial magnetoencephalography(meg) signals. In , pages 3036–3040. IEEE, 2017.[84] Hassan Akbari, Bahar Khalighinejad, Jose L Herrero, Ashesh D Mehta, and Nima Mes-garani. Towards reconstructing intelligible speech from the human auditory cortex.

Sci-entiﬁc reports , 9(1):1–12, 2019.[85] David F Conant, Kristofer E Bouchard, Matthew K Leonard, and Edward F Chang.Human sensorimotor cortex control of directly measured vocal tract movements duringvowel production.

Journal of Neuroscience , 38(12):2955–2966, 2018.[86] Spencer Kellis, Kai Miller, Kyle Thomson, Richard Brown, Paul House, and BradleyGreger. Decoding spoken words using local ﬁeld potentials recorded from the corticalsurface.

Journal of neural engineering , 7(5):056007, 2010.[87] Christian Herﬀ, Dominic Heger, Adriana De Pesters, Dominic Telaar, Peter Brunner,Gerwin Schalk, and Tanja Schultz. Brain-to-text: decoding spoken phrases from phonerepresentations in the brain.

Frontiers in neuroscience , 9:217, 2015.[88] Frank H Guenther, Jonathan S Brumberg, E Joseph Wright, Alfonso Nieto-Castanon,Jason A Tourville, Mikhail Panko, Robert Law, Steven A Siebert, Jess L Bartels, Dinal SAndreasen, et al. A wireless brain-machine interface for real-time speech synthesis.

PloSone , 4(12), 2009.[89] Jonathan R Wolpaw, Niels Birbaumer, Dennis J McFarland, Gert Pfurtscheller, andTheresa M Vaughan. Brain–computer interfaces for communication and control.

Clinicalneurophysiology , 113(6):767–791, 2002.[90] Tanja Schultz, Michael Wand, Thomas Hueber, Dean J Krusienski, Christian Herﬀ, andJonathan S Brumberg. Biosignal-based spoken communication: A survey.

IEEE/ACMTransactions on Audio, Speech, and Language Processing , 25(12):2257–2271, 2017.[91] Christopher Heelan, Jihun Lee, Ronan OShea, Laurie Lynch, David M Brandman, WilsonTruccolo, and Arto V Nurmikko. Decoding speech from spike-based neural populationrecordings in secondary auditory cortex of non-human primates.

Communications biology ,2(1):1–12, 2019.[92] Alex Graves, Santiago Fern´andez, Faustino Gomez, and J¨urgen Schmidhuber. Connec-tionist temporal classiﬁcation: labelling unsegmented sequence data with recurrent neuralnetworks. In

Proceedings of the 23rd international conference on Machine learning , pages369–376, 2006.

93] Pengfei Sun, Gopala K Anumanchipalli, and Edward F Chang. Brain2char: A deeparchitecture for decoding text from brain recordings. arXiv preprint arXiv:1909.01401 ,2019.[94] Joseph G Makin, David A Moses, and Edward F Chang. Machine translation of corticalactivity to text with an encoder–decoder framework. Technical report, Nature PublishingGroup, 2020.[95] Gopala K Anumanchipalli, Josh Chartier, and Edward F Chang. Speech synthesis fromneural decoding of spoken sentences.

Nature , 568(7753):493, 2019.[96] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, AlexGraves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A genera-tive model for raw audio. arXiv preprint arXiv:1609.03499 , 2016.[97] David A Moses, Matthew K Leonard, Joseph G Makin, and Edward F Chang. Real-timedecoding of question-and-answer speech dialogue using human cortical activity.

Naturecommunications , 10(1):1–14, 2019.[98] Yoichi Miyawaki, Hajime Uchida, Okito Yamashita, Masa-aki Sato, Yusuke Morito, Hi-roki C Tanabe, Norihiro Sadato, and Yukiyasu Kamitani. Visual image reconstructionfrom human brain activity using a combination of multiscale local image decoders.

Neu-ron , 60(5):915–929, 2008.[99] Shinji Nishimoto, An T Vu, Thomas Naselaris, Yuval Benjamini, Bin Yu, and Jack LGallant. Reconstructing visual experiences from brain activity evoked by natural movies.

Current Biology , 21(19):1641–1646, 2011.[100] Kai Qiao, Jian Chen, Linyuan Wang, Chi Zhang, Lei Zeng, Li Tong, and Bin Yan. Cate-gory decoding of visual stimuli from human brain activity using a bidirectional recurrentneural network to simulate bidirectional information ﬂows in human visual cortices.

Fron-tiers in neuroscience , 13, 2019.[101] Randall Jordan Ellis and Michael Michaelides. High-accuracy decoding of complex visualscenes from neuronal calcium responses.

BioRxiv , page 271296, 2018.[102] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: Alarge-scale hierarchical image database. In , pages 248–255. Ieee, 2009.[103] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer usingconvolutional neural networks. In

Proceedings of the IEEE conference on computer visionand pattern recognition , pages 2414–2423, 2016.[104] Katja Seeliger, Umut G¨u¸cl¨u, Luca Ambrogioni, Yagmur G¨u¸cl¨ut¨urk, and Marcel AJ vanGerven. Generative adversarial networks for reconstructing natural images from brainactivity.

NeuroImage , 181:775–785, 2018.[105] Ya˘gmur G¨u¸cl¨ut¨urk, Umut G¨u¸cl¨u, Katja Seeliger, Sander Bosch, Rob van Lier, and Mar-cel AJ van Gerven. Reconstructing perceived faces from brain activations with deep ad-versarial neural decoding. In

Advances in Neural Information Processing Systems , pages4246–4257, 2017.[106] Ghislain St-Yves and Thomas Naselaris. Generative adversarial networks conditionedon brain activity reconstruct seen images. In , pages 1054–1061. IEEE, 2018. CerebralCortex , 28(12):4136–4160, 2018.[108] Guohua Shen, Tomoyasu Horikawa, Kei Majima, and Yukiyasu Kamitani. Deep imagereconstruction from human brain activity.

PLoS computational biology , 15(1):e1006633,2019.[109] Guohua Shen, Kshitij Dwivedi, Kei Majima, Tomoyasu Horikawa, and Yukiyasu Kami-tani. End-to-end deep image reconstruction from human brain activity.

Frontiers inComputational Neuroscience , 13, 2019.[110] Ardi Tampuu, Tambet Matiisen, H Freyja ´Olafsd´ottir, Caswell Barry, and Raul Vicente.Eﬃcient neural decoding of self-location with a deep recurrent network.

PLoS computa-tional biology , 15(2):e1006822, 2019.[111] Mohammad R Rezaei, Anna K Gillespie, Jennifer A Guidera, Behzad Nazari, Saeid Sadri,Loren M Frank, Uri T Eden, and Ali Youseﬁ. A comparison study of point-process ﬁlterand deep learning performance in estimating rat position using an ensemble of place cells.In , pages 4732–4735. IEEE, 2018.[112] Zishen Xu, Wei Wu, Shawn S Winter, Max L Mehlman, William N Butler, Christine MSimmons, Ryan E Harvey, Laura E Berkowitz, Yang Chen, Jeﬀrey S Taube, et al. Acomparison of neural decoding methods and population coding across thalamo-corticalhead direction cells.

Frontiers in Neural Circuits , 13, 2019.[113] Hongming Li and Yong Fan. Interpretable, highly accurate brain decoding of subtlydistinct brain states from functional mri using intrinsic functional networks and longshort-term memory recurrent neural networks.

NeuroImage , 202:116059, 2019.[114] So-Hyeon Yoo, Seong-Woo Woo, and Zafar Amad. Classiﬁcation of three categories fromprefrontal cortex using lstm networks: fnirs study. In , pages 1141–1146. IEEE, 2018.[115] Eleanor Batty, Matthew Whiteway, Shreya Saxena, Dan Biderman, Taiga Abe, SimonMusall, Winthrop Gillis, Jeﬀrey Markowitz, Anne Churchland, John P Cunningham, et al.Behavenet: nonlinear embedding and bayesian neural decoding of behavioral videos. In

Advances in Neural Information Processing Systems , pages 15680–15691, 2019.[116] Simon M Hofmann, Felix Klotzsche, Alberto Mariola, Vadim V Nikulin, Arno Villringer,and Michael Gaebler. Decoding subjective emotional arousal during a naturalistic vrexperience from eeg using lstms. In , pages 128–131. IEEE, 2018.[117] Anumit Garg, Ashna Kapoor, Anterpreet Kaur Bedi, and Ramesh K Sunkaria. Mergedlstm model for emotion classiﬁcation using eeg signals. In , pages 139–143. IEEE, 2019.[118] Samarth Tripathi, Shrinivas Acharya, Ranti Dev Sharma, Sudhanshu Mittal, and SamitBhattacharya. Using deep and convolutional neural networks for accurate emotion classi-ﬁcation on deap dataset. In

Twenty-Ninth IAAI Conference , 2017.[119] Gregory Ciccarelli, Michael Nolan, Joseph Perricone, Paul T Calamia, Stephanie Haro,James OSullivan, Nima Mesgarani, Thomas F Quatieri, and Christopher J Smalt. Com-parison of two-talker attention decoding from eeg with nonlinear neural networks andlinear methods.

Scientiﬁc reports , 9(1):1–10, 2019. EuropeanJournal of Neuroscience , 2017.[121] Elaine Astrand, Pierre Enel, Guilhem Ibos, Peter Ford Dominey, Pierre Baraduc, and Su-liann Ben Hamed. Comparison of classiﬁers for decoding sensory and cognitive informationfrom prefrontal neuronal populations.

PloS one , 9(1), 2014.[122] Marc-Andre Schulz, Thomas Yeo, Joshua Vogelstein, Janaina Mourao-Miranada, JakobKather, Konrad Kording, Blake A Richards, and Danilo Bzdok. Deep learning for brains?:Diﬀerent linear and nonlinear scaling in uk biobank brain images vs. machine-learningdatasets. bioRxiv , page 757054, 2019.[123] Rajat Mani Thomas, Selene Gallo, Leonardo Cerliani, Paul Zhutovsky, Ahmed El-Gazzar,and Guido van Wingen. Classifying autism spectrum disorder using the temporal statisticsof resting-state functional mri data with 3d convolutional neural networks.

Frontiers inPsychiatry , 11:440, 2020.[124] Johannes Hennrich, Christian Herﬀ, Dominic Heger, and Tanja Schultz. Investigating deeplearning for fnirs based bci. In , pages 2844–2847. IEEE, 2015.[125] Ashesh K Dhawale, Rajesh Poddar, Steﬀen BE Wolﬀ, Valentin A Normand, EviKopelowitz, and Bence P ¨Olveczky. Automated long-term recording and analysis of neuralactivity in behaving animals.

Elife , 6:e27702, 2017.[126] Allen Brain Observatory. Available at: http://observatory.brain-map.org/visualcoding,2016.[127] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudi-nov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image captiongeneration with visual attention. In

International conference on machine learning , pages2048–2057, 2015.[128] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep net-works. In

Proceedings of the 34th International Conference on Machine Learning-Volume70 , pages 3319–3328. JMLR. org, 2017.[129] Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and BeenKim. Sanity checks for saliency maps. In

Advances in Neural Information ProcessingSystems , pages 9505–9515, 2018.[130] Chris Olah, Arvind Satyanarayan, Ian Johnson, Shan Carter, Ludwig Schubert, KatherineYe, and Alexander Mordvintsev. The building blocks of interpretability.

Distill , 3(3):e10,2018.[131] The OpenAI microscope. https://microscope.openai.com/models, 2020. Accessed: 2020-05-12.[132] Nikolaus Kriegeskorte and Pamela K Douglas. Interpreting encoding and decoding models.

Current opinion in neurobiology , 55:167–179, 2019.[133] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, GregoryChanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch:An imperative style, high-performance deep learning library. In

Advances in Neural In-formation Processing Systems , pages 8024–8035, 2019. { USENIX } Symposium on OperatingSystems Design and Implementation ( { OSDI } , pages 265–283, 2016.[135] Tim Christian Kietzmann, Patrick McClure, and Nikolaus Kriegeskorte. Deep neuralnetworks in computational neuroscience. BioRxiv , page 133504, 2018.[136] Blake A Richards, Timothy P Lillicrap, Philippe Beaudoin, Yoshua Bengio, Rafal Bogacz,Amelia Christensen, Claudia Clopath, Rui Ponte Costa, Archy de Berker, Surya Ganguli,et al. A deep learning framework for neuroscience.

Nature neuroscience , 22(11):1761–1770,2019.[137] John J Hopﬁeld. Neural networks and physical systems with emergent collective com-putational abilities.

Proceedings of the national academy of sciences , 79(8):2554–2558,1982.[138] David Zipser and Richard A Andersen. A back-propagation programmed network thatsimulates response properties of a subset of posterior parietal neurons.

Nature , 331(6158):679–684, 1988.[139] David Sussillo, Mark M Churchland, Matthew T Kaufman, and Krishna V Shenoy. Aneural network that ﬁnds a naturalistic solution for the production of muscle activity.

Nature neuroscience , 18(7):1025–1033, 2015.[140] Daniel LK Yamins and James J DiCarlo. Using goal-driven deep learning models tounderstand sensory cortex.

Nature neuroscience , 19(3):356, 2016., 19(3):356, 2016.