[PDF] Reconstructing Perceptive Images from Brain Activity by Shape-Semantic GAN

Abstract

Reconstructing seeing images from fMRI recordings is an absorbing research area in neuroscience and provides a potential brain-reading technology. The challenge lies in that visual encoding in brain is highly complex and not fully revealed. Inspired by the theory that visual features are hierarchically represented in cortex, we propose to break the complex visual signals into multi-level components and decode each component separately. Specifically, we decode shape and semantic representations from the lower and higher visual cortex respectively, and merge the shape and semantic information to images by a generative adversarial network (Shape-Semantic GAN). This 'divide and conquer' strategy captures visual information more accurately. Experiments demonstrate that Shape-Semantic GAN improves the reconstruction similarity and image quality, and achieves the state-of-the-art image reconstruction performance.

Full PDF

RReconstructing Perceptive Images from BrainActivity by Shape-Semantic GAN

Tao Fang , Yu Qi , ∗ , Gang Pan , , ∗ [email protected], [email protected], [email protected] College of Computer Science and Technology, Zhejiang University State Key Lab of CAD&CG, Zhejiang University The First Afﬁliated Hospital, College of Medicine, Zhejiang University

Abstract

Reconstructing seeing images from fMRI recordings is an absorbing research areain neuroscience and provides a potential brain-reading technology. The challengelies in that visual encoding in brain is highly complex and not fully revealed.Inspired by the theory that visual features are hierarchically represented in cor-tex, we propose to break the complex visual signals into multi-level componentsand decode each component separately. Speciﬁcally, we decode shape and se-mantic representations from the lower and higher visual cortex respectively, andmerge the shape and semantic information to images by a generative adversarialnetwork (Shape-Semantic GAN). This ’divide and conquer’ strategy captures vi-sual information more accurately. Experiments demonstrate that Shape-SemanticGAN improves the reconstruction similarity and image quality, and achieves thestate-of-the-art image reconstruction performance.

Decoding visual information and reconstructing stimulus images from brain activities is a meaningfuland attractive task in neural decoding. The fMRI signals, which record the variations in blood oxygenlevel dependent (BOLD), can reveal the correlation between brain activities and different visualstimuli by monitoring blood oxygen content. Implementing an fMRI-based image reconstructionmethod can help us understand the visual mechanisms of brain and provide a way to ’read mind’.

Previous studies.

According to previous studies, the mapping between activities in visual cortexand visual stimulus is supposed to exist [1], and the perceived images are proved to be decodablefrom fMRI recordings [2–4]. Early approaches estimate the mapping using linear models such aslinear regression [3–7]. These approaches usually ﬁrstly extract speciﬁc features from the images, forinstance multi-scale local image bases [3] and features of gabor ﬁlters [2], then learn a linear mappingfrom fMRI signals to image features. Linear methods mostly focus on reconstructing low-levelfeatures, which is insufﬁcient for reconstructing complex images, such as natural images. After thehomogeneity between the hierarchical representations of the brain and deep neural networks (DNNs)was revealed [8], methods based on this ﬁnding have achieved great reconstruction performance [9,10].Shen et al. [9] used convolutional neural networks (CNN) models to extract image features and learnedthe mapping from fMRI signals to the CNN-based image features, which successfully reconstructednatural images. Recently, the development of DNN makes it possible to learn nonlinear mappingsfrom brain signals to stimulus images in an end-to-end manner [11–15]. The DNN-based approacheshave remarkably improved the reconstruction performance, such as Encoder-Decoder models [12, 16]and Generative Adversarial Network (GAN) based models [11–15]. Shen et al. et al. [11] proposed ∗ Corresponding authors: Yu Qi and Gang Pan34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada. a r X i v : . [ c s . N E ] J a n DNN-based decoder which learned nonlinear mappings from fMRI signals to the seeing imageseffectively. Beliy et al. et al. [12] learned a bidirectional mapping between fMRI signals and stimulusimages using an unsupervised GAN. These recent approaches achieved higher image quality and canreconstruct more natural-looking images compared with linear methods.Learning a mapping from fMRI recordings to the corresponding stimulus images is a challengingproblem. The difﬁculty mostly lies in that brain activity in the visual cortex is complex and notfully revealed. Studies have shown that there exists a hierarchical increase in the complexity ofrepresentations in visual cortex [17], and study [6] has demonstrated that exploiting information fromdifferent visual areas can help improve the reconstruction performance. Simple decoding modelswithout considering the hierarchical information may be insufﬁcient for accurate reconstruction.The hierarchical structure of information encoding in visual areas have been widely studied [17–19].On the one hand, activities in the early visual areas show high response to low-level image featureslike shapes and orientations [20–23]. On the other hand, anterior visual areas are mostly involved inhigh-level information processing, and activities in such visual areas show high correlation with thesemantic content of stimulus images [8, 17]. Such high-level image features are more categorical andinvariant than low-level features in identiﬁcation or reconstruction [19]. The hierarchical processingin the visual cortex inspired us to decode the low-level and high-level image features from lowervisual cortex (LVC) and higher visual cortex (HVC) separately [8].In this study, we propose a novel method to realize image reconstruction from fMRI signals bydecomposing the decoding task into hierarchical subtasks: shape/semantic decoding in lower/highervisual cortex respectively (Figure 1). In shape decoding, we propose a linear model to predict theoutline of the core object from the fMRI signals of lower visual cortex. In semantic decoding, wepropose to learn effective features with a DNN model to represent high-level information from highervisual cortex activities. Finally, the shape and semantic features are combined as the input to a GANto generate natural-looking images with the shape and semantic conditions. Data augmentation isemployed to supplement the limited fMRI data and improve the reconstruction quality.Experiments are conducted to evaluate the image reconstruction performance of our method incomparison with the state-of-the-art approaches. Results show that the Shape-Semantic GAN modeloutperforms the leading methods. The main contributions of this work can be summarized as follows:• Instead of directly using end-to-end models to predict seeing images from fMRI signals, wepropose to break the complex visual signals into multi-level components and decode eachcomponent separately. This ’divide and conquer’ approach can extract visual informationaccurately.• We propose a linear model based shape decoder and a DNN based semantic decoder, whichare capable of decoding shape and semantic information from the lower and higher visualcortex respectively.• We propose a GAN model to merge the decoded shape and semantic information to images,which can generate natural-looking images given shape and semantic conditions. Theperformance of GAN-based image generation can be further improved by data augmentationtechnique.

The proposed framework is composed of three key components: a shape decoder, a semantic decoderand a GAN image generator. The framework of our approach is illustrated in Figure 1. Let x denotethe fMRI recordings and y be the corresponding images perceived by the subjects during experiments.Our purpose is to reconstruct each subject’s perceived images y from the corresponding fMRI data x .In our method, we use the shape decoder C to reconstruct the stimulus images’ shapes r sp and thesemantic decoder S to extract semantic features r sm from x . The image generator G implemented byGAN is introduced to reconstruct the stimulus images G ( r sp , r sm ) in the ﬁnal stage of decoding. We make use of a publicly available benchmark dataset from [9]. In this dataset brain activity datawere collected in the form of functional images covering the whole brain. The corresponding stimulus2igure 1: The framework of the proposed method. The decoding task is divided into two parts:training (A) a semantic decoder (extracting high-level features) and (B) a shape decoder (extractinglow-level features) to decode from higher and lower visual cortex respectively. The decoded shapeand semantic information are input to (C) a GAN model to reconstruct the seeing images.images were selected from ImageNet including 1200 training images and 50 test images from 150and 50 categories separately. Images in both of the training and test set have no overlap with eachother. And each image has 5 or 24 fMRI recordings for training or test respectively.The fMRI signals contain information from different visual areas. Early visual areas like V1,V2 andV3 were deﬁned by the standard retinatopic mapping procedures [9, 24, 25], and V1,V2 and V3 wereconcatenated as an area named lower visual cortex [8]. Higher visual cortex is composed of regionsincluding parahippocampal place area (PPA), lateral occipital complex (LOC) and fusiform face area(FFA) deﬁned in [8].

In order to obtain the outline of the visual stimulus, we present a shape decoder C to extract low-levelimage cues from the lower visual cortex based on linear models. Using a simple model to obtainlow-level visual features, which has been demonstrated feasible by previous studies [2], can avoidthe overﬁtting risk of complex models. Shape decoder C consists of three base decoders trained forV1, V2, V3 individually and a combiner to merge the results of base shape decoders. The process ofshape decoding is described in Figure 2.The stimulus images are ﬁrstly preprocessed by shape detection and feature extraction before shapedecoder training. Shape detection.

First, image matting is conducted on the stimulus images to extract the coreobjects [26] and remove the interference of other parts in the images. The objects in the stimulusimages are extracted based on saliency detection [27] and manual annotation. The results are binarizedto eliminate the inﬂuence of minor variance in images and emphasize objects over background. Bysuch preprocessing method, the core objects are extracted and some details (like colors or textures)that help little in shape decoding are eliminated.

Features extraction.

Second, square image patches are presented for feature extraction. Pixel valueslocated in non-overlapping m × m pixels square patches are averaged as the image patch’s value. Byrepresenting shapes in contrast-deﬁned patch images, the amount of calculation can be reduced andimprove the invariance to small distortions. In our model m = 8 is selected. Model Training.

Because V1, V2 and V3 areas have representations for the visual space respectively,we train the base decoders for each of them and use a linear weighting combiner to combine the3igure 2: Flow chart of the shape decoder. The linear models are trained to predict the shapes fromV1 to V3 individually, and then the intermediate results p k are combined to get the decoded shapes r sp .decoded shapes. The values of the image patches are normalized to [0, 1] and ﬂattened to one-dimensional vectors p . The decoded shape vectors p ∗ k = c k ( x k ) , where c k ( x k ) is the base shapedecoder whose parameter η k is optimized by: η k = arg min η k (cid:107) c k ( x k ) − p k (cid:107) , k = V , V , V , (1)where η k denotes the weights of base shape decoder c k and k denotes the visual area that the samplesbelong to. The base decoder c k is implemented by linear regression and the models are trained forfMRI recordings in V1, V2 and V3 individually. Then a combiner is trained to combine the predictedresults p ∗ V , p ∗ V and p ∗ V : r sp ( i, j ) = (cid:88) k w kij p ∗ k ( i, j ) , k = V , V , V , (2)where r sp refers to the predicted shapes computed by the combiner, and r sp ( i, j ) is the pixel valueat position ( i, j ) . The results r sp predicted by the combiner are resized to the same size as thestimulus images ( × pixels). w ij is the weight of the combiner for pixel at position ( i, j ) . Thecombining weight w ij is computed independently for each pixel. To render semantically meaningful details on shapes, a semantic decoder is used to provide categoricalinformation. Although images can be rendered only based on shapes with a pre-trained GAN model,in practice we ﬁnd that the results are not always acceptable because of the lack of conditions. Themapping from shapes to real images is not unique in many cases (e.g. a circular shape can betranslated into a football/crystal ball/golf ball and all of these translations are correct judged by thediscriminator). Besides, noise retained in shapes will interfere with the reconstruction quality in theabsence of other conditions. Therefore reconstructing only on the shape condition is not sufﬁcient. Asemantic context, which is used to guide the GAN model with the image’s category, can be helpfulwhen incorporated with the shape features in training phase.The input to the semantic decoder is fMRI signals in HVC. HVC covers the regions of LOC, FFAand PPA, whose voxels show signiﬁcantly high response to the high-level features such as objects,4aces or scenes respectively [8]. As showed in Figure 1, a lightweight DNN model is introduced togenerate semantic features. The DNN model consists of one input layer (the same size as the input’snumber of voxels), two hidden layers and one output layer. Tanh activation function is introducedbetween the hidden layers and sigmoid activation function is used for classiﬁcation. When trainingthe DNN model, the fMRI recordings in HVC are identiﬁed by the model to infer their correspondingstimulus images’ categories. After the training phase, the DNN model works as a semantic decoder.Note that the penultimate layer of DNN performs as a semantic space supporting the classiﬁcationtask at the output layer [10], the features in such layer are adopted as the semantic representation offMRI signals in our method.

To reconstruct images looking more realistic and ﬁlled with meaningful details, an encoder-decoderGAN, referring to the image translation methods [28], is introduced in the ﬁnal stage of imagereconstruction.In image reconstruction, there exists lots of low-level features (like contours) shared between theinput shapes and the output natural images, which need to be passed across the decoder directly toreconstruct images with accurate shapes. Therefore, we propose the U-Net [29], an encoder-decoderstructure, with skip connections. Traditional encoder-decoder model passes information through abottleneck structure to extract high-level features, while low-level features such as shapes and texturescan be lost. And few shape features retained in the output can cause deformation in reconstruction.By using the U-Net structure, more low-level features can be passed from the input space to thereconstruction space with the help of skip connections without the limitation of the bottleneck.The generator is composed of a pair of symmetrical encoder and decoder. The encoder and decoderhave eight convolutional or deconvolutional layers with symmetrical parameters and no down-sampling or up-sampling is used. The input layer takes the × pixel images (shape images)as input. The bottleneck between encoder and decoder represents the high-level features extractedby convolutional layers in encoder, which is modiﬁed to take both of the semantic features r sm andthe high-level features as input to the decoder. In this way the generator will be optimized under theconstraint of semantic and shape conditions. The discriminator takes the shape r sp and the output ofgenerator together as input and predicts the similarity of high-frequency structures between these twodomains, using this similarity to guide the generator training.Let G θ denotes the U-Net generator and D φ denotes the discriminator. The generator G θ anddiscriminator D φ have parameters named θ and φ , which are optimized by minimizing the lossfunction L ( θ, φ ) . The objective of the conditional GAN is composed of two components, which canbe described as L ( θ, φ ) = L adv ( θ, φ ) + λ img L img ( θ ) , (3)where L adv ( θ, φ ) and L img ( θ ) denote the adversarial loss and image space loss, and λ img deﬁne theweight of the image space loss L img in L ( θ, φ ) . As is inferred in [28], L1 loss is able to accuratelycapture the low frequencies. The GAN discriminator is designed to model the high-frequencystructures. By combining these two terms in the loss function, blurred reconstructions will not betolerated by the discriminator and low-frequency visual features can also be retained at the same time.The adversarial loss and the image space loss used in optimizing the generator can be expressed as L adv ( θ, φ ) = − E r sp ,r sm [ log ( D φ ( r sp , G θ ( r sp , r sm )))] , (4) L img ( θ ) = E r sp ,r sm ,y [ | y − G θ ( r sp , r sm ) | ] , (5)where r sp , r sm and y refer to shapes, semantic features and stimulus images. During the trainingphase, gradient descent is computed on G θ and D φ alternately. Instead of directly training G θ tominimize log (1 − D φ ( r sp , G θ ( r sp , r sm ))) , we followed the recommendations in [24] and maximize log ( D φ ( r sp , G θ ( r sp , r sm ))) . The objective of the discriminator is: L discr ( θ, φ ) = − E r sp ,y [ logD φ ( r sp , y )] − E r sp ,r sm [ log (1 − D φ ( r sp , G θ ( r sp , r sm )))] . (6)When G θ is being trained, it tries to optimize θ to reduce the distance between generated images G θ ( r sp , r sm ) and stimulus images y . It also tries to generate images that share similar high-frequency5tructure with shapes r sp to confuse D φ and let D φ predict G θ ( r sp , r sm ) as correct. When D φ istrained, it tries to optimize φ to distinguish the pairs of { r sp , y } from the pairs of { r sp , G θ ( r sp , r sm ) } .Each time one of G θ or D φ is trained, the other’s parameters are ﬁxed. Since the size of the fMRI dataset is limited, we propose to improve the image reconstructionperformance by data augmentation in GAN training.We sampled the augmented images from the ImageNet dataset. For shape augmentation, the prepro-cess in Section 2.2 is conducted on augmented images and the contrast-deﬁned, m × m -patch images R sp represent as shapes of the augmented images. For semantics augmentation, the category-averagesemantic feature R sm is computed as a substitute of semantic vector. R sm is deﬁned as the vectorobtained by averaging the semantic features of samples annotated with the same category. By combin-ing the shapes and category-average semantic features generated from the augmented images as theform of { R sp , R sm } pairs, the new samples are concatenated with the { r sp , r sm } pairs as the inputsto G θ , enhancing the generality of G θ eventually. Note that in our method the image augmentationcould only be conducted within images that corresponds to the same classes as the training images.In reconstruction about 1.2k augmented natural images are randomly selected from the same imagedataset as [9] (ILSVRC2012), and they have no overlap with the training or test set. We implemented the image generator using the PyTorch framework and modiﬁed the image translationmodel provided by [28]. The image generator consists of a U-Net generator G and a discriminator D .In both of G and D , the kernel size is (4, 4), step size is (2,2) and the padding size is 1 for parametersof the layers. The generator is composed of 8 parametrically symmetric convolutional/deconvolutionallayers with LeakyReLU (0.2) used as activation functions. All the input images (the stimulus imagesand shape images) of G and D are resized to (256, 256, 3).In GAN training, minibatch SGD is used and Adam solver is employed to optimize the parameterswith momentum β = 0.9 and β = 0.999. The initial learning rate is × − and 10 samples areinput in a batch. The weights of individual loss terms affect the quality of the reconstructed image. Inour experiments, we set λ img = 100 to make a balance between the results’ sharpness and similaritywith stimulus images. The image generator is trained for 200 epoch totally with the learning ratedecay occurring at 120 epoch. To evaluate the quality of reconstructed images, we conduct both visual comparison and quantitativecomparison. In quantitative comparison, pairwise similarity comparison analysis is used to measurethe reconstructed images’ quality, which is introduced in [11]. One reconstructed image is comparedwith two candidate images (the ground-truth image and a randomly selected test image) to test if itscorrelation with the ground truth image is higher. And in our experiments structural similarity index(SSIM) [30] is used as the correlation measure. SSIM measures the similarity of the local structurebetween the reconstructed and origin images in spatially close pixels [11].

Here we compare the image reconstruction performance with existing approaches. The competitorsinclude [12], [11], and [9]. For visual comparison, we directly use the reconstructed images reportedin the papers of [12], [11], and [9], respectively. For quantitative comparison, we use the reportedpairwise similarity with SSIM for [11]. For [12], we run the code published along with the paper,and use the same data augmentation images as our approach. All the pairwise similarity results areaveraged by ﬁve runs to mitigate the effectiveness of randomness.Samples of reconstructed images are presented in Figure 3, in comparison with existing approaches.Similar to [9], the test fMRI samples corresponding to the same category are averaged across trialsto improve the fMRI signals’ signal-to-noise ratio (SNR). The results are reconstructed from thetest-fMRI recordings of three subjects (150 samples totally), and the performance of this model is6igure 3: Image reconstruction performance comparison with other methods. (a) Images reconstructedby different methods. (b) Performance comparison with pairwise similarity.compared with the leading methods on the same dataset [9]. In visual comparison, we compareour reconstructed images with methods in [9, 11] and [12] in Figure 3a. Owing to the U-Net modeltrained with semantic information, our model’s reconstructed images are vivid and close to the realstimulus images in color. Also, under the constraint of shape conditions, the reconstructed imagesshare similar structures with the origin images. In quantitative comparison, we conduct pairwisesimilarity comparison based on SSIM with the existing methods of [12] and [11]. The comparisonof different approaches used on three subjects’ fMRI recordings are displayed in Figure 3b. Resultsshow that our method performs slightly better than [12] (ours 65.3 % vs. 64.3 % on average) andoutperforms [11] (62.9% on average).

In this experiment, we evaluate the decoding performance of shape/semantic information fromdifferent ROIs (region of interest). Forty samples in the origin training set are reserved for validationand the rest are used for the decoders’ training in this experiment.Figure 4: Decoding performance of different ROIs. (a) Performance of semantic decoding withdifferent ROIs. (b) Performance of shape decoding with different ROIs.To compare different ROIs’ semantic representation performance, the semantic decoders (DNNmodels) are trained on fMRI signals in different visual areas individually. The trained DNN models areused to decode semantic features from the validation fMRI samples and identify their correspondingcategories. To facilitate the comparison, we use 10-category rough labels in this section (seesupplementary materials). The identiﬁcation accuracy of semantic representations decoded fromdifferent ROIs is compared in Figure 4a. Results show that semantic representations extractedfrom the fMRI data in HVC outperform those extracted from other areas such as LVC (by 14.5%),suggesting that voxels in more anterior areas like HVC show high correlation with abstract features.To compare the decoding performance of shape features from different ROIs, we train shape decodersfrom different visual areas respectively. The similarity between the decoded shapes and the stimulusimages’ shapes is measured by the pairwise similarity comparison based on SSIM in Figure 4b.Results show that decoding shapes from fMRI data in LVC perform better than in other areas like7VC (by 19.3%), indicating that signals in LVC have high response to low-level image features anddetails.The ﬁndings that improved performance can be achieved when different decoding models are trainedfor low/high-level features with lower/higher visual areas respectively, are also in line with previousstudies [6, 8]. In our experiments, models trained on the whole visual cortex (VC) perform slightlyworse than those only trained on LVC/HVC in shape/semantic decoding tasks, probably becauseof the interference caused by low-correlation visual areas in VC (such as introducing higher visualareas in decoding low-level features like shapes). Note that theoretically the information processed inthe HVC should also be contained in LVC in the form of low level features, it can be inferred thatsemantic decoding from signals in LVC may perform better when using a deeper model.Figure 5: Effectiveness of semantics. (a) Training without semantics works well on part of thesamples. (b) Failed/successful reconstructions on some samples without/with semantic conditions.(c) Quantitative comparison of reconstruction with/without semantics.

We conduct an ablation study to evaluate the necessity of introducing semantics in our model. Forcomparison, two different training methods are used for reconstruction: reconstructing with andwithout semantic features. For model without semantics, we remove the semantic decoder and replaceits image generator with a standard pix2pix model, which is trained for translating shapes to imagesdirectly.As shown in Figure 5a, using the image translation models to reconstruct images only from shapesperform well on part of the samples, which are similar to the results reconstructed with semanticinformation. These successful cases without semantic information usually depend on effective andclear shape decoding performance. However, most of the fMRI signals’ SNR is low [12] and manydecoded shapes are similar. In Figure 5b, the GAN model can not deduce right decisions onlyfrom these noisy or similar shapes (left-hand side of Figure 5b), which causes the reconstructedimages rendered uncorrelated details (like colors). Images reconstructed from the same shapes withsemantic information are showed in the right-hand side of Figure 5b. The generated images’ colorsare corrected with the guidance of the semantic information. Quantitative results are showed in Figure5c. The images reconstructed with semantics perform better than those without semantics (65.3% vs.62.5%). The results indicate that by reconstructing with categorical information in semantic features,our model is able to improve the reconstruction performance visually and can reconstruct imagesmore accurately.

To evaluate the improvement of reconstruction quality by introducing data augmentation, we trainmodels with and without augmentation respectively, and compare the models on the test set. Onemodel is trained on such augmented dataset and the other one is trained solely on the origin training setas a contrast. The results are showed in Figure 6. In visual comparison, the images reconstructed with8igure 6: Effectiveness of augmentation. (a) Comparison of images reconstructed with/withoutaugmentation. (b) Quantitative comparison of reconstruction with/without augmentation.augmented data look more natural and more close to the ground truth images than those reconstructedonly from the origin training set (Figure 6a). In quantitative comparison, model trained with dataaugmentation performs slightly better than that without augmentation (65.3% vs. 63.6%). Byadding more images in the GAN training phase, the short board of the limited dataset size can becomplemented and the model will learn the distribution over more natural images, contributing to theimprovement in reconstruction.

In this paper, we demonstrate the feasibility of reconstructing stimulus images from the fMRIrecordings by decoding shape and semantic features separately, and merge the shape and semanticinformation to natural-looking images with GAN. This ’divide and conquer’ strategy simpliﬁedthe fMRI decoding and image reconstruction task effectively. Results show that the proposedShape-Semantic GAN improves the reconstruction similarity and image quality.

Broader Impact

The proposed Shape-Semantic GAN method provides a novel solution to visual reconstructionfrom brain activities and present a potential brain-reading technique. This method can help peoplerecognize the human perception and thinking, and may help promote the development of neuroscience.However, the development of such brain-reading method may invade the privacy of the informationwithin people’s mind, and may cause people to worry about the freedom of thought.

Acknowledgment

This work was partly supported by grants from the National Key Research and Development Programof China (2018YFA0701400), National Natural Science Foundation of China (61906166, U1909202,61925603, 61673340), the Key Research and Development Program of Zhejiang Province in China(2020C03004).

References [1] Russell A. Poldrack and Martha J. Farah. Progress and challenges in probing the human brain.

Nature , 526(7573):371–379, 2015.[2] Takashi Yoshida and Kenichi Ohki. Natural images are reliably represented by sparse andvariable populations of neurons in visual cortex.

Nature communications , 11(1):1–19, 2020.[3] Yoichi Miyawaki, Hajime Uchida, Okito Yamashita, Masa-aki Sato, Yusuke Morito, Hiroki CTanabe, Norihiro Sadato, and Yukiyasu Kamitani. Visual image reconstruction from humanbrain activity using a combination of multiscale local image decoders.

Neuron , 60(5):915–929,2008. 94] Yusuke Fujiwara, Yoichi Miyawaki, and Yukiyasu Kamitani. Modular encoding and decodingmodels derived from bayesian canonical correlation analysis.

Neural computation , 25(4):979–1005, 2013.[5] Marcel AJ van Gerven, Floris P de Lange, and Tom Heskes. Neural decoding with hierarchicalgenerative models.

Neural computation , 22(12):3127–3142, 2010.[6] Thomas Naselaris, Ryan J Prenger, Kendrick N Kay, Michael Oliver, and Jack L Gallant.Bayesian reconstruction of natural images from human brain activity.

Neuron , 63(6):902–915,2009.[7] Shinji Nishimoto, An T Vu, Thomas Naselaris, Yuval Benjamini, Bin Yu, and Jack L Gallant.Reconstructing visual experiences from brain activity evoked by natural movies.

CurrentBiology , 21(19):1641–1646, 2011.[8] Tomoyasu Horikawa and Yukiyasu Kamitani. Generic decoding of seen and imagined objectsusing hierarchical visual features.

Nature communications , 8(1):1–15, 2017.[9] Guohua Shen, Tomoyasu Horikawa, Kei Majima, and Yukiyasu Kamitani. Deep image recon-struction from human brain activity.

PLoS computational biology , 15(1):e1006633, 2019.[10] Haiguang Wen, Junxing Shi, Yizhen Zhang, Kun-Han Lu, Jiayue Cao, and Zhongming Liu.Neural encoding and decoding with deep learning for dynamic natural vision.

Cerebral Cortex ,28(12):4136–4160, 2018.[11] Guohua Shen, Kshitij Dwivedi, Kei Majima, Tomoyasu Horikawa, and Yukiyasu Kamitani.End-to-end deep image reconstruction from human brain activity.

Frontiers in ComputationalNeuroscience , 13, 2019.[12] Roman Beliy, Guy Gaziv, Assaf Hoogi, Francesca Strappini, Tal Golan, and Michal Irani. Fromvoxels to pixels and back: Self-supervision in natural-image reconstruction from fmri. In

Advances in Neural Information Processing Systems , pages 6514–6524, 2019.[13] Ghislain St-Yves and Thomas Naselaris. Generative adversarial networks conditioned on brainactivity reconstruct seen images. In , pages 1054–1061. IEEE, 2018.[14] Yunfeng Lin, Jiangbei Li, and Hanjing Wang. Dcnn-gan: Reconstructing realistic image fromfmri. In , pages 1–6.IEEE, 2019.[15] Changde Du, Changying Du, and Huiguang He. Sharing deep generative representationfor perceived image reconstruction from human brain activity. In , pages 1049–1056. IEEE, 2017.[16] Ruﬁn VanRullen and Leila Reddy. Reconstructing faces from fmri patterns using deep generativeneural networks.

Communications biology , 2(1):1–10, 2019.[17] Daniel J. Felleman and David C. Van Essen. Distributed hierarchical processing in the primatecerebral cortex.

Cerebral Cortex , 1(1):1–47, 1991.[18] David Marvin Green and John Arthur Swets.

Signal detection theory and psychophysics . 1966.[19] Stephanie Ding, Christopher J. Cueva, Misha Tsodyks, and Ning Qian. Visual perception asretrospective bayesian decoding from high- to low-level features.

Proceedings of the NationalAcademy of Sciences of the United States of America , 114(43):201706906, 2017.[20] D. H. Hubel and T. N. Wiesel. Receptive ﬁelds, binocular interaction and functional architecturein the cat’s visual cortex.

The Journal of Physiology , 160(1):106–154, 1962.[21] John B. Reppas, Sourabh Niyogi, Anders M. Dale, Martin I. Sereno, and Roger B. H. Tootell.Representation of motion boundaries in retinotopic human visual cortical areas.

Nature ,388(6638):175–179, 1997.[22] G Skiera, D Petersen, M Skalej, and M Fahle. Correlates of ﬁgure-ground segregation in fmri.

Vision Research , 40(15):2047–2056, 2000.[23] Janine D. Mendola, Anders M. Dale, Bruce Fischl, Arthur K. Liu, and Roger B. H. Tootell. Therepresentation of illusory and real contours in human cortical visual areas revealed by functionalmagnetic resonance imaging.

The Journal of Neuroscience , 19(19):8560–8572, 1999.[24] Stephen A Engel, David E Rumelhart, Brian A Wandell, Adrian T Lee, Gary H Glover, Eduardo-Jose Chichilnisky, and Michael N Shadlen. fmri of human visual cortex.

Nature , 1994.1025] Martin I Sereno, AM Dale, JB Reppas, KK Kwong, JW Belliveau, TJ Brady, BR Rosen,and RB Tootell. Borders of multiple visual areas in humans revealed by functional magneticresonance imaging.

Science , 268(5212):889–893, 1995.[26] Kohitij Kar, Jonas Kubilius, Kailyn Schmidt, Elias B Issa, and James J DiCarlo. Evidencethat recurrent circuits are critical to the ventral stream’s execution of core object recognitionbehavior.

Nature neuroscience , 22(6):974–983, 2019.[27] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attentionfor rapid scene analysis.

IEEE Transactions on pattern analysis and machine intelligence ,20(11):1254–1259, 1998.[28] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation withconditional adversarial networks. In

Proceedings of the IEEE conference on computer visionand pattern recognition , pages 1125–1134, 2017.[29] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks forbiomedical image segmentation. In

International Conference on Medical image computing andcomputer-assisted intervention , pages 234–241. Springer, 2015.[30] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment:from error visibility to structural similarity.