[PDF] Decontextualized learning for interpretable hierarchical representations of visual patterns

Abstract

Apart from discriminative models for classification and object detection tasks, the application of deep convolutional neural networks to basic research utilizing natural imaging data has been somewhat limited; particularly in cases where a set of interpretable features for downstream analysis is needed, a key requirement for many scientific investigations. We present an algorithm and training paradigm designed specifically to address this: decontextualized hierarchical representation learning (DHRL). By combining a generative model chaining procedure with a ladder network architecture and latent space regularization for inference, DHRL address the limitations of small datasets and encourages a disentangled set of hierarchically organized features. In addition to providing a tractable path for analyzing complex hierarchal patterns using variation inference, this approach is generative and can be directly combined with empirical and theoretical approaches. To highlight the extensibility and usefulness of DHRL, we demonstrate this method in application to a question from evolutionary biology.

Full PDF

DD ECONTEXTUALIZED LEARNING FOR INTERPRETABLEHIERARCHICAL REPRESENTATIONS OF VISUAL PATTERNS

R. Ian Etheredge

1, 2, 3, ∗ , Manfred Schartl

4, 5, 6, 7 , and Alex Jordan

1, 2, 31

Department of Collective Behaviour, Max Planck Institute of Animal Behavior, Konstanz, Germany Centre for the Advanced Study of Collective Behaviour, University of Konstanz, Konstanz, Germany Department of Biology, University of Konstanz, Konstanz, Germany Centro de Investigaciones Cientíﬁcas de las Huastecas Aguazarca, A.C., Calnali, Hidalgo, Mexico Developmental Biochemistry, Biocenter, University of Würzburg, Würzburg, Bavaria, Germany Hagler Institute for Advanced Study, Texas A&M University, College Station, TX, USA Xiphophorus Genetic Stock Center, Texas State University San Marcos, San Marcos, TX, USASeptember 22, 2020 S UMMARY

Apart from discriminative models for classiﬁcation and object detection tasks, the application of deepconvolutional neural networks to basic research utilizing natural imaging data has been somewhatlimited; particularly in cases where a set of interpretable features for downstream analysis is needed,a key requirement for many scientiﬁc investigations. We present an algorithm and training paradigmdesigned speciﬁcally to address this: decontextualized hierarchical representation learning (DHRL).By combining a generative model chaining procedure with a ladder network architecture and latentspace regularization for inference, DHRL address the limitations of small datasets and encouragesa disentangled set of hierarchically organized features. In addition to providing a tractable path foranalyzing complex hierarchal patterns using variation inference, this approach is generative and canbe directly combined with empirical and theoretical approaches. To highlight the extensibility andusefulness of DHRL, we demonstrate this method in application to a question from evolutionarybiology. K eywords Generative Modeling · Interpretable AI · Disentangled Representation Learning · Hierarchical Features · Image Analysis · Small Data ∗ Corresponding author: [email protected] a r X i v : . [ c s . C V ] A ug econtextualized learning for interpretable hierarchical representations of visual patterns The application of deep convolutional neural networks (CNNs ) to supervised tasks is quickly becoming ubiquitous,even outside of standardized visual classiﬁcation tasks. In the life sciences, researchers are leveraging these powerfulmodels for a broad range of domain-speciﬁc discriminative tasks such as automated tracking of animal movement, the detection and classiﬁcation of cell lines, and mining genomics data. A key motivation for the expanded use of deep feed-forward networks lies in their capacity to capture increasinglyabstract and robust representations. However, outside of the objective function they have been optimized on, buildinginterpretability into these representations is often difﬁcult as networks naturally absorb all correlations found in thesample data and the features which are useful for deﬁning class boundaries can become highly complex (Figure S1). Formany investigations the main objective falls outside of a clearly deﬁned detection or classiﬁcation task, e.g. identifyinga set of descriptive features for downstream analysis, and interpretability and generalizability is much more important.Because of this, in contrast to many traditional computer vision algorithms, the application of more expressiveapproaches built on CNNs and other deep networks to research has been limited (Figure 2).Unsupervised learning, a family of algorithms designed to uncover unknown patterns in data without the use of labeledsamples, offers an alternative for compression, clustering, and feature extraction using deep networks. Generativemodeling techniques have been especially effective in capturing the complexity of natural images, i.e. generativeadversarial networks (GANs ) and variational autoencoders (VAEs, ). VAEs in particular offer an intuitive wayfor analyzing data. As an extension of variational inference, VAEs combine an inference model, which performsamortized inference (typically a CNN) to approximate the true posterior distribution and encode samples into a set oflatent variables ( q φ ( z | x ) ), and a generative model which generates new samples from those latent variables ( p θ ( x | z ) ).Instead of optimizing on a discriminative task, the objective function in VAEs is less strictly deﬁned but typicallyseeks to minimize the reconstruction error between inputs x and outputs p θ ( q φ ( x )) (reconstruction loss) as well as thedivergence between the distribution of latent variables q φ ( z | x ) and the prior distribution p ( z ) (latent regularization). In VAEs, two problems often arise which are of primary concern to researchers using natural imaging data. Themutual information between x and z can become vanishingly small, resulting in an uninformative latent code and overﬁtto sample data, the information preference problem; this is particularly true when using powerful convolutionaldecoders which are needed to create realistic model output. In contrast to the hierarchical representationsproduced by deep feed-forward networks used for discriminative tasks, in generative models local feature contextsbecome emphasized at the cost of large-scale spatial relationships. This is a product of the restrictive mean-ﬁeldassumption of pixel-wise comparisons and produces generative models capable of reproducing complex image featureswhile using only local feature contexts without capturing higher-order spatial relationships within the latent encoding. Example ApproachDisentangling factors of variation Limited number of shared factors ofvariation Latent regularization

Capturing spatial relationships Hierarchical organization ofrepresentation Hierarchical model architecture Incorporating existing knowledge Local variation on manifolds Structured latent codes Connect analyses and experiments Local variation on manifolds Generative models

Inference Probability mass and local variationon manifolds Variational inference The basis of a more expressive and robust approach for investigating natural image data has some key requirements: provide a useful representation which disentangles factors of variation along a set of interpretable axes; capture featurecontexts and hierarchical relationships; incorporate existing knowledge of feature importance and relationshipsbetween samples; allow for statistical inference of complex traits; and provide direct connections betweenanalytical, virtual and experimental approaches. Here we integrate meta-prior enforcement strategies taken fromrepresnetation learning to speciﬁcally address the requirements of researchers using natural image data (Table 1).Here we propose to address the limitations of existing approaches and incorporate the speciﬁc requirements ofresearchers using a combination of meta-prior enforcement strategies. VAEs with a ladder network architecture hasbeen show to better capture a hierarchy of feature by mitigating the explain away problem of lower level feature,allowing for bottom-up and top-down feedback. Additionally, combining pixel-wise error with a perceptual lossfunction adapted from neural style transfer, may also reduce the restrictive assumptions of amortized inferenceand pixel-wise reconstruction error by balancing them against abstract measures of visual similarity.In terms of the latent regularization, a disentangled representation of causal factors requires an information-preservinglatent code. Choosing a regularization techniques which mitigate the trade off between inference and data ﬁt can encourage the disentanglement of generative factors along a set of variables in an interpretable way. We alsopropose a novel training paradigm inspired by GAN chaining that further relaxes the natural covariances in the data: decontextualized learning and actually uses the restrictive assumptions of GAN generator networks to our advantage toovercome the limitations of small datasets, typical for many studies in the natural sciences and further increase thedisentanglement of generative factors (Figure 1, Methods 4.2).While several metrics have been proposed for assessing interpretability and disentanglement, these metrics relyheavily on the associated labels, well deﬁned features or stipulations from classiﬁcation of detection competitions, e.g. In addition to being highly domain speciﬁc, for most practical investigations in the natural sciences, these types of3econtextualized learning for interpretable hierarchical representations of visual patternsFigure 1:

Overview . a) Many patterns (e.g. male guppy ornaments) consist of combinations of several elements whichhave hierarchical relationships, spatial dependence, and feature contexts which may hold distinct biological importance.In our proposed framework, small sample sizes are supplemented using a generative (GAN) model which learns imagesstatistics sufﬁcient to produce novel out of sample examples (b). This model can be used to produce an unlimitednumber of novel samples and also reduces the covariances within sample data, which is advantageous for disentanglinggenerative features. We use these "decontextualizd" generated outputs as input (c) to a variational auto encoder (VAE).Via a speciﬁc combination of meta-prior enforcement strategies and network architectures, we capture the hierarchicalstructure, disentangling factors of variation across multiple scales of increasing abstraction ( z through z n ). Usingthe learned distribution over these variables, the latent representation, parameterized by a mean and variance term,we (d) deﬁne a color-pattern space. Using this low dimensional representation we can (e) interface with downstreammodels such as evolutionary algorithms and (f) produce photo-realistic outputs to be used in playback experimentsand immersive VR. Interpolations through the color-pattern space with animated models and VR allows researchers tomanipulate generated output for experimental tests. Techniques represented with dashed lines: (d) capturing a hierarchyof visual features, (e) combining a low-dimensional latent representation with virtual experiments and (f) playbackexperiments represent the current gaps in our analytical and experimental framework. Our approach directly addressthese shortcomings to span these gaps, creating a robust, integrated framework for investigating natural visual stimuli.labels do not exist and we must often rely on fundamentally qualitative assessments. In many cases, labeled data is notavailable and interpreting traversals of the latent code (Figure S2) may introduce our own perceptual biases. Here, weadapt an approach from explainable AI: integrated gradients in application to latent variable exploration too provide adirect assessment of latent variables, quantifying latent feature attributions without the necessity of labeled data andallows for exploring latent variables without adding additional human biases (Methods 4.3).We demonstrate the proposed framework using two example datasets: male guppy ornamentation and butterﬂy wingpatterns from the discipline of sensory ecology and evolution (see Appendix A for motivation and background onexisting approaches). 4econtextualized learning for interpretable hierarchical representations of visual patterns While biological datasets are typically small, they are usually highly structured and standardized compared to largeclassiﬁcation datasets (e.g. ImageNet ). This provides an advantage for controlling noise and uninformative covariatesin the data. Using a modiﬁed infoGAN architecture, we incorporate prior knowledge about the structure of our sampledata to generate realistic samples from the complex image distribution conditioned on a set of latent variables. Here, weincorporate prior knowledge about our samples of male guppy ornamentation images by providing a 32-class categoricallatent code (Figure 3b, top right). These 32-classes represent the 32 individual tanks, unique subsets of the overallsample, with shared traits related to guppy ornamentation patterns. The categories learned by the trained model possesunique features which also covary in the sample data, e.g. a distinct black bar and orange stripe which characterizesone guppy species, P. wengei (Figure S2, a). While generated samples share characteristics and even resemble knownvarieties, generated samples posses decontextualized combinations of features across examples (Figure S2, a). We usethese, decontextualized samples as input to our variational (VAE) model for our "decontextualized" training paradigm.GAN training and VAE training are performed in separate steps so that models are not jointly optimized. The generatedsamples from the trained GAN model are used as training data to a variational model (Figure 1) with a hierarchical modelarchitecture which consists of 10 latent variables across four codes ( z , . . . , z ) with increasing expressivity, (Methods4.2.2). We observe distinct clusters in the latent space of the trained model which correspond to sample categoriesand differs qualitatively from two existing method (raw pixel and perceptuall loss embeddings using tSNE, Figure2). The unique latent space of the four latent encodings capture unique factors of variation in the sample data in ascale-dependent way (Figure S2, Figure S3). In this model, z , the latent code with the lowest capacity captures localtraits such as the color and intensity of discrete patches, e.g. z encodes variation in the intensity of an orange spot (S24b, left). At higher levels ( z , . . . , z ), latent variables encode complex traits which combine multiple elements, (S2 4b,right). We use this same latent representation to describe the relationship between samples and calculate likelihoodestimates. Samples with rare traits, e.g. such as the “Tr5” strain in our sample data which are distinctly melanated,cluster together in the embedded space, and have a low sample likelihood (3).Embedding the 4, 10-dimensional, latent codes reveals scale-dependent relationships between elements. In z (FigureS3, left) color values and local features dominate the relationship between points (Figure S3, left). Nearest neighborsamples (Minkowski distance in the 10-dimensional space, Figure S3, b) show color similarity whereas higher orderfeatures, e.g. patterning and morphology, determine the relationships between samples in the more expressive latentspaces ( z , . . . , z ). Though we ﬁnd strong covariance between features across scales, in some cases the nearestneighbors samples differ greatly depending on the scale and feature context (Figure S3, b).We assess the level of disentanglement of our trained variational models using the metric established in using knownclass labels as attribute classes (butterﬂy species, learned class from infoGAN pre-training, and guppy strain varieties).Across models, we ﬁnd the most expressive latent codes ( z ) provide the highest degree of disentanglement betweenknown classes with the highest disentanglement score overall using our decontextualized, DHRL method (see Table 2).5econtextualized learning for interpretable hierarchical representations of visual patternsFigure 2: Comparing with existing techniques . 2-dimensional embedding of (left) raw pixel distributions, (middle)using a perceptual similarity score ( ), and (right) our framework. a) guppies, b) butterﬂies. Colors indicate uniquesub groups for each sample (guppy variety and butterﬂy species).6econtextualized learning for interpretable hierarchical representations of visual patternsFigure 3:

Sample likelihood estimates . a) Embedded samples b) Normalized (standard score) likelihood estimates foreach sample.Table 2: Disentanglement and completeness metrics for of VLAE inference network across datasets and when using ourdecontextualized learning approach (DHRL).VLAE Training Data D ( z ) , C ( z ) D ( z ) , C ( z ) D ( z ) , C ( z ) D ( z ) , C ( z ) Butterﬂies (n=9531) 0.64, 0.60 0.67, 0.55 0.63, 0.51 0.88, 0.60Guppies (n=987) 0.29, 0.32 0.12, 0.13 0.13, 0.16 0.56, 0.66Guppies (gen.,n=19k) 0.12, 0.16 0.13, 0.18 0.32, 0.42 0.62, 0.75Guppies

DHRL

We also provide a qualitative approach for attributing latent variables to image features using network gradients(Methods 4.3); when labels are unknown. In Figure 4, a-d we visualize one variable of z , the least expressive latentvariable space ( z ) of the DHRL-trained guppy latent variable model. We ﬁnd that the same latent variable controlsthe relative intensity of green color patches across individuals. Looking at a single variable of more expressive latentcodes z of the trained butterﬂy model (Figure 4, e-h) we ﬁnd that this latent variable controls the size of yellowpatches on the lower wings relative to the size of yellow patches on the upper wings (when patches are not present thisvariable has no effect (Figure 4, f). Further investigation of latent variables can be performed using the provided tool( https://github.com/ietheredge/VisionEngine/notebooks/IntegratedGradients.ipynb ).Using the latent representation, z , of our DHRL trained variational model of guppy ornaments as input, we apply anevolutionary algorithm (Figure 5), deﬁned by a ﬁtness function from the guppy literature: oranger, higher contrast malesare preferred by females. Starting from a parent population initialized by our sample embedding (900 samples), we7econtextualized learning for interpretable hierarchical representations of visual patternsFigure 4:

Latent variable feature attribution . a-d) Four samples of genereated guppy images performing integratedgradeints feature attribution (see Methods 4.3). a-d) guppy images visualizing the feature attributions of the latentvariable z of the DHRL trained variational model. e-h) butterﬂy images using latent feature z . Heatmap valueshave been normalized using a standard score. Images to the left are generated with the latent feature set to its lowestvalue in the sample, to the right with the highest value in the sample.simulate 500 generations under these selective forces. We observe exaggerated and more numerous orange and blackpatches in novel conﬁgurations compared to the initial population (Figure 5, b). Projecting the latent representation ofgenerations 1, 250, and 500, we ﬁnd that instead of a single peak, after several generations, many novel solutions areoptimized (Figure 5 a). Investigating the values of the latent variables over generations reveals two distinct latent factorsdriven to ﬁxation in the population under these selective forces (S4). We also observe to population optimization oflatent factors over time in Movie S5. Using a single Titan Xp GPU with 12GB memory we could simulate a populationsize of 1000 individuals in an average of 19.5 seconds per generation.8econtextualized learning for interpretable hierarchical representations of visual patternsFigure 5: Virtual Experiments . a) Kernel density plot of samples over generations 1, 250, 500 selecting orangeornaments and contrast. After 500 generations the population has shifted from the initial sample distribution, ﬁndingtwo peaks which maximize the ﬁtness function. b) Samples of initial parent population, left, with the highest ﬁtness,compared to those with the highest ﬁtness after 500 generations. Samples in later generations show higher numbers ofbrighter orange and dark melanated patches and increased within body contrast.9econtextualized learning for interpretable hierarchical representations of visual patterns

Supervised discriminative learning algorithms are already becoming an integral tool for researchers across disciplineswhereas unsupervised generative modeling approaches remain a relatively young and active area of machine learningresearch. Already, the highly expressive generative models like the ones presented here are transforming the way weinteract with image data. By solving problems in a more general way, generative modeling approaches provide moredirect connections to hypothesis testing and connecting observations. Here, we demonstrate how these approaches mayserve as an engine for more integrative studies of animal coloration patterns, and natural image data more generally,directly connecting approaches.Analytically, our approach captures important hierarchical features across spatial scales that existing approaches do notaccount for (Figure 2, Figure S3, Appendix A.1), it removes the inherent biases of predeﬁned ﬁlters by learning featuresdirectly from the sample data, and it disentangles complex factors of variation into a useful, meaningful representation(Figure S2, Figure 4). More than compressing data into a low dimensional space, this approach is generative and cancreate novel out-of-sample examples with high ﬁdelity. This is a potentially transformative extension for researchers inthe natural sciences which is not offered by existing approaches, allowing researchers to test analytical results withvirtual experiments, and empirically, by using virtual reality playback experiments or observational studies (see MovieS6).These techniques can be adapted to many domain speciﬁc questions (see A.1 for a speciﬁc discussion regarding thepotential impact of this approach on the study of color pattern evolution). As the latency between input and outputdecreases in video playback experiments, integrating instantaneous behavioral feedback and in-the-loop methods forhypothesis testing may be used to design complex real-time assays. More sophisticated virtual experiments may alsoincorporate agent based models and evolutionary algorithms working directly on the latent representation to createcomplex simulations (e.g as in, Figure 5). In our demonstration, we are able to simulate 1000 individuals in under 20seconds per generation with very little optimization and asynchronous approaches may already be possible. Analytically,as research in machine learning aimed at understanding how information is organized and used by algorithms advances,a growing theoretical framework with a basis in statistical mechanics and information theory may provide additionalavenues for investigating the statistical properties of color pattern spaces and their evolution. Guppy images were collected from a maintained stock at the University of Wuerzburg under authorization 568/300-1870/13 of the Veterinary Ofﬁce of the District Government of Lower Franconia, Germany, in accordance with theGerman Animal Protection Law (TierSchG). Individuals were imaged on a white background with ﬁxed lighting10econtextualized learning for interpretable hierarchical representations of visual patternsconditions using a Cannon D600 digital camera. Images were down sampled and center cropped to ﬁnal size of 256 x256 pixels. The dataset consists of 977 standardized RGB images across three species and 13 individual strains.Butterﬂy images were downloaded from the Natural History Museum, London under a creative commons license (DOIs:https://doi.org/10.5519/qd.gvq3p7xq, https://doi.org/10.5519/qd.pw8srv43). This dataset consists of 9531 RGB images.For each dataset, we segmented samples from the background using a customized object segmentation network adaptedfrom. For each dataset we annotated 8 samples to train the segmentation network. All samples were cropped andresized to 256 x 256 and placed on a transparent background (RGBA). For calculating the perceptual loss duringtraining, images were translated to 3-channel images with a white background using alpha blending.Updated links to original data repositories can be accessed here: https://github.com/ietheredge/VisionEngine/README.md . All models were implemented using Tensorﬂow 2.2 and can be accessed here: https://github.com/ietheredge/VisionEngine , including installation and evaluation scripts to reproduce our results. Instructions for creating newdata loaders for training new datasets using this method can be found at https://github.com/ietheredge/VisionEngine/data_loaders/datasets/README.md . DHRL relies on a three-step process of sequential training where ﬁrst a generative adversarial network is trained totransform a noise sample into realistic out of sample examples. Next, a variational autoencoder is pre-trained on thegenerated samples. Then ﬁnally, the pretrained variational model is ﬁne-tuned on the original samples.

We use an unsupervised approach to disentangle discrete and continous latent factors adapted from (InfoGAN) whichmodiﬁes the minimax game typically used for training GANs such that: min G,Q max D V I ( D, G, Q ) = V ( D, G ) − λL I ( G, Q ) (1)where V ( D, G ) is the original GAN objective introduced in and L I ( G, Q ) approximates the lower bound of themutual information I ( c ; G ( z, c )) using Monte Carlo sampling such that L I ( G, Q ) ≤ I ( c ; G ( z, c )) . Like the generator G and discriminator D , Q is parameterized as a neural network and shares all convolutional layers with D .Both discrete Q ( c d | x ) and continuous latent codes Q ( c c | x ) are provided with continuous latent codes treated as afactored Gaussian distributions. Importantly, InfoGAN does not require supervision and no labels are provided, e.g. Key Methods Top left : the distributions of our latent representation may be parameterized by a number ofcontinuous or discrete variables. In infoGAN, a categorical latent code is combined with continuous latent codeswhich allows for disentangling substructures in the sample data without labeled samples.

Top right , example structureof a generative adversarial network. Here, a noise vector, z i is input to the generator network G ( z ) which producesa reconstructed output ˆ x i . A real sample, x i , and generated sample ˆ x i are subsequently passed through a separatediscriminator network D ( x ) which determines if the sample is real ( ) or generated ( ). In infoGAN, the latent encodingof generated samples is optimized an additional network Q which shares all convolutional layers with D . Bottomleft , the generic architecture of a variational ladder autoencoder (VLAE). Multiple latent spaces ( z , z , ..., z k ) arelearned with each successive input layer having increasing expressivity and abstraction (abilitiy to combine featuresacross spatial scales). Bottom middle , structure of a variational autoencoder (VAE). x i and ˆ x i are an example input andits reconstructed output, the probabilistic encoder or inference model, q φ ( z | x ) , performs posterior inference learningshared model parameters, φ , across samples, approximating the true posterior distribution. The probabilistic decode, p θ ( Z | X ) , p θ ( X | Z ) , learns a joint distribution of the encoded space, Z , and the data space x . The low dimensionalbottleneck, Z is a distribution of latent variables capable of reconstructing sample inputs, parameterized by a vectorof means µ and standard deviations σ . The noise term (cid:15) allows for the parameters of this multivariate distributionto be optimized using back propagation, known as the reparametrization trick. Right , Perceptual loss models use apretrained network, φ , e.g. VGG-16. Two samples, the original input and reconstructed output, are input to the modeland the maxpooling layer activations for each are used as outputs. The distance between these functions emphasizeshigher-level similarity than standard pixel-wise differences. Outputs from the shallow layer (cid:96) represents low-level localfeatures where as output from the deeper layer (cid:96) k contains information from across spatial scales and more abstractrepresentations. The euclidean distance between these activation outputs gives a metric for the similarity of the twoinputs as "perceived" by a network pre-trained on a much broader dataset. Perceptual loss functions can be used as astand-alone transfer-learning approach to ﬁnding perceptual differences between samples or as part of any network asan additional or alternative reconstruction loss (see 2). 12econtextualized learning for interpretable hierarchical representations of visual patternsWe substitute the original generator and discriminator models from with the architecture described in and increasethe ﬂexibility of the latent code, providing additional continuous and discrete latent codes. For guppy experiments, weprovide two continuous and 19 discrete codes (samples were drawn from 19 paternal lines). For the basis noise vectorinput to the generator, we used 100-unit random noise vector. In contrast to hierarchical architectures, e.g., we learn a hierarchy of features by using multiple latent codes withincreasing levels of abstraction from, i.e. q φ ( z , . . . , z L | x ) . The expressivity of z i is determined by its depth. Theencoder q φ ( z , . . . , z L | x ) consists of four blocks such that: H (cid:96) = G (cid:96) ( H (cid:96) − ) (2) z (cid:96) ∼ N ( µ (cid:96) ( H (cid:96) ) , I ) (3)where H (cid:96) , G (cid:96) , and µ (cid:96) are neural networks. For our encoder model, G (cid:96) is a stack of convolutional, batch normalization,and leaky rectiﬁed linear unit activation (Conv-BN-LeakyReLU), we stack four Conv-BN-LeakyReLU blocks foreach G (cid:96) with increasing number of channels for each subsequent convolutional layer, i.e. N-channels/2, N-channels,N-channels, N-channels*2 where N-channels is 16, 64, 256, 1024 for G , G , G , G respectively. We apply spectralnormalization to all convolutional layers (see below). Because we want to preserve feature localization, we use averagepooling followed by a squeeze-excite block to apply a context-aware weighting to each channel (see below).Similarly, the decoder, p θ ( x | z , . . . , z L ) , is composed of blocks such that: ˜ z (cid:96) = U (cid:96) ([˜ z (cid:96) +1 ; V (cid:96) ( z (cid:96) )]) (4)where [ . ; . ] denotes channel-wise concatenation. Parallel to G (cid:96) , blocks in the encoder: U (cid:96) are composed of Conv-BN-ReLU blocks (note the use of ReLU and not LeakyReLU in the decoder) with decreasing number of channels in eachconvolutional layer, i.e. N-channels*2, N-channels, N-channels, N-channels/2 where N-channels is 1024, 256, 64, 16.No spectral normalization wrappers or squeeze-excite layers are applied in the decoder. Squeeze-and-Excitation Networks were proposed to improve feature interdependence by adaptively weighting eachchannel within a feature map based on the ﬁlter relevance by applying a a channel-wise recalibaration. Here weapply squeeze-excite (SE) layers prior to the variational layer such that each embedding z i captures features withcross-channel dependencies. Each SE layer consists of a global average pooling layer which averages channel-wisefeatures followed by two fully connected layers with relu activations, the ﬁrst with size channels/16 and the second with13econtextualized learning for interpretable hierarchical representations of visual patternsthe same size as the number of input channels. Finally a sigmoid, "excite," layer assigns channel wise probabilitieswhich are then multiplied channel wise with the original inputs. We minimize the negative log likelihood of the sample data by minimizing the mean squared error between input andoutput, jointly optimizing the reconstruction loss for each sample x : L pixel-wise = E p data ( x ) E q φ ( z | x ) [log p θ ( x | z )]= 1 n n (cid:88) i =1 ( x i − p θ ( q φ ( x i ))) (5)To relax the restrictive mean-ﬁeld assumption which is implicit in minimizing the pixel-wise error, we jointly optimizethe similarity between inputs and outputs using intermediate layers of a pretrained network, VGG16, as featuremaps. Here we calculate the Gram matrices of feature maps, which match the feature distributions of real andgenerated outputs for each layer as: G (cid:96)ab = (cid:80) cd F (cid:96)cda ( x ) F (cid:96)cdb ( x ) CD (6) L perceptual = L (cid:88) (cid:96) =1 1 n (cid:80) ni =1 (cid:0) G (cid:96)ab ( x i ) − G (cid:96)cd ( p θ ( q φ ( x i ))) (cid:1) L (7)for feature maps F a and F b in layer (cid:96) across locations c , d . This measures the correlation between image ﬁlters and isequivalent to minimizing the distance between the distribution of features across feature maps, independently of featureposition. The combined reconstruction loss is a weighted sum of the perceptual loss and pixel-wise error: L reconstruction = α L perceptual + β L pixel-wise (8)where α and β are Lagrange multipliers controlling the inﬂuence of each loss term. Here we set α = 1e-6 and β = 1e5to balance the contribution of reconstruction terms with variational loss (see below).14econtextualized learning for interpretable hierarchical representations of visual patterns We use the maximum mean discrepancy approach (MMD) to maximize the similarity between the statistical momentsof p ( z ) and q φ ( x ) using the kernel embedding trick: MMD( p ( z ) (cid:107) q φ ( z )) = E p ( z ) ,p ( z (cid:48) ) [ k ( z, z (cid:48) )] + E q φ ( z ) ,q φ ( z (cid:48) ) [ k ( z, z (cid:48) )] − E p ( z ) ,q φ ( z (cid:48) ) [ k ( z, z (cid:48) )] (9)using a Gaussian kernel, k ( z, z (cid:48) ) , such that k ( z, z (cid:48) ) = e − (cid:107) z − z (cid:48) (cid:107) σ (10)to measure the similarity between p θ ( z ) and q φ ( z ) in Euclidean space. We measured similarity using multiple kernelswith varying degrees of smoothness, controlled by the value of σ , i.e. multi-kernel MMD, with varying bandwidths: σ = 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 5, 10, 15, 20, 25, 30, 35, 100, 1e3, 1e4, 1e5, 1e6.Weighing the inﬂuence of MMD kernel differences on the combined objective function is controlled by the Lagrangemultiplier λ applied across each latent code. Giving the combined objective: L total = (cid:32) L (cid:88) i λ MK-MMD ( q φ ( z i ) (cid:107) p ( z i )) (cid:33) + L reconstruction (11)where L is the number of hierarchical latent codes and z i is the n-dimensional latent code and the prior, p ( z i ) = N (0 , I ) and L reconstruction deﬁne above. Here, we set λ = 1. In addition to further relaxing the contribution of pixel-wise error, adding a denoising criterion has been shown to yieldbetter sample likelihood by learning to map both training data and corrupted inputs to the true posterior, providing morerobust training for out of sample data. We implement this with the addition of noise layer which samples a corruptedinput ˜ x from input x before passing ˜ x to the encoder q φ ( z | ˜ x ) . We use apply random binomial noise (salt and pepper) toten percent of pixels. Spectral normalization has been proposed as a method to prevent exploding gradients when using rectiﬁed linear unitsto stabilize GAN training via a global regularization on the weight matrix of each layer as opposed to gradient clippingto provide bounded ﬁrst derivatives (the Lipschitz constraint). Understanding the importance of features for model predictions is an active area of research. Integrated gradients,introduced by, assigns feature importance, determining causal relationships between predictions and image featuresby summing the gradients along paths between x (cid:48) and x .IG i ( x ) ::= ( x i − x (cid:48) i ) × (cid:90) α =0 ∂ P ( x (cid:48) + α × ( x − x (cid:48) )) ∂x i dα (12)We adapt this procedure to investigate the contribution of each latent variable parameter z i where we use a baseline z ,an encoding of a singe sample x and iterate z j while holding all other z l constant and summing the gradients of thedecoder p θ ( x | z ) such that:IG approxi ( p θ ( x | z j )) ::= (cid:16) p θ ( x | z j ) i − p θ ( x | z j (cid:48) ) i (cid:17) × m (cid:88) k =1 ∂ P (cid:16) p θ ( x | z j (cid:48) ) : z j (cid:48) j = z j (cid:48) j + km × ( z jj − z j (cid:48) j ) (cid:17) ∂p θ ( x | z j ) i × m (13)where j is the axis of latent code being interpolated, i is the individual feature (pixel), p θ ( x | z ) is the reconstructedoutput, p θ ( x | z (cid:48) ) is the baseline reconstructed output, k is the perturbation constant, and m is the number of steps inthe approximation of the integral. We use the Riemann sum approximation of the integral over the interpolated path P which involves computing the gradient in a loop over the inputs for k = 1 , . . . , m . Here, we use m = 300 and k = 2 max( | z | ) for each z j starting from a baseline p θ ( x | z j (cid:48) ) : z j = − max( | z | ) .We use the technique developed in for assessing disentanglement, measuring the relative entropy of latent factors forpredicting class labels. We measure disentanglement of D i of each latent code is measured by D i = (1 − H K ( P i )) where H K is the entropy and P i is the relative importance of the generative factor. We also include a metric ofcompleteness C i , approximating the degree to which the generative factor is captured by a single latent variable, where C j = (1 − H D ( P j )) where P j is the unweighted contribution of generative factors. Here, in the absence of labeledfeatures, we use species (butterﬂies), breeding line variants (guppies), and predicted class of the generative model(generated guppies, 4.2.1, above) for each model as approximate class labels (one class). This approximation naturallyoverestimates D i and underestimates C j as there is some overlap between classes in terms of visual features (see Figure2, Figure S3). While proposes a third term to evaluate representations I to measure the relative informativeness,we found that this value was highly coupled to the choice of the Lagrange multiplier λ used for latent regularization(above). For demonstrating an example virtual experiment, we use a genetic algorithm, with a parent population of 1000 randomsamples, evolved over 500 generations. Parent samples are random initialized across the the latent variables of each16econtextualized learning for interpretable hierarchical representations of visual patternslatent code. Fitness was calculate as an equally weighted sum of the total percentage of pixels within two ranges (orangergb(0.9, 0.55, 0.) > rgb(1., 0.75, 0.1) and black rgb(0., 0., 0.) < rgb(0.2, 0.2, 0.2)) measured on the generated output, asimpliﬁcation of empirical results from the literature.

During each generation predicted ﬁtness for each sample inthe population was measured by the ﬁtness of the nearest neighboring value in the reference table (for processing speed).To simulate weak selective pressure on the ﬁtness function, we drew 500 random parent subsamples weighted by theirproportional ﬁtness. An additional 200 samples were drawn, without the proportional ﬁtness weighting. Together,from the 700 subsamples in each generation we drew 300 random pairs, the "alleles" from each sample (the speciﬁclatent variable values) were chosen randomly with equal probability to create a combined offspring between the twosamples. Each combined offspring then had two alleles randomly mutated, one by drawing from a random normaldistribution and the other by replacing an existing value with zero (similar to destabilizing and stabilizing mutations).The next generation thus consisted of 100 samples, 700 parent samples + 300 offspring. This process was repeated for500 generations.

We would like to thank members of the Dept. of Collective Behavior, Max Planck Institute of Animal Behavior andCentre for the Advanced Study of Collective Behaviour, University of Konstanz for comments on earlier versions of themanuscript as well as the Max Planck Computing and Data Facility for use of computational resources.

RIE conceived the approach and designed the methodology; MS and RIE collected sample data. RIE wrote themanuscript. AJ secured funding. All authors contributed to editing and approving the manuscript.

The Authors have no ﬁnancial or non-ﬁnancial competing interest.

References Yann LeCun, Koray Kavukcuoglu, and Clément Farabet. Convolutional networks and applications in vision. In

Proceedings of 2010 IEEE international symposium on circuits and systems , pages 253–256. IEEE, 2010. Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visualobject classes (voc) challenge.

International journal of computer vision , 88(2):303–338, 2010. Katarzyna Bozek, Laetitia Hebert, Alexander S Mikheyev, and Greg J Stephens. Towards dense object tracking ina 2d honeybee hive. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages4185–4193, 2018. 17econtextualized learning for interpretable hierarchical representations of visual patterns Alexander Mathis, Pranav Mamidanna, Kevin M. Cury, Taiga Abe, Venkatesh N. Murthy, Mackenzie W. Mathis,and Matthias Bethge. Deeplabcut: markerless pose estimation of user-deﬁned body parts with deep learning.

NatureNeuroscience , 2018. Talmo D. Pereira, Diego E. Aldarondo, Lindsay Willmore, Mikhail Kislin, Samuel S. H Wang, Mala Murthy, andJoshua W. Shaevitz. Fast animal pose estimation using deep neural networks.

Nature Methods , 16(1):117–125, 2019. Jacob M Graving, Daniel Chae, Hemal Naik, Liang Li, Benjamin Koger, Blair R Costelloe, and Iain D Couzin.Deepposekit, a software toolkit for fast and robust animal pose estimation using deep learning.

Elife , 8:e47994,2019. Thorsten Falk, Dominic Mai, Robert Bensch, Özgün Çiçek, Ahmed Abdulkadir, Yassine Marrakchi, Anton Böhm,Jan Deubner, Zoe Jäckel, Katharina Seiwald, et al. U-net: deep learning for cell counting, detection, and morphometry.

Nature methods , 16(1):67–70, 2019. Julian Riba, Jonas Schoendube, Stefan Zimmermann, Peter Koltay, and Roland Zengerle. Single-cell dispensingand ‘real-time’cell classiﬁcation using convolutional neural networks for higher efﬁciency in single-cell cloning.

Scientiﬁc reports , 10(1):1–9, 2020. Claire McQuin, Allen Goodman, Vasiliy Chernyshev, Lee Kamentsky, Beth A Cimini, Kyle W Karhohs, Minh Doan,Liya Ding, Susanne M Rafelski, Derek Thirstrup, et al. Cellproﬁler 3.0: Next-generation image processing forbiology.

PLoS biology , 16(7):e2005970, 2018. Ryan Poplin, Pi-Chuan Chang, David Alexander, Scott Schwartz, Thomas Colthurst, Alexander Ku, Dan Newburger,Jojo Dijamco, Nam Nguyen, Pegah T Afshar, et al. A universal snp and small-indel variant caller using deep neuralnetworks.

Nature biotechnology , 36(10):983–987, 2018. Paul VC Hough. Method and means for recognizing complex patterns, December 18 1962. US Patent 3,069,654. Christopher G Harris, Mike Stephens, et al. A combined corner and edge detector. In

Alvey vision conference ,volume 15, pages 10–5244. Citeseer, 1988. D. G. Lowe. Object recognition from local scale-invariant features. In

Proceedings of the Seventh IEEE InternationalConference on Computer Vision , volume 2, pages 1150–1157 vol.2, Sep. 1999. David G. Lowe. Distinctive image features from scale-invariant keypoints.

Int. J. Comput. Vision , 60(2):91–110,November 2004. Briana D Ezray, Drew C Wham, Carrie E Hill, and Heather M Hines. Unsupervised machine learning revealsmimicry complexes in bumblebees occur along a perceptual continuum.

Proceedings of the Royal Society B ,286(1910):20191501, 2019. Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C.Courville, and Yoshua Bengio. Generative adversarial networks.

CoRR , abs/1406.2661, 2014.18econtextualized learning for interpretable hierarchical representations of visual patterns Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Yoshua Bengio and Yann LeCun, editors, , 2014. Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximateinference in deep generative models. In Eric P. Xing and Tony Jebara, editors,

Proceedings of the 31st InternationalConference on Machine Learning , volume 32 of

Proceedings of Machine Learning Research , pages 1278–1286,Bejing, China, 22–24 Jun 2014. PMLR. Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives.

IEEE Transactionson Pattern Analysis and Machine Intelligence , 35(8):1798–1828, Aug 2013. Shengjia Zhao, Jiaming Song, and Stefano Ermon. Infovae: Balancing learning and inference in variationalautoencoders. 33:5885–5892, 2019. Arthur Gretton, Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alex J Smola. A kernel method for thetwo-sample-problem. In

Advances in neural information processing systems , pages 513–520, 2007. Shengjia Zhao, Jiaming Song, and Stefano Ermon. Learning hierarchical features from deep generative models. In

Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages 4091–4099. JMLR. org,2017. Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretablerepresentation learning by information maximizing generative adversarial nets. In

Advances in neural informationprocessing systems , pages 2172–2180, 2016. Tian Qi Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources of disentanglement invariational autoencoders. In

Advances in Neural Information Processing Systems , pages 2610–2620, 2018. Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution.In

European Conference on Computer Vision , 2016. Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. Texture synthesis using convolutional neural networks. In

Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1 , NIPS’15,pages 262–270, Cambridge, MA, USA, 2015. MIT Press. Leon Gatys, Alexander Ecker, and Matthias Bethge. A neural algorithm of artistic style.

Journal of Vision ,16(12):326–326, Sep 2016. Irina Higgins, Loïc Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed,and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In , 2017. 19econtextualized learning for interpretable hierarchical representations of visual patterns Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In Jennifer Dy and Andreas Krause, editors,

Proceedings of the 35th International Conference on Machine Learning , volume 80 of

Proceedings of MachineLearning Research , pages 2649–2658, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. Cian Eastwood and Christopher K. I. Williams. A framework for the quantitative evaluation of disentangledrepresentations. In

International Conference on Learning Representations , 2018. Grégoire Mesnil Yann Dauphin, Xavier Glorot, Salah Rifai, Yoshua Bengio, Ian Goodfellow, Erick Lavoie, XavierMuller, Guillaume Desjardins, David Warde-Farley, Pascal Vincent, et al. Unsupervised and transfer learningchallenge: a deep learning approach. In

Proceedings of ICML Workshop on Unsupervised and Transfer Learning ,pages 97–110, 2012. Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In Doina Precup andYee Whye Teh, editors,

Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney,NSW, Australia, 6-11 August 2017 , volume 70 of

Proceedings of Machine Learning Research , pages 3319–3328.PMLR, 2017. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, AndrejKarpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale VisualRecognition Challenge.

International Journal of Computer Vision (IJCV) , 115(3):211–252, 2015. Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.

Journal of machine learning research ,9(Nov):2579–2605, 2008. Dmitry Kobak and Philipp Berens. The art of using t-sne for single-cell transcriptomics.

Nature communications ,10(1):1–14, 2019. Richard Dubes and Anil K. Jain. Clustering methodologies in exploratory data analysis. In

Advances in computers ,volume 19, pages 113–228. Elsevier, 1980. Anne E Houde. Mate choice based upon naturally occurring color-pattern variation in a guppy population.

Evolution ,41(1):1–10, 1987. David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In

Advances in NeuralInformation Processing Systems 31 , pages 2451–2463. Curran Associates, Inc., 2018. https://worldmodels.github.io . Yasaman Bahri, Jonathan Kadmon, Jeffrey Pennington, Sam S Schoenholz, Jascha Sohl-Dickstein, and SuryaGanguli. Statistical mechanics of deep learning.

Annual Review of Condensed Matter Physics , 2020. Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method. pages 368–377, 1999. Darrell J Kemp. Female mating biases for bright ultraviolet iridescence in the butterﬂy eurema hecabe (pieridae).

Behavioral Ecology , 19(1):1–8, 2008. 20econtextualized learning for interpretable hierarchical representations of visual patterns Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool.One-shot video object segmentation. In

Proceedings of the IEEE conference on computer vision and patternrecognition , pages 221–230, 2017. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. 2015. Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. You only look once: Uniﬁed, real-timeobject detection. In , pages 779–788. IEEE Computer Society, 2016. Philip Bachman. An architecture for deep, hierarchical generative models. In

Advances in Neural InformationProcessing Systems , pages 4826–4834, 2016. Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variationalautoencoders. In

Advances in neural information processing systems , pages 3738–3746, 2016. Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In

Proceedings of the IEEE conference oncomputer vision and pattern recognition , pages 7132–7141, 2018. Yanghao Li, Naiyan Wang, Jiaying Liu, and Xiaodi Hou. Demystifying neural style transfer. In Carles Sierra, editor,

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence, IJCAI 2017, Melbourne,Australia, August 19-25, 2017 , pages 2230–2236. ijcai.org, 2017. Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, Kenji Fukumizu,and Bharath K Sriperumbudur. Optimal kernel choice for large-scale two-sample tests. In

Advances in neuralinformation processing systems , pages 1205–1213, 2012. Daniel Im Jiwoong Im, Sungjin Ahn, Roland Memisevic, and Yoshua Bengio. Denoising criterion for variationalauto-encoding framework. In

Thirty-First AAAI Conference on Artiﬁcial Intelligence , 2017. Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generativeadversarial networks. In . OpenReview.net, 2018. John A Endler and Anne E Houde. Geographic variation in female preferences for male traits in poecilia reticulata.

Evolution , 49(3):456–468, 1995. Henry Walter Bates.

The naturalist on the river Amazons . London; Toronto: JM Dent; New York: EP Dutton,[1910,reprinted 1921], 1863. Charles Darwin.

The descent of man: and selection in relation to sex . John Murray, Albemarle Street., 1871. Alfred Russel Wallace. The colors of animals and plants.

The American Naturalist , 11(11):641–662, 1877. Fritz Müller.

Über die vortheile der mimicry bei schmetterlingen . 1878. Edward Bagnall Poulton.

The colours of animals: their meaning and use, especially considered in the case of insects .D. Appleton, 1890. 21econtextualized learning for interpretable hierarchical representations of visual patterns Gerald Handerson Thayer.

Concealing-coloration in the animal kingdom: an exposition of the laws of disguisethrough color and pattern: being a summary of Abbott H. Thayer’s discoveries . Macmillan Company, 1918. Hugh Bamford Cott. Adaptive coloration in animals. 1940. John Maynard Smith. Natural selection and the concept of a protein space.

Nature , 225(5232):563–564, 1970. Jakob Von Uexküll. A stroll through the worlds of animals and men: A picture book of invisible worlds.

Semiotica ,89(4):319–391, 1992. Cedric P van den Berg, Jolyon Troscianko, John A Endler, N Justin Marshall, and Karen L Cheney. Quantitativecolour pattern analysis (qcpa): A comprehensive framework for the analysis of colour patterns in nature.

Methods inEcology and Evolution , 11(2):316–332, 2020. Rafael Maia, Hugo Gruson, John A. Endler, and Thomas E. White. pavo 2: New tools for the spectral and spatialanalysis of colour in r.

Methods in Ecology and Evolution , 0(0), 2019. Mary Caswell Stoddard, Rebecca M. Kilner, and Christopher Town. Pattern recognition algorithm reveals how birdsevolve individual egg pattern signatures.

Nature Communications , 5:4117 EP –, Jun 2014. Article. Felipe M. Gawryszewski. Color vision models: Some simulations, a general n-dimensional model, and thecolourvision r package.

Ecology and Evolution , 8(16):8159–8170, 2018. Cynthia Tedore and Sönke Johnsen. Using RGB displays to portray color realistic imagery to animal eyes.

CurrentZoology , 63(1):27–34, 06 2016. John A. Endler. Variation in the appearance of guppy color patterns to guppies and their predators under differentvisual conditions.

Vision Research , 31(3):587 – 608, 1991. John A. Endler and Paul W. Mielke JR. Comparing entire colour patterns as birds see them.

Biological Journal ofthe Linnean Society , 86(4):405–431, 2005. John A. Endler. A framework for analysing colour pattern geometry: adjacent colours.

Biological Journal of theLinnean Society , 107(2):233–253, 2012. Jolyon Troscianko and Martin Stevens. Image calibration and analysis toolbox – a free software suite for objectivelymeasuring reﬂectance, colour and pattern.

Methods in Ecology and Evolution , 6(11):1320–1331, 2015. Eleanor M. Caves and Sönke Johnsen. Acuityview: An r package for portraying the effects of visual acuity onscenes observed by an animal.

Methods in Ecology and Evolution , 9(3):793–797, 2018. John A. Endler, Gemma L. Cole, and Alexandrea M. Kranz. Boundary strength analysis: Combining colour patterngeometry and coloured patch visual properties for use in predicting behaviour and ﬁtness.

Methods in Ecology andEvolution , 9(12):2334–2348, 2018. Mary Caswell Stoddard and Daniel Osorio. Animal coloration patterns: Linking spatial vision to quantitativeanalysis.

The American Naturalist , 193(2):164–186, 2019. H.F. Nijhout. Elements of butterﬂy wing patterns.

Journal of Experimental Zoology , 291(3):213–225, 2001.22econtextualized learning for interpretable hierarchical representations of visual patterns G Th Fechner. Ueber die subjectiven nachbilder und nebenbilder.

Annalen der Physik , 126(7):427–470, 1840. Ray Fuller and Jorge A Santos.

Human factors for highway engineers . Pergamon Amsterdam, The Netherlands,2002. Gemma L Cole and John A Endler. Male courtship decisions are inﬂuenced by light environment and femalereceptivity.

Proceedings of the Royal Society B: Biological Sciences , 283(1839):20160861, 2016. Andreas Nieder, David J Freedman, and Earl K Miller. Representation of the quantity of visual items in the primateprefrontal cortex.

Science , 297(5587):1708–1711, 2002. S Fujita, T Kitayama, N Mizoguchi, Y Oi, N Koshikawa, and M Kobayashi. Spatiotemporal proﬁles of transcallosalconnections in rat insular cortex revealed by in vivo optical imaging.

Neuroscience , 206:201–211, 2012. Scott Cheng-Hsin Yang, Mate Lengyel, and Daniel M Wolpert. Active sensing in the categorization of visualpatterns.

Elife , 5:e12215, 2016. Laura A Kelley and Jennifer L Kelley. Animal visual illusion and confusion: the importance of a perceptualperspective.

Behavioral Ecology , 25(3):450–463, 2014. Sami Merilaita, Nicholas E. Scott-Samuel, and Innes C. Cuthill. How camouﬂage works.

Philosophical Transactionsof the Royal Society B: Biological Sciences , 372(1724):20160341, 2017. Clelia Gasparini, Giovanna Serena, and Andrea Pilastro. Do unattractive friends make you look better? context-dependent male mating preferences in the guppy.

Proceedings of the Royal Society B: Biological Sciences ,280(1756):20123072, 2013. David Marr. Vision: A computational investigation into the human representation and processing of visualinformation, henry holt and co.

Inc., New York, NY , 2(4.2), 1982. David H Hubel and Torsten N Wiesel. Receptive ﬁelds, binocular interaction and functional architecture in the cat’svisual cortex.

The Journal of physiology , 160(1):106–154, 1962. Diego E Pafundo, Mark A Nicholas, Ruilin Zhang, and Sandra J Kuhlman. Top-down-mediated facilitation in thevisual cortex is gated by subcortical neuromodulation.

Journal of Neuroscience , 36(10):2904–2914, 2016. Spencer J Ingley, Mohammad Rahmani Asl, Chengde Wu, Rongfeng Cui, Mahmoud Gadelhak, Wen Li, Ji Zhang,Jon Simpson, Chelsea Hash, Trisha Butkowski, et al. anyﬁsh 2.0: an open-source software platform to generate andshare animated ﬁsh models to study behavior.

SoftwareX , 3:13–21, 2015. John R Stowers, Maximilian Hofbauer, Renaud Bastien, Johannes Griessner, Peter Higgins, Sarfarazhussain Farooqui,Ruth M Fischer, Karin Nowikovsky, Wulf Haubensak, Iain D Couzin, et al. Virtual reality for freely moving animals.

Nature methods , 14(10):995, 2017. H. Naik, R. Bastien, N. Navab, and I. D. Couzin. Animals in virtual environments.

IEEE Transactions onVisualization and Computer Graphics , 26(5):2073–2083, 2020.23econtextualized learning for interpretable hierarchical representations of visual patterns Kuo-Hua Huang, Peter Rupprecht, Thomas Frank, Koichi Kawakami, Tewis Bouwmeester, and Rainer W Friedrich.A virtual reality system to analyze neural activity and behavior in adult zebraﬁsh.

Nature Methods , pages 1–9, 2020. John R. Stowers, Anton L. Fuhrmann, Maximilian Hofbauer, Martin Streinzer, Axel Schmid, Michael H. Dickinson,and Andrew D. Straw. Reverse engineering animal vision with virtual reality and genetics.

Computer , 47:38–45,2014. Charles Darwin.

On the origin of species . John Murray, 1859. Ronald A Fisher. The evolution of sexual preference.

The Eugenics Review , 7(3):184, 1915. Russell Lande. Models of speciation by sexual selection on polygenic traits.

Proceedings of the National Academyof Sciences , 78(6):3721–3725, 1981. Mark Kirkpatrick. Sexual selection and the evolution of female choice.

Evolution , 36(1):1–12, 1982. Yoh Iwasa and Andrew Pomiankowski. The evolution of mate preferences for multiple sexual ornaments.

Evolution ,48(3):853–867, 1994. Richard O Prum. The lande–kirkpatrick mechanism is the null model of evolution by intersexual selection:implications for meaning, honesty, and design in intersexual signals.

Evolution: International Journal of OrganicEvolution , 64(11):3085–3100, 2010. Eleanor M. Caves, Nicholas C. Brandley, and Sönke Johnsen. Visual acuity and the evolution of signals.

Trends inEcology & Evolution , 33(5):358 – 372, 2018. Mathieu Joron and James LB Mallet. Diversity in mimicry: paradox or paradigm?

Trends in Ecology & Evolution ,13(11):461–466, 1998.

William A Searcy and Stephen Nowicki.

The evolution of animal communication: reliability and deception insignaling systems . Princeton University Press, 2005.

Derek A Roff. The evolution of mate choice: a dialogue between theory and experiment.

Annals of the New YorkAcademy of Sciences , 1360(1):1–15, 2015.

John A Endler. Predation, light intensity and courtship behaviour in poecilia reticulata (pisces: Poeciliidae).

AnimalBehaviour , 35(5):1376–1385, 1987.

Appendix A Example Application to the Evolution of Color Patterns: Background

The incredible variety of color patterns seen in nature evolved under the selective forces imposed by the environment, andthe visual experience of their receivers.

Quantifying this diversity, and reliably testing the functional signiﬁcance ofthese traits is fundamental to understanding ﬁtness landscapes and underlies many subdisciplines of sensory ecology,cognitive neuroscience, collective behavior, and evolution.Creating quantitative descriptions of color patterns which take into account the unique sensory and semiotic worldsof their receivers has been a central challenge in visual ecology. Many tools have been developed: Quantitative24econtextualized learning for interpretable hierarchical representations of visual patternsColour Pattern Analysis, PAVO, Natural Pattern Match, among others. Each of these tools uses one or anensemble of complimentary metrics from image analysis and computer vision, e.g. image statistics, edge detection, andlandmark-based ﬁlters. Still, fundamental gaps remain. One of these gaps is the difﬁculty in building quantitative descriptions of complexfeatures with multiple subelements. Most existing approaches fail to capture the full complexity of many of colorpatterns; the algorithms themselves are insufﬁciently expressive. This is particularly true when spatial or scaledependent relationships between features exist, e.g. the irregular patterns of male guppy ornamentation or butterﬂywing patterns where similar sets of elements are arranged in species-speciﬁc conﬁgurations. Recently, researchershave begun employing machine learning algorithms such a as non-linear dimensionality reduction, e.g. t-distributestochastic neighbor embedding (t-SNE,

Figure 2), and deep neural networks (Figure 2, Figure S1). Still, whilethese techniques can better represent more complex relationships between pixel values within an image, currentimplementations do not disentangle features across scales or provide extensions to downstream experiments.While complex trait may be difﬁcult to quantify, they are nonetheless biologically relevant in terms of feature context and the perceptions of shape, motion, and attention.

And in the brain, we know that perception is hierarchicallyorganized, and representations made at higher levels of the visual cortex and its homologs heavily inﬂuence theperception of low-level features. While measuring local features across an image provides important insight onregularity and the nature of wide-ﬁeld variation, a collection of local feature descriptions across space is fundamentallydifferent to a feature description built across scales.Another gap is in building direct connections between approaches. Establishing spectral sensitivity, acuity, and featureimportance is typically done using stimulus playback experiments or behavioral assays. However, beyond usingstatistical descriptions of features to guide researchers in the creation of stimuli there are few explicit connectionsbetween analysis and experiment. The current state of the art: immersive virtual reality (VR) and low-latency playbackexperiment—with fully animated, photo-realistic, 3D models, provide a rich experimental basis for investigatingthe relationship between visual inputs, neural activity, and behavior.

VR systems are also beginning to betteraccount for species-speciﬁc sensory biases including photoreceptor sensitivity, ﬂicker fusion rate, acuity, and depthperception.

Still, currently these approaches rely on human-in-the-loop interventions for creating stimulus witheven moderate complexity.Additionally, because color pattern traits have evolved under selective pressure from multiple receivers, establishingthese types of evolutionary trade-offs is important to our understanding. However, experimental approaches often requirelarge, highly disruptive manipulations such as translocation experiments or large scale crossbreeding experiments.Simulations and virtual experiments may better allows researchers to be explicit about the stimulus that is being testedand greatly reduce the number of subjects needed (Methods 4.4).25econtextualized learning for interpretable hierarchical representations of visual patterns

A.1 The potential impacts of this approach on the study of evolution

This platform may be used to address many outstanding questions regarding the functional signiﬁcance of color patterntraits; here, we discuss some of these questions. What are the constraints on the evolvability of a given trait? Byidentifying the topographical relationship between different traits within the color pattern space we can test predictionsabout the selective forces acting on them related to their geometric relationships, e.g. the axes of variation in traits meantto communicate viability should show increased orthogonality compared to co-occurring traits which have evolvedunder a Fisherian process. Categorical perception is an important perceptual mechanism for understanding theevolution of color signals. But in systems where color patterns are used for mimicry or novelty, investigatingthe boundaries between complex traits is fundamental. By performing traversals across the distribution of the latentvariables, interpolating between samples can allow for tests of continuous versus categorical perception of complextraits. Many color pattern traits have evolved under selective pressure from multiple receivers, e.g. both females andpredators shape the diversity of male guppy ornaments.

Establishing these types of evolutionary trade-offs is difﬁcultand often requires large, highly disruptive manipulations such as translocation experiments. Using evolutionary modelssimilar to the ones presented here researchers can simulate multiple ﬁtness landscapes and evolutionary trajectoriessimultaneously to perform a broad range of virtual experiments. Importantly, while each of these examples place eitheranalytical, experimental, or virtual results at the center, by using the platform presented here, they maintain directconnections across approaches. Furthermore, they can incorporate existing techniques as image preprocessingroutines, during playback, or constraints on virtual experiments.

Appendix B Supplemental Figures

Convolutional layers.

In typical supervised discriminative models, the objective being optimized iswell deﬁned, e.g. accurate classiﬁcation or localization. As such, the representations provided by the downstreamconvolutional layers of deep networks take on characteristics optimized for task performance. At higher and highernetwork layers, the boundaries between classes can become complex and specialized to this objective because of theusefulness of such representations to identifying complex boundaries.Middle: Features learned at lower layers relate tocolor patches or gestures whereas at higher levels (bottom) fears become complex and interpretability can be difﬁcult.Left, image pixels which are activated by pretrained image ﬁlters (yellow represents higher activations). Right, themaximally activating image feature for each ﬁlter. 27econtextualized learning for interpretable hierarchical representations of visual patternsFigure S2:

Exploring latent variables . a) We incorporated knowledge of our sample data by providing a 32-classcategorical latent code and were able to generate examples from distinct classes learned by the model which capturemeaningful combinations of features in our sample data. b) Latent traversal of 3 latent variables from z , . . . , z . Topand middle are two embedded samples and bottom a latent code initialized at zero. For each latent variable (rows)we traverse values between -2 and 2 for the generated output. We see that each latent code has consistent effects. Allsamples in both a and b are generated from the generative models (b and c in Figure 128econtextualized learning for interpretable hierarchical representations of visual patternsFigure S3: Exploring latent representations . a) 2D embedding (using tSNE ) of butterﬂy images using the 4hierarchical latent encodings. The relationship between images at lower levels are dominated by color value similaritieswhereas at higher layers pattern elements at increasing spatial scales deﬁne the relationship between samples b) Nearestneighbors of 8 random samples based in the Minkowski distance between the 10-dimensional space of each latentcode z , . . . , z Latent variable "alleles" over generations a) We ﬁnd two alleles are driven to ﬁxation in the population afterseveral generations selecting for oranger and higher contrast color patterns in guppies.Figure S5:

Movie 1: The combined pattern space over 500 generations, visualized in 2D using tSNE

Figure S6: