[PDF] Improved Conditional Flow Models for Molecule to Image Synthesis

Abstract

In this paper, we aim to synthesize cell microscopy images under different molecular interventions, motivated by practical applications to drug development. Building on the recent success of graph neural networks for learning molecular embeddings and flow-based models for image generation, we propose Mol2Image: a flow-based generative model for molecule to cell image synthesis. To generate cell features at different resolutions and scale to high-resolution images, we develop a novel multi-scale flow architecture based on a Haar wavelet image pyramid. To maximize the mutual information between the generated images and the molecular interventions, we devise a training strategy based on contrastive learning. To evaluate our model, we propose a new set of metrics for biological image generation that are robust, interpretable, and relevant to practitioners. We show quantitatively that our method learns a meaningful embedding of the molecular intervention, which is translated into an image representation reflecting the biological effects of the intervention.

Full PDF

IImproved Conditional Flow Models forMolecule to Image Synthesis

Karren Yang ∗‡ Samuel Goldman ∗ Wengong Jin ∗ Alex Lu † Regina Barzilay ∗ Tommi Jaakkola ∗ Caroline Uhler ∗‡ Abstract

In this paper, we aim to synthesize cell microscopy images under different molecu-lar interventions, motivated by practical applications to drug development. Buildingon the recent success of graph neural networks for learning molecular embeddingsand ﬂow-based models for image generation, we propose Mol2Image: a ﬂow-basedgenerative model for molecule to cell image synthesis. To generate cell features atdifferent resolutions and scale to high-resolution images, we develop a novel multi-scale ﬂow architecture based on a Haar wavelet image pyramid. To maximize themutual information between the generated images and the molecular interventions,we devise a training strategy based on contrastive learning. To evaluate our model,we propose a new set of metrics for biological image generation that are robust,interpretable, and relevant to practitioners. We show quantitatively that our methodlearns a meaningful embedding of the molecular intervention, which is translatedinto an image representation reﬂecting the biological effects of the intervention.

High-content cell microscopy assays are gaining traction in recent years as the rich morphologicaldata from the images proves to be more informative for drug discovery than conventional targetedscreens [6, 12, 54]. Motivated by these developments, we aim to build, to our knowledge, the ﬁrstgenerative model to synthesize cell microscopy images under different molecular interventions,translating molecular information into a high-content and interpretable image representation of theintervention. Such a system has numerous practical applications in drug development – for example,it could enable practitioners to virtually screen compounds based on their predicted morphologicaleffects on cells, allowing more efﬁcient exploration of the vast chemical space and reducing theresources required to perform extensive experiments [45, 52, 56]. In contrast to conventional modelsthat predict speciﬁc chemical properties, a molecule-to-image synthesis model has the potential toproduce a panoptic view of the morphological effects of a drug that captures a broad spectrum ofproperties such as mechanisms of action [32, 33, 44] and gene targets [4].To build our molecule-to-image synthesis model (Mol2Image), we integrate state-of-the-art graphneural networks for learning molecular representations with ﬂow-based generative models. Flow-based models are a relatively recent class of generative models that learn the data distribution bydirectly inferring the latent distribution and maximizing the log-likelihood of the data [9, 10, 25].Compared to other classes of deep generative models such as variational autoencoders (VAEs) [26] andgenerative adversarial networks (GANs) [15], ﬂow-based models do not rely on approximate posteriorinference or adversarial training and are less prone to training instability and mode collapse, makingthem advantageous for biological applications [53]. Nevertheless, molecule-to-image synthesis is ∗ Massachusetts Institute of Technology, Cambridge, MA, USA; † University of Toronto, Toronto ON, Canada; ‡ To whom correspondence should be addressed: [email protected], [email protected] a r X i v : . [ q - b i o . B M ] J un a) Model Architecture (b) Training Strategy Figure 1: (a) (Red box) Our ﬂow-based model architecture based on a Haar wavelet image pyramid.Information ﬂow follows the black arrows during training/inference and the red arrows duringgeneration. The dashed lines represent conditioning and are used in both training and generation.(Green box) Molecular information is processed and input to the network via a graph neural network g . (b) Our training strategy for effective molecule-to-image synthesis. See text for details.a challenging task that highlights key, unsolved problems in ﬂow-based generation. Current ﬂowarchitectures cannot scale to full-resolution cell images (e.g., × ) and are unable to separatelygenerate image features at multiple spatial resolutions, which is important to disentangle coarsefeatures (e.g., cell distribution) from ﬁne features (e.g., subcellular localization of proteins). Whileseparate generation of image features at different resolutions has been demonstrated using GANs [8],this is still an open problem for ﬂow-based models. Furthermore, existing formulations of ﬂowmodels do not effectively leverage conditioning information when the relationship between the imageand the conditioning information is complex and/or subtle as is the case with molecular interventions.This results in generated samples that do not reﬂect the conditioning information. Contributions.

In this work, we develop (to our knowledge) the ﬁrst molecule-to-image synthesismodel for generating high-resolution cell images conditioned on molecular interventions. Speciﬁcally, • We develop a new architecture and approach for ﬂow-based generation based on a Haar waveletimage pyramid, which generates image features at different spatial resolutions separately andenables scaling to high-resolution images. • We propose a new training strategy based on contrastive learning to maximize the mutual infor-mation between the latent variables of the ﬂow model and the embedding of the molecular graph,ensuring that generated images reﬂect the molecular intervention. • We establish a set of evaluation metrics speciﬁc to biological image generation that are robust,interpretable, and relevant to biological practitioners. • We demonstrate that our method outperforms the baselines by a large margin, indicating potentialfor application to virtual screening.Although we focus on molecule-to-image synthesis in this work, our generative approach canpotentially extend to other applications, e.g., text-to-image synthesis [46].

Biological Image Generation.

Osokin et al . use GAN architectures to generate cellular images ofbudding yeast to infer missing ﬂuorescence channels (stained proteins) in a dataset where only twochannels can be observed at a time [43]. Separately, Goldsborough et al . qualitatively evaluate theuse of different GAN variants in generating three-channel images of human breast cancer cell lines[14]. While these works consider the task of generating single cell images, neither considers thegeneration of cells conditioned on complex inputs nor the generation of multi-cell images, whichis useful in observing cell-to-cell interactions [41] and variability [38]. A separate, similar line ofinvestigation in histopathology and medical imagery has used GAN models to reﬁne and generatesynthetic datasets for training downstream classiﬁers but does not address the difﬁculty of conditionalimage generation necessary to capture drug interventions [21, 37, 60]. While both high throughputimage-based drug screens [5] and molecular structures [59] have been used to generate representationsof small molecules, little work has focused on learning representations of these modalities jointly.2 raph Neural Networks for Molecules.

A neural network formulation on graphs was ﬁrst proposedby Gori et al . [16]; Scarselli et al . [50] and later extended to various graph neural network (GNN)architectures [30, 7, 40, 27, 18, 29, 55, 58]. In the context of molecule property prediction, Duvenaud et al . [11] and Kearns et al . [23] ﬁrst applied GNNs to learn neural ﬁngerprints for molecules. Gilmer et al . [13] further enhanced GNN performance by using set2set readout functions and adding virtualnodes into molecular graphs. Yang et al . [59] provided extensive benchmarking of various GNNarchitectures and demonstrated the advantage of GNNs over traditional Morgan ﬁngerprints [48]as well as domain-speciﬁc features. While these works mainly focused on predicting numericalchemical properties, we here focus on using GNNs to learn rich molecular representations formolecule-to-image synthesis.

Flow-Based Generative Models.

A ﬂow-based generative model (e.g., Glow) is a sequence ofinvertible networks that transforms the input distribution to a simple latent distribution such as aspherical Gaussian [9, 10, 19, 25, 34, 47]. Conditional variants of Glow have recently been proposedfor image segmentation [36, 57], modality transfer [28, 53], image super-resolution [57], and imagecolorization [1]. These applications are variants of image-to-image translation tasks and leveragethe spatial correspondence between the conditioning information and the generated image. Otherconditional models perform generation given an image class [25] or a binary attribute vector [31].Since the condition is categorical, these models apply auxiliary classiﬁers in the latent space toensure that the model learns the correspondence between the condition and the image. Unlike theseworks, we generate images from molecular graphs; here spatial correspondence is not present andthe conditioning information cannot be learned using a classiﬁer. Therefore we must leverage othertechniques to ensure correspondence between the generated images and the conditioning information.In addition to conditioning on molecular structure, our ﬂow model architecture is based on an imagepyramid, which conditions the generation of ﬁne features at a particular spatial resolution on a coarseimage from another level of the pyramid. Flow-based generation of images conditioned on otherimages has been explored in various previous works [1, 28, 36, 53, 57], but different from theseworks, our ﬂow-based model leverages conditioning to break generation into successive steps andreﬁne features at different scales. Our approach is inspired by methods such as Laplacian PyramidGANs [8] that break GAN generation into successive steps. A key design choice here is our use of aHaar wavelet image pyramid instead of a Laplacian pyramid, which avoids introducing redundantvariables into the model and is an important consideration for ﬂow-based models. Ardizzone etal . [1] use the Haar wavelet transform to improve training stability, but they do not consider theframework of an image pyramid for separately generating features at different spatial resolutions.

Our approach is to develop a ﬂow-based generative model for synthesizing cell images conditionedon the molecular embeddings of a graph neural network. We ﬁrst provide an overview of graphneural networks (Section 3.1) and generative ﬂows (Section 3.2). In Section 3.3, we describe ournovel multi-scale ﬂow architecture that generates images in a coarse-to-ﬁne process based on theframework of an image pyramid. The architecture separates generation of image features at differentspatial resolutions and scales to high-resolution cell images. In Section 3.4, we describe a noveltraining strategy using contrastive learning for effective molecule-to-image synthesis.

A molecule y can be represented as a labeled graph G y whose nodes are the atoms in the moleculeand edges are the bonds between the atoms. Each node v has a feature vector f v including its atomtype, valence, and other atomic properties. Each edge ( u, v ) is also associated with a feature vector f uv indicating its bond type. A graph neural network (GNN) g learns to embed a graph G y into acontinuous vector g ( y ) . In this paper, we adopt the GNN architecture from [7, 59], which associateshidden states h v with each node v and updates these states by passing messages m (cid:126)uv over edges ( u, v ) . Each message m (0) (cid:126)uv is initialized at zero. At time step t , the messages are updated as follows: m ( t +1) (cid:126)uv = MLP (cid:0) f u , f uv , (cid:88) w ∈ N ( u ) ,w (cid:54) = v m ( t ) (cid:126)wu (cid:1) ∀ ( u, v ) ∈ G y (1)3here N ( u ) is the set of neighbor nodes of u and MLP stands for a multilayer perceptron. After T message passing steps, we compute hidden states h v as well as the ﬁnal representation g ( y ) as h u = MLP (cid:0) f u , (cid:88) v ∈ N ( u ) m ( t ) (cid:126)uv (cid:1) g ( y ) = MLP( (cid:88) u ∈G y h u ) (2) A generative ﬂow f consists of a sequence of invertible functions f = f ◦ · · · ◦ f L that transform aninput variable x to a Gaussian latent variable z . The generative process is deﬁned as: z ∼ N ( µ, Σ) , h L = z , h L − = f − L ( h L ) , · · · , h = f − ( h ) , x = h , (3)where { h i } i ∈ ··· L are the intermediate variables that arise from applying the inverse of individualﬂow functions { f i } i ∈ ··· L . By the change-of-variables formula, the log-likelihood of sampling x is, log p ( x ) = log p N ( z ; µ, Σ) + L (cid:88) i =1 log (cid:12)(cid:12) det d h i d h i − (cid:12)(cid:12) , (4)where p N is the Gaussian probability density function. In this paper, we adopt the ﬂow functionsfrom the Glow model [25], in which each ﬂow consists of actnorm, × convolution, and couplinglayers (see [25] for details). The Jacobian matrices of these transformations are triangular and hencehave log-determinants that are easy to compute. As a result, the log-likelihood of the data is tractableand can be efﬁciently optimized with respect to the parameters of the ﬂow functions. Existing multi-scale architectures for generative ﬂows [10, 25] do not separately generate featuresfor different spatial resolutions and cannot scale to full-resolution cell images. In the following, wepropose a novel multi-scale architecture that generates cell images in a coarse-to-ﬁne fashion andenables scaling to high-resolution images. Our architecture integrates ﬂow units into the frameworkof an image pyramid generated by recursive 2D Haar wavelet transforms.

Haar Wavelets.

Wavelets are functions that can be used to decompose an image into coarse and ﬁnecomponents. The Haar wavelet transform generates the coarse component in a way that is equivalentto nearest neighbor downsampling. The coarse component is obtained by convolving the image withan averaging matrix followed by sub-sampling by a factor of 2, and the ﬁne components are obtainedby convolving the image with three different matrices followed by sub-sampling by a factor of 2: M average = (cid:20) (cid:21) , M diff1 = (cid:20) − − (cid:21) , M diff2 = (cid:20) − − (cid:21) , M diff3 = (cid:20) − − (cid:21) . (5)To generate an image pyramid that captures features at different spatial resolutions, we recursivelyapply Haar wavelet transforms to the coarse image. Speciﬁcally, let [ x , x , · · · , x k ] be a pyramid ofdownsampled images, where x i represents the image x after i applications of the coarse operation.We apply the ﬁne operation to each downsampled image except the last, resulting in the image pyramid [˜ x , ˜ x , · · · , ˜ x k − , x k ] . The image at each spatial resolution can be reconstructed recursively, x i = I ([ U ( x i +1 ) , ˜ x i ]) , where U represents spatial upsampling, the brackets indicate concatenation, and I represents theinverse of the linear operation corresponding to the 2D Haar wavelet transform; see Equation (5). Haar Pyramid Generative Flow.

Our ﬂow architecture f consists of multiple blocks b , · · · , b k ,each responsible for generating the ﬁne features for a different level of the Haar image pyramidconditioned on a coarse image from the next image in the pyramid; see Figure 1a, red box. Note thateach block b i consists of multiple invertible ﬂow units, i.e., b i = f ( i )1 ◦ · · · ◦ f ( i ) L and can be treatedindependently as a generative ﬂow from Section 3.2. The generative process is deﬁned as follows.First we generate the ﬁnal downsampled image of the pyramid, z k ∼ N ( µ k , Σ k ) , x k = b − k ( z k ) , (6)4y sampling a latent vector that corresponds to the coarsest features and passing it through the ﬁrstblock. Then we recursively sample latent vectors corresponding to ﬁner spatial features and generatethe other images in the Haar image pyramid as follows: z i ∼ N ( µ i ( x i +1 ) , Σ i ( x i +1 )) , ˜ x i = b − i ( z i , x i +1 ) , x i = I ([ U ( x i +1 ) , ˜ x i ]) , ≤ i < k, where x = x is the ﬁnal full-resolution image. To perform conditioning on the coarse image x i +1 ,we provide it as an additional input to both the prior distribution of z i and to the individual ﬂow unitsin b i . Computation of the log-likelihood within the image pyramid framework is straightforward,since the Haar wavelet transform is an invertible linear transformation with a block-diagonal Jacobianmatrix that adds a constant factor to the log-determinant in Equation (4). Conditioning on a Molecular Graph.

To condition the generation of features by block b i on amolecular intervention y , we condition the distribution of latent variables z i on the output of a graphneural network. Speciﬁcally, we let µ i , Σ i take g ( y ) as input, where g is a graph neural networkdescribed in Section 3.1; see Figure 1b, green box. The challenge of training a conditional ﬂow model using log-likelihood is that it may not sufﬁcientlyleverage the shared information between the input image and the molecular intervention. Intuitively,the ﬂow model can achieve a high log-likelihood by converting the input image distribution to aGaussian distribution without using the condition. This is especially true for molecule-to-imagesynthesis because the effect of the molecular intervention on the cells is subtle in the image space.To ensure that the conditional ﬂow model extracts useful information from the molecular graph forgeneration, we propose a training strategy based on contrastive learning . As shown in Figure 1b,during training, we use contrastive learning to maximize the mutual information between the latentvariables from the ﬂow model f and the molecular embedding from the graph neural network g .During generation, information ﬂow is reversed through the ﬂow model to generate an image that istightly coupled to the conditioning molecular information.The objective of contrastive learning is to learn embeddings of x and y that maximize their mutualinformation. Speciﬁcally, these embeddings should distinguish “matched" samples from the jointdistribution p xy from “mismatched" samples from the product of the marginals p x p y . To obtain theseembeddings, we train a critic function h to assign high values to matched samples and low values tomismatched samples by minimizing the following contrastive loss: L contrastive = − E ( x ,y ) ∼ p xy ,y ··· y N ∼ p y (cid:34) log h ( x , y ) (cid:80) Ni =1 h ( x , y i ) (cid:35) . (7)In practice, we compute h ( x, y ) by taking the cosine similarity of f ( x ) and g ( y ) , where f is the ﬂowmodel and g is the graph neural network that embeds the molecular structure graph y : h ( x, y ) = exp (cid:18) f ( x ) · g ( y ) τ || f ( x ) || · || g ( y ) || (cid:19) , (8)where τ > is a temperature hyperparameter. Minimizing the contrastive loss in Equation (7) isequivalent to maximizing a lower bound on the mutual information between f ( x ) and g ( y ) andhas been used in previous work for representation learning [42]. Our key insight is in leveragingcontrastive learning in a conditional ﬂow model to maximize the mutual information between thelatent image variables f ( x ) and the molecular embedding g ( y ) , such that reversing information ﬂowthrough f generates images that share a high degree of information with the molecular graph y . Dataset.

We perform our experiments on the Cell Painting dataset introduced by Bray et al . [2, 3]and preprocessed by Hofmarcher et al . [20]. The dataset consists of 284K cell images collected from10.5K molecular interventions. We divide the dataset into a training set of 219K images correspondingto 8.5K molecular interventions, and hold out the remaining of the data for evaluation. The held-outdata consists of images corresponding to each of the 8.5K molecules in the training set as well asimages corresponding to 2K molecules that are not in the training set.5igure 2: Examples of cell images generated by our method vs the baselines.

Implementation and Training Details.

Our model for the molecule-to-image generation taskconsists of six ﬂow modules that construct different levels of the Haar wavelet image pyramid,generating images from resolution of × to × . The lowest resolution module consists of64 ﬂow units, and each of the other modules consists of 32 ﬂow units. Each of the modules is trainedto maximize the log-likelihood of the data (Equation 4). Additionally, the three ﬂow modules thatprocess low-resolution images (up to × resolution) are also trained to maximize the mutualinformation between the latent variables and the molecular features using contrastive learning witha weight of . and τ = 0 . . We train each ﬂow module for approximately 50K iterations usingAdam [24] with initial learning rate of − , during which the highest resolution block sees over 1Mimages and the lowest resolution block sees over 10M images. Robust Evaluation Metrics for Biological Image Generation.

For a molecule-to-image synthesismodel to be useful to practitioners, it needs to generate image features that are meaningful from abiological standpoint. It has been shown that machine learning methods can discriminate betweenmicroscopy images using features that are irrelevant to the target content [51, 35]. Therefore, inaddition to more conventional vision metrics, we propose a new set of evaluation metrics basedon CellProﬁler cell morphology features [39] that are more robust, interpretable, and relevant topractitioners [49]. We speciﬁcally consider the following morphological features: • Coverage.

The total area of the regions covered by segmented cells. • Cell/Nuclei Count.

The total number of nuclei/cells found in the image. • Cell Size.

The average size of the segmented cells found in the image. • Zernike Shape.

A set of 30 features that describe the shape of cells using a basis of Zernikepolynomials (order 0 to order 9). • Expression Level.

A set of ﬁve features that measure the level of signal from the different cellularcompartments in the image: DNA, mitochondria, endoplasmic reticulum, Golgi, cytoplasmic RNA,nucleoli, actin, and plasma membrane.We extract these features from a subset of images and compute the Spearman correlation betweenthe features of real and generated images corresponding to the same molecule (see SupplementaryMaterial for details). Due to space constraints, we show the mean of the correlation coefﬁcients forthe 30 Zernike shape features and the ﬁve expression level features.

Other Evaluation Metrics.

In addition to these specialized metrics for biological images, we alsoevaluate our model using the following metrics that are more conventional for image generation tasks: • Sliced Wasserstein Distance (SWD).

To assess the visual quality of the generated images, weconsider the statistical similarity of image patches taken from multiple levels of a Laplacianpyramid representation of generated and real images, as described in [22]. This metric comparesthe unconditional distributions of patches between generated and real images, but it does not takeinto account the correspondence between the generated image and the molecular information.6 pproach Coverage Cell Count Cell Size Zernike Shape Exp. Level SWD CorrCGAN [17] 7.0 4.8 -2.9 -3.9 7.4 5.65 56.6CGlow [25] -1.3 3.8 5.8 2.2 6.6 5.01 55.5w/o pyramid 28.5 36.1 17.5 8.7 26.7 4.96 60.0w/o contrastive loss 7.7 13.4 12.0 6.8 5.3

Table 1: Evaluation of Mol2Image (our model) vs. the baselines on images generated from moleculesfrom the training set. “Coverage", “Cell Count", “Cell Size", “Zernike Shape", “Exp. Level" measureSpearman correlation coefﬁcients ( × ) between features from a subset of real and generated images;higher is better. “Corr" represents correspondence classiﬁcation accuracy of a pretrained model;higher is better and ground truth (upper bound) achieves . . “SWD" is the sliced Wassersteindistance metric ( × − ) from [22]; lower is better. See text for details. Approach Coverage Cell Count Cell Size Zernike Shape Exp. Level SWD CorrCGAN [17] 6.4 1.9 -1.5 -1.0 9.2 5.60 56.1CGlow [25] 3.1 -3.7 -3.0 -3.1 3.7 5.40 54.5w/o pyramid 9.2 1.7

Table 2: Same as Table 1, but evaluated on images generated from held-out molecules. Ground truth(upper bound) achieves . on the correspondence classiﬁcation accuracy (Corr) metric. • Correspondence Classiﬁcation Accuracy (Corr).

To assess the correspondence between thegenerated images and the molecular information, we compute the accuracy of a pretrained corre-spondence classiﬁer on the generated images. The classiﬁer consists of a visual network and GNNthat are trained on a binary classiﬁcation task: detect whether the input cell image matches theinput molecular intervention (positive sample) or whether they are mismatched (negative sample).The classiﬁer detects correctly matched pairs of images and molecules with an accuracy of ∼ Baselines and Ablations.

Since molecule-to-image synthesis is a novel task, we develop our ownbaselines based on well-established generative models and perform ablations to determine the beneﬁtof our approach. Since not all of the methods are capable of generating high-quality images at full × resolution, we compare all of the model results at × spatial resolution. • Baseline: Conditional GAN with Graph Neural Network (CGAN).

We train a CGAN such thata generator network G is trained to generate images conditioned on the corresponding molecule, y .Both the generator G and discriminator D are conditioned on the molecular representation g ( y ) learned by the same GNN as above. We use a Wasserstein GAN trained with a gradient penalty[17]. This variant is able to consistently produce qualitatively realistic images in the unconditionalsetting, in agreement with previous generative models for cell image data [43]. • Baseline: Conditional Glow with Graph Neural Network (CGlow).

Since our model is animproved ﬂow-based model for conditional generation, we develop a baseline approach basedon existing work that is a straightforward extension of Glow to the conditional setting [25].Speciﬁcally, this baseline model conditions the distribution of latent variables introduced at everylevel of the multi-scale architecture on the output of the graph neural network and optimizes theconditional log-likelihood with respect to the model parameters. Alternatively, this model can beseen as an ablation of our model without pyramid architecture or contrastive training. • Ablation: Mol2Image without Pyramid Architecture (w/o pyramid).

We train our modelwithout the framework of the image pyramid for separately generating features at different scales.Instead, we directly generate the full resolution image. • Ablation: Mol2Image without Contrastive Learning (w/o contrastive loss).

We train ourmodel without using contrastive loss to maximize the mutual information between the latentvariables of the image and the embeddings extracted by the graph neural network.

Results.

Tables 1 and 2 show the results of our model in comparison to the baselines. Our condi-tional ﬂow-based generative model, which is trained with the proposed pyramid architecture and the7 olecular Embedding Random Morgan Fingerprint w/o Contrastive Loss Mol2Image (ours)Mean AUC 0.569 0.645 0.675

Mean AUC (Held-Out) 0.578 0.665 0.675

Table 3: Evaluation of molecular embeddings on predicting morphological labels. Higher AUCis better. “Random" refers to embeddings from a randomly initialized GNN. “Held-out" refers toheld-out molecules from the training set. For reference, a fully-supervised model (in which theparameters of the graph neural network are trained) achieves an AUC of 0.702 on held-out molecules.contrastive loss, outperforms the baselines in generating cell images that reﬂect the effects of themolecular interventions. Table 1 shows that our model performs well on generating cell images con-ditioned on molecules that were observed during training. Table 2 shows that our model generalizesbetter than the baselines to molecules that were held-out from the training set.

Effect of Contrastive Loss.

Our training strategy, which uses contrastive loss to maximize themutual information between the image latent variables and the molecular embedding, is essentialfor effective generation of images conditioned on the molecular intervention. In particular, there ismuch lower correspondence between the images and the molecular intervention when contrastivelearning is omitted. This result holds both in the case that we use the image pyramid framework (i.e.,compare "Mol2Image" with "w/o contrastive learning") and in the case that we directly generate 64 x64 images using the standard multi-scale architecture (i.e., compare "w/o pyramid architecture" with"CGlow"). This demonstrates that contrastive learning can provide a strong signal for learning therelation between the image and the conditioning information for generative modeling, in the absenceof categorical labels that can be used in a supervised framework. On the other hand, contrastive lossdoes not appear to improve the unconditional quality of generated images (based on SWD).

Effect of Pyramid Framework.

We proposed the pyramid structure to generate image featuresat different spatial resolutions, which is important to disentangle higher level features (e.g., celldistribution) from lower level features (e.g., cell shape), and to allow our model to scale to high-resolution cell images (512 x 512). Interestingly, we ﬁnd that the image pyramid framework alsoimproves the conditional generation of 64 x 64 images compared to the baseline model that directlygenerates images of this size (i.e., compare "Mol2Image" to "w/o pyramid"). We hypothesize thatthis is because it is more efﬁcient and easier to learn the relation between images and conditionswhen starting with the low-resolution images at the bottom of the image pyramid. Consistent with ourobservations, previous works have reported that training GANs starting from lower-resolution images[22] or using an image pyramid [8] is more effective than training directly on full-resolution images.

Qualitative Examples.

Figure 2 shows a qualitative comparison between the baselines (CGAN,CGlow) and our method on generating images conditioned on molecular structure. The generatedimages from our method (Figure 2, row 3) more closely reﬂect the real effect of the intervention(Figure 2, row 4) compared to other methods, both in terms of cell morphology and in terms ofchannel intensities (representing expression of different cellular components). More qualitativeexamples (including full-resolution × images) are provided in the Supplementary Material. Analysis of Molecular Embeddings.

Since our method performs well at generating cell imagesconditioned on molecular interventions, we hypothesize that the GNN learns a molecular representa-tion that reﬂects the morphological features of the cell image. To determine whether the molecularembeddings are linearly separable based on the morphology they induce in treated cells, we train alinear classiﬁer to predict a subset of 14 features curated from the morphological analysis of Bray etal . [2] (see the Supplementary Material). For comparison, we consider embeddings from a randomlyinitialized GNN, Morgan/circular ﬁngerprints [48], and an ablation of our model trained withoutcontrastive loss. Table 3 shows the average AUC of the various embeddings on this task. Theresults suggest that our method learns molecular embeddings that are linearly-separable based onmorphological properties of the treated cells (Table 3, Row 1), and that the learned embeddings canalso generalize to previously unseen molecules (Table 3, Row 2).

We have developed a new multi-scale ﬂow-based architecture and training strategy for molecule-to-image synthesis and demonstrated the beneﬁts of our approach on new evaluation metrics tailored8o biological cell image generation. Our work represents a ﬁrst step towards image-based virtualscreening of chemicals and lays the groundwork for studying the shared information in molecularstructures and perturbed cell morphology. A promising avenue for future work is integrating sideinformation (e.g., known chemical properties, drug dosage) to impose constraints on the molecularembedding space and improve generalization to previously unseen molecules. Furthermore, eventhough we have focused on molecule-to-image synthesis in this paper, our contributions to ﬂow-basedmodels can potentially be applied in other contexts, e.g., text-to-image synthesis [46].

Acknowledgements

Karren Dai Yang was supported by an NSF Graduate Research Fellowship and ONR (N00014-18-1-2765). Alex X. Lu was funded by a pre-doctoral award from the National Science and EngineeringResearch Council. Regina Barzilay and Tommi Jaakkola were partially supported by the MLPDSConsortium and the DARPA AMD program. Caroline Uhler was partially supported by NSF (DMS-1651995), ONR (N00014-17-1-2147 and N00014-18-1-2765), IBM, and a Simons InvestigatorAward.

References [1] Lynton Ardizzone, Carsten Lüth, Jakob Kruse, Carsten Rother, and Ullrich Köthe. Guided image generationwith conditional invertible neural networks. arXiv preprint arXiv:1907.02392 , 2019.[2] Mark-Anthony Bray, Sigrun M Gustafsdottir, Mohammad H Rohban, Shantanu Singh, Vebjorn Ljosa,Katherine L Sokolnicki, Joshua A Bittker, Nicole E Bodycombe, Vlado Danˇcík, Thomas P Hasaka, et al. Adataset of images and morphological proﬁles of 30 000 small-molecule treatments using the cell paintingassay.

Gigascience , 6(12):giw014, 2017.[3] Mark-Anthony Bray, Shantanu Singh, Han Han, Chadwick T Davis, Blake Borgeson, Cathy Hartland,Maria Kost-Alimova, Sigrun M Gustafsdottir, Christopher C Gibson, and Anne E Carpenter. Cell painting,a high-content image-based assay for morphological proﬁling using multiplexed ﬂuorescent dyes.

Natureprotocols , 11(9):1757, 2016.[4] Marco Breinig, Felix A Klein, Wolfgang Huber, and Michael Boutros. A chemical–genetic interactionmap of small molecules using high-throughput imaging in cancer cells.

Molecular systems biology , 11(12),2015.[5] Juan C Caicedo, Claire McQuin, Allen Goodman, Shantanu Singh, and Anne E Carpenter. Weaklysupervised learning of single-cell feature embeddings. In

Proceedings of IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , pages 9309–9318, 2018.[6] Juan C Caicedo, Shantanu Singh, and Anne E Carpenter. Applications in image-based proﬁling ofperturbations.

Current opinion in biotechnology , 39:134–142, 2016.[7] Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models for structureddata. In

International Conference on Machine Learning , pages 2702–2711, 2016.[8] Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep generative image models using a laplacianpyramid of adversarial networks. In

Advances in neural information processing systems , pages 1486–1494,2015.[9] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516 , 2014.[10] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprintarXiv:1605.08803 , 2016.[11] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, AlánAspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular ﬁngerprints.In

Advances in neural information processing systems , pages 2224–2232, 2015.[12] Ulrike S Eggert. The why and how of phenotypic small-molecule screens.

Nature chemical biology ,9(4):206, 2013.[13] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural messagepassing for quantum chemistry. arXiv preprint arXiv:1704.01212 , 2017.[14] Peter Goldsborough, Nick Pawlowski, Juan C Caicedo, Shantanu Singh, and Anne Carpenter. Cytogan:generative modeling of cell images. bioRxiv , page 227645, 2017.[15] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. Generative adversarial nets. In

Advances in neural information processingsystems , pages 2672–2680, 2014.[16] Marco Gori, Gabriele Monfardini, and Franco Scarselli. A new model for learning in graph domains. In

Neural Networks, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Conference on , volume 2,pages 729–734. IEEE, 2005.

17] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improvedtraining of wasserstein gans. In

Advances in Neural Information Processing Systems (NeurIPS) , pages5767–5777, 2017.[18] William L Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. arXiv preprint arXiv:1706.02216 , 2017.[19] Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. Flow++: Improving ﬂow-based gen-erative models with variational dequantization and architecture design. arXiv preprint arXiv:1902.00275 ,2019.[20] Markus Hofmarcher, Elisabeth Rumetshofer, Djork-Arne Clevert, Sepp Hochreiter, and Gu nter Klambauer.Accurate prediction of biological assays with high-throughput microscopy images and convolutionalnetworks.

Journal of chemical information and modeling , 59(3):1163–1171, 2019.[21] Le Hou, Ayush Agarwal, Dimitris Samaras, Tahsin M Kurc, Rajarsi R Gupta, and Joel H Saltz. Unsuper-vised histopathology image synthesis. arXiv preprint arXiv:1712.05021 , 2017.[22] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improvedquality, stability, and variation.

ICLR , 2018.[23] Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. Molecular graphconvolutions: moving beyond ﬁngerprints.

Journal of computer-aided molecular design , 30(8):595–608,2016.[24] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.

Proceedings of theInternational Conference on Learning Representations (ICLR) , 2014.[25] Durk P Kingma and Prafulla Dhariwal. Glow: Generative ﬂow with invertible 1x1 convolutions. In

Advances in Neural Information Processing Systems , pages 10215–10224, 2018.[26] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 ,2013.[27] Thomas N Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutional networks.

International Conference on Learning Representations , 2017.[28] Ruho Kondo, Keisuke Kawano, Satoshi Koide, and Takuro Kutsuna. Flow-based image-to-image translationwith feature disentanglement. In

Advances in Neural Information Processing Systems , pages 4170–4180,2019.[29] Tao Lei, Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Deriving neural architectures from sequenceand graph kernels.

International Conference on Machine Learning , 2017.[30] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493 , 2015.[31] Rui Liu, Yu Liu, Xinyu Gong, Xiaogang Wang, and Hongsheng Li. Conditional adversarial generativeﬂow for controllable image synthesis. In

Proceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 7992–8001, 2019.[32] Vebjorn Ljosa, Peter D Caie, Rob Ter Horst, Katherine L Sokolnicki, Emma L Jenkins, Sandeep Daya,Mark E Roberts, Thouis R Jones, Shantanu Singh, Auguste Genovesio, et al. Comparison of methodsfor image-based proﬁling of cellular morphological responses to small-molecule treatment.

Journal ofbiomolecular screening , 18(10):1321–1329, 2013.[33] Lit-Hsin Loo, Hai-Jui Lin, Robert J Steininger III, Yanqin Wang, Lani F Wu, and Steven J Altschuler.An approach for extensibly proﬁling the molecular states of cellular subpopulations.

Nature methods ,6(10):759, 2009.[34] Christos Louizos and Max Welling. Multiplicative normalizing ﬂows for variational bayesian neuralnetworks. In

Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages2218–2227. JMLR. org, 2017.[35] Alex Lu, Amy Lu, Wiebke Schormann, Marzyeh Ghassemi, David Andrews, and Alan Moses. The cells outof sample (coos) dataset and benchmarks for measuring out-of-sample generalization of image classiﬁers.In

Advances in Neural Information Processing Systems , pages 1852–1860, 2019.[36] You Lu and Bert Huang. Structured output learning with conditional generative ﬂows. arXiv preprintarXiv:1905.13288 , 2019.[37] Faisal Mahmood, Richard Chen, and Nicholas J Durr. Unsupervised reverse domain adaptation for syntheticmedical images via adversarial training.

IEEE transactions on medical imaging , 37(12):2572–2581, 2018.[38] Mojca Mattiazzi Usaj, Nil Sahin, Helena Friesen, Carles Pons, Matej Usaj, Myra Paz D Masinas, ErmiraShuteriqi, Aleksei Shkurin, Patrick Aloy, Quaid Morris, et al. Systematic genetics and single-cell imagingreveal widespread morphological pleiotropy and cell-to-cell variability.

Molecular systems biology ,16(2):e9243, 2020.[39] Claire McQuin, Allen Goodman, Vasiliy Chernyshev, Lee Kamentsky, Beth A Cimini, Kyle W Karhohs,Minh Doan, Liya Ding, Susanne M Rafelski, Derek Thirstrup, et al. Cellproﬁler 3.0: Next-generationimage processing for biology.

PLoS biology , 16(7), 2018.[40] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks forgraphs. In

International Conference on Machine Learning , pages 2014–2023, 2016.

41] Laura H Okagaki, Anna K Strain, Judith N Nielsen, Caroline Charlier, Nicholas J Baltes, Fabrice Chrétien,Joseph Heitman, Françoise Dromer, and Kirsten Nielsen. Cryptococcal cell morphology affects host cellinteractions and pathogenicity.

PLoS pathogens , 6(6), 2010.[42] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictivecoding. arXiv preprint arXiv:1807.03748 , 2018.[43] Anton Osokin, Anatole Chessel, Rafael E Carazo Salas, and Federico Vaggi. Gans for biological imagesynthesis. In

Proceedings of IEEE International Conference on Computer Vision (ICCV) , pages 2233–2242,2017.[44] Zachary E Perlman, Michael D Slack, Yan Feng, Timothy J Mitchison, Lani F Wu, and Steven J Altschuler.Multidimensional drug proﬁling by automated microscopy.

Science , 306(5699):1194–1198, 2004.[45] A Srinivas Reddy, S Priyadarshini Pati, P Praveen Kumar, HN Pradeep, and G Narahari Sastry. Virtualscreening in drug discovery-a computational perspective.

Current Protein and Peptide Science , 8(4):329–351, 2007.[46] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee.Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396 , 2016.[47] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing ﬂows. arXivpreprint arXiv:1505.05770 , 2015.[48] David Rogers and Mathew Hahn. Extended-connectivity ﬁngerprints.

Journal of chemical informationand modeling , 50(5):742–754, 2010.[49] Mohammad Hossein Rohban, Shantanu Singh, Xiaoyun Wu, Julia B Berthet, Mark-Anthony Bray,Yashaswi Shrestha, Xaralabos Varelas, Jesse S Boehm, and Anne E Carpenter. Systematic morpho-logical proﬁling of human gene and allele function via cell painting.

Elife , 6:e24060, 2017.[50] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. Thegraph neural network model.

IEEE Transactions on Neural Networks , 20(1):61–80, 2009.[51] L Shamir. Assessing the efﬁcacy of low-level image content descriptors for computer-based ﬂuorescencemicroscopy image analysis.

Journal of microscopy , 243(3):284–292, 2011.[52] Brian K Shoichet. Virtual screening of chemical libraries.

Nature , 432(7019):862–865, 2004.[53] Haoliang Sun, Ronak Mehta, Hao H Zhou, Zhichun Huang, Sterling C Johnson, Vivek Prabhakaran, andVikas Singh. Dual-glow: Conditional ﬂow-based generative model for modality transfer. In

Proceedings ofthe IEEE International Conference on Computer Vision , pages 10611–10620, 2019.[54] David C Swinney and Jason Anthony. How were new medicines discovered?

Nature reviews Drugdiscovery , 10(7):507–519, 2011.[55] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio.Graph attention networks. arXiv preprint arXiv:1710.10903 , 2017.[56] W Patrick Walters, Matthew T Stahl, and Mark A Murcko. Virtual screening—an overview.

Drug discoverytoday , 3(4):160–178, 1998.[57] Christina Winkler, Daniel Worrall, Emiel Hoogeboom, and Max Welling. Learning likelihoods withconditional normalizing ﬂows. arXiv preprint arXiv:1912.00042 , 2019.[58] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 , 2018.[59] Kevin Yang, Kyle Swanson, Wengong Jin, Connor Coley, Philipp Eiden, Hua Gao, Angel Guzman-Perez,Timothy Hopper, Brian Kelley, Miriam Mathea, et al. Analyzing learned molecular representations forproperty prediction.

Journal of chemical information and modeling , 59(8):3370–3388, 2019.[60] Xin Yi, Ekta Walia, and Paul Babyn. Generative adversarial network in medical imaging: A review.

Medical image analysis , page 101552, 2019. ppendix A Analysis of Molecular Embeddings As described in the main text, to evaluate the molecular embedding space learned by our graph neural network, we train a linear classiﬁer to predict a subset of morphological features. We curate the labels for this task as follows. The dataset of Bray et al . [1] provides measured values of the following morphological features for every cell image in their dataset: area, compactness, eccentricity, form factor, major axis length, minor axis length, radius, perimeter, solidity, and cell count. To each small molecule, we assign a continuous-valued vector representing the mean values of these features observed in cells treated with that molecule. Direct prediction of these values is not a meaningful task because the amount of intra-molecular variability is high relative to the inter-molecular variability; much of the variability in the features may be naturally occurring due to stochasticity in cell growth and is not explained by the molecular perturbation. Therefore, we predict instead the presence of atypical morphology caused by a molecule. We convert these continuous values to binary labels – if the value is in the top or bottom 1% / 5% / 10% of the values for its class, and otherwise – and train a logistic regression model to perform multi-task binary classiﬁcation. The results in Table 3 of the main text show that the molecular embeddings learned by our graph neural network reﬂect morphological properties of treated cells and enable linear separation of molecules that cause atypical morphological features. Upon acceptance of the work, we will release the molecular metadata and splits used in our morphology prediction task. B CellProﬁler Evaluation CellProﬁler [2] is a standard open-source software used for segmenting cells/nuclei and quantifying speciﬁc morphological features. The segmentation of nuclei and cells occurs in two steps: (1) thresholding is performed to identify the nuclei from the DNA stain, and (2) the nuclei are used as reference points for determining boundaries between cells and identifying cell objects. Once the cells are identiﬁed, multiple pipelines are available to measure shape and intensity features within each cell. To evaluate the generated images from our model, we extract morphological features for a subset of generated and held-out images and compute the correlation coefﬁcient between the features of generated and real images. To increase the range of phenotypes within the evaluated subset, we focus our evaluation on molecules that are more likely to cause a morphological change in cells, based on the morphology criterion used in Section A. Upon acceptance of the work, we will release the molecular metadata and CellProﬁler pipeline used for evaluation. C Additional Qualitative Examples See Supplemental Figures 1 and 2 for examples of full-resolution cell images generated by our method. References [1] Mark-Anthony Bray, Sigrun M Gustafsdottir, Mohammad H Rohban, Shantanu Singh, Vebjorn Ljosa, Katherine L Sokolnicki, Joshua A Bittker, Nicole E Bodycombe, Vlado Danˇcík, Thomas P Hasaka, et al. A dataset of images and morphological proﬁles of 30 000 small-molecule treatments using the cell painting assay. Gigascience , 6(12):giw014, 2017. [2] Claire McQuin, Allen Goodman, Vasiliy Chernyshev, Lee Kamentsky, Beth A Cimini, Kyle W Karhohs, Minh Doan, Liya Ding, Susanne M Rafelski, Derek Thirstrup, et al. Cellproﬁler 3.0: Next-generation image processing for biology. PLoS biology , 16(7), 2018.42