[PDF] Source Separation with Deep Generative Priors

Abstract

Despite substantial progress in signal source separation, results for richly structured data continue to contain perceptible artifacts. In contrast, recent deep generative models can produce authentic samples in a variety of domains that are indistinguishable from samples of the data distribution. This paper introduces a Bayesian approach to source separation that uses generative models as priors over the components of a mixture of sources, and noise-annealed Langevin dynamics to sample from the posterior distribution of sources given a mixture. This decouples the source separation problem from generative modeling, enabling us to directly use cutting-edge generative models as priors. The method achieves state-of-the-art performance for MNIST digit separation. We introduce new methodology for evaluating separation quality on richer datasets, providing quantitative evaluation of separation results on CIFAR-10. We also provide qualitative results on LSUN.

Full PDF

SSource Separation with Deep Generative Priors

Vivek Jayaram * John Thickstun * Abstract

Despite substantial progress in signal source sep-aration, results for richly structured data continueto contain perceptible artifacts. In contrast, re-cent deep generative models can produce authen-tic samples in a variety of domains that are in-distinguishable from samples of the data distri-bution. This paper introduces a Bayesian ap-proach to source separation that uses generativemodels as priors over the components of a mix-ture of sources, and noise-annealed Langevin dy-namics to sample from the posterior distributionof sources given a mixture. This decouples thesource separation problem from generative mod-eling, enabling us to directly use cutting-edgegenerative models as priors. The method achievesstate-of-the-art performance for MNIST digit sep-aration. We introduce new methodology for evalu-ating separation quality on richer datasets, provid-ing quantitative evaluation of separation resultson CIFAR-10. We also provide qualitative resultson LSUN.

1. Introduction

The single-channel source separation problem (Davies &James, 2007) asks us to decompose a mixed signal m ∈ X into a linear combination of k components x , . . . , x k ∈ X with scalar mixing coefﬁcients α i ∈ R : m = g ( x ) ≡ k (cid:88) i =1 α i x i . (1)This is motivated by, for example, the “cocktail party prob-lem” of isolating the utterances of individual speakers x i from an audio mixture m captured at a busy party, wheremultiple speakers are talking simultaneously. * Equal contribution. Paul G. Allen School of Computer Scienceand Engineering, University of Washington. Correspondence to:Vivek Jayaram < [email protected] > , JohnThickstun < [email protected] > . Proceedings of the th International Conference on MachineLearning , Vienna, Austria, PMLR 119, 2020. Copyright 2020 bythe author(s).

With no further constraints or regularization, solving Equa-tion (1) for x is highly underdetermined. Classical “blind”approaches to single-channel source separation resolve thisambiguity by privileging solutions to (1) that satisﬁy mathe-matical constraints on the components x , such as statisticalindependence (Davies & James, 2007) sparsity (Lee et al.,1999) or non-negativity (Lee & Seung, 1999). These con-straints can be be viewed as weak priors on the structure ofsources, but the approaches are blind in the sense that theydo not require adaptation to a particular dataset.Recently, most works have taken a data-driven approach. Toseparate a mixture of sources, it is natural to suppose thatwe have access to samples x of individual sources, whichcan be used as a reference for what the source componentsof a mixture are supposed to look like. This data can be usedto regularize solutions of Equation (1) towards structurallyplausible solutions. The prevailing way to do this is toconstruct a supervised regression model that maps an inputmixture m to components x i (Huang et al., 2014; Halperinet al., 2019). Paired training data ( m , x ) can be constructedby summing randomly chosen samples from the componentdistributions x i and labeling these mixtures with the groundtruth components.Instead of regressing against components x , we use samplesto train a generative prior p ( x ) ; we separate a mixed signal m by sampling from the posterior distribution p ( x | m ) . Forsome mixtures this posterior is quite peaked, and samplingfrom p ( x | m ) recovers the only plausible separation of m into likely components. But in many cases, mixtures arehighly ambiguous: see, for example, the orange-highlightedMNIST images in Figure 1. This motivates our interest insampling, which explores the space of plausible separations.In Section 3 we introduce a procedure for sampling fromthe posterior, an extension of the noise-annealed Langevindynamics introduced in Song & Ermon (2019), which wecall Bayesian Annealed SIgnal Source separation: “BASIS”separation.Ambiguous mixtures pose a challenge for traditional sourceseparation metrics, which presume that the original mixturecomponents are identiﬁable and compare the separated com-ponents to ground truth. For ambiguous mixtures of richdata, we argue that recovery of the original mixture com-ponents is not a well-posed problem. Instead, the problem a r X i v : . [ c s . L G ] S e p ource Separation with Deep Generative Priors Original ImagesMixture (Input) Separated ImagesMixture(Input) OriginalSeparated

Figure 1.

Separation results for mixtures of four images from the MNIST dataset (Left) and two images from the CIFAR-10 dataset(Right), using BASIS with the NCSN (Song & Ermon, 2019) generative model as a prior over images. We draw attention to the centralpanel of the MNIST results (highlighted in orange), which shows how a mixture can be separated in multiple ways. we aim to solve is ﬁnding components of a mixture that areconsistent with a particular data distribution. Motivated bythis perspective, we discuss evaluation metrics in Section 4.Formulating the source separation problem in a Bayesianframework decouples the problem of source generation fromsource separation. This allows us to leverage pre-trained,state-of-the-art, likelihood-based generative models as priordistributions, without requiring architectural modiﬁcationsto adapt these models for source separation. Examples ofsource separation using noise-conditioned score networks(NCSN) (Song & Ermon, 2019) as a prior are presented inFigure 1. Further separation results using NCSN and Glow(Kingma & Dhariwal, 2018) are presented in Section 5.

2. Related Work

Blind separation . Work on blind source separation is data-agnostic, relying on generic mathematical properties to priv-ilege particular solutions to (1) (Comon, 1994; Bell & Se-jnowski, 1995; Davies & James, 2007; Huang et al., 2012).Because blind methods have no access to sample compo-nents, they face the challenging task of modeling the distri-bution over unobserved components while simultaneouslydecomposing mixtures into likely components. It is difﬁcultto ﬁt a rich model to latent components, so blind methodsoften rely on simple models such as dictionaries to capturethe structure of these components.One promising recent work in the blind setting is Double-DIP (Gandelsman et al., 2019). This work leverages theunsupervised Deep Image Prior (Ulyanov et al., 2018) asa prior over signal components, similar to our use of a trained generative model. But the authors of this work docu-ment fundamental obstructions to applying their method tosingle-channel source separation; they propose using mul-tiple image frames from a video, or multiple mixtures ofthe same components with different mixing coefﬁcients α .This multiple-mixture approach is common to much of thework on blind separation. In contrast, our approach is ableto separate components from a single mixture. Supervised regression . Regression models for source sep-aration learn to predict components for a mixture using adataset of mixed signals labeled with ground truth compo-nents. This approach has been extensively studied for sepa-ration of images (Halperin et al., 2019), audio spectrograms(Huang et al., 2014; 2015; Nugraha et al., 2016; Janssonet al., 2017), and raw audio (Lluis et al., 2019; Stoller et al.,2018b; D´efossez et al., 2019), as well as more exotic datadomains, e.g. medical imaging (Nishida et al., 1999). Bylearning to predict components (or equivalently, masks on amixture) this approach implicitly builds a generative modelof the signal components. This connection is made moreexplicit in recent work that uses GAN’s to force componentsemitted by a regression model to match the distribution of agiven dataset (Zhang et al., 2018; Stoller et al., 2018a).The supervised approach takes advantage of expressive deepmodels to capture a strong prior over signal components.But it requires specialized model architectures trained specif-ically for the source separation task. In contrast, our ap-proach leverages standard, pre-trained generative models forsource separation. Furthermore, our approach can directlyexploit ongoing advances in likelihood-based generativemodeling to improve separation results. ource Separation with Deep Generative Priors

Signal Dictionaries . Much work on source separation isbased on the concept of a signal dictionary, most notablythe line of work based on non-negative matrix factorization(NMF) (Lee & Seung, 2001). These approaches modelsignals as combinations of elements in a latent dictionary.Decomposing a mixture into dictionary elements can beused for source separation by (1) clustering the elements ofthe dictionary and (2) reconstituting a source using elementsof the decomposition associated with a particular cluster.Dictionaries are typically learned from data of each sourcetype and combined into a joint dictionary, clustered bysource type (Schmidt & Olsson, 2006; Virtanen, 2007). Theblind setting has also been explored, where the clustering isobtained without labels by e.g. k-means (Spiertz & Gnann,2009). Recent work explores more expressive decomposi-tion models, replacing the linear decompositions used inNMF with expressive neural autoencoders (Smaragdis &Venkataramani, 2017; Venkataramani et al., 2017).When the dictionary is learned with supervision from la-beled sources, dictionary clusters can be interpreted as im-plicit priors on the distributions over components. Ourapproach makes these prior explicit, and works with genericpriors that are not tied to the dictionary model. Furthermore,our method can separate mixed sources of the same type,whereas mixtures of sources with similar structure present aconceptual difﬁculty for dictionary-based methods.

Generative adversarial separation . Recent work by Sub-akan & Smaragdis (2018) and Kong et al. (2019) exploresthe intriguing possibility of optimizing x given a mixture m to satisfy (1), where components x i are constrained tothe manifold learned by a GAN. The GAN is pre-trainedto model a distribution over components. Like our method,this approach leverages modern deep generative models in away that decouples generation from source separation. Weview this work as a natural analog to our likelihood-basedapproach in the GAN setting. Likelihood-based approaches . Our approach is similar inspirit to older ideas based on maximum a posteriori esti-mation (Geman & Geman, 1984) likelihood maximization(Pearlmutter & Parra, 1997; Roweis, 2001) and Bayesiansource separation (Benaroya et al., 2005). We build upontheir insights, with the advantage of increased computationalresources and modern expressive generative models.

3. BASIS Separation

We consider the following generative model of a mixedsignal m , relaxing the mixture constraint g ( x ) = m to a softGaussian approximation: x ∼ p, (2) m ∼ N (cid:0) g ( x ) , γ I (cid:1) . (3) Algorithm 1

BASIS Separation

Input: m ∈ X , { σ i } Li =1 , δ , T Sample x , . . . , x k ∼ Uniform ( X ) for i ← to L do η i ← δ · σ i /σ L for t = 1 to T do Sample ε t ∼ N (0 , I ) u ( t ) ← x ( t ) + η i ∇ x log p σ i ( x ( t ) ) + √ ηε t x ( t +1) ← u ( t ) − η i σ i Diag( α ) (cid:0) m − g ( x ( t ) ) (cid:1) end forend for This deﬁnes a joint distribution p γ ( x , m ) = p ( x ) p γ ( m | x ) over signal components x and mixtures m , and a correspond-ing posterior distribution p γ ( x | m ) = p ( x ) p γ ( m | x ) /p γ ( m ) . (4)In the limit as γ → , we recover the hard constraint onthe mixture m given by Equation (1).BASIS separation (Algorithm 1) presents an approach tosampling from (4) based on the discussion in Sections 3.1and 3.2. In Section 3.3 we discuss the behavior of thegradients ∇ x log p ( x ) , which motivates some of the hyper-parameter choices in Section 3.4. We describe a procedureto construct the noisy models p σ i required for BASIS inSection 3.5. Sampling from the posterior distribution p γ ( x | m ) looksformidable; just computing Equation (4) requires evalu-ation of the partition function p γ ( m ) . But using Langevindynamics (Neal et al., 2011; Welling & Teh, 2011) we cansample x ∼ p γ ( ·| m ) while avoiding explicit computationof p γ ( x | m ) . Let x ∼ Uniform ( X ) , ε t ∼ N (0 , I ) , anddeﬁne a sequence x ( t +1) ≡ x ( t ) + η ∇ x log p γ ( x ( t ) | m )+ (cid:112) ηε t (5) = x ( t ) + η ∇ x (cid:16) log p ( x ( t ) )+ γ (cid:107) m − g ( x ( t ) ) (cid:107) (cid:17) + (cid:112) ηε t . Observe that ∇ x log p γ ( m ) = 0 , so this term is not requiredto compute (5). By standard analysis of Langevin dynamics,as the step size η → , lim t →∞ D KL ( x t (cid:107) x | m ) = 0 , underregularity conditions on the distribution p γ ( x | m ) .If the prior p ( x ) is parameterized by a neural model, thengradients ∇ x log p ( x ) can be computed by automatic differ-entiation with respect to the inputs of the generator network.This family of likelihood-based models includes autoregres-sive models (Salimans et al., 2017; Parmar et al., 2018), thevariational autoencoder (Kingma & Welling, 2014; van denOord et al., 2017), or ﬂow-based models (Dinh et al., 2017; ource Separation with Deep Generative Priors Kingma & Dhariwal, 2018). Alternatively, if gradients ofthe distribution are modeled (Song & Ermon, 2019), then ∇ x log p ( x ) can be used directly. To accelerate mixing of (5) we adopt a simulated annealingschedule over noisy approximations to the model p ( x ) , ex-tending the unconditional sampling algorithm proposed inSong & Ermon (2019) to accelerate sampling from the poste-rior distribution p γ ( x | m ) . Let p σ ( x ) denote the distributionof x + (cid:15) σ for x ∼ p and (cid:15) σ ∼ N (0 , σ I ) . We deﬁne thenoisy joint likelihood p σ,γ ( x , m ) ≡ p σ ( x ) p γ ( m | x ) , whichinduces a noisy posterior approximation p σ,γ ( x | m ) . At highnoise levels σ , p σ ( x ) is approximately Gaussian and irre-ducible, so the Langevin dynamics (5) will mix quickly.And as σ → , D KL ( p σ (cid:107) p ) → . This motivates deﬁningthe modiﬁed Langevin dynamics x ( t +1) ≡ x ( t ) + η ∇ x log p σ,γ ( x ( t ) | m ) + (cid:112) ηε t . (6)The dynamics (6) approximate samples from p ( x | g ( x ) = m ) as η → , γ → , σ → , and t → ∞ . An implementa-tion of these dynamics, annealing η , γ , and σ as t → ∞ according to the hyper-parameter settings presented in Sec-tion 3.4, is presented in Algorithm 1.We anneal η , γ , and σ using a heuristic introduced inSong & Ermon (2019): the idea is to maintain a constantsignal-to-noise ratio (SNR) between the expected size of theposterior log-likelihood gradient term η ∇ x log p σ,γ ( x | m ) and the expected size of the Langevin noise √ ηε : E x ∼ p σ (cid:34)(cid:13)(cid:13)(cid:13)(cid:13) η ∇ x log p σ,γ ( x | m ) √ η (cid:13)(cid:13)(cid:13)(cid:13) (cid:35) = η E x ∼ p σ (cid:104) (cid:107)∇ x log p γ ( m | x ) + ∇ x log p σ ( x ) (cid:107) (cid:105) . (7)Assuming that gradients w.r.t. to the likelihood and the priorare uncorrelated, the SNR is approximately η E x ∼ p σ (cid:104) (cid:107)∇ x log p γ ( m | x ) (cid:107) (cid:105) + η E x ∼ p σ (cid:104) (cid:107)∇ x log p σ ( x ) (cid:107) (cid:105) . (8)Observe that log p γ ( m | x ) is a concave quadratic withsmoothness proportional to /γ ; it follows analytically that E (cid:104) (cid:107)∇ x log p γ ( m | x ) (cid:107) (cid:105) ∝ /γ . Song & Ermon (2019)found empirically that E (cid:107)∇ x log p σ ( x ) (cid:107) ∝ /σ for theNCSN model; we observe similar behavior for the ﬂow-based Glow model (Kingma & Dhariwal, 2018) and in Sec-tion 3.3 we propose a possible explanation for this behavior.Therefore, to maintain a constant SNR, it sufﬁces to set both γ and σ proportional to η . Standard deviation of the noise N o i s e x G r ad i en t Figure 2.

The behavior of σ × (cid:107)∇ x log p σ ( x ) (cid:107) in expectation forthe NCSN (orange) and Glow (blue) models trained on CIFAR-10at each of noise levels as σ decays geometrically from . to . . For large σ , (cid:107)∇ x log p σ ( x ) (cid:107) ≈ /σ . This proportionalrelationship breaks down for smaller σ . Because the expectedgradient of the noiseless density log p ( x ) is ﬁnite, its product with σ must asymptotically approach zero as σ → . We remark that the empirical ﬁnding E (cid:107)∇ x log p σ ( x ) (cid:107) ∝ /σ discussed in Section 3.2, and the consistency of thisobservation across models and datasets, could be surprising.Gradients of the noisy densities p σ can be described byconvolution of p with a Gaussian kernel: ∇ x log p σ ( x ) = ∇ x log E (cid:15) ∼N (0 ,I ) [ p ( x − σ(cid:15) )] . (9)From this expression, assuming p is continuous, we clearlysee that the gradients are asymptotically independent of σ : lim σ → ∇ x log p σ ( x ) = ∇ x log p ( x ) . (10)Maintaining proportionality E (cid:107)∇ x log p σ ( x ) (cid:107) ∝ /σ re-quires the gradients to grow unbounded as σ → , but thegradients of the noiseless distribution log p ( x ) are ﬁnite.Therefore, proportionality must break down asymptoticallyand we conclude that–even though we turn the noise σ down to visually imperceptible levels–we have not reachedthe asymptotic regime.We conjecture that the proportionality between the gradientsand the noise is a consequence of severe non-smoothnessin the noiseless model p ( x ) . The probability mass of thisdistribution is peaked around plausible images x , and decaysrapidly away from these points in most directions. Considerthe extreme case where the prior has a Dirac delta pointmass. The convolution of a Dirac delta with a Gaussian isitself Gaussian so, near the point mass, the noisy distribution p σ will be proportional to a Gaussian density with variance σ . If p σ were exactly Gaussian then analytically E x ∼ p σ (cid:2) (cid:107)∇ x log p σ ( x ) (cid:107) (cid:3) = 1 σ E x ∼ p σ (cid:2) x (cid:3) = 1 σ . (11)Because the distribution p ( x ) does not contain actual deltaspikes–only approximations thereof–we would expect thisproportionality to eventually break down as σ → . Indeed,Figure 2 shows that both for NCSN and Glow models of ource Separation with Deep Generative Priors Original ImagesMixture Simple Gradient Ascent Gradient Ascent + Noise Conditioning Langevin Dynamics +Noise Conditioning

Figure 3.

Non-stochastic gradient ascent produces sub-par results. Annealing over smoothed-out distributions (Noise Conditioning) guidesthe optimization towards likely regions of pixel space, but gets stuck at sub-optimal solutions. Adding Gaussian noise to the gradients(Langevin dynamics) shakes the optimization trajectory out of bad local optima.

CIFAR-10, after maintaining a very consistent proportional-ity E (cid:2) (cid:107)∇ x log p σ ( x ) (cid:107) (cid:3) ∝ /σ at the higher noise levels,the decay of σ to zero eventually outpaces the growth ofthe gradients. We adopt the hyper-parameters proposed by Song & Ermon(2019) for annealing σ , the proportionality constant δ , andthe iteration count T . The noise σ is geometrically annealedfrom σ = 1 . to σ L = 0 . with L = 10 . We set δ = 2 × − , and T = 100 . We ﬁnd that the same proportionalityconstant between σ and η also works well for γ and η ,allowing us to set γ = σ . We use these hyper-parametersfor both the NCSN and Glow models, applied to each of thethree datasets MNIST, CIFAR-10, and LSUN. For noise-conditioned score networks, we can directly com-pute ∇ x log p σ ( x ) by evaluating the score network at thedesired noise level. For generative ﬂow models like Glow,these noisy distributions are not directly accessible. Wecould estimate the distributions p σ ( x ) by training Glowfrom scratch on datasets perturbed by each of the requirednoise levels σ . But this not practical; Glow is expensive totrain, requiring thousands of epochs to converge and con-suming hundreds of gpu-hours to obtain good models evenfor small low-resolution datasets.Instead of training models p σ ( x ) from scratch, we applythe concept of ﬁne-tuning from transfer learning (Yosinskiet al., 2014). Using pre-trained models of p ( x ) publishedby the Glow authors, we ﬁne-tune these models on noise-perturbed data x + (cid:15) , where (cid:15) ∼ N (0 , σ I ) . Empirically,this procedure quickly converges to an estimate of p σ ( x ) ,within about 10 epochs. We remark that adding Gaussian noise to the gradients in theBASIS algorithm is essential. If we set aside the Bayesianperspective, it is tempting to simply run gradient ascent onthe pixels of the components to maximize the likelihood ofthese components under the prior, with a Lagrangian termto enforce the mixture constraint g ( x ) = m : x ← x + η ∇ x (cid:2) log p ( x ) − λ (cid:107) g ( x ) − m (cid:107) (cid:3) . (12)But this does not work. As demonstrated in Figure 3, thereare many local optima in the loss surface of p ( x ) and agreedy ascent procedure simply gets stuck. Pragmatically,the noise term in Langevin dynamics can be seen as a wayto knock the greedy optimization (12) out of local maxima.In the recent literature, pixel-space optimizations by follow-ing gradients ∇ x of some objective are perhaps associatedmore with adversarial examples than with desirable results(Goodfellow et al., 2015; Nguyen et al., 2015). We note thatthere have been some successes of pixel-wise optimizationin texture synthesis (Gatys et al., 2015) and style transfer(Gatys et al., 2016). But broadly speaking, pixel-space opti-mization procedures often seem to go wrong. We speculatethat noisy optimizations (6) on smoothed-out objectiveslike p σ could be a widely applicable method for makingpixel-space optimizations more robust.

4. Evaluation Methodology

Many previous works on source separation evaluate theirresults using peak signal-to noise ratio (PSNR) or struc-tural similarity index (SSIM) (Wang et al., 2004). Thesemetrics assume that the original sources are identiﬁable; inprobabilistic terms, the true posterior distribution p ( x | m ) ispresumed to have a unique global maximum achieved by the ource Separation with Deep Generative Priors ground truth sources (up to permutation of the sources). Un-der the identiﬁability assumption, it is reasonable to measurethe quality of a separation algorithm by comparing sepa-rated sources to ground truth mixture components. PSNR,for example, evaluates separations by computing the mean-squared distance between pixel values of the ground truthand separated sources on a logarithmic scale.For CIFAR-10 source separation, the ground truth sourcecomponents of a mixture are not identiﬁable. As evidencefor this claim, we call the reader’s attention to Figure 4. Foreach mixture depicted in Figure 4, we present separation re-sults that sum to the mixture and (to our eyes) look plausiblylike CIFAR-10 images. However, in each case the separatedimages exhibit high deviation from the ground truth. Thisphenomenon is not unusual; Figure 5 shows an un-curatedcollection of samples from p ( x | m ) using BASIS, illustratinga variety of plausible separation results for each given mix-ture. We will later see evidence again of non-identiﬁabilityin Figure 7. If we accept that the separations presented inFigures 4, 5, and 7 are reasonable, then source separationon this dataset is fundamentally underdetermined; we can-not measure success using metrics like PSNR that compareseparation results to ground truth.Instead of comparing separations to ground truth, we pro-pose instead to quantify the extent to which the results of asource separation algorithm look like samples from the datadistribution. If a pair of images sum to the given mixtureand look like samples from the data distribution, we deemthe separation to be a success. This shift in perspectivefrom identiﬁability of the latent components to the qualityof the separated components is analogous to the classicaldistinction in the statistical literature between estimationand prediction (Shmueli et al., 2010; Bellec et al., 2018).To this end, we borrow the Inception Score (IS) (Salimanset al., 2016) and Frechet Inception Distance (FID) (Heuselet al., 2017) metrics from the generative modeling literatureto evaluate CIFAR-10 separation results. These metrics at-tempt to quantify the similarity between two distributionsgiven samples. We use them to compare the distributionof components produced by a separation algorithm to thedistribution of ground truth images.In contrast to CIFAR-10, the posterior distribution p ( x | m ) for an MNIST model is demonstrably peaked. Moreover,BASIS is able to consistently identify these peaks. Thisconstitutes a constructive proof that components of MNISTmixtures are identiﬁable, and therefore comparisons to theground-truth components make sense. We report PSNRresults for MNIST, which allows us to compare the resultsof BASIS to other recent work on MNIST image separation(Halperin et al., 2019; Kong et al., 2019). OrigMix Sep

Color Ambiguities Structure Ambiguities

OrigMix Sep

Figure 4.

A curated collection of examples demonstrating colorand structural ambiguities in CIFAR-10 mixtures. In each case,the original components differ substantially from the componentsseparated by BASIS using NCSN as a prior. But in each case, theseparation results also look like plausible CIFAR-10 images.

5. Experiments

We evaluate results of BASIS on 3 datasets: MNIST (Le-Cun et al., 1998) CIFAR-10 (Krizhevsky, 2009) and LSUN(Yu et al., 2015). For MNIST and CIFAR-10, we considerboth NCSN (Song & Ermon, 2019) and Glow (Kingma &Dhariwal, 2018) models as priors, using pre-trained weightspublished by the authors of these models. For LSUN thereis no pre-trained NCSN model, so we consider results onlywith Glow. For Glow, we ﬁne-tune the weights of the pre-trained models to construct noisy models p σ using the pro-cedure described in Section 3.5. Code and instructions forreproducing these experiments is available online. Baselines . On MNIST we compare to results reported forthe GAN-based “S-D” method (Kong et al., 2019) and thefully supervised version of Neural Egg separation “NES”(Halperin et al., 2019). Results for MNIST are presentedin Section 5.1. To the best of our knowledge there areno previously reported quantitative metrics for CIFAR-10separation, so as a baseline we ran Neural Egg separationon CIFAR-10 using the authors’ published code. CIFAR-10results are presented in Section 5.2. We present additionalqualitative results for × LSUN in Section 5.3, whichdemonstrate that BASIS scales to larger images.We also consider results for a simple baseline, “Average,”that separates a mixture m into two 50% masks x = x = m / . This is a surprisingly competitive baseline. Observethat if we had no prior information about the distribution ofcomponents, and we measure separation quality by PSNR,then by a symmetry argument setting x = x is the optimal https://github.com/jthickstun/basis-separation ource Separation with Deep Generative Priors Orig Mix Resampled Separations

Figure 5.

Repeated sampling using BASIS with NCSN as a priorfor several mixtures of CIFAR-10 images. While most separationslook reasonable, variation in color and lighting makes comparativemetrics like PSNR unreliable. This challenges the notion that theground truth components are identiﬁable. separation strategy in expectation. In principle we would ex-pect Average to perform very poorly under IS/FID, becausethese metrics purport to measure similarity of distributionsand mixtures should have little or no support under thedata distribution. But we ﬁnd that IS and FID both assignreasonably good scores to Average, presumably becausemixtures exhibit many features that are well supported bythe data distribution. This speaks to well-known difﬁcultiesin evaluating generative models (Theis et al., 2016) andcould explain the strength of “Average” as a baseline.We remark that we cannot compare our algorithm to theseparation-like task reported for CapsuleNets (Sabour et al.,2017). The segmentation task discussed in that work is sim-ilar to source separation, but the mixtures used for the seg-mentation task are constructed using the non-linear thresh-old function h ( x ) = max( x + x , , in contrast to ourlinear function g . While extending the techniques of thispaper to non-linear relationships between x and m is intrigu-ing, we leave this to future this work. Class conditional separation . The Neural Egg separationalgorithm is designed with the assumption that the compo-nents x i are drawn from different distributions. For quanti-tative results on MNIST and CIFAR-10, we therefore con-sider two slightly different tasks. The ﬁrst is class-agnostic,where we construct mixtures by summing randomly selectedimages from the test set. The second is class-conditional,where we partition the test set into two groupings: digits − and − for MNIST, animals and machines forCIFAR-10. The former task allows us compare to S-D re-sults on MNIST, and the latter task allows us to compare to Neural Egg separation on MNIST and CIFAR-10.There are two different ways to apply a prior for class-conditional separation. First observe that, because x and x are chosen independently, p ( x ) = p ( x , x ) = p ( x ) p ( x ) . (13)In the class agnostic setting, x and x are drawn from thesame distribution (the empirical distribution of the test set)so it makes sense to use a single prior p = p = p . Inthe class conditional setting, we could potentially use sep-arate priors over components x and x . For the MNISTand CIFAR-10 experiments in this paper, we use pre-trainedmodels trained on unconditional distribution of the trainingdata for both the class agnostic and class conditional setting.It is possible that better results could be achieved in the classconditional setting by re-training the models on class condi-tional training data. For LSUN, the authors of Glow provideseparate pre-trained models for the Church and Bedroomcategories, so we are able to demonstrate class-conditionalLSUN separations using distinct priors in Section 5.3. Sample Likelihoods . Although we do not directly modelthe posterior likelihood p ( x | m ) , we can compute the log-likelihood of the output samples x . The log-likelihood is afunction of the artiﬁcial variance hyper-parameter γ , so itis more informative to look at the unweighted square error (cid:107) m − g ( x ) (cid:107) ; this quantity can be interpreted as a recon-struction error, and measures how well we approximate thehard mixture constraint. Because we geometrically annealthe variance γ , by the end of optimization the mixture con-straint is rigorously enforced; per-pixel reconstruction erroris smaller than the quantization level of 8-bit color, resultingin pixel-perfect visual reconstructions.For Glow, we can also compute the log-probability of sam-ples under the prior. How do the probabilities of sources x BASIS constructed by BASIS separation compare to theprobabilities of data x test taken directly from a dataset’s testset? Because we anneal the noise to a ﬁxed level σ L > ,we ﬁnd it most informative to ask this question using theminimal-noise, ﬁne-tuned prior p σ L ( x ) . As seen in Table 1,the outputs of BASIS separation are generally comparable inlog-likelihood to test set images; BASIS separation recoverssources deemed typical by the prior. Table 1.

The mean log-likelihood under the minimal-noise Glowprior p σ L ( x ) for the test set x test , and for samples of 100 BASISseparations x BASIS . The log-likelihood of each test set under thenoiseless prior p ( x test ) is reported for reference. Dataset p ( x test ) p σ L ( x test ) p σ L ( x BASIS ) MNIST 0.5 3.6 3.6CIFAR-10 3.4 4.5 4.7LSUN (bed) 2.4 4.2 4.4LSUN (crh) 2.7 4.4 4.4 ource Separation with Deep Generative Priors

Peak Signal-to-Noise Ratio (PSNR) . . . . D en s i t y Figure 6.

The empirical distribution of PSNR for 5,000 class ag-nostic MNIST digit separations using BASIS with the NCSN prior(see Table 2 for comparison of the central tendencies of this andother separation methods).

Quantitative results for MNIST image separation are re-ported in Table 2, and a panel of visual separation resultsare presented in Figure 1. For quantitative results, we reportmean PSNR over separations of , separated compo-nents. The distribution of PSNR for class agnostic MNISTseparation is visualized in Figure 6. We observe that ap-proximately / of results exceed the mean PSNR of 29.5,which to our eyes is visually indistinguishable from groundtruth.A natural approach to improve separation performance is tosample multiple x ∼ p ( ·| m ) for a given mixture m . A majoradvantage of models like Glow, that explicitly parameterizethe prior p ( x ) , is that we can approximate the maximum ofthe posterior distribution with the maximum over multiplesamples. By construction, samples from BASIS approxi-mately satisfy g ( x ) = m , so for the noiseless model wesimply declare p ( m | x ) = 1 and therefore p ( x | m ) ∝ p ( x ) .We demonstrate the effectiveness of resampling in Table 2(Glow, 10x) by comparing the expected PSNR of x ∼ p ( ·| m ) to the expected PSNR of arg max i p ( x i ) over samples x , . . . , x ∼ p ( ·| m ) . Even moderate resampling dramati-cally improves separation performance. Unfortunately thisapproach cannot be applied to the otherwise superior NCSNmodel, which does not model explicit likelihoods p ( x ) . Table 2.

PSNR results for separating 6,000 pairs of equally mixedMNIST images. For class split results, one image comes fromlabel − and the other comes from − . We compare to S-D(Kong et al., 2019), NES (Halperin et al., 2019), convolutionalNMF (class split) (Halperin et al., 2019) and standard NMF (classagnostic) (Kong et al., 2019). Algorithm Class Split Class AgnosticAverage 14.8 14.9NMF 16.0 9.4S-D - 18.5BASIS (Glow) 22.9 22.7NES 24.3 -BASIS (Glow, 10x) 27.7 27.1

BASIS (NCSN) 29.5 29.3

Without any modiﬁcation, we can apply BASIS to separatemixtures of k > images. We contrast this with regression-based methods, which require re-training to target varyingnumbers of components. Figure 1 shows the results ofBASIS using the NCSN prior applied to mixtures of fourrandomly selected images. For more mixture components,we observe that identiﬁability of ground truth sources beginsto break down. This is illustrated by looking at the centralitem in each panel of Figure 1 (highlighted in orange). Quantitative results for CIFAR-10 image separation mea-sured are presented in Table 3, and visual separation resultsare presented in Figure 1.We can also view image colorization (Levin et al., 2004;Zhang et al., 2016) as a source separation problem by in-terpreting a grayscale image as a mixture of the three colorchannels of an image x = ( x r , x g , x b ) with g ( x ) = ( x r + x g + x b ) / . (14)Unlike our previous separation problems, the channels of animage are clearly not independent, and the factorization of p given by Equation 13 is unwarranted. But conveniently, agenerative model trained on color CIFAR-10 images itselfmodels the joint distribution p ( x ) = p ( x r , x g , x b ) . There-fore, the same pre-trained generative model that we use toseparate images can also be used to color them.Qualitative colorization results are visualized in Figure 7.The non-identiﬁability of ground truth is profound for thistask (see Section 4 for discussion of identiﬁability). Wedraw attention to the two cars in the middle of the panel:the white car that is colored yellow by the algorithm, andthe blue car that is colored red. The colors of these speciﬁccars cannot be inferred from a greyscale image; the best an Table 3.

Inception Score / FID Score of 25,000 separations (50,000separated images) of two overlapping CIFAR-10 images usingNCSN as a prior. In Class Split one image comes from the categoryof animals and other from the category of vehicles. NES resultsusing published code from Halperin et al. (2019).

Algorithm Inception Score FIDClass SplitNES 5.29 ± ± ± BASIS (NCSN) 7.83 ± Class AgnosticBASIS (Glow) 6.10 ± ± BASIS (NCSN) 8.29 ± ource Separation with Deep Generative Priors Grayscale (Input) Colorization Original

Figure 7.

Colorizing CIFAR-10 images. Left: original CIFAR-10 images. Middle: greyscale conversions of the images on the left. Right:imputed colors for the greyscale images, found by BASIS using NCSN as a prior.

Table 4.

Inception Score / FID Score of 50,000 colorized CIFAR-10 images. As measured by IS/FID, the quality of NCSN coloriza-tions nearly matches CIFAR-10 itself.

Data Distribution Inception Score FID ScoreInput Grayscale 8.01 ± ± BASIS (NCSN) 10.53 ± CIFAR-10 Original 11.24 ± Qualitative results for LSUN separations are visualized inFigure 8. While the separation results in Figure 8 are imper-fect, Table 1 shows that the mean log-likelihood of the sepa-rated components is comparable to the mean log-likelihoodthat the model assigns to images in the test set. This sug-gests that the model is incapable of distinguishing theseseparations from better results, and the imperfections areattributable to the quality of the model rather than to theseparation algorithm. This is encouraging, because it sug-gests that the artifacts are due to the Glow model rather thanthe BASIS separation algorithm, and that better separationresults will be achievable with improved generative models.

OriginalMixture (Input) Separated

Figure 8. × LSUN separation results using Glow as a prior.One mixture component is sampled from the LSUN churches cate-gory, and the other component is sampled from LSUN bedrooms.

6. Conclusion

In this paper, we introduced a new approach to source sepa-ration that makes use of a likelihood-based generative modelas a prior. We demonstrated the ability to swap in differ-ent generative models for this purpose, presenting resultsof our algorithm using both NCSN and Glow. We pro-posed new methodology for evaluating source separationon richer datasets, demonstrating strong performance onMNIST and CIFAR-10. Finally, we presented qualitativeresults on LSUN that point the way towards scaling thismethod to practical tasks such as speech separation, usinggenerative audio models like WaveNets (Oord et al., 2016).

Acknowledgements

We thank Zaid Harchaoui, Sham M. Kakade, Steven Seitz,and Ira Kemelmacher-Shlizerman for valuable discussionand computing resources. This work was supported by theNational Science Foundation Grant DGE-1256082. ource Separation with Deep Generative Priors

References

Bell, A. J. and Sejnowski, T. J. An information-maximization approach to blind separation and blinddeconvolution.

Neural computation , 7(6):1129–1159,1995.Bellec, P. C., Lecu´e, G., Tsybakov, A. B., et al. Slopemeets lasso: improved oracle bounds and optimality.

TheAnnals of Statistics , 46(6B):3603–3642, 2018.Benaroya, L., Bimbot, F., and Gribonval, R. Audio sourceseparation with a single sensor.

IEEE Transactions onAudio, Speech, and Language Processing , 14(1):191–199,2005.Comon, P. Independent component analysis, a new concept?

Signal processing , 36(3):287–314, 1994.Davies, M. E. and James, C. J. Source separation usingsingle channel ica.

Signal Processing , 87(8):1819–1832,2007.D´efossez, A., Usunier, N., Bottou, L., and Bach, F. Musicsource separation in the waveform domain. arXiv preprintarXiv:1911.13254 , 2019.Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density es-timation using real nvp.

International Conference onLearning Representations , 2017.Gandelsman, Y., Shocher, A., and Irani, M. Double-dip”:Unsupervised image decomposition via coupled deep-image-priors. In

The IEEE Conference on ComputerVision and Pattern Recognition , volume 6, pp. 2, 2019.Gatys, L., Ecker, A. S., and Bethge, M. Texture synthesisusing convolutional neural networks. In

Advances inNeural Information Processing Systems , pp. 262–270,2015.Gatys, L. A., Ecker, A. S., and Bethge, M. Image styletransfer using convolutional neural networks. In

Confer-ence on Computer Vision and Pattern Recognition , pp.2414–2423, 2016.Geman, S. and Geman, D. Stochastic relaxation, gibbs dis-tributions, and the bayesian restoration of images.

Trans-actions on Pattern Analysis and Machine Intelligence ,(6):721–741, 1984.Goodfellow, I. J., Shlens, J., and Szegedy, C. Explainingand harnessing adversarial examples.

International Con-ference on Learning Representations , 2015.Halperin, T., Ephrat, A., and Hoshen, Y. Neural separationof observed and unobserved distributions.

Advances inNeural Information Processing Systems , 2019. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., andHochreiter, S. Gans trained by a two time-scale updaterule converge to a local nash equilibrium. In

Advances inNeural Information Processing Systems , pp. 6626–6637,2017.Huang, P.-S., Chen, S. D., Smaragdis, P., and Hasegawa-Johnson, M. Singing-voice separation from monauralrecordings using robust principal component analysis.In , pp. 57–60. IEEE, 2012.Huang, P.-S., Kim, M., Hasegawa-Johnson, M., andSmaragdis, P. Singing-voice separation from monau-ral recordings using deep recurrent neural networks. In

International Symposium on Music Information Retrieval ,pp. 477–482, 2014.Huang, P.-S., Kim, M., Hasegawa-Johnson, M., andSmaragdis, P. Joint optimization of masks and deeprecurrent neural networks for monaural source separa-tion.

IEEE/ACM Transactions on Audio, Speech, andLanguage Processing , 23(12):2136–2147, 2015.Jansson, A., Humphrey, E., Montecchio, N., Bittner, R.,Kumar, A., and Weyde, T. Singing voice separation withdeep u-net convolutional networks. 2017.Kingma, D. P. and Dhariwal, P. Glow: Generative ﬂowwith invertible 1x1 convolutions. In

Advances in NeuralInformation Processing Systems , pp. 10215–10224, 2018.Kingma, D. P. and Welling, M. Auto-encoding variationalbayes.

International Conference on Learning Represen-tations , 2014.Kong, Q., Xu, Y., Jackson, P. J. B., Wang, W., and Plumbley,M. D. Single-channel signal separation and deconvolutionwith generative adversarial networks. In

InternationalJoint Conference on Artiﬁcial Intelligence , 2019.Krizhevsky, A. Learning multiple layers of features fromtiny images. 2009.LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition.

Proceed-ings of the IEEE , 1998.Lee, D. D. and Seung, H. S. Learning the parts of objectsby non-negative matrix factorization.

Nature , 401(6755):788, 1999.Lee, D. D. and Seung, H. S. Algorithms for non-negativematrix factorization. In

Advances in Neural InformationProcessing Systems , pp. 556–562, 2001.Lee, T.-W., Lewicki, M. S., Girolami, M., and Sejnowski,T. J. Blind source separation of more sources than mix-tures using overcomplete representations.

IEEE signalprocessing letters , 6(4):87–90, 1999. ource Separation with Deep Generative Priors

Levin, A., Lischinski, D., and Weiss, Y. Colorization usingoptimization. In

ACM SIGGRAPH 2004 Papers , pp. 689–694. 2004.Lluis, F., Pons, J., and Serra, X. End-to-end music sourceseparation: is it possible in the waveform domain?

Inter-speech , 2019.Neal, R. M. et al. Mcmc using hamiltonian dynamics.

Hand-book of Markov Chain Monte Carlo , 2(11):2, 2011.Nguyen, A., Yosinski, J., and Clune, J. Deep neural net-works are easily fooled: High conﬁdence predictions forunrecognizable images. In

Conference on Computer Vi-sion and Pattern Recognition , pp. 427–436, 2015.Nishida, S., Nakamura, M., Ikeda, A., and Shibasaki, H.Signal separation of background eeg and spike by usingmorphological ﬁlter.

Medical engineering & physics , 21(9):601–608, 1999.Nugraha, A. A., Liutkus, A., and Vincent, E. Multichan-nel audio source separation with deep neural networks.

IEEE/ACM Transactions on Audio, Speech, and Lan-guage Processing , 24(9):1652–1664, 2016.Oord, A. v. d., Dieleman, S., Zen, H., Simonyan, K.,Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A.,and Kavukcuoglu, K. Wavenet: A generative model forraw audio. arXiv preprint arXiv:1609.03499 , 2016.Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, Ł., Shazeer,N., Ku, A., and Tran, D. Image transformer.

InternationalConference on Machine Learning , 2018.Pearlmutter, B. A. and Parra, L. C. Maximum likelihoodblind source separation: A context-sensitive generaliza-tion of ica. In

Advances in Neural Information ProcessingSystems , pp. 613–619, 1997.Roweis, S. T. One microphone source separation. In

Ad-vances in Neural Information Processing Systems , pp.793–799, 2001.Sabour, S., Frosst, N., and Hinton, G. E. Dynamic routingbetween capsules. In

Advances in Neural InformationProcessing Systems , pp. 3856–3866, 2017.Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Rad-ford, A., and Chen, X. Improved techniques for traininggans. In

Advances in Neural Information ProcessingSystems , pp. 2234–2242, 2016.Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P. Pix-elcnn++: Improving the pixelcnn with discretized logisticmixture likelihood and other modiﬁcations.

InternationalConference on Learning Representations , 2017. Schmidt, M. N. and Olsson, R. K. Single-channel speechseparation using sparse non-negative matrix factoriza-tion. In

International Conference on Spoken LanguageProcessing , 2006.Shmueli, G. et al. To explain or to predict?

StatisticalScience , 25(3):289–310, 2010.Smaragdis, P. and Venkataramani, S. A neural networkalternative to non-negative audio models. In

InternationalConference on Acoustics, Speech and Signal Processing ,pp. 86–90. IEEE, 2017.Song, Y. and Ermon, S. Generative modeling by estimatinggradients of the data distribution. In

Advances in NeuralInformation Processing Systems , pp. 11895–11907, 2019.Spiertz, M. and Gnann, V. Source-ﬁlter based clusteringfor monaural blind source separation. In

InternationalConference on Digital Audio Effects , 2009.Stoller, D., Ewert, S., and Dixon, S. Adversarial semi-supervised audio source separation applied to singingvoice extraction. In , pp. 2391–2395. IEEE, 2018a.Stoller, D., Ewert, S., and Dixon, S. Wave-u-net: A multi-scale neural network for end-to-end audio source sepa-ration.

International Symposium on Music InformationRetrieval , 2018b.Subakan, Y. C. and Smaragdis, P. Generative adversarialsource separation. In , pp.26–30. IEEE, 2018.Theis, L., Oord, A. v. d., and Bethge, M. A note on the eval-uation of generative models.

International Conferenceon Learning Representations , 2016.Ulyanov, D., Vedaldi, A., and Lempitsky, V. Deep imageprior. In

IEEE Conference on Computer Vision and Pat-tern Recognition , pp. 9446–9454, 2018.van den Oord, A., Vinyals, O., et al. Neural discrete repre-sentation learning. In

Advances in Neural InformationProcessing Systems , 2017.Venkataramani, S., Subakan, C., and Smaragdis, P. Neu-ral network alternatives toconvolutive audio models forsource separation. In

International Workshop on MachineLearning for Signal Processing , pp. 1–6. IEEE, 2017.Virtanen, T. Monaural sound source separation by nonneg-ative matrix factorization with temporal continuity andsparseness criteria.

IEEE transactions on audio, speech,and language processing , 15(3):1066–1074, 2007. ource Separation with Deep Generative Priors

Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli,E. P. Image quality assessment: from error visibilityto structural similarity.

IEEE Transactions on ImageProcessing , 13(4):600–612, 2004.Welling, M. and Teh, Y. W. Bayesian learning via stochasticgradient langevin dynamics. In

International Conferenceon Machine Learning , pp. 681–688, 2011.Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. Howtransferable are features in deep neural networks? In

Advances in Neural Information Processing Systems , pp.3320–3328, 2014.Yu, F., Zhang, Y., Song, S., Seff, A., and Xiao, J. Lsun:Construction of a large-scale image dataset using deeplearning with humans in the loop. arXiv preprintarXiv:1506.03365 , 2015.Zhang, R., Isola, P., and Efros, A. A. Colorful image col-orization. In

European conference on computer vision ,pp. 649–666. Springer, 2016.Zhang, X., Ng, R., and Chen, Q. Single image reﬂectionseparation with perceptual losses. In

IEEE Conferenceon Computer Vision and Pattern Recognition , pp. 4786–4794, 2018. ource Separation with Deep Generative Priors

A. Experimental Details

A.1. Fine-tuning

We ﬁne-tuned the MNIST, CIFAR-10, and Glow models at 10 noise levels σ (see Section 3.4) for 50 epochs each onclusters of 4 1080Ti GPU’s. This procedure converges rapidly, with no further decrease of the negative log-likelihood afterthe ﬁrst 10 epochs. Although Glow models theoretically have full support, the noiseless pre-trained models assign vanishingprobability to highly noisy images. In practice, this can cause invertibility assertion failures when ﬁne-tuning directly fromthe noiseless model. To avoid this we took an iterative approach: ﬁrst ﬁne-tune the lowest noise level σ = . from thenoiseless model, then ﬁne-tune the σ = . model from the σ = . model, etc. A.2. Scaling and Resources

Scaling Algorithm 1 to richer datasets is constrained primarily by the limited availability of strong, likelihood-basedgenerative models for these datasets. For high resolution images, the running time of Algorithm 1 can also becomesubstantial. Assuming the hyper-parameters T and L discussed in Section 3.4 remain valid at higher resolutions, thecomputational complexity of BASIS scales linearly with the cost of evaluating gradients of the model (albeit with a largemultiplicative constant T × L ). Therefore, if a generative model is tractable to train, then it should also be tractable to usefor BASIS separation.In concrete detail, we observe that a batch of BASIS separation results for MNIST or CIFAR-10 using NCSN takes < minutes on a single 1080Ti GPU. Running BASIS with Glow is much slower. We observe that substantial time is spentloading and unloading the noisy models p σ from memory (in contrast to NCSN, which uses a single noise-conditionedmodel). A batch of BASIS separation results on MNIST or CIFAR-10 using Glow takes about 30 minutes on a 1080Ti.A batch of BASIS separation result on LSUN using Glow takes 2-3 hours on a 1080Ti.

A.3. Visual Comparisons

When using class-agnostic priors, BASIS separation is symmetric in its output components. To facilitate visual comparisonsbetween original images and separated components, we sort the BASIS separated components to minimize PSNR to theoriginal images. This usually results in the separated components being visually paired with the most similar originalcomponents. But due to the deﬁciencies of PSNR as a comparative metric this is not always the case; the alert reader mayhave noticed that the yellow and silver car mixture in Figure 1 appears to have been displayed in reverse order. This happensbecause the separated yellow car component takes the light sky from the original silver car component, and the lightness ofthe sky dominates the PSNR metric.For the LSUN separation results, where we use a church model for the ﬁrst component and a bedroom model for the second,the symmetry is broken. For these results, components naturally sort themselves into church and bedroom components,which can be compared directly to the original images. ource Separation with Deep Generative Priors

B. Intermediate Samples During the Annealing Process

Mixture Separated 1 Separated 2 ! = 1.0 ! = 0.6 ! = 0.36 ! = 0.21 ! = 0.01Original Figure 9.

Intermediate CIFAR-10 separation results taken at noise levels σ during the annealing process of BASIS separation. ource Separation with Deep Generative Priors C. MNIST Separation Results Under Different Models and Sampling Procedures

NCSNMixture Glow Glow (Resampling) OriginalNCSNMixture Glow Glow (Resampling) Original

Figure 10.

Uncurated class-agnostic separation results using: (1) samples from the posterior with Glow as a prior (2) an approximate MAPestimate using the maximum over 10 samples from the posterior with Glow as a prior (3) samples from the posterior with NCSN as a prior. ource Separation with Deep Generative Priors

D. Extended CIFAR-10 Separation Results

D.1. NCSN Prior

NCSN

OriginalSeparated OriginalSeparatedMixtureMixture

Figure 11.

Uncurated class-agnostic CIFAR-10 separation results using NCSN as a prior. ource Separation with Deep Generative Priors

D.2. Glow Prior

NCSN

OriginalSeparated OriginalSeparatedMixtureMixture

Figure 12.

Uncurated class-agnostic CIFAR-10 separation results using Glow as a prior. ource Separation with Deep Generative Priors

E. Extended CIFAR-10 Colorization Results

E.1. NCSN Prior

NCSN

GrayscaleColorized

Figure 13.

Uncurated CIFAR-10 colorization results using NCSN as a prior. ource Separation with Deep Generative Priors

E.2. Glow Prior

GrayscaleColorized

Glow

Figure 14.

Uncurated CIFAR-10 colorization results using Glow as a prior. ource Separation with Deep Generative Priors

F. Extended LSUN Separation Results

NCSN

OriginalMixture Separated OriginalMixture Separated