[PDF] Deep Probabilistic Programming Languages: A Qualitative Study

Abstract

Deep probabilistic programming languages try to combine the advantages of deep learning with those of probabilistic programming languages. If successful, this would be a big step forward in machine learning and programming languages. Unfortunately, as of now, this new crop of languages is hard to use and understand. This paper addresses this problem directly by explaining deep probabilistic programming languages and indirectly by characterizing their current strengths and weaknesses.

Full PDF

DDeep Probabilistic Programming Languages: A Qualitative Study

Guillaume Baudart

IBM [email protected]

Martin Hirzel

IBM [email protected]

Louis Mandel

IBM [email protected]

ABSTRACT

Deep probabilistic programming languages try to combine the ad-vantages of deep learning with those of probabilistic programminglanguages. If successful, this would be a big step forward in ma-chine learning and programming languages. Unfortunately, as ofnow, this new crop of languages is hard to use and understand. Thispaper addresses this problem directly by explaining deep proba-bilistic programming languages and indirectly by characterizingtheir current strengths and weaknesses.

CCS CONCEPTS • Theory of computation → Probabilistic computation ;• Computing methodologies → Neural networks ;• Software and its engineering → Domain specific languages ; KEYWORDS

DL, PPL, DSL

A deep probabilistic programming language (PPL) is a languagefor specifying both deep neural networks and probabilistic models.In other words, a deep PPL draws upon programming languages,Bayesian statistics, and deep learning to ease the development ofpowerful machine-learning applications.For decades, scientists have developed probabilistic models invarious fields of exploration without the benefit of either dedicatedprogramming languages or deep neural networks [12]. But sincethese models involve Bayesian inference with often intractableintegrals, they sap the productivity of experts and are beyond thereach of non-experts. PPLs address this issue by letting users expressa probabilistic model as a program [15]. The program specifies howto generate output data by sampling latent probability distributions.The compiler checks this program for type errors and translates itto a form suitable for an inference procedure, which uses observedoutput data to fit the latent distributions. Probabilistic models showgreat promise: they overtly represent uncertainty [6] and they havebeen demonstrated to enable explainable machine learning even inthe important but difficult case of small training data [21, 26, 30].Over the last few years, machine learning with deep neural net-works (deep learning, DL) has become enormously popular. This isbecause in several domains, DL solves what was previously a vexingproblem [10], namely manual feature engineering. Each layer ofa neural network can be viewed as learning increasingly higher-level features. In other words, the essence of DL is automatic hier-archical representation learning [22]. Hence, DL powered recentbreakthrough results in accurate supervised large-data tasks suchas image recognition [20] and natural language translation [33].Today, most DL is based on frameworks that are well-supported,efficient, and expressive, such as TensorFlow [1] and PyTorch [11]. These frameworks provide automatic differentiation (users need notmanually calculate gradients for gradient descent), GPU support(to efficiently execute vectorized computations), and Python-basedembedded domain-specific languages [18].Deep PPLs, which have emerged just recently [29–32], aim tocombine the benefits of PPLs and DL. Ideally, programs in deepPPLs would overtly represent uncertainty, yield explainable models,and require only a small amount of training data; be easy to writein a well-designed programming language; and match the break-through accuracy and fast training times of DL. Realizing all ofthese promises would yield tremendous advantages. Unfortunately,this is hard to achieve. Some of the strengths of PPLs and DL areseemingly at odds, such as explainability vs. automated featureengineering, or learning from small data vs. optimizing for largedata. Furthermore, the barrier to entry for work in deep PPLs ishigh, since it requires non-trivial background in fields as diverseas statistics, programming languages, and deep learning. To tacklethis problem, this paper characterizes deep PPLs, thus lowering thebarrier to entry, providing a programming-languages perspectiveearly when it can make a difference, and shining a light on gapsthat the community should try to address.This paper uses the Stan PPL as a representative of the state ofthe art in regular (not deep) PPLs [9]. Stan is a main-stream, mature,and widely-used PPL: it is maintained by a large group of developers,has a yearly StanCon conference, and has an active forum. Stan isTuring complete and has its own stand-alone syntax and semantics,but provides bindings for several languages including Python.Most importantly, this paper uses Edward [31] and Pyro [32] asrepresentatives of the state of the art in deep PPLs. Edward is basedon TensorFlow and Pyro is based on PyTorch. Edward was firstreleased in mid-2016 and has a single main maintainer, who is fo-cusing on a new version. Pyro is a much newer framework (releasedlate 2017), but seems to be very responsive to community questions.This paper characterizes deep PPLs by explaining them (Sec-tions 2, 3, and 4), comparing them to each other and to regularPPLs and DL frameworks (Section 5), and envisioning next steps(Section 6). Additionally, the paper serves as a comparative tutorialto both Edward and Pyro. To this end, it presents examples of in-creasing complexity written in both languages, using deliberatelyuniform terminology and presentation style. By writing this paper,we hope to help the research community contribute to the excitingnew field of deep PPLs, and ultimately, combine the strengths ofboth DL and PPLs.

This section explains PPLs using an example that is probabilisticbut not deep. The example, adapted from Section 9.1 of [3], is aboutlearning the bias of a coin. We picked this example because it issimple, lets us introduce basic concepts, and shows how differentPPLs represent these concepts. a r X i v : . [ c s . A I] A p r x i N Figure 1: Graphical model for biased coin tosses. Circles rep-resent random variables. The white circle for θ indicates thatit is latent and the gray circle for x i indicates that it is ob-served. The arrow represents dependency. The rounded rec-tangle is a plate, representing N distributions that are IID. We write x i = i th coin toss is head and x i = Bernoulli distribution with parameter θ : p ( x i = | θ ) = θ and p ( x i = | θ ) = − θ . The latent (i.e., unobserved) variable θ is thebias of the coin. The task is to infer θ given the results of previouslyobserved coin tosses, that is, p ( θ | x , x , . . . , x N ) . Figure 1 showsthe corresponding graphical model . The model is generative : oncethe distribution of the latent variable has been inferred, one candraw samples to generate data points similar to the observed data.We now present this simple example in Stan, Edward, and Pyro.In all these languages we follow a Bayesian approach: the program-mer first defines a probabilistic model of the problem. Assumptionsare encoded with prior distributions over the variables of the model.Then the programmer launches an inference procedure to automat-ically compute the posterior distributions of the parameters of themodel based on observed data. In other words, inference adjuststhe prior distribution using the observed data to give a more pre-cise model. Compared to other machine-learning models such asdeep neural networks, the result of a probabilistic program is aprobability distribution, which allows the programmer to explicitlyvisualize and manipulate the uncertainty associated with a result.This overt uncertainty is an advantage of PPLs. Furthermore, aprobabilistic model has the advantage that it directly describes thecorresponding world based on the programmer’s knowledge. Suchdescriptive models are more explainable than deep neural networks,whose representation is big and does not overtly resemble the worldthey model.Figure 2(a) solves the biased-coin task in Stan, a well-establishedPPL [9]. This example uses PyStan, the Python interface for Stan.Lines 2-13 contain code in Stan syntax in a multi-line Python string.The data block introduces observed random variables, which areplaceholders for concrete data to be provided as input to the in-ference procedure, whereas the parameters block introduces latent random variables, which will not be provided and must be inferred.Line 4 declares x as a vector of ten discrete random variables, con-strained to only take on values from a finite set, in this case, either0 or 1. Line 7 declares θ as a continuous random variable, which cantake on values from an infinite set, in this case, real numbers be-tween 0 and 1. Stan uses the tilde operator ( ~ ) for sampling. Line 10samples θ from a uniform distribution (same probability for all val-ues) between 0 and 1. Since θ is a latent variable, this distribution ismerely a prior belief, which inference will replace by a posterior dis-tribution. Line 12 samples the x i from a Bernoulli distribution withparameter θ . Since the x i are observed variables, this sampling is really a check for how well the model corresponds to data providedat inference time. One can think of sampling an observed variablelike an assertion in verification [15]. Line 15 specifies the data andLines 17-18 run the inference using the model and the data. Bydefault, Stan uses a form of Monte-Carlo sampling for inference [9].Lines 20-22 extract and print the mean and standard deviation ofthe posterior distribution of θ .Figure 2(b) solves the same task in Edward [31]. Line 2 samples θ from the prior distribution, and Line 3 samples a vector of randomvariables from a Bernoulli distribution of parameter θ , one for eachcoin toss. Line 5 specifies the data. Lines 7-8 define a placeholderthat will be used by the inference procedure to compute the poste-rior distribution of θ . The shape and size of the placeholder dependson the inference procedure. Here we use the Hamiltonian Monte-Carlo inference HMC , the posterior distribution is thus computedbased on a set of random samples and follows an empirical dis-tribution. The size of the placeholder corresponds to the numberof random samples computed during inference. Lines 9-11 launchthe inference. The inference takes as parameter the prior:posteriorpair theta:qtheta and links the data to the variable x . Lines 13-16extract and print the mean and standard deviation of the posteriordistribution of θ .Figure 2(c) solves the same task in Pyro [32]. Lines 2-7 definethe model as a function coin . Lines 3-5 sample θ from the priordistribution, and Lines 6-7 sample a vector of random variable x from a Bernoulli distribution of parameter θ . Pyro stores randomvariables in a dictionary keyed by the first argument of the func-tion pyro.sample . Lines 9-10 define the data as a dictionary. Line 12conditions the model using the data by matching the value of thedata dictionary with the random variables defined in the model.Lines 13-15 apply inference to this conditioned model, using im-portance sampling. Compared to Stan and Edward, we first definea conditioned model with the observed data before running theinference instead of passing the data as an argument to the infer-ence. The inference returns a probabilistic model, post , that can besampled to extract the mean and standard deviation of the posteriordistribution of θ in Lines 17-19.The deep PPLs Edward and Pyro are built on top of two populardeep learning frameworks TensorFlow [1] and PyTorch [11]. Theybenefit from efficient computations over large datasets, automaticdifferentiation, and optimizers provided by these frameworks thatcan be used to efficiently implement inference procedures. As wewill see in the next sections, this design choice also reduces the gapbetween DL and probabilistic models, allowing the programmerto combine the two. On the other hand, this choice leads to pilingup abstractions (Edward/TensorFlow/Numpy/Python or Pyro/Py-Torch/Numpy/Python) that can complicate the code. We defer adiscussion of these towers of abstraction to Section 5. Variational Inference

Inference, for Bayesian models, computes the posterior distribu-tion of the latent parameters θ given a set of observations x , thatis, p ( θ | x ) . For complex models, computing the exact posteriordistribution can be costly or even intractable. Variational inference turns the inference problem into an optimization problem and tends coin_code = """ data { int x[10]; } parameters { real theta; } model { theta ~ uniform(0,1); for (i in 1:10) x[i] ~ bernoulli(theta); }""" data = { 'x' : [0, 1, 0, 0, 0, 0, 0, 0, 0, 1]} fit = pystan.stan(model_code=coin_code, data=data, iter=1000) samples = fit.extract()[ 'theta' ] print("Posterior mean:", np.mean(samples)) print("Posterior stddev:", np.std(samples)) theta = Uniform(0.0, 1.0) x = Bernoulli(probs=theta, sample_shape=10) data = np.array([0, 1, 0, 0, 0, 0, 0, 0, 0, 1]) qtheta = Empirical( tf.Variable(tf.ones(1000) ∗ 0.5)) inference = ed.HMC({theta: qtheta}, data={x: data}) inference.run() mean, stddev = ed.get_session().run( [qtheta.mean(),qtheta.stddev()]) print("Posterior mean:", mean) print("Posterior stddev:", stddev) def coin(): theta = pyro.sample("theta", Uniform( Variable(torch.Tensor([0])), Variable(torch.Tensor([1]))) pyro.sample("x", Bernoulli( theta ∗ Variable(torch.ones(10))) data = {"x": Variable(torch.Tensor( [0, 1, 0, 0, 0, 0, 0, 0, 0, 1]))} cond = pyro.condition(coin, data=data) sampler = pyro.infer.Importance(cond, num_samples=1000) post = pyro.infer.Marginal(sampler, sites=["theta"]) samples = [post()["theta"].data[0] for _ in range(1000)] print("Posterior mean:", np.mean(samples)) print("Posterior stddev:", np.std(samples)) (a) Stan (b) Edward (c) Pyro Figure 2: Probabilistic model for learning the bias of a coin. qalpha = tf.Variable(1.0) qbeta = tf.Variable(1.0) qtheta = Beta(qalpha, qbeta) inference = ed.KLqp({theta: qtheta}, {x: data}) inference.run() def guide(): qalpha = pyro.param("qalpha", Variable(torch.Tensor([1.0]), requires_grad= True )) qbeta = pyro.param("qbeta", Variable(torch.Tensor([1.0]), requires_grad= True )) pyro.sample("theta", Beta(qalpha, qbeta)) svi = SVI(cond, guide, Adam({}), loss="ELBO", num_particles=7) for step in range(1000): svi.step() (a) Edward (b) Pyro Figure 3: Variational Inference for learning the bias of a coin. to be faster and more adapted to large datasets than sampling-basedMonte-Carlo methods [4].Variational infence tries to find the member q ∗ ( θ ) of a family Q of simpler distribution over the latent variables that is the closestto the true posterior p ( θ | x ) . The fitness of a candidate is measuredusing the Kullback-Leibler (KL) divergence from the true posterior,a similarity measure between probability distributions. q ∗ ( θ ) = argmin q ( θ )∈Q KL ( q ( θ ) || p ( θ | x )) It is up to the programmer to choose a family of candidates, or guides , that is sufficiently expressive to capture a close approxima-tion of the true posterior, but simple enough to make the optimiza-tion problem tractable.Both Edward and Pyro support variational inference. Figure 3shows how to adapt Figure 2 to use it. In Edward (Figure 3(a)), the programmer defines the family of guides by changing the shapeof the placeholder used in the inference. Lines 2-4 use a beta dis-tribution with unknown parameters α and β that will be com-puted during inference. Lines 6-7 do variational inference using theKullback-Leibler divergence. In Pyro (Figure 3(b)), this is done bydefining a guide function. Lines 2-5 also define a beta distributionwith parameters α and β . Lines 7-9 do inference using StochasticVariational Inference, an optimized algorithm for variational infer-ence. Both Edward and Pyro rely on the underlying framework tosolve the optimization problem. Probabilistic inference thus closelyfollows the scheme used for training procedures of DL models.This section gave a high-level introduction to PPLs and intro-duced basic concepts (generative models, sampling, prior and pos-terior, latent and observed, discrete and continuous). Next, we turnour attention to deep learning. w * w * b + w * w * b + w * w * b + ReLUReLUReLU w * w * b + w * w w * b + w * y *input layer( n x = 2) hidden layer( n h = 3) output layer( n y = 2) h hh h hh h h y y yy y y yy s o ft m a x l p p a r g m a x h h h x x y MLP (a) Computational graph for non-probabilistic MLP. q l Nx p MLP (b) Graphical model for probabilistic MLP.

Figure 4: Multi-layer perceptron (MLP) for classifyingimages. Circles and squares are probabilistic and non-probabilistic variables. Black rectangles are pure functions.Arrows represent dependencies and forward data flow.

This section explains DL using an example of a deep neural networkand shows how to make that probabilistic. The task is multiclassclassification: given input features x , e.g., an image of a handwrittendigit [23] comprising n x = ·

28 pixels, predict a label l , e.g., oneof n y =

10 digits. Before we explain how to solve this task using DL,let us clarify some terminology. In cases where DL uses differentterminology from PPLs, this paper favors the PPL terminology. Sowe say inference for what DL calls training; predicting for what DLcalls inferencing; and observed variables for what DL calls train-ing data . For instance, the observed variables for inference in theclassifier task are the image x and label l .Among the many neural network architectures suitable for thistask, we chose a simple one: a multi-layer perceptron (MLP [28]).We start with the non-probabilistic version. Figure 4(a) shows anMLP with a 2-feature input layer, a 3-feature hidden layer, and a2-feature output layer; of course, this generalizes to wider (morefeatures) and deeper (more layers) MLPs. From left to right, there isa fully-connected subnet where each input feature x i contributes toeach hidden feature h j , multiplied with a weight w ji and offset witha bias b j . The weights and biases are latent variables. Treating theinput, biases, and weights as vectors and a matrix lets us cast thissubnet in linear algebra xW h + b h , which can run efficiently viavector instructions on CPUs or GPUs. Next, a rectified linear unit ReLU ( z ) = max ( , z ) computes the hidden feature vector h . TheReLU lets the MLP discriminate input spaces that are not linearlyseparable. The hidden layer exhibits both the advantage and dis-advantage of deep learning: automatically learned features thatneed not be hand-engineered but would require non-trivial reverseengineering to explain in real-world terms. Next, another fully-connected subnet hW y + b y computes the output layer y . Then, the softmax computes a vector π of scores that add up to one. Thehigher the value of π l , the more strongly the classifier predictslabel l . Using the output of the MLP, the argmax extracts the label l with highest score.Traditional methods to train such a neural network incrementallyupdate the latent parameters of the network to optimize a loss func-tion via gradient descent [28]. In the case of hand-written digits, theloss function is a distance metrics between observed and predictedlabels. The variables and computations are non-probabilistic, inthat they use concrete values rather than probability distributions.Deep PPLs can express Bayesian neural networks with probabilis-tic weights and biases [5]. One way to visualize this is by replacingrectangles with circles for latent variables in Figure 4(a) to indicatethat they hold probability distributions. Figure 4(b) shows the cor-responding graphical model, where the latent variable θ denotesall the parameters of the MLP: W h , b h , W y , b y .Bayesian inference starts from prior beliefs about the parametersand learns distributions that fit observed data (such as, images andlabels). We can then sample concrete weights and biases to obtaina concrete MLP. In fact, we do not need to stop at a single MLP:we can sample an ensemble of as many MLPs as we like. Then,we can feed a concrete image to all the sampled MLPs to get theirpredictions, followed by a vote.Figure 5(a) shows the probabilistic MLP example in Edward.Lines 3-4 are placeholders for observed variables (i.e., batches ofimages x and labels l ). Lines 5-9 defines the MLP parameterized by θ ,a dictionary containing all the network parameters. Lines 10-14sample the parameters from the prior distributions. Line 15 definesthe output of the network: a categorical distribution over all possiblelabel values parameterized by the output of the MLP. Line 17-23define the guides for the latent variables, initialized with randomnumbers. Later, the variational inference will update these duringoptimization, so they will ultimately hold an approximation ofthe posterior distribution after inference. Lines 25-29 set up theinference with one prior:posterior pair for each parameter of thenetwork and link the output of the network to the observed data.Figure 5(b) shows the same example in Pyro. Lines 2-11 containthe basic neural network, where torch.nn.Linear wraps the low-levellinear algebra. Lines 3-6 declare the structure of the net, that is, thetype and dimension of each layer. Lines 7-10 combine the layersto define the network. It is possible to use equivalent high-levelTensorFlow APIs for this in Edward as well, but we refrained fromdoing so to illustrate the transition of the parameters to randomvariables. Lines 14-20 define the model. Lines 15-18 sample priorsfor the parameters, associating them with object properties createdby torch.nn.Linear (i.e., the weight and bias of each layer). Line 19lifts the MLP definition from concrete to probabilistic. We thusobtain a MLP where all parameters are treated as random variables.Line 20 conditions the model using a categorical distribution overall possible label values. Lines 26-31 define the guide for the latentvariables, initialized with random numbers, just like in the Edwardversion. Line 33 sets up the inference.After the inference, Figure 6 shows how to use the posteriordistribution of the MLP parameters to classify unknown data. InEdward (Figure 6(a)), Lines 2-4 draw several samples of the param-eters from the posterior distribution. Then, Lines 5-6 execute the batch_size, nx, nh, ny = 128, 28 ∗ 28, 1024, 10 x = tf.placeholder(tf.float32, [batch_size, nx]) l = tf.placeholder(tf.int32, [batch_size]) def mlp(theta, x): h = tf.nn.relu(tf.matmul(x, theta["Wh"]) + theta["bh"]) yhat = tf.matmul(h, theta["Wy"]) + theta["by"] log_pi = tf.nn.log_softmax(yhat) return log_pi theta = { 'Wh' : Normal(loc=tf.zeros([nx, nh]), scale=tf.ones([nx, nh])), 'bh' : Normal(loc=tf.zeros(nh), scale=tf.ones(nh)), 'Wy' : Normal(loc=tf.zeros([nh, ny]), scale=tf.ones([nh, ny])), 'by' : Normal(loc=tf.zeros(ny), scale=tf.ones(ny)) } lhat = Categorical(logits=mlp(theta, x)) def vr(∗shape): return tf.Variable(tf.random_normal(shape)) qtheta = { 'Wh' : Normal(loc=vr(nx, nh), scale=tf.nn.softplus(vr(nx, nh))), 'bh' : Normal(loc=vr(nh), scale=tf.nn.softplus(vr(nh))), 'Wy' : Normal(loc=vr(nh, ny), scale=tf.nn.softplus(vr(nh, ny))), 'by' : Normal(loc=vr(ny), scale=tf.nn.softplus(vr(ny))) } inference = ed.KLqp({ theta["Wh"]: qtheta["Wh"], theta["bh"]: qtheta["bh"], theta["Wy"]: qtheta["Wy"], theta["by"]: qtheta["by"] }, data={lhat: l}) class MLP(nn.Module): def __init__(self): super(MLP, self).__init__() self.lh = torch.nn.Linear(nx, nh) self.ly = torch.nn.Linear(nh, ny) def forward(self, x): h = F.relu(self.lh(x.view(( −

1, nx)))) log_pi = F.log_softmax(self.ly(h), dim= − return log_pi mlp = MLP() def v0s(∗shape): return Variable(torch.zeros(∗shape)) def v1s(∗shape): return Variable(torch.ones(∗shape)) def model(x, l): theta = { 'lh.weight' : Normal(v0s(nh, nx), v1s(nh, nx)), 'lh.bias' : Normal(v0s(nh), v1s(nh)), 'ly.weight' : Normal(v0s(ny, nh), v1s(ny, nh)), 'ly.bias' : Normal(v0s(ny), v1s(ny)) } lifted_mlp = pyro.random_module("mlp", mlp, theta)() pyro.observe("obs", Categorical(logits=lifted_mlp(x)), one_hot(l)) def vr(name, ∗shape): return pyro.param(name, Variable(torch.randn(∗shape), requires_grad=

True )) def guide(x, l): qtheta = { 'lh.weight' : Normal(vr("Wh_m", nh, nx), F.softplus(vr("Wh_s", nh, nx))), 'lh.bias' : Normal(vr("bh_m", nh), F.softplus(vr("bh_s", nh))), 'ly.weight' : Normal(vr("Wy_m", ny, nh), F.softplus(vr("Wy_s", ny, nh))), 'ly.bias' : Normal(vr("by_m", ny), F.softplus(vr("by_s", ny))) } return pyro.random_module("mlp", mlp, qtheta)() inference = SVI(model, guide, Adam({"lr": 0.01}), loss="ELBO") (a) Edward (b) Pyro Figure 5: Probabilistic multilayer perceptron for classifying images. def predict(x): theta_samples = [ { "Wh": qtheta["Wh"].sample(), "bh": qtheta["bh"].sample(), "Wy": qtheta["Wy"].sample(), "by": qtheta["by"].sample() } for _ in range(args.num_samples) ] yhats = [ mlp(theta_samp, x).eval() for theta_samp in theta_samples ] mean = np.mean(yhats, axis=0) return np.argmax(mean, axis=1) def predict(x): sampled_models = [ guide( None ) for _ in range (args.num_samples) ] yhats = [ model(Variable(x)).data for model in sampled_models ] mean = torch.mean(torch.stack(yhats), 0) return np.argmax(mean, axis=1) (a) Edward (b) Pyro Figure 6: Predictions by the probabilistic multilayer perceptron.

MLP with each concrete model. Line 7 computes the score of a labelas the average of the scores returned by the MLPs. Finally, Line 8predicts the label with the highest score. In Pyro (Figure 6(b)), theprediction is done similarly but we obtain multiple versions of theMLP by sampling the guide (Line 2-3), not the parameters.This section showed how to use probabilistic variables as build-ing blocks for a DL model. Compared to non-probabilistic DL, this approach has the advantage of reduced overfitting and accuratelyquantified uncertainty [5]. On the other hand, this approach re-quires inference techniques, like variational inference, that aremore advanced than classic back-propagation. The next sectionwill present the dual approach, showing how to use neural net-works as building blocks for a probabilistic model. DL IN PROBABILISTIC MODELS

This section explains how deep PPLs can use non-probabilisticdeep neural networks as components in probabilistic models. Theexample task is learning a vector-space representation. Such a rep-resentation reduces the number of input dimensions to make othermachine-learning tasks more manageable by counter-acting thecurse of dimensionality [10]. The observed random variable is x , forinstance, an image of a hand-written digit with n x = ·

28 pixels.The latent random variable is z , the vector-space representation,for instance, with n z = x depends on the latent representation z in a com-plex non-linear way (i.e., via a deep neural network). The task is tolearn this dependency between x and z . The top half of Figure 7(a)shows the corresponding graphical model. The output of the neuralnetwork, named decoder , is a vector µ that parameterizes a Bernoullidistribution over each pixel in the image x . Each pixel is thus asso-ciated to a probability of being present in the image. Similarly toFigure 4(b) the parameter θ of the decoder is global (i.e., shared byall data points) and is thus drawn outside the plate. Compared toSection 3 the network here is not probabilistic, hence the squarearound θ .The main idea of the VAE [19, 27] is to use variational inference tolearn the latent representation. As for the examples presented in theprevious sections, we need to define a guide for the inference. Theguide maps each x to a latent variable z via another neural network.The bottom half of Figure 7(a) shows the graphical model of theguide. The network, named encoder , returns, for each image x , theparameters µ z and σ z of a Gaussian distribution in the latent space.Again the parameter ϕ of the network is global and not probabilistic.Then inference tries to learn good values for parameter θ and ϕ ,simultaneously training the decoder and the encoder, according tothe data and the prior beliefs on the latent variables (e.g., Gaussiandistribution).After the inference, we can generate a latent representation of animage with the encoder and reconstruct the image with the decoder.The similarity of the two images gives an indication of the successof the inference. The model and the guide together can thus be seenas an autoencoder, hence the term variational autoencoder .Figure 7(b) shows the VAE examples in Edward. Lines 4-12 definethe decoder: a simple 2-layers neural network similar to the onepresented in Section 3. The parameter θ is initialized with randomnoise. Line 13 samples the priors for the latent variable z from aGaussian distribution. Lines 14-15 define the dependency between x and z , as a Bernoulli distribution parameterized by the output ofthe decoder. Lines 17-29 define the encoder: a neural network withone hidden layer and two distinct output layers for µ z and σ z .The parameter ϕ is also initialized with random noise. Lines 30-31define the inference guide for the latent variable, that is, a Gaussiandistribution parameterized by the outputs of the encoder. Line 33 sets up the inference matching the prior:posterior pair for the latentvariable and linking the data with the output of the decoder.Figure 7(c) shows the VAE example in Pyro. Lines 2-12 definethe decoder. Lines 13-19 define the model. Lines 14-16 sample thepriors for the latent variable z . Lines 18-19 define the dependencybetween x and z via the decoder. In contrast to Figure 5(b), the de-coder is not probabilistic, so there is no need for lifting the network.Lines 34-37 define the guide as in Edward linking z and x via thedecoder defined Lines 21-33. Line 39 sets up the inference.This example illustrates that we can embed non-probabilistic DLmodels inside probabilistic programs and learn the parameters ofthe DL models during inference. Sections 2, 3, and 4 were aboutexplaining deep PPLs with examples. The next section is aboutcomparing deep PPLs with each other and with their potential. This section attempts to answer the following research question:At this point in time, how well do deep PPLs live up to their poten-tial? Deep PPLs combine probabilistic models, deep learning, andprogramming languages in an effort to combine their advantages.This section explores those advantages grouped by pedigree anduses them to characterize Edward and Pyro.Before we dive in, some disclaimers are in order. First, bothEdward and Pyro are young, not mature projects with years ofimprovements based on user experiences, and they enable newapplications that are still under active research. We should keepthis in mind when we criticize them. On the positive side, earlycriticism can be a positive influence. Second, since getting evena limited number of example programs to support direct side-by-side comparisons was non-trivial, we kept our study qualitative.A respectable quantitative study would require more programsand data sets. On the positive side, all of the code shown in thispaper actually runs. Third, doing an outside-in study risks missingsubtleties that the designers of Edward and Pyro may be moreexpert in. On the positive side, the outside-in view resembles whatnew users experience.

Probabilistic models support overt uncertainty : they give not just aprediction but also a meaningful probability. This is useful to avoiduncertainty bugs [6], track compounding effects of uncertainty,and even make better exploration decisions in reinforcement learn-ing [5]. Both Edward and Pyro support overt uncertainty well, seee.g. the lines under the comment “ ” in Figure 2.Probabilistic models give users a choice of inference procedures :the user has the flexibility to pick and configure different approaches.Deep PPLs support two primary families of inference procedures:those based on Monte-Carlo sampling and those based on varia-tional inference. Edward supports both and furthermore flexiblecompositions, where different inference procedures are applied todifferent parts of the model. Pyro supports primarily variationalinference and focuses less on Monte-Carlo sampling. In comparison,Stan makes a form of Monte-Carlo sampling the default, focusingon making it easy-to-tune in practice [8].Probabilistic models can help with small data : even when in-ference uses only small amount of labeled data, there have been Nz µ decoder x Nz µ z encoder s z fq model p q ( x | z )guide q f ( z | x ) (a) Graphical models. batch_size, nx, nh, nz = 256, 28 ∗ 28, 1024, 4 def vr(∗shape): return tf.Variable(0.01 ∗ tf.random_normal(shape)) X = tf.placeholder(tf.int32, [batch_size, nx]) def decoder(theta, z): hidden = tf.nn.relu(tf.matmul(z, theta[ 'Wh' ]) + theta[ 'bh' ]) mu = tf.matmul(hidden, theta[ 'Wy' ]) + theta[ 'by' ] return mu theta = { 'Wh' : vr(nz, nh), 'bh' : vr(nh), 'Wy' : vr(nh, nx), 'by' : vr(nx) } z = Normal(loc=tf.zeros([batch_size, nz]), scale=tf.ones([batch_size, nz])) logits = decoder(theta, z) x = Bernoulli(logits=logits) def encoder(phi, x): x = tf.cast(x, tf.float32) hidden = tf.nn.relu(tf.matmul(x, phi[ 'Wh' ]) + phi[ 'bh' ]) z_mu = tf.matmul(hidden, phi[ 'Wy_mu' ]) + phi[ 'by_mu' ] z_sigma = tf.nn.softplus( tf.matmul(hidden, phi[ 'Wy_sigma' ]) + phi[ 'by_sigma' ]) return z_mu, z_sigma phi = { 'Wh' : vr(nx, nh), 'bh' : vr(nh), 'Wy_mu' : vr(nh, nz), 'by_mu' : vr(nz), 'Wy_sigma' : vr(nh, nz), 'by_sigma' : vr(nz) } loc, scale = encoder(phi, X) qz = Normal(loc=loc, scale=scale) inference = ed.KLqp({z: qz}, data={x: X}) class Decoder(nn.Module): def __init__(self): super(Decoder, self).__init__() self.lh = nn.Linear(nz, nh) self.lx = nn.Linear(nh, nx) self.relu = nn.ReLU() def forward(self, z): hidden = self.relu(self.lh(z)) mu = self.lx(hidden) return mu decoder = Decoder() def model(x): z_mu = Variable(torch.zeros(x.size(0), nz)) z_sigma = Variable(torch.ones(x.size(0), nz)) z = pyro.sample("z", dist.Normal(z_mu, z_sigma)) pyro.module("decoder", decoder) mu = decoder.forward(z) pyro.sample("xhat", dist.Bernoulli(mu), obs=x.view( −

1, nx)) class Encoder(nn.Module): def __init__(self): super(Encoder, self).__init__() self.lh = torch.nn.Linear(nx, nh) self.lz_mu = torch.nn.Linear(nh, nz) self.lz_sigma = torch.nn.Linear(nh, nz) self.softplus = nn.Softplus() def forward(self, x): hidden = F.relu(self.lh(x.view(( −

1, nx)))) z_mu = self.lz_mu(hidden) z_sigma = self.softplus(self.lz_sigma(hidden)) return z_mu, z_sigma encoder = Encoder() def guide(x): pyro.module("encoder", encoder) z_mu, z_sigma = encoder.forward(x) pyro.sample("z", dist.Normal(z_mu, z_sigma)) inference = SVI(model, guide, Adam({"lr": 0.01}), loss="ELBO") (b) Edward (c) Pyro Figure 7: Variational autoencoder for encoding and decoding images. high-profile cases where probabilistic models still make accuratepredictions [21]. Working with small data is useful to avoid the costof hand-labeling, to improve privacy, to build personalized mod-els, and to do well on underrepresented corners of a big-data task.The intuition for how probabilistic models help is that they canmake up for lacking labeled data for a task by domain knowledgeincorporated in the model, by unlabeled data, or by labeled datafor other tasks. There are some promising initial successes of using deep probabilistic programming on small data [26, 30]; at the sametime, this remains an active area of research.Probabilistic models can support explainability : when the com-ponents of a probabilistic model correspond directly to conceptsof a real-world domain being modeled, predictions can include anexplanation in terms of those concepts. Explainability is usefulwhen predictions are consulted for high-stakes decisions, as wellas for transparency around bias [7]. Unfortunately, the parameters f a deep neural network are just as opaque with as without proba-bilistic programming. There is cause for hope though. For instance,Siddharth et al. advocate disentangled representations that helpexplainability [30]. Overall, the jury is still out on the extent towhich deep PPLs can leverage this advantage from PPLs. Deep learning is automatic hierarchical representation learning [22]:each unit in a deep neural network can be viewed as an automati-cally learned feature. Learning features automatically is useful toavoid the cost of engineering features by hand. Fortunately, this DLadvantage remains true in the context of a deep PPL. In fact, a deepPPL makes the trade-off between automated and hand-crafted fea-tures more flexible than most other machine-learning approaches.Deep learning can accomplish high accuracy : for various tasks,there have been high-profile cases where deep neural networksbeat out earlier approaches with years of research behind them.Arguably, the victory of DL at the ImageNet competition in 2012ushered in the latest craze around DL [20]. Record-breaking ac-curacy is useful not just for publicity but also to cross thresholdswhere practical deployments become desirable, for instance, inmachine translation [33]. Since a deep PPL can use deep neuralnetworks, in principle, it inherits this advantage from DL [31].However, even non-probabilistic DL requires tuning, and in ourexperience with the example programs in this paper, the tuningburden is exacerbated with variational inference.Deep learning supports fast inference : even for large models anda large data set, the wall-clock time for a batch job to infer pos-teriors is short. The fast-inference advantage is the result of theback-propagation algorithm [28], novel techniques for paralleliza-tion [25] and data representation [17], and massive investments inthe efficiency of DL frameworks such as TensorFlow and PyTorch,with vectorization on CPU and GPU. Fast inference is useful for it-erating more quickly on ideas, trying more hyperparameter during,and wasting fewer resources. Tran et al. measured the efficiency ofthe Edward deep PPL, demonstrating that it does benefit from theefficiency advantage of the underlying DL framework [31].

Programming language design is essential for composability : biggermodels can be composed from smaller components. Composabil-ity is useful for testing, team-work, and reuse. Conceptually, bothgraphical probabilistic models and deep neural networks composewell. On the other hand, some PPLs impose structure in a way thatreduces composability; fortunately, this can be mitigated [16]. BothEdward and Pyro are embedded in Python, and, as our exampleprograms demonstrate, work with Python functions and classes.For instance, users are not limited to declaring all latent variablesin one place; instead, they can compose models, such as MLPs, withseparately declared latent variables. Edward and Pyro also workwith higher-level DL framework modules such as tf.layers.dense or torch.nn.Linear , and Pyro even supports automatically lifting thoseto make them probabilistic. Edward and Pyro also do not preventusers from composing probabilistic models with non-probabilisticcode, but doing so requires care. For instance, when Monte-Carlo sampling runs the same code multiple times, it is up to the pro-grammer to watch out for unwanted side-effects. One area wheremore work is needed is the extensibility of Edward or Pyro itself [8].Finally, in addition to composing models, Edward also emphasizescomposing inference procedures.Not all PPLs have the same expressiveness : some are Turing com-plete, others not [15]. For instance, BUGS is not Turing complete,but has nevertheless been highly successful [13]. The field of deepprobabilistic programming is too young to judge which levels ofexpressiveness are how useful. Edward and Pyro are both Turingcomplete. However, Edward makes it harder to express while-loopsand conditionals than Pyro. Since Edward builds on TensorFlow,the user must use special APIs to incorporate dynamic control con-structs into the computational graph. In contrast, since Pyro buildson PyTorch, it can use native Python control constructs, one of theadvantages of dynamic DL frameworks [24].Programming language design affects conciseness : it determineswhether a model can be expressed in few lines of code. Concisenessis useful to make models easier to write and, when used in goodtaste, easier to read. In our code examples, Edward is more concisethan Pyro. Pyro arguably trades conciseness for structure, makingheavier use of Python classes and functions. Wrapping the modeland guide into functions allows compiling them into co-routines,an ingredient for implementing efficient inference procedures [14].In both Edward and Pyro, conciseness is hampered by the Bayesianrequirement for explicit priors and by the variational-inferencerequirement for explicit guides.Programming languages can offer watertight abstractions : theycan abstract away lower-level concepts and prevent them fromleaking out, for instance, using types and constraints [8]. Considerthe expression xW [ ] + b [ ] from Section 3. At face value, this lookslike eager arithmetic on concrete scalars, running just once in theforward direction. But actually, it may be lazy (building a computa-tional graph) arithmetic on probability distributions (not concretevalues) of tensors (not scalars), running several times (for differentMonte-Carlo samples or data batches), possibly in the backward di-rection (for back-propagation of gradients). Abstractions are usefulto reduce the cognitive burden on the programmer, but only if theyare watertight. Unfortunately, abstractions in deep PPLs are leaky.Our code examples directly invoke features from several layers ofthe technology stack (Edward or Pyro, on TensorFlow or PyTorch,on NumPy, on Python). Furthermore, we found that error mes-sages rarely refer to source code constructs. For instance, names ofPython variables are erased from the computational graph, makingit hard to debug tensor dimensions, a common cause for mistakes.It does not have to be that way. For instance, DL frameworks aregood at hiding the abstraction of back-propagation. More work isrequired to make deep PPL abstractions more watertight. This paper is a study of two deep PPLs, Edward and Pyro. The studyis qualitative and driven by code examples. This paper explains howto solve common tasks, contributing side-by-side comparisons ofEdward and Pyro. The potential of deep PPLs is to combine the ad-vantages of probabilistic models, deep learning, and programminglanguages. In addition to comparing Edward and Pyro to each other, his paper also compares them to that potential. A quantitativestudy is left to future work. Based on our experience, we confirmthat Edward and Pyro combine three advantages out-of-the-box:the overt uncertainty of probabilistic models; the hierarchical rep-resentation learning of DL; and the composability of programminglanguages.Following are possible next steps in deep PPL research. • Choice of inference procedures: Especially Pyro should supportMonte-Carlo methods at the same level as variational inference. • Small data: While possible in theory, this has yet to be demon-strated on Edward and Pyro, with interesting data sets. • High accuracy: Edward and Pyro need to be improved to simplifythe tuning required to improve model accuracy. • Expressiveness: While Turing complete in theory, Edward shouldadopt recent dynamic TensorFlow features for usability. • Conciseness: Both Edward and Pyro would benefit from reducingthe repetitive idioms of priors and guides. • Watertight abstractions: Both Edward and Pyro fall short on thisgoal, necessitating more careful language design. • Explainability: This is inherently hard with deep PPLs, necessi-tating more machine-learning innovation.In summary, deep PPLs show great promises and remain an activefield with many research opportunities.

REFERENCES [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, JeffreyDean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Man-junath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray,Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, YuanYu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale MachineLearning. In

Operating Systems Design and Implementation (OSDI)

Workshopat NIPS on Learning with Limited Labeled Data (LLD) . https://lld-workshop.github.io/papers/LLD_2017_paper_10.pdf[3] David Barber. 2012.

Bayesian Reasoning and Machine Learning

J. Amer. Statist. Assoc.

International Conference on MachineLearning (ICML) . 1613–1622. http://proceedings.mlr.press/v37/blundell15.html[6] James Bornholt, Todd Mytkowicz, and Kathryn S. McKinley. 2014. Uncertain:A First-order Type for Uncertain Data. In

Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS) . 51–66. https://doi.org/10.1145/2541940.2541958[7] Flavio P. Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Nate-san Ramamurthy, and Kush R. Varshney. 2017. Optimized Pre-Processing for Discrimination Prevention. In

Neural InformationProcessing Systems (NIPS) . 3995–4004. http://papers.nips.cc/paper/6988-optimized-pre-processing-for-discrimination-prevention.pdf[8] Bob Carpenter. 2017. Hello, world! Stan, PyMC3, and Edward. (2017). http://andrewgelman.com/2017/05/31/compare-stan-pymc3-edward-hello-world/ (Re-trieved February 2018).[9] Bob Carpenter, Andrew Gelman, Matt Hoffman, Daniel Lee, Ben Goodrich,Michael Betancourt, Michael A. Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell.2017. Stan: A probabilistic programming language.

Journal of Statistical Software

Communications of the ACM (CACM)

55, 10 (Oct. 2012), 78–87. https://doi.org/10.1145/2347736.2347755[11] Facebook. 2016. PyTorch. (2016). http://pytorch.org/ (Retrieved February 2018).[12] Zoubin Ghahramani. 2015. Probabilistic machine learning and artificial intelli-gence.

Nature

The Statistician

43, 1 (Jan. 1994), 169–177.[14] Noah D. Goodman and Andreas Stuhlmüller. 2014. The Design and Implementa-tion of Probabilistic Programming Languages. (2014). http://dippl.org (RetrievedFebruary 2018).[15] Andrew D. Gordon, Thomas A. Henzinger, Aditya V. Nori, and Sriram K. Rajamani.2014. Probabilistic Programming. In

ICSE track on Future of Software Engineering(FOSE) . 167–181. https://doi.org/10.1145/2593882.2593900[16] Maria I. Gorinova, Andrew D. Gordon, and Charles Sutton. 2018. SlicStan: Im-proving Probabilistic Programming using Information Flow Analysis. In

Work-shop on Probabilistic Programming Languages, Semantics, and Systems (PPS) .https://pps2018.soic.indiana.edu/files/2017/12/SlicStanPPS.pdf[17] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan.2015. Deep learning with limited numerical precision. In

International Confer-ence on Machine Learning (ICML) . 1737–1746. http://proceedings.mlr.press/v37/gupta15.pdf[18] Paul Hudak. 1998. Modular domain specific languages and tools. In

InternationalConference on Software Reuse (ICSR) . 134–142. https://doi.org/10.1109/ICSR.1998.685738[19] Diederik P. Kingma and Max Welling. 2013. Auto-encoding variational Bayes.(2013). https://arxiv.org/abs/1312.6114[20] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Ima-geNet Classification with Deep Convolutional Networks. In

Advances inNeural Information Processing Systems (NIPS) . http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks[21] Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. 2015. Human-level concept learning through probabilistic program induction.

Science

350 (Dec.2015), 1332–1338. Issue 6266. http://science.sciencemag.org/content/350/6266/1332[22] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep Learning.

Nature

Conference on Neural Information Pro-cessing Systems (NIPS) . 693–701. http://papers.nips.cc/paper/4390-hogwild-a-lock-free-approach-to-parallelizing-stochastic-gradient-descent[26] Danilo J. Rezende, Shakir Mohamed, Ivo Danihelka, Karol Gregor, and DaanWierstra. 2016. One-shot Generalization in Deep Generative Models. In

Interna-tional Conference on Machine Learning (ICML) . 1521–1529. http://proceedings.mlr.press/v48/rezende16.html[27] Danilo J. Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic Back-propagation and Approximate Inference in Deep Generative Models. In

Interna-tional Conference on Machine Learning (ICML) . 1278–1286. http://proceedings.mlr.press/v32/rezende14.html[28] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1986. Learningrepresentations by back-propagating errors.

Nature

323 (Oct. 1986), 533–536.https://doi.org/doi:10.1038/323533a0[29] John Salvatier, Thomas V. Wiecki, and Christopher Fonnesbeck. 2016. Probabilis-tic programming in Python using PyMC3.

PeerJ Computer Science

Advances in Neural Information Pro-cessing Systems (NIPS) . 5927–5937. http://papers.nips.cc/paper/7174-learning-disentangled-representations-with-semi-supervised-deep\-generative-models.pdf[31] Dustin Tran, Matthew D. Hoffman, Rif A. Saurous, Eugene Brevdo, Kevin Murphy,and David M. Blei. 2017. Deep Probabilistic Programming. In

InternationalConference on Learning Representations (ICLR) . https://arxiv.org/abs/1701.03757[32] Uber. 2017. Pyro. (2017). http://pyro.ai/ (Retrieved February 2018).[33] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi,Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, JeffKlingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, StephanGouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian,Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick,Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’sneural machine translation system: Bridging the gap between human and ma-chine translation. (2016). https://arxiv.org/abs/1609.08144. https://arxiv.org/abs/1701.03757[32] Uber. 2017. Pyro. (2017). http://pyro.ai/ (Retrieved February 2018).[33] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi,Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, JeffKlingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, StephanGouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian,Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick,Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’sneural machine translation system: Bridging the gap between human and ma-chine translation. (2016). https://arxiv.org/abs/1609.08144