[PDF] Deep Generalized Convolutional Sum-Product Networks

Abstract

Sum-Product Networks (SPNs) are hierarchical, graphical models that combine benefits of deep learning and probabilistic modeling. SPNs offer unique advantages to applications demanding exact probabilistic inference over high-dimensional, noisy inputs. Yet, compared to convolutional neural nets, they struggle with capturing complex spatial relationships in image data. To alleviate this issue, we introduce Deep Generalized Convolutional Sum-Product Networks (DGC-SPNs), which encode spatial features in a way similar to CNNs, while preserving the validity of the probabilistic SPN model. As opposed to existing SPN-based image representations, DGC-SPNs allow for overlapping convolution patches through a novel parameterization of dilations and strides, resulting in significantly improved feature coverage and feature resolution. DGC-SPNs substantially outperform other SPN architectures across several visual datasets and for both generative and discriminative tasks, including image inpainting and classification. These contributions are reinforced by the first simple, scalable, and GPU-optimized implementation of SPNs, integrated with the widely used Keras/TensorFlow framework. The resulting model is fully probabilistic and versatile, yet efficient and straightforward to apply in practical applications in place of traditional deep nets.

Full PDF

DD EEP G ENERALIZED C ONVOLUTIONAL S UM -P RODUCT N ETWORKS

Deep Generalized Convolutional Sum-Product Networks

Jos van de Wolfshaar

JOS . VANDEWOLFSHAAR @ GMAIL . COM

Andrzej Pronobis

PRONOBIS @ CS . WASHINGTON . EDU KTH Royal Institute of Technology, Stockholm, Sweden MessageBird, Amsterdam, The Netherlands University of Washington, Seattle, WA, USA

Abstract

Sum-Product Networks (SPNs) are hierarchical, graphical models that combine beneﬁts of deeplearning and probabilistic modeling. SPNs offer unique advantages to applications demanding ex-act probabilistic inference over high-dimensional, noisy inputs. Yet, compared to convolutionalneural nets, they struggle with capturing complex spatial relationships in image data. To allevi-ate this issue, we introduce Deep Generalized Convolutional Sum-Product Networks (DGC-SPNs),which encode spatial features in a way similar to CNNs, while preserving the validity of the prob-abilistic SPN model. As opposed to existing SPN-based image representations, DGC-SPNs allowfor overlapping convolution patches through a novel parameterization of dilations and strides, re-sulting in signiﬁcantly improved feature coverage and feature resolution. DGC-SPNs substantiallyoutperform other SPN architectures across several visual datasets and for both generative and dis-criminative tasks, including image inpainting and classiﬁcation. These contributions are reinforcedby the ﬁrst simple, scalable, and GPU-optimized implementation of SPNs, integrated with thewidely used Keras/TensorFlow framework. The resulting model is fully probabilistic and versatile,yet efﬁcient and straightforward to apply in practical applications in place of traditional deep nets.

Keywords:

Sum-Product Networks, Deep Probabilistic Models, Image Representations

1. Introduction

Sum-Product Networks (Poon and Domingos, 2011) are deep models with unique probabilistic se-mantics based on a rigorous theoretical framework. They can be seen as both probabilistic graphicalmodels (PGMs) and deep nets, and can be trained using common deep learning techniques (adap-tive gradient descent, dropout (Peharz et al., 2019)) as well as those used for PGMs (Zhao et al.,2016; Rashwan et al., 2018). As opposed to specialized CNNs or GANs (Radford et al., 2015),SPNs can perform a wide range of probabilistic inferences efﬁciently through a single forward andbackward pass (including marginal, conditional, joint, MPE), and naturally marginalize out miss-ing data. Whereas CNNs excel at classiﬁcation, an SPN can perform classiﬁcation and a varietyof generative tasks within a single network. Analogously, GANs, while ideal for sampling, lackprobabilistic interpretation and struggle with (or are inefﬁcient for) many generative problems otherthan sampling. This makes SPNs ideal for domains demanding real-time performance and involvinguncertainty and heterogeneous tasks, such as robotics (Pronobis and Rao, 2017; Zheng et al., 2018).In this work, we propose Deep Generalized Convolutional Sum-Product Networks (DGC-SPNs)that combine the probabilistic properties of SPNs with the ability to capture spatial relationships ina way similar to convolutional neural networks (CNNs). DGC-SPNs are deep, layered modelsthat exploit the inherent structure of image data and hierarchically capture spatial relations throughproducts and weighted sums with local receptive ﬁelds. We introduce a novel parameterization ofstrides, dilation rates and connectivity for convolutional operations in product layers that makes a r X i v : . [ c s . L G ] S e p OS VAN DE W OLFSHAAR , A

NDRZEJ P RONOBIS

DGC-SPNs more general than existing approaches to convolutional SPNs (Sharir et al., 2016; Butzet al., 2019). Unlike other architectures, DGC-SPNs employ overlapping patches in product layers,thus avoiding the loss of feature resolution and coverage which enhances representations of spatialrelations across the input image, while at the same time preserving the constraints that guaranteevalidity. We demonstrate that this translates to signiﬁcant improvements in performance for bothgenerative and discriminative tasks, such as image inpainting and image classiﬁcation.Our main contributions are (i) the introduction of a novel convolutional SPN architecture, (ii) acomprehensive range of experiments with SPNs on image data where our DGC-SPNs substantiallyoutperform other SPN models, and (iii) the ﬁrst layer-centered SPN library libspn-keras builton top of the widely used TensorFlow and Keras frameworks. The library implements the DGC-SPN architecture as well as many other types of SPN models, and provides ease-of-use, ﬂexibility,efﬁciency, and scalability. Our experiments are easily reproduced using this library.

2. Background

We begin with a brief introduction of the theoretical background behind SPNs. A Sum-ProductNetwork represents a joint probability distribution over a set of random variables X . An exampleof a simple SPN is given in Figure 1. An SPN is a rooted directed acyclic graph where the root nodecomputes the unnormalized probability S ( x ) of a distribution at x ∈ X . The leaves of an SPNcorrespond to individual random variables X i . In the discrete case, they are typically represented asBernoulli variables, while Gaussian distributions are often used for continuous inputs. In betweenthe leaves and the root, an SPN consists of weighted sum and product operations. The weightedsum nodes have non-negative weights and can be interpreted as probabilistic mixture models oversubsets of variables, while products compute joint probabilities by multiplying input values and canbe seen as features. The output of a sum node s j is given by S s j ( x ) = (cid:80) i ∈ ch ( s j ) w ji C i where C i isthe value of the i -th child and ch ( s j ) is the set of children of s j . The output of a product node p j isgiven by S p j ( x ) = (cid:81) i ∈ ch ( p j ) C i .Following (Poon and Domingos, 2011), we deﬁne the following concepts related to SPNs: Leaf Product Weighted Sum

Root

Figure 1: An example of an SPN over 4 discrete binary variables ( sc ( n ) denotes the scope of n ). EEP G ENERALIZED C ONVOLUTIONAL S UM -P RODUCT N ETWORKS

Deﬁnition 1 (Scope)

The scope of a node n , denoted sc ( n ) , is the set of variables that are descen-dants of n . In other words, the scope of a node is the union of the scopes of its children. Typically, leaf nodeshave a singular scope containing a single variable X i . Deﬁnition 2 (Validity)

An SPN is valid if it correctly computes the unnormalized probability forall evidence E where E ⊆ X . A sufﬁcient set of conditions that ensure validity consists of completeness and decomposability : Deﬁnition 3 (Completeness)

An SPN is complete if all children of a sum node have identicalscopes.

Deﬁnition 4 (Decomposability)

An SPN is decomposable if all children of the same product nodehave pairwise disjoint scopes.

A unique feature of SPNs is the ability to compute the partition function Z S = (cid:80) x ∈ X S ( x ) usingonly a single pass through the network. To this end, all evidence is removed from network’s inputsand an upwards pass is computed. For a Bernoulli variable X i , both indicators x i and ¯ x i are set to1. If all X i are continuous and represented as multiple Gaussian components per variable, then allof the components corresponding to X i are set to 1. It has been shown that Z S = 1 for a normalized SPN, where the weights of each sum node add up to one. Therefore, a normalized SPN computesvalid probability with a single upward pass: S ( x ) = P ( x ) (Peharz, 2015). For discriminative tasks, SPNs can be trained with traditional gradient descent techniques used indeep learning, such as SGD or Adam (Kingma and Ba, 2014). Generative learning is commonlydone with expectation maximization (EM). A single EM step requires one forward and one back-ward pass through the network. As an alternative to vanilla EM, with ‘dense’ non-zero trainingsignals for all sum children in the backward pass, hard

EM backpropagates sparse signals by select-ing only one ‘winning’ child per sum. For each sum, the child with maximum weighted probability(as computed in the forward pass) is selected as the winning child. The winning child receives asignal of 1 and so its accumulator is incremented by 1 after the corresponding training step, while itssiblings receive a signal of 0, yielding no increments for their accumulators. Yet another alternativeis unweighted hard EM which similarly selects one winning child per sum, but relies on unweightedvalues of the children computed in the forward pass for the selection (Kalra et al., 2018).

3. Related Work on Visual Tasks with SPNs

SPNs are expressive and have previously been used in a wide range of domains (Amer and Todor-ovic, 2015; Zheng et al., 2018). Within computer vision, SPNs have been applied to both discrimi-native and generative problems. In the seminal work (Poon and Domingos, 2011), SPNs were usedfor image inpainting, and assembled following a recursive procedure, where each rectangular imageregion was split into all possible vertical and horizontal non-overlapping sub-regions. The samearchitecture was trained with the Extended-Baum Welch (EBW) algorithm to perform image clas-siﬁcation (Rashwan et al., 2018). The non-uniform dimension sizes in this architecture do not lendthemselves to GPU-optimized tensorized implementations, limiting their scalabilty. OS VAN DE W OLFSHAAR , A

NDRZEJ P RONOBIS

The fact that SPNs could beneﬁt from the introduction of convolutions was ﬁrst recognizedin (Hartmann, 2014), where a hybrid model consisting of CNN and SPN layers was used for imageclassiﬁcation. However, this yields an architecture that is not fully probabilistic, thereby lackingthe ability to perform probabilistic inferences all the way down to the inputs of the model. Moreclosely related to our work are Convolutional Arithmetic Circuits (ConvACs) (Sharir et al., 2016),which can be viewed as convolutional SPNs, and Deep Convolutional Sum-Product Networks (Butzet al., 2019). Both ConvACs and DCSPNs ensure SPN validity and decomposability of products byusing non-overlapping image patches. This effectively reduces feature resolution and fails to capturespatial relations across many regions of the image. In contrast, DGC-SPNs employ a more generalapproach to convolutions that allow for signiﬁcantly improved feature coverage and resolution andcapture a superset of probability distributions of other convolutional SPNs.Apart from spatial SPNs, other works rely on randomly structured SPNs (Peharz et al., 2019),an SVD-based structure learning algorithm (Adel et al., 2015), a structure learning algorithm basedon clustering of variables (Dennis and Ventura, 2012), or a structure learning algorithm based onboth correlation matrices and clustering (Jaini et al., 2018). Instead, DGC-SPNs take inspirationfrom CNNs and impose a spatial prior on the structure that corresponds to the inherent spatialrelations present in image data. Our experiments show that this translates to gains in performanceas DGC-SPNs yield superior results compared to the aforementioned approaches.

4. Deep Generalized Convolutional SPNs

DGC-SPNs are valid Sum-Product Networks with unique structure inspired by CNNs, consisting ofweighted sum and product operations corresponding to local receptive ﬁelds. DGC-SPNs are specif-ically designed to enable efﬁcient utilization of GPU hardware within tensor-based frameworks.DGC-SPN layers are represented as tensors with one dimension for samples in the batch, an arbi-trary number of spatial dimensions (2 for images), and one dimension for channels. From hereon,we omit the batch dimension for simplicity and discuss DGC-SPNs for a single image sample.The SPN nodes of DGC-SPN (products and sums) are aligned spatially. We refer to all nodescorresponding to a single location ( i, j ) indexed on the spatial axes as a cell . The probabilitiescomputed by the nodes are represented as a tensor X ∈ R H × W × C for an image with dimensions W × H and number of channels C (e.g. RGB channels for color images). As shown in Figure 3,nodes are stacked along the channel axis to form cells, while cells are stacked along the spatialaxes to form layers. Sum layers and product layers are stacked in alternating fashion to form adeep network. Sum layers compute mixtures at each cell, resulting in new output channels. Spatialproducts combine inputs locally to form hierarchical spatial features that exponentially increasetheir receptive ﬁeld from the leaf layer to the root. Scopes of nodes within a DGC-SPN have a clear relation to the tensor dimensions mentioned above.If we consider a single cell at the input tensor, we ﬁnd multiple channels that cover the same variable X i (e.g. different Gaussian components in case of continuous data or different indicators in case ofdiscrete data). Since channels within a cell cover the same variable, they have the same singletonscope { X i } . In other words, scopes within a cell of the input tensor are homogeneous . In contrast,for image data, different cells at the input correspond to different pixels. If we take any pair ofcells , we can say that the scopes of these cells are heterogeneous . By ensuring that sum layers EEP G ENERALIZED C ONVOLUTIONAL S UM -P RODUCT N ETWORKS and product layers preserve within-cell homogeneity, across-cell heterogeneity, completeness anddecomposability, we can derive valid convolutional SPN architectures. We now elaborate on howto implement and parameterize such spatial layers.

A sum is complete if it has children with identical scopes . The within-cell homogeneity and across-cell heterogeneity dictate that at each level, sums should only be connected to a single input cell.Yet, multiple single-cell sums can be added to form an arbitrary amount of output channels. Hence,the spatial layout of the scopes remains unchanged and the validity propagates up the network.

Products are decomposable if they are connected to children with pairwise disjoint scopes. Asa result, products at each level can have children from at most one channel per cell, but covertwo or more input cells. At the input layer, it is trivial to see that neighboring cells are not onlyheterogeneous, but also pairwise disjoint. Hence, the products on top of an input layer can joinscopes by taking small patches of several cells while selecting only one input channel per cell.

Product NodeSum NodeLeaf Node

Layer 3 (Products)Layer 2 (Sums)Layer 0 (Leaves)Layer 1 (Products)

Cell

Layer 6, Root (Sum)

Layer 4 (Sums)Layer 5 (Products)

Figure 2: An example of a ‘vanilla’ convolutional SPN. As opposed to the more general DGC-SPNsdepicted in Figure 3, such architecture does not allow for spatially overlapping patches.Node types are indicated by different colors and some connections are highlighted toimprove readability. OS VAN DE W OLFSHAAR , A

NDRZEJ P RONOBIS

Product NodeSum NodeLeaf NodePadding Node ∅ ∅ ∅ ∅ ∅ ∅ ∅∅ ∅ ∅ ∅∅ ∅∅ ∅∅∅∅ ∅∅ ∅∅ ∅∅∅∅ ∅∅ ∅∅ Layer 7 (Products) Layer 6 (Sums)Layer 8 (Root Sum) Layer 5 (Products)Layer 4 (Sums)Layer 3 (Products)Layer 2 (Sums)Layer 0 (Leaves)Layer 1 (Products)

Cell 0 Cell 3Channel 0Channel 1Channel 2Channel 3

Figure 3: An illustration of a DGC-SPN simpliﬁed to 1D. Connections for only one cell per layerare highlighted for readability. Layer 0 contains leaf distributions, where each channelcorresponds to an indicator for discrete variables or a distribution component (e.g. Gaus-sian) for continuous variables. Every product layer doubles the dilation rate, starting atthe rate of 1. The scopes are indicated by the numbers within each node. Padding nodeshave a ﬁxed probability of 1 (or 0 in log-space). The nodes of a single cell share the samescope. All children of the root node R have a scope that contains all input variables.

SPNs are implemented to propagate log-probabilities to avoid underﬂow. Hence, the local patchesof products become local patches of sums. In previous works, such local products in log-spacewere computed through sum pooling (Sharir et al., 2016; Butz et al., 2019), so that the number ofinput channels equals the number of output channels. We propose a more general alternative thatimplements local products in log-space through convolutions using kernels with one-hot weightsper cell. One-hot weights are needed so that only one channel per cell has a coefﬁcient of 1, whileall other channels have a zero coefﬁcient. We refer to such products realized using convolutions asconvolutional log-products (CLP). In Section 4.5, we describe an even more general view of CLPs. EEP G ENERALIZED C ONVOLUTIONAL S UM -P RODUCT N ETWORKS

In general, there are C t combinations of child nodes per patch, where C is the number of inputchannels and t is the number of cells under a patch. Consequently, there are at most C t outputchannels in a CLP layer. To limit that number, one can take an arbitrary subset of combinations ofchild nodes. In this context, sum-pooling can be seen as a special case of CLPs where the number ofoutput channels equals the number of input channels. Figure 2 displays a convolutional SPN usingCLPs highlighted by the blue and red lines. A red line indicates that the product is connected tothe top channel of the previous layer, while a blue line indicates that the product is connected to thebottom channel of the previous layer. Existing approaches to convolutional SPNs (Sharir et al., 2016; Butz et al., 2019) use non-overlappingpatches for their products. Hence, neighboring cells in the output of such layers are not only hetero-geneous, but also pairwise disjoint. However, non-overlapping patches require strides larger than 1,thus skipping many combinations of input cells, yielding sub-optimal feature coverage. In fact, thisbecomes visibly apparent in patch-wise artifacts in image completions (Butz et al., 2019).To overcome these limitations, DGC-SPNs use generalized convolutional log-products (GCLPs).A GCLP is obtained by (i) ‘full’ padding (Dumoulin and Visin, 2016), (ii) strides as small as 1, and(iii) exponentially increasing dilation rates. A dilated kernel with a dilation rate of d takes in cellsthat are d positions apart, leaving gaps of d − positions. For example, a kernel with × cellsand a dilaton rate of × (same in both directions), covers a patch of × cells of the input.The ﬁrst layer of GCLPs in Figure 3 use a dilation rate of 1. We see that the convolutional patchesoverlap as a consequence of unit strides. Hence, neighboring cells in layer 1 are heterogeneous,but not pairwise disjoint. This forbids us from applying another convolution with a dilation rate of1. Instead, we exponentially increase the dilation rate to obtain kernels that ‘skip’ the input cellsthat would otherwise yield non-disjoint children. The exponentially increasing dilation rates yieldpatches covering pairwise disjoint scopes. Hence, GCLPs exhibit signiﬁcantly improved featurecoverage while preserving decomposability. Full padding is required to ensure uniform dimensionsizes in tensors along each axis, allowing the use of GPU-optimized convolution implementations.For readability, the above explanation of DGC-SPNs focuses on 1 spatial dimension and ker-nel sizes of 2 for all GCLPs. Nevertheless, our architecture generalizes to any number of spatialdimensions and kernel sizes that vary per layer as shown in our experiments.

5. Experiments

We evaluated the generative and discriminative capabilities of DGC-SPNs on two visual tasks: im-age inpainting and image classiﬁcation. All experiments were performed using our recently open-sourced libspn-keras library. We ﬁrst describe the experimental setup for both types of experiments.

We assessed generative capabilities of DGC-SPNs for unsupervised image inpainting on a variedcollection of image types and datasets. We used images of objects belonging to 101 categories fromthe Caltech 101 dataset (Fei-Fei et al., 2004) and images of faces from the Olivetti dataset (Samariaand Harter, 1994), for which we employed the same × crops and train and test splits as in

2. In case of varying kernel sizes, the GCLP dilation rates must equal the product of all preceding GCLP kernel sizes. OS VAN DE W OLFSHAAR , A

NDRZEJ P RONOBIS (Poon and Domingos, 2011) to ensure a fair comparison. In addition, we used the MNIST writtendigits dataset (Lecun et al., 1998) and the Fashion MNIST (Xiao et al., 2017) dataset with images ofcloth. For these datasets, we used the default train and test splits. All generative experiments wereunsupervised, and any class labels in the datasets were ignored. Pixels were normalized sample-wiseby subtracting the mean and dividing by the standard deviation for each image.To build DGC-SPNs, we started with a Gaussian leaf layer followed by a stack of alternatingGCLP layers and spatial sum layers. The ﬁrst GCLP layer computed all possible combinations ofproducts under each patch through one-hot kernels. For the remaining GCLP layers, we used depth-wise convolutions (i.e. sum pooling). All GCLP layers considered × patches with exponentiallyincreasing dilation rates. Between each pair of GCLP layers was a spatial sum layer with 16 outputchannels. Finally, the top GCLP layer consisted of cells with all variables in their scopes, so thattheir receptive ﬁeld covered the full image. This layer was followed by a root sum node.We trained our generative SPNs with online hard EM using a batch size of 128 for 15 epochs.Although hard EM formally requires MPE inference in the forward pass, following the suggestionin (Poon and Domingos, 2011), we used marginal inference instead. For each hard EM iteration,the sum weights were obtained by normalizing the MPE path accumulators with additive smoothingdependent on the number of weights per sum: w i = c i + ε (cid:80) j c j + ε , ε = 10 − / | w | . The Gaussian leaflayer was parameterized by 4 univariate components per pixel. Following (Poon and Domingos,2011), for each pixel, the values over all samples in the train set were divided over 4 quantiles. Themean of the i -th quantile was used as the value of the mean of the i -th Gaussian component at thecorresponding pixel, and variances were set to 1.For vanilla hard EM, we observed that values of nodes sharing the same sum as a parent start toconverge as the depth of the SPN increases. Hence, the impact of sum weights gradually increases,eventually overcoming the impact of the values of the nodes for winning child selection. This formsa self-amplifying chain of signals that only follow the sum child with the maximum weight. Relyingon unweighted sum inputs (USI) (Kalra et al., 2018) for selection of the winning child mitigates thateffect and allows for capturing more complex data patterns during training.We performed image inpainting by computing the marginal posterior probability at the Gaus-sian leaves through partial derivatives (Darwiche, 2003). The predicted pixel value is obtained bylinearly combining the modes of the leaf components using the marginal posterior probability (Poonand Domingos, 2011). We assessed the discriminative abilities of DGC-SPNs on the task of image classiﬁcation. We usedthe MNIST (Lecun et al., 1998) and Fashion MNIST datasets (Xiao et al., 2017), as these werestandard benchmarks for discriminative learning of SPNs in the previous works. We performedsample-wise normalization of pixel values in the same way as for the generative case.To build DGC-SPNs for these datasets, we used a Gaussian leaf layer with 32 components perpixel followed by a stack of alternating GCLP and spatial sum layers. For generative experiments,we applied unit strides and exponentially increasing dilation rates with zero padding for all GCLPs.In contrast, here we obtained better results when ﬁrst using two GCLP layers with non-overlappingstrides without zero padding. For the remaining GCLP layers, we applied overlapping strides,exponentially increasing dilations and padding. All GCLP layers used × patches. We used 64channels for the ﬁrst 3 spatial sum layers and 128 channels afterwards. EEP G ENERALIZED C ONVOLUTIONAL S UM -P RODUCT N ETWORKS

Table 1: Results of generative experiments (averaged over 5 runs). We trained DGC-SPNs withand without unweighted sum inputs (USI) and report MSE of predictions for the occludedparts of images.Dataset Authors Method Bottom LeftOlivetti (Butz et al., 2019) DCSPN

Olivetti (Poon and Domingos, 2011) APVAHRS

918 942

Olivetti (Dennis and Ventura, 2012) ClusterVars

820 814

Olivetti

Ours

DGC-SPN

804 847

Olivetti

Ours

DGC-SPN + USI

735 811

Caltech 101 (Poon and Domingos, 2011) APVAHRS

Caltech 101

Ours

DGC-SPN

Caltech 101

Ours

DGC-SPN + USI

MNIST

Ours

DGC-SPN

MNIST

Ours

DGC-SPN + USI

FMNIST

Ours

DGC-SPN

FMNIST

Ours

DGC-SPN + USI

We used the Adam optimizer (Kingma and Ba, 2014) with its default settings in Keras (learningrate α = 10 − , decay rates β = 0 . , β = 0 . ). We parameterized the sums with log-spaceaccumulators, denoted c (cid:48) i = log( c i ) , so that additional projections onto R + were not necessary aftergradient updates (Peharz et al., 2019). Gaussian means were initialized using equidistant intervalsin the range of [ − . , . per pixel, and variances were initialized to 1. In contrast to the generativecase, the parameters of the Gaussian leaves were adapted after initialization as part of the learningprocedure. We used cross-entropy as the loss function.For regularization, we applied product dropout (PD) (Peharz et al., 2019) by randomly settingproduct outputs to zero with a rate of p PD = 0 . throughout the entire network. Finally, we appliedinput dropout (ID) by setting the components of a variable X i to 1 at random with a rate of p ID = 0 . (Peharz et al., 2019), as if the dropped out variables were excluded from the evidence. Finally, weused a batch size of 64 and trained our SPNs for 400 epochs. Table 1 shows the results for generative learning and the task of unsupervised image inpainting. Wetasked the models with recreating half of an image based on its other half, an extremely difﬁcultproblem for a deep architecture. We masked either the left or the bottom half of the images. Theperformance was assessed based on the mean squared error (MSE) between predicted pixel valuesand original images. Pixels were scaled to the range of [0 , when computing MSE.In all experiments, DGC-SPNs outperformed all other approaches indicating that the generalityof our convolutional architecture translates into practical beneﬁts for image data. Figure 4 shows arandom selection of completions for images from the Olivetti dataset for one of our experiments.The performance gains are apparent even without the use of the unweighted sum inputs heuristic(USI in Table 1). However, using unweighted sum inputs with hard EM further improves MSEs for

3. MSEs for (Butz et al., 2019) corrected following: github.com/jhonatanoliveira/dcspns/issues/1 . OS VAN DE W OLFSHAAR , A

NDRZEJ P RONOBIS

Figure 4: Random selection of examples of original images (top row) and inpainting results for left-occluded (middle row) and bottom-occluded (bottom row) test images from the Olivettidataset.Table 2: Results of discriminative experiments (averaged over 5 runs).Dataset Algorithm Architecture Authors AccuracyMNIST EBW APVAHRS (Rashwan et al., 2018) . MNIST DSPN-SVD Learned (Adel et al., 2015) . MNIST Prometheus Learned (Jaini et al., 2018) . MNIST SGD CNN + SPN (Hartmann, 2014) . MNIST Adam RAT-SPN (Peharz et al., 2019) . MNIST Adam DGC-SPN

Ours . % FMNIST Adam RAT-SPN (Peharz et al., 2019) . FMNIST Adam DGC-SPN

Ours . % all datasets. To the best of our knowledge, our results provide the ﬁrst quantitative evaluation ofthis heuristic in the SPN literature. We suggest that its beneﬁts can be attributed to the fact thatexponentially decreasing variances cause strong biases in path selection, which can be mitigated byreducing the effect of weights on this process.The results for the task of image classiﬁcation are shown in Table 2. In the discriminativelearning case, DGC-SPNs again outperformed all other SPN approaches in the literature for boththe MNIST and Fashion MNIST datasets.Several other SPN approaches, including (Gens and Domingos, 2012; Hartmann, 2014), requiresub-SPNs per class after which the class-speciﬁc SPNs are combined by a single sum node at theroot of the SPN. Consequently, these SPN architectures scale poorly with increasing number ofclasses. In contrast, DGC-SPNs use only a single stack of product and sum layers shared by allclasses followed by a top layer of K sums (where K is the number of classes) and a root noderesulting in greatly improved scalability.

6. Conclusions

This paper introduced DGC-SPNs, a novel, scalable, deep convolutional architecture for spatial andimage data applicable to both generative and discriminative tasks. DGC-SPNs are the most generalrealization of convolutional SPNs to date, allowing for overlapping convolution patches withoutbreaking validity thanks to a unique parameterization of strides and dilations. This translates to asigniﬁcant improvement in performance. In our experiments, DGC-SPNs offered state-of-the-artresults compared to related SPN architectures on several visual tasks and datasets. EEP G ENERALIZED C ONVOLUTIONAL S UM -P RODUCT N ETWORKS

DGC-SPNs are fully probabilistic, naturally deal with missing inputs, and are capable of ef-ﬁcient joint, marginal, and conditional queries over complex, noisy data. The general design ofthe model permits its application to a wide range of domains involving spatial data beyond imagesand two dimensions. By releasing an implementation based on the well-established Keras and Ten-sorFlow frameworks, we hope to encourage further development and application of spatial SPNarchitectures.

7. Acknowledgements

This work was supported by the Swedish Research Council (VR) project 2012-4907 SKAEENet.We would like to thank Avinash Raganath for his indispensable contributions to the LibSPN libraryand insightful discussions during the time we spent together at KTH. Last but not least, we wouldlike to express our gratitude to Prof. Rajesh P. N. Rao for his unwavering support, encouragement,and invaluable advice.

References

T. Adel, D. Balduzzi, and A. Ghodsi. Learning the Structure of Sum-product Networks via anSVD-based Algorithm. In

Proceedings of the Thirty-First Conference on Uncertainty in ArtiﬁcialIntelligence , pages 32–41, 2015.M. Amer and S. Todorovic. Sum product networks for activity recognition.

Transactions on PatternAnalysis and Machine Intelligence , 38(4), 2015.C. Butz, J. Oliveira, A. dos Santos, and A. Teixeira. Deep Convolutional Sum-Product Networks.In

Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI) , 2019.A. Darwiche. A Differential Approach to Inference in Bayesian Networks.

J. ACM , 50(3):280–305,2003.A. Dennis and D. Ventura. Learning the architecture of sum-product networks using clustering onvariables. In

Advances in Neural Information Processing Systems 25 , pages 2033–2041. 2012.V. Dumoulin and F. Visin. A Guide to Convolution Arithmetic for Deep Learning. arXiv preprintarXiv:1603.07285 , 2016.L. Fei-Fei, R. Fergus, and P. Perona. Learning Generative Visual Models from Few Training Exam-ples: An Incremental Bayesian Approach Tested on 101 Object Categories. In , pages 178–178, June 2004.R. Gens and P. Domingos. Discriminative Learning of Sum-Product Networks. In

Advances inNeural Information Processing Systems 25 , pages 3239–3247. 2012.T. Hartmann.

Discriminative Convolutional Sum-Product Networks on GPU . PhD thesis, Rheinis-che Friedrich-Wilhelms-Universit¨at Bonn, 2014.P. Jaini, A. Ghose, and P. Poupart. Prometheus : Directly Learning Acyclic Directed Graph Struc-tures for Sum-Product Networks. In

Proceedings of the Ninth International Conference on Prob-abilistic Graphical Models , pages 181–192, 2018. OS VAN DE W OLFSHAAR , A

NDRZEJ P RONOBIS

A. Kalra, A. Rashwan, W.-S. Hsu, P. Poupart, P. Doshi, and G. Trimponias. Online Structure Learn-ing for Feed-Forward and Recurrent Sum-Product Networks. In

Advances in Neural InformationProcessing Systems 31 , pages 6944–6954. 2018.D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization.

CoRR , abs/1412.6980,2014.Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-Based Learning Applied to DocumentRecognition. In

Proceedings of the IEEE , pages 2278–2324, 1998.R. Peharz.

Foundations of Sum-Product Networks for Probabilistic Modeling . PhD thesis, GrazUniversity of Technology, 2015.R. Peharz, A. Vergari, K. Stelzner, A. Molina, X. Shao, M. Trapp, K. Kersting, and Z. Ghahra-mani. Random Sum-Product Networks: A Simple and Effective Approach to Probabilistic DeepLearning. In

Proceedings of the 35th Conference on Uncertainty in Artiﬁcial Intelligence , 2019.H. Poon and P. Domingos. Sum-Product Networks: A New Deep Architecture. In , pages 689–690.IEEE, 2011.A. Pronobis and R. P. N. Rao. Learning deep generative spatial models for mobile robots. In

Proceedings of the 2017 International Conference on Intelligent Robots and Systems (IROS) ,2017.A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutionalgenerative adversarial networks. arXiv preprint arXiv:1511.06434 , 2015.A. Rashwan, P. Poupart, and C. Zhitang. Discriminative Training of Sum-Product Networks byExtended Baum-Welch. In

Proceedings of the Ninth International Conference on ProbabilisticGraphical Models , volume 72 of

Proceedings of Machine Learning Research , pages 356–367,2018.F. S. Samaria and A. C. Harter. Parameterisation of a Stochastic Model for Human Face Identi-ﬁcation. In

Proceedings of 1994 IEEE Workshop on Applications of Computer Vision , pages138–142, 1994.O. Sharir, R. Tamari, N. Cohen, and A. Shashua. Tensorial Mixture Models. arXiv preprintarXiv:1610.04167 , 2016.H. Xiao, K. Rasul, and R. Vollgraf. Fashion-MNIST: a Novel Image Dataset for BenchmarkingMachine Learning Algorithms, 2017.H. Zhao, P. Poupart, and G. J. Gordon. A Uniﬁed Approach for Learning the Parameters of Sum-Product Networks. In

Advances in Neural Information Processing Systems 29 , pages 433–441.2016.K. Zheng, A. Pronobis, and R. P. N. Rao. Learning Graph-Structured Sum-Product Networks forProbabilistic Semantic Maps. In

Proceedings of the Thirty-Second AAAI Conference on ArtiﬁcialIntelligence , pages 4547–4555, 2018., pages 4547–4555, 2018.