Multi-Object Classification and Unsupervised Scene Understanding Using Deep Learning Features and Latent Tree Probabilistic Models
aa r X i v : . [ c s . C V ] M a y Multi-Object Classification and Unsupervised SceneUnderstanding Using Deep Learning Features andLatent Tree Probabilistic Models
Tejaswi Nimmagadda
University of California, Irvine [email protected]
Anima Anandkumar
University of California, Irvine [email protected]
Abstract
Deep learning has shown state-of-art classification performance on datasets suchas ImageNet, which contain a single object in each image. However, multi-objectclassification is far more challenging. We present a unified framework whichleverages the strengths of multiple machine learning methods, viz deep learn-ing, probabilistic models and kernel methods to obtain state-of-art performanceon Microsoft COCO, consisting of non-iconic images. We incorporate contex-tual information in natural images through a conditional latent tree probabilisticmodel (CLTM), where the object co-occurrences are conditioned on the extractedfc7 features from pre-trained Imagenet CNN as input. We learn the CLTM treestructure using conditional pairwise probabilities for object co-occurrences, es-timated through kernel methods, and we learn its node and edge potentials bytraining a new 3-layer neural network, which takes fc7 features as input. Objectclassification is carried out via inference on the learnt conditional tree model, andwe obtain significant gain in precision-recall and F-measures on MS-COCO, es-pecially for difficult object categories. Moreover, the latent variables in the CLTMcapture scene information: the images with top activations for a latent node havecommon themes such as being a grasslands or a food scene, and on on. In addition,we show that a simple k-means clustering of the inferred latent nodes alone sig-nificantly improves scene classification performance on the MIT-Indoor dataset,without the need for any retraining, and without using scene labels during training.Thus, we present a unified framework for multi-object classification and unsuper-vised scene understanding.
Deep learning has revolutionized performance on a variety of computer vision tasks such as objectclassification and localization, scene parsing, human pose estimation, and so on. Yet, most deeplearning works focus on simple classifiers at the output, and train on datasets such as ImageNetwhich consist of single object categories. On the other hand, multi-object classification is a far morechallenging problem.Currently many frameworks for multi-object classification use simple approaches: the multi-class setting, which predicts one category out of a set of mutually exclusive categories (e.g.ILSVRC [22]), or binary classification, which makes binary decisions for each label independently(e.g. PASCAL VOC [8]). Both models, however, do not capture the complexity of labels in natural1mages. The labels are not mutually exclusive, as assumed in the multi-class setting. Independent bi-nary classifiers, on the other hand, ignore the relationships between labels and miss the opportunityto transfer and share knowledge among different label categories during learning. More sophisti-cated classification techniques based on structured prediction are being explored, but in general,they are computationally more expensive and not scalable to large datasets (see related works for adiscussion).In this paper, we propose an efficient multi-object classification framework by incorporating con-textual information in images. The context in natural images captures relationships between variousobject categories, such as co-occurrence of objects within a scene or relative positions of objectswith respect to a background scene. Incorporating such contextual information can vastly improvedetection performance, eliminate false positives, and provide a coherent scene interpretation.We present an efficient and a unified approach to learn contextual information through probabilisticlatent variable models, and combine it with pre-trained deep learning features to obtain state-of-artmulti-object classification system. It is known that deep learning produces transferable features,which can be used to learn new tasks, which differ from tasks on which the neural networks weretrained [26, 21]. Here, we demonstrate that the transferability of pre-trained deep learning featurescan be further enhanced by capturing the contextual information in images.We model the contextual dependencies using a conditional latent tree model (CLTM), where wecondition on the pre-trained deep learning features as input. This allows us to incorporate the jointeffects of both the pre-trained features and the context for object classification. Note that a hierar-chical tree structure is natural for capturing the groupings of various object categories in images; thelatent or hidden variables capture the “group” labels of objects. Unlike previous works, we do notimpose a fixed tree structure, or even a fixed number of latent variables, but learn a flexible structureefficiently from data. Moreover, since we make these “group” variables latent, there is no need tohave access to group labels during training, and we learn the object groups or scene categories in aunsupervised manner. Thus, in addition to efficient multi-object classification, we also learn latentvariables that capture semantic information about the scene in a unsupervised manner.
We propose a unified framework for multi-object classification and scene understanding that com-bines the strengths of multiple machine learning techniques, viz deep learning, probabilistic models,and kernel methods. We demonstrate significant improvement over state-of-art deep learning meth-ods, especially on challenging objects. We learn a conditional latent tree model, where we conditionon pre-trained deep learning features. We employ kernel methods to learn the structure of the hier-archical tree model, and we train a new smaller neural network to learn the node and edge potentialsof the model. Multi-object classification is carried out via inference on the tree. All these steps areefficient and scalable to large datasets with a large number of object categories.We extract features using pre-trained ImageNet CNN [15] from Caffe [12], and use it as input tothe conditional latent tree model (CLTM), a type of conditional random field (CRF). The tree de-pendency structure for this model is recovered using distance based methods [4], which requirespairwise conditional probabilities of object co-occurrences, conditioned on the input features. Weemploy the kernel conditional embedding framework [23] to compute these pairwise measures. Us-ing a feed-forward neural network, we train the above energy based model; the outputs of this neuralnetwork yield the node and edge potentials of the CLTM. We test performance of multi-object classi-fication on a non-iconic image set Microsoft COCO [20] and we test its unsupervised scene learningcapabilities on the MIT Indoor dataset [13].We recover a natural coherent tree structure on the MS COCO data-set, using training images, eachof which contain only few object categories. For instance, objects (e.g. table, chair and couch) thatappear in a given scene (living room) are grouped together. Using our approach, precision-recallperformance and F-measures are significantly improved compared to the baseline of a 3-layer neural2etwork with independent binary classifiers, which also takes in fc7 features as input. We see acrossthe board improvement for all object categories over the entire precision-recall curve. The overallrelative gain in F-measure for our method is 7%. For difficult objects like couch, frisbee, cup, bowl,remote, fork, and wine-glass, the F-measure relative gain is 41%, 48%, 50%, 53%, 113%, 122%, and171% respectively. Thus, we combine pre-trained deep learning features and the learnt contextualmodel to obtain state-of-art multi-object classification performance.We also demonstrate how latent nodes can be used for unsupervised scene understanding, withoutusing any scene labels during training. We observe that latent nodes capture high-level semanticinformation common to images, based on the neighborhoods of object categories in the latent tree.When we consider the top images with largest activations of node potential for a given latent node,we find diverse images with different objects, but with a unifying common theme. For instance,for one of the latent variables, the top images capture a grassland scene but with different animalsin different images. Similarly, the latent variable representing an outdoor scene contains diverseimages with traffic, beaches, and buildings. As another example, the latent variable representingthe food scene shows foods of various different kinds. Thus, we present a flexible framework forcapturing thematic information in images in a unsupervised manner.We also quantitatively show that the latent variables yield efficient scene classification performanceon the MIT-Indoor dataset, without any re-training, and without using any scene labels during train-ing. We use the marginal probabilities of the latent variables in our model on test images, andperform k -means clustering. For validation, we match these clusters to ground truth scene cate-gories using maximum weight matching [1]. We obtain 20% improvement in misclassification rateof the scenes, compared to the neural network baseline. Note that we assume that the scene labelsare not present during training for both our method, and for the neural network baseline. Thus,we demonstrate that our model is capable of capturing rich semantic information about the scenes,without using any scene labels during the training process.Thus, we present a carefully engineering unified framework for multi-object classification that com-bines the strengths of diverse machine learning techniques. While general non-parametric methodsare computationally expensive, and not scalable to large datasets, we employ kernel methods onlyto estimate pairwise conditional probabilities, which can be carried out efficiently using randomizedmatrix techniques [7]. Our tree structure estimation is scalable to large datasets using recent ad-vances in parallel techniques for structure estimation [11]. Instead of training a large neural networkfrom scratch, we train a smaller one, and we use a energy-based model at its output to obtain thenode and edge potentials of the latent tree model. Finally, at test time, we have “lightning” fastinference using message passing on the tree model. Thus, we present an efficient and a scalableframework for handling large image datasets with a large number of object categories. Correlations between labels have been explored for detecting multiple object categories before. [6,5] learn contextual relations between co-occurring objects using a tree structure graphical modelto capture dependencies among different objects. In this model, they incorporate dependenciesbetween object categories, and outputs of local detectors into one probabilistic framework. However,using simple pre-trained object detectors are typically noisy and lead to performance degradation.In contrast, we employ pre-trained deep learning features as input, and consider a conditional modelfor context, given the features. This allows us to incorporate both deep learning features and contextinto our framework.In many settings, the hierarchical structure representing the contextual relations between differentobjects is fixed and is based on semantic similarity [10], or may rely on text, in addition to imageinformation [19]. In contrast, we learn the tree structure from data efficiently, and thus, the frame-work can be adapted to settings where such a tree may not be available, and even if available, maynot give the best classification performance for multi-object classification.3sing pre-trained ImageNet features for other computer vision tasks has been popular in a num-ber of works recently, e.g. [26, 9, 21]. [9] term this as supervised pre-training and employ themto train regional convolutional neural networks (R-CNN) for object localization. We note that ourframework can be extended to localization and we plan to pursue it in future. While [9] employindependent SVM classifiers for each class, we believe that incorporating our probabilistic frame-work for multi-object localization can significantly improve performance. Recently, [27] proposeimproving object detection using Bayesian optimization for fine grained search and a structured lossfunction that aims at both classification and localization. We believe that incorporating probabilisticcontextual models can further improve performance in these settings.Recent papers also incorporate deep learning for scene classification. [29, 28] introduce the placesdataset and use CNNs for scene classification. In this framework, scene labels are available duringtraining, while we do not assume access to these labels during our training process. We demonstratehow introducing latent variables can automatically capture semantic information about the scenes,without the need for labeled data.Scene understanding is a very rich and an active area of computer vision and consists of a varietyof tasks such as object localization, pixel labeling, segmentation and so on, in addition to classifica-tion tasks. [18] propose a hierarchical generative model that performs multiple tasks in a coherentmanner. [17] also consider the use of context by taking into account the spatial location of the re-gions of interest. While there is a large body of such works which use contextual information (seefor instance [17]), they mostly do not incorporate latent variables in their modeling. In future, weplan to extend our framework for these various scene understanding tasks and expect significantimprovement over existing methodologies.There have been some recent attempts to combine neural networks with probabilistic models. Forexample, [2] propose to combine CRF and auto-encoder frameworks for unsupervised learning.Markov random fields are employed for pose estimation to encode the spatial relationships betweenjoint locations in [24]. [3] propose a joint framework for deep learning and probabilistic models.They learn deep features which take into account dependencies between output variables. Whilethey train a 8-layer deep network from scratch to learn the potential functions of a MRF, we exhibithow a simpler network can be used if we employ pre-trained features as an input to the conditionalmodel. Moreover, we incorporate latent variables that allow us to use a simple tree model, leadingto faster training and inference. Finally, while many works have used MS-COCO for captioningand joint image-text related tasks [14, 25], there have been no attempts to improve multi-objectclassification over standard deep learning techniques, using images alone on MS-COCO and not thetext data, to the best of our knowledge.The rest of this paper is organized as follows. Section 2 presents overview of the model. Section 3presents structure learning method using input distribution of fc7 features. In Section 4, we discusshow we train CLTM using neural networks. In Section 5, we evaluate the proposed model on MSCOCO dataset and discuss the results. Finally, Section 6 concludes the paper.
We consider pre-trained ImageNet [15] as a fixed feature extractor by considering the fc7 layer(4096-D vector) as the feature vector for a given input image. We denote this extracted featureas x i for i th image. It is also demonstrated in [26] that such feature vectors can be effectivelyused for different tasks with different labels. The goal here is to learn models which can labelan image to multiple-object categories present in a given image. Our model predicts a structuredoutput y ∈ { , } L . To achieve this goal ,we use a dependency structure that relates different objectlabels. Such dependency structure should able to capture pair-wise probabilities of object labelsconditioned on input features. We model this dependency structure using a latent tree. Firstly, thesetype of structures allow for more complex structures of dependence compared to a fully observedtree. Secondly, inference on it is tractable. 4 lgorithm 1 Overview of the Framework
Require:
Labeled image-set I = { ( I , y ) , · · · , ( I n , y n ) } { x , x , · · · , x n } ← ExtractFc7Features( I ) Estimate conditional distance matrix :D ← CondDistanceMatrix( { ( x , y ) , · · · , ( x n , y n ) } )using kernel methods. Extract tree structure using [4]
T ←
CLRG(D) Training a NN with randomly initialized weights W: repeat randomly select a mini-batch M . compute negative marginalized log-likelihood loss: Eqn.(2) L ←
Loss(W, T , M ) W ← BackpropogateGradient( L ) until convergence Given a test image T : x t ← ExtractFc7Features( T ) Potentials ← FeedForward(W, x t ) Prediction: y ← arg min Y Energy ( Y, P otentials ) h1y1 y2 y3 Input Layer Hidden Layer Output Layernode potential - h1node potential - y3 latent nodeobserved node
Figure 1: Our Model takes input as fc7 features and generates node potentials at the output layerof a given neural network. Using these node potentials, our model outputs MAP configuration andmarginal probabilities of observed and latent nodesWe estimate probabilities of object co-occurrences conditioned on input fc7 features. We then usedistance-based algorithm to recover the structure using estimated distance matrix. Once we recoverthe structure, we model the distribution of observed labels and latent nodes for a given input covari-ates as a discriminative model. We use conditional latent Tree Model, a class of CRF that belongsto exponential family of distributions to model distribution of output variables given an input. In-stead of restricting the potentials(factors) to linear functions of covariates, we generalize potentialsas functions represented by outputs of a neural network. For a given architecture of neural net-work which takes X as input, we learn weights W by backpropogating the gradient of marginalizedlog-likelihood of output binary variables. Once we train the given neural network, we consider theoutputs of neural network as potentials for estimating marginal node beliefs conditioned on inputcovariates X . Our model also results in MAP configuration for a given input covariates X . Algo.1gives overview of our framework.Use of non-parametric methods for end-end tasks on large datasets is computationally expensive.So, we restrict using kernel methods to only evaluate pairwise conditional probabilities, and here,we can use randomized matrix methods to efficiently scale the computations [7]. The tree structure5s estimated through CL grouping algorithm from [4]. Although the method in [4] is serial, wenote that recently there have been parallel versions of this method in [11]. Finally, we train neuralnetworks to output node and edge potentials for CLTM. Finally, detection is carried out via inferenceon the tree model through message passing algorithms. Thus, we have an efficient procedure formulti-object detection in images. We denote given labeled training set as D = { ( x , y ) , · · · , ( x n , y n ) } and x i ∈ R , y i ∈ { , } L ∀ i ∈ (1 , , · · · , n ) . We denote extracted tree by T = ( Z , E ) where Z indicates the set of observedand latent nodes and E denotes edge set. Once we recover the structure, we use conditional latenttree model to model P ( Z| X ) . Conditioned on input X , we model distribution of Z using in thebelow Eqn. P ( Z| X ) = exp − X k ∈Z φ k ( X, θ ) z k + X ( k,t ) ∈E φ ( k,t ) ( X, θ ) z k z t − A ( θ, X ) where A ( X, θ ) is the term that normalizes the distribution, also known as the log partition function. φ k ( X, θ ) and φ ( k,t ) ( X, θ ) indicate the node and edge potentials of the exponential family distribu-tion, respectively. Instead of restricting the potentials to linear functions of covariates, we generalizepotentials as functions represented by outputs of a neural network. Sec.4 explains how we learn theweights of such a neural network.We learn the dependency structure among object labels from a set of fully labeled images. Tradi-tional distance-based methods use only empirical co-occurrences of objects to learn the structure.Learning a structure that involves strong pair-wise relations among objects requires training imagesto contain many instances of different object categories. In this section, we propose a new struc-ture recovery method without the need of such training sets. This method involves both empiricalco-occurrences and the distribution of fc7 features to calculate distances between labels.Since there are very few positive sample images with multiple object-categories, training just basedon co-occurrence is not sufficient to recover a coherent tree structure. We leverage on extractedfeatures to estimate moments by conditioning on them. We propose a new method to calculate thedistance matrix by using a RKHS framework to estimate moments. The estimated distance matrixis then used by distance-based methods for structure recovery [4]. Kernel Embedding of Conditional Distribution
The kernel conditional embedding framework, described in [23] gives us methods for modelingconditional and joint distributions. These methods are effective in high-dimensional settings withmulti-modal components such as the current setting .In the general setting, given transformations φ ( X ) and Ψ( Y ) on X,Y to the RKHS using kernelfunctions K ( x, . ) , K ′ ( y, . ) , the above framework provides us with the following empirical operatorsto embed joint distributions into the reproducing kernel Hilbert space (RKHS). Define ˆ C XX = 1 N N X n =1 φ ( x n ) ⊗ φ ( x n )ˆ C XY = 1 N N X n =1 φ ( x n ) ⊗ Ψ( y n ) , and ˆ C Y | X := ˆ C Y X ˆ C − XX . We have following results that can be used to evaluate ˆ E Y i Y j | X [ y i ⊗ y j | x ] for a given data-set. Ψ( y ) ⊤ ˆ C Y | X φ ( x ) = Ψ( y ) ⊤ Ψ Y ( K XX + λN I ) − φ ⊤ X φ ( x ) (1)6e employ Gaussian RBF kernels and use the estimated conditional pairwise probabilities for learn-ing the latent tree structure. CondDistanceMatrix
Require:
Input data-set D = { ( x , y ) , · · · , ( x n , y n ) } Compute Gram matrix K n × n using hyper-parameter γ for i = 1 TO i = n do G = ( K + λI ) − × K (: , i ) . for all pairs (k,t) where k,t ∈ (1 , , · · · , L ) do ˆ E [ Y k ⊗ Y t | X = x i ] = [ y k ⊗ y t , y k ⊗ y t , · · · , y nk ⊗ y nt ] ⊤ G S k,t = | det(ˆ E [ Y k ⊗ Y t | X = x i ]) | Compute D i where D i [ k, t ] = − log( S k,t √ S k,k × S t,t ) return D L × L = n P ni =1 D i A significant amount of work has been done on learning latent tree models. Among the availableapproaches for latent tree learning, we use the information distance based algorithm CLGrouping [4]which has provable computational efficiency guarantees. These algorithms are based on a measureof statistical additive tree distance. For our conditional setting, we use the following form of thedistance function: ˆ d kt = 1 n n X i =1 − log | det(ˆ E [ Y k ⊗ Y t | X = x i ]) | p S k,k · S t,t ! , where S k,k := | det(ˆ E [ Y k ⊗ Y k | X = x i ]) | , and similarly for S t,t , for observed nodes k, t using N samples. We employ the CL grouping to learn the tree structure from the estimated distances. Energy-based learning provides a unified framework for many probabilistic and non-probabilisticapproaches to structured output tasks [16], particularly for non-probabilistic training of graphicalmodels and other structured models. Furthermore, the absence of the normalization condition allowsfor more flexibility in the design of learning machines. Most probabilistic models can be viewed asspecial types of energy-based models in which the energy function satisfies certain normalizabilityconditions, and in which the loss function, optimized by learning, has a particular form.
Consider observed variable X and output variable Y. Define an energy function E ( X, Y ) that isminimized when X and Y are compatible. The most compatible Y ∗ given an observed X can beexpressed as Y ∗ = arg min Y E ( Y, X ) The energy function can be expressed as a factor graph, i.e. a sum of energy functions (node andedge potentials) that depend on input covariates x. Efficient inference procedures for factor graphscan be used to find the optimum configuration Y ∗ . In the below Eqn., we define the energy functionwhich is used to model loss function. E ( x, z, θ ) = X k ∈Z φ k ( x, θ ) z k + X ( k,t ) ∈E φ ( k,t ) ( x, θ ) z k z t e r s on z eb r ag i r a ff e t enn i s r a ck e t e l ephan t ba s eba ll g l o v e sk i s a i r p l ane t r a i nbea r t o il e t ba s eba ll ba t k i t e c a t s u r f boa r d m o t o r cyc l eb r o cc o li p i zz ad i n i ng t ab l eboa t bu ssk a t eboa r d s po r t s ba ll t vs heep k e y boa r d m ou s e s i n kc l o ck ho r s e t i e l ap t op s t op s i gn s no w boa r dbed t r a ff i c li gh t c a r b i r d c ou c h f r i s beeo v enbo w l t edd y bea r banana c up c ha i r o r ange s and w i c h r e f r i ge r a t o r dog c o w t r u ckv a s eu m b r e ll aboo k f o r kc a k e f i r eh y d r an t r e m o t e c a rr o t k n i f eho t dogapp l edonu t b i cyc l epa r k i ng m e t e r handbag m i c r o w a v ebo tt l eben c h w i neg l a sss u i t c a s e c e ll phone s poonba ck pa ck po tt edp l an tt oo t hb r u s h sc i ss o r s t oa s t e r ha i r d r i e r F - M ea s u r e CLTM3-layer NN
Figure 2: F-Measure comparison of individual classes
Training an energy based model (EBM) consists of finding an energy function that produces thebest Y for any X. The search for the best energy function is performed within a family of energyfunctions indexed by a parameter W. The architecture of the EBM is the internal structure of theparameterized energy function E ( W, Y, X ) . In the case of neural networks the family of energyfunctions are the set of neural net architectures and weight values.For a given neural network architecture, weights are learned by backpropagating the gradientthrough some loss function [16]. In the case of structures involving latent variables h, we usenegative marginal log-likelihood loss (2) for training. L = E [ E ( W, x, y, h ) | y, x ] − E [ E ( W, y, x, h ) | x ] (2)And the gradient is evaluated using below Eqn. ∂ L ∂W = E (cid:20) ∂ E ( W, y, x, h ) ∂W | x, y (cid:21) − E (cid:20) ∂ E ( W, y, x, h ) ∂W | x (cid:21) In this section, we show experimental results of (a) classifying an image to multiple-object categoriessimultaneously and (b) identifying scenes from which images emerged. We use the non-iconicimage data-set MS COCO [20] to evaluate our model. This data-set contains 83K training imageswith images labeled with 80 different object classes. The validation set contains 40K images. Weuse an independent classifier trained using 3 layer neural network (Indep. Classifier) as a baseline,and compare precision-recall measures with our proposed conditional latent tree model.
Implementation
We use our conditional latent tree model as a standalone layer on top of a neural network. Thelayer takes as input a set of scores φ ( x, W ) ∈ R n . These scores correspond to node potentials ofthe energy function. To avoid over-fitting we make edge potentials independent of input covariates.Using these potentials, our model outputs marginal probabilities of all the labels along with theMAP configuration. During learning, we use stochastic gradient descent and compute ∂ L ∂φ , where L is loss function defined in Eqn.(2). This derivative is then back propagated to the previous layersrepresented by φ ( x ; w ) . Using a mini-batch size of 250 and dropout, we train the model. We use theViterbi message passing algorithm for exact inference on conditional latent tree model.8 ecall P r ec i s i on CLTM3-layer NN000.20.4 0.50.60.81 1 (a)
Recall P r ec i s i on CLTM3-layer NN000.20.4 0.50.60.81 1 (b)
Recall P r ec i s i on CLTM3-layer NN000.20.4 0.50.60.81 1 (c)
Figure 3: Precision Recall Comparison: a) All the training images b) Subset of training imagescontaining 2 object categories and c) Subset of training images containing 3 object categories.
Recall P r ec i s i on CLTM3-layer NN000.20.4 0.50.60.81 1 Recall P r ec i s i on CLTM3-layer NN000.20.4 0.50.60.81 1 Recall P r ec i s i on CLTM3-layer NN000.20.4 0.50.60.81 1 Recall P r ec i s i on CLTM3-layer NN000.20.4 0.50.60.81 1
Figure 4: Class-wise Precision-Recall for: a) Keyboard b) Baseball Glove c) Tennis Racket and d)Bed.Figure 5: Top 12 images producing the largest activation of node potentials for different latentnodes : (from left to right) h with neighborhood of objects appearing in living room ; h withneighborhood of objects belonging to class fruit ; h with neighborhood of objects appearing inoutdoor scenes; h with neighborhood of objects appearing in kitchen ; h with neighborhood ofobjects appearing in forest; h with neighborhood of objects appearing on dining table.9able 1: F-Measure ComparisonModel Precision Recall F-Measure1 Layer (Indep. Classifier) 0.715 0.421 0.5291 Layer (CLTM) 0.742 0.432 0.5462 Layer (Indep. Classifier) 0.722 0.425 0.5352 Layer (CLTM) 0.763 0.437 0.5563 Layer (Indep. Classifier) 0.731 0.428 0.5393 Layer (CLTM) Figure 6: Figure showing heat map of marginal beliefs of nodes activated in different sub-trees fordifferent images.
We use 40k images randomly selected from the training set to learn the tree structure using thedistance based method proposed in Section 3. We have the recovered tree structure relating 80different objects and 22 hidden nodes in 6 Appendix. From the learned tree structure, we can seethat hidden nodes take the role of dividing the tree according to the scene category. For instance, thenodes connected to hidden nodes h , h , h and h contain objects from the kitchen, bathroom,wild animals and living room respectively. Similarly, all the objects that appear in outdoor trafficscenes are clustered around the observed node car. Note that most training images contain fewerthan 3 instances of different object categories. Table 1 shows the comparison of precision, recall and F-measure between 3 layer neural networkindependent classifier and Conditional Latent Tree Model trained using 1,2 and 3 layer feed forwardneural networks respectively. For 3 layer neural network independent classifier, we use a threshold of0.5 to make binary decisions for different object labels. For CLTM, we use the MAP configurationto make binary decisions. Note that CLTM improves F-measure significantly. Fig.2 shows thecomparison of F-measure for each object category between baseline and CLTM trained using a3 layer neural network. Over-all the gain in F-measure using our model is 7-percent comparedto 3 Layer neural network. Note that F-measure gain for indoor objects is more significant. Fordifficult objects like skateboard, keyboard, laptop, bowl, cup and wine-glass, F-measure gain is 19-percent, 20-percent, 27-percent, 56-percent, 50-percent and 171-percent respectively. Fig. 3 showsthe precision recall curves for a) entire test image set b) a subset of test images that contain 2 differentobject categories c) a subset of test images that contain 3 different object categories. We considermarginal probabilities of each observed class that our model produced to measure precision-recallcurves for varying threshold values. Fig.4 shows comparison of plots of precision-recall curves of asubset of object classes: tennis racket, bed, keyboard and baseball glove.10 .3 Qualitative Analysis
In this section, we investigate the class of images that triggered highest activation of node potentialsfor different latent nodes. Fig. 5 shows the top-12 images from test set that resulted in the highestactivation of different latent nodes. It is observed that different latent nodes effectively capture dif-ferent semantic information common to images containing neighboring object classes. For instance,the top-12 images of latent nodes h , h , h , h , h and h resulted in a class of images appear-ing in scenes of forest, dining table, kitchen, living room, traffic and belonging to fruit category. The hidden nodes in CLTM model capture scene relevant information which can be used to performscene classification tasks. In this section, we demonstrate scene classification capabilities of CLTMmodel. We use 529 images from MIT-Indoor data-set belonging to 4 different scenes: Kitchen,Bathroom, Living Room and Bedroom. We perform k-means on outputs of CLTM model and 3 layerneural network independent classifier to cluster images. We then optimally match these clusters toscenes to evaluate misclassification rate. Note that we never trained our model using scene labels andwe just use them for validating the performance. In our experiments, we use marginal probabilitiesof observed and hidden nodes of CLTM , marginal probabilities of hidden nodes of CLTM andprobabilities of individual classes resulted from 3 layer neural network conditioned on input features.Table 2 shows misclassification rates of different input features used for clustering. With out the needof object presence knowledge, clustering on marginal probabilities of hidden nodes alone resultedin the least misclassification rate. Table 2: Misclassification RateModel k=4 k=6Observed + Hidden 0.326 0.2423 layer neural network 0.390 0.301Hidden
In conclusion, with the proposed structure recovery method we could recover the structure of latenttree. This tree has natural hierarchy of related objects placed according to their co-appearance indifferent scenes. We use neural networks of different architectures to train conditional latent treemodels. We evaluate CLTM on MS COCO data-set and there is a significant gain in precision, recalland F-measure compared to 3 layer neural network independent classifier. Latent nodes captureddifferent semantic information to distinguish high level class information of images. Such an infor-mation is used for scene labeling task in an unsupervised manner. In future, we aim to model bothspatial and co-occurance knowledge and apply the model to object localisation tasks using CNN(like RCNN).
References [1] R. Ahuja, T. Magnanti, and J. Orlin. Network flows. In
Optimization , pages 211–369. Elsevier North-Holland, Inc., 1989. 3[2] W. Ammar, C. Dyer, and N. A. Smith. Conditional random field autoencoders for unsupervised structuredprediction. In
Advances in Neural Information Processing Systems , pages 3311–3319, 2014. 4[3] L.-C. Chen, A. G. Schwing, A. L. Yuille, and R. Urtasun. Learning deep structured models. arXiv preprintarXiv:1407.2538 , 2014. 4[4] M. J. Choi, V. Y. F. Tan, A. Anandkumar, and A. S. Willsky. Learning latent tree graphical models.
J.Mach. Learn. Res. , 12:1771–1812, July 2011. 2, 5, 6, 7[5] M. J. Choi, A. Torralba, and A. S. Willsky. Context models and out-of-context objects.
Pattern Recogni-tion Letters , 33(7):853 – 862, 2012. Special Issue on Awards from { ICPR }
6] M. J. Choi, A. Torralba, and A. S. Willsky. A tree-based context model for object recognition.
IEEETrans. Pattern Anal. Mach. Intell. , 34(2):240–252, 2012. 3[7] P. Drineas and M. W. Mahoney. On the nystr&
J. Mach. Learn. Res. , 6:2153–2175, Dec. 2005. 3, 5[8] M. Everingham, L. Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes(voc) challenge.
Int. J. Comput. Vision , 88(2):303–338, June 2010. 1[9] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detectionand semantic segmentation. In
Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conferenceon , pages 580–587, 2014. 4[10] K. Grauman, F. Sha, and S. J. Hwang. Learning a tree of metrics with disjoint visual features. In
Advancesin Neural Information Processing Systems , pages 621–629, 2011. 3[11] F. Huang, N. U. N, and A. Anandkumar. Integrated structure and parameters learning in latent treegraphical models.
ArXiv 1406.4566 , 2014. 3, 6[12] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 , 2014. 2[13] L. jia Li, H. Su, L. Fei-fei, and E. P. Xing. Object bank: A high-level image representation for scene clas-sification & semantic feature sparsification. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel,and A. Culotta, editors,
Advances in Neural Information Processing Systems 23 , pages 1378–1386. CurranAssociates, Inc., 2010. 2[14] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. arXivpreprint arXiv:1412.2306 , 2014. 4[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neuralnetworks. In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors,
Advances in Neural InformationProcessing Systems 25 , pages 1097–1105. Curran Associates, Inc., 2012. 2, 4[16] Y. Lecun, F. Jie, and Jhuangfu. Loss functions for discriminative training of energy-based models. In
InProc. of the 10-th International Workshop on Artificial Intelligence and Statistics (AIStats05 , 2005. 7, 8[17] C. Li, A. Saxena, and T. Chen. θ -mrf: Capturing spatial and semantic structure in the parameters forscene understanding. In Advances in Neural Information Processing Systems , pages 549–557, 2011. 4[18] L.-J. Li, R. Socher, and L. Fei-Fei. Towards total scene understanding: Classification, annotation andsegmentation in an automatic framework. In
Computer Vision and Pattern Recognition, 2009. CVPR2009. IEEE Conference on , pages 2036–2043. IEEE, 2009. 4[19] L.-J. Li, C. Wang, Y. Lim, D. M. Blei, and L. Fei-Fei. Building and using a semantivisual image hierarchy.In
Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on , pages 3336–3343, 2010.3[20] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick. MicrosoftCOCO: common objects in context.
CoRR , abs/1405.0312, 2014. 2, 8[21] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring mid-level image representationsusing convolutional neural networks. In
Computer Vision and Pattern Recognition (CVPR), 2014 IEEEConference on , pages 1717–1724. IEEE, 2014. 2, 4[22] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.
Interna-tional Journal of Computer Vision (IJCV) , 2015. 1[23] L. Song, K. Fukumizu, and A. Gretton. Kernel embeddings of conditional distributions: A unified kernelframework for nonparametric inference in graphical models.
IEEE Signal Process. Mag. , 30(4):98–111,2013. 2, 6[24] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphicalmodel for human pose estimation. In
Advances in Neural Information Processing Systems , pages 1799–1807, 2014. 4[25] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. arXivpreprint arXiv:1411.4555 , 2014. 4[26] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks?
CoRR , abs/1411.1792, 2014. 2, 4[27] Y. Zhang, K. Sohn, R. Villegas, G. Pan, and H. Lee. Improving object detection with deep convolutionalnetworks via bayesian optimization and structured prediction. arXiv preprint arXiv:1504.03293 , 2015. 4[28] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Object detectors emerge in deep scene cnns. arXiv preprint arXiv:1412.6856 , 2014. 4[29] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognitionusing places database. In
Advances in Neural Information Processing Systems , pages 487–495, 2014. 4 ppendixppendix