[PDF] Autoregressive Unsupervised Image Segmentation

Abstract

In this work, we propose a new unsupervised image segmentation approach based on mutual information maximization between different constructed views of the inputs. Taking inspiration from autoregressive generative models that predict the current pixel from past pixels in a raster-scan ordering created with masked convolutions, we propose to use different orderings over the inputs using various forms of masked convolutions to construct different views of the data. For a given input, the model produces a pair of predictions with two valid orderings, and is then trained to maximize the mutual information between the two outputs. These outputs can either be low-dimensional features for representation learning or output clusters corresponding to semantic labels for clustering. While masked convolutions are used during training, in inference, no masking is applied and we fall back to the standard convolution where the model has access to the full input. The proposed method outperforms current state-of-the-art on unsupervised image segmentation. It is simple and easy to implement, and can be extended to other visual tasks and integrated seamlessly into existing unsupervised learning methods requiring different views of the data.

Full PDF

AAutoregressive Unsupervised ImageSegmentation

Yassine Ouali, C´eline Hudelot and Myriam Tami

Universit´e Paris-Saclay, CentraleSup´elec, MICS, 91190, Gif-sur-Yvette, France { yassine.ouali,celine.hudelot,myriam.tami } @centralesupelec.fr Abstract.

In this work, we propose a new unsupervised image segmen-tation approach based on mutual information maximization between dif-ferent constructed views of the inputs. Taking inspiration from autore-gressive generative models that predict the current pixel from past pixelsin a raster-scan ordering created with masked convolutions, we proposeto use diﬀerent orderings over the inputs using various forms of maskedconvolutions to construct diﬀerent views of the data. For a given input,the model produces a pair of predictions with two valid orderings, andis then trained to maximize the mutual information between the twooutputs. These outputs can either be low-dimensional features for rep-resentation learning or output clusters corresponding to semantic labelsfor clustering. While masked convolutions are used during training, ininference, no masking is applied and we fall back to the standard con-volution where the model has access to the full input. The proposedmethod outperforms current state-of-the-art on unsupervised image seg-mentation. It is simple and easy to implement, and can be extended toother visual tasks and integrated seamlessly into existing unsupervisedlearning methods requiring diﬀerent views of the data.

Keywords:

Image segmentation, Autoregressive models, Unsupervisedlearning, Clustering, Representation learning.

Supervised deep learning has enabled great progress and achieved impressiveresults across a wide number of visual tasks, but it requires large annotateddatasets for eﬀective training. Designing such fully-annotated datasets involvesa signiﬁcant eﬀort in terms of data cleansing and manual labeling. It is especiallytrue for ﬁne-grained annotations such as pixel-level annotations needed for seg-mentation tasks, where the annotation cost per image is considerably high [5,17].This hurdle can be overcome with unsupervised learning, where unknown butuseful patterns can be extracted from the easily accessible unlabeled data. Re-cent advances in unsupervised learning [23,28,7,39], that closed the performancegap with its supervised counterparts, make it a strong possible alternative.

Project page: https://yassouali.github.io/autoreg_seg/ a r X i v : . [ c s . C V ] J u l Y. Ouali, C. Hudelot and M. Tami

Objective:

Similar representations

OutputFeatures (b) Autoregressive RepresentationLearning (ARL)

ClusterProbabilities (a) Autoregressive Clustering (AC)

Objective:

Similar cluster assignments

OrderingOrdering (c) Orderings

Fig. 1. Overview.

Given an encoder-decoder type network F and two valid order-ings ( o , o ) as illustrated in (c). The goal is to maximize the Mutual Information(MI) between the two outputs over the diﬀerent views , i.e . diﬀerent orderings . (a) ForAutoregressive Clusterings (AC), we output the cluster assignments in the form of aprobability distribution over pixels, and the goal is to have similar assignments regard-less of the applied ordering. (b) For Autoregressive Representation Learning (ARL),the objective is to have similar representations at each corresponding spatial locationand its neighbors over a window of small displacements Ω. Recent works are mainly interested in two objectives, unsupervised represen-tation learning and clustering. Representation learning aims to learn semanticfeatures that are useful for down-stream tasks, be it classiﬁcation, regression orvisualization. In clustering, the unlabeled data points are directly grouped intosemantic classes. In both cases, recent works showed the eﬀectiveness of maxi-mizing Mutual Information (MI) between diﬀerent views of the inputs to learnuseful and transferable features [23,39,44,13] or discover clusters that accuratelymatch semantic classes [28,22].Another line of study in unsupervised learning is generative modeling. In par-ticular, for image modeling, generative autoregressive models [38,37,43,9], suchas PixelCNN, are powerful generative models with tractable likelihood compu-tation. In this case, the high-dimensional data, e.g ., an image x , is factorized asa product of conditionals over its pixels. The generative model is then trainedto predict the current pixel x i based on the past values x ≤ i − in a raster scanfashion using masked convolutions [37] (Fig. 3 (a)). utoregressive Unsupervised Image Segmentation 3 In this work, instead of using a single left to right, top to bottom ordering,we propose to use several orderings obtained with diﬀerent forms of maskedconvolutions and attention mechanism. The various orderings over the inputpixels, or the intermediate representations, are then considered as diﬀerent views of the input image ∗ , and the model is then trained to maximize the MI betweenthe outputs over these diﬀerent views.Our approach is generic, and can be applied for both clustering and represen-tation learning (see Fig. 1). For a clustering task (Fig. 1 (a)), we apply a pair ofdistinct orderings over a given input image, producing two pixel-level predictionsin the form of probability distribution over the semantic classes. We then maxi-mize the MI between the two outputs at each corresponding spatial location andits intermediate neighbors. Maximizing the MI helps avoiding degeneracy ( e.g .,uniform output distributions) and trivial solutions ( e.g ., assigning all of the pix-els to the same cluster). For representation learning (Fig. 1 (b)), we maximize alower bound of MI between the two output feature maps over the diﬀerent views .We evaluate the proposed method using standard image segmentation datasets:Potsdam [14] and COCO-stuﬀ [5], and show competitive results. We present anextensive ablation study to highlight the contribution of each component withinthe proposed framework, and emphasizing the ﬂexibility of the method.To summarize, we propose following contributions: (i) a novel unsupervisedmethod for image segmentation based on autoregressive models and MI maxi-mization; (ii) various forms of masked convolutions to generate diﬀerent order-ings; (iii) an attention augmented version of masked convolutions for a largerreceptive ﬁeld, and a larger set of possible orderings; (iv) an improved perfor-mance above previous state-of-the-art on unsupervised image segmentation. Autoregressive models.

Many autoregressive models [34,15,37,43,40,9,10] fornatural image modeling have been proposed. They model the joint probabilitydistribution of high-dimensional images as a product of conditionals over thepixels. PixelCNN [37,38] speciﬁes the conditional distribution of a sub-pixel ( i.e .,a color channel of a pixel) as a full 256-way softmax, while PixelCNN++ [43] usesa mixture of logistics. In both cases, masked convolutions are used to processthe initial image x in an autoregressive manner. In Image [40] and Sparse [10]transformers, self-attention [46] is used over the input pixels, while PixelSNAIL[9] combines both attention and masked convolutions. Clustering and unsupervised representation learning.

Recent works inclustering aim at combining traditional clustering algorithms [20] with deeplearning, such as using K-means style objectives when training deep nets train-ing [6,19,12]. However, such objective can lead to trivial and degenerate solu-tions [6]. IIC [28] proposed to use a MI based objective which is intrinsicallymore robust to such trivial solutions. Unsupervised learning of representations ∗ Throughout the paper, a view refers to the application of a given ordering . Bothare used interchangeably. Y. Ouali, C. Hudelot and M. Tami [23,1,39,16] rather aims to train a model, mapping the unlabeled inputs intosome lower-dimensional space, while preserving semantic information and dis-carding instance-speciﬁc details. The pre-trained model can then be ﬁne-tunedon a down-stream task with fewer labels.

Unsupervised learning and MI maximization.

Maximizing MI for unsu-pervised learning is not a new idea [20,2], and recent works demonstrated itseﬀectiveness for unsupervised learning. For representation learning, the trainingobjective is to maximize a lower bound of MI over continuous random variablesbetween distinct views of the inputs. These views can be the input image andits representation [24], the global and local features [23], the features at diﬀerentscales [1], a sequence of extracted patches from an image in some ﬁxed order [39]or diﬀerent modalities of the image [44]. For a clustering objective, with discreterandom variables as outputs, the exact MI can be maximized over the diﬀerentviews, e.g ., IIC [28] maximizes the MI between the image and its augmentedversion.

Unsupervised Image Segmentation.

Methods that learn the segmentationmasks entirely from data with no supervision can be categorized as follows: (1)GAN based methods [8,4] that extract and redraw the main object in the imagefor object segmentation. Such methods are limited to only instances with twoclasses, a foreground and a background. The proposed method is more gener-alizable and is independent of the number of ground-truth classes; (2) Iterativemethods [25] consisting of a two-step process. The features produced by a CNNare ﬁrst grouped into clusters using spherical K-means. The CNN is then trainedfor better feature extraction to discriminate between the clusters. We propose anend-to-end method simplifying both training and inference; (3) MI maximiza-tion based methods [28] where the MI between two views of the same instanceat the corresponding spatial locations is maximized. We propose an eﬃcient andeﬀective way to create diﬀerent views of the input using masked convolutions.Another line of work consists of leveraging the learned representations of a deepnetwork for unsupervised segmentation, e.g ., CRFs [30] and deep priors [31].

Our goal is to learn a representation that maximizes the MI, denoted as I ,between diﬀerent views of the input. These views are generated using variousorderings, capturing diﬀerent aspects of the inputs. Formally, let x ∼ X be anunlabeled data point, and F : X → Y be a deep representation to be learnedas a mapping between the inputs and the outputs. For clustering, Y is the setof possible clusters corresponding to semantic classes, and for representationlearning, Y corresponds to a lower-dimensional space of the output features. Let( o i , o j ) ∈ O be two orderings o i and o j obtained from the set of possible andvalid orderings O (Fig. 2). For two outputs y ∼ F ( x ; o i ) and y (cid:48) ∼ F ( x ; o j ), theobjective is to maximize the predictability of y from y (cid:48) and vice-versa, where F ( x ; o i ) corresponds to applying the learning function F with a given ordering o i to process the image x . This objective is equivalent to maximizing the MI utoregressive Unsupervised Image Segmentation 5 Start End

Ordering Ordering Ordering

Ordering

Fig. 2. Raster-scan type orderings. between the two encoded variables:max F I ( y ; y (cid:48) ) (1)We start by presenting diﬀerent forms of masked convolutions to generatevarious raster-scan orderings, and propose an attention augmented variant fora larger receptive ﬁeld and use it to extend the set of possible orderings (Sec-tion 3.1). We then formulate the training objective for maximizing Eq. (8) forboth clustering and unsupervised representation learning (Section 3.2). We ﬁ-nally conclude with a ﬂexible design architecture for the function F (Section 3.3). In neural autoregressive modeling [37,43,9], for an in-put image x ∈ R H × W × with 3 color channels, a raster-scan ordering is ﬁrstimposed on the image (see Fig. 2, ordering o ). Such an ordering, where thepixel x i only depends on the pixels that come before it, is maintained usingmasked convolutions † [37,38] (Fig. 3 (a)).Our proposition is to use all 8 possible raster-scan type orderings as the setof valid orderings O as illustrated in Fig. 2. A simple way to obtain them isto use a single ordering o with the standard masked convolution (Fig. 3 (a)),along with geometric transformations g ( i.e ., image rotations by multiples of90 degrees and horizontal ﬂips), resulting in 8 versions of the input image. Wecan then maximize the MI between the two outputs, i.e ., I ( y ; g − ( y (cid:48) )) with y (cid:48) ∼ F ( g ( x ); o j ). In this case, since the masked weights are never trained, wecannot fall-back to the normal convolution where the function F has access to thefull input during inference, greatly limiting the performance of such approach.This point motivates our approach. Our objective is to learn all the weightsof the masked convolution during training, and use an unmasked version duringinference. This can be achieved by using a normal convolution, and for a givenordering o i , we mask the corresponding weights during the forward pass to con-struct the desired view of the inputs. Then in the backward pass, we only updatethe unmasked weights and the masked weights remain unchanged. In this case,all of the weights will be learned and we will converge to a normal convolutiongiven enough training iterations. During inference, no masking is applied, givingthe function F full access to the inputs.A straight forward way to implement this is to use 8 versions of the standardmasked convolution to create the set O (Fig. 3 (d)). However, for each forward † Note that for a convolution weight tensor of shape [

F, F, C in , C out ], the maskingin applied over all values of both channels, C in and C out . Y. Ouali, C. Hudelot and M. Tami PaddingUnmaskded Weights Shifts as paddingMasked Weights Input

Shift Conv A Conv B Conv C Shift Conv D Shift Shift Conv A Conv B Conv C Conv D Shift Shift Shift Shift Standard MaskedConvolution (a)

Simplified Version (c) (d) (e)

Access to the currentpixel (b)

Fig. 3. Masked Convolutions. (a) Standard masked convolution used in autoregres-sive generative modeling, yielding an ordering o . (b) A relaxed version of standardmasked convolution where we have access to the current pixel at each step. (c) A sim-pliﬁed version of masked convolution with a reduced number of masked weights. (d)The 8 versions of the standard masked convolution to construct all of the possibleraster-scan type orderings. (e) The proposed types of masked convolutions with thecorresponding shifts to obtain all of the 8 desired raster-scan types orderings. F = 3in this case. pass, the majority of the weights are masked, resulting in a reduced receptiveﬁeld and a fewer number of weights will be learned at each iteration, leading tosome disparity between them.Given that we are interested in a discriminative task, rather than generativeimage modeling where the access to the current pixel is not allowed. We startby relaxing the conditional dependency, and allow the model to have access tothe current pixel, reducing the number of masked locations by one (Fig. 3 (b)).To further reduce the number of masked weights, for an F × F convolution,instead of masking the lower rows, we can simply shift the input by the sameamount and only mask the weights of the last row. We thus reduce the num-ber of masked weight from (cid:98) F / (cid:99) (Fig. 3 (b)) to (cid:98) F/ (cid:99) (Fig. 3 (c)). With fourpossible masked convolutions: { Conv A , Conv B , Conv C , Conv D } and four possibleshifts: ‡ { Shift , Shift , Shift , Shift } , we can create all of 8 raster-scan orderingsas illustrated in Fig. 3 (e). The proposed masked convolutions do not introduceany additional computational overhead, neither in training, nor inference, mak-ing them easy to implement and integrate into existing architectures with minorchanges. ‡ e.g ., for Shift and a 3 × H × W isﬁrst padded on the top resulting in ( H + 1) × W , the last row is then cropped, goingback to H × W .utoregressive Unsupervised Image Segmentation 7 Ordering Ordering Ordering Ordering Ordering OrderingOrdering

Start End

Ordering

Fig. 4. Zigzag type orderings.

Attention Augmented Masked Convolutions

As pointed out by [37], theproposed masked convolutions are limited in terms of expressiveness since theycreate blind spots in the receptive ﬁeld (Fig. 6). In our case, by applying dif-ferent orderings, we will have access to all of the input x over the course oftraining, and this bug can be seen as a feature where the blind spots can beconsidered as an additional restriction. This restricted receptive ﬁled, however,can be overcome using the self-attention mechanism [46]. Similar to previousworks [47,48,3], we propose to add attention blocks to model long range depen-dencies that are hard to access through standalone convolutions. Given an inputtensor of shape ( H, W, C in ), after reshaping it into a matrix X ∈ R HW × C in , wecan apply a masked version of attention [46] in a straight forward manner. Theoutput of the attention operation is: A = Softmax(( QK (cid:62) ) (cid:12) M o i ) V (2)with Q = XW q , K = XW k and V = XW v , where W q , W k ∈ R C in × d and W v ∈ R C in × d are learned linear transformations that map the input X to queries Q ,keys K and values V , and M o i ∈ R HW × HW corresponds to a masking operationto maintain the correct ordering o i .The output is then projected into the output space using a learned lineartransformation W O ∈ R d × C in obtaining X att = AW O . The output of the atten-tion operation X att is concatenated channel wise with the input X , and thenmerged using a 1 × Zigzag Orderings.

Using attention gives us another beneﬁt, we can extendthe set of possible orderings to include zigzag type orderings introduced in [9](Fig. 4). With zigzag orderings, the outputs at each spatial location will bemostly inﬂuenced by the values of the corresponding neighboring input pixels,which can give rise to more semantically meaningful representations comparedto that of raster-scan orderings. This is done by simply using a mask M o i cor-responding to the desired zigzag ordering o i . Resulting in a set O of 16 possibleand valid orderings o i with i ∈ { , . . . , } in total. See Fig. 5 for an example. In information theory, the MI I ( X ; Y ) between two random variables X and Y measures the amount of information learned from the knowledge of Y about X and vice-versa. The MI can be expressed as the diﬀerence of two entropy terms: I ( X ; Y ) = H ( X ) − H ( X | Y ) = H ( Y ) − H ( Y | X ) (3) Y. Ouali, C. Hudelot and M. Tami

Raster-Scan Mask

MaskedUnmasked

Zigzag Mask

Fig. 5. Attention Masks.

Examplesof the diﬀerent attention masks M o i ofshape HW × HW applied for a givenordering o i . With HW = 9. Blind Spots Receptive Field

OrderingOrdering

Fig. 6. Blind Spots.

Blind spots inthe receptive ﬁeld of pixel as a re-sult of using a masked convolution fora given ordering o i . Intuitively, I ( X ; Y ) can be seen as the reduction of uncertainty in one of thevariables, when the other one is observed. If X and Y are independent, knowingone variable exposes nothing about the other, in this case, I ( X ; Y ) = 0. Inversely,if the state of one variable is deterministic when the state of the other is revealed,the MI is maximized. Such an interpretation explains the goal behind maximizingEq. (8). The neural network F must be able to preserve information and extractsemantically similar representations regardless of the applied ordering o i , andlearn representations that encode the underlying shared information between thediﬀerent views. The objective can also be interpreted as having a regularizationeﬀect, forcing the function F to focus on the diﬀerent views and subparts of theinput x to produce similar outputs, reducing the reliance on speciﬁc objects orparts of the image.Let p ( y , y (cid:48) ) be the joint distribution produced by sampling examples x ∼X and then sampling two outputs y ∼ F ( x ; o i ) and y (cid:48) ∼ F ( x ; o j ) with twopossible orderings o i and o j . In this case, the MI in Eq. (8) can be deﬁned asthe Kullback–Leibler (KL) divergence between the joint and the product of themarginals: I ( y , y (cid:48) ) = D KL ( p ( y , y (cid:48) ) (cid:107) p ( y ) p ( y (cid:48) )) (4)To maximize Eq. (13), we can either maximize the exact MI for a clusteringtask over discrete predictions, or a lower bound for an unsupervised learningof representations over the continuous outputs. We will now formulate the lossfunctions L AC and L ARL of both objectives for a segmentation task.

Autoregressive clustering (AC).

In a clustering task, the goal is to traina neural network F to predict a cluster assignment corresponding to a givensemantic class k ∈ { , . . . , K } with K possible clusters at each spatial location.In this case, the encoder-decoder type network F is terminated with K -waysoftmax, outputting y ∈ [0 , H × W × K of the same spatial dimensions as theinput. Concretely, for a given input image x and two valid orderings ( o i , o j ) ∈ O ,we forward pass the input through the network producing two output probabilitydistributions F ( x ; o i ) = p ( y | x , o i ) and F ( x ; o j ) = p ( y (cid:48) | x , o j ) over the K clusters utoregressive Unsupervised Image Segmentation 9 and at each spatial location. After reshaping the outputs into two matrices ofshape HW × K , with each element corresponding to the probability of assigningpixel x l with l ∈ { , . . . , HW } to cluster k , we can compute the joint distribution p ( y , y (cid:48) ) of shape K × K as follows: p ( y , y (cid:48) ) = F ( x ; o i ) (cid:62) F ( x ; o j ) (5)The marginals p ( y ) and p ( y (cid:48) ) can then be obtained by summing over the rowsand columns of p ( y , y (cid:48) ). Similar to IIC [28], we symmetrize p ( y , y (cid:48) ) using [ p ( y , y (cid:48) )+ p ( y , y (cid:48) ) (cid:62) ] / L AC inthis case can be written as follows: L AC = E x ∼X (cid:20) E p ( y , y (cid:48) ) log p ( y , y (cid:48) ) p ( y ) p ( y (cid:48) ) (cid:21) (6)In practice, instead of only maximizing the MI between two correspondingspatial locations, we maximize it between each spatial location and its intermedi-ate neighbors over small displacements u ∈ Ω (see Fig. 1). This can be eﬃcientlyimplemented using a convolution operation as demonstrated in [28].

Autoregressive representation learning (ARL).

Although the clusteringobjective in Eq. (6) can also be used as a pre-training objective for F , Tschannen et al . [45] recently showed that maximizing the MI does not often results intransferable and semantically meaningful features, especially when the down-stream task is a priori unknown. To this end, we follow recent representationlearning works based on MI maximization [39,23,1,44], where a lower boundestimate of MI ( e.g ., InfoNCE [39], NWJ [36]) is maximized between diﬀerentviews of the inputs. These estimates are based on the simple intuitive idea,that if a critic f is able to diﬀerentiate between samples drawn from the jointdistribution p ( y , y (cid:48) ) and samples drawn from the marginals p ( y ) p ( y (cid:48) ), then thetrue MI is maximized. We refer the reader to [45] for a detailed discussion.In our case, with image segmentation as the target down-stream task, wemaximize the InfoNCE estimator [39] over the continuous outputs. Speciﬁcally,with two outputs ( y , y (cid:48) ) ∈ R H × W × C as C -dimensional feature maps. The train-ing objective is to maximize the infoNCE based loss L ARL : L ARL = E x ∼X (cid:34) log e f ( y l , y (cid:48) l )1 N (cid:80) Nm =1 e f ( y l , y (cid:48) m ) (cid:35) (7)For an input image x and two outputs y and y (cid:48) . Let y l and y (cid:48) m correspond to C -dimensional feature vectors at spatial positions l and m in the ﬁrst and secondoutputs respectively. We start by creating N pairs of feature vectors ( y l , y (cid:48) m ),with one positive pair drawn from the joint distribution and N − i.e ., a pair ( y l , y (cid:48) m ) with m = l . The negatives are pairs ( y l , y (cid:48) m ) corresponding to two distinct spatialpositions m (cid:54) = l . In practice, we also consider small displacements Ω (Fig. 1) when constructing positives. Additionally, the negatives are generated from twodistinct images, since two feature vectors might share similar characteristics evenwith diﬀerent spatial positions. By maximizing Eq. (14), we push the model F to produce similar representations for the same spatial location regardless ofthe applied ordering, so that the critic function f is able to give high matchingscores to the positive pairs and low matching to the negatives. We follow [23] anduse separable critics f ( y , y (cid:48) ) = φ ( y ) (cid:62) φ ( y (cid:48) ), where the functions φ /φ non-linearly transform the outputs to a higher vector space, and f ( y l , y (cid:48) m ) producesa scalar corresponding to a matching score between the two representations attwo spatial positions l and m of the two outputs.Note that both losses L AC and L ARL can be applied interchangeably forboth objectives, a case we investigate in our experiments (Section 4.1). For L AC ,we can consider the clustering objective as an intermediate task for learninguseful representations. For L ARL , during inference, K-means [29] algorithm canbe applied over the outputs to obtain the cluster assignments.

The representation F can be implemented in a general manner using three sub-parts, i.e ., F = h ◦ g ar ◦ d , with a feature extractor h , an autoregressive encoder g ar and a decoder d . With such a formulation, the function F is ﬂexible and cantake diﬀerent forms. With h as an identity mapping, F becomes a fully autore-gressive network, where we apply diﬀerent orderings directly over the inputs.Inversely, if g ar is an identity mapping, F becomes a generic encoder-decodernetwork, where h plays the role of an encoder. Additionally, h can be a simpleconvolutional stem that plays an important role in learning local features suchas edges, or even multiple residual blocks [21] to extract higher representations.In this case, the orderings are applied over the hidden features using g ar . g ar issimilar to h , containing a series of residual blocks, with two main diﬀerences, theproposed masked convolutions are used, and the batch normalization [26] layersare omitted to maintain the autoregressive dependency, with an optional atten-tion block. The decoder d can be a simple conv1 × K , followed by bilinear upsampling and a softmax operationfor a clustering objective. For representation learning, d consists of two separablecritics φ /φ , which are implemented as a series of conv3 × − BN − ReLU andconv1 × F . After stating the experimental setting, we start by presenting an extensive abla-tion study of the proposed method and its various parts. We then compare themethod to state-of-the-art approaches on unsupervised image segmentation. utoregressive Unsupervised Image Segmentation 11

Datasets.

The experiments are conducted on the newly established and chal-lenging baselines by [28]. Potsdam [14] with 8550 RGBIR satellite images of size200 × stuﬀ classes. Similarly, we use a reduced version ofCOCO-Stuﬀ with 164k images and 15 coarse labels, reduced to 52k by takingonly images with at least 75% stuﬀ pixel. In addition to COCO-Stuﬀ-3 with only3 labels, sky, ground and plants. Evaluation Metrics.

We report the pixel classiﬁcation Accuracy (Acc). For aclustering task, with a mismatch between the learned and ground truth clusters.We follow the standard procedure and ﬁnd the best one-to-one permutation tomatch the output clusters to ground truth classes using the Hungarian algorithm[33]. The Acc is then computed over the labeled examples.

Implementation details.

The diﬀerent variations of F are trained using ADAMwith a learning rate of 10 − to optimize both objectives in Eqs. (6) and (14).We train on 200 ×

200 crops for Potsdam and 128 ×

128 for COCO. The train-ing is conducted on NVidia V100 GPUs, and implemented using the PyTorchframework [41]. For more experimental details, see sup. mat.

We start by performing comprehensive ablation studies on the diﬀerent com-ponents and variations of the proposed method. Table 1 and Fig. 7 show theablation results for AC, and Table 2 shows a comparison between AC and ARL,analyzed as follows:

Variations of F . Table 1a compares diﬀerent variations of the network F . Witha ﬁxed decoder d ( i.e ., a 1 × h and g ar going from a fully autoregressive model ( F ) toa normal decoder-encoder network ( F and F ). When using masked versions,we see an improvement over the normal case, with up to 8 points for Potsdam,and to a lesser extent for Potsdam-3 where the task is relatively easier withonly three ground truth classes. When using a fully autoregressive model ( F ),and applying the orderings directly over the inputs, maximizing the MI becomesmuch harder, and the model fails to learn meaningful representations. Inversely,when no masking is applied ( F and F ), the task becomes comparatively sim-pler, and we see a drop in performance. The best results are obtained whenapplying the orderings over low-level features ( F and F ). Interestingly, theunmasked versions yield results better than random, and perform competitivelywith 3 output classes for Potsdam-3, validating the eﬀectiveness of maximizingthe MI over small displacements u ∈ Ω. For the rest of the experiments we use F as our model. Attention and diﬀerent orderings.

Table 1c shows the eﬀectiveness of addingattention blocks to our model. With a single attention block added at a shallowlevel, we observe an improvement over the baseline, for both raster-scan and

Network F = h ◦ g ar ◦ dh g ar POS POS3Random 28.5 38.2 F Id 5 Res. blocks 39.3 56.3 F Stem 5 Res. blocks 46.4 F Res. block 4 Res. blocks F F ResNet-18 Id 40.7 51.9 (a)

Variation of F : we compare diﬀerent variantsof the network F using diﬀerent feature extractors f and autoregressive encoders g ar . The decoder d isﬁxed. |O| POS POS32 43.2 ± ± ± ± (b) Number of orderings: we com-pare diﬀerent sizes of the set O . For |O| = 2 and |O| = 4 , we report themean and std over 4 runs using diﬀer-ent possible pairs and quadruples re-spectively. OrderingsRaster-Scan Zigzag Attention POS POS3 (cid:88) × × (cid:88) × (cid:88) × (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (c) Attention: we add a single attention block ata shallow level, and change the applied masks toget the desired orderings. Output stride = 4 in thisinstance.

Sampling o i POS POS3Random 46.4

No Rep. 48.6 64.8Hard (d)

Sampling of o i : we com-pare diﬀerent possible samplingprocedures of the orderings o i during training. Type Transf. POS POS3None - 46.4 66.4Photometric Col. Jittering 47.9 65.5Geometric Flip 46.7 68.0Geometric Rot. (e)

Transformations: we apply a given trans-formation to the inputs of the second forwardpass during a single training iteration. p POS POS30 46.4 (f)

Dropout: we inspectthe addition of dropoutto the inner activationsof a residual block.

Table 1. AC Ablations.

Ablations studies conducted on Potsdam (POS) andPotsdam-3 (POS3) for Autoregressive Clusterings. We show the pixel classiﬁcationaccuracy (%). zigzag orderings, and their combination, with up to 4 points for Potsdam. Inthis case, given the quadratic complexity of attention, we used an output strideof 4.

Data augmentations.

For a given training iteration, we pass the same imagetwo times through the network, applying two diﬀerent orderings at each forwardpass. We can, however, pass a transformed version of the image as the secondinput. We investigate using photometric ( i.e ., color jittering) and geometric ( i.e .,rotations and H-ﬂips) transformations. For geometric transformations, we bringthe outputs back to the input coordinate space before computing the loss. Resultsare shown in Table 1e. As expected, we obtain relative improvements with dataaugmentations, highlighting the ﬂexibility of the approach.

Dropout.

To add some degree of stochasticity to the network, and as an ad-ditional regularization, we apply dropout to the intermediate activations within utoregressive Unsupervised Image Segmentation 13 A cc u r a c y Potsdam-3 ( K gt = 3) K =12 K =9Fully unsupervised A cc u r a c y Potsdam ( K gt = 6) K =12 K =20Fully unsupervised Fig. 7. Overclustering.

The Acc obtained when using a number of output clustersgreater than the number of ground truth classes

K > K gt . With variable number ofimages used to ﬁnd the best many-to-one matching between the outputs and targets. residual blocks of the network. Table 1f shows a small increase in Acc for Pots-dam. Orderings.

Until now, at each forward pass, we sample a pair of possible or-derings with replacement from the set O . With such a sampling procedure, wemight end-up with the same pair of orderings for a given training iteration.As an alternative, we investigate two other sampling procedures. First, with norepetition (No Rep.), where we choose two distinct orderings for each trainingiteration. Second, using hard sampling, choosing two orderings with oppositereceptive ﬁelds ( e.g ., o and o ). Table 1d shows the obtained results. We see 2points improvement when using hard sampling for Potsdam. For simplicity, weuse random sampling for the rest of the experiments.Additionally, to investigate the eﬀect of the number of orderings ( i.e ., thecardinality of O ), we compute the Acc over diﬀerent choices and sizes of O .Table 1b shows best results are obtained when using all 8 raster-scan orderings.Interestingly, for some choices, we observe better results, which may be due toselecting orderings that do not share any receptive ﬁelds, as the ones used inhard sampling. Overclustering.

To compute the Acc for a clustering task using linear assign-ment, the output clusters are chosen to match the ground truth classes K = K gt .Nonetheless, we can choose a higher number of clusters K > K gt , and then ﬁndthe best many-to-one matching between the output clusters and ground truthsbased a given number of labeled examples. In this case, however, we are notin a fully unsupervised case, given that we extract some information, althoughlimited, from the labels. Fig. 7 shows that, even with a very limited number oflabeled examples used for mapping, we can obtain better results than the fullyunsupervised case. AC and ARL

To compare AC and ARL, we apply them interchangeably onboth clustering and representation learning objectives. In clustering, for ARL,after PCA Whitening, we apply K-means over the output features to get thecluster assignments. In representation learning, we evaluate the quality of thelearned representations using both linear and non-linear separability as a proxy

ClusteringMethod POS POS3Random CNN 28.5 38.2AC

ARL 45.1 57.1

Linear EvaluationMethod POS POS3AC

ARL

ARL 47.6 63.5

Table 2. Comparing ARL and AC.

We compare ARL and AC on a clusteringtask (left). And investigate the quality of the learned representations by freezing thetrained model, and reporting the test Acc obtained when training a linear (center) andnon-linear (right) functions trained on the labeled training examples.

COCO-Stuﬀ-3 COCO-Stuﬀ Potsdam-3 PotsdamRandom CNN 37.3 19.4 38.2 28.3K-means [42] 52.2 14.1 45.7 35.3SIFT [35] 38.1 20.2 38.2 28.5Doersch 2015 [11] 47.5 23.1 49.6 37.2Isola 2016 [27] 54.0 24.3 63.9 44.9DeepCluster 2018 [6] 41.6 19.9 41.7 29.2IIC 2019 [28] 72.3 27.7 65.1 45.4AC

Table 3. Unsupervised image segmentation.

Comparison of AC with state-of-the-art methods on unsupervised segmentation. for disentanglement, and as a measure of MI between representations and classlabels. Table 2 shows the obtained results.

Clustering.

As expected, AC outperforms ARL on a clustering task, given thatthe clusters are directly optimized by computing the exact MI during training.

Quality of the learned representations.

Surprisingly, AC outperforms ARL onboth linear and non-linear classiﬁcations. We hypothesize that unsupervisedrepresentation learning objectives that work well on image classiﬁcation, failin image segmentation due to the dense nature of the task. The model in thiscase needs to output distinct representations over pixels, rather than the wholeimage, which is a harder task to optimize. This might also be due to using onlya small number of features ( i.e ., N pairs) for each training iteration. Table 3 shows the results of the comparison. AC outperforms previous work,and by a good margin for harder segmentation tasks with a large number ofoutput classes ( i.e ., Potsdam and COCO-Stuﬀ), highlighting the eﬀectivenessof maximizing the MI between the diﬀerent orderings as a training objective.We note that no regularization or data augmentation were used, and we expectthat better results can be obtained by combining AC with other procedures asdemonstrated in the ablation studies.

We presented a novel method to create diﬀerent views of the inputs using diﬀer-ent orderings , and showed the eﬀectiveness of maximizing the MI over these views utoregressive Unsupervised Image Segmentation 15 for unsupervised image segmentation. We showed that for image segmentation,optimizing over the discrete outputs by computing the exact MI works betterfor both clustering and unsupervised representation learning, due to the densenature of the task. Given the simplicity and ease of adoption of the method, wehope that the proposed approach can be adapted for other visual tasks and usedin future works.

Acknowledgments.

Y. Ouali is supported by Randstad corporate researchchair in collaboration with CentraleSup´elec. We would also like to thank Saclay-IA plateform of Universit´e Paris-Saclay and M´esocentre computing center ofCentraleSup´elec and ´Ecole Normale Sup´erieure Paris-Saclay for providing thecomputational resources.

References

1. Bachman, P., Hjelm, R.D., Buchwalter, W.: Learning representations by maximiz-ing mutual information across views. In: Advances in Neural Information Process-ing Systems. pp. 15509–15519 (2019) 4, 92. Becker, S., Hinton, G.E.: Self-organizing neural network that discovers surfaces inrandom-dot stereograms. Nature (6356), 161–163 (1992) 43. Bello, I., Zoph, B., Vaswani, A., Shlens, J., Le, Q.V.: Attention augmented convo-lutional networks. In: Proceedings of the IEEE International Conference on Com-puter Vision. pp. 3286–3295 (2019) 74. Bielski, A., Favaro, P.: Emergence of object segmentation in perturbed generativemodels. In: Advances in Neural Information Processing Systems. pp. 7256–7266(2019) 45. Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuﬀ: Thing and stuﬀ classes in con-text. In: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition. pp. 1209–1218 (2018) 1, 3, 11, 25, 266. Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsuper-vised learning of visual features. In: Proceedings of the European Conference onComputer Vision (ECCV). pp. 132–149 (2018) 3, 147. Caron, M., Bojanowski, P., Mairal, J., Joulin, A.: Unsupervised pre-training ofimage features on non-curated data. In: Proceedings of the IEEE InternationalConference on Computer Vision. pp. 2959–2968 (2019) 18. Chen, M., Arti`eres, T., Denoyer, L.: Unsupervised object segmentation by redraw-ing. In: Advances in Neural Information Processing Systems. pp. 12705–12716(2019) 49. Chen, X., Mishra, N., Rohaninejad, M., Abbeel, P.: Pixelsnail: An improved au-toregressive generative model. In: International Conference on Machine Learning.pp. 864–872 (2018) 2, 3, 5, 710. Child, R., Gray, S., Radford, A., Sutskever, I.: Generating long sequences withsparse transformers. arXiv preprint arXiv:1904.10509 (2019) 311. Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learningby context prediction. In: Proceedings of the IEEE International Conference onComputer Vision. pp. 1422–1430 (2015) 1412. Fard, M.M., Thonet, T., Gaussier, E.: Deep k -means: Jointly clustering with k -means and learning representations. arXiv preprint arXiv:1806.10069 (2018) 36 Y. Ouali, C. Hudelot and M. Tami13. Federici, M., Dutta, A., Forr´e, P., Kushman, N., Akata, Z.: Learning robust repre-sentations via multi-view information bottleneck. arXiv preprint arXiv:2002.07017(2020) 214. Gerke, M.: Use of the stair vision library within the isprs 2d semantic labelingbenchmark (vaihingen) (2014) 3, 1115. Germain, M., Gregor, K., Murray, I., Larochelle, H.: Made: Masked autoencoderfor distribution estimation. In: International Conference on Machine Learning. pp.881–889 (2015) 316. Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning bypredicting image rotations. arXiv preprint arXiv:1803.07728 (2018) 417. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for ac-curate object detection and semantic segmentation. In: Proceedings of the IEEEconference on computer vision and pattern recognition. pp. 580–587 (2014) 118. Glorot, X., Bengio, Y.: Understanding the diﬃculty of training deep feedforwardneural networks. In: Proceedings of the thirteenth international conference on ar-tiﬁcial intelligence and statistics. pp. 249–256 (2010) 1919. Haeusser, P., Plapp, J., Golkov, V., Aljalbout, E., Cremers, D.: Associative deepclustering: Training a classiﬁcation network with no labels. In: German Conferenceon Pattern Recognition. pp. 18–32. Springer (2018) 320. Hartigan, J.A.: Direct clustering of a data matrix. Journal of the american statis-tical association (337), 123–129 (1972) 3, 421. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:Proceedings of the IEEE conference on computer vision and pattern recognition.pp. 770–778 (2016) 1022. He, Z., Xu, X., Deng, S.: k-anmi: A mutual information based clustering algorithmfor categorical data. Information Fusion (2), 223–233 (2008) 223. Hjelm, R.D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P.,Trischler, A., Bengio, Y.: Learning deep representations by mutual informationestimation and maximization. arXiv preprint arXiv:1808.06670 (2018) 1, 2, 4, 9,1024. Hu, W., Miyato, T., Tokui, S., Matsumoto, E., Sugiyama, M.: Learning discreterepresentations via information maximizing self-augmented training. In: Proceed-ings of the 34th International Conference on Machine Learning-Volume 70. pp.1558–1567. JMLR. org (2017) 425. Hwang, J.J., Yu, S.X., Shi, J., Collins, M.D., Yang, T.J., Zhang, X., Chen, L.C.:Segsort: Segmentation by discriminative sorting of segments. In: Proceedings ofthe IEEE International Conference on Computer Vision. pp. 7334–7344 (2019) 426. Ioﬀe, S., Szegedy, C.: Batch normalization: Accelerating deep network training byreducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015) 1027. Isola, P., Zoran, D., Krishnan, D., Adelson, E.H.: Learning visual groups fromco-occurrences in space and time. arXiv preprint arXiv:1511.06811 (2015) 1428. Ji, X., Henriques, J.F., Vedaldi, A.: Invariant information clustering for unsuper-vised image classiﬁcation and segmentation. In: Proceedings of the IEEE Interna-tional Conference on Computer Vision. pp. 9865–9874 (2019) 1, 2, 3, 4, 9, 11, 14,20, 23, 25, 2629. Johnson, J., Douze, M., J´egou, H.: Billion-scale similarity search with gpus. arXivpreprint arXiv:1702.08734 (2017) 1030. Kanezaki, A.: Unsupervised image segmentation by backpropagation. In: 2018IEEE international conference on acoustics, speech and signal processing(ICASSP). pp. 1543–1547. IEEE (2018) 4utoregressive Unsupervised Image Segmentation 1731. Kanezaki, A.: Unsupervised image segmentation by backpropagation. In: 2018IEEE international conference on acoustics, speech and signal processing(ICASSP). pp. 1543–1547. IEEE (2018) 432. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014) 1933. Kuhn, H.W.: The hungarian method for the assignment problem. Naval researchlogistics quarterly (1-2), 83–97 (1955) 1134. Larochelle, H., Murray, I.: The neural autoregressive distribution estimator. In:Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligenceand Statistics. pp. 29–37 (2011) 335. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna-tional journal of computer vision (2), 91–110 (2004) 1436. Nguyen, X., Wainwright, M.J., Jordan, M.I.: Estimating divergence functionals andthe likelihood ratio by convex risk minimization. IEEE Transactions on InformationTheory (11), 5847–5861 (2010) 937. Van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al.:Conditional image generation with pixelcnn decoders. In: Advances in neural in-formation processing systems. pp. 4790–4798 (2016) 2, 3, 5, 738. Oord, A.v.d., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks.arXiv preprint arXiv:1601.06759 (2016) 2, 3, 539. Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predic-tive coding. arXiv preprint arXiv:1807.03748 (2018) 1, 2, 4, 9, 2140. Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer, N., Ku, A., Tran,D.: Image transformer. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th In-ternational Conference on Machine Learning. Proceedings of Machine LearningResearch, vol. 80, pp. 4055–4064. PMLR, Stockholmsm¨assan, Stockholm Sweden(10–15 Jul 2018) 341. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z.,Desmaison, A., Antiga, L., Lerer, A.: Automatic diﬀerentiation in pytorch (2017)1142. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: Ma-chine learning in python. Journal of machine learning research (Oct), 2825–2830(2011) 1443. Salimans, T., Karpathy, A., Chen, X., Kingma, D.P.: Pixelcnn++: Improving thepixelcnn with discretized logistic mixture likelihood and other modiﬁcations. arXivpreprint arXiv:1701.05517 (2017) 2, 3, 544. Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. arXiv preprintarXiv:1906.05849 (2019) 2, 4, 945. Tschannen, M., Djolonga, J., Rubenstein, P.K., Gelly, S., Lucic, M.: Onmutual information maximization for representation learning. arXiv preprintarXiv:1907.13625 (2019) 946. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,(cid:32)L., Polosukhin, I.: Attention is all you need. In: Advances in neural informationprocessing systems. pp. 5998–6008 (2017) 3, 747. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Pro-ceedings of the IEEE conference on computer vision and pattern recognition. pp.7794–7803 (2018) 748. Zhang, H., Goodfellow, I., Metaxas, D., Odena, A.: Self-attention generative ad-versarial networks. arXiv preprint arXiv:1805.08318 (2018) 78 Y. Ouali, C. Hudelot and M. Tami Supplementary Material

In this supplementary material, we provide architectural details, hyperparame-ters settings, further discussions about the loss functions and the masked con-volutions. We also provide some qualitative results and implementation details.

A Architectural details

Tables 4 to 7 present the building blocks of the representation function F . Specif-ically, we describe the architecture of the convolutional stem, the residual blocks,the decoder for AC and the separable critics used for ARL. Convolutional StemLayer Output size

Input × H × W Conv 3 × × H × W Batch Norm - ReLU 64 × H × W Max Pool 3 × s = 2 64 × H/ × W/ Table 4.

Convolutional Stem for an output stride of 2. For an output stride of 4, weuse a Conv 3 × × H/ × W/

4. For thefully autoregressive case, the Batch Norm and Max pool are omitted, and the Maxpool is replaced with a strided masked convolution.DecoderLayer Output size

Input C × H/ × W/ × K × H × W Bilinear Interpolation K × H × W Softmax K × H × W Table 5.

Decoder used for a clustering objective. In this case, we have an output strideof 2 and K clusters. Separable CriticsLayer Output size Input C × H × W Conv 1 × C × H × W Conv 1 × C × H × W Conv 1 × C × H × W Residual Connection + Batch Norm 2 C × H × W Table 6.

Separable critics used for representation learning to non-linearly project theoutputs to a higher vector space.utoregressive Unsupervised Image Segmentation 19Residual BlockLayer Output size

Input C × H × W Conv 3 × C × H × W Conv 1 × C × H × W Zero padding of the input to 2 C C × H × W Residual Connection 2 C × H × W Conv 1 × C × H × W Conv 1 × C × H × W Residual Connection 2 C × H × W Conv 1 × C × H × W Conv 1 × C × H × W Residual Connection 2 C × H × W Table 7.

Architecture of residual blocks, for residual blocks used in the autoregressiveencoder g ar , normal convolutions are replaced with masked ones. B Hyperparameters

We discuss the hyperparameters used in our experiments. We note that we havenoticed that the network is very sensitive to the initialization. In our case, weinitialize the parameters using Xavier initialization [18], and noticed a somehowmore stable results with such an instantiation scheme. The optimizer of choiceis Adam [32] with the default parameters ( β = 0 . β = 0 . Parameter COCO-stuﬀ 3 COCO-stuﬀ Potsdam-3 PotsdamLR 4 . − . − − . − Batch size 60 60 30 30Crop size 128 ×

128 128 ×

128 200 ×

200 200 × L AC

10 10 10 10Attention False False True True

Table 8.

Hyperparameters used for training per dataset.

C Loss functions

In this section, we will go into more details about the loss functions introducedin Paper Section 3.2. For a given unlabeled input x ∼ X , and two outputs y ∼ F ( x ; o i ) and y (cid:48) ∼ F ( x ; o j ) with two valid orderings ( o i , o j ) ∈ O , thetraining objective is to maximize the MI between the two encoded variables:max F I ( y ; y (cid:48) ) (8) C.1 Autoregressive Clustering L AC To see the beneﬁts of maximizing Eq. (8) for a clustering objective, we expandthe objective as the diﬀerence between two entropy terms: I ( y ; y (cid:48) ) = H ( y ) − H ( y | y (cid:48) ) (9)By such a formulation, we can see that maximizing the MI involves maximizingthe entropy and minimizing the conditional entropy. The compromise betweenthese two terms help us avoid both degenerate and trivial solutions. For degen-erate solution, where the model F outputs uniform distributions over all of thepixels, not assigning any cluster to any pixel, the entropy H ( y ) in this case ismaximized, however the second term H ( y | y (cid:48) ) is also maximized, since the out-puts are not deterministic and there is no predictability of the second outputfrom the ﬁrst. Inversely, with trivial solutions, where all of the pixel are assignedto the same cluster. the second output y (cid:48) is totally deterministic from the ﬁrst,and the conditional entropy H ( y | y (cid:48) ) is minimized, yet, the entropy H ( y ) is alsominimized and we fail to maximize the MI. By balancing the maximization ofthe ﬁrst term and the minimization of the second, we are more likely to end-upwith the correct assignments, than if we only maximized the entropy.Given that the two outputs are generated using the same input and twodiﬀerent orderings, there is a strong statistical dependency between them. Inthis case, y ∼ F ( x ; o i ) and y (cid:48) ∼ F ( x ; o j ) are dependent and we compute thejoint probability p ( y , y (cid:48) ) as a matrix of size K × K : p ( y , y (cid:48) ) = F ( x ; o i ) T F ( x ; o j ) (10)In practice we also marginalize over the batch, with an input x of shape B × × H × W as a batch of B input images. Let x i correspond to the i-thimage in the batch x of B images. In this case the joint probability is computedas follows: p ( y , y (cid:48) ) = 1 B B (cid:88) i =1 F ( x i ; o i ) T F ( x i ; o j ) (11)Additionally, following [28], we also compute the joint probability over smallpossible displacements u ∈ Ω. Let the input x ( u ) correspond to shifting the input x by u pixels ( i.e ., zero padding and cropping). In such a case, we also need tomarginalize over all possible displacements u as follows: p ( y , y (cid:48) ) = 1 B | Ω | B (cid:88) i =1 (cid:88) u ∈ Ω F ( x i ; o i ) T F ( x ( u ) i ; o j ) (12) utoregressive Unsupervised Image Segmentation 21 Finally, by summing over the rows and columns of p ( y , y (cid:48) ), we can computethe marginals, and then the MI: I ( y , y (cid:48) ) = D KL ( p ( y , y (cid:48) ) (cid:107) p ( y ) p ( y (cid:48) )) (13) Positives Negatives

Fig. 8.

Left : Examples of positive and negative pairs for B = 2 and HW = 4. y i refersto the i-th element of the output y corresponding to the i-th image in the input batch. Right : Examples of positive pairs with possible displacements Ω = {− , , } . C.2 Autoregressive Representation Learning L AC For unsupervised representation learning objective, we maximize the infoNCE[39] as a lower bound of MI over the continuous outputs: L ARL = log e f ( y l , y (cid:48) l )1 N (cid:80) Nm =1 e f ( y l , y (cid:48) m ) (14)The goal of Eq. (14) is to push the network F to produce similar featuresbetween the two outputs y and y (cid:48) at the same spatial locations, so that thecritic is able to give high scores between two feature vectors ( y l , y (cid:48) m ) at thesame spatial position m = l , and low scores for feature vectors from distinctspatial position m (cid:54) = l or from two distinct images. To compute the loss inEq. (14), we need to create a set of positive and negative pairs. With a batchof images x of shape B × × H × W , we generate two outputs y and y (cid:48) ofshape B × C × H × W , with C -dimensional output feature maps. In this case theoutput of the critic f ( y , y (cid:48) ) = φ ( y ) (cid:62) φ ( y (cid:48) ) is a matrix of shape BHW × BHW .To construct the positive and negative pairs, we reshape the scoring matrix as B matrices of shape HW , in this case the positives are the diagonals of eachmatrix from the same images with a given shift u ∈ Ω. The negatives are allof the possible combination across the matrices from distinct images. See Fig. 8for an illustration for B = 2 and HW = 4. Note that we avoid using the sameimage to construct negative pairs, and only construct them across images, giventhat even with distinct spatial positions, it is very likely that two feature vectorsshare similar characteristics. D Receptive ﬁelds

To further illustrate how a given ordering o i is constructed, we present a toyexample where we plot the receptive ﬁeld of a given pixel at the center of animage of size 16 ×

16. After each application of a masked convolution with thecorresponding shift, we compute the gradient of the target pixel and plot thenon-zero values in blue, which correspond to the receptive ﬁeld of the targetpixel. The results are illustrated in Fig. 9.

Step 5 Step 6 Step 7 Step 8Step 1 Step 2 Step 3 Step 4

Ordering

Step 1 Step 2 Step 3 Step 4Step 5 Step 6 Step 7 Step 8

Ordering

Fig. 9.

Examples of the growing receptive ﬁeld of pixel for two orderings; o and o ,over 8 consecutive applications of masked convolutions to get the correct orderings. Asexpected, after enough convolutions, and with the correct shift, we can construct thedesired ordering. Note that in both cases we have a signiﬁcant number of pixels in theblind spots, which can be accessed using an attention block. In this case, we use Conv A with Shift and Shift . Orderings.

For a given pair of distinct orderings, the resulting dependencies andreceptive ﬁelds of the two outputs will be diﬀerent even if the applied orderings utoregressive Unsupervised Image Segmentation 23 are quite similar. It is however likely that the two outputs share some overlap intheir receptive ﬁelds, but such an overlap is small and helps reduce the diﬃcultyof the task. An illustration of the resulting receptive ﬁelds for a given pixel usingraster-scan orderings is shown in Fig. 10.

Ordering Ordering Ordering

Ordering

Fig. 10.

The resulting receptive ﬁelds with the various raster-scan type orderings

E Qualitative Results

Fig. 11 shows qualitative results of Autoregressive Clustering (AC) on COCO-stuﬀ 3 test set, in addition to linear and non-linear evaluations, where the modeltrained for AC is frozen and then the corresponding layers are added on top of thedecoder, that are then trained on the train set. Surprisingly, even if the accuracywith linear and non-linear evaluations is higher, we see that qualitatively, thefully unsupervised method gives slightly better results. This might be due to thedense nature of image segmentation, where the prediction at a given pixel is verydependent of its neighbors, and we lose this locality with linear evaluation, giventhat we consider each pixel as a standalone data point. This is similar to whatwe observed with ARL where we optimize the representations at each spatiallocation separately. Note that we have noticed some minor annotation errors inthe ground truths that might be due to the conversion done by [28], these arevery minor and can be overlooked.We also present some examples where AC fails in Fig. 12. We observe that themodel is very dependent on the appearance and colors for making the predictions.However, in special cases, like tennis courts with grass or asphalt ﬂoors, themodel predicts green or sky classes, and the correct prediction is ground. Thiscan be overcome with additional data augmentations like color jittering, or incase where a limited amount of labeled examples are available, the model can beﬁne-tunned to correct such mistakes. We already see some slight improvementswith linear and non-linear evaluations.

Image Ground Truth LinearEvaluation Non-linearEvaluationIgnored Pixels (Non-stuff) Ground PlantsSkyAutoregressiveClustering utoregressive Unsupervised Image Segmentation 25

Fig. 11.

Qualitative Results from COCO-Stuﬀ 3 [5,28] test set.6 Y. Ouali, C. Hudelot and M. Tami

Image Ground Truth LinearEvaluation Non-linearEvaluationIgnored Pixels (Non-stuff) Ground PlantsSkyAutoregressiveClustering

Fig. 12.

Failure Cases for Autoregressive Clustering from COCO-Stuﬀ 3 [5,28] testtest