[PDF] Colorization Transformer

Abstract

We present the Colorization Transformer, a novel approach for diverse high fidelity image colorization based on self-attention. Given a grayscale image, the colorization proceeds in three steps. We first use a conditional autoregressive transformer to produce a low resolution coarse coloring of the grayscale image. Our architecture adopts conditional transformer layers to effectively condition grayscale input. Two subsequent fully parallel networks upsample the coarse colored low resolution image into a finely colored high resolution image. Sampling from the Colorization Transformer produces diverse colorings whose fidelity outperforms the previous state-of-the-art on colorising ImageNet based on FID results and based on a human evaluation in a Mechanical Turk test. Remarkably, in more than 60% of cases human evaluators prefer the highest rated among three generated colorings over the ground truth. The code and pre-trained checkpoints for Colorization Transformer are publicly available at this https URL

Full PDF

PPublished as a conference paper at ICLR 2021 C OLORIZATION T RANSFORMER

Manoj Kumar, Dirk Weissenborn & Nal Kalchbrenner

Google Research, Brain Team {mechcoder,diwe,nalk}@google.com A BSTRACT

We present the Colorization Transformer, a novel approach for diverse high ﬁdelityimage colorization based on self-attention. Given a grayscale image, the coloriza-tion proceeds in three steps. We ﬁrst use a conditional autoregressive transformer toproduce a low resolution coarse coloring of the grayscale image. Our architectureadopts conditional transformer layers to effectively condition grayscale input. Twosubsequent fully parallel networks upsample the coarse colored low resolutionimage into a ﬁnely colored high resolution image. Sampling from the ColorizationTransformer produces diverse colorings whose ﬁdelity outperforms the previousstate-of-the-art on colorising ImageNet based on FID results and based on a humanevaluation in a Mechanical Turk test. Remarkably, in more than 60% of caseshuman evaluators prefer the highest rated among three generated colorings over theground truth. The code and pre-trained checkpoints for Colorization Transformerare publicly available at this url.

NTRODUCTION

Figure 1:

Samples of our model showing diverse, high-ﬁdelity colorizations.

Image colorization is a challenging, inherently stochastic task that requires a semantic understandingof the scene as well as knowledge of the world. Core immediate applications of the technique includeproducing organic new colorizations of existing image and video content as well as giving life tooriginally grayscale media, such as old archival images (Tsaftaris et al., 2014), videos (Geshwind,1986) and black-and-white cartoons (S`ykora et al., 2004; Qu et al., 2006; Cinarel & Zhang, 2017).Colorization also has important technical uses as a way to learn meaningful representations withoutexplicit supervision (Zhang et al., 2016; Larsson et al., 2016; Vondrick et al., 2018) or as an unsuper-vised data augmentation technique, whereby diverse semantics-preserving colorizations of labelledimages are produced with a colorization model trained on a potentially much larger set of unlabelledimages.The current state-of-the-art in automated colorization are neural generative approaches based onlog-likelihood estimation (Guadarrama et al., 2017; Royer et al., 2017; Ardizzone et al., 2019).Probabilistic models are a natural ﬁt for the one-to-many task of image colorization and obtain betterresults than earlier determinisitic approaches avoiding some of the persistent pitfalls (Zhang et al.,2016). Probabilistic models also have the central advantage of producing multiple diverse coloringsthat are sampled from the learnt distribution.In this paper, we introduce the Colorization Transformer (ColTran), a probabilistic colorizationmodel composed only of axial self-attention blocks (Ho et al., 2019b; Wang et al., 2020). The main1 a r X i v : . [ c s . C V ] F e b ublished as a conference paper at ICLR 2021advantages of axial self-attention blocks are the ability to capture a global receptive ﬁeld with onlytwo layers and O ( D √ D ) instead of O ( D ) complexity. They can be implemented efﬁciently usingmatrix-multiplications on modern accelerators such as TPUs (Jouppi et al., 2017). In order to enablecolorization of high-resolution grayscale images, we decompose the task into three simpler sequentialsubtasks: coarse low resolution autoregressive colorization, parallel color and spatial super-resolution.For coarse low resolution colorization, we apply a conditional variant of Axial Transformer (Ho et al.,2019b), a state-of-the-art autoregressive image generation model that does not require custom kernels(Child et al., 2019). While Axial Transformers support conditioning by biasing the input, we ﬁnd thatdirectly conditioning the transformer layers can improve results signiﬁcantly. Finally, by leveragingthe semi-parallel sampling mechanism of Axial Transformers we are able to colorize images fasterat higher resolution than previous work (Guadarrama et al., 2017) and as an effect this results inimproved colorization ﬁdelity. Finally, we employ fast parallel deterministic upsampling models tosuper-resolve the coarsely colorized image into the ﬁnal high resolution output. In summary, ourmain contributions are:• First application of transformers for high-resolution ( × ) image colorization.• We introduce conditional transformer layers for low-resolution coarse colorization in Section4.1. The conditional layers incorporate conditioning information via multiple learnablecomponents that are applied per-pixel and per-channel. We validate the contribution of eachcomponent with extensive experimentation and ablation studies.• We propose training an auxiliary parallel prediction model jointly with the low resolutioncoarse colorization model in Section 4.2. Improved FID scores demonstrate the usefulnessof this auxiliary model.• We establish a new state-of-the-art on image colorization outperforming prior methods by alarge margin on FID scores and a 2-Alternative Forced Choice (2AFC) Mechanical Turk test.Remarkably, in more than 60% of cases human evaluators prefer the highest rated amongthree generated colorings over the ground truth. ELATED WORK

Colorization methods have initially relied on human-in-the-loop approaches to provide hints in theform of scribbles (Levin et al., 2004; Ironi et al., 2005; Huang et al., 2005; Yatziv & Sapiro, 2006;Qu et al., 2006; Luan et al., 2007; Tsaftaris et al., 2014; Zhang et al., 2017; Ci et al., 2018) andexemplar-based techniques that involve identifying a reference source image to copy colors from(Reinhard et al., 2001; Welsh et al., 2002; Tai et al., 2005; Ironi et al., 2005; Pitié et al., 2007;Morimoto et al., 2009; Gupta et al., 2012; Xiao et al., 2020). Exemplar based techniques have beenrecently extended to video as well (Zhang et al., 2019a). In the past few years, the focus has movedon to more automated, neural colorization methods. The deterministic colorization techniques suchas CIC (Zhang et al., 2016), LRAC (Larsson et al., 2016), LTBC (Iizuka et al., 2016), Pix2Pix (Isolaet al., 2017) and DC (Cheng et al., 2015; Dahl, 2016) involve variations of CNNs to model per-pixelcolor information conditioned on the intensity.Generative colorization models typically extend unconditional image generation models to incorporateconditioning information from a grayscale image. Speciﬁcally, cINN (Ardizzone et al., 2019) useconditional normalizing ﬂows (Dinh et al., 2014), VAE-MDN (Deshpande et al., 2017; 2015) andSCC-DC (Messaoud et al., 2018) use conditional VAEs (Kingma & Welling, 2013), and cGAN (Caoet al., 2017) use GANs (Goodfellow et al., 2014) for generative colorization. Most closely related toColTran are other autoregressive approaches such as PixColor (Guadarrama et al., 2017) and PIC(Royer et al., 2017) with PixColor obtaining slightly better results than PIC due to its CNN-basedupsampling strategy. ColTran is similar to PixColor in the usage of an autoregressive model forlow resolution colorization and parallel spatial upsampling. ColTran differs from PixColor in thefollowing ways. We train ColTran in a completely unsupervised fashion, while the conditioningnetwork in PixColor requires pre-training with an object detection network that provides substantialsemantic information. PixColor relies on PixelCNN (Oord et al., 2016) that requires a large depth tomodel interactions between all pixels. ColTran relies on Axial Transformer (Ho et al., 2019b) andcan model all interactions between pixels with just 2 layers. PixColor uses different architecturesfor conditioning, colorization and super-resolution, while ColTran is conceptually simpler as weuse self-attention blocks everywhere for both colorization and superresolution. Finally, we train2ublished as a conference paper at ICLR 2021our autoregressive model on a single coarse channel and a separate color upsampling network thatimproves ﬁdelity (See: 5.3). The multi-stage generation process in ColTran that upsamples in depthand in size is related to that used in Subscale Pixel Networks (Menick & Kalchbrenner, 2018) forimage generation, with differences in the order and representation of bits as well as in the use of fullyparallel networks. The self-attention blocks that are the building blocks of ColTran were initiallydeveloped for machine translation (Vaswani et al., 2017), but are now widely used in a number ofother applications including density estimation (Parmar et al., 2018; Child et al., 2019; Ho et al.,2019a; Weissenborn et al., 2019) and GANs (Zhang et al., 2019b)

ACKGROUND : A

XIAL T RANSFORMER

OW AND COLUMN SELF - ATTENTION

Self-attention (SA) has become a standard building block in many neural architectures. Althoughthe complexity of self-attention is quadratic with the number of input elements (here pixels), it hasbecome quite popular for image modeling recently (Parmar et al., 2018; Weissenborn et al., 2019) dueto modeling innovations that don’t require running global self-attention between all pixels. Followingthe work of (Ho et al., 2019b) we employ standard qkv self-attention (Vaswani et al., 2017) withinrows and columns of an image. By alternating row- and column self-attention we effectively allowglobal exchange of information between all pixel positions. For the sake of brevity we omit the exactequations for multihead self-attention and refer the interested reader to the Appendix H for moredetails. Row/column attention layers are the core components of our model. We use them in theautoregressive colorizer, the spatial upsampler and the color upsampler.3.2 A

XIAL T RANSFORMER

Ths Axial Transformer (Ho et al., 2019b) is an autoregressive model that applies (masked) row- andcolumn self-attention operations in a way that efﬁciently summarizes all past information x i, m during self-attention (see Eq. 15). Outer decoder.

The outer decoder computes a state s o over all previous rows x ≤ i, · by applying N layers of full row self-attention followed by masked column self-attention. (Eq 2). s o is shifted downby a single row, such that the output context o i,j at position i, j only contains information aboutpixels x

The embeddings to the inner decoder are shifted right by a single column to maskthe current pixel x i,j . The context o from the outer decoder conditions the inner decoder by biasingthe shifted embeddings. It then computes a ﬁnal state h , by applying N layers of masked row-wiseself-attention to infuse additional information from prior pixels of the same row x i,

As shown above, the outer and inner decoder operate on 2-D inputs, such as a singlechannel of an image. For multi-channel RGB images, when modeling the "current channel", theAxial Transformer incorporates information from prior channels of an image (as per raster order)with an encoder. The encoder encodes each prior channel independently with a stack of unmaskedrow/column attention layers. The encoder outputs across all prior channels are summed to output aconditioning context c for the "current channel". The context conditions the outer and inner decoderby biasing the inputs in Eq 1 and Eq 4 respectively.3ublished as a conference paper at ICLR 2021 Shift down

ColTran Core

Grayscale encoder

N x

Conditional Masked Row Attention Embeddings + Area interpolationConditional Row Attention BlockConditional Masked Column Attention Embeddings + Shift right

Linear Embeddings Embeddings + Softmax

Sample Argmax

ColTran Upsamplers

Outer Decoder + N x

LinearSoftmaxColumn AttentionRow Attention

N x

Embeddings LinearSoftmaxColumn AttentionRow Attention

N x N x

LinearSoftmaxColumn AttentionRow Attention

Argmax

Embeddings

Inner Decoder

Figure 2:

Depiction of ColTran. It consists of 3 individual models: an autoregressive colorizer (left), a colorupsampler (middle) and a spatial upsampler (right). Each model is optimized independently. The autoregressivecolorizer (ColTran core) is an instantiation of Axial Transformer (Sec. 3.2, Ho et al. (2019b)) with conditionaltransformer layers and an auxiliary parallel head proposed in this work (Sec. 4.1). During training, the ground-truth coarse low resolution image is both the input to the decoder and the target. Masked layers ensure thatthe conditional distributions for each pixel depends solely on previous ground-truth pixels. (See Appendix Gfor a recap on autoregressive models). ColTran upsamplers are stacked row/column attention layers thatdeterministically upsample color and space in parallel. Each attention block (in green) is residual and consists ofthe following operations: layer-norm → multihead self-attention → MLP.

Sampling.

The Axial Transformer natively supports semi-parallel sampling that avoids re-evaluation of the entire network to generate each pixel of a RGB image. The encoder is runonce per-channel, the outer decoder is run once per-row and the inner decoder is run once per-pixel.The context from the outer decoder and the encoder is initially zero. The encoder conditions theouter decoder (Eq 1) and the encoder + outer decoder condition the inner decoder (Eq 4). The innerdecoder then generates a row, one pixel at a time via Eqs. (4) to (6). After generating all pixels in arow, the outer decoder recomputes context via Eqs. (1) to (3) and the inner decoder generates thenext row. This proceeds till all the pixels in a channel are generated. The encoder, then recomputescontext to generate the next channel.

ROPOSED A RCHITECTURE

Image colorization is the task of transforming a grayscale image x g ∈ R H × W × into a coloredimage x ∈ R H × W × . The task is inherently stochastic; for a given grayscale image x g , there existsa conditional distribution over x , p ( x | x g ) . Instead of predicting x directly from x g , we insteadsequentially predict two intermediate low resolution images x s ↓ and x s ↓ c ↓ with different color depthﬁrst. Besides simplifying the task of high-resolution image colorization into simpler tasks, the smallerresolution allows for training larger models.We obtain x s ↓ , a spatially downsampled representation of x , by standard area interpolation. x s ↓ c ↓ is a3 bit per-channel representation of x s ↓ , that is, each color channel has only 8 intensities. Thus, thereare = 512 coarse colors per pixel which are predicted directly as a single “color” channel. Werewrite the conditional likelihood p ( x | x g ) to incorporate the intermediate representations as follows: p ( x | x g ) = p ( x | x g ) · p ( x | x g ) · p ( x s ↓ c ↓ , x s ↓ | x, x g ) = p ( x s ↓ c ↓ , x s ↓ , x | x g ) (7) = p ( x | x s ↓ , x g ) · p ( x s ↓ | x s ↓ c ↓ , x g ) · p ( x s ↓ c ↓ | x g ) (8)ColTran core (Section 4.1), a parallel color upsampler and a parallel spatial upsampler (Section 4.3)model p ( x s ↓ c ↓ | x g ) , p ( x s ↓ | x s ↓ c ↓ , x g ) and p ( x | x s ↓ ) respectively. In the subsections below, we describe4ublished as a conference paper at ICLR 2021 Component Unconditional ConditionalSelf-Attention y = Softmax ( qk (cid:62) √ D ) v y = Softmax ( q c k (cid:62) c √ D ) v c where ∀ z = k , q , vz c = ( c U zs ) (cid:12) z + ( c U zb ) MLP y = ReLU ( x U + b ) U + b h = ReLU ( x U + b ) U + b y = ( c U fs ) (cid:12) h + ( c U fb ) Layer Norm y = β Norm ( x ) + γ y = β c Norm ( x ) + γ c where ∀ µ = β c , γ c c ∈ R H × W × D → ˆ c ∈ R HW × D µ = ( u · ˆ c ) U µd u ∈ R HW Table 1:

We contrast the different components of unconditional self-attention with self-attention conditioned oncontext c ∈ R M × N × D . Learnable parameters speciﬁc to conditioning are denoted by u and U · ∈ R D × D . these individual components in detail. From now on we will refer to all low resolutions as M × N and high resolution as H × W . An illustration of the overall architecture is shown in Figure 2.4.1 C OL T RAN C ORE

In this section, we describe ColTran core, a conditional variant of the Axial Transformer (Ho et al.,2019b) for low resolution coarse colorization. ColTran Core models a distribution p c ( x s ↓ c ↓ | x g ) over512 coarse colors for every pixel, conditioned on a low resolution grayscale image in addition to thecolors from previously predicted pixels as per raster order (Eq. 9). p c ( x s ↓ c ↓ | x g ) = M (cid:89) i =1 N (cid:89) j =1 p c ( x s ↓ c ↓ ij | x g , x s ↓ c ↓

For every layer in the decoder, we apply six × convolutions to c toobtain three scale and shift vectors which we apply element-wise to q , k and v of the self-attentionoperation (Appendix 3.1), respectively. Conditional MLP.

A standard component of the transformer architecture is a two layer pointwisefeed-forward network after the self-attention layer. We scale and shift to the output of each MLPconditioned on c as for self-attention. Conditional Layer Norm.

Layer normalization (Ba et al., 2016) globally scales and shifts a givennormalized input using learnable vectors β , γ . Instead, we predict β c and γ c as a function of c . Weﬁrst aggregate c into a global 1-D representation c ∈ R L via a learnable, spatial pooling layer. Spatialpooling is initialized as a mean pooling layer. Similar to 1-D conditional normalization layers (Perezet al., 2017; De Vries et al., 2017; Dumoulin et al., 2016; Huang & Belongie, 2017), we then apply alinear projection on c to predict β c and γ c , respectively.A grayscale encoder consisting of multiple, alternating row and column self-attention layers encodesthe grayscale image into the initial conditioning context c g . It serves as both context for the conditionallayers and as additional input to the embeddings of the outer decoder. The sum of the outer decoder’soutput and c g condition the inner decoder. Figure 2 illustrates how conditioning is applied in theautoregressive core of the ColTran architecture.Conditioning every layer via multiple components allows stronger gradient signals through theencoder and as an effect the encoder can learn better contextual representations. We validate thisempirically by outperforming the native Axial Transformer that conditions context states by biasing(See Section 5.2 and Section 5.4). 5ublished as a conference paper at ICLR 20214.2 A UXILIARY PARALLEL MODEL

We additionally found it beneﬁcial to train an auxiliary parallel prediction model that models (cid:101) p c ( x s ↓ c ↓ ) directly on top of representations learned by the grayscale encoder which we found beneﬁcial forregularization (Eq. 10) (cid:101) p c ( x s ↓ c ↓ | x g ) = M (cid:89) i =1 N (cid:89) j =1 (cid:101) p c ( x s ↓ c ↓ ij | x g ) (10)Intuitively, this forces the model to compute richer representations and global color structure alreadyat the output of the encoder which can help conditioning and therefore has a beneﬁcial, regularizingeffect on learning. We apply a linear projection, U parallel ∈ R L × on top of c g (the output of thegrayscale encoder) into a per-pixel distribution over 512 coarse colors. It was crucial to tune therelative contribution of the autoregressive and parallel predictions to improve performance which westudy in Section 5.34.3 C OLOR & S

PATIAL U PSAMPLING

In order to produce high-ﬁdelity colorized images from low resolution, coarse color images and agiven high resolution grayscale image, we train color and spatial upsampling models. They sharethe same architecture while differing in their respective inputs and resolution at which they operate.Similar to the grayscale encoder, the upsamplers comprise of multiple alternating layers of row andcolumn self-attention. The output of the encoder is projected to compute the logits underlying the perpixel color probabilities of the respective upsampler. Figure 2 illustrates the architectures

Color Upsampler.

We convert the coarse image x s ↓ c ↓ ∈ R M × N × of 512 colors back into a 3 bitRGB image with 8 symbols per channel. The channels are embedded using separate embeddingmatrices to x s ↓ c ↓ k ∈ R M × N × D , where k ∈ { R, G, B } indicates the channel. We upsample eachchannel individually conditioning only on the respective channel’s embedding. The channel em-bedding is summed with the respective grayscale embedding for each pixel and serve as input tothe subsequent self-attention layers (encoder). The output of the encoder is further projected to perpixel-channel probability distributions (cid:101) p c ↑ ( x s ↓ k | x s ↓ c ↓ , x g ) ∈ R M × N × over 256 color intensitiesfor all k ∈ { R, G, B } (Eq. 11). (cid:101) p c ↑ ( x s ↓ | x g ) = M (cid:89) i =1 N (cid:89) j =1 (cid:101) p c ↑ ( x s ↓ ij | x g , x s ↓ c ↓ ) (11) Spatial Upsampler.

We ﬁrst naively upsample x s ↓ ∈ R M × N × into a blurry, high-resolution RGBimage using area interpolation. As above, we then embed each channel of the blurry RGB image andrun a per-channel encoder exactly the same way as with the color upsampler. The output of the encoderis ﬁnally projected to per pixel-channel probability distributions (cid:101) p s ↑ ( x k | x s ↓ , x g ) ∈ R H × W × over256 color intensities for all k ∈ { R, G, B } . (Eq. 12) (cid:101) p s ↑ ( x | x g ) = H (cid:89) i =1 W (cid:89) j =1 (cid:101) p s ↑ ( x ij | x g , x s ↓ ) (12)In our experiments, similar to (Guadarrama et al., 2017), we found parallel upsampling to be sufﬁcientfor high quality colorizations. Parallel upsampling has the huge advantage of fast generation whichwould be notoriously slow for full autoregressive models on high resolution. To avoid plausible minorcolor inconsistencies between pixels, instead of sampling each pixel from the predicted distributionin (Eq. 12 and Eq. 11), we just use the argmax. Even though this slightly limits the potential diversityof colorizations, in practice we observe that sampling only coarse colors via ColTran core is enoughto produce a great variety of colorizations. Objective.

We train our architecture to minimize the negative log-likelihood (Eq. 13) of the data. p c / (cid:101) p c , (cid:101) p s ↑ , (cid:101) p c ↑ are maximized independently and λ is a hyperparameter that controls the relativecontribution of p c and (cid:101) p c L = (1 − λ ) log p c + λ log (cid:101) p c + log (cid:101) p c ↑ + log (cid:101) p s ↑ (13)6ublished as a conference paper at ICLR 2021 Figure 3:

Per pixel log-likelihood of coarse colored × images over the validation set as a functionof training steps. We ablate the various components of the ColTran core in each plot. Left:

ColTran withConditional Transformer Layers vs a baseline Axial Transformer which conditions via addition (

ColTran-B ). ColTran-B 2x and

ColTran-B 4x refer to wider baselines with increased model capacity.

Center:

Removingeach conditional sub-component one at a time ( no cLN , no cMLP and no cAtt ). Right:

Conditional shifts only(

Shift ), Conditional scales only (

Scale ), removal of kq conditioning in cAtt ( cAtt, only v ) and ﬁxed mean poolingin cLN ( cLN, mean pool ). See Section 5.2 for more details.

XPERIMENTS

RAINING AND E VALUATION

We evaluate ColTran on colorizing × grayscale images from the ImageNet dataset (Rus-sakovsky et al., 2015). We train the ColTran core, color and spatial upsamplers independently on16 TPUv2 chips with a batch-size of , and for 450K, 300K and 150K steps respectively.We use axial attention blocks in each component of our architecture, with a hidden size of and heads. We use RMSprop (Tieleman & Hinton, 2012) with a ﬁxed learning rate of e − . Weset apart 10000 images from the training set as a holdout set to tune hyperparameters and performablations. To compute FID, we generate samples conditioned on the grayscale images from thisholdout set. We use the public validation set to display qualitative results and report ﬁnal numbers.5.2 A BLATIONS OF C OL T RAN C ORE

The autoregressive core of ColTran models downsampled, coarse-colored images of resolution × with coarse colots, conditioned on the respective grayscale image. In a series of experimentswe ablate the different components of the architecture (Figure 3). In the section below, we refer tothe conditional self-attention, conditional layer norm and conditional MLP subcomponents as cAtt,cLN and cMLP respectively. We report the per-pixel log-likelihood over coarse colors on thevalidation set as a function of training steps. Impact of conditional transformer layers.

The left side of Figure 3 illustrates the signiﬁcantimprovement in loss that ColTran core (with conditional transformer layers) achieves over the originalAxial Transformer (marked

ColTran-B ). This demonstrates the usefulness of our proposed conditionallayers. Because conditional layers introduce a higher number of parameters we additionally compareto and outperform the original Axial Transformer baselines with 2x and 4x wider MLP dimensions(labeled as

ColTran-B 2x and

ColTran-B 4x ). Both

ColTran-B 2x and

ColTran-B 4x have an increasedparameter count which makes for a fair comparison. Our results show that the increased performancecannot be explained solely by the fact that our model has more parameters.

Importance of each conditional component.

We perform a leave-one-out study to determine theimportance of each conditional component. We remove each conditional component one at a time andretrain the new ablated model. The curves no cLN , no cMLP and no cAtt in the middle of Figure 3quantiﬁes our results. While each conditional component improves ﬁnal performance, cAtt plays themost important role. Multiplicative vs Additive Interactions.

Conditional transformer layers employ both conditionalshifts and scales consisting of additive and multiplicative interactions, respectively. The curves

Scale and

Shift on the right hand side of Figure 3 demonstrate the impact of these interactions via ablatedarchitectures that use conditional shifts and conditional scales only. While both types of interactionsare important, multiplicative interactions have a much stronger impact.7ublished as a conference paper at ICLR 2021

Figure 4: Left:

FID of generated 64 ×

64 coarse samples as a function of training steps for λ = 0 . and λ = 0 . . Center:

Final FID scores as a function of λ . Right:

FID as a function of log-likelihood.

Context-aware dot product attention.

Self-attention computes the similarity between pixel repre-sentations using a dot product between q and k (See: Eq 15). cAtt applies conditional shifts andscales on q , k and allow modifying this similarity based on contextual information. The curve cAtt,only v on the right of Figure 3 shows that removing this property, by conditioning only on v leads toworse results. Fixed vs adaptive global representation: cLN aggregates global information with a ﬂexible learn-able spatial pooling layer. We experimented with a ﬁxed mean pooling layer forcing all the cLNlayers to use the same global representation with the same per-pixel weight. The curve cLN, meanpool on the right of Figure 3 shows that enforcing this constraint causes inferior performance ascompared to even having no cLN. This indicates that different aggregations of global representationsare important for different cLN layers.5.3 O

THER ABLATIONS

Auxiliary Parallel Model.

We study the effect of the hyperparameter λ , which controls the con-tribution of the auxiliary parallel prediction model described in Section 4.2. For a given λ , we nowoptimize ˆ p c ( λ ) = (1 − λ ) log p c ( . ) + λ log (cid:101) p c ( . ) instead of just log p c ( . ) . Note that (cid:101) p c ( . ) , modelseach pixel independently, which is more difﬁcult than modelling each pixel conditioned on previouspixels given by p c ( . ) . Hence, employing ˆ p c ( λ ) as a holdout metric, would just lead to a trivial soluionat λ = 0 . Instead, the FID of the generated coarse 64x64 samples provides a reliable way to ﬁndan optimal value of λ . In Figure 4, at λ = 0 . , our model converges to a better FID faster with amarginal but consistent ﬁnal improvement. At higher values the performance deteriorates quickly. Upsamplers.

Upsampling coarse colored, low-resolution images to a higher resolution is muchsimpler. Given ground truth × coarse images, the ColTran upsamplers map these to ﬁne grained × images without any visible artifacts and FID of 16.4. For comparison, the FID betweentwo random sets of 5000 samples from our holdout set is 15.5. It is further extremely importantto provide the grayscale image as input to each of the individual upsamplers, without which thegenerated images appear highly smoothed out and the FID drops to 27.0. We also trained a singleupsampler for both color and resolution. The FID in this case drops marginally to 16.6.5.4 F RECHET I NCEPTION D ISTANCE

We compute FID using colorizations of 5000 grayscale images of resolution 256 ×

256 from theImageNet validation set as done in (Ardizzone et al., 2019). To compute the FID, we ensure thatthere is no overlap between the grayscale images that condition ColTran and those in the ground-truthdistribution. In addition to ColTran, we report two additional results

ColTran-S and

ColTran-Baseline . ColTran-B refers to the baseline Axial Transformer that conditions via addition at the input. PixColorsamples smaller 28 ×

28 colored images autoregressively as compared to ColTran’s 64 ×

64. As acontrol experiment, we train an autoregressive model on resolution 28 ×

28 (

ColTran-S ) to disentanglearchitectural choices and the inherent stochasticity of modelling higher resolution images.

ColTran-S and

ColTran-B obtains FID scores of 21.9 and 21.6 that signiﬁcantly improve over the previous bestFID of 24.32. Finally, ColTran achieves the best FID score of 19.71. All results are presented inTable 2 left. 8ublished as a conference paper at ICLR 2021

Models FID

ColTran ± ColTran-B 21.6 ± ± ± ± ± ± ± ± Models AMT Fooling rate

ColTran (Oracle) 62.0 % ± ± ± ColTran (Seed 3) 41.7 % ± ± ± ± ± ± ± ± Table 2:

We outperform various state-of-the-art colorization models both on FID (left) and human evaluation(right). We obtain the FID scores from (Ardizzone et al., 2019) and the human evaluation results from(Guadarrama et al., 2017). ColTran-B is a baseline Axial Transformer that conditions via addition and ColTran-Sis a control experiment where we train ColTran core (See: 4.1) on smaller 28 ×

28 colored images.

Figure 5:

We display the per-pixel, maximum predicted probability over 512 colors as a proxy for uncertainty.

Correlation between FID and Log-likelihood.

For each architectural variant, Figure 4 rightillustrates the correlation between the log-likelihood and FID after 150K training steps. There is amoderately positive correlation of 0.57 between the log-likelihood and FID. Importantly, even anabsolute improvement on the order of 0.01 - 0.02 can improve FID signiﬁcantly. This suggests thatdesigning architectures that achieve better log-likelihood values is likely to lead to improved FIDscores and colorization ﬁdelity.5.5 Q

UALITATIVE E VALUATION

Human Evaluation.

For our qualitative assessment, we follow the protocol used in PixColor(Guadarrama et al., 2017). ColTran colorizes 500 grayscale images, with 3 different colorizationsper image, denoted as seeds. Human raters assess the quality of these colorizations with a twoalternative-forced choice (2AFC) test. We display both the ground-truth and recolorized imagesequentially for one second in random order. The raters are then asked to identify the image with fakecolors. For each seed, we report the mean fooling rate over 500 colorizations and 5 different raters.For the oracle methods, we use the human rating to pick the best-of-three colorizations. ColTran’sbest seed achieves a fooling rate of 42.3 % compared to the 35.4 % of PixColor’s best seed. ColTranOracle achieves a fooling rate of 62 %, indicating that human raters prefer ColTran’s best-of-threecolorizations over the ground truth image itself.

Visualizing uncertainty.

The autoregressive core model of ColTran should be highly uncertainat object boundaries when colors change. Figure 5 illustrates the per-pixel, maximum predictedprobability over 512 colors as a proxy for uncertainty. We observe that the model is indeed highlyuncertain at edges and within more complicated textures.

ONCLUSION

We presented the Colorization Transformer (ColTran), an architecture that entirely relies on self-attention for image colorization. We introduce conditional transformer layers, a novel building blockfor conditional, generative models based on self-attention. Our ablations show the superiority ofemploying this mechanism over a number of different baselines. Finally, we demonstrate that ColTrancan generate diverse, high-ﬁdelity colorizations on ImageNet, which are largely indistinguishablefrom the ground-truth even for human raters. 9ublished as a conference paper at ICLR 2021 R EFERENCES

Lynton Ardizzone, Carsten Lüth, Jakob Kruse, Carsten Rother, and Ullrich Köthe. Guided imagegeneration with conditional invertible neural networks. arXiv preprint arXiv:1907.02392 , 2019.Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprintarXiv:1607.06450 , 2016.Yun Cao, Zhiming Zhou, Weinan Zhang, and Yong Yu. Unsupervised diverse colorization viagenerative adversarial networks, 2017.Zezhou Cheng, Qingxiong Yang, and Bin Sheng. Deep colorization. In

Proceedings of the IEEEInternational Conference on Computer Vision , pp. 415–423, 2015.Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparsetransformers. arXiv preprint arXiv:1904.10509 , 2019.Yuanzheng Ci, Xinzhu Ma, Zhihui Wang, Haojie Li, and Zhongxuan Luo. User-guided deep animeline art colorization with conditional adversarial networks. In

Proceedings of the 26th ACMinternational conference on Multimedia , pp. 1536–1544, 2018.Ceyda Cinarel and Byoung-Tak Zhang. Into the colorful world of webtoons: Through the lensof neural networks. In , volume 3, pp. 35–40. IEEE, 2017.Ryan Dahl. Automatic colorization, 2016.Harm De Vries, Florian Strub, Jérémie Mary, Hugo Larochelle, Olivier Pietquin, and Aaron CCourville. Modulating early visual processing by language. In

Advances in Neural InformationProcessing Systems , pp. 6594–6604, 2017.Aditya Deshpande, Jason Rock, and David Forsyth. Learning large-scale automatic image colorization.In

Proceedings of the IEEE International Conference on Computer Vision , pp. 567–575, 2015.Aditya Deshpande, Jiajun Lu, Mao-Chuang Yeh, Min Jin Chong, and David Forsyth. Learningdiverse image colorization. In

Proceedings of the IEEE Conference on Computer Vision andPattern Recognition , pp. 6837–6845, 2017.Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent componentsestimation. arXiv preprint arXiv:1410.8516 , 2014.Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artisticstyle. arXiv preprint arXiv:1610.07629 , 2016.David M Geshwind. Method for colorizing black and white footage, August 19 1986. US Patent4,606,625.Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, SherjilOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. arXiv preprintarXiv:1406.2661 , 2014.Sergio Guadarrama, Ryan Dahl, David Bieber, Mohammad Norouzi, Jonathon Shlens, and KevinMurphy. Pixcolor: Pixel recursive colorization. arXiv preprint arXiv:1705.07208 , 2017.Raj Kumar Gupta, Alex Yong-Sang Chia, Deepu Rajan, Ee Sin Ng, and Huang Zhiyong. Imagecolorization using similar images. In

Proceedings of the 20th ACM international conference onMultimedia , pp. 369–378, 2012.Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. Flow++: Improving ﬂow-based generative models with variational dequantization and architecture design. arXiv preprintarXiv:1902.00275 , 2019a.Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim Salimans. Axial attention in multidi-mensional transformers. arXiv preprint arXiv:1912.12180 , 2019b.10ublished as a conference paper at ICLR 2021Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normal-ization. In

Proceedings of the IEEE International Conference on Computer Vision , pp. 1501–1510,2017.Yi-Chin Huang, Yi-Shin Tung, Jun-Cheng Chen, Sung-Wen Wang, and Ja-Ling Wu. An adaptiveedge detection based colorization algorithm and its applications. In

Proceedings of the 13th annualACM international conference on Multimedia , pp. 351–354, 2005.Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. Let there be color! joint end-to-end learningof global and local image priors for automatic image colorization with simultaneous classiﬁcation.

ACM Transactions on Graphics (ToG) , 35(4):1–11, 2016.Revital Ironi, Daniel Cohen-Or, and Dani Lischinski. Colorization by example. In

RenderingTechniques , pp. 201–210. Citeseer, 2005.Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation withconditional adversarial networks. In

Proceedings of the IEEE conference on computer vision andpattern recognition , pp. 1125–1134, 2017.Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa,Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre luc Cantin, CliffordChao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara VazirGhaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, DougHogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, AlexanderKaplan, Harshit Khaitan, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law,Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, AdrianaMaggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, KathyNix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, MattRoss, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter,Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle,Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. In-datacenterperformance analysis of a tensor processing unit, 2017.Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprintarXiv:1312.6114 , 2013.Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Learning representations for automaticcolorization. In

European conference on computer vision , pp. 577–593. Springer, 2016.Anat Levin, Dani Lischinski, and Yair Weiss. Colorization using optimization. In

ACM SIGGRAPH2004 Papers , pp. 689–694. 2004.Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In

Proceedings of the IEEE international conference on computer vision , pp. 3730–3738, 2015.Qing Luan, Fang Wen, Daniel Cohen-Or, Lin Liang, Ying-Qing Xu, and Heung-Yeung Shum. Naturalimage colorization. In

Proceedings of the 18th Eurographics conference on Rendering Techniques ,pp. 309–320, 2007.Jacob Menick and Nal Kalchbrenner. Generating high ﬁdelity images with subscale pixel networksand multidimensional upscaling. arXiv preprint arXiv:1812.01608 , 2018.Safa Messaoud, David Forsyth, and Alexander G. Schwing. Structural consistency and controllabilityfor diverse colorization. In

Proceedings of the European Conference on Computer Vision (ECCV) ,September 2018.Yuji Morimoto, Yuichi Taguchi, and Takeshi Naemura. Automatic colorization of grayscale imagesusing multiple images on the web. In

SIGGRAPH 2009: Talks , pp. 1–1. 2009.Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759 , 2016.Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Łukasz Kaiser, Noam Shazeer, Alexander Ku, andDustin Tran. Image transformer. arXiv preprint arXiv:1802.05751 , 2018.11ublished as a conference paper at ICLR 2021Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visualreasoning with a general conditioning layer. arXiv preprint arXiv:1709.07871 , 2017.François Pitié, Anil C Kokaram, and Rozenn Dahyot. Automated colour grading using colourdistribution transfer.

Computer Vision and Image Understanding , 107(1-2):123–137, 2007.Yingge Qu, Tien-Tsin Wong, and Pheng-Ann Heng. Manga colorization.

ACM Transactions onGraphics (TOG) , 25(3):1214–1220, 2006.Erik Reinhard, Michael Adhikhmin, Bruce Gooch, and Peter Shirley. Color transfer between images.

IEEE Computer graphics and applications , 21(5):34–41, 2001.Amelie Royer, Alexander Kolesnikov, and Christoph H Lampert. Probabilistic image colorization. arXiv preprint arXiv:1705.04258 , 2017.Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognitionchallenge.

International journal of computer vision , 115(3):211–252, 2015.Daniel S`ykora, Jan Buriánek, and Jiˇrí Žára. Unsupervised colorization of black-and-white cartoons.In

Proceedings of the 3rd international symposium on Non-photorealistic animation and rendering ,pp. 121–127, 2004.Yu-Wing Tai, Jiaya Jia, and Chi-Keung Tang. Local color transfer via probabilistic segmentation byexpectation-maximization. In , volume 1, pp. 747–754. IEEE, 2005.Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a runningaverage of its recent magnitude.

COURSERA: Neural networks for machine learning , 4(2):26–31,2012.Sotirios A Tsaftaris, Francesca Casadio, Jean-Louis Andral, and Aggelos K Katsaggelos. A novelvisualization tool for art history and conservation: Automated colorization of black and whitearchival photographs of works of art.

Studies in conservation , 59(3):125–135, 2014.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. Attention is all you need. In

Advances in neural informationprocessing systems , pp. 5998–6008, 2017.Carl Vondrick, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. Trackingemerges by colorizing videos. In

Proceedings of the European conference on computer vision(ECCV) , pp. 391–408, 2018.Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. arXiv preprint arXiv:2003.07853 ,2020.Dirk Weissenborn, Oscar Täckström, and Jakob Uszkoreit. Scaling autoregressive video models. arXiv preprint arXiv:1906.02634 , 2019.Tomihisa Welsh, Michael Ashikhmin, and Klaus Mueller. Transferring color to greyscale images. In

Proceedings of the 29th annual conference on Computer graphics and interactive techniques , pp.277–280, 2002.Chufeng Xiao, Chu Han, Zhuming Zhang, Jing Qin, Tien-Tsin Wong, Guoqiang Han, and ShengfengHe. Example-based colourization via dense encoding pyramids. In

Computer Graphics Forum ,volume 39, pp. 20–33. Wiley Online Library, 2020.Liron Yatziv and Guillermo Sapiro. Fast image and video colorization using chrominance blending.

IEEE transactions on image processing , 15(5):1120–1129, 2006.Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun:Construction of a large-scale image dataset using deep learning with humans in the loop. arXivpreprint arXiv:1506.03365 , 2015. 12ublished as a conference paper at ICLR 2021Bo Zhang, Mingming He, Jing Liao, Pedro V Sander, Lu Yuan, Amine Bermak, and Dong Chen.Deep exemplar-based video colorization. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pp. 8052–8061, 2019a.Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generativeadversarial networks. In

International Conference on Machine Learning , pp. 7354–7363. PMLR,2019b.Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In

Europeanconference on computer vision , pp. 649–666. Springer, 2016.Richard Zhang, Jun-Yan Zhu, Phillip Isola, Xinyang Geng, Angela S Lin, Tianhe Yu, and Alexei AEfros. Real-time user-guided image colorization with learned deep priors. arXiv preprintarXiv:1705.02999 , 2017. 13ublished as a conference paper at ICLR 2021

Figure 6: Left : FID vs training steps, with and without polyak averaging.

Right : The effect of K in top-Ksampling on FID. See Appendix B and E

ACKNOWLEDGEMENTS

We would like to thank Mohammad Norouzi, Rianne van den Berg, Mostafa Dehghani or their usefulcomments on the draft and Avital Oliver for assistance in the Mechanical Turk setup.

A C

ODE , CHECKPOINTS AND TENSORBOARD FILES

Our implementation is open-sourced in the google-research framework at https://github.com/google-research/google-research/tree/master/coltran. Our full set of hyperparameters are available here.We provide pre-trained checkpoints of the colorizer and upsamplers on ImageNet athttps://console.cloud.google.com/storage/browser/gresearch/coltran. Our colorizer and spatial up-sampler for thesr checkpoints were trained longer for 600K and 300K steps which gave us a slightlyimproved FID score of ≈ . Finally, reference tensorboard ﬁles for our training runs are available at colorizer tensorboard, colorupsampler tensorboard and spatial upsampler tensorboard.

B E

XPONENTIAL M OVING A VERAGE

We found using an exponential moving average (EMA) of our checkpoints, extremely crucial togenerate high quality samples. In Figure 6, we display the FID as a function of training steps, withand without EMA. On applying EMA, our FID score improves steadily over time.

C N

UMBER OF PARAMETERS AND INFERENCE SPEED

Inference speed.

ColTran core can sample a batch of 20 64x64 grayscale images in around 3.5 -5minutes on a P100 GPU vs PixColor that takes 10 minutes to colorize 28x28 grayscale images on aK40 GPU. Sampling 28x28 colorizations takes around 30 seconds. The upsampler networks take inthe order of milliseconds.Further, in our naive implementation, we recompute the activations, c U zs , c U zb , c U fs , c U fb in Table1 to generate every pixel in the inner decoder. Instead, we can compute these activations onceper-grayscale image in the encoder and once per-row in the outer decoder and reuse them. This islikely to speed up sampling even more and we leave this engineering optimization for future work. Number of parameters.

ColTran has a total of ColTran core (46M) + Color Upsampler (14M) +Spatial Upsampler (14M) = 74M parameters. In comparison, PixColor has Conditioning network(44M) + Colorizer network (11M) + Reﬁnement Network (28M) = 83M parameters.

D L

OWER COMPUTE REGIME

We retrained the autoregressive colorizer and color upsampler on 4 TPUv2 chips (the lowest con-ﬁguration) with a reduced-batch size of 56 and 192 each. For the spatial upsampler, we found that14ublished as a conference paper at ICLR 2021

Figure 7:

Ablated models.

Gated : Gated conditioning layers as done in (Oord et al., 2016) and cAtt + cMLP,global : Global conditioning instead of pointwise conditioning in cAtt and cLN. a batch-size of 8 was sub-optimal and lead to a large deterioration in loss. We thus used a smallerspatial upsampler with 2 axial attention blocks with a batch-size of 16 and trained it also on 4 TPUv2chips. The FID drops from 19.71 to 20.9 which is still signiﬁcantly better than the other models in 2.We note that in this experiment, we use only 12 TPUv2 chips in total while PixColor (Guadarramaet al., 2017) uses a total of 16 GPUs.

E I

MPROVED

FID

WITH T OP -K SAMPLING

We can improve colorization ﬁdelity and remove artifacts due to unnatural colors via Top-K samplingat the cost of reduced colorization diversity. In this setting, for a given pixel ColTran generates acolor from the top-K colors (instead of 512 colors) as determined by the predicted probabilities. Ourresults in Figure 6 K = 4 and K = 8 demonstrate a performance improvement over the baselineColTran model with K = 512 F A

DDITIONAL ABLATIONS : Additional ablations of our conditional transformer layers are in Figure 7 which did not help.• Conditional transformer layers based on Gated layers (Oord et al., 2016) (

Gated )• A global conditioning layer instead of pointwise conditioning in cAtt and cLN. cAtt + cMLP,global . G A

UTOREGRESSIVE MODELS

Autoregressive models are a family of probabilistic methods that model joint distribution of data P ( x ) or a sequence of symbols ( x , x , . . . x n ) as a product of conditionals (cid:81) Ni =1 P ( x i | x
In the following we describe row self-attention, that is, we omit the height dimension as all operationsare performed in parallel for each column. Given the representation of a single row within of an15ublished as a conference paper at ICLR 2021

Figure 8:

We train our colorization model on ImageNet and display high resolution colorizations from LSUN image x i, · ∈ R W × D , row-wise self-attention block is applied as follows: [ q , k , v ] = LN( x i, · ) U qkv U qkv ∈ R D × D h (14) A = softmax (cid:16) qk (cid:62) / (cid:112) D h (cid:17) A ∈ R W × W (15) SA( x i, · ) = A v (16) MSA( x i, · ) = [SA ( x i, · ) , SA ( x i, · ) , · · · , SA k ( x i, · )] U out U out ∈ R k · D h × D (17) LN refers to the application of layer normalization (Ba et al., 2016). Finally, we apply residualconnections and a feed-forward neural network with a single hidden layer and ReLU activation( MLP ) after each self-attention block as it is common practice in transformers. ˆ x i, · = MLP(LN( x (cid:48) i, · )) + x (cid:48) i, · x (cid:48) i, · = MSA( x i, · ) + x i, · (18)Column-wise self-attention over x · ,j ∈ R H × D works analogously. I O

UT OF DOMAIN COLORIZATIONS

We use our trained colorization model on ImageNet to colorize high-resolution grayscale imagesfrom LSUN × (Yu et al., 2015) and low-resolution grayscale images from Celeb-A (Liuet al., 2015) × . Note that these models were trained only on ImageNet and not ﬁnetuned onCeleb-A or LSUN. 16ublished as a conference paper at ICLR 2021 Figure 9:

We train our colorization model on ImageNet and display low resolution colorizations from Celeb-A

Figure 10: Top : Colorizations

Bottom : Ground truth. From left to right, our colorizations have a progressivelyhigher fooling rate.

J N

UMBER OF AXIAL ATTENTION BLOCKS

We did a very small hyperparameter sweep using the baseline axial transformer (no conditional layers)with the following conﬁgurations:• hidden size = 512, number of blocks = 4• hidden size = 1024, number of blocks = 2• hidden size = 512, number of blocks = 2Once we found the optimal conﬁguration, we ﬁxed this for all future architecture design.

K A

NALYSIS OF MT URK RATINGS

Figure 11:

In each column, we display the ground truth followed by 3 samples.

Left:

Diverse and real.

Center:

Realism improves from left to right.

Right:

Failure cases

Figure 12:

We display the per-pixel, maximum predicted probability over 512 colors as a proxy for uncertainty.

We analyzed our samples on the basis of the MTurk ratings in Figure 11. To the left, we show images,where all the samples have a fool rate > 60 %. Our model is able to show diversity in color for bothhigh-level structure and low-level details. In the center, we display samples that have a high variancein MTurk ratings, with a difference of 80 % between the best and the worst sample. All of these arecomplex objects, that our model is able to colorize reasonably well given multiple attempts. To theright of Figure 11, we show failure cases where all samples have a fool rate of 0 %, For these cases,our model is unable to colorize highly complex structure, that would arguably be difﬁcult even for ahuman. 18ublished as a conference paper at ICLR 2021

L M

ORE PROBABILITY MAPS

We display additional probability maps to visualize uncertainty as done in 5.5.

M M

Related Researches

Identifying the Origin of Finger Vein Samples Using Texture Descriptors

by Babak Maser

Solid Texture Synthesis using Generative Adversarial Networks

by Xin Zhao

Towards Accurate RGB-D Saliency Detection with Complementary Attention and Adaptive Integration

by Hong-Bo Bi

Plotting time: On the usage of CNNs for time series classification

by Nuno M. Rodrigues

Subjective and Objective Visual Quality Assessment of Textured 3D Meshes

by Jinjiang Guo

Improving filling level classification with adversarial training

by Apostolos Modas

Multi-level Distance Regularization for Deep Metric Learning

by Yonghyun Kim

One-shot Face Reenactment Using Appearance Adaptive Normalization

by Guangming Yao

TransReID: Transformer-based Object Re-Identification

by Shuting He

APS: A Large-Scale Multi-Modal Indoor Camera Positioning System

by Ali Ghofrani

An Efficient Framework for Zero-Shot Sketch-Based Image Retrieval

by Osman Tursun

Semantic Segmentation with Labeling Uncertainty and Class Imbalance

by Patrik Olã Bressan

Partial Is Better Than All: Revisiting Fine-tuning Strategy for Few-shot Learning

by Zhiqiang Shen

Analysis of Latent-Space Motion for Collaborative Intelligence

by Mateen Ulhaq

Overhead MNIST: A Benchmark Satellite Dataset

by David Noever

End-to-end Generative Zero-shot Learning via Few-shot Learning

by Georgios Chochlakis

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

by Jieneng Chen

The Multi-Temporal Urban Development SpaceNet Dataset

by Adam Van Etten

In-game Residential Home Planning via Visual Context-aware Global Relation Learning

by Lijuan Liu

Improving memory banks for unsupervised learning with large mini-batch, consistency and hard negative mining

by Adrian Bulat

(AF)2-S3Net: Attentive Feature Fusion with Adaptive Feature Selection for Sparse Semantic Segmentation Network

by Ran Cheng

Points2Vec: Unsupervised Object-level Feature Learning from Point Clouds

by Joël Bachmann

Soccer Event Detection Using Deep Learning

by Ali Karimi

Point-set Distances for Learning Representations of 3D Point Clouds

by Trung Nguyen

A Hybrid Bandit Model with Visual Priors for Creative Ranking in Display Advertising

by Shiyao Wang

«
1

2

3

4

»

Submitted on 8 Feb 2021 (v1), last revised 7 Mar 2021 (this version, v2) Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar