Semantic-driven Colorization
SS EMANTIC - DRIVEN C OLORIZATION
A P
REPRINT
Man M. Ho
Hosei University, Japan [email protected]
Lu Zhang
Univ Rennes, INSA RennesCNRS, IETR - UMR 6164, France [email protected]
Alexander Raake
Audiovisual Technology GroupTU Ilmenau, Germany [email protected]
Jinjia Zhou
Hosei University, Japan [email protected]
Our Detected Semantic Map OursZhang'17Iizuka'16Larsson'16Zhang'16
Figure 1: In automatic colorization, the previous methods Zhang’16 [1], Larsson’16 [2], Iizuka’16 [3], and Zhang’17[4] synthesize the same color tone between the bird and background. Meanwhile, our result gives a harmony anddiscriminative colors. A BSTRACT
Recent deep colorization works predict the semantic information implicitly while learning to col-orize black-and-white photographic images. As a consequence, the generated color is easier to beoverflowed, and the semantic faults are invisible. As human experience in coloring, the human firstrecognize which objects and their location in the photo, imagine which color is plausible for theobjects as in real life, then colorize it. In this study, we simulate that human-like action to firstlylet our network learn to segment what is in the photo, then colorize it. Therefore, our network canchoose a plausible color under semantic constraint for specific objects, and give discriminative colorsbetween them. Moreover, the segmentation map becomes understandable and interactable for theuser. Our models are trained on PASCAL-Context and evaluated on selected images from the publicdomain and COCO-Stuff, which has several unseen categories compared to training data. As seenfrom the experimental results, our colorization system can provide plausible colors for specific objectsand generate harmonious colors competitive with state-of-the-art methods.
Keywords
Colorization · Deep Learning
Colorization generates colors for the old black-and-white photos. Thanks to machine learning development, computerscan handle the mentioned work surprisingly well. The common approaches are data-driven automatic colorization anduser-guided colorization.
Data-driven automatic techniques automatically generate plausible colors by learning a color mapping from thetraining data. Generally, an end-to-end deep neural network is employed to manipulate colors directly. Larsson et al. a r X i v : . [ c s . C V ] J un emantic-driven Colorization A P
REPRINT [2] use the image classification pre-trained model of VGG16 [5] to extract contextual information as hypercolumn.Iizuka et al. [3] divide their network into two streams: colorization and object recognition. They scale the input imageand feed it into a recognition stream, then fuse the recognized features into the middle of the colorization stream.Meanwhile, Victoria et al. [6] work let the network learn object recognition internally. Instead of semantic recognition,Yoo et al. [7] present a memory network learning color feature, then modify the features in the colorization network.Zhang et al. [1] solve the multimodal nature of the colorization problem by learning the color probabilities for theirpossible colors. Zhao et al. [8] improve the work of Zhang et al. [1] to let the network learn the meaning of pixels bydetecting the semantic segmentation. Afterward, they improve their colorization to learn the semantic segmentation inthe internal network, as described in [9]. However, semantic detection has a limitation on gray-scale images leadingto semantic-related problems, such as color bleeding and color inconsistency. Moreover, semantic faults also bringmonochrome-like problems. Figure 1 shows the problem in existing works[3, 2, 1]; the bird have the same color tonewith the background . Its mechanism can be reproduced by the combination of reducing saturation and overlaying acolor filter on the original image. To solve this problem, we provide the scheme to learn the semantic map associatedwith the gray context at low-level features. The advantage of our approach is that the gray features are effectivelycombined with the semantic information in training, which helps to define the features of object segmentation and itsgray-scale region. The combined features, having rich semantic information, help to synthesize discriminative colors.Consequently, this work can provide colors more discriminative, especially between the bird and background . User-guided edit propagation techniques rely on the user’s suggestions to colorize the image. The user-guided inputcan be scribbles, dots, image contents, etc., to control the color. Levin et al. [10] use the optimized method to leveragethe user’s strokes to match it with the gray image. Bahng et al. [11] apply the scheme to extract palette to guide theirnetwork based on user-provided words. Besides, the methods [12, 13, 14, 15, 16, 17] use a reference to transfer thehistograms. Zhang et al. [4] introduce their network that uses several user-guided colorful dots within the suggestedcolors, associated with a reference. Yi Xiao et al. [18] adopt the work of Zhang et al. [4] to build a network that cansupport both the guided reference and user-guided color dots simultaneously. They tend to extract the palette usingnot only their recommendation system but also the user’s input. Recently, the methods leveraging user-guided colordots achieve state-of-the-art performance. However, the given gray pixel has various plausible colors, especially thecolors with the same intensity. It is challenging for the user to avoid the incompatible colors between the objects. Infact, the harmony of a colorful image depends on the color of specific objects and the color between them. Most of theuser-guided methods that use colorful dots try to answer the question: "T here are guided color dots, what should thecolor of this picture be? " Meanwhile, our novelty aims to answer the question: "
There is a dog on the grass under thesky, which color should this picture be? " Thus, our method tends to synthesize the color based on not only gray-scalevalues but also the semantic information.In this study, we modify the U-Net [19] combined with GridNet [20] and present a colorization framework simulatinghuman-like action, which is firstly recognition then colorization, to achieve better colorization results. Our insightmotivation is from: (a) the generated colors can be more natural if we constraint gray-scale values on its semanticinformation (e.g. which and how "red" color is plausible for the gray-scale values of "fire truck" ). While solving (a),we consider how to exploit the semantic information effectively as follows: (b) object recognition task (e.g., imageclassification, semantic segmentation) is usually limited due to lack of color information, and semantic faults is unable tobe adjusted, (c) instance-specific contrast information of the features of gray-scale images are complicated to be trained,especially to learn the natural color at a semantic level. In order to solve (a) and (b), we isolate semantic segmentationand colorization; consequently, semantic faults are easily adjusted. However, concatenated input is fed to the typicalnetwork (e.g., U-Net), which cannot solve the problem (c) to learn from normalized gray-scale features together withvanilla semantic features. We thus modify U-Net to have two streams of data: gray-scale image and segmentation map.In inference, only gray-scale features are Instance Normalized (IN), and the modified U-Net observes gray-scale valuesand semantic information using the weights. Our contributions are as follows: • As human-like action in colorizing a photo, we present a colorization framework localizing objects, generalizingthe color based on the prior semantic information, and synthesizing the natural color competitive to the previousworks. Furthermore, the semantic map is intervenable. • In our colorization network, we modify the U-Net to simplify gray-scale features. • We build an application for semantic map modification to follow through on how semantic informationinfluences generated color . 2emantic-driven Colorization A P
REPRINT
SemanticSegmentation Colorizationcow grass
Figure 2: The overview of the proposed system.
As shown in Figure 2, the proposed colorization framework is composed of two components: Semantic Segmentationusing GridNet [20] and colorization using the modified U-Net. The gray image is firstly processed to have a semanticmap. Secondly, our colorization network will leverage the semantic map combined with the gray values to generatethe initial color. The plausible color will be synthesized based on the predicted semantic information. Furthermore,because the map is visible to humans, the user can adjust the interactive semantic map from coarse to refinement usingour application.In our study, we use CIE
Lab color space as [3, 2, 1, 4], which includes the luma and chroma representing gray-scaleand color components, respectively. Given the luma X ∈ R H × W × and filled semantic map S ∈ R H × W × , our targetis generate chroma as the plausible color ˆ Y ∈ R H × W × , which can be concatenated with X to produce a satisfactorycolorful image. The accurate semantic maps S will guide the colorization network to learn the plausible color for aspecific semantic segmentation. Because filling the map S is inconvenient for the user, we thus firstly use the modifiedGridNet [20] trained on gray-scale images to detect the coarse semantic map ˆ S ∈ R H × W × . According to the predicted ˆ S , the user can constantly adjust the map from coarse to fine to produce a satisfactory output.The advantages of our method are as follows: 1) The Semantic Segmentation can be independently improved byleveraging previous research related to the semantic segmentation. 2) Our colorization network can learn the perfectfeatures combining the semantic map and gray values. Unlike traditional methods, which use concatenation for multipleinputs, the network leverages the originally extracted features of the semantic map, and instance normalized features ofgray-scale content. 3) The user can constantly adjust the coarse semantic map from Semantic Segmentation. Semantic information plays an important role in colorization. It can be combined with gray-scale values to provideplausible colors harmoniously. Therefore, an accurate semantic map is crucial for our colorization network. Since it isinconvenient for the user to fulfill the map, we use Semantic Segmentation in our system to support the user and makecolorization automatic. In this study, we choose the GridNet [20] for our Semantic Segmentation. The network has fiverows and six columns. Depth dimensions of each row are as follows: [16 , , , , . All convolutions use kernelsize × , padding of , a stride of , except for sampling convolution. The down-sampling layer uses a convolutionstride of to reduce spatial size while up-sampling uses transposed convolution with a stride of . Cross-entropy isused to observe segmentation loss L seg . Our colorization network synthesizes color by leveraging the semantic map associated with the gray-scale image, likein the image-to-image translation task. We thus modify the well-known U-Net [19], which achieves the outstandingperformance in transforming image to image [21]. However, U-Net is limited to handling each type of feature. For themore proficiency of our colorization, we modify the U-Net to have two streams, as shown in Figure 3. Our modifiedU-Net contains two parts: encoder and decoder. The components in the encoder part share their weights for two streams https://minhmanho.github.io/semantic-driven_colorization/ A P
REPRINT
L+Shared Weights Convolution LayerDeconvolution Layer
CCCCC C
Instance NormalizationConcatenation C Convolution + Leaky ReLU
Figure 3: The modified U-Net for our colorization.
Conv2D stride 2 Conv2D LeakyReLU LeakyReLUDeconv2D stride 2 Conv2D LeakyReLU LeakyReLU
Convolution LayerDeconvolution Layer + IN Figure 4: Convolution Layer (CL) and Deconvolution Layer (DL) in Figure 3.of data, one for gray-scale image and another one for the semantic map. The encoder with shared weights allows thenetwork to establish the constraints between the semantic information and gray context. Furthermore, we can handle itindividually for each stream.Each part of the network has five layers consisting of two types: convolution layer (CL) and deconvolution layer(DL). CL is used in the encoder to reduce the spatial dimension by convolution stride . In the decoder side, asimage-to-image transformation, DL is used to expand the features back to the original size by transposed convolutionwith stride . The semantic information and gray context will be recovered from encoder to decoder by skip connection.As shown in Figure 4, a layer has two convolutions, and convolution is followed by a Leaky Rectified Linear Unit(Leaky ReLU) with a negative slope = 0 . . The first convolution is for adjusting the spatial dimension as described,except for the first and final layers. The kernel size is set to × for all convolutions. The depth dimension flow of thenetwork is [32 , , , , in the encoder part and will be doubled up to [1024 , , , , because of theconcatenation of the semantic information and gray context. Instance Normalization.
Our network can provide plausible colors based on gray values and a semantic map. However,the deep neural network can make mistakes because of complicated gray values. There still exists color bleeding, theimproper color representing context confusion, even when we can obtain accurate semantic information. To address4emantic-driven Colorization
A P
REPRINT
Figure 5: Instance normalization (IN) effectiveness for our scheme. Artifacts are removed to provide harmony in color.
Top-bottom : our network trained without IN, trained with IN.these problems, we utilize Instance Normalization (IN) to remove the instance-specific information and simplify thegray features right before the concatenation. It is also proved the excellent performance in generation tasks [22].
Loss function.
In colorization, we have the multimodal problem that becomes complicated because of various plausiblecolors ˆ Y for a gray-scale value. We thus choose Huber loss as [4] with δ = 1 to learn the true colors from the groundtruth Y : L color ( Y, ˆ Y ) = (cid:40) ( Y − ˆ Y ) for | Y − ˆ Y | ≤ δδ | Y − ˆ Y | − δ otherwise (1) Training dataset . Our models are trained on PASCAL-Context dataset [23], which supplies enough expectations forcolorization, such as a filled semantic map, various pixel-wise labels, and categories. The dataset contains , images in trainval set with classes of the most frequent appearances as [23] considered. Preprocessing data . The utilized colorspace is CIE
Lab , which allows us to separate gray-scale and color componentsfrom a typical colorful image. CIE
Lab consists of L for the lightness and a , b for the color components. The values of L, a, b are scaled and normalized into the range [ − , . Data augmentation.
From various sizes of training images, we scale the data to × and then randomly cropthem in the training process as × . The random flip is also utilized to make the diversity of our training data. Training details.
We train our models with Adam Optimizer [24] with β = 0 . , β = 0 . , initial learning rate of . , batch size of . Most of the models are trained for epochs on Tesla V100 in 1 week each. Compared images.
Most of the compared images from the validation set of COCO-Stuff [25] and public domainUnsplash.
We experiment to visualize the performance of IN by training two models with/without IN inthe same condition. As shown in Figure 5, the model with IN gives more harmonious and plausible colors, and caneffectively resolve the color bleeding problem.
Color/semantic correction.
The coarse semantic map from Semantic Segmentation usually causes the color inconsis-tency in an object. Our colorization system allows the user to correct the semantic faults by using strokes. Figure 65emantic-driven Colorization
A P
REPRINT grassroadskygrassroadsky water buildingtree unlabel water buildingtree dog cat grassrockgrounddog rock
Figure 6: Color correction.
From left to right , images show three types of problems: (a) patching semantic mistake onthe grass region, (b) inharmonious color and green bleeding on the bridge, (c) green bleeding on the bear and rock.
Toprow shows the results before the adjustment, bottom row shows the results after the adjustment. We highlight severalparts for a quick comparison.shows our model performance in color correction. Moreover, the color pixels around the fixed semantic region can beadapted when the user manipulates the semantic map.
Demonstration on semantic adjustment.
Generated color becomes implausible in several cases; even the semanticprediction is correct. Therefore, besides semantic correction, we remove/change the semantic values of pixels having animplausible color. For example, as shown in our supplemental video, the girl has a little bit pink on her teeth , though thesemantic values are correct. It may be because our training data contains mostly humans/girls with a closed mouth withpinky lips. As a consequence, that little pink represents her lips while it should be white for her teeth. Therefore, webuild an application and show that removing/changing semantic information of the region having implausible color canpotentially solve the mentioned problem. Regarding the user interface, our application includes two widgets for userintervention and showing the results. The tools include the semantic brush status , combo box to select, and semanticpicker . The computation time of our colorization is millisecond per × image on VGA NVIDIA GeForce GT730. We compare our research with the current state-of-the-art methods, namely the works of Zhang et al. [1] as
Zhang’16 ,Larsson et al. [2] as
Larsson’16 , Iizuka et al. [3] as
Iizuka’16 , and the interactive colorization of Zhang et al. [4] as
Zhang’17 (without guided colors).
Qualitative comparison.
Besides
Zhang’16 , Larsson’16 , Iizuka’16 as pure automatic methods,
Zhang’17 uses theblack image and zero mask as their suggestion. Meanwhile, our method uses the coarse semantic map detected by theSemantic Segmentation. The results described in Figure 7 reveal that our method can provide more plausible colorsfor objects in harmonious way. In detail, the man’s face of our result looks more yellow and has a discriminative tonewith his coat in row 1 ; meanwhile, other works give the same tone. The the girl skin color of our work and Zhang’17[4] in row 2 is discriminative; furthermore, it is brighter and more colorful in ours . Regarding context confusion, thesmall details colorized by other works are usually infected by the background color or nearby objects, especially in thehighlighted rectangles in row 3,4,5 . In contrast, we provide discriminative colors for specific objects with naturalness.Additionally, our results show more varied colors. In summary, our method can fix semantic-related faults and provideplausible colors competitive with others, though there are some artifacts because of the coarse semantic map. User Study.
Our target aims to generate the naturalness of color as a generative task. Therefore, we conduct a userstudy on ten people under three criterions Semantic Correctness, Saturability, Edges Keeping as Zhao et al. [8]’smethod. The naturalness is the average score of those three criterions. Additionally, we also conduct a comparison onthe naturalness (*) by showing the results with the same context of different methods (positions are shuffled) and letuser score them. All measurements are under a scale of . As shown in Table 1, ours outperforms others.6emantic-driven Colorization
A P
REPRINT
Figure 7: Comparison in automatic colorization.
Top part , Left to right : Zhang’16 [1], Larsson’16 [2], Iizuka’16 [3],Zhang’17 [4] and our method. Furthermore, we highlight the context confusion places as red rectangles. Six comparedimages are shown in each row . The final row shows the coarse semantic maps we use for our colorization.7emantic-driven Colorization
A P
REPRINT
Semantic Correctness Saturability Edges Keeping Naturalness Naturalness (*)Original 85.76 87.88 89.82 87.8 66.62Iizuka’16 [3] 67.16 65.5 70.52 67.7 43.5Zhang’16 [1] 61.02 63.88 62.96 62.6 42.0Larsson’16 [2] 67.3 62.62 70.26 66.7 45.13Zhang’17 [4] 64.8 60 70.98 65.3 43.0Ours
Table 1: User study on 3 criterions (Semantic Correctness, Saturability, Edges Keeping) as described in [8]’s work. Thescale of 100. Higher score means better.
As human experience in photo coloring, humans step-by-step recognize the content, imagine how the color of contentin real life (semantic-driven), and finally colorize it. We present a colorization framework that behaves as human-likeaction in coloring a photo. As a result, our method can give discriminative colors with more naturalness, comparable tothe previous works. Furthermore, semantic information can be adjusted using our interactive application.
Multi-color Suggestion and User-guided Colorization.
Currently, the system directly uses the recommended naturalcolor for specific objects and harmony between objects. It can be extended to recommend more than one color asuser-guided colorization, such as [4]. The suggested color should be more natural, and the semantic boundaries arealso improved. Also, the coarse semantic map is able to be improved adaptively via user intervention suggesting directcolors.
Coarse Semantic Map in Automatic Colorization.
In fact, the deep neural network can make mistakes as Zhanget al. [4] mentioned about the color, Larsson et al. [2] mentioned about the object not recognized, Xiao et al. [18]mentioned about the incorrect semantic segmentation detection. That causes many problems in colorization, suchas color inconsistency, color bleeding, etc. Our target is to solve the problems using suitable semantic information.However, our method also makes mistakes. Consequently, our results have some symptoms of the mentioned problems.Thanks to IN efficiency (that we present in Section 4.1), those problems can be handled well but not totally. Therefore,we build an application that can adjust the visible semantic map in the middle of our colorization framework.
References [1] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. pages 649–666, 2016.[2] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Learning representations for automatic colorization.In
European Conference on Computer Vision , pages 577–593. Springer, 2016.[3] Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. Let there be color!: joint end-to-end learning of globaland local image priors for automatic image colorization with simultaneous classification.
ACM Transactions onGraphics (TOG) , 35(4):110, 2016.[4] Richard Zhang, Jun-Yan Zhu, Phillip Isola, Xinyang Geng, Angela S Lin, Tianhe Yu, and Alexei A Efros.Real-time user-guided image colorization with learned deep priors. arXiv preprint arXiv:1705.02999 , 2017.[5] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 , 2014.[6] Patricia Vitoria, Lara Raad, and Coloma Ballester. Chromagan: Adversarial picture colorization with semanticclass distribution. In
The IEEE Winter Conference on Applications of Computer Vision (WACV) , March 2020.[7] Seungjoo Yoo, Hyojin Bahng, Sunghyo Chung, Junsoo Lee, Jaehyuk Chang, and Jaegul Choo. Coloring withlimited data: Few-shot colorization via memory augmented networks. In
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 11283–11292, 2019.[8] Jiaojiao Zhao, Li Liu, Cees GM Snoek, Jungong Han, and Ling Shao. Pixel-level semantics guided imagecolorization. arXiv preprint arXiv:1808.01597 , 2018.[9] Jiaojiao Zhao, Jungong Han, Ling Shao, and Cees GM Snoek. Pixelated semantic colorization.
InternationalJournal of Computer Vision , pages 1–17, 2019. 8emantic-driven Colorization
A P
REPRINT [10] Anat Levin, Dani Lischinski, and Yair Weiss. Colorization using optimization. In
ACM transactions on graphics(tog) , volume 23, pages 689–694. ACM, 2004.[11] Hyojin Bahng, Seungjoo Yoo, Wonwoong Cho, David Keetae Park, Ziming Wu, Xiaojuan Ma, and Jaegul Choo.Coloring with words: Guiding image colorization through text-based palette generation. In
Proceedings of theEuropean Conference on Computer Vision (ECCV) , pages 431–447, 2018.[12] Alex Yong-Sang Chia, Shaojie Zhuo, Raj Kumar Gupta, Yu-Wing Tai, Siu-Yeung Cho, Ping Tan, and StephenLin. Semantic colorization with internet images. In
ACM Transactions on Graphics (TOG) , volume 30, page 156.ACM, 2011.[13] Aditya Deshpande, Jiajun Lu, Mao-Chuang Yeh, Min Jin Chong, and David Forsyth. Learning diverse imagecolorization. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages6837–6845, 2017.[14] Raj Kumar Gupta, Alex Yong-Sang Chia, Deepu Rajan, Ee Sin Ng, and Huang Zhiyong. Image colorization usingsimilar images. In
Proceedings of the 20th ACM international conference on Multimedia , pages 369–378. ACM,2012.[15] Mingming He, Dongdong Chen, Jing Liao, Pedro V Sander, and Lu Yuan. Deep exemplar-based colorization.
ACM Transactions on Graphics (TOG) , 37(4):47, 2018.[16] Safa Messaoud, David Forsyth, and Alexander G Schwing. Structural consistency and controllability for diversecolorization. In
Proceedings of the European Conference on Computer Vision (ECCV) , pages 596–612, 2018.[17] Tomihisa Welsh, Michael Ashikhmin, and Klaus Mueller. Transferring color to greyscale images. In
ACMTransactions on Graphics (TOG) , volume 21, pages 277–280. ACM, 2002.[18] Yi Xiao, Peiyao Zhou, and Yan Zheng. Interactive deep colorization with simultaneous global and local inputs. arXiv preprint arXiv:1801.09083 , 2018.[19] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical imagesegmentation. In
International Conference on Medical image computing and computer-assisted intervention ,pages 234–241. Springer, 2015.[20] Damien Fourure, Rémi Emonet, Elisa Fromont, Damien Muselet, Alain Tremeau, and Christian Wolf. Residualconv-deconv grid network for semantic segmentation. arXiv preprint arXiv:1707.07958 , 2017.[21] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditionaladversarial networks. In
Proceedings of the IEEE conference on computer vision and pattern recognition , pages1125–1134, 2017.[22] Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Instance normalization: The missing ingredient forfast stylization.
CoRR , abs/1607.08022, 2016.[23] Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun,and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. In
IEEEConference on Computer Vision and Pattern Recognition (CVPR) , 2014.[24] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 ,2014.[25] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. In