Robust Spatial-spread Deep Neural Image Watermarking
RR OBUST S PATIAL - SPREAD D EEP N EURAL I MAGE W ATERMARKING
A P
REPRINT
Marcin Plata , [email protected] Department of Fundamentals of Computer ScienceWroclaw University of Science and TechnologyWroclaw, Pland
Piotr Syga , [email protected] VestigitWroclaw, Poland A BSTRACT
Watermarking is an operation of embedding an information into an image in a way that allows toidentify ownership of the image despite applying some distortions on it. In this paper, we presented anovel end-to-end solution for embedding and recovering the watermark in the digital image usingconvolutional neural networks. The method is based on spreading the message over the spatial domainof the image, hence reducing the local bits per pixel capacity. To obtain the model we used adversarialtraining and applied noiser layers between the encoder and the decoder. Moreover, we broadened thespectrum of typically considered attacks on the watermark and by grouping the attacks according totheir scope, we achieved high general robustness, most notably against JPEG compression, Gaussianblurring, subsampling or resizing. To help us in the models training we also proposed a precisedifferentiable approximation of JPEG. K eywords Blind watermarking, JPEG, Attacks, Adversarial training, Neural networks
In the recent years the multimedia market has been steadily growing. The services provide access to vast range ofdesired multimedia and are getting more convenient to use, e.g. Netflix offers offline access to movies and TV shows [1].It causes an increase in illegal redistribution of copyrighted content. One of the most efficient method to prevent suchbehaviour utilizes embedding of human-invisible watermark in the digital content.Watermarking uses the fact that the current bandwidth of digital image signals is much higher that the amount ofinformation which could be properly received and interpreted by human. It is well-known that a human eyesight is moresensitive to luminance component of a color space than to chrominance , i.e. one can recognize even small differencein brightness of an image, but small color perturbations are oblivious to human’s visual system. This is one amongmany properties operating on the surplus bandwidth that are used in such applications as compression , steganography or watermarking .Watermarking describes a model of a communication, where a user needs to embed a message into a digital image andsend it to a recipient. Afterwards the image may be manipulated on by an attacker, however a legitimate user who sharesa set of joint strategies of embedding and extracting the message should be able to recover the embedded message fromthe (possibly manipulated) image. The goal of the attacker is to modify the image, without significant deterioration, inorder to destroy the embedded message, yet preserving the commercial value of the original data.During the work on watermarking techniques, we need to handle three following requirements [2]:1. transparency concerns the quality of the image after the watermark encoding. In general, the original andwatermarked images need to be perceptually similar. All distortions affected by the watermark embedding a r X i v : . [ c s . MM ] M a y PREPRINT should be invisible for the human eyes, so that the value of the data for the consumers does not deteriorate. Inour work, we utilized peak signal-to-noise ratio (PSNR), which measures the pixel-wise difference betweentwo images;2. robustness describes user’s ability to decode the message from the encoded images after applying some signalprocessing operations on it. These operations could be applied intentionally, in order to destroy the watermark,or be a result of technical requirements or limitations. In this work, we used a terminology attacks referring tothese operations. Examples of attacks include cropping, resizing, Gaussian blur or JPEG compression;3. capacity was defined in [3] as "the number of bits a watermark encodes within a unit of time or work". Inthis paper, we considered local or block bits per pixel capacity . The size of the block could be delimited bycalculating the longest distance on which an information about any pixel is spread over the image using theencoder architecture based on the sequence of the convolutional layers, e.g. for one convolutional layer withthe kernel size equal to , the block size is and for two layers with kernel size equal to , it is .In this paper, we introduce a novel technique of embedding a secret message into a digital image and extracting itusing convolutional neural networks. We proposed a method of spatial spreading of the secret message over the image,which significantly reduces the local (block) bits per pixel capacity and at the same time preserves robustness on spatialattacks, such as rotating or cropping. Additionally, using the spatial spreading method significantly reduces the timeof the training phase in comparison to previous solutions. The proposed method has been validated against a widegroup of attacks including lossy compression techniques, such as subsampling and JPEG compression, and spatialattacks, such as rotating and cropping. Despite considering such attacks by the multimedia community throughout thehistory of ’classic’ watermarking, some of these attacks were neglected by other authors of recent watermark encodingsolutions using neural networks, even though the attacks are easy to apply, and some of them are common componentsof a lossy compression techniques. We also divided the considered attacks into five groups based on the scope theyaffect the image. Next, we shown that it is essential to apply the attacks from various groups in order to build a robustdeep learning system for watermarking. Finally, we evaluate the robustness of our method against the attacks in termsof the quality of the image measured by peak signal-to-noise ratio (PSNR). Our contribution is (1) a new architecture of the spatial-spread encoder and decoder as well as (2) the formulationof a loss function matching the architecture. (3) We improved the robustness against particular attacks in comparison tothe current state-of-the-art methods, especially JPEG lossy compression algorithm, resizing and Gaussian blurring. (4)We handled new types of attacks, such as subsampling, which is a part of JPEG algorithm. (5) We proposed precisedifferentiable approximation for JPEG compression algorithm. (6) The resulting training framework required half thetime in comparison to prior solutions. (7) We prepared the analysis of attacks’ types – we grouped the typical attacksaccording to their scope, which is helpful in choosing the appropriate and balanced set of attacks applied to the noiserlayers and deriving dependencies between them.
The main goal of the watermarking methods is to embed an additional information, called a watermark, into a digitalimage, called cover image in a way, that allows recovering the watermark by a legitimate user. The watermark needs tobe robust against some signal processing operations, called attacks . Our method is a blind technique, thus the decoder isnot provided with an access to the original image. In this work, we considered the following attacks: cropping, cropout,dropout, rotation, Gaussian smoothing, subsampling 4:2:0, JPEG compression, resizing. All attacks as well as thewatermark encoding need to ensure the transparency of the image.We aim to encode a binary message m ∈ { , } L , where L is a positive integer, in the cover image I c of shape H × W × Ch . The result of this operation is the encoded image I e containing the hidden watermark m . Both images I c and I e need to be perceptually indistinguishable. Next, an attacker modifies I e by applying selected attacks in orderto indispose to extract m from the encoded image. An output after the modification is a noised image I n which hasthree channels and unspecified width and height. Finally, we try to extract a hidden message m (cid:48) ∈ { , } L from I n thatsatisfies (cid:107) m − m (cid:48) (cid:107) < δ . The problem of transparent and robust embedding of an additional information into a digital domain was deeply studiedfor many years. Watermark solutions could be divided into two types non-blind and blind . The non-blind solutionsrequire an original copy of the image for a detection step, whereas blind methods are able to detect a message encodedinto the covertext without any additional data. Due to their easier application in real-life environment, most recent2
PREPRINT
Figure 1: The sketch of the pipeline architecture.works has been focused on the blind approaches. Many solutions use spatial-to-frequency domain transformations, suchas Discrete Fourier Transform (DFT) [4], Discrete Wavelet Transform (DWT) [5, 6, 7, 8, 9], Discrete Cosine Transform(DCT) [10, 11, 12] and others [13, 14]. Extreme Machine Learning (EML) is another technique used for embeddingwatermarks into digital images which is gaining popularity over the last years [15, 6, 16]. Another method used widelyfor handling the watermark problem is Singular Value Decomposition (SVD) that was utilized in [8, 17, 18, 9, 19]among others. Many presented works handled watermarking with combination of two or more techniques (e.g. [8, 6]).In recent years, we could also observe increased interest in applying deep learning methods into the watermarkingfield. Authors of [20] proposed a framework for training encoder and decoder networks in end-to-end manner due toadding noiser layers between the encoder and the decoder and an advisory network decided whether the images wereencoded or not. A message was spread over all pixels on an image, hence allowing to achieve impressive robustness forcropping attacks. The paper was followed by [21], where the authors introduced a novel method of training the originalarchitecture, called adversarial training . They reported a high robustness against the attacks, however it resulted inrelatively low quality of encoded images measured by the PSNR. Another interesting approach for improving therobustness of a message detection was using an additional attack neural network for generating generic distortionsintroduced in [22]. The authors of [23] designed a fully automated deep learning-based system for watermark extractionfrom camera-captured images. In [24], the authors used convolutional neural networks for zero-watermarking whichdoes not modify the image but extracts some characteristics from the image in order to linking it with an owner. Thepaper [25] described a deep learning solution robust against JPEG compression and rotating. In RedMark [26], therewas a special transform layer used on an image before feed forwarding the encoding neural network and they workedout a differentiable approximation of JPEG.
The architecture proposed in the paper consists of six main components. Three of them are trainable neural networkscalled encoder E φ , decoder D γ and adversarial discriminator/critic C ω , where φ , γ , ω are trainable parameters. Thenext component is an additional attacker network A used for performing the attacks from the previous section on theencoded image. We also specified two deterministic algorithms called message propagator P and message translator T . The overall sketch of the architecture was presented in Figure 1.We denote the i -th bit of the message m as m i . Moreover, we represent the message using a sequence of tuples ( i, m ki , m ki +1 . . . , m ki + k − ) , where i ∈ { , , . . . , Lk − } and ≤ k ≤ L . In particular, for k = 1 , we are ableto represent the message as a trivial sequence of tuples ( i, m i ) for i ∈ { , , . . . , L − } . We also define a function bin n : N → { , } n which for a given value returns its binary representation of a length equal to n .The propagator P nkb : { , } L → { , } Hb × Wb × ( n + k ) is a function which executes following steps:1. converts the message m into a sequence of tuples ( s , s , . . . , s Lk − ) ,3 PREPRINT
Figure 2: The visualization of steps of the propagator P extnkb for parameters n = 2 , k = 2 , b = 2 , L = 8 , W = 4 and H = 4 . The numbers under the arrows refer to the propagators steps.2. for every i , converts the first element of the tuple s i to the binary representation bin n ( s i ) , flattens the tuple s i ,and unsqueezes to s i ∈ { , } × × ( n + k ) ,3. fills a spatially-spread message M ∈ { , } Hb × Wb × ( n + k ) with unsequeezed tuples s i for i ∈ { , , . . . , Lk − } .Note that, we allow to produce redundant data in M , i.e. insert more that one tuple s i .We also need to extend M if the message is an input to the encoder, then we go through with one additional step:4. repeat b times in horizontal and vertical directions every tuple in M ( M ij ∈ { , } × × ( n + k ) is converting to M ij ∈ { , } b × b × ( n + k ) ).If the additional step needs to be executed, we denote the propagator by P extnkb and achieve M ext ∈ { , } H × W × ( n + k ) .The visualization of the propagator is presented in Figure 2.The output of the propagator P extnkb together with the cover image I c is used by the encoder E φ to produce the encodedimage I e , i.e.: I e = E φ ( I c , M ext ) . (1)We follow by applying the attacks on the image I e by: I a , M = A ( I e , I c , M ext ) . (2)Note that, some attacks required the cover image I c , e.g. dropout. For the crop attack, we also cropped the message M during the training. The decoder tries to predict the message M having an access only to I a : M (cid:48) = D γ ( I a ) ∈ { , } Hb × Wb × ( n + k ) . (3)Additionally, we use C ω to rate if I e is similar to I c : C ω ( I ∈ { I e , I c } ) ∈ [0 , . (4)The last element of the architecture is the message translator T o . It is a deterministic function which calculates the finalmessage m (cid:48) based on the decoded message M (cid:48) . The process of the calculation is similar to the k-Nearest Neighboursalgorithm. For every i ∈ { , , . . . , Lk − } , we find o tuples from M (cid:48) with first n values (referred to binary index) thatare closest to bin n ( i ) , i.e., we choose a tuple with coordinates xy if || bin n ( i ) − M (cid:48) xy [0 ,...,n − || is one of o lowestvalues. Then, we calculate mean values for every element encoding a bit of the message, i.e. elements from the tuple onpositions ( n, . . . , n + k − . Based on this, we are able to predict all bits from the message m (cid:48) .The whole architecture allows to encode the message m in the cover image I c and reduce a number of the local(block) bits per pixel capacity. The state-of-the-art and recent architectures of encoders [20, 21, 22] are based on theconvolutional layers. It means that the encoder embeds the watermark message locally, without an access to the wholeimage. This architecture of the encoder provokes two ways of encoding the message. (1) Encoding only subset ofthe whole message depending on the pixels color space, e.g. encode some bits only if a tone of the pixel is close toblue. This way of encoding is risky and unreliable. (2) Attempting to encode the whole message locally (in the blockof pixels). A results’ analysis of the robustness on attacks, in particular, the high accuracy against cropping attack,indicated that the second way of the message encoding is more likely. Thus, we proposed the solution for reducing thelocal bits per pixel capacity and improved the robustness against attacks, especially smoothing-like attacks.4 PREPRINT
The proposed architecture spreads fractions of the message m over the image in the form of tuples ( i, m ki , m ki +1 . . . , m ki + k − ) , where i ∈ { , , . . . , Lk − } and ≤ k ≤ L . Note that the spread is performedin a block fashion rather than assigning the whole message m to every single pixel. For example, we could encode themessage of length L = 32 by splitting it into 8 patches of length equal to 4 ( k = 4 and n = 3 ). Thus, we are able toencode the patch by 7 bits, where we need 3 bits for the index of the patch and 4 bits for the corresponding fraction ofthe massage. During our experiments, we achieved the best results for k = 2 . We formulated a novel loss function for training our models using gradient descent algorithm. Our general objectivecontains three separated loss functions L E , L D and L C , for training the encoder E φ , the decoder D γ and the critic C ω respectively. The nosier layers A are inside the training pipeline and do not contain training parameters. Furthermore,the message propagator P nkb and translator T o are deterministic algorithms outside of the training pipeline.The aim of the loss function L E is keeping images I c and I e similar. It was formulated as follow: L E ( I c , I e ) = MSE( I c , I e ) = 1 H · W · Ch || I c − I e || , (5)where MSE is a standard Mean Square Error function. The loss function L D works on the similarity between propagatedmessages M and M (cid:48) . However, as M contains redundant data, i.e. the same tuples, we do not need to perfectly recoverthe message. Our aim was to extract a subset of tuples with "high confidence" of information. Thus, we formulated theloss function L D as a combination of mean and variance functions: L meanD ( M, M (cid:48) ) = b H · W H b (cid:88) h =0 W b (cid:88) w =0 M ean ( | M hw − M (cid:48) hw | ) (6) = b H · W · ( n + k ) || M − M (cid:48) || (7)and L varD ( M, M (cid:48) ) = b H · W H b (cid:88) h =0 W b (cid:88) w =0 V ar ( | M hw − M (cid:48) hw | ) , (8)where H b = Hb − and W b = Wb − and the operator | · | returns the absolute value of every element of the vector.The final loss function is L D = λ meanD L meanD + λ varD L varD . Such formulation of the loss function promotes learning of all elements in some tuples over some elements over all tuples .We also defined the adversarial training for the encoder E φ and the critic C ω . By this, we were able to achieve bettervisual similarity of the images I c and I e . For the encoder E φ , we expected to produce images following the transparencyrequirement, thus we defined the loss function L EC = log (1 − C ω ( I e )) . On the other hand, the rule of the critic C ω wasto distinguish between the "real" images I c and the modified image I e , thus in this case we defined the loss function L CC = log (1 − C ω ( I c )) + log ( C ω ( I e )) .Finally, we ran the gradient decent algorithm on φ and γ parameters in order to minimize the loss function over thedistribution of images I c and messages M : E I c ,M [ λ E L E + λ meanD L meanD + λ varD L varD + λ C L EC ] , (9)where λ -s are weights for particular losses. We simultaneously conducted the training of the critic C ω with parameters ω by minimize the loss function over the distribution of images I c : E I c [ L CC ] . The main block applied to the neural networks, i.e. the encoder E φ , the decoder D γ and the critic C ω , is a sequentialstructure of a convolutional layer with 64 channels, the kernel size equal to × , stride equal to × and paddingequal to × , then a batch normalization layer and a ReLU activation. All networks operate on images in YCbCr colorspace.The encoder E φ contains five sequential blocks, where the first block is fed by the concatenated tensor of the image I c and the spread message M ext . Next, the tensor [ I c , M ext ] is also concatenated with the input before every secondconvolutional layer, i.e. 1st, 3rd and 5th layer has an access to the cover image and the spread message. The last encoder5 PREPRINT layer is the convolution with 3 channels and default other parameters. Note, that the number of layers in the encoder E φ does not exceed the other state-of-the-art methods, e.g. [20, 21, 22, 26]. It is important in the context of a timeperformance as in many practical scenarios (e.g. streaming) the encoder needs to work in real-time.The decoder D γ takes the encoded image I e and puts it through 6 sequential blocks. Then, we apply an adaptive averagepooling layer which produces a tensor with a size equal to Hb × Wb × . Next, the tensor is fed to the sequential blockwith 64 channels, the kernel size and the padding equal to × . The last layer is the separated convolution layer with k + n channels, the kernel size and the padding remain unchanged. Thus, the decoder returns a tensor with the samesize as M . The last two convolutional layers imitate fully connected layers for every spatial element of the output overchannels. Note that during our experiments we did not change the size of the tensor produced by the adaptive pooling,i.e. the decoder returned the output tensor with the same size also after cropping or resizing attacks. Executing actionsregarding attacks’ types could improve the robustness of the method, but requires a method to recognize the attack’stype and counters the end-to-end approach, thus we decided to return M (cid:48) with the same size in every case.The critic C ω consists of three sequential blocks, an adaptive average pooling layer which produces a 64-dimensionalvector, then a fully connected layer. The critic returns the value describing a similarity of the input image to real images. We selected some nosier layers which we later applied during the training process. We exposed to the neural networksvarious kinds of distortions which they needed to handle in order to increase the performance. By this, we were able todetermine a way of the training of the neural networks. The types of selected distortions included cropping and cropout,dropout, Gaussian smoothing, rotation, subsampling 4:2:0, approximation of JPEG and resizing.The crop distortion returns a cropped square of the image I e of a specified area ratio p = H new W new HW . The cropoutattack works similar to the crop, it crops the square of the image I e and instead of discarding the rest of the image,it replaces the extant area by the image I c . As in [20], we decided to use the image I c as the background for theencoded image I e , because this simulates a binary symmetric channel (BSC), which is a standard model considered ininformation theory, where a receiver does not have knowledge if a obtained bit is correct or wrong. Applying monotonicor random color of pixels of the extant area imitates other simple communication model called a binary erasure channel(BEC). The cropout attack was parameterized by us with a value p equal to a ratio of the cropped area over the entireimage area. The dropout attacks keeps a percentage p of the pixels of the image I e and the rest pixels replaces withcorresponding pixels of the image I c . As in the cropout, this procedure also simulates the BSC model. The Gaussiansmoothing was done with a parameter σ (a kernel width).Next four attacks are our extension of those presented in [20, 21]. The rotation attack rotates the image by α degrees.The subsampling 4:2:0 is applied in many digital compression algorithms, such as JPEG or MPEG, and is the mostpopular from all its variants (e.g. 4:2:2, 4:1:1). It reduces the image channels Cb and Cr by calculating an averagevalue of every square of × . The procedure could be done using a 2D convolutional layer with one channel, kernelsize equal to × , stride equal to × and weights set to . . We also used the resize attack with a scale factor s = H new H = W new W . We handled two types of interpolation – Nearest neighbours and
Lanczos . Approximation of JPEG.
Lossy compression algorithms could be considered as most efficient attacks against a widerange of watermarking protocols. This comes from the fact that algorithms such as JPEG are very efficient in removingbarely visible objects and which are not essential for the viewer. On the other hand, all watermarking techniques aim atchanging the image in a way that is hardly noticeable for the viewer.Thus, it was necessary to apply them into the training pipeline in order to obtain an appropriate design for the encoderand the decoder training. The main inconvenience of the JPEG is a rounding operation applied on quantized frequency-domain elements of the image. The derivative of the round function is indeterminate for points x ∈ Z and equal to inthe rest of the domain. Thus, using the rounding function in the middle of the training pipeline is impossible due tohalting the update of the neural networks weights by the gradient descent algorithm. We proposed an approximationof JPEG compression which executes the following steps for the image I : (1) converting to YCbCr color space, (2)subsampling 4:2:0, (3) splitting separately every channel into blocks of × (4) applying the Discrete Cosine Transform(DCT), (5) dividing by the quantization table Q and (6) applying the approximation of the rounding. The last two steps,we formulated as follows: I (cid:48) ij = (cid:40) , if − ≤ I ij Q ij ≤ , I ij Q ij + δQ ij , otherwise , (10)6 PREPRINT where δ ∼ N (0 , σ ) , I ij is the frequency-domain element of the image and Q ij is the related element of the quantizationtable. For our experiments, we set σ = 0 . . We used the standard quantization table for the quality parameter q = 50 and we modified the elements of the table Q for different q in accordance with the JPEG standard [27, 28].To the best of our knowledge it is the most precise differentiable approximation of the classic JPEG applied to the deeplearning training pipeline for the watermarking. For the evaluation procedure, we used the standard JPEG. The method was trained on the COCO dataset [29]. We used 10000 randomly sampled cover images for the trainingsubset and 1000 for the validation subset. Both subsets were disjoint. The messages were sampled randomly andalso the spatial spreading was random. The parameters λ E , λ meanD , λ varD and λ C were set to . , . , . and . ,respectively. We used Adam [30] with learning rate equal to . and other default parameters for the stochasticgradient descent optimization. The models were trained with batch size equal to . The final training with applied allnosier layers took 100 epochs. We observed that most of the attacks considered by us could be distinguished into more general groups based on theirspecific characteristics. Thus, we classified attacks regarding the way in which they affect the image. We also assumedthat after any attack a content of the image needs to be visible and its quality has to be acceptable to the customers.With these assumptions, we consider five types of the attacks: • Pixel-specific , where we modify only single pixels (without considering any others) by changing color, addingnoise, replacing pixels by other random ones, removing some pixels or changing their position on the image.In this group we could specify two subgroups: one that applies one modification on all pixels , and the otherthat applies one modification on a subset of pixels . A characteristic of this group is that we have an access to asmaller subset of non-modified pixels after attacks or all pixels were transformed in the same specific way. Tothis group, we selected some attacks such as color space conversions, cropping, cropout, dropout and rotation. • Local , where we modify pixels with regard to their local neighborhoods. In this group, all pixels are modifiedduring attacks, but only neighbours of the pixels affect on the results (e.g., subsampling, Gaussian blur andresizing). • Domain , where modifications are domain-specific and even small changes in limited neighbourhoods couldaffect globally on an image represented in a different domain. This group of attacks includes all transformsmethods, e.g. Discrete Cosine Transform (DCT). • Mixed , where a final modification is a combination of methods from other groups. Here we could distinguishJPEG which combines color space conversion, subsampling and locally applied DCT.The analysis of attack types could be important and helpful in the context of designing a training pipeline. Most of therecent deep learning solutions for watermarking use additional noiser layers in order to improve robustness for particularattacks (e.g. [20, 21, 26]). It requires to choose a finite set of attacks applied during the training process. Moreover, allattacks in the training pipeline need to be differentiable as the noiser layers are usually embedded before the neuralnetwork responsible for the message’s detection. As such, it requires deferential approximations of non-differentiableattacks, e.g.
JPEG compression. An appropriate choice of attacks for a training pipeline could cause a high robustnessfor other attacks which were not applied to the training pipeline. In [22], where authors proposed a distortion agnosticmethod using adversarial neural networks, we could observe that even small perturbations generated by attacks classifiedby us into the local group noticeably decrease an accuracy of the message detection. It could mean that the attack neuralnetwork generated the distortions belonging to the pixel-specific or domain groups and ignored attacks similar to thesefrom the local group. In our work, we focused on selecting a special set of attacks which covers all four groups. We conducted an experiment where we trained the pipeline with only a subset of the attacks, from only one of thementioned groups, and we observed its impact on the robustness on the attacks from the same group and from the others.The results was presented in the Table 1. The experiments confirmed that there exist a correlation between the ways ofimage modifications by particular attacks and stronger correlations are noticeable between attacks belonging to thesame group. It is trivial to notice, that crop and cropout attacks do not modify a complete (whole) patch of an image, i.e.the decoder has an access to the non-modified patch. The dropout attack changes random pixels, but still the decoder7
PREPRINT
Figure 3: The visualisation of attacks’ applications. The row above refer to the noised image I n and the row below referto the normalized difference between the noised image I n and the encoded image I e . We used min-max normalization.The used cover image I c is presented in Figure 4.Table 1: The table shows the results of the experiment of applying attacks from the same group during the trainingprocess. The values in the table refers to the bit accuracies. The red color indicates the attacks which were used duringthe training and the blue color refers to the best accuracy achieved for the non-applied attacks. Note, that best resultswere achieved around the same groups of the attacks.Attacks Noiser LayersIdentity Crop( p = [0 . , . )Dropout( p = [0 . , . ) Gaussian( σ ∈ { , } )Subsampling(4:2:0)Identity 0.999 0.991 0.985Crop( p = 0 . ) 0.847 0.894 0.833Cropout( p = 0 . ) 0.793 0.875 0.672Dropout( p = 0 . ) 0.530 0.972 0.574Rotate( a = 5 ◦ ) 0.754 0.821 0.780Gaussian( σ = 5 ) 0.823 0.564 0.981Subsampling(4:2:0) 0.524 0.623 0.980Resize( s = 0 . , m = L ) 0.511 0.532 0.735JPEG( q = 95 ) 0.502 0.512 0.783could detect the message based on not-modified pixels. Thus, applying only the subset of the attacks during training wecould achieve more general robustness on a wider collection of attacks from the same group. In this section, we presented the evaluation of our method and the comparison with the current state-of-the-art solutions.The experiments were done for the images of the size × and the message of the length L = 32 . Our main goalwas reducing the local bits per pixels capacity, thus we set k = 2 . By this, the number of bits required for storing thepatch (tuple) was equal to and the number of the patches was equal to . The tuple stored two bits of the messageand the related index which took four bits. The block size b was set to . In order to spread all patches over the image,we needed to locate blocks with the size equal to × pixels. It indicated that the smallest size of the imagewas equal to × pixels. The final method was trained with all types of attacks applied to the noiser layers. Weconsidered the bit accuracy as a metric for calculating the robustness against attacks. The results of the robustness onattacks were presented in the Table 2. Lossy compression algorithms and watermark encoders work in the same subdomain of the image, i.e. they try to modifyinvisible pixels in order to reduce a size of the image or encode additional information in the image, respectively. Thus,we considered these algorithms as a special and sophisticated group of attacks. Assuming invisibility of the watermark,the encoder should change these pixels which are removed or modified by the lossy compression algorithms. Therefore,8
PREPRINT
Table 2: The results of the bit accuracy for some types of the attacks and the comparison with the state-of-the-artmethods. Note that, the resizing modes were not specified in [22] and [26].Attacks MethodsOur HiDDeN [20] DADW [22] RedMark [26]Identity 1.0 1.0 1.0 1.0Crop( p = 0 . ) 0.832 -Cropout( p = 0 . ) 0.902 - 0.925Dropout( p = 0 . ) 0.962 ≈ α = 5 ◦ ) - - -Gaussian( σ = 2 ) σ = 4 ) - - -Resize( s = 0 . , m = N ) - 0.671 0.819Resize( s = 0 . , m = L ) -JPEG( q = 50 ) I e (middle row) and the cover image I c (above). The row below shownthe min-max normalized difference between the cover image I c and the encoded image I e .in our work we mainly focus on preserving a robustness on lossy compression techniques as contemporary multimediaapplications or services use lossy compression algorithms by default and it is impossible to skip the compression stepdue to some technical reasons, such as the limitations of the broadcast bandwidth. The method was evaluated for the PSNR equal to . dB, . dB and . dB for the Y, Cb and Cr channels,respectively. The quality of the encoded image is similar to results achieved in [20]. In [22, 26], authors reportedslightly higher values of the PSNR. All methods achieved the quality of the images similar to the lossy compressionalgorithms [31], where the average PSNR for all channels is typically above dB. We did not take into considerationthe results of robustness from [21] because their method modifies the image significantly. In order to compare thedistortion level we calculated the PSNR for our validation dataset after applying JPEG compression algorithm with thequality factor q = 50 and the subsampling 4:2:0. We achieved the PSNR equal to . dB, . dB and . dB forthe Y, Cb and Cr channels, respectively. And without using the subsampling technique, we achieved . dB, . dBand . dB. The results of the PSNR suggest that the message was encoded on the Y channel chiefly.9 PREPRINT
In the paper we proposed the watermarking method based on spatial spreading of the message. Our architecture is donewith convolutional neural networks and is scalable for any size of an image. We developed a special architecture for theencoder network where the cover image and the message are yielded to every second layer. We also formulated a noveland custom loss function for training the neural networks. In comparison to previous method our watermarking systemprovides significantly improvement of robustness against Gaussian smoothing, resizing and JPEG, which are attacksfrom the local group. The work was extended by additional attack types, such as subsampling 4:2:0 or rotation. We alsoachieved the bit accuracies above . for all considered attacks. This indicates that the method accomplished highgeneral robustness exceeding previous solutions. As a way to obtain our results, we additionally provided a precisedifferentiable approximation of JPEG compression and grouped the attacks on the watermark based on their scope.In future work we would like to continue to improve the robustness against the attacks as well as apply and evaluatemulti-attacks scenarios. We would like to increase the message capacity and extend the solution over a video domainand video-specific compression algorithms. Moreover, some other quality measures like the one presented in [32] maybe considered in order to adjust the transparency. References [1] Netflix’s Support Site. Downloading TV shows and movies on Netflix. https://help.netflix.com/en/node/54816 , last accessed on 2020-01-08.[2] V. M. Potdar, S. Han, and E. Chang. A survey of digital image watermarking techniques. In
INDIN ’05. 2005 3rdIEEE International Conference on Industrial Informatics, 2005. , pages 709–716, Aug 2005.[3] I. J. Cox, J. Kilian, F. T. Leighton, and T. Shamoon. Secure spread spectrum watermarking for multimedia.
IEEETransactions on Image Processing , 6(12):1673–1687, Dec 1997.[4] C. Pun. A novel dft-based digital watermarking system for images. In , volume 2, Nov 2006.[5] E. Najfi. A robust embedding and blind extraction of image watermarking based on discrete wavelet transform. In
Mathematical Sciences , volume 11, pages 307–318, Dec 2017.[6] R. P. Singh, N. Dabas, Nagendra, and V. Chaudhary. Weighted extreme learning machine for digital watermarkingin dwt domain. In , pages 393–396, Sep. 2015.[7] Chandan Kumar, Anuj Kumar Singh, and Priyadarshni Kumar. Improved wavelet-based image watermarkingthrough spiht.
Multimedia Tools and Applications , pages 1–14, 2018.[8] Nasrin M Makbol and Bee Ee Khoo. A new robust and secure digital image watermarking scheme based on theinteger wavelet transform and singular value decomposition.
Digital Signal Processing , 33:134–147, 2014.[9] Chih-Chin Lai and Cheng-Chih Tsai. Digital image watermarking using discrete wavelet transform and singularvalue decomposition.
IEEE Transactions on instrumentation and measurement , 59(11):3060–3063, 2010.[10] R. Dubolia, R. Singh, S. S. Bhadoria, and R. Gupta. Digital image watermarking by using discrete wavelettransform and discrete cosine transform and comparison based on psnr. In , pages 593–596, June 2011.[11] J. L. Divya Shivani and Ranjan K. Senapati. Robust image embedded watermarking using dct and listless spiht.
Future Internet , 9:33, 2017.[12] Jagdish C Patra, Jiliang E Phua, and Cedric Bornand. A novel dct domain crt-based watermarking scheme forimage authentication surviving jpeg compression.
Digital Signal Processing , 20(6):1597–1611, 2010.[13] S. Natu, P. Natu, and T. Sarode. Improved robust digital image watermarking with svd and hybrid transform.In , pages177–181, Dec 2017.[14] B. Ahmaderaghi, F. Kurugollu, J. M. D. Rincon, and A. Bouridane. Blind image watermark detection algorithmbased on discrete shearlet transform using statistical decision theory.
IEEE Transactions on ComputationalImaging , 4(1):46–59, March 2018.[15] A. Mishra, A. Goel, R. Singh, G. Chetty, and L. Singh. A novel image watermarking scheme using extremelearning machine. In
The 2012 International Joint Conference on Neural Networks (IJCNN) , pages 1–6, June2012. 10
PREPRINT [16] A. Rajpal, A. Mishra, and R. Bala. Fast digital watermarking of uncompressed colored images using bidirectionalextreme learning machine. In , pages 1361–1366,May 2017.[17] Akshya Kumar Gupta and Mehul S Raval. A robust and secure watermarking scheme based on singular valuesreplacement.
Sadhana , 37(4):425–440, 2012.[18] Nasrin M Makbol and Bee Ee Khoo. Robust blind image watermarking scheme based on redundant discrete wavelettransform and singular value decomposition.
AEU-International Journal of Electronics and Communications ,67(2):102–112, 2013.[19] Khaled Loukhaoukha, Jean-Yves Chouinard, and Mohamed Haj Taieb. Optimal image watermarking algorithmbased on lwt-svd via multi-objective ant colony optimization.
Journal of Information Hiding and MultimediaSignal Processing , 2(4):303–319, 2011.[20] Jiren Zhu, Russell Kaplan, Justin Johnson, and Li Fei-Fei. HiDDeN: Hiding Data with Deep Networks. In
TheEuropean Conference on Computer Vision (ECCV) , Sep 2018.[21] Bingyang Wen and Sergul Aydore. ROMark: A Robust Watermarking System Using Adversarial Training. arXive-prints , page arXiv:1910.01221, October 2019.[22] Xiyang Luo, Ruohan Zhan, Huiwen Chang, Feng Yang, and Peyman Milanfar. Distortion Agnostic DeepWatermarking. arXiv e-prints , page arXiv:2001.04580, January 2020.[23] Xin Zhong and Frank Y. Shih. A Robust Image Watermarking System Based on Deep Neural Networks. arXive-prints , page arXiv:1908.11331, August 2019.[24] A. Fierro-Radilla, M. Nakano-Miyatake, M. Cedillo-Hernandez, L. Cleofas-Sanchez, and H. Perez-Meana. ARobust Image Zero-watermarking using Convolutional Neural Networks. In , pages 1–5, May 2019.[25] Ippei Hamamoto and Masaki Kawamura. Neural watermarking method including an attack simulator againstrotation and compression attacks.
IEICE Transactions on Information and Systems , E103.D:33–41, January 2020.[26] Mahdi Ahmadi, Alireza Norouzi, Nader Karimi, Shadrokh Samavi, and Ali Emami. Redmark: Framework forresidual diffusion watermarking based on deep networks.
Expert Systems with Applications , 146:113157, 2020.[27] William B. Pennebaker and Joan L. Mitchell.
JPEG Still Image Data Compression Standard . Kluwer AcademicPublishers, USA, 1st edition, 1992.[28] Michael Parker. Chapter 25 - image and video compression fundamentals. In Michael Parker, editor,
DigitalSignal Processing 101 (Second Edition) , pages 329 – 346. Newnes, second edition edition, 2017.[29] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, andC. Lawrence Zitnick. Microsoft coco: Common objects in context. In David Fleet, Tomas Pajdla, Bernt Schiele,and Tinne Tuytelaars, editors,
Computer Vision – ECCV 2014 , pages 740–755, Cham, 2014. Springer InternationalPublishing.[30] Diederik Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization.
International Conference onLearning Representations , Dec 2014.[31] Mauro Barni.
Document and Image compression . CRC press, 2006.[32] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In