[PDF] Robust watermarking with double detector-discriminator approach

Abstract

In this paper we present a novel deep framework for a watermarking - a technique of embedding a transparent message into an image in a way that allows retrieving the message from a (perturbed) copy, so that copyright infringement can be tracked. For this technique, it is essential to extract the information from the image even after imposing some digital processing operations on it. Our framework outperforms recent methods in the context of robustness against not only spectrum of attacks (e.g. rotation, resizing, Gaussian smoothing) but also against compression, especially JPEG. The bit accuracy of our method is at least 0.86 for all types of distortions. We also achieved 0.90 bit accuracy for JPEG while recent methods provided at most 0.83. Our method retains high transparency and capacity as well. Moreover, we present our double detector-discriminator approach - a scheme to detect and discriminate if the image contains the embedded message or not, which is crucial for real-life watermarking systems and up to now was not investigated using neural networks. With this, we design a testing formula to validate our extended approach and compared it with a common procedure. We also present an alternative method of balancing between image quality and robustness on attacks which is easily applicable to the framework.

Full PDF

RRobust watermarking with doubledetector–discriminator approach

Marcin Plata , [email protected] Department of Fundamentals of Computer ScienceWroclaw University of Science and TechnologyWroclaw, Poland

Piotr Syga , [email protected] VestigitWroclaw, Poland

Abstract

In this paper we present a novel deep framework for a watermarking — a techniqueof embedding a transparent message into an image in a way that allows retrievingthe message from a (perturbed) copy, so that copyright infringement can be tracked.For this technique, it is essential to extract the information from the image even afterimposing some digital processing operations on it. Our framework outperformsrecent methods in the context of robustness against not only spectrum of attacks (e.g.rotation, resizing, Gaussian smoothing) but also against compression, especiallyJPEG. The bit accuracy of our method is at least . for all types of distortions.We also achieved . bit accuracy for JPEG while recent methods provided atmost . . Our method retains high transparency and capacity as well. Moreover,we present our double detector–discriminator approach — a scheme to detect anddiscriminate if the image contains the embedded message or not, which is crucialfor real–life watermarking systems and up to now was not investigated using neuralnetworks. With this, we design a testing formula to validate our extended approachand compared it with a common procedure. We also present an alternative methodof balancing between image quality and robustness on attacks which is easilyapplicable to the framework. The research on watermarking techniques in the multimedia sector is signiﬁcantly rising, as boththe presented methods achieve ever-increasing accuracy at the same time as mitigating the ﬂawsof embedding additional information into the image, and the need for pervasive marking of theintellectual property due to increasing rate of piracy emerges. As the multimedia services are workingon simplifying the content delivery systems and broadening its accessibility, the increased numberof online assets allows to easily intercept the content and redistribute it illegally to violator’s ownbeneﬁt, somewhat utilizing the amount of media online to hide his infringement. In order to preventit, many techniques are used, with watermarking being one of the most effective methods [1, 2]. Theidea of digital watermarking assumes that one embeds an additional message into an image or avideo, so that when copied, the owner of the content may prove his rights. Moreover, when digitalmedia is made available to a wide range of users, the embedding of a watermark that is unique foreach user, allows, in the case of an unauthorised redistribution, to indicate which user has leaked thecontent. The aim is to allow any authorized party easy localization, extraction and identiﬁcation of thewatermark, even if the copy has been manipulated. Since a person that aims to violate the copyrightwants to destroy the embedded watermark, it has to be robust against image processing attacks thatremove identiﬁcation information encoded in it. Moreover, as the main interest of copyright owner isto provide the highest possible quality of his property to the clients, the embedded watermark should

Preprint. Under review. a r X i v : . [ c s . MM ] J un ot be visible for the end user ( transparency ), nor should it diminish the quality of the image. Itis easy to note, that the requirements of transparency and robustness are to some extent opposing(assume a constant capacity), hence it is of paramount importance for a watermarking system toprovide an appropriate balance between the two. Naturally, the higher demand on the content, thebetter it should be protected, hence the watermark should have the capacity that allows to identifya large number of unique end users, Note that not only adversarial attacks hinder the effectivenessof watermarking. In order to maximize the network transfer, images and videos are compressed sothat almost all surplus information, i.e., the information that is not properly interpreted by humanperception, is removed, leaving only the data that are important for the visual quality. An exampleof the information that is removed in compression (e.g. JPEG) is chrominance (Cr, Cb) for whichhuman sight is more oblivious than the luminance (Y) of an object. Since information in chromacomponents of the colorspace is trimmed signiﬁcantly, the watermark has to be embedded into thecomponents that end user is more aware of. Note that, in simpliﬁcation, video watermarking maybe viewed as an extension of image watermarking with a obstacle – MPEG compression, encodingprimary frames (I) using JPEG and other frames (P and B) by relying on references to the primary. Related Work.

Various methods has been used in order to provide robust and transparent water-marking, in [3, 4] the authors used Discrete Cosine Transform (DCT) in order to comply with JPEGcompression, whereas a different frequency approach, combined with Singular Value Decomposition(SVD) was proposed in [5, 6] (sharp frequency localized contourlet transform and discrete wavetransform were used respectively) as frequency-domain modiﬁcations allow to easily ’spread’ theinformation across the visible image. In [7], the authors used redundancy in their dual watermarks,to improve the robustness against cropping. Similar effect was achieved by an end-to-end encoder-noiser-decoder framework proposed in [8] that spread the embedded message over all pixels. A followup paper [9] introduced adversarial training, that resulted in further improvement of the robustness,albeit at the expense of the transparency. In fact, the resulting image signiﬁcantly deviates from thewell established ranges of ’acceptable’ PSNR, mitigating its commercial value. A spread spectrumwatermarking with adaptive strength and differential quantization allowed the authors of [10] toimprove PSNR guarantees. The robustness of the algorithm was the focus of [11], where authorsadd another neural network for generating generic distortions in the training. The presented modelallowed to achieve improvement on previous accuracy, however the authors did not provide tests ofrobustness against such common attacks as rotation and subsampling. The authors of [12] focused onthe local capacity of an image, presenting a method improving robustness of the watermark againstwide range of the attacks, as well as proposing a differentiable approximation of JPEG compression.The latter was investigated also in [13]. Coping with the difﬁculties introduced by image compressionwas focused on likewise in [14].

Our Contribution.

In this paper, we introduced (1) a novel end-to-end watermarking systemutilizing an additional component to inspect if images contain a watermark. (2) We enhanced thearchitecture of the system and extended the encoder by a watermark adapter, resulting in a signiﬁcantimprovement of the robustness against some attacks and compression algorithms, such as rotation,resizing or JPEG. (3) We proposed a novel evaluation method coping with false suspects of a copyrightviolation and indicated the efﬁciency of our discriminator–detector approach. (4) We also providedan alternative method to increase the transparency of encoded images and shown its efﬁciency.

The aim of watermarking techniques is to embed some binary data m ∈ { , } L into a cover image I co of shape W × H × C . To achieve that, we use the encoder E which returns the encoded image I en . The encoded image needs to pass transparency requirement, i.e., be perceptually similar tothe cover I co . Moreover, the encoded image should be robust against some processing operationscalled attacks (we utilize noisers in the training phase to simulate attacks). Then, we process thedistorted image I (cid:48) en using the decoder D to extract the message m (cid:48) ∈ { , } L . We aim to receive | m (cid:48) − m | < δ , i.e. the extracted message should be similar to the embedded one. Note that, in thereal–life scenario we also need to determine if the investigated image contains the additional data ornot. To distinguish between distorted images I (cid:48) co and I (cid:48) en , we use the discriminator F .Our system contains eight components, where three of them are trainable neutral networks — encoder E , decoder D and discriminator F . Next two are differentiable layers - noiser N and prenoiser N pre , used for adding artiﬁcial noise to the images. The last two elements, the propagator P and translator T , are required to propagate the message to a spatial form and revert this operation. Wealso distinguished an auxiliary neural network called adapter A which extends the propagator P .The overall architecture was presented in Fig. 1. Propagator, adapter and translator.

In our solution, we applied a method to reduce local bits-per-pixels capacity. Instead of assigning a representation of the message m to every pixel of thecover image I co , as in [8, 11], among others, we used a spatial spreading algorithm proposed in [12].The algorithm converts the message m to a sequence s = ( s , s , . . . , s (cid:100) Lk (cid:101)− ) , where s i is a tuplecontaining a slice of the message m [ ik,...,ik + k − and a reference index in a binary form bin( i ) . Weassumed that k is a size of the slice and i ∈ { , . . . , (cid:100) Lk (cid:101) − } . It is easy to notice that, we need atmost (cid:100) log ( Lk ) (cid:101) bits to store the binary index bin( i ) . For example, for k = 2 and L = 32 , we needto prepare a sequence s with length equal to and a tuple s i containing 6 bits (4 bits for bin ( i ) and2 bits for m [ i,i +1] ). We also deﬁne k (cid:48) = k + (cid:100) log ( Lk ) (cid:101) for simplicity of further formulations.The propagator P converts the message m to a sequence s and generates two spatially-spreadrepresentations of the message, i.e. M ∈ { , } Wb × Hb × k (cid:48) and M ext ∈ { , } W × H × k (cid:48) , where b isan argument of the propagator, which refers to a size of a unitary block of a size b × b × k (cid:48) , whichcontains some tuple s i replicated b times in two directions. The tuples s i and the unitary blocksare randomly sampled in order to ﬁll the messages M and M ext , respectively. In the case of M ,we assign tuples s i to cells of a grid of a shape Wb × Hb , while in the case of M ext , to every cellof the grid we assign a unitary block of shape b × b × k (cid:48) , thus the ﬁnal shape of M ext is equal to W × H × k (cid:48) . The translator T works with the message M and reverts the operations applied by thepropagator P . In order to extract some slice of the message m , the translator chooses n tuples (cells)in which stored binary indices are closest to a corresponding index of the slice, then elements of thechosen tuples referring to the slice of the message are averaged.We extended the propagator P by adding the adapter A , which is a convolutional neutral networkused to adapting the propagator output to a convenient form for the encoder E . Similar approach waspresented in [11], where linear layers were used to adapt the message m directly. The adapter A wasseparated from the encoder E as it allowed to produce the result of A ( M ext ) independently of theprocessing steps of the images. It could be particularly relevant in a case of working with a sequenceof frames. The adapter A is build with four convolutional blocks called conv-bn-relu , which containsthe convolutional layer, batch normalization and ReLU activation. The last block outputs 6 channels. Discriminator.

The primary role of the discriminator F was the application of adversarial training approach [15, 16] in order to improve the perceptual similarity of the encoded and cover images. Wealso utilized it to indicate if the image contains the message or not. The details and motivation to use3he discriminator in this way were presented in Sect. 4. The discriminator is build with 3 conv-bn-relublocks, a global average pooling layer, a linear layer with single output unit and Sigmoid activation. Encoder and decoder.

The encoder E ands decoder D are two main components of our water-marking system. Both networks need to cooperate during training in order to ﬁnd the balance betweenthe transparency and robustness requirements of the watermarking. Moreover, they determine a joint"scheme" of the encoding–decoding procedure which is known only for the trainable componentsof the pipeline. The watermarking system is designed to work with some real–life limitations. Asan example, note that watermarking videos is currently in much higher demand than watermarkingstrictly still images, hence any proper image watermarking method should be easily extended formarking movies (e.g., at frame by frame basis). In the over-the-top (OTT) media, services providingvideo–streaming are severely constrained by the content servers (origin or cache) that, due to storagelimitations, are not able to provide different contents to all users (with a key per user). As a conse-quence, to allow uniquely watermarked media for each user, the encoder needs to work on the client’sside and handle a high quality video [17]. This indicates that the proposed architecture of the encoder(and indirectly other neural networks) has to be relatively small and shallow. Additionally, it explainsthe reason of considering the adapter A as the separated component, which allowing to be used oncefor all frames. The encoder processes the cover image using three conv-bn-relu blocks. Next, theoutput is concatenated with the cover images and the adapted message taken from A , followed by theapplication of two conv-bn-relu blocks. Finally, we use the convolutional layer with the kernel sizeequal to on the concatenation of the cover image and the prior output. The detector is based oneight conv-bn-relu blocks. We apply average pooling with the kernel size and stride equal to b to theoutput. Finally, we use the conv-bn-relu and the convolution with the kernel sizes equal to for both. Attacks and noiser layers.

In our work, we considered eight attack types, all implemented in thenoisier layer N . The crop attack crops a square form the encoded image and it is parameterized by p referring to a ratio of the squared area. The cropout also crops the square while the rest of the imagesis replaced by the cover image. The dropout attack chooses pixels with the probability p and replacesthem with the corresponding pixels of the cover image. Both cropout and dropout imitate binarysymmetric channel which in information theory describes more challenging model of communicationthan binary erasure channel (that may be simulated by salt and pepper noising) [8]. We also includedcommon computer vision operations – resizing, rotation and Gaussian smoothing, as well as mostcommon compression algorithm – JPEG (with quality parameter q ) and 4:2:0 chroma subsamplingprocedure. The JPEG algorithm includes non-differentiable operations, e.g. quantization, thus wecould not apply it in the training pipeline not halting the neural network weights updates. To handlethis problem, we used an approximation of the JPEG proposed in [13, 12]. In the experimentsdescribed in Sect. 5, the noiser layer N pre is executed before N and always applies the dropout. Training details and hyper parameters.

To train and test our method, we used the COCOdataset [18]. We sampled images for the training subset and for the testing subset.Both subsets were disjoint. We resized images to × pixels and encoded the message of thelength L = 32 . The messages were sampled at random, the spatial spreading was random as well.We used Adam [19] with learning rate equal to . (other parameters were set to default). Thepipeline was trained with the batch size for epochs and all attacks were applied (one typein each iteration of the training). At the end, we froze all the weights except the discriminator andran the discriminator’s part of the pipeline for 20 epochs. To train the system, we used two GPUs –Nvidia RTX 2080Ti 11GB. The one epoch of the pipeline training took about 370 seconds, whileduring the inference we were able to process about 45 images per second by one component andusing one GPU. The default parameters for the convolutional layers were as follow: the channel size , kernel size and the reﬂection padding applied. For the linear layers we set units by default.All neural network were fed by images in the YCrCb color space, the same space were used for theimages returned by the encoder. The training procedure and objectives.

In one iteration of the pipeline, ﬁrst, we take a message m and apply it to the propagator P , which returns two variants of the spatially-spread messages,i.e. M ext , M = P ( m ) . Next, we use the adapter A to transform the message M ext and encode thecover image I en = E ( I co , A ( M ext )) . The output of the operation is the encoded image I en , whichis distorted by the nosier layers N afterward. Exactly the same distortion needs to be applied onthe cover image I co , thus we have: I (cid:48) en , I (cid:48) co , M = N ( I en , I co , M ) . Note that, some types of attacks4equire the cover image I co in order to distort the encoded image I en . Moreover, for the types ofnoises which affect the encoded image spatially, we also calibrate the message M , e.g. for croppingattacks, we also crop the message M .The decoder D is fed by a noised encoded image I (cid:48) en and predicts the encoded message M (cid:48) = D ( I (cid:48) en ) ,while the discriminator F distinguishes between the noised cover and encoded images, I (cid:48) co and I (cid:48) en respectively, i.e., F ( I (cid:48) ∈ { I (cid:48) co , I (cid:48) en } ) ∈ [0 , . We need to determine a loss component in order to train the discriminator used for two purposes —improving the perceptual similarity and, as an auxiliary component of our watermarking system fordouble discriminator–decoder approach. Naturally, the main focus is to keep the transparency androbustness at highest possible level.We seek to ensure the transparency using the mean square error between I co and I en , thus L E ( I en , I co ) = W HC || I en − I co || , where || · || is the Frobenius norm. In order to handle themessage decoding, we used the mean-variance approach proposed in [12], given by: L D ( M, M (cid:48) ) = b W H W/ b (cid:88) w =0 H/ b (cid:88) h =0 λ mean D mean( | M hw − M (cid:48) hw | ) + λ var D var( | M hw − M (cid:48) hw | ) , where W/ b = Wb − , H/ b = Hb − and | · | returns element-wise absolute values. This way of theloss formulation converges to a case in which some tuples M (cid:48) ij contain possibly high quality of theoveral data, instead of returning the correct predictions for a proper subset of the indices for all tuples.Note that, due to high redundancy of tuples in M (cid:48) , this way of convergence is advisable.We provided the adversarial training of the encoder E using the discriminator F . The aim of theencoder is to produce an image recognized as the cover image by the discriminator, while in ourpipeline the image is further distorted by the noiser N . Thus, we deﬁned L FE ( I (cid:48) en ) = log( F ( I (cid:48) en )) .The aim of the discriminator is to distinguish between the distorted cover and encoded images, whichwas formulated as L FF ( I (cid:48) co , I (cid:48) en ) = log( F ( I (cid:48) co )) + log(1 − F ( I (cid:48) en )) .To obtain the parameters for the encoder E , adapter A and decoder D , we minimize the objectiveover the distribution of images and messages, namely E I co ,M [ λ E L E + λ F L EF + L D ] , with λ weights.To train the discriminator F , we minimize the objective over the distribution of images: E I co [ L FF ] . In this section we present the efﬁciency of the image encoding–decoding procedure of our water-marking solution. In order to compare our framework with the well established ones, we applied thecommon evaluation approach, which relies on calculating the bit accuracy of the detected messages.We set the propagator’s parameters to k = 2 and b = 16 and the translator parameter n = 3 . Wesplit the message m into tuples and further replicate them to ﬁll the unitary blocks of the size × × . The expected number of the blocks’ redundancy was equal to . We trained the pipelinewith the following parameters: λ E = 2 . , λ mean D = 1 . , λ var D = 1 . and λ F = 0 . . The results’comparison of the bit accuracy between our solution and the state-of-the-art works was presentedin Tab. 1. The results were achieved for the PSNR equal to . dB, . dB and . dB for the Y,Cb and Cr channels, respectively. The examples of the encoded images were presented in Figure 2.Our solution exhibited signiﬁcant improvement for the rotation, resizing and JPEG attacks. Moreover,it provided the highest, among all methods, overall accuracy, which was measured as the lowest bitaccuracy of all attacks. For our solution, the lowest bit accuracy was equal to . , while it was . for [12] and below . for others methods. In the real–life environment, watermarking systems are used only in small subset of the totalmultimedia content worldwide. Thus, in order to detect the message, we need to handle one offollowing approaches:1. naïve — apply the detection procedure on every image and label the suspects, when theextracted key shares at least t bits with any of the keys from the database,2. double — at ﬁrst execute a procedure to distinguish if given content comes from ourwatermarked sources or not 5able 1: The robustness against selected attacks for our method and state-of-the-art systems withrespect to average of the bit accuracy. Note that the resizing modes were not declared in [11, 13].Attacks MethodsOur Spatial [12] HiDDeN [8] DADW [11] RedMark [13]No attack 1.000 1.000 1.000 1.000 1.000Crop( p = 0 . ) 0.860 0.832 -Cropout( p = 0 . ) 0.921 0.902 - 0.925Dropout( p = 0 . ) 0.981 0.962 ≈ α = 5 ◦ ) σ = 2 ) σ = 4 ) s = 0 . , m = N ) s = 0 . , m = L ) 0.886 -JPEG( q = 50 ) I c (above), the encoded images I e (middle) and their differences with themin-max normalization (below).The former relies on using a highly effective detection procedure. For example, let us assume that t = 29 (estimated for accuracy about 90%) and the watermark contains 32 bits. Having one millionrandom keys in the database, the probability of the event that at least one key from the databasecontains at least 29 shared bits is equal to − ( (cid:80) i =0 [ (cid:0) i (cid:1) . i . − i ]) ≈ . . Thus, the chanceof failure is high even for relatively small database of the keys and high accuracy of the decoder.The latter could signiﬁcantly improve the overall efﬁcacy of the watermarking system, but it requiresauxiliary subsystem to distinguish between the content sources. Instead of using the critic F onlyto rate if the encoded image is similar to the original one, we moved its position in the system andplaced it after the nosier N in the training pipeline. By this, we fed the critic F with noised images I no . Therefore, cover images I co also needed to be processed by the nosier N in the same way asencoded images I en in order to avoid to learn inappropriate characteristics of the cover and encodedimages, i.e. the critic would be able to learn features which are effects of processing by the nosierrather than the encoder. This modiﬁcation still gave us possibility to use the critic F for adversarialtraining and aiming to improve the transparency of the encoded images. Used metrics.

During the tests, we assume that an image I contains the watermark if F ( I ) > t F ,where t F ∈ [0 , is a threshold for the critic’s outputs. We utilized the standard metrics: true positive (TP), when I is an encoded image and F ( I ) > t F ; false positive (FP), when I does not containembedded message and F ( I ) > t F ; true negative (TN), when I does not contain embedded messageand F ( I ) ≤ t F ; false negative (FN), when I is an encoded image and F ( I ) ≤ t F . While designing6able 2: The results of the robustness against some attacks validated using our novel testing formula.The notably better scores are presented in bold.Attacks Double Naïve TIR FIR en FIR co TIR FIR en FIR co No attack 0.999 0.001 0.000 0.999 0.001 0.005Crop( p = 0 . ) 0.219 p = 0 . ) 0.550 p = 0 . ) 0.940 0.007 0.008 0.951 0.004 0.006Rotate( α = 5 ◦ ) 0.983 0.016 σ = 2 ) 0.950 0.000 0.474 0.998 0.000 Gaussian( σ = 4 ) 0.956 0.001 0.410 0.999 0.001 Subsampling(4:2:0) 0.971 0.000 0.000 0.975 0.000 0.001Resize( s = 0 . , m = N ) 0.630 s = 0 . , m = L ) 0.540 q = 50 ) 0.712 0.069 true identiﬁcation rate ( TIR ), deﬁned as a probabilityof extracting appropriate key provided true encoded image, i.e.,

Pr(extract true key | TP) , (2) falseidentiﬁcation rate ( FIR ), deﬁned as a probability of indicating a wrong key, provided true encodedimage or falsely classifying image as encoded, i.e.,

Pr(true key not extracted | TP) Pr( I en ) +Pr(FP) Pr( I co ) . It is easy to note that TIR covers the maximization problem discussed above,whereas

FIR complies with the minimization. Working with both rates gives us ability to preciselyvalidate the watermarking system. We also consider

FIR for false identiﬁcation of encoded images I en and cover images I co separately. Hence, we deﬁne FIR en = Pr(true key not extracted | TP) to test an indication of wrong keys from encoded images, and

FIR co = Pr(FP) to validate falselyclassiﬁed cover images. Robustness of the discriminator–decoder approach.

To the best of our knowledge, recent workon neural networks-based watermarking did not include experiments following similar conditionsto those described in Sect. 4. Vast majority of the research focuses on improving only the detectionprocedures and transparencies of encoded images. We validate our system using the common inrecent works procedure, e.g. [12, 8, 11, 13] (see Sect. 3), as well as

TIR and

FIR .Taking advantage of this testing formula, we shown the efﬁciency of our discriminator–decoderapproach compared to the naïve system with a single decoder. The experiments were done for images. We assumed that the size of the keypool is equal to . The results are shown in Table 2.The threshold was adjusted to obtain the highest TIR for the naïve approach, however, due to highsensitivity of

FIR co , we chose lower one if the difference between them was at most . Then, weadjusted the rates for the double approach keeping the similar TIR . Our results conﬁrmed the highefﬁciency of the double approach. In most cases, the obtained results were higher than in the naïveone, notably for some attacks, the difference was signiﬁcant. The double approach improves theoverall performance signiﬁcantly for some of the attacks in which the decoder presents relativelylow bit accuracy, i.e., lower than . . For example, having similar values for TIR , we completelydiscarded

FIR co from around . and reduced FIR en around - times for cropping and cropoutattacks. In turn, we observed that the naïve approach is as efﬁcient as the double approach if thedecoder demonstrates a high robustness against some attacks. Moreover, we observed that havinginefﬁcient discriminator and robust detector, the results could be better for the naïve approach, e.g.,for Gaussian smoothing, the decoder’s bit accuracy was close to . and the discriminator’s – . (on balanced data) and ﬁnally the overall performance was higher for the naïve approach.7 a) PSNR for the Bernoulli parameter p . (b) Bit accuracy for the Bernoulli parameter p . Figure 3: The visualization of the experiment of improving the transparency of images using dropout.

One of the most challenging problems of watermarking techniques is encoding message into animage in transparent manner that enables accurate decoding of the message from a distorted image.A robust watermarking needs to store the message in a part of the image, that is the most resistant ondistortions resulting from attacks as well as compression algorithms.Finding a suitable trade–off between those ratios is possible by a proper tuning of hyper parameters ofthe loss function during the training of the pipeline. Despite the fact that the tuning for our pipeline istime–consuming and exhaustive process, including changing parameters in the middle of the trainingprocedure, the pipeline of three neural networks was usually not balanced the way that was expected.Additionally, in a case of creating a system that produces images with different transparency levels,the basic approach requires training separated neural networks. Thus, we extended our framework bya component which allows improving the quality of the encoded image.We designed a method to improve transparency of the encoded image that could be applied after thetraining process. The method performs following steps:1. select a mask S ∈ { , } W × H × C , where S whc ∼ B ( p ) , i.e., sampling from Bernoullidistribution with the parameter p ,2. update I en ← S ◦ I en + (1 − S ) ◦ I co , where ◦ denotes element-wise multiplication. Results.

We trained our pipeline with the hyper parameters set in a ﬁrmly favourable way forimproving robustness against attacks, i.e. we set λ E = 3 . , λ meanD = 1 . , λ varD = 1 . and λ F = 0 . . Then, we compared the encoded image quality and the robustness against attacks forvarious values of the parameter p applied to the Bernoulli distribution. The results was presentedin Figure 3. The results conﬁrmed that an increase of the images’ transparency has negative impacton the robustness against some attacks. In [13], authors presented another method of increasing thetransparency by calculating linear combination of the cover and encoded images. Note that, theirsystem is not designed with consideration of spatial attacks (e.g. rotation), whereas we presented themethod which handles this type of attacks as well. In our work, we introduced a novel end-to-end watermarking solution based on neural networks. Wesigniﬁcantly improved the robustness against some attacks, e.g. JPEG compression, rotation. Themethod stands out with the highest overall accuracy which is equal to . . It is also characterizedby one of the highest transparencies of the encoded images. We added a new component called adapter to the pipeline and used a novel architecture of the pipeline in which we were able to utilizethe discriminator to distinguish between cover and encoded images. We proposed an evaluationmethod which ﬁts to the real–life environment of the watermarking system based on three metrics,namely TIR , FIR en and FIR co . We evaluated a naïve watermarking system and our double decoder–discriminator architecture following our evaluation method. It conﬁrmed the high efﬁciency of8ur double approach. Finally, we explored the problem of robustness versus transparency of thewatermarking systems and proposed a ﬂexible solution to handle the problem. In the future work, wewould like to continue to improve the robustness of the watermarking system, including multi-attackscenarios and new types of attacks. Moreover, we would like to enhance the capacity of the watermarkas well as apply other quality measures in the training pipeline that could improve the transparencyof the encoded images. Broader Impact

This paper presents a complete framework of image watermarking that allows both embeddingand identifying the message encoded in the image. The natural environment for implementing thesolution is copyright management. Due to high accuracy and low impact on the image quality, thesolution may be used to protect intellectual property in the form of professional or amateur pictures,computer generated graphics or movies. Encoding a watermark may allow to identify the author ofthe intellectual property as well as help to identify the person that leaked the medium, allowing thebreach of the copyright. For the latter case, it is required that the watermark is unique for each client.An additional branch that may beneﬁt from the watermarking technique is emerging as a response tothe needs of deep learning solutions, data annotation business. Both above cases present a scenariowhere content creators (or owners) are protected against copyright infringement and may prove theirauthorship or allows presenting proof in legal actions. Naturally, such actions may be taken only inthe case of high certainty. Despite the high accuracy of the method it is not foolproof, hence it maygive a false positive identiﬁcation and has to be sufﬁciently supervised so that false accusation isnot made. Aside from the intended use, the popularity of watermarking may incur some negativeaspects, as the legal system adjusts to the proofs by presenting the watermark we may be witnessinga ’copyright trolling’ similar to recent patent rivalries between companies as some individuals maytry to watermark their yet-not encoded-content with their signature, to claim ownership. This aspectmay be enhanced by the false positive identiﬁcation mentioned above. The watermarking systemsare a proposition for the multimedia industry to protect their businesses against the piracy. Thesolution considers many aspects required for solutions working in the real–life environment. Theproposed solution is ﬂexible in a context of the size of the image. It is a ’lightweight solution’ as itstime performance is relatively high, i.e. 30 FPS movie could be encoded in a real time harnessinggraphic processing unit (GPU). The solution is also robust against video compression algorithms(most informative units of a compress movie, IFrames, are encoded using JPEG algorithm).

References [1] IBC’s website. Using forensic watermarking to protect UHD content. , last accessed on 2020-04-20.[2] Niels Thorwirth, Erik Hietbrink, Jaap Haitsma, Gwenaël Doërr, Glenn Deen, Mike Wilkinson, RobinWilson, Brian Stevenson, and Christopher White. Forensic Watermarking Implementation Considerationsfor Streaming Media.

Streaming Video Alliance , July 2018.[3] E. Najﬁ. A robust embedding and blind extraction of image watermarking based on discrete wavelet trans-form. In

Mathematical Sciences , volume 11, pages 307–318, Dec 2017. doi: 10.1007/s40096-017-0233-1.[4] Chandan Kumar, Anuj Kumar Singh, and Priyadarshni Kumar. Improved wavelet-based image watermark-ing through spiht.

Multimedia Tools and Applications , pages 1–14, 2018.[5] E. Najaﬁ and K. Loukhaoukha. Hybrid secure and robust image watermarking scheme based on svdand sharp frequency localized contourlet transform.

Journal of Information Security and Applications ,44:144 – 156, 2019. ISSN 2214-2126. doi: https://doi.org/10.1016/j.jisa.2018.12.002. URL .[6] J. Liu, J. Huang, Y. Luo, L. Cao, S. Yang, D. Wei, and R. Zhou. An optimized image watermarking methodbased on hd and svd in dwt domain.

IEEE Access , 7:80849–80860, 2019.[7] Ching-Sheng Hsu and Shu-Fen Tu. Enhancing the robustness of image watermarking against crop-ping attacks with dual watermarks.

Multimedia Tools and Applications , 79(17):11297–11323, May2020. ISSN 1573-7721. doi: 10.1007/s11042-019-08367-6. URL https://doi.org/10.1007/s11042-019-08367-6 .

8] Jiren Zhu, Russell Kaplan, Justin Johnson, and Li Fei-Fei. HiDDeN: Hiding Data with Deep Networks. In

The European Conference on Computer Vision (ECCV) , Sep 2018.[9] Bingyang Wen and Sergul Aydore. ROMark: A Robust Watermarking System Using Adversarial Training. arXiv e-prints , art. arXiv:1910.01221, October 2019.[10] Y. Huang, B. Niu, H. Guan, and S. Zhang. Enhancing image watermarking with adaptive embeddingparameter and psnr guarantee.

IEEE Transactions on Multimedia , 21(10):2447–2460, 2019.[11] Xiyang Luo, Ruohan Zhan, Huiwen Chang, Feng Yang, and Peyman Milanfar. Distortion Agnostic DeepWatermarking. arXiv e-prints , art. arXiv:2001.04580, January 2020.[12] Marcin Plata and Piotr Syga. Robust Spatial-spread Deep Neural Image Watermarking. arXiv e-prints , art.arXiv:, March 2020.[13] Mahdi Ahmadi, Alireza Norouzi, Nader Karimi, Shadrokh Samavi, and Ali Emami. Redmark: Frameworkfor residual diffusion watermarking based on deep networks.

Expert Systems with Applications , 146:113157, 2020. ISSN 0957-4174. doi: https://doi.org/10.1016/j.eswa.2019.113157. URL .[14] Ippei Hamamoto and Masaki Kawamura. Neural watermarking method including an attack simulatoragainst rotation and compression attacks.

IEICE Transactions on Information and Systems , E103.D:33–41,January 2020. doi: 10.1587/transinf.2019MUP0007.[15] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In

Advances in Neural Infor-mation Processing Systems 27 , pages 2672–2680, 2014. URL http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf .[16] Jamie Hayes and George Danezis. Generating steganographic images via adversarial training. In

Advancesin Neural Information Processing Systems 30 , pages 1954–1963, 2017. URL http://papers.nips.cc/paper/6791-generating-steganographic-images-via-adversarial-training.pdf .[17] Friend MTS. Comparing subscriber watermarking technologies for premium pay tv content. Sep2018. URL .[18] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár,and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In David Fleet, Tomas Pajdla,Bernt Schiele, and Tinne Tuytelaars, editors,

Computer Vision – ECCV 2014 , pages 740–755, Cham, 2014.Springer International Publishing. ISBN 978-3-319-10602-1.[19] Diederik Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization.

International Conferenceon Learning Representations , Dec 2014., Dec 2014.