CIS-Net: A Novel CNN Model for Spatial Image Steganalysis via Cover Image Suppression
SSUBMISSION TO IEEE TRANSACTIONS ON XXX 1
CIS-Net: A Novel CNN Model for Spatial ImageSteganalysis via Cover Image Suppression
Songtao Wu, Sheng-hua Zhong*,
Member, IEEE,
Yan Liu, and Mengyuan Liu
Abstract —Image steganalysis is a special binary classificationproblem that aims to classify natural cover images and suspectedstego images which are the results of embedding very weak secretmessage signals into covers. How to effectively suppress cover im-age content and thus make the classification of cover images andstego images easier is the key of this task. Recent researches showthat Convolutional Neural Networks (CNN) are very effective todetect steganography by learning discriminative features betweencover images and their stegos. Several deep CNN models havebeen proposed via incorporating domain knowledge of imagesteganography/steganalysis into the design of the network andachieve state of the art performance on standard database.Following such direction, we propose a novel model calledCover Image Suppression Network (CIS-Net), which improves theperformance of spatial image steganalysis by suppressing coverimage content as much as possible in model learning. Two novellayers, the Single-value Truncation Layer (STL) and Sub-linearPooling Layer (SPL), are proposed in this work. Specifically,STL truncates input values into a same threshold when theyare out of a predefined interval. Theoretically, we have provedthat STL can reduce the variance of input feature map withoutdeteriorating useful information. For SPL, it utilizes sub-linearpower function to suppress large valued elements introduced bycover image contents and aggregates weak embedded signalsvia average pooling. Extensive experiments demonstrate theproposed network equipped with STL and SPL achieves betterperformance than rich model classifiers and existing CNN modelson challenging steganographic algorithms.
Index Terms —Steganalysis, steganography, convolutional neu-ral network, cover image content suppression.
I. I
NTRODUCTION
As the development of social media, huge amount of digitalimages are uploaded to internet every day. The proliferation ofdigital images on the internet provide intended users accessiblemedia for criminal purpose easily. Image steganography, thescience and art to hide secret messages into images by slightlymodifying their pixels/coefficients, is one of key methodsfor covert communication with digital images [1-8]. As acounterpart of image steganography, image steganalysis isthe technique to reveal the presence of secret message in adigital image [9-14]. Because of its importance of informationsecurity, image steganalysis has been developed greatly inrecent years [15-16].
Songtao Wu and Sheng-hua Zhong are with the College of ComputerScience and Software Engineering, Shenzhen University, Shenzhen, China.Email: {csstwu, csshzhong}@szu.edu.cn. Sheng-hua Zhong is the correspond-ing author of this paper.Yan Liu is with the Department of Computing, The Hong Kong PolytechnicUniversity, Hong Kong SAR, China. Email: [email protected] Liu now is with the Tencent Research, Shenzhen, China. Email:[email protected].
Steganalysis for natural images in spatial domain provesto be a difficult task. Modern steganographic algorithms [2-8] embed secret messages into cover images by modifyingeach pixel of a cover image with a very small amplitude( ± ). Furthermore, the STC embedding scheme [17] enablessteganographers to change pixels located in those complex,cluttered or noisy regions, which are difficult to be accuratelymodeled by statistical methods. Previous researches [11,18]indicated that it is difficult to learn discriminative features toclassify cover images and their stegos when they are fed intoa binary classifier directly. Consequently, designing effectivefeatures and learning methods that both preserve the embeddedmessage and suppress the cover image content are essential forimage steganalysis. In [11], the author proposed to model thedifferences between adjacent pixels rather than original val-ues for feature extraction. This operation actually suppressesthe cover image content by removing their low frequencycomponents and thus obtain a great detection improvementto LSB matching steganography. Extending this approach,Fridrich in [12-13] proposed the Spatial Rich Model (SRM)method that uses thirty linear and nonlinear high-pass filtersto extract noise residuals from input covers and stegos. Withpaired training method, i.e. cover images and their stegos areinput to the classifier simultaneously, SRM learns a subset offeatures most sensitive to message embedding for steganalysis.Same idea is also used in Projection-SRM (PSRM) [14],which utilizes many different random vectors to project coverimages and their stegos into low dimensions, in order tohighlight hidden message and suppress cover image contentas much as possible. Although many efforts have been put todesign features for image steganalysis, it is still hard to detectsteganography accurately and move them into real applications[19].The development on deep convolutional neural network hasopened a new gate for image steganalysis. Recent progressesof CNN on image related tasks [20-24] have demonstratedits powerful ability to describe the distribution of naturalimages. This ability, however, can be used to model statisticaldifferences between natural cover images and unnatural stegoimages. Additionally, CNN deep architecture with convolution,pooling and nonlinear mapping provides steganalyzers largerspace to extract more effective features than hand-craftedones [25]. For these reasons, many CNN models have beenproposed for image steganalysis in recent years. Tan in [26]firstly proposed a stacked auto-encoder network to detectsteganography. Results in their paper showed that a CNNmodel suitable for image recognition may not be applicableto image steganalysis directly. In [27], Qian et al. proposed a a r X i v : . [ c s . MM ] D ec UBMISSION TO IEEE TRANSACTIONS ON XXX 2 new neural network equipped with fixed KV high-pass filteringlayer and Gaussian action function to detect steganography.This is the first deep learning based steganalyzer that usesdomain knowledge of steganalysis, i.e. design special layersto suppress cover image content. Along this direction, Xu[28] proposed a new CNN model which contains absolutevalue layer, batch normalization layer, TanH layer and × convolutional layer. The purpose of these designs is to makethe network specialized to the image steganalysis task andprevent overfitting. Following the Xu-network, Li in [29]extended the model with diverse activation modules andmade the network achieve much better performance. Wu etal. in [30-31] proposed to use residual connections in asteganalytic network and obtained low detection error rateswhen cover images and their stegos are trained and tested inpairs. Different from previous models that only use one fixedkernel to suppress cover content, Ye et al. in [32] utilizedall thirty SRM high-pass filters in first layer of their network.Additionally, the author proposed a linear truncation layer anda module to incorporate the selection channel information intothe design, obtaining significant performance improvementto classic SRM steganalyzer. Recently, Wang in [33] andBoroumand in [34] proposed two clean end-to-end architec-tures to detect steganography with residual learning. Withoutany pre-calculated convolutional layers, both networks canautomatically learn out high-pass filters to suppress coverimage content.In parallel with steganalysis in spatial domain, deep learningbased steganalyzer for JPEG images is also developed in recentyears. Based on Xu-network, Chen in [35] and Zeng in [36]replaced the KV kernel with JPEG-phase-aware filters, suchas DCT basis patterns and 2D Gabor filters, for suppressingJPEG image content. In [37], Xu proposed a deep residualnetwork with fixed DCT preprocessing filters for JPEG imagesteganalysis. Use a same end-to-end model to the spatialdomain case, Boroumand [34] trained the network with JPEGimages and achieved the state of the art performances on theBOSS database. In summary, either the model with predefinedor automatically learned kernels, how to effectively suppresscover image content without destroying the existence of em-bedded message is central for spatial and compressed domainsteganalysis with deep neural networks.Although many efforts have been put to incorporate thedomain knowledge of steganalysis into the design of CNNmodel, how to effectively suppress cover image content is notfully explored. Along this research direction, we propose twonovel layers, Single-value Truncation Layer (STL) and Sublin-ear Pooling Layer (SPL), for cover image content reduction.For STL, it truncates input data into a predefined intervalusing a same threshold, which is different from a generaltruncation linear layer that rounds off a large/small value intotwo different (positive/negative) thresholds. Intuitively, STLcan reduce the variance introduced by out-interval elementswhose values are truncated with two different thresholds. Theassumption is supported by mathematical analysis based onthe distribution of natural image pixels. For SPL, it uses thesublinear power function, a function whose power factor issmaller than 1, for cover image content suppression. To avoid destroying the embedded message, SPL aggregates the featuremap with average pooling before sublinear suppression. Byunifying STL and SPL in a single model, a novel neuralnetwork called Cover Image Suppression Network (CIS-Net)is proposed in this paper. Experiments on some challengingsteganographic algorithms have demonstrated the superiorityof CIS-Net over classic SRMs and existing CNN based ste-ganalyzers. Based on proposed network, we also explore thepossibility that a well-learned CNN model is able to roughlyestimate embedding probability map of given steganographicalgorithm.The rest of the paper is organized as follows. In sectionII, we introduce the proposed network for image steganalysisin details. The proposed STL and SPL are described andanalyzed in this section. In section III, we conduct severalexperiments on standard database to demonstrate the effective-ness of proposed network over existing hand-crafted methodsand deep learning based method. In the same section, weuse the Classification Activation Map (CAM) [38] techniqueto draw attentional maps learned by CIS-Net for differentsteganographic algorithms and compare them with groundtruth message embedding probability maps. The paper isfinally closed with the conclusion in section IV.II. P ROPOSED N ETWORK
In this section, we introduce the proposed CIS-Net modelfor image steganalysis. Firstly, the overall architecture of CIS-Net is described in details. Then, we introduce the proposedsingle-value truncation and sublinear pooling, and explain theirrationality for image steganalysis based on theoretical analysisand experiments.
A. Overall Architecture
As illustrated in Fig.1, the proposed network contains apreprocessing block, a feature fusion block, two Type-1 blocksand two Type-2 blocks. These building blocks are describedin details as follows:
Preprocessing block : The block contains several × High Pass Filters (HPF) and a STL to preprocess inputimages. It is noted that image steganalysis is to classifycover images and stego images which are results of addingcover images with very weak high frequency message signal,thus to preprocess input images to make the classificationeasy is necessary. Specifically, we follow Ye-network’s [32]design that use several SRM high pass filters to remove lowfrequency components and a truncation layer to further filterout those large elements in the cover image. However, thereare two main differences between the proposed network andYe-network. For the first, we refine SRM filters and only selecttwenty of them for high frequency components extraction.Among all thirty SRM filters, we find the 4-th order HPFsare not beneficial for the effectiveness of our model andthey are discarded in the design. The selected high passfilters are shown in Fig.2. For the second, rather than usea traditional truncation method, we propose a new single-valued truncation layer to filter out large elements of coverimages. The main advantage of the single-valued truncation is
UBMISSION TO IEEE TRANSACTIONS ON XXX 3
Fig. 1: The proposed CIS-Net model for image steganalysis. The whole architecture consists of a preprocessing block, afeature fusion block, two Type-1 blocks and two Type-2 blocks.The preprocessing block suppresses cover image content byextracting high frequency components and using a single-valued truncation layer. The feature fusion block combines differentpreprocessed information for the following processing. Two Type-1 and Type-2 blocks learn discriminative features for imagesteganalysis through the suppression of cover image content and the aggregation of embedded message signal.that it can reduce the dynamic range of cover image contentcompared to traditional truncation method, without destroyingpreserved information. Details about the method can be foundin part B of this section.
Feature fusion block : The block bridges image preprocess-ing block and the following feature learning blocks. Several × convolutional layers in the block fuse different highfrequency components extracted from the preprocessed blocktogether and augment features into higher dimensions. Insteadof using popular ReLU activation layer, we use a ParametricReLU (PReLU) after the convolutional layer since it allowsinformation in the negative region pass through the layer,which avoids information loss caused by the ReLU layer. Type-1 block : Each block uses a unit containing a convo-lutional layer, a ReLU activation layer and an average poolinglayer to extract discriminative feature for image steganalysis.The design is motivated by VGG-net [39] which recursivelyuse × convolutional kernels and pooling layers in thenetwork. In the block, there are no batch normalization layerssince they may make training be unstable when the mean andvariance are not accurately estimated [25]. Type-2 block : Each block consists of a × convolutionallayer, a ReLU activation layer and a sublinear pooling layer.With the help of the proposed sublinear pooling layer (detailsare introduced in part C of this section), the Type-2 blocklearns discriminative features from input feature maps bysimultaneously aggregating embedded message signal and sup-pressing cover image content. In order to summarize messageinformation across the whole stego image, we force that the (a) Second order and third order high pass filters(b) KB, KV filters and their variationsFig. 2: Twenty high pass kernels selected from SRM filters asthe HPF layers in the proposed CIS-Net.second Type-2 block to uses a large kernel sublinear pooling(kernel size is × ). In addition, the dilated convolution[40] is utilized in this block in order to extract long rangecorrelation from input features.To summarize, the proposed CIS-Net uses a series ofmethods to suppress cover image content and preserve em-bedded message in the network design. All the number ofconvolutional kernels in the network are optimized based onthe task. B. Single-valued Truncation
Truncating data into predefined interval proves to be usefuleither for traditional hand-crafted feature based steganalysis[11-13] or deep learning based steganalysis [25,32]. Comparedwith cover image content, signal introduced by embeddedmessage is usually low-amplitude. Thus, truncation can filterout cover image content whose elements are usually large am-plitude without destroying secret message greatly. In addition,
UBMISSION TO IEEE TRANSACTIONS ON XXX 4 truncation can reduce the dynamic range of input feature map,making the modeling of data’s distribution easier [12]. In thissubsection, we propose an effective data truncation method forimage steganalysis. The method is featured that it preservesthe signal of embedded message while reduces the varianceof truncated elements.
1) Motivation: in [12,32], truncation is defined as thefollowing equation:
T runc ( x ) = − T, x < − Tx, − T ≤ x ≤ TT, x > T (1)where T is a predefined positive threshold. The T runc ( · ) function preserves elements in the interval [ − T, T ] while mapsall other elements into two different values (we call it bi-valuedtruncation in this paper): T for those larger than the predefinedpositive threshold and − T for those smaller than the negativethreshold. Generally, elements of high pass filtered images aresymmetrically distributed across zero [41], thus the mean offeature map is zero. Based on such conclusion and Eq.(1), wecan write the variance of feature map after bi-valued truncationinto three parts: σ b = (cid:90) − T −∞ ( − T ) p ( x ) dx + (cid:90) T − T x p ( x ) dx + (cid:90) + ∞ T T p ( x ) dx (2)where σ b is the element variance after bi-valued truncation, p ( x ) denotes the probability distribution of the element x after high pass filtering. In Eq.(2), the first term and thirdterm are introduced by two truncated thresholds, while thesecond term is introduced by preserved elements in [ − T, T ] .Nevertheless, two truncated values T and − T do not provideany useful information for the classification of cover imagesand stego images, but they increase the variance of feature mapby adding the first term and third term in Eq.(2). To reduce theinfluence of such artificially introduced terms, we propose anovel truncation method called single-valued truncation whichis defined as: ST L ( x ) = (cid:26) T, | x | > Tx, − T ≤ x ≤ T (3)The main difference between single-valued truncation and bi-valued truncation is that all elements out of the predefinedinterval [ − T, T ] are mapped to a same threshold T . Thevariance of feature map element after single-value truncationcan be written as: σ s = (cid:90) − T −∞ ( T − µ s ) p ( x ) dx + (cid:90) T − T ( x − µ s ) p ( x ) dx + (cid:90) + ∞ T ( T − µ s ) p ( x ) dx (4)where σ s and µ s represent the element variance and meanafter single-valued truncation respectively. For a symmetricallydistributed function p ( x ) , it can easily validate that µ s is apositive value which is smaller than T . Intuitively, we canconclude that the first term and third term in Eq.(4) aredecreased compared to two terms in Eq.(2). In the followingpart, we give a strict proof that σ s is always smaller than σ b for natural images processed by high-pass filters. High pass filter type S t a nd a r d d e v i a ti on Bi−valued truncationSingle−value truncation (a) Standard deviations
Epoch number L o ss Bi−valued truncationSingle−valued truncation (b) Training loss curves
Fig. 3: Standard deviation and training loss curves. The stan-dard deviation is calculated by 100 randomly selected coverimage which are processed by two truncation methods. Fortraining loss curves, we use the proposed architecture exceptfor the different truncation layers to detect S-UNIWARDsteganography at 0.4 bpp. Both bi-value truncation and single-valued truncation use the same threshold, i.e. T = 5 .
2) Theoretic Analysis:
In the formulation, we follow theconclusion of previous researches that each pixel in naturalimages processed by mean-0 highpass filters follows the“generalized Laplace distribution” [41-42]: p ( x ) = 1 Z e − | xs | α (5)where α and s are two parameters of the distribution, Z is thenormalization constant to make the integral of p ( x ) be 1. Forbi-valued truncation, its variance σ b of Eq.(2) can be writtenas the following formula based on the above distribution: σ b = 2 (cid:90) ∞ T T Z e − | xs | α dx + (cid:90) T − T x Z e − | xs | α dx (6)Based on the symmetrical property of Eq.(5), the mean valueof feature map element µ s for single-valued truncation can beobtained: µ s = (cid:90) − T −∞ TZ e − | xs | α dx + (cid:90) T − T xZ e − | xs | α dx + (cid:90) ∞ T TZ e − | xs | α dx = 2 (cid:90) ∞ T TZ e − | xs | α dx (7) and the variance of Eq.(4) is calculated as: σ s = 2 (cid:90) ∞ T ( T − µ s ) Z e − | xs | α dx + (cid:90) T − T ( x − µ s ) Z e − | xs | α dx (8) After some mathematical operations which is illustrated inAppendix, the variance σ s can be rewritten as: σ s = (cid:90) T − T x Z e − | xs | α dx + 2 (cid:90) ∞ T T Z e − | xs | α dx − µ s (9)Based on Eq.(6) and Eq.(9), the difference between σ s and σ b is: σ s − σ b = − µ s < (10)Eq.(10) indicates that, for any positive threshold ( T > ), thevariance of element in the feature map after the single-valuetruncation is always smaller than the variance after the bi-value UBMISSION TO IEEE TRANSACTIONS ON XXX 5 truncation. The result demonstrates that the proposed single-valued truncation reduces the variance of traditional bi-valuedtruncation without deteriorating those preserved elements inthe interval [ − T, T ] .Except for the theoretic analysis, the bi-valued truncationand single-valued truncation are compared on real images andsteganographic algorithms. We randomly select 100 cover im-ages from BOSSbase 1.01 and use two truncation methods toprocess them. In addition, the proposed model with bi-valuedtruncation and single-valued truncation are learned to detectSpatial UNIversal WAvelet Relative Distortion (S-UNIWARD)steganography [6] at payload 0.4 bpp. Fig.3(a) shows thesingle-valued truncation decreases standard deviations acrossall high pass filters, while Fig.3(b) demonstrate that the modelwith single-valued truncation converges much faster than themodel with bi-valued truncation. C. Sublinear Pooling
For deep learning based image steganalysis, it is importantto aggregate weak signal of embedded message. Previous re-searches [27-32] shows that average pooling is better than maxpooling for image steganalysis since it can effectively mergeembedded message signal in a local region and strengthenembedded message signal across the whole stego image. How-ever, average pooling method only focuses on the aggregationof embedded message but do not take the suppression ofcover image content into account. To overcome this limitation,we propose a novel pooling method called sublinear poolingwhich is depicted by Fig.4. The proposed sublinear poolingunifies both embedded message aggregation and cover con-tent suppression in one block. Mathematically, the sublinearpooling is defined as follows:
SP L ( x ; γ , γ ) = f ( avg _ pool ( f ( x ; γ )) ; γ ) (11)where avg _ pool denote the average pooling, both f and f are element-wise sublinear power function f parameterized bya value γ : f ( x ; γ ) = | x | γ ◦ sgn ( x ) , < γ ≤ (12)where | · | , sgn ( · ) and ◦ represent the element-wise absolutevalue, sign function and multiplication. The element-wisefunction used in sublinear pooling is motivated by the “gen-eralized α -pooling” which is proposed in [43]. This designis easy to be implemented in code and optimized by back-propagation algorithm. Additionally, power function satisfiesthe following inequality: | x | γ ≤ | x | f or | x | ≥ , < γ ≤ (13)This property can be used for cover image suppression whenthe parameter is set accordingly.The main difference between the proposed pooling and aver-age pooling is that our method adds sub-linear power functionsbefore (pre-sublinear) and after (post-sublinear) an ordinaryaverage pooling layer. This change brings two advantages forimage steganalysis: • Sublinear pooling suppress cover image content adap-tively. In the proposed pooling method, sublinear power Fig. 4: Schematic illustration of proposed sublinear poolinglayer. The layer adds the element-wise power function beforeand after an average pooling layer. To make the power functionto be sublinear, γ and γ in the layer should be positive andsmaller than 1.TABLE I: Detection error rates of proposed model withaveraging pooling and sublinear pooling for training, testingand their differences on S-UNIWARD steganography at 0.4bpp. SPL’s parameters, γ and γ , are set to 0.9 and 0.9according to the experimental results. Pooling method
Training Testing DifferenceAverage pooling 6.72% 15.11% 8.39%Proposed SPL 10.76% 14.82% 4.06% functions decrease values of elements with large ampli-tudes. The larger the element’s amplitude is, the morethe sublinear function reduces the element. Since largevalued elements are mainly generated by cover image,sub-linear pooling decreases their amplitudes and thussuppress cover image content; • Sublinear pooling aggregates embedded message signaleffectively. In the middle of two sublinear power func-tions, average pooling in the proposed method mergesmessage signals from input feature maps that cover imagecontents are firstly suppressed by pre-sublinear. Then,the post-sublinear further reduces the cover image con-tent in features maps whose embedded message signalsare already augmented by the averaging pooling. Such“suppression-aggregation-suppression” is more effectivethan a single average pooling for image steganalysis.To validate the effectiveness of proposed pooling method,we compare training and testing detection error rates of theproposed architecture on S-UNIWARD steganography at 0.4bpp in two different cases. In the first case, both Type-1 andType-2 blocks use averaging pooling method in the wholearchitecture. In the second case, Type-2 blocks use sublinearpooling in the model as Fig.1 shows. Results in TABLEI demonstrate that sublinear pooling not only decrease thedetection error rate but also decrease the performance gapbetween training set and testing set, indicating our new poolingmethod improve the model’s generalization ability to detectsteganagraphic algorithms.III. E
XPERIMENTS
In this section, we conduct extensive experiments to demon-strate the effectiveness of proposed CIS-network for imagesteganalysis. At first, we introduce implementation details ofthe proposed model, including parameter setting, optimizationmethod and model training strategy. Secondly, we validatethe proposed model on challenging steganographic algorithms
UBMISSION TO IEEE TRANSACTIONS ON XXX 6
Truncation threshold D e t ec ti on e rr o r r a t e Single−valued truncationBi−valued truncation (a) Detection error rates at different truncation values γ γ D e t ec ti on e rr o r r a t e (b) Detection error rates at different configuration of ( γ , γ ) Fig. 5: Ablation study to proposed CIS-Net. (a) Detection error rates of CIS-Net with STL and BTL at different truncationthresholds. (b) Detection error rates of CIS-Net at different configuration of ( γ , γ ) .and compare it with the state of the art steganalytic methods.Thirdly, we use Class Activation Mapping (CAM) [38] todemonstrate that the proposed CIS-Net can extract the selec-tion channel information effectively even no such informationis provided in the model learning phase. Finally, we conductexperiments to validate the effectiveness of proposed networkwhen the database and steganographic algorithms used intraining mismatch that used in testing. A. Steganographic algorithms and Database
We use the BOSSbase 1.01 database [44] which contains10,000 uncompressed natural images with the size of × in all following experiments. For performance evaluation,the detection error rate P E [11-13] is utilized to measure thedetection ability of steganalytic algorithms: P E = min P FA
12 ( P MD + P F A ) (14)where P MD and P F A represents the miss detection probabil-ity and the false alert probability respectively. Since imagesteganalysis is a detection problem, we also evaluate the per-formance of different steganalytic methods with ROC curveson selected payloads.Our experiments are conducted on states of the artsteganographic schemes. Three representatives of adaptivesteganographic algorithms, including the Wavelet ObtainedWeights steganography (WOW) [5], S-UNIWARD [6], andthe HIghpass Low-pass Low-pass steganography (HILL) [7]are adopted for performance evaluation. For all steganographicalgorithms, we use the MATLAB version rather than the C++implementation to avoid the problem as [45] that all imagesare embedded with a same key.
B. Implementations
We implement our model using Pytorch platform and thesource code can be found at this link. In the implementation, all the weights in the model except the last fully connectedlayer are initialized by He’s “improved Xavie” [46] method: W ij ∼ N (cid:18) , o n (cid:19) (15)where N ( · , · ) denotes the Gaussian distribution, o n representsthe number of output channels of the convolutional layer.For the fully connected layer, we utilize zero-mean Gaussianrandom variable to initialize the weights but the variance is setto 0.01. This setting is to avoid that the fully connected layerhas a large variance when the “improved Xavie” is used forinitialization (the variance is 1.0), which may make the modeltraining unstable and hard to converge. The Adam optimizer[47] is used to update the model’s parameters in the learningphase. The mini-batch size is set to 16, which contains 8 coverimages and their 8 corresponding stego images.In our model, all convolutional layers including fully con-nected layer contain bias. Unlike general methods that justinitialize biases with random values, we calculate the bias ofeach convolutional layer based on input cover-stego pairs inthe initialization stage. For the n -th convolutional layer, itsbias b n is set by the following formula: b n = − |S| (cid:88) i ∈S E (cid:2) vec (cid:0) c xin (cid:1) + vec (cid:0) c yin (cid:1)(cid:3) (16)where c xin and c yin denote the i -th feature map of the n -thconvolutional layer for the cover image x and stego image y . vec ( · ) represents the vectorization operator and E ( · ) is theexpectation. S is the set of cover/stego images for initializingthe bias of convolutional/fully-connected layers. Actually, thisis a mean-only version of shared normalization proposed in[25]. The advantage of such is that it can conceal the non-zero mean introduced by the STL and make all feature mapelements distributed across zero, leading to fast convergence.In our experiment, the size of S is set to 100, which contains50 randomly selected cover images and their correspondingstegos from training set. UBMISSION TO IEEE TRANSACTIONS ON XXX 7
TABLE II: Performance comparisons between proposed network and SRM, maxSRM on WOW, S-UNIWARD and HILLsteganography at five different payloads. The BOSSbase 1.01 dataset is used for validation.
Steganography Detection algorithm
WOW
SRM + ensemble 40.26% 32.10% 25.53% 20.60% 16.83%maxSRMd2 + ensemble 29.97% 23.39% 18.86% 15.43% 13.06%The proposed network
SRM + ensemble 40.24% 31.99% 25.71% 20.37% 16.40%maxSRMd2 + ensemble 36.60% 28.86% 23.60% 19.08% 15.51%The proposed network
SRM + ensemble 43.64% 36.11% 29.96% 24.82% 20.55%maxSRMd2 + ensemble 37.71% 30.91% 25.73% 21.84% 18.14%The proposed network
Payload: bpp D e t ec ti on e rr o r r a t e SRMmaxSRMd2Proposed CIS−Net (a) WOW steganography
Payload: bpp D e t ec ti on e rr o r r a t e SRMmaxSRMd2Proposed CIS−Net (b) S-UNIWARD steganography
Payload: bpp D e t ec ti on e rr o r r a t e SRMmaxSRMd2Proposed CIS−Net (c) HILL steganography
Fig. 6: Detection error rates for SRM, maxSRM and the proposed network for three different steganographic algorithms at fivedifferent payloads.
False positive rate T r u e po s iti v e r a t e Random guessProposed network at payload 0.4 bppProposed network at payload 0.2 bppSRM at payload 0.4 bppSRM at payload 0.2 bppmaxSRM at payload 0.4 bppmaxSRM at payload 0.2 bpp (a) WOW steganography
False positive rate T r u e po s iti v e r a t e Random guessProposed network at payload 0.4 bppProposed network at payload 0.2 bppSRM at payload 0.4 bppSRM at payload 0.2 bppmaxSRM at payload 0.4 bppmaxSRM at payload 0.2 bpp (b) S-UNIWARD steganography
False positive rate T r u e po s iti v e r a t e Random guessProposed network at payload 0.4 bppProposed network at payload 0.2 bppSRM at payload 0.4 bppSRM at payload 0.2 bppmaxSRM at payload 0.4 bppmaxSRM at payload 0.2 bpp (c) HILL steganography
Fig. 7: ROC curves of SRM, maxSRM and the proposed network for three different steganographic algorithms at payloads 0.4bpp and 0.2 bpp.Instead of using a same learning rate for all layers in thenetwork, we utilize layer-wise learning rates in the trainingphase. Specifically, learning rates for the convolutional layerin feature fusion block, two Type-1 blocks, two Type-2 blocksand the fully-connected layer are set to 0.01, 0.001, 0.0001 and0.0001 respectively. As the network is optimized for imagesteganalysis, feature maps of cover images and their stegosare more discrimnative in later layers than they are in initiallayers. Therefore, the layer-wise learning rate strategy canmake different layers in network be optimized in a same speed.To make the HPFs adaptive to the network, we also let thembe updated in the learning stage and set its learning rate to be a small value, i.e. × − . During the training, all learningrates of our model decay with an exponential factor, 0.985: α i ( t ) = α i · . t (17)where t represents the epoch number, α i denotes the learningrate for convolutional/fully-connected layers, i.e. α i ∈ { × − , . , . , . , . } .It is usually hard for deep learning based methods to directlylearn discriminative features between cover images and stegoimages when the payload is low [25,32]. Compared withsteganography at high payloads, modifications introduced bysteganographic embedding at low payloads are too weak and UBMISSION TO IEEE TRANSACTIONS ON XXX 8
TABLE III: Performance comparisons between proposed network and several state of the arts CNN models on S-UNIWARDand HILL at five different payloads. The BOSSbase 1.01 dataset is used for validation.
Steganography CNN model
S-UNIWARD
Xu-network 40.57% 33.33% 26.32% 19.88% 16.46%Ye-network 40.29% 33.51% 25.62% 22.64% 17.64%SN-network
Xu-network 41.07% 33.25% 26.86% 21.31% 18.18%Ye-network 43.55% 34.65% 27.98% 23.08% 21.14%SN-network 36.86% 29.63% 23.60% 19.87% 16.29%ReST-Net (SRM) 38.77% 30.87% 24.84% 19.75% 16.53%The proposed network almost all of them are located in regions with highly varied in-tensities. In this case, deep models are difficult to discriminatecover images and their stegos since high frequency compo-nents of cover images may swamp the existence of embeddedmessage, making image steganalysis more challenging. Tohandle the difficulty, we use curriculum learning [48] to detectsteganopgraphic algorithms at low payloads. Specifically, theCIS-Net trained for a lower payload steganalysis, e.g. 0.4 bpp,are refined on the network trained at a higher payload, e.g. 0.5bpp. The advantage of curriculum learning is that attentionalfeatures learned by higher payloads can guide/regularize thesearch of locations modified by steganographic algorithms atlow payloads, which make the task much easier. To avoidtraining samples are reused for testing at different payloads,we force that cover images and their stegos used for networktraining/testing at a lower payload are same to those used fornetwork training/testing at a higher payload.
C. Ablation Study
This section conducts ablation study to the proposed net-work. We examine the behavior of proposed CIS-Net atdifferent configuration of STL and SPL. The S-UNIWARDsteganography at 0.4 bpp is used for performance validationin following experiments.
Truncation threshold T in STL . In this subsection, weevaluate the performance of our CIS-Net at different truncationthresholds T . To demonstrate the effectiveness of proposedSTL, the network equipped with Bi-valued Truncation Layer(BTL) is also compared in the experiment.Fig.5(a) shows detection error rates of the network withSTL and BTL at seven different thresholds, i.e. T ∈ [1 , , , , , + ∞ ] , where + ∞ means that no truncation isapplied in the model. Results in the figure indicate that themodel with STL is systematically better than the model withBTL at different T , which demonstrates the effectiveness ofproposed data truncation method for image steganalysis. Forsmall T value, e.g. T = 1 , the detection error rate of theCIS-Net is very high because excessive truncation to featuremap elements makes discriminative features between coverimages and stego images lost. For large T values, e.g. T > ,the network’s performance degrades due to negative influencecaused by large elements in cover image content. Therefore, to set an appropriate truncation threshold T is important forCNN based image steganalysis. Power factor ( γ , γ ) in SPL . This subsection studieshow the power of sublinear function in a SPL affects theperformance of CIS-Net for detecting steganography. Both γ and γ in proposed SPL are selected from a given set [0 . , . , . , . , . . When both γ and γ are equal to 1.0,the SPL becomes a normal average pooling. In the experiment,two Type-2 blocks use same setting of γ and γ . Fig.5(b)shows detection error rates when different configurations of γ and γ are set to the CIS-Net. From the figure, we canobserve that detection error rates are high when γ and γ are low values. The reason is that low power factorsnot only reduce cover image content but also remove theembedded message, making classification of cover imagesand stego images difficult. The best performance ( . detection error rate) is obtained when two power factors arearound 1.0, i.e. ( γ , γ ) = (1 . , . . The result indicates thata slight sublinear suppression to feature map after averagepooling is beneficial for image steganalysis. In the followingexperiments, we use this setting to train the model for differentsteganographic algorithms at five payloads. D. Performance Comparisons with Prior Arts
In this section, we conduct experiments to demonstrate theeffectiveness of proposed CIS-Net for image steganalysis. Twokinds of methods, i.e. hand-crafted feature based methods anddeep learning based methods, are compared in the experi-ment. For hand-crafted feature based methods, we compareperformances of the proposed CIS-Net with the classic SRMsteganalysis [11] and its selection channel version, the maxS-RMd2 steganalysis [12]. For deep learning based methods,four state of arts CNN models including Xu-Net [28], Ye-Net [32], Share Normalization Network (SN-Net) [25] andReST-Net [29] are selected for performance comparison. ForReST-Net, we both compare the model with SRM high passfilers. Three steganographic algorithms, WOW, S-UNIWARDand HILL, at five different payloads [0.1, 0.2, 0.3, 0.4, 0.5]are evaluated. In the experiment, 5,000 cover images and theircorresponding stego images are randomly selected to train themodel, the rest 5,000 cover images and their stegos are usedfor testing. To make results reliable, all reported detection
UBMISSION TO IEEE TRANSACTIONS ON XXX 9 (a) Cover images selected from BOSSbase 1.01(b) Ground truth of embedding probability map(c) Attentional maps extracted from proposed CIS-NetFig. 8: Comparisons of attentional maps and probability embedding maps on different images. (a) Five selected cover imagesfrom BOSSbase 1.01. (b) Ground truth probability embedding maps of selected cover images on on S-UNIWARD steganographyat 0.4 bpp. (c) Attentional maps of selected cover images extracted from CIS-Net based on CAM.error rates of the proposed model are results of the averageperformance of 5 times running.TABLE II and Fig.6 show detection error rates of the SRM,maxSRMd2 and the proposed CIS-Net on three steganographicalgorithms at five payloads. Additionally, ROC curves of threemethods at payload 0.4 bpp and 0.2 bpp are provided inFig.7. Compared with the SRM steganalysis, our model ob-tains significant performance gain on different steganographicalgorithms. Furthermore, the proposed CIS-Net outperformsthe maxSRMd2 steganalysis even no information of selectionchannel is provided. An interesting phenomenon in TABLEII is that, compared to SRM steganalysis, CIS-Net’s perfor-mance gain over maxSRMd2 steganalysis decreases as thepayload decreases. There are two main reasons for suchphenomenon. One is that secret messages are embedded atcomplex/cluttered regions in a cover image when the payloadis low. After highpass filtering, those regions in cover imagesare statistically similar to stego images. The other is that themaxSRMd2 steganalysis only extracts features at positions thatsecret messages are exactly embedded since it is providedwith embedding probability map. However, CIS-Net only hasa rough estimation to embedding positions provided by thenetwork trained at high payloads.Despite the traditional steganalytic methods, we also com- pare the proposed method with four deep CNN models onS-UNIWARD and HILL steganography at five payloads. ForXu-network and ReST network, we report their performancesaccording to Li’s paper [29]. For Ye-network, it is originallya CNN model optimized for × input images. Li in[29] has implemented a version of Ye-net which can detect × spatial images. Here, we use such result forperformance comparison. For SN-Network, the performancesin [25] are used for reporting. Recently, several researchpapers [34][51] used both BOSS [44] and BOWS [52] asthe training set and obtain promising performance. However,these networks are only optimized for downsampled images( × ) and use more training data in model learning.Such setting is greatly different from our case. Therefore,these networks are not compared in our experiments. Resultsin TABLE III show that the proposed CIS-Net outperformsXu-network, Ye-network, SN-Network and ReST-Net (SRM)on all configurations.Li in [29] boosted the performance of ReST-Net via en-sembling three networks (ReST-Net ensemble), in which eachnetwork is equipped with three different highpass filters, i.e.SRM, Gabor filters and the max-min nonlinear filters, forfeature extraction. In our experiment, we also compare theproposed network with ReST-Net (ensemble) on S-UNIWARD UBMISSION TO IEEE TRANSACTIONS ON XXX 10 (a) WOW embedding map at 0.4 bpp (b) S-UNIWARD embedding map at 0.4 bpp (c) HILL embedding map at 0.4 bpp(d) Attentional map of WOW at 0.4 bpp (e) Attentional map of S-UNIWARD at 0.4 bpp (f) Attentional map of HILL at 0.4 bpp
Fig. 9: Comparisons of probability embedding maps and attentional maps for different steganographic embedding methods. (a)and (d) are probability embedding maps and attentional maps of WOW steganography at 0.4 bpp; (b) and (e) are probabilityembedding maps and attentional maps of S-UNIWARD steganography at 0.4 bpp; (c) and (f) are probability embedding mapsand attentional maps of HILL steganography at 0.4 bpp.TABLE IV: Detection error rates of proposed CIS-Net andReST-Net (ensemble) on S-UNIWARD and HILL at fivedifferent payloads.
S-UNIWARD
TABLE V: Detection error rates of proposed model with andwithout augmentation on three steganographic algorithms atpayload 0.4 bpp.
Method
WOW S-UNIWARD HILLNo augmentation 12.13% 14.62% 18.10%Augmentation and HILL. Table IV shows that our CIS-Net outperformsReST-Net (ensemble) in most of configurations even thoughit is augmented by model ensemble.Recent researches in deep learning show that data aug- mentation is important for the performance improvement ofvarious CNN models [49-50]. In this experiment, we use dataaugmentation method to decrease the detection error rates ofCIS-Net for steganographic algorithms. Same to the setting in[25], we randomly split 10,000 BOSSbase samples into 5,000training images and 5,000 testing images. For training images,we rotate them with 90 degree, 180 degree, and 270 degreealong counter clockwise direction, which generates a newtraining set with 20,000 samples. Then, three steganographicalgorithms embed secret messages into the augmented trainingset and the test set. The proposed network is trained onthis new training set with 20,000 covers/stegos and finallyvalidated on the test set with 5,000 covers/stegos. To makeexperiment simple, we only demonstrate the performance ofproposed CIS-Net at payload 0.4 bpp. Detection error ratesin TABLE V indicate that data augmentation can improve theperformance of proposed CIS-Net on different algorithms.
E. Attentional Map Extraction for CIS-Net
In [38], Zhou et al. showed that an image classificationCNN exposes implicit attention of the model on an image.
UBMISSION TO IEEE TRANSACTIONS ON XXX 11
Such ability of CNN models can be used to localize mostdiscriminative regions contributing to image classification. Forsteganalytic CNN models, similar idea was also reported in[32] that the network can defeat the selection-channel-awaremaxSRMd2 steganalytic algorithm, demonstrating that theyare able to implicitly learn the distribution of selection channelfor a specific embedding scheme. In this experiment, we aim todraw attentional features learned by the proposed CIS-Net forgiven images. For image steganalysis, such attentional featureis actually the estimation to the embedding probability map ofthe steganographic algorithm. The motivation is to understandwhether a well trained CNN can indeed extract the embed-ding probability map implicitly even no such information isprovided. Additionally, we also want to analyze the differencebetween the estimated embedding probability map and trueembedding probability map, which may reveal limitations ofCNN models for image steganalysis.The Class Activation Mapping (CAM) [38] is an effectivemethod to extract attentional maps learned by a CNN model.Specifically, it computes a weighted sum of CNN’s last featuremaps as follows: M c ( x, y ) = (cid:88) k w ck f k ( x, y ) (18)where f k ( x, y ) represents the feature map of unit k in thelast convolutional layer at spatial location ( x, y ) , and w ck isthe learned weight in the fully connect layer correspondingto class c for unit k . CAM highlights discriminative visualpatterns for class c represented by f k ( x, y ) using the weightmatrix w ck . For adaptive steganography, the discriminationbetween cover images and stego images mainly comes fromnoisy/cluttered regions in which secret messages are mostlyembedded. Therefore, attentional feature map extracted byCAM for image steganalysis is an estimation to the embeddingprobability map.Following the idea of CAM method, we compute a weightedsum of CIS-Net’s feature maps of the global average poolingin second SPL to obtain attentional maps. To make the size ofattentional maps comparable to input images, we simply resizethem from × to × with “imresize” in Matlab. Inthe experiment, we randomly select several cover images fromBOSSbase 1.01 and also provide their ground truth embeddingprobability maps for S-UNIWARD steganography at 0.4 bpp.Fig.(8) shows five cover images, their ground truth embeddingprobability maps and the attentional maps calculated by CAMrespectively. From the figure, we can easily observed thatattentional maps extracted by CAM are visually similar toground truth embedding probability maps. The observationindicates that our proposed CIS-Net can implicitly estimatepositions of embedded messages in case that no selected chan-nel information is provided. In addition, we also compare thedifferences between CAM attentional maps of three stegano-graphic algorithms in Fig.9. Compare to the S-UNIWARDand HILL steganography, the attentional map of WOW isalmost equal to zero at the region without message embedding.This demonstrates that CIS-Net optimized for WOW onlyextract discriminative features at message embedding regions,thus the detection error rate should be low. However, for the attentional map of HILL, it is still activated at no message em-bedding regions. These noisy activations are very harmful fordiscriminative features extraction between cover images andtheir stegos, thus make image steganalysis difficult. Therefore,the detection error rate should be high. Such analysis fromextracted attentional map of three steganographic algorithmsis consistent with results reported in TABLE II, indicatingthat the quality of CAM attentional map is consistent with theperformance of CNN model for steganographic algorithms.The reason why attentional maps of different steganographicalgorithms demonstrate different visual qualities is that theembedding method of WOW and S-UNIWARD make allsecret messages be crowded in complex regions, while HILLuse “spreading strategy” to make messages be distributedaround complex regions. This strategy not only decreases theembedding intensity in a local region but also spreads secretmessages into high frequency component of cover images. Inthis case, a CNN model is hard to classify cover images andtheir stegos since the embedding message signals and highfrequency cover image components are mixed together.IV. C ONCLUSION
In this paper, we propose a novel CNN model called CIS-Net to detect adaptive steganography in spatial domain. Twonew layers, i.e. single-valued truncation layer and sublinearpooling layer, are designed to suppress cover image content.The single-valued truncation layer uses a same truncationthreshold to reduce the variance introduced by the truncateddata, while the sublinear pooling layer adaptively suppresseslarge elements of cover image content and aggregate weakembedded message signal with average pooling. Comparedwith previous data truncation and feature pooling, the pro-posed two layers can accelerate the learning and improvethe generalization ability of the CNN model. Additionally,we use class activation map method to demonstrate thatthe proposed CIS-Net can learn the embedding probabilitymap of steganographic algorithms when no selection channelinformation is provided. The result shows that CNN modelshave the ability to estimate the message embedding positionsimplicitly. In future works, we would extend our methods tocompressed domain images.V. A
PPENDIX
In this appendix, we prove that the proposed STL can reducethe variance of traditional data truncation method. For Eq.(4),we expand it as following equations: σ s = 2 (cid:90) ∞ T ( T − µ s ) Z e − | xs | α dx + (cid:90) T − T ( x − µ s ) Z e − | xs | α dx = 2 (cid:90) ∞ T T Z e − | xs | α dx + (cid:90) T − T x Z e − | xs | α dx + (cid:90) T − T µ s Z e − | xs | α dx + 2 (cid:90) ∞ T µ s Z e − | xs | α dx − µ s (cid:90) T − T xZ e − | xs | α dx − µ s (cid:90) ∞ T TZ e − | xs | α dx (19) UBMISSION TO IEEE TRANSACTIONS ON XXX 12 for the third line of Eq.(19), it can be written as: (cid:90) T − T µ s Z e − | xs | α dx + 2 (cid:90) ∞ T µ s Z e − | xs | α dx = (cid:90) − T −∞ µ s Z e − | xs | α dx + (cid:90) T − T µ s Z e − | xs | α dx + (cid:90) ∞ T µ s Z e − | xs | α dx = µ s (cid:90) ∞−∞ Z e − | xs | α dx = µ s (20)since p ( x ) is a symmetric function, the following integral isequal to zero: µ s (cid:90) T − T xZ e − | xs | α dx = 0 (21)based on Eq.(7), we obtain: µ s (cid:90) ∞ T TZ e − | xs | α dx = 2 µ s · (cid:90) ∞ T TZ e − | xs | α dx = 2 µ s (22)Combining Eq.(20), Eq.(21) and Eq.(22), σ s can be written asEq.(9). R EFERENCES[1] J. Mielikainen, “LSB matching revisited,”
IEEE Signal Processing Let-ters , 13(5):285-287, 2006.[2] X. Zhang and S. Wang, “Efficient steganographic embedding by exploit-ing modification direction,”,
IEEE Communications Letters , 10(11):781-783, 2006.[3] W. Luo, F. Huang, and Jiwu Huang, “Edge adaptive image steganographybased on LSB matching revisited,”
IEEE Transactions on InformationForensics and Security , 5(2):201-214, 2010.[4] T. Filler and J. Fridrich, “Gibbs construction in steganography,”
IEEETransactions on Information Forensics and Security , 5(4):705-720,2010.[5] V. Holub and J. Fridrich, “Designing steganographic distortion usingdirectional filters,”
IEEE Workshop on Information Forensic and Security ,2012.[6] V. Holub, J. Fridrich, and T. Denemark, “Universal distortion function forsteganography in an arbitrary domain,”
EURASIP Journal on InformationSecurity , 1(1):1-13, 2014.[7] B. Li, M. Wang, J. Huang, and X. Li, “A new cost function forspatial image steganography,”
IEEE International Conference on ImageProcessing , pp.4206-4210, 2014.[8] T. Denemark and J. Fridrich, “Improving steganographic security bysynchronizing the selection channel,”
Proceedings of 3rd ACM Workshopon Information Hiding and Multimedia Security , 2015.[9] S. Lyu and H. Farid, “Detecting hidden messages using higher-orderstatistics and support vector machines,”
International Workshop on In-formation Hiding , 2002.[10] J. Fridrich, ”Feature-based steganalysis for JPEG images and its im-plications for future design of steganographic schemes,”
InternationalWorkshop on Information Hiding , pp.67-81, 2004.[11] T. Pevny, P. Bas, and J. Fridrich, ”Steganalysis by subtractive pixeladjacency matrix,”
IEEE Transactions on Information Forensics andSecurity , 5(2):215-224, 2010.[12] J. Fridrich and J. Kodovsky, “Rich models for steganalysis of digitalimages,”
IEEE Transactions on Information Forensics and Security ,7(3):868-882, 2012.[13] T. Denemark, V. Sedighi, V. Holub, R. Cogranne, and J. Fridrich,“Selection-channel-aware rich model for steganalysis of digital images,”
IEEE Workshop on Information Forensic and Security , 2014.[14] V. Holub and J. Fridrich, “Random projections of residuals for digitalimage steganalysis,”
IEEE Transactions on Information Forensics andSecurity , 8(12):1996-2006, 2013.[15] H. Yin, W. Hui, H. Li, C. Lin, and W. Zhu, “A novel large-scale digitalforensics service platform for internet videos,”
IEEE Transactions onMultimedia , 14(1):178-186, 2012.[16] H. Zhou, K. Chen, W. Zhang, C. Qin, and N. Yu, “Feature-preservingtensor voting model for mesh steganalysis,”
IEEE Transactions on Vi-sualization and Computer Graphics , DOI:10.1109/TVCG.2019.2929041,2019. [17] T. Filler, J. Judas, and J. Fridrich, “Minimizing additive distortionin steganography using syndrome-trellis codes,”
IEEE Transactions onInformation Forensics and Security , 6(3):920-935, 2011.[18] T. Pevny and J. Fridrich, “Merging Markov and DCT features for multi-class JPEG steganalysis,”,
Proceedings of SPIE Electronic Imaging , 2007.[19] A. D. Ker, P. Bas, R. Bohme, R. Cogranne, S. Craver, T. Filler, J.Fridrich, and T. Pevny, “Moving steganography and steganalysis from thelaboratory into the real world,”
Proceedings of the first ACM workshopon Information Hiding and Multimedia Security , pp.45-58, 2013.[20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,”
IEEE Conference on Computer Vision and PatternRecognition , 2016.[21] I, Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”
Advances in Neural Processing Systems , pp.2672-2680, 2014.[22] C. Dong, C. Loy, K. He, and X. Tang, “Image super-resolution usingdeep convolutional networks,”
IEEE Transactions on Pattern Analysis andMachine Intelligence , 38(2):295-307, 2015.[23] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep con-volutional encoder-decoder architecture for image segmentation,”
IEEETransactions on Pattern Analysis and Machine Intelligence , 39(12):2481-2495, 2017.[24] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussiandenoiser: Residual learning of deep cnn for image denoising,”
IEEETransactions on Image Processing , 26(7):3142-3155, 2017.[25] S. Wu, S. Zhong, and Y. Liu, “A novel convolutional neural networkfor image steganalysis with shared normalization,”
IEEE Transactions onMultimedia , DOI: 10.1109/TMM.2019.2920605, 2019.[26] S. Tan and B. Li, “Stacked convolutional auto-encoders for steganalysisof digital images,”
Asia-Pacific Signal and Information Processing Asso-ciation, 2014 Annual Summit and Conference (APSIPA) , 2014, pp.1-4.[27] Y. Qian, J. Dong, W. Wang, and T. Tan, “Deep learning for steganalysisvia convolutional neural networks,”
SPIE Media Watermarking, Security,and Forensics , vol. 9409, 2015.[28] G. Xu, H. Z. Wu, and Y. Q. Shi, “Structural design of convolutional neu-ral networks for steganalysis,”
IEEE Signal Processing Letters , 23(5):708-712, 2016.[29] B. Li, W. Wei, A. Ferreira, and S. Tan, “ReST-Net: Diverse activationmodules and parallel subnets-based CNN for spatial image steganalysis,”
IEEE Signal Processing Letters , 25(5):650-654, 2018.[30] S. Wu, S. Zhong, and Y. Liu, “Deep residual learning for imagesteganalysis,”
Multimedia Tools and Applications , pp. 1-17, 2017.[31] S. Wu, S. Zhong, and Y. Liu,“Residual convolution network basedsteganalysis with adaptive content suppression,”
IEEE International Con-ference on Multimedia and Expo (ICME) , 2017.[32] J. Ye, J. Ni, and Y. Yi, “Deep learning hierarchical representations forimage steganalysis,”
IEEE Transactions on Information Forensics andSecurity , 12(11):2545-2557, 2017.[33] W. Wang, J. Dong, Y. Qian, and T. Tan, “Deep steganalysis: End-to-end learning with supervisory information beyond class labels,” arXiv:1806.10443v1 , 2018.[34] M. Boroumand, M. Chen, and J. Fridrich, “Deep residual networkfor steganalysis of digital images,”
IEEE Transactions on InformationForensics and Security , 14(5):1181-1193, 2018.[35] M. Chen, V. Sedighi, M. Boroumand, and J. Fridrich, “JPEG-phase-aware convolutional neural network for steganalysis of JPEG images,”
Proceedings of the 5th ACM Workshop on Information Hiding andMultimedia Security , pp.75-84, 2017.[36] J. Zeng, S. Tan, B. Li, and J. Huang, “Large-scale JPEG imagesteganalysis using hybrid deep-learning framework,”
IEEE Transactionson Information Forensics and Security , 13(5):1200-1214, 2017.[37] G. Xu, “Deep convolutional neural network to detect J-UNIWARD,”
Proceedings of the 5th ACM Workshop on Information Hiding andMultimedia Security , pp.67-73, 2017.[38] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learn-ing deep features for discriminative localization,”
IEEE Conference onComputer Vision and Pattern Recognition , 2016.[39] K. Simonyan and A. Zisserman, “Very deep convolutional networksfor large-scale image recognition,”
International Conference on LearningRepresentation , 2015.[40] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated con-volutions,“
International Conference on Learning Representation , 2016.[41] J. Huang, “Statistics of natural images and models,”
PhD Thesis, BrownUniversity , 2000.[42] A. Srivastava, A. B. Lee, E. P. Simoncelli, and S-C. Zhu, “On advancesin statistical modeling of natural images,”
Journal of MathematicalImaging and Vision , 18(1):17-33, 2003.
UBMISSION TO IEEE TRANSACTIONS ON XXX 13 [43] M. Simon, Y. Gao, T. Darrell, J. Denzler, and E. Rodner, “Generalizedorderless pooling performs implicit salient matching,”
IEEE InternationalConference on Computer Vision , 2017.[44] P. Bas, T. Filler, and T. Pevny, “Break our steganographic system: the insand outs of organizing BOSS,”
International Workshop on InformationHiding , pp.59-70, 2011.[45] L. Pibre, J. Pasquet, J. Pasquet, D. Ienco, D. Ienco, and M. Chaumont,“Deep learning is a good steganalysis tool when embedding key isreused for different images, even if there is a cover sourcemismatch,”
Media Watermarking, Security, and Forensics, Part of IS&T InternationalSymposium on Electronic Imaging , 2016.[46] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:surpassing human-level performance on imageNet classification,”
IEEEInternational Conference on Computer Vision , 2015.[47] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimiza-tion,”
International Conference on Learning Representation , 2015.[48] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculumlearning,”
International Conference on Machine Learning , 2009.[49] L. Perez and J. Wang, “The effectiveness of data augmentation in imageclassification using deep learning,” arXiv:1712.04621 , 2017.[50] Y. Xu, R. Jia, L. Mou, G. Li, Y. Chen, Y. Lu, and Z. Jin, “Improvedrelation classification by deep recurrent neural networks with data aug-mentation,” arXiv:1601.03651v2 , 2016.[51] R. Zhang, F. Zhu, J. Liu, G. Liu, “Depth-wise separable convolutionsand multi-level pooling for an efficient spatial CNN-based steganaly-sis,”