[PDF] Deep Convolutional Neural Network for Identifying Seam-Carving Forgery

Abstract

Seam carving is a representative content-aware image retargeting approach to adjust the size of an image while preserving its visually prominent content. To maintain visually important content, seam-carving algorithms first calculate the connected path of pixels, referred to as the seam, according to a defined cost function and then adjust the size of an image by removing and duplicating repeatedly calculated seams. Seam carving is actively exploited to overcome diversity in the resolution of images between applications and devices; hence, detecting the distortion caused by seam carving has become important in image forensics. In this paper, we propose a convolutional neural network (CNN)-based approach to classifying seam-carving-based image retargeting for reduction and expansion. To attain the ability to learn low-level features, we designed a CNN architecture comprising five types of network blocks specialized for capturing subtle signals. An ensemble module is further adopted to both enhance performance and comprehensively analyze the features in the local areas of the given image. To validate the effectiveness of our work, extensive experiments based on various CNN-based baselines were conducted. Compared to the baselines, our work exhibits state-of-the-art performance in terms of three-class classification (original, seam inserted, and seam removed). In addition, our model with the ensemble module is robust for various unseen cases. The experimental results also demonstrate that our method can be applied to localize both seam-removed and seam-inserted areas.

Full PDF

PPREPARED FOR IEEE TRANSACTIONS ON 1

Deep Convolutional Neural Network forIdentifying Seam-Carving Forgery

Seung-Hun Nam, Wonhyuk Ahn, In-Jae Yu, Myung-Joon Kwon, Minseok Son, and Heung-Kyu Lee

Abstract —Seam carving is a representative content-aware im-age retargeting approach to adjust the size of an image whilepreserving its visually prominent content. To maintain visuallyimportant content, seam-carving algorithms ﬁrst calculate theconnected path of pixels, referred to as the seam, accordingto a deﬁned cost function and then adjust the size of animage by removing and duplicating repeatedly calculated seams.Seam carving is actively exploited to overcome diversity in theresolution of images between applications and devices; hence,detecting the distortion caused by seam carving has becomeimportant in image forensics. In this paper, we propose a convolu-tional neural network (CNN)-based approach to classifying seam-carving-based image retargeting for reduction and expansion. Toattain the ability to learn low-level features, we designed a CNNarchitecture comprising ﬁve types of network blocks specializedfor capturing subtle signals. An ensemble module is furtheradopted to both enhance performance and comprehensively ana-lyze the features in the local areas of the given image. To validatethe effectiveness of our work, extensive experiments based onvarious CNN-based baselines were conducted. Compared to thebaselines, our work exhibits state-of-the-art performance in termsof three-class classiﬁcation (original, seam inserted, and seamremoved). In addition, our model with the ensemble module isrobust for various unseen cases. The experimental results alsodemonstrate that our method can be applied to localize bothseam-removed and seam-inserted areas.

Index Terms —Image forensics, Content-aware image retarget-ing, Seam-carving forgery, Convolutional neural network, Fine-grained local artifact extraction.

I. I

NTRODUCTION W ITH the recent spread of mobile devices, includingsmartphones and tablet computers, and the use ofsocial networking services, sharing images has become afamiliar phenomenon. The majority of users go through theprocess of resizing a given image to their preferred size andaspect ratio before sharing the image [1]. In addition, resizingis generally used in the process of overcoming incompatibilitybetween modules because the size and aspect ratio providedby each device and application is different [2]–[6]. To adjustthe size of the image to ﬁt the target size, traditional resizingtechniques (e.g., linear scaling and center cropping) havebeen actively employed in various tasks [7]. However, theseapproaches, which only consider geometric constraints, havethe disadvantage that the visually prominent areas of images

S.-H. Nam, W. Ahn, I.-J. Yu, M.-J. Kwon, M. Son, and H.-K. Lee are withthe School of Computing, Korea Advanced Institute of Science and Technol-ogy (KAIST), Daejeon 34141, South Korea (e-mail: [email protected]).

Corresponding author: Heung-Kyu Lee.

This work has been submitted to the IEEE for possible publication.Copyright may be transferred without notice, after which this version mayno longer be accessible. can be distorted or discarded during the resizing process [8],[9].To address this issue, content-aware image retargeting, alsoknown as content-based image resizing, is introduced [3]–[9].Unlike the conventional approaches, this promising techniqueallows us to adjust the size of an image while preservingimportant content, as illustrated in Fig. 1. As a speciﬁcexample, as shown in Fig. 2(b), when applying linear scalingthat reduces or expands at an equal rate along one axis, theaspect ratio of the area in the original image is transformedto distort the identity of the main object [1], [2]. Whenapplying center cropping to crop part of the original image, asdepicted in Fig. 2(c), some of the prominent content (i.e., thesmall ﬂower on the right side) can be lost [8], [9]. Content-aware image retargeting is a technique that reduces the visualquality deterioration caused by image resizing by maintainingthe original content of areas where prominent objects existand allowing a distortion of relatively less important areas(Fig. 2(d)).In other words, content-aware image retargeting aims topreserve as much important content as possible, and a repre-sentative approach to this is the seam-carving technique [8]–[10]. Seam-carving-based image retargeting computes energyor saliency maps for a given image and preferentially selectsseams with small amounts of energy [3]–[6]. The computedseams, represented in green, form a connected path of pixels.The computed seams are located in the relatively less impor-tant areas, as illustrated in Figs. 1(b) and 1(e). By removingseams or inserting duplicated existing seams according to thepriority determined based on the energy value, the image sizecan be adjusted while preserving important objects. Figs. 1(c)and 1(f) and Figs. 1(d) and 1(g) are examples of seam-removedand seam-inserted images, respectively.Seam-carving-based image retargeting to generate naturalresized scenes can be deliberately exploited to distort orremove original content; therefore, detecting artifacts of seamcarving has become an important topic in image forensics[11], [12]. Fig. 1 reveals that seam carving may not leavevisual clues for the human visual system and subtly alters theunderlying statistics of an image. In addition, it is challengingto model and analyze artifacts caused by seam carving throughonly resized images because the parts where the seam insertionand seam removal occur are different according to the contentand area in the image [7], [11], [12]. In other words, comparedto linearly scaled images with periodic characteristics, it ismore difﬁcult to classify retargeted images generated fromthe seam-carving method because the computed seams arescattered globally according to the inherent characteristics of a r X i v : . [ c s . MM ] J u l REPARED FOR IEEE TRANSACTIONS ON 2 (a) (b) (c) (d) (e) (f) (g)Fig. 1. Examples of content-aware image retargeting using the seam-carving method [3]: (a) original images (512 × , (b) visualization of computed10% seams marked in green, (c) 10% seam-removed images (461 × , (d) 10% seam-inserted images (563 × , (e) visualization of computed 20%seams marked in green, (f) 20% seam-removed images (410 × , and (g) 20% seam-inserted images (614 × . the given image.This paper proposes a convolutional neural network (CNN)-based forensic approach to classifying seam-carved imageswith three-class classiﬁcation: original, seam insertion, andseam removal. This work is an extended version of ourprevious work [7], which was presented at the IEEE Interna-tional Conference on Image Processing (ICIP) 2019 and wasreferred to as LFNet. In

ICIP 2019 , we focused on ideas andconcepts for learning subtle local artifacts caused by seam-carving-based image retargeting. In this paper, we propose anetwork architecture with improved low-level feature learning(ILFNet), which is more sophisticated than the architecturethat we initially proposed in [7]. We expect the componentsfor local residual learning and local feature fusion employedin the residual dense block (RDB) to help our model learnforensic features caused by seam carving. The effectiveness ofour work is demonstrated through extensive experiments basedon the BOSSbase [13] and UCID [14] datasets. In addition, anensemble module to improve the classiﬁcation performance ofCNN-based classiﬁers is introduced. Our main contributionsare summarized as follows: • Compared to CNN-based approaches [7], [15]–[21] andthe handcrafted feature-based approach [22], the proposedILFNet exhibits state-of-the-art performance. (a) (b) (c) (d)Fig. 2. Examples of image resizing techniques: (a) an original image (512 × , (b) linear scaling results (410 × , (c) center cropping results (410 × , and (d) 20% seam-removed image (410 × based on the seam-carving method [3]. Seam-carving-based image retargeting tends to preservethe prominent content of the given image, and our goal is to design a forensicmethod to classify seam-carved images that are resized naturally. • This work is the ﬁrst attempt to classify artifacts causedby four types of seam-carving algorithms [3]–[6]. • The ensemble module of this study improves the detectionperformance of CNN-based classiﬁers without furthertraining. • The superiority of the proposed ILFNet is validatedbased on extensive experiments, including seam-carvingforgery classiﬁcation, robustness testing against unseencases (e.g., saving format, seam-carving algorithm, noiseaddition, and retargeting ratios), and localization.The remainder of this paper is organized as follows. SectionII presents the seam-carving methods and reviews relevantexisting work classifying low-level features. The proposedmethodology is presented in Section III, and the performanceof the proposed method is demonstrated in Section IV. Finally,Section V concludes this paper.II. R

ELATED W ORK

In this section, we review seam-carving methods and pre-vious forensic approaches related to our work.

A. Seam-carving-based Image Retargeting

Seam carving is a representative content-aware image re-targeting approach for adjusting the size of an image whilepreserving its visually important objects [23]. Various seam-carving algorithms have been introduced, and this sectionprovides a review of four types of approaches [3]–[6]. Avidanand Shamir ﬁrst proposed a concept of seam carving that isan image operator for identifying the pixels with the lowestenergy in ascending order [3]. The goal of seam carving isto compute a monotonic and connected path of low energypixels (i.e., a seam) in an image. As depicted in the ﬁrst rowof Figs. 3(a) and 3(d), the vertical seam represented by thegreen monotonic line crosses the image from top to bottomand contains only one pixel in each row [3]. The ordering of

REPARED FOR IEEE TRANSACTIONS ON 3 A v i d a n e t a l . [ ] R ub i n s t e i n e t a l . [ ] A c h a n t a e t a l . [ ] F r a nkov i c h e t a l . [ ] (a) (b) (c) (d) (e) (f)Fig. 3. Examples of content-aware image retargeting generated by four types of seam-carving algorithms: (a) and (d) are original images (512 × withvisualization of the computed 20% seams using [3]–[6], (b) and (e) are 20% seam-removed images (410 × , and (c) and (f) are 20% seam-insertedimages (614 × . Green, red, cyan, and yellow connected paths of pixels represent seams computed through seam-carving algorithms [3]–[6]. the seams can be determined by the energy function deﬁningthe importance of the pixels.The measure of energy used in [3] is deﬁned by the L -norm of the gradient: e g ( I ) = (cid:12)(cid:12)(cid:12)(cid:12) ∂∂x I (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) ∂∂y I (cid:12)(cid:12)(cid:12)(cid:12) , (1)where I denotes the grayscale intensity of the image with asize of W × H . Given an e g , the optimal vertical seam ˆs thatminimizes the total energy E of a seam s can be obtained: ˆs = argmin s E ( s ) = argmin s (cid:80) Hk =1 e g ( I ( s k )) , where k is anindex of s , which is the path of H connected pixels. With thedynamic programming approach, an optimal ˆs can be foundby updating the cumulative energy matrix m for all possibleconnected seams: m ( i, j ) = e g ( i, j )+ min ( m ( i − , j − , m ( i, j − , m ( i + 1 , j − , (2)where ( i, j ) indicates the location of a particular pixel. Atthe end of this process, ˆs is obtained by backtracking fromthe minimum element in the last row of m . By repeatedlyremoving and inserting the minimum-cost seam, the size of theimage can be reduced or enlarged while maintaining visuallyprominent content (see the seam-carved and seam-insertedexamples in the ﬁrst row of Fig. 3).In [4], Rubinstein et al. noted that the original operator in [3]only focuses on ﬁnding seams with the minimum energy cost,ignoring energy that is re-introduced by joining previouslynonadjacent pixels. To address this issue, the authors presented a forward energy criterion for ﬁnding the optimal seam bymeasuring the effect of seam carving on the retargeted image: m ( i, j ) = e g ( i, j ) + min  C L ( i, j ) + m ( i − , j − C U ( i, j ) + m ( i, j − C R ( i, j ) + m ( i + 1 , j − . (3)Here, C L , C U , and C R are the three possible vertical seam-step costs for pixel ( i, j ) , and these costs are computed asfollows:  C L ( i, j ) = C U ( i, j ) + | I ( i, j − − I ( i − , j ) | C U ( i, j ) = | I ( i + 1 , j ) − I ( i − , j ) | C R ( i, j ) = C U ( i, j ) + | I ( i, j − − I ( i + 1 , j ) | . (4)With newly added cost terms, seam removal that introduces theleast amount of energy into the retargeted images is possible.Unlike previous work [3], [4] relying on a gradient mapof intensity, Achanta et al. introduced a saliency map-basedseam-carving algorithm [5]. The saliency value is computedby evaluating the Euclidean distance of the average of allLab pixel vectors of the original image I with each pixelvalue of the Gaussian blurred image I G with a × kernel: e s ( i, j ) = || I µ − I G ( i, j ) || , where I µ represents the average ofall pixel vectors in the Lab color space. Inspired by Equation(4), the authors in [5] further presented color information-based cost terms by replacing the scalar differences of thegrayscale intensity I into vector distances of I in the Labcolor space. After applying e s ( i, j ) and the newly deﬁned costfunction to Equation (3), the optimal seam is found using the REPARED FOR IEEE TRANSACTIONS ON 4 dynamic programming approach [3], [4].In [6], Frankovich and Wong extended the backward andforward energy cost functions in [3], [4] by incorporatingan absolute energy cost function in the optimization process.As in the absolute energy cost case, the optimal seam canbe calculated using the dynamic programming process byupdating the following cumulative energy matrix: m ( i, j ) = e a ( i, j ) + min  C L ( i, j ) + m ( i − , j − C U ( i, j ) + m ( i, j − C R ( i, j ) + m ( i + 1 , j − , (5)where e a ( i, j ) is equal to e g ( i, j ) + | e g ( i + 1 , j ) − e g ( i, j ) | + | e g ( i, j + 1) − e g ( i, j ) | . The newly designed cost functionpenalizes seam candidates that cross areas of local extremathat characterize regions containing a high concentration ofkey features.Fig. 3 illustrates the retargeted results generated by theseam-carving approaches [3]–[6] observed in this section. Theseams computed via [3]–[6] are represented as the connectedpaths of green, red, cyan, and yellow, respectively. Becausea difference exists in the function of determining the energyand saliency value, the form of seams corresponding to 10%of the image width computed using each technique is different(see the ﬁrst and fourth columns in Fig. 3). Approaches in [4],[6], which are extended from [3] using e g , generally calculateseams similar to the results of [3], whereas the seam-carvingmethod in [5] using the newly deﬁned e s calculates relativelydifferent forms of seams compared to those in [3], [4], [6].In this paper, the seams have various characteristics dueto the inherent properties of the content (e.g., the shape ofthe object and background) and the predeﬁned rule of eachseam-carving algorithm. To deal with this issue, we designeda network specialized for learning and capturing low-levelfeatures so that manipulation identiﬁcation can be performedeven in areas where few traces of artifacts are caused by seam-carving forgery. Considering the case in which seam-carvingtraces are scattered throughout the image, we further aim toimprove the classiﬁcation performance by comprehensivelyanalyzing the results of multiple local patches through theensemble module. More details are provided in Section III. B. Seam-carving Artifact Detection

In the sections below, conventional handcrafted feature-based (i.e., non-CNN-based) approaches for classifying seam-carved images and CNN-based classiﬁers for capturing localartifacts caused by various manipulations are covered.

1) Conventional Handcrafted Feature-based Approach:

To capture the inherent statistical changes caused by seamremoval or seam insertion in retargeted images, handcraftedfeature-based approaches [11], [12], [22], [24]–[29] have beenpresented. Sarkar et al. proposed a forensic approach [24]exploiting 324-dimensional Markov features (i.e., Shi-324),consisting of a 2D difference histogram in the discrete cosinetransform domain and a support vector machine (SVM) frame-work. In [25], Fillion and Sharma demonstrated that an SVM-based trained model employing the hand-designed features based on energy bias, seam behavior, and wavelet absolutemoments is suitable for detecting seam-carving forgery.In [11], Wei et al. introduced an SVM-based approach using × blocks (called a mini square) and three types of patchtransition probability matrices. To highlight the local textureartifacts, Yin et al. produced a set of features by combininghalf-seam features, energy features, and noise-based featuresfrom the local binary pattern domain [12]. The author of [29]revealed that a set of directional derivative-based and Gaborresidual-based features generally performed well in a givenforensic task. In [26], Liu and Chen presented an approachusing calibrated neighboring joint density and demonstratedthat an ensemble classiﬁer [30] with rich models (e.g., CC-JRM [28] and SRMQ1 [27] features) for image steganalysis iseffective for seam-carved forgery detection. In [22], Ryu et al. presented a feature vector that combines energy features, seamfeatures, and noise features for exploring artifacts of seamremoval and analyzed the relationship among neighboringpixels to estimate the seam insertion.The described conventional approaches have shown accept-able performance, but they do not fully meet the needs offorensics for seam-carving detection because forensic tracescan be lost during the generation of handcrafted features [17],[21]. In addition, in some cases, two independent algorithmsare required to detect seam insertion and seam removal [7],[23]. In other words, two tests should be undergone to au-thenticate an image, which leads to a high false alarm rate. Toaddress the inherent problem of these hand-designed feature-based approaches, forensic techniques using deep learningframeworks to let the network automatically learn forensicfeatures have been proposed. This is described in the nextsection.

2) Convolutional Neural Network-based Approach:

In-spired by high-level vision tasks (e.g., ImageNet [31] clas-siﬁcation and object detection) that have achieved signiﬁcantadvances using deep learning, various approaches to CNN-based multimedia forensics have been proposed [32]–[34].While CNNs for computer vision are capable of learningfeatures from the data, in their general form, they tend tolearn high-level features of the content of a given image [17],[21]. To address this issue, CNN-based forensic approaches[7], [15]–[21] have been designed to learn forensic featureswhile suppressing the content of the image by exploitingpreprocessing layers or network components specialized forlearning low-level features.In [17], Bayar and Stamm introduced a constrained con-volution layer that forces the CNN model, called BayarNetin this paper, to learn prediction error ﬁlters that producelow-level forensic features. In [16], He et al. suggested aresidual network (ResNet) based on a skip-connection forresidual learning, which had a positive effect on improvingthe performance of the CNN for forensics [7], [32] and ste-ganalysis [21]. In particular, the revised ResNet-34 (rResNet)without the initial pooling layer to prevent the loss of noise-like features exhibited stable and outstanding performance incapturing seam-carved forgery, as introduced in [7]. In [19],Nam et al. proposed H-VGG in combination with VGGNet[35] with high-pass ﬁltering (HPF), and the model successfully

REPARED FOR IEEE TRANSACTIONS ON 5 : Original image : Seam-inserted image : Seam-removed image : Cropped input data (𝑾 × 𝑯) : Label

Image cropping and labeling

Training with loss0 Seam carving methodComputed seams

Proposed ILFNet : Block type (BT) 1 to 5 , , , , Training phaseTesting phase

Original /

Seam insertion /

Seam removal BT - BT - BT - BT - BT - BT - BT - BT - BT - BT - BT - R G B G RAY

Fig. 4. Overview of the forensic approach classifying seam-carving-based image retargeting. In the process of training the proposed ILFNet, the mini-batchconsists of randomly selected original, seam-inserted, and seam-removed images. In the process of testing, the trained model with the classiﬁcation loss L c enables three-class classiﬁcation for a given suspicious image. detected double compression artifacts in the decoded intra-coded frame (Iframe) of H.264 video.Regarding detecting the relocated I-frame in the H.264video, He et al. presented HeNet, consisting of a compo-nent for extracting high-frequency features to eliminate theinﬂuence of diverse video content [18]. To derive resultsspecialized for deepfake detection, R¨ossler et al. [36] con-structed a FaceForensics dataset consisting of fake videos andexperimentally demonstrated that Xception [15] effectivelydetects artifacts that occur during the generation of a fakeface. Boroumand et al. presented SRNet [21], which is theﬁrst end-to-end framework for steganalysis in both the spatialand JPEG domains. In addition, SRNet, where the poolinglayer is excluded from the network blocks in the early andmiddle stages, is effective for exploring low-level artifacts. Ye et al. proposed a CNN-based method, referred to as YeNet,in which a preprocessing layer based on HPF is placed at thefront of the network for seam-carved image detection [20].Inspired by CNN-based approaches covered in this section,we aim to design a CNN architecture specialized in micro-signal detection. In particular, without the aid of heuristiccomponents (e.g., a preprocessing layer and hand-designedfeatures), we focused on the framework of learning forensicfeatures in an end-to-end fashion. To do this, ﬁve types ofnetwork blocks were introduced in our work, and the proposedILFNet based on the advantages obtainable in each block caneffectively detect traces caused by seam carving. To reveal theeffectiveness of our work, we conducted extensive experimentscomparing the conventional approach [22] and CNN-basedapproaches [7], [15]–[21], and the detailed description ofILFNet is provided in the next section.III. P ROPOSED F RAMEWORK

This paper focuses on designing a CNN-based frameworkto capture local artifacts caused by seam carving. Learningﬁne-grained forensic features requires a different approachfrom CNNs that are specialized for learning content-dependentfeatures. To overcome this obstacle, we explored forensicfeatures through the proposed ILFNet formed by consideringthe role of each network block. As illustrated in Fig. 4, ourarchitecture consists of ﬁve block types (BTs) from BT-1 to BT-5. Unlike previous work using heuristic components, theproposed work automatically learns forensic features in anend-to-end fashion. Next, we present the motivation for ourapproach and the detailed descriptions related to the proposedILFNet.

A. Motivation and Strategy

Distinguishing between original, seam-inserted, and seam-removed images can be regarded as a three-class classiﬁcationproblem. Given the set of training data ( x , y ) , ..., ( x N , y N ) of N samples, x represents the image, and y denotes itscorresponding class (0: original image I OR , 1: seam-insertedimage I SI , and 2: seam-removed image I SR ). Fig. 4 depictsthe overview of our framework for classifying seam-carvingforgery. Fig. 4 illustrates that I SR and I SI are generated byremoving or duplicating the less visually important areas of I OR through the seam-carving algorithm. The tendency of thecalculated seam expressed in green varies according to thegiven image content and the energy calculation approach of theseam-carving algorithm (Section II-A). In addition, the CNN-based classiﬁcation of seam-carving artifacts has an obstaclein handling data of different sizes (e.g., the size of I OR , I SR ,and I SI ) at the same time.We ﬁrst considered an approach to resize data to the samesize using scaling, but it has the disadvantage that local textureartifacts caused by seam carving can be lost. In addition, em-ploying interpolation-based scaling causes unintended tracesinto the given image [1]. Thus, data samples with a size of W × H were generated using cropping rather than scaling[7]. Because the intrinsic content of the natural image is verydiverse, we judged that cropping a sufﬁciently large area ata ﬁxed location would include various cases of seam-carvingartifacts in the cropped sample. That is, because the form of thecalculated seams is affected by the content of the image, suchas the object and background, various cases of local artifactscan be observed in the obtained samples. As illustrated in themiddle part of Fig. 4, input data of the proposed network weregenerated by cropping an area of W × H from the upper leftof the images including I OR , I SR , and I SI .Unlike the paired mini-batch training methodology [37]focusing on the difference between the paired original and REPARED FOR IEEE TRANSACTIONS ON 6

BT-1 s : 1 BT-2 s : 1 s : 1 BT-5BT-4 s : 1 s : 1 s : 2 s : 2 Proposed ILFNet image (𝑾 × 𝑯 × 𝟑)

RGB image (𝑾 × 𝑯 × 𝟏)

Grayscale BT - BT - BT - BT - BT - BT - BT - BT - BT - BT - BT - d : 16 d : 16 d : 16 d : 16 d : 16 d : 16 d : 16 d : 32 d : 64 d : 128 CNN components

Convolutional layer

ReLU

ConcatenationElement-wise additionAverage pooling

Global average poolingBatch normalizationConvolutional layer

Fully connected layer

Softmax

BT-3 s : 1 s : 1 s : 1 Fig. 5. Architecture of the proposed ILFNet for classifying low-level artifacts caused by seam-carving-based image retargeting: d and s indicate the numberof output feature maps and stride of each layer, respectively. its corresponding manipulated data, we constructed the mini-batch through random sampling from the training set. Throughthis, the proposed model was induced to consider various cases(e.g., content information and user-preferred ratio parametersfor retargeting) for each iteration in the training phase. Beforeinputting it into the network, the input in the RGB colorspace is converted to grayscale ( W × H × → W × H × )in our work. The proposed ILFNet consists of ﬁve types ofnetwork blocks from BT-1 to BT-5. Through the combinationof network blocks, the following fundamental abilities forforensic feature learning are included in ILFNet: (i) localtexture artifact learning, (ii) reﬁned feature learning via lo-cal feature fusion, and (iii) hierarchical feature learning andclassiﬁcation.Unlike existing approaches exploiting preprocessing [18]–[20] and hand-designed features [33], [34], the proposednetwork learns and extracts subtle traces of seam carving frominput in an end-to-end fashion. To do this, we ﬁrst placednetwork blocks for extracting low-level features, also knownas residual noise [21] and prediction residuals [17], in shallowlayers of the network. Because ﬁne-grained artifacts causedby manipulation are vulnerable to destruction by the poolinglayer [38]–[40], the pooling layer of the suppressing noise-like signal is excluded from the ﬁrst to ﬁfth network blockscomprising BT-1 and BT-2. In particular, BT-2 is added witha skip-connection to help propagate gradients to the upperlayers [21], which has proven effective at residual learning[16]. We expected that the front segment for extracting low-level features would play a role similar to the high-pass ﬁlter,which is veriﬁed by visualizing the feature map in Section IV.Next, we further improve the classiﬁcation ability by ap-plying RDB [39] for super-resolution into BT-3, constitutingthe middle segment. The super-resolution is a task for recon-structing high-resolution images from low-resolution images[40] by improving textured details. It differs from the givenclassiﬁcation task in that it is intended for image restoration,but a common point exists from the perspective of dealing withlow-level signals. Because RDB is specialized in extractingabundant features, we expected sub-components for local residual learning and local feature fusion in BT-3 to help ourmodel learn meaningful features from feature maps generatedfrom BT-2. In addition, inspired by [7], [21], we kept thenumber of feature maps d of the ﬁrst to seventh network blocksconstant.Finally, the last segment comprising BT-4 and BT-5 is usedfor dimensionality reduction of the feature maps generatedfrom the middle segment and performs three-class classiﬁca-tion. The higher-level features are learned from the lower-levelfeatures obtained from the front and middle segments throughconsecutively placed blocks of BT-4. For BT-5, global averagepooling (AvgPool) [41] is exploited to replace the numerousneurons of the fully connected (FC) layers to mitigate thechance of overﬁtting. Based on the experiments, we empir-ically determined the number and arrangement of networkblocks that constitute the proposed ILFNet. The networkautomatically explores forensic features that are difﬁcult tolearn with training from randomly initialized parameters in anend-to-end fashion. B. Network Architecture

Fig. 5 depicts a detailed conﬁguration for each networkblock constituting the proposed ILFNet. The network consistsof 11 network blocks using ﬁve BTs, as displayed in the ﬁgure.In this section, each block type is described, and then detailsof the differences from our previous work [7] are provided.Finally, we introduce a loss function for training our modeland an ensemble module for further performance improvementin the testing phase. Moreover, BT-1 is composed of a × convolutional (Conv) layer with stride 1, which is followed bybatch normalization [42] to alleviate the potential of overﬁttingand uses the rectiﬁed linear unit (ReLU) [43] as an activationfunction. On content classiﬁcation tasks, the AvgPool layer,which is a representative pooling layer, is employed to rein-force the content [21] and reduce the dimensionality of thefeature maps, but it suppresses subtle signals by averagingthe adjacent information [7], [38]. Therefore, if the AvgPoollayer is placed in the initial layers, it prevents the networkfrom learning subtle pixel-value dependent low-level features. REPARED FOR IEEE TRANSACTIONS ON 7

Inspired by the insight and approaches in [7], [21], [39], weprevented the noise-like signal from disappearing by excludingthe pooling layer from the BT-1 conﬁguration. As depicted inFig. 5, we induced the extraction of shallow features from theinput data by placing two BT-1s in the early layers of ILFNet.In addition, BT-2 is designed to improve the ability ofILFNet to extract forensic features using a skip-connection[16], [44]. Like BT-1, the pooling layer is excluded in BT-2.The main stream of BT-2 consists of two Conv layers, eachof which is followed by batch normalization, and ReLU as anonlinear activation function. Unlike the approaches in [17],[18] in which a × Conv layer is placed on a deeper layer,we sequentially placed a × Conv layer that learns therelationship between neighboring elements and a × Convlayer that learns the association between the feature maps inthe shallow layer. As displayed in Fig. 5, the feature mapinput into BT-2 is reused using the skip-connection for residuallearning. The skip-connection is proposed to alleviate the van-ishing gradient problem that adversely affects the convergenceof deep-structured CNNs [16], [44], and this component helpspropagate the gradient to the upper layers. Thus, local residuallearning through the skip-connection can help in learning localtexture artifacts caused by seam carving. We expect that theproposed ILFNet can learn low-level features through the frontsegment comprising consecutive BT-1s and BT-2s.Inspired by RDB in [39], which addressed low-level signals,the structure of BT-3 constituting the middle segment was de-termined. Because RDB is specialized in extracting abundantfeatures [40], we expected sub-components for local residuallearning and local feature fusion in BT-3 to help ILFNetlearn reﬁned features from feature maps generated from theprevious block. In this work, a lightweight version of RDB,consisting of two connected BT-1s followed by the componentfor local feature fusion, is adopted. Fig. 5 indicates that thearchitecture of BT-3 not only enables the feature maps of theprevious block to connect with each Conv layer of current theBT-3 but also learns comprehensively abundant local featuresthrough local feature fusion [39]. Based on concatenation and × Conv layers, the extracted local features are fused.Then, the local residual learning through the skip-connectionis performed. Inspired by [7], [21], we kept the number offeature maps d of the ﬁrst to seventh network blocks constant (16) and the growth rate of the × Conv layers in BT-3was set to 16. In summary, through the abilities of BT-3 interms of abundant local feature extraction and comprehensivefeature learning, the proposed ILFNet can learn and explorereﬁned forensic features.For hierarchical feature learning, three consecutive BT-4sare placed in front of the last segment. To learn and extract ahigher-level representation of the previously learned features,BT-4 consists of Conv layers for higher feature learning andpooling layer reduction for feature extraction. The main pathof BT-4 employed two × Conv layers, each of which isfollowed by batch normalization and ReLU, and the AvgPoollayer ( × kernel and stride 2) was applied to the lastlayer for the dimensionality reduction of the feature maps.Inspired by [7], [15], the skip-connection has a × Convlayer with a stride of 2 to perform element-wise addition, and this component enables the fusion of feature representationsof multiple resolutions. We increase the number of ﬁlters d bya factor of 2 whenever BT-4 is inserted into ILFNet.Finally, the condensed feature maps are directly passed toBT-5 designed for three-class classiﬁcation through consecu-tive dimensionality reduction. For BT-5, constituting the lastsegment, global AvgPool is exploited to replace the numerousneurons of the FC layers to alleviate overﬁtting and improvethe generalization ability [41]. Next, BT-5 consists of an FClayer that has three output neurons and a softmax layer. In ourwork, ˆ y denotes the output through the FC layer. The proposedILFNet consisting of a combination of BT-1 to BT-5 withunique characteristics and purposes automatically explores theforensic features for seam-carving forgery in an end-to-endfashion. C. Differences From the Original Work

In [7], we proposed a network for low-level feature learning,referred to as LFNet. Compared to our original work, thenew architecture was designed based on a considerable designreﬁnement, and this design choice has been motivated byextensive experiments. Unlike the original network block forlearning subtle signals, in which the × Conv layer and × Conv layer are sequentially placed on BT-1, we believe that itis important to learn the relationship between the neighboringpixel elements in the shallow layers. Therefore, to focus onlearning low-level features based on the relationship betweenadjacent pixels, only the × Conv layer was employed forBT-1 of ILFNet. In addition, the proportion of BT-2 adoptingresidual learning was increased, and the proportion of BT-1without a skip-connection was reduced.In particular, the new architecture contains a middle segmentfor local feature fusion and reﬁned feature learning. Inspiredby the RDB addressing a low-level signal for super-resolutionin [39], we newly adopted BT-3 to learn comprehensivelyreﬁned features from lower-level features. With the sub-components of BT-3 (e.g., the contiguous memory mechanismfor passing local features, × Conv layer-based featurefusion, and skip-connection for local residual learning), theproposed ILFNet can learn reﬁned and higher features fromshallow features obtained through the front segment. For BT-4, for learning hierarchical features, the × AvgPool layerwith a stride of 2 was employed instead of the max-pooling(MaxPool) layer employed in the original architecture. Thiswas determined based on the insight obtained from [18], [20],[39] and the performance analysis experiments for the poolinglayer type (i.e., AvgPool and MaxPool layers). From BT-1 to BT-4, batch normalization [42] was used to alleviatethe overﬁtting problem. Lastly, from BT-2 to BT-4, shortcutconnections were employed to help propagate the gradient tothe higher layer.

D. Loss Function

In this section, the loss function for three-class classiﬁcationis deﬁned, where ˆ y j refers to the output for which the inputis class j among three classes, where j = { , , } . In ourwork, the original, seam-inserted, and seam-removed images REPARED FOR IEEE TRANSACTIONS ON 8 = 𝟏

Trained model

Predicted probability

Three-class classification resultSingle sample = 𝟏 ... ... k predicted probabilities ... Averaging

Trained model

Three-class classification result k samples = 𝒌 : Suspicious test image : Cropped input data : Number of cropped input data Fig. 6. Overview of the ensemble module to improve the performance of thetrained model for classifying seam-carving forgery in the testing phase. were set to Class 0, Class 1, and Class 2, respectively. Theprobability P (ˆ y = j ) can be computed from ˆ y j using thefollowing softmax function: P (ˆ y = j ) = e ˆ yj (cid:80) j =0 e ˆ yj , where y isa one-hot vector and y j denotes the speciﬁc class. If a givenimage corresponds to Class 0, y is deﬁned as y = [ y ; y ; y ] =[1; 0; 0] . The classiﬁcation loss L c was computed from thecross-entropy as follows: L c = − (cid:80) j =0 y j log P (ˆ y = j ) .The proposed ILFNet trained with the deﬁned L c classiﬁessuspicious test images into three classes (i.e., original, seaminsertion, and seam removal) with high accuracy. E. Ensemble Module

To enhance the performance of our trained model in thetesting phase, we propose an ensemble module. As describedin Section III-A, our approach crops an area of W × H fromthe upper left of the input data before inputting it into thenetwork in the training process. This approach also appliesto the testing phase, which is illustrated at the top of Fig. 6.Here, Θ indicates the number of patches sampled by applyingcropping to the suspicious test image I S . Inspired by the dataaugmentation-based self-ensemble [45], we aimed to improvethe classiﬁcation performance by generating multiple samplesfrom a given I S , providing them to the trained model toobtain multiple outputs, considering them comprehensively(see bottom of Fig. 6).If Θ is equal to k , the proposed ensemble module in thetesting phase acquires k samples from I S with a size of W S × H S . The upper left coordinate for generating the i -th sample I S i , represented by ( r x i , r y i ) , is uniformly sampled accordingto: r x i ∼ U (0 , W S − W ) , r y i ∼ U (0 , H S − H ) , (6)where U stands for uniform distribution, and the area of size W × H is cropped based on ( r x i , r y i ) , where i = { , ..., k − } .We exceptionally ﬁx a case where the ( r x , r y ) is equal to (0 , , which is the same as the case with Θ = 1 . If the value of i is between 1 and k − , then ( r x i , r y i ) and its corresponding I S i is obtained based on Equation (6). In the case of Θ = k , the trained model takes k patches sampled from I S as input,and k predicted probabilities are generated. We averaged the k predicted probabilities to determine the ﬁnal output score.This simple ensemble module does not require additionaltraining of the separate model. Through the ensemble mod-ule, we expected the proposed forensic framework with theensemble module and multiple samples to comprehensivelyexplore local texture artifacts due to seam-carving scatteredthroughout a given I S . In addition, we found that the ensemblemodule provides an additional performance gain for both ourwork and comparative CNN-based approaches in terms ofclassifying seam-carving artifacts. The effectiveness of theensemble module is covered in detail in the following section.IV. E XPERIMENTS

To assess the performance of our ILFNet for classifyingseam-carving forgery, we conducted a set of experiments andanalysis. This section provides detailed descriptions of the ex-perimental setup and the results of the extensive experiments.

A. Dataset

In the experiments, the BOSSbase [13] and UCID [14]datasets were used to generate 10,000 original images witha size of × . Inspired by the various uses of JPEGcompression reported in [32]–[34], the obtained original im-ages were saved using JPEG compression with a quality factorof 100. Based on the algorithm [3], the original images wereretargeted using vertical seam removal from 10% to 50% in10% steps, resulting in 50,000 seam-carved images in total.Similarly, the original images were enlarged using verticalseam insertion [3] from 10% to 50% in 10% steps, resultingin 50,000 seam-inserted images. Like the original images, thegenerated seam-removed and seam-inserted images were alsosaved using JPEG compression with a quality factor of 100.In total, 110,000 images with resolutions in the range from × to × were obtained. We divided the imagesinto three sets for training, validation, and testing (with an ratio). In the training and testing process, the ratioof image data corresponding to each class was maintained at for the three-class classiﬁcation. Before being inputinto the network, the sample size for the W × H cropped fromthe images in the generated dataset was set to × .To demonstrate the effectiveness of our approach, we gen-erated additional testing sets for experiments assessing therobustness in the unseen cases (i.e., seam-carving algorithms[4]–[6], retargeting ratios for 4% to 8% in 2% steps, noisesignal addition for interfering classiﬁcation, and uncompressedimage format, such as BMP). In addition, we designed experi-ments regarding the horizontal seam-carving classiﬁcation andspatial localization and created testing sets for them. In thesecases, the training based on a new methodology was appliedto the model. The description of additionally generated testingsets is introduced in detail in each experimental section. B. Training Settings

We built our network using the TensorFlow framework andran the experiments using NVIDIA GeForce RTX 2080 Ti.

REPARED FOR IEEE TRANSACTIONS ON 9

10 20 30 40 50

Epoch T r a i n i ng acc u r ac y ( % ) XceptionrResNetBayarNetHeNetH-VGGYeNetLFNetSRNetILFNet (a)

10 20 30 40 50

Epoch T r a i n i ng l o ss XceptionrResNetBayarNetHeNetH-VGGYeNetLFNetSRNetILFNet (b)Fig. 7. Training accuracy (a) and training loss (b) tendencies for each network for 50 epochs.

In the experiments, we use the Adam optimizer [46] with alearning rate of − , momentum coefﬁcients β = 0 . and β = 0 . , and the numerical stability constant (cid:15) = 10 − .The size of the mini-batch was set to 24. In the trainingprocess, the mini-batch was constructed by randomly samplingthe data corresponding to the training set. The proposedILFNet is trained for 50 epochs, and the best model is selectedas the one that maximizes the classiﬁcation accuracy on thevalidation set. C. Baselines

To demonstrate the effectiveness of the proposed ILFNet,we designed an experiment to analyze the classiﬁcation per-formance of our work versus the comparative approaches. ForCNN-based approaches of classifying manipulation artifactsand low-level signals, Xception [15], rResNet [16], BayarNet[17], HeNet [18], H-VGG [19], YeNet [20], LFNet [7], andSRNet [21] were employed as baselines in the experiments.For the three-class classiﬁcation, the CNN components werelightly modiﬁed so that the last layer provides three predictedprobabilities. Inspired by the setting in [7], in the case ofrResNet, the ResNet-34 model [16], in which the initialpooling layer was excluded, was employed. The hyperparam-eters of the comparative CNN-based approaches were set asdescribed in each paper [7], [15]–[20], [39], and the batchsize and optimizer were set using the methodology speciﬁedin Section IV-B. For fair experiments, the weight initializationwas set equally, and heuristic adjustment of the learning ratewas excluded from the training process. The rotation-baseddata augmentation was used only in the training phase of thehorizontal seam-carving classiﬁer.As illustrated in Fig. 7, each model was trained until itsufﬁciently converged in terms of training accuracy and lossfor 50 epochs. We found that trivial improvement occurred inthe performance of each model after 50 epochs. Like ILFNet,the best model was selected as the one that maximizes thevalidation accuracy. In addition, we conducted a comparativeexperiment using a conventional handcrafted feature-basedapproach [22], and the results are provided in Section IV-I. Theparameters of the conventional forensic method were set asspeciﬁed in [22]. Because traditional seam-carving classiﬁersonly allow two-class classiﬁcation (i.e., seam insertion versus original and seam removal versus original), we employed mul-tiple classiﬁers in the experiments. In the case of the proposedILFNet, only one trained model for three-class classiﬁcationwas used.

D. Evaluation Metrics

We employed classiﬁcation accuracy as the base evaluationcriterion, which is deﬁned as follows: n c n t ×

100 (%) , where n t and n c indicate the total number of testing samples andthe number of correctly predicted samples, respectively. Whencalculating the accuracy of the testing set, the ratio of datacorresponding to each class was maintained at ,except for the cases of tests with specially designed two-classclassiﬁcation. In addition, the receiver operating characteristic(ROC) curves were computed to evaluate the performanceof the proposed and comparative models. The ROC curve isdeﬁned as a plot of the true positive rate against the falsepositive rate, and the area under the curve (AUC) is furtherused as a metric for performance evaluation. In the case of theAUC, a model is considered to have outstanding performanceif the value of AUC is close to 1. E. Performance Evaluation of Networks

We ﬁrst evaluated the performance of the proposed IL-FNet and eight comparative networks [7], [15]–[20], [39] bymeasuring the accuracy of the three-class classiﬁcation (i.e.,original, seam insertion, and seam removal). The results ofmeasuring the accuracy of the retargeting ratio parametersare listed in Table I. The bottom part of the table lists theclassiﬁcation results for a mixed test set that contains a retar-geting ratio of 10% to 50%. The accuracy values of ILFNetfor a 10% to 50% retargeting ratio are 88.17%, 94.93%,98.40%, 99.43%, and 99.53%, respectively. The classiﬁcationperformance for the mixed set is 96.56%, which is the state-of-the-art performance when compared to CNN-based baselines.The proposed work achieved 0.87% higher performance com-pared to SRNet [39], which exhibits outstanding performancefor steganalysis in the spatial and JPEG domains. Comparedwith the performance of LFNet [7], 1.17% higher accuracywas achieved, which conﬁrms that the architecture reﬁnementapplied to ILFNet was effective. Other networks in [15]–[20]demonstrated acceptable performance, but each accuracy value

REPARED FOR IEEE TRANSACTIONS ON 10

TABLE IP

ERFORMANCE EVALUATION OF

ILFN

ET AND COMPARATIVE CONVOLUTIONAL NEURAL NETWORKS FOR THREE - CLASS CLASSIFICATION ON VARIOUSRETARGETING RATIOS (%) .Ratio Xception rResNet BayarNet HeNet H-VGG YeNet LFNet SRNet ILFNet % % % % ERFORMANCE EVALUATION OF

ILFN

ET AND COMPARATIVE CONVOLUTIONAL NEURAL NETWORKS FOR TWO - CLASS CLASSIFICATION ON VARIOUSRETARGETING RATIOS (%) .Class Ratio Xception rResNet BayarNet HeNet H-VGG YeNet LFNet SRNet ILFNetOR, SI % % % % % % % % Notes:

OR, SI, and SR denote the abbreviations of original, seam insertion, and seam removal, respectively. was less than 90%. As listed in Table I, all networks exhibitedlower accuracy when the retargeting ratio was 10% becausea smaller ratio results in fewer traces of forgery remaining inthe image.Next, we applied the models for three-class classiﬁcationto the tasks of two-class classiﬁcation. Table II lists theresults of the two-class classiﬁcation (i.e., seam insertionversus original and seam removal versus original) on variousretargeting ratios. The proportion of data corresponding toeach class was maintained at . As listed in Table II, ourwork demonstrated outstanding performance for two types ofclassiﬁcation tasks. In the experiments, we found that our workand baselines reveal the tendency of better exploration of thetraces due to seam insertion than due to the artifacts causedby seam removal. Capturing artifacts of seam removal may bemore difﬁcult than a task for seam insertion because forensicfeature extraction proceeds by focusing on the differencesbetween adjacent pixels due to the loss of information. Fora mixed set, the proposed ILFNet showed 1.74% higheraccuracy for the task of seam insertion compared to the taskof seam removal.To evaluate the performance of ILFNet and the comparativenetworks in detail, we computed the ROC curve for the three-class classiﬁcation. As presented in Fig. 8, the ROC curvefor each class was generated, and each AUC value for theROC curves was calculated (see the legend of each subﬁgurein Fig. 8). Because the ROC curves of ILFNet, LFNet, andSRNet were closer to the top left corner than those of the othercomparative networks [15]–[20], we conclude that ILFNet,LFNet, and SRNet perform better than the other networks.Furthermore, our work has AUC values of 0.991, 0.996, and

TABLE IIIC

LASSIFICATION ACCURACY OF THE PROPOSED

ILFN

ET ON TYPES OFPOOLING LAYER (%) .Ratio MaxPool layer AvgPool layer10 % % % % % × and 2,respectively. As listed in Table III, the AvgPool-based modelexhibits 96.56% accuracy for the mixed ratio set, which isgreater than that of the MaxPool-based model by 0.57%. Whenthe retargeting ratios were 10% and 20%, the MaxPool-basedmodel demonstrated an accuracy slightly above the AvgPool-based model, but overall, the model with the AvgPool layerperformed better. Based on the results in the table, we decidedto place the AvgPool layer on BT-3, which comprises ILFNet. REPARED FOR IEEE TRANSACTIONS ON 11 (a) Xception (b) rResNet (c) BayarNet(d) HeNet (e) H-VGG (f) YeNet(g) LFNet (h) SRNet (i) ILFNetFig. 8. Receiver operating characteristic (ROC) curves of each network and the computed area under the curve (AUC) values for each class: (a)-(i) show theresults for Xception, rResNet, BayarNet, HeNet, H-VGG, YeNet, LFNet, SRNet, and the proposed ILFNet, respectively.

F. Performance Evaluation of Networks with Ensemble Mod-ule

In Section III-E, we presented a methodology to improvethe test performance of the trained model using the ensemblemodule, which does not require additional training. Throughthe ensemble module, we expected the trained CNN-basedmodel with the ensemble module and multiple samples tocomprehensively explore local texture artifacts due to seamcarving scattered throughout a given suspicious image. Asmentioned, Θ indicates the number of patches sampled byapplying cropping to the suspicious image. When the valueof Θ is equal to 1, it indicates models that attempt toclassify forgery from a single sample in the previous section.In this experiment, we applied the ensemble module with Θ = 1 , , to our trained and comparative models andanalyzed the classiﬁcation performance of each model againstseam-carving forgery.Table IV lists the performance evaluation of ILFNet andcomparative CNNs with an ensemble module for three-classclassiﬁcation on various retargeting ratios. When the numberof samples ( Θ ) provided to the ensemble module was set to 1, 5, and 10, the classiﬁcation accuracies of the proposed ILFNetwere 96.56%, 97.02%, and 97.18%, respectively. Likewise,for the comparative models with ensemble modules, perfor-mance improved due to the ensemble module, as presentedin Table IV. For the case in which multiple samples wereprovided in the ensemble module (i.e., Θ = 5 , ), theproposed ILFNet achieved the highest accuracy, and the LFNetexhibited second-best performance. The YeNet achieved alarge improvement in performance with the adoption of theensemble module, and when the value of Θ increases from 1to 10, performance improves by 3.26%.In the seam-carving-based retargeting process, the distribu-tion of computed seams is affected by the content of the image;thus, the local region generally expands or reduces throughoutthe given image. Therefore, by providing samples of variouslocal areas in the seam-carved image in the model, the proba-bility that samples containing abundant forensic traces causedby seam-carving forgery are provided to the model increases.The results in Table IV support why ensemble modules shouldbe adopted to the proposed forensic framework for classifyingseam carving. Based on the results in Table IV, the ensemble REPARED FOR IEEE TRANSACTIONS ON 12

TABLE IVP

ERFORMANCE EVALUATION OF

ILFN

ET AND COMPARATIVE CONVOLUTIONAL NEURAL NETWORKS WITH AN ENSEMBLE MODULE FOR THREE - CLASSCLASSIFICATION OF VARIOUS RETARGETING RATIOS (%) . Θ Ratio Xception rResNet BayarNet HeNet H-VGG YeNet LFNet SRNet ILFNet1 % % % % SR SI OR SRSIOR (a)

SRSI OR SRSIOR (b)

SRSIOR SRSIOR (c)Fig. 9. Confusion matrix of the proposed ILFNet for classifying seam carving with an ensemble module: (a) result of ILFNet with

Θ = 1 , (b) result ofILFNet with

Θ = 5 , and (c) result of ILFNet with

Θ = 10 . Here, OR, SI, and SR denote the abbreviations of original, seam insertion, and seam removal,respectively. module can provide additional performance gain in terms ofclassifying seam-carving artifacts.For a more detailed analysis, we generated a confusionmatrix for three-class classiﬁcation using ILFNet with theensemble module. As the number of samples provided toILFNet increased (i.e.,

Θ = 1 → Θ = 5 , ), the number ofcorrect predictions for the corresponding true class increased(see Fig. 9). In the case of ILFNet with Θ = 10 , the correctlypredicted probability values for the true classes of original,seam insertion, and seam removal were 0.997, 0.993, and0.922, respectively. Furthermore, we found that seam-removedimages are often misclassiﬁed as original images, as observedin Fig. 9.

G. Performance Evaluation of Networks for Unseen Cases

In this section, the results of the robustness experimentsfor the unseen cases that were not considered in the trainingprocess of the models are provided. It is important in mul-timedia forensics to ensure robustness against unconsideredenvironments, such as digital watermarking [47]–[49], whichis robust from various attacks in the distribution process. Fromthis perspective, it is beneﬁcial for a CNN-based forensic approach to be robust against unseen cases. To demonstrate theeffectiveness of the proposed ILFNet, we further conductedextended experiments for the testing set of unseen cases,meaning testing environments that were not considered inthe training phase. We employed the trained models for thethree-class classiﬁcation described in the previous sections andconducted the robustness test without additional training forthe unseen cases. In this section, the following unseen casesare covered. • Unseen seam-carving algorithms in [4]–[6], • Unseen retargeting ratios of 0%, 4%, 6%, and 8%, • Unseen post-processing of noise addition, • Unseen uncompressed image format such as BMP.First, we conducted the performance evaluation of ILFNetand comparative CNNs for unseen seam-carving algorithms[4]–[6]. As described in Section II-A, the computed seamshave various characteristics due to the inherent properties ofthe content (e.g., the shape of the object and background) andthe predeﬁned function of each the seam-carving algorithm.To address this issue, we designed a network architecture forlearning forensic features, even in areas with few artifacts,and proposed an ensemble module-based methodology to

REPARED FOR IEEE TRANSACTIONS ON 13

TABLE VP

ERFORMANCE EVALUATION OF

ILFN

ET AND COMPARATIVE CONVOLUTIONAL NEURAL NETWORKS WITH AN ENSEMBLE MODULE FOR THREE - CLASSCLASSIFICATION ON UNSEEN SEAM - CARVING ALGORITHMS (%) .SC method Θ Xception rResNet BayarNet HeNet H-VGG YeNet LFNet SRNet ILFNetAvidan et al. [3] 1 85.93 87.06 84.83 83.96 88.66 88.83 95.39 95.69 96.565 87.62 89.46 85.67 85.75 89.75 91.36 96.96 96.82 97.0210 87.99 90.18 86.31 86.12 89.89 92.09 97.08 97.02 97.18Rubinstein et al. [4] 1 79.33 83.91 75.62 75.67 81.66 83.00 89.53 89.80 90.105 80.70 87.10 78.97 79.50 83.53 86.26 89.77 90.13 90.7310 81.14 88.53 79.40 80.43 83.46 86.71 90.20 89.92 91.18Achanta et al. [5] 1 56.83 60.93 58.73 69.57 74.43 70.03 79.91 76.53 79.275 60.77 63.82 61.20 75.10 79.13 76.73 81.00 76.32 81.3710 60.46 64.50 61.93 74.98 79.23 75.62 81.33 76.90 81.71Frankovich et al. [6] 1 80.27 83.03 77.20 77.46 81.33 84.45 88.52 90.13 89.575 84.43 86.37 79.23 80.06 84.15 87.60 89.55 90.23 90.3710 83.70 87.80 80.53 80.47 85.50 87.82 90.73 90.45 91.38 = 1 = 5 = 10

Number of samples provied to the trained models A cc u r a c y ( % ) XceptionrResNetBayarNetHeNetH-VGGYeNetLFNetSRNetILFNet

Fig. 10. Performance evaluation of ILFNet and comparative convolutional neural networks with an ensemble module on the unseen retargeting ratio of 0%.In the experiment, the models were applied to the testing set consisting of single-compressed and double-compressed original images. improve performance by comprehensively analyzing multiplelocal samples. In this experiment, we veriﬁed whether ourattempts were effective, and models trained using only a seam-carving algorithm [3] were used to measure the accuracy ofthe unseen cases [4]–[6].Table V indicates the classiﬁcation accuracy of ILFNetand comparative CNNs on unseen seam-carving algorithms[4]–[6]. The table contains the results for a mixed test setthat contains a retargeting ratio of 10% to 50% for eachalgorithm. In the comprehensive analysis, the proposed ILFNetextracted artifacts due to seam-carving algorithms [4]–[6] thatwere not considered in the training process better than othercomparative CNNs. With an ensemble module of

Θ = 10 ,ILFNet achieved a classiﬁcation accuracy of 91.18%, 81.71%,and 91.38% for each unseen algorithm [4]–[6]. As listed inTable V, models trained on the training set for [3] tended todetect forensic traces in the test set for [4], [6] better thanthe test set for [5]. The seam-carving algorithms in [4], [6],which were extended from [3] using e g , generally calculateseams similar to the results of [3], whereas an approach in [5]using the newly deﬁned e s calculates relatively different formsof seams compared to [3], [4], [6] (Section II-A). Therefore,the models appear to have higher classiﬁcation accuracy forunseen algorithms [4], [6], and the proposed ILFNet achieveda relatively acceptable performance for the testing set of [5]than other comparative models. We further conducted experiments for unlearned retargetingratios of seam carving. As described in Section IV-A, theoriginal images were retargeted employing a seam-carvingprocess [3] from 10% to 50% in 10% steps for trainingthe models. For the experiments, we generated a testing setof unseen retargeting ratios of 0%, 4%, 6%, and 8% basedon the algorithm [3]. When the ratio is equal to 0%, theoriginal image is subject to JPEG compression (quality factor= 100) without enlargement or reduction. Thus, in this case,the models were applied to the testing set consisting of single-compressed and double-compressed original images. Becauserounding and truncation errors occur during encoding anddecoding in JPEG compression with the ﬁxed quality factor,respectively, the differences between single-compressed anddouble-compressed original images are not exactly zero [32].Fig. 10 reveals the classiﬁcation results of this experiment,and we aimed to prove that the proposed model classiﬁesthe single- and re-compressed images, in which the seam-carving forgery is not applied to the original content. Inthis experiment, the proposed ILFNet and SRNet achievedoutstanding performance, and the classiﬁcation accuracies ofour model were 96.12%, 98.05%, and 98.9%, when Θ inensemble module was set to 1, 5, and 10, respectively.In addition, based on the seam-carving algorithm [3], weconducted a performance evaluation on the testing set withretargeting corresponding to the 4%, 6%, and 8% ratios of REPARED FOR IEEE TRANSACTIONS ON 14

TABLE VIP

ERFORMANCE EVALUATION OF

ILFN

ET AND COMPARATIVE CONVOLUTIONAL NEURAL NETWORKS WITH AN ENSEMBLE MODULE FOR THREE - CLASSCLASSIFICATION ON UNSEEN RETARGETING RATIOS (%) .Ratio Θ Xception rResNet BayarNet HeNet H-VGG YeNet LFNet SRNet ILFNet4% 1 50.02 56.87 49.66 48.43 58.74 59.73 72.20 70.90 72.445 50.83 58.10 49.02 49.63 58.50 59.90 72.47 71.26 72.5310 51.86 58.56 50.03 49.77 59.22 59.96 73.50 71.77 72.706% 1 57.26 62.14 54.47 54.23 64.13 64.50 77.33 75.83 77.145 57.33 64.95 54.86 56.20 65.83 67.03 78.78 77.06 77.4010 58.07 65.53 55.42 57.26 64.57 66.60 79.86 77.74 78.108% 1 61.20 66.07 58.23 58.47 68.30 68.43 80.76 78.60 80.955 61.86 69.96 59.70 61.30 69.03 70.53 82.00 79.50 82.1310 62.50 70.43 60.13 62.16 69.60 71.70 82.06 79.96 82.30TABLE VIIP

ERFORMANCE EVALUATION OF

ILFN

ET AND COMPARATIVE CONVOLUTIONAL NEURAL NETWORKS FOR THREE - CLASS CLASSIFICATION ONPOST - PROCESSING OF ADDITIVE WHITE G AUSSIAN NOISE (%) . σ Xception rResNet BayarNet HeNet H-VGG YeNet LFNet SRNet ILFNet0.1 85.25 86.76 84.62 84.50 88.77 88.83 95.06 95.38 96.100.2 84.50 85.41 83.43 81.40 85.67 87.29 92.73 93.29 93.430.3 83.34 84.97 81.02 73.21 79.03 85.93 87.23 91.36 90.370.4 82.81 84.80 78.57 68.51 72.26 79.63 82.80 87.56 86.630.5 82.23 83.77 75.23 63.40 67.70 74.36 80.53 81.40 83.93TABLE VIIIP

ERFORMANCE EVALUATION OF

ILFN

ET WITH AN ENSEMBLE MODULEFOR THREE - CLASS CLASSIFICATION ON POST - PROCESSING OF ADDITIVEWHITE G AUSSIAN NOISE (%) . σ ILFNet

Θ = 1 Θ = 5 Θ = 10 the image width. As listed in Table VI, all networks exhibitedlower accuracy when the unseen retargeting ratio was smaller,which may be caused by the fewer traces of forgery remainingin the given samples. When analyzing the results in Table VI,ILFNet, SRNet, and LFNet demonstrated acceptable perfor-mance, whereas the accuracy values of the other networks[15]–[20] were less than 72%, even with the ensemble module.In the experiments, the accuracy values of ILFNet using anensemble module with

Θ =

10 were 72.70%, 78.10%, and82.30%, when the retargeting ratio was 4%, 6%, and 8%,respectively. In particular, when the unseen ratios were setat 4% and 6%, LFNet achieved outstanding performance, andwhen the ratio was 8%, ILFNet exhibited the highest accuracy.When analyzing the results for the unseen retargeting ratios of0%, 4%, 6%, and 8% comprehensively, our work demonstratedstable and outstanding performance.Next, we conducted experiments for unseen post-processingof additive white Gaussian noise (AWGN). The AWGN can beapplied in the distribution and manipulation of images [47]–[49]; hence, it is important to ensure robustness against AWGNregarding practical forensics. In this experiment, we appliedAWGN with a σ value of 0.1 to 0.5 to a mixed test set basedon a seam-carving algorithm [3] containing a retargeting ratio of 10% to 50%. Table VII lists the classiﬁcation accuracyobtained by applying the trained models ( Θ = 1 ) to the testimages with AWGN that were not considered in the trainingphase. In the experiments, the proposed ILFNet achievedoutstanding performance, and our model achieved accuracyvalues of 96.10%, 93.43%, 90.37%, 86.63%, and 83.93% for σ ranged from 0.1 to 0.5. We also found that the accuracyvalues of all networks decreased as σ increased and that low-level features caused by seam-carving forgery may have beenaffected by noise signal addition. In addition, we analyzed therobustness against AWGN in the proposed ILFNet using theensemble module with Θ = 1 , , . As listed in Table VIII,our work exhibited performance improvement using the en-semble module. In particular, when the results for the Θ valuesof 1 and 10 were compared, performance improved by 0.91%,2.87%, 3.46%, 5.4% and 4.57% for each σ value.Finally, we experimented with the unlearned case of theimage format. As mentioned in Section IV-A, inspired by thevarious uses of JPEG compression reported in [33], [34], wesaved the training images in the JPEG format. In the process,we minimized the additional distortion (i.e., the rounding andtruncation errors of JPEG compression [32]) on the training setby setting the quality factor of the JPEG compression to 100to induce the proposed network to focus on the deformationby seam carving. The BMP is a representative uncompressedimage format used as the standard image format for variousapplications. To analyze the robustness of the unseen imageformat of our model that learned the local artifacts of seamcarving applied to JPEG images with a quality of 100, wecreated a testing set for the BMP format. To do this, we saveddata (i.e., original, seam-removed, and seam-inserted imagesbefore JPEG compression) corresponding to the testing setused in the experiment in Table I in BMP format.Table IX presents the classiﬁcation accuracies obtained byapplying the models trained for the JPEG format to the testing REPARED FOR IEEE TRANSACTIONS ON 15

TABLE IXP

ERFORMANCE EVALUATION OF

ILFN

ET AND COMPARATIVE CONVOLUTIONAL NEURAL NETWORKS WITH AN ENSEMBLE MODULE FOR THREE - CLASSCLASSIFICATION ON THE UNSEEN IMAGE FORMAT (%) .Format Θ Xception rResNet BayarNet HeNet H-VGG YeNet LFNet SRNet ILFNetJPEG 1 85.93 87.06 84.83 83.96 88.66 88.83 95.39 95.69 96.565 87.62 89.46 85.67 85.75 89.75 91.36 96.96 96.82 97.0210 87.99 90.18 86.31 86.12 89.89 92.09 97.08 97.02 97.18BMP 1 76.73 80.06 73.73 75.63 71.57 78.33 81.57 79.50 80.265 79.27 82.74 77.60 79.23 73.00 81.80 83.97 81.27 82.9310 79.56 83.77 78.92 80.73 73.52 81.93 84.97 82.60 84.26Fig. 11. Examples of enlargement and reduction of image height based on computed horizontal seams. Examples are the results of seam removal and seaminsertion corresponding to 20% of the height by applying the seam-carving algorithm [3] to the original image ( × ). From top to bottom, originalimages with visualized computed seams, seam-removed images, and seam-inserted images are provided. set for the BMP format. In the experiment, LFNet achievedthe highest accuracy for each Θ , and LFNet exhibited thesecond-best performance. The accuracy values of our modelwere 80.26%, 82.93%, and 84.26%, when Θ in the ensemblemodule was set to 1, 5, and 10, respectively. In particular,all networks tended to improve performance as the providednumber of sample data points increased. Compared to theresults for the JPEG format in Table IX, the classiﬁcationperformance for the BMP format of all networks is relativelylow. We estimated that this performance degradation wascaused by rounding and truncation errors of JPEG compressionremaining in the images of the training set used for trainingthe models. In other words, pixel distortions caused by JPEGcompression with a quality factor of 100 are minor, butILFNet, which is specialized in extracting ﬁne-grained signals,learns these artifacts in addition to forensic features caused byseam carving. Therefore, the performance of the model trainedon JPEG images may deteriorate in the testing set consistingof an uncompressed BMP image.Summarizing the results in this section, the proposed ILFNetachieved stable and high performance for four types of unseencases. Therefore, this work is suitable for practical forensics(i.e., real-world approaches) compared to other comparativenetworks [7], [15]–[21]. H. Performance Evaluation of Classifying Horizontal SeamCarving

The horizontal seam is similar to the vertical seam exceptfor the connection being from left to right. As stated in [3], adynamic programming approach based on the energy functionis performed to select the horizontal seam in the horizontaldirection (i.e., left to right). Fig. 11 reveals the result of imageretargeting, which corresponds to 20% of the image height,based on the horizontal seams calculated using the seam-carving algorithm [3]. Fig. 11 conﬁrms that the image heightcan be adjusted while maintaining the prominent content of theimage through horizontal seam carving. Although the resultsin Figs. 1 and 11 have a point of sameness in that imageretargeting is performed using the same seam-carving algo-rithm, differences exist in the direction of the calculated seams;therefore, different forensic features remain in the retargetedimages. Therefore, because the CNN models employed in theprevious section are trained to capture the forensic features forvertical seam carving, performance degradation occurs whenthe models are applied directly to the horizontal seam-carvedimage.In this section, a training methodology using a data augmen-tation approach to classify horizontally seam-carved artifactsis introduced. To prevent the consumption of creating a new

REPARED FOR IEEE TRANSACTIONS ON 16

TABLE XP

ERFORMANCE EVALUATION OF

LFN ET , SRN ET , AND

ILFN

ET WITH ANENSEMBLE MODULE FOR CLASSIFYING SEAM - CARVING FORGERY OFHORIZONTAL DIRECTION (%) . Θ Ratio With data augmentationLFNet SRNet ILFNet1 10 % % % % % % % % % % % % % % % training set related on horizontal seam carving, we conductedthe training process by applying rotations of 90 ◦ and 270 ◦ to the training set, which was created using vertical seamcarving in Section IV-A. In the training phase for CNNs, suchas LFNet, SRNet, and ILFNet, the images in the mini-batchobtained from the training set were subjected to a rotation of90 ◦ or 270 ◦ degrees before being input into the network. Fora fair performance evaluation, a new testing set was createdby applying horizontal seam carving to the original imagescorresponding to the testing set in Table I. Table X liststhe accuracy values obtained by applying the newly trainedmodels to the testing set containing horizontal seam-carvedartifacts. Compared to LFNet and SRNet, ILFNet achievedhigher performance for all cases. In particular, the accuracyvalues of ILFNet were 92.13%, 94.16%, and 95.52%, when Θ was set to 1, 5, and 10, respectively. Thus, we conﬁrmed thatseam-carved artifacts can be explored in different directionsfrom the data constituting the training set through a dataaugmentation-based training methodology. I. Performance Evaluation with a Conventional Approach

To further demonstrate the effectiveness of our work, weevaluated the performance of ILFNet and the conventionalhandcrafted feature-based method [22], referred to as the Ryumethod, by measuring the accuracy of two-class classiﬁcation(i.e., seam insertion versus original and seam removal versusoriginal). The Ryu method consists of two algorithms fordetecting seam insertion and seam removal, whereas onlyone trained model enables classifying two types of seam-carved forgery in the case of ILFNet. For the Ryu method,seam insertion is detected using a candidate map focused onthe relationship between adjacent pixels, and seam removaldetection is performed using learning feature vectors with anSVM classiﬁer. The weighting factor t was set to 0.85, anda LIBSVM classiﬁer with the radial basis function of r =0.125 is employed. The testing set that was used to derive the TABLE XIP

ERFORMANCE EVALUATION OF

ILFN

ET AND THE CONVENTIONALHANDCRAFTED FEATURE - BASED METHOD ON VARIOUS RETARGETINGRATIOS (%) .Ratio Seam insertion Seam removalRyu method ILFNet Ryu method ILFNet10 % % % % % results listed in Table II was also exploited in this experiment.Table XI reveals the classiﬁcation accuracy of ILFNet andthe Ryu method, as measured by changes in the retargetingratio. For the mixed testing set, the proposed ILFNet achievedhigher accuracy values of 97.64% and 95.90%, respectively,for seam-insertion and seam-carving classiﬁcations than theRyu method. For two-class classiﬁcation, the Ryu methodperformed worse than the proposed ILFNet but achievedslightly higher accuracy over some CNN-based approaches. J. Feature Map Visualization

In this section, feature maps obtained from the networkblocks, including BT-1, BT-2, BT3, and BT-4, were visualizedto analyze the ability of ILFNet to explore and extract seam-carving artifacts. To do this, we input samples cropped fromimages with 20% seam removal and seam insertion to thetrained model. Fig. 12 presents the results of visualizing thefeature maps. Moreover, B i and F B i represent the i -th networkblock that constitutes ILFNet and the feature maps obtainedfrom B i , respectively, where i = { , ..., } . As illustratedin Fig. 12, we visualized F B , F B , F B , F B , and F B by applying averaging to the channel of each feature map.Lighter-colored areas refer to areas with higher energy values.Based on the visualization results of a speciﬁc F B i , an analysisof whether each block constituting the ILFNet learns andoperates as intended is introduced.First, BT-1 and BT-2 constituting the front segment ofILFNet were induced to extract noise-like signals by focusingon the differences between adjacent pixels of the sample. Fromthe results of visualization on F B and F B , BT-1 and BT-2 areactivated on subtle differences between adjacent pixels, as weintended. Next, we induced higher and reﬁned features to beextracted and learned from the feature maps (i.e., F B ) througha middle segment comprising consecutive BT-3s. Unlike thevisualization results for F B and F B , where the energy wasconcentrated on the edges of the prominent object, the energyis globally distributed in the entire region of visualizationresults of F B . We propose that these results are due to localresidual learning and the ability of reﬁned and higher featurelearning of the local feature fusion-based BT-3 from the featuremaps generated from the previous block.Finally, we induced hierarchical feature learning by placingBT-4, which contains the AvgPool layer for the dimensionalityreduction of feature maps, in deeper layers of ILFNet. Thevisualization results of F B have larger energy values in REPARED FOR IEEE TRANSACTIONS ON 17

Fig. 12. Visualization of feature maps obtained from the network blocks of ILFNet. Image retargeting corresponds to 20% of the width of the original image( × ) by employing vertical seam carving [3], and cropped samples represented in yellow are input into ILFNet.(a) (b) (c) (d) (e) (f)Fig. 13. Localization results for seam-removed artifacts of the proposed network: (a) original images ( × ) with seams corresponding to 5% of thewidth marked in green, (b) 5% seam-removed images, (c) results of seam removal localization, (d) original images ( × ) with seams correspondingto 10% of the width marked in green, (e) 10% seam-removed images, (f) results of seam removal localization. For (c) and (f), seam-removed regions aremarked in red. areas similar to the areas where local artifacts are generateddue to seam carving (see examples for the 1 st , 7 th , and13 th columns in Fig. 12). Therefore, ILFNet can focus onthe local area where seam carving is applied. In particular,visualization results based on F B for seam-inserted imagesexhibit a higher contrast and more meaningful predictionthan those for seam removal. Capturing artifacts of seamremoval is presumed to be more difﬁcult than seam insertionbecause forensic feature extraction proceeds by focusing onthe differences between adjacent pixels due to the loss ofinformation. K. Localization Results

This section presents the results of seam-removed and seam-inserted region localization. For this experiment, we newlytrained ILFNet using sample images of size × croppedfrom the training set described in Section IV-A, where W = H = 128 . This was considered for more sophisticated localizationand to evaluate the scalability of ILFNet against image size.In particular, localization using a small patch is more effectivewhen the size of the test image is small. Thus, based onthe training methodology in Section IV-B, ILFNet has beennewly trained on × images, and the performance of TABLE XIIC

LASSIFICATION ACCURACY OF THE PROPOSED

ILFN

ET TRAINED ON × IMAGES (%) .Model Retargeting ratio10 % % % % % MixedILFNet 80.83 88.79 92.87 95.53 96.27 91.17 the trained model is speciﬁed in Table XII. The accuracy ofILFNet for the mixed testing set is 91.17%. This performanceis 5.39% lower than that of the model based on × samples, which occurs because the trace of seam carvingdecreases as the sample size decreases.We conducted localization for local seam-carved areas byapplying the trained model to large test images. In the ex-periments, image retargeting corresponding 5% and 10% ofthe width was applied to the test image (4224 × inthe RAISE [50] dataset. The seam-carved images, includingboth enlargement and reduction, were divided into patcheswith a stride of . Then, we performed a patch-levelclassiﬁcation to localize the manipulated regions. Figs. 13 and14 illustrate the proposed model-based localization results forseam removal and seam insertion, respectively. The ﬁgures REPARED FOR IEEE TRANSACTIONS ON 18 (a) (b) (c) (d) (e) (f)Fig. 14. Localization results for seam-inserted artifacts of the proposed network: (a) original images ( × ) with seams corresponding to 5% of thewidth marked in green, (b) 5% seam-inserted images, (c) results of seam-insertion localization, (d) original images ( × ) with seams correspondingto 10% of the width marked in green, (e) 10% seam-inserted images, (f) results of seam-insertion localization. For (c) and (f), seam-inserted regions aremarked in blue. reveal that ILFNet localizes the manipulated local area rel-atively accurately. For seam insertion, fewer false positiveswere found that for seam removal. In addition, more accuratelocalization is possible when the retargeting ratio increases,which may be because, as the ratio increases, the traces ofseam carving in the image are more enriched. Although someerror cases exist, the proposed ILFNet effectively explores andcaptures the artifacts of seam-carving forgery.V. C ONCLUSION

This paper proposes a CNN-based forensic framework thatlearns and captures local texture artifacts caused by seam-carving forgery. Learning low-level forensic features requires adifferent approach from the general CNN for learning content-dependent features. To address this issue, we designed the pro-posed ILFNet comprising ﬁve types of network blocks, whichare specialized for learning forensic features. Furthermore,an ensemble module for enhancing classiﬁcation performanceand comprehensively analyzing the features in the local areasof the given test images was presented. To demonstrate theeffectiveness of the proposed ILFNet, extensive experimentswere conducted with comparative CNNs and a non-CNN-based approach. Compared to the comparative classiﬁers, ourwork exhibits state-of-the-art performance in terms of classify-ing seam forgery artifacts. In addition, our trained model withthe ensemble module also demonstrated high performance forthe testing set of unseen cases. The experimental results alsodemonstrate that our method can be applied to localize bothseam-removed and seam-inserted areas. In future work, wewill apply ILFNet to datasets with various JPEG quality factorsand improve the classiﬁcation performance by reﬁning thenetwork architecture. R

EFERENCES[1] S. Battiato, G. M. Farinella, G. Puglisi, and D. Ravi, “Saliency-basedselection of gradient vector ﬂow paths for content aware image resizing,”

IEEE Transactions on Image Processing , vol. 23, no. 5, pp. 2081–2095,2014. [2] A. C. Popescu and H. Farid, “Exposing digital forgeries by detectingtraces of resampling,”

IEEE Transactions on signal processing , vol. 53,no. 2, pp. 758–767, 2005.[3] S. Avidan and A. Shamir, “Seam carving for content-aware imageresizing,” in

ACM Transactions on graphics (TOG) , vol. 26, no. 3.ACM, 2007, p. 10.[4] M. Rubinstein, A. Shamir, and S. Avidan, “Improved seam carving forvideo retargeting,”

ACM transactions on graphics (TOG) , vol. 27, no. 3,pp. 1–9, 2008.[5] R. Achanta and S. S¨usstrunk, “Saliency detection for content-awareimage resizing,” in . IEEE, 2009, pp. 1005–1008.[6] M. Frankovich and A. Wong, “Enhanced seam carving via integrationof energy gradient functionals,”

IEEE Signal Processing Letters , vol. 18,no. 6, pp. 375–378, 2011.[7] S.-H. Nam, W. Ahn, S.-M. Mun, J. Park, D. Kim, I.-J. Yu, and H.-K. Lee,“Content-aware image resizing detection using deep neural network,”in .IEEE, 2019, pp. 106–110.[8] D. Cho, J. Park, T.-H. Oh, Y.-W. Tai, and I. So Kweon, “Weakly-andself-supervised learning for content-aware deep image retargeting,” in

Proceedings of the IEEE International Conference on Computer Vision ,2017, pp. 4558–4567.[9] W. Dong, F. Wu, Y. Kong, X. Mei, T.-Y. Lee, and X. Zhang, “Imageretargeting by texture-aware synthesis,”

IEEE Transactions on Visual-ization & Computer Graphics , no. 2, pp. 1088–1101, 2016.[10] M. Rubinstein, D. Gutierrez, O. Sorkine, and A. Shamir, “A comparativestudy of image retargeting,” in

ACM transactions on graphics (TOG) ,vol. 29, no. 6. ACM, 2010, p. 160.[11] J.-D. Wei, Y.-J. Lin, and Y.-J. Wu, “A patch analysis method to detectseam carved images,”

Pattern Recognition Letters , vol. 36, pp. 100–106,2014.[12] T. Yin, G. Yang, L. Li, D. Zhang, and X. Sun, “Detecting seam carvingbased image resizing using local binary patterns,”

Computers & Security ,vol. 55, pp. 130–141, 2015.[13] P. Bas, T. Filler, and T. Pevn`y, “ break our steganographic system:The ins and outs of organizing boss,” in

International Workshop onInformation Hiding . Springer, 2011, pp. 59–70.[14] G. Schaefer and M. Stich, “Ucid: An uncompressed color imagedatabase,” in

Storage and Retrieval Methods and Applications forMultimedia 2004 , vol. 5307. International Society for Optics andPhotonics, 2003, pp. 472–481.[15] F. Chollet, “Xception: Deep learning with depthwise separable convo-lutions,” arXiv preprint , pp. 1610–02 357, 2017.[16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[17] B. Bayar and M. C. Stamm, “Design principles of convolutional neural

REPARED FOR IEEE TRANSACTIONS ON 19 networks for multimedia forensics,”

Electronic Imaging , vol. 2017, no. 7,pp. 77–86, 2017.[18] P. He, X. Jiang, T. Sun, S. Wang, B. Li, and Y. Dong, “Frame-wisedetection of relocated i-frames in double compressed h. 264 videos basedon convolutional neural network,”

Journal of Visual Communication andImage Representation , vol. 48, pp. 149–158, 2017.[19] S.-H. Nam, J. Park, D. Kim, I.-J. Yu, T.-Y. Kim, and H.-K. Lee, “Two-stream network for detecting double compression of h. 264 videos,”in .IEEE, 2019, pp. 111–115.[20] J. Ye, Y. Shi, G. Xu, and Y.-Q. Shi, “A convolutional neural networkbased seam carving detection scheme for uncompressed digital images,”in

International Workshop on Digital Watermarking . Springer, 2018,pp. 3–13.[21] M. Boroumand, M. Chen, and J. Fridrich, “Deep residual networkfor steganalysis of digital images,”

IEEE Transactions on InformationForensics and Security , 2019.[22] S.-J. Ryu, H.-Y. Lee, and H.-K. Lee, “Detecting trace of seam carving forforensic analysis,”

IEICE TRANSACTIONS on Information and Systems ,vol. 97, no. 5, pp. 1304–1311, 2014.[23] Z. K. Senturk and D. Akgun, “Seam carving based image retargeting: Asurvey,” in . IEEE, 2019, pp. 1–6.[24] A. Sarkar, L. Nataraj, and B. S. Manjunath, “Detection of seam carvingand localization of seam insertions in digital images,” in

Proceedings ofthe 11th ACM workshop on Multimedia and security . ACM, 2009, pp.107–116.[25] C. Fillion and G. Sharma, “Detecting content adaptive scaling of imagesfor forensic applications,” in

Media Forensics and Security II , vol. 7541.International Society for Optics and Photonics, 2010, p. 75410Z.[26] Q. Liu and Z. Chen, “Improved approaches with calibrated neighboringjoint density to steganalysis and seam-carved forgery detection in jpegimages,”

ACM Transactions on Intelligent Systems and Technology(TIST) , vol. 5, no. 4, pp. 1–30, 2014.[27] J. Fridrich and J. Kodovsky, “Rich models for steganalysis of digitalimages,”

IEEE Transactions on Information Forensics and Security ,vol. 7, no. 3, pp. 868–882, 2012.[28] J. Kodovsk`y and J. Fridrich, “Steganalysis of jpeg images using richmodels,” in

Media Watermarking, Security, and Forensics 2012 , vol.8303. International Society for Optics and Photonics, 2012, p. 83030A.[29] Q. Liu, “An improved approach to exposing jpeg seam carving underrecompression,”

IEEE Transactions on Circuits and Systems for VideoTechnology , vol. 29, no. 7, pp. 1907–1918, 2018.[30] J. Kodovsky, J. Fridrich, and V. Holub, “Ensemble classiﬁers for ste-ganalysis of digital media,”

IEEE Transactions on Information Forensicsand Security , vol. 7, no. 2, pp. 432–444, 2011.[31] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:A large-scale hierarchical image database,” in . Ieee, 2009, pp. 248–255.[32] W. Ahn, S.-H. Nam, M. Son, H.-K. Lee, and S. Choi, “End-to-enddouble jpeg detection with a 3d convolutional network in the dctdomain,”

Electronics Letters , 2019.[33] J. Park, D. Cho, W. Ahn, and H.-K. Lee, “Double jpeg detection inmixed jpeg quality factors using deep convolutional neural network,” in

Proceedings of the European Conference on Computer Vision (ECCV) ,2018, pp. 636–652.[34] M. Barni, L. Bondi, N. Bonettini, P. Bestagini, A. Costanzo, M. Maggini,B. Tondi, and S. Tubaro, “Aligned and non-aligned double jpeg detectionusing convolutional neural networks,”

Journal of Visual Communicationand Image Representation , vol. 49, pp. 153–163, 2017.[35] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[36] A. R¨ossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, andM. Nießner, “Faceforensics: A large-scale video dataset for forgerydetection in human faces,” arXiv preprint arXiv:1803.09179 , 2018.[37] J.-S. Park, H.-G. Kim, D.-G. Kim, I.-J. Yu, and H.-K. Lee, “Pairedmini-batch training: A new deep network training for image forensicsand steganalysis,”

Signal Processing: Image Communication , vol. 67,pp. 132–139, 2018.[38] B. Bayar and M. C. Stamm, “Constrained convolutional neural networks:A new approach towards general purpose image manipulation detection,”

IEEE Transactions on Information Forensics and Security , vol. 13,no. 11, pp. 2691–2706, 2018.[39] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense net-work for image super-resolution,” in

Proceedings of the IEEE conferenceon computer vision and pattern recognition , 2018, pp. 2472–2481. [40] T. Dai, J. Cai, Y. Zhang, S.-T. Xia, and L. Zhang, “Second-orderattention network for single image super-resolution,” in

Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition ,2019, pp. 11 065–11 074.[41] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprintarXiv:1312.4400 , 2013.[42] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” arXiv preprintarXiv:1502.03167 , 2015.[43] V. Nair and G. E. Hinton, “Rectiﬁed linear units improve restricted boltz-mann machines,” in

Proceedings of the 27th international conference onmachine learning (ICML-10) , 2010, pp. 807–814.[44] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residualnetworks,” in

European conference on computer vision . Springer, 2016,pp. 630–645.[45] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deepresidual networks for single image super-resolution,” in

Proceedingsof the IEEE conference on computer vision and pattern recognitionworkshops , 2017, pp. 136–144.[46] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[47] S.-M. Mun, S.-H. Nam, H. Jang, D. Kim, and H.-K. Lee, “Finding robustdomain from attacks: A learning framework for blind watermarking,”

Neurocomputing , vol. 337, pp. 191–202, 2019.[48] S.-H. Nam, W.-H. Kim, S.-M. Mun, J.-U. Hou, S. Choi, and H.-K.Lee, “A sift features based blind watermarking for dibr 3d images,”

Multimedia Tools and Applications , vol. 77, no. 7, pp. 7811–7850, 2018.[49] S. Nam, S. Mun, W. Ahn, D. Kim, I. Yu, W. Kim, and H. Lee, “Nsct-based robust and perceptual watermarking for dibr 3d images,”

IEEEAccess , vol. 8, pp. 93 760–93 781, 2020.[50] D.-T. Dang-Nguyen, C. Pasquini, V. Conotter, and G. Boato, “Raise: araw images dataset for digital image forensics,” in