[PDF] Deep Learning-based Forgery Attack on Document Images

Abstract

With the ongoing popularization of online services, the digital document images have been used in various applications. Meanwhile, there have emerged some deep learning-based text editing algorithms which alter the textual information of an image . In this work, we present a document forgery algorithm to edit practical document images. To achieve this goal, the limitations of existing text editing algorithms towards complicated characters and complex background are addressed by a set of network design strategies. First, the unnecessary confusion in the supervision data is avoided by disentangling the textual and background information in the source images. Second, to capture the structure of some complicated components, the text skeleton is provided as auxiliary information and the continuity in texture is considered explicitly in the loss function. Third, the forgery traces induced by the text editing operation are mitigated by some post-processing operations which consider the distortions from the print-and-scan channel. Quantitative comparisons of the proposed method and the exiting approach have shown the advantages of our design by reducing the about 2/3 reconstruction error measured in MSE, improving reconstruction quality measured in PSNR and in SSIM by 4 dB and 0.21, respectively. Qualitative experiments have confirmed that the reconstruction results of the proposed method are visually better than the existing approach. More importantly, we have demonstrated the performance of the proposed document forgery algorithm under a practical scenario where an attacker is able to alter the textual information in an identity document using only one sample in the target domain. The forged-and-recaptured samples created by the proposed text editing attack and recapturing operation have successfully fooled some existing document authentication systems.

Full PDF

11 Deep Learning-based Forgery Attack onDocument Images

Lin Zhao,

Student Member, IEEE , Changsheng Chen,

Senior Member, IEEE , and Jiwu Huang,

Fellow, IEEE

Abstract —With the ongoing popularization of online services,the digital document images have been used in various applica-tions. Meanwhile, there have emerged some deep learning-basedtext editing algorithms which alter the textual information ofan image in an end-to-end fashion. In this work, we presenta low-cost document forgery algorithm by the existing deeplearning-based technologies to edit practical document images.To achieve this goal, the limitations of existing text editingalgorithms towards complicated characters and complex back-ground are addressed by a set of network design strategies. First,the unnecessary confusion in the supervision data is avoidedby disentangling the textual and background information inthe source images. Second, to capture the structure of somecomplicated components, the text skeleton is provided as auxiliaryinformation and the continuity in texture is considered explicitlyin the loss function. Third, the forgery traces induced by thetext editing operation are mitigated by some post-processingoperations which consider the distortions from the print-and-scan channel. Quantitative comparisons of the proposed methodand the exiting approach have shown the advantages of ourdesign by reducing the about 2/3 reconstruction error measuredin MSE, improving reconstruction quality measured in PSNR andin SSIM by 4 dB and 0.21, respectively. Qualitative experimentshave conﬁrmed that the reconstruction results of the proposedmethod are visually better than the existing approach in bothcomplicated characters and complex texture. More importantly,we have demonstrated the performance of the proposed documentforgery algorithm under a practical scenario where an attackeris able to alter the textual information in an identity documentusing only one sample in the target domain. The forged-and-recaptured samples created by the proposed text editing attackand recapturing operation have successfully fooled some existingdocument authentication systems.

Index Terms —Document Image, Text Editing, Deep Learning,

I. I

NTRODUCTION

Due to the COVID-19 pandemic, we have observed anunprecedented demand for online document authenticationin the applications of e-commerce and e-government. Someimportant document images were uploaded to online platformsfor various purposes. However, the content of document canbe altered by some image editing tools or deep learning-based technologies. As an illustration in Fig. 1(a), we show anexample on Document Forgery Attack dataset from AlibabaTianchi Competition [1] forged with the proposed document

The authors are with the Guangdong Key Laboratory of Intelligent Informa-tion Processing and Shenzhen Key Laboratory of Media Security, and NationalEngineering Laboratory for Big Data System Computing Technology, Collegeof Electronics and Information Engineering, Shenzhen University, Shenzhen,China. They are also with Shenzhen Institute of Artiﬁcial Intelligenceand Robotics for Society, China (e-mail: [email protected],[email protected], [email protected]). (a) (b)(c)Fig. 1. Illustration of three types of document images processed by theproposed document forgery approach (ForgeNet, as outlined in Sec. III). Theedited regions are boxed out in blue. forgery approach. Some key information on the original imageis edited and then the document is recaptured to conceal theforgery trace. It is a low-cost (automatic, and without the needof skilled professional) and dangerous act if an attacker usessuch forge-and-recapture document images to launch illegalattack.Recently, it has been demonstrated that characters and wordsin natural images can be edited with convolutional neuralnetworks [2]–[4] in an end-to-end fashion. Similar to theframework of DeepFake [5], these models have been trainedto disentangle different components in the document images,such as text, style and background. During the process oftext editing, the input textual information (plain text with thetargeted contents) is converted to a text image with targetedstyle and background. It should be noted that these works [2]–[4] are originally proposed for the visual translation and ARtranslation applications. To the best of our knowledge, thereis no existing works on evaluating impacts of the above deeplearning-based textual contents generation schemes towardsdocument security. The edited text images have not beeninvestigated from a forensic aspect.Authentication of hardcopy documents with digitally ac-quired document images is a forensic research topic withbroad interest. Although the edited document image in digitaldomain can be traced with some existing tamper detectionand localization schemes [6], it has been shown that detectionof document forgery with small manipulation region (e.g.,key information in a document) is challenging [7]. Moreover, a r X i v : . [ c s . MM ] F e b (a)(b)Fig. 2. Two representative forge-and-recapture attack scenarios. (a) The attacker scans his/her own identity document to obtain an identity document imageand forges the document of a target identity to perform an impersonate attack. (b) The attacker steals an identity document image and forge his/her owndocument to obtain unauthorized access. recapturing operation (replay attack) is an effective way toconceal the forgery trace [8], [9]. A formal attack model withtwo scenarios is shown in Fig. 2. For a common document(e.g., an identity card), the attacker’s own copy can be editedto perform an impersonate attack of a target identity. For adocument with speciﬁc template, the attacker would steal adigital copy of the document, and forge his/her own documentimage to get unauthorized access.To understand the security threat, one should note that de-tecting recapturing attack in digital documents is very differentfrom detecting spooﬁng in other media, e.g., face and naturalimages. For example, the forensic trace from depth in face[10], [11] and natural images [9], [12], as well as Moirépattern [13] artifacts in displayed images are not available indocument images. Both the captured and recaptured versionsof a hardcopy document are acquired from ﬂat paper surfaces,which lack the distinct differences between a 3D naturalscene versus a ﬂat surface or a pixelated display. Thus, theadvancement of the deep learning technologies in text editingmay have already put our document image at stake.In this work, we build a deep learning-based documentforgery network to attack the existing digital document au-thentication system under a practical scenario. The approachcan be divided into two stages, i.e., document forgery anddocument recapturing. In the document forgery stage, thetarget text region is disentangled to yield the text, style andbackground components. To allow text editing of characterswith complicated structure under complex background, severalimportant strategies are introduced. First, to avoid confusionsin different components of the source images (e.g., betweencomplex background textures and texts), the textual informa-tion is extracted by subsequently performing inpainting anddifferentiation on the input image. Second, to capture thestructure of some complicated components, the text skeleton isprovided as auxiliary information and the continuity in textureis considered explicitly in the loss function. Last but notleast, the forgery traces between the forgery and backgroundregions are mitigated by post-processing operations with con-siderations on distortions from the print-and-scan process. Inthe recapturing stage, the forged document is printed and scanned with some off-the-shelf devices. In the experiment, thenetwork is trained with a publicly available document imagedataset and some synthetic textual images with complicatedbackground. Ablation study shows the importance of ourstrategies in designing and training our document forgerynetwork. Moreover, we demonstrate the document forgeryperformance under a practical scenario where an attackergenerates a forgery document with only one sample in thetarget domain. In our case, an identity document with complexbackground can also be edited by a single sample ﬁne-tuningoperation. Finally, the edited images are printed and scannedto conceal the forgery traces. We show that the forge-and-recapture samples by the proposed attack have successfullyfooled some existing document authentication systems.The main contributions of this work are summarized asfollows. • We propose the ﬁrst deep learning-based text editingnetwork towards document images with complicatedcharacters and complex background. Together with therecapturing attack, we show that the forge-and-recapturesamples have successfully fooled some state-of-the-artdocument authentication systems. • We mitigate the visual artifacts introduced by the textediting operation by color pre-compensation and inversehalftoning operations, which consider the distortions fromprint-and-scan channel, to produce a high-quality forgeryresult. • We demonstrate the document forgery performance undera practical scenario where an attacker alters the textualinformation in an identity document (with Chinese char-acters and complex texture) by ﬁne-tuning the proposedscheme ﬁne-tuned with one sample in the target domain.The remaining of this paper is organized as follows. Sec-tion II reviews the related literatures on deep learning-basedtext editing. Section III introduces the proposed documentforgery method. Section IV describes the datasets and train-ing procedure of our experiments. Section V compares theproposed algorithm with the exiting text editing methods, anddemonstrates the feasibility of attacking the existing document (a) (b) (c)Fig. 3. Text editing under complex background by SRNet [2] (ﬁne-tunedby 5,000 and 20,000 Chinese character images with complex and simplebackground, respectively). (a) Source image with styled text and background.(b) Target characters. (c) Target styled characters with background. Textartifacts and background discontinuities can be found in the boxed regionof (c). authentication systems with the forge-and-recapture attack.Section VI concludes this paper.II. R

ELATED W ORK

Recently, text image synthesis has become a hot topicin the ﬁeld of computer vision. Text synthesis tasks havebeen implemented on scene images for visual translation andaugmented reality applications. The GAN-based text synthesistechnique renders more realistic text regions in natural sceneimages. Wu et al. ﬁrst addressed the problem of word ortext-line level scene text editing by an end-to-end trainableStyle Retention Network (SRNet) [2]. SRNet consists ofthree learnable modules, including text conversion module,background inpainting module and fusion module, which areused for text editing, background erasure, as well as text andbackground fusion, respectively. The design of the networkfacilitates the modules to be pre-trained separately, reducesthe difﬁculty in end-to-end training of complicate network.Compared with the work of character replacement, SRNetworks in word-level which is a more efﬁcient and intuitive wayof document editing. Experimental results show that SRNetis able to edit the textual information in some natural sceneimages. Roy et al. [3] designed a Scene Text Editor usingFont Adaptive Neural Network (STEFANN) to edit texts inscene images. However, a one-hot encoding of length 26 ofthe target character is adopted in STEFANN to represent the26 upper-case English alphabets in the latent feature space.Such one-hot encoding is expandable to lower-case Englishalphabets and Arabic numerals. However, it is not applicableto Chinese which is with a much larger character set (morethan 3000 characters in common use) [14]. Thus, STEFANNis not suitable for editing Chinese documents. Yang et al. [4] proposed an image texts swapping scheme (SwapText) inscenes with special attention on the performance in perspectiveand curved text images. In the following, we mainly focuson SRNet [2] since it is the most relevant work to ourtask on editing text in document images for two reasons.First, it is applicable to Chinese character unlike STEFFANN[3]. Second, it keeps a relatively simple network structurecompared to SwapText [4] which considers curved texts thatuncommonly found on a document.The difﬁculties of editing Chinese text in documents imagesmainly lies in background inpainting and text style conversion.In the background inpainting process, we need to ﬁll the back-ground after erasing the textual region. The image background,as an important visual cue, is the main factor affecting the (a) (b) (c)Fig. 4. Text editing of complicated Chinese characters by SRNet [2] (ﬁne-tuned by 20,000 Chinese character images with solid color background). (a)Source image with styled text and background. (b) Target characters. (c) Targetstyled characters with background. The edited text languages are English (toprow) and Chinese (bottom row). It should be noted that Chinese text imagesynthesis performs worse than English. similarity between the synthesized and the ground-truth textimages. However, as shown in Fig. 3, the reconstructed regionsshow discontinuity in texture that degrades the visual quality.This is mainly due to the background reconstruction loss ofSRNet compares the inpainted and original images pixel bypixel and weights the distortions in different region equally,while human inspects the results mainly from the structuralcomponents, e.g., texture.In text style conversion process, the SRNet inputs the sourceimage (with source text, target style and background) to thetext conversion subnet. However, as shown in Fig. 4(c), thetext style has not been transferred from (a) to (c). Especially,the Chinese character with more strokes is distorted moreseriously than the English alphabets. This is because differentcomponents (source text, target style, and background) in thesource image introduces confusion in the text style conversionprocess. It should be noted that such distortion is more obviousfor Chinese characters due to two reasons. On the one hand,the number of Chinese characters is huge, with more than3,000 characters in common use. It is more difﬁcult to train astyle conversion network for thousands of Chinese charactersthan dozens of English alphabets. On the other hand, the fontcomposition of Chinese characters is complex, as it consistsof ﬁve standard strokes with multiple radicals. Therefore,text editing of Chinese characters in document with complexbackground still presents great challenges.In addition, most of the target contents of the existing worksare scene images rather than document images. It requiresthe artifacts in synthesized text image to be unobtrusivetowards human visual system, rather than undetectable underforensic tools. Therefore, the existing works [2]–[4] have notconsidered to further process the text editing results withregards to the distortions from print-and-scan channel, suchas color degradation, and halftoning [15].III. P

ROPOSED M ETHOD

As shown in Fig. 5, the document forgery attack is di-vided into the forgery (through the proposed deep network,ForgeNet) and recapturing steps. For the forgery process, thedocument image acquired by an imaging device is employedas input to the ForgeNet. It is divided into three regions,i.e., text region, image region, and background region (theareas that are not included in the ﬁrst two categories). The

Fig. 5. Overview of the proposed document forgery approach. A forge-and-recapture attack is performed on a captured document image. background region is processed by the inverse halftoningmodule (IHNet) to remove the halftone dots in the printeddocument. The original content in the image region is replacedby the target image, and the resulting image is fed into theprint-and-scan pre-compensation module (PCNet) and IHNet.It should be noted that the PCNet deliberately distorts thecolor and introduces halftone patterns in the edited regionsuch that the discrepancies between the edited and backgroundregions are compensated. The text region is subsequentlyforwarded to the text editing module (TENet), PCNet andIHNet. After processed by the ForgeNet, the three regionsare stitched together to form a complete document image.Lastly, the forged document image is recaptured by cameras orscanners to ﬁnish the forge-and-recapture attack. For clarity,the deﬁnitions of main symbols in our work is summarizedin Tab. I. In the following paragraphs, the TENet, PCNet, andIHNet within the ForgeNet will be elaborated.

A. The Text Editing Network (TENet)

In this part, a deep learning-based architecture, TENet isproposed to edit the textual information in document images.As shown in Fig. 6, TENet consists of three subnets. The back-ground inpainting subnet generates a complete background byﬁlling the original text region with the predicted content. Thetext conversion subnet replaces the text content of the sourceimage I s with the target text I t while preserving the originalstyle. The fusion subnet merges the output from the last twosubnets and yields the edited image with the target text andoriginal background.

1) Background Inpainting Subnet:

Prior to performing textediting, we need to erase the text in the original text region andﬁll the background. In this part, we adopt the original encoder-decoder structure in SRNet [2] to complete the backgroundinpainting. The L loss and adversarial loss [16] is employedto optimize the initial background inpainting subnet. The lossfunction of background inpainting subnet written as L b = E (cid:2) log D b ( I b , I s ) + log(1 − D b ( O b , I s )) (cid:3) + λ b (cid:107) I b − O b (cid:107) , (1)where E denotes the expectation operation, D b denotes thediscriminator network of the background inpainting subnet, O b is the output of the background inpainting subnet, I b isthe ground-truth of background images, λ b is the weighting TABLE IS

UMMARY OF S YMBOL D EFINITIONS .Symbols in TENet I s source image with background and text I (cid:48) s source image without background I t target textual image with solid background O t output of text conversion subnet in TENet T sk text skeleton image of the target styled characters I st the target styled characters with solid background I b background image of I s O b output of the background inpainting subnet in TENet O cf output of the coarse fusion subnet O ff output of the ﬁne fusion subnet T e edge map extracted from I f M t binary mask of I st Symbols in PCNet I o original digital document image I p I o with print-and-capture distortion I d denoised version of I p O p output of PCNetSymbols in IHNet I h generated halftone image used to training IHNet O c output of CoarseNet in IHNet I e edge map extracted from I h O e output of EdgeNet in IHNet O d output of DetailNet in IHNet O de edge map obtained by feeding O d to EdgeNet factor that is set to 10 to balance adversarial loss and L lossin our experiment.As shown in Fig. 3, the background inpainting performancedegrades seriously under complex backgrounds. As discussedin Sec. II, the texture continuity in the background regionwas not considered in the existing network design [2], [4].In our approach, we adopt the background inpainting subnetin SRNet for a rough reconstruction, and the ﬁne detailsof background inpainting will be reconstructed in the fusionsubnet (Sec. III-A3).

2) Text Conversion Subnet:

The purpose of the text conver-sion subnet is to convert the target texts to the style of sourcetexts. In this subnet, the text properties that can be transferredinclude fonts, sizes, color, etc.However, the performance of text conversion subnet in [2]degraded signiﬁcantly (as shown in Fig. 3) if the backgroundregion of the source image I s contains complex textures.Therefore, we propose to isolate the text region from thebackground texture before carrying out text conversion. Firstly,the background image O b is obtained by the backgroundinpainting subnet proposed in Sec. III-A1. Secondly, wedifferentiate the background image O b and the source image I s to get the source text image without background I (cid:48) s . Dueto the subtle differences between O b and the correspondingground-truth I b , there will be some residuals in the differentialimage of I s and O b . These residuals can be removed by post-processing operation, such as ﬁltering and binarization, andthe source text image without background I (cid:48) s is obtained.The target text image I t and I (cid:48) s are fed into text conversionsubnet which follows the encoder-decoder FCN framework.The network can then convert I t according to the style of I (cid:48) s without interference from the background region.However, different from the training data provided in [2], Fig. 6. The framework of TENet. It contains three subnets: background inpainting subnet, text conversion subnet and fusion subnet. The background inpaintingsubnet generates a complete background by ﬁlling the original text region with the predicted content. The text conversion subnet replaces the text content ofthe source image with the target text while preserving the original style. The fusion subnet merges the output from the last two subnets and yields the editedimage with the target text and original background. our target documents (as shown in Fig. 1) contain a signiﬁcantamount of Chinese characters which are with more complexstructure than that of the English alphabets and Arabic nu-merals. Besides, the number of Chinese characters is huge,with more than 3,000 characters in common use. Therefore,instead of using a ResBlock-based text skeletons extractionsubnet in [2], we directly adopt a hard-coded component[17] for text skeleton extraction in our implementation toavoid unnecessary distortions. Such designs avoid the trainingoverhead for Chinese characters, though the ﬂexibility of thenetwork is reduced.Intuitively, the L loss can be applied to train text con-version subnet. However, without weighting the text andbackground region, the output of text conversion subnet mayleave visible artifacts on character edges. We proposed to addan binary mask of the target styled text image M t to weightdifferent components in the loss function. The loss of the textconversion subnet can be written as L t = | M t | · M t · L t + (1 − | M t | ) · (1 − M t ) · L t , (2)where | M t | is the L norm of M t , and L t is the L lossbetween the output of text conversion subnet O t and thecorresponding ground-truth. It should be noted that duringtesting, T sk is replaced with the text skeleton image of theintermediate result O (cid:48) t after performing decoding.

3) Fusion Subnet:

We use the fusion subnet to fuse theoutput of the background inpainting subnet O b and the outputof the text conversion subnet O t . In order to improve thequality of the text editing image, we further divide the fusionsubnet into coarse fusion subnet and ﬁne fusion subnet. The coarse fusion subnet follows a generic encode-decodearchitecture. We ﬁrst perform three layers of downsampling ofthe text-converted output O t . Next, the downsampled featuremaps are fed into 4 residual blocks (ResBlocks) [18]. It is note-worthy that we connect the feature maps of the backgroundinpainting subnet to the corresponding feature map with thesame resolutions in the decoding layers of coarse fusion subnetto allow a straight path for feature reusing. After decoding andup-sampling, the coarse fusion image O cf is obtained. The lossfunction of the coarse fusion subnet is adopted from SRNet[2] as L cf = E (cid:2) log D f ( I f , I t ) + log(1 − D f ( O cf , I t )) (cid:3) + λ cf (cid:107) I f − O cf (cid:107) , (3)where D f denotes the discriminator network of the coarsefusion subnet, I f is the ground-truth, O cf is the output of thecoarse fusion subnet, and λ cf is the balance factor which isset to 10 in our implementation.Next, we further improve the quality by considering thecontinuity of background texture in the ﬁne fusion subnet.The input to this subnet is a single feature tensor which isobtained by concatenating the coarsely fused image O cf andthe edge map T e along the channel-axis, that is [ O cf , T e ] T .It should be noted that T e is extracted from the ground-truthusing Canny edge detector in the training process; while, inthe testing process, T e is the edge map extracted from outputof the coarse fusion subnet O cf .In ﬁne fusion subnet, the edge map of ground-truth playsa role in correcting the detail in the background area andmaintaining texture continuity [19]. We attaches [ O cf , T e ] T Fig. 7. Architecture of PCNet. The general architecture follows an encoder-decoder structure. to 4 ResBlocks to enhance the high-frequency details in theimage and to remove the artifacts created by the low-frequencyreconstruction in coarse fusion subnet. The loss function ofﬁne fusion subnet is deﬁned as L ff = (cid:107) I f − O ff (cid:107) , (4)where O ff is the output of the ﬁne fusion subnet.In order to reduce perceptual image distortion, we introducea VGG-loss based on VGG-19 [20]. The VGG-loss is dividedinto a perceptual loss [21] and a style loss [22] , which are L vgg = λ g · L per + λ g · L style , (5) L per = E (cid:2) (cid:107) φ i ( I f ) − φ i ( O cf ) (cid:107) (cid:3) , (6) L style = E (cid:2) (cid:107) G φi ( I f ) − G φi ( O cf ) (cid:107) (cid:3) , (7)where i ∈ [1 , indexes the layers from relu _ to relu _ layer of VGG-19 model, φ i is the activation map of the i -th layer, G φi is the Gram matrix of the i -th layer, andthe weighting factors λ g and λ g are set to 1 and 500,respectively.The whole loss function for the fusion subnet is deﬁned as L f = L cf + L vgg + L ff . (8)Eventually, the overall loss for TENet can be written as L TENet = arg min G max D b ,D f ( L b + L t + L f ) . (9)where G is the generator of TENet. B. Pre-Compensation Network (PCNet)

Since the edited text regions are digital images (withoutprint-and-scan distortions), yet the background regions havebeen through the print-and-scan process. If stitching the editedtext and background regions directly, the boundary artifactswill be obvious. We propose to pre-compensate the textregions with print-and-scan distortion before combining differ-ent regions. The print-and-scan process introduces nonlineardistortions such as changes in contrast and brightness, varioussources of noises, which can be modelled as a non-linearmapping function [15]. However, it is more difﬁcult to modelthe distortion parametrically under uncontrolled conditions.Inspired by display-camera transfer simulation in [23], wepropose the PCNet with an auto-encoder structure (shown in Fig. 7) to simulate the intensity variation and noise in theprint-and-scan process.We choose the local patch-wise texture matching loss func-tion of the more lightweight VGG-16 network in order toimprove the overall performance of the network [19], that is L tm ( I p , O p ) = E (cid:2) (cid:107) G φi ( I p ) − G φi ( O p ) (cid:107) (cid:3) , (10)The loss function of PCNet is deﬁned as L PCNet = (cid:107) I p − O p (cid:107) + λ p · L tm ( I p , O p ) , (11)where O p is the output of PCNet, and I p is the ground-truthof O p . The local patch-wise texture matching loss between O p and I p with weight λ p is also considered. In our experiment,the weight λ p is set to 0.02. In practice, the original documentimage I o is not accessible by the attacker. Therefore, adenoised version of the document image I d is employed inthe training process as an estimation of the original documentimage. In our experiment, the denoised images are generatedby the NoiseWare plugin of Adobe Photoshop [24].Essentially, PCNet learns the intensity mapping and noisedistortion in the print-and-scan channel. As shown inSec. V-B2, the distortion model can be trained adaptively witha small amount of ﬁne-tuning samples to pre-compensate thechannel distortion. C. Inverse Halftoning Network (IHNet)

According to [25], halftoning is a technique that simulatesthe continuous intensity variation in a digital image by chang-ing the size, frequency or shape of the ink dots during printingor scanning. After the print-and-scan process or processing byour PCNet, the document image can be regarded as clustersof halftone dots. If the image is re-printed and recapturedwithout restoration, the halftone patterns generated during theﬁrst and second printing process will interfere with each otherand introduce aliasing distortions, e.g., Moiré artifacts [26].In order to make the forge-and-recapture attack more realistic,the IHNet is proposed to remove the halftoning pattern in theforged document images before recapturing.We follow the design of network in [19] to remove thehalftoning dots in the printed document images. The IH-Net can be divided into two steps. The ﬁrst step extractsthe shape, color (low-frequency features) and edges (high-frequency features) of the document image via CoarseNetand EdgeNet, respectively. The resulting features are fed intothe second stage where image enhancements like recoveringmissing texture details are implemented. However, a muchsimpler structure is adopted since the content of a documentimage is much more regular and simpler than that of a naturalimage. The simpliﬁcation includes removing the high-levelnetwork components (e.g., the object classiﬁcation subnet) andthe discriminator in [19]. By such simpliﬁcation, the networkis much more efﬁcient.Speciﬁcally, the CoarseNet with an encoder-decoder struc-ture is employed for the rough reconstruction of the shapeand color of halftone input images. Besides L loss, a global Fig. 8. Architecture of IHNet. It consists of three sub-networks: CoarseNet,EdgeNet and DetailNet. texture loss function (deﬁned in Eq. 10) based on the VGG-16 structure is used to measure the loss in texture statistics.Therefore, the overall loss function of CoarseNet is deﬁned as L CoarseNet = (cid:107) I d − O c (cid:107) + λ c · L tm ( I d , O c ) , (12)where O c is the output of CoarseNet and I d is the denoisedversion of the document image, and λ c is the weighting factorset to 0.02 in our implementation.Due to the downsampling operation in the encoder partof CoarseNet, the high-frequency features are not preservedin the reconstructed images. However, the high frequencycomponents, such as edge and contour of the objects areimportant visual landmarks in the image reconstruction task.Therefore, the edge map is provided as auxiliary informationto the reconstruction process.Instead of detecting edges with Canny edge detector (asshown in the fusion subnet in Sec. III-A3), an end-to-endconvolutional network is proposed here to extract the contourof characters and background texture from I p . This is becausethe traditional edge detector will also detect the edges fromhalftone dots in I p which should be removed by the IHNet.Due to the binary nature of an edge map, the cross-entropyfunction is used as the loss function of EdgeNet, that is L EdgeNet ( O e ) = E (cid:2) − I e log( O e ) + (1 − I e ) log(1 − O e ) (cid:3) , (13)where I e and O e are the edge map of the ground-truth andoutput of EdgeNet, respectively.The output maps from CoarseNet and EdgeNet are concate-nated along the channel-axis to form a single feature tensorbefore fed into the DetailNet, that is [ O c , O e ] T . DetailNetadopts the residual network that integrates low and highfrequency features. It clears the remaining artifacts in the lowfrequency reconstruction, and enhances the details. The lossfunction of the network is deﬁned as L DetailNet = λ d (cid:107) I d − O d (cid:107) + λ d L EdgeNet ( O de ) + λ d L tm ( I d , O d ) , (14)where O d is the output of DetailNet and O de is the edge-mapobtained by feeding O d to EdgeNet. We set the weights as λ d = 100 , λ d = 0 . , λ d = 0 . , respectively. (a) I s (b) I t (c) T sk (d) I st (e) I b (f) I f (g) T e (h) M t Fig. 9. Examples of synthetic Chinese character in dataset D t . (a) I s , sourceimage with styled text and background. (b) I t , target characters. (c) T sk ,text skeleton of I t . (d) I st , target styled characters w/o background. (e) I b ,background image. (f) I f , target styled characters with background. (g) T e ,edge map from I f . (h) M t , binary mask of I st . IV. D

ATASETS AND T RAINING P ROCEDURE

A. Datasets1) Synthetic Character Dataset:

The editing object of ourtask contains a large number of Chinese characters. To trainTENet, we construct a synthetic character dataset D t includingtext types in Chinese characters, English alphabets and Arabicnumerals. As shown in Fig. 9, the dataset consists of eighttypes of images, which are summarized as follows: • I s : a source image which consists of a background imageand generated characters with random content and length,including Chinese characters (about 5 characters perimage), English alphabets (about 10 alphabets per image)and Arabic numerals (about 10 alphabets per image), andthe colors, fonts and rotation angles are also randomlydetermined. • I t : a gray background image with ﬁxed font for the targetcharacter(s). • T sk : a text skeleton image of I t . • I st : target styled character(s) with gray solid background. • I b : the background image in the source image. • I f : an image consisting of both the background of thesource image and the target styled character(s). • T e : the edge map extracted from I f . • M t : the binary mask of I st .The synthetic text dataset D t contains a total of 400,000images, with 50,000 images of each type.

2) Student Card Image Dataset:

To facilitate the trainingof our ForgeNet, a high-quality dataset consists of captureddocument images from various devices is needed. As shownin Fig. 10, we use the student card dataset from our group[27]. The original images in this dataset are synthesized usingAdobe CorelDRAW and printed on acrylic plastic material bya third-party manufacturer. It contains a total of 12 studentcards from 5 universities. The dataset is collected by 11 off-the-shelf imaging devices, including 6 camera phones (XiaoMi8, RedMi Note 5, Oppo Reno, Huawei P9, Apple iPhone6 and iPhone 6s) and 5 scanners (Brother DCP-1519, BenqK810, Epson V330, Epson V850 and HP Laserjet M176n). Intotal, the dataset consists of 132 high-quality captured imagesof student card images. In our experiments, these documentimages are used in the forgery and recapture operations. Thisdataset is denoted as D c . Fig. 10. Some captured images of student cards for our experiment. Theoriginals are synthesized with Adobe CorelDRAW.

B. Training Procedure of ForgeNet

The training process of the proposed ForgeNet is carriedout in several phases. The TENet, PCNet and IHNet are pre-trained separately.

1) Training TENet:

For training TENet, we use the syn-thetic chinese character dataset D t in Sec. IV-A1. In order tocater for the network input dimension, we adjust the heightof all images to 128 pixels and keep the original aspectratio. In addition, the 400,000 images in the dataset aredivided into training set, validation set and testing set in an8:1:1 ratio. Different portions of the dataset are fed into thecorresponding inputs of the network for training. With a giventraining dataset, the model parameters (random initialization)are optimized by minimizing the loss function. We implementa pix2pix-based network architecture [28] and train the modelusing the Adam optimizer ( β = 0 . , β = 0 . ). The batchsize is set to 8. Since it is not simple to conduct end-to-endjoint training on such a complicated network, we ﬁrst input thecorresponding images into the background inpainting subnetand text conversion subnet for pre-training with a trainingperiod of 10 epochs. Subsequently, the fusion subnet joins theend-to-end training with a training period of 20 epochs, andthe learning rate gradually decreases from × − to × − .We use a NVIDIA TITAN RTX GPU card for training with atotal training duration of 3 days.

2) Training PCNet:

As shown in Fig. 11, training PCNetrequires pairs of the original and the captured documentimages. PCNet learns the mapping from the original image tothe captured image to simulate the print-and-scan distortions.We use dataset D c in Sec. IV-A2 to train PCNet. The dataset D c consists of 12 original images I o and 132 captured images I p of the documents. One may employ I o and I p to trainPCNet so as to learn the distortions in print-and-scan channel.However, in practice, it is often difﬁcult for an attacker toobtain the original document image I o . Alternatively, we use (a) (b)Fig. 11. Illustration of training dataset in PCNet. (a) I d , the denoised versionof the captured document image. (b) I p , the captured document image.(a) (b) (c)Fig. 12. Illustration of training dataset in IHNet. (a) I h , the generated halftoneimage. (b) I e , the edge image. (c) I d , the denoised version of the captureddocument image. the denoised version of the captured document images I d asan approximation of the original images. In our experiment,the NoiseWare plugin in Adobe Photoshop [24] is employedto remove the distortions in the captured images. In orderto accommodate the network input size, all images in thedataset D c are cropped to image patches with a resolutionof × pixels. In addition, data augmentation strategiessuch as rotation, cropping, and mirroring are carried out toexpand the number of datasets D c to 20,000 image patches,with 80% of the data used for training, 10% for validation, and10% for testing. PCNet is trained with 20 epochs using theADAM optimizer ( β = 0 . , β = 0 . ) with a learning rateof × − and no weight decay and the batchsize is set to8. The parameter of activation functions LeakyReLu and ELUare set as α = 0 . and α = 1 . , respectively. The trainingprocess lasts for 1 day on a NVIDIA TITAN RTX GPU card.

3) Training IHNet:

As shown in Fig. 12, the denoiseddocument image I d in dataset D c is also used to train IHNet.The edge image I e of I d , and the artiﬁcially generated halftoneimage I h are also needed in the training process. The halftoneimage I h is generated by applying color halftone patterns to I d in Photoshop with amplitude modulation technique andvarious parameters (random halftone angles for different colorchannels). The edge image I e is an edge map of I d obtainedby Canny edge detection. Similarly, all images in the datasetare cropped to a resolution of × pixels to ﬁt thesize of the network input. Data augmentation strategies arealso employed to expand the dataset to 20,000 images, whichare then divided into ratio of 8:1:1 for the training, validationand testing sets, respectively. IHNet uses an ADAM optimizer( β = 0 . , β = 0 . ) with an initial learning rate of × − . Since the network is divided into three subnets, weﬁrst pre-train CoarseNet and EdgeNet, respectively. After 10epochs, DetailNet joins the end-to-end training with a decayingrate of 0.9. The batchsize is set to 8. The training stops after20 epochs. The training lasts for 2 days on a NVIDIA TITANRTX GPU card. (a) Original (b) SRNet [2] (c) TENet w/o ID (d) TENet w/o FF (e) TENet w/o SS (f) TENet (g) Ground-truthFig. 13. Comparisons of SRNet [2] and different conﬁgurations of the proposed TENet on synthetic character dataset D t . (a) Original images. (b) Editedby SRNet [2]. (c) Edited by TENet without image differentiation (ID). (d) Edited by TENet without ﬁne fusion (FF). (e) Edited by TENet without skeletonsupervision (SS). (f) Edited by the proposed TENet. (g) Ground-truth. Differences between the results from TENet and the ground-truth are boxed out inblue. The SSIM metric computed from each edited document and the ground-truth is shown under each image from (b) to (f). V. E

XPERIMENTAL R ESULTS

In the following, we ﬁrst evaluate the performance of theproposed TENet in both the synthetic character dataset andthe student card dataset without distortions from the print-and-scan channel. Then, the performance of ForgeNet (includingTENet, PCNet and IHNet) is studied under practical setups,including forgery under the channel distortion, with a singlesample, and attacking the state-of-the-art document authen-tication systems. Finally, some future research directions ondetection of such forge-and-recapture attack are discussed.

A. Performance Evaluation on TENet1) Performance on Synthetic Character Dataset:

InSec. III-A, we propose the text editing network, TENet byadapting SRNet [2] to our task. However, SRNet is originallydesigned for editing English alphabets and Arabic numerals inscene images for visual translation and augmented reality ap-plications. As shown in Fig. 3, 4 and 13(b), it does not performwell on Chinese characters with complicated structure, espe-cially in document with complex background. In this part, wequalitatively and quantitatively examine the modules in TENetwhich are different from SRNet so as to show the effectivenessof our approach. Three main differences between our proposedSRNet and TENet are as follows. First, we perform imagedifferentiation operation between the source image I s and theoutput O b of the background inpainting subnet to obtain styletext image without background I (cid:48) s . Second, I (cid:48) s is then fed intoa hard-coded component to extract the text skeleton of thestyle text which is then directly input to the text conversionsubnet as supervision information. Third, instead of only usinga general U-Net structure to fuse different components (asin SRNet), we adopt a ﬁne fusion subnet in TENet withconsideration on texture continuity. We randomly select 500images from our synthetic character dataset D t as a testing setfor comparison. Quantitative analysis with three commonlyused metrics are performed to evaluate the resulting imagedistortion, including Mean Square Error (MSE, a.k.a. l error),Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity TABLE IIC

OMPARISONS OF

SRN ET [2] AND DIFFERENT SETTINGS OF

TEN ET . T HEBEST RESULTS ARE HIGHLIGHTED IN BOLD .Method MSE PSNR SSIMSRNet [2] 0.032 16.44 0.519TENet w/o ID 0.027 17.37 0.687TENet w/o FF 0.019 19.14 0.635TENet w/o SS 0.015 19.75 0.708TENet (SSIM) [29]. The edited results by different approaches arecompared with the ground-truth to calculate these metrics.

Image Differentiation (ID).

After removing the imagedifferentiation part, we ﬁnd that the text generation gets worse(as shown in Fig. 13(c)). The distortion is more severe inthe case of source images with complex backgrounds. Due tothe interference of the background textures, text conversionsubnet cannot accurately distinguish the foreground (charac-ters) from background. It leads to difﬁculty in extracting textstyle features and character strokes are therefore distorted. Forexample, the residual of the original characters are still visiblein background of the last two ﬁgures in Fig. 13(c)). In contrast,using the differential image of the source image I s and theoutput O b as input to the text conversion subnet can avoidbackground interference, allowing the text conversion subnetto focus on extracting text styles without confusions from thebackground. It leads to a better text conversion performance.From Tab. II, we can see that without image differentiation,there is a signiﬁcant drop in MSE and PSNR compared tothe proposed TENet. The above experiments indicate that thedifferentiation operation is essential in producing high qualitystyled text. Fine Fusion (FF).

The performance of TENet under com-plex background mainly relies on the ﬁne fusion subnet. If theﬁne fusion subnet is removed, the resulting image suffers fromloss of high-frequency details. This is because the remainingsubnets (the background inpainting subnet and the text conver-sion subnet) are of U-Net based structure which downsamplesthe input images before reconstruction. As shown in Fig. 13(d), (a) Original (b) Target text (c) TENet (d) Ground-truthFig. 14. Some failure cases. The SSIM metric computed from editeddocument and the ground-truth is shown under the image (c). the resulting text images are blurry. Besides, the SRNet doesnot take into account the continuity of the background textureduring image reconstruction. The texture components in theresulting images are discontinuous. The results of Tab. II showthat the impact of removing ﬁne fusion component is muchmore signiﬁcant than the others. It is due to the fact that thebackground region is much larger than the foreground regionin our test images, and the contribution of the ﬁne fusionsubnet is mainly in the background. Skeleton Supervision (SS).

Visually, Chinese charactersare much more complex than English alphabets and Arabicnumerals in terms of the number of stokes and the interac-tion of the strokes in a character. The skeleton supervisioninformation is important in providing accurate supervisionon the skeleton of Chinese characters. If the skeleton isextracted using a general trainable network (as designed inSRNet) instead of using a hard-coded the style text, the textskeleton extraction performance will be degraded. As shown inFig. 13(e), by removing the skeleton supervision component,the character strokes in the resulting images appear distortedand the characters are not styled correctly. From Tab. II, welearn that the skeleton supervision has less impact on theoverall image quality, as it only affects the character strokegeneration. However, the style of characters is vital in creatinghigh quality forgery samples.In summary, the results look unrealistic in the absence ofthese three components as shown in the ablation study inFig. 13(c)-(e). The importance of image differentiation, ﬁnefusion, and skeleton supervision are reﬂected in the quality ofcharacters, the background texture, and the character skeleton,respectively. Both quantitative analysis and visual examplesclearly indicate the importance of the three components.Although TENet shows excellent text editing performanceon most document images, it still has some limitations. Whenthe structure of target character is complex or the numberof characters is large, TENet may fail. Fig. 14 shows twofailure cases. In the top row, the performance of the textconversion subnet is degraded due to the complex structureand large number of strokes of the target characters, andthus the editing results show distortion of the strokes. In thebottom row, it is a text conversion with cross languages anddifferent character lengths. In dataset D t , we follow the datasetgeneration strategy of SRNet [2], where source and targetstyled characters have the same geometric attribute (e.g., size,position) settings. However, for pairs of characters of differentlengths, the strategy for setting the text geometry attributes isto make the overall style of the text with fewer characters (a) Original (b) SRNet [2] (c) TENetFig. 15. Visual comparison on the images of student cards without print-and-scan distortions. (a) Original image. (b) Edited by SRNet [2]. (c) Edited bythe proposed TENet. converge to that of multiple characters. But inevitably, somegeometric attributes of text with fewer characters are missing.The text conversion process of TENet excellently implementsthe conversion of geometric attributes from source text to tar-get styled text, thus causing the generated results to have errorswith ground-truth. These failures occur because the numberand type of samples in the training data are insufﬁcient, whichleads to the unsatisfactory generalization performance of themodel. So we believe that these problems could be alleviatedby adding more complex characters and more font attributesto the training set.

2) Performance on the Student Card Forgeries:

InSec. V-A1, we perform an ablation study of the text editingmodule in a target text region of the document. However, ithas not reﬂected the forgery performance on the entire image,including text, image and background as shown in Fig. 5. Inthis part, we perform text editing on the captured student cardimages and stitch the edited text regions with the other regionsto yield the forged document image. It should be noted that theprint-and-scan distortion is not considered in this experimentsince we are evaluating the performance of TENet.In this experiment, SRNet [2] and the proposed TENet arecompared in the text editing task with some student cardsof different templates from dataset D c . The training datacontains 50,000 images from each type of images introducedin Sec. IV-A-1). The height of all images is ﬁxed to 128pixels, and the original aspect ratio is maintained. The editedtext ﬁelds are name, student No. and expiry date, includingChinese characters, English alphabets and Arabic numerals. Itshould be noted that the text lengths may be different beforeand after editing. As can be seen from Fig. 15, the proposedTENet signiﬁcantly improves the performance in characterstyle conversion. B. Performance Evaluation on ForgeNet1) Ablation Study of PCNet and IHNet:

This part shows thetampering results of ForgeNet under print-and-scan distortion.The ForgeNet consists of three modules, namely, TENet,PCNet, and IHNet. We perform ablation study to analyze therole of each module.The role of the TENet is to alter the content of text region.However, as shown in Fig. 16(b), the resulting text regionsfrom TENet are not consistent with the surrounding pixels. (a) ForgeNet (b) ForgeNet w/o PCNet & IHNet(c) ForgeNet w/o IHNet (d) ForgeNet w/o PCNetFig. 16. Examples of ablation study on ForgeNet. (a) The proposed ForgeNet(TENet+PCNet+IHNet). (b) TENet only (w/o PCNet & IHNet). (c) w/oIHNet. (d) w/o PCNet. The editing regions are boxed out in red.(a) Original (b) SRNet [2] (c) TENetFig. 17. Visual comparison on the identity card images. (a) Original image.(b) Edited by SRNet [2]. (c) Edited by the proposed TENet. Both SRNet andTENet are ﬁne-tuned with a single sample. This is because the edited region has not been through theprint-and-scan channel. The main channel distortion includescolor difference introduced by illumination conditions, differ-ent color gamuts and calibration in different devices, as wellas halftoning patterns.One of the most signiﬁcant difference is in color becauseprinting and scanning process are with different color gamut,and the resulting color will thus be distorted. Another differ-ence is on the micro-scale in the image which is introducedby the halftoning process and various source of noise in theprint-and-scan process. Thus, the role of PCNet is to pre-compensate the output images with print-and-scan distortions.As shown in Fig. 16(c), both the edited and backgroundregions are more consistent after incorporating the PCNet.However, the halftoning artifacts (visible yellow dots) remains.The remaining halftoning artifacts interfere with the halftoningpatterns which happens in the recapturing (print-and-capture)process. Thus, IHNet removes the visible halftoning artifacts(as shown in Fig. 16(a) and (d)) before performing recapturingattack. The resulting image processed with both PCNet andIHNet is closer to the original image, which shows that allthree modules in ForgeNet play important roles.

2) Document Forgery with a Single Sample:

In the previoussection, we show the performance of the proposed ForgeNeton editing student card images. However, the backgroundregions of these samples are relatively simple, usually withsolid colors or simple geometric patterns. In this part, wechoose Resident Identity Card for People’s Republic of Chinawith a complex background as a target document. Identitycard tampering is a more practical and challenging task toevaluate the performance of the proposed ForgeNet. However,identity card contains private personal information. It is very

TABLE IIII

DENTITY D OCUMENT A UTHENTICATION UNDER F ORGE - AND -R ECAPTURE A TTACK ON

MEGVIIF

ACE ++ AI [31]. T

HEITEMS "E DITED ", "P

HOTOCOPY ", "I

DENTITY P HOTO " AND "S CREEN " DENOTE THE PROBABILITIES OF IMAGE EDITING , PHOTOCOPIES , IDENTITY CARD IMAGES AND SCREEN RECAPTURING , RESPECTIVELY .No. Edited Photocopy Identity Photo Screen01 0 0 0.994 0.00602 0 0 0.996 0.00403 0 0 0.941 0.05904 0 0 0.913 0.08705 0.009 0 0.991 006 0 0 0.958 0.04207 0 0 0.958 0.04208 0 0 0.989 0.01109 0 0 0.983 0.01710 0 0 0.977 0.023 difﬁcult to obtain a large number of scanned identity cards astraining data. Thus, we assume the attacker has access to onlyone scanned identity card image which is his/her own copyaccording to our threat model in Fig. 2(a). This identity cardimage is regarded as both the source document image (to beedited) and the sample in target domain for ﬁne-tuning TENet,PCNet and IHNet. The attacker then tries to forge the identitycard image of a target person by editing the text.The identity card is scanned with a Canoscan 5600F scannerwith a resolution of 1200 DPI. The whole image is croppedaccording to different network input sizes, and data augmenta-tion is performed. In total, 5,000 image patches are generatedto ﬁne-tune the network. It is worth noting that the complextextures of the identity card background pose a signiﬁcantchallenge to the text editing task. To improve the backgroundreconstruction performance, the attacker could include someadditional texture images which are similar to the identitycard background for ﬁne-tuning. Some state-of-the-art texturesynthesis networks can be employed to generate the textureautomatically [30]. The image patches are fed to TENet, PC-Net, and IHNet for ﬁne-tuning. In order to collect the sensitiveinformation in identity cards, we need to collect personalinformation from our research group to ﬁnish the forgerytest. Ten sets of personal information (e.g., name, identitynumber) are gathered for a small-scale ID card tampering test,and 10 forged identity card images are generated accordingly.As shown in Fig. 17, some key information on the identitycard is mosaicked to protect personal privacy. It is shown thatForgeNet achieves a good forgery performance by ﬁne-tuningwith only one image, while the text and background in theimage reconstructed by SRNet are distorted.

3) Forge-and-Recapture Document Attack Authentication:

In this part, the forged identity card images obtained inSec. V-B2 are processed by the print-and-scan channel todemonstrate the threat posed by the forge-and-recapture attack.The printing and scanning devices used for the recapturingprocess are Canon G3800 and Canoscan 5600F, respectively.The highest printing quality of × DPI is employed.The print substraces is Kodak g/m glossy photo paper.The scanned images are in TIFF or JPEG formats withscanning resolutions (ranging from 300 DPI to 1200 DPI) ad-justed according to the required size of different authentication platforms.The popular off-the-shelf document authentication platformsin China includes Baidu AI, Tencent AI, Alibaba AI, NeteaseAI, Jingdong AI, MEGVII Face++ AI, iFLYTEK AI, HuaweiAI, etc. However, the document authentication platformswhich detect identity card recapturing and tampering are BaiduAI [32], Tencent AI [33], and MEGVII Face++ AI [31].We uploaded tampering results to these three state-of-the-artdocument authentication platforms for validation of the forge-and-recapture identity documents.The authentication results on MEGVII Face++ AI are shownin Tab. III. It is shown that the 10 forge-and-recapture identityimages in our test are successfully authenticated. All thetested images also pass the other two authentication plat-forms (include inspection against editing, recapturing, etc.).Given the fact that the state-of-the-art document authenticationplatforms have difﬁculties in distinguishing the forge-and-recapture document images, it fully demonstrates the successof our attack. This calls for immediate research effort indetecting such attacks. C. Discussion on Detection of Forge-and-Recapture Attack

As discussed in Section I, the main focus of this work is tobuild a deep learning-based document forgery network to studythe risk of existing digital document authentication system.Thus, developing forensics algorithm against the forge-and-recapture attack is not in the scope of this work. Moreover,in order to study such attack, a large and well-recognizeddataset of forge-and-recapture document images is needed.However, no such dataset is currently available in the publicdomain. Without such resource, some data-driven benchmarksin digital image forensics with hundreds or thousands featuredimensions [34], [35] are not applicable. Meanwhile, this workenables an end-to-end framework for generating high qualityforgery document, which facilitates the construction of a large-scale and high-quality dataset. Last but not least, it has beenshown in our parallel work [27] that the detection of documentrecapturing attack alone (without forgery) is not a trivial taskwhen the devices in training and testing sets are different.The performance of generic data-driven approaches (e.g.,ResNet [18]) and traditional machine learning approach withhandcrafted features (e.g., LBP+SVM [36]) are studied. Thedetection performance degraded seriously in a cross datasetexperimental protocol where different printing and imagingdevices are used in collecting the training and testing dataset.VI. C

ONCLUSION

In this work, the feasibility of employing deep learning-based technology to edit document image with complicatedcharacters and complex background is studied. To achievegood editing performance, we address the limitations of ex-isting text editing algorithms towards complicated charactersand complex background by avoiding unnecessary confusionsin different components of the source images (by the imagedifferentiation component introduced in Sec. III-A2), con-structing texture continuity loss and providing auxiliary skele-ton information (by the ﬁne fusion and skeleton supervision components in Sec. III-A3). Comparisons with the existingtext editing approach [2] conﬁrms the importance of ourcontributions. Moreover, we propose to mitigate the visualartifacts of text editing operation by some post-processing(color pre-compensation and inverse halftoning) consideringthe print-and-scan channel. Experimental results show thatthe consistency among different regions in a document imageare maintained by these post-processing. We also demonstratethe document forgery performance under a practical scenariowhere an attacker generates an identity document with onlyone sample in the target domain. Finally, the recapturing attackis employed to cover the forensic traces of the text editing andpost-processing operations. The forge-and-recapture samplesby the proposed attack have successfully fooled some state-of-the-art document authentication systems. From the study ofthis work, we conclude that the advancement of deep learning-based text editing techniques has already introduced signiﬁcantsecurity risk to our document images.R

EFERENCES[1] Aliyun, “Security AI Challenger Program Phase 5: Counterattacks onForged Images,” Retrieved 7-Dec. 2020, from: https://tianchi.aliyun.com/competition/entrance/531812/introduction.[2] L. Wu, C. Zhang, J. Liu, J. Han, J. Liu, E. Ding, and X. Bai, “EditingText in the Wild,” in

Proceedings of the 27th ACM InternationalConference on Multimedia , 2019, pp. 1500–1508.[3] P. Roy, S. Bhattacharya, S. Ghosh, and U. Pal, “STEFANN: Scenetext editor using font adaptive neural network,” in

Proceedings of theIEEE/CVF Conference on Computer Vision and Pattern Recognition ,2020, pp. 13 228–13 237.[4] Q. Yang, J. Huang, and W. Lin, “SwapText: Image based Texts Transferin Scenes,” in

Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition , 2020, pp. 14 700–14 709.[5] P. Korshunov and S. Marcel, “Deepfakes: A New Threat to Face Recog-nition? Assessment and Detection,” arXiv preprint arXiv:1812.08685 ,2018.[6] S. Shang, X. Kong, and X. You, “Document Forgery Detection usingDistortion Mutation of Geometric Parameters in Characters,”

Journal ofElectronic Imaging , vol. 24, no. 2, p. 023008, 2015.[7] S. Abramova et al. , “Detecting Copy–move Forgeries in Scanned TextDocuments,”

Electronic Imaging , vol. 2016, no. 8, pp. 1–9, 2016.[8] T. Thongkamwitoon, H. Muammar, and P.-L. Dragotti, “An ImageRecapture Detection Algorithm based on Learning Dictionaries of EdgeProﬁles,”

IEEE Transactions on Information Forensics and Security ,vol. 10, no. 5, pp. 953–968, 2015.[9] S. Agarwal, W. Fan, and H. Farid, “A Diverse Large-scale Dataset forEvaluating Rebroadcast Attacks,” in . IEEE,2018, pp. 1997–2001.[10] H. Li, P. He, S. Wang, A. Rocha, X. Jiang, and A. C. Kot, “LearningGeneralized Deep Feature Representation for Face Anti-spooﬁng,”

IEEETransactions on Information Forensics and Security , vol. 13, no. 10, pp.2639–2652, 2018.[11] H. Chen, G. Hu, Z. Lei, Y. Chen, N. M. Robertson, and S. Z. Li,“Attention-based Two-stream Convolutional Networks for Face SpooﬁngDetection,”

IEEE Transactions on Information Forensics and Security ,vol. 15, pp. 578–593, 2019.[12] H. Li, S. Wang, and A. C. Kot, “Image Recapture Detection withConvolutional and Recurrent Neural Networks,”

Electronic Imaging , vol.2017, no. 7, pp. 87–91, 2017.[13] D. C. Garcia and R. L. de Queiroz, “Face-spooﬁng 2D-detection basedon Moiré-pattern Analysis,”

IEEE Transactions on Information Forensicsand Security , vol. 10, no. 4, pp. 778–786, 2015.[14] X. Zhang and Y. LeCun, “Which Encoding is the Best for TextClassiﬁcation in Chinese, English, Japanese and Korean?” arXiv preprintarXiv:1708.02657 , 2017.[15] L. Zhang, C. Chen, and W. H. Mow, “Accurate Modeling and EfﬁcientEstimation of the Print-capture Channel with Application in Barcoding,”

IEEE Transactions on Image Processing , vol. 28, no. 1, pp. 464–478,2018. [16] Y. Zhou, Z. Zhu, X. Bai, D. Lischinski, D. Cohen-Or, and H. Huang,“Non-stationary Texture Synthesis by Adversarial Expansion,” ACMTransactions on Graphics (TOG) , vol. 37, no. 4, pp. 1–13, 2018.[17] K. Saeed, M. Tab˛edzki, M. Rybnik, and M. Adamski, “K3M: AUniversal Algorithm for Image Skeletonization and a Review of Thin-ning Techniques,”

International Journal of Applied Mathematics andComputer Science , vol. 20, no. 2, pp. 317–335, 2010.[18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for ImageRecognition,” in

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2016, pp. 770–778.[19] T.-H. Kim and S. I. Park, “Deep Context-aware Descreening and Re-screening of Halftone Images,”

ACM Transactions on Graphics (TOG) ,vol. 37, no. 4, pp. 1–12, 2018.[20] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networksfor Large-scale Image Recognition,” arXiv preprint arXiv:1409.1556 ,2014.[21] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual Losses for Real-time Style Transfer and Super-resolution,” in

European Conference onComputer Vision . Springer, 2016, pp. 694–711.[22] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image Style Transfer usingConvolutional Neural Networks,” in

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2016, pp. 2414–2423.[23] E. Wengrowski and K. Dana, “Light Field Messaging with DeepPhotographic Steganography,” in

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2019, pp. 1515–1524.[24] Imagenomic, “Noiseware for Adobe Photoshop,” Retrieved 7-Dec. 2020,from: https://imagenomic.com/Products/Noiseware.[25] R. V. Barry and J. Ambro, “Technique for Generating Additional Colorsin a Halftone Color Image through Use of Overlaid Primary ColoredHalftone Dots of Varying Size,” May 3 1994, US Patent 5,309,246.[26] C. Chen, M. Li, A. Ferreira, J. Huang, and R. Cai, “A Copy-ProofScheme Based on the Spectral and Spatial Barcoding Channel Models,”

IEEE Transactions on Information Forensics and Security , vol. 15, pp.1056–1071, 2019. [27] C. Chen, S. Zhang, and F. Lan, “A Database for Digital Image Forensicsof Recaptured Document,” arXiv preprint arXiv:2101.01404 , 2021.[28] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image Trans-lation with Conditional Adversarial Networks,” in

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , 2017,pp. 1125–1134.[29] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “ImageQuality Assessment: from Error Visibility to Structural Similarity,”

IEEETransactions on Image Processing , vol. 13, no. 4, pp. 600–612, 2004.[30] W. Xian, P. Sangkloy, V. Agrawal, A. Raj, J. Lu, C. Fang, F. Yu, andJ. Hays, “Texturegan: Controlling Deep Image Synthesis with TexturePatches,” in

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition

IEEE Transactions on Information Forensics and Security ,vol. 7, no. 3, pp. 868–882, 2012.[35] H. Li, W. Luo, X. Qiu, and J. Huang, “Identiﬁcation of VariousImage Operations using Residual-based Features,”

IEEE Transactionson Circuits and Systems for Video Technology , vol. 28, no. 1, pp. 31–45, 2016.[36] D. Wen, H. Han, and A. K. Jain, “Face Spoof Detection with ImageDistortion Analysis,”