Efficient video integrity analysis through container characterization
Pengpeng Yang, Daniele Baracchi, Massimo Iuliani, Dasara Shullani, Rongrong Ni, Yao Zhao, Alessandro Piva
11 Efficient video integrity analysis through containercharacterization
Pengpeng Yang,
Student Member, IEEE,
Daniele Baracchi, Massimo Iuliani, Dasara Shullani, Rongrong Ni,Yao Zhao,
Senior Member, IEEE, and Alessandro Piva,
Fellow, IEEE
Abstract —Most video forensic techniques look for traces withinthe data stream that are, however, mostly ineffective whendealing with strongly compressed or low resolution videos. Recentresearch highlighted that useful forensic traces are also left inthe video container structure, thus offering the opportunity tounderstand the life-cycle of a video file without looking at themedia stream itself.In this paper we introduce a container-based method toidentify the software used to perform a video manipulation and,in most cases, the operating system of the source device. Asopposed to the state of the art, the proposed method is bothefficient and effective and can also provide a simple explanationfor its decisions. This is achieved by using a decision-tree-based classifier applied to a vectorial representation of the videocontainer structure. We conducted an extensive validation ona dataset of video files including both software manipu-lated contents ( ffmpeg , Exiftool , Adobe Premiere , Avidemux , and
Kdenlive ), and videos exchanged through social media platforms(Facebook, TikTok, Weibo and YouTube). This dataset has beenmade available to the research community. The proposed methodachieves an accuracy of . in distinguishing pristine fromtampered videos and classifying the editing software, even whenthe video is cut without re-encoding or when it is downscaled tothe size of a thumbnail. Furthermore, it is capable of correctlyidentifying the operating system of the source device for most ofthe tampered videos. Index Terms —video forensics, video container, social media,integrity, authentication, video tampering, decision trees, machinelearning.
This work was supported in part by the National Key Research andDevelopment of China (No. 2016YFB0800404), the National NSF of China(Nos. 61672090, 61532005, U1936212) and the Fundamental Research Fundsfor the Central Universities (Nos. 2018JBZ001, 2017YJS054). It was alsosupported in part by the Air Force Research Laboratory and in part by theDefense Advanced Research Projects Agency under Grant FA8750-16-2-0188.Finally it was supported by the Italian Ministry of Education, Universities andResearch MIUR under Grant 2017Z595XS.P. Yang, R. Ni, and Y. Zhao are with the Beijing Key Laboratory ofAdvanced Information Science and Network Technology and the Institute ofInformation Science, Beijing Jiaotong University, Beijing 100044, China.D. Baracchi, M. Iuliani, D. Shullani, and A. Piva are with the Departmentof Information Engineering, University of Florence, Via di S. Marta, 3, 50139Florence, Italy.M. Iuliani, and A. Piva are with the FORLAB, Multimedia ForensicsLaboratory, PIN Scrl, Piazza G. Ciardi, 25, 59100 Prato, Italy.Please address correspondence to Rongrong Ni (email: [email protected])and Alessandro Piva (email: alessandro.piva@unifi.it).Digital Object Identifier 10.1109/JSTSP.2020.3008088© 2020 IEEE. Personal use of this material is permitted. Permission fromIEEE must be obtained for all other uses, in any current or future media,including reprinting/republishing this material for advertising or promotionalpurposes, creating new collective works, for resale or redistribution to serversor lists, or reuse of any copyrighted component of this work in other works.
I. I
NTRODUCTION
Digital videos are becoming more and more relevant inthe communication among users and in providing information.Recent statistics show that the current global average of videoconsumption per day stands at 84 minutes and it is expectedto increase and hit 100 minutes per day by 2021 . Therefore,it is not surprising that digital videos are often involved ininvestigations and other forensic analysis. At the same time,video editing programs, both open source (e.g. ffmpeg ) andcommercial (e.g. Adobe Premiere ), allow users to easily cutand manipulate videos to create fake contents.Video Forensics develops algorithms for assessing videointegrity and authenticity by looking at the digital traces leftduring the video life-cycle [1]. Most of the existing videoforensic techniques verify the authenticity of a video file byinvestigating the presence of inconsistencies in pixel statis-tics. For example, double encoding or manipulation can bedetected by analyzing prediction residuals [2] or macroblocktypes [3, 4, 5]. Similarly, traces of frame rate up-conversioncan be used to prove malicious video processing [6, 7].Recent works have also successfully employed deep-learningtechniques to detect video forgeries [8]. Jamimamul et al. [9]focused on inter–frame video forgery detection by designinga 3D Convolutional neural network. Verde et al. [10] intro-duced a CNN-based approach to detect and localize splicingmanipulation by learning video codec traces.A major drawback of most of those techniques is theirhigh computational cost; furthermore, strong compressionsand downsampling often hide forensic traces, thus severelyrestricting the number of scenarios where those methods canbe employed.Recently, a new research branch highlighted that videointegrity can be determined using information hidden in thewhole video file and not just in the video stream [12]. Videofiles, in fact, are written to disk using a specific structure calledcontainer, comprising multiple streams (video, audio, subtitles)and metadata, which are exploited by decoding software tocorrectly reproduce the video. Guera et al. [13] showed howto identify forged videos without looking at the pixel space. To Note that integrity and authenticity are different concepts. Integrity isproved when the imagery is complete and unaltered, from the time ofacquisition or generation through the life of the imagery; indeed, contentauthentication is used to determine whether the visual content depicted inimagery is a true and accurate representation of subjects and events. Moredetails can be found on the Best Practices for Image Authentication of theScientific Working Group on Digital Evidences [11]. a r X i v : . [ c s . MM ] J a n do so, they extracted high level features (multimedia streamdescriptors) related to video coding using ffprobe . However,the tree-shaped container structure is not taken into account,thus discarding most of the overall available information.Iuliani et al. [14] highlighted that video integrity can beassessed by looking at the video file container structure, sinceany post-processing operation alter the content and the positionof some atoms and field-values. This approach turned out tobe also promising in classifying the source brand of nativevideos.However, Iuliani et al. [14] merely detects a loss of integrity,without providing a human-interpretable explanation of thereasoning behind its decisions. Furthermore, the method hasa linear computational cost since it requires to check thedissimilarity of the probe video with all available referencecontainers. As a consequence, an increase of the referencedataset size leads to a higher computational effort. Further-more, both [13, 14] do not provide any characterization ofmanipulated videos, nor any explainability of the achievedoutcome.In this paper we introduce an efficient method for the anal-ysis of video file containers that allows both to characterizeidentified manipulations and to provide an explanation forthe outcome. The proposed approach is based on DecisionTrees [15], a non-parametric learning method used for clas-sification problems in many signal processing fields. Theirkey feature is the ability to break down a complex decision-making process into a collection of simpler decisions. Weenriched the tool with a likelihood ratio framework designedto automatically clean up the container elements that onlycontribute to source intra-variability.With respect to the state of the art, the proposed method,simply called EVA from now on, offers new forensic oppor-tunities, such as: identifying the manipulating software (e.g.
Adobe Premiere , ffmpeg , . . . ); providing additional informationrelated to the original content history, such as the source deviceoperating system.The process is extremely efficient since a decision can betaken by checking the presence of a small number of features,independently on the video length or size. Furthermore, EVA can provide a simple explanation for the process leading toan outcome, since container symbols used to take a decisioncan be inspected. To the best of our knowledge, this is thefirst video forensic method with all these desirable traits.Experiments have been performed using videos produced by modern smartphones from some of the most popular brandson the market, e.g. Apple , Samsung , LG , Huawei . Tamperedcontents were generated using both automated processing andmanual user operations. This approach allowed us to builda sizeable, realistic dataset. Manipulations include contentsgenerated using
Exiftool , ffmpeg , Adobe Premiere , Avidemux and
Kdenlive .Eventually, we investigated whether a container-based ap-proach is effective when dealing with videos exchangedthrough Facebook, TikTok, Weibo, and YouTube. Overall, theexperimental validation involved seven thousands videos. Thispaper is organised as follows: Section II describes the videocontainer standard; Section III introduces the mathematical root ftyp majorBrand: isomminorVersion: 0compatibleBrand 1: isomcompatibleBrand 2: 3gp4...moov stuff: MovieBoxmvhd creationTime: Thu Aug 06...modificationTime: Thu Aug 06...timescale: 1000duration: 73432...trak ...trak ......mdat ...
Fig. 1. Pictorial representation of an MP4-like video container structure. tools to represent and analyse the video file container; Sec-tions IV and V are devoted to the experimental validation ofthe proposed techniques; finally, Section VI draws some finalremarks and outlines future works.II. V
IDEO F ILE F ORMAT
Most smartphones and compact cameras output videos in mp4 , mov , or format. These video packaging refer to thesame standard, ISO/IEC 14496 Part 12 [16], that defines themain features of MP4 [17] and
MOV [18] containers whileleaving a wide margin for those who implement it. In Fig. 1we provide an example of an MP4-like container, a tree-like structure describing the video file with respect to threeaspects: how the bytes are organized (physical aspect); howthe audio/video streams are synchronized (temporal aspect);and how the latter two aspects are linked (logical aspect).Each node ( atom ) is identified by a unique 4-byte code. Itconsists of a header which describes its role in the containerand possibly some associated data. The first atom to appear ina container has to be ftyp , since it defines the best usageand compatibility of the video content. The video structuralinformation is separated from the data itself, indeed the first one is stored in the movie atom ( moov ) and the second onein the media data atom ( mdat ). The moov atom links thelogical and timing relationships of the video-samples, andprovides pointers to their mdat location. It is worth notingthat the moov sub-tree can contain one or more trak atoms,depending on the number of streams present in a video (i.e.visual-stream and/or audio-stream).III. P ROPOSED A PPROACH
We can represent a video container as a labelled treewhere internal nodes and leaves correspond to, respectively,atoms and field-value attributes. A video container X can becharacterised by the set of symbols { s , . . . s m } , where s i canbe: (i) the path from the root to any field (value excluded), alsocalled field-symbols ; (ii) the path from the root to any field-value (value included), also called value-symbols . An exampleof this representation can be : s = [ ftyp/ @ majorBrand ] s = [ ftyp/ @ majorBrand/isom ] . . . s i = [ moov/mvhd/ @ duration ] s i +1 = [ moov/mvhd/ @ duration/73432 ] . . .Overall, we denote with Ω the set of all unique symbols s , . . . , s M available in the world set of digital video contain-ers X = { X , . . . , X N } . Similarly, C = { C , . . . , C s } denotesa set of possible origins (e.g., Huawei
P9,
Apple iPhone 6s).Given a container X , the different structure of its symbols { s , . . . , s m } can be exploited to assign the video to a specificclass C u .For this purpose binary decision trees [19] are employed tobuild a set of hierarchical decisions. In each internal tree nodethe input data is tested against a specific condition; the testoutcome is used to select a child as the next step in the decisionprocess. Leaf nodes represent decisions taken by the algorithm.An example is reported in Fig. 3. More specifically, in ourapproach we adopted the growing-pruning-based ClassificationAnd Regression Trees (CART) [20].Given the size of unique symbols | Ω | = M , a videocontainer X is converted into a vector of integers X (cid:55)→ ( x . . . x M ) where x i is the number of times that s i occursinto X . This approach is inspired by the bag-of-words repre-sentation [21] used to reduce variable-length documents to afixed-length vectorial representation.Note that X contains several symbols that are not represen-tative of any class, thus contributing to class intra-variabilityonly (e.g. information related to video length, acquisition dateand time). This information is expected to introduce noise inthe decision process and it should be possibly removed. Thus,we pre-filtered the data in Ω by using the likelihood ratioframework. Given two classes C u , C v , u (cid:54) = v , and a symbol s i , the log-likelihood ratio (LLR) log L u,v ( s i ) = log P ( s i | C u ) P ( s i | C v ) (1) Note that @ is used to identify atom parameters and root is used forvisualization purpose but it is not part of the container data. Native-iOS vs Exiftool-iOS -4 -2 0 2 4root/wide/@countroot/wide/@stu ff root/@modelNameroot/free/@countroot/free/@stu ff root/uuid/@userType={BINARY}root/moov/udta/@countroot/moov/udta/@stu ff root/moov/udta/XMP_/@countroot/moov/udta/XMP_/@stu ff Fig. 2. LLRs of symbols obtained when comparing native videos with onesaltered through
Exiftool . Values far from are automatically included in theanalysis. Values close to in all compared classes are excluded from theanalysis. Note that @ is used to identify a tree leaf. is computed by approximating the conditional probabilities P ( s i | C u ) = W C u ( s i ) P ( s i | C v ) = W C v ( s i ) , where W C u ( s i ) and W C v ( s i ) are the frequencies of s i in C u and C v respectively . The symbol s i is preserved only if ∃ u, v, u (cid:54) = v : log L u,v ( s i ) > τ , with τ a threshold, otherwiseit is considered useless and then removed from Ω . It shouldbe noted that using the likelihood ratio we can possibly keep a field-symbol while discarding its corresponding value-symbol ,or vice-versa. In this way we can automatically understandwhether the value or the field are relevant for the classification.As an example, we consider two classes: C u : iOS devices, native videos; C v : iOS devices, dates modified through Exiftool (seeSection V for details).In Fig. 2 some achieved LLRs are reported.The symbols moov/udta/XMP_/@stuff , moov/udta/XMP_/@count , wide/@stuff , wide/@count are clearly relevant in identifying thiskind of operation on devices equipped with iOS. On the otherhand, symbols like free/@stuff will be possibly filteredsince their LLR is close to zero. In this case, the manipulationonly affects a small set of symbols. Indeed, the decision treecan detect such a processing in a single step, by looking, forinstance, at the presence of moov/udta/XMP_/@stuff ,as shown in Fig. 3. We avoid null frequencies by adding one to both the numerator and thedenominator. count(root/moov/udta/XMP /@stuff) ≤ Fig. 3. Decision tree applied to distinguish iOS native videos from iOSvideos with modified dates through
Exiftool . The decision is easily explain-able since it is taken by simply looking at the presence of the symbol moov/udta/XMP_/@stuff . IV. I
NTEGRITY V ERIFICATION
The first relevant experimental question is whether theproposed approach is capable of distinguishing between pris-tine and tampered videos. To answer that we created a newcollection of videos, starting from VISION [22], a pub-licly available dataset that includes native videos from smartphones of different brands. As it would have notbeen feasible to perform the editing operations, upload, anddownload of all the videos in VISION, we selected videosfor each device, thus obtaining a total of pristine videos.Then, we created ( × editing operations) tamperedvideos, both automatically generated with ffmpeg and Exiftool ,and manually created through
Kdenlive , Avidemux and
AdobePremiere . More specifically: • cut with re-encoding: each video was cut through ffmpeg and re-encoded ; • cut without re-encoding: each video was cut through ffmpeg by copying the audio and video coding parame-ters to minimize the traces left by the operation; • speed up: each video was speeded up through ffmpeg ; • slow down: each video was cut through ffmpeg andslowed down ; • cut + downscale: each video was cut through ffmpeg anddownscaled to the resolution of × ; • cut-kd: each video was manually cut through Kden-live (v17.12.3) by keeping to seconds and thenthe video was saved with the MP4 - the dominatingformat(H264/AAC) setting; • cut-av: each video was manually cut through Avidemux (v2.7.4) by keeping to seconds and then the videowas saved as copy and MP4 Muxer settings; • cut-ap: each video was manually cut through AdobePremiere
Pro CC 2019 by keeping to seconds andby saving as H.264 with medium bitrate setting; • date change: each video was manually processed through Exiftool (v11.37) to change the date information within The operation is performed with ffmpeg version 3.4.6 through the com-mand ffmpeg -i $file -ss 00:00:03 -t 00:00:05 -vcodeclibx264 -acodec copy $name The operation is performed with the command ffmpeg -i $file-ss 00:00:03 -t 00:00:05 -c copy $name The operation is performed through the command ffmpeg -i $file-vf "setpts=0.25*PTS" $name for all the other devices. The operation is performed with the command ffmpeg -i $file-ss 00:00:03 -t 00:00:15 -vf "setpts=4*PTS" $name The operation is performed on with the command ffmpeg -i $file-ss 00:00:03 -t 00:00:15 -vf scale=320:240 $name the metadata .We considered ffmpeg , Exiftool , Avidemux and
Kdenlive fortwo main reasons:1) some of them can forge videos in automated way, thusallowing us to create a dataset of tampered videos largeenough to obtain statistically significant results;2) they allow even a novice to create persuasive forgedvideos, for instance by cutting specific frames, slowingdown or speeding up the streams.Indeed, some real-world forged videos involved such oper-ations. The White House suspended access to CNN’s JimAcosta, after he refused to give up the microphone whileasking a question about the Russia investigation at a newsconference with President Trump. However, the video report-ing the event was possibly speeded up . Another exampleis a viral clip of Nancy Pelosi that has been edited to givethe impression that the Democratic House speaker was drunkor unwell . We also considered videos manually forged with Adobe Premiere a proficient video editing tool that can be usedby an expert to produce fake contents.Furthermore, all the produced contents ( pristine videosand tampered ones) were exchanged through differentsocial media platforms:
YouTube videos: manual upload on YouTube and auto-mated download through youtube-dl ; Facebook videos: manual upload and download fromFacebook with the ’SD’ setting.
Tiktok videos: manual upload and download from Tik-Tok 10.0.0 via a HUAWEI Mate 30 Pro 5G device withthe system of EMUI 10.0.0, Android 10. Several accountswere used to overcome the uploading limitation.
Weibo videos: videos (from now on
EVA-7K
Dataset ).The container structure, described in Section II, is extractedfrom each video by means of the MP4 Parser library [23].Note that, due to how the dataset was built, some value-symbols are always present in some classes even if they arenot relevant for their identification. For instance, all the cutvideos have the same duration even if this is not, per se,relevant for identifying the editing. As this could lead toartificially higher performance, we manually removed the value-symbols associated to the following fields: @author , @count , @creationTime , @depth , @duration , @entryCount , @entryCount , @flags , @gpscoords , @matrix , @modelName , @modificationTime , @name , @sampleCount , @segmentDuration , @size , @stuff , @timescale , @version , @width , @height , @language . The operation is performed with the command exiftool"-AllDates=1986:11:05 12:00:00" $videos . https://bit.ly/2vniNi5 https://bit.ly/2Vx2BGj through the command line youtube-dl -f mp4 -o"%(title)s.%(ext)s" "videos_list_link" EVA-7K is available for download from our research group site https://lesc.dinfo.unifi.it/en/datasets.
TABLE IB
ALANCED ACCURACIES OBTAINED IN THE BASIC SCENARIO FOR EACHDEVICE .Device BalancedAccuracy Device BalancedAccuracy Device BalancedAccuracyD01 1.00 D13 1.00 D26 1.00D02 1.00 D14 1.00 D27 1.00D03 1.00 D15 1.00 D28 1.00D04 0.99 D16 1.00 D29 1.00D05 1.00 D18 1.00 D30 1.00D06 1.00 D19 1.00 D31 1.00D07 1.00 D20 1.00 D32 1.00D08 1.00 D21 1.00 D33 1.00D09 1.00 D22 1.00 D34 1.00D10 1.00 D23 1.00 D35 1.00D11 1.00 D24 1.00D12 0.50 D25 1.00
It should also be noted that VISION is composed byseveral iOS/Android devices and a single Windows phone. Weremoved this latter device (
D17 ) from our tests since it is nota representative sample for Windows Phone devices. For thisreason, our approach aims to distinguish between iOS andAndroid videos only. This is a negligible limitation given thatWindows Phone devices represent less than . of the mobiledevices market [24].In order to estimate the real-world performance of theproposed method we adopted an exhaustive leave-one-outcross-validation strategy. We partitioned our dataset in subsets, each one of them containing pristine, manipulated,and social-exchanged videos belonging to a specific device.We performed each of the experiments hereby described times, each time keeping one of the subsets out as test set,and using the remaining for training our model. In thisway, test accuracies collected after each iteration are computedon videos belonging to an unseen device. We reported themean accuracies obtained among all the iterations as confusionmatrices. During the training we assigned to each class aweight inversely proportional to the class frequency. We usedthe decision trees algorithms included in scikit-learn [25], afreely available Python toolkit for machine learning.We trained our method to distinguish between the twoclasses “Pristine” (containing videos) and “Tampered”(containing videos). We obtained a global balancedaccuracy of . , failing only for videos produced by D12 (see Table I). The low accuracy obtained on such a device isreasonably due to the fact that it is the sole Sony smartphonein our dataset.As a consequence of our strict leave-one-device-out strategy,we have no videos belonging to a Sony device in our trainingset when
D12 is tested. Thus, our algorithm cannot learnthe features needed to correctly classify those videos. Thislimitation does not always apply as different camera modelscan exhibit very similar containers. In such a case, a nativevideo can be correctly classified even if the specific originatingdevice is unavailable in the training set. This is the case of theLG D290 (
D04 ) that reaches an accuracy of . .We also compared our method with two recently proposedalgorithms for video integrity [13, 14]. In Table II we reportthe mean global accuracy and the average runtime per fold for TABLE IIC
OMPARISON OF OUR METHOD WITH THE STATE OF THE ART . V
ALUES OFACCURACY AND TIME ARE AVERAGED OVER THE FOLDS . Balanced accuracy Training time Test time
Guera et al. [13] 0.67 347 s < EVA < the proposed approach and for those two methods. A. DiscussionEVA provides several improvements with respect to thestate of the art. In comparison with G¨uera et al. [13] weachieve a higher accuracy. This can be reasonably attributedto their use of a smaller feature space; indeed, only a subsetof the available pieces of information are extracted withoutconsidering their position within the video container. On thecontrary,
EVA features also include the path from the rootto the value, thus providing a stronger discriminating power.Indeed, this approach allows to distinguish between two videoswhere the same information is stored in different atoms. Whencompared with Iuliani et al. [14],
EVA is capable of obtainingbetter classification performance with a lower computationalcost. In Iuliani et al. [14] O ( N ) comparisons are requiredsince all the N reference-set examples must be comparedwith a tested video; on the contrary, the cost for a decisiontree analysis is O (1) since the output is reached in a constantnumber of steps.Furthermore, EVA allows a simple explanation forthe outcome. For the sake of example, we reportin Fig. 4(a) a sample tree from the integrity veri-fication experiment: the decision is taken by up tofour checks, just based on the presence of the sym-bols ftyp/@minorVersion = 0 , uuid/@userType , moov/udta/XMP_ and moov/udta/auth . We also reportin Fig. 4(b) a tree from the blind scenario experiment: in thiscase the tree needs to check the absence of just one atom toclassify a YouTube video; at the same time a series of morecomplex checks are used to assign a video to other classes.This shows how a single decision tree can handle both easy-and hard-to-classify cases at the same time. Neither [14] nor[13] provide an equivalent feature. Moreover, EVA is equippedwith a formal likelihood ratio framework that can estimatethe relevance of symbols for specific tasks. This frameworkhas been used to automatically remove symbols that onlycontribute to class intra-variability.V. M
ANIPULATION C HARACTERIZATION
We also performed a set of experiments designed to showthat the proposed method, as opposed to the state of the art, isalso capable of identifying the manipulating software and theoperating system of the originating device. More specifically,we tried to answer the following questions:A
Software identification:
Is the proposed method ca-pable of identifying the software used to manipulatea video? If yes, is it possible to identify the operatingsystem of the original video?
True
False count(root/ftyp/@minorVersion=0) < 1 samples = 1320 samples = 924 class = Tampered count(root/uuid/@userType) < 1 samples = 396 count(root/moov/udta/XMP_/@stu ff ) < 1 samples = 185 samples = 211 class = Tampered count(root/moov/udta/auth/@stu ff ) < 1 samples = 133 samples = 52 class = Tampered samples = 128 class = Native samples = 5 class = Native (a) Integrity verification classifier. count(root/moov/trak/mdia/minf/stbl/stsd/mp4a/esds/@syncExtensionType=0) < 1samples = 6600count(root/ftyp/@compatibleBrand_1=mp42) < 1samples = 5280True samples = 1320class = YoutubeFalseftyp/@minorVersion=0) < 1samples = 5148 samples = 132class = premierent) < 1 count(root/uuid/@contentLenght=3029) < 1samples = 264count(root/moov/udta/XMP_/@stu ff ) < 1samples = 185 samples = 79class = exiftoolk/mdia/minf/stbl/stsd/avc1/avcC/@avcLevelIndication=31) < 1samples = 133 samples = 52class = exiftool (b) Detail of a blind scenario classifier.Fig. 4. Pictorial representation of some of the generated decision trees. B Integrity Verification on Social Media:
Given avideo from a social media platform (YouTube, Face-book, TikTok or WeiBo), can we determine whetherthe original video was pristine or tampered?C
Blind scenario:
Given a video that may or maynot have been exchanged through a social mediaplatform, is it possible to retrieve some informationon the video origin?
A. Software identification
In this scenario we only analyze videos that either arenative, or that have undergone a manipulation. This time,however, we trained our algorithm to classify which softwarehas been used to tamper the video, if any. Our classes are thus:“native” ( videos), “
Avidemux ” ( videos), “
Exiftool ”( videos), “ ffmpeg ” ( videos), “ Kdenlive ” ( videos),and “
Premiere ” ( videos).In this experiment
EVA obtained a global balanced accuracyof . ; the detailed results reported in Table III show thatthe algorithm achieved a slightly lower accuracy in identifying ffmpeg with respect to the other tools. This is reasonably dueto the fact that ffmpeg library is used by other software and,internally, by Android devices.We also trained our algorithm to classify both the editingsoftware used to tamper the video, if any, and the operating TABLE IIIC
ONFUSION MATRIX FOR THE SOFTWARE IDENTIFICATION SCENARIO . Native
Avidemux Exiftool ffmpeg Kdenlive Premiere
Native
Avidemux - 1.00 - - - -
Exiftool ffmpeg - 0.01 - 0.90 0.09 -
Kdenlive - - - - 1.00 -
Premiere - - - - - 1.00 system of the device originally used for the acquisition. Theclasses for this scenario are: “Android-native” ( videos),“iOS-native” ( videos), “Android-avidemux” ( videos),“iOS-avidemux” ( videos), “Android-exiftool” ( videos),“iOS-exiftool” ( videos), “Android-ffmpeg” ( videos),“iOS-ffmpeg” ( videos), “Android-kdenlive” ( videos),“iOS-kdenlive” ( videos), “Android-premiere” ( videos),and “iOS-premiere” ( videos).A summary of the results obtained by this experimentis reported in Table IV. Our approach maintains good per-formance in correctly identifying the editing software. Wenotice, however, that the operating system used for videosmanipulated with Kdenlive or with
Adobe Premiere is oftenmisclassified. At the same time, both those programs arealways identified correctly. This indicates that the container’sstructure of videos saved by
Kdenlive and
Adobe Premiere isprobably reconstructed in a software-specific way.
B. Integrity Verification on Social Media
In this scenario we tested YouTube, Facebook, TikTok andWeibo videos to determine whether they were pristine ormanipulated prior the upload.A summary of the results obtained by our method isreported in Table V. We achieved global balanced accuraciesof 0.76, 0.80, 0.79, and 0.60 on Facebook, TikTok, Weibo, andYoutube, respectively. Such results are characterised by lowtrue negative rates, and thus it cannot be considered effectivein this scenario, as many tampered videos are incorrectlyclassified as pristine.The poor performance are mainly due to the social me-dia transcoding process that flattens the containers almostindependently on the video origin. As an example, afterYouTube transcoding, videos produced by
Avidemux and by
Exiftool have exactly the same container representation. Wedo not know how the videos are processed by the consideredplatforms due to the lack of public documentation but wecan assume that uploaded videos undergo custom/multipleprocessing. Indeed, social media videos need to be viewableon a great range of platforms, and thus need to be transcodedto multiple video codecs and adapted for multiple resolutionsand bitrates. Thus, it seems plausible that those operationscould discard most of the original container structure.
C. Blind scenario
In this scenario we considered videos that may or may nothave been exchanged through a social media platform and wewould like to extract the most complete information possible.We used all the videos in our dataset and we trained our
TABLE IVC
ONFUSION MATRIX FOR THE SOFTWARE IDENTIFICATION SCENARIO WHEN THE OS IS TAKEN INTO ACCOUNT . Native
Avidemux Exiftool ffmpeg Kdenlive Premiere
Android iOS Android iOS Android iOS Android iOS Android iOS Android iOS
Native
Android 0.95 - - - 0.05 - - - - - - -iOS - 1.00 - - - - - - - - - -
Avidemux
Android - - 0.95 0.05 - - - - - - - -iOS - - - 1.00 - - - - - - - -
Exiftool
Android 0.01 - - - 0.99 - - - - - - -iOS - - - - - 1.00 - - - - - - ffmpeg
Android 0.01 - 0.01 0.05 - - 0.75 - - 0.15 0.04 -iOS - - - - - - - 1.00 - - - -
Kdenlive
Android - - - - - - - - 0.75 0.25 - -iOS - - - - - - - - 0.38 0.62 - -
Premiere
Android - - - - - - - - - - 0.79 0.21iOS - - - - - - - - - - 0.37 0.63TABLE VP
ERFORMANCE ACHIEVED FOR INTEGRITY VERIFICATION ON SOCIALMEDIA CONTENTS . W
E REPORT FOR EACH SOCIAL NETWORK THEOBTAINED ACCURACY , TRUE POSITIVE RATE (TPR),
AND TRUE NEGATIVERATE (TNR). A
LL THESE PERFORMANCE MEASURES ARE BALANCED .Accuracy TNR TPRFacebook 0.76 0.40 0.86TikTok 0.80 0.51 0.75Weibo 0.79 0.45 0.82YouTube 0.60 0.36 0.74 classifier to distinguish (i) whether the video was downloadedfrom a social media platform; (ii) whether the video wastampered and, if so, which software was used; (iii) whetherthe original video belonged to an Android or iOS device.A summary of the results obtained by our method isreported in Table VI. Even without any prior knowledge of thevideo origin, we are still able to distinguish between nativeand tampered videos. Our method is also able of correctlyidentifying videos belonging to YouTube, Facebook, TikTokand Weibo, even though in those cases it is not possible tomake further claims on the video authenticity. In most caseswe are also able to correctly classify the operating system ofthe source device. VI. C
ONCLUSIONS
In this paper we proposed an efficient forensic method forchecking video integrity. If a manipulation is detected, theproposed method allows to identify the editing software and, inmost cases, whether the original video belonged to an Androidor iOS device.This is achieved by exploiting a decision tree classifierapplied to a vector based representation of the video con-tainer structure, enriched with the likelihood ratio frameworkthat is employed to automatically remove container elementsthat only contribute to source intra-variability. The proposedmethod, in case of tampered videos, is able to characterise thesoftware that performed the manipulation with an accuracyof . , even when the video is cut without re-encoding.Except for manipulations performed with Adobe Premiere and
Kdenlive , the proposed method correctly determines theoperating system of the video source device. As opposed to the state of the art, the proposed method isextremely efficient and can provide a simple explanation for itsdecisions. A new experimental dataset of 7000 videos was alsocreated and shared with the research community, includingcontents generated with five editing tools ( ffmpeg , Exiftool , Adobe Premiere , Avidemux , and
Kdenlive ) and four socialmedia platforms (Facebook, TikTok, Weibo and Youtube).The current limitation of the method is that a container-based approach can identify whether the video belongs to asocial medial platform like YouTube, Facebook, TikTok orWeibo, but it cannot be effectively applied on such contents forauthenticity assessment, since the transcoding operation wipesout most of the forensic traces from the video container. Futureworks will aim to improve our tool by adding handcraftedfeatures to improve the performance on social media contents.A
CKNOWLEDGMENT
The U.S. Government is authorized to reproduce and dis-tribute reprints for Governmental purposes notwithstandingany copyright notation thereon. The views and conclusionscontained herein are those of the authors and should notbe interpreted as having to do with the official policies orendorsements, either expressed or implied, of the Air ForceResearch Laboratory and the Defense Government.Pengpeng Yang would like to acknowledge the China Schol-arship Council, State Scholarship Fund, that supports his jointPh.D program. R
EFERENCES [1] S. Milani, M. Fontani, P. Bestagini, M. Barni, A. Piva,M. Tagliasacchi, and S. Tubaro, “An overview on videoforensics,”
APSIPA Transactions on Signal and Informa-tion Processing , vol. 1, 2012.[2] T. Shanableh, “Detection of frame deletionfor digital video forensics,”
Digital Investigation
IEEE 2012 InternationalWorkshop on Information Forensics and Security (WIFS) .IEEE, 2012, pp. 151–156.
TABLE VIC
ONFUSION MATRIX FOR THE BLIND SCENARIO . Native
Avidemux Exiftool ffmpeg Kdenlive Premiere
Facebook TikTok Weibo YouTube
Android iOS Android iOS Android iOS Android iOS Android iOS Android iOS
Native
Android 1.00 - - - - - - - - - - - - - - -iOS - 1.00 - - - - - - - - - - - - - -
Avidemux
Android - - 0.95 - - - - - - - - - - - 0.05 -iOS - - - 1.00 - - - - - - - - - - - -
Exiftool
Android 0.01 - - - 0.99 - - - - - - - - - - -iOS - - - - - 1.00 - - - - - - - - - - ffmpeg
Android 0.05 - 0.01 - - - 0.75 - - 0.15 - - - - 0.05 -iOS - - - - - - - 1.00 - - - - - - - -
Kdenlive
Android - - - - - - - - 0.75 0.25 - - - - - -iOS - - - - - - - - 0.38 0.62 - - - - - -
Premiere
Android - - - - - - - - - - 0.80 0.20 - - - -iOS - - - - - - - - - - 0.37 0.63 - - - -
Facebook - - - - - - - - - - - - 1.00 - - -
TikTok - - - - - - - - - - - - - 1.00 - -
Weibo - - - - - - - - - - - - - - 1.00 -
YouTube - - - - - - - - - - - - - - - 1.00 [4] A. Gironi, M. Fontani, T. Bianchi, A. Piva, and M. Barni,“A video forensic technique for detecting frame deletionand insertion,” in
IEEE 2014 International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2014, pp. 6226–6230.[5] D. V´azquez-Pad´ın, M. Fontani, D. Shullani, F. P´erez-Gonz´alez, A. Piva, and M. Barni, “Video integrity verifi-cation and gop size estimation via generalized variationof prediction footprint,”
IEEE Transactions on Informa-tion Forensics and Security , 2019.[6] X. Ding, Y. Gaobo, R. Li, L. Zhang, Y. Li, and X. Sun,“Identification of motion-compensated frame rate up-conversion based on residual signal,”
IEEE Transactionson Circuits and Systems for Video Technology , 2017.[7] M. Xia, G. Yang, L. Li, R. Li, and X. Sun, “Detectingvideo frame rate up-conversion based on frame-levelanalysis of average texture variation,”
Multimedia Toolsand Applications , vol. 76, no. 6, pp. 8399–8421, 2017.[8] D. D’Avino, D. Cozzolino, G. Poggi, and L. Verdoliva,“Autoencoder with recurrent neural networks for videoforgery detection,”
Electronic Imaging , vol. 2017, no. 7,pp. 92–99, 2017.[9] J. Bakas and R. Naskar, “A digital forensic techniquefor inter–frame video forgery detection based on 3dcnn,” in
International Conference on Information SystemsSecurity . Springer, 2018, pp. 304–317.[10] S. Verde, L. Bondi, P. Bestagini, S. Milani, G. Calvagno,and S. Tubaro, “Video codec forensics based on convolu-tional neural networks,” in
Digital Investigation , vol. 11,Supplement 1, pp. S68 – S76, 2014, proceedingsof the First Annual { DFRWS } International Conferenceon Machine Learning (ICML), Synthetic Realities: DeepLearning for Detecting AudioVisual Fakes Workshop ,2019.[14] M. Iuliani, D. Shullani, M. Fontani, S. Meucci, andA. Piva, “A video forensic framework for the unsuper-vised analysis of mp4-like file container,”
IEEE Trans-actions on Information Forensics and Security , vol. 14,no. 3, pp. 635–645, 2018.[15] J. R. Quinlan, “Induction of decision trees,”
Machinelearning , vol. 1, no. 1, pp. 81–106, 1986.[16] ISO/IEC 14496, “Information technology. coding ofaudio-visual objects, part 12: Iso base media file format,3rd ed.” 2008.[17] ——, “Information technology. coding of audio-visualobjects, part 14: Mp4 file format,” 2003.[18] Apple Computer, Inc., “Quicktime file format,” 2001.[19] S. R. Safavian and D. Landgrebe, “A survey of decisiontree classifier methodology,”
IEEE transactions on sys-tems, man, and cybernetics , vol. 21, no. 3, pp. 660–674,1991.[20] L. Breiman,
Classification and regression trees . Rout-ledge, 2017.[21] H. Sch¨utze, C. D. Manning, and P. Raghavan, “Introduc-tion to information retrieval,” in
Proceedings of the in-ternational communication of association for computingmachinery conference , 2008, p. 260.[22] D. Shullani, M. Fontani, M. Iuliani, O. Al Shaya, andA. Piva, “VISION: a video and image dataset for sourceidentification,”
EURASIP Journal on Information Secu-rity , vol. 2017, no. 1, p. 15, 2017.[23] Apache, “Java mp4 parser,” .[24] “Statcounter: Globalstats 1999-2020,” https://gs.statcounter.com/os-market-share/mobile/worldwide.[25] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg et al. , “Scikit-learn: Machinelearning in Python,”