Provenance Filtering for Multimedia Phylogeny
Allan Pinto, Daniel Moreira, Aparna Bharati, Joel Brogan, Kevin Bowyer, Patrick Flynn, Walter Scheirer, Anderson Rocha
cc (cid:13) a r X i v : . [ c s . I R ] J un ROVENANCE FILTERING FOR MULTIMEDIA PHYLOGENY
A. Pinto , , D. Moreira , A. Bharati , J. Brogan ,K. Bowyer , P. Flynn , W. Scheirer and A. Rocha , Department of Computer Science and Engineering, Univ. of Notre Dame, IN, U.S.A. Institute of Computing, Univ. of Campinas, SP, Brazil
ABSTRACT
Departing from traditional digital forensics modeling, which seeks toanalyze single objects in isolation, multimedia phylogeny analyzesthe evolutionary processes that influence digital objects and collec-tions over time. One of its integral pieces is provenance filtering,which consists of searching a potentially large pool of objects for themost related ones with respect to a given query, in terms of possibleancestors (donors or contributors) and descendants. In this paper,we propose a two-tiered provenance filtering approach to find all thepotential images that might have contributed to the creation processof a given query q . In our solution, the first (coarse) tier aims to findthe most likely “host” images — the major donor or background —contributing to a composite/doctored image. The search is then re-fined in the second tier, in which we search for more specific (poten-tially small) parts of the query that might have been extracted fromother images and spliced into the query image. Experimental resultswith a dataset containing more than a million images show that thetwo-tiered solution underpinned by the context of the query is highlyuseful for solving this difficult task. Index Terms — Provenance Filtering; Multimedia Phylogeny;Phylogeny Graph; Provenance Context Incorporation.
1. INTRODUCTION AND RELATED WORK
Rather than focusing on checking the integrity of a single multime-dia object (as it used to be with most of the proposed methods fromthe early 2000s until recently), some researchers in digital forensicsare now seeking to leverage all possible information associated to apool of objects, analyzing their space and time relationships. Suchrecent efforts are made possible by a research field known as Multi-media Phylogeny [3, 1] — a relatively new discipline that studies theevolutionary processes that influence multimedia objects and collec-tions, as well as the relationship among transformed versions of anobject, looking for causal and ancestry relationships, the types oftransformations, and the order in which they were applied to objects.Such new developments are necessary in order to adapt forensicsmethods to a rapidly evolving society. The increasingly frequent oc-currence of image and video compositions on the Internet and socialmedia render the applications of phylogeny very useful in practicalscenarios such as content tracking, forensics and copyright enforce-ment [3, 1]. Within this new reality, forensics analysts are interestednot only in determining if a digital object is fake or real but also
This material is based on research sponsored by DARPA and Air ForceResearch Laboratory (AFRL) under agreement number FA8750-16-2-0173.Hardware support was generously provided by the NVIDIA Corporation. Wealso thank the financial support of FAPESP (Grant
Cropping + ResizingExposure + SaturationOriginal (a) Semantically-similar & near-duplicate images.
Potential Host (major donor)Donor (alien) Composition Donor (Alien)Donor (Alien) (b) Multiple parenting multimedia phylogeny setupwith an image composition and its several ancestors(donors).
Fig. 1 . Contrasting multimedia phylogeny applied to near duplicateimages (a) and image composites with several donors (b). Whilethe former focuses on finding relationships among images that havesimilar overall context, the latter aims at finding the genealogy of anasset, including all possible near duplicates of the composition itselfand of its donors. Example in (a) from [1]; example in (b) from theNIST Nimble 2016 dataset [2].in pinpointing who created it, what happened, when and how (ge-nealogy) an asset was created. This process might be of significantimportance in the era of post-truth [4, 5, 6] for determining howa composition was crafted, what parts went into creating the com-posite, and whether there was re-staging, re-purposing or an overallchange of semantics [7].Nonetheless, before analyzing a pool of objects looking for pos-sible kinship relationships, we need to be able to comb through largequantities of data looking for the very pieces potentially associatedwith a given query q . This task needs to be performed prior to sub-sequent multimedia phylogeny steps — namely the pairwise imagedissimilarity calculations and the phylogenetic graph analysis andconstruction — and it is referred to herein as provenance filtering .Most of the work thus far in multimedia phylogeny has over-looked the provenance filtering task, considering it to be a reasonablywell solved problem [3, 1]. The rationale behind that assumptionwas that most phylogeny works focused on finding the evolution-ary processes associated with near-duplicate [3] and semantically-similar images [1]. In both setups, original images may undergoransformations over time but cannot have their overall semanticschanged. When we consider forged and composite images, we bringnew elements to the table. In this case, we now have the appear-ance of multiple parenting phylogeny [8], a setup in which an imagemight be the composite result of several other images, each with itsown evolutionary chain of modifications. The composite image it-self might also have its own chain of descendants and so on. Fig. 1(a)shows an example of semantically-similar images in which an orig-inal image might undergo several transformations and generate off-spring. Each child can also generate others. However, the transfor-mations tend to keep the overall meaning of the scene. In turn, aswe see in Fig. 1(b), an image in a multiple parenting setup might bethe result of combining several others, each of which having its ownchain of ancestors and descendants.Near-duplicate detection (NDD) methods [9, 10, 11, 12, 13]work properly for the task of finding semantically-similar images(Fig. 1(a)), upon which phylogeny graph construction algorithmscould operate later on. However, NDD methods might fail in thepresence of multiple donors (Fig.1(b)) given that the context andmeaning of each donor is too diverse to be represented and capturedby current methods. Moreover, each donor might undergo severaltransformations in the composition creation process including color,geometric, and affine operations. For those cases, even partial near-duplicate detection methods could fail [14]. Likewise, traditionalcontent-based image retrieval (CBIR) methods [15] would not workdirectly either as they often aim to determine the overall meaningof the scene and its generalization to provide the user with similarimages respecting the principles of novelty and diversity [16].While related work for multimedia phylogeny abounds, priorwork on provenance filtering is almost non-existent. In terms of phy-logeny, Dias et al. [3] presented a minimum spanning tree-based al-gorithm to find a directed graph that represented the phylogeny treeof a group of near-duplicate images. This work was extended to dealwith images from multiple cameras and their near duplicates [1].Other media have also been considered such as videos [17, 18], au-dio [19] and text [20]. Oliveira et al. [8] extended the image phy-logeny formulation to deal with multiple donors and descendants si-multaneously more aligned with the context of this paper. However,their work assumes the candidate images are known a priori.Important advances have been made on finding ancestral rela-tionships between pairs of images; nevertheless, the performance ofsuch algorithms is significantly degraded if a good set of potentiallyrelated images is not found beforehand. In this vein, we extend uponimage representation and indexing techniques (common in NDD andCBIR areas) to deal with provenance filtering for multiple donor andcomposite images. Our technique comprises two stages: in the first,we query an image collection for the most likely donors that mighthave contributed to the creation of the query, if it is a composite.This is done following a traditional CBIR pipeline, which involvesimage representation through appropriate features and the adoptionof a subsequent indexing mechanism (more details in Sec. 2). Thetop retrieved results are then analyzed and compared to the query us-ing scale and rotation-invariant points of interest [21], nearest neigh-bor distance ratio policy [22], and geometric alignment [23]. Afterfinding the best possible match to the query, we use that image alongwith the query to calculate a contextual mask to serve as an activationof possible regions that are different between them. Such regions arecandidate regions for possible donors. We then proceed with the sec-ond stage of the search, querying the collection for images that aresimilar to the selected regions of interest in the query as pointed outby the contextual mask. Ultimately, we aggregate the different rank-ings to create a final ranked list of images related to the query in Collection IndexingImage
Characterization I ⇤ q q - (…) Image
Characterization I ⇤ Offline Online r best C Query
Fig. 2 . Method’s pipeline. After retrieving related images, we com-pare the best result with q , incorporate the search’s context and per-form a second search to refine the list of possible donors.terms of possible donors contributing to its creation process and thusclosing the loop for provenance filtering.The contributions of this work are (i) the exploration of differ-ent querying and indexing techniques for the new problem of prove-nance filtering; (ii) the incorporation of provenance context to singleout possible candidate regions related to donors in the creation pro-cesso of a query; and (iii) the study of the efficiency and effectivenesstradeoffs involved in the provenance filtering task while dealing withvery large collections of images.
2. PROPOSED METHOD
In this section, we present the proposed approach to provenance fil-tering. Given a query q , such as the image in the center of Fig. 1(b),the objective is to search a collection of images C for all potentialdonors r i contributing to the creation of q , including possible nearduplicates r ij of r i . Near duplicates of q are also of interest as theywould be important for tracing the offspring of q over time.Our approach to this problem involves two stages (c.f., Fig. 2).In the first stage, we design a fast image retrieval solution to recoverthe (likely) donor images, with high precision. We then exploit thecontext of the results to find the best match r best (respecting geomet-ric constraints) with respect to q and refine the donor list. Regionsthat are different between q and its top-related image r best are of in-terest as they show regions that might have been incorporated into q by combining pieces of different images in C . Leveraging the con-textual mask, the second stage of the search examines C a secondtime, focusing on finding potential localized donors.In the example of Fig. 1(b), when querying the collection forpotential donors (first tier/stage), we would likely retrieve the imagewith the table, flower and their background or the hand (as both aremajor contributors to the composite q ). Calculating the contextualmask gives the region of the hand as a potential donor spliced fromanother source image(s). Therefore, when performing the secondsearch, we look for images similar to that region, which would resultin the donor for the hand as well as the other pieces. This processcan be repeated a number of times if necessary. The different re-trieved lists of results might be combined through rank aggregationtechniques based on the confidence of the retrieved results. The first step of our approach needs to represent each image in a ro-bust manner so as to allow us retrieve partially related images in alarge collection. In this context, using bags of words [15] or deepearning techniques [24] would likely fail as they would be goodfor retrieving similar images in general but would not capture pos-sible transformed donors, especially the small or heavily processedones. In addition, a deep learning solution would require large imagecollections spanning different forgeries for a proper training and, inforensics, such collections are simply not available. In face of theselimitations, we opted to represent each image using points of inter-est robust to image transformations, as forgeries often employ suchtransformations for more photorealistic montages. For that, we relyupon Speeded-Up Robust Features (SURF) [21]. We represent animage with about 2000 keypoints for small-scale experiments andwith about 500 keypoints for large-scale ones.
Given a query image q and a collection of images C for searching,we need to represent the images in C in a very compact fashion soas to allow fast querying. For that, we use an indexing algorithmfor finding nearest neighbors of q , in terms of their representativekeypoints. More specifically, after extracting the points of interestfor all images in C , we need to find the k -nearest points to each key-point in q . We further perform majority voting to infer the similaritybetween the query image q and each image in C based on the nearestkeypoints retrieved from the gallery.As the number of points of interest extracted from C might reachhundreds of millions, the comparison between the q and all imagesin C using brute-force search is impracticable. Therefore, we inves-tigated some algorithms for (cid:15) -approximated nearest neighbors, ade-quate for large-scale searches. According to Arya [25], an approx-imate search can be achieved by considering (1 + (cid:15) )-approximatenearest neighbors for which dist ( k, l ) ≤ (1 + (cid:15) ) dist ( p, l ) suchthat p is the true nearest neighbor for l . Nonetheless, these solu-tions might lose effectiveness depending on the heuristic adopted tospeed up the search. For this reason, here we compare four index-ing approaches in terms of runtime, memory footprint and qualityof the search: KD-Trees and KD-Forests [26], Hierarchical Cluster-ing [27], and Product Quantization [28]. To retrieve the donor images with high recall rates, we propose aquery refinement process, referred to as context incorporation, inthat we use the ranking result obtained in a first tier to reformulate thequery so that small objects used to compose the spliced image can bebe retrieved more accurately. First, we need to make sure the queryis well represented in terms of describing keypoints. The overrepre-sentation of the query q aims at guaranteeing we sample basically allof its regions, including the background. Although SURF descrip-tors are robust to describe objects in general in a scene, this approachmost likely will fail in finding interest points inside very small ob-jects, mainly when such objects are put in a complex background. Toovercome this problem, we perform a query refinement by comput-ing the intersection between q and the best-matching retrieved im-age (most likely the host / background donor). This leads to a newquery image containing just the information about the objects addedin the host image. Our second search stage consists of querying thecollection using the keypoints falling within the selected regions ofinterest. We combine the different ranked lists using the confidenceof the retrieved images (number of votes and keypoints matched). Fig. 3 . Example of a query, its top-related donor and the contextualmask. In the top row, the contextual mask captures the added rocks,person, bird and red-dirty region. In turn, the mask in the secondrow captures the added umbrella, content-smoothed sand on the leftand the deleted white bird.
To find the contextual mask, we perform an image registration be-tween q and the top-match image r best in the ranked list obtained inthe first tier of search. We match SURF features extracted from bothimages, select the best-matching keypoints and calculate the dis-tance between the two images using the selected pairs of matches.We then calculate the geometrical transformation present in r best with respect to q via image homography. Next, we compute themask that indicates the candidate regions in which we might havespliced objects. We generate this mask by computing the differencebetween geometrically aligned images, followed by an opening op-eration with a × -structuring element and a × -kernel medianfilter to reduce the residual noise present in the mask. We also per-form color quantization to 32-bits before computing the differencebetween the two images to reduce the presence of noise in the mask.There are some extreme cases for this approach that are worthdiscussing. First, when the top retrieved image does not have any-thing in common with q , the calculated mask should be null. In thiscase, there should be no search in the second tier. In turn, when q it-self is not a composite, the top retrieved image might be non-relatedat all (case one above) or a near-duplicate of q , in which case themask is virtually identical to q . In the latter case, the search in thesecond tier should result in basically the same images retrieved inthe first tier. Fig. 3 depicts examples of a query q , its top result r and the calculated contextual masks.
3. EXPERIMENTS AND RESULTS
In this section, we present and discuss the experimental results weperformed to validate the proposed method. We report the quality ofthe results in terms of Recall@k that measures the fraction of correctimages at the top- k retrieved results. The source code of all proposedmethods are freely available . Datasets.
We adopt the Nimble Challenge 2016 (NC2016) and 2017(NC2017) datasets, provided by the National Institute of Standardsand Technology (NIST) [2], which focus on forensics, provenancefiltering and phylogeny tasks. These datasets comprise a query setcontaining different kinds of manipulated images (e.g., copy-moveand compositions), and a gallery set containing the source imagesused to produce the queries. The datasets also comprise distractorimages. The probe sets of NC2016 and NC2017 datasets contain and composite images, respectively. The gallery sets contain and images, respectively. We also embed the datasets The source code is freely available on https://gitlab.com/notredame-provenance/filtering K0.00.20.40.60.81.0 R e c a ll @ K KD-Tree
First tier ranking resultsContext incorporation K0.00.20.40.60.81.0 R e c a ll @ K KD-Forest
First tier ranking resultsContext incorporation K0.00.20.40.60.81.0 R e c a ll @ K PQ First tier ranking resultsContext incorporation K0.00.20.40.60.81.0 R e c a ll @ K HCAL
First tier ranking resultsContext incorporation
Fig. 4 . First- and second-tier results for the NC2017 dataset in termsof Recall@k. The context incorporation is important regardless ofthe used indexing technique.
Table 1 . Runtime (in seconds) and memory usage (GB), per query,in the first tier, for different indexing techniques in the NC2017 andNC2017 + World1M datasets. KD-Forest comprises two trees. * de-notes the method did not scale.
Method KD-Tree KD-Forest PQ HCALRuntime . s . s . s . s Memory . GB . GB . GB . GB Runtime (World1M) . s . s ∗ ∗ Memory (World1M) . GB . GB ∗ ∗ within one million images (distractors) provided by RankOne Inc. ,as recommended by NIST for evaluating scalability. Indexing Method.
We now analyze (see Table 1) different indexingapproaches for NC2017 and NC2017+World1M in terms of memoryfootprint and efficiency (results for NC2016 are similar) consideringan Intel(R) Xeon(R), CPU E5-2620 v3 @2.40GHz, 24 cores and512GB of RAM. Although PQ is more efficient in terms of storagefor a small scale, it does not scale for World1M. The clustering inHCAL prevented it from scaling for 1M images. More work involv-ing approximate clustering and sampling would be necessary in thiscase. KD-Tree shows a good storage and efficiency tradeoff.
Context Incorporation and Ranking Aggregation.
In this section,we evaluate the proposed approach to improve ranking results fordonor images. Fig. 4 shows the performance results in terms of recallat the top- k retrieved images, considering the retrieval of donor im-ages in the first and second tiers of the proposed method. Althoughnot shown here, the performance for retrieving the host image is al-ways above 95% as it shares much content with q . The challenge inprovenance filtering is in retrieving the donors. Large-scale Image Retrieval.
We now evaluate the proposed ap-proach, considering a more challenging scenario, in which we em-bed the NC2016 and NC2017 datasets into one million images, here-inafter referred to as World1M dataset. The World1M dataset con-tains several images that are semantically similar to the images thatcompose both datasets. Table 2 shows the obtained results in thisexperiment. There is a gain of about when retrieving donors for http://medifor.rankone.io/ Table 2 . Performance results for NC2016 and NC2017 datasets em-bedded in one million images and KD-Forest (2 trees). Bold high-lights improvements in the second tier.
Dataset Type Tier Recall@10 . NC2016 + World1M
Host 2nd % . NC2016 + World1M
Donor 2nd % . NC2017 + World1M
Host 2nd . . NC2017 + World1M
Donor 2nd . Fig. 5 . Queries and results for KD-Forest + 2 trees. The first andthird rows refer to the first tier results while the second and fourthrefer to the second tier. The green border denotes the matched hostwhile the blue ones denote donors. The search in the second tierallows the retrieval of donors that were not present in the first tier.NC2016 when we compare the obtained results in the first and sec-ond tiers. The results for NC2017 are slightly lower given that thecomposite images in this dataset are more difficult, more photoreal-istic and smaller with respect to the whole image, which also impactsthe context incorporation, second tier (first- and second-tier resultsremain equal for this case). A future work consists of improving thecontext incorporation mask to better capture small donors such asthose present in NC2017.
Qualitative Analysis.
Fig. 5 shows the results of two queries forKD-Forests with two trees in the first and second tiers.
4. CONCLUSIONS
In this paper, we introduced a first method for provenance filteringdesigned to improve retrieval of donor images in composite images.Reliable provenance filtering is highly useful for selecting the mostpromising candidates for more complex analyzes in the multimediaphylogeny pipeline such as graph construction and inference of di-rectionality of donors and descendants. The challenge in this prob-lem is the retrieval of small objects considering a large image gallery.By incorporating the context of the top results with respect tothe query itself, we can improve the retrieval results and better findpossible donors of a given composite (forged) query q . Experimentswith different indexing techniques have also shown that KD-forestsseem to be the most effective but not the most efficient. KD-trees,on the other hand, are more efficient but less effective. In our exper-iments, PQ did not perform well for large galleries.Future research efforts will focus on better characterizing smallorged regions, incorporating forgery detectors in the process of con-text analysis and also consider bringing the user into the loop withrelevance feedback methods. . REFERENCES [1] Zanoni Dias, Siome Goldenstein, and Anderson Rocha, “To-ward image phylogeny forests: Automatically recovering se-mantically similar image relationships,” Forensic science in-ternational
IEEE Transactions on InformationForensics and Security (TIFS) , vol. 7, no. 2, pp. 774–788, April2012.[4] Ralph Keyes,
The post-truth era: Dishonesty and deception incontemporary life , Macmillan, 2004.[5] Jonathan Mahler, “The problem with self-investigation ina post-truth era,”
The New York Times Magazine , January1st, 2017, Available online at http://tinyurl.com/juufufc .[6] Katherine Schulten and Amanda Christy Brown, “Evaluatingsources in a ‘post-truth’ world: Ideas for teaching and learningabout fake news,”
The New York Times , January 19th, 2017,Available online at http://tinyurl.com/h3w7rp8 .[7] A. Rocha, W. Scheirer, T. E. Boult, and S. Goldenstein, “Visionof the Unseen: Current Trends and Challenges in Digital Imageand Video Forensics,”
ACM Computing Surveys (CSUR) , vol.43, pp. 1–42, 2011.[8] Alberto A de Oliveira, Pasquale Ferrara, Alessia De Rosa,Alessandro Piva, Mauro Barni, Siome Goldenstein, ZanoniDias, and Anderson Rocha, “Multiple parenting phylogenyrelationships in digital images,”
IEEE Transactions on Infor-mation Forensics and Security , vol. 11, no. 2, pp. 328–343,2016.[9] Yan Ke, Rahul Sukthankar, and Larry Huston, “Efficient near-duplicate detection and sub-image retrieval,” in
ACM Intl. Con-ference on Multimedia , 2004, pp. 869–876.[10] Wengang Zhou, Yijuan Lu, Houqiang Li, Yibing Song, andQi Tian, “Spatial coding for large scale partial-duplicate webimage search,” in
ACM Int. Conference on Multimedia , NewYork, NY, USA, 2010, MM ’10, pp. 511–520, ACM.[11] S. Tang, H. Chen, K. Lv, and Y. D. Zhang, “Large visual wordsfor large scale image classification,” in
IEEE Int. Conferenceon Image Processing (ICIP) , Sept 2015, pp. 1170–1174.[12] J. Yuan and X. Liu, “Product tree quantization for approximatenearest neighbor search,” in
IEEE Int. Conference on ImageProcessing (ICIP) , Sept 2015, pp. 2035–2039.[13] K. H. Zeng, Y. C. Lin, A. Farhadi, and M. Sun, “Semantic high-light retrieval,” in
IEEE Int. Conference on Image Processing(ICIP) , Sept 2016, pp. 3359–3363.[14] Wei Dong, Zhe Wang, Moses Charikar, and Kai Li, “High-confidence near-duplicate image detection,” in
ACM Int. Con-ference on Multimedia Retrieval , New York, NY, USA, 2012,pp. 1:1–1:8, ACM.[15] Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z Wang, “Imageretrieval: Ideas, influences, and trends of the new age,”
ACMComputing Surveys (CSUR) , vol. 40, no. 2, pp. 5, 2008. [16] Thomas Deselaers, Tobias Gass, Philippe Dreuw, and HermannNey, “Jointly optimising relevance and diversity in image re-trieval,” in
ACM Int. Conference on Multimedia Retrieval .ACM, 2009, p. 39.[17] Zanoni Dias, Anderson Rocha, and Siome Goldenstein, “Videophylogeny: Recovering near-duplicate video relationships,” in
IEEE Int. Workshop on Information Forensics and Security(WIFS) . IEEE, 2011, pp. 1–6.[18] Silvia Lameri, Paolo Bestagini, Ambra Melloni, SimoneMilani, Anderson Rocha, Marco Tagliasacchi, and StefanoTubaro, “Who is my parent? reconstructing video sequencesfrom partially matching shots,” in
IEEE Int. Conference onImage Processing (ICIP) . IEEE, 2014, pp. 5342–5346.[19] Matteo Nucci, Marco Tagliasacchi, and Stefano Tubaro, “Aphylogenetic analysis of near-duplicate audio tracks,” in
IEEEInt. Workshop on Multimedia Signal Processing (MMSP) .IEEE, 2013, pp. 099–104.[20] Nicholas Andrews, Jason Eisner, and Mark Dredze, “Namephylogeny: A generative model of string variation,” in
Intl.Conference on Empirical Methods in Natural Language Pro-cessing and Computational Natural Language Learning . As-sociation for Computational Linguistics, 2012, pp. 344–355.[21] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and LucVan Gool, “Speeded-up robust features (surf),”
Comput. Vis.Image Underst. , vol. 110, no. 3, pp. 346–359, June 2008.[22] David G Lowe, “Object recognition from local scale-invariantfeatures,” in
IEEE Int. Conference on Computer Vision andPattern Recognition (CVPR) . Ieee, 1999, vol. 2, pp. 1150–1157.[23] Barbara Zitova and Jan Flusser, “Image registration methods:a survey,”
Image and vision computing , vol. 21, no. 11, pp.977–1000, 2003.[24] Ian Goodfellow, Yoshua Bengio, and Aaron Courville,
Deeplearning , MIT Press, 2016.[25] Sunil Arya, David M. Mount, Nathan S. Netanyahu, Ruth Sil-verman, and Angela Y. Wu, “An optimal algorithm for approx-imate nearest neighbor searching fixed dimensions,”
Journalof ACM , vol. 45, no. 6, pp. 891–923, Nov. 1998.[26] Jon Louis Bentley, “Multidimensional binary search trees usedfor associative searching,”
Commun. ACM , vol. 18, no. 9, pp.509–517, Sept. 1975.[27] Michael Steinbach, George Karypis, and Vipin Kumar, “Acomparison of document clustering techniques,” in
In KDDWorkshop on Text Mining , 2000.[28] H. Jegou, M. Douze, and C. Schmid, “Product quantizationfor nearest neighbor search,”