Unshredding of Shredded Documents: Computational Framework and Implementation
AAsia Pacific Journal of Education, Arts and Sciences Vol. xx No. yy, Month 2015
Unshredding of Shredded Documents:Computational Framework and Implementation
Lei Kristoffer R. Lactuan and Jaderick P. Pabico Institute of Computer Science, University of the Philippines Los Baños, College 4031, Laguna { lkrlactuan, jppabico}@up.edu.ph Abstract – A shredded document D is a document whose pages have been cut into stripsfor the purpose of destroying private, confidential, or sensitive information I contained in D . Shredding has become a standard means of government organizations, businesses, andprivate individuals to destroy archival records that have been officially classified fordisposal. It can also be used to destroy documentary evidence of wrongdoings by entitieswho are trying to hide I (e.g., as alleged by the whistle blowers in the P10-Billion porkbarrel-JLN NGOs scam). Shredding does not really destroy I D but just simply jumbles I by cutting the pages of D into strips. Let D be a set of n ordered pages { p , p , …, p n –1 } , and let each i th ( X × Y ) -dimension page p i < n D contains a 2-dimensional ordered arrangement of theinformation elements J i, ( x,y ) {0,1} , ∀ x < X, y < Y . When each J i ,( x , y ) is presented to areader R in correct order, then I is transferred to R . When p i is shredded into m verticalstrips s i ,0 , s i ,1 , …, s i , m –1 , then the j th strip s i ,0≤ j < m contains X / m × Y information elements,which has J i ,( a ( j ),0) in its upper left corner and J i ,( b ( j ), Y –1) in its lower right, where a ( r )=( x – m ) r / m , and b ( r ) = ( r + ) x / m . However, what we are interested in are the informationelements J i ,( a ( j ),0≤ y < Y ) and J i ,( b ( j ),0≤ y < Y ) along the left and right edges, respectively. If two strips s i , p and s i , q are actually adjacent to each other in the unshredded p i , then either J i,b ( p ),0≤ y < Y ) ≡ J i, ( a ( q ),0≤ y < Y ) or J i ,( a ( p ),0≤ y < Y ) ≡ J i ,( b ( q ),0≤ y < Y ) , where ≡ is a similarity function that we defined. In thispaper, we present an optimal O(( n × m ) ) algorithm A that reconstructs an n -page D , whereeach page p is shredded into m strips. We also present the efficacy of A in reconstructingthree document types: hand-written, machine typed-set, and images.Keywords – shredding, unshredding, reconstruction, optimal algorithm, documents I. I
NTRODUCTION
In the 2012 Corruption Perceptions Index (CPI), the Philippines ranked 105 out of 176 countrieshaving a CPI score of 34/100, where a CPI score of 100 is perceived to be as corruption-free [1]. A rankof one is perceived to be the most corrupt, while a rank of 176 is perceived to be as the least corruptcountry. The Philippines ranked fifth out of the 10 Southeast Asian countries on the same year. These2012 rankings are lower compared to 2011 rankings in which the country ranked 129th internationallyand seventh in Southeast Asia [2]. The country's corruption perception, and thus the country's CPIranking, is expected to get worse in the election year of 2016 due to the surfacing of alleged corruptioncases that triggered the resurfacing of other old corruption accusations. For example, the discovery of the10-billion peso pork barrel scam triggered the resurfacing of the 2004 Fertilizer scam. It is alleged thatthe 10-billion peso pork barrel scam funneled public funds to several bogus non-governmentalorganizations (NGOs) under the name of one person in connivance with several high-ranking nationally-elected legislatures [3]. Recently, about 240 bank accounts and assets of the Philippine Vice PresidentJesus Jose “Jejomar” C. Binay, Sr. was ordered frozen by the Philippines' Court of Appeals on the petitionof the Anti-Money Laundering Council from documentary evidences gathered by the Office of theOmbudsman [4].Many fingers are being pointed at various people, as well as to various organizations, as either themastermind or the beneficiaries of the alleged crimes. However, justice cannot be served to them becauseof lack of proper documentary evidences that will support the claims of the supposedly whistle blowers. sia Pacific Journal of Education, Arts and Sciences Vol. xx No. yy, Month 2015 The whistle blowers allege that they themselves were once (knowingly or unknowingly) participants ofthe crime and could only narrate what had happened. This is because almost all, if not most, of thesedocumentary evidences were destroyed to cover up the alleged crime [5]. Almost all of the oral narrations,however, corroborated each other which provided the investigating prosecutors the modes of operationsof the alleged perpetuators and their cohorts [3].Destroying documentary evidence may be done in several ways, such as by burning and bymechanical shredding. Documents that were burned, either with toxic chemicals, with water, or with fire,is very difficult if not impossible to reconstruct given the current state of technology that is availabletoday. Shredded documents, however, may be reconstructed even if some of the shreds were alreadydestroyed. For example, in 1979 when the United States Embassy in Iran was about to be taken over byIranian students supporting the Iranian Revolution [6], the embassy personnel shredded their Iran-Contradocuments. After the Embassy was taken over, the Iranians hired carpet weavers to reconstruct theshredded pieces of papers by hand [7].Shredding an n -page document D is the process of cutting each of the n pages p i < n D into m rectangular pieces, called strips, by a mechanical shredder. The information I contained in D issupposedly destroyed because the pages have become unordered while the contents of each page werealso unordered. However, given that all of the n×m strips are available, a very determined person canpatiently reconstruct D by hand, as in the example of the carpet weavers who were tasked to reconstructshredded U.S. Embassy documents during the Iran-U.S. Hostage Crisis of 1979 [6]. The reconstructionmay be intuitively done by taking any two strips s i , p and s i , q and visually match their edges to see if there isa continuity between them. Matching the edges of s i , p and s i , q takes four steps at the most (see Algorithm1). The optimal reconstruction can be achieved via brute-force method that takes all possible pairs of s i , p and s i , q totaling to at most ( n×m ) combinations (or exactly n ×m – n×m ), resulting to at most 4 × ( n×m ) visual matches per pairwise combination of strips. Algorithm 1.
Matching of two strips s i , p and s i , q .
1) Match between the right edge of s i , p and the left edge of s i , q
2) Match between the right edge of s i , p and the inverted right edge of s i , q
3) Match between the inverted left edge of s i , p and the left edge of s i , q
4) Match between the inverted left edge of s i , p and the inverted right edge of s i , q Although it has been reported that this problem has already been solved recently [8,9] through theannouncement of the winner in Defense Advanced Research Projects Agency's (DARPA) Almost-Impossible Challenge [10], none of those who won, or even submitted solutions have published and madeavailable to the public the details of their solutions [11-13]. We want to make public our detailed solution,and so in this paper, we present our computational framework for the automatic reconstruction ofshredded documents. We alternatively used the term unshredding to mean the process of reconstruction.We then present our computational implementation which uses image processing techniques on images ofstrips as a pre-processing step. We then present our computational framework which we optimized tohave a lesser number of steps than the intuitive solution mentioned above. We also present our results inunshredding various types of documents using our computational implementation. We tested the efficacyof our solution to reconstructing shredded papers of various types namely, hand-written, machine typed-set, and images. II. C
OMPUTATIONAL F RAMEWORK
We discuss our framework under the assumption that we are given n×m unordered strips s i < n ,0≤ j < m ,where n and m , as well as the indexes i and j are originally unknown. We can easily infer both n and m if sia Pacific Journal of Education, Arts and Sciences Vol. xx No. yy, Month 2015 we have prior knowledge of the dimension X×Y of the papers used and we are able to measure the length x and height y of each strip s i , j . The strips have the trivial y = Y , while clearly, each page has m = X / x . Toimplement this framework, what we need to do now is (which we described in detail in the subsectionsthat follow):1. To scan all n×m strips s i < n ,0≤ j < m to convert them into a representation that the computer canautomatically process;2. To conduct image pre-processing to reduce the number of strips to consider, and therefore alsoreduce the number of pairwise combinations of strips;3. To compute for the similarity sim ( s i , p , s i , q ) of each pair and use this similarity metric to score amatch between two strips as adjacent strips in the original unshredded p i < n ; 4. To stitch all strips, using image processing techniques, according to the adjacency that was foundby step three; and5. To manually test the efficacy of the above steps under three document types: hand-written,machine typed-set, and images. A. Collecting and scanning of shredded papers
We collected all strips of shredded papers and carefully laid them on a flatbed scanner, making surethat the strips do not touch or overlap each other as shown in Figure 1. To easily identify each strip foreach scan step, we used a black background to delineate the strips in the image. We used a simple imageprocessing technique [14,15] to identify the strips, cut each from the image, and save each to a file with i and j as identifiers. We will also refer to these scanned strips as s i < n ,0≤ j < m throughout this paper withoutloss of generality. For testing our framework, we simulated this step using the image of a page andmanually shredded it with an aid of an image editor. Figure 1.
How the shredded strips are supposed to be laid out onto the flatbed scanner. Notice thedark cover so that the background of the mostly white pages will stand out over thescanned strips.B. Pre-processing of image strips
To reduce the number of strips to evaluate, we used image processing to automatically removeall strips that will not be used in the reconstruction of pages. In this step, blank strips wereremoved because they will have no use in the process. Usually, these blank strips compose themargins of p i < n , unless, of course the margins themselves have annotations, doodles, or anyidentifying marks that will be useful in the reconstruction of the pages. We could have removed sia Pacific Journal of Education, Arts and Sciences Vol. xx No. yy, Month 2015 these blank strips in the scanning step described above but we included this step here because wewere thinking of also automating that manual process.We have also used printed character recognition techniques to set the strips into its uprightposition, whenever possible, which thus reduces the number of match steps described inAlgorithm 1 down to 25%. We then extracted the first two pixel columns at the left side of each strip, as well as that ofthe two pixel columns at the right side as four binary arrays. We termed them as J i ,( a ( j ),0≤ y < Y ) and J i, ( a ( j )+1,0≤ y < Y ) for the left side, and as J i ,( b ( j ),0≤ y < Y ) and J i ,( b ( j )–1,0≤ y < Y ) for the right side, where a ( r ) = ( x – m ) r / m , and b ( r ) = ( r + ) x / m . We used these arrays to represent the edges of s i < n ,0≤ j < m in a datastructure. C. Computation of similarity
We defined a metric we called the similarity sim ( s i , p , s i , q ) to score a match between two strips asadjacent strips in the original unshredded p i < n . We constructed a 4 Q composed of four adjacentrows of each of J i ,( a ( j ),0≤ y < Y ) , J i ,( a ( j )+1,0≤ y < Y ) , J i ,( b ( j ),0≤ y < Y ) and J i ,( b ( j )–1,0≤ y < Y ) such that, for example, Q = J b ( j )–1,0≤ y < Y ) , Q = J b ( j ),0≤ y < Y ) , Q = J a ( j ),0≤ y < Y ) , and Q = J a ( j )+1,0≤ y < Y ) . Using similarity of matrices, we matched Q over each of the matrix template T , where some interesting ones are shown in Figure 2. Most othertemplates are either vertical or horizontal mirror images, horizontal or vertical translated images, orrotated versions of those that we have shown here. Mathematically, Q is similar to any of the T if Q = R – TR , where R is any 4 R is invertible if RR – = R – R = I , where I is the 4 R , and actually any R can be used without affecting the outcome of the metric. The result ofthe metric, therefore, is the one that is in the neighborhood of I , if not exactly I . If sim ( s i , p , s i , q ) I , thenwe stop computing for the similarities of all pairs that include the right side of s i , p or the left side of s i , q ,reducing further the number of pairs to be considered. We defined the symbol to mean both sides arereduced to a scalar quantity that is 1 if I , otherwise >1 if just in the neighborhood of I . Figure 2 . Some interesting templates that we used in the similarity test. Shown are templates for(a) a horizontal line, (b) a vertical line, (c) a diagonal line, (d) a horizontal edge of apolygon part, (e) a vertical edge of a polygon, and (f) a diagonal edge of a polygonpart. Dark colored circles have a value of 1 in the binary matrix, while light coloredones have 0. sia Pacific Journal of Education, Arts and Sciences Vol. xx No. yy, Month 2015 D. Stitching of matched strips into pages
We considered two strips s i , p and s i , q with the highest similarity sim ( s i , p , s i , q ) as adjacent strips. Usingimage processing techniques [17,18], we stitched s i , p and s i , q together in that order. We considered thestitching of any page p i < n completed if we have already stitched at most m strips together. E. Manual Evaluation
We evaluated our method by reconstructing three different types of documents: hand-written, machinetyped-set, and images. Since our methodology is heavily biased on the machine typed-set (e.g., printedcharacter recognition technique described in II.B), we hypothesized that this type of document will bereconstructed with less error. We only used a manual method for evaluating the reconstructed documentssince we lack a metric for automatically doing this. III. E
VALUATION R ESULTS
Figure 3a shows the images of sample documents before they were shredded. These images areexamples from among the many that we evaluated. Figure 3b shows the images of the stitched documentsafter the documents in Figure 3a have been shredded and undergone our computational framework. Sincewe have removed the blank strips, which mostly compose the margins of the original documents, it can beclearly seen that the reconstructed pages lack margins. Margins, however, are not important since what wewant to reconstruct are the information that are contained in the pages, unless, of course, the marginsthemselves contain information, such as, for example editor's annotations, doodles, or other identifyingmarks. The extra information in the margins would have been caught by our framework. (a) (b)
Figure 3.
Sample documents of each type (a) before shredding, and (b) after reconstructing using ourcomputational framework.
We tested our framework using a subjective evaluation of the human users. Although there existseveral advanced optical character and image recognizers (see for example [19]), as well as state-of-the-art natural language processors (see for example [20]), we did not automatically, and thus, objectivelyevaluate the output of our framework using these advanced systems because of their respective inherent sia Pacific Journal of Education, Arts and Sciences Vol. xx No. yy, Month 2015 implementation complexities [21]. Intuitively, the subjective evaluations would have sufficed to helppeople automatically unshred the shredded documents.IV. S UMMARY
AND C ONCLUSION
In this paper, we presented the computational framework and implementation of an automated processfor reconstructing shredded documents. We initially estimated that the process will take 4 × ( n×m ) matches, which we reduced to about ( n×m ) due to our implementation of the optical character reader. Wefurther reduced n and m by expunging from the set of strips those that are blanks. These blank strips areusually parts of the paper margins. We subjectively evaluated the results of our framework and we foundout that it is able to unshred shredded papers by stitching two strips s i , p and s i , q together if their respectiveedges have high similarity scores sim ( s i , p , s i , q ). We defined our similarity metric using the matrix similarityprinciples. We further reduced the number of matches in our framework if the two strips si,p and si,qscore a similarity that matches the identity matrix I .V. A CKNOWLEDGMENTS , D
ISCLOSURE OF E ARLIER P RESENTATION , AND A UTHOR C ONTRIBUTIONS
This research effort is funded by the Institute of Computer Science (ICS) Core Fund conducted duringthe First and Second Semesters of AY 2013-2014 under the directorship of
Prof. Jaime M. Samaniego . Weused the Application Programmer Interface in the ImageLab Library of Image Processing Functionsprovided for academic and research purposes by
Prof. Vladimir Y. Mariano, Ph.D. , Associate Professor atICS.An earlier version of this paper was presented as an oral paper during the held at the NCAS Auditorium, UPLB on 13 December 2013.The following are the respective contributions of the authors: (1) JPP formulated the computationalsolution to the problem; (2) LKRL implemented the computational solution; (3) Both LKRL and JPPconducted the computational experiments and the performance analyses; and (4) Both LKRL and JPPprepared and edited the final manuscript. Both authors declare no conflict of interest.R
EFERENCES [1] Transparency International. 2012.
Corruption Perceptions Index 2012
Philippines Improves in Latest Transparency International CorruptionPerceptions Index . National Competitive Council of the Philippines
Janet Napoles and the Pork Barrel Scam: An InquirerSpecial Report . Philippine Daily Inquirer
Court freezes Binay assets: CA order covers 242 account of VP, kin,pals . Philippine Daily Inquirer (http://newsinfo.inquirer.net/690936/court-freezes-binay-assets).[5] N. Gutierrez. 2013.
Senate Probe: Napoles Shredded Evidence . Rappler
Iran-U.S. Hostage Crisis (1979-1981).
The History Guy
Documents from the U.S. Espionage Den . Muslim StudentsFollowing the Line of the Iman . sia Pacific Journal of Education, Arts and Sciences Vol. xx No. yy, Month 2015 [8] J. Aron. 2011. DARPA's Shredder Challenge solved two days early . New Scientist Online
DARPA’s Almost-Impossible Challenge to Reconstruct ShreddedDocuments: Solved . Gizmodo.com (http://gizmodo.com).[10] ACM News. 2011.
DARPA's Almost-Impossible Challenge . Communications of the ACM (http://cacm.acm.org/news/). [11] B. Biesinger, C. Schauer, B. Hu, and G. R. Raidl. 2013.
Reconstructing cross cut shreddeddocuments with a genetic algorithm with solution archive . In
Extended Abstracts of the14th International Conference on Computer Aided Systems Theory , Las Palmas deGran Canaria, Spain. Pp 226-228.[12] A. Deever and A. Gallagher. 2012.
Semi-Automatic assembly of real, cross-cut shreddeddocuments . IEEE International Conference on Image Processing (ICIP) 2012.[13] A. Skeoch. 2006.
An Investigation into Automated Shredded Document Reconstructionusing Heuristic Search Algorithms . Unpublished Ph.D. Thesis in the University of Bath ,U.K. pp 107.[14] R. Haralick. 1984.
Digital step edges from zero crossing of second directional derivatives . IEEE Transactions on Pattern Analysis and Machine Intelligence
Edge detection and ridge detection with automatic scale selection . International Journal of Computer Vision
Matrix Analysis . Cambridge University Press:London. pp 575.[17] G. Ward. 2006.
Hiding seams in high dynamic range panoramas . In
Proceedings of the3rd symposium on Applied Perception in Graphics and Visualization. ACMInternational Conference Proceeding Series 153 .[18] S. Suen, E. Lam, and K. Wong. 2007.
Photographic stitching with optimized object andcolor matching based on image derivatives . Optics Express
The state of the art in online handwritingrecognition . IEEE Transactions on Pattern Analysis and Machine Intelligence
Models of natural language understanding . Proceedings of the NationalAcademy of Sciences of the United States of America
Formalizing semantic of natural language throughconceptualization from existence . International Journal of Innovation, Managementand Technology2(1):37-42.