Universal Layout Emulation for Long-Term Database Archival
UUniversal Layout Emulation for Long-Term Database Archival
Raja Appuswamy
EURECOMBiot, [email protected]
Vincent Joguin
EUPALIAHyères, [email protected]
ABSTRACT
Research on alternate media technologies, like film, synthetic DNA,and glass, for long-term data archival has received a lot of attentionrecently due to the media obsolescence issues faced by contemporarystorage media like tape, Hard Disk Drives (HDD), and Solid StateDisks (SSD). While researchers have developed novel layout andencoding techniques for archiving databases on these new mediatypes, one key question remains unaddressed: How do we ensurethat the decoders developed today will be available and executableby a user who is restoring an archived database several decades laterin the future, on a computing platform that potentially does not evenexist today?In this paper, we make the case for Universal Layout Emulation(ULE), a new approach for future-proof, long-term database archivalthat advocates archiving decoders together with the data to ensuresuccessful recovery. In order to do so, ULE brings together conceptsfrom Data Management and Digital Preservation communities byusing emulation for archiving decoders. In order to show that ULEcan be implemented in practice, we present the design and evaluationof Micr’Olonys, an end-to-end long-term database archival systemthat can be used to archive databases using visual analog media likefilm, microform, and archival paper.
Driven by the promise of machine learning and data analytics, en-terprises routinely gather vast amounts of data from diverse datasources. Analysts have reported that enterprise data stored in databases,data warehouses, and data lakes, is growing 40% annually andwill account for 60% of the 160 Zettabyte Global Datasphere by2025 [14]. However, not all data is accessed uniformly. Studies havereported that only 20% of data stored is performance-critical andaccessed frequently. The remaining 80% is cold and accessed in-frequently [11]. Historic data used for trend forecasting, archivaldata stored for meeting legal and regulatory audits, and backup dataaccessed during failures, are examples of such cold data. Cold datahas been identified as the fastest growing data segment with a 60%annual growth rate, and also as the segment with the longest lifetime(window between creation and deletion date) with retention periodslasting 50–60 years [18]. Thus, enterprises are in desperate need ofcost-effective options for long-term storage of cold data.Traditionally, DBMS have used a tiered storage hierarchy com-posed of a DRAM or SSD-based performance tier, a HDD-basedcapacity tier, and a tape-based archival tier. Thus, infrequently-accessed archival data was stored using tape, as it has the lowestcost/GB among all commercially available storage technologies.Unfortunately, tape suffers from two fundamental limitations thatcomplicate long-term database archival. First, tape has a limited life-time of 10 to 20 years. In contrast, enterprises routinely archive datafor over 50 years in order to meet legal and regulatory compliancerequirements [18]. Second, tape density improves at an annual rate of 30% [1, 8], and tape vendors retain backwards compatibility onlyup to two generations. As a result of these two limitations, using tapefor long-term storage mandates periodic, expensive data migration todeal with device failures and technology upgrades. These limitationsmake tape a less-than-ideal medium for long-term archival of colddata in enterprise databases.Recently, several new initiatives have emerged from both industryand academia in an effort to develop new long-term storage technolo-gies that can overcome the limitations of tape. Libraries and archiveshave long used analog media, like microfilm and archival paper,for protecting journals and magazines across several decades [6].For instance, LE-500 rated microfilms and ISO 9706 rated archivalpaper are designed to last 500 years or more when stored in properconditions. Recently, photographic media (film) has been used forthe preservation of the Declaration of Children’s Rights documentin collaboration with the UN in the Arctic World Archive [13]. Re-searchers have also demonstrated the use of synthetic DNA [2, 3, 9]and glass [7] as promising storage media with orders of magnitudehigher density, and thousands of years of longevity. While theseinitiatives address the storage lifetime challenges associated withlong-term archival, enterprise DBMS also face another challenge inarchiving data–one related to format obsolescence.All modern DBMS use proprietary layout formats for storing data.These layouts employ sophisticated compression, partitioning, dedu-plication, and data organization techniques to improve performanceand reduce space utilization. As layout formats evolve to provide newfunctionality, most commercial DBMS maintain backwards compat-ibility to ensure that an upgrade to a newer DBMS version does notrender a database stored in an older format unusable. However, inthe context of long-term preservation, data must be stored in a layoutformat that is forwards compatible with all future versions of DBMSsoftware. Due to this reason, the state-of-the-art approach for long-term database archival is to convert data from a machine-readable,high-performance binary layout to a human-readable, textual repre-sentation that uses well-established, publicly-available standards likeCSV and XML [4, 10]. The typical approach is to use external toolsthat communicate with the DBMS using well-established interfaces,and “dump” a database into a generic text file that is then archivedusing a long-term storage medium.Unfortunately, this approach suffers from two major drawbacks.First, the switch from binary to textual layout leads to severe databloat as it strips away the benefit of techniques like compressionand deduplication that the database can apply using its knowledgeof the schema. As databases continue to be extended with supportfor more complex data types and more data models, this approachalso requires continuously refining the definition of a “standard”archival layout to accommodate new data types. This is less thanideal especially when taken in the context of long-lasting storagemedia, as data needs to be migrated from one standard to the next a r X i v : . [ c s . D B ] S e p aja Appuswamy and Vincent Joguin periodically, and a suite of archival tools that provides compatibil-ity across generations of standards needs to be maintained acrossdecades. Second, while the switch to DBMS-agnostic layout solvesthe format problem at the database level, it does not solve the prob-lem at the media level; the text file generated from a database dumpstill has to be converted into a “physical” layout format suitable forthe long-term storage media. For instance, storing a database on filmrequires encoding it from its digital form, which is a sequence ofbits, into an analog form, which is typically a sequence of barcodes.Similarly, storing a database on DNA requires encoding bits into asequence of DNA strands. While researchers have developed novelmedia layout techniques for storing data on these new media types,little attention has been paid to the fact that in order to recover databack successfully, the layout decoders and their parameters shouldalso be archived together with the data.In this work, we present Micr’Olonys–an end-to-end, long-termdatabase archival solution that solves all the aforementioned prob-lems by design. At the core of Micr’Olonys is a new approach toarchival we refer to as Universal Layout Emulation (ULE) . Thecentral idea behind ULE is to archive data together with the layoutdecoders necessary for retrieving the data to ensure that data storedin a custom, binary, compressed layout format can be retrieved usingany computing environment in the future. To do so, ULE uses emu-lation to create a software processor with a custom Instruction SetArchitecture, and archives layout algorithms by porting them to thisISA. Micr’Olonys implements ULE by porting both database andmedia layout decoding algorithms to DynaRisc, a custom 23-ISAsoftware processor. The instruction stream to be emulated, and theDynaRisc emulator itself are stored together with the data. Usinga novel nested emulator design, Micr’Olonys makes it possible forany user in the future to bootstrap the DynaRisc emulator by writingless than 300 lines of code in any programming language, runtime,or operating system that might not even exist today. We show thatempowered by ULE, Micr’Olonys can perform long-term archivalof databases across several types of visual analog media includingfilm, microform, and archival paper.
The problem of preserving software in such a way that it can beexecuted several decades later on an unknown computing platformis not unique to database archival. Libraries and museums havelong faced this problem as they need to permanently preserve anincreasingly larger collection of born-digital software artefacts anddocuments that have historic or cultural significance [15]. One state-of-the-art approach used by researchers in Digital Preservation forpreserving software is to use emulation [16, 17]. Emulation refersto the technique that enables a host system to run software or useperipheral devices designed for a different guest system. Emulatorssimulate the processor and associated guest hardware entirely insoftware by interpreting the instructions of the guest processor. Thus,emulators can run unmodified software compiled for a guest pro-cessor architecture on a host processor with a different architecture.Emulation differs from virtualization whose goal is to provide medi-ated shared access to underlying hardware. Hypervisors implementvirtualization and provide virtual machines that can host unmodifiedguest OS and applications with minimal overhead by directly exe-cuting the guest instructions on the host processor, and by exploiting processor specific virtualization extensions. Thus, virtualization isfundamentally tied to the underlying host hardware. In contrast tovirtualization which extends the host processing environment to theguest, emulation tries to faithfully reproduce a guest processing en-vironment on a different host. Thus, emulators prioritize portabilityand compatibility over performance. Due to these reasons, emula-tion is being actively used today for preserving historic softwaredesigned for old, obsolete computing environments by emulatingthem on modern processing environments.One approach of using emulation for database archival is toarchive the DBMS software stack and store it together with thedata. At restoration time, an emulator can be used to create theright hardware and software environment for emulating the archivedDBMS version in order to access the data. Unfortunately, this ap-proach is not feasible in practice as it suffers from several drawbacks.First, this approach requires the entire DBMS software stack, in-cluding supporting libraries, runtimes and OS, to be meticulouslyarchived each time data is archived. This is no simple task giventhat modern DBMS engines are complex pieces of software withmany dependencies. Second, archiving DBMS software also impliesthat each restoration will emulate one specific older version of theDBMS. Thus, the user now has to perform manual synchronizationbetween the archived versions and the latest version. Third, thisapproach complicates licensing as emulated older versions have tobe potentially licensed differently from non-emulated current ver-sions. Fourth, and perhaps most importantly, the use of emulationfor archiving DBMS software simply shifts the problem, as it im-plicitly assumes the existence of an emulator in the future that canfaithfully reproduce the computing environment for the DBMS. Asmodern DBMS engines are advanced pieces of software that oftenuse processor specific extensions for accelerating performance, theywould require emulators to continuously keep pace with advancesin instruction set extensions like vector extensions for SIMD, trans-actional extensions for Hardware Transactional Memory, etcetera,for every architecture supported by the DBMS. This is clearly anon-trivial, non-scalable endeavor.The ULE approach avoids these problems by not emulating theDBMS software and instead, only emulating the decoders necessaryfor retrieving data. Such an emulator does not need to emulate acomplex architecture like x86 as the goal of the emulator is not tofaithfully reproduce unmodified, existing x86 software. Rather, thegoal is to be able to simply archive the decoding logic for later execu-tion. Thus, rather than designing an emulator for a given architecturedetermined by software, we can write the decoding software to targeta pre-designed emulator. Such an emulator can simulate a much sim-pler RISC processor with a limited instruction set that is sufficientfor implementing decoders. Note that the processor emulated nolonger needs to correspond to any real processor. Thus, the emulatorhere functions in principle like an interpreter that reads instructionscorresponding to the program and interprets them. We refer to sucha strategy as
Universal Emulation to highlight the hardware andarchitecture-agnostic nature of this approach, and to distinguish itfrom traditional hardware emulation. Universal Emulation was origi-nally proposed as an approach for long-term software preservationin digital libraries [12] and has not been used for database archival.Universal Emulation has several advantages for database archival.First, the emulator itself is dramatically simple compared to a tra-ditional hardware emulator due to the limited ISA of the virtual niversal Layout Emulation for Long-Term Database Archival
Figure 1: A sample emblem generated by MOCoder from digi-tal data that can be printed to analog media. processor. Further, the ISA is a fixed interface that will never beextended unlike current processors that expose new functionality viaISA extensions. On those grounds, there is no reason to continuouslymaintain and port the emulator across years. In fact, decoders can bearchived by simply archiving their instruction stream together witha description of the fixed ISA they are programmed against. Duringrestoration, the ISA description can be used to implement the Uni-versal Emulator using any programming language, OS, or runtime,and the decoder can be executed using the simulated virtual proces-sor. The second advantage of ULE is that it seamlessly extends thecurrent archival infrastructure. During archival, layout encoders cancompress the textual database archive using database-specific binarylayouts and transform them for storage using media-specific layouts.During restoration, the decoders are executed by the emulator to con-vert the data back into a software-independent format. Thus, ULEuses the same interfaces as traditional archiving for getting data intoand out of a DBMS. But, by using universal emulation of decoders,ULE enables the use of structure-aware, media-specific layouts forarchiving databases efficiently using long-term storage media. Third,ULE obviates the need for emulating a full DBMS. Thus, queriescan be executed at bare-metal performance without any overhead.There is also no need to synchronize data across multiple versions.
Micr’Olonys is a ULE-based archival system we have developed forarchiving databases using analog media, like film, microform, andarchival paper. In this section, we will explain the design of the threefundamental building blocks of Micr’Olonys, namely,
DBCoder ,the database layout encoder/decoder,
MOCoder , the media layoutencoder/decoder, and
Olonys , the nested universal emulator.
DBCoder manages compression of archived databases from theirtextual, software-independent format into a compressed binary lay-out. Our current DBCoder supports a generic compression schemebased on LZ77 and arithmetic coding that can achieve compressionperformance close to 7-Zip’s LZMA for compressing all databasefiles into a single archive. We are working on supporting more ad-vanced database-specific, compressed, columnar layout schemes aspart of on-going research. Irrespective of the layout used, DBCoderis expected to produce a compressed bit stream for further encodingby MOCoder, our media layout coder.In order to store the compressed bit stream generated by DBCoderusing analog media, the bit stream should be converted into visualsignals and then printed as pictures. The generated visual signalsmust be robust to a range of errors that can be introduced duringboth filming (writing) and scanning (reading). Retrieving digital datastored on film can be jeopardized mainly in two ways. First, the filmitself can distort to a small extent over time and become damaged in various ways with fading, hot spots, scratches, etcetera. Second, filmscanners are a possible factor of image degradation as they use lenseswhich can change straight lines into curves, usually near the edgeof the field of view. Moreover, as with paper scanners, especiallyAutomatic Document Feeders, often, the mechanical motion in lin-ear array scanners will introduce small perturbations or unsteadymovements while scanning. Dust can also be a source of degradationin microform, both on the film itself, on the glass plates used to holdfilm while scanning, and also on the surface being filmed which isusually a flat screen with modern microform writers.One possible approach to overcoming these problems is to usetwo-dimensional barcodes, like QR codes and Data Matrix to convertbit streams into barcodes. QR codes, for instance, represent a se-quence of bits as an ordered sequence of pixels in a square grid, withwhite and black pixels representing bits 1 and 0. In addition to thesedata cells, QR codes also contain a bidimensional clocking systemthat takes the form of guiding marks placed at specific predefinedlocations within the field of black and white dots to compensatefor distortions. In particular, a QR code always contains 3 positionpatterns (at 3 corners of the QR code), 2 timing patterns (one foreach dimension), and a varying number of alignment patterns, de-pending on the size of the QR code. These support the decodingalgorithm that recovers data back from the QR code by keeping thedata bits synchronized with the black and white dots. However, QRcodes have been designed considering large-scale distortions, suchas an indirect viewing angle using a smartphone camera. Barcodesdesigned for data archival, in contrast, must be able to cope withlow-scale distortions incurred by lenses and unsteady movements ofscanners. QR codes are also designed based on the assumption thateach dot that makes up the QR code is captured using many pixels,that is, the capture resolution is significantly higher than the QRcode resolution. Thus, QR codes and other 2D barcodes typicallystore a few kilobytes of information at best, and are mainly used astags or placeholders for short textual information. Archival barcodes,in contrast, should be designed to store multi-megabyte data streamsspread over many barcodes.MOCoder is the media layout encoder used in Micr’Olonys thatperforms the “physical” layout of bits across barcodes on visualanalog media. We refer to the barcodes generated by MOCoder as emblems to distinguish them from traditional 2D barcodes. Similarto other 2D codes, MOCoder maps bits to pixels. However, unlikeother 2D codes, MOCoder does not use a separate clocking system.Instead, it pairs the bit signal and the clock signal in an approachsimilar to Differential Manchester encoding used in floppy disks, togenerate a self-synchronizing data stream. This approach ensuresrobust, local clock recovery without having to rely on an independentreference clocking system that could itself be affected by a differentlevel of distortion, as is the case with other 2D barcode schemes.Further, the area that represents data bits is surrounded in eachemblem by a thick black square and large-scale black and white dotsthat allow fast and robust initial detection of the emblem geometryand type, and therefore to precisely position the decoding algorithmon the data area in the scanned image as shown in Figure 1.On top of the physical, visual representation of bits, MOCoderalso uses a Reed-Solomon code-based error correcting mechanismto deal with bit erasures that could be caused by media degradationor dust. In particular, MOCoder uses bidimensional error correctionwith nested Reed-Solomon (RS) codes. The inner RS code works aja Appuswamy and Vincent Joguin
Arithmetic Logical Control/DataADC(carry) Rd, Rs AND Rd, Rs MOVE Rd, RsSBB(borrow) Rd, Rs OR Rd, Rs LDI Rd,
Table 1: DynaRisc instructions. Rd and Rs refer to destinationand source data registers. [Dx] refers to source or destinationmemory pointer register. on blocks of data, each holding 223 bytes of user data and 32 re-dundancy bytes, spread over the entire emblem. This intra-emblemmechanism will automatically correct up to 7.2% damaged datawithin a single emblem. The outer code, or inter-emblem mecha-nism, protects against whole-emblem failures, by including threeparity emblems with each set of 17 data emblems. This results inthe full bit-for-bit restoration of data contained within a series of 20emblems in which any three are missing altogether.
So far, we described the encoding components of DBCoder andMOCoder that transform data from a textual format to emblems. Asthese encoders are intended to be used by enterprises today, theyare written using a contemporary programming language (C
DynaRisc . Table 1 provides a sampleof various arithmetic, logical, control transfer and data movementinstructions supported by DynaRisc. Further information about theregister file, instruction/operand formats, and addressing modes areavailable in a prior patent publication about Olonys [5]. The key takeaway from Table 1 is that DynaRisc supports several instructionsthat are also provided by modern processors. Unlike the encodingpart, the decoding part of DBCoder and MOCoder are implementedin DynaRisc assembly. In order to execute the decoders in the future,a user would need to write an emulator that can, in effect, interpreteach instruction. Given the limited ISA, writing an emulator forDynaRisc is a much simpler endeavor than writing, for instance, anx86 emulator. However, in order to minimize the amount of workthat must be done in the future, and to simplify the task of writingan emulator, Olonys adopts a novel nested emulation strategy.Instead of emulating just DynaRisc, Olonys internally emulatestwo ISAs, DynaRisc and an even-simplified, four-ISA software pro-cessor we refer to as VeRisc. The four instructions in the VeRiscISA are (i) LD &address (load from memory to general-purposeregister R ), (ii) ST &address (store from register R to memory), (iii)SBB &address (subtract from register R the value located at givenaddress), and (iv) AND &address (logical AND of register R withvalue at given address). Using just these four VeRisc instructions, wehave built an emulator that can interpret the broader DynaRisc ISA.A user now only has to write an emulator for VeRisc, which is ef-fectively implementing an interpreter for just four basic instructionsusing any computing environment. The VeRisc emulator runningon a host computer can then load the DynaRisc emulator written in VeRisc and use it to instantiate the DynaRisc emulator, which,in turn, can load the instruction stream of the decoders written inDynaRisc and instantiate them. Thus, the nested emulation strategyminimizes user effort at restoration time.Another important design aspect of Olonys is the approach usedfor bootstrapping the nested emulator. As described earlier, the de-coder parts of both DBCoder and MOCoder are implemented usingDynaRisc so that Olonys can emulate them, and Olonys internallyuses a DynaRisc emulator implemented using VeRisc for executingthese decoders. Thus, in order to adhere to the ULE philosophy ofstoring decoders with the data, we have to store (i) the binary instruc-tion streams of the two decoders, (ii) the binary instruction stream ofthe DynaRisc emulator, and (iii) description of the VeRisc emulatortogether with the archived database on analog media. As describedearlier, the database itself will be stored in the form of emblems thatare generated by MOCoder. Interestingly, we can also convert theDynaRisc instruction stream of DBCoder into emblems and store itsimilarly to data. This allows for high-density storage of databaselayout decoding algorithms. Unlike DBCoder, although MOCoderis also programmed using DynaRisc, it cannot be stored as emblems,as it itself is the media layout decoder responsible for decoding andreverse translating emblems into binary. Similarly, the DynaRiscemulator can also not be stored as emblems as the emulator is neededto execute MOCoder before emblems can be converted back into bi-nary data. For this reason, we convert the binary, VeRisc instructionstream corresponding to MOCoder and DynaRisc emulators into alist of textual characters using a text encoding where letters A to Pare used to encode hexadecimal values 0 xF to 0 x Bootstrap . So far, we described the internals of the three major ULE componentsin Micr’Olonys. In this section, we will provide an end-to-end usageoverview and describe the interaction between various components.
Archival.
Figure 2(a) shows the internal steps involved in archivinga database using Micr’Olonys. In the first step, existing databasetools are used to extract data out of a database for archival. In thesecond step, DBCoder is used to convert this data into a compactbinary form. The third step shown in Figure 2(a) takes the binarydata from DBCoder and uses MOCoder to convert it into emblems .The output from MOCoder is a series of high-resolution images thatwe refer to as data emblems . So far, we have described the stepsinvolved in archiving the data. The second column in Figure 2(a)shows the steps involved in archiving the decoders. The fourth step inthe overall procedure is writing the decoding parts of DBCoder andMOCoder using DynaRisc. The result of this step are instructionscorresponding to these decoders. In the fifth step, the DBCoderDynaRisc instruction stream is passed to MOCoder to generatea new set of emblems, which we refer to as system emblems , to niversal Layout Emulation for Long-Term Database Archival
Microfilming / Shooting / Printing
Plain-text algorithm Data emblems
DBMS db_dump MOCoderimpl. (DynaRisc)
System emblems
Hex-to-letter encoder DBCoderimpl. (DynaRISC)DynaRISCemulator impl. (VeRisc) DBCoder
MOCoder MOCoder Letters (a) db_load
VeRisc EmulatorDynaRiscEmulator S c an Data emblemsSystem emblemsPlain-text algorithm
VeRiscEmulator &Letter-to-Hex decoder
MODecode
DBDecode DBMS
VeRisc EmulatorDynaRiscEmulatorMOCoder (b) Figure 2: End-to-end use case; (a) Steps involved in ULE approach of encoding a database together with associated layout decodersusing cinema film. (b) Steps involved in decoding and retrieving data back from the film archive. distinguish them from data emblems. Thus, the DBCoder is itselfstored as emblems on analog media.The sixth step is the archival of MOCoder and the DynaRisc em-ulator. As described earlier, the DynaRisc instructions of MOCoderand VeRisc instructions of the DynaRisc emulator are converted intoa list of letters. These letters are appended to a simple, plain-textpseudocode of the VeRisc emulator to form the Bootstrap. Finally,in the seventh step, the data and system emblems together with theBootstrap text are physically “written” to the analog media via mi-crofilming (for microfilm), shooting (for cinema film), or printing(for archival paper). Figure 2(a) shows a real cinema film that wasgenerated using this approach. While we have described all the stepsin Figure 2, it is important to note that archival is very simple fromthe user’s point of view. The programming of decoders using Dy-naRisc and generation of the Bootstrap text is a one-time procedurethat is performed in advance by the Olonys developers without anyuser involvement.
Restoration.
Figure 2(b) shows the steps involved in restoring thedatabase using Micr’Olonys. The first step in restoring the data isscanning the microform to generate high-resolution images corre-sponding to each frame. The user extracts the three pages of al-phabetical characters that correspond to the instruction streams ofthe DynaRisc emulator and MOCoder from the images. Any OCRprogram can be used to automate this task. Similarly, the user con-verts the images containing emblems into a linear flat array of pixelintensities as described in the Bootstrap. Any standard image han-dling libraries can be used for automating this task. With imagepreprocessing done, in the second step, the user implements theVeRisc emulator using the algorithm pseudocode in the Bootstrap.It is important to note here that we make no assumptions about theunderlying hardware or software environment on which the emulatorwill run. The pseudocode is less than 500 lines of code that canbe implemented by anyone with a basic programming background.The user then executes the code, thereby emulating a basic, four-instruction virtual machine. In the third step, the Bootstrap codereads the alphabetical characters, decodes them, and instantiatesa new DynaRisc emulator as an application inside the VeRisc en-vironment, and executes MOCoder within the DynaRisc emulator.In the fourth step, the system emblems are then read, decoded byMOCoder, and used to load DBCoder. In the fifth step, the remainingdata emblems are then read, and decoded by MOCoder first and then by DBCoder to output the ASCII database archive files. Finally, inthe sixth and final step, the user loads the standardized data into anyfuture DBMS using then-current tools and interfaces.Notice that unlike archival, all the steps during restoration areperformed by the user without our involvement. Micr’Olonys isdeliberately designed in such a way that the most complex step, theimplementation of the VeRisc emulator, is as simple as possible sothat the user will still be able to recover the data even if the creatorsof the system are no longer alive. The Bootstrap provides technicalinstructions to precisely guide the user through the process of settingup the environment necessary for decoding data.
To demonstrate the feasibility of ULE, we performed several end-to-end experiments where we used Micr’Olonys to archive and restoredata using analog media. In this section, we provide details aboutour experiments that show that the ULE approach can be realized inpractice, and analog media can indeed be integrated into the databasestorage hierarchy.
Paper archive.
For our first experiment, we used the industry-standard TPC-H benchmark to generate a test dataset. We loaded thedata into a PostgreSQL database and used pg_dump to generate thedatabase archive in the text-based SQL format. We configured theTPC-H scale factor to produce an archive file that was roughly 1MBin size (1.2MB). We used Micr’Olonys to encode this archive into26 emblems that were directly printed to A4 paper at 600 dpi using anetwork-attached Canon ImageRunner Advance 6255i Laser printer.Thus, we achieved a density of 50KB per page. Replacing our A4paper with an archival-grade one would be the only change requiredfor archiving a database to permanent paper. The combined encodingand printing process took 6 minutes on a low-end Windows laptopequipped with Intel Core i7-6500U CPU clocked at 2.5GHz, and16GB of DRAM. In order to test our decoding process with a differ-ent computing platform locally, we used the same equipment to scanthe emblems back as 26 pdf files. We then implemented the VeRiscemulator in C++ and executed it on a Linux server equipped withan Intel Core i9-10920X CPU clocked at 3.5 GHz. The decodingprocess successfully restored back the SQL archive file in 3 minutesand 20 seconds.
Microfilm archive.
In order to show that Micr’Olonys also workswith microform, we targeted a 16mm microfilm as the archival media aja Appuswamy and Vincent Joguin for our second experiment. We used the EPM/Kodak IMAGELINK9600 Archive Writer for “writing” to microfilm. With this equip-ment, each frame written to film is a 3888 (width) x 5498 (length)pixel black and white (bitonal) TIFF image. With such a system,Micr’Olonys is capable of storing 1.3GB in a single 66 meter reel.Due to time and budget constraints, we were able to use Micr’Olonysto only encode a 102KB TIFF image (the Olonys logo), instead ofthe 1MB PostgreSQL database. The image was encoded into 3 em-blems by Micr’Olonys, and these emblems were written togetherwith the Bootstrap to the 16mm microfilm. A standard microfilmreader was used to scan back the emblems. The produced scanswere also bitonal with a high resolution of about 5000 x 7000 pixels.We used our VeRisc emulator to convert the emblems back to thesource image without any errors. Cinema film archive.
A similar experiment was conducted with35mm black and white cinema film shown in Figure 2. The sameOlonys logo was “shot” as 3 emblems in 3 full-aperture frames(equivalent to the 4/3 image ratio) with a resolution of 2048 x 1556pixels (2K) using the Arrilaser digital film recorder. The frames werethen scanned in grayscale 4K resolution (4096 x 3120 pixels) usinga Scanity Immersion from DFT. Both shooting and scanning use thespecific DPX image format used for raw cinema frames. Comparedto microfilm scanners, we found cinema film scanners to producesharper, low-distortion images. We used our VeRisc emulator toconvert these emblems back to the source image successfully.While we are pursuing large-scale database archival experimentsas a part of on-going research, we believe that our preliminary resultswith the 102KB image demonstrate clearly that Micr’Olonys canwork with any visual analog backend.
Portability and user friendliness.
The aforementioned experimentsdemonstrated that the ULE approach can be realized in practice, andMicr’Olonys can successfully archive data to analog media. How-ever, in order to ensure that a user in the distant future can implementthe VeRisc emulator on a computing platform that is unknown today,we also undertook two additional tasks. First, we requested peoplewith diverse technical backgrounds, including first-year undergrad-uate students (at Lycée Bonaparte, Toulon), engineers at a partnerinstitute (CNES), and researcher staff at EURECOM, to implementthe VeRisc emulator in any language and system of their choice.Thus, the emulator was implemented on Windows and Linux inseveral programming languages including JavaScript, Python, C++,and C
With the growing adoption of data-driven decision making, enter-prises are increasingly facing the need to archive data over long timeperiods to meet legal and regulatory compliance requirements. In https://history.denverlibrary.org/news/wait-minute-you-still-use-microfilm this paper, we introduced Universal Layout Emulation as a new ap-proach for long-term database archival that uses universal emulationfor archiving layout decoders together with the data. In order to showthat ULE can be realized in practice, we presented Micr’Olonys, anend-to-end long-term data archival system based on ULE. Usingan experimental evaluation, we demonstrated that Micr’Olonys isportable, easy to use, and can be used to archive databases usinganalog media.In future work, we plan to extend Micr’Olonys on several fronts.First, we are working on adding support for compressed, colum-nar layout encoding schemes in DBCoder that are well-known toprovide an order of magnitude reduction to storage utilization overthe generic compression support available today. Second, despiteits longevity, analog media might not be suited for extremely largedatabases due to density issues. For example, Micr’Olonys can store1.3GB in a 66m microfilm reel. This implies that one would need800 reels for Terabyte-sized data lakes, and hundreds of thousandsof reels for Petabyte-sized data lakes. Thus, while microfilm mightbe a feasible solution for small or medium-sized archives, it is unsuit-able for extremely large archives. DNA, in contrast has a theoreticaldensity of 1EB per mm . Thus, one avenue of future work we arepursuing is extending Micr’Olonys to be used in conjunction with aDNA-based database archive [2]. REFERENCES [1] Raja Appuswamy, Goetz Graefe, Renata Borovica-Gajic, and Anastasia Ailamaki.2019. The Five-Minute Rule 30 Years Later and Its Impact on the StorageHierarchy.
CACM
62, 11 (2019).[2] Raja Appuswamy, Kevin Lebrigand, Pascal Barbry, Marc Antonini, Oliver Madder-son, Paul Freemont, James MacDonald, and Thomas Heinis. 2019. OligoArchive:Using DNA in the DBMS storage hierarchy. In
CIDR .[3] James Bornholt, Randolph Lopez, Douglas M. Carmean, Luis Ceze, Georg Seelig,and Karin Strauss. 2016. A DNA-Based Archival Storage System. In
ASPLOS .[4] Stefan Brandl and Peter Keller-Marxer. 2007. Long-term Archiving of RelationalDatabases with Chronos. In
First Intl. Workshop on Database Preservation .[5] David Carrere and Vincent Joguin. Patent WO/2003/052542, 17 June 2004. Dataprocessing method and device.[6] Dana M. Caudle, Cecilia M. Schmitz, and Elizabeth J. Weisbrod. 2013. MicroformNot extinct yet: Results of a long-term microform use study in the digital age.
Library Collections, Acquisitions, and Technical Services
37, 1 (2013).[7] Patrick Anderson et al. 2018. Glass: A New Media for a New Era?. In
HotStorage .[8] Robert E. Fontana and Gary M. Decad. 2018. Moore’s law realities for recordingsystems and memory storage components: HDD, tape, NAND, and optical.
AIPAdvances
8, 5 (2018).[9] Nick Goldman, Paul Bertone, Siyuan Chen, Christophe Dessimoz, Emily M.LeProust, Botond Sipos, and Ewan Birney. 2013. Toward Practical High-capacityLow-maintenance Storage of Digital Information in Synthesised DNA.
Nature
The FirstACM+IEEE Joint Conf. on Digital Libraries
Interna-tional Journal of Legal Information
26, 1-3 (1998).[17] Jeff Rothenberg. 2000. Using Emulation to Preserve Digital Documents.