[PDF] Revealing digital documents. Concealed structures in data

Abstract

This short paper gives an introduction to a research project to analyze how digital documents are structured and described. Using a phenomenological approach, this research will reveal common patterns that are used in data, independent from the particular technology in which the data is available. The ability to identify these patterns, on different levels of description, is important for several applications in digital libraries. A better understanding of data structuring will not only help to better capture singular characteristics of data by metadata, but will also recover intended structures of digital objects, beyond long term preservation.

Full PDF

aa r X i v : . [ c s . D L ] M a y Revealing digital documents

Concealed structures in data ⋆ Jakob Voß

Verbundzentrale des GBV (VZG), G¨ottingen, Germany, [email protected]

Abstract.

This short paper gives an introduction to a research projectto analyze how digital documents are structured and described. Using aphenomenological approach, this research will reveal common patternsthat are used in data, independent from the particular technology inwhich the data is available. The ability to identify these patterns, ondiﬀerent levels of description, is important for several applications indigital libraries. A better understanding of data structuring will not onlyhelp to better capture singular characteristics of data by metadata, butwill also recover intended structures of digital objects, beyond long termpreservation.

Keywords: data, data description, metadata, data modeling, patterns

Given the growing importance of digital documents in libraries, the theoreticalunderpinning of data in library and information science is still insuﬃcient. Themajority of bibliographic descriptions only exist in digital form. Increasingly doc-uments only exist as digital objects, which impacts on traditional concepts suchas ‘document‘, ‘page‘, ‘edition‘, and ‘copy‘. Meanwhile most metadata consistsof digital documents that describe other digital documents. With the advent ofnetworked environments, these documents basically exist as streams of bits, ab-stracted from any storage medium and location. Although in practice concreteforms, such as ‘ﬁles‘, ‘records’, and ‘objects’, are dealt with, these forms arerather diﬀerent views on the same thing, than inherent properties of a digitaldocument. So what is this ‘same thing’ if you talk about a digital document?It has been shown that the nature of documents can better be deﬁned in termsof function rather than format [5], and the key properties, which constitute andidentify a document, depend on context [26]. This highlights the importance ofdescriptive metadata to put data in context, but it does not eliminate the needto actually look at data at some level of description. In practice, we often have todeal with heterogeneous documents provided as data that must be indexed and ⋆ [ v1 ] An ongoing trend that is most visible in hypes, for instance, today, cloud computingand Semantic Web. reserved, or with metadata, that is aggregated from diverse sources, withoutexact description of the data on a higher level.This paper proposes that a deeper look at data is required, to reveal howdigital documents are actually structured and described. The question shouldnot be answered by simply pointing to concrete technologies and formats, whichare subject to rapid change and obsolescence, but at a more fundamental level.The main hypothesis of this research is that all methods to structure and de-scribe data share common patterns, independent from technology and level ofdescription.

The concept of data is used in many disciplines with various meanings. Ballsun-Stanton explores how diﬀerent individuals understand data in diﬀerent “philoso-phies of data” [3]: the concept can range from the product of objective, repro-ducible measurements (“data as hard numbers”) to the product of any recordedobservations (“data as observations”), or processable encodings of informationand knowledge (“data as bits”). With this research I commit to the third philos-ophy, which is often found in computer science and in library and informationscience. However, both disciplines do not use data as a core concept but relateit to information as the main topic of interest. The growing amount of freelyavailable “open data” and tools to analyze this data has brought up ideas of“data science” and “data journalism”. Both deal with aggregating, ﬁltering, andvisualizing large sets of data, based on statistical methods of data analysis. Thegrowth of data-driven science combined with principles of Open Access also raisesthe awareness of the need to publish and share data sets. Library institutionsbegin to recognize this need and start to provide infrastructure for collectingand identifying research data [1]. Data discussed in this context is mainly seenas “data as observations” and the main concern of data science evangelists seemsto be “big data”, that is “when the size of the data itself becomes part of theproblem” [17]. The problem, that I want to tackle, does not depend on the sizeof the data or on problems of preservation [19], but on the inherent complexityof data, independent from its applications.While disciplines that deal with physical documents, such as codicology andpalaeography, have long been acknowledged as part of library and informationscience, there is no established curriculum of data studies as yet. The best exami-nation of data by libraries so far can be found in long-term preservation of digitalmaterial and in metadata research and practice. The former is still in an earlystage of development. It provides two general strategies to cope with the rapidchange and decay of technologies: either you need to emulate the environment This does not imply that results will not be applicable to other philosophies of data.Revealed data patterns may also reﬂect typical structures of data observation andmeasurements. However, this is beyond the scope of this work. By deﬁnition you can only speak retrospectively of successful long-term preservation,but most digital objects are too young to judge. f digital objects or you must regularly migrate them to other environments andformats. Both strategies require good descriptions of the data to be archived.When time passes, these descriptions themselves become subject of preservationand digital objects may get buried in nested layers of metadata. Metadata re-search deals with data on a more explicit level. Although metadata has becomeone of the core concepts of library and information science, there is no commonlyagreed upon deﬁnition. The general “data about data” deﬁnition at least makesclear that metadata is data about something. Coyle’s deﬁnition of metadata assomething constructed, constructive, and actionable [6] highlights the relevanceof function and context as Buckland [5] and Yeo [26] do for documents. As a re-sult there are numerous ways to describe the same object by data and the samedata can describe diﬀerent things. Metadata research provides at least someguidelines for interoperability by metadata registries, application proﬁles, andcrosswalks. However, in practice a lot of manual work is needed to make use ofmetadata, because context and function are not fully known or creators of datajust do not comply to assumed standards. Currently, the Resource DescriptionFramework (RDF) and persistent identiﬁers promise to solve most problems.However, as also conﬁrmed by the preliminary results below, there is no silverbullet in data description. Data is always a simpliﬁed, context-dependent imageof the information, knowledge, or reality where an attempt has been to encode itin data. Some good criticism of the expressive power of particular data encodinglanguages has been given by Kent [14,15,16].Patterns as structured methods of describing good design practice, were ﬁrstintroduced by Alexander et al. in the ﬁeld of architecture [2]. In their words“each pattern describes a problem which occurs over and over again in our envi-ronment, and then describes the core of the solution to that problem.” Patternswere later adopted in other ﬁelds of engineering, especially in (object-oriented)software design [4,9]. There are some works that describe pattern for speciﬁcdata structuring or data modeling languages, among them Linked Data in RDF[8], Markup languages [7], data models in enterprises[11,20], and meta models[12,21]. A general limitation of existing approaches is the focus to one speciﬁcformalization method. This practical limitation blocks the view to more generaldata patterns, independent from a particular encoding, and it conceals blindspots and weaknesses of a chosen formalism.

A preliminary analysis of diﬀerent structuring methods shows that each datalanguage highlights some structuring features that are then overused and evenconceal intended structures of data. For instance, the nesting and order of el-ements in an XML document can be chosen with intent. However, they canalso be chosen in an arbitrary manner because an XML document must alwaysbe an ordered tree. For this reason we cannot rely only on oﬃcial descriptionsand speciﬁcations to reveal patterns in data. Most existing approaches to an-alyze data structuring are either normative (theoretical descriptions how datahould be), or empirical but limited. Existing empirical approaches only viewdata at one level of description, in order to have a base for statistical methods(data mining) and other automatic methods (machine learning). In contrast, Iuse a phenomenological research method that includes all aspects of data struc-turing and description. Beside technical standards that specify data, softwarethat shapes data, typical examples of data, and at how data is actually used bypeople, must also be considered.The phenomenological method views data as social artifacts, that cannot bedescribed from an absolute, objective point of view. Instead data are studied as“‘phenomena‘: appearances of things, or things as they appear in our experi-ence” [23]. The analysis begins with a detailed review of methods and systemsfor structuring and describing data, from simple character encodings to data lan-guages and even graphical notations. The focus is on conceptual properties, whiledetails of implementation, such as performance and security, are only mentionedwhere they show how and why speciﬁc techniques have evolved.

The ﬁrst outcome of this work is a broad typology of existing methods to struc-ture and describe data. These methods are normally described as data codes,systems, languages, or models without consistent terminology jointly amongtechnologies. The following groups of methods can be identiﬁed, each with aprimary but not exclusive purpose: – character and number encodings to express data – identiﬁers and query languages to identify data – ﬁle systems and databases to store data – data structuring languages and markup languages to structure data – schema languages to deﬁne and constrain data – conceptual modeling languages to abstract and describe dataThese methods are rarely discussed together as general structures with dataas their common domain. Instead a strong focus on trends and families of basictechnologies is found, that often concentrate on one speciﬁcation or implemen-tation. Examples include; the dominance of the Structured Query Language(SQL) and the hype around the extensible markup language (XML) in the late1990s. With size and speed as a main driving force of development, there is littleprogress at the conceptual level. An example of this being the large gap betweenresearch and practice in conceptual modeling languages, which are mostly usedin form of an oversimpliﬁcation of the Entity-Relationship Model (ERM) [22].The main empirical part of the analysis consists of a detailed descriptionand placement of the most relevant instances and subgroups from the typologyabove. It is shown how each structuring method has its strengths and limitations,and how each method shapes digital objects independent from the object’s char-acteristic properties. A deeper look at data also shows that the most inﬂuentialtechnologies of data structuring are not used in one exact and established form,ut they occur as groups of slightly diﬀering variants. For example, the set ofdata expressible in RDF/XML diﬀers from the full RDF triple model. This triplemodel with URIs, blank nodes and literals, also has diﬀerent characteristics andlimitations depending on which technology (serialization, triple store, SPARQL,reasoners etc.) and which entailment regime (simple, RDF, OWL etc.) is ap-plied. Other examples include; SQL databases, which substantially diﬀer fromthe original relational database model, and the family of XML-related stan-dards. In many cases confusion originates from diﬀerences between syntaxes andimplementations on one side and abstract models on the other.To some degree, common patterns can be derived from speciﬁc systems bymodeling them in a higher level modeling language, such as ERM and ObjectRole Modeling (ORM), or in schema and ontology languages such as BackusNaur Form (BNF), XML Schema (XSD) and the Web Ontology language (OWL).In software engineering it is common practice to use domain speciﬁc modelinglanguages in nested layers of abstraction [13]. These languages exist in manyvariants as tools to communicate between levels of description. As a result, eachlanguage highlights a speciﬁc subset of patterns and makes other patterns lessvisible or more diﬃcult to apply. Typical instances of data further show thatin practice, patterns and levels of abstraction often overlap and that methodsof structuring are often used against their original purpose. Typical examplesinclude; the creation of dummy values for non-existing mandatory elements andthe use of separators to add lists to non-repeatable ﬁelds. It appears that inpractice it is often diﬃcult to judge which properties of data are intended andwhich arise as artifacts from the constraints of a given modeling language. Theexample of XML was already mentioned above: XML structures data in form ofan ordered tree, but many instances of XML documents use this feature to ap-ply other patterns but hierarchy and strict ordering. Figure 1 shows an examplefrom a yet to be ﬁnished catalog of data patterns.

The preliminary results show a large variety of methods to structure and describedata. The research hypotheses can be conﬁrmed, as common patterns like iden-tiﬁers, repeatability, grouping, sequences and ordering are used on all levels indiﬀerent variants and explicitly. The ability to identify and apply these patternsis crucial for several applications in digital libraries. Some patterns are alreadyrecognized, but the results show that it lacks a more systematic view, indepen-dent from the constraints of particular technologies. A better understanding ofmethods to structure and describe data can help both, the creation of data andits consumption. These applications are shortly illustrated in the following.Creation of data in libraries is most notably present as creation of metadata.This process is guided by complex cataloging rules and specialized formats. Bothare deeply intertwined and often criticized as barriers to innovation. However,simpler forms of metadata do not provide a solution [25]. Remarkably, alterna-tives are most visible as technologies, for instance XML [24] or RDF [6]. Despite ame sequence pattern idea strictly order multiple objects, one after another context a collection of multiple objects implementations • If objects have a known size, they can be directly concatenated. If objects havesame size, this results in the array pattern. • The separator pattern can be used to separate each object from its successorobject. To distinguish objects and separators, this implies the forbidden objectspattern. If separators may occur directly after each other, this may also imply theempty object pattern. • You can link one object to its successor with an identiﬁer. To avoid link structuresthat result in other patterns (tree, graph, . . . ) additional constraints must apply. • If objects have consecutive positions, a sequence is implied by their order. examples • string of ASCII characters (array) • string of Unicode characters in UTF-8 (each character has known size) • ‘ Kernighan and Ritchie ’ (sequence with ‘ and ’ as separator) • extract → transform , transform → load , (sequence of linked steps) counter examples ﬁles in a ﬁle system, records in a database table, any unordered collection motivation sequences are a natural method to model one-dimensional phenomena, for instancesequences of events in time. As digital storage is structured as sequence of bits,sequences seem to be the natural form of data and counterexamples, such as formaldiagrams and visual programming languages, are often not considered as data. problems empty sequences and sequences of only one element are diﬃcult to spot, like inother collection patterns. similar patterns without context, sequences are diﬃcult to distinguish from other collection pat-terns. Many implementations of other patterns use sequences on a lower level. implied patterns position pattern specialized patterns array, ordered set, ring Fig. 1.

Example of a pattern description. Pattern names are underlined. he strengths of each technology, it is unlikely that one method will providethe ultimate tool to express all metadata. Instead a look at metadata patterncan aid the construction of more precise and interoperable metadata that bet-ter captures an object’s unique characteristics. The nature of patterns in generalshows that data creation is no an automatic process, but a creative act of design.Recognizing the artiﬁcial nature of data will to some degree free data designersfrom apologies and unquestioned habits that are justiﬁed as enforced by naturalneeds or technical requirements.Consumption of data can beneﬁt even more from an understanding of datapatterns. Since the invention of digital computers, technologies and formatsrapidly change. The ﬂuctuation will unlikely slow down because it is also drivenby trends, as progress in data description (in contrast to quantitative data pro-cessing) is diﬃcult to measure. The results show that many description methodsresult in other structures than originally intended, when the the patterns thatare actually applied are examined. Relevant structures are less visible, if you con-centrate on single technologies. Knowledge of general data patterns can thereforehelp to reveal concealed structures in digital documents. This application couldbe named “data archeology”. Data archeology, in contrast to long-term preser-vation, which tries to prevent the need of the former, deals with the retrospectiveanalysis of incompletely deﬁned or unknown data. Similar to traditional arche-ology, data archeology belongs to the humanities, as it involves study of thecultural context of data creation and usage. Existing techniques from computerscience, like data mining and knowledge discovery, provide useful tools to dis-cover detailed views on data. However, they cannot reveal its meaning as partof social practice. Data patterns can provide a contribution to intellectual dataanalysis, which is needed to underpin and interpret algorithmic data analysis.Beside the creation of a catalog of the most common data patterns as basicprimitives and derived patterns, there are some open tasks that may be answeredby the analysis described in this paper. It is assumed that no closed system ormeta-system can fully describe all aspects of practical data. This thesis could beproved at least for formal systems of description based on results of G¨odel [10].Further research, which will probably not be covered fully in this work, includeshow to best ﬁnd known patterns in given data using semi-automatic methodsand which methods are best suited to express a given set of patterns.In any case, libraries can beneﬁt from a general understanding of data anddata patterns, at least as deep as the current understanding of physical publica-tion types and material.

References

1. Special issue on research data. D-Lib Magazine 17(1/2) (January 2011), http://dx.doi.org/10.1045/january2011-contents The most related existing discipline is digital forensics. It has a more speciﬁc scopeand its application to more complex and heterogeneous methods of data structuring,e.g. databases, is in an early stage of development [18].. Alexander, C., Ishikawa, S., Silverstein, M.: A Pattern Language: Towns, Buildings,Construction. Oxford University Press (1977)3. Ballsun-Stanton, B.: Asking about data: Experimental philosophy of informationtechnology. In: 5th International Conference on Computer Sciences and Conver-gence Information Technology. pp. 119–124 (2010)4. Beck, K., Cunningham, W.: Using pattern languages for object oriented programs.In: Conference on Object-Oriented Programming, Systems, Languages, and Appli-cations (OOPSLA) (1987)5. Buckland, M.: What is a ”digital document”? Document Numrique 2(2), 221–230(1998)6. Coyle, K.: Understanding the semantic web: Bibliographic data and metadata.Library Technology Reports 46(1) (2010)7. Dattolo, A., Iorio, A.D., Duca, S., Feliziani, A.A., Vitali, F.: Structural patterns fordescriptive documents. In: Baresi, L., Fraternali, P., Houben, G.J. (eds.) ICWE.Lecture Notes in Computer Science, vol. 4607, pp. 421–426. Springer (2007)8. Dodds, L., Davis, I.: Linked Data Patterns (2010), http://patterns.dataincubator.org/book/

9. Gamma, E., Helm, R., Johnson, R., Vlissides, J.M.: Design Patterns: Elements ofReusable Object-Oriented Software. Addison-Wesley, 1 edn. (1994)10. G¨odel, K.: ber formal unentscheidbare stze der principia mathematica und ver-wandter systeme. Monatshefte fr Mathematik und Physik 38(1), 173–198 (1931)11. Hay, D.C.: Data Model Patterns: Conventions of Thought. Dorset House Publishing(1995)12. Hay, D.C.: Data model patterns: a metadata map. Morgan Kaufmann (2006)13. Kelly, S., Tolvanen, J.P.: Domain-Speciﬁc Modeling: Enabling Full Code Genera-tion. Wiley (2008)14. Kent, W.: Data and Reality. Basic assumptions in data processing reconsidered.North-Holland (1978)15. Kent, W.: The many forms of a single fact. In: Proceeedings of the IEEE COMP-CON (1988)16. Kent, W.: The unsolvable identity problem. In: Extreme Markup Languages (2003)17. Loukides, M.: What is data science? O‘Reilly radar (June 2010), http://radar.oreilly.com/2010/06/what-is-data-science.html

18. Olivier, M.S.: On metadata context in database forensics. Digital Investigation5(3-4), 115 – 123 (2009)19. Rosenthal, D.S.H.: Bit preservation: A solved problem? The International Journalof Digital Curation 5(1), 134–148 (2010)20. Silverston, L.: The Data Model Resource Book - A Library of Universal DataModels for All Enterprises, vol. 1. John Wiley & Sons (2001)21. Silverston, L., Agnew, P.: The Data Model Resource Book, Vol. 3: Universal Pat-terns for Data Modeling. Wiley (2009)22. Simsion, G.: Data Modeling Theory and Practise. Technics Publications (2007)23. Smith, D.W.: Phenomenology. In: Zalta, E.N. (ed.) The Stan-ford Encyclopedia of Philosophy. Summer 2009 edn. (June 2009), http://plato.stanford.edu/archives/sum2009/entries/phenomenology/

24. Tennant, R.: Marc must die. Library Journal (October 2002),

25. Tennant, R.: Digital libraries: Metadata’s bitter harvest. Library Journal 12 (2004),