aa r X i v : . [ c s . D L ] J un Infrastructure-Agnostic Hypertext
Jakob Voß [email protected] des GBV (VZG)
ABSTRACT
This paper presents a novel and formal interpretation of the orig-inal vision of hypertext: infrastructure-agnostic hypertext is inde-pendent from specific standards such as data formats and networkprotocols. Its model is illustrated with examples and references toexisting technologies that allow for implementation and integra-tion in current information infrastructures such as the Internet.
CCS CONCEPTS • Information systems → Hypertext languages ; Document rep-resentation ; •
Human-centered computing → Hypertext / hyper-media.
KEYWORDS
Xanadu, hypermedia, transclusion, documents
INTRODUCTION
The original vision of hypertext as proposed by Ted Nelson [14, 22]still waits to be realized. His influence is visible through peoplewho, influenced by his works, shaped the computer world of today,last but not least the Web [20]. Nelson’s core idea, a network ofvisibly connected documents called Xanadu, goes beyond the Webin several aspects. In particular it promises non-breaking links andit uses links to build documents (with versions, quotations, over-lay markup. . . ) instead of using documents to build links [18]. Theconcept of hypertext or more general ‘hypermedia’ has also beenused differently from Nelson, both in the literary community (thatfocused on simple links), and in the hypertext research community(that focused on tools) [30].This paper tries to get back to the original vision of hypertext byspecification of a formal model that puts transcludeable documentsat its heart. Apart from Nelson’s works [14–17, 19, 22] this paperdraws from research on data format analysis [11, 28], content-basedidentifiers [12, 26], existing transclusion technologies for the Web[1, 2, 6, 25] and hypertext systems beyond link-based models [3].Limited by the state of document processing tools and by sub-mission guidelines, this paper is not a demo of hypertext. On acloser look however there are traces of transclusion links that havebeen processed to this paper. See figure 2 for an overview andhttps://github.com/jakobib/hypertext2019 for details and sources. See [5] for a good example of what an actual demo paper might look like.
OUTLINE
The architecture of an infrastructure-agnostic hypertext systemconsists of four basic elements and their relations:1) documents include all finite, digital objects2) document identifiers reference individual documents3) content locators reference segments within documents4) edit list combine parts of existing document into new onesInstances of each element can further be grouped by data formats .The elements and their relations are each described in the followingsections after a formal definition.
Formal model
A hypertext system is a tuple h D , I , C , E , S , R , U , T , A i where: D is a set of documents I is a set of document identifiers with I ⊂ DC is a set of content locators with C ⊂ DE is a set of edit lists with E ⊂ DS is a set of document segments with S ⊂ C × D Document sets can each be grouped into (possibly overlapping)data formats. The hypertext system further consists of: R is a retrieval function with R : I → DU is a segments usage function U : E → P( S ) T is a transclusion function with T : S → DA is an hypertext assemble function with A : E → D A practical hypertext system needs executable implementations ofthe functions R , U , T , A and a method to tell whether a given com-bination h c , d i ∈ C × D is part of S to allow its use with T . Documents
A document is a finite sequence of bytes. This definition roughlyequates with the definition of data as documents [8, 29]. The no-tion of hypertext used in this paper therefore subsumes all kinds ofhyperdata (datasets that transclude other datasets). Documents arestatic by content [24] but they may be processed dynamically. Doc-uments are grouped by non-disjoint data formats such as UTF-8,CSV, SVG, PDF and many many more.
Document identifiers
A document identifier is a relatively short document that refers toanother document. Identifiers have properties depending on the akob Voß particular identifier system they belong to [29, pp. 59-71]. Identi-fiers in infrastructure-agnostic hypertext must first be unambigu-ous (an identifier must reference only one document), persistent (thereference must not change over time), and actionable (hypertextsystems should provide methods to retrieve documents via the re-trieval function R ). Properties that should be fulfilled at least tosome degree include uniqueness (a document should not be refer-enced by too many identifiers), performance (identifiers should beeasy to compute and to validate), and distributedness (identifiersshould not require a central institution). The actual choice of anidentifier system depends on weighting of specific requirements. Apromising choice is the application of content-based identifiers asproposed at least once in the context of hypertext systems [12]. Content locators
A content locator is a document that can be used to select parts ofanother document via transclusion. Nelsons refers to these locatorsas “reference pointers” [19], exemplified with spans of bytes or char-acters in a document. Content locators depend on data formats anddocument models. For instance locator languages XPath, XPointer,and XQuery act on XML documents, which can be serialized in dif-ferent forms (therefore it makes no sense to locate parts of an XMLdocument with positions of bytes). Other locator languages applyto tabular data (SQL, RFC 7111), to graphs (SPARQL, GraphQL), orto two-dimensional images (IIIF), to name a few. Whether and howparts of a document can be selected with a content locator languagedepends on which data format the document is interpreted in. Forinstance an SVG image file can be processed at least as image, asXML document, or as Unicode string, each with its own methodsof locating document segments. Content locators can be extendedto all executable programs that reproducibly process some docu-ments into other documents. This generalization can be useful totrack data processing pipelines as hyperdata such as discussed forexecutable papers and reproducible research. Restriction of contentlocators to less powerful query languages might make sense froma security point of view.
Edit Lists
An edit list is a document that describes how to construct a new doc-ument from parts of other documents. Edit lists are known as
EditDecision Lists in Xanadu and the idea was borrowed from film mak-ing [15]. Simplified forms of edit lists are implemented in versioncontrol systems and in collaborative tools such as wikis and real-time editing. Hypertext edit lists go beyond this one-dimensionalcase by support of multiple source documents and by more flex-ible methods of document processing in addition to basic opera-tions such as insert, delete, and replace. The actual processing stepstracked by an edit list depend on data formats of transcluded docu-ments. Just like content locators, edit lists could be extended to arbi-trary executable programs that implement the hypertext assemblefunction A for some subset of edit lists. To ensure reproducibilityand reliable transclusion, these programs must not access unstableexternal information such as documents that may change [24]. If documents models/segments change, locators may not be applicable anymore [6].
Data formats
An often neglected fundamental property of digital documents istheir grounding in data formats. A data format is a set of docu-ments that share a common data model, also known as their docu-ment model, and a common serialization (see fig. 1). Models defineelements of a document in terms of sets, strings, tuples, graphs orsimilar structures. These structures are mathematically rigorous intheory [24] but more based on descriptive patterns in practice [28].The meaning of these elements (for instance “words”, “sentences”,and “paragraphs” in a document model) is based on ideas that weat least assume to be consistent among different people. idea model format . . .serialization
Figure 1: levels of data modeling
Data modeling, the act of mapping between ideas, models, and for-mats is an unsolved problem because ideas can be expressed inmany models and models can be interpreted in many ways [11].Data models can further be expressed in multiple formats althoughthese formats should fully be convertible between each other, atleast in theory. Formats can further be serialized in multiple forms,which for their part are based on other data models. For instance theRDF model can be serialized in RDF/XML format which is based onthe XML model. XML can be serialized XML syntax which is basedon the Unicode model, and Unicode can be serialized in UTF-8. Atthe end of these chains of abstraction eventually all documents canbe serialized as sequence of bytes. Serializations are seldom sim-ple mappings as most serialization formats allow insignificant vari-ances such as additional whitespace. To check whether a data ob-ject conforms to a serialization, formats are often described witha formal grammar that can also give insights about the format’smodel. Infrastructure-agnostic hypertext does not impose any lim-its on possible data formats and their models.
EXAMPLE
The following example may illustrate the formal model and its el-ements. Let h D , I , C , E , S , R , U , T , A i be a hypertext system with D the set of printable ASCII character strings and some documents: d = ‘ My name is Alice ’ d = ‘ Alice ’ c = ‘ char=11,15 ’ d = ‘ Hello, ! ’ c = ‘ char=7 ’ d = ‘ Hello, Alice! ’If c and c are read in content locator syntax as defined byRFC 5147 so that T (h c , d i) = d then d can be constructed bytranscluding a document segment of d into d at position c . Thecorresponding edit list e ∈ E with A ( e ) = d could look like this: In practice it’s often unknown whether two data formats actually share the samemodel, especially if models are only given implicitly by definition of their formats. Formats may also exist purely implicit in form of sample instances which grammar,model, and ideas must be guessed from by reverse engineering [28].2019-07-02 01:06. Page 2 of 1–4. nfrastructure-Agnostic Hypertext take 995f37f2e066b7d8893873ca4d780da5bf017184insert at 48ba94c47b45390b6dd27824cfc0d8468c2cbc71from fcb59267e2e6641140578235c8cb6d38eaf6abc1segment c5b794c7ae5d490f52a414d9d19311b9a19f61b3
The values in e are SHA-1 hashes of d , c , d , and c respectively. Retrieval function R maps them back to strings. Hyperlinks aregiven by U ( e ) = {h c , d i , h c , d i} used for editing d to d (ver-sioning) and for referencing of segment of d in d (transclusion). IMPLEMENTATIONS
One of the problems faced by project Xanadu was it long requirednew developments such as computer networks, document process-ing, and graphical user interfaces ahead of their time. Today we canbuild on a many existing technologies: networks: storage and communication networks are ubiquitouswith several protocols (HTTP, IPFS, BitTorrent. . . ). identifier systems: document identifiers should be part of theURI/IRI identifier system. More specific candidates of relevantidentifier systems include URLs and content-based identifiers. formats: hypertext systems should not be limited to their owndocument formats (such as the Web’s focus on HTML/DOM)but allow for integration of all kinds of digital objects. content locators as shown above, several content locator andquery formats exist, at least for some document models.Access to documents via a retrieval function R can be implementedwith existing network and identifier technologies. Obvious solu-tions build on top of HTTP and URL but these identifiers are farfrom unambiguous and persistent. Content-based identifiers areguaranteed to always reference the same document but they re-quire network and storage systems to be actionable. The set ofsupported data formats is only limited by availability of applica-tions to view and to edit documents. Full integration into a hyper-text system however requires appropriate content locator formatsto select, transclude, and link to segments from these documents.Existing content locator technologies include URI Fragment Iden-tifiers [25], patch formats (JSON Patch, XML Patch, LD Patch. . . ),and domain-specific query languages as long as they can guaran-tee reproducible builds. The IIIF Image API, with focus on contentlocators in images [2], and hypothes.is, with a combination of loca-tor methods [6], popularized at least simple forms of transclusionon the Web.
Challenges
Despite the availability of technologies to build on, creation of axanalogical hypertext system is challenging for several reasons.The general problems involved with transclusion have beenidentified [1]. Other or more specific challenges include (orderedby severity): storage: data storage is cheap, but someone has to pay for it. A more practical edit list syntax E could also allow the direct embedding of small doc-ument instances which SHA-1 hashes can be computed from. If implemented carefully,this could also reconcile transclusion with copy-and-paste. New standards such as IPFS Mulihashes and BitTorrent Merkle-Hashes look promis-ing but these types of identifiers are not specified as part of the URI system (yet) [26]. normalization: most documents (including identifiers) can beserialized in different forms. To support unique document iden-tifiers, a hypertext system should support normalization of doc-uments to canonical forms. link services: databases of links have been proposed as centralpart of Open Hypermedia Systems [3] but they are not availablefor the Web because of commercial interest. Links are ideallyderived from edit lists with segments usage function U . Recentdevelopment such as Webmention and OpenCitation may helpto improve collection of links. visualization and navigation: this most recognizable elementof hypertext has mostly been reduced to simple links while Nel-son’s ideas seem to have been forgotten or ignored [27]. Nev-ertheless the creation of tools for visualization and navigationin hypertext structures is less challenging then getting hold ofthe underlying documents and hyperlinks. edit list formats despite edit lists being the very core of theidea of hypertext [15], they have rarely been implementedin reusable data formats. Proper hypertext implementationstherefore require to establish new formats with support ofhypertext assemble function A and segments usage function U . editing tools applications to create and modify digital objectstrack changes don’t provide this information in form ofreusable edit lists, if at all. Hypermedia authoring needs to beintegrated into existing editing tools to succed [7]. copyright and control: who should be allowed to use whichdocuments under which conditions? The answers primarilydepend on legal, social, and political requirements. Differences to Xanadu
Project Xanadu promised a comprehensive hypertext system in-cluding elements for content (xanadocs), network (servers), rights(micropayment), and interfaces (viewers) – years before each theseconcept made it into the computer mainstream. Today a xanalogi-cal hypertext system can more build on existing technologies. Theinfrastructure-agnostic model of hypertext tries to capture the coreparts of the original vision of hypertext by concentrating on its doc-uments and document formats. For this reason some requirementslisted by Xanadu Australia [23] or mentioned by Nelson in otherpublications are not incorporated explicitly:a) identified servers as canonical sources of documentsb) identified users and access controlc) copyright and royalty system via micropaymentd) user interfaces to navigate and edit hypertextsMeeting these requirements in actual implementations is possiblenevertheless. Identified servers (a) and users (b) were part of
Tumbler identifers (that combined document identifiers andcontent locators) [17] but the current OpenXanadu implementa-tion uses plain URLs as part of its
Xanadoc edit list format [21].Canonical sources of documents (a) could also be implementedby blockchains or alternative technology to prove that a specificdocument existed on a specific server at a specific time. Suchknowledge of a document’s first insertion into the hypertext In particular by search engines and by spammers. A criterion to judge the success ofa hypertext system is whether it is popular enough to attract link spam.2019-07-02 01:06. Page 3 of 1–4. akob Voß system would also allow for royalty systems (c). Identificationof users and access control (b) could also be implemented inseveral ways but this feature much more depends on networkinfrastructures and socio-technical environments, including rulesof privacy, intellectual property, and censorship. Last but not least,a hypertext system needs applications to visualize, navigate, andedit hypermedia (d) Several user interface have been invented inthe history of hypertext [13] and there will unlikely be one finalapplication because user interfaces depend on use-cases and fileformats.
Differences to other hypertext models
The focus of models from the hypertext research community[3] is more on services and tools than on Nelson’s requirements[30]. This paper rather looks at the neglected “within-componentlayer” of the Dexter Hypertext Reference Model [10] than onissues of storage, presentation and interaction with a hypertextsystem. Extension with content locators (“locSpecs” in [9]) couldmore align Dexter with infrastructure agnostic hypertext butexisting models rarely put traceable edit-lists and transclusion intotheir core. SUMMARY AND CONCLUSION
This paper presents a novel interpretation of the original vision ofhypertext [14, 15]. Its infrastructure-agnostic model does not re-quire or exclude specific data formats or network protocols. Ab-stract from these ever-changing technologies, the focus is on hy-permedia content (documents) and connections (hyperlinks). Coreelements of hypertext systems are identified as documents, docu-ment identifiers, content locators, and edit lists. A formal model de-fines their relations based on knowledge of data formats and mod-els. It is shown which technologies can be used to implement sucha hypertext system integrated into current information infrastruc-tures (especially the Internet and the Web) and which challengesstill exist (in particular support of edit lists in editing tools).
Markdown TEX
PDF Ti k Z SVG
HTML
Wikidata CSL JSON BibTEXPandoc template (TEX)Pandoc template (HTML) CSS
Figure 2: Proto-transclusion links of this paper Copyright detection was easy to implement with mandatory registration such aspartly required in the United States until 1976. Authors might also register documentswith cryptographic hashes without making them public in the first place. Tim Berners-Lee’s first Web browser originally supported editing. “It would be folly to attempt a generic model covering all of these data types.” [10] REFERENCES [1] Robert Akscyn. 2015. The Future of Transclusion.
Intertwingled (2015), 113–122.https://doi.org/10.1007/978-3-319-16925-5_15[2] Michael Appleby, Tom Crane, Robert Sanderson, Jon Stroop, and Simeon Warner(Eds.). 2017.
IIIF Image API (2.1.1 ed.). IIIF Consortium. https://iiif.io/api/image/[3] Claus Atzenbeck, Thomas Schedel, Manolis Tzagarakis, Daniel Roßner,and Lucas Mages. 2017. Revisiting Hypertext Infrastructure.
Proceed-ings of the 28th ACM Conference on Hypertext and Social Media , 35–44.https://doi.org/10.1145/3078714.3078718[4] Tim Berners-Lee. 1989.
Information Management: A Proposal
The Semantic Web: ESWC 2015 Satellite Events (2015), 26–30.https://doi.org/10.1007/978-3-319-25639-9_5[6] Kristóf Csillag. 2013. Fuzzy Anchoring.
Hypothesis (22 4 2013).https://web.hypothes.is/blog/fuzzy-anchoring/[7] Angelo Di Iorio and Fabio Vitali. 2005. From the writable web to global editability.
Proceedings of the sixteenth ACM conference on Hypertext and hypermedia (2005),35–45. https://doi.org/10.1145/1083356.1083365[8] Jonathan Furner. 2016. “Data”: The data.
Information Cultures in theDigital Age: a Festschrift in honor of Rafael Capurro
Proceedings of thethe seventh ACM conference on Hypertext , 149–160.[10] Frank G. Halasz and Mayer D. Schwartz. 1990.
The Dexter Hypertext ReferenceModel . Technical Report.[11] William Kent. 1988. The Many Forms of a Single Fact.
Proceedings of IEEE COM-PCON 89 , 438–443. https://doi.org/10.1109/CMPCON.1989.301972[12] Tuomas Lukka and Benja Fallenstein. 2002. Freenet-like GUIDs for implementingxanalogical hypertext.
Proceedings of the thirteenth ACM conference on Hypertextand hypermedia (2002), 194–195. https://doi.org/10.1145/513338.513386[13] Matthias Müller-Prove. 2002.
Vision and Reality of Hypertext and Graphical UserInterfaces . Ph.D. Dissertation. https://mprove.de/visionreality/[14] Ted Nelson. 1965. Complex information processing: a file structure for the com-plex, the changing and the indeterminate.
Proceedings of the 1965 20th nationalconference , 84–100. https://doi.org/10.1145/800197.806036[15] Ted Nelson. 1967. Getting It Out of Our System.
Information Retrieval: A CriticalReview , 191–210.[16] Ted Nelson. 1974.
Computer Lib / Dream Machines .[17] Ted Nelson. 1980.
Literary Machines . Mindful Press.[18] Ted Nelson. 1997. Embedded Markup Considered Harmful.
World Wide WebJournal
2, 4 (1997), 129–134.[19] Ted Nelson. 1999. Xanalogical structure, needed now more than ever: paralleldocuments, deep links to content, deep versioning, and deep re-use.
Comput.Surveys
31, 4es (12 1999). https://doi.org/10.1145/345966.346033[20] Ted Nelson. 2008.
Geeks Bearing Gifts . Mindful Press.[21] Ted Nelson and Nicholas Levin. 2014.
OpenXanadu .http://xanadu.com/xanademos/MoeJusteOrigins.html[22] Ted Nelson, Robert Adamson Smith, and Marlene Mallicoat. 2007. Back to thefuture: hypertext the way it used to be.
Proceedings of the eighteenth conferenceon Hypertext and hypermedia , 227–228. https://doi.org/10.1145/1286240.1286303[23] Andrew David Pam. 2002.
Xanadu FAQ
Proceedings of the Amer-ican Society for Information Science and Technology
45, 1 (3 6 2009).https://doi.org/10.1002/meet.2008.14504503143[25] Jeni Tennison. 2012.
Best Practices for Fragment Identifiersand Media Type Definitions
Hash URI Specification (Initial Draft) . Technical Report.https://github.com/hash-uri/hash-uri[27] Fernanda Viégas, Martin M. Wattenberg, and Kushal Dave. [n. d.]. Studying Co-operation and Conflict between Authors with history flow Visualizations.
CHI’04 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems ,575–582. https://doi.org/10.1145/985692.985765[28] Jakob Voss. 2013.
Describing Data Patterns . Ph.D. Dissertation.http://aboutdata.org/[29] Jakob Voss. 2013. Was sind eigentlich Daten?
LIBREAS
23 (2013).https://doi.org/10.18452/9038[30] Noah Wardrip-Fruin. 2004. What hypertext is.