[PDF] Infrastructure-Agnostic Hypertext

Abstract

This paper presents a novel and formal interpretation of the original vision of hypertext: infrastructure-agnostic hypertext is independent from specific standards such as data formats and network protocols. Its model is illustrated with examples and references to existing technologies that allow for implementation and integration in current information infrastructures such as the Internet.

Full PDF

aa r X i v : . [ c s . D L ] J un Infrastructure-Agnostic Hypertext

Jakob Voß [email protected] des GBV (VZG)

ABSTRACT

This paper presents a novel and formal interpretation of the orig-inal vision of hypertext: infrastructure-agnostic hypertext is inde-pendent from speciﬁc standards such as data formats and networkprotocols. Its model is illustrated with examples and references toexisting technologies that allow for implementation and integra-tion in current information infrastructures such as the Internet.

CCS CONCEPTS • Information systems → Hypertext languages ; Document rep-resentation ; •

Human-centered computing → Hypertext / hyper-media.

KEYWORDS

Xanadu, hypermedia, transclusion, documents

INTRODUCTION

The original vision of hypertext as proposed by Ted Nelson [14, 22]still waits to be realized. His inﬂuence is visible through peoplewho, inﬂuenced by his works, shaped the computer world of today,last but not least the Web [20]. Nelson’s core idea, a network ofvisibly connected documents called Xanadu, goes beyond the Webin several aspects. In particular it promises non-breaking links andit uses links to build documents (with versions, quotations, over-lay markup. . . ) instead of using documents to build links [18]. Theconcept of hypertext or more general ‘hypermedia’ has also beenused diﬀerently from Nelson, both in the literary community (thatfocused on simple links), and in the hypertext research community(that focused on tools) [30].This paper tries to get back to the original vision of hypertext byspeciﬁcation of a formal model that puts transcludeable documentsat its heart. Apart from Nelson’s works [14–17, 19, 22] this paperdraws from research on data format analysis [11, 28], content-basedidentiﬁers [12, 26], existing transclusion technologies for the Web[1, 2, 6, 25] and hypertext systems beyond link-based models [3].Limited by the state of document processing tools and by sub-mission guidelines, this paper is not a demo of hypertext. On acloser look however there are traces of transclusion links that havebeen processed to this paper. See ﬁgure 2 for an overview andhttps://github.com/jakobib/hypertext2019 for details and sources. See [5] for a good example of what an actual demo paper might look like.

OUTLINE

The architecture of an infrastructure-agnostic hypertext systemconsists of four basic elements and their relations:1) documents include all ﬁnite, digital objects2) document identiﬁers reference individual documents3) content locators reference segments within documents4) edit list combine parts of existing document into new onesInstances of each element can further be grouped by data formats .The elements and their relations are each described in the followingsections after a formal deﬁnition.

Formal model

A hypertext system is a tuple h D , I , C , E , S , R , U , T , A i where: D is a set of documents I is a set of document identiﬁers with I ⊂ DC is a set of content locators with C ⊂ DE is a set of edit lists with E ⊂ DS is a set of document segments with S ⊂ C × D Document sets can each be grouped into (possibly overlapping)data formats. The hypertext system further consists of: R is a retrieval function with R : I → DU is a segments usage function U : E → P( S ) T is a transclusion function with T : S → DA is an hypertext assemble function with A : E → D A practical hypertext system needs executable implementations ofthe functions R , U , T , A and a method to tell whether a given com-bination h c , d i ∈ C × D is part of S to allow its use with T . Documents

A document is a ﬁnite sequence of bytes. This deﬁnition roughlyequates with the deﬁnition of data as documents [8, 29]. The no-tion of hypertext used in this paper therefore subsumes all kinds ofhyperdata (datasets that transclude other datasets). Documents arestatic by content [24] but they may be processed dynamically. Doc-uments are grouped by non-disjoint data formats such as UTF-8,CSV, SVG, PDF and many many more.

Document identiﬁers

A document identiﬁer is a relatively short document that refers toanother document. Identiﬁers have properties depending on the akob Voß particular identiﬁer system they belong to [29, pp. 59-71]. Identi-ﬁers in infrastructure-agnostic hypertext must ﬁrst be unambigu-ous (an identiﬁer must reference only one document), persistent (thereference must not change over time), and actionable (hypertextsystems should provide methods to retrieve documents via the re-trieval function R ). Properties that should be fulﬁlled at least tosome degree include uniqueness (a document should not be refer-enced by too many identiﬁers), performance (identiﬁers should beeasy to compute and to validate), and distributedness (identiﬁersshould not require a central institution). The actual choice of anidentiﬁer system depends on weighting of speciﬁc requirements. Apromising choice is the application of content-based identiﬁers asproposed at least once in the context of hypertext systems [12]. Content locators

A content locator is a document that can be used to select parts ofanother document via transclusion. Nelsons refers to these locatorsas “reference pointers” [19], exempliﬁed with spans of bytes or char-acters in a document. Content locators depend on data formats anddocument models. For instance locator languages XPath, XPointer,and XQuery act on XML documents, which can be serialized in dif-ferent forms (therefore it makes no sense to locate parts of an XMLdocument with positions of bytes). Other locator languages applyto tabular data (SQL, RFC 7111), to graphs (SPARQL, GraphQL), orto two-dimensional images (IIIF), to name a few. Whether and howparts of a document can be selected with a content locator languagedepends on which data format the document is interpreted in. Forinstance an SVG image ﬁle can be processed at least as image, asXML document, or as Unicode string, each with its own methodsof locating document segments. Content locators can be extendedto all executable programs that reproducibly process some docu-ments into other documents. This generalization can be useful totrack data processing pipelines as hyperdata such as discussed forexecutable papers and reproducible research. Restriction of contentlocators to less powerful query languages might make sense froma security point of view.

Edit Lists

An edit list is a document that describes how to construct a new doc-ument from parts of other documents. Edit lists are known as

EditDecision Lists in Xanadu and the idea was borrowed from ﬁlm mak-ing [15]. Simpliﬁed forms of edit lists are implemented in versioncontrol systems and in collaborative tools such as wikis and real-time editing. Hypertext edit lists go beyond this one-dimensionalcase by support of multiple source documents and by more ﬂex-ible methods of document processing in addition to basic opera-tions such as insert, delete, and replace. The actual processing stepstracked by an edit list depend on data formats of transcluded docu-ments. Just like content locators, edit lists could be extended to arbi-trary executable programs that implement the hypertext assemblefunction A for some subset of edit lists. To ensure reproducibilityand reliable transclusion, these programs must not access unstableexternal information such as documents that may change [24]. If documents models/segments change, locators may not be applicable anymore [6].

Data formats

An often neglected fundamental property of digital documents istheir grounding in data formats. A data format is a set of docu-ments that share a common data model, also known as their docu-ment model, and a common serialization (see ﬁg. 1). Models deﬁneelements of a document in terms of sets, strings, tuples, graphs orsimilar structures. These structures are mathematically rigorous intheory [24] but more based on descriptive patterns in practice [28].The meaning of these elements (for instance “words”, “sentences”,and “paragraphs” in a document model) is based on ideas that weat least assume to be consistent among diﬀerent people. idea model format . . .serialization

Figure 1: levels of data modeling

Data modeling, the act of mapping between ideas, models, and for-mats is an unsolved problem because ideas can be expressed inmany models and models can be interpreted in many ways [11].Data models can further be expressed in multiple formats althoughthese formats should fully be convertible between each other, atleast in theory. Formats can further be serialized in multiple forms,which for their part are based on other data models. For instance theRDF model can be serialized in RDF/XML format which is based onthe XML model. XML can be serialized XML syntax which is basedon the Unicode model, and Unicode can be serialized in UTF-8. Atthe end of these chains of abstraction eventually all documents canbe serialized as sequence of bytes. Serializations are seldom sim-ple mappings as most serialization formats allow insigniﬁcant vari-ances such as additional whitespace. To check whether a data ob-ject conforms to a serialization, formats are often described witha formal grammar that can also give insights about the format’smodel. Infrastructure-agnostic hypertext does not impose any lim-its on possible data formats and their models.

EXAMPLE

The following example may illustrate the formal model and its el-ements. Let h D , I , C , E , S , R , U , T , A i be a hypertext system with D the set of printable ASCII character strings and some documents: d = ‘ My name is Alice ’ d = ‘ Alice ’ c = ‘ char=11,15 ’ d = ‘ Hello, ! ’ c = ‘ char=7 ’ d = ‘ Hello, Alice! ’If c and c are read in content locator syntax as deﬁned byRFC 5147 so that T (h c , d i) = d then d can be constructed bytranscluding a document segment of d into d at position c . Thecorresponding edit list e ∈ E with A ( e ) = d could look like this: In practice it’s often unknown whether two data formats actually share the samemodel, especially if models are only given implicitly by deﬁnition of their formats. Formats may also exist purely implicit in form of sample instances which grammar,model, and ideas must be guessed from by reverse engineering [28].2019-07-02 01:06. Page 2 of 1–4. nfrastructure-Agnostic Hypertext take 995f37f2e066b7d8893873ca4d780da5bf017184insert at 48ba94c47b45390b6dd27824cfc0d8468c2cbc71from fcb59267e2e6641140578235c8cb6d38eaf6abc1segment c5b794c7ae5d490f52a414d9d19311b9a19f61b3

The values in e are SHA-1 hashes of d , c , d , and c respectively. Retrieval function R maps them back to strings. Hyperlinks aregiven by U ( e ) = {h c , d i , h c , d i} used for editing d to d (ver-sioning) and for referencing of segment of d in d (transclusion). IMPLEMENTATIONS

One of the problems faced by project Xanadu was it long requirednew developments such as computer networks, document process-ing, and graphical user interfaces ahead of their time. Today we canbuild on a many existing technologies: networks: storage and communication networks are ubiquitouswith several protocols (HTTP, IPFS, BitTorrent. . . ). identiﬁer systems: document identiﬁers should be part of theURI/IRI identiﬁer system. More speciﬁc candidates of relevantidentiﬁer systems include URLs and content-based identiﬁers. formats: hypertext systems should not be limited to their owndocument formats (such as the Web’s focus on HTML/DOM)but allow for integration of all kinds of digital objects. content locators as shown above, several content locator andquery formats exist, at least for some document models.Access to documents via a retrieval function R can be implementedwith existing network and identiﬁer technologies. Obvious solu-tions build on top of HTTP and URL but these identiﬁers are farfrom unambiguous and persistent. Content-based identiﬁers areguaranteed to always reference the same document but they re-quire network and storage systems to be actionable. The set ofsupported data formats is only limited by availability of applica-tions to view and to edit documents. Full integration into a hyper-text system however requires appropriate content locator formatsto select, transclude, and link to segments from these documents.Existing content locator technologies include URI Fragment Iden-tiﬁers [25], patch formats (JSON Patch, XML Patch, LD Patch. . . ),and domain-speciﬁc query languages as long as they can guaran-tee reproducible builds. The IIIF Image API, with focus on contentlocators in images [2], and hypothes.is, with a combination of loca-tor methods [6], popularized at least simple forms of transclusionon the Web.

Challenges

Despite the availability of technologies to build on, creation of axanalogical hypertext system is challenging for several reasons.The general problems involved with transclusion have beenidentiﬁed [1]. Other or more speciﬁc challenges include (orderedby severity): storage: data storage is cheap, but someone has to pay for it. A more practical edit list syntax E could also allow the direct embedding of small doc-ument instances which SHA-1 hashes can be computed from. If implemented carefully,this could also reconcile transclusion with copy-and-paste. New standards such as IPFS Mulihashes and BitTorrent Merkle-Hashes look promis-ing but these types of identiﬁers are not speciﬁed as part of the URI system (yet) [26]. normalization: most documents (including identiﬁers) can beserialized in diﬀerent forms. To support unique document iden-tiﬁers, a hypertext system should support normalization of doc-uments to canonical forms. link services: databases of links have been proposed as centralpart of Open Hypermedia Systems [3] but they are not availablefor the Web because of commercial interest. Links are ideallyderived from edit lists with segments usage function U . Recentdevelopment such as Webmention and OpenCitation may helpto improve collection of links. visualization and navigation: this most recognizable elementof hypertext has mostly been reduced to simple links while Nel-son’s ideas seem to have been forgotten or ignored [27]. Nev-ertheless the creation of tools for visualization and navigationin hypertext structures is less challenging then getting hold ofthe underlying documents and hyperlinks. edit list formats despite edit lists being the very core of theidea of hypertext [15], they have rarely been implementedin reusable data formats. Proper hypertext implementationstherefore require to establish new formats with support ofhypertext assemble function A and segments usage function U . editing tools applications to create and modify digital objectstrack changes don’t provide this information in form ofreusable edit lists, if at all. Hypermedia authoring needs to beintegrated into existing editing tools to succed [7]. copyright and control: who should be allowed to use whichdocuments under which conditions? The answers primarilydepend on legal, social, and political requirements. Diﬀerences to Xanadu

Project Xanadu promised a comprehensive hypertext system in-cluding elements for content (xanadocs), network (servers), rights(micropayment), and interfaces (viewers) – years before each theseconcept made it into the computer mainstream. Today a xanalogi-cal hypertext system can more build on existing technologies. Theinfrastructure-agnostic model of hypertext tries to capture the coreparts of the original vision of hypertext by concentrating on its doc-uments and document formats. For this reason some requirementslisted by Xanadu Australia [23] or mentioned by Nelson in otherpublications are not incorporated explicitly:a) identiﬁed servers as canonical sources of documentsb) identiﬁed users and access controlc) copyright and royalty system via micropaymentd) user interfaces to navigate and edit hypertextsMeeting these requirements in actual implementations is possiblenevertheless. Identiﬁed servers (a) and users (b) were part of

Tumbler identifers (that combined document identiﬁers andcontent locators) [17] but the current OpenXanadu implementa-tion uses plain URLs as part of its

Xanadoc edit list format [21].Canonical sources of documents (a) could also be implementedby blockchains or alternative technology to prove that a speciﬁcdocument existed on a speciﬁc server at a speciﬁc time. Suchknowledge of a document’s ﬁrst insertion into the hypertext In particular by search engines and by spammers. A criterion to judge the success ofa hypertext system is whether it is popular enough to attract link spam.2019-07-02 01:06. Page 3 of 1–4. akob Voß system would also allow for royalty systems (c). Identiﬁcationof users and access control (b) could also be implemented inseveral ways but this feature much more depends on networkinfrastructures and socio-technical environments, including rulesof privacy, intellectual property, and censorship. Last but not least,a hypertext system needs applications to visualize, navigate, andedit hypermedia (d) Several user interface have been invented inthe history of hypertext [13] and there will unlikely be one ﬁnalapplication because user interfaces depend on use-cases and ﬁleformats.

Diﬀerences to other hypertext models

The focus of models from the hypertext research community[3] is more on services and tools than on Nelson’s requirements[30]. This paper rather looks at the neglected “within-componentlayer” of the Dexter Hypertext Reference Model [10] than onissues of storage, presentation and interaction with a hypertextsystem. Extension with content locators (“locSpecs” in [9]) couldmore align Dexter with infrastructure agnostic hypertext butexisting models rarely put traceable edit-lists and transclusion intotheir core. SUMMARY AND CONCLUSION

This paper presents a novel interpretation of the original vision ofhypertext [14, 15]. Its infrastructure-agnostic model does not re-quire or exclude speciﬁc data formats or network protocols. Ab-stract from these ever-changing technologies, the focus is on hy-permedia content (documents) and connections (hyperlinks). Coreelements of hypertext systems are identiﬁed as documents, docu-ment identiﬁers, content locators, and edit lists. A formal model de-ﬁnes their relations based on knowledge of data formats and mod-els. It is shown which technologies can be used to implement sucha hypertext system integrated into current information infrastruc-tures (especially the Internet and the Web) and which challengesstill exist (in particular support of edit lists in editing tools).

Markdown TEX

PDF Ti k Z SVG

HTML

Wikidata CSL JSON BibTEXPandoc template (TEX)Pandoc template (HTML) CSS

Figure 2: Proto-transclusion links of this paper Copyright detection was easy to implement with mandatory registration such aspartly required in the United States until 1976. Authors might also register documentswith cryptographic hashes without making them public in the ﬁrst place. Tim Berners-Lee’s ﬁrst Web browser originally supported editing. “It would be folly to attempt a generic model covering all of these data types.” [10] REFERENCES [1] Robert Akscyn. 2015. The Future of Transclusion.

Intertwingled (2015), 113–122.https://doi.org/10.1007/978-3-319-16925-5_15[2] Michael Appleby, Tom Crane, Robert Sanderson, Jon Stroop, and Simeon Warner(Eds.). 2017.

IIIF Image API (2.1.1 ed.). IIIF Consortium. https://iiif.io/api/image/[3] Claus Atzenbeck, Thomas Schedel, Manolis Tzagarakis, Daniel Roßner,and Lucas Mages. 2017. Revisiting Hypertext Infrastructure.

Proceed-ings of the 28th ACM Conference on Hypertext and Social Media , 35–44.https://doi.org/10.1145/3078714.3078718[4] Tim Berners-Lee. 1989.

Information Management: A Proposal

The Semantic Web: ESWC 2015 Satellite Events (2015), 26–30.https://doi.org/10.1007/978-3-319-25639-9_5[6] Kristóf Csillag. 2013. Fuzzy Anchoring.

Hypothesis (22 4 2013).https://web.hypothes.is/blog/fuzzy-anchoring/[7] Angelo Di Iorio and Fabio Vitali. 2005. From the writable web to global editability.

Proceedings of the sixteenth ACM conference on Hypertext and hypermedia (2005),35–45. https://doi.org/10.1145/1083356.1083365[8] Jonathan Furner. 2016. “Data”: The data.

Information Cultures in theDigital Age: a Festschrift in honor of Rafael Capurro

Proceedings of thethe seventh ACM conference on Hypertext , 149–160.[10] Frank G. Halasz and Mayer D. Schwartz. 1990.

The Dexter Hypertext ReferenceModel . Technical Report.[11] William Kent. 1988. The Many Forms of a Single Fact.

Proceedings of IEEE COM-PCON 89 , 438–443. https://doi.org/10.1109/CMPCON.1989.301972[12] Tuomas Lukka and Benja Fallenstein. 2002. Freenet-like GUIDs for implementingxanalogical hypertext.

Proceedings of the thirteenth ACM conference on Hypertextand hypermedia (2002), 194–195. https://doi.org/10.1145/513338.513386[13] Matthias Müller-Prove. 2002.

Vision and Reality of Hypertext and Graphical UserInterfaces . Ph.D. Dissertation. https://mprove.de/visionreality/[14] Ted Nelson. 1965. Complex information processing: a ﬁle structure for the com-plex, the changing and the indeterminate.

Proceedings of the 1965 20th nationalconference , 84–100. https://doi.org/10.1145/800197.806036[15] Ted Nelson. 1967. Getting It Out of Our System.

Information Retrieval: A CriticalReview , 191–210.[16] Ted Nelson. 1974.

Computer Lib / Dream Machines .[17] Ted Nelson. 1980.

Literary Machines . Mindful Press.[18] Ted Nelson. 1997. Embedded Markup Considered Harmful.

World Wide WebJournal

2, 4 (1997), 129–134.[19] Ted Nelson. 1999. Xanalogical structure, needed now more than ever: paralleldocuments, deep links to content, deep versioning, and deep re-use.

Comput.Surveys

31, 4es (12 1999). https://doi.org/10.1145/345966.346033[20] Ted Nelson. 2008.

Geeks Bearing Gifts . Mindful Press.[21] Ted Nelson and Nicholas Levin. 2014.

OpenXanadu .http://xanadu.com/xanademos/MoeJusteOrigins.html[22] Ted Nelson, Robert Adamson Smith, and Marlene Mallicoat. 2007. Back to thefuture: hypertext the way it used to be.

Proceedings of the eighteenth conferenceon Hypertext and hypermedia , 227–228. https://doi.org/10.1145/1286240.1286303[23] Andrew David Pam. 2002.

Xanadu FAQ

Proceedings of the Amer-ican Society for Information Science and Technology

45, 1 (3 6 2009).https://doi.org/10.1002/meet.2008.14504503143[25] Jeni Tennison. 2012.

Best Practices for Fragment Identiﬁersand Media Type Deﬁnitions

Hash URI Speciﬁcation (Initial Draft) . Technical Report.https://github.com/hash-uri/hash-uri[27] Fernanda Viégas, Martin M. Wattenberg, and Kushal Dave. [n. d.]. Studying Co-operation and Conﬂict between Authors with history ﬂow Visualizations.

CHI’04 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems ,575–582. https://doi.org/10.1145/985692.985765[28] Jakob Voss. 2013.

Describing Data Patterns . Ph.D. Dissertation.http://aboutdata.org/[29] Jakob Voss. 2013. Was sind eigentlich Daten?

LIBREAS

23 (2013).https://doi.org/10.18452/9038[30] Noah Wardrip-Fruin. 2004. What hypertext is.