Geert Jan Bex | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Geert Jan Bex is active.

Explore More

Publication

Featured researches published by Geert Jan Bex.

ACM Transactions on Database Systems | 2006

Expressiveness and complexity of XML Schema

Wim Martens; Frank Neven; Thomas Schwentick; Geert Jan Bex

The common abstraction of XML Schema by unranked regular tree languages is not entirely accurate. To shed some light on the actual expressive power of XML Schema, intuitive semantical characterizations of the Element Declarations Consistent (EDC) rule are provided. In particular, it is obtained that schemas satisfying EDC can only reason about regular properties of ancestors of nodes. Hence, with respect to expressive power, XML Schema is closer to DTDs than to tree automata. These theoretical results are complemented with an investigation of the XML Schema Definitions (XSDs) occurring in practice, revealing that the extra expressiveness of XSDs over DTDs is only used to a very limited extent. As this might be due to the complexity of the XML Schema specification and the difficulty of understanding the effect of constraints on typing and validation of schemas, a simpler formalism equivalent to XSDs is proposed. It is based on contextual patterns rather than on recursive types and it might serve as a light-weight front end for XML Schema. Next, the effect of EDC on the way XML documents can be typed is discussed. It is argued that a cleaner, more robust, larger but equally feasible class is obtained by replacing EDC with the notion of 1-pass preorder typing (1PPT): schemas that allow one to determine the type of an element of a streaming document when its opening tag is met. This notion can be defined in terms of grammars with restrained competition regular expressions and there is again an equivalent syntactical formalism based on contextual patterns. Finally, algorithms for recognition, simplification, and inclusion of schemas for the various classes are given.

international workshop on the web and databases | 2004

DTDs versus XML schema: a practical study

Geert Jan Bex; Frank Neven; Jan Van den Bussche

Among the various proposals answering the shortcomings of Document Type Definitions (DTDs), XML Schema is the most widely used. Although DTDs and XML Schema Definitions (XSDs) differ syntactically, they are still quite related on an abstract level. Indeed, freed from all syntactic sugar, XML Schemas can be seen as an extension of DTDs with a restricted form of specialization. In the present paper, we inspect a number of DTDs and XSDs harvested from the web and try to answer the following questions: (1) which of the extra features/expressiveness of XML Schema not allowed by DTDs are effectively used in practice; and, (2) how sophisticated are the structural properties (i.e. the nature of regular expressions) of the two formalisms. It turns out that at present real-world XSDs only sparingly use the new features introduced by XML Schema: on a structural level the vast majority of them can already be defined by DTDs. Further, we introduce a class of simple regular expressions and obtain that a surprisingly high fraction of the content models belong to this class. The latter result sheds light on the justification of simplifying assumptions that sometimes have to be made in XML research.

Information Systems | 2002

A formal model for an expressive fragment of XSLT

Geert Jan Bex; Sebastian Maneth; Frank Neven

The extension of the eXtensible Style sheet Language (XSL) by variables and passing of data values between template rules has generated a powerful XML query language: eXtensible Style sheet Language Transformations (XSLT). An informal introduction to XSTL is given, on the bases of which a formal model of a fragment of XSLT is defined. This formal model is in the spirit of tree transducers, and its semantics is defined by rewrite relations. It is shown that the expressive power of the fragment is already beyond that of most other XML query languages. Finally, important properties such as termination and closure under composition are considered.

ACM Transactions on Database Systems | 2010

Inference of concise regular expressions and DTDs

Geert Jan Bex; Frank Neven; Thomas Schwentick; Stijn Vansummeren

We consider the problem of inferring a concise Document Type Definition (DTD) for a given set of XML-documents, a problem that basically reduces to learning concise regular expressions from positive examples strings. We identify two classes of concise regular expressions—the single occurrence regular expressions (SOREs) and the chain regular expressions (CHAREs)—that capture the far majority of expressions used in practical DTDs. For the inference of SOREs we present several algorithms that first infer an automaton for a given set of example strings and then translate that automaton to a corresponding SORE, possibly repairing the automaton when no equivalent SORE can be found. In the process, we introduce a novel automaton to regular expression rewrite technique which is of independent interest. When only a very small amount of XML data is available, however (for instance when the data is generated by Web service requests or by answers to queries), these algorithms produce regular expressions that are too specific. Therefore, we introduce a novel learning algorithm crx that directly infers CHAREs (which form a subclass of SOREs) without going through an automaton representation. We show that crx performs very well within its target class on very small datasets.

international world wide web conferences | 2008

Learning deterministic regular expressions for the inference of schemas from XML data

Geert Jan Bex; Wouter Gelade; Frank Neven; Stijn Vansummeren

Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirically validated both on real world and synthetic data. Furthermore, the method is shown to be conservative over the simpler classes of expressions considered in previous work.

ACM Transactions on The Web | 2010

Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

Geert Jan Bex; Wouter Gelade; Frank Neven; Stijn Vansummeren

Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of deterministic regular expressions in which each alphabet symbol occurs at most k times, for some small k. We refer to such expressions as k-occurrence regular expressions (k-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k-OREs for increasing values of k, and selects the deterministic one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirically validated both on real world and synthetic data. Furthermore, the method is shown to be conservative over the simpler classes of expressions considered in previous work.

international world wide web conferences | 2005

Expressiveness of XSDs: from practice to theory, there and back again

Geert Jan Bex; Wim Martens; Frank Neven; Thomas Schwentick

On an abstract level, XML Schema increases the limited expressive power of Document Type Definitions (DTDs) by extending them with a recursive typing mechanism. However, an investigation of the XML Schema Definitions (XSDs) occurring in practice reveals that the vast majority of them are structurally equivalent to DTDs. This might be due to the complexity of the XML Schema specification and the difficulty to understand the effect of constraints on typing and validation of schemas. To shed some light on the actual expressive power of XSDs this paper studies the impact of the Element Declarations Consistent (EDC) and the Unique Particle Attribution (UPA) rule. An equivalent formalism based on contextual patterns rather than on recursive types is proposed which might serve as a light-weight front end for XML Schema. Finally, the effect of EDC and UPA on the way XML documents can be typed is discussed. It is argued that a cleaner, more robust, stronger but equally efficient class is obtained by replacing EDC and UPA with the notion of 1-pass preorder typing: schemas that allow to determine the type of an element of a streaming document when its opening tag is met. This notion can be defined in terms of restrained competition regular expressions and there is again an equivalent syntactical formalism based on contextual patterns.

international conference on management of data | 2009

Simplifying XML schema: effortless handling of nondeterministic regular expressions

Geert Jan Bex; Wouter Gelade; Wim Martens; Frank Neven

Whether beloved or despised, XML Schema is momentarily the only industrially accepted schema language for XML and is unlikely to become obsolete any time soon. Nevertheless, many nontransparent restrictions unnecessarily complicate the design of XSDs. For instance, complex content models in XML Schema are constrained by the infamous unique particle attribution (UPA) constraint. In formal language theoretic terms, this constraint restricts content models to deterministic regular expressions. As the latter constitute a semantic notion and no simple corresponding syntactical characterization is known, it is very difficult for non-expert users to understand exactly when and why content models do or do not violate UPA. In the present paper, we therefore investigate solutions to relieve users from the burden of UPA by automatically transforming nondeterministic expressions into concise deterministic ones defining the same language or constituting good approximations. The presented techniques facilitate XSD construction by reducing the design task at hand more towards the complexity of the modeling task. In addition, our algorithms can serve as a plug-in for any model management tool which supports export to XML Schema format.

Lecture Notes in Computer Science | 2000

A Formal Model for an Expressive Fragment of XSLT

Geert Jan Bex; Sebastian Maneth; Frank Neven

The aim of this paper is two-fold. First, we want to show that the recent extension of XSL with variables and passing of data values between template rules has increased its expressiveness beyond that of most other current XML query languages. Second, in an attempt to increase the understanding of this already wide-spread but not so transparent language, we provide an essential and powerful fragment with a formal syntax and a precise semantics.

Molecular Ecology | 2010

Phylogenetic diversity of Sri Lankan freshwater crabs and its implications for conservation

Natalie Beenaerts; Rohan Pethiyagoda; Peter K. L. Ng; Darren C. J. Yeo; Geert Jan Bex; Mohomed M. Bahir; Tom Artois

As part of a Global Biodiversity Hotspot, the conservation of Sri Lanka’s endemic biodiversity warrants special attention. With 51 species (50 of them endemic) occurring in the island, the biodiversity of freshwater crabs is unusually high for such a small area (65 600 km2). Freshwater crabs have successfully colonized most moist habitats and all climatic and elevational zones in Sri Lanka. We assessed the biodiversity of these crabs in relation to the different elevational zones (lowland, upland and highland) based on both species richness and phylogenetic diversity. Three different lineages appear to have radiated simultaneously, each within a specific elevational zone, with little interchange thereafter. The lowland and upland zones show a higher species richness than the highland zone while – unexpectedly – phylogenetic diversity is highest in the lowland zone, illustrating the importance of considering both these measures in conservation planning. The diversity indices for the species in the various IUCN Red List categories in each of the three zones suggest that risk of extinction may be related to elevational zone. Our results also show that overall more than 50% of Sri Lanka’s freshwater crab species (including several as yet undescribed ones), or approximately 72 million years of evolutionary history, are threatened with extinction.

Explore More