[PDF] An Object-Oriented and Fast Lexicon for Semantic Generation

Abstract

This paper is about the technical design of a large computational lexicon, its storage, and its access from a Prolog environment. Traditionally, efficient access and storage of data structures is implemented by a relational database management system. In Delilah, a lexicon-based NLP system, efficient access to the lexicon by the semantic generator is vital. We show that our highly detailed HPSG-style lexical specifications do not fit well in the Relational Model, and that they cannot be efficiently retrieved. We argue that they fit more naturally in the Object-Oriented Model. Although storage of objects is redundant, we claim that efficient access is still possible by applying indexing, and compression techniques from the Relational Model to the Object-Oriented Model. We demonstrate that it is possible to implement object-oriented storage and fast access in ISO Prolog.

Full PDF

AAn Object-Oriented and Fast Lexicon for Semantic Generation

Maarten Hijzelendoorn and Crit Cremers

Leiden University Centre for Linguistics (LUCL)

1. Introduction

ID:A+B |HEAD:|CONCEPT:discover | |PHON:ontdekt | |SLF:discover | |SYNSEM:|ETYPE:event | | |FLEX:fin | | |NUMBER:sing | | |PERSON:or([2,3]) | | |TENSEOP:at-pres | | |VTYPE:transacc |PHON:C |PHONDATA:lijnop(ontdekt,A+B,[arg(right(-1),0,D),arg(left(11),wh,E)],C) |SLF:{{[F&(B+G)

Figure 1: A lemma for ‘ontdekt’ discovers (2nd/3rd pers. pres. sing) The lemmas are completely defined by HPSG-style feature-value specifications (Sag et al. 2003). Lemmas are complex symbols, and can be represented by Directed Acyclic Graphs (DAGs). Typically, they have a different number of features. A lemma may or may not specify a certain value for a certain feature. In the latter case, the lemma is underspecified. Besides atoms and numbers, values can be complex structures themselves, defining sub-graphs. A lemma will contain sub-graphs for semantically and/or syntactically related phrases. Unification will apply to these sub-graphs, that is, two graphs A and B unify whenever B unifies with a designated sub-graph of A, in which case A is called primary and B a secondary. By definition, the primary graph constrains the secondary graph in every relevant aspect: morphologically, syntactically and semantically. Thus, the Delilah lemma is a natural way of expressing collocational effects, from weak combinatory effects to rigid combinations. In fact, every lemma defines the domain for collocational effects. The lemma essentially separates constraints on sub-phrases of a structure from properties of the overall phrase. Inheritance and co-indexing are specified by using the same variable as value at different places in the graph. he lexicon is a collection of explicitly defined, spelled-out linguistic entities. They are ‘unrelated’ in the sense that they are stored independently of each other, and ‘autonomous’ in the sense that they, once retrieved, are operated independently of each other. A different approach is followed in Cornetto, which defines a combinatorial and relational, i.e. implicit, network on word level (Vossen et al. 2007). In Delilah such an information network, including, for example, collocations, has been ‘compiled away’, yielding real linguistic entities to start with, e.g. for generation purposes. The following illustrates the lexicon’s size and growth and the storage and access problem of a large computational lexicon. Adding the lemma for hij ‘he’ means adding 1 lexical entry, while adding the lemma for gelopen ‘walked’ (past. part.) means adding 19 entries, adding the lemma for heeft ‘has’ (3rd.pers. pres. sing. aux) means adding 133 entries, and adding the whole paradigm for verven ‘to paint’ means adding 226 entries. Clearly, the fully written-out specification of lexical entries introduces an exponential storage factor. (These figures are to be interpreted relatively and don’t mean anything on their own. They reflect the current state of the lexicon.) Furthermore, for Delilah’s grammar-driven generation component efficient access to the lexicon is crucial, because a word form should only be produced when its lexical specification matches certain constraints specified by the grammar and by the generation algorithm. It has been observed that searching and finding of lexical entries is the main business of the generator. Therefore, efficient access methods are required for retrieval, with the property to search and match complex lexical graphs and lexical constraints, which is hard. Finally, we developed our system in Prolog (Clocksin and Mellish 1984) for historical reasons. We implemented standards as much as possible, including ISO Prolog (Deransart et al .

2. Models

We will concentrate on a computational model of a lexicon. Such a lexicon can be stored either in internal, working memory of the computer, or in external memory, e.g. on a hard disk. Storing it internally means at least loading all lexical entries, which takes much time when the lexicon is big. Access to data structures in internal memory is usually very fast, but Prolog’s internal database is not so fast. Working memory cannot be extended beyond some gigabytes, which is not large enough for our purposes, while extension by virtual memory gives bad performance. The enormous proportions of the lexicon, let alone its foreseen expansion, and the limitations of current hardware rule out the option of storing it in internal memory. Storing it externally has no space problem, because hard disks are ‘big enough’ and cheap nowadays. An external medium, however, is slower, and harder to access. As the lexicon is of a static nature, once it has been generated, a full-featured database management system, including update acilities, is unnecessary. We can restrict ourselves to implementing a lexicon as a saved set of ‘read-only’ lexical entries, and provide efficient access methods for at least the most important features used by the generator, being the syntactic type, the semantic concept and the word form. The lexicon, built as a collection of lexical entries, can be regarded as a set of records in a database. There are a number of database models for database management. The Relational Model (Codd 1970) is based on two parts of mathematics: first order predicate logic and the theory of relations. It is data-based and well-known from relational database management systems (RDBMS). As the programming language Prolog is based upon the Relational Model, it seems straightforward to consider this model in more detail. Alternatively, the Object-Oriented Model (Meyer 1997) is investigated, because it was noted that a lexical entry can be seen as a ‘linguistic object’ in the OO Model. Such an object stores data (lexical specification), and procedures (e.g. for linearization). The OO Model is knowledge-based. Furthermore, the OO Model is often recommended when there is a need for high performance processing on complex data, e.g. binary multimedia objects. The OO Model is well-known from graphical user interfaces, while object database management systems are arising.

The Relational Model was the first formal database model, solidly founded on well-understood mathematical principles and explained by Date (2003). It was invented in those days when computer memory was scarce and expensive. A relational database consists of a number of relations (‘tables’), in which all data is stored. Each relation is a set of tuples that all contain the same attributes (the horizontal rows, ‘records’). A tuple is an unordered set of attribute values (the vertical columns, ‘fields’). An n -tuple is an unordered set of n values of n attributes. All of the attribute values should be in the same domain, that is they should be a valid value for the data type of the attribute, and they should obey the same constraints. Data types must be scalar, like integer, or string, and cannot be compound, like a graph. Constraints provide a way of restricting the data that can be stored, either in tuples, or in attributes. A relation is said to be n -ary iff it consists of a set of n -tuples. A special kind of constraint is a key. A key is an m -tuple of an n -ary relation, where m < n , that enforces the uniqueness of the combination of the m attribute values for each tuple. Key values are usually kept in an index or hash table, which is stored in internal memory for fast access. Keys prevent storing duplicate data. A relational variable can be assigned a subset of tuples as result of a query. Queries are stated in a query language, typically SQL (Chamberlin and Boyce 1974). A query can be simplex or complex, which implies consulting one or more tables, resp., possibly under the condition of some constraints, of which the key is the most important one, and which links tables. Our lexical entries are complex, non-atomic data structures. The Relational Model only allows atomic data types. It would be possible to pre-compile them nto flat, atomic strings. However, flattening structured information is in general not a good idea, when it comes to retrieval on the basis of some highly detailed substructure. Lexical entries are DAGs, that is recursive data structures. Although it would be possible to pre-compile all feature-value paths of a DAG to a number of tuples in different tables by means of a recursive procedure, it would be impossible to retrieve them, as a relational database system does not provide for recursive processing (Hirao 1990). On the other hand, when we regard the lexicon as knowledge, we can linearize and store the graphs into a relational database, and use the powerful processing of recursion for inferences by Prolog, Prolog being mobilized as a powerful query language. It is possible to successfully and easily store objects in a relational database by following a step-by-step procedure (Ambler 2000). A disadvantage is that quite a number of tables might be involved, as graphs typically hold large numbers of features, while for semantic generation no less than complete lexical specifications are required. Pre-compiling a lexical entry for a noun will typically yield another number of tuples (in just as many tables) as pre-compiling a verbal entry. The top-level, a main table, which is to represent complete lexical entries, has to span all the tuples into one large main tuple, in which each attribute represents a path, and where the attribute’s value is either the value of the path, or an ID that links to another table. This implies that there will be more than one top-level: one table for each combination of attributes and consequently more than one main table. As we don’t want to impose a restriction on the internal dependencies of a graph, there is no restriction on the recursion depth of features. Thus, we allow for dependencies in multi-word expressions of any kind. Consequently, the number of different main tables can be large in practice, which means decreasing performance with orders of magnitude. Thus, lexical entries cannot be regarded as single, homogeneous relations in the Relational Model, let alone be retrieved. Furthermore, lexical entries use variables. The Relational Model does not allow attribute values to be variables. A variable is not an atomic constant, and thus not distinguishable from other values for the same attribute. As a consequence, an attribute that has a variable in its data domain cannot be indexed, which could lead to bad overall performance of the database. It would be possible to pre-compile variables into constants, and to de-compile them during retrieval. Calling meta-predicates, however, is in general rather time-consuming. We conclude that the Relational Model cannot efficiently accommodate (logic) variables, recursive features and, consequently, the top-level.It does not meet high-performance demands on complex (recursive) data structures. It is inappropriate for our purposes.

In the Object-Oriented Model, information and control is represented in the form of interacting ‘objects’, as is well-known from the Object-Oriented Programming (OOP) paradigm. An object can be seen as an information processor. It accepts commands from other objects, processes data by executing procedures (‘methods’), which both are stored in the object, and sends commands to other bjects to be executed by those objects. By keeping data (properties) and procedures (operations) together in one local unit, an object holds all characteristics related to some concept. This implies that an object is a complex data structure. It is independent of other objects, it has its own ID, and its own role. OOP, then, is modeling a problem by distinguishing different (abstract) levels of objects (called ‘classes’ and ‘subclasses’, which are maintained in a ‘class hierarchy’), and defining their cooperation and interactions. A class defines the general characteristics of a concept in terms of the problem domain. An object is a particular instance of a class, from which it inherits all properties and methods, and to which it may add own information, or overwrite inherited information. Multiple inheritance is inheriting from more than one class, combining properties and methods. Classes are the structuring elements (modules) in OOP, which hide the details of the code to be accessed by objects that stem from other classes.

Prolog is a declarative programming language with different procedural semantics than the OOP paradigm. In Prolog the term ‘object’ refers to things that can be represented by terms, Prolog’s only data type. It does not refer to a data structure that inherits from a class hierarchy. Some vendors have extended Prolog with object-oriented features (SICS 2008). However, in general, mixing programming styles will lead to unmanageable and incompatible code. Therefore, we will stick to Prolog’s execution mechanism and to its term data type, and explore the possibility of objects as a data representation for lexical entries or: linguistic objects.

Linguistic objects are generated by deriving information from one or more generic classes of templates, called ‘constructions’, and by adding local information. This fits nicely in the OOP concept of objects that are constructed by a specialized ‘constructor’ method and by (multiply) inheriting information from classes. A linguistic object, a complex, recursive graph, can be mapped onto an OOP object, which is a complex data structure. Shared variables are, in fact, an abbreviation for a unification procedure, deferred until runtime. Encoding them by an OOP method is straightforward. Linguistic objects are independent units and, thus, uniquely identifiable, as are OOP objects. This makes an object, including all properties and methods, accessible by one ID, which is utterly important for efficient storage and access by data-intensive processes, such as semantic generation. ISO Prolog terms, being complex data types, are well-equipped to represent OOP objects, including variables, recursion, and unique identification. On the other hand, linguistic objects are built from classes and any combinatory difference is compiled out as a difference between objects. These objects may differ from each other minimally, yielding a significant amount of overlap and introducing an exponential space factor. For example, the objects for the 2nd and 3rd person of a regular verb only differ in the person and honological features. Objects can adopt different states when they get involved in some process. ‘Persistent’ objects are objects whose initial state have been saved. We conclude that the Object-Oriented Model provides a natural environment for representing linguistic objects. It lifts the drawbacks of the Relational Model with respect to data modeling, and potentially enables fast access by unique identifiers. Its data redundancy is a small price to pay, given the considerable decrease of the price/performance ratio of hard disks each year. Future developments in runtime file (de-)compression techniques might weaken this disadvantage. Prolog’s term data type is suitable to represent linguistic data objects.

An Object-Oriented database system must satisfy two criteria: it should be a database management system, and it should be an object-oriented system, i.e., to the extent possible, it should be consistent with the current crop of object-oriented programming languages (Atkinson et al . We illustrate the operation of the generator. It produces free, grammatical and meaningful sentences. In categorial grammar, a category consists of a head, and zero or more arguments to its left and/or right side. For generation purposes, the category can be seen as an agenda. The generation algorithm keeps already produced heads in an unordered list, and still to be produced arguments on a stack. It handles them by inserting and deleting elements into/from an arbitrary osition, or onto/from the top of the stack position, respectively. It starts with a random semantic concept, and finds one of its realisations in the lexicon. The head of its category is inserted in the list. The arguments are shifted on the stack. Each argument is produced by either reduction with some head in the list, or by reduction with a new category to be found in the lexicon that has it as head. The topmost argument of the stack is replaced by the arguments of the new category or removed completely when it does not have arguments of itself. When the argument stack is empty, there are two possibilities. If the heads list does not consist of exactly one of the sentential categories s (‘sentence’) or q (‘query’), the lexicon is consulted for a category that has the non-sentential category as an argument. Its head is inserted in the heads list, while its arguments are shifted on the argument stack. Otherwise, the algorithm stops. Categories, being complex symbols, are unified on each reduction step, yielding both linearization and underspecified logical form. In table 1 the procedure for deriving the sentence die Nederlander ontdekt diepe betekenissen ‘that Dutchman discovers deep meanings’ is listed. Only categories and number features are shown, instead of full complex symbols.

Step Action Result Heads Args betekenissen ‘meanings’, cat=n, num=plur {} {} plur } - diepe ‘deep’, cat=np/n, where n has num=plur - - diepe betekenissen ‘deep meanings’ {np plur } - ontdekt ‘discovers’, cat=s\np1/np2, where s has num=sing, np1 has num=sing and np2 has num unspecified - - ontdekt diepe betekenissen ‘discovers deep meanings’; np2 is assigned num=plur {s sing } {s-np sing } die ‘that’, cat=np/n, where n has num=sing - - die _ ontdekt diepe betekenissen ‘that _ discovers deep meanings’ { s sing } {s-n sing }

Nederlander ‘Dutchman’, cat=n, num=sing - - die Nederlander ontdekt diepe betekenissen ‘that Dutchman discovers deep meanings’ {s sing } {}

Table 1: Generating a sentence . Methods

We will describe methods that store linguistic objects as objects in an OO lexicon, and methods, some borrowed from the Relational Model, that retrieve these objects efficiently and, yet better, fast. The methods have been implemented in ISO Prolog. They should obey the Resource Principle, which is stated thus: “Deploy working memory when performance is the key factor, and deploy external memory when storage is the main aspect”. This principle is a practical phrasing of the insights that working memory will never be large enough to hold our current and planned number of linguistic objects, and that working memory is needed for real computation tasks and of the facts that working memory is faster than external memory, and that external disk space is abundant, and cheap. We refrain from implementing interfaces to third-party products, e.g. the (semi-commercial) Berkeley DB library for external storage of terms (SICS 2008), because we prefer ‘light’, compatible, and manageable interfaces, that are portable to other platforms. The sections 3.1, 3.2, and 3.3 describe the techniques of index tables, caching, hashing and compression. In later sections their use is described.

The ‘Edinburgh’ Prolog standard offers sequential read and write access to files. Although linear access is still efficient in complexity terms, searching will take an unacceptable amount of time in worst case. In ISO Prolog, the concept of a file has been improved on, and been replaced by the concept of a ‘stream’. A stream is a file with random access, that is each byte in the file can be located in constant time. An index table is a table that relates unique, simplex values to the positions in the context where these values can be found. In our case, this translates to a table per feature that relates each feature’s value to all occurrences of objects that contain that feature with that value. An index table can be implemented by a collection of clauses of a 2-ary relation with the first argument as key. Many Prolog systems use ‘first-argument indexing’ to locate the correct clause, given its key, in constant time. This technique is not part of ISO Prolog, but it is regarded as an essential facility for interpreting Prolog programs efficiently. A fast lexicon for semantic generation hinges on the stream concept and indexing techniques.

Internal memory can be much faster accessed than external memory. Applications often need the same data again, and again. This had led to the development of cache memories. A disk cache is a storage mechanism in working memory that keeps the most recently read data plus the data of adjoining sectors in a buffer. As soon as the application asks for some data, the buffer is consulted first, saving access time to the hard disk. When the required data does not fit in the buffer, an extra disk access is necessary. To exploit disk caching, the data must be ordered in a way that corresponds to a relevant search criterion. Disk caching might be implemented at the application level, as advocated by Ceri et al . Hashing (Knuth 1998) is a method that converts an arbitrary, complex value to a simplex one—the ‘hash’ or ‘key’—by applying a hash function to it. Typically, hashing converts to an integer, because it is the most economic data type, and most easy to use by a computer program. A hash value enables an index table to be used. Searching for an object in a collection of n objects, given an unhashed value of some constraint, will take O ( n ) comparison operations. With a hashed value and objects that are hashed on the appropriate constraint by applying one and the same hash function, searching an arbitrary object only takes O (1) time, assuming that the index table can be directly accessed. This is easy to implement in Prolog systems that can do first-argument indexing. As each piece of data ever to be searched for is fixed, the lexicon is a collection of ‘static search sets’. For such sets, a ‘perfect hash function’ (PHF) can be designed, that is a function that will never assign the same hash to different data structures. However, depending on the complexity of the data structure, the hash can exceed the range of the integer data type on some computer platforms. A ‘near perfect hash function’ takes the integer range into account, but allows for duplicates (‘collisions’). Duplicates increase space and time complexity. A 64-bit platform extends the integer range with orders of magnitude, compared to 32-bit, enabling a PHF to be used, yielding zero collisions. Number grouping is a compression technique that replaces a set of adjacent integers by a range, starting with the lowest number, and ending with the highest number. The space complexity for a range is constant, instead of linear for a set of numbers. The time complexity for determining the subset of two ranges is constant, while it is linear for intersecting ordered sets. The techniques described sofar are used for creating and accessing the object lexicon (section 3.4) and for the index tables (sections 3.5 and 3.6). Section 3.7 discusses lexical retrieval in general, and gives some complexity results. The object lexicon and access methods are is generated by an off-line process, which is not subjected to the Resource Principle. During generation, the objects are checked for well-formedness and validness and are assigned an ID for reference only, corresponding to the order they are generated in. The process of creating them is not described here. Once they are created, they are linearized into a series of bytes, and stored as persistent objects, formatted as standard Prolog terms. A persistent object can be seen as a variable-length ‘record’, a concept from the Relational Model. The object’s ID corresponds to the ‘record number’. The physical address of the persistent object’s first byte is kept in an external access table. Obviously, the object lexicon and the access table have the same large number of entries. Storing them externally agrees with the Resource rinciple. The access table has records (addresses) of fixed size. Consequently, it can be addressed by a function that maps an ID= n to the physical address of the n ’th entry in the access table. The access table, opened as a stream, gives direct access to this address. The information to be found there points to the physical location of the n ’th object in the object lexicon, which, as a stream, has direct access. Thus, objects are retrieved by ID via an indirect addressing scheme in O (1) time, at the cost of one extra disk access, one function in working memory, and one auxiliary access table in external memory. Prolog’s read predicate loads the objects as native Prolog terms. Efficient data structures are the basis for efficient algorithms. Therefore, the persistent objects need to be ordered by some ciriterion. A parser would benefit from an ordering on the phonological field for lexical look-up and tagging purposes and would exploit a disk cache by accessing all objects with the same surface form using a buffered read operation. A generator would take advantage of an ordering on the most frequently used deep constraint for handling the agenda. As the generator only picks one carefully selected object at a time, no physical ordering seems to be beneficial. However, physical ordering by some criterion corresponds to a very compact index table for that criterion, because each entry’s set of ID’s can be represented by a range. It turns out, that a physical ordering on syntactic type will prove helpful to the generator. This makes clear, that a lexicon that has to be deployed to both a parser and a generator needs to meet two sets of requirements. As the class of syntactic types is much smaller than the class of surface forms, a physical ordering on the former will yield less and bigger ranges, and consequently a more compact index table than an ordering on the latter, saving more memory, in line with the Resource Principle and speeding up intersection operations. For semantic generation, we require maximum performance on the retrieval of linguistic objects, specified by constraints on the features for semantic concept, syntactic type, and word form. For each object, these features are stored in three auxiliary index tables in working memory. If an entry does not exist, it is created and added to the index table together with the object’s ID. If it does already exist, only the ID is added, in order. Number grouping is applied to the sets of ID’s as much as possible. Each combinatory type is hashed by a PHF into a unique string, instead of an integer, because, theoretically, types are unlimited in size, which may yield too many collisions. Each word form and each semantic concept is hashed to an integer by applying a near-PHF to their alphabetic letters. The hash value is kept as small as possible by taking into account the usage frequency of letters in Dutch ( ). This method is a variant on Huffman (1952) coding. Provisions are made for handling collisions, which result from lexical ambiguities. In general, retrieving all ID’s of objects that satisfy either the concept, type, or phon constraint takes O (1) time, when first-argument indexing is applied by the Prolog system to the resp. index tables, and zero disk accesses. The index tables have a space complexity that is a linear function with a small factor—due to number grouping—of the number of objects. he index table on the feature that is used for physically ordering the objects has a space complexity that is a linear function of the number of the feature’s values. The index tables are kept in working memory for fast access, in accordance with the Resource Principle. The following lines are extracted from the type index, encoded by clauses of the predicate yx/2. This predicate encodes which objects contain which type. yx('RHa/Hc',[45591+1130]). yx('R/HcHc',[46722+568]). yx('N/HcHc',[47291+575]). The string ‘RHa/Hc’ in the first argument of the first clause is the hash of type s\np/np, shared by the objects with ID’s in the range 45591 up to 46722, encoded by 45591+1130 in the second argument. The string ‘R/HcHc’ is the hash for type s/{np,np}, and the objects with ID’s in the range 46722 up to 47291 share this type. The lines demonstrates the economic encoding of ranges of object ID’s. Prolog’s first-argument indexing technique can be applied to the predicate yx/2, because the hashes are unique values. To find some object with type s\np/np, the type is hashed, yielding ‘RHa/Hc’. The hash is looked up in the type index, yielding a range of object ID’s, including discovers , shown in figure 1.

Additionally, the semantic generator may select objects that are specified by constraints on other features. We don’t demand maximum performance on queries of this kind. For each linguistic object, each value and full path, starting at the top of the graph, of each feature ever to be retrieved, and not being concept, type or phon, is stored in one auxiliary index table in working memory. This ‘meta’ index table spans more than one feature. The ID’s of all objects that specify the same value for some feature are number-grouped and stored in an external meta data table. Additionally, the set of ID’s of objects that do not specify any value for the feature is calculated and stored. The storage address per feature-value pair is kept in the meta index table. Retrieval of all ID’s of objects that match one of these constraints takes O (1) time, when first-argument indexing is applied by the Prolog system, and one extra disk access. The space complexity of the meta index table is a linear function of the number of unique feature-value combinations, occupying only a fraction of the working memory, consistent with the Resource Principle. Retrieval of linguistic objects is performed by executing a search task, specified by one or more constraints. When the search task is a graph, typically an rgument sub graph of some object involved in the generation process, it is flattened into a series of constraints. Each constraint is looked-up in an appropriate index table. We distinguish strict and liberal constraints, which demand objects to match the constraint explicitly, or allow objects that are unspecified for the constraint respectively. Liberal constraints follow from graphs that may be underspecified. When a strict constraint is a restriction on semantic concept, syntactic type, or word form, the set of ID’s of the objects satisfying the constraint is found immediately in the resp. index tables. For other features, including liberal contraints, the set of ID’s is found after issuing an extra disk access to the meta data table. In all cases, a constraint is replaced by a set of ID’s, corresponding to objects that are consistent or not inconsistent with the constraint. The objects that satisfy all constraints are identified by ID’s that result from intersecting all sets of ID’s. As these sets are ordered, calculating their intersection is a linear function of the size of the biggest set. The function factor may be very small when ranges are involved. In general, however, the ranges get fragmented after a few intersection operations. By applying directly accessible pre-compiled access and index tables, searching is reduced to looking-up with great performance, in contrast to Prolog’s own depth-first backtracking search algorithm, which is inefficient. In summary: finding one object, given its ID, takes O (1) time and one disk access. Finding the set of ID’s of objects satisfying a constraint on concept, type or phon only, takes O (1) time and zero disk accesses. Finding the set of ID’s of objects satisfying another constraint takes O (1) time and one disk access. Finding the set of ID’s of objects that satisfy n arbitrary constraints takes O ( n ) time and at most n disk accesses.

4. Conclusions and discussion

We showed the design and access of an external lexicon to be used by a semantic generator component of an NLP system in a Prolog context. The major design goal was to develop a fast lexicon, as searching and finding complete linguistic data is the major activity of the generator. It appeared that some properties of our data (variables, recursion) prevented them from being stored in the Relational Model, e.g. in a relational database. Furthermore, it turned out that this model is inherently unsuitable to demonstrate fast performance on complete linguistic data, because data is spread across a large number of tables. The Object-Oriented Model proved to be a natural environment for our linguistic data when regarded as objects. The nature of categorial grammar, being the basis of our objects, and the nature of objects, being distinct copies, derived from one or more classes, introduce an exponential space demand when storing them in an object-oriented way. It was felt that this is an inevitable consequence of the way Delilah was designed. The space issue was remedied by putting in a bigger storage device, which is cheap nowadays. While the storage of objects is redundant, it was shown that an object can still be efficiently retrieved using advanced index tables in working memory, by compression techniques, and by implementing the lexicon and retrieval functions in ISO Prolog, using Prolog’s first-argument indexing technique, and ISO Prolog’s stream I/O. In this respect, ISO Prolog surpasses Edinburgh’ Prolog with an order of magnitude. Efficient runtime performance was achieved by executing an off-line compilation process of the lexical resources into efficient data structures. Currently, the lexicon holds ca. 70K fully inflected entries, specified by ca. 1,500 lemmas and over 200 constructions, occupying ca. 63 MB of disk space. The sum of all external index and auxiliary tables is under 3 MB. In the near future we hope to incorporate the valency information of the Alpino lexicon (Bouma et al . . References

Ambler, S.W. (2000). Mapping objects to relational databases,

IBM DeveloperWorks Magazine . Atkinson, M., F. Bancilhon, D. DeWitt, K. Dittrich, D. Maier, and S. Zdonik (1989). The Object-Oriented Database System Manifesto,

Proceedings of the First International Conference on Deductive and Object-Oriented Databases , Elsevier Science Publishers, pp. 223-240. Baldridge, J., and G.-J.M. Kruijff (2003). Multi-Modal Combinatory Categorial Grammar.

Proceedings of 10th Annual Meeting of the European Association for Computational Linguistics , ACL, Morristown, NJ, USA, pp. 211-218.

Bouma, G., G. van Noord, and R. Malouf (2001). Alpino: Wide-coverage Computational Analysis of Dutch,

Proceedings of CLIN 2000 , Rodopi, Amsterdam, pp. 45-59. Ceri, S., G. Gottlob, and G. Wiederhold (1989). Efficient Database Access from Prolog,

IEEE Transactions on Software Engineering (2), 153-164. Chamberlin, D.D., and R.F. Boyce (1974). SEQUEL: A Structured English Query Language, Proceedings of the ACM SIGFIDET Workshop on Data Description, Access and Control , ACM, New York, NY, USA, pp. 249-264.

Clocksin, W.F., and C.S. Mellish (1984).

Programming in Prolog , 2nd ed., Springer-Verlag. Codd, E.F. (1970). A Relational Model of Data for Large Shared Data Banks,

Communications of the ACM (6), pp. 377–387, ACM Press, New York, NY, USA. remers, C. (1993). On parsing coordination categorially , Phd thesis, Leiden University, number 5 in

HIL dissertations . Cremers, C. (2002). (‘n) Betekenis berekend.

Nederlandse Taalkunde (4) , pp. 375-395. Croft, W. (2001). Radical Construction Grammar . Oxford University Press. Date, C.J. (2003).

Introduction to Database Systems , 8th ed., Addison-Wesley. Deransart, P., A. Ed-Dbali, and L. Cervoni (1996).

Prolog: The Standard,

Springer-Verlag. Hirao, T. (1990). Extension of the relational database semantic processing model,

IBM Systems Journal (4), pp. 539-550. Huffman, D.A. (1952). A Method for The Construction of Minimum Redundancy Codes, Proceedings of the Institute of Radio Engineers (9), pp. 1098-1101. Knuth, D. (1998). The Art of Computer Programming, Volume 3: Sorting and Searching,

RBN Manual , Vrije Universiteit, Amsterdam. Meyer, B. (1997).

Object-Oriented Software Construction , 2nd ed., Prentice-Hall. Moortgat, M. (1997). Categorial Type Logics, in J. van Benthem and A. ter Meulen, editors,

Handbook of Logic and Language , Elsevier, Amsterdam and The MIT Press, Cambridge MA, pp. 93-177.. Sag, I., T. Wasow, and E. Bender (2003).

Syntactic Theory: A Formal Introduction , CSLI, University of Chicago Press. SICS (2008).

Documentation and Manuals for SICStus Prolog 4,

The Syntactic Process , The MIT Press. Vossen, P., editor (1998).

EuroWordNet: a multilingual database with lexical semantic networks for European Languages , Kluwer, Dordrecht. Vossen, P., K. Hofmann, M. de Rijke, E. Tjong Kim Sang, and K. Deschacht (2007). The Cornetto Database: Architecture and User-Scenarios,