[PDF] Interactive Query Formulation using Point to Point Queries

Abstract

Effective information disclosure in the context of databases with a large conceptual schema is known to be a non-trivial problem. In particular the formulation of ad-hoc queries is a major problem in such contexts. Existing approaches for tackling this problem include graphical query interfaces, query by navigation, and query by construction. In this article we propose the point to point query mechanism that can be combined with the existing mechanism into an unprecedented computer supported query formulation mechanism. In a point to point query a path through the information structure is build. This path can then be used to formulate more complex queries. A point to point query is typically useful when users know some object types which are relevant for their information need, but do not (yet) know how they are related in the conceptual schema. Part of the point to point query mechanism is therefore the selection of the most appropriate path between object types (points) in the conceptual schema. This article both discusses some of the pragmatic issues involved in the point to point query mechanism, and the theoretical issues involved in finding the relevant paths between selected object types.

Full PDF

aa r X i v : . [ c s . D B ] F e b Interactive Query Formulation using Point to Point Queries

Asymetrix Report 94-1

H.A. ProperAsymetrix Research LaboratoryDepartment of Computer ScienceUniversity of QueenslandAustralia [email protected] of February 4, 2021 at 1:24 P UBLISHED AS :H.A. (Erik) Proper. Interactive Query Formulation using Point to Point Queries. Techni-cal report, Asymetrix Research Laboratory, University of Queensland, Brisbane, Queensland,Australia, 1994.

Abstract

Effective information disclosure in the context of databases with a large conceptual schema is knownto be a non-trivial problem. In particular the formulation of ad-hoc queries is a major problem in suchcontexts. Existing approaches for tackling this problem include graphical query interfaces, query bynavigation, and query by construction. In this article we propose the point to point query mechanism that can be combined with the existing mechanism into an unprecedented computer supported queryformulation mechanism.In a point to point query a path through the information structure is build. This path can then be usedto formulate more complex queries. A point to point query is typically useful when users know someobject types which are relevant for their information need, but do not (yet) know how they are related inthe conceptual schema. Part of the point to point query mechanism is therefore the selection of the mostappropriate path between object types (points) in the conceptual schema.This article both discusses some of the pragmatic issues involved in the point to point query mecha-nism, and the theoretical issues involved in ﬁnding the relevant paths between selected object types.

Most present day organisations make use of some automated information system. This usually means thata large body of vital corporate information is stored in these information systems. As a result an obvious,yet crucial, function of information systems is the support of disclosure of this information. Without aset of adequate information disclosure avenues an information system becomes worthless since there is nouse in storing information that will never be retrieved. An adequate support for information disclosure,however, is far from a trivial problem. Most query languages do not provide any support for the users intheir quest for information. Furthermore, the conceptual schemata of real-life applications tend to be quitelarge and complicated. As a result, the users may easily become lost in conceptual space’ and they will end1p retrieving irrelevant (or even wrong) objects and may miss out on relevant objects. Retrieving irrelevantobjects leads to a low precision , missing relevant objects has a negative impact on the recall ([SM83]).The disclosure of information stored in an information system has some clear parallels to the disclosureproblems encountered in document retrieval systems . To draw this parallel in more detail, we quote the in-formation retrieval paradigm as introduced in [BW92]. The paradigm starts with an individual or companyhaving an information need they wish to fulﬁl. This need is typically a vague notion and needs to be mademore concrete in terms of an information request (the query) in some (formal) language. The informationrequest should be as good as possible a description of the information need. The information request isthen passed on to an automated system, or a human intermediary, who will then try to fulﬁl the informationrequest using the information stored in the system. This is illustrated in the information disclosure , or information retrieval paradigm , presented in ﬁgure 1 which is taken from [BW92].

Information(cid:13)Need(cid:13) Information(cid:13)Request(cid:13) q(cid:13)

Information(cid:13)Base(cid:13)

K(cid:13)

Character-(cid:13)isation(cid:13)

X(cid:13)

Formulation(cid:13) Matching(cid:13) Indexing(cid:13)

Figure 1: The information retrieval paradigmWe now brieﬂy discuss why the information retrieval paradigm for document retrieval systems is alsoapplicable for information systems. For a more elaborate discussion on the relation between informationsystems and document (information) retrieval systems in the context of the information retrieval paradigm,refer to [Pro94a]. In the paradigm, the retrievable information is modelled as a set K of information objects constituting the information base (or population).In a document retrieval system the information base will be a set of documents ([SM83]), while in thecase of an information system the information base will contain a set of facts conforming to a conceptualschema. Each information object o ∈ K is characterised by a set of descriptors χ ( o ) that facilitates itsdisclosure. The characterisation of information objects is carried out by a process referred to as indexing.In an information system, the stored objects (the population or information base) can always be identiﬁedby a set of (denotable) values, the identiﬁcation of the object. For example, an address may be identiﬁedas a city name, street name, and house number. The characterisation of objects in an information system isdirectly provided by the reference schemes of the object types.The actual information disclosure is driven by a process referred to as matching . In document retrieval ap-plications this matching process tends to be rather complex. The characterisation of documents is known tobe a hard problem ([Mar77], [Cra86]), although newly developed approaches turn out to be quite successful([Sal89]). In information systems the matching process is less complex as the objects in the informationbase have a more clear characterisation (the identiﬁcation). In this case, the identiﬁcation of the objects(facts) is simply related to the query formulation q by some (formal) query language.The remaining problem is the query formulation process itself. An easy and intuitive way to formulatequeries is absolutely essential for an adequate information disclosure. Quite often, the quest from users to2ulﬁl their information need can be aptly described by ([Bru93]): I don’t know what I’m looking for, but I’ll know when I ﬁnd it.

In document retrieval systems this problem is attacked by using query by navigation ([BW92], [Bru93])and relevance feedback mechanisms ([Rij89]). The query by navigation interaction mechanism between asearcher and the system is well-known from the Information Retrieval ﬁeld, and has proven to be useful. Itshall come as no surprise that these mechanisms also apply to the query formulation problem for informa-tion systems. In [BPW93], [BPW94], [HPW94b], [Pro94a] such applications of the query by navigation and relevance feedback mechanisms have been described before. When combining the query by navigationand manipulation mechanisms with the ideas behind visual interfaces for query formulation as describedin e.g. [ADD +

92] and [Ros94] powerfull and intuitive tools for computer supported query formulationbecome feasible. Such tools will also heavily rely on the ideas of direct manipulation interfaces ([Sch83])as used in present day computer interfaces.One important step in the improvement of the information disclosure of information systems, is the intro-duction of query languages on a conceptual level. Examples of such conceptual query languages are RIDL([Mee82]), LISA-D ([HPW93], [HPW94a]), and FORML ([HHO92]). By letting users formulate querieson a conceptual level, they are safeguarded from having to know the exact mapping to internal representa-tions (e.g. a set of tables conforming to the relational model), as they would be required when formulatingqueries in a non conceptual language such as SQL. The next step is to introduce ways to support users inthe formulation of queries in such conceptual query languages (CQL).In line with the above discussed information retrieval paradigm and the notion of relevance feedback, aquery formulation process (both for a document retrieval system, and an information system) can be saidto roughly consist of the following four phases:1.

The explorative phase . What information is there, and what does it mean?2.

The constructive phase . Using the results of phase 1, the actual query is formulated.3.

The feedback phase . The result from the query formulated in phase 2 may not be completely satis-factory. In this case, phases 1 and 2 need to be re-done and the result reﬁned.4.

The presentation phase . In most cases, the result of a query needs to be incorporated into a report orsome other document. This means that the results must be grouped or aggregated in some form.Depending on the user’s knowledge of the system, the importance of the respective phases may change.For instance, a user who has a good working knowledge of the structure of the stored information may notrequire an elaborate ﬁrst phase and would like to proceed with the second phase as soon as possible.In this paper, we discuss one of the mechanisms to support automated disclosure of information stored ininformation systems. As stated before, the related notions of query by navigation and query by construction have already been discussed in [BPW93], [PW95], [Pro94a]. This article is concerned with the point topoint query (PPQ) mechanism as an additional avenue for information disclosure. A point to point querystarts by selecting two or more object types from a conceptual schema. Then the system will return a listof possible (non cyclic) paths through the information structure between the speciﬁed object types. Forobvious reasons, the paths in this list should be ordered according to some relevance criterion. This styleof querying corresponds to a situation in which users know some aspects (object types) about which theywant to be informed, but do not yet know the exact details of their information need and the underlyinginformation structure. The query by navigation mechanism, on the other hand, is intended to support userswho do not have an overview of the stored information.Dispite how simple the above scenario may seem, the point to point query mechanism required to realisethis query is far from trivial. There are two main problems involved. Firstly, all (non-cyclic) paths throughthe conceptual schema must be found. This corresponds to ﬁnding all (non-cyclic) paths between two3odes in a graph, which is in general an exponential (NP hard) problem. (Finding the shortest paths ispolynomal!) The second main problem is the order in which the results should be presented to the user.It is clear that (especially when there is an abundance of possible paths to choose from) the alternativesshould be presented to the user in some order of relevance. We believe to have found an approach to thesetwo problems that makes a point to point query mechanism feasible.The structure of this article is as follows. In section 2, we discuss an example PPQ session, and elaboratebrieﬂy on its integration with query by navigation and query by construction. Section 3 deals with therepresentation of a conceptual schema as a graph. Searching for a path through this graph is covered insection 4. In section 5 an optimisation strategy is introduced based on a pre-compilation of the conceptualschema graph. Finally, section 6 concludes this article. For the reader who is unfamiliar with the notationstyle used in this report, it is advisable to ﬁrst read [Pro94b].

In this section we discuss a sample session involving a point to point query, and also discuss brieﬂy therelationship to query by navigation and query by construction. The discussed example operates on a con-ceptual schema for the administration of the election of American presidents. The example schema itself isnot shown; the structure of the domain will become clear from the sample session. Note that the quality ofthe verbalisations of paths expressions used in the examples in this section should be improved. However,this is the subject of further research.In ﬁgure 2, a possible screen is depicted for building queries using a point to point query mechanism. Theupper window is concerned with the point to point query itself, whereas the lower window contains thecomplete query under construction. When specifying a point to point query a user speciﬁes a sequence ofobject types: the points. For each point, the user is offered a listbox containing all object types present inthe conceptual schema. The order of the object types in the listbox should preferably be based on somenotion of conceptual importance ([CH94]). In ﬁgure 3 an existing point to point query path from presidentto election is extended with another point.An important underlying practical issue is whether the selection of the points in a point to point queryshould be done graphically or textually. The theoretical discussions in the remainder of this article arenot inﬂuenced by such a choice; but this should be taken into serious consideration when implementingthe point to point query mechanism. Although the ideal situation may seem to be a graphical selectionmechanism using the conceptual schema itself by clicking on object types ellipses, this may turn out tobe impossible due to the limited size of PC screens. Graphical based visual query formulation interfaces([ADD + Go! button inthe point to point query window. In ﬁgure 4, this process is illustrated. The sample PPQ involves threepoints. Therefore, two paths through the conceptual schema will result. We now shift our attention fromthe point to point query window to the query by construction window. Note that the small box con-taining the

PPQ abbreviation is now replaced by the paths resulting from the point to point query (i.e.

President winning election which resulted in nr of votes ). The system initially inserts a most likely path. Theuser can, however, select alternative paths using a listbox. Note that not all alternative paths between thetwo points are listed in the listbox. The reason for this is the NP completeness of the path searching prob-lem. To avoid the NP completeness problem, only the best paths are listed initially. However, potentiallyall paths can be selected (which still remains NP complete) by repeatedly selecting the

MORE option. Inthe remainder of this article we will discuss this in more detail.Since every path resulting from a query by navigation session connects two points in the conceptual schema,any path through the conceptual schema displayed in the query by construction screen can be used as a4 nfoAssistant(cid:13) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)

Point to Point Query(cid:13)Query by Construction(cid:13) Election(cid:13) (cid:0)

From(cid:13) to(cid:13) (cid:0)

President(cid:13) (cid:0)(cid:0)

Age(cid:13) Administration(cid:13) Election(cid:13) Elect. results(cid:13) Hobby(cid:13) Person(cid:13) (cid:0)(cid:0) .(cid:13).. ..(cid:13).. AND ALSO ..(cid:13).. OR ELSE ..(cid:13).. BUT NOT ..(cid:13)INSTANCE(cid:13).(cid:13) (cid:0)(cid:0)(cid:0)(cid:0)

PPQ(cid:13)

Go!(cid:13)

AND ALSO(cid:13)

Go!(cid:13) politician is president of administration(cid:13) (cid:0)(cid:0) inaugurated in year(cid:13) 1920(cid:13) (cid:0)

Figure 2: Building a PPQ querystarting point for a query by navigation session, and vice versa. This is illustrated in ﬁgure 5. In this ses-sion, the user has selected the box which contains the two paths politician is president of administration and inaugurated in year for a query by navigation session. The upper window now displays a node in the query bynavigation session, with the path politician is president of administration inaugurated in year as its focus. If theuser had selected the inaugurated in year listbox, the initial focus would have been administration inaugurated in year .The query by construction window is basically a syntax directed editor. In the left part of the window allpossible constructs from the query language are listed. In our examples we have used the constructs deﬁnedin LISA-D. Once the FORML and LISA-D languages have been merged, a more complete language forthe query by construction part will result.

For the purpose of ﬁnding a path between object types in a conceptual schema, the schema ﬁrst needs tobe translated to a graph. We start out from a formalisation of ORM based on the one used in ([HP95]).However, since only a very limited part of the formalisation is needed, we do not cover the formalisationin full detail.A conceptual schema is presumed to consist of a set of types TP . Within this set of types two subsetscan be distinguished: the relationship types RL , and the object types OB . Furthermore, let RO be the5 nfoAssistant(cid:13) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0) Point to Point Query(cid:13)Query by Construction(cid:13) (cid:0)

Election(cid:13) (cid:0)

From(cid:13) via(cid:13) to(cid:13) (cid:0)

President(cid:13) (cid:0)(cid:0)

Nr of children(cid:13) Nr of votes(cid:13) Nr of years(cid:13) Person(cid:13) Politician(cid:13) President(cid:13) (cid:0)(cid:0) .(cid:13).. ..(cid:13).. AND ALSO ..(cid:13).. OR ELSE ..(cid:13).. BUT NOT ..(cid:13)INSTANCE(cid:13).(cid:13) (cid:0)(cid:0)(cid:0)(cid:0)

PPQ(cid:13)Go!(cid:13)

AND ALSO(cid:13)

Go!(cid:13) politician is president of administration(cid:13) (cid:0)(cid:0) inaugurated in year(cid:13) 1920(cid:13) (cid:0)

Figure 3: Extending the PPQ pathset of roles in the conceptual schema. The fabric of the conceptual schema is then captured by two func-tions and two predicates. The set of roles associated to a relationship type is provided by the partition:

Roles : RL → ℘ ( RO ) . Using this partition, we can deﬁne the function Rel which returns for each role therelationship type in which it is involved:

Rel ( r ) = f ⇐⇒ r ∈ Roles ( f ) . Every role has an object type atits base called the player of the role, which is provided by the function: Player : RO → TP . Subtyping andpolymorphism of object types is captured by the predicates

SpecOf ⊆ OB × OB and

HasMorph ⊆ OB × OB respectively. For any ORM conceptual schema the following (undirected) labelled graph G = h N, E i canbe deﬁned: N , TP (1) E , (cid:8) h{ Player ( r ) , Rel ( r ) } , r i (cid:12)(cid:12) r ∈ RO (cid:9) (2) ∪ (cid:8) h{ x, y } , SpecOf i (cid:12)(cid:12) x SpecOf y (cid:9) (3) ∪ (cid:8) h{ x, y } , HasMorph i (cid:12)(cid:12) x HasMorph y (cid:9) (4)The edges in the resulting graph have the format h{ x, y } , l i , where x and y are the source/destination (noorder) of the edge, and l is the label of the edge. The labels on the edges either result from the roles inthe relationship types (2), or they result from specialisation and polymorphism (3,4). In the remainder, thegraph G will be used as an implicit parameter for all introduced functions and operations. As a convention,the nodes of graph G are accessed by G.N , and the edges by

G.E .6 nfoAssistant(cid:13) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0) Point to Point Query(cid:13)Query by Construction(cid:13) Nr of votes(cid:13) (cid:0)

Election(cid:13) (cid:0)

From(cid:13) via(cid:13) to(cid:13) (cid:0)

President(cid:13) (cid:0)(cid:0) .(cid:13).. ..(cid:13).. AND ALSO ..(cid:13).. OR ELSE ..(cid:13).. BUT NOT ..(cid:13)INSTANCE(cid:13).(cid:13) (cid:0)(cid:0)

Go!(cid:13)

AND ALSO(cid:13)

Go!(cid:13) politician is president of administration(cid:13) (cid:0)(cid:0) inaugurated in year(cid:13) 1920(cid:13) (cid:0) president winning election(cid:13) (cid:0) which resulted in nr of votes(cid:13) (cid:0) president winning election(cid:13) president participating in(cid:13) - the election(cid:13) MORE ...(cid:13) (cid:0)(cid:0)

Figure 4: Completing a PPQAs an example consider the conceptual schema depicted in ﬁgure 6. For this schema we have: TP = { A, B, C, D, f, g } Roles ( f ) = { r, s } , Roles ( g ) = { t, u }RL = { f, g } Player ( r ) = A, Player ( s ) = B, Player ( t ) = C, Player ( u ) = A OB = { A, B, C, D } A HasMorph

C, A

HasMorph g RO = { r, s, t, u } D SpecOf B From this schema the graph as depicted in ﬁgure 7 can be derived.For point to point queries paths in the graph need to be found. In this article, a path through the graph isdenoted as a sequence of alternating nodes and labels: [ x , l , x , . . . , l n , x n ] Note that if x is a node, then [ x ] denotes the path consisting of node x only. In the remainder ++ is used asthe concatenation operation for sequences. Not all alternating sequences of nodes and labels correspond toa proper path. For a path to be a proper one, it must adhere to two properties:1. The nodes in the path must originate from the graph: ∀ ≤ i ≤ n [ x i ∈ G.E ] .2. The labels in the path must originate from the proper edges in the graph: ∀ ≤ i ≤ n [ h{ x i − , x i } , l i i ∈ G.N ] nfoAssistant(cid:13) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0) Query by Navigation(cid:13)Query by Construction(cid:13).(cid:13).. ..(cid:13).. AND ALSO ..(cid:13).. OR ELSE ..(cid:13).. BUT NOT ..(cid:13)INSTANCE(cid:13).(cid:13) (cid:0)(cid:0)

AND ALSO(cid:13)

Go!(cid:13) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0) politician is president of administration(cid:13) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0) inaugurated in year(cid:13) (cid:0) (cid:0)(cid:0)(cid:0) president winning election(cid:13) (cid:0) which resulted in nr of votes(cid:13) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)

Focus: politician is president of administration inaugurated in year(cid:13)administration inaugurated in year(cid:13)politician is president of administration(cid:13)politician is president of administration inaugurated in year in which is born president(cid:13)politician is president of administration inaugurated in year in which is inaugurated administration(cid:13)person is president of administration inaugurated in year(cid:13)president is president of administration inaugurated in year(cid:13)year in which is inaugurated the administration administered by politician(cid:13)

Figure 5: Switching to query by navigationIn the remainder of this article,

Path denotes the set of all valid paths for any graph G resulting froman ORM schema. On such paths the following three functions can be deﬁned: Begin : Path → G.N , End : Path → G.N , and

Length : Path → INI , which are identiﬁed by:

Begin ([ x , l , x , . . . , l n , x n ]) , x End ([ x , l , x , . . . , l n , x n ]) , x n Length ([ x , l , x , . . . , l n , x n ]) , n Furthermore, the ∈ and operations can be extended to paths as well, expressing the occurrence of a nodeon a path: x ∈ [ x , l , x , . . . , l n , x n ] ⇐⇒ x ∈ { x , . . . , x n } x p ⇐⇒ ¬ ( x ∈ p ) A badness level is associated with every path through the conceptual schema, expressing its conceptualirrelevance. The badness is used to order the alternative paths in the listboxes. Badness is deﬁned in termsof a penalty point system where a high penalty point score corresponds to a low conceptual relevance. Twoways of earning penalty points exist: the relative conceptual irrelevance of the object types in the path, andthe length of the path.For the ﬁrst class of penalty points the existence of a function: CWeight : TP →

INI is presumed. Thisfunction should capture the conceptual importance of the types in the conceptual schema, which can for8 (cid:13) D(cid:13)A(cid:13) B(cid:13)r s(cid:13)t u(cid:13) f(cid:13)g(cid:13)

Figure 6: Example Conceptual Schema f(cid:13) B(cid:13)A(cid:13) r(cid:13) s(cid:13)

Spec(cid:13)

D(cid:13)g(cid:13)C(cid:13)

Poly(cid:13) Poly(cid:13) u(cid:13)t(cid:13)

Figure 7: Example Graphinstance be derived from the abstraction level at which the type is present ([CH94]). For each objecttype occurring in a path, the number of penalty points added to the total badness of the path depends onthe deviation of its conceptual importance from the maximum conceptual importance in the conceptualschema.The second way for a path to earn penalty points is the length of the path. For every object type occurringin the path a basic amount of penalty points is added. In order to maintain uniformity of the penalty pointsadded, this basic amount is set equal to the maximum conceptual importance of object types in the schema.Finally, sometimes one would like to be able to control the inﬂuence of the two ways to earn penalty pointsin the ﬁnal outcome. For this purpose, we introduce the (user deﬁnable) constant C weight ∈ [0 , . Thisleads to the following deﬁnition of the badness of a non-cyclic path in a graph G : Badness : Path → INI

Badness ( p ) , Σ o ∈ p ( C weight × ( MaxCWeight − CWeight ( o )) + (1 − C weight ) × MaxCWeight ) where MaxCWeight = max x ∈ G.N

CWeight ( x ) . Note the ∈ operation used in the expression o ∈ p is theabove deﬁned inclusion operation for paths through graphs, and that an object type o only occurs at mostonce in p and that therefore the summation over o ∈ p is correct with respect to the length of the path. Animportant property of the Badness function is the following:

Lemma 3.1

The function

Badness is monotonous strict increasing, i.e.:if p ++ q ∈ Path is an acyclic path and q is non-empty, then Badness ( p ) < Badness ( p ++ q ) Proof:

Follows directly from the deﬁnition of

Badness and the observation that the set (cid:8) o (cid:12)(cid:12) o ∈ p (cid:9) is aproper subset of (cid:8) o (cid:12)(cid:12) o ∈ p ++ q (cid:9) ✷ The above property allows us to incrementally search for the paths with the lowest badness since thebadness of a path never decreases when extending it. Note that one might also want to introduce additionalﬁtness factors. For instance, one could take the correlation of the (verbalisation of the) path to a set ofkeywords describing the users interests into consideration. However, the badness should remain a strictmonotonous increasing function. 9

The Quest

This section is concerned with ﬁnding a path through the conceptual schema (graph) between two points(nodes). Although a point to point query typically involves more then two object types, it can always beexpressed as a combination of a set of point to point queries over two points. As an example considerﬁgure 4. The newly added point to point query involves three points and is represented as two point topoint queries (listboxes) over two points.In searching paths between two points (nodes) in the graph, an incremental strategy is followed. Two poolsof paths are maintained during the entire search: a pool P of paths which could lead to a possible solutionand a pool S of found solutions. In every step (increment) of the algorithm these pools are updated. Thebest (lowest badness) potential solutions in pool P are selected for further extension. By selecting the bestpaths in P for further extension it can be guaranteed that the ﬁrst solutions found are the ones with thelowest badness. Within a pool of possible solutions P , and in the context of a graph G , the set of bestcandidates are provided by: Best : ℘ ( Path ) → ℘ ( Path ) Best ( P ) , (cid:8) p ∈ P (cid:12)(cid:12) Badness ( p ) = min q ∈ P Badness ( q ) (cid:9) The ﬁrst operation we introduce calculates the increment as described above for a pair of pools. It tries toextend the paths in P , and updates the set of found solutions in S if new ones have been found. For anystart node f and end node t , we deﬁne the increment operator as: Increment f,t : : ℘ ( Path ) × ℘ ( Path ) → ℘ ( Path ) × ℘ ( Path ) × ℘ ( Path ) Increment f,t ( P, S ) , ( h P ′ , S ′ , R ′ i if P = ∅ h P, S, S i otherwisewhere: N = (cid:8) p ++[ n, l ] (cid:12)(cid:12) p ∈ Best ( P ) ∧ h{ End ( p ) , n } , l i ∈ G.E ∧ n p (cid:9) S ′ = S ∪ (cid:8) s ∈ N (cid:12)(cid:12) End ( s ) = t (cid:9) P ′ = ( P ∪ N ) − Best ( P ) − S ′ R ′ = (cid:8) r ∈ S ′ (cid:12)(cid:12) Badness ( r ) ≤ min q ∈ P ′ Badness ( q ) (cid:9) For deﬁning the set of (best) extended paths N , all best paths ( p ∈ Best ( P ) ) in the existing pool ofpossible solutions are extended with an appropriate edge from the graph ( h{ End ( p ) , n } , l i ∈ G.E ) whilemaintaining acyclicity ( n p ). The new set of solutions ( S ′ ) is simply the old set of solutions extendedwith the solutions found after extending the best paths. In the new pool of possible solutions ( P ′ ) the newlyfound solutions are removed since they should not be extended any further. Although a path in S ′ has theproper begin and end point it is not considered to be a proper solution until it has a lower badness then thepaths in the pool of potential solutions P ′ . The set of proper solutions is returned in R ′ .For the increment operation we have the following property: Lemma 4.1 If Increment f,t ( P, S ) = h P ′ , S ′ , R ′ i and Increment f,t ( P ′ , S ′ ) = h P ′′ , S ′′ , R ′′ i , then: R ′ ⊆ R ′′ ∧ ∀ r ∈ R ′′ − R ′ (cid:2) Badness ( r ) > max q ∈ R ′ Badness ( q ) (cid:3) Proof:

We ﬁrst prove that R ′ ⊆ R ′′ .If r ∈ R ′ , then since R ′ ⊆ S ′ ⊆ S ′′ we also have r ∈ S ′′ . Furthermore, from the deﬁnition of R ′ follows: Badness ( r ) ≤ min q ∈ P ′ Badness ( q ) . Due to the monotonic behaviour of the Badness function,we immediately have:

Badness ( r ) ≤ min q ∈ P ′′ Badness ( q ) , since P ′′ contains the extended paths.From the deﬁnition of R ′ then follows that r ∈ R ′ .10ow we prove that ∀ r ∈ R ′′ − R ′ (cid:2) Badness ( r ) > max q ∈ R ′ Badness ( q ) (cid:3) . If r ∈ R ′′ − R ′ , then r R ′ .From this and the deﬁnition of R ′ follows that r S ′ or Badness ( r ) > min q ∈ P ′ Badness ( q ) . So wehave:1. Let r S ′ . Since r ∈ R ′′ − R ′ , we know that r is a newly found solution in R ′′ . So there is a p ∈ Best ( P ′ ) and an e ∈ G.E such that p ++[ e ] = r . From the monotonicity of Badness , it thenfollows

Badness ( p ) < Badness ( r ) .If x ∈ R ′ , then from the deﬁnition of R ′ follows that Badness ( x ) ≤ min q ∈ P ′ Badness ( q ) .Since we just concluded that Badness ( p ) < Badness ( r ) for a certain p ∈ P ′ , we at least have min q ∈ P ′ Badness ( q ) < Badness ( r ) Since we also have

Badness ( x ) < Badness ( r ) , we in particu-lar have: Badness ( r ) > max q ∈ R ′ Badness ( q ) .2. Let Badness ( r ) > min q ∈ P Badness ( q ) . From the deﬁnition of R ′ follows that if x ∈ R ′ then Badness ( x ) ≤ min q ∈ P ′ Badness ( q ) , which means that Badness ( x ) < Badness ( r ) . From thisﬁnally follows: Badness ( r ) > max q ∈ R ′ Badness ( q ) . ✷ This property implies that the result (the R ) of a point to point query is returned in monotonous increasingsteps. Which means that when presenting the results to the user, the list box can be ﬁlled in incrementalsteps by repeatedly selecting the MORE option.The increment operation, as such, can not yet be used to calculate the best solutions which are presented inthe listboxes as depicted in ﬁgure 4. For this latter purpose the

List f,t ( P, S ) operation is introduced, whichserves as a ‘driver’ function for the entire process. List f,t : ℘ ( Path ) × ℘ ( Path ) × ℘ ( Path ) → ℘ ( Path ) × ℘ ( Path ) List f,t ( P, S, R ) ,  h P ′ , S ′ , R ′ i if R ′ = R List f,t ( P ′ , S ′ , R ′ ) else if P = ∅ ⊥ otherwisewhere h P ′ , S ′ , R ′ i = Increment f,t ( P, S ) This function calls the increment operation until it has come up with some new solutions ( R ′ = R ) or thepool of potential solutions has been exhausted ( P = ∅ ). To provide the initial ﬁlling for the listboxes of apoint to point query from f to t , this function should be applied as: h P, S, R i = List f,t ( { [ f ] } , ∅ , ∅ ) Now R contains the set of found paths to be listed in the listbox. If users desire to see more options theycan select the MORE option (see ﬁgure 4). This results in another call of the

List f,t function using P , S , R as parameters.Finally, the paths resulting from the search through the graph need to be translated into linear path ex-pressions. For more details on linear path expressions please refer to [HPW93]. In a later stage, however,the current deﬁnition of the linear path expressions as provided in [HPW93] needs to be changed to bet-ter match our requirements. Every (proper) path through the graph can be translated into a linear pathexpression by the following recursive function: PathExpr : Path → PathExpressionsPathExpr ([ x , l , x , . . . , l n , x n ]) , x Connector ( l , x ) x . . . Connector ( l n , x n ) where Connector ( l, x ) ,  ◦ if l ∈ { HasMorph , SpecOf }◦ l ◦ if x ∈ RL ∧ l ∈ Roles ( x ) ◦ l ← ◦ otherwise11ote that when n = 0 , we have: PathExpr ([ x ]) = x .The linear path expressions are for internal use only. They can be mapped to proper SQL queries on the onehand, and verbalised as semi-natural language sentences using the verbalisation information as providedin the conceptual schema on the other hand. As stated before, the verbalisation of path expressions is thesubject of further research. When humans look at a conceptual schema to ﬁnd a path between two object types, they are usually able toidentify parts of the schema which can safely be ignored when a searching for the actual path. In ﬁgure 8such a situation is illustrated. In this schema, F is the starting point of the point to point query and T theend point. The three clouds represent subschemes. It is clear that subschema III can be safely ignored whensearching for a path from F to T since the only way in/out of subschema III is through object type B . If apath would enter subschema III through B (either continuing via fact type f or g ), the path would not beable to leave the subschema without creating a cycle. Such situations are not rare for real life applications.For instance, [Hal95] contains quite a number of schemes of real life applications with a similar pattern.In this section a strategy is developed that allows us to reduce the graph associated with a conceptualschema before actually commencing a search. A possible way to approach this is to deﬁne a pruningalgorithm that repeatedly cuts of irrelevant leaves from the graph. However, in the situation sketched inﬁgure 8, subschema III contains a cycle making it impossible for such a simple pruning algorithm to removethe entire subschema III. In this section we therefore develop a strategy for the removal of parts of the graphthat may contain cycles. First a clustering algorithm is developed. After this, we use the simple pruningalgorithm to remove irrelevant leaves, resulting in the removal of irrelevant subschemas even when theycontain cycles (for instance subschema III). The ﬁrst step in our approach is the clustering itself. Clustering can be done by a pre-compilation of theconceptual schema. This pre-compilation should be done after the conceptual schema has been ﬁnished,but before the users start formulating queries. Although a conceptual schema is expected to evolve in thecourse of time ([FOP92], [Pro94a], [OPF94]), the pre-compilation we propose here will not be costly to doafter each evolution step. For small schemes one might even consider combining the pre-compilation withthe search though the conceptual schema itself.Formally, a clustering of a graph can be modelled as a function: C : INI ֌ ℘ ( Node ) . An existing cluster i within an existing clustering C can be extended with nodes E , using the C L i E operation. This operationis identiﬁed by: ( C ⊕ i E )( j ) , if i = j then C ( j ) ∪ E else C ( j ) ﬁ for each j ∈ INI

In the returned clustering the existing cluster C ( i ) is grown to C ( i ) ∪ E .A clustered graph can in itself be seen as a graph. The nodes are the clusters (effectively subgraphs of theoriginal graph), and the edges can be derived from the original graph by having an edge between nodes(clusters) in the hypergraph if nodes in the clusters are connected in the original graph. This correspondsto the notion of a hypergraph since the clusters are treated as nodes. Obviously, the hypergraph can alsobe clustered, leading to yet another hypergraph. In the algorithms discussed here, we will repeatedly makeuse of the hypergraph notion. The idea is to use the hypergraphs to repeatedly simplify the graph, allowingus to identify irrelevant subschemas (which will correspond to nodes in one of the hypergraphs).An important concept when clustering is the degree of a node. Let G.E be the set of edges in a graph, thenwe deﬁne the degree of a node n within the context of a set of nodes N to be the number of nodes in N that12 (cid:13) T(cid:13)B(cid:13)f(cid:13) g(cid:13) I(cid:13) II(cid:13)III(cid:13)

Figure 8: Connected subschemascan be reached from n by an edge in G.E . The word reached should be interpreted here as either directlyconnected, or one of the nodes contained in an involved cluster is connected. This notion of degree can becaptured formally as:

Deg : ℘ ( Node ) × Node → INI

Deg ( N, n ) , (cid:12)(cid:12)(cid:8) m ∈ N − { n } (cid:12)(cid:12) m ! n (cid:9)(cid:12)(cid:12) The principle of nodes (which could actually be clusters) being reachable from other nodes is representedby the ! operator: n ! m ⇐⇒ ∃ x,y [ { x, y } ∈ π ( G.E ) ∧ x ≺ n ∧ y ≺ m ] The expression x ≺ y captures the intuition of a node x being equal to node y or node x being contained inthe cluster y (note that we will later on use hyper clustering, resulting in nodes which contain other nodes).The ≺ operator is therefore deﬁned by the following recursive rule: n ≺ m ⇐⇒ n = m ∨ ∃ m ′ ∈ m [ n ∈ m ′ ] From the above deﬁned notion of degree we derive the so called normalised degree of a node. The ideabehind the normalised degree is that the leaves of a graph (nodes with

Deg ( N, n ) = 1 ) can safely be ignoredby the clustering algorithm as they will never lead to a cycle. For any node with a higher degree, a closerinvestigation is required, i.e. the clustering algorithm needs to be applied. Informally, the normaliseddegree is the number of neighbouring nodes with a degree higher then 1. The formal deﬁnition of thenormalised degree is:

NDeg : ℘ ( Node ) × Node → INI

NDeg ( N, n ) , (cid:12)(cid:12)(cid:8) m ∈ N − { n } (cid:12)(cid:12) Deg ( N, m ) > ∧ m ! n (cid:9)(cid:12)(cid:12) (cid:13)1(cid:13)1(cid:13) 2(cid:13) 1(cid:13)2(cid:13) 3(cid:13)1(cid:13) 2(cid:13) 2(cid:13) Figure 9: Normalised degrees of nodes.As an example, consider ﬁgure 9. There a graph is depicted where each node is labelled with the

NDeg ofthe node.We now have enough primitives to deﬁne the clustering algorithm itself. The clustering needs to be donein such a way that the clusters themselves contain no branches (nodes n with NDeg ( n ) > ). For clusteringthree functions are introduced. The ﬁrst function is simply used to provide a nice interface to the other twoclustering functions: Cluster : ℘ ( Node ) → (INI ֌ ℘ ( Node )) Cluster ( N ) , DoCluster ( h N, C i , where C is the initial clustering deﬁned as C ( i ) = ∅ for ≤ i ≤ | N | .To cluster a graph G , this function should be invoked as: Cluster ( G.N ) . The second clustering function( DoCluster ) is the ‘driver’ function of the clustering algorithm, and the third function (

Propagate ) the ‘prop-agation’ function. The driver function takes three parameters. The ﬁrst parameter represents the set ofnodes that have not been placed in any cluster yet. The second parameter is the current clustering, and thelast parameter is the number of the cluster that is currently being formed. This function selects, if thereis any node left to be clustered, a node with the minimal degree and forms a cluster for this node (usingthe speciﬁed cluster number). Each time a new cluster is formed, the ‘propagation’ function

Propagate iscalled, which will then try to extend the newly formed cluster. The driver function is identiﬁed by:

DoCluster : ( ℘ ( Node ) × (INI ֌ ℘ ( Node ))) × INI → (INI ֌ ℘ ( Node )) DoCluster ( h N, C i , i ) , ( C if N = ∅ DoCluster ( Propagate ( N − { n } , C ⊕ i { n } , i ) , i + 1) otherwisewhere n ∈ N such that NDeg ( N, n ) ≤ min m ∈ N NDeg ( N, m ) .The propagation function tries to extend the size of the cluster. It does so by trying to ﬁnd unclusterednodes which have a normalised degree that is less then or equal two, and are connected to a node whichis already contained in the new cluster. Note that in the determination of the normalised degree for theunclustered nodes, the nodes that are already contained in the new cluster are still treated as unclusterednodes. The clause ‘less than or equal to two’ is absolutely essential to maintain the idea of a cluster not(directly) containing any branches. Doing this allows us to safely prune the hypergraph, i.e. cutting awayirrelevant parts. Furthermore, any simple leaves ( Deg ( N, n ) = 1 ) connected to a node in the new clusterbecome part of the cluster as well. The deﬁnition of the

Propagate function is now provided by:

Propagate : ℘ ( Node ) × (INI ֌ ℘ ( Node )) × INI → ℘ ( Node ) × (INI ֌ ℘ ( Node )) Propagate ( N, C, I ) , ( Propagate ( N − I, C ⊕ i I, i ) if I = ∅ h N, C i otherwisewhere I = (cid:8) n ∈ N − C ( i ) (cid:12)(cid:12) ∃ n ′ ∈ C ( i ) [ NDeg ( N ∪ C ( i ) , n ′ ) ≤ ∧ n ! n ′ ] (cid:9) ∪ (cid:8) n ∈ N − C ( i ) (cid:12)(cid:12) ∃ n ′ ∈ C ( i ) [ Deg ( N ∪ C ( i ) , n ′ ) = 1 ∧ n ! n ′ ] (cid:9) (cid:13) 1(cid:13)1(cid:13)4(cid:13)4(cid:13) 4(cid:13)2(cid:13)2(cid:13)2(cid:13)2(cid:13)3(cid:13) 3(cid:13) 3(cid:13)5(cid:13)5(cid:13) 5(cid:13) 6(cid:13) 6(cid:13)6(cid:13)3(cid:13)3(cid:13) 4(cid:13) 4(cid:13)4(cid:13) Figure 10: Example ClusteringAs an example consider the graph depicted in ﬁgure 10. Each node has associated the number of the clusterit is part of. The arrows indicate the start node of each cluster.

Cluster 1:

At the start of the algorithm four nodes have an

NDeg of 1: the two right most nodes of cluster1, and the top and bottom nodes of cluster 3. We arbritrarily chose for the ﬁrst node of cluster 1 asour starting point.The second node of cluster one is added because it has a neighbour (in cluster one) with an

NDeg of1. The third node (the most left one) of cluster 1 is then added to cluster one as it also has a neighbour(in cluster 1) with an

NDeg of 1 (the middle node of cluster 1).After adding the third node to cluster one, no more nodes can be added since the third node has an

NDeg of 4. Now that cluster one cannot be extended any further a new cluster is formed.

Cluster 2:

At this moment, again four nodes have an

NDeg of 3: the two right most nodes of cluster 2, andthe top and bottom nodes of cluster 3 (remember, the nodes from cluster 1 have now been ‘removed’from the

NDeg count).Again we arbritrarily select the most right node of cluster 2 as a starting point. The three other nodesof cluster 2 are then added consecutively from right to left since they each have neighbours (in cluster2) with an

NDeg less than or equal to 2.The last node added to cluster 2 has an

NDeg of 4 and therefore no further nodes can be added.

Cluster 3:

After the completion of cluster 2, once again four nodes with an

NDeg of 3 remain. They arethe four right most nodes of cluster 3. Note that the middle node of cluster 3 has an

NDeg of 1 aswell since it has only one neigbour with a degree higher than 1.One arbritrary node is selected, and the other three nodes on cluster 3 are added consecutively.

Cluster 4, 5 and 6:

All nodes that remain are the nodes of clusters 4, 5 and 6. All three clusters are simplecycles of three or six nodes. After selecting one node on each of the cycles, the remaining nodes onthe cycles are added to the clusters.For the clustering algorithm we can prove some useful properties. Firstly, the clustering algorithm leads toa partition of the nodes in the graph:

Lemma 5.1

For every graph G the function Cluster ( G.N ) results in a partition of the nodes in graph G .15 roof: This follows immediately from the following observations:1. The clustering algorithm only stops when all nodes are clustered (the N = ∅ clause in thedeﬁnition of DoCluster ).2. The clustering algorithm removes every clustered nodes from the ‘to do’ set (the N − { n } , and N − I clauses in the deﬁnition of DoCluster and

Propagate respectively) ✷ The clusters do not contain nodes with a normalised degree that is higher than 2.

Lemma 5.2 If C = Cluster ( G.N ) , then for all i ∈ Dom ( C ) : ∀ n ∈ C ( i ) [ NDeg ( C ( i ) , n ) ≤ Proof:

Follows directly from the deﬁnition of I in the deﬁnition of Propagate . ✷ The consequence of this lemma is, as stated before, that there are no real decision points (branches) withinone cluster. A node may have more then one leaf neighbour, but will not contain more then two non leafneighbours. The result of this is that we can safely treat the clusters of one (hyper) graph as nodes on thenext hyper graph, and remove them if they are found irrelevant for the point to point query. The differentlevels of hypergraphs are built using the hyper cluster function which continues clustering until a graphresults that does not contain any cycles:

HCluster : ℘ ( Node ) × INI → ℘ ( Node ) × INI

HCluster ( N, i ) , ( h N, i i if | E | = | N | − HCluster ( N ′ , i + 1) otherwisewhere N ′ = ran ( Cluster ( N )) E = (cid:8) { x, y } ⊆ N (cid:12)(cid:12) x = y ∧ x ! y (cid:9) Note that ran ( f ) returns the range of function f . Not furthermore that a connected graph G is acyclicexactly when | G.E | = | G.N | − . The initial call of the hyper cluster function is HCluster ( G.N, . Figure 11: First Level HypergraphIn ﬁgure 11 the hypergraph that can be associated to the clustering in ﬁgure 10 is depicted, together witha clustering of the hypergraph. Based on the clustering of this hypergraph, a second level hypergraph asdepicted in ﬁgure 12 can be derived. As this graph is acyclic, the

HCluster function terminates after thecompletion of this clustering. 16 (cid:13) 1(cid:13)

Figure 12: Second Level Hypergraph

An important aspect of the clustering algorithm is the complexity of both the storage of the clusters as wellas the algorithm itself. One call of the Cluster function is clearly linear in terms of the total number ofnodes in the graph: Ω( | G.N | ) . The calculation of the complexity of the HCluster function, however, is abit more complicated. The

HCluster repeatedly tries to reduce the number of nodes in the (hyper) graph bycalling the

Cluster function. Trying to get a grips on the complexity of the

HCluster function thus requiresus to analyse the expected number of times that the

Cluster function will be called, i.e. how many levels ofhypergraphs we expect to have. n(cid:13) n(cid:13) n(cid:13)

Figure 13: Alternative situations for remaining nodesA ﬁrst observation we make is that we are always dealing with a connected graph since a conceptual schemais a connected graph. We now prove that performing

Cluster on a connected (hyper) graph with n > nodesalways leads to a connected hypergraph with maximally n − nodes. Lemma 5.3 C = Cluster ( N ) ⇒ | Dom ( C ) | ≤ | N | − , i.e. applying Cluster leads to a reduction in thenumber of nodes of at least 2

Proof: If | N | = 3 , the Cluster ( N ) graph only contains 1 node, implying a reduction of exactly two nodes.This follows directly from the fact that in a graph with three nodes NDeg has a maximum value of 2.Let | N | > . The Cluster function does not terminate until all nodes have been clustered. Now let usconsider the last three nodes in N that are selected by Cluster to be clustered last. In ﬁgure 13, thefour possible ways in which these three nodes can be connected are depicted. In the ﬁrst two cases,the three nodes would be clustered together, thus leading to a reduction of at least two nodes. Thetwo remaining cases require a closer examination:1. In case three, the remaining three nodes obviously result in two clusters; leading to a reductionof only one node. However, we can prove that there must exist another cluster with at least twoelements, thus implying that the total reduction size is still at least two.As the original graph is a connected one, node n must have been connected to some node(s)which are already clustered. Let M be the set of node(s) connected to node n that have beenclustered the last. So the nodes in M are all part of the same cluster, and have the highestcluster number of n ’s neighbours. If M contains more then one node, this consequently meansthat there is at least one other cluster with more than two nodes, thus leading to a reduction ofat least two nodes.If M contains only one node, say m , this implies that at the moment that the cluster containingnode m was formed, node n was only connected to m . So node n is a leaf node of the graphat that moment with an NDeg of 1. This means that node n should have been clustered in thesame cluster as m , which implies that M contains at least two nodes.17. In case four, the remaining three nodes lead to three separate clusters. However, we will provethat there are enough clusters with more then one element to ensure a reduction of 2 nodes.Let M be the set of node(s) connected to n that were clustered last, and similarly M theset of node(s) connected to n that were clustered last. As the original graph was a connectedgraph, such nodes must exist.If M or M contains only one node, then they should have been clustered already (sameargument as above). So M and M both contain at least two nodes. This is not yet enough,since M and M may overlap. However, for i ∈ { , } we have the following:If M i has two elements, n i must have an NDeg of at most 2. This means that oneof the two nodes in M j must have an NDeg of at most 2, since the clustering alwaysstarts with a node with the minimal

NDeg . This in turn means that n i is connected toa node with an NDeg that is less then or equal to two, and should therefore have beenin the same cluster as M i . .As a result, M and M contain at least three nodes. So even if they overlap the reduction oftwo nodes is still guaranteed. ✷ The connectedness of a hypergraph after applying

Cluster to a connected graph follows directly from theway in which the edges are derived from the original graph. The above lemma allows us to identify the(execution) complexity of the

HCluster algorithm. Since the number of nodes in the graph decreases bytwo in every clustering of the

HCluster algorithm, the number of calls of

HCluster to Cluster is limited to: ⌈ ( | G.N | ) / ⌉ . Therefore, the total complexity of the HCluster function is: Ω( | G.N | ) . However, since mostconceptual schemes result in a sparse graph (not containing many edges), the results are likely to be betterfor most schemes.Another important issue is the complexity of the memory used. Every clustering requires the storage of thenodes in the cluster. Let n = | G.N | , and k = ⌊ n/ ⌋ , then we have the following worst case with respectto the number of nodes that need to be stored: Σ k − j =0 ( n − i ) + 1 = Σ ki =1 ( n + 2 − i ) + 1= nk + 2 k + 1 − ki =1 i = nk + 2 k + 1 − k ( k + 12= nk + 2 k + 1 − k ( k + 1)= nk − k + k + 1= n − n n n n As an example of a worst case graph, consider ﬁgure 14. This ﬁgure depicts the original graph, and thethree associated hypergraphs after subsequent clustering steps. The original node contains 7 nodes, and ittakes 3 steps to reduce it. The total number of nodes that needs to be stored is 16.

Using the hypergraphs, we can now reduce the size of the graph prior to searching the paths by means ofthe algorithm deﬁned in section 4. In general, a (hyper)graph is reduced by:

ReduceHG f,t : ℘ ( Node ) → ℘ ( Node ) ReduceHG f,t ( N ) , ( N if N = N ′ ReduceHG f,t ( N ) otherwise18 (cid:13)2(cid:13) 3(cid:13)4(cid:13)5(cid:13) 5(cid:13)5(cid:13) 5(cid:13) 3(cid:13)4(cid:13)1(cid:13)2(cid:13)1(cid:13) 3(cid:13)2(cid:13)3(cid:13)3(cid:13) 1(cid:13) 2(cid:13) 3(cid:13)1(cid:13) 1(cid:13)1(cid:13) 1(cid:13) Figure 14: Worst Case Clusteringwhere N ′ = (cid:8) n ∈ N (cid:12)(cid:12) SDeg ( N, n ) > ∨ t ≺ n ∨ f ≺ n (cid:9) Note that, as mentioned before, the edges of the (hyper) graphs used in the above algorithm are derivedfrom the original graph G . The surrounding degree ( SDeg ) of a node n in the (hyper) graph is the numberof nodes in the original graph G that are reachable from n , and that are not contained in (or equal to) thecurrent node n . These nodes are the surroundings of node n . The formal deﬁnition therefore is: SDeg : ℘ ( Node ) × Node → INI

SDeg ( N, n ) , (cid:12)(cid:12)(cid:8) y ∈ G.N (cid:12)(cid:12) ∃ m ∈ N − { n } [ n ! m ∧ y ≺ m ] (cid:9)(cid:12)(cid:12) Figure 15: Example Point to Point QueryThe reduction function

ReduceHG f,t simply removes all nodes that are neither the start nor the end ofthe point to point query and have a surrounding of only one node (e.g. subschema III in ﬁgure 8). Thecompletely reduced graph is calculated by the following ‘driver’ function:

Reduce f,t : ℘ ( Node ) × INI → ℘ ( Edge ) × ℘ ( Node ) Reduce f,t ( N, n ) , ( h E ′ , N ′ i if N = N ′ Reduce f,t ( ∪ N ′ , n − otherwise19 (cid:13) 1(cid:13) Figure 16: Point to Point Query on the Second Level Hyper Graphwhere N ′ = ReduceHG f,t ( N ) and E ′ = (cid:8) h P, l i ∈ E (cid:12)(cid:12) P ⊆ N (cid:9) If HCluster ( G.N,

0) = h N ′ , n i for a certain graph G , then the search should be performed in the reducedgraph: h E ′′ , N ′′ i = Reduce f,t ( N ′ , n ) . The interesting question now is when HCluster ( G.N, should becalculated. Since this latter call does not depend on the point to point query speciﬁc source and destina-tion, it could be calculated once after the completion of the conceptual schema (from which graph G isderived). Alternatively, one could calculate the hyper clustering each time a point to point query needs tobe completed, which is likely to be very costly. Figure 17: Point to Point Query on the First Level Hyper GraphAs an example reduction, consider the point to point query denoted in ﬁgure 15. The reduction algorithmstarts with the reduction of the top level hypergraph as depicted in ﬁgure 16. Obviously, this graph cannotbe reduced in any way. The next hypergraph that is considered by the reduction algorithm is shown inﬁgure 17. Cluster 3 (containing the large cycle) is connected to only one other node and does not containthe start or end of the point to point query. Therefore, it can safely be removed from the graph. Finally, theoriginal search graph can be reduced; the resulting graph is shown in ﬁgure 18. In this graph, the two leafnodes from cluster 4 can be removed, as well as the two right nodes of cluster 1. Note that the nodes fromcluster 3 have already been removed as the entire cluster was already removed in the previous step of theReduce function.

In this article we introduced a novel way for computer supported query formulation called point to pointqueries. We provided a sample session with a provisional tool supporting point to point queries, andbrieﬂy discussed the relationship to query by navigation and query by construction. Together with thesemechanisms a powerfull query formulation tool could be build. Furthermore, a search algorithm wasintroduced to search for the relevant paths between the speciﬁed points. Finally, an optimisation strategyfor the search process was discussed based on a pre-compiled clustering of the conceptual schema graph.20 (cid:13) 1(cid:13)1(cid:13)2(cid:13)2(cid:13)2(cid:13)2(cid:13)3(cid:13) 3(cid:13) 3(cid:13)5(cid:13)5(cid:13) 5(cid:13) 6(cid:13) 6(cid:13)6(cid:13)3(cid:13)3(cid:13)

Figure 18: The Reduced Search GraphAs a next step, the path expressions should be further developed to suit our needs. Furthermore, elegantverbalisations of the path expressions should be catered for. asy

References [ADD +

92] A. Auddino, Y. Dennebouy, Y. Dupont, E. Fontana, S. Spaccapietra, and Z. Tari. SUPER– Visual Interaction with an Object–based ER Model. In G. Pernul and A.M. Tjoa, editors, , volume 340-356 of

LectureNotes in Computer Science , pages 423–439, Berlin, Germany, EU, 1992. Springer.[BPW93] C.A.J. Burgers, H.A. (Erik) Proper, and Th.P. van der Weide. Organising an Information Systemas Stratiﬁed Hypermedia. In H.A. Wijshoff, editor,

Proceedings of the Computing Science in theNetherlands Conference , pages 109–120, November 1993.[BPW94] C.A.J. Burgers, H.A. (Erik) Proper, and Th.P. van der Weide. An Information System orga-nized as Stratiﬁed Hypermedia. In N. Prakash, editor,

CISMOD94, International Conference onInformation Systems and Management of Data , pages 159–183, October 1994.[Bru93] P.D. Bruza.

Stratiﬁed Information Disclosure: A Synthesis between Information Retrieval andHypermedia . PhD thesis, University of Nijmegen, Nijmegen, The Netherlands, EU, 1993.[BW92] P.D. Bruza and Th.P. van der Weide. Stratiﬁed Hypermedia Structures for Information Disclo-sure.

The Computer Journal , 35(3):208–220, 1992.[CH94] L.J. Campbell and T.A. Halpin. Abstraction Techniques for Conceptual Schemas. In R. Sacks–Davis, editor,

Proceedings of the 5th Australasian Database Conference , volume 16, pages 374–388, Christchurch, New Zealand, January 1994. Global Publications Services.[Cra86] T.C. Craven.

String Indexing . Academic Press, New York, New York, USA, 1986.[FOP92] E.D. Falkenberg, J.L.H. Oei, and H.A. (Erik) Proper. Evolving Information Systems: BeyondTemporal Information Systems. In A.M. Tjoa and I. Ramos, editors,

Proceedings of the DataBase and Expert System Applications Conference (DEXA‘92), Valencia, Spain, EU , pages 282–287, Berlin, Germany, EU, September 1992. Springer. ISBN 3211824006[Hal95] T.A. Halpin.

Conceptual Schema and Relational Database Design . Prentice–Hall, EnglewoodCliffs, New Jersey, USA, 2nd edition, 1995.[HHO92] T.A. Halpin, J. Harding, and C-H. Oh. Automated Support for Subtyping. In B. Theodoulidisand A.G. Sutcliffe, editors,

Proceedings of the Third Workshop on the Next Generation of CASETools , pages 99–113, May 1992. 21HP95] T.A. Halpin and H.A. (Erik) Proper. Subtyping and Polymorphism in Object–Role Modelling.

Data & Knowledge Engineering , 15:251–281, 1995.[HPW93] A.H.M. ter Hofstede, H.A. (Erik) Proper, and Th.P. van der Weide. Formal deﬁnition of aconceptual language for the description and manipulation of information models.

InformationSystems , 18(7):489–523, October 1993.[HPW94a] A.H.M. ter Hofstede, H.A. (Erik) Proper, and Th.P. van der Weide. A Conceptual Languagefor the Description and Manipulation of Complex Information Models. In G. Gupta, editor,

Seventeenth Annual Computer Science Conference , volume 16 of

Australian Computer ScienceCommunications , pages 157–167, Christchurch, New Zealand, January 1994. University of Can-terbury. ISBN 047302313[HPW94b] A.H.M. ter Hofstede, H.A. (Erik) Proper, and Th.P. van der Weide. Supporting InformationDisclosure in an Evolving Environment. In D. Karagiannis, editor,

Proceedings of the 5th Inter-national Conference DEXA‘94 on Database and Expert Systems Applications, Athens, Greece,EU , volume 856 of

Lecture Notes in Computer Science , pages 433–444, Berlin, Germany, EU,September 1994. Springer. ISBN 3540584358[Mar77] M.E. Maron. On Indexing, Retrieval and the Meaning of About.

Journal of the American Societyfor Information Science , 28(1):38–43, 1977.[Mee82] R. Meersman. The RIDL Conceptual Language. Technical report, International Centre for In-formation Analysis Services, Control Data Belgium, Inc., Brussels, Belgium, EU, 1982.[OPF94] J.L.H. Oei, H.A. (Erik) Proper, and E.D. Falkenberg. Evolving Information Systems: Meetingthe Ever–Changing Environment.

Information Systems Journal , 4(3):213–233, 1994.[Pro94a] H.A. (Erik) Proper.

A Theory for Conceptual Modelling of Evolving Application Domains . PhDthesis, University of Nijmegen, Nijmegen, The Netherlands, EU, 1994. ISBN 909006849X[Pro94b] H.A. (Erik) Proper. Introduction to Formal Notations. Technical report, Asymetrix ResearchLaboratory, University of Queensland, Brisbane, Queensland, Australia, 1994.[PW95] H.A. (Erik) Proper and Th.P. van der Weide. Information Disclosure in Evolving InformationSystems: Taking a shot at a moving target.

Data & Knowledge Engineering , 15:135–168, 1995.[Rij89] C.J. van Rijsbergen. Towards an information logic. In

Proceedings of the 12th annual inter-national ACM SIGIR conference on Research and development in information retrieval, Cam-bridge, Massachusetts, USA , pages 77–86, New York, New York, USA, June 1989. ACM.[Ros94] P. Rosengren. Using Visual ER Query Systems in Real World Applications. In G.M. Wijers,S. Brinkkemper, and T. Wasserman, editors,

Proceedings of the Sixth International ConferenceCAiSE‘94 on Advanced Information Systems Engineering, Utrecht, The Netherlands, EU , vol-ume 811 of

Lecture Notes in Computer Science , pages 394–405, Berlin, Germany, EU, June1994. Springer.[Sal89] G.E Salton.

Automatic Text Processing–The Transformation, Analysis, and Retrieval of Informa-tion by Computer . Addison Wesley, Reading, Massachusetts, USA, 1989.[Sch83] B. Schneiderman. Direct Manipulation: A Step Beyond Programming Languages.

IEEE Com-puter , 16(8):57–69, 1983.[SM83] G.E Salton and M.J. McGill.