An Improved Algorithm for Fast K-Word Proximity Search Based on Multi-Component Key Indexes
aa r X i v : . [ c s . I R ] S e p An Improved Algorithm for Fast K-WordProximity Search Based on Multi-ComponentKey Indexes
Alexander B. Veretennikov Ural Federal University, Yekaterinburg, RussiaChair of Calculation Mathematics and Computer Science [email protected] ,WWW home page: http://veretennikov.org
This is a pre-print of a contribution published in Arai K., Kapoor S., BhatiaR. (eds) Intelligent Systems and Applications. IntelliSys 2020. Advances in In-telligent Systems and Computing, vol 1251, published by Springer, Cham. Thefinal authenticated version is available online at:https://doi.org/10.1007/978-3-030-55187-2 37.
Abstract.
A search query consists of several words. In a proximity full-text search, we want to find documents that contain these words neareach other. This task requires much time when the query consists of high-frequently occurring words. If we cannot avoid this task by excludinghigh-frequently occurring words from consideration by declaring them asstop words, then we can optimize our solution by introducing additionalindexes for faster execution. In a previous work, we discussed how todecrease the search time with multi-component key indexes. We hadshown that additional indexes can be used to improve the average queryexecution time up to 130 times if queries consisted of high-frequentlyoccurring words. In this paper, we present another search algorithm thatovercomes some limitations of our previous algorithm and provides evenmore performance gain.
Keywords: full-text search, search engines, inverted indexes, additionalindexes, proximity search, term proximity, information retrieval, queryprocessing, document-at-a-time, DAAT.
Full-text search is a cornerstone of information retrieval. By a list of words,a user can obtain relevant documents that contain these words. Inverted filesare used for this search [23, 11, 21, 2]. Words occur in documents with differentfrequencies, and an example of the frequency distribution of words is representedby Zipf’s law [22], which is presented in Figure 1. On the horizontal axis, we plotwords from high-frequently occurring to low-frequently occurring. On the verticalaxis, we plot the numbers of occurrences of the corresponding words in a typicaltext collection.
Alexander B. Veretennikov
Fig. 1.
A typical word frequency distribution.
The most frequently occurring words (see Figure 1, on the left side) occursignificantly more often than ordinary words (see Figure 1, on the right side).This factor can affect the search performance in some cases. If the user needsonly the document that contains the query words, then the search query timedepends only on the number of documents in the collection. For each documentand each word that occurs in the document, we need to store in the index exactlyone record, which represents the fact of occurrence of the word somewhere inthe document.For other kinds of full-text searches, we need to store a record for each oc-currence of each word in each document [12, 8], which considerably affects per-formance. In this case, the search time depends on the occurrence frequency inthe texts of the queried words, and it is common to observe a search systemthat evaluates queries, which contain high-frequently occurring words in a sig-nificantly longer time than queries that consist only of ordinary words (see theleft side and the right side of Figure 1 respectively). See an example in [14].One way is to skip the most frequently occurring words. However, there aresome concerns about this approach [18]. A high-frequently occurring word mayhave a unique meaning in the context of the specific query. The authors [18]stated literally that “stopping or ignoring common words will have an unpre-dictable effect”. Examples are provided in [18, 14]. We can consider as an examplethe query “Who are you who”. The word “Who” has a specific meaning in thisquery: The Who are an English rock band, and “Who are You” is one of theirsongs.If the user needs the document that contains the query as a phrase, that is, thequeried words must exist in the document in sequential order one after another,then additional phrase indexes can be used for performance improvement [18].However, the phrase indexes cannot be used for proximity full-text search, thatis, when the user needs a document that contains queried words near each other.In the latter kind of searches, some other words are allowed in the text betweenqueried words. We have proposed other methods to solve this task [14, 16, 17].In our methodology, we define several types of queries, depending on thekinds of words they contain. For each type of query, we can use specific types n Improved Algorithm for Fast K-Word Proximity Search 3 of additional indexes. The kinds of words are specified on the basis of wordfrequencies.The importance of proximity full-text searches is determined by involvingthe proximity factor in modern information-retrieval methods [20, 3, 13, 10].Early termination approaches [1, 9] can reduce the query processing timeby sorting the posting lists in the index according to the relevance in decreasingorder. In this case, irrelevant records, which are located at the end of each postinglist, can be skipped. However, these methods cannot be used in an effectiveway when we need proximity full-text searches [17]. When we are sort postinglists according to some factors, they are sorted independently of each other.However, in a specific query, we have several words linked together, and wecannot skip any part of any posting list because it is always a possibility thatrecords for a document that contains queried words near each other occur atthe end of a posting list, due to the document having low relevance accordingto the nonproximity factors. This problem was investigated in [20] but only fortwo-word queries, demonstrating to be a huge limitation.In the following sections, we introduce several lemma types and several indextypes, the definition of which is based on the defined lemma types. Then, weprovide an overview of previously developed search algorithms. Then, the newalgorithm is described. Then, the results of the experiments is presented.
We use a morphological analyzer for lemmatization. For each word in the dictio-nary, the analyzer returns the list of lemmas, i.e., basic or canonical forms. Ourdictionary now supports two languages.Let us sort all lemmas in decreasing order of their occurrences in the texts.We call such a sorted list
F L -list.Let the first
SW Count elements of
F L -list be “stop” lemmas.Let the second
F U Count elements of
F L -list be “frequently used” lemmas.Let all remaining lemmas be “ordinary” lemmas.
SW Count and
F U Count are parameters in which the representative examplevalues are 700 and 2100.We use
F L -numbers to establish an order in the set of all lemmas. For ex-ample, “you” < “who” because “you” has F L -number 47, and “who” has
F L -number 293.The examples for each type of word are as follows:stop lemmas: “are”, “war”, “time”, “be”.frequently used lemmas: “beautiful”, “red”, “hair”.ordinary lemmas: “glorious”, “promising”.Although we introduce the notion “stop lemma”, we do not exclude suchlemmas from the search. This division of lemmas is only performed to introducedifferent optimization methods for each kind of lemma.
Alexander B. Veretennikov
The expanded ( f, s, t ) index or three-component key index [16, 17] is the list ofoccurrences of the lemma f for which lemmas s and t both occur in the text atdistances less than or equal to M axDistance from f .We create the expanded ( f, s, t ) index only when f , s , and t are all stoplemmas and only for the case in which f ≤ s ≤ t . There, M axDistance is aparameter which may have a value of 5, 7, 9 or even more.Each posting in the index includes the distance between f and s in the textand the distance between f and t in the text.The expanded ( w, v ) index or two-component key index [16, 17] is the list ofoccurrences of the lemma w for which lemma v occurs in the text at a distanceless than or equal to M axDistance from w .We create the expanded ( w, v ) index only when w is a frequently used lemmaand v is a frequently used or ordinary lemma. Each posting in the index includesthe distance between w and v in the text. If both w and v are frequently usedlemmas, then we create an index for them only if w < v .An ordinary inverted index with NSW records [16, 17] contains the postinglists for each frequently used and ordinary lemma. Each posting includes anNSW (near stop word) record. This record contains information about all stoplemmas that occur in the texts near the position of the specified posting. NSWrecords can also be skipped if it is required.Different types of indexes can be used depending on the types of lemmas inthe query.For example, if the query contains several ordinary lemmas and a frequentlyused lemma, then ( w, v ) indexes can be used instead an ordinary index for ob-taining information about the occurrence in the texts of these lemmas.If the query contains only stop lemmas, we use ( f, s, t ) indexes, because two-component indexes do not provide enough performance [14]. This case is themost complex from the performance point of view. We investigated other typesof queries in our previous work [16], and the task for them seems to be solved. Inthe current work, we investigate only queries that consist only of stop lemmas.Let us consider an example. Let us have two documents D D
1. Thewords are numbered, and these numbers are zero based. D
0: Who are you is the album by The Who. D
1: Who has reality, who is real, who is true.Stop lemmas: who, are, you, is, the, by, etc.We have several three-component keys there, for example: (you, are, who),(have, who, who), (the, by, who), (the, you, are), (be, who, who), etc.Let us note that “be” is the lemma of “is”, and “have” is the lemma of “has”.For the key (be, who, who) we have, for example, the records in the indexare as follows:(0 , , − , , , − , − , , − , , , − , , , − , − , , − , D
0, 3 is the position of “is” in the document, ( −
3) is the distance between thefirst “who” and “is”, and 5 is the distance between the second “who” and “is”. n Improved Algorithm for Fast K-Word Proximity Search 5
For the key (you, are, who) we have, in the record (0 , , − , − D
0, 2 is the position of “you” in the document, ( −
1) isthe distance between “are” and “you” in the document, and ( −
2) is the distancebetween “you” and “who” in the document.
The expanded ( f, s, t ) index contains (
ID, P, D , D
2) records, in which ID isthe identifier of the document, P is the position of the word in the document, D s and f , and D t and f .If we have two records, A = ( ID , P , X , X
2) and B = ( ID , P , Y , Y A < B if one of the following conditions is met: ID < ID ID ID P < P N ext method with which we move to the next record.The iterator object also has the
V alue property to access the current record. Theiterator object also has the
Key property to access the key of the iterator. Wecan access specific components of the three-component key by
Key [0],
Key [1],and
Key [2].We defined several search algorithms for multi-component key indexes.In the
Main-Cell algorithm [17], we need to select the most frequently oc-curring lemma in the query. This lemma is called the main lemma of the query.Then, we form a list of multi-component keys. The main lemma is always thefirst component of each key. For other components, we are using the remaininglemmas of the query.We create an iterator object for each key. Then, in each iterator object, weuse the
N ext method to move to an equal position. After all iterator objectshave equal position (
ID, P ), we check that all lemmas are present nearby thatposition and calculate the size of the fragment of the text which contains thequery. The drawback of this algorithm is that we need to duplicate the mainlemma in several keys.In the
Intermediate-Lists algorithm [14], we do not need to select a mainlemma. We select a list of multi-component keys in such a way that eachlemma of the query is used in some key. For each key, we have a list of records(
ID, P, D , D ID, P, D , D ID, P ), (
ID, P + D ID, P + D
2) — that correspond to occurrences inthe texts of f , s and t accordingly.The algorithm works as follows. We move in each iterator object to the samedocument. Then, for each iterator, we read all records for this document and pro-duce three intermediate streams of records. Each intermediate stream containsa list of occurrences of a specific lemma in the document. Then, we combine theintermediate streams to produce results. The drawback of this algorithm is thatwe need to produce intermediate streams. Alexander B. Veretennikov
Fig. 2.
The search algorithm.
In the
Optimized-Intermediate-Lists algorithm [15], we are obtaining moreperformance gain by applying optimized key selection methods. But we still needto produce intermediate posting lists with this algorithm.In this paper, we present a novel algorithm in which we combine the severalposting lists for multi-component keys into a list of results without creatingintermediate posting lists. We call this algorithm
Combiner algorithm.
Let us have subquery Q , which is a list consisting of n lemmas.The search algorithm consists of the following steps (see Figure 2).1) Lemmatization2) Building the list of subqueries.3) Processing subqueries.4) Combining results.Let us consider the following query: “who are you who”. After lemmatization,we have [who] [are, be] [you] [who], because the word “are” has two lemmas inour dictionary, namely, “are” and “be”.For a query that consists of high-frequently occurring words, we need tocreate subqueries, that is, [who] [are] [you] [who] and [who] [be] [you] [who]. Weneed to have a subquery in the form of a list of lemmas. Then, we evaluate eachsubquery and combine the results.The processing of the evaluation of the subquery contains the following stages(see Figure 3).1) Selection of the keys.2) Building an iterator for each key.3) Search.4) Calculation of the relevance. We have a subquery that is a list of lemmas. The elements of the list are num-bered starting with zero. For each key, we need to select three components, and n Improved Algorithm for Fast K-Word Proximity Search 7
Fig. 3.
The processing of the evaluation of the subquery. we need to select those using lemmas that occur at different indexes in the sub-query. We also want to exclude duplicates from consideration, if the subqueryhas some lemmas that appear several times. When we select a lemma as a com-ponent of a key, we “mark” it as “used”. Let
U sed be a set of lemmas that isinitially empty. When we mark a lemma, we include it into
U sed . We will useoccurrence frequency in the texts of the lemmas as a factor for selection.For the first component of the first key, we select the most frequently occur-ring unused lemma in the query.For the second component of the key, we need to select an unused lemma inthe query in which the index in the query is different from the index in the queryof the first component of the key. If we have several acceptable lemmas, we selectone among them that is the least frequently occurring in the texts. If we do nothave any acceptable lemma, then we select a lemma using the aforementionedconditions, except we ignore the “used” mark, and we mark this component with* to designate it as duplicate.For the third component of the key, we need to select an unused lemma in thequery in which the index in the query is different from the indexes in the queryof the first and the second components of the key. If we have several acceptablelemmas, we select one among them that is the least frequently occurring in thetexts. If we do not have any acceptable lemma, then we select a lemma usingthe aforementioned conditions except we ignore the “used” mark, and we markthis component with * to designate it as duplicate.Then, we mark all selected lemmas as “used”.If we have any unused lemmas, we repeat the process and form another key;otherwise, all keys are selected.Let us consider an example. Let us say that we have a query “Who are youand why did you say what you did”. This query can be found in Cecil ForesterScott’s novel “Lord Hornblower”. Let us consider its subquery [who] [are] [you][and] [why] [do] [you] [say] [what] [you] [do]. With
F L -numbers, the query willhave the following appearance: [who: 293] [are: 268] [you: 47] [and: 28] [why: 528][do: 154] [you: 47] [say: 165] [what: 132] [you: 47] [do: 154].We select “and: 28” as the first component of the first key, because it isthe most frequently occurring lemma; that is, it has the least
F L -number (28)
Alexander B. Veretennikov among the lemmas of the subquery. Then, we select “why: 528” as the secondcomponent of the key and “who: 293” as the third component of the key. Wemark “and”, “why” and “who” as used.We have other unused lemmas and can select another key. We select “you:47” as the first component of the second key, and “are: 268” and “say: 165” asthe second and the third components of the second key. We mark all selectedlemmas as “used”. Then, we select “what: 132” as the first component of thethird key. We select “do: 154” as the second component of the key. There are no“unused” lemmas remaining. Therefore, we ignore the “used” mark and select“why*: 528” as the third component of the third key. This component we markwith * because it is a duplicate.
Queries are usually evaluated by Term-At-A-Time (TAAT), Document-At-A-Time (DAAT) or Score-At-A-Time (SAAT) approaches [4, 7]. DAAT approacheshave advantages over SAAT and TAAT approaches [7].We use a Document-At-A-Time (DAAT) kind of algorithm. An iterator al-lows reading the posting list for a key from the start to the end. The posting listis sorted in increasing order.The search procedure is a three-level process (see Figure 4):Step 1. We move to a document. All iterators are positioned on the specificdocument.Step 2. We move to a position in the document at which the queried lemmasare near each other.Step 3. We put information about the position of the lemmas in special tables.We use the tables to check that all queried lemmas are present and to calculatethe exact position of the result in the text.If we do not have an acceptable position in the document in Step 2, then wemove to the next document (move to the start of Step 1); otherwise, we repeatit. If we do not have another document in Step 1, then we exit from the search;otherwise, we repeat it.
If we have read and processed all postings, then we exit from the search; other-wise, we perform the following in a loop:1) Let S be the iterator with the minimum document identifier, which is definedby its V alue.ID .2) Let E be the iterator with the maximum document identifier.3) If S.ID = E.ID , then we exit from the loop and move to Step 2; otherwise,we perform
S.N ext () and move to the start of the loop again.The cost of one iteration of this loop is O ( log n ). Both S and E are localvariables for this procedure. n Improved Algorithm for Fast K-Word Proximity Search 9 Fig. 4.
Search procedure diagram.
We perform the search in the document ID . We perform in a loop the following.1) If we have read all postings for the document ID in some iterator, we breakthe loop; otherwise, we perform the following sub steps 2-4.2) Let S be the iterator with the minimum position value V alue.P .3) Let E be the iterator with the maximum position value. Let Delta = E.V alue.P − S.V alue.P .4) If
Delta < M axDistance ×
2, then break the loop and move to Step 3;otherwise, we perform
S.N ext () and move to the start of the loop again.The cost of one iteration of this loop is O ( log n ). We can use binary heaps[19] to implement this approach.Step 2 ends when one of the following cases occurs:1) After execution of S.N ext , we move to another document in S , or we do nothave any other postings in S . In this case, we break the loop and move toStep 1.2) We have Delta < M axDistance ×
2. In this case, we have a place in thedocument that can potentially contain all queried lemmas near each other.Then, we break the loop and move to Step 3.
10 Step 3
At the start of this step, we have established that in the current position inthe document, we have all keys, meaning that all queried lemmas are near each other. We have two tables, namely,
Lemma table and
P osition table. We willuse these tables to obtain the fragments of the text that contain the queriedlemmas. Afterward, we move to Step 2 again.
Lemma
Table
The
Lemma table is used for the following.1) To check whether all queried lemmas exist in the text or not.2) To determine the start and the end of the fragment of the text that containsthe queried lemmas. The fragment of the text must have the minimum lengthamong the acceptable fragments.The
Lemma table contains an array that consists of
SW Count entries.Each entry corresponds to one stop lemma by its
F L -number.Each entry has two fields, that is,
M ax and
Count . M ax is the count of the occurrences of the corresponding lemma in the query.We initialize this field at the start of the search.
Count is the count of occurrencesof the corresponding lemma in the current fragment of the text, which is initiallyzero.The
Lemma table itself has
M ax and
Count fields.The
M ax field of the
Lemma table equals to the length of the subquery. Weinitialize this field at the start of the search.The idea of the use of the
Lemma table is the following. We need two queues.Let us have a queue of records (
P, Lem ), where the lemma
Lem is somelemma from the subquery and P is the position of the lemma Lem in the doc-ument. This queue we call
Source . Let the queue be sorted in increasing orderof P .Let us also have a second queue. The second queue we call P rocessed . Thisqueue will also be sorted in increasing order of P .We will process all elements of the Source from the start to the end.We perform the following in a loop, until we have any element in
Source :1) Let E be the first, that is, the minimum, element from the Source queue(see 3.2 in Figure 4).2) We remove E from the Source queue (see 3.3 in Figure 4).3) We place E into the P rocessed queue into the end of the queue (see 3.3 inFigure 4).4) We will add information about E into the Lemma table (see 3.4 in Figure4). This information includes the following (4.a — 4.c).a. We obtain the entry
Entry by the value of
E.Lem .b. If
Entry.M ax > Entry.Count , then we increment the
Lemma.Count field of the
Lemma table.c. We increment
Entry.Count .5) We check the
Lemma table (3.5 in Figure 4). n Improved Algorithm for Fast K-Word Proximity Search 11
Lemma
Table (Step 3.5 in Figure 4) If Lemma.Count = Lemma.M ax , then we do nothing.Otherwise,
Lemma.Count = Lemma.M ax . Then, we have in the text allrequired lemmas.Then, we need to obtain the minimum fragment of the text, which containsthe queried lemmas.For this, we will use
P rocessed queue.We perform in a loop the following:1) Let S be the first element of P rocessed queue.2) We obtain the entry
Entry by the value of
S.Lem .3) If
Entry.Count > Entry.M ax , then we can decrease the length of the frag-ment of the text. In this case, we decrease
Entry.Count , remove S from P rocessed , and go to 1; otherwise, we break the loop.When we exit from the loop, S defines the start of the fragment of the text,and E defines the end of the fragment of the text.The next question is where we obtain the Source queue and how do weperform the sorting of
Source with O (1) computational complexity. P osition
Table
The
P osition table has the method
Set ( P, Lem ) where P is the position of thelemma Lem in the document. It has the property
Start which specifies the startof the current fragment of the text that is interesting for us. It has the method
Shif t ( P ) which can be used to set the value of Start .At the start of Step 3, we execute the method
Shif t ( P − min ( P, M axDistance ))where P is the minimum current position value V alue.P among all iterators.In the internal implementation of the
P osition table, we use three bufferseach with a length
W indowSize . Each buffer is an array which contains
W indowSize entries. The following condition must be met:
M axDistance × ≤ W indowSize ≤ . Each buffer also has a corresponding 64-bit
M ask field. Each entry of thebuffer has a corresponding bit in the
M ask field of the buffer.Each entry of the buffer has three fields:
Lem , P , and N ext .When we execute
Set ( P, Lem ), we calculate the relative position in the
P osition table, that is, R = P − Start.
Then, we define the buffer by performing
R/W indowSize .Then, we define a relative position in the buffer
RelativeP = R % W indowSize (let % be the modulus operator).The variable
RelativeP defines the target entry T in the buffer. We set T.Lem = Lem and
T.P = P for the target entry. We also set the bit withnumber RelativeP in the
M ask field of the buffer to 1.From the buffer, we can produce a queue, which is a sorted linked list.We can use the Bit Scan Forward operation, which is one processor command,on the
M ask field to determine the index of the first entry of the queue. We canreset the bit of the entry to zero and perform the Bit Scan Forward operationagain to move to the second entry and so on. To build the queue, we use
N ext fields of entries. The queue will be initially sorted by this creation process.The problem here is that one buffer has a limited length. To solve this prob-lem, we use three buffers.Let
W indowF lushBorder be W indowSize × .
5, that is, the center of thesecond buffer.
For each iterator IT , we perform the following.We execute the methods: Set ( IT.V alue.P, IT.Key [0]) ,Set ( IT.V alue.P + IT.V alue.D , IT.Key [1]) ,Set ( IT.V alue.P + IT.V alue.D , IT.Key [2])and perform IT.N ext (). All these actions we perform until
IT.V alue.P < Start + W indowF lushBorder.
See 3.1 in Figure 4.We also take in consideration (*) marks of each component of
IT.Key . Weperform
Set only for these components that do not have the (*) mark. That is,if the third component has the (*) mark, we do not perform
Set ( IT.V alue.P + IT.V alue.D , IT.Key [2]).Each call of Set defines an occurrence P of a lemma in the document. Thatmeans for each three-component key, we may produce up to three values, andeach of them defines an occurrence of a lemma in the document. We are surethat all values with condition P < Start + W indowSize are already produced.This is because we have
W indowSize ≥ M axDistance ×
2, and for anyiterator IT , all records with condition IT.V alue.P < Start + W indowF lushBorder are already processed.For any iterator IT , any next record IT.V alue.P ≥ Start + W indowF lushBorder = Start + W indowSize × . ≥ n Improved Algorithm for Fast K-Word Proximity Search 13 Start + W indowSize + M axDistance.
Therefore,
IT.V alue.P + IT.V alue.D ≥ Start + W indowSize, because we have: − M axDistance ≤ IT.V alue.D ≤ M axDistance.
For
IT.V alue.D
2, we have the same.After we process all the iterators, we put all updated entries from the firstbuffer in the
Source queue (we use Bit Scan Forward for this).This completes step 3.1 in Figure 4.Then, we use the
Lemma table to produce a list of search results (see 3.2-3.5in Figure 4). Each search result is a fragment of the document which containsthe queried lemmas. In this production process, all elements of the
Source queuewill be processed. However, after the processing, some entries may remain in the
P rocessed queue.
Let us note that if any item exists in the
P rocessed queue, then this item canonly belong to the first buffer.After the
Source queue is processed, we can go to the start of Step 3 again.We renumber the buffers in a cyclic way. The first buffer we make the thirdbuffer; the second buffer will be the first buffer; and the third buffer will be thesecond buffer.We require three buffers because of these entries in the
P rocessed queue. Wecannot reuse these entries again, until they remain in the
P rocessed queue.However, we have no problems here. For any iterator IT , after the bufferswitch, we will read all records with the condition: IT.V alue.P < Start new + W indowF lushBorder . That means, these records will affect the entries only inthe first and second buffers, which means the new third buffer, that is, the formerfirst buffer, will not be affected.We also remove each entry
Entry from the
P rocessed queue with the fol-lowing condition:(
Start + W indowSize − Entry.P ) > M axDistance × . The following entries that can be added into the
P rocessed queue in the nextiteration of Step 3 can lie only in the new first buffer, that is, the former secondbuffer. These following entries will be far from such entries which we remove, sowe can safely free them.In fact, we need to free them, because it can be that no records will be addedinto
P rocessed in the next iteration, and this cleaning of the
P rocessed queueensures that no item will exist in the
P rocessed queue that belongs to the firstor the second buffer at the start of Step 3.Finally, we set
Start = Start + W indowSize . Lemma
Table Renumbering
To reduce the size of the
Lemma table, we can assign a local number, withrespect to the subquery, for each lemma. In this case, the size of the
Lemma table will be equal to the count of unique lemmas in the subquery.
11 Experiment 1
In our experiments, we use the collection of texts from [16, 17, 14] which consistsof approximately 195 000 documents of plain text, fiction and magazine articleswith a total size of 71.5 GB. The average document text size is approximately384.5 KB. In our previous experiments, we used
M axDistance = 5,
SW Count =700, and
F U Count = 2100.We need to use the same parameters to perform a comparison betweenthe algorithms. The search experiments were conducted using the experimen-tal methodology from [17]. We assume that in typical texts, the words are dis-tributed similarly, as Zipf stated in [22]. Therefore, the results obtained with ourtext collection will be relevant to other collections.We used the following computational resources:CPU: Intel(R) Core(TM) i7 CPU 920 @ 2.67 GHz.HDD: 7200 RPM. RAM: 24 GB. OS: Microsoft Windows 2008 R2 Enterprise.We created the following indexes.
Idx
1: the ordinary inverted index without any improvements, such as NSWrecords [17]. The total size is 95 GB.
Idx
2: our indexes, including the ordinary inverted index with the NSWrecords and the ( w, v ) and ( f, s, t ) indexes, where
M axDistance = 5. The totalsize is 746 GB.Please note that the total size of each type of index includes the size of therepository (indexed texts in compressed form), which is 47.2 GB.The size of
Idx SE
1: all queries are evaluated using the standard inverted index
Idx SE .
1: all queries are evaluated using
Idx SE .
2: all queries are evaluated using
Idx SE .
3: all queries are evaluated using
Idx n Improved Algorithm for Fast K-Word Proximity Search 15
Fig. 5.
Average query execution times for SE SE . SE . SE . SE . Fig. 6.
Average query execution times for SE . SE . SE . SE . SE .
4: all queries are evaluated using
Idx SE
1: 31.27 sec., SE .
1: 0.33 sec., SE .
2: 0.29 sec., SE .
3: 0.24 sec., and SE .
4: 0.22 sec.Average data read sizes per query: SE
1: 745 MB, SE .
1: 8.45 MB, SE . SE .
3: 6.16 MB, and SE .
4: 6.2 MB.Average numbers of postings per query: SE
1: 193 million, SE .
1: 765 thou-sand, SE .
2: 559 thousand, SE .
3: 419 thousand, and SE .
4: 423 thousand.We improved the query processing time by a factor of 94.7 with the SE . SE . SE . SE . SE SE . SE . SE . SE . SE . SE . SE . SE . SE .
1, by a factor of 109.2 with SE . SE . SE .
4, in comparison with ordinary inverted files SE Fig. 7.
Experiment 1. Average data read sizes per query for SE SE . SE . SE . SE . We improved the average data read size per query by a factor of 1.1 withthe SE . SE . SE .
12 Experiment 2
For the second experiment GOV2 text collection and the following queries areused: title queries from TREC Robust Task 2004 (with 250 queries in total),TREC Terabyte Task from 2004 to 2006 (with 150 queries in total) and TRECWeb Task from 2009 to 2014 (with 300 queries in total), with 700 queries intotal. GOV2 text collection consists of approximately 25 million documents witha total size of approximately 426 GB, that is approximately 167 GB of plain text(after HTML tags removal). The average document text size is approximately 7KB. We used
M axDistance = 5,
SW Count = 500, and
F U Count = 1050 andonly English dictionary. The value of
SW Count is very near to the 421 from [5].We created the following indexes.
Idx
1: the ordinary inverted index without any improvements, such as NSWrecords [17]. The total size is 143 GB (included the total size of indexed texts incompressed form, that is 57.3 GB).
Idx
2: our indexes, including the ordinary inverted index with the NSWrecords and the ( w, v ) and ( f, s, t ) indexes, where
M axDistance = 5. The totalsize is 1.29 TB.The query set can be divided into the following groups depending on lemmasin a concrete query:Q1) Only stop lemmas: 12 queries.Q2) Stop and frequently used and/or ordinary lemmas (i. e. the query has oneor several stop lemmas and some other lemmas that may be frequently usedor ordinary): 298.Q3) Only frequently used lemmas: 9.Q4) Frequently used lemmas and ordinary lemmas: 151.Q5) Only ordinary lemmas: 230.Accordingly [16], different algorithms will be applied to each kind of thequery when multi-component key indexes are used. n Improved Algorithm for Fast K-Word Proximity Search 17
For Q1 queries ( f, s, t ) indexes are used and we have the following results.Average query times: SE
1: 77.673 sec., SE .
1: 5.072 sec., SE .
2: 2.072 sec., SE .
3: 1.057 sec., and SE .
4: 0.662 sec.Average data read sizes per query: SE
1: 2.027 GB, SE .
1: 190.6 MB, SE . SE .
3: 33.5 MB, and SE .
4: 19.18 MB.Average numbers of postings per query: SE
1: 511.5 million, SE .
1: 17.3million, SE .
2: 5.08 million, SE .
3: 2.9 million, and SE .
4: 1.6 million.We improved the average query processing time by a factor of 15.3 with the SE . SE . SE . SE . SE SE . SE . SE . SE . SE . Idx
1: 13.37 sec.,
Idx
2: 0.521 sec.Average data read sizes per query:
Idx
1: 376 MB,
Idx
2: 12.85 MB.Average numbers of postings per query:
Idx
1: 90.2 million,
Idx
2: 0.81 million.The average query processing time for multi-component key indexes is similarwith Q1 case (e.g., with
Idx
Idx
Example queries.
Let us consider the following query: how to find the mean.The query times: SE
1: 173.457 sec., SE .
1: 0.468 sec., SE .
2: 0.062 sec., SE .
3: 0.109 sec., and SE .
4: 0.094 sec. The query can be executed with multi-component key indexes significantly faster than with the ordinary index SE Duplicates.
Although the average query times for SE . SE . SE . SE . SE .
13 Incremental solution
Let us consider the search query ”Who I need you” and the following text.
The book that you are looking at is about the famous rock band “The Who”. Theirsongs include “I Need You”, “You”, “One at a Time” and “Who are you”.
The partial tracing of the search is presented below. We use the followingvalues in this example:
M axDistance = 7,
W indowSize = 14. The words arenumbered, and these numbers are 1-based.
Shift,
Start = 4 (start of Step 3, we use 4 to demonstrate the buffer switch).Read the posting (19, 20, 15), key ( i , need , who ) (3.1).Set (position 19, key i ), buffer 1 (3.1).Set (position 20, key need ), buffer 1 (3.1).Set (position 15, key who ), buffer 0 (3.1).Read the posting (21, 20, 15), key ( you , need *, who *) (3.1).Set (position 21, key you ), buffer 1 (3.1).Read the posting (21, 20, 28), key ( you , need *, who *) (3.1).Set (position 21, key you ), buffer 1 (3.1).Read the posting (22, 20, 15), key ( you , need *, who *) (3.1).Set (position 22, key you ), buffer 1 (3.1).Read the posting (22, 20, 28), key ( you , need *, who *) (3.1).Set (position 22, key you ), buffer 1 (3.1).Populate the Source queue using the data from the first buffer (3.1).Fetch (position 15, key who ) from the
Source queue (3.2, 3.3).Add (key who ) into
Lemma table,
Lemma.Count = Lemma.M ax (3.4).Buffer switch,
Start = 18 (3.6).Populate the
Source queue using the data from the first buffer (3.1).Fetch (position 19, key i ) from the Source queue (3.2, 3.3).Add (key i ) into Lemma table,
Lemma.Count = Lemma.M ax (3.4).Fetch (position 20, key need ) from the
Source queue (3.2, 3.3).Add (key need ) into
Lemma table,
Lemma.Count = Lemma.M ax (3.4).Fetch (position 21, key you ) from the
Source queue (3.2, 3.3).Add (key you ) into
Lemma table,
Lemma.Count = Lemma.M ax (3.4).Checking the
Lemma
Table (3.5) → Result (from 15, to 21).Please note that
W indowSize should be 64 for the better performance.
14 Conclusion
In the paper, we presented a new fast algorithm for proximity full-text searchwhen a query that consists of high-frequently occurring words is considered. Inthe first experiment, we improved the average query processing time by a factorof 1.09 with the new algorithm in comparison with the algorithm from [15], bya factor of 1.3 in comparison with the algorithm from [14] and by a factor of 1.5in comparison with the algorithm from [17]. This improvement can be done byusing the requirement that we need to have a document which contains queriedwords near each other. We use three-component key indexes to solve the task.In the second experiment (GOV2 text collection), we improved the averagequery processing time by a factor of 1.59 with the new algorithm in comparisonwith the algorithm from [15], by a factor of 3.1 in comparison with the algorithmfrom [14], and by a factor of 7.6 in comparison with the algorithm from [17].We have presented the results of the experiments, showing that the averagetime of the query execution with our indexes is 142.13 times less (with a valueof
M axDistance = 5) than that required when using ordinary inverted indexes,when queries that consist of high-frequently occurring words are evaluated. n Improved Algorithm for Fast K-Word Proximity Search 19
As we discussed in [17], three component key indexes occupy an importantpart of our holistic full-text search methodology. With them, we can evaluatequeries that consist of high-frequently occurring words in an effective way. Otherquery types can be evaluated by using different additional indexes; those tasksare solved in [16] and, therefore, lie outside the scope of the current paper.The new algorithm overcomes some limitations of our previous algorithm [14,15], that is, working with duplicate lemmas in the query and creating additional,considerable sized, intermediate data structures in the memory.In the future, it will be useful to optimize the index creation times for largevalues of
M axDistance . The new algorithm can also be used with any multi-component indexes and one-component indexes. The author is now creatingindexes for relatively small values of
M axDistance , such as 5, 7, and 9.The limitation of the proposed indexes is that we search only documents thatcontain queried words near each other. Documents in which the queried wordsoccur at distances that are greater than
M axDistance can be skipped. Thislimitation can be overcome by combining the proximity search with additionalindexes with the search without distance [17]. When the former requires a word-level index, the latter is requires only a document-level index. In the searchwithout distance, we need only those documents that contain the queried wordsanywhere. This approach can produce fine results from the performance pointof view if the documents are relatively large, e.g., several hundreds of kilobyteseach. On the other hand, the modern approaches for calculating the relevancepresuppose that the relevance of the document is inversely proportional to thesquare of the distance between searched words in the document [20]. Using thisconsideration, we can easily select a value of
M axDistance that is large enoughsuch that all relevant documents will be found by our additional indexes.
References
1. Anh, V. N., de Kretser, O., Moffat, A.: Vector-space ranking with effective earlytermination. SIGIR ’01 Proceedings of the 24th Annual International ACM SIGIRConference on Research and Development in Information Retrieval, pp. 35–42. NewOrleans, Louisiana, USA (2001). doi:10.1145/383952.3839572. Borodin, A., Mirvoda, S., Porshnev, S., Ponomareva, O.: Improving generalizedinverted index lock wait times. Journal of Physics: Conference Series, vol. 944, no. 1,Article number 012022 (2018). doi:10.1088/1742-6596/944/1/0120223. B¨uttcher, S., Clarke, C., Lushman, B.: Term proximity scoring for ad-hoc retrievalon very large text collections. SIGIR ’06 Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and development in information re-trieval, pp. 621–622 (2006). doi:10.1145/1148170.11482854. Caio Moura Daoud, Silva de Moura, E., Carvalho, A., Soares da Silva, A., Fernandes,D., Rossi, C.: Fast top-k preserving query processing using two-tier indexes. Inf.Process. Manage, vol. 52, no. 5, pp. 855–872 (2016). doi:10.1016/j.ipm.2016.03.0055. Fox, C.: A Stop List for General Text. ACM SIGIR Forum, vol. 24, pp. 19–35,(1989). doi:10.1145/378881.3788886. Jansen, B. J., Spink, A., Saracevic, T.: Real life, real users, and real needs: A studyand analysis of user queries on the web. Inf. Process. Manage, vol. 36, no. 2, pp. 207–227 (2000). doi:10.1016/S0306-4573(99)00056-40 Alexander B. Veretennikov7. Jiang, D., Kenneth Wai-Ting Leung, Yang, L., Ng, W.: TEII: topic enhancedinverted index for top-k document retrieval. Know.-Based Syst, vol. 89, no. C.,pp. 346–358 (2015). doi:10.1016/j.knosys.2015.07.0148. Gall, M., Brost, G.: K-Word Proximity Search on Encrypted Data. 30th Interna-tional Conference on Advanced Information Networking and Applications Work-shops (WAINA), pp. 365-372 (2016). doi:10.1109/WAINA.2016.1049. Garcia, S., Williams, H. E., Cannane, A.: Access-ordered indexes. ACSC ’04 Pro-ceedings of the 27th Australasian Conference on Computer Science. Dunedin. NewZealand, pp. 7–14 (2004).10. Lu, X., Moffat, A., Culpepper, J.S.: Efficient and effective higher order proximitymodeling. ICTIR ’16 Proceedings of the 2016 ACM International Conference on theTheory of Information Retrieval, pp. 21–30, (2016). doi:10.1145/2970398.297040411. Luk, R. W. P.: Scalable, statistical storage allocation for extensible inverted fileconstruction. Journal of Systems and Software archive, vol. 84, no. 7, pp. 1082–1088(2011). doi:10.1016/j.jss.2011.01.04912. Sadakane, K.: Fast algorithms for k-word proximity search. IEICE Transactions onFundamentals of Electronics Communications and Computer Sciences, vol. 84, no.9, pp. 2311–2318 (2001).13. Rasolofo, Y., Savoy, J.: Term proximity scoring for keyword-based retrieval sys-tems. European Conference on Information Retrieval (ECIR) 2003: Advances inInformation Retrieval, pp. 207–218 (2003). doi:10.1007/3-540-36618-0 1514. Veretennikov, A.B.: Proximity full-text search with a response time guarantee bymeans of additional indexes with multi-component keys. Selected Papers of theXX International Conference on Data Analytics and Management in Data IntensiveDomains (DAMDID/RCDL 2018), Moscow, Russia, October 9-12 2018, pp. 123–130(2018). http://ceur-ws.org/Vol-227715. Veretennikov, A.B.: Proximity Full-Text Search by Means of Additional Indexeswith Multi-component Keys: In Pursuit of Optimal Performance. In: ManolopoulosY., Stupnikov S. (eds) Data Analytics and Management in Data Intensive Domains.DAMDID/RCDL 2018. Communications in Computer and Information Science,vol, 1003, pp. 111–130 (2019), Springer, Cham. doi:10.1007/978-3-030-23584-0 716. Veretennikov, A.B.: Proximity Full-Text Search with a Response Time Guar-antee by Means of Additional Indexes. In: Arai K., Kapoor S., Bhatia R.(eds) Intelligent Systems and Applications. IntelliSys 2018. Advances in Intel-ligent Systems and Computing, vol. 868, pp. 936–954 (2019), Springer, Cham.doi:10.1007/978-3-030-01054-6 6617. Veretennikov, A.B.: Proximity full-text search with response time guarantee bymeans of three component keys. Bulletin of the South Ural State University. Series:Computational Mathematics and Software Engineering, vol. 7, no. 1, pp. 60–77(2018). (in Russian)18. Williams, H. E., Zobel, J., Bahle, D.: Fast phrase querying with combined indexes.ACM Transactions on Information Systems (TOIS)., vol. 22, no. 4, pp. 573–594(2004). doi:10.1145/1028099.102810219. Williams, J. W. J.: Algorithm 232 heapsort. Communications of the ACM, vol. 7,no. 6, pp. 347–348 (1964). doi:10.2307/40877220. Yan, H., Shi, S., Zhang, F., Suel, T., Wen, J.-R.: Efficient term proximity searchwith term-pair indexes. CIKM ’10 Proceedings of the 19th ACM International Con-ference on Information and Knowledge Management. Toronto, pp. 1229–1238 (2010).doi:10.1145/1871437.1871593n Improved Algorithm for Fast K-Word Proximity Search 2121. Yang, Y. Ning, H.: Block linked list index structure for large data full text re-trieval. 13th International Conference on Natural Computation, Fuzzy Systems andKnowledge Discovery (ICNC-FSKD), pp 2123-2128. (2017).22. Zipf, G.: Relative frequency as a determinant of phonetic change. Harvard Studiesin Classical Philology, vol. 40, pp. 1–95 (1929). doi:10.2307/40877223. Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv.,vol. 38, no. 2, Article 6 (2006). doi:10.1145/1132956.11329591. Anh, V. N., de Kretser, O., Moffat, A.: Vector-space ranking with effective earlytermination. SIGIR ’01 Proceedings of the 24th Annual International ACM SIGIRConference on Research and Development in Information Retrieval, pp. 35–42. NewOrleans, Louisiana, USA (2001). doi:10.1145/383952.3839572. Borodin, A., Mirvoda, S., Porshnev, S., Ponomareva, O.: Improving generalizedinverted index lock wait times. Journal of Physics: Conference Series, vol. 944, no. 1,Article number 012022 (2018). doi:10.1088/1742-6596/944/1/0120223. B¨uttcher, S., Clarke, C., Lushman, B.: Term proximity scoring for ad-hoc retrievalon very large text collections. SIGIR ’06 Proceedings of the 29th annual interna-tional ACM SIGIR conference on Research and development in information re-trieval, pp. 621–622 (2006). doi:10.1145/1148170.11482854. Caio Moura Daoud, Silva de Moura, E., Carvalho, A., Soares da Silva, A., Fernandes,D., Rossi, C.: Fast top-k preserving query processing using two-tier indexes. Inf.Process. Manage, vol. 52, no. 5, pp. 855–872 (2016). doi:10.1016/j.ipm.2016.03.0055. Fox, C.: A Stop List for General Text. ACM SIGIR Forum, vol. 24, pp. 19–35,(1989). doi:10.1145/378881.3788886. Jansen, B. J., Spink, A., Saracevic, T.: Real life, real users, and real needs: A studyand analysis of user queries on the web. Inf. Process. Manage, vol. 36, no. 2, pp. 207–227 (2000). doi:10.1016/S0306-4573(99)00056-40 Alexander B. Veretennikov7. Jiang, D., Kenneth Wai-Ting Leung, Yang, L., Ng, W.: TEII: topic enhancedinverted index for top-k document retrieval. Know.-Based Syst, vol. 89, no. C.,pp. 346–358 (2015). doi:10.1016/j.knosys.2015.07.0148. Gall, M., Brost, G.: K-Word Proximity Search on Encrypted Data. 30th Interna-tional Conference on Advanced Information Networking and Applications Work-shops (WAINA), pp. 365-372 (2016). doi:10.1109/WAINA.2016.1049. Garcia, S., Williams, H. E., Cannane, A.: Access-ordered indexes. ACSC ’04 Pro-ceedings of the 27th Australasian Conference on Computer Science. Dunedin. NewZealand, pp. 7–14 (2004).10. Lu, X., Moffat, A., Culpepper, J.S.: Efficient and effective higher order proximitymodeling. ICTIR ’16 Proceedings of the 2016 ACM International Conference on theTheory of Information Retrieval, pp. 21–30, (2016). doi:10.1145/2970398.297040411. Luk, R. W. P.: Scalable, statistical storage allocation for extensible inverted fileconstruction. Journal of Systems and Software archive, vol. 84, no. 7, pp. 1082–1088(2011). doi:10.1016/j.jss.2011.01.04912. Sadakane, K.: Fast algorithms for k-word proximity search. IEICE Transactions onFundamentals of Electronics Communications and Computer Sciences, vol. 84, no.9, pp. 2311–2318 (2001).13. Rasolofo, Y., Savoy, J.: Term proximity scoring for keyword-based retrieval sys-tems. European Conference on Information Retrieval (ECIR) 2003: Advances inInformation Retrieval, pp. 207–218 (2003). doi:10.1007/3-540-36618-0 1514. Veretennikov, A.B.: Proximity full-text search with a response time guarantee bymeans of additional indexes with multi-component keys. Selected Papers of theXX International Conference on Data Analytics and Management in Data IntensiveDomains (DAMDID/RCDL 2018), Moscow, Russia, October 9-12 2018, pp. 123–130(2018). http://ceur-ws.org/Vol-227715. Veretennikov, A.B.: Proximity Full-Text Search by Means of Additional Indexeswith Multi-component Keys: In Pursuit of Optimal Performance. In: ManolopoulosY., Stupnikov S. (eds) Data Analytics and Management in Data Intensive Domains.DAMDID/RCDL 2018. Communications in Computer and Information Science,vol, 1003, pp. 111–130 (2019), Springer, Cham. doi:10.1007/978-3-030-23584-0 716. Veretennikov, A.B.: Proximity Full-Text Search with a Response Time Guar-antee by Means of Additional Indexes. In: Arai K., Kapoor S., Bhatia R.(eds) Intelligent Systems and Applications. IntelliSys 2018. Advances in Intel-ligent Systems and Computing, vol. 868, pp. 936–954 (2019), Springer, Cham.doi:10.1007/978-3-030-01054-6 6617. Veretennikov, A.B.: Proximity full-text search with response time guarantee bymeans of three component keys. Bulletin of the South Ural State University. Series:Computational Mathematics and Software Engineering, vol. 7, no. 1, pp. 60–77(2018). (in Russian)18. Williams, H. E., Zobel, J., Bahle, D.: Fast phrase querying with combined indexes.ACM Transactions on Information Systems (TOIS)., vol. 22, no. 4, pp. 573–594(2004). doi:10.1145/1028099.102810219. Williams, J. W. J.: Algorithm 232 heapsort. Communications of the ACM, vol. 7,no. 6, pp. 347–348 (1964). doi:10.2307/40877220. Yan, H., Shi, S., Zhang, F., Suel, T., Wen, J.-R.: Efficient term proximity searchwith term-pair indexes. CIKM ’10 Proceedings of the 19th ACM International Con-ference on Information and Knowledge Management. Toronto, pp. 1229–1238 (2010).doi:10.1145/1871437.1871593n Improved Algorithm for Fast K-Word Proximity Search 2121. Yang, Y. Ning, H.: Block linked list index structure for large data full text re-trieval. 13th International Conference on Natural Computation, Fuzzy Systems andKnowledge Discovery (ICNC-FSKD), pp 2123-2128. (2017).22. Zipf, G.: Relative frequency as a determinant of phonetic change. Harvard Studiesin Classical Philology, vol. 40, pp. 1–95 (1929). doi:10.2307/40877223. Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv.,vol. 38, no. 2, Article 6 (2006). doi:10.1145/1132956.1132959