[PDF] Selection of Optimal Parameters in the Fast K-Word Proximity Search Based on Multi-component Key Indexes

Abstract

Proximity full-text search is commonly implemented in contemporary full-text search systems. Let us assume that the search query is a list of words. It is natural to consider a document as relevant if the queried words are near each other in the document. The proximity factor is even more significant for the case where the query consists of frequently occurring words. Proximity full-text search requires the storage of information for every occurrence in documents of every word that the user can search. For every occurrence of every word in a document, we employ additional indexes to store information about nearby words, that is, the words that occur in the document at distances from the given word of less than or equal to the MaxDistance parameter. We showed in previous works that these indexes can be used to improve the average query execution time by up to 130 times for queries that consist of words occurring with high-frequency. In this paper, we consider how both the search performance and the search quality depend on the value of MaxDistance and other parameters. Well-known GOV2 text collection is used in the experiments for reproducibility of the results. We propose a new index schema after the analysis of the results of the experiments. This is a pre-print of a contribution published in Supplementary Proceedings of the XXII International Conference on Data Analytics and Management in Data Intensive Domains (DAMDID/RCDL 2020), Voronezh, Russia, October 13-16, 2020, P. 336-350, published by CEUR Workshop Proceedings. The final authenticated version is available online at: this http URL

Full PDF

SSelection of Optimal Parameters in the FastK-Word Proximity Search Based onMulti-component Key Indexes

Alexander B. Veretennikov − − − Ural Federal University, 620002 Mira street, Yekaterinburg, Russia [email protected]

Abstract.

Proximity full-text search is commonly implemented in con-temporary full-text search systems. Let us assume that the search queryis a list of words. It is natural to consider a document as relevant ifthe queried words are near each other in the document. The proximityfactor is even more signiﬁcant for the case where the query consists offrequently occurring words. Proximity full-text search requires the stor-age of information for every occurrence in documents of every word thatthe user can search. For every occurrence of every word in a document,we employ additional indexes to store information about nearby words,that is, the words that occur in the document at distances from the givenword of less than or equal to the

MaxDistance parameter. We showedin previous works that these indexes can be used to improve the aver-age query execution time by up to 130 times for queries that consist ofwords occurring with high-frequency. In this paper, we consider how boththe search performance and the search quality depend on the value of

MaxDistance and other parameters. Well-known GOV2 text collectionis used in the experiments for reproducibility of the results. We proposea new index schema after the analysis of the results of the experiments.

Keywords:

Full-text search · Inverted indexes · Proximity search · Termproximity · Query processing · DAAT.

In full-text search, a query is a list of words. The result of the search is a listof documents containing these words. Consider a query that consists of wordsoccurring with high-frequency. In [16], we improved the average query processingtime by a factor of 130 for these queries relative to traditional inverted indexes. a r X i v : . [ c s . I R ] J a n A. B. Veretennikov

A methodology for high-performance proximity full-text searches that coversdiﬀerent types of queries was discussed in [17].The factor of proximity or nearness between the queried words in the indexeddocument plays an important role in modern information retrieval [19, 12, 3].We assume that a document should contain query words near each other to berelevant for the user in the context of the search query. Taking this factor intoaccount is essential if the query consists of frequently occurring words.Some words occur in texts signiﬁcantly more frequently than others. We canillustrate this [17] by referring to Zipf’s law [20]. An example of a typical wordoccurrence distribution is presented in Fig. 1. The horizontal axis representsdiﬀerent words in decreasing order of their occurrence in texts. On the verticalaxis, we plot the number of occurrences of each word. This peculiarity of languagehas a strong inﬂuence on the search performance.The inverted index [21, 2] is a commonly used data structure for full-textsearches. The traditional inverted index contains records (

ID, P ), where ID isthe identiﬁer of the document and P is the position of the word in the document,for example, an ordinal number of the word in the document. This record cor-responds to an occurrence of a word in a document. These ( ID, P ) records arecalled “postings”. The inverted index enables us to obtain a list of postings thatcorresponds to the given word. These lists of postings are used for the search.For proximity full-text searches, we need to store the (

ID, P ) record for everyoccurrence of every word in the indexed document [15, 8, 9]. In other words, forproximity searches, we require a word-level inverted index instead of a document-level index [8]. Consequently, if a word occurs frequently in the texts, then itslist of postings is long [17]. The query search time is proportional to the numberof occurrences of the queried words in the indexed documents. To process asearch query that contains words occurring with high-frequency, a search systemrequires much more time, as shown on the left side of Fig. 1, than a query thatcontains only ordinary words, as shown on the right side of Fig. 1.

Fig. 1.

Example of a word frequency distribution.

According to [13], we can consider a full-text query as a “simple inquiry”[17]. In this case, we may require that the search results be provided within twoseconds, as stated in [13], to prevent the interruption of the thought continuityof the user. To enhance the performance, the following approaches can be used. ast K-Word Proximity Search Based on Multi-component Key Indexes 3

Early-termination approaches can be employed for full-text searches [1, 11].However, these methods are not eﬀective in the case of proximity full-textsearches [16]. Usually, early-termination approaches are applied to document-level indexes. It is diﬃcult to combine the early-termination approach with theincorporation of term-proximity data into relevance models.Additional indexes can improve the search performance. In [14, 18], addi-tional indexes were used to improve phrase searches. However, the approachesreported in [14, 18] cannot be used for proximity full-text searches. Their areaof application is limited by phrase searches. We have overcome this limitation.With our additional indexes, an arbitrary query can be executed very fastand quickly [17].In an example given in [16], we indexed a subset of the Project Gutenbergweb site using Apache Lucene and Apache Tika and performed some searches. Aquery that contained words occurring with high-frequency was evaluated within21 sec. On the other hand, another query that contained ordinary words wasevaluated within 172 milliseconds. This diﬀerence is considerable. However, ouralgorithms reported in [16, 17] can help address this issue.The goals and questions of this paper are as follows:1) We need to examine how search performance is improved with considerationof diﬀerent values of

M axDistance and other parameters.2) We need to investigate search performance on the commonly used text col-lection.3) Does the search performance depend on the document size?4) What are other factors that can aﬀect the performance?5) We evaluate the performance with respect to both short and long queries.Key points of our research are the following. We use the word-level index.We use the DAAT approach [6, 10]. We include in the indexes information aboutall lemmas. Our indexes support incremental updates. We can store one postinglist in several data streams. We read entire posting lists when searching and noearly-termination are used.

For this paper, we use an English dictionary with approximately 92 thousandEnglish lemmas. This dictionary is used by a morphology analyzer. The analyzerproduces a list of numbers of lemmas, that is, basic or canonical forms, for everyword from the dictionary. Usually, a word has one lemma, but some words haveseveral lemmas. For example, the word “mine” has two lemmas, namely, “mine”and “my”.Consider an array of all lemmas. Let us sort all lemmas in decreasing orderof their occurrence frequency in the texts. We call the result of sorting the

F L -list [17]. The number of a lemma w in the F L -list is called its

F L -number [17]and is denoted by

F L ( w ). Let us say that lemma “earth” > “day” because A. B. Veretennikov

F L (earth) = 309,

F L (day) = 199, and 309 > F L -numbers todeﬁne the order of the lemmas in the collection of all lemmas.In our search methodology [16], we deﬁned three types of lemmas, namely,stop lemmas, frequently used lemmas and ordinary lemmas.The ﬁrst

SW Count most frequently occurring lemmas are stop lemmas. Ex-amples include “earth”, “yes”, “who”, “day”, “war”, “time”, “man” and “be”.The second

F U Count most frequently occurring lemmas are frequently usedlemmas. Examples include “red”, “beautiful”, and “mountain”.All other lemmas are ordinary lemmas, e.g., “ﬁber” and “undersea”.

SW Count and

F U Count are the parameters. We use

SW Count = 500 and

F U Count = 1050 in the experiments described here. Let

M axDistance be aparameter that can take a value of 5, 7, 9 or even greater.The value of

SW Count is very near 421 from [7]. However, as we includeinformation about all lemmas in the indexes, we can use very diﬀerent values ofthe parameters. If an ordinary lemma, q , occurs in the text so rarely that F L ( q )is irrelevant, then we can say that F L ( q ) = ∼ . Here, “ ∼ ” denotes a large number. Let us discuss the stop-word approach, in which some words occurring with high-frequency or their lemmas are excluded from the search. We do not agree withthis approach. A word cannot be excluded from the search because even a wordoccurring with high-frequency can have a speciﬁc meaning in the context of aspeciﬁc query [16, 18]. Therefore, excluding some words from the search can leadto search quality degradation or unpredictable eﬀects [18]. Additionally, stopwords are often employed in higher-order term proximity feature models [5].Consider the query “who are you who” [17]. The Who are an English rockband, and “Who are You” is one of their works. The word “Who” has a speciﬁcmeaning in the context of this query. Therefore, in our approach, we includeinformation about all of the words in the indexes. Moreover, we can easily seethat modern search systems, such as Google, do not skip stop words in a search.

The three-component key ( f, s, t ) index [17] is the list of the occurrences of thelemma f for which lemmas s and t both occur in the text at distances thatare less than or equal to the M axDistance from f . Each posting includes thedistance between f and s in the text and the distance between f and t in thetext. An ( f, s, t ) index is created only for the case in which f ≤ s ≤ t . Here, f , s , and t are all stop lemmas.In [16], we considered queries consisting of high-frequency occurring words. Inaddition, we showed that the average query execution time can be improved withthree-component key indexes by up to 15.6 times relative to the time necessaryusing two-component key indexes only. Therefore, ( f, s, t ) indexes are requiredif we need to search queries that contain high-frequently occurring words. ast K-Word Proximity Search Based on Multi-component Key Indexes 5 The two-component key ( w, v ) index [17] is the list of occurrences of thelemma w for which lemma v occurs in the text at a distance that is less than orequal to the M axDistance from w . Each posting includes the distance between w and v in the text. Here, w denotes a frequently used lemma, and v denotes afrequently used or ordinary lemma.Let us consider the traditional index with near stop word (NSW) records. Foreach occurrence of each ordinary or frequently used lemma in each document, weinclude a posting record ( ID, P,

NSW) in the index. ID can be the ordinal num-ber of the speciﬁc document, and P is the position of the word in the document,e.g., the ordinal number of the word in the document. The NSW record containsinformation about all high-frequency lemmas, that is, stop lemmas, occurringnear position P in the document (at a distance that is less than or equal to the M axDistance from P ). Examples of the indexes are given in [17].The posting list of a key is stored in several data streams [17]. For the tradi-tional index with NSW records, we can use up to three streams for one key: onestream for ( ID ), one for ( P ) and one for (NSW). On the other hand, for thisindex, we can use two streams: the ﬁrst is ( ID, P ), and the second is (NSW).The actual choice depends on the length of the posting list. For short lists, weuse two streams; for long lists, we use three streams. This architecture allowsus to skip NSW records when they are not required. For the ( w, v ) and ( f, s, t )indexes, we use one or two data streams for every key.

We create the following indexes:

Idx

0: the traditional inverted index without any enhancements, such as NSWrecords. The total size is 143 GB. This value includes the total size of indexedtexts in compressed form, which is 57.3 GB.

Idx

5: our indexes, including the traditional inverted index with the NSWrecords and the ( w, v ) and ( f, s, t ) indexes, where

M axDistance = 5. The totalsize is 1.29 TB, the total size of the ( w, v ) index is 104 GB, the total size of the( f, s, t ) index is 727 GB, and the total size of the traditional index with NSWrecords is 192 GB.

Idx

7: our indexes, where

M axDistance = 7. The total size is 2.16 TB, thetotal size of the ( w, v ) index is 148 GB, the total size of the ( f, s, t ) index is 1.422TB, and the total size of the traditional index with NSW records is 239 GB.

Idx

9: our indexes, where

M axDistance = 9. The total size is 3.27 TB, thetotal size of the ( w, v ) index is 191 GB, the total size of the ( f, s, t ) index is 2.349TB, and the total size of the traditional index with NSW records is 283 GB.For the experiment, GOV2 [4] text collection and the following queries areused: title queries from TREC Robust Task 2004 (with 250 queries in total),title queries from TREC Terabyte Task from 2004 to 2006 (with 150 queries intotal), title queries from TREC Web Task from 2009 to 2014 (with 300 queriesin total), queries from TREC 2007 Million Query Track (10000 queries in total).

A. B. Veretennikov

The total size of the query set after duplicate removal is 10 665 queries. GOV2text collection contains 25 million documents. The total size of the collection isapproximately 426 GB, and after HTML tag removal, there is approximately167 GB of plain text. The average document text size is approximately 7 KB.We used the following computational resources: CPU: Intel(R) Core(TM) i7CPU 920 @ 2.67 GHz. HDD: 7200 RPM. RAM: 24 GB. OS: Microsoft Windows2008 R2 Enterprise.The query set can be divided into the following subsets depending on lemmasin a concrete query [17]. All queries are evaluated within one program thread.If we have a query with a length greater than

M axDistance , then we shoulddivide it into several parts. For example, when the value of

M axDistance is 5,then the query “to be or not to be that is the question” should be rewrittenas “(to be or not to) (be that is the question)”, and these two parts should beevaluated independently; then, the results should be combined.

Every query in the subset contains only stop lemmas. There are 119 queriesin this subset. Examples include the following: to be or not to be that is thequestion, kids earth day activities. With

F L -numbers, we have the followingqueries: [to: 9] [be: 7] [or: 38] [not: 64] [to: 9] [be: 7] [that: 40] [be: 7] [the: 1][question: 305] and [kid: 447] [earth: 309] [day: 199] [activity: 247].For these queries, the ( f, s, t ) indexes are used [16].Average query processing times:

Idx

0: 51.4 s,

Idx

5: 0.82 s,

Idx

7: 0.86 s,

Idx

9: 1.05 s (see Fig. 2).Average data read sizes per query:

Idx

0: 1.3 GB,

Idx

5: 11.1 MB,

Idx

7: 15.6 MB,

Idx

9: 20.1 MB.Average numbers of postings per query:

Idx

0: 317.8 million,

Idx

5: 0.88 million,

Idx

7: 1.15 million,

Idx

9: 1.5 million.We improved the query time by a factor of 62.7 with

Idx

5, by a factor of59.4 with

Idx

7, and by a factor of 48.7 with

Idx

Idx Fig. 2.

The average query execution times for

Idx Idx Idx

7, and

Idx

Every query in the subset contains one or several stop lemmas. The query alsocontains some other lemmas that may be frequently used or ordinary. There are7 244 queries in the subset. Examples include the following: History of Physiciansin America. With

F L -numbers, we have the following query:[history: 598] [of: 4] [physician: 1760] [in: 14] [America: 1391]For these queries, we need to read one posting list with NSW records fromthe traditional index for some query lemma. This lemma is designated as the“main” lemma of the query. If there is a frequently used lemma in the query,then we can use two-component key indexes, like (physician, history), for otherlemmas. If the query consists of only stop and ordinary lemmas, then we useposting lists from the traditional index for the remaining ordinary lemmas butwithout reading the NSW records. The details are described in [17].Average query times:

Idx

0: 50 s,

Idx

5: 1.9 s,

Idx

7: 2 s,

Idx

9: 2.14 s.Average data read sizes per query:

Idx

0: 1.3 GB,

Idx

5: 64.3 MB,

Idx

7: 68.7 MB,

Idx

9: 76 MB.Average numbers of postings per query:

Idx

0: 324.7 million,

Idx

5: 3.7 million,

Idx

7: 3.4 million,

Idx

9: 3.3 million.The number of postings for

Idx

Idx Idx

Idx

9. We improved the query processing time by afactor of 25.6 with

Idx

5, by a factor of 24.7 with

Idx

7, and by a factor of 23.3with

Idx

Idx Every query in the subset contains only frequently used lemmas. There are 79queries in this subset. Examples include the following: california mountain pass.With

F L -numbers, we have the following query:[california: 518] [mountain: 704] [pass: 528]Two-component ( w, v ) key indexes are used. For example, we can use the(california, pass) and (*pass, mountain) two-component key indexes.Average query times:

Idx

0: 2.5 s,

Idx

5: 0.32 s,

Idx

7: 0.33 s,

Idx

9: 0.34 s.Average data read sizes per query:

Idx

0: 53.3 MB,

Idx

5: 4.69 MB,

Idx

7: 4.79 MB,

Idx

9: 4.93 MB.Average numbers of postings per query:

Idx

0: 9.4 million,

Idx

5: 0.59 million,

Idx

7: 0.6 million,

Idx

9: 0.61 million.We improved the query time by a factor of 7.83 with

Idx

5, by a factor of7.58 with

Idx

7, and by a factor of 7.44 with

Idx

Idx Every query in the subset contains frequently used lemmas and ordinary lemmas.There are 1 388 queries in this subset.Examples include the following: Scalable Vector Graphics.

A. B. Veretennikov

With

F L -numbers, we have the following query:[scalable: ∼ ] [vector: 2953] [graphics: 921]Each query contains a frequently used lemma; therefore, two-component keyindexes can be used. For example, we can use the (graphics, scalable) and (graph-ics, vector) two-component key indexes.Average query times: Idx

0: 2.2 s,

Idx

5: 0.36 s,

Idx

7: 0.34 s,

Idx

9: 0.32 s.Average data read sizes per query:

Idx

0: 41.7 MB,

Idx

5: 1.5 MB,

Idx

7: 1.6 MB,

Idx

9: 1.8 MB.Average numbers of postings per query:

Idx

0: 8.1 million,

Idx

5: 0.17 million,

Idx

7: 0.17 million,

Idx

9: 0.17 million.We improved the query time by a factor of 6 with

Idx

5, by a factor of 6.4with

Idx

7, and by a factor of 6.8 with

Idx

Idx Every query in the subset contains only ordinary lemmas. There are 1 835 queriesin this subset. Examples include the following: Undersea Fiber Optic Cable.With

F L -numbers, we have the following query:[undersea: 15873] [ﬁber: 3127] [optic: 2986] [cable: 2771]We use the traditional index only for these queries. We do not need to readthe NSW records because they are stored in separated streams of data.Average query times:

Idx

0: 0.789 s,

Idx

5: 0.819 s,

Idx

7: 0.8 s,

Idx

9: 0.81 s.Average data read sizes per query:

Idx

0: 14.2 MB,

Idx

5: 15.6 MB,

Idx

7: 15.7 MB,

Idx

9: 15.8 MB.Average numbers of postings per query:

Idx

0: 2.89 million,

Idx

5: 2.89 million,

Idx

7: 2.89 million,

Idx

9: 2.89 million.The query processing time for these queries does not need any improvement.

Average query times:

Idx

0: 34.9 s,

Idx

5: 1.51 s,

Idx

7: 1.57 s,

Idx

9: 1.66 s.Average data read sizes per query:

Idx

0: 0.93 GB,

Idx

5: 46.4 MB,

Idx

7: 49.5 MB,

Idx

9: 54.5 MB.Average numbers of postings per query:

Idx

0: 225.7 million,

Idx

5: 3.08 million,

Idx

7: 2.85 million,

Idx

9: 2.76 million.We improved the query time by a factor of 23.1 with

Idx

5, by a factor of22.4 with

Idx

7, and by a factor of 21 with

Idx

Idx The average time that is needed for the search for

Idx Idx

7, and

Idx

9, isapproximately 1-2 sec. for every query type.For

Idx

0, we need 2-2.5 sec. for the search if the query consists only ofordinary or frequently used lemmas. For the queries that contain any stop lemma(Q1, Q2),

Idx ast K-Word Proximity Search Based on Multi-component Key Indexes 9

This means that the search in

Idx Idx

7, and

Idx

0, the search system has a performanceproblem if the query contains any high-frequency occurring lemma. The diﬀer-ence in performance between

Idx

Idx Idx

Idx D D

1. Let us consider a word w . For w , we have thefollowing posting list in the traditional index: (0 , , (0 , , (0 , , (1 , , (1 , . We use two very common encoding schemes. The idea of the ﬁrst schemeas follows. We group the records that are related to a speciﬁc document. Weconvert the original posting list as follows: (0 , (1 , , , (1 , (2 , . For other kindsof indexes, we have the same.Therefore, we store in the index the ID of the document and then the list ofword’s positions. The second scheme is a delta-encoding scheme. Consequently,we have the following. (0 , (1 , , , (1 , (2 , . For example, consider (1 , , −

1, where 1 is the previous value in the list.The ﬁrst scheme is much more eﬀective for text collections that consist oflarge documents. This explains the diﬀerence in “improvement factor” in theexperiments with two aforementioned collections. However, with our method, wecan use any other encoding scheme and any other inverted index organization.In addition, the “improvement factor” can depend on the structure of thequery set. To analyze this question, we performed additional experiment. Weformed another query set, using the method from [16]. Consequently, we se-lected a document from the TREC GOV2 collection. We used the content ofthe document to produce a set of 3500 queries. We performed our experimentsagain, using this query set. The “improvement factor” was similar to the fore-going results that we have for TREC GOV2.The next question is, how can we select the values of our parameters? Thevalue of

M axDistance aﬀects relevance and should be determined accordingto the selected relevance function [17]. The results of experiments allows us topredict how the change of the

M axDistance aﬀects the search time and indexsize. Let us discuss now, how the values of

SW Count and

F U Count aﬀect thesemetrics.The value of

SW Count is more important than the value of

F U Count , be-cause it aﬀects Q1 and Q2 queries, which are most complex from the performancepoint of view. Let us consider

SW Count now. Let us consider Q2 query subset.

We use Zipf’s law [20] as a representation of our word occurrence distribution.According to this, the second most frequently occurring lemma, occurs half asoften as the ﬁrst. The third most frequently occurring lemma, occurs 1/3 asoften as the ﬁrst, and so on. Let V be the number of occurrences of the mostfrequently occurring lemma. Therefore, our lemmas have the following numberof occurrences: V, V / , V / , ... Let us consider a query q = ( q , q , ..., q n ) from Q2. Here, q i is the numberof corresponding lemma in the F L -list. The query contains one or several stoplemmas and some other lemmas that may be frequently used or ordinary. Toevaluate the query using traditional index, we need to read the following numberof postings from the index: n (cid:80) i =1 V /q i . Let, without loss of generality, q and q be stop lemmas and q n is the main lemma of the query. With the use of NSWrecords, we need to read the following number of postings. n (cid:80) i =3 V /q i + ( V /q n ) × ( N SW F actor − q and q . However, for q n we need to read its posting list with N SW records.The size of the posting (

ID, P, N SW ) in bytes is up to

N SW F actor = 4 . ID, P ). The experiments with

M axDistance = 5show this. Therefore, we can calculate the ”planned performance gain” as follows:

P P G ( q ) = (cid:18) n (cid:80) i =1 /q i (cid:19) / (cid:18) n (cid:80) i =3 /q i + (1 /q n ) × ( N SW F actor − (cid:19) . If we have a query set Q , then we can calculate the average planned per-formance gain, AP P G ( Q ) = | Q | (cid:80) q ∈ Q P P G ( q ). Let now use only Q2 queriesfor estimations. If a query q is not a Q2 query, then let P P G ( q ) be 1. Let AP P G ( Q, SW Count ) be the average planned performance gain that is calcu-lated for a speciﬁc value of

SW Count .For the query set that we use in this paper, i.e., 10 665 queries, we have thefollowing:

AP P G ( Q, AP P G ( Q, AP P G ( Q, . However, this model do not take into account ( w, v ) and ( f, s, t ) indexes. Infuture, we plan to develop more precise models. However, from this model, wecan predict that

SW Count should not be increased.The foregoing results need a more detailed examination. Let us consider asearch query. The query consists of some set of lemmas. Let

M in - F L -number bethe minimum

F L -number among all lemmas of the query. A lower

F L -numbercorresponds to a more frequently occurring lemma. If the

M in - F L -number ofa query is a small number, then the query can induce performance problemsbecause the query contains some high-frequency occurring lemma.Then, we divide the entire query set into subsets based on the

M in - F L -number of queries. We select 100 as the division step. In the ﬁrst subset, weinclude all queries with

M in - F L -numbers from 0 to 99; in the second subset,we include all queries with

M in - F L -numbers from 100 to 199; and so on. In thefollowing diagrams, we consider the ﬁrst 21 subsets.In Fig. 3, the average query execution time for

Idx ast K-Word Proximity Search Based on Multi-component Key Indexes 11

Fig. 3.

The average query execution times for

Idx

Min - F L -number with a step of 100. that the ﬁrst subset induces some performance problems. The ﬁrst 8 subsetshave an average query execution time of more than two seconds.

Fig. 4.

The average query execution times for

Idx

Min - F L -number with a step of 100.

In Fig. 4, the average query execution time for

Idx

Fig. 5.

The improvement factor for

Idx

Min - F L -number with a step of 100.

In Fig. 5, we show the improvement factor for

Idx

M in - F L -number of a query crosses the

SW Count of 500.By this diagram, we can propose that the value of

SW Count can be loweredto 100. Consequently, we created another index

Idx /SW SW Count =100 and

F U Count = 1450.In Fig. 6, the average query execution time for

Idx

Idx Fig. 6.

The average query execution times for

Idx /SW

100 (seconds); the query setis divided based on the

Min - F L -number with a step of 100.

However, let us consider Q1 queries when the value of

SW Count is 500, whichare the queries that consist only of lemmas with

F L -number < Idx

Idx /SW Idx

Fig. 7.

The average query execution times for

Idx Idx

5, and

Idx /SW

100 (seconds);the query set Q1 is divided based on the

Min - F L -number with a step of 100.

Let us divide query set Q1 into ﬁve subsets based on the

M in - F L -numbervalue of the concrete query. In Fig. 7, the average query execution time for

Idx Idx

Idx

F L -number < f, s, t ) indexes work better for these queries than do other types ofindexes, as shown by Fig. 7. The search in Idx /SW

100 is signiﬁcantly slowerthan that in

Idx

5. When the search is performed using

Idx

5, the ( f, s, t )indexes are used for every query in T1. When the search is performed using

Idx /SW f, s, t ) indexes are used only if F L ( f ) < F L ( s ) < F L ( t ) < F L -number ≥ ast K-Word Proximity Search Based on Multi-component Key Indexes 13 The ( w, v ) indexes allow one to achieve a better performance improvement,as shown in Fig. 6. For subsets starting from

F L -number = 100, no ( f, s, t )indexes or NSW records are used when we do the search using

Idx /SW Idx /SW

100 is better than

Idx

F L -number < w, v ) indexes and NSW records are required. The ﬁrst bar in Figures3, 4 and 6 supports this, and our previous experiments in [16] conﬁrm it. The original schema can be represented by the following rules:1) ( f, s, t ) indexes, where

F L ( f ) , F L ( s ) , F L ( t ) < SW Count .2) ( w, v ) indexes, where SW Count ≤ F L ( w ) < SW Count + F U Count, and

SW Count ≤ F L ( v ) .

3) Traditional indexes ( x ) with NSW records, SW Count ≤ F L ( x ); the NSWrecords contain information about all lemmas y with the condition F L ( y )

HF Count = 400 – for high-frequency occurring lemmas.

F U Count = 1050 – for frequently used lemmas.We propose using the following indexes.1) ( f, s, t ) indexes that can be used for T1 queries, where

F L ( f ) , F L ( s ) , F L ( t ) < EHF Count + HF Count = 500 ,

2) ( w, v ) indexes that can be used for T2 and T3 queries, where100 =

EHF Count ≤ F L ( w ) < EHF Count + HF Count + F U Count = 1450 ,

100 =

EHF Count ≤ F L ( v ) .

3) Traditional indexes ( x ) with NSW records, 100 = EHF Count ≤ F L ( x );the NSW records contain information about all lemmas y with the condition F L ( y ) < EHF Count = 100 that occur near lemma x in the text. Theseindexes can be used for T3 queries.The concrete values of the parameters are provided only for example and canbe diﬀerent for diﬀerent languages and text collections. In this paper, we investigated how multi-component key indexes help to improvesearch performance. We used well-known GOV2 text collection. We proposeda method of analyzing the search performance by considering diﬀerent types of queries. By following this method, we found that the performance can beimproved further and proposed a new index schema.We analyzed how the value of

M axDistance aﬀects the search performance.With an increase in the value of

M axDistance from 5 to 9, the average searchtime using multi-component key indexes was increased from 1.51 sec. to 1.66sec. Therefore, the value of

M axDistance can be increased even further, andthe main limitations here are the disk space and the time of indexing. Ourmulti-component indexes are relatively large for large values of

M axDistance .However, large hard disk drives are now available. In many cases, it would bepreferable to spend several TB of disk space but to increase the search speed bya factor of 20 times or more.We found that multi-component key indexes work signiﬁcantly better on textcollections with large documents (e.g., documents with sizes of approximatelyseveral hundred KB or more) than on text collection that consists of small doc-uments; thus, an improvement factor of 20 can be considered as a minimumimprovement factor. In the future, it is important to consider how diﬀerent com-pression technologies can reduce the index total size, which will increase thesearch speed even more.The proposed indexes with multi-component keys have one limitation. If wehave a document that contains queried words and the distance between thesewords is greater than

M axDistance , then this document can be absent in thesearch results. This is usually not a problem if the average document size inthe text collection is relatively large, e.g., several hundreds of kilobytes. In thiscase, after the proximity search with multi-component key indexes, we can runa search without distance. When the former requires the word-level index, thelatter needs only the document-level index and works signiﬁcantly faster.Second, in the majority of modern relevance models, it is deﬁned that theweight of the document is inversely proportional to the square of the distancebetween queried words in the document [19]. With a relatively large value of

M axDistance , we can be sure that all relevant documents will occur in the searchresults. When the ﬁrst consideration for GOV2 collection is under question,because the documents are small, the second is still valid.

References

1. Anh, V.N., de Kretser, O., Moﬀat, A.: Vector-space ranking with eﬀective earlytermination. In: Proceedings of the 24th Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval. pp. 35–42. SIGIR’01, ACM, New York, NY, USA (2001). https://doi.org/10.1145/383952.3839572. Borodin, A., Mirvoda, S., Porshnev, S., Ponomareva, O.: Improving generalized in-verted index lock wait times. Journal of Physics: Conference Series (1), 012022(jan 2018). https://doi.org/10.1088/1742-6596/944/1/0120223. Broschart, A., Schenkel, R.: High-performance processing of text queries with tun-able pruned term and term pair indexes. ACM Trans. Inf. Syst. (1) (Mar 2012).https://doi.org/10.1145/2094072.20940774. B¨uttcher, S., Clarke, C., Soboroﬀ, I.: The trec 2006 terabyte track. In: Proceedingsof the Fifteenth Text REtrieval Conference, TREC 2006. pp. 128–141 (2006)ast K-Word Proximity Search Based on Multi-component Key Indexes 155. Crane, M., Culpepper, J.S., Lin, J., Mackenzie, J., Trotman, A.: A comparisonof document-at-a-time and score-at-a-time query evaluation. In: Proceedings ofthe Tenth ACM International Conference on Web Search and Data Mining. pp.201–210. WSDM ’17, Association for Computing Machinery, New York, NY, USA(2017). https://doi.org/10.1145/3018661.30187266. Daoud, C.M., Silva de Moura, E., Carvalho, A., Soares da Silva, A.,Fernandes, D., Rossi, C.: Fast top-k preserving query processing us-ing two-tier indexes. Inf. Process. Manage. (5), 855–872 (Sep 2016).https://doi.org/10.1016/j.ipm.2016.03.0057. Fox, C.: A stop list for general text. SIGIR Forum (1-2), 19–21 (Sep 1989).https://doi.org/10.1145/378881.3788888. Gall, M., Brost, G.: K-word proximity search on encrypted data.In: 30th International Conference on Advanced Information Network-ing and Applications Workshops (WAINA). pp. 365–372. IEEE (2016).https://doi.org/10.1109/WAINA.2016.1049. Imran Raﬁque, M., Hassan, M.: Utilizing distinct terms for proximity and phrasesin the document for better information retrieval. In: 2014 International Conferenceon Emerging Technologies (ICET). pp. 100–105. IEEE (2014). https://doi.org/10.1109/ICET.2014.702102410. Jiang, D., Leung, K.W.T., Yang, L., Ng, W.: Teii: Topic enhanced inverted in-dex for top-k document retrieval. Know.-Based Syst. (C), 346–358 (Nov 2015).https://doi.org/10.1016/j.knosys.2015.07.01411. Lin, J., Trotman, A.: Anytime ranking for impact-ordered indexes. In: Proceedingsof the 2015 International Conference on The Theory of Information Retrieval. pp.301–304. ICTIR ’15, Association for Computing Machinery, New York, NY, USA(2015). https://doi.org/10.1145/2808194.280947712. Lu, X., Moﬀat, A., Culpepper, J.S.: Eﬃcient and eﬀective higher order proximitymodeling. In: Proceedings of the 2016 ACM International Conference on the The-ory of Information Retrieval. pp. 21–30. ICTIR ’16, ACM, New York, NY, USA(2016). https://doi.org/10.1145/2970398.297040413. Miller, R.B.: Response time in man-computer conversational transactions.In: Proceedings of the December 9-11, 1968, Fall Joint Computer Confer-ence, Part I. pp. 267–277. AFIPS ’68 (Fall, part I), ACM, USA (1968).https://doi.org/10.1145/1476589.147662814. Petri, M., Moﬀat, A.: On the cost of phrase-based ranking. In: Proceedings of the38th International ACM SIGIR Conference on Research and Development in In-formation Retrieval. p. 931–934. SIGIR ’15, Association for Computing Machinery,New York, NY, USA (2015). https://doi.org/10.1145/2766462.276776915. Sadakane, K., Imai, H.: Fast algorithms for k-word proximity search. In: IEICETransactions on Fundamentals of Electronics Communications and Computer Sci-ences. vol. 84, pp. 2311–2318 (2001)16. Veretennikov, A.B.: Proximity full-text search by means of additional indexes withmulti-component keys: In pursuit of optimal performance. In: Manolopoulos Y.,Stupnikov S. (eds) Data Analytics and Management in Data Intensive Domains.DAMDID/RCDL 2018. Communications in Computer and Information Science.vol. 1003, pp. 111–130. Springer. https://doi.org/10.1007/978-3-030-23584-0 717. Veretennikov, A.B.: Proximity full-text search with a response time guarantee bymeans of additional indexes. In: Arai K., Kapoor S., Bhatia R. (eds) Intelligent Sys-tems and Applications. IntelliSys 2018. Advances in Intelligent Systems and Com-puting. vol. 868, pp. 936–954. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-01054-6 666 A. B. Veretennikov18. Williams, H.E., Zobel, J., Bahle, D.: Fast phrase querying with com-bined indexes. ACM Trans. Inf. Syst. (4), 573–594 (Oct 2004).https://doi.org/10.1145/1028099.102810219. Yan, H., Shi, S., Zhang, F., Suel, T., Wen, J.R.: Eﬃcient term proximity searchwith term-pair indexes. In: Proceedings of the 19th ACM International Conferenceon Information and Knowledge Management. pp. 1229–1238. CIKM ’10, ACM,New York, NY, USA (2010). https://doi.org/10.1145/1871437.187159320. Zipf, G.K.: Selected Studies of the Principle of Relative Frequency in Language.Harvard University Press (1932)21. Zobel, J., Moﬀat, A.: Inverted ﬁles for text search engines. ACM Comput. Surv.38