Selection of Optimal Parameters in the Fast K-Word Proximity Search Based on Multi-component Key Indexes
SSelection of Optimal Parameters in the FastK-Word Proximity Search Based onMulti-component Key Indexes
Alexander B. Veretennikov − − − Ural Federal University, 620002 Mira street, Yekaterinburg, Russia [email protected]
Abstract.
Proximity full-text search is commonly implemented in con-temporary full-text search systems. Let us assume that the search queryis a list of words. It is natural to consider a document as relevant ifthe queried words are near each other in the document. The proximityfactor is even more significant for the case where the query consists offrequently occurring words. Proximity full-text search requires the stor-age of information for every occurrence in documents of every word thatthe user can search. For every occurrence of every word in a document,we employ additional indexes to store information about nearby words,that is, the words that occur in the document at distances from the givenword of less than or equal to the
MaxDistance parameter. We showedin previous works that these indexes can be used to improve the aver-age query execution time by up to 130 times for queries that consist ofwords occurring with high-frequency. In this paper, we consider how boththe search performance and the search quality depend on the value of
MaxDistance and other parameters. Well-known GOV2 text collectionis used in the experiments for reproducibility of the results. We proposea new index schema after the analysis of the results of the experiments.
Keywords:
Full-text search · Inverted indexes · Proximity search · Termproximity · Query processing · DAAT.
In full-text search, a query is a list of words. The result of the search is a listof documents containing these words. Consider a query that consists of wordsoccurring with high-frequency. In [16], we improved the average query processingtime by a factor of 130 for these queries relative to traditional inverted indexes. a r X i v : . [ c s . I R ] J a n A. B. Veretennikov
A methodology for high-performance proximity full-text searches that coversdifferent types of queries was discussed in [17].The factor of proximity or nearness between the queried words in the indexeddocument plays an important role in modern information retrieval [19, 12, 3].We assume that a document should contain query words near each other to berelevant for the user in the context of the search query. Taking this factor intoaccount is essential if the query consists of frequently occurring words.Some words occur in texts significantly more frequently than others. We canillustrate this [17] by referring to Zipf’s law [20]. An example of a typical wordoccurrence distribution is presented in Fig. 1. The horizontal axis representsdifferent words in decreasing order of their occurrence in texts. On the verticalaxis, we plot the number of occurrences of each word. This peculiarity of languagehas a strong influence on the search performance.The inverted index [21, 2] is a commonly used data structure for full-textsearches. The traditional inverted index contains records (
ID, P ), where ID isthe identifier of the document and P is the position of the word in the document,for example, an ordinal number of the word in the document. This record cor-responds to an occurrence of a word in a document. These ( ID, P ) records arecalled “postings”. The inverted index enables us to obtain a list of postings thatcorresponds to the given word. These lists of postings are used for the search.For proximity full-text searches, we need to store the (
ID, P ) record for everyoccurrence of every word in the indexed document [15, 8, 9]. In other words, forproximity searches, we require a word-level inverted index instead of a document-level index [8]. Consequently, if a word occurs frequently in the texts, then itslist of postings is long [17]. The query search time is proportional to the numberof occurrences of the queried words in the indexed documents. To process asearch query that contains words occurring with high-frequency, a search systemrequires much more time, as shown on the left side of Fig. 1, than a query thatcontains only ordinary words, as shown on the right side of Fig. 1.
Fig. 1.
Example of a word frequency distribution.
According to [13], we can consider a full-text query as a “simple inquiry”[17]. In this case, we may require that the search results be provided within twoseconds, as stated in [13], to prevent the interruption of the thought continuityof the user. To enhance the performance, the following approaches can be used. ast K-Word Proximity Search Based on Multi-component Key Indexes 3
Early-termination approaches can be employed for full-text searches [1, 11].However, these methods are not effective in the case of proximity full-textsearches [16]. Usually, early-termination approaches are applied to document-level indexes. It is difficult to combine the early-termination approach with theincorporation of term-proximity data into relevance models.Additional indexes can improve the search performance. In [14, 18], addi-tional indexes were used to improve phrase searches. However, the approachesreported in [14, 18] cannot be used for proximity full-text searches. Their areaof application is limited by phrase searches. We have overcome this limitation.With our additional indexes, an arbitrary query can be executed very fastand quickly [17].In an example given in [16], we indexed a subset of the Project Gutenbergweb site using Apache Lucene and Apache Tika and performed some searches. Aquery that contained words occurring with high-frequency was evaluated within21 sec. On the other hand, another query that contained ordinary words wasevaluated within 172 milliseconds. This difference is considerable. However, ouralgorithms reported in [16, 17] can help address this issue.The goals and questions of this paper are as follows:1) We need to examine how search performance is improved with considerationof different values of
M axDistance and other parameters.2) We need to investigate search performance on the commonly used text col-lection.3) Does the search performance depend on the document size?4) What are other factors that can affect the performance?5) We evaluate the performance with respect to both short and long queries.Key points of our research are the following. We use the word-level index.We use the DAAT approach [6, 10]. We include in the indexes information aboutall lemmas. Our indexes support incremental updates. We can store one postinglist in several data streams. We read entire posting lists when searching and noearly-termination are used.
For this paper, we use an English dictionary with approximately 92 thousandEnglish lemmas. This dictionary is used by a morphology analyzer. The analyzerproduces a list of numbers of lemmas, that is, basic or canonical forms, for everyword from the dictionary. Usually, a word has one lemma, but some words haveseveral lemmas. For example, the word “mine” has two lemmas, namely, “mine”and “my”.Consider an array of all lemmas. Let us sort all lemmas in decreasing orderof their occurrence frequency in the texts. We call the result of sorting the
F L -list [17]. The number of a lemma w in the F L -list is called its
F L -number [17]and is denoted by
F L ( w ). Let us say that lemma “earth” > “day” because A. B. Veretennikov
F L (earth) = 309,
F L (day) = 199, and 309 > F L -numbers todefine the order of the lemmas in the collection of all lemmas.In our search methodology [16], we defined three types of lemmas, namely,stop lemmas, frequently used lemmas and ordinary lemmas.The first
SW Count most frequently occurring lemmas are stop lemmas. Ex-amples include “earth”, “yes”, “who”, “day”, “war”, “time”, “man” and “be”.The second
F U Count most frequently occurring lemmas are frequently usedlemmas. Examples include “red”, “beautiful”, and “mountain”.All other lemmas are ordinary lemmas, e.g., “fiber” and “undersea”.
SW Count and
F U Count are the parameters. We use
SW Count = 500 and
F U Count = 1050 in the experiments described here. Let
M axDistance be aparameter that can take a value of 5, 7, 9 or even greater.The value of
SW Count is very near 421 from [7]. However, as we includeinformation about all lemmas in the indexes, we can use very different values ofthe parameters. If an ordinary lemma, q , occurs in the text so rarely that F L ( q )is irrelevant, then we can say that F L ( q ) = ∼ . Here, “ ∼ ” denotes a large number. Let us discuss the stop-word approach, in which some words occurring with high-frequency or their lemmas are excluded from the search. We do not agree withthis approach. A word cannot be excluded from the search because even a wordoccurring with high-frequency can have a specific meaning in the context of aspecific query [16, 18]. Therefore, excluding some words from the search can leadto search quality degradation or unpredictable effects [18]. Additionally, stopwords are often employed in higher-order term proximity feature models [5].Consider the query “who are you who” [17]. The Who are an English rockband, and “Who are You” is one of their works. The word “Who” has a specificmeaning in the context of this query. Therefore, in our approach, we includeinformation about all of the words in the indexes. Moreover, we can easily seethat modern search systems, such as Google, do not skip stop words in a search.
The three-component key ( f, s, t ) index [17] is the list of the occurrences of thelemma f for which lemmas s and t both occur in the text at distances thatare less than or equal to the M axDistance from f . Each posting includes thedistance between f and s in the text and the distance between f and t in thetext. An ( f, s, t ) index is created only for the case in which f ≤ s ≤ t . Here, f , s , and t are all stop lemmas.In [16], we considered queries consisting of high-frequency occurring words. Inaddition, we showed that the average query execution time can be improved withthree-component key indexes by up to 15.6 times relative to the time necessaryusing two-component key indexes only. Therefore, ( f, s, t ) indexes are requiredif we need to search queries that contain high-frequently occurring words. ast K-Word Proximity Search Based on Multi-component Key Indexes 5 The two-component key ( w, v ) index [17] is the list of occurrences of thelemma w for which lemma v occurs in the text at a distance that is less than orequal to the M axDistance from w . Each posting includes the distance between w and v in the text. Here, w denotes a frequently used lemma, and v denotes afrequently used or ordinary lemma.Let us consider the traditional index with near stop word (NSW) records. Foreach occurrence of each ordinary or frequently used lemma in each document, weinclude a posting record ( ID, P,
NSW) in the index. ID can be the ordinal num-ber of the specific document, and P is the position of the word in the document,e.g., the ordinal number of the word in the document. The NSW record containsinformation about all high-frequency lemmas, that is, stop lemmas, occurringnear position P in the document (at a distance that is less than or equal to the M axDistance from P ). Examples of the indexes are given in [17].The posting list of a key is stored in several data streams [17]. For the tradi-tional index with NSW records, we can use up to three streams for one key: onestream for ( ID ), one for ( P ) and one for (NSW). On the other hand, for thisindex, we can use two streams: the first is ( ID, P ), and the second is (NSW).The actual choice depends on the length of the posting list. For short lists, weuse two streams; for long lists, we use three streams. This architecture allowsus to skip NSW records when they are not required. For the ( w, v ) and ( f, s, t )indexes, we use one or two data streams for every key.
We create the following indexes:
Idx
0: the traditional inverted index without any enhancements, such as NSWrecords. The total size is 143 GB. This value includes the total size of indexedtexts in compressed form, which is 57.3 GB.
Idx
5: our indexes, including the traditional inverted index with the NSWrecords and the ( w, v ) and ( f, s, t ) indexes, where
M axDistance = 5. The totalsize is 1.29 TB, the total size of the ( w, v ) index is 104 GB, the total size of the( f, s, t ) index is 727 GB, and the total size of the traditional index with NSWrecords is 192 GB.
Idx
7: our indexes, where
M axDistance = 7. The total size is 2.16 TB, thetotal size of the ( w, v ) index is 148 GB, the total size of the ( f, s, t ) index is 1.422TB, and the total size of the traditional index with NSW records is 239 GB.
Idx
9: our indexes, where
M axDistance = 9. The total size is 3.27 TB, thetotal size of the ( w, v ) index is 191 GB, the total size of the ( f, s, t ) index is 2.349TB, and the total size of the traditional index with NSW records is 283 GB.For the experiment, GOV2 [4] text collection and the following queries areused: title queries from TREC Robust Task 2004 (with 250 queries in total),title queries from TREC Terabyte Task from 2004 to 2006 (with 150 queries intotal), title queries from TREC Web Task from 2009 to 2014 (with 300 queriesin total), queries from TREC 2007 Million Query Track (10000 queries in total).
A. B. Veretennikov
The total size of the query set after duplicate removal is 10 665 queries. GOV2text collection contains 25 million documents. The total size of the collection isapproximately 426 GB, and after HTML tag removal, there is approximately167 GB of plain text. The average document text size is approximately 7 KB.We used the following computational resources: CPU: Intel(R) Core(TM) i7CPU 920 @ 2.67 GHz. HDD: 7200 RPM. RAM: 24 GB. OS: Microsoft Windows2008 R2 Enterprise.The query set can be divided into the following subsets depending on lemmasin a concrete query [17]. All queries are evaluated within one program thread.If we have a query with a length greater than
M axDistance , then we shoulddivide it into several parts. For example, when the value of
M axDistance is 5,then the query “to be or not to be that is the question” should be rewrittenas “(to be or not to) (be that is the question)”, and these two parts should beevaluated independently; then, the results should be combined.
Every query in the subset contains only stop lemmas. There are 119 queriesin this subset. Examples include the following: to be or not to be that is thequestion, kids earth day activities. With
F L -numbers, we have the followingqueries: [to: 9] [be: 7] [or: 38] [not: 64] [to: 9] [be: 7] [that: 40] [be: 7] [the: 1][question: 305] and [kid: 447] [earth: 309] [day: 199] [activity: 247].For these queries, the ( f, s, t ) indexes are used [16].Average query processing times:
Idx
0: 51.4 s,
Idx
5: 0.82 s,
Idx
7: 0.86 s,
Idx
9: 1.05 s (see Fig. 2).Average data read sizes per query:
Idx
0: 1.3 GB,
Idx
5: 11.1 MB,
Idx
7: 15.6 MB,
Idx
9: 20.1 MB.Average numbers of postings per query:
Idx
0: 317.8 million,
Idx
5: 0.88 million,
Idx
7: 1.15 million,
Idx
9: 1.5 million.We improved the query time by a factor of 62.7 with
Idx
5, by a factor of59.4 with
Idx
7, and by a factor of 48.7 with
Idx
Idx Fig. 2.
The average query execution times for
Idx Idx Idx
7, and
Idx
Every query in the subset contains one or several stop lemmas. The query alsocontains some other lemmas that may be frequently used or ordinary. There are7 244 queries in the subset. Examples include the following: History of Physiciansin America. With
F L -numbers, we have the following query:[history: 598] [of: 4] [physician: 1760] [in: 14] [America: 1391]For these queries, we need to read one posting list with NSW records fromthe traditional index for some query lemma. This lemma is designated as the“main” lemma of the query. If there is a frequently used lemma in the query,then we can use two-component key indexes, like (physician, history), for otherlemmas. If the query consists of only stop and ordinary lemmas, then we useposting lists from the traditional index for the remaining ordinary lemmas butwithout reading the NSW records. The details are described in [17].Average query times:
Idx
0: 50 s,
Idx
5: 1.9 s,
Idx
7: 2 s,
Idx
9: 2.14 s.Average data read sizes per query:
Idx
0: 1.3 GB,
Idx
5: 64.3 MB,
Idx
7: 68.7 MB,
Idx
9: 76 MB.Average numbers of postings per query:
Idx
0: 324.7 million,
Idx
5: 3.7 million,
Idx
7: 3.4 million,
Idx
9: 3.3 million.The number of postings for
Idx
Idx
Idx Idx
Idx
9. We improved the query processing time by afactor of 25.6 with
Idx
5, by a factor of 24.7 with
Idx
7, and by a factor of 23.3with
Idx
Idx Every query in the subset contains only frequently used lemmas. There are 79queries in this subset. Examples include the following: california mountain pass.With
F L -numbers, we have the following query:[california: 518] [mountain: 704] [pass: 528]Two-component ( w, v ) key indexes are used. For example, we can use the(california, pass) and (*pass, mountain) two-component key indexes.Average query times:
Idx
0: 2.5 s,
Idx
5: 0.32 s,
Idx
7: 0.33 s,
Idx
9: 0.34 s.Average data read sizes per query:
Idx
0: 53.3 MB,
Idx
5: 4.69 MB,
Idx
7: 4.79 MB,
Idx
9: 4.93 MB.Average numbers of postings per query:
Idx
0: 9.4 million,
Idx
5: 0.59 million,
Idx
7: 0.6 million,
Idx
9: 0.61 million.We improved the query time by a factor of 7.83 with
Idx
5, by a factor of7.58 with
Idx
7, and by a factor of 7.44 with
Idx
Idx Every query in the subset contains frequently used lemmas and ordinary lemmas.There are 1 388 queries in this subset.Examples include the following: Scalable Vector Graphics.
A. B. Veretennikov
With
F L -numbers, we have the following query:[scalable: ∼ ] [vector: 2953] [graphics: 921]Each query contains a frequently used lemma; therefore, two-component keyindexes can be used. For example, we can use the (graphics, scalable) and (graph-ics, vector) two-component key indexes.Average query times: Idx
0: 2.2 s,
Idx
5: 0.36 s,
Idx
7: 0.34 s,
Idx
9: 0.32 s.Average data read sizes per query:
Idx
0: 41.7 MB,
Idx
5: 1.5 MB,
Idx
7: 1.6 MB,
Idx
9: 1.8 MB.Average numbers of postings per query:
Idx
0: 8.1 million,
Idx
5: 0.17 million,
Idx
7: 0.17 million,
Idx
9: 0.17 million.We improved the query time by a factor of 6 with
Idx
5, by a factor of 6.4with
Idx
7, and by a factor of 6.8 with
Idx
Idx Every query in the subset contains only ordinary lemmas. There are 1 835 queriesin this subset. Examples include the following: Undersea Fiber Optic Cable.With
F L -numbers, we have the following query:[undersea: 15873] [fiber: 3127] [optic: 2986] [cable: 2771]We use the traditional index only for these queries. We do not need to readthe NSW records because they are stored in separated streams of data.Average query times:
Idx
0: 0.789 s,
Idx
5: 0.819 s,
Idx
7: 0.8 s,
Idx
9: 0.81 s.Average data read sizes per query:
Idx
0: 14.2 MB,
Idx
5: 15.6 MB,
Idx
7: 15.7 MB,
Idx
9: 15.8 MB.Average numbers of postings per query:
Idx
0: 2.89 million,
Idx
5: 2.89 million,
Idx
7: 2.89 million,
Idx
9: 2.89 million.The query processing time for these queries does not need any improvement.
Average query times:
Idx
0: 34.9 s,
Idx
5: 1.51 s,
Idx
7: 1.57 s,
Idx
9: 1.66 s.Average data read sizes per query:
Idx
0: 0.93 GB,
Idx
5: 46.4 MB,
Idx
7: 49.5 MB,
Idx
9: 54.5 MB.Average numbers of postings per query:
Idx
0: 225.7 million,
Idx
5: 3.08 million,
Idx
7: 2.85 million,
Idx
9: 2.76 million.We improved the query time by a factor of 23.1 with
Idx
5, by a factor of22.4 with
Idx
7, and by a factor of 21 with
Idx
Idx The average time that is needed for the search for
Idx Idx
7, and
Idx
9, isapproximately 1-2 sec. for every query type.For
Idx
0, we need 2-2.5 sec. for the search if the query consists only ofordinary or frequently used lemmas. For the queries that contain any stop lemma(Q1, Q2),
Idx ast K-Word Proximity Search Based on Multi-component Key Indexes 9
This means that the search in
Idx Idx
7, and
Idx
Idx
0, the search system has a performanceproblem if the query contains any high-frequency occurring lemma. The differ-ence in performance between
Idx
Idx Idx
Idx D D
1. Let us consider a word w . For w , we have thefollowing posting list in the traditional index: (0 , , (0 , , (0 , , (1 , , (1 , . We use two very common encoding schemes. The idea of the first schemeas follows. We group the records that are related to a specific document. Weconvert the original posting list as follows: (0 , (1 , , , (1 , (2 , . For other kindsof indexes, we have the same.Therefore, we store in the index the ID of the document and then the list ofword’s positions. The second scheme is a delta-encoding scheme. Consequently,we have the following. (0 , (1 , , , (1 , (2 , . For example, consider (1 , , −
1, where 1 is the previous value in the list.The first scheme is much more effective for text collections that consist oflarge documents. This explains the difference in “improvement factor” in theexperiments with two aforementioned collections. However, with our method, wecan use any other encoding scheme and any other inverted index organization.In addition, the “improvement factor” can depend on the structure of thequery set. To analyze this question, we performed additional experiment. Weformed another query set, using the method from [16]. Consequently, we se-lected a document from the TREC GOV2 collection. We used the content ofthe document to produce a set of 3500 queries. We performed our experimentsagain, using this query set. The “improvement factor” was similar to the fore-going results that we have for TREC GOV2.The next question is, how can we select the values of our parameters? Thevalue of
M axDistance affects relevance and should be determined accordingto the selected relevance function [17]. The results of experiments allows us topredict how the change of the
M axDistance affects the search time and indexsize. Let us discuss now, how the values of
SW Count and
F U Count affect thesemetrics.The value of
SW Count is more important than the value of
F U Count , be-cause it affects Q1 and Q2 queries, which are most complex from the performancepoint of view. Let us consider
SW Count now. Let us consider Q2 query subset.
We use Zipf’s law [20] as a representation of our word occurrence distribution.According to this, the second most frequently occurring lemma, occurs half asoften as the first. The third most frequently occurring lemma, occurs 1/3 asoften as the first, and so on. Let V be the number of occurrences of the mostfrequently occurring lemma. Therefore, our lemmas have the following numberof occurrences: V, V / , V / , ... Let us consider a query q = ( q , q , ..., q n ) from Q2. Here, q i is the numberof corresponding lemma in the F L -list. The query contains one or several stoplemmas and some other lemmas that may be frequently used or ordinary. Toevaluate the query using traditional index, we need to read the following numberof postings from the index: n (cid:80) i =1 V /q i . Let, without loss of generality, q and q be stop lemmas and q n is the main lemma of the query. With the use of NSWrecords, we need to read the following number of postings. n (cid:80) i =3 V /q i + ( V /q n ) × ( N SW F actor − q and q . However, for q n we need to read its posting list with N SW records.The size of the posting (
ID, P, N SW ) in bytes is up to
N SW F actor = 4 . ID, P ). The experiments with
M axDistance = 5show this. Therefore, we can calculate the ”planned performance gain” as follows:
P P G ( q ) = (cid:18) n (cid:80) i =1 /q i (cid:19) / (cid:18) n (cid:80) i =3 /q i + (1 /q n ) × ( N SW F actor − (cid:19) . If we have a query set Q , then we can calculate the average planned per-formance gain, AP P G ( Q ) = | Q | (cid:80) q ∈ Q P P G ( q ). Let now use only Q2 queriesfor estimations. If a query q is not a Q2 query, then let P P G ( q ) be 1. Let AP P G ( Q, SW Count ) be the average planned performance gain that is calcu-lated for a specific value of
SW Count .For the query set that we use in this paper, i.e., 10 665 queries, we have thefollowing:
AP P G ( Q, AP P G ( Q, AP P G ( Q, . However, this model do not take into account ( w, v ) and ( f, s, t ) indexes. Infuture, we plan to develop more precise models. However, from this model, wecan predict that
SW Count should not be increased.The foregoing results need a more detailed examination. Let us consider asearch query. The query consists of some set of lemmas. Let
M in - F L -number bethe minimum
F L -number among all lemmas of the query. A lower
F L -numbercorresponds to a more frequently occurring lemma. If the
M in - F L -number ofa query is a small number, then the query can induce performance problemsbecause the query contains some high-frequency occurring lemma.Then, we divide the entire query set into subsets based on the
M in - F L -number of queries. We select 100 as the division step. In the first subset, weinclude all queries with
M in - F L -numbers from 0 to 99; in the second subset,we include all queries with
M in - F L -numbers from 100 to 199; and so on. In thefollowing diagrams, we consider the first 21 subsets.In Fig. 3, the average query execution time for
Idx ast K-Word Proximity Search Based on Multi-component Key Indexes 11
Fig. 3.
The average query execution times for
Idx
Min - F L -number with a step of 100. that the first subset induces some performance problems. The first 8 subsetshave an average query execution time of more than two seconds.
Fig. 4.
The average query execution times for
Idx
Min - F L -number with a step of 100.
In Fig. 4, the average query execution time for
Idx
Fig. 5.
The improvement factor for
Idx
Idx
Min - F L -number with a step of 100.
In Fig. 5, we show the improvement factor for
Idx
Idx
M in - F L -number of a query crosses the
SW Count of 500.By this diagram, we can propose that the value of
SW Count can be loweredto 100. Consequently, we created another index
Idx /SW SW Count =100 and
F U Count = 1450.In Fig. 6, the average query execution time for
Idx
Idx Fig. 6.
The average query execution times for
Idx /SW
100 (seconds); the query setis divided based on the
Min - F L -number with a step of 100.
However, let us consider Q1 queries when the value of
SW Count is 500, whichare the queries that consist only of lemmas with
F L -number < Idx
Idx /SW Idx
Fig. 7.
The average query execution times for
Idx Idx
5, and
Idx /SW
100 (seconds);the query set Q1 is divided based on the
Min - F L -number with a step of 100.
Let us divide query set Q1 into five subsets based on the
M in - F L -numbervalue of the concrete query. In Fig. 7, the average query execution time for
Idx Idx
Idx
Idx
Idx
F L -number < f, s, t ) indexes work better for these queries than do other types ofindexes, as shown by Fig. 7. The search in Idx /SW
100 is significantly slowerthan that in
Idx
5. When the search is performed using
Idx
5, the ( f, s, t )indexes are used for every query in T1. When the search is performed using
Idx /SW f, s, t ) indexes are used only if F L ( f ) < F L ( s ) < F L ( t ) < F L -number ≥ ast K-Word Proximity Search Based on Multi-component Key Indexes 13 The ( w, v ) indexes allow one to achieve a better performance improvement,as shown in Fig. 6. For subsets starting from
F L -number = 100, no ( f, s, t )indexes or NSW records are used when we do the search using
Idx /SW Idx /SW
100 is better than
Idx
F L -number < w, v ) indexes and NSW records are required. The first bar in Figures3, 4 and 6 supports this, and our previous experiments in [16] confirm it. The original schema can be represented by the following rules:1) ( f, s, t ) indexes, where
F L ( f ) , F L ( s ) , F L ( t ) < SW Count .2) ( w, v ) indexes, where SW Count ≤ F L ( w ) < SW Count + F U Count, and
SW Count ≤ F L ( v ) .
3) Traditional indexes ( x ) with NSW records, SW Count ≤ F L ( x ); the NSWrecords contain information about all lemmas y with the condition F L ( y ) HF Count = 400 – for high-frequency occurring lemmas. F U Count = 1050 – for frequently used lemmas.We propose using the following indexes.1) ( f, s, t ) indexes that can be used for T1 queries, where F L ( f ) , F L ( s ) , F L ( t ) < EHF Count + HF Count = 500 , 2) ( w, v ) indexes that can be used for T2 and T3 queries, where100 = EHF Count ≤ F L ( w ) < EHF Count + HF Count + F U Count = 1450 , 100 = EHF Count ≤ F L ( v ) . 3) Traditional indexes ( x ) with NSW records, 100 = EHF Count ≤ F L ( x );the NSW records contain information about all lemmas y with the condition F L ( y ) < EHF Count = 100 that occur near lemma x in the text. Theseindexes can be used for T3 queries.The concrete values of the parameters are provided only for example and canbe different for different languages and text collections. In this paper, we investigated how multi-component key indexes help to improvesearch performance. We used well-known GOV2 text collection. We proposeda method of analyzing the search performance by considering different types of queries. By following this method, we found that the performance can beimproved further and proposed a new index schema.We analyzed how the value of M axDistance affects the search performance.With an increase in the value of M axDistance from 5 to 9, the average searchtime using multi-component key indexes was increased from 1.51 sec. to 1.66sec. Therefore, the value of M axDistance can be increased even further, andthe main limitations here are the disk space and the time of indexing. Ourmulti-component indexes are relatively large for large values of M axDistance .However, large hard disk drives are now available. In many cases, it would bepreferable to spend several TB of disk space but to increase the search speed bya factor of 20 times or more.We found that multi-component key indexes work significantly better on textcollections with large documents (e.g., documents with sizes of approximatelyseveral hundred KB or more) than on text collection that consists of small doc-uments; thus, an improvement factor of 20 can be considered as a minimumimprovement factor. In the future, it is important to consider how different com-pression technologies can reduce the index total size, which will increase thesearch speed even more.The proposed indexes with multi-component keys have one limitation. If wehave a document that contains queried words and the distance between thesewords is greater than M axDistance , then this document can be absent in thesearch results. This is usually not a problem if the average document size inthe text collection is relatively large, e.g., several hundreds of kilobytes. In thiscase, after the proximity search with multi-component key indexes, we can runa search without distance. When the former requires the word-level index, thelatter needs only the document-level index and works significantly faster.Second, in the majority of modern relevance models, it is defined that theweight of the document is inversely proportional to the square of the distancebetween queried words in the document [19]. With a relatively large value of M axDistance , we can be sure that all relevant documents will occur in the searchresults. When the first consideration for GOV2 collection is under question,because the documents are small, the second is still valid. References 1. Anh, V.N., de Kretser, O., Moffat, A.: Vector-space ranking with effective earlytermination. In: Proceedings of the 24th Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval. pp. 35–42. SIGIR’01, ACM, New York, NY, USA (2001). https://doi.org/10.1145/383952.3839572. Borodin, A., Mirvoda, S., Porshnev, S., Ponomareva, O.: Improving generalized in-verted index lock wait times. Journal of Physics: Conference Series (1), 012022(jan 2018). https://doi.org/10.1088/1742-6596/944/1/0120223. Broschart, A., Schenkel, R.: High-performance processing of text queries with tun-able pruned term and term pair indexes. ACM Trans. Inf. Syst. (1) (Mar 2012).https://doi.org/10.1145/2094072.20940774. B¨uttcher, S., Clarke, C., Soboroff, I.: The trec 2006 terabyte track. In: Proceedingsof the Fifteenth Text REtrieval Conference, TREC 2006. pp. 128–141 (2006)ast K-Word Proximity Search Based on Multi-component Key Indexes 155. Crane, M., Culpepper, J.S., Lin, J., Mackenzie, J., Trotman, A.: A comparisonof document-at-a-time and score-at-a-time query evaluation. In: Proceedings ofthe Tenth ACM International Conference on Web Search and Data Mining. pp.201–210. WSDM ’17, Association for Computing Machinery, New York, NY, USA(2017). https://doi.org/10.1145/3018661.30187266. Daoud, C.M., Silva de Moura, E., Carvalho, A., Soares da Silva, A.,Fernandes, D., Rossi, C.: Fast top-k preserving query processing us-ing two-tier indexes. Inf. Process. Manage. (5), 855–872 (Sep 2016).https://doi.org/10.1016/j.ipm.2016.03.0057. Fox, C.: A stop list for general text. SIGIR Forum (1-2), 19–21 (Sep 1989).https://doi.org/10.1145/378881.3788888. Gall, M., Brost, G.: K-word proximity search on encrypted data.In: 30th International Conference on Advanced Information Network-ing and Applications Workshops (WAINA). pp. 365–372. IEEE (2016).https://doi.org/10.1109/WAINA.2016.1049. Imran Rafique, M., Hassan, M.: Utilizing distinct terms for proximity and phrasesin the document for better information retrieval. In: 2014 International Conferenceon Emerging Technologies (ICET). pp. 100–105. IEEE (2014). https://doi.org/10.1109/ICET.2014.702102410. Jiang, D., Leung, K.W.T., Yang, L., Ng, W.: Teii: Topic enhanced inverted in-dex for top-k document retrieval. Know.-Based Syst. (C), 346–358 (Nov 2015).https://doi.org/10.1016/j.knosys.2015.07.01411. Lin, J., Trotman, A.: Anytime ranking for impact-ordered indexes. In: Proceedingsof the 2015 International Conference on The Theory of Information Retrieval. pp.301–304. ICTIR ’15, Association for Computing Machinery, New York, NY, USA(2015). https://doi.org/10.1145/2808194.280947712. Lu, X., Moffat, A., Culpepper, J.S.: Efficient and effective higher order proximitymodeling. In: Proceedings of the 2016 ACM International Conference on the The-ory of Information Retrieval. pp. 21–30. ICTIR ’16, ACM, New York, NY, USA(2016). https://doi.org/10.1145/2970398.297040413. Miller, R.B.: Response time in man-computer conversational transactions.In: Proceedings of the December 9-11, 1968, Fall Joint Computer Confer-ence, Part I. pp. 267–277. AFIPS ’68 (Fall, part I), ACM, USA (1968).https://doi.org/10.1145/1476589.147662814. Petri, M., Moffat, A.: On the cost of phrase-based ranking. In: Proceedings of the38th International ACM SIGIR Conference on Research and Development in In-formation Retrieval. p. 931–934. SIGIR ’15, Association for Computing Machinery,New York, NY, USA (2015). https://doi.org/10.1145/2766462.276776915. Sadakane, K., Imai, H.: Fast algorithms for k-word proximity search. In: IEICETransactions on Fundamentals of Electronics Communications and Computer Sci-ences. vol. 84, pp. 2311–2318 (2001)16. Veretennikov, A.B.: Proximity full-text search by means of additional indexes withmulti-component keys: In pursuit of optimal performance. In: Manolopoulos Y.,Stupnikov S. (eds) Data Analytics and Management in Data Intensive Domains.DAMDID/RCDL 2018. Communications in Computer and Information Science.vol. 1003, pp. 111–130. Springer. https://doi.org/10.1007/978-3-030-23584-0 717. Veretennikov, A.B.: Proximity full-text search with a response time guarantee bymeans of additional indexes. In: Arai K., Kapoor S., Bhatia R. (eds) Intelligent Sys-tems and Applications. IntelliSys 2018. Advances in Intelligent Systems and Com-puting. vol. 868, pp. 936–954. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-01054-6 666 A. B. Veretennikov18. Williams, H.E., Zobel, J., Bahle, D.: Fast phrase querying with com-bined indexes. ACM Trans. Inf. Syst. (4), 573–594 (Oct 2004).https://doi.org/10.1145/1028099.102810219. Yan, H., Shi, S., Zhang, F., Suel, T., Wen, J.R.: Efficient term proximity searchwith term-pair indexes. In: Proceedings of the 19th ACM International Conferenceon Information and Knowledge Management. pp. 1229–1238. CIKM ’10, ACM,New York, NY, USA (2010). https://doi.org/10.1145/1871437.187159320. Zipf, G.K.: Selected Studies of the Principle of Relative Frequency in Language.Harvard University Press (1932)21. Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv.38