Linear-Space Data Structures for Range Mode Query in Arrays
LLinear-Space Data Structures for Range Mode Query in Arrays ∗ S. Durocher † J. Morrison ‡ January 20, 2011
Abstract
A mode of a multiset S is an element a ∈ S of maximum multiplicity; that is, a occurs at leastas frequently as any other element in S . Given a list A [1 : n ] of n items, we consider the problem ofconstructing a data structure that efficiently answers range mode queries on A . Each query consists of aninput pair of indices ( i, j ) for which a mode of A [ i : j ] must be returned. We present an O ( n − (cid:15) )-spacestatic data structure that supports range mode queries in O ( n (cid:15) ) time in the worst case, for any fixed (cid:15) ∈ [0 , / (cid:15) = 1 /
2, this corresponds to the first linear-space data structure to guarantee O ( √ n )query time. We then describe three additional linear-space data structures that provide O ( k ), O ( m ),and O ( | j − i | ) query time, respectively, where k denotes the number of distinct elements in A and m denotes the frequency of the mode of A . Finally, we examine generalizing our data structures to higherdimensions. Mode and Range Queries.
The frequency of an element x in a multiset S , denoted freq S ( x ), is thenumber of occurrences (i.e., the multiplicity) of x in S . A mode of S is an element a ∈ S such that for all x ∈ S , freq S ( x ) ≤ freq S ( a ). A multiset S may have multiple distinct modes; the frequency of the modes of S , denoted by m , is unique.Along with the mean and median of a multiset, the mode is a fundamental statistic of data analysis forwhich efficient computation is necessary. Given a sequence of n elements ordered in a list A , a range queryseeks to compute the corresponding statistic on the multiset determined by a subinterval of the list: A [ i : j ].The objective is to preprocess A to construct a data structure that supports efficient response to one or moresubsequent range queries, where the corresponding input parameters ( i, j ) are provided at query time.We assume the RAM model of computation with word size Θ(log u ), where elements are drawn from auniverse U = { , . . . , u − } . Although the complete set of possible queries can be precomputed and storedusing Θ( n ) space, practical data structures require less storage while still enabling efficient response time.For all i , if i = j , then a range query must report A [ i ]. Consequently, any range query data structure for a listof n items requires Ω( n ) storage space in the worst case [7]. This leads to a natural question: how quickly canan O ( n )-space data structure answer range queries? The problem of constructing efficient data structuresfor range median queries has been analyzed extensively [7, 9, 10, 11, 23, 24, 26, 28, 29, 30, 33, 34]. A rangemean query is equivalent to a normalized range sum query (partial sum query), for which a precomputedprefix-sum array provides a linear-space static data structure with constant query time [30]. As expressedrecently by Brodal et al. regarding the current status of the range mode query problem: “The problem offinding the most frequent element within a given array range is still rather open.” [9, page 2]. See Section 2for an overview of the current state of the range mode query problem. Our Results.
Given an array A [1 : n ] of n items, we present an O ( n − (cid:15) )-space static data structure thatsupports range mode queries in O ( n (cid:15) ) time in the worst case, for any fixed (cid:15) ∈ [0 , / (cid:15) = 1 / ∗ Work supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC). † University of Manitoba, Winnipeg, Canada, [email protected] ‡ University of Manitoba, Winnipeg, Canada, jason [email protected] a r X i v : . [ c s . D S ] J a n his corresponds to the first linear-space data structure to guarantee O ( √ n ) query time. Prior to our work,the previous fastest linear-space data structure by Krizanc et al. [30] supported range mode queries in O ( √ n log log n ) time; our data structure borrows ideas developed by Krizanc et al. and augments their datastructure to eliminate dependence on predecessor queries (see Proposition 4). We describe three additional O ( n )-space data structures that provide O ( k ), O ( m ), and O ( | j − i | ) query time, respectively, where k denotesthe number of distinct elements in A . Finally we discuss generalizations of our data structures to d dimensionsfor any fixed d . To the authors’ knowledge, this is the first examination of multidimensional range modequery. Computing a Mode.
The mode of a multiset S of n items can be found in O ( n log n ) time by sorting S andscanning the sorted list to identify the longest sequence of identical items. Due to the corresponding lowerbound on the worst-case time for solving the element uniqueness problem, finding a mode requires Ω( n log n )time in the worst case; that is, the decision problem of determining whether m > n log n ) timein the worst case [36]. Better bounds on the worst-case time are obtained by parameterizing in terms of m or k . A worst-case time of O ( n log k ) is easily achieved by inserting the n elements into a balanced search treein which each node stores a key and its frequency. Munro and Spira [32] describe an O ( n log( n/m ))-timealgorithm for finding a mode and a corresponding lower bound of Ω( n log( n/m )) on the worst-case time.If distinct elements in S can be mapped efficiently (i.e., in constant time) to distinct integers in the range { , . . . , k (cid:48) } , for some k (cid:48) , then a mode of S can be found in O ( n + k (cid:48) ) time using O ( n + k (cid:48) ) space. This isachieved by identifying a maximum element in a frequency table for S of size k (cid:48) . This method is analogousto counting sort. A similar algorithm for computing a mode can be implemented using hash tables.We include the following lemma to which we refer in Section 3: Lemma 1 (Krizanc et al. [30])
Let A and B be any multisets. If c is a mode of A ∪ B and c (cid:54)∈ A , then c is a mode of B . Range Mode Query.
Naturally, a mode of the query interval A [ i : j ] can be computed directly withoutpreprocessing using any of the methods described in Section 2. Krizanc et al. [30] describe data structures thatprovide constant-time queries using O ( n log log n/ log n ) space and O ( n (cid:15) log n )-time queries using O ( n − (cid:15) )space, for any fixed (cid:15) ∈ (0 , / O ( n log log n/ log n ) space and Petersen [33] improves the second bound to O ( n (cid:15) )-time queries using O ( n − (cid:15) ) space, for any fixed (cid:15) ∈ [0 , / (cid:15) = 1 /
2, the data structure of Krizanc et al. [30] requires onlylinear space and provides O ( √ n log log n ) query time. Although its space requirement is almost linear in n as (cid:15) approaches 1 /
2, the data structure of Petersen [33] requires ω ( n ) space. Furthermore, the constructionbecomes impractical as (cid:15) approaches 1 / ∞ as (cid:15) → /
2) and no obvious modification reduces its space requirement to O ( n ).Greve et al. [25] prove a lower bound of Ω(log n/ log( s · w/n )) query time for any data structure that uses s memory cells of w bits.Bose et al. [7] consider approximate range mode queries, in which the objective is to return an elementwhose frequency is at least α · m . They give a data structure that requires O ( n/ (1 − α )) space and answersapproximate range mode queries in O (log log /α n ) time for any fixed α ∈ (0 , α ∈ { / , / , / } , using space O ( n log n ), O ( n log log n ), and O ( n ),respectively. Greve et al. [25] give a linear-space data structure that supports approximate range mode queriesin constant time for α = 1 /
3, and an O ( n · α/ (1 − α ))-space data structure that supports approximate rangemode queries in O (log( α/ (1 − α ))) time for any fixed α ∈ [1 / , Continuous Space versus Array Input.
A vast literature studies the problems of geometric rangesearching in continuous Euclidean space; that is, data points are positioned arbitrarily in R d . See the surveyby Agarwal [1] for an overview of results. The range query problems considered in this paper, however,restrict attention to array input. Although a range query on an array can be viewed as a restricted case of2 more general range searching problem (e.g., a point set with regular spacing), the algorithmic techniquesdiffer greatly between the two settings when d ≥
2. When d = 1, however, a geometric range mode queryproblem reduces to array range mode query. In particular, the rank of each data point in Euclidean spacecorresponds to its array index. It suffices to compute the ranks of the respective successor and predecessorof the endpoints of the query interval to identify the indices i and j , and to return the corresponding arrayrange mode query on A [ i : j ].In addition to results on the median, mode, and sum range query problems discussed in Sections 1and 2, other range query problems examined on arrays include semigroups [2, 38, 39], extrema (e.g., rangeminimum or maximum) [4, 6, 13, 19, 20, 18, 21, 22], selection or quantiles (for which the median is aspecial case) [23, 24, 28, 29], dominance or rank (counting the number of elements in the query range thatexceed a given input threshold) [27, 28], coloured range (counting/enumerating the distinct elements in thequery range) [23], and k -frequency (determining whether any element has frequency k ) [25]. Recently, rangequery problems have been examined on multidimensional arrays, including partial sums [12], range minimum[3, 8, 13, 35, 40], median [24], and selection [23]. O ( n (cid:15) ) Query Time and O ( n − (cid:15) ) Space
In the worst case, for every range mode query processed, the data structure of Krizanc et al. [30] makes a se-quence of Θ( n (cid:15) ) predecessor queries, each requiring Θ(log log n ) time, for a total query time of Θ( n (cid:15) log log n ).We build on the data structure of Krizanc et al. and introduce a different technique that avoids predecessorsearch entirely. Section 3 establishes the following theorem and the corresponding corollary that followswhen (cid:15) = 1 / Theorem 2
Given an array A [1 : n ] of n items, for any (cid:15) ∈ [0 , / there exists a data structure requiring O ( n − (cid:15) ) storage space that supports range mode queries on A in O ( n (cid:15) ) time in the worst case. Corollary 3
Given an array A [1 : n ] of n items, there exists a data structure requiring O ( n ) storage spacethat supports range mode queries on A in O ( √ n ) time in the worst case. Data Structure Precomputation.
Suppose the elements of A [1 : n ] are drawn from an ordered boundeduniverse U . Let D = { a , . . . , a k } ⊆ U denote the set of distinct elements stored in A . Construct an array B [1 : n ] such that for each i , B [ i ] stores the rank of A [ i ] in D . Therefore, B [ i ] ∈ { , . . . , k } . For any a , i ,and j , B [ a ] is a mode of B [ i : j ] if and only if A [ a ] is a mode of A [ i : j ]. Performing computation on array B instead of array A allows direct array referencing using the values stored in B as indices. For simplicity, wedescribe our data structures in terms of array B ; a table look-up provides a direct bijective mapping from { , . . . , k } to D . Set D , array B , and the value k are independent of any query range and can be computedin O ( n log k ) time during preprocessing.Given fixed a and b , array C [1 : k ] is a frequency table for B [ a : b ] if, for each i , C [ i ] stores the number ofoccurrences of element i in B [ a : b ]. For any j > i , if C i [1 : k ] is a frequency table for B [1 : i ] and C j [1 : k ]is a frequency table for B [1 : j ], then for each x , C j [ x ] − C i [ x ] is the frequency of B [ x ] in B [ i + 1 : j ].For each a ∈ { , . . . , k } , let Q a = { b | B [ b ] = a } . That is, Q a is the set of indices b such that B [ b ] = a .For any a , a range counting query for element a in B [ i : j ] can be answered by searching for the predecessorsof i and j , respectively, in the set Q a ; the difference of the indices of the two predecessors is the frequency of a in B [ i : j ] [30]. As noted above, implementing such a range counting query using an efficient predecessordata structure requires Θ(log log n ) time in the worst case.The following related decision problem, however, can be answered in constant time by a linear-space datastructure: does B [ i : j ] contain at least q instances of element B [ i ]? This question can be answered by aselect query that returns the index of the q th instance of B [ i ] in B [ i : n ]. For each a ∈ { , . . . , k } , store theset Q a as an ordered array (also denoted Q a for simplicity). Define a rank array B (cid:48) [1 : n ] such that for all b , B (cid:48) [ b ] denotes the rank of B [ b ] in B [1 : n ] (i.e., the index of b in Q B [ b ] ). Given any q , i , and j , to determine3 j b j b i bBA 30 2030 2040 40 10 201040 3 23 24 4 1 214 202 101 330 550 1 440 510 10 50B’ 1220 440 4405504402200 0 0 0 01 1 1 1 12 2 2 223 3 34 4 45 5ji0 1 2 3 4 5query range A[7:19]243 6 171 15 21 221497 2313128518 19 24 20 0 01 12 23 34 45 5S 441 2 2 4 0 01 12 23 34 45 5S’ 1 43 6 6 72 244 42 2 2 222 45 51 2 3 5 5 62 4 4 42 2 32 320 1 2 3 4 5 6 suffix16 prefix span2 31 4 8 10 11 13 1495 12 16156 191817 20 21 23227 24611Q 10 Q Q Q Q Figure 1:
Example of the sparse mode table method data structure.
The number of list items is n = 24, of which k = 5 are distinct. If (cid:15) = 3 /
8, the array is partitioned into t = (cid:100) n/s (cid:101) = 6 blocks of size s = (cid:100) n (cid:15) (cid:101) = 4. The query range is A [ i : j ] = A [7 : 19], for which the unique mode is 20, occurring withfrequency 5. The corresponding mode of B [ i : j ] is 2. The query range B [7 : 19] is partitioned into the prefix B [7 : 8], the span B [9 : 16], and the suffix B [17 : 19]. The span covers blocks b i = 2 to b j = 3, for which thecorresponding mode is S [2 ,
3] = 2, occurring with frequency S (cid:48) [2 ,
3] = 4.whether B [ i : j ] contains at least q instances of B [ i ] it suffices to check whether Q B [ i ] [ B (cid:48) [ i ] + q − ≤ j . Sincearray Q B [ i ] stores the sequence of indices of instances of element B [ i ] in B , looking ahead q − Q B [ i ] returns the index of the q th occurrence of element B [ i ] in B [ i : n ]; if this index is at most j , then thefrequency of B [ i ] in B [ i : j ] is at least q . If the index B (cid:48) [ i ] + q − Q B [ i ] , thenthe query returns a negative answer. This gives the following lemma: Lemma 4
Given an array A [1 : n ] of n items, there exists a data structure requiring O ( n ) storage spacethat can determine in constant time for any { i, j } ⊆ { , . . . , n } and any q whether A [ i : j ] contains at least q instances of element A [ i ] . Following Krizanc et al. [30], given any (cid:15) ∈ [0 , /
2] we partition array B into t blocks of size s = (cid:100) n (cid:15) (cid:101) ,where t = (cid:100) n/s (cid:101) ≤ (cid:100) n − (cid:15) (cid:101) . That is, for each i ∈ { , . . . , t − } , the i th block spans B [ i · s + 1 : ( i + 1) s ] andthe last block spans B [( t − · s + 1 : n ]. We precompute tables S [0 : t − , t −
1] and S (cid:48) [0 : t − , t − t ), such that for any { b i , b j } ⊆ { , . . . , t − } , S [ b i , b j ] stores a mode of B [ b i · s + 1 : ( b j + 1) s ]and S (cid:48) [ b i , b j ] stores the corresponding frequency.Finally, we need a frequency table C [1 : k ] of size k , initialized to zero. The arrays Q , . . . , Q k can beconstructed in O ( n ) total time in a single scan of array B . The arrays S and S (cid:48) can be constructed in O ( n − (cid:15) ) time by scanning array B t times, computing one row of each array S and S (cid:48) per scan. Thus, thetotal precomputation time required to initialize the data structure is O ( n − (cid:15) ). Range Mode Query Algorithm.
Given a query range B [ i : j ], let b i = (cid:100) ( i − /s (cid:101) and b j = (cid:98) j/s (cid:99) − B [ i : j ]. We referto B [ b i · s + 1 : ( b j + 1) s ] as the span of the query range, to B [ i : min { b i · s, j } ] as its prefix , and to B [max { ( b j + 1) s + 1 , i } : j ] as its suffix . One or more of the prefix, span, and suffix may be empty; inparticular, if b i > b j , then the span is empty. See the example in Figure 1.The value c = S [ b i , b j ] is a mode of the span with corresponding frequency f c = S (cid:48) [ b i , b j ]. If the span isempty, then let f c = 0. By Lemma 1, either c is a mode of B [ i : j ] or some element of the prefix or suffix4s a mode of B [ i : j ]. Thus, to identify a mode of B [ i : j ], we verify for every element in the prefix andsuffix whether its frequency in B [ i : j ] exceeds f c and, if so, we identify this element as a candidate modeand count its additional occurrences in B [ i : j ]. We present the details of this procedure for the prefix; ananalogous procedure is applied to the suffix.We now describe how to compute the frequency of all candidate elements in the prefix over the range B [ i : j ], storing these values in the frequency table C . Sequentially scan the items in the prefix startingat the leftmost index, i , and let x denote the index of current item. If C [ B [ x ]] >
0, then an instance ofelement B [ x ] appears in B [ i : x − B [ x ] and increment x . If C [ B [ x ]] = 0, check whether Q B [ x ] [ B (cid:48) [ x ] + f c − ≤ j (i.e., verify whether B [ x ] is acandidate). If so, then the frequency of B [ x ] in B [ i : j ] is at least f c . The exact frequency of B [ x ] in B [ i : j ]can be counted by a linear scan of Q B [ x ] , starting at index B (cid:48) [ x ] + f c − y such that Q B [ x ] [ y ] > j or the end of array Q B [ x ] (i.e., y = | Q B [ x ] | + 1). That is, Q B [ x ] [ y ]denotes the index of the first instance of element B [ x ] that lies beyond the query range B [ i : j ] (or no suchelement exists). Consequently, the frequency of B [ x ] in B [ i : j ] is y − B (cid:48) [ x ]. Store this value in C [ B [ x ]].An analogous procedure is repeated for the suffix. Upon completing the scans of the prefix and suffix,we identify a maximum value in array C ; its index corresponds to a mode of B [ i : j ]. Only non-zero entriesin C need be examined (and subsequently reset to zero); this is achieved by making a second scan of theprefix and suffix and examining the corresponding elements in array C . Storage Space and Query Time.
If the prefix and suffix are empty, then S [ b i , b j ] is a mode of B [ i : j ],and this value is returned in constant time. Without loss of generality, suppose the prefix contains at leastone item. Consider an arbitrary index x ∈ { i, . . . , b i · s − } during the scan of the prefix. If C [ B [ x ]] > B [ x ] is processed in constant time. Therefore, suppose C [ B [ x ]] = 0. That is, x corresponds to theindex of the first instance of B [ x ] in the prefix. Consequently, the frequency of B [ x ] in B [ i : j ], denoted f x ,is equal to its frequency in B [ x : j ]. By Lemma 4, determining whether f x ≥ f c requires only constant time.Any item B [ x ] that is not a candidate is processed in constant time. Therefore, suppose B [ x ] is a candidate.Since the prefix and suffix each have size at most s − f c ≤ f x ≤ s − B [ x ] incurs a cost of O ( f x − f c ) time for its first occurrence, and O (1) time for subsequent occurrences.Since f c is the frequency of the mode of the span, at least f x − f c instances of B [ x ] must occur in the prefixor suffix. In other words, instances of element B [ x ] incur a total cost of O ( c x ) time, where c x denotes thefrequency of B [ x ] in the prefix and suffix. Since the number of items in the prefix and suffix is at most2( s − O ( s ). By an analogous argument, the total cost forprocessing the suffix is also O ( s ). Identifying the maximum element in array C and re-initializing C to zerorequires O ( s ) time. Therefore, a range mode query requires O ( s ) = O ( n (cid:15) ) time in the worst case. Thedata structure requires O ( n ) space to store the arrays A , B , and B (cid:48) , O ( n ) total space to store the arrays Q , . . . , Q k , and O ( t ) = O ( n − (cid:15) ) space to store the tables S and S (cid:48) . This gives O ( n − (cid:15) ) total space for O ( n (cid:15) ) worst-case query time for any (cid:15) ∈ [0 , / n ) space isrequired. Therefore, increasing (cid:15) beyond 1 / We apply results from Section 3 to obtain three additional O ( n )-space data structures, giving the followingtheorem: Theorem 5
Given an array A [1 : n ] of n items, there exists a data structure requiring O ( n ) storage spacethat supports range mode queries on any A [ i : j ] in O (min {√ n, k, | j − i | , m + log log n } ) time in the worstcase, where k denotes the number of distinct elements in A and m denotes the frequency of the mode of A . O ( k ) Query Time and O ( n ) Space
We now describe an O ( k + s ) query time and O ( n + n · k/s )-space data structure for any fixed s ∈ [1 , n ].When s ∈ Θ( k ), our data structure requires O ( n ) space and supports range mode queries in O ( k ) time. A5 ’p i−1C C C C C Figure 2:
Example of the sparse frequency table method data structure.
The number of list itemsis n = 16, of which k = 5 are distinct. The array is partitioned into four blocks of size s = 4. The queryrange is A [ i : j ] = A [6 : 15], for which elements 10 and 20 are modes, each occurring with frequency 3. Thecorresponding modes of B [ i : j ] are 1 and 2. Thus, C [1] = C [2] = 3 is the maximum value in the frequencyarray C .value of s ∈ o ( k ) (respectively, s ∈ ω ( k )) results in ω ( n ) space ( ω ( k ) time) without any reduction in querytime (space). Data Structure Precomputation.
For each p ∈ { , . . . , n } such that p mod s = 0, construct a frequencytable C p [1 : k ] for the range B [1 : p ]. Create one additional array C [1 : k ], initialized to zero. There are (cid:100) n/s (cid:101) +1 such arrays C i . See Figure 2. The preprocessing time required is O ( n + n · k/s ) (or O ( n log k + n · k/s )time if k or B must be computed). Range Mode Query Algorithm.
Array B is partitioned into blocks of size s as in Section 3. Given aquery range B [ i : j ], we refer to the sequence of blocks completely covered by B [ i : j ] as the span, and tothe remaining subarrays as the prefix and suffix, respectively. A query on B [ i : j ] is performed as follows:1. Let p = s (cid:98) ( i − /s (cid:99) and let p (cid:48) = s (cid:98) j/s (cid:99) . That is, p is the largest p ≤ i − C p isdefined. Similarly, p (cid:48) is the largest p (cid:48) ≤ j such that array C p (cid:48) is defined.2. Create an array C [1 : k ] such that for each x , C [ x ] ← C p (cid:48) [ x ] − C p [ x ]. Upon completing this step, C isa frequency table for the span B [ p + 1 : p (cid:48) ].3. For each x ∈ { p + 1 , . . . , i − } , set C [ B [ x ]] ← C [ B [ x ]] −
1. For each x ∈ { p (cid:48) + 1 , . . . , j } , set C [ B [ x ]] ← C [ B [ x ]] + 1. Upon completing this step C is a frequency table for the entire query range B [ i : j ].4. Find a maximum value in C . If x (cid:48) is an index that maximizes C [ x (cid:48) ], then B [ x (cid:48) ] is a mode of B [ i : j ]. Storage Space and Query Time.
The data structure consists of arrays A and B , requiring O ( n ) space, and O (cid:100) n/s (cid:101) + 1 frequency tables of size k . Thus, the total space required by the data structure is O ( n + n · k/s ).Steps 1 through 4 of the algorithm require O (1), O ( k ), O ( s ), and O ( k ) time, respectively. This gives O ( n + n · k/s ) total space for O ( k + s ) query time. O ( m + log log n ) Query Time and O ( n ) Space
Using a combination of ideas from Section 3 and from an approximate range mode query data structure ofGreve et al. [25], we briefly describe a range mode data structure parameterized in terms of the frequency6f the mode, m , with good bounds on space and query time when m is small (e.g., m ∈ O ( √ n )).As in Section 3, the rank array B (cid:48) and the arrays Q , . . . , Q k are constructed, and array B is partitionedinto blocks of size s . For each i ∈ { , . . . , n } such that i mod s = 0, construct an array F i [1 : m ] such that foreach x , F i [ x ] stores the largest j ≤ n such that the mode of B [ i : j ] has frequency at most x ; a correspondingmode is also stored. A query range B [ i : j ] is divided into prefix, span, and suffix subarrays as before. Let p = s (cid:100) i/s (cid:101) denote the index of the first element of the span. Using the technique of Greve et al. [25], amode of the span and its frequency are computed by finding the successor of j in F i ; this can be achievedin O (log log n ) time by an O ( n )-space data structure (e.g., a van Emde Boas tree [15, 17, 16] or a y-fasttrie [37]). By Lemma 4, determining whether the frequency of an element in the prefix or suffix exceedsthat of the mode of the span requires only constant time per element, or O ( s ) total time. The resultingworst-case query time is O ( s + log log n ) using O ( n + n · m/s ) space. Choosing s ∈ Θ( m ) gives O ( n ) spaceand O ( m + log log n ) query time. O ( | j − i | ) Query Time and O ( n ) Space
We briefly describe an O ( | j − i | )-time and O ( n )-space data structure. No actual precomputation is necessaryother than constructing the array B , finding k , and initializing a frequency table C [1 : k ] to zero, all of whichcan be achieved in O ( n log k ) precomputation time. This algorithm is similar to counting sort: compute afrequency table for B [ i : j ] stored in C [1 : k ], then identify a maximum element in C [1 : k ]. When computingthe maximum, the running time is bounded to O ( | j − i | ) by only examining indices in C that correspondto elements in B [ i : j ] (these are exactly the elements of C that have non-zero values). This procedure isrepeated after identifying the maximum to reset C [1 : k ] to zero. Each step requires Θ( | j − i | ) time and thetotal space required by the data structure is O ( n ). A natural question is whether our results for one-dimensional range mode query extend to arbitrary dimen-sions. The array B [1 : n ] is replaced by a d -dimensional array B [1 : n , . . . , n d ], containing n elements intotal with dimensionality n , . . . , n d , where n = n × · · · × n d . Within Section 5 we refer to a d -dimensionaltuple (e.g., (cid:126)i = [ i , . . . , i d ]) as an array index (e.g., B [ (cid:126)i ]). We say a tuple (cid:126)i dominates another tuple (cid:126)j if andonly if i t ≤ j t for all t ∈ { , . . . , d } . We denote the input array as B [ (cid:126) (cid:126)n ], where (cid:126)n = [ n , . . . , n d ]. A rangeis defined over a d -dimensional rectangle of indices, uniquely determined by two indices, [ (cid:126)i : (cid:126)j ], where (cid:126)i ≤ (cid:126)j .A key element of our one-dimensional data structures is the use of frequency tables. In d dimensions,array C [1 : k ] is a frequency table for B [ (cid:126)a : (cid:126)b ] if, for each i ∈ { , . . . , k } , C [ i ] stores the number of occurrencesof element B [ (cid:126)x ] = i in B [ (cid:126)a : (cid:126)b ]. Unlike the one-dimensional case, if C (cid:126)i [1 : k ] is a frequency table for B [ (cid:126) (cid:126)i ]and C (cid:126)j [1 : k ] is a frequency table for B [ (cid:126) (cid:126)j ], then C (cid:126)j [ B [ (cid:126)x ]] − C (cid:126)i [ B [ (cid:126)x ]] is not the frequency of B [ (cid:126)x ] in B [ (cid:126)i : (cid:126)j ]in general. In one dimension, (cid:126)i dominates all indices that are to be excluded from the count, whereas this isnot the case in higher dimensions. Instead, the 2 d corners of the d -rectangle [ (cid:126)i : (cid:126)j ] can be used to computethe frequency table with typical inclusion-exclusion rules [14]. The result is computed using 2 d d -directionalrange counting queries to determine the frequency of B [ (cid:126)x ] in B [ (cid:126)a : (cid:126)b ]. In the range searching literature it istypical to assume d to be a small known constant and for the corresponding factors of d to be omitted fromthe evaluation of space and time requirements. Counting Method.
The counting method described in Section 4.3 does not depend on any properties ofone-dimensional data and extends to d -dimensional data and queries. The query time is directly proportionalto the cardinality of the query range [ (cid:126)i : (cid:126)j ]: O ( (cid:81) dl =1 ( j l − i l + 1)). Precomputation time, query time, andspace requirements are analogous to those of the one-dimensional data structure. Sparse Frequency Table Method.
We now consider a generalization to d dimensions of the sparsefrequency table method described in Section 4.1. As in the one-dimensional data structure, for every (cid:126)t ∈ T we precompute a frequency table C (cid:126)t [1 : k ] for the range B [ (cid:126) (cid:126)t ], where T ⊆ [ (cid:126) (cid:126)n ] is a fixed subset of indices.7f T is a sparse set whose elements are distributed regularly across [ (cid:126) , (cid:126)n ], then a frequency table for the spancan be computed in O (2 d k ) time and O ( n ) space using the inclusion-exclusion principle. The remainder ofthe query algorithm consists of examining each index (cid:126)w in the enclosing set W = [ (cid:126)i : (cid:126)j ] \ [ (cid:126)b i : (cid:126)b j ] (knownas the suffix and prefix in the one-dimensional case) and incrementing the corresponding frequency count C [ B [ (cid:126)w ]]. Finally, the maximum value of the frequency table C determines the frequency of the mode; thismaximum is identified in O ( k ) time. Therefore, the total query time is O (2 d k + | W | ).The regular positioning of the indices in T forms a d -dimensional grid that divides B [ (cid:126) (cid:126)n ] evenly into | T | cells, each of which is a d -rectangle of cardinality s = n/ | T | . Each frequency table has size k . In order forthe space occupied by the frequency tables to remain linear there can be at most O ( n/k ) such tables (e.g.,let | T | = (cid:100) n/k (cid:101) and s = k ). We set the width of each cell in the l dimension to be O ( n l ( k/n ) d ). Observethat (cid:81) dl =1 n l ( k/n ) d = k . Since there are s = k items in a cell, the number of items on the cell’s surfaceperpendicular to dimension l is O (cid:18) kn l (cid:16) nk (cid:17) d (cid:19) = O (cid:32) k d − d n d n l (cid:33) . Observe that | W | is at most s times the number of cells on the external surfaces of the d -rectanglespecified by the query range [ (cid:126)i,(cid:126)j ]. The total number of items on the external surface perpendicular to somedimension l ∈ { , . . . , d } is O ( n/n l ). Thus the number of cells on that external surface is O (cid:18) nn l k d − d n l n d (cid:19) = O (cid:18)(cid:16) nk (cid:17) d − d (cid:19) . Therefore, | W | ∈ O ( d · k ( n/k ) d − d ) = O ( d · n d − d k d ), resulting in a total query time of O (2 d k + d · n d − d k d ). If k is constant, then the query time can be improved to O (2 d k ) using O ( n · k ) space by including a frequencytable for every item in B . Sparse Mode Table Method.
The sparse mode table method described in Section 3 and the sparsefrequency table method both specify a subset T of indices positioned at regular intervals for which any pairdetermines a span within the array B . Instead of storing frequencies for all elements in D , however, thesparse mode table method stores a precomputed mode of the span between any two indices in T . The modeof the query range is then found by searching for elements in the prefix and suffix whose frequency exceedsthat of the mode of the span.This data structure exemplifies the space-time trade-off. The O ( √ n ) query time and O ( n ) space boundsof the one-dimensional data structure are possible because the cardinality of the prefix and suffix can bekept small while minimizing the time required to measure the frequency of elements in the prefix and suffix.In particular, the one-dimensional data structure supports a constant-time query to determine whether thefrequency of a given element exceeds that of the mode of the span. This is achieved by referring to the arrays Q , . . . , Q k . These arrays, however, do not generalize easily to higher dimensions. A corresponding decisionquery would be: “Does element B [ (cid:126)x ] occur at least m times in the block B [ (cid:126)i : (cid:126)j ]?” Replacing the arrays Q , . . . , Q k with orthogonal range counting data structures answers the query: “How frequently does element B [ (cid:126)x ] occur in the block B [ (cid:126)i : (cid:126)j ]?” A range counting query computed using kd -trees gives a linear-space datastructure with O ( | Q [ B [ (cid:126)x ]] | − d ) query time [31]. Bentley and Mauer [5] describe a linear-space data structurewith a faster query time of O ( | Q [ B [ (cid:126)x ]] | (cid:15) ) for any fixed (cid:15) <
1, where the time and space bounds omit constantfactors of (cid:15) .As in Section 5, let W denote the enclosing set of indices, (i.e., the indices of the query range not containedin the span). Let D W denote the set of distinct elements contained in W . Thus the range mode query timeis O (cid:32) max (cid:40) (cid:88) u ∈ D W | Q [ u ] | d − d , | W | (cid:41)(cid:33) ⊆ O (cid:16) max (cid:110) n d − d , | W | (cid:111)(cid:17) . (1) Our data structure includes kd -trees. In the corresponding analysis of Lee and Wong [31], d is assumed to be constant;consequently, constants dependent upon d do not appear in (1). S and S (cid:48) respectively store a mode and frequency of the span B [ (cid:126)b i : (cid:126)b j ] for all { (cid:126)b i , (cid:126)b j } ⊆ T .Maintaining linear space requires that Θ( | T | ) = Θ( s ) = Θ( √ n ). We set the number of elements per cellin the l dimension to be O ( √ n l ). Thus the number of elements on the surface of the cell perpendicular tothe l dimension is O ( (cid:112) n/n l ). The total number of elements on the external surface perpendicular to somedimension l ∈ { , . . . , d } is O ( n/n l ). Thus the number of cells on the external surface is O (( n/n l ) (cid:112) n l /n ) = O ( (cid:112) n/n l ). Therefore, | W | ∈ O (cid:32) n d (cid:88) l =1 √ n l (cid:33) . (2)If all values n l are equal, then (2) simplifies to O ( d · n − d ). Generalizing Mode.
The sparse frequency table and counting methods described in Sections 4.1 and 4.3,respectively, can be generalized to return the x th most frequently occurring element in the query range A [ i : j ] for any x ∈ { , . . . , k } by employing a linear-time ( O (min { k, | j − i |} ) time) selection algorithm to findthe x th largest element in the frequency table for A [ i : j ]. Due to its dependence on precomputed modesstored in array S , an analogous generalization seems unlikely without a significant increase in space for thesparse mode table method described in Section 3. Open Problem 1
Given a list of A [1 : n ] of n items, construct an O ( n ) -space data structure for identifyingthe x th most frequently occurring element in the range A [ i : j ] with O ( √ n ) query time, where i , j , and x areprovided at query time. Dynamic Range Mode Query.
Prior discussion has been restricted to static data structures for rangemode query. Dynamically updating the list of items is a natural operation: A [ i ] ← x . Unlike the rangemedian query problem for which dynamic data structures exist [10, 9, 24, 28], none of the previous datastructures for range mode query [7, 30, 25, 33, 34] support efficient updates. We briefly discuss some of thechallenges of making our data structures dynamic.Both the sparse frequency table and counting methods described in Sections 4.1 and 4.3, respectively,permit straightforward constant-time updates when the set of distinct elements, D , remains unchanged.Updates that modify D , however, require careful consideration. A key issue in defining dynamic datastructures analogous to the static data structures described in this paper is to generalize the mappingdefined by array B (see Section 3) to support efficient updates. We have preliminary results demonstratingthat such updates are possible for implementing a dynamic version of the counting method. As the datastructure for the sparse frequency method is currently specified, however, updates that modify D requireΘ( n ) time in the worst case. The sparse mode table method described in Section 3 does not suggest itself asa good candidate for efficient updates. In particular, the table S requires Θ( n ) updates in the worst case,even if D remains unchanged. Also challenging is the problem of updating the arrays Q , . . . , Q k . Each set Q x is stored as a sorted array to enable direct indexing, resulting in Θ( n ) update time in the worst case.Thus, the problem of defining an efficient dynamic range mode query data structure remains open. Open Problem 2
Given an array A [1 : n ] of n items, construct a dynamic data structure that supportsefficient range mode queries and updates. Geometric Range Mode Query.
The range mode problem has a natural definition in Euclidean space:
Open Problem 3
Given a multiset P of n points in R d , construct a data structure to support queries thatreturn a mode of P ∩ R for an arbitrary (orthogonal) query range R ⊆ R d . What is the time complexity ofsuch a range query for a given space bound? P (cid:48) ⊆ R d , such that each point p ∈ P (cid:48) is assigned a colour. In this case, the mode of R ∩ P (cid:48) is the most frequently occuring colour in the queryregion. As discussed in Section 2, when d = 1, this problem reduces to range mode query on an array. When d ≥
2, however, solution techniques tend to differ extensively for range searching problems set in continuousEuclidean space versus those restricted to array input.A range reporting query can be combined with a mode-finding algorithm (e.g., the counting methoddescribed in Section 4.3) to identify the multiset of points within the query range and then compute itsmode. Such a solution requires enumerating all elements in the query range, possibly resulting in poor querytime (e.g., when | R ∩ P | ∈ Θ( | P | )). A more ingenious solution might reduce query time by avoiding the useof a range report query. Other than a basic combination approach such as that described above, the rangemode query problem in the continuous setting remains open. Lower Bounds.
Recently, Greve et al. [25] showed that any data structure that uses s memory cells of w bits requires Ω(log n/ log( s · w/n )) time to answer a range mode query. For linear-space data structures inthe RAM model, s · w ∈ Θ( n log n ), corresponding to a lower bound of Ω(log n/ log log n ) query time. Otherthan the bound of Greve et al. and the lower bounds on the problem of computing a mode of a multiset(see Section 2), little is known regarding non-trivial lower bounds for the time complexity of the range modequery problem. In particular, it is unknown whether there exists a linear-space data structure that supports o ( √ n ) query time. Open Problem 4
Identify a function f ( n ) such that any O ( n ) -space data structure that supports rangemode query on an array of n items requires Ω( f ( n )) query time in the worst case, where f ( n ) ∈ ω (log n/ log log n ) ,or provide an O ( n ) -space data structure that supports O (log n/ log log n ) -time queries. The corresponding question for range selection query was recently solved by Jørgensen and Larsen [29]who showed a lower bound of Ω(log r/ log log n ) and a linear-space data structure with O (log r/ log log n +log log n ) query time, where r denotes the rank of the selection query. Acknowledgements.
The authors thank Peyman Afshani, Timothy Chan, Francisco Claude, Meng He,Ian Munro, Patrick Nicholson, Matthew Skala, and Norbert Zeh for discussing various topics related to rangesearching.
References [1] P. K. Agarwal. Range searching. In J. Goodman and J. O’Rourke, editors,
Handbook of Discrete andComputational Geometry , pages 809–837. CRC Press, New York, 2nd edition, 2004.[2] N. Alon and B. Schieber. Optimal preprocessing for answering on-line product queries. Technical Report71/87, Tel-Aviv University, 1987.[3] A. Amir, J. Fischer, and M. Lewenstein. Two-dimensional range minimum queries. In
Proceedings ofthe Symposium on Combinatorial Pattern Matching (CPM) , volume 4580 of
Lecture Notes in ComputerScience , pages 286–294. Springer, 2007.[4] M. A. Bender and M. Farach-Colton. The LCA problem revisited. In
Proceedings of the Latin AmericanTheoretical Informatics Symposium (LATIN) , volume 1776 of
Lecture Notes in Computer Science , pages88–94. Springer, 2000.[5] J. L. Bentley and H. A. Maurer. Efficient worst-case data structures for range searching.
Acta Infor-matica , 13(2):155–168, 1980.[6] O. Berkman, D. Breslauer, Z. Galil, B. Schieber, and U. Vishkin. Highly parallelizable problems. In
Proceedings of the ACM Symposium on the Theory of Computing (STOC) , pages 309–319, 1989.107] P. Bose, E. Kranakis, P. Morin, and Y. Tang. Approximate range mode and range median queries.In
Proceedings of the International Symposium on Theoretical Aspects of Computer Science (STACS) ,volume 3404 of
Lecture Notes in Computer Science , pages 377–388. Springer, 2005.[8] G. S. Brodal, P. Davoodi, and S. S. Rao. On space efficient two dimensional range minimum datastructures. In
Proceedings of the European Symposium on Algorithms (ESA) , volume 6346/6347 of
Lecture Notes in Computer Science . Springer, 2010.[9] G. S. Brodal, B. Gfeller, A. G. Jørgensen, and P. Sanders. Towards optimal range medians.
TheoreticalComputer Science , 2011. In press.[10] G. S. Brodal and A. G. Jørgensen. Data structures for range median queries. In
Proceedings of theInternational Symposium on Algorithms and Computation (ISAAC) , volume 5878 of
Lecture Notes inComputer Science , pages 822–831. Springer, 2009.[11] T. Chan and M. P˘atra¸scu. Counting inversions, offline orthogonal range counting, and related problems.In
Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA) , pages 161–173, 2010.[12] B. Chazelle and B. Rosenberg. Computing partial sums in multidimensional arrays. In
Proceedings ofthe ACM Symposium on Computational Geometry (SoCG) , pages 131–139, 1989.[13] E. D. Demaine, G. M. Landau, and O. Weimann. On Cartesian trees and range minimum queries.In
Proceedings of the International Colloquium on Automata, Languages, and Programming (ICALP) ,volume 5555 of
Lecture Notes in Computer Science , pages 341–353. Springer, 2009.[14] V. Dujmovi´c, J. Howat, and P. Morin. Biased range trees. In
Proceedings of the ACM-SIAM Symposiumon Discrete Algorithms (SODA) , pages 486–495, 2009.[15] P. van Emde Boas. Preserving order in a forest in less than logarithmic time. In
Proceedings of theIEEE Symposium on Foundations of Computer Science , pages 75–84, 1975.[16] P. van Emde Boas. Preserving order in a forest in less than logarithmic time and linear space.
Infor-mation Processing Letters , 6(3):80–82, 1977.[17] P. van Emde Boas, R. Kaas, and E. Zijlstra. Design and implementation of an efficient priority queue.
Mathematical Systems Theory , 10:99–127, 1976.[18] J. Fischer. Optimal succinctness for range minimum queries. In
Proceedings of the Latin AmericanTheoretical Informatics Symposium (LATIN) , volume 6034 of
Lecture Notes in Computer Science , pages158–169. Springer, 2010.[19] J. Fischer and V. Heun. Theoretical and practical improvements on the RMQ-problem, with applicationsto LCA and LCE. In
Proceedings of the Symposium on Combinatorial Pattern Matching (CPM) , volume4009 of
Lecture Notes in Computer Science , pages 36–48. Springer, 2006.[20] J. Fischer and V. Heun. A new succinct representation of RMQ-information and improvements in theenhanced suffix array. In
Proceedings of the International Symposium on Combinatorics, Algorithms,Probabilistic and Experimental Methodologies (ESCAPE) , volume 4614 of
Lecture Notes in ComputerScience , pages 459–470. Springer, 2007.[21] J. Fischer and V. Heun. Finding range minima in the middle: Approximations and applications.
Mathematics in Computer Science , 3(1):17–30, 2010.[22] H. N. Gabow, J. L. Bentley, and R. E. Tarjan. Scaling and related techniques for geometry problems.In
Proceedings of the ACM Symposium on the Theory of Computing (STOC) , pages 135–143, 1984.1123] T. Gagie, S. J. Puglisi, and A. Turpin. Range quantile queries: Another virtue of wavelet trees. In
Proceedings of the String Processing and Information Retrieval Symposium (SPIRE) , volume 5721 of
Lecture Notes in Computer Science , pages 1–6. Springer, 2009.[24] B. Gfeller and P. Sanders. Towards optimal range medians. In
Proceedings of the International Collo-quium on Automata, Languages, and Programming (ICALP) , volume 5555 of
Lecture Notes in ComputerScience , pages 475–486. Springer, 2009.[25] M. Greve, A. G. Jørgensen, K. D. Larsen, and J. Truelsen. Cell probe lower bounds and approxima-tions for range mode. In
Proceedings of the International Colloquium on Automata, Languages, andProgramming (ICALP) , volume 6198 of
Lecture Notes in Computer Science , pages 605–616. Springer,2010.[26] S. Har-Peled and S. Muthukrishnan. Range medians. In
Proceedings of the European Symposium onAlgorithms (ESA) , volume 5193 of
Lecture Notes in Computer Science , pages 503–514. Springer, 2008.[27] J. J´aJ´a, C. W. Mortensen, and Q. Shi. Space-efficient and fast algorithms for multidimensional dom-inance reporting and counting. In
Proceedings of the International Symposium on Algorithms andComputation (ISAAC) , volume 3341 of
Lecture Notes in Computer Science , pages 558–568. Springer,2004.[28] A. G. Jørgensen.
Data Structures: Sequence Problems, Range Queries, and Fault Tolerance . PhD thesis,Aarhus University, 2010.[29] A. G. Jørgensen and K. D. Larsen. Range selection and median: Tight cell probe lower bounds andadaptive data structures. In
Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA) ,2011. To appear.[30] D. Krizanc, P. Morin, and M. Smid. Range mode and range median queries on lists and trees.
NordicJournal of Computing , 12:1–17, 2005.[31] D. T. Lee and C. K. Wong. Worst-case analysis for region and partial region searches in multidimensionalbinary search trees and balanced quad trees.
Acta Informatica , 9(1):23–29, 1977.[32] J. I. Munro and M. Spira. Sorting and searching in multisets.
SIAM Journal on Computing , 5(1):1–8,1976.[33] H. Petersen. Improved bounds for range mode and range median queries. In
Proceedings of the Con-ference on Current Trends in Theory and Practice of Computer Science (SOFSEM) , volume 4910 of
Lecture Notes in Computer Science , pages 418–423. Springer, 2008.[34] H. Petersen and S. Grabowski. Range mode and range median queries in constant time and sub-quadraticspace.
Information Processing Letters , 109:225–228, 2009.[35] C. K. Poon. Optimal range max datacub for fixed dimensions. In
Proceedings of the InternationalConference on Database Theory (ICDT) , volume 2572 of
Lecture Notes in Computer Science , pages158–172. Springer, 2003.[36] S. Skiena.
The Algorithm Design Manual . Springer, 2nd edition, 2008.[37] D. E. Willard. Log-logarithmic worst-case range queries are possible in space Θ( N ). InformationProcessing Letters , 17:81–84, 1983.[38] A. C. Yao. Space-time tradeoff for answering range queries. In
Proceedings of the ACM Symposium onthe Theory of Computing (STOC) , pages 128–136, 1982.[39] A. C. Yao. On the complexity of maintaining partial sums.
SIAM Journal on Computing , 14:277–288,1985. 1240] H. Yuan and M. J. Atallah. Data structures for range minimum queries. In