CCantor Mapping Technique
Kaustubh Joshi [email protected]
Abstract —A new technique specific to String ordering utilizinga method called ”Cantor Mapping” is explained in this paperand used to perform string comparative sort in loglinear timewhile utilizing linear extra space.
Index Terms —Cantor Mapping, Precision Specific, String Sort-ing
I. I
NTRODUCTION
The whole process of comparative sorting is to comparetwo elements amongst each other to find out which onepossess a higher and lower value (or equivalent). Due to this,the time complexity of each comparative sorting algorithmalways depends on f(l) factor where l is the representationof the object being compared. Standard merge sort takesO(n*log(n)) time for numbers and O(n*l*log(n)) for strings.This f(l) factor is O(1) for numbers and O(l) for linearsequences.Due to the nature of the f(l) factor, the time complexityof searching and sorting algorithms increases by a factor off(l) depending on the object type. This is what causes stringsorting to take longer when compared to standard numericalsorting. Non comparative methods overcome this by utilizingadditional memory (which isn’t always available to machines)as displayed by counting sort for a fixed upper and lowerbound for integer values and burst sort which utilizes a stabletrie structure. II. P
ROBLEM S TATEMENT
Given a random unsorted array of strings, what is the mostoptimum space and time complexity in which you can sortthe result ? Is there a middle ground in case there isn’t ?Standard approach is to treat this as a problem in key basedsorting and utilizing multi-array sort to solve for the same.This approach has a time complexity of O(n* ¯ l *log(n)) andspace complexity of O(1) and is multi valued sort extendedto strings.Another approach is to build a key index tree (or trie) andthen perform Depth First Search in order to obtain the sortedvalues. This approach has a time complexity of O(n* ¯ l ) andspace complexity of O(n*l) and is called as burst sort [5].The value compare approach is preferred when memory isof the highest importance and the system is relaxed on runningtime. The burst sort approach is preferred when the system hasa sufficient amount of memory and requires fast execution.There is a trade off for choosing either of the two, howeverusing the given method of Cantor Mapping, we wish to propose a middle ground with O(n) of memory space andO(n*log(n)) of time complexity.III. C ANTOR M APPING
The idea of Cantor Mapping can be considered as thefusion of Cantor Sets and closed space hashing. A Cantor Setis a closed set with remarkable properties such as nowheredense, perfect set, low capacity dimension, recursivelydefined, etc.Length of each segment at depth d = l d = (1 / d Number of segments at depth d = n d = (2) d Total length for given depth = L d = n d ∗ l d Fig. 1. A Cantor Set
Hashing is a method that is used for assigning values toobjects using the modular polynomial ring method to ensurethe co-domain is bounded. This technique is used in maps toachieve O(1) time lookups and in rabin-karp [1] for singlevalued string searchCantor Mapping is a technique used to ensure that a stringvalue is provided with an equivalent numerical value. Theunique property of this mapping is that any string can beused for searching and ordering based on this numerical valuerather than having to iterate and compare its elements.Unsigned floating point numbers are assigned monotonically,hence an ordering of the number ensures that thecorresponding index based string value is also ordered. Whilethis method is analogous to Polynomial Hashing, it shouldnot be used to replace it because unlike hashing, mapping isinjective by nature where as hashing is not injective by nature. a r X i v : . [ c s . D S ] J a n . Technique We need a function f(x): R → N for performing ordinalranking. Our function needs to be strictly monotonic in nature[7]. I.e. given f ( x < f ( x
2) = ⇒ x < x .We could use an intermediate function for increas-ing/decreasing number spacing. A function m(x): R → R such as the exponential function,linear functions, polynomialfunctions, etc. are examples of what can be utilized. Weaklymonotonic functions cannot be utilized as this will lead toordinal ranking violation when performing f(x).We also need a function g(z): String → N , such that z ∈ { Z ∗ ∪ T } (where T is a terminal and Z is a set of nonterminals) which acts as a series converter of sorts. The mainpurpose of this function is to reduce the length of the datasequenceOver here we utilize the composite function f ◦ m ◦ g whichis specified as follows:T(x): String → N where T(x) = (cid:80) | S |− i =0 s i /x i The value of x should be = | T | + (cid:15) where (cid:15) ≥ . For thispaper the total alphabet size was 26 and (cid:15) = 4 Algorithm 1
Cantor Mapping S ← A string M ← A character to number order Map (cid:15) := 4 x := | T | + (cid:15)Result := 0 for ( i = | S | − i ≥ i − − ) do Result = ((
Result/x ) +
M.get ( S [ i ])) end for return ResultThe proof of the function being monotonic (given propervalue of x is as follows):Assume two strings B and C (where B is dictionary orderedbefore C) which have a common prefix of size l. = ⇒ T(B[0,l-1]) = T(C[0,l-1]) and B [ l ] (cid:54) = C [ l ] As per the procedure of cantor mapping:T(B[l, | B | − ]) = (cid:80) | B |− i = l B [ i ] /x i T(C[l, | C | − ]) = (cid:80) | C |− i = l C [ i ] /x i Since B should come before C: T ( C [ l ]) > T (( B [ l, | B | − ⇒ C [ l ] /x l > ( B [ l ] /x l + B [ l + 1] /x l +1 + . . . )= ⇒ ( C [ l ] − B [ l ]) /x l > B [ l + 1] /x l +1 + B [ l + 2] /x l +2 . . . As each character is bounded by the maximum attainablevalue ζ (where ζ = value of the last non-terminal) = ⇒ ( C [ l ] − B [ l ]) /x l > ζ/x l +1 + ζ/x l +2 . . . = ( ζ/x l +1 )(1 / (1 − (1 /x )))= ⇒ ( C [ l ] − B [ l ]) > ( ζ/x ) ∗ ( x/ ( x − ⇒ x > ζ/ ( C [ l ] − B [ l ]) + 1 Hence maximizing the above equation, we can get thecorrect value of x in order to ensure monotonicity.IV. U
TILITY
Here three major types of sorting methods relevant to stringoperations have been mentioned for exemplification purposes.They are as follows:
A. Standard String Sorting
Let A be a collection of strings that we wish to sort inlexicographical order, using the Cantor Mapping technique wefirst preprocess and store every string based on its respectiveindex. This task happens in exact n* ¯ l time and requires O(n)space. Once this is completed we can utilize a standardcomparison sort which compares the unsigned float valuesand returns the index ordered array. This process happens inO(n*log(n)) time.Hence the total time complexity can be seen as O(n* ¯ l +n*log(n)). ¯ l and log(n) will be the competing factor behindwhich time complexity the algorithm runs in. ( ¯ l is the averageword length and n is the number of strings present in thecollection) Fig. 2. Standard String Sorting
While floating number comparison is more complexcompared to standard integers, the run-time of floatingnumber comparison is Ω( integercomparison ) . Hardwareoptimizations are possible using a Floating Point Unit whichcan reduce the constant factor even more. Software optimizedalgorithms can be used too [6] . Split-wise String Sorting Let A be a collection of strings that we wish to sortin lexicographical order and assuming there exists asystem,problem or hardware restriction for computing precisefloating point numbers, we can create chunks of certain sizeand perform standard multi-array sort for the same.This sorting requires O(n* max( l i ) * log(n) / k) time andO(n* max( l i ) / k) space (where k is the split size and l i isthe size of the i th string). Worst case scenario occurs wherethe value of k = 1 as this will correspond to the standardcomparative string sorting [3]. C. Suffix String Sorting
Let S be the string which we wish to create a Suffix Arrayfrom. We can preprocess the initial index based values for allindexes in total O(n) time.Following this we can subject the array to comparison sortleading us to suffix sorted results in O(l*log(l)) time. Whilethis is comparable to older suffix sort algorithms [2], newerones [4] can perform sorting in O(l) time and hence isn’t aviable alternative to modern suffix sorting algorithms
Algorithm 2
Suffix pre-processing in O(n) timecantorMap = Function calculates cantor mapping S ← String we wish to find suffix array ofA[] := Array of size | S | (cid:15) := 4 x := | T | + (cid:15)currentV al := 0 for i = 0; i < | S | ; i + + do currentV al = (( currentV al/x ) + cantorM ap ( S [ | S | − − i ]) A [ i ] = currentV al end for D. Calculation optimizations
In the case of frequently occurring prefixes, or split wisestring sorting, we utilize a hash table and then perform remain-ing calculation. Assuming a certain prefix exists and the stringS = prefix + remainder. The result will be (hashTable[prefix]+ cantorMap(remainder)/ x | prefix | )V. C ONCLUSION