Ge Nong | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ge Nong is active.

Explore More

Publication

Featured researches published by Ge Nong.

data compression conference | 2009

Linear Suffix Array Construction by Almost Pure Induced-Sorting

Ge Nong; Sen Zhang; Wai Hong Chan

We present a linear time and space suffix array (SA) construction algorithm called the SA-IS algorithm.The SA-IS algorithm is novel because of the LMS-substrings used for the problem reduction and the pure induced-sorting (specially coined for this algorithm)used to propagate the order of suffixes as well as that of LMS-substrings, which makes the algorithm almost purely relying on induced sorting at both its crucial steps.The pure induced-sorting renders the algorithm an elegant design and in turn a surprisingly compact implementation which consists of less than 100 lines of C code.The experimental results demonstrate that this newly proposed algorithm yields noticeably better time and space efficiencies than all the currently published linear time algorithms for SA construction.

IEEE Transactions on Computers | 2011

Two Efficient Algorithms for Linear Time Suffix Array Construction

Ge Nong; Sen Zhang; Wai Hong Chan

We present, in this paper, two efficient algorithms for linear time suffix array construction. These two algorithms achieve their linear time complexities, using the techniques of divide-and-conquer, and recursion. What distinguish the proposed algorithms from other linear time suffix array construction algorithms (SACAs) are the variable-length leftmost S-type (LMS) substrings and the fixed-length d-critical substrings sampled for problem reduction, and the simple algorithms for sorting these sampled substrings: the induced sorting algorithm for the variable-length LMS substrings and the radix sorting algorithm for the fixed-length d-critical substrings. The very simple sorting mechanisms render our algorithms an elegant design framework, and, in turn, the surprisingly succinct implementations. The fully functional sample implementations of our proposed algorithms require only around 100 lines of C code for each, which is only 1/10 of the implementation of the KA algorithm and comparable to that of the KS algorithm. The experimental results demonstrate that these two newly proposed algorithms yield the best time and space efficiencies among all the existing linear time SACAs.

ACM Transactions on Information Systems | 2014

Suffix Array Construction in External Memory Using D-Critical Substrings

Ge Nong; Wai Hong Chan; Sen Zhang; Xiao Feng Guan

We present a new suffix array construction algorithm that aims to build, in external memory, the suffix array for an input string of length n measured in the magnitude of tens of Giga characters over a constant or integer alphabet. The core of this algorithm is adapted from the framework of the original internal memory SA-DS algorithm that samples fixed-size d-critical substrings. This new external-memory algorithm, called EM-SA-DS, uses novel cache data structures to construct a suffix array in a sequential scanning manner with good data spatial locality: data is read from or written to disk sequentially. On the assumed external-memory model with RAM capacity Ω((nB)0.5), disk capacity O(n), and size of each I/O block B, all measured in log n-bit words, the I/O complexity of EM-SA-DS is O(n/B). This work provides a general cache-based solution that could be further exploited to develop external-memory solutions for other suffix-array-related problems, for example, computing the longest-common-prefix array, using a modern personal computer with a typical memory configuration of 4GB RAM and a single disk.

combinatorial pattern matching | 2009

Linear Time Suffix Array Construction Using D-Critical Substrings

Ge Nong; Sen Zhang; Wai Hong Chan

In this paper we present in detail a new efficient linear time and space suffix array construction algorithm(SACA), called the D-Critical-Substring algorithm. The algorithm is built upon a novel concept called fixed-size D-Critical-Substrings, which allow us to compute suffix arrays through a balanced combination of the bucket-sort and the induction sort. The D-Critical-Substring algorithm is very simple, a fully-functioning sample implementation of which in C++ is embodied in only about 100 effective lines. The results of the experiment that we conducted on the data from the Canterbury and Manzini-Ferragina corpora indicate that our algorithm outperforms the two previously best-known linear time algorithms: the Karkkainen-Sanders (KS) and the Ko-Aluru (KA) algorithms.

ACM Transactions on Information Systems | 2013

Practical linear-time O (1)-workspace suffix sorting for constant alphabets

Ge Nong

This article presents an O(n)-time algorithm called SACA-K for sorting the suffixes of an input string T[0, n-1] over an alphabet A[0, K-1]. The problem of sorting the suffixes of T is also known as constructing the suffix array (SA) for T. The theoretical memory usage of SACA-K is n log K + n log n + K log n bits. Moreover, we also have a practical implementation for SACA-K that uses n bytes + (n + 256) words and is suitable for strings over any alphabet up to full ASCII, where a word is log n bits. In our experiment, SACA-K outperforms SA-IS that was previously the most time- and space-efficient linear-time SA construction algorithm (SACA). SACA-K is around 33% faster and uses a smaller deterministic workspace of K words, where the workspace is the space needed beyond the input string and the output SA. Given K=O(1), SACA-K runs in linear time and O(1) workspace. To the best of our knowledge, such a result is the first reported in the literature with a practical source code publicly available.

data compression conference | 2008

Fast and Space Efficient Linear Suffix Array Construction

Sen Zhang; Ge Nong

Let S be an n-character string terminated with an unique smallest sentinel, its suffix array SA(S) is an array of pointers for all the suffixes in S sorted in the lexicographically ascending order. Specially, the Burrows-Wheeler transform for building efficient compression solutions can be quickly computed by fast suffix sorting based on suffix array construction algorithms (SACAs). The existing well-known practical linear SACAs are those two contemporarily reported in 2003 by Karkkainen and Sanders (KS) (J. Karkkaiinen and P. Sanders, 2003) and Ko and Aluru (KA) (P. Ko and S. Aluru, 2003).

ACM Transactions on Information Systems | 2015

Induced Sorting Suffixes in External Memory

Ge Nong; Wai Hong Chan; Sheng Qing Hu; Yi Wu

We present in this article an external memory algorithm, called disk SA-IS (DSA-IS), to exactly emulate the induced sorting algorithm SA-IS previously proposed for sorting suffixes in RAM. DSA-IS is a new disk-friendly method for sequentially retrieving the preceding character of a sorted suffix to induce the order of the preceding suffix. For a sizen string of a constant or integer alphabet, given the RAM capacity Ω ((nW)0.5), where W is the size of each I/O buffer that is large enough to amortize the overhead of each access to disk, both the CPU time and peak disk use of DSA-IS are O(n). Our experimental study shows that on average, DSA-IS achieves the best time and space results of all of the existing external memory algorithms based on the induced sorting principle.

combinatorial pattern matching | 2008

Computing Inverse ST in Linear Complexity

Ge Nong; Sen Zhang; Wai Hong Chan

The Sort Transform (ST) can significantly speed up the block sorting phase of the Burrows-Wheeler transform (BWT) by sorting only limited order contexts. However, the best result obtained so far for the inverse ST has a time complexity O(Nlogk) and a space complexity O(N), where Nand kare the text size and the context order of the transform, respectively. In this paper, we present a novel algorithm that can compute the inverse ST in an O(N) time/space complexity, a linear result independent of k. The main idea behind the design of the linear algorithm is a set of cycle properties of k-order contexts we explored for this work. These newly discovered cycle properties allow us to quickly compute the longest common prefix (LCP) between any pair of adjacent k-order contexts that may belong to two different cycles, leading to the proposed linear inverse ST algorithm.

string processing and information retrieval | 2015

Induced Sorting Suffixes in External Memory with Better Design and Less Space

Wei Jun Liu; Ge Nong; Wai Hong Chan; Yi Wu

Recently, several attempts have been made to extendi¾źthe internal memory suffix array SA construction algorithm SA-IS to the external memory model, e.g., eSAIS, EM-SA-DS and DSA-IS. While the developed programs for these algorithms achieve remarkable performance in terms of I/O complexity and speed, their designs are quite complex and their disk requirements remain rather heavy. Currently, the core algorithmic part of each of these programs consists of thousands of lines in C++, and the average peak disk requirement is over 20n bytes for an input string of size

Software - Practice and Experience | 2016