IEEE Transactions on Information Theory | 2021

On the Approximation Ratio of Ordered Parsings

 
 
 

Abstract


Shannon’s entropy is a clear lower bound for statistical compression. The situation is not so well understood for dictionary-based compression. A plausible lower bound is <inline-formula> <tex-math notation= LaTeX >$\\boldsymbol {b}$ </tex-math></inline-formula>, the least number of phrases of a general bidirectional parse of a text, where phrases can be copied from anywhere else in the text. Since computing <inline-formula> <tex-math notation= LaTeX >$\\boldsymbol {b}$ </tex-math></inline-formula> is NP-complete, a popular gold standard is <inline-formula> <tex-math notation= LaTeX >$\\boldsymbol {z}$ </tex-math></inline-formula>, the number of phrases in the Lempel-Ziv parse of the text, which is computed in linear time and yields the least number of phrases when those can be copied only from the left. Almost nothing has been known for decades about the approximation ratio of <inline-formula> <tex-math notation= LaTeX >$\\boldsymbol {z}$ </tex-math></inline-formula> with respect to <inline-formula> <tex-math notation= LaTeX >$\\boldsymbol {b}$ </tex-math></inline-formula>. In this paper we prove that <inline-formula> <tex-math notation= LaTeX >$z=O(b\\log (n/b))$ </tex-math></inline-formula>, where <inline-formula> <tex-math notation= LaTeX >$n$ </tex-math></inline-formula> is the text length. We also show that the bound is tight as a function of <inline-formula> <tex-math notation= LaTeX >$n$ </tex-math></inline-formula>, by exhibiting a text family where <inline-formula> <tex-math notation= LaTeX >$z = \\Omega (b\\log n)$ </tex-math></inline-formula>. Our upper bound is obtained by building a run-length context-free grammar based on a locally consistent parsing of the text. Our lower bound is obtained by relating <inline-formula> <tex-math notation= LaTeX >$\\boldsymbol {b}$ </tex-math></inline-formula> with <inline-formula> <tex-math notation= LaTeX >$r$ </tex-math></inline-formula>, the number of equal-letter runs in the Burrows-Wheeler transform of the text. We continue by observing that Lempel-Ziv is just one particular case of <italic>greedy</italic> parses–meaning that it obtains the smallest parse by scanning the text and maximizing the phrase length at each step–, and of <italic>ordered</italic> parses–meaning that phrases are larger than their sources under some order. As a new example of ordered greedy parses, we introduce <italic>lexicographical</italic> parses, where phrases can only be copied from lexicographically smaller text locations. We prove that the size <inline-formula> <tex-math notation= LaTeX >$v$ </tex-math></inline-formula> of the optimal lexicographical parse is also obtained greedily in <inline-formula> <tex-math notation= LaTeX >$O(n)$ </tex-math></inline-formula> time, that <inline-formula> <tex-math notation= LaTeX >$v=O(b\\log (n/b))$ </tex-math></inline-formula>, and that there exists a text family where <inline-formula> <tex-math notation= LaTeX >$v = \\Omega (b\\log n)$ </tex-math></inline-formula>. Interestingly, we also show that <inline-formula> <tex-math notation= LaTeX >$v = O(r)$ </tex-math></inline-formula> because <inline-formula> <tex-math notation= LaTeX >$r$ </tex-math></inline-formula> also induces a lexicographical parse, whereas <inline-formula> <tex-math notation= LaTeX >$z = \\Omega (r\\log n)$ </tex-math></inline-formula> holds on some text families. We obtain some results on parsing complexity and size that hold on some general classes of greedy ordered parses. In our way, we also prove other relevant bounds between compressibility measures, especially with those related to smallest grammars of various types generating (only) the text.

Volume 67
Pages 1008-1026
DOI 10.1109/TIT.2020.3042746
Language English
Journal IEEE Transactions on Information Theory

Full Text