[PDF] DeepZip: Lossless Data Compression using Recurrent Neural Networks

Abstract

Sequential data is being generated at an unprecedented pace in various forms, including text and genomic data. This creates the need for efficient compression mechanisms to enable better storage, transmission and processing of such data. To solve this problem, many of the existing compressors attempt to learn models for the data and perform prediction-based compression. Since neural networks are known as universal function approximators with the capability to learn arbitrarily complex mappings, and in practice show excellent performance in prediction tasks, we explore and devise methods to compress sequential data using neural network predictors. We combine recurrent neural network predictors with an arithmetic coder and losslessly compress a variety of synthetic, text and genomic datasets. The proposed compressor outperforms Gzip on the real datasets and achieves near-optimal compression for the synthetic datasets. The results also help understand why and where neural networks are good alternatives for traditional finite context models

Full PDF

DDeepZip: Lossless Data Compression usingRecurrent Neural Networks

Mohit Goyal † ,γ , Kedar Tatwawadi ∗ , Shubham Chandak ∗ and Idoia Ochoa γ † Department of Electrical Engineering, Indian Institute of Technology Delhi, India ∗ Department of Electrical Engineering, Stanford University, CA, USA γ Electrical and Computer Engineering, University of Illinois, Urbana, IL, USA [email protected] , [email protected] Abstract

Sequential data is being generated at an unprecedented pace in various forms, including textand genomic data. This creates the need for eﬃcient compression mechanisms to enablebetter storage, transmission and processing of such data. To solve this problem, manyof the existing compressors attempt to learn models for the data and perform prediction-based compression. Since neural networks are known as universal function approximatorswith the capability to learn arbitrarily complex mappings, and in practice show excellentperformance in prediction tasks, we explore and devise methods to compress sequential datausing neural network predictors. We combine recurrent neural network predictors with anarithmetic coder and losslessly compress a variety of synthetic, text and genomic datasets.The proposed compressor outperforms Gzip on the real datasets and achieves near-optimalcompression for the synthetic datasets. The results also help understand why and whereneural networks are good alternatives for traditional ﬁnite context models.The code and data are available at https://github.com/mohit1997/DeepZip . Introduction

There has been a tremendous surge in the amount of data generated in the pastyears. Along with image and textual data, new types of data such as genomic, 3DVR, and point cloud data are being generated at a rapid pace. A lot of humaneﬀort is spent in analyzing the statistics of these new data formats for designinggood compressors. From information theory, we understand that a good predictornaturally leads to good compression. In the recent past, recurrent neural network(RNN) based models have proved extremely eﬀective in natural language processingtasks such as language translation, semantic parsing and more speciﬁcally, in thetask of language modeling, which includes predicting the next symbol/character ina sequence [1]. This raises a natural question:

Can RNN-based models be utilizedfor eﬀective lossless compression?

In this work, we propose a neural network basedlossless compressor for sequential data, named DeepZip. DeepZip consists of twomajor blocks: an RNN based probability estimator and an arithmetic coding basedencoder [2].Before describing the compression framework in detail, we take a look at theexisting literature on lossless sequence compression. We then assess the performanceof DeepZip on synthetic data as well as real textual and genomic datasets. Weconclude by discussing some observations and future extensions. a r X i v : . [ c s . C L ] N ov elated Work Ever since Shannon introduced information theory [3] and showed that the entropyrate is the fundamental limit on the compression rate for any stationary process,there have been multiple works attempting to achieve this optimum. Perhaps themost common compression tool is Gzip ( ). Gzip is basedon LZ77 [4] and Huﬀman coding [5]. LZ77 is a universal compressor, i.e., it asymp-totically achieves the optimal compression rate for any stationary source, withoutthe knowledge of the source statistics. LZ77 works by searching for matching sub-strings in the text appearing before the current position and storing pointers to thematches. Gzip achieves further improvements by using Huﬀman coding to compressthe pointers and other streams generated by LZ77.LZMA ( ) is another popular compressor which com-bines LZ77 with atithmetic coding (described later in detail). More generally, arith-metic coding is a technique for compressing data streams given a probability modelfor the sequence. A large class of compressors model the data using a conditionalprobability distribution and then use arithmetic coding as the entropy coding tech-nique. This class includes context-tree weighting (CTW) [6] and PPM [7], whicheﬃciently use a mixture of multiple models to generate their predictions.There has been some related work in the past on lossless compression using neuralnetworks. [8] discussed the application of character-level RNN model for text, andobserved that it gives competitive compression performance as compared with theexisting compressors. However, as vanilla RNNs were used, the performance was notvery competitive for complex sources with longer memory. Recently, [9] introduceda diﬀerent framework for using neural networks for text compression. An RNN wasused as a context mixer for mixing the opinions of multiple compressors, to obtainimproved compression performance. This was later improved upon by the CMIXcompressor [10], which is based on a similar approach that mixes together more than2000 models using an LSTM context mixer. However, unlike DeepZip it still requiresdesigning of the individual context based compressors, which can heavily depend onthe kind of source being analyzed. More recently there has also been some work onword-based and semantically aware models for text compression [11].

Methods

Consider a data stream S N = { S , S , . . . , S N } over an alphabet S which we want tocompress losslessly. We next describe in detail the DeepZip compression frameworkfor such a stream, and the speciﬁc models used in the experiments. Framework Overview

The compressor framework can be broken down into two blocks:

Probability predictor : For a sequence S N = { S , S , . . . , S N } , the probabilitypredictor block estimates the conditional probability distribution of S r based on thepreviously K observed symbols, where K is a hyperparameter. This probabilityestimate ˆ P ( S r | S r − , . . . , S r − K ) is then fed into the the arithmetic encoding block.The probability predictor block is modeled as a neural network based predictor. rithmetic coder block : This block can be thought of as a reversible Finite-State-Machine (FSM) which takes in the probability distribution estimate for thenext symbol S r , ˆ P ( S r | S r − , . . . , S r − K ), and encodes it into a state. The ﬁnal state isencoded using bits which form the compressed representation of the sequence. Encoding-Decoding Mechanism

The encoding-decoding operations proceed as follows (see Figure 1):1) The neural network model in the probability predictor block is trained on thesequence to be compressed for multiple epochs. Once the training is complete, themodel weights are stored, to allow its usage during decompression.2) The probability predictor block uses the trained model weights to output aprobability distribution over each symbol, which is then used by the arithmetic en-coder to perform compression. For the initial K symbols, any prior can be chosen bythe arithmetic encoding block. In our framework, we choose a uniform prior, knownalso to the decoder. Figure 1a depicts this process for the special case of K = 1.3) The operations of the decoder are exactly symmetrical to the encoder, as shownin Figure 1b. The arithmetic decoder decodes the initial K symbols using a uniformprior distribution, whereas the subsequent symbols are decoded by using the prob-ability distribution provided by the NN-based predictor block. The predictor blockutilizes the stored model weights to produce exactly the same probability estimatesas the compressor.There are a couple of things which are of utmost important for the correct func-tionality of DeepZip: Firstly, the probability predictor block needs to be causal, andcan have input features based only on the past symbols to ensure all the necessaryinformation is available to the decoder. Secondly, the probability prediction blockneeds to be perfectly symmetric so as to get back the same probability distributionat the decoder, guaranteeing successful reconstruction of the encoded sequence. Probability Predictor Block

We explored several models for the probability predictor block, ranging from fullyconnected networks to recurrent neural networks such as LSTMs, GRUs, and othervariants. This section describes some speciﬁc models and motivation for their use.

Fully connected/dense models (FC) : A dense or fully connected neural net-work is a combination of multiple fully connected layers. Mathematically, a denselayer with input X of shape ( batchsize, n ) with n features can be deﬁned as : H = σ ( XW T + B ) , (1)where σ ( x ) is the activation function, W is the weight matrix, and B is the biasterm. In the context of sequence compression, the input for the model would be theprevious K symbols and the output would be a multinomial distribution for the nextcharacter, ( p , p , . . . , p |S| ), where |S| is the alphabet size of the sequence. This isobtained by adding a softmax layer at the end, deﬁned as: sof tmax ( z ) r = p r = e z r (cid:80) |S| j =1 e z j (2)igure 1: Encoder-Decoder Framework. LSTM/GRU single output framework : LSTM/GRU (Gated Recurrent Units)belong to the class of gated RNNs, and have been used very eﬀectively in the past fewyears for various natural language processing applications. For every symbol S r , theinput consists of the past K symbols S r − K , . . . , S r − . Each of the symbols is an inputto a bi-directional GRU cell. Finally, a softmax layer is applied on the ﬁnal hiddenstate obtained from the LSTM/GRU cell. The model architecture is illustrated inFigure 2a. For our experiments, we use a bi-directional variant of the GRU, denoted biGRU . (a) (b) Figure 2: (a) A multi-input single-output architecture for the probability predictorblock. (b) A multi-input multi-output (concatenated) architecture for NN basedpredictor.

LSTM multi output (concat.) framework:

The LSTM/GRU single outputframework suﬀers from the vanishing gradients issue, making the dependence on far-her symbols weak. To alleviate the issue, we consider an LSTM based framework,which includes explicit dependence on all the past K symbols via a fully connectedlayer on top of LSTM embedding. The architecture is depicted in Figure 2b. Theinput consists of past K symbols, which are fed to K LSTM cells. The LSTM celloutputs are concatenated, followed by a dense layer and a subsequent softmax layerwhich gives the ﬁnal probability distribution to be used for arithmetic encoding.

Training

To train the NN predictor based on K previous encountered symbols, with K chosento be 64, the sequence is divided into overlapping segments of length K + 1 (shiftedby one), where the ﬁrst K symbols in each segment form the input and the lastsymbol acts as the output. In all of the models described above, the optimizer Adam[12] is used to minimize categorical cross entropy (default parameters and a batchsize ﬁxed at 1024 are used). The model is optimized for a maximum of 10 epochs,where the training is terminated early, if signiﬁcant improvement is not observed. Forevery epoch the training data is shuﬄed, which helps in achieving convergence. Weupdate the model every epoch if there is a decrease in average loss from the previousminima (initially ∞ ). In the LSTM and GRU based models, a cuDNN acceleratedimplementation [13] is used which reduces the training time by approximately 7 × .Note that we do not use cross-validation during the training and in fact attemptto overﬁt on the training data. This is because the proposed framework stores themodel weights as part of the compressed representation and the trained model is usedonly for prediction on the training data. Arithmetic coder block

Arithmetic coding [2] is an entropy coding technique to compress a stream of datagiven a probability estimate for every symbol, conditioned on the past. Arithmeticcoding maintains a range in the interval [0 , C ( Y, ˆ Y ), where Y is the one-hot encoded groundtruth, ˆ Y is the predicted probability, |S| is the alphabet size and N is the sequencelength. C ( Y, ˆ Y ) = 1 N N (cid:88) n =1 |S| (cid:88) k =1 y k log y k (3)Using the chain rule for probabilities, this expression can be rewritten as shownin Equation 4 where S N is the sequence and ˆ p is the joint probability distributionbtained from the predictor block. C ( Y, ˆ Y ) = 1 N log p ( S N ) (4)Finally, Equation 5 shows that ¯ L AE , the average number of bits used per symbol forarithmetic coding is very close to the loss function from Equation 4. Thus, categoricalcross entropy loss is in fact the optimal loss function to consider while training themodels in the DeepZip framework.1 N log p ( S N ) ≤ ¯ L AE ≤ N log p ( S N ) + 2 N (5)The arithmetic coder in DeepZip is based on an open source Python implementa-tion [14]. We achieved signiﬁcant speedups by parallelizing the encoding and decodingoperations. While the computation of predicted probabilities can be easily parallelizedduring the encoding process, parallelizing the decoding is slightly non-trivial becausethe computation of probabilities itself depends on the previously decompressed sym-bols. Thus, we divide the original sequence into B non-overlapping segments duringencoding (by default B = 1000). At each step, the probabilities for these segments arecomputed independently and in parallel by creating a batch of size B . This is followedby separate arithmetic coding steps for each segment. The decompression process issymmetric to the compression process, and the segments are decoded independentlyin parallel and concatenated at the end to produce the decoded ﬁle. Experiments

We benchmark the performance of our neural network-based compressor DeepZipagainst Gzip, BSC [15], and some dataset speciﬁc compressors like GeCo [16] (forgenomic data) and ZPAQ [17] (for text). For DeepZip we considered the three in-troduced probability predictor blocks FC, biGRU and LSTM-multi. BSC is a BWT-based compressor which improves over Gzip while being computationally eﬃcient.Several synthetic and real datasets are considered to evaluate the compression thatcan be achieved with our method and also highlight the advantage which this workprovides. - Real datasets:

We consider a wide variety of data types including genomic and textdata. These datasets were chosen as they beneﬁt in practice from lossless compression.

Human chr1 dataset:

We consider the chromosome 1 DNA refence sequence ofthe Human Genome Project [18]. The alphabet of a DNA sequence typically consistsof { A, C, G, T, N } , where { A, C, G, T } represent the possible nucleotides (bases), andthe symbol N represents an unknown nucleotide. Although it is well known thatgenome sequences have signiﬁcant repeated regions, state-of-the-art compressors havebeen unable to capture these repeats, making it a diﬃcult source to compress. C. Elegans chr1:

We consider the chromosome 1 of the

C. Elegans genomic ref-erence sequence for compression, available at ftp://ftp.ensembl.org/pub/release-94/fasta/caenorhabditis_elegans/dna/ . . Elegans whole genome: We also consider the

C. Elegans whole genomesequence for compression, obtained by concatenating its six chromosomes.

PhiX virus quality value data:

Along with sequenced nucleotides, raw genomicsequencing data also consists of quality values that represent the probability of errorof the obtained nucleotides. We consider 100MB of quality value data for a PhiX virusdata, where each symbol takes 4 possible values. Unlike the nucleotide sequences, thequality value sequences are highly compressible since most quality values are the sameand correspond to the best quality. text8 dataset:

Along with genomic datasets, we also consider the text8 dataset,which is an ASCII text dataset of size 100MB. The text8 dataset has been widely stud-ied and experimented on in the literature. It can be accessed at http://mattmahoney.net/dc/text8.zip . - Synthetic datasets: We generate data from synthetic sources of known entropyrate. Since entropy rate provides a lower bound on the compression ratio, it allows usto gauge the performance of a compression algorithm against this ideal bound. Thefollowing sources are considered:

Independent and identically distributed (IID):

IID binary data distributedas

Bern (0 .

1) is considered since existing compressors perform fairly good on IIDsequences. k-order Markov (XOR):

The Markov-k sources are generated as follows: S n +1 = S n + S n − k (mod M ) , (6)where M is the alphabet size (2 by default). This source is closely related to thelagged Fibonacci pseudorandom generator [19] and hence is diﬃcult to compress formost traditional compressors, even though the entropy rate for Markov-k sources isin fact 0. We consider k = { , , , } for our experiments. Hidden Markov Model (HMM):

HMM is a statistical Markov model wherethe system being modeled is assumed to be a Markov process with unobserved hiddenstate [20]. We simulate a HMM source where the hidden state follows the Markov-ksequence described earlier. Speciﬁcally, the HMM process is generated as follows: S n +1 = X n + X n − k + Z n (mod M ) . (7)Here, the hidden process H = X n − + X n − k − (mod M ) is Markov-k, and Z n isthe added IID noise. We consider Z n ∼ Bern (0 .

1) and k = { , , } for ourexperiments.For all our experiments we used a NVIDIA TITAN X GPU (12GB). All the train-ing and encoding-decoding scripts are available at: https://github.com/mohit1997/DeepZip Results and discussion

Table 1 shows the compression results for the real datasets. We compare generalpurpose compressors Gzip and BSC to the proposed neural network based compressors ataset Seq. Length Gzip BSC DeepZipFC biGRU LSTM-multi

H. chr1

C. E. chr1

15M 4.03

C. E. genome text8

PhiX Quality Bytes) for real datasets. Best results areboldfaced.

Dataset FC biGRU LSTM-multi

Model Sequence Model Sequence Model Sequence

H. chr1

C. E. chr1

C. E. genome text8

PhiX Quality Bytes) into model size and sizeof sequence compressed with arithmetic coding, for DeepZip.DeepZip. We observe that the proposed compressor outperforms Gzip by about 20%on text and genomic data. As compared to BSC, DeepZip usually achieves comparableresults, with slightly better compression on the

C. Elegans genome. We also observethat for DeepZip, biGRU exhibits in general the best performance.We also tested some specialized compressors for these datasets. For the humanand

C. Elegans genomes, we used GeCo, which achieves 5-10 % smaller size as com-pared to DeepZip. Similarly, for text compression, ZPAQ achieves a compressed sizeof 17.5MB on the text8 dataset [21], which is 25% lower than that for DeepZip. Theseresults are to be expected, since the specialized compressors typically involve hand-crafted contexts and mechanisms which are highly optimized for the dataset statistics.Also, they can improve the compression performance by taking into account that thedatasets can in fact be non-stationary. In contrast, the proposed compressor achievesreasonably good results on a wide variety of datasets.Table 2 shows the breakdown of size between the model weights and the arith-metic coded stream for the proposed compressor. We observe that the model sizecontributes signiﬁcantly to the overall size, especially when the sequence length issmall. Currently the model weights are represented as 32 bit ﬂoats without furthercompression. We attempted to use 16 bit ﬂoats and TensorFlow Lite [22], but facedstability and compatibility issues. We believe that the model size can be reducedsigniﬁcantly without losing compression performance by using techniques similar tothose outlined in [23]. Furthermore, in some cases, the model can be shared betweendiﬀerent sequences, for example when compressing genomes of diﬀerent individualswhich are very similar. ataset Seq. Length Gzip BSC DeepZipFC biGRU LSTM-multi

IID

10M 0.81

XOR20

10M 1.51

XOR30

10M 1.51 1.26 0.40

XOR40

10M 1.49 1.26

XOR50

10M 1.48 1.26 0.40

HMM20

10M 1.49 0.87 0.98

HMM30

10M 1.49 1.26 0.98

HMM40

10M 1.49 1.26 Bytes) for synthetic datasets. Best results areboldfaced.To better understand the ability of the proposed framework, we also experimentedwith some synthetic data of low Kolmogorov complexity, but which are not com-pressed well by traditional compressors. Table 3 shows the results for these datasets(IID, XOR and HMM). We see that as we go towards sequences with long-term depen-dencies, the traditional compressors fail to achieve good compression, only achieving1 bit per binary symbol. The proposed compressor DeepZip, on the other hand, isable to exploit the structure in these sequences to achieve much better compression.There is still some gap from the entropy of the sequences because of the space neededto store the model and some overhead associated with arithmetic coding. Note that,we observe that in some cases, XOR40 dataset for e.g., the DeepZip performance issigniﬁcantly dependent upon the training parameters. This can be understood by thefact that the source is pseudo-random, making it diﬃcult for the optimization processto ﬁnd the appropriate minima.Regarding the running time of DeepZip, we observe that for typical datasets ofsize 10MB, every training epoch requires 1-2 hrs (with a 12GB NVIDIA TITAN XGPU and a batch size of 1024). We typically train every dataset for 3-4 epochs. Theencoding/decoding requires performing a single forward pass through the NN-model,and takes approximately 5-10 mins, depending upon the model.

Conclusion

We proposed a neural network prediction based framework for lossless compressionof sequential data. The proposed compressor DeepZip achieves improvements overGzip for a variety of real datasets and achieves near optimal compression for syn-thetic datasets. Future work involves improving the performance of the compres-sor, for example by using attention models to improve the prediction and hencethe overall compression. We also plan to work on improving compression on non-stationary sources by allowing model weights to be ﬁne-tuned as the sequence iscompressed/decompressed, so as to adapt quickly to changing statistics.Finally, we believe the compression experiments should also help in improvingour understanding of the neural network models themselves. The well establishednformation theoretic framework for data compression can be potentially useful forthis cause.

References [1] Andrej Karpathy, “The Unreasonable Eﬀectiveness of Recurrent Neural Networks,” http://karpathy.github.io/2015/05/21/rnn-effectiveness/ .[2] Ian H. Witten, Radford M. Neal, and John G. Cleary, “Arithmetic coding for datacompression,”

Commun. ACM , vol. 30, no. 6, pp. 520–540, June 1987.[3] Claude Elwood Shannon, “A mathematical theory of communication,”

Bell systemtechnical journal , vol. 27, no. 3, pp. 379–423, 1948.[4] Jacob Ziv and Abraham Lempel, “A universal algorithm for sequential data compres-sion,”

IEEE Transactions on information theory , vol. 23, no. 3, pp. 337–343, 1977.[5] David A Huﬀman, “A method for the construction of minimum-redundancy codes,”

Proceedings of the IRE , vol. 40, no. 9, pp. 1098–1101, 1952.[6] Frans MJ Willems, Yuri M Shtarkov, and Tjalling J Tjalkens, “The context-treeweighting method: basic properties,”

IEEE Transactions on Information Theory , vol.41, no. 3, pp. 653–664, 1995.[7] John Cleary and Ian Witten, “Data compression using adaptive coding and partialstring matching,”

IEEE transactions on Communications , vol. 32, no. 4, 1984.[8] J¨urgen Schmidhuber and Stefan Heil, “Sequential neural text compression,”

IEEETransactions on Neural Networks , vol. 7, no. 1, pp. 142–146, 1996.[9] Matthew V. Mahoney, “Fast text compression with neural networks,” in

Proceedingsof the Thirteenth International Florida Artiﬁcial Intelligence Research Society Confer-ence . 2000, pp. 230–234, AAAI Press.[10] Byron Knoll, “CMIX,” .[11] David Cox, “Syntactically informed text compression with recurrent neural networks,” arXiv preprint arXiv:1608.02893 , 2016.[12] Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,”

CoRR , vol. abs/1412.6980, 2014.[13] Google Tensorﬂow, “Cudnn LSTM tensorﬂow documentation,” 2018.[14] “Arithmetic Coding Library,” .[15] Ilya Grebnov, “BSC,” http://libbsc.com/ .[16] Diogo Pratas, Armando J Pinho, et al., “Eﬃcient compression of genomic sequences,”in . IEEE, 2016, pp. 231–240.[17] M. Mahoney, “ZPAQ compressor,” http://mattmahoney.net/dc/zpaq.html .[18] Mark P. Sawicki, Ghassan Samara, et al., “Human genome project,”

The AmericanJournal of Surgery , vol. 165, no. 2, pp. 258 – 264, 1993.[19] George Marsaglia, Arif Zaman, and Wai Wan Tsang, “Toward a universal randomnumber generator.,”

Stat. Prob. Lett. , vol. 9, no. 1, pp. 35–39, 1990.[20] Wikipedia, “Hidden Markov model,” https://en.wikipedia.org/wiki/Hidden_Markov_model .[21] “Text8 results,” http://mattmahoney.net/dc/textdata.html .[22] “TensorFlow Lite,” .[23] Song Han, Huizi Mao, and William J Dally, “Deep compression: Compressing deepneural networks with pruning, trained quantization and huﬀman coding,” arXivpreprint arXiv:1510.00149arXivpreprint arXiv:1510.00149