Machine Learning for Error Correction with Natural Redundancy
11 Machine Learning for Error Correctionwith Natural Redundancy
Pulakesh Upadhyaya,
Student Member, IEEE
Anxiao (Andrew) Jiang,
Senior Member, IEEE
Abstract
The persistent storage of big data requires advanced error correction schemes. The classical ap-proach is to use error correcting codes (ECCs). This work studies an alternative approach, which usesthe redundancy inherent in data itself for error correction. This type of redundancy, called NaturalRedundancy (NR), is abundant in many types of uncompressed or even compressed files. The complexstructures of Natural Redundancy, however, require machine learning techniques. In this paper, we studytwo fundamental approaches to use Natural Redundancy for error correction. The first approach, calledRepresentation-Oblivious, requires no prior knowledge on how data are represented or compressed infiles. It uses deep learning to detect file types accurately, and then mine Natural Redundancy for softdecoding. The second approach, called Representation-Aware, assumes that such knowledge is knownand uses it for error correction. Furthermore, both approaches combine the decoding based on NR andECCs. Both experimental results and analysis show that such an integrated scheme can substantiallyimprove the error correction performance.
Index Terms
Machine learning, deep learning, LDPC codes, natural redundancy.
This work was supported in part by NSF Grant CCF-1718886. Parts of this paper were presented at the 2017 AllertonConference on Communication, Control, and Computing, 2017 Information Theory and Applications (ITA) Workshop, and2019 IEEE International Conference on Communications (ICC). The authors are with the Department of Computer Science andEngineering, Texas A&M University, College Station, TX 77840, USA (e-mail: [email protected]; [email protected]). Thispaper was submitted to IEEE Journal on Selected Areas in Information Theory (special issue on Deep Learning: MathematicalFoundations and Applications to Information Science). a r X i v : . [ c s . I T ] O c t I. I
NTRODUCTION
A large amount of data is generated on the Internet everyday, and the feasibility of storinguseful data permanently has become a key concern. The most effective classic approach toimprove data reliability is to add external redundancy to data using Error Correcting Codes(ECCs). We call such redundancy artificial redundancy . However, over time, errors accumulate instorage systems and can exceed the decoding threshold of ECCs. To ensure permanent reliabilityof data, many techniques have been explored to improve the error correction capabilities in long-term storage systems. Recent progress in machine learning has offered an opportunity to employnovel techniques to improve data reliability. One such approach is to use Natural Redundancyin data for error correction.By
Natural Redundancy (NR), we refer to the redundancy that is inherent in data, which is notartificially added by ECCs. It is abundant in many types of uncompressed or even compressedfiles. For instance, consider the English language. When LZW (Lempel-Ziv-Welch) coding isused with a fixed dictionary of patterns (larger than many LZW codes in practical systems),the language can be compressed to 2.94 bits/character. State-of-the art compression algorithms(e.g., syllable-based Burrows-Wheeler Transform) can further reduce it to 2 bits/character [26].However, even with such advanced compression techniques, the result is still far from Shannon’sestimation of 1.34 bits/character, which is an upper-bound for the entropy of printed English [47].For images, residual redundancy can also be abundant after compression, as made evident byrecent inpainting techniques of deep learning [56]. Such abundant Natural Redundancy can bean excellent resource for error correction.There are two fundamental ways to utilize Natural Redundancy in an information system. Thefirst way is enhanced data compression , which often uses deep learning to remove redundancyfurther than before [30], [41]. It is a new and active research area, and compression ratios higherthan classic compression algorithms have been achieved in some cases (e.g., for high distortionregimes).The second way, which is the focus of this paper, is to use Natural Redundancy for errorcorrection . That is, a new decoder is designed to mine the Natural Redundancy in data, and utilize it for error correction. The decoder can be further combined with ECC’s decoder for betterperformance. A strong motivation for this method is that modern storage systems already storea massive amount of data, which would be very costly to reprocess. The Natural Redundancy(NR) based decoder does not require systems to examine or modify any existing data. It onlyrequires an enhancement to the decoding algorithm itself. Therefore, it is compatible with storagesystems and convenient to use.In this paper, we study two fundamental approaches to use Natural Redundancy for errorcorrection. The first approach, called Representation-Oblivious , requires no prior knowledgeon how data are represented or compressed in files. It uses deep learning to detect file typesaccurately, and then mine Natural Redundancy for soft decoding. The second approach, called
Representation-Aware , assumes that such knowledge is known and uses it for error correction.Furthermore, both approaches combine the decoding based on NR and ECCs.The Representation-Oblivious approach is useful for many storage systems where error correc-tion is a low-level function. In those systems, such as hard drives or solid-state drives (SSDs),the controllers for error correction often have no access to information such as file types orcompression schemes. Deep learning is a very useful tool for learning the complex patterns indata from scratch. And deep learning based classifiers are also suitable for decoding such datawith Natural Redundancy. The Representation-Aware approach is useful for storage systemswhere error correction is a higher-level function. With knowledge on how data are represented,better error correction performance can be achieved with suitable machine learning techniques.This paper studies NR-based error correction for several types of data of common file types,including HTML files, JPEG files, PDF files, LaTex files and language-based texts. It presentsnew deep learning techniques for mining Natural Redundancy, and presents both soft-decodingand hard-decoding algorithms based on NR that can be combined with LDPC codes. It presentsboth experimental results and theoretical analysis for measuring the amount of Natural Redun-dancy mined for error correction, and the results show that NR-based decoding can substantiallyimprove the error correction performance. (For instance, the Representation-Aware scheme canimprove the decoding threshold for erasures of LDPC codes by a factor of five, when the channel’s erasure rate is as high as .) Furthermore, we also analyze the computationalcomplexity of using Natural Redundancy for error correction versus for data compression.The rest of the paper is organized as follows. In Section II, we review related works. In SectionIII, we present the Representation-Oblivious scheme, and combine it with LDPC codes to achieveenhanced error-correction performance. In Section IV, we present a Representation-Aware schemefor language-based texts, and analyze the performance of two approaches for combining NR-based decoders with LDPC decoders: a sequential decoding scheme and an iterative decodingscheme. In Section V, we study the computational complexity of using NR for error correctionversus for data compression. In Section VI, we present the conclusions.II. R
ELATED W ORK
In this section, we review related works, including joint-source channel coding (JSCC), de-noising, recent results on NR-based error correction, and deep learning for information theory.The idea of using the leftover redundancy at a source encoder to improve the performance ofECCs has been studied within the field of joint source-channel coding (JSCC) [4], [16], [17], [18],[20], [24], [39], [40], [42]. However, few works have considered the Representation-Obliviousscheme. Furthermore, not many works have considered JSCC specifically for language-basedsources. Related to JSCC, denoising is also an interesting and well studied technique [1], [6],[9], [11], [31], [37], [38], [44], [57], [55]. A denoiser can use the statistics and features of inputdata to reduce its noise level for further processing. However, how to combine denoisers withthe recent progress in LDPC codes and machine learning has remained under-explored.In recent works (including results from the authors of this work), machine learning andalgorithmic techniques have been used to exploit NR to correct errors in data [21], [22], [23],[27], [33], [48], [49], [53], [54]. This work studies the Representation-Oblivious scheme for thefirst time, and also presents new theoretical analysis for the Representation-Aware scheme.In parallel, there have been numerous recent works on using deep learning for informationtheory [19], [43], especially for wireless and optical communications. They mainly focus onusing deep learning to model complex channels, to design codes, and to approximate or improvedecoding algorithms [3], [8], [12], [25], [35]. In contrast to those works, this paper focuses on using machine learning for data with complex structures (instead of for complex channels),and on exploring error correction for such complex data. These two different directions arecomplementary in a communication or storage system, and can be integrated.III. R
EPRESENTATION -O BLIVIOUS
NR-
DECODING
In this section, we study the
Representation-Oblivious scheme for Natural Redundancy (NR)based decoding. In this scheme, no prior information on the data is needed, including how dataare represented or compressed, which file type (e.g. HTML, JPEG, etc.) they belong to, or howmeta-data are appended to payload bits. This scheme has the benefit of having only minimalrequirements on practical storage systems such as hard drives and SSDs. Controllers of storagesystems can read out blocks of data and perform error correction (aided by NR-decoding) asusual, without having to access file systems for additional information on the data. However,the task is also challenging. For example, without knowing the data compression algorithm, wecannot use its codebook to find patterns in the data. The patterns in data are highly complex, andvary greatly for different file types. (For instance, bit patterns in HTML files and JPEG files arevery distinct from each other.) To address the challenges of this new error correction paradigm,we use deep learning to perform error correction in three consecutive steps: (1) detect the filetype of the given block of noisy bits; (2) perform NR-based soft decoding for the block of noisybits; (3) use the NR-based soft-decoding results to improve the performance of ECC decoding.Our coding scheme for Representation-Oblivious error correction using NR is illustrated inFig. 1. When files are stored, each file is partitioned into segments of k bits, and each filesegment is encoded by a systematic ( n, k ) ECC into a codeword of n bits. Then each ECCcodeword passes through a noisy channel, which models the errors in a storage device. Duringdecoding, first, a deep neural network (DNN) uses the k noisy information bits to recognize thefile type (e.g. HTML, LaTeX, PDF or JPEG) of the file segment. Then, a second DNN for thatfile type performs soft decoding on the k noisy information bits based on Natural Redundancy,and outputs k probabilities, where for i = 1 , , · · · , k , the i -th output is the probability forthe i -th information bit to be 1. The k probabilities are given as additional information to theECC’s decoder. The ECC decoder then performs its decoding and outputs the final result. (In Fig. 1. Encoding and decoding scheme for a noisy file segment of an initially unknown file type. The k -bit file segment isencoded by a systematic ( n, k ) ECC into an n -bit codeword. The codeword is transmitted through a channel to get a noisycodeword. Two neural networks use NR to decode the k noisy information bits: the first network determines the file type ofthe file segment, and then a corresponding neural network for that file type performs soft decoding for the k noisy informationbits. The soft decoding result and the noisy codeword are both given to the ECC decoder for further error correction. our experiments, the ECC is a systematic LDPC code, and the k probabilities are combinedwith the initial LLRs (log-likelihood ratios) for information bits to obtain their updated LLRs.The LDPC code then runs its belief-propagation (BP) decoding algorithm.) In the following, wepresent the detailed designs. A. File Type Recognition using Deep Learning
We present here a
Deep Neural Network (DNN) for file type recognition. The DNN takesa noisy file segment of k bits, ( y , y , · · · , y k ) , as input, and outputs one of T file types (e.g.,HTML, LaTeX, PDF or JPEG). The errors in the file segment come from a binary-symmetricchannel (BSC) of bit-error rate (BER) p . We first introduce the architecture of the DNN andits training method. We then present the experimental results, which show that it achieves highaccuracy for file type recognition.
1) DNN Architecture and Training:
Our DNN architecture is shown in Fig. 2. It is a Con-volutional Neural Network (CNN) that takes the k bits of a noisy file segment as input. In ourexperiments, we let k = 4095 . (The LDPC code we use is a (4376 , code designed byMacKay [34], which can tolerate BER of 0.2%. Both the code length and the BER are in thetypical range of parameters for storage systems.) The CNN has T outputs that correspond tothe T possible file types, namely, the T classification results. The output with the highest valueleads to the selection of the corresponding file type. In our experiments, we consider four filetypes: HTML, LaTeX, PDF and JPEG. So T = 4 . Note that HTML and LaTeX files are both C O N V M A X P O O L C O N V M A X P O O L C O N V M A X P O O L C O N V M A X P O O LM A X P O O L M A X P O O L M A X P O O L C O N VC O N VC O N VC O N V M A X P O O L C O N V M A X P O O L fc Input . . . File type 1File type 2File type T Fig. 2. Architecture of the CNN (convolutional neural network) for File Type Recognition. Its input is a noisy file segmentof 4095 bits, and its output corresponds to T = 4 candidate file types (HTML, LaTex, PDF and JPEG). The CNN uses ReLU and sigmoid as the activation function of its convolutional layers and output layer, respectively. It uses cross entropy as its lossfunction. Its optimizer is chosen to be an
Ada Delta Optimizer . text sequences but have different file structures; PDF files contain both texts and images; andJPEG files are images. In the following, we will present DNNs and experiments using theseparameters for the convenience of presentation. Note that the designs can be extended to otherfile-segment lengths and more file types.A large dataset has been used to train and test the CNN. For each of the T = 4 file types,24,000 noiseless file segments are used for training data, 4,000 noiseless file segments are usedfor validation data, and 4,800 noiseless file segments are used for test data. During training andtesting, random errors of BER p are added to each file segment, where each file segment usesan independently generated error pattern.
2) Experimental Performance:
The (4376 , LDPC code used in our experiments cancorrect errors of BER up to 0.2% by itself . (That is, when it is used in the conventional waywithout the extra help of Natural Redundancy, it has a decoding threshold of 0.2%.) Our goal isto use the Natural Redundancy in file segments to correct errors of substantially higher BERs.So we have selected the target BER p with substantially higher values, ranging from . to . . We then train the CNN with the given target BER p . TABLE IB
IT ERROR RATE (BER) VS T EST A CCURACY FOR F ILE T YPE R ECOGNITION (FTR). H
ERE THE “ OVERALL TESTACCURACY ” IS FOR ALL TYPES OF FILES TOGETHER . T
HE LAST FOUR COLUMNS SHOW THE TEST ACCURACY FOR EACHINDIVIDUAL TYPE OF FILES . (T
HEIR AVERAGE VALUE IS THE OVERALL TEST ACCURACY .) Bit Error Overall
HTML JPEG PDF LaTeX
Rate Test
Test Test Test Test (BER) Accuracy
Accuracy Accuracy Accuracy Accuracy0.2% 99.61% 99.98% 99.52% 99.17% 99.77%0.4% 99.69% 99.96% 99.60% 99.25% 99.96%0.6% 99.60% 99.94% 99.48% 99.06% 99.90%0.8% 99.69% 99.98% 99.50% 99.35% 99.92%1.2% 99.66% 99.96% 99.23% 99.48% 99.96%1.6% 99.58% 99.96% 99.60% 98.83% 99.92%We measure the performance of the CNN by the accuracy of file type recognition (FTR),which is defined as the fraction of file segments whose file types are recognized correctly. Thetest performance is shown in Table I. It can be seen that file types can be recognized by theCNN with high accuracy: for all BERs, the accuracy is close to 1.We can also examine the accuracy for recognizing each file type, and see if there is variancein performance from file type to file type. The results are shown in the last four columns ofTable I. It can be seen that overall, the accuracy is constantly high for all file types.The CNN’s performance compares favorably with existing results on FTR, which has beenstudied previously for applications such as disk recovery. The work [7] considered a classificationmethod for a pair of file types using Fisher’s linear discriminant and longest common subsequencemethods. The accuracy ranges between and depending on which pair of file types areconsidered. The work [15] introduced an NLP (natural language processing) based method,where unigram and bigram counts of bytes and other statistics are used to generate featurerepresentation, which is then followed by support vector machine (SVM) for classification ofvarious file types. The classification accuracy varies from . for JPEG files, . for PDFfiles to . for HTML files. The work [2] used PCA (principal component analysis) and afeed-forward auto-associative unsupervised neural network for feature extraction, and a threelayer multi-layer perceptron network for classification. The classification accuracy is . forsix file types while considering entire files instead of file segments. Our deep-learning based method can be seen to achieve high performance, without the need to train separate modules forfeature extraction and classification.The CNN has robust performance because it works well not only for the BER it is trained for,but also for other BERs in the considered range. (For example, a CNN trained for BER = 1 . also works well for other BERs in the range [0 . , . .) For succinctness we skip the details.The robustness of the overall error correction performance for different BERs will be presentedin Subsection C. B. Soft NR-decoding by Deep Neural Networks
In this subsection, we study how to design DNNs that can perform soft decoding on noisyfile segments. For each of the T file types, we will design and train a different DNN, becausedifferent types of files have different types of Natural Redundancy. Given a file type, we willdesign a DNN whose input is a noisy file segment of k bits Y = ( y , y , · · · , y k ) . As before, theerrors in the noisy file segment come from a binary-symmetric channel (BSC) of bit-error rate(BER) p . The output of the DNN is a vector Q = ( q , q , · · · , q k ) , where for i = 1 , , · · · , k ,the real-valued output q i ∈ [0 , represents the DNN’s belief that for the i -th bit in the filesegment, the probability that its correct value should be 1 is q i . In other words, if we use X = ( x , x , · · · , x k ) to denote an error-free file segment, and let it pass through a BSC ofBER p to obtain a noisy file segment Y = ( y , y , · · · , y k ) , then q i is the DNN’s estimation for P r { x i = 1 | Y, p } . Note that the k bits are not independent of each other because of the NaturalRedundancy in them. So P r { x i = 1 | Y, p } depends on not only y i and p , but also the overallvalue of Y . The goal of the DNN is to learn the Natural Redundancy in file segments, and useit to make the probability estimation q i be as close to the true probability P r { x i = 1 | Y, p } aspossible, for each i and for each possible value Y of the noisy file segment. To train the DNN,our optimization objective is to minimize the loss function L = 1 k k (cid:88) i =1 [ x i log q i + (1 − x i ) log (1 − q i )] , Fig. 3. General architecture of deep neural networks (DNNs) for NR-based soft decoding of noisy file segments. It consists of L convolutional layers followed by L deconvolutional layers. The activation function for the last layer is relu , and is sigmoid for the other layers. It uses cross-entropy as the loss function, and uses the Adam optimizer.Fig. 4. Hyper-parameters of optimized DNN models for NR-based soft decoding, for T = 4 file types and different biterror rates. Here L is the number of convolutional/deconvolutional layers, s = ( s , s , · · · , s L ) represents the filter sizes, and m = ( m , m , · · · , m L ) represents the numbers of feature maps. which measures the cross-entropy between ( x , x , · · · , x k ) and ( q , q , · · · , q k ) , over all samplesin the training dataset.The architecture of the DNN is presented in Fig. 3. It is related to auto-encoders, which aregood choices for various applications related to denoising [52], [32]. The DNN model consistsof L convolutional layers followed by L deconvolutional layers. (Deconvolutional layers maybe seen as reverse operations of convolutional layers. Interested readers can refer to [10] formore details.) The L convolutional layers have one-dimensional filters of size s , s , · · · , s L , respectively, and the number of feature maps at the output of each layer is m , m , · · · , m L , respectively. The filter sizes and the number of feature maps for deconvolutional layers changein the reverse order.We optimize the hyper-parameters of DNNs (including filter sizes, number of feature maps, etc.) for each file type. Their performance is robust: an optimized DNN usually performs softdecoding well for a wide range of BERs. However, the performance can be slightly improvedfurther if the hyper-parameters are also optimized based on BERs. Such optimization results arepresented in Fig. 4. (Here, for PDF and JPEG files, the hyper-parameters are optimized basedon two sub-ranges of BERs.) We will present their decoding performance (when combined withECC decoding) and robustness in the next subsection. C. Combine Soft NR-decoding with Soft LDPC-decoding
In this subsection, we present a scheme that combines the soft NR-decoding, which appliesdeep learning to noisy file segments of different file types, with soft LDPC-decoding. Theexperimental results confirm that the scheme substantially improves the reliability of differenttypes of files.We adopt a robust scheme here: the DNNs for file-type recognition and for soft decoding havebeen trained with a constant BER p DNN , but they are used for a wide range of BERs p for theBSC channel. (For example, the DNNs may be trained just for p DNN = 1 . , but are used forany BER p from 0.2% to 1.6% in experiments here.) We choose this robust scheme becausewhen DNNs are designed, the future BER in data can be highly unpredictable.Given a noisy systematic LDPC codeword, we first use a DNN to recognize its file type basedon its k noisy information bits. Then a second DNN for that file type is used to do soft decodingfor the k noisy information bits, and output k probabilities: for i = 1 , , · · · , k , the i -th output q i represents the estimated probability for the i -th information bit to be 1. Those k probabilitiescan be readily turned into LLRs (log-likelihood ratios) for the information bits using the formula LLR
DNNi = log( 1 − q i q i ) For i = 1 , , · · · , n , let LLR channeli be the LLR for the i -th codeword bit (with ≤ i ≤ k for information bits, and k + 1 ≤ i ≤ n for parity-check bits) derived for the binary-symmetricchannel, which is either log( − pp ) (if the received codeword bit is 0) or log( p − p ) (if the received codeword bit is 1). Then we let the initial LLR for the i -th codeword bit be LLR initi = LLR channeli + LLR
DNNi for ≤ i ≤ k , and LLR initi = LLR channeli for k + 1 ≤ i ≤ n . We then perform belief-propagation (BP) decoding using the initial LLRs,and obtain the final result.Note that there is a positive – although very small – chance that the file type will be recognizedincorrectly. In that case, the incorrect soft-decoding DNN will be used, which is accounted forin the overall decoding performance for fair evaluation. We measure the performance of theerror correction scheme by the percentage of codewords that are decoded correctly, which wecall Decoding Success Rate . (Let us call the scheme the
NR-LDPC decoder , since it combinesdecoding based on Natural Redundancy and the LDPC code.) We focus on BERs that are beyondthe decoding threshold of the LDPC code, because NR becomes helpful in such cases. Note thatthe (4376 , LDPC code used in our experiments has a decoding threshold of
BER = 0 . .In our experiments, we focus on BERs p that are not only beyond the decoding threshold, butalso can be significantly larger: p ∈ [0 . , . .The experimental results for p DNN = 1 . are presented in Fig. 5 (a). Here the x -axis is thechannel error probability p , and the y -axis is the Decoding Success Rate . (For each p , 1000 filesegments with independent random error patterns have been used in experiments.) The curvefor “ldpc” is the performance of the LDPC decoder alone, and the curve for “nr-lpdc” is forthe NR-LDPC decoder. It can be seen that the NR-LDPC decoder achieves significantly higherperformance. For example, as p = 0 . , the decoding success rate of the NR-LDPC decoder isapproximately 4 times as high as the LDPC decoder.The figure also shows the performance for each of the 4 file types. (The 4 curves are labelledby “html”, “latex”, “pdf”, “jpeg”, respectively. Their average value becomes the curve for “nr–ldpc”.) It shows that the error correction performance for HTML and LaTex files are significantly a)c) b) d) Fig. 5. Decoding success rate vs BER for (a) p DNN = . , (b) p DNN = . , (c) p DNN = . , (d) p DNN = . . better than for PDF and JPEG files. It is probably because the former two mainly consist oflanguages, for which the soft-decoding DNNs are better at finding their patterns and mining theirnatural redundancy, while PDF is a mixture of languages and images and JPEG is image only.It is interesting to notice that even for JPEG files, when p > . , the NR-LDPC decoder againperforms better than the LDPC decoder, which means the DNNs can extract Natural Redundancyfrom images, too. Fig. 5 (b) to Fig. 5 (d) show the performance for p DNN = 1.2%, 1.4% and1.6%, respectively. The NR-LDPC decoder performs equally well in those cases, which provesthe value of Natural Redundancy for decoding.In summary, although no prior information is known on data representation, deep learning canrecognize file types with high accuracy, and perform soft decoding effectively. When combinedwith ECCs, it can improve the error correction performance substantially. It is expected thatwith future improvements in deep learning, more natural redundancy can be mined from data toimprove the reliability of storage systems even further. IV. R
EPRESENTATION -A WARE
NR-
DECODING
The previous section studies
Representation-Oblivious schemes. In this section, we studyschemes that are
Representation-Aware : how the source data is mapped to bits in files is known.This is also a highly useful scenario, especially when error correction is performed at a high levelin computer systems, or when the controller in storage devices perform their own compressionschemes. In this work, we focus on language-based data, which form an important part of bigdata. In particular, we focus on the English language compressed by LZW algorithms. Theresults can be generalized to more languages and other sequential compression algorithms, suchas Huffman codes [27] etc.
In this section, we first present an NR-based hard-decoding algorithm for languages, andanalyze its performance. We then study two important cases for combining NR-decoding withECC decoding: the sequential decoding scheme, and the iterative decoding scheme. For bothcases, we study how NR-decoding improves the decoding thresholds of LDPC codes. Both theexperimental results and the theoretical analysis show the ability of NR decoding to enhance thereliability of storage systems.
A. NR-decoder for Languages
Consider English texts compressed by an LZW (Lempel-Ziv-Welch) algorithm that uses afixed dictionary of size (cid:96) . In our experiments, we use (cid:96) = 20 , which gives a dictionary of patterns (larger than many practical LZW codes). The dictionary has (cid:96) text strings (calledpatterns) of variable lengths, where every pattern is encoded as an (cid:96) -bit codeword. Given atext to compress, the LZW algorithm scans it and partitions it into patterns, and maps them tocodewords. For example, if we compress “Flash memory is an · · · ” , “Flash m” gets mapped toa 20-bit codeword, “emory i” gets mapped to another codeword and so on. The LZW code hasbeen constructed using the Wikipedia corpus. It can compress English texts to 2.94 bits/character,which is substantially higher than the rate of 4.59 bits/character achieved by the commonly usedcharacter-level Huffman codes. The fixed dictionary of the LZW code also makes it easy to usein practice. In this section, we focus on bit-erasure channels. For long LZW-compressed texts witherasures, to make the NR-decoding efficient, we present a decoding algorithm based on sliding-windows of variable lengths as follows.
1) Baseline Algorithm:
Let n min and n max be two integers, where n min < n max and let (cid:96) be the length of LZW codewords. We first use a sliding-window of n min (cid:96) bits to scan thecompressed text (where every such window contains exactly n min LZW codewords of size (cid:96) ),and obtain candidate solutions for each window based on the validity of words. (Specifically, ifthe bits in the window contain t erasures, there are t possible solutions, each of which can bemapped back to a text string. If all the whole words in the text string are valid words, the solutionis considered a candidate solution.) We then increase the size of the window to ( n min + 1) (cid:96) , ( n min + 2) (cid:96) , · · · , n max (cid:96) , and do decoding for each size in the following dynamic programmingapproach.Consider a window of k(cid:96) bits that contains k LZW-codewords C , C , · · · , C k . Let S ⊆{ , } ( k − (cid:96) be the set of candidate solutions for the sub-window that contains the LZW-codewords C , C , · · · , C k − ; and let S ⊆ { , } ( k − (cid:96) be the set of candidate solutions for the sub-windowthat contains the LZW-codewords C , C , · · · , C k . (Both S and S have been obtained in theprevious round of decoding.) We now obtain the set of candidate solutions for the currentwindow, which contains C , C , · · · , C k , this way. A bit sequence ( b , b , · · · , b k(cid:96) ) is in S onlyif it satisfies two conditions: (1) its first ( k − (cid:96) bits are a solution in S , and its last ( k − (cid:96) bits are a solution in S ; (2) the decompressed text corresponding to it contains no invalid words(except on the boundaries). This way, potential solutions filtered by smaller windows will notenter solutions for larger windows, making decoding more efficient. As a final step, an erasedbit is decoded this way: if any of the windows of size n max (cid:96) containing it (note that there are upto n max − such windows) can recover its value, decode it to that value; otherwise it remainsas an erasure.
2) Phrase and Word Length Filter:
To make the above decoding algorithm more efficient, wealso use phrases (such as “information theory”, “flash memory”) and features such as word/phraselengths. If a solution for a window contains a valid word or phrase that is particularly long, we Fig. 6. Co-location relationship between words and phrases. (a) A sample paragraph from Wikipedia (part of which was omittedto save space). (b) Phrases in it that have the co-location relationship with “flash memory”. may remove other candidate solutions that contain only short words. That is because long wordsand phrases are very rare: their density among bit sequences of the same length decreasesexponentially fast as the length increases [21]. So if they appear, the chance that they are thecorrect solution is high based on Bayes’ rule. The thresholds for such word/phrase lengths canbe set sufficiently high such that the probability of making a decoding error is sufficiently small.
3) Co-location Filter:
We also enhance the decoding performance by using the co-location relationship. Co-location means that certain pairs of words/phrases appear unusually frequently inthe same context (because they are closely associated), such as “dog” and “bark”, or “informationtheory” and “channel capacity”. If two words/phrases with the co-location relationship aredetected among candidate solutions for two windows close to each other, we may keep themas candidate solutions and remove other less likely solutions. The reason for this approach issimilar to that for long words/phrases. The co-location relationship can appear in multiple placesin a text, and therefore help decoding in non-trivial ways. For example, for the text in Fig. 6(a), the words/phrases that have the co-location relationship with the phrase “flash memory” areshown in Fig. 6 (b). How to find words/phrases with the co-location relationship from a corpusof training texts is a well-known technique in Natural Language Processing (NLP) [36]. So weskip its details here.We present the above decoding algorithm’s performance for the binary erasure channel (BEC).The output of the NR-decoder has both erasures and errors (which will be further decoded by ECC later on). Let (cid:15) ∈ [0 , be the raw bit-erasure rate (RBER) of BEC. After NR-decoding,for an originally erased bit, let δ ∈ [0 , denote the probability that it remains as an erasure,and let ρ ∈ [0 , − δ ] denote the probability that it is decoded to 0 or 1 incorrectly. Then theamount of noise after NR-decoding can be measured by the entropy of the noise (erasures anderrors) per bit: E NR ( (cid:15) ) (cid:44) (cid:15) ( δ + (1 − δ ) H ( ρ − δ )) , where H ( p ) = − p log p − (1 − p ) log(1 − p ) is the entropy function. Some typical values of E NR ( (cid:15) ) are shown in Table II. The reduction in noise by NR-decoding is (cid:15) − E NR ( (cid:15) ) (cid:15) . The tableshows that noise is reduced very effectively (from 88.0% to 91.6%) for the LZW compresseddata (without any help from ECC), for RBER from to , which is a wide range for storagesystems. RBER (cid:15) δ . × − . × − . × − ρ . × − . × − . × − E NR ( (cid:15) ) 4 . × − . × − . × − Noise 91.6% 91.1% 90.6%reductionRBER (cid:15) δ . × − . × − . × − ρ . × − . × − . × − E NR ( (cid:15) ) 2 . × − . × − . × − Noise 89.8% 89.0% 88.0%reduction
TABLE IIN
OISE REDUCTION BY
NR-
BASED LANGUAGE DECODER FOR DIFFERENT ERASURE RATES (cid:15) . Suppose that the LZW-codewords, seen as information bits, are protected by a systematicECC. The NR-decoder can work collaboratively with the ECC decoder to maximize the numberof correctable erasures. We now study two important cases for combining NR decoding withECC decoding: the sequential decoding scheme, and the iterative decoding scheme. B. Sequential Decoding by NR and LDPC code
Fig. 7. Two schemes for combining NR-decoding with LDPC-decoding. (a) A sequential decoding scheme by NR and LDPCcode. (b) An iterative decoding scheme by NR and LDPC code.
This subsection discusses the combination of NR-decoder with LDPC codes. We protectcompressed text as information bits by a systematic LDPC code of rate R . The NR-decoderstudied here generalizes the one presented in the previous subsection: it decodes the informationbits by NR, and possibly the parity check bits as well using their relations with the informationbits. The decoding process is a concatenation of two decoders: (1) first, the NR-decoder correctserasures and outputs a partially corrected codeword; (2) then, the LDPC decoder takes thatcodeword as input (where the erasure and error probabilities result from the NR-decoding),and uses belief propagation (BP) for decoding. (See Fig. 7 (a) for an illustration.) We present atheoretical analysis for the decoding performance, and show that the NR-decoder can substantiallyimprove the performance of LDPC codes.Consider a binary-erasure channel (BEC) with erasure probability (cid:15) . Let us call the non-erasedbits fixed bits . Assume that after NR-decoding, a non-fixed bit (i.e., erasure) remains as an erasurewith probability p ( (cid:15) ) ∈ [0 , , becomes an error (0 or 1) with probability (1 − p ( (cid:15) )) γ ( (cid:15) ) ∈ [0 , − p ( (cid:15) )] , and is decoded correctly (as 0 or 1) with probability (1 − p ( (cid:15) ))(1 − γ ( (cid:15) )) .(In general, p ( (cid:15) ) and γ ( (cid:15) ) can be functions of (cid:15) . Note that if the NR-decoder decodes onlyinformation bits, and an erasure in the information bits remains as an erasure with probability p ( (cid:15) ) (cid:48) , then p ( (cid:15) ) = Rp ( (cid:15) ) (cid:48) + (1 − R ) . Also note that the LDPC decoder needs to decode allbits with both errors and erasures.)
1) Decoding Algorithm:
We design the following iterative LDPC decoding algorithm, whichgeneralizes both the peeling decoder for BEC and the Gallager B decoder for BSC [46]:
Algorithm 1.
Generalized LDPC decoding algorithm.1) Let π ∈ [1 , d v − and τ ∈ [1 , d v − be two integer parameters;2) In each iteration, for a variable node v that is an erasure, if π or more non-erased messagebits come from d v − check nodes and they all have the same value, set v to that bit value;3) If v is not a fixed bit and not an erasure (but possibly an error) in this iteration, change v to the opposite bit value if τ or more non-erased message bits come from d v − check nodesand they all have that opposite value. (The updated value of v will be sent to the remainingcheck node in the next iteration.)2) Density Evolution Analysis: We now analyze the density evolution for the decoding algo-rithm, for an infinitely long and randomly constructed LDPC code of regular degrees.For t = 0 , , · · · , let α t and β t be the fraction of codeword bits that are errors or erasures,respectively, after t iterations of LDPC decoding. We have α = (cid:15) (1 − p ( (cid:15) )) γ ( (cid:15) ) and β = (cid:15) p ( (cid:15) ) . Let κ = (cid:15) (1 − p ( (cid:15) ))(1 − γ ( (cid:15) )) . Theorem 2.
For a regular ( d v , d c ) LDPC code with variable-node degree d v and check-nodedegree d c , we have α t +1 = α C t + κ D t + β µ t , where C t = 1 − (1 − A t ) d v − + τ − (cid:88) i =0 (cid:18) d v − i (cid:19) B it (1 − A t − B t ) d v − i − ,D t = d v − (cid:88) j = τ (cid:18) d v − j (cid:19) A jt (1 − A t − B t ) d v − − j ,µ t = d v − (cid:88) m = π (cid:18) d v − m (cid:19) A mt (1 − A t − B t ) d v − − m , whose component variables are computed iteratively as A t = (1 − β t ) d c − − (1 − β t − α t ) d c − ,B t = (1 − β t ) d c − + (1 − β t − α t ) d c − . For the LDPC code, we also have β t +1 = β (1 − µ t − ν t ) , where ν t = d v − (cid:88) m = π (cid:18) d v − m (cid:19) B mt (1 − A t − B t ) d v − − m . Proof:
Consider the root variable node of a computation tree. After t iterations, let A t denote the probability that an incoming message to the root node from a neighboring checknode is an error, and let B t denote the probability that the message is correct. Then − A t − B t is the probability that the message is an erasure. Let µ t (respectively, ν t ) be the probability thatamong the d v − incoming messages from neighboring check nodes to the root node, π or moremessages are errors (respectively, correct) and the remaining messages are all erasures. In the ( t + 1) -th iteration, we can have an error in the root node in one of the following cases:1) The root node was initially (namely, before decoding begins) an error (which has probability α ), and either of the two disjoint events happens: 1) fewer than τ check-node messagesare correct and the remaining messages are all erasures, which happens with probability τ − (cid:80) i =0 (cid:0) d v − i (cid:1) B it (1 − A t − B t ) d v − i − ; 2) at least one check-node message is an error, whichhappens with probability − (1 − A t ) d v − . The probability that either of the two eventsoccurs is C t = 1 − (1 − A t ) d v − + τ − (cid:80) i =0 (cid:0) d v − i (cid:1) B it (1 − A t − B t ) d v − i − .2) The root node was initially correct (which has probability κ ), but τ or more check-nodemessages are errors and the rest are all erasures (which happens with probability D t = d v − (cid:80) j = τ (cid:0) d v − j (cid:1) A jt (1 − A t − B t ) d v − − j ).3) The root node was initially an erasure (which has probability β ), and π or more check-node messages are errors and the rest are all erasures (which happens with probability µ t ).Therefore the error rate after t + 1 iterations will be α t +1 = α C t + κ D t + β µ t . In the ( t + 1) -th iteration, we can correct an erasure at a root node correctly if the root node wasinitially an erasure, and π or more check-node messages are correct and the rest are all erasures.This happens with probability β ν t . The root node will remain as an erasure if it is neithercorrected mistakenly nor corrected correctly. So the erasure rate after t + 1 iterations will be β t +1 = β (1 − µ t − ν t ) .Now we need to find the values of A t , B t , µ t and ν t . The incoming message from a check nodeto the root node is correct if out of the d c − non-root variable nodes connected to the check node,an even number of nodes are errors and the rest are all correct (i.e., neither errors nor erasures).That probability is B t = (cid:98) dc − (cid:99) (cid:80) k =0 (cid:0) d c − k (cid:1) α kt (1 − α t − β t ) d c − − k = (1 − β t ) dc − +(1 − β t − α t ) dc − . Theincoming message from a check node to the root node is an error if out of the d c − non-rootvariable nodes connected to the check node, an odd number of nodes are errors and the rest areall correct. That probability is A t = (cid:98) dc (cid:99) (cid:80) k =1 (cid:0) d c − k − (cid:1) α k − t (1 − α t − β t ) d c − k = (1 − β t ) dc − − (1 − β t − α t ) dc − .The probability that π or more neighboring check-node messages are errors and the rest are allerasures can be simplified as µ t = (cid:80) d v − m = π (cid:0) d v − m (cid:1) A mt (1 − A t − B t ) d v − − m . The probability that π or more neighboring check-node messages are correct and the rest are all erasures can besimplified as ν t = (cid:80) d v − m = π (cid:0) d v − m (cid:1) B mt (1 − A t − B t ) d v − − m . This completes the proof.
3) Erasure Threshold:
Define erasure threshold (cid:15) ∗ as the maximum erasure probability (for (cid:15) )for which the LDPC code can decode successfully (which means the error/erasure probabilities α t and β t both approach 0 as t → ∞ ). Let us show how the NR decoder can substantially improve (cid:15) ∗ . Consider a regular LDPC code with d v = 5 and d c = 100 , which has rate 0.95 (a typicalcode rate for storage systems). Without NR-decoding, the erasure threshold is ˜ (cid:15) ∗ = 0 . . Nowlet π = 1 and τ = 4 . For LZW-compressed texts, when (cid:15) = 0 . , the NR-decoder in the previoussubsection gives p = 0 . and γ = 0 . , for which the LDPC decoder has lim t →∞ α t = 0 and lim t →∞ β t = 0 . (The same happens for (cid:15) < . .) So with NR-decoding, (cid:15) ∗ ≥ . , whichmeans the improvement in erasure threshold is more than . . C. Iterative Decoding by NR and LDPC code
In this subsection, we study the decoding performance when we use iterative decoding betweenthe LDPC decoder and NR-decoder, as shown in Fig. 7 (b). (In last subsection’s study, the NR-decoder is followed by the LDPC decoder, without iterations between them.) As before, wefocus on languages and systematic LDPC codes, and present a theoretical model for compressedlanguages as follows.Let T = ( b , b , b , · · · ) be a compressed text. Partition T into segments S , S , S · · · , whereeach segment S i = ( b il , b il +1 , · · · , b il + l − ) has l bits. Consider erasures in the compressed text.Let θ ∈ [0 , , l θ (cid:44) (cid:98) lθ (cid:99) and p ∈ [0 , be parameters. We assume that when a segment S i has atmost l θ erasures, the NR-decoder can decode it by checking the validity of up to l θ candidatesolutions (based on the validity of their corresponding words/phrases, grammar, etc.), and eitherdetermines (independently) the correct solution with probability p or makes no decision withprobability − p . (Note that an NR-decoder does not have to check the l θ candidate solutionsone by one. For example, the NR-decoder introduced earlier can remove many invalid solutionsearly on without exhaustive search.) And this NR-decoding operation can be performed onlyonce for each segment (because if the correct solution cannot be determined by such an NR-based operation the first time, there is no guarantee that such operations in the future will findthe correct solution).The parameter l θ here is used to bound the computational complexity and erasure-correctioncapability of the NR-decoder in the worst case, and p models the probability of making anerror-free decision. This is a simplification of the practical NR-decoder shown in the previoussubsection that makes very high-confidence – although not totally error-free – decisions. Themodel is suitable for compression algorithms such as LZW coding with a fixed dictionary,Huffman coding, etc., where each segment can be decompressed to a piece of text. The greater l is, the better the model is.
1) Iteration with LDPC Decoder:
The compressed text T is protected as information bits by asystematic LDPC code. The LDPC code uses the peeling decoder for BEC (where d c − incomingmessages of known values at a check node determine the value of the outgoing message on the remaining edge) to correct erasures. See the decoding model in Fig. 7 (b). In each iteration, theLDPC decoder runs one iteration of BP decoding, then the NR-decoder tries to correct those l -information-bit segments that contain at most l θ erasures (if those segments were never decodedby the NR-decoder in any of the previous iterations). Let (cid:15) < be the BEC’s erasure rate. Let (cid:15) (cid:48) t and (cid:15) t be the LDPC codeword’s erasure rate after the t -th iteration of the LDPC decoder andthe NR-decoder, respectively. Next, we analyze the density evolution for regular ( d v , d c ) LDPCcodes of rate R = − d v d c .Note that since the NR-decoder decodes only information bits, for the LDPC decoder, theinformation bits and parity-check bits will have different erasure rates during decoding. Further-more, information bits consist of l -bit segments, while parity-check bits do not. For such an l -bitsegment, if the NR-decoder can decode it successfully when it has no more than l θ erasures,let us call the segment lucky ; otherwise, call it unlucky . Lucky and unlucky segments will havedifferent erasure rates during decoding, too.Every l -information-bit segment is lucky with probability p , and unlucky with probability − p .A lucky segment is guaranteed to be decoded successfully by the NR-decoder once the numberof erasures in it becomes less than or equal to l θ ; and an unlucky segment can be considered as never to be decoded by the NR-decoder (because such decoding will not succeed). Since whethera segment is lucky or not is independent of the party-check constraints and the LDPC-decoder,for analysis we can consider it as an inherent property of the segment (which exists even beforethe decoding begins).
2) Density Evolution Analysis:
Define q = 1 , q t (cid:44) (cid:15) t (cid:15) (cid:48) t and d t (cid:44) (cid:15) (cid:48) t (cid:15) t − for t ≥ . Note thatdecoding will end after t iterations if one of these conditions occurs: (1) (cid:15) (cid:48) t = 0 , because allerasures are corrected by the t -th iteration; (2) d t = 1 , because the LDPC decoder corrects noerasure in the t -th iteration, and nor will the NR-decoder since the input codeword is identicalto its previous output. We now study density evolution before those boundary cases occur.For t = 1 , , · · · and k = 0 , , · · · , l , let f k ( t ) denote the probability that a lucky segmentcontains k erasures after t iterations of decoding by the NR-decoder. Lemma 3. f k (1) = l θ (cid:80) i =0 (cid:0) li (cid:1) ( (cid:15) (cid:48) ) i (1 − (cid:15) (cid:48) ) l − i if k = 00 if ≤ k ≤ l θ (cid:0) lk (cid:1) ( (cid:15) (cid:48) ) k (1 − (cid:15) (cid:48) ) l − k if l θ + 1 ≤ k ≤ l Proof:
Consider the LDPC-decoding and the NR-decoding in the first iteration. Since theinitial erasure rate is (cid:15) , the erasure rate after LDPC decoding will now be (cid:15) (cid:48) = q (cid:15) (1 − (1 − (cid:15) ) d c − ) d v − where q = 1 by definition. The probability that an l -information-bit segmentcontains exactly i erasures is given by (cid:0) li (cid:1) ( (cid:15) (cid:48) ) i (1 − (cid:15) (cid:48) ) l − i , which is independent of whetherthe segment is lucky or unlucky. Thus the probability that a lucky segment contains up to l θ erasures is given by (cid:80) l θ i =0 (cid:0) li (cid:1) ( (cid:15) (cid:48) ) i (1 − (cid:15) (cid:48) ) l − i . All such segments are decoded by the NR-decodersuccessfully, while the remaining segments are not. That leads to the conclusion. Lemma 4.
The erasure rate after the first iteration of NR-decoding is (cid:15) = (cid:15) d ((1 − R ) + R (1 − p )) + ( l (cid:88) k = l θ +1 kl f k (1)) Rp Proof:
After NR-decoding, the erasure rate of a lucky segment with k erasures is kl , and theerasure rate for unlucky segments and parity-check bits is still (cid:15) (cid:48) . We have d = (cid:15) (cid:48) /(cid:15) . Hencethe overall erasure rate after the 1st iteration of NR-decoding is (cid:15) = (cid:15) d ((1 − R ) + R (1 − p )) + ( (cid:80) lk = l θ +1 kl f k (1)) Rp . (See Fig. 8 (b) for an illustration of the computation tree for densityevolution. For comparison, we show the tree for classic BP decoding for BEC in Fig. 8 (a).) Lemma 5.
The erasure rate after the second iteration of LDPC-decoding is (cid:15) (cid:48) = q q (cid:15) (1 − (1 − (cid:15) ) d c − ) d v − . Proof:
We have q = (cid:15) (cid:15) (cid:48) . Since the NR-decoding of the 1st iteration reduces the overall erasure probability by a factor of q (from (cid:15) (cid:48) to (cid:15) ), and the root variable node of a computationtree is chosen uniformly at random from the infinitely long and randomly constructed LDPCcode, the root node in the tree for the 2nd iteration of LDPC decoding now has the erasure probability q (cid:15) . (See Fig. 8 (b).) Hence the equation for the LDPC-decoder for the 2nd iterationwill be given by (cid:15) (cid:48) = q q (cid:15) (1 − (1 − (cid:15) ) d c − ) d v − . Note that LDPC decoding is independentof NR-decoding because the parity-check constraints are independent of the bits being lucky-segment bits, unlucky-segment bits or parity-check bits. And note that d = (cid:15) (cid:48) (cid:15) is the probabilitythat an erasure remains as an erasure after the LDPC decoding. If d = 1 , no change was madeby the LDPC-decoder; if d = 0 , all erasures have been corrected. In both cases, the decodingwill end. Lemma 6.
For t ≥ , f k ( t ) = f k ( t −
1) + l (cid:80) i = l θ +1 l θ (cid:80) j =0 f i ( t − (cid:0) ij (cid:1) ( d t ) j (1 − d t ) i − j , if k = 00 , if ≤ k ≤ l θl (cid:80) i = k f i ( t − (cid:0) ik (cid:1) ( d t ) k (1 − d t ) i − k , if l θ + 1 ≤ k ≤ l Proof:
Now consider the second iteration of NR-decoding. We only consider the case when < d < . A lucky segment has zero errors after the second iteration if an only if either oneof the two cases happen : 1) the segment already has zero errors after the first iteration, or 2)the segment had l θ + 1 or more errors after the first iteration and it has at most l θ erasures aftersecond iteration of the LDPC-decoding. Thus if k = 0 , f k (2) = f k (1) + l (cid:88) i = l θ +1 l θ (cid:88) j =0 f i (1) (cid:18) ij (cid:19) ( d ) j (1 − d ) i − j A lucky segment cannot have k ≤ l θ erasures (with k ≥ ) after the second iteration of NR-decoding (because if so, it would have corrected those erasures). So we have f k (2) = 0 forthat case. Finally, a lucky segment has l θ + 1 ≤ k ≤ l erasures if and only if it had k or moreerasures after the first iteration of NR-decoding and it has k erasures after the second iterationof LDPC-decoding. Thus f k (2) = l (cid:88) i = k f i (1) (cid:18) ik (cid:19) ( d ) k (1 − d ) i − k if l θ + 1 ≤ k ≤ l The remaining cases can be analyzed similarly. That leads to the conclusion.We now present the analytical formulas for the density evolution of the iterative LDPC-NRdecoding scheme. Its proof follows the previous lemmas.
Theorem 7.
For t ≥ , (cid:15) t = ((1 − R ) + R (1 − p )) (cid:15) ( t (cid:89) i =1 d t ) + Rp l (cid:88) k = l θ +1 kl f k ( t ) ,(cid:15) (cid:48) t = ( t − (cid:89) m =0 q m ) (cid:15) (1 − (1 − (cid:15) t − ) d c − ) d v − . Proof:
The decoding performance for the 2nd iteration of the LDPC-decoding has beenanalyzed in Lemma 5. The erasure rate in unlucky-segment bits and parity-check bits wasdecreased from (cid:15) (cid:48) to (cid:15) (cid:48) d = (cid:15) d d by the LDPC-decoding. Now the NR-decoder corrects thoselucky segments that had more than l θ erasures before the LDPC-decoding but now has at most l θ erasures after the LDPC-decoding. So (cid:15) = (cid:15) d d ((1 − R ) + R (1 − p )) + ( l (cid:80) k = l θ +1 kl f k (2)) Rp .The analysis for the following iterations is similar to the 2nd iteration. In general, since in the i -th iteration the NR-decoder reduces the overall erasure rate by a factor of q i , the root variablenode in the computation tree for the t -th iteration of LDPC decoding has the erasure probability ( (cid:81) t − i =0 q i ) (cid:15) . That leads to the conclusion.
3) Performance:
We now numerically show that the iterative NR-LDPC decoder can improvethe decoding threshold for erasures significantly. Note that the analysis in this subsection is basedon the assumption that the NR-decoder corrects erasures but does not create errors. However, allour existing NR-decoders still create errors with small probabilities (such as − in Table II)which, although small, are still non-zero due to the complexity of languages. Extending the NR-decoders here to correct both erasures and errors is beyond the scope of this paper. Therefore,the following analysis is based on the same assumption as above, and the parameters of theNR-decoder are chosen reasonably based on existing experimental evidence: let each segmenthave l = 120 bits (which corresponds to 6 LZW codewords of 20 bits each); and let l θ = 30 .(Note that in the experiments of the previous two subsections, sliding windows of the same sizeand more erasures have been considered.) Let the LDPC code be a regular code with d v = 5 Fig. 8. Comparison of the computation tree for density evolution analysis. (a) First three iterations of classic BP decoding(alone) for BEC. (b) First three iterations of BP-decoding and NR decoding. and d c = 100 .Recall that p is the probability that an NR-decoder can correct the erasures in a segmentsuccessfully when the segment has at most l θ erasures. Based on the previous analysis, given thevalue of p , we can obtain the corresponding decoding threshold for erasures for the iterative NR-LDPC decoder. The results are shown in Fig. 9 (a). It can be seen that as p increases, the decodingthreshold (cid:15) ∗ increases quickly. Note that without the NR-decoder, the decoding threshold of theLDPC code alone for erasures is ˜ (cid:15) ∗ = 0 . . In Fig. 9 (a), the decoding threshold increases from0.039 to 0.224, all of which are higher than ˜ (cid:15) ∗ . Based on Table II, it is reasonable to consider p = 0 . . In this case, the decoding threshold is (cid:15) ∗ = 0 . , which represents a . increasefrom ˜ (cid:15) ∗ .We also study how quickly decoding converges in the iterative decoding scheme. The resultsare shown in Fig. 9 (b). Here, the BER of the BEC channel is (cid:15) = 0 . (which is above thedecoding threshold of the LDPC code alone). It can be seen that decoding converges faster as p increases. In particular, when p = 0 . , it takes only about 7 iterations for decoding to converge.In summary, with knowledge on how data are represented by bits, effective NR-based decodingschemes can be designed. Both sequential and iterative schemes are presented for combining NR- Fig. 9. Performance of the iterative NR-LDPC decoding. (a) Here parameter p is the probability that the NR-decoder correctserasures in a segment when it has at most l θ erasures, and parameter (cid:15) ∗ is the decoding threshold for erasures of the iterativeNR-LDPC decoder. The figure shows that (cid:15) ∗ increases rapidly as p increases, and it can substantially outperform the decodingthreshold of the LDPC code alone (which is 0.036). (b) Here t is the number of iterations of the iterative NR-LDPC decodingprocess, and (cid:15) t is the overall bit erasure rate of the LDPC codeword after the t -th iteration. The bit erasure rate of the BECchannel is set to be (cid:15) = 0 . . The figure shows that the higher p is, the more quickly decoding ends. decoders with LDPC codes, and their performance is rigorously analyzed. The results show thatthe inclusion of NR-decoding can improve LDPC decoding substantially, and iterative decodingbetween the two decoders can further improve performance effectively.V. C OMPUTATIONAL -C OMPLEXITY T RADEOFF FOR
NR-
BASED C ODING
In the Introduction section, we have mentioned that the Natural Redundancy in data can beused for both compression and error correction. How to use it suitably depends on many factors,such as available coding techniques, hardware design, etc. In this chapter, we discuss one suchtradeoff of central importance : the computational complexity of using NR for compression orerror correction. Real NR is hard to model precisely, so we explore this topic from a theoreticalpoint of view, and consider NR in general forms. We show that certain types of redundancyare computationally efficient for compression, while others are so for error correction. Note thatthere exist works on analyzing the hardness of certain types of source coding schemes [28],[29], [45] and channel coding schemes [5], [13], [14], [50], [51]. In contrast, here we focus onthe tradeoff between the two, and the analysis is NR-oriented. Let B = ( b , b , · · · , b n ) ∈ { , } n be an n -bit message with NR. Define V : { , } n → { , } as a validity function : B is a valid message if and only if V ( B ) = 1 . The set of all valid messagesof n bits is M (cid:44) { B ∈ { , } n | V ( B ) = 1 } . For simplicity, for both source and channel coding,assume that the valid messages in M are equally likely.First, consider source coding. Let k = (cid:100) log |M|(cid:101) . Define an optimal lossless compressionscheme to be an injective function C opt : M → { , } k that compresses any valid message B ∈ M to a distinct k -bit vector C opt ( B ) . Define the Data Compression Problem as follows:Given a validity function V , find an injective function C opt : M → { , } k .Next, consider channel coding. Assume that a valid message X = ( x , x , · · · , x n ) ∈ M is transmitted through a binary-symmetric channel (BSC), and is received as a noisy message Y = ( y , y , · · · , y n ) ∈ { , } n . Maximum likelihood (ML) decoding requires us to find amessage Z = ( z , z , · · · , z n ) ∈ M that minimizes the Hamming distance d H ( Y, Z ) . Define the Error Correction Problem as follows: Given a validity function V and a message Y ∈ { , } n ,find a valid message Z ∈ M that minimizes the Hamming distance d H ( Y, Z ) .Let F be the set of all functions from the domain { , } n to the codomain { , } . (We have |F | = 2 n .) The function V represents NR in data. In practice, different types of data havedifferent types of NR. Let us define the latter concept formally. For any subset T ⊆ F , let T be called a type of validity functions (which represents a type of NR). When V can only be afunction in T (instead of F ), we denote the Data Compression Problem and the Error CorrectionProblem by P T dc and P T ec , respectively. The hardness of the problems P T dc and P T ec depends on T .Let S dc = NP,ec = P denote the set of types T (where each type is a subset of F ) for which the datacompression problem P T dc is NP-hard while the error correction problem P T ec is polynomial-timesolvable. Similarly, let S dc = P,ec = NP (or S dc = P,ec = P , S dc = NP,ec = NP , respectively) denote the set oftypes T for which P T dc is polynomial-time solvable while P T ec is NP-hard (or P T dc and P T ec areboth polynomial-time solvable, or both NP-hard, respectively). The following theorem showsthat there exist validity-function types for each of those four possible cases. Theorem 8.
The four sets S dc = NP,ec = P , S dc = P,ec = NP , S dc = P,ec = P and S dc = NP,ec = NP are all non-empty. Proof:
We first prove that S dc = NP,ec = P (cid:54) = ∅ , namely, there exists a validity-function type T NP,P ⊆ F that makes the data compression problem P T NP,P dc be NP-hard while making the errorcorrection problem P T NP,P ec be polynomial-time solvable.We define a validity function V NP,P : { , } n → { , } as follows, which takes n binaryvariables b , b , · · · , b n as its input. Let f SAT ( b , b , · · · , b n − ) be a 3-SAT Boolean formula,which is in the Conjunctive Normal Form (CNF) where each clause contains 3 variables (such as ( b ∨ ¯ b ∨ ¯ b ) ∧ ( b ∨ b ∨ b ) ∧ (¯ b ∨ b ∨ b ) ∧· · · , where ∨ is the OR operation, ∧ is the AND operation,and ¯ x is the NOT of the Boolean variable x ). Define a function f even ( b , b , · · · , b n ) as follows: f even ( b , b , · · · , b n ) equals 1 if (cid:80) ni =1 b i is even, and equals 0 otherwise. Similarly, define afunction f odd ( b , b , · · · , b n ) as follows: f odd ( b , b , · · · , b n ) equals 1 if (cid:80) ni =1 b i is odd, and equals0 otherwise. Finally, define the validity function V NP,P ( b , b , · · · , b n ) as V NP,P ( b , b , · · · , b n ) (cid:44) ( f SAT ( b , b , · · · , b n − ) ∧ f even ( b , b , · · · , b n )) ∨ f odd ( b , b , · · · , b n ) . (The validity-function type T NP,P is the set of all specific forms for the function V NP,P . Note that the same holds for thetypes T P,NP , T P,P and T NP,NP to be discussed later.)Given the validity function V NP,P , we can see that the set of valid messages M has cardinality |M| = |{ B ∈ { , } n | V NP,P ( B ) = 1 }| ≥ |{ B ∈ { , } n | f odd ( B ) = 1 }| = 2 n − , becauseall the messages whose bits have odd parity must be valid. So whether |M| > n − or not(which means whether k = (cid:100) log |M|(cid:101) > n − or not) depends on whether the 3-SATformula f SAT ( b , b , · · · , b n − ) has a satisfying solution: if there is a satisfying solution to b , b , · · · , b n − that makes f SAT ( b , b , · · · , b n − ) be 1, then we can let b n = ⊕ n − i =1 b i ∈ { , } (where ⊕ is the exclusive-OR operation), which gives us an n -bit message of even paritythat is valid (because here f SAT ( b , b , · · · , b n − ) ∧ f even ( b , b , · · · , b n ) = 1 ∧ V NP,P ( b , b , · · · , b n ) ), so k > n − (which means k = n ); otherwise, there is no valid messageof even parity, so k = n − . So determining whether k = n or n − is equivalent to solving the3-SAT Problem f SAT ( b , b , · · · , b n − ) , which is a known NP-complete problem. To solve thedata compression problem P T NP,P dc , it is necessary to know the value of k . So the data compressionproblem P T NP,P dc is NP-hard.Consider the error-correction problem P T NP,P ec given the same validity function V NP,P . Given an input noisy message Y = { , } n , we compute V NP,P ( Y ) . If V NP,P ( Y ) = 1 , then Y is validand we let Z = Y be the decoded message; otherwise, since all messages of odd parity arevalid, we just need to flip any bit of Y to get a valid message Z of odd parity. In both cases, wehave minimized the Hamming distance d H ( Y, Z ) (which is either 0 or 1). So the error correctionproblem P T NP,P ec is polynomial-time solvable. So S dc = NP,ec = P (cid:54) = ∅ .Next, we prove that S dc = P,ec = NP (cid:54) = ∅ . Let H be an r × n binary matrix of rank r < n .Define the validity function V P,NP : { , } n → { , } as follows: V P,NP ( b , · · · , b n ) = 1 ifand only if H · ( b , · · · , b n ) T ≡ mod 2 . (That is, the valid messages form a linear code.)Then the data compression problem P T P,NP dc becomes polynomial-time solvable: we can view H as the parity-check matrix of an ECC, find its corresponding generator matrix and use itto compress any valid n -bit message into a distinct vector of k = n − r bits (e.g., throughGaussian elimination). Its details are well known in coding theory, so we skip them here. Theerror correction problem P T P,NP ec is the same as the ML decoding problem of linear codes, whichis known to be NP-hard [5]. So S dc = P,ec = NP (cid:54) = ∅ .To prove that S dc = P,ec = P (cid:54) = ∅ , we can let the validity function V P,P ( b , · · · , b n ) = 1 for allinputs. In this case, all messages are valid, so both data compression and error correction becometrivial problems. So S dc = P,ec = P (cid:54) = ∅ .Now we prove that S dc = NP,ec = NP (cid:54) = ∅ . Let f SAT ( b , · · · , b n ) be a 3-SAT Boolean formulaas defined before (except that here it takes n bits, instead of n − bits, as input). Let function f ( b , · · · , b n ) be defined this way: it equals 1 if b = b = · · · = b n = 0 , and 0 otherwise. Letthe validity function be V NP,NP ( b , · · · , b n ) = f SAT ( b , · · · , b n ) ∨ f ( b , · · · , b n ) .For the data compression problem P T NP,NP dc , k > (namely, |M| > ) if and only if f SAT ( b , · · · , b n ) has a satisfying solution whose bits are not all zeros, which is NP-completeto determine. So P T NP,NP dc is NP-hard.For the error correction problem P T NP,NP ec , let the input noisy message Y ∈ { , } n be Y =(1 , , · · · , . Then min Z ∈M d H ( Y, Z ) < n if and only if f SAT ( b , · · · , b n ) has a satisfyingsolution whose bits are not all zeros, which is NP-complete to determine. So P T NP,NP ec is NP-hard. So S dc = NP,ec = NP (cid:54) = ∅ . The above result shows a wide range of possibilities for the computational-complexity trade-off between source and channel coding. In practice, it is worthwhile to study the propertiesof Natural Redundancy (e.g., whether the redundancy is mainly local or global, which differsfor different types of data), and choose appropriate coding schemes based on computationalcomplexity along with other important factors.VI. C
ONCLUSION
This paper explores the use of Natural Redundancy in data for error correction. It presentsnew NR-decoders, which are based on deep learning and machine learning, and combinesthem with ECC decoding. For storage systems accommodating big data, the vast amount ofNatural Redundancy offers the opportunity to improve data reliability significantly. Two importantparadigms are studied in the paper. In the Representation-Oblivious paradigm, no informationon data representation is needed a priori . In the Representation-Aware paradigm, both sequentialand iterative decoding schemes are analyzed. The experimental and analytical results verify thatmachine learning can mine Natural Redundancy effectively from complex data, and improveerror correction substantially. The usage of Natural Redundancy for error correction also addsminimal overhead for big storage systems, since it does not require the modification of existingdata. R
EFERENCES [1] L. Alvarez, P. Lions and J. Morel, “Image Selective Smoothing and Edge Detection by Nonlinear Diffusion. II,” in
SIAMJournal on numerical analysis , vol. 29, no. 3, pp. 845–866, 1992.[2] M. Amirani, M. Toorani and A. Beheshti, “A New Approach to Content-based File Type Detection,” in
Proc. IEEESymposium on Computers and Communications , pp. 1103-1108, Marrakech, Morocco, 2008.[3] F. A. Aoudia and J. Hoydis, “End-to-end Learning of Communications Systems without a Channel Model,” in arXiv preprintarXiv:1804.02276 , 2018.[4] R. Bauer and J. Hagenauer, “On Variable Length Codes for Iterative Source/Channel Decoding,” in
Proceedings of DataCompression Conference , pp. 273–282, 2001.[5] J. Bruck and M. Naor, “The Hardness of Decoding Linear Codes with Preprocessing,” in
IEEE Transactions on InformationTheory , vol. 36, no. 2, pp. 381–385, March 1990.[6] A. Buades, B. Coll and J. Morel, “A Review of Image Denoising Algorithms, with a New One,” in
Multiscale Modeling &Simulation , vol. 4, no. 2, pp. 490530, 2005. [7] W.C. Calhoun and D. Coles, “Predicting the Types of File Fragments,” in Digital Investigation , vol. 5, supplement, pp.S14-S20, 2008.[8] S. Cammerer, S. D¨orner, J. Hoydis, and S. ten Brink, “End-to-end Learning for Physical Layer Communications,” in
Proc.The International Zurich Seminar on Information and Communication (IZS) , 2018, pp. 51-52, 2018.[9] P. Chatterjee and P. Milanfar, “Is Denoising Dead?” in
IEEE Transactions on Image Processing , vol. 19, no. 4, pp. 895-911,2010.[10] F. Chollet,
Deep Learning with Python , Manning Publications Co. 2017.[11] R. Coifman and D. Donoho,
Translation-invariant De-noising , Springer, 1995.[12] S. D¨orner, S. Cammerer, J. Hoydis, and S. t. Brink, “Deep Learning Based Communication over the Air,” in
IEEE Journalof Selected Topics in Signal Processing , vol. 12, no. 1, pp. 132-143, Feb 2018.[13] I. Dumer, D. Micciancio, and M. Sudan, “Hardness of Approximating the Minimum Distance of a Linear Code,” in
IEEETransactions on Information Theory , vol. 49, no. 1, pp. 22–37, January 2003.[14] U. Feige and D. Micciancio, “The Inapproximability of Lattice and Coding Problems with Preprocessing,” in
Journal ofComputer and Systems Sciences , vol. 69, no. 1, pp. 45–67, August 2004.[15] S. Fitzgerald, G. Mathews, C. Morris and O. Zhulyn, “Using NLP Techniques for File Fragment Classification,” in
DigitalInvestigation , vol. 9, supplement, pp. S44-S49, 2012.[16] M. Fresia and G. Caire, “Combined Error Protection and Compression with Turbo Codes for Image Transmission Usinga JPEG2000-like Architecture,” in
Proc. International Conference on Image Processing , pp. 821–824, 2006.[17] L. Guivarch, J. Carlach and P. Siohan, “Joint Source-channel Soft Decoding of Huffman Codes with Turbo-codes,” in
Proceedings of Data Compression Conference (DCC) , pp. 83-92, 2000.[18] J. Hagenauer, “Source-controlled Channel Decoding,” in
IEEE Transactions on Communications , vol. 43, no. 9, pp. 2449–2457, 1995.[19] Y. Ichiki, G. Song, K. Cai, S. Lu, and J. Cheng, “Neural Network Detection of LDPC-coded Random Access CDMAsystems,” in
The International Symposium on Information Theory and Its Applications (ISITA) , Singapore, 2018.[20] M. Jeanne, J. Carlach and P. Siohan, “Joint Source-channel Decoding of Variable-length Codes for Convolutional Codesand Turbo Codes,” in
IEEE Transactions on Communications , vol. 53, no. 1, pp. 10-15, 2005.[21] A. Jiang, Y. Li and J. Bruck, “Error Correction through Language Processing,” in
Proc. IEEE Information Theory Workshop(ITW) , 2015.[22] A. Jiang, P. Upadhyaya, Y. Wang, K.R. Narayanan, H. Zhou, J. Sima, and J. Bruck “Stopping Set Elimination for LDPCCodes”, in
Proc. 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton) , 2017.[23] A. Jiang, P. Upadhyaya, E. F. Haratsch and J. Bruck, “Correcting Errors by Natural Redundancy.” in
Information Theoryand Applications Workshop (ITA) , 2017.[24] A. N. Kim, S. Sesia, T. Ramstad, and G. Caire, “Combined Error Protection and Compression using Turbo Codes forError Resilient Image Transmission,” in
Proceedings of International Conference on Image Processing (ICIP) , vol. 3, pp.III-912-15, 2005.[25] H. Kim, Y. Jiang, R. Rana, S. Kannan, S. Oh and P. Viswanath, ”Communication Algorithms via Deep Learning,” in
Proc.International Conference on Representation Learning (ICLR) , Vancouver, 2018. [26] J. Lansky, K. Chernik and Z. Vlckova, “Syllable-Based Burrows-Wheeler Transform,” in Proc. Annual InternationalWorkshop on Databases,Texts, Specifications and Objects , pp. 1-10, 2007.[27] Y. Li, Y. Wang, A. Jiang and J. Bruck, “Content-assisted File Decoding for Nonvolatile Memories,” in
Proc. 46th AsilomarConference on Signals, Systems and Computers , pp. 937–941, Pacific Grove, CA, 2012.[28] J. Lin, “Vector Quantization for Image Compression: Algorithms and Performance,” Doctoral Dissertation, BrandeisUniversity, 1992.[29] J. Lin, J. Storer and M. Cohn, “Optimal Pruning for Tree-structured Vector Quantization,” in
Information Processing andManagement , vol. 28, no. 6, 1992.[30] Mu Li, Wangmeng Zuo, Shuhang Gu, Debin Zhao, David Zhang, “Learning Convolutional Networks for Content-WeightedImage Compression,” in
Proc. IEEE Conference on Computer Vision and Pattern Recognition , Salt Lake City, Utah, 2018.[31] M. Lindenbaum, M. Fischer, and A. Bruckstein, “On Gabor’s Contribution to Image Enhancement,” in
Pattern Recognition ,vol. 27, no. 1, pp. 18, 1994.[32] J. Long, E. Shelhamer and T. Darrell , “Fully Convolutional Networks for Semantic Segmentation,” in
Proc. IEEEConference on Computer Vision and Pattern Recognition , pp. 3431-3440, Boston, MA, 2015[33] J. Luo, Q. Huang, S. Wang and Z. Wang, “Error Control Coding Combined with Content Recognition,” in
Proc. 8thInternational Conference on Wireless Communications and Signal Processing , pp. 1–5, 2016.[34] D. MacKay,
Encyclopedia of Sparse Graph Codes , [35] E. Nachmani, E. Marciano, L. Lugosch, W. J. Gross, D. Burshtein and Y. Be’ery, “Deep Learning Methods for ImprovedDecoding of Linear Codes,” in IEEE Journal of Selected Topics in Signal Processing , vol. 12, no. 1, pp. 119-131, 2018.[36] C. D. Manning and H. Schutze,
Foundations of Statistical Natural Language Processing , MIT Press, 1999.[37] E. Ordentlich, G. Seroussi, S. Verdu, and K. Viswanathan, “Universal Algorithms for Channel Decoding of UncompressedSources,”
IEEE Trans. Information Theory , vol. 54, no. 5, pp. 2243–2262, May 2008.[38] E. Ordentlich, G. Seroussi, S. Verdu, M. Weinberger and T. Weissman, “A Discrete Universal Denoiser and Its Applicationto Binary Images,” in
Proc. International Conference on Image Processing , vol. 1, pp. 117, 2003.[39] Z. Peng, Y. Huang and D. Costello, “Turbo codes for Image Transmission – A Joint Channel and Source DecodingApproach,” in
IEEE Journal on Selected Areas in Communications (JSAC) , vol. 18, no. 6, pp. 868–879, 2000.[40] C. Poulliat, D. Declercq, C. Lamy-Bergot, and I. Fijalkow, “Analysis and Optimization of Irregular LDPC Codes for JointSource-channel Decoding,” in IEEE Communications Letter , vol. 9, no. 12, pp. 1064–1066, 2005.[41] A. Prakash, N. Moran, S. Garber, A. DiLillo, and J. Storer, “Semantic Perceptual Image Compression using DeepConvolution Networks,” in Proc. Data Compression Conference (DCC) , pp. 250-259. IEEE, 2017.[42] L. Pu, Z. Wu, A. Bilgin, M. Marcellin, and B. Vasic, “LDPC-based Iterative Joint Source-channel Decoding for JPEG2000,”in
IEEE Transactions on Image Processing , vol. 16, no. 2, pp. 577–581, 2007.[43] M. Qin, Q. Sun, and D. Vucinic, “Robustness of Neural Networks against Storage Media Errors,” in arXiv preprintarXiv:1709.06173 , 2017.[44] L. Rudin, S. Osher and E. Fatemi, “Nonlinear Total Variation based Noise Removal Algorithms,” in
Physica D: NonlinearPhenomena , vol. 60, no. 1, pp. 259–268, 1992. [45] M. Ruhl and H. Hartenstein, “Optimal Fractal Coding is NP-hard,” in Proc. Data Compression Conference (DCC) , April1997.[46] W. Ryan and S. Lin,
Channel Codes: Classical and Modern , Cambridge University Press, 2009.[47] C. E Shannon, “Prediction and Entropy of Printed English,” in
Bell System Technical Journal , vol. 30, no. 1, pp. 50–64,1951.[48] P. Upadhyaya and A. Jiang, “On LDPC Decoding with Natural Redundancy,” in
Proc. 55th Annual Allerton Conferenceon Communication, Control, and Computing (Allerton) , 2017.[49] P. Upadhyaya and A. Jiang, “Representation-Oblivious Error Correction by Natural Redundancy”, in
Proc. Proc. IEEEInternational Conference on Communications (ICC) , Shanghai, China, May 2019.[50] A. Vardy, “Algorithmic Complexity in Coding Theory and the Minimum Distance Problem,”in
Proc. 29th ACM Symposiumon Theory of Computing (STOC) , pp. 92–109, El Paso, Texas, May 1997.[51] A. Vardy, “The Intractability of Computing the Minimum Distance of a Code,” in
IEEE Transactions on InformationTheory , pp. 43, no. 6, pp. 1757–1766, November 1997.[52] P. Vincent, H. Larochelle, Y. Bengio and P. A. Manzagol, “Stacked Denoising Autoencoders: Learning Useful Repre-sentations in a Deep Network with a Local Denoising Criterion,” in
Journal of Machine Learning Research , vol. 11, pp.3371-3408, 2010.[53] Y. Wang, K. R. Narayanan and A. Jiang, “Exploiting Source Redundancy to Improve the Rate of Polar Codes,” in
IEEEInternational Symposium on Information Theory (ISIT) , Aachen, Germany, June 2017.[54] Y. Wang, M. Qin, K. R. Narayanan, A. Jiang and Z. Bandic, “Joint Source-channel Decoding of Polar Codes for Language-based Sources,” in
Proc. IEEE Global Communications Conference (Globecom) , Washington D.C., December 2016.[55] T. Weissman, E. Ordentlich, G. Seroussi, S. Verdu and M. Weinberger, “Universal Discrete Denoising: Known Channel,”in
IEEE Transactions on Information Theory , vol. 51, no. 1, pp. 5–28, 2005.[56] J. Xie, L. Xu and E. Chen, “Image Denoising and Inpainting with Deep Neural Networks.”, in
Advances in NeuralInformation Processing Systems , pp. 341-349, 2012.[57] L. Yaroslavsky and M. Eden,