Linear hash-functions and their applications to error detection and correction
aa r X i v : . [ c s . I T ] A ug Linear hash-functions and their applicationsto error detection and correction
Boris RyabkoFederal Research Center for Information and Computational Technologies andNovosibirsk state university,Novosibirsk, Russian Federation, Email: [email protected]
Abstract
We describe and explore so-called linear hash functions and show how they can be usedto build error detection and correction codes. The method can be applied for different typesof errors (for example, burst errors). When the method is applied to a model where numberof distorted letters is limited, the obtained estimate of its performance is slightly better thanthe known Varshamov-Gilbert bound. We also describe random code whose performance isclose to the same boundary, but its construction is much simpler. In some cases the obtainedmethods are simpler and more flexible than the known ones. In particular, the complexity of theobtained error detection code and the well-known CRC code is close, but the proposed code,unlike CRC, can detect with certainty errors whose number does not exceed a predeterminedlimit.
I. I
NTRODUCTION
Error detection and correction codes are commonly used in telecommunication and data storagesystems, and there are many effective and practically used constructions of such codes, see forreview [1], [2], [3]. Currently, cyclic redundancy check (CRC), which was proposed in [4], isone of the most popular error detection codes, while block codes [2] are the basis of many errorcorrection methods.
DRAFT
In short, error correction and detection systems can be described as follows: a binary word x ...x L is transmitted through a communication channel, and the recipient receives a message y ...y L in which some letters y i may differ from x i . The purpose of an error detection code isto inform the receiver that some letters sent were changed during the transmission (i.e., at leastone x i = y i ). The purpose of an error correction codes is not only to report that the errors haveoccurred, but also to find all the letters that were changed (that is, all i for which x i = y i ). (Weconsider the most popular model in which messages are words in the binary alphabet { , } ,but the main results can be easily extended to any finite alphabet.)The main part of both types of codes can be described as follows: the transmitted word x ...x L contains two subwords, say x ...x L − l and x L − l +1 ...x L , L > l ≥ , where the firstsubword contains information bits, and the second one contains so-called check bits (or paritybits). When the sender wants to send L − l bits, he first sets them to x ...x L − l , and then calculatesthe check bits x L − l +1 ...x L . The receiver receives the word y ...y L and uses it to detect or correcterrors that may have occurred during transmission. Generally speaking, check bits are given bya function λ , which is defined on the set of ( L − l ) -bit words with values in the set of l -bitwords. In the area of error detection codes, λ is often called a hash function. It is worth notingthat sometimes the check bits are not at the end of the message, but in other places.The simplest example of this scheme is the parity-bit, or check-bit, method. In this method,a sequence of information bits is x ...x L − , the check bit is x L , (i.e. l = 1 ). If the total numberof 1-bits in the string x ...x L − l is even, then x L = 0 , otherwise x L = 1 . When the receiverobtains y ...y L he calculates the total number of 1-bits. If this value is odd, it means that anerror has occurred. Thus, this method makes it possible to detect one error, but, obviously, doesnot detect two errors (and any even number of errors).Naturally, the larger the number of information bits (i.e. L − l ), the better the code one canconstruct. That is why the question about codes with the largest number of information bits hasattracted attention of many researcher (see for review [2]). In order to describe some knownresults in this field we need some definitions. The expression | X | denotes the number of elementsif X is a set and the length X , if X is a word. Let u , v be finite binary words of the same DRAFT length. We denote the Hamming weight of u , i.e., the number of 1’s in the word u by || u || and,by definition, the Hamming distance d h ( u, v ) = || u ⊕ v || , where ⊕ is bitwise XOR (or additionmodulo 2). Let U be a set of binary words of the same length and | U | > . The minimalHamming distance of U is defined by as d h ( U ) = min u,v ∈ U,u = v d h ( u, v ) . Let U be a set ofbinary words of some length L , L ≥ . The Varshamov-Gilbert bound states that max d h ( U )= d | U | ≥ L −⌈ log (1+ P d − i =0 ( L − i ) ⌉ , (1)see [2], Theorem 2.9.3. There exist some improvements of this bound, but they do not changeits asymptotic (see for review [5], [6]).The ability of a code to detect and correct errors is simply related to the Hamming distance d h ( U ) . To show this, we first define B mn ⊂ { , } m , n ≤ m, is a set of words of length m which contain n or less 1’s . (2)(That is, B mn contains all words whose Hamming weight is not grater than n .) Now, take U ⊂ { , } L and consider a method where U is the set of messages transmitted and v is theword of errors occurred, that is, the message transmitted is x ∈ U and the message received is y = x ⊕ v . Suppose that d h ( U ) = d , d ≥ . It turns out, that d − errors can be detected is (i.e., v ∈ B Ld − ), and ⌊ ( d − / ⌋ errors can be corrected (i.e., v ∈ B L ⌊ ( d − / ⌋ ). Indeed, if x ∈ U and v ∈ B Ld − , then y = x ⊕ v does not belong to U and this indicates an error. In order to correcterrors, the word closest to y is considered sent.We briefly reviewed a model in which errors are letter distortions, and their number is limitedby a certain bound. There are other models of possible errors that describe various systemsfor transmitting and storing information, for example, packet errors. This general case is alsoconsidered in this work, the part 4.In this work we describe new classes of error detection and correction codes, which are basedon the so-called linear hash functions. Linear hash functions are defined as follows: any map λ defined on L -bit binary words whose values λ ( x ) are taken from the set l -bit binary words, DRAFT l < L , is called a hash function. (Formally, λ : { , } L → { , } l , L > l ≥ .) A hash function λ is called linear if for any L -bit words x and yλ ( x ⊕ y ) = λ ( x ) ⊕ λ ( y ) . Linear hash functions are well-known and date back at least to Zobrist [8].The proposed methods allow us to build a code for any set of errors (including the case whenerrors occur in packages). In particular, this method can be used to detect errors whose numberdoes not exceed a predetermined limit (for example, detecting any three errors). Note that thewell-known cyclic redundancy check (CRC) codes do not detect a predetermined number oferrors with for certainty; rather, CRC make it possible to detect a predetermined number oferrors (say, 3) only with a certain probability.It is worth noting, that the performance of the proposed codes slightly exceeds the well-knownVarshamov-Gilbert (VG) bound [2].When considering error correction and detection codes, the problem of the complexity of themethod is very important. Three questions arise: the complexity of i) encoding, ii) decoding, andiii) constructing encoding and decoding methods. In the case of error-detecting the encoding anddecoding is quite simple, whereas the complexity of constructing encoding and decoding methodsis relatively large. To overcome this, we propose a randomized algorithm for constructing anencoder and decoder whose performance is close to optimal, but the complexity is much smaller.The rest of the paper is organised as follows. The next section contains a description of someof the properties of linear hash functions, as well as a general scheme of their application tocodes. Section III is devoted to a model in which errors are letter distortions and an upper boundon their number is given. First, we describe a code which meets the VG bound. This method isthen generalized in two directions: we describe its modification that performs slightly better thanthe VG estimate, and we propose a randomized algorithm. The last section describes generalmethods of error detection and correction.
DRAFT
II. L
INEAR HASH FUNCTIONS AND THEIR APPLICATIONS TO ERROR DETECTION ANDCORRECTION
A. Representation of linear hash functions as sums of words
Consider a linear hash function λ defined on the set of L -bit binary words { , } L and λ ( x ) taken from the set of l -bit binary words { , } l . It will be convenient to denote by e ki a stringof k -bits that contains 1 at the i -th position and zeros at all others, and let e k be the string oflength k consisting only of 0s.Let x = x ...x L be an L -bit word and v , ..., v L be any l -bit words. Define a function λ ( x ) = x × v ⊕ x × v ⊕ ... ⊕ x L × v L , (3)where x i ∈ { , } and we assume × v = 00 ... , × v = v .For any two vectors x, y we obtain λ ( x ⊕ y ) = ( x ⊕ y ) × v ⊕ ( x ⊕ y ) × v ⊕ ... ⊕ ( x L ⊕ y L ) × v L =( x × v ) ⊕ ( y × v ) ⊕ ( x × v ) ⊕ ( y × v ) ⊕ ... ⊕ ( x L × v L ) ⊕ ( y L × v L ) = λ ( x ) ⊕ λ ( y ) . So, the hash-function (3) is linear. On the other hand, for any linear hash-function λ ′ λ ′ ( x ) = x × λ ′ ( e L ) ⊕ x × λ ′ ( e L ) ⊕ ... ⊕ x L × λ ′ ( e LL ) and, hence, λ ′ is represented in the form (3), where v i = λ ′ ( e Li ) . Thus, we derived the following: Theorem 1.
A hash-function λ is linear if and only if it can be represented as λ ( x ) = x × v ⊕ x × v ⊕ ... ⊕ x L × v L (4) for some l -bit words v , ..., v L ∈ { , } l . Note that the CRC code is a linear has function and, hence, can be represented as (3). Also, itis worth noting that the calculation of (3) does not require multiplication or other time-consumingoperations, and can be performed in linear time.
DRAFT
B. A scheme for using a linear hash function to detect and correct errors
Consider the following data transfer scheme: there are sets of L -bit messages A and possibledistortions (or errors) D ⊂ { , } L . If the message x ∈ A is sent through the channel, adistortion d ∈ D may occur, that is, the recipient receives the message x ⊕ d . (For example, if D contains all words with two 1’s, this means that two-bit errors may occur during the transfer.)A key component is a linear hash function λ such that λ ( x ) = e l for all x ∈ A and λ ( d ) = e l for all d ∈ D, (5)where the set A is constructed as follows: any message x = x ...x L consists of L − l informationsymbols x i ... x i L − l , while the remaining l symbols are used as check symbols. (Generally,we will use x ...x L − l as information symbols and x L − l +1 ...x L as check symbols.) The checksymbols are chosen in such a way that λ ( x ) = e l for all x ∈ A . If the distortion d ∈ D occurs,the received message y can be presented as y = x ⊕ d . (If no error occurs, then y = x .) Wecan see from this equation that this method gives a possibility to detect any distortion d ∈ D ,because λ ( x ) = e l , λ ( y ) = λ ( x ⊕ d ) = λ ( x ) ⊕ λ ( d ) = λ ( d ) = e l . (6)Thus, this scheme allows to detect any distortion d ∈ D , because the equation λ ( y ) = 0 meansthat d occurred, and, conversely, the opposite equation λ ( y ) = 0 informs about the absence ofan error.This system can be used to correct errors if the following additional property applies: allvalues of λ ( d ) are different, i.e. for all d i , d j ∈ D , λ ( d i ) = λ ( d j ) . Indeed, in this case, thedecoder may first compute λ ( y ) = λ ( d ) , see (6). All λ ( d ) are different and therefore the decodercan find d from λ ( d ) and compute x = y ⊕ d .III. C ODES FOR A LIMITED NUMBER OF LETTER ERRORS
We consider codes which can detect or correct a limited number of bit-errors, that is, thepossible distortions belong to the ball B Ld of a certain radius d , L > d ≥ . For this purpose DRAFT we develop some methods for constructing such a liner hash-function λ , λ : { , } L → { , } l , l ≤ L and a set A that d h ( A ) = d, d ≥ , λ ( x ) = e l f or any x ∈ A and λ ( y ) = e l f or any y ∈ { , } L \ A . (7)The following property of this construction will play an important rule. Theorem 2.
Let there be a linear hash-function λ , an integer d, d ≥ , and a set A for which λ ( x ) = e l for any x ∈ A and λ ( y ) = e l for any y ∈ { , } L \ A . Then d h ( A ) ≥ d , d > ,if and only if λ ( v ) = e l for any v ∈ B Ld − \ e l .Proof. Suppose that d h ( A ) = d . Then, for any x ∈ A and any v ∈ B Ld − \ e L , the word x ⊕ v does not belong to A , because d h ( x, ( x ⊕ v )) = || v || ≤ d − . Hence, λ ( x ⊕ v ) = e l . Fromthis we obtain λ ( v ) = λ ( x ) ⊕ λ ( v ) = λ ( x ⊕ v ) = e l .Let us prove the opposite statement. Suppose, λ ( v ) = e l for all v ∈ B Ld − \ e L . Let x ∈ A , v ∈ B Ld − . We can see that x ⊕ v does not belong to A , because λ ( x ⊕ v ) = λ ( x ) ⊕ λ ( v )= e l ⊕ λ ( v ) = e i . So, if < || v || < d , for some v , then x ⊕ v does not belong to A and,hence, d h ( A ) ≥ d .The construction (7) can be directly used in error detection and correction codes. Indeed, asmentioned in the introduction, those codes are as follows: eitheri) a code that can detect d − or less bit-errors, orii) a code that can correct ⌊ ( d − / ⌋ or less bit-errors.In accordance with this, we will call the hash function λ and the set A in (7) as a code. Inthis section we consider three methods for constructing such codes. The first method producesa code that matches the VG bound and can be easily randomized. A slightly improved estimatewill be valid for the second method, while the third method is a greatly simplified version ofthe first one, obtained using randomization. DRAFT
A. Method which meets VG bound
Here our goal is to build a code (7) for given integers L and d , L > d ≥ . It means thatwe should find methods i) to calculate l , ii) to build λ and iii) to describe how to find, for anyinformation symbols x ...x L − l , the symbols x L − l +1 ...x L for which λ ( x ...x L ) = e l (that is, x ...x L ∈ A ) .
1) Building the hash function λ : The following algorithm (Algorithm 1) is intended to find l and λ while a method for performing iii) will be described immediately after. The input is two integers
L, d . The output l = & log d − X i =0 (cid:16) L − i (cid:17) + 1 !' , (8)a linear hash function λ : { , } L → { , } l , for which (7) holds true, and the set A (here andbelow log = log ). If l in (8) is not defined or l ≥ L then the algorithm stops and answers thatthe solution does not exist. Algorithm 1.First step.
Calculate l in (8) and define ˆ λ ( e L ) = e l , ˆ λ ( e L ) = e l , . . . , ˆ λ ( e Ll ) = e ll . (9) Second step.
For i = l + 1 , l + 2 , ..., L define ˆ λ ( e Li ) as follows: ˆ λ ( e Li ) = v i where v i is any word from ( { , } l \ ˆ λ ( B i − d − )) . (10)Here and below ˆ λ ( Z ) = S z ∈ Z n ˆ λ ( z ) o for any set Z , and B i − d − is the set of all words b b ...b L from B Ld − such that b i = b i +1 = ... = b L = 0 . Note that i) ˆ λ ( e Lj ) , j = 1 , , ..., i − are definedwhen ˆ λ ( B i − d − ) is calculated, and ii) the set { , } l \ ˆ λ ( B i − d − ) is not empty for i = 1 , ..., L , dueto | ˆ λ ( B i − d − ) | ≤ P d − j =0 ( i − j ) and the definition of l in (8).From this definition we can see that ˆ λ ( e Li ) = ˆ λ ( w ) for any w ∈ B i − d − and, hence, ˆ λ ( w ′ ) = e l for any w ′ ∈ B id − , for i = l + 1 , l + 2 , ..., L .From (9) and (10) we can see that ˆ λ ( u ) = e l for every u ∈ B Ld − \ { e L } . (11) DRAFT
Final step.
The goal of this step is to permute the values of the hash function λ in such away that the last values λ ( e LL − l +1 ) , ..., λ ( e LL ) will be the first l values of ˆ λ . Clearly, this step isnot a mandatory procedure, but it simplifies the encoding of the information symbols. Note thatany permutation of coordinates of the set B mn does not change it, so the following procedure iscorrect: Define λ using ˆ λ as follows: λ ( e Li ) = ˆ λ ( e Li + l ) for i = 1 , ..., L − l and λ ( e LL − l + i ) = ˆ λ ( e Li ) for i = 1 , ..., l . Note that λ ( e LL − l + i ) = e li , (12)for i = 1 , ..., l , see (9). From (11) we obtain λ ( u ) = e l for all u ∈ B Ld − \ { e L } . (13)
2) Description of the set A or encoding: Now we can describe the set A , that is, the methodof encoding of information symbols. Let x ...x L − l be a set of information symbols and we wantto find the check symbols x L − l +1 ...x L . In order to do it, first, we pad x ...x L − l with l zeros atthe end and denote the obtained string as u = x ...x L − l ... . Then calculate λ ( u ) = w ...w l and define x L − l +1 = w , x L − l +2 = w , ..., x L = w l . Taking into account (12) we can see that λ (00 ... x L − l ...x L ) = λ (00 ... w ...w l ) = ( w ...w l ) .From this we obtain λ ( x ...x L ) = λ ( x ...x L − l ... ⊕ λ (00 ... x L − l ...x L ) = ( w ...w l ) ⊕ ( w ...w l ) = e l . So, for any information symbols x ...x L − l we find the check symbols x L − l +1 ...x L such that λ ( x ...x L ) = e l .It will be convenient to describe the properties of the described algorithm as follows: Theorem 3. i) The described algorithm is correct, that is, d h ( A ) ≥ d ,ii) the following inequality is valid for the number of information symbols L − l : L − l = L − ⌈ log ( d − X i =0 ( L − i ) + 1 ) ⌉ , (14) Proof.
Taking into account (13), we can obtain the first statement i) from Theorem 2. Thestatement ii) follows from (8).
DRAFT0
3) The complexity:
Now consider the complexity of the proposed method. There are thefollowing three important characteristics to consider: i) the time ( T ) to construct the hash functionusing the algorithm described, ii) the tie encoding ( t enc ) and decoding ( t dec ) time if the methodis used for error detection or correction. It is important to note that the hash function mustbe constructed only once and then used many times (for different inputs), while encoding anddecoding are performed for each input. Claim 1. i) The time T is proportional to P d − j =0 ( L − j ) . If L grows to ∞ and d is a constantthen T = O (( L log L ) d − ) , if L → ∞ and lim d/L equals some α , then T = 2 LH ( α ) .ii) If this algorithm is used for error detection then t enc = t dec = O (( d − L log L ) . Forerror correction t enc is the same, but t dec is proportional to T in i). (here H ( α ) = − ( α log α + (1 − α ) log(1 − α )) is Shannon entropy, see [7].) Proof.
The proof is based on a direct estimation of number of bit-operations and known estimatesof the binomial coefficients, see [7], [9].
4) Examples:
We start with the case d = 2 , which gives a possibility to detect one error. So,the input of the algorithm is L ≥ and d = 2 . From (8) we see that l = log(1+1) = 1 . From (9)we obtain ˆ λ ( e L ) = 1 . Taking into account that B i − is e L , we can see from (10) that ˆ λ ( e Li ) = 1 for all i , , ..., L . From this and (12) we can see that λ ( e Li ) = 1 for all i , , , ..., L . Theinformation symbols are x ...x L − , while the check symbol is x L . If this method is applied toerror detection, the encoder calculates x L = L L − i =1 ( x i × λ ( e Li )) = L L − i =1 ( x i ×
1) = L L − i =1 x i ,while the decoder calculates L Li =1 ( x i × λ ( e Li )) = L Li =1 ( x i ×
1) = L Li =1 x i . If this sum is 1,then one error occurred, otherwise an error did not occur. Thus, in this case, the code based onlinear hash functions coincides with the parity check method.The second example is d = 3 . Now the input of the algorithm is L and d = 3 . From(8) we obtain l = ⌈ log(( L −
1) + 1 + 1) ⌉ = ⌈ log(( L + 1) ⌉ . According to the first step (see(9)) ˆ λ ( e L ) = e l , ˆ λ ( e L ) = e l , ..., , ˆ λ ( e Ll ) = e ll . Having taken into account (10) we cansee that different values of ˆ λ ( e Ll ) , i = l + 1 , ..., L will be assigned different words from { , } l DRAFT1 \{ e l , ..., e ll } . From the final step of the algorithm we can see that λ ( e LL − l +1 ) = e l , λ ( e LL − l +2 ) = e l , ..., λ ( e LL ) = e ll , while the other values of λ are different words from { , } l \{ e l , ..., e ll } .The encoding is carried out according to the method III-A2 described above.Note that encoding and decoding can be implemented in such a way that there is no needto store the values λ ( e L ) , λ ( e L ) , ..., λ ( e LL ) . Indeed, one can select the values λ ( e L ) , λ ( e L ) , ...,λ ( e LL − l ) in lexicographical order and calculate these values sequentially during encoding anddecoding.It is interesting that, in fact, the described method is the well-known Hamming code whichcan either detect two errors or correct one [2]. (Indeed, the described code can correct one erroras follows: if the transmitted (or saved) message is y and one error occurred, then for some iλ ( y ) = λ ( e Li ) . This means that the error occurred in i - th position. ) B. Methods whose performance outperforms the VG bound
The method proposed here is a modification of the previous one. The only difference is thechoice of the new value ˆ λ ( e Li ) in (10). That is why we describe only those parts of the algorithmthat are different, i.e the output and the second step. It will be convenient to describe the methodand the purpose of the modifications together.First, we describe the main idea of the proposed modification. From (10) we can see that thevalue of l is determined by the size of the set ˆ λ ( B L − d − ) , because it must be less than l − .Then we use the following obvious inequality | ˆ λ ( B L − d − ) | ≤ | B L − d − | and the requirement | B L − d − |≤ l − instead of | ˆ λ ( B L − d − ) | ≤ l − . In what follows we build such a hash function ˆ λ that | ˆ λ ( B L − d − ) | is less than | B L − d − | . For this purpose we find such subsets U, V from B L − d − that U ∩ V = ∅ and ˆ λ ( U ) = ˆ λ ( V ) . Taking into account that | ˆ λ ( Z ) | ≤ | Z | for any Z and the lastequation we can see that | ˆ λ ( B L − d − ) | = | ˆ λ ( B L − d − \ U ) | ≤ | B L − d − \ U | = | B L − d − | − | U | . (15)So, if we find such sets U and V , we have the upper bound | ˆ λ ( B L − d − ) | ≤ | B L − d − | − | U | insteadof | ˆ λ ( B L − d − ) | ≤ | B L − d − | and, hence, can reduce the number of the check bits l . DRAFT2
Now we can describe the modified algorithm. As we mentioned, the only difference is thesecond step and the definition of l which are as follows: l = log d − X i =0 (cid:16) L − i (cid:17) − d − X s =1 (cid:16) d − s (cid:17) s − X j =1 (cid:16) L − d − j (cid:17) + 1 , (16)and Second step.
For i = l + 1 , l + 2 , ..., L define ˆ λ ( e Li ) as follows: ˆ λ ( e Li ) = w i where w i is any word from ˆ λ ( B i − d − ) \ ˆ λ ( B i − d − ) . (17)Note that P d − s =1 (cid:16)(cid:0) d − s (cid:1) P s − j =1 (cid:16) L − d − j (cid:17)(cid:17) corresponds to | U | in (15).Now we describe the sets U and V . From (17) we can see that for i = L − , ˆ λ ( e LL − ) = w L − ∈ ˆ λ ( B L − d − ) \ ˆ λ ( B L − d − ) . By definition, ˆ λ ( e LL − ) = ˆ λ (00 ... . On the other hand, from(17) we see that there exists x = x ...x L − which contains d − ones and ˆ λ ( x ) = w L − .Hence, ˆ λ (00 ... λ ( x ) . The word x contains d − ones among x ...x L − . To simplifythe notation we suppose that x = 1 , ..., x d − = 1 whereas the others x i = 0 . (We can do thiswithout loss of generality due to the symmetry of the set B L − d − \ B L − d − .) So, ˆ λ (00 ... λ (11 ... ... , where ... is ( d − ones. (18) Z = { z : z = xy , where | x | = d − , | y | = L − d − , || x || + || y || ≤ d − , and || y || ≤ || x ||− } (19)Now we define the following sets U = { u : u = 00 ... ⊕ z, z ∈ Z } , V = { v : v = 11 ... ... ⊕ z, z ∈ Z } , (20)where, as before in (18), ... is ( d − ones. Claim 2. i) U ∩ V = ∅ .ii) ˆ λ ( U ) = ˆ λ ( V ) .iii) U ⊂ B L − d − , V ⊂ B L − d − . DRAFT3 iv) | U | = d − X s =1 (cid:16) d − s (cid:17) s − X j =1 (cid:16) L − d − j (cid:17) . (21) Corollary.
From i) - iii) we can see that ˆ λ ( B L − d − ) = ˆ λ ( B L − d − ) \ U ) . From this and iv) we obtain | ˆ λ ( B L − d − ) | = | ˆ λ ( B L − d − ) \ U ) | ≤ | B L − d − ) |−| U | = P d − i =0 ( L − i ) − P d − s =1 (cid:16)(cid:0) d − s (cid:1) P s − j =1 (cid:16) L − d − j (cid:17)(cid:17) .Taking into account that | ˆ λ ( B L − d − ) | must be not grater than l − , we obtain (16). Proof. i) The two last digits of any u ∈ U are 10, whereas the two last digits of any v ∈ V are 00, see (19).ii) | U | = | V | (see (19) ) and for any u ∈ U there exists v ∈ V such that ˆ λ ( u ) = ˆ λ ( v ) .(Indeed, for any u : u = (00 ... ⊕ z, z ∈ Z . Hence, taking into account linearity of ˆ λ and(18), for v = 11 ... ... ⊕ z we obtain ˆ λ ( u ) = ˆ λ ( v ) .)iii) From the definition U in (19) we can see that || u || = || x || + || y || + 1 . Taking into accountthat || x || + || y || ≤ d − , we obtain from the last equation that || u || ≤ d − , that is, u ∈ B L − d − . Letus consider the set V . From (19) we can see that || v || = ( d − − || x || + || y || and || y || + 1 ≤ || x || .From the latter two inequalities we obtain || v || ≤ d − , that is, v ∈ B L − d − .iv) The equation | U | = | Z | follows from (19). Let now s = || x || , j = || y || . From (19) we cansee that || y || ≤ || x || − , that is, ≤ j ≤ s − . Taking into account that || x || + || y || ≤ d − and y ≥ , we can see that || x || ≤ d − , that is, ≤ s ≤ d − . Using common combinatorialformulas, we obtain | U | = d − X s =1 ( (cid:16) d − s (cid:17) s − X j =1 (cid:16) L − d − j (cid:17) ) ) . The claim is proven.
It will be convenient to summarize the properties of the algorithm just described as follows:
Theorem 4.
For the modified algorithm 2, the following equality is valid for the number ofinformation symbols L − l : L − l = L − ⌈ log ( d − X i =0 ( L − i ) − d − X s =1 ( (cid:16) d − s (cid:17) s − X j =1 (cid:16) L − d − j (cid:17) ) + 1 ) ⌉ . (22) DRAFT4
C. A randomised algorithm whose performance is close to the VG bound.
In this part we consider a randomised algorithm whose performance is close to the VG bound,but whose complexity is much smaller.Let, as before, the block length be L , the required code distance be d and l = ⌈ log ( P d − i =0 ( L − i )+ 1 ) ⌉ , see (8). Define l ∆ = l + ∆ , where ∆ is such an integer that L − l ∆ ≥ .The only difference between the new randomised algorithm and the algorithm 1 is in thesecond step (10). In the new algorithm the values ˆ λ ( e Li ) , i = l ∆ + 1 , ...L , are chosen randomlyfrom { , } l ∆ according to the uniform distribution. We call this method Algorithm 3 or therandomised algorithm.Our goal is to estimate the probability of the following events Π = { For i = l ∆ +1 , ...L, the ( randomly chosen ) word ˆ λ ( e Li ) belongs to { , } l ∆ \ ˆ λ ( B i − d − ) } , (23)see (10). In turn, if Π occurs then this gives a possibility to build an encoding set A for which d h ( A ) ≥ d . Define Π i = { a unif ormly chosen word u belongs to { , } l ∆ \ ˆ λ ( B i − d − ) } . (24)Clearly, Π = Π l ∆ +1 ∩ ... ∩ Π L − and the following chain of equations is valid P (Π) = P (Π l ∆ +1 ∩ ... ∩ Π L − ) = P (Π l ∆ +1 ) P (Π l ∆ +2 | Π l ∆ +1 ) P (Π l ∆ +3 | Π l ∆ +2 Π l ∆ +1 ) ...P (Π L − | Π L − ... Π l ∆ +1 ) ≥ L Y i = l ∆ +1 |{ , } l ∆ \ ˆ λ ( B i − d − ) | l ∆ ≥ L Y i = l ∆ +1 (1 − | B i − d − | / l ∆ ) ≥ L Y i = l ∆ +1 (1 − d − X j =0 ( i − j ) / l ∆ ) ≥ − − l ∆ L X i = l ∆ +1 d − X j =0 ( i − j ) , (25)where | B i − d − | = P d − j =0 ( i − j ) . Here we used two following inequalities: | λ ( Z ) | ≤ | Z | for anyhash-function λ and any set Z , and (1 − a )(1 − b ) ≥ − ( a + b ) for non-negative a and b .This rather cumbersome expression can be simplified to obtain an asymptotic estimate. Indeed, L X i = l ∆ +1 d − X j =0 ( i − j ) ≤ L X i =0 d − X j =0 ( i − j ) = d − X j =0 L X i =0 ( i − j ) , (26) DRAFT5 where, by definition, ( ab ) = 0 , if a < b or b < . Now we will apply the well-known identity n X m =0 ( mk ) = ( n +1 k +1 ) which is sometimes called the hockey-stick identity (see, for example, [9]). So, from this and(26) we obtain L X i = l ∆ +1 d − X j =0 ( i − j ) ≤ d − X j =0 L X i =0 ( i − j ) = d − X j =0 L − X m =0 ( mj ) = d − X j =0 ( Lj +1 ) . From this and (III-C) we obtain the inequality P (Π) ≥ − − l ∆ d − X j =0 ( Lj +1 ) , (27)which can be used instead of the more complicated right part of (III-C).Considering that the occurrence of the event Π guarantees that for the encoding set A constructed d h ( A ) ≥ d , and combining the last inequality and (27), we obtain the following Theorem 5.
Let L , d and ∆ be integers and let l correspond to the VG bound, see (8). IfAlgorithm 3 (randomised) is applied and the number of check symbols is l + ∆ (that is, thevalues of the hash functions are chosen randomly from { , } l +∆ according to the uniformdistribution), then the probability of the event Π ∗ that for the encoding set A d h ( A ) ≥ d ,satisfies the following inequalities: P (Π ∗ ) ≥ − − ( l +∆) d − X j =0 ( Lj +1 ) . Corollary.
Clearly, P d − j =0 ( Lj +1 ) < l . From this and the theorem we obtain − log(1 − P (Π ∗ )) ≥ ∆ + O (1) , if L → ∞ .So we can see that the probability of getting an encoding set A with d h ( A ) ≥ d is mainlydetermined by the value of ∆ , that is, the number of extra bits that are added to the VG bound l . This gives a possibility to build simple error detection codes for which the number of extra DRAFT6 check bits ∆ does not depend on the length of the message ( L ) and the number of errors thatcan be detected ( d − ).IV. A GENERAL METHOD FOR ERRORS OF ANY TYPE
Now we consider a general case where there is a length of transmitted (or stored) messages L and a set of possible distortions D ⊂ { , } L such that any input message x can be receivedas x ⊕ d, d ∈ D . For example, let L = 5 and D = { , , } . It means thatthree consecutive letters can be changed. If x = 01010 and d = 11100 , the output message is y = 10110 .So far, we have considered the case where the check symbols are located at the end of themessage. Now it will be convenient to assume that the check symbols can be located in differentpositions, but, of course, they will be known to the encoder and decoder. This generalizationallows us to simplify the notation slightly. A. Error detection.
Let us describe an algorithm for calculating a hash function λ that gives a possibility to detectany distortion d ∈ D , D ⊂ { , } L \ e L . That is, for any message x and any d ∈ Dλ ( x ⊕ d ) = λ ( d ) = 00 ... λ ( x ) = 00 ... . (28)In order to describe the algorithm we define the sets D i and D ′ i , i = 1 , , ..., L , by D i = { d = d ...d L : d ∈ D, d i = 1 and d i +1 = 0 , d i +2 = 0 , ..., d L = 0 } D ′ i = { d ′ : ∃ d ∈ D i f or which d ′ = ( d ⊕ e Li ) } , (29)that is, D i contains all d = d ...d L from D for which d i = 1 and d i +1 = 0 , d i +2 = 0 , ..., d L = 0 ,while D ′ i contains all the words from D i in which d i changes to 0. Input.
A message length L and a set of possible distortions D ⊂ { , } L \ e L . Output.
Such an integer l that l − ≥ max i =1 ,...,L | D ′ i | (30) DRAFT7 and a linear hash function λ : { , } L → { , } l , for which (28) is true. If l in (30) is not definedor l ≥ L , the algorithm stops and answers that the solution does not exist. The algorithm.First step.
Calculate l in (30) and define λ ( e L ) = e l , λ ( e L ) = e l , ..., λ ( e Ll ) = e ll . (31) Second step.
For i = l + 1 , l + 2 , ..., L define λ ( e Li ) as follows: λ ( e Li ) = v i where v i any word f rom { , } l \ λ ( D ′ i ) . (32)From (31) and (32) we can see that f or any u ∈ λ ( D ) , λ ( u ) = e l . (33)Note that λ ( e Li ) , j = 1 , , ..., i − are defined when λ ( D ′ i ) is calculated, see (31) and (32).Now we can describe the method for encoding and decoding. The positions , ..., l are usedfor check symbols, while the other L − l are used for information symbols. When encoding,the encoder first puts the information symbols into positions { l + 1 , ..., L } and 0’s into positions , ..., l . Denote the obtained word x ∗ and calculate λ ( x ∗ ) = w ...w l . Then put letters w ...w l into the check positions , ..., l and denote the obtained word by x = x ...x L . It should be clearthat λ ( x ) = 00 ... . Indeed, λ ( x ) = λ ( x ∗ ) ⊕ λ ( x ⊕ x ∗ ) = w ...w l ⊕ (( e l × w ) ⊕ ( e l × w ) ⊕ ( e ll × w l )) = w ...w l ⊕ w ...w l = 00 ... . (Here we used the definition (31) ).It will be convenient to describe the properties of the algorithm above as follows: Theorem 6.
The algorithm is correct, that is, if an error d ∈ D has occurred, then λ ( receved message ) = 00 ... , and λ ( receved message ) = 00 ... , if no error occurred.Proof. Suppose that the input message is x = x ...x L and the output message is y = y ...y L .Then λ ( y ) = 00 ... ⊕ λ ( y ) = λ ( x ) ⊕ λ ( y ) = λ ( x ⊕ y ) ∈ D .
Taking into account (33), from these equations we can see that λ ( y ) = 00 ... if y = x and λ ( y ) = 00 ... , if y = x . DRAFT8
Now consider the complexity of the proposed method. There are two important characteristics:the time of encoding and decoding and the construction time of the hash function. It is importantto note that the hash function must be prepared once, and then can be used for a long time,while encoding and decoding are performed repeatedly.
Claim 3.
The number of bit-operations ( t ) for encoding and decoding is not grater than O ( L l ) , If L grows to ∞ . The number of bit-operations ( T ) for building the hash-function λ is proportionalto | D | l .Proof is based on a direct estimation of the number of bit-operations.Let us consider a simple example illustrating the described method. Suppose that a systemshould transmit 6-bit messages, but two consecutive letters may be distorted. It means that theset of possible distortions is D = { , , , , } . (34)(That is, any message x ...x may change into x ⊕ d i during the transmission, where d i is i -thword from D .) Our goal is to build a code which can detect any distortion from D that occursduring the transmission. For this, we first build a linear hash-function λ described in this part.According to (IV-A) we find that D and D ′ are empty sets and D = { } , D = { } , D = { } , D = { } , D = { } ,D ′ = { } , D ′ = { } , D ′ = { } , D ′ = { } , D ′ = { } . Clearly, max i | D ′ i | = 1 and from (30) we obtain that l − ≥ and, hence, it is enough to put l = 1 . Recall that it means that there will be one check symbol and 5 information ones, and,besides, λ will take values from { , } .Now we can find λ . According to (31) we obtain λ ( e ) = 1 . Then, based on (32) we calculateall the rest of the values of λ as follows: λ ( e ) should be chosen from the set { , }\{ } = { } .So, λ ( e ) = 0 . Analogously, λ ( e ) = 1 , λ ( e ) = 0 , λ ( e ) = 1 , λ ( e ) = 0 . Or, to put it shortly, DRAFT9 λ ( e even ) = 0 , λ ( e odd ) = 1 . From (34), we can see that λ ( d ) = 1 for any distortion d ∈ D and,hence, any distortion from this set is detected.Now we can finish the description of the code. We know that l = 1 and, hence, the first messagesymbol x is a check symbol, while x ...x are information ones. Suppose that informationsymbols are . The encoder forms the word x ∗ = 011001 , calculates λ ( x ∗ ) = 1 and,hence, x = 111001 . If no error occurs, then λ ( x ) = 0 and the receiver obtains the informationsymbols . If a distortion d occurs (say, d = 011000 ), the receiver obtains the word y = x ⊕ d = 100001 , calculates λ (100001) = 1 and sees that the message was corrupted during thetransmission. B. Error-correction.
In this part we describe an algorithm for calculating a hash function λ that gives a possibilityto correct any distortion d ∈ D , where D is a given subset from { , } L . We say that the systemcorrects distortions from D if λ ( x ) = 00 ... f or any input message x , all λ ( d ) , d ∈ D are dif f erentand non − equal to ... , ( hence, λ ( x ⊕ d ) = λ ( d ) = 00 ... . (35)Note that this property gives a possibility to find d and the original message x = y ⊕ d .To describe the algorithm for constructing λ , we will define some auxiliary variables. For anyword x ...x L and ≤ i ≤ L we define x | i = x x ...x i ... and let D + = D ∪ { e L } , G i = { d | i , d ∈ D + } , H i = G i \ G i − ,F i = { All f f or which ∃ g ∈ G i − , ∃ h ∈ H i : f = g ⊕ h ⊕ e Li } . (36) Input.
A message length L and a set of possible distortions D ⊂ ( { , } L \ e L ) . Output.
An integer l such that l − ≥ max i =1 ,...,L | F i | (37) DRAFT0 and a linear hash function λ : { , } L → { , } l , for which all λ ( d ) , d ∈ D, are dif f erent and non − equal to ... , (38)see (IV-B). If l in (37) is not defined or l ≥ L , the algorithm stops and answers that the solutiondoes not exist. The algorithm.First step . Define λ ( e L ) = e l , λ ( e L ) = e l , ..., λ ( e Ll ) = e ll . (39) Second step.
For i = l + 1 , l + 2 , ..., L define λ ( e Li ) = v where v any word f rom { , } l \ λ ( F i ) . (40)Note that λ ( e Lj ) , j = 1 , , ..., i − are defined when λ ( F i ) is calculated, see (39) and (40).The key property of the algorithm described is the following Theorem 7.
All λ ( u ) , u ∈ λ ( D + ) , are different. (41) Proof.
We prove this by induction on i for G i , i = 1 , ..., L , where G L = D + . For , ..., l theproperty (41) follows from (39), because x x ...x l = λ ( x | l ) for any x = x ...x L . Suppose that(41) is proven for G i , and let us prove it for G i +1 . Let u, v ∈ G i +1 . We need to show that λ ( u ) = λ ( v ) . There are the following three possibilities:i) u, v ∈ G i . Then, λ ( u ) = λ ( v ) , because it is proven for G i .ii) u, v ∈ G i +1 \ G i (= H i +1 ) . In this case u ⊕ e Li +1 ∈ G i and v ⊕ e Li +1 ∈ G i (i.e. both belongto G i ) and, hence, λ ( u ⊕ e Li +1 ) = λ ( v ⊕ e Li +1 ) . So, λ ( u ) = λ ( v ) .iii) u ∈ G i , v ∈ H i +1 . In this case λ ( u ) ⊕ λ ( v ) = λ ( u ⊕ v ⊕ e Li +1 ) ⊕ λ ( e Li +1 ) . From thedefinition (IV-B) we can see that u ⊕ v ⊕ e Li +1 belong to F i +1 . Taking into account (40), we cansee that λ ( e Li +1 ) = λ ( u ⊕ v ⊕ e Li +1 ) . Hence, λ ( u ⊕ v ) = 00 ... , and λ ( u ) = λ ( v ) .So, for i) - iii) the inequality λ ( u ) = λ ( u ) is proven and the induction step is completed. Theclaim (41) is proven. DRAFT1
Now we can describe the methods for encoding and decoding. The encoding coincides withthe method for error-detection. Namely, the positions , ..., l are used for check symbols, whilethe other L − l are used for information symbols. The encoder first puts the information symbolsinto positions { , ..., L } and 0’s into positions , ..., l . Denote the obtained word x ∗ andcalculate λ ( x ∗ ) = w ...w l . Then put letters w ...w l into the check positions , ..., l and denotethe obtained word by x = x ...x L . It should be clear that λ ( x ) = 00 ... . Indeed, λ ( x ) = λ ( x ∗ ) ⊕ λ ( x ⊕ x ∗ ) = w ...w l ⊕ (( e l × w ) ⊕ ( e l × w ) ⊕ ( e ll × w l )) = w ...w l ⊕ w ...w l = 00 ... . (Here we used the definition (39).) The decoder calculates λ ( y ) for the received (or stored) y .If λ ( y ) = 00 ... , then no error occurred, otherwise a distortion d has occurred, for which λ ( d )= λ ( y ) (and, hence y ⊕ d is the original message).From the property (41) we can see that the described method is correct.Let us consider the complexity of the proposed method. There are three important charac-teristics: the encoding time ( t enc ), decoding one ( t dec ) and the construction time of the hashfunction in accordance with the described algorithm ( T ). It is clear that t enc = O ( L log L ) and t dec = O ( | D | L log L ) . Basing on (IV-B) and (40) we can obtain an estimate T = O ( | D | log L ) .Let us consider an example. Let the set of possible distortions D be (34). Then, from (IV-B)we obtain D + = { , , , , , } ,G = { , } , G = { , , } ,G = { , , , } , G = { , , , , } ,G = { , , , , , } , G = D + ,H = G = { , } , H = { , } , H = { , } ,H = { , } , H = { , } , H = { } ,F = { , } , F = { , } , F = { , , , } ,F = { , , , , , } , DRAFT2 F = { , , , , , , , } ,F = { , , , , , } . According to (37) we find max i =1 ,...,L | F i | = | F | = 8 and, hence, − ≥ , l = 4 . From(39) we obtain λ ( e ) = 1000 , λ ( e ) = 0100 , λ ( e ) = 0010 , λ ( e ) = 0001 . Then, according to (40) we calculate { , } \ λ ( F ) = { , } \ { , , , , , , , } = { , , , , , , } . Any of these words can be chosen as the value of λ ( e ) . So, let λ ( e ) = 1111 . Analogously, { , } \ λ ( F ) = { , } \ { , , , , , } = { , , , , , , , , , } . Thus, we can define λ ( e ) = 1000 . (We can check that all λ ( d ) , d ∈ D, are different: λ (000011) =0111 , λ (000110) = 1110 , λ (001100) = 0011 , λ (011000) = 0110 , λ (110000) = 1100 . ) So, alinear hash function has been constructed, the number of information symbols is L − l =6 − , the number of check symbols is l = 4 . Suppose that the information symbolsare 10. Then, according to the encoding method, x ∗ = 000010 , λ ( x ∗ ) = 1111 , x = 111110 .Suppose that the distortion has occurred. Then y = 100110 , λ ( y ) = 0110 . Note that λ (011000) = 0110 . Thus, λ ( y ) = λ (011000) . It means that the decoder has found thedistortion d = (011000) and can find x = y ⊕ d = 100110 ⊕ x . So, theerror is corrected. DRAFT3
V. C
ONCLUSION
In this paper we have shown how linear hash functions can be used for error detection anderror correction. It turns out, that it is possible to build error detection and correction codes forany possible set of distortions.The case when the number of errors does not exceed a predetermined value is discussed inmore detail. We consider a method whose performance is slightly better than the Varshamov -Gilbert bound [2]. In addition, we propose a randomized algorithm, the performance of whichis close to this bound, but the construction and encoding times are close to linear.A
CKNOWLEDGMENT
This work was supported by Russian Foundation for Basic Research (grant 18-29-03005).R
EFERENCES [1] Lin, Shu, and Daniel J. Costello. Error control coding. Vol. 2. Prentice hall, 2001.[2] Pless, Vera, and W. Cary Huffman. ”Fundamentals of error-correcting codes.” (2003).[3] Klove, T., and Korzhik V. Error detecting codes: general theory and their application in feedback communicationsystems. Vol. 335. Springer Science & Business Media, 2012.[4] Peterson, William Wesley, and Daniel T. Brown. ”Cyclic codes for error detection.” Proceedings of the IRE49.1 (1961): 228-235.[5] T. Jiang and A. Vardy, Asymptotic Improvement of the Gilbert-Varshamov boundon the size of binary codes,IEEETransactions on Information Theory, Vol. 50,No. 8, Aug. 2004, pp. 1655-1664.[6] P. Gaborit and G. Zemor, ”Asymptotic Improvement of the GilbertVarshamov Bound for Linear Codes,” in IEEETransactions on Information Theory, vol. 54, no. 9, pp. 3865-3872, Sept. 2008[7] T. M. Cover and J. A. Thomas,