Energy and Sampling Constrained Asynchronous Communication
aa r X i v : . [ c s . I T ] M a r Energy and Sampling ConstrainedAsynchronous Communication
Aslan Tchamkerten, Venkat Chandar, and Giuseppe Caire
Abstract —The minimum energy, and, more generally,the minimum cost, to transmit one bit of informationhas been recently derived for bursty communication wheninformation is available infrequently at random times atthe transmitter. This result assumes that the receiver isalways in the listening mode and samples all channeloutputs until it makes a decision. If the receiver isconstrained to sample only a fraction ρ ∈ (0 , of thechannel outputs, what is the cost penalty due to sparseoutput sampling?Remarkably, there is no penalty: regardless of ρ > theasynchronous capacity per unit cost is the same as underfull sampling, i.e. , when ρ = 1 . Moreover, there is not evena penalty in terms of decoding delay—the elapsed timebetween when information is available until when it is de-coded. This latter result relies on the possibility to sampleadaptively; the next sample can be chosen as a function ofpast samples. Under non-adaptive sampling, it is possibleto achieve the full sampling asynchronous capacity perunit cost, but the decoding delay gets multiplied by /ρ .Therefore adaptive sampling strategies are of particularinterest in the very sparse sampling regime. Index Terms —Asynchronous communication; burstycommunication; capacity per unit cost; energy; errorexponents; hypothesis testing; sequential decoding; sensornetworks; sparse communication; sparse sampling; syn-chronization
I. I
NTRODUCTION I N many emerging technologies, communication is sparseand asynchronous, but it is essential that when data isavailable, it is delivered to the destination as timelyand reliably as possible. Examples are sensor networksmonitoring rare but critical events, such as earthquakes,forest fires, or epileptic seizures.For such settings, [1] characterized the asynchronouscapacity per unit cost based on the following model.
This work was supported in part by an Excellence Chair Grantfrom the French National Research Agency (ACE project).A. Tchamkerten is with the Department of Communications andElectronics, Telecom ParisTech, 75634 Paris Cedex 13, France.Email: [email protected]. Chandar is with MIT Lincoln Laboratory, Lexington, MA02420, USA. Email: [email protected]. Caire is with the Viterbi School of Engineering, University ofSouthern California, Los Angeles, USA. Email: [email protected].
There are B bits of information that are made availableto the transmitter at some random time ν , and need to becommunicated to the receiver. The B bits are coded andtransmitted over a memoryless channel using a sequenceof symbols that have costs associated with them. The rate R per unit cost is the total number of bits divided bythe cost of the transmitted sequence. Asynchronism iscaptured here by the fact that the random time ν is notknown a priori to the receiver. However both transmitterand receiver know that ν is distributed ( e.g. , uniformly)over a time horizon [1 , . . . , A ] . At all times before andafter the actual transmission, the receiver observes “purenoise.” The noise distribution corresponds to a specialinput “idle symbol” ⋆ being sent across the channel (forexample, in the case of a Gaussian channel, this wouldbe the , i.e. , no transmit signal).The goal of the receiver is to reliably decode theinformation bits by sequentially observing the outputsof the channel.A main result in [1] is a single-letter characterizationof the asynchronous capacity per unit cost C ( β ) where β def = log AB denotes the timing uncertainty per information bit . Whilethis result holds for arbitrary discrete memoryless chan-nels and arbitrary input costs, the underlying modelassumes that the receiver is always in the listening mode:every channel output is observed until decoding happens.What happens when the receiver is constrained toobserve a fraction < ρ ≤ of the channel outputs?In this paper, it is shown that the asynchronous capacityper unit cost is not impacted by a sparse output sampling.More specifically, the asynchronous capacity per unitcost satisfies C ( β, ρ ) = C ( β, for any asynchronism level β > and sampling fre-quency < ρ ≤ . Moreover, the decoding delay isminimal: the elapsed time between when informationstarts being sent and when it is decoded is the same asunder full sampling. This result uses the possibility forthe receiver to sample adaptively: the next sample canbe chosen as a function of past observed samples. In fact, under non-adaptive sampling, it is still possible toachieve the full sampling asynchronous capacity per unitcost, but the decoding delay gets multiplied by a factor /ρ or (1 + ρ ) /ρ depending on whether or not ⋆ canbe used for code design. Therefore, adaptive samplingstrategies are of particular interest in the very sparseregime.We end this section with a brief review of studiesrelated to the above communication model. This modelwas introduced in [2], [3]. Both of these works focusedmainly on the synchronization threshold —the largestlevel of asynchronism under which it is still possible tocommunicate reliably. In [3], [4] communication rate isdefined with respect to the decoding delay, the expectedelapsed time between when information is available andwhen it is decoded. Capacity upper and lower bounds areestablished and shown to be tight for certain channels. In[4] it is also shown that so-called training-based schemes,where synchronization and information transmission useseparate degrees of freedom, need not be optimal inparticular in the high rate regime.The finite message regime has been investigated byPolyanskiy in [5] when capacity is defined with respectto the codeword length, i.e. , same setting as [1] but withunit cost per transmitted symbol. A main result in [5] isthat dispersion—a fundamental quantity that relates rateand error probability in the finite block length regime—is unaffected by the lack of synchronization. Whether ornot this remains true under sparse output sampling is aninteresting open issue.Note that the seemingly similar notions of rates inves-tigated in [3], [4] and [1], [5] are in fact very different. Inparticular, capacity with respect to the expected decodingdelay remains in general an open problem.A “slotted” version of the above communicationmodel was considered in [6] by Wang, Chandar, andWornell where communication now can happen only inone of consecutive slots of the size of a codeword. Forthis model, the authors investigated the tradeoff betweenthe false-alarm event (the decoder declares a messagebefore even it is sent) and the miss event (the decodermisses the sent codeword).The previous works consider point-to-point communi-cation. A (diamond) network configuration was recentlyinvestigated by Shomorony, Etkin, Parvaresh, and Alves-timehr in [7] who provided bounds on the minimumenergy needed to convey one bit of information acrossthe network.In above models, although communication is bursty,information transmission is contiguous since it alwayslasts the codeword duration. A complementary setupproposed by Khoshnevisan and Laneman [8] considers a bursty communication scenario caused by an intermittentcodeword transmission. This model can be seen as aslotted variation of the purely insertion channel model,the latter being a particular case of the general inser-tion, deletion, and substitution channel introduced byDobrushin [9].This paper is organized as follows. Section II con-tains some background material and extends the modeldeveloped in [1] to allow for sparse output sampling.Section III contains the main results and briefly discussesextensions to a decoder-universal setting and to a mul-tiple access setup. Finally Section IV is devoted to theproofs.II. M ODEL AND P ERFORMANCE C RITERION
The asynchronous communication model we considercaptures the following general features: • Information is available at the transmitter at arandom time; • The transmitter can choose when to start sendinginformation based on when information is availableand based on what message needs to be transmitted; • There is a cost associated to each channel input; • Outside the information transmission period thetransmitter stays idle and the receiver observesnoise; • The decoder is sampling constrained and can ob-serve only a fraction of the channel outputs. • Without knowing a priori when information is avail-able, the decoder should decode reliably and asearly as possible, on a sequential basis.The model is now specified. Communication isdiscrete-time and carried over a discrete memorylesschannel characterized by its finite input and outputalphabets X ∪ { ⋆ } and Y , respectively, and transition probability matrix Q ( y | x ) , for all y ∈ Y and x ∈ X ∪ { ⋆ } . The alphabet X may ormay not include ⋆ . Without loss of generality, we assumethat for all y ∈ Y there is some x ∈ X ∪ { ⋆ } for which Q ( y | x ) > .Given B ≥ information bits to be transmitted, acodebook C consists of M = 2 B codewords of length n ≥ composed of symbols from X .A randomly and uniformly chosen message m arrivesat the transmitter at a random time ν , independent of m , and uniformly distributed over [1 , . . . , A ] , where theinteger A = 2 βB characterizes the asynchronism level between the trans-mitter and the receiver, and where the constant β ≥ denotes the timing uncertainty per information bit , seeFig. 1.We consider one-shot communication, i.e. , only onemessage arrives over the period [1 , , . . . , A ] . If A = 1 ,the channel is said to be synchronous.Given ν and m , the transmitter chooses a time σ ( ν, m ) to start sending codeword c n ( m ) ∈ C assigned to mes-sage m . Transmission cannot start before the messagearrives or after the end of the uncertainty window, hence σ ( ν, m ) must satisfy ν ≤ σ ( ν, m ) ≤ A almost surely.In the rest of the paper, we suppress the arguments ν and m of σ when these arguments are clear from context.Before and after the codeword transmission, i.e. , be-fore time σ and after time σ + n − , the receiverobserves “pure noise,” Specifically, conditioned on theevent { ν = t } , t ∈ { , . . . , A } , and on the message to beconveyed m , the receiver observes independent channeloutputs Y , Y , . . . , Y A + n − distributed as follows. For ≤ i ≤ σ ( t, m ) − or σ ( t, m ) + n ≤ i ≤ A + n − , the Y i ’s are “pure noise” symbols, i.e. , Y i ∼ Q ( ·| ⋆ ) . For σ ≤ i ≤ σ + n − Y i ∼ Q ( ·| c i − σ +1 ( m )) where c i ( m ) denotes the i th symbol of the codeword c n ( m ) .The receiver operates according to a sampling strategyand a sequential decoder. A sampling strategy consistsof “sampling times” which are defined as an orderedcollection of random time indices S = { ( S , . . . , S ℓ ) ⊆ { , . . . , A + n − } : S i < S j , i < j } where S j is interpreted as the j th sampling time.The sampling strategy is either non-adaptive or adap-tive. It is non-adaptive when the sampling times given Y Y Y S Y S . . .⋆ ⋆ . . . ⋆ . . . ⋆ c ( m ) νm σ . . . Y S τ c n ( m ) ⋆ ⋆ . . . ⋆ Fig. 1. Time representation of what is sent (upper arrow) andwhat is received (lower arrow). The “ ⋆ ” represents the “idle” symbol.Message m arrives at time ν and starts being sent at time σ . Thereceiver samples at the (random) times S , S , . . . and decodes attime S τ based on τ output samples. by S are all known before communication starts, hence S is independent of Y A + n − . The strategy is adaptivewhen the sampling times are function of past obser-vations. This means that S is an arbitrary value in { , . . . , A + n − } , possibly random but independentof Y A + n − and, for j ≥ , S j = g j ( { Y S i } i
Definition 5 (Asynchronous Capacity per Unit Costunder Sampling Constraint) . R is an achievable rateper unit cost at timing uncertainty per information bit β and sampling frequency ρ , if there exists a sequenceof codes { C B } and a sequence of positive numbers ε B with ε B B →∞ −→ such that for all B large enough1) C B operates at timing uncertainty per informationbit β ;2) the maximum error probability P ( E | C B ) is at most ε B ;3) the rate per unit cost B K ( C B ) is at least R − ε B ;4) the sampling frequency satisfies ρ ( C B , ε B ) ≤ ρ + ε B ;5) the delay satisfies B log( d ( C B , ε B )) ≤ ε B . Notice that the last requirement asks for a subexponentialdelay.The asynchronous capacity per unit cost, denoted by C ( β, ρ ) , is the supremum of achievable rates per unitcost.Two basic observations: • C ( β, ρ ) is a non-increasing function of β for fixed ρ ; • C ( β, ρ ) is an non-decreasing function of ρ for fixed β .In particular, for any fixed β ≥ ρ ≥ C ( β, ρ ) = C ( β, . Capacity per unit cost under full sampling C ( β, ischaracterized in the following theorem: Theorem 1 ( [1] Theorem 1) . For any β ≥ C ( β,
1) = max X min (cid:26) I ( X ; Y ) E [ k ( X )] , I ( X ; Y ) + D ( Y || Y ⋆ ) E [ k ( X )](1 + β ) (cid:27) , (2) where max X denotes maximization with respect tothe channel input distribution P X , where ( X, Y ) ∼ P X ( · ) Q ( ·|· ) , where Y ⋆ denotes the random output ofthe channel when the idle symbol ⋆ is transmitted (i.e., Throughout the paper log is always to the base . Y ⋆ ∼ Q ( ·| ⋆ ) ), where I ( X ; Y ) denotes the mutual infor-mation between X and Y , and where D ( Y || Y ⋆ ) denotesthe divergence (Kullback-Leibler distance) between thedistributions of Y and Y ⋆ . Let P X ∗ be a capacity per unit cost achieving inputdistribution, i.e. , X ∗ achieves the maximum in (2). Asshown in the converse of the proof of [1, Theorem 1],codes that achieve the capacity per unit cost can berestricted to codes of (asymptotically) constant compo-sition P X ∗ . Specifically, we have Bn B ( P X ∗ ) E [ k ( X ∗ )] = C ( β, − o (1)) ( B → ∞ ) where n B ( P X ∗ ) denotes the length of the P X ∗ -constantcomposition codes achieving C ( β, . Now define n ∗ B def = min P X ∗ n B ( P X ∗ ) = min X ∈ P B C ( β, E [ k ( X )] where P def = { X : X achieves the maximum in (2) } . From the achievability and converse of [1, Theorem 1], { n ∗ B } represent the smallest achievable delays for codes { C B } achieving the asynchronous capacity per unit costunder full sampling C ( β, in the sense that d ( C B , ε B ) ≥ n ∗ B (1 − o (1)) ( B → ∞ ) for any ε B → as B → ∞ .Our results, stated in the next section, say that thecapacity per unit cost under sampling frequency <ρ < is the same as under full sampling, i.e. , ρ =1 . To achieve this, non-adaptive sampling is sufficient.However, if we also want to achieve minimum delay,then adaptive sampling is necessary. In fact, non-adaptivesampling strategies that achieve capacity per unit costhave a delay that grows at least as n ∗ B ρ or n ∗ B (1 + ρ ) ρ depending on whether or not ⋆ ∈ X .We end this section with a few notational conventions.We use P X to denote the set of distributions over thefinite alphabets X . Recall that the type of a string x n ∈ X n , denoted by ˆ P x n , is the probability over X that assignsto each a ∈ X the number of occurrences of a within x n divided by n [10, Chapter . ]. For instance, if x =(0 , , , then ˆ P x (0) = 2 / and ˆ P x (1) = 1 / . The joint Y ⋆ can be interpreted as “pure noise.” type ˆ P x n ,y n induced by a pair of strings x n ∈ X n , y n ∈ Y n is defined similarly. The set of strings of length n that have type P is denoted by T nP . The set of all typesover X of strings of length n is denoted by P X n . Finally,we use poly( · ) to denote a function that does not growor decay faster than polynomially in its argument.Throughout the paper we use the standard “big-O”Landau notation to characterize growth rates (see, e.g.,[11, Chapter 3]). III. R ESULTS
In the sequel we denote by C a ( β, ρ ) and C na ( β, ρ ) the capacity per unit cost when restricted to adaptiveand non-adaptive sampling, respectively.Our first result characterizes the capacity per unit costunder non-adaptive sampling. Theorem 2 (Non-adaptive sampling) . Under non-adaptive sampling it is possible to achieve the full-sampling capacity per unit cost, i.e. C na ( β, ρ ) = C ( β, for any β > , ρ > . Furthermore codes { C B } that achieve rate γ C ( β, , ≤ γ ≤ , satisfy lim γ → lim inf B →∞ d ( C B , ε B ) n ∗ B ≥ ρ when ⋆ ∈ X , and satisfy lim γ → lim inf B →∞ d ( C B , ε B ) n ∗ B ≥ ρρ when ⋆ / ∈ X . Finally, the above delay bounds are tight:for any ε > and γ close enough to there exists { C B } and ε B → as B → ∞ such that lim inf B →∞ d ( C B , ε B ) n ∗ B ≤ ρ + ε for the case ⋆ ∈ X , and similarly for the case ⋆ / ∈ X . Hence, even with a negligible fraction of the chan-nel outputs it is possible to achieve the full-samplingcapacity per unit. However, this comes at the expenseof delay which gets multiplied by a factor /ρ or (1 + ρ ) /ρ depending on whether or not ⋆ can be used forcode design. This disadvantage is overcome by adaptivesampling: Theorem 3 (Adaptive sampling) . Under adaptive sam-pling it is possible to achieve the full-sampling capacityper unit cost, i.e. C na ( β, ρ ) = C ( β, for any β > , ρ > . Moreover, there exists { C B } and ε B → as B → ∞ such that d ( C B , ε B ) = n ∗ B (1 + o (1)) . The first part of Theorem 3 immediately follows fromthe first part of Theorem 2 since the set of adaptivesampling strategies include the set of non-adaptive sam-pling strategies. The interesting part of Theorem 3 is thatadaptive sampling strategies guarantee minimal delay regardless of the sampling rate ρ , as long as it is non-zero.What is a an optimal adaptive sampling strategy?Intuitively, such a strategy should sample sparsely, witha sampling frequency of no more than ρ , under purenoise—for otherwise the sampling constraint is violated.It should also sample the entire sent codeword, and sodensely sample during message transmission—for other-wise a rate per unit cost penalty is incurred. The maincharacteristic of a good adaptive sampling strategy is thecriterion under which the sampling mode switches fromsparse to dense. If the criterion is too conservative, i.e. , ifthe probability of switching under pure noise is too high,we might sample only part of the codeword, therebyincurring a cost loss. By contrast, if this probability is toolow, we might not be able to accommodate the desiredsampling frequency.The proposed asymptotically optimalsampling/decoding strategy operates as follows—details are deferred to the proof of Theorem 3.The strategy starts in the sparse mode, taking samplesat times S j = ⌈ j/ρ ⌉ , j = 1 , , . . . . At each S j , thereceiver computes the empirical distribution (or type) ofthe last log( n ) samples. If the probability of observingthis type under pure noise is greater than /n , the modeis kept unchanged and we repeat this test at the nextround j + 1 . Instead, if it is smaller than /n , thenwe switch to the dense sampling mode, taking samplescontinuously for at most n time steps. At each of thesesteps the receiver applies a standard typicality decodingbased on the past n output samples. If no codeword istypical with the channel outputs after these n times steps,sampling is switched back to the sparse mode. As itturns out, the threshold /n can be replaced by anydecreasing function of n that decreases at least as fastas /n but not faster than polynomially in n .We end this section by considering the specific casewhen β = 0 , i.e. , when the channel is synchronous. Fora given sampling frequency ρ , the receiver gets to seeonly a fraction ρ of the transmitted codeword (whethersampling is adaptive or non-adaptive) and hence C (0 , ρ ) = ρ C (0 , for any ρ ≥ .How is it possible that sparse output sampling inducesa rate per unit cost loss for synchronous communication( β = 0 ), but not for asynchronous communication( β > ) as we saw in Theorems 2 and 3? The reasonfor this is that when β > , the level of asynchronismis exponential in B . Therefore, even if the receiver isconstrained to sample only a fraction ρ of the channeloutputs, it may still occasionally sample fully over,say, Θ( B ) channel outputs, and still satisfy the overallconstraint that the fraction shouldn’t exceed ρ . Remark 1.
Theorems 2 and 3 remain valid under uni-versal decoding, i.e., the only element from the channelthat the decoder needs to know is its output alphabet Y .This is briefly discussed at the end of Section IV. Remark 2.
Consider a multiple access generalization ofthe point-to-point setting where, instead of one transmit-ter, there are U = 2 υB transmitters who communicate to a common receiver,where υ , ≤ υ ≤ β , denotes the occupation pa-rameter of the channel. The messages arrival times { ν , ν , . . . , ν U } at the transmitters are jointly inde-pendent and uniformly distributed over [1 , . . . , A ] with A = 2 βB as before. Communication takes place asin the previous point-to-point case, each user uses thesame codebook, and transmissions start at the times { σ , σ , . . . , σ U } . Whenever a user tries to access thechannel while it is occupied, the channel outputs randomsymbols, independent of the input (collision model).The receiver operates sequentially and declares U messages at the times S τ , S τ + S τ , . . . , S τ + S τ + . . . S τ U where stopping time τ i , ≤ i ≤ U , is with respect tothe output samples Y S τi − , Y S τi − , Y S τi − , . . . . It is easy to check (say, from the Birthday problem[12]) that if υ < β/ and hence U = o ( √ A ) = o (2 βB/ ) , the collisionprobability goes to zero as B → ∞ . Hence in the regimeof large message size, the transmitters are (essentially)operating orthogonally, and each user can achieve thepoint-to-point capacity per unit cost assuming a per/usererror probability. We may refer to this regime as the If over a long trip we have a high-mileage drive, we can still pushthe car a few times without impacting the overall mileage. regime of “sparse transmissions,” relevant in a sensornetwork monitoring independent rare events.Note that since the users use the same codebook, thereceiver does not know which transmitter conveys whatinformation. The receiver can only recognize the set oftransmitted messages.If the receiver is also required to identify the messagesand their transmitters, then each transmitter effectivelyconveys B (1 + υ ) information bits and the capacity perunit cost gets multiplied by / (1 + υ ) . IV. A
NALYSIS
The following two standard type results are often usedin our analysis.
Fact 1 ([10, Lemma 1.2.2]) . | P X n | = poly( n ) . Fact 2 ([10, Lemma 1.2.6]) . If X n is independent andidentically distributed (i.i.d.) according to P ∈ P X , then poly( n ) e − nD ( P k P ) ≤ P ( X n ∈ T P ) ≤ e − nD ( P k P ) . for any P ∈ P X n .Achievability of Theorem 2: Fix some arbitrarydistribution P on X . Let X be the input having thatdistribution and let Y be the corresponding output, i.e. , ( X, Y ) ∼ P ( · ) Q ( ·|· ) .Given B bits of information to be transmitted, thecodebook C is randomly generated as follows. For eachmessage m = 1 , . . . , M , randomly generate length n sequences x n i.i.d. according to P , until x n belongs tothe “constant composition” set A n = { x n : || ˆ P x n − P || ≤ / log n } . (3)If (3) is satisfied, then let c n ( m ) = x n and moveto the next message. Stop when a codeword has beenassigned to all messages. From Chebyshev’s inequality,for any fixed m , no repetition will be required with highprobability to generate c n ( m ) , i.e. , P n ( A n ) → as n → ∞ (4)where P n denotes the order n product distribution of P .The obtained codewords are thus essentially of con-stant composition— i.e. , each symbol appears roughly thesame number of times—and have cost n E [ k ( X )](1 + o (1)) as n → ∞ where k ( · ) is the input cost functionof the channel. Case ⋆ ∈ X : Information transmission is as follows.For simplicity let us first assume that /ρ is an integer. || · || refers to the L -norm. Codeword symbols can be transmitted only at multiplesof /ρ . Times that are integer multiples of /ρ fromnow on are referred to as transmission times. Given amessage m available at time ν , the transmitter sendsthe corresponding codeword c n ( m ) during the first n information transmission times coming at time ≥ ν .In between transmission times the transmitter sends ⋆ .Hence, the transmitter sends c ( m ) ⋆ . . . ⋆ c ( m ) ⋆ . . . ⋆ c ( m ) { . . . . . . } c n ( m ) starting at time σ = σ ( ν ) = min { t ≥ : ⌊ t/ρ ⌋ ≥ ν } .The receiver operates as follows. Sampling is per-formed only at the transmission times. At transmissiontime t , the decoder computes the empirical distributions ˆ P c n ( m ) ,y n ( · , · ) induced by the last output samples y n and all thecodewords { c n ( m ) } . If there is a unique message m forwhich || ˆ P c n ( m ) ,y n ( · , · ) − P ( · ) Q ( ·|· ) || ≤ / log n, the decoder stops and declares that message m wassent. If two (or more) codewords c n ( m ) and c n ( m ′ ) relative to two different messages m and m ′ are typicalwith y n , the decoder stops and declares one of thecorresponding messages at random. If no codeword istypical with y n , the decoder repeats the procedure atthe next transmission time. If by the time of the lasttransmission time no message has been declared, thedecoder outputs a random message.We first compute the error probability averaged overcodebooks and messages. Suppose message m is trans-mitted. The error event that the decoder declares somespecific message m ′ = m can be decomposed as { m → m ′ } = E ∪ E , (5)where the error events E and E are defined as • E : the decoder stops at a time t between σ and σ + (2 n − /ρ (including σ and σ + (2 n − /ρ )and declares m ′ ; • E : the decoder stops either at a time t before time σ or from time σ +(2 n − /ρ onwards and declares m ′ .Note that when event E happens, the observed sequenceis generated by the sent codeword. By contrast, whenevent E happens, then the observed sequence is gener-ated only by pure noise.Using analogous arguments as in the achievability of[1, Proof of Theorem 1] we obtain the upper bounds P m ( E ) ≤ − n ( I ( X ; Y ) − ε ) Notice that the decoder outputs a message with probability one. and P m ( E ) ≤ A · − n ( I ( X ; Y )+ D ( Y || Y ⋆ ) − ε ) which are both valid for any fixed ε > provided that n is large enough.Combining, we get P m ( m → m ′ ) ≤ − n ( I ( X ; Y ) − ε ) + A · − n ( I ( X ; Y )+ D ( Y || Y ⋆ ) − ε ) . Hence, taking a union bound over all possible wrongmessages, we obtain that for all ε > , P ( E ) ≤ B (cid:16) − n ( I ( X ; Y ) − ε ) + A · − n ( I ( X ; Y )+ D ( Y || Y ⋆ ) − ε ) (cid:17) def = ε ( n ) (6)for n large enough.We now show that the delay of our coding schemein the sense of Definition 4 is at most n/ρ . Suppose aspecific (non-random) codeword c n ( m ) ∈ A is sent. If τ > σ + ( n − /ρ , then necessarily c n ( m ) is not typical with Y σ +( n − /ρσ .By Sanov’s theorem this happens with vanishing errorprobability and hence P ( τ − σ ≤ ( n − /ρ ) = 1 − ε ( n ) with ε ( n ) → as n → ∞ . Hence, since ν ≤ σ <ν + 1 /ρ , we get P ( τ − ν ≤ n/ρ ) = 1 − ε ( n ) . The proof can now be concluded. From inequality(6) there exists a specific code C ⊂ A n whose errorprobability, averaged over messages, is less than ε ( n ) .Removing the half of the codewords with the highesterror probability, we end up with a set C ′ of B − codewords whose maximum error probability P ( E ) issuch that P ( E ) ≤ ε ( n ) , (7)and whose delay satisfies d ( C ′ , ε ( n )) ≤ n/ρ . Now fix the ratio
B/n and substitute A = 2 βB in thedefinition of ε ( n ) (see (6)). Then, P ( E ) goes to zero as B → ∞ whenever Bn < min (cid:26) I ( X ; Y ) , I ( X ; Y ) + D ( Y || Y ⋆ )1 + β (cid:27) . (8) Recall that by construction, all the codewords have cost n E [ k ( X )](1 + o (1)) as n → ∞ . Hence, for any η > and all n large enough k ( C ′ ) ≤ n E [ k ( X )](1 + η ) . (9)Condition (8) is thus implied by condition B K ( C ′ ) < min (cid:26) I ( X ; Y )(1 + η ) E [ k ( X )] , I ( X ; Y ) + D ( Y || Y ⋆ ) E [ k ( X )](1 + η )(1 + β ) (cid:27) . (10)Maximizing over all input distributions and using thefact that η > is arbitrary proves that C ( β, —where C ( β, is defined in Theorem 1—is asymptoticallyachieved by non-random codes with delay no larger than n/ρ with probability approaching one as n → ∞ .Finally, if /ρ is not an integer, it suffices to definetransmission times as t j = ⌊ j/ρ ⌋ . This guarantees the same asymptotic performance as forthe case where /ρ is an integer. Case ⋆ / ∈ X : Parse the entire sequence { , , . . . , A + n − } into consecutive superperiods of size n/ρ —take ⌊ n/ρ ⌋ if n/ρ is not an integer. The periods of duration n occurring at the end of each superperiod are referred to astransmission periods. Given ν , the codeword starts beingsent over the first transmission period starting at a time >ν . In particular, if ν happens over a transmission period,then the transmitter delays the codeword transmission tothe next superperiod.The receiver sequentially samples only the transmis-sion periods. At the end of a transmission period, thedecoder computes the empirical distributions ˆ P c n ( m ) ,y n ( · , · ) induced by the last output samples y n and all thecodewords { c n ( m ) } . If there is a unique message m forwhich || ˆ P c n ( m ) ,y n ( · , · ) − P ( · ) Q ( ·|· ) || ≤ / log n, the decoder stops and declares that message m was sent.If two (or more) codewords c n ( m ) and c n ( m ′ ) relativeto two different messages m and m ′ are typical with y n ,the decoder stops and declares one of the correspondingmessages at random. If no codeword is typical with y n ,the decoder waits for the next transmission period tooccur, samples it, and repeats the decoding procedure.Similarly as for the previous case, if at the end of thelast transmission period no message has been declared,the decoder outputs a random message. Following the same arguments as for the case ⋆ ∈ X we deduce that (10) also holds in this case and that forthe delay we have P ( τ − ν ≤ n + n/ρ ) = 1 − ε ( n ) for some ε ( n ) → as n → ∞ . To see this, note thata superperiod has duration n/ρ and that if ν happensduring a transmission period, then the actual codewordtransmission is delayed to the next transmission period. Delay Converse of Theorem 2:
We consider thecases ⋆ ∈ X and ⋆ / ∈ X separately. Case ⋆ ∈ X : Pick some arbitrary < ρ < , β > suchthat C ( β, ρ ) > , and < γ < . Consider a code C B with length n B codewords that achieves rate per unit cost γ C ( β, ρ ) − ε B > , maximum error probability at most ε B , sampling frequency ρ ( C B , ε B ) ≤ ρ + ε B , and delay d B = d ( C B , ε B ) , for some ε B B →∞ −→ . The samplingstrategy S is supposed to be non-adaptive, and for themoment also non-randomized.Denote by I γ the event that the decoder samples atleast γn ∗ B samples of the sent codeword—recall that n ∗ B refers to the minimal codeword length, see Section III.Then by the converse of the [1, Theorem 1] P m ( I γ ′ ) = 1 − o (1) ( B → ∞ ) (11)for any message m , where γ ′ = γ ′ ( γ ) satisfies γ ′ = γ ′ ( γ ) > for any γ > and lim γ → γ ′ ( γ ) = 1 .Further, by our assumption on the error probabilityand on the delay (see Definition 4), we have for anymessage m P m ( E cm ∩ { τ − ν ≤ d B − } ) → B → ∞ ) , where E cm denotes the successful decoding event. Thisimplies that for any message m P m (0 ≤ τ − ν ≤ d B − } ) → B → ∞ ) , since the error probability is bounded away from zerowhenever τ < ν .It then follows that P m ( { ≤ τ − ν ≤ d B − } ∩ I γ ′ ) = 1 − o (1) ( B → ∞ ) . (12)Hence, since ν is uniformly distributed over { , , . . . , A + n − } , for B large enough wehave P m ( { ≤ τ − t ≤ d B − } ∩ I γ ′ | ν = t ) > for at least (1 − o (1)) A values of t ∈ { , , . . . , A } . Now,conditioned on { ν = t } , if event { ≤ τ − t ≤ d B − } ∩ I γ ′ happens ( i.e. , with non-zero probability), then necessar-ily the period { t, t + 1 , . . . , t + d B − } contains at least γ ′ n ∗ B sampling times—here we use the fact that S isnon-randomized.It then follows that | S | ≥ (cid:22) (1 − o (1)) Ad B (cid:23) γ ′ · n ∗ B . (13)Now if ρd B ≤ n ∗ B (1 − ε ) (14)for some arbitrary fixed < ε < , then (cid:22) (1 − o (1)) Ad B (cid:23) γ ′ n ∗ B ≥ (1 − o (1)) γ ′ − ε ρA (1 − o (1)) (15)as B → ∞ .Hence, by taking γ ′ and hence γ close enough to and by taking B large enough (1 − o (1)) γ ′ / (1 − ε ) > . Therefore, if (14) holds, from (13) and (15) we get | S | ≥ ρ (1 + ε ′ ) A (16)for B large enough and some ε ′ > such that ε ′ → as ε → . Inequality (16) implies that the samplingconstraint is violated, as we now show.Fix an arbitrary < ε ′′ < . For an arbitrary integer ≤ k ≤ A + n − and any message m P m ( S τ ≥ ρτ (1 + ε ′′ )) ≥ P m ( S τ ≥ ρτ (1 + ε ′′ ) | τ ≥ k ) P m ( τ ≥ k ) ≥ P m ( S k ≥ ρ ( A + n − ε ′′ ) | τ ≥ k ) P m ( τ ≥ k )= P m ( S k ≥ (1 + ε ′′ ) ρA (1 + o (1))) P m ( τ ≥ k ) (17)where for the second inequality we used the fact thatthe sampling times S , S , . . . are non-decreasing andthe fact that τ ≤ A + n − . We now show that bothterms P m ( S k ≥ (1 + ε ′′ ) ρA (1 + o (1))) and P m ( τ ≥ k ) are bounded away from zero in the limit B → ∞ , foran appropriate choice of k . This, by (17), implies that lim inf B →∞ P m ( S τ ≥ ρτ (1 + ε ′′ )) > , i.e. , that sampling frequency ρ is not achievable when-ever (14) holds. In other words, to achieve a sampling frequency ρ it is necessary that delay and codewordlength satisfy d B ≥ n ∗ B ρ (1 − o (1)) . Let k = (1 + 2 ε ′′ ) ρA . (18)Since S k ≥ k , S k ≥ (1 + 2 ε ′′ ) ρA , and so by choosing ε ′′ > small enough we get P ( S k ≥ (1 + ε ′′ ) ρA (1 + o (1))) = 1 (19)for B large enough.Since C B achieves (maximum) error probability ≤ ε B we have for any message mε B ≥ P m ( E ) ≥ P m ( E | τ < k, ν ≥ k ) P m ( τ < k, ν ≥ k ) ≥ P m ( τ < k | ν ≥ k ) P ( ν ≥ k )= 12 P ⋆ ( τ < k ) P ( ν ≥ k ) . (20)For the third inequality in (20) note that event { τ
We show that C ( β, ρ ) = C ( β, for any β > and < ρ ≤ and that C ( β, can be achieved with codes { C B } with delay d ( C B , ε B ) = n ∗ B (1 + o (1)) as B → ∞ .Let P be the distribution achieving C ( β, (seeTheorem 1). We generate B codewords of length n − log( n ) as in the proof of Theorem 2 according to distribution P . Each codeword starts with a common preamble thatconsists of log( n ) repetitions of a symbol x such that Q ( ·| x ) = Q ( ·| ⋆ ) .For the proposed asymptotically optimal sam-pling/decoding strategy, it is convenient to introduce thefollowing notation. Let ˜ Y ba denote the random vectorobtained by extracting the components of the outputprocess Y t at t ∈ [ a, b ] of the form t = ⌈ j/ρ ⌉ for non-negative integer j . Notice that, for any t ≥ ℓ and ℓ ≫ , ˜ Y tt − ℓ +1 contains ≈ ρℓ samples.The strategy starts in the sparse mode, taking samplesat times S j = ⌈ j/ρ ⌉ , j = 1 , , . . . . At each j , the receivercomputes the empirical distribution (or type) ˆ P j = ˆ P ˜ Y SjSj − log( n )+1 of the sampled output in the most recent window oflength log( n ) .If the probability of this type under pure noise is largeenough, i.e. , if P ⋆ ( T ˆ P j ) > n , the mode is kept unchanged and we repeat this test atthe next round j + 1 .Instead, if P ⋆ ( T ˆ P j ) ≤ n , then we switch to the dense sampling mode, takingsamples continuously for at most n time steps. At eachof these steps the receiver applies the same sequentialtypicality decoder as in the proof of Theorem 2, basedon the past n − log n output samples. If no codewordis typical with the channel outputs after these n timessteps, sampling is switched back to the sparse mode.We compute the error probability of the above scheme,its relative number of samples, and its delay.For the error probability, a similar analysis as for thenon-adaptive case in the proof of Theorem 2 still applies, with ρn being replaced by n − log n . In particular, afterfixing the ratio B/n and thereby imposing a delay linearin B , equation (10) holds with ρ = 1 .For the relative number of samples, we now show that P m ( τ /S τ ≥ ρ + ε B ) n →∞ −→ (25)with ε B = 1 / poly( B ) from which we then concludethat C ( β, ρ ) ≥ C ( β, . To do this, it is convenient tointroduce Z i , ≤ i ≤ A + n − , which is equal to oneif at time i the receiver switches to the dense mode andsamples the next n channel outputs and equal to zerootherwise. Then it follows that τ ≤ ρS τ + n S τ X i =1 Z i . (26)To see this, note that the number of samples involved inthe sparse mode is at most ρS τ and that the number ofsamples involved in the dense mode is at most n P S τ i =1 Z i (it is actually equal to n P S τ i =1 Z i if we ignore theboundary discrepancies that we cannot sample beyondtime A + n − ).From (26) P ( τ /S τ ≥ ρ + ε ) ≤ P ( n S τ X i =1 Z i ≥ S τ ε ) ≤ P ( n S τ X i =1 Z i ≥ S τ ε, ν ≤ S τ ≤ ν + 2 n − P ( S τ < ν or S τ > ν + 2 n − . (27)We now show that the right-hand side of the secondinequality in (27) vanishes as B → ∞ .For the first term on the right-hand side of the secondinequality in (27), since the Z i ’s are nonnegative P ( n S τ X i =1 Z i ≥ S τ ε ; ν ≤ S τ ≤ ν + 2 n − ≤ P ( n ν +2 n − X i =1 Z i ≥ νε ) . (28)Now, conditioned on ν = t , the Z i ’s, ≤ i ≤ t − , arebinary random variables distributed according to purenoise. Hence, P (cid:16) n t +2 n − X i =1 Z i ≥ tε | ν = t (cid:17) ≤ P ⋆ (cid:16) n t − X i =1 Z i ≥ tε − (2 n − (cid:17) ≤ t − tε − (2 n − − ( t − /n ) = o (1) ( t → ∞ ) (29) where the second inequality follows from Chebyshev’sinequality and by noting that for ≤ i ≤ t − we haveVar ( Z i ) ≤ E Z i ≤ /n since the variance of a Bernoulli random variable is atmost its mean which, in turn, is at most /n .Therefore, P (cid:16) n ν +2 n − X i =1 Z i ≥ νε ) ≤ P ( ν ≤ √ A (cid:17) + 1 A A X t = √ A +1 P ( n ν +2 n − X i =1 Z i ≥ νε | ν = t )= o (1) ( B → ∞ ) (30)where the last equality follows from (29) and the factthat ν is uniformly distributed over { , , . . . , A = e βB } .From (28) and (30) we get P ( n S τ X i =1 Z i ≥ S τ ε ; ν ≤ S τ ≤ ν + 2 n −
2) = o (1) as B → ∞ .We now show that P ( S τ < ν or S τ ≥ ν + 2 n − → B → ∞ ) . (31)That P ( S τ < ν ) → follows from the fact that P m ( E ) → where E is defined in the proof ofTheorem 2. That P ( S τ ≥ ν + 2 n − → followsfrom the fact that with probability tending to one thesampling strategy will changes mode over the transmittedcodeword and that the typicality decoder will make adecision up to time ν + n − with probability tending toone. This last argument can also be used for the delayto show that d ( C B , ε B ) = n (1 + o (1)) for some ε B → .Finally, by optimizing the input distribution to mini-mize delay (see paragraph after Theorem 1) we deducethat C ( β,
1) = C ( β, ρ ) and that the capacity per unit costis achievable with delay n ∗ B (1 + o (1)) .We end this section with a few words concerning theRemark 1 at the end of Section III. To prove the claimit suffices to slightly modify the achievability schemesyielding Theorems 2 and 3 to make them universal atthe decoder.The first modification is needed to estimate the purenoise distribution Q ⋆ with a negligible fraction of chan-nel outputs. An estimate of this distribution is obtainedby sampling the first √ A output symbols. At the end ofthis estimation phase, the receiver declares the pure noise distribution as being equal to ˆ P Y √ A . Note that since ν isuniformly distributed over { , , . . . , A } we have P ( || ˆ P Y √ A − Q ⋆ || ≥ ε B ) → as B → ∞ , for some ε B → . Note also that this esti-mation phase requires a negligible amount of sampling, i.e. , sublinear in A .The second modification concerns the typicality de-coder which is replaced by an MMI (Maximum MutualInformation) decoder (see [10, Chapter 2]).It is straightforward to verify that the modifiedschemes indeed achieve the asynchronous capacity perunit cost. The formal arguments are similar to those usedin [3, Proof of Theorem 2] (see also [5, Theorem 3]which proves the claim under full sampling and unitinput cost) and are thus omitted.R EFERENCES [1] V. Chandar, A. Tchamkerten, and D. Tse, “Asynchronous ca-pacity per unit cost,”
Information Theory, IEEE Transactionson , vol. 59, no. 3, pp. 1213 –1226, march 2013.[2] V. Chandar, A. Tchamkerten, and G. Wornell, “Optimal se-quential frame synchronization,”
Information Theory, IEEETransactions on , vol. 54, no. 8, pp. 3725–3728, 2008.[3] A. Tchamkerten, V. Chandar, and G. Wornell, “Communicationunder strong asynchronism,”
Information Theory, IEEE Trans-actions on , vol. 55, no. 10, pp. 4508–4528, 2009.[4] A. Tchamkerten, V. Chandar, and G. W. Wornell, “Asyn-chronous communication: Capacity bounds and suboptimalityof training,”
Information Theory, IEEE Transactions on , vol. 59,no. 3, pp. 1227 –1255, march 2013.[5] Y. Polyanskiy, “Asynchronous communication: Exact synchro-nization, universality, and dispersion,”
Information Theory,IEEE Transactions on , vol. 59, no. 3, pp. 1256 –1270, march2013.[6] D. Wang, V. Chandar, S. Chung, and G. Wornell, “Error expo-nents in asynchronous communication,” in
Information TheoryProceedings (ISIT), 2011 IEEE International Symposium on .IEEE, 2011, pp. 1071–1075.[7] I. Shomorony, R. Etkin, F. Parvaresh, and A. Avestimehr,“Bounds on the minimum energy-per-bit for bursty traffic indiamond networks,” in
Information Theory Proceedings (ISIT),2012 IEEE International Symposium on . IEEE, 2012, pp. 801–805.[8] M. Khoshnevisan and J. Laneman, “Achievable rates for in-termittent communication,” in
Information Theory Proceedings(ISIT), 2012 IEEE International Symposium on , july 2012, pp.1346 –1350.[9] R. L. Dobrushin, “Asymptotic bounds on the probability of errorfor the transmission of messages over a memoryless channelusing feedback,”
Probl. Kibern. , vol. 8, pp. 161–168, 1963.[10] I. Csiszàr and J. Körner,
Information Theory: Coding Theoremsfor Discrete Memoryless Channels . New York: CambridgeUniversity Press, 2011.[11] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein,
Introduction to Algorithms, 2nd edition . MIT Press, McGraw-Hill Book Company, 2000.[12] G. Grimmett and D. Stirzaker,