[PDF] On the Capacity of a Class of Signal-Dependent Noise Channels

Abstract

In some applications, the variance of additive measurement noise depends on the signal that we aim to measure. For instance, additive Gaussian signal-dependent noise (AGSDN) channel models are used in molecular and optical communication. Herein we provide lower and upper bounds on the capacity of additive signal-dependent noise (ASDN) channels. The idea of the first lower bound is the extension of the majorization inequality, and for the second one, it uses some calculations based on the fact that h(Y) > h (Y|Z). Both of them are valid for all additive signal-dependent noise (ASDN) channels defined in the paper. The upper bound is based on a previous idea of the authors ("symmetric relative entropy") and is used for the additive Gaussian signal-dependent noise (AGSDN) channels. These bounds indicate that in ASDN channels (unlike the classical AWGN channels), the capacity does not necessarily become larger by making the variance function of the noise smaller. We also provide sufficient conditions under which the capacity becomes infinity. This is complemented by a number of conditions that imply capacity is finite and a unique capacity achieving measure exists (in the sense of the output measure).

Full PDF

OOn the Capacity of a Class of Signal-Dependent Noise Channels

Hamid Ghourchian, Gholamali Aminian,Amin Gohari, Mahtab Mirmohseni, and Masoumeh Nasiri-Kenari,

Department of Electrical Engineering, Sharif University of Technology, Tehran, Iran.E-mails: { h ghourchian, aminian } @ee.sharif.edu, { aminzadeh, mirmohseni, mnasiri } @sharif.edu ∗ Abstract

In some applications, the variance of additive measurement noise depends on the signal thatwe aim to measure. For instance, additive Gaussian signal-dependent noise (AGSDN) channelmodels are used in molecular and optical communication. Herein we provide lower and upperbounds on the capacity of additive signal-dependent noise (ASDN) channels. The idea of theﬁrst lower bound is the extension of the majorization inequality, and for the second one, ituses some calculations based on the fact that h ( Y ) > h ( Y | Z ). Both of them are valid forall additive signal-dependent noise (ASDN) channels deﬁned in the paper. The upper boundis based on a previous idea of the authors (“symmetric relative entropy”) and is used for theadditive Gaussian signal-dependent noise (AGSDN) channels. These bounds indicate that inASDN channels (unlike the classical AWGN channels), the capacity does not necessarily becomelarger by making the variance function of the noise smaller. We also provide suﬃcient conditionsunder which the capacity becomes inﬁnity. This is complemented by a number of conditionsthat imply capacity is ﬁnite and a unique capacity achieving measure exists (in the sense of theoutput measure). Keywords:

Signal-dependent noise channels, molecular communication, channels with inﬁ-nite capacity, existence of capacity-achieving distribution.

An additive Gaussian signal-dependent noise (AGSDN) channel with input x and output y is deﬁnedby f Y | X ( y | x ) = 1 (cid:112) πσ ( x ) e − ( y − x )22 σ ( x )2 , where σ ( · ) is a given function from R to [0 , ∞ ). Alternatively, we may describe the AGSDN channelby Y = X + σ ( X ) · Z where Z ∼ N (0 ,

1) is a standard Gaussian random variable and independentof the input X . For constant function σ ( x ) = c , the AGSDN channel reduces to a simple additiveGaussian channel. More generally, we may relax the Gaussian assumption on Z and consider anadditive signal-dependent noise (ASDN) channel deﬁned by Y = X + σ ( X ) · Z, (1)where noise Z is assumed to be a continuous random variable with a given pdf f Z ( z ), and beindependent of the input X . For instance, one can consider an ASDN with Z being a truncated ∗ This work was supported by INSF Research Grant on “Nano-Network Communications”. The ﬁrst two authorscontributed equally to this work. See Deﬁnition 3 for the deﬁnition of continuous random variables. a r X i v : . [ c s . I T ] J un ersion of the Gaussian distribution as a better model in an application if we know that the output Y has minimum and maximum values in that applications.Below we provide a number of applications in which the ASDN channel arises.1. The AGSDN channel appears in optical communications when modeling the shot noise or theoptical ampliﬁcation noise for σ ( x ) = (cid:112) c + c x [1].2. In molecular communication, the AGSDN channel with σ ( x ) = c √ x arises in the ligandreceptor model, the particle sampling noise, the particle counting noise and the Poisson modelfor an absorbing receiver [2, 3, 4]. In all cases, the reason for appearance of a Gaussian signal-dependent noise is the approximation of a binomial or Poisson distribution with a Gaussiandistribution. Observe that the mean and variance of a binomial distribution with parameters( n, p ) relate to each other: the mean is np and the variance is np (1 − p ) respectively. As aresult, the mean and variance of the approximated Gaussian distribution also relate to eachother (see [5, Section II.B] for a detailed overview).3. Besides the above applications of ASDN in molecular communications we shall provide twoother cases where this channel model is helpful: Consider the Brownian motion of a particlewith no drift over a nonhomogeneous medium with σ ( x ) denoting the diﬀusion coeﬃcient ofthe medium at location x . The diﬀusion coeﬃcient σ ( x ) describes the movement variance ofa particle when in location x . More speciﬁcally, the motion of the particle is described by thestochastic diﬀerential equation d X t = σ ( X t ) d B t , where B t is the standard Wiener process (standard Brownian motion). Alternatively, we canexpress the above equation using the following Itˆo integral X t + s − X t = (cid:90) t + st σ ( X u ) d B u . (2)Let us denote the position of the particle at time 0 by X = X , and its position after t secondsby Y = X t . If t is a small and ﬁxed number, (2) reduces to Y = X + tσ ( X ) · Z, where Z ∼ N (0 , t issmall.4. As another example, consider the molecular timing channel in a time-varying medium. In amolecular timing channel, information is encoded in the release time of molecules. A moleculereleased at time X hits the receiver after a delay Z at time Y = X + Z . Molecules areabsorbed once they hit the receiver. As such, the distribution of Z is that of the ﬁrst arrivaltime. The existing literature only studies this problem when the medium is time-invariant (see[6, 7, 8, 9]): if the medium is uniform, time-invariant and one-dimensional, Z is distributedaccording to the inverse Gaussian distribution (if there is a ﬂow in the medium) or the L´evydistribution (if there is no ﬂow in the medium). As a result, the channel is called the additiveinverse Gaussian noise additive channel, or the additive L´evy noise in the literature. However,in a time-varying medium (or when the distance between the transmitter and receiver variesover time), the distribution of Z depends on the release time X . As a result, we obtain asignal-dependent noise additive component. For instance, the additive noise can have a L´evy2istribution with a scale parameter that depends on input X . Using the scaling property of theL´evy distribution, we can express this as σ ( X ) · Z where Z is the standard L´evy distribution,and σ ( X ) is the scale parameter. This would be an ASDN channel.5. In the third item, we discussed Brownian motion after a small time elapse. A Brownian motionwith no drift is an example of a martingale. Now let us consider a martingale after a largetime elapse. Here, the AGSDN channel also arises as a conditional distribution in any processthat can be modeled by a discrete time martingale with bounded increments . Assume that X , X , X , · · · is such a martingale. Then E [ X n ] = E [ X ]. Furthermore, by the martingalecentral limit theorem, the conditional distribution of X n given X = x for large values of n can be approximated by a Gaussian distribution with mean X = x and a variance σ n ( x ) thatdepends on X = x .6. Finally, we relate the ASDN channel to real fading channels with a direct line of sight. Considera scalar Gaussian fading channel Y = X + HX + N, (3)where X is the input, H ∼ N (0 , c ) is the Gaussian fading coeﬃcient and N ∼ N (0 , c ) isthe additive environment noise. The ﬁrst X term on the right-hand side of (3) correspondsto the direct line of sight, while the HX term is the fading term. The distribution of Y given X = x is N ( x, c x + c ). Thus (3) can be expressed as Y = X + σ ( X ) · Z where σ ( x ) = (cid:112) c x + c , Z ∼ N (0 , . A fast fading setting in which H varies independently over each channel use corresponds to amemoryless ASDN channel.The purpose of this paper is to study the capacity of a memoryless additive signal-dependentnoise (ASDN) channel deﬁned via Y = X + σ ( X ) · Z, under input cost constraints. The memoryless assumption implies that the noise Z is drawn inde-pendently from f Z ( z ) in each channel use. Related works:

In [10], vector AGSDN channels subject cost constraints are studied. It isshown that under some assumptions, the capacity achieving distribution is a discrete distribution.The AGSDN channel with σ ( x ) = (cid:112) c + c x is investigated in [1] wherein capacity upper and lowerbounds are derived considering peak and average constraints.Note that the memoryless AGSDN includes the additive white Gaussian noise (AWGN) channelas its special case. The capacity of AWGN channel under power constraint is classical and isobtained by an input of Gaussian random variable. Its capacity under both average and peakpower constraints is quite diﬀerent, as the capacity achieving input distribution is discrete with aﬁnite number of mass points [11]. See [12, 13] for further results on the capacity of the AWGNchannel with both average and peak power constraints. Our contributions:

Our contributions in this work can be summarized as follows: • We provide a new tool for bounding the capacity of continuous input/output channels. Notethat I ( X ; Y ) = h ( Y ) − h ( Y | X ) .

3e provide two suﬃcient conditions under which h ( Y ) ≥ h ( X ), which results inI ( X ; Y ) ≥ h ( X ) − h ( Y | X ) , and leads to lower bounds on the channel capacity of an ASDN channel. • It is known that increasing the noise variance of an AWGN channel decreases its capacity.However, we show that this is no longer the case for signal-dependent noise channels: theconstraint σ ( x ) ≥ σ ( x ) for all x does not necessarily imply that the capacity of an AGSDNchannel with σ ( x ) is less than or equal to the capacity of an AGSDN with σ ( x ). • We identify conditions under which the capacity of the ASDN channel becomes inﬁnity. Inparticular, this implies that the capacity of a AGSDN channel with σ ( x ) = (cid:112) c x + c tends to inﬁnity as c tends to zero. Thus, the capacity of the real Gaussian fast fading channelgiven earlier in this section tends to inﬁnity as c tends to zero. This parallels a similar resultgiven in [14] for complex Gaussian fading channels. • We provide a new upper bound for the AGSDN channel based on the KL symmetrized upperbound of [15]. This upper bound is suitable for the low SNR regime, when σ ( x ) is large. Thisis in contrast with the upper bound of [1, Theorems 4, 5] for AGSDN channels with σ ( x ) = (cid:112) c + c x which is suitable for large values of peak and average constraints. Furthermore,we give our upper bound for a large class of functions σ ( x ) while the technique of [1] is tunedfor σ ( x ) = (cid:112) c + c x .This paper is organized as follows. Section 2 includes some of primary deﬁnitions and notations.In Section 3, our main results are given. This includes two lower bounds and one upper bound onthe capacity of the ASDN channel. There are some useful lemmas in Section 4 used in the paper.The numerical results and plots are given in Section 5. The proofs of our results are given in Section6. In this section we review the deﬁnitions of continuous and discrete random variables, as well asentropy and diﬀerential entropy, relative entropy and mutual information.Throughout this paper all the logarithms are in base e. Random variables are denoted bycapital letters, and probability measure functions are denoted by letter µ . The collection of Borelmeasurable sets in R is denoted by B ( R ). We sometimes use a.e. and µ -a.e. as a short-hand for“almost everywhere” and “ µ -almost everywhere”, respectively. The set A is µ -a.e., when (cid:90) A d µ = 0 . The set A is a.e. if it is µ -a.e. when µ is the Lebesgue measure. Deﬁnition 1 (Relative Entropy) . [16, Section 1.4] For random variables X and Y with probabilitymeasures µ X and µ Y , the relative entropy between X and Y is deﬁned as follows: D ( µ X (cid:107) µ Y ) = D ( X (cid:107) Y ) := (cid:40) E (cid:104) log d µ X d µ Y ( X ) (cid:105) µ X (cid:28) µ Y + ∞ o.w. , here d µ X d µ Y is the Radon-Nikodym derivative and µ X (cid:28) µ Y means µ X is absolutely continuous w.r.t. µ Y i.e. µ X ( A ) = 0 for all A ∈ B if µ Y ( A ) = 0 , where B is the Borel σ -ﬁeld of the space over whichthe measures are deﬁned. Deﬁnition 2 (Mutual Information) . [16, Section 1.6] For random variables X, Y with joint prob-ability measure µ X,Y , the mutual information between X and Y is deﬁned as follows: I ( X ; Y ) = D ( µ X,Y (cid:107) µ X µ Y ) , where µ X µ Y is the product measure deﬁned as ( µ X µ Y )( A , C ) = µ X ( A ) µ Y ( C ) , where A ∈ B X the Borel σ -ﬁeld of the space over which µ X is deﬁned, and C ∈ B Y the Borel σ -ﬁeldof the space over which µ Y is deﬁned.Similarly, for three random variable X, Y, Z with joint measure µ X,Y,Z , conditional mutual in-formation

I ( X ; Y | Z ) is deﬁned as I ( X ; Y, Z ) − I ( X ; Z ) . Deﬁnition 3 (Continuous Random Variable) . [10] Let X be a real-valued and random variablethat is measurable with respect to B ( R ) . We call X a continuous random variable if its probabilitymeasure µ X , induced on ( R , B ) , is absolutely continuous with respect to the Lebesgue measure for B ( R ) (i.e., µ ( A ) = 0 for all A ∈ B with zero Lebesgue measure). We denote the set of all absolutelycontinuous probability measures by AC . Note that the Radon-Nikodym theorem implies that for eachrandom variable X with measure µ X ∈ AC there exists a B ( R ) -measurable function f X : R → [0 , ∞ ) ,such that for all A ∈ B ( R ) we have that µ X ( A ) = Pr { X ∈ A} = (cid:90) A f X ( x ) d x. (4) The function f X is called the probability density function (pdf ) of X [16, p. 21]. We denote pdf ofabsolutely continuous probability measures by letter f . Deﬁnition 4 (Discrete Random Variable) . [10] A random variable X is discrete if it takes valuesin a countable alphabet set X ⊂ R . Probability mass function (pmf) for discrete random variable X with probability measure µ X is denoted by p X and deﬁned as follows: p X ( x ) := µ X ( { x } ) = Pr { X = x } , ∀ x ∈ X . Deﬁnition 5 (Entropy and Diﬀerential Entropy) . [17, Chapter 2] We deﬁne entropy H ( X ) , for adiscrete random variable X with measure µ X and pmf p X as H ( X ) = H ( µ X ) = H ( p X ) := (cid:88) x p X ( x ) log 1 p X ( x ) , if the summation converges. Observe that H ( X ) = E (cid:20) log 1 p X ( X ) (cid:21) . or a continuous random variable X with measure µ X and pdf f X , we deﬁne diﬀerential entropy h ( X ) as h ( X ) = h ( µ X ) = h ( p X ) := (cid:90) + ∞−∞ f X ( x ) log 1 f X ( x ) , if the integral converges. Similarly, the diﬀerential entropy is the same as h ( X ) = E (cid:20) log 1 f X ( X ) (cid:21) . Similarly, for two random variables

X, Y , with measure µ X,Y , if for all x , µ Y | X ( ·| x ) is absolutelycontinuous with pdf f Y | X ( ·| x ) , the conditional diﬀerential entropy h ( Y | X ) is deﬁned as h ( Y | X ) = E (cid:20) log 1 f Y | X ( Y | X ) (cid:21) . We allow for diﬀerential entropy to be + ∞ or −∞ if the integral is convergent to + ∞ or −∞ , i.e. , we say that h ( X ) = + ∞ , if and only if (cid:90) A + f X ( x ) log 1 f X ( x ) d x = + ∞ , and (cid:90) A − f X ( x ) log 1 f X ( x ) d x converges to a ﬁnite numberwhere A + = { x : f X ( x ) ≤ } , A − = { x : f X ( x ) > } . Similarly, we deﬁne h ( X ) = −∞ . When we write that h ( X ) > −∞ , we mean that the diﬀerentialentropy of X exists and is not equal to −∞ . The following example, from [14], demonstrates thediﬀerential entropy can be + ∞ or −∞ . Example 1.

Diﬀerential entropy becomes plus inﬁnity for the following pdf deﬁned over R [14]: f ( x ) = (cid:40) x (log x ) , x > e , , x ≤ e . On the other hand, as shown in [14], diﬀerential entropy is minus inﬁnity for g ( x ) = (cid:40) − x log x (log( − log x )) , < x < e − e , , otherwise . eﬁnition 6 (Riemann integrable functions) . Given −∞ ≤ (cid:96) < u ≤ + ∞ , in this work, we utilizeRiemann integrable functions g : ( (cid:96), u ) (cid:55)→ R on open interval ( (cid:96), u ) . Such functions satisfy theproperty that for any c ∈ ( (cid:96), u ) , the function h ( x ) = (cid:90) xc g ( t ) d t, is well-deﬁned. By the fundamental theorem of calculus, h ( · ) is continuous on ( (cid:96), u ) (but not neces-sarily diﬀerentiable unless g is continuous). As an example, consider the function g ( x ) = 1 /x for x (cid:54) = 0, and g (0) = 0 otherwise. Thisfunction is Riemann integrable on the restricted domain (0 , ∞ ), but not integrable on ( − , We are interested in the capacity of an ASDN channel with the input X taking values in a set X and satisfying the cost constraint E [ g i ( X )] ≤ , ∀ i = 1 , , · · · , k for some functions g i ( · ). Thecommon power constraint corresponds to g i ( x ) = x − p for some p ≥

0, but we allow for moregeneral constraints. Then, given a density function f Z ( z ) for the noise Z and function σ ( · ), weconsider the following optimization problem: C = sup µ X ∈F I ( X ; Y ) , (5)where X and Y are related via (1) and F = { µ X (cid:12)(cid:12) supp( µ X ) ⊆ X , E [ g i ( X )] ≤ i = 1 , · · · , k } . (6)We sometimes use supp( X ) to denote the support of measure µ X , supp( µ X ), when the probabilitymeasure on X is clear from the context.As an example, if, in an application, input X satisﬁes (cid:96) ≤ X ≤ u , the set X can be takento be [ (cid:96), u ] to reﬂect this fact; similarly, the constraint 0 < X ≤ u reduces to X = (0 , u ], and0 ≤ (cid:96) ≤ | X | ≤ u reduces to X = [ − u, − (cid:96) ] ∪ [ (cid:96), u ].The rest of this section is organized as follows: in Section 3.1, we provide conditions that implyﬁniteness of the capacity of an ASDN channel. In Section 3.2, we review the ideas used for obtaininglower bounds in previous works and also in this work. Then, based on the new ideas introduced inthis work, we provide two diﬀerent lower bounds in Sections 3.3 and 3.4. Finally, in Section 3.5, weprovide an upper bound for AGSDN channels. Theorem 1.

Assume that an ASDN channel satisﬁes the following properties: • X is a closed and also bounded subset of R , i.e. , there exists u ≥ such that X ⊆ [ − u, u ] ; • Real numbers < σ (cid:96) < σ u exist such that σ (cid:96) ≤ σ ( x ) ≤ σ u for all x ∈ X ; • Positive real m and γ exist such that f Z ( z ) ≤ m < ∞ (a.e.) , and E [ | Z | γ ] = α < ∞ ; • The cost constraint functions g i ( · ) are bounded over X . hen, the capacity of the ASDN channel is ﬁnite. Furthermore there is a capacity achieving prob-ability measure; in other words, the capacity C can be expressed as a maximum rather than asupremum: C = max µ X ∈F I ( X ; Y ) . Moreover, the output distribution is unique, i.e. if µ X and µ X both achieves the capacity, then f Y ( y ) = f Y ( y ) , ∀ y ∈ R , where f Y and f Y are the pdfs of the output of the channel when the input probability measures are µ X and µ X , respectively. Remark 1.

The above theorem is a generalization of that given in [10, Theorem 1] for the specialcase of Gaussian noise Z . The proof can be found in Section 6.1. To give a partial converse of the above theorem, considerthe case that the second assumption of the above theorem fails, i.e., when there is a sequence { x i } of elements in X such that σ ( x i ) converges to zero or inﬁnity. The following theorem shows thatinput/output mutual information can be inﬁnity in such cases. Theorem 2.

Consider an ASDN channel with σ : X (cid:55)→ [0 , + ∞ ) where X is not necessarily a closedset. Suppose one can ﬁnd a sequence { ˜ x i } of elements in X such that σ (˜ x i ) converges to or + ∞ such that • As a sequence on real numbers, { ˜ x i } has a limit (possibly outside X ), which we denote by c .The limit c can be plus or minus inﬁnity. • One can ﬁnd another real number c (cid:48) (cid:54) = c such that the open interval E = ( c, c (cid:48) ) (or E = ( c (cid:48) , c ) depending on whether c (cid:48) > c or c (cid:48) < c ) belongs to X . Furthermore, ˜ x i ∈ E , and σ ( · ) ismonotone and continuous over E . Then one can ﬁnd a measure µ X deﬁned on E such that I ( X ; Y ) = ∞ provided that Z is a continuousrandom variable and has the following regularity conditions: | h ( Z ) | < ∞ , ∃ δ > { Z > δ } , Pr { Z < − δ } > , Furthermore, there is more than one measure µ X that makes I ( X ; Y ) = ∞ . In fact, input X canbe both a continuous or discrete random variable, i.e., one can ﬁnd both an absolutely continuousmeasure with pdf f X and discrete pmf p X such that I ( X ; Y ) is inﬁnity when the measure on inputis either f X or p X . The proof can be found in Section 6.2 and uses some of the results that we prove later in thepaper.

Remark 2.

As an example, consider an AGSDN channel with X = (0 , u ) for an arbitrary u > ,and σ ( x ) = x α for α (cid:54) = 0 . For this channel, we have C = + ∞ if we have no input cost constraints.Setting α = 1 , this shows that the capacity of the fast-fading channel given in (3) is inﬁnity if c = 0 ; that is when there is no additive noise. This parallels a similar result given in [14] forcomplex Gaussian fading channels. We only require monotonicity here, and not strictly monotonicity. emark 3. It is known that increasing the noise variance of an AWGN channel decreases itscapacity. However, we show that this is no longer the case for signal-dependent noise channels:Consider two AGSDN channels with parameters σ ( x ) and σ ( x ) , respectively, which are deﬁnedover X = (0 , with the following formulas: σ ( x ) = 1 , σ ( x ) = 1 x . No input cost constraints are imposed. It is clear that σ ( x ) > σ ( x ) for all x ∈ X . However,by considering the constraint < X < , from Theorem 1 we obtain that the capacity of the ﬁrstchannel is ﬁnite, while from Theorem 2, we obtain that the capacity of the second channel is ∞ .Therefore, the constraint σ ( x ) > σ ( x ) for all x ∈ X does not necessarily imply that the capacityof an AGSDN channel with σ ( x ) is less than or equal to the capacity of an AGSDN with σ ( x ) . To compute capacity from (5), one has to take maximum over probability measures in a potentiallylarge class F . Practically speaking, one can only ﬁnd a ﬁnite number of measures µ , µ , · · · , µ k in F and evaluate input/output mutual information for them. Ideally, { µ i } should form an (cid:15) -covering of the entire F (with an appropriate distance metric), so that mutual information atevery arbitrary measure in F can be approximated with one of the measures µ i . This can becomputationally cumbersome, even for measures deﬁned on a ﬁnite interval. As a result, it isdesirable to ﬁnd explicit lower bounds on the capacity. Observe that I ( X ; Y ) = h ( Y ) − h ( Y | X ).To compute the term h ( Y | X ), observe that given X = x , we have Y = x + σ ( x ) · Z and thush ( Y | X = x ) = log σ ( x ) + h ( Z ) (see Lemma 2). Thus,h ( Y | X ) = E [log σ ( X )] + h ( Z ) . However, the term h ( Y ) is more challenging to handle. Authors in [1] consider an AGSDN channelwith σ ( x ) = (cid:112) c + c x for x ≥

0, as well as show that h ( Y ) ≥ h ( X ) and hence I ( X ; Y ) ≥ h ( X ) − h ( Y | X ). This implies that instead of maximizing I ( X ; Y ), one can maximize h ( X ) − h ( Y | X )to obtain a lower bound.The proof of the relation h ( Y ) ≥ h ( X ) in [1] is non-trivial; we review it here to motivate ourown techniques in this paper. First consider the special case of c = 0. In this case, we get σ ( x ) = c and the AGDSN reduces to AWGN channel Y = X + Z . In this special case, one obtains the desiredequation by writing h ( Y ) ≥ h ( Y | Z ) = h ( X + Z | Z ) = h ( X | Z ) = h ( X ) . (7)However, the above argument does not extend for the case of c > σ ( x ) = (cid:112) c + c x dependson x . As argued in [1], without loss of generality, one may assume that c = 0; this is because onecan express a signal-dependent noise channel with σ ( x ) = (cid:112) c + c x as Y = X + c √ XZ + c Z , where Z and Z are independent standard normal variables. Thus, we can write Y = Y + c Z where Y = X + c √ XZ . From the argument for AWGN channels, we have that h ( Y ) ≥ h ( Y ).Thus, it suﬃces to show that h ( Y ) ≥ h ( X ). This is the special case of the problem for c = 0 andcorresponds to σ ( x ) = c √ x . 9o show h ( Y ) ≥ h ( X ) when Y = X + c √ XZ , more advanced ideas are utilized in [1]. Thekey observation is the following: assume that X ∼ g X ( x ) = 1 α e − xα [ x ≥ , be exponentially distributed with mean E [ X ] = α . Then Y has density g Y ( y ) = 1 (cid:112) α ( α + 2 c ) exp (cid:32) √ αy − (cid:112) α + 2 c | y |√ αc (cid:33) . Then, for any arbitrary input distribution f X , from the data processing property of the relativeentropy, we have D ( f Y (cid:107) g Y ) ≤ D ( f X (cid:107) g X )where f Y is the output density for input density f X . Once simpliﬁed, this equation leads to h ( f Y ) ≥ h ( f X ).The above argument crucially depends on the particular form of the output distribution corre-sponding to the input exponential distribution. It is a speciﬁc argument that works for the speciﬁcchoice of σ ( x ) = (cid:112) c + c x and normal distribution for Z , and cannot be readily extended to otherchoices of σ ( · ) and f Z ( z ). In this paper, we propose two approaches to handle more general settings: • (Idea 1:) We provide the following novel general lemma that establishes h ( Y ) ≥ h ( X ) for alarge class of ASDN channels. Lemma 1.

Take an arbitrary channel characterized by the conditional pdf f Y | X ( ·| x ) satisfying (cid:90) X f Y | X ( y | x ) d x ≤ , ∀ y ∈ Y , (8) where X and Y are the support of channel input X and channel output Y , respectively. Takean arbitrary input pdf f X ( x ) on X resulting in an output pdf f Y ( y ) on Y . Assuming that h ( X ) and h ( Y ) exist, we have h ( Y ) ≥ h ( X )The proof is provided in Section 6.7.As an example, Lemma 1 yields an alternative proof for the result of [1] for an AGSDN channel.Note that, as we mentioned before, in order to prove that h ( Y ) ≥ h ( X ) for σ ( x ) = (cid:112) c (cid:48) + c x ,we only need to prove it for σ ( x ) = c √ x . To this end, observe that since X ⊆ [0 , + ∞ ), wehave (cid:90) X f Y | X ( y | x ) d x ≤ (cid:90) ∞ √ πc x e − ( y − x )22 c x d x = (cid:90) ∞ √ √ πc e − ( y − v c v d v =  y ≥ yc y < ≤ . (9)where x = v , and v ≥

0. The proof for equation (9) is given in Appendix A. • (Idea 2:) We provide a variation of the type of argument given in (7) by introducing a numberof new steps. This would adapt the argument to ASDN channels.In the following sections, we discuss the above two ideas separately.10 .3 First Idea for Lower Bound Theorem 3.

Assume an ASDN channel deﬁned in (1) , where σ : ( (cid:96), u ) (cid:55)→ (0 , + ∞ ) with −∞ ≤ (cid:96)

Note that for any c ∈ ( (cid:96), u ) , ϕ ( x ) is well deﬁned (see Deﬁnition 6). By selecting adiﬀerent c (cid:48) ∈ ( (cid:96), u ) we obtain a diﬀerent function ϕ (cid:48) ( x ) such that ϕ (cid:48) ( x ) − ϕ ( x ) = (cid:90) cc (cid:48) σ ( t ) d t < ∞ . However, h ( ϕ ( X )) is invariant with respect to adding constant terms, and thus invariant withrespect to diﬀerent choices of c ∈ ( (cid:96), u ) . The above theorem is proved in Section 6.3.

Corollary 1.

Let W = ϕ ( X ) . Since ϕ ( · ) is a one-to-one function (as σ ( x ) > ), we obtain max µ X ∈F∩AC h ( ϕ ( X )) − h ( Z ) = max f W ∈G h ( W ) − h ( Z ) , where F is deﬁned in (6) , and W ∼ f W belongs to G = { f W ( · ) (cid:12)(cid:12) µ W ∈ AC , supp( µ W ) ⊆ ϕ ( X ) , E [ g i ( ϕ − ( W ))] ≤ for all i = 1 , · · · , k } . Here ϕ ( X ) = { ϕ ( x ) : x ∈ X } . Hence, from Theorem 3 we obtain that max µ X ∈F I ( X ; Y ) ≥ max f W ∈G h ( W ) − h ( Z ) . In order to ﬁnd the maximum of h ( W ) over f W ∈ G , we can use known results on maximumentropy probability distributions, e.g., see [16, Chapter 3.1]. Corollary 2.

Consider an ASDN channel satisfying (10) and (11) . Assume that the only inputconstraint is X = ( (cid:96), u ) i.e. (cid:96) < X < u . Then, from Corollary 1, we obtain the lower bound max f W ∈G h ( W ) − h ( Z ) = log (cid:18)(cid:90) u(cid:96) σ ( x ) d x (cid:19) − h ( Z ) , by taking a uniform distribution for f W ( w ) over ϕ ( X ) if this set is bounded [16, Section 3.1]. Else,if ϕ ( X ) has an inﬁnite length, the capacity is inﬁnity by choosing a pdf for W whose diﬀerentialentropy is inﬁnity (see Example 1). The equivalent pdf f X ( x ) for X is the pdf of ϕ − ( W ) . Example 2.

Consider an AWGN channel (namely, an AGSDN channel with σ ( x ) = σ ) with X = R and Z ∼ N (0 , . Let us restrict to measures that satisfy the power constraint E (cid:2) X (cid:3) ≤ P ;that is g ( x ) = x . Since (cid:90) R σ ( x ) f Z (cid:18) y − xσ ( x ) (cid:19) d x = (cid:90) R (cid:112) πσ e − ( y − x )22 σ d x = 1 , we can apply Corollary 1. Here W = ϕ ( X ) = x/σ ; thus, the lower bound is C ≥ max f W ( · ): E [ W ] ≤ Pσ h ( W ) − h ( Z ) = 12 log Pσ , (13) where it is achieved by Gaussian distribution W ∼ N (0 , √ P /σ ) [17, Section 12.1]. It is well-knownthat the capacity of AWGN channel is C = 12 log (cid:18) Pσ (cid:19) . (14) Comparing (14) and (13) , we see that the lower bound is very close to the capacity in the high SNRregime.As another example, consider the constraints X ≥ , and E [ X ] ≤ α on admissible input mea-sures. Here, we obtain the lower bound max f W ( · ): W ≥ E [ W ] ≤ ασ h ( W ) − h ( Z ) = 12 log α e2 πσ , where we used the fact that the maximum is achieved by the exponential distribution f W ( w ) = σ /α exp( − wσ /α ) for w ≥ and f W ( w ) = 0 for w < [17, Section 12.1]. Unlike the ﬁrst exampleabove, an exact capacity formula for this channel is not known. Now, we are going to provide another lower bound which is more appropriate in the channels forwhich Z is either non-negative or non-positive, and σ ( x ) is a monotonic function. An example ofsuch channels is the molecular timing channel discussed in the introduction. Theorem 4.

Assume an ASDN channel deﬁned in (1) with σ : ( (cid:96), u ) (cid:55)→ (0 , ∞ ) for −∞ ≤ (cid:96) < u ≤ + ∞ . If X is a continuous random variable with pdf f X ( x ) , and σ ( x ) is continuous and monotonic over ( (cid:96), u ) , (15)1 σ ( x ) is Riemann integrable on ( (cid:96), u ) , (16) then I ( X ; Y ) ≥ α h ( ψ ( X )) − β, provided that α , β are well-deﬁned, and α > . In order to deﬁne the variables α , β , and thefunction ψ ( x ) , take some arbitrary δ > and proceed as follows: If the function σ ( x ) is increasing over ( (cid:96), u ) , let ψ ( x ) = δ log σ ( x ) + (cid:90) xc σ ( t ) d t,α = Pr { Z ≥ δ } , β = α h ( Z | Z ≥ δ ) + H ( α ) , • If the function σ ( x ) is decreasing over ( (cid:96), u ) , let ψ ( x ) = − δ log σ ( x ) + (cid:90) xc σ ( t ) d t,α = Pr { Z ≤ − δ } , β = α h ( Z | Z ≤ − δ ) + H ( α ) , where c ∈ ( (cid:96), u ) is arbitrary, and H ( p ) := − p log p − (1 − p ) log (1 − p ) . Remark 5.

Observe that in both cases, ψ ( x ) is an strictly increasing function of x deﬁned over ( (cid:96), u ) , as σ ( x ) > and log( x ) is increasing. Similar to Remark 4, the choice of c ∈ ( (cid:96), u ) does notaﬀect the value of h ( ψ ( X )) , and hence the lower bound. However, the choice of δ > aﬀects thelower bound. The above theorem is proved in Section 6.4.

Corollary 3.

Similar to Corollary 1, let V = ψ ( X ) . Since ψ ( · ) is a one-to-one (strictly increasing)function, we obtain max µ X ∈F∩AC α h ( ψ ( X )) − β = max f V ∈G α h ( V ) − β where F is deﬁned in (6) , and V ∼ f V belongs to G = { f V ( · ) (cid:12)(cid:12) µ V ∈ AC , supp( µ V ) ⊆ ψ ( X ) , E [ g i ( ψ − ( V ))] ≤ for all i = 1 , · · · , k } . Hence, from Theorem 4 we obtain that max µ X ∈F I ( X ; Y ) ≥ α max f V ∈G h ( V ) − β, where α and β are constants deﬁned in Theorem 4. As mentioned earlier, to maximize h ( V ) over f V ∈ G , we can use known results on maximumentropy probability distributions, e.g., see [16, Chapter 3.1]. Corollary 4.

Consider an ASDN channel satisfying (15) and (16) . Assume that the only inputconstraint is X = ( (cid:96), u ) i.e. (cid:96) < X < u . Then, from Corollary 3, we obtain the lower bound α max f V ∈G h ( V ) − β = α log (cid:20) δ (cid:12)(cid:12)(cid:12)(cid:12) log σ ( u − ) σ ( (cid:96) + ) (cid:12)(cid:12)(cid:12)(cid:12) + (cid:90) u(cid:96) σ ( x ) d x (cid:21) − β, where α and β are deﬁned in Theorem 4, and σ ( (cid:96) + ) := lim x ↓ (cid:96) σ ( x ) , σ ( u − ) := lim x ↑ u σ ( x ) . The lower bound is achieved by taking a uniform distribution for f V ( w ) over ψ ( X ) if this set isbounded [16, Section 3.1]. Else, if ψ ( X ) has an inﬁnite length, the capacity is inﬁnity by choosinga pdf f V ( v ) such that h ( V ) = + ∞ . (see Example 1). The equivalent pdf f X ( x ) for X is the pdf of ψ − ( V ) . .5 An Upper Bound We begin by reviewing upper bound given in [1] to motivate our own upper bound. The upperbound in [1] works by utilizing Topsoe’s inequality [18] to bound mutual information I ( X ; Y ) fromabove as follows: I ( X ; Y ) ≤ E µ X [D ( f ( y | x ) (cid:107) q ( y ))] . for any arbitrary pdf q ( y ) on output Y . The distribution q ( y ) is chosen carefully to allow forcalculation of the above KL divergence. The particular form of σ ( x ) = √ c + c x makes explicitcalculations possible. The second diﬃculty in calculating the above expression is that we need totake expected value over input measure µ X . However, the capacity achieving input measure is notknown. This diﬃculty is addressed by the technique of “input distributions that escape to inﬁnity”,under some assumptions about the peak constraint.In this part, we give an upper bound based on the KL symmetrized upper bound of [15]. Theidea is that I ( X ; Y ) = D( µ X,Y (cid:107) µ X µ Y ) ≤ D( µ X,Y (cid:107) µ X µ Y ) + D( µ X µ Y (cid:107) µ X,Y ) (cid:44) D sym ( µ X,Y (cid:107) µ X µ Y ) . Our upper bound has the advantage of being applicable to a large class of σ ( x ). To state this upperbound, let Cov ( X, Y ) := E [ XY ] − E [ X ] E [ Y ] be the covariance function between two randomvariables X and Y . Theorem 5.

For any AGSDN channel deﬁned in (1) , we have

I ( X ; Y ) ≤ − Cov (cid:18) X + σ ( X ) , σ ( X ) (cid:19) + Cov (cid:18)

X, Xσ ( X ) (cid:19) , provided that the covariance terms on the right hand side are ﬁnite. The proof can be found in Section 6.5

Corollary 5.

For an AGSDN channel with parameters σ ( x ) , Z ∼ N (0 , , and X = [0 , u ] , iffunctions σ ( x ) and x/σ ( x ) are increasing over X , σ (0) > , and x + σ ( x ) is convex over X then max µ X :0 ≤ X ≤ u E [ X ] ≤ α I ( X ; Y ) ≤ (cid:40) F α ≥ u (cid:0) − αu (cid:1) αu F α < u , where F = u σ ( u ) + u σ (0) + σ (0) σ ( u ) + σ ( u ) σ (0) − . The corollary is proved in Section 6.6.

Remark 6.

Even though Corollary 5 is with the assumption σ (0) > , if we formally set σ (0) = 0 ,we see that F and the upper bound on capacity becomes inﬁnity. This is consistent with Theorem 2when σ (0) = 0 . Corollary 6.

The particular choice of σ ( x ) = (cid:112) c + c x that was motivated by applications dis-cussed in the Introduction has the property that σ ( x ) , x/σ ( x ) are increasing and Theorem 5 can beapplied. Some Useful Lemmas

In this section, we provide three lemmas used in the proof of theorems in this paper.

Lemma 2.

In an ASDN channel deﬁned in (1) , with continuous random variable noise Z with pdf f Z ( · ) , and noise coeﬃcient σ ( x ) > µ X − a.e.) , the conditional measure µ Y | X ( ·| x ) has the followingpdf: f Y | X ( y | x ) = 1 σ ( x ) f Z (cid:18) y − xσ ( x ) (cid:19) , µ X,Y - (a.e.) . Moreover, Y is a continuous random variable with the pdf f Y ( y ) = E (cid:20) σ ( X ) f Z (cid:18) y − Xσ ( X ) (cid:19)(cid:21) . Furthermore, if h ( Z ) exists, h ( Y | X ) can be deﬁned and is equal to h ( Y | X ) = E [log σ ( X )] + h ( Z ) . The lemma is proved in Section 6.8.

Lemma 3.

Let X be a continuous random variable with pdf f X ( x ) . For any function σ : ( (cid:96), u ) (cid:55)→ [0 , + ∞ ) such that σ ( x ) is Riemann integrable over ( (cid:96), u ) and σ ( x ) > (a.e), where −∞ ≤ (cid:96) < u ≤ + ∞ , we have that h ( X ) + E [log σ ( X )] = h ( ϕ ( X )) . (17) where ϕ ( x ) = (cid:90) xc σ ( t ) d t, (18) where c ∈ ( (cid:96), u ) is an arbitrary constant.Note that if the left-hand side does not exist, or becomes ±∞ , the same occurs for the right-handside and vice versa . The lemma is proved in Section 6.9.

Lemma 4.

Let X be a random variable with probability measure µ X , and the functions w ( x ) and v ( x ) be increasing over [ (cid:96), u ] , where −∞ < (cid:96) < u < + ∞ . If v ( x ) is convex over [ (cid:96), u ] , then max µ X : (cid:96) ≤ X ≤ u E [ X ] ≤ α Cov ( w ( X ) , v ( X )) ≤ β [ w ( u ) − w ( (cid:96) )][ v ( u ) − v ( (cid:96) )] , (19) where β = (cid:40) α ≥ (cid:96) + u u − α )( α − (cid:96) )( u − (cid:96) ) α < (cid:96) + u . Furthermore, for the case α ≥ ( (cid:96) + u ) / , a maximizer of (19) is the pmf p X ( (cid:96) ) = p X ( u ) = 12 . For the case α < ( (cid:96) + u ) / if v ( x ) is linear, a maximizer of (19) is the pmf p X ( (cid:96) ) = 1 − p X ( u ) = u − αu − (cid:96) . The proof is given in Section 6.10. 15

20 40 60 80 100 120 140 160 180 20010 −2 −1 σ C a p a c i t y ( n a t s p e r c h a nn e l u s e ) KL Upper BoundCapacity

Figure 1: Capacity and Symmetrized diver-gence upper bound in terms of c for AGSDNchannel with A = 5, α = 2 . c = 1 andfunction σ ( x ) = (cid:112) c + x .

50 100 150 200 250 30010 −3 −2 −1 A C a p a c i t y ( n a t s p e r c h a nn e l u s e ) CapacityCorollary 2 Lower BoundCorollary 4 Lower Bound

Figure 2: Capacity and lower bound at corol-lary 2 and 4 in terms of A for AGSDN chan-nel with function σ ( x ) = √ x . In this section, some numerical results are given for σ ( x ) = (cid:112) c + x and Z ∼ N (0 , A = 5 and average constraint α = 2 . σ ( x ) = (cid:112) c + x in terms of peak constraint, A . Here, c = 1 is assumed. The lower bound of Corollary 2 for0 < X < A is computed by the following closed form formula:log (cid:32)(cid:90) A (cid:112) c + x (cid:33) − h ( Z ) = log (cid:18) (cid:113) A + c − c (cid:19) −

12 log(2 π e) , while the lower bound of Corollary 4 equals α log (cid:32) δ (cid:12)(cid:12)(cid:12)(cid:12) log σ ( A ) σ (0) (cid:12)(cid:12)(cid:12)(cid:12) + (cid:90) A (cid:112) c + x d x (cid:33) − β = α log (cid:32) δ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) log (cid:112) A + c c (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 2 (cid:113) A + c − c (cid:33) − β where δ > α = Pr { Z ≥ δ } , β = α h ( Z | Z ≥ δ ) − α log α − (1 − α ) log (1 − α ) . We maximized over δ in order to ﬁnd the lower bound of Corollary 4. The ﬁrst lower bound isbetter than the second one mainly because of the multiplicative coeﬃcient α of the second lowerbound. Since the second lower bound is for a more general class of channels, we should consider thepositive (or negative) part of the support of Z , causing a multiplicative of coeﬃcient 1 / Z is positive (or negative) reals, the two lower boundsdo not diﬀer much. 16 Proofs

Finiteness of capacity:

The ﬁrst step is to show that the capacity is ﬁnite:sup µ X ∈F I ( X ; Y ) < ∞ . (20)To prove this, it suﬃces to show that the supremum of both h ( Y ) and h ( Y | X ) over µ X ∈ F areﬁnite, i.e., | h ( Y ) | , | h ( Y | X ) | < + ∞ , uniformly on µ X ∈ F . (21)Utilizing Lemma 2, the existence and boundedness of h ( Y | X ) is obtained as follows: | h ( Y | X ) | ≤ max {| log σ (cid:96) | , | log σ u |} + | h ( Z ) | < ∞ , uniformly on F . From Lemma 2, we obtain that Y is continuous with a pdf f Y ( y ). To prove thatthe integral deﬁning h ( Y ) is convergent to a ﬁnite value (existence of entropy), and furthermorethe integral is convergent to a value that is bounded uniformly on F , it is suﬃcient to show thatthere are some positive real γ , ¯ m and v such that for any µ X ∈ F , we have [19]:sup y ∈ R f Y ( y ) < ¯ m, (22) E [ | Y | γ ] < v. (23)Also, from Lemma 2, we obtain that for any µ X ∈ F f Y ( y ) ≤ mσ (cid:96) . Thus, (22) holds with ¯ m = m/σ (cid:96) . In order to prove (23), note that E [ | Y | γ ] ≤ E [( | X | + σ u | Z | ) γ ] ≤ γ E [max {| X | γ , σ γu | Z | γ } ] ≤ γ E [ | X | γ ] + (2 σ u ) γ E [ | Z | γ ] ≤ γ u γ + (2 σ u ) γ α, uniformly on F . Thus, h ( Y ) is well-deﬁned and uniformly bounded on F .Hence, from the deﬁnition of mutual information we obtain thatI ( X ; Y ) = h ( Y ) − h ( Y | X ) (24)is bounded uniformly for µ X ∈ F . Existence of a maximizer:

Let C = sup µ X ∈F I ( X ; Y ) < ∞ . (25)We would like to prove that the above supremum is a maximum. Equation (25) implies existenceof a sequence of measures { µ ( k ) X } ∞ k =1 in F such thatlim k →∞ I ( X k ; Y k ) = C, X k ∼ µ ( k ) X , and Y k ∼ µ ( k ) Y is the output of the channel when the input is X k . Furthermore,without loss of generality, we can assume that { µ ( k ) X } ∞ k =1 is convergent (in the L´evy measure) to ameasure µ ∗ X ∈ F . The reason is that since X is compact, the set F is also compact with respectto the L´evy measure [10, Proposition 2]. Thus, any sequence of measures in F has a convergentsubsequence. With no loss of generality we can take the subsequence as { µ ( k ) X } ∞ k =1 . Thus, fromconvergence in L´evy measure, we know that there is µ ∗ X ∈ F such thatlim k →∞ E [ g ( X k )] = E [ g ( X ∗ )] , (26)for all g : R (cid:55)→ C such that sup x ∈ R | g ( x ) | < + ∞ . We would like to prove thatI ( X ∗ ; Y ∗ ) = C, (27)where Y ∗ ∼ µ ∗ Y is the output measure of the channel when the input measure is µ ∗ X . This willcomplete the proof.From the argument given in the ﬁrst part of the proof on “Finiteness of capacity”, h ( Y ∗ | X ∗ )and h ( Y ∗ ) are well-deﬁned and ﬁnite. As a result to show (27), we only need to prove thatlim k →∞ h ( Y k | X k ) = h ( Y ∗ | X ∗ ) , (28)lim k →∞ h ( Y k ) = h ( Y ∗ ) . (29)Since −∞ < σ (cid:96) < σ u < + ∞ , (28) is obtained from (26) and Lemma 2.In order to prove (29), we proceed as follows: • Step 1: We begin by showing that the sequence { µ ( k ) Y } ∞ k =1 is a Cauchy sequence with respectto total variation i.e. ∀ (cid:15) > , ∃ N : m, n ≥ N ⇒ (cid:107) µ ( m ) Y − µ ( n ) Y (cid:107) V ≤ (cid:15), (30)where for any two arbitrary probability measure µ A and µ B , the total variation distance isdeﬁned by [16, p. 31] (cid:107) µ A − µ B (cid:107) := sup ∆ (cid:88) i (cid:0) µ A ( E i ) − µ B ( E i ) (cid:1) , where ∆ = { E , · · · , E m } ⊆ B ( R ) is the collection of all the available ﬁnite partitions. • Step 2: Having established step 1 above, we utilize the fact that the space of probabilitymeasures is complete with respect to the total variation metric. To show this, note that byLemma 2, all the Y k ’s have a pdf, and hence the total variation can be expressed in terms ofthe (cid:107) · (cid:107) L norm between pdfs [16, Lemma 1.5.3]. From [20, p. 276] we obtain that this spaceof pdfs is complete with respect to (cid:107) · (cid:107) L norm.As a result, µ ( k ) Y converges to some measure (cid:98) Y ∼ (cid:98) µ Y with respect to the total variation metric.We further claim that this convergence implies thatlim k →∞ h ( Y k ) = h (cid:16) (cid:98) Y (cid:17) . (31)The reason is that from (22) and (23), we see that { f ( k ) Y } and f ˆ Y are uniformly bounded andhave ﬁnite γ -moments. Therefore, (31) follows from [19, Theorem 1]. Thus, in step 2, weobtain that the sequence h ( Y k ) has a limit.18 Step 3: We show that the limit found in Step 2 is equal to h ( Y ∗ ), i.e., h (cid:16) (cid:98) Y (cid:17) = h ( Y ∗ ) . (32)This completes the proof of (29).Hence, it only remains to prove (30) and (32). Proof of (30): Since { I ( X k ; Y k ) } ∞ k =1 is convergent to C , for any (cid:15) (cid:48) >

0, there exists N such that: | C − I ( X k ; Y k ) | ≤ (cid:15) (cid:48) , ∀ k ≥ N. Now, consider m, n ≥ N . Let Q be a uniform Bernoulli random variable, independent of allpreviously deﬁned variables. When Q = 0, we sample from measure µ ( m ) X and when Q = 1, wesample from measure µ ( n ) X . This induces the measure (cid:101) X ∼ µ (cid:101) X deﬁned as follows: µ (cid:101) X = 12 µ ( m ) X + 12 µ ( n ) X . Let (cid:101) Y ∼ µ (cid:101) Y be the output of the channel when the input is (cid:101) X . We have a Markov chain Q − (cid:101) X − (cid:101) Y .Note that I (cid:16) (cid:101) X ; (cid:101) Y | Q (cid:17) = 12 I ( X m ; Y m ) + 12 I ( X n ; Y n ) . From concavity of mutual information in input measure, we obtain that:I (cid:16) (cid:101) X ; (cid:101) Y (cid:17) ≥

12 I ( X m ; Y m ) + 12 I ( X n ; Y n ) ≥ C − (cid:15) (cid:48) . Since F is an intersection of half spaces, it is convex and as a result µ (cid:101) X ∈ F . Thus, I (cid:16) (cid:101) X ; (cid:101) Y (cid:17) ≤ C ,and we obtain that I (cid:16) (cid:101) X ; (cid:101) Y (cid:17) − I (cid:16) (cid:101) X ; (cid:101) Y (cid:12)(cid:12) Q (cid:17) ≤ (cid:15) (cid:48) . Because of the Markov chain Q − (cid:101) X − (cid:101) Y , we obtain I (cid:16) (cid:101) Y ; Q (cid:12)(cid:12) (cid:101) X (cid:17) = 0 and as a result:I (cid:16) (cid:101) Y ; Q (cid:17) ≤ (cid:15) (cid:48) = ⇒ D (cid:16) µ (cid:101) Y ,Q (cid:107) µ (cid:101) Y µ Q (cid:17) ≤ (cid:15) (cid:48) . From the Pinsker’s inequality we obtain that (cid:107) µ (cid:101) Y ,Q − µ (cid:101) Y µ Q (cid:107) V ≤ √ (cid:15) (cid:48) , (33)where (cid:107) µ (cid:101) Y ,Q − µ (cid:101) Y µ Q (cid:107) V is the total variation between the measures µ (cid:101) Y ,Q and µ (cid:101) Y µ Q . Note that (cid:107) µ (cid:101) Y ,Q − µ (cid:101) Y µ Q (cid:107) V = 12 (cid:107) µ ( m ) Y − µ (cid:101) Y (cid:107) V + 12 (cid:107) µ ( n ) Y − µ (cid:101) Y (cid:107) V . (34)Therefore from (33) and (34), we obtain that (cid:107) µ ( m ) Y − µ (cid:101) Y (cid:107) V , (cid:107) µ ( n ) Y − µ (cid:101) Y (cid:107) V ≤ √ (cid:15) (cid:48) . As a result, (cid:107) µ ( m ) Y − µ ( n ) Y (cid:107) V ≤ √ (cid:15) (cid:48) . (cid:15) (cid:48) ≤ (cid:15) /

32, we obtain that { µ ( k ) Y } ∞ k =1 is a Cauchy sequence. Proof of (32): To this end, it suﬃces to prove thatΦ (cid:98) Y ( ω ) = Φ Y ∗ ( ω ) , ∀ ω ∈ R , where Φ X ( ω ) := E [exp(j ωX )] is the characteristic function of the random variable X .Since Y k converge to (cid:98) Y in total variation, and the fact that convergence in total variation isstronger than weakly convergence [16, p. 31], from (26) we obtain that their characteristic functions,Φ Y k ( ω ), also converge to Φ (cid:98) Y ( ω ) pointwise.Hence, it suﬃces to prove that Φ Y k ( ω ) converge to Φ Y ∗ ( ω ) pointwise. From (1), we obtain thatΦ Y k ( ω ) = E (cid:104) e j ω ( X k + σ ( X k ) Z ) (cid:105) = E (cid:2) e j ωX k Φ Z ( σ ( X k ) ω ) (cid:3) . Similarly, Φ Y ∗ ( ω ) = E (cid:104) e j ωX ∗ Φ Z ( σ ( X ∗ ) ω ) (cid:105) . Since { X k } converges to X ∗ in L´evy measure and the function g ( x ) = e j ωx Φ Z ( σ ( x ) ω ) is bounded: | g ( x ) | = (cid:12)(cid:12) e j ωx Φ Z ( σ ( x ) ω ) (cid:12)(cid:12) ≤ (cid:12)(cid:12) e j ωx (cid:12)(cid:12) | Φ Z ( σ ( x ) ω ) | ≤ , from (26) we obtain that E [ g ( X k )] = Φ Y k ( ω ) converges to E [ g ( X ∗ )] = Φ Y ∗ ( ω ) pointwise. Uniqueness of the output pdf:

The proof is the same as the ﬁrst part of the proof of [10,Theorem 1].This completes the proof.

For a continuous input measure, we utilize a later result in the paper, namely Theorem 4 by choosing (cid:96) = c , u = c (cid:48) when c (cid:48) > c , or (cid:96) = c (cid:48) , u = c when c (cid:48) < c . To use Corollary 4, observe that theimage of E under ψ ( · ) has inﬁnite length. This is because the sequence { ˜ x i } in E was such that themonotone function σ ( · ) converged to zero or inﬁnity on that sequence. Then, it is obtained thatany pdf f X ( · ) such that h ( ψ ( X )) = + ∞ , makes I ( X ; Y ) inﬁnity if | h ( Z | Z > δ ) | < ∞ (which leadsto | β | < ∞ ), where ψ ( x ) is the bijective function of x deﬁned in the statement of Theorem 4.In order to prove that | h ( Z | Z > δ ) | < ∞ , let the random variable ¯ Z be Z conditioned to Z > δ .Due to the continuity of Z and the fact that Pr { Z > δ } >

0, we obtain that ¯ Z has a valid pdf f ¯ Z ( z )deﬁned by f ¯ Z ( z ) = (cid:40) θ f Z ( z ) z > δ z ≤ δ , where θ := Pr { Z > δ } >

0. Since h ( Z ) exists and | h ( Z ) | < ∞ , we obtain that E (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) f Z ( Z ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) < ∞ . Hence, | h (cid:0) ¯ Z (cid:1) | ≤ E (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) f ¯ Z ( ¯ Z ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) ≤ − log θ + 1 θ E (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) f Z ( Z ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) < ∞ . Therefore, h ( Y | Z > δ ) exists and | h ( Y | Z > δ ) | < ∞ . A similar treatment can be used to prove | h ( Y | Z < δ ) | < ∞ . 20t remains to construct a discrete pmf with inﬁnite mutual information. The statement of thetheorem assumes existence of a sequence { ˜ x i } in an open interval E = ( c, c (cid:48) ) ⊂ X (or E = ( c (cid:48) , c ) if c (cid:48) < c ) such that1. c is the limit of the sequence { ˜ x i } ,2. σ (˜ x i ) converges to 0 or + ∞ σ ( · ) is monotone and continuous over E We now make the following claim about existence of another sequence { x i } ∞ i =1 ⊆ E with certainnice properties: Claim:

Suppose that one cannot ﬁnd a non-empty interval [ x (cid:48) , x (cid:48)(cid:48) ] ⊂ E such that σ ( x ) = 0 forall x ∈ [ x (cid:48) , x (cid:48)(cid:48) ]. Then, there exists 0 < a < b and a sequence { x i } ∞ i =1 ⊆ E , such that • If σ ( x ) is increasing,Pr { a < Z < b } > , (35)( x i + aσ ( x i ) , x i + bσ ( x i )) ∩ ( x j + aσ ( x j ) , x j + bσ ( x j )) = ∅ , ∀ i (cid:54) = j ∈ N (36)0 < σ ( x i ) < ∞ , ∀ i ∈ N . (37) • If σ ( x ) is decreasing,Pr {− b < Z < − a } > , ( x i − bσ ( x i ) , x i − aσ ( x i )) ∩ ( x j − bσ ( x j ) , x j − aσ ( x j )) = ∅ , ∀ i (cid:54) = j ∈ N < σ ( x i ) < ∞ , ∀ i ∈ N . We continue with the proof assuming that this claim is correct; we give the proof of this claimlater. To show how this claim can be used to construct a discrete pmf with inﬁnite mutual infor-mation, consider the possibility that the assumption of the claim fails: σ ( x ) = 0 for all x ∈ [ x (cid:48) , x (cid:48)(cid:48) ],then Y = X in that interval when X ∈ [ x (cid:48) , x (cid:48)(cid:48) ]. Therefore, we can provide any discrete distributionin that interval such that H ( X ) = ∞ , as a result I ( X ; Y ) = I ( X ; X ) = H ( X ) = ∞ .Thus, we should only consider the case that the assumption of the claim holds. Assume that σ ( x ) is increasing. The construction when σ ( x ) is decreasing is similar. Fix a given a , b , { x i } ∞ i =1 satisfying (35) and (36). Take an arbitrary pmf { p i } ∞ i =1 such that (cid:88) i p i log 1 p i = + ∞ . (38)Then, we deﬁne a discrete random variable X , taking values in { x i } ∞ i =1 such that Pr { X = x i } = p i .We claim that I ( X ; Y ) = + ∞ . To this end, it suﬃces to showI ( X ; Y ) ≥ Pr { a < Z < b } I ( X ; Y | a < Z < b ) − H (Pr { a < Z < b } ) , (39)I ( X ; Y | a < Z < b ) = ∞ . (40) Proof of (39): Deﬁne random variable E as following: E = (cid:40) Z ∈ ( a, b )1 Z / ∈ ( a, b )21rom the deﬁnition of mutual information, we have thatI ( X ; Y | E ) − I ( X ; Y ) = I ( Y ; E | X ) − I ( Y ; E ) ≤ H ( E ) , Since I ( X ; Y | E ) = Pr { E = 0 } I ( X ; Y | E = 0) + Pr { E = 1 } I ( X ; Y | E = 1) , we conclude (39). Proof of (40): Since I ( X ; Y | a < Z < b ) = H ( X ) − H ( X | Y, a < Z < b ) , it suﬃces to show that H ( X ) = ∞ , H ( X | Y, a < Z < b ) = 0 . (41)The equality H ( X ) := − (cid:80) p i log p i = + ∞ follows (38). To prove the other equality, note that Y belongs to the interval ( x i + aσ ( x i ) , x i + bσ ( x i )) when X = x i . Therefore, since the intervals( x i + aσ ( x i ) , x i + bσ ( x i )) are disjoint, X can be found from Y . Thus, X is a function of Y when a < Z < b . As a result, the second equality of (41) is proved.Now, it only remains to prove our Claim on the existence of a , b , and { x i } ∞ i =1 .We assume that σ ( x ) is increasing. The proof when σ ( x ) is decreasing is similar. From the as-sumptions on Z that Pr { Z ≥ δ } >

0, we obtain there exists δ < b < ∞ such that Pr { δ < Z < b } >

0. As a result, we select a = δ .Since σ ( x ) is monotone, we cannot have σ ( x (cid:48) ) = σ ( x (cid:48)(cid:48) ) = 0 for two arbitrary distinct x (cid:48) and x (cid:48)(cid:48) in E since this implies that σ ( x ) = 0 for all x in between x (cid:48) and x (cid:48)(cid:48) . As a result, we shall not worryabout the constraint (37) on { x i } because σ ( x i ) = 0 can occur for at most one index i and we candelete that element from the sequence to ensure (37).To show the existence of { x i } ∞ i =1 , we provide a method to ﬁnd x i +1 with respect to x i . Themethod is described below and illustrated in Figure 3.Take x an arbitrary element of E . Observe that since σ ( x ) is continuous and increasing over E ,the functions x + aσ ( x ) and x + bσ ( x ) are continuous and strictly increasing over E , as well as x + aσ ( x ) < x + bσ ( x ) , ∀ x ∈ E . Therefore, for the case c (cid:48) > c (happening when σ ( x i ) converge to 0),lim x → c x + aσ ( x ) = lim x → c x + bσ ( x ) = c, Hence, for a given x i ∈ E , due to the intermediate value theorem, there exists unique x i +1 satisfying c < x i +1 < x i < c (cid:48) such that x i +1 + bσ ( x i +1 ) = x i + aσ ( x i ) . Similarly, for the case c (cid:48) < c (happening when σ (˜ x i ) converge to + ∞ ), if x i ∈ E , there exists unique x i +1 satisfying c > x i +1 > x i > c (cid:48) such that x i +1 + aσ ( x i +1 ) = x i + bσ ( x i ) . It can be easily obtained that the intervals created this way are disjoint, and the process will notstop after ﬁnite steps. Therefore, the theorem is proved.22 x2 x3 x4 x5 ... c c' σ→+∞ (ascending) x + a σ(x) x + b σ(x) ... x4 x3 x2 x1 c c' σ→+∞ (descending) x - a σ(x) x - b σ(x) x1 x2 x3 x4 ... c' c σ→0 (ascending) x + a σ(x) x + b σ(x) x1 x2 x3 x4 x5 ... c' c σ→0 (descending) x - a σ(x) x - b σ(x) Figure 3: Possible cases for σ ( x ) when | c | < ∞ .23 .3 Proof of Theorem 3 From Lemma 2 we obtain that h ( Y | X ) exists. Hence, utilizing Lemma 1 , we can writeh ( Y ) ≥ h ( X ) = ⇒ I ( X ; Y ) ≥ h ( X ) − h ( Y | X ) , (42)provided that (cid:90) u(cid:96) f Y | X ( y | x ) d x ≤ , where it is satisﬁed due to (cid:90) u(cid:96) f Y | X ( y | x ) d x = (cid:90) u(cid:96) σ ( x ) f Z (cid:18) y − xσ ( x ) (cid:19) d x ≤ , where the last inequality comes from the assumption of the theorem. From Lemma 2, we have thath ( Y | X ) = E [log σ ( X )] + h ( Z ) . Therefore, (42) can be written asI ( X ; Y ) ≥ h ( X ) − E [log σ ( X )] − h ( Z ) . Exploiting Lemma 3 we obtain thath ( X ) − E [log σ ( X )] = h ( ϕ ( X )) , where ϕ ( X ) is deﬁned in (12) Hence, the proof is complete. We only prove the case that σ ( x ) is an increasing function over ( (cid:96), u ). The proof of the theoremfor decreasing functions is similar to the increasing case and we only need to substitute Z ≥ δ with Z ≤ − δ . We claim that I ( X ; Y ) ≥ α I ( X ; Y | Z ≥ δ ) − H ( α ) . (43)Consider random variable E as following: E = (cid:40) Z ≥ δ Z < δ .

From the deﬁnition of mutual information, we have thatI ( X ; Y | E ) − I ( X ; Y ) = I ( Y ; E | X ) − I ( Y ; E ) ≤ H ( E ) , Therefore, sinceI ( X ; Y | E ) = Pr { Z ≥ δ } I ( X ; Y | Z ≥ δ ) + Pr { Z < δ } I ( X ; Y | Z < δ ) , we conclude (43).Now, we ﬁnd a lower bound for I ( X ; Y | Z ≥ δ ). From Lemma 2 we obtain that Y is a continuousrandom variable. We claim thatI ( X ; Y | Z ≥ δ ) =h ( Y | Z ≥ δ ) − h ( Y | X, Z ≥ δ )24h ( Y | Z ≥ δ ) − ( E [log σ ( X )] + h ( Z | Z ≥ δ )) (44)=h ( Y | Z ≥ δ ) − h ( X ) − E (cid:2) log(1 + Zσ (cid:48) ( X )) | Z ≥ δ (cid:3) + h ( X ) + E (cid:20) log 1 + Zσ (cid:48) ( X ) σ ( X ) (cid:12)(cid:12)(cid:12) Z ≥ δ (cid:21) − h ( Z | Z ≥ δ ) , (45)where (44) is obtained from Lemma 2, and the fact that random variable Z conditioned to Z ≥ δ is also continuous when Pr { Z ≥ δ } >

0. Moreover, (45) is obtained by adding and subtracting theterm E [log(1 + Zσ (cid:48) ( X )) | Z ≥ δ ]. Note that we had not assumed that σ ( x ) needs to be diﬀerentiable.We had only assumed that σ : ( (cid:96), u ) (cid:55)→ (0 , ∞ ) is continuous and monotonic over ( (cid:96), u ). However,every monotonic function is diﬀerentiable almost everywhere, i.e., the set of points in which σ ( x ) isnot diﬀerentiable has Lebesgue measure zero. We deﬁne σ (cid:48) ( x ) to be equal to zero wherever σ ( x ) isnot diﬀerentiable; and we take σ (cid:48) ( x ) to be the derivative of σ ( x ) wherever it is diﬀerentiable. Withthis deﬁnition of σ (cid:48) ( x ) and from the continuity of σ ( x ), we have that the integral of σ (cid:48) ( x ) /σ ( x ) givesus back the function log( σ ( x )).Since σ ( x ) is an increasing positive function, and Z ≥ δ >

0, we conclude that E (cid:20) log 1 + Zσ (cid:48) ( X ) σ ( X ) (cid:12)(cid:12)(cid:12) Z ≥ δ (cid:21) ≥ E (cid:20) log 1 + δσ (cid:48) ( X ) σ ( X ) (cid:21) . (46)From Lemma 3 and the fact that the integral of σ (cid:48) ( x ) /σ ( x ) gives us back the function log( σ ( x )),we obtain that h ( X ) + E (cid:20) log 1 + δσ (cid:48) ( X ) σ ( X ) (cid:21) = h ( ψ ( X )) , where ψ ( x ) is deﬁned in Theorem 4. As a result from (45), we obtain thatI ( X ; Y | Z ≥ δ ) ≥ h ( Y | Z ≥ δ ) − h ( X ) − E (cid:2) log(1 + Zσ (cid:48) ( X )) | Z ≥ δ (cid:3) + h ( ψ ( X )) − h ( Z | Z ≥ δ ) , (47)Using this inequality in conjunction with (43), we obtain a lower bound on I ( X ; Y ). The lowerbound that we would like to prove in the statement of the theorem is thatI ( X ; Y ) ≥ α h ( ψ ( X )) − α h ( Z | Z ≥ δ ) − H ( α ) . As a result, it suﬃces to prove that for all continuous random variables X with pdf f X ( x ) we haveh ( Y | Z ≥ δ ) − h ( X ) − E (cid:2) log(1 + Zσ (cid:48) ( X )) | Z ≥ δ (cid:3) ≥ . To this end, observe that h ( Y | Z ≥ δ ) ≥ h ( Y | Z, Z ≥ δ ). Thus, if we show thath ( Y | Z, Z ≥ δ ) = h ( X ) + E (cid:2) log(1 + Zσ (cid:48) ( X )) | Z ≥ δ (cid:3) , (48)the proof is complete. We can write thath ( Y | Z, Z ≥ δ ) = (cid:90) ∞ δ f Z (cid:48) ( z )h ( Y | Z = z ) d z, where Z (cid:48) is Z conditioned to Z ≥ δ , and the pdf of Z (cid:48) is denoted by f Z (cid:48) ( z ). By deﬁning the function r z ( x ) := x + zσ ( x ), we obtain that Y z = r z ( X ) , Y z is Y which is conditioned to Z = z ≥ δ . Since, σ ( x ) is a continuous increasing function, r z ( x ) is a bijection for all z ≥ δ , and so its inverse function, r − z ( y ), exists. Moreover, since X iscontinuous and r z ( · ) is a bijection, Y z is also continuous random variable with pdf f Y ( y ) deﬁned asfollowing: f Y z ( y ) = 11 + zσ (cid:48) ( x ) f X ( x ) , where x = r − z ( y ). Thus, we have thath ( Y | Z = z ) = E (cid:20) log 1 f Y z ( Y z ) (cid:21) = E (cid:20) log 1 f X ( X ) (cid:21) + E (cid:2) log(1 + zσ (cid:48) ( X )) (cid:3) =h ( X ) + E (cid:2) log(1 + zσ (cid:48) ( X )) (cid:3) . By taking expected value over Z ≥ δ from both sides, (48) is achieved. Therefore, the theorem isproved. Based on [15] we obtain that I ( X ; Y ) ≤ D sym ( µ X,Y (cid:107) µ X µ Y ) , Utilizing Lemma 2, we obtain that the pdfs f Y ( y ) and f Y | X ( y | x ) exist and are well-deﬁned. There-fore, D sym ( µ X,Y (cid:107) µ X µ Y ) =D( µ X,Y (cid:107) µ X µ Y ) + D( µ X µ Y (cid:107) µ X,Y )= E µ X,Y (cid:20) log f Y | X ( Y | X ) f Y ( Y ) (cid:21) + E µ X µ Y (cid:20) log f Y ( Y ) f Y | X ( Y | X ) (cid:21) = − E µ X,Y (cid:20) log 1 f Y | X ( Y | X ) (cid:21) + E (cid:20) log 1 f Y ( Y ) (cid:21) + E µ X µ Y (cid:20) log 1 f Y | X ( Y | X ) (cid:21) − E (cid:20) log 1 f Y ( Y ) (cid:21) = E µ X µ Y (cid:20) log 1 f Y | X ( Y | X ) (cid:21) − h ( Y | X ) . Again, from Lemma 2, since Z ∼ N (0 , f Y | X ( y | x ) = log (cid:16) σ ( x ) √ π (cid:17) + ( y − x ) σ ( x )Therefore, since Z = ( Y − X ) /σ ( X ), we obtain thath ( Y | X ) = E (cid:104) log (cid:16) σ ( X ) √ π (cid:17)(cid:105) + 12 . (49) The measure zero points where σ ( x ) is not diﬀerentiable aﬀect f Y z ( y ) on a measure zero points. However, notethat F Y z ( y ) = F X ( r − ( y )) is always correct and thus the values of f Y z ( y ) on a measure zero set of points are notimportant.

26n addition, E µ X µ Y (cid:20) log 1 f Y | X ( Y | X ) (cid:21) = E (cid:104) log (cid:16) √ πσ ( X ) (cid:17)(cid:105) + E µ X µ Y (cid:20) ( Y − X ) σ ( X ) (cid:21) . By expanding, we obtain that E µ X µ Y (cid:20) ( Y − X ) σ ( X ) (cid:21) = E (cid:2) Y (cid:3) E (cid:20) σ ( X ) (cid:21) + E (cid:20) X σ ( X ) (cid:21) − E [ Y ] E (cid:20) Xσ ( X ) (cid:21) . By substituting Y with X + σ ( X ) Z and simplifying we can write E µ X µ Y (cid:20) ( Y − X ) σ ( X ) (cid:21) = E (cid:2) X (cid:3) E (cid:20) σ ( X ) (cid:21) + E (cid:2) σ ( X ) (cid:3) E (cid:20) σ ( X ) (cid:21) + E (cid:20) X σ ( X ) (cid:21) − E [ X ] E (cid:20) Xσ ( X ) (cid:21) = E (cid:2) X (cid:3) E (cid:20) σ ( X ) (cid:21) + E (cid:2) σ ( X ) (cid:3) E (cid:20) σ ( X ) (cid:21) − E (cid:20) X + σ ( X ) σ ( X ) (cid:21) + 1+ 2 (cid:18) E (cid:20) X σ ( X ) (cid:21) − E [ X ] E (cid:20) Xσ ( X ) (cid:21)(cid:19) , which equals to 1 − Cov (cid:18) X + σ ( X ) , σ ( X ) (cid:19) + 2 Cov (cid:18)

X, Xσ ( X ) (cid:19) . Therefore, from all above equations the theorem is proved.

Observe that 12 F = 12 (cid:18) u σ ( u ) + u σ (0) + σ (0) σ ( u ) + σ ( u ) σ (0) − (cid:19) = 12 (cid:0) u + σ ( u ) − σ (0) (cid:1) (cid:18) σ (0) − σ ( u ) (cid:19) + u σ ( u ) . Then, using Theorem 5, it suﬃces to prove the following two inequalities:

Cov (cid:18) X + σ ( X ) , − σ ( X ) (cid:19) ≤ β (cid:0) u + σ ( u ) − σ (0) (cid:1) (cid:18) σ (0) − σ ( u ) (cid:19) , (50)and Cov (cid:18)

X, Xσ ( X ) (cid:19) ≤ β u σ ( u ) , (51)where β = (cid:40) α ≥ u uα (cid:0) − uα (cid:1) α < u . Since σ ( x ) is increasing, we obtain that x + σ ( x ) and − /σ ( x ) are also increasing. Therefore,from Lemma 4, equation (50) is proved. Similarly, (51) is also obtained from Lemma 4 because x and x/σ ( x ) are increasing functions. 27 .7 Proof of Lemma 1 From Deﬁnition 5, we obtain thath ( X ) − h ( Y ) = E (cid:20) log f Y ( Y ) f X ( X ) (cid:21) . Now, utilizing the inequality log x ≤ x −

1, it suﬀuces to prove that E (cid:20) f Y ( Y ) f X ( X ) (cid:21) ≤ . To this end, we can write E (cid:20) f Y ( Y ) f X ( X ) (cid:21) = (cid:90) X (cid:90) Y f X,Y ( x, y ) f Y ( y ) f X ( x ) d y d x = (cid:90) X (cid:90) Y f Y | X ( y | x ) f Y ( y ) d y d x = (cid:90) Y f Y ( y ) (cid:90) X f Y | X ( y | x ) d x d y ≤ (cid:90) Y f Y ( y ) d y = 1 , where the last inequality holds because of the assumption of the lemma. Therefore, the lemma isproved. The conditional pdf f Y | X ( y | x ) can be easily obtained from the deﬁnition of channel in (1). In orderto calculate h ( Y | X ), using the Deﬁnition 5 we can writeh ( Y | X ) = E (cid:20) f Y | X ( Y | X ) (cid:21) = E [log σ ( X )] + E  log 1 f Z (cid:16) Y − Xσ ( X ) (cid:17)  . Exploiting the fact that ( Y − X ) /σ ( X ) = Z , h ( Y | X ) is obtained.It only remains to prove that Y is continuous. To this end, from the deﬁnition of the channelin (1), we obtain that F Y ( y ) = Pr (cid:26) Z ≤ y − Xσ ( X ) (cid:27) = E µ X (cid:20) F Z (cid:18) y − Xσ ( X ) (cid:19)(cid:21) , where F Y ( y ) and F Z ( z ) are the cdfs of the random variables Y and Z , deﬁned by F Y ( y ) =Pr { Y ≤ y } and F Z ( z ) = Pr { Z ≤ z } , respectively. In order to prove the claim about f Y ( y ), wemust show that (cid:90) y −∞ E (cid:20) σ ( x ) f Z (cid:18) y − xσ ( x ) (cid:19)(cid:21) d y = E (cid:20) F Z (cid:18) y − Xσ ( X ) (cid:19)(cid:21) , for all y ∈ R . Because of the Fubini’s theorem [20, Chapter 2.3], it is equivalent tolim n →∞ E (cid:20) F Z (cid:18) − n − Xσ ( X ) (cid:19)(cid:21) = 0 . (cid:15) >

0, there exists m such that E (cid:20) F Z (cid:18) − n − Xσ ( X ) (cid:19)(cid:21) ≤ (cid:15), ∀ n > m. (52)Since lim z →−∞ F Z ( z ) = 0, there exists (cid:96) ∈ R such that F Z ( z ) ≤ (cid:15) , ∀ z ≤ (cid:96). Therefore, since F Z ( z ) ≤ z , we can write E (cid:20) F Z (cid:18) − n − Xσ ( X ) (cid:19)(cid:21) ≤ (cid:15) (cid:26) − n − Xσ ( X ) ≥ (cid:96) (cid:27) . We can write Pr (cid:26) − n − Xσ ( X ) ≥ (cid:96) (cid:27) = Pr { X + (cid:96)σ ( X ) ≤ − n } . Now, we can take m large enough such that,Pr (cid:26) − n − Xσ ( X ) ≥ (cid:96) (cid:27) ≤ (cid:15) , n > m. As a result, (52) is proved.

Since σ ( x ) is Riemann integrable, ϕ ( x ) is continuous and since σ ( x ) > ϕ ( x ) is a strictlyincreasing function over the support of X . It yields that ϕ ( x ) is an injective function and thereexists an inverse function ϕ − ( · ) for ϕ ( · ). Now, deﬁne random variable Y = ϕ ( X ). Assume thatthe pdf of X is f X ( x ). Since X is a continuous random variable and ϕ ( x ) is a bijection, Y is alsocontinuous random variable with the following pdf: f Y ( y ) = 1 dd x ϕ ( x ) f X ( x ) , where x = ϕ − ( y ). Hence, we have that f Y ( y ) = 1 σ ( ϕ − ( y )) f X (cid:0) ϕ − ( y ) (cid:1) . Now, we can calculate the diﬀerential entropy of Y as following:h ( Y ) = E (cid:20) log 1 f Y ( Y ) (cid:21) = E (cid:34) log σ (cid:0) ϕ − ( Y ) (cid:1) f X ( ϕ − ( Y )) (cid:35) = E (cid:20) log σ ( X ) f X ( X ) (cid:21) =h ( X ) + E [log ϕ ( X )] . Therefore, the lemma is proved. 29 .10 Proof of Lemma 4

First, assume that v ( x ) = ax + b , with a >

0. We will prove the general case later. In this case, weclaim that the support of the optimal solution only needs to have two members. To this end, notethat the following problem is equivalent to the original problem deﬁned in (19):max γ ≤ α max µ X : (cid:96) ≤ X ≤ u E [ X ]= γ Cov ( w ( X ) , v ( X )) . Since v ( x ) = ax + b , for a given γ , we would like to maximize Cov ( w ( X ) , v ( X )) = E [ w ( X ) v ( X )] − ( aγ + b ) E [ w ( X )] , which is a linear function of µ X , subject to E [ X ] = γ which is also a linear function of µ X . Bythe standard cardinality reduction technique (Fenchel’s extension of the Caratheodory theorem),we can reduce the support of µ X to at most two members (see [21, Appendix C] for a discussionof the technique). Assume that the support of µ X is { x , x } where (cid:96) ≤ x ≤ x ≤ u with pmf p X ( x ) = 1 − p X ( x ) = p . Thus, we can simplify Cov ( w ( X ) , v ( X )) as Cov ( w ( X ) , v ( X )) = (cid:88) i =1 p X ( x i ) w ( x i ) v ( x i ) − (cid:88) i =1 2 (cid:88) j =1 p X ( x i ) p X ( x j ) w ( x i ) v ( x j )= p (1 − p ) ( w ( x ) − w ( x )) ( v ( x ) − v ( x )) , where the last equality can be obtained by expanding the sums. Thus, the problem deﬁned in (19)equals the following: max p,x ,x :0 ≤ p ≤ (cid:96) ≤ x ≤ x ≤ upx +(1 − p ) x ≤ α p (1 − p ) ( w ( x ) − w ( x )) ( v ( x ) − v ( x )) . We claim that the optimal choice for x is x = (cid:96) . To see this, observe that w ( x ) and v ( x ) areincreasing functions, and hence p(cid:96) + (1 − p ) x ≤ px + (1 − p ) x ≤ α and ( w ( x ) − w ( (cid:96) )) ( v ( x ) − v ( (cid:96) )) ≥ ( w ( x ) − w ( x )) ( v ( x ) − v ( x )) . Hence, x = (cid:96) is optimal. Substituting v ( x ) = ax + b , we obtain that the problem is equivalent withthe following: a max p,x :0 ≤ p ≤ (cid:96) ≤ x ≤ up(cid:96) +(1 − p ) x ≤ α p (1 − p ) ( w ( x ) − w ( (cid:96) )) ( x − (cid:96) ) . Utilizing KKT conditions, one obtains that the optimal solution is (cid:40) p ∗ = , x ∗ = (cid:96), x ∗ = u α ≥ (cid:96) + u p ∗ = u − αu − (cid:96) , x ∗ = (cid:96), x ∗ = u α < (cid:96) + u . Now, we consider the general case of v ( x ) being a convex function (but not necessarily linear).Since v ( x ) is convex, we obtain that v ( x ) ≤ v ( (cid:96) ) + ( x − (cid:96) ) v ( u ) − v ( (cid:96) ) u − (cid:96) , ∀ x ∈ [ (cid:96), u ] . (cid:96), v ( (cid:96) )) and ( u, v ( u )); this line liesabove the curve x (cid:55)→ v ( x ) for any x ∈ [ (cid:96), u ]. Therefore, E [ v ( X )] ≤ v ( (cid:96) ) + ( E [ X ] − (cid:96) ) v ( u ) − v ( (cid:96) ) u − (cid:96) . Thus, E [ X ] ≤ α implies that E [ v ( X )] ≤ ∆ where∆ = v ( (cid:96) ) + ( α − (cid:96) ) v ( u ) − v ( (cid:96) ) u − (cid:96) . Now, we relax the optimization problem and considermax µ X : (cid:96) ≤ X ≤ u E [ v ( X )] ≤ ∆ Cov ( w ( X ) , v ( X )) . The solution of the above optimization problem is an upper bound for the original problem becausethe feasible set of the original problem is a subset of the feasible set of the relaxed optimizationproblem.Now, using similar ideas as in the linear case, we conclude that the support of the optimal µ X has at most two members. And the optimal solution is (cid:40) p ∗ = , x ∗ = (cid:96), x ∗ = u α ≥ (cid:96) + u p ∗ = v ( u ) − ∆ v ( u ) − v ( (cid:96) ) , x ∗ = (cid:96), x ∗ = u α < (cid:96) + u . It can be veriﬁed that v ( u ) − ∆ v ( u ) − v ( (cid:96) ) = u − αu − (cid:96) . Note that in the case α > ( (cid:96) + u ) /

2, we obtain that E [ X ∗ ] = ( (cid:96) + u ) / < α , where X ∗ distributedwith the optimal probability measure. As a result the constraint E [ X ] ≤ α is redundant. Therefore,the support of the optimal µ X has two members, which shows that the upper bound is tight in thiscase. In this paper, we studied the capacity of a class of signal-dependent additive noise channels. Thesechannels are of importance in molecular and optical communication; we also gave a number of newapplication of such channels in the introduction. A set of necessary and a set of suﬃcient conditionsfor ﬁniteness of capacity were given. We then introduced two new techniques for proving explicitlower bounds on the capacity. As a result, we obtained two lower bounds on the capacity. Theselower bounds were helpful in inspecting when channel capacity becomes inﬁnity. We also providedan upper bound using the symmetrized KL divergence bound.

References [1] S. M. Moser, “Capacity results of an optical intensity channel with input-dependent gaussiannoise,”

IEEE Transactions on Information Theory , vol. 58, no. 1, pp. 207–223, 2012.312] M. Pierobon and I. F. Akyildiz, “Diﬀusion-based noise analysis for molecular communication innanonetworks,”

IEEE Transactions on Signal Processing , vol. 59, no. 6, pp. 2532–2547, 2011.[3] G. Aminian, M. F. Ghazani, M. Mirmohseni, M. Nasiri-Kenari, and F. Fekri, “On the capacityof point-to-point and multiple-access molecular communications with ligand-receptors,”

IEEETransactions on Molecular, Biological and Multi-Scale Communications , vol. 1, no. 4, pp. 331–346, 2016.[4] H. Arjmandi, A. Gohari, M. Nasiri-Kenari, and F. Bateni, “Diﬀusion-based nanonetworking: Anew modulation technique and performance analysis,”

IEEE Communications Letters , vol. 17,no. 4, pp. 645–648, 2013.[5] A. Gohari, M. Mirmohseni, and M. Nasiri-Kenari, “Information theory of molecular commu-nication: Directions and challenges,” to appear in IEEE Transactions on Molecular, Biologicaland Multi-Scale Communications , 2016.[6] K. V. Srinivas, A. W. Eckford, and R. S. Adve, “Molecular communication in ﬂuid media: Theadditive inverse gaussian noise channel,”

IEEE Transactions on Information Theory , vol. 58,no. 7, pp. 4678–4692, 2012.[7] M. N. Khormuji, “On the capacity of molecular communication over the aign channel,” in

Information Sciences and Systems (CISS), 2011 45th Annual Conference on , pp. 1–4, IEEE,2011.[8] H. Li, S. M. Moser, and D. Guo, “Capacity of the memoryless additive inverse gaussian noisechannel,”

IEEE Journal on Selected Areas in Communications , vol. 32, no. 12, pp. 2315–2329,2014.[9] N. Farsad, Y. Murin, A. W. Eckford, and A. Goldsmith, “Capacity limits of diﬀusion-basedmolecular timing channels,” arXiv:1602.07757 , 2016.[10] T. H. Chan, S. Hranilovic, and F. R. Kschischang, “Capacity-achieving probability measurefor conditionally gaussian channels with bounded inputs,”

IEEE Transactions on InformationTheory , vol. 51, no. 6, pp. 2073–2088, 2005.[11] J. G. Smith, “The information capacity of amplitude-and variance-constrained sclar gaussianchannels,”

Information and Control , vol. 18, no. 3, pp. 203–219, 1971.[12] R. Jiang, Z. Wang, Q. Wang, and L. Dai, “A tight upper bound on channel capacity for visiblelight communications,”

IEEE Communications Letters , vol. 20, no. 1, pp. 97–100, 2016.[13] A. Lapidoth, S. M. Moser, and M. A. Wigger, “On the capacity of free-space optical intensitychannels,”

IEEE Transactions on Information Theory , vol. 55, no. 10, pp. 4449–4461, 2009.[14] R. R. Chen, B. Hajek, R. Koetter, and U. Madhow, “On ﬁxed input distributions for nonco-herent communication over high-snr rayleigh-fading channels,” vol. 50, no. 12, pp. 3390–3396,2004.[15] G. Aminian, H. Arjmandi, A. Gohari, M. Nasiri-Kenari, and U. Mitra, “Capacity of diﬀusion-based molecular communication networks over lti-poisson channels,”

IEEE Transactions onMolecular, Biological and Multi-Scale Communications , vol. 1, no. 2, pp. 188–201, 2015.3216] S. Ihara,

Information theory for continuous systems . Singapore: World Scientiﬁc, 1993.[17] T. M. Cover and J. A. Thomas,

Elements of information theory . New York: John Wiley &Sons, 2nd ed., 2006.[18] F. Topsoe, “An information theoretical identity and a problem involving capacity,”

StudiaScientiarum Math. Hungarica , vol. 2, no. 10, p. 291292, 1967.[19] H. Ghourchian, A. Gohari, and A. Amini, “Existence and continuity of diﬀerential entropy fora class of distributions,”

IEEE Communications Letters , 2017.[20] E. M. Stein and R. Shakarchi,

Real analysis: measure theory, integration, and Hilbert spaces .New Jersey: Princeton University Press, 2005.[21] A. El Gamal and Y.-H. Kim,

Network Information Theory . Cambrdige University press, 2011.

A Proof of Equation (9)

Take some arbitrary x >

0. Then, equation (9) holds because (cid:90) ∞ √ √ πc e − ( y − v c v d v = (cid:90) ∞ √ π (cid:18) v + yv √ c + v − yv √ c (cid:19) e − ( y − v c v d v =e yc (cid:90) ∞ √ π v − yv √ c e − ( y + v c v d v + (cid:90) ∞ √ π v + yv √ c e − ( y − v c v d v =e yc (cid:90) √ π v − yv √ c e − ( y + v c v d v + e yc (cid:90) ∞ √ π v − yv √ c e − ( y + v c v d v + (cid:90) √ π v + yv √ c e − ( y − v c v d v + (cid:90) ∞ √ π v + yv √ c e − ( y − v c v d v. Now utilize the change of variables v (cid:55)→ u = y − v v √ c , v (cid:55)→ u = y + v v √ c to re-express the above integrals. Note thatd u = − v + yv √ c d v, d u = v − yv √ c d v. For y >

0, if v = 0 then u = + ∞ , u = + ∞ , if v = + ∞ then u = −∞ , u = + ∞ and if v = 1then u = ( y − / √ c , u = ( y + 1) / √ c . For y <

0, if v = 0 then u = −∞ , u = −∞ , if v = + ∞ then u = −∞ , u = + ∞ and if v = 1 then u = ( y − / √ c , u = ( y + 1) / √ c . Nowfor y > − e yc (cid:90) ∞ y +1 √ c √ π e − u d u + e yc (cid:90) ∞ y +1 √ c √ π e − u d u + (cid:90) ∞ y − √ c √ π e − u d u + (cid:90) y − √ c −∞ √ π e − u d u = (cid:90) ∞−∞ √ π e − u d u = 1 . y < yc (cid:90) y +1 √ c −∞ √ π e − u d u + e yc (cid:90) ∞ y +1 √ c √ π e − u d u − (cid:90) y − √ c −∞ √ π e − u d u + (cid:90) y − √ c −∞ √ π e − u d u = e yc (cid:90) ∞−∞ √ π e − u d u = e yc ..