[PDF] Learning Task-Oriented Communication for Edge Inference: An Information Bottleneck Approach

Abstract

This paper investigates task-oriented communication for edge inference, where a low-end edge device transmits the extracted feature vector of a local data sample to a powerful edge server for processing. It is critical to encode the data into an informative and compact representation for low-latency inference given the limited bandwidth. We propose a learning-based communication scheme that jointly optimizes feature extraction, source coding, and channel coding in a task-oriented manner, i.e., targeting the downstream inference task rather than data reconstruction. Specifically, we leverage an information bottleneck (IB) framework to formalize a rate-distortion tradeoff between the informativeness of the encoded feature and the inference performance. As the IB optimization is computationally prohibitive for the high-dimensional data, we adopt a variational approximation, namely the variational information bottleneck (VIB), to build a tractable upper bound. To reduce the communication overhead, we leverage a sparsity-inducing distribution as the variational prior for the VIB framework to sparsify the encoded feature vector. Furthermore, considering dynamic channel conditions in practical communication systems, we propose a variable-length feature encoding scheme based on dynamic neural networks to adaptively adjust the activated dimensions of the encoded feature to different channel conditions. Extensive experiments evidence that the proposed task-oriented communication system achieves a better rate-distortion tradeoff than baseline methods and significantly reduces the feature transmission latency in dynamic channel conditions.

Full PDF

11 Learning Task-Oriented Communication forEdge Inference: An Information BottleneckApproach

Jiawei Shao,

Student Member, IEEE,

Yuyi Mao,

Member, IEEE, and Jun Zhang,

Senior Member, IEEE

Abstract

This paper investigates task-oriented communication for edge inference, where a low-end edgedevice transmits the extracted feature vector of a local data sample to a powerful edge server forprocessing. It is critical to encode the data into an informative and compact representation for low-latency inference given the limited bandwidth. We propose a learning-based communication schemethat jointly optimizes feature extraction, source coding, and channel coding in a task-oriented manner,i.e., targeting the downstream inference task rather than data reconstruction. Speciﬁcally, we leverage aninformation bottleneck (IB) framework to formalize a rate-distortion tradeoff between the informative-ness of the encoded feature and the inference performance. As the IB optimization is computationallyprohibitive for the high-dimensional data, we adopt a variational approximation, namely the variationalinformation bottleneck (VIB), to build a tractable upper bound. To reduce the communication overhead,we leverage a sparsity-inducing distribution as the variational prior for the VIB framework to sparsify theencoded feature vector. Furthermore, considering dynamic channel conditions in practical communicationsystems, we propose a variable-length feature encoding scheme based on dynamic neural networksto adaptively adjust the activated dimensions of the encoded feature to different channel conditions.Extensive experiments evidence that the proposed task-oriented communication system achieves a betterrate-distortion tradeoff than baseline methods and signiﬁcantly reduces the feature transmission latencyin dynamic channel conditions.

The authors are with the Department of Electronic and Information Engineering, Hong Kong Polytechnic University, HongKong (E-mail: [email protected], {yuyi-eie.mao, jun-eie.zhang}@polyu.edu.hk). (The corresponding author is J.Zhang.) a r X i v : . [ ee ss . SP ] F e b Index Terms

Learning to communicate, task-oriented communication, edge inference, information bottleneck,variational approximation.

I. I

NTRODUCTION

The recent revival of artiﬁcial intelligence (AI) has led to their adaptations in a broad spectrumof application domains, ranging from speech recognition [1] and natural language processing(NLP) [2], to computer vision [3] and augmented/virtual reality (AR/VR) [4]. Most recently, thepotential of AI technologies has also been exempliﬁed in communication systems [5], [6]. Aimingat delivering data with extreme levels of reliability and efﬁciency, various design problems of data-oriented communication , including transceiver structures [7], source/channel coding [8],signal detection [9], and radio resource management [10], have been revisited intensively using AItechniques, especially deep neural networks (DNNs), breeding the emerging area of “ learning tocommunicate ”. It is widely perceived that learning-driven techniques are critical complements totraditional model-driven approaches for communication system designs that rely heavily on expertknowledge, and will undoubtedly transform the wireless networks towards the next generation[11].Meanwhile, emerging AI applications also raise new communication problems [12], [13]. Toprovide immersive user experience, DNN-based mobile applications need to be performed withinthe edge of wireless networks, which eliminates the excessive latency incurred by routing datato the Cloud, and is referred to as edge inference [14], [13]. Edge inference can be implementedby deploying DNNs at an edge server located in close proximity to mobile devices, known as edge-only inference . However, the transmission latency remains a bottleneck for applications withstringent delay requirements [15]–[17], as a huge volume of data (e.g., 3D images, high-deﬁnitionvideos, and point cloud data) need to be uploaded. On the other hand, the resource-demandingnature of DNNs often makes it infeasible to be deployed as a whole locally for device-onlyinference due to the limited on-device computational resources [18].

Device-edge co-inference appears to be a prominent solution for fast edge inference [14],[19], [20], which reduces the communication overhead by harvesting the available computationalresources at both the edge servers and mobile devices. A mobile device ﬁrst extracts a compactfeature vector from the raw input data using an affordable neural network and then uploadsit for server-based processing. Nevertheless, most existing device-edge co-inference proposals simply split a pre-trained DNN into two subnetworks to be deployed at a device and a server,leaving feature compression and transmission to a traditional communication module [20]. Suchkind of decoupled treatment ignores the interplay between wireless communications and theinference tasks, and thus fails to exploit the full beneﬁts of collaborative inference since thecommunication strategies can be adaptive to speciﬁc tasks. To address this limitation and improvethe inference performance, in this paper, we propose a task-oriented communication principlefor edge inference and develop an innovative learning-driven approach under the framework ofinformation bottleneck (IB) [21].

A. Related Works and Motivations

The line of research on “learning to communicate” stems from the introductory article on deeplearning for the physical layer design in [7], where information transmission was viewed as adata reconstruction task, and a communication system can thus be modeled by a DNN-basedautoencoder with the wireless channel simulated by a non-trainable layer. The autoencoder-basedframework for communication systems was later extended to a deep joint source-channel coding(JSCC) architecture for wireless image transmission in [8], which enjoys signiﬁcant improvementof image reconstruction quality over separate source/channel coding techniques. JSCC has alsobeen applied to natural language processing for text transmission, which was accomplished byincorporating the semantic information of sentences using recurrent neural networks [22]. Itis worth noting that the aforementioned works focus on data-oriented communication , whichtargets at transmitting data reliably given the limited radio resources.Nevertheless, the shifted objective of feature transmissions for accurate edge inference withlow latency is not aligned with that of data-oriented communication, as it regards a part of theraw input data (e.g., nuisance, task-irrelevant information) as meaningless. Thus, recovering theoriginal data sample with high ﬁdelity at the edge server results in redundant communicationoverhead, which leaves room for further compression. This insight is also supported by a basicprinciple from representation learning [23]: A good representation should be insensitive (orinvariant) to nuisances such as translations, rotations, occlusions. Thus, we advocate for task-oriented communication for applications such as edge inference, to improve the efﬁciency bytransmitting sufﬁcient but minimal information for the downstream task.There have been recent studies on feature compression and transmission for edge inference.Inspired by the structured network compression [24], a 2-step pruning algorithm was developed in [25] to trim the model while reducing the communication load. An orthogonal approach forefﬁcient task-oriented communication is to compress the feature vector before transmission [26]–[28]. In particular, for the image classiﬁcation task, an end-to-end architecture was proposed in[26] to jointly optimize the feature compression and encoding by integrating deep JSCC. Incontrast to data-oriented communication that concerns the data recovery metrics (e.g., the l -distance or bit error rate), the proposed architecture was directly trained with the cross-entropyloss for the targeted classiﬁcation task and ignored the data reconstruction quality. The end-to-end training facilitates the mapping of task-relevant information to the channel symbols andomits the irrelevance. A similar idea was utilized to design feature compression and encodingschemes for image retrieval tasks at the wireless network edge in [29] and for point cloud dataprocessing in [30].While the end-to-end learning-driven architectures for task-oriented communication have beenproven effective in saving communication bandwidth, there remain multiple restrictions unsolvedin order to unleash their highest potentials: First, there lacks a systematic way to quantify theinformativeness of the encoded feature vector and its impact on the inference tasks, hinderingto achieve the best inference performance given the available resources; Besides, the dynamicwireless channel condition necessitates adaptive encoding scheme for reliable feature transmis-sion, which has received less attention in existing frameworks (e.g. [26]–[28], [31]). These formthe main motivations of our study.Data-oriented communication relies on classical source coding and channel coding theory,which, however, is not optimized for task-oriented communication. Recently, an informationtheoretical design principle, named information bottleneck (IB) [21], has been applied to in-vestigate deep learning, which seeks the right balance between data ﬁt and generalization byusing the mutual information as both a cost function and a regularizer. Particularly, the IBframework maximizes the mutual information between the latency representation and the label ofthe data sample to promote high accuracy, while minimizing the mutual information between therepresentation and the input sample to promote generalization. Such a tradeoff between preservingthe relevant information and ﬁnding a compact representation ﬁts nicely with bandwidth-limitededge inference, and thus will be adopted as a main design principle in our study for task-oriented communication. The IB framework is inherently related to the communication problemof remote source coding (RSC) [32]. It has recently attracted great attentions from both themachine learning and information theory communities [33]–[36]. Nevertheless, applying it to task-oriented communication demands additional optimization, which forms the main technicalcontributions of our study. B. Contributions

In this paper, we develop effective methods for task-oriented communication for device-edgeco-inference based on the IB principle [21]. Our major contributions are summarized as follows: • We design the task-oriented communication system by formalizing a rate-distortion tradeoffusing the IB framework. Our formulation aims at maximizing the mutual informationbetween the inference result and the encoded feature, meanwhile, minimizing the mutualinformation between the encoded feature and input data. Thus, it addresses the objectives ofimproving the inference accuracy, while reducing the communication overhead, respectively.To the best of our knowledge, this is the ﬁrst time that IB is introduced to design wirelessedge inference systems. • As the mutual information terms in the IB formulation are generally intractable for DNNswith high-dimensional features, we leverage the variational approximation, known as vaira-tional information bottleneck (VIB), to devise a tractable upper bound. Besides, by selectinga sparsity-inducing distribution as the variational prior, the VIB framework identiﬁes andprunes the redundant dimensions of the encoded feature vector to reduce the communicationoverhead. The proposed method is named as variational feature encoding (VFE). • We extend the proposed task-oriented communication scheme to dynamic communicationenvironments by enabling ﬂexible adjustment of the transmitted signal length. In particular,we develop a variable-length variational feature encoding (VL-VFE) based on dynamicneural networks that can adaptively adjust the active dimensions according to differentchannel conditions. • The effectiveness of the proposed task-oriented communication schemes is validated in bothstatic and dynamic channel conditions on image classiﬁcation tasks. Extensive simulationresults demonstrate that VFE and VL-VFE outperform the traditional communication designand existing learning-based joint source-channel coding for data-oriented communication.

C. Organization

The rest of the paper is organized as follows. Section II introduces the system model anddescribes the design objective of task-oriented communication. Section III and Section IV propose the task-oriented communication schemes in static and dynamic channel conditions, respectively.In Section V, we provide extensive simulation results to evaluate the performance of the proposedtask-oriented communication schemes. Finally, Section VI concludes the paper.

D. Notations

Throughout this paper, upper-case letters (e.g. X and Y ) and lower-case letters (e.g. x and y ) stand for random variables and their realizations, respectively. The entropy of Y andthe conditional entropy of Y given X are denoted as H ( Y ) and H ( Y | X ) , respectively. Themutual information between X and Y is represented as I ( X, Y ) , and the Kullback-Leibler (KL)divergence between two probability distributions p ( x ) and q ( x ) is denoted as D KL ( p || q ) . Thestatistical expectation of X is denoted as E ( X ) . We further denote the Gaussian distributionwith mean µ and covariance matrix Σ as N ( µ , Σ ) and use I to represent the identity matrix.II. S YSTEM M ODEL AND P ROBLEM D ESCRIPTION

A. System Model

We consider task-oriented communication in a device-edge co-inference system as shown inFig. 1b, where two DNNs are deployed at the mobile device and the edge server respectively sothat they can cooperate to perform inference tasks, e.g., image classiﬁcation and object detection.The input data x and its target variable y (e.g., label) are deemed as different realizations of apair of random variables ( X, Y ) . The encoded feature, received feature (noise-corrupted feature),and the inference result are respectively instantiated by random variables Z , ˆ Z and ˆ Y . Theserandom variables constitute the following probabilistic graphical model: Y → X → Z → ˆ Z → ˆ Y , (1)which satisﬁes p ( ˆ y | x ) = p θ ( ˆ y | ˆ z ) p channel ( ˆ z | z ) p φ ( z | x ) , with DNN parameters θ and φ to bediscussed below.As shown in Fig. 1b, the on-device network deﬁnes the conditional distribution p φ ( z | x ) parameterized by φ , which consists of a feature extractor and a JSCC encoder. The extractorﬁrst identiﬁes the task-relevant feature from the raw input x , and then the JSCC encoder maps While two components, i.e., a feature extractor and a JSCC encoder, are shown in Fig. 1b at the device, they can be regardedas a single DNN. We consider resource-constrained devices that can only afford light DNNs, which are unable to complete theinference task with sufﬁcient accuracy. More details of the adopted neural network architecture will be discussed in Section V. (a) Data-oriented communication for device-edge co-inference.(b) Task-oriented communication for device-edge co-inference.

Fig. 1: Two kinds of communication schemes for device-edge co-inference: Learning-based data-oriented and task-oriented communication. The green region corresponds to a mobile device, and the red region corresponds to anedge server. In data-oriented communication (top), a mobile device transmits the encoded feature z of the originaldata x (e.g., an image). Then, an edge server attempts to decode the data ˆ x based on the noise-corrupted feature ˆ z , and further utilizes ˆ x as input to obtain the inference result ˆ y (e.g., the label of input data). In contrast, task-oriented communication (bottom) extracts and encodes useful information z jointly by the on-device network, andthe receiver directly leverages ˆ z to obtain the inference result ˆ y , without recovering the original data. Therefore, z could be a highly compressed representation since the task-unrelated information can be discarded. the feature values to the channel input symbols z . Since both the extractor and encoder areparameterized by DNNs, these two modules can be jointly trained in an end-to-end manner.Then, the encoded feature z is transmitted to the server over the noisy wireless channel, and theserver receives the noise-corrupted feature ˆ z . In this paper, we assume a scalar Gaussian channelbetween the mobile device and the edge server for simplicity, which is modeled as a non-trainablelayer with the transfer function denoted as ˆ z = z + (cid:15) . The additive channel noise (cid:15) is sampledfrom a zero-mean Gaussian distribution with σ as the noise variance, i.e., (cid:15) ∼ N ( , σ I ) . Toaccount for the limited transmit power at the mobile device, we constrain the power of eachdimension of the encoded feature vector to be below P , i.e., z i ≤ P, ∀ i = 1 , · · · , n with n as the encoded feature vector dimension. Thus, the channel condition can be characterized by thepeak signal-to-noise ratio (PSNR) deﬁned as follows: PSNR = 10 log Pσ (dB) . Note that the adopted channel model in this paper can be easily extended for other communicationenvironments by modeling the channel using a non-trainable neural network layer with contin-uous transfer function [37]. Finally, the server-based network leverages ˆ z for further processingand outputs the inference result ˆ y with the distribution p θ ( ˆ y | ˆ z ) parameterized by θ . B. Problem Description

The communication overhead is characterized by the number of nonzero dimensions of theoutput of the JSCC encoder. Intuitively, if symbols over more dimensions are transmitted, theedge server will get a high-quality feature vector, which leads to higher inference accuracy, butit will induce a higher communication overhead and latency. So there is an inherent tradeoffbetween the inference performance and the communication overhead, which is a key ingredientfor the design of task-oriented communication. This can be regarded as a new and special kindof rate-distortion tradeoff . Therefore, we resort to the information bottleneck (IB) principle [21]to formulate an optimization problem that minimizes the following objective function : L IB ( φ ) = − I ( ˆ Z, Y ) (cid:124) (cid:123)(cid:122) (cid:125) Distortion + β I ( ˆ Z, X ) (cid:124) (cid:123)(cid:122) (cid:125) Rate = E p ( x , y ) { E p φ ( ˆ z | x ) [ − log p ( y | ˆ z )] + βD KL ( p φ ( ˆ z | x ) (cid:107) p ( ˆ z )) } + H ( Y ) (cid:124) (cid:123)(cid:122) (cid:125) constant ≡ E p ( x , y ) { E p φ ( ˆ z | x ) [ − log p ( y | ˆ z )] + βD KL ( p φ ( ˆ z | x ) (cid:107) p ( ˆ z )) } , (2)where the equivalence in the last row is in the sense of optimization, ignoring the constant term H ( Y ) . The objective function is a weighted sum of two mutual information terms with β > controling the tradeoff. Speciﬁcally, the quantity I ( ˆ Z, X ) is comprehended as the preservedinformation in ˆ Z given X and measured by the minimum description length [38] (or rate).Besides, since the entropy of Y , i.e., H ( Y ) , is a constant related to the input data distribution,minimizing the term − I ( ˆ Z, Y ) is equivalent to minimizing the conditional entropy H ( Y | Z ) , Note that the IB objective function is unrelated to the parameter θ since the distribution p ( y | ˆ z ) is deﬁned by p ( x , y ) , p φ ( z | x ) , and p channel ( ˆ z | z ) . which characterizes the uncertainty (distortion) of the inference result Y given the receivednoise-corrupted feature vector ˆ Z . Thus, the IB principle formalizes a rate-distortion tradeoff foredge inference systems, and minimizes the conditional mutual information I ( X, ˆ Z | Y ) , whichcorresponds to the amount of redundant information that needs to be transmitted. Compared withdata-oriented communication, the IB framework retains the task-relevant information and resultsin I ( ˆ Z, X ) that is much smaller than H ( X ) , which reduces the communication overhead. C. Main Challenges

The IB framework is promising for task-oriented communication as it explicitly quantiﬁes theinformativeness of the encoded feature vector and offers a formalization of the rate-distortiontradeoff in edge inference. However, there are three main challenges when applying it to developpractical feature encoding methods, listed as follows. • Estimation of mutual information : The computation of mutual information terms for high-dimensional data with unknown distributions is challenging since the empirical estimate forthe probability distribution requires the sampling number to increase exponentially withthe dimension [39]. Therefore, developing a tractable estimator for mutual information iscritical to make the problem solvable. • Effective control of the communication overhead : Minimizing the mutual informationbetween the input data and the feature vector indeed reduces the redundancy about taks-unrelated information. However, there is no direct link between redundancy reduction andfeature sparsiﬁcation, which controls the communication overhead with a JSCC encoder.Thus, to reduce the communication overhead, an effective method is needed to aggregatethe nuisance to the expandable dimensions so that the number of symbols to be transmittedis minimized. • Dynamic channel conditions : The hostile wireless channel always poses signiﬁcant chal-lenges for communication systems. Particularly, the channel dynamics have to be accountedfor. Dynamically adjusting the encoded feature length based on the DNNs is nontrivial, asthe neural network structure is ﬁxed since initialization. Changing the activation of neuronsaccording to the channel conditions calls for other control modules.The following two sections will tackle these challenges, and develop effective methods for task-oriented communications. The effectiveness of the proposed methods will be tested in SectionV. III. V

ARIATIONAL F EATURE E NCODING

In this section, we develop a variational information bottleneck (VIB) frameowork to resolvethe difﬁculty of mutual information computation of the original IB objective in (2). Besides, weshow that by selecting a sparsity-inducing distribution as the variational prior, minimizing themutual information between the raw input data X and the noise-corrupted feature ˆ Z facilitatesthe sparsiﬁcation of ˆ Z by pruning the task-irrelevant dimensions. Such an activation pruningscheme, i.e., removing neurons in a DNN, is effective in reducing the overhead of task-orientedcommunication. Based on this idea, we name our proposed method as variational feature encoding(VFE). This section assumes a static channel condition, while dynamic channels will be treatedin Section IV. A. Variational Information Bottleneck Reformulation

The variational method is a natural way to approximate intractable computations based onsome adjustable parameters (e.g., weights in DNNs), and it has been widely applied in machinelearning, e.g., the variational autoencoder [40]. In the VIB framework, the central idea is tointroduce a set of approximating densities to the intractable distribution.Revisiting the probabilistic graphical model in (1), the distribution p φ ( ˆ z | x ) is determined bythe on-device DNN and the channel model, i.e., p φ ( ˆ z | x ) = p φ ( z | x ) p channel ( ˆ z | z ; (cid:15) ) . Particularly,as we adopt a deterministic on-device network, p φ ( z | x ) can be regarded as a Dirac-delta function.Then, we have p φ ( ˆ z | x ) = N ( z ( x ; φ ) , σ I ) , where the deterministic function z ( x ; φ ) maps x to z parameterized by φ . For notational simplicity, we rewrite p φ ( ˆ z | x ) = N ( ˆ z | z ( x ; φ ) , σ I ) as p φ ( ˆ z | x ) = N ( z , σ I ) .With a known distribution p φ ( ˆ z | x ) and the joint data distribution p ( x , y ) , the distributions p ( ˆ z ) and p ( y | ˆ z ) are fully characterized by the underlying Markon chain Y ↔ X ↔ ˆ Z . Unfortunately,these two distributions are intractable due to the following high-dimensional integrals: p ( ˆ z ) = (cid:90) p ( x ) p φ ( ˆ z | x ) d x ,p ( y | ˆ z ) = (cid:90) p ( x , y ) p φ ( ˆ z | x ) p ( ˆ z ) d x . To overcome this issue, we apply two variational distributions q ( ˆ z ) and q θ ( y | ˆ z ) to approximatethe true distributions p ( ˆ z ) and p ( y | ˆ z ) , respectively, where θ is the parameters of the server- based network shown in Fig. 1b that computes the inference result ˆ y . Therefore, we recast theobjective function in (2) as follows: L V IB ( φ , θ ) = E p ( x , y ) { E p φ ( ˆ z | x ) [ − log q θ ( y | ˆ z )] + βD KL ( p φ ( ˆ z | x ) (cid:107) q ( ˆ z )) } . (3)The above formulation is termed as the variational information bottleneck (VIB) [36], whichinvokes an upper bound on the IB objective function in (2). Details of the derivations aredeferred to the Appendix A. By further applying the reparameterization trick [40] and MonteCarlo sampling, we are able to obtain an unbiased estimate of the gradient and hence optimize theobjective using stochastic gradient descent. In particular, given a mini-batch of data { ( x i , y i ) } Mi =1 and sampling the channel noise L times for each pair ( x i , y i ) , we have the following empiricalestimation: L V IB ( φ , θ ) (cid:39) M M (cid:88) m =1 (cid:40) L L (cid:88) l =1 [ − log q θ ( y m | ˆ z m , l )] + βD KL ( p φ ( ˆ z | x m ) (cid:107) q ( ˆ z )) (cid:41) , (4)where ˆ z m , l = z m + (cid:15) m , l and (cid:15) m , l ∼ N ( , σ I ) .In the next subsection, we illustrate that minimizing the VIB objective helps to prune theredundant dimensions in the encoded feature vector, and thus it serves as a suitable and tractableobjective for task-oriented communication. B. Redundancy Reduction and Feature Sparsiﬁcation

As we leverage the IB principle instantiated via a variational approximation, minimizing theKL-divergence term D KL ( p ( ˆ z | x ) (cid:107) q ( ˆ z )) shall reduce the redundancy in feature ˆ Z . However, itdoes not guarantee sparse activations in the feature encoding process. For example, if the reducedredundancy is distributed equally across all dimensions and each dimension still preserves task-related information, the encoded feature vector may have a high dimension that leads to ahigh communication overhead. To obtain a feature vector ˆ Z that aggregates the task-irrelevantinformation into certain expendable dimensions through end-to-end training, we adopt the log-uniform distribution as the variational prior, i.e., q ( ˆ z ) , to induce sparsity [41]. In particular, wechoose the mean-ﬁeld variational approximation [40] to alleviate the computation complexity, i.e.,given an n -dimensional ˆ z , q ( ˆ z ) = (cid:81) ni q (ˆ z i ) . Speciﬁcally, for each dimension ˆ z i , the variationalprior distribution is chosen as: q (log | ˆ z i | ) = constant . Since p φ ( ˆ z | x ) = (cid:81) ni p φ (ˆ z i | x ) , the KL-divergence term in (3) can be decomposed into a sum-mation: D KL ( p φ ( ˆ z | x ) (cid:107) q ( x )) = n (cid:88) i =1 D KL ( p φ (ˆ z i | x ) (cid:107) q (ˆ z i )) . (5)Nevertheless, as the KL-divergence term in (5) does not have a closed-form expression, weutilize the approximation proposed in [42] as follows: − D KL ( p φ (ˆ z i | x ) (cid:107) q (ˆ z i )) = 12 log α i − E (cid:15) ∼N (1 ,α i ) log | (cid:15) | + C ≈ k S ( k + k log α i ) − . (cid:0) α − i (cid:1) + C , (6)where α i = σ z i k = 0 . k = 1 . k = 1 . , and C is a constant. Besides, z i is the i -th dimension in z , and S ( · ) denotes the sigmoid function.It can be veriﬁed that the approximate KL-divergence approaches its minimum when α i goes toinﬁnite (i.e., z i goes to zero), and minimizing this term encourages the value of z i to be small.Empirical results in Section V show that the selected sparsity-inducing distribution sparsiﬁessome dimensions in z , i.e., z i ≡ for arbitrary input, which can be pruned to reduce thecommunication overhead. C. Variational Pruning on Dimension Importance

While the selected variational prior helps to promote sparsity in the feature vector, we stillneed an effective method to determine which of the dimensions can be pruned. Maintaining z i ≡ requires all the weights and the bias corresponding to z i in this layer converge tozero. However, checking each parameter is time-consuming in a large-scale DNN. To developan efﬁcient solution, we revisit the fully-connected (FC) layer, in which, neurons have fullconnections to all activations in the previous layer, and their activations can hence be computedwith a matrix multiplication followed by an offset: F C ( x ) = W x + b = (cid:102) W ˜ x , where (cid:102) W = [ W, b ] is an augmented matrix, and ˜ x = [ x T , T is an augmented vector. Notethat z i ≡ is equivalent to that the i -th row in (cid:102) W , i.e., (cid:102) W i · , is a zero vector. Therefore, weintroduce a dimension importance vector γ as the scale factor for each row in (cid:102) W , and γ i is the Algorithm 1

Training Variational Feature Encoding (VFE)

Input: T (number of iterations), n (number of output dimension of encoder), L (number of channel noise samplesper datapoint), batch size M , channel variance σ , and threshold γ . while epoch t = 1 to T do Select a mini-batch of data { ( x m , y m ) } Mm =1 Compute the encoded feature vector { z m } Mm =1 based on (7) Compute the appropriate KL-divecgence based on (6) while m = 1 to M do Sample the noise { (cid:15) m , l } Ll =1 ∼ N ( , σ I ) end while Compute the loss L V IB ( φ , θ ) based on (4) Update parameters φ , θ through backpropagation. while i = 1 to n do if γ i ≤ γ then Prune the i -th dimension in the encoded feature vector end if end while end while i -th dimension in γ . For each output element z i (i.e., the i -th element of z ( x ; φ ) ), the mappingfrom x to z is deﬁned as follows: z i = Tanh (cid:32) γ i (cid:102) W i · (cid:107) (cid:102) W i · (cid:107) f ( x ) (cid:33) . (7)In (7), the function f ( x ) is deﬁned by the on-device network, and its output is the activationin the last layer. Since the Tanh activation function has an output range from -1 to 1, the peaktransmitted power P is constrained to 1. Both γ and (cid:102) W are learnable parameters in φ . As theweight vector (cid:102) W i · is normalized by its l -norm, the magnitude of z i is highly dependent on γ i .When the γ i is close to zero, z i is also close to zero, and the corresponding p φ ( ˆ z | x ) degradesto the channel noise distribution without valid information. Based on this idea, we eliminate theredundant channels when the parameter γ i is less than a threshold γ . Note that the formula in(7) can be easily extended to convolutional layers by replacing the matrix multiplication withconvolution. Such a variational pruning process is one of the main components of the proposedVFE method. The training procedures for VFE are illustrated in Algorithm 1. IV. V

ARIABLE - LENGTH V ARIATIONAL F EATURE E NCODING

The task-oriented communication scheme developed in Section III assumes static wirelesschannels. In practice, wireless data transmission may experience changes due to various factorssuch as beam blockage and signal attenuation. This necessitates instant link adaptation to improvethe efﬁciency of feature encoding for low-latency inference. In this section, we extend ourﬁndings in Section III and propose a new encoding scheme, namely variable-length variationalfeature encoding (VL-VFE), by designing a dynamic neural network, which admits ﬂexiblecontrol of the encoded feature dimension.

A. Background on Dynamic Neural Networks

Dynamic neural networks, referring to DNNs that are able to adapt their architectures to thegiven input, are effective in improving the efﬁciency of DNN processing via selective execution.For example, several prior works (e.g. [43]–[45]) proposed to learn a binary gating module toadaptively skip layers or prune channels based on the input data. Besides, there are also somevariants of dynamic neural networks, including the slimmable neural networks and the “

Once-for-All ” architecture. In particular, inventors of the slimmable neural networks [46] proposedto train a single model to support layers with arbitrary widths; while authors of [18] proposedthe “Once-for-All” architecture with a progressive shrinking algorithm which trains one networkthat supports diverse sub-networks. In this work, we employ the idea of selective activation,as shown in Fig. 2, to learn a set of neurons that can adjust the number of activated neuronsaccording to the channel conditions.

B. Selective Activation for Dynamic Channel Conditions

We propose the variable-length variational feature encoding (VL-VFE), which is empoweredwith the capability of adjusting its output length under different channel conditions. Such kindsof channel-adaptive feature encoding schemes favor the following two properties: • The activated dimensions of the feature z can be adjusted in the DNN forward propagationaccording to the channel conditions. More dimensions should be activated during the badchannel conditions and vice versa. • The activated dimensions start consecutively from the ﬁrst dimension (shown in Fig. 2b),which avoids transmitting the indexes of the activated dimensions using extra communicationresources. (a) Random activation(b) Consecutive activation Fig. 2: Two types of selective activations: Random activation and consecutive activation. In different channelconditions (e.g., different PSNRs), the same DNN can be executed with different activated dimensions to balancethe achievable inference performance and the incurred communication overhead. Random activation does not requirethe dimensions to be activated in order, while consecutive activation forces the activated dimensions to be consecutivestarting from the ﬁrst dimension.

In practical communication systems, the mobile device could be aware of the channel conditionvia a feedback channel. Therefore, the channel condition can be incorporated in the featureencoding process. Because the amplitude of the encoded feature vector is constrained to 1 byTanh function, the noise variance σ sufﬁces to represent the PSNR and is adopted as an extrainput of the feature encoder. In the training process, the noise variance σ is regarded as a randomvariable distributed within a range to model the dynamic channel conditions. For simplicity, wesample the channel variance σ from the uniform distribution p ( σ ) . As the noise variance p ( σ ) is independent to the dataset, we have p ( x , y , σ ) = p ( x , y ) p ( σ ) . The loss function in (3) isthus revised as follows: (cid:101) L V IB ( φ , θ ) = E p ( x , y ,σ ) { E p φ ( ˆ z | x ,σ ) [ − log q θ ( y | ˆ z )] + βD KL (cid:0) p φ ( ˆ z | x , σ ) (cid:107) q ( ˆ z ) (cid:1) } . (8)Similarly, we adopt Monto Carlo sampling as in (4) to estimate (cid:101) L V IB . The formula is as follows: (cid:101) L V IB ( φ , θ ) (cid:39) M M (cid:88) m =1 (cid:40) L L (cid:88) l =1 [ − log q θ ( y m | ˆ z m , l )] + βD KL (cid:0) p φ (cid:0) ˆ z | x m , σ m (cid:1) (cid:107) q ( ˆ z ) (cid:1)(cid:41) , (9)where ˆ z m , l = z m + (cid:15) m , l , σ m ∼ p ( σ ) , and (cid:15) m , l ∼ N ( , σ m I ) , and for a given z m , the channelnoise is sampled L times. Then, as the encoding scheme should be channel-adaptive, we have p φ ( ˆ z | x , σ ) = N ( z ( x ; φ , σ ) , σ I ) , where the function z ( x ; φ , σ ) determined by the on-devicenetwork incorporates σ as an input variable. Hence, the function in (7) is modiﬁed as follows: z i = Tanh (cid:32) γ i ( σ ) (cid:102) W i · (cid:107) (cid:102) W i · (cid:107) f ( x , σ ) (cid:33) , (10)where both the dimension importance γ i ( σ ) (i.e., the i -th element in γ ( σ ) ) and f ( x , σ ) arefunctions of the channel condition (i.e., channel noise variance σ ). Rather than directly traininga gating network to control the activated dimensions like other dynamic neural networks (e.g.,[43]–[45]), γ ( σ ) can adaptively prune the redundant dimensions in the encoded feature vectorfor different σ due to the intrinsic sparsity discussed in Section III. As a result, in the device-edge co-inference system, the activated dimensions of the encoded feature vector can be easilydecided by setting a threshold for γ ( σ ) . Besides, as VL-VFE needs to meet the consecutiveactivation property, we deﬁne the function γ ( σ ) to induce a particular group sparsity pattern,and for the i -th element γ i ( σ ) , the expression is constructed as follows: γ i ( σ ) = n (cid:88) j = i g j ( σ ) , (11)where g j ( · ) denotes the j -th output dimension of the function g ( · ) , which is parameterizedby a lightweight multi-layer perceptron (MLP). By constraining the range of parameters in theMLP, each function g j ( σ ) can be a non-negative increasing function, which naturally leads to γ i ( σ ) ≥ γ j ( σ ) , ∀ j > i and γ i ( σ ) ≥ γ i (ˆ σ ) , ∀ σ ≥ ˆ σ . Therefore, given a threshold γ , theVL-VFE method summarized in Algorithm 2 can activate the dimensions consecutively, andmore dimensions can be activated during the adverse channel conditions. Details of the MLPstructure and parameter constraints are deferred to Appendix B. C. Training Procedure for the Dynamic Neural Network

To train a dynamic neural network with the selective activation under different channelconditions, we naturally average losses sampled from different cases. In each training iteration,for simplicity, we uniformly sample σ from the possible PSNR range. Different from the trainingprocedure in Algorithm 1, VL-VFE deactivate each dimension with γ i ( σ ) ≤ γ , rather thanpermanently pruning it, as the function γ ( σ ) is not stable until convergence. More details aboutthe algorithm are summarized in Algorithm 2. Algorithm 2

Training Variable-Length Variational Feature Enoding (VL-VFE)

Input: T (number of iterations), n (number of output dimension of encoder), L (number of channel noise samplesper datapoint), batch size M , noise variance distribution p ( σ ) , and threshod γ . while epoch t = 1 to T do Get a mini-batch of data { ( x m , y m ) } Mm =1 Sample the channel variance (cid:8) σ m (cid:9) Mm =1 ∼ p ( σ ) Compute the encoded feature vector { z m } Mm =1 based on (10) while m = 1 to M do Sample the channel noise { (cid:15) m , l } Ll =1 ∼ N (0 , σ m I ) while i = 1 to n do if γ i ( σ m ) ≤ γ then Deactivate the i -th dimension of z m in this iteration end if end while end while Compute the appropriate KL-divecgence based on (6)

Compute loss (cid:101) L V IB ( φ , θ ) based on (9) Update parameters φ , θ through backpropagation end while V. P

ERFORMANCE E VALUATION

In this section, we evaluate the performance of the proposed task-oriented communicationschemes for both static and dynamic channel environments and investigate the rate-distortiontradeoff. An ablation study is also conducted to illustrate the importance of an appropriatechoice of the variational prior distribution discussed in Section III, i.e., a sparsity-inducing priordistribution can force some dimensions of the encoded feature vector to zero without over-shrinking other dimensions.

A. Experimental Setup1) Datasets:

We test the proposed task-oriented communication schemes on two benchmarkdatasets for image classiﬁcation, including MNIST [47] and CIFAR-10 [48]. The MNIST datasetof handwritten digits from “0” to “9” has a training set of 60,000 sample images and a test setof 10,000 sample images. The CIFAR-10 dataset consists of 60,000 color images in 10 classeswith 5,000 training images per class and 10,000 test images.

2) Baselines:

We compare the proposed methods against two learning-based communicationschemes for device-edge co-inference, including the

DeepJSCC [8] and the learning-basedQuantization [49]. • DeepJSCC : DeepJSCC is a learning-based JSCC method for data-oriented communication,which maps the input data directly to the channel symbols via a JSCC encoder. Thecommunication cost of DeepJSCC is proportional to the output dimension of the featureencoder. • Learning-based Quantization : This scheme quantizes the ﬂoating-point values in theencoded feature vector into low-precision data representations (e.g., the 2-bit ﬁxed-pointformat). Such a quantization method imitates the lossy source coding and therefore itrequires an extra step of channel coding before transmission for error correction. Notethat designing a universally optimal channel coding scheme for different channel conditionsin the ﬁnite block-length regime is highly nontrivial [50]. For fair comparisons, we assumean adaptive channel coding scheme that achieves the following communication rate: C ( P, σ ) = min (cid:40) log (cid:32) (cid:114) Pπeσ (cid:33) ,

12 log (cid:18) Pσ (cid:19)(cid:41) (bits per symbol) , (12)where Pσ is the PSNR. This formula was shown to be a tight upper bound on the capacityof the amplitude-limited scalar Gaussian channel in [51].

3) Metrics:

We mainly concern the rate-distortion tradeoff in the task-oriented communi-cation. For the classiﬁcation tasks, we use the classiﬁcation accuracy to denote the inferenceperformance (corresponding to “distortion”), and adopt the communication latency as an indicatorof “rate”. In the following experiments, we set the bandwidth W as 12.5kHz which correspondsto the bandwidth of a very high frequency (VHF) narrowband channel [52] to calculate thelatency.

4) Neural Network Architecture:

Carefully designing the on-device network is important dueto the limited onboard computation and memory resources. Besides, as the DNN structure affectthe inference performance and communication overhead, all methods adopt the same architecturefor fair comparisons as follows . • For the MNIST classiﬁcation experiment, we assume a microcontroller unit (e.g., ARMSTM32F4 series) as the mobile device, and its memory (RAM) is less than 0.5 MB. The codes to reproduce the simulation results will be made available soon. TABLE I: The DNN structure for MNIST classiﬁcation task.

Layer Output DimensionsOn-device Network

Fully-connected Layer + Tanh n Server-based Network

Fully-connected Layer + ReLU 1024Fully-connected Layer + ReLU 256Fully-connected Layer + Softmax 10

TABLE II: The DNN structure for CIFAR-10 classiﬁcation task.

Layer Output DimensionsOn-device Network [Convolutional Layer + ReLU] × × × × × × × × × × × × × Therefore, we use only one fully-connected layer as the on-device network to meet thememory constraint. At the edge server, we select an MLP as the server-based network. Thecorresponding network structure is shown in Table I. Note that a 4-layer MLP achieves anerror rate of 1.38% as reported in [36]. • For the CIFAR-10 classiﬁcation task, we assume a single-board computer (e.g., RaspberryPi series) as the mobile device and adopt ResNet [53] as the backbone for the CIFAR-10processing, which can achieve the classiﬁcation accuracy of around 92%. As the single-boardcomputer has much more memory compared to a microcontroller, we deploy convolutionallayers on the mobile device to extract a compact representation. Besides, to reduce thecommunication overhead, we add a fully-connected layer at the end of the on-device networkto map the intermediate tensor to an n -dimensional encoded feature. Correspondingly, thereis a fully connected layer in the server-based network that maps the received feature vectorback to a tensor, and several server-based layers are adopted for further processing. The network structure is shown in Table II.Since the proposed VFE method can prune the redundant dimensions in the encoded featurevector, our method initialize n to 128 and 64 for the MNIST classiﬁcation and the CIFAR-10classiﬁcation, respectively. Moreover, the function g ( · ) in (11) for variable-length encoding is aMLP with three hidden layers with 16 hidden units each, which brings negligible computationcompared with other computation-intensive modules . B. Results for Static Channel Conditions

In this set of experiments, we assume the wireless channel model has the same value ofPSNR in both the training and test phases. Then, we record the inference accuracy achieved withdifferent communication latency to obtain the rate-distortion tradeoff curves. In the proposed VFEmethod, varying the weighting parameter β can adjust the encoded feature length, where β ∈ [10 − , − ] in the MNIST classiﬁcation, and β ∈ [5 × − , − ] in the CIFAR-10 classiﬁcation.The communication latency of DeepJSCC is determined by the encoded feature dimension n ,while for the learning-based Quantization method, the communication latency is determined bythe dimension n and the number of quantization levels. Adjusting these parameters affects boththe communication latency and accuracy. The rate-distortion tradeoff curves are shown in Fig.3 and Fig. 4 for the MNIST and CIFAR-10 classiﬁcation tasks, respectively. It shows that ourproposed method outperforms the baselines by achieving a better rate-distortion tradeoff, i.e., witha given latency requirement, a higher classiﬁcation accuracy is maintained, and vice versa. Thisis because the proposed VFE method is able to identify and eliminate the redundant dimensionsof the encoded feature vector for the task-oriented communication. Besides, we also depict thenoisy feature vector ˆ z in the MNIST classiﬁcation tasks in Fig. 5 using a two-dimensional t-distributed stochastic neighbour embedding (t-SNE) [54]. Since the IB principle can preserveless nuisance from the input and make ˆ z less affected by the channel noise, our VFE methodcan better distinguish the data from different classes compared with DeepJSCC.Next, we test the robustness of the proposed method by further evaluating its inference perfor-mance over different channel conditions. Particularly, we set a transmission latency tolerance of Note that there is a tradeoff between the on-device computation latency and the communication overhead caused by thecomplexity of the on-device network [28]. In this paper, as we assume an extreme bandwidth-limited situation, we mainlyconsider the communication overhead in the experiments. (a) PSNR = 10 dB (b) PSNR = 20 dB Fig. 3: The rate-distortion curves in the MNIST classiﬁcation task with (a) PSNR = 10 dB and (b) PSNR = 20 dB. (a) PSNR = 10 dB (b) PSNR = 20 dB

Fig. 4: The rate-distortion curves in the CIFAR-10 classiﬁcation task with (a) PSNR = 10 dB and (b) PSNR = 20dB. . Since the channelachievable rate decreases with the PSNR, it requires the learning-based Quantization method toreduce the encoded data size to meet the latency constraint. The latency constraint can also betranslated to an encoded feature vector with less than 32 dimensions for both the VFE method Theoretically, based on the channel capacity bound in (12), transmitting a MNIST image takes around 240 ms when PSNR= 25 dB and 550 ms when PSNR = 10 dB. Similarly, transmitting a CIFAR-10 image takes around 1s when PSNR = 25 dBand 2500 ms when PSNR = 10 dB. (a) DeepJSCC: Accuracy = 96.77%, dimension n = 24. (b) Proposed VFE: Accuracy = 97.39%, dimension n = 24. Fig. 5: 2-dimensional t-SNE embedding of the received feature in the MNIST classiﬁcation task with PSNR = 20dB. TABLE III: The classiﬁcation accuracy under different PSNR with communication latency t ≤ ms. MNIST

10 dB 15 dB 20 dB 25 dBDeepJSCC 97.04 97.13 97.45 97.56Quantization 95.32 95.96 96.81 97.12

Proposed 97.29 97.79 98.01 98.17CIFAR-10

10 dB 15 dB 20 dB 25 dBDeepJSCC 91.58 91.60 91.67 91.72Quantization 90.68 91.07 91.53 91.65

Proposed 91.62 91.72 91.90 92.04 and DeepJSCC. Table III shows the classiﬁcation accuracy under various values of PSNR for theMNIST and CIFAR-10 tasks. It is observed that, our method consistently outperforms the twobaselines, implying that the IB framework can effectively identify the task-related informationin the encoding scheme, and our VFE method is capable of achieving resilient transmission fortask-oriented communication.

C. Results for Dynamic Channel Conditions

In this subsection, we evaluate the performance of the proposed VL-VFE method in dynamicchannel conditions. We assume the PSNR is changing from 10 to 25 dB. As the peak transmit (a) The MNIST classiﬁcation task (b) The CIFAR-10 classiﬁcation task Fig. 6: Communication latency and error rate as a function of the channel PSNR in the dynamic channelenvironments. power is constrained below 1 by the Tanh activation function, it equivalently means that thechannel noise variance σ varies in [3 × − , . . We compare the inference performancebetween the proposed method and DeepJSCC when testing in a wide range of PSNR. TheDeepJSCC is trained with PSNR = 20 dB and the the feature dimension is set to n = 36 inthe MNIST classiﬁcation and n = 16 in the CIFAR-10 classiﬁcation. The training process ofour method follows Algorithm 2, and the channel noise σ , as a random variable, is uniformlysampled from [3 × − , . . The hyperparameter β is set to × − in the MNIST classiﬁcationand β = 10 − in the CIFAR-10 classiﬁcation.Fig. 6 shows the latency and inference accuracy for the two classiﬁcation tasks. From thisﬁgure, it can be seen that the proposed VL-VFE method achieves a higher accuracy comparedwith DeepJSCC. Besides, the proposed VL-VFE method can adaptively adjust the encodedfeature dimension according to the instantaneous channel noise level, and thus it can reduce thecommunication latency in the high PSNR regime. Speciﬁcally, when the channel conditions areunfavorable, VL-VFE tends to activate more dimensions for transmission to make the receivedfeature vector robust to maintain the inference performance, which is analogous to addingredundancy for error correction in conventional channel coding techniques. On the contrary,when the channel conditions are good enough, VL-VFE tends to activate less dimensions toreduce the communication overhead. (a) The γ value with a Gaussian distribution as the varia-tional prior. Task accuracy = 95.91 %, dimension n = 32 . (b) The γ value with a log-uniform distribution as the vari-ational prior. Task accuracy = 97.99 %, dimension n = 32 . Fig. 7: The γ value in the MNIST classiﬁcation task with (a) a Gaussian distribution as the variational prior and(b) a log-uniform distribution as the variational prior. The red dashed line denotes the pruning threshold γ = 0 . . (a) The γ value with a Gaussian distribution as the varia-tional prior. Task accuracy = 91.18 %, dimension n = 21 . (b) The γ value with a log-uniform distribution as the vari-ational prior. Task accuracy = 91.83 %, dimension n = 21 . Fig. 8: The γ value in the CIFAR-10 classiﬁcation task with (a) a Gaussian distribution as the variational prior and(b) a log-uniform distribution as the variational prior. The red dashed line denotes the pruning threshold γ = 0 . . D. Ablation Study

To verify the effectiveness of the log-uniform distribution as the variational prior q ( ˆ z ) forsparsity induction, we further conduct an ablation study that selects a Gaussian distribution witha diagonal covariance matrix for comparison. Note that the Gaussian distribution is widely used inthe previous variational approximation studies (e.g., [35], [40]) as it generally has a closed-form solution. Since the Gaussian distribution is not a parameter-free distribution, the mean valueand covariance matrix are optimized in the training process to minimize the KL-divergence D KL ( p ( ˆ z | x ) (cid:107) q ( ˆ z )) . The experiments are conducted for MNIST and CIFAR-10 classiﬁcationassuming PSNR = 20 dB. The values of γ with different variational prior distributions areshown in Fig. 7 and 8. The dashed line corresponds to the value of threshold γ used to prune thedimensions. From these two ﬁgures, it can be seen that, although using the Gaussian distributioncan also conﬁne some dimensions of γ to close-to-zero values, it is prone to shrinking theremaining informative dimensions that eventually results in inference accuracy degradation.VI. C ONCLUSIONS

In this work, we investigated task-oriented communication for edge inference, where a low-end edge device transmits the extracted feature vector of a local data sample to a powerfuledge server for processing. Our proposed methodology is built upon the information bottleneck(IB) framework, which provides a principled way to characterize and optimize a new rate-distortion tradeoff in edge inference. Assisted by a variational approximation with a log-normaldistribution as the variational prior to promote sparsity in the output feature, we obtained atractable formulation that is amenable to end-to-end training, named variational feature encoding.We further extended our method to develop a variable-length variational feature encoding schemebased on the dynamic neural networks, which makes it adaptive to dynamic channel conditions.The effectiveness of our methods was veriﬁed by extensive simulations on image classiﬁcationdatasets.Through this study, we would like to advocate for rethinking the communication system designfor emerging applications such as edge inference. In these applications, communication will keepplaying a critical role, but it will serve for the downstream task rather than for data reconstructionas in the classical communication setting. Thus we should take a task-oriented perspective todesign the communication module for such applications. New design tools and methodologieswill be needed, and the IB framework is a promising candidate. It bridges machine learning andinformation theory, and leverages theory and tools from both ﬁelds. There are many interestingfuture research directions on this exciting topic, e.g., to apply the IB based framework to otherapplications, to develop a theoretical understanding of the new rate-distortion tradeoff, to improvethe robustness of the method, etc. A PPENDIX AD ERIVATION OF THE V ARIATIONAL U PPER B OUND

Recall that the IB objective in (2) has the form L IB ( φ ) = − I ( ˆ Z, Y ) + βI ( ˆ Z, X ) . Writing itout in full, it becomes: L IB ( φ ) = − I ( Y, ˆ Z ) + βI ( ˆ Z, X )= − (cid:90) p ( y | ˆ z ) p ( ˆ z ) log p ( y | ˆ z ) p ( y ) d y d ˆ z + β (cid:90) p φ ( ˆ z | x ) p ( x ) log p φ ( ˆ z | x ) p ( ˆ z ) d x d ˆ z = − (cid:90) p ( y | ˆ z ) p ( ˆ z ) log p ( y | ˆ z ) d y d ˆ z + β (cid:90) p φ ( ˆ z | x ) p ( x ) log p φ ( ˆ z | x ) p ( ˆ z ) d x d ˆ z + H ( Y )= − (cid:90) p ( y | ˆ z ) p ( ˆ z ) log q θ ( y | ˆ z ) d y d ˆ z + β (cid:90) p φ ( ˆ z | x ) p ( x ) log p φ ( ˆ z | x ) q ( ˆ z ) d x d ˆ z (cid:124) (cid:123)(cid:122) (cid:125) L V IB ( φ , θ ) −− (cid:90) p ( y | ˆ z ) p ( ˆ z ) log p ( y | ˆ z ) q θ ( y | ˆ z ) d y d ˆ z (cid:124) (cid:123)(cid:122) (cid:125) − D KL ( p ( y | ˆ z ) (cid:107) q θ ( y | ˆ z )) ≤ − β (cid:90) p φ ( ˆ z | x ) p ( x ) log p ( ˆ z ) q ( ˆ z ) d x d ˆ z (cid:124) (cid:123)(cid:122) (cid:125) − D KL ( p ( ˆ z ) (cid:107) q ( ˆ z )) ≤ + H ( Y ) (cid:124) (cid:123)(cid:122) (cid:125) constant . L V IB ( φ, θ ) in the above equation is the VIB objective function in (3). As the KL-divergence isnonnegative and the entropy of Y is a constant, L V IB ( φ, θ ) is a variational upper bound of theIB objective L IB ( φ ) . A PPENDIX

BMLP S

TRUCTURE OF THE F UNCTION g ( σ ) We parameterize g ( σ ) by a K -layer MLP, and thus it can be written as a composition of K non-linear functions: g ( σ ) = h K ◦ h K − · · · h ( σ ) , (13)where h k represents the k -th layer in the MLP and has the following formulation : h k ( x ) = tanh (cid:0) W ( k ) x (cid:1) . (14)To maintain the desired properties of the proposed VL-VFE method, each function g j ( σ ) (the j -th output dimension of the vector function g ( σ ) ) should be non-negative and increasing withthe noise variance σ . Therefore, functions g j ( σ ) should satisfy the following constraints: g j ( σ ) ≥ g (cid:48) j ( σ ) = ∂g j ( σ ) ∂σ ≥ , tanh( x ) = e x − e − x e x + e − x and tanh (cid:48) ( x ) = 1 − tanh( x ) . For simplicity, we deﬁne tanh( x ) and tanh (cid:48) ( x ) as element-wise functions. and g ( σ j ) can be writtern as follows: g j ( σ ) = h K,j ◦ h K − · · · h ( σ ) , where h K,j is j -th output dimension of h K . The derivative of g j ( σ ) can be obtained using thechain rule: g (cid:48) j ( σ ) = h (cid:48) K,j ◦ h (cid:48) K − · · · h (cid:48) ( σ ) , where we denote the Jacobian matrix of h k as h (cid:48) k , and h (cid:48) K,j is the j -th row of h (cid:48) K . The derivativeswork out as follows: h (cid:48) k ( x ) = diag (cid:0) tanh (cid:48) (cid:0) W ( k ) x (cid:1)(cid:1) · W ( k ) . To guarantee each g j ( σ ) is a non-negative increasing function, we set W ( k ) = abs( (cid:99) W ( k ) ) ,which means that g j ( σ ) outputs a non-negative value, and all entries in Jacobian matrices arenon-negative . R EFERENCES [1] A. Graves, A. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in

InternationalConference on Acoustics, Speech and Signal Processing , 2013, pp. 6645–6649.[2] R. Collobert and J. Weston, “A uniﬁed architecture for natural language processing: Deep neural networks with multitasklearning,” in

International Conference on Machine Learning , 2008, pp. 160–167.[3] R. Szeliski,

Computer vision: algorithms and applications . Springer Science & Business Media, 2010.[4] X. Hou, S. Dey, J. Zhang, and M. Budagavi, “Predictive view generation to enable mobile 360-degree and vr experiences,”in

Proceedings of the Morning Workshop on Virtual Reality and Augmented Reality Network , 2018, pp. 20–26.[5] M. Chen, U. Challita, W. Saad, C. Yin, and M. Debbah, “Artiﬁcial neural networks-based machine learning for wirelessnetworks: A tutorial,”

IEEE Communications Surveys & Tutorials , vol. 21, no. 4, pp. 3039–3071, 2019.[6] J. Downey, B. Hilburn, T. O’Shea, and N. West, “Machine learning remakes radio,”

IEEE Spectrum , vol. 57, no. 5, pp.35–39, 2020.[7] T. O’Shea and J. Hoydis, “An introduction to deep learning for the physical layer,”

IEEE Transactions on CognitiveCommunications and Networking , vol. 3, no. 4, pp. 563–575, 2017.[8] E. Bourtsoulatze, D. Burth Kurka, and D. Gündüz, “Deep joint source-channel coding for wireless image transmission,”

IEEE Transactions on Cognitive Communications and Networking , vol. 5, no. 3, pp. 567–579, 2019.[9] N. Samuel, T. Diskin, and A. Wiesel, “Learning to detect,”

IEEE Transactions on Signal Processing , vol. 67, no. 10, pp.2554–2564, 2019.[10] Y. Shen, Y. Shi, J. Zhang, and K. B. Letaief, “Graph neural networks for scalable radio resource management: Architecturedesign and theoretical analysis,”

IEEE Journal on Selected Areas in Communications , vol. 39, no. 1, pp. 101–115, 2021.[11] K. B. Letaief, W. Chen, Y. Shi, J. Zhang, and Y.-J. A. Zhang, “The roadmap to 6G: AI empowered wireless networks,”

IEEE Communications Magazine , vol. 57, no. 8, pp. 84–90, 2019. abs( · ) denotes the element-wise absolute function. (cid:99) W ( k ) are the actual parameters in the K -layer MLP. [12] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, “Toward an intelligent edge: wireless communication meetsmachine learning,” IEEE Communications Magazine , vol. 58, no. 1, pp. 19–25, 2020.[13] Y. Shi, K. Yang, T. Jiang, J. Zhang, and K. B. Letaief, “Communication-efﬁcient edge AI: Algorithms and systems,”

IEEECommunications Surveys and Tutorials , vol. 22, no. 4, pp. 2167–2191, 2020.[14] E. Li, Z. Zhou, and X. Chen, “Edge intelligence: On-demand deep learning model co-inference with device-edge synergy,”in

Proceedings of the Workshop on Mobile Edge Communications , 2018, pp. 31–36.[15] A. I. Maqueda, A. Loquercio, G. Gallego, N. García, and D. Scaramuzza, “Event-based vision meets deep learningon steering prediction for self-driving cars,” in

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2018, pp. 5419–5427.[16] X. Hou, S. Dey, J. Zhang, and M. Budagavi, “Predictive view generation to enable mobile 360-degree and vr experiences,”in

Proceedings of the Morning Workshop on Virtual Reality and Augmented Reality Network , 2018, pp. 20–26.[17] L. Liu, H. Li, and M. Gruteser, “Edge assisted real-time object detection for mobile augmented reality,” in

AnnualInternational Conference on Mobile Computing and Networking , 2019, pp. 1–16.[18] H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han, “Once for all: Train one network and specialize it for efﬁcient deployment,”in

International Conference on Learning Representations , 2020.[19] Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, and L. Tang, “Neurosurgeon: Collaborative intelligencebetween the cloud and mobile edge,”

ACM SIGARCH Computer Architecture News , vol. 45, no. 1, pp. 615–629, 2017.[20] H. Li, C. Hu, J. Jiang, Z. Wang, Y. Wen, and W. Zhu, “JALAD: Joint accuracy-and latency-aware deep structure decouplingfor edge-cloud execution,” in

International Conference on Parallel and Distributed Systems , 2018, pp. 671–678.[21] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” in

Annual Allerton Conference onCommunication, Control and Computing, pages 368–377. , 2000.[22] N. Farsad, M. Rao, and A. Goldsmith, “Deep learning for joint source-channel coding of text,” in

International Conferenceon Acoustics, Speech and Signal Processing , 2018, pp. 2326–2330.[23] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,”

IEEE Transactions onPattern Analysis and Machine Intelligence , vol. 35, no. 8, pp. 1798–1828, 2013.[24] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in

Proceedings ofthe International Conference on Neural Information Processing Systems , 2016, pp. 2082–2090.[25] W. Shi, Y. Hou, S. Zhou, Z. Niu, Y. Zhang, and L. Geng, “Improving device-edge cooperative inference of deep learningvia 2-step pruning,” arXiv preprint arXiv:1903.03472 , 2019.[26] J. Shao and J. Zhang, “Bottlenet++: An end-to-end approach for feature compression in device-edge co-inference systems,”in

International Conference on Communications Workshops , 2020, pp. 1–6.[27] M. Jankowski, D. Gündüz, and K. Mikolajczyk, “Wireless image retrieval at the edge,”

IEEE Journal on Selected Areasin Communications , vol. 39, no. 1, pp. 89–100, 2021.[28] J. Shao and J. Zhang, “Communication-computation trade-off in resource-constrained edge inference,”

IEEE Communica-tion Magazine , 2020.[29] M. Jankowski, D. Gündüz, and K. Mikolajczyk, “Deep joint source-channel coding for wireless image retrieval,” in

International Conference on Acoustics, Speech and Signal Processing , 2020, pp. 5070–5074.[30] J. Shao, H. Zhang, Y. Mao, and J. Zhang, “Branchy-GNN: a device-edge co-inference framework for efﬁcient point cloudprocessing,” arXiv preprint arXiv:2011.02422 , 2020.[31] K. Choi, K. Tatwawadi, A. Grover, T. Weissman, and S. Ermon, “Neural joint source-channel coding,” in

InternationalConference on Machine Learning , 2019, pp. 1182–1192. [32] R. Dobrushin and B. Tsybakov, “Information transmission with additional noise,” IRE Transactions on Information Theory ,vol. 8, no. 5, pp. 293–304, 1962.[33] Z. Goldfeld and Y. Polyanskiy, “The information bottleneck problem and its applications in machine learning,”

IEEEJournal on Selected Areas in Information Theory , 2020.[34] A. Zaidi, I. Estella-Aguerri et al. , “On the information bottleneck problems: Models, connections, applications andinformation theoretic views,”

Entropy , vol. 22, no. 2, p. 151, 2020.[35] A. Achille and S. Soatto, “Information dropout: Learning optimal representations through noisy computation,”

IEEETransactions on Pattern Analysis and Machine Intelligence , vol. 40, no. 12, pp. 2897–2905, 2018.[36] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, “Deep variational information bottleneck,” in

International Conferenceon Learning Representations , 2017.[37] S. Dörner, S. Cammerer, J. Hoydis, and S. t. Brink, “Deep learning based communication over the air,”

IEEE Journal ofSelected Topics in Signal Processing , vol. 12, no. 1, pp. 132–143, 2018.[38] T. M. Cover and J. A. Thomas,

Elements of information theory , 2012.[39] Z. Wang and D. W. Scott, “Nonparametric density estimation for high-dimensional data—algorithms and applications,”

Wiley Interdisciplinary Reviews: Computational Statistics , vol. 11, no. 4, p. e1461, 2019.[40] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in

International Conference on Learning Representations ,2014.[41] D. P. Kingma, T. Salimans, and M. Welling, “Variational dropout and the local reparameterization trick,” in

Advances inNeural Information Processing Systems , 2015, pp. 2575–2583.[42] D. Molchanov, A. Ashukha, and D. Vetrov, “Variational dropout sparsiﬁes deep neural networks,” in

InternationalConference on Machine Learning , 2017, pp. 2498–2507.[43] X. Wang, F. Yu, Z.-Y. Dou, T. Darrell, and J. E. Gonzalez, “Skipnet: Learning dynamic routing in convolutional networks,”in

Proceedings of the European Conference on Computer Vision , 2018, pp. 409–424.[44] Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. S. Davis, K. Grauman, and R. Feris, “Blockdrop: Dynamic inference pathsin residual networks,” in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2018, pp.8817–8826.[45] Z. Chen, Y. Li, S. Bengio, and S. Si, “You look twice: Gaternet for dynamic ﬁlter selection in cnns,” in

Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition , 2019, pp. 9172–9180.[46] J. Yu, L. Yang, N. Xu, J. Yang, and T. Huang, “Slimmable neural networks,” in

International Conference on LearningRepresentations , 2019.[47] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,”

Proceedingsof the IEEE , vol. 86, no. 11, pp. 2278–2324, 1998.[48] A. Krizhevsky, G. Hinton et al. , “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.[49] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Quantized neural networks: Training neural networkswith low precision weights and activations,”

The Journal of Machine Learning Research , vol. 18, no. 1, pp. 6869–6898,2017.[50] Y. Polyanskiy, H. V. Poor, and S. Verdu, “Channel coding rate in the ﬁnite blocklength regime,”

IEEE Transactions onInformation Theory , vol. 56, no. 5, pp. 2307–2359, 2010.[51] A. L. McKellips, “Simple tight bounds on capacity for the peak-limited discrete-time channel,” in

International Symposiumon Information Theory Proceedings. , 2004, pp. 348–348.[52] V. Dantona, C. Hofmann, S. Lattrell, and B. Lankl, “Spectrally efﬁcient multilevel cpm waveforms for vhf narrowbandcommunications,” in

International ITG Conference on Systems, Communications and Coding , 2015, pp. 1–6. [53] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of IEEE Conferenceon Computer Vision and Pattern Recognition , 2016, pp. 770–778.[54] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,”