Semantic Communication Systems for Speech Transmission
aa r X i v : . [ ee ss . SP ] F e b Semantic Communication Systems for SpeechTransmission
Zhenzi Weng,
Student Member, IEEE, and Zhijin Qin,
Member, IEEE
Abstract —Semantic communications could improve the trans-mission efficiency significantly by exploring the input semanticinformation. Motivated by the breakthroughs in deep learning(DL), we make an effort to recover the transmitted speech signalsin the semantic communication systems, which minimizes theerror at the semantic level rather than the bit level or symbol levelas in the traditional communication systems. Particularly, wedesign a DL-enabled semantic communication system for speechsignals, named DeepSC-S. Based on an attention mechanismemploying squeeze-and-excitation (SE) networks, DeepSC-S isable to identify the essential speech information and assign highvalues to the weights corresponding to the essential informa-tion when training the neural network. Moreover, in order tofacilitate the proposed DeepSC-S to cater to dynamic channelenvironments, we dedicate to find a general model to cope withvarious channel conditions without retraining. Furthermore, toverify the model adaptation in practice, we investigate DeepSC-Sin the telephone systems as well as the multimedia transmissionsystems, which usually requires higher data rates and lowertransmission latency. The simulation results demonstrate that ourproposed DeepSC-S achieves higher system performance thanthe traditional communications in both telephone systems andmultimedia transmission systems by comparing the speech signalsmetrics, signal-to-distortion ration and perceptual evaluation ofspeech distortion. Besides, DeepSC-S is more robust to channelvariations than the traditional approaches, especially in the lowsignal-to-noise (SNR) regime.
Index Terms —Deep learning, semantic communication, speechtransmission, squeeze-and-excitation networks.
I. I
NTRODUCTION I NTELLIGENT communications have attracted intensive at-tention for traditional communication systems [1]. Inspiredby the success in various areas, deep learning (DL) has beenconsidered as a promising candidate for communications toachieve higher system performance with more intelligence [2].Particularly, DL has shown its great potentials to solve theexisting technical problems in both physical layer communi-cations [3]–[5] and wireless resource allocations [6], [7].Typically, a DL-based communication system is designedto reduce the complexity and/or improve the system perfor-mance, by merging one or multiple communication modulesin the traditional block-wise architecture and using deepneural networks (DNN) with trainable parameters to representthe intelligent transceiver. However, even if the communica-tion systems utilizing DL technique yield better performanceand/or lower complexity for some scenarios and conditions,most of the literature focus on the performance improvement
Zhenzi Weng and Zhijin Qin are with the School of Electronic Engineeringand Computer Science, Queen Mary University of London, London E1 4NS,UK (email: [email protected], [email protected]). at the bit or symbol level, which usually takes bit-error rate(BER) or symbol-error rate (SER) as the performance metrics.Particularly, the major task in the traditional communicationsystems and the developed DL-enabled systems, is to recoverthe transmitted message accurately and effectively, representedby digital bit sequences. In the past decades, such type ofwireless communication systems have experienced significantdevelopment from the first generation (1G) to the fifth gen-eration (5G) and the system capacity is approaching Shannonlimit.Shannon and Weaver [8] categorized communications intothree levels: • Level A : how accurately can the symbols of communica-tion be transmitted? (The technical problem.) • Level B : how precisely do the transmitted symbols conveythe desired meaning? (The semantic problem.) • Level C : how effectively does the received meaning affectconduct in the desired way? (The effectiveness problem.)This indicates the feasibility to transmit the semantic infor-mation, instead of the bits or symbols, to achieve highersystem efficiency. Besides, due to the increasing deploy-ment of intelligent IoT applications, e.g., human-computerinteractions and machine-machine communications, semantic-irrelative communications are no longer ideal in the future.Motivated by this, researchers have dedicated to develop anew system to process and exchange semantic information formore efficient communications.Semantic theory, in contrast to information theory exploitedin existing communication systems, takes into account themeaning and veracity of source information because they canbe both informative and factual [9], which facilitates semanticcommunication systems to recover information at the receivervia minimizing the meaning difference between the input andthe recovered signals instead of BER or SER. However, theexploration of semantic communications has gone throughdecades of stagnation since it was first identified becausesome fundamental problems, i.e., lack of mathematical model,cannot be formulated and solved properly when semanticinformation exchange is considered [10], e.g., how to definethe efficiency and reliability in semantic communication?
According to the recent efforts in [11], semantic data can becompressed to proper size for transmission using a losslessmethod by utilizing the semantic relationship between differentmessages, while the traditional lossless source coding is torepresent a signal with the minimum number of binary bits byexploring the dependencies or statistical properties of inputsignals. In addition, the end-to-end (E2E) communication systems has been developed in [12] in order to address thebottlenecks in traditional block-wise communication systems,that is sub-optimal, since the conventional signal processing ishard to capture many imperfections and non-linearities in thepractical channel environment. Inspired by this, different typesof sources have been considered in recent investigation on E2Esemantic communication systems, which mainly focus on theimage and text transmission [23]–[31]. The investigation onsemantic communication for speech signals transmission isstill missed.In the area of speech signal processing, the cutting edge DLapplications are developed to convert speech signals into textinformation, e.g., automatic speech recognition (ASR). Thecore of ASR is to generate corresponding texts by interact-ing speech signals with a acoustic model by mapping eachphoneme into a single alphabet, and then concatenating allalphabets into a understandable word sequence via a languagemodel, which pays no attention to the characteristics of speechsignals, e.g., the speaking speed and tone [13]. However, ourwork is to recover speech signals, which includes the recoveryof speech characteristics. Thus, the language model based ASRtechnologies are not applicable as the speech characteristicsare abandoned when speech signals are converted into textsand the process is irreversible. Moreover, most DL algorithmspre-process speech signals to obtain magnitude, spectra, orMel-Frequency Cepstrum by various operations, such as dis-crete cosine transform (DCT), before feeding into a learningsystem. Such operations are employed to capture the uniquefeatures of speech signals, e.g., inconsistent speaking speedsof a person speaking in different moments are inconsistent,different frequencies of female and male voices, and distincttones of persons at different ages, which runs counter tothe motivation of intelligence. Therefore, a DL algorithm forfeature learning directly from speech signal is desired. The DLbased semantic communication systems by learning semanticinformation of speech signals directly are of great interest andimportance for the next generation communication systems.In this paper, we explore the semantic systems for speechsignals by utilizing DL technique. Particularly, a DL-enabledsemantic communication system for speech signals, namedDeepSC-S, is proposed by learning and extracting speechsignals, and then recovering them at the receiver from thereceived features directly. The main contributions of this articlecan be summarized as fourfold: • A novel semantic communication system for speechsignals, named DeepSC-S, is first proposed, which treatsthe transmitter and the receiver as two trainable DNNs,and jointly designs the speech coding and the channelcoding to deal with source distortion and channel effects. • Particularly, in the proposed DeepSC-S, the squeeze-and-excitation (SE) networks [14] is employed to learn andextract essential speech semantic information, as wellassign high values to the weights corresponding to theessential information during the training phase. By ex-ploiting the attention mechanism based on SE networks,DeepSC-S improves the accuracy of signal recovering. • Moreover, by training DeepSC-S under a fixed fadingchannel and SNR, then facilitating the trained model with good performance under testing channel conditions, theproposed DeepSC-S is highly robust to dynamic channelenvironments without network tuning and retraining. • To verify the model adaptation to practical communi-cation scenarios, the proposed DeepSC-S is applied totelephone systems and multimedia transmission systems,respectively. The performance is also verified with tra-ditional approaches to prove its superiority. Simulationresults show that DeepSC-S outperforms the traditionalsystems, especially in the low SNR regime.The rest of this article is structured as follows. The relatedwork is presented in Section II. Section III introduces themodel of semantic communication system for speech trans-mission and the related performance metrics. In Section IV,the details of the proposed DeepSC-S is presented. Simulationresults are discussed in Section V. Section VI draws conclu-sions.
Notation : Single boldface letters are used to represent vec-tors or matrices and single plain capital letters denotes integers.Given a vector x , x i indicates its i -th component, k x k denotesits Euclidean norm. Given a matrix Y , Y ∈ R M × N indicates Y is a matrix with real values and the size is M × N . Super-script swash letters refers a block in the system, e.g., T in θ T represents the parameter of the transmitter. CN ( m , V ) aremultivariate circular complex Gaussian distribution with meanvector m and co-variance matrix V , respectively. Moreover, a ∗ b represents the convolution operation on the vector a andthe vector b . II. R ELATED W ORK
As aforementioned that E2E learning of communicationsystems has been developed to address the challenges intraditional communication systems, however, it analyses andimproves the system performance at the bit or symbol level.Moreover, the existing applications on DL-enabled semanticcommunication systems are mainly based on text and imagesource information.
A. End-to-End Learning in Communication Systems
The DL-based E2E communication systems have achievedextremely competitive block-error rate (BLER) performancecompared to the traditional baselines in various scenarios [12],e.g., uncoded binary phase shift keying (BPSK) and Hammingcoded BPSK. In addition, it has shown great potentials inprocessing complicated communication tasks. For example,the E2E learning systems have been employed in orthogonalfrequency division multiplexing (OFDM) systems [15], [16],as well as in multiple-input multiple-output (MIMO) systems[17], [18]. Besides, channel estimation is a challenging prob-lem in the DL enabled E2E systems. In [19], reinforcementlearning (RL) has been unitized to estimate channel stateinformation (CSI) through treating the channel layer and thereceiver as the environment , the transmitter as the agent to take actions to interact with the environment based on a policy ,which has been assumed as the most cutting-edge approachwhile an additional reliable channel is still required to sendlosses back from the receiver to the transmitter during the training phase. Another novel channel agnostic solution hasbeen proposed in [20], which replaces the realistic channelswith a neural network (NN) by exploiting a conditionalgenerative adversarial network (GAN).Furthermore, due to the complexity of NN training in theE2E learning models, high training efficiency with low energyconsumption system is desirable when employing it into thepractical scenarios. Transfer learning has been consideredas a promising technology for adapting E2E communicationsystems to cope with the uncontrollable and unpredictablechannel environments by training them over a statistical chan-nel model [21]. In addition, another appealing solution isto obtain a trained model yielding expected performance viasmall number of stochastic gradient descent (SGD) iterations.Particularly, a model agnostic meta-learning enabled E2E com-munication system has been investigated in [22], which findsa common initialization parameter achieving fast convergenceafter one or several iterations for various channel conditions.
B. Semantic Communications
An initial research on semantic communication systems fortext information has been developed [23], which mitigates thesemantic error to achieve Nash equilibrium by integrating thesemantic inference and the physical layer communication tooptimize the whole transceiver. However, such a text-basedsemantic communication system only measures the differencebetween the transmitted sentences and the received sentencesat the word level instead of the sentence level. Thus, afurther investigation about semantic communications for texttransmission, named DeepSC, has been carried out in [24]to deal with the semantic error at the sentence level withvarious length. Powered by the Transformer [25], the semanticencoder and the channel encoder are jointly designed as atrainable autoencoder to minimize the semantic error, ratherthan the BER or SER as in traditional communications andto improve the system capacity. By doing so, the semanticcommunications for text could be realized. Moreover, theincreasing deployment of smart IoT devices has required theIoT devices to implement more complicated tasks, such astraining a DNN independently, which runs counter to thelimited computing capability of IoT devices. Inspired by this,a lite distributed semantic communication system for texttransmission, named L-DeepSC, has been proposed in [26] toaddress the challenge of IoT to perform the intelligent tasks bypruning parameters to reduce the size of the trained models aswell as to reduce the communication cost between IoT devicesand the server.In the area of semantic communications for image infor-mation, a DL-enabled semantic communication system forimage transmission, named JSCC, has been developed [27],which employs a convolutional neural network (CNN) withfive convolutional layers to jointly design the source-channelencoder, and a CNN with five transposition convolutional lay-ers at the receiver to realize the source-channel decoder. Basedon JSCC, an image transmission system, integrating noiselessor noisy channel output feedback, has been investigated toimprove image reconstruction [28], where the channel output backpropagates to the transmitter based on a unit delay andadditive white Gaussian noise (AWGN) is added to generatea weight vector to the NN at the transmitter. By utilizingthe feedback mechanism, the quality of reconstructed imageis improved compare to the model without feedback signalsand the traditional approaches. Similar to text transmission,IoT applications for image transmission have been carriedout. Particularly, a joint image transmission-recognition systemhas been developed [29] by applying two DNNs as thetransmitter at the IoT devices and the receiver at the serveredge, which has shown the superior recognition accuracy thanthe traditional approaches and has great advantage of lowcomputation resource via transfer learning. In [30], a deepjoint source-channel coding architecture, name DeepJSCC,combines with network pruning technique to perform imageclassification at the edge sever, which facilitates the IoTservices to process image with low computation complex-ity and reduce the requirement of transmission bandwidth.Moreover, an application based on JSCC to retrieve image atwireless edge has been proposed in [31], which aims to addresstransmission delay of IoT devices to send the whole quiltedimage by employing a DNN to realize retrieval-oriented imagecompression.Given the intensive investigations of semantic communica-tion for text and image information as well as the challenges oftraditional communications for speech transmission, e.g., poortelephone communication quality at the airport, it is significantto carry out the research on speech communication systems byutilizing the semantic information.III. S
YSTEM M ODEL
In this section, we first introduce the considered systemmodel. Besides, the details of the system model and theperformance metrics are presented.
A. System Settings
The considered system will transmit the original speech sig-nals via a NN-based speech semantic communication system,which comprises two major tasks as shown in Fig. 1: i) se-mantic information learning and extracting of speech signals;ii) and mitigating the effects of wireless channels. Due to thevariation of speech characteristics, it is a quite challengingproblem. For a practical communication scenario, the signalpassing through the physical channel suffers from distortionand attenuation. Therefore, the considered DL enabled systemtargets to recover the original speech signals and achieve betterperformance than the traditional approaches while coping withcomplicated channel distortions.
B. Transmitter
The proposed system model is shown in Fig. 1. From thefigure, the input of the transmitter is a speech sample sequence, s = [ s , s , ..., s W ] with W samples, where s w is w -th itemin s and it is a scalar value, i.e., a positive number, a negativenumber, or zero. At the transmitter, the input, s , is mappedinto symbols, x , to be transmitted over physical channels. As Fig. 1: The model structure of DL enabled speech semanticcommunication system.shown in Fig. 1, the transmitter consists of two individualcomponents: the speech encoder and the channel encoder ,in which each component is implemented by an independentNN. Denote the NN parameters of the speech encoder and the channel encoder as α and β , respectively. Then the encodedsymbol sequence, x , can be expressed as x = T C β ( T S α ( s )) , (1)where T S α ( · ) and T C β ( · ) indicate the speech encoder and the channel encoder with respect to (w.r.t.) parameters α and β , respectively. Here we denote the NN parameters of thetransmitter as θ T = ( α , β ) .The mapped symbols, x , are transmitted over a physicalchannel. Note that the normalization on transmitted symbols x is required to ensure the total transmission power constraint E k x k = 1 .The whole transceiver in Fig. 1 is designed for a singlecommunication link, in which the channel layer, representedby p h ( y | x ) , takes x as the input and produces the output asreceived signal y . Denote the coefficients of a linear channelas h , then the transmission process from the transmitter to thereceiver can be modeled as y = h ∗ x + w , (2)where w ∼ CN (0 , σ I ) indicates independent and identicallydistributed (i.i.d.) Gaussian noise, σ is noise variance for eachchannel and I is the identity matrix. C. Receiver
Similar to the transmitter, the receiver also consists of twocascaded parts, including the channel decoder and the speechdecoder . The channel decoder is to mitigate the channeldistortion and attenuation, and the speech decoder recoversspeech signals based on the learned and extracted speechsemantic features. Denote the NN parameters of the channeldecoder and the speech decoder as χ and δ , respectively. Asdepicted in Fig. 1, the decoded signal, b s , can be obtained fromthe received signal, y , by the following operation: b s = R S δ ( R C χ ( y )) , (3)where R C χ ( · ) and R S δ ( · ) indicate the channel decoder and the speech decoder w.r.t. parameters χ and δ , respectively. Denotethe NN parameter of the receiver as θ R = ( χ , δ ) .The objective of the whole transceiver system is to recoverspeech signals as close as to the original, which causestwo challenges. The first one is the design of efficient andintelligent speech encoder/decoder by utilizing the semantic information to recover speech signals, especially under thepoor channel conditions, such as the low SNR regime. Thesecond one is the design of the channel encoder/decoder toalleviate symbol errors caused by the physical channels viaadding redundancy information. For the traditional communi-cations, the advanced channel coding techniques are achievedat the bit level to target a low BER/SER. However, the bit-to-symbol transformation is not involved in our proposed system.The raw speech signals are directly mapped into a transmittedsymbol stream by the speech encoder and the channel encoder ,and recover it at the receiver via inverse operations. Thus, wetreat the speech recovery process as a signal reconstructiontask to minimize the errors between the signal values in s and b s by exploiting the characteristics of speech signals, thenmean-squared error (MSE) is used as the loss function in oursystem to measure the difference between s and b s , denoted as L MSE ( θ T , θ R ) = 1 W W X w =1 ( s w − b s w ) , (4)where s w and b s w indicate the w -th element of the vectors s and b s , respectively. W is the length of these two vectors.Assume that the NN models of the whole transceiver aredifferentiable w.r.t. the corresponding parameters, which canbe optimized via gradient descent based on the MSE loss func-tion. It is worth to mention that the speech encoder/decoder and the channel encoder/decoder are jointly designed. Besides,given prior CSI, both parameters sets θ T and θ R can beadjusted at the same time. Denote the NN parameter of thewhole system model as θ , θ = ( θ T , θ R ) , we adopt the SGDalgorithm to train task in this paper. Then iteratively updateson the parameters θ follows θ ( i +1) ← θ ( i ) − η ∇ θ ( i ) L MSE ( θ T , θ R ) , (5)where η > is a learning rate and ∇ indicates the differentialoperator. D. Performance Metrics
In our model, the system is committed to reconstruct the rawspeech signals. Hence, the signal-to-distortion ration (SDR)[32] is employed to measure the L error between s and b s , which is one of the commonly used metric for speechtransmission and can be expressed as SDR = 10 log k s k k s − b s k ! . (6)The higher SDR represents the speech information is recov-ered with better quality, i.e., easier to understand for humanbeings. According to (4), MSE loss could reflect the goodnessof SDR. The lower the MSE, the higher the SDR.Perceptual evaluation of speech distortion (PESQ) [33] isanother metric for the quality of speech signals at the receiver,which takes the short memory in human perception intoconsideration. PESQ is a speech quality assessment modelcombing the perceptual speech quality measure (PSQM) andperceptual analysis measurement system (PAMS), which isadopted in International Telecommunication Union (ITU-T) recommendation P.862 [34]. PESQ is a good candidate forevaluating the quality of speech messages under various con-ditions, e.g., background noise, analog filtering, and variabledelay, by scoring the speech quality range from -0.5 to 4.5.IV. P ROPOSED S EMANTIC C OMMUNICATION S YSTEM FOR S PEECH S IGNALS
To address the aforementioned challenges, we design aDL-enabled speech semantic communication system, namedDeepSC-S. Specifically, an attention-based two-dimension(2D) CNN is used for the speech encoder/decoder and a 2DCNN is adopted for the channel encoder/decoder . The detailsof the developed DeepSC-S will be introduced in this section.
A. Model Description
As shown in Fig. 2, the input of the proposed DeepSC-S, denoted as S ∈ R B × W , is the set of speech samplesequences, s , which are drawn from the speech dataset, S ,and B is the batch size. S consists of considerable speechsignals, which is collected by recording the speakings fromdifferent persons. The input sample sequences set, S , areframed into m ∈ R B × F × L for training before passing throughan attention-based encoder, i.e., the speech encoder , where F indicates the number of frames and L is the length ofeach frame. Note that the framing operation only reshapes S without any feature learning and extracting. The speechencoder directly learns the speech semantic information from m and outputs the learned features b ∈ R B × F × L × D . Thedetails of the speech encoder is detailed in part B of thissection. Afterwards, the channel encoder , denoted as a CNNlayer with 2D CNN modules, converts b into U ∈ R B × F × N .In order to transmit U into a physical channel, it is reshapedinto symbol sequences, X ∈ R B × F N × , via a reshape layer.The channel layer takes the reshaped symbol sequences, X ,as the input and produces Y at the receiver, which is givenby Y = HX + W , (7)where H consists of B number of channel coefficient vectors, h , and W is Gaussian noise, which includes B number ofnoise vectors, w .The received symbol sequences, Y , is reshaped into V ∈ R B × F × N before feeding into the channel decoder , repre-sented by a CNN layer with 2D CNN modules. The outputof the channel decoder is b b ∈ R B × F × L × D . Afterwards, anattention-based decoder, i.e., the speech decoder , converts b b into c m ∈ R B × F × L and c m is recovered into b S via the inverseoperation of framing, named deframing, where the size of b S is same as that of S at the transmitter. The loss is calculated atthe end of the receiver and backpropagated to the transmitter,thus, the trainable parameters in the whole system can beupdated simultaneously. B. Speech Encoder and Decoder
The core of the proposed DeepSC-S is the NN-enabled speech encoder and speech decoder based on an attentionmechanism, named SE-ResNet, which is capable of learning and extracting essential information. In this work, we focus onthe essential speech semantic information, e.g., the signal mag-nitude, which increases sharply when emphasising importantmessage and the signal frequency, which decreases abruptlywhen speaking speed becomes slow to express message moreclearly. Particularly, SE-ResNet is employed to identify theessential information and the weights corresponding to theessential information are assigned to high values when weightupdating and adjusting during the training phase.As shown in Fig. 3, for the SE-ResNet, a
Split layer splitsthe input, m , into multiples blocks, which is achieved bymultiple convolution kernels, and all the blocks are con-catenated. Then a transition layer is utilized to reduce thedimension of the concatenated blocks and the output is denotedas p ∈ R M × N × C , which consists of C features and eachfeature is in size of M × N . For the SE layer, a squeeze operation is employed to aggregate the 2D spatial dimensionof each input feature, then an operation, named excitation ,intents to output the attention factor of each feature bylearning the inter-dependencies of features p . The output ofthe SE layer, z ∈ R × × C , includes C number of scalecoefficients, which is considered as the attention factor to scalethe importance of the extracted features in p by multiplyingthe features corresponding essential information in p with thehigh scale coefficients in z . By doing so, the weights of m arereassigned, i.e., the weights corresponding to essential speechinformation are paid more attention. Note that the SE layeris considered as an independent unit and one or multiple SE-ResNet modules can be sequentially connected. With moreSE-ResNet modules, the performance of feature learning andextracting to essential information will improve, however, italso increases computational complexity. Therefore, a tradeoffbetween the learning performance and complexity should beconsidered during the training. Additionally, residual networkis adopted to alleviate the problem of gradient vanishing dueto the network depth by adding m into the output of the SE-ResNet module, as shown in Fig. 3.Particularly, the speech encoder is comprised by multipleSE-ResNet modules to convert input m into b , correspondingto Fig. 2. For the speech decoder , in addition to several SE-ResNet modules, the last layer , including a 2D CNN modulewith one single kernel, is utilized to reduce the output size ofthe speech decoder , c m , as the sizes of m and c m should beequal. C. Model Training and Testing
Based on the prior knowledge of CSI, the transmitter and re-ceiver parameters, θ T and θ R , can be updated simultaneously.As aforementioned, the objective of the proposed DeepSC-S isto train a model to capture the essential information in speechsignals and make it to work well under various channels anda wide SNR regime.
1) Training Stage:
As in Fig. 2, the training algorithm ofDeepSC-S is described in Algorithm 1. During the trainingstage, in order to facilitate the fast MSE loss convergence, theNN parameters, θ = ( θ T , θ R ) , are initialized by a variancescaling initializer, instead of 0. Besides, for achieving a valid Fig. 2: The proposed system architecture for the speech semantic communication system. (a) Attention-based speech encoder.(b) Attention-based speech decoder.
Fig. 3: The proposed speech encoder and speech decoder based on SE-ResNet.training task, the MSE loss converges until the loss is no longerdecreasing. The number of SE-ResNet modules is an importanthyperparameter, which aims to facilitate the good performanceof the speech encoder/decoder and the reasonable trainingtime. Moreover, the channel encoder/decoder is to mitigatethe distortion and attenuation by the physical channels and inthe channel layer, the noise, W , is generated by a fixed SNRvalue.After passing through the whole transceiver, the samplesequences set, S , is recovered into b S , the size of S and b S should be equal. Furthermore, the loss is computed at theend of the receiver according to (4) and the parameters areupdated by (5). When the training stage is finished, the trainednetworks are obtained for testing.
2) Testing Stage:
Based on the trained networks T S α ( · ) , T C β ( · ) , R C χ ( · ) , and R S δ ( · ) from the outputs of Algorithm 1,the testing algorithm of DeepSC-S is illustrated in Algorithm2. Note that the speech sample sequences used for testing aredifferent from that used for training.As shown in Algorithm 2, the trained model under a fixedchannel condition is employed to test the performance undervarious fading channels directly without model retraining.Note that transfer learning can be adopted as a promisingtechnique to yield efficient retraining task when coping withdynamic environment [24]. However, in this work, we facilitatethe model with strong adaptation. Algorithm 1
Training algorithm of the proposed DeepSC-S
Initialization: initialize parameters θ T (0) and θ R (0) , i = 0 . Input:
Speech sample sequences S from speech dataset S , a fading channel h , noise w generated under a fixedSNR value. Framing S into m with trainable size. while Stop criterion is not meet do T S α ( m ) → b . T C β ( b ) → X . Transmit X over physical channel and receive Y via (2). T C χ ( Y ) → b b . T S δ ( b b ) → c m . Deframing c m into b S . Compute loss L MSE ( θ T , θ R ) via (4). Update trainable parameters simultaneously via SGD: θ T ( i +1) ← θ T ( i ) − η ∇ θ T ( i ) L MSE ( θ T , θ R ) (8) θ R ( i +1) ← θ R ( i ) − η ∇ θ R ( i ) L MSE ( θ T , θ R ) (9) i ← i + 1 . end while Output:
Trained networks T S α ( · ) , T C β ( · ) , R C χ ( · ) , and R S δ ( · ) .V. E XPERIMENT AND N UMERICAL R ESULTS
In this section, we compare to the performance of the pro-posed DeepSC-S, the traditional communication systems andthe system with an extra feature encoder for speech transmis-sion under the AWGN channels, the Rayleigh channels, andthe Rician channels, where the accurate CSI is assumed. Thedetails of the adopted benchmarks will be introduced in partA of this section. Moreover, in order to facilitate DeepSC-Swith good adaptation to practical environment, the DeepSC-Sis tested over telephone systems and multimedia transmissionsystems, respectively. Note that the channel environment ismodeled as a fading channel with a fixed SNR of 8 dB duringthe training stage and the testing channel set, H , includesthe AWGN channels, the Rayleigh channels, and the Ricianchannels.In the whole experiment, we adopt the speech dataset fromEdinburgh DataShare, which comprises more than 10,000 .wav files trainset and 800 .wav files testset with samplingrate 16KHz. In terms of the traditional telephone systemsand multimedia transmission systems, the sampling rates for Algorithm 2
Testing algorithm of the proposed DeepSC-S Input:
Speech sample sequences S from speech dataset S , trained networks T S α ( · ) , T C β ( · ) , R C χ ( · ) , and R S δ ( · ) ,testing channel set H , a wide range of SNR regime. Framing S into m with trainable size. for each channel condition h drawn from H do for each SNR value do generated Gaussian noise w under the SNR value. T S α ( m ) → b . T C β ( b ) → X . Transmit X over physical channel and receive Y via (2). T C χ ( Y ) → b b . T S δ ( b b ) → c m . Deframing c m into b S . end for end for Output:
Recovered speech sample sequences, b S , underdifferent fading channels and various SNR values.speech signals are 8KHz and 44.1KHz, respectively. Thus, forthe experiment regarding telephone systems, the input samplesare down-sampled to 8KHz and regarding multimedia com-munications, the input samples are up-sampled to 44.1KHz.Note that the number of speech samples in different .wav isinconsistent. In the simulation, we fix W = 16 , , and eachsample sequence in m consists of frames F = 128 with theframe length L = 128 . A. Benchmark Model
We use the following three different beachmarks. For thetraditional communications, speech transmission over tele-phone systems has lower accuracy requirements compare tomultimedia transmission systems. For instance, audio signalsin video required to be extremely clear, but the backgroundnoise and echo occur when speaking over the phone.
1) Benchmark 1:
According to ITU-T G.711 standard,64 Kbps pulse code modulation (PCM) is recommended forspeech source coding in telephone systems with = 256 quantization levels [35]. Moreover, 16-bits PCM is adopted inour work for speech transmission in multimedia transmissionsystems with = 65 , quantization levels. Note that A-law PCM and uniform PCM are adopted in telephone systemsand multimedia transmission systems, respectively. For thechannel coding, turbo codes with soft output Viterbi algorithm(SOVA) is considered to improve the performance of errordetection and correction at the receiver [36], in which thecoding rate is 1/3, the block length is 512, and the numberof decoding iterations is 5. In addition, to make the numberof transmitted symbols in the traditional systems is same asthat in DeepSC-S, 64-QAM is adopted in the benchmark forthe modulation. The details for the the typical communicationsystems for the two different transmission applications aresummarized in Table I.
2) Benchmark 2:
The second benchmark combines a fea-ture encoder with the traditional model, named a semi- TABLE I: Parameters settings in the traditional communicationsystems.
Telephone Systems Multimedia SystemsSample rate
Signal Length
Number of frames
128 128
Frame length
128 128
Source coding
Channel coding
Turbo codes Turbo codes
Modulation
Fig. 4: The benchmark model by combing a feature encoderwith the transmission systems.traditional communication system, as shown in Fig. 4. Fromthe figure, at the training stage, the feature encoder takes thespeech samples, s , as the input and the output is fed into thefeature decoder directly. The received signal is converted intothe speech information, b s , by the feature decoder. Based onsignals s and b s , the MSE loss is computed at the end of thereceiver, thus, the trainable parameters of the feature encoderand the feature decoder are updated via SGD at the same time.For the end-to-end testing, the pre-trained feature leaningsystem is split into the feature encoder and the feature decoder,which are placed before the traditional transmitter and afterthe traditional receiver, respectively. Note that the signalprocessing blocks of the traditional communication systemare same as the settings as shown in Table I. During thetraining stage, the feature encoder and the feature decoderare treated as the extraction and recovery operations withoutconsidering communication problems. During the end-to-endtesting stage, the semi-traditional system is aimed to yieldefficient transmission as well as to mitigate the channel effects.The network settings of the semi-traditional system are shownas Table II.TABLE II: Parameters settings of the semi-traditional system. Layer Name Kernels ActivationFeature Encoder × CNN layer 6 ×
32 ReluCNN layer 1 None
Feature Decoder × CNN layer 6 ×
32 ReluCNN layer 1 None
Learning Rate η
3) Benchmark 3:
In order to emphasize the improvementof the attention mechanism, we combine a semantic commu-nication system without attention into simulation based on2D CNN module, named a CNN-based system. The networksettings of the CNN-based system are shown as Table III.
TABLE III: Parameters settings of the CNN-based system fortelephone systems.
Layer Name Kernels ActivationTransmitter × CNN modules 6 ×
32 ReluCNN layer 8 Relu
Receiver
CNN layer 8 Relu6 × CNN modules 6 ×
32 ReluLast layer (CNN) 1 None
Learning Rate η B. Experiments over Telephone Systems
In this experiment, we first investigate a robust system towork on various channel conditions while training DeepSC-Sunder the fixed channel condition, and then testing the MSEloss via the trained model under all adopted fading channels.Besides, we test the SDR and PESQ under DeepSC-S, thetraditional system, the semi-traditional systems, and the CNN-based system for speech transmission over telephones systems.Particularly, the number of the SE-ResNet modules in the speech encoder/decoder is 6 and the number of the 2D CNNmodules in the channel encoder/decoder is 1, which includes 8kernels. For each SE-ResNet module, the number of kernels inthe split layer and the transition layer is 32. The learning rateis set as 0.001. The network setting of the proposed DeepSC-Sare shown as Table IV.TABLE IV: Parameters settings of the proposed DeepSC-S fortelephone systems.
Layer Name Kernels ActivationTransmitter × SE-ResNet 6 ×
32 ReluCNN layer 8 Relu
Receiver
CNN layer 8 Relu6 × SE-ResNet 6 ×
32 ReluLast layer (CNN) 1 None
Learning Rate η As shown in Fig. 5a, in terms of the MSE loss tested underthe AWGN channels, DeepSC-S trained under the AWGNchannels outperforms the model trained under the Rayleighchannels and the Rician channels when SNRs are higher thanaround 6 dB. However, it has higher MSE loss values in thelow SNR regime. Besides, according to Fig. 5b, DeepSC-S trained under the AWGN channels performs quite poorin terms of MSE loss when testing for Rayleigh channels.Furthermore, Fig. 5c shows the model trained under thethree adopted channels can achieve MSE loss values under × − when testing under the Rician channels. Therefore,DeepSC-S trained under the Rician channels is considered asa robust model that is capable of coping with various channelenvironments.Fig. 6 tests the SDR performance between the tradi-tional communication systems, the semi-traditional system, theCNN-based system, and the proposed DeepSC-S under theAWGN channels, the Rayleigh channels, and the Rician chan-nels. From the figure, the semi-traditional system yields higherSDR score than the traditional one under all tested channelenvironments while it performs unreliable when SNRs are low.Besides, the CNN-based system and DeepSC-S achieve better SDR than the the semi-traditional system and the traditionalsystem under the Rayleigh channels and the Rician channels,as well as the AWGN channels over the most ranges of testedSNRs. In addition, DeepSC-S performs steadily when copingwith different channels and SNRs, however, for the semi-traditional system and the traditional system, the performancesare quite poor under dynamic channel conditions, especiallyin the low SNR regime. Moreover, due to the attention mech-anism, SE-ResNet, the proposed DeepSC-S achieves higherSDR score than the CNN-based system under all adoptedSNRs and fading channels, which proves the effectiveness ofthe DeepSC-S.The PESQ score comparison is shown in Fig. 7. From thefigure, the CNN-based system and DeepSC-S provide highquality speech recovery and outperform the semi-traditionalsystem and the traditional system under various fading chan-nels and SNRs. Moreover, similar to the results of SDR,DeepSC-S obtains good PESQ when coping with channelvariation while the traditional one provides poor scores in thelow SNR regime. DeepSC-S also achieves higher score thanthe CNN-based system under all adopted channel conditions.Based on the simulated results, the proposed DeepSC-S isable to yield better speech transmission for the telephonesystems under complicated communication scenarios than thetraditional systems, especially in the low SNR regime. C. Experiments over Multimedia Transmission Systems
In this part, we present the SDR and PESQ performancecomparison between DeepSC-S and the traditional systemsfor speech signals transmission for multimedia applications,as well as the CNN-based system similar to the telephonecommunications experiment. The NN network settings of theCNN-based system and the proposed DeepSC-S are similar toTable III and Table IV, respectively, but the kernels of CNNlayer at the transmitter and the receiver in both CNN-basedsystem and DeepSC-S are 16.Fig. 8 depicts the SDR performance comparison for multi-media communications among the traditional system, the semi-traditional system, the CNN-based system, and the proposedDeepSC-S under the AWGN channels and the Rician channels.For the traditional system under the AWGN channels, theSDR score shows sharp increasing when the SNR is over 8dB, and it achieves SDR scores over 80 in high SNRs dueto the high PCM quantization accuracy. However, DeepSC-Scan reach universal strong SDR values for all tested SNRsand fading channels. Moreover, DeepSC-S outperforms thesemi-traditional system and the traditional system under theRician channels, as well as the AWGN channels in the lowSNR regime. Furthermore, DeepSC-S has higher SDR scorethan the CNN-based system because the SE-ResNet moduleis utilized to learn and extract the essential information.The simulation result of PESQ for multimedia communi-cations is illustrated in Fig. 9. From the figure, the proposedDeepSC-S outperforms the semi-traditional system and the thetraditional system under the Rician channels with any testedSNRs as well as the AWGN channels with low SNR values.Moreover, similar to the results of SDR, the proposed DeepSC-S achieves higher PESQ score than the CNN-based system
0 2 4 6 8 10 12 14 16 18 20
SNR (dB) -8 -6 -4 -2 M SE Trained under AWGNTrained under RayleighTrained under Rician (a) AWGN channels
0 2 4 6 8 10 12 14 16 18 20
SNR (dB) -8 -6 -4 -2 M SE Trained under AWGNTrained under RayleighTrained under Rician (b) Rayleigh channels
0 2 4 6 8 10 12 14 16 18 20
SNR (dB) -8 -6 -4 -2 M SE Trained under AWGNTrained under RayleighTrained under Rician (c) Rician channels
Fig. 5: MSE loss tested for (a) AWGN, (b) Rayleigh, and (c) Rician channels with the models trained under various channels.
0 2 4 6 8 10 12 14 16 18 20
SNR (dB) -20-1001020304050 S DR Traditional+AWGNSemi-traditional+AWGNCNN+AWGNDeepSC-S+AWGN (a) AWGN channels
0 2 4 6 8 10 12 14 16 18 20
SNR (dB) -20-1001020304050 S DR Traditional+RayleighSemi-traditional+RayleighCNN+RayleighDeepSC-S+Rayleigh (b) Rayleigh channels
0 2 4 6 8 10 12 14 16 18 20
SNR (dB) -20-1001020304050 S DR Traditional+RicianSemi-traditional+RicianCNN+RicianDeepSC-S+Rician (c) Rician channels
Fig. 6: SDR score versus SNR for speech based telephone communications with the traditional system, the semi-traditionalsystem, the CNN system, and the proposed DeepSC-S for (a) AWGN channels, (b) Rayleigh channels, (c) Rician channels.
0 2 4 6 8 10 12 14 16 18 20
SNR (dB)
PES Q Traditional+AWGNSemi-traditional+AWGNCNN+AWGNDeepSC-S+AWGN (a) AWGN channels
0 2 4 6 8 10 12 14 16 18 20
SNR (dB)
PES Q Traditional+RayleighSemi-traditional+RayleighCNN+RayleighDeepSC-S+Rayleigh (b) Rayleigh channels
0 2 4 6 8 10 12 14 16 18 20
SNR (dB)
PES Q Traditional+RicianSemi-traditional+RicianCNN+RicianDeepSC-S+Rician (c) Rician channels
Fig. 7: PESQ score versus SNR for speech based telephone communications with the traditional system, the semi-traditionalsystem, the CNN system, and the proposed DeepSC-S for (a) AWGN channels, (b) Rayleigh channels, (c) Rician channels.under all adopted channel conditions. Thus, it is believed thatthe investigated DeepSC-S is with greater adaptability thanthe traditional system for speech based multimedia communi-cations when coping with channel variation.VI. C
ONCLUSION
In this article, we investigate a DL-enabled semantic com-munication system for speech transmission, named DeepSC-S, which achieves more efficient transmission than the tra-ditional approaches by utilizing the semantic information of speech signals. Particularly, we jointly design the speechencoder/decoder and the channel encoder/decoder to learnand extract the speech features, as well as to mitigate thechannel distortion and attenuation for practical communicationscenarios. Additionally, an attention mechanism based onsqueeze-and-excitation (SE) networks is utilized to improvethe recovery accuracy by minimizing the mean-square errorof the speech signals. Moreover, in order to enable DeepSC-S working well over various physical channels, a DeepSC-S model with strong robustness to channel variations is in-
0 2 4 6 8 10 12 14 16 18 20
SNR (dB) -20-100 10 20 30 40 50 60 70 80 90 100 S DR AWGN+TraditionalAWGN+Semi-traditionalAWGN+CNNAWGN+DeepSC-SRician+TraditionalRician+Semi-traditionalRician+CNNRician+DeepSC-S
Fig. 8: SDR score versus SNR for speech based mul-timedia communications with the traditional system, thesemi-traditional system, the CNN system, and the proposedDeepSC-S for AWGN channels and Rician channels.
0 2 4 6 8 10 12 14 16 18 20
SNR (dB)
PES Q AWGN+TraditionalAWGN+Semi-traditionalAWGN+CNNAWGN+DeepSC-SRician+TraditionalRician+Semi-traditionalRician+CNNRician+DeepSC-S
Fig. 9: PESQ score versus SNR for speech based mul-timedia communications with the traditional system, thesemi-traditional system, the CNN system, and the proposedDeepSC-S for (a) AWGN channels, (b) Rayleigh channels,(c) Rician channels.vestigated. The proposed DeepSC-S is investigated under thetelephone systems and the multimedia transmission systemsfor verifying the system adaptation. Simulation results demon-strated that DeepSC-S outperforms the traditional communi-cation systems as well as the semi-traditional system with anextra feature encoder, especially when the SNR is low. Hence,our proposed DeepSC-S is a promising candidate for speechsemantic communication systems.R
EFERENCES[1] Z. Qin, H. Ye, G. Y. Li, and B.-H. F. Juang, “Deep learning in physicallayer communications,”
IEEE Wireless Commun., vol. 26, no. 2, pp.93–99, Apr. 2019.[2] Z. Qin, G. Y. Li, and H. Ye, “Federated Learning and Wireless Commu-nications,” https://arxiv.org/abs/2005.05265, May. 2020. [3] T. Gruber, S. Cammerer, J. Hoydis, and S. T. Brink, “On deep learning-based channel decoding,” in
Proc. IEEE 51st Annu. Conf. Inf. Sci. Syst.(CISS),
Baltimore, MD, USA, Mar. 2017, pp. 1–6.[4] H. Ye, G. Y. Li, and B.-H. F. Juang, “Power of deep learning for channelestimation and signal detection in OFDM systems,”
IEEE WirelessCommun. Lett., vol. 7, no. 1, pp. 114-117, Feb. 2018.[5] N. Samuel, T. Diskin, and A. Wiesel, “Deep MIMO detection,” in
Proc.IEEE Int. Workshop Signal Process. Adv. Wireless Commun.,
Sapporo,Japan, Dec. 2017, pp. 690–694.[6] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos,“Learning to optimize: Training deep neural networks for interferencemanagement,”
IEEE Trans. Signal Process., vol. 66, no. 20, pp. 5438-5453, Oct. 2018.[7] L. Liang, H. Ye, G. Yu, and G. Y. Li, “Deep-learning-based wirelessresource allocation with application to vehicular networks,”
Proc. IEEE, vol. 108, no. 2, pp. 341–356, Feb. 2020.[8] C. E. Shannon and W. Weaver,
The Mathematical Theory of Communi-cations.
The University of Illinois Press, 1949.[9] R. Carnap and Y. Bar-Hillel,
An Outline of a Theory of Semantic Infor-mation.
RLE Technical Reports 247, Research Laboratory of Electronics,Massachusetts Institute of Technology., Cambridge MA, Oct. 1952.[10] J. Bao, P. Basu, M. Dean, C. Partridge, A. Swami, W. Leland, and J. A.Hendler, “Towards a theory of semantic communication,” in
Proc. IEEENetw. Sci. Workshop,
West Point, NY, USA, Jun. 2011, pp. 110–117.[11] P. Basu, J. Bao, M. Dean, and J. Hendler, “Preserving quality ofinformation by using semantic relationships,”
Pervasive Mob. Comput., vol. 11, pp. 188–202, Apr. 2014.[12] T. O’Shea and J. Hoydis, “An introduction to deep learning for thephysical layer,”
IEEE Trans. Cogn. Commun. Netw., vol. 3, no. 4, pp.563–575, Dec. 2017.[13] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large vocabulary speech recognition,”
IEEE Trans. Audio, Speech, Language Process., vol. 20, no. 1, pp. 33–42,Jan. 2012.[14] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
Proc.Conf. Comput. Vis. Pattern Recognit. (CVPR),
Salt Lake City, UT, USA,Jun. 2018, pp. 7132–7141.[15] M. Kim, W. Lee, and D-H. Cho, “A novel PAPR reduction scheme forOFDM system based on deep learning,”
IEEE Commun. Lett., vol. 22,no. 3, pp. 510–513, Mar. 2018.[16] A. Felix, S. Cammerer, S. D¨orner, J. Hoydis, and S. T. Brink, “OFDMautoencoder for end-to-end learning of communications systems,” in
Proc.IEEE Int. Workshop Signal Process. Adv. Wireless Commun.,
Kalamata,Greece, Jun. 2018, pp. 1-5.[17] T. J. O’Shea, T. Erpek, and T. C. Clancy, “Physical layer deep learningof encodings for the MIMO fading channel,” in
Proc. Annu. AllertonConf. Commun., Control, Comput. (Allerton),
Monticello, IL, USA, Oct.2017, pp. 76–80.[18] H. He, C. Wen, S. Jin, and G. Y. Li, “Model-driven deep learning forMIMO detection,”
IEEE Trans. Signal Process., vol. 68, pp. 1702–1715,Feb. 2020.[19] F. A. Aoudia, and J. Hoydis, “Model-free training of end-to-end com-munication systems,”
IEEE J. Sel. Areas Commun., vol. 37, no. 11, pp.2503-2516, Nov. 2019.[20] H. Ye, L. Liang, G. Y. Li, and B.-H. Juang, “Deep learning based end-to-end wireless communication systems with conditional GAN as unknownchannel,”
IEEE Trans. Wireless Commun., vol. 19, no. 5, pp. 3133-3143,May. 2020.[21] S. J. Pan, and Q. Yang, “A survey on transfer learning,”
IEEE Trans.Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010.[22] S. Park, O. Simeone, and J. Kang, “Meta-learning to communicate:Fast end-to-end training for fading channels,” in
Proc. IEEE Int. Conf.Acoustics, Speech, Signal Process. (ICASSP),
Barcelona, Spain, May.2020, pp. 5075-5079.[23] B. Guler, A. Yener, and A. Swami, “The semantic communicationgame,”
IEEE Trans. Cogn. Commun. Netw., vol. 4, no. 4, pp. 787–802,Sep. 2018.[24] H. Xie, Z. Qin, G. Y. Li, and B.-H. Juang, “Deep learning enabledsemantic communication systems,” https://arxiv.org/abs/2006.10685, May.2020.[25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in
AdvancesNeural Info. Process. Syst. (NIPS’17),
Long Beach, CA, USA. Dec. 2017,pp. 5998–6008.[26] H. Xie, and Z. Qin, “A lite distributed semantic communication systemfor Internet of Things,”
IEEE J. Sel. Areas Commun. vol. 39, no. 1, pp.142–153, Jan. 2021. [27] E. Bourtsoulatze, D. Burth Kurka, and D. G¨und¨uz, “Deep joint source-channel coding for wireless image transmission,” IEEE Trans. Cogn.Commun. Netw., vol. 5, no. 3, pp. 567–579, Sept. 2019.[28] D. B. Kurka and D. G¨und¨uz, “Deepjscc-f: Deep joint source-channelcoding of images with feedback,”
IEEE J. Sel. Areas Inf. Theory, vol. 1,no. 1, pp. 178–193, Apr. 2020.[29] C. Lee, J. Lin, P. Chen, and Y. Chang, “Deep learning-constructed jointtransmission-recognition for Internet of Things,”
IEEE Access, vol. 7, pp.76547–76561, Jun. 2019.[30] M. Jankowski, D. G¨und¨uz, and K. Mikolajczyk, “Joint device-edgeinference over wireless links with pruning,” in
Proc. IEEE Int. WorkshopSignal Process. Adv. Wireless Commun.,
Atlanta, GA, USA, Aug. 2020,pp. 1–5.[31] M. Jankowski, D. G¨und¨uz, and K. Mikolajczyk, “Wireless image re-trieval at the edge,”
IEEE J. Sel. Areas Commun. vol. 39, no. 1, pp.89–100, Jan. 2021.[32] E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurementin blind audio source separation,”
IEEE Trans. Audio, Speech, LanguageProcess., vol. 14, no. 4, pp. 1462–1469, Jun. 2006.[33] A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluationof speech quality (PESQ)-a new method for speech quality assessmentof telephone networks and codecs,” in
Proc. IEEE Int. Conf. Acoust.,Speech, Signal Process.. Proc. (ICASSP),
Salt Lake City, UT, USA, May.2001, pp. 749–752.[34]
Perceptual evaluation of speech quality (PESQ): An objective method forend-to-end speech quality assessment of narrow-band telephone networksand speech codecs,
ITU-T recommendation P.862, Mar. 2018.[35] R. Cox, “Three new speech coders from the ITU cover a range ofapplications,”
IEEE Commun. Mag., vol. 35, no. 9, pp. 40–47, Sept. 1997.[36] Y. Wu and B. Woerner, “The influence of quantization and fixed pointarithmetic upon the BER performance of turbo codes,” in