[PDF] Introducing a Novel Data over Voice Technique for Secure Voice Communication

Abstract

The current increasing need for privacy-preserving voice communications is leading to new ideas for securing voice transmission. This paper refers to a relatively new concept of sending encrypted data or speech as pseudo-speech in the audio domain over existing voice communication infrastructures, like 3G cellular network and Voice over IP (VoIP). The distinctive characteristic of such a communication system is that it relies on the robust transmission of binary information in the form of audio signal. This work presents a novel Data over Voice (DoV) technique based on codebooks of short harmonic waveforms. The technique provides a sufficiently fast and reliable data rate over cellular networks and many VoIP applications. The new method relies on general principles of Linear Predictive Coding for voice compression (LPC voice coding) and is more versatile compared to solutions trained on exact channel models. The technique gives by design a high control over the desired rate of transmission and provides robustness to channel distortion. In addition, an efficient codebook design approach inspired by quaternary error correcting codes is proposed. The usability of the proposed DoV technique for secure voice communication over cellular networks and VoIP has been successfully validated by empirical experiments. The paper details the system parameters, putting a special emphasis on system's security and technical challenges.

Full PDF

II NTRODUCING A N OVEL D ATA OVER V OICE T ECHNIQUEFOR S ECURE V OICE C OMMUNICATION

A P

REPRINT

Piotr Krasnowski, Jerome Lebrun, Bruno Martin

Univ. Côte d’Azur, I3S-CNRS, 2000, Route des Lucioles, BP 121,06903 Sophia Antipolis-Cedex, France {krasnowski,lebrun,bruno.martin}@i3s.unice.fr A BSTRACT

The current increasing need for privacy-preserving voice communications is leading to new ideas forsecuring voice transmission. This paper refers to a relatively new concept of sending encrypted data orspeech as pseudo-speech in the audio domain over existing voice communication infrastructures, like3G cellular network and Voice over IP (VoIP). The distinctive characteristic of such a communicationsystem is that it relies on the robust transmission of binary information in the form of audio signal.This work presents a novel Data over Voice (DoV) technique based on codebooks of short harmonicwaveforms. The technique provides a sufﬁciently fast and reliable data rate over cellular networks andmany VoIP applications. The new method relies on general principles of Linear Predictive Coding forvoice compression (LPC voice coding) and is more versatile compared to solutions trained on exactchannel models. The technique gives by design a high control over the desired rate of transmissionand provides robustness to channel distortion. In addition, an efﬁcient codebook design approachinspired by quaternary error correcting codes is proposed.The usability of the proposed DoV technique for secure voice communication over cellular networksand VoIP has been successfully validated by empirical experiments. The paper details the systemparameters, putting a special emphasis on system’s security and technical challenges.

Keywords

Secure Voice Communications · Data over Voice · Digital Voice Channels · VoIP · Quaternary ECC

Secure data transmission over voice channels (telephone lines) dates back to the early 40s, and the development ofthe probably ﬁrst secret telephony system for use during World War II, nicknamed SIGSALY or ‘Green Hornet’ [1].However, the term modem (‘ mod ulator- dem odulator’) in context of a device converting data into the format suitable fortransmission originated in the late 50s when referring to Bell 101 modem developed by Bell Labs. The modems werepart of SAGE (Semi-Automatic Ground Environment), the largest computer-aided information system of that time [2].Their role was to transmit digital radar pictures over telephone wires using frequency-shift keying (FSK) modulation,initially at 750 bps and later at 1300 bps and 2100 bps. In the following years, the rapid progress in modem technologycontributed to creating the digital subscriber line (DSL) with data rates exceeding 10 kbps [3].A bit surprisingly, the possibility of exploiting cellular vocal networks for data transmission went unnoticed until theseminal work of Katugampala, Vilette, and Kondoz in 2003 [4], in which the authors suggested sending encrypted bitsof compressed voice between two cellular phones. Since then, cellular vocal networks have attracted attention as apotential high-priority, low-bandwidth data communication channel with errors. The work on Data over Voice (DoV)technology in cellular networks enabled new applications, such as emergency call system eCall [5], messaging overvoice [6], point of sell (POS) ﬁnancial transactions [7], automatic network address translator (NAT) traversal [8], andsecure data and voice communications [9, 10].With the quickly expanding data-driven 4G networks and the deployment of 5G networks, the use of voice channels forsending data diminishes. Nevertheless, DoV techniques are still crucial in secure voice communications, for example, a r X i v : . [ c s . CR ] F e b ntroducing a Novel Data over Voice Technique for Secure Voice Communication A P

REPRINT provided by Crypto Phones or other specialized devices [11]. On the other hand, voice channels can be maliciouslyused for extruding private data or in Advanced Persistent Threat (APT) attacks [12].The crucial challenges related to DoV are a consequence of principles underlying digital voice channels. Namely, voicechannels aim at preserving speech intelligibility and quality while reducing the perceptually redundant information. Incontrast to classical data channels, voice channels signiﬁcantly distort the sent signal due to transcodings and audioprocessing. Moreover, modern digital voice channels are selective to signal parameters conforming to the speech modeladopted in a particular system. To mitigate signal degradation caused by voice channels, several authors proposed DoVtechniques based on encoding the data signal into speech-like parameters, codebook training, or modulation techniques.Katugampala et al. [4] proposed a system that uses predeﬁned codebooks to map bits into vocal parameters: energy,pitch, and spectral envelope (encoded as line spectral pairs, LSP [13]). The encoded parameters are transformedinto a pseudo-speech signal adapted to transmission over a cellular network. Data extraction is done by a pairedspeech analyzer, which restores vocal parameters from the signal and decodes codebook indices. The system enabledtransmission over a real GSM voice channel at the rate of 3000 bps with 2.9% BER [9]. Similar techniques werepresented by Ozkan et al. [14], and Rashidi et al. [15], who achieved respectively transmission rates of 1600 bps and2000 bps by simulations.LaDue et al. [16], and Sapozhnykov and Fienberg [17] investigated genetic and pattern matching algorithms to constructcodebooks of short speech-like waveforms. Instead of synthesizing pseudo-speech, the authors proposed encodingbitstream directly into a sequence of symbols selected from a trained wavetable. Upon reception, received symbolswere decoded with a bank of matched ﬁlters. The technique achieved the remarkable 4000 bps with 2.3 % BERover enhanced full rate (EFR) voice channel. Unfortunately, the training process was time-consuming and requiredconsiderable computational resources. Moreover, the obtained wavetable was compatible with a unique channel modeland hence impractical in real communication.The problem of long and heavy computations has been tackled by Shahbazi et al. [18], and Boloursaz et al. [19], whosimpliﬁed the codebook construction by limiting the search to signals from the TIMIT speech database [20]. Parallelly,Kazemi et al. [21] proposed a new idea to exploit sphere packing techniques to construct waveforms with a largeminimum distance and an improved detection rate.Finally, there exists a range of DoV techniques based on well-established, classical signal modulation. Zhan Xu [22],Chmayssani and Baudoin [23] tested by simulations phase shift keying modulation (PSK) and quadrature amplitudemodulation (QAM), and achieved bitrates within the range 1 - 3 kbps. Ali et al. [24] exploited M-ary frequencyshift keying (M-FSK), whereas Dhananjay et al. [6] introduced a modiﬁed binary FSK (BFSK) tolerant to a smallfrequency deviation. Chen and Guo [10] reported a solution using orthogonal frequency division multiplexing (OFDM)modulation combined with PSK.An inspiring technique based on Amplitude Shift keying (ASK), named PCCD-OFDM-ASK, has been presented byMezgec et al. [7]. Phase-Continuity and Context Dependency (PCCD) refers to techniques providing phase continuityof the modulated signal. In PCCD-OFDM-ASK, blocks of 8-bit sequences are encoded onto eight orthogonal harmonics,numbered from 1 to 8. In contrast to classical OFDM, each bit in the 8-bit block is represented by the presence orabsence of an orthogonal carrier. For instance, the binary 8-bit sequence ‘10001010’ is mapped to a symbol withharmonics present only at positions 1, 5, and 7. The scheme offers robust transmission up to 500 bps over real cellularvoice channels.This article introduces a new DoV codebook-based modulation over cellular networks and VoIP for the needs of securevoice communication. The novelty comes from our simpliﬁed and universal codebook design process compared withthe usual extensive codebook training on a selected voice model. Nevertheless, the method can be adapted to a particularchannel, avoiding codebook over-tuning in the presence of ﬂuctuating channel characteristics. Modulation parametersare easily adjustable in order to balance the transmission bitrate and the robustness to errors.The proposed technique was thoroughly tested with real voice calls. The scheme achieves up to 6.4 kbps over VoIPvoice channels using 4G wireless network and 2.4 kbps over 3G cellular calls (see Section 5.3). It also enables safevoice transmission with an effective binary error rate signiﬁcantly below 1%.This paper is organized as follows. Section 2 outlines challenges related to sending data over voice channels withLPC-based speech compression. Section 3 investigates signal distortion introduced by three selected LPC coders: AMR,Speex, and Opus-Silk. Section 4 describes the novel DoV technique, including codebook construction, signal generation,and demodulation. Section 5 presents performance results obtained by simulations and real-world experiments, andSection 6 proposes a secure voice communication scheme using DoV. Finally, Section 7 concludes the article.2ntroducing a Novel Data over Voice Technique for Secure Voice Communication

A P

REPRINT

This section introduces critical challenges related to data transmission over voice channels. It outlines the speciﬁcbehavior of voice channels, very different compared to classical communication channels, and highlights the desiredproperties of DoV signals.

In real-world implementations, a complete voice channel is typically the concatenation of algorithms that transform aspeech signal into binary data suitable for transmission over the network. Despite the lossy nature of speech processing,the received binary information is sufﬁcient to re-synthesize a speech perceptually similar to the initial. However,from a DoV perspective, it is more convenient to consider voice channels as communication channels with particularconstraints and signal distortion characteristics.The core elements of any digital voice channel are voice codecs , which compress and encode sampled speech waveformexploiting principles of speech production and perception [25]. Real-time voice coders usually process speech on aframe basis by mapping portions of a speech waveform into sets of vocal parameters. These algorithms may performhigh-pass ﬁltering, differential encoding, and adaptive quantization to improve the compression ratio depending on theavailable network throughput. Unfortunately, such operations add memory and latency to a voice channel, and make itnon-linear and non-stationary.In addition to voice compression, modern voice communication systems apply techniques such as

Voice ActivityDetection (VAD) [26],

Adaptive Gain Control (AGC) [27] or

Noise Suppression (NS) [28]. In opposition to voicecoders, the implementation of these algorithms is rarely public and their impact on the DoV cannot be fully predicted.Combining all the mentioned elements of real voice channels, achieving an analytic model of signal distortion is usuallyintractable. Nevertheless, it is still worthwhile to consider the most fundamental properties of voice channels andconstruct the DoV scheme agnostic to small variations of the voice channel characteristics.

Most of the voice coders operating in the upper-middle bitrate range (10 kbps –16 kbps) listed in ITU, IETF and 3GPPstandards, and which are widely adopted in cellular and VoIP systems, rely on

Linear Predictive Coding (LPC). LPCcoders take their inspiration from the simpliﬁed speech production model, often referred to as a source-ﬁlter model[29, 30]. According to the model, voice sound originates from a single source e ( t ) and is ﬁltered by a vocal tract with animpulse response v ( t ) . Such a simpliﬁcation is justiﬁed for voiced and stationary sounds, which can be approximatelyrepresented by the buzzing excitation produced in the glottis and shaped when passing through the pharynx and betweentongue, teeth, and lips. The resulting signal has the form s ( t ) = e ( t ) ∗ v ( t ) , where ∗ denotes the convolution product.However, considering voice as the convolution of excitation and vocal tract shaping would be of little practical valuewithout effective methods for separating these components. The excitation and vocal tract characteristics can be wellapproximated during LPC analysis (hence LPC coders). The outputs of LPC analysis consist of a linear prediction ﬁlterdescribing the vocal tract’s ﬁltering effect and a residual that can be viewed as an excitation signal.As an example, Fig. 1a and Fig. 1b present 100 ms of a real recording of vowel /a/ in the time and the frequencydomain. It can be noticed that this spectrum has an harmonic structure and could be accurately parameterized by itsenergy, spectral envelope, and fundamental frequency. The dashed line in Fig. 1b, which coincides with the spectralenvelope of a vowel, represents the frequency response of the estimated LPC ﬁlter. On the other hand, the peaks ofthe residual signal in Fig. 1c correspond to a buzzing excitation from the glottis. Finally, the frequency spectrum of aresidual in Fig. 1d is relatively ﬂat and has less different formants (acoustic harmonic resonances), compared to theinitial spectrum in Fig. 1a. Thus, we can reach the intuitive conclusion that LPC analysis separates the spectral envelopefrom the harmonic content of the signal.Source-ﬁlter separation emphasizes the relevant vocal information, which is advantageous in signal compression.Figure 2 depicts a simpliﬁed diagram of speech analysis and synthesis by a generic LPC coder. The encoder estimatesLPC coefﬁcients and calculates the excitation of a small portion of speech (typically 5ms –20ms). Lossy excitationencoding puts stress on preserving the harmonic content of the speech, whereas LPC ﬁlters are often weighted to boostformants, taking advantage of the human auditory system’s speciﬁcities and information redundancy. From this point, itis understandable that vocal parameters in a waveform are usually well preserved during compression, while the lessspeech-like are removed. The output waveform is also smoothed in the time and spectral domains to remove ringingeffects caused by frame-based processing. 3ntroducing a Novel Data over Voice Technique for Secure Voice Communication A P

REPRINT

Linear Predictive Coding achieves remarkable results in representing and compressing smoothly varying voiced soundsbut often struggles with encoding short and noisy plosives (like /p/ or /t/ ), which do not ﬁt into the source-ﬁlterspeech model. To improve the robustness for noisy sounds, LPC coders incorporate more ﬂexibility into the excitationencoder. This observation suggests that the potential performance of the DoV technique would mostly depend on theaccuracy and reaction time of excitation encoding.Despite preserving core speech intelligibility, time-domain LPC coding destroys the ﬁne time-structure of compressedsignals. Thus, it is not obvious how voice channels equipped with LPC coders modify the sent signal. In Section 3, wedescribe a simpliﬁed framework that will allow us to evaluate the typical distortion introduced by LPC voice channels. time (ms) (a) frequency (kHz) (b) time (ms) (c) frequency (kHz) (d) Figure 1: LPC analysis of vowel /a/ : (a) time domain waveform, (b) spectrum of the waveform (blue solid line) andfrequency response of the 12th order LPC ﬁlter (red dashed line), (c) residual of LPC analysis (solid blue line) andexcitation peaks (red dashed line), (d) frequency spectrum of a residual.Figure 2: Simpliﬁed diagram of LPC encoder and decoder.4ntroducing a Novel Data over Voice Technique for Secure Voice Communication

A P

REPRINT

This section presents a novel DoV technique based on codebooks of phase-modulated harmonic waveforms. Theproposed solution is the result of extensive simulation experiments with three representative LPC narrow-band coders:AMR 12.1 kbps [31], Speex v1.2 11 kbps [32] and Opus-Silk v1.3.1 12 kbps [33].The section begins with a thorough analysis of signal distortion characteristics caused by selected voice compressionalgorithms. The investigation leads to a signiﬁcant improvement in harmonic signal demodulation. Finally, the sectionproposes a simpliﬁed codebook design approach.

By their construction optimized to vowel sounds, LPC coders are suitable for synthesizing multi-harmonic signals. Onthe other hand, the versatility of excitation encoding allows easy manipulation of phase information, which above 2 kHztypically plays a lesser role in speech intelligibility [25, 34]. Combining phase modulation with multiple subcarriersis particularly interesting, as it opens the possibility of applying spectrally-efﬁcient orthogonal frequency-divisionmultiplexing (OFDM) modulation [35]. The OFDM approach has been already analyzed in the context of DoV in [10].Their solution is based on 27 independently modulated carriers and achieved a high bitrate of 2.4 kbps over the (nowobsolete) RPE-LTP GSM voice coder at an acceptably low error rate. . . . AMRSP EEXSILK (a) Radial variance of distortion.

AMRSP EEXSILK (b) Error rate of phase detection.

Figure 3: Energy-normalized variance of spectral distortion and related error rates of phase detection in multi-tonesignals compressed by a selection of LPC coders. The initial multi-tone signal consisted of four independentlyphase-modulated harmonics at frequencies 400 Hz, 800 Hz, 1200 Hz, and 1600 Hz, with a modulation order 4 and amodulation rate of 200 baud. Then, the set of carriers was expanded by adding harmonics at 2000 Hz, 2400 Hz, ...,3200 Hz with a 400 Hz step. Colored bars denote the lowest and highest values among harmonics, and dots indicate theaverage.Since LPC coders process the signal jointly, it is not clear how the presence of other harmonics affects the distortion ofeach component. Figure 3 presents the energy-normalized variance of spectral distortion and related error rates of phasedetection in multi-tone signals compressed by a selection of LPC coders. It can be noticed, that there is a direct relationbetween the variance of distortion and the error rate. In addition, as the cardinality of harmonics in the multi-tone signalgoes up, the variability of error rates rises. Nevertheless, harmonics are not distorted uniformly, which is especiallynoticeable for Silk. It is because the codec puts a more signiﬁcant emphasis on preserving lower frequencies [33],especially important for the auditory perception of voice [36].5ntroducing a Novel Data over Voice Technique for Secure Voice Communication

A P

REPRINT

The distortion introduced by each studied coder has a similar nature, as presented in Fig. 4. Apart from randomnoise-like distortion, all samples are subject to constant phase shift (this effect was also observed in [12, 22]). The phaseshift depends on the frequency and the speciﬁc LPC coder, but not on symbol duration. The phase shift is probablyintroduced during speech synthesis by the LPC reconstruction ﬁlter with a non-uniform phase response.The sample density distributions of the variable part of distortion are approximately Gaussian, like those presented inFig. 5. As the frequency goes up, the width (i.e., variance) is getting larger. This observation supports the intuition thatthe harmonics at lower frequencies are generally less distorted by compression. (a) 400 Hz. (b) 800 Hz. (c) 1200 Hz. (d) 1600 Hz.

Figure 4: Scatter plots of a four-harmonic signal compressed by AMR. Each plot represents a distortion of onephase-modulated harmonic at 400 Hz, 800 Hz, 1200 Hz, and 1600 Hz, with a modulation rate of 200 baud. Blue pointscorrespond to compressed symbols, whereas red dots denote the initial phase constellation. The angle of the phase shift(restricted by black rays) varies in frequency. . . AMRSP EEXSILK (a) Skewness. − . . AMRSP EEXSILK (b) Kurtosis. − . . AMRSP EEXSILK (c) Inter-harmonic correlation. . . AMRSP EEXSILK (d) Time correlation.

Figure 5: Sample probability density function of a variable part of distortion of two harmonics at frequencies 800 Hzand 1600 Hz, compressed by a selection of LPC coders. The compressed signal consisted of four independentlyphase-modulated carriers at frequencies 400 Hz, 800 Hz, 1200 Hz, and 1600 Hz, with a modulation order 4 and amodulation rate of 200 baud. Distortion in the transversal axis is centered using mean phase shift compensation, andthe x-axes are normalized to the initial amplitude value of each harmonic.6ntroducing a Novel Data over Voice Technique for Secure Voice Communication

A P

REPRINT . . AMRSP EEXSILK (a) Skewness. . . . AMRSP EEXSILK (b) Kurtosis. . . AMRSP EEXSILK (c) Inter-harmonic correlation. . . AMRSP EEXSILK (d) Time correlation.

Figure 6: Statistical parameters of spectral distortion in multi-tone signals compressed by a selection of LPC coders.The initial multi-tone signal consisted of four independently phase-modulated harmonics at frequencies 400 Hz, 800 Hz,1200 Hz, and 1600 Hz, with a modulation order 4 and a modulation rate of 200 baud. Then, the set of carriers wasexpanded by adding harmonics at 2000 Hz, 2400 Hz, ..., 3200 Hz with a 400 Hz step. Colored bars denote the lowestand highest values among harmonics, and dots indicate the average.Figures 6a and 6b present Mardia’s bivariate skewness and kurtosis of a variable part of distortion. Mardia’s skewnessand kurtosis of a p -variate random sample x , ..., x n whose sample mean vector ¯ x and sample covariance S are deﬁnedas [37]: skewness = 1 n n (cid:88) k =1 n (cid:88) (cid:96) =1 (cid:2) ( x k − ¯ x ) S − ( x (cid:96) − ¯ x ) (cid:3) , (1) kurtosis = 1 n n (cid:88) k =1 (cid:2) ( x k − ¯ x ) S − ( x k − ¯ x ) (cid:3) . (2)For a sample taken from a p -variate normal distribution, the statistics simplify to: skewness = 0 and kurtosis = p ( p + 2) . (3)It can be noticed that in the case of AMR and Speex (and to some extent Silk), the computed Mardia’s skewnessand kurtosis are close respectively to 0 and 8, which are the values characterizing symmetric bivariate normaldistribution [38]. Crucially, distortion is not signiﬁcantly correlated both in time and between harmonics (Fig. 6cand Fig. 6d). As a result, there is some evidence to treat the variable part of signal distortion as independent andmemoryless. It can be seen as an advantage for demodulation but is also quite surprising because the analyzed codersare deterministic and non-linear. It suggests that distortion characteristics depend not only on LPC coders but also onstatistical properties of the modulated signal. 7ntroducing a Novel Data over Voice Technique for Secure Voice Communication A P

REPRINT

An open question remains, though, for other LPC coders at similar compression rates. Precisely, LPC coding’s basicprinciples do not imply the independence of distortion in the time and the frequency domain. On the other hand, itis arguable that such properties of the proposed modulation, like harmonicity and constant spectral amplitude, arecompatible with LPC coding’s fundamental properties. Therefore, it should be suitable for the vast majority of LPCcoders.

Figure 7 depicts the typical diagram of a data transmission system over voice channel, which uses a codebook of M pre-deﬁned discrete-time audio waveforms. Signal generation is a two-step procedure that ﬁrstly encodes the binaryinput into a sequence of indices ( m , m , ... ) and then maps these indices into a concatenation of codebook symbols s = ( s m , s m , ... ) . Finally, the resulting discrete-time audio signal s is played to the (digital) audio input of a voicechannel.Figure 7: Modulation and demodulation of a discrete DoV signal using a codebook of M pre-deﬁned discrete audiowaveforms.On the reception side, the demodulator splits the received sampled audio signal r = ( r m , r m , ... ) into short chunks ofﬁxed length corresponding to the symbol duration, and then performs symbol-by-symbol matched-ﬁltering with allcodebook entries. In the last steps, the demodulator extracts the indices of the codebook symbols giving the highestcorrelation value and decodes the binary information.In the proposed DoV technique, a codebook symbol is a vector of waveform samples s m = [ s m [0] , ..., s m [ N − sampled at 8 kHz and of duration between 2.5-10 ms. Each symbol consists of some small number K (between 7-10)of orthogonal harmonics modulated by quadrature phase-shift keying (4-PSK): s m [ n ] = (cid:60) (cid:32) K − (cid:88) k =0 C m,k exp (cid:16) j ( k + k ) ω nN (cid:17)(cid:33) , n = 0 , , ..., N − , (4)where ≤ m < M is the symbol index, ω denotes the fundamental angular frequency and k is the subband of thelowest harmonic. Finally, C m = { C m,k | ≤ k < K } denotes a sequence of K complex PSK symbols over thephase-amplitude plane: C m,k = A · exp( j πϕ m,k / , k = 0 , ..., K − , (5)where A is the amplitude and Φ m = { πϕ m,k / | ≤ k < K, ϕ m,k ∈ Z } denotes a sequence of PSK phases (theselection of phase sequences will be detailed in Section 4.1). Examples of such waveforms are presented in Fig. 8.The symbol structure is equivalent to the discrete-time base-band representation of 4PSK-OFDM modulation [35].Therefore, the received symbols can be processed in a similar manner using subband de-multiplexing. Let ˜C m = { ˜ C m,k | ≤ k < K } be the sequence of PSK symbols obtained from some received codebook symbol r m . Assuming8ntroducing a Novel Data over Voice Technique for Secure Voice Communication A P

REPRINT

Figure 8: Three discrete-time codebook waveforms (respectively blue, red and green dots) of duration 5 ms andconsisting of 10 harmonics at frequencies 600 Hz, 800 Hz, ..., 2400 Hz, with a 200 Hz step.a typical AWGN (Additive White Gaussian Noise) channel, the maximum likelihood OFDM symbol detection can beexpressed by the L2 norm minimization in the complex plane [39]: ˜ m = arg min m K − (cid:88) k =0 (cid:12)(cid:12)(cid:12) ˜ C m,k − A exp ( j πϕ m,k / (cid:12)(cid:12)(cid:12) . (6)However, the experiments in Section 3.1 indicated that compression by the selected LPC coders causes group delay inthe processed signal and alters each harmonic with a distortion of different variance. The estimated phase shift ˆ φ k andthe variance of distortion ˆ σ k respective to each harmonic can be computed using a training sequence and the followingestimators for sample mean and sample variance [40]: ˆ µ k = | ˆ µ k | exp (cid:16) j ˆ φ k (cid:17) = 1 L L − (cid:88) (cid:96) =0 ˜ C m (cid:96) ,k exp( − j πϕ m (cid:96) ,k / , (7) ˆ σ k = 1 L − L − (cid:88) (cid:96) =0 (cid:12)(cid:12)(cid:12) ˜ C m (cid:96) ,k exp( − j πϕ m (cid:96) ,k / − ˆ µ k (cid:12)(cid:12)(cid:12) , (8)where ˜C m (cid:96) = { ˜ C m (cid:96) ,k | ≤ k < K } denotes the (cid:96) − th sequence of PSK symbols measured at the reception side and Φ m (cid:96) = { πϕ m (cid:96) ,k / | ≤ k < K, ϕ m (cid:96) ,k ∈ Z } denotes the initial phases of the corresponding codebook symbols inthe training sequence.With the estimated ˆ φ k and ˆ σ k , one may apply the phase shift compensation and spectral weighting of distortion in thedemodulation rule from Eq. (6): ˜ m = arg min m K − (cid:88) k =0 (cid:12)(cid:12)(cid:12) ˜C k exp( − j ˆ φ k ) − A exp ( j πϕ m,k / (cid:12)(cid:12)(cid:12) / ˆ σ k . (9)Finally, rewriting Eq. (9) and removing the constant terms gives a more convenient demodulation rule, which ismaximizing the real part of a complex dot product [39]: ˜ m = arg max m (cid:60) (cid:32) K − (cid:88) k =0 ˜C k · A ˆ σ k exp (cid:16) − j πϕ m,k / − j ˆ φ k (cid:17)(cid:33) . (10)In contrast to time-domain matched-ﬁltering, the proposed demodulation rule enables phase and variance correctionin the channel distortion. Secondly, it becomes more efﬁcient when the codebook size grows. Instead of performing M matched-ﬁltering operations on a symbol of length N , this demodulator needs to compute the in-phase/quadrature(I/Q) representations of K < N

PSK symbols and to correlate them with M different phase sequences. As an example,given the triple ( K, M, N ) = (8 , , , matched ﬁltering in the time domain requires at least ·

40 = 10240 real-value multiplications. On the other hand, demodulation using Eq. (10) involves computing the complex PSKsymbols ( · ·

40 = 640 real-value multiplications) and comparing the obtained sequence with all phase combinationsin the codebook ( ·

256 = 2048 complex multiplications, or at least real-value multiplications).9ntroducing a Novel Data over Voice Technique for Secure Voice Communication

A P

REPRINT

Despite the computational improvement, the codebook’s preferable size ranges between 64 and 256 elements andshould not overreach elements. These values would make the real-time demodulation computationally practical onportable devices, especially if the codebook has a symmetric structure that enables further computational optimizations.Another factor in the process of selecting the codebook size is the transmission bitrate. Full 4PSK-OFDM modulationoffers transmission up to K = log (4 K ) information bits per symbol. However, the modulation is susceptible toexcessive distortion or attenuation of some harmonics in spectrally selective voice channels. Instead, it is advisable tochoose only a subset of all possible OFDM phase combinations to enlarge the minimum distance between symbols. Thisapproach makes a transmission over voice channels more robust to spectrally selective distortion, as a large distortion ofsome harmonics would be compensated by a moderate distortion of the others. On the other hand, smaller modulationorder M < K decreases the bitrate. Construction of a suitable DoV codebook relies on ﬁnding (or training) a subset of harmonic symbols with a largeminimum distance. However, this task becomes challenging as the number of symbol combinations increases. Thissubsection gives a proposition of a suboptimal codebook design method, which produces a set of harmonic waveformssufﬁciently different from each other.For x , x ∈ C K , let d E ( x , x ) be the Euclidean metric over the complex space and for y , y ∈ Z K , let d L ( y , y ) be the Lee metric over Z K : d L ( y , y ) = K − (cid:88) k =0 min( | y ,k − y ,k | , − | y ,k − y ,k | ) . In addition, let us deﬁne the bijective function f : C K → Z K which takes the phase indices ϕ m,k of every4-PSK sequence C m = { A · exp( j πϕ m,k / | ≤ k < K, ϕ m,k ∈ Z } , and maps to a quaternary codeword f ( C m ) = { ϕ m,k | ≤ k < K } over Z K . For any two 4-PSK sequences C m and C m , we get an isometric property: A d L ( f ( C m ) , f ( C m )) = d E ( C m , C m ) . It can be noticed that the same relation holds for the minimum distance between all PSK sequences in the OFDMcodebook and elements of the associated quaternary codewords. The selection of the most distinct OFDM symbolscould be thus replaced by the construction of a quaternary code

C ⊂ Z K (not necessarily a subgroup), that maximizesthe minimum Lee distance.In the perspective of non-binary codes with a deﬁned minimum distance, these OFDM symbols can be seen as errorcorrecting codes encoded in the spectral domain [41]. In consequence, quaternary codes provide a new degree of freedomin the DoV codebook design. By some sensible manipulation of the number of harmonics K , the symbol duration N ,and the minimum distance between codebook symbols d , it is possible to ﬁnd a codebook providing the required bitrateand maintaining sufﬁcient robustness to distortion. Moreover, the codebook generation is computationally constrainedmostly by ﬁnding quaternary codes, which is a much faster process compared to training a full codebook of waveforms.Finally, quaternary codes can be reused to produce waveforms of different duration and harmonic frequencies. It is alsoworth noticing that the above motivation for exploiting non-binary codes is slightly different from other works focusingmainly on reducing the peak-to-mean energy ratio of the OFDM signal [42, 43, 44, 45].Table 1: Minimum Lee distance of additive quaternary codes of length n = 7 , , and , found by Algorithm 1.Parameter k denotes the number of (quaternary) information bits of the code. From the perspective of OFDM symbols,value n is related to the cardinality of harmonics, while k describes the codebook size equal to k . n \ k A P

REPRINT

Due to some rotational symmetries of quaternary codes, there is no unique codebook with the largest minimumdistance. It gives more ﬂexibility in the ﬁne-tuning of the codes to make them more suitable in real operation. It isadvisable to select a codebook with a possibly uniform distribution of phase values and remove symbols with the highestmaximum amplitude. Table 1 presents the minimum distance of several quaternary codes found by a greedy algorithmcoined

CodebookSearch . The subroutine

ChooseInitial inserts a random or some pre-deﬁned initial codeword intothe codebook, while the subroutine

SelectCodeword iteratively selects a codeword to remain within the uniformdistribution of phase values in the expanded set.To improve the computational demodulation efﬁciency, one may exploit the reﬂection symmetry of the codebookproduced by the algorithm. Since for any ≤ m < M we have s m = − s m +1 , it is sufﬁcient to correlate thereceived PSK sequence only with codebook symbols having the even indices and then to check the sign of computation. Algorithm 1:

CodebookSearch(

C, M ) Data: the set of quaternary codewords C , an even size of codebook M ; Result: a set Cb of M quaternary codes; Cb ←− ∅ ; // select the first codeword (random or pre-defined) c ←− ChooseInitial(C) ; Cb ←− Cb ∪ { c , − c } ; for i ← to (cid:98) M/ (cid:99) − do // select codewords in C with a maximum Lee distance from CbS ←− MaxLeeDistance(C , Cb) ; // select a codeword from S respective to uniform distribution c i ←− ChooseCodeword(S , Cb) ; Cb ←− Cb ∪ { c i , − c i } ; end This section presents the performance results of the DoV scheme described in Section 4. Simulations are followed byexperimental tests over 3G and VoIP. Examples of some DoV signals recorded during tests are available online. Efﬁcient detection of received DoV symbols, described by Eq. (10) in Section 3, requires voice channel characterizationusing the training sequence. Intuitively, the larger number of symbols in the sequence, the more accurate is theestimation. We estimated the standard error SE of the phase shift ˆ φ k ( t ) and the variance of distortion ˆ σ k ( t ) as a functionof training duration t , using Monte Carlo simulations and the following formulas: (cid:99) SE φ k ( t ) = 1 L L (cid:88) (cid:96) =1 (cid:16) ˆ φ k,(cid:96) ( t ) − ¯ φ k (cid:17) , (11) (cid:99) SE σ k ( t ) / ¯ σ k = 1 L L (cid:88) (cid:96) =1 (cid:0) ˆ σ k,(cid:96) ( t ) − ¯ σ k (cid:1) / ¯ σ k , (12)where L is the number of Monte Carlo runs, ˆ φ k,(cid:96) ( t ) and ˆ σ k,(cid:96) ( t ) denote respectively the estimated phase shifts andthe variances of distortion in the (cid:96) − th Monte Carlo run, and the reference values ¯ φ k and ¯ σ k were obtained from asequence of 50000 DoV symbols (250 seconds of a signal). Figure 9 depicts the maximum standard error of ˆ φ k ( t ) and ˆ σ k ( t ) / ¯ σ k taken over all harmonics k and for every t between 0.5 and 2.5 seconds with a 0.05 second step. It can beobserved that 2 seconds of training period should give a sufﬁciently accurate channel characterization. https://github.com/PiotrKrasnowski/Data_over_Voice A P

REPRINT . . .

15 training period duration (s) e rr o r(r a d ) AM RSP EEXSILK (a) Standard error of phase-shift estimation. . . . . . . e rr o r( un i t l e ss ) AM RSP EEXSILK (b) Standard error of variance estimation.

Figure 9: Estimated standard error of the phase-shift ˆ φ k and the normalized variance ˆ σ k / ¯ σ k estimators of distortionintroduced by a selection of coders. The graphs present the maximum standard error over all harmonics k , and forevery t between 0.5 and 2.5 seconds with a 0.05 second step. Results obtained based on 1000 Monte Carlo runs. Thereference values ¯ φ k and ¯ σ k were computed from a sample of 50000 symbols. The DoV signal consisted of 8 harmonicsat frequencies 400 Hz, 800 Hz, ..., 3200 Hz with a modulation rate of 200 baud. The symbol error rate primarily depends on the distortion variance and the minimum distance between codebooksymbols. For example, it can be noticed in Fig. 10a that compressing by AMR leads to signiﬁcantly lower error rateswhen compared to compression using the Silk codec. This result agrees with the experimental outcomes shown in Fig. 3in Section 3. Nevertheless, when the voice channel’s capacity goes up, the amount of distortion, and thus the error rategradually decreases, as indicated by Fig. 10b.The characteristic staircase shape of the graphs in Fig. 10a and Fig. 10b corresponds to the codebook minimumdistance d in function of the codebook size (ref. Table 1). Thus, the symbol error rates obtained can be viewed as theapproximated probability of the signal distortion exceeding the distance d/ . Consequently, it is generally advantageousto design the codebook with a larger number of orthogonal harmonics, leading to increased minimum distance andimproved robustness.Despite its simplicity, the presented scheme suffers from the large size of the codebooks used, especially at higherbitrates. The exponentially growing number of correlations becomes a major practical limitation for real-time signaldemodulation. The problem can be tackled by scaling down the symbol duration at the expense of higher relativedistortion and a smaller number of orthogonal frequency slots. As shown by Fig. 10c, a modulation based on smallercodebooks of shorter symbols provides similar performance at a much lower computational cost. The DoV technique has been tested over a real voice channel between mobile phones, using pre-computed DoV signals.The selected phones for experiments were two iPhones 6 running iOS 12 and a Huawei P8 Lite running Android 8,each registered to a different major French mobile network operator. The DoV performance over 3G calls is displayedin Table 2, and the performance over VoIP calls using 4G wireless network is shown in Table 3. The duration of thetraining period was extended to 4 seconds to ensure the reliability of the experiments.In the case of the 3G connection, the overall symbol error rates given in Table 2 are higher compared to the simulationresults presented in Fig. 10. Additional signal distortion is possibly caused by several signal processing stages inthe phones and also by multiple voice compression in the network [4]. Nevertheless, the DoV signal based on fastermodulation and smaller codebook sizes again demonstrated lower error rates. Finally, the results emphasize theimportance of voice channel estimation, which signiﬁcantly improves the symbol error rate. Figure 11 displays thesmall fragment of the DoV signal sent over the 3G channel.12ntroducing a Novel Data over Voice Technique for Secure Voice Communication

A P

REPRINT − − − − − − −

256 512 1024 2048 4096 8192 16384 32768 65536bitrate (kbps) :codebook size : s y m b o l e rr o rr a t e ( l og . s c a l e ) AMR − harm. AMR − harm.SP EEX − harm. SP EEX − harm.SILK − harm. SILK − harm. − − − − −

256 512 1024 2048 4096 8192 16384 32768 65536bitrate (kbps) :codebook size : s y m b o l e rr o rr a t e ( l og . s c a l e ) SILK − kbps, harm.SILK − kbps, harm.SILK − kbps, harm. − − − −

32 64 128 256 512 1024 2048 4096bitrate (kbps) :codebook size : s y m b o l e rr o rr a t e ( l og . s c a l e ) AMR − harm., rate = 400 baudSP EEX − harm., rate = 400 baudSILK − harm., rate = 400 baud Figure 10: Decoding symbol error rate of a DoV signal compressed by AMR, Speex and Opus-Silk. To ensure reliabilityof the simulations, duration of the training period was extended to 4 seconds. If not indicated otherwise, symbol rateequals 200 baud. DoV signals consisted of symbols produced according to an output of a built-in pseudo-randomgenerator with a pre-deﬁned seed. 13ntroducing a Novel Data over Voice Technique for Secure Voice Communication A P

REPRINT

Table 2: Symbol error rate of DoV signal over 3G call with and without channel estimation.10 harmonics, symbol duration 5 msbitrate 4 s training period no training1.0 kbps < . · − < . · − < . · − < . · − . · − . · − . · − . · − . · − . · − . · − . · − . · − . · − . · − . · − < . · − < . · − < . · − . · − . · − . · − . · − . · − . · − . · − . · − . · − . · − . · − . · − . · − Table 3: Symbol error rate of DoV signal over VoIP.8 harmonics, symbol duration 2.5 ms, 4 s training periodbitrate Face Time Skype Signal Messenger WhatsApp4.0 kbps < . · − < . · − . · − . · − < . · − . · − . · − . · − < . · − . · − . · − . · − . · − . · − . · − . · − Contrary to 3G, VoIP enables very high DoV bitrates, up to full OFDM narrowband transmission at 6.4 kbps. Theimproved results provided in Table 3 are achieved due to mild signal distortion given by high throughput and networkstability. However, since VoIP is a packet-based system without any guarantee of Quality of Service (QoS), shortinterruptions in the network connection may cause many packet dropouts. The negative impact of dropouts is typicallymitigated by the re-synthesis of lost frames by VoIP application, leading to non-recoverable damages to the DoV signaland hindering the system’s re-synchronization.

Figure 11: DoV signal at the bitrate 2.8 kbps, before (black line) and after (red line) transmission over the 3G network.The fragment displays eight consecutive DoV symbols of duration 2.5 ms consisting of 8 harmonics at frequencies400 Hz, 800 Hz, ..., 3200 Hz, with a 400 Hz step.

This section provides a detailed proposition of a scheme for secure voice communication over 3G and VoIP, using smallportable devices with limited battery capacity. The system has been successfully tested in a controlled, real-worldenvironment and with pre-computed DoV signals. The performance results are followed by a short discussion onsecurity and computational complexity. 14ntroducing a Novel Data over Voice Technique for Secure Voice Communication

A P

REPRINT

Figure 12 presents a simpliﬁed diagram of a system for secure voice communication over a voice channel, whichtransforms consecutive portions of speech into DoV frames of the same duration. The scheme substantially resemblesa classical digital communication system: it consists of speech encoding, followed by encryption, error correction,and data modulation blocks. Although the input and output signals of the processing chain are analog, all internalprocessing is performed digitally.Figure 12: Encrypted speech over voice channel scheme.The system settings should be a trade-off between operational constraints (restricted bandwidth, real-time processing,synchronization) and the desired security level against eavesdroppers and active attackers from within the network.Depending on the voice channel type, two modes of operation may be considered: a low mode designed for 3G cellularcalls and a high mode for VoIP. The system parameters selected in the following experiments are presented in Table 4and are used only for illustration.Table 4: Selected parameters of the secure voice communication system.version: low mode ( 3G ) high mode ( VoIP )

DoV frame codebook size: 64 4096DoV symbol order: 6 bits 12 bitsmodulation rate: 400 baud 400 baudbitrate: 2400 bps 4800 bpsframe duration: 80 ms 60 msframe length: 32 symbols / 192 bits 24 symbols / 288 bits

Reed-Solomon coding

RS symbol order: 6 bits 6 bitsmessage length: 20 symbols / 120 bits 28 symbols / 168 bits · encrypted speech: 96 bits 144 bits · frame counter: 16 bits 16 bits · control checksum: 8 bits 8 bitscode length: 28 symbols / 168 bits 40 symbols / 240 bitsredundancy: 8 symbols / 48 bits 12 symbols / 72 bits Voice enciphering

AES 256 (CTR mode) AES 256 (CTR mode)

Voice compression

Codec2 1200 bps Codec2 2400 bpsThe processing chain starts with low-bitrate speech compression. In this work, voice is encoded by

Codec2 , anopen-source algorithm developed by Rowe and J.-M. Valin, which offers speech compression down to 450 bps [46]. Inthe next step, the encoded voice frames are enciphered by AES in the counter mode of operation and with a secret keyof 256 bits with a random initial value (IV).The encrypted binary stream is protected against channel errors by shortened Reed-Solomon (RS) codes with erasures[47, 48] and 6-bit symbols. The error correction capabilities of RS codes depend only on the redundancy length,which is not the case for Turbo and LDPC codes [49]. Moreover, non-binary symbol processing of RS codewordsseems suitable for symbol-to-symbol demodulation of the DoV signal. In particular, one or more RS symbols can berepresented by a single DoV symbol. https://rowetel.com A P

REPRINT

Erasure decoding improves correction capabilities of RS codes, provided that the localization of errors are known. Thedemodulator may try to guess the erroneous symbols, using a straightforward metric that considers symbol energy andits distance to the closest codebook symbol. Thus, when the ﬁrst decoding attempt fails, the decoder may reiteratedecoding with new estimated erasure positions until the 8-bit control checksum (8-CRC) matches.Table 5: Performance of encrypted voice transmission over cellular voice channels and VoIP.3G Face Time Skype Signal Messenger WhatsAppeffective BER: . · − < . · − < . · − < . · − . · − effective FER: . · − < . · − < . · − < . · − . · − In the proposed scheme, each RS codeword is directly encoded into one DoV frame, as described in Fig. 13. A constantheader and a counter (CTR) enable decoding and decryption of DoV frames independently from each other, simplifyingthe re-synchronization in the presence of signal dropouts. Extensive experiments have shown that a 10-ms header isusually sufﬁciently long to keep signal synchronization or detect a DoV frame after signal restoration. In addition, the16-bit counter permits re-synchronization after more than one hour of lost connection.The duration of a DoV frame is equal to the portion of speech encoded by this frame, which is a valid requirement forreal-time communication. Selected voice compression rates, 1.2 kbps, and 2.4 kbps depending on the mode, are lowenough to append error correction redundancy at the end of each DoV frame.The system was tested over cellular and VoIP calls. Table 5 presents the decoding results of several minutes of speechrecording sent through using the 4G mobile data connectivity between two iPhones 6 registered to different networkoperators. The effective bit error rates (BER) and frame error rates (FER) take into account errors due to systemde-synchronizations and short signal dropouts.Figure 14 shows the consecutive waveforms of a signal processed by a 3G network. The initial speech waveformpresented in Fig. 14a is compressed, encrypted, and encoded into the DoV signal of equal duration in Fig. 14b. Thereceived signal displayed in Fig. 14c is strongly attenuated after less than two seconds of transmission, classiﬁed by theVoice Activity Detector (VAD) as non-speech-like. However, correct decoding is still possible as long as the harmonicstructure of the signal is preserved, as shown in Fig. 14d.The distortion of the received signal can vary, depending on the network type and the phones used for communication.To counteract the blockage of stationary signals by VAD and Noise Suppression, several authors suggest to alternate twoDoV codebooks deﬁned over two non-overlapping bandwidths [50, 17]. This work proposes another complementarytechnique: periodic silence insertion in place of some DoV frames, as depicted in Fig. 15. It was observed thatdepending on the chosen rate of silence insertion and the type of connection, these silences signiﬁcantly postpone oreven prevent signal suppression. On the reception side, these inserted silences can be classiﬁed as lost frames andre-synthesized.Figure 13: DoV frame structure in a low (up) and a high (bottom) mode of operation. The numbers indicate the lengthsof frame sections, given as a cardinality of DoV symbols. In the high mode, one DoV symbol represents two RSsymbols. 16ntroducing a Novel Data over Voice Technique for Secure Voice Communication

A P

REPRINT

Figure 14: Consecutive stages of the signal in secure voice communication over a 3G call. From top to bottom: theinitial speech, the sent DoV signal, the received DoV signal and the re-synthesized speech. The received signal wasfully decodable despite strong signal attenuation.

Introducing a dedicated system for voice communication is a response to an increased risk of being intercepted.Thus, a cryptographic scheme should reﬂect higher requirements for secrecy and authentication. A major risk is therecording and off-line cryptanalysis of the network trafﬁc by passive eavesdroppers. Securing the communicationagainst eavesdroppers is especially important because the encrypted and non-speech signal can be easily detectedby some advanced Data Leakage Prevention (DLP) and Content Monitoring and Filtering (CMF) systems protectingagainst unauthorized data extrusion [51, 52, 12]. Active attackers controlling the network are more likely to block ordistort the fragile DoV signal, which is technically very simple. However, a powerful and knowledgeable attacker whocan synthesize a compatible DoV signal in real-time may modify the signal or insert its own.The chosen AES cipher in the counter mode of operation, if implemented correctly, is believed to provide securityagainst passive eavesdroppers [53, 54]. On the other hand, enciphering in counter mode does not guarantee dataintegrity [55], giving some space for adversarial manipulations. Therefore, the common practice is to combine the AESin counter mode with a cryptographic message authentication function [56]. Unfortunately, due to severe bandwidthlimitations appending the authentication check is not viable. Instead, it would be possible to randomly shufﬂe thepositions of encrypted bits within one DoV frame [57, 58]. The motivation for this is to prevent malicious attackersfrom intentional modiﬁcations of the transmitted content. While still capable of replacing several DoV symbols, theattacker should not beneﬁt from distorting the transmitted signals.17ntroducing a Novel Data over Voice Technique for Secure Voice Communication

A P

REPRINT

Figure 15: Comparison of the received DoV signal in a 3G call (top) without and (bottom) with silence insertion every16th frame. Depending on the connection type and the silence insertion rate, this technique may postpone or preventsignal suppression.Finally, it is assumed that both users share a common secret cryptographic key used for encryption. Secure key exchangecan become challenging when the voice channel is the only available communication channel. With decentralizedimplementations of the proposed system, there would be no practical possibility to add the Trusted Third Party foruser’s authentication. A few protocols overcome this limitation by using vocal veriﬁcation [59, 60, 11]. In such ascenario, users compare freshly generated random strings vocally while challenging another speaker’s voice proﬁle.

The goal of real-time operation on small portable devices puts a big emphasis on computational optimization of theproposed system. It can be noticed that PSK-OFDM modulation [61], AES-CTR encryption [56, 62], Reed-Solomonerror correction [63, 64] and the speech encoding [65] algorithms mentioned in this work have been already widelyadopted in wireless communication with mobile phones or in the computationally constrained environment, includingreal-time applications. However, the presented system was implemented in GNU Octave environment , serving as aproof-of-concept only. There is still considerable work to be done to efﬁciently integrate all these elements into a singlesystem operating on a device with limited resources, like mid-range smartphones. In this article, we detailed a new and versatile Data over Voice technique for secure voice communications overLPC-based voice channels, like cellular networks and VoIP. Based on codebooks with harmonic symbols, the proposedsolution is well-grounded on the fundamental principles of LPC coding.A thorough analysis of OFDM signals compressed by some prominent voice coders revealed that the distortion isstatistically close to a symmetric bivariate Gaussian distribution over the complex phase-amplitude plane. However,this distortion is not uniformly distributed in the spectral domain. Thus, we proposed an optimized demodulation metricbased on spectrally weighted Euclidean distance with phase shift correction.The tedious design process of DoV codebooks has been considerably simpliﬁed by using quaternary error correctioncodes. With OFDM symbols being treated as codes over a quaternary ring, codebook construction reduces to ﬁnding aset of quaternary codes that maximizes the minimum Lee distance.The performance of our DoV technique has been evaluated through simulations and real-world tests over real voiceconnections between two mobile phones. A bitrate of 2.4 kbps over 3G call and 6.4 kbps over VoIP have been achievedwith acceptably low symbol error rates. These tests highlight the need to properly characterize the channel distortionbefore transmission properly. A P

REPRINT

Finally, the work described a scheme for secure voice communications over voice channels in high and low bitratemodes of operation. The system has been practically validated for real-time voice transmission over cellular networksand VoIP with small effective bit error rates. To mitigate the negative impact of VAD, we also proposed a new methodbased on the insertion of repetitive silences.The promising results presented in this work suggest some further investigation of the proposed DoV technique. A bigemphasis has to be put on signal synchronization on the reception side and reducing the computational cost of signaldemodulation. Additionally, sensible codebook structuring, combined with the exploitation of phase symmetries, maysigniﬁcantly lower the number of correlations in a demodulator.

This work is supported by grant DGA Cifre-Defense program No 01D17022178 DGA/DS/MRIS and AID programNo SED0456JE75.

References [1] William Bennett. Secret Telephony as a Historical Example of Spread-Spectrum Communication.

IEEETransactions on Communications , 31(1):98–104, 1983. doi:https://doi.org/10.1109/TCOM.1983.1095724.[2] Hans Dieter Hellige. From SAGE via Arpanet to Ethernet: Stages in computer communications concepts between1950 and 1980.

History and Technology , 11(1):49–75, 1994. doi:https://doi.org/10.1080/07341519408581854.[3] Philip Golden, Herve Dedieu, and Krista Jacobsen.

Fundamentals of DSL technology . Auerbach Publications,Boca Raton, FL, 2006.[4] Nilantha Katugampala, Stephane Villette, and Ahmet M. Kondoz. Secure voice over GSM and other low bit ratesystems. In

IEE Seminar on Secure GSM and Beyond (Digest No. 2003/10059) . IET, 2003.[5] Marc Werner, Christian Pietsch, Christoph Joetten, Christian Sgraja, Georg Frank, Wolfgang Granzow, and JeffHuang. Cellular In-Band Modem Solution for eCall Emergency Data Transmission. In

VTC Spring 2009 - IEEE69th Vehicular Technology Conference . IEEE, 2009. doi:https://doi.org/10.1109/VETECS.2009.5073434.[6] Aditya Dhananjay, Ashlesh Sharma, Michael Paik, Jay Chen, Trishank Karthik Kuppusamy, Jinyang Li, andLakshminarayanan Subramanian. Hermes: data transmission over unknown voice channels. In

Proceedings ofthe sixteenth annual international conference on Mobile computing and networking . Association for ComputingMachinery, 2010. doi:https://doi.org/10.1145/1859995.1860010.[7] Zdenko Mezgec, Amor Chowdhury, Bojan Kotnik, and Rajko Sveˇcko. Implementation ofPCCD-OFDM-ASK robust data transmission over GSM speech channel.

Informatica , 20, 2009.doi:https://doi.org/10.15388/Informatica.2009.237.[8] Ashish Patro, Yadi Ma, Fatemah Panahi, Jordan Walker, and Suman Banerjee. A system for audio signalling basedNAT Traversal. In , pages 1–10. IEEE, 2011. doi:https://doi.org/10.1109/COMSNETS.2011.5716432.[9] Nilantha N. Katugampala, Kholdoon T. Al-Naimi, Stephane Villette, and Ahmet M. Kondoz. Real-time end-to-endsecure voice communications over GSM voice channel. In ,pages 1–4. IEEE, 2005. ISBN 978-160-4238-21-1.[10] Liquan Chen and Qi Guo. An OFDM-based secure data communicating scheme in GSM voice chan-nel. In . IEEE, 2011.doi:https://doi.org/10.1109/ICECC.2011.6066715.[11] Piotr Krasnowski, Jerome Lebrun, and Bruno Martin. Introducing a Veriﬁed Authenticated Key Ex-change Protocol over Voice Channels for Secure Voice Communication. In , pages 683–690. Scitepress Digital Library, 2020.doi:https://doi.org/10.5220/0009156506830690.[12] Seungjoon Lee, Youngmok Ha, Sungjun Yoon, Hyunju Jo, Soohee Jang, Jiyeon Lee, Youngan Kim, and Ji Yoon.The Vulnerability Exploitation Conveying Digital Data Over Mobile Voice Call Channels.

Wireless PersonalCommunications , 96:1–28, 2017. doi:https://doi.org/10.1007/s11277-017-4229-9.[13] Frank Soong and Biing-H. Juang. Line spectrum pair (LSP) and speech data compression. In

ICASSP ’84. IEEEInternational Conference on Acoustics, Speech, and Signal Processing , volume 9, pages 37–40. IEEE, 1984.doi:https://doi.org/10.1109/ICASSP.1984.1172448. 19ntroducing a Novel Data over Voice Technique for Secure Voice Communication

A P

REPRINT [14] Mehmet Akif Özkan and S Berna Örs. Data transmission via GSM voice channel for end to end secu-rity. In . IEEE, 2015.doi:https://doi.org/10.1109/ICCE-Berlin.2015.7391285.[15] Mahsa Rashidi, Abolghasem Sayadiyan, and Pejman Mowlaee. A Harmonic Approach to Data Transmission overGSM Voice Channel. In , pages 1–4. IEEE, 2008. doi:https://doi.org/10.1109/ICTTA.2008.4530052.[16] Christoph K. LaDue, Vitaily V. Sapozhnykov, and Kurt S. Fienberg. A Data Modem for GSM Voice Channel.

IEEETransactions on Vehicular Technology , 57(4):2205–2218, 2008. doi:https://doi.org/10.1109/TVT.2007.912322.[17] Vitaliy V. Sapozhnykov and Kurt S. Fienberg. A low-rate data transfer technique for compressed voice channels.

Journal of Signal Processing Systems , 2012. doi:https://doi.org/10.1007/s11265-011-0594-x.[18] Ali Shahbazi, Amir Hossein Rezaie, Abolghasem Sayadiyan, and Saeed Mosayyebpour. A novel speech-likesymbol design for data transmission through GSM voice channel. In . IEEE, 2009. doi:https://doi.org/10.1109/ISSPIT.2009.5407541.[19] Mahdu Boloursaz, Amirhossein Hadavi, Rohullah Kazemi, and Fereidoon Behnia. A data modem for GSMAdaptive Multi Rate voice channel. In

East-West Design Test Symposium (EWDTS 2013) , pages 1–4. IEEE, 2013.doi:https://doi.org/10.1109/EWDTS.2013.6673152.[20] Victor Zue, Stephanie Seneff, and James Glass. Speech database development at MIT: Timit and beyond.

SpeechCommunication , 9(4):351 – 356, 1990. doi:https://doi.org/10.1016/0167-6393(90)90010-7.[21] Reza Kazemi, Mahdi Boloursaz Mashhadi, Mohsen Heidari Khoozani, and Fereydoon Behnia. Modem based onsphere packing techniques in high-dimensional Euclidian sub-space for efﬁcient data over voice communicationthrough mobile voice channels.

IET Communications , 9, 2015. doi:https://doi.org/10.1049/iet-com.2014.0610.[22] Zhan Xu. Data transmission method based on single carrier over GSM voice channel.

Revista de la Facultad deIngeniera , 32(9):23–29, 2017.[23] Touﬁc Chmayssani and Genevieve Baudoin. Data transmission over voice dedicated channels us-ing digital modulations. In . IEEE, 2008.doi:https://doi.org/10.1109/RADIOELEK.2008.4542682.[24] Bechir Taleb Ali, Genevieve Baudoin, and Olivier Venard. Data transmission over mobile voice channel basedon M-FSK modulation. In , pages4416–4421. IEEE, 2013. doi:https://doi.org/10.1109/WCNC.2013.6555289.[25] Lawrence R. Rabiner and Ronald W. Schafer.

Theory and applications of digital speech processing . Pearson,Upper Saddle River, NJ, 2011.[26] Tom Bäckström.

Speech Coding with Code-Excited Linear Prediction . Springer, Cham, Switzerland, 2017.[27] P. Heitkamper. Optimization of an Acoustic Echo Canceller Combined with Adaptive Gain Control. In , 1995. doi:10.1109/ICASSP.1995.479488.[28] Dionysis E. Tsoukalas, John N. Mourjopoulos, and George Kokkinakis. Speech Enhancement Basedon Audible Noise Suppression.

IEEE Transactions on Speech and Audio Processing , 5, 1997.doi:https://doi.org/10.1109/89.641296.[29] Gunnar Fant.

Acoustic Theory of Speech Production: With Calculations based on X-Ray Studies of RussianArticulations . De Gruyter Mouton, Berlin, Boston, 1960. doi:https://doi.org/10.1515/9783110873429.[30] Carol Lochbaum and John Kelly. Speech synthesis. In

Proceedings of the Speech Communication Seminar , pages583–596. Speech Transmission Laboratory, 1962.[31] 3GPP. Adaptive Multi-Rate (AMR) speech codec; Transcoding functions. Technical Report TS 26.090, Release 15,3GPP, 2018. URL .[32] Greg Herlein, Jean-Marc Valin, Alfred Heggestad, and Aymeric Moizard. RTP Payload Format for the SpeexCodec. Technical Speciﬁcation RFC 5574, IETF, 2009. URL https://tools.ietf.org/html/rfc5574 .[33] Jean-Marc Valin, Koen Vos, and Timothy Terriberry. Deﬁnition of the Opus Audio Codec. Technical SpeciﬁcationRFC 6176, IETF, 2012. URL https://tools.ietf.org/html/rfc6716 .[34] Ana Alves-Pinto, Alan R. Palmer, and Enrique A. Lopez-Poveda. Perception and coding of high-frequencyspectral notches: potential implications for sound localization.

Frontiers in neuroscience , 8:112, 2014.doi:https://doi.org/10.3389/fnins.2014.00112.[35] Richard van Nee and Ramjee Prasad.

OFDM for Wireless Multimedia Communications . Artech House, Boston,MT, 2000. 20ntroducing a Novel Data over Voice Technique for Secure Voice Communication

A P

REPRINT [36] Ben Gold, Nelson Morgan, and Dan Ellis.

Speech and Audio Signal Processing: Processing and Perception ofSpeech and Music . John Wiley & Sons, Hoboken, NJ, 2011.[37] Kanti V. Mardia. Measures of multivariate skewness and kurtosis with applications.

Biometrika , 57, 1970.doi:https://doi.org/10.2307/2334770.[38] Kanti V. Mardia. Applications of Some Measures of Multivariate Skewness and Kurtosis in Testing Normalityand Robustness Studies.

Sankhy¯a: The Indian Journal of Statistics, Series B , 32(2):115–128, 1974. URL .[39] Henrik Schulze and Christian Lüders.

Theory and Applications of OFDM and CDMA: Wideband WirelessCommunications . John Wiley & Sons, Chichester, GB, 2005.[40] Robert Witte and John Witte.

Statisticss . Wiley, Hoboken, NJ, 2017. ISBN 978-1-119-25451-5.[41] Tim A. Wilkinson and Alan E. Jones. Minimisation of the Peak to Mean Envelope Power Ratio of MulticarrierTransmission Schemes by Block Coding. In , volume 2. IEEE, 1995. doi:https://doi.org/10.1109/VETEC.1995.504983.[42] James A Davis and Jonathan Jedwab. Peak-to-mean power control in OFDM, Golay complementary sequences, andReed-Muller codes.

IEEE Transactions on Information Theory , 45, 1999. doi:https://doi.org/10.1109/18.796380.[43] Houshou Chen and Hsinying Liang. Combined selective mapping and binary cyclic codes forPAPR reduction in OFDM systems.

IEEE Transactions on Wireless Communications , 6, 2007.doi:https://doi.org/10.1109/TWC.2007.060145.[44] Thilak Ginige, Nandana Rajatheva, and Kazi M. Ahmed. Dynamic spreading code selection method forPAPR reduction in OFDM-CDMA systems with 4-QAM modulation.

IEEE Communications Letters , 5, 2001.doi:https://doi.org/10.1109/4234.957377.[45] Miguel Angel Hisojo, Jerome Lebrun, and Luc Deneire. Low PAPR and spatial diversity for OFDM schemes byusing L2-orthogonal CPM ST-codes with fast decoding. In . Springer, 2014. doi:https://doi.org/10.1007/s11277-013-1108-x.[46] Stefan Erhardt, Thomas Kurin, Fabian Lurz, Robert Weigel, and Alexander Koelpin. An Open-Source SpeechCodec at 450 bit/s with Pseudo-Wideband Mode. In , pages1048–1051. IEEE, 2019.[47] Shu Lin and Daniel J. Costello.

Error control coding, Second Edition . Prentice Hall, Lebanon, IN, 2001.[48] Andre Neubauer, Jurgen Freudenberger, and Volker Kuhn.

Coding Theory: Algorithms, Architectures andApplications . John Wiley & Sons, Chichester, UK, 2007.[49] Bashar Tahir, Stefan Schwarz, and Markus Rupp. BER comparison between convolutional, Turbo, LDPC,and Polar codes. In . IEEE, 2017.doi:https://doi.org/10.1109/ICT.2017.7998249.[50] Ali Shahbazi, Amir Hossein Rezaei, Abolghasem Sayadiyan, and Saeed Mosayyebpour. Data transmission overGSM adaptive multi rate voice channel using speech-like symbols. In . IEEE, 2010. doi:https://doi.org/10.1109/ICSAP.2010.72.[51] Cheol-Joo Chae, Yongju Shin, Kiseok Choi, Ki-Bong Kim, and Kwang-Nam Choi. A privacy dataleakage prevention method in P2P networks.

Peer-to-Peer Networking and Applications , 9, 05 2015.doi:https://doi.org/10.1007/s12083-015-0371-x.[52] Barbara Hauer. Data and Information Leakage Prevention Within the Scope of Information Security.

IEEE Access ,3:2554–2565, 2015. doi:https://doi.org/10.1109/ACCESS.2015.2506185.[53] Helger Lipmaa, Phillip Rogaway, and David Wagner. CTR-mode encryption. In , volume 39, 2000.[54] Jakob Jonsson. On the security of CTR+ CBC-MAC. In

Selected Areas in Cryptography . Springer, 2003.doi:https://doi.org/10.1007/3-540-36492-7_7.[55] Jonathan Katz and Yehuda Lindell.

Introduction to modern cryptography . CRC press, Boca raton, FL, 2015.[56] Russell Housley. Using Advanced Encryption Standard (AES) counter mode with IPsec encapsulating securitypayload (ESP). Technical Speciﬁcation RFC 3686, IETF, 2004. URL https://tools.ietf.org/html/rfc3686 .[57] Ben Morris, Phillip Rogaway, and Till Stegers. How to Encipher Messages on a Small Domain. In

Advances inCryptology - CRYPTO 2009 . Springer, 2009. doi:https://doi.org/10.1007/978-3-642-03356-8_17.21ntroducing a Novel Data over Voice Technique for Secure Voice Communication

A P

REPRINT [58] Emil Stefanov and Elaine Shi. FastPRP: Fast Pseudo-Random Permutations for Small Domains.

IACR CryptologyePrint Report 2012/254 , 2012. URL https://eprint.iacr.org/2012/254.pdf .[59] Sylvain Pasini and Serge Vaudenay. SAS-Based Authenticated Key Agreement. In

Public Key Cryptography -PKC 2006 , pages 395–409. Springer, 2006. doi:https://doi.org/10.1007/11745853_26.[60] Jon Callas, Alan Johnston, and Philip Zimmermann. ZRTP: Media path key agreement for unicast secure RTP.Technical Speciﬁcation RFC 6189, IETF, 2011. URL https://tools.ietf.org/html/rfc6189 .[61] 3GPP. LTE; Evolved Universal Terrestrial Radio Access (E-UTRA); Physical channels and modulation. Tech-nical Report TS 36.211, Release 16, 3GPP, 2020. URL https://portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails.aspx?specificationId=2425 .[62] Jin-Hyung Park, Jung-Ha Paik, and Dong-Hoon Lee. Efﬁcient implementation of AES CTR Mode for a MobileEnvironment.

Journal of the KIISC , 21(5):47–58, 2011. URL .[63] Lionel Biard and Dominique Noguet. Reed-Solomon Codes for Low Power Communications.

Journal ofCommunications , 3, 2008. doi:https://doi.org/10.4304/jcm.3.2.13-21.[64] 3GPP. Universal Mobile Telecommunications System (UMTS); Multiplexing and Channel Coding (FDD).Technical Report TS 25.212, Release 15, 3GPP, 2017. URL https://portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails.aspx?specificationId=468 .[65] Sumek Wisayataksin. An Efﬁcient Hardware Architecture of Codec2 Low Bit-rate Speech Decoder. In2019 5th International Conference on Engineering, Applied Sciences and Technology (ICEAST)