Speech-to-speech Translation between Untranscribed Unknown Languages
SSPEECH-TO-SPEECH TRANSLATION BETWEEN UNTRANSCRIBED UNKNOWNLANGUAGES
Andros Tjandra , Sakriani Sakti , , Satoshi Nakamura , Nara Institute of Science and Technology, Japan RIKEN, Center for Advanced Intelligence Project AIP, Japan { andros.tjandra.ai6,ssakti,s-nakamura } @is.naist.jp ABSTRACT
In this paper, we explore a method for training speech-to-speechtranslation tasks without any transcription or linguistic supervision.Our proposed method consists of two steps: First, we train and gen-erate discrete representation with unsupervised term discovery witha discrete quantized autoencoder. Second, we train a sequence-to-sequence model that directly maps the source language speech to thetarget languages discrete representation. Our proposed method candirectly generate target speech without any auxiliary or pre-trainingsteps with a source or target transcription. To the best of our knowl-edge, this is the first work that performed pure speech-to-speechtranslation between untranscribed unknown languages.
Index Terms — speech translation, sequence-to-sequence, zero-resource modeling, unit discovery, autoencoder
1. INTRODUCTION
Information exchanges among different countries continue to in-crease. International travelers for tourism, emigration, or foreignstudy are becoming increasingly diverse, heightening the need fordevising a means to offer effective interaction among people whospeak different languages. Since automatic spoken-to-speech trans-lation (S2ST) provides an opportunity for people to communicatein their own languages, it significantly overcomes language barriersand closes cross-cultural gaps.Many researchers have been developing a S2ST system over thepast several decades. A traditional approach in S2ST systems re-quires effort to construct several components, including automaticspeech recognition (ASR), machine translation (MT), and text-to-speech (TTS) synthesis, all of which are trained and tuned indepen-dently. Given speech input, ASR processes and transforms speechinto text in the source language, MT transforms the source languagetext to corresponding text in the target language, and finally TTSgenerates speech from the text in the target language. Significantprogress has been made and various commercial speech translationsystems are already available for several language pairs. However,more than 6000 languages, spoken by 350 million people, have notbeen covered yet. Critically, over half of the world’s languages actu-ally have no written form; they are only spoken.Recently, end-to-end deep learning frameworks have shownimpressive performances on many sequence-related tasks, such asASR, MT, and TTS [1, 2, 3]. Their architecture commonly usesan attentional-based encoder-decoder mechanism, which allows themodel to learn the alignments between the source and the targetsequence, that can perform end-to-end mapping tasks of different modalities. Many complicated hand-engineered models can also besimplified by letting neural networks find their way to map frominput to output spaces. Thus, the approach provides the possibil-ity of learning a direct mapping between the variable-length of thesource and the target sequences that are often not known a priori.Several works extended the sequence-to-sequence models coverageby directly performing end-to-end speech translation using only asingle neural network architecture instead of separately focusing onits components (ASR, MT, and TTS).Although the first feasibility was shown by Duong et al. [4],they focused on the alignment between the speech in the sourcelanguage and the text in the target language because their speech-to-word model did not yield any useful output. The first full-fledgedend-to-end attentional-based speech-to-text translation system wassuccessfully performed by B ´ e rard et al. on a small French-Englishsynthetic corpus [5]. But their performance was only comparedwith statistical MT systems. Weiss et al. [6] demonstrated that end-to-end speech-to-text models on Spanish-English language pairsoutperformed neural cascade models. Kano et al. then provedthat this approach is possible for distant language pairs such asJapanese-to-English translation [7]. Similar to the model by Weisset al. [6], although it does not explicitly transcribe the speech intotext in the source language, it also doesnt require supervision fromthe groundtruth of the source language transcription during train-ing. However, most of these works remain limited to speech-to-texttranslation and require text transcription in the target language.Recently, Jia et al. [8] proposed the deep learning model thatis trained end-to-end, which learns to map speech spectrogramsinto target spectrograms in another language that corresponds to thetranslated content (in a same or different canonical voice). Unfor-tunately, since training without auxiliary losses leads to extremelypoor performance, they provided a solution by integrating auxiliarydecoder networks to predict phoneme sequences that correspond tothe source and/or target speech. Despite much progress in directspeech translation research, no completely direct speech-to-speechtranslation has been achieved without any text transcription in sourceand target languages, during training, has not been achieved yet.Therefore, it remains difficult to scale-up the existing approachto unknown languages without written forms or transcription dataavailable.On the other hand, there has been a project that held by speechcommunity to push toward developing unsupervised, data-drivensystems that are less reliant on linguistic expertise. Zero resourcemodeling is an approach where completely unsupervised techniquescan learn the elements of a languages speech hierarchy solely fromuntranscribed audio data. This means that only spoken audio dataare available in a specific language, but transcriptions, annota- a r X i v : . [ c s . C L ] O c t ions, and prior knowledge for it are all unavailable. The ZeroResource Speech Challenge s eries [9, 10, 11] was constructed toprogress incrementally toward a system that learns an end-to-endspoken dialog (SD) system in an unknown language from scratchjust using information available to language learning infants. TheZeroSpeech 2019 [11] challenge confronts the problem of construct-ing a speech synthesizer without any text or phonetic labels: TTSwithout T . It is a continuation of the subword unit discovery trackof ZeroSpeech 2015 and 2017 [9, 10]. 19 systems were submit-ted, but few studies proposed end-to-end frameworks [12, 13, 14].Among these proposed systems, the vector quantized variationalautoencoder (VQ-VAE) approach provides a better performanceof naturalness based on mean opinion score (MOS) on the gener-ated speech and character error rate after human transcription ofthe speech synthesis. Further details of the results are available: .In this paper, we take a step beyond the task of the current Ze-roSpeech 2019 and propose a method for training speech to speechtranslation tasks without any transcription or linguistic supervision.Instead of only discovering subword units and synthesizing themwithin a certain language, our approach discovers subword units thatare directly translated to another language. Our proposed methodconsists of two steps: (1) we train and generate discrete represen-tation with unsupervised term discovery, which is also based on adiscrete quantized autoencoder; (2) we train a sequence-to-sequencemodel to directly map the source language speech to the target lan-guage discrete representation. Our proposed method can directlygenerate target speech without any auxiliary or pre-training stepswith source or target transcription. To the best of our knowledge,this is the first work that performed pure speech-to-speech transla-tion between untranscribed unknown languages.
2. UNSUPERVISED UNIT DISCOVERY WITH VQ-VAE
A speech signal can be disentangled into independent factors of vari-ation such as contexts and speaking styles. In a speech domain, weassume the context has a similar property with phonemes or sub-words, which are represented with a limited set of discrete symbols.Therefore, to capture the context without any supervision, we use agenerative model named a vector quantized variational autoencoder(VQ-VAE) [15] to extract the discrete symbols. There are severaldistinctions between a VQ-VAE with a normal autoencoder [16] anda normal variational autoencoder (VAE) [17]. The VQ-VAE encodermaps the input features to a limited number of discrete latent vari-ables, and a standard VAE encoder maps the input features into con-tinuous latent variables. Therefore, a VQ-VAE encoder has many-to-one mappings due to restricting the representation to the nearestcodebook vector, and the standard VAE encoder has one-to-one map-ping between the input and latent variables.We illustrate the VQ-VAE model in Fig. 1 and define E =[ e , .., e K ] ∈ R K × D e as a collection of codebook vectors and V =[ v , .., v L ] ∈ R L × D v . During the encoding step, input x is suchspeech features as MFCC or mel-spectrogram and input x ’s speakeridentity is denoted by s ∈ { , .., L } . In Fig. 2, we show the de-tails for the residual block inside the encoder and decoder modules.Encoder q θ ( y | x ) generates discrete latent variable y ∈ { , ..K } ( y can also be represented as a one-hot vector). To transform a contin-uous representation into a discrete random variable, the encoder firstproduces intermediate continuous representation z ∈ R D e . Later,we find which codebook has a minimum distance between z and a Fig. 1 . VQ-VAE for unsupervised unit discovery consists of sev-eral parts: encoder Enc
V Qθ ( x ) = q θ ( y | x ) , decoder Dec V Qφ ( y, s ) = p φ ( x | y, s ) , codebooks E = [ e , .., e K ] , and (optional) speaker em-bedding V = [ v , .., v L ] . Fig. 2 . Building block inside VQ-VAE encoder and decoder: a)Encoder residual block and 1D convolution with stride 2 to down-sample input sequence length; b) Decoder residual block and 1Dtransposed convolution with stride 2 to upsample codebook back tooriginal input length.ector in E . Mathematically, we formulate the operation: q θ ( y = c | x ) = (cid:40) if c = argmin i Dist ( z, e i )0 else (1) e c = E q θ ( y | c ) [ E ] (2) = K (cid:88) i =1 q θ ( y = i | x ) e i . (3)where Dist ( · , · ) : R D e × R D e → R is a function to calculate thedistance between two vectors. In this paper, we define Dist ( a, b ) = (cid:107) a − b (cid:107) as the L2-norm distance.After we find closest codebook index c ∈ { , .., K } , we sub-stitute intermediate variable z with corresponding codebook vector e c . To reconstruct the input data, decoder p φ ( x | y, s ) reads codebookvector e c and speaker embedding v s and generates reconstruction ˆ x .The following is the learning objective for VQ-VAE: L V Q = − log p φ ( x | y, s ) + γ (cid:107) z − sg ( e c ) (cid:107) , (4)where function sg ( · ) stops the gradient, defined as: x = sg ( x ) (5) ∂ sg ( x ) ∂ x = 0 . (6)The first term is a negative log-likelihood to measure the reconstruc-tion loss between original input x and reconstruction ˆ x to optimizeencoder parameters θ and decoder parameters φ . The second termminimizes the distance between intermediate representation z andnearest codebook e c , but the gradient is only back-propagated intoencoder parameters θ as commitment loss. Such commitment losscan be scaled with additional hyperparameters γ . To update thecodebook vectors, we use an exponential moving average (EMA)[18]. With an EMA update rule for training codebook E , the modelhas a more stable result during the training process and avoids theposterior collapse issue [19].
3. SEQUENCE-TO-SEQUENCE FROM SPEECH TOCODEBOOK
Our speech-to-speech translation model is built based on an atten-tion sequence-to-sequence (seq2seq) framework [20, 21]. Assumepaired source sequence X = [ x , ..., x S ] and target sequence Y =[ y , ..., y T ] . A sequence-to-sequence model directly learns mapping P ψ ( X | Y ) , parameterized by ψ parameters. In this paper, we spec-ify X ∈ R S × D s to represent such speech features as MFCC or mel-spectrogram and Y = [ y , ..., y T ] ∈ { , .., K } to represent code-book E indices. Fig. 3 illustrates a seq2seq model with an attentionmechanism. Inside a seq2seq model, there are three different com-ponents:1. The encoder module reads all the sequence speech featuresand represents them with h E = [ h E , .., h ES ] ∈ R S × M where h E = Enc S Sψ ( X ) .2. The attention module assists the decoder to find which part ofthe encoder contains related information for current decod-ing state [20]. Given decoder state h Dt ∈ R n , the attentionmodules generate attention a t ∈ R S and context c t ∈ R N : a t [ s ] = exp (cid:0) Score ( h Es , h Dt ) (cid:1)(cid:80) Ss =1 exp ( Score ( h Es , h Dt )) (7) c t = S (cid:88) s =1 a t [ s ] h Es , (8) Fig. 3 . Sequence-to-sequence model with attention mechanism.Here encoder input is speech features X = [ x , .., x S ] , and decoderpredicts codebook index y t for each time-step.where function Score ( · , · ) : R M × R N → R predicts the rel-evancy value between the encoder and decoder states. Many Score functions exist, including dot-product [22], MLP [20]or modified MLP with history [23].3. The decoder module predicts class probability p t = Dec S Sψ ( y t | c t , Y 4. CODEBOOK INVERTER A codebook inverter is an module that synthesizes the correspondingspeech utterance from a sequence of the codebook index. Its input,which is a sequence of codebook embedding, is [ E [ y ] , .., E [ y T Y ]] ,and the output target is a sequence of speech representation (e.g.,linear magnitude spectrogram) X R = [ X R , .., X RT x ] .We illustrate our codebook inverter architecture in Fig. 4. Ourcodebook inverter is composed of several residual 1D blocks, fol-lowed by stacked bidirectional LSTMs [24], and finally another sev-eral residual 1D blocks. Fig. 5 shows the details inside the block.Under certain circumstances, codebook sequence length T Y mightbe shorter than T X because VQ-VAE encoder q θ ( y | x ) has convolu-tion with a stride larger than 1. Therefore, to align the codebook se-quence with the speech representation target sequence, we duplicateeach codebook E [ y t ] into r copies side-by-side where r = T X /T Y .To train a codebook inverter, we set the objective function: L INV = (cid:107) X R − ˆ X R (cid:107) (10)to minimize the L2-norm between predicted spectrogram ˆ X R = Inv ρ ([ E [ y ] , ..., E [ y T Y ]]) and groundtruth spectrogram X R . We de-fined Inv ρ as the inverter parameterized by ρ . In the inference stage,we used Griffin-Lim [27] to reconstruct the phase from the spectro-gram and applied an inverse short-term Fourier transform (STFT) toinvert it into a speech waveform. ig. 4 . Codebook inverter: given codebook sequence [ E [ y ] , .., E [ y T Y ]] , we predict corresponding linear magnitudespectrogram ˆ X R = [ x R , ..x RT X ] . If the lengths between T Y and T X are different, we consecutively duplicate each codebook by r -times. 5. TRAINING AND INFERENCE In this section, we explain our proposed method in detail and step-by-step. To train our proposed model, we setup three differentmodules: VQ-VAE (Section 2), a speech-to-codebook seq2seq (Sec-tion 3), and a codebook inverter (Section 4). Fig. 6 shows whichmodules are trained in each step. Initially, we defined { X Msrc , X Mtgt } as paired parallel speech, X Msrc is the MFCC features from thesource language, and X Mtgt is the MFCC features from the targetlanguage. Y tgt is the codebook sequences generated by VQ-VAEencoder Enc θ ( x ) given X Mtgt as the input. ˆ X Rtgt is the predictedlinear spectrogram of the target language. L V Q , L INV , and L NLL are calculated by the formula in Eqs. 4, 10, and 9.1. First, we trained the VQ-VAE model on target languageMFCC X Mtgt . We also trained the codebook inverter to pre-dict corresponding linear spectrogram X Rtgt .2. Second, we trained the seq2seq model from the source lan-guage speech to the target language codebook. Given apaired parallel MFCC from source and target languages { X Msrc , X Mtgt } , we extracted codebook sequence Y tgt = Enc V Qθ ( X Mtgt ) from the VQ-VAE encoder. Later, we trainedthe seq2seq translation model to predict ˆ Y tgt = Seq2Seq ( X Msrc ) and minimize loss L NLL between X Msrc and X Msrc . Fig. 5 . Residual 1D block combines multiscale 1D convolution withdifferent kernel size and “SAME” padding, LeakyReLU [25] activa-tion function, and batch normalization [26].3. In the inference step, given source language speech X Msrc ,we decoded a target language codebook index sequence ˆ Y tgt = Seq2Seq ψ ( X Msrc ) and synthesized it into targetlanguage speech ˆ X Rtgt = Inverter ( ˆ Y tgt ) . 6. EXPERIMENTAL SETUP6.1. Dataset In this paper, we ran our experiment based on the Basic Travel Ex-pression (BTEC) corpus [28, 29] that has several language pairs .We chose two tasks: French-to-English and Japanese-to-English.For both language pairs, we used the BTEC1 set that consisted of162,318 training sentences and 510 test sentences. Since the speechutterances for the sentences are unavailable, we generated sentenceswith Google text-to-speech API for all languages pairs. Even thoughthe lack of natural speech dataset in this paper, VQ-VAE and code-book inverter can be applied and has shown a great performance onmultispeaker natural speech [14, 13]. Some papers [30, 31, 32] alsoshow the performance improvement from the synthetic dataset canbe carried over to the real dataset. For the source language and target language speech utterances, werepresented the speech utterances with mel-frequency cepstral coef-ficients (MFCCs) with 13 dimensions +∆ + ∆ (total 39 dimen-sions). For the target language speech utterances, we also gener-ated a linear magnitude spectrogram with 1025 dimensions for thecodebook inverter (Section 4) training target. For each frame, weextracted the MFCCs and the linear magnitude spectrogram witha 25-millisecond-sized window and 10-millisecond time-steps. Weextracted both the MFCC and the linear magnitude spectrogram withLibrosa [33] library. ig. 6 . a) Train VQ-VAE to represent continuous MFCC vectors with codebook sequence and train codebook inverter to generate a linearmagnitude spectrogram based on generated codebook sequence; b) Train a seq2seq model from source language MFCC to target languagecodebook. c) In inference stage, seq2seq model takes source language MFCC and predicts codebook sequences, and then codebook invertergenerates target language speech representation. For an objective evaluation of the target speech utterances, currentlythere is no standard method can be used to measure translation qual-ity directly on the speech utterances. Therefore, we utilized a pre-trained ASR on the English BTEC dataset and the generated tran-scription for our evaluation. For the ASR architecture, the encodermodule has three stacked Bi-LSTMs with 512 hidden units, and thedecoder has one LSTM with 512 hidden units. For the attentionmodule, we utilized MLP attention with multiscale location history[23]. For the output unit, we used a word-level token from the En-glish transcription. Because there is a performance gap between theASR and the ground truth cause by imperfect transcription, we as-sume the metric (calculated based on the ASR transcription) is thelowerbound for the related translation model. We utilized two met-rics to evaluate the translation performance from the transcribed text:BLEU scores [34] and METEOR [35] with a Multeval toolkit [36].Our pre-trained ASR model resulted in a 2.84% WER, a 94.9 BLEU,and a 69.1 METEOR on English speech utterances from the BTECtest set, and we set those scores as the groundtruth topline scores. 7. RESULTS AND DISCUSSION In this section, we present our experimental result and followed bythe discussion. For the baseline translation task, we modified the Tacotron [3] modelby changing the source input from a one-hot character embeddinginto a continuous vector. Basically, we changed the embedding layerin the encoder layer with a linear projection layer. Therefore, thismodel directly translated the source language MFCC to a target lan-guage mel-spectrogram. However, this approach did not convergeat all and produced no audible s peech. [8] also observed a similarresult with a similar scenario. Table 1 . Our experiment results based on BTEC French-Englishspeech-to-speech translation: Model (FR-EN) BLEU METEORBaselineTacotron with MFCC input - - Proposed Speech2Code Codebook TimeReduction32 4 19.4 19.132 8 23.8 22.232 12 23.2 22.164 4 16.1 16.964 8 24.4 22.964 12 25.0 23.2128 4 16.9 17.4128 8 23.3 22.1128 12 24.2 21.9 Topline(Cascade ASR - > TTS) In this paper, we set the topline performance by using the cascadeof ASR and TTS system. First, we train the ASR system by usingthe source language MFCC as the input and target language charac-ter transcription. Second, we train a TTS based on Tacotron [3] togenerate a speech from the target language characters to the targetlanguage speech representation. Table 1 shows our experimental result on various hyperparametersacross different codebook sizes and time-reductions. We tried sev-eral hyperparameters, including codebook size and time-reductionfactor. Our best performance was produced by codebook of 64 anda time-reduction factor of 12 with a score of 25.0 BLEU and 23.2METEOR. able 2 . Our experiment results based on BTEC Japanese-Englishspeech-to-speech translation. Model (JA-EN) BLEU METEORBaselineTacotron with MFCC source - - Proposed Speech2Code Codebook TimeReduction32 4 14.8 1532 8 14.2 15.632 12 16 1664 4 10.8 12.164 8 14.2 14.764 12 14.7 14.8128 4 11.9 13.5128 8 15.3 15.3128 12 14.9 14.5 Topline(Cascade ASR - > TTS) Table 2 shows our experimental result on various hyperparametersacross different codebook sizes and time-reductions. We tried sev-eral hyperparameters, including codebook size and time-reductionfactor. Our best performance was produced by a codebook of 128and a time-reduction factor 8 with a score of 15.3 BLEU and 15.3METEOR. In Table 3, we provide some transcriptions example from the ground-truth, our proposed speech-to-code and topline cascade ASR-TTSmodels. In the first result, all models translation contains similarmeaning with the ground truth. In the second result, all models stillmaintain a similar semantic with the ground-truth. However, com-pared to the topline, the speech-to-code does not produce the addi-tional translation for “as soon as he comes in”. In the third result,our proposed method can only translate the beginning of the sen-tence correctly and produce incorrect result in the latter part. Fromthe transcription result, the missing part and arbitrary transcriptionin the latter half might be interesting to be investigated in the future.For further information and translation samples, our readercould refer to: https://sp2code-translation-v1.netlify.com/ . 8. CONCLUSION In this paper, we proposed a novel approach for training a speech-to-speech translation between two languages without any transcrip-tion. First, we trained a discrete quantized autoencoder to generatea discrete representation from the target speech features. Second,we trained a sequence-to-sequence model to predict the codebooksequence given the source speech representation. This method isapplicable to any type of language, with or without a written formbecause the target speech representations are trained and generatedunsupervisedly. Based on our experiment result, our model can per-form a direct speech-to-speech translation on French-English andJapanese-English. Table 3 . Transcription example between the ground truth, our pro-posed Speech2Code, and topline (Cascade ASR-TTS) model. Model Transcription Result Groundtruth how long are you going to staySpeech2CodeFR-EN how long are you going to staySpeech2CodeJA-EN how long will it takeTopline FR-EN how long are you stayingTopline JA-EN how long are you stayingGroundtruth please tell him to call me as soon as he comes inSpeech2CodeFR-EN please tell him to call me backSpeech2CodeJA-EN please tell him that i calledTopline FR-EN please tell her to call me and check itTopline JA-EN please ask him to call me as soon as possibleGroundtruth i would like a balcony seat pleaseSpeech2CodeFR-EN i would like to have this film pleaseSpeech2CodeJA-EN i would like a seat near the seatTopline FR-EN i would like a balcony seat pleaseTopline JA-EN i would like a balcony seat 9. ACKNOWLEDGMENTS Part of this work was supported by JSPS KAKENHI Grant NumbersJP17H06101 and JP17K00237. 10. REFERENCES [1] Jan Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk,Kyunghyun Cho, and Yoshua Bengio, “Attention-based mod-els for speech recognition,” in Advances in Neural InformationProcessing Systems 28: Annual Conference on Neural Infor-mation Processing Systems 2015, December 7-12, 2015, Mon-treal, Quebec, Canada , 2015, pp. 577–585.[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio,“Neural machine translation by jointly learning to align andtranslate,” CoRR , vol. abs/1409.0473, 2014.[3] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu,Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao,Zhifeng Chen, Samy Bengio, et al., “Tacotron: Towards end-to-end speech synthesis,” arXiv preprint arXiv:1703.10135 ,2017.[4] Long Duong, Antonios Anastasopoulos, David Chiang, StevenBird, and Trevor Cohn, “An attentional model for speech trans-lation without transcription,” in NAACL HLT 2016, The 2016Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technolo-gies, San Diego California, USA, June 12-17, 2016 , 2016, pp.949–959.[5] Alexandre B´erard, Olivier Pietquin, Christophe Servan, andLaurent Besacier, “Listen and translate: A proof of con-cept for end-to-end speech-to-text translation,” CoRR , vol.abs/1612.01744, 2016.6] Ron J. Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui Wu,and Zhifeng Chen, “Sequence-to-sequence models can directlytranslate foreign speech,” in Interspeech 2017, 18th AnnualConference of the International Speech Communication Asso-ciation, Stockholm, Sweden, August 20-24, 2017 , 2017, pp.2625–2629.[7] Takatomo Kano, Sakriani Sakti, and Satoshi Nakamura,“Structured-based curriculum learning for end-to-end english-japanese speech translation,” in Proc. Interspeech 2017 , 2017,pp. 2630–2634.[8] Ye Jia, Ron J Weiss, Fadi Biadsy, Wolfgang Macherey, MelvinJohnson, Zhifeng Chen, and Yonghui Wu, “Direct speech-to-speech translation with a sequence-to-sequence model,” arXivpreprint arXiv:1904.06037 , 2019.[9] Maarten Versteegh, Roland Thiolliere, Thomas Schatz,Xuan Nga Cao, Xavier Anguera, Aren Jansen, and EmmanuelDupoux, “The zero resource speech challenge 2015,” in Six-teenth Annual Conference of the International Speech Commu-nication Association , 2015, pp. 3169–3173.[10] Ewan Dunbar, Xuan Nga Cao, Juan Benjumea, Julien Kara-dayi, Mathieu Bernard, Laurent Besacier, Xavier Anguera,and Emmanuel Dupoux, “The zero resource speech challenge2017,” in . IEEE, 2017, pp. 323–330.[11] Ewan Dunbar, Robin Algayres, Julien Karadayi, MathieuBernard, Juan Benjumea, Xuan-Nga Cao, Lucie Miskic, Char-lotte Dugrain, Lucas Ondel, Alan W. Black, Laurent Be-sacier, Sakriani Sakti, and Emmanuel Dupoux, “The zero re-source speech challenge 2019: TTS without T,” CoRR , vol.abs/1904.11469, 2019.[12] Andy T. Liu, Po-chun Hsu, and Hung-yi Lee, “Unsupervisedend-to-end learning of discrete linguistic units for voice con-version,” CoRR , vol. abs/1905.11563, 2019.[13] Suhee Cho, Yeonjung Hong, Yookyung Shin, and Young-sun Cho, “VQVAE with speaker adversarial training,”https://github.com/Suhee05/Zerospeech2019, 2019.[14] Andros Tjandra, Berrak Sisman, Mingyang Zhang, SakrianiSakti, Haizhou Li, and Satoshi Nakamura, “VQVAE unsuper-vised unit discovery and multi-scale code2spec inverter for ze-rospeech challenge 2019,” CoRR , vol. abs/1905.11449, 2019.[15] Aaron van den Oord, Oriol Vinyals, et al., “Neural discreterepresentation learning,” in Advances in Neural InformationProcessing Systems , 2017, pp. 6306–6315.[16] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol, “Extracting and composing robust featureswith denoising autoencoders,” in Proceedings of the 25th in-ternational conference on Machine learning . ACM, 2008, pp.1096–1103.[17] Diederik P Kingma and Jimmy Ba, “Adam: A method forstochastic optimization,” arXiv preprint arXiv:1412.6980 ,2014.[18] Lukasz Kaiser, Samy Bengio, Aurko Roy, Ashish Vaswani,Niki Parmar, Jakob Uszkoreit, and Noam Shazeer, “Fast de-coding in sequence models using discrete latent variables,”in International Conference on Machine Learning , 2018, pp.2395–2404.[19] Aurko Roy, Ashish Vaswani, Niki Parmar, and Arvind Nee-lakantan, “Towards a better understanding of vector quantizedautoencoders,” OpenReview , 2018. [20] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio,“Neural machine translation by jointly learning to align andtranslate,” arXiv preprint arXiv:1409.0473 , 2014.[21] Ilya Sutskever, Oriol Vinyals, and Quoc V Le, “Sequence tosequence learning with neural networks,” in Advances in neu-ral information processing systems , 2014, pp. 3104–3112.[22] Thang Luong, Hieu Pham, and Christopher D. Manning, “Ef-fective approaches to attention-based neural machine transla-tion,” in Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing , Lisbon, Portugal,Sept. 2015, pp. 1412–1421, Association for ComputationalLinguistics.[23] Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura,“Multi-scale alignment and contextual history for attentionmechanism in sequence-to-sequence model,” in . IEEE, 2018,pp. 648–655.[24] Sepp Hochreiter and J¨urgen Schmidhuber, “Long short-termmemory,” Neural computation , vol. 9, no. 8, pp. 1735–1780,1997.[25] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li, “Empiricalevaluation of rectified activations in convolutional network,” arXiv preprint arXiv:1505.00853 , 2015.[26] Sergey Ioffe and Christian Szegedy, “Batch normalization: Ac-celerating deep network training by reducing internal covariateshift,” arXiv preprint arXiv:1502.03167 , 2015.[27] Daniel Griffin and Jae Lim, “Signal estimation from modifiedshort-time Fourier transform,” IEEE Transactions on Acous-tics, Speech, and Signal Processing , vol. 32, no. 2, pp. 236–243, 1984.[28] Genichiro Kikui, Eiichiro Sumita, Toshiyuki Takezawa, andSeiichi Yamamoto, “Creating corpora for speech-to-speechtranslation,” in Eighth European Conference on Speech Com-munication and Technology , 2003.[29] Gen-ichiro Kikui, Seiichi Yamamoto, Toshiyuki Takezawa,and Eiichiro Sumita, “Comparative study on corpora forspeech translation,” IEEE Transactions on Audio, Speech, andLanguage Processing , vol. 14, no. 5, pp. 1674–1682, 2006.[30] Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura, “Lis-tening while speaking: Speech chain by deep learning,” in . IEEE, 2017, pp. 301–308.[31] Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura, “Ma-chine speech chain with one-shot speaker adaptation,” Proc.Interspeech 2018 , pp. 887–891, 2018.[32] A. Tjandra, S. Sakti, and S. Nakamura, “End-to-end feedbackloss in speech chain framework via straight-through estima-tor,” in ICASSP 2019 - 2019 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) , May2019, pp. 6281–6285.[33] Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis,Matt McVicar, Eric Battenberg, and Oriol Nieto, “librosa: Au-dio and music signal analysis in Python,” 2015.[34] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-JingZhu, “Bleu: a method for automatic evaluation of machinetranslation,” in Proceedings of the 40th annual meeting on as-sociation for computational linguistics . Association for Com-putational Linguistics, 2002, pp. 311–318.35] Satanjeev Banerjee and Alon Lavie, “METEOR: An auto-matic metric for MT evaluation with improved correlationwith human judgments,” in Proceedings of the ACL Work-shop on Intrinsic and Extrinsic Evaluation Measures for Ma-chine Translation and/or Summarization , Ann Arbor, Michi-gan, June 2005, pp. 65–72, Association for Computational Lin-guistics.[36] Jonathan H Clark, Chris Dyer, Alon Lavie, and Noah A Smith,“Better hypothesis testing for statistical machine translation:Controlling for optimizer instability,” in