[PDF] Automatic Quality Assessment for Speech Translation Using Joint ASR and MT Features

Abstract

This paper addresses automatic quality assessment of spoken language translation (SLT). This relatively new task is defined and formalized as a sequence labeling problem where each word in the SLT hypothesis is tagged as good or bad according to a large feature set. We propose several word confidence estimators (WCE) based on our automatic evaluation of transcription (ASR) quality, translation (MT) quality, or both (combined ASR+MT). This research work is possible because we built a specific corpus which contains 6.7k utterances for which a quintuplet containing: ASR output, verbatim transcript, text translation, speech translation and post-edition of translation is built. The conclusion of our multiple experiments using joint ASR and MT features for WCE is that MT features remain the most influent while ASR feature can bring interesting complementary information. Our robust quality estimators for SLT can be used for re-scoring speech translation graphs or for providing feedback to the user in interactive speech translation or computer-assisted speech-to-text scenarios.

Full PDF

NNoname manuscript No. (will be inserted by the editor)

Automatic Quality Assessment for SpeechTranslation Using Joint ASR and MT Features

Ngoc-Tien Le · Benjamin Lecouteux · Laurent Besacier

Received : date / Accepted : date

Abstract

This paper addresses automatic quality assessment of spoken lan-guage translation (SLT). This relatively new task is deﬁned and formalized asa sequence labeling problem where each word in the SLT hypothesis is taggedas good or bad according to a large feature set. We propose several word con-ﬁdence estimators (WCE) based on our automatic evaluation of transcription(ASR) quality, translation (MT) quality, or both (combined ASR+MT). Thisresearch work is possible because we built a speciﬁc corpus which contains 6.7kutterances for which a quintuplet containing: ASR output, verbatim transcript,text translation, speech translation and post-edition of translation is built. Theconclusion of our multiple experiments using joint ASR and MT features forWCE is that MT features remain the most inﬂuent while ASR feature canbring interesting complementary information. Our robust quality estimatorsfor SLT can be used for re-scoring speech translation graphs or for provid-ing feedback to the user in interactive speech translation or computer-assistedspeech-to-text scenarios. Keywords

Quality estimation · Word conﬁdence estimation (WCE) · SpokenLanguage Translation (SLT) · Joint Features · Feature Selection

Automatic quality assessment of spoken language translation (SLT), alsonamed conﬁdence estimation (CE), is an important topic because it allowsto know if a system produces (or not) user-acceptable outputs. In interactivespeech to speech translation, CE helps to judge if a translated turn is uncertain

Firstname LastnameLaboratoire d’Informatique de Grenoble, University of Grenoble Alpes, FranceBuilding IMAG, 700 Centrale, 38401 Saint Martin d’H`eresTel. : +33 457421454E-mail: ﬁ[email protected] a r X i v : . [ c s . C L ] S e p Ngoc-Tien Le et al. (and ask the speaker to rephrase or repeat). For speech-to-text applications,CE may tell us if output translations are worth being corrected or if theyrequire retranslation from scratch. Moreover, an accurate CE can also helpto improve SLT itself through a second-pass N-best list re-ranking or searchgraph re-decoding, as it has already been done for text translation in [2] and[19], or for speech translation in [4]. Consequently, building a method whichis capable of pointing out the correct parts as well as detecting the errors in aspeech translated output is crucial to tackle above issues.Given signal x f in the source language, spoken language translation (SLT)consists in ﬁnding the most probable target language sequence ˆ e = ( e , e , ..., e N )so that ˆ e = argmax e { p ( e/x f , f ) } (1)where f = ( f , f , ..., f M ) is the transcription of x f .Now, if we perform conﬁdence estimation at the “words” level, the problemis called Word-level Conﬁdence Estimation (WCE) and we can represent thisinformation as a sequence q (same length N of ˆ e ) where q = ( q , q , ..., q N )and q i ∈ { good, bad } .Then, integrating automatic quality assessment in our SLT process can bedone as following:ˆ e = argmax e (cid:88) q p ( e, q/x f , f ) (2)ˆ e = argmax e (cid:88) q p ( q/x f , f, e ) ∗ p ( e/x f , f ) (3)ˆ e ≈ argmax e { max q { p ( q/x f , f, e ) ∗ p ( e/x f , f ) }} (4)In the product of (4), the SLT component p ( e/x f , f ) and the WCE com-ponent p ( q/x f , f, e ) contribute together to ﬁnd the best translation output ˆ e .In the past, WCE has been treated separately in ASR or MT contexts andwe propose here a joint estimation of word conﬁdence for a spoken languagetranslation (SLT) task involving both ASR and MT.This journal paper is an extended version of a paper published at ASRU2015 last year [4] but we focus more on the WCE component and on the bestapproaches to estimate p ( q/x f , f, e ) accurately. Contributions

The main contributions of this journal paper are the fol-lowing:– A corpus (distributed to the research community ) dedicated to WCEfor SLT was initially published in [3]. We present, in this paper, itsextension from 2643 to 6693 speech utterances. q i could be also more than 2 labels, or even scores but this paper only deals with errordetection (binary set of labels)2. https://github.com/besacier/WCE-SLT-LIGutomatic Quality Assessment for Speech Translation 3 – While our previous work on quality assessment was based on two sep-arate WCE classiﬁers (one for quality assessment in ASR and one forquality assessment in MT), we propose here a unique joint model basedon diﬀerent feature types (ASR and MT features).– This joint model allows us to operate feature selection and analyze whichfeatures (from ASR or MT) are the most eﬃcient for quality assessmentin speech translation.– We also experiment with two ASR systems that have diﬀerent perfor-mance in order to analyze the behavior of our SLT quality assessmentalgorithms at diﬀerent levels of word error rate (WER). Outline

The outline of this paper goes simply as follows: section 2 reviewsthe state-of-the-art on conﬁdence estimation for ASR and MT. Our WCEsystem using multiple features is then described in section 3. The experimen-tal setup (notably our speciﬁc WCE corpus) is presented in section 4 whilesection 5 evaluates our joint WCE system. Feature selection for quality as-sessment in speech translation is analyzed in section 6 and ﬁnally, section 7concludes this work and gives some perspectives.

Several previous works tried to propose eﬀective conﬁdence measures inorder to detect errors on ASR outputs. Conﬁdence measures are introducedfor Out-Of-Vocabulary (OOV) detection by [1]. [27] extends the previous workand introduces the use of word posterior probability (WPP) as a conﬁdencemeasure for speech recognition. Posterior probability of a word is most of thetime computed using the hypothesis word graph [10]. Also, more recent ap-proaches [16] for conﬁdence measure estimation use side-information extractedfrom the recognizer: normalized likelihoods (WPP), the number of competi-tors at the end of a word (hypothesis density), decoding process behavior,linguistic features, acoustic features (acoustic stability, duration features) andsemantic features.In parallel, the Workshop on Machine Translation (WMT) introduced in2013 a WCE task for machine translation. [9] [21] employed the ConditionalRandom Fields (CRF) [12] model as their machine learning method to addressthe problem as a sequence labelling task. Meanwhile, [5] extended their initialproposition by dynamic training with adaptive weight updates in their neuralnetwork classiﬁer. As far as prediction indicators are concerned, [5] proposedseven word feature types and found among them the “common cover links”(the links that point from the leaf node containing this word to other leafnodes in the same subtree of the syntactic tree) the most outstanding. [9] fo-cused only on various n-gram combinations of target words. Inheriting mostof previously-recognized features, [21] integrated a number of new indicatorsrelying on graph topology, pseudo reference, syntactic behavior (constituentlabel, distance to the semantic tree root) and polysemy characteristic. The esti-mation of the conﬁdence score uses mainly classiﬁers like Conditional Random

Ngoc-Tien Le et al.

Fields [9,18], Support Vector Machines [13] or Perceptron [5]. Some investi-gations were also conducted to determine which features seem to be the mostrelevant. [13] proposed to ﬁlter features using a forward-backward algorithmto discard linearly correlated features. Using Boosting as learning algorithm,[20] was able to take advantage of the most signiﬁcant features.Finally, several toolkits for WCE were recently proposed:

TranscRater forASR , Marmot for MT as well as WCE-LIG [25] that will be used to extractMT features in the experiments of this journal paper.To our knowledge, the ﬁrst attempt to design WCE for speech translation,using both ASR and MT features, is our own work [3,4] which is furtherextended in this journal paper submission. The WCE component solves the equation:ˆ q = argmax q { p SLT ( q/x f , f, e ) } (5)where q = ( q , q , ..., q N ) is the sequence of quality labels on the targetlanguage. This is a sequence labelling task that can be solved with severalmachine learning techniques such as Conditional Random Fields (CRF) [12].However, for that, we need a large amount of training data for which a quadru-plet ( x f , f, e, q ) is available. In this work, we will use a corpus extended from[3] which contains 6.7k utterances. We will investigate if this amount of datais enough to evaluate and test a joint model p SLT ( q/x f , f, e ).As it is much easier to obtain data containing either the triplet ( x f , f, q )(automatically transcribed speech with manual references and quality labelsinfered from word error rate estimation) or the triplet ( f, e, q ) (automaticallytranslated text with manual post-editions and quality labels infered using toolssuch as TERpA [26]) we can also recast the WCE problem with the followingequation: ˆ q = argmax q { p ASR ( q/x f , f ) α ∗ p MT ( q/e, f ) − α } (6)where α is a weight giving more or less importance to W CE

ASR (qual-ity assesment on transcription) compared to

W CE MT (quality assesment ontranslation). It is important to note that p ASR ( q/x f , f ) corresponds to thequality estimation of the words in the target language based on features cal-culated on the source language (ASR). For that, what we do is projectingsource quality scores to the target using word-alignment information between e and f sequences. This alternative approach ( equation 6 ) will be also evalu-ated in this work.

3. https://github.com/hlt-mt/TranscRater4. https://github.com/qe-team/marmot5. https://github.com/besacier/WCE-LIGutomatic Quality Assessment for Speech Translation 5

Table 1

List of MT features extracted.1 Proper Name 10 Stop Word 19 WPP max2 Unknown Stem 11 Word context Alignments 20 Nodes3 Num. of Word Occ. 12 POS context Alignments 21 Constituent Label4 Num. of Stem Occ. 13 Stem context Alignments 22 Distance To Root5 Polysemy Count – Target 14 Longest Target N -gram Length 23 Numeric6 Backoﬀ Behaviour – Target 15 Longest Source N -gram Length 24 Punctuation7 Alignment Features 16 WPP Exact8 Occur in Google Translate 17 WPP Any9 Occur in Bing Translator 18 WPP min In both approaches – joint ( p SLT ( q/x f , f, e )) and combined ( p ASR ( q/x f , f )+ p MT ( q/e, f )) – some features need to be extracted from ASR and MTmodules. They are more precisely detailed in next subsections.3.1 WCE features for speech transcription (ASR)In this work, we extract several types of features, which come from theASR graph, from language model scores and from a morphosyntactic analysis.These features are listed below (more details can be found in [3]):– Acoustic features: word duration ( F-dur ).– Graph features (extracted from the ASR word confusion networks): num-ber of alternative (

F-alt ) paths between two nodes; word posterior prob-ability (

F-post ).– Linguistic features (based on probabilities by the language model): worditself (

F-word ), 3-gram probability (

F-3g ), log probability (

F-log ),back-oﬀ level of the word (

F-back ), as proposed in [6],– Lexical Features: Part-Of-Speech (POS) of the word (

F-POS ),– Context Features: Part-Of-Speech tags in the neighborhood of a givenword (

F-context ).For each word in the ASR hypothesis, we estimate the 9 features (F-Word;F-3g; F-back; F-log; F-alt; F-post; F-dur; F-POS; F-context) previously de-scribed.In a preliminary experiment, we will evaluate these features for qualityassessment in ASR only (

W CE

ASR task). Two diﬀerent classiﬁers will beused: a variant of boosting classiﬁcation algorithm called bonzaiboost [14] (im-plementing the boosting algorithm

Adaboost.MH over deeper trees) and theConditional Random Fields [12].3.2 WCE features for machine translation (MT)A number of knowledge sources are employed for extracting features, in atotal of 24 major feature types, see Table 1.It is important to note that we extract features regarding tokens in themachine translation (MT) hypothesis sentence. In other words, one feature is

Ngoc-Tien Le et al. extracted for each token in the MT output. So, in the Table 1, target refersto the feature coming from the MT hypothesis and source refers to a featureextracted from the source word aligned to the considered target word. Moredetails on some of these features are given in the next subsections.

These features are given by the Machine Translation system, which outputsadditional data like N -best list. Word Posterior Probability (WPP) and

Nodes features are extractedfrom a confusion network, which comes from the output of the machine trans-lation N -best list. WPP Exact is the WPP value for each word concerned atthe exact same position in the graph.

WPP Any extracts the same informa-tion at any position in the graph.

WPP Min gives the smallest WPP valueconcerned by the transition and

WPP Max its maximum.

Below is the list of the external features used:–

Proper Name : indicates if a word is a proper name (same binary fea-tures are extracted to know if a token is

Numerical , Punctuation or Stop Word ).–

Unknown Stem : informs whether the stem of the considered word isknown or not.–

Number of Word/Stem Occurrences : counts the occurrences of aword/stem in the sentence.–

Alignment context features : these features (

Alignment Features ): • Source alignment context features : the combinations of the target word,the source word (with which it is aligned), and one source word beforeand one source word after (left and right contexts, respectively). • Target alignment context features : the combinations of the source word,the target word (with which it is aligned), and one target word beforeand one target word after.–

Longest Target (or Source) N -gram Length : we seek to get thelength ( n +1) of the longest left sequence ( w i − n ) concerned by the currentword ( w i ) and known by the language model (LM) concerned (sourceand target sides). For example, if the longest left sequence w i − , w i − , w i appears in the target LM, the longest target n-gram value for w i will be3. This value ranges from 0 to the max order of the LM concerned. Wealso extract a redundant feature called Backoﬀ Behavior Target . utomatic Quality Assessment for Speech Translation 7 – The target word’s constituent label ( Constituent Label ) and its depthin the constituent tree (

Distance to Root ) are extracted using a syn-tactic parser.–

Target Polysemy Count : we extract the polysemy count, which is thenumber of meanings of a word in a given language.–

Occurences in Google Translate and

Occurences in Bing Trans-lator : in the translation hypothesis, we (optionally) test the presenceof the target word in on-line translations given respectively by

GoogleTranslate and

Bing Translator .A very similar feature set was used for a simple W CE MT task (English -Spanish MT, WMT 2013, 2014 quality estimation shared task) and obtainedvery good performances [17]. This preliminary experience in participating tothe WCE shared task in 2013 and 2014 lead us to the following observation:while feature processing is very important to achieve good performance, it re-quires to call a set of heterogeneous NLP tools (for lexical, syntactic, semanticanalyses). Thus, we recently proposed to unify the feature processing, togetherwith the call of machine learning algorithms, in order to facilitate the designof conﬁdence estimation systems. The open-source toolkit proposed (writtenin Python and made available on github ) integrates some standard as well asin-house features that have proven useful for WCE (based on our experiencein WMT 2013 and 2014).In this paper, we will use only Conditional Random Fields [12] (CRFs) asour machine learning method, with WAPITI toolkit [15], to train our WCEestimator based on MT features. For a French-English translation task, we used our SMT system to obtainthe translation hypothesis for 10,881 source sentences taken from news corporaof the WMT (Workshop on Machine Translation) evaluation campaign (from2006 to 2010). Post-editions were obtained from non professional translatorsusing a crowdsourcing platform. More details on the baseline SMT systemused can be found in [22] and more details on the post-edited corpus can befound in [23]. It is worth mentionning, however, that a sub-set (311 sentences)of these collected post-editions was assessed by a professional translator and87.1% of post-editions were judged to improve the hypothesis

6. Using this kind of feature is controversial, however we observed that such features areavailable in general use case scenarios, so we decided to include them in our experiments.Contrastive results without these 2 features will be also given later on.7. http://github.com/besacier/WCE-LIG

Ngoc-Tien Le et al.

Table 2

Example of training label obtained using TERp-A.

Reference

The consequence of the fundamentalistS S

Hyp After Shift

The result of the hard-line

Reference movement also has its importance .Y I D P

Hyp After Shift trend is also important .

Then, the word label setting for WCE was done using TERp-A toolkit[26]. Table 2 illustrates the labels generated by TERp-A for one hypothesisand post-edition pair. Each word or phrase in the hypothesis is aligned to aword or phrase in the post-edition with diﬀerent types of edit: “I” (insertions),“S” (substitutions), “T” (stem matches), “Y” (synonym matches), and “P”(phrasal substitutions). The lack of a symbol indicates an exact match andwill be replaced by “E” thereafter. We do not consider the words marked with“D” (deletions) since they appear only in the reference. However, later on, wewill have to train binary classiﬁers ( good / bad ) so we re-categorize the obtained6-label set into binary set: The E, T and Y belong to the good (G), whereasthe S, P and I belong to the bad (B) category. The dev set and tst set of this corpus were recorded by french native speak-ers. Each sentence was uttered by 3 speakers, leading to 2643 and 4050 speechrecordings for dev set and tst set, respectively. For each speech utterance, aquintuplet containing: ASR output ( f hyp ), verbatim transcript ( f ref ), Englishtext translation output ( e hyp mt ), speech translation output ( e hyp slt ) and post-edition of translation ( e ref ), was made available. This corpus is available ona github repository . More details are given in table 3. The total length ofthe dev and tst speech corpus obtained are 16h52, since some utterances werepretty long. Table 3

Details on our dev and test corpora for SLT.

Corpus dev

881 2643 15 (9 women + 6 men) 5h51 tst

Table 4

Details on language models (LM) used in our two ASR systems.

LM 1-gram 2-grams 3-grams small (

ASR

1) 62K 1M 59Mbig (

ASR

2) 95K 49M 301M f hyp ), we built a French ASR systembased on KALDI toolkit [24]. Acoustic models are trained using several corpora(ESTER, REPERE, ETAPE and BREF120) representing more than 600 hoursof french transcribed speech.The baseline GMM system is based on mel-frequency cepstral coeﬃcient(MFCC) acoustic features (13 coeﬃcients expanded with delta and doubledelta features and energy : 40 features) with various feature transformations in-cluding linear discriminant analysis (LDA), maximum likelihood linear trans-formation (MLLT), and feature space maximum likelihood linear regression(fMLLR) with speaker adaptive training (SAT). The GMM acoustic modelmakes initial phoneme alignments of the training data set for the followingDNN acoustic model training.The speech transcription process is carried out in two passes: an automatictranscript is generated with a GMM-HMM model of 43182 states and 250000Gaussians. Then word graphs outputs obtained during the ﬁrst pass are usedto compute a fMLLR-SAT transform on each speaker. The second pass isperformed using DNN acoustic model trained on acoustic features normalizedwith the fMLLR matrix.CD-DNN-HMM acoustic models are trained (43 182 context-dependentstates) using GMM-HMM topology.We propose to use two 3-gram language models trained on French ESTERcorpus [8] as well as on French Gigaword (vocabulary size are respectively 62kand 95k). The ASR systems LM weight parameters are tuned through WERon the dev corpus. Details on these two language models can be found in table4. In our experiments we propose two ASR systems based on the previouslydescribed language models. The ﬁrst system ( ASR

1) uses the small languagemodel allowing a fast ASR system (about 2x Real Time), while in the secondsystem lattices are rescored with a big language model (about 10x Real Time)during a third pass.

Table 5 presents the performances obtained by two above ASR systems.

Table 5

ASR performance (WER) on our dev and test set for the two diﬀerent ASR systems

Task dev set tst set

ASR1

ASR2

These WER may appear as rather high according to the task (transcribingread news). A deeper analysis shows that these news contain a lot of foreignnamed entities, especially in our dev set. This part of the data is extractedfrom French medias dealing with european economy in EU. This could alsoexplain why the scores are signiﬁcantly diﬀerent between dev and test sets.In addition, automatic post-processing is applied to ASR output in order tomatch requirements of standard input for machine translation.4.3 SMT SystemWe used moses phrase-based translation toolkit [11] to translate FrenchASR into English ( e hyp ). This medium-size system was trained using a subsetof data provided for IWSLT 2012 evaluation [7]: Europarl, Ted and News-Commentary corpora. The total amount is about 60M words. We used anadapted target language model trained on speciﬁc data (News Crawled cor-pora) similar to our evaluation corpus (see [22]). This standard SMT systemwill be used in all experiments reported in this paper.4.4 Obtaining quality assessment labels for SLTAfter building an ASR system, we have a new element of our desired quin-tuplet: the ASR output f hyp . It is the noisy version of our already availableverbatim transcripts called f ref . This ASR output ( f hyp ) is then translated bythe exact same SMT system [22] already mentionned in subsection 4.3. Thisnew output translation is called e hyp slt and it is a degraded version of e hyp mt (translation of f ref ).At this point, a strong assumption we made has to be revealed: we re-usedthe post-editions obtained from the text translation task (called e ref ), to inferthe quality (G, B) labels of our speech translation output e hyp slt . The wordlabel setting for WCE is also done using TERp-A toolkit [26] between e hyp slt and e ref . This assumption, and the fact that initial MT post-edition can bealso used to infer labels of a SLT task, is reasonnable regarding results (laterpresented in table 8 and table 9) where it is shown that there is not a hugediﬀerence between the MT and SLT performance (evaluated with BLEU).The remark above is important and this is what makes the value of this cor-pus. For instance, other corpora such as the TED corpus compiled by LIUM contain also a quintuplet with ASR output, verbatim transcript, MT output,SLT output and target translation. But there are 2 main diﬀerences: ﬁrst, thetarget translation is a manual translation of the prior subtitles so this is nota post-edition of an automatic translation (and we have no guarantee thatthe good / bad labels extracted from this will be reliable for WCE training and testing); secondly, in our corpus, each sentence is uttered by 3 diﬀerent speak-ers which introduces speaker variability in the database and allows us to dealwith diﬀerent ASR outputs for a single source sentence.4.5 Final corpus statisticsThe ﬁnal corpus obtained is summarized in table 6, where we also clarifyhow the WCE labels were obtained. For the test set, we now have all the dataneeded to evaluate WCE for 3 tasks:– ASR : extract good / bad labels by calculating WER between f hyp and f ref ,– MT : extract good / bad labels by calculating TERp-A between e hyp mt and e ref ,– SLT : extract good / bad labels by calculating TERp-A between e hyp slt and e ref . Table 6

Overview of our post-edition corpus for SLT.

Data dev utt test utt method to obtain WCE labels f ref

881 1350 f hyp f hyp , f ref ) e hyp mt

881 1350 terpa( e hyp mt , e ref ) e hyp slt e hyp slt , e ref ) e ref

881 1350

Table 7 gives an example of the quintuplet available in our corpus. Onetranscript ( f hyp ) has 1 error while the other one ( f hyp ) has 4. This leadsto respectively 2 B labels ( e hyp slt ) and 4 B labels ( e hyp slt ) in the speechtranslation output, while e hyp mt has only one B label. Table 7

Example of quintuplet with associated labels. f ref quand notre cerveau chauﬀe f hyp comme notre cerveau chauﬀelabels ASR B G G G f hyp qu’ entre serbes au chauﬀelabels ASR B B B B G e hyp mt when our brains chauﬀe labels MT G G G B e hyp slt as our brains chauﬀe labels SLT B G G B e hyp slt between serbs in chauﬀe labels SLT B B B B e ref when our brain heats up2 Ngoc-Tien Le et al. Table 8 and table 9 summarize baseline ASR, MT and SLT performancesobtained on our corpora, as well as the distribution of good (G) and bad (B)labels inferred for both tasks. Logically, the percentage of (B) labels increasesfrom MT to SLT task in the same conditions.

Table 8

MT and SLT performances on our dev set.

Task ASR (WER) MT (BLEU) % G (good) % B (bad)

MT 0% 49.13% 76.93% 23.07%SLT (ASR1) 21.86% 26.73% 62.03% 37.97%SLT (ASR2) 16.90% 28.89% 63.87% 36.13%

Table 9

MT and SLT performances on our tst set.

Task ASR (WER) MT (BLEU) % G (good) % B (bad)

MT 0% 57.87% 81.58% 18.42%SLT (

ASR

1) 17.37% 30.89% 61.12% 38.88%SLT (

ASR

2) 12.50% 33.14% 62.77% 37.23% f hyp and e hyp is used to project the WCE scorescoming from ASR, to the SLT output,In all experiments reported in this paper, we evaluate the performance ofour classiﬁers by using the average between the F-measure for good labels andthe F-measure for bad labels that are calculated by the common evaluationmetrics: Precision, Recall and F-measure for good / bad labels. Since two ASR utomatic Quality Assessment for Speech Translation 13 systems are available, F-mes1 is obtained for SLT based on

ASR

F-mes2 is obtained for SLT based on

ASR

2. For the results of Table 10, theclassiﬁer is evaluated on the tst part of our corpus and trained on the dev part.

Table 10

WCE performance with diﬀerent feature sets for tst set (training is made on dev set) - *for MT feat, removing

OccurInGoogleTranslate and

OccurInBingTranslate featureslead to 59.40% and 58.11% for

F-mes1 and

F-mes2 respectively task WCE for ASR WCE for ASR WCE for SLT WCE for SLT feat. type ASR feat. ASR feat. MT feat. ASR feat. p ( q/x f , f ) p ( q/x f , f ) p ( q/f, e ) p ( q/x f , f )(CRFs) (Boosting) projected to e F-mes1

F-mes2

Concerning WCE for ASR, we observe that Fmeasure decreases when ASRWER is lower (

F-mes2 < F-mes1 while

W ER

ASR < W ER ASR ). So qualityassessment in ASR seems to become harder as the ASR system improves.This could be due to the fact that the ASR1 errors recovered by bigger LMin ASR2 system were easier to detect. Anyway, this conclusion should be con-sidered with caution since both results ( F-mes1 and

F-mes2 ) are not directlycomparable because they are evaluated on diﬀerent references (proportion of good / bad labels diﬀer as ASR system diﬀer). The eﬀect of the classiﬁer (CRFor Boosting) is not conclusive since CRF is better for F-mes1 and worse for

F-mes2 . Anyway, we decide to use CRF for all our future experiments sincethis is the classiﬁer integrated in WCE-LIG [25] toolkit.Concerning WCE for SLT, we observe that Fmeasure is better using MTfeatures rather than ASR features (quality assessment for SLT more depen-dent of MT features than ASR features). Again, Fmeasure decreases when ASRWER is lower (

F-mes2 < F-mes1 while

W ER

ASR < W ER ASR ). For MT fea-tures, removing OccurInGoogleTranslate and

OccurInBingTranslate featureslead to 59.40% and 58.11% for

F-mes1 and

F-mes2 respectively.In the next subsection, we try to see if the use of both MT and ASRfeatures improves quality assessment for SLT.5.2 SLT quality assessment using both MT and ASR featuresWe now report in Table 12 WCE for SLT results obtained using bothMT and ASR features. More precisely we evaluate two diﬀerent approaches( combination and joint ):– The ﬁrst system (WCE for SLT / MT+ASR feat.) combines the output oftwo separate classiﬁers based on ASR and MT features. In this approach,ASR-based conﬁdence score of the source is projected to the target SLToutput and combined with the MT-based conﬁdence score as shown in equation 6 (we did not tune the α coeﬃcient and set it a priori to 0.5). – The second system (joint feat.) trains a single WCE system for SLT(evaluating p ( q/x f , f, e ) as in equation 5 using joint ASR features andMT features. All ASR features are projected to the target words usingautomatic word alignments. However, a problem occur when a targetword does not have any source word aligned to it. In this case, we decideto duplicate the ASR features of its previous target word. Another prob-lem occur when a target word is aligned to more than one source word.In that case, there are several strategies to infer the 9 ASR features:average or max over numerical values, selection or concatenation oversymbolic values (for F-word and F-POS), etc. Three diﬀerent variants ofthese strategies (shown in Table 11) are evaluated here. Table 11

Diﬀerent strategies to project ASR features to a target word when it is alignedto more than one source word. *It should be noted that

F-context features are the com-binations of the source word (

F-word ) and one POS of source word (

F-POS ) before andone POS of source word (

F-POS ) after.

ASR

Feat Joint 1 Joint 2 Joint 3

F-post avg(F-post1, F-post2) avg(F-post1, F-post2) avg(F-post1, F-post2)F-log avg(F-log1, F-log2) avg(F-log1, F-log2) avg(F-log1, F-log2)F-back avg(F-back1, F-back2) avg(F-back1, F-back2) avg(F-back1, F-back2)F-dur max(F-dur1, F-dur2) max(F-dur1, F-dur2) max(F-dur1, F-dur2)F-3g max(F-3g1, F-3g2) max(F-3g1, F-3g2) max(F-3g1, F-3g2)F-alt max(F-alt1, F-alt2) max(F-alt1, F-alt2) max(F-alt1, F-alt2)F-word F-word1 F-word2 F-word1 F-word2F-POS F-POS1 F-POS2 F-POS1 F-POS2F-context F-context* F-context* F-context*

Table 12

WCE performance with combination (MT+ASR) or joint (MT,ASR) feature setsfor tst set (training is made on dev set) - * For

Joint 1 feat, removing

OccurInGoogleTrans-late and

OccurInBingTranslate features lead to 59.14% and 57.75% for

F-mes1 and

F-mes2 respectively. task WCE for SLT WCE for SLT WCE for SLT WCE for SLT feat. type MT+ASR feat. Joint feat. 1 Joint feat. 2 Joint feat. 3 p ASR ( q/x f , f ) α p ( q/x f , f, e ) p ( q/x f , f, e ) p ( q/x f , f, e ) ∗ p MT ( q/e, f ) − α F-mes1

F-mes2

The results of Table 12 show that joint ASR and MT features do not im-prove WCE performance:

F-mes1 and

F-mes2 are slightly worse than thoseof table 9 (WCE for SLT / MT features only). We also observe that simplecombination (MT+ASR) degrades the WCE performance. This latter obser-vation may be due to diﬀerent behaviors of

W CE MT and W CE

ASR classiﬁerswhich makes the weighted combination ineﬀective. Moreover, the disappoint- utomatic Quality Assessment for Speech Translation 15 .

00 0 .

25 0 .

50 0 .

75 1 . F avg ( all ) of WCE-SLT using MT feature F avg ( all ) of WCE-SLT using ASR feature F avg ( all ) of WCE-SLT using MT+ASR feature sets F avg ( all ) of WCE-SLT using joint feature sets 1 Fig. 1

Evolution of system performance (y-axis -

F-mes1 - ASR1) for dev corpus (2683utt) along decision threshold variation (x-axis) - training is made on 4050 utt. ing performance of our joint classiﬁer may be due to an insuﬃcient trainingset (only 2683 utterances in dev !). Finally, removing

OccurInGoogleTranslate and

OccurInBingTranslate features for

Joint lowered

F-mes between 1% and1.5%.These observations lead us to investigate the behaviour of our WCE ap-proaches for a large range of good / bad decision threshold and with a newprotocol where we reverse dev and tst . So, in the next experiments of thissubsection, we will report WCE evaluation results obtained on dev (2683 utt.)with classiﬁers trained on tst (4050 utt.). Finally, the diﬀerent strategies usedto project ASR features when a target word is aligned to more than one sourceword do not lead to very diﬀerent performance: we will use strategy joint 1 inthe future.While the previous tables provided WCE performance for a single point ofinterest ( good / bad decision threshold set to 0.5), the curves of ﬁgures 1 and 2show the full picture of our WCE systems (for SLT) using speech transcriptionssystems ASR

ASR

2, respectively. We observe that the classiﬁer based onASR features has a very diﬀerent behaviour than the classiﬁer based on MTfeatures which explains why their simple combination (MT+ASR) does not .

00 0 .

25 0 .

50 0 .

Evolution of system performance (y-axis -

F-mes2 - ASR2) for dev corpus (2683utt) along decision threshold variation (x-axis) - training is made on 4050 utt. work very well for the default decision threshold (0.5). However, for thresholdabove 0.5, the use of both ASR and MT features is beneﬁcial. This is interestingbecause higher thresholds improves the Fmeasure on bad labels (so improveserror detection). Both curves are similar whatever the ASR system used. Theseresults suggest that with enough development data for appropriate thresholdtuning (which we do not have for this very new task), the use of both ASRand MT features should improve error detection in speech translation (blueand red curves are above the green curve for higher decision threshold ). Wealso analyzed the F measure curves for bad and good labels separately : if weconsider, for instance ASR bad labels is equivalent (60%) for 3 systems (

Joint, MT+ASR and MT ) while the Fmeasure on good labels is 61% when using MT featuresonly, 66% when using Joint features and 68% when using

MT+ASR features.In other words, for a ﬁxed performance on bad labels, the Fmeasure on good labels is improved using all information available (ASR and MT features).

10. Corresponding to optimization of the Fmeasure on bad labels (errors)11. Not reported here due to space constraints.utomatic Quality Assessment for Speech Translation 17

Finally, if we focus on

Joint versus

MT+ASR , we notice that the range of thethreshold where performance are stable is larger for

Joint than for

MT+ASR . In this section, we try to better understand the contribution of each (ASRor MT) feature by applying feature selection on our joint WCE classiﬁer. Inthese experiments, we decide to keep

OccurInGoogleTranslate and

OccurIn-BingTranslate features.We choose the Sequential Backward Selection (SBS) algorithm which is atop-down algorithm starting from a feature set noted Y k (which denotes theset of all features) and sequentially removing the most irrelevant one ( x ) thatmaximizes the Mean F-Measure, M F ( Y k − x ). In our work, we examine untilthe set Y k contains only one remaining feature. Algorithm 1 summarizes thewhole process. Algorithm 1

Sequential Backward Selection (SBS) algorithm for feature se-lection. Y k denotes the set of all features and x is the feature removed at eachstep of the algorithm while size of Y k > do maxval = 0 for x ∈ Y k doif maxval < MF ( Y k − x ) then maxval ← MF ( Y k − x ) worstfeat ← x end ifend for remove worstfeat from Y k end while The results of the SBS algorithm can be found in table 13 which ranks alljoint features used in WCE for SLT by order of importance after applying thealgorithm on dev . We can see that the SBS algorithm is not very stable andis clearly inﬂuenced by the ASR system (

ASR

2) considered in SLT.Anyway, if we focus on the features that are in the top-10 best in both cases,we ﬁnd that the most relevant ones are:–

Occur in Google Translate and

Occur in Bing Translate (diagnostic fromother MT systems),–

Longest Source N-gram Length, Target Backoﬀ Behaviour (source or tar-get N-gram features)–

Stem Context Alignment (source-target alignment feature)We also observe that the most relevant ASR features (in bold in table 13)are

F-3g, F-POS and

F-back (lexical and linguistic features) whereas ASRacoustic and graph based features are among the worst (

F-post, F-alt, F-dur ).So, in our experimental setting, it seems that MT features are more inﬂuentthan ASR features. Another surprising result is the relatively low rank of word

Table 13

Rank of each feature according to the Sequential Backward Selection algorithm- WCE for SLT task - Joint (ASR,MT) features used - Feature selection applied on dev corpus for both

ASR1 and

ASR2 - ASR features are in bold.

Rank Rank Feature Rank Rank Feature

ASR1 ASR2 ASR1 ASR2 N -gram Length 19 30 Proper Name3 5 Target Backoﬀ Behaviour 20 20 Unknown Stem4 22 Constituent Label 21 24 Number of Word Occurrences5 1 Occur in Bing Translate 22 23 F-alt

F-3g

23 15 Nodes7 16 WPP Exact 24 8

F-log

F-context N -gram Length10 12 Number of Stem Occurrences 27 6 WPP Any11 21 Polysemy Count - Target 28 29 POS Context Alignment12 3 F-POS

29 10

F-post

13 18 Stop Word 30 28 Word Context Alignment14 25 Distance to Root 31 31

F-dur

15 13

F-back

32 9 Alignment Features16 26 WPP Min 33 33

F-word

17 27 Punctuation posterior probability (WPP) features whereas we were expecting to see themamong the top features (as shown in [20] where

WPP Any is among the bestfeatures for WCE in MT).Figure 3 and Figure 4 present the evolution of WCE performance for dev and tst corpora when feature selection using SBS algorithm is made on dev , for

ASR

ASR dev which means that feature selection isdone on dev with classiﬁers trained on tst . After that, the best feature subsets(using 33, 32, 31 until 1 feature only) are applied on tst corpus (with classiﬁerstrained on dev ) .On both ﬁgures, we observe that half of the features only contribute to theWCE process since best performances are observed with 10 to 15 features only.We also notice that optimal WCE performance is not necessarily obtained withthe full feature set but it can be obtained with a subset of it.

12. 3 data sets would have been needed to (a) train classiﬁers, (b) apply feature selection,(c) evaluate WCE performance. Since we only have a dev and a tst set, we found thisprocedure acceptableutomatic Quality Assessment for Speech Translation 19

Number of the Best Features ranked by Feature Selection process on dev corpus F a v g ( a ll ) selection for dev application on tst Fig. 3

Evolution of WCE performance for dev (features selected) and tst corpora whenfeature selection using SBS algorithm is made on dev ( ASR corpus, distributed to the research community was built for this purpose.We formalized WCE for SLT and proposed several approaches based on sev-eral types of features: machine translation (MT) based features, automaticspeech recognition (ASR) based features, as well as combined or joint featuresusing ASR and MT information. The proposition of a unique joint classiﬁerbased on diﬀerent feature types (ASR and MT features) allowed us to operatefeature selection and analyze which features (from ASR or MT) are the mosteﬃcient for quality assessment in speech translation. Our conclusion is thatMT features remain the most inﬂuential while ASR feature can bring inter-esting complementary information. In all our experiments, we systematicallyevaluated with two ASR systems that have diﬀerent performance in order toanalyze the behavior of our quality assessment algorithms at diﬀerent levels ofword error rate (WER). This allowed us to observe that WCE performance de-creases as ASR system improves. For reproducible research, most features and algorithms used in this paper are available through our toolkit calledWCE-LIG. This package is made available on a GitHub repository underthe licence GPL V3. We hope that the availability of our corpus and toolkitcould lead, in a near future, to a new shared task dedicated to quality estima- https://github.com/besacier/WCE-SLT-LIG

14. MT features already available, ASR features available soon15. https://github.com/besacier/WCE-LIG

Number of the Best Features ranked by Feature Selection process on dev corpus F a v g ( a ll ) selection for dev application on tst Fig. 4

Evolution of WCE performance for dev (features selected) and tst corpora whenfeature selection using SBS algorithm is made on dev ( ASR tion for speech translation. Such a shared task could be proposed in avenuessuch as IWSLT (International Workshop on Spoken Language Translation) orWMT (Workshop on Machine Translation) for instance.7.2 SLT redecoding using WCEA direct application of this work is the use of WCE labels to re-decodespeech translation graphs and (hopefully) improve speech translation perfor-mance. Preliminary results were already obtained and recently published bythe authors of this paper [4]. The main idea is to carry a second speech trans-lation pass by considering every word and its quality assessment label, asshown in equation 4 . The speech translation graph is redecoded following thefollowing principle: words labeled as good in the search graph should be “re-warded” by reducing their cost; on the contrary, those labeled as bad shouldbe “penalized”. To illustrate this direct application of our work, we presentexamples of speech translation hypotheses (SLT) obtained with or withoutgraph re-decoding in table 14 (table taken from [4]).Example 1 illustrates a ﬁrst case where re-decoding allows slightly improv-ing the translation hypothesis. Analysis of the labels from the conﬁdence esti-mator indicates that the words a (start of sentence) and penalty were labeledas bad here. Thus, a better hypothesis arised from the second pass, although utomatic Quality Assessment for Speech Translation 21 Table 14

Examples of French SLT hyp with and w/o graph re-decoding - table taken from[4]. f ref une d´emobilisation des employ´es peut d´eboucher sur uned´emoralisation mortif`ere f hyp une d´emobilisation des employ´es peut d´eboucher sur uned´emoralisation mort y faire e hyp baseline a demobilisation employees can lead to a penalty de-moralisation e hyp with re-decoding a demobilisation of employees can lead to a demoral-ization death e ref demobilization of employees can lead to a deadly de-moralization f ref celui-ci a indiqu´e que l’intervention s’ ´etait parfaitement bien d´eroul´ee et que les examens post- op´eratoires ´etaient nor-maux f hyp celui-ci a indiqu´e que l’ intervention c’ ´etait parfaitement bien d´eroul´es , et que les examens post op´eratoire ´etaient nor-maux. e hyp baseline it has indicated that the speech that was well conducted ,and that the tests were normal post route e hyp with re-decoding he indicated that the intervention is very well done , andthat the tests after operating were normal e ref he indicated that the operation went perfectly well andthe post-operative tests were normal f ref general motors repousse jusqu’en janvier le plan pour opel f hyp general motors repousse jusqu’ en janvier le plan pour open e hyp baseline general motors postponed until january the plan to open e hyp with re-decoding general motors puts until january terms to open e ref general motors postponed until january the plan for opel the transcription error could not be recovered. In example 2, the conﬁdenceestimator labeled as bad the following word sequences: it has , speech that was and post route . Better translation hypothesis is found after re-decoding (cor-rect pronoun, better quality at the end of sentence). Finally, example 3 showsa case where, this time, the end of the ﬁrst pass translation deteriorated afterre-decoding. Analysis of conﬁdence estimator output shows that the phrase toopen was (correctly) labeled as bad , but the re-decoding gave rise to an evenworse hypothesis. The reason is that the system could not recover the namedentity opel since this word was not in the speech translation graph.7.3 Other perspectivesIn addition to re-decode SLT graphs, our quality assessment system canbe used in interactive speech translation scenarios such as news or lecturessubtitling, to improve human translator productivity by giving him/her feed-back on automatic transcription and translation quality. Another applicationwould be the adaptation of our WCE system to interactive speech-to-speechtranslation scenarios where feedback on transcription and translation modules is needed to improve communication. On these latter subjects, it would alsobe nice to move from a binary ( good or bad labels) to a 3-class decision prob-lem ( good, asr-error, mt-error ). The outcome material of this paper (corpus,toolkit) can be deﬁnitely used to address such a new problem. References

1. Asadi, A., Schwartz, R., Makhoul, J.: Automatic detection of new words in a largevocabulary continuous speech recognition system. Proc. of International Conference onAcoustics, Speech and Signal Processing (1990)2. Bach, N., Huang, F., Al-Onaizan, Y.: Goodness: A method for measuring machinetranslation conﬁdence. In: Proceedings of the 49th Annual Meeting of the Associationfor Computational Linguistics, pp. 211–219. Portland, Oregon (2011)3. Besacier, L., Lecouteux, B., Luong, N.Q., Hour, K., Hadjsalah, M.: Word conﬁdenceestimation for speech translation. In: Proceedings of The International Workshop onSpoken Language Translation (IWSLT). Lake Tahoe, USA (2014)4. Besacier, L., Lecouteux, B., Luong, N.Q., Le, N.T.: Spoken language translation graphsre-decoding using automatic quality assessment. In: IEEE Workshop on AutomaticSpeech Recognition and Understanding (ASRU). Scotsdale, Arizona, United States(2015). DOI 10.1109/ASRU.2015.7404804. URL https://hal.archives-ouvertes.fr/hal-01289158

5. Bicici, E.: Referential translation machines for quality estimation. In: Proceedingsof the Eighth Workshop on Statistical Machine Translation, pp. 343–351. Associationfor Computational Linguistics, Soﬁa, Bulgaria (2013). URL

6. Fayolle, J., Moreau, F., Raymond, C., Gravier, G., Gros, P.: Crf-based combination ofcontextual features to improve a posteriori word-level conﬁdence measures. In: Inter-speech (2010)7. Federico, M., Cettolo, M., Bentivogli, L., Paul, M., St¨uker, S.: Overview of the IWSLT2012 evaluation campaign. In: In proceedings of the 9th International Workshop onSpoken Language Translation (IWSLT) (2012)8. Galliano, S., Geoﬀrois, E., Gravier, G., Bonastre, J.F., Mostefa, D., Choukri, K.: Corpusdescription of the ester evaluation campaign for the rich transcription of french broadcastnews. In: In Proceedings of the 5th international Conference on Language Resourcesand Evaluation (LREC 2006), pp. 315–320 (2006)9. Han, A.L.F., Lu, Y., Wong, D.F., Chao, L.S., He, L., Xing, J.: Quality estimationfor machine translation using the joint method of evaluation criteria and statisticalmodeling. In: Proceedings of the Eighth Workshop on Statistical Machine Translation,pp. 365–372. Association for Computational Linguistics, Soﬁa, Bulgaria (2013). URL

10. Kemp, T., Schaaf, T.: Estimating conﬁdence using word lattices. Proc. of EuropeanConference on Speech Communication Technology pp. 827–830 (1997)11. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan,B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.:Moses: Open source toolkit for statistical machine translation. In: Proceedings of the45th Annual Meeting of the Association for Computational Linguistics, pp. 177–180.Prague, Czech Republic (2007)12. Laﬀerty, J., McCallum, A., Pereira, F.: Conditional random ﬁelds: Probabilistic modelsfor segmenting et labeling sequence data. In: Proceedings of ICML-01, pp. 282–289(2001)13. Langlois, D., Raybaud, S., Sma¨ıli, K.: Loria system for the wmt12 quality estimationshared task. In: Proceedings of the Seventh Workshop on Statistical Machine Transla-tion, pp. 114–119. Baltimore, Maryland USA (2012)utomatic Quality Assessment for Speech Translation 2314. Laurent, A., Camelin, N., Raymond, C.: Boosting bonsai trees for eﬃcient featurescombination : application to speaker role identiﬁcation. In: Interspeech (2014)15. Lavergne, T., Capp´e, O., Yvon, F.: Practical very large scale crfs. In: Proceedings ofthe 48th Annual Meeting of the Association for Computational Linguistics, pp. 504–513(2010)16. Lecouteux, B., Linar`es, G., Favre, B.: Combined low level and high level features forout-of-vocabulary word detection. INTERSPEECH (2009)17. Luong, N.Q., Besacier, L., Lecouteux, B.: Word conﬁdence estimation and its integrationin sentence quality estimation for machine translation. In: Proceedings of The FifthInternational Conference on Knowledge and Systems Engineering (KSE 2013). Hanoi,Vietnam (2013)18. Luong, N.Q., Besacier, L., Lecouteux, B.: LIG System for Word Level QE task atWMT14. In: Proceedings of the Ninth Workshop on Statistical Machine Translation,pp. 335–341. Baltimore, Maryland USA (2014)19. Luong, N.Q., Besacier, L., Lecouteux, B.: Word Conﬁdence Estimation for SMT N-best List Re-ranking. In: Proceedings of the Workshop on Humans and Computer-assisted Translation (HaCaT) during EACL. Gothenburg, Su`ede (2014). URL http://hal.inria.fr/hal-00953719

20. Luong, N.Q., Besacier, L., Lecouteux, B.: Towards accurate predictors of word qual-ity for machine translation: Lessons learned on french - english and english - spanishsystems. Data and Knowledge Engineering p. 11 (2015)21. Luong, N.Q., Lecouteux, B., Besacier, L.: LIG system for WMT13 QE task: Investigat-ing the usefulness of features in word conﬁdence estimation for MT. In: Proceedings ofthe Eighth Workshop on Statistical Machine Translation, pp. 396–391. Association forComputational Linguistics, Soﬁa, Bulgaria (2013)22. Potet, M., Besacier, L., Blanchon, H.: The lig machine translation system for wmt 2010.In: A. Workshop (ed.) Proceedings of the joint ﬁfth Workshop on Statistical MachineTranslation and Metrics MATR (WMT2010). Uppsala, Sweden (2010)23. Potet, M., Emmanuelle E, R., Besacier, L., Blanchon, H.: Collection of a large databaseof french-english smt output corrections. In: Proceedings of the eighth internationalconference on Language Resources and Evaluation (LREC). Istanbul, Turkey (2012)24. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann,M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The kaldispeech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recogni-tion and Understanding. IEEE Signal Processing Society (2011). IEEE Catalog No.:CFP11SRW-USB25. Servan, C., Le, N.T., Luong, N.Q., Lecouteux, B., Besacier, L.: An Open Source Toolkitfor Word-level Conﬁdence Estimation in Machine Translation. In: The 12th Inter-national Workshop on Spoken Language Translation (IWSLT’15). Da Nang, Vietnam(2015). URL https://hal.archives-ouvertes.fr/hal-01244477https://hal.archives-ouvertes.fr/hal-01244477