Fine-grained evaluation of German-English Machine Translation based on a Test Suite
Vivien Macketanz, Eleftherios Avramidis, Aljoscha Burchardt, Hans Uszkoreit
FFine-grained evaluation of German-EnglishMachine Translation based on a Test Suite
Vivien Macketanz, Eleftherios Avramidis, Aljoscha Burchardt, Hans Uszkoreit
German Research Center for Artificial Intelligence (DFKI), Berlin, Germany [email protected]
Abstract
We present an analysis of 16 state-of-the-artMT systems on German-English based on alinguistically-motivated test suite. The test sui-te has been devised manually by a team of lan-guage professionals in order to cover a broadvariety of linguistic phenomena that MT of-ten fails to translate properly. It contains 5,000test sentences covering 106 linguistic pheno-mena in 14 categories, with an increased focuson verb tenses, aspects and moods. The MToutputs are evaluated in a semi-automatic waythrough regular expressions that focus only onthe part of the sentence that is relevant to eachphenomenon. Through our analysis, we are ab-le to compare systems based on their perfor-mance on these categories. Additionally, wereveal strengths and weaknesses of particularsystems and we identify grammatical pheno-mena where the overall performance of MT isrelatively low.
The evaluation of Machine Translation (MT) hasmostly relied on methods that produce a numeri-cal judgment on the correctness of a test set. Thesemethods are either based on the human percepti-on of the correctness of the MT output (Callison-Burch et al., 2007), or on automatic metrics thatcompare the MT output with the reference transla-tion (Papineni et al., 2002; Snover et al., 2006). Inboth cases, the evaluation is performed on a test-set containing articles or small documents that areassumed to be a random representative sample oftexts in this domain. Moreover, this kind of evalua-tion aims at producing average scores that expressa generic sense of correctness for the entire testset and compare the performance of several MTsystems.Although this approach has been proven valua-ble for the MT development and the assessment of new methods and configurations, it has been sug-gested that a more fine-grained evaluation, asso-ciated with linguistic phenomena, may lead in abetter understanding of the errors, but also of theefforts required to improve the systems (Burchardtet al., 2016). This is done through the use of testsuites, which are carefully devised corpora, whosetest sentences include the phenomena that need tobe tested. In this paper we present the fine-grainedevaluation results of 16 state-of-the-art MT sy-stems on German-English, based on a test suitefocusing on 106 German grammatical phenomenawith a focus on verb-related phenomena.
The use of test suites in the evaluation of NLP ap-plications (Balkan et al., 1995) and MT systems inparticular (King and Falkedal, 1990; Way, 1991)has been proposed already in the 1990’s. For in-stance, test suites were employed to evaluate state-of-the-art rule-based systems (Heid and Hilden-brand, 1991). The idea of using test suites forMT evaluation was revived recently with the emer-gence of Neural MT (NMT) as the produced trans-lations reached significantly better levels of quali-ty, leading to a need for more fine-grained qualita-tive observations. Recent works include test suitesthat focus on the evaluation of particular lingui-stic phenomena (e.g. pronoun translation; Guillouand Hardmeier, 2016) or more generic test sui-tes that aim at comparing different MT technolo-gies (Isabelle et al., 2017; Burchardt et al., 2017)and Quality Estimation methods (Avramidis et al.,2018). The previously presented papers differ inthe amount of phenomena and the language pairsthey cover.This paper extends the work presented inBurchardt et al. (2017) by including more test sen-tences and better coverage of phenomena. In con- a r X i v : . [ c s . C L ] O c t rast to that work, which applied the test suite inorder to compare 3 different types of MT systems(rule-based, phrase-based and NMT), the evaluati-on in the publication at hand has been applied on16 state-of-the-art systems whose majority followsthe NMT methods. This test suite is a manually devised test set, ai-ming to investigate the MT performance against awide range of linguistic phenomena or other qua-litative requirements (e.g. punctuation).It contains a set of sentences in the source lan-guage, written or chosen by a team of linguistsand professional translators with the aim to co-ver as many linguistic phenomena as possible, andparticularly the ones that MT often fails to trans-late properly. Each sentence of the test suite ser-ves as a paradigm for investigating only one par-ticular phenomenon. Given the test sentences, theevaluation tests the ability of the MT systems toproperly translate the associated phenomena. Thephenomena are organized in categories (e.g. alt-hough each verb tense is tested separately with therespective test sentences, the results for all tensesare aggregated in the broader category of verb ten-se/aspect/mood).Our test suite contains about 5,000 test sen-tences, covering 106 phenomena organized in 14categories. For each phenomenon at least 20 testsentences were devised to allow better generali-zations about the capabilities of the MT systems.With 88%, the majority of the test suite coversverb phenomena, but other categories, such asnegation, long distance dependencies, valency ormulti-word expressions are included as well. Afull list of the phenomena and their categories canbe seen in Table 1. An example list of test sen-tences with correct and incorrect translations isavailable on GitHub . The test suite was constructed in a way that al-lows a semi-automatic evaluation method, in orderto assist the efficient evaluation of many transla-tion systems. A simplified sketch of the test sui-te construction is shown in Figure 1. First (Figu-re 1, stage a), the linguist choses or writes the testsentences in the source language with the help of https://github.com/DFKI-NLP/TQ_AutoTest translators. The test sentences are manually writ-ten or chosen, based on whether their translationhas demonstrated or is suspected to demonstrateMT errors of the respective error categories. Testsentences are selected from various parallel cor-pora or drawn from existing resources, such asthe TSNLP Grammar Test Suite (Lehmann et al.,1996) and online lists of typical translation errors.Then (stage b) the test sentences are passed as aninput to the some sample MT systems and theirtranslations are fetched.Based on the output of the sample MT systemsand the types of the errors, the linguist devises aset of hand-crafted regular expressions (stage c)while the translator ensures the correctness of theexpressions. The regular expressions are used toautomatically check if the output correctly trans-lates the part of the sentence that is related to thephenomenon under inspection. There are regularexpressions that match correct translations (posi-tive) as well as regular expressions that match in-correct translations (negative). During the evaluation phase, the test sentences aregiven to several translation systems and their out-puts are acquired (stage d). The regular expressi-ons are applied to the MT outputs (stage e) to auto-matically check whether the MT outputs translatethe particular phenomenon properly. An MT out-put is marked as correct ( pass ), if it matches a po-sitive regular expression. Similarly, it is marked asincorrect ( fail ), if it matches a negative regular ex-pression. In cases where the MT output does notmatch either a positive or a negative regular ex-pression, the automatic evaluation flags an uncer-tain decision ( warning ). Then, the results of theautomatic annotation are given to a linguist or atranslator who manually checks the warnings (sta-ge f) and optionally refines the regular expressi-ons in order to cover similar future cases. It is alsopossible to add full sentences as valid translations,instead of regular expressions. In this way, the testsuite grows constantly, whereas the required ma-nual effort is reduced over time.Finally, for every system we calculate thephenomenon-specific translation accuracy:accuracy = correct translationssum of test sentencesThe translation accuracy per phenomenon is givenby the number of the test sentences for the pheno- igure 1: Example of the preparation and application of the test suite for one test sentence menon which were translated properly, divided bythe number of all test sentences for this phenome-non.This allows us also to perform comparisonsamong the systems, focusing on particular pheno-mena. The significance of every comparison bet-ween two systems is confirmed with a two-tailedZ-test with α = 0 . , testing the null hypothesisthat the difference between the two respective per-centages is zero. The evaluation of the MT outputs was performedwith TQ-AutoTest (Macketanz et al., 2018), a toolthat organizes the test items in a database, allowingthe application of the regular expressions on newMT outputs. For the purpose of this study, we havecompared the 16 systems submitted to the test sui-te task of the EMNLP2018 Conference of Machi-ne Translation (WMT18) for German → English.At the time that this paper is written, the creatorsof 11 of these systems have made their develop-ment characteristics available, 10 of them statingthat they follow a NMT method and one of them amethod combining phrase-based SMT and NMT.After the application of the existing regular ex-pressions to the outputs of these 16 systems, therewas a considerable amount of warnings (i.e. un-certain judgments) that varied between 10% and45% per system. A manual inspection of the out-puts was consequently performed (Figure 1, sta-ge f) by a linguist, who invested approximately 80hours of manual annotation. A small-scale manualinspection of the automatically assigned pass and fail labels indicated that the percentage of the er- roneously assigned labels is negligible. The ma-nual inspection therefore focused on warnings andreduced their amount to less than 10% warningsper system . In particular, 32.1% of the originalsystem outputs ended in warnings, after the ap-plication of the regular expressions, whereas themanual inspection and the refining of the regularexpressions additionally validated 14,000 of the-se system outputs, i.e. 15.7% of the original testsuite.In order to analyze the results with respect to theexistence of warnings, we performed two differenttypes of analysis:1. Remove all sentences from the overall com-parison that have even one warning for onesystem and the translation accuracy on the re-maining segments. The unsupervised systemsare completely excluded from this analysis inorder to keep the sample big enough. Thisway, all systems are compared on the sameset of segments.2. Remove the sentences with warnings per sy-stem and calculate the translation accuracyon the remaining segments. The unsupervi-sed systems can be included in this analysis.In this way, the systems are not compared onthe same set of segments, but more segmentscan be included altogether. The final results of the evaluation can be seen inTable 2, based on Analysis 1 and Table 3, based Here, we do not take into account the two unsupervisedsystems for the reasons explained in Section 4.1. n Analysis 2. Results for verb-related phenomenabased on Analysis 1 are detailed in Tables 4 and 5and other indicative phenomena in Table 6. Thefiltering prior to Analysis 1 left a small number oftest sentences per category, which limits the pos-sibility to identify significant differences betweenthe systems. Analysis 2 allows better testing ofeach system’s performance, but observations needto be treated with caution, since the systems are te-sted against different test sentences and thereforethe comparisons between them are not as expressi-ve as in Analysis 1. Moreover, the interpretabilityof the overall averages of these tables is limited,as the distribution of the test sentences and the lin-guistic phenomena does not represent an objectivenotion of quality.We have calculated the mean values per systemas non-weighted average and as weighted avera-ge. The non-weighted average was calculated bydividing the sum off all correct translations by thesum of all test sentences. The weighted averagefor a system was computed by taking a mean ofthe averages per category. We have not calculatedstatistical significances for the weighted averagesas these are less meaningful due to the dominanceof the verb tense/aspect/mood category.
The following results are based on Analysis 1.The system that achieves the highest accuracy inmost linguistic phenomena, as compared to therest of the systems, is UCAM, which is in thefirst significance cluster for 11 out of the 12 de-cisive error categories in Analysis 1 and achievesa 86.0% non-weighted average accuracy over alltest sentences. UCAM obtains a significantly bet-ter performance than all other systems concerningverb tense/aspect/mood, reaching a 86.9% accura-cy, 1.5% better than MLLP and NTT which arefollowing in this category. The different perfor-mance may be explained by the fact that UCAMdiffers from others, since it combines several diffe-rence neural models together with a phrase-basedSMT system in an syntactic MBR-based scheme(Stahlberg et al., 2016). Despite its good perfor-mance in grammatical phenomena, UCAM has avery low accuracy regarding punctuation (52.9%).The system with the highest weighted averagescore is RWTH. Even though it reaches higher ac-curacies for some categories than UCAM, the dif-ferences are not statistically significant. Another system that achieves the best accura-cies at the 11 out of the 12 categories is Online-A.This system performs close to the average of allsystems concerning verb tense/aspect/mood, but itshows a significantly better performance on the ca-tegory of punctuation (96.1%). Then, 6 systems(JHU, NTT, Online-B, Online-Y, RWTH, Ubiqus)have the best performance at the same amount ofcategories (10 out of 12), having lost the first po-sition in punctuation and verb tense/aspect/mood.Two systems that have the lowest accuraciesin several categories are Online-F and Online-G. Online-F has severe problems with the punc-tuation (3.9%) since it failed producing properquotation marks in the output and mistranslatedother phenomena, such as commas and the punc-tuation in direct speech (see Table 6). Online-Ghas the worst performance concerning verb ten-se/aspect/mood (45.8%). Additionally, these twosystems together demonstrate the worst perfor-mance on coordination/ellipsis and negation.The unsupervised systems form a special ca-tegory of systems trained only on monolingualcorpora. Their outputs suffer from adequacy pro-blems, often being very “creative” or very far froma correct translation. Thus, the automatic evalua-tion failed to check a vast amount of test sen-tences on these systems. Therefore, we conduc-ted Analysis 2. As seen in Table 3, unsupervisedsystems suffer mostly on MWE (11.1% - 17.4%accuracy), function words (15.7% - 21.7%), ambi-guity (26.9% - 29.1%) and non-verbal agreement(38.3% - 39.6%).
Despite the significant progress in the MT quali-ty, we managed to devise test sentences that indi-cate that the submitted systems have a mediocreperformance for several linguistic categories. Onaverage, all current state-of-the-art systems suf-fer mostly on punctuation (and particularly quo-tation marks), MWE, ambiguity and false friendswith an average accuracy of less than 64% (ba-sed on Analysis 1). Verb tense/aspect/mood, non-verbal agreement, function words and coordinati-on/ellipsis are also far from good, with average ac-curacies around 75%.The two categories verb valency and named en-tities/terminology cannot lead to comparisons onthe performance of individual systems, since allsystems achieve equal or insignificantly differenterformance on them. The former has an averageaccuracy of 81.4%, while the latter has an averageaccuracy of 83.4%.We would like to present a few examples in or-der to provide a better understanding of the lingui-stic categories and the evaluation. Example (1) istaken from the category of punctuation . Amongothers, we test the punctuation in the context ofdirect speech. While in German it is introduced bya colon, in English it is introduced by a comma. Inthis example, the NTT system produces a correctoutput (therefore highlighted in boldface), where-as the other two systems depict incorrect translati-ons with a colon.(1) Punctuation source: Er rief: ”Ich gewinne!“
NTT:
He shouted, “I win!”Online-F: He called: “I win!”Ubiqus: He cried: “I win!”We may assume that these errors are attributed tothe fact that punctuation is often manipulated byhand-written pre- and post-processing tools, whe-reas the ability of the neural architecture to proper-ly convey the punctuation sequence has attractedlittle attention and is rarely evaluated properly.
Negation is one of the most important catego-ries for meaning preservation. Two commercialsystems (Online-F and Online-G) show the lowestaccuracy for this category and it is disappointingthat they miss 4 out of 10 negations. In Example(2), the German negation particle “nie” should betranslated as “never”, but Online-G omits the who-le negation. In other cases it negates the wrong ele-ment in the sentence.(2) Negation source: Tim w¨ascht seine Kleidung nieselber.
Online-B:
Tim never washes his clotheshimself.Online-G: Tim is washing his clothes my-self.
MWE , such as idioms or collocations, are proneto errors in MT as they cannot be translated intheir separate elements. Instead, the meaning ofthe expression has to be translated as a whole. Ex-ample (3) focuses on the German idiom “auf demHolzweg sein” which can be translated as “beingon the wrong track”. However, a literal transla- tion of “Holzweg” would be “wood(en) way”,“wood(en) track” or “wood(en) patch”. As can beseen in the example, MLLP and UCAM provide aliteral translation of the separate segments of theMWE rather than translating the whole meaningof it, resulting in a translation error.(3) MWE source: Du bist auf dem Holzweg.
MLLP: You’re on the wood track.
RWTH:
You’re on the wrong track.UCAM: You’re on the wooden path.
As mentioned above, a large part of the test suiteis made up of verb-related phenomena. Therefo-re, we have conducted a more fine-grained ana-lysis of the category “Verb tense/aspect/mood”.In Table 4 we have grouped the phenomena byverb tenses. Table 5 shows the results for the verb-related phenomena grouped by verb type. Regar-ding the verb tenses, future II and future II sub-junctive show the lowest accuracy with a maxi-mum accuracy of about 30%. The highest accura-cy value on average (weighted and non-weighted)is achieved by UCAM with 63.5%, respectively61.5%. UCAM is the only system that is one ofthe best-performing systems for all the verb tensesas well as for all the verb types. The second-bestsystem on average for verb tenses and verb typesis NTT.While the accuracy scores among the phenome-na range between 33.4% and 63.5% for the verbtenses, the scores for the verb types are higher with45.7% - 86.9%.Table 6 shows interesting individual phenome-na with at least 15 valid test sentences. The accu-racy for compounds and location is generally quitehigh. There are other phenomena that exhibit a lar-ger range of accuracy scores, as for example quo-tation marks, with an accuracy ranging from 0%to 94.7% among the systems. The system Online-F fails on all test sentences with quotation marks.The failure results from the system generating thequotation marks analogously to the German punc-tuation, e.g., introducing direct speech with a co-lon, as seen in Example (1). Online-F furthermo-re also fails on all test sentences with questiontags, as does NJUNMT. For the phenomenon lo-cation, on the other hand, none of the systems issignificantly better than any other system. They allerform similarly good, with an accuracy rangingfrom 86.7% to 100%. RWTH is the only systemthat reaches an accuracy of 100% twice in theseselected phenomena.
We used a test suite in order to perform fine-grained evaluation in the output of the state-of-the-art systems, submitted at the shared task ofWMT18. One system (UCAM), that uses a syn-tactic MBR combination of several NMT andphrase-based SMT components, stands out regar-ding to verb-related phenomena. Additionally, twosystems fail to translate 4 out of 10 negations. Ge-nerally, submitted systems suffer on punctuation(and particularly quotation marks, with the excep-tion of Online-A), MWE, ambiguity and false fri-ends, and also on translating the German futuretense II. 6 systems have approximately the sameperformance in a big number of linguistic catego-ries.Fine-grained evaluation would ideally providethe potential to identify particular flaws at the de-velopment of the translation systems and suggestspecific modifications. Unfortunately, at the timethat this paper was written, few details about thedevelopment characteristics of the respective sy-stems were available, so we could provide onlyfew assumptions based on our findings. The dif-ferences observed may be attributed to the designof the models, to pre- and post-processing tools,to the amount, the type and the filtering of the cor-pora and other development decisions. We believethat the findings are still useful for the original de-velopers of the systems, since they are aware of alltheir technical decisions and they have the techni-cal possibility to better inspect the causes of spe-cific errors.
Acknowledgments
This work was supported by XXX through the pro-ject Open Source Lab and by the German FederalMinistry of Education and Research (BMBF)through the project DEEPLEE (01lW17001).Special thanks to Arle Lommel and Kim Har-ris who helped with their input in earlier stages ofthe experiment, to Renlong Ai and He Wang whodeveloped and maintained the technical infrastruc-ture and to Aylin Cornelius who helped with theevaluation.
References
Eleftherios Avramidis, Vivien Macketanz, Arle Lom-mel, and Hans Uszkoreit. 2018. Fine-grained eva-luation of Quality Estimation for Machine transla-tion based on a linguistically-motivated Test Suite.In
Proceedings of the First Workshop on Translati-on Quality Estimation and Automatic Post-Editing ,pages 243–248, Boston, MA, USA.Lorna Balkan, Doug Arnold, and Siety Meijer. 1995.Test suites for natural language processing. In
As-lib proceedings , volume 47, pages 95–98. MCB UPLtd.Aljoscha Burchardt, Kim Harris, Georg Rehm, andHans Uszkoreit. 2016. Towards a Systematic andHuman-Informed Paradigm for High-Quality Ma-chine Translation. In
Language Resources and Eva-luation (LREC) , Portoroz, Slovenia. European Lan-guage Resources Association.Aljoscha Burchardt, Vivien Macketanz, Jon Dehda-ri, Georg Heigold, Jan-Thorsten Peter, and Phi-lip Williams. 2017. A Linguistic Evaluation ofRule-Based, Phrase-Based, and Neural MT Engines.
The Prague Bulletin of Mathematical Linguistics ,108:159–170.Chris Callison-Burch, Cameron Fordyce, PhilippKoehn, Christof Monz, and Josh Schroeder. 2007.(Meta-) evaluation of machine translation. In
Pro-ceedings of the Second Workshop on Statistical Ma-chine Translation , pages 136–158, Prague, CzechRepublic. Association for Computational Lingui-stics.Liane Guillou and Christian Hardmeier. 2016. PRO-TEST: A Test Suite for Evaluating Pronouns in Ma-chine Translation.
Tenth International Conferenceon Lan- guage Resources and Evaluation (LREC2016) .Ulrich Heid and Elke Hildenbrand. 1991. Some prac-tical experience with the use of test suites for theevaluation of SYSTRAN. In the Proceedings of theEvaluators’ Forum, Les Rasses . Citeseer.Pierre Isabelle, Colin Cherry, and George Foster. 2017.A Challenge Set Approach to Evaluating MachineTranslation. In
EMNLP 2017: Conference on Empi-rical Methods in Natural Language Processing .Margaret King and Kirsten Falkedal. 1990. Using testsuites in evaluation of machine translation systems.In
Proceedings of the 13th conference on Computa-tional Linguistics , volume 2, pages 211–216, Mor-ristown, NJ, USA. Association for ComputationalLinguistics.Sabine Lehmann, Stephan Oepen, Sylvie Regnier-Prost, Klaus Netter, Veronika Lux, Judith Klein, Kir-sten Falkedal, Frederik Fouvry, Dominique Estival,Eva Dauphin, Herve Compagnion, Judith Baur, Lor-na Balkan, and Doug Arnold. 1996. TSNLP - TestSuites for Natural Language Processing.
Procee-dings of the 16th . . . , page 7. ategory phenomenaAmbiguity lexical ambiguity, structural ambiguityComposition phrasal verb, compoundCoordination & ellipsis slicing, right-node rasing, gapping, strippingFalse friendsFunction word focus particle, modal particle, question tagLong-distancedependency (LDD)& interrogative multiple connectors, topicalization, polar question, WH-movement,scrambling, extended adjective construction, extraposition, pied-pipingMulti-word expression prepositional MWE, verbal MWE, idiom, collocationNamed entity (NE) &terminology date, measuring unit, location, proper name, domain-specific termNegationNon-verbal agreement coreference, internal possessor, external possessorPunctuation comma, quotation marksSubordination adverbial clause, indirect speech, cleft sentence, infinitive clause,relative clause, free relative clause, subject clause, object clauseVerb tense/aspect future I, future II, perfect, pluperfect, present, preterite, progressivemood indicative, imperative, subjunctive, conditionaltype ditransitive, transitive, intransitive, modal, reflexiveVerb valency case government, passive voice, mediopassive voice, resultative pre-dicates
Table 1: Categorization of the grammatical phenomenaVivien Macketanz, Renlong Ai, Aljoscha Burchardt,and Hans Uszkoreit. 2018. TQ-AutoTest An Auto-mated Test Suite for (Machine) Translation Quality.In
Proceedings of the Eleventh International Con-ference on Language Resources and Evaluation. In-ternational Conference on Language Resources andEvaluation (LREC-2018), 11th, May 7-12, Miyaza-ki, Japan . European Language Resources Associati-on (ELRA).Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a Method for Automa-tic Evaluation of Machine Translation. In
Procee-dings of the 40th Annual Meeting of the Associationfor Computational Linguistics , pages 311–318, Phil-adelphia, Pennsylvania, USA. Association for Com-putational Linguistics.Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-nea Micciulla, John Makhoul, Ralph Weischedel,John Makhoul, and Ralph Weischedel. 2006. A Stu-dy of Translation Error Rate with Targeted HumanAnnotation. In
Proceedings of the 7th biennial con-ference of the Association for Machine Translationin the Americas , pages 223–231, Cambridge, MA,USA. International Association for Machine Trans-lation.Felix Stahlberg, Adri`a de Gispert, Eva Hasler, and BillByrne. 2016. Neural Machine Translation by Mi-nimising the Bayes-risk with Respect to SyntacticTranslation Lattices.
CoRR , abs/1612.03791. Andrew Way. 1991. Developer-Oriented Evaluation ofMT Systems. In
Proceedings of the Evaluators’ Fo-rum , pages 237–244, Les Rasses, Vaud, Switzerland.ISSCO. J HU L M U M LL P N J UN M T N TT on l - A on l - B on l - F on l - G on l - Y R W T HU b i qu s U C A M u e d i n A m b i gu it y76 . . . . . . . . . . . . . . F a l s e fr i e nd s . . . . . . . . . . . . . . V e r bv a l e n c y3080 . . . . . . . . . . . . . . V e r b t e n s e / a s p ec t/ m ood411074 . . . . . . . . . . . . . . N on - v e r b a l a g r ee m e n t . . . . . . . . . . . . . . P un c t u a ti on5160 . . . . . . . . . . . . . . S ubo r d i n a ti on34 . . . . . . . . . . . . . . M W E . . . . . . . . . . . . . . L DD & i n t e rr og a ti v e s . . . . . . . . . . . . . . N E & t e r m i no l ogy3582 . . . . . . . . . . . . . . C oo r d i n a ti on & e lli p s i s . . . . . . . . . . . . . . N e g a ti on20 . . . . . . . . . . . . . . C o m po s iti on43 . . . . . . . . . . . . . . F un c ti on w o r d50 . . . . . . . . . . . . . . S u m N on - w e i gh t e d a v e r a g e . . . . . . . . . . . . . . W e i gh t e d a v e r a g e . . . . . . . . . . . . . . T a b l e : S y s t e m acc u r ac y ( % ) on eac h e rr o r ca t e go r yb a s e don A n a l y s i s , h a v i ng r e m ov e d a llt e s t s e n t e n ce s w ho s ee v a l u a ti on r e m a i n e dun ce r t a i n , e v e n f o r on e o f t h e s y s t e m s . B o l d f ace i nd i ca t e s t h e s i gn i fi ca n tl yb e s t s y s t e m s i n t h eca t e go r y J HU L M U L M U - un s M LL P N J UN M T N TT on l - A on l - B on l - F on l - G on l - Y R W T H - un s R W T HU b i qu s U C A M u e d i n A m b i gu it y65 . . . . . . . . . . . . . . . . F a l s e fr i e nd s . . . . . . . . . . . . . . . . V e r bv a l e n c y76 . . . . . . . . . . . . . . . . V e r b t e n s e / a s p ec t/ m ood73 . . . . . . . . . . . . . . . . N on - v e r b a l a g r ee m e n t . . . . . . . . . . . . . . . . P un c t u a ti on63 . . . . . . . . . . . . . . . . S ubo r d i n a ti on98 . . . . . . . . . . . . . . . . M W E . . . . . . . . . . . . . . . . L DD & i n t e rr og a ti v e s . . . . . . . . . . . . . . . . N E & t e r m i no l ogy86 . . . . . . . . . . . . . . . . C oo r d i n a ti on & e lli p s i s . . . . . . . . . . . . . . . . N e g a ti on100 . . . . . . . . . . . . . . . . C o m po s iti on81 . . . . . . . . . . . . . . . . F un c ti on w o r d70 . . . . . . . . . . . . . . . . S u m N on - w e i gh t e d a v e r a g e . . . . . . . . . . . . . . . . W e i gh t e d a v e r a g e . . . . . . . . . . . . . . . . T a b l e : S y s t e m acc u r ac y ( % ) on eac h e rr o r ca t e go r yb a s e don A n a l y s i s , h a v i ng r e m ov e don l y t h e s y s t e m ou t pu t s w ho s ee v a l u a ti on r e m a i n e dun ce r t a i n . J HU L M U M LL P N J UN M T N TT on l - A on l - B on l - F on l - G on l - Y R W T HU b i qu s U C A M u e d i n F u t u r e I . . . . . . . . . . . . . . F u t u r e I s ub j un c ti v e II . . . . . . . . . . . . . . F u t u r e II . . . . . . . . . . . . . . F u t u r e II s ub j un c ti v e II . . . . . . . . . . . . . . P e rf ec t . . . . . . . . . . . . . . P l up e rf ec t . . . . . . . . . . . . . . P l up e rf ec t s ub j un c ti v e II . . . . . . . . . . . . . . P r e s e n t . . . . . . . . . . . . . . P r e t e r it e . . . . . . . . . . . . . . P r e t e r it e s ub j un c ti v e II . . . . . . . . . . . . . . S u m / non - w e i gh t e d a v e r a g e . . . . . . . . . . . . . . W e i gh t e d a v e r a g e . . . . . . . . . . . . . . T a b l e : S y s t e m acc u r ac y ( % ) on li ngu i s ti c ph e no m e n a r e l a t e d t ov e r b t e n s e s J HU L M U M LL P N J UN M T N TT on l - A on l - B on l - F on l - G on l - Y R W T HU b i qu s U C A M u e d i n D it r a n s iti v e . . . . . . . . . . . . . . I n t r a n s iti v e . . . . . . . . . . . . . . M od a l . . . . . . . . . . . . . . M od a l n e g a t e d140373 . . . . . . . . . . . . . . R e fl e x i v e . . . . . . . . . . . . . . T r a n s iti v e . . . . . . . . . . . . . . S u m / non - w e i gh t e d a v e r a g e . . . . . . . . . . . . . . W e i gh t e d a v e r a g e . . . . . . . . . . . . . . T a b l e : S y s t e m acc u r ac y ( % ) on li ngu i s ti c ph e no m e n a r e l a t e d t ov e r b t yp e s J HU L M U M LL P N J UN M T N TT on l - A on l - B on l - F on l - G on l - Y R W T HU b i qu s U C A M u e d i n C o m pound2673 . . . . . . . . . . . . . . Q uo t a ti on m a r k s . . . . . . . . . . . . . . P h r a s a l v e r b17 . . . . . . . . . . . . . . Q u e s ti on t a g1566 . . . . . . . . . . . . . . C o ll o ca ti on1560 . . . . . . . . . . . . . . L o ca ti on1593 . . . . . . . . . . . . . . M od a l p a r ti c l e . . . . . . . . . . . . . . T a b l e : S y s t e m acc u r ac y ( % ) on s p ec i fi c li ngu i s ti c ph e no m e n a w it h m o r e t h a n15 t e s t s e n t e n cece