The Algorithmic Inflection of Russian and Generation of Grammatically Correct Text
aa r X i v : . [ c s . C L ] J un THE ALGORITHMIC INFLECTION OF RUSSIAN AND GENERATIONOF GRAMMATICALLY CORRECT TEXT
T.M. SADYKOV AND T.A. ZHUKOV
Abstract.
OpenCorpora [18]. Introduction
Automatic inflection of words in a natural language is necessary for a variety of theoret-ical and applied purposes like parsing, topic-to-question generation [3], speech recognitionand synthesis, machine translation [13], tagset design [17], information retrieval [12], con-tent analysis etc [1, 2]. Various approaches towards automated inflection have been used todeal with particular aspects of inflection [5, 23] in predefined languages [6, 8, 9, 15, 16, 19]or in an unspecified inflected language [7].Despite substantial recent progress in the field [16, 21, 22], automatic inflection stillrepresents a problem of formidable computational complexity for many natural languagesin the world. Most state-of-the-art approaches make use of extensive manually annotatedcorpora that currently exist for all major languages [20]. Real-time handling of a dic-tionary that contains millions of inflected word forms and tens of millions of relationsbetween them is not an easy task [10]. Besides, no dictionary can ever be complete. Forthese reasons, algorithmic coverage of the grammar of a natural language is importantprovided that inflection in this language is complex enough.Russian is a highly inflected language whose grammar is known for its complexity [21,23]. In Russian, inflection of a word may require changing its prefix, root and endingsimultaneously while the rules of inflection are highly complex [11, 23]. The form of aword can depend on as many as five grammatical categories such as number, gender,person, tense, case, voice, animacy etc. By an estimate based on [18], the average numberof different grammatical forms of a Russian adjective is 11.716. A Russian verb has, onaverage, 44.069 different inflected forms, counting participles of all kinds and the gerunds.In the present paper we describe a fully algorithmic dictionary-free approach towardsautomatic inflection of Russian. The algorithms described in the present paper are im-plemented in C programming language. The described functionality is freely available
This research was conducted in the framework of the basic part of the scientific research state taskin the field of scientific activity of the Ministry of science and education of the Russian Federation,project no. 2.9577.2017. The first author was also supported by the grant of the Government of theRussian Federation for investigations under the guidance of the leading scientists of the Siberian FederalUniversity (contract No. 14.Y26.31.0006).
Algorithms and implementation
The web-service passare.ru offers a variety of functions for inflection of single Russianwords, word matching, and synthesis of grammatically correct text. In particular, theinflection of a Russian noun by number and case, the inflection of a Russian adjective bynumber, gender, and case, the inflection of a Russian adverb by the degrees of comparisonare implemented. Russian verb is the part of speech whose inflection is by far the mostcomplicated in the language. The presented algorithm provides inflection of a Russianverb by tense, person, number, and gender. It also allows one to form the gerunds and theimperative forms of a verb. Besides, functions for forming and inflecting active present andpast participles as well as passive past participles are realized. Passive present participleis the only verb form not currently supported by the website due to the extreme level ofits irregularity and absence for numerous verbs in the language.The algorithmic coverage of the Russian language provided by the web-service pas-sare.ru aims to balance grammatical accuracy and easiness of use. For that reason, a fewsimplifying assumptions have been made: the Russian letters ” ¨е ” and ” е ” are identified;no information on the stress in a word is required to produce its inflected forms; for in-flectional functions, the existence of an input word in the language is determined by theuser. Furthermore, the animacy of a noun is not treated as a variable category in thenoun-inflecting function despite the existence of 1037 nouns (about 1.4% of the nouns inthe OpenCorpora database [18]) with unspecified animacy. This list of nouns has beenmanually reviewed on a case-by-case basis and the decision has been made in favor of theform that is more frequent in the language. The other form can be obtained by callingthe same function with a different case parameter (
Nominative or Genitive instead of
Accusative ).Similarly, the perfectiveness is not implemented as a parameter in a verb-inflectingfunction although by [18] there exist 1038 verbs (about 3.2% of the verbs in the database)in the language whose perfectiveness is not specified. For such verbs, the function producesforms that correspond to both perfective and imperfective inflections.The inflectional form of a Russian word defined by a choice of grammatical categories(such as number, gender, person, tense, case, voice, animacy etc.) is in general notuniquely defined. This applies in particular to many feminine nouns, feminine forms ofadjectives and to numerous verbs. For such words, the algorithms implemented in theweb-service passare.ru only aim at finding one of the inflectional forms, typically, the onewhich is the most common in the language.Due to the rich morphology of the Russian language and to the high complexity of itsgrammar, a detailed description of the algorithms of Russian inflection cannot be providedin a journal paper. The algorithm for the generation of the perfective gerund form of averb is presented in Fig. 1. Most of the notation in Figure 1 is the same as that of the C programming language. Furthermore, NF denotes the input normal form (the infinitive) ofa verb to be processed. GetPerfectness() is a boolean function which detects whethera verb is perfective or not.
Verb() is the function which inflects a given verb with respect
HE ALGORITHMIC INFLECTION OF RUSSIAN 3 to person, number, gender and tense (see the notation in Section 5). BF denotes the basicform of a Russian verb which is most suitable for constructing the perfective gerund ofthat verb. We found it convenient to use one of the three different basic forms dependingon the type of the input verb to be inflected. The list vowels comprises all vowels in theRussian alphabet. Figure 1.
Generation of the perfective gerund form of a verb
T.M. SADYKOV AND T.A. ZHUKOV
The algorithms have been implemented in C programming language. The imple-mentation comprises about 35,000 lines of code and has been compiled into a 571 kBexecutable file.3.
Software speed tests and verification of results
The software being presented has been tested against the one of the widest publiclyavailable corpora of Russian,
OpenCorpora [18]. We have been using Intel Core i5-2320processor clocked at 3.00GHz with 16GB RAM under Windows 7. The results are sum-marized in the below table.
Table 1: Inflection speed and agreement rates of passare.ru and
OpenCorpora
Part of speech
Total number of words Total processingtime, min:sec Number of forms com-puted (per word) Processing timeper word, msec Agreementrate with
OpenCorpora
Noun 74633 02:36 12 2 98.557 %Verb 32358 05:49 24 10 98.678 %Adjective 42920 00:06 28 0.14 98.489 %Adverb 1507 < All of the words whose inflected forms did not show full agreement with the
OpenCorpora database have been manually reviewed on a case-by-case basis. In the case of nouns,26.76% of all error-producing input words belong to the class of Russian nouns whoseanimacy cannot be determined outside the context (e.g. ” ¨еж ”, ” жучок ” and the like).For verbs, 11.26% of the discrepancies result from the verbs whose perfectiveness cannotbe determined outside the context without additional information on the stress in theword (e.g. ” насыпать ”, ” пахнуть ” etc.).Besides, a substantial number of errors in
OpenCorpora have been discovered. Theclassification of flaws in
OpenCorpora is beyond the scope of the present work and weonly mention that the inflection of the verb ” застелить ” as well as the gerund forms ofthe verbs ” выместить ” and ” напечь ” appear to be incorrect in this database at the timeof writing. 4.
Synthesis of grammatically correct text
Using the basic functions described above, one can implement automated synthesis ofgrammatically correct Russian text on the basis of any logical, numerical, financial, factualor any other precise data. The website passare.ru provides examples of such metafunctionsthat generate grammatically correct weather forecast and currency exchange rates reporton the basis of real-time data available online. Besides, it offers a function that convertsa correct arithmetic formula into Russian text.
HE ALGORITHMIC INFLECTION OF RUSSIAN 5
The following piece of C code is the core of one of the central functions which generatea grammatically correct report on exchange rates of currencies.[ S y n t h F u n c t i o n ( ” ( C h a n g e i n e x c h a n g e r a t e ) ” ) ] public s t a t i c
L o g i c S e t TrendChange ( L o g i c S e t i n p u t ,L o g i c S o l v e r s o l v e r ) { T i m e S e r i e s t s = new
T i m e S e r i e s ( ) ;v a r c c 1 = i n p u t . ElementAt ( 0 ) . T o S t r i n g ( ) . S u b s t r i n g ( 1 ) ;v a r c c 2 = i n p u t . ElementAt ( 1 ) . T o S t r i n g ( ) . S u b s t r i n g ( 1 ) ; s t r i n g r d t = c c 1 + c c 2 ;v a r c u r r e n c y 1 y =F i n a n c i a l d a t a . GetData ( r d t , ”CURRENCY” , ” 1Y” , ”d , c ” ) ; f o r ( i n t i = 0 ; i < c u r r e n c y 1 y . Count ; i ++) { i n t r i = c u r r e n c y 1 y . Count − i − − i − new T i m e S e r i e s D a t a P o i n t ( d , double . P a r s e ( c u r r e n c y 1 y [ r i ] [ 1 ] . R e p l a c e ( ’ . ’ , ’ , ’ ) ) ) ) ; } v a r r e s u l t = t s . B u i l d M o n t l y T r e n d s ( ) ; i f ( r e s u l t . Count == 0 ) return new L o g i c S e t ( ) ;v a r l a s t = r e s u l t . L a s t ( ) ;v a r i s t p = t s . I s T r e n d P r e s e n t ( l a s t ,DateTime . Today . AddDays ( − i n t t t l m o n t h s = DateTime . Today . Month − m i d t r e n d t i m e . Month ;v a r i d 3 = s o l v e r . OpenParamGroup ( ) ;v a r l s e t = new L o g i c S e t ( ) ; i f ( t t l m o n t h s > { v a r i d 2 = s o l v e r . OpenParamGroup ( ) ;v a r a1 = s o l v e r . C o n s t r u c t ( ” n u m b e r o f m o n t h s ” +t t l m o n t h s , i d 2 ) ;v a r d t e s t =s o l v e r . C o n s t r u c t ( ” p a s t t i m e ( m o n t h s a g o ) ” , i d 3 ) ;s o l v e r . Apply ( d t e s t ) ;s o l v e r . C o l l a p s e L o n g B r a n c h e s ( d t e s t ) ;s o l v e r . CloseParamGroup ( i d 2 ) ; T.M. SADYKOV AND T.A. ZHUKOV } e l s e . . . . . . . . . . . .s o l v e r . CloseParamGroup ( i d 3 ) ; return l s e t ; } Automated API access of main functions interface: socketip: 46.173.208.127port: 9999character encoding: UTF8
To access a function, one needs to connect to the server, send a query string endingwith the zero byte, receive the response string ending with the zero byte and close theconnection.The API accessible functions of the website provide inflection of the following parts ofspeech: • Verbs ( ru verb ) with the arguments: verb (the infinitive); person; number; gender;tense; • Nouns ( ru noun ) with the arguments: noun (the singular nominative form); num-ber; case; • Adjectives ( ru adjective ) with the arguments: adjective (the singular masculinenominative form); number; gender; case; animacy; • Adverbs ( ru adverb ) with the arguments: adverb; comparative/superlative form; • Numerals ( ru numeral ):Cardinals with the arguments: number; card;Ordinals with the arguments: number; ordi;Fractions with the arguments: number (e.g. 1/2); frac. • Do we have API accessible functions for participles and gerund form?
The lists of possible values of the parameters in the above functions are as follows:Person: p1 - 1st person; p2 - 2nd person; p3 - 3rd person.Number: n1 - Singular; nx - Indefinite plural; n2 - Plural for numerals like 2, 3, 4, 22,23, 24, etc; n5 - Plural for numerals like 5, 6, 7, 8, etc.Gender: gm - Masculine; gf - Feminine; gn - Neuter.Tense: tc - Present; tp - Past; tf - Future.Case: ci - Imenitelnyj (Nominative) cr - Roditelnyj (Genitive) HE ALGORITHMIC INFLECTION OF RUSSIAN 7 cd - Datelnyj (Dative) cv - Vinitelnyj (Accusative) ct - Tvoritelnyj (Instrumental) cp - Predlozhnyj (Prepositional)Animacy: a - Animate; an - Inanimate.Adverb form: fc - Comparative; fs - Superlative.Type of a numeral: card - Cardinal; ordi - Ordinal; frac - Fractional.Examples of query strings: ru adverb; быстро ;fcru verb; изучить ;p3;n1;gm;tcru adjective; русский ;nx;gf;ti;naru noun; язык ;n1;cpru numeral;24;cardru numeral;7;ordiru numeral;11/12;frac Example of implementation:
PHP$host = 46.173.208.127;$port = 9999;$waitTimeoutInSeconds=8;$fp=fsockopen($host,$port,$errCode,$errStr,$waitTimeoutInSeconds);if ($fp) { fwrite ($fp, "ru noun; машина ;cr;nx"." \ } //Output: машин Discussion
There exist several other approaches towards automated Russian inflection and syn-thesis of grammatically correct Russian text, e.g. [14, 16]. Besides, numerous programsattempt automated inflection of a particular part of speech or synthesis of a documentwith a rigid predefined structure [4]. Judging by publicly available information, mostof such program make extensive use of manually annotated corpora which might causefailure when the word to be inflected is different enough from the elements in the database.The solution presented in this paper has been designed to be as independent of anydictionary data as possible. However, due to numerous irregularities in the Russian lan-guage, several lists of exceptional linguistic objects (like the list of indeclinable nouns orthe list of verbs with strongly irregular gerund forms, see Fig. 1) have been composed andused throughout the code. Whenever possible, rational descriptions of exceptional caseshave been adopted to keep the numbers of elements in such lists to the minimum.
T.M. SADYKOV AND T.A. ZHUKOV
References [1] G.G. Belonogov, A.A. Horoshilov, and A.A. Horoshilov.
Automation of the English-Russianbilingual phraseological dictionaries based on arrays of bilingual texts (bilingual) , AutomaticDocumentation and Mathematical Linguistics, :3 (2010), 103-110.[2] G.G. Belonogov and R. Kotov. Automated Information-Retrieval Systems.
Moscow: Mir,1971.[3] Y. Chali and S.A. Hasan.
Towards topic-to-question generation,
Computational Linguistics, :1 (2015), 20p.[4] B.V. Chernikov and A.M. Karminsky. Specificities of lexicological synthesis of text docu-ments,
Procedia Computer Science, (2014), 431-439.[5] D. Conway. An algorithmic approach to English pluralization,
Proceedings of the SecondAnnual Perl Conference. San Jose, California, USA. COPE, D., 2001.[6] D. Elworthy.
Tagset design and inflected languages, arXiv:cmp-lg/9504002v2.[7] M. Faruqui, Yu. Tsvetkov, G. Neubig, and C. Dyer.
Morphological inflection generation us-ing character sequence to sequence learning,
Human Language Technologies, NAACL HLT(2016), 634-643.[8] W.D. Foust.
Automatic English inflection,
Proceedings of the National Symposium on Ma-chine Translation, UCLA (1960), 229-233.[9] H. Fuk´s.
Inflection system of a language as a complex network,
IEEE Toronto InternationalConference - Science and Technology for Humanity (2009), 491-496.[10] J. Goldsmith.
Unsupervised learning of the morphology of a natural language,
Computa-tional Linguistics :2 (2001), 153-198.[11] M. Halle and O. Matushansky. The morphophonology of Russian adjectival inflection,
Lin-guistic Inquiry :3 (2006), 351-404.[12] L.L. Iomdin. Natural language processing as a source of linguistic knowledge,
Proceedings ofthe International Conference on Machine Learning, Models, Technologies and Applications(2003), 68-74.[13] L.L. Iomdin, O. Streiter, and I.L. Sagalova.
Learning lessons from bilingual corpora: Benefitsfor machine translation,
International Journal of Corpus Linguistics :2 (2000), 199-230.[14] M.I. Kanovich and Z.M. Shalyapina. The RUMORS system of Russian synthesis,
Proceed-ings of the 15th conference on Computational linguistics - Vol. 1 (1994), 177-179.[15] Kasmir Raja S. V., V. Rajitha, and Meenakshi Lakshmanan.
Computational model to gen-erate case-inflected forms of masculine nouns for word search in Sanskrit E-text,
Journal ofComputer Science :11 (2014), 2260-2268.[16] M. Korobov. Morphological analyzer and generator for Russian and Ukrainian languages,
Communications in Computer and Information Science (2015), 330-342.[17] E.A. Kuzmenko.
Morphological analysis for Russian: Integration and comparison of taggers ,Communications in Computer and Information Science (2017), 162-171.[18]
OpenCorpora:
An algorithm for suffix stripping,
Program :3, (1980), 130-137.[20] I. Segalovich. A fast morphological algorithm with unknown word guessing induced by adictionary for a web search engine,
Proceedings of the International Conference on MachineLearning; Models, Technologies and Applications (2003), 273-280.[21] A. Sorokin.
Using longest common subsequence and character models to predict word forms,
Proceedings of the 14th Annual SIGMORPHON Workshop on Computational Research inPhonetics, Phonology, and Morphology (2016), 54-61.
HE ALGORITHMIC INFLECTION OF RUSSIAN 9 [22] T. Xiao, J. Zhu, and T. Liu.
Bagging and boosting statistical machine translation systems,
Artificial Intelligence (2013), 496-527.[23] A.A. Zaliznyak.
Russian Nominal Inflection. (Russian) Nauka, 1967.
Department of Mathematicsand Computer Science,Plekhanov Russian University115054, Moscow, Russia
E-mail address ::