[PDF] A Comparative Study of Feature Types for Age-Based Text Classification

Abstract

The ability to automatically determine the age audience of a novel provides many opportunities for the development of information retrieval tools. Firstly, developers of book recommendation systems and electronic libraries may be interested in filtering texts by the age of the most likely readers. Further, parents may want to select literature for children. Finally, it will be useful for writers and publishers to determine which features influence whether the texts are suitable for children. In this article, we compare the empirical effectiveness of various types of linguistic features for the task of age-based classification of fiction texts. For this purpose, we collected a text corpus of book previews labeled with one of two categories -- children's or adult. We evaluated the following types of features: readability indices, sentiment, lexical, grammatical and general features, and publishing attributes. The results obtained show that the features describing the text at the document level can significantly increase the quality of machine learning models.

Full PDF

aa r X i v : . [ c s . C L ] S e p A Comparative Study of Feature Types forAge-Based Text Classiﬁcation ⋆ Anna Glazkova − − − and Yury Egorov − − − and Maksim Glazkov − − − University of Tyumen, ul. Volodarskogo 6, 625003 Tyumen, Russia [email protected] , [email protected] ”Organization of cognitive associative systems” LLC, ul. Gertsena 64, 625000Tyumen, Russia [email protected] Abstract.

The ability to automatically determine the age audience ofa novel provides many opportunities for the development of informationretrieval tools. Firstly, developers of book recommendation systems andelectronic libraries may be interested in ﬁltering texts by the age of themost likely readers. Further, parents may want to select literature forchildren. Finally, it will be useful for writers and publishers to determinewhich features inﬂuence whether the texts are suitable for children. Inthis article, we compare the empirical eﬀectiveness of various types oflinguistic features for the task of age-based classiﬁcation of ﬁction texts.For this purpose, we collected a text corpus of book previews labeledwith one of two categories – children’s or adult. We evaluated the follow-ing types of features: readability indices, sentiment, lexical, grammaticaland general features, and publishing attributes. The results obtainedshow that the features describing the text at the document level cansigniﬁcantly increase the quality of machine learning models.

Keywords:

Text classiﬁcation · Fiction · Corpus · Age audience · Con-tent rating · Text diﬃculty · RuBERT · Neural network · Natural lan-guage processing · Machine Learning.

Nowadays, there are quite a lot of approaches to text classiﬁcation accordingto document subjects, genre, author or according to other attributes. However,modern challenges in the ﬁeld of natural language processing (NLP) and infor-mation retrieval (IR) increasingly require classiﬁcation based on more complexcharacteristics. For example, it may be necessary to determine whether the textcontains elements of propaganda or whether it has similar plot characteristics toother texts. One of such urgent and complex classiﬁcation tasks is the divisionof literary texts into suitable for children and for adults. Age-based classiﬁcation ⋆ Supported by the grant of the President of the Russian Federation no. MK-637.2020.9. A. Glazkova et al. tools could ﬁnd wide practical application. For instance, they would be useful inthe personal selection of ﬁction or in ﬁltering content not intended for children.Despite the fact that many scholars considered the issue of text diﬃcultyestimation, formally, text diﬃculty does not indicate the age of the intendedreader. The question of whether the features describing text diﬃculty are suitablefor age-based classiﬁcation needs to be investigated. In addition, the severity ofthe features depends on the text genre. It is necessary to ﬁnd out how theseor those features are presented in the literary text and whether they containinformation about the age audience of the text.In this paper, we systematically evaluated diﬀerent feature types on age-basedclassiﬁcation task. In addition to popular text diﬃculty features, we considerspecial publishing attributes intrinsic ﬁction books, such as age rating score andabstract features. We collected the corpus of Russian ﬁction texts and appliedtwo commonly used machine learning models, these are random forest (RF) andlinear support vector classiﬁer (LSVC). For comparison, we evaluated a trans-former model based on RuBERT and a convolutional neural network (CNN)trained on Word2Vec embeddings. Finally, we evaluated feedforward neural net-work (FNN) trained on RuBERT text embeddings and age rating scores.The LSVC model using a combination of a baseline and publishing attributesshowed the best result of 95.77% (F1-score). RuBERT achives 90.16%. The FNNmodel combining RuBERT embeddings and age ratings showed 94.78%. Theresults show that the features describing the text at the document level gives anadvantage in case of long texts. Moreover, publishing attributes provide valuableinformation for the age-based classiﬁer. We also found that some features used todetermine text diﬃculty positively aﬀect the quality of age-based classiﬁcation.The paper is organised as follows. In Section 2 we present a brief review ofrelated works. Section 3 describes feature types evaluated in the paper. Section4 contains the description of our dataset. Section 5 presents the structure of themodels and the evaluation results. Finally, Section 6 is a conclusion.

In the modern world, the constant growth of information resources gives rise tothe need for ﬁltering and ranking texts. One of the signiﬁcant characteristicsof a text is its complexity. The question of determining text diﬃculty naturallylooks related to the task of age-based classiﬁcation.The task of estimating texts by complexity is not new. It appeared at thebeginning of the last century in the context of evaluating the readability ofeducational texts. Further, during the XX century, researchers have proposed anumber of tests to determine readability based on the quantitative characteristicsof texts. Readability tests usually use quantitative text features, such as count-ing syllables, words, and sentences. There are several common readability textsfor text diﬃculty estimation. For instance, these are: the Flesch–Kincaid read-ability test, the Coleman–Liau index, the automated readability index (ARI),the SMOG grade, the Dale-Chall formula [6,10].

Comparative Study of Feature Types for Age-Based Text Classiﬁcation 3 – The Flesch–Kincaid readability test is based on the idea that the shorterthe sentences and words, the simpler the text. The speciﬁc mathematicalformula is: RF = 206 . − . · ASL − . · ASW , (1) ASL – average sentence length,

ASW – average number of syllables per word(i.e., the number of syllables divided by the number of words). – The Coleman–Liau index uses letters instead of syllables. The formula takesinto account the average number of letters per word and the average numberof words per sentence. RC = 0 . · L − . · S − . , (2)where L – average number of letters per 100 words, S – average number ofsentences per 100 words. – The ARI formula takes into account the number of letters. In the past, thisallowed the use of this index to measure the complexity of texts in real timein electric typewriters. RA = 4 . · characterswords + 0 . · wordssentences − . , (3)where characters – number of letters and numbers, words – number of words, sentences – number of sentences. – The main idea of the SMOG grade is that the complexity of the text is mostaﬀected by complex words. Complex words are words with many syllables(more than 3). The more syllables the more complicated the word. RS = 1 . · r polysyllable · sentences + 3 . , (4)where polysyllable – number of polysyllable words, sentences – number ofsentences. – The Dale-Chall formula uses a count of ”hard” words. These ”hard” wordsare words that do not appear on a specially designed list of common wordsfamiliar to most 4th-grade students. RD = 0 . · dif f icultwords ·

100 + 0 . · wordssentences, (5)where diﬃcult – number of diﬃcult words, words – number of words, sen-tences – number of sentences.In addition to the above, there are many other readability tests that arealso actively used, e.g. the Fry Graph readability formula, the Spache index, theLinsear Write formula and others. The values obtained from readability tests arecalled readability indices. The Readability Index characterizes the diﬃculty ofperceiving a text or the expected level of education that is required to understandit. A. Glazkova et al.

The readability formulas listed above are metrics for English texts. At thesame time, the quantitative characteristics of other languages can diﬀer signiﬁ-cantly. For instance, Russian sentences are on average shorter than English, andwords are longer. Therefore, the readability formulas need to be processed foruse in other languages. Up to now, several studies have suggested the adaptationof readability tests for Russian. For example, I. Oborneva [27] proposed the coef-ﬁcients for the Flesch–Kincaid formula for Russian texts. The project [41] oﬀersthe adaptation of several readability formulas. M. Solnyshkina et al. presenteda new approach to reading diﬃculty prediction in Russian texts [37,38].Readability is however only one aspect of age-based classiﬁcation. Scholarshave proposed more complex techniques for text complexity estimation usingfeatures of diﬀerent nature. Thus, Yu. Tomina [42] considered the lexical andsyntactic features of the text complexity level. A. Laposhina et al. [22] evaluateda wide range of diﬀerent types of features, such as readability, semantic, lexical,grammatical and others. M. Shafaei et al. [33] estimated age suitability rating ofmovie dialogs using genre and sentiment features. L. Flekova et al. [16] proposedan approach to describing the story complexity for literary text. Y. Bertills [4]wrote about the features of literary characters and named entities in books forchildren. Finally, in our previous research, we evaluated the informativeness ofsome quantitative and categorical features for age-based text classiﬁcation [12].The modern methodology for text diﬃculty estimation is based in most cases onmachine learning approaches. Thus, R. Balyan et al. [3] showed that applyingmachine learning methods increased accuracy by more than 10% as comparedto classic readability metrics (e.g., Flesch–Kincaid formula). To date, a numberof studies conﬁrmed the eﬀectiveness of various machine learning techniques fortext diﬃculty estimation, such as support vector machine (SVM) [36,39], randomforest [26], and neural networks [2,7,35].Another aspect of assessing the age category of text readers is the safety ofthe information it contains. Currently, in many countries, publishers are requiredto label books (including ﬁction) and other informational sources [9,11,14,18,30]according to their age rating. For these purposes, there are special laws thatrank information in terms of the potential harm it can bring. So, in Russia thereis a Russian Age Rating System (RARS).The RARS was introduced in 2012 when the Federal law of Russian Fed-eration no. 436-FZ of 2010-12-23 ”On Protection of Children from InformationHarmful to Their Health and Development” was passed [15]. The law prohibitsthe distribution of ”harmful” information that depicts violence, unlawful activi-ties, substance abuse, or self-harm. The RARS includes 5 categories, such as forchildren under the age of six (0+), for children over the age of six (6+), for chil-dren over the age of twelve (12+), for children over the age of sixteen (16+), andprohibited for children (18+). As a rule, an age rating is assigned to a book byeditors or experts. As far as we know, there are currently no published researchof how age rating correlates with other attributes of text, such as readability.The reviewed studies and sources clearly indicate that age-based classiﬁcationof ﬁction texts includes several aspects. First, the research topic is related to

Comparative Study of Feature Types for Age-Based Text Classiﬁcation 5 works on text diﬃculty evaluation. Text diﬃculty is characterized by diﬀerentfeatures, these are lexical, semantic, grammatical and other types. However,the measure of the diﬃculty of the text does not guarantee that this text istargeted to a particular age audience. It is required to evaluate the eﬀectivenessof the existing text diﬃculty features for age-based classiﬁcation. In addition, itwould be interesting to evaluate the role of publishing attributes (for example,age rating labels) as classiﬁcation features. Finally, the studies presented thusfar provide evidence that machine learning approaches show the highest resultsin the task of estimating texts by diﬃculty. Based on this, it is reasonable toevaluate the text features for age-based classiﬁcation using machine learningmethods.

According to the related works, we consider the following types of classiﬁcationfeatures.1.

General features.

This type includes features that reﬂect the quantitativecharacteristics of the text: – the average and median length of words ( avg words len , med words len ); – the average and median length of sentences ( avg sent len , med sent len ),e.g. average or median number of symbols in each sentence; – the average number of syllables ( avg count syl ); – the percentage of long words with more than 4 syllables ( many syllables ); – the Type-Token Ration, TTR ( ttr ) [40]. The main idea of the metric isthat if the text is more complex, the author uses a more varied vocab-ulary so there’s a larger number of unique words. So, the TTR’s valueis calculated as the number of unique words divided by the number ofwords. As a result, the higher the TTR, the higher the variety of words; – the TTR for nouns ( ttr n ), adjectives ( ttr a ), and verbs ( ttr v ). Thevalues of TTR calculated separately for parts of speech; – the NAV metric ( nav ). The NAV metric is a TTR-based ratio of (TTRA + TTR N)/TTR V proposed in [37].2. Readability features.

We used the readability formulas with the coeﬃ-cients for the Russian language proposed by the project [41]. In this study,we evaluated ﬁve types of readability indices using the following metrics: theFlesch–Kincaid readability test ( index fk ); the Coleman–Liau index ( index cl );the ARI index ( index ari ); the SMOG grade ( index SMOG ); the Dale-Challformula ( index dc ).3.

Lexical features.

In this category, we included features constructed bythe evaluation of the text in accordance with frequency dictionaries. As fre-quency dictionaries, we used the lists of Russian frequency words presentedin [25,34]: – the percentage of words included in the list of 5000 most frequent Russianwords ( ); A. Glazkova et al. – the average frequency of the words included in the 5000 most frequentwords ( ); – the average frequency of words per 1 million occurrences (ipm, words fr ); – the average frequency of nouns, verbs, adjectives, adverbs and propernames per 1 million occurrences ( s fr , v fr , adj fr , adv fr , prop fr ); – the average number of topic segments of the corpus where the word wasencountered (out of 100 possible, words r ); – the average number of the corresponding topic segments for nouns, verbs,adjectives, adverbs and proper names ( s r , v r , adj r , adv r , prop r ); – the average value of Juilland’s usage coeﬃcients ( words d ). This Juil-land’s usage coeﬃcient measures the dispersion of the word’s subfre-quencies over n equally-sized subcategories of the corpus [17]; – the average value of Juilland’s usage coeﬃcients for nouns, verbs, adjec-tives, adverbs and proper names ( s d , v d , adj d , adv d , prop d ); – the number of documents in the corpora in which a word occurs (aver-aged over the text, words doc ); – the average number of documents in the corpora in which a word occurs(for nouns, verbs, adjectives, adverbs and proper names, s fr , v fr , adj fr , adv fr , prop fr ).4. Grammatical features.

We evaluated the percentage of nouns, verbs, andadjectives ( count n , count v , count a ).5. Sentiment features.

These features obtained with Russian Sentiment Lex-icon [24]. We separately evaluated the percentage of positive and negativewords for each of the topic categories, these are opinion, feeling (privatestate), or fact (sentiment connotation) ( neg opinion , neg feeling , neg fact , pos opinion , pos feeling , pos fact ).6. Publishing features.

Here we have included features based on publishingattributes, i.e. on the book characteristics assigned by an editor or publisher,such as age rating according to the RARS ( age rating ) and TF-IDF scoresfor book abstracts.

For feature evaluation, we collected a dataset of ﬁction books published in Rus-sian. Due to copyright restrictions, the full texts of the books are not publiclyavailable. Therefore, we used a collection of previews presented in online librariesin the public domain. Typically, the preview is 5-10% of the total book volume.The corpus consists of 5592 texts of children’s and adult book previews.We have divided the texts into two parts. The ﬁrst part included 4492 texts.It was used to train the models. The remaining 1000 texts were served as anindependent text sample. The main characteristics of the data is presented inTable 1. Table 2 shows short text examples of adults and children’s categories. The frequency dictionary was created on the basis of the modern subcorpus of theMain Corpus and the Oral Corpus of the Russian National Corpus (1950-2007) [32]with a total volume of 92 million tokens [25]. Comparative Study of Feature Types for Age-Based Text Classiﬁcation 7

Table 1.

Characteristics of the corpus.Characteristic Training sample Test sampleChildren’s Adult Children’s AdultNumber of text 2108 2384 500 500Avg number of symbols 3134.38 3326.11 3048.69 3319.86Avg number of tokens 488.55 499.52 479.3 498.16Avg number of sentences 37.35 35.2 36.05 36.49

Table 2: Example short fragments.Category AgeRating Genre FragmentAdults 16+ Modernro-mancenovels A tall young man dressed in jeans, an inconspic-uous jacket and a baseball cap pulled down witha visor over his eyes, approached the entrance ofa seventeen-story apartment building and stood asif waiting for someone, and when a mother with astroller appeared at the door, he quickly jumpedinside - he did not know the code. I walked up tothe ﬁfth ﬂoor, putting on thin gloves on the go,looked around, and then deftly opened the door ofone of the apartments. On the threshold he frozeand listened for a while, but it was quiet. The manturned the baseball cap over the visor and began aleisurely survey of the apartment, opening the doorsof the cupboards and looking into the drawers. Theﬁrst thing he did was to open the sliding wardrobedoor in the hallway, and oversized men’s slippers,an empty box, and a bright purple scarf fell outonto the ﬂoor. The man grimaced and shoved ev-erything back, muttering, ”I thought so.” In a largeroom that served as a bedroom, he lingered a lit-tle longer and grunted ironically at the sight of aluxurious couch with a carved back. Fragment from the book ”Men We Choose” by Evgeniya Perova (translated fromRussian). A. Glazkova et al.

Adults 12+ Histori-caladven-tures What a wonderful autumn it was in SouthernPoland that year! Almost without rain and coldwinds, tenderly warm, quiet, crimson-gold. Fabu-lous autumn - in such an autumn it is good, havingclimbed into the spurs of the Beskydy, from dawn tonoon to wander along the slopes of hills overgrownwith beech and hazel, and to your ﬁll, drunk tobreathe in the cool and crystal clear mountain air.And then, on the cozy terrace of a small mountaintavern, eat a good portion of hot, ﬁery-spicy bigoswith pork legs, washed down with icy ”okocim”.And in the evening, having walked up to achingknees, kindle a ﬁre on a platform open to all thewinds above a shallow ravine, and, sitting on un-bound logs, look at the stars that suddenly pouredout in incredible numbers overhead. And, peeringto the north, in the transparent thickening blue ofthe air, distinguish the lights of distant Krakow orNova Huta, or maybe Bochnia or Wieliczka, whoknows? Children’s 6+ Child-ren’sadven-tures In a big, big city, where there are many, manyhouses, many, many cars and even more people, andthe crows cannot be counted at all, there lived aginger cat on a short street consisting of only twocourtyards. His name was Ostrich. Children’s 12+ Child-ren’sfan-tastictales The hands of the clock were approaching half pastseven, but the setting sun, reluctantly sliding be-hind the houses, continued to burn the city withrays, and the approaching twilight did not promisethe long-awaited coolness.Friday night was hot and stuﬀy, and the city roofswere so hot during the day that no sane cat woulddare to run over them without burning their paws.August was coming to an end, and the sun knewthat it was the most important thing in the city,so from the very morning it climbed everywhere,trying to melt the asphalt on the streets, drying thegrass on the lawns and sneaking into the apartmentsto ﬂood them with heat and stuﬃness. ”Sold Poland” by Alexander Usovsky (translated from Russian). ”Greetings from cutlets” by Evgenia Malinkina (translated from Russian). ”Vlad and the Secret Ghost” by Sasha Gotti (translated from Russian). Comparative Study of Feature Types for Age-Based Text Classiﬁcation 9 Table 3 shows the most informative quantitative features with their meansand standard deviation values. The informativeness is measured using the methodof cumulative frequencies [1,44]. The main idea of this method consists in divid-ing the range of feature values for each class into n intervals. The cumulativefrequency of characteristic values is calculated for each interval. The informa-tiveness indicator is calculated as the maximum absolute value of the diﬀerencein the accumulated frequencies for the corresponding intervals in the classes. Table 3.

Top-10 of the most informative quantitative features (according to the methodof cumulative frequencies).Feature Mean (adult) std (adult) Mean (children’s) std (children’s) avg sent len med sent len index dc adj doc index ari adj fr s doc index fk v doc adv fr

Figure 1 presents the distribution of the age rating labels (age rating is acategorical feature) in the classes of the training data. It is interesting to notethat some of the books from the children’s class are labelled with the 18+ agecategory. Notable examples of this type of books are love stories for teens.

Fig. 1.

The distribution of age rating categories.0 A. Glazkova et al.

This section describes our feature evaluation experiments. We built two types ofbaseline models and sequentially enriched them with diﬀerent types of features.Further, we compared the results obtained with our models with the results ofCNN and RuBERT. Our dataset and models are available at [5].

We built two classiﬁers for model evaluation. The ﬁrst one was a Random ForestClassiﬁer trained on bootstrap samples. The number of trees in the forest wasequal to 100 and the Gini impurity was implemented to measure the quality ofa split. The second model was a Linear Support Vector Classiﬁer with the ”l2”penalty and the squared hinge loss function. Both models were implementedusing Scikit-learn [29] and Python 3.6.

To preprocess our data, we used min-max normalization. Moreover, it is obviousthat some of the features are correlated. For instance, most readability indicesshow a cross-correlation greater than 0.8. Another example of correlated featurepairs is average and median length of sentences or TTR values for all wordsand for particular parts of speech (Figure 2). To reduce the inﬂuence of featurecorrelation on the LSVC model, we applied linear dimensionality reduction usingSingular Value Decomposition of the data with the minimum number of principalcomponents such that 95% of the variance is retained.

Fig. 2.

Correlation matrices for readability and general features. Comparative Study of Feature Types for Age-Based Text Classiﬁcation 11

We used models trained of TF-IDF vectors as baselines. Further, we systemati-cally evaluated each type of features.To begin this process, we connected the TF-IDF vector of the text with thecorresponding vector of features of a certain type. Book previews are rather longtexts. Since the RuBERT model that participated in the comparison can onlyprocess a sequence of limited length, we used the same fragments of 256 tokens totrain both neural networks and to construct TF-IDF vectors . TF-IDF vectorswere built over the top 2000 words ordered by term frequency across the corpus.At the same time, the values of additional features were calculated for the fullpreview texts.To build TF-IDF vectors and CNN, the texts were pre-processed. The prepro-cessing included the following steps: special character removal, lowercase trans-lation, lemmatization, stop-word removal. Text preprocessing was implementedusing NLTK [23] and Pymorphy2 [19].Table 4 shows the results obtained for each type of features (e.g. RF base-line + readability, LSVC baseline + readability) and the results of the modelstrained only on additional features without TF-IDF vectors (e.g. RF (readabil-ity), LSVC (readability)). For publishing attributes, we evaluated two separatetypes of models. The ﬁrst type used book abstracts as supplementary informa-tion. In other words, we added the texts of abstract to the book preview andbuilt new TF-IDF vectors. The second type used baseline TF-IDF vectors withan additional feature of age rating. Finally, we evaluated three types of combinedmodels, such as using all considered features, only all additional features, andall features with the exception of editorial attributes. The results obtained werecompared with the results of three neural models: – RuBERT [20], based on BERT architecture [8]. BERT showed state-of-the-art results on a wide range of NLP tasks. RuBERT was trained on theRussian part of Wikipedia and news data. The model was implemented usingPyTorch [28] and Transformers [43] libraries, it was trained for 3 epoches; – FNN trained on ﬁne-tuned RuBERT [20] text embeddings obtained withPyTorch library [28]. Text embeddings were calculated by averaging thetoken vectors of the last hidden state. Age rating was presented as a one-hot numeric array which is the most widely used coding scheme [31]. FNNconsisted of three layers including an input layer, a 1024 hidden layer withhyperbolic tangent activation function, and an output layer with softmaxactivation function. We also used Adam as an optimizer and binary cross-entropy loss. The FNN model was implemented using Keras [13] library; – CNN trained on Word2Vec embeddings [21]. CNN consisted of four buildingunits including three convolutional units (CU) and a fully connected unit.Each CU contained the following sequence of layers C − BN − C − BN − P where C is a convolutional layer (CL), BN is a batch normalization layer, The maximum sequence length for BERT is 512 tokens. However, due to the ratherlarge volume of the corpus, we were also limited in computational resources.2 A. Glazkova et al.

Table 4.

Age-based classiﬁcation results (%).Model Accuracy F1-score Precision RecallBaselinesRF (TF-IDF) 85.8 86.37 90 83.03LSVC (TF-IDF) 83.7 84.01 85.6 82.47Readability featuresRF baseline + readability 86.5 86.91 89.6 84.37LSVC baseline + readability 84.9 85.63 90 81.67RF (readability) 60.6 61.6 63.2 60.08LSVC (readability) 60.8 61.53 52.6 74.11Sentiment featuresRF baseline + sentiment 85.4 86.04 90 82.42LSVC baseline + sentiment 83.8 84.09 85.6 82.63RF (sentiment) 59.4 62.2 66.8 58.19LSVC (sentiment) 68 67.01 65 69.15Lexical featuresRF baseline + lexical 83.2 84.23 89.9 79.23LSVC baseline + lexical 84.2 84.45 85.8 83.14RF (lexical) 63.1 64.62 67.4 62.06LSVC (lexical) 61.8 62.91 64.8 61.13Grammatical featuresRF baseline + grammatical 85.5 86.26 RF (all features) 94.7 94.67 94.2 95.15LSVC (all features) 94.2 94.09 92.4 95.85RF baseline+all features–publ.attr. 86.1 86.41 88.4 84.51LSVC baseline+all features–publ.attr. 87.3 87.54 89.2 85.93RuBERT

FNN (RuBERT embs + age rating) 94.8 94.78 94.4 95.31CNN 82.1 80.2 89.6 72.59 Comparative Study of Feature Types for Age-Based Text Classiﬁcation 13 P is a pooling layer. After every CL the LeakyRelu activation function wasapplied. At the ﬁrst CU we used 512 ﬁlters 7 ×

7. At the second CL we applied1024 ﬁlters 5 ×

5. As a pooling strategy at each layer we used max polling witha kernel 2 ×

2. The fully connected layer consisted of the following sequenceof layers

F L − BN − F N where F L is a hidden layer with 32 neurons, F L is an output layer. We applied ReLU as an activation function and usedstochastic gradient descent with Nesterov momentum and learning rate equalto 5 × − as optimization parameters. The model was implemented withPyTorch [28].The results show that additional features in most cases improve the qualityof baselines. According to F1-score, this concerns readability features, generalfeatures and publishing attributes. We assume that the advantage of these fea-tures is that they describe the text at the document level and allow the modelto evaluate the whole text, and not just a fragment. It also can be seen that theusing of abstracts and the age rating feature signiﬁcantly improves the qualityof the classiﬁcation. The best results was obtained by the LSVC model usingall considered features (95.8% of accuracy, 95.77% of F1-score, 95% of precision,and 96.54% of recall). These values are shown in bold in Table 3. Among themodels that did not use publishing attributes, the best results were shown byRuBERT (90.5% of accuracy, 90.16% of F1-score, and 93.55% of recall) and theLSVC baseline with grammatical features (91% of precision). The purpose of the current study was to evaluate diﬀerent types of featuresfor the task of age-based text classiﬁcation. The results of this investigationshow that features used in text diﬃculty evaluation can improve the quality ofage-based classiﬁcation. In addition, in this study, we considered publishing at-tributes (such as book abstracts and age ratings) as classiﬁcation features. Theresults showed that the use of these attributes in digital libraries and recom-mendation systems could signiﬁcantly improve the quality of machine learningapproaches. Our further research will focus on studying other types of features,such as named entity analysis or plot and character features.

References

1. Aivazyan, S.A., Bukhshtaber, V.M., Enyukov, I.S. et al.: Applied Statistics: Clas-siﬁcation and Dimension Reduction: A Handbook. Finansy i statistika, Moscow(1989).2. Azpiazu, I.M., Pera, M.S.: Multiattentive recurrent neural network architecture formultilingual readability assessment. Trans. of the Association for Comp. Ling. ,421–436. https://doi.org//10.1162/tacl a 002783. Balyan, R., McCarthy. K.S., McNamara, D.S.: Applying Natural LanguageProcessing and Hierarchical Machine Learning Approaches to Text Diﬃ-culty Classiﬁcation. Int. J. of Art. Intelligence in Education, 1–34 (2020).https://doi.org/10.1007/s40593-020-00201-74 A. Glazkova et al.4. Bertills, Y.: Beyond identiﬁcation : proper names in children’s literature. AboAkademi University Press, Turku (2003).5. Corpus and Baselines for Age-Based Text Classiﬁcation, https://github.com/oldaandozerskaya/age_based_classification . Lastaccessed 24 Sep 2020.6. Crossley, S.A., Skalicky, S., Dascalu, M. et al.: Predicting text com-prehension, processing, and familiarity in adult readers: New ap-proaches to readability formulas. Discourse Processes , 5–6 (2017).https://doi.org/10.1080/0163853x.2017.12962647. Cuzzocrea, A., Bosco, G. L., Pilato, G. et al.: Multi-class text complex-ity evaluation via deep neural networks. LNCS , 313–322 (2019).https://doi.org/0.1007/978-3-030-33617-2 328. Devlin, J., Chang, M.W., Lee, K. et al.: Bert: Pre-training of deep bidirectionaltransformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).9. Dogruel, L., Joeckel, S.: Video game rating systems in the US and Europe: Com-paring their outcomes. International Communication Gazette

7, 672–692 (2013).10. Didegah, F., Thelwall, M.: Which factors help authors produce the highest impactresearch? Collaboration, journal and document properties. J. of informetrics (4),861–873 (2013). https://doi.org/10.1016/j.joi.2013.08.00611. Grealy, L., Driscoll, C., Cather, K.: A history of age-based ﬁlm classiﬁcation inJapan. Japan Forum (2020). https://doi.org/10.1080/09555803.2020.177805812. Glazkova, A.: An Approach to Text Classiﬁcation based on AgeGroups of Addressees. SPIIRAS Proceedings (3), 51–69 (2017).https://doi.org/10.15622/sp.52.313. Gulli, A., Pal, S. Deep learning with Keras. Packt Publishing Ltd (2017).14. Kim, S.W. et al. A Global Comparative Study on the Game Rating System. Journalof Digital Convergence

12, 91-108 (2019).15. Federal Law of Dec. 29, 2010 N 436-FZ ”On the Protection of Chil-dren from Information Harmful to Their Health and Development”, . Last accessed 23Jul 2020.16. Flekova, L., Stoﬀel, F., Gurevych, I. et al.: Content-based Analysis and Visual-ization of Story Complexity. In: Visualisierung sprachlicher Daten, pp. 185–223.Heidelberg: Heid. Univ. Publishing (2018).17. Juilland, A.G., Brodin, D.R., Davidovitch, C.: Frequency dictionary of Frenchwords. Hague, Paris (1971).18. Hamid, R.S., Shiratuddin, N.: Age Classiﬁcation of the Existing Digital GameContent Rating System Across the World: A Comparative Analysis. In Proceedingsof KMICe, pp. 218-222 (2018).19. Korobov, M.: Morphological analyzer and generator for Russian andUkrainian languages. In: International Conference on Analysis of Im-ages, Social Networks and Texts, pp. 320-332. Springer, Cham (2015).https://doi.org/10.1007/978-3-319-26123-2 3120. Kuratov, Y., Arkhipov, M.: Adaptation of deep bidirectional multilingual trans-formers for Russian language, arXiv preprint arXiv:1905.07213. 2019.21. Kutuzov, A., Kuzmenko, E.: WebVectors: A Toolkit for Building WebInterfaces for Vector Semantic Models. CCIS , 155–161 (2017).https://doi.org/10.1007/978-3-319-52920-2 1522. Laposhina, A.N., Veselovskaya, T.S., Lebedeva, M.U. et al.: Automated text read-ability assessment for Russian second language learners. In: Komp. Lingv. i Intel.Tehn., pp. 396–406 (2018). Comparative Study of Feature Types for Age-Based Text Classiﬁcation 1523. Loper, E., Bird, S.: NLTK: the natural language toolkit, arXiv preprint cs/0205028(2002).24. Loukachevitch, N., Levchik, A.: Creating a General Russian Sentiment Lexicon.In: Proc. of LREC-2016, pp. 1171-1176 (2016).25. Lyashevskaya, O. N., Sharov S. A.: Frequency Dictionary of the Modern RussianLanguage (based on the materials of the National Corps of the Russian Language).Azbukovnik, Moscow (2009).26. Mukherjee, P., Leroy, G., Kauchak D.: Using Lexical Chains to Identify Text Dif-ﬁculty: A Corpus Statistics and Classiﬁcation Study. J. of Biomed. and HealthInformatics (5), 2164–2173 (2019). https://doi.org/10.1109/jbhi.2018.288546527. Oborneva, I. V.: Automated estimation of complexity of educational texts on thebasis of statistical parameters. Pedagogy Cand. Diss. Moscow (2006).28. Paszke, A., Gross, S., Massa, F. et al.: Pytorch: An imperative style, high-performance deep learning library. In: Adv. in neural information processing sys-tems, pp. 8026-8037 (2019).29. Pedregosa, F., Varoquaux, G., Gramfort, A. et al.: Scikit-learn: Machine learningin Python. The J. of Machine Learning research , 2825–2830 (2011).30. Piasecki, S., Malekpour, S.: Morality and religion as factors in age rating computerand video games: ESRA, the Iranian games age rating system. Online-HeidelbergJournal of Religions on the Internet, 11.31. Potdar, K., Pardawala, T. S., Pai, C. D.: A comparative study of categorical vari-able encoding techniques for neural network classiﬁers. International journal ofcomputer applications

4, 7–9 (2017). https://doi.org/10.5120/ijca201791549532. Russian National Corpus, https://ruscorpora.ru/new/en/index.html . Last ac-cessed 23 Jul 2020.33. Shafaei, M., Samghabadi, N.S., Kar, S., Solorio, T.: Age Suitability Rating: Pre-dicting the MPAA Rating Based on Movie Dialogues. In: Proceedings of The 12thLanguage Resources and Evaluation Conference, pp. 1327-1335 (2020).34. Sharoﬀ, S.: Meaning as use: exploitation of aligned corpora for the contrastivestudy of lexical semantics. In: Proc. of LREC02, pp. 447–452. Las Palmas, Spain(2002).35. Schicchi, D., Pilato, G., Bosco, G.L.: Deep Neural Attention-Based Model for theEvaluation of Italian Sentences Complexity. In: 2020 IEEE 14th ICSC, pp. 253-256.https://doi.org/10.1109/icsc.2020.0005336. Schwarm, S. E., Ostendorf, M. Reading level assessment using support vector ma-chines and statistical language models. In: Proc. of ACL’05, pp. 523–530 (2005).https://doi.org/10.3115/1219840.121990537. Solnyshkina, M., Ivanov, V., Solovyev, V.: Readability Formula for Russian Texts:A Modiﬁed Version. In: Proc. of MICAI 2018, part II, pp. 142–145. Springer, Cham(2018). https://doi.org/10.1007/978-3-030-04497-8 1138. Solovyev, V., Solnyshkina, M., Ivanov, V. et al.: Prediction of reading diﬃcultyin Russian academic texts. J. of Int. & Fuzzy Systems (5), 4553–4563 (2019).https://doi.org/10.3233/jifs-17900739. Sung, Y.T., Chen, J.L., Cha, J.H. et al.: Constructing and validatingreadability models: the method of integrating multilevel linguistic featureswith machine learning. Behavior research methods (2), 340–354 (2015).https://doi.org/10.3758/s13428-014-0459-x40. Templin, M. C.: Certain language skills in children; their development and inter-relationships. Univ. of Minnesota Press, Minneapolis (1957).41. Text readability rating, http://readability.io/http://readability.io/