HeBERT & HebEMO: a Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition
HHeBERT & HebEMO: a Hebrew BERT Model and a Tool for PolarityAnalysis and Emotion Recognition
Avihay Chriqui, Inbal Yahav
Abstract
Sentiment analysis of user-generated content (UGC) can provide valuable information across nu-merous domains, including marketing, psychology, and public health. Currently, there are very fewHebrew models for natural language processing in general, and for sentiment analysis in particu-lar; indeed, it is not straightforward to develop such models because Hebrew is a MorphologicallyRich Language (MRL) with challenging characteristics. Moreover, the only available Hebrew sen-timent analysis model, based on a recurrent neural network, was developed for polarity analysis(classifying text as “positive”, “negative”, or neutral) and was not used for detection of finer-grained emotions (e.g., anger, fear, joy). To address these gaps, this paper introduces HeBERTand HebEMO. HeBERT is a Transformer-based model for modern Hebrew text, which relies ona BERT (Bidirectional Encoder Representations for Transformers) architecture. BERT has beenshown to outperform alternative architectures in sentiment analysis, and is suggested to be partic-ularly appropriate for MRLs. Analyzing multiple BERT specifications, we find that while modelcomplexity correlates with high performance on language tasks that aim to understand terms in asentence, a more-parsimonious model better captures the sentiment of an entire sentence. Notably,regardless of the complexity of the BERT specification, our BERT-based language model outper-forms all existing Hebrew alternatives on all common language tasks. HebEMO is a tool that usesHeBERT to detect polarity and extract emotions from Hebrew UGC. HebEMO is trained on aunique Covid-19-related UGC dataset that we collected and annotated for this study. Data collec-tion and annotation followed an active learning procedure that aimed to maximize predictability.We show that HebEMO yields a high F1-score of 0.96 for polarity classification. Emotion detectionreaches F1-scores of 0.78-0.97 for various target emotions, with the exception of surprise , which themodel failed to capture (F1 = 0.41). These results are better than the best-reported performance,even among English-language models of emotion detection.
Preprint submitted to XX December 2020 a r X i v : . [ c s . C L ] F e b . Introduction Sentiment analysis , also referred to as opinion mining or subjectivity analysis (Liu and Zhang2012), is probably one of the most common tasks in natural language processing (NLP) (Liu 2012,Zhang et al. 2018). The goal of sentiment analysis is to systematically extract, from written text,what people think or feel toward entities such as products, services, individuals, events, newsarticles, and topics.Sentiment analysis includes multiple types of tasks, one of the most common being polarity classification: the binning of overall sentiment into the three categories of positive, neutral, ornegative. Another prominent sentiment analysis task is emotion detection - a process for extractingfiner-grained emotions such as happiness, anger, and fear from human language. These emotions, inturn, can shed light on individuals’ beliefs, behaviors, or mental states. Both polarity classificationand emotion detection have proven to yield valuable information in diverse applications. Researchin marketing, for example, has shown that emotions that users express in online product reviewsaffect products’ virality and profitability (Chitturi et al. 2007, Ullah et al. 2016, Adamopouloset al. 2018). In finance, Bellstam et al. (2020) extracted sentiments from financial analysts’ textualdescriptions of firm activities, and used those sentiments to measure corporate innovation. Inpsychology, sentiment analysis has been used to detect distress in psychotherapy patients (Shapiraet al. 2020), and to identify specific emotions that might be indicative of suicidal intentions (Desmetand Hoste 2013). Notably, recent studies suggest that the capacity to identify certain emotions(e.g., fear or distress) can contribute towards the understanding of individuals’ behaviors and mentalhealth in the Covid-19 pandemic (Ahorsu et al. 2020, Pfefferbaum and North 2020).The literature offers a considerable number of methods and models for sentiment analysis, witha strong bias towards polarity detection. Models for emotion detection, though less common, arealso accessible to the research community in multiple languages. As yet, however, emotion de-tection models do not support the Hebrew language. In fact, to our knowledge, only one studythus far has developed a Hebrew-language model for sentiment analysis of any kind (specifically,polarity classification; Amram et al. (2018)). Notably, existing sentiment analysis methods devel-oped for other languages are not easily adjustable to Hebrew, due to unique linguistic and cultural2eatures of this language. A key challenge in the development of Hebrew-language sentiment anal-ysis tools relates to the fact that Hebrew is a Morphologically Rich Language (MRL), defined asa language “in which significant information concerning syntactic units and relations is expressedat word-level” (Tsarfaty et al. 2010). In Hebrew, as in other MRLs (e.g., Arabic), grammaticalrelations between words are expressed via the addition of affixes (suffixes, prefixes), instead of theaddition of particles. Moreover, the word order in Hebrew sentences is rather flexible. Many wordshave multiple meanings, which change depending on context. Further, written Hebrew containsvocalization diacritics, known as
Niqqud (“dots”), which are missing in non-formal scripts; otherHebrew characters represent some, but not all of the vowels. Thus, it is common for words thatare pronounced differently to be written in the same way. These unique characteristics of Hebrewpose a challenge in developing appropriate Hebrew NLP models. Architectural choices should bemade with care, to ensure that the features of the language are well represented. The current bestpractice for Hebrew NLP is the use of the multilingual BERT model (mBERT, based on the BERT[Bidirectional Encoder Representations from Transformers] architecture, discussed further below;Devlin et al. (2018)), which was trained on a small size Hebrew dictionary. When tested on Arabic(the closest language to Hebrew), mBERT was shown to have significantly lower performance thana language-specific BERT model on multiple language tasks (Antoun et al. 2020).This paper achieves two main goals related to the development of Hebrew-language sentimentanalysis capabilities. First, we pre-train a language model for modern Hebrew, called
HeBERT ,which can be implemented in diverse NLP tasks, and is expected to be particularly appropriatefor sentiment analysis (as compared with alternative model architectures). HeBERT is based onthe well-established BERT architecture (Devlin et al. 2018); the latter was originally trained forthe unsupervised fill-in-the-blank task (known as Masked Language Modeling - MLM (Fedus et al.2018)). We train HeBERT on two large-scale Hebrew corpuses – Hebrew Wikipedia and OSCAR(Open Super-large Crawled ALMAnaCH corpus, a huge multilingual corpus based on open webcrawl data, (Ortiz Su´arez et al. 2020). We then evaluate HeBERT’s performance on five key NLPtasks, namely, fill-in-the-blank, out-of-vocabulary (OOV), Name Entity Recognition (NER), Partof Speech (POS), and sentiment (polarity) analysis. We examine several architectural choices forour model and put forward and test hypotheses regarding their relative performance, ultimatelyselecting the best-performing option. Specifically, we show that while model complexity correlates3ith high performance on language tasks that aim to understand terms in a sentence, a more-parsimonious model better captures the sentiment of an entire sentence.Second, we develop a tool to detect sentiments – specifically, polarity and emotions – fromuser-generated content (UGC). Our sentiment detector, called
HebEMO , is based on HeBERTand operates on a document level. We apply HebEMO to user-generated comments, from threemajor news sites in Israel, that were posted in response to Covid-19-related articles during 2020.We chose this dataset on the basis of findings that the Covid-19 pandemic intensified emotions inmultiple communities (Pedrosa et al. 2020), suggesting that online discourse regarding the pandemicis likely to be highly emotional. Comments were selected for annotation following an innovativesemi-supervised iterative labeling approach that aimed to maximize predictability.We show that HebEMO achieves a high performance of weighted average F1-score = 0.96for polarity classification. Emotion detection reaches F1-scores of 0.8-0.97 for the various targetemotions, with the exception of surprise , which the model failed to capture (F1 = 0.41). Theseresults are better than the best reported performance, even when compared to English-languagemodels for emotion detection (Ghanbari-Adivi and Mosleh 2019, Mohammad et al. 2018).The remainder of this paper is organized as follows. In the next section, we provide a briefoverview of the state of the art in sentiment analysis in general and emotion recognition in par-ticular; we also briefly discuss considerations that must be taken into account when developingpre-trained language models for sentiment analysis. Next, we present HeBERT, our languagemodel, elaborating on how we address some of the unique challenges associated with the Hebrewlanguage. We subsequently describe HebEMO and evaluate its performance on our UGC data.
2. Background
Psychologists and psychoanalysts have long known that, despite the importance of non-verbalbehavior, words are the most natural way to externally express an inner emotional world (Ortonyet al. 1987). In line with this premise, theories of emotions stress that emotional experience andits intensity can be inferred from spoken or written language (Argaman 2010). Yet, emotionsvary across cultures (Rosaldo et al. 1984), and, consequently, languages differ in the degree ofemotionality they convey and in the ways in which emotions are expressed in words (Wierzbicka4994, K¨ovecses 2003). In particular, as noted by K¨ovecses (2003), the verbalization of emotionscommonly relies on the use of metaphorical and metonymic expressions, which may differ acrosslanguages. Religion is another source of variation in emotional experience and its associated ex-pression Kim-Prieto and Diener (2009). One study showed how the moral system of a culture - andspecifically, a Middle Eastern culture - can be linked to certain types of emotions, and suggestedthat differences in culturally dominant emotions can play a decisive role in cultural clashes (Fattahand Fierke 2009).The above discussion implies that emotion detection tools that are implemented in one languagemight not be easily transferable to other languages, particularly languages that are culturallydistant. Accordingly, sentiment analysis tools must be tailored to specific language models in orderto provide informative results. The current paper proposes such a tool for the Hebrew language– one that takes into account specific linguistic challenges associated with Hebrew, elaborated insubsequent sections.
Many studies offer comprehensive overviews of common sentiment analysis methods (e.g., Liuet al. 2019, Hemmatian and Sohrabi 2019, Yue et al. 2019, Yadav and Vishwakarma 2020). Wepresent here the main points, with an emphasis on models that form the basis of this study. Mostof the models described below were developed primarily for polarity analysis; however, as noted inthe following subsection, the architectures are applicable to other sentiment analysis tasks such asemotion detection.Current reviews on sentiment analysis tend to categorize the various approaches according tothe granularity level of text that they accommodate (Liu et al. 2019): document level , that is,evaluating whether an entire document expresses a particular type of sentiment (e.g., positive ornegative); sentence level – assigning a sentiment to each sentence in the document separately; and aspect level , that is, assigning sentiment to each “aspect” discussed in the text. The latter requires apre-processing step to extract aspects from a written text. In this paper we follow a document-levelapproach, elaborated further below.Sentiment classification approaches can further be categorized according to their underlyingmethodologies. The first, and perhaps the most popular, methodology is the lexicon-based ap-proach. Based on the theory of emotions, this approach uses sentiment terms to score emotions in5n input text.
Linguistic inquiry and word count (LIWC) , for example, is a popular software pro-gram that was developed to assess (among other features) emotions in text, using a psychometricallyvalidated internal dictionary (Pennebaker et al. 2001). The main advantage of the lexicon-basedapproach is that it is unsupervised , meaning that it can be applied without any training or labeleddata (Yue et al. 2019). The main limitation of this approach is that is does not account for thecontext of terms in the lexicon, and thus overlooks complex linguistic features such as sarcasm, am-biguity, and idioms (Liu 2012). Accordingly, its accuracy is fairly low compared with the alternativeapproaches.The second sentiment classification approach is Deep Learning (DL)-based. DL approaches aresupervised methods that are based on multiple-layer neural networks. DL-based sentiment classifi-cation models differ by their network architecture. Common architectures include the following: (1)
Convolutional Neural Networks (CNNs) , which transform a structured input layer (e.g., sentencesor documents represented as bag-of-words or word-embedding vectors), via convolutional layers,into a sentiment class (Kim 2014); (2)
Recursive or Recurrent Neural Networks (RNNs) , whichhandle unstructured sequential data, such as textual sentences, and learn the relations betweenthe sequential elements (Dong et al. 2014); and (3)
Long Short-Term Memory (LSTM) , a popularvariant of RNN, which can catch long-term dependencies between data segments, in one direction(e.g., left to right) or in both (denoted bidirectional LSTM, or
BiLSTM architecture) (Hochreiterand Schmidhuber 1997).In a recent paper, Amram et al. (2018) raised the question of the relationship between the char-acteristics of a language and the DL architectural choices of a sentiment classifier. They analyzedthis question for the morphologically rich Hebrew language. Specifically, they compared the per-formance of CNN and BiLSTM architectures on a polarity classification task. They assumed thatthe latter method would implicitly capture main morphological signatures, and thus outperformthe former. Interestingly, and in contrast to findings in English sentiment analysis (Yin et al. 2017,Acheampong et al. 2020), they found that CNN yielded overall better performance (accuracy =0.89) than BiLSTM, even when the latter was trained on morphologically segmented inputs. Asfar as we know, this is the only paper that developed and evaluated a sentiment analysis model forthe Hebrew language.The last sentiment classification method, which we adopt in this paper, is the transfer learning-6ased approach. Transfer learning is the act of carrying knowledge gained from one problem andapplying it to another, similar problem (Pan and Yang 2009). In NLP, transfer learning is imple-mented via
Transformers (Tay et al. 2020). Similarly to RNN, transformers use a DL approach toprocess sequential data. The primary advantage of the Transformer is its unique attention mech-anism, which eliminates the need to process data in order, and allows for parallelization (Vaswaniet al. 2017). With Transformers, a target language is first algorithmically learned, irrespective ofthe target language task (e.g., sentiment analysis task). To this end, a language model is trainedon a pre-selected unsupervised NLP task (see section 3 for details) . Then the language model istransferred to the target task. This process is called fine-tuning .Various pre-trained language models have been used in transfer learning for NLP; these includefastText (Joulin et al. 2016), ELMo (Embeddings from Language Models, based on forward andbackward LSTMs) (Peters et al. 2018), GPT (Generative Pre-trained Transformer) (Radford et al.2018), and BERT (Devlin et al. 2018). Of these, BERT is one of the most common Transformermodels for NLP. For sentiment analysis tasks, BERT models - and Transformer models in general- are widely used and produce the best results compared with alternatives (Zampieri et al. 2019,Patwa et al. 2020). For the Hebrew language, the only BERT model available is mBERT (Devlinet al. 2018), which was trained on a small-sized Hebrew dictionary (about 2000 tokens). Notably,for the Arabic language, which is the closest MRL to Hebrew, Antoun et al. (2020) showed thata pre-trained Arabic BERT model achieved better performance on polarity analysis than did anyother architecture (improvement of 1% to 6% in accuracy). The Arabic-specific model also achievedbetter performance compared with mBERT.
Emotion recognition is a sub-task in sentiment analysis that offers a finer granularity sentimentlevel compared with polarity analysis. Two definitions of human emotions dominate the NLPliterature, with no clear preference between them (Kratzwald et al. 2018). The first definition, basedon a theory developed by Ekman (1999), considers emotions as distinct categories, meaning thateach emotion differs from the others in important ways rather than simply their intensity. Ekman(1999) identified six basic emotions, consistent across cultures, that fit facial expressions: anger,disgust, fear, happiness, sadness and surprise. The second definition is based on a theory by Plutchik(1980), who stressed that emotions can be treated as dimensional constructs, and that there are7elations between occurrences and intensities of basic emotions. In particular, Plutchik (1980)defined a “wheel” comprising four polar-pairs of basic emotions: joy–sadness, anger–fear, trust–disgust, and surprise–anticipation. Combinations of dyads or triads of emotions define another setof 56 emotions. For example, envy is a combination of sadness and anger . This wheel serves asthe theoretical basis of common automated emotion detection algorithms (Medhat et al. 2014).Notably, for the purpose of emotion detection, the two conceptualizations of emotion are generallycompatible with each other, as they agree on the set of emotions defined as “basic” emotions.Though common, emotion recognition is not as widespread as polarity analysis, and it is con-sidered more challenging (Acheampong et al. 2020). A key challenge is that, whereas any textcan be classified according to its polarity, not all texts contain emotions, and thus it is harder toinfer emotions via a lexicon-based approach. This challenge is further compounded by the factthat labeled data are commonly not available. Further, existing datasets are rather imbalanced.Naturally, the lack of data availability is more severe in non-English languages (Ahmad et al. 2020).In general, the emotion detection task is treated as a multi-label classification task, and modelsfor emotion recognition are similar in architecture to polarity detection models. Recent researchhas shown that, in emotion detection tasks, pre-trained BiLSTM architectures provide advantagesover CNN and unidirectional RNN models (Acheampong et al. 2020), and that Transformers arepreferable to other DL approaches (Chatterjee et al. 2019, Zhong et al. 2019). For example, ina recent SemEval competition (Chatterjee et al. 2019) that included an emotion detection taskfor three emotions (angry, happy, sad), Transformer-based models were shown to give the bestperformance (performance ranges: F1-Score = 0.75 - 0.8; precision = 0.78 - 0.85; recall = 0.78 -0.85).
As noted above, transfer learning for polarity analysis and/or emotion recognition requiresa pre-trained language model. To develop and train a language model, one needs to make thefollowing three basic decisions:1. Input representation (tokenization): What is the granularity of the tokens that are fed tothe model? Common granularity levels include characters, n-gram-based sub-words (usingWordPiece algorithm (Schuster and Nakajima 2012)), morpheme-based sub-words, and full8ords (see Figure 1 for the differences between the approaches).2. Architectural choices: What is the exact architecture and specification of the neural network?3. Output: What is the (unsupervised) task that the model is trained on?
Figure 1: Input representation alternatives
Regarding input representation , the choice of representation affects the features that the languagemodel is able to capture, and the training complexity. Character-based representation is betterfor learning word-morphology, especially for low-frequency words and MRLs (Belinkov et al. 2017,Vania et al. 2018), but it comes with longer training time and a deeper architecture, comparedwith other representations (Bojanowski et al. 2015). Word-based representation, in turn, treatseach word as a separate token, and thus is considered better for understanding semantics (Potaet al. 2019). With this representation, however, words that differ by prefix or suffix are con-sidered different, necessitating storage of a very large vocabulary. Moreover, out-of-vocabulary(OOV) tokens are not represented. The intermediate option is to use a sub-word representation,which provides some balance between the character- and word-based representations; moreover, itovercomes the OOV problem associated with the word-based representation, and its vocabularyrequirements are more manageable (Wu et al. 2016). With sub-words, words can be broken into9ither n-gram characters, or according to morphemes that have lingual meaning (but also highercomputational costs). Previous literature has produced mixed results regarding to the extent towhich using a morpheme-based approach can improve upon the n-gram-based approach (Bareketand Tsarfaty 2020). Recently, Klein and Tsarfaty (2020) showed that sub-word splitting in the mul-tilingual BERT model (mBERT, Devlin et al. (2018)) is sub-optimal for capturing morphologicalinformation.For the question of architecture selection , Devlin et al. (2018) and Radford et al. (2019) showedthat for similar model size, BERT outperforms other architectures such as GPT and ELMo onsentiment tasks.With respect to the model output , there are two tasks on which a model can be trained. Thefirst is predict-the-future , meaning that the model is trained to predict the last token of a sentence.This task accounts for uni-directional contexts only. The second is the fill-in-the-blank task, wherethe model is trained to fill in a missing token within a sentence. This task takes into accountthe full (bi-directional) sentence context, and is able to better capture the meanings of tokens,both syntactically and semantically (Devlin et al. 2018). Recently, Levine et al. (2020) offered amethod to optimize these tasks, called
Pointwise Mutual Information (PMI) masking . The authorssuggested that instead of filling in a single random token, the model should be trained to fill in aset of tokens that carry mutual information.
3. HeBERT: Language Model
In this section we develop an unsupervised Hebrew BERT model, which we will later fine-tunefor the tasks of polarity analysis and emotion recognition.
We begin by addressing the three key modeling decisions outlined in the previous section - inputrepresentation (tokenization), architecture, and output - in the context of the Hebrew language.Recall that, as discussed in the introduction, Hebrew is an MRL with the following importantcharacteristics: (i) grammatical relations in Hebrew are expressed via the addition of affixes; (ii)Hebrew sentences are nearly order-free; (iii) many Hebrew words have multiple meanings, whichchange depending on context; (iv) Hebrew contains vocalization diacritics that are missing in non-10ormal scripts, implying that words that are pronounced differently can be written in the sameway.Bearing these features in mind, we first address the last two questions, of architectural choice andmodel output. As discussed in previous sections, BERT has been shown to outperform alternativearchitectures in sentiment analysis tasks (Radford et al. 2019); moreover, the literature offersevidence that BERT networks effectively capture linguistic information and phrase-level information(Jawahar et al. 2019), a necessary requirement for MRLs (Tsarfaty et al. 2020). Accordingly, wedecided to use BERT as our base model, with the default architecture. For the output task, weused BERT’s default fill-in-the-blank task.
Fill-in-the-blank has the advantage of understandingbi-directional context, which corresponds to the order-free property of Hebrew sentences.With respect to the input - the granularity of the tokens - the literature on MRLs, and Hebrewspecifically, is inconclusive. Belinkov et al. (2017) and Vania et al. (2018) showed that character-based representation, which is becoming increasingly popular, is better than word-based represen-tation for learning Hebrew morphology, especially for low-frequency words. For sentiment tasks,however, Amram et al. (2018) and Tsarfaty et al. (2020) showed that a word-based representationyields better predictions than a char-based representation. With regard to sub-word represen-tations, Klein and Tsarfaty (2020) suggested (but did not verify) that, for BERT for Hebrew,morpheme-based sub-words are likely to be preferable to n-gram-based sub-words. A similar argu-ment was made for Arabic, which is the closest MRL language to Hebrew (Antoun et al. 2020).To understand what causes differences in findings between different researchers, consider thefollowing three examples:1. First, is the word
NA’AL . NA’AL can be translated as either locked (e.g., he locked thedoor), a shoe , or the past, singular, tense of the verb wearing (a shoe). It is also often usedas a slang term for stupid . The actual semantic meaning of
NA’AL in a sentence is derivedfrom the context. In that respect, a high-level text granularity (such as a word-basedrepresentation) might be the preferable choice for representing Hebrew, as it is better incapturing semantic meanings in context (Pota et al. 2019).2. Next, is the word
NA’ALO , which is an inflection of the word
NA’AL with the suffix ”O”.
NA’ALO can refer to either “his shoe” or “locked it”. In that respect, a finer text granu- arity , such as char-based, which is better at learning morphology, might be preferred.3. Finally, consider the splitting of the word NA’ALO . Here, a meaningful splitting would be
NA’AL-O . However, such a splitting can be only achieved with morpheme-based sub-words ,using a tool such as YAP (Yet Another Parser, by More et al. (2019)). The alternative, n-gram-based sub-words , will result in additional splitting, which might have lower semanticmeaning than morpheme-based sub-words, yet higher robustness to OOV.Given the above discussion, we hypothesize that sub-word representations (n-gram- or morpheme-based representation), which balance semantic meaning with morphology, will best capture thefeatures of the Hebrew language, and will yield better performance for various language tasks, ascompared with character-based and word-based representations. Comparing n-gram-based sub-words with morpheme-based sub-words, we expect the latter to have an advantage on token-leveltasks that require a good “understanding” of the language features; yet, a morpheme-based repre-sentation might not have such an advantage in document-level downstream tasks.To examine our hypothesis, we first train and evaluate multiple small-size BERT models thatdiffer by the granularity of the input. We then choose the best-performing architecture, and re-trainthe model on a much larger corpus.
We examine five alternative text representations: char-based; two n-gram-based sub-word repre-sentations, which differ in the total vocabulary size (30K tokens vs. 50K tokens); a morpheme-basedsub-word representation; and a word-based representation, which considers all words in the corpus,after trimming terms in the lowest 5 th quantile according to their term frequency (vocabulary sizeof over 53K tokens).To compare between the input alternatives, we first train small-sized base-BERTs on a HebrewWikipedia dump . Our working assumption is that the performance of a small-sized BERT ismonotonic with the model’s performance when trained on a larger corpus with the same parameters,yet requires significantly fewer resources. We evaluate the models’ performances on two commonunsupervised language tasks and on three downstream tasks: As of September 2013; retrieved from https://u.cs.biu.ac.il/ yogo/hebwiki/. The dataset includes over 63 millionwords and 3.8 million sentences.
12. Unsupervised language tasks:(a) Fill-in-the-blank - the ability to fill in a missing token; tested on a newspaper arti-cle and a fairy-tale dataset . Performance was measured with sequence perplexity( P P ( W )) - a common measure to examine the ability of a language model to evalu-ate the correctness of sentences in a sample set. Perplexity of a sequence W with N tokens ( W = { w , w , ..., w n } ) is calculated as the exponential average log-likelihoodof the sequence ( P P ( W ) = exp {− N (cid:80) Ni log p θ ( w i | w
In line with the specifications outlined above, we trained a large-size BERT on both theWikipedia corpus and an OSCAR corpus (Ortiz Su´arez et al. 2020), with a small-size n-gram-based sub-word dictionary. For the Hebrew language, OSCAR contains a corpus of size 9.8 GB,including 1 billion words and over 20.8 million sentences (after de-duplicating the original data).We used a Pytorch implementation of Transformers in Python (Wolf et al. 2020) to train a base-BERT network for 4 epochs, with learning rate = 5e-5, using the Adam optimizer in batches of 128sentences each. 14he performance of the final model is reported in Table 2, and compared to the performanceof (i) the (non-BERT) model reported in Amram et al. (2018) and More et al. (2019) and Bareketand Tsarfaty (2020), the only other model developed for NLP tasks in Hebrew (denoted SOTA, or“state of the art”); and (ii) mBERT. The best results for each task are in bold.Task Fill-in-the-blank OOV NER POS Polarity analysisMetric (Perplexity) (%) (F-1 score) (F-1 score) (F-1 score)
HeBERT 3.24 ∼ Current SOTA (Not reported)
8% 0.84
Table 2: HeBERT performance, compared to alternative models. Best results for each task are in bold.
The results show that while mBERT outperformed HeBERT in an unsupervised task (fill-in-the-blank), HeBERT performed better on supervised tasks, even when compared to the currentSOTA. Of note, mBERT contains only 2,000 tokens in Hebrew (compared to 30K in HeBERT).HeBERT’s higher performance in supervised tasks is thus not surprising.
4. HebEMO: A Model for Polarity Analysis and Emotion Recognition
In this section we develop HebEMO - a model for sentiment analysis, including polarity analysisand emotion recognition. HebEMO, which is based on HeBERT, predicts sentiments at a documentlevel; as elaborated in what follows, in our case a “document” is a single user-generated comment ona news website. The development of the model is based on three main elements: (i) data collection;(ii) data annotation; and (iii) fine-tuning of HeBERT.
The data collected for this study were compiled from user comments that were posted to Israelinews websites in response to Covid-19-related articles, during the Covid-19 pandemic (Jan-Dec,2020) - a highly emotional period (Pedrosa et al. 2020).Our selection of news sites was inspired by a 2016 statement by Israel’s president, Reuven(Rubi) Rivlin, according to which Israeli society is composed of four equal-sized “tribes” which areculturally different (and hence might express emotions slightly differently); of these, three compriseHebrew-speaking Jews - namely, secular, national-religious, and ultra-Orthodox (“Haredi”) - and15he fourth “tribe” is Israel’s Arab population (Steiner 2016). Each group is represented in bothpolitics and the media.Accordingly, we collected data from three popular Israeli news sites that, respectively, representthe three Hebrew-speaking “tribes”. Specifically, our dataset contained all Covid-19-related articlesfrom
Ynet , which is identified with the secular “tribe” (with a slight left-wing political leaning); Israel Hayom (translation: “Israel Today”), which is identified with the national-religious “tribe”(with a slight right-wing political leaning), and Be-Hadre Haredim (translation: “In Haredis’Rooms”), which represents the ultra-Orthodox group.For each article, we collected the article’s text, its date of publication, the section in the newssite in which it was published (e.g., news, health, sports), the author, and the comments section. Weexcluded from the dataset comments that did not contain Hebrew words, and comments with fewerthan 3 words. We further merged repeated consecutive characters (e.g., three or more identicalpunctuation symbols) and removed links and double spaces. The compiled corpus, summarized inTable 3, contained over half a million comments on 10,794 titles in various sections. source section Table 3: Description of the collected data igure 2: Iterative annotation process We annotated a total of 4,000 comments. Comments were selected for annotation followingactive learning principles (Li et al. 2012) to minimize the well-known imbalance problem in theemotion recognition literature (Acheampong et al. 2020). The annotation process we used is de-scribed below and illustrated in Figure 2.Our iterative process was initialized in step 1 with a naive unsupervised lexicon-based ap-proach. For this step, we Google-translated EmoLex: a freely-available English-language polarityand emotion dictionary (Mohammad and Turney 2013). EmoLex contains a list of manually col-lected (via crowdsourcing) English words classified according to one or more of the eight basicemotions and two polarity values (positive and negative). We then used the translated dictionariesto score the entire set of lemmatized comments in our dataset. Lemmatization was achieved withUDPipe (Straka et al. 2016).In step 2 , given the initial sentiment scores generated in step 1, we selected a set of 150comments, of which 75 comments had received the highest positive polarity scores, and 75 hadreceived the highest negative polarity scores. Similarly, for each of the eight emotions, we selecteda set of 75 comments in which the emotion was highly expressed, and another 75 comments in whichthe emotion was not expressed. The resulting set, after removing duplicate comments, compriseda total of 1,500 initially labeled comments. 17e then turned to Prolific , a trusted online labor and research platform, to manually re-annotate the 1,500 comments. Each comment was annotated by at least three distinct nativeHebrew-speaking Prolific workers. Specifically, annotators were asked to rate individual comments’polarity on a symmetric 5-point scale of { strongly negative, negative, neutral, positive, stronglypositive } , and to rate the expression of each emotion in the comment on a polar 3-point scale of { not expressed (in the comment), expressed, strongly expressed } . The participants were given thecontext of the comment (i.e., the title of the news article on which the comment was posted). Eachparticipant annotated 20 randomly selected comments.The reliability levels of workers’ annotations were then computed with Krippendorff’s alpha(Krippendorff 1970), a measure of inter-rater agreement. We measured reliability independentlyfor each sentiment in a comment, using coarser sentiment scales of polarity = { positive, neutral,negative } and emotion = { expressed, not expressed } . For example, if two raters, i and j , rated theemotion “anger” in a comment c as L ic,anger = “expressed” and L jc,anger = “strongly expressed”,we computed their mutual response as “agreement” (formally, the observed agreement betweenthe raters was δ ( L ic,anger , L jc,anger ) = 0). If the ratings were L ic,anger = “expressed” (or “stronglyexpressed”) and L jc,anger = “not expressed”, we computed the raters’ mutual response as “dis-agreement” ( δ ( L ic,anger , L jc,anger ) = 1). We then excluded comments’ sentiment annotation withKrippendorff’s alpha lower than 0.75.In step 3 , we trained an initial HeBERT-based sentiment (supervised) classifier (see detailsin Section 4.3) on the crowd-annotated data, and predicted polarity and emotion scores for theremainder of the corpus. We then repeated steps 2 and 3 until the performance of our classifierconverged. Convergence occurred after three iterations, and a total of 4,000 partially labeledcomments (partially means that the raters agreed on at least one sentiment). Tables 4 and 5summarize the number of comments for each sentiment (polarity and emotion, respectively) forwhich there was high agreement among raters, and the percentage of the comments that expressthis sentiment. For example, the expression/non-expression of the emotion “anger” was labelledin 1,979 distinct comments; among these, “anger” was expressed in 78% of the comments, and in22% it was not expressed. Table 4: Summary of the polarity data
Emotion
Table 5: Summary of the emotion data
Interestingly, though we attempted to balance the expression and non-expression of each sen-timent in our labelled data, our raters had significantly lower agreement on positive sentiments -specifically, positive polarity, expression of happiness, surprise, and trust, and non-expression ofanger and disgust. In line with the theory of Plutchik (1980), we observed high negative corre-lation between emotions that are located opposite each other in Plutchik’s wheel of emotion, andpositive correlation between closely related emotions (see Table 6). The final classification model
Anger Disgust Anticipation Fear Joy Sadness Surprise Trust PolarityAnger 1.00Disgust 0.46 1.00Anticipation 0.10 0.09 1.00Fear 0.15 0.11 0.14 1.00Joy 0.25 0.27 0.12 0.11 1.00Sadness 0.21 0.16 0.13 0.28 0.12 1.00Surprise 0.06 0.04 0.10 0.15 0.05 0.12 1.00Trust 0.27 0.31 0.11 0.07 0.41 0.08 0.07 1.00Polarity 0.47 0.44 0.11 0.09 0.36 0.14 0.05 0.40 1.00
Table 6: Pearson score for correlation among emotions identified by human raters was denoted
HebEMO . 19 .3. Fine-Tuning of HeBERT: The Classification Model
We modeled our classification algorithm by fine-tuning HeBERT for a document-level classifica-tion task. Prediction probabilities were computed with a softmax activation function. We treatedthe polarity task as a multinomial problem with three classes (positive, neutral, negative); emotionswere modeled as independent dichotomous classification tasks (expressed, not expressed), as mul-tiple emotions can co-exist in a single comment. Attempts to merge emotion pairs (e.g., joy-sad)into a single classification category yielded lower performance. To train and evaluate our model,we randomly partitioned the corpus into training (70%), validation (15%), and test (15%) sets. Inorder to avoid data leakage, the tokenization process (in HeBERT) was not trained on the UGCdataset. We repeated the training and evaluation process following a bootstrap approach with 50samples (each generated a different data partition) and examined the stability of our results.
5. Results
We applied HebEMO to our annotated dataset and examined its performance, as measured byprecision, recall, F1-score and overall accuracy of the expressed sentiment. Table 7 presents theperformance of our model on the polarity task, and Table 8 presents the performance for emotionrecognition. The weighted average performance across all sentiments is F1-score = 0.931, andoverall accuracy = 0.91. With the exception of the emotion “surprise” , the performance of themodel ranges between F1-score and accuracy of 0.78-0.97. These performance levels, as far aswe know, exceed those of state-of-the-art English-language models for UGC emotion recognition(Ghanbari-Adivi and Mosleh 2019, Mohammad et al. 2018).The emotion “surprise” is known to be hard to detect. As mentioned in Zhou et al. (2020),the best reported F1-score for this emotion in English was found to be as low as 0.19 (Mohammadet al. 2018). In our dataset, the amount of labeled data for “surprise” - as well as for its opposingcounterpart on the wheel of emotion, “anticipation” (Plutchik 1980) - was also the lowest amongall emotions (see Table 5), implying that this pair is a challenging labeling task even for humanannotators.Next, we re-trained HebEMO on the polarity data reported by Amram et al. (2018). Amramet al. (2018) collected comments that were written in response to official tweets posted by theIsraeli president, Mr. Reuven Rivlin, between June and August, 2014 (a total of 12,804 Hebrew20recision Recall F1-scorePositive 0.96 0.92 0.94Neutral 0.83 0.56 0.67Negative 0.97 0.99 0.98Accuracy 0.97
Table 7: HebEMO performance on polarity task in the UGC data
F1 Precision Recall AccuracyAnger 0.97 0.97 0.97 0.95Disgust 0.96 0.97 0.95 0.93Anticipation 0.85 0.83 0.87 0.84Fear 0.80 0.84 0.77 0.80Joy 0.88 0.89 0.87 0.97Sadness 0.84 0.83 0.84 0.79Surprise 0.41 0.47 0.37 0.78Trust 0.78 0.88 0.70 0.95
Table 8: HebEMO performance on emotion detection task in the UGC data comments). The authors manually annotated the comments with the following labels - supportive(positive), criticizing (negative), or off-topic (neutral) comments - and published a partitioneddataset (training and validation) for the benefit of comparisons between language models.The performance of our model is presented in Table 9, along with the improvement/ deterio-ration in performance as compared with the SOTA model reported in Amram et al. (2018). Theresults show that, in most aspects, with the exception of off-topic precision, our model’s perfor-mance exceeds that of the SOTA model. The improvement is significant at the 95% confidencelevel. Precision Recall F1-scorePositive 0.95 (+.03) (+.01) (+.01)
Negative 0.89 (+.05) (+.02) (+.04)
Off-topic 0.70 (-.3) (+.55) (+.03)
Accuracy 0.93 (+.03)
Table 9: The performance of HebEMO when trained on the polarity corpus reported by Amram et al. (2018) . Summary and Discussion This paper presented two new tools that contribute to the development of Hebrew-languagesentiment analysis capabilities: (i)
HeBERT - the first Hebrew BERT model, and a new state-of-the-art model for multiple Hebrew NLP tasks; and (ii)
HebEMO - a tool for polarity analysis andemotion recognition from Hebrew UGC.Although HeBERT was developed for the purpose of optimizing sentiment analysis, we showedthat it outperforms mBERT in a variety of supervised language tasks. This finding is consistentwith the literature that proposes that language-specific models are better than multilingual model.HeBERT also showed better performance than the current (non-BERT) SOTA Hebrew-languagemodel.For the task of extracting sentiments from UGC, we showed that a morpheme-based model,which aims to “understand” features of the language, performed less well than a model that didnot address the language features (ngram-based sub-words). For the latter input representation,a smaller-size dictionary was better than the larger-size dictionary. A plausible explanation forthese results is that UGC contains unofficial language, which includes non-lexical words such asslang words and typos. Over-fitting a model to a language in this case may overlook the uniquecharacteristics of the unofficial language. In future work we plan to examine the performance ofHebEMO when HeBERT is trained on a PMI masking task, rather than fill-in-the-blank.22 eferences
Acheampong FA, Wenyu C, Nunoo-Mensah H (2020) Text-based emotion detection: Advances, challenges,and opportunities.
Engineering Reports e12189.Adamopoulos P, Ghose A, Todri V (2018) The impact of user personality traits on word of mouth: Text-mining social media platforms.
Information Systems Research
Expert Systems with Applications
International journal of mental health and addiction .Amram A, David AB, Tsarfaty R (2018) Representations and architectures in neural sentiment analysis formorphologically rich languages: A case study from modern hebrew.
Proceedings of the 27th Interna-tional Conference on Computational Linguistics , 2242–2252.Antoun W, Baly F, Hajj H (2020) Arabert: Transformer-based model for arabic language understanding. arXiv preprint arXiv:2003.00104 .Argaman O (2010) Linguistic markers and emotional intensity.
Journal of psycholinguistic research arXiv preprintarXiv:2007.15620 .Belinkov Y, Durrani N, Dalvi F, Sajjad H, Glass J (2017) What do neural machine translation models learnabout morphology? arXiv preprint arXiv:1704.03471 .Bellstam G, Bhagat S, Cookson JA (2020) A text-based analysis of corporate innovation.
ManagementScience .Bojanowski P, Joulin A, Mikolov T (2015) Alternative structures for character-level rnns. arXiv preprintarXiv:1511.06303 .Chatterjee A, Narahari KN, Joshi M, Agrawal P (2019) Semeval-2019 task 3: Emocontext contextual emotiondetection in text.
Proceedings of the 13th International Workshop on Semantic Evaluation , 39–48.Chitturi R, Raghunathan R, Mahajan V (2007) Form versus function: How the intensities of specific emo-tions evoked in functional versus hedonic trade-offs mediate product preferences.
Journal of marketingresearch
Expert Systems with Applications arXiv preprint arXiv:1810.04805 .Dong L, Wei F, Tan C, Tang D, Zhou M, Xu K (2014) Adaptive recursive neural network for target-dependent twitter sentiment classification.
Proceedings of the 52nd annual meeting of the associationfor computational linguistics (volume 2: Short papers) , 49–54.Ekman P (1999) Basic emotions.
Handbook of cognition and emotion
European journal of international relations arXiv preprintarXiv:1801.07736 .Ghanbari-Adivi F, Mosleh M (2019) Text emotion detection in social networks using a novel ensembleclassifier based on parzen tree estimator (tpe).
Neural Computing and Applications
Artificial Intelligence Review
Neural computation oulin A, Grave E, Bojanowski P, Douze M, J´egou H, Mikolov T (2016) Fasttext.zip: Compressing textclassification models. arXiv preprint arXiv:1612.03651 .Kim Y (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 .Kim-Prieto C, Diener E (2009) Religion as a source of variation in the experience of positive and negativeemotions. The Journal of Positive Psychology
Proceedings of the 17th SIGMORPHON Workshop on Computational Researchin Phonetics, Phonology, and Morphology , 204–209.K¨ovecses Z (2003)
Metaphor and emotion: Language, culture, and body in human feeling (Cambridge Uni-versity Press).Kratzwald B, Ili´c S, Kraus M, Feuerriegel S, Prendinger H (2018) Deep learning for affective computing:Text-based emotion recognition in decision support.
Decision Support Systems
Edu-cational and Psychological Measurement arXiv preprint arXiv:2010.01825 .Li S, Ju S, Zhou G, Lin X (2012) Active learning for imbalanced sentiment classification.
Proceedings of the2012 Joint conference on empirical methods in natural language processing and computational naturallanguage learning , 139–148.Liu B (2012) Sentiment analysis and opinion mining.
Synthesis lectures on human language technologies
Mining text data , 415–463(Springer).Liu R, Shi Y, Ji C, Jia M (2019) A survey of sentiment analysis based on transfer learning.
IEEE Access
AinShams engineering journal
Proceedings of the 12th international workshop on semantic evaluation , 1–17.Mohammad SM, Turney PD (2013) Crowdsourcing a word-emotion association lexicon 29(3):436–465.Mordecai NB, Elhadad M (2005) Hebrew named entity recognition.
MONEY
Transactions of the Association forComputational Linguistics
Proceedings of the 58th Annual Meeting of the Association forComputational Linguistics , 1703–1714 (Online: Association for Computational Linguistics), URL .Ortony A, Clore GL, Foss MA (1987) The referential structure of the affective lexicon.
Cognitive science
IEEE Transactions on knowledge and data engineering arXiv e-prints arXiv–2008.Pedrosa AL, Bitencourt L, Fr´oes ACF, Cazumb´a MLB, Campos RGB, de Brito SBCS, e Silva ACS (2020)Emotional, behavioral, and psychological impact of the covid-19 pandemic.
Frontiers in psychology
Mahway:Lawrence Erlbaum Associates eters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualizedword representations. arXiv preprint arXiv:1802.05365 .Pfefferbaum B, North CS (2020) Mental health and the covid-19 pandemic. New England Journal of Medicine .Plutchik R (1980) A general psychoevolutionary theory of emotion.
Theories of emotion , 3–33 (Elsevier).Pota M, Marulli F, Esposito M, De Pietro G, Fujita H (2019) Multilingual pos tagging by a composite deeparchitecture based on character-level features and on-the-fly enriched word embeddings.
Knowledge-Based Systems
OpenAI blog , 5149–5152 (IEEE).Shapira N, Lazarus G, Goldberg Y, Gilboa-Schechtman E, Tuval-Mashiach R, Juravski D, Atzil-Slonim D(2020) Using computerized text analysis to examine associations between linguistic features and clients’distress during psychotherapy.
Journal of counseling psychology .Sima’an K, Itai A, Winter Y, Altman A, Nativ N (2001) Building a tree-bank of modern hebrew text.
Traitement Automatique des Langues
Proceedings of the Tenth InternationalConference on Language Resources and Evaluation (LREC’16) , 4290–4297.Tay Y, Dehghani M, Bahri D, Metzler D (2020) Efficient transformers: A survey. arXiv preprintarXiv:2009.06732 .Tsarfaty R, Bareket D, Klein S, Seker A (2020) From spmrl to nmrl: What did we learn (and unlearn) in adecade of parsing morphologically-rich languages (mrls)? arXiv preprint arXiv:2005.01330 .Tsarfaty R, Seddah D, Goldberg Y, K¨ubler S, Versley Y, Candito M, Foster J, Rehbein I, Tounsi L (2010)Statistical parsing of morphologically rich languages (spmrl) what, how and whither.
Proceedings of theNAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages , 1–12.Ullah R, Amblee N, Kim W, Lee H (2016) From valence to emotions: Exploring the distribution of emotionsin online product reviews.
Decision Support Systems arXiv preprint arXiv:1808.09180 .Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser (cid:32)L, Polosukhin I (2017) Attentionis all you need.
Advances in neural information processing systems , 5998–6008.Wierzbicka A (1994) Emotion, language, and cultural scripts. .Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, DavisonJ, Shleifer S, von Platen P, Ma C, Jernite Y, Plu J, Xu C, Scao TL, Gugger S, Drame M, LhoestQ, Rush AM (2020) Transformers: State-of-the-art natural language processing.
Proceedings of the2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , 38–45 (Online: Association for Computational Linguistics), URL .Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, et al.(2016) Google’s neural machine translation system: Bridging the gap between human and machinetranslation. arXiv preprint arXiv:1609.08144 .Yadav A, Vishwakarma DK (2020) Sentiment analysis using deep learning architectures: a review.
ArtificialIntelligence Review in W, Kann K, Yu M, Sch¨utze H (2017) Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923 .Yue L, Chen W, Li X, Zuo W, Yin M (2019) A survey of sentiment analysis in social media. Knowledge andInformation Systems arXiv preprint arXiv:1903.08983 .Zhang L, Wang S, Liu B (2018) Deep learning for sentiment analysis: A survey.
Wiley InterdisciplinaryReviews: Data Mining and Knowledge Discovery arXiv preprint arXiv:1909.10681 .Zhou D, Wu S, Wang Q, Xie J, Tu Z, Li M (2020) Emotion classification by jointly learning to lexiconize andclassify.
Proceedings of the 28th International Conference on Computational Linguistics , 3235–3245., 3235–3245.