Benchmarking Automatic Detection of Psycholinguistic Characteristics for Better Human-Computer Interaction
BBenchmarking Automatic Detection of PsycholinguisticCharacteristics for Better Human-Computer Interaction
Sanja ˇStajner a, ∗ , Seren Yenikent a , Marc Franco-Salvador a a Symanto Research, Pretzfelder Str. 15, 90425 N¨urnberg, Germany
Abstract
When two people pay attention to each other and are interested in what the otherhas to say or write, they almost instantly adapt their writing/speaking style tomatch the other. For a successful interaction with a user, chatbots and dialoguesystems should be able to do the same. We propose a framework consisting of fivepsycholinguistic textual characteristics for better human-computer interaction. Wedescribe the annotation processes used for collecting the data, and benchmark fivebinary classification tasks, experimenting with different training sizes and modelarchitectures. We perform experiments in English, Spanish, German, Chinese, andArabic. The best architectures noticeably outperform several baselines and achievemacro-averaged F -scores between 72% and 96% depending on the language andthe task. Similar results are achieved even with a small amount of training data.The proposed framework proved to be fairly easy to model for various languageseven with small amount of manually annotated data if right architectures are used.At the same time, it showed potential for improving user satisfaction if applied in ∗ Corresponding author.
Email addresses: [email protected] (Sanja ˇStajner), [email protected] (Seren Yenikent), [email protected] (Marc Franco-Salvador)
Preprint submitted to journal January 14, 2021 a r X i v : . [ c s . C L ] J a n xisting commercial chatbots. Keywords: natural language processing, machine learning, deep learning,psycholinguistics, emotionality, communication styles
1. Introduction
Empathy is the central pillar of social interactions and a highly complex featurethat requires capacity to understand and encompass a broad range of psychologicalcharacteristics, requiring understanding of the user’s personality and situationalcontext (Omdahl, 1995). In the last few years, much attention has been given totrying to build personalised dialogue agents, emotionally intelligent virtual agents,empathetic chatbots and dialogue systems (Paiva et al., 2017; Rashkin et al., 2019;Lin et al., 2019b; Shin et al., 2019; Lin et al., 2019a). It has been shown thatadapting the style of the answers to the emotional state of a user leads to longerand more satisfactory user engagement not only in general conversation (i.e. chit-chat) chatbots, but also in goal-oriented chatbots, as well as in human-to-humanconversations in various tasks and domains (Rashkin et al., 2019). Some studies(Ma et al., 2019; Lee et al., 2019) suggested that for building emotionally intelligentvirtual agents, it is not enough to adapt them only to the emotion of the user, butalso to the user’s personality. Ma et al. (2019) show that adding personality traits tovirtual agents, especially a submissive trait, leads to significantly better perceivedemotional intelligence of such systems.From the computational perspective, automatically detecting someone’s person-ality is difficult, and rarely surpasses even the majority-class baselines in a binarysetting, even in cases where large amounts of textual data per user are available(Plank and Hovy, 2015; Celli and Lepri, 2018; ˇStajner and Yenikent, 2020). With2he addition of audio and video data, computational models for automatic detec-tion of personality traits show significant improvements (Kampman et al., 2018).However, building fully empathetic virtual agents is even more demanding. Bothapproaches, theoretically-based and empirically-based, have some major shortcom-ings. Theoretically-based approaches suffer from “lost in translation” problem,using theories based on behavioural data obtained from different angles, accumu-lated through various studies and abstracted into computational models, whereasempirically-based approaches suffer from data biases, are domain dependent andnot easily generalizable to other application scenarios (Paiva et al., 2017).Communication represents the manifestation of personality which showcasesone’s psychological state. In human-to-human communication (either written ororal), people instinctively attempt to be in synchronization with each other tothe extent that they want to commit to the interaction, engaging themselves in aso-called conversational dance (Pennebaker, 2011). In order to stay engaged andshow their interest, parties invest efforts to switch their attention and adjust tothe communication style of the other party. Research indicates that people whoare perceived as skilled at communication are also perceived to have appealing,intellectual, helpful, and attractive personalities (Leung and Bond, 2001). It has alsobeen shown that people prefer to interact with others who have better-resonatingpersonality styles due to fewer levels of information processing (Wu et al., 2012).We argue that for building successful task-oriented chatbots, which wouldbe perceived as skilled human agents that resonate well with the users, it is notnecessary to model deep psychological characteristics, but rather the surface real-izations of psychological states and user preferences expressed through five basicpsycholinguistic characteristics. In this study, we make contributions to the field of3uman-computer interaction by:1. Proposing a new framework for automatically detecting five psycholinguisticcharacteristics that reflect current psychological state and communicationpreferences of the user;2. Benchmarking the task of automatically detecting the five psycholinguisticcharacteristics proposed in our framework.3. Building state-of-the-art neural models for binary classification of textualutterances according to the five proposed psycholinguistic characteristics, infive languages: English, Spanish, German, Chinese, and Arabic;
4. Showing that high-performing models can be built for various languageseven with a small amount of human-annotated data;5. Providing an error analysis of our best models, showing the current limita-tions of the proposed approach.
2. Methods
To design a framework that can be used on short textual utterances, we focusedon emotionality and four communication styles as main aspects of instant com-munication. Based on those, we propose a framework consisting of five essentialpsycholinguistic characteristics (EPC-5) that reflect the psychological and con-textual states of the user during an interactive communication. A task-orientedconversational agent should be able to comprehend the psychological state of the The models can be used for academic purposes for free via our API: https://developers.symanto.net/ . Emotionality influences fundamental psychological experiences such as per-sonal interests and attentiveness (Krapp, 2002; Tausczik and Pennebaker, 2010).Research suggests that it is highly related to, and could be detected with, linguisticfeatures (Tausczik and Pennebaker, 2010). By using various linguistic featuresautomatically extracted from text, it has been shown that people tend to use thesame levels of emotionality throughout a conversation (Pennebaker, 2011). There-fore, correctly detecting and responding to the emotional state of the user is animportant aspect of an empathetic communication system. Emotionality does notonly refer to the similarity between the sender and the receiver of the message,but rather to any emotional responsiveness that takes place during the interaction(Hoffman, 2001). Emotional experiences and reactions of a person change instantlydepending on the personal condition and triggers. Hence, the capability of anempathetic dialogue system to keep up with those dynamic states of the user andtailor the responses accordingly is fundamental for successful communicationwith the user. While emotionality clearly is a variable ranging anywhere betweencompletely non-emotional to highly emotional, for the annotation purposes, we5 able 1: Examples for emotionality aspect in communication.
Emotionality ExampleNon-emotional These are customised earphones with secure hooks.I would prefer to buy this car since it is hybrid and cost effective.Emotional They offer a friendly service with great choice of drinks and stunning view!It was an awesome day with my new classmates.Mixed You’re looking at the newest member of our team. She is ready to tear it up!Research suggests that you can have hydrated, beautiful skin in 7 days. define emotionality as a binary label:•
Emotional : Feeling-focused statements that focus on values and emotions;•
Non-emotional : Emotionally-neutral statements that provide logical andanalytical meaning.The decision to treat emotionality as a binary label of course had a strong impacton the observed inter-annotator agreement, leading to expected disagreements oninstances which had a combination of emotional and non-emotional signals. Table 1presents two cases of clearly non-emotional and emotional instances (on whichall annotators agreed), and two cases of instances which contained a mixtureof emotional and non-emotional signals, on which we did not observe a perfectinter-annotator agreement.
An empathetic communication process includes effective stylistic associations.When the communication is tailored based on the style, it generates positive6nteractions, e.g. feeling of getting liked by others as part of the social function(Bell and Daly, 1984). Our approach to communication style is based on theFour-sides Communication Model (Schulz von Thun, 1981). According to thismodel, every utterance reveals important information about the sender, the receiver,and the topic in four different layers:1.
Experience layer which provides self-revealing information about the user;2.
Factual layer which contains facts and data-related information;3.
Appeal layer which contains the desires and effects that the user seeks;4.
Relationship layer which provides indicators of how the sender feels aboutthe receiver through intonation, body language, and gestures.As the model relies on the dynamic aspect of the communication concept,it allows us to apply the layers on short text snippets which are the products ofsituational and domain-specific triggers, similar to those found in a communicationwith goal-oriented communication agents. To incorporate this framework into thecontext of human-to-human or human-computer textual interaction, we use onlythe first three layers of the model, as the fourth layer (relationship layer) is based onface-to-face communication characteristics and thus not applicable to purely textualcontexts. Focusing on goal-oriented human-computer communication (though alsoapplicable in human-to-human communication), we further break down the appeallayer into two separate characteristics (action-seeking and information-seeking),to capture more granular aspects of the appeal, necessary to better respond to theuser’s needs.Therefore, we define four communication styles as following:•
Self-revealing : Statements in which speaker shares personal information orexperiences; 7 able 2: Examples of different communication styles.
Communication style ExampleSelf-revealing I love the ambience and will definitely come again.My husband was also diagnosed with a lung cancer.Fact-oriented For this phone, battery lasts about 20 minutes but excellent for price.Rashes that develop in the sun can indicate lupus.Action-seeking Try contacting the customer service, here’s the link.Please support the UNICEF with your donations.Information-seeking Where can I watch tonight’s game?I would like to know if anyone would be interested in helping. • Fact-oriented : Factual and objective statements;•
Action-seeking : Direct or indirect requests, suggestions, and recommenda-tions that expect action from other people;•
Information-seeking : Direct or indirect questions searching for informa-tion.Two examples for each of the four communication styles are provided inTable 2.
In this section, we describe the sources used for data collection with some basicstatistics, the annotation procedure, and inter-annotator agreement followed by abrief discussion on the complexity of the task.8 able 3: Average length of posts in sentences (S) and in words (W).
Language Emotionality Fact-oriented Self-revealing Action-seeking Info-seekingS W S W S W S W S WEnglish 3.2 47.4 2.7 37.3 3.5 53.1 3.0 44.2 3.1 46.9German 5.5 78.4 4.6 65.2 5.4 78.0 4.5 62.9 4.6 65.6Spanish 3.3 62.8 3.0 52.2 2.8 51.0 2.8 49.7 2.7 49.2Chinese 4.9 90.0 4.5 80.4 4.3 72.3 3.6 57.6 3.8 61.2Arabic 1.3 27.2 1.3 28.1 1.3 33.0 1.2 24.9 1.3 28.3
Data was collected from various sources, ranging from those that contain veryshort posts (Facebook, Twitter, YouTube), through Amazon reviews, to forumswhere multiple users interact with each other. In all cases, each post by one userwas treated as a separate instance, and we did not allow for multiple posts from thesame user from any of the sources (to avoid user bias as much as possible). Weused the same sources and topics for each of the five languages (English, German,Spanish, Arabic, and Chinese), as much as possible, to make our results comparableacross multiple languages. The average number of sentences and words per post,for each language and psycholinguistic characteristic is presented in Table 3.
For each of the five languages, we hired three annotators, all native speak-ers. We provided a workshop in which we explained briefly the theory behindeach of the five psycholinguistic characteristics, and showed them a number ofexamples for each. A question and answer session followed to clarify potential9 able 4: Percentage of cases in which all three annotators agreed (for this task, expected agreementby chance would be 25%).
Language Emotionality Fact-oriented Self-revealing Action-seeking Info-seekingEnglish 53 52 63 73 80German 35 50 61 69 79Spanish 49 33 40 69 73Chinese 54 55 52 82 90Arabic 51 45 59 83 90 misunderstandings. After that, the annotators were provided with a small test setof 100 instances and asked to annotate each instance for each of the five character-istics (emotionality, self-revealing, fact-oriented, action-seeking, and informationseeking) by assigning it a yes or no label. After the test phase was finished, weconducted the second workshop to further clarify the potential difficulties of theannotation task. After that, each annotator was given the final annotation set. Table 4 shows the percentage of instances on which all three annotators agreed,for each psycholinguistic characteristic separately. We observed a similar patternfor all languages: the highest agreement was on information-seeking and action-seeking characteristics, followed by the agreement on self-revealing characteristics.Only in Spanish, we observed a significantly lower agreement for self-revealingcharacteristic than in any of the other four languages. The communication char-acteristic on which we observed the highest disagreement was the fact-oriented(the exception being the cases of Chinese and German), followed by emotion-10lity. These results indicate that in certain languages, agreeing on some of thepsycholinguistic characteristics of a post can pose difficulties even for trainedhuman annotators.To better understand the causes of disagreement, we had a closer look at theEnglish instances in which annotators disagreed on some of the psycholinguisticcharacteristics, and noticed two main sources of disagreements:1. The disagreements on the emotionality aspect were mostly observed forlonger instances which contained a mixture of emotional and non-emotionalsignals. By talking to the annotators, we discovered that there were two maincauses for annotation disagreements in those cases: (a) different annotatorsfocused on different parts of the instances; and (b) annotators had differentthresholds for the number of emotional signals they deem necessary to labela certain instance as emotional. For example, in the following post, the firstpart conveys an emotional expression that is followed by a rational statementincluding reasoning:
You just hit the nail on the head! They have Kante sohe’s not going to get much playing time.
2. Those instances which are emotional and fact-oriented at the same timeled to most disagreements. This revealed that humans tend to associate thefact-oriented statements with being non-emotional, and non-fact-orientedwith being emotional. Although it is more common to have statements thatare both non-emotional and fact-oriented, or those that are both emotionaland non-fact-oriented, statements that are emotional and fact-oriented at thesame time also exist, as in the following examples:•
I never mentioned that I don’t like how it works, I asked is it accu-rate!!How do you know it isn’t accurate, if it wasn’t it wouldn’t be on he mmi at all would it? Some specs are turned on for specific countriesdepending on legal requirements. • All this is a scam to push the EV market and away from diesel fuelthis was all pre set up for everyone to start buying electric and not gotowards ”Clean Diesel”.
Such examples were common source for disagreements on both emotionalityand fact-oriented labelling tasks.The tasks of emotionality labeling in German, and fact-oriented and self-revealing in Spanish, were tasks in which we observed the lowest percentage ofinstances with a perfect inter-annotator agreement. Therefore, we asked twopsycholinguists (one native speaker of German, and the other native speakerof Spanish) to double-check if those examples were indeed difficult cases forbinary classification by humans, even after two rounds of annotation training.Their analysis confirmed that reported inter-annotator agreement results stemfrom sociolinguistic characteristics of those languages and unfortunate choice ofparticular instances, rather than from low quality of annotations.
In this section, we describe the datasets, classification models and evalua-tion measures used in our experiments, as well as the experimental design andmotivations behind them.
For our main experiments (all except those presented in Sections 3.2.1 and 3.2.2),we took into account only instances on which all three annotators agreed, for each12f the five communication characteristics separately. Given the high number ofinstances that are difficult to annotate even by humans (see Section 2.2.3), we didnot want to bring noise into the comparison of model performances among variousarchitectures and training sizes by having a high number of ‘gray zone’ cases inthe test sets. Therefore, we opted for the main test sets to consist of only instanceson which all three annotators agreed (‘clear cases’). When it comes to trainingsets, we experimented with both scenarios: (1) using only the cases on which allthree annotators agreed (‘clear cases’ only); and (2) using the same portion of datarandomly sampled from the whole annotated datasets, taking the majority labelas the ‘gold label’. The first scenario yielded significantly better results than thesecond, and therefore, we here discuss only those results and present the statisticsof those datasets (Table 5). This scenario on one hand ensured the good quality oftraining data (due to training dataset consisting of only ‘clear cases’), and on theother hand, ensured easier discussion of the results knowing that trained humanannotators had perfect agreement on that test set.We performed a stratified division of data, aiming to have 1000 test instancesfor each combination of language and psycholinguistic characteristic. The rest ofthe annotated data with perfect inter-annotator agreement was used as the main(largest) training dataset. The data was highly unbalanced for certain tasks in someof the languages (Table 5). As the data in all languages was collected from thesame sources and covering similar topics, different distributions of classes amongdifferent languages do not stem from data collection procedure, but are ratherthe consequences of certain cultural or language properties, and the number of‘clear cases’ that the specific combination contained. In our main classificationexperiments, we also tried balancing the training data. The resulting classification13 able 5: Statistics of the dataset partitions in different languages.
Lang. Set Emotionality Fact-oriented Self-revealing Action-seeking Info-seekingN O Y ES N O Y ES N O Y ES N O Y ES N O Y ES EN Train 6,557 6,538 11,047 2,097 2,548 13,162 15,969 1,945 17,259 3,177Test 496 504 827 173 160 840 884 116 844 156ES Train 7,354 3,129 5,475 3,830 3,087 7,385 14,864 710 14,628 2,075Test 676 324 581 419 306 694 953 47 882 118DE Train 2,855 1,868 4,791 1,380 2,029 5,615 7,905 818 8,142 2,024Test 605 395 776 224 249 751 916 84 799 201AR Train 4,901 4,483 8,664 1,197 811 12,100 12,456 486 12,212 2,70Test 530 470 871 129 64 936 954 46 806 194ZH Train 4,660 533 3,036 2,264 3,020 1,880 7,908 412 8,284 947Test 894 106 580 420 571 429 944 56 891 109
Additional test set . To assess the usefulness of our models in a real-worldscenario, where given utterances might not all be the ‘clear cases’, we built anadditional test set for the emotionality classification task in English by randomlychoosing 500 instances on which the annotators did not agree (where the ‘gold label’is the majority label) and 500 instances on which all three annotators agreed, on anewly collected set of instances from Twitter, one year after the original datasetswere collected. In this annotation round, we additionally asked the annotatorsto label the instances for being easy or difficult cases. Each instance that wasannotated as difficult by at least two annotators, obtained a ‘gold label’ difficult .This resulted in 482 instances being labelled as easy cases , and 518 instancesbeing labelled as difficult cases . We exploited those additional labels to calculateaccuracy of our best models on the test set consisting of only ‘difficult cases’(Section 3.2.1), and to check how they relate to the class probabilities in our bestmodels (Section 3.2.2). We build our classification models using state-of-the-art (neural network) deeplearning architectures. Aiming to perform a highly accurate classification, ourmodel includes pre-trained general-purpose language representations and mecha-nisms to identify the most relevant information within a word sequence.We first use a fine-tuned Bidirectional Encoder Representations from Trans-formers (henceforth referred to as
BERT ) model (Devlin et al., 2018) to obtain the15ord embeddings from the input text. They feed a bidirectional Long Short-TermMemory (LSTM) (Hochreiter and Schmidhuber, 1997). An attention mechanism(Yang et al., 2016) combines its hidden states. Finally, we get the classificationusing three fully connected dense layers together with dropout and layer normali-sation (Ba et al., 2016). The model architecture is shown in Figure 1. We fine-tune
BERTModel Bi-LSTM Attention Dense(ReLU) Dense(ReLU) Dense(Sigmoid)Dropout+Norm. Layer T e x t C l a ss i f i c a t i o n Figure 1: Classification model architecture. a different BERT model for each training partition. We note that the BERT wordrepresentations are derived from sub-word information, which makes the modelversatile in terms of vocabulary size and spelling errors. For more details aboutthe fine-tuned BERT model and its parameters, please refer to Appendix A. Theclassifiers use 5% of the training data as development partition.We also explored two additional baseline models: (1) a classical word andcharacter n -grams with a Support Vector Machines (Chang and Lin, 2011) clas-sifier (henceforth referred to as ng-SVM ), and (2) representing the input text by For more details about BERT, its pretrained models, and how to fine-tune, please refer to: https://github.com/google-research/bert . The ng-SVM parameters include the top 20k word { } -grams, top 20k character { } -grams with word boundaries, and the Term Frequency-Inverse Document Frequency (TF-IDF) c-CNN ) (LeCun et al., 2015). We used the neuralarchitecture proposed by Zhang et al. (2015) but adding a character embeddinglayer as input. We note that the original performance was similar but the embed-ding layer contributes to make the model smaller and to converge faster at trainingtime.
Evaluation Metrics.
As evaluation metrics, we use the macro-averaged and theper-class Precision (P), Recall (R) and F -score (F), for all models. We selectedthe macro-averaged metrics to consider equally important each class regardless ofits frequency. This eases the analysis of results with our imbalanced test sets. In this study, we performed three sets of experiments:1. Comparison of performances of models with different architectures;2. Benchmarking all language-task pairs with the best models trained on fulltraining datasets;3. Exploring the influence of the training size on the performance of our bestmodel.
Comparison of Model Performances.
The first set of experiments focused on themodel comparison using the English data. We selected our models for complete-ness reasons. We were interested in a baseline model (ng-SVM) based on basic weighting scheme (Salton et al., 1983) with minimum DF of 2. The c-CNN parameters include 128-dimensional trainable random embeddings, top 20k char-acters, maximum text length of 400, 1024 filters, and kernels of size 3, 5 and 7. n -grams, as it is widely known for its com-petitive performances. Next, we wanted to compare a state-of-the-art characterneural model (c-CNN) to study the performance of strong lexical features on theevaluated tasks. Finally, we included one of the current state-of-the-art models fortext classification (BERT), that offers an excellent semantic understanding of themeaning of texts. While the first baseline is meant to show the difficulty of thetask, the second and third models aim to compare which kind of features (lexicalor semantic) have higher relevance for the evaluated tasks. We conducted the sameset of experiments for the other available languages. However, we omit presentingthose tables of results and their analysis as the conclusions were identical to theEnglish setting. Benchmarking.
In the second set of experiments, we benchmarked all five newly-proposed binary classification tasks in five languages (English, German, Spanish,Chinese, and Arabic), using the model architecture that obtained the best resultsin the first set of experiments for all tasks and languages: BERT. To assess theusefulness of our models in a real-world scenario where given utterances mightnot be ‘clear cases’, in this set of experiments, we also evaluated performance ofour best model for emotionality classification in English on the additional test set(see Section 2.3.1 for the full description of the additional test set) consisting of1000 posts on which the three annotators had perfect agreement only in half of thecases. Furthermore, using the same additional test set, but this time also using theadditional labels indicating easy and difficult cases, we explored the possibility ofusing class probabilities from the best models for assessing emotional intensity ofthe posts in English. 18 able 6: Macro-averaged performances (in %) of the compared models for all English tasks.
Model Emotionality Fact-oriented Self-revealing Action-seeking Info-seekingP R F P R F P R F P R F P R Fc-CNN 89 89 89 89 80 83 86 86 86 85 81 83 94 93 94ng-SVM 90 90 90 90 87 88 89 86 87 91 81 85 94 88 91BERT 95 95 95 94 95 95 94 96 95 92 87 90 96 96 96
Influence of the Training Size.
Knowing that human annotation is expensive andtime-consuming, in the third set of experiments, we experimented with smallerportions of the training data, to explore if similar results can be achieved with lesstraining data. We thus conducted seven additional classification experiments foreach language and task, taking training sizes of 1000, 2000, ..., 7000 instances fromthe original full training dataset. Each larger portion of the training set containedall instances present in the smaller training sets. All models were evaluated usingthe same test sets and evaluation metrics as in the second set of experiments (fordistribution of classes in the test set see Table 5).
3. Results
For each set of experiments, we present the results in a separate subsection.
The macro-averaged results of the classification experiments on all five psy-cholinguistic characteristics and for all three model architectures, using the fullEnglish training datasets (see Table 5 for the full training set sizes) are presentedin Table 6, while per class results can be found in Tables 7 and 8.19 able 7: Per-class performances (in %) of the compared models for the English emotionality,fact-oriented and self-revealing tasks.
Emotionality Fact-oriented Self-revealingN O Y ES N O Y ES N O Y ES Model P R F P R F P R F P R F P R F P R Fc-CNN 87 91 89 91 86 88 92 98 95 85 62 72 77 78 77 96 95 96ng-SVM 91 90 90 90 91 90 95 97 96 84 77 80 83 74 78 95 97 96BERT 97 93 95 93 97 95 98 98 98 90 91 91 89 95 92 99 98 98
Table 8: Per-class performances (in %) of the compared models for the English action-seeking andinformation-seeking tasks.
Action-seeking Information-seekingN O Y ES N O Y ES Model P R F P R F P R F P R Fc-CNN 95 97 96 76 64 69 98 98 98 91 88 90ng-SVM 95 99 97 87 63 73 96 99 97 91 78 84BERT 97 99 98 88 76 81 99 98 99 93 93 93
As can be observed, ng-SVM has performed better than c-CNN on average, buthas shown lower performances on the information-seeking classification task. Thisis not surprising considering that the information-seeking task relies on detectionof lexical signals, and the neural model might do a better job at detecting themost salient character components, i.e. text fragments indicative of seeking for20nformation (interrogative constructions with question marks). Interestingly, the ng-SVM baseline obtained an average F score of around 88%, manifesting thus that ourcorpora contain clear signals which enable models to classify the psycholinguisticcharacteristics correctly. We conducted an additional experiment with externalsocial media data (Section 3.2.1) to discard any kind of data artifact that could easethe classification of the test partitions.As can be seen from per-class performances in Tables 7 and 8, BERT was moresuccessful in overcoming the problems of unbalanced classes in training datasets(see Table 5 for class distribution on English training datasets). The 180,052,354trainable hyper-parameters of its neural model clearly excelled at distinguishingbetween the classes. This benefited the overall performance of the model in thiscomparison.BERT obtained the highest results in all the experiments with an average F -score of 94%. This subword-based model captured deeper lexical and semanticcharacteristics due to the millions of hyper-parameters of its attention-based neurallanguage model. Note that the BERT language model was trained with millionsof texts to be a general-purpose model for natural language processing. We musthighlight that the use of a fine-tuned neural language model eases the learning task,i.e. the neural training does not start from a random weight distribution but froma solid language model that suffers a transfer learning process at fine-tuning andtraining time. This is shown in Figure 2, where we calculate the macro-averagedF -score of the models in function of the training size. As can be seen, BERT startswith a considerable advantage compared to the other models, that narrows whenthe size of the training set increases.For the rest of our experiments, we use the BERT model, as the model with the21 Emotionality
Fact-oriented
Self-revealing
Action-seeking
Information-seeking c-CNNng-SVMBERTMajority-class-baseline
Figure 2: Macro-averaged F -score of the compared models in function of the training size. able 9: Macro-averaged performances (in %) for all the languages and tasks. Lang. Emotionality Fact-oriented Self-revealing Action-seeking Info-seekingP R F P R F P R F P R F P R FEN 95 95 95 94 95 95 94 96 95 92 87 90 96 96 96ES 94 93 94 96 96 96 93 93 93 79 76 77 97 96 96DE 86 86 86 88 88 88 93 92 93 88 84 86 95 92 94AR 88 88 88 83 79 81 90 84 86 80 67 72 93 93 93ZH 91 90 91 92 92 92 90 91 90 87 86 87 95 96 95 highest performances.
The macro-averaged results of the classification experiments for each of thefive psycholinguistic characteristics and each of the five languages, using the fulltraining datasets (see Table 5 for the full training set sizes) are presented in Table 9,while per class results can be found in Tables 10 and 11. As can be seen, ourmodels significantly outperform the majority class baseline (50% F -score) inall languages. The highest F -score (over 96%) is achieved for the information-seeking classification task in Spanish and English and the fact-oriented one inSpanish. The emotionality classification in English achieves an F -score of over95%.Similarly to the English results analyzed in Section 3.1, the unbalanced scenariodid not produce large per-class differences with BERT in most of the languagesand tasks. However, the Arabic and Spanish action-seeking tasks show a bias tothe majority class. We comment more about this issue in Section 3.3 where we23 able 10: Per-class performances (in %) for all the languages and our emotionality, fact-orientedand self-revealing tasks. Emotionality Fact-oriented Self-revealingY ES N O N O Y ES N O Y ES Lang. P R F P R F P R F P R F P R F P R FEN 93 97 95 97 93 95 98 98 98 90 91 91 89 95 92 99 98 98ES 94 88 91 94 97 96 96 97 97 96 94 95 89 91 90 96 96 96DE 84 82 83 88 90 89 95 95 95 81 82 81 91 87 89 96 97 96AR 87 88 88 89 88 89 94 96 95 71 62 66 81 69 75 98 99 98ZH 85 81 83 98 98 98 93 93 93 91 91 91 92 91 92 88 90 89
Table 11: Per-class performances (in %) for all the languages and our action-seeking andinformation-seeking tasks.
Action-seeking Information-seekingN O Y ES N O Y ES Lang. P R F P R F P R F P R FEN 97 99 98 88 76 81 99 99 99 93 93 93ES 98 98 98 61 53 57 99 99 99 94 93 94DE 97 98 98 78 69 73 97 98 97 93 87 90AR 97 99 98 64 35 45 97 97 97 90 89 89ZH 98 99 98 76 73 74 99 99 99 91 93 92
On the additional test set consisting of external social media data collected ayear after the main datasets were collected (see Section 2.3.1 for the description ofthe additional test set), comprised of 500 instances on which all three annotatorsagreed and 500 instances on which not all three annotators agreed (where majorityclass was used to assign the ‘gold labels’), the full English emotionality modelachieves an 81% macro-averaged F -score. On the cases labeled as difficult , itachieves a 74% macro-averaged F -score. Considering the additional difficultyof social media texts, the topic-domain shift produced due to the year in betweenthe crawling of the training and the test sets, and inclusion of difficult cases in the test set, we consider this result as a proof of our models’ satisfactoryperformance in detecting the five essential psycholinguistic characteristics studiedin this work. These results also extend the validity of our best models to thereal-world applications where quality and versatility to adapt to new domains is amust. We applied our best-performing emotionality model for English on the addi-tional test set consisting of external social media data (see Section 2.3.1 for thedescription of additional test set), which contains 1000 instances, out of which 482were labeled as easy cases , and 518 were labeled as difficult cases . The results ofthis experiment suggested that the class probability of the model might be used toassess the emotional intensity of the post. In over 94% of the test instances, easycases of non-emotional instances had a probability of the no class between 0.8525nd 0.99, while difficult cases had a probability of no class between 0.4 and 0.6.These results indicate that the class probability might be used as a measure of theemotional intensity for applications where a binary label is not sufficient. In the third set of experiments, we investigated the influence of training sizeon the performance of our best models. The macro-averaged F -score obtainedin those experiments is plotted in Figure 3. As can be seen, even with using only1000 instances for training, our model achieves over 92% macro-averaged F -scorefor emotionality detection in English.The results of the binary classifiers trained on smaller portions of the trainingdata (even only 1000 instances) for the four communication characteristics alsoachieved very high performances, significantly outperforming the majority-classbaseline in all cases except for the action-seeking detection task in Spanish andArabic where the models showed bias towards the majority class (horizontal“curves” in Figure 3). We used the same exact set of parameters for all our BERTmodels, as we consider out of the scope of this work to tune further.The same problem was observed with other neural architectures that we ex-plored, and seems to be rather the reflection of the problem of class distribution inthose classification tasks and languages than a shortcoming of the used architec-tures. Due to the lack of data in German and Chinese, for some tasks we could onlytrain up to certain number of instances, e.g. up to 5000 when detecting emotionalityin Chinese posts. 26 Emotionality
97% 0 2000 4000 6000 8000
Fact-oriented
97% 0 2000 4000 6000 8000
Self-revealing
Action-seeking
Information-seeking
ArabicGermanEnglishSpanish
Chinese
Majority-class-baseline
Figure 3: Macro-averaged F -score of the BERT models in function of the training size andlanguage. . Discussion In this section, we present a general discussion about the results obtained in ourthree sets of experiments, findings of the manual error analysis performed on fulltest sets in English, and the envisioned usage of our framework in goal-orientedchatbots from the psycholinguistic perspective.
The comparison of the models based on different architectures and typesof features led us to several conclusions. The self-revealing and information-seeking tasks seem to rely more on lexical signals. This allowed for the characterconvolutional model to excel in that scenario. However, the character and word n -gram SVM baseline obtained very competitive results for all tasks, showingthus that our datasets have strong potential and have clear signals necessary forsuccessful detection of psycholinguistic characteristics. Finally, the BERT modelachieved the highest results in all our experiments, indicating that its fine-tunedneural language model did an excellent job at detecting lexical and semanticcharacteristics relevant for the evaluated tasks.The BERT model performances showed similar trends across different lan-guages and tasks. All the experiments outperformed the baseline with a notabledifference and yielded average F -scores of around 90%. The most challenginglanguages were German and Arabic, where the original classes were severelyunbalanced, and fewer instances were available for certain tasks. For those twolanguages, BERT performances were below the average ones across all languages,especially for the action-seeking and fact-oriented classification tasks. Those taskswere also the most challenging tasks in general.28he analysis of the effect of the training size on the BERT model performancesacross different languages proved once again that the more training data, the better.However, it also showed that our models perform well even when trained with smallamounts of training data, i.e. 1000 instances. The increase in training sizes ledto small overall improvements. Such high performance of BERT models trainedon 1000 training instances is a result of its pretrained neural language model,which only needs few hundreds of instances to fine-tune itself and the subsequentclassification layers for our tasks. As seen in Section 3.1, this was not the case withthe other two models that we explored.Finally, we explored the association between the class probabilities and emo-tional intensity of the posts. We found that classification probabilities can be usedin emotionality classification for successfully distinguishing between difficult and easy cases, used as the proxy for mid-range emotionality ( difficult cases ) and highand low emotionality ( easy cases ). We performed a manual error analysis on all misclassified instances by theEnglish models trained with BERT on the full training datasets.The majority of errors in emotionality detection seem to stem from the model’sinability to recognize sarcasm and some fixed expressions which humans easilyrecognize as emotional, e.g. “
Uber and other companies are under fire. ”, or “
It’slike an Uber for the air. Check out this travel start up. ”.In self-revealing classification, we found two types of recurrent errors, bothbeing false positives. The first type stems from the high emotionality of the postbeing confused with self-revealing, e.g. “
Apple pay sucks. ”. The second type stemsfrom first-person pronouns, which in those cases do not indicate self-revealing29osts, e.g. “
I sent you an email. It wouldn’t let me post it below. ”. The last examplewas classified wrongly in both self-revealing and fact-oriented classifications (asself-revealing and as non fact-oriented). In fact-oriented classification, we havemostly seen false negatives, where the models failed to recognize fact-orientedposts, e.g. “
For this phone, battery lasts about 20 minutes but excellent for price. ”.The errors in fact-oriented and self-revealing classification most likely originatefrom the fact that a great majority of self-revealing posts are at the same timeemotional, and a great majority of fact-oriented posts are at the same time non-emotional. Therefore, the self-revealing and fact-oriented classifiers seem tointroduce the noise coming from emotionality signals, due to a high imbalance ofthe two emotionality classes in both training datasets.In action-seeking classification, we found that the model cannot recognizeaction-seeking requests that start with
I wish... , e.g. “
I wish you could changeyour attitude. ” However, the model correctly identifies action seeking requestswhich come in polite form using modal verbs, e.g. “
Can you please recommendme something? ”. In information-seeking classification, the model was not ableto recognize polite information-seeking requests hidden under a layer of modalverbs, e.g. “
I would like to know if anyone would be interested in helping. ”. Theother common types of errors for this model were false positives for the rhetoricalquestions, such as “
Why can’t you help me? ”.Apart from the above-mentioned patterns of misclassified instances in theaction-seeking classification task, we also noticed that there are fewer false nega-tives found in shorter ( ≤
40 words) than in longer posts ( >
40 words), a 62% and75% out of all misclassified instances, respectively. In the self-revealing classifica-tion task, in contrast, we found that longer posts ( >
40 words) lead to noticeably30 igure 4: An emotional, self-revealing and information-seeking utterance fewer false negatives (20%) than shorter posts (81%).
Most commercial chatbots used for customer service and troubleshooting usegeneric databases to create utterances without resonating with the user’s psycholin-guistic characteristics and communication needs, and this leads to customers beingunsatisfied and leaving the conversation quickly. An effective chatbot should beable to do the same that most humans intuitively do, to adapt to the user’s psy-cholinguistic characteristics, making thus impression of being interested and fullyinvolved in the conversation with the user.In terms of emotional tonality, an effective chatbot should respond with equalemotional tonality (emotional or non-emotional).When encountered with a user that uses self-revealing communication style(Figure 4), chatbot should show adaptive nature, and refer to the user by usingsecond-person pronouns, thus acknowledging the user’s shared personal experienceand focusing on it.When encountered with fact-oriented communication style by a user (Figure 5),chatbot should keep the wording concise and based on facts. In the case of eitheraction-seeking or information-seeking signals from the user’s side, if chatbotcannot immediately give the requested answer but rather has to ask something first31 igure 5: A non-emotional, fact-oriented, action-seeking and information-seeking utteranceFigure 6: A non-emotional and information-seeking utterance (Figures 6 and 4), its answer should contain assurance words such as recommend and offer giving thus a confirmation that the user’s seeking needs will be met in thesubsequent answers.To test our hypothesis, we obtained 50 real conversations of users with a com-mercial goal-oriented chatbot, and manually annotated for user satisfaction ( verysatisfied , neutral , or dissatisfied ) based on how the user ended the conversation,considering the following scenarios:• User finishes conversation abruptly and on an angry note ( dissatisfied )• User finishes conversation abruptly without any emotional marker ( neutral )• User finishes conversation on a happy note or thanking for the conversation( satisfied )We asked our annotators to label users’ psycholinguistic characteristics in those32onversations. We further annotated the chatbot answers according to the hypothe-sis above, for whether it matches users’ emotionality and communication styles ornot. The level of matching between chatbot’s and user’s psycholinguistic character-istics was calculated as a percentage of psycholinguistic characteristics detected inthe user’s utterances that were adequately matched by the chatbot answers.Comparison of the user’s satisfaction with the level of matching betweenchatbot’s and user’s psycholinguistic characteristics confirmed our hypothesis thatthere is a significant association between the two. Although these results canbe considered only as preliminary, due to the small sample size and only onechatbot solution taken into account, they give an indication of the potential ofour framework for being used in the real-world chatbot scenarios for a better usersatisfaction.
5. Summary
In this study, we proposed a framework that would be well-suited for buildingconversational agents that lead to better user satisfaction. The framework encom-passes five psycholinguistic characteristics which are essential for a successfulgoal-oriented human-computer and human-to-human communications accordingto the existing psycholinguistic literature: emotionality and four communicationstyles (self-revealing, fact-oriented, information seeking, and action-seeking).We showed that, unlike deeper personality characteristics, these five essentialpsycholinguistic characteristics can be modelled with high performances (up to96% macro-averaged F -score) for various languages, even using only a very smallamount of annotated data for training. We thus benchmarked the newly proposedbinary classification tasks, for each of the five psycholinguistic characteristics33eparately, using a well-known state-of-the-art neural architecture, that proved bestin our model comparison.Furthermore, we pinpointed some difficulties that can be encountered in datacollection and annotation, should the same approach be taken for any other lan-guage, and presented the current limitations of our systems and their envisionedusage. Appendix A: BERT Model Details and Parameters
Our model is based on
Tensorflow (Abadi et al., 2016).The English and Chinese experiments use the pretrained uncased L-12 H-768 A-12 and chinese L-12 H-768 A-12 models, respectively. The experiments inother languages use the pretrained multi cased L-12 H-768 A-12 model. Thesemodels have 12 Transformer layers (Vaswani et al., 2017) and produce 768-dimensional embeddings. Only the multilingual model is case sensitive. We foundthese models to be the most adequate for each language during our prototypingphase.The model architecture parameters are the following: maximum text lengthof 128, 256 LSTM units, hidden dense layers of size 256 and 128, respectively,dropout rate of 0.5, batch size of 256, and the binary cross-entropy loss.The model training parameters are the following: 15 training epochs, earlystopping with patience of two epochs, and the Adam weight optimization (Kingmaand Ba, 2014). 34 cknowledgements
We thank Angelo Basile for his support and comments. We also thank allcurrent and past employees of Symanto Research who were involved in earlierstages of this project and thus made this publication possible.
References
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghe-mawat, S., Irving, G., Isard, M., et al., 2016. Tensorflow: A system for large-scale machine learning, in: 12th { USENIX } Symposium on Operating SystemsDesign and Implementation ( { OSDI } , doi: .Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980 .Krapp, A., 2002. Structural and dynamic aspects of interest development: Theoret-ical considerations from an ontogenetic perspective. Learning and Instruction12(4), 383–409.LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. nature 521, 436.Lee, S.y., Lee, G., Kim, S., Lee, J., 2019. Expressing personalities of conversationalagents through visual and verbal feedback. Electronics 8.Leung, S.K., Bond, M.H., 2001. Interpersonal communication and personality:Self and other perspectives. Asian Journal of Social Psychology 4(1), 69–86.Lin, Z., Madotto, A., Shin, J., Xu, P., Fung, P., 2019a. Moel: Mixture of empatheticlisteners. CoRR abs/1908.07687. URL: http://arxiv.org/abs/1908.07687 , arXiv:1908.07687 . 36in, Z., Xu, P., Winata, G.I., Liu, Z., Fung, P., 2019b. Caire: An end-to-endempathetic chatbot. CoRR abs/1907.12108. URL: http://arxiv.org/abs/1907.12108 , arXiv:1907.12108 .Ma, X., Yang, E.P., Fung, P., 2019. Exploring perceived emotional intelligence ofpersonality-driven virtual agents in handling user challenges, in: The World WideWeb Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019, pp.1222–1233. URL: https://doi.org/10.1145/3308558.3313400 ,doi: .Omdahl, B.L., 1995. Cognitive Appraisal, Emotion, and Empathy. LawrenceErlbaum Associates.Paiva, A., Leite, I., Boukricha, H., Wachsmuth, I., 2017. Empathy in virtualagents and robots: A survey. ACM Trans. Interact. Intell. Syst. 7. URL: https://doi.org/10.1145/2912150 , doi: .Pennebaker, J.W., 2011. The secret life of pronouns: What our words say about us.Bloomsbury Press.Plank, B., Hovy, D., 2015. Personality traits on twitter—or—how to get 1,500 per-sonality tests in a week, in: Proceedings of the 6th Workshop on ComputationalApproaches to Subjectivity, Sentiment and Social Media Analysis, Associationfor Computational Linguistics, Lisbon, Portugal. pp. 92–98.Rashkin, H., Smith, E.M., Li, M., Boureau, Y.L., 2019. Towards empa-thetic open-domain conversation models: A new benchmark and dataset,in: Proceedings of the 57th Annual Meeting of the Association for Com-putational Linguistics, Association for Computational Linguistics, Florence,37taly. pp. 5370–5381. URL: , doi: .Salton, G., Fox, E.A., Wu, H., 1983. Extended boolean information retrieval.Communications of the ACM 26, 1022–1036.Shin, J., Xu, P., Madotto, A., Fung, P., 2019. Happybot: Generatingempathetic dialogue responses by improving user experience look-ahead.CoRR abs/1906.08487. URL: http://arxiv.org/abs/1906.08487 , arXiv:1906.08487 .Tausczik, Y.R., Pennebaker, J.W., 2010. The Psychological Meaning of Words:LIWC and Computerized Text Analysis Methods. Journal of language and socialpsychology 29(1), 24–54.Schulz von Thun, F., 1981. Miteinander reden: St¨orungen und Kl¨arungen. Psy-chologie der zwischenmenschlichen Kommunikation. Rowohlt Taschenbuch.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,Ł., Polosukhin, I., 2017. Attention is all you need, in: Advances in neuralinformation processing systems, pp. 5998–6008.ˇStajner, S., Yenikent, S., 2020. A survey of automatic personality detectionfrom texts, in: Proceedings of the 28th International Conference on Compu-tational Linguistics, International Committee on Computational Linguistics,Barcelona, Spain (Online). pp. 6284–6295. URL: