[PDF] Understanding Meanings in Multilingual Customer Feedback

Abstract

Understanding and being able to react to customer feedback is the most fundamental task in providing good customer service. However, there are two major obstacles for international companies to automatically detect the meaning of customer feedback in a global multilingual environment. Firstly, there is no widely acknowledged categorisation (classes) of meaning for customer feedback. Secondly, the applicability of one meaning categorisation, if it exists, to customer feedback in multiple languages is questionable. In this paper, we extracted representative real world samples of customer feedback from Microsoft Office customers in multiple languages, English, Spanish and Japanese,and concluded a five-class categorisation(comment, request, bug, complaint and meaningless) for meaning classification that could be used across languages in the realm of customer feedback analysis.

Full PDF

UUnderstanding Meanings in Multilingual Customer Feedback

Chao-Hong Liu

ADAPT Centre, Ireland Dublin City University chaohong.liu @adaptcentre.ie

Declan Groves

Microsoft Ireland Leopardstown, Dublin 18 degroves @microsoft.com

Akira Hayakawa , Alberto Poncelas and Qun Liu ADAPT Centre, Ireland Trinity College Dublin Dublin City University {akira.hayakawa, alberto.poncelas, qun.liu}@adaptcentre.ie  Abstract

Understanding and being able to react to customer feedback is the most fundamen-tal task in providing good customer ser-vice. However, there are two major obsta-cles for international companies to auto-matically detect the meaning of customer feedback in a global multilingual environ-ment. Firstly, there is no widely acknowl-edged categorisation (classes) of meaning for customer feedback. Secondly, the ap-plicability of one meaning categorisation, if it exists, to customer feedback in multi-ple languages is questionable. In this pa-per, we extracted representative real-world samples of customer feedback from Microsoft Office customers in multiple languages, English, Spanish and Japanese, and concluded a five-class categorisation (comment, request, bug, complaint and meaningless) for meaning classification that could be used across languages in the realm of customer feedback analysis. Introduction

In this paper we discuss the results of an ADAPT-Microsoft joint research project which aims to as-sess the performance of the internal tools of Mi-crosoft on multilingual customer feedback analy-sis. The current approach to multilingual cus-tomer feedback analysis is to translate non-Eng-lish feedback into English using machine transla-tion (MT) systems and use English-based tools for the analysis. Due to the variability of MT quality,  © 2017 Chao-Hong Liu, Declan Groves, Akira Hayakawa, Alberto Poncelas and Qun Liu. This article is li-censed under a Creative Commons 3.0 licence, no derivative works, attribution, CC-BY-ND. in particular when translating user-generated con-tent, a reduction in the ability of such tools to ac-curately analyse customer feedback in non-Eng-lish languages is to be expected. To test this as-sumption, two languages, Spanish and Japanese, were considered in this assessment. In summary, the tools perform well if MT quality is good in general as exemplified by Spanish customer feed-back. The true positive rate is about 20% lower in Japanese feedback where the lower MT quality might be one of the causes of the comparative un-derperformance. The results suggest that building native tools might be necessary for languages where MT qual-ity is not satisfactory. To do this, a corpus of an-notated customer feedback for each language should be prepared, which raises two questions; 1) what categorisation of customer feedback should be used as annotation scheme, and 2) does this cat-egorisation apply to multiple languages? To an-swer these questions and to improve the ability to understand international customer feedback, we summarised directly from free-form meaning an-notations on both Spanish and Japanese feedback and concluded a five-class categorisation (com-ment, request, bug, complaint, and meaningless) which might be used across languages, including English, for future customer feedback analysis. Sentiment analysis itself has commonly been used in customer feedback analysis as part of meaning-based analysis in Microsoft Office and in categorisation approaches within other compa-nies and organisations (Salameh et al., Afli et al.) In this paper however, we separate sentiment from what was expressed, the meanings or intentions, n customer feedback. Customer feedback analy-sis nowadays has become an industry in its own right; there are dozens of notable internet compa-nies (which we refer to as ‘app companies’) who are performing customer feedback analysis for other, often much larger, companies. The business model for these app companies is to acquire customer feedback data from their cli-ents and to perform analysis using their internal tools and provide reports to their clients periodi-cally (Freshdesk, Burns). However, most app companies not only treat the contents of these re-ports as confidential material, which is under-standable, but also regard things such as the set of categories they used for grouping customer feed-back as business secrets. To the best of our knowledge, there are three different openly available categorisations from these app companies. The first is the most com-monly-used categorisation which could be found in many websites, i.e. the five-class Excellent-Good-Average-Fair-Poor and its variants (Yin et al., SurveyMonkey). The second one is a com-bined categorisation of sentiment and responsive-ness, i.e. a five-class Positive-Neutral-Negative-Answered-Unanswered, used by an app company (Freshdesk) The third one is used by another app company called Sift and the categorisation is a seven-class Refund-Complaint-Pricing-Tech Support-Store Locator-Feedback-Warranty Info (Keatext). There are certainly many other possi-ble categorisations for customer feedback analy-sis, however, most of them are not publicly avail-able (Equiniti, UseResponse, Inmoment). In this paper, we try to answer the question if there is a suitable categorisation of meanings for customer feedback which could be used in multi-ple languages. In Section 2, we give brief descrip-tion on how we acquire the corpora of multilin-gual customer feedback. The observations of meanings in Japanese and Spanish customer feed-back are presented in Section 3 and 4. Section 5 details the summarised five-class categorisation which we propose to use for multilingual cus-tomer feedback analysis as a common annotation scheme in the future. Finally, the conclusions are given in Section 6. https://studio.azureml.net Preparation of Customer Feedback Corpora

Microsoft Office collects customer feedback via several different channels, which is aggregated and analysed via the internal Office Customer Voice (OCV) system. OCV gathers submitted user data across 70+ languages, and implements classification and on-demand clustering, using semi-supervised techniques, together with a rich web UI to facilitate further analysis and reporting by product owners. These product owners can carry out trend analysis to identify issues that re-quire specific attention or feature requests from users. Due to the large quantities of customer feedback received on a regular basis, manual pro-cessing is not feasible. Therefore, the ability to quickly identify meaning is key to help establish the actionability and importance of customer feed-back. The system implements supervised logistic re-gression to create two multi-class models; one for area and one for issue (Bentley & Batrya, 2016). Additionally, inferences are provided via Sys-Sieve, an internal inference engine (Potharaju, Jain & Nita-Rotaru, 2013) and two-class senti-ment analysis via Azure ML’s sentiment classi-fier. As mentioned previously, the system oper-ates only on English language feedback. For non-English feedback, which constitutes on average 58% of total monthly feedback in Microsoft Of-fice, MT is used as a pre-processing technology provided via Microsoft’s general-domain Transla-tor APIs. To illustrate the classification process, Bentley & Batrya (2016) provide the example of the ver-batim feedback “I am having trouble saving an Excel spreadsheet with charts in it.” OCV assigns it the possible issue types of “Excel\Charts” and “Excel\Save”, but not “Excel\Print”, whereas Sys-Sieve may infer that it relates to “Problem” and Azure ML’s sentiment classifier determines that it has a higher probability of carrying “Negative” sentiment, than “Positive”. This type of coarse-grained classification is typically sufficient for a product owner to identify emerging trends and is-sues that require subsequent triage.

Data Selection

In terms of content, for this study we made the decision to focus on feedback received via the “sent-a-smile” feature which is the prime source f in-application (“in-app”) feedback for Microsoft Office products. This allows a user to provide feedback directly from within the application, including the provision of verbatim textual feedback and screenshot information if the user so wishes to do so. We chose two target languages to focus on: Spanish and Japanese. Based on previous internal qualitative evaluations, they represent contrasting languages with respect to the expected quality of MT; Spanish content typically preforms very well, whereas Japanese is a more difficult lan-guage for MT. Table I: MT Quality Scale Score Adequacy Fluency 4 All meaning of the source correctly expressed in the translation Completely fluent. Good word choice & structure. No editing required. 3 Most of the source expressed in the translation Almost fluent. Few errors which don’t impact the overall meaning. 2 Little of the source expressed in the translation Not very fluent. About half the trans-lation contains errors & requires editing. 1 No source mean-ing expressed in the translation Incomprehensible. Needs to be trans-lated from scratch.

Taking an initial 12-month snapshot of 4,254 Japanese and 28,352 Spanish pieces of in-app feedback for the same product, we randomly sam-pled 2,000 items for each language (i.e. 4,000 pieces in total), ensuring that the items selected had been assigned a label by OCV’s automatic classifiers and by the inference engine (i.e. we ex-cluded any manually labelled items or items that for any reason were not assigned a label). We sub-sequently had the quality of the MT’d verbatim feedback judged by human evaluators who as-signed both fluency and adequacy scores on a scale of 1-4 (cf. Table I). We did not carry out any filtering on source quality, but it is acknowledged that typographical errors in the user input, slang, idiomatic expressions and abbreviations can all have considerable impact on the comprehensibil-ity of the translation, thus why the MT quality for this domain may often be lower than expected. Table 2 provides the results for human MT quality judgements for Spanish and Japanese feedback. We performed some initial analyses to measure the impact of MT quality on classifica-tion accuracy (for area, issue and sentiment) and found overall, on a per-language basis, that alt-hough improving MT quality does result in im-provements in area, issue and sentiment classifi-cation accuracy (an overall improvement of ap-prox. 10% was observed, on average, with Japa-nese benefiting more than Spanish), the impact was not significant to warrant the exclusion of any of the data from further analyses i.e. we can get useful classification even with less than perfect MT output. Table 2: MT Quality for Customer Feedback

Language Fluency Adequacy Mean Spanish 2.89 3.16 3.03 Japanese 2.35 2.46 2.41

The feedback items were also manually la-belled by human annotators for area and inference type. OCV will typically automatically assign a large number of potential area classes to each item, together with a probability score derived from the classifier indicating the likelihood that the verbatim text belongs to that class. To ensure there were no data sparsity issues, we mapped the large initial set of area classes to a smaller set of 16 (the human annotators were requested to use this smaller set). Sentiment consistent of three classes (positive, negative and neutral) and there were 7 possible inference types (e.g. “problem”, “delighter”, “suggestion”). Analysis of Meanings in Japanese Cus-tomer Feedback

In this section, we use meanings instead of “Is-sues” for discussion purposes. A native speaker of Japanese was asked to annotate the meaning of each customer feedback (item); no pre-defined taxonomy was given and the native speaker was instructed to add or modify the “meanings” if they found it is necessary or more appropriate to do so. It was recommended to the annotators for both Spanish and Japanese to aim for a small set of master labels in order to mitigate the possibility of data sparseness in future analyses. The resulting taxonomy of the meanings is as follows: 1.

Opinion/Comment (662) 2.

Complaint (568) 3.

Request (274) 4.

NA/No meaning (32) 5.

Appreciation (3) 6.

Apology (1) 7.

Sarcasm (1) he large majority of items were annotated with multiple labels as feedback often reflects multiple meanings. It is interesting to see that feedback items intended to give ideas (opinion/comment) and to request improvements to the software com-prise approximately two thirds of the items (936 mentions), while complaints relate to approxi-mately only a third (568 mentions). According to the native speaker, there are two clearly distinct genres in the Japanese feedback. One is from casual users, or consumers, of Mi-crosoft Office software and the other, task critical users, typically representing enterprise customers. Feedback of the second genre tends to be polite and gives useful information that could be used to improve the software. This could be part of the reason why these items (opinion/comment and re-quest) comprise the bulk of the feedback. We also looked into detail for some of the ma-jor semantic classes. Here are the fine-grained se-mantic sub-classes for “Request” and “Com-plaint”: 1.

Request: a.

Add feature b.

User's guide c.

B2C communication d.

Feature change e.

Standardisation f.

Compatibility/Hardware compatibility g.

Solution to reliability problem h.

Improvement 2.

Complaint: a.

Add feature b.

Overall performance/Software perfor-mance c.

Bug report d.

UI/UI design e.

Feature change/Feature setting f.

B2C communication g.

Customer service h.

Standardisation i.

Improvement j.

Product concept k.

Printout display l.

Usability m.

MS server (generalised as software in-teroperability problem) n.

Wrong usage of Japanese (generalised as language usage problem) This 7-class taxonomy of meanings and its fine-grained categorisation are summarised directly from Japanese customer feedback sentences in the Japanese corpus. Although the corpus is mainly for Microsoft Office products, we contend they are general enough for customer feedback analy-sis for other software products.

Criticality in Japanese Customer Feed-back

We also observed a new linguistic concept called “criticality” for the annotation of customer feedback, which applies to both “meanings” and “sentiments.” This criticality concept indicates if the customer sees her/his problem as critical to her/his task on hand and requires addressing, regardless if it has been addressed or not. Negative feedback in terms of sentiment might not be of “critical” importance in some cases. For example, a Japanese item which in English translates as “I need a simple manual on how it is used.” was annotated as negative sentiment with “minor” criticality. There are only 40 items annotated as critical in meanings. Most items are not annotated with crit-icality values and it seems this could be a good indicator to identify which items are of interest to customer service. 1.

Critical (40) 2.

Medium (87) 3.

Minor (72) 4.

N/A (648) The “N/A” (not applicable) refers to those items where criticality is not expressed in the item. For example, a Japanese item with the English translation “Easy to use.” It is interesting to see some critical items and their English translations. 1.

Item 1742: “If there was a ‘Select File Format and Copy’ feature it would be perfect.” 2.

Item 1904: “Problem solved. For months it was so slow that it cannot do anything. It is great that this is solved in the last update.” 3.

Item 1977: “All right. (I am) satisfied. Editing and browsing are now running smoothly.” Item 1742 is annotated as “request” (to add a feature) while items 1904 and 1977 are annotated as “appreciation” (on bug fixing) in meanings. The contents showed that the users are either eager to add a feature that would be useful or that they are very satisfied with the improvements of software. Analysis of Meanings in Spanish Cus-tomer Feedback

In the annotation of Spanish customer feedback, the taxonomy used by OCV internally was not ex-posed to the native Spanish speaker, either. The native speaker was free to annotate the meaning of each item as appropriate. The native Spanish annotator noted that users in Spanish-speaking countries often use sarcasm or humour in their responses, and the feedback frequently reflects their tendency towards free-dom of expression and frankness. In addition, not knowing a priori the cultural background of the users who have provided the feedback makes evaluating the politeness of sentences more com-plex as different Spanish-speaking regions tend to express themselves in different ways. An exam-ple of this is how a speaker refers to other people within the context of feedback: the sentence “son los mejores” (translated as “you are the best”) would be considered as neutral tone in Latin American countries, while in Spain it would be polite (“sois los mejores” would be more typical of how someone from Spain would express this in a more casual way). In our opinion, Spanish feedback lends itself well to the detection and identification of prob-lems when seeking frank opinions from customers after the launch of a new feature or software prod-uct. There are 2,051 items in the Spanish corpus; two items are written in Catalan. The resulting taxonomy of meanings is as follows. 1.

Congratulate (1243) 2.

Request (420) 3.

Bug (267) 4.

Usability (61) 5.

Complaint (28) 6.

Nonsense (19) 7.

Sarcasm (8) 8.

Meaningless (6) We first noticed the high proportion of “Con-gratulate” and saw it could be a regional and cul-tural phenomenon in Spanish-speaking countries. Examples of this feedback include “everything is very useful thank you” and “excellent applica-tion”. The Spanish feedback is also notable for the customers’ short responses, e.g. “I like it”, “Good” and “Simple”. There are not many “complaint" types in Span-ish feedback when compared to Japanese feed-back. “Usability” and “Bug” are the native speaker’s own labels, while in Japanese annota-tion, these two categories are classified as part of “complaint.” The native speaker also distinguished the con-cepts of “Nonsense” and “Meaningless”. In “Meaningless”, users expressed messages indicat-ing that they need more time to give proper feed-back. An example of this is “I just start using it. I will give my opinions once tried using it for one month”. In the “Nonsense”, users are inputting texts that are not relevant to customer feedback, e.g. “Best regards” and “Bad don't let me go”. Common Categorisation of Meanings for Customer Feedback

We summarised the categorisations in Table II for comparison purposes. It seems that despite the cultural differences, meanings can be generalised for both Spanish and Japanese customer feedback, which is comprised of the five classes as follows. 1.

Comment (including Congratulate, Apol-ogy and Sarcasm) 2.

Request (e.g. a new feature or improve-ment of existing features) 3.

Bug (Reporting) 4.

Complaint (including Usability) 5.

Meaningless (in the contexts of customer feedback) Table II: Summarised Meaning Categorisation for Customer Feedback Common Categori-sation Native Spanish Speaker Catego-risation Native Japanese Speaker Catego-risation Comment Congratulate (1243) Usability (61) Sarcasm (8) Opinion / Com-ment (662) Appreciation (3) Apology (1) Sarcasm (1) Request Request (420) Request (274) Bug Bug (267) Bug report (185) Complaint Complaint (28) Complaint (383)

Meaning-less Nonsense (19) Meaningless (6) NA / No mean-ing (32) It should be noted, that the five classes above are not necessarily exclusive. For example, a piece of feedback might be both a bug and com-plaint at the same time. Secondly, the sense of omments is constrained considering other clas-ses. For example, a ‘negative comment’ will be annotated as a ‘complaint’ rather than a comment. Despite the five-class mapping suggested in this paper, it should be noted that in certain cir-cumstances a finer-grained language-specific cat-egorisation might still be of interest. For future work, we plan on investigating whether a larger number of language-specific finer-grained catego-risation sets could be combined and generalised to adequately represent multiple languages. Conclusions

In this paper, we addressed the problem of under-standing the meanings of multilingual customer feedback. Real-world customer feedback from Microsoft Office customers are collected and an-alysed in three languages. Customer feedback in Spanish and Japanese are annotated by native speakers with the meanings they see fit for the sentences in each feedback text, without any pre-defined categorisation. A five-class categorisa-tion (i.e. comment, request, bug, complaint and meaningless) are summarised from the free-form meaning annotation, which we propose to use as a fundamental annotation scheme for meaning clas-sification for multilingual customer feedback analysis. For future work, we would like to train a clas-sifier using the suggested annotation scheme and compare the performance of the new classifier against the existing OCV classification. Although we did discover that variability in MT quality did not have a significant impact on classification ac-curacy, we would still be interested in seeing whether improved MT quality provided by the lat-est Microsoft neural network MT systems impacts the classification of customer feedback. Acknowledgements

This research is supported by the ADAPT Centre for Digital Content Technology, funded under the Science Foundation Ireland (SFI) Research Cen-tres Programme (Grant 13/RC/2106).

References

Afli, Haithem, Sorcha McGuire, and Andy Way. Sen-timent Translation for low-resourced languages: Ex-periments on Irish General Election Tweets. In

Pro-ceedings of the 18th International Conference on In-telligent Text Processing and Computational Lin-guistics , Budapest, Hungary, 2017. Bentley, Michael and Batra, Soumya. Giving Voice to Office Customers: Best Practices in How Office Handles Verbatim Text Feedback. In

IEEE Interna-tional Conference on Big Data th USENIX Symposium on Network Systems De-sign and Implementation (NSDI 13) . pp. 127–141, 2013. Salameh, Mohammad, Saif M Mohammad, and Svet-lana Kiritchenko. Sentiment after translation: A case-study on Arabic social media posts. In

Pro-ceedings of the 2015 Annual Conference of the North American Chapter of the ACL

Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . pp. 323–332, 2016. ACM.. pp. 323–332, 2016. ACM.