[PDF] An Unsupervised Normalization Algorithm for Noisy Text: A Case Study for Information Retrieval and Stance Detection

Abstract

A large fraction of textual data available today contains various types of 'noise', such as OCR noise in digitized documents, noise due to informal writing style of users on microblogging sites, and so on. To enable tasks such as search/retrieval and classification over all the available data, we need robust algorithms for text normalization, i.e., for cleaning different kinds of noise in the text. There have been several efforts towards cleaning or normalizing noisy text; however, many of the existing text normalization methods are supervised and require language-dependent resources or large amounts of training data that is difficult to obtain. We propose an unsupervised algorithm for text normalization that does not need any training data / human intervention. The proposed algorithm is applicable to text over different languages, and can handle both machine-generated and human-generated noise. Experiments over several standard datasets show that text normalization through the proposed algorithm enables better retrieval and stance detection, as compared to that using several baseline text normalization methods. Implementation of our algorithm can be found at this https URL

Full PDF

AA N U NSUPERVISED N ORMALIZATION A LGORITHM FOR N OISY T EXT : A C

ASE S TUDY FOR I NFORMATION R ETRIEVAL AND S TANCE D ETECTION

A P

REPRINT

Anurag Roy , Shalmoli Ghosh , Kripabandhu Ghosh , and Saptarshi Ghosh Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur , Kharagpur , India Department of Computer Science and Application , Indian Institute of Science Education and Research Kolkata ,Mohanpur , India {anu15roy, shalmolighosh94, kripa.ghosh}@gmail.com, [email protected] A BSTRACT

A large fraction of textual data available today contains various types of ‘noise’, such as OCR noisein digitized documents, noise due to informal writing style of users on microblogging sites, andso on. To enable tasks such as search/retrieval and classiﬁcation over all the available data, weneed robust algorithms for text normalization, i.e., for cleaning different kinds of noise in the text.There have been several efforts towards cleaning or normalizing noisy text; however, many of theexisting text normalization methods are supervised, and require language-dependent resources orlarge amounts of training data that is difﬁcult to obtain. We propose an unsupervised algorithmfor text normalization that does not need any training data / human intervention. The proposedalgorithm is applicable to text over different languages, and can handle both machine-generated andhuman-generated noise. Experiments over several standard datasets show that text normalizationthrough the proposed algorithm enables better retrieval and stance detection, as compared to thatusing several baseline text normalization methods. Implementation of our algorithm can be found at https://github.com/ranarag/UnsupClean . K eywords Data cleansing; unsupervised text normalization; morphological variants; retrieval; stance detection;microblogs; OCR noise

Digitization and the advent of Web 2.0 have resulted in a staggering increase in the amount of online textual data. Asmore and more textual data is generated online, or digitized from ofﬂine collections, the presence of noise in the text is agrowing concern. Here noise refers to the incorrect forms of a valid word (of a given language). Such noise can be of twobroad categories [1]: (1) machine-generated , such as those produced by Optical Character Recognition (OCR) systems,Automatic Speech Recognition (ASR) systems, etc. while digitizing content, and (2) human-generated , produced bythe casual writing style of humans, typographical errors, etc. For instance, social media posts (e.g. microblogs) andSMS contain frequent use of non-standard abbreviations (e.g., ‘meds’ for ‘medicines’, ‘tmrw’ for ‘tomorrow’).There exist plenty of important resources of textual data containing both the categories of noise. With regard tomachine-generated noise, old documents like legal documents, defense documents, collections produced from largescale book digitization projects etc. are either in scanned version or in hard-copy format. Digitization of these vitalcollections involve OCR-ing the scanned versions. The print-quality, font variability, etc. lead to poor performance of This work will be appearing in ACM Journal of Data and Information Quality. https://archive.org/details/millionbooks a r X i v : . [ c s . I R ] J a n PREPRINT the OCR systems which incorporates substantial error in the resulting text documents. Moreover, these noisy documentscan be present in many languages, which also contributes to the challenge, since the OCR systems of many languagesmay not be well-developed. The unavailability of error-free versions of these collections makes error modellinginfeasible.For the second category of noise (user-generated), social networking sites such as Twitter and Weibo have emerged aseffective resources for important real-time information in many situations [2, 3, 4]. Informal writing by users in suchonline media generates noise in the form of spelling variations, arbitrary shortening of words, etc. [5].The presence of various types of noise are known to adversely affect Information Retrieval (IR), Natural LanguageProcessing (NLP) and Machine Learning (ML) applications such as search/retrieval, classiﬁcation, stance detection,and so on [6]. Hence, normalization/cleansing of such noisy text has a signiﬁcant role in improving performance ofdownstream NLP / IR / ML tasks. The soul of a text normalization technique lies in mapping the different variationsof a word to a single ‘normal’ form, achieving which successfully can lead to remarkable improvement in a lot ofdownstream tasks. There have been several works that have attempted text normalization, both in supervised andunsupervised ways (see Section 2 for a survey). Supervised algorithms require large training data (e.g., parallel corporaof noisy and clean text) which is expensive to obtain. This leads to a necessity for unsupervised methods, on which wefocus in this paper.There already exist several unsupervised text normalization approaches. However, most of the existing works haveexperimented with only one type of noise. For instance, Ghosh et al. [7] developed a method for correcting OCR errors,while others [8, 9] addressed variations in tweets. The type of errors in various text collections can be quite different –usually in tweets the error does not occur at the beginning of a word. However, OCR errors can appear at any locationin a word, which makes noisy text normalization more challenging. So, a robust text normalization system for genericnoise should be adept at handling many random and unseen error patterns. But to our knowledge, no one has attemptedto address different categories of noise at the same time.In this paper, we propose a novel unsupervised, language-independent algorithm – called

UnsupClean – capable ofautomatically identifying the noisy variants produced by any of the aforementioned categories of noise, viz. machine-generated and human-generated.We perform extensive experiments where we compare the proposed

UnsupClean algorithm with four baseline textnormalization algorithms – Aspell, Enelvo [9], and the algorithms developed by Sridhar et al [8] and Ghosh et al. [7].To compare the performance of the various text normalization algorithms, we focus on two practical downstreamapplications – retrieval/search, and stance detection . We judge the performance on a text normalization algorithm bychecking the performance of state-of-the-art retrieval/stance detection models on datasets cleaned by the normalizationalgorithm. Experiments are performed over several standard datasets spanning over two languages (English andBengali), both containing OCR noise, as well as over microblogs containing human-generated noise. The empiricalevaluation demonstrates the efﬁcacy of our proposed algorithm over the baseline normalization algorithms – while weproduce signiﬁcantly better performance on OCR collections, our performance is competitive with all the baselines overthe microblog collections. To our knowledge, no prior work on text normalization/cleansing have reported such detailedcomparative evaluation as done in this work – over two downstream applications, and over six datasets spanning twolanguages and both machine-generated and user-generated noise.To summarize, the strengths of the proposed text normalization algorithm

UnsupClean are as follows: (1) Thealgorithm is completely unsupervised, and does not need any training data (parallel corpus), dictionaries, etc. Thus,the algorithm is suitable for application on resource-poor languages. (2) The algorithm can handle different types ofnoise, including both intentional mis-spellings by human users, as well as machine-generated noise such as OCR noise,and (3) The algorithm is language-independent, and can be readily applied to text in different languages. We make theimplementation of

UnsupClean publicly available at https://github.com/ranarag/UnsupClean .The remainder of the paper is organized as follows. Section 2 brieﬂy surveys various types of text normalization/cleaningalgorithms. The proposed text normalization algorithm (

UnsupClean ) is detailed in Section 3. The baseline textnormalization algorithms with which the performance of

UnsupClean is compared, are described in Section 4. Next, tocheck the effectiveness of various text normalization algorithms, we focus on two downstream applications over noisytext – retrieval and stance detection. The experiments related to these two downstream applications are described inSection 5 and Section 6 respectively. Each of these sections describe the datasets used, the retrieval / stance detectionmodels used, and the experimental results. Finally, we conclude the study in Section 7.2

PREPRINT

Textual content can contain different types of noise [1], and signiﬁcant amount of research has been done on noisy text.Some prior works attempted to understand the effect of noise present in text data on applications such as classiﬁcation [6].Also some works have explored the use of artiﬁcially generated noise for cleansing noisy text. For instance, Gaddeet al. [10] proposed methods for artiﬁcially injecting noise in text, through random character deletion, replacement,phonetic substitution, simulating typing errors, and so on.Several methods have been developed for the normalization of noisy text, especially microblogs [11, 12, 13]. Based onthe approach taken, text normalization methods can be broadly classiﬁed into two categories:

Supervised text normalization methods:

Such approaches rely on a training dataset, which is usually a parallelcorpus of canonical (correct) and noisy strings. For example, in [14], a noisy channel model was employed to performtext normalization. Incorporation of phonetic information has been shown to help text normalization, e.g., Toutanova et.al [15] included word pronunciation information in a noisy channel framework. GNU Aspell ( http://aspell.net/ ),a popular spelling correction tool, also uses phonetic and lexical information to ﬁnd normalized words.More recently, neural models such as RNN [16] and Encoder-Decoder models [17] have also been used for textnormalization. Some prior works have also modeled the text normalization problem as machine translation from thenoisy text to the clean version [18].Supervised methods require huge amounts of training data in the form of a parallel corpus, i.e., pairs of noisy andclean text. Such a parallel corpus is often difﬁcult to get, especially for low-resource languages. Even for resource-richlanguages, creating a training dataset involves a lot of human intervention, thus making it a costly process. Unsupervisedmethods try to mitigate some of these problems.

Unsupervised text normalization methods:

These methods make use of the contextual and lexical informationavailable from the corpora to perform normalization, without the need for any training data. For instance, Kolaket al. [19] applied a noisy channel based pattern recognition approach to perform OCR text correction. Ghosh etal. [7] used a combination of contextual and syntactic information to ﬁnd the morphological variants of a word. Wordembeddings (e.g., Word2vec [20]) have shown great potential in capturing the contextual information of a word, andhence word embedding based contextual information has been used in many of the recent works [8, 9]. For instance,Sridhar [8] used contextual information to ﬁnd a set of candidate variants, which is reﬁned based on a lexical similarityscore.In this work, we propose an unsupervised text normalization algorithm, and show that it performs competitively withsome of the methods mentioned above [8, 7].

Task-independent vs. Task-speciﬁc text normalization methods:

Text normalization methods can also be catego-rized into two broad categories based on whether a method is meant for a speciﬁc task. Most of the methods statedabove, including spelling correction tools, are task-independent – they aim to correct the misspelled words in a corpus.On the other hand, some task-speciﬁc text normalization methods have also been proposed. For instance, Satapathy etal. [21] proposed a method for normalizing tweets speciﬁcally for sentiment analysis. Vinciarelli et al. [6] attemptednormalization of noisy text speciﬁcally for categorization. Again, Ghosh et al. [7] used co-occurrence counts andedit-distance measures to ﬁnd morphological variants of query words for the task of information retrieval from OCR-eddocuments.The proposed method is a task-independent one, and the experiments in the paper show that the data cleansing performedby the proposed method can be helpful for various tasks such as retrieval and stance detection.

This section describes the proposed unsupervised text normalization algorithm, which we call

UnsupClean . Theimplementation is publicly available at https://github.com/ranarag/UnsupClean . We start this section with anoverview of our proposed algorithm, and then describe every step in detail.

We look at text normalization from the perspective of identifying morphological variations of words. To understand thisobjective, we ﬁrst need to understand what a morpheme is. According to Morphology (the area of linguistics concerned3

PREPRINT

Symbol Description Symbol Description C Corpus of documents which needs to be cleaned L Lexicon of noisy words L c Lexicon of clean words w L Noisy word w Lc Clean word Cl wLc cluster of morphological variants of word w Lc CL set of clusters BLCS ( w , w ) Bi-character Longest Common Subsequence between w and w LCSR ( w , w ) Longest Common Subsequence Ratio between words w and w BLCSR ( w , w ) BLCS

Ratio between w and w α Threshold used to form cluster of words having

BLCSRES ( w , w ) Edit Similarity between words w and w A ( w Lc ) cluster of words having lexical similarity with w Lc > αG ( w ) Graph of words as nodes and edges weighted with the con-textual similarity amongst them β a threshold used to prune the graph Gγ threshold used in Ghosh et. al. [7] n threshold used in Enelvo [9] Table 1:

Notations used in this paper. with the internal structure of words), a morpheme is the smallest unit of a word with a meaning. Structural variationsto a morpheme or a combination of morphemes are known as morphological variations. There can be various typesof morphological variations, including the plural and possessive forms of nouns, the past tense, past participle andprogressive forms of verbs, and so on. Our approach attempts to identify such (and other) morphological variants ofwords (which are combinations of morphemes).Intuitively, most morphological variations of a word have high string similarity with the canonical form of the word. Soa string similarity based approach can be used to ﬁnd candidate morphological variants of a word. However, stringsimilarities can be misleading, e.g., the words ‘

Kashmiri ’ and ‘

Kashmira ’ have high string similarity, but one is anadjective (meaning ‘belonging to the state of Kashmir, India’) while the other is the name of an Indian actress. In suchcases, it is useful to check the contextual similarity of the words.An important question is how to measure contextual similarity between words. One potential way is to check co-occurrence counts of the words , i.e., how frequently the words appear in the same document (same context). However,for short-length documents (e.g., microblogs), the co-occurrence counts will mostly be zero (or very low), since twovariants of the same word would usually not occur in the same microblog. An alternative method is to utilize wordembeddings that capture contextual features of a word (e.g., Word2vec [20]). Hence, our algorithm uses a mixture ofstring similarity, word co-occurrence counts, and embeddings to ﬁnd the morphological variants of words.Table 1 lists the notations used in this paper. Our proposed model considers as input a corpus C of noisy documents(that need to be cleaned). Let L be the set of all distinct words contained in all the documents in C . Let L c be a lexiconof words with correct spellings, and let w L c ∈ L c be a correct word. For each word w L c ∈ L c , our model ﬁnds aset Cl w Lc of morphological variants of w L c , where every element of Cl w Lc is a term in L . To ﬁnd the set Cl w Lc ofmorphological variants of w L c , our model employs the following four steps: (1) First, we use structural/lexical similaritymeasures to generate a set of candidate morphological variants for w L c . (2) Then we construct a similarity graph of themorphological variants, based on the similarity between word embeddings and co-occurrence counts of words. (3) Wefragment the graph into clusters of similar morphological variants, using a graph clustering algorithm [22]. This stepforms cluster of words which have both high semantic similarity as well as syntactic similarity with w L c . (4) Lastly, wechoose that cluster of morphological variants as Cl w Lc , which has the member with the highest structural similaritywith w L c . We then, club Cl w j to Cl w Lc for all w j ∈ Cl w j belonging to Cl w Lc . Subsequently, all words in Cl w Lc arereplaced by w L c . This results in non-overlapping clusters of morphological variants. Finally, all words belonging to aparticular cluster are replaced by the same word.Figure 1 shows an overview of our proposed algorithm UnsupClean . We explain each of the above steps in thefollowing subsections.

For a given word w L c (that is considered to be taken from the lexicon of correct words L c ), we ﬁrst construct a set ofmorphological variants A ( w L c ) , that would include all possible noisy variants of w L c that occur in the given corpus.We observed that various noisy variants of words can be broadly classiﬁed into the following categories:(1) Vowels are often added/deleted from words, especially in social media posts. E.g., ‘ cool ’ is written as ‘ cooool ’, and‘ thanks ’ is written as ‘ thnks ’.(2) Platforms like microblogging sites (e.g., Twitter, Weibo) allow very short posts, with a hard limit on the number ofcharacters. On such platforms, users shorten words arbitrarily, omitting not only vowels but also many consonants. Forinstance, ‘ tomorrow ’ is shortened as ‘ tmrw ’, and ‘ medicines ’ as ‘ meds ’. Even named entities can be shortened, e.g.,‘ Kathmandu ’ as ‘ ktm ’ [5].(3) People sometimes misspell words due to the phonetic similarity between the correct word and the misspelled word. PREPRINT

Figure 1: An overview of the proposed algorithm

UnsupClean

For example, the phrase ‘smell like men’s cologne’ can be misspelled as ‘smell like men’s colon’.(4) Different from the types of user-generated noise described above, there is also machine-generated noise (e.g.,OCR noise) where spelling errors can occur in any part of the word (including the ﬁrst character), whereas in case ofhuman-generated variants, the ﬁrst few characters usually match with that of the correct word. Also in case of OCRnoise, there is a high chance of not having the correct word at all in the corpus, while a corpus containing user-generatederrors usually contains both the clean and noisy variants of words.To capture all these different types of noisy variants, we include in A ( w L c ) , every word that has Bi-character LongestCommon Sub-sequence Ratio (BLCSR) similarity with w greater than a threshold α ∈ (0 , . Formally: A ( w L c ) = { w L | BLCSR ( w L , w L c ) > α } (1)where BLCSR is the Bi-character Longest Common Sub-sequence Ratio [23] based string similarity measure betweentwo strings, and α ∈ (0 , is a hyper-parameter. BLCSR is LCSR with bi-gram of characters calculated as:

BLCSR ( w , w ) = BLCS ( w , w ) maxlength ( w , w ) − . (2)where BLCS ( w , w ) is the Longest Common Subsequence considering bi-gram of characters. For example, if w is‘ABCD’ and w is ‘ACD’ then its BLCS value will be 1. Once A ( w L c ) is formed, we remove the word w L c itself from A ( w L c ) . As stated earlier in this section, only string similarity based measure is not sufﬁcient for capturing the morphologicalvariations of words. Hence, we also consider the contextual similarity between words. We employ two ways ofmeasuring contextual similarity between words – (i) observing co-occurrence counts. which works better for longerdocuments (that are more likely to contain multiple variants of the same word), and (ii) computing similarity betweenword embeddings which works better for short documents such as microblogs (since embeddings tend to lose thecontext over long windows). 5

PREPRINT

Let −→ w i be the embedding of the word w i . Speciﬁcally, we obtain the word embeddings using Word2vec [24] over thegiven corpus. Word2vec is a neural network based word-embedding generation model [24]. Unlike one-hot vectorsfor words, word2vec embeddings capture some contextual information in them – two words having similar word2vecembeddings have been used in similar context in the corpus on which Word2vec has been trained. For instance,after training word2vec on the ‘TREC 2005 confusion’ dataset, words like ‘program‘, ‘programs‘, ‘program1‘, and‘program:s‘ have high cosine similarity amongst their embeddings. Here, a word w i is considered to be used in a similarcontext as another word w j if w i frequently appears ‘nearby’ w j , i.e., within a ﬁxed context window size around w j .Another advantage of word2vec embeddings over one-hot vectors is that the size of these embeddings do not increasewith increase in vocabulary. Two different versions of the Word2vec model exist depending upon how the embeddingsare generated – (1) skip-gram, and (2) Continuous Bag of Words (CBoW). For our algorithm, we used Word2vecwith context window size and skip-gram based approach. We chose skip-gram over CBoW since it generates betterword embeddings in case of infrequent words. Also, we kept the context window size 3 so as to ensure good wordembeddings for infrequent noisy words.We construct a graph G ( w L c ) with all words from the set A ( w L c ) as the nodes. We consider an edge between twonodes (words) w i , w j ∈ A ( w L c ) only if the value of cosine similarity between the embeddings −→ w i and −→ w j is greaterthan a threshold β . We calculate the β value by using a weighted average of the cosine similarities between embeddingsof all word-pairs in the set A ( w L c ) , with the BLCSR value between the words as the corresponding weights. That is: β = (cid:80) w i ,w j ∈ A ( w Lc ) BLCSR ( w i , w j ) ∗ cosineSim ( −→ w i , −→ w j ) (cid:80) w i ,w j ∈ A ( w Lc ) BLCSR ( w i , w j ) (3)This is done to ﬁnd a threshold which takes both the syntactic similarity and semantic similarity into consideration.If cosineSim ( −→ w i , −→ w j ) is greater than β , it means that the word-embedding model is highly conﬁdent of w i being amorphological variant of w j (and vice versa), and hence we connect the two words with an edge. We weight the edgebetween the two words w i , w j with weight W ij deﬁned as: W i,j = cosineSim ( −→ w i , −→ w j ) × Cooccur ( w i , w j ) (4)where Cooccur ( w i , w j ) is the co-occurrence count of the words w i and w j in the same document (averaged over alldocuments in the corpus). We take the product of the two similarities since cosine-similarity of word embeddingscaptures similarity of the words in a local context (e.g., in a context window of size 3 words) and co-occurrence countcaptures their similarity in a global context (of a whole document). In the graph G ( w L c ) , edges exist only between those nodes (words) which have at least a certain amount of similarity( cosineSim ( −→ w i , −→ w j ) > β ). Next, we want to identify groups of words which have a high level of similarity amongthemselves. To this end, community detection or graph clustering algorithms can be used to identify groups of nodesthat have ‘strong’ links (high-weight edges) among them. There are many graph clustering algorithms in literature [22].We speciﬁcally use the popular Louvain graph clustering algorithm [25]. This algorithm functions on the principleof maximizing a modularity score [26] for each cluster/community, where the modularity quantiﬁes the quality ofan assignment of nodes to clusters, by evaluating how much more densely connected the nodes within a cluster are,compared to how connected they would be in a random network. Thus, it can be expected that all words in a particularcluster (as identiﬁed by Louvain algorithm) have strong semantic similarity amongst them. Out of the clusters identiﬁed from G ( w L c ) , we now need to ﬁnd that cluster Cl w Lc whose members are most likely tobe the morphological variants of the given word w L c . We consider Cl w Lc to be that cluster which contains the wordwith the minimum edit-distance from the word w L c (ties broken arbitrarily).Thus, for a given word w L c , we identify a set Cl w Lc of morphological variants. Some examples of Cl s identiﬁed bythe proposed algorithm are shown later in Table 6 and Table 7. We compare the performance of our proposed text normalization algorithm with those of several baseline normalizationmethods that are described in this section. Since the implementations of these baselines are not publicly available See https://code.google.com/archive/p/word2vec/ . PREPRINT

Figure 2: An overview of GNU Aspell(except Aspell and Enelvo), we implemented the algorithms following the descriptions in the respective papers, withsome modiﬁcations as described below. Code for Enelvo is publicly available but as a normalizer package for PortugueseLanguage. Hence we implemented the Enelvo algorithm as well.

Aspell is an open-source spell checking library which uses a combination of Lawren Philips’ metaphone searchalgorithm [27] and Ispell’s missed strategy mechanism. Given a (misspelled) word, the algorithm ﬁrst ﬁnds all themetaphones of the word. It then ﬁnds all the words that are within 1 or 2 edit-distance from any of the metaphones ofthe word. Figure 2 shows a ﬂowchart describing the functioning of Aspell.We used the GNU implementation of Aspell with the Python wrapper available at https://github.com/WojciechMula/aspell-python . For the RISOT dataset (in Bengali) that we use later in the paper, we used theAspell Bengali dictionary available at https://ftp.gnu.org/gnu/aspell/dict/0index.html . Unlike a spell-checker like Aspell, the approach by Sridhar [8] uses both contextual information and structuralinformation for text normalization. Figure 3 gives a schematic ﬂowchart describing the method.The approach involves a lexicon L c of correct words (i.e. w L c ∈ L c ) such that there exists a word embedding for each w L c . Noisy versions for each w L c are found from the corpus as:noisy-version [ w L c ] = ∀ w L ∈ L,w L / ∈ L c max k cosineSim ( −−→ w L c , −→ w L ) (5)where w L is a noisy word from the corpus C with lexicon L . For each w L c , k = 25 nearest neighbors are foundand stored in the map noisy-version. Note that only contextual similarity among word embeddings is considered foridentifying noisy versions in this initial step.Now for a given noisy word w L ∈ L , a list of all those w L c is obtained for which noisy-version[ w L c ] contains w L . Thewords w L c in this list are scored, where the score is calculated as:score ( w L c , w L ) = LCSR ( w L , w L c ) ED ( w L , w L c ) (6)7 PREPRINT

Figure 3: The overview of the algorithms of both Sridhar [8] and Enelvo [9]. The only difference between the twoalgorithms is the way in which the scores are calculated.where ED ( w L , w L c ) is the edit-distance between w L and w L c , and LCSR ( w L , w L c ) is the Longest Common Subse-quence Ratio [23] calculated as LCSR ( w L , w L c ) = LCS ( w L , w Lc ) maxlength ( w L , w L c ) (7)where LCS ( w L , w L c ) is the Longest Common Subsequence

Length between the words w L and w L c .Finally, the w L c with the highest score (as computed above) is considered to be the normalized version of the noisyword w L . Note that the computation of score considers only the lexical similarity between words. The integration ofcontextual similarity with lexical similarity is achieved by ﬁrst selecting the w L c words based on contextual similaritywith w L , and then computing the lexical similarity score.In the original paper [8], the author created L c (the lexicon of correct words) by training a word embedding model onsome randomly selected sentences from the corpus which were manually normalized by a professional transcriber .Since our objective is to not use human resource, we chose pre-trained word-vectors to construct L c . Speciﬁcally, forthe RISOT dataset (in Bengali), we used pre-trained word2vec word vectors on Bengali Wikipedia data (availableat https://github.com/Kyubyong/wordvectors ), and for the English / microblog datasets, we used the pre-trained Word2Vec word vectors on the English Google News data (available at https://github.com/eyaler/word2vec-slim ). This method [9] is inherently similar to that of Sridhar [8]. Figure 3 gives a schematic ﬂowchart describing both thesemethods. Enelvo involves a lexicon L c of correct words (i.e., w L c ∈ L c ) with frequency ≥ in the given corpus.The noisy versions w L ∈ L (the lexicon the given corpus) of the word w L c are then found as:noisy-version [ w L c ] = ∀ w L ∈ L,w L / ∈ L c max k cosineSim ( −−→ w L c , −→ w L ) (8)where w L c is the correct word from the lexicon L c , and w L is the noisy word from the lexicon L of corpus C . For each w L c , the top k = 25 most similar noisy words w L ∈ L are found and stored in the map noisy-version. This step issimilar to the ﬁrst step in the Sridhar baseline [8], and considers only contextual similarity among word embeddings.8 PREPRINT

Next, for a given noisy word w L ∈ L a list of all those w L c is chosen for which noisy-version[ w L c ] contains w L . Thewords w L c in this list are then scored.The score is computed as: score ( w L , w L c ) = n × lexical-similarity ( w L , w L c ) + (1 − n ) × cosineSim ( −→ w L , −−→ w L c ) (9)where n ∈ (0 , is a hyperparameter, and lexical-similarity( w L , w L c ) is measured as:lexical-similarity ( w L , w L c ) = (cid:40) LCSR ( w L ,w Lc ) MED ( w L ,w Lc ) , if M ED ( w L , w L c ) > .LCSR ( w L , w L c ) , otherwise . (10)Here M ED ( w L , w L c ) = ED ( w L , w L c ) − DS ( w L , w L c ) is the modiﬁed edit distance between w L and w L c , where ED ( w L , w L c ) is the edit distance and DS ( w L , w L c ) is the diacritical symmetry between w L and w L c (a feature that isuseful in Portuguese and some non-English languages – see [9] for details). LCSR ( w L , w L c ) is the Longest CommonSubsequence Ratio , calculated as:

LCSR ( w L , w L c ) = LCS ( w L , w L c ) + DS ( w L , w L c ) maxlength ( w L , w L c ) (11)where LCS ( w L , w L c ) is the Longest Common Subsequence of words w L and w L c . Hence, the computation of score integrates both lexical similarity as well as contextual similarity, which is a point of difference from the Sridharbaseline [8]. Finally, the w L c with the highest score is chosen as the normalized version of w L .Thus, the integration of contextual similarity and lexical similarity is achieved in Enelvo as follows – ﬁrst, the top k noisy words w L are chosen for each clean word w L c using contextual similarity. Then for a given noisy word w L , thebest clean word w L c is selected using a scoring function that is a weighted combination of contextual similarity andlexical similarity.For the embeddings, the authors of [9] trained a Word2Vec skip-gram model on a conglomeration of two corpora –Twitter posts written in Portuguese and a product review corpus containing both noisy and correct texts [28]. Sinceour datasets are in English and Bengali, we used pre-trained embeddings for the correct words w L c ∈ L c and trainedword2vec models on the respective noisy corpora for the noisy word embeddings w L ∈ L . For the embeddings of w L c in the RISOT dataset, we used pretrained Word2Vec embeddings of the Bengali Wikipedia dataset (available at https://github.com/Kyubyong/wordvectors ). Similarly, for the English and microblog datasets, we used pretrained wordembeddings of the Google English News dataset (available at https://github.com/eyaler/word2vec-slim ). Ghosh et. al [7] employs a ﬁve-step approach to ﬁnd the morphological variants of a word (see Figure 4). The ﬁrst stepcalled segregation involves removal of words with string similarity ≤ α (where α ∈ (0 , ) with the correct word w L c (orquery word), resulting in the formation of A ( w L c ) . The next step is Graph Formation where a graph G = ( V, E ) isformed with vertices V as the words in A ( w L c ) and edges E with weights as the co-occurrence counts between them.Next step is the Pruning step where the Graph G is pruned. Pruning is carried out only if the maximum edge-weight( max ew ) in G is greater than a threshold γ which is determined by the user. In this step, edges which are less than β % of max ew are removed from the Graph G resulting in the formation of a graph G r = ( V, E (cid:48) ) where E (cid:48) ⊂ E .The next step is called Congregation where the graph G r is further fragmented into clusters based on their edge-weights.Two vertices v and v belong to the same cluster if they satisfy any of the following conditions – (i) v is thestrongest neighbour of v , or (ii) v is the strongest neighbour of v . This results in the formation of a set of k clusters CL = { Cl , Cl , . . . , Cl k } . Congregation step is followed by the ﬁnal step called Melding . In this step, a cluster ischosen from the set of clusters CL to have the morphological variants of the query word w L c . Such a cluster containinglexically the most similar word to w L c is chosen to be the cluster containing morphological variants of w L c . The lexicalsimilarity is measured using Edit Similarity (ES), which is an edit-distance based similarity metric, having the formula: ES ( w , w ) = 1 − ED ( w , w ) max ( stringLength ( w ) , stringLength ( w )) (12)where ED ( w , w ) is the edit-distance between the two words w and w . Difference of proposed algorithm with baselines:

The baseline normalization algorithms discussed in this section aresigniﬁcantly different from the proposed algorithm

UnsupClean . Sridhar and Enelvo both use contextual and lexicalsimilarity, as does

UnsupClean ; however, the approach for integrating contextual similarity and lexical similarity is9

PREPRINT

Figure 4: An overview of the algorithm by Ghosh et. al., derived from the original paper [7]different in

UnsupClean as compared to Sridhar and Enelvo. Additionally, neither Sridhar nor Enelvo are graph-basedmethods, whereas

UnsupClean relies on forming a similarity graph and then clustering the graph. The Ghosh baselineis most similar to

UnsupClean – both use contextual as well as lexical similarity, and both are graph-based methods.But importantly, Ghosh measures contextual similarity using only co-occurrence statistics, whereas

UnsupClean uses acombination of co-occurrence statistics and word embeddings. The advantage of using word embeddings will be clearfrom our experiments described in the subsequent sections.In the next two sections, we compare the performance of the proposed

UnsupClean text normalization algorithmwith that of the baseline normalization algorithms described in this section. To this end, we consider two downstreamapplications – retrieval (Section 5) and stance detection (Section 6).

In this section, we evaluate the performance of a text normalization algorithm by observing the performance of aretrieval (search) system over a noisy corpus that is normalized by the algorithm.

For evaluation of various text normalization methods, we experimented with the following three datasets, each having adifferent source of noise. The statistics of the datasets are given in Table 2. (1) RISOT Bengali OCR Data:

This dataset was released in the FIRE RISOT track ( ). The text version of 62,825 news articles of a premiere Bengali newspaperAnandabazar Patrika (2004-2006) was scanned and OCRed to create the collection. The dataset also contains 66 queries,and their set of relevant documents (gold standard). (2) TREC 2005 Confusion Track Data:

This dataset was created for the TREC Confusion track 2005 [29]. Thedataset consists of the text versions of Federal Register English documents (of the year 1994), which were scanned and10

PREPRINT

Dataset No. of documents No. of queries

FIRE RISOT Bengali 62,825 66TREC Confusion 2005 55,600 49TREC Microblog 2011 16,074,238 50

Table 2:

Description of the retrieval datasets used for empirical comparison of the proposed and baseline textnormalization methods.

OCR-ed. We use the 5% version for our experiments, where the approximate character error rate is 5%. Further detailscan be found in [29]. (3) TREC 2011 Microblog Track Data:

We consider an English tweet collection viz. TREC Microblog trackcollection 2011 ( https://trec.nist.gov/data/microblog2011.html , containing noise produced by cavalierhuman writing style. Unlike the other two datasets, this is a much larger collection (containing over 16 million tweets),and is also expected to test the scalability of any text normalization method.Thus, we use three noisy collections having different types of errors (OCR errors, errors due to users’ writing styles)in two different languages (English and Bengali) to test the efﬁcacy and language independence of various textnormalization methods. For brevity, we refer to the datasets as

RISOT , Confusion and

Microblog respectively.

We used the popular Information Retrieval system

Indri which is a component of the

Lemur Toolkit ( ) for the retrieval experiments. Indri is a language-independent retrieval system, thatcombines language modeling and inference network approaches [30]. We used the default setting of Indri for allexperiments.A particular corpus is ﬁrst normalized using a normalization algorithm (either UnsupClean , or one of the baselinesSridhar, Enelvo, Ghosh, and Aspell), and the normalized corpus is then indexed using Indri. The queries are alsonormalized using the same normalization algorithm, and then the retrieval is carried out.

The proposed algorithm has one hyperparameter α (the similarity threshold for BLCSR). We obtained the best value for α via grid search in the range [0 . , . with a step size of . . The best values are α = 0 . for RISOT, α = 0 . forConfusion and α = 0 . for the Microblog dataset.Ghosh et al. [7] has three hyper-parameters, namely α, β and γ . We followed the same methodology as stated inthe original paper [7] to set the parameters – we varied the α and β while keeping the γ value equal to . Weobtained the best values for α and β via grid search in the range [0 . , . with a step size of . . The best values are α = 0 . , β = 0 . for RISOT, α = 0 . , β = 0 . for TREC 2011 microblog and α = 0 . , β = 0 . for the Confusiondataset.For both Sridhar [8] and Enelvo [9] have a hyper-parameter K (the number of nearest neighbors to a particular word w ,that are considered as potential variants of w ).Both these works directly specify k = 25 without any tuning. Hence, we adhered to the value k = 25 value provided bythe authors in their papers. Enelvo has another hyper-parameter n that speciﬁes the relative signiﬁcance of contextualsimilarity and lexical similarity while computing score ; we used n = 0 . , the value used by the authors in the originalpaper. As stated earlier, to evaluate the performance of a text normalization algorithm, we measure the performance of retrievalon noisy datasets (described in Section 5.1) cleaned using the normalization algorithm. We use some standard metricsto evaluate the retrieval performance, as follows.For the RISOT and Microblog datasets, we measure Recall@100 (at rank 100), Mean Average Precision mAP@100and mAP@1000 (at ranks 100 and 1000) as the metrics for comparing the retrieval performance.The TREC 2005 Confusion dataset, however, has only one relevant document per query . Hence, it is more common toreport the Mean Reciprocal Rank (MRR) metric for this dataset.11

PREPRINT

Normalization Algorithm Recall@100 mAP@100 mAP@1000

Raw data (without any cleaning) 50.08% 17.21% 18.50%Ghosh et al.[7] 54.48% 19.48% 20.98%Enelvo [9] 49.24% 17.04% 18.37%Sridhar [8] 47.75% 17.38% 18.76%Aspell 24.40% 5.43% 5.58%

UnsupClean (Proposed)

ESA

Table 3:

Retrieval performance on RISOT Bengali OCR dataset, using Indri IR system. Best performances arein bold font. Superscripts E,S,A indicate the proposed method is statistically signiﬁcantly better than Enelvo,Sridhar and Aspell respectively.

Normalization Algorithm Recall@100 mAP@100 mAP@1000

Raw data (without any cleaning) 37.63% 16.13% 19.77%Ghosh et al.[7] 37.62% 15.97% 19.66%Enelvo [9] 38.79% 15.84% 19.39%Sridhar [8]

UnsupClean (Proposed) 39.23%

Table 4:

Retrieval performance on TREC 2011 microblog dataset, using Indri IR system. Best performancesare in bold font (none of the differences in performance are statistically signiﬁcant).5.5 Results

Table 3 compares the retrieval performance using different text normalization algorithms (the proposed method and thebaselines) for the RISOT Bengali dataset. Also shown are the retrieval performances over the raw data, i.e., without anynormalization. Similarly, Table 4 shows the results for the TREC 2011 Microblog dataset, while Table 5 shows resultsfor the TREC 2005 Confusion dataset. The tables also report results of statistical signiﬁcance testing – the symbols A,E, G, S in the superscripts in the tables indicate that the proposed method is statistically signiﬁcantly better at 95%conﬁdence interval ( p < . ) than Aspell, Enelvo, Ghosh et al., and Sridhar respectively.For the Microblog dataset (see Table 4), the proposed UnsupClean performs competitively with the baselines. Retrievalusing the Sridhar baseline achieves higher Recall@100, while retrieval using

UnsupClean achieves higher mAP@100and mAP@1000 scores. None of the differences are statistically signiﬁcant.For the other two datasets which primarily contain OCR noise (see Table 3 and Table 5),

UnsupClean enables muchbetter retrieval than the baselines. Especially for the RISOT dataset (in Bengali language), the baselines Enelvo, Sridharand Aspell perform very poorly. In fact, in most cases, retrieval performance is better over the raw RISOT data, thanafter normalization by these three baselines. In contrast,

UnsupClean leads to signiﬁcant improvement in retrievalperformance. Hence, the proposed model can generalize to different types of noise as well as different languages, betterthan most of the baselines . Out of all the text normalization algorithms, retrieval using Aspell performs most poorly. In general, spelling correctionalgorithms such as Aspell are unable to segregate words with high string similarity but low semantic similarity

Normalization Algorithm Mean Reciprocal Rank

Raw data (without any cleaning) 62.93%Ghosh et al.[7] 63.78%Enelvo [9] 55.20%Sridhar [8] 54.36%Aspell 38.93%

UnsupClean (Proposed)

ESA

Table 5:

Retrieval Performance on TREC 2005 Confusion dataset, using Indri IR system. Best result are in boldfont. Superscripts E,S,A indicate the same as in Table 3. PREPRINT

Word

UnsupClean

Ghosh Enelvo Sridhar release release, released, releases release release, realease, relea releasing, released, realease, release, re-leases, releaorganic organic, organik, organi organic turmeric, organic organicmsnbc msnbc msnbc cbsnews, tpmdc, msnbc,foxnews, nbcnews msnbc, nbcnews, cbsnews, tpmdc, mas-rawy

Table 6:

Resulting clusters of morphological variants of some speciﬁc query-words, on the TREC 2011 microblogdataset

Word

UnsupClean

Ghosh Enelvo Sridhar program programt, 44program, programs, pro-gram1, program programs1, sprogram,program11, program, programs program programs1, program1,programs, prodects programs,program,programs1document document1, documents, documents1,locument1, locuments, locument, docu-ment, locuments1 documents, docu-ment document, forepart, locument,document1, documents locument,documents,document1,document

Table 7:

Resulting clusters of morphological variants of some speciﬁc query-words, on the TREC 2005 Confu-sion dataset (‘ industrious ’ and ‘ industrial ’), or homonyms (such as ‘ living ’ and ‘ leaving ’), etc. Such variants can be identiﬁed bythe proposed algorithm as well as by the other baselines which use context information in identiﬁcation of variants.In the rest of this section, we report some insights on how the proposed UnsupClean model differs from the otherbaselines Sridhar [8], Enelvo [9] and Ghosh et al. [7]. For this analysis, Table 6 and Table 7 show some examples ofmorphological variants identiﬁed by these algorithms for the same word.

UnsupClean vs Sridhar/Enelvo:

Both Sridhar and Enelvo ﬁrst consider the contextual similarity for identifying acandidate set of morphological variants (see Eqn. 5 and Eqn. 8 in the descriptions of Sridhar and Enelvo respectively).They initially ignore the structural/lexical similarity between words, which at times includes incorrect variants into thecandidate set. For instance, the word ‘turmeric’ is identiﬁed as a candidate variant of ‘organic’ by Enelvo, while thewords ‘nbcnews’ and ‘tpmdc’ are identiﬁed by both the methods as candidate variants of ‘msnbc’ (see Table 6), dueto the high contextual similarity between these words. On the other hand, the proposed method considers w j to be acandidate variant of w i only if their lexical similarity is higher than a threshold; hence, such wrong variants are notincluded in A ( w i ) (on which contextual similarity is later applied). UnsupClean vs. Ghosh et al.:

Ghosh et al. [7] computes the contextual similarity among words using only theco-occurrence counts of words in the same document. This method works well for long documents having a smallvocabulary, but performs poorly when the vocabulary size is very large or when the document size is very small (inwhich case the co-occurrence matrix will be very sparse). Especially, two error variants of the same word rarely co-occurin the same microblog. Hence, Ghosh et. al [7] performs well on the RISOT dataset, but poorly on the Microblog andConfusion datasets. On the other hand, the proposed method employs both co-occurrence counts and cosine similarityof embeddings to capture contextual similarity among words, and hence generalizes better to different types of noiseand documents.

We conducted an ablation analysis in order to understand the importance of various steps / components in our algorithm.To this end, we applied

UnsupClean on the RISOT dataset in various settings – with all steps, and by removing onestep at a time. Table 8 shows the results of the ablation analysis.

Normalization Algorithm Recall@100 mAP@100 mAP@1000

UnsupClean (with all components)

UnsupClean without β thresholding 57.0 20.3 21.5 UnsupClean without similarity graph of morphological variants 54.9 18.6 19.1

UnsupClean without graph fragmentation 52.6 19.1 20.7

Table 8:

Ablation analysis of

UnsupClean – how retrieval performance (on RISOT Bengali OCR dataset, usingIndri IR system) varies with removal of various components. PREPRINT

The β thresholding (see Step 2 in Section 3.3) is done to consider words which have a high contextual similarity only.This is done to ensure high precision of the retrieved results. As can be seen from Table 8, the removal of such athreshold results in the drop of mAP scores. The recall@100, however, does not reduce much, since the morphemecluster increases.The construction of similarity graph of morphological variants (see Step 2 in Section 3.3) is a crucial step that helpscapture the contextual similarity among words with different morphological structures. Removal of this step will resultin a signiﬁcant drop in both mAP and recall scores compared to UnsupClean .If the graph fragmenting step (see Step 3 in Section 3.4) is skipped, we miss out on identifying the strongly connectedcomponent of the graph, and thus weakly connected components also get drawn into the ﬁnal cluster, causing a decreasein both mAP as well as recall scores.This ablation analysis shows that all the steps of the

UnsupClean algorithm are essential, and removing any of thesteps will adversely affect the retrieval performance. Especially, skipping the graph formation step leads to substantialdegradation in both Precision and Recall of retrieval. Also the graph fragmenting step is crucial particularly for Recall.From the experiments and analyses stated in this section, we can conclude the our proposed algorithm

UnsupClean isvery helpful in enhancing retrieval from noisy text.

UnsupClean out-performs almost all baselines across differenttypes of noisy corpus (containing OCR noise, and noise due to cavalier writing style of human users) and acrossmultiple languages (English and Bengali). Especially for corpus with OCR noise,

UnsupClean enables statisticallysigniﬁcantly better retrieval than most baselines, while it achieves very competitive performance in case of noisymicroblogs. Additionally,

UnsupClean is completely unsupervised, and does not need any manual intervention at anystep, thus making it suitable for use on large corpora in various languages.

Online platforms such as Twitter, Facebook, etc. have become popular forums to discuss about various socio-politicaltopics and express personal opinions regarding these topics. In this kind of scenario, ‘stance’ refers to the inherentsentiment express by an opinion towards a particular topic. The topic is referred to as the ‘target’. The ‘stance’ expressedby a blog/tweet can have three types of sentiment towards a given topic. It can either support (or favor ) the target, it can oppose (or be against ) the target, or it can be neutral , i.e., neither support nor oppose the target. The problem of ‘stancedetection’ refers to the process of automatically identifying this stance or sentiment of a post expressed towards a target.The reader is referred to [31] for a recent survey on stance detection.Stance detection is frequently carried out on crowdsourced data such as tweets/blogs which are noisy in nature, dueto frequent use of informal language, non-standard abbreviations, and so on. There exist several popular methods forstance detection from such noisy crowdsourced content. In this section, we check whether using text normalizationalgorithms improves the performance of these stance detection models on noisy microblogs.

We consider datasets made available by the SEMEVAL 2016 Task 6A challenge [4] which are commonly used datasetsfor stance detection, and have been used in a number of prior works. The datasets consist of tweets (microblogs) fordifferent topics; for each topic, there are three types of tweets – those in favor of (supporting) the topic, those against(opposing) the topic, and tweets that are neutral to the topic. We consider the following three SemEval datasets:(i)

Atheism (AT) : Here the target is ‘atheism’, and the tweets are in favor of, or against the idea of atheism (or neutral).(ii)

Climate change is a real concern (CC) : For this dataset the target is ‘Climate change is a real concern’.(iii)

Hillary Clinton (HC) : Here the target is ‘Hillary Clinton’, and the tweets are either in support of / against thepolitician.We refer to these datasets as AT, CC and HC respectively for brevity in the rest of the paper.The datasets are already partitioned into train and test sets, as deﬁned in the SEMEVAL 2016 Task 6A challenge [4].The numbers of tweets in each dataset is stated in Table 9. We have also provided some examples of tweets in thedatasets in Table 10.

We consider two stance detection models for the experiments.14

PREPRINT

Dataset

Favor Against Neutral Favor Against NeutralSemEval-AT 92 304 117 32 160 28SemEval-CC 212 15 168 123 11 35SemEval-HC 112 361 166 45 172 78

Table 9:

Statistics of standard SEMEVAL 2016 Task 6A datasets for stance detection (divided into training andtest sets).

Tweet text LabelTweets from SemEval-AT (Atheism) dataset

Absolutely f**king sick & tired of the religious and their "We’re persecuted" bollocksSo f**king what? Pissoff!

Tweets from SemEval-CC (Climate Change is a Real Concern) dataset

We cant deny it, its really happening.

Tweets from SemEval-HC (Hillary Clinton) dataset @HuffPostPol If @HillaryClinton can do half of what he did then she would be doing is a favor

Table 10:

Examples of posts from the SemEval datasets. Some characters in abusive words are replaced with * LSTM (Long Short Term Memory) : This model was proposed by Hochreiter and Schmidhuber [32] tospeciﬁcally address this issue of learning long-term dependencies. The LSTM maintains a separate memory cellinside it that updates and exposes its content only when deemed necessary.A number of minor modiﬁcations tothe standard LSTM unit have been made. We deﬁne the LSTM units at each time step t to be a collection ofvectors in R d : an input gate i t , a forget gate f t , an output gate o t , a memory cell c t , and a hidden state h t . d is the number of the LSTM units. The entries of the gating vectors i t , f t and o t are in [0 , . The LSTMtransition equations are the following: i t = σ ( W i x t + U i h t − + V i c t − ) (13) f t = σ ( W f x t + U f h t − + V f c t − ) (14) o t = σ ( W o x t + U o h t − + V o c t − ) (15) ˜ c t = σ ( W i x t + U i h t − + V i c t − ) (16) c t = f t (cid:12) c t − + i t (cid:12) ˜ c t (17) h t = o t (cid:12) tanh ( c t ) (18)For detecting stance , a simple strategy is to map the input sequence to a ﬁxed-sized vector using one RNN,and then to feed the vector to a softmax layer. Given a text sequence, x= [ x , x ,··· , x T ] , we ﬁrst use a lookuplayer to get the vector representation (embeddings) of each word x i . The output h T can be regarded as therepresentation of the whole sequence, which has a fully connected layer followed by a softmax non-linearlayer that predicts the probability distribution over stances (in this case either ‘FAVOR’ or ‘AGAINST’ or‘NEUTRAL’). The LSTM model has the following hyperparameters whose values are taken as follows:learning rate of × e − and dropout of . . 15 PREPRINT Target Speciﬁc Attention Neural Network (TAN) : This model was introduced by Du et al. [33], one of thewinning entries SemEval 2016 task 6A [4] challenge. The model is based on bidirectional LSTM with anattention mechanism. A target sequence of length N is represented as [ z , z , . . . , z N ] where z n (cid:15)R d (cid:48) is the d (cid:48) -dimensional vector of the n -th word in the target sequence. The target-augmented embedding of a word t for a speciﬁc target z is e zt = x t (cid:12) z where (cid:12) is the vector concatenation operation. The dimension of e zt is ( d + d (cid:48) ) . An afﬁne transformation maps the ( d + d (cid:48) )-dimensional target-augmented embedding of each wordto a scalar value as per the following Eqn. 19: a (cid:48) t = W a e zt + b a (19)where W a and b a are the parameters of the bypass neural network. The attention vector [ a (cid:48) , a (cid:48) , . . . , a (cid:48) T ] undergoes a softmax transformation to get the ﬁnal attention signal vector (Eqn. 20): a t = sof tmax ( a t ) = e a (cid:48) t (cid:80) Ti =1 e a (cid:48) i (20)After this, the product of attention signal a t and h t (which is the corresponding hidden state vector of RNN) isused to represent the word t in a sequence with attention signal. The representation of the whole sequence canbe obtained by averaging the word representations: s = 1 T T (cid:88) t =0 a t h t (21)where s (cid:15) R d is the vector representation of the text sequence and it can be used as features for text classiﬁcation: p = sof tmax ( W clf s + b clf ) (22)where p (cid:15) R C is the vector of predicted probability for stance. Here C is the number of classes of stance labels,and W clf and b clf are parameters of the classiﬁcation layer.The TAN model has two hyperparameters, whose values are taken as follows: learning rate of × e − anddropout of . . We set the parameters for various text normalization algorithms in a way similar to what was described in Section 5.3.For the proposed method, the value of α is set of . . For Ghosh et al. [7], the parameters were set to α =0 . , β = 0 . , γ = 50 (decided through grid search, as described in Section 5.3). For Sridhar [8] and Enelvo [9], thehyperparameter K was set to as speciﬁed in the original papers. For Enelvo [9], n was set to . . To evaluate the performance of different models, we use the same metric as reported by the ofﬁcial SemEval 2016 TaskA [4]. We use the macro-average of the F1-score for ‘favor’ and ‘against’ as the bottom-line evaluation metric. F avg = F favor + F against (23)where F favor and F against are calculated as shown below: F favor = 2 P favor R favor P favor + R favor (24) F against = 2 P against R against P against + R against (25)Here P favor and R favor are the precision and recall of the ‘FAVOR’ class respectively, and P against and R against respectively are the precision and recall of the ‘AGAINST’ class and are deﬁned as: P favor = T P favor

T P favor +

F P favor (26)16

PREPRINT

Normalization Method AT CC HC

Raw data (without any cleaning) 0.5508 0.3968 0.4396Enelvo [9] 0.5659 0.4427 0.5336Sridhar [8] 0.5476 0.4778 0.5318Ghosh et al. [7]. 0.6156 0.4641 0.5216Aspell 0.5923 0.5250 0.4586

UnsupClean (Proposed)

Table 11:

Stance detection results on SemEval datasets by the TAN model [33]. The metric is as explained inSection 6.4. Highest values marked in boldface.

Normalization Method AT CC HC

Raw data (without any cleaning) 0.5497 0.3760 0.4938Enelvo [9] 0.6084 0.3911

Sridhar [8] 0.5666 0.4558 0.5055Ghosh et al. [7]. 0.6395 0.3884 0.4893Aspell 0.5902 0.4008 0.5337

UnsupClean (Proposed)

Table 12:

Stance classiﬁcation results on SemEval datasets by the LSTM [32] model. The metric is as explainedin Section 6.4. Highest values marked in boldface. R favor = T P favor

T P favor +

F N favor (27) P against = T P against

T P against +

F P against (28) R against = T P against

T P against +

F N against (29)With respect to ‘FAVOR’ class,

T P favor refers to number of true positives in favor class, i.e., the number of tweetsthat are predicted as ‘FAVOR’ by the classiﬁer, and are actually of ‘FAVOR’ class.

F P favor refers to number oftweets which are not actually of ‘FAVOR’ class, but that have been predicted as ‘FAVOR’ by the classiﬁer.

F N favor refers to number of tweets which are actually of ‘FAVOR’ class but have been wrongly classiﬁed by the classiﬁer.

The experiments with a particular dataset are as follows. Five instances of the dataset are normalized, one by each ofthe ﬁve normalization algorithms (

UnsupClean and four baselines). The training set and the test set are normalizedseparately. Subsequently, a stance detection model (LSTM or TAN) is used on the normalized versions, and theirperformances are measured.Table 11 shows the performance of the TAN stance detection model [33], on different versions of the datasets normalizedby the different normalization algorithms. Also shown are the values on the raw datasets, i.e., without any cleaning.From Table 11, it can be seen that all normalization approaches lead to some improvement, as compared to theperformance on the raw data. It is evident that the TAN model performs the best by using our proposed normalizationmethod, for all three datasets (best metric values highlighted in boldface).Similarly, Table 12 shows the performance of the LSTM [32] model on different versions of the datasets normalized bythe different normalization algorithms, as well as on the raw dataset. The LSTM model performs the best by using theproposed normalization method for two datasets (AT and CC), and gives the second-best result for the HC dataset.We checked the statistical signiﬁcance of the differences in performance with various normalization methods using Mc-Nemar’s test ( https://machinelearningmastery.com/mcnemars-test-for-machine-learning/ ); however,the differences in performance are not statistically signiﬁcant.

Table 13 shows some examples of tweets from the SemEval datasets, where all/most of the baseline normalizationalgorithms led to wrong stance detection by the TAN [33] model (for the ﬁrst two examples) and by the LSTM17

PREPRINT

Tweets from SemEval Datasets UnsupClean(Proposed) Enelvo [9] Sridhar [8] Ghosh etal. [7] AspellJesus response to a religious environ-ment was to create a royal environment.It drove people nuts. Still does - thankGod @EstadodeSats Thankyou so much for supportive @PatVPeters NEITHER

ONE !!!!

Just making a funny! can’t all be right, but theycan all be wrong. Table 13:

Examples of posts from the SemEval datasets, for which proposed method led to correct stancedetection but all/most baseline normalization methods led to wrong stance detection (by TAN [33] for ﬁrst twoexamples, and by LSTM model [32] for last two examples). model [32] (for the last two examples), but by using the proposed normalization method, the correct stance wasidentiﬁed for each of these tweets.The ﬁrst and third examples show a situation where the proposed

UnsupClean keeps the words present in a tweetunchanged, but the baseline normalization methods change the words to some non-contextual variants that seem illogicaland lead to wrong stance detection. In the second example the proposed method changes the phrase ‘thank you so muchfor’ to ‘thank you so much your’ whereas the modiﬁcations of different words by the other normalization methodsseem wrong, such as from ‘@EstadodeSats’ to ‘stardust’s’ by Aspell, or from ‘thank’ to ’thang’ by Enelvo. A similarscenario is observed in the fourth example. Thus, the proposed model performs slightly better in understanding whichwords to modify (into variants) and which words to leave unchanged.From the experiments described in this section, it can be concluded that the proposed text normalization algorithm

UnsupClean competes very favorably with the baseline normalization models, for the stance detection task. Wedemonstrate the efﬁcacy of

UnsupClean with two popular stance detection models. Experiments show that both stancedetection models perform better after text normalization with

UnsupClean , as compared to text normalization with thebaseline models.These results, along with the results of retrieval in the previous Section 5, show that text normalization using

UnsupClean enables superior performance in multiple text processing tasks.

We proposed a language-independent, unsupervised algorithm for normalizing/cleansing of noisy text. We conductedexperiments for two downstream applications (Retrieval and Stance detection), over a variety of datasets and typesof noise, including OCR noise over English and Bengali, as well as noise due to informal writing style of humansin microblogs. The experiments show that the proposed method generalizes well to different types of noise (bothuser-generated and machine-generated noise), and performs competitively with several baseline text normalizationalgorithms. The main strengths of the proposed method are that (i) it does not need expensive parallel corpus fortraining or any human intervention (unlike many existing algorithms), and (ii) it does not need external resources suchas global word embeddings. These features make it any attractive choice for cleaning text in low-resource languages.The implementation of

UnsupClean is publicly available at https://github.com/ranarag/UnsupClean .18

PREPRINT

There are several potential future directions of this work. First, the effectiveness of

UnsupClean can be checkedon other types of noise, such as noise due to automated speech-to-text conversion systems. Second, the proposedmethod can be applied to clean text in low-resource languages that lack external resources and large parallel corpora.Also, in this paper, we demonstrated the beneﬁts of cleaning text using

UnsupClean for the two general tasks ofretrieval (search) and stance detection. The proposed method can also be tried to improve performance in more speciﬁcversions of these tasks, such as identifying speciﬁc types of microblogs that aid relief operations during post-disasteroperations [34], as well as in other tasks such as summarization of noisy social media text [35, 36]. We plan to explorethese directions in future.

Acknowledgements:

The authors thank the anonymous reviewers whose valuable suggestions helped to improve thepaper. The authors also acknowledge Dr. Arnab Bhattacharya of Indian Institute of Technology Kanpur for usefuldiscussions in the initial stages of the work. The work is partially supported by a project titled “Building HealthcareInformatics Systems Utilising Web Data” funded by Department of Science & Technology, Government of India.Finally, the authors gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPUused for this research.

References [1] L. Venkata Subramaniam, Shourya Roy, Tanveer A. Faruquie, and Sumit Negi. A Survey of Types of Text Noiseand Techniques to Handle Noisy Text. In

Proc. Workshop on Analytics for Noisy Unstructured Text Data (AND) ,pages 115–122, 2009.[2] Moumita Basu, Anurag Roy, Kripabandhu Ghosh, Somprakash Bandyopadhyay, and Saptarshi Ghosh. A novelword embedding based stemming approach for microblog retrieval during disasters. In

Proc. European Conferenceon Information Retrieval (ECIR) , pages 589–597, 2017.[3] Jaime Teevan, Daniel Ramage, and Merredith Ringel Morris.

Proc. ACM Conference on Web Search and Data Mining (WSDM) , pages 35–44, 2011.[4] Saif Mohammad, Svetlana Kiritchenko, Parinaz Sobhani, Xiaodan Zhu, and Colin Cherry. Semeval-2016 task 6:Detecting stance in tweets. In

Proc. International Workshop on Semantic Evaluation (SemEval) , pages 31–41,2016.[5] Anurag Roy, Trishnendu Ghorai, Kripabandhu Ghosh, and Saptarshi Ghosh. Combining Local and Global WordEmbeddings for Microblog Stemming. In

Proc. ACM Conference on Information and Knowledge Management(CIKM) , pages 2267–2270, 2017.[6] Alessandro Vinciarelli. Noisy text categorization.

IEEE Trans. Pattern Anal. Mach. Intell. , 27(12):1882–1895,2005.[7] Kripabandhu Ghosh, Anirban Chakraborty, Swapan Kumar Parui, and Prasenjit Majumder. Improving InformationRetrieval Performance on OCRed Text in the Absence of Clean Text Ground Truth.

Information Processing &Management , 52(5):873–884, 2016.[8] Rangarajan Sridhar and Vivek Kumar. Unsupervised Text Normalization Using Distributed Representations ofWords and Phrases. In

Proc. 1st Workshop on Vector Space Modeling for Natural Language Processing , pages8–16, June 2015.[9] Thales Felipe Costa Bertaglia and Maria das Graças Volpe Nunes. Exploring word embeddings for unsupervisedtextual user-generated content normalization. In

Proc. Workshop on Noisy User-generated Text (WNUT) , pages112–120, 2016.[10] Phani Gadde, Rahul Goutam, Rakshit Shah, Hemanth Sagar Bayyarapu, and L. V. Subramaniam. Experiments withartiﬁcially generated noise for cleansing noisy text. In

Proceedings of the 2011 Joint Workshop on MultilingualOCR and Analytics for Noisy Unstructured Text Data , 2011.[11] Stephan Gouws, Dirk Hovy, and Donald Metzler. Unsupervised mining of lexical variants from noisy text. In

Proc. Workshop on Unsupervised Learning in NLP (with EMNLP) , pages 82–90, 2011.[12] Chen Li and Yang Liu. Improving text normalization via unsupervised model and discriminative reranking. In

Proc. ACL 2014 Student Research Workshop , pages 86–93, June 2014.[13] Bo Han and Timothy Baldwin. Lexical normalisation of short text messages: Makn sens a

Proceedingsof the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies -Volume 1 , page 368–378, 2011. 19

PREPRINT [14] Eric Brill and Robert C. Moore. An improved error model for noisy channel spelling correction. In

In Proc.Annual Meeting of the Association for Computational Linguistics (ACL) , pages 286–293, 2000.[15] Kristina Toutanova and Robert Moore. Pronunciation modeling for improved spelling correction. In

In Proc. ACL ,pages 144–151, July 2002.[16] Maryam Zare and Shaurya Rohatgi. Deepnorm-a deep learning approach to text normalization.

CoRR ,abs/1712.06994, 2017.[17] Massimo Lusetti, Tatyana Ruzsics, Anne Göhring, Tanja Samardži´c, and Elisabeth Stark. Encoder-DecoderMethods for Text Normalization. In

Proc. Workshop on NLP for Similar Languages, Varieties and Dialects(VarDial 2018) , pages 18–28, 2018.[18] Wang Ling, Chris Dyer, Alan W Black, and Isabel Trancoso. Paraphrasing 4 Microblog Normalization. In

Proc.EMNLP , pages 73–84, 2013.[19] Okan Kolak and Philip Resnik. Ocr error correction using a noisy channel model. In

Proc. HLT , HLT ’02, pages257–262, 2002.[20] T. Mikolov, W.T. Yih, and G. Zweig. Linguistic Regularities in Continuous Space Word Representations. In

Proc.Annual Conference of the North American Chapter of the Association for Computational Linguistics: HumanLanguage Technologies (NAACL HLT) , pages 746–751, 2013.[21] Satapathy, R. and Guerreiro, C. and Chaturvedi, I. and Cambria, E. Phonetic-Based Microtext Normalization forTwitter Sentiment Analysis. In

Proc. IEEE International Conference on Data Mining Workshops (ICDMW) , pages407–413, 2017.[22] Santo Fortunato. Community detection in graphs.

Physics Reports , 486(3):75–174, 2010.[23] I. Dan Melamed. Automatic evaluation and uniform ﬁlter cascades for inducing n-best translation lexicons. In

Third Workshop on Very Large Corpora , 1995.[24] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words andphrases and their compositionality. In

Advances in Neural Information Processing Systems 26 , pages 3111–3119.2013.[25] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communi-ties in large networks.

Journal of Statistical Mechanics: Theory and Experiment , 2008:P10008, 2008.[26] M. E. J. Newman. Modularity and community structure in networks.

Proceedings of the National Academy ofSciences , 103(23):8577–8582, 2006.[27] Lawrence Philips. The double metaphone search algorithm.

C/C++ Users J. , 18(6):38–43, June 2000.[28] Nathan Hartmann, Lucas Avanço, Pedro Balage, Magali Duran, Maria das Graças Volpe Nunes, Thiago Pardo,and Sandra Aluísio. A large corpus of product reviews in Portuguese: Tackling out-of-vocabulary words. In

Proc.International Conference on Language Resources and Evaluation (LREC) , pages 3865–3871, May 2014.[29] Paul B. Kantor and Ellen M. Voorhees. The TREC-5 confusion track: Comparing retrieval methods for scannedtext.

Inf. Retr. , 2(2-3):165–176, 2000.[30] Trevor Strohman, Donald Metzler, Howard Turtle, and W. Croft. Indri: A language-model based search engine forcomplex queries.

Information Retrieval - IR , 01 2005.[31] Dilek Küçük and Fazli Can. Stance detection: A survey.

ACM Computing Surveys (CSUR) , 53:1–37, 02 2020.[32] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.

Neural computation , 9(8):1735–1780, 1997.[33] Jiachen Du, Ruifeng Xu, Yulan He, and Lin Gui. Stance classiﬁcation with target-speciﬁc neural attentionnetworks. In

Proc. International Joint Conference on Artiﬁcial Intelligence (IJCAI) , 2017.[34] Moumita Basu, Anurag Shandilya, Prannay Khosla, Kripabandhu Ghosh, and Saptarshi Ghosh. Extractingresource needs and availabilities from microblogs for aiding post-disaster relief operations.

IEEE Transactions onComputational Social Systems , 6(3):604–618, 2019.[35] Koustav Rudra, Subham Ghosh, Pawan Goyal, Niloy Ganguly, and Saptarshi Ghosh. Extracting situationalinformation from microblogs during disaster events: A classiﬁcation-summarization approach. In

Proc. ACMConference on Information and Knowledge Management (CIKM) , pages 583–592, 2015.[36] Mohammed Elsaid Moussa, Ensaf Hussein Mohamed, and Mohamed Hassan Haggag. A survey on opinionsummarization techniques for social media.