Novel Keyword Extraction and Language Detection Approaches
NNovel Keyword Extraction & Language Detection Approaches
Malgorzata Pikies [email protected], United Kingdom
Andronicus Riyono [email protected], Singapore
Junade Ali [email protected], United Kingdom
ABSTRACT
Fuzzy string matching and language classification are importanttools in Natural Language Processing pipelines, this paper pro-vides advances in both areas. We propose a fast novel approach tostring tokenisation for fuzzy language matching and experimentallydemonstrate an 83 .
6% decrease in processing time with an estimatedimprovement in recall of 3 .
1% at the cost of a 2 .
6% decrease in pre-cision. This approach is able to work even where keywords aresubdivided into multiple words, without needing to scan character-to-character. So far there has been little work considering usingmetadata to enhance language classification algorithms. We provideobservational data and find the
Accept-Language header is 14%more likely to match the classification than the country languageassociated to the IP Address.
CCS CONCEPTS • Human-centered computing → Natural language interfaces ; Web-based interaction . KEYWORDS string tokenisation, fuzzy string matching, language classification
Feature extraction is an important part of Natural Language Pro-cessing pipelines, whether extracting important keywords [1] orlanguage detection [2].During analysis; text data can be divided into sentences, words,or characters, which can then later be treated individually or canbe gathered into groups of n -grams [3]. Fuzzy string matching com-pares tokenised text to keywords using string similarity algorithms.For example; the Edit Distance (also known as the Levenshtein dis-tance) [4] measures the number of elementary changes necessaryneeded to transform one string into another, using the followingelementary edit operations: • a change operation, if X (cid:44) ∅ and Y (cid:44) ∅ ; • a delete operation, if Y = ∅ ; • an insert operation, if X = ∅ .Other string similarity algorithms like Cosine similarity [5] andDice [3], divide strings into sets of letters known as n -grams. Giventhese algorithms are not sensitive to the order of characters and n -grams, the size of the n -gramcan dramatically alter the accuracyof the algorithm. Equation 1 shows a formula for calculating theCosine similarity between strings X and Y. Strings are divided into n -grams, where each unique n -gram is a separate dimension in amulti-dimensional vector space. The two vectors made of strings Xand Y are then used to calculate the cosine of the angle between them: s ( X , Y ) = (cid:174) U ( X ) · (cid:174) V ( Y )| (cid:174) U ( X )|| (cid:174) V ( Y )| = cosθ . (1) n -gram based approaches can also be used for language detection,but are notably unreliable on short corpuses of text [2]. Whilstinternet standards and web browsers have sought to standardisecontent language headers [6], there has been little study on theaccuracy of these fields.Our prior work [1] provided a more detailed definition of vari-ous string similarity algorithms and provided empirical analysis oftheir performance for fuzzy string matching. We extend upon thiswork here by developing a novel tokenisation approach for fuzzystring matching. We propose a novel hybrid approach to approxi-mate keyword search. Our focus is on the impact on accuracy andcomputation speed while using a greedy approach for sub-stringselection prior to its tokenisation. We further explore using internetuser metadata to improve language classification.In Section 2 we describe research papers related to languagedetection and keyword search including n -grams. In Section 3 wedescribe our hybrid method to a keyword search text classification.Section 4 explores how metadata like web browser headers andIP Addresses can be used to improve the accuracy of languageclassification algorithms. We summarise our research and presentconclusions in Section 5. There are many algorithms designed to find exact matches of strings,such as Knuth-Morris-Pratt [7] or the Boyer Moore algorithm [8].[9] developed a novel technique to generate variable-length grams(VGRAMs) and showed that VGRAM tokenisation improved per-formance of three chosen algorithms. Additionally, [10] describea novel approach for n -gram - based string search in the ’writeonce read many’ context. Their algorithm uses n -gram signaturestogether with an algorithm similar to the Boyer Moore algorithmthus their technique is also focused on exact string matching.In our approach, we instead wish to perform fuzzy string match-ing rather than exact matching. [11] proposed a fuzzy-token simi-larity metric, which is a combination of token and character basedsimilarities. The algorithm looks for a maximum sum of weightsbetween pairs of tokens in two strings from a weighted bigraph.They also proposed an efficient method based on tokens’ signaturescalled Fast-Join.[12] proposed using a wildcard symbol (it can represent anycharacter from the alphabet) in q -grams. They proposed two algo-rithms, BasicEQ and
OptEQ , that use a concept of string hierarchy,combinatorial analysis, and semi-lattice for selectivity estimation.In [13] authors proposed two algorithms,
The MOst Frequent Mini-mal Base String Method (MOF) and
Lower Bound Estimation (LBS), a r X i v : . [ c s . C L ] S e p o perform an estimation of selectivity of approximate substringqueries based an extended n -gram table with wildcards.[14] showed that the n -gram based frequency method is bothinexpensive and effective in documents classification. They splitthe text into n -grams of sizes from one to five (letters) and countedtheir occurrences using a hash table. n -grams can also be used in language classification. [2] considersusing smoothed n -gram based models for language identification ofTwitter messages - the authors compared a smoothed n-gram lan-guage model with a TF-IDF weighting scheme alongside comparingvarious classifiers (Naive-Bayes, Logistic Regression, SVM, and LLRclassifiers). The authors conclude that: "This study validates thefact that when it comes to dealing with very short texts we need toconduct deep investigations based on this domain."[15] incorporates entity level information into a pre-trained lan-guage model, but to the best of our knowledge there is no suchwork incorporating metadata into language classification models.[16] found that "found that there are significant challenges to accu-rately determining the language of tweets in an automated manner"but notes challenges of using purely geolocation data for languageclassification.[6] provides that web browsers may pass language preferences towebsites using the Accept-Language header; whilst this has beenimplemented in modern web browsers, there has been no empiricalstudy of the accuracy of such language headers.[17] experimented with using IP Address information and userinterface language to predict the language used in user input formson a small sample of 510 logs; the authors note that these featuresalone are not strong indicators for determining query languageand more robust dimensions are needed. [18] found a correlationbetween country language, interface language and input language;but to the authors surprise, found only 24% of queries were in alanguage associated with the user’s country (obtained from their IPAddress), the work did not consider other browser headers and thelogs were from a pan-European online library (it is not understoodif the context affected the language input of users). We could notidentify prior work seeking to use the output of a language classifi-cation algorithm together with geolocation data. No prior work hasconsidered using the web browser’s
Accept-Language as a featureof language classification.To the best of our knowledge, there exists a gap in the literaturethat we want to fill. This paper is first to present the performanceof a keyword search using a hybrid method with strings tokenisedinto words and character based n -grams and to present a potentialimprovement to language prediction in short messages by includingcountry and Accept-Language header as predictor variables.
In our prior work [1] we described our approach to a ticket classifi-cation system based on fuzzy string matching. A keyword searchwas performed by scanning a corpus of text in windows of thekeyword’s length. The string similarity was estimated using Cosinesimilarity, where both text and keyword were divided in n -grams of size 2 (characters). Our prior work [1] found that the Cosinealgorithm was not only the most accurate but significantly faster(for two strings n and m , the computational difficulty of the Cosine algorithm is O ( n + m ) whilst the Edit Distance is O ( n × n ) ). Themethod was not sensitive to beginning and end of the string. Inthis Section we would like to present our solution to this issue. Algorithm 1
The Greedy algorithm for a keyword search. * Thesimilarity function runs only if the number of characters in P Y j iswithin bounds ( − θ ) ∗ c X ≤ c P Yj ≤ ( + θ ) ∗ c X , with j = − , , INPUT: searched string X , text Y , similarity threshold θ OUTPUT: a similarity S of the first match l X ← length of X l Y ← length of Y c X ← word count of X p X ← profile of X p Y ← profile of Y for i ← word from P Y do P Y − ← c X − P Y , starting with i P Y ← c X consecutive words in P Y , starting with i P Y + ← c X + P Y , starting with i w − ← Cosine similarity of P X and P Y − ▷ * w ← Cosine similarity of P X and P Y w + ← Cosine similarity of P X and P Y + S ← max { w − , w , w + } if S ≥ θ then return S end ifend for Our novel approach to fuzzy string matching consists of twotokenisation steps. In the first part we divide both strings into wordsby white spaces. We create a profile for both strings, which storesall words in the right order with the information of their lengths. Inorder to match on keywords which are divided into multiple words(e.g. nameservers and name-servers ) we calculate the similarity forthree cases: • the searched string and a part of the scanned string one wordshorter than the searched string, • the searched string and a part of the scanned string of thesame length as the searched string, • the searched string and a part of the scanned string one wordlonger than the searched string.We calculate a similarity only if the number of characters of thepart of the scanned string is within the acceptable bounds wrt. tothe similarity threshold θ . We scan the ticket body ( Y ) word byword. In every iteration we choose the highest similarity S fromthe three above-mentioned cases. If the condition S ≥ θ if fulfilled,the scan stops and the function returns the similarity value. In order to measure accuracy of our new approach we gathered 1790tickets falling into a chosen product category (known as "DNS").The tickets were processed using the multi-classifier outlined in [1]both before and after modification with our novel approach. ith our novel approach; of the 1790 tickets, 1286 were properlyclassified as DNS, 249 were left unclassified, and 255 were misclassi-fied (with majority being in three other product areas; Crypto (99),Server Errors (81) and Registrar (25)). The mean processing timewas 0.012 seconds per ticket and median 0.00934 seconds per ticket.With our old method 1324 tickets we classified as DNS, 236 wereleft unclassified and 230 were misclassified. The mean processingtime of the old approach was 0.073 seconds per ticket and median0.056 seconds per ticket. As the new approach is sensitive to thebeginning and the end of the string we miss a little on the coverageyet the improvement in computing time is valuable (the new ap-proach is more than 6 times faster). In order to estimate precisionand recall we gathered 1208 tickets from a product category knownas "Crypto". The same multi-classifier was applied. Table 1: Precision and recall.
Method Precision RecallOld 0.878 0.718Greedy Approach 0.855 0.740
So far, we have outlined an improved algorithm for fuzzy stringmatching, however improved language classification can be im-portant to improving the accuracy of a given Natural LanguageProcessing pipeline. In this instance, we take chat messages fromreal world customer support chatbot which is using the [19] libraryfor language classification. The classification is run on the firstmessage sent by a user, given the need to understand if the user isusing a supported language as early as possible in the conversation.We gathered 3204 chats tickets alongside the language classifica-tion, the HTTP
Accept-Language header presented by the usersweb browser and list the languages common to the visitor’s coun-try (using MaxMind GeoIP [20] for geolocation based on their IPAddress and open-source data [21] to get the languages associatedwith a given country).Predicting language from the first chat message, as we do here,can be challenging and will not replicate all use-cases. The firstmessage could be just a simple greeting or could be a lengthydescription of the issue the customer experiencing. As the chatbotis used for support purposes on a internet infrastructure product;in some cases visitors even included some software log lines ontheir first message on chat, making it harder to flag whether thevisitor would like to get support in a non-English language.In Fig 1, we observe that as the length of a message increasesin length - so does the probability that classified language will matchthe visitor’s country language and their browsers
Accept-Language header.We do also observe correlation between the classified languageand
Accept-Language header, as shown in Fig 2. Although harderto visualise, a similar correlation can be observed between theclassified language and the visitor’s country in Fig 3. In 67% of chats,all parameters were in agreement. In 15%, the classified language
Figure 1: Message Length and Language Classification Match only matched
Accept-Language header and a further 5% matchedonly the country languages. 13% had no agreement between thesethree parameters. These results are visualised in Fig 3.
Figure 2: Classified Languages and Top 10 Accept-Language
The
Accept-Language header was 14% more likely to matchthe language classification than IP Address but 23% more coveragewas obtained by allowing either parameter to match the classifiedlanguage than both, indicating both data sources can add value in aclassification system. This observational evidence may be leveragedby future work to experiment different approaches to incorporatesuch metadata into novel language classification approaches.
In this paper, we presented a novel greedy tokenisation approachof strings for use in fuzzy keyword search. Our approach allowsfor efficient search of keywords using n -gram string similarityalgorithms, even where keywords are subdivided into multiplewords in the corpus text. Experimental results show that greedytokenisation decreased processing time by 83 . .
1% with a decrease in precision of 2 . igure 3: Classified Languages and Top 20 CountriesFigure 4: Classified Language Matches increases with the message length. Whilst the Accept-Language matched the classified language in 82% of instances, the countrylanguages (extracted from the user’s IP Address) only matched in72% of instances. This indicates that
Accept-Language likely pro-vides a better signal than IP Address for user language, but whilstall three signals only matched in 67% of instances, 87% coveragecan be obtained if the classified language is allowed to match either
Accept-Language header or the country language.Whilst further research is needed; our data supports future re-search into using metadata to support language classification algo-rithms, particularly for short messages or instances where highercertainty is needed before making language classification decisions. One potential area of study is the creation of a model that receivesinput from language classification algorithms as well as differentsources of metadata, with message length potentially being a furtherdimension.
REFERENCES [1] Malgorzata Pikies and Junade Ali. String similarity algorithms for a ticketclassification system. In , pages 36–41, April 2019.[2] Dayvid W Castro, Ellen Souza, Douglas Vitório, Diego Santos, and Adriano LIOliveira. Smoothed n-gram based models for tweet language identification: Acase study of the brazilian and european portuguese national varieties.
AppliedSoft Computing , 61:1160–1172, 2017.[3] Grzegorz Kondrak. N-gram similarity and distance. In
Proceedings of the 12thInternational Conference on String Processing and Information Retrieval , SPIRE’05,pages 115–126, Berlin, Heidelberg, 2005. Springer-Verlag.[4] VI Levenshtein. Binary Codes Capable of Correcting Deletions, Insertions andReversals.
Soviet Physics Doklady , 10:707, 1966.[5] C. D. Manning.
Introduction to Information Retrieval .[6] H. Alvestrand. Content Language Headers. 3282 1654, Internet Engineering TaskForce, May 2002.[7] Donald E. Knuthf, James H. Morris, Jr. :l, and Vaughan R. Pratt. Fast patternmatching in strings*, 1974.[8] Robert S. Boyer and J. Strother Moore. A fast string searching algorithm.
Commun.ACM , 20(10):762–772, October 1977.[9] Chen Li, Bin Wang, and Xiaochun Yang. Vgram: Improving performance ofapproximate queries on string collections using variable-length grams. In
VLDB ,2007.[10] Witold Litwin, Riad Mokadem, Philippe Rigaux, and Thomas Schwarz. Fastngram-based string search over data encoded using algebraic signatures. In
Proceedings of the 33rd International Conference on Very Large Data Bases , VLDB’07, pages 207–218. VLDB Endowment, 2007.[11] Jiannan Wang, Guoliang Li, and Jianhua Feng. Fast-join: An efficient method forfuzzy token matching based string similarity join. pages 458–469, 04 2011.[12] Hongrae Lee, Raymond Ng, and Kyuseok Shim. Extending q-grams to estimateselectivity of string matching with low edit distance. pages 195–206, 01 2007.[13] Hongrae Lee, Raymond Ng, and Kyuseok Shim. Approximate substring selectivityestimation. pages 827–838, 01 2009.[14] William B. Cavnar and John M. Trenkle. N-gram-based text categorization. 1994.[15] Shanchan Wu and Yifan He. Enriching pre-trained language model with entityinformation for relation classification. In
Proceedings of the 28th ACM InternationalConference on Information and Knowledge Management , pages 2361–2364, 2019.[16] Mark Graham, Scott A Hale, and Devin Gaffney. Where in the world are you?geolocation and language identification in twitter.
The Professional Geographer