Stopwords in Technical Language Processing
aa r X i v : . [ c s . I R ] J un S TOPWORDS IN T EC HNICAL L ANGUAGE P ROC ESSING
A P
REPRINT
Serhad Sarica
Data-Driven Innovation LabSingapore University of Technology and DesignSingapore, 487372 [email protected]
Jianxi Luo
Data-Driven Innovation LabSingapore University of Technology and DesignSingapore, 487372 [email protected]
June 5, 2020 A BSTRACT
There are increasingly applications of natural language processing techniques for information re-trieval, indexing and topic modelling in the engineering contexts. A standard component of suchtasks is the removal of stopwords, which are uninformative components of the data. While re-searchers use readily available stopwords lists which are derived for general English language, thetechnical jargon of engineering fields contains their own highly frequent and uninformative wordsand there exists no standard stopwords list for technical language processing applications. Here weaddress this gap by rigorously identifying generic, insignificant, uninformative stopwords in engi-neering texts beyond the stopwords in general texts, based on the synthesis of alternative data-drivenapproaches, and curating a stopwords list ready for technical language processing applications. K eywords Stopwords · Technical language · Data-driven
Natural language processing (NLP) and text analysis have been growingly popular in engineering analytics [1, 2, 3, 4].To ensure the accuracy and efficiency of such NLP tasks as indexing, topic modelling and information retrieval [5,6, 7, 8, 9], the uninformative words, often referred to as “stopwords”, need to be removed in the pre-processing step,in order to increase signal-to-noise ratio in the unstructured text data. Example stopwords include ”each”, ”about”,”such” and ”the”. Stopwords often appear frequently in many different natural language documents or parts of the textin a document but carry little information about the part of the text they belong to.The use of a standard stopword list, such as the one distributed with popular Natural Language Tool Kit (NLTK) [10]python package, for removal in data pre-processing has become an NLP standard in both research and industry. Therehave been efforts to identify stopwords from generic knowledge sources such as Brown Corpus [8, 11], 20 newsgroupcorpus [6], books corpus [12], etc, and curate a generic stopword list for removal in NLP applications across fields.However, the technical language used in engineering or technical texts is different from layman languages and mayuse stopwords that are less prevalent in layperson languages. When it comes to engineering or technical text analysis,researchers and engineers either just adopt the readily available generic stopword lists for removal [1, 2, 3, 4] leavingmany noises in the data or identify additional stopwords in a manual, ad hoc or heuristic manner [5, 13, 14, 15]. Thereexist no standard stopword list for technical language processing applications.Here, we address this gap by rigorously identifying generic, insignificant, uninformative stopwords in engineeringtexts beyond the stopwords in general texts, based on the synthesis of alternative data-driven approaches. The re-sultant stopword list is statistically identified and human-evaluated. Researchers, analysts and engineers working ontechnology-related textual data and technical language analysis can directly apply it for denoising and filtering of theirtechnical textual data, without conducting the manual and ad hoc discovery and removal of uninformative words bythemselves.
PREPRINT - J
UNE
5, 2020
To identify stopwords in technical language texts, we statistically analyse the natural texts in patent documents whichare descriptions of technologies at all levels. The patent database is vast and provides the most comprehensive coverageof technological domains. Specifically, our patent text corpus contains 781,156,082 tokens (words, bi-, tri- and four-grams) from 30,265,976 sentences of the titles and abstracts of 6,559,305 of utility patents in the complete USPTOpatent database from 1976 to 31st December 2019 (access date: 23 March 2020). Non-technical design patentsare excluded. Technical description fields are avoided because they include information on contexts, backgroundsand prior arts that may be non-relevant to the specific invention and repetitive, lead to statistical bias and increasecomputational requirements. We also avoided legal claim sections which are written in repetitive, disguising and legalterms.In general text analysis for topic modelling or information retrieval, various statistical metrics, such as term frequency(TF) [7, 9], inverse-document frequency (IDF) [7], term-frequency-inverse-document-frequency (TFIDF) [5], entropy[6, 12], information content [6], information gain [16] and Kullback-Leibler (KL) divergence [7], are employed to sortthe words in a corpus [6, 16]. Herein we use TF, TFIDF and information entropy to automatically identify candidatestopwords.Furthermore, some of the technically significant terms such as “composite wall”, “driving motion” and “hose adapter”are statistically indistinguishable from such stopwords “be”, “and” and “for”, regardless of the statistic metrics forsorting. That is, automatic and data-driven methods by themselves are not accurate and reliable enough to returnstopwords. Therefore, we also use a human-reliant step to further evaluate the automatically identified candidatestopwords and confirm a final set of stopwords which do not carry information on engineering and technology.In brief, the overall procedure as depicted in Figure 1 consists of three major steps: 1) basic pre-processing of the patentnatural texts, including punctuation removal, lower-casing, phrase detection and lemmatization; 2) using multiplestatistic metrics from NLP and information theory to identify a ranked list of candidate stopwords; 3) term-by-termevaluation by human experts on their insignificance for technical texts to confirm stopwords that are uninformativeabout engineering and technology. In the following, we describe implementation details of these three steps.Figure 1: Overall procedure
The patent texts in the corpus are first transformed into a line-sentence format, utilizing the sentence tokeniza-tion method in the NLTK, and normalized to lowercase letters to avoid additional vocabulary caused by lower-case/uppercase differences of the same words. The punctuation marks in sentences are removed except “-” and “/”.These two special characters are frequently used in word-tuples, such as “AC/DC” and “inter-link”, which can beregarded as a single term. The original raw texts are transformed into a collection of 30,265,976 sentences, including796,953,246 unigrams. 2
PREPRINT - J
UNE
5, 2020Phrases are detected with the algorithm of Mikolov et al [17] that finds words that frequently appear together, and inother contexts infrequently, by using a simple statistical method based on the count of words to give a score to eachbigram such that: score ( w i , w j ) = ( count ( w i w j ) − δ ) | N | count ( w i ) count ( w j ) (1)where count ( w i w j ) is the count of w i and w j appearing together as bigrams in the collection of sentences and count ( w i ) is the count of w i in the collection of sentences. δ is the discounting coefficient to prevent too manyphrases consisting of very infrequent words, and set δ = 1 to prevent having scores higher than 0 for phrases occur-ring less than twice. The term N = P t,p ∈ P n ( t, p ) represents the total number of tokens in the patent database where n ( t, p ) is the count of the term t in the patent p . Bigrams with a score over a defined threshold ( T phrase ) are consideredas phrases and joined with a “_” character in the corpus, to be treated as a single term. We run the phrasing algorithmof Mikolov et al. [17] on the pre-processed corpus twice to detect n-grams, where n = [2,4]. The first run detectsonly bigrams by employing a higher threshold value T phrase , while the second run can detect n-grams up to n = 4 byusing a lower threshold value T phrase to enable combinations of bigrams. Via this procedure of repeating the phrasingprocess with decreasing threshold values of T phrase , we detected phrases that appear more frequently in the first stepusing the higher threshold value, e.g., “autonomous vehicle”, and detected phrases that are comparatively less frequentin the second step using the lower threshold value, e.g., “autonomous vehicle platooning”. In this study, we used thebest performing thresholds (5, 2.5) found in a previous study [13].The phase detection computation resulted in a vocabulary of 15,435,308 terms, including 13,730,320 phrases. Sincethe adopted phrase detection algorithm is purely based on cooccurrence statistics, the detection of some faulty phrasesincluding stopwords such as “the_”, “a_”, “and_”, and “to_” is inevitable. Therefore, the detected phrases are pro-cessed one more time to split the known stopwords from the NLTK [10] and USPTO [18] stopwords lists. For example,“an_internal_combustion_engine” is replaced with “an internal_combustion_engine”. Then the vocabulary is reducedto 8,641,337 terms, including 6,900,263 phrases.Next, all the words are represented with their regularized forms to avoid having multiple terms representing the sameword or phrase and thus decrease the vocabulary size. This step is achieved by first using a part-of-speech (POS)tagger [19] to detect the type of words in the sentences and lemmatize those words accordingly. For example, if theword “learning” is tagged as a VERB, it would be regularized as “learn” while it would be regularized as “learning” ifit is tagged as a NOUN. The lemmatization procedure further decreased the vocabulary to 8,144,852 terms including6,418,992 phrases.As a last step, we removed the words contained in famous NLTK [10] and USPTO [18] stopwords lists. The NLTKstopwords list focuses more on general stopwords that can be encountered in daily English language such as “a, an,the, . . . , he, she, his, her, . . . , what, which, who, . . . ”, in total 179 words. On the other hand, USPTO stopwords listinclude words that occur very frequently in patent documents and do not contain critical meaning within patent texts,such as “claim, comprise, . . . embodiment, . . . provide, respectively, therefore, thereby, thereof, thereto, . . . ”, in total99 words. The union of these two lists contains 220 stopwords.Additionally, we also discarded the words appearing only 1 time in the whole patent database, which leads to a finalset of 6,645,391 terms including 5,834,072 phrases. To identify the frequently occurring words or phrases that carry little information content about engineering andtechnology, we use four metrics together: 1) direct term frequency (TF), 2) inverse-document frequency (IDF), 3)term-frequency-inverse-document-frequency (TFIDF) and 4) Shannon’s information entropy [20].We use f ( t ) to denote direct frequency of term t . Consider a corpus C of P patents. T F ( t ) = n ( t ) n ( p ) (2)where n ( p ) = P t n ( t, p ) is the number of terms in the patent p , n ( t ) = P p ∈ P n ( t, p ) is total count of term t in all patents.The term frequency is an important indicator of commonality of a term within a collection of documents. Stopwordsare expected to have high term frequency. 3 PREPRINT - J
UNE
5, 2020Inverse-document-frequency (IDF) is calculated as follows
IDF ( t ) = log | C | DF ( t ) (3)where DF ( t ) = |{ p ∈ C : t ∈ p }| is the number of patents containing term t and | C | represents the number ofpatents in the database. This metric penalizes the frequently occurring terms and favours the ones occurring in a fewdocuments only. The metric’s lower bound is 0 which refers to the terms that appear in every single document in thedatabase. The upper bound is defined by the terms appearing only in one document, which is log | C | .Term frequency-inverse-document-frequency (TFIDF) is calculated as follows T F IDF ( t ) = 1 DF ( t ) X p n ( t, p ) n ( p ) | C | DF ( t ) (4)This metric favours the terms that appear in a few documents, with a considerably high term frequency within thedocument. If a term appears in many documents, its TFIDF score will be penalized by IDF score due to its common-ality. Here, we did not use the traditional IDF metric but removed the log normalizing function to penalize the termscommonly occurring in the entire patent database harder regardless of their in-document (patent) term frequencies.We eventually used the mean of the single document TFIDF scores for each term.The entropy of term t is calculated as follows. The metric indicates how uneven the distribution of term t is in thecorpus C . H ( t | C ) = − X p P ( p | t ) log P ( p | t ) (5)where P ( p | t ) = n ( t,p ) n ( t ) is the distribution of term t over patent documents. This indicates how evenly distributed a termis in the patent database. Maximum attainable entropy value for a given collection of documents is basically an evendistribution to all patents which leads to log | C | . Therefore, the terms having higher entropy values will contain lessinformation about the patents where they appear, compared to other terms with lower entropy.We reported the distributions of terms in our corpus according to these four metrics in the Appendix (see Figure A1).The term-frequency distribution has a very long right tail, indicating most of the terms appear a few times in the patentdatabase while some words appear so frequently. Our further tests found that the distribution follows the a powerlaw [21, 22]. By contrast, the distribution by IDF has a long left tail, indicating the existence of a few terms thatappears commonly in all patents. The TFIDF distribution also has a long right tail that indicates the existence ofhighly common terms in each patent and highly strong domain-specific terms dominating a set of patents. Moreover,the long right tail of entropy distribution indicates comparingly few high valued terms that are appearing commonlyin the entire database. Therefore, assessing the four metrics together will allow us to detect the stopwords with variedoccurrence patterns. We formed 4 different lists of terms sorted by their decreasing TF, increasing IDF, increasing TFIDF, and decreasingentropy. Table A1 in the appendix presents the top ranked 30 terms in respective lists. Then the top 2,000 terms in eachof the four lists are used to form a union set of terms. The union only includes 2,305 terms, which indicates that thelists based on four alternative statistic metrics overlap significantly. Then the terms in the union set are evaluated bytwo researchers with more than 20 years of engineering experience each, in terms of whether a term carries informationabout engineering and technology, to identify stopwords. The researchers initially achieved an inter-rater reliability of0.83 [23] and then discussed the discrepancy to reach the consensus on a final list of 62 insignificant terms.
This list, compared to our previous study which identified a list of stopwords [13] (see Table A2 in the Appendices) bymanually reading 1,000 randomly selected sentences from the same patent text corpus, includes 26 new uninformativestopwords that the previous list did not cover. In the meantime, we also found the previous list contains other 25stopwords, which are still deemed qualified stopwords in this study. Therefore, we integrate these 25 stopwords fromthe previous study with the 62 stopwords identified here to derive a final list of 87 stopwords for technical language4
PREPRINT - J
UNE
5, 2020analysis. The final list is presented in Table 1 together with the NLTK stopwords list and the USPTO stopwords list . Itis suggested to apply the three stopwords lists together in technical language processing applications across technicalfields. Table 1: Stopwords lists for technical language processing applications NLTK Stopword List [10](179 words) USPTO Stopword List [18](99 words) This Study(87 words)a hadn’t on wasn’t a onto able othersabout has once we accordance or above-mentioned otherwiseabove hasn only were according other accordingly overallafter hasn’t or weren all particularly across ratheragain have other weren’t also preferably along remarkablyagainst haven our what an preferred already significantlyain haven’t ours when and present alternatively simplyall having ourselves where another provide always sometimesam he out which are provided among specificallyan her over while as provides and/or straight forwardand here own who at relatively anything substantiallyany hers re whom be respectively anywhere thereafterare herself s why because said better therebetweenaren him same will been should disclosure thereforaren’t himself shan with being since due therefromas his shan’t won by some easily thereinat how she won’t claim such easy thereinto+be i she’s wouldn comprises suitable eg thereonbecause if should wouldn’t corresponding than either therethroughbeen in should’ve y could that elsewhere therewithbefore into shouldn you described the enough togetherbeing is shouldn’t you’d desired their especially towardbelow isn so you’ll do then essentially towardsbetween isn’t some you’re does there et al typicalboth it such you’ve each thereby etc typicallybut it’s t your embodiment therefore eventually uponby its than yours fig thereof excellent viacan itself that yourself figs thereto finally vice versacouldn just that’ll yourselves for these furthermore whatevercouldn’t ll the from they good whereasd m their further this hence whereatdid ma theirs generally those he/she whereverdidn me them had thus him/her whetherdidn’t mightn themselves has to his/her whosedo mightn’t then have use ie withindoes more there having various ii withoutdoesn most these herein was iii yetdoesn’t mustn they however were insteaddoing mustn’t this if what laterdon my those in when likedon’t myself through into where littledown needn to invention whereby manyduring needn’t too is wherein mayeach no under it which meanwhilefew nor until its while mightfor not up means who moreoverfrom now ve not will muchfurther o very now with musthad of was of would neverhadn off wasn on often This list can be downloaded from our GitHub repository https://github.com/SerhadS/TechNet PREPRINT - J
UNE
5, 2020
To develop a comprehensive list of stopwords in engineering and technology-related texts, we mined the patent textdatabase with several statistical metrics from term frequency to entropy together to automatically identify candidatestopwords and use human evaluation to validate, screen and finalize stopwords from the candidates. In this procedure,the automatic data-driven detection of four statistic metrics yield highly overlapping results, and the human evaluationsalso came with high inter-rater reliability, suggesting evaluator independence. Our final stopwords list can be used asa complementary list to NLTK and USPTO stopwords lists in NLP and text analysis tasks related to technology,engineering design, and innovation.
References [1] Danni Chang and Chun-hsien Chen. Product concept evaluation and selection using data mining and domainontology in a crowdsourcing environment.
Advanced Engineering Informatics , 29(4):759–774, oct 2015.[2] Yi Zhang, Alan L. Porter, Zhengyin Hu, Ying Guo, and Nils C. Newman. "Term clumping" for technical intel-ligence: A case study on dye-sensitized solar cells.
Technological Forecasting and Social Change , 85:26–39,2014.[3] Mattyws F Grawe, Claudia A Martins, and Andreia G Bonfante. Automated Patent Classification Using WordEmbedding. In ,pages 408–411. IEEE, dec 2017.[4] Qiyu Liu, Kai Wang, Yan Li, and Ying Liu. Data-driven Concept Network for Inspiring Designers’ Idea Genera-tion.
Journal of Computing and Information Science in Engineering , pages 1–39, 2020.[5] Antoine Blanchard. Understanding and customizing stopword lists for enhanced patent mapping.
World PatentInformation , 29(4):308–316, 2007.[6] Martin Gerlach, Hanyu Shi, and Luis A.Nunes Amaral. A universal information theoretic approach to the identi-fication of stopwords.
Nature Machine Intelligence , 2019.[7] Rachel Tsz-Wai Lo, Ben He, and Iadh Ounis. Automatically Building a Stopword List for an InformationRetrieval System. In , Utrecht, 2005.[8] Christopher Fox. A stop list for general text.
ACM SIGIR Forum , 24(1-2):19–21, sep 1989.[9] W. John Wilbur and Karl Sirotkin. The automatic identification of stop words.
Journal of Information Science ,18(1):45–55, 1992.[10] Steven Bird, Ewan Klein, and Edward Loper.
Natural language processing with Python: analyzing text with thenatural language toolkit . O’Reilly Media, Inc., 2009.[11] Henry Kuˇcera and Winthrop Nelson Francis. Computational analysis of present-day American English.
Interna-tional Journal of American Linguistics , 35(1):71–75, 1969.[12] Marcelo A. Montemurro and Damián H. Zanette. Towards the quantification of the semantic information encodedin written language.
Advances in Complex Systems , 13(2):135–153, 2010.[13] Serhad Sarica, Jianxi Luo, and Kristin L. Wood. TechNet: Technology semantic network based on patent data.
Expert Systems with Applications , 142, 2020.[14] Kazuhiro Seki and Javed Mostafa. An application of text categorization methods to gene ontology annotation.
SI-GIR 2005 - Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Developmentin Information Retrieval , pages 138–145, 2005.[15] Dan Crow and John Desanto. A hybrid approach to concept extraction and recognition-based matching in thedomain of human resources.
Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI ,(Ictai):535–539, 2004.[16] Masoud Makrehchi and Mohamed S. Kamel. Automatic extraction of domain-specific stopwords from labeleddocuments.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence andLecture Notes in Bioinformatics) , 4956 LNCS:222–233, 2008.[17] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrasesand their Compositionality. In
Advances in Neural Information Processing Systems (NIPS) 26 , pages 1–9, 2013.[18] USPTO. Stopwords, USPTO Full-Text Database. 6
PREPRINT - J
UNE
5, 2020[19] Kristina Toutanova and Christopher D. Manning. Enriching the knowledge sources used in a maximum entropypart-of-speech tagger. In
EMNLP ’00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methodsin natural language processing and very large corpora , pages 63–70, 2007.[20] C. E. Shannon. A Mathematical Theory of Communication.
Bell System Technical Journal , 27(3):379–423, jul1948.[21] George Kingsley Zipf.
The Psychobiology of Language . Routledge, London, 1936.[22] George Kingsley Zipf.
Human Behavior and the Principle of Least Effort.
Addison-Wesley, New York, 1949.[23] Lee J. Cronbach. Coefficient alpha and the internal structure of tests.
Psychometrika , 16(3):297–334, sep 1951.
Appendices
Table A1: Top 30 terms for term-frequency, IDF, TFIDF and entropyTerm-Frequency IDF TFIDF Entropy1 method method include method2 first include method include3 include one one one4 second first comprise form5 form form form first6 one comprise system comprise7 system system first system8 plurality second least second9 device plurality second apparatus10 comprise apparatus apparatus plurality11 apparatus device plurality least12 least least receive disclose13 least_one disclose disclose device14 may receive device receive15 connect may connect may16 process least_one may connect17 control connect position least_one18 portion control control control19 receive process least_one process20 position position portion position21 mean portion base base22 surface base determine portion23 say surface generate surface24 base determine make determine25 disclose generate surface make26 configure make within generate27 determine mean process relate28 generate produce accord produce29 substrate configure end configure30 signal relate allow within7
PREPRINT - J
UNE
5, 2020Table A2: The stopwords identified in the previous study. * indicates that the term is also identified in the currentstudy. + indicates that the term is a stopword as defined in the current study. Rest of the terms are no longer consideredas stopwords as defined in the current study.able* etc* one another therethrough*above-mentioned+ eventually+ otherwise* therewith*already* finally* possibly towards*always* furthermore* rather* typical+and/or* he/she+ remarkably+ via*anything+ hence* significantly+ vice versa+anywhere+ him/her+ simply* whatever+better* his/her+ sometimes+ whereat+disclosure+ instead* straight forward+ wherever+easily* may* substantially whether*eg* meanwhile+ therebetween* whose*either* might+ therefor* within*elsewhere+ moreover+ therefrom* without*enough+ must* therein* wrtespecially* often+ thereinto+ yet*et al+ one thereon*Figure A1: Distribution of terms by (a) term-frequency, (b) IDF, (c) TFIDF and (d) entropy. Term-frequency andTFIDF histograms arbitrarily filtered (term-count < = < = 106