Text Similarity in Vector Space Models: A Comparative Study
aa r X i v : . [ c s . C L ] S e p Text Similarity in Vector Space Models:A Comparative Study
Omid Shahmirzadi , Adam Lugowski , and Kenneth Younge TIS, EPFL, Lausanne, Switzerland { omid.shahmirzadi,kenneth.younge } @epfl.ch Patent Research Foundation, Seattle, USA [email protected]
Abstract.
Automatic measurement of semantic text similarity is an im-portant task in natural language processing. In this paper, we evaluatethe performance of different vector space models to perform this task.We address the real-world problem of modeling patent-to-patent simi-larity and compare TFIDF (and related extensions), topic models (e.g.,latent semantic indexing), and neural models (e.g., paragraph vectors).Contrary to expectations, the added computational cost of text embed-ding methods is justified only when: 1) the target text is condensed; and2) the similarity comparison is trivial. Otherwise, TFIDF performs sur-prisingly well in other cases: in particular for longer and more technicaltexts or for making finer-grained distinctions between nearest neighbors.Unexpectedly, extensions to the TFIDF method, such as adding nounphrases or calculating term weights incrementally, were not helpful inour context.
Keywords: text similarity, vector space model, text embedding, patent,big data
Automatic detection of semantic text similarity between documents plays animportant role in many natural language processing applications. Techniques forthis task fall into two broad categories: structure based and structure agnostic. Inthe first category, solutions rely on a logical structure of the text and transformit into an intermediate structure, such as aligning trees, to do the comparison.While useful in many contexts, it often is not clear which structure to use fora particular comparison. In the second category, structure is ignored and thetext is represented using a vector space model (VSM). While VSMs often do notcapture semantic components of text (e.g., negations), they nevertheless havebeen shown to be able to measure text similarity in many applications.A vector space model converts text into a numeric vector. A key aspect ofVSMs is the definition and number of dimensions for each vector. In a commonand simple approach, TFIDF defines a space where each term in the vocabularyis represented by a separate and orthogonal dimension. TFIDF measures theerm frequency of each term in a text and multiplies it by the logged inversedocument frequency of that term across the entire corpus. Despite its simplicity,TFIDF may suffer from an ignorance of n-gram phrases, complications withincremental updates upon addition of new documents, and a large number ofdimensions. To deal with such issues, variants of TFIDF have been proposed toincorporate n-grams as new terms, and/or to adjust for the timing of the use ofvocabulary across the time line of the corpus.Other techniques, known as text embedding, attempt to address the high-number of dimensions and the loss of semantic information in TFIDF models, bytransforming each text into a low-dimensional vector. Text embedding methodscan be grouped into two categories: (i) count based methods based on bag ofwords (where the order of words are ignored), and (ii) prediction based methodsbased on sequence of words (where the order of words is taken into account).Topic models are an example of the first approach where each document isrepresented as a probability distribution of how relevant that document is toa given number of topics (and thus a lower-dimensional space). Each topic isselected as a weighted average of a subset of terms and document vectors arelearned from the corpus on the assumption that words with similar meaningswill occur in similar documents. Neural models are an example of the secondapproach where word vectors are learned using a shallow neural network trainedfrom pairs of (target word, context word), where context words are taken aswords observed to surround a target word. The assumption behind neural modelsis that words with similar meanings tend to occur in similar contexts. Documentvectors can then be created out of word vectors through an averaging strategyor by considering each document as a special context token, hence obtainingdocument vectors directly. Prior research suggests that topic models and neuralmodels are fundamentally similar in that they both arrive at a representation ofthe document in a lower-order space [13].In this paper, we are interested in similarity measurement between patents.Patent-to-patent similarity can have several applications such as decision makingon patent filing, predicting probability of different types of patent rejectionsand forecasting the innovation space. Previous studies [24], have shown thatTFIDF is a powerful technique to detect patent-to-patent similarity, but theperformance of other vectorization methods is unknown. We therefore comparethe performance of TFIDF to other, newer methods to determine the relativeperformance of such methods for real world problems.This paper continues as follows: In section 2, we discuss the backgroundmaterial as well as related comparative studies on semantic text similarity. Insection 3, we introduce our data gathering, pre-processing, vectorizing and per-formance evaluation pipeline. In section 4, we present our experimental results.In section 5, we conclude the paper and propose some avenues for future work.
Background
Vector space models transform text of different lengths (such as a word, sen-tence, paragraph, or document) into a numeric vector in order to be fed intodown-stream applications (such as similarity detection or machine learning al-gorithms). TFIDF, the most basic text vectorization method, defines a spacewhere each term in the vocabulary is represented by a separate and orthogonaldimension. Despite its popularity and simplicity, basic TFIDF may suffer from(i) ignorance of n-gram phrases, (ii) complications with incremental updatesupon addition of new documents, and (iii) a large number of dimensions. Todeal with the two former issues, variants of TFIDF have been proposed: (i) toincorporate n-grams as new terms, and/or (ii) to adjust for the timing of the useof vocabulary across the time line of the corpus. To deal with the latter issue,text embedding methods attempt to address the high-number of dimensions bytransforming each text into a low-dimensional vector. Text embedding techniquescan be categorized into count-based and prediction-based models. Count-basedmodels (a.k.a. topic models) create a document-term matrix where the weightof each cell is based on the number of times a term appears in the focal docu-ment. Prediction-based models (a.k.a. neural models) predict the occurrence ofa term/document based on surrounding terms to learn a vectorization for eachterm/document.In this section, we review mentioned families of the vector space models,namely (i) TFIDF models, (ii) topic models and (iii) neural models. From eachfamily, we also select candidate methods to be compared for patent-to-patentsimilarity detection task.
TFIDF Models.
Term Frequency–Inverse Document Frequency (TFIDF) isone of the most common vectorization techniques for textual data with manypossible variations [21]. TFIDF considers two documents as similar if they sharerare, but informative, words. In TFIDF, every term is considered as a differentdimension orthogonal to all other dimensions. Each term is represented by aweight which is positively correlated with its occurrence in the current docu-ment and negatively correlated with its occurrence in all other documents inthe corpus. The logic behind TFIDF is to downgrade the importance of termsthat are common in many documents, on the view that those terms carrying lessinformation specific to a focal text. One common weighting scheme for a term t in document d is given in formula 1 ( | D | represents total number of documentsin corpus, TF t,D represents total number of occurrences of term t in document d and DF t,D represents total number of documents in which term t occurs).TFIDF t,d = TF t,d · log | D | + 1DF t,D + 1 (1)Despite the popularity and applicability of TFIDF to many applications,it suffers from the curse of dimensionality in many downstream applicationse.g. computing k nearest neighbors[14]), it ignores n-gram phrases, and all IDFweights might need to be updated upon the addition of new documents. Thebasic model, however, can be extended in several ways to avoid some of thesepitfalls. We consider two recently proposed extensions in this study.First, we consider adding certain n-grams to the term vocabulary. N-gramsallow for the combination of terms into higher-level concepts, which may beparticularly important for research in computational social sciences includingpatent research [2]. Adding n-grams blindly, however, would vastly increase thesize of vocabulary, and thus the number of vector dimensions. A more manage-able approach, therefore, is to add noun phrases based on synthetic properties ofthe text. We test the phrase extraction technique from [9] which extracts nounphrases based on a pattern based method. They extend the simple noun phrasegrammar of formula 2 to support better coordination of noun phrases and betterhandling of textual tags. A finite state transducer is used to extract text por-tions that match the input grammar, including nested and overlapping parts,from the input text which is marked by part of speech (POS) tags. They imposeno upper bound for the size of extracted phrases and show that their methodextract high quality noun phrases efficiently.Noun Phrase ≃ (Adj. | Noun) ∗ Noun(Prep.Det. ∗ (Adj. | Noun) ∗ Noun) ∗ (2)A second, and separate, extension takes advantage of the timing informationof patents to implement incremental IDF [10]. More specifically, whenever a newdocument is added to the corpus, the corresponding IDF at that point in time iscalculated based on the current state of the total corpus (see formula 3, where T and D T are the addition time of a new document to the corpus and the availablecorpus at time T respectively). Therefore, a term would have a low IDF when it isfirst introduced into the vocabulary and high differentiating power; and the IDFwould attenuate over time as use of the term became more common. An examplewould be a niche term for an emerging technology, where the term would have avery high importance at the time of filing the patent, but the term would reducein importance over time. As a convenient side property, incremental calculationof IDFs also avoids the need to update all TFIDF vectors upon addition of anew document to the corpus.TFIDF t,d,T = TF t,d · log | D T | + 1DF t,D T + 1 (3) Topic Models.
Topic models transform a text into a fixed size vector, equal to agiven number of latent topics. The vector represents the probability distributionthat the focal text relates to each of the different topics. In practice, each topicis a weighted average of a subset of terms. Similar to TFIDF, topic models treatthe text as a bag of words where order of words is ignored. On the down side,interpretation of each topic can be subjective and determining the right numberof topics requires tuning of the model.atent Semantic Indexing (LSI) [8,18] is a commonly used topic model tofind low-dimension representation of words or documents. Latent Dirichlet allo-cation (LDA) is another popular topic model that fits a probabilistic model witha special prior to extract topics and document vectors. We choose LSI as the rep-resentative of topic models in this study, as LDA models can be hard to reproducedue to their highly probabilistic nature. Given a set of documents d i , d , ..., d n and a set of vocabulary words w i , w , ..., w m , LSI builds a document-term matrix X of m.n dimensions, where item x i,j can represent the total occurrences of w j in d i (which can be the raw count, 0-1 count or TFIDF weight). To reduce thedimensions of X , truncated Singular Value Decomposition (SVD) can be appliedin LSI as in formula 4 where k is the number of topics. X ≈ U m,k Σ k,k V Tn,k (4)The low-dimensional vector of document i can be obtained using Σ k,k ˆ d i ,where ˆ d i is the i th row of matrix V . The approximation of X is due to selectingthe k highest items in the primary diagonal matrix σ and corresponding columnsand rows in U and V matrices from the original singular value decomposition.Truncated SVD can be implemented efficiently and updated incrementally onthe addition of new documents [5,22]. Neural Models.
Unlike models that simply count terms, neural models captureinformation from the context of other words that surround a given word, hencetaking ordering into account. The most well-known model for predicting wordcontext is W2V (Word to Vector)[16], where the authors propose an algorithmbased on a shallow neural network of three layers to learn word vectors. Priorresearch has shown the W2V model to perform well with analogy and similarityrelationships. Given a context window size, the W2V algorithm comes in twoforms. In the first form, known as CBOW (Continuous Bag of Words), the modelpredicts the probability of a target word given a context word. In the secondform, know as Skip-Gram, the model predicts the probability of a context wordgiven a target word. We explain Skip-Gram mechanics in more details. Considera corpus with a sequence of words w , w , ..., w T and a window with size of c ,where c words on the left and right side of a focal word are considered as context.The objective functions to be maximized is given in formula 5.1 T T X t =1 c X i = − c,i =0 log p ( w t + i | w t ) (5)The probabilities are defined by Softmax. Letting u w as a target word vectorand v w as a context word vector, probabilities are calculated using formula 6. p ( w c | w t ) = exp ( v Tw c u w t ) P Ww =1 exp ( v Tw u w t ) (6)ue to the computational cost of using Softmax as a loss function (i.e.,computing the gradient has a complexity proportional to the vocabulary size W ), two efficient alternatives to Softmax have been suggested[17]: hierarchicalSoftmax, and negative sampling. Also note that W2V ignores frequent wordswith a probability of 1 − q tf ( w i ) , where t is a hyper-parameter of the model thatis used to for the down-sampling of frequent words, and that f ( w i ) is the wordfrequency in the corpus.To obtain document vectors from word vectors, one can average together allword vectors. Because simple averaging will give the same weight to both im-portant and unimportant words, one may be able to retain more information byassigning weights during averaging based on TFIDF. Alternatively, an extensionof W2V, known as D2V (Document to Vector) or paragraph vectors[12], hasbeen proposed to obtain document vectors directly by considering a documentas a special context token that can be added to the training data, such that themodel can learn token vectors and consider them as document vectors. Build-ing on W2V, the D2V algorithm also comes in two flavors: Distributed Memory(DM) and Distributed Bag of Words (DBOW). D2V can be implemented incre-mentally and showed a better performance compared to the previous approachesin some similarity detection benchmarks. However, on the negative side, it hasnumerous hyper-parameters which should be tuned to harvest its power to thefull extent. We consider the D2V model as the representative neural model forthe experiments in this study. Given two vectors, one can measure similarity between them in many ways:Euclidean distance, angular separation, correlation, and others. Although thereare differences between each measure, in this study we adopt cosine similarity asthe measure of interest (see formula 7) as it is well-known and frequently-used.sim (A,B) = A · B | A || B | (7) Several recent studies compare different vector space models with respect totheir similarity detection power for texts. Very few of them, however, targetsimilarity detection for longer texts (e.g. documents). In [11] the authors com-pared D2V variants against an averaging W2V, as well as a probabilistic n-grammodel for two similarity tasks. In the first task, the goal is to detect similarity offorum questions; the second task aims to detect similarity between pair of sen-tences. The authors find that D2V is superior for most cases and that trainingthe model on a large corpora can improve their results. In [19] the authors at-tempt to detect similarity between sentences and compare several neural modelsagainst a baseline that simply averages together word vectors. They find thatore complex neural models work best for in-domain scenarios (where both thetraining data set and the testing data set are from the same domain), while abaseline of averaging word vectors is hard to beat for out-of-domain cases. In [3]the authors proposed a method for sentence embedding through the weightedaveraging of word vectors as transformed by a dimensionality reduction tech-nique. They show that their text vectors outperform well-known methods todetect sentence to sentence similarity. In [20] the authors use an unsupervisedmethod to vectorize sentences and show that their method outperforms otherstate-of-the-art techniques to detect similarity of short sequences of words. Ineach of these studies, however, the objective was to determine the performanceof similarity detection algorithms on relatively short sections of text.There has been much less research on the performance of similarity compar-isons for longer text. In [7] the authors compared D2V to a weighted W2V, LDA,and TFIDF to detect the similarity of documents in Wikipedia and arXiv cor-pus. They find that D2V can outperform other models on Wikipedia, but thatD2V could barely beat a simple TFIDF baseline on arXiv. In [1] the authorscompared several algorithms to detect similarity between biomedical papers ofPubMed and find that advanced embedding techniques again can hardly beatsimpler baselines such as TFIDF. This paper adds to the stream of researchcomparing text vectorization methods for longer text. In particular, we focuson a real-world problem with an objective standard for determining similarity,whereas prior research has had to rely on broad categorizations from repositoriessuch as Wikipedia and arXiv. To the best of our knowledge, this work is alsothe first comparative study of semantic similarity methods in the patent space. From the raw data,we create a pipeline to extract the technical description from each patent andvectorize the text. Our code, based on Python 3, is available on GitHub uponemail request. For the purpose of this study, we focus on the following fields: – Number: unique ID for each issued patent – Title: patent information of high density – Abstract: patent information of moderate density – Description: patent information of low density – Date: date when the patent application was submitted – Class: one of 491 patent main class classifications by the USPTO – Subclass: one of 82,520 patent subclass classifications by the USPTO All of the necessary data can also be downloaded from https://bulkdata.uspto.gov/ .1 Pre-Processing For pre-processing of the data, we use the DataProc service of Google CloudPlatfrom . DataProc, a managed Apache Spark[25] service hosted by Google,provides a high-performance infrastructure for rapid implementation of dataparallel tasks such as data pre-processing. We apply several pre-processing stepsto the textual fields of patent data (title, abstract, description): – remove HTML, non-ASCII characters, terms with digits, terms less than 3characters, internet addresses, and sequences such as DNA – stem words and change to lower-case – remove stopwords, including general NLP and patent-specific ones – remove rare terms with extremely low total document frequency We vectorize titles, abstracts and descriptions for each of the following models:
Simple TFIDF.
We use the machine learning library in the Apache Sparkframework for our implementation of TFIDF. There are two flavors of of TFIDFin Spark to consider. The first method is based on CountVectorizer, where vo-cabulary is generated and term frequencies are explicitly counted before beingmultiplied with inverse document frequency. The CountVectorizer method, how-ever, creates a highly sparse representation of a document over the vocabulary.The second method takes advantage of HashingTF which transforms a set ofterms to fixed-length feature vectors. The hashing trick can be used to directlyderive dimensional indices, but it can suffer from collisions. In this work, we usethe first method as a slower but more robust technique.
Incremental TFIDF.
Incremental updating of inverse document frequenciescan be implemented in one of two ways: (i) create a new TFIDF model at afixed time interval for newly arrived patents, or (ii) create a TFIDF model onthe whole corpus and then adjust document frequency vectors by subtractingdocument frequencies of future patents with respect to each focal patent. Weimplement the second approach.
Phrase-augmented TFIDF.
We augment the vocabulary of TFIDF by nounphrases as extracted from the open source NPFST library . The library is basedon Python 2, which was ported to Python 3 to be compatible with the rest of thepipeline. NPFST can be configured to limit the number of noun phrases basedon their frequency and length. https://cloud.google.com/dataproc http://slanglab.cs.umass.edu/phrasemachine SI.
Despite the advantage of Spark for running data parallel tasks, its supportfor model parallel tasks such as LSI is limited. Therefore, we use the LSI imple-mentation in Gensim , a well-established library for text vectorization. Gensimimplements several different vector space models, exploits parallelism of multipleCPU cores, and supports large corpora that cannot be resided in the memory.Gensim also implements LSI in a memory-efficient way to support incrementalupdates, an important consideration for large corpora. Gensim accepts TFIDFdocument-word matrix as input and has several hyper-parameters, of which weconsider the followings for tuning: – num-topics: number of latent topics – chunksize: number of documents in each training chunk – decay: weight of existing observations relative to new ones (max 1.0) D2V.
We also use the implementation of D2V in Gensim, as the Gensim versionis memory efficient, allows for incremental updates, and can be parallelized overmultiple cores. We fix the random seed and Python hash seed for reproducibil-ity of results. D2V implementation accepts raw pre-processed text in the formof TaggedDocument elements, and has several hyper-parameters, of which weconsider the followings for tuning: – dm: flag to choose distributed memory or distributed bag of words algorithms – size: dimension of the feature vectors – window: maximum distance between target and context word in text – sample: threshold as to which high-frequency words are randomly ignored – iter: number of iterations over the corpus – hs: flag to choose hierarchical softmax or negative sampling methods To evaluate the relative performance for each vectorization method, we constructa classification test on our data, where we use the similarity score between twovectors from a given method to predict whether that pair of patents would berated as similar (positive label) or not (negative label) according to an indepen-dent benchmark. Specifically, our evaluation test requires:1. a benchmark indicator of similarity as the ground truth,2. a way to calculate a similarity score between vectors,3. an evaluation metric to assess the accuracy of predictions, and4. a mechanism to tune hyper-parameters of different models. https://radimrehurek.com/gensim/index.html enchmark. Identifying a ground-truth benchmark for similarity is an impor-tant requirement for evaluating the performance of similarity detection fromautomated text vectorization. While a continuous measure of the ground truthwould be ideal, we are not aware of a reliable measure of continuous patent-to-patent similarity that is separate from textual analysis. Instead, we construct abenchmark set of both more -similar pairs of patents (positive case) as identifiedby the USPTO, and less -similar pairs of patents (negative case), and evaluate therelative performance of text vectorization techniques in predicting the differencebetween the two cases.For the selection of more -similar pairs of patents, one might consider patentcitations to be a natural choice – for one would expect patent citations to refer-ence more -similar patents. Within the full set of patent citations, however, thereis a subset of “102 rejections” that one would expect to be even more highly-similar that an average patent citation. Patent examiners issue a 102 rejectionwhen they believe that a cited patent is similar enough to the citing patent tocause them to reject the patent application of the citing patent on the basisthat the new invention is not novel enough. Although 102 rejections are not aperfect indicator of similarity, it is reasonable to believe that 102 rejections areconsiderably more -similar to each other than some other comparison sets. Wetherefore select 102 rejections as our human-labeled indicator of similarity. For the selection of the comparison set of less- similar patent pairs, one caneasily identify patents that are in fact very dissimilar, and therefore very easyto distinguish from positive cases defined above. Alternatively, one can also pickpairs of patents that are, themselves, somewhat similar, and therefore muchharder to distinguish from the positive cases. As such, we select and evaluate theperformance of vectorization methods across a range of comparison difficulties.Going from harder separation to easier separation, we select negative cases forthree testing scenarios: (i) pairs of patents selected at random from the samesubclass, (ii) pairs of patents selected at random from the same main class and(iii) pairs of patents simply selected at random.In summary, we selected a random set of 102 rejection patent pairs for ourpositive labels, and we selected multiple sets of patent pairs for our negativelabels (sets drawn from the same subclass, the same main class, or at random).We choose the size of our test set large enough to avoid selection bias (a set of50K positive-labeled and 50K negative-labeled pairs).
Similarity Calculation.
We calculate the similarity of each patent pair basedon the cosine similarity of their corresponding vectors from a different vectorspace. To compare pairs of patents from an incremental TFIDF model, a recentstudy[10] proposes replacing the IDF of the latter patent to that of the formerpatent so that both vectors have the same time scale to compare. Therefore we The Patent Research Foundation provided us with pairs of patents in 102 rejectionsfrom public records for use in this study. More recently, the USPTO has also releaseda data set of 102 rejections[15]. Validation tests on both data sets give similar results. re-compute aggregated DFs by month so that IDF replacement could be doneefficiently at run time.
Evaluation Metric.
The Receiver Operating Characteristic (ROC) curve isa standard method for comparing the accuracy of classifiers in which the classdistribution can be skewed and the probability threshold to assign labels is notdetermined[4]. Given labels and predictions, the Area Under the Curve (AUC)represents the probability that a vectorization method will predict that a positivecase (a randomly selected 102 rejection pair of patents) to be more similar thana negative case (a randomly selected pair of patents from a comparison set). TheAUC is our preferred metric for comparing models, and we plot ROC curves forvisual comparisons.
Model Tuning.
Models with hyper-parameters (e.g., topic models and neuralmodels) need to be tuned for optimal performance. Studies show that hyper-parameter tuning can be as important as, or even more important than, thechoice of the model itself [13]. Nevertheless, the complexity and cost of tuning canescalate quickly as the number of hyper-parameters increases. Classic approachesto tuning, such as a grid search or a random search, are often a poor choice whenthe cost of model evaluation is high. For grid search, it is difficult to select gridpoints for continuous parameters, and every added hyper-parameter will increasethe tuning cost geometrically. For random search, one can end up calculatingmany poor configurations where model evaluation is expensive and of no use.Bayesian optimization is shown to be superior to classic hyper-parametertuning solutions for expensive models [23]. In particular, it provides a global,derivative-free approach that is suitable for tuning black box models with a highcost of evaluation. The costs of tuning is also linear with respect to the numberof hyper-parameters.Algorithm 1 depicts Bayesian optimization at a high level. Given a functionto optimize f , and a set of hyper-parameters X , the algorithm creates a set ofinitial points for the optimization surface (line 2) and saves the obtained valuesin set D . For a budget of N evaluations, the algorithm then does the following:(i) fit the distribution of possible functions as a Gaussian process GP to the D (line 4); (ii) suggest the next point to assess where it maximizes the expectedvalue of its goodness for the probabilistic model (line 5) ; (iii) assess the costlyobjective function (line 6); and (iv) add the new assessed point to the set D (line7). At the end of the computing budget, the point with the lowest y value in D is reported as the optimum point.We used the scikit-optimize library to implement Bayesian optimizationdue to the library’s ability to run in parallel and to integrate with other librariesin Python. We set a tuning budget equal to 10 times the number of hyper-parameters for most experiments, but stopped tuning short for the descriptionfield due to the high cost and low expectation of further improvement. This method can be extended to suggest several points instead of a single one[6]. https://scikit-optimize.github.io lgorithm 1 Bayesian optimization loop input : f, XD ← InitialSamples ( f, X ) for i ← | D | to N do p ( y | x, D ) ← F itModel ( GP, D ) x i ← argmax x ∈ X EV ( x, p ( y | x, D )) y i ← f ( x i ) D ← D ∪ ( x i , y i ) end for Figure 1 plots the performance of patent-to-patent similarity measurement bymodel. Each row corresponds to a different vectorization approach and each col-umn corresponds to testing on a different text field. Each pane compares a simpleTFIDF model as a baseline (solid lines) with a more complicated vectorizationmethod indicated for that row (dashed lines). The comparison is done for eachof the three benchmarks mentioned earlier in Section 3.3: a random set of 102rejection patent pairs with positive labels, and three sets of patent pairs withnegative labels (sets drawn from the same subclass, the same main class, or atrandom). Cosine similarity of each patent pair is calculated using the two vectorspace models and their corresponding ROC curves are plotted, where differentcolors correspond to different benchmarks.In the first row of Fig. 1, TFIDF with noun phrases performs almost iden-tically to the baseline, across all text lengths and benchmarks. This is counter-intuitive, as augmenting a bag-of-words vector with n-grams would seem to addmore information. It is possible, however, that adding every noun phrases is toogranular. We performed additional experiments with including only the top 50,100, and 200 noun phrases, but found even worse results than the baseline.In the second row of Fig. 1, incremental TFIDF also performs almost iden-tically to the baseline, across all text lengths and benchmarks. This result iscontrary to recent studies [10] where incremental TFIDF was presumed to bebeneficial. However, even if incremental TFIDF is not better at similarity de-tection, it is computationally cheaper for a corpus that expands over time, andtherefore our results suggest that incremental TFIDF may be a reasonable choicein that context.In the third row of Fig. 1, a highly-tuned LSI model is able to beat theperformance of the baseline, but only for a very easy similarity comparison andonly for very short text (titles). Otherwise, LSI performs worse than the baseline.In the fourth row of Fig. 1, a highly-tuned D2V model is able to beat thebaseline, but this gain is considerable only for very short text (title) and only foran easy similarity comparison. D2V gives only a very slight improvement overbaseline on all other conditions. It also is important to emphasize that the very itle Abstract Description T F I D F w i t h P h r a s e s S e n s i t i v i t y S e n s i t i v i t y S e n s i t i v i t y I n c r e m e n t a l T F I D F S e n s i t i v i t y S e n s i t i v i t y S e n s i t i v i t y H i g h l y - T un e d L S I S e n s i t i v i t y S e n s i t i v i t y S e n s i t i v i t y H i g h l y - T un e d D V S e n s i t i v i t y S e n s i t i v i t y S e n s i t i v i t y Legend: Green : Rej. vs random pairs Blue : Rej. vs same class pairs Red : Rej. vs same subclass pairs_________ TFIDF baseline - - - - - Comparison model
Fig. 1.
Inferred accuracy of patent-to-patent similarity by vectorization method: Fig-ures plot ROC curves for the prediction of patent rejection based on similarity scoreof four vectorization methods: simple TFIDF; TFIDF with noun phrases; TFIDF withincremental calculation of IDF; highly-tuned LSI; and highly-tuned D2V. In each plot,solid lines represent the simple TFIDF model as a baseline, and dashed lines representthe comparison model. Experiments were run by length of text (title, abstract anddescription) and difficulty of prediction: easy (in green: discrimination between a 102rejection pair of patents and a random pair of patents); medium (in blue: discrimina-tion between a 102 rejection pair of patents and a pair of patents from the same patentclass); and difficult (in red: discrimination between a 102 rejection pair and a pair ofpatents from the same patent subclass). able 1.
AUC performance comparison of the best VSM to the simple TFIDF.
Title Abstract DescriptionSub-classBest VSM 0.646 0.749 0.775TFIDF 0.643 0.738 0.768Improvement 0.4% 1.5% 0.9%Main-classBest VSM 0.723 0.858 0.900TFIDF 0.720 0.846 0.886Improvement 0.4% 1.4% 1.6%RandomBest VSM 0.907 0.977 0.993TFIDF 0.786 0.957 0.988Improvement 15.4% 2.0% 0.5%
Table 2.
Computation wall time comparison of the best VSM to the simple TFIDF.
Title Abstract DescriptionBest VSM hours days weeksTFIDF seconds minutes hours minor improvement of D2V over the simple TFIDF baseline were obtained onlyafter very extensive and expensive tuning of the D2V model.To summarize, Table 1 reports the AUC and the percentage improvement inAUC of the best vectorization method in all different text field - test set evalu-ation scenarios over the simple TFIDF baseline while Table 2 depicts the roughwall time estimate required by each method to compute the corresponding vectorspace representation of the full corpus using our hardware setting (excluding pre-processing time). In each case, the best method performs better than a simpleTFIDF model, but the percentage of improvement is negligible in most casesother than similarity comparison based on titles with a very easy distinguisha-bility (102 rejections pairs versus random pairs). Moreover the computation walltime of the best method is at least two orders of magnitude highers than thebaseline TFIDF in all scenarios. To better understand the effect of tuning hyper-parameters using Bayesian Op-timization, Figure 2 plots the AUC of D2V over the course of tuning the model.Rows correspond to testing difficulty; columns correspond to text fields. Westopped tuning at 10 times the number of parameters (i.e., 60 rounds), exceptfor Description where there was little improvement.Several trends in Figure 2 are clear: 1) easier similarity detection makeshyper-parameter tuning more important; 2) longer text makes hyper-parametertuning less beneficial; and 3) using D2V with default parameters gives very poorperformance - substantially worse than a simple TFIDF baseline. Table 3 showsthe values of tuned hyper-parameters of D2V. Highly tuned D2V performs the best in all testing scenarios. itle Abstract Description S a m e Sub c l a ss Rounds A U C Rounds A U C Rounds A U C S a m e C l a ss Rounds A U C Rounds A U C Rounds A U C R a nd o m Rounds A U C Rounds A U C Rounds A U C Legend: ______ AUC for D2V during tuning . . . . AUC for D2V with default parameters - - - - AUC for simple TFIDF baseline
Fig. 2.
Hyper-parameter tuning of D2V for each testing scenario.
Table 3.
Highly-tuned hyper-parameters for D2V.
Field Benchmark dm hs size window sample iter AUCTitle Sub-class 0 1 374 1 1e-3.28 10 0.647Title Main-class 0 1 250 10 1e-3 10 0.725Title Random 0 0 321 1 1e-3.08 10 0.907Abstract Sub-class 1 1 491 1 1e-4.06 10 0.750Abstract Main-class 1 0 290 1 1e-4.01 10 0.859Abstract Random 1 0 522 1 1e-4.04 9 0.977Description Sub-class 1 0 321 7 1e-4.91 10 0.775Description Main-class 1 1 592 1 1e-5.52 10 0.900Description Random 1 0 501 1 1e-7 10 0.993
In this paper, we evaluated the performance of text vectorization methods for thereal-world application of automatic measurement of patent-to-patent similarity.We compared a simple TFIDF baseline to more complicated methods, includingextensions to the basic TFIDF model, LSI topic model, and D2V neural model.We tested models on shorter to longer text, and for easier to harder problems ofsimilarity detection.For our application, we find that the simple TFIDF, considering its per-formance and cost, is a sensible choice. The use of more complex embeddingmethods which can require extensive tuning, such as LSI and D2V, is only jus-tified if the text is very condensed and the similarity detection task is relativelycoarse. Moreover extensions to the baseline TFIDF, such as adding n-grams orincremental IDFs, does not seem to be beneficial. Although our conclusion isased on experiments over patent corpus, we believe that it can be generalizedto other corpora due to the minimal patent-specific interventions in our pipeline.Our results are compatible with previous studies on semantic text similaritydetection of embedding methods. The focus of prior research, however, has typ-ically been on short text and simple similarity detection problems. Few studieshave evaluated the performance of different vector space models on long text orfor more challenging benchmarks.For the context of this study (patent-to-patent similarity) in practice, dis-criminating between random pairs of patents and rejection pairs of patents isa rather trivial problem that probably does not require a complicated NLPsolution. It is only on such problems, however, that D2V and LSI might outper-form the TFIDF model considerably. Instead, for many applications, users areoften looking for the automatic detection of differences between relatively simi-lar patents (e.g. same-subclass pairs versus rejection pairs). For such problems,where the differences in similarity are small, simple TFIDF appears to be a goodchoice. The difference in cost and simplicity is to the extent that the use of asimple TFIDF model, which might do slightly worse under certain conditionsthen more complex models, may still be justified. An extension of TFIDF withincremental IDF calculation could provide additional benefit of avoiding calcu-lation of all TFIDF vectors upon addition of new patent(s) without scarifyingperformance for similarity detection task.This study can be extended by future research in several directions, both intheory and in practice. We observed that incorporating noun phrases and incre-mental timing information did not lead to better detection of similar patents.For the case of noun phrases, perhaps the low weights of locally-filtered phrasesmisses the signal, and a more global approach for filtering might improve theperformance. For the case of incremental TFIDF, it appears that adjusting IDFvectors based on time is not strong enough to affect the model performance inour context; it remains to be seen, however, how incremental IDFs may affectmore rapidly evolving domains. Also studying the affect of other similarity met-rics, than cosine similarity, can introduce another dimension for future work.Last, but not the least, coming up with better unsupervised vector space mod-els for similarity detection of longer text as well as fundamental understandingof limitations of current embedding methods in this context are clearly fertileresearch playgrounds.
References
1. Jon Ezeiza Alvarez. A review of word embedding and document similarity algo-rithms applied to academic text, bachelor thesis, university of freiburg, 2017.2. Linda Andersson, Mihai Lupu, Jo˜ao Palotti, Allan Hanbury, and Andreas Rauber.When is the time ripe for natural language processing for patent passage retrieval?In
CIKM , 2016.3. Sanjeev Arora, Yingyu Liang, and Tengyu Ma. A simple but tough-to-beat baselinefor sentence embeddings. In
ICLR , 2017.. Andrew P. Bradley. The use of the area under the roc curve in the evaluation ofmachine learning algorithms.
Pattern Recognition , 30(7):1145–1159, 1997.5. Matthew Brand. Incremental singular value decomposition of uncertain data withmissing values. In
ECCV , 2002.6. Cl´ement Chevalier and David Ginsbourger. Fast Computation of the Multi-pointsExpected Improvement with Applications in Batch Selection. working paper orpreprint, October 2012.7. Andrew M. Dai, Christopher Olah, and Quoc V. Le. Document embedding withparagraph vectors.
CoRR , abs/1507.07998, 2015.8. Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, andRichard Harshman. Indexing by latent semantic analysis.
Journal of the AmericanSociety for Information Science , 41(6):391–407, 1990.9. Abram Handler, Matthew Denny, Hanna M. Wallach, and Brendan T. O’Connor.Bag of what? simple noun phrase extraction for text analysis. In
EMNLP , 2016.10. Bryan Kelly, Dimitris Papanikolaou, Amit Seru, and Matt Taddy. Measuring tech-nological innovation over the long run, working paper, 2017.11. Jey Lau and Timothy Baldwin. An empirical evaluation of doc2vec with practicalinsights into document embedding generation.
CoRR , abs/1607.05368, 2016.12. Quoc V. Le and Tomas Mikolov. Distributed representations of sentences anddocuments.
CoRR , abs/1405.4053, 2014.13. Omer Levy, Yoav Goldberg, and Ido Dagan. Improving distributional similaritywith lessons learned from word embeddings.
TACL , 3:211–225, 2015.14. Ting Liu, Andrew W. Moore, Er Gray, and Ke Yang. An investigation of practicalapproximate nearest neighbor algorithms. In
NIPS , 2004.15. Qiang Lu, Amanda Myers, and Scott Beliveau. Uspto patent prosecution researchdata: Unlocking office action traits. https://ssrn.com/abstract=3024621 , 2017.16. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimationof word representations in vector space.
CoRR , abs/1301.3781, 2013.17. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeff Dean. Dis-tributed representations of words and phrases and their compositionality. In
NIPS .2013.18. Andreea Moldovan, Radu Bot¸, and Gert Wanka. Latent semantic indexing forpatent documents.
International Journal of Applied Mathematics and ComputerScience , 15:551–560, 01 2005.19. Marwa Naili, Anja Habacha Chaibi, and Henda Hajjami Ben Ghezala. Compara-tive study of word embedding methods in topic segmentation.
Procedia ComputerScience , 112(C):340–349, 2017.20. Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. Unsupervised learning ofsentence embeddings using compositional n-gram features.
CoRR , abs/1703.02507,2017.21. Juan Ramos. Using tf-idf to determine word relevance in document queries, 1999.22. Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. Incrementalsingular value decomposition algorithms for highly scalable recommender systems.In
ICIS , 2002.23. Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical bayesian optimiza-tion of machine learning algorithms. In
NIPS , 2012.24. Kenneth Younge and Jeffrey Kuhn. Patent-to-patent similarity: A vector spacemodel. https://ssrn.com/abstract=2709238 , 2016.25. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and IonStoica. Spark: Cluster computing with working sets. In