[PDF] Identification, Tracking and Impact: Understanding the trade secret of catchphrases

Abstract

Understanding the topical evolution in industrial innovation is a challenging problem. With the advancement in the digital repositories in the form of patent documents, it is becoming increasingly more feasible to understand the innovation secrets -- "catchphrases" of organizations. However, searching and understanding this enormous textual information is a natural bottleneck. In this paper, we propose an unsupervised method for the extraction of catchphrases from the abstracts of patents granted by the U.S. Patent and Trademark Office over the years. Our proposed system achieves substantial improvement, both in terms of precision and recall, against state-of-the-art techniques. As a second objective, we conduct an extensive empirical study to understand the temporal evolution of the catchphrases across various organizations. We also show how the overall innovation evolution in the form of introduction of newer catchphrases in an organization's patents correlates with the future citations received by the patents filed by that organization. Our code and data sets will be placed in the public domain soon.

Full PDF

IIdentification, Tracking and Impact:Understanding the trade secret of catchphrases

Jagriti Jalal

IIT Kharagpur, [email protected]

Mayank Singh

IIT Gandhinagar, [email protected]

Arindam Pal

Data61, CSIRO & Cyber Security CRCSydney, New South Wales, [email protected]

Lipika Dey

TCS Innovation Labs, [email protected]

Animesh Mukherjee

IIT Kharagpur, [email protected]

ABSTRACT

Understanding the topical evolution in industrial innovation is achallenging problem. With the advancement in the digital reposito-ries in the form of patent documents, it is becoming increasinglymore feasible to understand the innovation secrets – ‘catchphrases’– of organizations. However, searching and understanding this enor-mous textual information is a natural bottleneck. In this paper, wepropose an unsupervised method for the extraction of catchphrasesfrom the abstracts of patents granted by the U.S. Patent and Trade-mark Office over the years. Our proposed system achieves substan-tial improvement, both in terms of precision and recall, againststate-of-the-art techniques. As a second objective, we conduct anextensive empirical study to understand the temporal evolutionof the catchphrases across various organizations. We also showhow the overall innovation evolution in the form of introduction ofnewer catchphrases in an organization’s patents correlates with thefuture citations received by the patents filed by that organization.Our code and data sets will be placed in the public domain.

CCS CONCEPTS • Social and professional topics → Patents ; •

Information sys-tems → Data mining ; Information extraction.

KEYWORDS

Patents; digital library;

As software and other products are becoming more complex, thenumber and size of patent documents are increasing gradually. Auto-mated patent document processing systems are essential to extractinformation and gain insights from this ever-increasing collectionof patent databases. Catchphrases provide a concise representa-tion of the content of a document. A catchphrase is a well-known

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

JCDL ’20, August 1–5, 2020, Virtual Event, China © 2020 Association for Computing Machinery.ACM ISBN 978-1-4503-7585-6/20/06...$15.00https://doi.org/10.1145/3383583.3398512 word or phrase encapsulating the particular concept or subjectof a document. They contain all the important legal and techni-cal aspects, instead of just summarizing the document. They havenumerous applications such as document categorization, cluster-ing, summarization, indexing, topic search, quantifying semanticsimilarity with other documents, and conceptualizing particularknowledge domain of the document [12, 16]. However, since onlya small minority of documents have author-assigned catchphrases,and manual assignment of catchphrases to existing documents istime-consuming, the automation of the catchphrase extraction pro-cess is highly desirable. In the current study, catchphrases representinnovation topics. Figure 1 presents example catchphrases fromtwo different patent abstracts.In this paper, we propose an unsupervised method for the extrac-tion of catchphrases from the abstracts of patents granted by the U.S.Patent and Trademark Office over the years. The key contributionsof this paper are as follows. • We propose an unsupervised technique for catchphrase iden-tification and ranking in patent documents. • We conduct robust evaluations and comparison against sev-eral state-of-the-art baselines. • As a secondary objective, we study the evolution of catch-phrases present in the patents filed by various organizationsover time. • We bring forth some of the unique temporal characteristicsof these catchphrases and show how these are correlated tothe overall future citation count of the patents filed by anorganization. • The catchphrase evolution study further unfolds that com-panies get polarized based on whether the patent documentskeep re-using the same catchphrases over time or they in-troduce newer catchphrases as time progresses.

A variety of techniques have been applied for automated keywordextraction like locating important phrases by analyzing markupslike capitalization, section headings and emphasized texts [17];building phrase dictionary by parts-of-speech (POS) tagging ofword sequences [18]; thesaurus-based keyphrase indexing [29];domain-specific keyphrase extraction [11, 23] and several other su-pervised methods such as KEA [30], MAUI [22], back-of-the-book a r X i v : . [ c s . D L ] J u l D: US06681004

Abstract : The telephone memory aid provides a databaseto a primary party for storing and retrieving personal in-formation about a secondary party, including summaryinformation related to communication exchanges betweenthe primary and secondary parties. The summary infor-mation includes, for example, the date and time of priortelephone calls and the topics discussed. This secondaryparty information, including the summaries of prior tele-phone calls, is available for review by the primary partyduring future phone calls with the secondary party. Thetelephone memory aid also facilitates entry of informationinto the database through speech recognition algorithmsand through question and answer sessions with the pri-mary and secondary parties.

ID:

US06680003

Abstract : The present invention concerns chiral dopingagents allowing a modification to be induced in the spiralpitch of a cholesteric liquid crystal, said doping agentsincluding a biactivated chiral unit at least one of whosefunctions allows a chemical link to be established with anisomerisable group, for example by radiation, said grouppossibly having a polymerisable or co-polymerisable endchain. These new chiral doping agents find application inparticular in a color display.

Figure 1: Example abstracts from USPTO patentsUS06681004 and US06680003. The highlighted set ofwords are identified as catchphrases from IPC (described inSection 5). indexing using catchphrase extraction [10], MAUI with text denois-ing [26], CSSeer [8] etc. In recent years artificial neural networks(ANNs) are being used to build predictive models to rank words ina document [5] and then select keywords based on these ranks.It has been widely recognized that the innovative capability of afirm is a critical determinant of its performance and competitiveedge [3, 13, 15]. Since patents are a direct outcome of the inventiveprocess and are broken down by technical fields, they are consideredindicators of not only the rate of the innovative activities of a firmbut also its direction [1, 2, 4]. Many previous studies have examinedthe relationships between the patenting activities of a companyand its market value [14, 24]. Bornmann and Daniel [6] preciselyreviews the citing behavior of scientists and shows the role ofcitations as a reliable measure of impact. Cheng et al. [9] showsthat some indicators of patent quality are statistically significantto return on assets. Lee et al. [19] assesses future technologicalimpacts by employing the future citation count as a proxy while Leeet al. [20] employs various patent indicators such as novelty andscope, as features of an ANN for early identification of emergingtechnologies.

The current study requires a rich time-stamped dataset. We, there-fore, leverage two independent data sources. These are:(1)

The patent dataset : We compile the first dataset by crawl-ing the full-text patent articles, available at the United StatesPatent and Trademark Office (USPTO ). It comprises patentsgranted weekly (Tuesdays) from January 1, 2003, to May 18,2018 (excluding images/drawings). The patents are availableas XML encoded files with English as the primary language.Out of all the curated documents, in this study, we onlyconsider those patents for which the abstract information ispresent (see Table 1 for statistics).(2) The newsgroup corpus : We also use another data source,the

20 Newsgroups Dataset donated by T. Mitchell in 1999.It includes one thousand Usenet articles each from 20 news-groups like ’alt.atheism’, ’comp.graphics’, ’talk.politics.guns’,etc. Approximately 4% of the articles are crossposted.This serves as a non-patent corpus to estimate the impor-tance of a word specifically in the domain of the patentsconcerning a non-patent domain (see Table 1 for statistics). P a t e n t Year range 2003–2018Number of patents 3,915,639Number of patents with abstract 3,486,866 N e w s g r o u p Year range 1993–2017Number of articles 19,997Number of words –Language English

Table 1: General statistics about the patent dataset and thenewsgroup corpus. A large fraction (89%) of patents have ab-stract information.Pre-processing : For both of the above, we performed several pre-processing tasks such as a sentence to lowercase conversion, re-moval of special characters except apostrophe and periods, lemma-tization, and multiple white-spaces removals.

Catchphrase extraction is a challenging problem mainly due to thediversity and unavailability of large-text annotated datasets. We,therefore, present an unsupervised method for catchphrase extrac-tion. We propose a two-stage extraction strategy that identifiesrelevant candidate catchphrases in a given patent article. In thefirst stage, we select the candidate catchphrases. This is followedby candidate catchphrase ranking in the second stage. Next, wedescribe the two stages in detail.

In the first stage, we select candidate catchphrases from each patent’sabstract. Empirically, we observe that all catchphrases are n-gramnoun phrases, for example, unigrams (e.g. communication, dielec-trometry, etc. ), bigrams (e.g. consecutive bit, voice synthesizer, etc. ), https://bulkdata.uspto.gov/ https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups2 rigrams (e.g. integrated circuit device, hydrogen chloride gas, etc. )or quadrigrams (e.g. commercially available synthesis tool, electricsignal processing board, etc. ). We, therefore, perform part-of-speech-tagging (POS) of each abstract text to identify noun phrases. Cur-rently, we leverage python’s state-of-the-art NLP library SpaCy .Note that, we experimented with two text processing approachesbefore noun phrase identification: (i) with stopwords (WS), and(ii) without stopwords (WOS). WS represents that no stopwordswere removed from the abstracts, whereas, WOS represents that allstopwords in the abstract text were removed beforehand. Abstractswith stopwords (WS) led to better quality extraction results dueto the existence of stop-words in noun phrases. We discuss theresults in detail in Section 5. Table 2 presents statistics of extractedcandidate phrases from the dataset. Word n-grams Count

Unigrams 208,105Bigrams 2,616,762Trigrams 4,432,251Quadrigrams 2,138,696

Total 9,395,814Table 2: Count of n-gram noun phrases generated frompatent dataset.

Candidate phrases obtained in the first stage are ranked in this stage.The ranking algorithm is based upon two empirical findings: (i) howwell the phrase describes the document’s topic , and (ii) how specificis the phrase to the patent literature . Our proposed method unifiesboth of these findings by combining a frequency-based measurewith an information-theoretic measure. Given a patent document d and a set of candidate phrases c d obtained in the previous stage,we compute the phrase score PS ( c , d ) for each phrase c ⊆ c d . PS ( c , d ) = | c | (cid:213) i = { loд ( score ( t i ))} . KLI ( c , d ) (1)where, t i denotes the i th term in an n-gram candidate phrase c , score( t i ) denotes the score of the i th term by estimating theimportance of the term specifically in the patent domain relative toa non-patent domain and KLI ( c , d ) represents the Kullback-Leiblerdivergence informativeness specifying how well a candidate phrase c represents a document d . The term score ( t ) in the above equationis computed as score ( t ) = Importance ( t , C p ) Importance ( t , C n ) + C p and C n represents the patent collection and non-patent (in our case, the newsgroup) collection. The importance ( t , C ) of a term t in a given collection C ∈ { C p , C n } is measured in termsof the collection frequency CF and the document frequency DF . CF ( t , C ) represents how many times the term t appeared in the https://spacy.io/usage/linguistic-features entire collection C . DF ( t , C ) represents the count of documentswhere the term t appeared. It is computed as Importance ( t , C ) = CF ( t , C ) DF ( t , C ) + KLI ( c , d ) denotes an information theoretic measure to computehow informative the phrase is in the given document d . It is com-puted as: KLI ( c , d ) = T F ( c , d )| d | . log T F ( c , d )| d | CF ( c ) n (4)where, T F ( c , d ) represents how many times c appeared in doc-ument d . CF ( c ) denotes how many times c appeared in the entirepatent collection C p . | d | and | n | represents total number of n-gramsin document d and C p respectively.The above scoring method results in a ranking of candidatephrases. We select top-ranked candidates such as top-5, top-10,top-20, etc., and evaluate our unsupervised method in the nextsection. In this section, we describe the experimental settings, baselinesand the evaluation metrics. We construct a collection of possiblecatchphrases from the International Patent Classification (IPC) list.This list is maintained by the

World Intellectual Property Organiza-tion (WIPO) . The IPC provides a hierarchical system of languageindependent symbols for the classification of patents and utilitymodels according to the different areas of technology to which theypertain. The hierarchy comprises eight high-level categories:(1) Cat-1 : Human necessities(2)

Cat-2 : Performing operations; Transporting(3)

Cat-3 : Chemistry; Metallurgy(4)

Cat-4 : Textiles; Paper(5)

Cat-5 : Fixed constructions(6)

Cat-6 : Mechanical engineering; Lighting; Heating; Weapons;Blasting(7)

Cat-7 : Physics(8)

Cat-8 : ElectricityIn each of these high-level categories, several sub-categories exist.An n-gram phrase represents each category. We term these phrasesas ground truth catchphrases (GTC). Overall, we obtained 22,855GTC such as ”actuators”, ”cleaning fabrics”, ”feedback arrangementsin control systems”, etc. We use GTC to evaluate our proposedcatchphrase extraction method. Table 3 presents examples of GTCfor each high-level category. Next, we present three state-of-the-artbaselines. (1)

Keyphrase extraction algorithm (KEA) : KEA [30] is a su-pervised machine learning toolkit that extracts keyphrasesand ranks them. The original algorithm was trained on sci-entific documents and uses a trained Naïve Bayes model.We trained KEA for patent documents leveraging a similartraining procedure. ategory Unigrams Bigrams Trigrams QuadgramsCat-1 rhinoscopes dental surgery table service equipment foodstuffs containing gelling agentsCat-2 thwarts rivet hearths making plough shares making plastics bushes bearingsCat-3 riboflavin septic tanks acetone carboxylic acid chromising of metallic material surfacesCat-4 carding carbon filaments opening fiber bales drying wet webs in paper-makingCat-5 collieries suspension bridges setting anchoring bolts freezing for sinking mine shaftsCat-6 thermal diesel engines portable accumulator lamps treating internal-combustion engine exhaustCat-7 ozotypy investigating abrasion measuring electric supply incineration of solid radioactive wasteCat-8 rheostats electric accumulator thermo magnetic devices electric amplifiers for amplifying pulse Table 3: Examples of ground truth catchphrases for each high-level category available in the International Patent Classification(IPC) list. (2)

Legal : Mandal et al. [21] also follow an unsupervised ap-proach for identification of catchphrases from legal courtcases. The scoring is done as: PS ( c , C p , C np ) = log [ | c | (cid:213) i = Score ( t i , C p , C np )] . KLI ( c , d ) (5)where score ( t i , C p , C n p ) and KLI ( c , d ) can be calculated us-ing equations 2 and 4 respectively. Note the change in theformula in equation 5 compared to equation 1. This modifi-cation as we shall see almost doubles our performance.(3) KLIP : Tomokiyo and Hurst [27], Verberne et al. [28] pro-posed a Kullback-Leibler (KL) divergence based phrase as-signment score which is a linear combination of two differentscores:(a)

KL informativeness ( KLI ): KLI measures how well a candi-date phrase represents a document. It is computed usingequation 4.(b)

KL phraseness ( KLP ): KLP score is computed specificallyfor multi-word phrases. It compensates for low frequencyof multi-word phrases by assigning higher weights tolonger phrases:

KLP ( c , d ) = T F ( c , d )| d | . log T F ( c , d )| d | (cid:206) | c | i = f req ( t i , d )| d | (6)where, t i is the i th term of the phrase c , and f req ( t i , d ) is thefrequency of the term t i in document d .(4) BM25 : BM25 [25] is a well-known measure for scoring doc-uments with respect to a given query. We use this functionfor assigning score to an extracted candidate phrase c in agiven document d . The scoring function is: score ( c , d ) = IDF ( c ) . T F ( c , d ) . ( k + ) T F ( c , d ) + k . (cid:16) + b + b . | d | avдdl (cid:17) (7)where T F ( c , d ) is the term frequency of phrase c in the doc-ument d . k and b are free parameters. We choose k ∈ [1.2,2.0] and b = 0.75 . IDF ( c ) is the inverse document frequencyof the candidate phrase c , calculated as IDF ( c ) = log n − DF ( c ) + . DF ( c ) + . We select these values as per previous literature [25]. where DF ( c ) is the document frequency of the phrase c inthe collection.Note that KEA is a supervised machine learning model whereasLegal, KLIP and BM25 are unsupervised methods. We evaluate our proposed method against the three baselines. Weuse two standard evaluation measures: (i)

Macro precision , and(ii)

Macro recall . These metrics are computed by macro-averagingthe precision/recall values computed for every patent.Macro precision = (cid:205) | T | i = precision i | T | (9)Macro recall = (cid:205) | T | i = recall i | T | (10)where precision i and recall i are the precision and recall valuescomputed for i th patent in our test dataset T . The precision andrecall values for the i th patent are computed as follows precision i = DCP i DC i (11) recall i = DCP i CPG i (12)where DC i , DCP i , and CPG i represents the number of catch-phrases in the i th patent that are detected, detected and present inGTC, and present in GTC respectively.As KEA requires training, we partition our dataset into twoclasses: (i) train and (ii) test. Train split consist of 2,055,588 (65%)patent documents. Test split consist of 1,106,883 (35%) patent doc-uments. For a fair comparison, we evaluate our proposed methodagainst baselines (described in Section 5.1) using only the test split. Table 4 compares our proposed catchphrase extraction approachagainst state-of-the-art baselines. We outperform all baselines bya substantially high margin. The second best system in terms ofprecision is KEA, whereas the second-best system in terms of recallis a mix between KLIP and KEA. The baseline Legal performedworst among all the baselines, which is possible because of the factthat the authors take a logarithm of the sum of all the scores ratherthan the sum of the logarithms of the scores. The former measure ndermines the contribution of the scores from each term and istherefore ineffective and is rather unintuitive. In this section, we intend to show the usability of catchphraseextraction. We claim that catchphrase evolution presents a fairunderstanding of the changing innovation trends of companies.We conduct several interesting temporal studies to understand theemergence of new research topics in the industry. In this study,we select top-10 companies from three industrial segments: (1)Software , (2) Hardware , and (3) Mobile Phones . Table 5 presentsthe list of top-10 companies in each of the above three segments.In subsequent sections, we analyze patents filed by these com-panies over the years. In our patent dataset, each company canhave several variations in name due to multiple research groups,geographical locations, subsidiaries, headquarters, etc. For example,IBM is present as ‘International Business Machines CorporationArmonk’, ‘International Business Machines Laboratory Inc.’, etc. Weovercome these inconsistencies by manually annotating name vari-ations. However, we claim that basic string matching techniquescan easily automate this normalization. Besides, we eliminate fre-quently occurring catchphrases like, ’method’, ’present invention’,etc., to ignore noisy/redundant signals. This filtering process wasautomated by removing catchphrases with top-10 document fre-quencies. We next, present how catchphrases can be leveraged inunderstanding the topical evolution of companies. In this section, we study the topical evolution of companies. Weleverage the

Jaccard Similarity (JS) between the catchphrases tocompute the topical overlap between patents filed in consecutiveyears by a specific company. We conduct this experiment for 11years between 2006–2016. Figure 2 shows temporal profiles of athree-year moving average over JS for each of the three segments.We observe that Baidu in

Hardware segment while Oppo, Vivo, andOnePlus in

Mobile Phones segments exhibit relatively low similaritybetween catchphrases over the years. However, most of the com-panies have similarity curves with multiple peaks with an overallincrease in the JS values over the years. For this analysis, we onlyconsidered 2-gram catchphrases. However, we found similar ob-servations for higher n-gram catchphrases. If an organization isfiling patents on the same topics over the years, the JS value willonly increase; on the other hand, if an organization is continuouslyfiling patents on newer topics, the JS value is expected to decline.

Further, we conduct a nuanced study to understand this temporalbehavior. We classify each company’s similarity profile into fivecategories [7] based on the number and location of peaks. A peakin the similarity profile of a company represents a high topicalsimilarity between consecutive years followed by a topical driftingoff period. We leverage the peak identification method proposed https://en.wikipedia.org/wiki/List_of_the_largest_software_companies by Chakraborty et al. [7]. Note that peaks occurring in consecutiveyears are considered as a single peak. The categories are:(1) MonInc: Similarity profile that monotonically increases. Thepeak occurs in the last year.(2) MonDec: Similarity profile that monotonically decreases.The peak occurs in the first year.(3) PeakInit: Similarity profile that consists single peak withinthe first three years but not the first year.(4) PeakLate: Similarity profile that consists single peak afterthe initial three years but not the last year.(5) PeakMult: Similarity profile consisting of multiple peaks.(6) Others: Similarity profiles that do not qualify into the abovecategories are kept in this category. They mainly consist ofprofiles with extremely low JS values for each year.Table 6 shows categorization results. We find no company inMonDec and PeakInit categories. Majority of the companies arepresent in the PeakMult category followed by PeakLate cate-gory. Companies in Others category have very less number offiled patents. Three out of four companies in Others category arerecently launched mobile companies.Even though, PeakMult category consists multiple peaks, weobserve two distinct fluctuation patterns. We term these patternsas (i) stable and (ii) unstable. In stable, the profile looks consid-erably less fluctuating. The profile highly fluctuates in unstablecategory. We quantify the above fluctuating patterns by leveragingthe average value of JS. Given, JS(c) is the similarity profile for acompany c , average value of JS ( avд J S ( c ) ) is computed as: avд J S ( c ) = min ( JS ( c )) + max ( JS ( c )) avд J S > . Citations, in the scholarly world, determine the popularity of re-search papers/authors/organizations. Here, we adopt a similar anal-ogy for patent articles. A patent citation is a document cited byan applicant, third party, or a patent office examiner because itscontent relates to a patent application. We compute the citationcount of a patent p by summing the citations received by p . Forthe current study, we construct citer-cited pairs by extracting refer-ences present in patent texts and use these pairs to compute patentcitation counts.Next, we create multiple citation zones based on the citationcount of a patent. We define four distinctive zones: (i) very low,(ii) low, (iii) medium, and (iv) high, to study the influence of theJS profile of a company on the number of citations received by itspatents. Table 8 presents zoning statistics of the complete dataset.Out of 3,829,153 patent articles, 1,499,175 have zero citation count.Next, we relate similarity profiles and citation count zones. Foreach company, we measure the fraction of patents in differentcitation zones. We leverage histograms as a visualization tool to PRECISION RECALL

Our Model KEA Legal BM25 KLIP Our Model KEA Legal BM25 KLIP Table 4: Comparison of our proposed method against the baselines: Precision and recall values at different top-ranks ( z ∈ , , , , ) of extracted catchphrases.Figure 2: Moving average of catchphrases similarity between consecutive years for – Software (left),

Hardware (center), and

Mobile Phone (right) companies.Software Hardware Mobile Phone

Microsoft Apple SamsungGoogle Samsung AppleIBM IBM HuaweiOracle Foxconn OppoFacebook Hewlett Packard VivoTencent Lenovo XiaomiSAP Fujitsu OnePlusAccenture Quanta Computer LenovoTCS AsusTek NokiaBaidu Compal LG

Table 5: Top-10

Software , Hardware , and

Mobile Phone com-panies selected from three publicly available lists. conduct this study. In Figure 3, we observe that the fraction ofpatents in

Medium and

High citation zones in PeakLate categoryare relatively higher than in MonInc category. This indicates thatthe introduction of diversity in topics over time helps in enhancingthe future citations of the patents filed by a company.Figure 4 compares two subcategories of PeakMult. We observethat the fraction of patent falling under the

Medium , and

High cita-tion zones in unstable category is relatively higher than stablecategories implying that the companies with high fluctuations insimilarity profiles perform better in terms of receiving citationcounts. A possible explanation is that the companies with rela-tively specialized research domain file patents which attract lessercitations than the companies with diversified research domain.

Category Count Names

MonInc 4 Tencent, Samsung, Xiaomi, LenovoMonDec 0PeakInit 0PeakLate 6 Facebook, TCS, Huawei, AsusTek, Foxconn,CompalPeakMult 12 HP, SAP, Accenture, Nokia, Fujitsu, QuantaComputer, Microsoft, IBM, Oracle, Google,Apple, LGOthers 4 Baidu, Oppo, Vivo, OnePlus

Table 6: Categorization of top-10

Software , Hardware , and

Mobile Phone companies based on temporal catchphrasesimilarity profile. No company was classified in MonDecand PeakInit category.

Lastly, we study Others category in Figure 5. Quite surprisingly,we observe that the fraction of patents in

Medium and

High citationzones in Others category is relatively higher than the rest of thecategories described above in Figures 3 and 4.

In this section, we analyze the extent of usage of certain catch-phrases (bigrams and trigrams) by a company. We rank the catch-phrases based on document frequency, i.e, the number of patentdocuments a catchphrase is present in. Tables 9 and 10 show thetop-10 bigrams for companies present in the stable and unstable ompany avд J S

Category

Nokia 0.040 stableFujitsu 0.085 stableQuanta Computer 0.069 stableMicrosoft 0.105 stableAccenture 0.040 stableSAP 0.048 stableHewlett Packard 0.084 stableLG 0.223 unstableOracle 0.117 unstableGoogle 0.121 unstableApple 0.197 unstableIBM 0.124 unstable

Table 7: List of companies in PeakMult that are classifiedinto stable and unstable sub-categories along with the av-erage value of Jaccard Similarity ( avд

J S ) used for categoriza-tion. Category Citation Count Patent Count

Very Low 0 1,499,175Low 0 < x < ≤ x <

25 840,461High x ≥

25 215,488

Table 8: Patent citation zones with distinct citation countranges.Figure 3: Citation count zones vs similarity profiles: Fractionof patents in PeakLate (left) and MonInc (right) categorycompanies in each citation count zone.Figure 4: Citation count zones vs similarity profiles: Frac-tion of patents in stable (left) and unstable (right) cate-gory companies in each citation count zone. Figure 5: Citation count zones vs similarity profiles: Fractionof patents in Others in each citation count zone. groups respectively. Table 11 and 12 show top-10 trigrams for thesame companies. Last, Table 13 notes the top-10 bigrams and tri-grams from the entire stable and unstable categories taken together.While the stable group is concerned more about computer systems,the unstable group is more about electronic device parts.

In this section, we study the catchphrase evolution of companies.As a popular visualization tool, we leverage word clouds . We createword clouds for each company between the years 2003–2016. Dueto space constraints, in Figure 6, we only consider word cloudsfor one representative company from stable, unstable, peaklateand moninc categories at three representative years. We claim thatcatchphrase evolution presents a fair understanding of the changinginnovation trends of companies. Note that we consider only bigramcatchphrases in this study. We can conduct a similar study for anycompany in different years .In Figure 6a, we study catchphrase evolution for Microsoft (arepresentative company in the stable group). We observe a shiftfrom traditional topics such as client-server models, databases, basicWeb development, etc. (in 2003), toward full-fledged Web search andInternet technologies (in 2010). In 2016, the focus shifted to mobiledevices and gesture identification. The above trends coincide withseveral product releases such as BING (a search engine released in2009) and Lumia (mobile phones released in 2015) .In Figure 6b, we study catchphrase evolution for Oracle (a rep-resentative company in the unstable category). Oracle seems tohave shifted its focus from traditional database topics like relationaldatabases, query, etc. (in 2003), toward the development of softwareas a service (SAAS) in 2010. In 2016, it continued to focus on serviceswith a major emphasis on reliable authentication mechanisms inthe cloud. These innovation trends resulted in several products like Oracle cloud (cloud computing service launched in 2016), Primavera(an enterprise project portfolio management software acquired byOracle in 2008), etc. The detailed word clouds for all companies in our dataset are available at http://tinyurl.com/y5ynhj9n https://en.wikipedia.org/wiki/Bing_(search_engine) https://en.wikipedia.org/wiki/Microsoft_Lumia_4357 igure 6: Word clouds of the representative companies in different similarity profile based categories at three distinct years.(a) stable (Microsoft), (b) unstable (Oracle), (c) peaklate (Facebook), and (d) moninc (Samsung). P MS SAP Accenture Nokia Fujitsu Quanta Computer print job client device business process processing device user interface closed position circuit boardone aspect search result application server third party communication device inner surface display panelprinting system application program application program real-world environment one embodiment opposite side second imagesecond set user input software application mobile device computer program upper surface one sideoperating system computing system business application invention concern telecommunication system longitudinal axis second endsecond side search engine system method educational material telecommunication network another embodiment battery modulesecond position data store data structure computer-implemented method access point opposite end second positionsecond portion least portion system software communication network data transmission open position one enddisplay device subject matter user input synchronized video least part bottom surface portable computerpresent disclosure client computer business object solution information second device side wall power supply

Table 9: Bi-grams with the top 10 document frequency values in STABLE category.

IBM Oracle Google Apple LG top surface application server user interface integrated circuit second electrodeoperating system operating system present disclosure second set one sidestorage device data structure one example user input common electrodecomputer program second set system method one example light sourcesecond set software application example method first set lcd devicecomputing system one technique content item operating system control informationanother embodiment source code user input another embodiment display deviceuser interface computer-implemented method one processor least portion array substratedrain region another aspect user device host device drain electrodedata structure database object subject matter client device washing machine

Table 10: Bi-grams with the top 10 document frequency values in UNSTABLE category.

HP MS SAP Accenture Nokia Fujitsu Quanta Computer storage area network host operating system first data object dual information system first base station user ’s head portable electronic apparatusleast one component client computing device one general aspect telecommunication industry taxonomy packet data network first second portion mobile communication devicefluid ejection assembly mobile communication device business process model contact center representative least one parameter user ’s foot second frequency bandfirst second set user ’s interaction second user input contact center system first network element least one opening third conductor armleast one component least one implementation least one service context-appropriate enforcing completion user equipment due thinning spraying irrigation portable computer systemdisclosed embodiment relate distributed computing system core software platform location-based service system wireless communication device patient ’s body second radiating elementleast one surface application program interface least one attribute cognitive educational experience least one cell least one side blade server systeminkjet ink composition one computing device second data object individualized learning experience wireless communication system least one aperture service agent servercentral processing unit client computer system one exemplary embodiment user ’s comprehension wireless communication device storied index rating printed circuit boardgraphical user interface wireless access point related method system object recognition analysis second base station usda hardiness zone wireless communication device

Table 11: Tri-grams with the top 10 document frequency values in STABLE category.

IBM Oracle Google Apple LG first conductivity type current result list one search result electronic device housing light guide platefield effect transistor flexible extensible architecture first search result scrolling 3d manipulation digital broadcasting systemsecond dielectric layer computer program product image sensor interface intuitive hand configuration second semiconductor layergate dielectric layer distributed computing environment disclosed subject matter hand approach touch liquid crystal celldata communication network graphical user interface image search result proximity-sensing multi-touch surface light emitting diodeintegrated circuit device data storage system client computing device wireless communication circuitry main service datadirect physical contact data processing system client computing device antenna resonating element image display devicesecond conductivity type application programming interface second computing device computer readable medium serving base stationdatabase management system database management system mobile communication device wireless communication system light emitting diodeburied insulator layer contention management mechanism distributed storage system wireless electronic device first second electrode

Table 12: Tri-grams with the top 10 document frequency values in UNSTABLE category.

Similarly, in Figure 6c, we study catchphrase evolution for Face-book (a representative company in PeakLate category). As Face-book started its operations from 2004, we present visualizationsfor three years, 2010, 2013 and 2016. The initial focus was to de-velop technical features like news feeds, membership, etc. In theyear 2013, these trends shift toward instant messaging aspects. Inthe year 2016, the catchphrases show a distinct innovation pattern of restricting and disclosing data availability. Facebook Messen-ger (introduced in 2011) is one of the products developed between2011–2013 .We study Samsung as a representative company in MonInc cat-egory (see Figure 6d). Primarily Samsung’s major focus lies in tra-ditional electronics innovation. Recent trends suggest an increased https://en.wikipedia.org/wiki/Facebook9 TABLE UNSTABLEBi-grams Tri-grams Bi-grams Tri-grams closed position first second portion top surface second semiconductor layeranother embodiment user’s head user interface printed circuit boardopposite side least one opening second set light guide plateinner surface least one side operating system light emitting diodeupper surface user’s foot present disclosure digital broadcasting systemone aspect least one aperture least portion first semiconductor layerlongitudinal axis patient’s body another embodiment light emitting diodeopen position thinning spraying irrigation system method liquid crystal cellsecond position central processing unit data structure first second electrodeopposite end storie index rating computing system serving base station

Table 13: Bi-grams and Tri-grams with the top 10 document frequency values. focus on mobile technologies such as user interfaces, display units,etc.

In this paper, we propose an unsupervised catchphrase identifica-tion and ranking system. Our proposed system achieves a substan-tial improvement, both in terms of precision and recall, againststate-of-the-art techniques. We demonstrate the usability of thisextraction by analyzing how topics evolve in patent documents andhow these evolution patterns shape the future citation count of thepatents filed by a company.In the future, we plan to extend the current work by developingan online interface for automatic catchphrase identification. Wealso plan to understand the influence of catchphrase evolution onthe company’s revenue.

REFERENCES [1] Daniele Archibugi and Mario Planta. 1996. Measuring technological changethrough patents and innovation surveys.

Technovation

16, 9 (1996), 451–519.[2] Kendall W Artz, Patricia M Norman, Donald E Hatfield, and Laura B Cardinal.2010. A longitudinal study of the impact of R&D, patents, and product innovationon firm performance.

Journal of product innovation management

27, 5 (2010),725–740.[3] Richard A Bettis and Michael A Hitt. 1995. The new competitive landscape.

Strategic management journal

16, S1 (1995), 7–19.[4] Nicholas Bloom and John Van Reenen. 2002. Patents, real options and firmperformance.

The Economic Journal

Information Processing & Management

37, 2(2001), 187–198.[6] Lutz Bornmann and Hans-Dieter Daniel. 2008. What do citation counts measure?A review of studies on citing behavior.

Journal of documentation

64, 1 (2008),45–80.[7] Tanmoy Chakraborty, Suhansanu Kumar, Pawan Goyal, Niloy Ganguly, andAnimesh Mukherjee. 2015. On the categorization of scientific citation profiles incomputer science.

Commun. ACM

58, 9 (2015), 82–90.[8] Hung-Hsuan Chen, Pucktada Treeratpituk, Prasenjit Mitra, and C Lee Giles. 2013.CSSeer: an expert recommendation system based on CiteseerX. In

Proceedings ofthe 13th ACM/IEEE-CS joint conference on Digital libraries . ACM, 381–382.[9] Yin-Hui Cheng, Fu-Yung Kuan, Shih-Chieh Chuang, and Yun Ken. 2009. Prof-itability decided by patent quality? An empirical study of the US semiconductorindustry.

Scientometrics

82, 1 (2009), 175–183.[10] Andras Csomai and Rada Mihalcea. 2008. Linguistically motivated featuresfor enhanced back-of-the-book indexing.

Proceedings of ACL-08: HLT (2008),932–940.[11] Eibe Frank, Gordon W Paynter, Ian H Witten, Carl Gutwin, and Craig G Nevill-Manning. 1999. Domain-specific keyphrase extraction. In , Vol. 2. Morgan Kaufmann PublishersInc., San Francisco, CA, USA, 668–673. [12] Parthasarathy Gopavarapu, Line C Pouchard, and Santiago Pujol. 2016. Increasingdatasets discoverability in an engineering data platform using keyword extraction.In

Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries .ACM, 225–226.[13] Christine Greenhalgh and Mark Longland. 2005. Running to stand still?–the valueof R&D, patents and trade marks in innovating manufacturing firms.

InternationalJournal of the Economics of Business

12, 3 (2005), 307–328.[14] Bronwyn H Hall, Adam Jaffe, and Manuel Trajtenberg. 2005. Market value andpatent citations.

RAND Journal of economics (2005), 16–38.[15] Constance E Helfat and Margaret A Peteraf. 2003. The dynamic resource-basedview: Capability lifecycles.

Strategic management journal

24, 10 (2003), 997–1010.[16] Steve Jones and Gordon Paynter. 1999. Topic-based browsing within a digitallibrary using keyphrases. In

Proceedings of the fourth ACM conference on Digitallibraries . ACM, 114–121.[17] Bruce Krulwich and Chad Burkey. 1997. The InfoFinder agent: Learning userinterests through heuristic phrase extraction.

IEEE Expert

12, 5 (1997), 22–27.[18] Leah S Larkey. 1999. A patent search and classification system. In

Proceedings ofthe fourth ACM conference on Digital libraries . ACM, 179–187.[19] Changyong Lee, Yangrae Cho, Hyeonju Seol, and Yongtae Park. 2012. A stochas-tic patent citation analysis approach to assessing future technological impacts.

Technological Forecasting and Social Change

79, 1 (2012), 16–29.[20] Changyong Lee, Ohjin Kwon, Myeongjung Kim, and Daeil Kwon. 2018. Earlyidentification of emerging technologies: A machine learning approach usingmultiple patent indicators.

Technological Forecasting and Social Change

Proceedings of the 2017 ACM on Conference on Information and Knowledge Man-agement . ACM, 2187–2190.[22] Olena Medelyan, Eibe Frank, and Ian H Witten. 2009. Human-competitive taggingusing automatic keyphrase extraction. In

Proceedings of the 2009 Conference onEmpirical Methods in Natural Language Processing: Volume 3-Volume 3 . Associationfor Computational Linguistics, 1318–1327.[23] Chau Q Nguyen and Tuoi T Phan. 2009. An ontology-based approach for keyphrase extraction. In

Proceedings of the ACL-IJCNLP 2009 Conference Short Papers .Association for Computational Linguistics, 181–184.[24] Sooyoung Oh, Zhen Lei, Prasenjit Mitra, and John Yen. 2012. Evaluating andranking patents using weighted citations. In

Proceedings of the 12th ACM/IEEE-CSjoint conference on Digital Libraries . ACM, 281–284.[25] Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevanceframework: BM25 and beyond.

Foundations and Trends® in Information Retrieval

3, 4 (2009), 333–389.[26] Rushdi Shams and Robert E Mercer. 2012. Investigating keyphrase indexing withtext denoising. In

Proceedings of the 12th ACM/IEEE-CS joint conference on DigitalLibraries . ACM, 263–266.[27] Takashi Tomokiyo and Matthew Hurst. 2003. A language model approach tokeyphrase extraction. In

Proceedings of the ACL 2003 workshop on Multiwordexpressions: analysis, acquisition and treatment-Volume 18 . Association for Com-putational Linguistics, 33–40.[28] Suzan Verberne, Maya Sappelli, Djoerd Hiemstra, and Wessel Kraaij. 2016. Eval-uation and analysis of term scoring methods for term extraction.

InformationRetrieval Journal

19, 5 (2016), 510–545.[29] Ian H Witten and Olena Medelyan. 2006. Thesaurus based automatic keyphraseindexing. In

Proceedings of the 6th ACM/IEEE-CS Joint Conference on DigitalLibraries (JCDL’06) . IEEE, 296–297.[30] Ian H Witten, Gordon W Paynter, Eibe Frank, Carl Gutwin, and Craig G Nevill-Manning. 2005. KEA: Practical Automated Keyphrase Extraction. In