Measuring Patent Claim Generation by Span Relevancy
PProceedings of the Thirteenth International Workshopon Juris-informatics (JURISIN 2019) hosted by JSAI-isAI2019
Workshop Co-ChairsMakoto NakamuraSatoshi TojoRaiosha Building, Keio University, Kanagawa, Japan,November 11-12, 2019 easuring Patent Claim Generation by Span Relevancy
Jieh-Sheng Lee [0000-0002-0990-6170] and Jieh Hsiang Department of Computer Science and Information Engineering National Taiwan University, Taiwan
Abstract.
Our long-term goal of patent claim generation is to realize "augment-ed inventing" for inventors by leveraging new Deep Learning techniques. We envision the possibility of building an "auto-complete" function for inventors to conceive better inventions in the era of artificial intelligence. In order to gener-ate patent claims with reasonable quality, a fundamental question is how to measure the quality. We tackle the problem from the perspective of claim span relevancy as a proof of concept. Patent claim language was rarely explored in the NLP field. In this work, we propose a span-based approach and a generic framework to measure patent claim generation quantitatively. In order to study the effectiveness of patent claim generation, we define a metric to measure whether two consecutive spans in a generated patent claims are relevant. We treat such relevancy measurement as a span-pair classification problem, follow-ing the concept of natural language inference. Technically, the span-pair classi-fier is implemented by fine-tuning a pre-trained language model. The patent claim generation is implemented by fine-tuning the other pre-trained model. Specifically, we fine-tune a pre-trained Google BERT model to measure the pa-tent claim spans generated by a fine-tuned OpenAI GPT-2 model. In this way, we re-use two of the state-of-the-art pre-trained models in the NLP field. Our result shows the effectiveness of the span-pair classifier after fine-tuning the pre-trained model. It further validates the quantitative metric of span relevancy in patent claim generation. Particularly, we found that the span relevancy ratio measured by BERT becomes lower when the diversity in GPT-2 text generation becomes higher.
Keywords:
Patent, Claims, Classification, Text Generation, BERT, GPT-2, NLI, NLG, NLP Admitted in New York and passed the USPTO patent bar exam. Currently a Ph.D. candidate focusing on Deep Learning for patents and an in-house patent counsel at Novatek Microelectronics Corp.
Introduction
Patent Law & Deep Learning
Patents are granted to inventions that meet three basic legal requirements in general: utility, novelty, and nonobviousness. Utility is the requirement that an invention must have a useful function of some kind. Novelty is the requirement that an invention must be substantially different from everything that has been published or known before. Nonobviousness is the requirement that an invention cannot have been obvi-ous to a person having ordinary skill in the art. From the perspective of NLP and Deep Learning, we hypothesize that the nonobviousness problem is a reinforcement learning problem between inventor/patent practitioner and patent examiner. The nov-elty problem is a search problem in essence. The utility problem is an NLI (Natural Language Inference) problem. In this paper, we focus on the utility perspective in the patent claim generation problem. Without a way to measure the likelihood of meeting the utility requirement, a patent claim generator may generate something novel and nonobvious but not useful at all.
Augmented Inventing
The ultimate goal of our research is an "augmented inventing" system. We envision an "auto-complete" use case in which, if an inventor is just contemplating and has no whole picture in mind yet, a function like patent claim generation may augment the inventor to conceive better inventions. For example, the interactive augmented-inventing system can suggest next words, phrases, claim spans or even new ideas based on user’s input . Such active learning between human and machine may open a window for both qualitative and quantitative analysis on augmented inventing. By measuring how the inventor responds to the system, it is possible to collect human annotations for supervised learning. In order to facilitate supervised learning in the future, it is essential to generate patent claims with reasonable quality for inventors to appreciate. This paper is a step toward such a direction. We measure the quality by span relevancy and assume that a suitable range of span relevancy means a reasonable quality. It is noted that the relevancy measurement is implemented in an unsupervised fashion. By doing so, we have a chance to combine both unsupervised learning and supervised learning in the future.
A Span-based Approach
In the NLP field, language modeling is the task of predicting what word comes next. Instead of working on word level, we propose a span-based modeling approach to predict what text span may come next. The text spans in this work are claim spans in patent claims. A patent claim defines the scope of the legal protection conferred by a patent. Most of the time a patent has several claims to define its scope. The reason why it might be possible to build a function to evaluate the utility requirement is that a granted claim is presumed to have met the utility requirement. It could be said that granted patents are human-annotated and possible for supervised learning. The prob-lem lies in how to identify and make use of such annotations. We identify two types of human annotation in patent claims: explicit and implicit. The explicit annotation is manifested by the dependency between an independent claim and a dependent claim. For example, a dependent claim such as “2. The method of claim 1, wherein….” defines a dependency between claim 2 and claim 1.
The im-plicit annotation is based on the property of element combination. In patent claim language, an invention could be decomposed into inventive elements and conceptual-ly, for a classification task, the order of the elements describing how they work col-lectively does not matter. The identification and boundary of an inventive element is, therefore, an implicit annotation. Such a property of being able to combine elements in different order is pretty unique, compared with other mainstream NLP research. Leveraging both the explicit and implicit of annotations is the reason why super-vised learning might be feasible for learning the utility requirement. Before building training datasets, a technical problem is how to identify the inventive elements in a patent claim. The format of patent claims provides an answer. A patent claim is re-quired to be a single sentence. Since it defines a technical scope to be protected, a patent claim is usually much longer than an ordinary sentence. Such an unusual length is a challenge to most inventors and even to patent practitioners. Therefore, it is common to split a claim into text spans. A claim span is a segment of claim text. For example, the claim 1 of US9229634B2 is divided into spans as
Fig. 1 . A claim span for readability is a suitable approximation of an inventive element. We assume such approximation sufficient for proof of concept in this work and leave finer approxima-tion to the future, such as training a neural network to split a longer span into shorter ones.
Fig. 1.
Spans in the ‘634 patent .4 Span-pair Classification
Based on the utility requirement, we treat two spans in a patent claim relevant to a useful function of some kind and relevant to each other. We further take such a rele-vance problem as a classification problem and the classification is binary: relevant or irrelevant. Our goal is to train a neural network to predict the relevancy between two claim spans. For example, in
Fig. 1 , the span 1-1, 1-2, 1-3 and 1-4 are relevant to each other. The span 2-1 and 2-2 are relevant to each other. In addition, since the claim 2 is dependent on the claim 1, the span 2-1 and 2-2 are relevant to all spans in the claim 1 too. We collect such intra-claim span pairs and inter-claim span pairs and build a dataset of relevant span pairs. Training a binary classifier needs both positive records and negative records. The relevant span pairs are positive records. As for negative records, we leverage the cur-rent patent classification system, such as CPC (Cooperative Patent Classification), and apply negative sampling to select claim spans from non-overlapped patent clas- ses. For example, the subclass labels of the ‘634 patent are G06F and H04M. The claim spans from patents without any of the same labels are sampled randomly. We assume that such claim spans would be negative records and suitable for building a dataset of irrelevant span pairs. After having both of the relevant or irrelevant span pairs, we formulate the span-pair relevancy problem as a sentence-pair classification problem. It is noted that patents with different subject matters, such as process, machine, manufacture, and composition of matter, may have different average span lengths or different average numbers of spans per claim. We avoid any manual feature engineer-ing and hypothesize that the neural network can learn different subject matters in one model well based on two assumptions: (1) The span boundaries are annotated by hu-mans for easier comprehension. Therefore, it should be feasible for the neural net-work to learn such comprehensible spans. It might also be possible that the neural network can estimate what kind of subject matter a patent claim is. (2) As mentioned in Section 1.3, it should be possible to train a neural network to calculate more fine-grained span boundaries. By doing so, the relation between spans in different subject matters should be similar from a data perspective. In our previous works [1] [2] and this work, we did not encounter any issue requiring different treatments for different subject matters. We leave the validation of the aforementioned assumptions to the future.
Patent Claim Generation with GPT-2
Initially, we tried patent claim generation at span level by using the span-pair classifi-er. We thought that, by ranking the relevancy with respect to all existing spans, it might be possible to provide candidate spans based on user's input. Unfortunately, it did not work. As a binary classifier, its relevancy ranking is polarized to binary results most of the time. The neural network cannot produce ranking results with finer granu- larity. We switched gear to the OpenAI GPT-2 by Radford et al. [3] for text genera-tion word by word. In order to use the span-pair classifier to check the generated patent claims by GPT-2, we add special span separators to patent claims. In this way, GPT-2 can generate patent claims with span separators which make it possible to split a generated patent claim into claim spans. We use the span-pair classifier to check whether two consecutive spans of the generated patent claim are relevant or not. For details in patent claim generation, please refer to our previous work [1]. In brief, our patent claim generation is based on the unconditional random sampling in GPT-2. The sampling is a top_k random sampling algorithm with a default k value 40. It means sorting by probability and zeroing out anything below the 40th token when sampling. The quality of generated text depends on the distribution of reasonable words in the top 40 tokens. If there are many words one could sample from reasona-bly, the quality would be higher. If there are only a few reasonable words to sample from, the quality would be lower. We experiment with different k values in this work. We also hypothesize that a higher k value will generate patent claims with higher diversity and the span-pair relevancy will, therefore, become lower. To our knowledge, our previous work [1] is the first to generate patent claims by transfer learning with Transformer [4]models. This work is the first to observe the correlation between the diversity of text generation by one Transformer and the span relevancy measured by the other Transformer model. Framework
We propose a generic framework for text generation and quality measurement based on Transformer architecture. In
Fig. 2 , on the right-hand side, the quality measure-ment is based on a fine-tuned Transformer Encoder. The Encoder is the BERT model by Devlin et al. [5]. BERT is a language representation model which stands for Bidi-rectional Encoder Representations from Transformers. We build the fine-tuned BERT by fine-tuning the pre-trained BERT model released by Google. In Section 3, we describe how to build a span-pair classifier based on BERT. On the left-hand side, the text generation is based on a fine-tuned Transformer Decoder. The Decoder is the GPT-2 model. We build the fine-tuned GPT-2 by fine-tuning the pre-trained GPT-2 model released by OpenAI. In Section 4, we describe how to work on GPT-2 in more details. By using different k values for the top_k function in GPT-2, we can generate patent claims with different randomness. The next step is to split the generated patent claims into span pairs. The span pairs are then fed to the fine-tuned BERT for measur-ing relevancy. In this framework, we treat the fine-tuned models as building blocks and replacea-ble. In fact, it is preferable but not necessary to use Transformer Decoder for text generation and use Transformer Encoder for quality measurement. A Transformer Encoder can be used for text generation. A Transformer Decoder can be used for quality measurement too. If the Encoder and the Decoder of the same Transformer are https://github.com/google-research/bert https://github.com/openai/gpt-2 o-trained, it is also possible to use the same Transformer for both text generation and quality measurement. Whether this kind of replacement is better is another research topic. At the moment of this writing, BERT and GPT-2 are the best available Encoder and Decoder, respectively. For example, using BERT for text generation is feasible by Wang and Cho [6], but it does not outperform GPT-2. There are several competing Transformer-based models emerging too, such as RoBERTa [7], MASS [8], XLNet [9] and ERNIE 2.0 [10]. This is the reason why we generalize our implementation from the perspective of a framework. Our current work is a baseline for benchmark-ing in the future. Build a Span-pair Classifier based on BERT
Sentence pair classification has been a type of problem for a long time. Conventional-ly the goal of the sentence pair classification is to predict whether the second sentence is an entailment, contradiction, or neutral. In this work, we simplify the prediction as a binary classification to entail relevancy or irrelevancy in terms of the utility re-quirement for patents. One recent approach to text classification problems is fine-tuning a pre-trained BERT model. Such an approach produced the state-of-the-art models for several sentence pair classification tasks, e.g. MNLI, QQP, QNLI, SST-2, CoLA, STS-B, MRPC, and RTE. Therefore, we leverage this fine-tuning approach to build our span-pair classifier. Besides, such a fine-tuning approach applies to other text classification problems as well. For example, a new state-of-the-art result for patent classification based on BERT is produced in our previous work [2].
Fig. 2.
Framework of Text Generation & Measurement
Data Pipeline
A data pipeline of preprocessing is required in order to split the raw text of patent claims into span pairs. There are three stages in our data pipeline: (1) raw data collec-tion, (2) claim span identification, and (3) span combination. Although raw data is available on the USPTO Open Data Portal, we found it easier to leverage the Google Patents Public Datasets on BigQuery. A dataset based on SQL lowers the entry barri-er of data preparation. In the appendix, we share our SQL statements as a better way than sharing conventional raw data for two reasons: (1) Separation of concerns. If the shared raw data contains pre-processing or post-processing specific to a problem or solution, it will be harder for other researchers to reuse for different data processing. (2) Clarity and flexibility. A SQL statement is more precise and easier to revise. At the second stage, we split patent claim text into claim spans on a heuristic basis. Very often, patent practitioners use semicolon or comma to separate a long patent claim into multiple lines. The character return between lines was omitted in our que-ried data. Therefore, a convenient trick is to make use of such omissions to identify claim spans. Such a heuristic span identification might be not perfect, but this kind of data approximation is sufficient for proof of concept in our work. At the third stage, the span combination includes two types of data preparation: positive sampling and negative sampling. Generating relevant span pairs as positive records is more intuitive. If the claim is an independent claim having n1 spans, n1 * (n1-1) span pairs will be generated. If the claim is a dependent claim having n2 spans and depending on a claim having n3 spans, n2*n3 spans will be generated in addition to n2*(n2-1) spans. We skipped the scenario of multiple dependencies since it does not occur often. Generating irrelevant span pairs (span1, span2) as negative records is trickier. The number of records based on negative sampling is twice the number of records of posi-tive sampling because we define two types of negative sampling. Type (a) is to ran-domly pick a span2 which does not share any CPC Subclass label (e.g., "G06F") with the span1. Under the CPC hierarchy, a Section label is higher and more inclusive than a Subclass label. Based on this hierarchy, we design type (b) to pick another span2 which shares the same Section label (e.g., "G") with the span1 but does not share any Subclass label. We hypothesize that the two spans in type (a) could be farther away in tensor space and easier for a neural network to learn. By adding the type (b), the two spans could be closer but still irrelevant. This may make the neural network general-ize better. We leave the validation of this hypothesis to the future. One more consideration at the span combination stage is to avoid data explosion. In fact, a patent may have many claims and a claim may have many spans. In this work, we cap both the number of claims per patent and the number of spans per claim to 20. In addition, due to resource constraint, our datasets cover the first month of year 2013 to 2016 only. The number of span pairs in our datasets are 3,826,027 (2013), 3,257,617 (2014), 3,316,532 (2015) and 3,728,207 (2016) respectively, as shown in Table 1 . We found such datasets sufficient to show the stability of accuracy https://developer.uspto.gov/ https://console.cloud.google.com/bigquery?p=patents-public-data n our results. We leave broader coverages and different approaches of span combina-tion to the future. Method & Experimental Setup
According to Devlin et al. [5], a pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. The fine-tuning approach introduces minimal task-specific parameters and is trained on the downstream tasks by simply fine-tuning all pre-trained parameters. Computer vision research has also demonstrated the importance of transfer learning from large pre-trained models, where an effective recipe is to fine-tune pre-trained models. In this work, we follow the transfer learning technique and leverage the released BERT-Base pre-trained model (Uncased: 12-layer, 768-hidden, 12-heads, 110M pa-rameters). We leave other bigger models such as the BERT-Large (340M parameters) to the future since the BERT-Base is sufficient for proof of concept. Our implementa-tion follows the same fine-tuning example and source code released in the BERT project. In order to build a vanilla baseline for future experiments to compare with, we reduce the number of training epoch from 3 to 1 and change the max_seq_length value from 128 to 256 to reflect the fact that claim spans are longer. All other hy-perparameters remain the same.
Results
We show the performance of the span pair classifier in
Table 1 . Each dataset contains both positive sampling (relevant span pairs) and negative sampling (irrelevant span pairs). The accuracy of evaluating relevant span pairs stays above 84% at least. The accuracy of evaluating both relevant and irrelevant span pairs stays above 92%. Such numbers are significant. Take row 3 as an example. The fine-tuning dataset belongs to year 2013 and, at each column, the testing datasets belong to year 2013, 2014, 2015 and 2016 respectively. After fine-tuned, the testing accuracy of dataset 2014 is 93.07% if covering positive and negative samplings. The testing accuracy is 87.19% if covering positive sampling only. The testing accuracies of dataset 2015 are 92.79% and 86.37% respectively. The testing accuracies of dataset 2016 are 92.74% and 86.24% respectively. We think that such accuracy is sufficient to measure text genera-tion in terms of span relevancy.
Table 1.
Evaluation Result (accuracy in percentage)
Test
Fine - tuning pos &