[PDF] Customizing Contextualized Language Models forLegal Document Reviews

Abstract

Inspired by the inductive transfer learning on computer vision, many efforts have been made to train contextualized language models that boost the performance of natural language processing tasks. These models are mostly trained on large general-domain corpora such as news, books, or Wikipedia.Although these pre-trained generic language models well perceive the semantic and syntactic essence of a language structure, exploiting them in a real-world domain-specific scenario still needs some practical considerations to be taken into account such as token distribution shifts, inference time, memory, and their simultaneous proficiency in multiple tasks. In this paper, we focus on the legal domain and present how different language model strained on general-domain corpora can be best customized for multiple legal document reviewing tasks. We compare their efficiencies with respect to task performances and present practical considerations.

Full PDF

CCustomizing Contextualized Language Models forLegal Document Reviews

Shohreh Shaghaghian ∗ , Luna (Yue) Feng ∗ , Borna Jafarpour ∗ and Nicolai Pogrebnyakov ∗†∗ Center for AI and Cognitive Computing at Thomson Reuters, Canada † Copenhagen Business School, Denmark

Emails: [email protected]

Abstract —Inspired by the inductive transfer learning on com-puter vision, many efforts have been made to train contextualizedlanguage models that boost the performance of natural languageprocessing tasks. These models are mostly trained on largegeneral-domain corpora such as news, books, or Wikipedia.Although these pre-trained generic language models well perceivethe semantic and syntactic essence of a language structure,exploiting them in a real-world domain-speciﬁc scenario stillneeds some practical considerations to be taken into account suchas token distribution shifts, inference time, memory, and theirsimultaneous proﬁciency in multiple tasks. In this paper, we focuson the legal domain and present how different language modelstrained on general-domain corpora can be best customized formultiple legal document reviewing tasks. We compare their efﬁ-ciencies with respect to task performances and present practicalconsiderations.

I. I

NTRODUCTION

Document review is a critical task for many law prac-titioners. Whether they intend to ensure that client ﬁlingscomply with relevant regulations, update or re-purpose abrief for a trial motion, negotiate or revise an agreement,examine a contract to avoid potential risks, or review clienttax documents, they need to carefully inspect hundreds ofpages of legal documents. Recent advancements in NaturalLanguage Processing (NLP) have helped with the automationof the work-intensive and time-consuming review processesin many of these scenarios. Several requirements of thereview process have been modeled as some common NLPtasks such as information retrieval, question answering, en-tity recognition, and text classiﬁcation (e.g., [1]). However,certain characteristics of the legal domain applications causesome limitations for deploying these NLP methodologies.First, while electronic and online versions of many legalresources are available, the need for labelled data required fortraining supervised algorithms still needs resource-consumingannotation processes. Second, not only do legal texts containterms and phrases that have different semantics when usedin a legal context, but also their syntax is different fromgeneral language texts. Recently, sequential transfer learningmethods [2] have alleviated the ﬁrst limitation by pre-trainingnumeric representations on a large unlabelled text corpus usingvariants of language modelling (LM) and then adapting theserepresentations to a supervised target task using fairly small amounts of labelled data. Attention-based Neural NetworkLanguage models (NNLM) pre-trained on large-scale textcorpora and ﬁne-tuned on a NLP task achieve the state-of-the-art performance in various tasks [3]. However, due tothe second limitation, directly using the existing pre-trainedmodels may not be effective in the legal text processingtasks. These tasks may beneﬁt from some customization ofthe language models on legal corpora.In this paper, we speciﬁcally focus on adapting the state-of-the-art contextualized Transformer-based [4] language modelsto the legal domain and investigate their impact on theperformance of several tasks in the review process of legaldocuments. The contributions of this paper can be summarizedas follows. • While [5]–[7] have ﬁne-tuned the BERT language model[8] on the legal domain, this study, to the best of ourknowledge, is the ﬁrst to provide an extensive compar-ison between several contextualized language models.Most importantly, unlike the existing works, it evaluatesdifferent aspects of the efﬁcacy of these models whichprovides NLP practitioners in the legal domain with amore comprehensive understanding of practical pros andcons of deploying these models in real life applicationsand products. • Rather than experimenting with typical standalone NLPtasks, this work studies the impact of adaptation of thelanguage models on real scenarios of a legal documentreview process. The downstream tasks studied in thispaper are all based on the features of many of the existinglegal document review products which are the result ofhours of client engagement and user experience studies.The rest of this paper is organized as follows. We reviewthe existing studies related to our work in Section I-A. InSection II, we describe the legal document review scenarioswe study. We brieﬂy overview the language models we use inthis paper in Section III and present the results of customizingand employing them in the document review tasks in SectionIV. The concluding remarks are made in Section V. a r X i v : . [ c s . C L ] F e b . Related Works Large-scale pre-trained language models have proven tocompete with or surpass state-of-the-art performance in manyNLP tasks such as Named Entity Recognition (NER) andQuestion Answering (QA). There have been multiple attemptsto do transfer learning and ﬁne-tuning of these models onEnglish NLP tasks [9], [10]. However, these language modelsare trained on text corpora of general domains. For example,the Transformer-based language model, BERT [8], has beentrained on Wikipedia and BooksCorpus [11]. The performanceof these general language models are not yet fully investigatedin more speciﬁc domains such as biomedical, ﬁnance or legal.To use these models for domain-speciﬁc NLP tasks, one canrepeat the pre-training process on a domain-speciﬁc corpus(from scratch or re-use the general-domain weights) or simplyﬁne-tune the generic version for the domain-speciﬁc task.[12] has pre-trained generic BERT on multiple large-scalebiomedical corpora and called it BioBERT. They show thatwhile ﬁne-tuned generic BERT competes with the state-of-the-art in several biomedical NLP tasks, BioBERT outperformsstate-of-the-art models in biomedical NER, relation extractionand QA tasks. FinBERT [13] authors pre-train and ﬁne-tune BERT, ULMFit [14] and ELMO [15] for sentimentclassiﬁcation in the ﬁnance domain and evaluate the effectsof multiple training strategies such as language model weightfreezing, different training epochs, different corpus sizes, andlayer-speciﬁc learning rates on the results. They show thatpre-training and ﬁne-tuning BERT leads to better results inﬁnancial sentiment analysis. BERT has also been pre-trainedand ﬁne-tuned on scientiﬁc data [16] and it is shown that itprovides superior results to ﬁne-tuned generic BERT in mul-tiple NLP tasks and sets new state-of-the-art results in someof them in the biomedical and computer science domains.One of the most relevant studies to our work is doneby [5] in which BERT is pre-trained on a proprietary legalcontract corpus and ﬁne-tuned for entity extraction task. Theirresults show that pre-trained BERT is faster to train forthis downstream task and provides superior results. No otherlanguage models and NLP tasks have been investigated inthis work. Another relevant work is [6] in which the authorsemploy various pre-training and ﬁne-tuning settings on BERTto fulﬁll the classiﬁcation and NER tasks on legal documents.They report that the BERT language model pre-trained onlegal corpora outperforms the generic BERT especially inthe most challenging end-tasks (i.e., multi-label classiﬁcation)where domain speciﬁc knowledge is more important. In thispaper, we will have a more comprehensive investigation ofthe Transformer-based language models in the legal domainby pre-training and ﬁne-tuning multiple Transformer-basedlanguage models for a variety of legal NLP tasks.II. L

EGAL R EVIEW T ASKS

The formal and technical language of legal documents (i.e.,legalese) is by nature often difﬁcult and time consuming toread through. A practical facilitating tool is the one that cannavigate the human reviewers to their points of interest in a single or multiple documents while asking for minimuminputs. Here, we have identiﬁed four main navigation scenariosin a legal document review process that can be facilitated byan automated tool. We elaborate on the main requirements thatshould be satisﬁed in each scenario and show how we modeleach task as an NLP problem. We explain the format of thelabelled data we need as well as the learning technique weuse to address each task. We eventually introduce a baselinealgorithm for each task that can be used as a benchmarkwhen evaluating the impact of using language models. TableI presents a summary of the details of each review task. Dueto the complexities of legal texts, we deﬁne snippet as theunit of text that is more general than a grammatical sentence(see Section III-E for an example of a legal text snippet). Inour experiments, splitting the text into snippets is performedusing a customized rule based system that relies only on textpunctuation.

A. Information Navigation

Navigating the users to the parts of the document where theycan ﬁnd the information required to answer their questionsis an essential feature of any document reviewing tool. Atypical legal practitioner is more comfortable with posing aquestion in a natural language rather than building propersearch queries and keywords. These questions can be eitherfactoid like “What is the termination date of the lease?” ornon-factoid such as “On what basis, can the parties terminatethe lease?” . Navigating the user to the answers of the non-factoid questions is equivalent to retrieving the potentiallyrelevant text snippets and selecting the most promising ones.This step can also be a prior step for factoid questions (seeSection II-B) to reduce the search space for ﬁnding a shorttext answer.We model this task as a classic passage retrieval problemwith natural language queries and investigate the factoidquestions for which the user is looking for an exact entityas another task. Given a question q and a pool of candidatetext snippets { s , s , ..., s N } , the goal is to return the top K best candidate snippets. Both supervised and unsupervisedapproaches have been proposed for this problem [20]. Wemodel this problem as a binary text classiﬁcation problemwhere a pair of question and snippet ( q i , s i ) receives a label if s i contains the answer to question q i and otherwise. Weuse the pairwise learning model proposed in [17] as a baselinealgorithm for this module. B. Fact Navigation

As mentioned in Section II-A, there are scenarios in which alegal professional needs to extract a pre-deﬁned set of facts in alegal document. Examples of these scenarios are proofreadinga single document to make sure that it contains the right setof facts or reviewing multiple documents to answer questionssuch as “What is the average of settlements awarded in alarge set of court decision documents?” . We model this taskas a sequence labeling problem in which each token will eview Task NLP Task ML Task Data Format* Baseline Algorithm

Information Navigation Passage Retrieval Binary Classiﬁcation question snippet pairs Siamese BiLSTM [17]Fact Navigation Named Entity Recognition Sequence Labeling tokens 2 Layer BiLSTM [18](512 units) + dense layerComparative Navigation Text Similarity Multi-class Classiﬁcation snippet pairs BiLSTM [18]Rule Navigation Sentiment Analysis Binary Classiﬁcation snippets XGBoost [19]

TABLE I: Tasks in a legal document review scenario. *The format of the samples for the classiﬁer.be assigned to a set of pre-deﬁned classes using a multi-class classiﬁer. For example, in a court decision document,the following set of classes may exist:

Date of Argument , Date of Decision , Petitioner , Judge , Sought Damages and

Damages Awarded Monetary Values . This task is similar tothe Named Entity Recognition (NER) task. However, in legalreview scenarios, there are some additional challenges: (1)Facts in some cases might span a large piece of text asopposed to a few tokens. For example, the commencementdate in a real estate lease can be described as “

Commencingon the later of: June 1,2018 or day on which Tenant entersinto possession of the Premises (or on such earlier datethat Tenant enters into possession of the Premises for thepurpose of conducting its business therein) ”. (2) This task iseven more context-dependent. For example, in a general NERproblem, the

Blackberry token might refer to a company inone sentence and to a fruit in another sentence and they haveto be distinguished. In a court decision ruling, the

Blackberry token that refers to a company might take the role of

Plaintiff , Defendant or has no role at all depending on the context.

C. Comparative Navigation

Comparing two pieces of text is an integral part of manylegal document review scenarios. Whether the goal is toidentify the differences between an amended regulation and itsoriginal version, to discover the discrepancies of regulationsin different jurisdictions, or to investigate a legal agreementfor potential deviations from an industry template, a lawpractitioner often needs to compare two or multiple legaldocuments [21]. This task can be considered as a classic textsimilarity problem. However, one of the main challenges inlegal text comparison is that the same legal concept may beexpressed in very different ways, complicating the applicationof lexical matching approaches such as tf-idf. For example,“

Broker is entitled to a commission from the vendor ” and “

Theseller shall pay the agent a fee ” carry the same legal meaning,yet have practically no words in common. This makes this taskwell-suited for language modeling.The problem is formulated as follows. We are given onedocument consisting of N r text snippets, another document of N t snippets and a set of labels L = { l , l , . . . } . The labels canbe binary, such as L = { match, no match } or multiclass, e.g., L = { match , partial match , no match } . We want to assign alabel l i,j ∈ L to any snippet pair ( s ir , s jt ) where i ∈ { , . . . N r } and j ∈ { , . . . N t } . In the experiments of this paper, we focuson the binary label scenarios. D. Rule Navigation

One of the main purposes of reviewing a legal document bya legal expert is to identify the indispensable rules imposedby deontic modalities. In legal documents, modalities areubiquitously used for representing legal knowledge in the formof obligations, prohibitions and permission [22]. In practice,legal professionals identify deontic modalities by referencingthe modal verbs specially “ would ”, “ should ” and “ shall ” thatexpress obligatory and permissive statements. However, solelydepending on the modal verbs to automatically identify thedeontic modalities is error prone due to three main reasons:(1) It is difﬁcult to quantify the semantic range betweenmodalities [23]. For example, the person who is allowed or obliged to do something carries different sentiment; (2)Modal verbs can have more than one function that may notindicate a deontic modality. For instance, in the sentence“ Licence Agreements shall mean collectively, the TrademarkLicence and the Technology Licence ”, the modal verb “ shall ”does not indicate any deontic modality; (3) The misuse ofmodal verbs in the documents brings in another complexityto disambiguate from the real deontic modalities. Therefore,context is important for interpreting what the modal verbs aremeant and whether deontic modalities are presented. This taskcan be considered as a sentiment analysis problem.To simplify the problem, we model it as a binary classi-ﬁcation problem aiming to identify text snippets that containobligations, particularly positive duties from the full documentand consider the rest of the snippets having no deonticmodalities. Formally, given a document with text snippets { s , s , ..., s N } , the binary classiﬁer will label each snippetbased on whether it imposes a positive duty for a party or not.III. L ANGUAGE M ODELS

In this section, we ﬁrst brieﬂy review the four Transformer-based language models we study in this work. Note that thelanguage models we investigate in this paper are only made ofthe encoder of the transformer architecture. Then, we elaborateon how we adapt them to the legal domain. Table II providesa summary of different characteristics of the language modelsstudied in this work.

A. BERT

Proposed by [8], Bidirectional Encoder Representationsfrom Transformers (BERT) aims to capture the context fromboth left-to-right and right-to-left directions when learninga representation for a text snippet. The input tokenizationof BERT is WordPiece [24] and its architecture is a multi-layer bidirectional Transformer encoder. Transformer [4] is a odel Name Number ofParameters Model Size onDisk Number of HiddenLayers Number ofTokens Training Time* (hours /epoch )BERT 110 M 416 MB 12 28,996 7.12DistilBERT 66 M 252 MB 6 28,996 7.02RoBERTa 125 M 501 MB 12 50,265 7.87ALBERT 12 M 45 MB 12 30,000 7.03

TABLE II: Comparison of some features of language models. The maximum window length is 512 and the training task isMLM for all four models. *Training times for BERT and DistilBERT are averaged over different tokenization and weightinitialization methods.sequence transduction model in which the recurrent layers arereplaced with multi-headed self attention (i.e., simultaneousattention to different parts of the sequence). Recurrent NeuralNetworks (RNNs) need to be unwrapped in the order of theinput. This constraint makes training and inference processestime-consuming for long sequences specially in the encoder-decoder architectures. By replacing the recurrent connections,more parallelism is achieved since each node’s output can becalculated solely based on previous layer’s output (as opposedto RNN nodes that need the output of the previous node inthe same layer). Due to the lack of recurrence, positionalembeddings are trained for each of the input positions. BERT’slanguage model is simultaneously trained with two tasks ofMasked Language Modeling (MLM) and Next Sentence Pre-diction (NSP). However, ablation studies in [8] and subsequentpublications [25] show that NSP has minimal contribution todownstream tasks’ performance metrics. In this work, we usebert-base-cased trained by Google on English Wikipedia andToronto Book Corpus [11] as the general-domain version ofthe BERT language model.

B. DistilBERT

DistilBERT [26] is a smaller and faster version of BERTwhich is reported to achieve 97% of BERT’s performance onGLUE [27] while reducing the size by 40% and the inferencetime by 60%. In order to train this language model, a techniquecalled knowledge distillation, also referred to as teacher-student learning, is incorporated which is originally proposedin [28] and [29]. The goal of this technique is to compress alarge model (i.e., the teacher) into a smaller model (i.e., thestudent) that can reproduce the behaviour of the larger model.In this teacher-student setup, the student network is trained bymatching the full output distribution of the teacher networkrather than by maximizing the probability of the correct class.Therefore, instead of training with a cross-entropy over thehard targets (i.e., one-hot encoding of the correct class), theknowledge from the teacher to the student can be transferredwith a cross-entropy loss function over the predicted prob-abilities of the teacher i.e., L ce = − (cid:80) i t i ∗ log ( s i ) where t i s and s i s are the probabilities respectively estimated by theteacher and the student. The ﬁnal training objective is a linearcombination of the distillation loss L ce and the supervisedtraining loss which is the MLM loss. In the process ofdistilling the BERT model, the student network uses the samearchitecture as BERT. The token-type embeddings and thepooler are removed while the number of layers is reduced by a factor of 2. As for the general-domain version of DistilBERT,we use distilbert-base-cased. C. RoBERTa

RoBERTa is a release of the BERT architecture that opti-mized the training regime and used a larger dataset for pre-training [25]. Its release was prompted by the observation thatthe original BERT model [8] followed a suboptimal trainingprocedure. Speciﬁcally, the changes made in RoBERTa include(i) training only with the MLM objective (removing BERT’sNSP objective), (ii) a larger batch size of 8,000 (comparedto 256 in BERT), (iii) a higher learning rate, peaking at 4e-4for the large model and 6e-4 for the base model (vs. 1e-4 forBERT), (iv) byte-level vocabulary with 50K subword tokens(unlike BERT’s character-level vocabulary of 30K tokens),which added 15M parameters to the model for the base versioncompared to BERT, and (v) a larger training dataset of 160GB (compared to 13 GB in BERT), including BERT’s trainingset and an additional 76-GB set compiled from news articles.With these modiﬁcations, RoBERTa is reported to outperformother language models trained after BERT’s release such asXLNet. Here, we use roberta-base pretrained by Facebook asthe general-domain version of the model.

D. ALBERT

Larger language models are shown to lead to better accuracyin many NLP downstream tasks. However, the size of thesemodels are constrained by computation cost, training timeand GPU memory. To address these issues, [30] propose ALite BERT (ALBERT) language model which has signiﬁcantlyfewer parameters than the original BERT model by employingthe following two techniques. • Factorized embedding parameterization : BERT with vo-cabulary size of V and embedding size of E has V E pa-rameters in the embedding layer which are sparsely updatedduring training. Instead of using one-hot encodings to beprojected to the embedding layers directly, ALBERT ﬁrstprojects them into a smaller space of size S and then into theembedding space of size E . Therefore, ALBERT embeddinglayer size is V S + SE which can be signiﬁcantly smallerthan BERT’s embedding parameters if S << E . Also,parameters of the embedding layer are less sparse and areupdated more frequently. • Weight sharing : All weights are shared between all layersof ALBERT. ALBERT also uses a slightly different lossfunction compared to BERT. In this new loss function,ERT’s NSP task is replaced with the sentence orderprediction (SOP) task. In SOP task, positive examples arethe same as NSP but negative examples are the same twosentences with their order reversed. Auhtors speculated thatNSP is a much simpler task than MLM and replacing it withSOP will help the model to better learn natural languagecoherence.

E. Legal Domain Language Models

Before being able to propose adjustment to the pre-trainedlanguage models, we need to understand how the languageof a document written in legalese can be different from plainEnglish. In the rest of this paper, we focus on the languageof legal agreements as the legal domain language and usea subset of 9,000 legal agreements of the publicly availablecorpus of US Securities and Exchange Commission (SEC) as our domain speciﬁc corpus. For all the four language mod-els explained above, we use the HuggingFace Transformerslibrary [31] for training of the language models and for thedownstream tasks. There are two main distinguishing featuresin the language of legal documents. • Domain Speciﬁc Terms

There are some terms and phrasesthat are unique to law such as “ fee simple ” or “ novation ”.However, regular words like “ party ” or “ title ” may havedifferent semantics when used in a legal context compared toa general context. In addition, many old words and phrasessuch as “ herein ”, “ hereto ”, “ hereby ” and “ heretofore ”, aswell as some non-English words like “ estoppel ” or “ habeascorpus ” are often employed in legalese. The models trainedon general-domain corpora have either never seen theseterms or have captured their general semantics. • Syntactic Structure

A single sentence written in legalesecan have very long and complex construction. For example,the language models pre-trained on general-domain corporahave barely seen the complex syntactic structures like this:“

In the event of any sale of such interest or transfer ofsuch rights and upon the assumption, in writing, of theobligations of Landlord under this Lease by such assigneeor transferee, Landlord herein named (and in case of anysubsequent transfer, the then assignor) shall be automat-ically freed and relieved from and after the date of suchtransfer of all liability in respect of the performance of anyof Landlord’s covenants and agreements thereafter accruing,and such transferee shall thereafter be automatically boundby all of such covenants and agreements, subject, however,to the terms of this Lease; it being intended that Landlord’scovenants and agreements shall be binding on Landlord, itssuccessors and assigns, only during and in respect of theirsuccessive periods of such ownership . ”These examples conﬁrm the need for the language modelsthat can capture the speciﬁc syntactic and semantic features ofthe legal domain. Table III shows the probability distribution of number of words in a sentence for a general domain Englishcorpus [32] and our legal corpus. We see that, using the samesentence splitting rule, there are on average 73 more words ina legal sentence. IV. E XPERIMENTS

While all the experiments of this paper have focusedon the domain of commercial real estate agreements, themethodologies in deﬁning the tasks, building the datasets,and customizing the models can be generalized for similarreview scenarios in other domains and other types of legaldocuments. There are three main questions we try to answerin this paper: (1) What are different approaches to adapt acontextualized language model to the legal domain?; (2) Howdo different language model architectures perform for differentreview tasks?; (3) How to choose a model if the computationalresources are limited? • What are different approaches to adapt a contextualizedlanguage model to the legal domain?

In order to answer the ﬁrst question, we study the impact oftwo main factors on training of language models: tokenizationand initial weights. We use Sentencepiece on our legal corpusto generate the same number of cased tokens as in bert-base-cased’s token set (See Table II). We call these domain-speciﬁctokens Legal Tokens while we refer to bert-base-cased’s orig-inal tokens as General Tokens. Only of tokens arecommon between Legal Tokens and General Tokens. Whiletokens like attorney , lease , and liability are common unbrokentokens in these two sets, other more domain-speciﬁc tokenssuch as lessor , lessee , and memorandum only exist in the legalset. We also use a hybrid version in which we add the 500most frequent words in our legal corpus that do not exist as anindependent unbroken token in the set of General Tokens. Thisset is referred to as Hybrid Tokens. We limited the number ofadded tokens to only 500 because of overhead it adds to thesize the embedding layer and therefore the training time.Table III shows the probability distribution functions ofnumber of tokens in the sentences of our legal corpus us-ing these three tokenization approaches. As we see in thetable, by switching from General Tokens to Legal Tokens,we do not see a signiﬁcant change in the distribution ofthe number of tokens in sentences. That is due to the factthat the number of generated tokens are the same in thesetwo tokenizations. However, the way a single sequence hasbeen tokenized is different is these two approaches. On theother hand, by adding the 500 most frequent tokens in thehybrid tokenization approach, the average number of tokens insentences decreases. By comparing the tokenization examples,we realize that the word contingency is among the top 500most frequent words of the legal corpus and Sentencepiecealso captures it as a single token in the legal corpus. However,it is probably not a very frequently used token in the generalcorpus of bert-base-cased therefore it is broken into sub-tokens using the general tokenization approach. The word https://github.com/google/sentencepiece orpus Domain Tokenization Average Tokensper Sentence Sentence Length (Number of Tokens)Histogram Tokenization Examples* General Words 27 [’attorney-in-fact’, ’injunctive’, ’self-insurance’,’contingency’, ’Condominium’]Legal Words 100 [’attorney-in-fact’, ’injunctive’, ’self-insurance’,’contingency’, ’Condominium’]Legal GeneralTokens 142 [’attorney’, ’-’, ’in’, ’-’, ’fact’, ’in’, ’ [’attorney’, ’-’, ’in’, ’-’, ’fact’, ’injunctive’, ’self’,’-’, ’insurance’, ’contingency’, ’Condominium’]Legal HybridTokens 120 [’attorney-in-fact’, ’injunctive’, ’self-insurance’,’contingency’, ’Con’, ’

TABLE III: Comparison of distributions of number of tokens per sentence for different tokenization approaches in differentdomains, with examples. *Examples show how each tokenization approach tokenizes the sequence “attorney-in-fact injunctiveself-insurance contingency Condominium”

Condominium is not among the top 500 tokens of the legalcorpus, therefore the tokenization based on Hybrid Tokensbreaks it into sub-tokens, but Sentencepiece still captures it asa single unbroken token when creating Legal Tokens. Finally,the hyphenated compound words self-insurance and attorney-in-fact are among the top 500 most frequent words in the legalcorpus, but based on rules of Sentencepiece, they are brokenin both General Tokens and Legal Tokens.Based on the two factors of tokenization and initial weights,we train ﬁve different versions of the BERT language model.For general and hybrid tokenization approaches (in which themajority of tokens are general-domain tokens), we start thetraining both from the general-domain model weights pub-lished with the original papers (i.e., pre-trained initial weights)and from scratch (i.e., random initial weights). Figure 1 showsthe moving average of the MLM training loss through 10epochs of training the BERT language model on a p3.8xlargeAWS instance (i.e., 4 Tesla V100 GPUs with 64 GB memory) with batch size of 32 samples with max lengths of 512 tokens.We used Adam optimizaer with learning rate of × − . Aswe see in the ﬁgure, the training loss saturates much fasterwhen starting from the pre-trained weights. Also, comparingGR, HR, and LR models, we see that adding more domain-speciﬁc tokens delays the saturation in loss. • How do different language model architectures performfor different review tasks?

In order to answer the second question, we compare the resultof BERT, DistilBERT, RoBERTa, and ALBERT on our down-stream document review tasks. We use the aforementioned ﬁveversions of BERT as the teacher to customize the DistilBERTlanguage models. However, for RoBERTa and ALBERT weonly compare the base version (i.e., without any customizationfor the legal domain) with only one customized version usinggeneral tokens and starting from pre-trained weights.As mentioned in Section III, we use publicly available legal K K K K K K Training Steps1234567 T r a i n i n g L o ss General Tokens, Random Weights (GR)Hybrid Tokens, Random Weights (HR)Legal Tokens, Random Weights (LR)General Tokens, Pretrained Weights (GP)Hybrid Tokens, Pretrained Weights (HP)

Fig. 1: Training loss through pre-training of BERT with different tokenization and initial weightsagreements and use some human-annotated labels for each ofthe tasks described in Section II. The labelled dataset for allof the four legal review tasks is created by legal practitionerswith detailed instructions of the annotating process. However,due to the complexity of the annotating tasks and limitedresources (for example, it took a legal expert annotator onaverage 4 hours to annotate the snippets related to the passageretrieval task in one document), each instance is only labelledby one human which prevents us from computing the inter-annotator agreement score. We remedied that by providingcomprehensive instructions to annotators, engaging annotatorsin pre-task training, as well as performing pilot annotationsand reviewing their results with a senior law practitioner beforeproceeding with full annotation. Table IV shows examples ofdata samples for each of the four tasks.For the passage retrieval task (i.e., information navigation),we use 106 documents and have human legal experts annotatethe answers to 31 questions in each one. In a single document,each question may have zero, one, or multiple (consecutive ornon-consecutive) snippets as its answers. For each question,negative samples are generated by uniformly sampling 10snippets from the snippets of the document that are notannotated as the answer to that question. Overall, we use a setof approximately 81,000 question-snippet pairs and comparethe impact that different models have on the passage retrievaltask based on the F1-score of the classiﬁer.In the text similarity task (i.e., comparative navigation), wehave used the same documents used for the passage retrievaltask and sampled pairs of text snippets from these documents.In order to reduce the search space for ﬁnding the matches, welimit ourselves to the pieces of text that have been annotatedas the answer to the same question in the retrieval task. Wehave then asked human legal experts to label each pair as amatch or not. The snippets of a relevant pair are the ones thataddress the same legal topic and a speciﬁc set of points within that topic, for example, a speciﬁc set of responsibilities of aparty to the agreement. We build a dataset of approximately1,500 snippet pairs. The distribution of labels in the data forthis task is 40% in the positive class to 60% in the negativeclass. We present the F1-score as the metric for evaluatingdifferent models.In the entity recognition task (i.e., fact navigation), we usedata from 200 documents fully annotated by human experts.This dataset contains of approximately 5,000 snippets whosetokens are annotated with at least one of the 26 entities ofinterest. A text snippet might contain multiple entities. Only3% of the sentences contain an entity and only 1% of tokensare part of an entity. We use 20,000 randomly sampled snippetsas negative samples and report token level micro average F1-score. We use token level F1-score as it provides a more ﬁne-grained evaluation compared to entity level evaluation. Figure2 shows the distribution of tokens that are annotated as partof an entity among our 26 classes.For the sentiment analysis task (i.e., rule navigation), wehave the human legal experts assign labels to approximately3,000 text snippets randomly selected from our documentcorpus. If the snippet contains a positive duty statement, a label1 is assigned to it. Otherwise, the label is 0. After assigninghuman labels, 24% of samples are labeled as 1. We present theF1-score as the metric for comparing various language models.Table V reports the performance metrics of document re-view tasks by adding a linear layer as the sequence (or tokenin case of NER) classiﬁcation heads on top of the pooled (orhidden-states) output of different versions of contextualizedlanguage models. The base version of each model correspondsto the general-domain language models published with theoriginal papers without any adjustment for the legal domain.For all four tasks, we split the datasets into training, validation,and test sets with respectively 80%, 10% and 10% ratios. Thereported results are the average over 3 different random splits.The standard deviations are also reported in the parentheses.

LP Task Examples LabelPassageRetrieval Question:

Does the tenant have the right to challenge tax assessments?

Snippet:

If the Premises separately assessed, Tenant shall have the right, by appropriate proceedings, to protest or contest ingood faith any assessment or reassessment of Taxes , any special assessment, or the validity of any Taxes or of any changein assessment or tax rate; provided, however, that prior to any such challenges must either (a) pay the taxes alleged to be duein their entirety and seek a refund from the appropriate authority, or (b) post bond in an amount sufﬁcient to insure fullpayment of the Taxes. 1

Question:

Does the tenant have the right to challenge tax assessments?

Snippet:

Landlord also shall provide Tenant with a copy of the applicable Tax bill or Tax statement from the taxing authority. 0

TextSimilarity Snippet1 : The exercise of any remedy by either party shall not be deemed an election of remedies or preclude that partyfrom exercising any other remedies in the future, except as expressly set forth herein.

Snippet2 : Either party’s acceptance of monies under this Lease following a Default by the other shall not waive such party’srights regarding such Default. 1

Snippet1 . Provided Tenant has performed all its obligations, Tenant shall peaceably and quietly hold and enjoy the Premisesfor the Term, subject to the provisions of this Lease.

Snippet2 . If either party elects to terminate this Lease as provided in this Section, this Lease shall terminate on the datewhich is 30 days following the date of the notice of termination. 0

Named EntityRecognition

MINUTE ORDER IN CHAMBERS - JUDGE ORDER PERMITTING PLAINTIFF TO FILE ADDITIONALSUPPLEMENTAL BRIEFS REGARDING 35 USC 101 by Judge George H. Wu.

BlackBerry is permitted to ﬁle anadditional supplemental brief by April 23, 2020. PlaintiffTenant will pay a security deposit of Nineteen Million Seven Hundred and Fifty Thousand Dollars ( $ 19,750,000 ) ( payable in cash or , as and to the extent set forth in Section 3.7.1 , in the form of a letter of credit reasonablyacceptable to Landlord ) ( the “Security Deposit”). Form ofSecurityDeposit

SentimentAnalysis Snippet:

If the Landlord so requires the Principal Rent shall be paid directly to the Landlord’s bankers by bankers standingorder. 1

Snippet: ”Prescribed Rate” shall mean such comparable rate of interest as the Landlord reasonably determines. 0

TABLE IV: Examples of labeled data for each of the 4 review tasks.Fig. 2: Distribution of named entities’ tokens among our 26classes, ranging from 19.2% to 0.1% of the whole namedentity tokensIn each task, the same stopping criteria (e.g, early stoppingbased on the validation loss) is used for all models.As we see in Table V, the relative performance of dif-ferent models depends on the task. However, this relativeperformance is more similar for the three sentence level tasks (i.e., passage retrieval, text similarity and sentiment analysis)compared to the entity recognition task which is a tokenlevel one (see table II). In these three sentence classiﬁcationtasks, customizing the general BERT language model onthe domain speciﬁc corpus improved the performance. Theimprovement is in average . of the performance of thebase version. In these three tasks, the highest performanceof DistilBERT, RoBERTa, and ALBERT can respectivelyachieve , , and of the best performance ofBERT in average. These models also seem to beneﬁt fromdomain-speciﬁc language model customization in fewer cases.Moreover, by comparing GR and LR versions of BERT andDistilBERT, we see that when starting from scratch, usinglegal tokens marginally improves the performance compared tousing the default general-domain tokens. However, it still doesnot beat the impact of using pre-trained weights consideringthe size of our corpus and amount of language model trainingthat we have performed (10 epochs). Even extending the pre-trained model with only some legal tokens (see GP and HPversions), degrades the performance for most of the tasks.These observations are, however, aligned with Figure 1 andreﬂect the fact that more training time is required to get thesame performance for training a model with larger numberof tokens. For the entity recognition task, the base languagemodels seem to perform better in general. odel Version Tokens Initial Weights Passage Retrieval Text Similarity Entity Recognition Sentiment AnalysisBERT base General – 0.83 ( ± .05) 0.73 ( ± .04) 0.46 ( ± .01) 0.84 ( ± .01)GR General Random 0.84 ( ± .05) 0.79 ( ± .05) 0.21 ( ± .00) 0.83 ( ± .02)HR Hybrid Random 0.84 ( ± .04) 0.78 ( ± .04) 0.33 ( ± .01) 0.82 ( ± .02)LR Legal Random 0.85 ( ± .03) 0.79 ( ± .02) 0.32 ( ± .01) 0.85 ( ± .04)GP General Pre-trained ( ± .05) ( ± .02) 0.42 ( ± .03) ( ± .01)HP Hybrid Pre-trained 0.85 ( ± .05) 0.79 ( ± .04) 0.45 ( ± .01) 0.89 ( ± .02) DistilBERT base General – 0.81 ( ± .06) 0.76 ( ± .03) ( ± .03) 0.88 ( ± .01)GR General Random 0.82 ( ± .04) 0.75 ( ± .04) 0.42 ( ± .01) 0.87 ( ± .03)HR Hybrid Random 0.83 ( ± .04) 0.73 ( ± .05) 0.37 ( ± .04) 0.87 ( ± .02)LR Legal Random 0.83 ( ± .05) 0.76 ( ± .03) 0.32 ( ± .01) 0.91 ( ± .02)GP General Pre-trained 0.81 ( ± .04) 0.74 ( ± .04) 0.41 ( ± .03) 0.89 ( ± .01)HP Hybrid Pre-trained 0.82 ( ± .04) 0.74 ( ± .03) 0.34 ( ± .03) 0.88 ( ± .03) RoBERTa base General – 0.84 ( ± .05) 0.79 ( ± .01) 0.38 ( ± .00) 0.89 ( ± .03)GP General Pre-trained 0.84 ( ± .05) 0.75 ( ± .02) 0.36 ( ± .02) 0.91 ( ± .03) ALBERT base General – 0.82 ( ± .06) 0.80 ( ± .03) 0.33 ( ± .00) 0.88 ( ± .03)GP General Pre-trained 0.80 ( ± .07) 0.75 ( ± .04) 0.31 ( ± .03) 0.86 ( ± .03) Baseline – – – 0.72 0.64 0.27 0.77

TABLE V: Performance metrics of document review tasks using different versions of language models. For each task, the boldfont shows the best performance among all models while the underlined number corresponds to the best performance amongdifferent versions of one model. • How to choose a model if the computational resourcesare limited?

The downstream performance metrics reﬂect the qualityof the results we can get by using each of the models.However, there are other factors that should be taken intoaccount in order to implement the models in a practical setupsuch as model size, memory usage, training and predictiontimes. In order to answer the third question, we report thetraining/inference time for all models. We have compared thelanguage models based on their model size and number oftraining parameters in Table II. This table also shows the timeit takes to train each language model architecture for one epochon the same GPU machine while ﬁxing the batch size. TableVI presents a summary of training and prediction time forﬁne-tuning the network for each of the downstream reviewtasks. The training and prediction times are reported for themodel with the best performance for each task as reportedin Table V. For each task, the GPU machine, learning rate,batch size, and the stopping criteria are the same when trainingdifferent models. We see that training and prediction times arein general aligned with the number of parameters and layersreported for different models in Table II.V. C

ONCLUSION

In this paper, we investigated how different contextual-ized Transformer-based language models can be employed toautomate different tasks of reviewing legal documents. Weelaborated on the distinguishing features of the legal domaintexts and studied several strategies for adapting the languagemodels to the legal domain. Rather than using the standaloneNLP tasks, we have compared the overall performance ofmodels on real review scenarios. Our experiments show thatwhile the token level task performs better with the general-domain pre-trained models, the sentence level tasks maybeneﬁt from some customization of language models. We alsoreported other practical aspects of models such as memoryusage and training and prediction times. As a future work, we intend to investigate the performance of the same languagemodels in a multi-task architecture where multiple documentreview tasks are co-trained.VI. A

CKNOWLEDGEMENTS

We would like to thank Vector Institute for the supportduring the NLP industry collaborative project.R EFERENCES[1] O.-M. Sulea, M. Zampieri, S. Malmasi, M. Vela, L. P. Dinu, andJ. Van Genabith, “Exploring the use of text classiﬁcation in the legaldomain,” arXiv preprint arXiv:1710.09306 , 2017.[2] S. J. Pan and Q. Yang, “A survey on transfer learning,”

IEEE Transac-tions on Knowledge and Data Engineering (TKDE) , vol. 22, no. 10, pp.1345–1359, 2009.[3] K. Jing, J. Xu, and B. He, “A survey on neural network languagemodels,” arXiv preprint arXiv:1906.03591 , 2019.[4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in

Advancesin Neural Information rPocessing Systems (NeurIPS) , 2017, pp. 5998–6008.[5] E. Elwany, D. Moore, and G. Oberoi, “Bert goes to law school:Quantifying the competitive advantage of access to large legal corporain contract understanding,” in

Workshop on Document Intelligence atInternational Conference on Neural Information Processing Systems(NeurIPS) , 2019.[6] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras, and I. Androut-sopoulos, “Legal-bert: The muppets straight out of law school,” arXivpreprint arXiv:2010.02559 , 2020.[7] R. Zhang, W. Yang, L. Lin, Z. Tu, Y. Xie, Z. Fu, Y. Xie, L. Tan,K. Xiong, and J. Lin, “Rapid adaptation of bert for informationextraction on domain-speciﬁc business documents,” arXiv preprintarXiv:2002.01861 , 2020.[8] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-trainingof deep bidirectional transformers for language understanding,” in

Pro-ceedings of the Annual Conference of the North American Chapterof the Association for Computational Linguistics: Human LanguageTechnologies (NAACL-HLT) , 2019.[9] M. E. Peters, S. Ruder, and N. A. Smith, “To tune or not to tune?adapting pretrained representations to diverse tasks,” in

Workshop onRepresentation Learning for NLP (RepL4NLP) , 2019, pp. 7–14. https://vectorinstitute.ai/ raining Time (hours per epoch) Prediction Time (milliseconds per sample)

Model PassageRetrieval TextSimilarity EntityRecognition SentimentAnalysis PassageRetrieval TextSimilarity EntityRecognition SentimentAnalysis

BERT 1.53 0.44 1.01 0.16 63 65 37 65DistilBERT 1.71 0.26 0.35 0.03 32 32 20 33RoBERTa 2.22 0.37 0.83 0.43 60 63 35 63ALBERT 1.05 0.28 0.75 0.66 69 68 22 71

TABLE VI: Training and prediction times for downstream document review tasks. [10] W. Yang, Y. Xie, A. Lin, X. Li, L. Tan, K. Xiong, M. Li, andJ. Lin, “End-to-end open-domain question answering with bertserini,”in

Proceedings of the International Conference of the North AmericanChapter of the Association for Computational Linguistics (Demonstra-tions) , 2019, pp. 72–77.[11] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba,and S. Fidler, “Aligning books and movies: Towards story-like visualexplanations by watching movies and reading books,” in

Proceedimgsof IEEE International Conference on Computer Vision (ICCV) , 2015,pp. 19–27.[12] J. Lee, W. Yoon, S. Kim, D. Kim, C. So, and J. Kang, “Biobert: a pre-trained biomedical language representation model for biomedical textmining,”

Bioinformatics (Oxford, England) , vol. 36, no. 4, pp. 1234–1240, 2020.[13] D. Araci, “Finbert: Financial sentiment analysis with pre-trained lan-guage models,” arXiv preprint arXiv:1908.10063 , 2019.[14] J. Howard and S. Ruder, “Universal language model ﬁne-tuning for textclassiﬁcation,” in

Proceedings of the Annual Meeting of the Associationfor Computational Linguistics (ACL) , 2018, pp. 328–339.[15] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee,and L. Zettlemoyer, “Deep contextualized word representations,” in

Proceedings of the Annual Conference of the North American Chapterof the Association for Computational Linguistics: Human LanguageTechnologies (NAACL-HLT) , 2018, pp. 2227–2237.[16] I. Beltagy, A. Cohan, and K. Lo, “Scibert: Pretrained contextualizedembeddings for scientiﬁc text,” arXiv preprint arXiv:1903.10676 , 2019.[17] M. Tan, C. d. Santos, B. Xiang, and B. Zhou, “Lstm-based deeplearning models for non-factoid answer selection,” arXiv preprintarXiv:1511.04108 , 2015.[18] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition withdeep recurrent neural networks,” in

Proceedings of the IEEE In-ternational Conference on Acoustics, Speech and Signal Processing(ICASSP) , 2013, pp. 6645–6649.[19] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in

Proceedings of the International Conference on Knowledge Discoveryand Data Mining (ACM SIGKDD) , 2016, pp. 785–794.[20] T. Lai, T. Bui, and S. Li, “A review on deep learning techniques appliedto answer selection,” in

Proceedings of the International Conference onComputational Linguistics (COLING) , 2018, pp. 2132–2144.[21] W. Alschner, “Sense and similarity: Automating legal text comparison,”

Computational Legal Studies: The Promise and Challenge of Data-driven Research , 2019.[22] J. O. Neill, P. Buitelaar, C. Robin, and L. O. Brien, “Classifyingsentential modality in legal language: a use case in ﬁnancial regulations,acts and directives,” in

Proceedings of the International Conference onArtiﬁcial Intelligence and Law (ICAIL) , 2017, pp. 159–168.[23] J.-C. Verstraete, “Scalar quantity implicatures and the interpretationof modality: Problems in the deontic domain,”

Journal of pragmatics ,vol. 37, no. 9, pp. 1401–1418, 2005.[24] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey,M. Krikun, Y. Cao, Q. Gao, K. Macherey et al. , “Google’s neuralmachine translation system: Bridging the gap between human andmachine translation,” arXiv preprint arXiv:1609.08144 , 2016.[25] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bertpretraining approach,” arXiv preprint arXiv:1907.11692 , 2019.[26] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilledversion of bert: smaller, faster, cheaper and lighter,” arXiv preprintarXiv:1910.01108 , 2019.[27] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman,“Glue: A multi-task benchmark and analysis platform for natural lan-guage understanding,” in

Proceedings of the International Conferenceon Learning Representations (ICLR) , 2018. [28] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neuralnetwork,” stat , vol. 1050, p. 9, 2015.[29] C. Buciluˇa, R. Caruana, and A. Niculescu-Mizil, “Model compression,”in

Proceedings of the International Conference on Knowledge Discoveryand Data Mining (ACM SIGKDD) , 2006, pp. 535–541.[30] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut,“Albert: A lite bert for self-supervised learning of language represen-tations,” in

Proceedings of the International Conference on LearningRepresentations (ICLR) , 2019.[31] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cis-tac, T. Rault, R. Louf, M. Funtowicz et al. , “Huggingface’s transformers:State-of-the-art natural language processing,” arXiv preprint arXiv:1910 ,2019.[32] M. Guo, Z. Dai, D. Vrandeˇci´c, and R. Al-Rfou, “Wiki-40b: Multilin-gual language model dataset,” in