Towards Grad-CAM Based Explainability in a Legal Text Processing Pipeline
Lukasz Gorski, Shashishekar Ramakrishna, Jedrzej M. Nowosielski
TTowards Grad-CAM Based Explainability in a LegalText Processing Pipeline
Łukasz Górski [ − − − ] ,Shashishekar Ramakrishna , [ − − − ] , andJ˛edrzej M. Nowosielski [ − − − ] Interdisciplinary Centre for Mathematical and Computational Modelling, Univ. of Warsaw Freie Universität Berlin, Berlin EY - AI Labs, Bangalore
Abstract.
Explainable AI (XAI) is a domain focused on providing interpretabil-ity and explainability of a decision-making process. In the domain of law, in addi-tion to system and data transparency, it also requires the (legal-) decision-modeltransparency and the ability to understand the model’s inner working when arriv-ing at the decision. This paper provides the first approaches to using a popular im-age processing technique, Grad-CAM, to showcase the explainability concept forlegal texts. With the help of adapted Grad-CAM metrics, we show the interplaybetween the choice of embeddings, its consideration of contextual information,and their effect on downstream processing.
Keywords:
Legal Knowledge Representation · Language Models · Grad-CAM ·HeatMaps· CNN
Advancements in the domain of AI and Law have brought additional considerations re-garding models development, deployment, updating and their interpretability. This canbe seen with the advent of machine-learning-based methods, which naturally exhibit alower degree of explainability than traditional knowledge-based systems. Yet, knowl-edge representation frameworks that handle legal information, irrespective of their ori-gin, should cover the pragmatics or context around a given concept and this functional-ity should be easily demonstrable.Explainable AI (XAI), is a domain which has focused on providing interpretabilityand explainability to a decision making process. In the domain of law, interpretabilityand explainability are more than dealing with information/data transparency or systemtransparency [1] (henceforth referred to as ontological view ). It additionally requiresthe (legal-) decision-model transparency, the ability to understand the model’s innerworking when arriving at the decision ( epistemic view ). In this paper, we aim to presentthe system’s user and architect with a set of tools that facilitate the discovery of inputsthat contribute to convolutional neural network’s (CNN’s) output to the greatest degree,by adapting the Grad-CAM method, which originated from the field of computer vision.We adapt this method to the legal domain and show how it can be used to achieve a
Presented at the Workshop on EXplainable & Responsible AI in Law (XAILA)at 33rd International Conference on Legal Knowledge and Information Systems (JURIX)9 - 11 December, 2020 a r X i v : . [ c s . H C ] D ec . Górski et al. better understanding of a given system’s state and explain how different embeddingscontribute to end result as well as to optimize this system’s inner workings. While thiswork is concerned with the ontological perspective, we aim this as a stepping stonefor another related perspective, where the legally-based positions are connected withexplanation thus providing the ability to explain the decisions to its addressee. Thispaper addresses mainly the technical aspects, showing how Grad-CAMs can be appliedto the legal texts, describing the text processing pipeline - taking this as a departing pointfor deeper analyses in future work. We aim to present this technical implementation aswell as the quantitative comparison metrics as the main contribution of the paper.The paper is structured as follows. State-of-the-art is described in Section 2. Section3 describes the methodology, which includes the metrics used for results quantification.The architecture used for experiments is described in Section 4. Section 5 talks aboutthe different datasets used and the experimental setup. The outcomes are described inSection 6. Finally, Section 7 provides a conclusion and future work. The feasibility of using different - contextual (e.g. BERT) and non-contextual (e.g.word2vec) - embeddings was already studied outside the domain of law. In [2], it wasfound that the usage of more sophisticated, context-aware methods is unnecessary inthe domains where labelled data and simple language are present. As far as the area oflaw is concerned, the feasibility of using the domain-specific vs. general embeddings(based on word2vec) for the representation of Japanese legal texts was investigated,with the conclusion that general embeddings have an upper hand [3]. The feasibility ofusing BERT in the domain of law was also already put under scrutiny as well. In [4] itsgeneric pretrained version was used for embeddings generation and it was found thatlarge computational requirements may be a limiting factor for domain-specific embed-ding creation. The same paper concluded that the performance of the generic version islower when compared with law-based non-contextual embeddings. On the other hand,in [5], BERT versions trained on legal judgments corpus (of 18000 documents) wereused and it was found that training on in-domain corpus does not necessarily offer betterperformance compared to generic embeddings. In [6] contradictory conclusions werereached: the system’s performance significantly improves when using pre-trained BERTon a legal corpus. Those results suggest that introduction of XAI-based methods mightbe a condition sine qua non for a proper understanding of general language embeddingsand their feasibility in the domain.Grad-CAMs are explainability method originating from computer vision [7]. It isa well established post-hoc explainability technique when CNNs are concerned. More-over, Grad-CAM method passed independent sanity checks [8]. Whilst it is mainlyconnected with the explanations of deep learning networks used with image data, ithas already been adapted for other areas of application. In particular, CNN architecturefor text classification was described in [9], and there exists at least one implementa-tion which extends this work with Grad-CAM support for explainability [10]. Grad-CAMs were already used in the NLP domain, for (non-legal) document retrieval [11].Herein we build upon this work and investigate the feasibility of using this method for owards Grad-CAM Based Explainability in a Legal Text Processing Pipeline the legal domain, in particular allowing for the visualisation of context-dependency ofvarious word embeddings. Legal language is a special register of everyday languageand deservers investigation on its own. The evolution of legal vocabulary can be pre-cisely traced to particular statutes and precedential judgments, where it is refined and itsboundaries are tested [12]. Many terms have thus a particular legal meaning and efficacyand tools that can safeguard final black-box models’ adherence to the particularities oflegal language are valuable.The endeavours aimed at using XAI methods in the legal domain, similar to thispaper, have already been undertaken recently. In [13] an Attention Network was usedfor legal decision prediction - coupling it with attention-weight-based text highlightingof salient case text (though this approach was found to be lacking). The possibility ofexplaining the BERT’s inner workings was already investigated by other authors, andit was already subject to static as well as dynamic analyses. An interactive tool forthe visualisation of its learning process was implemented in [14]. Machine-learning-based evaluation of context importance was performed in [15]; therein it was found thataccounting for the content of a sentence’s context greatly improves the performance oflegal information retrieval system.However, the results mentioned hereinbefore do not allow for direct and easily in-terpretable comparison of different types of embeddings and we aim to explore an easyplug-in solution facilitating this aim.
We study the interplay between the choice of embeddings, its consideration of contex-tual information, and its effect on downstream processing. For this work, a pipelinefor comparison was prepared, with the main module being the embedder, classifica-tion CNN and metric-based evaluator. All the parts are easily pluggable, allowing forextendibility and further testing of a different combinations of modules.The CNN used in the pipeline was trained for classification. We use two differentdatasets for CNN training (as well as testing) :1. The Post-Traumatic Stress Disorder (PTSD) [16] dataset [17], where rhetoricalroles of sentences are classified.2. Statutory Interpretation - Identifying Particular (SIIP) dataset [15], where the sen-tences are classified into four categories according to their usefulness for a legalprovision’s interpretation.Whilst many methods have already been used for the analysis of aforementioneddatasets (including regular expressions, Naive Bayes, Logistic Regression, SVMs [17],or Bi-LSTMs [18]), we are unaware of papers that use (explainable) CNNs for thistasks. On the other hand, usage of said CNN should not be treated as the main con-tribution of this paper, as the classification network is treated only as an exemplaryapplication, warranting conclusions regarding the paper’s main contribution, i.e. thecontext-awareness of various embeddings when used in the legal domain. Section 5.1, provides a detailed discussion on the considered datasets. Górski et al.
Further down the line, the embeddings are used to transform CNN input sentencesinto vectors, with vector representation for each word in a sentence concatenated. Hereinour implementation is based on the prior work [10] [9].
Grad-CAM heatmaps are inherently visual tools for data analysis. In computer vision,they are commonly used for qualitative determination of input image regions that con-tribute to the final prediction of the CNN. While they are an attractive tool for a quali-tative analysis of a single entity, they should be supplemented with other tools for easycomparison of multiple embeddings [19] and to facilitate quantitative analysis. Hereinthe following metrics are introduced and adapted to the legal domain:1. Fraction of elements above relative threshold t ( F ( v , t ) )2. Intersection over union with relative thresholds t and t ( I ( v , v , t , t ) )The first metric, F ( t ) , is designed to measure the CNN network attention spreadover words present in the given input, i.e, what portion of the input is taken into accountby CNN in the case of a particular prediction. It is defined as a number of elements ina vector that are larger than the relative threshold t multiplied by the maximum vectorvalue divided by the length of this vector.The second metric, I ( v , v , t , t ) , helps to compare two predictions of two dif-ferent models given the same input sentence. It answers the question of whether twomodels, when given the same input sentence, ‘pay attention’ to the same or differentchunk(s) of the input sentence. It takes as arguments two Grad-CAM heatmaps ( v and v ), binarizes them using relative thresholds ( t and t ) and finally calculates standardintersection over union. It quantifies the relative overlap of words considered importantfor the prediction by each of two models. The architecture, as shown in Fig 1, is designed to implement the methodology de-scribed in 3 and comprises four main modules, i.e.: preprocessing module, embeddingmodule, classification module and visualization module. The pre-processing moduleuses some industry de facto standard text processing libraries for spelling correction,sentence detection, irregular character removal, etc. The embedding module housesa plug-in system to handle different variants of embeddings, in particular BERT andword2vec. The classification module houses simple 1D CNN which facilitates explain-ability method common in computer vision i.e. Grad-CAM. The visualization moduleis used for heatmap generation and metric computation.The output from the pre-processing module is fed into the embeddings module.The embeddings used are based on variants of BERT and word2vec. In addition tothe pre-trained ones, raw data from CourtListener [20] dataset was used for trainingembeddings creation. owards Grad-CAM Based Explainability in a Legal Text Processing Pipeline
Fig. 1.
System Architecture
Within the frame of the classification module, the output from the embeddings mod-ule is fed into a 1D convolutional layer followed by an average pooling layer and fully-connected layers with dropout and softmax [9]. Although CNN architectures stem fromcomputer vision where an image forms the input of the network, the use of CNN forthe sequence of word vectors as an input is reasonable. In a sentence relative positionsof words convey meaning. It is similar to an image where relative positions of pixelsconvey information, with the difference being about dimensionality. Standard image is2D while a sentence is a 1D sequence of words, therefore we use the 1D CNN for thetask of sentence classification.With Grad-CAM technique it is possible to produce a class activation map (heatmap)for a given input sentence and predicted class. Each element of the class activation mapcorresponds to one token and indicates its importance in terms of the score of the par-ticular (usually the predicted) class. The class activation map gives information on howstrongly the particular tokens present in the input sentence influence the prediction ofthe CNN.The software stack used for the development of this system was instrumented underAnaconda 4.8.3 (with Python 3.8.3). Tensorflow v. 2.2.0 was used for CNN instrumen-tation and Grad-CAMs calculations (with the code itself expanding prior implementa-tion available at [10]). Spacy 2.1.8 and blackstone 0.1.15 were used for CourtListenertext cleaning. Various BERT implementations and supporting codes were sourced fromHuggingface libraries: transformers v. 3.1.0, tokenizers v. 0.8.1rc2, nlp v. 0.4.0. Twocomputing systems available at ICM University of Warsaw were exploited for the ex-periments. Text cleaning was performed using the okeanos system (Cray XC40) andmain calculations were run on rysy GPU cluster (4x Nvidia Tesla V100 32GB GPUs).
As stated in Section 3, we use two different datasets for experiments. The PTSD datasetis from the U.S. Board of Veterans’ Appeals (BVA) from 2013 through 2017. Thedataset deals with the decisions from adjudicated disability claims by veterans forservice-related post-traumatic stress disorder (PTSD) [16]. The dataset itself is well-known and has already been studied by other authors. It annotates a set of sentencesoriginating from 50 decisions issued by the Board according to their function in the . Górski et al. decision [17] [21] [22]. The classification consists of six elements:
Finding Sentence , Evidence Sentence , Reasoning Sentence , Legal-Rule Sentence , Citation Sentence , OtherSentence .The SIIP dataset pertains to the United States Code 5 § 552a(a)(4) provision andaims to annotate the judgments that are most useful for interpretation of said provision.The seed information for annotation is collected from the court decisions retrieved fromthe Caselaw access project data. The sentences are classified into four categories ac-cording to their usefulness for the interpretation:
High Value , Certain Value , PotentialValue , No Value [15].
We use pre-trained models as well as we train domain-specific models for the purpose ofvector representation of texts. Many flavours of word2vec and BERT embedders weretested. The paper does not go into any details on the comparison of these pre-trainedmodels (or other similar models) based on performance. This has been addressed inseveral other papers [23] [24] [14].For the word2vec a (slimmed down) GoogleNews model was used, with a vocab-ulary of 300000 words [25]. In addition, Law2vec embeddings were also employed,which were trained on a large freely-available legal corpus, with 200 dimensions [26].For BERT, bert-base-uncased model was used, a transformer model consisting of 12layers, 768 hidden units, 12 attention heads and 110M parameters. In addition to that,a slimmed-down version of BERT, DistilBERT was also tried, due to its accuracy be-ing on the par with vanilla BERT, yet offering better performance and smaller memoryfootprint.In addition to pretrained models, we have also tried training our own word2vec andBERT models. For this aim, a CourtListener [20] database was sourced. However, dueto the large computational requirements of BERT training, a small subset of this datasetwas chosen, consisting of 180MiB of judgments. Moreover, while several legal projectsprovide access to a vast database of US case-laws, it was found that the judgmentsavailable therein need to be further processed, as the available textual representationsusually contain unnecessary elements, such as page numbers or underscores, that hindertheir machine processing. Our hand-written parser joined hyphenated words, removedpage numbers and artifacts that were probably introduced by OCR-ing; furthermore, thetext was split into sentences using spacy-based blackstone-parser. In line with other au-thors [27], we have found it to be imperfect and failing in segmenting the sentences thatcontained period-delimited legal abbreviations (e.g.
Fed. - Federal). Thus it was sup-plemented with our own manually-curated list of abbreviations. The training was per-formed using DistilBERT model (for ca. 36 hours), as well as word2vec in two flavours,200-dimensional (in line with the dimensionality of Law2Vec) and 768-dimensional (inline with BERT embeddings dimensionality).As far as the BERT-based embeddings go, there is a number of ways in which theycan be extracted from the model. One of the ways is taking embeddings for special
CLS token, which prefixes any sentence fed into BERT; another technique that was studiedin the literature amounted to concatenating the model’s final layer’s values. The optimaltechnique is dependent on the task and the domain. Herein we have found the latter to owards Grad-CAM Based Explainability in a Legal Text Processing Pipeline offer better accuracy for downstream CNN training. The features for CNN processingconsisted of tokenized sentences, together with embeddings for special BERT tokens(their absence would cause a slight drop in accuracy as well). F ( . ) F ( . ) F ( . ) Mean StdDev Mean StdDev Mean StdDevword2vec (GoogleNews) 0.53 0.31 0.44 0.3 0.35 0.29Law2vec 0.6 0.3 0.52 0.32 0.42 0.33word2vec (CourtListener, 200d) 0.49 0.28 0.39 0.27 0.29 0.26word2vec (CourtListener, 768d) 0.48 0.28 0.38 0.28 0.29 0.27BERT(bert-base-uncased) 0.48 0.32 0.36 0.28 0.24 0.22DistilBert(distilbert-base-uncased) 0.67 0.27 0.56 0.27 0.38 0.24DistilBERT(CourtListener) 0.47 0.39 0.47 0.39 0.44 0.39
Table 1.
Heatmap metric F for the PTSD dataset I ( . ) I ( . ) I ( . ) Mean StdDev Mean StdDev Mean StdDevword2vec (GoogleNews) –BERT(bert-base-uncased) 0.49 0.25 0.41 0.24 0.3 0.21Law2vec –BERT(bert-base-uncased) 0.51 0.26 0.43 0.25 0.34 0.25word2vec(CourtListener, 200d) –Law2Vec 0.65 0.25 0.58 0.27 0.51 0.31word2vec(CourtListener, 768d) –DistilBert(distilbert-base-uncased) 0.44 0.23 0. 35 0.22 0.26 0.21
Table 2.
Heatmap metric I for the selected pairs of embeddings for the PTSD dataset A sample heatmap can be referenced in Fig. 2 and Fig. 3, with a colorbar definingthe mapping between the colors and values. Fig 2 clearly shows the area of CNN’sattention, which can be quantified further down the line. This picture shows a properlyclassified sentence, a statement of evidence, defined by the PTSD dataset’s authors asa description of a piece of evidence. CNN pays most attention to the phrase "medicalrecords", which is in line with PTSD’s authors’ annotation protocols, where this kindof sentence describes a given piece of evidence (e.g. the records of testimony). Wehave found the sentence in Fig. 3 to be hard to classify for ourselves and it prima . Górski et al. PTSD SIIPword2vec (GoogleNews) 0.7 0.9Law2vec 0.69 0.85word2vec (CourtListener, 200d) 0.78 0.93word2vec (CourtListener, 768d) 0.79 0.94BERT (bert-base-uncased) 0.84 0.94DistilBERT (distilbert-base-uncased) 0.85 0.94DistilBERT (CourtListener) 0.42 0.85
Table 3.
Test set accuracy.
Fig. 2.
A sample heatmap for correct prediction with word2vec (CourtListener,768d) embedding facie seemed for us to be an example of evidentiary sentence. In the case of CNN, nodistinctive activations can be spotted.Yet, we did not perform any detailed analyses of such images. Instead, we focuson two types of comparison using metrics defined in section 3.1. The comparisons aredesigned to capture differences between different embeddings, particularly in terms ofcontext handling. First, for a given embedding we calculate CNN network attentionspread over words quantified by metric F ( t ) averaged over all input sentences con-tained in the test set. Then we can compare the mean fraction of words (tokens) in theinput sentences which contribute to prediction in the case of various embeddings. Cri-terion deciding if a particular word contributes to the prediction is, in fact, arbitrary anddepends on class activation map (heatmap) binarization threshold. This is why we test Fig. 3.
A sample heatmap for failed prediction with word2vec (CourtListener,768d) embeddingowards Grad-CAM Based Explainability in a Legal Text Processing Pipeline a few thresholds, including 0 .
15 as suggested in [7] for weakly supervised localization.Essentially high value of the fraction F ( t ) indicates that most word vectors in inputsentence are taken into account by CNN during inference. Conversely, the low value ofthe fraction F ( t ) indicates that most word vectors in the input sentence are ignored byCNN during inference. The comparison results for the PTSD dataset are shown in Ta-ble 1 and Table 2 (SIIP dataset was omitted for brevity and due to the similarity with thepresented PTSD dataset). The outstanding similarity between word2vec and Law2Veccan be spotted in Table 2, due to both of those models belonging to the same class, asexhibited by the high value of I metric. The analysis of heatmaps and metrics presented hereinbefore proves that only a part ofa given sentence contributes to a greater extent to final results. We have hypothesizedthat it is possible to decrease the amount of CNN’s input data to those important partswithout compromising the final prediction. In this respect, Grad-CAM was treated as ahelpful heuristic that allows to identify the most important words for a given CNN in itstraining phase. For this experiment, the value of F , for the threshold of 0 .
15 was usedto select a percentage of the most important words from a given training example. Thisin turn was used to compose a vocabulary (or white-list) of the most important wordsthat were encountered during the training. Further down the line, this white-list wasused during the inference and only the words present on the list were passed as input tothe CNN. Nevertheless, the number of white-listed words allowed coherent sentencesto be still passed into CNN (for example, the PTSD sentence
However, this evidencedoes not make it clear and , before white-listing amounted to
However, this evidencedoes not make it clear and unmistakable. ).We have managed to keep accuracy up to the bar of an unmodified dataset using thisprocedure (e.g. 0.7 for PTSD-word2vec(GoogleNews) and 0.85 for PTSD-DistilBERT(distilbert-base-uncased).
We presented the first approach to using a popular image processing technique, Grad-CAMs to showcase the explainability concept for legal texts. Few conclusions whichwe can be drawn from the presented methodology are: – The mean value of F ( t ) is higher in the case of DistilBERT embedding than inthe cases of word2vec and Law2vec embeddings. It suggests that CNN trained andutilised with this embedding tends to take into account a relatively larger chunk ofinput sentence while making prediction. – Described metrics and visualizations provide a peek into the complexity of contexthandling aspects embedded in a language model. – It enables an user to identify and catalog attention words in a sentence type for dataoptimization in downstream processing tasks.Some issues which need further investigation are: . Górski et al. – Training of these domain-specific models requires time and resources. Apart fromalgorithmic optimization, data optimization also plays an important role. Extensionof this methodology can be used to remove tokens that do not contribute to the finaloutcome of any downstream processing tasks. A systematic analysis of the methodpresented in Section 6.2 is warranted. – Mapping of metrics from our methodology to standard machine learning metricscould allow us to infer the quality of language models in a given domain (i.e. legaldomain). This allows us to measure the quality of a model when there is not suf-ficient gold data which can be used for effective training of models (inline to theconcept of semi-supervised learning). – An extension of this approach could be used when validating the consistency ofcontext in facts. And inturn the legal argument chain which is built based on thesefacts.
Acknowledgment
This research was carried out with the support of the Interdisciplinary Centre for Math-ematical and Computational Modelling (ICM), University of Warsaw, under grant noGR81-14.
References
1. Adrien Bibal, Michael Lognoul, Alexandre de Streel, and Benoît Frénay. Legal requirementson explainability in machine learning.
Artificial Intelligence and Law , pages 1–21, 2020.2. Simran Arora, Avner May, Jian Zhang, and Christopher Ré. Contextual embeddings: Whenare they worth it?, 2020.3. Linyuan Tang and Kyo Kageura. An examination of the validity of general word embeddingmodels for processing japanese legal texts. In
Proceedings of the Third Workshop on Au-tomated Semantic Analysis of Information in Legal Texts, Montreal, QC, Canada, June 21,2019 , volume 2385 of
CEUR Workshop Proceedings , 2019.4. Charles Condevaux, Sébastien Harispe, Stéphane Mussard, and Guillaume Zambrano.Weakly supervised one-shot classification using recurrent neural networks with attention:Application to claim acceptance detection. In
JURIX , pages 23–32, 2019.5. Julien Rossi and Evangelos Kanoulas. Legal search in case law and statute law. In
JURIX ,pages 83–92, 2019.6. Emad Elwany, Dave Moore, and Gaurav Oberoi. Bert goes to law school: Quantifying thecompetitive advantage of access to large legal corpora in contract understanding. arXivpreprint arXiv:1911.00473 , 2019.7. Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, DeviParikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization.
International Journal of Computer Vision , 128(2):336–359, Oct 2019.8. Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and BeenKim. Sanity checks for saliency maps. In S. Bengio, H. Wallach, H. Larochelle, K. Grau-man, N. Cesa-Bianchi, and R. Garnett, editors,
Advances in Neural Information ProcessingSystems , volume 31, pages 9505–9515. Curran Associates, Inc., 2018.9. Yoon Kim. Convolutional neural networks for sentence classification. In
Proceedings of the2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages1746–1751, Doha, Qatar, October 2014. Association for Computational Linguistics.owards Grad-CAM Based Explainability in a Legal Text Processing Pipeline10. Grad-cam for text. https://github.com/HaebinShin/grad-cam-text . Accessed:2020-08-05.11. Jaekeol Choi, Jungin Choi, and Wonjong Rhee. Interpreting neural ranking models usinggrad-cam. arXiv preprint arXiv:2005.05768 , 2020.12. Edwina L Rissland, Kevin D Ashley, and Ronald Prescott Loui. Ai and law: A fruitfulsynergy.
Artificial Intelligence , 150(1-2):1–15, 2003.13. L Karl Branting, Craig Pfeifer, Bradford Brown, Lisa Ferro, John Aberdeen, Brandy Weiss,Mark Pfaff, and Bill Liao. Scalable and explainable legal prediction.
Artificial Intelligenceand Law , pages 1–26, 2020.14. Benjamin Hoover, Hendrik Strobelt, and Sebastian Gehrmann. exbert: A visual analysis toolto explore learned representations in transformers models, 2019.15. Jaromir Savelka, Huihui Xu, and Kevin D Ashley. Improving sentence retrieval from caselaw for statutory interpretation. In
Proceedings of the Seventeenth International Conferenceon Artificial Intelligence and Law , pages 113–122, 2019.16. Victoria Hadfield Moshiashwili. The downfall of auer deference: Veterans law at the federalcircuit in 2014., 2015.17. Vern R. Walker, Krishnan Pillaipakkamnatt, Alexandra M. Davidson, Marysa Linares, andDomenick J. Pesce. Automatic classification of rhetorical roles for sentences: Comparingrule-based scripts with machine learning. In
Proceedings of the Third Workshop on Auto-mated Semantic Analysis of Information in Legal Texts, Montreal, QC, Canada, June 21,2019 , volume 2385 of
CEUR Workshop Proceedings , 2019.18. S. R. Ahmad, D. Harris, and I. Sahibzada. Understanding legal documents: Classificationof rhetorical role of sentences using deep learning and natural language processing. In , pages 464–467, 2020.19. David Krakov and Dror G Feitelson. Comparing performance heatmaps. In
Workshop onJob Scheduling Strategies for Parallel Processing , pages 42–61. Springer, 2013.20. Free Law Project. Courtlistener, 2020.21. Vern R. Walker, Ji Hae Han, Xiang Ni, and Kaneyasu Yoseda. Semantic types for computa-tional legal reasoning: Propositional connectives and sentence roles in the veterans’ claimsdataset. ICAIL ’17, page 217–226, New York, NY, USA, 2017. Association for ComputingMachinery.22. Jaromír Savelka, Vern R. Walker, Matthias Grabmair, and Kevin D. Ashley. Sentence bound-ary detection in adjudicatory decisions in the united states. 2017.23. Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary,Éric de la Clergerie, Djamé Seddah, and Benoît Sagot. CamemBERT: a tasty French lan-guage model. In
Proceedings of the 58th Annual Meeting of the Association for Compu-tational Linguistics , pages 7203–7219, Online, July 2020. Association for ComputationalLinguistics.24. Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilledversion of bert: smaller, faster, cheaper and lighter, 2019.25. Word2vec-slim. https://github.com/eyaler/word2vec-slim . Accessed: 2020-09-21.26. Law2vec: Legal word embeddings. https://archive.org/details/Law2Vechttps://archive.org/details/Law2Vec