An External Knowledge Enhanced Multi-label Charge Prediction Approach with Label Number Learning
AAn External Knowledge Enhanced Multi-labelCharge Prediction Approach with Label NumberLearning
Duan Wei and Li Lin
School of Computer Science and Technology, Wuhan University of Technology,Wuhan 430070, China
Abstract.
Multi-label charge prediction is a task to predict the corre-sponding accusations for legal cases, and recently becomes a hot topic.However, current studies use rough methods to deal with the label num-ber. These methods manually set parameters to select label numbers,which has an effect in final prediction quality. We propose an exter-nal knowledge enhanced multi-label charge prediction approach that hastwo phases. One is charge label prediction phase with external knowl-edge from law provisions, the other one is number learning phase witha number learning network (NLN) designed. Our approach enhanced byexternal knowledge can automatically adjust the threshold to get labelnumber of law cases. It combines the output probabilities of samplesand their corresponding label numbers to get final prediction results.In experiments, our approach is connected to some state-of-the art deeplearning models. By testing on the biggest published Chinese law dataset,we find that our approach has improvements on these models. We futureconduct experiments on multi-label samples from the dataset. In itemsof macro-F1, the improvement of baselines with our approach is 3%-5%;In items of micro-F1, the significant improvement of our approach is5%-15%. The experiment results show the effectiveness our approach formulti-label charge prediction.
Keywords: multi-label charge prediction · label number learning · ex-ternal knowledge. In recent years, NLP (Nature Language Processing) has a huge development inmany research tasks, such as text classification, NER (Named Entity Recogni-tion), semantic labeling, reading comprehending, etc. With the development ofinternet, the unit of global data total has crossed from GB to ZB. Among them,various valuable text data has greatly promoted the development of the NLPfield, and help to alleviate defects of these works.The legal charge prediction is a multi-label text classification task that learnsfrom law case called fact to predict accusations (labels). This task makes use of a r X i v : . [ c s . C L ] J u l W.Duan et al. the sample’s complex text data to properly classify it into the appropriate labels(accusation types). A sample have one label or multiple labels, and there aresubtle differences of contents of texts. For example, someone has violated twolaws of theft and robbery because of a series of criminal activities. Therefore, weneed to understand the description of a case called fact, extract the description ofcriminals breaking the law, and finally classify the sample into the correspondingaccusations. In public Chinese law datasets, like Cail2018, the number of single-label samples is the largest. The number of training data is 154,592, but thereis a huge difference between the number of single-label samples and multi-labelsamples. The number of multi-label samples accounts for a small portion of thetotal. The more labels the samples have, the less the size of them is in the totalamount. For example, the number of 5 labels samples is 58 and the number of9 labels sample is 1. Therefore multi-label legal charge prediction is a difficultmulti-label text classification prediction task.There are many models that can better solve text classification. For exam-ple, Kim purposed the TextCNN [12] that can synthesize local text content ofdifferent receptive field to capture the most important features of text; Shi pur-posed the CRNN [23] that can capture the local information, global information,and interrelated associations of texts; Johnson purposed the DPCNN [10] thatcan get more efficient and widely available global information while effectivelyobtaining local information; Vaswani purposed Multi-head Attention that canfind out the degree of correlation between the text and the label [25]; And thereare GRU and LSTM [4, 9] that can get better capture of longer distance textdependencies. We argue that for the legal charge prediction task, these modelswill be improved by the external knowledge from law provisions and a suitableway to decide label numbers.Firstly, all models are only learning the logic between the content of thecase and accusations, but they ignore the correlation between the content of thecase and law provisions of corresponding accusations. If the model already hassuch a priori knowledgethe law provisions, it is easy to learn which case is morerelevant to the accusations and assist itself learning to understand the contentof the case, and finally make the prediction result more accurate.Secondly, their models only get the output probability of belonging to thecorresponding label. However, there is only two widely used strategies to mapthe probability to output label number as follow:(1) Top-k strategy: For the output probability of each sample, it selects thefirst k labels that has the highest probability as the final prediction result;(2) Threshold strategy: For each probability of the sample, it selects a valueas the threshold, and all labels with the probability greater than the thresholdare the final prediction result.However, these two strategies have some disadvantages:(1) We need to calculate the distribution, maximum value, median of theentire output probability. After getting understanding of the data, we can adjustthe parameters to obtain the optimal result. However, it will take a lot of timeand efforts to complete this goal; harge Prediction with Label Number Learning 3 (2) or most models or datasets, their output probabilities must be different.Therefore, we should select a threshold or k value respectively. It also takes timeand efforts;(3) These two widely used strategies will cause certain errors no matter howthe parameters are selected, and the errors will greatly increase for the multi-label task.Therefore, we propose an external knowledge enhanced multi-label chargeprediction approach with automatic label number learning. Meanwhile, we pur-pose an efficient network structure called number learning network (NLN). Forthis legal text classification problem, we get external knowledge-corresponding tothe law provision of each accusation, and assisting various deep learning modelsto understand legal texts. And the output probability can be mapped to prob-ability of the label number through NLN. At last, combining them can get thebetter result than using popular deep learning models alone. Among it, the mostimportant technology of the network is to construct a special embedding layer,that adaptively adjusts the threshold of each corresponding label probabilitythrough BP. After implementing this network any deep learning models, it isstill an end-to-end model and can effectively solve the above problems..In this paper, we have an overview of the legal text processing, multi-labeltext classification, memory mechanism and label number learning in Section 2.We propose an external knowledge enhanced end-to-end multi-label charge pre-diction approach with automatic label number learning and a number learningnetwork (NLN). This network can be implemented with various deep learningmodel for the multi-label legal charge prediction. The details of approach will beelaborated in the Section 3. For the experiment results, we use the six deep learn-ing methods with NLN on multi-label samples and our approach on the dataset.This approach has a huge improvement on these models, and the promotion issignificant in the multi-label samples. The experiment results show that addinglabel number learning can also improve the model effect. The more details ofexperiment will be elaborated in the Section 4. Section 5 makes a conclusion ofour work and discuss our future work.
As the development of deep learning technology, there are many fields of NLP ap-plications. Because of legal text form and the large amount of data, the law fieldof NLP becomes a hot topic. Luo adopted the attention based neural networkwith the supplement of relevant law articles [14,32,33]; Some researchers thoughtdecisions of applicable law articles, charges, fines, and the terms of penalty havelogical relationship [38]; Some researchers attempted a hybrid approach to sum-marize legal case reports [6]; And some researchers compared policy differencesby the embedding of entire legal corpus of each institution [19]. More and moreresearchers paid attentions on the legal text, and exploited lots of deep learn-
W.Duan et al. ing models. [2, 24, 30]. However, for the classification of legal text, the currentresearch is far away from real application.
In the direction of text classification, so many models have been proposed, andhave achieved quite good results that makes an outstanding contribution tothe development of this field. Kim purposed the attention-based classifier thatcan achieve multi-label emotion classification [11, 26]; Yang applied a sequencegeneration model with a novel decoder structure to solve it with correlationsbetween labels [28, 29, 35]; [1, 15, 22, 25] created various attention mechanismsfor NLP that applied into text classification by others; There was a new mod-ule that can sheep up training [7, 8]; Zhang achieved text classification with thecorrelation between different task data [37]; And [3] explored the influence ofdifferent semantic embedding to multi-label text classification. And various pur-posed embedding approaches for representation in document can be applied inmulti-label text classification [5, 16, 20, 21, 27]. Most researchers put their energyinto complex or novelty deep learning model to solve this problem, however, fewresearchers refer to the label number learning.
The existing model has a weak storage capacity for memory, which cant storetoo much information, and it is easy to lose some semantic information, so thememory network memorizes information by external storage. Facebook AI pro-posed a memory mechanism to enhance memory of network, and later solvedthe problem of not being able to end-to-end train [31]. In the next study, itenhanced the size of memory and proposed the new dynamic memory networkstructure [17,18]. Therefore, we got the external knowledge through this memorymechanism to enhance the classification ability of the model.
Label number learning is to learning the label number on the basis of predictedprobability of deep learning model. There are two common strategies to solvethis problem:1. Select an appropriate global threshold. If the predicted probabil-ity greater than it, the label will be true label; 2. Directly determine the numberof labels of sample, such as top-k. However, there are respectively some defects.Yang purposed the approach that summarizes the whole process of multi-labelclassification. And she thought there is a need to learning to optimize the thresh-old over ranked listed on the per-instance basis [36]; Lenc refered to a simpleneural network that can solve the problem by top-k [13]. Then these papers givethe idea whether exist a neural network that can learn the label number. If itexists, then we can add this network after any deep learning model of the generalprediction approach to enhance the performance of the model. harge Prediction with Label Number Learning 5
The charge prediction is a task to predict accusations by analyzing the casedescription called fact in the Cail2018. In Figure 1 we see that a fact in law is aseries of precise descriptions of the case and the presentation of criminal behavior.After segmentation of fact, we define the input as T i = { t i , t i , t i , . . . , t ij } , ,and each t ij is a word, such as ability. In Figure 1, we analyze the facts throughdeep learning model, find out the text of the most relevant accusations (the textpart on the yellow background), and then understand their association witheach accusation, and finally predict that accusations of sample are respectivelyarson and intentional injury. We want to get the output is R i , each R i is a list,such as R i = [1 , , Arson (label)Intentional injury (label)
Fact Prediction
Fig. 1.
Example of charge prediction
For multi-label text classification tasks, our purposed prediction approach is: (1)use text preprocessing to obtain text vectors of input data and external knowl-edge; (2) use deep learning model and attention mechanism to obtain vectors,and combine them to get output results; (3) By machine learning training withoutput probability of the case sample, the probability of the label number ofeach sample is obtained; (4) The final result is obtained by label decision thatcombines the predicted label number of the case sample with output probabilityof each case sample. This new approach with automatic label number learningis also end-to-end. For the third step of the approach, we can think of it as an
W.Duan et al. auxiliary system. And details of the framework of our approach are shown inFigure 2.In this approach, we can use any text vectorization methods, whether it isone-hot, word-embedding or direct learning, etc. You can also use any kinds ofdeep learning models, whether it is TextCNN, Bi-GRU, etc. Therefore, it is easyto experiment for the charge prediction task. We make use of the frameworkof the memory network to obtain the correlation between text and legal provi-sions by attention mechanism in training, and combine their outputs with the
Fig. 2.
External knowledge enhanced prediction with label number learning joint training to get the results. Since both outputs are obtained at the sametime, the loss function also needs to be adjusted. See 3.2 for details. Throughthis approach, adding a label number learning phase after the general predictionapproach can effectively solve the multi-label number decision problem. In cur-rent research, for the output of the model prediction, both of top-k or thresholdstrategies need to observe the actual output, and try to manually determine theparameters. However, for the multi-label task, the number of labels is variable,and the probability distributions of different labels is different, so both strate-gies will produce a huge judgment error. Using this approach, the number of harge Prediction with Label Number Learning 7 multi-label charge prediction can be determined adaptively and the results aremore accurate.
Since we hope that the network can learn to understand the logic of the textand it can also make decisions by matching the relevance of the text to the legalprovisions, the network has multiple outputs. So we set the loss of these twooutputs to loss and loss respectively, and set the Loss of the network to theirweighted sum by the custom weight.
Loss = w ∗ loss + w ∗ loss (1) w i is for them its respective weight. This paper adds a label number learning phase after general prediction processphase according to [34], and proposes a network that can better solve the labelnumber problem of charge prediction called number learning network (NLN).Generally, the input of number learning network is D = { X i , Y i } mi =1 . For thenumber learning network, X i is the label output probability obtained from theprevious phrase text prediction, Y i is the corresponding one-hot encoding ofnumber of labels. The details of X i and Y i are illustrated in equation 1,2. X i = { x i , x i , x i , . . . , x ij } , x ij ∈ [0 ,
1] (2) Y i = { y i , y i , y i , . . . , y ik } , x ik = 0 , x i +1 = f ( m − (cid:88) i =0 w ij x i + b j ) (4)For the purpose of the experiment that it can adaptively adjust the thresholdof the corresponding label, this paper modifies the network and purposes thenumber learning network (NLN). The most significant difference between thetwo networks is that the weight value of the first layer is fixed. For example, thefirst neuron of the first hidden layer is only connected to the first neuron of thesecond hidden layer. The specific expression is as follows: w ij = (cid:40) , i = j , i (cid:54) = j (5)Secondly, the number of hidden neurons fixed in the first hidden layer is thesame as the number of neurons in the input layer. Through these two steps, W.Duan et al. this article creates a specific embedding process before the output probabilityis inputted into the FC network. For each x ij , in the embedding process, itrepresents the probability that the sample belongs to j th accusation. Throughthis step of the embedding process, if the equality 5 is established, then x ij canpass the filtering. f ( x ij + b j ) > , i = j (6) b j is the threshold corresponding to all x ij . In this paper, the back propaga-tion (BP) process can adaptively adjust the b for each label as the correspondingthreshold. There is a fine adjustment or screening work through the followingseveral layers of fully connected (FC) network. After the specific embedding pro-cess, the FC network is used to map values of each label to the output probability.At the same time, the network can not be too complex to prevent gradient ex-plosions or gradient dispersion. The details of network are illustrated in Figure3. Fig. 3.
Number learning network (NLN)
Finally, we get the label number probability for each sample, and choose thelabel number with the largest value as the final label number. Then we set thevalue of the label number is n , and select the top n of the largest value in thecorresponding sample output probability as the final output R i harge Prediction with Label Number Learning 9 In this section, we evaluate our proposed approach on the dataset. We introducethe dataset, evaluation measure, experimental details, and all baseline models.Then, we compare our method with the baselines. Finally, we provide the analysisand discussions of experiment results.
All legal data are from Cail2018 [34] that contains more than 2.6 million criminalcases published by the Supreme People’s Court of China. And it is divided smalland big dataset. In this paper, we used the small dataset that consist of 154,592samples for training, 32,508 samples for testing. Each of the samples containscomplex legal text description and three label types. There is respectively accu-sation, relating article, and term of penalty. This paper only uses the accusation(label) that consists of 202 accusations. And each sample may have multiplelabels. The details of small dataset are illustrated in Table 1. The ratio of thetraining dataset to the test dataset is about 5:1, and the number of labels is 202.Among them, the size of the single-label samples in the training dataset is about4/5, and the size of multi-label samples is 1/5. And the number of samples thatlabel number is 2 is the majority in multi-label samples.
Table 1.
The details of Cail2018 small datasetDataset Training Valid Test LabelQuantity 154592 17131 32508 202Label numer 1 2 3 4 Greater than 4Quantity in training data 120475 30831 2914 288 96
Because this dataset is provided by Chinese ’fa yan bei’ competition , wefollow the evaluation measure of the competition. We respectively use micro-F1and macro-F1 to evaluate the ability of different models that are employed onthe dataset. We compare our proposed methods with the following baselines:TextCNN [12]: It uses a variety of kernels to extract the key information ofthe text vectors, and then map the information to the low-dimensional spacethrough the FC network to obtain the output probability.CRNN [23]: It uses CNN before using RNN, and finally output the resultthrough FC network. http://cail.cipsc.org.cn/0 W.Duan et al. DPCNN [10]: It uses down sampling based on ResNet, get more efficient andwidely available global information while effectively obtaining local information.CNN + attention [25]: While understanding the text information, it findsout the degree of correlation between the text and the label, and maps it to thelabel space.Bi-GRU/Bi-LSTM [4,9]: It better captures longer distance text dependencies.
In experiment, we simply get the word2vec directly after the legal text segmen-tation, so that the text is mapped into a 512 dimensions vector and the numberof words of input data and knowledge respectively is 400, 85. The convolutionkernel is generally 3, and for TextCNN, the convolution is respectively [1, 2,3, 4, 5]. The output dimension of the RNN is 512, and the parameters of thedropout phase are always set to 0.2. Finally, the first layer of the FC networkis 1000 dimensions, and the number of dimensions of the second layers outputis 202. Both loss functions of deep learning model and NLN are cross entropywhile training.For the multi-label charge prediction task, the number of the multi-labelsamples is generally less than single-label samples. if the label number of samplesis too large, such as 8, the number of samples is less than 0.01% compared tothe number of total samples. Therefore the samples are ignored directly by us.Because the number of samples with the number of labels greater than 4 is toosmall in table 1, we recognized the label number of these samples as 4 in theexperiment.
To prove the effect of our proposed approach, we extract all the samples that thenumbers of labels are greater than 1 in the test dataset. We hope to prove theeffect of our proposed approach with NLN. The details of multi-label samplesare illustrated in Table 2 and the details of experiment are illustrated in Table3. According to the Table 3, we can clearly see that scores of all the baselineswith the number learning network (NLN) are higher in multi-label samples thanthe ones by threshold strategy. And the improvements of the scores of variousbaselines are very high.
Table 2.
The details of multi-label samplesLabel number 2 3 4 Greater than 4Quantity 1913 135 6 2
With macro-F1 as the index, score of the best model TextCNN is 1.5% higherthan the model with threshold strategy. And DPCNN has the most promotion harge Prediction with Label Number Learning 11 in the baselines, the score of promotion is 5%. Meanwhile, we are surprised tofind that with micro-F1 as the index, scores of all the baseline model have greatimprovements. Score of the best model TextCNN is 5% higher than the modelwith threshold strategy. The most promotion in the baseline model is DPCNNand the promotion of score is 17%. Therefore we can get a conclusion thatby adding number learning network (NLN), we can make these kinds of deeplearning models perform better on multi-label charge prediction task.
Table 3.
The details of multi-label samplesAdd NLN microF1(%) macroF1(%)TextCNN with without 77.39 62.52CRNN with without 63.90 54.63DPCNN with without 63.90 54.63Attention with without 70.39 57.59GRU with without 60.48 50.54LSTM with without 63.14 48.49
In this part, we apply baselines to the whole dataset including single label sam-ples and multi-label samples. The details of experiment results are illustratedin the Table 4. In Table 4, the single-label type shows that we only select thelabel with the highest probability as final prediction result, so we recognize it assingle-label task; the multi-label type shows that we have selected all the labelsthat meet the conditions by threshold strategy as the output. Here we use theappropriate values as the threshold by many manual selections. We find that inall the baselines, using the RNN model to process word2vec vectors that the factconverted is the worst, and using TextCNN model with five different convolutionkernels performs best. With micro-F1 as the evaluation index, the best modelTextCNN scores 10% higher than the worst model Bi-GRU. With macro-F1 asthe evaluation index, the best model TextCNN scores 15% higher than the worstmodel Bi-GRU.Although the data is extremely imbalance, we can still find that after theselected threshold determining results, the F1 scores have improved, but theeffect is not satisfactory. On this dataset, the difference between the multi-labelresults with all baselines and the single-label results is not large. Based on theevaluation measure, their improvement is less than 1% on average. Therefore wecan make a judgment that the widely threshold strategy has defects in accuracy.
Table 4.
Results of baselinestype microF1(%) macroF1(%)TextCNN single-label 84.59 74.54multi-label
CRNN single-label 77.67 61.19multi-label 78.04 61.58DPCNN single-label 80.55 65.03multi-label 80.9 65.49Attention single-label 82.69 71.36multi-label 83.44 72.10GRU single-label 75.48 58.16multi-label 75.76 58.44LSTM single-label 77.54 60.17multi-label 77.96 62.20
In this part, we use our purposed approach with all the baseline models. Thedetails of experiment are illustrated in Table 5. In Figure 5, we find that ourapproach has some improvements on various baselines. According to the Table 5,with micro-F1 as the index, we can observe that the results using our approachhave a better performance. For all models with our approach the scores increase,and some models have 5%-8% improvements; With macro-F1 as the index, wecan also observe that all the models have great improvements, and some modelshave 9%-13% improvements. Scores of most models reach above 70%. Becauseof the definition of macro-F1, we can also make a conclusion: we can correctlyclassify more multi-label samples with our approach. And results prove that ourapproach can solve this task better.
Table 5.
Results of baselines with our approachOur Approach microF1(%) macroF1(%)TextCNN True
Flase 85.95 76.18CRNN True
Flase 78.04 61.58DPCNN True
Flase 80.9 65.49Attention True
Flase 83.44 72.10GRU True
Flase 75.76 58.44LSTM True
Flase 77.98 60.66harge Prediction with Label Number Learning 13 n this paper, we propose an external knowledge enhanced end-to-end multi-labelcharge prediction approach with automatic label number learning and a numberlearning network (NLN). By using it, we prove that the approach can improvethe performance of the deep learning model in the multi-label charge predictiontask. This method has great improvement on all models. At the same time, weextract all the multi-label samples from the test dataset and test the approachon it. The experiment results show that our approach is better than the deeplearning model with the threshold strategy. With macro-F1 as the index, theimprovement of various deep learning models is 3%-5%; With micro-F1 as theindex, it has increased 5%-15%.In the future, we will connect deep learning models with number learningnetwork (NLN) while training, rather than getting the output probability ofsample first, and create a new loss function that is used in training. We hopethat can improve accuracy of the charge prediction and reduce the training losserror. In particularly, we want to avoid predicting the single-label samples intomulti-label category to increase the score of F1. At the same time, we will testthe performance of NLN on other multi-label tasks, and hope that further studycan improve its versatility.
References
1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learningto align and translate. CoRR abs/1409.0473 (2014)2. Bajwa, I.S., Karim, F., Naeem, M.A., ul Amin, R.: A semi supervised approachfor catchphrase classification in legal text documents. JCP (5), 451–461 (2017)3. Berger, M.J.: Large scale multi-label text classification with semantic word vectors.Tech. rep., Technical Report. Stanford University (2015)4. Cho, K., van Merrienboer, B., G¨ul¸cehre, C¸ ., Bahdanau, D., Bougares, F., Schwenk,H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder forstatistical machine translation. In: EMNLP 2014, October 25-29, 2014, Doha,Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL. pp. 1724–1734 (2014)5. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirec-tional transformers for language understanding. CoRR abs/1810.04805 (2018)6. Galgani, F., Compton, P., Hoffmann, A.: Combining different summarization tech-niques for legal text. In: Proceedings of the Workshop on Innovative Hybrid Ap-proaches to the Processing of Textual Data. pp. 115–123. Association for Compu-tational Linguistics (2012)7. Gehring, J., Auli, M., Grangier, D., Yarats, D., Dauphin, Y.N.: Convolutional se-quence to sequence learning. In: ICML 2017, Sydney, NSW, Australia, 6-11 August2017. pp. 1243–1252 (2017)8. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.In: CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. pp. 770–778 (2016)9. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation (8), 1735–1780 (1997)4 W.Duan et al.10. Johnson, R., Zhang, T.: Deep pyramid convolutional neural networks for text cate-gorization. In: ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: LongPapers. pp. 562–570 (2017)11. Kim, Y., Lee, H., Jung, K.: Attnconvnet at semeval-2018 task 1: Attention-basedconvolutional neural networks for multi-label emotion classification. In: NAACL-HLT, New Orleans, Louisiana, June 5-6, 2018. pp. 141–145 (2018)12. Kim, Y.: Convolutional neural networks for sentence classification. In: EMNLP2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special InterestGroup of the ACL. pp. 1746–1751 (2014)13. Lenc, L., Kr´al, P.: Word embeddings for multi-label document classification. In:RANLP 2017, Varna, Bulgaria, September 2 - 8, 2017. pp. 431–437 (2017)14. Luo, B., Feng, Y., Xu, J., Zhang, X., Zhao, D.: Learning to predict charges for crim-inal cases with legal basis. In: EMNLP 2017, Copenhagen, Denmark, September9-11, 2017. pp. 2727–2736 (2017)15. Luong, T., Pham, H., Manning, C.D.: Effective approaches to attention-based neu-ral machine translation. In: EMNLP 2015, Lisbon, Portugal, September 17-21,2015. pp. 1412–1421 (2015)16. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-sentations in vector space. CoRR abs/1301.3781 (2013)17. Miller, A.H., Fisch, A., Dodge, J., Karimi, A., Bordes, A., Weston, J.: Key-valuememory networks for directly reading documents. In: EMNLP 2016, Austin, Texas,USA, November 1-4, 2016. pp. 1400–1409 (2016)18. Miller, A.H., Fisch, A., Dodge, J., Karimi, A., Bordes, A., Weston, J.: Key-valuememory networks for directly reading documents. In: EMNLP 2016, Austin, Texas,USA, November 1-4, 2016. pp. 1400–1409 (2016)19. Nay, J.J.: Gov2vec: Learning distributed representations of institutions and theirlegal text. In: EMNLP 2016, Austin, TX, USA, November 5, 2016. pp. 49–54 (2016)20. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word rep-resentation. In: EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting ofSIGDAT, a Special Interest Group of the ACL. pp. 1532–1543 (2014)21. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettle-moyer, L.: Deep contextualized word representations. In: NAACL-HLT 2018, NewOrleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers). pp. 2227–2237(2018)22. Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position represen-tations. In: NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume2 (Short Papers). pp. 464–468 (2018)23. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-basedsequence recognition and its application to scene text recognition. IEEE Trans.Pattern Anal. Mach. Intell. (11), 2298–2304 (2017)24. Sulea, O., Zampieri, M., Malmasi, S., Vela, M., Dinu, L.P., van Genabith, J.:Exploring the use of text classification in the legal domain. In: ICAIL 2017, London,UK, June 16, 2017. (2017)25. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,L., Polosukhin, I.: Attention is all you need. In: Advances in Neural InformationProcessing Systems 30: Annual Conference on Neural Information Processing Sys-tems 2017, 4-9 December 2017, Long Beach, CA, USA. pp. 6000–6010 (2017)26. Wang, Y., Lin, X., Wu, L., Zhang, W., Zhang, Q., Huang, X.: Robust subspaceclustering for multi-view data by exploiting correlation consensus. IEEE Trans.Image Processing (11), 3939–3949 (2015)harge Prediction with Label Number Learning 1527. Wang, Y., Wu, L.: Beyond low-rank representations: Orthogonal clustering basisreconstruction with optimized graph structure for multi-view spectral clustering.Neural Networks , 1–8 (2018)28. Wang, Y., Wu, L., Lin, X., Gao, J.: Multiview spectral clustering via structuredlow-rank matrix factorization. IEEE Trans. Neural Netw. Learning Syst. (10),4833–4843 (2018)29. Wang, Y., Zhang, W., Wu, L., Lin, X., Fang, M., Pan, S.: Iterative views agreement:An iterative low-rank based structured optimization method to multi-view spectralclustering. In: IJCAI 2016, New York, NY, USA, 9-15 July 2016. pp. 2153–2159(2016)30. Wei, F., Qin, H., Ye, S., Zhao, H.: Empirical study of deep learning for text classi-fication in legal document review. In: Big Data 2018, Seattle, WA, USA, December10-13, 2018. pp. 3317–3320 (2018)31. Weston, J., Chopra, S., Bordes, A.: Memory networks. In: ICLR 2015, San Diego,CA, USA, May 7-9, 2015, Conference Track Proceedings (2015)32. Wu, L., Wang, Y., Gao, J., Li, X.: Where-and-when to look: Deep siamese attentionnetworks for video-based person re-identification. IEEE Trans. Multimedia (6),1412–1424 (2019)33. Wu, L., Wang, Y., Li, X., Gao, J.: Deep attention-based spatially recursive net-works for fine-grained visual recognition. IEEE Trans. Cybernetics (5), 1791–1802 (2019)34. Xiao, C., Zhong, H., Guo, Z., Tu, C., Liu, Z., Sun, M., Feng, Y., Han, X., Hu, Z.,Wang, H., Xu, J.: CAIL2018: A large-scale legal dataset for judgment prediction.CoRR abs/1807.02478 (2018)35. Yang, P., Sun, X., Li, W., Ma, S., Wu, W., Wang, H.: SGM: sequence generationmodel for multi-label classification. In: COLING 2018, Santa Fe, New Mexico,USA, August 20-26, 2018. pp. 3915–3926 (2018)36. Yang, Y., Gopal, S.: Multilabel classification with meta-level features in a learning-to-rank framework. Machine Learning88