[PDF] Techniques to Improve Q&A Accuracy with Transformer-based models on Large Complex Documents

Abstract

This paper discusses the effectiveness of various text processing techniques, their combinations, and encodings to achieve a reduction of complexity and size in a given text corpus. The simplified text corpus is sent to BERT (or similar transformer based models) for question and answering and can produce more relevant responses to user queries. This paper takes a scientific approach to determine the benefits and effectiveness of various techniques and concludes a best-fit combination that produces a statistically significant improvement in accuracy.

Full PDF

TT ECHNIQUES TO I MPROVE

Q&A A

CCURACY WITH T RANSFORMER - BASED MODELS ON L ARGE C OMPLEX D OCUMENTS

Chejui Liao

Synechron Innovation Lab [email protected]

Tabish Maniar

Synechron Innovation Lab [email protected]

Sravanajyothi N

Synechron Innovation Lab [email protected]

Anantha Sharma

Synechron Innovation Lab [email protected] A BSTRACT

This paper discusses the effectiveness of various text processing techniques, their combinations,and encodings to achieve a reduction of complexity and size in a given text corpus. The simpliﬁedtext corpus is sent to BERT (or similar transformer based models) for question and answering andcan produce more relevant responses to user queries. This paper takes a scientiﬁc approach todetermine the beneﬁts and effectiveness of various techniques and concludes a best-ﬁt combinationthat produces a statistically signiﬁcant improvement in accuracy.

Keywords

BERT, QnA model, Stanford CoreNLP, LexNLP, spaCy, document similarity, document processing,misspelled Words, phonetic matching, Soundex, information retrieval

In today’s world, BERT [Devlin et al., 2018] is one of the most popular models used to build question and answer-ing systems. BERT generally needs additional training (with large relevant text corpus) to maintain relevance whenpresented with large complex documents (such as regulations, federal or institutional policies, and domain-speciﬁcdocuments). This shortcoming becomes clear when dealing with a complex sentence structure and can be alleviatedby using innovative text pre-processing techniques on the text corpus and ﬁne-tuning the model.Why is this needed: • Fine-tuned BERT QnA model can handle sequence length of 384 tokens at once. When the input context hasmore has 384 tokens, context gets divided into multiple chunks (length of each chunk can be set by user) andthis is popularly known as â ˘AIJsliding windowâ ˘A˙I approach. • Each chunk is then processed along with question by BERT model giving answer text and probability asoutput, ﬁnal answer text will be considered based on highest probability criteria. • We observed that sliding window approach fails to provide correct answers when document size was over1000 tokens, as BERT model was trained on data with smaller contexts.This led us to explore possibilities of reducing complexity of input text and ﬁnding most relevant text from large contextbefore passing on to BERT for inference. This paper explores the possibility of using text processing techniques toreduce the complexity of text sending to BERT and hence improve BERTâ ˘A ´Zs accuracy on answering questions forcomplex documents without lots of extra training. The limitations of each techniques are also discussed in this paper. a r X i v : . [ c s . C L ] S e p echniques to Improve Q&A Accuracy with Transformer-based models on Large Complex Documents To improve the performance on BERT, we implemented a couple of techniques, such as Deﬁnition Tokenization,Dependency Tokenization, Paragraph Splitting, Relevant Paragraph Ranking, and BERT Fine-tuning. The entire ﬂowfor text processing and BERT question and answering system we tried to propose is as below:Figure 1: Proposed Text Processing Flow for BERT QnA

One of our tokenization strategies is Deﬁnition Tokenization. We replace the speciﬁc terms used in the documentwith a single token. First, we identify deﬁnition sentences mentioned in the document via keyword search such as"mean" and "deﬁne". For example, "Common ownership means a relationship between two companies" is a deﬁnitionsentence. Second, we identify the subject (including the noun and its modiﬁers) in the deﬁnition sentences throughStanford CoreNLP’s dependency annotation [Manning et al., 2014], "Common ownership" in this case, and tokenizethe subject as XnXn (where n is a number, X can be any capitalized characters). The result of tokenization is as below:Original

Common ownership means a relationship between two companies.Deﬁnition Tokenization

X1X1 means a relationship between two companies.LexNLP [Bommarito II et al., 2018] is also used to help us tokenize compound nouns outside of the deﬁnition sen-tences. LexNLP is trained on ﬁnancial domains and is able to recognize some speciﬁc terms in the domain. The resultis as below: Original

Financial Institution needs to submit a suspicious activity report.Deﬁnition Tokenization

X1X2 needs to submit a suspicious activity report.Finally, we replace all the deﬁnition subject with correspondent tokens over the document.With this approach, we are able to reduce a couple of words into one token, and thus reduce the amount of text in thesentences.The limitations of this approach include the following: • Variant amount of tokenization: highly dependent on how often the deﬁnition terms are used across thedocument and how many terms are deﬁned • Customization of the deﬁnition keywords for different documents: each document may have different key-word for identifying a deﬁnition sentence

The other tokenization strategy we use is Dependency Tokenization that uses Stanford CoreNLP’s dependency annota-tion to tokenize words. We ﬁrst identify the verb in a sentence. Then we group the subjects and modiﬁers and tokenizethem. We also do the same for objects. Below is the result:Original

Bank and insurance company need to submit a suspicious activity report .Deﬁnition Tokenization

X1X3 need to submit

X1X4 .2echniques to Improve Q&A Accuracy with Transformer-based models on Large Complex DocumentsIn this example, â ˘AIJbank" and â ˘AIJcompany" are subjects of â ˘AIJneed", and â ˘AIJinsurance" is the modiﬁer ofâ ˘AIJcompany", so we tokenize them all together. â ˘AIJreport" is the object in this case, â ˘AIJa", â ˘AIJsuspicious",â ˘AIJactivity" are its modiﬁers. Complex sentences usually involve a lot of subjects, objects, and modiﬁers. With thisapproach, we can simplify such sentences to a great extent.The limitations of this approach include the following: • Highly dependent on the accuracy of NLP dependency parse: in very complicated sentences, the dependencyresults may not be correct and end up tokenizing wrong words, distorting the structure of the sentences • Computationally expensive: extracting dependency from each sentence requires a lot of computation

The other text processing we do is to split the document into paragraphs and send the most relevant paragraphs witha question for BERT to answer. Since the document we are dealing with contains tens of thousands of words, BERTis not able to pick up the answer in a sea. Given the hierarchy of the regulation documents we deal with, we splitthe document into small paragraphs (which contain a piece of information about a speciﬁc regulation) using regularexpression and spaCy’s sentence segmentation [Honnibal and Montani, 2017].The limitations of this approach include the following: • BERT not able to answer broader questions: after splitting into paragraphs, the information becomes morespeciﬁc, and BERT cannot revert to high-level answer • Splitting strategy very subjective: developer needs to make judgement about how far the splitting should go.If the paragraph is too long, BERT will still have issue ﬁguring out answer; however, if the paragraph is tooshort, you may not have enough information to match the question. The golden rule is that the split paragraphshould contain complete piece of informationOriginal 5 times the amount of the nonvoting capital stockof the Financing Corporation which is outstanding at such time ;or the amount of capital stock of the

Financing Corporation held by such remaining bank at the time of such determination ; by the amounts added to reservesafter December 31, 1985,pursuant to the requirement containedin the ﬁrst 2 sentences of section 1436of this title. Number of Tokens :66DeﬁnitionandDependencyTokenization 5 times the amount of theY1Y300 X1441 which isoutstanding at Y1Y1416 ; orthe amount of Y1Y1122 of the Y1Y415 held by suchY1Y1099 at the time of Y1Y651X1393 added to reserves after december 31 , 1985 ,pursuant to the requirementcontained in the ﬁrst 2 sentences of section 1436 ofthis title Number of Tokens :54

To narrow BERTâ ˘A ´Zs search space and obtain more accurate answers, we build a similarity ranking model basedon Doc2Vec [Le and Mikolov, 2014] and TF-IDF [Ramos et al., 2003] to locate the most relevant paragraphs given aquestion.We compare the similarity among the paragraphs and the question and send the top relevant paragraphs to BERT foran answer. The number of top paragraphs is a hyperparameter, depending on the size of the document, the largerthe document, the more relevant paragraphs needed. Doc2Vec provides us the ﬂexibility of the words used in thequestions. User does not need to provide exact same words as appeared in the documents to get the answer, since the3echniques to Improve Q&A Accuracy with Transformer-based models on Large Complex Documentssentences are vectorized and compared according to the context; while TF-IDF helps us identify the right paragraphsmore accurately if user does provide the exact words.Given the questions and the documents we had, we use the weight of 50-50 between Doc2Vec and TF-IDF.The limitations of this approach include the following: • The weight between Doc2Vec and TF-IDF is arbitrary: increase the weight of Doc2Vec if we want moreﬂexibility in questionâ ˘A ´Zs wording, but we may lose accuracy • The number of top relevant paragraphs varies: it depends on the how speciﬁc the question is, how similar theparagraphs are in the document, and how huge the document is

To overcome spelling mistakes in questions, we also implement Soundex Based Encodings [Koneru et al., 2016].Soundex tries to encode the words in the form of phonetic sounds. Even if the spellings are wrong in the question,the phonetic encoding remains the same. E.g. Hello yields an encoding of H400 and Hallo also yields the encodingof H400. With TF-IDF the Soundex encoded terms are used that make it easier to ﬁnd the relevant sections fromthe document. By default, Soundex uses encodings of length 4 (H400) but we decided to use encoding of length 6(H40000) to accommodate more term varieties in our documents.

We analyzed the performance of BERT model trained on SQUAD2.0 [Rajpurkar et al., 2018] and decided to ﬁne tune itfurther with data distribution similar to FDIC documents since SQUAD2.0 dataset is observed to have simple structureas compared to sentence complexity structure in FDIC documents. The ones in FDIC spanned across multiple lineshence leading to increased complexity of structure. Since BERT is trained on SQUAD dataset, model was observedperforming poor on FDIC documents even for simple "who/where" kind of questions.Training data consists of 88 paragraphs/context along with multiple questions and answers for each paragraph in theSQUAD data format. Similarly test data consists of 50 paragraphs and validation data has got 15 paragraphs followingthe same format as described above for train data.After experimenting with different set of hyperparameters, we ﬁnalized the one that used Adam as Optimizer[Kingma and Ba, 2014] with learning-rate=3e-5, train-batch-size=24 with other parameters having default values sincethis combination was proven to be optimized in terms of model accuracy when tested.

Due to integration difﬁculties in tokenization, we tested BERT question and answer system with paragraph splittingand relevant paragraph ranking on three regulation documents, Suspicious Activity Reports , Fair Housing , andAppraisals , from the FDIC website.We deﬁne two metrics to evaluate BERT performance: • F1 score: We used the same approach as in SQUAD for evaluation. In this approach an answer was dividedinto tokens and confusion matrix was calculated based on the comparison of tokens between BERTâ ˘A ´Zsanswer and real answer.[Sokolova et al., 2006]True Positives (tp): the number of tokens shared between the actual answer and the predicted answer.False Positives (fp): the tokens in the prediction but not in the actual answer.False Negative(fn): the tokens in actual answer and not in the predicton.Precision: tp ( tp + f p ) Recall: tp ( tp + f n ) ∗ precision ∗ recall ( precision + recall ) • Quality score (Q score): We deﬁned a scoring system and manually evaluate the quality of answers using thefollowing standards.1: Unacceptable, if BERT does not cover the complete response2: Partial answered, if BERT covers the partially the complete response3: Complete answered, if BERT covers the complete response to the questionWe compared the F1 score and quality score among BERT with entire document, BERT with manually selectedparagraph, and BERT with text processing techniques. Figure 2-4 are the ﬂows for each system.Figure 2: BERT with entire documentFigure 3: BERT with manually selected paragraphWe normalized the ﬁnal score as ( Qualityscore − to get a score ranging from 0 to 1 for better comparison. 5echniques to Improve Q&A Accuracy with Transformer-based models on Large Complex DocumentsFigure 4: BERT with Text Processing Document Documentsize(words) NumberofQuestions Entire document Manuallyselectedparagraph TextprocessingtechniquesF1 Score Q Score F1 Score Q Score F1 Score Q Score

Suspicious ActivityReport 1420 27 9.7% 3.7% 58.7% 79.5% 56.6 % 74%Fair Housing 1780 17 8.2 % 0% 44.6% 67.5% 41.3% 61.5%Appraisals 5367 14 6.42 % 0% 62.1% 61% 51.7% 50%Table 1: Test ResultsFigure 5: F1 Score comparison6echniques to Improve Q&A Accuracy with Transformer-based models on Large Complex Documents

Figure 6: Q-Score comparisonWith text processing techniques such as paragraph splitting and relevant paragraph ranking, we can narrow down thecontent size and boost BERT accuracy on large documents by 30-50% in terms of F1 score. The upper bound of ourBERT performance is BERT with manually selected paragraph since we directly provide the most relevant content.Compared to the upper bound, BERT with text processing techniques only sacriﬁce about 5-11% accuracy on F1 score.Although Tokenization techniques also look promising for simplifying complex documents, during our developmentof the algorithm, we ran into the following issues to integrate this solution with our BERT QnA system: • The handling of duplicate tokens: We should use the same tokens for the same phrases over the document. • The matching of question and tokens: Dependency tokenization tokenized phrases in a more varied fashion(Adjective and Adverb are also tokenized with the noun), making it difﬁcult to match with questions sinceuser may not provide exact phrases as in the document. Although we tried to iterate through all the possiblematching tokens, we were not able to determine which token should be taken since BERT could not provideus the absolute probability of an answer.In the future, we will continue to improve our Tokenization algorithm, so it can ﬁt into our BERT QnA system. One ofthe approaches to improve the tokenization algorithm is to determine the acronyms in the document and their meaningand linking the sections and their meaning in the documents[Banthia and Sharma, 2020].We will also continue toexplore more text processing techniques to improve BERTâ ˘A ´Zs performance.

References [Banthia and Sharma, 2020] Banthia, S. and Sharma, A. (2020). Classiﬁcation of descriptions and summary usingmultiple passes of statistical and natural language toolkits.[Bommarito II et al., 2018] Bommarito II, M. J., Katz, D. M., and Detterman, E. M. (2018). Lexnlp: Natural languageprocessing and information extraction for legal and regulatory texts. arXiv preprint arXiv:1806.03688 .[Devlin et al., 2018] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirec-tional transformers for language understanding.[Honnibal and Montani, 2017] Honnibal, M. and Montani, I. (2017). spaCy 2: Natural language understanding withBloom embeddings, convolutional neural networks and incremental parsing. To appear.7echniques to Improve Q&A Accuracy with Transformer-based models on Large Complex Documents[Kingma and Ba, 2014] Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 .[Koneru et al., 2016] Koneru, K., Pulla, V. S. V., and Varol, C. (2016). Performance evaluation of phonetic match-ing algorithms on english words and street names. In

Proceedings of the 5th International Conference on DataManagement Technologies and Applications , pages 57–64. SCITEPRESS-Science and Technology Publications,Lda.[Le and Mikolov, 2014] Le, Q. and Mikolov, T. (2014). Distributed representations of sentences and documents. In

International conference on machine learning , pages 1188–1196.[Manning et al., 2014] Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., and McClosky, D. (2014).The stanford corenlp natural language processing toolkit. In

Proceedings of 52nd annual meeting of the associationfor computational linguistics: system demonstrations , pages 55–60.[Rajpurkar et al., 2018] Rajpurkar, P., Jia, R., and Liang, P. (2018). Know what you don’t know: Unanswerablequestions for squad.[Ramos et al., 2003] Ramos, J. et al. (2003). Using tf-idf to determine word relevance in document queries. In

Proceedings of the ﬁrst instructional conference on machine learning , volume 242, pages 133–142. New Jersey,USA.[Sokolova et al., 2006] Sokolova, M., Japkowicz, N., and Szpakowicz, S. (2006). Beyond accuracy, f-score and roc:a family of discriminant measures for performance evaluation. In