Techniques to Improve Q&A Accuracy with Transformer-based models on Large Complex Documents
TT ECHNIQUES TO I MPROVE
Q&A A
CCURACY WITH T RANSFORMER - BASED MODELS ON L ARGE C OMPLEX D OCUMENTS
Chejui Liao
Synechron Innovation Lab [email protected]
Tabish Maniar
Synechron Innovation Lab [email protected]
Sravanajyothi N
Synechron Innovation Lab [email protected]
Anantha Sharma
Synechron Innovation Lab [email protected] A BSTRACT
This paper discusses the effectiveness of various text processing techniques, their combinations,and encodings to achieve a reduction of complexity and size in a given text corpus. The simplifiedtext corpus is sent to BERT (or similar transformer based models) for question and answering andcan produce more relevant responses to user queries. This paper takes a scientific approach todetermine the benefits and effectiveness of various techniques and concludes a best-fit combinationthat produces a statistically significant improvement in accuracy.
Keywords
BERT, QnA model, Stanford CoreNLP, LexNLP, spaCy, document similarity, document processing,misspelled Words, phonetic matching, Soundex, information retrieval
In today’s world, BERT [Devlin et al., 2018] is one of the most popular models used to build question and answer-ing systems. BERT generally needs additional training (with large relevant text corpus) to maintain relevance whenpresented with large complex documents (such as regulations, federal or institutional policies, and domain-specificdocuments). This shortcoming becomes clear when dealing with a complex sentence structure and can be alleviatedby using innovative text pre-processing techniques on the text corpus and fine-tuning the model.Why is this needed: • Fine-tuned BERT QnA model can handle sequence length of 384 tokens at once. When the input context hasmore has 384 tokens, context gets divided into multiple chunks (length of each chunk can be set by user) andthis is popularly known as â ˘AIJsliding windowâ ˘A˙I approach. • Each chunk is then processed along with question by BERT model giving answer text and probability asoutput, final answer text will be considered based on highest probability criteria. • We observed that sliding window approach fails to provide correct answers when document size was over1000 tokens, as BERT model was trained on data with smaller contexts.This led us to explore possibilities of reducing complexity of input text and finding most relevant text from large contextbefore passing on to BERT for inference. This paper explores the possibility of using text processing techniques toreduce the complexity of text sending to BERT and hence improve BERTâ ˘A ´Zs accuracy on answering questions forcomplex documents without lots of extra training. The limitations of each techniques are also discussed in this paper. a r X i v : . [ c s . C L ] S e p echniques to Improve Q&A Accuracy with Transformer-based models on Large Complex Documents To improve the performance on BERT, we implemented a couple of techniques, such as Definition Tokenization,Dependency Tokenization, Paragraph Splitting, Relevant Paragraph Ranking, and BERT Fine-tuning. The entire flowfor text processing and BERT question and answering system we tried to propose is as below:Figure 1: Proposed Text Processing Flow for BERT QnA
One of our tokenization strategies is Definition Tokenization. We replace the specific terms used in the documentwith a single token. First, we identify definition sentences mentioned in the document via keyword search such as"mean" and "define". For example, "Common ownership means a relationship between two companies" is a definitionsentence. Second, we identify the subject (including the noun and its modifiers) in the definition sentences throughStanford CoreNLP’s dependency annotation [Manning et al., 2014], "Common ownership" in this case, and tokenizethe subject as XnXn (where n is a number, X can be any capitalized characters). The result of tokenization is as below:Original
Common ownership means a relationship between two companies.Definition Tokenization
X1X1 means a relationship between two companies.LexNLP [Bommarito II et al., 2018] is also used to help us tokenize compound nouns outside of the definition sen-tences. LexNLP is trained on financial domains and is able to recognize some specific terms in the domain. The resultis as below: Original
Financial Institution needs to submit a suspicious activity report.Definition Tokenization
X1X2 needs to submit a suspicious activity report.Finally, we replace all the definition subject with correspondent tokens over the document.With this approach, we are able to reduce a couple of words into one token, and thus reduce the amount of text in thesentences.The limitations of this approach include the following: • Variant amount of tokenization: highly dependent on how often the definition terms are used across thedocument and how many terms are defined • Customization of the definition keywords for different documents: each document may have different key-word for identifying a definition sentence
The other tokenization strategy we use is Dependency Tokenization that uses Stanford CoreNLP’s dependency annota-tion to tokenize words. We first identify the verb in a sentence. Then we group the subjects and modifiers and tokenizethem. We also do the same for objects. Below is the result:Original
Bank and insurance company need to submit a suspicious activity report .Definition Tokenization
X1X3 need to submit
X1X4 .2echniques to Improve Q&A Accuracy with Transformer-based models on Large Complex DocumentsIn this example, â ˘AIJbank" and â ˘AIJcompany" are subjects of â ˘AIJneed", and â ˘AIJinsurance" is the modifier ofâ ˘AIJcompany", so we tokenize them all together. â ˘AIJreport" is the object in this case, â ˘AIJa", â ˘AIJsuspicious",â ˘AIJactivity" are its modifiers. Complex sentences usually involve a lot of subjects, objects, and modifiers. With thisapproach, we can simplify such sentences to a great extent.The limitations of this approach include the following: • Highly dependent on the accuracy of NLP dependency parse: in very complicated sentences, the dependencyresults may not be correct and end up tokenizing wrong words, distorting the structure of the sentences • Computationally expensive: extracting dependency from each sentence requires a lot of computation
The other text processing we do is to split the document into paragraphs and send the most relevant paragraphs witha question for BERT to answer. Since the document we are dealing with contains tens of thousands of words, BERTis not able to pick up the answer in a sea. Given the hierarchy of the regulation documents we deal with, we splitthe document into small paragraphs (which contain a piece of information about a specific regulation) using regularexpression and spaCy’s sentence segmentation [Honnibal and Montani, 2017].The limitations of this approach include the following: • BERT not able to answer broader questions: after splitting into paragraphs, the information becomes morespecific, and BERT cannot revert to high-level answer • Splitting strategy very subjective: developer needs to make judgement about how far the splitting should go.If the paragraph is too long, BERT will still have issue figuring out answer; however, if the paragraph is tooshort, you may not have enough information to match the question. The golden rule is that the split paragraphshould contain complete piece of informationOriginal 5 times the amount of the nonvoting capital stockof the Financing Corporation which is outstanding at such time ;or the amount of capital stock of the
Financing Corporation held by such remaining bank at the time of such determination ; by the amounts added to reservesafter December 31, 1985,pursuant to the requirement containedin the first 2 sentences of section 1436of this title. Number of Tokens :66DefinitionandDependencyTokenization 5 times the amount of theY1Y300 X1441 which isoutstanding at Y1Y1416 ; orthe amount of Y1Y1122 of the Y1Y415 held by suchY1Y1099 at the time of Y1Y651X1393 added to reserves after december 31 , 1985 ,pursuant to the requirementcontained in the first 2 sentences of section 1436 ofthis title Number of Tokens :54
To narrow BERTâ ˘A ´Zs search space and obtain more accurate answers, we build a similarity ranking model basedon Doc2Vec [Le and Mikolov, 2014] and TF-IDF [Ramos et al., 2003] to locate the most relevant paragraphs given aquestion.We compare the similarity among the paragraphs and the question and send the top relevant paragraphs to BERT foran answer. The number of top paragraphs is a hyperparameter, depending on the size of the document, the largerthe document, the more relevant paragraphs needed. Doc2Vec provides us the flexibility of the words used in thequestions. User does not need to provide exact same words as appeared in the documents to get the answer, since the3echniques to Improve Q&A Accuracy with Transformer-based models on Large Complex Documentssentences are vectorized and compared according to the context; while TF-IDF helps us identify the right paragraphsmore accurately if user does provide the exact words.Given the questions and the documents we had, we use the weight of 50-50 between Doc2Vec and TF-IDF.The limitations of this approach include the following: • The weight between Doc2Vec and TF-IDF is arbitrary: increase the weight of Doc2Vec if we want moreflexibility in questionâ ˘A ´Zs wording, but we may lose accuracy • The number of top relevant paragraphs varies: it depends on the how specific the question is, how similar theparagraphs are in the document, and how huge the document is
To overcome spelling mistakes in questions, we also implement Soundex Based Encodings [Koneru et al., 2016].Soundex tries to encode the words in the form of phonetic sounds. Even if the spellings are wrong in the question,the phonetic encoding remains the same. E.g. Hello yields an encoding of H400 and Hallo also yields the encodingof H400. With TF-IDF the Soundex encoded terms are used that make it easier to find the relevant sections fromthe document. By default, Soundex uses encodings of length 4 (H400) but we decided to use encoding of length 6(H40000) to accommodate more term varieties in our documents.
We analyzed the performance of BERT model trained on SQUAD2.0 [Rajpurkar et al., 2018] and decided to fine tune itfurther with data distribution similar to FDIC documents since SQUAD2.0 dataset is observed to have simple structureas compared to sentence complexity structure in FDIC documents. The ones in FDIC spanned across multiple lineshence leading to increased complexity of structure. Since BERT is trained on SQUAD dataset, model was observedperforming poor on FDIC documents even for simple "who/where" kind of questions.Training data consists of 88 paragraphs/context along with multiple questions and answers for each paragraph in theSQUAD data format. Similarly test data consists of 50 paragraphs and validation data has got 15 paragraphs followingthe same format as described above for train data.After experimenting with different set of hyperparameters, we finalized the one that used Adam as Optimizer[Kingma and Ba, 2014] with learning-rate=3e-5, train-batch-size=24 with other parameters having default values sincethis combination was proven to be optimized in terms of model accuracy when tested.
Due to integration difficulties in tokenization, we tested BERT question and answer system with paragraph splittingand relevant paragraph ranking on three regulation documents, Suspicious Activity Reports , Fair Housing , andAppraisals , from the FDIC website.We define two metrics to evaluate BERT performance: • F1 score: We used the same approach as in SQUAD for evaluation. In this approach an answer was dividedinto tokens and confusion matrix was calculated based on the comparison of tokens between BERTâ ˘A ´Zsanswer and real answer.[Sokolova et al., 2006]True Positives (tp): the number of tokens shared between the actual answer and the predicted answer.False Positives (fp): the tokens in the prediction but not in the actual answer.False Negative(fn): the tokens in actual answer and not in the predicton.Precision: tp ( tp + f p ) Recall: tp ( tp + f n ) ∗ precision ∗ recall ( precision + recall ) • Quality score (Q score): We defined a scoring system and manually evaluate the quality of answers using thefollowing standards.1: Unacceptable, if BERT does not cover the complete response2: Partial answered, if BERT covers the partially the complete response3: Complete answered, if BERT covers the complete response to the questionWe compared the F1 score and quality score among BERT with entire document, BERT with manually selectedparagraph, and BERT with text processing techniques. Figure 2-4 are the flows for each system.Figure 2: BERT with entire documentFigure 3: BERT with manually selected paragraphWe normalized the final score as ( Qualityscore − to get a score ranging from 0 to 1 for better comparison. 5echniques to Improve Q&A Accuracy with Transformer-based models on Large Complex DocumentsFigure 4: BERT with Text Processing Document Documentsize(words) NumberofQuestions Entire document Manuallyselectedparagraph TextprocessingtechniquesF1 Score Q Score F1 Score Q Score F1 Score Q Score
Suspicious ActivityReport 1420 27 9.7% 3.7% 58.7% 79.5% 56.6 % 74%Fair Housing 1780 17 8.2 % 0% 44.6% 67.5% 41.3% 61.5%Appraisals 5367 14 6.42 % 0% 62.1% 61% 51.7% 50%Table 1: Test ResultsFigure 5: F1 Score comparison6echniques to Improve Q&A Accuracy with Transformer-based models on Large Complex Documents
Figure 6: Q-Score comparisonWith text processing techniques such as paragraph splitting and relevant paragraph ranking, we can narrow down thecontent size and boost BERT accuracy on large documents by 30-50% in terms of F1 score. The upper bound of ourBERT performance is BERT with manually selected paragraph since we directly provide the most relevant content.Compared to the upper bound, BERT with text processing techniques only sacrifice about 5-11% accuracy on F1 score.Although Tokenization techniques also look promising for simplifying complex documents, during our developmentof the algorithm, we ran into the following issues to integrate this solution with our BERT QnA system: • The handling of duplicate tokens: We should use the same tokens for the same phrases over the document. • The matching of question and tokens: Dependency tokenization tokenized phrases in a more varied fashion(Adjective and Adverb are also tokenized with the noun), making it difficult to match with questions sinceuser may not provide exact phrases as in the document. Although we tried to iterate through all the possiblematching tokens, we were not able to determine which token should be taken since BERT could not provideus the absolute probability of an answer.In the future, we will continue to improve our Tokenization algorithm, so it can fit into our BERT QnA system. One ofthe approaches to improve the tokenization algorithm is to determine the acronyms in the document and their meaningand linking the sections and their meaning in the documents[Banthia and Sharma, 2020].We will also continue toexplore more text processing techniques to improve BERTâ ˘A ´Zs performance.
References [Banthia and Sharma, 2020] Banthia, S. and Sharma, A. (2020). Classification of descriptions and summary usingmultiple passes of statistical and natural language toolkits.[Bommarito II et al., 2018] Bommarito II, M. J., Katz, D. M., and Detterman, E. M. (2018). Lexnlp: Natural languageprocessing and information extraction for legal and regulatory texts. arXiv preprint arXiv:1806.03688 .[Devlin et al., 2018] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirec-tional transformers for language understanding.[Honnibal and Montani, 2017] Honnibal, M. and Montani, I. (2017). spaCy 2: Natural language understanding withBloom embeddings, convolutional neural networks and incremental parsing. To appear.7echniques to Improve Q&A Accuracy with Transformer-based models on Large Complex Documents[Kingma and Ba, 2014] Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 .[Koneru et al., 2016] Koneru, K., Pulla, V. S. V., and Varol, C. (2016). Performance evaluation of phonetic match-ing algorithms on english words and street names. In
Proceedings of the 5th International Conference on DataManagement Technologies and Applications , pages 57–64. SCITEPRESS-Science and Technology Publications,Lda.[Le and Mikolov, 2014] Le, Q. and Mikolov, T. (2014). Distributed representations of sentences and documents. In
International conference on machine learning , pages 1188–1196.[Manning et al., 2014] Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., and McClosky, D. (2014).The stanford corenlp natural language processing toolkit. In
Proceedings of 52nd annual meeting of the associationfor computational linguistics: system demonstrations , pages 55–60.[Rajpurkar et al., 2018] Rajpurkar, P., Jia, R., and Liang, P. (2018). Know what you don’t know: Unanswerablequestions for squad.[Ramos et al., 2003] Ramos, J. et al. (2003). Using tf-idf to determine word relevance in document queries. In
Proceedings of the first instructional conference on machine learning , volume 242, pages 133–142. New Jersey,USA.[Sokolova et al., 2006] Sokolova, M., Japkowicz, N., and Szpakowicz, S. (2006). Beyond accuracy, f-score and roc:a family of discriminant measures for performance evaluation. In