Effective Subword Segmentation for Text Comprehension
Zhuosheng Zhang, Hai Zhao, Kangwei Ling, Jiangtong Li, Zuchao Li, Shexia He, Guohong Fu
IIEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSIN, 2019 1
Effective Subword Segmentation forText Comprehension
Zhuosheng Zhang, Hai Zhao, Kangwei Ling, Jiangtong Li, Zuchao Li, Shexia He, Guohong Fu
Abstract —Representation learning is the foundation of ma-chine reading comprehension and inference. In state-of-the-art models, character-level representations have been broadlyadopted to alleviate the problem of effectively representing rareor complex words. However, character itself is not a naturalminimal linguistic unit for representation or word embeddingcomposing due to ignoring the linguistic coherence of consecutivecharacters inside word. This paper presents a general subword-augmented embedding framework for learning and composingcomputationally-derived subword-level representations. We sur-vey a series of unsupervised segmentation methods for subwordacquisition and different subword-augmented strategies for textunderstanding, showing that subword-augmented embeddingsignificantly improves our baselines in various types of textunderstanding tasks on both English and Chinese benchmarks.
Index Terms —Subword Embedding, Machine Reading Com-prehension, Textual Entailment, Word Segmentation
I. I
NTRODUCTION
The fundamental part of deep learning methods applied tonatural language processing (NLP), distributed word repre-sentation, namely, word embedding , provides a basic solutionto text representation for NLP tasks and has proven usefulin various applications, including textual entailment [49, 56]and machine reading comprehension (MRC) [10, 42, 52, 53].However, deep learning based NLP models usually sufferfrom rare and out-of-vocabulary (OOV) word representation[31, 41], especially for low-resource languages. Besides, mostword embedding approaches treat word forms as atomic units,which is spoiled by many words that actually have a complexinternal structure. Especially, rare words like morphologically
Manuscript received November 10, 2018; revised March 24, 2019; acceptedJune 04, 2019. This paper was partially supported by National Key Researchand Development Program of China (No. 2017YFB0304100) and Key Projectsof National Natural Science Foundation of China (U1836222 and 61733011).The associate editor coordinating the review of this manuscript and approvingit for publication was Dr. Min Zhang (Corresponding author: Hai Zhao.).Zhuosheng Zhang, Hai Zhao, Kangwei Ling, Jiangtong Li, Zuchao Li,Shexia He are with the Department of Computer Science and Engineer-ing, Shanghai Jiao Tong University, 800 Dongchuan Road, Minhang Dis-trict, Shanghai, China. (e-mail: [email protected]; [email protected];[email protected]; keep [email protected]; [email protected];[email protected]).Guohong Fu is with the Institute of Artificial Intelligence, SoochowUniversity, China (email: [email protected]).Part of this study has been published as
Subword-augmented Embed-ding for Cloze Reading Comprehension [54] in COLING-2018. This pa-per extends the previous byte pair encoding (BPE) subword method tointroduce a unified segmentation framework, and conducts comprehensiveexperiments on both reading comprehension and textual entailment tasks,considering multilingual effectiveness, generalization ability on differentbenchmarks and thorough case studies. The codes have been released athttps://github.com/cooelf/subword seg. complex words and named entities, are often expressed poorlydue to data sparsity. Actually, plenty of words share someconjunct written units, such as morphemes, stems and affixes.The models would benefit a lot from distilling these salientunits automatically.Character-level embedding has been broadly used to refinethe word representation [25, 27, 31, 51], showing beneficiallycomplementary to word representations. Concretely, each wordis split into a sequence of characters. Character representationsare obtained by applying neural networks on the charactersequence of the word, and their hidden states form the repre-sentation.However, character is not the natural minimum linguisticunit, which makes it quite valuable to explore the potentialunit (subword) between character and word to model sub-word morphologies or lexical semantics. For English, thereare only 26 letters. Using such a small character vocabularyto form the word representations could be too insufficientand coarse. Even for a language like Chinese with a largeset of characters (typically, thousands of), lots of which aresemantically ambiguous, using character embedding below theword-level to build the word representations would not beaccurate enough, either. For example, for an internet neologism 老 司 机 ( experienced driver ), the characters < 老 ( experienced,old ) , 司 ( manage ), 机 ( machine ) > would be somewhat fromthe meaning of the word while the subwords < 老 ( experienced,old ), 司 机 ( driver ) > with proper syntactic and semantic de-composition give exactly the minimal meaningful units belowthe word-level which surely improve the later word representa-tion. Thus, in either type of languages, effective representationcannot be done accurately only via the character based process.In fact, morphological compounding (e.g. sunshine or play-ground ) is one of the most common and productive methodsof word formation across human languages, and most ofrare or OOV words can be segmented into meaningful fine-grained subword units for accurate learning and representation,which inspires us to represent word by meaningful sub-word units. Recently, researchers have started to work onmorphologically informed word representations [2, 4, 7, 18],aiming at better capturing syntactic, lexical and morphologicalinformation. With flexible subwords from either source, we donot necessarily need to work with characters, and segmentationcould be stopped at the subword-level. With related charactersgrouping into subword, we hopefully reach a meaningfulminimal representation unit.Splitting a word into sub-word level subwords and usingthese subwords to augment the word representation mayrecover the lost syntactic or semantic information that isc (cid:13) a r X i v : . [ c s . C L ] J un EEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSIN, 2019 2
TABLE I: A machine reading comprehension example.
Passage
Robotics is an interdisciplinary branch of engineering andscience that includes mechanical engineering, electricalengineering, computer science, and others. Robotics dealswith the design, construction, operation, and use ofrobots, as well as computer systems for their control,sensory feedback, and information processing. Thesetechnologies are used to develop machines that cansubstitute for humans. Robots can be used in any situationand for any purpose, but today many are used in dan-gerous environments (including bomb detection and de-activation), manufacturing processes, or where humanscannot survive. Robots can take on any form but someare made to resemble humans in appearance. This is saidto help in the acceptance of a robot in certain replicativebehaviors usually performed by people. Such robotsattempt to replicate walking, lifting, speech, cognition,and basically anything a human can do.
Question
What do robots that resemble humans attempt to do?
Answer replicate walking, lifting, speech, cognition supposed to be delivered by subwords. For example, under-standing could be split into the following subwords: < under,stand, ing > . Previous work usually considered prior linguisticknowledge based methods to tokenize each word into sub-words (namely, morphological based subword ). However, suchtreatment may encounter two main inconveniences. First, thelinguistic knowledge resulting subwords, typically, morpho-logical suffix, prefix or stem, may not be suitable for thetargeted NLP tasks. Second, linguistic knowledge or relatedannotated lexicons or corpora even may not be available fora specific language or task. Thus in this work we considercomputationally motivated subword tokenization approachesinstead.We present a unified representation learning framework tosub-word level information enhanced text understanding andsurvey various computationally motivated segmentation meth-ods. Specifically, we consider the subword as the basic unitin our models and manipulate the neural architecture accord-ingly. The proposed method takes variable-length subwordssegmented by unsupervised segmentation measures, withoutrelying on any predefined linguistic resource. First, a goodnessscore is computed for each n -gram using the selected goodnessmeasure to form a dictionary. Then segmentation or decodingmethod is applied to tokenize words into subwords based onthe dictionary. The proposed subword-augmented embeddingwill be evaluated on text understanding tasks, including textualentailment and machine reading comprehension, both of whichare quite challenging due to the need of accurate lexical-levelrepresentation. Furthermore, we empirically survey varioussubword segmentation methods from a computational perspec-tive and investigate the better way to enhance the tasks withthoughtful analysis and case studies.The rest of this paper is organized as follows. The nextsection reviews the related work. Section 3 will demonstrateour subword augmented learning framework and implemen-tation. Task details and experimental results are reported inSection 4, followed by case studies and analysis in Section 5and conclusion in Section 6. TABLE II: A textual entailment example. Premise
Man grilling fish on barbecue Label
Hypothesis
The man is cooking fish. EntailmentThe man is sailing a boat. ContradictionThe man likes to eat fish. Neutral
II. R
ELATED W ORK
A. Augmented Embedding
To model texts into vector space, the input tokens are rep-resented as embeddings in deep learning models [28, 29, 30,45, 46, 55, 57]. Previous work has shown that word represen-tations in NLP tasks can benefit from character-level models,which aim at learning language representations directly fromcharacters. Character-level features have been widely usedin language modeling [34, 38], machine translation [31, 41]and reading comprehension [42, 51]. Seo et al. [42] concate-nated the character and word embedding to feed a two-layerHighway Network. Cai et al. [6] presented a greedy neuralword segmenter to balance word and character embeddings.High-frequency word embeddings are attached to characterembedding via average pooling while low-frequency wordsare represented as character embedding. Miyamoto and Cho[34] introduced a recurrent neural network language modelwith LSTM units and a word-character gate to adaptivelyfind the optimal mixture of the character-level and word-level inputs. Yang et al. [51] explored a fine-grained gatingmechanism to dynamically combine word-level and character-level representations based on properties of the words (e.g.named entity and part-of-speech tags).However, character embeddings only show marginal im-provement due to a lack of internal semantics. Recently, manytechniques were proposed to enrich word representations withsub-word information. Bojanowski et al. [3] proposed to learnrepresentations for character n -gram vectors and representwords as the sum of the n -gram vectors. Avraham and Gold-berg [1] built a model inspired by Joulin et al. [22], who usedmorphological tags instead of n -grams. They jointly trainedtheir morphological and semantic embeddings, implicitly as-suming that morphological and semantic information shouldlive in the same space. Our work departs from previous oneson morphologically-driven embeddings by focusing on em-bedding data-driven subwords. To handle rare words, Sennrichet al. [41] introduced the byte pair encoding (BPE) compres-sion algorithm for open-vocabulary neural machine translationby encoding rare and unknown words as subword units. Zhanget al. [54] applied BPE for cloze-style reading comprehensionto handle OOV issues. Different from the motivation of sub-word segmentation for rare words modeling, our proposedunified subword-augmented embedding framework serves fora general purpose without relying on any predefined linguisticresources with thorough analysis, which can be adopted tothe enhance the representation for each word by adaptivelyaltering the segmentation granularity in multiple NLP tasks. B. Text Comprehension
As a challenging task in NLP, text comprehension aims toread and comprehend a given text, and then answer questions
EEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSIN, 2019 3 or make inference based on it. These tasks require a compre-hensive understanding of natural languages and the ability todo further inference and reasoning. In this paper, we focus ontwo types of text comprehension, document-based question-answering (Table I) and textual entailment (Table II), whichshare the similar genre of machine reading comprehension,though the task formations are slightly different.In the last decade, the MRC tasks have evolved from theearly cloze-style test [19, 20, 54] to span-based answer ex-traction from passage [36, 39, 40]. The former has restrictionsthat each answer should be a single word in the documentand the original sentence without the answer part is takenas the query. For the span-based one, the query is formedas questions in natural language whose answers are spansof texts. Notably, Chen et al. [8] conducted an in-depth andthoughtful examination on the comprehension task based onan attentive neural network and an entity-centric classifier witha careful analysis based on handful features. Then, variousattentive models have been employed for text representationand relation discovery, including Attention Sum Reader [23],Gated attention Reader [15], Self-matching Network [47] andAttended over Attention Reader [12].With the release of the large-scale span-based datasets[21, 35, 39, 40, 48], which constrain answers to all possibletext spans within the reference document, researchers areinvestigating the models with more logical reasoning andcontent understanding [47, 48].For the other type of text comprehension, natural languageinference (NLI) is proposed to serve as a benchmark fornatural language understanding and inference, which is alsoknown as recognizing textual entailment (RTE). In this task, amodel is presented with a pair of sentences and asked to judgethe relationship between their meanings, including entailment,neutral and contradiction. Bowman et al. [5] released StanfordNatural language Inference (SNLI) dataset, which is a high-quality and large-scale benchmark, thus inspiring varioussignificant work.Most of existing NLI models apply attention mechanismto jointly interpret and align the premise and hypothesis,while transfer learning from external knowledge is popularrecently. Notably, Chen et al. [9] proposed an enhancedsequential inference model (ESIM), which employed recursivearchitectures in both local inference modeling and inferencecomposition, as well as syntactic parsing information, for asequential inference model. ESIM is simple with satisfactoryperformance, and is thus widely chosen as the baseline model.Mccann et al. [32] proposed to transfer the LSTM encoderfrom the neural machine translation (NMT) to the NLI taskto contextualize word vectors. Pan et al. [37] transfered theknowledge learned from the discourse marker prediction taskto the NLI task to augment the semantic representation.III. O UR U NIFIED R EPRESENTATION L EARNING F RAMEWORK
For generality, we consider an end-to-end model for eitherof text comprehension tasks. Fig. 1 overviews the unifiedrepresentation learning framework. The input tokens are seg-mented into subword units to further obtain the subword embeddings, which are then fed to downstream models alongwith word embedding. For textual entailment, the two inputsequences are premise and hypothesis and the output is thelabel. For reading comprehension, the two input sequencesare document and question and the output is the answer.We apply unsupervised subword segmentation to producethe subwords for each token in the input sequence. Oursubwords are formed as character n -gram and do not crossword boundaries. After splitting each word k into a subwordsequence, an augmented embedding (AE) is formed to straight-forwardly integrate word embedding W E ( k ) and subwordembedding SE ( k ) for a given word k . AE ( k ) = W E ( k ) (cid:5) SE ( k ) (1)where (cid:5) denotes the integration strategy. In this work, weinvestigate concatenation ( concat ), element-wise summation( sum ) and element-wise multiplication ( mul ).Suppose that word k is formed with a sequence of subwords [ s , . . . , s l ] where l is the number of subwords for word k .Then the subword-level representation of k is given by thematrix C k ∈ R d × l where d denotes the subword dimension.We employ a narrow convolution between C k and a filter H ∈ R d × w of width w to obtain a feature map f k ∈ R l − w +1 .We take one filter operation for example, the i -th element of f k is given by f k [ i ] = tanh( (cid:10) C k [ ∗ , i : i + w − , H (cid:11) + b ) (2)where C k [ ∗ , i : i + w − denotes the i -th to ( i + w − -thcolumn of C k and (cid:104) A, B (cid:105) = Tr( AB T ) represents the Frobeniusinner product. Then, a max pooling operation is adoptedafter the convolution and we fetch the feature representationcorresponding to the filter H . y k = max i f k [ i ] (3)Here we have described the process by which one featureis obtained from one filter matrix. For a total of h filters, [ H , . . . , H h ] , then y k = [ y k , . . . , y kh ] is the distilled subword-level representation of word k . We then fed y k to a highwaynetwork [44] to select features individually for each subword-derived word representation, and the final subword embedding(SE) is obtained by SE ( k ) = t (cid:12) g ( W H y k + b H ) + (1 − t ) (cid:12) y k (4)where g is a nonlinear function and t = σ ( W T y k + b T ) represents the transform gate. W H , W T , b H and b T areparameters.The downstream model is task-specific. In this work, wefocus on the textual entailment and machine reading compre-hension, which will be discussed latter. A. Unsupervised Subword Segmentation
To segment subwords from word that is regarded as char-acter sequence, we adopt and extend the generalized unsu-pervised segmentation framework proposed by Zhao and Kit[58], which was originally designed only for Chinese wordsegmentation.
EEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSIN, 2019 4 gn i dd e b m E d e t n e m gu A { { { 青蛙 frog 的 ’s 朋友 friend XXbiGRU biGRU Semantic Learning ...... SoftmaxInput Sequence 青蛙 frog 和 and 小白兔 rabbit 赶集 go shopping ... ... 小白兔 little white rabbit Short list lookup
Augmented Embedding (AE) Conv.Subword Embedding(SE)Word Embedding(WE) Pooling littlewhiterabbit ... ...
UNK γ=0.9 ...... ...... A ug m e n t e d E m b e dd i ng Input Sequence . ... Word token
Fig. 1: Architecture of the proposed Subword-augmented Embedding framework.The generalized framework can be divided into two colloca-tive parts, goodness measurement which evaluates how likelya subword is to be a ‘proper’ one, and a segmentation or decoding algorithm . The framework generally works in twosteps. First, a goodness score g ( w i ) is computed for each n -gram w i (in this paper gram always refers to character)using the selected goodness measure to form a dictionary W = (cid:110) { w i , g ( w i ) } i =1 ,...,n (cid:111) . Then segmentation or decodingmethod is applied to tokenize words into subwords based onthe dictionary.Zhao and Kit [58] originally considered two decodingalgorithms. a) Viterbi: This style of segmentation is to search fora segmentation with the largest goodness score sum for aninput unsegmented sequence T (to be either words or Chinesesentence). b) Maximal-Matching (MM): This is a greedy algorithmwith respect to a goodness score. It works on T to output thebest current subword w ∗ repeatedly with T = t ∗ for the nextround as follows, { w ∗ , t ∗ } = arg max wt = T g ( w ) (5)with each { w, g ( w ) } ∈ W .In this work, we additionally introduce the second segmen-tation algorithm. c) Byte Pair Encoding (BPE): Byte Pair Encoding (BPE)[17] is a simple data compression technique that iterativelyreplaces the most frequent pair of bytes in a sequence by a sin-gle, unused byte. Different from the previous two algorithms that segment the input sequence into pieces in a top-downway, BPE segmentation actually merges a full single-charactersegmentation to a reasonable segmentation in a bottom-upway. We formulize the generalized BPE style segmentationin the following.At the very beginning, all the input sequences are tokenizedinto a sequence of single-character subwords, then we repeat,1) Calculate the goodness scores of all bigrams under thecurrent segmentation status of all sequences.2) Find the bigram with the highest goodness score andmerge them in all the sequences. Note the segmentationstatus has been updated at this time.3) If the merging times does not reach the specified number,go back to 1, otherwise the algorithm ends.In our work, we investigate three types of goodness mea-sures to evaluate subword likelihood, namely
Frequency , Ac-cessor Variety and
Description Length Gain . Frequency (FRQ):
FRQ is simply defined as the countingin the entire corpus for each n -gram being subword candidate.We take a logarithmic form as the goodness score, g FSR ( w ) = log(ˆ p ( w )) (6)where ˆ p ( w ) is w ’s frequency in the corpus. Accessor Variety (AV):
AV is proposed by Feng et al.[16] to measure how likely a subword is a true word. The AV Zhao and Kit [58] considered four types of goodness measures but
BranchEntropy is excluded here due to its similar performance as
Accessor Variety according to their results
EEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSIN, 2019 5 of a subword x i x i +1 . . . x j (also denoted as x i..j ) is definedas AV ( x i..j ) = min { L av ( x i..j ) , R av ( x i..j ) } (7)where the left and right accessor variety L av ( x i..j ) and R av ( x i..j ) are the number of distinct predecessor and succes-sor characters, respectively. The same as FRQ, the goodnessscore is taken in logarithmic form, g AV ( w ) = log AV ( w ) . Description Length Gain (DLG):
Wilks [50] proposedthis goodness measure for compression-based segmentation.The DLG replaces all occurrences of x i..j from a corpus X = x x ...x n as a subword and is computed by DLG ( x i..j ) = L ( X ) − L ( X [ r → x i..j ] ⊕ x i..j ) (8)where X [ r → x i..j ] represents the resultant corpus by re-placing all items of x i..j with a new symbol r throughout X and ⊕ denotes the concatenation. L ( · ) is the empiricaldescription length of a corpus in bits that can be estimated bythe Shannon-Fano code or Huffman code, following classicinformation theory [43]. L ( X ) . = −| X | (cid:88) x ∈ V ˆ p ( x ) log ˆ p ( x ) (9)where | · | denotes the string length, V is the vocabulary of X and ˆ p ( x ) is x ’s frequency in X . The goodness score is givenby g DLG ( w ) = DLG ( w ) .It is easy to find that BPE style segmentation with FRQgoodness measures (denoted as BPE-FRQ) could be identicalto the BPE subword encoding in [41] in neural machinetranslation which is originally motivated for word represen-tation for infrequent (rare or OOV) word representation inneural machine translation. Instead, we aim to refine theword representations by using subwords, for both frequent andinfrequent words, which is more generally motivated. To thisend, we adaptively tokenize words in multi-granularity.IV. E XPERIMENTS
In this section, we evaluate the performance of subword-augmented embedding on two kinds of challenging text under-standing tasks, textual entailment and reading comprehension .Both of the concerned tasks are quite challenging, let alonethe latest performance improvement has been already verymarginal. However, we present a new solution in a new di-rection instead of heuristically stacking attention mechanisms.Namely, we show that subword embedding could be potentialto give further advances due to its meaningful linguisticaugments, which has not been studied yet for the concernedtasks. Our evaluation aims to answer the following empiricalquestions:1) Can subword-augmented embedding enhance the con-cerned tasks?2) Can using subword-augmented embedding be generallyhelpful for different languages?3) Can subword embedding help effectively model OOV orrare words?4) Which is the best unsupervised subword segmentationmethod for text understanding? TABLE III: Accuracy on SNLI dataset. SOTA is short forstate-of-the-art.
Model Dev TestBaseline (Word + Char) 88.39 87.61Word + Viterbi-AV 88.35 87.70Word + Viterbi-FRQ 88.15 87.46Word + Viterbi-DLG 88.31 87.53Word + MM-AV 88.58 88.16Word + MM-FRQ 88.45 88.05Word + MM-DLG 88.61 88.28Word + BPE-AV 88.42 88.11Word + BPE-FRQ 88.56 88.36Word + BPE-DLG
SOTA [24] / 88.9
5) Which is the best strategy to integrate word and subwordembedding?The default subword vocabulary size is set 10 k for textualentailment task and 1 k for the two reading comprehensiontasks. The default integration strategy is concatenation for thefollowing experiments. The above choices are based on themodel performance on the development set and the detailedanalysis will be given in Section 6. Word embeddings are200 d and pre-trained by word2vec [33] toolkit on Wikipedia corpus . Both character and subword embeddings are also200 d and randomly initialized with the uniform distributionin the interval [-0:05; 0:05]. Note that character could beregarded as the minimal case of subwords so we separatelydepict them in our experiments for better comparison andconvenient demonstration.In our preliminary experiments, we thoroughly explore allnine subword segmentation methods by considering there arethree segmentation algorithms and three goodness measures.We find that all Viterbi based segmentation fails to showsatisfactory performance, and we only report three best per-forming segmentation-goodness collocations for each task.Our baseline models are selected due to their simplicity andstate-of-the-art performance in each task. We are interested ina subword-based framework that performs robustly across adiverse set of tasks. To this end, we follow the same hyper-parameters or each baseline model as the original settingsfrom their corresponding literatures [9, 15, 42] except thosespecified (e.g. subword dimension, integration strategy). Sinceensemble systems and pre-training enhanced methods arecommonly integrated with multiple heterogeneous models andresources and thus not completely comparable, we only focuson the evaluations on single models. A. Textual Entailment
Textual entailment is the task of determining whether a hypothesis is entailment, contradiction and neutral , given a premise . The Stanford Natural Language Inference (SNLI)corpus [5] provides approximately 570 k hypothesis/premisepairs.Our baseline model is Enhanced Sequential Inference Model(ESIM) [9] which employs a biLSTM to encode the premise https://dumps.wikimedia.org/ EEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSIN, 2019 6
TABLE IV: Data statistics of CMRC-2017, PD and CFT.
CMRC-2017 PD CFTTrain Valid Test Train Valid Test human and hypothesis, followed by an attention layer, a local infer-ence layer, an inference composition layer. To keep the modelsimplicity and concentrate on the performance of subwordunits, we do not integrate extra syntactic parsing features or in-crease the dimension of word embeddings. However, with thesubword augmentation, our simple sequential encoding modelyields substantial gains and achieves competitive performancewith more complex state-of-the-art models .The dimensions for all the LSTM and fully connectionlayers were 300. We set the dropout rate to 0.5 for eachLSTM layer and the fully connected layers. All feed forwardlayers used ReLU activations. Parameters were optimizedusing Adam [26] with gradient norms clipped at 5.0. The initiallearning rate was 0.001, which was halved every epoch afterthe second epoch. The batch size was 32.Results in Table III show that, subword-augmented embed-ding can boost our baseline (Word + Char) by +0.95% on thetest set. Among the subword algorithms, BPE-DLG performsthe best whose key difference with other approaches is thatBPE-DLG gives finer-grained bi-grams like { ri, ch, ne, ss } which could be potentially important for short text modelingwith small word vocabulary like textual entailment task. B. Reading Comprehension
To investigate the effectiveness of the subword-augmentedembedding in conjunction with more complex models, weconduct experiments on machine reading comprehension tasks.The reading comprehension task can be described as a triple < D, Q, A > , where D is a document (context), Q is aquery over the contents of D , in which a word or span isthe right answer A . This task can be divided into cloze-styleand query-style. The former has restrictions that each answershould be a single word and should appear in the documentand the original sentence removing the answer part is takenas the query. For the query-style, the query is formed asquestions in natural language whose answer is a span of texts.To test the subword-augmented embedding in multi-lingualcase, we select three Chinese datasets, Chinese MachineReading Comprehension (CMRC-2017) [14],
People’s Daily(PD) [11],
Children Fairy Tales (CFT) [11] and two Englishones,
Children’s Book Test (CBT) [20], the Stanford QuestionAnswering Dataset (SQuAD) [39] in which the first four setsare cloze-style and the last one is query-style.
1) Cloze-style:
To verify the effectiveness of our proposedmodel for Chinese, we conduct multiple experiments on three We only compare with currently published work from SNLI Leaderboard:https://nlp.stanford.edu/projects/snli/
TABLE V: Accuracy on CMRC-2017 dataset.
Model Dev TestBaseline (Word + Char) 76.15 77.73Word + MM-AV 77.80 77.80Word + MM-DLG 77.30 77.17Word + BPE-FRQ
SOTA [13] 77.20 78.63
Chinese Machine Reading Comprehension datasets, namelyCMRC-2017, PD and CFT . Table IV gives data statistics.Different from the current cloze-style datasets for Englishreading comprehension, such as CBT, Daily Mail and CNN[19], the three Chinese datasets do not provide candidateanswers. Thus, the model has to find the correct answer fromthe entire document.Our baseline model is the Gated-Attention (GA) Reader [15]which integrates a multi-hop architecture with a gated atten-tion mechanism between the intermediate states of documentand query. We used stochastic gradient descent with ADAMupdates for optimization. The batch size was 32 and the initiallearning rate was 0.001 which was halved every epoch after thesecond epoch. We also used gradient clipping with a thresholdof 10 to stabilize GRU training (Pascanu et al. , 2013). Weused three attention layers. The GRU hidden units for boththe word and subword representation were 128. We applieddropout between layers with a dropout rate of 0.5. a) CMRC-2017: Table V gives our results on CMRC-2017 dataset , which shows that our Word + BPE-FRQmodel outperforms all other models on the test set, eventhe state-of-the-art AoA Reader [13]. With the help of theproposed method, the GA Reader could yield a new state-of-the-art performance over the dataset. Different from theabove textual entailment task, the best subword segmentationtends to be BPE-FRQ instead of BPE-DLG. The divergenceindicates that for a task like reading comprehension involvinglong paragraphs with a huge vocabulary , high frequencywords weigh more. In fact, as DLG measures word throughmore type statistics than the direct frequency weighting, it canbe seriously biased by a lot of noise in the vocabulary. Usingfrequency instead of DLG can let the segmentation resist thenoise by keeping concerns over those high frequency (also Note that the test set of CMRC-2017 and human evaluation test set (Test-human) of CFT are harder for the machine to answer because the questionsare further processed manually and may not be accordance with the patternof automatic questions. The word vocabulary sizes of SNLI and CMRC-2017 are 30k and 90krespectively.
EEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSIN, 2019 7
TABLE VI: Accuracy on PD and CFT datasets. Results ofAS Reader and CAS Reader are from [11]. The result for GAReader is based on our implementation. Previous state-of-the-art model is marked by † . Model PD CFTValid Test Test-humanAS Reader 64.1 67.2 33.1CAS Reader † TABLE VII: Accuracy on CBT dataset. Results except oursare from previously published works [11, 15, 51]. Previousstate-of-the-art model is marked by † . Model CBT-NE CBT-CNValid Test Valid TestHuman - 81.6 - 81.6LSTMs 51.2 41.8 62.6 56.0MemNets 70.4 66.6 64.2 63.0AS Reader 73.8 68.6 68.8 63.4Iterative Attentive Reader 75.2 68.2 72.1 69.2EpiReader 75.3 69.7 71.5 67.4AoA Reader 77.8 72.0 72.2 69.4NSE 78.2 73.2 74.3 71.9FG Reader † GA Reader 76.8 72.5 73.1 69.6Word + BPE-FRQ 78.5 74.9 75.0 71.6 usually regular) words. Since we found the stable performancegain in all our preliminary experiments, we focus on BPE-FRQin later similar cloze-style evaluation and comparison. b) PD & CFT:
Since there is no training set for CFTdataset, our model is instead trained on PD training set. Notethat CFT test set is processed by human evaluation, and maynot be accordance with the pattern of PD training dataset. Theresults on PD and CFT datasets are listed in Table VI, whichshows our Word + BPE-FRQ significantly outperforms theCAS Reader in all types of testing, with improvements of 7.0%on PD and 8.8% on CFT test sets, respectively. Consideringthat the domain and topic of PD and CFT datasets are quitedifferent, the results indicate the effectiveness of our modelfor out-of-domain learning. c) CBT:
To verify if our method can work for more thanChinese, we also evaluate the proposed method on Englishbenchmark, CBT, whose documents consist of 20 contiguoussentences from the body of a popular children’s book, andqueries are formed by deleting a token from the 21st sentence.We only focus on its subsets where the answer is either acommon noun (CN) or NE, so that our task here is morechallenging as the answer is likely to be rare words. For a faircomparison, we simply set the same parameters as before. Weevaluate all the models in terms of accuracy, which is thestandard evaluation metric for this task.Table VII shows the results for CBT. We observe that ourmodel outperforms most of the previously published works,with 2.4 % gains on the CBT-NE test set compared withGA Reader which adopts word and character embedding con-catenation. Our Word + BPE-FRQ also achieves comparableperformance with FG Reader who adopts neural gates tocombine word-level and character-level representations with TABLE VIII: Exact Match (EM) and F1 scores on SQuADdev set. BiDAF α denotes BiDAF + Self-Attention and BiDAF β denotes BiDAF + Self-Attention + ELMO . Model EM F1BiDAF α Word + Char 71.22 80.42Word + MM-AV 72.46 81.28Word + MM-DLG 72.21 81.03Word + BPE-FRQ
BiDAF β Word + Char 77.43 85.03Word + MM-AV 77.49 85.23Word + MM-DLG 77.46 85.22Word + BPE-FRQ
BiDAF Word + Char 68.23 77.95Word + MM-AV 68.86 78.44Word + MM-DLG 68.82 78.40Word + BPE-FRQ
TABLE IX: Embedding combinations on CMRC-2017.
Model Dev TestWord Only 74.90 75.80Char Only 71.25 72.53BPE-FRQ Only 74.75 75.77Word + Char 76.15 77.73Word + BPE-FRQ 78.95 78.80Word + Char + BPE-FRQ assistance of extra features including NE, POS and wordfrequency while our model is much simpler and faster. Thiscomparison shows that our Word + BPE-FRQ is not restrictedto Chinese reading comprehension, but also effective for otherlanguages.
2) Query-style:
The Stanford Question Answering Dataset(SQuAD) [39] contains 100 k + crowd sourced question-answerpairs where the answer is a span in a given Wikipedia para-graph. Our basic model is Bidirectional Attention Flow [42]and we improve it by adding a self-attention layer [47] andELMO [38], similar to [10], to see whether subword could stillimprove more complex models. The augmented embeddingsof document and query are passed through a bi-directionalGRU which share parameters, and then fed to the BiDAFmodel. Then, we obtain the context vectors and pass themthrough a linear layer with ReLU activations, followed by aself-attention layer against the context itself. Finally, the resultsare fed through linear layers to predict the start and end tokenof the answer. For the hyper-parameters, the dropout rates forthe GRUs and linear layers are 0.2. The dimensions for GRUand linear layers are 90 and 180, respectively. We optimizethe model using ADAM. The batch size is 32. Table VIIIshows the results on the dev set . We can see that for all themodels, subword embeddings boost the performance signifi-cantly. Even for BiDAF α and BiDAF β , BPE-FRQ could alsoyield substantial performance gains (+1.57%EM, 1.36%F1 and+0.41%EM, 0.45%F1 respectively).V. A NALYSIS
The experimental results have shown that the subword-augmented embedding can essentially improve baselines, from Since the test set is not released, we train our models on training set andevaluate them on dev set.
EEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSIN, 2019 8 A cc u r a c y devtest (a) SNLI A cc u r a c y dev test (b) CMRC-2017 A cc u r a c y F1Exact Match (c) SQuAD
Fig. 2: Case study of the subword vocabulary size of BPE-FRQ.the simple to the complicated, among multiple tasks withdifferent languages. Though the performance of BPE-FRQtends to be the most stable overall, the best practice forsubword embedding might be task-specific. This also disclosesthat there exists potential for a more effective goodnessmeasure or segmentation algorithm to polish up the subwordrepresentations.
A. Using Diverse Embedding Together
To see if we can receive further performance improvementwhen using different embedding together, we compare thefollowing embeddings: Word Only, Char Only, BPE-FRQ onlyand Word + Char, Word + BPE-FRQ and Word + Char + BPE-FRQ. Table IX shows the result. For each type of embeddingalone, word embedding and BPE-FRQ subword embeddingturn out to be comparable. BPE-FRQ performs much betterthan char embedding, which again confirms that subwordsare more representative as minimal natural linguistic unitsthan single characters. Any embedding combination couldimprove the performance as the distributed representationscan be beneficial from different perspectives through diversegranularity. However, using all the three types of embeddingsonly shows marginal improvement. This might indicate thatincreasing embedding features or dimension might not bringmuch gains and seeking natural and meaningful linguistic unitsfor representation is rather significant.
B. Subword Vocabulary Size
The segmentation granularity is highly related to the sub-word vocabulary size. For BPE style segmentation, the result-ing subword vocabulary size is equal to the merging times plusthe number of single-character types. To have an insight of theinfluence, we adopt merge times of BPE-FRQ from 0 to 20 k ,and conduct quantitative study on SNLI, CMRC-2017 andSQuAD for BPE-FRQ segmentation. Fig. 2 shows the results.We observe that with 1 k merge times, the models could obtainthe best performance on CMRC-2017 and SQuAD thoughthese two tasks are of different languages while 10k shows tobe more suitable for SNLI. The results also indicate that for atask like reading comprehension the subwords, being a highlyflexible grained representation between character and word,tends to be more like characters instead of words. However, TABLE X: Different merging functions with word embed-dings on SNLI and CMRC-2017. Model Strategy Dev Testconcat
SNLI sum 88.30 87.14mul 88.47 87.77concat 77.45 77.47CMRC sum 75.95 76.43mul F E x a c t M a t c h F1Exact Match
Fig. 3: Results of n -gram of BPE-FRQ on SQuAD dataset.when the subwords completely fall into characters, the modelperforms the worst. This indicates that the balance betweenword and character is quite critical and an appropriate grainof character-word segmentation could essentially improve theword representation. C. Subword and Word Embedding Integration Strategies
We investigate the combination of subword-augmented em-bedding with word embedding. Table X shows the com-parisons based on our best models of SNLI and CMRC-2017, BPE-DLG and BPE-FRQ, respectively. The models with concat and mul significantly outperform the model with sum .This reveals that concat and mul operations might be moreinformative than sum and the best practice for the choice wouldbe task-specific. Though concat operation may result in highdimensions, it could keep more information for downstreammodels to select from. The superiority of mul might be
EEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSIN, 2019 9
Fig. 4: Pair-wise attention visualization. ; ; ; ; ; (a) Embedding of document and query ; ; ; ; ; (b) Final document and query representation
Doc (extract): The cat was going to build a new house. His friends came to help. The elephant went into the woods for logs. The goat and dog cut the logs into planks. Soonafterwards the bear constructs a beautiful house. The cat said happily, “After decorating my house, I’ll invite everybody to have a party in it.” A few days later, friends came tothe party happily. Upon entering the door, the cat fetched a small basin of water and said, “Your shoes will trample the carpet. Please take off your shoes and wash your feet, orleave.” The elephant and bear looked at their own feet and the small basin, felt upset, saying, “Forget it, we’ll never go in.” Since then, no animal played with him any more. Thehouse was his last friend.Query: fetched a small basin of water and said, “Your shoes will trample the carpet.Please take off your shoes and wash your feet, or leave.” due to element-wise product being capable of modeling theinteractions and eliminating distribution differences betweenword and subword embedding which is intuitively similar toendowing subword-aware attention over the word embedding.In contrast, sum is too simple to prevent from detailed infor-mation loss.
D. Effect of the n -grams The goodness measures commonly build the subword vo-cabulary based on neighbored character relationship insidewords. This is reasonable for Chinese where words are com-monly formed by two characters which is also the original mo-tivation for Chinese word segmentation. However, we wonderwhether it would be better to use longer n -gram connections.We expand the n -grams of BPE-FRQ from 1 to 4. Fig. 3shows the quantitative study results. We observe the n -gramsof BPE-FRQ segmentation might slightly influence the resultwhere 2 or 3 tends to be better choice. E. Visualization
To analyze the learning process of our models, we drawthe attention distributions at intermediate layers based on anexample from CMRC-2017 dataset. Fig. 4 shows the result ofmodel with BPE-FRQ. We observe that the right answer (
Thecat ) could obtain a high weight after the pair-wise matching ofdocument and query. After attention learning, the key evidenceof the answer would be collected and irrelevant parts would beignored. This shows that our subword-augmented embeddingis effective at selecting the vital points at the fundamentalembedding layer, guiding the attention layers to collect morerelevant pieces.
F. Subword Observation
In text understanding tasks, if the ground-truth answer isOOV word or contains OOV word(s), the performance of deep neural networks would severely drop due to the incompleterepresentation, especially for a task like cloze-style readingcomprehension where the answer is only one word or phrase.To get an intuitive observation for the task, we collect allthe 118 questions whose answers are OOV words (with theircorresponding documents, denoted as
OOV questions ) fromCMRC-2017 test set, and use our model to answer thesequestions. We observe only 2.54% could be correctly answeredby the best Word + Char embedding based model. With BPE-FRQ subword embedding, 12.71% of these OOV questionscould be correctly solved. This shows that the subwordrepresentations could be essentially useful for modeling rareand unseen words. In fact, the meaning of complex wordslike indispensability could be accurately refined by segmentedsubwords as shown in Table XI. This also shows subwordscould help the models to use morphological clues to formrobust word representations which is especially potential toobtain fine-grained representation for low-resource languages.TABLE XI: Examples of BPE-FRQ subwords.
Word Subwordindispensability in disp ens abilityintercontinentalexchange inter contin ent al ex changeplaygrounds play ground s 大 花 猫 大 花 猫 一 步 一个 脚 印 一 步 一个 脚 印 VI. C
ONCLUSION
Embedding is the fundamental part of deep neural networks,which could also be the bottleneck of the model strength.Building a more fine-grained representation at the very begin-ning could potentially guide the following networks, especiallyattention component to collect more important pieces. Thispaper presents a general yet effective architecture, subword-augmented embedding to enhance the word representation
EEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSIN, 2019 10 and effectively handle rare or unseen words. Experiments onfive datasets from textual entailment and reading comprehen-sion tasks demonstrate significant performance gains over thebaselines. Unlike most existing works, which introduce eithercomplex attentive architectures, handcrafted features or extraknowledge resources, our model is much more simple yeteffective. The proposed method takes variable-length subwordssegmented by unsupervised segmentation measures, withoutrelying on any predefined linguistic resource. Thus the pro-posed method is also suitable for various open vocabulary NLPtasks. Our work discloses that the deep internals of sub-wordlevel embeddings are crucial, helping downstream models toabsorb different signals.R
EFERENCES [1] Oded Avraham and Yoav Goldberg. The interplay ofsemantics and morphology in word embeddings. In
Proceedings of the 15th Conference of the EuropeanChapter of the Association for Computational Linguistics(EACL) , pages 422–426, 2017.[2] Toms Bergmanis and Sharon Goldwater. From segmen-tation to analyses: a probabilistic model for unsupervisedmorphology induction. In
Proceedings of the 15thConference of the European Chapter of the Associationfor Computational Linguistics (EACL) , pages 337–346,2017.[3] Piotr Bojanowski, Edouard Grave, Armand Joulin, andTomas Mikolov. Enriching word vectors with subwordinformation.
Transactions of the Association for Compu-tational Linguistics (TACL) , 5:135–146, 2017.[4] Jan A. Botha and Phil Blunsom. Compositional morphol-ogy for word representations and language modelling.
Proceedings of the 31st International Conference onMachine Learning (ICML) , 32:1899–1907, 2014.[5] Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. A large annotated corpusfor learning natural language inference. In
Proceedingsof the 20th conference on Empirical Methods in NaturalLanguage Processing (EMNLP) , pages 632–642, 2015.[6] Deng Cai, Hai Zhao, Zhisong Zhang, Yuan Xin, YongjianWu, and Feiyue Huang. Fast and accurate neural wordsegmentation for Chinese. In
Proceedings of the 55thAnnual Meeting of the Association for ComputationalLinguistics (ACL) , pages 608–615, 2017.[7] Kris Cao and Marek Rei. A joint model for wordembedding and word morphology. In
The Workshop onRepresentation Learning for NLP , pages 18–26, 2016.[8] Danqi Chen, Jason Bolton, and Christopher D Manning.A thorough examination of the CNN/Daily Mail readingcomprehension task. In
Proceedings of the 54th AnnualMeeting of the Association for Computational Linguistics(ACL) , pages 2358–2367, 2016.[9] Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, HuiJiang, and Diana Inkpen. Enhanced lstm for naturallanguage inference. In
Proceedings of the 55th AnnualMeeting of the Association for Computational Linguistics(ACL) , pages 1657–1668, 2017. [10] Christopher Clark and Matt Gardner. Simple and effec-tive multi-paragraph reading comprehension. In
Proceed-ings of the 56th Annual Meeting of the Association forComputational Linguistics (ACL) , pages 845–855, 2018.[11] Yiming Cui, Ting Liu, Zhipeng Chen, Shijin Wang, andGuoping Hu. Consensus attention-based neural networksfor Chinese reading comprehension. In
Proceedings ofCOLING 2016, the 26th International Conference onComputational Linguistics: Technical Papers (COLING) ,pages 1777–1786, 2016.[12] Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang, TingLiu, and Guoping Hu. Attention-over-attention neuralnetworks for reading comprehension. In
Proceedings ofthe 55th Annual Meeting of the Association for Compu-tational Linguistics (ACL) , pages 1832–1846, 2017.[13] Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang, TingLiu, and Guoping Hu. Attention-over-attention neuralnetworks for reading comprehension. In
Proceedings ofthe 55th Annual Meeting of the Association for Compu-tational Linguistics (ACL) , pages 593–602, 2017.[14] Yiming Cui, Ting Liu, Zhipeng Chen, Wentao Ma, ShijinWang, and Guoping Hu. Dataset for the first evaluationon Chinese machine reading comprehension. arXivpreprint arXiv:1511.02301 , 2017.[15] Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang, William W.Cohen, and Ruslan Salakhutdinov. Gated-attention read-ers for text comprehension. In
Proceedings of the 55thannual meeting of the Association for ComputationalLinguistics (ACL) , pages 1832–1846, 2017.[16] Haodi Feng, Kang Chen, Xiaotie Deng, and WeiminZheng. Accessor variety criteria for chinese word ex-traction.
Computational Linguistics , 30(1):75–93, 2004.[17] Gage and Philip. A new algorithm for data compression.
C Users Journal , 12(2):23–38, 1994.[18] Harald Hammarstrom and Lars Borin. Unsupervisedlearning of morphology.
Computational Linguistics , 37(2):309–350, 2011.[19] Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-stette, Lasse Espeholt, Will Kay, Mustafa Suleyman, andPhil Blunsom. Teaching machines to read and compre-hend. In
Advances in Neural Information ProcessingSystems 28 (NIPS) , pages 1693–1701, 2015.[20] Felix Hill, Antoine Bordes, Sumit Chopra, and Jason We-ston. The goldilocks principle: Reading children’s bookswith explicit memory representations. arXiv preprintarXiv:1511.02301 , 2015.[21] Mandar Joshi, Eunsol Choi, Daniel S Weld, and LukeZettlemoyer. Triviaqa: A large scale distantly supervisedchallenge dataset for reading comprehension. In
Pro-ceedings of the 55th Annual Meeting of the Associationfor Computational Linguistics (ACL) , pages 1601–1611,2017.[22] Armand Joulin, Edouard Grave, Piotr Bojanowski, andTomas Mikolov. Bag of tricks for efficient text classi-fication. In
Proceedings of the 15th Conference of theEuropean Chapter of the Association for ComputationalLinguistics (EACL) , pages 427–431, 2017.[23] Rudolf Kadlec, Martin Schmid, Ondrej Bajgar, and Jan
EEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSIN, 2019 11
Kleindienst. Text understanding with the attention sumreader network. In
Proceedings of the 54th AnnualMeeting of the Association for Computational Linguistics(ACL) , pages 908–918, 2016.[24] Seonhoon Kim, Jin Hyuk Hong, Inho Kang, and No-jun Kwak. Semantic sentence matching with densely-connected recurrent and co-attentive information. arXivpreprint arXiv:1805.11360 , 2018.[25] Yoon Kim, Yacine Jernite, David Sontag, and Alexan-der M Rush. Character-aware neural language models.In
Proceedings of the Thirtieth AAAI Conference onArtificial Intelligence (AAAI) , pages 2741–2749, 2016.[26] Diederik Kingma and Jimmy Ba. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980 ,2014.[27] Haonan Li, Zhisong Zhang, Yuqi Ju, and Hai Zhao.Neural character-level dependency parsing for Chinese.In
Proceedings of the Thirty-Second AAAI Conference onArtificial Intelligence (AAAI) , pages 5205–5212, 2018.[28] Zuchao Li, Jiaxun Cai, Shexia He, and Hai Zhao.Seq2seq dependency parsing. In
Proceedings of the 27thInternational Conference on Computational Linguistics(COLING) , pages 3203–3214, 2018.[29] Zuchao Li, Shexia He, Jiaxun Cai, Zhuosheng Zhang,Hai Zhao, Gongshen Liu, Linlin Li, and Luo Si. A unifiedsyntax-aware framework for semantic role labeling. In
Proceedings of the 2018 Conference on Empirical Meth-ods in Natural Language Processing (EMNLP) , pages2401–2411, 2018.[30] Zuchao Li, Shexia He, Hai Zhao, Yiqing Zhang, Zhu-osheng Zhang, Xi Zhou, and Xiang Zhou. Dependencyor span, end-to-end uniform semantic role labeling. arXivpreprint arXiv:1901.05280 , 2019.[31] Minh-Thang Luong and Christopher D Manning.Achieving open vocabulary neural machine translationwith hybrid word-character models. arXiv preprintarXiv:1604.00788 , 2016.[32] Bryan Mccann, James Bradbury, Caiming Xiong, andRichard Socher. Learned in translation: Contextualizedword vectors.
Advances in Neural Information Process-ing Systems 30 (NIPS) , 2017.[33] Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. Efficient estimation of word representations invector space. arXiv preprint arXiv:1301.3781 , 2013.[34] Yasumasa Miyamoto and Kyunghyun Cho. Gated word-character recurrent language model. In
Proceedings ofthe 2016 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) , pages 1992–1997, 2016.[35] Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao,Saurabh Tiwary, Rangan Majumder, and Li Deng. Msmarco: A human generated machine reading comprehen-sion dataset.
ArXiv:1611.09268v2 , 2016.[36] Kyosuke Nishida, Itsumi Saito, Atsushi Otsuka, HisakoAsano, and Junji Tomita. Retrieve-and-read: Multi-tasklearning of information retrieval and reading compre-hension. In
Proceedings of the 27th ACM InternationalConference on Information and Knowledge Management ,pages 647–656, 2018. [37] Boyuan Pan, Yazheng Yang, Zhou Zhao, YuetingZhuang, Deng Cai, and Xiaofei He. Discourse markeraugmented network with reinforcement learning for natu-ral language inference. In
Proceedings of the 56th AnnualMeeting of the Association for Computational Linguistics(ACL) , pages 989–999, 2018.[38] M. E. Peters, M. Neumann, M. Iyyer, M. Gard-ner, C. Clark, K. Lee, and L. Zettlemoyer. Deepcontextualized word representations. arXiv preprintarXiv:1802.05365 , 2018.[39] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. Squad: 100,000+ questions for machinecomprehension of text. In
Proceedings of the 2016Conference on Empirical Methods in Natural LanguageProcessing (EMNLP) , pages 2383–2392, 2016.[40] Pranav Rajpurkar, Robin Jia, and Percy Liang. Knowwhat you don’t know: Unanswerable questions forSQuAD. In
Proceedings of the 56th Annual Meetingof the Association for Computational Linguistics (ACL) ,pages 784–789, 2018.[41] Rico Sennrich, Barry Haddow, and Alexandra Birch.Neural machine translation of rare words with subwordunits. In
Proceedings of the 54th Annual Meeting of theAssociation for Computational Linguistics (ACL) , pages1715–1725, 2016.[42] Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, andHannaneh Hajishirzi. Bidirectional attention flow for ma-chine comprehension. In
ICLR 2017 : 5th InternationalConference on Learning Representations , 2017.[43] C. E. Shannon. A mathematical theory of communi-cation.
Bell System Technical Journal , 27(3):379–423,1948.[44] Rupesh Kumar Srivastava, Klaus Greff, and J¨urgenSchmidhuber. Training very deep networks. arXivpreprint arXiv:1507.06228 , 2015.[45] Rui Wang, Andrew Finch, Masao Utiyama, and EiichiroSumita. Sentence embedding for neural machine trans-lation domain adaptation. In
Proceedings of the 55thAnnual Meeting of the Association for ComputationalLinguistics (ACL) , pages 560–566, 2017.[46] Rui Wang, Masao Utiyama, Andrew Finch, Lemao Liu,Kehai Chen, and Eiichiro Sumita. Sentence selectionand weighting for neural machine translation domainadaptation.
IEEE/ACM Transactions on Audio, Speech,and Language Processing , 26(10):1727–1741, 2018.[47] Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, andMing Zhou. Gated self-matching networks for readingcomprehension and question answering. In
Proceedingsof the 55th Annual Meeting of the Association for Com-putational Linguistics (ACL) , pages 189–198, 2017.[48] Yizhong Wang, Kai Liu, Jing Liu, Wei He, YajuanLyu, Hua Wu, Sujian Li, and Haifeng Wang. Multi-passage machine reading comprehension with cross-passage answer verification.
Proceedings of the 56thAnnual Meeting of the Association for ComputationalLinguistics (ACL) , pages 1918–1927, 2018.[49] Zhiguo Wang, Wael Hamza, and Radu Florian. Bilateralmulti-perspective matching for natural language sen-
EEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSIN, 2019 12 tences. In
Proceedings of the Twenty-Sixth InternationalJoint Conference on Artificial Intelligence (IJCAI) , pages4144–4150, 2017.[50] Yorick Wilks. Unsupervised learning of word boundarywith description length gain.
CoNLL-99: The SIGNLLConference on Computational Natural Language Learn-ing , pages 1–6, 1999.[51] Zhilin Yang, Bhuwan Dhingra, Ye Yuan, Junjie Hu,William W. Cohen, and Ruslan Salakhutdinov. Wordsor characters? fine-grained gating for reading compre-hension. In
ICLR 2017 : 5th International Conferenceon Learning Representations , 2017.[52] Shuailiang Zhang, Hai Zhao, Yuwei Wu, ZhuoshengZhang, Xi Zhou, and Xiang Zhou. Dual co-matchingnetwork for multi-choice reading comprehension. arXivpreprint arXiv:1901.09381 , 2019.[53] Zhuosheng Zhang and Hai Zhao. One-shot learningfor question-answering in gaokao history challenge. In
Proceedings of the 27th International Conference onComputational Linguistics (COLING) , pages 449–461,2018.[54] Zhuosheng Zhang, Yafang Huang, and Hai Zhao.Subword-augmented embedding for cloze reading com-prehension. In
Proceedings of the 27th InternationalConference on Computational Linguistics (COLING) ,pages 1802–1814, 2018.[55] Zhuosheng Zhang, Jiangtong Li, Pengfei Zhu, and HaiZhao. Modeling multi-turn conversation with deep ut-terance aggregation. In
Proceedings of the 27th Interna-tional Conference on Computational Linguistics (COL-ING) , pages 3740–3752, 2018.[56] Zhuosheng Zhang, Yuwei Wu, Zuchao Li, Shexia He,Hai Zhao, Xi Zhou, and Xiang Zhou. I know what youwant: Semantic learning for text comprehension. arXivpreprint arXiv:1809.02794 , 2018.[57] Zhuosheng Zhang, Yafang Huang, and Hai Zhao. OpenVocabulary Learning for Neural Chinese Pinyin IME. In
Proceedings of the 57th Annual Meeting of the Associa-tion for Computational Linguistics (ACL) , 2019.[58] Hai Zhao and Chunyu Kit. An empirical comparisonof goodness measures for unsupervised Chinese wordsegmentation with a unified framework. In
Proceedingsof the Third International Joint Conference on NaturalLanguage Processing (IJCNLP) , pages 9–16, 2008.
Zhuosheng Zhang received the Bachelor’s degree ininternet of things from Wuhan University, Wuhan,China, in 2016. He has been working toward theMaster’s degree in computer science and engineeringwith the Center for Brain-like Computing and Ma-chine Intelligence of Shanghai Jiao Tong University,Shanghai, China. His research interests lie withindeep learning for natural language processing andunderstanding, and he is particularly interested inquestion answering and machine reading compre-hension.
Hai Zhao received the BEng degree in sensorand instrument engineering, and the MPhil degreein control theory and engineering from YanshanUniversity in 1999 and 2000, respectively, and thePhD degree in computer science from Shanghai JiaoTong University, China in 2005. He is currently afull professor at department of computer science andengineering, Shanghai Jiao Tong University after hejoined the university in 2009. He was a researchfellow at the City University of Hong Kong from2006 to 2009, a visiting scholar in Microsoft Re-search Asia in 2011, a visiting expert in NICT, Japan in 2012. He is an ACMprofessional member, and served as area co-chair in ACL 2017 on Tagging,Chunking, Syntax and Parsing, (senior) area chairs in ACL 2018, 2019on Phonology, Morphology and Word Segmentation. His research interestsinclude natural language processing and related machine learning, data miningand artificial intelligence.
Kangwei Ling received the B.S. degree from Shang-hai Jiao Tong University, Shanghai, China, in 2018.He is currently pursuing the M.S. degree in Com-puter Science at Columbia University. During hisundergraduate study, he had been doing researchon natural language processing at BCMI ShanghaiJiao Tong University. His research focuses machinereading comprehension.
Jiangtong Li
Jiangtong Li is an undergraduatestudent in Shanghai Jiao Tong University, Shanghai,China. Since 2015, he has joined the Center forBrain-like Computing and Machine Intelligence ofShanghai Jiao Tong University, Shanghai, China.His research focuses on natural language processing,especially in dialogue system.
Zuchao Li received the B.S. degree from WuhanUniversity, Wuhan, China, in 2017. Since 2017,he has been a Ph.D. student with the Center forBrain-like Computing and Machine Intelligence ofShanghai Jiao Tong University, Shanghai, China.His research focuses on natural language processing,especially syntactic and semantic parsing.
Shexia He received the B.S. degree from Universityof Electronic Science and Technology of China, in2017. Since then, she has been a master student inDepartment of Computer Science and Engineering,Shanghai Jiao Tong University. Her research focuseson natural language processing, especially shallowsemantic parsing, semantic role labeling.