Sentence Segmentation in Narrative Transcripts from Neuropsychological Tests using Recurrent Convolutional Neural Networks
Marcos Vinícius Treviso, Christopher Shulby, Sandra Maria Aluísio
SSentence Segmentation in Narrative Transcripts from NeuropsychologicalTests using Recurrent Convolutional Neural Networks
Marcos Vin´ıcius Treviso Christopher Shulby Sandra Maria Alu´ısio [email protected] [email protected] [email protected]
Interinstitutional Center for Computational Linguistics (NILC)Institute of Mathematical and Computer SciencesUniversity of S˜ao Paulo
Abstract
Automated discourse analysis tools basedon Natural Language Processing (NLP)aiming at the diagnosis of language-impairing dementias generally extract sev-eral textual metrics of narrative transcripts.However, the absence of sentence bound-ary segmentation in the transcripts pre-vents the direct application of NLP meth-ods which rely on these marks to func-tion properly, such as taggers and parsers.We present the first steps taken towardsautomatic neuropsychological evaluationbased on narrative discourse analysis, pre-senting a new automatic sentence segmen-tation method for impaired speech. Ourmodel uses recurrent convolutional neu-ral networks with prosodic, Part of Speech(PoS) features, and word embeddings. Itwas evaluated intrinsically on impaired,spontaneous speech, as well as, normal,prepared speech, and presents better re-sults for healthy elderly (CTL) ( F = 0.74)and Mild Cognitive Impairment (MCI) pa-tients ( F = 0.70) than the ConditionalRandom Fields method ( F = 0.55 and0.53, respectively) used in the same con-text of our study. The results suggest thatour model is robust for impaired speechand can be used in automated discourseanalysis tools to differentiate narrativesproduced by MCI and CTL. Mild Cognitive Impairment (MCI) has recently re-ceived much attention, as it may represent a pre-clinical state of Alzheimer’s disease (AD). MCIcan affect one or multiple cognitive domains (e.g.memory, language, visuospatial skills and the ex- ecutive function); the kind that affects memory,called amnestic MCI, is the most frequent and thatwhich most often converts to AD (Janoutov´a et al.,2015). As dementias are chronic progressive dis-eases, it is important to identify them in the earlystages, because early detection yields a greaterchance of success for non-pharmacological treat-ment strategies such as cognitive training, physicalactivity and socialization (Teixeira et al., 2012).The definition of MCI diagnostic criteria is con-ducted mainly by the cognitive symptoms pre-sented by patients in standardized tests and byfunctional impairments in daily life (McKhann etal., 2011). Difficulties related with narrative dis-course deficits (e.g. repetitions or gaps duringthe narrative) may lead an elderly individual tolook for a specialist. Narrative discourse is the re-production of an experienced episode (necessarilyevoking memory), respecting temporal and causalrelations among events. Although MCI is clini-cally characterized by episodic memory deficits,language impairment may also occur.Certain widely used neuropsychological testsrequire patients to retell or understand a story.This is the case of the logical memory test, whereone reproduces a story after listening to it. Thehigher the number of recalled elements from thenarrative, the higher the memory score (Wech-sler, 1997; Bayles and Tomoeda, 1991; Morris etal., 2006). However, the main difficulties in ap-plying these tests are: (i) time required, since itis a manual task; and (ii) the subjectivity of theclinician. Therefore, automatic analysis of dis-course production is seen as a promising solutionfor MCI diagnosis, because its early detection en-sures a greater chance of success in addressingpotentially reversible factors (Muangpaisan et al.,2012). Since discourse is a natural form of com-munication, it favors the observation of the pa-tient’s functionality in everyday life. Moreover, it a r X i v : . [ c s . C L ] A ug rovides data for observing the language-cognitiveskills interface, such as executive functions (plan-ning, organizing, updating and monitoring data).With regard to the Wechsler Logical Memory(WLM) test, the original narrative used is short,allowing for the use of Automatic Speech Recog-nition (ASR) output even without capitalizationand sentence segmentation, as shown by Lehr etal. (2012) for English. They based their methodon automatic alignment of the original and patienttranscripts in order to calculate the number of re-called elements.The evaluation of narrative discourse pro-duction from the standpoint of linguistic im-pairment is an attractive alternative as it al-lows for linguistic microstructure analysis, includ-ing phonetic-phonological, morphosyntactic andsemantic-lexical components, as well as semantic-pragmatic macrostructures. Automated discourseanalysis tools based on Natural Language Pro-cessing (NLP) resources and tools aiming at thediagnosis of language-impairing dementias viamachine learning methods are already availablefor the English language (Fraser et al., 2015b;Yancheva et al., 2015; Roark et al., 2011) andalso for Brazilian Portuguese (BP) (Alu´ısio et al.,2016). The latter study used a publicly availabletool, Coh-Metrix-Dementia , to extract 73 textualmetrics of narrative transcripts, comprising severallevels of linguistic analysis from word counts tosemantics and discourse. However, the absenceof sentence boundary segmentation in transcriptsprevents the direct application of NLP methodsthat rely on these marks in order for the toolsto function properly. To our knowledge, onlyone study evaluating automatic sentence segmen-tation in English transcripts of elderly aphasic ex-ists (Fraser et al., 2015a).The purpose of this paper is to present ourmethod, DeepBond, for automatic sentence seg-mentation of spontaneous speech of healthy el-derly (CTL) and MCI patients. Although it wasevaluated for BP data, it can be adapted to otherlanguages as well. The sentence boundary detection task has beentreated by many researchers. Liu et al. (2006)investigated the imbalanced data problem, sincethere are more non-boundary words than not; their http://143.107.183.175:22380/ study was carried out using two speech corpora:conversational telephone and broadcast news, bothfor English.More recent studies have focused on Condi-tional Random Field (CRF) and Neural Networkmodels. Wang et al. (2012) and Hasan etal. (2014) use CRF based methods to iden-tify word boundaries in speech corpora datasets,more specifically on English broadcast news dataand English conversational speech (lecture record-ings), respectively. Khomitsevich et al. (2015),similar to our work, used a combination of twomodels, one based on Support Vector Machines todeal with prosodic information, and other basedon CRF to deal with lexical information. Theycombine the two models using a logistic regres-sion classifier.Xu et al. (2014) uses a combination of CRFand a Deep neural network (DNN) to detect sen-tence boundaries on broadcast news data. Che etal. (2016) uses two different convolutional neuralnetwork (CNN), one which moves in only one di-mension and another which moves in two. Theyachieved good results on a TED talks dataset. Tilkand Alum¨ae (2015) use a recurrent neural net-work (RNN) with long short-term memory unitsto restore punctuation in speech transcripts frombroadcast news and conversations.Although there are proposed methods for sen-tence segmentation of Portuguese datasets (Silla Jrand Kaestner, 2004; Batista and Mamede, 2011;L´opez and Pardo, 2015), none of them are used fortranscriptions produced in a clinical setting for theelderly with dementia and related syndromes. Thestudy most similar to our scenario is (Fraser et al.,2015a), which proposes a segmentation methodfor aphasic speech based on lexical, PoS andprosodic features using tools and a generic acous-tic model trained for English. Their approach isbased on a CRF model, and the best results for thisstudy were obtained for non-spontaneous broad-cast news data.Our method uses recurrent convolutional neu-ral networks with prosodic, PoS features, and alsoword embeddings and was evaluated intrinsicallyon impaired, spontaneous speech and normal, pre-pared speech. Although DNNs have already beenused for this task, our work was the first, to thebest of our knowledge, to evaluate them on im-paired speech. Datasets
A total of 60 participants from a research projecton diagnostic tools for language impaired de-mentias produced narratives used to evaluate ourmethod. Two datasets were used to train our model(Sections 3.1 and 3.2). As a preprocessing stepwe have removed capitalization information andin order to simulate high-quality ASR, we left allspeech disfluences intact. Demographic informa-tion for participants in our study is presented inTable 1. A third dataset was used in robustnesstests (Section 3.3).
Info CTL MCI AD
Avg. Age 74.8 73.3 78.2Avg. Education 11.4 10.8 8.6No. of Male/Female 4/16 6/14 10/10Table 1: Demographic information of participantsin the Cinderella dataset. The Avg. Education isgiven in years.
The Cinderella dataset consists of spontaneousspeech narratives produced during a test to elicitnarrative discourse with visual stimuli, using abook consisting of sequenced pictures based onthe Cinderella story. In the test, an individual ver-bally tells the story to the examiner based on thepictures. The narrative is manually transcribed bya trained annotator who scores the narrative bycounting the number of recalled propositions.This dataset consists of 60 narrative texts fromBP speakers, 20 controls, 20 with AD, and 20 withMCI, diagnosed at the Medical School of Uni-versity of S˜ao Paulo and also used in Alu´ısio etal. (2016). Counting all patient groups, this datasethas an audio duration of 4h and 11m, an averageof /
60 = 30 . sentences per narrative, andsentence averages of / . words.AD narratives were only used for training the lex-ical model. This dataset was made available by the LaPS (Sig-nal Processing Laboratory) at the Federal Univer-sity of Par´a (Batista, 2013), and is composed ofarticles from Brazil’s 1988 constitution, in whichthe speech is prepared and read. Each file has anaverages 30 seconds. A preprocessing step removed lexical tipswhich indicate the beginning of the articles, sec-tions and paragraphs. This removal was carriedout on both the transcripts and audio. In addition,we separated the new dataset organized by articles,totaling 357 texts. Then, we marked the end ofeach article and paragraph and inserted punctua-tion at the end. Titles and chapters have been ig-nored during this process. We randomly selected60 texts from this dataset, forcing only the con-dition that the number of sentences of each textsentence was greater than 12. We refer to the largedataset as Constitution L, and the dataset with the60 texts as Constitution S.The average number of sentences in each textof Constitution L is /
357 = 7 . , and the av-erage size of these sentences have / . words while Constitution S has on average /
60 = 23 . sentences, and these sentencesaverage / . words. The totalaudio duration of Constitution L is 7h 39m, andConstitution S is 3h 43m. The Dog Story dataset is available from the BALE(Battery of Language Assessment in Aging, in En-glish) instrument, described in (Jerˆonimo, 2016).It is composed of transcriptions from the narrativeproduction test based on the presentation of a setof seven pictures telling a story of a boy who hidesa dog that he found on the street (Le Boeuf, 1976).This battery was chosen because its aim is to al-low for its administration to elderly people whoare illiterate and/or of low educational level, whorepresent the majority of the aged sample assistedby the public health system in Brazil.This dataset consists of narratives transcripts( CTL and MCI), where the average numberof sentences and the average size of the sentencesare 16.60 and 6.58, respectively. When comparedwith the Cinderella dataset, the dataset is com-posed of less sentences and the sentences havefewer words on average.
We divide our lexical features into two groups:PoS features and word embeddings, where everyword is represented in a high dimensionality con-tinuous vector.The PoS features where extracted using a BPigure 1: Architecture of the RCNN for both lexical and prosodic model.morphosyntatic tagger called nlpnet trained on arevised version of the Mac-Morpho corpus (Fon-seca et al., 2015), which contains a set of 25 tags.The word embeddings used in this work have50 dimensions and were trained by Fonseca etal. (2015) with articles from the BP version ofWikipedia and a large journalistic corpus with ar-ticles from the news site G1 , totaling 240 milliontokens and a vocabulary of 160,270 words. Allof these tokens were made lowercase and trainedwith a neural language model described in (Col-lobert et al., 2011). We used three prosodic features: F0, intensity andduration which were extracted at the phonetic levelusing PRAAT (Boersma and others, 2002) fromforced alignment output. Alignment was done us-ing using the HTK toolkit (Young et al., 2002)with clean speech corpora and a pronunciation dic-tionary phonetically transcribed by Petrus (Ser-rani, 2015) and augmented by our rule-based al-gorithm to insert multiple pronunciations, render-ing a suitable model for ASR. The features werecalculated for the first, last, penultimate and an-tepenultimate vowels of each word and pauses.These vowels were chosen based on knowledgeof the BP which typically exhibits stress on thepenultimate vowel, with notable patterns observedfor final vowel stressing, for example words end-ing in “i” (“Barueri”) or a nasal consonant (“Re-nan”), and the antepenultimate vowel (usually in-dicated by a stress diacritic) like “helic´optero”(“helicopter”), “esp´ırito” (“spirit”) and “´arvore”(“tree”). Also, Portuguese, like most western lan-guages, distinguishes sentence types by rising andfalling pitch patterns, giving the listener a clue as nilc.icmc.usp.br/nlpnet/ g1.globo.com/ to whether the speaker has finished a sentence ornot. Pause duration was also calculated since thelength of a pause can be indicative of the presenceof a punctuation mark (Beckman and Ayers Elam,1997). To automatically extract features from the in-put and also deal with the problem of long de-pendencies between words, we propose a modelbased on recurrent convolutional neural networks(RCNN), which was inspired by the work of Laiet al. (2015). The architecture of our model can beseen in Figure 1. First, we show how to prepare theinput for the network, then we go through the net-works layers and describe the training procedure,finally, we discuss the experimental settings.
In our approach, the input to the network is atranscribed narrative which is categorized as CTL(healthy elderly individuals) and MCI (MCI pa-tients). The narratives contain a sequence of words w , w , . . . , w m . Each word is annotated with alabel, to indicate whether it precedes a bound-ary ( y = B ) or not ( y = N B ) . We do notmake a distinction between punctuation marks, soa boundary is defined as a period, exclamationmark, question mark, colon or semicolon. Withthis approach, we can see this task as a binary clas-sification problem. Our input contains transcribed narratives with m words in it. We represent the narrative i as X i ∈ R m × n , X i = { x , x , . . . , x m × n } , where n is thenumber of features. We represent the boundariesas Y i ∈ R , Y i = { , } , where stands for N B and denotes B . Our final model consists of aombination of two models. The first model isresponsible for treating only lexical information,while the second treats only prosodic information.Both models have the same architecture shown inFigure 1. This strategy is based on the idea that wecan train the lexical model with even more data,since textual information is easily found on theweb. In order to obtain the most probable class y for the w j word, a linear combination was createdbetween these two models, where one receives theweighted complement of the other: α · P lexical ( y | w j )+(1 − α ) · P prosodic ( y | w j ) (1)Then, the most probable class is the one that max-imizes the linear combination from previous equa-tion. The data input for the lexical model is divided intotwo features: word embeddings with dimensions | e w | , and the PoS tags with dimensions | e t | . Givena word w , the respective embedding e w ∈ E word is fetched and concatenated with the word’s PoSvector e t ∈ E tag , thus obtaining a new vector size d = | e w | + | e t | . Out of vocabulary words sharea single and randomly generated vector that repre-sents an unknown word.In the prosodic model we directly feed informa-tion about pitch, intensity and duration from thefirst, last, penultimate and ante-penultimate vow-els of each word. Moreover, we feed the informa-tion about pause duration after each word, whereduration of zero seconds denotes no pause. There-fore, for the prosodic model, we have a vector withdimensions d = 4 · . Once we have a matrix formed by the features ofthe words in the text, the convolutional layer re-ceives it, which, in turn, is responsible for the au-tomatic extraction of n f new features dependingon h c neighboring words (Kim, 2014). The con-volutional layer produces a new feature c j by ap-plying a filter W ∈ R h c · d to a window of h c words x j − h c +1: j in a sentence with length m : c j = f ( W x ( j − h c +1): j + b ) , h c ≤ j ≤ m (2)Where b ∈ R represents a bias term and f is anon-linear function.Our convolutional layer simply moves one di-mension vertically, making one step at a time, which gives us m − h c + 1 generated features.Since we want to classify exactly m elements,we add p = (cid:98) h c / (cid:99) zero-padding on both sidesof the text. Applying this strategy for each en-try x j yields the complete feature map c ∈ R ( m − h c +1)+2 · p .In addition, we apply a max-pooling operationover time, looking at a region of h m elements tofind the most significant features: ˆ c = max ≤ j ≤ m { c ( j − h m +1): j } (3) The new features extracted are fed into a recurrentbidirectional layer which has n r units. A recurrentlayer is able to store historic information by con-necting the previous hidden state with the currenthidden state at a time t . The values in the hiddenand output layers are computed as follows: h t = f ( W x x t + W h h t − + b h ) (4) y t = g ( W y h t + b y ) (5)where W x , W h , and W y are the connectionweights, b y and b h are bias vectors, and f and g are non-linear functions. Here, we use a specialunit known as Long Short-Term Memory (LSTM)(Hochreiter and Schmidhuber, 1997), which isable to learn over long dependencies betweenwords by a purpose-built memory cell. Figure 2shows a single LSTM memory cell.Figure 2: Diagram of a LSTM memory cell.The LSTM updates for time steps t are doneas described by Jozefowicz et al. (2015), whichis a slight simplification of the one described byGraves and Jailty (2014), where the memory cellis implemented as follows: t = σ ( W xi x t + W hi h t − + b i ) f t = σ ( W xf x t + W hf h t − + b f ) o t = σ ( W xo x t + W ho h t − + b o ) g t = tanh ( W xc x t + W hc h t − + b c ) c t = f t (cid:12) c t − + i t (cid:12) g t h t = o t (cid:12) tanh ( c t ) where σ ( z ) = 1 / (1 + e − z ) is the sigmoid func-tion, h t ∈ R n r is the hidden unit, i t ∈ R n r is theinput gate, f t ∈ R n r is the forget gate, o t ∈ R n r is the output gate, g t ∈ R n r is the input modula-tion gate, and c t ∈ R n r is the memory cell unit,which is the summation of the previous memorycell modulated by the forget gate f t , and a func-tion of the current input with previous hidden statemodulated by the input gate i t .As in Graves and Jaitly (2014), we used thefeatures by looking at forward states and back-ward states. This kind of mechanism is knownas a bidirectional neural network (BRNN), sinceit learns weights based on both past and future el-ements given a timestep t . In order to implementthe BRNN, we reversed the sentences as a trickbefore we fed them to a regular LSTM layer, dou-bling the number of weights used in the recurrentlayer. The output from this layer is the summationof the forward output with backward output: y t = ←− y t + −→ y t (6)With a bidirectional LSTM layer, we are ableto explore the principle that words nearby have agreater influence in classification, while consider-ing that words farther away can also have someimpact. This often happens, for example, in thecase of question words and conjunctions: por que(“why”); qual (“which”); quem (“who”); quando(“when”), etc. After the BRNN layer, dropout is used to pre-vent co-adaptation of hidden units during forward-backpropagation, where we ignore some neuronsmeaning to reduce the chance of overfitting themodel (Srivastava et al., 2014).The last layer receives the output from theBRNN in each timestep and passes them trough afully connected layer, where the softmax operationis calculated, giving us the probability of whether or not the word precedes a boundary: ˆ y t = sof tmax ( W y t + b ) (7)Where W ∈ R n r × is a matrix of weights, b ∈ R n r is a bias vector, and softmax is defined as: s j ( z ) = e z j (cid:80) Kk =1 e z k , for j = 1 , , . . . , K (8) We define all of the parameters to be trained as θ . θ = (cid:8) E word , E tag , W ( c ) , b ( c ) , W ( f ) ,b ( f ) , ←− W ( r ) , ←− b ( r ) , −→ W ( r ) , −→ b ( r ) (cid:9) (9)Where E word ∈ R | V |×| e w | is the lookup tablefor the word embeddings, E tag ∈ R | V tag |×| e t | isthe lookup table for PoS tags, and | V | , | V tag | rep-resents the size of the vocabulary for word embed-dings and PoS tags, respectively.For the convolutional layer: the weights W ( c ) ∈ R n f × h c · d and the bias vector b ( c ) ∈ R n f .For the fully connected layer: the weights ma-trix W ( f ) ∈ R n r × and the bias vector b ( f ) ∈ R n r .For the BRNN layer we divide the set of param-eters from BRNN into two sets. Those from theforward pass and backward pass. Each set con-tains the weights for an input W ( r ) x ∈ R n r × n f ,the weights for previous hidden states W ( r ) h ∈ R n r × n r , and the bias vectors b ( r ) ∈ R n r forall gates ( i, f, o, g ). Additionally, we have theweights for an output in a timestep W ( r ) y ∈ R n r × n r and a bias vector b y ∈ R n r .We define the loss function L as categoricalcross-entropy (Murphy, 2012), shown in the equa-tion below, which aims to minimize the negativelog likelihood in relation to the weights. Since wehave an unbalanced class problem, we give differ-ent weights for each class, where the weight of theminority class ( B ) is greater than that of the ma-jority ( N B ). L ( y, ˆ y ) = − (cid:88) i y i log (ˆ y i ) cw y i (10)Where y are our real targets, ˆ y are our predic-tions, and cw are the class weights for (cid:96) = B and (cid:96) = N B , calculated as follows: cw (cid:96) = | y | · | y = (cid:96) | (11)We minimize the loss function with respect toall weights θ (cid:55)→ L by using RMSProp algorithmTieleman and Hinton, 2012) with backpropaga-tion to compute the gradients ∇L . The update stepfor a timestep t is made by normalizing the gradi-ents by an exponent moving at an average r t : r t = γr t − + (1 − γ ) ∇L ( θ t ) (12) θ t +1 = θ t − η ∇L ( θ t ) √ r t + (cid:15) (13)Where η is the learning rate and < γ < isthe forgetting factor. We break the text in tokens delimited by spaces.We do not remove stopwords from the texts, sincethey can be important features for our domain.We ran a 5-fold cross-validation for the groupbeing analyzed (CLT or MCI), which leaves about10% of the data for testing, the rest for training.The weight matrix for tag embeddings E tag wasgenerated randomly from a gaussian distributionscaled by fan in + fan out (Glorot and Bengio,2010). Both embeddings matrix E word and E tag were adjusted during training. We follow previ-ous studies on sentence boundary detection to setthe network hyper-parameters (Tilk and Alum¨ae,2015; Che et al., 2016). The values for each pa-rameter are shown in Table 2. Var. Parameter Lexical Prosodic | e w | Word emb. size 50 - | e t | Tag emb. size 10 - n f Conv. filters 100 8 h c Filter length 7 5 h m Max-pool size 3 3 n r Recurrent units 100 100 γ Forget factor 0.9 0.9 η Learning rate 0.001 0.001Table 2: RCNN Hyper-parameters.We tried three different learning rate values η ∈{ . , . , . } for both lexical and prosodicmodels, and found that 0.001 yielded best results.We trained our network over 20 epochs using abucket strategy, which groups training examples inbuckets of similar sentence size. Our implementa-tion is based on Theano (Bergstra et al., 2010), alibrary that defines, optimizes and evaluates math-ematical expressions in an effective way. We evaluated our method intrinsically and alsocompared it with the method developed by Fraseret al. (2015a) for all of the datasets. We also per-formed robustness tests to indicate how well ourmethod responds to both (i) test data that variesfrom Cinderella training data and (ii) train datathat varies from Cinderella testing data.If we classified all words as
N B , our methodwould have an accuracy superior to 90%. For thisreason, we use the F metric, which is definedas the harmonic mean between precision and re-call. And since we are more interested in knowingwhether our method correctly identifies the bound-aries, we ignore the N B s and calculate F only forthe positive class ( B ). In this subsection, we evaluate the performance ofour classifier (RCNN) for the Cinderella and Con-stitution datasets. Table 3 summarizes the results.From Table 3 we can see that our approachpresents better results for the Constitution datasetthan Cinderella. This may be related to the textquality, as the Cinderella transcripts presents manydisfluences, characteristic of spontaneous speech.As expected, results for CTL were higher thanfor MCI, since CTL narratives contain less dis-fluencies. Another important observation is thatour method performs much better than the base-line. Where the baseline represents the resultsfor a classifier that predicts all words as B . TheConstitution results show us that traditional ma-chine learning techniques used in NLP can be ap-plied to this scenario, since the differences in theCinderella data are few. Another reason that sup-ports this statement is that F results from relatedstudies on sentence boundary detection based onwell-written texts are between . and . for twoclasses (Wang et al., 2012; Khomitsevich et al.,2015; Tilk and Alum¨ae, 2015; Che et al., 2016).When we compare the Constitution size relationwe find out that corpus size is not greatly affectedby the results, since the results for Constitution Swere slightly better than for Constitution L. Wethink that, even with less data, our method per-forms better on Constitution S because of the dis-tribution of sentence quantity in the dataset, whereConstitution S has an average of 23.48 sentencesper text, while Constitution L has an average ofonly 7.56 sentences per text. eatures Cinderella ConstitutionCTL MCI L S P R F P R F P R F P R F Baseline 0.07 1.00 0.13 0.08 1.00 0.14 0.03 1.00 0.07 0.04 1.00 0.08PoS 0.36 0.82 0.50 0.32 0.83 0.46 0.30 0.89 0.44 0.29 0.79 0.42Prosody 0.20 0.59 0.30 0.19 0.58 0.29 0.54 0.84 0.66 0.48 0.85 0.61Embeddings 0.70 0.70 0.70 0.63 0.77 0.69 0.60 0.63 0.63 0.60 0.64 0.62PoS + Pros. 0.40 0.74 0.52 0.36 0.80 0.49 0.52 0.91 0.66 0.57 0.85 0.68Emb. + PoS 0.71 0.72 0.71 0.64 0.75 0.69 0.64 0.72 0.68 0.63 0.67 0.65Emb. + Pros. 0.71 0.74 0.72 0.64 0.77 0.70 0.71 0.83 0.76 0.74 0.81 0.77All 0.72 0.76
Table 3: F for boundary class for each feature set on Cinderella and Constitution data using our method.We also evaluated the performance of differentfeature sets with our datasets. Embeddings havea great impact on both datasets. The PoS infor-mation was influential on both datasets, but by asmall margin, since it has a small difference whenused with embeddings (0.01) on the Cinderella,and (0.03) Constitution data. This tells us thatembeddings already bring enough morphosyntac-tic information. It is evident that the weight of theprosodic features is higher on Constitution, whichis based on prepared speech, than in Cinderela.This result is consistent with those found by Kol´aret al. (2009) and Fraser et al. (2015a). We alsobelieve that the quality of the audio recordingsmay have impacted the weight of the prosodic fea-tures, since the Constitution dataset was recordedby speech processing experts in a studio and theCinderella dataset was recorded in a clinical set-ting. In light of this, we can see that our methodperforms better when all features are used. Fur-thermore, the best results were obtained by using α = 0 . , from the linear combination in Equa-tion 1, showing that our model lends more weightto the lexical model. In order to compare our model with related work,we replicated the approach proposed by Fraser etal. (2015a), which uses a CRF model for sen-tence segmentation. To explain the choice for a re-current convolutional model, we split our methodin three: (i) Multilayer Perceptron (MLP): we re-moved the convolutional and the recurrent layerof our model, and added a hidden fully-connectedlayer with 100 units and sigmoid activation; (ii)CNN: we simply removed the recurrent layer fromour model and passed the output from the convolu-tional to the fully-connected layer; (iii) Recurrent Neural Network (RNN): analogous to the CNNmodel, we removed the convolutional layer andconnected the embedding layer with the recurrentlayer. The results for each method are presented inTable 4.Our method achieved the best results in bothdatasets. We can see that the CRF method, usedby Fraser et al. (2015a), obtained the worst resultson Constitution, and was only better than RNN onthe Cinderella data. These results were similar tothose reported in their paper, which suggests thatour replication was faithful. We believe that theRNN performed poorly because it has a large setof weights to be trained, and since we have rela-tively little data, it failed to achieve good results.This may be related to the fact that LSTM unitsare very complex and need more data to be ableto converge. Looking at the Constitution results,which have about three times more words than theCinderella data, we can note the difference ( ∼ . )with relation to corpus size.MLP and CNN alone were able to achieve bet-ter results than CRF and RNN, but MLP results forthe MCI subset were not as good as CNN, whichindicates that MLP alone is not able to deal withnarratives that are potentially impaired. However,for the Constitution data, MLP obtained resultsvery close ( ∼ . ) to our best method.Our RCNN achieved the best results on bothdatasets, implying that a union of these modelswas a good choice in order to deal with impairedspeech. We believe that the greatest influence wasfrom the CNN, and the addition of a recurrentlayer with LSTM was able to deal with some par-ticular cases, likely over long dependencies sim-ilar to the findings in (Tilk and Alum¨ae, 2015),where the CNN was not able to do so due to thefixed filter length in the convolution process, a re- ethods Cinderella ConstitutionCTL MCI L S P R F P R F P R F P R F CRF 0.70 0.45 0.55 0.62 0.46 0.53 0.89 0.36 0.51 0.84 0.34 0.48MLP 0.59 0.79 0.67 0.47 0.80 0.59 0.75 0.79 0.77 0.76 0.80 0.78RNN 0.27 0.68 0.39 0.73 0.25 0.37 0.43 0.92 0.58 0.44 0.85 0.57CNN 0.64 0.79 0.71 0.59 0.77 0.67 0.65 0.85 0.73 0.58 0.89 0.70RCNN 0.72 0.76
Table 4: Best F results for each method.sult which was also noted in (Che et al., 2016). Robustness was evaluated by measuring F onboth out-of-genre and in-genre data. The resultsfor each configuration are presented in Table 5. Trained on Tested on
P R F Constitution Cinderella CTL 0.19 0.29 0.23Constitution Cinderella MCI 0.20 0.25 0.22Cinderella Dog story CTL 0.72 0.62 0.66Cinderella Dog story MCI 0.65 0.64 0.64
Table 5: Results for robustness testsWe evaluated our method by changing the cor-pus genre: training with the Constitution and test-ing with the Cinderella dataset. This evaluationshows that our method performed poorly in thisscenario, probably because the differences in thelexical clues between these datasets are high, sincethe Constitution is composed of prepared speechand Cinderella of spontaneous speech. When wemaintain the corpus genre but change the storyused in the neuropsychological test, our methodcan still achieve good results, yielding a small dif-ference of . for CTL and . for MCI fromour best results. We believe that these results arerelated with the linear combination weight fromEquation 1, where the results were obtained by us-ing α = 0 . , lending less weight to the prosodicmodel when compared to our best results (whereit has of influence). Since the Dog Storyand Cinderella datasets are composed of sponta-neous speech, the lexical clues found in this kindof speech helped the method to achieve good per-formance. We have shown that our model, using a recur-rent convolutional neural network, is benefited by word embeddings and can achieve promising re-sults even with a small amount of data. We foundthat our method is better for cases where speechis planned, since the prosodic features lend moreweight to the classification. Our method achievedgood results on impaired speech transcripts evenwith little data, with an F result of 0.74 on CTLpatients, which is comparable with the resultsfrom other studies using broadcast news and con-versational data (Wang et al., 2012; Khomitsevichet al., 2015; Tilk and Alum¨ae, 2015; Che et al.,2016). Moreover, our method achieved good re-sults in robustness tests when we changed the storyused in the neuropsychological test.As for future work, we plan to evaluate ourmethod on English data for comparison with re-lated work. Also, we plan on using more textdata to train the lexical model, as it is independentfrom the prosodic model and lends more weight inour evaluations. Moreover, we will evaluate ourmethod with the output of an ASR system for BP,as a higher word recognition error rate can greatlyaffect our results. Lastly, we would like to evaluateour method with datasets with higher quality au-dio, more robust acoustic models and a manuallyaligned portion of the database as better audio seg-mentation would greatly improve the model andthe usefulness of prosodic features.With respect to improvements in the corpus, ourdataset consists of spontaneous speech narrativesand was annotated only with periods. Since thereare initial conjunctions such as “and”, “moreover”,and “however”, we could include commas. Thiswould turn our problem into a ternary problem.This could be done by increasing the number ofneurons in the last layer of our architecture. Acknowledgments
We thank CNPq for a scholarship granted to thefirst author. eferences [Alu´ısio et al.2016] S. Alu´ısio, A. Cunha, and C Scar-ton. 2016. Evaluating progression of alzheimer’sdisease by regression and classification methods ina narrative language test in portuguese.
Interna-tional Conference on Computational Processing ofthe Portuguese Language , pages 374–384, July.[Batista and Mamede2011] Fernando Batista and NunoMamede. 2011.
Recovering Capitalization andPunctuation Marks on Speech Transcriptions . Ph.D.thesis, Instituto Superior T´ecnico.[Batista2013] Pedro dos Santos Batista. 2013. Avanc¸osem reconhecimento de fala para portuguˆes brasileiroe aplicac¸ ˜oes: ditado no libreoffice e unidade de re-sposta aud´ıvel com asterisk.[Bayles and Tomoeda1991] Kathryn Bayles andCK Tomoeda. 1991.
ABCD: Arizona Battery forCommunication Disorders of Dementia . Tucson,AZ: Canyonlands Publishing.[Beckman and Ayers Elam1997] Mary E Beckman andGayle Ayers Elam. 1997. Guidelines for tobi la-belling: The ohio state university research founda-tion.[Bergstra et al.2010] James Bergstra, Olivier Breuleux,Fr´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu,Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. 2010. Theano: A cpuand gpu math compiler in python. In
Proc. 9thPython in Science Conf , pages 1–7.[Boersma and others2002] Paul Boersma et al. 2002.Praat, a system for doing phonetics by computer.
Glot international , 5(9/10):341–345.[Che et al.2016] Xiaoyin Che, Cheng Wang, HaojinYang, and Christoph Meinel. 2016. Punctuationprediction for unsegmented transcript based on wordvector. In
Proceedings of the Tenth InternationalConference on Language Resources and Evaluation(LREC 2016) , Paris, France. European LanguageResources Association (ELRA).[Collobert et al.2011] Ronan Collobert, Jason Weston,L´eon Bottou, Michael Karlen, Koray Kavukcuoglu,and Pavel Kuksa. 2011. Natural language pro-cessing (almost) from scratch.
Journal of MachineLearning Research , 12(Aug):2493–2537.[Fonseca et al.2015] Erick R Fonseca, Jo˜ao Lu´ıs GRosa, and Sandra Maria Alu´ısio. 2015. Evaluatingword embeddings and a revised corpus for part-of-speech tagging in portuguese.
Journal of the Brazil-ian Computer Society , 21(1):1.[Fraser et al.2015a] Kathleen C. Fraser, Naama Ben-David, Graeme Hirst, Naida Graham, and ElizabethRochon. 2015a. Sentence segmentation of aphasicspeech. In
Proceedings of the NAACL HLT 2015,The 2015 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies , pages 862–871. [Fraser et al.2015b] Kathleen C Fraser, Jed A Meltzer,and Frank Rudzicz. 2015b. Linguistic features iden-tify alzheimer’s disease in narrative speech.
Journalof Alzheimer’s Disease , 49(2):407–422.[Glorot and Bengio2010] Xavier Glorot and YoshuaBengio. 2010. Understanding the difficulty of train-ing deep feedforward neural networks. In
Aistats ,volume 9, pages 249–256.[Graves and Jaitly2014] Alex Graves and NavdeepJaitly. 2014. Towards end-to-end speech recogni-tion with recurrent neural networks. In
ICML , vol-ume 14, pages 1764–1772.[Hasan et al.2014] Madina Hasan, Rama Doddipatla,and Thomas Hain. 2014. Multi-pass sentence-end detection of lecture speech. In
INTERSPEECH ,pages 2902–2906.[Hochreiter and Schmidhuber1997] Sepp Hochreiterand J¨urgen Schmidhuber. 1997. Long short-termmemory.
Neural computation , 9(8):1735–1780.[Janoutov´a et al.2015] Jana Janoutov´a, Omar Ser`y,Ladislav Hos´ak, and Vladim´ır Janout. 2015. Is mildcognitive impairment a precursor of alzheimer’s dis-ease? short review.
Central European journal ofpublic health , 23(4):365.[Jerˆonimo2016] Gislaine Machado Jerˆonimo. 2016.
Produc¸ ˜ao de narrativas orais no envelhecimento sa-dio, no comprometimento cognitivo leve e na doenc¸ade Alzheimer e sua relac¸ ˜ao com construtos cogni-tivos e escolaridade . Ph.D. thesis.[Jozefowicz et al.2015] Rafal Jozefowicz, WojciechZaremba, and Ilya Sutskever. 2015. An empiri-cal exploration of recurrent network architectures.
Journal of Machine Learning Research .[Khomitsevich et al.2015] Olga Khomitsevich, PavelChistikov, Tatiana Krivosheeva, Natalia Epi-makhova, and Irina Chernykh. 2015. Combiningprosodic and lexical classifiers for two-pass punc-tuation detection in a russian asr system. In
International Conference on Speech and Computer ,pages 161–169. Springer.[Kim2014] Yoon Kim. 2014. Convolutional neural net-works for sentence classification. In
Proceedings ofthe 2014 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP) , pages 1746–1751. Association for Computational Linguistics.[Kol´ar et al.2009] J´achym Kol´ar, Yang Liu, and Eliza-beth Shriberg. 2009. Genre effects on automaticsentence segmentation of speech: A comparison ofbroadcast news and broadcast conversations. In , pages 4701–4704.IEEE.[Lai et al.2015] Siwei Lai, Liheng Xu, Kang Liu, andJun Zhao. 2015. Recurrent convolutional neu-ral networks for text classification. In
Proceedingsf the Twenty-Ninth AAAI Conference on ArtificialIntelligence , AAAI’15, pages 2267–2273. AAAIPress.[Le Boeuf1976] Christine Le Boeuf. 1976.
Raconte:55 historiettes en images . L’ ´Ecole.[Lehr et al.2012] Maider Lehr, Emily Tucker Prudhom-meaux, Izhak Shafran, and Brian Roark. 2012.Fully automated neuropsychological assessment fordetecting mild cognitive impairment. In
INTER-SPEECH , pages 1039–1042.[Liu et al.2006] Yang Liu, Nitesh V. Chawla, Mary P.Harper, Elizabeth Shriberg, and Andreas Stolcke.2006. A study in machine learning from imbalanceddata for sentence boundary detection in speech.
Computer Speech and Language , 20(4):468–494.[L´opez and Pardo2015] Roque L´opez and Thiago ASPardo. 2015. Experiments on sentence boundarydetection in user-generated web content. In
Compu-tational Linguistics and Intelligent Text Processing ,pages 227–237.[McKhann et al.2011] Guy M McKhann, David SKnopman, Howard Chertkow, Bradley T Hyman,Clifford R Jack Jr, Claudia H Kawas, William EKlunk, Walter J Koroshetz, Jennifer J Manly,Richard Mayeux, et al. 2011. The diagnosisof dementia due to alzheimer’s disease: Recom-mendations from the national institute on aging-alzheimer’s association workgroups on diagnosticguidelines for alzheimer’s disease.
Alzheimer’s &Dementia , 7(3):263–269.[Morris et al.2006] John C Morris, Sandra Weintraub,Helena C Chui, Jeffrey Cummings, Charles DeCarli,Steven Ferris, Norman L Foster, Douglas Galasko,Neill Graff-Radford, Elaine R Peskind, et al. 2006.The uniform data set (uds): clinical and cognitivevariables and descriptive data from alzheimer dis-ease centers.
Alzheimer Disease & Associated Dis-orders , 20(4):210–216.[Muangpaisan et al.2012] Weerasak Muangpaisan,Chonachan Petcharat, and Varalak Srinonprasert.2012. Prevalence of potentially reversible condi-tions in dementia and mild cognitive impairmentin a geriatric clinic.
Geriatrics & GerontologyInternational , 12(1):59–64.[Murphy2012] Kevin P Murphy. 2012.
Machine learn-ing: a probabilistic perspective . MIT press.[Roark et al.2011] Brian Roark, Margaret Mitchell,John-Paul Hosom, Kristy Hollingshead, and JeffreyKaye. 2011. Spoken language derived measuresfor detecting mild cognitive impairment.
Audio,Speech, and Language Processing, IEEE Transac-tions on , 19(7):2081–2090.[Serrani2015] Vanessa Marquiaf´avel Serrani. 2015.Ambiente web de suporte `a transcric¸ ˜ao fon´etica au-tom´atica de lemas em verbetes de dicion´arios doportuguˆes do brasil. [Silla Jr and Kaestner2004] Carlos N Silla Jr andCelso AA Kaestner. 2004. An analysis of sen-tence boundary detection systems for english andportuguese documents. In
International Conferenceon Intelligent Text Processing and ComputationalLinguistics , pages 135–141. Springer.[Srivastava et al.2014] Nitish Srivastava, Geoffrey EHinton, Alex Krizhevsky, Ilya Sutskever, and Rus-lan Salakhutdinov. 2014. Dropout: a simple way toprevent neural networks from overfitting.
Journal ofMachine Learning Research , 15(1):1929–1958.[Teixeira et al.2012] Camila Vieira Ligo Teixeira, Lil-ian Teresa Bucken Gobbi, Danilla Icassatti Corazza,Florindo Stella, Jos´e Luiz Riani Costa, and Se-basti˜ao Gobbi. 2012. Non-pharmacological inter-ventions on cognitive functions in older people withmild cognitive impairment (mci).
Archives of geron-tology and geriatrics , 54(1):175–180.[Tieleman and Hinton2012] Tijmen Tieleman and Ge-offrey Hinton. 2012. Lecture 6.5-rmsprop: Dividethe gradient by a running average of its recent mag-nitude.
COURSERA: Neural Networks for MachineLearning , 4(2).[Tilk and Alum¨ae2015] Ottokar Tilk and TanelAlum¨ae. 2015. Lstm for punctuation restoration inspeech transcripts. In
INTERSPEECH .[Wang et al.2012] Xuancong Wang, Hwee Tou Ng, andKhe Chai Sim. 2012. Dynamic conditional randomfields for joint sentence boundary and punctuationprediction. In
INTERSPEECH .[Wechsler1997] David Wechsler. 1997.
Wechsler mem-ory scale (WMS-III) . Psychological Corporation.[Xu et al.2014] Chenglin Xu, Lei Xie, Guangpu Huang,Xiong Xiao, Engsiong Chng, and Haizhou Li. 2014.A deep neural network approach for sentence bound-ary detection in broadcast news. In
INTERSPEECH ,pages 2887–2891.[Yancheva et al.2015] Maria Yancheva, KathleenFraser, and Frank Rudzicz. 2015. Using linguisticfeatures longitudinally to predict clinical scores foralzheimer’s disease and related dementias. In , page 134.[Young et al.2002] Steve Young, Gunnar Evermann,Mark Gales, Thomas Hain, Dan Kershaw, XunyingLiu, Gareth Moore, Julian Odell, Dave Ollason, DanPovey, et al. 2002. The htk book.