From text saliency to linguistic objects: learning linguistic interpretable markers with a multi-channels convolutional architecture
Laurent Vanni, Marco Corneli, Damon Mayaffre, Frédéric Precioso
FFrom text saliency to linguistic objects:learning linguistic interpretable markers with amulti-channels convolutional architecture
Laurent Vanni, , Marco Corneli, , Damon Mayaffre, Fr´ed´eric Precioso , Univ. Cˆote d’Azur, BCL, UMR UNS-CNRS 7320, Nice, France Inria, CNRS, Laboratoire J.A. Dieudonn´e, Maasai research team, Nice, France Univ. Cˆote d’Azur, Center of Modeling, Simulation and Interactions, Nice, France Univ. Cˆote d’Azur, I3S, UMR UNS-CNRS 7271, Nice, France
Abstract
A lot of effort is currently made to provide methods to analyze and un-derstand deep neural network impressive performances for tasks such as im-age or text classification. These methods are mainly based on visualizingthe important input features taken into account by the network to build a de-cision. However these techniques, let us cite LIME, SHAP, Grad-CAM, orTDS, require extra effort to interpret the visualization with respect to expertknowledge. In this paper, we propose a novel approach to inspect the hiddenlayers of a fitted CNN in order to extract interpretable linguistic objects fromtexts exploiting classification process. In particular, we detail a weighted extension of the Text Deconvolution Saliency (wTDS) measure which canbe used to highlight the relevant features used by the CNN to perform theclassification task. We empirically demonstrate the efficiency of our ap-proach on corpora from two different languages: English and French. On alldatasets, wTDS automatically encodes complex linguistic objects based onco-occurrences and possibly on grammatical and syntax analysis. a r X i v : . [ s t a t . M L ] A p r Introduction
Each author has a discursive identity made up of identifiable lexical and gram-matical choices. Therefore, one of the challenges of deep learning on text is todescribe these identities.Although it was shown in the literature that, in terms of accuracy, CNN basedapproaches outperform existing classifiers based on statistical key-indicators (e.g.the relative words frequency) or other machine learning techniques, it is still notclear if and how CNNs make use of standard features used in text mining (forinstance word co-occurrences). We might also go further and assume that, for textclassification, CNNs can rely on other complex linguistic structures that mightbe of interest for linguists. In the attempt to shed some light on this topic, ourapproach mainly relies on deconvolution process (i.e. transpose process), allowingus to interpret the CNN features in the input space.This paper focuses on linguistic object analysis via a multichannel convolu-tional architecture. That is, a CNN is trained to associate several parts of tran-scribed political speeches to their speaker (e.g. E. Macron and D. Trump). Ourmain contribution is an improvement of an existing measure, the Text Deconvo-lution Saliency (TDS) (TDS, Vanni et al., 2018), called weighted
Text Deconvo-lution Saliency (wTDS), allowing us to visualize the linguistic markers used bythe CNN to perform the classification of a text, but also to make them fully inter-pretable for the linguists. In order to have a relevant description of a dataset, thewTDS is included in a model that introduce two further contributions i) process-2ng the CNN parameters in order to “rank” text segments assigned to an authorfrom the more to the less representative of that author and ii) introducing a multi-channel CNN architecture in order to exploit additional linguistic information (e.g.lemma or part-of-speech ) for each token.The next section describes some of the most representative related works. Twoof them are discussed in more details in order to motivate and better describe ourown main contribution.
Since the seminal work of Collobert and Weston (2008), adopting CNNs for sev-eral NLP tasks (part-of-speech tagging, chunking, named entity recognition andsemantic labeling), many researchers have widely used CNNs for similar andother purposes, such as text modeling (e.g. Kalchbrenner et al., 2014) or sentenceclassification (e.g. Kim, 2014). While CNNs are not the only available deep archi-tecture in Text Mining, it has been noticed that they have several advantages withrespect to recurrent architectures (RNNs, in particular LSTM and GRU) when per-forming key-phrase recognition (Yin et al., 2017). This supervised classificationtask is the one we are interested in this work. In particular, we aim at uncover-ing linguistic patterns used to highlight similarities and specificities (Feldman, R.,and J. Sanger, 2007; Ludovic Lebart, Andr´e Salem and Lisette. Berry, 1998) ina corpus. Standard text analysis techniques originally relied on statistical scores,for instance on the relative frequency of words (a.k.a. z-scores, see Lafon, 1980).However, these techniques could not exploit more challenging linguistic features,3uch as syntactical motifs Mellet and Longr´ee (2009). In order to overcome theselimitations and to account for long term dependencies in sentences, CNNs havebeen recently used. Indeed, being CNNs more robust than RNNs to the vanishinggradient problem, they might be able to detect links between different parts of asentence (Dauphin et al., 2017; Wen et al., 2017; Adel and Sch¨utze, 2017). Thisproperty is crucial, since it was shown that long range dependencies emerge inreal data (Li et al., 2015). Aiming at inspecting these dependencies as long asother complex linguistic patterns, some tools explaining how CNNs perform theclassification task are required. In this regard, a recent crucial contribution is rep-resented by the Local Interpretable Model-agnostic Explanations (LIME Ribeiroet al., 2016) framework. The basic idea of LIME is to approximate any complexclassifier (e.g. a CNN) by a simpler one (e.g. sparse linear) in a neighborhoodof a training point x i . A simplified representation ˜ x i of x i is adopted, and N points in a neighborhood of ˜ x i are sampled uniformly and used to minimize adistance between the original classifier and the simpler one. Once the simplerclassifier is trained, it can be used to assess the (positive or negative) contribu-tion of each feature to the classification task as easily as in linear models. Thisapproach provides very interesting results and is generic, since it can provide ex-planations for any kind classifier. However, for every training point it involvessampling N neighbors and evaluating the classifier for each one of them. Thismight be computationally prohibitive, especially for high dimension data. In thecontext of key-phrase recognition, an alternative approach was proposed by Vanniet al. (2018). They considered as input data text segments of fixed size ( M to-4ens). Each data point was represented as an M × D matrix, where D is the wordembedding size. After training a CNN for an author recognition task, they used aDeconvolution Network (Zeiler and Fergus, 2014) to project the feature map backinto the input data space. Thus, the “deconvolution” assigns to the m -th token inthe i -th text segment (say d im ) a vector x im ∈ R D . The sum of its entries definesthe Text Deconvolution Saliency (TDS) of d im . Intuitively, the higher (respec-tively lower) the TDS of d im , the more (less) d im contributed to assign the textsegment to its class (i.e. its author). Although this approach returns meaningfulresults it may suffer from some inconsistencies in the explanation, as it will beshown in Section 2. In order to preserve the computational efficiency of TDS(once the CNN is trained it can be computed at a cost of one model evaluation perdata point) we propose an improved version of the TDS (Section 2.2) overcomingthe explanation drawbacks.This paper is organized as follows: Section 2 describes our CNN architectureas well as our contributions. Section 3 illustrates the framework described in Sec-tion 2 on two datasets: a English corpus and a French corpus. Section 4 concludesthe paper and outlines some perspectives for future research. The first part of this section details our model, a convolutional neural network,trained for author classification tasks. In this work, this task corresponds to anintermediate step but does not represent our final goal. Indeed, the scope is tolearn how to exploit a trained CNN to recover linguistic markers, specific to the5ifferent authors. Thus, after detailing the architecture, we focus on some origi-nal contributions to the linguistic features extraction. Our main contribution, the weighted Text Deconvolution Saliency (wTDS) is described in Section 2.2. Twoother contributions, the softmax breakdown ranking and the multi-channel convo-lutional lemmatization are discussed in Section 2.3.
Notation.
In the following, v ∈ R N will denote a real vector v with N entries.If not differently stated, it is intended to be a column vector. The notation A ∈ R M × N will be used to define a real matrix with M rows and N columns and thefunction relu ( · ) is defined as relu ( x ) = max { , x } The CNN considered takes as input d , . . . , d N text segments, each containing afixed number of tokens M . In the examples that we consider in Section 3 each seg-ment is part of a presidential speech, so that the number of classes K is the numberof considered presidents. An embedding layer is used for word representation. Al-though this layer might rely on different well known models such as fastText (Bo-janowski et al., 2017; Joulin et al., 2017), Word2Vec (Mikolov et al., 2013) orGlove (Pennington et al., 2014) as long as a fine tuning of the embedding vectorsis allowed during optimization, the choice of the embedding model is not crucial.Once the word feature vectors are obtained, they are concatenated (by row) insuch a way to form a matrix with M rows. This resulting matrix can then be input6nto a convolutional layer applying several filters all having the same width as thedimension of the embedding matrix. One max pooling layer follows, equippedwith a non linear activation function. A deconvolutional layer (up-sampling +convolution with transpose filters) is then introduced to bring the convolutionalfeatures back into the word embedding space. Finally, two fully-connected layersand a softmax function output for each segment d i a vector ˆ z i ∈ { , } K , where K is the number of classes/authors. The following multinomial cross-entropy lossfunction is considered: L ( θ ) := − N (cid:88) i =1 K (cid:88) k =1 z ik log (ˆ z ik ( θ )) (1)where θ denotes the set of all the network trainable parameters and z ∈ R N × K is an observed binary matrix, whose k -th row encodes the class/author of the i -thtext segment (thus z ik = 1 iff d i is affected to the k -th class/author). The aboveloss function is minimized with respect to θ via an Adam optimizer. In order toavoid overfitting the whole dataset is split into train ( ) and validation ( )sets and the loss function in Eq. (1) is monitored on the validation set duringoptimization, allowing us to apply early stopping (Prechelt, 1998) (Figure 1). Agraphical representation of the model described so far can be seen in Figure 2. After the CNN has been trained on the train dataset, it can assign a text segment d i (either in the train or in the validation set) to its class/author. We recall that d i can be viewed as a real matrix with M rows, where M is the number of tokens7igure 1: Model loss and accuracyof d i and D columns, where D is the embedding size. The m -th token of d i ,corresponding to the m -th row of the matrix, is denoted by d im and it is a vector in R D . The deconvolutional layer (see Figure 2) assigns to every d im another vectorof the same size denoted by x im ∈ R D . Note that, since this representation is theoutput of two convolutional layers, it is sensitive to the context of d im (neighbortokens). The Text Deconvolution Saliency (TDS, Vanni et al., 2018) of the token d im is defined as T DS ( d im ) = D (cid:88) d =1 x imd (2)where the real number x imd is the d -th entry of x im We stress that, although thismeasure is defined for each token of d i it also accounts for the context of d i (seealso the experiments in Section 3). The authors in Vanni et al. (2018) argue that,the higher the TDS of a token, the more the token (conditionally to its context)plays a crucial role in the classification task, according to the CNN. As a matter8igure 2: Three channels convolution/deconvolution for three representation ofthe input 1) full-forms (words), 2) part-of-speech (POS), 3) lemma of fact, even though TDS can correctly highlight the relevant words/contexts in d i being used by the CNN to classify d i , it cannot tell us how the network uses them.To illustrate this point in more detail, consider the following extract from a speechby Donald Trump: [...] neighborhoods for their families , and good jobs for themselves. These are just and reasonable demands of righteous people and arighteous public . But for too many of our citizens , a different realityexists : Mothers and children trapped in poverty in our inner cities ;rusted-out [...] (D. Trump, the 20th of January 2017, Inaugural Address, United States CapitolBuilding in Washington, DC).This text is part of a corpus described in Section 3 and collects several part ofspeeches from the US presidents. Once properly trained for an author recognition9 a) TDS (b) LIME Figure 3: Comparing the activation boost of the tokens toward the class “Trump”according to TDS and LIME.task, the CNN detailed in the previous section can correctly recognize this speechas being pronounced by the president Trump. In Figure 3a an histogram reportsthe TDS scores for the tokens of the extract. The higher the bars, the more thecorresponding tokens had a key role in the classification task. Now, when com-paring these TDSs with the word contributions detected by LIME (Figure 3b) wesee that most of the tokens having a high TDS correspond to brown right barshaving a positive impact in classifying the speech as “Trump” (e.g. righteous,people). Conversely, according to LIME, the noun “poverty” seems to have anegative boost when performing a binary classification “Trump” or “No Trump”.Indeed, if we additionally compute the z-scores of the tokens of d i (Figure 4), withrespect to the whole corpus, we see that the noun “poverty” is underused by D.Trump and this is in line with the explanation provided by LIME. However, thisnoun is very specific to another president in the corpus: L.B Johnson. Thus, theimportance of the word “poverty” was correctly captured by TDS, but we cannot10igure 4: z-scores for the noun “poverty” for the US presidents in the analyzedcorpus.say if that word contributed for “Trump” or against “Trump”.This motivated us to improve the TDS score initially proposed by Vanni et al.(2018), with two additional features: i) it should be able to go negative to indi-cate negative contributions of words to some classes and ii) in case of multi-classclassification, for a word d im it should be able to quantify its contribution to eachclass. In order to build such a measure, note that the last two fully connected lay-ers of the CNN basically map the de-convolved features x i , . . . , x iM into a singlevector in R K , denoted y i (see Figure 2), where K is the number of classes. If we11oncatenate x i , . . . , x iM into a column vector X i , of size D x M , the map can bespecified as y i = d + C ( relu ( b + AX i )) (3)where A ∈ R E × DM , b ∈ R E , C ∈ R E × K and d ∈ R K and E is the size of thepenultimate layer. In order to obtain a score that is specific to the token d im weobserve that AX i = M (cid:88) m =1 A m x Tim (4)where A m ∈ R E × D is the sub-matrix of A obtained by selecting all the rows andthe D columns form the ( D ( m −
1) + 1) -th to the ( D ( m −
1) + D ) -th. Thus wedefine wT DS ( d im ) := d + C (cid:0) relu (cid:0) b + A m x Tim (cid:1)(cid:1) (5)Note that, instead of
T DS ( d im ) , wT DS ( d im ) is a vector with K entries. Eachentry quantifies the activation boost of word d im (conditionally to its context) forthe class K . Moreover, the matrix multiplication A m x im induced K weighted sums of the entries of x im , in contrast with the simple sum defined in Eq. (2). Forthis reason we call the measure in Eq. (5) weighted Text Deconvolution Saliency(wTDS) . Figure 5 shows the wTDSs for the class “Trump” of the tokens in theTrump’s speech reported above. As it can be seen, the word “poverty” now has asmall negative contribution when classifying the speech as “Trump”. We noticethat, once the CNN is trained, the computation of the wTDS for one token (for allthe classes) has the cost of the matrix multiplications in Eq. (5). This is a hugeadvantage compared to LIME for two reasons: First, no sampling is required.12 a) Trump (b) Johnson Figure 5: wTDS for classes “Trump” and “Johnson” for the tokens in the samplespeech of D. Trump.Second, whereas LIME can only provide us with the tokens contribution in thebinarized problem (e.g. “Trump” vs. “No Trump”) , wTDS computes the tokenscontribution to each class in one shot.
In the previous section, we described how, given an input text segment d i , wTDScan be used to assess the contribution of each token in d i for the class assignment.Now, we zoom one step out and try to detect the key-segments in the data set, i.e.the segments being the more representative of each author according to the CNN.In particular, it might be of interest to be able to rank d , . . . , d N from the most tothe least representative for each author.A possible way to do that is described in the following. The number of neuronsin the last layer of the deep CNN coincides with the number of classes, previously13enoted by K . In the previous section y i ∈ R K denoted the value of that layerfor the text segment d i . Thus, y ik is the value of the k -th neuron and it is a realnumber. As usually, a softmax activation function is applied to y i in such a way toobtain K probabilities ˆ z ik (see Figure 2) lying in the K − simplex ˆ z ik = exp( y ik ) (cid:80) Kj =1 exp( y ij ) (6)Note that the above ˆ z ik is the very same as in Eq. (1). The highest probability ˆ z ik corresponds to the class assigned by the network to the observation d i . However,if one entry of y i is significantly higher than the others, it is mapped to by thesoftmax transformation and all the other entries are mapped to zero. For instance,consider two de-convolved features y i and y j corresponding to two different doc-uments both assigned to class k . Assume also that y ik > y jk , so that the document d i is more representative of the class than d j . If y ik and y jk are large enough,after applying the softmax function they both will be mapped to one and it willno longer be possible to assess whether d i or d j is more representative of class k .Thus, we make unconventional use of the trained deep neural network and observethe activation rate of neurons before applying the softmax transformation. Doingthat, allows us to sort the learning data (text segments) based on their activationstrengths. This simple but efficient method provides us with the most relevant key-segment in the corpus for each class. Often, CNN for images have multiple channels. Indeed, the RGB colors encodingcould be considered as three different representations of the input. Each represen-14ation corresponds to a data matrix and the convolutional layers apply differentfilters to each matrix and then later merge the results. Also with texts, it is pos-sible to encode the data in multiple channels that might be used, for instance, tocombine different word embedding solutions (skip-gram, cBow or Glove). Apartfrom word embedding, a pre-tagging process (Collobert and Weston, 2008) al-lows data scientists and linguists to get supplementary material on each word,such as the part-of-speech (POS) and the lemma . Both of them are essential fora linguistic interpretation of the key-segments and to observe complex linguisticpatterns (a.k.a syntactical motifs Mellet and Longr´ee, 2009). It is those reasonsmotivated us to implement a multi-channel CNN to account for the POS and thelemma. However, using a single multi-channel convolutional layer to learn thosepatterns from each representation is not convenient for our purposes. Indeed, themax pooling operations merge all the information into one channel, thus makingit impossible to retrieve which representation (word, POS or lemma) contributedto the classification. Since the aim of our contribution is to interpret the classifier,we split the convolution (and the max pooling) in three parts, one for each chan-nel (see Figure 2). By doing that, the deconvolution mechanism can be applied tothe three channels separately and all the linguistic features can be observed rightafter the deconvolutional layers. Finally, to combine this information, the featuresare merged into a global vector and the final dense layers use them to performthe class assignment. In more details, the m -th token of the segment d i is nowrepresented by three embedding vectors, say d ( w ) im for the full form, d ( pos ) im for thePOS and d ( l ) im for the lemma (see Figure 2). After deconvolution, these embedding15ectors are mapped to x ( w ) im , x ( pos ) im and x ( l ) im , respectively. Thus, whereas with a sin-gle channel, wT DS ( d im ) was a vector in R K , in a multichannel environment, wecan define three wTDS vectors in R K for each token. For instance, wT DS ( d ( l ) im ) refers to the lemma component of the m -th token and it can be computed as wT DS ( d ( l ) im ) := d + C (cid:16) relu (cid:16) b + A ( l ) m ( x ( l ) im ) T (cid:17)(cid:17) where A ( l ) m denotes a sub-matrix accounting for the lemma channel (the green onein Figure 2) and the m -th token x ( l ) im . First we want to thank the authors of TDS Vanni et al. (2018) for providing uswith their datasets.Political discourse analysis is one of the major challenges for linguistics intextual data analysis. For many years, statistics have provided tools and resultsthat help linguists to interpret political speeches. We will now see how our deeparchitecture allows us to describe international political discourses. We proposeto test our model by analysing two political discourse corpora in two differentlanguages, English and French. For comparison reasons, these two corpora aremade from presidential speeches and respect the same chronological span, fromthe 1960s to today.The first dataset targets American political discourse. It is a corpus of 1.8millions of words of American presidents from J.F. Kennedy in 1961 to D. Trumpin 2019. With 11 presidents, we focus on D. Trump to make a short but profound16inguistic analysis of the discourse of the current US president. The second issymmetrical with the speeches of the French presidents under the 5th republicfrom 1958 to today. It is 8 French presidents from C. De Gaulle to E. Macronwith 2.7 millions of words we focus also on current president, E. Macron.By default, the accuracy of each model (English and French) exceeds 90%, butthe markers displayed by the wTDS seem to be too sensitive to low frequencies(very rare linguistic markers) or on the contrary very frequent but unique to apresident (high z-score). The purpose of our architecture being to observe newlinguistic markers different from those known by statistics, each corpus has beenfiltered with precise rules to reduce the weight of these markers. Some words havebeen replaced: i) proper names ii) dates iii) words only present in a president.These rules reduce model accuracy by about 10% but help to reduce overfittingand extract relevant key segments. The table 1 compare those models, unfiltered(English, French) and filtered (English*, French*)dataset authors vocab words accEnglish 11 33279 1 815 839 90%English* 11 14758 1 815 839 81%French 8 46978 2 738 652 91%French* 8 20211 2 738 652 84%Table 1: English and French datasets.
Section 2.2 introduce a key-segment of D. Trump detected with the softmax break-down ranking method with a simple model using only one channel for the full-17orm of words. With the multi-channel convolutional lemmatization (Section 2.3),we have now a wTDS score on each token for each channel and this selected seg-ment become fully interpretable for the linguists due to exploitable features onfull-form (blue words), part-of-speech (orange words) and lemma (green words): [...] neighborhoods for their families , and good jobs for themselves. These are just and reasonable demands of righteous people and arighteous public SENT But for too many of our citizens , a differentreality exists : Mothers and children trapped in poverty in our innercities ; rusted-out [...]
We highlight here the main activation zones having a wTDS higher than afixed threshold. As it can be seen, there is a redundancy of “righteous people”and “righteous public”, being part of a simple and compassionate vocabulary (e.g.“families”, “mothers”, “children” or simply “good jobs”), which is typical of pop-ulist speeches.“But” appears as a characteristic of a polemical discourse that defines Trump’srhetoric. The president rarely makes a consensual speech. Opposition marks, as“But”, allow him to build a speech setting him apart from the mainstream. Being“But” placed at the beginning of the sentence, its full-form wTDS highlights therole of conjunction of opposition rather than of conjunction of coordination.We also report that the full-form wTDS for the word “many” is negative (Fig-ure 3a). Since “many” is one of the words more often employed by presidentTrump (high z-score), a negative wTDS might appear surprising. However in this18ontext, “many” is preceded by “too” which is taken into account by the convolu-tion layer. Thus we checked the z-score of the linguistic pattern “too many”, andwe found out that it is higher for B. Obama than D. Trump. This is a very goodexample of the wTDS capability to capture the linguistic context.Finally, the wTDSs of part-of-speech focuses on a simple but essential marker,the dot (encoded as “SENT”). The over use of this marker refers to a fundamentalrhetorical choice of D. Trump: short sentences. The reduction of the sentencelength is a trend that can be observed in most democracies in Europe or in USA.In the attempt to be accessible to as many people as possible, D. Trump’s speechthus plays on syntactic simplification (Norris and Inglehart, 2019). For a longtime, political discourse has imitated literature with long sentences and relativeor subordinate proposals, but nowadays, political discourse imitates popular lan-guage with short sentences that include only one subject, one verb and one com-plement. On average, in the corpus, Trump’s sentence counts 14.15 words whenObama’s sentence counts 21.51 words (Figure 6). In fact, the end of sentencemarkers characterize the current president.In 50 words here, Trump seems to take up the linguistic characteristics ofpopulist discourse (Oliver and Rahn, 2016) as it is expressed in the United Statesand Europe at the beginning of the 21st century.
This section aims at demonstrating that Deep learning can easily adapt to thesubtleties of each language. A French presidential corpus is considered. In this19igure 6: Average sentence sizedataset, the segment that the model identifies as being the most characteristic ofE. Macron’s speech gathers remarkable features of the current French presidentlanguage. The wTDSs highlight linguistic markers with multiple interpretations: [...] int´erˆets industriels et qui construire le opacit´e PRP PRP:detd´ecisions collectives qu’ attendent nos concitoyens . La cinqui`eme cl´ede notre souverainet´e passe par le num´erique . ce d´efi est aussi celuid’ une transformation profonde de nos ´economies , de nos soci´et´es ,de notre imaginaire mˆeme . La [...] (Macron, the 26th of September 2017, speech about Europe at the Sorbonne).Some main features of the E. Macron’s speech emerge. First, the French presidenttries to give a non-ideological and pragmatic talk oriented towards action, move-ment and efficiency (Colen, 2019). Thus, the lemmas “construire” (to build) and“transformation” are very meaningful of such a discourse whose main scope is to20e dynamic. The word “num´erique” (digital) is often at the heart of the speech ofa president who talks about changes and who wants to show his technical moder-nity. Then, from a grammatical and syntactic point of view, most of the time, the“PRP PRP:det” sequence (meaning preposition + contracted article, in French)introduces adverbial phrases. Thus, E. Macron avoids the main topics but he isprecise with the modalities of the action. In E. Macron’s speech, both the subjectand the object are less important than the way of the proposed reforms. Finally,from a lexical point of view, the CNN seems to focus on “concitoyens” (fellowcitizens) which allows E. Macron to avoid the term “compatriots”, considered toonationalist in the 21st century, in the context of the European integration. A highwTDS also corresponds to the “nos” and “notre” (“our” and “ours”) forms as wellto the lemma “notre”. Indeed, the construction of a political “we” appears as themain rhetorical objective of a discourse that aims at gathering the people behindits leader.
We have introduced and tested a new method to extract relevant linguistic objectscharacterizing the different classes/authors in a multi-class classification context.The main focus of the present work are the hidden layers of a trained CNN. Inparticular we introduced a measure (wTDS) which, entirely relying on the learnedparameters, allowed us to detect the key words that, conditionally to their context,were used by the CNN to assign a text segment to its author. We have proposeda routine to rank the text segments from the most to the least representative for21ach author providing a new and different view in the author discourse analysis.The way we propose to compute all these features internally to the network leadsto a highly reduced computation cost (compared to LIME for instance) and thusallows us to design a multi-channel architecture accounting for part-of-speech andthe lemma leading to extract enriched linguistic objects at almost no cost.The linguistic objects that we learn in this multi-class classification frameworkare those better discriminating one author with respect to the others. In order toextract not only discriminative spatial linguistic objects (using CNNs) but to takeinto account the sequential generation of the discourse based on these linguisticobjects, recurrent networks have to be considered. Some tools already explorethe hidden layers of such architectures (e.g. LSTMVis ) and future works mightfocus on the combination of both approaches, for instance, first extracting spatialpatterns then analyzing their sequential organization for an even more in depthdiscourse analysis. http://lstm.seas.harvard.edu/ . eferences and Notes Adel, H. and Sch¨utze, H. (2017). Global normalization of convolutional neu-ral networks for joint entity and relation classification. In
Proceedings of the2017 Conference on Empirical Methods in Natural Language Processing , pages1723–1729.Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). Enriching wordvectors with subword information.
Transactions of the Association for Compu-tational Linguistics , 5:135–146.Colen, A. (2019). Emmanuel macron and the two years that changed france.
Manchester University Press .Collobert, R. and Weston, J. (2008). A unified architecture for natural languageprocessing: Deep neural networks with multitask learning. In
Proceedingsof the 25th International Conference on Machine Learning , ICML ’08, pages160–167, New York, NY, USA. ACM.Dauphin, Y. N., Fan, A., Auli, M., and Grangier, D. (2017). Language modelingwith gated convolutional networks. In
International Conference on MachineLearning , pages 933–941.Feldman, R., and J. Sanger (2007).
The Text Mining Handbook. Advanced Ap-proaches in Analyzing Unstructured Data . New York: Cambridge UniversityPress. 23oulin, A., Grave, E., and Mikolov, P. B. T. (2017). Bag of tricks for efficient textclassification.
EACL 2017 , page 427.Kalchbrenner, N., Grefenstette, E., and Blunsom, P. (2014). A convolutional neu-ral network for modelling sentences. In
Proceedings of the 52nd Annual Meet-ing of the Association for Computational Linguistics (Volume 1: Long Papers) ,volume 1, pages 655–665.Kim, Y. (2014). Convolutional neural networks for sentence classification. In
Pro-ceedings of the 2014 Conference on Empirical Methods in Natural LanguageProcessing (EMNLP) , pages 1746–1751.Lafon, P. (1980). Sur la variabilit´e de la fr´equence des formes dans un corpus.
Mots , 1(1):127–165.Li, J., Chen, X., Hovy, E., and Jurafsky, D. (2015). Visualizing and understandingneural models in nlp. arXiv preprint arXiv:1506.01066 .Ludovic Lebart, Andr´e Salem and Lisette. Berry (1998).
Exploring Textual Data .Ed. Springer.Mellet, S. and Longr´ee, D. (2009). Syntactical motifs and textual structures. In
Belgian Journal of Linguistics 23 , pages 161–173.Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013). Dis-tributed representations of words and phrases and their compositionality. In
Advances in neural information processing systems , pages 3111–3119.24orris, P. and Inglehart, R. (2019). Cultural backlash : Trump, brexit, and author-itarian populism.
New York : Cambridge University Press .Oliver, J. E. and Rahn, W. M. (2016). Rise of the trumpenvolk: Populism in the2016 election.
The ANNALS of the American Academy of Political and SocialScience , 667(1):189–206.Pennington, J., Socher, R., and Manning, C. (2014). Glove: Global vectors forword representation. In
Proceedings of the 2014 conference on empirical meth-ods in natural language processing (EMNLP) , pages 1532–1543.Prechelt, L. (1998). Early stopping-but when? In
Neural Networks: Tricks of thetrade , pages 55–69. Springer.Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). Why should i trust you?:Explaining the predictions of any classifier. In
Proceedings of the 22nd ACMSIGKDD international conference on knowledge discovery and data mining ,pages 1135–1144. ACM.Vanni, L., Ducoffe, M., Aguilar, C., Precioso, F., and Mayaffre, D. (2018). Tex-tual deconvolution saliency (tds): a deep tool box for linguistic analysis. In
Proceedings of the 56th Annual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 548–557.Wen, T.-H., Vandyke, D., Mrkˇsi´c, N., Gasic, M., Barahona, L. M. R., Su, P.-H., Ultes, S., and Young, S. (2017). A network-based end-to-end trainabletask-oriented dialogue system. In