[PDF] Relation Extraction from Biomedical and Clinical Text: Unified Multitask Learning Framework

Abstract

To minimize the accelerating amount of time invested in the biomedical literature search, numerous approaches for automated knowledge extraction have been proposed. Relation extraction is one such task where semantic relations between the entities are identified from the free text. In the biomedical domain, extraction of regulatory pathways, metabolic processes, adverse drug reaction or disease models necessitates knowledge from the individual relations, for example, physical or regulatory interactions between genes, proteins, drugs, chemical, disease or phenotype. In this paper, we study the relation extraction task from three major biomedical and clinical tasks, namely drug-drug interaction, protein-protein interaction, and medical concept relation extraction. Towards this, we model the relation extraction problem in multi-task learning (MTL) framework and introduce for the first time the concept of structured self-attentive network complemented with the adversarial learning approach for the prediction of relationships from the biomedical and clinical text. The fundamental notion of MTL is to simultaneously learn multiple problems together by utilizing the concepts of the shared representation. Additionally, we also generate the highly efficient single task model which exploits the shortest dependency path embedding learned over the attentive gated recurrent unit to compare our proposed MTL models. The framework we propose significantly improves overall the baselines (deep learning techniques) and single-task models for predicting the relationships, without compromising on the performance of all the tasks.

Full PDF

AACCEPTED FOR PUBLICATION AT IEEE/ACM TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 1

Relation Extraction from Biomedical and ClinicalText: Uniﬁed Multitask Learning Framework

Shweta Yadav, Srivatsa Ramesh, Sriparna Saha, and Asif Ekbal

Abstract — Motivation:

To minimize the accelerating amount of time invested on the biomedical literature search, numerousapproaches for automated knowledge extraction have been proposed. Relation extraction is one such task where semantic relationsbetween the entities are identiﬁed from the free text. In the biomedical domain, extraction of regulatory pathways, metabolic processes,adverse drug reaction or disease models necessitates knowledge from the individual relations, for example, physical or regulatoryinteractions between genes, proteins, drugs, chemical, disease or phenotype.

Results:

In this paper, we study the relation extraction task from three major biomedical and clinical tasks, namely drug-druginteraction, protein-protein interaction, and medical concept relation extraction. Towards this, we model the relation extraction problemin a multi-task learning (MTL) framework, and introduce for the ﬁrst time the concept of structured self-attentive network complementedwith the adversarial learning approach for the prediction of relationships from the biomedical and clinical text. The fundamental notionof MTL is to simultaneously learn multiple problems together by utilizing the concepts of the shared representation. Additionally, we alsogenerate the highly efﬁcient single task model which exploits the shortest dependency path embedding learned over the attentive gatedrecurrent unit to compare our proposed MTL models. The framework we propose signiﬁcantly improves over all the baselines (deeplearning techniques) and single-task models for predicting the relationships, without compromising on the performance of all the tasks.

Index Terms —Protein Protein Interaction, Drug Drug Interaction, Medical Concept Relation, Adversarial Learning, Deep Learning,Natural Language Processing, Relation Extraction. (cid:70)

NTRODUCTION

Owing to the rapid growth of the scientiﬁc literature, majorityof the available biological facts remain concealed in the form ofscientiﬁc literature. Over the last two decades, MEDLINE sizehas risen at a compounded annual growth pace of 4.2 percent.MEDLINE currently holds more than , , records from publications, which is more than millions than those in-dexed in alone. This similar trend was also observed in caseof the healthcare data. IBM reported that nearly . quintillionbytes of the healthcare data are generated globally. Encapsulatedwithin this unstructured text is an enormous amount of signiﬁcanthealthcare and the biomedical data, which are valuable sourcesof information for the Biomedical Natural Language Processing(BioNLP) domain.As a consequence of the exponential rise [1], [2] and com-plexity of the biological and clinical information, it is imperativeto advance the automatic extraction techniques to assist biologistin detecting, curating and maintaining databases and providingautomated decision support systems for the health professional.This has led to a rise in the interest of the BioNLP communityto automatically detect and extract information from the scientiﬁcliterature and clinical records [3], [4], [5].Relation extraction is one such task that aims to detectand characterize the semantic relationship between the biologi-cal/clinical entities. The relation types could vary depending uponthe genres and domains, such as interactions between genes,proteins, drugs, or medical concepts (problems, treatments, ortests).In this paper, we study the relation extraction (RE) on themost popular biomedical and clinical tasks, namely drug-drug • All the authors are with the Department of Computer Science and Engi-neering, Indian Institute of Technology Patna, Bihar, India, 801103.E-mail: (shweta.pcs14,sriparna,asif)@iitp.ac.in interaction (DDI), protein-protein interaction (PPI) and clinicalrelation. DDI detection is an signiﬁcant area of patient safetystudies as these interactions can become very hazardous and boostthe cost of health care. Similarly, the knowledge about interactionamong proteins can help in understanding biological processes,complex diseases, and guiding drug discoveries [6]. In the clinicaldomain, the ability to recognize relations among medical concepts(treatments, tests, or problems) allows the automatic processing ofclinical texts, resulting in clinical decision-making, clinical trialscreening, and pharmacovigilance.In the vast literature on relation extraction, several techniques havebeen proposed to solve the problem ranging from the semantic-injected kernel model [7], [8] to the machine learning embeddedmodels [9], [10]. In recent past, with the success of deep neuralnetworks, the state-of-the-art models for these tasks have beendrifted towards the deep learning frameworks [11], [12].However, there have been a very few attempts in improving theperformance of RE system irrespective of the tasks or domains.One potential solution is to model the relation extraction problemin the multi-task learning framework, where a problem togetherwith the other related problem can be learned by leveraging theshared representation. This method of multi-task learning providesadvantages in (1) minimizing the number of parameters and (2) reducing the risk of over-ﬁtting. The aim of multi-task learning(MTL) is to enhance the system performance by integrating theother similar tasks [13], [14]. When tasks are common and, inparticular, when training data is limited, MTL can contribute tobetter results than a model trained on a single dataset, allowing thelearner to capitalize on the commonality among the tasks. This canassist the overall model as dataset can contain information whichare complementary to address individual tasks more correctlywhen trained jointly [15].However, most of the existing methods on multi-task classi- a r X i v : . [ c s . C L ] S e p CCEPTED FOR PUBLICATION AT IEEE/ACM TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2 ﬁcation tend to divide the feature space of each task into sharedand private, depending solely on hard-parameter or soft-parametersharing. As such, these task-shared features are prone to beingcontaminated with external noise and task-speciﬁc features thatoften lead the system to suffer from the redundancy of the feature.To combat the contamination issue of task-shared features, inthis paper, we propose an adversarial multi-task relation extractionframework. This framework deals with utilizing concept of adver-sarial learning to inherently disjoint the task-speciﬁc and task-invariant feature spaces. In adversarial learning [16], a model istrained to correctly classify both unmodiﬁed data and adversarialdata through the regularization method. The adversarial learningparadigm provides an assurance that task-shared feature space isnot contaminated with task-speciﬁc features and contains onlytask-invariant features.In our study, we use the bi-directional gated recurrent units(Bi-GRU) [17] as a learning algorithm which has the capability tolearn the features by capturing the long dependency information.Bi-GRU, unlike other Recurrent Neural Networks (RNN), suchas Long Short Term Memory (LSTM) is computationally lessexpensive [18]. GRU also has the gating mechanism similar toLSTM to control the information ﬂow. But unlike LSTM, GRUhas no memory unit and has an update and reset gate.As such, when down-sampling operation is performed on theoutput of GRU, we extract the optimal features from the entireinput sequence covering the complete context information. In theliterature, the attention mechanism has shown the promising re-sults in relation extraction by generating the optimal and effectivefeatures. In attention mechanism, a simple strategy has been fol-lowed by computing a weight vector corresponding to each hiddenstate of the RNN or CNN. The ﬁnal hidden states are computed byperforming the pooling operation (max, min, average etc.) on theweighted representations of hidden states. However, the computedattention weight focuses on a speciﬁc aspect of the input sequence.In this work, we attempt to capture multiple aspects of the inputsequence by exploiting the self-attention mechanism. Basically,we learn to generate the multiple attention weight vectors, whicheventually generate the multiple ﬁnal representations of hiddenstates considering the various aspects of the input sequence.We apply our proposed approach on four popular benchmarkdatasets namely, AiMED & BioInfer for PPI [19], DrugDDI forDDI [20], and 2010 i2b2/VA NLP challenge dataset for clinical re-lation extraction (MCR) [21]. Our proposed MTL model obtainedthe F1-Score points of . , . , . , and . for AiMed,BioInfer, DDI and i2b2 relation extraction dataset, respectively.We observe an average 5% improvement in F-score in comparisonto single task learning baseline model and over 3% improvementon MTL baseline model. Performance on any dataset does notdecrease considerably, and performance increases signiﬁcantly forall the four datasets. These are promising outcomes that set thepotential for using the MTL model to solve the issue of biomedicalRE. In addition to the baselines, our proposed model outperformsthe state-of-the-art methods for all the datasets. This shows that,when we have tasks in common, multi-task can assist over thesingle-task models. The contributions of our proposed work canbe summarized as follows:1) We propose a multi-task learning (MTL) framework forrelation extraction that exploits the capabilities of adversariallearning to learn the shared complementary features acrossthe multiple biomedical and clinical datasets. We also exploit the self-attention mechanism which allows the ﬁnal featurerepresentation to directly access previous Bi-GRU hiddenstates via the attention summation. Therefore, the Bi-GRUdoes not need to carry the information from each time steptowards its last hidden state.2) Our proposed model is capable of automatically extractingthe various relations (such as Protein-Protein Interaction , Drug- Drug interaction: ‘int’, ‘advice’, mechanism, effect ,and relation between medical problem and treatment , test andtreatment , treatment and treatment ).3) We validate our proposed framework on four popular bench-mark datasets (AiMED, BioInfer, SemEval 2013 Drug Inter-action task, and i2b2 medical relation shared task dataset) forrelation extraction, having different annotation schemes.4) Our uniﬁed multi-task model achieves the state-of-the-artperformance and outperforms the strong baseline models forall the tasks in the respective datasets. ELATED W ORKS

There has been recent surge in the interest of the BioNLP com-munity to automatically detect and extract information from thescientiﬁc literature and clinical records [22], [23], [24], [25], [26],[27]. In the past decade, there has been tremendous amount of thework on varieties of the relation extraction task such as extractingrelationship between the bio-entities (proteins, gene, diseases, etc.)from the biomedical literature. Much previous works are done byusing the Kernel-based technique which allows the representationlearning of the data in the form of dependency structures andsyntactic parse trees. Some of the other prominent techniquesfor extracting the relationships are based on the pattern-matchingtechnique. Recently, with the success of deep learning technique,several techniques based on the Convolutional Neural Network,Recurrent Neural Network, and Long Short Term Memory net-work are widely utilized for extracting the relationships frombiomedical literature and clinical records. Based on the tasks, wedivide the related works in the following three categories: • Protein Protein Interaction task:

Several NLP techniqueshave been proposed to identify the relationships betweenthe protein entities. The preliminary studies [28], [29], [30]on this task were essentially solved by using pattern-basedmodel, where patterns were extracted from the data basedon their syntactic and lexical properties. The main drawbackwith this approach is the inability to properly handle thecomplex relationship expressed in coordinating and relationalclause. Dependency based approaches [31], [32] are moresyntax aware techniques and have broader coverage thannaive pattern based approaches. Some of the studies [33]exploring the dependency based techniques incorporate thedependency information as a shortest dependency path be-tween the sentences. Also the technique based on kernelmethod is often explored in the area of PPI extraction. Someof the prominent kernel-based approaches for extracting thePPI include edit-distance kernel [33], bag-of-word kernel[34], all-path kernel [8], graph kernel [35], and tree-kernel[36]. [37] proposed a walk-weighted sub sequence kernelthat captures the syntactic structure by matching the e-walkand v-walk on the shortest dependency path. [38] proposeda technique based on convolutional tree kernel by integratingthe patterns of protein interaction.Recently, various studies have exploited deep learning based

CCEPTED FOR PUBLICATION AT IEEE/ACM TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 3 techniques [39], [40] which do not require the manual fea-ture engineering unlike the previous techniques based onthe kernel, pattern and dependency based methods. [41]ﬁrst proposed the deep learning technique for extractingthe relationship between the protein pairs. They used theCNN as a base learner over the word embedding generatedthrough the Google News corpus. [42] proposed a neuralnetwork framework which integrates several lexical, semanticand syntactic level information in the CNN model. Theirstudy shows that integrating these additional informationprovides very minor improvement overall. [43] in their studyproposed a two channel CNN technique for high level offeature extraction. In the ﬁrst channel, they used words withadditional syntactic features like part-of-speech, chunkinginformation, dependency information, named entities and theword position information w.r.t the protein entities. In thesecond channel, they used the parent word information forevery word. [12] propose a greedy layer-wise unsupervisedtechnique to extract the PPIs. They utilized the auto encoderon the unlabelled data for the parameter initialization of thedeep neural network model and applied the gradient decentmethod using back propagation to train the whole network.Various studies [40] on PPI extraction task have also exploredthe Recurrent Neural Network framework. [44] proposed amethod based on the Bi-directional Long Short Term Mem-ory Network (Bi-LSTM) equipped with the stacked attentionmechanism. The input to their model is the shortest depen-dency path between the entity pairs. Their study shows thatproviding multiple attentions can assist the model in bettercapturing the long contextual and structural information. [45]proposed a tree RNN with structured attention framework forextracting the PPI information. • Drug Drug Interaction task:

Existing techniques on drugdrug interaction can be categorized into one-stage and two-stage classiﬁcation scheme [46]. In the one-stage classiﬁca-tion [47], [48], the aim of the task is to identify the multiplerelationships between the drug pairs, which could be fromany of the interacting class or negative class. Several meth-ods have explored the multi-class classiﬁer, to capture therelationship between two target drugs in a sentence. While, inthe two-stage classiﬁcation [49], [50] scheme, there are twosteps. The ﬁrst step determines whether the target drug pairis interacting/non-interacting. In the second step, only theinteracting sentences are considered as inputs to the multi-class classiﬁer. These approaches can further be classiﬁedinto handcrafted feature and latent feature based methods.Techniques [51], [52], [53] based on the hand-crafted fea-ture mainly utilize support vector machines (SVMs) as thebase learner. These techniques are reliant on several hand-crafted features such as Part-of-Speech, chunk, syntax trees,dependency parsing, and trigger words. These methods areutilized in other biomedical relation extraction tasks, suchas adverse drug reactions extraction tasks [54], [55], pro-tein protein interaction extraction [56], relation extractionbetween diseases and genes [57], and relations between themedical concepts [58]. These techniques have appeared toperform well, however they are very domain-speciﬁc andare dependent on other NLP tools. Approaches exploring thelatent features, are based on deep learning models, that areproved to be powerful solutions to the feature based models.Below we provide the survey in detail for the above described methods: – Linear Methods:

This method utilizes a linear classiﬁerthat takes as an input the domain speciﬁc or manuallydesigned features. The system proposed by Uturku [49]explored Turku event extraction system (TEES) for identi-fying the drug interaction pairs. TEES utilizes the featuresof dependency parsing and lexicon derived from MetaMapand Drugbank. [50] developed two-stage classiﬁcationtechnique based on SVM. They explored several hand-crafted features such as lexical, contextual, semantics andtree structured features. – Kernel Methods:

These techniques are more advancedthan linear methods, where they have explored graph basedfeatures. All-paths graph kernel [59] learns a classiﬁerbased on the weighting scheme for dependency parse treefeature and surface feature. k-band shortest path spectrum[60] utilizes the shortest dependency path between theentity pairs to build a classiﬁer. It further permits themismatches for variables and includes all the nodes withink-distance from the shortest path. The shallow linguistickernel [61], as the name suggests, captures the shallowlinguistic features such as part-of-speech, stem, word andother morphological features that also explore the proper-ties of the surrounding words. [62] proposed an ensemblebased DDI extraction system. They explored the variouskernel classiﬁers in addition to the case-based reasoningtechnique for classifying the drug pairs. [53] also exploredvarious kernel classiﬁer by integrating multiple kernelmethods such as SL kernel, mildly extended dependencytree (MEDT) kernel and path-enclosed tree (PET) kernel.PET kernel captures the smallest subtree involving the twoentities in a phrase structure parse tree. MEDT kerneluses linguistically motivated expansions to capture theprominent clue words between the entity pairs. The systemproposed by [63] also utilizes several kernels based on thefeature and tree kernel methods. Precisely, they exploredMEDT, SL, PET, global context and local context kernel. – Deep Learning Technique:

Deep learning technique isbased on the neural network approach that utilizes latentfeatures instead of hand-crafted features. This techniqueencodes the word level representation using neural networkfor the generation of sentence level features. For the ﬁnalclassiﬁcation, the network uses the sentence level feature.In the recent past, several deep learning methods have beenused to address DDI tasks. SCNN system was proposedby [64]. They have used convolution neural network tocapture more dense sentence representation feature. SCNNalso utilized some additional features such as PoS anddependency tree based feature. [46] advanced the state-of-the-art technique by proposing three different modelsbased on the concept of LSTM and attentive pooling. Allof these models takes as an input the word and positionembedding and does not rely on any hand-crafted features.Some of the other prominent work that has explored deeplearning framework are [65], [66], [67], [68], [69] • Medical Concept Relation Extraction task:

Electronicmedical records such as patient’s discharge summariesand progress notes contains information about the medicalconcepts and relationship. To aid an advanced patient care,it is required have a technique for automatic processing of

CCEPTED FOR PUBLICATION AT IEEE/ACM TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 4 the clinical records. To address this issue, Informatics forIntegrating Biology & the Bedside (i2b2) organized a sharedtask challenge [21] that aims to identify the relationshipbetween medical concepts i.e. problems, treatments, or testsfrom the EMR document. Identifying the correct relationshipbetween the medical concepts requires the knowledgeof the context in which two concepts are discussed.Existing techniques for extracting the relations between themedical concepts can be grouped into semi-supervised andsupervised classes. [70] explored the semi-supervised methodto determine the relationship between a concepts-pairs. Theyused maximum entropy as an classiﬁcation algorithm totrain separately three classiﬁer for each concept pair i.e.test-problem, treatment-problem, and problem-problemrelations. They explored various external features obtainedfrom other NLP pipeline such as cTAKES and MetaMap.Additionally, they also used word-level feature, PoS tags,dependency path based feature, and distance feature thatcapture minimal, average and maximal tree distances to thecommon ancestor. To overcome the label imbalance problemin the training data, they computed the relatedness betweentwo medical concepts by using pointwise mutual informationin MEDLINE and bootstrapping with unlabeled examples.Majority of the research work has been carried out on thisproblem are highly dependent on the supervised approach.In those works [71], [72], [73], statistical machine learningclassiﬁer (CRF, SVM) is used to identify the relations. As thei2b2 shared task dataset, contained a large portion of conceptpairs without any relations, some of the system [74], [75],[76] proposed two stage classiﬁcation methods, where in theﬁrst stage, the concept pair with relation/no relation wereclassiﬁed. In the latter stage, only those concept pair with therelation participate for identifying the given relationship. Allof the participating system heavily rely on the hand-craftedfeatures. They exploited semantic, lexical, syntactic, domain-speciﬁc ontology feature. The system proposed by [77] usedmedical knowledge graph (UMLS) concept identiﬁers andapplied feature selection technique to capture more relevantfeature. [73] used linguistic feature to complement theirmachine learning component to extract the medical conceptrelations. Recently, neural network techniques are widelyadopted for the clinical relation extraction task. [78] exploredthe capabilities of convolutional neural network to captureprominent feature for extracting relations. Another studyconducted by [79] used two variant of RNN-LSTMs network,segment level LSTM and sentence level LSTM for encodingthe relation. The experimental results shows that the proposedapproach perform comparable w.r.t state-of-the-art system.They also identiﬁed that word embedding from clinical textis more beneﬁcial than the general text.Other prominent study was conducted by [80] to identify therelations among disease and treatment. They used severalneural network and graphical models. Furthermore, theyutilized other hand-crafted features such as lexical, semantic,and syntactic feature for classiﬁcation. They conducted theirstudy on very small dataset consisting of biomedical researcharticle. [81] applied CRF for extracting relations betweendisease, treatment and gene. They proposed two step modelwhere in the ﬁrst step, they identiﬁed the entities and inthe second step, they extract the relationship. For both the steps, they explored CRF as the base learner. [57] developeda system for ﬁnding the association between disease, drugand target in EU-ADR dataset. They exploited kernel basedmethod that uses the shallow linguistic and dependencykernel for extracting the relations. [82] developed shortestdependency path based deep neural network framework thatalso utilizes other features like Parts-of-speech information,dependency labels, and the types of the entities.

ATERIALS AND METHODS

In this section, we describe the methodology used to extract therelations from various biomedical texts in our proposed multi-tasklearning framework. We begin by the problem statement followedby the introduction of the Gated Recurrent Unit (GRU) which isused in our models as a base learning algorithm. Following that,we describe the proposed multi-task model in detail that utilizesthe concept of adversarial learning.

Problem Statement:

Given an input text sequence S consisting of n words, i.e., S = { w , w . . . w n } and a pair of entities ( e , e ) where e ∈ S and e ∈ S . The task is to predict the maximum probable class ¯ y from the set of class labels, Y . Mathematically, ¯ y = arg max y ∈ Y prob ( y | S, e , e , θ ) (1)where, θ is the model parameter. Each token w i ∈ S of the inputsequence S = { w , w , . . . , w n } is mapped into d dimensionalword embedding sequence x = { x , x , . . . , x n } , where x i ∈R d . In this work, we use GRU [17] as the base learner. The GRU isan improved version of the recurrent neural network and beingused as variant of the LSTM due to its simpler architecture overthe LSTM. Similar to the other RNN, GRU has the capabilityof internal memory. This internal memory helps them to exhibittemporal dynamic behaviour for a time sequence. Additionally,GRU is able to solve the problem of the vanishing gradientproblem which comes with a standard recurrent neural network.Similar to LSTM units, GRU also have the gating mechanism tocontrol the ﬂow of information and produce the effective hiddenstate representation.GRU has two neural gates, update and reset gate. The task of update gate is to helps the model for determining the amount ofinformation need to be carry forward along to the future. The reset gate helps the model to determine the amount of past informationneed to be forget. Speciﬁcally, a GRU network successively readsthe input token x i , as well as the visible state h i − , and generatesthe new states c i and h i . z i = σ ( W z x i + V z h i − + b z ) r i = σ ( W r x i + V r h i − + b r ) c i = tanh ( W x i + V ( r i (cid:12) h i − ) + b ) h i = z i (cid:12) c i + (1 − z i ) (cid:12) h i − (2)where z and r are the update and reset gates, respectively. Theﬁnal representation at a given time t , from a bi-directional GRU

1. The entity pair can be the protein , disease , problem name, etc. CCEPTED FOR PUBLICATION AT IEEE/ACM TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 5

Task1: AiMEDTask 2:BioInferTask 3: DDITask 4: i2b2-RE W W W W n G RU W W W W n G RU G RU G RU S e l f - A tt en t i on GENERATORADVERSARIAL TRAINING F u ll y - c onne c t edLa y e r T a sk Labe l TASKDISCRIMINATOR G RU G RU G RU S e l f - A tt en t i on S o ft m a x O u t pu t: T a sk - s pe c i ﬁc C l a ss l abe l s M L P Task1: AiMEDTask 2:BioInferTask 3: DDITask 4: i2b2-RE

Shared Feature (SF) ExtractorTask-specific Feature (TF) Extractor

Fig. 1: Proposed multi-task model for various biomedical relation extraction tasks.(Bi-GRU), can be computed by concatenating the forward −→ h t and backward ←− h t hidden states. From here onward, we will call Bi-GRU as function having the inputs, x t and h t − and output h t . In the relation extraction framework (c.f. Figure-1) the input se-quence x = { x , x , . . . , x n } with the corresponding entity pair ( e , e ) is transformed into the hidden state representation usingthe Bi-GRU (c.f. section 3.1). In order to emphasize the givenentity pair, the corresponding entity word in the input sequenceis marked with the special token EN T IT Y and assigned a ﬁxedword embedding to it. More formally, the hidden state at each timestep is calculated as follows: h t = Bi-GRU ( x t , h t − ) (3)Let the hidden state dimension for each Bi-GRU unit be d h . Weformulate a hidden state matrix H ∈ R n × d h . H = ( h , h , · · · h n ) (4)We compute the effective input sequence encoding ¯ h using func-tion f un (¯ h ; θ ) with learning parameter θ . The input sequenceencoding ¯ h is fed to a fully connected softmax layer to generatethe probability distribution over the predeﬁned classes. ¯ y = sof tmax (¯ h T W + z ) (5)Here, W and z are weight matrix and bias vector, respec-tively. The term ¯ y denotes the predicted probability distribution.Let us have a training dataset with N samples such that { ( x , y ) , ( x , y ) , . . . , ( x n , y n ) } . The network parameters aretrained to minimize the loss function– cross entropy of the prob-ability distributions of predicted ( ¯ y ) and true class ( y ) over the C number of classes. L CE (¯ y, y ) = − n (cid:88) i =1 C (cid:88) j =1 y ji log (¯ y ji ) (6)

2. This function is adversarial learning based self-attentive network whichcomputes the effective input sequence encoding.

We encode the input sequence with by adopting the self-attentivemechanism [83] over the Bi-GRU generated hidden state se-quence. The input to the attention mechanism is the Bi-GRUhidden states H . The self-attention generate the attention weightvector v computed as follows: p = tanh (cid:16) U H T (cid:17) v = sof tmax ( w p ) (7)Here U ∈ R d a × d h , and w ∈ R d a and d a is a hyperparameter and d h is the size of hidden state. The ﬁnal hidden state representation s is computed by the weighted (weight provided by v ) sum ofthe each time step of the Bi-GRU. The major drawbacks of theaforementioned representation is that it focuses on the speciﬁccomponent of the input sequence. The speciﬁc component couldbe a relation between a given entity to other words in the sequence.We call it is a aspect to represent the input sentence. In order tocapture the multiple aspects, we required multiple m ’s that capturethe various notion of the input sequence. Therefore, we extend theattention mechanism from focusing on single aspect to multipleaspects. Let us assume, we want to extract ‘ a ’ number of multipleaspects from the input sequence. To achieve this, we extend theEq 7 as follows: p = tanh (cid:16) U H T (cid:17) V = sof tmax ( W p ) (8)Here, W ∈ R a × d a . Formally, We compute the ‘ a ’ weightedsums by multiplying the matrix V and LSTM hidden states H . We obtain a matrix representation M of the input sequenceby multiplying the attention weight matrix V and hidden staterepresentation H . M = V H (9)We concatenate each row of the matrix representation to get theﬁnal vector representation of the input sequence.

CCEPTED FOR PUBLICATION AT IEEE/ACM TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 6

In this study, we introduce the novel method for biomedicalrelation extraction exploiting the adversarial learning in the multi-task deep learning framework. Our model leverages joint modelingof the entities and relations in a single model by exploitingattentive Bi-GRU based recurrent architecture. We propose anadversarial multi-task learning with attention (Ad-MTL) model forrelation extraction task. Multi-task learning exploits the correlationpresent among similar tasks to improve classiﬁcation by learningthe common features of multiple tasks simultaneously. We build alatent feature space that holds the features that are common tovarious tasks. Speciﬁcally in the model, the outputs generatedat each time-step of the shared Bi-GRU, are considered to becommon latent features. We generate the task-speciﬁc featurefeatures for each task by task-speciﬁc Bi-GRU network equippedwith the self-attentive network discussed in section 3.2.1.We compute the two hidden states at each time step t for agiven task k , one task-speciﬁc hidden state, h kt , and another sharedhidden state, s kt . The former captures the task dependent featuresand the latter captures the task invariant features. Both the hiddenstate representations are computed similar to Eqn. 3. s kt = BI-GRU ( x kt , s kt − , θ s ) (10) h kt = BI-GRU ( x kt , h kt − , θ h ) (11)where θ s and θ h are Bi-GRU’s parameters, x kt denotes the inputat time t . We generate the task-speciﬁc feature representationby applying self-attention using equation 9. We also use a feed-forward network with a hidden layer to project the attentive featurerepresentation to another vector space. We call the ﬁnal task-speciﬁc feature representation as of task k as T F k . Similar to thetask-speciﬁc feature generation, we use a feed-forward network toproject the the shared feature into another vector space and call itis a shared feature SF . The concatenation of the shared and taskspeciﬁc features is fed into a fully connected layer followed by thesoftmax layer. The softmax layer returns the class distributions, y kpred , for the underlying task, k . For every task, k , with trainingsamples ( x ki , y ki ), both the task speciﬁc parameters and sharedparameters are optimized to minimize the cross-entropy of thepredicted ( ˆ y ki ) and actual probability distributions ( y ki ), whose lossis computed as: L kCE = L CE (ˆ y ki , y ki ) (12)where L CE (ˆ y, y ) is deﬁned as Eq. (6). The above discussed model though intended to separately host theshared and task-speciﬁc features, provides no guarantee to behaveso, there might be contamination of shared features in the task-speciﬁc feature space and vice versa. To handle this, we exploitedthe principle that a good shared feature space has features thatmake it impossible to predict the source task of the feature. Forachieving the above, a

Task Discriminator D is used tomap the attention prioritized shared feature to estimate the taskof its origin. In our case, Task Discriminator is a fullyconnected layer using a softmax layer to produce the probabilitydistribution of the shared features belonging to any task. A Bi-GRU works as

Generator ( G ) to generate shared features. ThisBi-GRU layer is made to work in an adversarial way, preventingthe discriminator from predicting the task and hence preventingcontamination in the shared space. The adversarial loss is used to train the model. Similar to [16], [84], we use the followingadversarial loss function L adv = min G (cid:16) max D (cid:0) T (cid:88) t =1 N t (cid:88) i =1 d ti log (cid:2) D (cid:0) G ( x t ))] (cid:1)(cid:17) (13)where d ti is the gold label indicating the type of the current task.The min-max optimization problem is addressed by the gradientreversal layer [85]. The total loss of the network will be as follows: L total = α K (cid:88) k =1 L kCE + β L adv (14)where α and β are the scalar parameters. XPERIMENTAL R ESULTS AND D ISCUSSION

In this section, we will begin by brieﬂy describing the varioustasks and the datasets followed by the experimental results andanalysis. In this study, we focus on the following tasks:

The goal of this task is to classify whether or not a sentencecontaining a proteins pair actually indicates interaction betweenthe pair. Here, we considered the positive instances as the inter-acted protein-pair and the negative instances as the non-interactedprotein pair. If the relationship between the protein pair is notexplicitly provided, the pairs were considered to be the negativeinstances. In order to identify those instances, we extracted allpossible proteins pairs from the sentences.We utilize two standard benchmark datasets for PPI tasks, namelyAiMed and BioInfer . AiMed dataset is derived from the ab-stracts of the Database of Interacting Proteins (DIP) and containsmanually tagged relationship between the protein entities. Thereare in total relationship out of which are interactingrelation and non-interacting relation.BioInfer (Bio Information Extraction Resource) is another man-ually annotated PPI dataset developed by Turku BioNLP group that contains over sentences. It has instances of proteininteracting relationship and non-interacting relationship. This task aims to extract the relation between the clinical entities(

Problem, Treatment,and Test ) from the EMR. For this, we usedthe benchmark dataset released by i2b2 as a part of i2b2 2010clinical information challenge [21]. The dataset was collected fromthree different hospitals, which consists of discharge-summariesand progress notes of the patients those were manually annotatedby the medical practitioners for identifying the three major relationtypes: medical problem–treatment (TrP) relations, medical prob-lem–test (TeP) relations, and medical problem–medical problem(PP) relations. These relations were further ﬁne-grained into dif-ferent relation types which were: treatment caused medical prob-lems (TrCP), treatment administered medical problem (TrAP),treatment worsen medical problem (TrWP), treatment improveor cure medical problem (TrIP), treatment was not administeredbecause of medical problem (TrNAP), test reveal medical problem(TeRP), Test conducted to investigate medical problem (TeCP),and Medical problem indicates medical problems (PIP). The exact

3. http://corpora.informatik.hu-berlin.de/4. http://bionlp.utu.ﬁ/

CCEPTED FOR PUBLICATION AT IEEE/ACM TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 7 deﬁnition of each of these relation types can be found in [21]. It is to be noted that since we did not have enough training samplesfor all relation classes present in the dataset, we have removedfollowing three classes: TrWP, TrIP, and TrNAP. this same strategywas also followed by the [78].

Given a sentence with two pharmacological substances, this taskaims to classify if the given drug pair interacts with each other ornot. For this task, we utilize the standard benchmark DDI corpusfrom Semeval 2013 DDIExtraction challenge dataset [20]. TheDDIExtraction 2013 task exploits the DDI corpus, which containsMedLine abstracts on drug-drug interactions as well as documentsdescribing drug-drug interactions from the DrugBank database.The corpus consists of total 1,017 abstracts from Medline (233)and DrugBank (784) databases which were manually annotatedto obtain 18,491 pharmacological substances and 5,021 drug-druginteractions. Here each interacted drug pair is further classiﬁedinto one of four types, namely mechanism, advice, effect, and int.It is to be noted that during the challenge, original dataset has23756 false samples for training and 4737 for testing. But whenwe obtained the data set from shared task organizers, there wereonly 22474 false training instances and 4461 testing sample forfalse class. In the Table-3, we have provided the statistics of theDDI dataset.The detailed statistics of all the three datasets are reported inthe Table-1,2,3

Dataset Interacting Pairs Non-interacting pairsAiMed PPI 1000 4834BinInfer PPI 2534 7132

TABLE 1: Statistics: AiMed and BioInfer Dataset

Label Relation No.of Samples0 TrIP 2031 TeRP 30532 TrAP 26173 PIP 22034 TeCP 5045 TrCP 5266 TrNAP 1747 TrWP 1338 NONE 54530

TABLE 2: Statistics: 2010 i2b2/VA NLP Challenge dataset

Label Relation No.of Samples (Train) No.of Samples (Test)0 False 22474 44611 effect 1685 3602 mechanism 1316 3023 advice 826 2214 int 188 96

TABLE 3: Statistics: Semeval 2013 DDIExtraction challengedataset

5. While, the actual dataset released during the challenge was having documents for training and documents for testing. However, we were ableto download only documents for training and documents for testingas also pointed out by [78] from i2b2 website.

ETWORK T RAINING AND H YPER - PARAMETERS S ETTING

We train the network by minimizing the total loss of the net-work (Eq. 14). In adversarial training, ﬁrst we pre-trained thediscriminator to avoid the instability in the network. To pre-train the discriminator, we use a Bi-GRU network to get therepresentation of the sentences from the different tasks. We haveshown the training process in Algorithm 1. The shared featureextractor model S in Algorithm 1 is a Bi-GRU network followedby self-attention layer and it is being exploited by all the tasks(c.f. Section 3.3). In the shared feature extractor, there is anadditional adversarial learning component, where feature extractor(Generator) operates adversarially towards a learnable multi-layerperceptron (Discriminator), preventing it from making an accurateprediction about the types of the task the features generated from.For generating the word embedding, we have used pre-trainedembedding available at . We evaluate the performance of all our models using macro-averaged precision, macro-averaged recall and macro-averagedF1-Score metrics. Due to the unavailability of separate validationset for AiMED, BioInfer, and i2b2-2010 clinical relation taskdatasets, we adopt -fold document cross-validation strategy tocompute the precision, recall, and F1-score values. We considerthe predicted class-label as the correct only if it has the exactmatch with the ground truth annotation. For the DDI task, wereport the performance of the models on the test set. Inspired by the recent success of the deep learning based frame-works in solving the relation extraction task, we develop threestrong baselines based on STL and MTL frameworks for thepurpose of comparison. Figure-2 provides the architecture for thebelow described baseline model. • Baseline 1:

The ﬁrst baseline is a STL model constructedby training a Bi-GRU on the features obtained from theembedding layer to capture the long-term dependencies asdeﬁned in Subsection-3. In our experiment, we build theindividual model for each dataset. • Baseline 2:

This single-task learning (STL) model is anadvanced version of Baseline 1, where the sentence encoderof this model is also equipped with the word-level attention[86]. • Baseline 3:

It is a multi-task model with the shared Bi-GRUfollowed by word-level attention that acts as a shared featureextractor for all the tasks.The results obtained by our proposed model and the baselinesystems for each tasks is reported in Table-4. The results obtaineddemonstrate the efﬁciency of our proposed MTL framework overother models that explore state-of-the-art techniques based on asingle task and multi-task neural network. For AiMed dataset, ourproposed MTL model outperformed the Baseline and Baseline model by . and . F1-Score points, respectively. Similartrend was also followed for the BioInfer dataset where, weobserve the performance improvements of . and . F1-Score points over Baseline and Baseline , respectively, by the

6. http://evexdb.org/pmresources/vec-space-models/

CCEPTED FOR PUBLICATION AT IEEE/ACM TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 8

W1W2W m G RU G RU G RU M L P La y e r RelationExtraction

Baseline 1

W1W2W m G RU G RU G RU W o r d - l e v e l a tt en t i on RelationExtraction

Baseline 2 M L P W1W2W m G RU G RU G RU W o r d - l e v e l A tt en t i on Baseline 3

Task 1: PPI(Aimed)Task 2: PPI(BioInfer)Task 3: DDI Task 4: i2b2

Fig. 2: Baseline models for various biomedical relation extraction tasks.

Tasks Baseline 1 Baseline 2 Baseline 3 Proposed Approach (MTL)

P R F P R F P R F P R F

PPI-AiMED

TABLE 4: Evaluation results of proposed MTL model and baselines system. Performance is reported in terms of ‘P’: Precision, ‘R’:Recall, and ‘F’: macro-F1-Score. Baseline 1 is single task learning model based on Bi-GRU. Baseline 2 is STL model with Bi-GRU+ word-level attention. Baseline 3 is MTL model with a shared Bi-GRU layer. All the results are statistically signiﬁcant as p-value < . ). Hyper-parameters ValueMax sentence length 60Embedding dimension 200GRU Hidden State Dimension 64Attention Size ( d z ) 350Attention aspect Size ( a ) 5 relu Dropout rate 0.3Output Activation Soft-maxOptimizer Adam OptimizerLearning Rate 0.001 α β TABLE 5: Optimal hyper-paramete values on proposed modelproposed approach. For DDI dataset, our proposed method attainsimprovements of . and . F1-Score points over Baseline and Baseline , respectively. Lastly, for the i2b2-2010 clinicalrelation extraction dataset, the proposed method demonstrates theperformance improvements of . and . F1-Score points overBaseline and Baseline , respectively.We observe that our model outperforms Baseline 3 model, with . , . , . , and . F-Score points on AiMED, BioInfer,DDI, and i2b2-2010 clinical relation extraction dataset respec-tively.Overall, the ﬁndings achieved show that using self attention based adversarial multi-task learning to save the present knowledge in ashared layer is helpful for a new task.

Ablation Study:

To analyze the impact of various componentof our model, we perform the ablation study by removing onecomponent from the proposed model and evaluate the performanceon all the four tasks. We carried of two set of ablation study:(1) jointly training all the four dataset together (c.f. Table-6),and (2) training two similar task dataset i.e. AiMed and BioIn-fer together (since both share characteristics of having protein-interaction information ), DDI and i2b2-RE task together (c.f.Table-7). One possible reason for the i2b2-RE dataset to not bebeneﬁted from adversarial learning component is ﬁrstly because itis jointly trained with DDI corpus whose origins (clinical notes vsbiomedical literature) and characteristics are not very similar to thei2b2-RE dataset in contrary to AiMED and Bionfer that are bothprotein-interacting datasets. Another reason is the huge differencebetween the samples in DDI ( ) and i2b2-RE ( ). In thiscase when the batches of the DDI dataset ﬁnish arriving, modelwill train only with the i2b2-RE dataset and could lead to thediminishing effect of the adversarial training.Experimental results shows that removing self attention leads tothe decrements in the performance of the model across all the task,in both the ablation study scenarios. However, we observe thatadversarial learning was not that much effective in the ablationstudy 1, where we jointly trained all the four dataset together.While for the ablation study 2, when we trained similar task

CCEPTED FOR PUBLICATION AT IEEE/ACM TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 9

Model AiMED BioInfer DDI i2b2-RE

P R F1 P R F1 P R F1 P R F1

Proposed ModelMulti-task adversarial learning + GRU + self attention -self attention -adversarial learning

TABLE 6: Ablation study by jointly training all the four dataset together.

Model AiMED BioInfer DDI i2b2-RE

P R F1 P R F1 P R F1 P R F1

Proposed ModelMulti-task adversarial learning + GRU + self attention -self attention -adversarial learning

TABLE 7: Ablation study by training two similar dataset (AiMed + BioInfer) and (DDI+i2b2-RE) at a time.

System Technique AiMED BioInferP R F P R FOur System MTL-Adversarial (Bi-GRU+ self-attention) 78.12 76.56 77.33 76.69 75.98 76.33 [40] ∗ LSTM ∗ pre TABLE 8: Comparison with the SOTA techniques for PPI task on AiMed and BioInfer datasets. Performance is reported in terms of‘P’: Precision, ‘R’: Recall, and ‘F’: F1-Score (macro). [40] ∗ denotes the re-implementation of the systems proposed in [32] with theauthors reported experimental setups using their publicly available source codes.together, there was a drop in the performance across all the dataset.This shows that, adversarial learning is much helpful in the relatedtask where the data distribution is similar. In this section, we will conduct the comparative analysis of ourproposed method with the state-of-the-art (SOTA) model for allthe three tasks. • Protein-Protein Interaction:

We used the SOTA methodsfor both datasets as presented in Table-8 to compare ourproposed model. The results show that the proposed modelsigniﬁcantly outperforms the SOTA systems. From this wecan conclude that our proposed multi-task model is morepotent in extracting interacted protein pairs over the architec-ture based on CNN established in [41] and LSTM framework[40]. Our adversarial MTL model achieves a signiﬁcant . F-score point increment over the LSTM based model[40] on AiMed dataset. In case of BioInfer dataset, ourproposed model was able to achieve signiﬁcant performanceimprovement of . F-Score point over [40] as shown inTable-8. However, in comparison to [41], we could observethe improvement of . F-Score points over the proposedmodel. This clearly demonstrates the effect of neural self-attention based adversarial learning in multi-task setting. • Drug-Drug Interaction

We compare our proposed modelwith SOTA DDI extraction techniques as shown in Table-9.Since, there was a difference in the dataset statistics, we re-implemented the system proposed by [87] using their publiclyavailable source code on our DDI extraction dataset. Ourmulti-task adversarial Att-LSTM model obtains the signiﬁ-cant performance improvement of . F-Score points overthe state-of-the-art system [87] that exploited Bi-LSTM with attention mechanism. Also, the obtained experimental resultsillustrate that the model for the DDI task is beneﬁted fromother similar tasks, more speciﬁcally from the PPI tasks. It isbecause of the high semantic similarity between the sentencesin DDI and PPI tasks. • Medical Concept Relation

We were unable to make directcomparison of our proposed approach with the systemsparticipated in i2b2-2010 shared task due to the incompletedataset. We compare our model with [78], as they alsoexperimented with the same dataset. The results are reeportedin Table 10,. We obtain the performance improvements of . , . , . , . , and . F-Score points over[78] (irrespective of the use of additional linguistic features)for TeCP, TrCP, PIP, TrAP, and TeRP classes, respectively.This shows the usefulness of self attention based adversariallearning in multi-task setup which eventually gathers thecomplimentary features for medical concepts relations andother related tasks and improves the performance of thesystem.

Here, we closely examine the various forms of errors with respectto the tasks that cause the mis-classiﬁcation.1)

PPI:

In case of the AiMED and BioInfer datasets, weobserve that in a sentence with multiple protein mentions,our proposed model fails to identify properly the interactedpair. For example:

Sentence 1: “ We screened proteins for interaction withPROTEIN and cloned the full-length cDNA of humanPROTEIN which encoded 1225 amino acids. ” CCEPTED FOR PUBLICATION AT IEEE/ACM TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 10 (a) PR curve for AiMed (b) PR curve for BioInfer

Fig. 3: The graphs (a) and (b) demonstrate the Precision-Recall curves for the binary classiﬁcation dataset, AiMed and BioInfer.Fig. 4: Heatmap of the attention weight distributions on the examples of different datasets. The intensity of the colour is increased withthe increment in the attention weight.Here, the actual label was true but our model predicted thisas false. Repetitive mentions of proteins behave like a noise,than can inhibit the model to extract contextually relevantinformation. We also made an interesting observation that inthe presence of the protein interacting words (such as ‘ bind ’,‘ interact ’), our model predicts the class label as ‘ interacting ’(true). For example:

Sentence 1: “ PROTEIN binding signiﬁcantly increasedhetero- and homo-oligomerization (except for the BR-IIhomo-oligomer, which binds ligand poorly in the absence ofPROTEIN ” Sentence 2: “ PROTEIN is a muscle-speciﬁc HLH proteinthat binds DNA in vitro as a heterodimer with severalwidely expressed HLH proteins, such as the PROTEIN geneproducts E12 and E47. ”This is because these words often occur in the vicinity of theinteracting protein mentions.2)

DDI:

Apart from the highly imbalance dataset issue, ourmodel fails to capture the exact relationship between theDDI pairs, where the lengths of the sentences were long.Another phenomenon that we observed as a source ofmis-classiﬁcation was that the “

Int ” type was often predictedas “

Effect ” class type. “

Int ” class describes the coarseclassiﬁcation, i.e., there exists interaction between twodrugs. This implies that there could be a positive or negativeoutcome which forms the main cause of the system oftengetting confused between the “

Int ” and “

Effect ” class labels.For example:

Sentence 1: “ Synergistic interaction between DRUGA andDRUGB is sequence dependent in human non small lung cancer with EGFR TKIs resistant mutation. ”We also found that some labels were incorrectly predictedbecause of class-speciﬁc keywords which exist in thesentence but are not related to the concerned entity pair. Forexample:

Sentence 1: “ Interaction study of DRUGA and DRUGB withco administered drugs . ” Sentence 2: “ If in certain cases , an DRUGA is considerednecessary , it may be advisable to replace tamoxifen withDRUGB. ”In sentence 1, the model got confused between the classlabel ‘Int’ and ‘None’, while in the sentence 2, our modelincorrectly predicted the class label as ‘Advice’.3)

Medical Concept Relation:

We observe that due to closesimilarity between the class label ‘

TeRP ’ and ‘

TeCP ’, ourmodel was found to be confused between these classes. Forexample:

Sentence 1: “ TEST x-ray revealed no PROBLEM , nocongestive heart failure. ”In the given sentence, our model incorrectly predicted it asclass ‘

TeRP ’.We also observed that majority of the misclassiﬁcation wasbetween ‘

PIP ’ and ‘

NONE ’ class. For example:

Sentence 1: “ There was PROBLEM atrophy and PROBLEMencephalomalacia. ” CCEPTED FOR PUBLICATION AT IEEE/ACM TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 11

Algorithm 1:

Training ProcessSet the max number of epoch: epoch max . for t in { t , t , . . . , t K } do1. Pack the dataset t into mini-batch: D t . Deﬁne task-speciﬁc feature extractor model M t Initialize model parameters Θ t randomly. end Deﬁne shared feature extractor model S and initialize theirparameters Θ s randomly.Deﬁne the discriminator model D and initialize theirparameters Θ d randomly.Pre-train the discriminator. for epoch in , , ..., epoch max dofor t in { t , t , . . . , t K } dofor batch in D t do1. TF ← Generate the task-speciﬁc feature for batch using model M t SF ← Generate the shared feature for batch using model S ¯ h = T F ⊕ SF Compute task-speciﬁc loss : L tCE // using Eq.(6) Compute adversarial loss : L adv // using Eq.(13) Update the parameters: Θ t = Θ t − η ∂ L tCE ∂ Θ t Θ d = Θ d − η ∂ L adv ∂ Θ d Θ s = Θ s − η (cid:0) ∂ L CE ∂ Θ s − ∂ L adv ∂ Θ s (cid:1) endendend We have carried out the visual analysis (in Figure 4) to get anintuitive understanding of attention weights in multi-task attentionmodel. Each sentence in the ﬁgure shows the attention distributionin the form of heatmap for an instance of the correspondingdataset. The highlighted colours indicate the most relevant wordsin the sentence selected by the attention mechanism. For examplein the sentence, “ co-association of cd26 protein with the proteinon the surface of human t lymphocytes ”, the model is able toprovide more weights to “ co-association of cd26 protein ”, whichis relevant to correctly classify the given protein pair as interactedpair. For the binary classiﬁcation datasets, AiMed and BioInfer,we also plot the Precision-Recall curve (c.f. Figure-3).

ONCLUSION

In this paper, we propose a uniﬁed multi-task learning approachthat exploits the capabilities of adversarial learning approachfor relation extraction from biomedical domain. We ﬁrst experi-mented on three benchmark biomedical relation extraction tasks,i.e., protein-protein interaction, drug-drug interaction, and clinicalrelation extraction. For that, we utilized four popular datasets:AIMed, BioInfer, SemEval 2013 DDI shared task dataset and i2b2-2010 clinical relation dataset. We demonstrated that ourmodel shows superior performance compared to state-of-the-artmodels for all the tasks.Although our model has shown signiﬁcant improvements overstate-of-the-art methods on all the tasks, it was observed that oursupervised model does not generalize well for the class label withthe small instances. In future, we would like to develop a zero-shot learning method that could assist the model in the huge classimbalance issue.

Acknowledgement : Dr. Sriparna Saha gratefully acknowl-edges the Young Faculty Research Fellowship (YFRF) Award,supported by Visvesvaraya Ph.D. Scheme for Electronics andIT, Ministry of Electronics and Information Technology (MeitY),Government of India, being implemented by Digital India Corpo-ration (formerly Media Lab Asia) for carrying out this research. R EFERENCES [1] Z. Lu, “Pubmed and beyond: a survey of web tools for searchingbiomedical literature,”

Database , vol. 2011, 2011.[2] R. Khare, R. Leaman, and Z. Lu, “Accessing biomedical literature inthe current information landscape,”

Biomedical Literature Mining , pp.11–31, 2014.[3] S. Yadav, A. Ekbal, S. Saha, and P. Bhattacharyya, “Deep learningarchitecture for patient data de-identiﬁcation in clinical records,” in

Proceedings of the clinical natural language processing workshop (Clin-icalNLP) , 2016, pp. 32–41.[4] S. Yadav, A. Ekbal, and S. Saha, “Feature selection for entity extractionfrom multiple biomedical corpora: A pso-based approach,”

Soft Comput-ing , vol. 22, no. 20, pp. 6881–6904, 2018.[5] S. Yadav, A. Ekbal, S. Saha, and P. Bhattacharyya, “Entity extraction inbiomedical corpora: An approach to evaluate word embedding featureswith pso based feature selection,” in

Proceedings of the 15th Conferenceof the European Chapter of the Association for Computational Linguis-tics: Volume 1, Long Papers , 2017, pp. 1159–1170.[6] M. G. Kann, “Protein interactions and disease: computational approachesto uncover the etiology of diseases,”

Brieﬁngs in bioinformatics , vol. 8,no. 5, pp. 333–346, 2007.[7] R. C. Bunescu and R. J. Mooney, “A shortest path dependency kernelfor relation extraction,” in

Proceedings of the conference on human lan-guage technology and empirical methods in natural language processing .Association for Computational Linguistics, 2005, pp. 724–731.[8] A. Airola, S. Pyysalo, J. Bj¨orne, T. Pahikkala, F. Ginter, and T. Salakoski,“A graph kernel for protein-protein interaction extraction,” in

Proceed-ings of the workshop on current trends in biomedical natural languageprocessing . Association for Computational Linguistics, 2008, pp. 1–9.[9] P. Thomas, M. Neves, T. Rockt¨aschel, and U. Leser, “Wbi-ddi: drug-druginteraction extraction using majority voting,” in

Second Joint Conferenceon Lexical and Computational Semantics (* SEM), Volume 2: Proceed-ings of the Seventh International Workshop on Semantic Evaluation(SemEval 2013) , vol. 2, 2013, pp. 628–635.[10] Y. Liu, F. Wei, S. Li, H. Ji, M. Zhou, and H. Wang, “A dependency-based neural network for relation classiﬁcation,” arXiv preprintarXiv:1507.04646 , 2015.[11] S.-P. Choi, “Extraction of protein–protein interactions (ppis) from theliterature by deep convolutional neural networks with various featureembeddings,”

Journal of Information Science , p. 0165551516673485,2016.[12] Z. Zhao, Z. Yang, H. Lin, J. Wang, and S. Gao, “A protein-protein inter-action extraction approach based on deep neural network,”

InternationalJournal of Data Mining and Bioinformatics , vol. 15, no. 2, pp. 145–164,2016.[13] S. Yadav, A. Ekbal, S. Saha, P. Bhattacharyya, and A. Sheth, “Multi-task learning framework for mining crowd intelligence towards clinicaltreatment,” in

Proceedings of the 2018 Conference of the North AmericanChapter of the Association for Computational Linguistics: Human Lan-guage Technologies, Volume 2 (Short Papers) , vol. 2, 2018, pp. 271–277.[14] S. Yadav, A. Ekbal, S. Saha, and P. Bhattacharyya, “A uniﬁed multi-task adversarial learning framework for pharmacovigilance mining,” in

Proceedings of the 57th Annual Meeting of the Association for Compu-tational Linguistics , 2019, pp. 5234–5245.

CCEPTED FOR PUBLICATION AT IEEE/ACM TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 12

System Technique P R FOur System Multi-task adversarial learning + GRU+attention 76.52 69.01 72.57

Joint AB-LSTM ∗ [87] Bi-directional LSTM +attention 71.84 66.88 68.99Yi et al. + [88] RNN+dynamic WE+ multi-attention 73.67 70.79 72.20Joint AB-LSTM + [87] Bi-directional LSTM +attention 73.41 69.66 71.48DCNN + [89] Dependency based CNN 77.21 64.35 70.19Liu et al. + [90] CNN 75.72 64.66 69.75SCNN + [64] Two stage syntax CNN 72.5 65.1 68.6 TABLE 9: Comparison with the SOTA techniques for DDI task. Performance is reported in terms of ‘P’: Precision, ‘R’: Recall, and‘F’: F1-Score (macro). ∗ denotes the re-implementations of the systems proposed in [32] with the authors’ reported experimental setupsusing their publicly available source code. + denotes that the reported results may not be directly comparable with our proposed systembecause of the difference in dataset statistics. System Sahu et al. [78] Proposed ApproachRelation-types F P R FTeCP

TABLE 10: Comparison with the SOTA techniques for i2b2-2010clinical relation extraction task. For the fair comparison, we havereported the weighted F-Score. [15] G. Crichton, S. Pyysalo, B. Chiu, and A. Korhonen, “A neural networkmulti-task learning approach to biomedical named entity recognition,”

BMC Bioinformatics , vol. 18, no. 1, p. 368, Aug 2017. [Online].Available: https://doi.org/10.1186/s12859-017-1776-8[16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”in

Advances in neural information processing systems , 2014, pp. 2672–2680.[17] K. Cho, B. Van Merri¨enboer, D. Bahdanau, and Y. Bengio, “On theproperties of neural machine translation: Encoder-decoder approaches,” arXiv preprint arXiv:1409.1259 , 2014.[18] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

Neuralcomputation , vol. 9, no. 8, pp. 1735–1780, 1997.[19] S. Pyysalo, A. Airola, J. Heimonen, J. Bj¨orne, F. Ginter, and T. Salakoski,“Comparative analysis of ﬁve protein-protein interaction corpora,” in

BMC bioinformatics , vol. 9, no. 3. BioMed Central, 2008, p. S6.[20] I. Segura-Bedmar, P. Mart´ınez, and M. H. Zazo, “Semeval-2013 task 9:Extraction of drug-drug interactions from biomedical texts (ddiextraction2013),” in

Second Joint Conference on Lexical and ComputationalSemantics (* SEM), Volume 2: Proceedings of the Seventh InternationalWorkshop on Semantic Evaluation (SemEval 2013) , vol. 2, 2013, pp.341–350.[21] ¨O. Uzuner, B. R. South, S. Shen, and S. L. DuVall, “2010 i2b2/vachallenge on concepts, assertions, and relations in clinical text,”

Journalof the American Medical Informatics Association , vol. 18, no. 5, pp.552–556, 2011.[22] M. E. Savery, W. J. Rogers, M. Pillai, J. G. Mork, and D. Demner-Fushman, “Chemical entity recognition for medline indexing,”

AMIASummits on Translational Science Proceedings , vol. 2020, p. 561, 2020.[23] T. R. Goodwin and D. Demner-Fushman, “A customizable deep learningmodel for nosocomial risk prediction from critical care notes withindirect supervision,”

Journal of the American Medical InformaticsAssociation , vol. 27, no. 4, pp. 567–576, 2020.[24] A. Srivastava, A. Ekbal, S. Saha, P. Bhattacharyya et al. , “A recur-rent neural network architecture for de-identifying clinical records,” in

Proceedings of the 13th international conference on natural languageprocessing , 2016, pp. 188–197.[25] S. Yadav, A. Ekbal, and S. Saha, “Information theoretic-pso-based featureselection: an application in biomedical entity extraction,”

Knowledge andInformation Systems , vol. 60, no. 3, pp. 1453–1478, 2019.[26] A. Ekbal, S. Saha, P. Bhattacharyya et al. , “A deep learning architecturefor protein-protein interaction article identiﬁcation,” in . IEEE, 2016, pp.3128–3133.[27] S. Yadav, P. Ramteke, A. Ekbal, S. Saha, and P. Bhattacharyya, “Explor-ing disorder-aware attention for clinical event extraction,”

ACM Trans- actions on Multimedia Computing, Communications, and Applications(TOMM) , vol. 16, no. 1s, pp. 1–21, 2020.[28] C. Blaschke, M. A. Andrade, C. A. Ouzounis, and A. Valencia, “Auto-matic extraction of biological information from scientiﬁc text: protein-protein interactions.” in

Ismb , vol. 7, 1999, pp. 60–67.[29] T. Ono, H. Hishigaki, A. Tanigami, and T. Takagi, “Automated extrac-tion of information on protein–protein interactions from the biologicalliterature,”

Bioinformatics , vol. 17, no. 2, pp. 155–161, 2001.[30] R. Bunescu, R. Ge, R. J. Kate, E. M. Marcotte, R. J. Mooney, A. K. Ra-mani, and Y. W. Wong, “Comparative experiments on learning informa-tion extractors for proteins and their interactions,”

Artiﬁcial intelligencein medicine , vol. 33, no. 2, pp. 139–155, 2005.[31] Y. Miyao, K. Sagae, R. Sætre, T. Matsuzaki, and J. Tsujii, “Evaluatingcontributions of natural language parsers to protein–protein interactionextraction,”

Bioinformatics , vol. 25, no. 3, pp. 394–400, 2008.[32] N. Daraselia, A. Yuryev, S. Egorov, S. Novichkova, A. Nikitin, andI. Mazo, “Extracting human protein interactions from medline using afull-sentence parser,”

Bioinformatics , vol. 20, no. 5, pp. 604–611, 2004.[33] G. Erkan, A. ¨Ozg¨ur, and D. R. Radev, “Semi-supervised classiﬁcationfor extracting protein interaction sentences using dependency parsing.”in

EMNLP-CoNLL , vol. 7, 2007, pp. 228–237.[34] R. Sætre, K. Sagae, and J. Tsujii, “Syntactic features for protein-proteininteraction extraction.”

LBM (Short Papers) , vol. 319, 2007.[35] A. Airola, S. Pyysalo, J. Bj¨orne, T. Pahikkala, F. Ginter, and T. Salakoski,“All-paths graph kernel for protein-protein interaction extraction withevaluation of cross-corpus learning,”

BMC bioinformatics , vol. 9, no. 11,p. S2, 2008.[36] M. Zhang, J. Zhang, J. Su, and G. Zhou, “A composite kernel toextract relations between entities with both ﬂat and structured features,”in

Proceedings of the 21st International Conference on ComputationalLinguistics and the 44th annual meeting of the Association for Compu-tational Linguistics . Association for Computational Linguistics, 2006,pp. 825–832.[37] S. Kim, J. Yoon, J. Yang, and S. Park, “Walk-weighted subsequencekernels for protein-protein interaction extraction,”

BMC bioinformatics ,vol. 11, no. 1, p. 107, 2010.[38] Y.-C. Chang, C.-H. Chu, Y.-C. Su, C. C. Chen, and W.-L. Hsu, “Pipe:a protein–protein interaction passage extraction module for biocreativechallenge,”

Database , vol. 2016, 2016.[39] S. Yadav, A. Kumar, A. Ekbal, S. Saha, and P. Bhattacharyya, “Featureassisted bi-directional lstm model for protein-protein interaction identiﬁ-cation from biomedical texts,” arXiv preprint arXiv:1807.02162 , 2018.[40] Y.-L. Hsieh, Y.-C. Chang, N.-W. Chang, and W.-L. Hsu, “Identifyingprotein-protein interactions in biomedical literature using recurrent neu-ral networks with long short-term memory,” in

Proceedings of the eighthinternational joint conference on natural language processing (volume2: short papers) , 2017, pp. 240–245.[41] L. Hua and C. Quan, “A shortest dependency path based convolutionalneural network for protein-protein relation extraction,”

BioMed ResearchInternational , vol. 2016, 2016.[42] S.-P. Choi, “Extraction of protein–protein interactions (ppis) from theliterature by deep convolutional neural networks with various featureembeddings,”

Journal of Information Science , vol. 44, no. 1, pp. 60–73,2018.[43] Y. Peng and Z. Lu, “Deep learning for extracting protein-protein inter-actions from biomedical literature,” arXiv preprint arXiv:1706.01556 ,2017.[44] S. Yadav, A. Ekbal, S. Saha, A. Kumar, and P. Bhattacharyya, “Featureassisted stacked attentive shortest dependency path based bi-lstm modelfor protein–protein interaction,”

Knowledge-Based Systems , vol. 166, pp.18–29, 2019.

CCEPTED FOR PUBLICATION AT IEEE/ACM TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 13 [45] M. Ahmed, J. Islam, M. R. Samee, and R. E. Mercer, “Identifyingprotein-protein interaction using tree lstm and structured attention,”in . IEEE, 2019, pp. 224–231.[46] S. K. Sahu and A. Anand, “Drug-drug interaction extraction frombiomedical texts using long short-term memory network,”

Journal ofBiomedical Informatics

Second Joint Conference onLexical and Computational Semantics (* SEM), Volume 2: Proceedingsof the Seventh International Workshop on Semantic Evaluation (SemEval2013) , 2013, pp. 675–683.[48] N. Hailu, L. E. Hunter, and K. B. Cohen, “Ucolorado som: extractionof drug-drug interactions from biomedical text using knowledge-rich andknowledge-poor features,” in

Second Joint Conference on Lexical andComputational Semantics (* SEM), Volume 2: Proceedings of the SeventhInternational Workshop on Semantic Evaluation (SemEval 2013) , 2013,pp. 684–688.[49] J. Bj¨orne, S. Kaewphan, and T. Salakoski, “Uturku: drug named entityrecognition and drug-drug interaction extraction using svm classiﬁcationand domain knowledge,” in

Second Joint Conference on Lexical andComputational Semantics (* SEM), Volume 2: Proceedings of the SeventhInternational Workshop on Semantic Evaluation (SemEval 2013) , 2013,pp. 651–659.[50] M. Rastegar-Mojarad, R. D. Boyce, and R. Prasad, “Uwm-triads: clas-sifying drug-drug interactions with two-stage svm and post-processing,”in

Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop onSemantic Evaluation (SemEval 2013) , 2013, pp. 667–674.[51] S. Kim, H. Liu, L. Yeganova, and W. J. Wilbur, “Extracting drug–drug interactions from literature using a rich feature-based linear kernelapproach,”

Journal of biomedical informatics , vol. 55, pp. 23–30, 2015.[52] B. Bokharaeian and A. D´ıaz, “Nil ucm: Extracting drug-drug interac-tions from text through combination of sequence and tree kernels,” in

Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop onSemantic Evaluation (SemEval 2013) , 2013, pp. 644–650.[53] M. F. M. Chowdhury and A. Lavelli, “Fbk-irst: A multi-phase kernelbased approach for drug-drug interaction detection and classiﬁcationthat exploits linguistic information,” in

Second Joint Conference onLexical and Computational Semantics (* SEM), Volume 2: Proceedingsof the Seventh International Workshop on Semantic Evaluation (SemEval2013) , vol. 2, 2013, pp. 351–355.[54] R. Harpaz, A. Callahan, S. Tamang, Y. Low, D. Odgers, S. Finlayson,K. Jung, P. LePendu, and N. H. Shah, “Text mining for adverse drugevents: the promise, challenges, and state of the art,”

Drug safety , vol. 37,no. 10, pp. 777–790, 2014.[55] R. Xu and Q. Wang, “Large-scale automatic extraction of side effectsassociated with targeted anticancer drugs from full-text oncologicalarticles,”

Journal of biomedical informatics , vol. 55, pp. 64–72, 2015.[56] L. Qian and G. Zhou, “Tree kernel-based protein–protein interactionextraction from biomedical literature,”

Journal of biomedical informatics ,vol. 45, no. 3, pp. 535–543, 2012.[57] `A. Bravo, J. Pi˜nero, N. Queralt-Rosinach, M. Rautschka, and L. I.Furlong, “Extraction of relations between genes and diseases from textand large-scale data analysis: implications for translational research,”

BMC bioinformatics , vol. 16, no. 1, p. 55, 2015.[58] B. Rink, S. Harabagiu, and K. Roberts, “Automatic extraction of relationsbetween medical concepts in clinical texts,”

Journal of the AmericanMedical Informatics Association , vol. 18, no. 5, pp. 594–600, 2011.[59] A. Airola, S. Pyysalo, J. Bj¨orne, T. Pahikkala, F. Ginter,and T. Salakoski, “All-paths graph kernel for protein-proteininteraction extraction with evaluation of cross-corpus learning,”

BMCBioinformatics , vol. 9, no. 11, p. S2, Nov 2008. [Online]. Available:https://doi.org/10.1186/1471-2105-9-S11-S2[60] D. Tikk, P. Thomas, P. Palaga, J. Hakenberg, and U. Leser, “A com-prehensive benchmark of kernel methods to extract protein–proteininteractions from literature,”

PLoS computational biology , vol. 6, no. 7,p. e1000837, 2010.[61] C. Giuliano, A. Lavelli, and L. Romano, “Exploiting shallow linguisticinformation for relation extraction from biomedical literature.” in

EACL ,vol. 18, no. 2006. Citeseer, 2006, pp. 401–408.[62] P. Thomas, M. Neves, I. Solt, D. Tikk, and U. Leser, “Relation extractionfor drug-drug interactions using ensemble learning,”

Training , vol. 4, no.2,402, pp. 21–425, 2011. [63] M. F. M. Chowdhury, A. B. Abacha, A. Lavelli, and P. Zweigenbaum,“Two different machine learning techniques for drug-drug interactionextraction,”

Challenge task on drug-drug interaction extraction , pp. 19–26, 2011.[64] Z. Zhao, Z. Yang, L. Luo, H. Lin, and J. Wang, “Drug drug interactionextraction from biomedical literature using syntax convolutional neuralnetwork,”

Bioinformatics , vol. 32, no. 22, pp. 3444–3453, 2016.[65] W. Zheng, H. Lin, L. Luo, Z. Zhao, Z. Li, Y. Zhang, Z. Yang, and J. Wang,“An attention-based effective neural model for drug-drug interactionsextraction,”

BMC bioinformatics , vol. 18, no. 1, p. 445, 2017.[66] R. Kavuluru, A. Rios, and T. Tran, “Extracting drug-drug interactionswith word and character-level recurrent neural networks,” in . IEEE,2017, pp. 5–12.[67] W. Wang, X. Yang, C. Yang, X. Guo, X. Zhang, and C. Wu,“Dependency-based long short term memory network for drug-druginteraction extraction,”

BMC bioinformatics , vol. 18, no. 16, p. 578, 2017.[68] M. Asada, M. Miwa, and Y. Sasaki, “Extracting drug-drug interactionswith attention cnns,” in

BioNLP 2017 , 2017, pp. 9–18.[69] S. Lim, K. Lee, and J. Kang, “Drug drug interaction extraction from theliterature using a recursive neural network,”

PloS one , vol. 13, no. 1, p.e0190926, 2018.[70] B. de Bruijn, C. Cherry, S. Kiritchenko, J. Martin, and X. Zhu, “Nrcat i2b2: one challenge, three practical tasks, nine statistical systems,hundreds of clinical records, millions of useful features,” in

Proceedingsof the 2010 i2b2/VA Workshop on Challenges in Natural LanguageProcessing for Clinical Data. Boston, MA, USA: i2b2 , 2010.[71] N. Kang, R. J. Barendse, Z. Afzal, B. Singh, M. J. Schuemie, E. M.van Mulligen, and J. A. Kors, “Erasmus mc approaches to the i2b2challenge,” in

Proceedings of the 2010 i2b2/VA workshop on challengesin natural language processing for clinical data . i2b2, Boston, MA,USA, 2010.[72] M. Jiang, Y. Chen, M. Liu, T. Rosenbloom, S. Mani, J. C. Denny,and H. Xu, “Hybrid approaches to concept extraction and assertionclassiﬁcation-vanderbilt’s systems for 2010 i2b2 nlp challenge,” in

Proceedings of the 2010 i2b2/VA Workshop on Challenges in NaturalLanguage Processing for Clinical Data. Boston, MA, USA: i2b2 , 2010.[73] C. Grouin, A. Ben Abacha, D. Bernhard, B. Cartoni, L. Deleger,B. Grau, A.-L. Ligozat, A.-L. Minard, S. Rosset, and P. Zweigenbaum,“CARAMBA: Concept, Assertion, and Relation Annotation usingMachine-learning Based Approaches,” in i2b2 Medication ExtractionChallenge Workshop , Washington, United States, Nov. 2010, pp. –.[Online]. Available: https://hal.archives-ouvertes.fr/hal-00795663[74] K. Roberts, B. Rink, and S. Harabagiu, “Extraction of medical concepts,assertions, and relations from discharge summaries for the fourth i2b2/vashared task,” in

Proceedings of the 2010 i2b2/VA Workshop on Chal-lenges in Natural Language Processing for Clinical Data. Boston, MA,USA: i2b2 , 2010.[75] S. Jonnalagadda, T. Cohen, S. Wu, and G. Gonzalez, “Enhancing clinicalconcept extraction with distributional semantics,”

Journal of biomedicalinformatics , vol. 45, no. 1, pp. 129–140, 2012.[76] B. De Bruijn, C. Cherry, S. Kiritchenko, J. Martin, and X. Zhu,“Machine-learned solutions for three stages of clinical information ex-traction: the state of the art at i2b2 2010,”

Journal of the AmericanMedical Informatics Association , vol. 18, no. 5, pp. 557–562, 2011.[77] D. Demner-Fushman, E. Apostolova, R. Islamaj Dogan, F.-M. Lang,J. Mork, A. Neveol, S. Shooshan, M. Simpson, and A. Aronson, “Nlm’ssystem description for the fourth i2b2/va challenge,” in

Proceedingsof the 2010 i2b2/VA Workshop on Challenges in Natural LanguageProcessing for Clinical Data. Boston, MA, USA: i2b2 , 2010.[78] S. K. Sahu, A. Anand, K. Oruganty, and M. Gattu, “Relation extractionfrom clinical texts using domain invariant convolutional neural network,” arXiv preprint arXiv:1606.09370 , 2016.[79] Y. Luo, “Recurrent neural networks for classifying relations in clinicalnotes,”

Journal of Biomedical Informatics

Proceedings of the 42nd annual meeting on associationfor computational linguistics . Association for Computational Linguis-tics, 2004, p. 430.[81] M. Bundschus, M. Dejori, M. Stetter, V. Tresp, and H.-P. Kriegel,“Extraction of semantic biomedical relations from text using conditionalrandom ﬁelds,”

BMC bioinformatics , vol. 9, no. 1, p. 207, 2008.[82] D. Ningthoujam, S. Yadav, P. Bhattacharyya, and A. Ekbal, “Relationextraction between the clinical entities based on the shortest dependencypath based lstm,” arXiv preprint arXiv:1903.09941 , 2019.

CCEPTED FOR PUBLICATION AT IEEE/ACM TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 14 [83] Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Ben-gio, “A structured self-attentive sentence embedding,” arXiv preprintarXiv:1703.03130 , 2017.[84] P. Liu, X. Qiu, and X. Huang, “Adversarial multi-task learning fortext classiﬁcation,” in

Proceedings of the 55th Annual Meeting of theAssociation for Computational Linguistics (Volume 1: Long Papers) .Vancouver, Canada: Association for Computational Linguistics, July2017, pp. 1–10. [Online]. Available: http://aclweb.org/anthology/P17-1001[85] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by back-propagation,” arXiv preprint arXiv:1409.7495 , 2014.[86] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy, “Hierarchicalattention networks for document classiﬁcation,” in

Proceedings of the2016 Conference of the North American Chapter of the Association forComputational Linguistics: Human Language Technologies , 2016, pp.1480–1489.[87] S. K. Sahu and A. Anand, “Drug-drug interaction extraction frombiomedical text using long short term memory network,” arXiv preprintarXiv:1701.08303 , 2017.[88] Z. Yi, S. Li, J. Yu, Y. Tan, Q. Wu, H. Yuan, and T. Wang, “Drug-drug interaction extraction via recurrent neural network with multipleattention layers,” in

International Conference on Advanced Data Miningand Applications . Springer, 2017, pp. 554–566.[89] S. Liu, K. Chen, Q. Chen, and B. Tang, “Dependency-based convolu-tional neural network for drug-drug interaction extraction,” in

Bioinfor-matics and Biomedicine (BIBM), 2016 IEEE International Conferenceon . IEEE, 2016, pp. 1074–1080.[90] S. Liu, B. Tang, Q. Chen, and X. Wang, “Drug-drug interaction extractionvia convolutional neural networks,”