Reconstruction of Protein-Protein Interaction Pathways by Mining Subject-Verb-Objects Intermediates
Maurice HT Ling, Christophe Lefevre, Kevin R. Nicholas, Feng Lin
RReconstruction of Protein-Protein Interaction Pathwaysby Mining Subject-Verb-Objects Intermediates
Maurice HT Ling , Christophe Lefevre ,Kevin R. Nicholas , Feng Lin BioInformatics Research Centre, Nanyang Technological University, Singapore CRC for Innovative Dairy Products, Department of Zoology,The University of Melbourne, Australia Victorian Bioinformatics Consortium, Monash University, [email protected], [email protected],[email protected], [email protected]
Abstract.
The exponential increase in publication rate of new articles islimiting access of researchers to relevant literature. This has prompted the useof text mining tools to extract key biological information. Previous studies havereported extensive modification of existing generic text processors to processbiological text. However, this requirement for modification had not beenexamined. In this study, we have constructed Muscorian, using MontyLingua, ageneric text processor. It uses a two-layered generalization-specializationparadigm previously proposed where text was generically processed to asuitable intermediate format before domain-specific data extraction techniquesare applied at the specialization layer. Evaluation using a corpus and expertsindicated 86-90% precision and approximately 30% recall in extracting protein-protein interactions, which was comparable to previous studies using eitherspecialized biological text processing tools or modified existing tools. Ourstudy had also demonstrated the flexibility of the two-layered generalization-specialization paradigm by using the same generalization layer for twospecialized information extraction tasks.
Keywords: biomedical literature analysis, protein-protein interaction,montylingua
PubMed currently indexes more than 16 million papers with about one million papersand 1.2 million added in the years 2005 and 2006 respectively. A simple keywordsearch in PubMed showed that nearly 900 thousand papers on mouse and more than1.3 million papers on rat research had been indexed in PubMed to date, and in the lastfour years, more than 150 thousand papers have been published on each of mouse andrat research. This trend of increased volume of research papers indexed in PubMedover the last 10 years makes it difficult for researchers to maintain an active andproductive assessment of relevant literature. Information extraction (IE) has beenused as a tool to analyze biological text to derive assertions on specific biologicaldomains [30], such as protein phosphorylation [19] or entity interactions [1]. number of IE tools used for mining information from biological text can beclassified according to their capacity for general application or tools that considersbiological text as specialized text requiring domain-specific tools to process them.This has led to the development of specialized part-of-speech (POS) tag sets (such asSPECIALIST [28]), POS taggers (such as MedPost [33]), ontologies [11], textprocessors (such as MedLEE [15]), and full IE systems, such as GENIES [16],MedScan [29], MeKE [4], Arizona Relation Parser [10], and GIS [5]. On the otherhand, an alternative approach assumes that biological text are not specialized enoughto warrant re-development of tools but adaptation of existing or generic tools willsuffice. To this end, BioRAT [12] had modified GATE [8], MedTAKMI [36] hadmodified TAKMI [27], originally used in call centres, Santos [31] had used Linkgrammar parser [32].Although both systems demonstrated similar performance, either developing thesesystems or modifying existing systems were time consuming [20]. Although work byGrover [17] suggested that native generic tools may be used for biological text, arecent review had highlighted successful uses of a generic text processing system,MontyLingua [14, 23], for a number of purposes [22]. For example, MontyLingua hasbeen used to process published economics papers for concept extraction [35]. Theneed to modify generic text processors had not been formally examined and thequestion of whether an un-modified, generic text processor can be used in biologicaltext analysis with comparable performance, remains to be assessed.In this study, we evaluated a native, generic text processing system, MontyLingua[23], in a two-layered generalization-specialization architecture [29] where thegeneralization layer processes biological text into an intermediate knowledgerepresentation for the specialization layer to extract genic or entity-entity interactions.This system demonstrated 86.1% precision using Learning Logic in Languages 2005evaluation data [9], 88.1% and 90.7% precisions in extracting protein-protein bindingand activation interactions respectively. Our results were comparable to previouswork which modified generic text processing systems which reported precisionranging from 53% [24] to 84% [5], suggesting this modification may not improve theefficiency of information retrieval.
We have developed a biological text mining system, known as Muscorian, for miningprotein-protein inter-relationships in the form of subject-relation-object (for example,protein X bind protein Y) assertions. Muscorian is implemented as a 3-modulesequential system of entity normalization, text analysis, and protein-protein bindingfinding, as shown in Figure 1. It is available for academic and non-profit usersthrough http://ib-dwb.sf.net/Muscorian.html. ig 1. Schematic Diagram Illustrating the Operations of Muscorian
Entity normalization is the substitution of the long form of either a biological orchemical term with its abbreviated form. This is essential to correct part-of-speechtagging errors which are common in biological text due to multi-worded nouns. Forexample, the protein name “phosphatase and tensin homolog deleted on chromosome10” has to be recognized as a single noun and not a phrase. In this study, we attemptto mine protein-protein interactions and consolidate this knowledge to produce a map.Therefore, the naming convention of the protein entities must be standardized to allowfor matching. However, this is not the case for biological text and synonymousprotein names exist for virtually every protein. For example, “MAP kinase kinase”,“MAPKK”, “MEK” and “MAPK/Erk kinase” referred to the same protein. Both ofthese problems could be either resolved or minimized by reducing multi-wordednouns into their abbreviated forms.A dictionary-based approach was used for entity normalization to a high level ofaccuracy and consistency. The dictionary was assembled as follows: firstly, a set of25000 abstracts from PubMed was used to interrogate Stanford University's BioNLPserver [3] to obtain a list of long forms with its abbreviations and a calculated score.Secondly, only results with the score of more than 0.88 were retained as it is aninflection point of ROC graph [3], which is a good balance between obtaining themost information while reducing curation efforts. Lastly, the set of long form and itsabbreviations was manually curated with the help of domain experts.The domain experts curated dictionary of long forms and its abbreviated term wasused to construct a regular expression engine for the process of recognition of thelong form of a biological or chemical term and substituting it with its correspondingabbreviated form. .2 Text Analysis
Entity normalized abstracts were then analyzed textually by an un-modified textprocessing engine, MontyLingua [14], where they were tokenized, part-of-speechtagged, chunked, stemmed and processed into a set of assertions in the form of 3-element subject-verb-object(s) (SVO) tuple, or more generally, subject-relation-object(s) tuple. Therefore, a sequential pattern of words which formed an abstract wastransformed through a series of pattern recognition into a set of structurally-definableassertions.Before part-of-speech tagging is possible, an abstract made up of one or moresentences had to be separated into individual sentences. This is done by regularexpression recognition of sentence delimiters, such as full-stop, ellipse, exclamationmark and question mark, at the end of a word (regular expression: ([?!]+|[.][.]+)$)with an exception of acronyms. Acronyms, which are commonly represented with afull-stop, for example “Dr.”, are not denoted as the end of a sentence and weregenerally prevented by an enumeration of common acronyms.Individual sentences were then separated into constituent words and punctuationsby a process known as tokenization. Tokenization, which is essential to atomize asentence into atomic syntactic building blocks, is generally a simple process ofsplitting of an English sentence in words using whitespaces in the sentence, resultingin a list of tokens (words). However, there were three problems which were correctedby examining each token. Firstly, punctuations are crucial in understand a writtenEnglish sentence, but typographically a punctuation is usually joined to the presidingword. Hence, punctuation separation from the presiding word is necessary. However,it resulted in incorrect tokenization with respect to acronyms and decimal numbers.For example, “... an appt. for ...” will be tokenized to “... an appt . for ...” and “$4.20”'will be “$ 4 . 20”. This problem was prevented by pre-defining acronyms and usingregular expressions, such as “^[$][0-9]{1,3}[.][0-9][0-9](?[.]?)$”. Lastly, commonabbreviated words, such as “don't”, were expanded into two tokens of “do” and “n't”.Despite the above error correction measures, certain text such as mathematicalequations, which might be used to describe enzyme kinetics in biological text, will notbe tokenized correctly. In spite of this limitation, the described tokenization scheme isstill appropriate as extraction of enzyme kinetics or mathematical representations arenot the aims of this study.Each of the tokens (words and punctuations) in a tokenized sentence is then taggedusing Penn TreeBank Tag Set [25] by a Brill Tagger, trained on Wall Street Journaland Brown corpora, which operates in two phases. Using a lexicon, containing thelikely tag for each word, each word is tagged. This is followed by a phase ofcorrection using lexical and contextual rules, which were learnt using training with atagged corpora, in this case, Wall Street Journal and Brown corpora. Lexical rulesuses a combination of preceding tag and prefix or suffix of the token (word) inquestion. For example, the rule “NN ing fhassuf 3 VBG” defines that if the currenttoken is tagged as a noun (NN) and has a 3-character suffix of “ing”, then the tagshould be a verb (VBG). On the other hand, contextual rules uses only the precedingor proceeding tags and hence, must be applied after lexical rules for effectiveness.The contextual rule “RB JJ NEXTTAG NN” defines that an abverbial tag (RB) shouldbe changed to an adjective (JJ) if the next token was tagged as a noun (NN). A tableof Penn Treebank Tag Set [25] without punctuation tags is given in Table 1. ag Description Tag Description
CC Coordinating conjunction PRP$ Possessive pronounCD Cardinal number RB AdverbDT Determinant RBR Adverb, comparativeEX Existential there
RBS Adverb, superlativeFW Foreign word RP ParticleIN Preposition or subordinatingconjunction SYM SymbolJJ Adjective TO toJJR Adjective, comparative UH InterjectionJJS Adjective, superlative VB Verb, base formLS List item marker VBD Verb, past tenseMD Modal VBN Verb, past participleNN Noun, singular or mass VBG Verb, gerund or presentparticipleNNS Noun, plural VBP Verb, non-3 rd person singularpresentNNP Proper noun, singular VBZ Verb, 3 rd person singularpresentNNPS Proper noun, plural WDT Wh-determinerPDT Predeterminer WP Wh-pronounPOS Possessive ending WP$ Possessive wh-pronounPRP Personal pronoun WRB Wh-adverb Table 1. Penn Treebank Tag Set without Punctuation Tags (Adapted from [25])
By tagging, the complexity of an English sentence (ie, the number of ways anEnglish sentence can be grammatically constructed with virtually unlimited wordsand unlimited ideas) was collapsed into a sequence of part-of-speech tags, in thiscase, Penn TreeBank Tag Set [25], with only about 40 tags. Therefore, taggingreduced the large number of English words to about 40 “words” or tags.Generally, an English sentence is composed of a noun phrase, a verb, and a verbphase, where the verb phrase may be reduced into more noun phrases, verbs, and verbphrases. More precisely, the English language is an example of subject-verb-objecttypology structure, which accounts for 75% of all languages in the world [7]. Thisconcept of English sentence structure is used to process a tagged sentence into higher-order structures of phrases by a process of chunking, which is a precursor to theextraction of semantic relationships of nouns into SVO structure. Using only thesequence of tags, chunking was performed as a recursive 4-step process: protectingerbs, recognition of noun phrases, unprotecting verbs and recognition of verbphrases. Firstly, verb tags (VBD, VBG and VBN) were protected by suffixing thetags. The main purpose was to prevent interference in recognizing noun phrases.Secondly, noun phrases were recognized by the following regular expression patternof tags: ((((PDT )?(DT |PRP[$] |WDT |WP[$] )(VBG |VBD |VBN |JJ |JJR |JJS |, |CC |NN |NNS |NNP |NNPS |CD )*(NN |NNS |NNP|NNPS |CD )+)|((PDT )?(JJ |JJR |JJS |, |CC |NN |NNS |NNP |NNPS |CD )*(NN |NNS |NNP |NNPS |CD )+)|EX |PRP |WP|WDT )POS )?(((PDT )?(DT |PRP[$] |WDT |WP[$] )(VBG |VBD|VBN |JJ |JJR |JJS |, |CC |NN |NNS |NNP |NNPS |CD )*(NN|NNS |NNP |NNPS |CD )+)|((PDT )?(JJ |JJR |JJS |, |CC |NN |NNS |NNP |NNPS |CD )*(NN |NNS |NNP |NNPS |CD )+)|EX|PRP |WP |WDT )
Thirdly, the protected verb tags in the first step were de-protected by removing thesuffix appended onto the tags. Lastly, verb phrases were recognized by the followingregular expression: (RB |RBR |RBS |WRB )*(MD )?(RB |RBR |RBS |WRB )*(VB |VBD |VBG |VBN |VBP |VBZ )(VB |VBD |VBG |VBN |VBP |VBZ |RB |RBR |RBS |WRB )*(RP )?(TO (RB )*(VB |VBN )(RP )?)?
After chunking, each word (token) was stemmed into its root or infinite form.Firstly, each word was matched against a set of rules for specific stemming. Forexample, the rule “dehydrogenised verb dehydrogenate” defines that if the word“dehydrogenised” was tagged as a verb (VBD, VBG and VBN tags), it would bestemmed into “dehydrogenate”. Similarly, the words “binds”, “binding” and“bounded” were stemmed to “bind”. Secondly, irregular words which could not bestemmed by removal of prefixes and suffixes, such as “calves” and “cervices”, werestemmed by a pre-defined dictionary. Lastly, stemming was done by simple removalof prefixes or suffixes from the word based on a list of common prefixes or suffixes.For example, “regards” and “regarding” were both stemmed into “regard”.Given the general nature of an English sentence is an aggregation of noun phrase, averb, and a verb phase, where the verb phrase may be reduced into more nounphrases, verbs, and verb phrases, each verb phrase may be taken as a sentence byitself. This allowed for recursive processing of a chunked-stemmed sentence intoSVO(s) by a 3-step process. Firstly, the first terminal noun phrase, delimited by“(NX” and “NX)” was taken as the subject noun. Secondly, proceeding from the firstterminal noun phrase, the first terminal verb would be taken as the verb in the SVO.Lastly, the rest of the phrase was scanned for terminal noun phrases and would betaken as the object(s). The recursive nature of SVO extraction also meant that thesubject, verb, and object(s) will be contiguous, which had been demonstrated to havebetter precision than non-contiguous SVOs [26]. .3 Protein-Protein Binding Finding
The protein-protein binding finder module is a data miner for protein-protein bindinginteraction assertions from the entire set of subject-relation-object (SVO) assertionsfrom the text analysis process using apriori knowledge. That is, the set of proteins ofinterest must be known, in contrast to an attempt to uncover new protein entities, andtheir binding relationships with other protein entities, that were not known to theresearcher.Protein-protein binding assertions were extracted in a three step process. Firstly, aset of SVOs was isolated by the presence of the term “bind” in the verb clauseresulting in a set of “bind-SVOs” assertions. Non-infinite forms of “bind” (such as,“binding” and “binds”) were not used as verbs were stemmed into their infinite formsduring text processing. Secondly, the set of bind-SVOs were further characterized forthe presence of protein entities in both subject and object clauses by comparing withthe desired list of protein entities. A pairwise isolation of bind-SVOs for proteinentities resulted in a set of bind-SVOs, “entity-bind-SVOs”, containing SVOsdescribing binding relationship between the protein entities. Lastly, entity-bind-SVOswere cleaned so that the subject and object clauses only contains protein entities. Forexample, “MAPK in the cytoplasm” in the object clause will be reduced to just theentity name “MAPK”, the full subject and object clauses could be used in otherinformation extraction tasks, such as determining protein localization, but is notexplored in this study. This step is required to allow for the construction of networkgraphs, such as using Graphviz, without reference to the list of protein names duringconstruction. Given that protein_entities is the list of desired proteins, table SVOcontains the SVO output from MontyLingua and table entity_bind_SVO contains theisolated and cleaned SVOs, the pseudocode for Protein-Protein Binding Findingmodule is given as:for subject_protein in protein_entities for object_protein in protein_entities insert (pmid, subject_protein, object_protein) into entity_bind_SVOfrom select pmidfrom (select * from SVO where verb = 'bind')where subject is containing subject_proteinand object is containing object_protein
Four experiments were carried out to evaluate the performance of Muscorian anddemonstrate the flexibility of the two-layered generalization-specialization approachin constructing systems that could be readily be adapted to related problems. Theresults are summarized in Table 2.
LL05Directional LLL05 Un-directional Protein-ProteinBinding Protein-ProteinActivation
Precision 55.8% 86.1% 88.1% 90.7%Recall 19.8% 30.7% Not measured Not measured
Table 2. Summary of the Experimental Results Comparing the Precision and Recall Measures.
The performance of Muscorian, in terms of precision and recall, could only beevaluated using a defined data set with known results. For such purpose, the data setfor Learning Languages in Logic 2005 (LLL05) [9] was used to benchmarkMuscorian on genic interactions, which is a superset of protein-protein bindinginteractions. LLL05 had defined a genic interaction as an interaction between 2entities (agent and target) but the nature of interaction was not considered under thechallenge task. LLL05 provided a list of protein entities found in the data set, whichwas used to filter subject-relation-object assertions from text analysis (MontyLingua)output where both subject and object contained protein entities in the given list. Thefiltered list of assertions was evaluated for precision and recall, which was found tobe 55.6% and 19.8% respectively.LLL05 required that the agent and target (subject and object) to be in the correctdirection, making it a vector quality. However, this requirement was not biologicallysignificant to protein-protein binding interactions, which is scalar. For example, “Xbinds to Y” and “Y binds to X” have no biological difference. Hence, this requirementof directionality was eliminated and the precision and recall was 86.1% and 30.7%respectively.
Precision of Muscorian for mining protein-protein binding interactions frompublished abstracts was evaluated by manual verification of a sample of assertions(n=135) yielded by the protein-protein binding finder module against the originalabstracts. Each of the sampled assertions was assumed to be atomic, in the form of “Xbinds Y”. In cases where there were more than one target, such as “X binds Y and Z”,they would be reduced to atomic assertions. In this case, “X binds Y and Z” would bereduced to 2 assertions, “X bind Y” and “X bind Z”. These were then checked withthe original abstract, traceable by the PubMed IDs, and precision was measured as theratio of the number of correct assertions to the number of sampled atomic assertions(which is 135). A 95% confidence interval was estimated by bootstrapping (re-sampling with replacement) [13] of the manual verification results. Our resultssuggested a precision of 88.1%, with a 95% confidence interval between 82.4% to93.7%.An IE trial was performed using the Protein-Protein Binding Finding module tosearch for the binding partners of CREB and insulin receptor and a sample networkdiagram of the results are shown in Figure 2 and 3 respectively. ig 2. Preliminary Protein Binding Network of CREBFig 3. Preliminary Protein Binding Network of Insulin Receptor
A large scale mining of protein-protein binding interactions was carried out using allof the PubMed abstracts on mouse (about 860000 abstracts), which were obtainedusing “mouse” as the keyword for searches, with a predefined set of about 3500abbreviated protein entities as the list of proteins of interest (available fromhttp://cvs.sourceforge.net/viewcvs.py/ib-dwb/muscorian-data/protein_accession.csv?rev=1.2&view=markup). In this experiment, the primary aim was to apply Muscorianto large data set and the secondary aim was to look for multiple occurrences of thesame interactions as multiple occurrences might greatly improve precisionconfidence.For example, given our lower confidence estimate that the precision of Muscorianwith respect to mining protein-protein binding interactions is 82%, which means thatevery binding assertion has an 18% likelihood of not having a correspondingrepresentation in the published abstracts. However, if 2 abstracts yielded the samebinding assertion, the probability of both being wrong was reduced to 3.2% (0.18 ),and the corresponding probability that at least one of the 2 assertions was correctlyrepresented was 96.8% (1-0.18 ). The more times the same assertion was extractedfrom multiple sources text (abstracts), the higher the possibility that the minedinteraction was represented at least once in the set of abstracts. For example, if 5abstracts yielded the same assertion, the possibility that at least one of the 5 assertionswas correctly represented would be 99.98% (1-0.18 ).Our experiment mined a total of 9803 unique protein-protein binding interactions,of which 7049 binding interactions were from one abstract (P=82%), 1297 bindinginteractions were from two abstracts (P=96.8%), 516 binding interactions were fromthree abstracts (P=99.4%), 235 binding interactions were from four abstractsP=99.9%), 164 binding interactions were from five abstracts (P=99.98%), 105binding interactions were from six abstracts (P=99.997%), 69 binding interactionswere from seven abstracts (P=99.9993%), 398 binding interactions were from morethan seven abstracts (P>99.9993%). In order to demonstrate the adaptability of our proposed two-layered model, a smallpilot study for mining protein-protein activation interactions was carried out. For thisstudy, the protein-protein binding finder module, the data mining module for miningprotein-protein binding interaction, was replaced with a protein-protein activationfinder module.The protein-protein activation finder was semantically similar to the originalprotein-protein binding finder module as described in Section 3.3 previously. Theonly difference was that raw assertion output from MontyLingua was filtered foractivation-related assertions, instead of binding-related assertions, before analysis forthe presence of protein names in both subject and object nouns from a pre-defined listof proteins of interest. For example, by modifying the Protein-Protein BindingFinding module to look for the verb 'activate' instead of 'bind', it can then be used formining protein-protein activation interactions. A trial was done for insulin activationand a subgraph is illustrated in Figure 4 below.
Fig 4. Preliminary Protein Activation Network of Insulin
The precision measure of Muscorian for mining protein-protein activationinteractions was calculated using identical means as described for protein-proteinbinding interactions. Using a sample of 85 atomic assertions, the precision ofMuscorian for mining protein-protein activation interactions was estimated to be90.7%, with a 95% confidence interval of precision between 84.7% to 96.4% bybootstrapping [13].
Discussion
New research articles in gene expression regulation networks, protein-proteininteractions and protein docking are emerging at a rate faster than what mostbiologists can manage to extract the data and generate working pathways. Informationextraction technologies have been successfully used to process research text andautomate fact extraction [1]. Previous studies in biological text mining havedeveloped specialized text processing tools and adapted generic tools to relativelygood performance of more than 80% in precision [5, 11, 20, 31]. However, eitherspecialized tool development or modifying existing tools often require much effort[20]. The need to modify existing tools has not been formally tested and thepossibility of using an un-modified generic text processor for biological text for thepurpose of extracting protein-protein interaction remains unresolved. Using a two-layered approach [29] of generalizing biological text into a structured intermediateform, followed by specialized data mining, we present Muscorian, which usesMontyLingua natively in the generalized layer, as a tool for extracting either protein-protein or genic interactions from about 860000 published biological abstracts.Benchmarking Muscorian against LLL05, a tested data set, demonstrated aprecision of 55.6%, which is about 5% higher than that reported in the conference anda recall of 19.7% is similar to that reported by other participants of LLL05 [9]. Thismay be due to the emphasis of LLL05 on F-measure, which is the harmonic mean ofprecision and recall, rather than putting more emphasis on precision. Nevertheless,this also suggested that Muscorian is able to perform text analysis for the purpose ofextracting genic interactions effectively, which is comparable to specialized systemsreported in LLL05. In addition, directionality of genic interactions was not a concernfor protein-protein binding interactions as binding interaction is scalar rather thanvector. By eliminating directionality of genic interactions, the precision and recall ofMuscorian was 86.1% and 30.7% respectively. This suggested that Muscorian is asuitable tool for mining quality genic interactions from biological text compared toother tools reported in LLL05 [9].Our results on protein-protein binding and activation interactions show the insulinreceptor binds to IL-10 promoter through IRF and IRAK-1, which is an importantinsulin receptor signalling pathway. In addition, our data shows insulin activatesCREB via Raf-1, MEK-1 and MAPK, which is consistent with the MAP kinasepathway. Combining these data (Figures 2 and 4) indicated that insulin activatesCREB via MAP kinase pathway, and CREB binds to cpg15 promoter in the nucleus.A simple keyword search on PubMed, using the term “cpg15 and insulin” (done on30 th of April, 2007), did not yield any results, suggesting that the effects of insulin oncpg15, also known as neuritin [2], had not been studied thoroughly. This might alsosuggest limited knowledge shared between insulin investigators and cpg15investigators as suggested by Don Swanson in his classical paper describing the linksbetween fish oil and Raynaud's syndrome [34]. Neuritin is a relatively new researcharea with less than 20 papers published (as of 30 th of April, 2007) and had beenimplicated as a lead for neural network re-establishment [18], suggesting potentialcollaborations between endocrinologists and neurologists.Our experiments in extracting two different forms of relations demonstrated thatdespite using specialized dictionaries in the generalized layer, it is still general to thextend that specific application (the type of relationships to extract) was not built intothe generalized layer.At the same time, these 2 experiments also illustrated the relative ease in re-targeting the system for extracting another form of relationship by modifying thespecialized layer. The Protein-Protein Activation Finder module is a slightmodification of the original Protein-Protein Binding Finder module where the originalSQL statement that selects 'bind'-related SVOs from total SVOs, “ select * from SVOwhere verb = 'bind' ”, was changed to “ select * from SVO where verb = 'activate' ” toselect for 'activation'-related SVOs from total SVOs. Hence, it is plausible that similarchanges may suffice for extracting other relationships, such as 'inhibition'. Thisrelative ease of re-targeting the system for extracting other relationships alsodemonstrated the robustness of the generalization layer, as implied by Novichkova et.al. [29] – “ the adaptability of the system to related problems other than the problemthe system was designed for ”.Given large numbers of published abstracts, the performance of Muscorian onprecision was comparable with published values of BioRAT (58.7%) [12], GIS (84%)[5], Cooper and Kershenbaum (74%) [6] and CONAN (53%) [24] while Muscorian'srecall was comparable with published values of Arizona Relations Parser (35%) [10]and Daraselia et. al. (21%) [11]. Poor precision was considered unacceptable becauseincorrect information is more detrimental than missing information (1 - recall) whenprotein-protein binding interactions were used to support other biological analyses.Muscorian's mediocre recall of 30% (from LLL05 test set evaluation) could besupplemented by the fact that the same interaction could be mentioned or describedby multiple abstracts; thus, the actual recall when tested on a large corpus may behigher. For example, 30% recall essentially means a loss of 70% of the information;however, if the same information (in this case, protein interactions) were mentioned in3 or more abstracts, there is still a reasonable chance to believe that information fromat least 1 of the 3 or more abstracts will be extracted. This is supported by our resultsindicating that almost 30% (2754 of 9803) of binding interactions were extractedfrom more than one abstract.Multiple isolation of 2754 binding interactions enabled a higher confidence thatthese interactions were correctly extracted with reference to the source literature.Based on this analysis, 2754 binding interactions could be assigned higher confidencebased on their occurrences [21], in this case more than 95% chance of being correctbased on literature. In addition, the number of multiple interaction occurrence variesinversely with the number of abstracts these interactions were found in is in line withexpectation. Although this line of argument is based on the assumption that theappearance of protein names across abstracts were independent, it can be reasonablyheld as this study uses abstracts rather than full text – abstracts tends to describe whatmain results of the particular article while the introduction of a full text article tendsto be a brief background review of the field. Hence, independence of protein namescan be better assumed in abstracts than in full text articles.An evaluation of a sample of atomic assertions (interactions) of binding andactivation interactions between entities was performed by domain experts comparingthe assertions with their source abstracts. Both approaches gave similar precisionmeasures and are consistent with the evaluation using LLL05 test set. The ANOVAtest demonstrated that there was no significant differences between these threeprecision measures. Taken together, these evaluations strongly suggested thatMuscorian performed with precisions between 86-90% for genic (gene-protein androtein-protein) interactions, which was similar to that reported by studies eithermodifying existing tools [31] or developing specialized tools [11]. This suggested thatMontyLingua could be used natively (un-modified), with good precision, to processbiological text into structured subject-verb-objects tuples which could be mined forprotein interactions. Acknowledgments.
We wish to thank Prof. I-Fang Chung, Institute of BiomedicalInformatics, National Yang Ming University, Taiwan, for his comments on improvingthe initial drafts. This work is sponsored by the CRC for Innovative Dairy Products,Australia, and Postgraduate Overseas Research Experience Scholarship, TheUniversity of Melbourne, Australia.
References
1. Abulaish, M., and Dey, L. 2007. Biological relation extraction and query answering fromMEDLINE abstracts using ontology-based text mining. Data & Knowledge Engineering , : : , , , , : th edition.29. Novichkova, S., Egorov, S., and Daraselia, N. 2003. MedScan, a natural languageprocessing engine for MEDLINE abstracts. Bioinformatics 19:1699-1706.30. Rebholz-Schuhmann, D., Kirsch, H., and Couto, F. 2005. Facts from Text - Is Text MiningReady to Deliver? PLoS Biology,3:e65.31. Santos, C., Eggle, D., and States, D. J. 2005. Wnt pathway curation using automated naturallanguage processing: combining statistical methods with partial and full parse forknowledge extraction. Bioinformatics 21:1653-1658.32. Sleator, D., and Temperley, D. 1991. Parsing English with a Link Grammar. Proceedings ofthe 3rd International Workshop on Parsing Technologies.33. Smith, L., Rindflesch, T., and Wilbur, WJ. 2004. MedPost: a part-of-speech tagger forbioMedical text. Bioinformatics 20: 2320-1.34. Swanson, D. R. 1986. Fish oil, Raynaud's syndrome, and undiscovered public knowledge.Perspectives in Biology and Medicine , , thth