Alexander S. Yeh
Mitre Corporation
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Alexander S. Yeh.
international conference on computational linguistics | 2000
Alexander S. Yeh
Statistical significance testing of differences in values of metrics like recall, precision and balanced F-score is a necessary part of empirical natural language processing. Unfortunately, we find in a set of experiments that many commonly used tests often underestimate the significance and so are less likely to detect differences that exist between different techniques. This underestimation comes from an independence assumption that is often violated. We point out some useful tests that do not make this assumption, including computationally-intensive randomization tests.
Journal of Biomedical Informatics | 2002
Lynette Hirschman; Alexander A. Morgan; Alexander S. Yeh
As the pace of biological research accelerates, biologists are becoming increasingly reliant on computers to manage the information explosion. Biologists communicate their research findings by relying on precise biological terms; these terms then provide indices into the literature and across the growing number of biological databases. This article examines emerging techniques to access biological resources through extraction of entity names and relations among them. Information extraction has been an active area of research in natural language processing and there are promising results for information extraction applied to news stories, e.g., balanced precision and recall in the 93-95% range for identifying person, organization and location names. But these results do not seem to transfer directly to biological names, where results remain in the 75-80% range. Multiple factors may be involved, including absence of shared training and test sets for rigorous measures of progress, lack of annotated training data specific to biological tasks, pervasive ambiguity of terms, frequent introduction of new terms, and a mismatch between evaluation tasks as defined for news and real biological problems. We present evidence from a simple lexical matching exercise that illustrates some specific problems encountered when identifying biological names. We conclude by outlining a research agenda to raise performance of named entity tagging to a level where it can be used to perform tasks of biological importance.
working conference on reverse engineering | 1995
Alexander S. Yeh; David R. Harris; Howard B. Reubenstein
As part of our investigations on recovering software architectural representations from source code, we have developed, implemented and tested an interactive approach to the recovery of implicit abstract data types (ADTs) and object instances from conventional procedural languages such as C. This approach includes automatic recognition and semi-automatic techniques that handle potential recognition pitfalls. We have also developed a scale with which to compare much of the work on recovering ADTs and object classes.
Journal of Biomedical Informatics | 2004
Alexander A. Morgan; Lynette Hirschman; Marc E. Colosimo; Alexander S. Yeh; Jeffrey B. Colombe
Biology has now become an information science, and researchers are increasingly dependent on expert-curated biological databases to organize the findings from the published literature. We report here on a series of experiments related to the application of natural language processing to aid in the curation process for FlyBase. We focused on listing the normalized form of genes and gene products discussed in an article. We broke this into two steps: gene mention tagging in text, followed by normalization of gene names. For gene mention tagging, we adopted a statistical approach. To provide training data, we were able to reverse engineer the gene lists from the associated articles and abstracts, to generate text labeled (imperfectly) with gene mentions. We then evaluated the quality of the noisy training data (precision of 78%, recall 88%) and the quality of the HMM tagger output trained on this noisy data (precision 78%, recall 71%). In order to generate normalized gene lists, we explored two approaches. First, we explored simple pattern matching based on synonym lists to obtain a high recall/low precision system (recall 95%, precision 2%). Using a series of filters, we were able to improve precision to 50% with a recall of 72% (balanced F-measure of 0.59). Our second approach combined the HMM gene mention tagger with various filters to remove ambiguous mentions; this approach achieved an F-measure of 0.72 (precision 88%, recall 61%). These experiments indicate that the lexical resources provided by FlyBase are complete enough to achieve high recall on the gene list task, and that normalization requires accurate disambiguation; different strategies for tagging and normalization trade off recall for precision.
international conference on software engineering | 1997
Alexander S. Yeh; David R. Harris; Melissa P. Chase
Automatically recovered (reverse engineered) architectural views of legacy software are valuable for carrying out many software engineering tasks. However, while directly recovered views provide some relevant information, they often are either too fragmented or complex to be really useful in practice. This paper describes methods for making views more useful. We automatically combine and/or simplify views to produce hierarchies, hybrids, and abstractions. The methods we use take advantage of the source code fragments underlying the views to cross-reference between parts from different views.
Sigkdd Explorations | 2002
Alexander S. Yeh; Lynette Hirschman; Alexander A. Morgan
This paper presents a background and overview for task 1 (of 2 tasks) of the KDD Challenge Cup 2002, a competition held in conjunction with the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), July 23--26, 2002. Task 1 dealt with detecting which papers, in a set of fruitfly genetics papers (texts), contained experimental results about gene products (transcripts and proteins), and also within each paper, which genes had experimental results about their products mentioned.
meeting of the association for computational linguistics | 2003
Alexander A. Morgan; Lynette Hirschman; Alexander S. Yeh; Marc E. Colosimo
Machine-learning based entity extraction requires a large corpus of annotated training to achieve acceptable results. However, the cost of expert annotation of relevant data, coupled with issues of inter-annotator variability, makes it expensive and time-consuming to create the necessary corpora. We report here on a simple method for the automatic creation of large quantities of imperfect training data for a biological entity (gene or protein) extraction system. We used resources available in the FlyBase model organism database; these resources include a curated lists of genes and the articles from which the entries were drawn, together a synonym lexicon. We applied simple pattern matching to identify gene names in the associated abstracts and filtered these entities using the list of curated entries for the article. This process created a data set that could be used to train a simple Hidden Markov Model (HMM) entity tagger. The results from the HMM tagger were comparable to those reported by other groups (F-measure of 0.75). This method has the advantage of being rapidly transferable to new domains that have similar existing resources.
automated software engineering | 1996
David R. Harris; Alexander S. Yeh; Howard B. Reubenstein
Recovery of higher level design information and the ability to create dynamic software documentation is crucial to supporting a number of program understanding activities. Software maintainers look for standard software architectural structures (e.g., interfaces, interprocess communication, layers, objects) that the code developers had employed. Our goals center on supporting software maintenance/evolution activities through architectural recovery tools that are based on reverse engineering technology. Our tools start with existing source code and extract architecture-level descriptions linked to the source code fragments that implement architectural features. Recognizers (individual source code query modules used to analyze the target program) are used to locate architectural features in the source code. We also report on representation and organization issues for the set of recognizers that are central to our approach.
meeting of the association for computational linguistics | 1998
Alexander S. Yeh; Marc B. Vilain
Determining the attachments of prepositions and subordinate conjunctions is a key problem in parsing natural language. This paper presents a trainable approach to making these attachments through transformation sequences and error-driven learning. Our approach is broad coverage, and accounts for roughly three times the attachment cases that have previously been handled by corpus-based techniques. In addition, our approach is based on a simplified model of syntax that is more consistent with the practice in current state-of-the-art language processing systems. This paper sketches syntactic and algorithmic details, and presents experimental results on data sets derived from the Penn Treebank. We obtain an attachment accuracy of 75.4% for the general case, the first such corpus-based result to be reported. For the restricted cases previously studied with corpusbased methods, our approach yields an accuracy comparable to current work (83.1%).
intelligent systems in molecular biology | 2003
Christian Blaschke; Lynette Hirschman; Alexander S. Yeh; Alfonso Valencia
An increasing number of groups are now working in the area of text mining, focusing on a wide range of problems and applying both statistical and linguistic approaches. However, it is not possible to compare the different approaches, because there are no common standards or evaluation criteria; in addition, the various groups are addressing different problems, often using private datasets. As a result, it is impossible to determine how well the existing systems perform, and particularly what performance level can be expected in real applications. This is similar to the situation in text processing in the late 1980s, prior to the Message Understanding Conferences (MUCs). With the introduction of a common evaluation and standardized evaluation metrics as part of these conferences, it became possible to compare approaches, to identify those techniques that did or did not work and to make progress. This progress has resulted in a common pipeline of processes and a set of shared tools available to the general research community. The field of biology is ripe for a similar experiment. Inspired by this example, the BioLINK group (Biological Literature, Information and Knowledge [1]) is organizing a CASP-like evaluation for the text data-mining community applied to biology. The two main tasks specifically address two major bottlenecks for text mining in biology: (1) the correct detection of gene and protein names in text; and (2) the extraction of functional information related to proteins based on the GO classification system. For further information and participation details, see http://www.pdg.cnb.uam.es/BioLink/BioCreative.eval.html