Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Anni Coden is active.

Publication


Featured researches published by Anni Coden.


international acm sigir conference on research and development in information retrieval | 2000

Question-answering by predictive annotation

John M. Prager; Eric W. Brown; Anni Coden; Dragomir R. Radev

We present a new technique for question answering called Predictive Annotation. Predictive Annotation identifies potential answers to questions in text, annotates them accordingly and indexes them. This technique, along with a complementary analysis of questions, passage-level ranking and answer selection, produces a system effective at answering natural-language fact-seeking questions posed against large document collections. Experimental results show the effects of different parameter settings and lead to a number of general observations about the question-answering problem.


Journal of Biomedical Informatics | 2009

Automatically extracting cancer disease characteristics from pathology reports into a Disease Knowledge Representation Model

Anni Coden; Guergana Savova; Igor L. Sominsky; James J. Masanz; Karin Schuler; James W. Cooper; Wei Guan; Piet C. de Groen

We introduce an extensible and modifiable knowledge representation model to represent cancer disease characteristics in a comparable and consistent fashion. We describe a system, MedTAS/P which automatically instantiates the knowledge representation model from free-text pathology reports. MedTAS/P is based on an open-source framework and its components use natural language processing principles, machine learning and rules to discover and populate elements of the model. To validate the model and measure the accuracy of MedTAS/P, we developed a gold-standard corpus of manually annotated colon cancer pathology reports. MedTAS/P achieves F1-scores of 0.97-1.0 for instantiating classes in the knowledge representation model such as histologies or anatomical sites, and F1-scores of 0.82-0.93 for primary tumors or lymph nodes, which require the extractions of relations. An F1-score of 0.65 is reported for metastatic tumors, a lower score predominantly due to a very small number of instances in the training and test sets.


Journal of Biomedical Informatics | 2008

Word sense disambiguation across two domains: Biomedical literature and clinical notes

Guergana Savova; Anni Coden; Igor L. Sominsky; Rie Johnson; Philip V. Ogren; Piet C. de Groen; Christopher G. Chute

The aim of this study is to explore the word sense disambiguation (WSD) problem across two biomedical domains-biomedical literature and clinical notes. A supervised machine learning technique was used for the WSD task. One of the challenges addressed is the creation of a suitable clinical corpus with manual sense annotations. This corpus in conjunction with the WSD set from the National Library of Medicine provided the basis for the evaluation of our method across multiple domains and for the comparison of our results to published ones. Noteworthy is that only 20% of the most relevant ambiguous terms within a domain overlap between the two domains, having more senses associated with them in the clinical space than in the biomedical literature space. Experimentation with 28 different feature sets rendered a system achieving an average F-score of 0.82 on the clinical data and 0.86 on the biomedical literature.


conference on information and knowledge management | 2002

Detecting similar documents using salient terms

James W. Cooper; Anni Coden; Eric W. Brown

We describe a system for rapidly determining document similarity among a set of documents obtained from an information retrieval (IR) system. We obtain a ranked list of the most important terms in each document using a rapid phrase recognizer system. We store these in a database and compute document similarity using a simple database query. If the number of terms found to not be contained in both documents is less than some predetermined threshold compared to the total number of terms in the document, these documents are determined to be very similar. We compare this to the shingles approach.


Journal of Biomedical Informatics | 2005

Domain-specific language models and lexicons for tagging

Anni Coden; Serguei V. S. Pakhomov; Rie Kubota Ando; Patrick H. Duffy; Christopher G. Chute

Accurate and reliable part-of-speech tagging is useful for many Natural Language Processing (NLP) tasks that form the foundation of NLP-based approaches to information retrieval and data mining. In general, large annotated corpora are necessary to achieve desired part-of-speech tagger accuracy. We show that a large annotated general-English corpus is not sufficient for building a part-of-speech tagger model adequate for tagging documents from the medical domain. However, adding a quite small domain-specific corpus to a large general-English one boosts performance to over 92% accuracy from 87% in our studies. We also suggest a number of characteristics to quantify the similarities between a training corpus and the test data. These results give guidance for creating an appropriate corpus for building a part-of-speech tagger model that gives satisfactory accuracy results on a new domain at a relatively small cost.


Ibm Systems Journal | 2001

Toward speech as a knowledge resource

Eric W. Brown; Savitha Srinivasan; Anni Coden; Dulce B. Ponceleon; James W. Cooper; Arnon Amir

Speech is a tantalizing mode of human communication. On the one hand, humans understand speech with ease and use speech to express complex ideas, information, and knowledge. On the other hand, automatic speech recognition with computers is very hard, and extracting knowledge from speech is even harder. Nevertheless, the potential reward for solving this problem drives us to pursue it. Before we can exploit speech as a knowledge resource, however, we must understand the current state of the art in speech recognition and the relevant, successful applications of speech recognition in the related areas of multimedia indexing and search. In this paper we advocate the study of speech as a knowledge resource, provide a brief introduction to the state of the art in speech recognition, describe a number of systems that use speech recognition to enable multimedia analysis, indexing, and search, and present a number of exploratory applications of speech recognition that move toward the goal of exploiting speech as a knowledge resource.


hawaii international conference on system sciences | 2001

Speech transcript analysis for automatic search

Anni Coden; Eric W. Brown

We address the problem of finding collateral information pertinent to a live television broadcast in real time. The solution starts with a text transcript of the broadcast generated by an automatic speech recognition system. Speaker independent speech recognition technology, even when tailored for a broadcast scenario, generally produces transcripts with relatively low accuracy. Given this limitation, we have developed algorithms that can determine the essence of the broadcast from these transcripts. Specifically, we extract named entities, topics, and sentence types from the transcript and use them to automatically generate both structured and unstructured search queries. A novel distance-ranking algorithm is used to select relevant information from the search results. The whole process is performed online and the query results (i.e., the collateral information) are added to the broadcast stream.


Archive | 2002

Information retrieval techniques for speech applications

Anni Coden; Eric W. Brown; Savitha Srinivasan

Traditional Information Retrieval Techniques.- Perspectives on Information Retrieval and Speech.- Spoken Document Pre-processing.- Capitalization Recovery for Text.- Adapting IR Techniques to Spoken Documents.- Clustering of Imperfect Transcripts Using a Novel Similarity Measure.- Extracting Keyphrases from Spoken Audio Documents.- Segmenting Conversations by Topic, Initiative, and Style.- Extracting Caller Information from Voicemail.- Techniques for Multi-media Collections.- Speech and Hand Transcribed Retrieval.- New Applications.- The Use of Speech Retrieval Systems: A Study Design.- Speech-Driven Text Retrieval: Using Target IR Collections for Statistical Language Model Adaptation in Speech Recognition.- WASABI: Framework for Real-Time Speech Analysis Applications (Demo).


conference on information and knowledge management | 2001

Towards speech as a knowledge resource

Eric W. Brown; Savitha Srinivasan; Anni Coden; Dulce B. Ponceleon; James W. Cooper; Arnon Amir; Jan Pieper

Speech is a tantalizing mode of human communication. On the one hand, humans understand speech with ease and use speech to express complex ideas, information, and knowledge. On the other hand, automatic speech recognition with computers is still very hard, and extracting knowledge from speech is even harder. In this paper we motivate the study of speech as a knowledge resource and briefly survey a family of related applications and systems being developed at IBM Research aimed towards the goal of exploiting speech as a knowledge resource.


ieee international conference on healthcare informatics, imaging and systems biology | 2012

SPOT the Drug! An Unsupervised Pattern Matching Method to Extract Drug Names from Very Large Clinical Corpora

Anni Coden; Daniel Gruhl; Neal Lewis; Joe Terdiman

Although structured electronic health records are becoming more prevalent, much information about patient health is still recorded only in unstructured text. “Understanding” these texts has been a focus of natural language processing (NLP) research for many years, with some remarkable successes, yet there is more work to be done. Knowing the drugs patients take is not only critical for understanding patient health (e.g., for drug-drug interactions or drug-enzyme interaction), but also for secondary uses, such as research on treatment effectiveness. Several drug dictionaries have been curated, such as RxNorm, FDAs Orange Book, or NCI, with a focus on prescription drugs. Developing these dictionaries is a challenge, but even more challenging is keeping these dictionaries up-to-date in the face of a rapidly advancing field-it is critical to identify grapefruit as a “drug” for a patient who takes the prescription medicine Lipitor, due to their known adverse interaction. To discover other, new adverse drug interactions, a large number of patient histories often need to be examined, necessitating not only accurate but also fast algorithms to identify pharmacological substances. In this paper we propose a new algorithm, SPOT, which identifies drug names that can be used as new dictionary entries from a large corpus, where a “drug” is defined as a substance intended for use in the diagnosis, cure, mitigation, treatment, or prevention of disease. Measured against a manually annotated reference corpus, we present precision and recall values for SPOT. SPOT is language and syntax independent, can be run efficiently to keep dictionaries up-to-date and to also suggest words and phrases which may be misspellings or uncatalogued synonyms of a known drug. We show how SPOTs lack of reliance on NLP tools makes it robust in analyzing clinical medical text. SPOT is a generalized bootstrapping algorithm, seeded with a known dictionary and automatically extracting the context within which each drug is mentioned. We define three features of such context: support, confidence and prevalence. Finally, we present the performance tradeoffs depending on the thresholds chosen for these features.

Researchain Logo
Decentralizing Knowledge