Emilia Apostolova | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Emilia Apostolova is active.

Explore More

Publication

Featured researches published by Emilia Apostolova.

document recognition and retrieval | 2009

Figure content analysis for improved biomedical article retrieval

Daekeun You; Emilia Apostolova; Sameer K. Antani; Dina Demner-Fushman; George R. Thoma

Biomedical images are invaluable in medical education and establishing clinical diagnosis. Clinical decision support (CDS) can be improved by combining biomedical text with automatically annotated images extracted from relevant biomedical publications. In a previous study we reported 76.6% accuracy using supervised machine learning on the feasibility of automatically classifying images by combining figure captions and image content for usefulness in finding clinical evidence. Image content extraction is traditionally applied on entire images or on pre-determined image regions. Figure images articles vary greatly limiting benefit of whole image extraction beyond gross categorization for CDS due to the large variety. However, text annotations and pointers on them indicate regions of interest (ROI) that are then referenced in the caption or discussion in the article text. We have previously reported 72.02% accuracy in text and symbols localization but we failed to take advantage of the referenced image locality. In this work we combine article text analysis and figure image analysis for localizing pointer (arrows, symbols) to extract ROI pointed that can then be used to measure meaningful image content and associate it with the identified biomedical concepts for improved (text and image) content-based retrieval of biomedical articles. Biomedical concepts are identified using National Library of Medicines Unified Medical Language System (UMLS) Metathesaurus. Our methods report an average precision and recall of 92.3% and 75.3%, respectively on identifying pointing symbols in images from a randomly selected image subset made available through the ImageCLEF 2008 campaign.

international conference of the ieee engineering in medicine and biology society | 2009

Automatic segmentation of clinical texts

Emilia Apostolova; David S. Channin; Dina Demner-Fushman; Jacob D. Furst; Steven L. Lytinen; Daniela Stan Raicu

Clinical narratives, such as radiology and pathology reports, are commonly available in electronic form. However, they are also commonly entered and stored as free text. Knowledge of the structure of clinical narratives is necessary for enhancing the productivity of healthcare departments and facilitating research. This study attempts to automatically segment medical reports into semantic sections. Our goal is to develop a robust and scalable medical report segmentation system requiring minimum user input for efficient retrieval and extraction of information from free-text clinical narratives. Hand-crafted rules were used to automatically identify a high-confidence training set. This automatically created training dataset was later used to develop metrics and an algorithm that determines the semantic structure of the medical reports. A word-vector cosine similarity metric combined with several heuristics was used to classify each report sentence into one of several pre-defined semantic sections. This baseline algorithm achieved 79% accuracy. A Support Vector Machine (SVM) classifier trained on additional formatting and contextual features was able to achieve 90% accuracy. Plans for future work include developing a configurable system that could accommodate various medical report formatting and content standards.

empirical methods in natural language processing | 2014

Combining Visual and Textual Features for Information Extraction from Online Flyers

Emilia Apostolova; Noriko Tomuro

Information in visually rich formats such as PDF and HTML is often conveyed by a combination of textual and visual features. In particular, genres such as marketing flyers and info-graphics often augment textual information by its color, size, positioning, etc. As a result, traditional text-based approaches to information extraction (IE) could underperform. In this study, we present a supervised machine learning approach to IE from online commercial real estate flyers. We evaluated the performance of SVM classifiers on the task of identifying 12 types of named entities using a combination of textual and visual features. Results show that the addition of visual features such as color, size, and positioning significantly increased classifier performance.

north american chapter of the association for computational linguistics | 2009

Towards Automatic Image Region Annotation - Image Region Textual Coreference Resolution

Emilia Apostolova; Dina Demner-Fushman

Detailed image annotation necessary for reliable image retrieval involves not only annotating the image as a single artifact, but also annotating specific objects or regions within the image. Such detailed annotation is a costly endeavor and the available annotated image data are quite limited. This paper explores the feasibility of using image captions from scientific journals for the purpose of automatically annotating image regions. Salient image clues, such as an object location within the image or an object color, together with the associated explicit object mention, are extracted and classified using rule-based and SVM learners.

international conference on digital image processing | 2015

Genre-based image classification using ensemble learning for online flyers

Payam Pourashraf; Noriko Tomuro; Emilia Apostolova

This paper presents an image classification model developed to classify images embedded in commercial real estate flyers. It is a component in a larger, multimodal system which uses texts as well as images in the flyers to automatically classify them by the property types. The role of the image classifier in the system is to provide the genres of the embedded images (map, schematic drawing, aerial photo, etc.), which to be combined with the texts in the flyer to do the overall classification. In this work, we used an ensemble learning approach and developed a model where the outputs of an ensemble of support vector machines (SVMs) are combined by a k-nearest neighbor (KNN) classifier. In this model, the classifiers in the ensemble are strong classifiers, each of which is trained to predict a given/assigned genre. Not only is our model intuitive by taking advantage of the mutual distinctness of the image genres, it is also scalable. We tested the model using over 3000 images extracted from online real estate flyers. The result showed that our model outperformed the baseline classifiers by a large margin.

north american chapter of the association for computational linguistics | 2015

Digital Leafleting: Extracting Structured Data from Multimedia Online Flyers

Emilia Apostolova; Payam Pourashraf; Jeffrey Sack

Marketing materials such as flyers and other infographics are a vast online resource. In a number of industries, such as the commercial real estate industry, they are in fact the only authoritative source of information. Companies attempting to organize commercial real estate inventories spend a significant amount of resources on manual data entry of this information. In this work, we propose a method for extracting structured data from free-form commercial real estate flyers in PDF and HTML formats. We modeled the problem as text categorization and Named Entity Recognition (NER) tasks and applied a supervised machine learning approach (Support Vector Machines). Our dataset consists of more than 2,200 commercial real estate flyers and associated manually entered structured data, which was used to automatically create training datasets. Traditionally, text categorization and NER approaches are based on textual information only. However, information in visually rich formats such as PDF and HTML is often conveyed by a combination of textual and visual features. Large fonts, visually salient colors, and positioning often indicate the most relevant pieces of information. We applied novel features based on visual characteristics in addition to traditional text features and show that performance improved significantly for both the text categorization and NER tasks.

text retrieval conference | 2011

A Knowledge-Based Approach to Medical Records Retrieval.

Dina Demner-Fushman; Swapna Abhyankar; Antonio Jimeno-Yepes; Russell F. Loane; Bastien Rance; François-Michel Lang; Nicholas C. Ide; Emilia Apostolova; Alan R. Aronson

meeting of the association for computational linguistics | 2011