Jongwoo Kim | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jongwoo Kim is active.

Explore More

Publication

Featured researches published by Jongwoo Kim.

document recognition and retrieval | 2000

Automated labeling in document images

Jongwoo Kim; Daniel X. Le; George R. Thoma

The National Library of Medicine (NLM) is developing an automated system to produce bibliographic records for its MEDLINER database. This system, named Medical Article Record System (MARS), employs document image analysis and understanding techniques and optical character recognition (OCR). This paper describes a key module in MARS called the Automated Labeling (AL) module, which labels all zones of interest (title, author, affiliation, and abstract) automatically. The AL algorithm is based on 120 rules that are derived from an analysis of journal page layouts and features extracted from OCR output. Experiments carried out on more than 11,000 articles in over 1,000 biomedical journals show the accuracy of this rule-based algorithm to exceed 96%.

First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings. | 2004

A dynamic feature generation system for automated metadata extraction in preservation of digital materials

Song Mao; Jongwoo Kim; George R. Thoma

Obsolescence in storage media and the hardware and software for access and use can render old electronic files inaccessible and unusable. Therefore, the long-term preservation of digital materials has become an active area of research. At the U.S. National Library of Medicine (NLM), we are investigating the preservation of scanned and online medical journal articles, though other data types (e.g., video sequences) are also of interest. Metadata of different types have been proposed to save the information needed to preserve digital materials. Given the ever-increasing volume of medical journals and high labor cost of manual data entry, automated metadata extraction is crucial. A system has been developed at NLM to automatically generate descriptive metadata that includes title, author, affiliation, and abstract from scanned medical journals. A module called ZoneMatch is used to generate geometric and contextual features from a set of issues of each journal. A rule-based labeling module (called ZoneCzar) then uses these features to perform labeling independent of journal layout styles. However, if there are significant style variations among the issues of a same journal, the features generated from one set of journal issues may not be very useful to label a different set. We describe a dynamic feature updating system in which the features used for labeling a current journal issue are generated from previous issues with similar layout style. This new system can adapt to possible style variations among different issues of the same journal. Experimental results presented show that the new system delivers improved labeling performance accuracy.

document recognition and retrieval | 2003

Style-independent document labeling: design and performance evaluation

Song Mao; Jongwoo Kim; George R. Thoma

The Medical Article Records System or MARS has been developed at the U.S. National Library of Medicine (NLM) for automated data entry of bibliographical information from medical journals into MEDLINE, the premier bibliographic citation database at NLM. Currently, a rule-based algorithm (called ZoneCzar) is used for labeling important bibliographical fields (title, author, affiliation, and abstract) on medical journal article page images. While rules have been created for medical journals with regular layout types, new rules have to be manually created for any input journals with arbitrary or new layout types. Therefore, it is of interest to label any journal articles independent of their layout styles. In this paper, we first describe a system (called ZoneMatch) for automated generation of crucial geometric and non-geometric features of important bibliographical fields based on string-matching and clustering techniques. The rule based algorithm is then modified to use these features to perform style-independent labeling. We then describe a performance evaluation method for quantitatively evaluating our algorithm and characterizing its error distributions. Experimental results show that the labeling performance of the rule-based algorithm is significantly improved when the generated features are used.

document recognition and retrieval | 2003

Automated labeling of bibliographic data extracted from biomedical online journals

Jongwoo Kim; Daniel X. Le; George R. Thoma

A prototype system has been designed to automate the extraction of bibliographic data (e.g., article title, authors, abstract, affiliation and others) from online biomedical journals to populate the National Library of Medicine’s MEDLINE database. This paper describes a key module in this system: the labeling module that employs statistics and fuzzy rule-based algorithms to identify segmented zones in an article’s HTML pages as specific bibliographic data. Results from experiments conducted with 1,149 medical articles from forty-seven journal issues are presented.

document recognition and retrieval | 2010

Naïve Bayes and SVM classifiers for classifying Databank Accession Number sentences from online biomedical articles

Jongwoo Kim; Daniel X. Le; George R. Thoma

This paper describes two classifiers, Naïve Bayes and Support Vector Machine (SVM), to classify sentences containing Databank Accession Numbers, a key piece of bibliographic information, from online biomedical articles. The correct identification of these sentences is necessary for the subsequent extraction of these numbers. The classifiers use words that occur most frequently in sentences as features for the classification. Twelve sets of word features are collected to train and test the classifiers. Each set has a different number of word features ranging from 100 to 1,200. The performance of each classifier is evaluated using four measures: Precision, Recall, F-Measure, and Accuracy. The Naïve Bayes classifier shows performance above 93.91% at 200 word features for all four measures. The SVM shows 98.80% Precision at 200 word features, 94.90% Recall at 500 and 700, 96.46% F-Measure at 200, and 99.14% Accuracy at 200 and 400. To improve classification performance, we propose two merging operators, Max and Harmonic Mean, to combine results of the two classifiers. The final results show a measureable improvement in Recall, F-Measure, and Accuracy rates.

computer-based medical systems | 2009

Inferring grant support types from online biomedical articles

Jongwoo Kim; Daniel X. Le; George R. Thoma

The category of institution or organization underwriting the research reported in a scientific article is a required field (Grant Support type) in the bibliographic record of that article in the MEDLINE® database. We describe a system based on a combination of a Naive Bayes classifier and heuristic rules that automatically infers the Grant Support types from article text. Testing the performance of the system on 2,000 biomedical articles shows Precision, Recall, and F-Measure exceeding 95%.

computer based medical systems | 2001

Automated medical citation records creation for Web-based on-line journals

Daniel X. Le; Loc Q. Tran; Joseph Chow; Jongwoo Kim; Susan E. Hauser; Chan W. Moon; George R. Thoma

With the rapid expansion and utilization of the Internet and Web technologies, there is an increasing number of online medical journals. Online journals pose new challenges in the areas of automated document analysis and content extraction, database citation records creation, data mining and other document-related applications. New techniques are needed to capture, classify, analyze, extract, modify and reformat Web-based document information for computer storage, access and processing. At the National Library of Medicine (NLM), we are developing an automated system, temporarily code-named WebMARS (Web-based Medical Article Record System), to create citation records for the MEDLINE/sup (R)/ database. The system downloads and classifies Web document articles, parses and labels the article contents, extracts and reformats the citation information from the article, presents the entire citation to operators for reconciling (validation), and uploads the citation records to the MEDLINE database.

document recognition and retrieval | 2012

Combining SVM classifiers to identify investigator name zones in biomedical articles

Jongwoo Kim; Daniel X. Le; George R. Thoma

This paper describes an automated system to label zones containing Investigator Names (IN) in biomedical articles, a key item in a MEDLINE® citation. The correct identification of these zones is necessary for the subsequent extraction of IN from these zones. A hierarchical classification model is proposed using two Support Vector Machine (SVM) classifiers. The first classifier is used to identify an IN zone with highest confidence, and the other classifier identifies the remaining IN zones. Eight sets of word lists are collected to train and test the classifiers, each set containing collections of words ranging from 100 to 1,200. Experiments based on a test set of 105 journal articles show a Precision of 0.88, 0.97 Recall, 0.92 F-Measure, and 0.99 Accuracy.

document engineering | 2018

Main Content Detection in HTML Journal Articles

Alastair R. Rae; Jongwoo Kim; Daniel Le; George R. Thoma

Web content extraction algorithms have been shown to improve the performance of web content analysis tasks. This is because noisy web page content, such as advertisements and navigation links, can significantly degrade performance. This paper presents a novel and effective layout analysis algorithm for main content detection in HTML journal articles. The algorithm first segments a web page based on rendered line breaks, then based on its column structure, and finally identifies the column that contains the most paragraph text. On a test set of 359 manually labeled HTML journal articles, the proposed layout analysis algorithm was found to significantly outperform an alternative semantic markup algorithm based on HTML 5 semantic tags. The precision, recall, and F-score of the layout analysis algorithm were measured to be 0.96, 0.99, and 0.98 respectively.

computer-based medical systems | 2016

Visualization of Statistics from MEDLINE

Jongwoo Kim; Paul LoBuglio; George R. Thoma

We propose a system to visualize statistics collected from NLMs MEDLINE® database that contains citations related to biomedical journal articles. The system extracts information from author affiliations in the articles such as organization, city, state, country, etc., categorizes the articles into several groups using the information, collects statistics such as the number of articles published per country each year, etc., and displays the statistics through a Web site using tables and choropleth maps. Hidden Markov Model (HMM) and statistics are used to extract the information from the affiliations, and Google Map API, JSON, JavaScript and other APIs are used for the development of the site.

Explore More