Jinho D. Choi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jinho D. Choi is active.

Explore More

Publication

Featured researches published by Jinho D. Choi.

Journal of the American Medical Informatics Association | 2013

Towards comprehensive syntactic and semantic annotations of the clinical narrative

Daniel Albright; Arrick Lanfranchi; Anwen Fredriksen; Will Styler; Colin Warner; Jena D. Hwang; Jinho D. Choi; Dmitriy Dligach; Rodney D. Nielsen; James H. Martin; Wayne H. Ward; Martha Palmer; Guergana Savova

Objective To create annotated clinical narratives with layers of syntactic and semantic labels to facilitate advances in clinical natural language processing (NLP). To develop NLP algorithms and open source components. Methods Manual annotation of a clinical narrative corpus of 127 606 tokens following the Treebank schema for syntactic information, PropBank schema for predicate-argument structures, and the Unified Medical Language System (UMLS) schema for semantic information. NLP components were developed. Results The final corpus consists of 13 091 sentences containing 1772 distinct predicate lemmas. Of the 766 newly created PropBank frames, 74 are verbs. There are 28 539 named entity (NE) annotations spread over 15 UMLS semantic groups, one UMLS semantic type, and the Person semantic category. The most frequent annotations belong to the UMLS semantic groups of Procedures (15.71%), Disorders (14.74%), Concepts and Ideas (15.10%), Anatomy (12.80%), Chemicals and Drugs (7.49%), and the UMLS semantic type of Sign or Symptom (12.46%). Inter-annotator agreement results: Treebank (0.926), PropBank (0.891–0.931), NE (0.697–0.750). The part-of-speech tagger, constituency parser, dependency parser, and semantic role labeler are built from the corpus and released open source. A significant limitation uncovered by this project is the need for the NLP community to develop a widely agreed-upon schema for the annotation of clinical concepts and their relations. Conclusions This project takes a foundational step towards bringing the field of clinical NLP up to par with NLP in the general domain. The corpus creation and NLP components provide a resource for research and application development that would have been previously impossible.

BMC Bioinformatics | 2012

A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools

Karin Verspoor; Kevin Bretonnel Cohen; Arrick Lanfranchi; Colin Warner; Helen L. Johnson; Christophe Roeder; Jinho D. Choi; Christopher S. Funk; Yuriy Malenkiy; Miriam Eckert; Nianwen Xue; William A. Baumgartner; Michael Bada; Martha Palmer; Lawrence Hunter

BackgroundWe introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus.ResultsMany biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data.ConclusionsThe finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.

international joint conference on natural language processing | 2015

It Depends: Dependency Parser Comparison Using A Web-based Evaluation Tool

Jinho D. Choi; Joel R. Tetreault; Amanda Stent

The last few years have seen a surge in the number of accurate, fast, publicly available dependency parsers. At the same time, the use of dependency parsing in NLP applications has increased. It can be difficult for a non-expert to select a good “off-the-shelf” parser. We present a comparative analysis of ten leading statistical dependency parsers on a multi-genre corpus of English. For our analysis, we developed a new web-based tool that gives a convenient way of comparing dependency parser outputs. Our analysis will help practitioners choose a parser to optimize their desired speed/accuracy tradeoff, and our tool will help practitioners examine and compare parser output.

international conference on tools with artificial intelligence | 2016

SelQA: A New Benchmark for Selection-Based Question Answering

Tomasz Jurczyk; Michael Zhai; Jinho D. Choi

This paper presents a new selection-based question answering dataset, SelQA. The dataset consists of questions generated through crowdsourcing and sentence length answers that are drawn from the ten most prevalent topics in the English Wikipedia. We introduce a corpus annotation scheme that enhances the generation of large, diverse, and challenging datasets by explicitly aiming to reduce word co-occurrences between the question and answers. Our annotation scheme is composed of a series of crowdsourcing tasks with a view to more effectively utilize crowdsourcing in the creation of question answering datasets in various domains. Several systems are compared on the tasks of answer sentence selection and answer triggering, providing strong baseline results for future work to improve upon.

linguistic annotation workshop | 2009

Using Parallel Propbanks to enhance Word-alignments

Jinho D. Choi; Martha Palmer; Nianwen Xue

This short paper describes the use of the linguistic annotation available in parallel PropBanks (Chinese and English) for the enhancement of automatically derived word alignments. Specifically, we suggest ways to refine and expand word alignments for verb-predicates by using predicate-argument structures. Evaluations demonstrate improved alignment accuracies that vary by corpus type.

north american chapter of the association for computational linguistics | 2016

Dynamic Feature Induction: The Last Gist to the State-of-the-Art

Jinho D. Choi

We introduce a novel technique called dynamic feature induction that keeps inducing high dimensional features automatically until the feature space becomes ‘more’ linearly separable. Dynamic feature induction searches for the feature combinations that give strong clues for distinguishing certain label pairs, and generates joint features from these combinations. These induced features are trained along with the primitive low dimensional features. Our approach was evaluated on two core NLP tasks, part-of-speech tagging and named entity recognition, and showed the state-of-the-art results for both tasks, achieving the accuracy of 97.64 and the F1-score of 91.00 respectively, with about a 25% increase in the feature space.

intelligent user interfaces | 2015

Real-Time Community Question Answering: Exploring Content Recommendation and User Notification Strategies

Qiaoling Liu; Tomasz Jurczyk; Jinho D. Choi; Eugene Agichtein

Community-based Question Answering (CQA) services allow users to find and share information by interacting with others. A key to the success of CQA services is the quality and timeliness of the responses that users get. With the increasing use of mobile devices, searchers increasingly expect to find more local and time-sensitive information, such as the current special at a cafe around the corner. Yet, few services provide such hyper-local and time-aware question answering. This requires intelligent content recommendation and careful use of notifications (e.g., recommending questions to only selected users). To explore these issues, we developed RealQA, a real-time CQA system with a mobile interface, and performed two user studies: a formative pilot study with the initial system design, and a more extensive study with the revised UI and algorithms. The research design combined qualitative survey analysis and quantitative behavior analysis under different conditions. We report our findings of the prevalent information needs and types of responses users provided, and of the effectiveness of the recommendation and notification strategies on user experience and satisfaction. Our system and findings offer insights and implications for designing real-time CQA systems, and provide a valuable platform for future research.

social informatics | 2017

Event Analysis on the 2016 U.S. Presidential Election Using Social Media

Tarrek A. Shaban; Lindsay Hexter; Jinho D. Choi

It is not surprising that social media have played an important role in shaping the political debate during the 2016 presidential election. The dynamics of social media provide a unique opportunity to detect and interpret the pivotal events and scandals of the candidates quantitatively. This paper examines several text-based analysis to determine which topics have a lasting impact on the election for the two main candidates, Clinton and Trump. About 135.5 million tweets are collected over the six weeks prior to the election. From these tweets, topic clustering, keyword extraction, and tweeter analysis are performed to better understand the impact of the events occurred during this period. Our analysis builds upon a social science foundation to provide another avenue for scholars to use in discerning how events detected from social media show the impacts of campaigns as well as campaign the election.

bioRxiv | 2017

Natural Language Processing for Classification of Acute, Communicable Findings on Unstructured Head CT Reports: Comparison of Neural Network and Non-Neural Machine Learning Techniques

Falgun H. Chokshi; Bonggun Shin; Timothy Lee; Andrew Lemmon; Sean Necessary; Jinho D. Choi

Background and Purpose To evaluate the accuracy of non-neural and neural network models to classify five categories (classes) of acute and communicable findings on unstructured head computed tomography (CT) reports. Materials and Methods Three radiologists annotated 1,400 head CT reports for language indicating the presence or absence of acute communicable findings (hemorrhage, stroke, hydrocephalus, and mass effect). This set was used to train, develop, and evaluate a non-neural classifier, support vector machine (SVM), in comparisons to two neural network models using convolutional neural networks (CNN) and neural attention model (NAM) Inter-rater agreement was computed using kappa statistics. Accuracy, receiver operated curves, and area under the curve were calculated and tabulated. P-values < 0.05 was significant and 95% confidence intervals were computed. Results Radiologist agreement was 86-94% and Cohen’s kappa was 0.667-0.762 (substantial agreement). Accuracies of the CNN and NAM (range 0.90-0.94) were higher than SVM (range 0.88-0.92). NAM showed relatively equal accuracy with CNN for three classes, severity, mass effect, and hydrocephalus, higher accuracy for the acute bleed class, and lower accuracy for the acute stroke class. AUCs of all methods for all classes were above 0.92. Conclusions Neural network models (CNN & NAM) generally had higher accuracies compared to the non-neural models (SVM) and have a range of accuracies that comparable to the inter-annotator agreement of three neuroradiologists. The NAM method adds ability to hold the algorithm accountable for its classification via heat map generation, thereby adding an auditing feature to this neural network. Abbreviations NLP Natural Language Processing CNN Convolutional Neural Network NAM Neural Attention Model HER Electronic Health Record

meeting of the association for computational linguistics | 2017

Text-based Speaker Identification on Multiparty Dialogues Using Multi-document Convolutional Neural Networks

Kaixin Ma; Catherine Xiao; Jinho D. Choi

We propose a convolutional neural network model for text-based speaker identification on multiparty dialogues extracted from the TV show, Friends. While most previous works on this task rely heavily on acoustic features, our approach attempts to identify speakers in dialogues using their speech patterns as captured by transcriptions to the TV show. It has been shown that different individual speakers exhibit distinct idiolectal styles. Several convolutional neural network models are developed to discriminate between differing speech patterns. Our results confirm the promise of text-based approaches, with the best performing model showing an accuracy improvement of over 6% upon the baseline CNN model.

Explore More