Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Jorge R. Herskovic is active.

Publication


Featured researches published by Jorge R. Herskovic.


Journal of the American Medical Informatics Association | 2007

A Day in the Life of PubMed: Analysis of a Typical Day's Query Log

Jorge R. Herskovic; Len Y. Tanaka; William R. Hersh; Elmer V. Bernstam

OBJECTIVE To characterize PubMed usage over a typical day and compare it to previous studies of user behavior on Web search engines. DESIGN We performed a lexical and semantic analysis of 2,689,166 queries issued on PubMed over 24 consecutive hours on a typical day. MEASUREMENTS We measured the number of queries, number of distinct users, queries per user, terms per query, common terms, Boolean operator use, common phrases, result set size, MeSH categories, used semantic measurements to group queries into sessions, and studied the addition and removal of terms from consecutive queries to gauge search strategies. RESULTS The size of the result sets from a sample of queries showed a bimodal distribution, with peaks at approximately 3 and 100 results, suggesting that a large group of queries was tightly focused and another was broad. Like Web search engine sessions, most PubMed sessions consisted of a single query. However, PubMed queries contained more terms. CONCLUSION PubMeds usage profile should be considered when educating users, building user interfaces, and developing future biomedical information retrieval systems.


Journal of the American Medical Informatics Association | 2006

Using citation data to improve retrieval from MEDLINE

Elmer V. Bernstam; Jorge R. Herskovic; Yindalon Aphinyanaphongs; Constantin F. Aliferis; Madurai G. Sriram; William R. Hersh

OBJECTIVE To determine whether algorithms developed for the World Wide Web can be applied to the biomedical literature in order to identify articles that are important as well as relevant. DESIGN AND MEASUREMENTS A direct comparison of eight algorithms: simple PubMed queries, clinical queries (sensitive and specific versions), vector cosine comparison, citation count, journal impact factor, PageRank, and machine learning based on polynomial support vector machines. The objective was to prioritize important articles, defined as being included in a pre-existing bibliography of important literature in surgical oncology. RESULTS Citation-based algorithms were more effective than noncitation-based algorithms at identifying important articles. The most effective strategies were simple citation count and PageRank, which on average identified over six important articles in the first 100 results compared to 0.85 for the best noncitation-based algorithm (p < 0.001). The authors saw similar differences between citation-based and noncitation-based algorithms at 10, 20, 50, 200, 500, and 1,000 results (p < 0.001). Citation lag affects performance of PageRank more than simple citation count. However, in spite of citation lag, citation-based algorithms remain more effective than noncitation-based algorithms. CONCLUSION Algorithms that have proved successful on the World Wide Web can be applied to biomedical information retrieval. Citation-based algorithms can help identify important articles within large sets of relevant results. Further studies are needed to determine whether citation-based algorithms can effectively meet actual user information needs.


Journal of the American Medical Informatics Association | 2014

A benchmark comparison of deterministic and probabilistic methods for defining manual review datasets in duplicate records reconciliation

Erel Joffe; Michael J. Byrne; Phillip Reeder; Jorge R. Herskovic; Craig W. Johnson; Allison B. McCoy; Dean F. Sittig; Elmer V. Bernstam

INTRODUCTION Clinical databases require accurate entity resolution (ER). One approach is to use algorithms that assign questionable cases to manual review. Few studies have compared the performance of common algorithms for such a task. Furthermore, previous work has been limited by a lack of objective methods for setting algorithm parameters. We compared the performance of common ER algorithms: using algorithmic optimization, rather than manual parameter tuning, and on two-threshold classification (match/manual review/non-match) as well as single-threshold (match/non-match). METHODS We manually reviewed 20,000 randomly selected, potential duplicate record-pairs to identify matches (10,000 training set, 10,000 test set). We evaluated the probabilistic expectation maximization, simple deterministic and fuzzy inference engine (FIE) algorithms. We used particle swarm to optimize algorithm parameters for a single and for two thresholds. We ran 10 iterations of optimization using the training set and report averaged performance against the test set. RESULTS The overall estimated duplicate rate was 6%. FIE and simple deterministic algorithms allowed a lower manual review set compared to the probabilistic method (FIE 1.9%, simple deterministic 2.5%, probabilistic 3.6%; p<0.001). For a single threshold, the simple deterministic algorithm performed better than the probabilistic method (positive predictive value 0.956 vs 0.887, sensitivity 0.985 vs 0.887, p<0.001). ER with FIE classifies 98.1% of record-pairs correctly (1/10,000 error rate), assigning the remainder to manual review. CONCLUSIONS Optimized deterministic algorithms outperform the probabilistic method. There is a strong case for considering optimized deterministic methods for ER.


Medical Teacher | 2000

Ownership of computers and abilities for their use in a sample of Chilean medical students

P. Herskovic; A. Vásquez; Jorge R. Herskovic; V. Herskovic; A. Roizen; M. T. Urrutia; C. Miranda; M. Beytía

Computer skills are valuable assets for medical students. A survey was conducted among the 283 students of the University of Chile Medical School at the East Campus and 90% answered the survey. Of these, 75% own a computer at home; 4% declared not to have access to computers at all. The proportion of students that declared having the computer skills that were included in the survey were: word processing (94%), Medline search on the Internet (52%), spreadsheet (48%), and email (43%). A significantly larger proportion of male medical students rated themselves as able to perform Medline searches and communicate via email. Ownership of a computer is related to better abilities. There is a need to improve the computer use skills of our medical students.Computer skills are valuable assets for medical students. A survey was conducted among the 283 students of the University of Chile Medical School at the East Campus and 90% answered the survey. Of these, 75% own a computer at home; 4% declared not to have access to computers at all. The proportion of students that declared having the computer skills that were included in the survey were: word processing (94%), Medline search on the Internet (52%), spreadsheet (48%), and email (43%). A significantly larger proportion of male medical students rated themselves as able to perform Medline searches and communicate via email. Ownership of a computer is related to better abilities. There is a need to improve the computer use skills of our medical students.


Journal of the American Medical Informatics Association | 2015

Expert guided natural language processing using one-class classification

Erel Joffe; Emily J Pettigrew; Jorge R. Herskovic; Charles F. Bearden; Elmer V. Bernstam

INTRODUCTION Automatically identifying specific phenotypes in free-text clinical notes is critically important for the reuse of clinical data. In this study, the authors combine expert-guided feature (text) selection with one-class classification for text processing. OBJECTIVES To compare the performance of one-class classification to traditional binary classification; to evaluate the utility of feature selection based on expert-selected salient text (snippets); and to determine the robustness of these models with respects to irrelevant surrounding text. METHODS The authors trained one-class support vector machines (1C-SVMs) and two-class SVMs (2C-SVMs) to identify notes discussing breast cancer. Manually annotated visit summary notes (88 positive and 88 negative for breast cancer) were used to compare the performance of models trained on whole notes labeled as positive or negative to models trained on expert-selected text sections (snippets) relevant to breast cancer status. Model performance was evaluated using a 70:30 split for 20 iterations and on a realistic dataset of 10 000 records with a breast cancer prevalence of 1.4%. RESULTS When tested on a balanced experimental dataset, 1C-SVMs trained on snippets had comparable results to 2C-SVMs trained on whole notes (F = 0.92 for both approaches). When evaluated on a realistic imbalanced dataset, 1C-SVMs had a considerably superior performance (F = 0.61 vs. F = 0.17 for the best performing model) attributable mainly to improved precision (p = .88 vs. p = .09 for the best performing model). CONCLUSIONS 1C-SVMs trained on expert-selected relevant text sections perform better than 2C-SVMs classifiers trained on either snippets or whole notes when applied to realistically imbalanced data with low prevalence of the positive class.


BMC Bioinformatics | 2012

Graph-based signal integration for high- throughput phenotyping

Jorge R. Herskovic; Devika Subramanian; Trevor Cohen; Pamela A Bozzo-Silva; Charles F. Bearden; Elmer V. Bernstam

BackgroundElectronic Health Records aggregated in Clinical Data Warehouses (CDWs) promise to revolutionize Comparative Effectiveness Research and suggest new avenues of research. However, the effectiveness of CDWs is diminished by the lack of properly labeled data. We present a novel approach that integrates knowledge from the CDW, the biomedical literature, and the Unified Medical Language System (UMLS) to perform high-throughput phenotyping. In this paper, we automatically construct a graphical knowledge model and then use it to phenotype breast cancer patients. We compare the performance of this approach to using MetaMap when labeling records.ResultsMetaMaps overall accuracy at identifying breast cancer patients was 51.1% (n=428); recall=85.4%, precision=26.2%, and F1=40.1%. Our unsupervised graph-based high-throughput phenotyping had accuracy of 84.1%; recall=46.3%, precision=61.2%, and F1=52.8%.ConclusionsWe conclude that our approach is a promising alternative for unsupervised high-throughput phenotyping.


Journal of Biomedical Informatics | 2007

Using hit curves to compare search algorithm performance

Jorge R. Herskovic; M. Sriram Iyengar; Elmer V. Bernstam

Databases continue to grow but the metrics available to evaluate information retrieval systems have not changed. Large collections such as MEDLINE and the World Wide Web contain many relevant documents for common queries. Ranking is therefore increasingly important and successful information retrieval systems, such as Google, have emphasized ranking. However, existing evaluation metrics such as precision and recall, do not directly account for ranking. This paper describes a novel way of measuring information retrieval performance using weighted hit curves adapted from the field of statistical detection to reflect multiple desirable characteristics such as relevance, importance, and methodologic quality. In statistical detection, hit curves have been proposed to represent occurrence of interesting events during a detection process. Similarly, hit curves can be used to study the position of relevant documents within large result sets. We describe hit curves in light of a formal model of information retrieval, show how hit curves represent system performance including ranking, and define ways to statistically compare performance of multiple systems using hit curves. We provide example scenarios where traditional measures are less suitable than hit curves and conclude that hit curves may be useful for evaluating retrieval from large collections where ranking performance is crucial.


Journal of the American Medical Informatics Association | 2012

Predicting biomedical document access as a function of past use

J. Caleb Goodwin; Todd R. Johnson; Trevor Cohen; Jorge R. Herskovic; Elmer V. Bernstam

OBJECTIVE To determine whether past access to biomedical documents can predict future document access. MATERIALS AND METHODS The authors used 394 days of query log (August 1, 2009 to August 29, 2010) from PubMed users in the Texas Medical Center, which is the largest medical center in the world. The authors evaluated two document access models based on the work of Anderson and Schooler. The first is based on how frequently a document was accessed. The second is based on both frequency and recency. RESULTS The model based only on frequency of past access was highly correlated with the empirical data (R²=0.932), whereas the model based on frequency and recency had a much lower correlation (R²=0.668). DISCUSSION The frequency-only model accurately predicted whether a document will be accessed based on past use. Modeling accesses as a function of frequency requires storing only the number of accesses and the creation date for the document. This model requires low storage overheads and is computationally efficient, making it scalable to large corpora such as MEDLINE. CONCLUSION It is feasible to accurately model the probability of a document being accessed in the future based on past accesses.


Journal of Biomedical Informatics | 2013

SYFSA: a framework for systematic yet flexible systems analysis.

Todd R. Johnson; Eliz Markowitz; Elmer V. Bernstam; Jorge R. Herskovic; Harold W. Thimbleby

Although technological or organizational systems that enforce systematic procedures and best practices can lead to improvements in quality, these systems must also be designed to allow users to adapt to the inherent uncertainty, complexity, and variations in healthcare. We present a framework, called Systematic Yet Flexible Systems Analysis (SYFSA) that supports the design and analysis of Systematic Yet Flexible (SYF) systems (whether organizational or technical) by formally considering the tradeoffs between systematicity and flexibility. SYFSA is based on analyzing a task using three related problem spaces: the idealized space, the natural space, and the system space. The idealized space represents the best practice-how the task is to be accomplished under ideal conditions. The natural space captures the task actions and constraints on how the task is currently done. The system space specifies how the task is done in a redesigned system, including how it may deviate from the idealized space, and how the system supports or enforces task constraints. The goal of the framework is to support the design of systems that allow graceful degradation from the idealized space to the natural space. We demonstrate the application of SYFSA for the analysis of a simplified central line insertion task. We also describe several information-theoretic measures of flexibility that can be used to compare alternative designs, and to measure how efficiently a system supports a given task, the relative cognitive workload, and learnability.


american medical informatics association annual symposium | 2012

Deterministic binary vectors for efficient automated indexing of MEDLINE/PubMed abstracts.

Manuel Wahle; Dominic Widdows; Jorge R. Herskovic; Elmer V. Bernstam; Trevor Cohen

Collaboration


Dive into the Jorge R. Herskovic's collaboration.

Top Co-Authors

Avatar

Elmer V. Bernstam

University of Texas Health Science Center at Houston

View shared research outputs
Top Co-Authors

Avatar

Todd R. Johnson

University of Texas System

View shared research outputs
Top Co-Authors

Avatar

Eliz Markowitz

University of Texas Health Science Center at Houston

View shared research outputs
Top Co-Authors

Avatar

Erel Joffe

University of Texas Health Science Center at Houston

View shared research outputs
Top Co-Authors

Avatar

Trevor Cohen

University of Texas Health Science Center at Houston

View shared research outputs
Top Co-Authors

Avatar

Craig W. Johnson

University of Texas at Austin

View shared research outputs
Top Co-Authors

Avatar

Irmgard Willcockson

University of Texas Health Science Center at Houston

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Charles F. Bearden

University of Texas Health Science Center at Houston

View shared research outputs
Researchain Logo
Decentralizing Knowledge