Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Serguei V. S. Pakhomov is active.

Publication


Featured researches published by Serguei V. S. Pakhomov.


Journal of Biomedical Informatics | 2005

Domain-specific language models and lexicons for tagging

Anni Coden; Serguei V. S. Pakhomov; Rie Kubota Ando; Patrick H. Duffy; Christopher G. Chute

Accurate and reliable part-of-speech tagging is useful for many Natural Language Processing (NLP) tasks that form the foundation of NLP-based approaches to information retrieval and data mining. In general, large annotated corpora are necessary to achieve desired part-of-speech tagger accuracy. We show that a large annotated general-English corpus is not sufficient for building a part-of-speech tagger model adequate for tagging documents from the medical domain. However, adding a quite small domain-specific corpus to a large general-English one boosts performance to over 92% accuracy from 87% in our studies. We also suggest a number of characteristics to quantify the similarities between a training corpus and the test data. These results give guidance for creating an appropriate corpus for building a part-of-speech tagger model that gives satisfactory accuracy results on a new domain at a relatively small cost.


international health informatics symposium | 2012

Semantic relatedness study using second order co-occurrence vectors computed from biomedical corpora, UMLS and WordNet

Ying Liu; Bridget T. McInnes; Ted Pedersen; Genevieve Melton-Meaux; Serguei V. S. Pakhomov

Automated measures of semantic relatedness are important for effectively processing medical data for a variety of tasks such as information retrieval and natural language processing. In this paper, we present a context vector approach that can compute the semantic relatedness between any pair of concepts in the Unified Medical Language System (UMLS). Our approach has been developed on a corpus of inpatient clinical reports. We use 430 pairs of clinical concepts manually rated for semantic relatedness as the reference standard. The experiments demonstrate that incorporating a combination of the UMLS and WordNet definitions can improve the semantic relatedness. The paper also shows that second order co-occurrence vector measure is a more effective approach than path-based methods for semantic relatedness.


Cognitive and Behavioral Neurology | 2010

Computerized analysis of speech and language to identify psycholinguistic correlates of frontotemporal lobar degeneration

Serguei V. S. Pakhomov; Glenn E. Smith; Dustin Alfonso Chacón; Yara Feliciano; Neill R. Graff-Radford; Richard J. Caselli; David S. Knopman

ObjectiveTo evaluate the use of a semiautomated computerized system for measuring speech and language characteristics in patients with frontotemporal lobar degeneration (FTLD). BackgroundFTLD is a heterogeneous disorder comprising at least 3 variants. Computerized assessment of spontaneous verbal descriptions by patients with FTLD offers a detailed and reproducible view of the underlying cognitive deficits. MethodsAudiorecorded speech samples of 38 patients from 3 participating medical centers were elicited using the Cookie Theft stimulus. Each patient underwent a battery of neuropsychologic tests. The audio was analyzed by the computerized system to measure 15 speech and language variables. Analysis of variance was used to identify characteristics with significant differences in means between FTLD variants. Factor analysis was used to examine the implicit relations between subsets of the variables. ResultsSemiautomated measurements of pause-to-word ratio and pronoun-to-noun ratio were able to discriminate between some of the FTLD variants. Principal component analysis of all 14 variables suggested 4 subjectively defined components (length, hesitancy, empty content, grammaticality) corresponding to the phenomenology of FTLD variants. ConclusionSemiautomated language and speech analysis is a promising novel approach to neuropsychologic assessment that offers a valuable contribution to the toolbox of researchers in dementia and other neurodegenerative disorders.


Journal of Biomedical Informatics | 2014

Using semantic predications to uncover drug-drug interactions in clinical data

Rui Zhang; Michael J. Cairelli; Marcelo Fiszman; Graciela Rosemblat; Halil Kilicoglu; Thomas C. Rindflesch; Serguei V. S. Pakhomov; Genevieve B. Melton

In this study we report on potential drug-drug interactions between drugs occurring in patient clinical data. Results are based on relationships in SemMedDB, a database of structured knowledge extracted from all MEDLINE citations (titles and abstracts) using SemRep. The core of our methodology is to construct two potential drug-drug interaction schemas, based on relationships extracted from SemMedDB. In the first schema, Drug1 and Drug2 interact through Drug1s effect on some gene, which in turn affects Drug2. In the second, Drug1 affects Gene1, while Drug2 affects Gene2. Gene1 and Gene2, together, then have an effect on some biological function. After checking each drug pair from the medication lists of each of 22 patients, we found 19 known and 62 unknown drug-drug interactions using both schemas. For example, our results suggest that the interaction of Lisinopril, an ACE inhibitor commonly prescribed for hypertension, and the antidepressant sertraline can potentially increase the likelihood and possibly the severity of psoriasis. We also assessed the relationships extracted by SemRep from a linguistic perspective and found that the precision of SemRep was 0.58 for 300 randomly selected sentences from MEDLINE. Our study demonstrates that the use of structured knowledge in the form of relationships from the biomedical literature can support the discovery of potential drug-drug interactions occurring in patient clinical data. Moreover, SemMedDB provides a good knowledge resource for expanding the range of drugs, genes, and biological functions considered as elements in various drug-drug interaction pathways.


Behavior Research Methods | 2011

Computerized assessment of syntactic complexity in Alzheimer's disease: A case study of Iris Murdoch's writing

Serguei V. S. Pakhomov; Dustin Alfonso Chacón; Mark Wicklund; Jeanette K. Gundel

Currently, the majority of investigations of linguistic manifestations of neurodegenerative disorders such as Alzheimer’s disease are conducted based on manual linguistic analysis. Grammatical complexity is one of the language use characteristics sensitive to the effects of Alzheimer’s disease and is difficult to operationalize and measure using manual approaches. In the current study, we demonstrate the application of computational linguistic methods to automate the analysis of grammatical complexity. We implemented the Computerized Linguistic Analysis System (CLAS) based on the Stanford syntactic parser (Klein and Manning, Pattern Recognition, 38(9), 1407–1419, 2005) for longitudinal analysis of changes in syntactic complexity in language affected by neurodegenerative disorders. We manually validated CLAS scoring and used it to analyze writings of Iris Murdoch, a renowned Irish author diagnosed with Alzheimer’s disease. We found clear patterns of decline in grammatical complexity consistent with previous analyses of Murdoch’s writing conducted by Garrard, Maloney, Hodges, and Patterson (Brain, 128(250–260, 2005). CLAS is a fully automated system that may be used to derive objective and reproducible measures of syntactic complexity in language production and can be particularly useful in longitudinal studies with large volumes of language samples.


Journal of the American Medical Informatics Association | 2010

Evaluation of family history information within clinical documents and adequacy of HL7 clinical statement and clinical genomics family history models for its representation: A case report

Genevieve B. Melton; Nandhini Raman; Elizabeth S. Chen; Indra Neil Sarkar; Serguei V. S. Pakhomov; Robert D. Madoff

Family history information has emerged as an increasingly important tool for clinical care and research. While recent standards provide for structured entry of family history, many clinicians record family history data in text. The authors sought to characterize family history information within clinical documents to assess the adequacy of existing models and create a more comprehensive model for its representation. Models were evaluated on 100 documents containing 238 sentences and 410 statements relevant to family history. Most statements were of family member plus disease or of disease only. Statement coverage was 91%, 77%, and 95% for HL7 Clinical Genomics Family History Model, HL7 Clinical Statement Model, and the newly created Merged Family History Model, respectively. Negation (18%) and inexact family member specification (9.5%) occurred commonly. Overall, both HL7 models could represent most family history statements in clinical reports; however, refinements are needed to represent the full breadth of family history data.


Journal of the American Medical Informatics Association | 2008

Automatic Classification of Foot Examination Findings Using Clinical Notes and Machine Learning

Serguei V. S. Pakhomov; Penny L. Hanson; Susan S. Bjornsen; Steven A. Smith

We examine the feasibility of a machine learning approach to identification of foot examination (FE) findings from the unstructured text of clinical reports. A Support Vector Machine (SVM) based system was constructed to process the text of physical examination sections of in- and out-patient clinical notes to identify if the findings of structural, neurological, and vascular components of a FE revealed normal or abnormal findings or were not assessed. The system was tested on 145 randomly selected patients for each FE component using 10-fold cross validation. The accuracy was 80%, 87% and 88% for structural, neurological, and vascular component classifiers, respectively. Our results indicate that using machine learning to identify FE findings from clinical reports is a viable alternative to manual review and warrants further investigation. This application may improve quality and safety by providing inexpensive and scalable methodology for quality and risk factor assessments at the point of care.


meeting of the association for computational linguistics | 2005

High Throughput Modularized NLP System for Clinical Text

Serguei V. S. Pakhomov; James D. Buntrock; Patrick H. Duffy

This paper presents the results of the development of a high throughput, real time modularized text analysis and information retrieval system that identifies clinically relevant entities in clinical notes, maps the entities to several standardized nomenclatures and makes them available for subsequent information retrieval and data mining. The performance of the system was validated on a small collection of 351 documents partitioned into 4 query topics and manually examined by 3 physicians and 3 nurse abstractors for relevance to the query topics. We find that simple key phrase searching results in 73% recall and 77% precision. A combination of NLP approaches to indexing improve the recall to 92%, while lowering the precision to 67%.


Medical Decision Making | 2008

Quality Performance Measurement Using the Text of Electronic Medical Records

Serguei V. S. Pakhomov; Susan S. Bjornsen; Penny L. Hanson; Steven A. Smith

Background. Annual foot examinations (FE) constitute a critical component of care for diabetes. Documented evidence of FE is central to quality-of-care reporting; however, manual abstraction of electronic medical records (EMR) is slow, expensive, and subject to error. The objective of this study was to test the hypothesis that text mining of the EMR results in ascertaining FE evidence with accuracy comparable to manual abstraction. Methods. The text of inpatient and outpatient clinical reports was searched with natural-language (NL) queries for evidence of neurological, vascular, and structural components of FE. A manual medical records audit was used for validation. The reference standard consisted of 3 independent sets used for development (n=200 ), validation (n=118), and reliability (n=80). Results. The reliability of manual auditing was 91% (95% confidence interval [CI]= 85—97) and was determined by comparing the results of an additional audit to the original audit using the records in the reliability set. The accuracy of the NL query requiring 1 of 3 FE components was 89% (95% CI=83—95). The accuracy of the query requiring any 2 of 3 components was 88% (95% CI=82—94). The accuracy of the query requiring all 3 components was 75% (95% CI= 68— 83). Conclusions. The free text of the EMR is a viable source of information necessary for quality of health care reporting on the evidence of FE for patients with diabetes. The low-cost methodology is scalable to monitoring large numbers of patients and can be used to streamline quality-of-care reporting.


Bioinformatics | 2016

Corpus domain effects on distributional semantic modeling of medical terms

Serguei V. S. Pakhomov; Gregory P. Finley; Reed McEwan; Yan Wang; Genevieve B. Melton

MOTIVATION Automatically quantifying semantic similarity and relatedness between clinical terms is an important aspect of text mining from electronic health records, which are increasingly recognized as valuable sources of phenotypic information for clinical genomics and bioinformatics research. A key obstacle to development of semantic relatedness measures is the limited availability of large quantities of clinical text to researchers and developers outside of major medical centers. Text from general English and biomedical literature are freely available; however, their validity as a substitute for clinical domain to represent semantics of clinical terms remains to be demonstrated. RESULTS We constructed neural network representations of clinical terms found in a publicly available benchmark dataset manually labeled for semantic similarity and relatedness. Similarity and relatedness measures computed from text corpora in three domains (Clinical Notes, PubMed Central articles and Wikipedia) were compared using the benchmark as reference. We found that measures computed from full text of biomedical articles in PubMed Central repository (rho = 0.62 for similarity and 0.58 for relatedness) are on par with measures computed from clinical reports (rho = 0.60 for similarity and 0.57 for relatedness). We also evaluated the use of neural network based relatedness measures for query expansion in a clinical document retrieval task and a biomedical term word sense disambiguation task. We found that, with some limitations, biomedical articles may be used in lieu of clinical reports to represent the semantics of clinical terms and that distributional semantic methods are useful for clinical and biomedical natural language processing applications. AVAILABILITY AND IMPLEMENTATION The software and reference standards used in this study to evaluate semantic similarity and relatedness measures are publicly available as detailed in the article. CONTACT [email protected] information: Supplementary data are available at Bioinformatics online.

Collaboration


Dive into the Serguei V. S. Pakhomov's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Ted Pedersen

University of Minnesota

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Rui Zhang

University of Minnesota

View shared research outputs
Top Co-Authors

Avatar

Yan Wang

University of Minnesota

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Ying Liu

University of Texas at Dallas

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge