Samah Jamal Fodeh
Yale University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Samah Jamal Fodeh.
Knowledge and Information Systems | 2011
Samah Jamal Fodeh; Bill Punch; Pang Ning Tan
Incorporating semantic knowledge from an ontology into document clustering is an important but challenging problem. While numerous methods have been developed, the value of using such an ontology is still not clear. We show in this paper that an ontology can be used to greatly reduce the number of features needed to do document clustering. Our hypothesis is that polysemous and synonymous nouns are both relatively prevalent and fundamentally important for document cluster formation. We show that nouns can be efficiently identified in documents and that this alone provides improved clustering. We next show the importance of the polysemous and synonymous nouns in clustering and develop a unique approach that allows us to measure the information gain in disambiguating these nouns in an unsupervised learning setting. In so doing, we can identify a core subset of semantic features that represent a text corpus. Empirical results show that by using core semantic features for clustering, one can reduce the number of features by 90% or more and still produce clusters that capture the main themes in a text corpus.
Journal of Biomedical Informatics | 2013
Samah Jamal Fodeh; Cynthia Brandt; Thaibinh Luong; Ali Haddad; Martin H. Schultz; Terrence E. Murphy; Michael Krauthammer
The rapidly growing availability of electronic biomedical data has increased the need for innovative data mining methods. Clustering in particular has been an active area of research in many different application areas, with existing clustering algorithms mostly focusing on one modality or representation of the data. Complementary ensemble clustering (CEC) is a recently introduced framework in which Kmeans is applied to a weighted, linear combination of the coassociation matrices obtained from separate ensemble clustering of different data modalities. The strength of CEC is its extraction of information from multiple aspects of the data when forming the final clusters. This study assesses the utility of CEC in biomedical data, which often have multiple data modalities, e.g., text and images, by applying CEC to two distinct biomedical datasets (PubMed images and radiology reports) that each have two modalities. Referent to five different clustering approaches based on the Kmeans algorithm, CEC exhibited equal or better performance in the metrics of micro-averaged precision and Normalized Mutual Information across both datasets. The reference methods included clustering of single modalities as well as ensemble clustering of separate and merged data modalities. Our experimental results suggest that CEC is equivalent or more efficient than comparable Kmeans based clustering methods using either single or merged data modalities.
Journal of Pain and Symptom Management | 2013
Samah Jamal Fodeh; Mark Lazenby; Mei Bai; Elizabeth Ercolano; Terrence E. Murphy; Ruth McCorkle
CONTEXT Symptoms and subsequent functional impairment have been associated with the biological processes of disease, including the interaction between disease and treatment in a measurement model of symptoms. However, hitherto cluster analysis has primarily focused on symptoms. OBJECTIVES This study among patients within 100 days of diagnosis with advanced cancer explored whether self-reported physical symptoms and functional impairments formed clusters at the time of diagnosis. METHODS We applied cluster analysis to self-reported symptoms and activities of daily living of 111 patients newly diagnosed with advanced gastrointestinal (GI), gynecological, head and neck, and lung cancers. Based on content expert evaluations, the best techniques and variables were identified, yielding the best solution. RESULTS The best cluster solution used a K-means algorithm and cosine similarity and yielded five clusters of physical as well as emotional symptoms and functional impairments. Cancer site formed the predominant organizing principle of composition for each cluster. The top five symptoms and functional impairments in each cluster were Cluster 1 (GI): outlook, insomnia, appearance, concentration, and eating/feeding; Cluster 2 (GI): appetite, bowel, insomnia, eating/feeding, and appearance; Cluster 3 (gynecological): nausea, insomnia, eating/feeding, concentration, and pain; Cluster 4 (head and neck): dressing, eating/feeding, bathing, toileting, and walking; and Cluster 5 (lung): cough, walking, eating/feeding, breathing, and insomnia. CONCLUSION Functional impairments in patients newly diagnosed with late-stage cancers behave as symptoms during the diagnostic phase. Health care providers need to expand their assessments to include both symptoms and functional impairments. Early recognition of functional changes may accelerate diagnosis at an earlier cancer stage.
Journal of the American Medical Informatics Association | 2016
Jonathan Bates; Samah Jamal Fodeh; Cynthia Brandt; Julie A. Womack
OBJECTIVE To identify patients in a human immunodeficiency virus (HIV) study cohort who have fallen by applying supervised machine learning methods to radiology reports of the cohort. METHODS We used the Veterans Aging Cohort Study Virtual Cohort (VACS-VC), an electronic health record-based cohort of 146 530 veterans for whom radiology reports were available (N=2 977 739). We created a reference standard of radiology reports, represented each report by a feature set of words and Unified Medical Language System concepts, and then developed several support vector machine (SVM) classifiers for falls. We compared mutual information (MI) ranking and embedded feature selection approaches. The SVM classifier with MI feature selection was chosen to classify all radiology reports in VACS-VC. RESULTS Our SVM classifier with MI feature selection achieved an area under the curve score of 97.04 on the test set. When applied to all the radiology reports in VACS-VC, 80 416 of these reports were classified as positive for a fall. Of these, 11 484 were associated with a fall-related external cause of injury code (E-code) and 68 932 were not, corresponding to 29 280 patients with potential fall-related injuries who could not have been found using E-codes. DISCUSSION Feature selection was crucial to improving the classifiers performance. Feature selection with MI allowed us to select the number of discriminative features to use for classification, in contrast to the embedded feature selection method, in which the number of features is chosen automatically. CONCLUSION Machine learning is an effective method of identifying patients who have suffered a fall. The development of this classifier supplements the clinical researchers toolkit and reduces dependence on under-coded structured electronic health record data.
international conference on data mining | 2015
Samah Jamal Fodeh; Andrea L. Benin; Perry L. Miller; Kyle Lee; Michele Koss; Cynthia Brandt
Timely reporting and analysis of adverse events and medical errors is critical to driving forward programs in patient-safety, however, due to the large numbers of event reports accumulating daily in health institutions, manually finding and labeling certain types of errors or events is becoming increasingly challenging. We propose to automatically classify/label event reports via semi-supervised learning which utilizes labeled as well as unlabeled event reports to complete the classification task. We focused on classifying two types of event reports: patient mismatches and weight errors. We downloaded 9405 reports from the Connecticut Childrens Medical Center reporting system. We generated two samples of labeled and unlabeled reports containing 3155 and 255 for the patient mismatch and the weight error use cases respectively. We developed feature based Laplacian Support Vector machine (FS-LapSVM), a hybrid framework that combines feature selection with Laplacian Support Vector machine classifier (LapSVM). Superior performance of FS-LapSVM in finding patient weight error reports compared to LapSVM. Also, FS-LapSVM classifier outperformed standard LapSVM in classifying patient mismatch reports across all metrics.
Experimental Aging Research | 2015
Samah Jamal Fodeh; Mark Trentalange; Heather G. Allore; Thomas M. Gill; Cynthia Brandt; Terrence E. Murphy
Background/Study Context: The potential of cluster analysis (CA) as a baseline predictor of multivariate gerontologic outcomes over a long period of time has not been previously demonstrated. Methods: Restricting candidate variables to a small group of established predictors of deleterious gerontologic outcomes, various CA methods were applied to baseline values from 754 nondisabled, community-living persons, aged 70 years or older. The best cluster solution yielded at baseline was subsequently used as a fixed explanatory variable in time-to-event models of the first occurrence of the following outcomes: any disability in four activities of daily living, any disability in four mobility measures, and death. Each outcome was recorded through a maximum of 129 months or death. Associations between baseline ordinal cluster level and first occurrence of all three outcomes were modeled over a 10-year period with proportional hazards regression and compared with the associations yielded by the analogous latent class analysis (LCA) solution. Results: The final cluster-defining variables were continuous measures of cognitive status and depressive symptoms, and dichotomous indicators of slow gait and exhaustion. The best solution yielded by baseline values of these variables was obtained with a K-means algorithm and cosine similarity and consisted of three clusters representing increasing levels of impairment. After adjustment for age, sex, ethnic group, and number of chronic conditions, baseline ordinal cluster level demonstrated significantly positive associations with all three outcomes over a 10-year period that were equivalent to those from the corresponding LCA solution. Conclusion: These findings suggest that baseline clusters based on previously established explanatory variables have potential to predict multivariate gerontologic outcomes over a long period of time.
Journal of Biomedical Informatics | 2018
Samah Jamal Fodeh; Aditya Tiwari
Gene ontology (GO) provides a representation of terms and categories used to describe genes and their molecular functions, cellular components and biological processes. GO has been the standard for describing the functions of specific genes in different model organisms. GO annotation, or the tagging of genes with GO terms, has mostly been a manual and time-consuming curation process. Although many automated approaches have been proposed for annotation, few have utilized knowledge available in the literature. In this manuscript, we describe the development and evaluation of an innovative predictive system to automatically assign molecular functions (GO terms) to genes using the biomedical literature. Because genes could be associated with multiple molecular functions, we posed the GO molecular function annotation as a multi-label classification problem with several classes. We used non-negative matrix factorization (NMF) for feature reduction and then classified the genes. To address the multi-label aspect of the data, we used the binary-relevance method. Although we experimented with several classifiers, the combination of binary-relevance and K-nearest neighbor (KNN) classifier performed best. Our evaluation on UniProtKB/Swiss-Prot dataset showed the best performance of 0.84 in terms of F1-measure.
BMC Bioinformatics | 2018
Reilly N. Grant; David Kucher; Ana M. León; Jonathan Gemmell; Daniela Stan Raicu; Samah Jamal Fodeh
BackgroundSuicide is an alarming public health problem accounting for a considerable number of deaths each year worldwide. Many more individuals contemplate suicide. Understanding the attributes, characteristics, and exposures correlated with suicide remains an urgent and significant problem. As social networking sites have become more common, users have adopted these sites to talk about intensely personal topics, among them their thoughts about suicide. Such data has previously been evaluated by analyzing the language features of social media posts and using factors derived by domain experts to identify at-risk users.ResultsIn this work, we automatically extract informal latent recurring topics of suicidal ideation found in social media posts. Our evaluation demonstrates that we are able to automatically reproduce many of the expertly determined risk factors for suicide. Moreover, we identify many informal latent topics related to suicide ideation such as concerns over health, work, self-image, and financial issues.ConclusionsThese informal topics topics can be more specific or more general. Some of our topics express meaningful ideas not contained in the risk factors and some risk factors do not have complimentary latent topics. In short, our analysis of the latent topics extracted from social media containing suicidal ideations suggests that users of these systems express ideas that are complementary to the topics defined by experts but differ in their scope, focus, and precision of language.
BMC Health Services Research | 2016
Karen H. Wang; Joseph L. Goulet; Constance Carroll; Melissa Skanderson; Samah Jamal Fodeh; Joseph Erdos; Julie A. Womack; Erica A. Abel; Harini Bathulapalli; Amy C. Justice; Marcella Nunez-Smith; Cynthia Brandt
BackgroundHealthcare mobility, defined as healthcare utilization in more than one distinct healthcare system, may have detrimental effects on outcomes of care. We characterized healthcare mobility and associated characteristics among a national sample of Veterans.MethodsUsing the Veterans Health Administration Electronic Health Record, we conducted a retrospective cohort study to quantify healthcare mobility within a four year period. We examined the association between sociodemographic and clinical characteristics and healthcare mobility, and characterized possible temporal and geographic patterns of healthcare mobility.ResultsApproximately nine percent of the sample were healthcare mobile. Younger Veterans, divorced or separated Veterans, and those with hepatitis C virus and psychiatric disorders were more likely to be healthcare mobile. We demonstrated two possible patterns of healthcare mobility, related to specialty care and lifestyle, in which Veterans repeatedly utilized two different healthcare systems.ConclusionsHealthcare mobility is associated with young age, marital status changes, and also diseases requiring intensive management. This type of mobility may affect disease prevention and management and has implications for healthcare systems that seek to improve population health.
international conference on data mining | 2013
Samah Jamal Fodeh; Maryan Zirkle; Dezon Finch; Cynthia Brandt; Joseph Erdos; Ruth Reeves
In this paper we introduce a new framework called MedCat to delineate and demonstrate an approach for projecting representations of concept-derived content in clinical notes into a new categorization space to reduce dimensionality and noise in the data. Constructing MedCat framework required several steps including manual annotation, knowledge base expansion using MetaMap, concept category construction, automated annotation using NLP to generate a bag of concepts, and finally concept conversion to higher level abstracted categories. The framework was applied to Post Traumatic Stress Disorder (PTSD) clinical notes for evaluation. A random sample of PTSD clinical note content was automatically recategorized into six PTSD treatment categories using MedCat. Using existing annotations from PTSD notes that were categorized by content experts into treatment categories as the reference standard, the sensitivity of the framework in detecting the treatment categories was greater than 90%. The results suggest that representations of concept-derived content when categorized by relevance features can be used to reliably understand and summarize clinical notes.