Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Raphael Cohen is active.

Publication


Featured researches published by Raphael Cohen.


BMC Bioinformatics | 2013

Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies

Raphael Cohen; Michael Elhadad; Noémie Elhadad

BackgroundThe increasing availability of Electronic Health Record (EHR) data and specifically free-text patient notes presents opportunities for phenotype extraction. Text-mining methods in particular can help disease modeling by mapping named-entities mentions to terminologies and clustering semantically related terms. EHR corpora, however, exhibit specific statistical and linguistic characteristics when compared with corpora in the biomedical literature domain. We focus on copy-and-paste redundancy: clinicians typically copy and paste information from previous notes when documenting a current patient encounter. Thus, within a longitudinal patient record, one expects to observe heavy redundancy. In this paper, we ask three research questions: (i) How can redundancy be quantified in large-scale text corpora? (ii) Conventional wisdom is that larger corpora yield better results in text mining. But how does the observed EHR redundancy affect text mining? Does such redundancy introduce a bias that distorts learned models? Or does the redundancy introduce benefits by highlighting stable and important subsets of the corpus? (iii) How can one mitigate the impact of redundancy on text mining?ResultsWe analyze a large-scale EHR corpus and quantify redundancy both in terms of word and semantic concept repetition. We observe redundancy levels of about 30% and non-standard distribution of both words and concepts. We measure the impact of redundancy on two standard text-mining applications: collocation identification and topic modeling. We compare the results of these methods on synthetic data with controlled levels of redundancy and observe significant performance variation. Finally, we compare two mitigation strategies to avoid redundancy-induced bias: (i) a baseline strategy, keeping only the last note for each patient in the corpus; (ii) removing redundant notes with an efficient fingerprinting-based algorithm. aFor text mining, preprocessing the EHR corpus with fingerprinting yields significantly better results.ConclusionsBefore applying text-mining techniques, one must pay careful attention to the structure of the analyzed corpora. While the importance of data cleaning has been known for low-level text characteristics (e.g., encoding and spelling), high-level and difficult-to-quantify corpus characteristics, such as naturally occurring redundancy, can also hurt text mining. Fingerprinting enables text-mining techniques to leverage available data in the EHR corpus, while avoiding the bias introduced by redundancy.


Marine Biotechnology | 2007

Estimating the Efficiency of Fish Cross-Species cDNA Microarray Hybridization

Raphael Cohen; Vered Chalifa-Caspi; Timothy Williams; Meirav Auslander; Stephen G. George; James K. Chipman; Moshe Tom

Using an available cross-species cDNA microarray is advantageous for examining multigene expression patterns in non-model organisms, saving the need for construction of species-specific arrays. The aim of the present study was to estimate relative efficiency of cross-species hybridizations across bony fishes, using bioinformatics tools. The methodology may serve also as a model for similar evaluations in other taxa. The theoretical evaluation was done by substituting comparative whole-transcriptome sequence similarity information into the thermodynamic hybridization equation. Complementary DNA sequence assemblages of nine fish species belonging to common families or suborders and distributed across the bony fish taxonomic branch were selected for transcriptome-wise comparisons. Actual cross-species hybridizations among fish of different taxonomic distances were used to validate and eventually to calibrate the theoretically computed relative efficiencies.


PLOS ONE | 2014

Redundancy-Aware Topic Modeling for Patient Record Notes

Raphael Cohen; Iddo Aviram; Michael Elhadad; Noémie Elhadad

The clinical notes in a given patient record contain much redundancy, in large part due to clinicians’ documentation habit of copying from previous notes in the record and pasting into a new note. Previous work has shown that this redundancy has a negative impact on the quality of text mining and topic modeling in particular. In this paper we describe a novel variant of Latent Dirichlet Allocation (LDA) topic modeling, Red-LDA, which takes into account the inherent redundancy of patient records when modeling content of clinical notes. To assess the value of Red-LDA, we experiment with three baselines and our novel redundancy-aware topic modeling method: given a large collection of patient records, (i) apply vanilla LDA to all documents in all input records; (ii) identify and remove all redundancy by chosing a single representative document for each record as input to LDA; (iii) identify and remove all redundant paragraphs in each record, leaving partial, non-redundant documents as input to LDA; and (iv) apply Red-LDA to all documents in all input records. Both quantitative evaluation carried out through log-likelihood on held-out data and topic coherence of produced topics and qualitative assessement of topics carried out by physicians show that Red-LDA produces superior models to all three baseline strategies. This research contributes to the emerging field of understanding the characteristics of the electronic health record and how to account for them in the framework of data mining. The code for the two redundancy-elimination baselines and Red-LDA is made publicly available to the community.


Human Mutation | 2010

Syndrome to Gene (S2G): In-Silico Identification of Candidate Genes for Human Diseases

Avitan Gefen; Raphael Cohen; Ohad S. Birk

The identification of genomic loci associated with human genetic syndromes has been significantly facilitated through the generation of high density SNP arrays. However, optimal selection of candidate genes from within such loci is still a tedious labor‐intensive bottleneck. Syndrome to Gene (S2G) is based on novel algorithms which allow an efficient search for candidate genes in a genomic locus, using known genes whose defects cause phenotypically similar syndromes. S2G (http://fohs.bgu.ac.il/s2g/index.html) includes two components: a phenotype Online Mendelian Inheritance in Man (OMIM)‐based search engine that alleviates many of the problems in the existing OMIM search engine (negation phrases, overlapping terms, etc.). The second component is a gene prioritizing engine that uses a novel algorithm to integrate information from 18 databases. When the detailed phenotype of a syndrome is inserted to the web‐based software, S2G offers a complete improved search of the OMIM database for similar syndromes. The software then prioritizes a list of genes from within a genomic locus, based on their association with genes whose defects are known to underlie similar clinical syndromes. We demonstrate that in all 30 cases of novel disease genes identified in the past year, the disease gene was within the top 20% of candidate genes predicted by S2G, and in most cases—within the top 10%. Thus, S2G provides clinicians with an efficient tool for diagnosis and researchers with a candidate gene prediction tool based on phenotypic data and a wide range of gene data resources. S2G can also serve in studies of polygenic diseases, and in finding interacting molecules for any gene of choice. Hum Mutat 30:1–8, 2010.


BMC Bioinformatics | 2011

CSI-OMIM--Clinical Synopsis Search in OMIM.

Raphael Cohen; Avitan Gefen; Michael Elhadad; Ohad S. Birk

BackgroundThe OMIM database is a tool used daily by geneticists. Syndrome pages include a Clinical Synopsis section containing a list of known phenotypes comprising a clinical syndrome. The phenotypes are in free text and different phrases are often used to describe the same phenotype, the differences originating in spelling variations or typing errors, varying sentence structures and terminological variants.These variations hinder searching for syndromes or using the large amount of phenotypic information for research purposes. In addition, negation forms also create false positives when searching the textual description of phenotypes and induce noise in text mining applications.DescriptionOur method allows efficient and complete search of OMIM phenotypes as well as improved data-mining of the OMIM phenome. Applying natural language processing, each phrase is tagged with additional semantic information using UMLS and MESH. Using a grammar based method, annotated phrases are clustered into groups denoting similar phenotypes. These groups of synonymous expressions enable precise search, as query terms can be matched with the many variations that appear in OMIM, while avoiding over-matching expressions that include the query term in a negative context. On the basis of these clusters, we computed pair-wise similarity among syndromes in OMIM. Using this new similarity measure, we identified 79,770 new connections between syndromes, an average of 16 new connections per syndrome. Our project is Web-based and available at http://fohs.bgu.ac.il/s2g/csiomimConclusionsThe resulting enhanced search functionality provides clinicians with an efficient tool for diagnosis. This search application is also used for finding similar syndromes for the candidate gene prioritization tool S2G.The enhanced OMIM database we produced can be further used for bioinformatics purposes such as linking phenotypes and genes based on syndrome similarities and the known genes in Morbidmap.


Nucleic Acids Research | 2011

CHILD: a new tool for detecting low-abundance insertions and deletions in standard sequence traces

Ilia Zhidkov; Raphael Cohen; Nophar Geifman; Dan Mishmar; Eitan Rubin

Several methods have been proposed for detecting insertion/deletions (indels) from chromatograms generated by Sanger sequencing. However, most such methods are unsuitable when the mutated and normal variants occur at unequal ratios, such as is expected to be the case in cancer, with organellar DNA or with alternatively spliced RNAs. In addition, the current methods do not provide robust estimates of the statistical confidence of their results, and the sensitivity of this approach has not been rigorously evaluated. Here, we present CHILD, a tool specifically designed for indel detection in mixtures where one variant is rare. CHILD makes use of standard sequence alignment statistics to evaluate the significance of the results. The sensitivity of CHILD was tested by sequencing controlled mixtures of deleted and undeleted plasmids at various ratios. Our results indicate that CHILD can identify deleted molecules present as just 5% of the mixture. Notably, the results were plasmid/primer-specific; for some primers and/or plasmids, the deleted molecule was only detected when it comprised 10% or more of the mixture. The false positive rate was estimated to be lower than 0.4%. CHILD was implemented as a user-oriented web site, providing a sensitive and experimentally validated method for the detection of rare indel-carrying molecules in common Sanger sequence reads.


meeting of the association for computational linguistics | 2014

Query-Chain Focused Summarization

Tal Baumel; Raphael Cohen; Michael Elhadad

Update summarization is a form of multidocument summarization where a document set must be summarized in the context of other documents assumed to be known. Efficient update summarization must focus on identifying new information and avoiding repetition of known information. In Query-focused summarization, the task is to produce a summary as an answer to a given query. We introduce a new task, Query-Chain Summarization, which combines aspects of the two previous tasks: starting from a given document set, increasingly specific queries are considered, and a new summary is produced at each step. This process models exploratory search: a user explores a new topic by submitting a sequence of queries, inspecting a summary of the result set and phrasing a new query at each step. We present a novel dataset comprising 22 querychains sessions of length up to 3 with 3 matching human summaries each in the consumerhealth domain. Our analysis demonstrates that summaries produced in the context of such exploratory process are different from informative summaries. We present an algorithm for Query-Chain Summarization based on a new LDA topic model variant. Evaluation indicates the algorithm improves on strong baselines.


PLOS ONE | 2013

Analysis of Free Online Physician Advice Services

Raphael Cohen; Michael Elhadad; Ohad S. Birk

Background Online Consumer Health websites are a major source of information for patients worldwide. We focus on another modality, online physician advice. We aim to evaluate and compare the freely available online expert physicians’ advice in different countries, its scope and the type of content provided. Setting Using automated methods for information retrieval and analysis, we compared consumer health portals from the US, Canada, the UK and Israel (WebMD,NetDoctor,AskTheDoctor and BeOK). The evaluated content was generated between 2002 and 2011. Results We analyzed the different sites, looking at the distribution of questions in the various health topics, answer lengths and content type. Answers could be categorized into longer broad-educational answers versus shorter patient-specific ones, with different physicians having personal preferences as to answer type. The Israeli website BeOK, providing 10 times the number of answers than in the other three health portals, supplied answers that are shorter on average than in the other websites. Response times in these sites may be rapid with 32% of the WebMD answers and 64% of the BeOK answers provided in less than 24 hours. The voluntary contribution model used by BeOK and WebMD enables generation of large numbers of physician expert answers at low cost, providing 50,000 and 3,500 answers per year, respectively. Conclusions Unlike health information in online databases or advice and support in patient-forums, online physician advice provides qualified specialists’ responses directly relevant to the questions asked. Our analysis showed that high numbers of expert answers could be generated in a timely fashion using a voluntary model. The length of answers varied significantly between the internet sites. Longer answers were associated with educational content while short answers were associated with patient-specific content. Standard site-specific guidelines for expert answers will allow for more desirable content (educational content) or better throughput (patient-specific content).


Age | 2013

Redefining meaningful age groups in the context of disease

Nophar Geifman; Raphael Cohen; Eitan Rubin

Age is an important factor when considering phenotypic changes in health and disease. Currently, the use of age information in medicine is somewhat simplistic, with ages commonly being grouped into a small number of crude ranges reflecting the major stages of development and aging, such as childhood or adolescence. Here, we investigate the possibility of redefining age groups using the recently developed Age-Phenome Knowledge-base (APK) that holds over 35,000 literature-derived entries describing relationships between age and phenotype. Clustering of APK data suggests 13 new, partially overlapping, age groups. The diseases that define these groups suggest that the proposed divisions are biologically meaningful. We further show that the number of different age ranges that should be considered depends on the type of disease being evaluated. This finding was further strengthened by similar results obtained from clinical blood measurement data. The grouping of diseases that share a similar pattern of disease-related reports directly mirrors, in some cases, medical knowledge of disease–age relationships. In other cases, our results may be used to generate new and reasonable hypotheses regarding links between diseases.


Mechanisms of Ageing and Development | 2007

Longevity network: Construction and implications

Arie Budovsky; Amir Abramovich; Raphael Cohen; Vered Chalifa-Caspi; Vadim E. Fraifeld

Collaboration


Dive into the Raphael Cohen's collaboration.

Top Co-Authors

Avatar

Michael Elhadad

Ben-Gurion University of the Negev

View shared research outputs
Top Co-Authors

Avatar

Ohad S. Birk

Ben-Gurion University of the Negev

View shared research outputs
Top Co-Authors

Avatar

Tal Baumel

Ben-Gurion University of the Negev

View shared research outputs
Top Co-Authors

Avatar

Avitan Gefen

Ben-Gurion University of the Negev

View shared research outputs
Top Co-Authors

Avatar

Dana Elias

Weizmann Institute of Science

View shared research outputs
Top Co-Authors

Avatar

Eitan Rubin

Ben-Gurion University of the Negev

View shared research outputs
Top Co-Authors

Avatar

Nophar Geifman

Ben-Gurion University of the Negev

View shared research outputs
Top Co-Authors

Avatar

Vered Chalifa-Caspi

Ben-Gurion University of the Negev

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Amir Abramovich

Ben-Gurion University of the Negev

View shared research outputs
Researchain Logo
Decentralizing Knowledge