Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Davy Weissenbacher is active.

Publication


Featured researches published by Davy Weissenbacher.


Bioinformatics | 2015

Knowledge-driven geospatial location resolution for phylogeographic models of virus migration

Davy Weissenbacher; Tasnia Tahsin; Rachel Beard; Mari Figaro; Robert Rivera; Matthew Scotch; Graciela Gonzalez

Summary: Diseases caused by zoonotic viruses (viruses transmittable between humans and animals) are a major threat to public health throughout the world. By studying virus migration and mutation patterns, the field of phylogeography provides a valuable tool for improving their surveillance. A key component in phylogeographic analysis of zoonotic viruses involves identifying the specific locations of relevant viral sequences. This is usually accomplished by querying public databases such as GenBank and examining the geospatial metadata in the record. When sufficient detail is not available, a logical next step is for the researcher to conduct a manual survey of the corresponding published articles. Motivation: In this article, we present a system for detection and disambiguation of locations (toponym resolution) in full-text articles to automate the retrieval of sufficient metadata. Our system has been tested on a manually annotated corpus of journal articles related to phylogeography using integrated heuristics for location disambiguation including a distance heuristic, a population heuristic and a novel heuristic utilizing knowledge obtained from GenBank metadata (i.e. a ‘metadata heuristic’). Results: For detecting and disambiguating locations, our system performed best using the metadata heuristic (0.54 Precision, 0.89 Recall and 0.68 F-score). Precision reaches 0.88 when examining only the disambiguation of location names. Our error analysis showed that a noticeable increase in the accuracy of toponym resolution is possible by improving the geospatial location detection. By improving these fundamental automated tasks, our system can be a useful resource to phylogeographers that rely on geospatial metadata of GenBank sequences. Contact: [email protected]


Journal of the American Medical Informatics Association | 2016

A high-precision rule-based extraction system for expanding geospatial metadata in GenBank records

Tasnia Tahsin; Davy Weissenbacher; Robert Rivera; Rachel Beard; Mari Firago; Garrick Wallstrom; Matthew Scotch; Graciela Gonzalez

OBJECTIVE The metadata reflecting the location of the infected host (LOIH) of virus sequences in GenBank often lacks specificity. This work seeks to enhance this metadata by extracting more specific geographic information from related full-text articles and mapping them to their latitude/longitudes using knowledge derived from external geographical databases. MATERIALS AND METHODS We developed a rule-based information extraction framework for linking GenBank records to the latitude/longitudes of the LOIH. Our system first extracts existing geospatial metadata from GenBank records and attempts to improve it by seeking additional, relevant geographic information from text and tables in related full-text PubMed Central articles. The final extracted locations of the records, based on data assimilated from these sources, are then disambiguated and mapped to their respective geo-coordinates. We evaluated our approach on a manually annotated dataset comprising of 5728 GenBank records for the influenza A virus. RESULTS We found the precision, recall, and f-measure of our system for linking GenBank records to the latitude/longitudes of their LOIH to be 0.832, 0.967, and 0.894, respectively. DISCUSSION Our system had a high level of accuracy for linking GenBank records to the geo-coordinates of the LOIH. However, it can be further improved by expanding our database of geospatial data, incorporating spell correction, and enhancing the rules used for extraction. CONCLUSION Our system performs reasonably well for linking GenBank records for the influenza A virus to the geo-coordinates of their LOIH based on record metadata and information extracted from related full-text articles.


meeting of the association for computational linguistics | 2014

Natural Language Processing Methods for Enhancing Geographic Metadata for Phylogeography of Zoonotic Viruses

Tasnia Tahsin; Robert Rivera; Rachel Beard; Rob Lauder; Davy Weissenbacher; Matthew Scotch; Garrick Wallstrom; Graciela Gonzalez

Zoonotic viruses represent emerging or re-emerging pathogens that pose significant public health threats throughout the world. It is therefore crucial to advance current surveillance mechanisms for these viruses through outlets such as phylogeography. Despite the abundance of zoonotic viral sequence data in publicly available databases such as GenBank, phylogeographic analysis of these viruses is often limited by the lack of adequate geographic metadata. However, many GenBank records include references to articles with more detailed information and automated systems may help extract this information efficiently and effectively. In this paper, we describe our efforts to determine the proportion of GenBank records with “insufficient” geographic metadata for seven well-studied viruses. We also evaluate the performance of four different Named Entity Recognition (NER) systems for automatically extracting related entities using a manually created gold-standard.


north american chapter of the association for computational linguistics | 2015

DIEGOLab: An Approach for Message-level Sentiment Classification in Twitter

Abeed Sarker; Azadeh Nikfarjam; Davy Weissenbacher; Graciela Gonzalez

We present our supervised sentiment classification system which competed in SemEval2015 Task 10B: Sentiment Classification in Twitter— Message Polarity Classification. Our system employs a Support Vector Machine classifier trained using a number of features including n-grams, dependency parses, synset expansions, word prior polarities, and embedding clusters. Using weighted Support Vector Machines, to address the issue of class imbalance, our system obtains positive class F-scores of 0.701 and 0.656, and negative class F-scores of 0.515 and 0.478 over the training and test sets, respectively.


intelligent systems in molecular biology | 2018

Deep neural networks and distant supervision for geographic location mention extraction

Arjun Magge; Davy Weissenbacher; Abeed Sarker; Matthew Scotch; Graciela Gonzalez

Motivation Virus phylogeographers rely on DNA sequences of viruses and the locations of the infected hosts found in public sequence databases like GenBank for modeling virus spread. However, the locations in GenBank records are often only at the country or state level, and may require phylogeographers to scan the journal articles associated with the records to identify more localized geographic areas. To automate this process, we present a named entity recognizer (NER) for detecting locations in biomedical literature. We built the NER using a deep feedforward neural network to determine whether a given token is a toponym or not. To overcome the limited human annotated data available for training, we use distant supervision techniques to generate additional samples to train our NER. Results Our NER achieves an F1‐score of 0.910 and significantly outperforms the previous state‐of‐the‐art system. Using the additional data generated through distant supervision further boosts the performance of the NER achieving an F1‐score of 0.927. The NER presented in this research improves over previous systems significantly. Our experiments also demonstrate the NERs capability to embed external features to further boost the systems performance. We believe that the same methodology can be applied for recognizing similar biomedical entities in scientific literature.


Journal of Biomedical Informatics | 2018

Social media mining for birth defects research: A rule-based, bootstrapping approach to collecting data for rare health-related events on Twitter

Ari Z. Klein; Abeed Sarker; Haitao Cai; Davy Weissenbacher; Graciela Gonzalez-Hernandez

BACKGROUND Although birth defects are the leading cause of infant mortality in the United States, methods for observing human pregnancies with birth defect outcomes are limited. OBJECTIVE The primary objectives of this study were (i) to assess whether rare health-related events-in this case, birth defects-are reported on social media, (ii) to design and deploy a natural language processing (NLP) approach for collecting such sparse data from social media, and (iii) to utilize the collected data to discover a cohort of women whose pregnancies with birth defect outcomes could be observed on social media for epidemiological analysis. METHODS To assess whether birth defects are mentioned on social media, we mined 432 million tweets posted by 112,647 users who were automatically identified via their public announcements of pregnancy on Twitter. To retrieve tweets that mention birth defects, we developed a rule-based, bootstrapping approach, which relies on a lexicon, lexical variants generated from the lexicon entries, regular expressions, post-processing, and manual analysis guided by distributional properties. To identify users whose pregnancies with birth defect outcomes could be observed for epidemiological analysis, inclusion criteria were (i) tweets indicating that the users child has a birth defect, and (ii) accessibility to the users tweets during pregnancy. We conducted a semi-automatic evaluation to estimate the recall of the tweet-collection approach, and performed a preliminary assessment of the prevalence of selected birth defects among the pregnancy cohort derived from Twitter. RESULTS We manually annotated 16,822 retrieved tweets, distinguishing tweets indicating that the users child has a birth defect (true positives) from tweets that merely mention birth defects (false positives). Inter-annotator agreement was substantial: κ = 0.79 (Cohens kappa). Analyzing the timelines of the 646 users whose tweets were true positives resulted in the discovery of 195 users that met the inclusion criteria. Congenital heart defects are the most common type of birth defect reported on Twitter, consistent with findings in the general population. Based on an evaluation of 4169 tweets retrieved using alternative text mining methods, the recall of the tweet-collection approach was 0.95. CONCLUSIONS Our contributions include (i) evidence that rare health-related events are indeed reported on Twitter, (ii) a generalizable, systematic NLP approach for collecting sparse tweets, (iii) a semi-automatic method to identify undetected tweets (false negatives), and (iv) a collection of publicly available tweets by pregnant users with birth defect outcomes, which could be used for future epidemiological analysis. In future work, the annotated tweets could be used to train machine learning algorithms to automatically identify users reporting birth defect outcomes, enabling the large-scale use of social media mining as a complementary method for such epidemiological research.


Drug Safety | 2018

Pharmacoepidemiologic Evaluation of Birth Defects from Health-Related Postings in Social Media During Pregnancy

Su Golder; Stephanie Chiuve; Davy Weissenbacher; Ari Klein; Karen O’Connor; Martin Bland; Murray Malin; Mondira Bhattacharya; Linda J. Scarazzini; Graciela Gonzalez-Hernandez

IntroductionAdverse effects of medications taken during pregnancy are traditionally studied through post-marketing pregnancy registries, which have limitations. Social media data may be an alternative data source for pregnancy surveillance studies.ObjectiveThe objective of this study was to assess the feasibility of using social media data as an alternative source for pregnancy surveillance for regulatory decision making.MethodsWe created an automated method to identify Twitter accounts of pregnant women. We identified 196 pregnant women with a mention of a birth defect in relation to their baby and 196 without a mention of a birth defect in relation to their baby. We extracted information on pregnancy and maternal demographics, medication intake and timing, and birth defects.ResultsAlthough often incomplete, we extracted data for the majority of the pregnancies. Among women that reported birth defects, 35% reported taking one or more medications during pregnancy compared with 17% of controls. After accounting for age, race, and place of residence, a higher medication intake was observed in women who reported birth defects. The rate of birth defects in the pregnancy cohort was lower (0.44%) compared with the rate in the general population (3%).ConclusionsTwitter data capture information on medication intake and birth defects; however, the information obtained cannot replace pregnancy registries at this time. Development of improved methods to automatically extract and annotate social media data may increase their value to support regulatory decision making regarding pregnancy outcomes in women using medications during their pregnancies.


Bioinformatics | 2018

GeoBoost: accelerating research involving the geospatial metadata of virus GenBank records

Tasnia Tahsin; Davy Weissenbacher; Karen O'Connor; Arjun Magge; Matthew Scotch; Graciela Gonzalez-Hernandez

Summary GeoBoost is a command-line software package developed to address sparse or incomplete metadata in GenBank sequence records that relate to the location of the infected host (LOIH) of viruses. Given a set of GenBank accession numbers corresponding to virus GenBank records, GeoBoost extracts, integrates and normalizes geographic information reflecting the LOIH of the viruses using integrated information from GenBank metadata and related full-text publications. In addition, to facilitate probabilistic geospatial modeling, GeoBoost assigns probability scores for each possible LOIH. Availability and implementation Binaries and resources required for running GeoBoost are packed into a single zipped file and freely available for download at https://tinyurl.com/geoboost. A video tutorial is included to help users quickly and easily install and run the software. The software is implemented in Java 1.8, and supported on MS Windows and Linux platforms. Contact [email protected]. Supplementary information Supplementary data are available at Bioinformatics online.


Database | 2017

Named entity linking of geospatial and host metadata in GenBank for advancing biomedical research

Tasnia Tahsin; Davy Weissenbacher; Demetrius Jones-Shargani; Daniel Magee; Matteo Vaiente; Graciela Gonzalez; Matthew Scotch

Abstract GenBank is a popular National Center for Biotechnology Information (NCBI) database for submission and analysis of DNA sequences for biomedical research. The resource is part of the Entrez environment which enables for cross-linking of concepts and entries in other participating NCBI databases such as Taxonomy, PubMed and Protein. For example, a GenBank record of an influenza A hemagglutinin gene DNA sequence might have a link to the Taxonomy database for the organism, a link to the related article in PubMed (if published) and a link to the Protein entry for the hemagglutinin protein. Despite its importance in biomedical research such as population genetics, phylogeography and public health surveillance, the host and geospatial metadata of genetic sequences in GenBank are not linked to any database. Therefore, to facilitate biomedical research based on georeferenced DNA sequences and/or DNA sequences with normalized host names, we designed and developed a framework that enriches GenBank entries by linking their host metadata to the NCBI Taxonomy database and their geospatial metadata to a comprehensive knowledge base of geographic locations called GeoNames. Here, we introduce a database created through the application of this framework to virus sequences in GenBank, and evaluate our normalization algorithms on a set of manually annotated records pertaining to viruses. Although currently applied to viruses, our framework can be easily extended to other organisms, and we discuss the potential utilization of our resource for biomedical research. Database URL: https://zodo.asu.edu/zoophydb/


north american chapter of the association for computational linguistics | 2016

Automatic prediction of linguistic decline in writings of subjects with degenerative dementia

Davy Weissenbacher; Travis A. Johnson; Laura Wojtulewicz; Amylou C. Dueck; Dona E.C. Locke; Richard J. Caselli; Graciela Gonzalez

Given the limited success of medication in reversing the effects of Alzheimer’s and other dementias, a lot of the neuroscience research has been focused on early detection, in order to slow the progress of the disease through different interventions. We propose a Natural Language Processing approach applied to descriptive writing to attempt to discriminate decline due to normal aging from decline due to predementia conditions. Within the context of a longitudinal study on Alzheimer’s disease, we created a unique corpus of 201 descriptions of a control image written by subjects of the study. Our classifier, computing linguistic features, was able to discriminate normal from cognitively impaired patients to an accuracy of 86.1% using lexical and semantic irregularities found in their writing. This is a promising result towards elucidating the existence of a general pattern in linguistic deterioration caused by dementia that might be detectable from a subject’s written descriptive language.

Collaboration


Dive into the Davy Weissenbacher's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar

Matthew Scotch

Arizona State University

View shared research outputs
Top Co-Authors

Avatar

Tasnia Tahsin

Arizona State University

View shared research outputs
Top Co-Authors

Avatar

Abeed Sarker

Arizona State University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Rachel Beard

Arizona State University

View shared research outputs
Top Co-Authors

Avatar

Robert Rivera

Arizona State University

View shared research outputs
Top Co-Authors

Avatar

Ari Z. Klein

University of Pennsylvania

View shared research outputs
Top Co-Authors

Avatar

Arjun Magge

Arizona State University

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge