Ehsaneddin Asgari | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ehsaneddin Asgari is active.

Explore More

Publication

Featured researches published by Ehsaneddin Asgari.

PLOS ONE | 2015

Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics

Ehsaneddin Asgari; Mohammad R. K. Mofrad

We propose a new approach for representing biological sequences. This method, named protein-vectors or ProtVec for short, can be utilized in bioinformatics applications such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. Using the Skip-gram neural networks, protein sequences are represented with a single dense n-dimensional vector. This method was evaluated by classifying protein sequences obtained from Swiss-Prot belonging to 7,027 protein families where an average family classification accuracy of 94%± 0.03% was obtained, outperforming existing family classification methods. In addition, our model was used to predict disordered proteins from structured proteins. Two databases of disordered sequences were used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences were distinguished from structured Protein Data Bank (PDB) sequences with 99.81% accuracy, and unstructured DisProt sequences from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, information about protein structure can be determined with high accuracy. This so-called embedding model needs to be trained only once and can then be used to ascertain a diverse set of information regarding the proteins of interest. In addition, this representation can be considered as pre-training for various applications of deep learning in bioinformatics. Our Web-based tool and trained data is available at Life Language Processing Website: http://llp.berkeley.edu, and will be regularly updated for calculation/classification of ProtVecs as well as visualization of biological sequences.We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93%±0.06% is obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, accurate information about protein structure can be determined. Importantly, this model needs to be trained only once and can then be applied to extract a comprehensive set of information regarding proteins of interest. Moreover, this representation can be considered as pre-training for various applications of deep learning in bioinformatics. The related data is available at Life Language Processing Website: http://llp.berkeley.edu and Harvard Dataverse: http://dx.doi.org/10.7910/DVN/JMFHTN.

north american chapter of the association for computational linguistics | 2016

Comparing Fifty Natural Languages and Twelve Genetic Languages Using Word Embedding Language Divergence (WELD) as a Quantitative Measure of Language Distance

Ehsaneddin Asgari; Mohammad R. K. Mofrad

We introduce a new measure of distance between languages based on word embedding, called word embedding language divergence (WELD). WELD is defined as divergence between unified similarity distribution of words between languages. Using such a measure, we perform language comparison for fifty natural languages and twelve genetic languages. Our natural language dataset is a collection of sentence-aligned parallel corpora from bible translations for fifty languages spanning a variety of language families. Although we use parallel corpora, which guarantees having the same content in all languages, interestingly in many cases languages within the same family cluster together. In addition to natural languages, we perform language comparison for the coding regions in the genomes of 12 different organisms (4 plants, 6 animals, and two human subjects). Our result confirms a significant high-level difference in the genetic language model of humans/animals versus plants. The proposed method is a step toward defining a quantitative measure of similarity between languages, with applications in languages classification, genre identification, dialect identification, and evaluation of translations.

World Wide Web | 2014

Integration of scientific and social networks

Mahmood Neshati; Djoerd Hiemstra; Ehsaneddin Asgari; Hamid Beigy

In this paper, we address the problem of scientific-social network integration to find a matching relationship between members of these networks (i.e. The DBLP publication network and the Twitter social network). This task is a crucial step toward building a multi environment expert finding system that has recently attracted much attention in Information Retrieval community. In this paper, the problem of social and scientific network integration is divided into two sub problems. The first problem concerns finding those profiles in one network, which presumably have a corresponding profile in the other network and the second problem concerns the name disambiguation to find true matching profiles among some candidate profiles for matching. Utilizing several name similarity patterns and contextual properties of these networks, we design a focused crawler to find high probable matching pairs, then the problem of name disambiguation is reduced to predict the label of each candidate pair as either true or false matching. Because the labels of these candidate pairs are not independent, state-of-the-art classification methods such as logistic regression and decision tree, which classify each instance separately, are unsuitable for this task. By defining matching dependency graph, we propose a joint label prediction model to determine the label of all candidate pairs simultaneously. Two main types of dependencies among candidate pairs are considered for designing the joint label prediction model which are quite intuitive and general. Using the discriminative approaches, we utilize various feature sets to train our proposed classifiers. An extensive set of experiments have been conducted on six test collection collected from the DBLP and the Twitter networks to show the effectiveness of the proposed joint label prediction model.

north american chapter of the association for computational linguistics | 2016

Text Analysis and Automatic Triage of Posts in a Mental Health Forum

Ehsaneddin Asgari; Soroush Nasiriany; Mohammad R. K. Mofrad

We present an approach for automatic triage of message posts in ReachOut.com mental health forum, which was a shared task in the 2016 Computational Linguistics and Clinical Psychology (CLPsych). This effort is aimed at providing the trained moderators of ReachOut.com with a systematic triage of forum posts, enabling them to more efficiently support the young users aged 14-25 communicating with each other about their issues. We use different features and classifiers to predict the users’ mental health states, marked as green, amber, red, and crisis. Our results show that random forests have significant success over our baseline mutli-class SVM classifier. In addition, we perform feature importance analysis to characterize key features in identification of the critical posts.

intelligent systems in molecular biology | 2018

MicroPheno: Predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples

Ehsaneddin Asgari; Kiavash Garakani; Alice C. McHardy; Mohammad R. K. Mofrad

Motivation Microbial communities play important roles in the function and maintenance of various biosystems, ranging from the human body to the environment. A major challenge in microbiome research is the classification of microbial communities of different environments or host phenotypes. The most common and cost‐effective approach for such studies to date is 16S rRNA gene sequencing. Recent falls in sequencing costs have increased the demand for simple, efficient and accurate methods for rapid detection or diagnosis with proved applications in medicine, agriculture and forensic science. We describe a reference‐ and alignment‐free approach for predicting environments and host phenotypes from 16S rRNA gene sequencing based on k‐mer representations that benefits from a bootstrapping framework for investigating the sufficiency of shallow sub‐samples. Deep learning methods as well as classical approaches were explored for predicting environments and host phenotypes. Results A k‐mer distribution of shallow sub‐samples outperformed Operational Taxonomic Unit (OTU) features in the tasks of body‐site identification and Crohns disease prediction. Aside from being more accurate, using k‐mer features in shallow sub‐samples allows (i) skipping computationally costly sequence alignments required in OTU‐picking and (ii) provided a proof of concept for the sufficiency of shallow and short‐length 16S rRNA sequencing for phenotype prediction. In addition, k‐mer features predicted representative 16S rRNA gene sequences of 18 ecological environments, and 5 organismal environments with high macro‐F1 scores of 0.88 and 0.87. For large datasets, deep learning outperformed classical methods such as Random Forest and Support Vector Machine. Availability and implementation The software and datasets are available at https://llp.berkeley.edu/micropheno.

bioRxiv | 2018

Nucleotide-pair encoding of 16S rRNA sequences for host phenotype and biomarker detection

Ehsaneddin Asgari; Philipp C. Münch; Till R. Lesker; Alice C. McHardy; Mohammad R. K. Mofrad

We propose subsequence based 16S rRNA data processing, as a new paradigm for sequence phenotype classification and biomarker detection. This method and software called DiTaxa substitutes standard OTU-clustering or sequence-level analysis by segmenting 16S rRNA reads into the most frequent variable-length subsequences. These subsequences are then used as data representation for downstream phenotype prediction, biomarker detection and taxonomic analysis. Our proposed sequence segmentation called nucleotide-pair encoding (NPE) is an unsupervised data-driven segmentation inspired by Byte-pair encoding, a data compression algorithm. The identified subsequences represent commonly occurring sequence portions, which we found to be distinctive for taxa at varying evolutionary distances and highly informative for predicting host phenotypes. We compared the performance of DiTaxa to the state-of-the-art methods in disease phenotype prediction and biomarker detection, using human-associated 16S rRNA samples for periodontal disease, rheumatoid arthritis and inflammatory bowel diseases, as well as a synthetic benchmark dataset. DiTaxa identified 13 out of 21 taxa with confirmed links to periodontitis (recall= 0.62), relative to 3 out of 21 taxa (recall= 0.14) by the state-of-the-art method. On synthetic benchmark data, DiTaxa obtained full precision and recall in biomarker detection, compared to 0.91 and 0.90, respectively. In addition, machine-learning classifiers trained to predict host disease phenotypes based on the NPE representation performed competitively to the state-of-the art using OTUs or k-mers. For the rheumatoid arthritis dataset, DiTaxa substantially outperformed OTU features with a macro-F1 score of 0.76 compared to 0.65. Due to the alignment- and reference free nature, DiTaxa can efficiently run on large datasets. The full analysis of a large 16S rRNA dataset of 1359 samples required ≈1.5 hours on 20 cores, while the standard pipeline needed ≈ 6.5 hours in the same setting. Availability An implementation of our method called DiTaxa is available under the Apache 2 licence at http://llp.berkeley.edu/ditaxa.

bioRxiv | 2018

Probabilistic variable-length segmentation of protein sequences for discriminative motif mining (DiMotif) and sequence embedding (ProtVecX)

Ehsaneddin Asgari; Alice C. McHardy; Mohammad R. K. Mofrad

In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variable-length protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw k-mer features. Availability Implementations of our method will be available under the Apache 2 licence at http://llp.berkeley.edu/dimotif and http://llp.berkeley.edu/protvecx.

Archive | 2018

Overview of Character-Based Models for Natural Language Processing

Heike Adel; Ehsaneddin Asgari; Hinrich Schütze

Character-based models become more and more popular for different natural language processing task, especially due to the success of neural networks. They provide the possibility of directly model text sequences without the need of tokenization and, therefore, enhance the traditional preprocessing pipeline. This paper provides an overview of character-based models for a variety of natural language processing tasks. We group existing work in three categories: tokenization-based approaches, bag-of-n-gram models and end-to-end models. For each category, we present prominent examples of studies with a particular focus on recent character-based deep learning work.

Biophysical Journal | 2018