Nic Herndon | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Nic Herndon is active.

Explore More

Publication

Featured researches published by Nic Herndon.

BMC Genomics | 2015

Tools and pipelines for BioNano data: molecule assembly pipeline and FASTA super scaffolding tool

Jennifer Shelton; Michelle C Coleman; Nic Herndon; Nanyan Lu; Ernest T. Lam; Thomas Anantharaman; Palak Sheth; Susan J. Brown

BackgroundGenome assembly remains an unsolved problem. Assembly projects face a range of hurdles that confound assembly. Thus a variety of tools and approaches are needed to improve draft genomes.ResultsWe used a custom assembly workflow to optimize consensus genome map assembly, resulting in an assembly equal to the estimated length of the Tribolium castaneum genome and with an N50 of more than 1 Mb. We used this map for super scaffolding the T. castaneum sequence assembly, more than tripling its N50 with the program Stitch.ConclusionsIn this article we present software that leverages consensus genome maps assembled from extremely long single molecule maps to increase the contiguity of sequence assemblies. We report the results of applying these tools to validate and improve a 7x Sanger draft of the T. castaneum genome.

annual computer security applications conference | 2015

Experimental Study with Real-world Data for Android App Security Analysis using Machine Learning

Sankardas Roy; Jordan DeLoach; Yuping Li; Nic Herndon; Doina Caragea; Xinming Ou; Venkatesh Prasad Ranganath; Hongmin Li; Nicolais Guevara

Although Machine Learning (ML) based approaches have shown promise for Android malware detection, a set of critical challenges remain unaddressed. Some of those challenges arise in relation to proper evaluation of the detection approach while others are related to the design decisions of the same. In this paper, we systematically study the impact of these challenges as a set of research questions (i.e., hypotheses). We design an experimentation framework where we can reliably vary several parameters while evaluating ML-based Android malware detection approaches. The results from the experiments are then used to answer the research questions. Meanwhile, we also demonstrate the impact of some challenges on some existing ML-based approaches. The large (market-scale) dataset (benign and malicious apps) we use in the above experiments represents the real-world Android app security analysis scale. We envision this study to encourage the practice of employing a better evaluation strategy and better designs of future ML-based approaches for Android malware detection.

biomedical engineering systems and technologies | 2014

Empirical Study of Domain Adaptation with Naïve Bayes on the Task of Splice Site Prediction

Nic Herndon; Doina Caragea

For many machine learning problems, training an accurate classifier in a supervised setting requires a substantial volume of labeled data. While large volumes of labeled data are currently available for some of these problems, little or no labeled data exists for others. Manually labeling data can be costly and time consuming. An alternative is to learn classifiers in a domain adaptation setting in which existing labeled data can be leveraged from a related problem, referred to as source domain, in conjunction with a small amount of labeled data and large amount of unlabeled data for the problem of interest, or target domain. In this paper, we propose two similar domain adaptation classifiers based on a na¨A±ve Bayes algorithm. We evaluate these classifiers on the difficult task of splice site prediction, essential for gene prediction. Results show that the algorithms correctly classified instances, with highest average area under precision-recall curve (auPRC) values between 18.46% and 78.01%.

IEEE Transactions on Nanobioscience | 2016

A Study of Domain Adaptation Classifiers Derived From Logistic Regression for the Task of Splice Site Prediction

Nic Herndon; Doina Caragea

Supervised classifiers are highly dependent on abundant labeled training data. Alternatives for addressing the lack of labeled data include: labeling data (but this is costly and time consuming); training classifiers with abundant data from another domain (however, the classification accuracy usually decreases as the distance between domains increases); or complementing the limited labeled data with abundant unlabeled data from the same domain and learning semi-supervised classifiers (but the unlabeled data can mislead the classifier). A better alternative is to use both the abundant labeled data from a source domain, the limited labeled data and optionally the unlabeled data from the target domain to train classifiers in a domain adaptation setting. We propose two such classifiers, based on logistic regression, and evaluate them for the task of splice site prediction - a difficult and essential step in gene prediction. Our classifiers achieved high accuracy, with highest areas under the precision-recall curve between 50.83% and 82.61%.

biomedical engineering systems and technologies | 2013

Predicting Protein Localization Using a Domain Adaptation Approach

Nic Herndon; Doina Caragea

A challenge arising from the ever-increasing volume of biological data generated by next generation sequencing technologies is the annotation of this data, e.g. identification of gene structure from the location of splice sites, or prediction of protein function/localization. The annotation can be achieved by using automated classification algorithms. Supervised classification requires large amounts of labeled data for the problem at hand. For many problems, labeled data is not available. However, labeled data might be available for a similar, related problem. To leverage the labeled data available for the related problem, we propose an algorithm that builds a naive Bayes classifier for biological sequences in a domain adaptation setting. Specifically, it uses the existing large corpus of labeled data from a source organism, in conjunction with any available labeled data and lots of unlabeled data from a target organism, thus alleviating the need to manually label a large number of sequences for a supervised classifier. When tested on the task of predicting protein localization from the composition of the protein, this algorithm performed better than the multinomial naive Bayes classifier. However, on a more difficult task, of splice site prediction, the results were not satisfactory.

international symposium on bioinformatics research and applications | 2015

Domain Adaptation with Logistic Regression for the Task of Splice Site Prediction

Nic Herndon; Doina Caragea

Supervised classifiers are highly dependent on abundant labeled training data. Alternatives for addressing the lack of labeled data include: labeling data (but this is costly and time consuming); training classifiers with abundant data from another domain (however, the classification accuracy usually decreases as the distance between domains increases); or complementing the limited labeled data with abundant unlabeled data from the same domain and learning semi-supervised classifiers (but the unlabeled data can mislead the classifier). A better alternative is to use both the abundant labeled data from a source domain and the limited labeled data from the target domain to train classifiers in a domain adaptation setting. We propose such a classifier, based on logistic regression, and evaluate it for the task of splice site prediction – a difficult and essential step in gene prediction. Our classifier achieved high accuracy, with highest areas under the precision-recall curve between 50.83% and 82.61%.

bioinformatics and biomedicine | 2014

Predicting protein localization using a domain adaptation na¨ıve Bayes classifier with burrows wheeler transform features

Nic Herndon; Karthik Tangirala; Doina Caragea

The reduced cost of the next generation sequencing technologies provides opportunities to study non-model organisms. However, one challenge is the large volume of data generated and, thus, the need to use automated approaches to annotate these data. Machine learning algorithms could provide a cost-effective solution but they need lots of labeled data and informative features to represent these data. Our proposed approach addresses both these problems by using a domain adaptation classifier in conjunction with features generated with unsupervised techniques to annotate biological sequence data.

international symposium on bioinformatics research and applications | 2015

Community Detection-Based Feature Construction for Protein Sequence Classification

Karthik Tangirala; Nic Herndon; Doina Caragea

Machine learning algorithms are widely used to annotate biological sequences. Low-dimensional informative feature vectors can be crucial for the performance of the algorithms. In prior work, we have proposed the use of a community detection approach to construct low dimensional feature sets for nucleotide sequence classification. Our approach uses the Hamming distance between short nucleotide subsequences, called k-mers, to construct a network, and subsequently uses community detection to identify groups of k-mers that appear frequently in a set of sequences. While this approach worked well for nucleotide sequence classification, it could not be directly used for protein sequences, as the Hamming distance is not a good measure for comparing short protein k-mers. To address this limitation, we extend our prior approach by replacing the Hamming distance with substitution scores. Experimental results in different learning scenarios show that the features generated with the new approach are more informative than k-mers.

Journal of Contingencies and Crisis Management | 2018

Disaster response aided by tweet classification with a domain adaptation approach

Hongmin Li; Doina Caragea; Cornelia Caragea; Nic Herndon

Social media platforms such as Twitter provide valuable information for aiding disaster response during emergency events. Machine learning could be used to identify such information. However, supervised learning algorithms rely on labelled data, which is not readily available for an emerging target disaster. While labelled data might be available for a prior source disaster, supervised classifiers learned only from the source disaster may not perform well on the target disaster, as each event has unique characteristics (e.g., type, location, and culture) and may cause different social media responses. To address this limitation, we propose to use a domain adaptation approach, which learns classifiers from unlabelled target data, in addition to source labelled data. Our approach uses the Naive Bayes classifier, together with an iterative Self-Training strategy. Experimental results on the task of identifying tweets relevant to a disaster of interest show that the domain adaptation classifiers are better as compared to the supervised classifiers learned only from labelled source data.

F1000Research | 2018

Association mapping and landscape genomics of Georeferenced Forest Trees via CartograTree and TreeGenes

Taylor Falk; Nic Herndon; Emily Grau; Sean Buehler; Peter Richter; Jill L. Wegrzyn

Introduction • CartograTree a web-based framework for displaying, selecting, and analyzing forest trees. • TreeGenes an online database storing forest tree genotypic, phenotypic, and environmental data, all combined with project metadata. • Integrating environmental data with omic data from georeferenced trees allows for association mapping and landscape genomics, among other analyses. • TreeGenes relies on Chado, a GMOD relational database focused on biological information, and can be adapted to include metadata. • Instead of traditional curation by hand, TreeGenes can take standard and metadata through TPPS, the Tripal Plant Pop-Gen Submit pipeline.

Explore More