Nic Herndon
Kansas State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Nic Herndon.
BMC Genomics | 2015
Jennifer Shelton; Michelle C Coleman; Nic Herndon; Nanyan Lu; Ernest T. Lam; Thomas Anantharaman; Palak Sheth; Susan J. Brown
BackgroundGenome assembly remains an unsolved problem. Assembly projects face a range of hurdles that confound assembly. Thus a variety of tools and approaches are needed to improve draft genomes.ResultsWe used a custom assembly workflow to optimize consensus genome map assembly, resulting in an assembly equal to the estimated length of the Tribolium castaneum genome and with an N50 of more than 1 Mb. We used this map for super scaffolding the T. castaneum sequence assembly, more than tripling its N50 with the program Stitch.ConclusionsIn this article we present software that leverages consensus genome maps assembled from extremely long single molecule maps to increase the contiguity of sequence assemblies. We report the results of applying these tools to validate and improve a 7x Sanger draft of the T. castaneum genome.
annual computer security applications conference | 2015
Sankardas Roy; Jordan DeLoach; Yuping Li; Nic Herndon; Doina Caragea; Xinming Ou; Venkatesh Prasad Ranganath; Hongmin Li; Nicolais Guevara
Although Machine Learning (ML) based approaches have shown promise for Android malware detection, a set of critical challenges remain unaddressed. Some of those challenges arise in relation to proper evaluation of the detection approach while others are related to the design decisions of the same. In this paper, we systematically study the impact of these challenges as a set of research questions (i.e., hypotheses). We design an experimentation framework where we can reliably vary several parameters while evaluating ML-based Android malware detection approaches. The results from the experiments are then used to answer the research questions. Meanwhile, we also demonstrate the impact of some challenges on some existing ML-based approaches. The large (market-scale) dataset (benign and malicious apps) we use in the above experiments represents the real-world Android app security analysis scale. We envision this study to encourage the practice of employing a better evaluation strategy and better designs of future ML-based approaches for Android malware detection.
biomedical engineering systems and technologies | 2014
Nic Herndon; Doina Caragea
For many machine learning problems, training an accurate classifier in a supervised setting requires a substantial volume of labeled data. While large volumes of labeled data are currently available for some of these problems, little or no labeled data exists for others. Manually labeling data can be costly and time consuming. An alternative is to learn classifiers in a domain adaptation setting in which existing labeled data can be leveraged from a related problem, referred to as source domain, in conjunction with a small amount of labeled data and large amount of unlabeled data for the problem of interest, or target domain. In this paper, we propose two similar domain adaptation classifiers based on a na¨A±ve Bayes algorithm. We evaluate these classifiers on the difficult task of splice site prediction, essential for gene prediction. Results show that the algorithms correctly classified instances, with highest average area under precision-recall curve (auPRC) values between 18.46% and 78.01%.
IEEE Transactions on Nanobioscience | 2016
Nic Herndon; Doina Caragea
Supervised classifiers are highly dependent on abundant labeled training data. Alternatives for addressing the lack of labeled data include: labeling data (but this is costly and time consuming); training classifiers with abundant data from another domain (however, the classification accuracy usually decreases as the distance between domains increases); or complementing the limited labeled data with abundant unlabeled data from the same domain and learning semi-supervised classifiers (but the unlabeled data can mislead the classifier). A better alternative is to use both the abundant labeled data from a source domain, the limited labeled data and optionally the unlabeled data from the target domain to train classifiers in a domain adaptation setting. We propose two such classifiers, based on logistic regression, and evaluate them for the task of splice site prediction - a difficult and essential step in gene prediction. Our classifiers achieved high accuracy, with highest areas under the precision-recall curve between 50.83% and 82.61%.
biomedical engineering systems and technologies | 2013
Nic Herndon; Doina Caragea
A challenge arising from the ever-increasing volume of biological data generated by next generation sequencing technologies is the annotation of this data, e.g. identification of gene structure from the location of splice sites, or prediction of protein function/localization. The annotation can be achieved by using automated classification algorithms. Supervised classification requires large amounts of labeled data for the problem at hand. For many problems, labeled data is not available. However, labeled data might be available for a similar, related problem. To leverage the labeled data available for the related problem, we propose an algorithm that builds a naive Bayes classifier for biological sequences in a domain adaptation setting. Specifically, it uses the existing large corpus of labeled data from a source organism, in conjunction with any available labeled data and lots of unlabeled data from a target organism, thus alleviating the need to manually label a large number of sequences for a supervised classifier. When tested on the task of predicting protein localization from the composition of the protein, this algorithm performed better than the multinomial naive Bayes classifier. However, on a more difficult task, of splice site prediction, the results were not satisfactory.
international symposium on bioinformatics research and applications | 2015
Nic Herndon; Doina Caragea
Supervised classifiers are highly dependent on abundant labeled training data. Alternatives for addressing the lack of labeled data include: labeling data (but this is costly and time consuming); training classifiers with abundant data from another domain (however, the classification accuracy usually decreases as the distance between domains increases); or complementing the limited labeled data with abundant unlabeled data from the same domain and learning semi-supervised classifiers (but the unlabeled data can mislead the classifier). A better alternative is to use both the abundant labeled data from a source domain and the limited labeled data from the target domain to train classifiers in a domain adaptation setting. We propose such a classifier, based on logistic regression, and evaluate it for the task of splice site prediction – a difficult and essential step in gene prediction. Our classifier achieved high accuracy, with highest areas under the precision-recall curve between 50.83% and 82.61%.
bioinformatics and biomedicine | 2014
Nic Herndon; Karthik Tangirala; Doina Caragea
The reduced cost of the next generation sequencing technologies provides opportunities to study non-model organisms. However, one challenge is the large volume of data generated and, thus, the need to use automated approaches to annotate these data. Machine learning algorithms could provide a cost-effective solution but they need lots of labeled data and informative features to represent these data. Our proposed approach addresses both these problems by using a domain adaptation classifier in conjunction with features generated with unsupervised techniques to annotate biological sequence data.
international symposium on bioinformatics research and applications | 2015
Karthik Tangirala; Nic Herndon; Doina Caragea
Machine learning algorithms are widely used to annotate biological sequences. Low-dimensional informative feature vectors can be crucial for the performance of the algorithms. In prior work, we have proposed the use of a community detection approach to construct low dimensional feature sets for nucleotide sequence classification. Our approach uses the Hamming distance between short nucleotide subsequences, called k-mers, to construct a network, and subsequently uses community detection to identify groups of k-mers that appear frequently in a set of sequences. While this approach worked well for nucleotide sequence classification, it could not be directly used for protein sequences, as the Hamming distance is not a good measure for comparing short protein k-mers. To address this limitation, we extend our prior approach by replacing the Hamming distance with substitution scores. Experimental results in different learning scenarios show that the features generated with the new approach are more informative than k-mers.
Journal of Contingencies and Crisis Management | 2018
Hongmin Li; Doina Caragea; Cornelia Caragea; Nic Herndon
Social media platforms such as Twitter provide valuable information for aiding disaster response during emergency events. Machine learning could be used to identify such information. However, supervised learning algorithms rely on labelled data, which is not readily available for an emerging target disaster. While labelled data might be available for a prior source disaster, supervised classifiers learned only from the source disaster may not perform well on the target disaster, as each event has unique characteristics (e.g., type, location, and culture) and may cause different social media responses. To address this limitation, we propose to use a domain adaptation approach, which learns classifiers from unlabelled target data, in addition to source labelled data. Our approach uses the Naive Bayes classifier, together with an iterative Self-Training strategy. Experimental results on the task of identifying tweets relevant to a disaster of interest show that the domain adaptation classifiers are better as compared to the supervised classifiers learned only from labelled source data.
F1000Research | 2018
Taylor Falk; Nic Herndon; Emily Grau; Sean Buehler; Peter Richter; Jill L. Wegrzyn
Introduction • CartograTree a web-based framework for displaying, selecting, and analyzing forest trees. • TreeGenes an online database storing forest tree genotypic, phenotypic, and environmental data, all combined with project metadata. • Integrating environmental data with omic data from georeferenced trees allows for association mapping and landscape genomics, among other analyses. • TreeGenes relies on Chado, a GMOD relational database focused on biological information, and can be adapted to include metadata. • Instead of traditional curation by hand, TreeGenes can take standard and metadata through TPPS, the Tripal Plant Pop-Gen Submit pipeline.