Abhigyan Nath
Banaras Hindu University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Abhigyan Nath.
Computers in Biology and Medicine | 2013
Abhigyan Nath; Radha Chaube; Karthikeyan Subbiah
Antifreeze proteins (AFPs) prevent the growth of ice-crystals in order to enable certain organisms to survive under sub-zero temperature surroundings. These AFPs have evolved from different types of proteins without having any significant structural and sequence similarities among them. However, all the AFPs perform the same function of anti-freeze activity and are a classical example of convergent evolution. We have analyzed fish AFPs at the sequence level, the residue level and the physicochemical property group composition to discover molecular basis for this convergent evolution. Our study on amino acid distribution does not reveal any distinctive feature among AFPs, but comparative study of the AFPs with their close non-AFP homologs based on the physicochemical property group residues revealed some useful information. In particular (a) there is a similar pattern of avoidance and preference of amino acids in Fish AFP subtypes II, III and IV-Aromatic residues are avoided whereas small residues are preferred, (b) like other psychrophilic proteins, AFPs have a similar pattern of preference/avoidance for most of the residues except for Ile, Leu and Arg, and (c) most of the computed amino acids in preferred list are the key functional residues as obtained in previous predicted model of Doxey et al. For the first time this study revealed common patterns of avoidance/preference in fish AFP subtypes II, III and IV. These avoidance/preference lists can further facilitate the identification of key functional residues and can shed more light into the mechanism of antifreeze function.
Computers in Biology and Medicine | 2015
Priyanka Kumari; Abhigyan Nath; Radha Chaube
Identification of potential drug targets is a crucial task in the drug-discovery pipeline. Successful identification of candidate drug targets in entire genomes is very useful, and computational prediction methods can speed up this process. In the current work we have developed a sequence-based prediction method for the successful identification and discrimination of human drug target proteins, from human non-drug target proteins. The training features include sequence-based features, such as amino acid composition, amino acid property group composition, and dipeptide composition for generating predictive models. The classification of human drug target proteins presents a classic example of class imbalance. We have addressed this issue by using SMOTE (Synthetic Minority Over-sampling Technique) as a preprocessing step, for balancing the training data with a ratio of 1:1 between drug targets (minority samples) and non-drug targets (majority samples). Using ensemble classification learning method-Rotation Forest and ReliefF feature-selection technique for selecting the optimal subset of salient features, the best model with selected features can achieve 87.1% sensitivity, 83.6% specificity, and 85.3% accuracy, with 0.71 Matthews correlation coefficient (mcc) on a tenfold stratified cross-validation test. The subset of identified optimal features may help in assessing the compositional patterns in human drug targets. For further validation, using a rigorous leave-one-out cross-validation test, the model achieved 88.1% sensitivity, 83.0% specificity, 85.5% accuracy, and 0.712 mcc. The proposed method was tested on a second dataset, for which the current pipeline gave promising results. We suggest that the present approach can be applied successfully as a complementary tool to existing methods for novel drug target prediction.
Computational Biology and Chemistry | 2015
Abhigyan Nath; Karthikeyan Subbiah
Lipocalins are short in sequence length and perform several important biological functions. These proteins are having less than 20% sequence similarity among paralogs. Experimentally identifying them is an expensive and time consuming process. The computational methods based on the sequence similarity for allocating putative members to this family are also far elusive due to the low sequence similarity existing among the members of this family. Consequently, the machine learning methods become a viable alternative for their prediction by using the underlying sequence/structurally derived features as the input. Ideally, any machine learning based prediction method must be trained with all possible variations in the input feature vector (all the sub-class input patterns) to achieve perfect learning. A near perfect learning can be achieved by training the model with diverse types of input instances belonging to the different regions of the entire input space. Furthermore, the prediction performance can be improved through balancing the training set as the imbalanced data sets will tend to produce the prediction bias towards majority class and its sub-classes. This paper is aimed to achieve (i) the high generalization ability without any classification bias through the diversified and balanced training sets as well as (ii) enhanced the prediction accuracy by combining the results of individual classifiers with an appropriate fusion scheme. Instead of creating the training set randomly, we have first used the unsupervised Kmeans clustering algorithm to create diversified clusters of input patterns and created the diversified and balanced training set by selecting an equal number of patterns from each of these clusters. Finally, probability based classifier fusion scheme was applied on boosted random forest algorithm (which produced greater sensitivity) and K nearest neighbour algorithm (which produced greater specificity) to achieve the enhanced predictive performance than that of individual base classifiers. The performance of the learned models trained on Kmeans preprocessed training set is far better than the randomly generated training sets. The proposed method achieved a sensitivity of 90.6%, specificity of 91.4% and accuracy of 91.0% on the first test set and sensitivity of 92.9%, specificity of 96.2% and accuracy of 94.7% on the second blind test set. These results have established that diversifying training set improves the performance of predictive models through superior generalization ability and balancing the training set improves prediction accuracy. For smaller data sets, unsupervised Kmeans based sampling can be an effective technique to increase generalization than that of the usual random splitting method.
Computational Biology and Chemistry | 2014
Abhigyan Nath; Karthikeyan Subbiah
Organisms thriving at extreme cold surroundings are called as psychrophiles and they present a wealth of knowledge about sequence adjustments in proteins that had occurred during the adaptation to low temperatures. In this paper, we propose a new cascading model to investigate the basis for psychrophilicity. In this model, a superior classifier was used to discriminate psychrophilic from mesophilic protein sequences, and then the PART rule generating algorithm was applied on the input instances that are correctly classified by the classifier, to generate human interpretable rules. These derived rules were further validated on a structural dataset and finally analyzed to discover the underlying biological basis about the psychrophilicity. In this study, we have used one of the key features of psychrophilic proteins accountable for remaining functional in extreme cold temperature surroundings i.e., global patterns of amino acid composition as the input features. The rotation forest classifier outperformed all the other classifiers with maximum accuracy of 70.5% and maximum AUC of 0.78. The effect of sequence length on the classification accuracy was also investigated. The analysis of the derived rules and interpretation of the analyzed results had revealed some interesting phenomena such as the amino acids A, D, G, F, and S are over-represented, and T is under-represented in psychrophilic proteins. These findings augment the existing domain knowledge for psychrophilic sequence features.
Computers in Biology and Medicine | 2016
Abhigyan Nath; Karthikeyan Subbiah
Bioluminescence plays an important role in nature, for example, it is used for intracellular chemical signalling in bacteria. It is also used as a useful reagent for various analytical research methods ranging from cellular imaging to gene expression analysis. However, identification and annotation of bioluminescent proteins is a difficult task as they share poor sequence similarities among them. In this paper, we present a novel approach for within-class and between-class balancing as well as diversifying of a training dataset by effectively combining unsupervised K-Means algorithm with Synthetic Minority Oversampling Technique (SMOTE) in order to achieve the true performance of the prediction model. Further, we experimented by varying different levels of balancing ratio of positive data to negative data in the training dataset in order to probe for an optimal class distribution which produces the best prediction accuracy. The appropriately balanced and diversified training set resulted in near complete learning with greater generalization on the blind test datasets. The obtained results strongly justify the fact that optimal class distribution with a high degree of diversity is an essential factor to achieve near perfect learning. Using random forest as the weak learners in boosting and training it on the optimally balanced and diversified training dataset, we achieved an overall accuracy of 95.3% on a tenfold cross validation test, and an accuracy of 91.7%, sensitivity of 89. 3% and specificity of 91.8% on a holdout test set. It is quite possible that the general framework discussed in the current work can be successfully applied to other biological datasets to deal with imbalance and incomplete learning problems effectively.
international conference on computing, communication and automation | 2015
Sunil Kumar; Manish Kumar Pandey; Abhigyan Nath; Karthikeyan Subbiah; Manoj Kumar Singh
This is an era of Internet computing and computing as a service on the internet is called cloud computing. Mainly three services like SaaS (applications), PaaS, and IaaS are being accessed through internet on demand, pay as per usage basis. Quality of Service (QoS) is the main issue in internet based computing for service providers and user-dependent as well as user-independent QoS parameters. In the current work we compared different machine learning algorithms for predicting the response time and throughput QoS values using past usage data. Bagging and support vector machines are found to be better performing prediction methods in comparison with other learning algorithms.
Journal of Theoretical Biology | 2016
Abhigyan Nath; Karthikeyan Subbiah
Piezophiles are the organisms which can successfully survive at extreme pressure conditions. However, the molecular basis of piezophilic adaptation is still poorly understood. Analysis of the protein sequence adjustments that had taken place during evolution can help to reveal the sequence adaptation parameters responsible for protein functional and structural adaptation at such high pressure conditions. In this current work we have used SVM classifier for filtering strong instances and generated human interpretable rules from these strong instances by using the PART algorithm. These generated rules were analyzed for getting insights into the molecular signature patterns present in the piezophilic proteins. The experiments were performed on three different temperature ranges piezophilic groups, namely psychrophilic-piezophilic, mesophilic-piezophilic, and thermophilic-piezophilic for the detailed comparative study. The best classification results were obtained as we move up the temperature range from psychrophilic-piezophilic to thermophilic-piezophilic. Based on the physicochemical classification of amino acids and using feature ranking algorithms, hydrophilic and polar amino acid groups have higher discriminative ability for psychrophilic-piezophilic and mesophilic-piezophilic groups along with hydrophobic and nonpolar amino acids for the thermophilic-piezophilic groups. We also observed an overrepresentation of polar, hydrophilic and small amino acid groups in the discriminatory rules of all the three temperature range piezophiles along with aliphatic, nonpolar and hydrophobic groups in the mesophilic-piezophilic and thermophilic-piezophilic groups.
Archive | 2018
Abhigyan Nath; Priyanka Kumari; Radha Chaube
Identification of drug targets and drug target interactions are important steps in the drug-discovery pipeline. Successful computational prediction methods can reduce the cost and time demanded by the experimental methods. Knowledge of putative drug targets and their interactions can be very useful for drug repurposing. Supervised machine learning methods have been very useful in drug target prediction and in prediction of drug target interactions. Here, we describe the details for developing prediction models using supervised learning techniques for human drug target prediction and their interactions.
Neurocomputing | 2018
Abhigyan Nath; Karthikeyan Subbiah
Abstract Antifreeze proteins (AFPs) are those proteins, which inhibit the ice nucleation process and thereby enabling certain organisms to survive under sub-zero temperature habitats. AFPs are supposed to be evolved from different types of protein families to perform the unique function of antifreeze activity and turn out to be the classical example of convergent evolution. The common sequence similarity search methods have failed to predict putative AFPs due to poor sequence and structural similarity that exists among the different sub-types of AFP. The machine learning techniques are the viable alternative approaches to predict putative AFPs. In this paper, we have discussed about the criteria (like apposite feature selection, balanced data sets and complete learning) that are needed to be taken into account for successful application of machine learning methods and implemented these criteria by using a clustering procedure in order to achieve the true performance of the learning algorithms. Diversified and representative training and testing data sets are very crucial for perfect learning as well as true testing of machine learning based prediction methods for two reasons: first is that a training dataset that lacks definable subset of input patterns makes prediction of patterns belonging to this subset either difficult or unfeasible (thus resulting in incomplete learning) and secondly a testing data set that lacks definable subset of input patterns does not tell about whether this subset of patterns can be correctly predicted by the classifier or not (thus resulting in incomplete testing). Moreover, balanced training and testing data sets are equally important for achieving the true (robust) performance of classifiers because a well-balanced training set eliminates bias of the classifier toward particular class/sub-class due to over-representation or under-representation of input patterns belonging to those classes/sub-classes. We have used K-means clustering algorithm for creating the diversified and balanced training as well as testing data sets, to overcome the shortcoming of random splitting, which cannot guarantee representative training and testing sets. The current clustering based optimal splitting criteria proved to be better than random splitting for creating training and testing set in terms of superior generalization and robust evaluation.
Journal of Theoretical Biology | 2018
Abhigyan Nath; S. Karthikeyan
In yeast and in some mammals the frequencies of recombination are high in some genomic locations which are known as recombination hotspots and in the locations where the recombination is below average are consequently known as coldspots. Knowledge of the hotspot regions gives clues about understanding the meiotic process and also in understanding the possible effects of sequence variation in these regions. Moreover, accurate information about the hotspot and coldspot regions can reveal insights into the genome evolution. In the present work, we have used class specific autoencoders for feature extraction and reduction. Subsequently the deep features that are extracted from the autoencoders were used to train three different classifiers, namely: gradient boosting machines, random forest and deep learning neural networks for predicting the hotspot and coldspot regions. A comparative performance analysis was carried out by experimenting on deep features extracted from different sets of the training data using autoencoders for selecting the best set of deep features. It was observed that learning algorithms trained on features extracted from the combined class specific autoencoder out performed when compared with the performances of these learning algorithms trained with other sets of deep features. So the combined class-specific autoencoder based feature extraction can be applied to a growing range of biological problems to achieve superior prediction performance.