Frontiers in Bioengineering and Biotechnology | 2021
Editorial: Feature Representation and Learning Methods With Applications in Protein Secondary Structure
Abstract
In recent years, the rise of machine learning methods, especially deep learning, had greatly promoted the development of prediction of protein secondary structures. Such methods could not only make better use of exponentially growing massive protein sequence data, but were also able to automatically mine complex and latent patterns hidden in the data. Although significant progress had been made, we still faced challenges how to predict protein secondary structures directly from protein sequences with improved accuracy. There were 11 articles published in the special issue Feature Representation and Learning Methods With Applications in Protein Secondary Structure. The authors here described computer methods and techniques for protein secondary structure predictions. Also, they presented and discussed latest algorithms development in feature extraction, dimension reduction, unbalanced classification, etc. The papers provided good references to those new to the field as well as experienced researchers. Guo et al. established a model to classify thermophilic proteins and non-thermophilic proteins based on sequences. After feature extraction by iFeature, MRMD2.0 was applied for feature selection and dimension reduction, and LIBSVM was used to obtain the optimal parameters of the model and established the prediction model. Compared with LMT, Logistic, Random Forest, BayesNet, REPTree, J48, the prediction rate of this model was the highest (SE: 95.85%, SP: 96.22%, ACC: 96.02%). Li et al. constructed a model to identify antioxidant proteins based on a support vector machine based method, Vote9. Sequence features were extracted by using reduced amino acid compositions and the optimal g-gap dipeptide compositions from nine optimal individual models. Gu et al. distinguished GPCRs and non-GPCRs with CTDC extraction and MRMD2. 0 dimension-reduction. The authors found different methods of feature extraction and the same method of dimensionality reduction had different effects on distinguishing GPCRs and non-GPCRs. The correct classification rate of five independent test sets was 90.64, 90.37, 88.04, 93.28, and 95.73%, with an average rate of 91.61 ± 2.96%. Jing and Li used amino acid composition, dipeptide composition, position-specific score matrix auto-covariance, and Auto-covariance average chemical shift to predict cell wall lytic enzymes. SMOTE was used to counter the imbalanced data classification problems, and F-score algorithm was used to remove redundant or irrelevant features. ACC was 99.19% with jackknife test. Edited and reviewed by: Jean Marie François, Institut Biotechnologique de Toulouse (INSA), France