bioRxiv | 2021

Exploring Intuitive Approaches to Protein Conformation Clustering Using Regions of High Structural Variance

 

Abstract


This paper presents a method to find structurally high variance segments of the different conformations of a single protein and uses clusters them using different distance metrics and interpretation of coordinate and angle data presented by three different methods: root mean squared derivation RMSD, t-distributed Stochastic Neighbor Embedding (t-SNE) based map, and dihedral-based clustering. The methods were applied on the human cylin-dependent kinase 2 (CDK2) protein, code P24941 uniprot using a series of python scripts and clustering packages. We test our methods on the data of the CDK2 protein as it is a highly researched protein, with practical applications of clustering in cancer research, crucial in the regulation of the cell-cycle, and has a sizeable amount of experimental data collected on the confirmation structures. While using the distance based root mean squared deviation RMSD provides data of structure to structure dissimilarity between different conformations, a simple RMSD matrix lacks to ability to describe the subsequence-wise in shape and absolute position which could be the main identifying elements for a protein’s conformation and state. To make up for this loss we explore an intuitive and more flexible method, able to accept multiple high structural variance segments, which takes coordinate based data, through a series of maps and with the help of t-SNE, and maps each segment as a feature in the clustering matrix. This method, however, would require additional testing on other proteins and modifications to verify its consistency and test its robustness. In the end we explore the pros and cons of the three methods applied on the high structural variance regions. Despite the randomness factor by the t-SNE used in mapping the coordinates to lower dimensions, the coordinate-based approach consistently performed better than the RSMD and dihedral based methods in clustering the three groups of the CDK2 protein kinase. We also found that analyzing only the substructures identified by the high variance detection algorithm consistently provided more distinct clusters with higher multi-class F1 scores.

Volume None
Pages None
DOI 10.1101/2021.09.05.459014
Language English
Journal bioRxiv

Full Text