[PDF] E3-targetPred: Prediction of E3-Target Proteins Using Deep Latent Space Encoding

Abstract

Understanding E3 ligase and target substrate interactions are important for cell biology and therapeutic development. However, experimental identification of E3 target relationships is not an easy task due to the labor-intensive nature of the experiments. In this article, a sequence-based E3-target prediction model is proposed for the first time. The proposed framework utilizes composition of k-spaced amino acid pairs (CKSAAP) to learn the relationship between E3 ligases and their target protein. A class separable latent space encoding scheme is also devised that provides a compressed representation of feature space. A thorough ablation study is performed to identify an optimal gap size for CKSAAP and the number of latent variables that can represent the E3-target relationship successfully. The proposed scheme is evaluated on an independent dataset for a variety of standard quantitative measures. In particular, it achieves an average accuracy of 70.63% on an independent dataset. The source code and datasets used in the study are available at the author's GitHub page (this https URL).

Full PDF

11 E3-targetPred: Prediction of E3-Target ProteinsUsing Deep Latent Space Encoding

Seongyong Park, Shujaat Khan, Abdul Wahab

Abstract —Understanding E3 ligase and target substrate in-teractions are important for cell biology and therapeutic de-velopment. However, experimental identiﬁcation of E3 targetrelationships is not an easy task due to the labor-intensivenature of the experiments. In this article, a sequence-basedE3-target prediction model is proposed for the ﬁrst time. Theproposed framework utilizes composition of k-spaced aminoacid pairs (CKSAAP) to learn the relationship between E3ligases and their target protein. A class separable latent spaceencoding scheme is also devised that provides a compressedrepresentation of feature space. A thorough ablation study isperformed to identify an optimal gap size for CKSAAP andthe number of latent variables that can represent the E3-targetrelationship successfully. The proposed scheme is evaluated onan independent dataset for a variety of standard quantitativemeasures. In particular, it achieves an average accuracy of . on an independent dataset. The source code and datasetsused in the study are available at the author’s GitHub page(https://github.com/psychemistz/E3targetPred). Index Terms —Ubiquitination, E3 ligase, Deep Latent SpaceEncoding, Composition of K-Spaced Amino Acid Pairs, ProteinProtein Interaction

I. I

NTRODUCTION I N eukaryotic cells, proteolysis is mediated by two majororganelles, lysosomes and proteasomes, however, 80-90%of the intracellular proteolysis is mediated by proteasomes [1].Protein degradation by proteasomes is controlled by the ubiq-uitination, and in this process, the target substrate is covalentlyconjugated to a chain of amino acid poly-peptides calledubiquitin through enzyme cascade. Repetition of this reactionproduces a poly-ubiquitin chain on the substrate that can bea target of the S proteasome [2]. There is a known familyof enzymes that mediates covalent attachment of ubiquitin tosubstrates called E1, E2, and E3. E1 is a ubiquitin-activatingenzyme that forms a thiol-ester bond at the carboxy-terminalglycine of ubiquitin. Once activated, E2, called a ubiquitin-conjugating enzyme, catalyzes the trans-thiolation reactionbetween E1 and E2 to form E2-ubiquitin conjugate. Ubiquitinligase, E3, mediates the transfer of ubiquitin from the E2-ubiquitin conjugate to the target protein, most commonly ontothe (cid:15) -amino group of a lysine residue on the protein substrate[3]. Therefore, E3 ligase is a scaffold protein that recognizesE2 and the target protein simultaneously.

Seongyong Park and Shujaat Khan are with the Department of Bio andBrain Engineering, Korea Advanced Institute of Science and Technology, Dae-jeon, South Korea, 34141 e-mail: { sypark0215,shujaat } @kaist.ac.kr, AbdulWahab is with the Department of Mathematics, School of Natural Sciences,National University of Sciences and Technology (NUST), Sector H-12, 44000,Islamabad, Pakistan. e-mail: [email protected] The ubiquitination-proteasome system (UPS) plays a centralrole in various cellular processes such as cell cycle, sig-nal transduction, gene expression, development, and proteinfolding [4] and it plays a signiﬁcant role in disease biologysuch as neurodegenerative diseases [5] and cancer [6]. Sincethe substrate speciﬁcity of ubiquitination is determined bythe E3 ligase, many experimental and computational studieshave been conducted to discover the relationship betweenE3 and their target proteins. Various experimental techniqueshave been developed to elucidate the E3-target interactionsuch as Global protein stability (GPS) proﬁling [7], proteinmicroarray [8], phage display [9], and mass spectroscopy [10].Unfortunately, due to the low expression level of the substratesand their inherently weak interactions, it is hard to ﬁnd therelationship only by experimental techniques. Towards thisend, some of the recent studies proposed novel computationalapproaches [11]–[13]. E3Net was designed to comprehen-sively collecting available E3s and their substrate informationthrough text-mining [11]. In [12], UbiNet combined the E3Net[11] with experimentally veriﬁed E3-targets and also renderedsome manually curated E3-target relations to provide an inte-grated resource for E3-target relations. Recently, Li et al. [13]designed the Naive-Bayes model that predicts possible E3-target relationships by combining ortholog, network topology,domain, and function information.In this paper, a novel sequence-based E3-target predictionmethod is proposed for the ﬁrst time that does not requirecomplicated feature engineering, e.g., extraction of ortholog,network topology, domain, and function information. In par-ticular, our proposed method only uses E3 and target proteinsequences to extract composition of k-spaced amino acid pairs (CKSAAP). It is hypothesized that there is sufﬁcient discrimi-natory information available in sequence-derived features thatcan be learned from the known E3-target relationship to designa generalized E3-Target prediction model. To avoid sequencehomology problems, the framework is substantiated throughknown human E3-target relationships.The article is organized as follows. In Section II, detailsof dataset, feature extraction, and latent space encoding arediscussed, followed by ablation study and experimental resultsin Section III. In Section IV, a discussion on latent spaceencoding of predicted E3-target is given. The ﬁndings of thearticle are concluded in Section V.II. P

ROPOSED M ETHODS

A. Dataset

The gold standard positive E3-target relationship proposedin literature is used as positive data. In particular, we used a r X i v : . [ q - b i o . B M ] J un manually curated E3-target interactions [13], extracted usingE3miner (E3 Target Relational Text Mining Tool) for PubMedabstracts. For a gold standard negative dataset, we collected non-interacting E3 PPIs from Negatome 2.0 [14] and NIP[15].To design the proposed E3-targetPred model, we randomlyselected positive and negative E3-target relationshipsas the training dataset out of which positive and negativerelationships are randomly selected as the validation set. Theremaining samples are used as test dataset. Furthermore, anindependent dataset consisting of E3 and target relationsare collected from ESI network study [16] and Negatome2.0 [14]. The independent dataset is designed to test modelgeneralization power for different scenarios, for instance,different seen and unseen E3-Target pairs. Here seen means E3or targets which are available in development dataset (that isused for training, validation, and testing case) but their speciﬁcpairs are not the part of the development dataset. Similarly,unseen means that the particular E3s or targets are not partof the development dataset. The statistics of independent testdataset is summarized in Table I.

TABLE IS

AMPLE STATISTICS SUMMARY OF INDEPENDENT TEST DATASET

Dataset N E3 N Target N Relation N Positive

Seen E3, Seen Target 108 590 914 321Seen E3, UnSeen Target 115 512 940 680UnSeen E3, Seen Target 112 250 369 84UnSeen E3, Unseen Target 95 350 415 60Total 282 1,352 2,638 1,145

B. Features Extraction

In order to design a machine learning-based classiﬁcationmodel for E3-target prediction, we propose to use the CK-SAAP features. In CKSAAP, frequency of amino-acid pairsis calculated and the concept of pair is deﬁned by a gap of j = 0 . . . k , i.e., k-spaced amino acid pairs. For example, for k = 2 , three different forms of pairs are calculated that arespaced by zero, one and two peptide residues. An illustrativeexample of CKSAAP composition is presented in Fig. 1. Inparticular, a feature vector F V using CKSAAP for gap valueof k = 2 can be obtained as F V j =0 = (cid:18) F AA N , F AC N , F AD N , . . . , F

Y Y N (cid:19) , (1) F V j =1 = (cid:18) F AxA

N , F

AxC

N , F

AxD

N , . . . , F

Y xY N (cid:19) , (2) F V j =2 = (cid:18) F AxxA

N , F

AxxC

N , F

AxxD

N , . . . , F

Y xxY N (cid:19) , (3) F V k =2 = F V j =0 ++ F V j =1 ++ F V j =2 ∈ R × ( k +1) = R . (4)where N is the sequence length, x is the gap or skippedresidual, F is the frequency count of amino acids pair, ++ denotes concatenation operation, and A, C, D . . . Y are thestandard symbols for amino acid. To generate a combined representation of both the E3and its target proteins, we generated CKSAAP features fromindividual and concatenated protein sequences. In total, wegenerated three representations of E3 and target proteins:(1) E3-sequence features (E3SF), (2) target-sequence features(TSF), and (3) concatenated sequence features called pairfeature vector (PFV). The lengths of all three feature vectorsare chosen to be the same, i.e., the same value of gap is usedto extract CKSAAP features from all sequences. To avoidlearning bias towards particular E3 or its target protein, wedesigned a modiﬁed auto-encoding scheme shown in Fig. 2.The objective is to learn a coupled latent space that can repre-sent both E3 and its target with minimal complexity. Towardsthis end, the auto-encoder model is designed to produce PFVusing E3SF and TSF representations. This coupled learninghelps in avoiding the input sequence bias towards a particularE3 or the target sequences, which is essential for a multi-input learning scenario. A detailed description of the proposedmodel is provided in the next section.

C. Deep Latent Space Encoding

One big advantage of CKSAAP encoding scheme is its highresolution since it can encode various combinations of amino-acid pairs at different levels of granularity. However, a highresolution comes at a price of noise and inﬂated feature space(i.e., high number of uninformative or redundant features).In machine learning, it is well understood that for a robustclassiﬁer, a robust feature set that is minimally redundant andmaximally relevant to the class label is important [17], [18].Therefore, to reduce dimensions of feature space, a varietyof machine learning strategies has been proposed such ascomponent analysis [19], information gain [20], and kernelor latent space encoding [21].Handcrafted features or manual curation of important fea-tures through feature engineering is useful when direct orsimple (e.g., linear) relationship between the feature and theclass label is required. In other words, manual curation isessential, when a linear and easily interpretable relationshipbetween descriptor and target is needed to design experimen-tally veriﬁable studies. However, in most machine learning-based classiﬁcation tasks, the descriptors are usually non-linearly related to class labels and show high inter-classvariability. This poses a serious challenge for the section ofuseful feature space.Effective mapping of feature space to target label is achallenging task due to large feature space, non-linearity, andlow inter-class and high intra-class variability. Towards thisend, an auto-encoder-based latent space encoding (AE-LSE)scheme is designed. In the AE-LSE scheme, the latent spaceof the auto-encoder is fed to the classiﬁer, which imposes aconstraint on auto-encoder to learn not only the descriptivefeatures which are useful for minimizing reconstruction loss,but also the most discriminating features which are helpfulfor designing a power classiﬁer model. Through this simul-taneous optimization of auto-encoder and classiﬁer models,a noise-free latent space representation can be learned. Thearchitecture of the proposed AE-LSE based model is shownin Fig. 2.

Fig. 1. Illustration of CKSAAP descriptor calculation for k = 2 .Fig. 2. Proposed latent-space encoding-based E3-target prediction model. D. Training Conﬁgurations

The model consists of ﬁve encoding and ﬁve decodingmodules, each consists of batch-normalization, dropout, andReLU activation function. The number of neurons in encod-ing block are (50 , , , . The conﬁguration of encodingand decoding blocks is chosen to be a mirror-symmetric assuggested in [22], that is, the number of neurons in theﬁrst hidden layer of the encoder is the same as the numberof neurons in last hidden layer of the decoder. The sameis true for second, third, and fourth hidden layers of auto-encoder, except input and output layers where the numberof inputs (E3SF + TSF) is twice the number of the outputs(PSF). The classiﬁcation module consists of three hidden layers each with neurons. For the output layer, the soft-max activation function is used and its size is set to be equalto the number of class labels, i.e., two (E3-target, non-target).The model is implemented using Python on TensorFlow Keras [23] platform for variable size of latent space ( LV ), and gapvalues ( k ) speciﬁed in Section II-B. For parameter optimiza-tion, two loss functions, i.e., the auto-encoder loss (mean-squared-error) and the classiﬁer loss (binary-cross entropy),are minimized using Adam optimizer on default learningrates for epochs with early stopping tolerance of epochs. The model used in this study is furnished online at(https://github.com/psychemistz/E3targetPred). E. Evaluation Parameters

For a classiﬁcation model, there are many widely usedperformance statistics. In particular, Youden’s index (YI) (orYouden’s J statistics) [24], Matthew’s Correlation Coefﬁcient(MCC), and balanced accuracy (BACC) are considered tobe the most comprehensive performance criteria. The MCCranges from − to with MCC = and MCC = − are thebest and the worst predictions, respectively, and MCC = indicates the case of a random guess. For highly imbalancedtest datasets, balanced accuracy and YI are interesting ways ofsummarizing the results of a diagnostic experiment. The rangeof YI is from 0 to 1, where 0 indicates the worst performancewhile 1 shows perfect results with no false positives andno false negatives. The proposed algorithm is extensivelyevaluated for these measures. For class-speciﬁc results, we alsoprovide the true positive rate (sensitivity), and true negativerate (speciﬁcity). Besides, we also provide reconstruction errorstatistics of the proposed AE-LSE scheme. The reconstructionerror is calculated in the form of peak-signal-to-noise-ratio (PSNR). The PSNR is calculated between the decoded output(estimated PSF) and the original PSF. Fig. 3. Work-ﬂow and ablation study diagram for E3-targetPred method.

III. R

ESULTS

A. Ablation Study

To investigate the effect of hyper-parameters on the perfor-mance of the proposed model, we designed an ablation studyfor the number of gaps ( k ) in CKSAAP and the number oflatent variables ( LV ) in AE-LSE model. We also comparedthree combinations of objective functions mixing weighting λ (setting values λ = 0 . , . , . ) on the decoder loss, and( − λ ) on the classiﬁer loss. A workﬂow diagram summarizingthe ablation study is shown in Fig. 3(a). Test balanced accuracyfor different combinations of LV , k , and λ are shown inTable II. The objective of this thorough evaluation is to identifythe best combination of all three parameters. In particular, weare interested in designing a powerful classiﬁer with minimumcomplexity, speciﬁcally, with a latent space that we can use toeasily visualize the clusters of E3-target and non-target pairs.The smallest value of the gap ensures minimum redundancy inthe original feature space, which also reduces the computationoverhead.In Table II, for random trials mean and standard de-viations in test balanced accuracy distributions for dif-ferent combinations of LV , k , and λ are presented. Overall,the proposed model shows stable performance for all givenconﬁgurations, as it can be seen that the best BACC value of . can be achieved with λ = 0 . , k = 6 and LV = 2 .Therefore, the least complex best model in this study (i.e., k = 6 and LV = 2 ) is chosen for further testings. One bigadvantage of the LV = 2 model is that its latent space can bevisualized easily, which can help in ﬁnding the useful clustersfor further studies.The distributions of performance statistics for three combi-nations of λ are delineated in Fig. 4 to elucidate the effectof λ on other performance measures. In Figs. 4(A) and 4(B),distributions of BACC are shown for different LV and gapcombinations, respectively. It is evident from the graphs that TABLE IIB

ALANCED ACCURACY RESULTS OF ABLATION STUDY ON λ , G AP ( k ), AND LV PARAMETERS . λ Gap /LV . ± .

054 0 . ± .

026 0 . ± .

053 0 . ± . . ± .

023 0 . ± .

035 0 . ± .

056 0 . ± . . ± .

028 0 . ± .

069 0 . ± .

032 0 . ± . . ± .

036 0 . ± .

057 0 . ± .

026 0 . ± . .

01 4 0 . ± .

020 0 . ± .

018 0 . ± .

050 0 . ± . . ± .

026 0 . ± .

037 0 . ± .

055 0 . ± . . ± .

029 0 . ± .

037 0 . ± .

042 0 . ± . . ± .

031 0 . ± .

041 0 . ± .

025 0 . ± . . ± .

055 0 . ± .

022 0 . ± .

041 0 . ± . λ Gap /LV . ± .

048 0 . ± .

025 0 . ± .

042 0 . ± . . ± .

064 0 . ± .

031 0 . ± .

074 0 . ± . . ± .

052 0 . ± .

041 0 . ± .

043 0 . ± . . ± .

081 0 . ± .

063 0 . ± .

062 0 . ± . .

50 4 0 . ± .

042 0 . ± .

048 0 . ± .

088 0 . ± . . ± .

058 0 . ± .

066 0 . ± .

062 0 . ± . . ± .

029 0 . ± .

063 0 . ± .

058 0 . ± . . ± .

035 0 . ± .

021 0 . ± .

015 0 . ± . . ± .

051 0 . ± .

057 0 . ± .

025 0 . ± . λ Gap /LV . ± .

028 0 . ± .

039 0 . ± .

036 0 . ± . . ± .

027 0 . ± .

024 0 . ± .

031 0 . ± . . ± .

027 0 . ± .

034 0 . ± .

058 0 . ± . . ± .

023 0 . ± .

032 0 . ± .

031 0 . ± . .

99 4 0 . ± .

031 0 . ± .

034 0 . ± .

029 0 . ± . . ± .

037 0 . ± .

061 0 . ± .

029 0 . ± . . ± .

021 0 . ± .

030 0 . ± .

023 0 . ± . . ± .

029 0 . ± .

025 0 . ± .

038 0 . ± . . ± .

028 0 . ± .

027 0 . ± .

030 0 . ± . increasing decoder weight λ improves the performance inall the cases. A similar trend is observed in Figs. 4(C) and4(D) for MCC measure. However, it is more pronounced inFigs. 4(E) and 4(F) for the decoder’s reconstruction loss whichis measured in terms of the PSNR. To better understand theorigin of this performance gain due to increased λ , we analyzethe latent space encoding schemes later on. Fig. 4. Ablation study: Number of latent variables (LV) and Gaps in CKSAAPfeatures are optimized within × parameter space. Weights of the decoderare chosen . (blue), . (yellow), and . (green) with designated LVand Gaps. B. Performance of the Best Model

In Fig. 3(b), the workﬂow diagram of the proposed E3-Target study is shown. We chose the best model based on theresult of the ablation study. As substantiated earlier, LV = 2 and Gaps = 6 performed the best among all the testedcombinations. To evaluate the performance, we used BACCas the representative measure. The difference of performancebetween the validation and the test set is very small and themodel successfully predicts unseen E3-target relationships inthe testing phase (see Table III). In particular, there are only . , . units, and . units drop compared to validationBACC, MCC, and YI, respectively, and no difference in thereconstruction error. One important aspect of the proposedmodel is that the difference between sensitivity and speciﬁcityis very low. This suggests that the model is well balancedand has a negligible bias toward a particular class type. Thisunbiased feature is important especially for a novel predictioncase where no prior information about the given sample isknown. C. Comparison of Latent Space Encoding Schemes

To understand the origin of robust performance, we com-pared the latent space of proposed model (trained on differentweight combinations of the decoder and classiﬁer) with thestandard dimension reduction methods. In Fig. 5, PCA, t-SNE[25], UMAP [26], and the auto-encoder are compared. It canbe easily seen that the conventional methods failed to recoverclusters of positive and negative E3-Targets from the originalfeature space. Especially in the case of the PCA and the auto-encoder, there is a complete overlap between positive andnegative samples. In the case of the t-SNE and UMAP, thereare a different number of small clusters but there are no cleardistinguishing boundaries between two classes.

TABLE IIIP

ERFORMANCE STATISTICS OF ( k = 6 , LV = 2 ) MODEL

Measure BACC Sen Spe MCC YI PSNR (dB) µ train σ train µ valid σ valid µ test σ test Mean µ x and standard deviation σ x of measure in x datasets( random trials). BACC: Balanced Accuracy, Sen: Sen-sitivity, Spe: Speciﬁcity, MCC: Mathew Correlation Coefﬁ-cient, YI: Yoden Index, PSNR: Peak Signal to Noise Ratio − log (MSE) . Fig. 5. Comparisons of feature space embeddings. (A) PCA, (B) t-SNE [25],(C) UMAP [26], and (D) Auto-encoder.

In Fig. 6, latent space learned by proposed method isvisualized. Interestingly, both the positive and negative classesare exclusive in all cases. However, as expected with a largedecoder weight, the model learns latent space with morevariance. Indeed, since training attention is increased towardsminimizing the reconstruction loss, the model has more free-dom to represent individual samples with its variability. Onthe other hand, the attention of learning is towards minimizingthe classiﬁcation loss with a low value of the decoder weight.Therefore, the clusters are more compacted. Similar behavioris observed in the case of an equal weight of λ but theproblem is more challenging. Since both losses are equallyweighted, it creates a tug of war between equally weightedobjectives; the model is likely to learn from noisy latent spaceand performance of both the classiﬁer and decoder drop (see,e.g., Fig. 4). D. Validation on Independent Test Dataset

To evaluate the generalization power of proposed method,we tested the model on a completely independent test dataset.We used E3-substrate relation dataset curated by ESI networkstudy [16] and Negatome 2.0 [14]. We used catalogued E3and target relations that are not included in our developmentaldataset. We estimated the generalization performance of the

Fig. 6. Visualization of trained latent space (

Gap = 6 , LV = 2 ): Positive(blue) and negative (yellow) E3-target relations in training set (left) and testdataset (right) mapped to trained latent space with λ = 0 . , λ = 0 . and λ = 0 . (top to bottom). In λ = 0 . case, negative relations are clusteredon the right side of latent space while positive relations are clustered on theleft side of the space. The clustering of positive and negative data points iswell maintained with the test dataset as well. model by dividing the four cases according to whether E3 ortargets were included in the training data set (see Table I).The results are summarized in Table IV. Overall, similar tothe test dataset, the model show balanced total performanceon independent dataset. However as expected the performancein seen protein cases is higher compared to unseen cases.In particular, the balanced accuracy of the best model wasaround . in seen E3/target case, but fell to . inunseen E3/target case. In summary, although the performancewas decreased in totally unseen E3/target cases, our modelsuccessfully retrieved about . of known E3 and targetrelation in independent dataset. TABLE IVP

ERFORMANCE OF THE BEST MODEL FOR INDEPENDENT DATASETS

Dataset BACC Sen Spe MCC YI ACCSeen E3, Seen Target 0.814 0.763 0.865 0.627 0.628 0.829Seen E3, UnSeen Target 0.770 0.682 0.858 0.484 0.540 0.731UnSeen E3, Seen Target 0.705 0.810 0.600 0.343 0.410 0.648UnSeen E3, Unseen Target 0.586 0.600 0.572 0.121 0.172 0.576Total 0.727 0.710 0.743 0.452 0.454 0.729

IV. D

ISCUSSION

In this study, we proposed an E3-target prediction modelbased only on the sequence of two proteins. The proposedmodel was validated with an independent test dataset proposedby previous publications. We also veriﬁed that a trained E3-target prediction model can be generalized for unseen E3s aswell as unseen targets. This property of our model is attractiveas compared to the previous works. Several studies have been proposed to predict E3 and theirsubstrate relationships. For example, Kai-Yao et al. [27], pro-vided an interaction network viewer of E3 and target. Van-NuiNguyen et al. [12] proposed web resources for exploring ubiq-uitination in a network, called UbiNet. And Yang Li et al. [13]proposed an integrated platform of E3-target relation calledUbibrowser. Yang Li et. al. [13] provided the Naive-Bayesclassiﬁer for E3-target relation prediction based on multipleshreds of evidence such as homology, PPI, Gene ontology. DiChen et al. [16] proposed multidimensional characterizationof E3 and target interaction network by combining multiplesources. In particular, they trained a classiﬁer of E3 and targetrelation based on expression datasets and network/pathwayinformation derived features.The aforementioned works provided a somewhat alternativeview of E3-target relations but none of the proposed modelswas utilized for sequence-derived features alone. We hypoth-esized that if an E3 and its target proteins interact with eachother there are classiﬁable features embedded in the compositesequence features space of E3 and targets.Since our model utilizes only sequences of E3 and target,if the positive and negative relationships between E3 andtarget are not well characterized, the proposed model willbe under-powered. Our LSE model has a unique advantagein this regard since it learns classiﬁable latent space whilekeeping characteristic of original features. In the proposedLSE, we can inspect whether predicted E3-target relationsare closely distributed or not. Fig. 6 (A) shows the projectionof data points in training dataset into trained latent space of LV = 2 . We can see that our model learned classiﬁable latentspace which gives positive and negative class label clusters.Fig. 6 (B) shows the projection of data points in the testdataset. Although the classiﬁcation boundary is not clear asin the training case, the test set also visualizes the distinctionbetween positive and negative classes. Therefore, this latentspace-based representation of E3-target relation enables usto calculate conﬁdence or reﬁne the threshold of E3-targetrelation to reduce potential false positives.V. C ONCLUSION

In this study, for the ﬁrst time, we proposed a sequence-based E3-target relation prediction model called E3targetPred.Based on the comprehensive ablation study, we characterizedthe optimal number of gaps and latent variables to be utilizedin our model. Besides, we compared the performance ofconventional latent space encoding schemes and substantiatedthat the proposed model provides separable clusters in latentspace. E3-targetPred has a unique advantage compared toconventional models mainly due to the latent space learningproperty of the model. Since it learns non-linear embeddingof the features while keeping properties of the original fea-ture space, we can further ﬁlter out suspicious data pointsbased on the distance between a data point and the classcenter. This further increases the conﬁdence of the clas-siﬁer for noisy annotations. The code and dataset utilizedin this work are provided at GitHub page of the author(https://github.com/psychemistz/E3targetPred) R EFERENCES[1] D. H. Lee and A. L. Goldberg, “Proteasome inhibitors: valuable newtools for cell biologists.”

Trends in cell biology , vol. 8, no. 10, pp. 397–403, 1998.[2] A. von Mikecz, “The nuclear ubiquitin-proteasome system,”

Journal ofcell science , vol. 119, no. 10, pp. 1977–1984, 2006.[3] D. Hoeller and I. Dikic, “Targeting the ubiquitin system in cancertherapy,”

Nature , vol. 458, no. 7237, pp. 438–444, 2009.[4] M. Kirschner, “Intracellular proteolysis,”

Trends in Biochemical Sci-ences , vol. 24, no. 12, pp. M42–M45, 1999.[5] C. A. Ross and C. M. Pickart, “The ubiquitin–proteasome pathway inparkinson’s disease and other neurodegenerative diseases,”

Trends in cellbiology , vol. 14, no. 12, pp. 703–711, 2004.[6] Z. Ge, J. S. Leighton, Y. Wang, X. Peng, Z. Chen, H. Chen, Y. Sun,F. Yao, J. Li, H. Zhang et al. , “Integrated genomic analysis of theubiquitin pathway across cancer types,”

Cell reports , vol. 23, no. 1,pp. 213–226, 2018.[7] H.-C. S. Yen and S. J. Elledge, “Identiﬁcation of scf ubiquitin ligasesubstrates by global protein stability proﬁling,”

Science , vol. 322, no.5903, pp. 923–929, 2008.[8] Y. Merbl and M. W. Kirschner, “Large-scale detection of ubiquitinationsubstrates using cell extracts and protein microarrays,”

Proceedings ofthe National Academy of Sciences , vol. 106, no. 8, pp. 2543–2548, 2009.[9] Z. Guo, X. Wang, H. Li, and Y. Gao, “Screening e3 substrates using alive phage display library,”

PloS one , vol. 8, no. 10, 2013.[10] K. Yumimoto, M. Matsumoto, K. Oyamada, T. Moroishi, and K. I.Nakayama, “Comprehensive identiﬁcation of substrates for f-box pro-teins by differential proteomics analysis,”

Journal of proteome research ,vol. 11, no. 6, pp. 3175–3185, 2012.[11] Y. Han, H. Lee, J. C. Park, and G.-S. Yi, “E3net: a system for exploringe3-mediated regulatory networks of cellular functions,”

Molecular &cellular proteomics , vol. 11, no. 4, 2012.[12] V.-N. Nguyen, K.-Y. Huang, J. T.-Y. Weng, K. R. Lai, and T.-Y. Lee,“Ubinet: an online resource for exploring the functional associationsand regulatory networks of protein ubiquitylation,”

Database , vol. 2016,2016.[13] Y. Li, P. Xie, L. Lu, J. Wang, L. Diao, Z. Liu, F. Guo, Y. He, Y. Liu,Q. Huang et al. , “An integrated bioinformatics platform for investigatingthe human e3 ubiquitin ligase-substrate interaction network,”

Naturecommunications , vol. 8, no. 1, pp. 1–9, 2017.[14] P. Blohm, G. Frishman, P. Smialowski, F. Goebels, B. Wachinger,A. Ruepp, and D. Frishman, “Negatome 2.0: a database of non-interacting proteins derived by literature mining, manual annotation andprotein structure analysis,”

Nucleic acids research , vol. 42, no. D1, pp.D396–D400, 2014.[15] L. Zhang, G. Yu, M. Guo, and J. Wang, “Predicting protein-protein inter-actions using high-quality non-interacting pairs,”

BMC bioinformatics ,vol. 19, no. 19, p. 525, 2018.[16] D. Chen, X. Liu, T. Xia, D. S. Tekcham, W. Wang, H. Chen, T. Li,C. Lu, Z. Ning, X. Liu et al. , “A multidimensional characterization ofe3 ubiquitin ligase and substrate interaction network,”

IScience , vol. 16,pp. 177–191, 2019.[17] H. Peng, F. Long, and C. Ding, “Feature selection based on mu-tual information criteria of max-dependency, max-relevance, and min-redundancy,”

IEEE Transactions on pattern analysis and machine intel-ligence , vol. 27, no. 8, pp. 1226–1238, 2005.[18] I. Naseem, S. Khan, R. Togneri, and M. Bennamoun, “Ecmsrc: A sparselearning approach for the prediction of extracellular matrix proteins,”

Current Bioinformatics , vol. 12, no. 4, pp. 361–368, 2017.[19] L. Wang, Z.-H. You, X. Yan, S.-X. Xia, F. Liu, L.-P. Li, W. Zhang,and Y. Zhou, “Using two-dimensional principal component analysis androtation forest for prediction of protein-protein interactions,”

Scientiﬁcreports , vol. 8, no. 1, pp. 1–10, 2018.[20] S. Khan, I. Naseem, R. Togneri, and M. Bennamoun, “Rafp-pred: Robustprediction of antifreeze proteins using localized analysis of n-peptidecompositions,”

IEEE/ACM transactions on computational biology andbioinformatics , vol. 15, no. 1, pp. 244–250, 2016.[21] M. Usman, S. Khan, and J.-A. Lee, “Afp-lse: Antifreeze proteinsprediction using latent space encoding of composition of k-spaced aminoacid pairs,”

Scientiﬁc Reports , vol. 10, no. 1, pp. 1–13, 2020.[22] J. C. Ye, Y. Han, and E. Cha, “Deep convolutional framelets: Ageneral deep learning framework for inverse problems,”

SIAM Journalon Imaging Sciences , vol. 11, no. 2, pp. 991–1048, 2018.[23] F. Chollet et al. , “keras,” 2015.[24] W. J. Youden, “Index for rating diagnostic tests,”

Cancer , vol. 3, no. 1,pp. 32–35, 1950. [25] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,”

Journalof machine learning research , vol. 9, no. Nov, pp. 2579–2605, 2008.[26] L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifoldapproximation and projection for dimension reduction,” arXiv preprintarXiv:1802.03426 , 2018.[27] K.-Y. Huang, J. T.-Y. Weng, T.-Y. Lee, and S.-L. Weng, “A newscheme to discover functional associations and regulatory networks ofe3 ubiquitin ligases,” in