[PDF] Cell Type Identification from Single-Cell Transcriptomic Data via Semi-supervised Learning

Abstract

Cell type identification from single-cell transcriptomic data is a common goal of single-cell RNA sequencing (scRNAseq) data analysis. Neural networks have been employed to identify cell types from scRNAseq data with high performance. However, it requires a large mount of individual cells with accurate and unbiased annotated types to build the identification models. Unfortunately, labeling the scRNAseq data is cumbersome and time-consuming as it involves manual inspection of marker genes. To overcome this challenge, we propose a semi-supervised learning model to use unlabeled scRNAseq cells and limited amount of labeled scRNAseq cells to implement cell identification. Firstly, we transform the scRNAseq cells to "gene sentences", which is inspired by similarities between natural language system and gene system. Then genes in these sentences are represented as gene embeddings to reduce data sparsity. With these embeddings, we implement a semi-supervised learning model based on recurrent convolutional neural networks (RCNN), which includes a shared network, a supervised network and an unsupervised network. The proposed model is evaluated on macosko2015, a large scale single-cell transcriptomic dataset with ground truth of individual cell types. It is observed that the proposed model is able to achieve encouraging performance by learning on very limited amount of labeled scRNAseq cells together with a large number of unlabeled scRNAseq cells.

Full PDF

ii Cell Type Identiﬁcation from Single-CellTranscriptomic Data via Semi-supervised Learning

Xishuang Dong, Shanta Chowdhury, Uboho Victor, Xiangfang Li, and Lijun Qian

Abstract —Cell type identiﬁcation from single-cell transcrip-tomic data is a common goal of single-cell RNA sequencing(scRNAseq) data analysis. Neural networks have been employedto identify cell types from scRNAseq data with high performance.However, it requires a large mount of individual cells withaccurate and unbiased annotated types to build the identiﬁ-cation models. Unfortunately, labeling the scRNAseq data iscumbersome and time-consuming as it involves manual inspectionof marker genes. To overcome this challenge, we propose asemi-supervised learning model to use unlabeled scRNAseq cellsand limited amount of labeled scRNAseq cells to implementcell identiﬁcation. Firstly, we transform the scRNAseq cellsto “gene sentences”, which is inspired by similarities betweennatural language system and gene system. Then genes in thesesentences are represented as gene embeddings to reduce datasparsity. With these embeddings, we implement a semi-supervisedlearning model based on recurrent convolutional neural networks(RCNN), which includes a shared network, a supervised networkand an unsupervised network. The proposed model is evaluatedon macosko2015, a large scale single-cell transcriptomic datasetwith ground truth of individual cell types. It is observed thatthe proposed model is able to achieve encouraging performanceby learning on very limited amount of labeled scRNAseq cellstogether with a large number of unlabeled scRNAseq cells.

Index Terms —Single-Cell Sequencing, Semi-supervised Learn-ing, Recurrent Convolutional Neural Networks, Joint Optimiza-tion

I. I

NTRODUCTION

Single-cell RNA sequencing (scRNAseq) enables the pro-ﬁling of the transcriptomes of individual cells, thus character-izing the heterogeneity of biological samples since scRNAseqexperiments is able to yield high volumes of data. For example,in a single experiment, the expression proﬁle is up to cells, at the level of the single cell [1]. It is not possible fortraditional bulk RNAseq [2] to examine biological samples insuch high-resolution.One common goal of scRNAseq data analytics is to identifythe cell type of each individual cell that has been proﬁled.Although labeling cells with known cell types is a supervisedlearning task, it is currently achieved by unsupervised methodswith manual input [3]. To accomplish this, cells are ﬁrstgrouped into different clusters in an unsupervised manner [4],and the number of these clusters allows us to approximatelydetermine how many distinct cell types are present. To attempt X. Dong, S. Chowdhury, U. Victor, X. Li and L. Qian are with theCenter of Excellence in Research and Education for Big Military DataIntelligence (CREDIT Center) and Center for Computational Systems Biol-ogy (CCSB), Department of Electrical and Computer Engineering, PrairieView A&M University, Texas A&M University System, Prairie View,TX 77446, USA. Email: [email protected], [email protected],[email protected], [email protected], [email protected] ... ... ... ...... ... ... ... ... ... ... ... ... ...

Cross EntropyLoss Mean SquredError LossJointOptimization ... ... ... ... x t-1 x t x t+1 ... ... ... ... GeneEmbeddingInputShared BidirectionalLSTM RNNSupervised Bidirectional LSTM RNN Unsupervised BiderectionalLSTM RNN e t-1 e t e t+1 z zz' z' z'' x : LSTM Cell: Max Pooling ... ... ... ...... ... ... ... ... ... ... ...... ... ... ... Transform Samples withRanked GenesLearn GeneEmbedding : Dense LayerPre-processing

Fig. 1. Framework of the proposed semi-supervised learning. Input x is thecell. Cell types are available only for the labeled inputs and the associatedcross-entropy loss component is evaluated only for those. z (cid:48) and z (cid:48)(cid:48) areoutputs from the supervised bidirectional LSTM RNN and the unsupervisedbidirectional LSTM RNN, respectively. We jointly optimize cross entropy lossand mean squared error loss for supervised learning and unsupervised learningwith these outputs. ⊕ is the concatenation operation. to interpret the identity of each cluster, marker genes are iden-tiﬁed as those that are uniquely highly expressed in a cluster,compared to all other clusters. These canonical markers arethen used to assign the cell types for the clusters by crossreferencing the markers with lists of previously characterizedcell type speciﬁc markers. However, this approach has severallimitations, including the fact that the clusters may not opti-mally separate single cell types, and certain cell types maynot have previously characterized markers. Moreover, thesemethods are computationally intensive, especially when thenumber of cells becomes large.Recently, novel computational methods based on neural net-works have been proposed to overcome these limitations [3],[4], since cell type classiﬁcation based on a large number ofgenes is much more robust to noise with machine learning a r X i v : . [ q - b i o . GN ] M a y i models. For example, Ma et al. proposed ACTINN (Auto-mated Cell Type Identiﬁcation using Neural Networks) [4]with simple neural networks of three neuron layers, whichtrains on datasets with predeﬁned cell types and predictscell types for other datasets based on the trained model. Ituses all the genes to capture the features for each cell typeinstead of relying on a limited number of canonical markers.Furthermore, it is much more computationally efﬁcient thantraditional approaches. However, it requires a large amount ofindividual cells with accurate and unbiased type labels to builddatasets for training and testing.In this paper, we propose a novel deep semi-supervisedlearning model when only very limited number of cells arelabeled, and a large number of cells are unlabeled . Theproposed framework is shown in Figure 1. It is trained oncells with predeﬁned cell types and then can be used topredict cell types on new datasets. The cells in scRNAseqdata are transformed to “gene sentences” by taking advantageof similarities between natural language system and genesystem. Furthermore, to overcome data sparsity, we employword embedding techniques [5] to represent the genes in thesesentences as gene vectors. Then, these vectors are input intothe proposed semi-supervised neural networks built on recur-rent convolutional neural networks (RCNN) [6]. It consists ofthree components, namely, a shared bidirectional Long Short-Term Memory Recurrent Neural Network (LSTM RNN), asupervised bidirectional LSTM RNN, and an unsupervisedbidirectional LSTM RNN. One path is composed of the sharedbidirectional LSTM RNN and supervised bidirectional LSTMRNN while the other path consists of the shared bidirectionalLSTM RNN and unsupervised bidirectional LSTM RNN. Alldata (labeled and unlabeled data) will be evaluated to generatethe mean squared error loss, while only labeled data will beevaluated to calculate the cross entropy loss. Experimentalresults on macosko2015 [7] demonstrate the effectiveness ofthe proposed model even when training it with very limitedamount of labeled cells.The contributions in this study are as follows. • We represent cells in scRNAseq data via embeddingtechniques to reduce the sparsity of gene expressionvalues. • We propose semi-supervised deep learning models withRCNN through jointly training supervised RCNN andunsupervised RCNN. It is shown that the proposed modelcan learn on unlabeled cells and labeled cells jointly toidentify cell types with high performance. • The proposed model is validated on a large-scale scR-NAseq dataset. Experimental results indicate that the newrepresentations of cells enable cell type identiﬁcation withpromising performance. Moreover, the proposed semi-supervised learning model is able to effectively identifycell types by learning on very limited number of labeledcells and a large amount of unlabeled cells.II. P

ROBLEM F ORMULATION

Cell type identiﬁcation on single-cell transcriptomic datais to classify the individual cells into predeﬁned cell types, which is a supervised learning task from machine learningpoint of view. Speciﬁcally, it is a multi-class classiﬁcationproblem with N cell types in the set C = { c , c , c , ..., c N } ,where N > . Each cell belongs to one of the N differenttypes. The goal is to construct a function which, given a newindividual cell, will correctly predict the cell type where thenew individual cell belongs. It is deﬁned by f ( x ; θ ) → c , (1)where x is an individual cell, θ denotes the parameters in f ( · ) , and c ∈ C . For the scRNA-seq data, x is composed ofa sequence of gene expression values of the cell. Generallywe will have more than , gene expression values ifwe employ scRNAseq techniques to generate data [3], [4].These gene expression values will be input as features to buildmachine learning models to complete cell type identiﬁcation.Due to high dimensions and data sparsity of the scRNAseqdata [8], it is the challenging to solve this problem.III. P ROPOSED M ETHODOLOGY

We propose a semi-supervised recurrent convolutional neu-ral network (SSRCNN) to address the challenge of lackinglabeled individual cells for cell type identiﬁcation from scR-NAseq data. The proposed model is based on RCNN [6]and the detailed architecture is shown in Figure 1. The ﬁrststep is to preprocess the scRNAseq data to reduce the datasparsity [8], [9] by building “gene sentences” and repre-senting the gene with word embedding techniques [5], [10].Speciﬁcally, each cell in the scRNAseq data is composed ofthousands of gene expression values. Unfortunately, most ofthese values are zeros because of the limitation of currentsingle-cell sequencing techniques [9], which would reduce theperformance of machine learning models signiﬁcantly [11],[12]. Therefore, it is important to solve the data sparsityproblem for cell type identiﬁcation.To overcome the data sparsity, we propose a new techniqueof “gene embedding”, to represent the cells based on sim-ilarities between gene system and natural language systemsince gene sequences can be treated as a language whenwe regards genome as the “book of life” [13]. For example,words can be combined with others to generate new functions“phrases” while different genes can form pathways to controlprotein generation [14]. With respect to these similarities,we build gene sentences by selecting k genes and employword2vec [15] to represent these genes, where word2vec is apowerful technique to overcome data sparsity for natural lan-guage processing and understanding [15], [16], [17]. We rankthe genes in terms of their expression values and select the top k genes to build the gene sentence. Then the genes in the genesentence are represented as gene embeddings. For instance, thegene sentence will be representedas a sequence of gene embedding ,where e t is the embedding representation of the gene g t .After the preprocessing procedure, these gene sentenceswith gene embeddings will be input to the shared bidirectionalLSTM RNN to extract common features for cell identiﬁcation.The forward layer and backward layer generate two directional ii correlation features, respectively. Next, we combine thesetwo groups of features with the gene embedding and obtainthe output z of the shared RNN, where z is a sequence and z t is given by z t = h ft ⊕ e t ⊕ h bt , (2)where h ft = a ( w fh h ft − + w fe e t + b fh ) , (3) h bt = a ( w bh h bt +1 + w be e t + b bh ) , (4) z t is the output of g t of the gene sentence . ⊕ is the concatenation operation. a ( · ) is the activation function for hidden layers. w fh and w fe are forward weights for two layers, namely, forward layerand backward layer. w bh and w be are backward weights forthese two layers, respectively. b fh and b bh are bias for thesetwo layers.The idea to introduce this shared RNN to the proposedmodel is motivated by deep multi-task learning [18], [19],since different tasks share a common feature representationbased on the original features. In addition, the reason forlearning common feature representations instead of directlyusing the original ones is that the original representation maynot have enough expressive power for multiple tasks. With thetraining data in all tasks, a more powerful representation canbe learned for all the tasks and this representation will improveperformance. Therefore, the output z from the shared RNN areevaluated by two bidirectional RNNs, namely, supervised bidi-rectional LSTM RNN and unsupervised bidirectional LSTMRNN. As shown in Figure 1, the structures of these two RNNsare the same to that of shared RNN. For the supervised RNN,it is to learn the deep features of cells when the sample hasthe label. The output z (cid:48) of supervised RNN is the sequence , where z (cid:48) t is given by z (cid:48) t = max ( tanh ( w sup z tmp (cid:48) + b sup )) , (5)where z tmp (cid:48) = h ft (cid:48) ⊕ z t ⊕ h bt (cid:48) , (6) h ft (cid:48) = a ( w fh (cid:48) h ft (cid:48) − + w fsup z t + b fh (cid:48) ) , (7) h bt (cid:48) = a ( w bh (cid:48) h bt (cid:48) +1 + w bsup z t + b bh (cid:48) ) , (8)We employ the same activation function a ( · ) for the hiddenlayers of the supervised bidirectional RNN. tanh ( · ) is theactivation function for the dense layer. w sup and b sup arethe weights and a bias between the max-pooling layer andthe dense layer in the supervised RNN. w fh (cid:48) and w fsup areforward weights for the forward layer and backward layer inthe supervised bidirectional RNN. w bh (cid:48) and w bsup are backwardweights for these two layers, respectively. b fh (cid:48) and b bh (cid:48) are biasfor these two layers, respectively.Moreover, we build the unsupervised bidirectional RNN togenerate another representation of the input and the output z (cid:48)(cid:48) is a vector , where z (cid:48)(cid:48) t is given by z (cid:48)(cid:48) t = max ( tanh ( w unsup z tmp (cid:48)(cid:48) + b unsup )) , (9) where z tmp (cid:48)(cid:48) = h ft (cid:48)(cid:48) ⊕ z t ⊕ h bt (cid:48)(cid:48) , (10) h ft (cid:48)(cid:48) = a ( w fh (cid:48)(cid:48) h ft (cid:48)(cid:48) − + w funsup z t + b fh (cid:48)(cid:48) ) , (11) h bt (cid:48)(cid:48) = a ( w bh (cid:48)(cid:48) h bt (cid:48)(cid:48) +1 + w bunsup z t + b bh (cid:48)(cid:48) ) , (12) w unsup and b unsup are the weights and a bias between themax-pooling layer and the dense layer in the unsupervisedRNN. w fh (cid:48)(cid:48) and w funsup are forward weights for two layers,namely, forward layer and backward layer in the unsupervisedbidirectional RNN. w bh (cid:48)(cid:48) and w bunsup are backward weights forthese three layers, respectively. b fh (cid:48)(cid:48) and b bh (cid:48)(cid:48) are bias for thesetwo layers, respectively.We utilize those two vectors z (cid:48) and z (cid:48)(cid:48) to calculate thecross entropy loss (CEL) and mean squared error loss (MSEL)for supervised and unsupervised paths, respectively. They aregiven by l CEL = − (cid:88) y × logφ ( z (cid:48) ) , (13) l MSEL = || z (cid:48) − z (cid:48)(cid:48) || , (14)where y is the label for the input and φ ( · ) is the softmaxactivation function. l CEL is the standard cross entropy loss toaccount for the loss of labeled inputs. Because training RNNswith dropout regularization and gradient-based optimizationis a stochastic process, the two RNNs will have differentlink weights after training. In other words, there will bedifferences between the two prediction vectors z (cid:48) and z (cid:48)(cid:48) thatare from these two RNNs (supervised RNN and unsupervisedRNN). These differences can be treated as an error and thusminimizing its mean square error (MSE) is another objective l MSEL , in the proposed model. Furthermore, to combine thesupervised loss l CEL and unsupervised loss l MSEL , we scalethe latter by time-dependent weighting function w ( t ) [20] thatramps up, starting from zero, along a Gaussian curve. Thetotal loss is deﬁned by Loss = l CEL + w ( t ) × l MSEL , (15)At the beginning of training, the total loss and the learninggradients are dominated by the supervised loss component,i.e., the labeled data only. At later stage of training, unlabeleddata will contribute more than the labeled data. The detailedlearning of the proposed model is shown in Algorithm 1,where f r ( · ) is to represent cells as gene sentences, f e ( · ) is tolearn gene embeddings on the gene sentences, and f θ shared ( · ) is to learn the common features from the gene embeddings.Parameters of the shared neural network θ shared include w fh , w fb , w fe , w be , b fh , and b bh .After extracting common features from gene samples, weuse f θ sup ( · ) and f θ unsup ( · ) to obtain higher level represen-tations to complete cell type identiﬁcation and enhance thecell representations through optimizing supervised loss andunsupervised loss jointly. Parameters of the supervised RNN θ sup include w fh (cid:48) , w bh (cid:48) , w fsup , w bsup , b fh (cid:48) , b bh (cid:48) , w sup , and b sup .Parameters of the unsupervised RNN θ unsup consist of w fh (cid:48)(cid:48) , w bh (cid:48)(cid:48) , w funsup , w bunsup , b fh (cid:48)(cid:48) , b bh (cid:48)(cid:48) , w unsup , and b unsup .The proposed model combines the advantages of deep multi-task learning [18] and Π model [20]. However, there exist v Algorithm 1

Learning of SSRCNN

Require: training sample x i , the set of training samples S ,labeled samples y i for x i ( i ∈ S ) for t in [1, num epochs] do for each minibatch B do x (cid:48) i ∈ B ← f r ( x i ∈ B ) (cid:46) preprocessing x (cid:48)(cid:48) i ∈ B ← f e ( x (cid:48) i ∈ B ) (cid:46) gene embedding z i ∈ B ← f θ shared ( x (cid:48)(cid:48) i ∈ B ) (cid:46) common feature extrac-tion z (cid:48) i ∈ B ← f θ sup ( z i ∈ B ) (cid:46) supervised representation z (cid:48)(cid:48) i ∈ B ← f θ unsup ( z i ∈ B ) (cid:46) unsupervised representa-tion l CELi ∈ B ← − | B | (cid:80) i ∈ B ∩ S logφ ( z (cid:48) i )[ y i ] (cid:46) supervisedloss component l MSELi ∈ B ← C | B | (cid:80) i ∈ B || z (cid:48) i − z (cid:48)(cid:48) i || (cid:46) unsupervisedloss component Loss ← l CELi ∈ B + w ( t ) × l MSELi ∈ B (cid:46) total loss update θ shared , θ sup , θ unsup using, e.g., ADAM return θ shared , θ sup , θ unsup signiﬁcant differences. Compared to deep multi-task learning,the subtasks in the proposed model have two categoriesof learning, namely, supervised learning and unsupervisedlearning while there is only supervised learning in the deepmulti-task learning. On the other hand, instead of using onepath neural networks, we apply two independent RNNs togenerate supervised and unsupervised outputs. Furthermore,the proposed model is more ﬂexible as the two independentRNNs can be tuned in terms of speciﬁc goals.IV. E XPERIMENT

A. Dataset

We evaluate our proposed method using macosko2015 [7],a retina scRNAseq dataset. It includes 44,825 mouse retinalcells with 39 transcriptionally distinct cell populations . Thedataset with 24,760 genes contains 12 cell types, namely, rods,cones, muller glia, astrocytes, ﬁbroblasts, vascular endothe-lium, pericytes, microglia, retinal ganglion, bipolar, horizontal,and amacrine. The cell type distribution is shown in Figure 2.It can be observed that the cell distribution is imbalancedacross different cell types. Therefore, machine learning modelsbuilt on this data will have bias to majority classes. In otherwords, the models will tend to obtain high performance foridentiﬁcation of majority cell types, but low performance foridentiﬁcation of minority cell types. It will be a challenge toimplement cell type classiﬁcation with high performance forall cell types. B. Experimental settings

In this experiment, our proposed model is employed toimplement cell type identiﬁcation. The key hyper parametersof the proposed model are: Embedding size: 256 Minibatchsize: 128, Number of epoch: 300, Optimizer: Adam optimizer,Learning rate: 0.001, Learning rate decay: 0.9. They are https://github.com/olgabot/macosko2015 B i p o l a r P e r i c y t e s V a s c u l a r e n d o t h e li u m R e t i n a l g a n g li o n H o r i z o n t a l R o d s C o n e s A m a c r i n e F i b r o b l a s t s M i c r o g li a A s t r o c y t e s M u ll e r g li a N u m b e r o f c e ll s Cell distribution among difference classes

Fig. 2. Cell distribution in different types. determined by trial and error. For the data preprocessing, weselect top 50 genes based on the gene expression values tobuild the gene sentence for each cell. Moreover, the detailsof the model architecture is illustrated in Table I. Speciﬁcally,the output of the proposed model contains two parts: cell type φ ( z (cid:48) ) and a new representation z (cid:48)(cid:48) . TABLE IT

HE PROPOSED NETWORK ARCHITECTURE . Name Description

Input Gene SentenceGene Embedding Mikolov model [15], [21]Shared RNN 256 LSTM cells for each hidden layer,one forward hidden layer,one backward hidden layerSupervised RNN 256 LSTM cells for each hidden layer,one forward hidden layer,one backward hidden layer,one dense layer with 256 neurons,one × max-pooling layerUnsupervised RNN 256 LSTM cells for each hidden layer,one forward hidden layer,one backward hidden layer,one dense layer with 256 neurons,one × max-pooling layerOutput cell type φ ( z (cid:48) ) and a new representation z (cid:48)(cid:48) C. Evaluation metric

We applied different evaluation metrics to evaluate theperformance of our proposed model, which includes accu-racy, macro-average Precision (MacroP), macro-average Re-call (MacroR), and macro-average Fscore (MacroF) [22]. Ac-curacy is calculated by dividing the number of cells identiﬁedcorrectly over the total number of testing cells.

Accuracy = N correct N total . (16)Macro-average [23] is to calculate the metrics such asPrecision, Recall and F-scores independently for each cell type and then utilize the average of these metrics. It is to evaluatethe whole performance of classifying cell types. M acroF = 1 C C (cid:88) c =1 F score c . (17) M acroP = 1 C C (cid:88) c =1 P recision c . (18) M acroR = 1 C C (cid:88) c =1 Recall c . (19)where C denotes the total number of cell types and F score c , P recision c , Recall c are F score , P recision , Recall valuesin the c th cell type which are deﬁned by F score = 2 × P recision × RecallP recision + Recall . (20)where

P recision indicates precision measurement that de-ﬁnes the capability of a model to represent only correct celltypes and

Recall computes the aptness to refer all correspond-ing correct cell types:

P recision = T PT P + F P . (21)

Recall = T PT P + F N . (22)whereas

T P (True Positive) counts total number of cellsmatched with the cells in the types.

F P (False Positive)measures the number of recognized type does not match theannotated cells.

F N (False Negative) counts the number ofcells that does not match the predicted cells. The main goal forlearning from imbalanced datasets such as macosko2015 [7] isto improve the recall without hurting the precision. However,recall and precision goals are often conﬂicting, since whenincreasing the true positive (TP) for the minority class (True),the number of false positives (FP) can also be increased; thiswill reduce the precision [24].In addition, we employ three deep supervised learning mod-els as baselines including 1) Word-level CNN (Word CNN)[25], 2) Attention-Based Bidirectional RNN (Att RNN) [26],and 3) Recurrent CNN (RCNN) [6], where these models per-form well on similar problems such as text classiﬁcation. Forexample, Word CNN performs well on sentence classiﬁcation,which is more suitable to process sequencing data as the lengthof the content of the data is short like that of the gene sentence.In addition, we build 4) word-level bidirectional RNN (WordRNN) to compare the implemented model, where Word RNNcontains one embedding layer and one bidirectional RNNlayer, and concatenate all the outputs from the RNN layerto feed to the ﬁnal layer that is a fully-connected layer.Moreover, we employ 6 traditional machine learning modelsas the baselines, namely, Naive Bayes, Decision Tree, RandomForest, Adaboost, Neural Networks (NN), and Support VectorMachine (SVM). Thus, there are total 10 baseline models.

Note that baseline models are built on all labeled cells fromthe original training datasets . D. Experimental results

We evaluated the proposed model from two perspectives.One is to verify if the data preprocessing of the cell is ableto be employed to identify cell types effectively. The other isto validate performance of the proposed model on cell typeidentiﬁcation with limited amount of labeled cells.

1) Data preprocessing:

Table II presents the comparison ofidentiﬁcation performance between traditional machine learn-ing (ML) models and deep learning (DL) models, where theML models are built on the original gene values without datapreprocessing while the DL models are built on preprocesseddata that includes gene sentences with gene embeddings.We can observe that most of ML models perform notwell on the cell identiﬁcation regarding the data sparsity. Forexample, Naive Bayes’s accuracy and MacroF are not highsince it is sensitive to data sparsity and cell imbalance. Otherfour ML including Decision Tree, Random Forest, Adaboostand NN identify cell type with high accuracy but low MacroFsince they cannot overcome the challenge of cell imbalanceeven if data sparsity will not affect their performance signiﬁ-cantly. Only SVM can perform well on accuracy and MacroF.However, it will cost almost one and a half hours to obtaina converged model with respect to training on such a bigscRNAseq data.On the contrary, different DL models built on preprocessedcell data can identify cell types with promising and consistentperformance. For instance, compared to ML models, all DLmodels are able to gain high accuracy above 95%, whichmeans they are not struggling to the data sparsity. Moreover,considering MacroF values, DL models can obtain encourag-ing performance since these models can overcome cell imbal-ance to some extent. Speciﬁcally, the performance differencebetween RCNN and SVM is not signiﬁcant regarding accuracyand MacroF. Moreover, compared to SVM, building RCNNonly uses about a half of hour to become converged. Basedon the observations, we believe that the preprocessing is aneffective step to prepare the data for deep learning based celltype identiﬁcation.

2) Cell type identiﬁcation:

In this session, we will examineif the proposed model is able to effectively identify the celltypes by training on very limited amount of annotated cells.Table III presents the comparison of identiﬁcation performancebetween SVM, RCNN, and the proposed model, where theproposed model is built based on RCNN with different ratiosof training labeled cells. Firstly, we observe that the perfor-mance of proposed model is enhanced through increasing theratios of annotated cells. In other words, the proposed modelis able to obtain stronger identiﬁcation ability when learningon more labeled data. It’s because the unsupervised path isable to enhance the data representation for improving cellidentiﬁcation that is implemented with supervised path.Compared to SVM and RCNN, the proposed model canidentify the cell types even with extremely small amount ofannotated cells. For example, we can obtain encouraging per-formance with 1% annotated cells. Furthermore, the proposedmodel is robust since we can gain similar performance withdifferent ratios of annotated cells. For instance, the differencesof accuracy and MacroF between the case of 1%, 5%, and i TABLE IIC

OMPARING PERFORMANCE BETWEEN TRADITIONAL MACHINE LEARNING (ML)

AND DEEP LEARNING (DL).

Original Gene Expression Machine Learning (ML) Accuracy MacroP MacroR MacroF Training Time (s)

Naive Bayes 35.06% 36.96% 30.40% 35.48% 11Random Forest 85.09% 55.44% 27.45% 31.03% 22Neural Networks 86.72% 19.47% 23.77% 21.23% 187Decision Tree 93.78% 86.60% 80.34% 82.69% 1,172Adaboost 74.07% 30.38% 26.88% 25.67% 1,767

Support Vector Machine 97.28% 98.24% 93.32% 95.50% 5,554Gene Embedding Deep Learning (DL) Accuracy MacroP MacroR MacroF Training Time (s)

Word CNN [25] 96.30% 90.79% 77.22% 81.90% 295Word RNN 96.11% 86.69% 82.82% 84.17% 8,368Attenion RNN [26] 95.79% 88.18% 84.85% 85.85% 4,661

RCNN [6] 96.56% 96.55% 92.70% 94.45% 2,383

TABLE IIIC

OMPARING PERFORMANCE BETWEEN

SVM, RCNN,

AND O UR MODEL (S EMI - SUPERVISED RECURRENT CONVOLUTIONAL NEURAL NETWORKS ,SSRCNN).

ML Accuracy MacroP MacroR MacroF

SVM 97.28% 98.24% 93.32% 95.50%

DL Accuracy MacroP MacroR MacroF

RCNN [6] 96.56% 96.55% 92.70% 94.45%

Our model Accuracy MacroP MacroR MacroF

SSRCNN (1%) 95.47% 91.73% 93.90% 92.64%SSRCNN (3%) 95.76% 92.62% 94.21% 93.28%SSRCNN (5%) 95.76% 93.12% 93.39% 93.18%SSRCNN (10%) 95.70% 94.92% 93.18% 93.87%SSRCNN (30%) 96.44% 96.53% 92.66% 94.46%

30% are about 1%. Speciﬁcally, the MacroP is improvedsigniﬁcantly when increasing the ratios of labeled cells fortraining while the MacroR is stable. The reason for thisobservation is that enhancing representation with unsupervisedlearning in the proposed model seems to be more useful toidentify cell type precisely.In addition to examining the performance comparisonsbetween the proposed models and baselines, we have toﬁgure out whether the proposed model is sensitive the hyper-parameters. There are various hyper-parameters involved in thelearning procedure of the proposed model. Here, we choosebatch size to check since different batch sizes will involvedifferent numbers of labeled cells for building the proposedmodel when using the same ratio of labeled cells. Table IVshows the comparison results for two different batch sizes.We observe that there is no signiﬁcant differences of theperformance. It means that the proposed model is not sensitiveto the batch size since the supervised and unsupervised RNNin the proposed model could collaborate with each other toovercome the effects from the difference of batch size.To further investigate the detailed performance, we showthe performance with confusion matrix. Figure 3 presentsthe confusion matrix on performance generated with differentratios of annotated cells when using batch size 128 to buildthe proposed model. It is observed that for different cell types,the accuracy is increased when involving more labeled cellsto build the model. Speciﬁcally, when we use different ratiosof labeled cells to build the model, the error distributions arenot changed signiﬁcantly. For instance, for the cell type c , TABLE IVC

OMPARING PERFORMANCE WITH DIFFERENT BATCH SIZES ONDIFFERENT LABELED RATIOS .1% Labeled Data