Cell Type Identification from Single-Cell Transcriptomic Data via Semi-supervised Learning
Xishuang Dong, Shanta Chowdhury, Uboho Victor, Xiangfang Li, Lijun Qian
ii Cell Type Identification from Single-CellTranscriptomic Data via Semi-supervised Learning
Xishuang Dong, Shanta Chowdhury, Uboho Victor, Xiangfang Li, and Lijun Qian
Abstract —Cell type identification from single-cell transcrip-tomic data is a common goal of single-cell RNA sequencing(scRNAseq) data analysis. Neural networks have been employedto identify cell types from scRNAseq data with high performance.However, it requires a large mount of individual cells withaccurate and unbiased annotated types to build the identifi-cation models. Unfortunately, labeling the scRNAseq data iscumbersome and time-consuming as it involves manual inspectionof marker genes. To overcome this challenge, we propose asemi-supervised learning model to use unlabeled scRNAseq cellsand limited amount of labeled scRNAseq cells to implementcell identification. Firstly, we transform the scRNAseq cellsto “gene sentences”, which is inspired by similarities betweennatural language system and gene system. Then genes in thesesentences are represented as gene embeddings to reduce datasparsity. With these embeddings, we implement a semi-supervisedlearning model based on recurrent convolutional neural networks(RCNN), which includes a shared network, a supervised networkand an unsupervised network. The proposed model is evaluatedon macosko2015, a large scale single-cell transcriptomic datasetwith ground truth of individual cell types. It is observed thatthe proposed model is able to achieve encouraging performanceby learning on very limited amount of labeled scRNAseq cellstogether with a large number of unlabeled scRNAseq cells.
Index Terms —Single-Cell Sequencing, Semi-supervised Learn-ing, Recurrent Convolutional Neural Networks, Joint Optimiza-tion
I. I
NTRODUCTION
Single-cell RNA sequencing (scRNAseq) enables the pro-filing of the transcriptomes of individual cells, thus character-izing the heterogeneity of biological samples since scRNAseqexperiments is able to yield high volumes of data. For example,in a single experiment, the expression profile is up to cells, at the level of the single cell [1]. It is not possible fortraditional bulk RNAseq [2] to examine biological samples insuch high-resolution.One common goal of scRNAseq data analytics is to identifythe cell type of each individual cell that has been profiled.Although labeling cells with known cell types is a supervisedlearning task, it is currently achieved by unsupervised methodswith manual input [3]. To accomplish this, cells are firstgrouped into different clusters in an unsupervised manner [4],and the number of these clusters allows us to approximatelydetermine how many distinct cell types are present. To attempt X. Dong, S. Chowdhury, U. Victor, X. Li and L. Qian are with theCenter of Excellence in Research and Education for Big Military DataIntelligence (CREDIT Center) and Center for Computational Systems Biol-ogy (CCSB), Department of Electrical and Computer Engineering, PrairieView A&M University, Texas A&M University System, Prairie View,TX 77446, USA. Email: [email protected], [email protected],[email protected], [email protected], [email protected] ... ... ... ...... ... ... ... ... ... ... ... ... ...
Cross EntropyLoss Mean SquredError LossJointOptimization ... ... ... ... x t-1 x t x t+1 ... ... ... ... GeneEmbeddingInputShared BidirectionalLSTM RNNSupervised Bidirectional LSTM RNN Unsupervised BiderectionalLSTM RNN e t-1 e t e t+1 z zz' z' z'' x : LSTM Cell: Max Pooling ... ... ... ...... ... ... ... ... ... ... ...... ... ... ... Transform Samples withRanked GenesLearn GeneEmbedding : Dense LayerPre-processing
Fig. 1. Framework of the proposed semi-supervised learning. Input x is thecell. Cell types are available only for the labeled inputs and the associatedcross-entropy loss component is evaluated only for those. z (cid:48) and z (cid:48)(cid:48) areoutputs from the supervised bidirectional LSTM RNN and the unsupervisedbidirectional LSTM RNN, respectively. We jointly optimize cross entropy lossand mean squared error loss for supervised learning and unsupervised learningwith these outputs. ⊕ is the concatenation operation. to interpret the identity of each cluster, marker genes are iden-tified as those that are uniquely highly expressed in a cluster,compared to all other clusters. These canonical markers arethen used to assign the cell types for the clusters by crossreferencing the markers with lists of previously characterizedcell type specific markers. However, this approach has severallimitations, including the fact that the clusters may not opti-mally separate single cell types, and certain cell types maynot have previously characterized markers. Moreover, thesemethods are computationally intensive, especially when thenumber of cells becomes large.Recently, novel computational methods based on neural net-works have been proposed to overcome these limitations [3],[4], since cell type classification based on a large number ofgenes is much more robust to noise with machine learning a r X i v : . [ q - b i o . GN ] M a y i models. For example, Ma et al. proposed ACTINN (Auto-mated Cell Type Identification using Neural Networks) [4]with simple neural networks of three neuron layers, whichtrains on datasets with predefined cell types and predictscell types for other datasets based on the trained model. Ituses all the genes to capture the features for each cell typeinstead of relying on a limited number of canonical markers.Furthermore, it is much more computationally efficient thantraditional approaches. However, it requires a large amount ofindividual cells with accurate and unbiased type labels to builddatasets for training and testing.In this paper, we propose a novel deep semi-supervisedlearning model when only very limited number of cells arelabeled, and a large number of cells are unlabeled . Theproposed framework is shown in Figure 1. It is trained oncells with predefined cell types and then can be used topredict cell types on new datasets. The cells in scRNAseqdata are transformed to “gene sentences” by taking advantageof similarities between natural language system and genesystem. Furthermore, to overcome data sparsity, we employword embedding techniques [5] to represent the genes in thesesentences as gene vectors. Then, these vectors are input intothe proposed semi-supervised neural networks built on recur-rent convolutional neural networks (RCNN) [6]. It consists ofthree components, namely, a shared bidirectional Long Short-Term Memory Recurrent Neural Network (LSTM RNN), asupervised bidirectional LSTM RNN, and an unsupervisedbidirectional LSTM RNN. One path is composed of the sharedbidirectional LSTM RNN and supervised bidirectional LSTMRNN while the other path consists of the shared bidirectionalLSTM RNN and unsupervised bidirectional LSTM RNN. Alldata (labeled and unlabeled data) will be evaluated to generatethe mean squared error loss, while only labeled data will beevaluated to calculate the cross entropy loss. Experimentalresults on macosko2015 [7] demonstrate the effectiveness ofthe proposed model even when training it with very limitedamount of labeled cells.The contributions in this study are as follows. • We represent cells in scRNAseq data via embeddingtechniques to reduce the sparsity of gene expressionvalues. • We propose semi-supervised deep learning models withRCNN through jointly training supervised RCNN andunsupervised RCNN. It is shown that the proposed modelcan learn on unlabeled cells and labeled cells jointly toidentify cell types with high performance. • The proposed model is validated on a large-scale scR-NAseq dataset. Experimental results indicate that the newrepresentations of cells enable cell type identification withpromising performance. Moreover, the proposed semi-supervised learning model is able to effectively identifycell types by learning on very limited number of labeledcells and a large amount of unlabeled cells.II. P
ROBLEM F ORMULATION
Cell type identification on single-cell transcriptomic datais to classify the individual cells into predefined cell types, which is a supervised learning task from machine learningpoint of view. Specifically, it is a multi-class classificationproblem with N cell types in the set C = { c , c , c , ..., c N } ,where N > . Each cell belongs to one of the N differenttypes. The goal is to construct a function which, given a newindividual cell, will correctly predict the cell type where thenew individual cell belongs. It is defined by f ( x ; θ ) → c , (1)where x is an individual cell, θ denotes the parameters in f ( · ) , and c ∈ C . For the scRNA-seq data, x is composed ofa sequence of gene expression values of the cell. Generallywe will have more than , gene expression values ifwe employ scRNAseq techniques to generate data [3], [4].These gene expression values will be input as features to buildmachine learning models to complete cell type identification.Due to high dimensions and data sparsity of the scRNAseqdata [8], it is the challenging to solve this problem.III. P ROPOSED M ETHODOLOGY
We propose a semi-supervised recurrent convolutional neu-ral network (SSRCNN) to address the challenge of lackinglabeled individual cells for cell type identification from scR-NAseq data. The proposed model is based on RCNN [6]and the detailed architecture is shown in Figure 1. The firststep is to preprocess the scRNAseq data to reduce the datasparsity [8], [9] by building “gene sentences” and repre-senting the gene with word embedding techniques [5], [10].Specifically, each cell in the scRNAseq data is composed ofthousands of gene expression values. Unfortunately, most ofthese values are zeros because of the limitation of currentsingle-cell sequencing techniques [9], which would reduce theperformance of machine learning models significantly [11],[12]. Therefore, it is important to solve the data sparsityproblem for cell type identification.To overcome the data sparsity, we propose a new techniqueof “gene embedding”, to represent the cells based on sim-ilarities between gene system and natural language systemsince gene sequences can be treated as a language whenwe regards genome as the “book of life” [13]. For example,words can be combined with others to generate new functions“phrases” while different genes can form pathways to controlprotein generation [14]. With respect to these similarities,we build gene sentences by selecting k genes and employword2vec [15] to represent these genes, where word2vec is apowerful technique to overcome data sparsity for natural lan-guage processing and understanding [15], [16], [17]. We rankthe genes in terms of their expression values and select the top k genes to build the gene sentence. Then the genes in the genesentence are represented as gene embeddings. For instance, thegene sentence
Learning of SSRCNN
Require: training sample x i , the set of training samples S ,labeled samples y i for x i ( i ∈ S ) for t in [1, num epochs] do for each minibatch B do x (cid:48) i ∈ B ← f r ( x i ∈ B ) (cid:46) preprocessing x (cid:48)(cid:48) i ∈ B ← f e ( x (cid:48) i ∈ B ) (cid:46) gene embedding z i ∈ B ← f θ shared ( x (cid:48)(cid:48) i ∈ B ) (cid:46) common feature extrac-tion z (cid:48) i ∈ B ← f θ sup ( z i ∈ B ) (cid:46) supervised representation z (cid:48)(cid:48) i ∈ B ← f θ unsup ( z i ∈ B ) (cid:46) unsupervised representa-tion l CELi ∈ B ← − | B | (cid:80) i ∈ B ∩ S logφ ( z (cid:48) i )[ y i ] (cid:46) supervisedloss component l MSELi ∈ B ← C | B | (cid:80) i ∈ B || z (cid:48) i − z (cid:48)(cid:48) i || (cid:46) unsupervisedloss component Loss ← l CELi ∈ B + w ( t ) × l MSELi ∈ B (cid:46) total loss update θ shared , θ sup , θ unsup using, e.g., ADAM return θ shared , θ sup , θ unsup significant differences. Compared to deep multi-task learning,the subtasks in the proposed model have two categoriesof learning, namely, supervised learning and unsupervisedlearning while there is only supervised learning in the deepmulti-task learning. On the other hand, instead of using onepath neural networks, we apply two independent RNNs togenerate supervised and unsupervised outputs. Furthermore,the proposed model is more flexible as the two independentRNNs can be tuned in terms of specific goals.IV. E XPERIMENT
A. Dataset
We evaluate our proposed method using macosko2015 [7],a retina scRNAseq dataset. It includes 44,825 mouse retinalcells with 39 transcriptionally distinct cell populations . Thedataset with 24,760 genes contains 12 cell types, namely, rods,cones, muller glia, astrocytes, fibroblasts, vascular endothe-lium, pericytes, microglia, retinal ganglion, bipolar, horizontal,and amacrine. The cell type distribution is shown in Figure 2.It can be observed that the cell distribution is imbalancedacross different cell types. Therefore, machine learning modelsbuilt on this data will have bias to majority classes. In otherwords, the models will tend to obtain high performance foridentification of majority cell types, but low performance foridentification of minority cell types. It will be a challenge toimplement cell type classification with high performance forall cell types. B. Experimental settings
In this experiment, our proposed model is employed toimplement cell type identification. The key hyper parametersof the proposed model are: Embedding size: 256 Minibatchsize: 128, Number of epoch: 300, Optimizer: Adam optimizer,Learning rate: 0.001, Learning rate decay: 0.9. They are https://github.com/olgabot/macosko2015 B i p o l a r P e r i c y t e s V a s c u l a r e n d o t h e li u m R e t i n a l g a n g li o n H o r i z o n t a l R o d s C o n e s A m a c r i n e F i b r o b l a s t s M i c r o g li a A s t r o c y t e s M u ll e r g li a N u m b e r o f c e ll s Cell distribution among difference classes
Fig. 2. Cell distribution in different types. determined by trial and error. For the data preprocessing, weselect top 50 genes based on the gene expression values tobuild the gene sentence for each cell. Moreover, the detailsof the model architecture is illustrated in Table I. Specifically,the output of the proposed model contains two parts: cell type φ ( z (cid:48) ) and a new representation z (cid:48)(cid:48) . TABLE IT
HE PROPOSED NETWORK ARCHITECTURE . Name Description
Input Gene SentenceGene Embedding Mikolov model [15], [21]Shared RNN 256 LSTM cells for each hidden layer,one forward hidden layer,one backward hidden layerSupervised RNN 256 LSTM cells for each hidden layer,one forward hidden layer,one backward hidden layer,one dense layer with 256 neurons,one × max-pooling layerUnsupervised RNN 256 LSTM cells for each hidden layer,one forward hidden layer,one backward hidden layer,one dense layer with 256 neurons,one × max-pooling layerOutput cell type φ ( z (cid:48) ) and a new representation z (cid:48)(cid:48) C. Evaluation metric
We applied different evaluation metrics to evaluate theperformance of our proposed model, which includes accu-racy, macro-average Precision (MacroP), macro-average Re-call (MacroR), and macro-average Fscore (MacroF) [22]. Ac-curacy is calculated by dividing the number of cells identifiedcorrectly over the total number of testing cells.
Accuracy = N correct N total . (16)Macro-average [23] is to calculate the metrics such asPrecision, Recall and F-scores independently for each cell type and then utilize the average of these metrics. It is to evaluatethe whole performance of classifying cell types. M acroF = 1 C C (cid:88) c =1 F score c . (17) M acroP = 1 C C (cid:88) c =1 P recision c . (18) M acroR = 1 C C (cid:88) c =1 Recall c . (19)where C denotes the total number of cell types and F score c , P recision c , Recall c are F score , P recision , Recall valuesin the c th cell type which are defined by F score = 2 × P recision × RecallP recision + Recall . (20)where
P recision indicates precision measurement that de-fines the capability of a model to represent only correct celltypes and
Recall computes the aptness to refer all correspond-ing correct cell types:
P recision = T PT P + F P . (21)
Recall = T PT P + F N . (22)whereas
T P (True Positive) counts total number of cellsmatched with the cells in the types.
F P (False Positive)measures the number of recognized type does not match theannotated cells.
F N (False Negative) counts the number ofcells that does not match the predicted cells. The main goal forlearning from imbalanced datasets such as macosko2015 [7] isto improve the recall without hurting the precision. However,recall and precision goals are often conflicting, since whenincreasing the true positive (TP) for the minority class (True),the number of false positives (FP) can also be increased; thiswill reduce the precision [24].In addition, we employ three deep supervised learning mod-els as baselines including 1) Word-level CNN (Word CNN)[25], 2) Attention-Based Bidirectional RNN (Att RNN) [26],and 3) Recurrent CNN (RCNN) [6], where these models per-form well on similar problems such as text classification. Forexample, Word CNN performs well on sentence classification,which is more suitable to process sequencing data as the lengthof the content of the data is short like that of the gene sentence.In addition, we build 4) word-level bidirectional RNN (WordRNN) to compare the implemented model, where Word RNNcontains one embedding layer and one bidirectional RNNlayer, and concatenate all the outputs from the RNN layerto feed to the final layer that is a fully-connected layer.Moreover, we employ 6 traditional machine learning modelsas the baselines, namely, Naive Bayes, Decision Tree, RandomForest, Adaboost, Neural Networks (NN), and Support VectorMachine (SVM). Thus, there are total 10 baseline models.
Note that baseline models are built on all labeled cells fromthe original training datasets . D. Experimental results
We evaluated the proposed model from two perspectives.One is to verify if the data preprocessing of the cell is ableto be employed to identify cell types effectively. The other isto validate performance of the proposed model on cell typeidentification with limited amount of labeled cells.
1) Data preprocessing:
Table II presents the comparison ofidentification performance between traditional machine learn-ing (ML) models and deep learning (DL) models, where theML models are built on the original gene values without datapreprocessing while the DL models are built on preprocesseddata that includes gene sentences with gene embeddings.We can observe that most of ML models perform notwell on the cell identification regarding the data sparsity. Forexample, Naive Bayes’s accuracy and MacroF are not highsince it is sensitive to data sparsity and cell imbalance. Otherfour ML including Decision Tree, Random Forest, Adaboostand NN identify cell type with high accuracy but low MacroFsince they cannot overcome the challenge of cell imbalanceeven if data sparsity will not affect their performance signifi-cantly. Only SVM can perform well on accuracy and MacroF.However, it will cost almost one and a half hours to obtaina converged model with respect to training on such a bigscRNAseq data.On the contrary, different DL models built on preprocessedcell data can identify cell types with promising and consistentperformance. For instance, compared to ML models, all DLmodels are able to gain high accuracy above 95%, whichmeans they are not struggling to the data sparsity. Moreover,considering MacroF values, DL models can obtain encourag-ing performance since these models can overcome cell imbal-ance to some extent. Specifically, the performance differencebetween RCNN and SVM is not significant regarding accuracyand MacroF. Moreover, compared to SVM, building RCNNonly uses about a half of hour to become converged. Basedon the observations, we believe that the preprocessing is aneffective step to prepare the data for deep learning based celltype identification.
2) Cell type identification:
In this session, we will examineif the proposed model is able to effectively identify the celltypes by training on very limited amount of annotated cells.Table III presents the comparison of identification performancebetween SVM, RCNN, and the proposed model, where theproposed model is built based on RCNN with different ratiosof training labeled cells. Firstly, we observe that the perfor-mance of proposed model is enhanced through increasing theratios of annotated cells. In other words, the proposed modelis able to obtain stronger identification ability when learningon more labeled data. It’s because the unsupervised path isable to enhance the data representation for improving cellidentification that is implemented with supervised path.Compared to SVM and RCNN, the proposed model canidentify the cell types even with extremely small amount ofannotated cells. For example, we can obtain encouraging per-formance with 1% annotated cells. Furthermore, the proposedmodel is robust since we can gain similar performance withdifferent ratios of annotated cells. For instance, the differencesof accuracy and MacroF between the case of 1%, 5%, and i TABLE IIC
OMPARING PERFORMANCE BETWEEN TRADITIONAL MACHINE LEARNING (ML)
AND DEEP LEARNING (DL).
Original Gene Expression Machine Learning (ML) Accuracy MacroP MacroR MacroF Training Time (s)
Naive Bayes 35.06% 36.96% 30.40% 35.48% 11Random Forest 85.09% 55.44% 27.45% 31.03% 22Neural Networks 86.72% 19.47% 23.77% 21.23% 187Decision Tree 93.78% 86.60% 80.34% 82.69% 1,172Adaboost 74.07% 30.38% 26.88% 25.67% 1,767
Support Vector Machine 97.28% 98.24% 93.32% 95.50% 5,554Gene Embedding Deep Learning (DL) Accuracy MacroP MacroR MacroF Training Time (s)
Word CNN [25] 96.30% 90.79% 77.22% 81.90% 295Word RNN 96.11% 86.69% 82.82% 84.17% 8,368Attenion RNN [26] 95.79% 88.18% 84.85% 85.85% 4,661
RCNN [6] 96.56% 96.55% 92.70% 94.45% 2,383
TABLE IIIC
OMPARING PERFORMANCE BETWEEN
SVM, RCNN,
AND O UR MODEL (S EMI - SUPERVISED RECURRENT CONVOLUTIONAL NEURAL NETWORKS ,SSRCNN).
ML Accuracy MacroP MacroR MacroF
SVM 97.28% 98.24% 93.32% 95.50%
DL Accuracy MacroP MacroR MacroF
RCNN [6] 96.56% 96.55% 92.70% 94.45%
Our model Accuracy MacroP MacroR MacroF
SSRCNN (1%) 95.47% 91.73% 93.90% 92.64%SSRCNN (3%) 95.76% 92.62% 94.21% 93.28%SSRCNN (5%) 95.76% 93.12% 93.39% 93.18%SSRCNN (10%) 95.70% 94.92% 93.18% 93.87%SSRCNN (30%) 96.44% 96.53% 92.66% 94.46%
30% are about 1%. Specifically, the MacroP is improvedsignificantly when increasing the ratios of labeled cells fortraining while the MacroR is stable. The reason for thisobservation is that enhancing representation with unsupervisedlearning in the proposed model seems to be more useful toidentify cell type precisely.In addition to examining the performance comparisonsbetween the proposed models and baselines, we have tofigure out whether the proposed model is sensitive the hyper-parameters. There are various hyper-parameters involved in thelearning procedure of the proposed model. Here, we choosebatch size to check since different batch sizes will involvedifferent numbers of labeled cells for building the proposedmodel when using the same ratio of labeled cells. Table IVshows the comparison results for two different batch sizes.We observe that there is no significant differences of theperformance. It means that the proposed model is not sensitiveto the batch size since the supervised and unsupervised RNNin the proposed model could collaborate with each other toovercome the effects from the difference of batch size.To further investigate the detailed performance, we showthe performance with confusion matrix. Figure 3 presentsthe confusion matrix on performance generated with differentratios of annotated cells when using batch size 128 to buildthe proposed model. It is observed that for different cell types,the accuracy is increased when involving more labeled cellsto build the model. Specifically, when we use different ratiosof labeled cells to build the model, the error distributions arenot changed significantly. For instance, for the cell type c , TABLE IVC
OMPARING PERFORMANCE WITH DIFFERENT BATCH SIZES ONDIFFERENT LABELED RATIOS .1% Labeled Data
Batch size Accuracy MacroP MacroR MacroF
128 95.47% 91.73% 93.90% 92.64%256 95.11% 89.99% 94.40% 91.88%3% Labeled Data
Batch size Accuracy MacroP MacroR MacroF
128 95.76% 92.62% 94.21% 93.28%256 95.44% 91.76% 94.21% 92.79%5% Labeled Data
Batch size Accuracy MacroP MacroR MacroF
128 95.76% 93.12% 93.39% 93.18%256 95.49% 91.34% 93.74% 92.31%10% Labeled Data
Batch size Accuracy MacroP MacroR MacroF
128 95.70% 94.92% 93.18% 93.87%256 95.93% 95.13% 93.11% 94.00%30% Labeled Data
Batch size Accuracy MacroP MacroR MacroF
128 96.44% 96.53% 92.66% 94.46%256 96.45% 96.58% 92.02% 94.10% the majority errors are from incorrectly classifying the cellsinto the cell type c .Furthermore, considering the unbalanced feature of celldistribution (See Figure 2), the results in Figure 3 presentsthe model bias for the minority cell types. It means that themodel will obtain higher performance for the majority type,but lower performance for the minority types. For the cell type c , compared to the case of 1% labeled cells, the accuracy isdecreased because of the model bias when using 10% labeledcells for training.On the other hand, although the overall prediction accuracy(See Table III) is increased when increasing the ratios oflabeled cells, it is not always true that the accuracy for eachcell type will be enhanced. This can be observed in Fig 3. Takethe cell type c as an example, the prediction accuracy is notalways increased when increasing the ratios of labeled cells.On the contrary, for the cell type c , the accuracy is improvedwhenever more labeled cells are involved for building theidentification model.Moreover, we compare the confusion matrix for two casesof batch sizes to check the effects with different hyper-parameters in detail, which is shown in Figure 4. To sum up, ii Fig. 3. Confusion matrix on different cell types generated with batch size 128. There are 12 cell types including c (Bipolar), c (Pericytes), c (Vascularendothelium), c (Retinal ganglion), c (Horizontal), c (Rods), c (Cones), c (Amacrine), c (Fibroblasts), c (Microglia), c (Astrocytes), c (Mullerglia) for the majority cell type c , the performance is enhancedfor the case of larger batch size. For the minority cell types,when employing larger batch size to build the model, theperformances for some cell types such as c and c aredecreased whereas for the cell types like c and c , theaccuracy is increased. It means that we have to choose theoptimal batch size for improving the performance of certainminority cell types.On the other hand, compared to the case with more labeleddata, the case with low ratios of labeled cells needs larger batchsize to improve the performance for the majority cell type suchas c . For instance, when we compare the confusion matrixfor the case of 1% labeled cells, the confusion matrix withbatch size 256 has better performance compare to that of batchsize 128. It is consistent to the intuition that with larger batchsize, we will obtain larger size of labeled samples to enhancethe performance of supervised path when using extremelylow ratio of labeled cells. In other words, to improve theperformance for the proposed model in the case of extremelylow ratios of labeled data, we should apply larger batch sizefor the case of majority cell type. V. R ELATED W ORK
Single-cell RNA-seq (scRNAseq) data is able to profile thegene expression levels of cells and to link the dynamics at themolecular level and the cellular level. Analyzing scRNAseqdata will be beneficial for obtaining knowledge on cancerdrug resistance, gene regulation in embryonic development,and mechanisms of stem cell differentiation and reprogram-ming [27]. In recent years, a lot of progresses have beenmade on applying bioinformatics techniques to scRNAseqdata. However, there still exist many challenges due to dropoutevents, batch effect, noise, high dimensionality, and scalabil-ity [8].To overcome these challenges, deep learning techniqueshave been employed to build effective and efficient compu-tational methods for scRNAseq data. For example, Shaham et al. proposed MMD-ResNet to remove batch effect on bothmass cytometry and scRNAseq data by combining residualneural networks (ResNets) with the maximum mean discrep-ancy (MMD) [28]. To reduce the computational cost, Li etal. implemented batch effect removal and clustering in onestep [29]. Specifically, they built a stacked autoencoder [30] toenhance clustering performance. On the other hand, to removefake zeros, autoencoder based methods such as “AutoIm- iii
Fig. 4. Comparison of confusion matrix on different cell types generated with batch size 128 and 256. The left column is for the case of 128 while the rightcolumn is for the case of 256. pute” [31] and “DCA” [32] have been proposed to implementimputation and denoising to address the issue of dropout.Moreover, autoencoder techniques such as denoising autoen-coder (DAE) [33] and variational autoencoder (VAE) [34]have also been applied to reduce dimensions of scRNAseqdata [28], [33], [35]. In addition, Lopez et al. developed anintegrative pipeline called scVI (single-cell variational infer-ence) to implement multiple tasks including correcting batcheffect, removing dropout, imputation, dimension reduction, clustering, and visualization [36].Recently, Lieberman et al. employed transfer learning [37]to reuse a classification scheme that was learned from previoussimilar experiments for cell type classification [3]. However,it is challenging to interpret how transfer learning improvethe identification performance in this case. There are severalrecent works on cell type identification using machine learningtechniques such as [4], [3]. However, these works rely on fullylabeled cells to build the identification models, which could x not be applied when there are a large number of unlabeleddata. VI. C ONCLUSION AND F UTURE W ORK
In this paper, a novel framework of deep semi-supervisedlearning is proposed for cell type identification on scRNAseqdata. As an emerging research area, implementing cell typeidentification automatically is extremely important for thedownstream analysis on the scRNAseq data. However, currentmethods using supervised learning rely on the availabilityof large amount of labeled cells, which may not be avail-able in practice. Hence, we propose a deep semi-supervisedlearning model based on recurrent convolutional neural net-works (RCNN) that can utilize unlabeled cells to enhanceidentification performance. There are two paths in the modelfor obtaining supervised cross entropy loss and unsupervisedmean squared error loss, respectively. Then training is per-formed by jointly optimizing these two losses, and this allowsthe proposed scheme to take advantage of both informationfrom the labeled cells and information from the unlabeledcells. Furthermore, we introduce a preprocessing procedure toovercome the problem of data sparsity. Experimental resultsindicate that the proposed model could identify cell typeeffectively using very limited labeled cells and a large amountof unlabeled cells. In our future work, we plan to extendthe proposed model for other tasks such as pathway networkconstruction. A
CKNOWLEDGMENT
This research work is supported in part by the Texas A&MChancellor’s Research Initiative (CRI), the U.S. National Sci-ence Foundation (NSF) award 1464387 and 1736196, and bythe U.S. Office of the Under Secretary of Defense for Researchand Engineering (OUSD(R&E)) under agreement numberFA8750-15-2-0119. The U.S. Government is authorized toreproduce and distribute reprints for Governmental purposesnotwithstanding any copyright notation thereon. The viewsand conclusions contained herein are those of the authorsand should not be interpreted as necessarily representing theofficial policies or endorsements, either expressed or implied,of the U.S. National Science Foundation (NSF) or the U.S.Office of the Under Secretary of Defense for Research andEngineering (OUSD(R&E)) or the U.S. Government.R
EFERENCES[1] C. Trapnell, “Defining cell types and states with single-cell genomics,”
Genome research , vol. 25, no. 10, pp. 1491–1498, 2015.[2] A. Butler, P. Hoffman, P. Smibert, E. Papalexi, and R. Satija, “Integratingsingle-cell transcriptomic data across different conditions, technologies,and species,”
Nature biotechnology , vol. 36, no. 5, p. 411, 2018.[3] Y. Lieberman, L. Rokach, and T. Shay, “Castle–classification of singlecells by transfer learning: Harnessing the power of publicly availablesingle cell rna sequencing experiments to annotate new experiments,”
PloS one , vol. 13, no. 10, p. e0205499, 2018.[4] F. Ma and M. Pellegrini, “Actinn: automated identification of cell typesin single cell rna sequencing,”
Bioinformatics , 2019.[5] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,“Distributed representations of words and phrases and their composi-tionality,” in
Advances in neural information processing systems , 2013,pp. 3111–3119. [6] S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent convolutional neuralnetworks for text classification,” in
Twenty-ninth AAAI conference onartificial intelligence , 2015.[7] E. Z. Macosko, A. Basu, R. Satija, J. Nemesh, K. Shekhar, M. Goldman,I. Tirosh, A. R. Bialas, N. Kamitaki, E. M. Martersteck et al. , “Highlyparallel genome-wide expression profiling of individual cells usingnanoliter droplets,”
Cell , vol. 161, no. 5, pp. 1202–1214, 2015.[8] J. Zheng and K. Wang, “Emerging deep learning methods for single-cellrna-seq data analysis,”
Quantitative Biology , vol. 7, no. 4, pp. 247–254,2019.[9] D. L¨ahnemann, J. K¨oster, E. Szczurek, D. J. McCarthy, S. C. Hicks,M. D. Robinson, C. A. Vallejos, K. R. Campbell, N. Beerenwinkel,A. Mahfouz et al. , “Eleven grand challenges in single-cell data science,”
Genome Biology , vol. 21, no. 1, pp. 1–35, 2020.[10] Y. Goldberg and O. Levy, “word2vec explained: deriving mikolovet al.’s negative-sampling word-embedding method,” arXiv preprintarXiv:1402.3722 , 2014.[11] L. Tran, X. Liu, J. Zhou, and R. Jin, “Missing modalities imputation viacascaded residual autoencoder,” in
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2017, pp. 1405–1414.[12] F. Zhou, Q. Gao, G. Trajcevski, K. Zhang, T. Zhong, and F. Zhang,“Trajectory-user linking via variational autoencoder.” in
IJCAI , 2018,pp. 3212–3218.[13] D. B. Searls, “The language of genes,”
Nature , vol. 420, no. 6912, pp.211–217, 2002.[14] A. M. Lesk,
Introduction to genomics . Oxford University Press, 2017.[15] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation ofword representations in vector space,” arXiv preprint arXiv:1301.3781 ,2013.[16] N. Yang, S. Liu, M. Li, M. Zhou, and N. Yu, “Word alignment modelingwith context dependent deep neural network,” in
Proceedings of the51st Annual Meeting of the Association for Computational Linguistics(Volume 1: Long Papers) , 2013, pp. 166–175.[17] T. Schnabel, I. Labutov, D. Mimno, and T. Joachims, “Evaluationmethods for unsupervised word embeddings,” in
Proceedings of the 2015conference on empirical methods in natural language processing , 2015,pp. 298–307.[18] Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Facial landmark detection bydeep multi-task learning,” in
European conference on computer vision .Springer, 2014, pp. 94–108.[19] S. Ruder, “An overview of multi-task learning in deep neural networks,” arXiv preprint arXiv:1706.05098 , 2017.[20] S. Laine and T. Aila, “Temporal ensembling for semi-supervised learn-ing,” arXiv preprint arXiv:1610.02242 , 2016.[21] T. Mikolov, Q. V. Le, and I. Sutskever, “Exploiting similarities amonglanguages for machine translation,” arXiv preprint arXiv:1309.4168 ,2013.[22] V. Van Asch, “Macro-and micro-averaged evaluation measures [[basicdraft]],”
Belgium: CLiPS , vol. 49, 2013.[23] Y. Yang, “A study of thresholding strategies for text categorization,” in
Proceedings of the 24th annual international ACM SIGIR conference onResearch and development in information retrieval , 2001, pp. 137–145.[24] N. V. Chawla, “Data mining for imbalanced datasets: An overview,” in
Data mining and knowledge discovery handbook . Springer, 2009, pp.875–886.[25] Y. Kim, “Convolutional neural networks for sentence classification,” arXiv preprint arXiv:1408.5882 , 2014.[26] P. Zhou, W. Shi, J. Tian, Z. Qi, B. Li, H. Hao, and B. Xu, “Attention-based bidirectional long short-term memory networks for relation classi-fication,” in
Proceedings of the 54th Annual Meeting of the Associationfor Computational Linguistics (Volume 2: Short Papers) , 2016, pp. 207–212.[27] F. Tang, K. Lao, and M. A. Surani, “Development and applications ofsingle-cell transcriptome analysis,”
Nature methods , vol. 8, no. 4s, p. S6,2011.[28] U. Shaham, K. P. Stanton, J. Zhao, H. Li, K. Raddassi, R. Montgomery,and Y. Kluger, “Removal of batch effects using distribution-matchingresidual networks,”
Bioinformatics , vol. 33, no. 16, pp. 2539–2546,2017.[29] X. Li, Y. Lyu, J. Park, J. Zhang, D. Stambolian, K. Susztak, G. Hu,and M. Li, “Deep learning enables accurate clustering and batch effectremoval in single-cell rna-seq analysis,” bioRxiv , p. 530378, 2019.[30] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality ofdata with neural networks,” science , vol. 313, no. 5786, pp. 504–507,2006. [31] D. Talwar, A. Mongia, D. Sengupta, and A. Majumdar, “Autoimpute:Autoencoder based imputation of single-cell rna-seq data,”
Scientificreports , vol. 8, no. 1, pp. 1–11, 2018.[32] G. Eraslan, L. M. Simon, M. Mircea, N. S. Mueller, and F. J. Theis,“Single-cell rna-seq denoising using a deep count autoencoder,”
Naturecommunications , vol. 10, no. 1, pp. 1–14, 2019.[33] C. Lin, S. Jain, H. Kim, and Z. Bar-Joseph, “Using neural networksfor reducing the dimensions of single-cell rna-seq data,”
Nucleic acidsresearch , vol. 45, no. 17, pp. e156–e156, 2017.[34] J. Ding, A. Condon, and S. P. Shah, “Interpretable dimensionalityreduction of single cell transcriptome data with deep generative models,”
Nature communications , vol. 9, no. 1, pp. 1–13, 2018.[35] H. Cho, B. Berger, and J. Peng, “Generalizable and scalable visualizationof single-cell data using neural networks,”
Cell systems , vol. 7, no. 2,pp. 185–191, 2018.[36] R. Lopez, J. Regier, M. B. Cole, M. I. Jordan, and N. Yosef, “Deepgenerative modeling for single-cell transcriptomics,”
Nature methods ,vol. 15, no. 12, p. 1053, 2018.[37] S. J. Pan and Q. Yang, “A survey on transfer learning,”