[PDF] Factorized linear discriminant analysis and its application in computational biology

Abstract

A fundamental problem in computational biology is to find a suitable representation of the high-dimensional gene expression data that is consistent with the structural and functional properties of cell types, collectively called their phenotypes. This representation is often sought from a linear transformation of the original data, for the reasons of model interpretability and computational simplicity. Here we propose a novel method of linear dimensionality reduction to address this problem. This method, which we call factorized linear discriminant analysis (FLDA), seeks a linear transformation of gene expressions that varies highly with only one phenotypic feature and minimally with others. We further leverage our approach with a sparsity-based regularization algorithm, which selects a few genes important to a specific phenotypic feature or feature combination. We illustrated this approach by applying it to a single-cell transcriptome dataset of Drosophila T4/T5 neurons. A representation from FLDA captured structures in the data aligned with phenotypic features and revealed critical genes for each phenotype.

Full PDF

FFactorized linear discriminant analysis forphenotype-guided representation learning of neuronalgene expression data

Mu Qiao ∗ & Markus Meister Division of Biology and Biological EngineeringCalifornia Institute of TechnologyPasadena, CA, 91125 {muqiao, meister}@caltech.edu

Abstract

A central goal in neurobiology is to relate the expression of genes to the structuraland functional properties of neuronal types, collectively called their phenotypes.Single-cell RNA sequencing can measure the expression of thousands of genesin thousands of neurons. How to interpret the data in the context of neuronalphenotypes? We propose a supervised learning approach that factorizes the geneexpression data into components corresponding to individual phenotypic charac-teristics and their interactions. This new method, which we call factorized lineardiscriminant analysis (FLDA), seeks a linear transformation of gene expressionsthat varies highly with only one phenotypic factor and minimally with the others.We further leverage our approach with a sparsity-based regularization algorithm,which selects a few genes important to a speciﬁc phenotypic feature or feature com-bination. We applied this approach to a single-cell RNA-Seq dataset of DrosophilaT4/T5 neurons, focusing on their dendritic and axonal phenotypes. The analysisconﬁrms results obtained by conventional methods but also points to new genesrelated to the phenotypes and an intriguing hierarchy in the genetic organization ofthese cells.

The complexity of neural circuits is a result of many different types of neurons that speciﬁcally connectto each other. Each neuronal type has its own phenotypic traits, which together determine the roleof the neuronal type in a neural circuit. Typical phenotypic descriptions of neurons include featuressuch as dendritic and axonal laminations, electrophysiological properties, and connectivity [1–3].However, the genetic programs behind these phenotypic characteristics are still poorly understood.Recent progress in characterizing neuronal cell types and investigating their gene expression, es-pecially with advances in high-throughput single-cell RNA-Seq [2], provides an opportunity toaddress this challenge. With massive data generated from single-cell RNA-Seq, we now face acomputational problem: how to factorize the high-dimensional data into gene expression modulesthat are meaningful to neuronal phenotypes? Speciﬁcally, given phenotypic descriptions of neuronaltypes, such as their dendritic stratiﬁcation and axonal termination, can one project the original datainto a low-dimensional space corresponding to these phenotypic features and their interactions, andfurther extract genes critical to each of these components? ∗ Contact for correspondence. Current address: Apple Inc., Cupertino, CA, 95014. Alternative email:[email protected]. Under review. a r X i v : . [ q - b i o . Q M ] N ov = 0 j = 1 j = 2 j = 3i = 0 n n n n i = 1 n n n n i = 2 n n n n Phenotypic feature 2 P h e n o t y p i c f e a t u r e A Complete table B Partial table j = 0 j = 1 j = 2 j = 3i = 0 n n n i = 1 n n n i = 2 n n n Phenotypic feature 2 P h e n o t y p i c f e a t u r e C i = 0j = 1i = 1j = 0 i = 1j = 1z: representing Feature 2y: representing Feature 1 vwu s: representing the interaction of both featuresi = 0j = 0 Figure 1: Illustration of our approach. (A,B) In the example, cell types are jointly represented bytwo phenotypic features, indexed with labels i and j respectively. If only some combinations ofthe two features are observed, one obtains a partial contingency table (B) instead of a complete one(A). (C) We seek linear projections of the data that separate the cell types in a factorized mannercorresponding to the two features. Here u , v , and w are aligned with Feature 1, Feature 2, and theinteraction of both features, with the projected coordinates y , z , and s respectively.Here we propose a new analysis method named factorized linear discriminant analysis (FLDA).Inspired by multi-way analysis of variance (ANOVA) [4], this method factorizes data into componentscorresponding to phenotypic features and their interactions, and seeks a linear transformation thatvaries highly with one speciﬁc factor but not with the others. The linear nature of this approach makesit easy to interpret, as the weight coefﬁcients directly inform the relative importance of each geneto each factor. We further introduce a sparse variant of the method, which constrains the numberof genes contributing to each linear projection. We illustrate this approach by applying FLDA to asingle-cell transcriptome dataset of T4/T5 neurons in Drosophila [5], focusing on two phenotypes:dendritic location and axonal lamination. Suppose that we are given gene expression data of single neurons which are typically very high-dimensional. These cells are classiﬁed into cell types, as a result of clustering in the high-dimensionalspace and annotations based on prior knowledge or veriﬁcation outcome [6–10]. We know thephenotypic traits of each neuronal type, therefore each type can also be jointed deﬁned by thephenotypic features. We want to ﬁnd an interpretable low-dimensional embedding in which certaindimensions represent factors of phenotypic features or their interactions. This requires that variationalong one of the axes in the embedding space causes the variation of only one factor. In reality, this ishard to satisfy due to noise in the data, and we relax the constraint by letting data projected alongone axis vary largely with one factor while minimally with the others. In addition, we ask that cellsclassiﬁed as the same type are still close to each other in the embedding space, while cells of differenttypes are far apart.As a start, let us consider only two phenotypic features of neurons, dendritic stratiﬁcation, and axonaltermination, both of which can be described with discrete categories, such as different regions orlayers in the brain [1, 5, 11, 12]. Suppose that each cell type can be jointly represented by its dendriticlocation indexed as i and axonal lamination indexed as j , with the number of cells within each celltype n ij . This representation can be described using a contingency table (Figure 1A,B). Note herethat we allow the table to be partially ﬁlled.Let x ijk ( k ∈ , , ...n ij ) represent the expression values of g genes in each cell ( x ijk ∈ R g ) ). Howto ﬁnd linear projections y ijk = u T x ijk and z ijk = v T x ijk that are aligned with features i and j respectively (Figure 1C)? We ﬁrst asked whether we could factorize, for example, y ijk , with respectto components depending on features i and j . Indeed, motivated by the linear factor models used inmulti-way ANOVA and the idea of partitioning variance, we constructed an objective function as thefollowing, and found u ∗ that maximizes the objective (see detailed analysis in Appendix A): u ∗ = arg max u ∈ R g u T N A uu T M e u (1)2hen we have a complete table, and there are a levels for the feature i and b levels for the feature j ,we have N A = M A − λ M B − λ M AB (2)where M A , M B , and M AB are the covariance matrices explained by the feature i , the feature j ,and the interaction of them. λ and λ are hyper-parameters controlling the relative weights of M B and M AB with respect to M A . M e is the residual covariance matrix representing noise in geneexpressions. Formal deﬁnitions of these terms are the following: M A = 1 a − a (cid:88) i =1 b (cid:88) j =1 ( m i. − m .. )( m i. − m .. ) T (3) M B = 1 b − a (cid:88) i =1 b (cid:88) j =1 ( m .j − m .. )( m .j − m .. ) T (4) M AB = 1( a − b − a (cid:88) i =1 b (cid:88) j =1 ( m ij − m i. − m .j + m .. )( m ij − m i. − m .j + m .. ) T (5) M e = 1 N − ab a (cid:88) i =1 b (cid:88) j =1 [ 1 n ij n ij (cid:88) k =1 ( x ijk − m ij )( x ijk − m ij ) T ] (6)where m .. = 1 ab a (cid:88) i =1 b (cid:88) j =1 m ij (7) m i. = 1 b b (cid:88) j =1 m ij (8) m .j = 1 a a (cid:88) i =1 m ij (9)in which m ij = 1 n ij n ij (cid:88) k =1 x ijk (10)An analogous expression provides the linear projection v ∗ for the feature j , and w ∗ for the interactionof both features i and j . Similar arguments can be applied to the scenario of a partial table to ﬁnd u ∗ or v ∗ as the linear projection for the feature i or j (see Appendix B for mathematical details).Note that N A is symmetric and M e is positive deﬁnite. Therefore the optimization problem is ageneralized eigenvalue problem [13]. When M e is invertible, u ∗ is the eigenvector associated with thelargest eigenvalue of M − e N A . In general, if we want to embed x ijk into a d -dimensional subspacealigned with the feature i ( d < a ), we can take the eigenvectors with the d largest eigenvaluesof M − e N A , which we call the top d factorized linear discriminant components (FLDs). Sincemulti-way ANOVA can handle contingency tables with more than two dimensions, our analysis canbe easily generalized to more than two features. 3 Sparsity-based regularization of FLDA

For this domain-speciﬁc application in neurobiology, there is particular interest in ﬁnding a smallgroup of genes that best determine one of the phenotypic features. This leads to ﬁnding axes thathave only a few non-zero components. To identify such a sparse solution, we solved the followingoptimization problem: u ∗ = arg max u ∈ R g u T N A uu T M e u subject to || u || ≤ l (11)from which the number of non-zero elements of u ∗ is less or equal to l .This is known as a sparse generalized eigenvalue problem, which has three unique challenges, aslisted in [14]: ﬁrst, when the data are very high-dimensional, M e can be singular and non-invertible;second, because of the normalization term u T M e u , many solutions for sparse eigenvalue problemscannot be applied directly; ﬁnally, this problem involves maximizing a convex objective over anonconvex set, which is NP-hard.To solve it, we used truncated Rayleigh ﬂow (Riﬂe), a method speciﬁcally developed to solve sparsegeneralized eigenvalue problems. The algorithm of Riﬂe is composed of two steps [14]: ﬁrst, toobtain an initial vector u that is close to u ∗ . We used the solution from the non-sparse FLDA as aninitial estimate of u ; second, iteratively, to perform a gradient ascent step on the objective function,and then execute a truncation step that preserves the l entries of u with the largest values and sets theremaining entries to 0. Pseudo-code for this algorithm is presented below: procedure R IFLE ( N A , M e , u , l, η ) (cid:46) η is the step size t = 1 (cid:46) t indicates the iteration number while not converge do (cid:46) Converge when u t (cid:39) u t − ρ t − ← u Tt − N A u t − u Tt − M e u t − C ← I + ( ηρ t − )( N A − ρ t − M e ) u t ← Cu t − || Cu t − || Truncate u t by keeping the top l entries of u with the largest values and setting the restentries to 0 u t ← u t || u t || t ← t + 1 end whilereturn u t end procedure As proved in [14], if there is a unique sparse leading generalized eigenvector, Riﬂe will convergelinearly to it with the optimal statistical rate of convergence. The computational complexity of thesecond step is O ( lg + g ) for each iteration, therefore Riﬂe scales linearly with g , the dimensionalityof the original data. Based on the theoretical proof, to guarantee convergence, the hyperparameter η was selected to be sufﬁciently small such that ηλ max ( M e ) < , where λ max ( M e ) is the largesteigenvalue of M e . In our case, the other hyperparameter l , indicating how many genes to be preserved,was empirically selected based on the design of a follow-up experiment. As mentioned later in Results,we chose l to be 20, a reasonable number of candidate genes to be tested in a biological study. FLDA is a method for linear dimensionality reduction [15]. Formally, linear dimensionality reductionis deﬁned as the following: given n data points each of g dimensions, X = [ x , x , ..., x n ] ∈ R g × n ,and a choice of reduced dimensionality r < g , optimize an objective function f ( . ) to produce a linearprojection P ∈ R r × g , and Y = P X ∈ R r × n is the low-dimensional transformed data.4tate-of-the-art methods for linear dimensionality reduction include principal component analysis(PCA), factor analysis (FA), linear multidimensional scaling (MDS), linear discriminant analysis(LDA), canonical correlations analysis (CCA), maximum autocorrelation factors (MAF), slow featureanalysis (SFA), sufﬁcient dimensionality reduction (SDR), locality preserving projections (LPP),and independent component analysis (ICA) [15]. Some of them are obviously unsuitable to theproblem we want to solve, for example, MAF and SFA are developed for data with temporal structures[16, 17], and LPP focuses on the local structures of the data instead of their global organization[18]. The remaining approaches can be roughly grouped for either unsupervised or supervised lineardimensionality reduction, and we discuss them separately. Unsupervised linear dimensionality reduction, including PCA [19], ICA [20], FA [21] and more,project data into a low-dimensional space without using supervision labels. They are not suitable tosolve our problem, due to the speciﬁc characteristics of the gene expression data. The dimensionalityof gene expression data is usually very high, with tens of thousands of genes, and expressions of a fairnumber of them can be very noisy. These noisy genes cause large variance among individual cells,but in an unstructured way. Without supervision signals from the phenotypic features, unsupervisedmethods tend to select these genes to construct the low-dimensional space, which offers neither thedesired alignment nor a good separation of cell type clusters. To illustrate this, we performed PCAon the gene expression data and compared it with FLDA. Brieﬂy, we solved the following objectiveto ﬁnd the linear projection: u ∗ = arg max u ∈ R g u T XX T uu T u (12)The outcome of this comparison is shown in Results. Supervised linear dimensionality reduction, represented by LDA [22, 23] and CCA [24, 25], canovercome the above issue. By including supervised signals of phenotypic features, we can devaluegenes whose expressions are non-informative about the phenotypes.

We name our method FLDA because its objective function has a similar format as that of LDA. LDAalso models the difference among data organized in pre-determined classes. Formally, LDA solvesthe following optimization problem: u ∗ = arg max u ∈ R g u T Σ b uu T Σ e u (13)where Σ b and Σ e are estimates of the between-class and within-class covariance matrices respectively.Different from FLDA, the representation of these classes is not explicitly formulated as a contingencytable composed of multiple features. The consequence is that, when applied to the example problemin which neuronal types are organized into a two-dimensional contingency table with phenotypicfeatures i and j , in general, axes from LDA are not aligned with these two phenotypic features.However, in the example above, we can perform two separate LDAs for the two features. This allowsthe axes from each LDA to align with its speciﬁc feature. We call this approach “2LDAs". There aretwo limitations of this approach: ﬁrst, it discards information about the component depending on theinteraction of the two features which cannot be explained by a linear combination of them; second, itexplicitly maximizes the segregation of cells with different feature levels which sometimes is notconsistent with a good separation of cell type clusters. Detailed comparisons between LDA, “2LDAs”and FLDA can be found in Results. 5 j = 0 j = 1 i = 0 Type Type i = 1 Type Type Feature 2 F e a t u r e Data Synthesis Genes C e ll s T y p e T y p e T y p e T y p e B C

Figure 2: Quantitative comparison between FLDA and other models. (A) Illustration of data synthesis.See Appendix C for implementation details. Color bar indicates the expression values of the tengenerated genes. (B) Normalized overall SNR metric of each analysis. The SNR values are normalizedwith respect to that of LDA. (C) Overall modularity score for each analysis. Error bars in (B,C)denote standard errors each calculated from 10 repeated simulations.

CCA projects two datasets X a ∈ R g × n and X b ∈ R d × n to Y a ∈ R r × n and Y b ∈ R r × n , such thatthe correlation between Y a and Y b is maximized. Formally, it tries to maximize this objective: ( u ∗ , v ∗ ) = arg max u ∈ R g , v ∈ R d u T ( X a X Ta ) − X a X Tb ( X b X Tb ) − v ( u T u ) − ( v T v ) − (14)To apply CCA to our problem, we need to set X a to be the gene expression matrix, and X b tobe the matrix of d phenotypic features ( d = 2 for two features as illustrated later). In contrastwith FLDA, CCA ﬁnds a transformation of gene expressions aligned with a linear combinationof phenotypic features, instead of a factorization of gene expressions corresponding to individualphenotypic features. This difference is quantiﬁed and shown in Results. In order to quantitatively compare FLDA with other linear dimensionality reduction methods, such asPCA, CCA, LDA, and the “2LDAs" approach, we created synthetic datasets. Four types of cells, eachcontaining 25 examples, were generated from a Cartesian product of two features i and j , organizedin a 2x2 complete contingency table. Expressions of 10 genes were generated for these cells, in whichthe levels of Genes 1-8 were correlated with either the feature i , the feature j , or the interactions ofthem, and the levels of the remaining 2 genes were purely driven by noise (Figure 2A). Details ofgenerating the data can be found in Appendix C.To illustrate FLDA in analyzing single-cell RNA-Seq datasets for real problems of neurobiology,and demonstrate the merit of our approach in selecting a few important genes for each phenotype,we used a dataset of Drosophila T4/T5 neurons [5]. T4 and T5 neurons are very similar in terms ofgeneral morphology and physiological properties, but they differ by the location of their dendrites inthe medulla and lobula, two distinct brain regions. T4 and T5 neurons each contain four subtypes,with each pair of the four laminating their axons in a speciﬁc layer in the lobula plate (Figure 3A).Therefore, we can use two phenotypic features to describe these neurons: the feature i indicates thedendritic location at the medulla or lobula; the feature j describes the axonal lamination at one of thefour layers (a/b/c/d) (Figure 3B). In this experiment, we focused on the dataset containing expressiondata of 17492 genes from 3833 cells collected at a deﬁned time during brain development. The T4/T5 neuron dataset was preprocessed as previously reported [8, 10, 26]. Brieﬂy, transcriptcounts within each column of the count matrix (genes × cells) were normalized to sum to the median6 b c dmedulla T4a T4b T4c T4dlobula T5a T5b T5c T5d Axonal termination D e nd r i t i c l o c a t i o n A i j C DE F G a medulla bcd lobulaT4 neurons T5 neurons B Figure 3: FLDA on the dataset of T4/T5 neurons. (A) T4/T5 neuronal types and their dendritic andaxonal phenotypes. (B) T4/T5 neurons can be organized in a complete contingency table. Here i indicates the dendritic location and j indicates the axonal termination. (C) SNR metric of eachdiscriminant axis. (D) Projection of the data into the three-dimensional space consisting of thediscriminant axis for the feature i (FLD i ) and the ﬁrst and second discriminant axes for the feature j (FLD j and FLD j ). (E-G) Projection of the data into the two-dimensional space made of FLD i andFLD j (E), FLD j and FLD j (F), or FLD j and FLD j (the third discriminant axis for the feature j )(G). Different cell types are indicated by different colors as in (A) and (D).number of transcripts per cell, resulting in the normalized counts Transcripts-per-median or T P M gc for Gene g in Cell c . We used the log-transformed expression data E gc = ln ( T P M gc + 1) for furtheranalysis. We adopted a common approach in single-cell RNA-Seq studies that is based on ﬁtting arelationship between mean and coefﬁcient of variation [27, 28] to select highly variable genes, andperformed FLDA on the expression data with only these genes. We preprocessed the data with PCAand kept principal components (PCs) explaining ∼

99% of the total variance before running FLDAbut not the sparse version of the algorithm. In the experiment below, we set the hyper-parameters λ sin Equation (2) to 1. We included the following metrics to evaluate our method: A signal-to-noise ratio (SNR) measureshow well each discriminant axis separates cell types compared with noise estimated from the variancewithin cell type clusters. The explained variance (EV) for each discriminant axis measures how muchvariance of the feature i or j is explained among the total variance explained by that axis. The mutualinformation (MI) between each discriminant axis and each feature quantiﬁes how "informative" anaxis is to a speciﬁc feature. Built on the calculation of MI, we included the modularity score whichmeasures whether each discriminant axis depends on at most one feature [29]. The implementationdetails of these metrics can be found in Appendix D.7 Results

To quantitatively compare the difference between FLDA and other alternative models including PCA,CCA, LDA, and “2LDAs", we measured the proposed metrics from analyses of the synthesizeddatasets (Figure 2A). Given that the synthesized data were organized in a 2x2 contingency table,each LDA of the “2LDAs" approach could ﬁnd only one dimension for the speciﬁc feature i or j .Therefore, as a fair comparison, we only included the corresponding dimensions in FLDA (FLD i andFLD j ) and the top two components of PCA, CCA, and LDA. The overall SNR values normalizedby that of LDA and the overall modularity scores were plotted for data generated with differentnoise levels (Figure 2B,C). The performance of PCA is the worst among all these models becausethe unsupervised approach cannot prevent the noise from contaminating the signal. The supervisedapproaches in general have good SNRs, but LDA and CCA suffer from low modularity scores. Thisis expected because LDA maximizes the separation of cell type clusters but overlooks the alignmentof the axes to the feature i or j , and CCA maximizes the correlation to a linear combination ofphenotypic features instead of individual ones. By contrast, “2LDAs" achieves the highest modularityscores but has the worst SNR among the supervised approaches, because it tries to maximize theseparation of cells with different feature levels, which is not necessarily consistent with maximizingthe segregation of cell types. Both the SNR value and the modularity score of FLDA are close to theoptimal, as it not only considers the alignment of axes to different features but also constrains thevariance within cell types. A representative plot of the EV and MI metrics of these models is shownin Figure 5, reporting good alignment of axes to either the feature i or j in FLDA and ‘2LDAs", butnot in the others.A question of signiﬁcance in neurobiology is whether the diverse phenotypes of neuronal cell typesare generated by combinations of modular transcriptional programs, and if so, what is the genesignature for each of the programs. To illustrate the ability of our approach in addressing this problem,we applied FLDA to the dataset of Drosophila T4/T5 neurons. The T4/T5 neurons could be organizedin a 2x4 contingency table, therefore, FLDA was able to project the expression data into a subspaceof seven dimensions, with one FLD aligned with dendritic location i (FLD i ), three FLDs alignedwith axonal termination j (FLD j − ), and the remaining three representing the interaction of bothphenotypes (FLD ij − ). We ranked these axes based on their SNR metrics and found that FLD j ,FLD i , and FLD j have much higher SNRs than the rest (Figure 3C). Indeed, data representationsin the subspace consisting of these three dimensions show a clear separation of the eight neuronaltypes (Figure 3D). As expected, FLD i teases apart T4 from T5 neurons, whose dendrites are locatedat different brain regions (Figure 3E). Interestingly, FLD j separates T4/T5 neurons into two groups,a/b vs c/d, corresponding to the upper or lower lobula place, and FLD j divides them into anothertwo, a/d vs b/c, indicating whether their axons laminate at the middle or lateral part of the lobula plate(Figure 3E,F). Unexpectedly, among these three dimensions, FLD j has a much higher SNR thanFLD i and FLD j , whose SNR values are similar. This suggests a hierarchical structure in the geneticorganization of T4/T5 neurons: they are ﬁrst separated into either a/b or c/d types, and subsequentlydivided into each of the eight subtypes. In fact, this exactly matches the sequence of their cell fatedetermination during development, as revealed in a previous genetic study [30]. Finally, the lastdiscriminant axis of the axonal feature FLD j separates the group a/c from b/d, suggesting its role inﬁne-tuning the axonal depth within the upper or lower lobula plate (Figure 3G).To seek gene signatures for the discriminant components in FLDA, we applied the sparsity-basedregularization to constrain the number of genes with non-zero weight coefﬁcients. Here we set thenumber to 20, a reasonable number of candidate genes that might be tested in a follow-up biologicalstudy. We extracted a list of 20 genes each for the axis of FLD i or FLD j . The relative importanceof these genes to each axis is directly informed by their weight values (Figure 4A,C). Side-by-side,we plotted expression proﬁles of these genes in the eight neuronal types (Figure 4B,D). For bothaxes, the genes critical in separating cells with different feature levels are differentially expressed incorresponding cell types. We compared our gene lists with those obtained using conventional methodswhich were reported in [5]. Consistent with the report, we found indicator genes for dendritic locationsuch as TfAP - , dpr , CG , and CG , and those for axonal lamination such as klg , bi , pros . In addition, we found genes that were not reported in this previous study. For example, ourresults suggest that the genes T hor and pHCl - are important to the dendritic phenotype, and Lac and

M ip are critical to the axonal phenotype. These are promising genetic targets to be tested inbiological experiments. Lastly, FLDA allowed us to examine the component that depends on the8

B C DSelected genes for the dendritic phenotype Selected genes for the axonal phenotype

Figure 4: Critical genes extracted from the sparse algorithm. (A) Weight vector of the 20 genesselected for the dendritic phenotype (FLD i ). The weight value is indicated in the color bar with colorindicating direction (red: positive and green: negative) and saturation indicating magnitude. (B)Expression patterns of the 20 genes from (A) in eight types of T4/T5 neurons. Dot size indicates thepercentage of cells in which the gene was expressed, and color represents average scaled expression.(C) Weight vector of the 20 genes selected for the axonal phenotype (FLD j ). Legend as in (A). (D)Expression patterns of the 20 genes from (C) in eight types of T4/T5 neurons. Legend as in (B).interaction of both features and identify its gene signature, which provides clues to transcriptionalregulation of gene expressions in the T4/T5 neuronal types (Figures 6 and 7).As a supervised approach, FLDA depends on correct phenotype labels to extract meaningful informa-tion. But if the phenotypes are annotated incorrectly, can we use FLDA to raise a ﬂag? We propose aperturbation analysis of FLDA to address this question built on the assumption that among possiblephenotype annotations, the projection of gene expression data based on correct labels leads to bettermetric measurements than incorrect ones. As detailed in Appendix E, we generated three kindsof incorrect labels for the dataset of T4/T5 neurons, corresponding to three common scenarios ofmislabeling: the phenotypes of a cell type were mislabeled with those of another type; a singularphenotypic level was incorrectly split into two; two phenotypic levels are incorrectly merged intoone. FLDA was applied to gene expressions of T4/T5 neurons but with these perturbed annotations.Proposed metrics such as the SNR value and modularity score were plotted in Figure 8. Indeed, theprojection of gene expressions with correct annotation leads to the best SNR value and modularityscore compared with incorrect annotations. This implies that this type of perturbation analysis is auseful practice in general: it raises the conﬁdence that the original annotation is correct if FLDA onthe perturbed annotations produces lower metric scores. We developed FLDA, a novel supervised linear dimensionality reduction method for understandingthe relationship between high-dimensional gene expression patterns and cellular phenotypes. Weillustrate the power of FLDA by analyzing a gene expression dataset from Drosophila T4/T5 neuronsthat are labeled by two phenotypic features, each with multiple levels. The approach allowed us toidentify new genes for each of the phenotypic features that were not apparent under conventionalmethods. Furthermore, we found a hierarchical relationship in the genetic organization of these cells.These ﬁndings point the way for new biological experiments.The approach is motivated by multi-way ANOVA, and thus it generalizes easily to more than twofeatures. Future applications in neurobiology include the analysis of phenotypic characteristics suchas electrophysiology and connectivity [2, 3, 31]. More generally FLDA can be applied to any labeled9ata set for which the labels form a Cartesian product of multiple features. For example, this wouldinclude face images that can be jointly labeled by the age, gender, and other features of a person[32, 33].FLDA factorizes gene expression data into features and their interactions, and ﬁnds a linear projectionof the data that varies with only one factor but not the others. This provides a modular representationaligned with the factors [34]. Ridgeway and Mozer (2018) argued that modularity together withexplicitness could deﬁne disentangled representations. Our approach is linear, which presents anexplicit mapping between gene expressions and phenotypic features, therefore our approach canpotentially serve as a supervised approach to disentanglement [35–37].Compared with other non-linear embedding methods for cell types [38–40], the linear nature ofFLDA makes it extremely easy to interpret the low-dimensional representations, as the weight vectordirectly informs the relative importance of each gene. To allow the selection of a small set of criticalgenes, we leveraged our approach with sparse regularization. This makes FLDA especially useful toexperimentalists who can take the list of genes and test them in a subsequent round of experiments.Finally, FLDA can be combined with other sequencing approaches to reveal insights into neuronaltypes. In one approach, one can measure a neuron’s electrical properties and image its shape beforedrawing out the cellular contents for sequencing [41]. In still other methods, the sequencing is donein situ with the cell still in the tissue, allowing an assessment of its anatomy and other features [42].FLDA applies to all these cases.

We thank Jialong Jiang, Yuxin Chen, Oisin Mac Aodha, Matt Thomson, Lior Pachter, AndrewMacMahon, Christof Koch, Yu-li Ni, Karthik Shekhar, and Joshua R. Sanes for helpful discus-sions. This work was supported by a grant from the Simons Foundation (SCGB 543015, M.M.), apostdoctoral fellowship from the Swartz Foundation, and an NVIDIA GPU grant (M.Q.).

Here we derive the objective functions used in our analysis. Again if x ijk ( k ∈ , , ...n ij ) representsthe expression values of g genes in each cell ( x ijk ∈ R g ) ), we seek to ﬁnd a linear projection y ijk = u T x ijk that is aligned with the feature i . We asked what is the best way to factorize y ijk . Inspired by multi-way ANOVA [4], we identiﬁedthree components: one depending on the feature i , another depending on the feature j , and the lastone depending on the interaction of both features. We therefore followed the procedures of ANOVAto partition sums of squares and factorize y ijk into these three components.Let us ﬁrst assume that all cell types deﬁned by i and j contain the same number of cells. With celltypes represented by a complete contingency table (Figure 1A), y ijk can be linearly factorized usingthe model of two crossed factors. Formally, the linear factorization is the following: y ijk = µ + α i + β j + ( αβ ) ij + (cid:15) ijk (15)where y ijk represents the coordinate of the k th cell in the category deﬁned by i and j ; µ is the averagelevel of y ; α i is the component that depends on the feature i , and β j is the component that dependson the feature j ; ( αβ ) ij describes the component that depends on the interaction of both features i and j ; (cid:15) ijk ∼ N (0 , σ ) is the residual of this factorization.10et us say that the features i and j fall into a and b discrete categories respectively. Then without lossof generality, we can require: a (cid:88) i =1 α i = 0 (16) b (cid:88) j =1 β j = 0 (17) a (cid:88) i =1 ( αβ ) ij = b (cid:88) j =1 ( αβ ) ij = 0 (18)Corresponding to these, there are three null hypotheses: H : α i = 0 (19) H : β j = 0 (20) H : ( αβ ) ij = 0 (21)Here we want to reject H while accepting H and H in order that y ijk is aligned with the feature i .Next, we partition the total sum of squares. If the number of cells within each cell type category is n ,and the total number of cells is N , then we have a (cid:88) i =1 b (cid:88) j =1 n (cid:88) k =1 ( y ijk − ¯ y ... ) = bn a (cid:88) i =1 (¯ y i.. − ¯ y ... ) + an b (cid:88) j =1 (¯ y .j. − ¯ y ... ) + n a (cid:88) i =1 b (cid:88) j =1 (¯ y ij. − ¯ y i.. − ¯ y .j. + ¯ y ... ) + a (cid:88) i =1 b (cid:88) j =1 n (cid:88) k =1 ( y ijk − ¯ y ij. ) (22)where ¯ y is the average of y ijk over the indices indicated by the dots. Equation (22) can be written as SS T = SS A + SS B + SS AB + SS e (23)with each term having degrees of freedom N − , a − , b − , ( a − b − , and N − ab respectively.Here SS A , SS B , SS AB , and SS e are partitioned sum of squares for the factors α i , β j , ( αβ ) ij , andthe residual.ANOVA rejects or accepts a null hypothesis by comparing its mean square (the partitioned sum ofsquares normalized by the degree of freedom) to that of the residual. This is done by constructingF-statistics for each factor as shown below: F A = M S A M S e = SS A a − SS e N − ab (24) F B = M S B M S e = SS B b − SS e N − ab (25)11 AB = M S AB M S e = SS AB ( a − b − SS e N − ab (26)Under the null hypotheses, the F-statistics follow the F-distribution. Therefore, a null hypothesisis rejected when we observe the value of a F-statistic above a certain threshold calculated from theF-distribution. Here we want F A to be large enough so that we can reject H , but F B and F AB to be small enough for us to accept H and H . In other words, we want to maximize F A whileminimizing F B and F AB . Therefore, we propose maximizing an objective L : L = F A − λ F B − λ F AB (27)where λ and λ are hyper-parameters determining the relative weights of F B and F AB comparedwith F A . When the numbers of cells within categories deﬁned by i and j ( n ij ) are not all the same, the totalsum of squares cannot be partitioned as in Equation (22). However, if we only care about distinctionsbetween cell types instead of individual cells, we can use the mean value of each cell type cluster( ¯ y ij. ) to estimate the overall average value ( ˜ y ... ), and the average value of each category i ( ˜ y i.. ) or j ( ˜ y .j. ). Therefore, Equation (22) can be modiﬁed as the following: a (cid:88) i =1 b (cid:88) j =1 [ 1 n ij n ij (cid:88) k =1 ( y ijk − ˜ y ... ) ] = b a (cid:88) i =1 (˜ y i.. − ˜ y ... ) + a b (cid:88) j =1 (˜ y .j. − ˜ y ... ) + a (cid:88) i =1 b (cid:88) j =1 (¯ y ij. − ˜ y i.. − ˜ y .j. + ˜ y ... ) + a (cid:88) i =1 b (cid:88) j =1 [ 1 n ij n ij (cid:88) k =1 ( y ijk − ¯ y ij. ) ] (28)where ¯ y ij. = (cid:80) n ij k =1 y ijk n ij (29) ˜ y i.. = (cid:80) bj =1 ¯ y ij. b (30) ˜ y .j. = (cid:80) ai =1 ¯ y ij. a (31) ˜ y ... = (cid:80) ai =1 (cid:80) bj =1 ¯ y ij. ab (32)If we describe Equation (28) as: ˜ SS T = ˜ SS A + ˜ SS B + ˜ SS AB + ˜ SS e (33)then following the same arguments, we want to maximize an objective function in the followingformat: L = ˜ SS A a − − λ SS B b − − λ SS AB ( a − b − SS e N − ab (34)12 .1.3 Objective functions under a partial contingency table When we have a representation of a partial table, we can no longer separate out the component thatdepends on the interaction of both features. Therefore, we use another model, a linear model of twonested factors, to factorize y ijk , which has the following format: y ijk = µ + α i + β j ( i ) + (cid:15) ijk (35)Note that we now have β j ( i ) instead of β j + ( αβ ) ij . In this model, we identify a primary factor, forinstance, the feature denoted by i which falls into a categories, and the other (indexed by j ) becomesa secondary factor, the number of whose levels b i depends on the level of the primary factor. Wemerge the component depending on the interaction of both features into that of the secondary factoras β j ( i ) .Similarly, we have a (cid:88) i =1 b i (cid:88) j =1 [ 1 n ij n ij (cid:88) k =1 ( y ijk − ˜ y ... ) ] = a (cid:88) i =1 [ b i (cid:88) j =1 (˜ y i.. − ˜ y ... ) ]+ a (cid:88) i =1 b i (cid:88) j =1 (¯ y ij. − ˜ y i.. ) + a (cid:88) i =1 b i (cid:88) j =1 [ 1 n ij n ij (cid:88) k =1 ( y ijk − ¯ y ij. ) ] (36)which can be written as ˜ SS T = ˜ SS A + ˜ SS B + ˜ SS e (37)with degrees of freedom N − , a − , M − a , and N − M for each of the terms, where M is: M = a (cid:88) i =1 b i (38)Therefore, we want to maximize the following objective: L = ˜ SS A a − − λ ˜ SS B M − a ˜ SS e N − M (39) Here we provide the mathematical details of FLDA under the representation of a partial table. Whenwe have a partial table, if the feature i is the primary feature with a levels, and the feature j is thesecondary feature with b i levels, then N A in Equation (1) is deﬁned as follows: N A = M A − λ M B | A (40)where M A = 1 a − a (cid:88) i =1 b i (cid:88) j =1 ( m i. − m .. )( m i. − m .. ) T (41) M B | A = 1 M − a a (cid:88) i =1 b i (cid:88) j =1 ( m ij − m i. )( m ij − m i. ) T (42)13nd M is deﬁned as in Equation (38). Correspondingly, M e in Equation (1) is deﬁned as: M e = 1 N − M a (cid:88) i =1 b (cid:88) j =1 [ 1 n ij n ij (cid:88) k =1 ( x ijk − m ij )( x ijk − m ij ) T ] (43)and m .. = 1 M a (cid:88) i =1 b i (cid:88) j =1 m ij (44) m i. = 1 b i b i (cid:88) j =1 m ij (45)The remaining mathematical arguments are the same as those for the complete table. In this scenario,because we don’t observe all possible combinations of features i and j , we cannot ﬁnd the linearprojection for the interaction of both features. To quantitatively compare FLDA with alternative approaches, we synthesized data of four cell types,each of which contained 25 cells. The four cell types were generated from a Cartesian product of twofeatures i and j , where i ∈ { , } and j ∈ { , } . Expressions of 10 genes were generated for eachcell. The expression value of the h th gene in the k th cell of the cell type ij , g hijk was deﬁned as thefollowing: g ijk = i + (cid:15) ijk (46) g ijk = j + (cid:15) ijk (47) g ijk = i ∧ j + (cid:15) ijk (48) g ijk = i ∨ j + (cid:15) ijk (49) g ijk = 2 i + (cid:15) ijk (50) g ijk = 2 j + (cid:15) ijk (51) g ijk = 2 i ∧ j + (cid:15) ijk (52) g ijk = 2 i ∨ j + (cid:15) ijk (53) g ijk = (cid:15) ijk (54) g ijk = 2 + (cid:15) ijk (55)where i ∧ j = (cid:26) , if i = 1 , j = 10 , otherwise (56)14nd i ∨ j = (cid:26) , if i = 0 , j = 01 , otherwise (57)were interactions of the two features. Here (cid:15) ijk was driven by Gaussian noise, namely, (cid:15) ijk ∼ N (0 , σ ) (58)We synthesized datasets of 5 different σ values ( σ ∈ { . , . , . , . , . } ). This was repeated 10times and metrics for each σ value were calculated as the average across the 10 repeats. We measured the following metrics in our experiments:

Because we care about the separation of cell types, we deﬁne the SNR metric as the ratio of thevariance between cell types over the variance of the noise, which is estimated from within-clustervariance. For the entire embedding space, given q cell types, if the coordinate of each cell is indicatedby c , then we deﬁne the overall SNR metric as the following: SN R overall = tr (Σ qp =1 n p (¯ c p. − ¯ c .. )(¯ c p. − ¯ c .. ) T )) tr (Σ qp =1 Σ n p k =1 ( c pk − ¯ c p. )( c pk − ¯ c p. ) T ) (59)where ¯ c p. is the center of each cell type cluster, and ¯ c .. is the center of all data points.Let c denote the embedded coordinate along a speciﬁc dimension. The SNR metric for that axis istherefore: SN R = Σ qp =1 n p (¯ c p. − ¯ c .. ) Σ qp =1 Σ n p k =1 ( c pk − ¯ c p. ) (60) We want to know whether the variation of a speciﬁc dimension is strongly explained by that of aspeciﬁc feature. Therefore, we measure, for each axis, how much of the total explained variance isexplained by the variance of the feature i or j . Formally, given the embedded coordinate y ijk , wecalculate the EV as the following: EV i = (cid:80) ai =1 (cid:80) bj =1 n ij (¯ y i.. − ¯ y ... ) (cid:80) ai =1 (cid:80) bj =1 (cid:80) n ij k =1 ( y ijk − ¯ y ... ) (61) EV j = (cid:80) ai =1 (cid:80) bj =1 n ij (¯ y .j. − ¯ y ... ) (cid:80) ai =1 (cid:80) bj =1 (cid:80) n ij k =1 ( y ijk − ¯ y ... ) (62)where ¯ y is the average of y ijk over the indices indicated by the dots. The MI between a discriminant axis u and a feature quantiﬁes how much information of the featureis obtained by observing data projected along that axis. It is calculated as the MI between data15epresentations along the axis y = u T X and feature labels of the data f , where X is the originalgene expression matrix: I ( y , f ) = H ( y ) + H ( f ) − H ( y , f )= − (cid:88) y ∈ Y p ( y ) log p ( y ) − (cid:88) f ∈ F p ( f ) log p ( f ) − (cid:88) y ∈ Y (cid:88) f ∈ F p ( y, f ) log p ( y, f ) (63)Here H indicates entropy. To calculate H ( y ) and H ( y , f ) , we discretize y into 10 bins. Ridgeway and Mozer (2018) argued that in a modular representation, each axis should depend on atmost a single feature. Following the arguments in their paper, the modularity score is computed asfollows: we ﬁrst calculate the MI between each feature and each axis ( m if denotes the MI betweenone axis i and one feature f ). If an axis is perfectly modular, it will have high mutual information foronly one feature and zeros for the others, we therefore compute a template t if as the following: t if = (cid:26) θ i , if f = arg max g m ig , otherwise (64)where θ i = max g m ig . We then calculate the deviation from the template as: δ i = (cid:80) f ( m if − t if ) θ i ( N − (65)where N is the number of features. The modularity score for the axis i is − δ i . The mean of − δ i over i is deﬁned as the overall modularity score. To evaluate the effect of mislabeling phenotypic levels, we made use of the dataset of T4/T5 neurons,and generated three kinds of perturbation to the original labels:First, we switched the phenotype labels of T4a neurons with one of the seven other types (T4b, T4c,T4d, T5a, T5b, T5c, T5d). In this scenario, phenotype labels of two cell types were incorrect, but thenumber of cell type clusters was the same. We had two levels of the dendritic phenotypes (T4/T5),and four levels of the axonal phenotypes (a/b/c/d). Therefore we kept one dimension for the dendriticfeature, and three dimensions for the axonal feature.Second, we merged the axonal phenotypic level a with another level (b/c/d), as an incorrect new level(a+b/a+c/a+d). In this scenario, we had three axonal phenotypes, therefore we kept two dimensionsfor the axonal feature.Third, we randomly split each of the four axonal lamination labels (a/b/c/d) into two levels. Forinstance, among neurons with the original axonal level a, some of them were labeled with a levela1, and the others were labeled with a level a2. In this scenario, we had eight axonal phenotypes(a1/a2/b1/b2/c1/c2/d1/d2), and we kept seven dimensions for the axonal feature.We performed FLDA on the dataset of T4/T5 neurons but with these perturbed annotations. Metricsfrom each of the perturbed annotations were measured and compared with those from the originalannotation.

References [1] Joshua R. Sanes and Richard H. Masland. The types of retinal ganglion cells: Current statusand implications for neuronal classiﬁcation.

Annu. Rev. Neurosci. , 38:221–246, July 2015.[2] Hongkui Zeng and Joshua R. Sanes. Neuronal cell-type classiﬁcation: Challenges, opportunitiesand the path forward.

Nature Reviews Neuroscience , 18(9):530–546, September 2017.163] Nathan W. Gouwens, Staci A. Sorensen, Jim Berg, Changkyu Lee, Tim Jarsky, Jonathan Ting,Susan M. Sunkin, David Feng, Costas A. Anastassiou, Eliza Barkan, Kris Bickley, NicoleBlesie, Thomas Braun, Krissy Brouner, Agata Budzillo, Shiella Caldejon, Tamara Casper, DanCastelli, Peter Chong, Kirsten Crichton, Christine Cuhaciyan, Tanya L. Daigle, Rachel Dalley,Nick Dee, Tsega Desta, Song-Lin Ding, Samuel Dingman, Alyse Doperalski, Nadezhda Dotson,Tom Egdorf, Michael Fisher, Rebecca A. de Frates, Emma Garren, Marissa Garwood, AmandaGary, Nathalie Gaudreault, Keith Godfrey, Melissa Gorham, Hong Gu, Caroline Habel, KristenHadley, James Harrington, Julie A. Harris, Alex Henry, DiJon Hill, Sam Josephsen, Sara Kebede,Lisa Kim, Matthew Kroll, Brian Lee, Tracy Lemon, Katherine E. Link, Xiaoxiao Liu, BrianLong, Rusty Mann, Medea McGraw, Stefan Mihalas, Alice Mukora, Gabe J. Murphy, LindsayNg, Kiet Ngo, Thuc Nghi Nguyen, Philip R. Nicovich, Aaron Oldre, Daniel Park, SheanaParry, Jed Perkins, Lydia Potekhina, David Reid, Miranda Robertson, David Sandman, MartinSchroedter, Cliff Slaughterbeck, Gilberto Soler-Llavina, Josef Sulc, Aaron Szafer, BosiljkaTasic, Naz Taskin, Corinne Teeter, Nivretta Thatra, Herman Tung, Wayne Wakeman, GraceWilliams, Rob Young, Zhi Zhou, Colin Farrell, Hanchuan Peng, Michael J. Hawrylycz, Ed Lein,Lydia Ng, Anton Arkhipov, Amy Bernard, John W. Phillips, Hongkui Zeng, and Christof Koch.Classiﬁcation of electrophysiological and morphological neuron types in the mouse visualcortex.

Nat. Neurosci. , 22(7):1182–1195, July 2019.[4] Ronald Aylmer Fisher.

The Correlation between Relatives on the Supposition of MendelianInheritance . Royal Society of Edinburgh], 1918.[5] Yerbol Z Kurmangaliyev, Juyoun Yoo, Samuel A LoCascio, and S Lawrence Zipursky. Modulartranscriptional programs separately deﬁne axon and dendrite connectivity. eLife , 8:e50822,November 2019.[6] Evan Z. Macosko, Anindita Basu, Rahul Satija, James Nemesh, Karthik Shekhar, MelissaGoldman, Itay Tirosh, Allison R. Bialas, Nolan Kamitaki, Emily M. Martersteck, John J.Trombetta, David A. Weitz, Joshua R. Sanes, Alex K. Shalek, Aviv Regev, and Steven A.McCarroll. Highly parallel genome-wide expression proﬁling of individual cells using nanoliterdroplets.

Cell , 161(5):1202–1214, May 2015.[7] Bosiljka Tasic, Vilas Menon, Thuc Nghi Nguyen, Tae Kyung Kim, Tim Jarsky, Zizhen Yao, BoazLevi, Lucas T. Gray, Staci A. Sorensen, Tim Dolbeare, Darren Bertagnolli, Jeff Goldy, NadiyaShapovalova, Sheana Parry, Changkyu Lee, Kimberly Smith, Amy Bernard, Linda Madisen,Susan M. Sunkin, Michael Hawrylycz, Christof Koch, and Hongkui Zeng. Adult mouse corticalcell taxonomy revealed by single cell transcriptomics.

Nat. Neurosci. , 19(2):335–346, February2016.[8] Karthik Shekhar, Sylvain W. Lapan, Irene E. Whitney, Nicholas M. Tran, Evan Z. Macosko,Monika Kowalczyk, Xian Adiconis, Joshua Z. Levin, James Nemesh, Melissa Goldman,Steven A. McCarroll, Constance L. Cepko, Aviv Regev, and Joshua R. Sanes. COMPRE-HENSIVE CLASSIFICATION OF RETINAL BIPOLAR NEURONS BY SINGLE-CELLTRANSCRIPTOMICS.

Cell , 166(5):1308–1323.e30, August 2016.[9] Bosiljka Tasic, Zizhen Yao, Lucas T. Graybuck, Kimberly A. Smith, Thuc Nghi Nguyen, DarrenBertagnolli, Jeff Goldy, Emma Garren, Michael N. Economo, Sarada Viswanathan, Osnat Penn,Trygve Bakken, Vilas Menon, Jeremy Miller, Olivia Fong, Karla E. Hirokawa, Kanan Lathia,Christine Rimorin, Michael Tieu, Rachael Larsen, Tamara Casper, Eliza Barkan, MatthewKroll, Sheana Parry, Nadiya V. Shapovalova, Daniel Hirschstein, Julie Pendergraft, Heather A.Sullivan, Tae Kyung Kim, Aaron Szafer, Nick Dee, Peter Groblewski, Ian Wickersham, AliCetin, Julie A. Harris, Boaz P. Levi, Susan M. Sunkin, Linda Madisen, Tanya L. Daigle, LorenLooger, Amy Bernard, John Phillips, Ed Lein, Michael Hawrylycz, Karel Svoboda, Allan R.Jones, Christof Koch, and Hongkui Zeng. Shared and distinct transcriptomic cell types acrossneocortical areas.

Nature , 563(7729):72–78, November 2018.[10] Yi-Rong Peng, Karthik Shekhar, Wenjun Yan, Dustin Herrmann, Anna Sappington, Gregory S.Bryman, Tavé van Zyl, Michael Tri. H. Do, Aviv Regev, and Joshua R. Sanes. MolecularClassiﬁcation and Comparative Taxonomics of Foveal and Peripheral Cells in Primate Retina.

Cell , 176(5):1222–1237.e22, February 2019.1711] Seung Wook Oh, Julie A. Harris, Lydia Ng, Brent Winslow, Nicholas Cain, Stefan Mihalas,Quanxin Wang, Chris Lau, Leonard Kuan, Alex M. Henry, Marty T. Mortrud, BenjaminOuellette, Thuc Nghi Nguyen, Staci A. Sorensen, Clifford R. Slaughterbeck, Wayne Wakeman,Yang Li, David Feng, Anh Ho, Eric Nicholas, Karla E. Hirokawa, Phillip Bohn, Kevin M. Joines,Hanchuan Peng, Michael J. Hawrylycz, John W. Phillips, John G. Hohmann, Paul Wohnoutka,Charles R. Gerfen, Christof Koch, Amy Bernard, Chinh Dang, Allan R. Jones, and HongkuiZeng. A mesoscale connectome of the mouse brain.

Nature , 508(7495):207–214, April 2014.[12] Thomas Euler, Silke Haverkamp, Timm Schubert, and Tom Baden. Retinal bipolar cells:Elementary building blocks of vision.

Nature Reviews Neuroscience , 15(8):507–519, August2014.[13] Benyamin Ghojogh, Fakhri Karray, and Mark Crowley. Eigenvalue and Generalized EigenvalueProblems: Tutorial. arXiv:1903.11240 [cs, stat] , March 2019. Comment: 8 pages, Tutorialpaper.[14] Kean Ming Tan, Zhaoran Wang, Han Liu, and Tong Zhang. Sparse generalized eigenvalueproblem: Optimal statistical rates via truncated Rayleigh ﬂow.

Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) , 80(5):1057–1086, 2018.[15] John P. Cunningham and Zoubin Ghahramani. Linear Dimensionality Reduction: Survey,Insights, and Generalizations.

Journal of Machine Learning Research , 16(89):2859–2900,2015.[16] Rasmus Larsen. Decomposition using maximum autocorrelation factors.

Journal of Chemomet-rics , 16(8-10):427–435, 2002.[17] Laurenz Wiskott and Terrence J. Sejnowski. Slow Feature Analysis: Unsupervised Learning ofInvariances.

Neural Computation , 14(4):715–770, April 2002.[18] Xiaofei He and Partha Niyogi. Locality Preserving Projections. In

Proceedings of the 16thInternational Conference on Neural Information Processing Systems , NIPS’03, pages 153–160,Cambridge, MA, USA, December 2003. MIT Press.[19] I. T. Jolliffe.

Principal Component Analysis . Springer Series in Statistics. Springer-Verlag, NewYork, second edition, 2002.[20] Aapo Hyvärinen, Juha Karhunen, and Erkki Oja.

Independent Component Analysis . Wiley-Interscience, New York, 1st edition edition, May 2001.[21] C. Spearman. "General Intelligence," Objectively Determined and Measured.

The AmericanJournal of Psychology , 15(2):201–292, 1904.[22] R. A. Fisher. The Use of Multiple Measurements in Taxonomic Problems.

Annals of Eugenics ,7(2):179–188, 1936.[23] Geoffrey McLachlan.

Discriminant Analysis and Statistical Pattern Recognition . Wiley-Interscience, Hoboken, N.J, August 2004.[24] Harold Hotelling. RELATIONS BETWEEN TWO SETS OF VARIATES.

Biometrika , 28(3-4):321–377, December 1936.[25] Weiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes. On Deep Multi-View Representa-tion Learning: Objectives and Optimization. arXiv:1602.01024 [cs] , February 2016.[26] Nicholas M. Tran, Karthik Shekhar, Irene E. Whitney, Anne Jacobi, Inbal Benhar, GuosongHong, Wenjun Yan, Xian Adiconis, McKinzie E. Arnold, Jung Min Lee, Joshua Z. Levin,Dingchang Lin, Chen Wang, Charles M. Lieber, Aviv Regev, Zhigang He, and Joshua R. Sanes.Single-cell proﬁles of retinal neurons differing in resilience to injury reveal neuroprotectivegenes. bioRxiv , page 711762, July 2019.[27] Hung-I. Harry Chen, Yufang Jin, Yufei Huang, and Yidong Chen. Detection of high variabilityin gene expression from single-cell RNA-seq proﬁling.

BMC Genomics , 17 Suppl 7:508, August2016. 1828] Shristi Pandey, Karthik Shekhar, Aviv Regev, and Alexander F. Schier. Comprehensive Identiﬁ-cation and Spatial Mapping of Habenular Neuronal Types Using Single-Cell RNA-Seq.

Curr.Biol. , 28(7):1052–1065.e7, April 2018.[29] Karl Ridgeway and Michael C Mozer. Learning Deep Disentangled Embeddings With theF-Statistic Loss. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, andR. Garnett, editors,

Advances in Neural Information Processing Systems 31 , pages 185–194.Curran Associates, Inc., 2018.[30] Filipe Pinto-Teixeira, Clara Koo, Anthony Michael Rossi, Nathalie Neriec, Claire Bertet, XinLi, Alberto Del-Valle-Rodriguez, and Claude Desplan. Development of Concurrent RetinotopicMaps in the Fly Motion Detection Circuit.

Cell , 173(2):485–498.e11, April 2018.[31] Nathan W. Gouwens, Staci A. Sorensen, Fahimeh Baftizadeh, Agata Budzillo, Brian R. Lee,Tim Jarsky, Lauren Alﬁler, Katherine Baker, Eliza Barkan, Kyla Berry, Darren Bertagnolli, KrisBickley, Jasmine Bomben, Thomas Braun, Krissy Brouner, Tamara Casper, Kirsten Crichton,Tanya L. Daigle, Rachel Dalley, Rebecca A. de Frates, Nick Dee, Tsega Desta, Samuel DingmanLee, Nadezhda Dotson, Tom Egdorf, Lauren Ellingwood, Rachel Enstrom, Luke Esposito, ColinFarrell, David Feng, Olivia Fong, Rohan Gala, Clare Gamlin, Amanda Gary, Alexandra Glandon,Jeff Goldy, Melissa Gorham, Lucas Graybuck, Hong Gu, Kristen Hadley, Michael J. Hawrylycz,Alex M. Henry, DiJon Hill, Madie Hupp, Sara Kebede, Tae Kyung Kim, Lisa Kim, MatthewKroll, Changkyu Lee, Katherine E. Link, Matthew Mallory, Rusty Mann, Michelle Maxwell,Medea McGraw, Delissa McMillen, Alice Mukora, Lindsay Ng, Lydia Ng, Kiet Ngo, Philip R.Nicovich, Aaron Oldre, Daniel Park, Hanchuan Peng, Osnat Penn, Thanh Pham, Alice Pom,Zoran Popovi´c, Lydia Potekhina, Ramkumar Rajanbabu, Shea Ransford, David Reid, ChristineRimorin, Miranda Robertson, Kara Ronellenﬁtch, Augustin Ruiz, David Sandman, KimberlySmith, Josef Sulc, Susan M. Sunkin, Aaron Szafer, Michael Tieu, Amy Torkelson, JessicaTrinh, Herman Tung, Wayne Wakeman, Katelyn Ward, Grace Williams, Zhi Zhou, Jonathan T.Ting, Anton Arkhipov, Uygar Sümbül, Ed S. Lein, Christof Koch, Zizhen Yao, Bosiljka Tasic,Jim Berg, Gabe J. Murphy, and Hongkui Zeng. Integrated Morphoelectric and TranscriptomicClassiﬁcation of Cortical GABAergic Cells.

Cell , 183(4):935–953.e19, November 2020.[32] B. Moghaddam and Ming-Hsuan Yang. Learning gender with support faces.

IEEE Transactionson Pattern Analysis and Machine Intelligence , 24(5):707–711, May 2002.[33] Zhifei Zhang, Yang Song, and Hairong Qi. Age Progression/Regression by Conditional Adver-sarial Autoencoder. arXiv:1702.08423 [cs] , March 2017. Comment: Accepted by The IEEEConference on Computer Vision and Pattern Recognition (CVPR 2017).[34] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation Learning: A Review andNew Perspectives. arXiv:1206.5538 [cs] , June 2012.[35] Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervisedLearning with Deep Generative Models. In Z. Ghahramani, M. Welling, C. Cortes, N. D.Lawrence, and K. Q. Weinberger, editors,

Advances in Neural Information Processing Systems27 , pages 3581–3589. Curran Associates, Inc., 2014.[36] Tejas D Kulkarni, William F. Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep Convolu-tional Inverse Graphics Network. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, andR. Garnett, editors,

Advances in Neural Information Processing Systems 28 , pages 2539–2547.Curran Associates, Inc., 2015.[37] Theofanis Karaletsos, Serge Belongie, and Gunnar Rätsch. Bayesian representation learningwith oracle constraints. arXiv:1506.05011 [cs, stat] , March 2016. Comment: 16 pages,publishes in ICLR 16.[38] G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with NeuralNetworks.

Science , 313(5786):504–507, July 2006.[39] Fangxiang Feng, Xiaojie Wang, and Ruifan Li. Cross-modal Retrieval with CorrespondenceAutoencoder. In

Proceedings of the 22nd ACM International Conference on Multimedia ,MM ’14, pages 7–16, Orlando, Florida, USA, November 2014. Association for ComputingMachinery. 1940] Rohan Gala, Nathan Gouwens, Zizhen Yao, Agata Budzillo, Osnat Penn, Bosiljka Tasic, GabeMurphy, Hongkui Zeng, and Uygar Sümbül. A coupled autoencoder approach for multi-modalanalysis of cell types. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d\textquotesingleAlché-Buc, E. Fox, and R. Garnett, editors,

Advances in Neural Information Processing Systems32 , pages 9267–9276. Curran Associates, Inc., 2019.[41] Cathryn R. Cadwell, Athanasia Palasantza, Xiaolong Jiang, Philipp Berens, Qiaolin Deng, Mar-lene Yilmaz, Jacob Reimer, Shan Shen, Matthias Bethge, Kimberley F. Tolias, Rickard Sandberg,and Andreas S. Tolias. Electrophysiological, transcriptomic and morphologic proﬁling of singleneurons using Patch-seq.

Nature Biotechnology , 34(2):199–203, February 2016.[42] Carina Strell, Markus M. Hilscher, Navya Laxman, Jessica Svedlund, Chenglin Wu, ChikaYokota, and Mats Nilsson. Placing RNA in context and space – methods for spatially resolvedtranscriptomics.

The FEBS Journal , 286(8):1468–1481, 2019.

10 Additional Information

A C E G IB D F H J

Figure 5: Representative plots (at σ = 0 . ) of EV and MI metrics for FLDA and other models. (A,B)EV (A) and MI (B) metrics of FLDA. FLD i and FLD j indicate the factorized linear discriminantsfor features i and j . (C,D) EV (C) and MI (D) metrics of 2LDAs. LD i and LD j indicate the lineardiscriminant components for features i and j . (E,F) EV (E) and MI (F) metrics of LDA. LD andLD indicate the ﬁrst two linear discriminant components. (G,H) EV (G) and MI (H) metrics of CCA.CCA and CCA indicate the ﬁrst two canonical correlation axes. (I,J) EV (I) and MI (J) metricsof PCA. PC and PC indicate the ﬁrst two principal components. EV i and EV j are the explainedvariance of features i and j along an axis, and MI i and MI j indicate the mutual inform between anaxis and features i and j respectively. Values of EV and MI metrics are also indicated by the colorbars on the right side. 20 B Figure 6: Additional plots for FLDA on the dataset of T4/T5 neurons. (A, B) Projection of theoriginal gene expression data into the two-dimensional space made of the ﬁrst and second (FLD ij andFLD ij ) (A) or the second and third (FLD ij and FLD ij ) (B) discriminant axes for the componentthat depends on the combination of both features i and j . Different cell types are indicated in differentcolors as in (B). A B