MMeasuring the Stability of Learned Features
Kris SankaranDepartment of StatisticsUniversity of Wisconsin - [email protected] 23, 2021
Abstract
Many modern datasets don’t fit neatly into n × p matrices, but most techniquesfor measuring statistical stability expect rectangular data. We study methods forstability assessment on non-rectangular data, using statistical learning algorithmsto extract rectangular latent features. We design controlled simulations to char-acterize the power and practicality of competing approaches. This motivates newstrategies for visualizing feature stability. Our stability curves supplement thedirect analysis, providing information about the reliability of inferences based onlearned features. Finally, we illustrate our approach using a spatial proteomicsdataset, where machine learning tools can augment the scientist’s workflow, butwhere guarantees of statistical reproducibility are still central. Our raw data,packaged code, and experimental outputs are publicly available. How should we perform statistical inference on nontabular data? One idea, dis-cussed in [B¨uhlmann, 2019], is to first transform the data into tabular form, takingadvantage of modern feature learning algorithms. The point is important enough toquote at length,I would like to re-emphasize the importance of new sources of informa-tion. Indeed, images, videos and audios are typically cheap devices torecord data. [The authors] do not mention recent progress with autoen-coders [Hinton and Salakhutdinov, 2006, Vincent et al., 2010]: when usingsuch techniques, one would again end up with numeric features whichcan then be used for further downstream analysis using techniques fromhigh-dimensional statistics or statistical machine learning [Hastie et al.,2015, B¨uhlmann and Van De Geer, 2011].Here, we explore this proposal in depth, illustrating how methods from high-dimensional statistics can be used to study the inferential properties of machine-generated features. Specifically, we present algorithms and visualizations that canbe used to characterize the statistical stability of features learned from nontabular1 a r X i v : . [ s t a t . C O ] F e b ata sources. In the process, we also uncover novel challenges particular to this newsetting.To ground the discussion, consider the following examples,• Spatial Omics: In addition to measuring gene or protein expression at the celllevel, it has become possible to study how expression varies spatially across atissue [Lundberg and Borner, 2019]. A variety of spatial features are thoughtto influence biological processes. For example, for some types of cancer, it isthought that the infiltration of tumors by immune cells expressing particularproteins can influence disease prognosis [Keren et al., 2018].• Ecological and public health: Satellite imagery are now routinely integrated intoecological and public health projects, since they can capture important environ-mental features and are readily available globally [Rolf et al., 2020, Bondi et al.,2020]. These methods have been found to be effective proxies for otherwiseresource-intensive collection methods, like on-the-ground surveys, opening upthe possibility of more universal and easily updated monitoring.In both cases, machine learning methods are key for extracting useful featuresfrom novel sources of data. However, unlike many common machine learning appli-cations, the learned features here are subject to critical examination, either to informbiological mechanisms or to ensure vulnerable populations are not put at risk.This question of how to perform inference on learned features is not a new one —principal components can be bootstrapped Diaconis and Efron [1983], excess error canbe estimated from selected features Gong [1986], and confidence regions are availablefor exploratory projection pursuit Elguero and Holmes-Junca [1988]. More recently,studies have investigated the use of black-box predictions as features in downstreamanalysis [Wang et al., 2020b].However, for both the proteomics and satellite imagery examples presented above,these methods can’t be directly applied, because the raw data are not tabular. Instead,features are typically automatically learned using deep learning. What is new in thissetting?• Modern learned features are “distributed” McClelland et al. [1986]. This meansthat any pattern observed by the algorithm will be encoded by a large num-ber of more elementary features, not any single feature specialized to that pat-tern. A deep learning algorithm is able to recognize a highway in a satelliteimage because it can merge evidence from across neurons (i.e., an elementarylearned feature) that activate when particular color, shape, and edge featuresare present. This approach turns out to be much more effective than curatingspecialized features for many computational tasks, but it poses a challenge forhuman inspection.• From one run to the next, the learned features can change. This is unlike inprincipal components analysis, say, where the learned components are orderedin a natural way. If the deep learning features were more specialized, it might bepossible to recognize the same feature across two runs, and then match them.2owever, the features are distributed, so it isn’t easy to say that any given neu-ron from the first run matches any other neuron(s) in the second.• It’s impractical to bootstrap methods that take hours to run, even if they couldbe done in parallel. Moreover, it’s unclear what information should be com-pared across bootstraps — the model parameters, the learned features, or sum-mary statistics about the features.• Some form of sample-splitting must take place, to ensure that features are notevaluated using the same data that was used to learn them. However, it’s un-clear how the splitting should be carried out. How much data should be usedfor feature learning, and how much for inference?This work discusses these questions, proposing relevant definitions and algorithms,examining their behavior through simulation, and illustrating their use on a spatialproteomics dataset. Our basic takeaways are,• While learned features are not interpretable when viewed in isolation, their as-sociated feature subspaces often are. A feature learning algorithm may requirea large number of elementary features in order to develop effective distributedrepresentations, but the effective dimensionality of these representations is of-ten small. These algorithms learn many, but highly correlated, features.• Given enough training data, learned feature subspaces are often stable from onerun to the next, and this can be quantified using a Procrustes analysis. Unsu-pervised feature learning algorithms are typically more stable than supervisedones.• For problems where unsupervised feature learning is effective, a fast approx-imation to full deep model training, called the random convolutional features(RCF) model, can suffice for a feature stability analysis.• Inference is less data-greedy than feature learning, in the sense that when fewsamples are reserved for inference, stable features can still be identified. This isno longer the case when few samples are reserved for feature learning.Section 1 provides a description of our problem setting. Section 2 summarizes thekey technical tools used in this study. We present generic algorithms for measuringfeature subspace stability in Section 3, and we study its properties through a simula-tion in Section 4. We conclude in Section 5 with an application to a spatial proteomicsdataset. Our goal is to characterize the stability of features that were algorithmically derivedfrom n samples x i ∈ X . For example, each x i might be a spatial proteomics measure-ment or a satellite image. They could also be more general data types – x i might be theaudio recording for one of n speakers, or the graph derived from one of n molecules.3ptionally, a vector y ∈ R n of responses associated with each observation will beavailable. We denote the full set of available data by D . Definition 1.1. A feature learner is a parameterized mapping T ( · ; θ ) : X → R L which takes data from the raw input domain X and represents it using a vector in R L .For example, in the proteomics and satellite applications, we expect the learner totransform a set of raw pixel intensities into a vector of features reflecting biologicalor environmental properties of the imaged sample. The parameter θ is estimated fromdata, typically through an optimization problem, ˆ θ := arg min θ ∈ Θ L ( D , T ( · ; θ )) for some loss function L . In an unsupervised feature learner, candidates θ ∈ Θ arefunctions of x , . . . , x n alone. For a supervised feature learner, the class includes func-tions of both x , . . . , x n and y . To simplify notation, we will write z i = T (cid:16) x i ; ˆ θ (cid:17) ∈ R L to denote the learned features associated with the i th observation.A first attempt might try to assign a stability score to each of the L coordinates of z . An investigator would then have confidence that, if data were sampled again fromthe same population, and if features were extracted by the same black box, then thefeatures with high stability scores would reappear in the new analysis. The essentialchallenge is that the learned features are not the same from one run to the next; the l th learned feature from run 1 need not have any relationship with the l th feature fromrun 2. Parallel problems are well-known in clustering and factor analysis. Clusterlabels can be permuted without changing the quality of the clustering, leading to thelabel-switching problem. Likewise, without any criteria for sorting or postprocessingfactors, latent factors from a factor analysis are unidentifiable.To address this issue, we distinguish between two notions of stability, which wecall feature subspace and selection stability, respectively. The idea of subspace stabilityis that, even if there is no direct correspondence between learned features across runs,they may all reflect the same underlying latent features. Different runs of the featurelearning algorithm return different bases for nearly the same subspace. To make thismore precise, suppose that the b th run of the feature learning algorithm produces, Z b = z b ... z bn ∈ R n × L and define an alignment function A which takes learned to aligned features, A : Z , . . . , Z B → M , (¯ Z , . . . , ¯ Z B ) . We think of M ∈ R n × K as the average of all B representations and ¯ Z b ∈ R n × K asthe version of the b th learned representation after they have been put into a commoncoordinate system. A is just a Procrustes analysis. With this notation, we can nowdefine subspace stability. 4 efinition 1.2. The subspace stability of B learned representations Z , . . . , Z B withrespect to an alignment function A is the average distance between the aligned fea-tures and M , F SS A ( Z , . . . , Z B ) = 1 B B (cid:88) b =1 (cid:107) ¯ Z b − M (cid:107) where M , (¯ Z , . . . , ¯ Z B ) = A ( Z , . . . , Z B ) By selection stability, we mean that a given aligned feature is repeatedly discov-ered by a model selection procedure. At this point, the features can be thought of asfixed; we are back on more familiar inferential ground. That is, let S be a selectionfunction, which takes ¯ Z b and a response y and returns a subset S b ⊆ { , . . . , K } offeatures important for predicting y . Definition 1.3.
The selection stability of the k th aligned feature with respect to theselection S is the fraction SS k S (¯ Z , . . . , ¯ Z B ) = 1 B B (cid:88) b =1 I ( k ∈ S (¯ Z b , y )) . We review techniques that are used in our algorithmic proposals and computationalexperiments.
Many statistical quantities are based on the idea that meaningful conclusions shouldbe stable to perturbations Yu [2013]. We will make use of stability selection, a tech-nique for evaluating significance in high-dimensional linear models Meinshausen andB¨uhlmann [2010].For a dataset X ∈ R n × p and y ∈ R n , stability selection proceeds as follows. First, B subsamples of size (cid:98) n (cid:99) are drawn from the data, and for each a lasso regressionis run over a grid of regularization parameters λ . Each of these regressions resultsin p coefficient trajectories ˆ β j ( λ ) , and important features are expected to becomenonzero earlier on the regularization path (that is, even with large regularization λ ).For a given λ and feature j , let ˆΠ j ( λ ) measure the fraction of subsamples for which (cid:12)(cid:12)(cid:12) ˆ β j ( λ ) (cid:12)(cid:12)(cid:12) > . The paths (cid:16) ˆΠ j ( λ ) (cid:17) pj =1 describe the importance of each of the p regres-sion features. For a given stringency threshold π thr , the procedure selects features ˆ S = { j : max λ ˆΠ j ( λ ) ≥ π thr } .Let q Λ = E (cid:12)(cid:12)(cid:12) ˆ S Λ (cid:12)(cid:12)(cid:12) denote the expected number of selected features. It can be shownthat, for π thr ≥ , and assuming a sufficiently well-behaved ˆ S , the expected numberof falsely selected features is bounded by π thr − q p . The term q Λ p is like the typicalfraction of selected features; the coefficient q Λ π thr − describes the fraction of thosethat are expected to be false positives. 5 .2 Feature Learning We consider three particular feature learning algorithms. The first is called the Varia-tional Autoencoder (VAE). Like principal components analysis, this algorithm learnsan L -dimensional representation of a dataset by optimizing a reconstruction objec-tive. Formally, the algorithm first posits a generative model p ( z ) p ξ ( x | z ) of the data; p ( z ) is a prior on latent features and p ξ ( x | z ) is a likelihood parameterized by ξ . Thealgorithm finds a pair ξ, ϕ maximizing the lower bound, log p ξ ( x ) ≥ E q ϕ [log p ξ ( x | z )] − D KL ( q ϕ ( z | x ) (cid:107) p ( z )) where q ϕ ( z | x ) = N (cid:0) µ ϕ ( x ) , σ ϕ ( x ) (cid:1) maps raw data examples to distributions in a la-tent space of more meaningful features. This optimization problem is nonconvex, andis typically solved through a variant of stochastic gradient descent. There are manyparticular implementations of the VAE; our experiments are based on [Van Den Oordet al., 2017].Second, we investigate features learned in a supervised way through a Convolu-tional Neural Network (CNN). For regression, a CNN optimizes the empirical estimateof the risk E (cid:107) y − f W J ( x ) T β (cid:107) over W J and β . f W J transforms the raw inputinto the “last layer”’s features, and is defined recursively according to f jW j ( x ) = σ (cid:16) W j f j − W j − ( x ) (cid:17) f ( x ) = x for σ ( x ) defined as σ ( x ) := x I ( x ≥ and matrices W restricted to the set of con-volution operators. Like in the VAE, this optimization is nonconvex and is typicallyfound through first-order optimization methods. Our particular implementation isthe CBR architecture described in [Raghu et al., 2017].Third, we consider a random convolutional features (RCF) model [Rahimi andRecht, 2008]. A random sample of L training examples x i , . . . , x i L ∈ R w × h × c areselected; the x i ’s are assumed to be c -channel images with dimension w × h . For eachsample, a random s × s patch, which we call w p ∈ R s × s × c , is extracted. For any c -channel image x , the l th feature z l is found by convolving x with w l and spatiallyaveraging over all activations.To train an RCF, the training data are first featurized into Z ∈ R n × L using randomimage patches, as described above. Then, a ridge regression model is trained, solving ˆ β := arg min β (cid:107) y − Z β (cid:107) + λ (cid:107) β (cid:107) For a new example x ∗ , the same image patches w , . . . , w L are used to create a featur-ization z ∗ , and predictions are made with z ∗ T ˆ β . Unlike either the VAE or CNN, thismodel does not require gradient based training, and it can serve as a fast, and ofteneffective, baseline. 6 .3 Procrustes Analysis Procrustes Analysis gives a way of aligning multiple tables. Given two centered tables X and Y , the Procrustes problem finds the rotation matrix R solving the optimization, min R ∈O ( p,p ) (cid:107) X − YR (cid:107) F , where O ( p, p ) denotes the space of p × p orthonormal matrices. Using the associatedLagrangian, max R , Λ (cid:23) (cid:107) X − YR (cid:107) F − Λ (cid:0) R T R − I (cid:1) the solution can be shown to be ˆ R = U T V for U and V obtained by the SVD of X T Y .For B matrices X , . . . , X B , the generalized Procrustes problem finds B rotations R , . . . , R B and mean matrix M solving min R ,..., R B ∈O ( p,p ) ,M B (cid:88) b =1 (cid:107) X b R b − M (cid:107) F . While there is no closed form solution, the optimization can be solved by cyclically up-dating each R b via standard Procrustes problems and then updating M = B (cid:80) Bb =1 X b R b . Our approach is closely related to Singular Vector Canonical Correlation Analysis(SVCCA), a method used to compare features learned across deep learning models.In SVCCA, the two representations are identified with two matrices, X ∈ R n × l and Y ∈ R n × l . The ij th cell in each matrix corresponds to the activation of neuron j onsample i . A representation is a pattern of activations across a collection of neurons.To compare representations, SVCCA first computes singular value decomposi-tions X = U X D X V TX and Y = U Y D Y V TY . The coordinates of these matrices with re-spect to the top K singular vector directions are then extracted, Z X = U X, K D X, K and Z Y = U Y, K D Y, K . Finally, a canonical correlation analysis is performed onthese coordinate matrices. That is, the top CCA directions u , v are found by opti-mizing maximize u T ˆΣ − Z X ˆΣ Z X Z Y ˆΣ − Z Y v subject to (cid:107) u (cid:107) = (cid:107) v (cid:107) = 1 and subsequent directions are found by solving the same problem after orthogonaliz-ing Z X and Z Y to previously found directions. The value of the objective for each ofthe K directions is denoted ρ k , and the overall similarity between the two represen-tations is taken to be the average of these canonical correlations: K (cid:80) Kk =1 ρ k .Note that, while in principle, it would be possible to perform a CCA on the activa-tions X and Y directly, for representations with many neurons, the dimensionality re-duction step can simplify computation, because it avoids inverting a high-dimensionalcovariance matrix. 7 .5 Sparse Components Analysis As in SVCCA, we reduce the dimensionality of learned features before comparingthem. We use both PCA and a variant called Sparse Components Analysis (SCA)[Chen and Rohe, 2020]. For a given data matrix X and a number of components K ,and sparsity parameter γ , SCA solves the optimizationmaximize Z,B,Y (cid:107) X − ZBY T (cid:107) F subject to Z ∈O ( n, K ) Y ∈O ( p, k ) (cid:107) Y (cid:107) ≤ γ. The matrix Y provides the K SCA loadings, and the product ZB provide the coordi-nates of each sample with respect to that basis. Note that if B is forced to be diagonal,then this optimization reduces to sparse PCA. This optimization does not directly pro-vide an ordering of the loadings. However, the proportion of variance explained byeach dimension can still be computed, and this can be used to re-order the dimensionsin a similar way to PCA. Algorithms 1 through 3 give our approach to measuring feature stability. We firstmotivate the general setup, before explaining specific choices used in our experiments.The three algorithms do the following,1. Train a feature learning algorithm on B perturbed versions of the dataset D .This yields B sets of learned features Z b .2. Use an alignment strategy A to represent the learned features in a shared co-ordinate system. These aligned features are called ¯ Z b .3. Using the aligned features, evaluate the importance of each feature dimensionusing a selection mechanism S .The only subtlety is that the feature learning and selection steps are performedon different subsets of D , indexed by I and I C , respectively. This is needed to main-tain validity of inference – if the same samples were used for selection and learning,features would appear more important than they are.In our experiments, we use the bootstrap to perturb the data reserved for featurelearning. That is, if I ∗ b resamples I with replacement, then D [ I ∗ b ] gives a draw from P ( D ) . This lets us obtain B sets of learned features by optimizing ˆ θ b := arg min θ ∈ Θ L ( D [ I ∗ b ] , T ( · ; θ )) Z b = T (cid:16) D , ˆ θ b (cid:17) . This step is summarized by Algorithm 1. Algorithm 1:
Feature Learning
Result:
Sets of learned features ( Z b ) Bb =1 for D . Indices I ⊂ [ n ] used forfeature learning.Inputs: Dataset D and perturbation process P . Candidate feature learners { T ( · ; θ ) } θ ∈ Θ and criterion L .1. Randomly split samples into disjoint subsets I, I C ⊂ [ n ] .2. Generate B perturbed datasets, for b = 1 , . . . , B do D b ∼ P ( D [ I ]) end
3. Train B feature learners, for b = 1 , . . . , B do ˆ θ b = arg min θ ∈ Θ L ( D b , T ( · , θ )) Z b = T (cid:16) D , ˆ θ b (cid:17) end For the alignment strategy A , we first reduce the dimensionality of each Z b to K dimensions, call this ˜ Z b . Then, we solve a generalized Procrustes problem, finding R b ’s so that the ¯ Z b := ˜ Z b R b have low Frobenius distance to M . For the dimension-ality reduction step, we apply either PCA or SCA after centering and scaling. Giventhis A , we can compute a feature subspace stability score using Algorithm 2. Algorithm 2:
Feature Subspace Stability
Result:
Aligned features (¯ Z , . . . , ¯ Z B ) . Subspace stability score F SS A .Inputs: Learned features ( Z , . . . , Z B ) . Alignment strategy A .1. Align features, M, (¯ Z , . . . , ¯ Z B ) = A ( Z , . . . , Z B )
2. Compute feature subspace stability,
F SS A ( Z , . . . , Z B ) = 1 B B (cid:88) b =1 (cid:107) ¯ Z b − M (cid:107) . For the selection mechanism, we use stability selection. This means that our selec-tion mechanism S is parameterized by lasso regularization strength λ and selectionstringency π thr . In our experiments, we display the full selection curves ˆΠ bk ( λ ) foreach set of aligned features. From these curves, we can identify important features S b ( λ, π thr ) for any choice of λ or π thr . However, for clarity, we suppress this depen-9ence on λ and π thr in Algorithm 3. Algorithm 3:
Selection Stability
Result:
Selection stability scores for all aligned dimensions SS S , . . . , SS K S .Inputs: Reserved indices I C . Selection mechanism S . Aligned features (¯ Z , . . . , ¯ Z B ) , where each ¯ Z b ∈ R n × K .1. Define selection sets, for b = 1 , . . . , B do S b = S (cid:0) ¯ Z b (cid:2) I C (cid:3)(cid:1) end
2. Compute selection scores, for k = 1 , . . . , K do SS k S = B (cid:80) Bb =1 I ( k ∈ S b ) end We use a simulation experiment to understand properties of the proposed algorithms.Our guiding questions are,(G1) How different are the results obtained via supervised vs. unsupervised featurelearning algorithms T ( · ; θ ) ?(G2) When we vary the relative sizes of I and I C , we expect a trade-off betweenfeature learning and inferential quality. Can we characterize this trade-off?(G3) How does the dimensionality reduction approach used in A affect downstreaminferences?(G4) Is it ever possible to substitute the RCF feature learner for either of the moretime-consuming VAE or CNN models?To answer these questions, we evaluate our proposal using a full factorial designwith three factors,1. Relative sizes of I and I C : We use | I | ∈ { . n, . n, . n } .2. Dimensionality reduction procedure: We use both PCA and SCA.3. Algorithm used: We train CNN, VAE and RCF models.We use B = 20 perturbations in each case. Hence, 3 splits × ×
20 pertur-bations = 180 models are trained, from which 180 × = 360 dimensionality-reduced features are obtained. 10 .1 Simulated Data The key ingredient in this simulation is the construction of a dataset where all “true”features are directly controlled. To motivate the simulation, imagine studying a set oftumor pathology slides, with the hope of discovering features that are predictive ofdisease prognosis. Each pathology slide gives a view into a cellular ecosystem — thereare several types of cells, with different sizes, density, and affinities to one another.In the simulation, we first generate the latent properties of each slide. Prognosisis defined as a linear function of these properties. Then, images are generated thatalso reflect the latent properties. The essential challenge is that the investigator onlyhas access to the images and patient prognoses, not the true properties behind eachimage. A good feature extractor T (cid:16) x ; ˆ θ (cid:17) should recover important cell ecosystemproperties from the images alone. Example images for varying values of y are given inFigure 1. In total, our simulation generates 10,000 such RGB images, each of dimension × × .We now give details. The locations of cells are governed by an intensity functiondrawn from a two-dimensional marked Log Cox Matern Process (LCMP) Diggle et al.[2013]. Recall that a Matern process is a Gaussian process with covariance function, C ν,α ( (cid:107) x − y (cid:107) ) = σ − ν Γ( ν ) (cid:18) √ ν (cid:107) x − y (cid:107) α (cid:19) ν K ν (cid:18) √ ν (cid:107) x − y (cid:107) α (cid:19) , (1)where α acts like a bandwidth parameter and ν controls the roughness of the simu-lated process.Suppose we have R types of cells. Then, our LCMP should have R classes. Thiscan be constructed as follows. First, a nonnegative process Λ ( x ) is simulated alongthe image grid, Λ ( x ) ∼ exp ( N (0 , C ν Λ ,α Λ )) , where C ν Λ ,α Λ is the covariance ma-trix induced by the C ν Λ ,α Λ in equation 1. This is a baseline intensity that determinesthe location of cells, regardless of cell type. Then, R further processes are simulated, B r ( x ) ∼ exp ( β r + N (0 , C ν B ,α B )) . These processes will reflect the relative fre-quencies of the R classes at any location x ; the intercept β r makes a class either moreor less frequent across all positions x .Given these intensity functions, we can simulate N cell locations by drawing froman inhomogeneous Poisson process with intensity Λ ( x ) . For a cell at location x , weassign it cell type r with probability B τr ( x ) (cid:80) Rr (cid:48) =1 B τr (cid:48) ( x ) . Here, we have introduced a tem-perature parameter τ which controls the degree of mixedness between cell types at agiven location.To complete the procedure for simulating images, we add two last source of vari-ation — the number of cells and the cell size. The number of cells per image is drawnuniformly from 50 to 1000. The cells from class R are drawn with a random radiusdrawn from Gamma (5 , λ r ) . A summary of all parameters used to generate each im-age is given in Table 1. Each parameter is drawn uniformly within its range, whichhas been chosen to provide sufficient variation in image appearance. These parame-ters are the “true” latent features associated with the simulated images; they give themost concise description of the variation observed across the images.11igure 1: Example simulated images, for low (top), average (middle), and high (bot-tom) values of y i . For each sample, three relative intensity functions B k ( x ) are gener-ated, shown as a greyscale heatmaps. Samples drawn from each process are overlaidas circles. The final images, displayed at the right of each group, combines pointsfrom the three processes, removing the original generating intensity function, whichis not available to the feature learner. Small y i values are associated with smoother,less structured intensity functions. Feature Description Influence Range N i The total number of cells. 0.5 [50 , ν i, Λ The roughness of the overall in-tensity process. -0.5 [0 , α i, Λ The bandwidth of the overall in-tensity process. -0.5 [0 , β ir The intercept controlling the fre-quency of class r . 1 for r = 1 , -1otherwise [ − . , . ν iB The roughness of the relative in-tensity processes. -0.5 [0 , α iB The bandwidth of relative inten-sity processes. -0.5 [0 , τ i The temperature used in cell typeassignment. 0.5 [0 , λ ir The shape parameter controllingthe sizes of each cell type. 1 for r = 1 , 0otherwise [100 , Table 1: Our simulation mechanism is governed by the above parameters. Parameters N i and λ ir control the number and sizes of imaginary cells. ν i, Λ , α i, Λ , β ik , ν iB , α iB ,and τ i control the overall and relative intensities of the marked LCMP from which thecell positions are drawn. Example draws are displayed in Figure 1.12ese features are the latent properties used to generate the prognosis for each pa-tient i . Specifically, we generate y i = (cid:80) k Influence k × (cid:104) Feature ik − Feature k SD ( Feature k ) (cid:105) for the val-ues Influence given in Table 1. Note that there is no additional noise: if the Feature ik ’swere known for each sample, then the y i ’s could be predicted perfectly. Therefore, thesimulation gauges the capacity of the feature learners to recover these known latentfeatures. We now summarize findings of our factorial experiment. All simulated data, trainedmodels, and aligned features have been posted publicly; links can be found in theappendix.
To compare the features learned by supervised and unsupervised approaches (G1),we first directly visualize example learned features. Figure 2 shows the activations oflearned features across 2000 images for two perturbed versions of the training datawhen | I | = 0 . n . For the three algorithms, the learned features correspond to,• CNN: Activations from the final hidden layer of neurons, used directly as inputfor the regression. There are a total of 512 nonnegative features .• VAE: Spatially-pooled activations from the middle, encoding layer of the varia-tional autoencoder. There are a total of 64 real-valued features.• RCF: The spatially-pooled activations corresponding to each of 1048 randomconvolutional features.Our first observation is that, across algorithms, there is no simple correspondencebetween learned and source features (i.e., parameters of the underlying simulation).For example, it is not the case that one set of features represents the number of cells N , and a different set maps to the roughness ν Λ . Rather, there appear to be clustersof learned features, and each cluster corresponds to a pattern of correlations acrossmultiple source features. For example, in Run 1 of the CNN, a cluster of learned fea-tures are strongly negatively correlated with y i , λ i , and τ and positively correlatedwith N i . We also find large subsets of features, across all models, that are only weaklycorrelated with any source feature.Next, note that certain source features are “easier” to represent than others, inthe sense that more of the learned features are strongly correlated with them. Manyfeatures are correlated with N i , the total number of cells in the image, and λ i , the sizeof the cells from Process 1. Depending on the model, the bandwidth α ir , roughness ν ir , and prevalence β ik parameters are either only weakly or not at all correlated withthe learned features. Interestingly, the convolutional network learns to represent β well, but not β or β – this is consistent with the fact that only β influences the They are nonnegative because they follow an ReLU layer. y i . Even when features learn to detect variation in α ir and ν ir , they cannotdisambiguate between these two parameters.Finally, consider differences between feature learning algorithms. The CNN andVAE features tend to be more clustered, with strong correlation across several sourcefeatures. In contrast, the RCF features show more gradual shifts in correlation strength.They also show relatively little variation in correlation strength across features otherthan λ i and N i .Note that the features do not naturally map onto one another from one run to thenext. This is not obvious from Figure 2, but a zoomed in version in SupplementaryFigure 12 showing only the first 15 features per run makes this clear. To consider the trade-offs between feature learning and inferential quality (G2) anddimensionality reduction strategy (G3), Figure 3 displays the top canonical correla-tions between learned features ¯ Z b and the original source features, across bootstrap14eplicates and algorithms. Note that we calculate these scores separately for training,development, and testing splits. The training and development splits are subsets of I .Training samples were used in the optimization for each feature learner; developmentsamples were used to choose hyperparameters of the optimizer. They are shown sep-arate from the test samples I C to make it possible to recognize potential overfittingin the feature learning process.Even after the initial dimensionality reduction, the CCA canonical correlationsdecay quickly. The dimensionality reduction method used does not have much effectat this stage. In general, the fraction of data belonging to I also matters little; however,there is an exception in the CNN features. Here, the features learned when | I | is only15% of the data have noticeably lower canonical correlation in the second dimension.Note also that, from the point of view of these canonical correlations, the RCF andVAE have comparable feature learning behaviors.The feature learning algorithms do not seem to overfit the simulation data. If theydid, then the canonical correlations on the training and development data would belarger than those on the test data. That said, there are noticeable differences betweenthe data splits, and this effect will be visible in later figures as well. We hypothesizethat the differences are due to the alignment process. During the Procrustes rotation,one matrix Z b (cid:2) I C (cid:3) may get “lucky” and learn an alignment with axes that betterreflect variation in the source features. In this way, even though the test data may nothave been used to learn features, they may have higher correlation with the sourcefeatures than the original Z b [ I ] .To shed further light on G2 and G3, Figure 4 shows the median number of featuresselected by stability selection | S ( λ, π thr ) | across training sample sizes. The medianis taken across all bootstrap iterations. We fix a regularization strength λ where smallperturbations in π thr lead to large changes in the number of selected features. Wehave run stability selection for replicates, all restricted to either training, devel-opment, or test data, as indicated by the color of each line.All the curves decrease because increasing the stringency π thr leads to fewer se-lected features. We find that the features learned by the CNN are more frequentlyselected. This is expected, considering that the CNN features are learned in a super-vised fashion. More surprising is the fact that the RCF features are more frequentlyselected than the VAE features, suggesting that the simple baseline might already pro-vide features useful for interpretation, giving an affirmative answer to G4.Finally, it appears that | S ( λ, π thr ) | is largest when using 50% of the data for fea-ture learning. Though the features learned using 90% of the data may be higher quality,it is not possible to select them as frequently, because stability selection will have lowpower when it can only subsample from the 10% of the data reserved for inference.This phenomenon appears for features learned by both supervised and unsupervisedmethods. For this reason, in the remainder of this study, we will focus on resultsobtained using a 50% split between learning and inference, i.e., | I | = (cid:12)(cid:12) I C (cid:12)(cid:12) , thoughresults using different splits are available in the appendix.15igure 3: The top canonical correlations between the sources features and the aligned ¯ Z b . Horizontal rows correspond to CNN, RCF, and VAE algorithms. Columns givethe CCA dimension. Within each display, the canonical correlations across all boot-straps is shown. The x -axis within each cell gives the amount of data used for featurelearning and the dimensionality reduction method used. Canonical correlations arecomputed separately for samples within training, development, and test splits, indi-cated by each point’s color.Figure 4: The median number of features selected by stability selection across boot-strap runs ¯ Z b , viewed as a function of π thr . Rows and columns correspond to modelsand fraction of the data used for training, respectively. Stability selection is run withintraining, development, and test splits (color) and using two dimensionality reductionprocedures during alignment (solid vs. dashed lines).16 .2.3 Stability visualization Figures 5 and 6 summarize the results of Algorithms 2 and 3 applied to several featurelearners. Within a single panel, each star glyph corresponds to a single sample. Thearms of the glyph link the bootstrap replicates for that given sample: z i , . . . , z i .Each glyph is shaded by the true value of y i for that sample. Only the first two alignedfeature dimensions are shown.Both the model and reduction strategy used influence the stability of the learnedfeatures. Based on the relative sizes of the glyphs, the CNN features are least stable,those from the RCF are most stable, and the VAE features are intermediate. Separateregions of the learned feature space appear more stable than others. Features learnedwith 15% of the data are highly unstable and show little association with the response.For larger | I | ’s, the learned features are more stable. Supplementary Figure 14 showsthat in this case, there are also stronger associations with y .Figure 6 displays stability selection paths Π bk ( λ ) for combinations of feature learn-ing and dimensionality reduction procedures. Like in Figure 3, we find that the largestvariation is between the data splits, but that overfitting does not appear to be a prob-lem. Indeed, for several features, the development and test splits have higher selectionprobabilities.Revisiting the effect of dimensionality reduction approach on selection (G3), wefind the first major differences between dimensionality reduction strategies. The se-lection paths are not monotone for SCA, and they also vary substantially across runs.This is most likely a result of the fact that the matrix B in the decomposition X = ZBY T need not be diagonal, which correlates the resulting coordinates. Further,when using the PCA, the top features are also the most selected ones; this is not thealways the case with SCA.Next, compare the unsupervised, supervised, and accelerated methods (G1, G4),from the selection perspective. Between algorithms, we find that the selection pathsfor RCF features form a tighter band across all bootstrap samples, another mark inits favor. For all algorithms, this band seems to widen for later feature dimensions;the selection probabilities also generally rise more gradually. For Feature 3 onwards,the selection paths for the VAE rise more gradually than the corresponding paths foreither the CNN or RCF dimensions, suggesting a lack of association with y . In this section, we conduct a feature stability analysis on the spatial proteomics dataset reported in the study [Keren et al., 2018], where the authors found a relationship be-tween the spatial organization of Triple Negative Breast Cancer (TNBC) tissues anddisease progression. In a classical proteomics study, the expression levels for a set ofproteins is measured for a collection of cells, but the cell locations are unknown. Incontrast, these data provide for each patient (1) an image delineating cell boundariesand (2) the protein expression levels associated with each cell in the images. The data are publicly available. See also the appendix for preprocessed versions. ¯ Z b derived from several models with varying fea-ture learning sample sizes | I | . For clarity, only a subset of samples is shown. For aversion of this figure showing all samples, and collapsing each star to a point, refer toSupplementary Figure 14. 18igure 6: Stability selection paths ˆΠ bk ( λ ) for 10 learned features, across models andalignment strategies. Top and bottom groups use PCA and SCA-reduced features, re-spectively, and each row corresponds to a feature learning algorithm. Each columncorresponds to one learned feature dimension. Each curve gives ˆΠ bk ( λ ) for one boot-strap replicate b restricted to samples from either the training, development, or testset. The x -axis is the regularization strength log λ and the y -axis is the probability ofselecting that feature at the given regularization strength.19igure 7: Example patches from the TNBC data. Panels are ordered by y i , the (log)fraction of cells they contain that belong to the tumor. The goal of the predictionalgorithm in this example is to correctly place patches from new patients along thisgradient, after having observed many patches and their corresponding y i ’s.We will work only with the spatial cell delineations, not the protein expressionlevels. This allows us to study the mechanics of feature learning within the imageswithout having to worry about linking the expression and image data, which is in it-self a complex integration problem. Though this means we lose some scientific depth,we gain substantial implementation simplicity, and the analysis serves as a clear il-lustration. Our complete data are 41 × -dimensional images, each takenfrom a separate patient. We associate each pixel with one of 7 categories of tumorand immune cell types.To setup a prediction problem, we first split each image into × × patches. These patches are our x i . Patches from 32 of the patients are reserved formfeature learning. Four among these 32 are used as a development split, to tune pa-rameters of the feature learning algorithms. As a response variable, we use y i =log (cid:16) { Tumor cells in x i } { Immune cells in x i } (cid:17) . Example cell patches are shown in Figure 7.As a baseline, we compare against a ridge regression with pixelwise compositionfeatures. Specifically, we train a model with y as a response and the average number ofpixels belonging to each of the cell-type categories as a (length 7) feature vector. Thishelps us to determine whether the model has learned more interesting features usefulfor counting cells, like cell size and boundaries, rather than simply averaging acrosspixel values. Indeed, Figure 8 makes clear that, with the exception of the RCF-SCAcombination, all feature learning - dimensionality reduction combinations outperformthis manual baseline.Stability curves associated with the learned features from the CNN, RCF, and VAEmodels are shown in Figure 9. Interestingly, across all algorithms, the top aligned fea-tures are not necessarily those with the highest selection probabilities. For example,CNN Feature 4 has higher selections than Feature 3, RCF Feature 6 has higher selec-tion than Feature 4, and VAE Feature 5 is more frequently selected than Feature 3. Incontrast to the simulation, the CNN and VAE do not have substantial differences in20igure 8: Relative performance of feature learning strategies on the TNBC data.Linked points come from the same bootstrap replicate. The manual features refersto the regression model that simply uses the raw pixel counts for each of the celltypes. Predictions from the development split are omitted; this split has few patches,and the estimated MSE’s have high variance.their selection curves. Further, the RCF features seem to have high selection proba-bilities across more dimensions than either the CNN or VAE. This suggests that themost salient features in x i are also relevant for predicting y i , and that supervision isnot as critical in this problem as it was in the simulation.Example aligned coordinates are given in Figure 10. Consistent with the conclu-sion, we find that the association with the response is clearly visible with respectto the learned feature dimensions, even when using unsupervised algorithms. Thechanges in the sizes of glyphs across regions of the learned feature space are espe-cially pronounced in this application. For example, in the VAE, the representations ofsamples with higher tumor-to-immune ratio y i are much more stable than those withlow ratio.There is no consensus on how to best interpret automatically learned features[Doshi-Velez and Kim, 2017]. Nonetheless, we present one simple approach in Figure11, overlaying example patches onto aligned coordinates. For example, in the CNN,the first dimension distinguishes between the relative number of tumor and immunecells, while the second dimension reflects the density of cells. In the RCF, the seconddimension captures the diversity of cell types, with more uniform samples on the leftand more heterogeneous ones on the right. The second dimension of the VAE seemsrelated to both cell density and number of cell types. The analogous display from anSCA reduction is given in Supplementary Figure 24.21igure 9: Selection curves for features learned from the TNBC data. 50% of the datawere used for feature learning and PCA was used for dimensionality reduction. Eachcurve is read as in Figure 6. The analogous paths using SCA alignment are given inSupplementary Figure 25. This study has investigated the stability of machine-generated, rather than hand-crafted, features. A better understanding of stability in this modern regime has con-sequences for how these methods can be used in real-world applications, especiallythose intended for scientific workflows.Our results raise several questions for further study. It is natural to ask to whatextent similar behaviors are exhibited across other data domains, model types, ortraining regimes. For example, it would not be unusual to represent the cell datain our case study using a marked graph linking neighboring cells. Do the featureslearned by a graph autoencoder have similar stability properties? In other domains,we may ask whether our methods can be adapted to text or audio data.Further, there are questions that may guide us towards better instantiations ofAlgorithms 1 through 3. While we have relied on the bootstrap, our notation en-compasses more general perturbations P . For example, how might learned featureschange when discarding nonrandomly chosen subsets of training data? Perhaps learn-ing features based on different temporal or spatial subsets could reveal a drift in theimportant features over time or space. Alternatively, we could imagine perturbingthe model training procedure, using different hyperparameters. It would be of inter-est to trace out the dependence of the learned features on the mechanics of the featurelearner.Similarly, using a Procrustes rotation for A and stability selection for S provides areasonable point of departure, but more sophisticated approaches are possible. For ex-ample, dimensionality reduction could be optimized to support alignment; this couldbe accomplished using multiple canonical correlation analysis. Combinations of fea-tures could also be approximately matched across feature learners, using a form ofoptimal transport, identifying sets of features that all activate on similar input sam-22igure 10: Representation of samples according to some of the important learned fea-ture dimensions. Glyphs are read as in Figure 5. To interpret these features, examplepatches are arranged according to these coordinates in Figure 11 and SupplementaryFigure 24. Only a subsample of points are shown; the full dataset with stars collapsedto points is shown in Figure 21. 23igure 11: A version of the PCA panels of Figure 10 that overlays representative sam-ples across the learned feature space. Cells are color coded as in Figure 7. Note thatthe overall shape of the region in which images are displayed mirrors the shape of thecloud of points in Figure 10.ples. This has been applied in a federated learning context, but not in the study ofstability [Wang et al., 2020a].It is also possible to propose alignments based on more refined measures of corre-lation [Josse and Holmes, 2016, Azadkia and Chatterjee, 2019]. We have found that us-ing an even split between feature learning and inference gives reasonable results, butour results are admittedly coarse. Though we have concentrated on stability selection,the procedure S could be any selective inference procedure. Finally, our approaches tosummarizing and displaying the resulting representations may be improved througha more careful application of interactive visualization principles.More broadly, this study is situated in the body of work seeking to bridge the riftbetween the two cultures [Breiman, 2001, Efron, 2020]. Across the selective infer-ence, conformal prediction, and interpretability literatures, there is a growing under-standing that the statistics and machine learning communities could benefit by learn-ing to speak one another’s languages [Angelopoulos et al., 2020, Ren and Cand`es,2020]. Nonetheless, while the data sources and feature learning context we consideris novel, our basic motivation is an old statistical idea that still rings true – quotingfrom [Mosteller and Tukey, 1977],One hallmark of the statistically conscious investigator is a firm beliefthat, however the survey, experiment, or observational program actuallyturned out, it could have turned out some somewhat differently.24n order to accomplish this study, we have adapted tools from the representa-tional analysis, dimensionality reduction, and high-dimensional inference commu-nities. These tools give a window into the workings of modern feature extractiontechniques, helping us view commonplace algorithms in a new way. In a sense, ourintent is to do more than summarize data – it is to generate new data. In the sameway that a standard error around a mean is a new piece of information that supportsmore nuanced reasoning, we hope to see the development of techniques that generatedata characterizing the behavior of modern feature learning algorithms. References
Anastasios Angelopoulos, Stephen Bates, Jitendra Malik, and Michael I Jordan. Un-certainty sets for image classifiers using conformal prediction. arXiv preprintarXiv:2009.14193 , 2020.Mona Azadkia and Sourav Chatterjee. A simple measure of conditional dependence. arXiv preprint arXiv:1910.12327 , 2019.Rebecca L Barter and Bin Yu. Superheat: An r package for creating beautiful andextendable heatmaps for visualizing complex data.
Journal of Computational andGraphical Statistics , 27(4):910–922, 2018.Elizabeth Bondi, Andrew Perrault, Fei Fang, Benjamin L Rice, Christopher D Golden,and Milind Tambe. Mapping for public health: Initial plan for using satellite im-agery for micronutrient deficiency prediction. 2020.Leo Breiman. Statistical modeling: The two cultures (with comments and a rejoinderby the author).
Statistical science , 16(3):199–231, 2001.Peter B¨uhlmann. Comments on: Data science, big data and statistics.
TEST , 28(2):330–333, 2019.Peter B¨uhlmann and Sara Van De Geer.
Statistics for high-dimensional data: methods,theory and applications . Springer Science & Business Media, 2011.Fan Chen and Karl Rohe. A new basis for sparse PCA. arXiv preprint arXiv:2007.00596 ,2020.Persi Diaconis and Bradley Efron. Computer-intensive methods in statistics.
ScientificAmerican , 248(5):116–131, 1983.Peter J Diggle, Paula Moraga, Barry Rowlingson, and Benjamin M Taylor. Spa-tial and spatio-temporal log-Gaussian Cox processes: extending the geostatisticalparadigm.
Statistical Science , 28(4):542–563, 2013.Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable ma-chine learning. arXiv preprint arXiv:1702.08608 , 2017.25radley Efron. Prediction, estimation, and attribution.
Journal of the American Statis-tical Association , 115(530):636–655, 2020.Eric Elguero and Susan Holmes-Junca. Confidence regions for projection pursuit den-sity estimates. In
Compstat , pages 59–63. Springer, 1988.Gail Gong. Cross-validation, the jackknife, and the bootstrap: Excess error estimationin forward logistic regression.
Journal of the American Statistical Association , 81(393):108–113, 1986.Trevor Hastie, Robert Tibshirani, and Martin Wainwright.
Statistical learning withsparsity: the lasso and generalizations . CRC press, 2015.Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of datawith neural networks. science , 313(5786):504–507, 2006.Julie Josse and Susan Holmes. Measuring multivariate association and beyond.
Statis-tics surveys , 10:132, 2016.Leeat Keren, Marc Bosse, Diana Marquez, Roshan Angoshtari, Samir Jain, SushamaVarma, Soo-Ryum Yang, Allison Kurian, David Van Valen, Robert West, et al. Astructured tumor-immune microenvironment in triple negative breast cancer re-vealed by multiplexed ion beam imaging.
Cell , 174(6):1373–1387, 2018.Emma Lundberg and Georg HH Borner. Spatial proteomics: a powerful discovery toolfor cell biology.
Nature Reviews Molecular Cell Biology , 20(5):285–302, 2019.James L McClelland, David E Rumelhart, PDP Research Group, et al. Parallel dis-tributed processing.
Explorations in the Microstructure of Cognition , 2:216–271, 1986.Nicolai Meinshausen and Peter B¨uhlmann. Stability selection.
Journal of the RoyalStatistical Society: Series B (Statistical Methodology) , 72(4):417–473, 2010.Frederick Mosteller and John Wilder Tukey.
Data analysis and regression: a secondcourse in statistics . 1977.Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Sin-gular vector canonical correlation analysis for deep learning dynamics and inter-pretability. In
Advances in neural information processing systems , pages 6076–6085,2017.Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replac-ing minimization with randomization in learning.
Advances in neural informationprocessing systems , 21:1313–1320, 2008.Zhimei Ren and Emmanuel Cand`es. Knockoffs with side information. arXiv preprintarXiv:2001.07835 , 2020.Esther Rolf, Jonathan Proctor, Tamma Carleton, Ian Bolliger, Vaishaal Shankar, MiyabiIshihara, Benjamin Recht, and Solomon Hsiang. A generalizable and accessibleapproach to machine learning with global satellite imagery.
NBER Working Paper ,2020. 26aron Van Den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete rep-resentation learning. In
Advances in Neural Information Processing Systems , pages6306–6315, 2017.Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Man-zagol, and L´eon Bottou. Stacked denoising autoencoders: Learning useful repre-sentations in a deep network with a local denoising criterion.
Journal of machinelearning research , 11(12), 2010.Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris Papailiopoulos, andYasaman Khazaeni. Federated learning with matched averaging. arXiv preprintarXiv:2002.06440 , 2020a.Siruo Wang, Tyler H McCormick, and Jeffrey T Leek. Methods for correcting infer-ence based on outcomes predicted by machine learning.
Proceedings of the NationalAcademy of Sciences , 117(48):30266–30275, 2020b.Bin Yu. Stability.
Bernoulli , 19(4):1484–1500, 2013.
Instructions to reproduce our simulations and data analysis example are available in aREADME on the study’s github page. Training split creation, the regression baseline,and feature learning can be reproduced in the ipython notebooks• tnbc splits.ipynb : Define training and test splits for the TNBC dataset.• tnbc baseline.ipynb : Train a ridge regression baselien on the TNBCdataset.• model training.ipynb : Train either a CNN, RCF, or VAE feature learner.MIBI-ToF data preparation, generation of simulation data, feature stability analy-sis, and visualization of results are done within the rmarkdown documents,• generate.Rmd : Simulate the LCGP cells dataset.• stability.Rmd : Perform feature alignment and stability selection.• visualize features.Rmd : Visualize the aligned and selected featuresoutput by stability.Rmd .• summary plots.Rmd : Compute the CCA and selection summaries reportedin Figures 3 and 4.To support code reusability between experiments, two helper packages were pre-pared,• stability inference This packages can be installed by calling, git clone https://github.com/krisrs1128/learnedinference.gitRscript -e ‘‘devtools::install(‘learnedinference/inference’)’’pip3 install learnedinference/stability from the terminal.We have prepared a docker image with all necessary software pre-installed. Forexample, to rerun our stability analysis, the following commands may be used, docker run -it krisrs1128/li:latest bashgit clone https://github.com/krisrs1128/learnedinference.gitsource learnedinference/.env
Finally, we have released raw data and intermediate results from our analysis,• sim data.tar.gz: Our toy simulation dataset.• tnbc data.tar.gz: The preprocessed MIBI-ToF data, with all patient’s data splitinto 64 ×
64 patches and with the associated splits and response value storedin a metadata file.• simulation outputs.tar.gz: All the models trained in our simulation experiments.• tnbc outputs.tar.gz: All models trained in our illustration on the TNBC dataset.• simulation figure data.tar.gz: The data written by ‘stability.Rmd‘ which wasused to generate figures for our simulation.• tnbc figure data.tar.gz: The data written by ‘stability.Rmd‘ which was used togenerate figures for our data illustration.• tnbc raw.tar.gz: The original MIBI-ToF tiffs, before splitting into patches. y clearer, especially for larger | I | .30igure 15: The analogous figure to Figure 5, but using SCA dimensionality reduction.31igure 16: The coordinates of all samples in the first two dimensions after aligning theSCA-reduced features. A subset, along with the bootstrap glyphs, is shown in Figure15. 32igure 17: Selection paths for all models when using 15% of the data for feature learn-ing. Most features, especially those from the CNN, have lower selection probabilitythan the analogous features at other split sizes. Other key characteristics of the selec-tion curves, like the effect of SVD vs. SCA reduction, and the stability of RCF features,remain the same. 33igure 18: Selection paths for all models when using 90% of the data for feature learn-ing. The top learned features across several methods are strongly related to the re-sponse. 34igure 19: Patches overlaid on the learned features in Figure 14, to guide interpretationof regions of the learned feature space. The figure is read in the same way as Figure11.Figure 20: The analogous figure to 14, but using SCA dimensionality reduction instead.35igure 21: The version of Figure 10 showing all samples, and collapsing all glyphs toa point. 36igure 22: Examples of features that are found to be less important in the TNBCdataset application, according to the selection curves in Figure 9. Note that the re-lationship between the displayed locations and the measured yy