Quantitative Biology Quantitative Methods

Active Learning to Classify Macromolecular Structures in situ for Less Supervision in Cryo-Electron Tomography

Xuefeng Du, Haohan Wang, Zhenxi Zhu, Xiangrui Zeng, Yi-Wei Chang, Jing Zhang, Eric Xing, Min Xu

Abstract

Motivation: Cryo-Electron Tomography (cryo-ET) is a 3D bioimaging tool that visualizes the structural and spatial organization of macromolecules at a near-native state in single cells, which has broad applications in life science. However, the systematic structural recognition and recovery of macromolecules captured by cryo-ET are difficult due to high structural complexity and imaging limits. Deep learning based subtomogram classification have played critical roles for such tasks. As supervised approaches, however, their performance relies on sufficient and laborious annotation on a large training dataset. Results: To alleviate this major labeling burden, we proposed a Hybrid Active Learning (HAL) framework for querying subtomograms for labelling from a large unlabeled subtomogram pool. Firstly, HAL adopts uncertainty sampling to select the subtomograms that have the most uncertain predictions. Moreover, to mitigate the sampling bias caused by such strategy, a discriminator is introduced to judge if a certain subtomogram is labeled or unlabeled and subsequently the model queries the subtomogram that have higher probabilities to be unlabeled. Additionally, HAL introduces a subset sampling strategy to improve the diversity of the query set, so that the information overlap is decreased between the queried batches and the algorithmic efficiency is improved. Our experiments on subtomogram classification tasks using both simulated and real data demonstrate that we can achieve comparable testing performance (on average only 3% accuracy drop) by using less than 30% of the labeled subtomograms, which shows a very promising result for subtomogram classification task with limited labeling resources.

Full PDF

AActive Learning to Classify Macromolecular Structures in situ forLess Supervision in Cryo-Electron Tomography

Xuefeng Du , Haohan Wang , Zhenxi Zhu , Xiangrui Zeng , Yi-Wei Chang , Jing Zhang ,Eric Xing , and Min Xu Department of Computer Science, University of Wisconsin-Madison, Madison, 53706, USA Language Technologies Institute, Carnegie Mellon University, Pittsburgh, 15213, USA Computational Biology Department, Carnegie Mellon University, Pittsburgh, 15213, USA Machine Learning Department, Carnegie Mellon University, Pittsburgh, 15213, USA Department of Computer Science, Beijing University of Posts and Telecommunications,100876, China Department of Biochemistry and Biophysics, University of Pennsylvania, Philadelphia,19104, USA Department of Computer Science, University of California - Irvine, Irvine, 92697, USA * Corresponding author

AbstractMotivation:

Cryo-Electron Tomography (cryo-ET) is a 3D bioimaging tool that visualizes the struc-tural and spatial organization of macromolecules at a near-native state in single cells, which has broadapplications in life science. However, the systematic structural recognition and recovery of macromoleculescaptured by cryo-ET are diﬃcult due to high structural complexity and imaging limits. Deep learningbased subtomogram classiﬁcation have played critical roles for such tasks. As supervised approaches,however, their performance relies on suﬃcient and laborious annotation on a large training dataset.

Results:

To alleviate this major labeling burden, we proposed a Hybrid Active Learning (HAL) frame-work for querying subtomograms for labelling from a large unlabeled subtomogram pool. Firstly, HALadopts uncertainty sampling to select the subtomograms that have the most uncertain predictions. Thisstrategy enforces the model to be aware of the inductive bias during classiﬁcation and subtomogramselection, which satisﬁes the discriminativeness principle in AL literature. Moreover, to mitigate thesampling bias caused by such strategy, a discriminator is introduced to judge if a certain subtomogram islabeled or unlabeled and subsequently the model queries the subtomogram that have higher probabilitiesto be unlabeled. Such query strategy encourages to match the data distribution between the labeledand unlabeled subtomogram samples, which essentially encodes the representativeness criterion into thesubtomogram selection process. Additionally, HAL introduces a subset sampling strategy to improve thediversity of the query set, so that the information overlap is decreased between the queried batches andthe algorithmic eﬃciency is improved. Our experiments on subtomogram classiﬁcation tasks using bothsimulated and real data demonstrate that we can achieve comparable testing performance (on average only3% accuracy drop) by using less than 30% of the labeled subtomograms, which shows a very promisingresult for subtomogram classiﬁcation task with limited labeling resources.

Availability: https://github.com/xulabs/aitom

Contact: [email protected]

Cellular processes are generally governed by macromolecules. To accurately understand these processes,Cryo-Electron Tomography (cryo-ET) has been developed recently to enable a systematic 3D visualization a r X i v : . [ q - b i o . Q M ] F e b f subcellular structures in single cells at sub-molecular resolution and in native state. However, due tothe structural content complexity of the captured tomograms and imaging limitations, it is diﬃcult toclassify macromolecules in subtomograms (A subtomogram is a subvolume of a tomogram that is likely tocontain a single macromolecule) for structural recovery via manual inspections. Given that subtomogramclassiﬁcation is essentially a 3D image classiﬁcation problem, supervised deep learning has recently becomea major approach thanks to its ability to extract complex image composition rules from big image data.However, even though diﬀerent approaches have been developed on either 2D or 3D cryo-ET data (Che et al. , 2018; Xu et al. , 2017; Du et al. , 2019; Liu et al. , 2019), few of them emphasize the labeling burden,which is very time-consuming and requires structural biology expertise. This situation impedes theoﬀ-the-shelf deployment of these algorithms. For instance, even for just 1,000 real subtomograms wereused in (Liu et al. , 2018b), it already introduced a time-consuming labeling work for domain experts.Under such circumstances, we resort active learning , which selects a subset of subtomogram samples,if labeled and used for training, will best improve the model’s performance under the same labelingbudget (Sener and Savarese, 2018; Gissin and Shalev-Shwartz, 2019) (Fig.1). Two main principlesfor such unlabeled sample selection are proposed (Dasgupta, 2011) and they both have limitations: discriminativeness and representativeness . The discriminativeness principle aims to ﬁnd the mostdiscriminative samples for the current classiﬁer, which will shrink the space of candidate classiﬁers asrapidly as possible (Wang and Ye, 2013). The popular proposed criteria are uncertainty rule (Yang andLoog, 2018; Wang et al. , 2018), expected error reduction (Huang et al. , 2016) and query by committee(Seung et al. , 1992; Gilad-Bachrach et al. , 2005). In this case, the samples are selected based on speciﬁccriterion instead of being i.i.d. sampled. Such sampling bias prevents active learning from ﬁnding aclassiﬁer with good generalization performance and query eﬃciency (Wang and Ye, 2013), which becomeseven severe for high-dimensional and complex 3D medical images. The representativeness principle aimsto address this problem by querying the samples which can represent the overall patterns or statistics ofthe unlabeled data, such as by clustering (Nguyen and Smeulders, 2004) and generative models (Zhu andBento, 2017; Tran et al. , 2019; Lee and Kim, 2019; Sinha et al. , 2019; Kim et al. , 2020). Such methodsperform better when fewer initial labeled data is provided. However, they will become ineﬃcient withthe increase of queried classes, as they solely rely on data distributions and do not fully use the labelinformation (Wang and Ye, 2013).Since using either type of principle alone is not enough to guarantee the optimal result, in this paper,we approach this task by integrating the discriminativeness and representativeness in one optimizationformulation, namely the Hybrid Active Learning (HAL) framework. To satisfy the principle of datarepresentativeness, we start with a small labeled set and a large unlabeled set and train a supervisedConvolutional Neural Network (CNN) on the labeled set. We then extract the feature representationsof both the labeled and unlabeled set. Inspired by the distribution alignment techniques (Ganin andLempitsky, 2015), in each iteration, we train a discriminator on these representations and predict howlikely each subtomogram sample is labeled or unlabeled. Then, we select and label those subtomogramsamples in the unlabeled dataset which is predicted to have higher probabilities of coming from unlabeleddataset. This alternative optimization scheme eﬀectively improves the representativeness of the labeledtraining set. Moreover, since the subtomograms captured by cryo-ET are highly heterogeneous, a largeselected batch is likely to contain redundant subtomogram samples, which leads to a signiﬁcant informationoverlap and thus an ineﬃcient querying process. Therefore, we apply a sub-sampling strategy to enlargethe query batch without losing diversity. For the discriminativeness principle, We additionally introducethe label information by using the entropy of predictions as selection criterion. Such heuristic is a strongactive learning baseline, namely uncertainty sampling (Yang and Loog, 2018). In each sampling iteration,we use both principles to score the current unlabeled subtomogram samples and then ensemble the twoscores for ﬁnal ranking and selection. We then add all queried subtomogram samples into the labeleddataset and repeat until the labeling budget is reached. The overall learning and querying steps aresummarized in Figure 2. Note that the hybrid querying heuristics are also proposed in literature (Yin et al. , 2017; Ash et al. , 2020) and we defer the discussion in Section 2.2. The contributions of thispaper is summarized as follows: 1) We propose a 3D HAL framework to query unlabeled subtomogramsamples and expand the training dataset, such that deep models can be trained with signiﬁcantly lowerlabeling cost while incurring minimal prediction accuracy drop. We provide a theoretical analysis ofthe expected classiﬁcation risk of our framework (Equation 2). 2) HAL is the ﬁrst active learning workto address the issue of labeling cost in cryo-ET analysis tasks, which integrates two principled queryheuristics in one optimization framework to make the queried subtomograms both representative anddiscriminative. 3) In HAL, we adopt several eﬀective strategies to improve the performance, such asproposing a convolutional discriminator to learn the comparative metric of representations from shallowerlayers, introducing sub-sampling to improve the diversity of every query batch. 4) The empirical results uman Labeling Supervised

Training Prediction

Domain

Experts

Macromolecules Unlabeled Subtomograms (a)

Macromolecules Cryo-ET ImagingUnlabeled Subtomograms Human LabelingData Selection Algorithm Supervised Training Prediction (b)

Cryo-ET Imaging

Figure 1: Exemplary illustration of the active learning approach in cryo-ET classiﬁcation. (a) shows theexisting passive recognition pipeline where suﬃcient labeling is required where (b) demonstrates the recognitionscenario guided by active learning.

Termination

Start Labeled dataset(initially small)Fully Connected Layer Unlabeled datasetEnd

Accuracy or maximum labeled number reached?

Feature Extractor Recognition Result Domain ProfileDomain DiscriminatorYes No

Subtomogram Classification Samples Selection Add Selected Samples Initialization

Uncertainty Calculation

Figure 2: The full active learning scheme for subtomogram classiﬁcation in our HAL framework. for subtomogram classiﬁcation using both simulated and real data demonstrate that we are able to achievecomparable testing performance (on average only <

3% prediction accuracy drop) while signiﬁcantlyreducing the labeling burden by over 70%.

Since 2017, supervised deep learning started getting popular for cryo-ET analysis. Chen et al. (2017)proposed a 2D CNN segmentation model for segmenting ultrastructures on 2D slices of a 3D tomogram.Supervised deep learning based methods have also been proposed for 3D subtomogram classiﬁcation taskthanks to its high-throughput processing capability. Speciﬁcally, (Xu et al. , 2017) was the ﬁrst to proposea deep learning approach to separate diﬀerent structures in subtomograms. Liu et al. (2018b) introducedmulti-task learning to learn a feature space for simultaneous classiﬁcation, segmentation, and structuralrecovery of macromolecules. Other techniques, such as domain adaptation (Yu et al. , 2020), open-setrecognition (Du et al. , 2019) and semi-supervised learning (Liu et al. , 2019) have also been applied forreliable and deployable analysis result. Other works are Liu et al. (2018a) and . However, few of themtake the labeling burden into account, which provided a time-consuming labeling work for domain expertsand impeded their oﬀ-the-shelf usage.

Active learning has been popular prior to deep learning. It was usually used on small models. Theuncertainty-based methods usually measure uncertainty by posterior probability of the predicted classes Lewis and Catlett, 1994; Lewis, 1995) or the margin between the ﬁrst predicted class and the secondconﬁdently predicted class (Joshi et al. , 2009; Roth and Small, 2006). Other methods measure uncertaintyvia entropy (Settles and Craven, 2008; Luo et al. , 2013; Joshi et al. , 2009) or the distance to the decisionboundary for SVM (Li and Guo, 2014; Tong and Koller, 2001; Vijayanarasimhan and Grauman, 2011).Another direction that is based on the discriminative principle, such as query by committee, usuallydevelops multiple models as a committee and uses the disagreement between them as the criteria foruncertainty estimation (McCallum and Nigam, 1998; Seung et al. , 1992). The representativeness principleselects samples that can cover the distribution of the entire dataset by clustering (Nguyen and Smeulders,2004), discrete optimization (Yang et al. , 2015; Guo, 2010; Elhamifar et al. , 2013). Other popular methodseither focus on the neighboring information between samples (Hasan and Roy-Chowdhury, 2015; Aodha et al. , 2015; Xu et al. , 2010) or the expected model change (Settles et al. , 2007; Roy and McCallum, 2001;Freytag et al. , 2014) for sample selection.When deep learning comes in, several query heuristics are proposed on larger models and datasets,such as uncertainty based methods (Lin et al. , 2018; Wang et al. , 2017; Gal et al. , 2017; Beluch et al. ,2018), core-set sampling (Sener and Savarese, 2018), generative models (Sinha et al. , 2019). However,some of them, especially core-set sampling, are not scalable on very large datasets compared to our HALbecause a large distance matrix from unlabeled samples is needed which makes the query procedure highlyexpensive. Some hybrid query approaches are also proposed. For instance, Yin et al. (2017) selected thedata by uncertainty sampling and random sampling while we use a discriminator to query representativesamples in order to reduce sampling bias in uncertainty sampling. Ash et al. (2020) selected samples whosegradients span a diverse directions, which did not explicitly consider diﬀerent query heuristics. Shui et al. (2020) used the Wasserstein metric for measuring the representativeness, which is hand-crafted metriccompared to the learnable discriminator in HAL. Li and Guo (2013) proposed to combine uncertaintywith density but only queried one sample a time. The matrix inverse operation in density estimation isexpensive for large 3D images. The others integrated generative models, such as adversarial training (Zhuand Bento, 2017) or variational approaches (Sinha et al. , 2019) with other heuristics while HAL doesnot involve any adversarial learning procedure, which would lead to unstable training and data selectionresults.There are also several works that applied active learning for biomedical images (Smailagic et al. , 2018;Yang et al. , 2017; Zhou et al. , 2017; Kuo et al. , 2018) that are built upon conventional active learningapproaches for natural images. They are either not considering the tradeoﬀ between the two queryprinciples or not applicable for 3D subtomogram classiﬁcation tasks with high noise and transformationvariations, which will lead to a sub-optimal subtomogram selection and classiﬁcation performance.

Our Hybrid Active Learning method integrates two principles. For the representativeness principle, wedevelop an alternative optimization scheme for training the multi-class subtomogram classiﬁcation modeland a discriminator. Speciﬁcally, given a small initial set of labeled subtomograms, the classiﬁcation modelis ﬁrstly trained in a supervised way. Then, the hidden representations of both the labeled and unlabeledsubtomogram samples are extracted to train a binary classiﬁcation model (i.e. the discriminator). Thus,the probability scores of the unlabeled subtomogram samples are obtained from the predictions of thediscriminator. Meanwhile, the uncertainty score (i.e. the entropy of the predictions from the multi-classclassiﬁcation model) is further fused with the discriminator score to produce the ﬁnal query metric forranking the unlabeled subtomograms. Afterwards, the top subtomograms are selected and labeled foriterative training until the budget is reached.

In this part, the multi-class subtomogram classiﬁcation model starts with a sparsely labeled dataset D . Within the dataset, we denote the labeled subtomograms at iteration t as LD ( t ) and the unlabeledsubtomograms as UD ( t ). Then we have LD ( t ) ∪ UD ( t ) = D and LD ( t ) ∩ UD ( t ) = ∅ . From the domainadaptation point of view, we treat LD ( t ) and UD ( t ) as two separate domains, namely the source domain L and target domain U respectively. M ( · ) is deﬁned as the feature extractor and D ( · ) is the introduceddiscriminator which aims to distinguish these two domains.At each iteration t , the to enhance the sample representativeness, we ﬁrst train the main classiﬁerusing the softmax cross entropy loss. Then we extract the representations from the intermediate layers onboth LD ( t ) and UD ( t ) and regard them as inputs to the discriminator. Next, we train this discriminator y a binary classiﬁcation task so that it can discriminate the labeled and unlabeled subtomograms well.If we assume the output of the discriminator D ( · ) to be 0 for labeled class and 1 for unlabeled class, thenwe select and label a batch of subtomogram samples B ( t ) which satisfy: B ( t ) = arg max x ∈UD ( t ) Pr( D ( M ( x )) = 1 | M ( x )) , (1)where B ( t ) is the queried unlabeled batch at iteration t . Why a discriminator?

The reason behind is if we can determine with high probability that anunlabeled subtomogram is from UD , then it should be diﬀerent from LD , which is helpful for improvingthe information encoded in the labeled dataset and thus better for the model to generalize on the remainingunlabeled subtomogram examples after we label it. Otherwise, if the subtomogram examples from UD are indistinguishable from LD , then we successfully represent the distribution with LD . This motivatesus to design a discriminator D for such probability estimation and alignment. Moreover, the introduceddiscriminator is expected to provide more ﬂexibility during classiﬁcation and the subtomogram sampleselection since it has a learnable metric for separating the labeled and unlabeled subtomogram examples,which is better than hand-crafted metric designs (Shui et al. , 2020; Tang and Huang, 2019). In this section, we proposed two task-speciﬁc designs to further reﬁne the capacity of the representativenessprinciple, namely the convolutional discriminator and the subset sampling strategy.Commonly, the discriminator for unsupervised domain adaptation (Ganin and Lempitsky, 2015) oftenregards the outputs from the fully connected layers as the input. They claim such design will focus on moreﬁne-grained information for feature adaptation since these layers of the network extract and propagatemore speciﬁc features. However, these ﬁne-grained features are more suitable for multi-class classiﬁcationwhich is usually biased for the discriminator, especially for highly heterogeneous 3D cryo-ET data.Instead, we propose to use the output of the last max-pooling layer as the input for the discriminator andenhance the discriminator with the more ﬂexible convolutional operations. Speciﬁcally, the convolutionaldiscriminator consists of two convolutional layers followed by two fully connected layers (Fig.3). Theseconvolution operations enable our model to learn a ﬂexible representation space and a task-speciﬁccomparison metric for binary classiﬁcation, which is helpful for querying valuable subtomograms moreeﬀectively.In addition, recall that querying unlabeled subtomogram samples requires iteratively training themulti-class classiﬁcation model and expert annotation in a loop. Therefore, the querying eﬃciency isimportant. One simple solution is to query subtomograms in larger batches instead of one subtomogramat a time, which reduces the waiting time until the classiﬁer ﬁnishes training (Azimi et al. , 2012). However,since we select the data in a large batch which are all predicted by the discriminator with a high probabilityto be unlabeled, they tend to have a similar distribution, especially for the subtomograms captured with ahigher noise level. This scenario causes signiﬁcant information overlap. To mitigate this, we emphasize onthe diversity of the sampled subtomograms in a batch by assuming consecutive mini-queries will be lesslikely to contain similar instances. We split the original queried batch B ( t ) into m sub-batches. Supposewe desire to select K subtomograms at iteration t , we ﬁrst train the discriminator on the representationsuntil convergence and label the top Km subtomograms. Then we repeat the process by interleaving thediscriminator training and subtomogram selection until K subtomogram samples are queried. Duringthis process, we only train the main classiﬁer once but train the discriminator for m times which is moreeﬃcient. The detailed architecture of our model is demonstrated in Fig. 3. The motivation of the representativeness principle is to label the most appropriate data from the unlabeledsubset UD that can represent the distribution of the training (or the entire) dataset as well as possible.In this case, a classiﬁer trained on LD should perform similarly compared to that trained with the entiredataset D labeled. Naturally, we are interested in how to measure the distribution diﬀerence between twoobservations x ∼ L and x ∼ U and see if the design of a discriminator can achieve less classiﬁcation erroron U . Without loss of generality, we use H ∆ H divergence (Kifer et al. , 2004) d H ∆ H ( L , U ) for distributiondiﬀerence estimation, which measures the maximum diﬀerence of the probabilities for inconsistentprediction.Denote (cid:15) d , (cid:15) U and (cid:15) L to be the classiﬁcation error of the discriminator and the multi-class classiﬁcationmodel on the unlabeled and labeled subtomogram samples, respectively, we argue (cid:15) U is bounded by aterm related to (cid:15) L and (cid:15) d by Theorem 1. heorem 1. Assume the complexity of the discriminator is more than a XOR function, given f ( · ) a multi-class candidate classiﬁer, the classiﬁcation error on the unlabeled dataset is bounded by: (cid:15) U ( f ) ≤ (cid:15) L ( f ) + (cid:15) d + C . C is an uncorrelated constant.Proof Sketch. Following the proof and assumptions made in (Ben-David et al. , 2010) and substituting thesource and target domain as L and U , we get (cid:15) U ( f ) ≤ (cid:15) L ( f ) + 12 d H ∆ H ( L , U ) + C. (2)Then following the derivation in (Ganin and Lempitsky, 2015), we replace d H ∆ H ( L , U ) by its upperbound 2 sup η ∈H d | Pr L [ z : η ( z ) = 1] + Pr U [ z : η ( z ) = 0] − | , (3)which can be seen as 2 (cid:15) d . Here H d is the function space for the discriminator. And after substitution, thetheorem is proven. Here the assumption is easily satisﬁed since the discriminator is implemented by aneural network which is complex enough according the Universal Approximation Theorem (Barron, 1993;Funahashi, 1989).Given such a guarantee, if (cid:15) L ( f ) and (cid:15) d is minimized, the classiﬁcation error on the unlabeled datasetis bounded. Subtomogram Input 32×3×3×3−1 Conv1&2 / RELU 64×3×3×3−1 Conv3&4 / RELUMax-Pooling (2-2) 128×3×3×3−1 Conv5&6 / RELUMax-Pooling (2-2) Max-Pooling (2-2)FC Layer (1024)RELUDropout (0.7)Softmax FC Layer (1024)RELUDropout (0.7)FC Layer (...)Classification Result 128×3×3×3−1 Conv7 / RELUMax-Pooling (2-1)64×3×3×3−1 Conv8 / RELUMax-Pooling (2-1)FC Layer (256) / RELUDropout (0.5)FC Layer (128) / RELUFC Layer (2)SoftmaxDomain Profile

Classification ModelConvolutional Discriminator Image Query Module

Figure 3: The model architecture of our subtomogram classiﬁer with detailed layer conﬁguration. “64 × × × × × × × In addition to the introduced discriminator for improving the sample representativeness, we argue thatthe useful label information (i.e. inductive bias) is missing in the current query strategy, which is shownto be eﬀective in literature (Wang et al. , 2017; Gal et al. , 2017; Beluch et al. , 2018). Thus, we propose ahybrid query method by selecting discriminative subtomogram samples with uncertainty sampling (Yangand Loog, 2018).The intuition is as follows: the representativeness principle assumes the unlabeled pool is largeenough to represent the true distribution. However, the data from the sparse regions of distribution willbe sampled because the unlabeled set gradually becomes not representative due to its decreasing size.Conversely, uncertainty sampling can keep a balance between labeled and unlabeled subtomograms on therepresentation space by selecting subtomograms corresponding to data density such that the classiﬁcationwill not easily be biased by the sparse region of the manifold (Yang and Loog, 2018).On the other hand, uncertainty sampling is designed to sample the most uncertain instance whichis closest to the decision boundary. Since the number of subtomograms in the initial stage is limited,the estimated decision boundary is far from the actual one. Therefore, it may select noisy instancesand stuck at sub-optimal solutions due to a lack of exploration. In contrast, the semi-supervised settingin the discriminator-based query strategy can avoid this drawback by observing the entire dataset.Therefore, during training, the representativeness principle by the discriminator-based query and thediscriminativeness principle by uncertainty sampling assist each other and further enhance the stability ofquery and classiﬁcation performance.Speciﬁcally, we use the entropy of the predictions to measure the uncertainty, which is formulated as: E ( t ) = arg max x ∈UD ( t ) (cid:34) − (cid:88) y ∈ C P ( y | x ) log P ( y | x ) (cid:35) , (4) here C denotes the class space. P ( y | x ) denotes the conditional probability of y given x in the multi-classclassiﬁer. We evaluate the quality of the model prediction P ( y | x ) by comparing the class with the highestprobability against the ground truth labels given by domain experts. We implement this evaluation in theneural networks by using the softmax loss function, which is the common practice in image classiﬁcation. In order to tradeoﬀ between the two principles, at each iteration, we design a ranking score for ﬁnalselection criteria: S ( t ) = Pr( D ( M ( x )) = 1 | M ( x )) + λE ( t ) , (5)where λ is the weighting hyperparamter for balancing diﬀerent scores. The other notations have the samemeaning as Eqn. 1 and 4. While there remain other score fusion methods, we argue our implementationis simple and eﬀective enough to achieve suﬃcient application purposes. We evaluate our method on two simulated and three real cryo-ET datasets. For simulation datasets, weutilize the PDB2VOL program (Wriggers et al. , 1999) to generate 23 classes of subtomograms whichhave the same class space as (Xu et al. , 2017) at two Signal-to-Noise Ratio (SNR) levels, including0 .

03 (S1) and 0 .

05 (S2). These datasets are realistically simulated by approximating the true cryo-ETimage reconstruction process through a tilt-angle of ± ° , including the Contrast Transfer Function andModulation Transfer Function. Each class contains 1 ,

000 subtomograms with size of 40 voxels. Thesesimulated datasets are used in our 23-class classiﬁcation tasks.For real datasets, we use a set of rat neuron tomograms from (Guo et al. , 2018) (R1). For onetomogram, we manually select 1,800 subtomogram samples which contain particles of 28 voxels from5424 subtomograms extracted by Diﬀerence of Gaussian (DoG) (Long et al. , 2016) (R1a). We evaluatethe particle picking task for determining whether or not a sample contains a particle. This is formulatedas a binary classiﬁcation task for the multi-class classiﬁcation model. We also extract 2,394 subtomogramswith size of 40 in the same tomogram set. The 2,394 subtomograms contain 6 classes detected andclassiﬁed by template matching (R1b) (Guo et al. , 2018). We evaluate the 6-class classiﬁcation task on it.In addition, we process a 7-class dataset (Noble et al. , 2017) (R2) from EMPIAR (Iudin et al. , 2016).Each class contains 400 subtomograms with size of 28 . Following common practice in the active learningliterature, no data augmentation techniques are used during training. The eﬀect of data augmentationremains to be further explored.A 7-class classiﬁcation task is evaluated on R2. The 2D x − z center slice of the 3D images and theiso-surface of the simulated datasets are demonstrated in Fig. 4. In this section, we report the subtomogram classiﬁcation result for both the simulated and real data. Westart with 3% of the entire dataset as the labeled subtomogram samples for the simulated datasets S , S , ,

4% for real datasets R a, R b and R

2, respectively. The eﬀect of the number of the initiallylabeled subtomogram samples is shown in the next section. The query batch size is empirically ﬁxed to800 for the simulated datasets and 32 for the real datasets, which follows the common active learningsetting (Tran et al. , 2019). In terms of the subset sampling for the simulated datasets, we report themodel performance with the number of subset to be 20 while the number is 4 , , R a, R b and R

2, respectively. The eﬀect of the number of the subset and subset size is deferred fordiscussion in the next section. We report the classiﬁcation results after 7,5,6,5 and 8 query iterations indataset S1, S2, R1a, R1b and R2, respectively since we empirically found more iterations will not bringsigniﬁcant improvement on HAL. We run all the baselines under the same setting and report all metricsusing an average of 10 runs with random seed from 1 to 10.In Tab.1, we ﬁrstly compare with supervised training with the entire dataset labeled. In datasetS2, we use 16 .

91% labeled training data to achieve 93 .

86% test accuracy compared to 95 .

36% in fullysupervised training. In dataset R1a, we use 11 .

89% of training data to achieve 85 .

48% test accuracycompared to 87 .

24% in fully supervised training. Moreover, we compared with 8 representative activelearning baselines, including methods using a single query principle, namely, Random Query (Woo and NR = 0.05 (S1)SNR = 0.03 (S2)Iso-surface 4V4Q 2GHO 1KP8 1FNT 1BXR

Simulated DatasetsReal Datasets

R1R2 Double capped proteasome Mitochondrial membrane None Ribosome Single capped proteasome TRiC glutamate dehydrogenase rabbit muscle aldolase DNAB helicase-helicase T20S proteasome apoferritin hemagglutinin insulin receptor

Figure 4: Examples of used subtomograms. For simulated datasets, 5 out of 23 classes of simulatedsubtomograms are shown for simplicity. We plot in the form of iso-surface (bottom row) and center-sliceddensity map in parallel with the x-z plane (the ﬁrst two rows). Their PDB IDs are below each image. Forreal datasets, we visualize the center-sliced density map in parallel with the x-z plane with the name of eachmacromolecule class below each image.

Park, 2012), Uncertainty Query (Joshi et al. , 2009), CoreSet Query (Sener and Savarese, 2018), BayesianQuery (Gal et al. , 2017), Bayesian Generative Active Learning (Tran et al. , 2019) (BGAL) and hybridquery heuristics, exploration-exploitation BMAL (Yin et al. , 2017) (EE-BMAL), VAAL (Sinha et al. ,2019) and BADGE (Ash et al. , 2020). As shown in Tab.1, our method achieves a superior performance onall 5 diﬀerent datasets under the same labeling budget. For single query principle, BGAL performs thebest compared to others, especially on dataset R1a, which achieves a 82 .

18% ﬁnal accuracy. Surprisingly,even with a theoretical guarantee, the core-set sampling performs the worst among baselines, which ispossibly caused by the complex data distribution that makes it harder to cover the entire dataset withthe constructed core-sets. Moreover, the baselines that adopt a hybrid query strategy usually performsbetter because of the mutual beneﬁts of diﬀerent criteria. However, they still underperforms our HALwhich explicitly trade-oﬀs the representativeness and the discriminativeness principle.

To validate the eﬀect of our task-speciﬁc designs and the hybrid query strategy, we did a controlledexperiment that removes the convolutional layers in the discriminator (Variant 1), the subset sampling(Variant 2) and the uncertainty sampling strategy (Variant 3), which is shown in Tab.3. Note that we didnot remove the representativeness principle because that degenerates the model to the baseline methodof Uncertainty Query. The training setting is the same as the previous section. According to Tab.3,removing any of the three parts will lead to performance drop. For example, removing the subset samplingwill destabilize the training which decreases the accuracy from 74 .

80% to 63 .

50% on R b . ethod/Dataset S1 S2 R1a R1b R2Supervised Training 83 . ± . . ± . . ± . . ± . . ± . HAL . ± . . ± . . ± . . ± . . ± . Random Query (Woo and Park, 2012) 74 . ± . . ± . . ± . . ± . . ± . Uncertainty Query (Joshi et al. , 2009) 77 . ± . . ± . . ± . . ± . . ± . Bayesian Query (Gal et al. , 2017) 73 . ± . . ± . . ± . . ± . . ± . CoreSet Query (Sener and Savarese, 2018) 63 . ± . . ± . . ± . . ± . . ± . BGAL (Tran et al. , 2019) 78 . ± . . ± . . ± . . ± . . ± . VAAL (Sinha et al. , 2019) 75 . ± . . ± . . ± . . ± . . ± . EE-BMAL (Yin et al. , 2017) 79 . ± . . ± . . ± . . ± . . ± . BADGE (Ash et al. , 2020) 79 . ± . . ± . . ± . . ± . . ± . Labeled Percentage 23 .

87% 16 .

91% 11 .

89% 9 .

35% 12 . Table 1: Comparison of HAL and the baseline AL methods on ﬁve diﬀerent datasets (results are theclassiﬁcation accuracy in %). The same labeling budget is used among diﬀerent methods. The standarddeviation is reported at the top right corner.

S1 S2 R1a R1b R2Conﬁg Acc T Conﬁg Acc T Conﬁg Acc T Conﬁg Acc T Conﬁg Acc T400/2 75.96 1.9 h 400/2 83.62 1.6 h 32/1 80.25 0.2 h 32/1 63.50 0.2 h 32/1 90.86 0.4 h80/10 77.96 3.8 h 80/10 85.20 2.7 h 16/2 81.23 0.4 h 16/2 71.74 0.4 h 16/2 90.69 0.4 h40/20

Table 2: Comparative Results of diﬀerent subset conﬁgurations on HAL (in %). T refers to the overalltraining time. Conﬁg is in the form of Subset Size/Number of Subset. h denotes hours.

Model/Dataset S1 S2 R1a R1b R2Variant 1 (V1) 78 .

45 91 .

87 83 .

65 72 .

27 94 . .

77 91 .

92 80 .

25 63 .

50 90 . .

32 88 .

63 79 .

99 70 .

40 91 . .

23 93 .

96 85 .

48 74 .

80 95 . Table 3: Ablation study results (In %) on the convolutional discriminator, subset sampling and the uncertaintysampling

Ratio/Dataset S1 S2 R1a R1b R20.01 77 .

24 89 .

99 72 .

13 43 .

61 88 . .

01 93 .

36 81 .

62 47 .

77 93 . .

23 93 .

96 85 . .

95 94 . .

96 94 .

01 84 . .

80 95 . .

11 93 .

99 86 .

05 74 .

95 95 . Table 4: Comparative results (In %) on the number of initially labeled subtomograms for ﬁve datasets.

To observe the eﬀect of the number of initially labeled subtomogram samples, we test our HAL underdiﬀerent ratios from 1% to 5% with the interval of 1% on ﬁve datasets while keeping the other trainingconﬁguration unchanged. The comparative result is shown in Tab.4.As shown in Tab.4, the number of the initially labeled subtomogram samples have considerable eﬀectson the ﬁnal classiﬁcation accuracy. The model is trained towards a sub-optimal direction if this ratio istoo small, leading to much lower accuracy even though more subtomogram samples are labeled in thelater stages. However, the initially labeled subtomogram samples are practically expensive to obtain.Thus, observing that the ﬁnal accuracy does not increase too much if we keep increasing the number ofinitially labeled subtomogram samples, we empirically ﬁx the ratio to be 3% for dataset R a, S , S

2, and4% for dataset R b, R .5 The eﬀect of subset conﬁguration During subset sampling, it is important to determine the optimal combination of the number of subsets m and the subset size Km in order to balance the training time and the diversity of the subset. Speciﬁcally,we test 6 diﬀerent subset size for ﬁve datasets whose results are summarized in Tab.2. As can be observed,a balanced conﬁguration of subset size and numbers can help achieve better performance. Meanwhile, thetime complexity increases dramatically if the subset size is smaller since much more iterations are trainedon the discriminator. Therefore, to balance time and accuracy, a moderate subset size is preferable. For a comprehensive discussion, we plot the comparative query process in Fig.5. Here we demonstrate theaccuracy versus the number of labeled subtomogram samples during training. We can see the stabilityand the ﬁnal classiﬁcation accuracy of HAL is better without accuracy decrease or stagnation along thesampling procedure. λ HAL relies on the representative score predicted by a discriminator and the discriminative score calculatedfrom the entropy of classiﬁcation to sample examples for labeling. To determine a relatively optimalbalance between the entropy and the representative scores, we test ﬁve diﬀerent values of λ in Eqn. 5 andreport the classiﬁcation accuracy under the same labeling budget. From Tab. 5, we ﬁnd that λ = 1 obtainsa consistently better performance on the dataset S1 and R1a. Increasing or decreasing λ degeneratesthe HAL to the model solely trained with one score, either the representative score or the discriminativescore. This set of experiments illustrates the importance of selecting a proper λ in order to guarantee agood test performance. λ Acc. on S1 Acc. on R1b0.1 79.00 83.260.5 81.11 84.231 λ . Classiﬁcation accuracy on dataset S1 and R1b is reported. Firstly, we normalize the data in every dataset. For classiﬁcation on both simulated and real data, werandomly split the entire dataset to the train and test set by a ratio of 3:1. We set the learning rate andthe batch size of the multi-class subtomogram classiﬁer and the discriminator as 0.001, 128 and 0.01, 256without further ﬁne-grained tuning. We train the discriminator with early stopping when the accuracyreaches 98% in order to prevent overﬁtting. we set the λ in Eqn. 5 to 1 because both scores have thesame value range. Code is available at . Computational analysis, deep learning approaches in particular, has played an increasingly importantrole for obtaining molecular machinery insights from cryo-ET data. However, the heavy labeling workbehind data-driven methods presents obstacles for biologists to use them as assistant approaches. Inthis paper, we present a novel active learning tool in the cryo-ET domain with concerns for limitedlabeling resources, which approaches the active learning objective by querying both representativeand discriminative subtomogram samples. Our experimental results on both simulated and real datademonstrate it produces signiﬁcantly improved test performance compared to baselines under the samelabeling budget. Our method represents an important step towards fully utilizing deep learning for in https://github.com/xulabs/aitom a)(b)(a) Figure 5: Comparative querying process with baselines (a) and ablations (b). The shaded area means thestandard deviation. For simplicity, three of ﬁve datasets are shown. situ recognition of macromolecules inside single cells captured by cryo-ET. It can potentially also be veryuseful for other biomedical research with limited labeling resources.

Funding

This work was supported in part by U.S. National Institutes of Health (NIH) grants P41GM103712,R01GM134020, and K01MH123896, U.S. National Science Foundation (NSF) grants DBI-1949629 andIIS-2007595, Mark Foundation For Cancer Research 19-044-ASP, and AMD COVID-19 HPC Fund. XZwas supported by a fellowship from Carnegie Mellon University’s Center for Machine Learning and Health.

References

Aodha, O. M. et al. (2015). Hierarchical subquery evaluation for active learning on a graph. arXivpreprint arXiv: 1504.08219 .Ash, J. T. et al. (2020). Deep batch active learning by diverse, uncertain gradient lower bounds. In

International Conference on Learning Representations .Azimi, J. et al. (2012). Batch active learning via coordinated matching. In

International Conference onMachine Learning .Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function.

IEEETransactions on Information Theory , (3), 930–945.Beluch, W. H. et al. (2018). The power of ensembles for active learning in image classiﬁcation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 9368–9377.Ben-David, S. et al. (2010). A theory of learning from diﬀerent domains.

Machine Learning , (1-2),151–175.Che, C. et al. (2018). Improved deep learning-based macromolecules structure classiﬁcation from electroncryo-tomograms. Machine Vision and Applications , (8), 1227–1236.Chen, M. et al. (2017). Convolutional neural networks for automated annotation of cellular cryo-electrontomograms. Nature Methods , (10), 983.Dasgupta, S. (2011). Two faces of active learning. Theoretical Computer Science , (19), 1767–1781. u, X. et al. (2019). Open-set recognition of unseen macromolecules in cellular electron cryo-tomogramsby soft large margin centralized cosine loss. In British Machine Vision Conference , page 148.Elhamifar, E. et al. (2013). A convex optimization framework for active learning. In

InternationalConference on Computer Vision , pages 209–216.Freytag, A. et al. (2014). Selecting inﬂuential examples: Active learning with expected model outputchanges. In

European Conference on Computer Vision , pages 562–577.Funahashi, K. (1989). On the approximate realization of continuous mappings by neural networks.

NeuralNetworks , (3), 183–192.Gal, Y. et al. (2017). Deep bayesian active learning with image data. In International Conference onMachine Learning , pages 1183–1192.Ganin, Y. and Lempitsky, V. S. (2015). Unsupervised domain adaptation by backpropagation. In

International Conference on Machine Learning , pages 1180–1189.Gilad-Bachrach, R. et al. (2005). Query by committee made real. In

Advances in Neural InformationProcessing Systems , pages 443–450.Gissin, D. and Shalev-Shwartz, S. (2019). Discriminative active learning. arXiv preprint arXiv: 1907.06347 .Guo, Q. et al. (2018). In situ structure of neuronal c9orf72 poly-ga aggregates reveals proteasomerecruitment.

Cell , (4), 696.Guo, Y. (2010). Active instance sampling via matrix partition. In Advances in Neural InformationProcessing Systems , pages 802–810.Hasan, M. and Roy-Chowdhury, A. K. (2015). Context aware active learning of activity recognitionmodels. In

International Conference on Computer Vision , pages 4543–4551.Huang, J. et al. (2016). Active learning for speech recognition: the power of gradients. arXiv preprintarXiv: 1612.03226 .Iudin, A. et al. (2016). Empiar: a public archive for raw electron microscopy image data.

Nature methods ,page 387.Joshi, A. J. et al. (2009). Multi-class active learning for image classiﬁcation. In

IEEE/CVF Conferenceon Computer Vision and Pattern Recognition , pages 2372–2379.Kifer, D. et al. (2004). Detecting change in data streams. In

International Conference on Very LargeData Bases , pages 180–191.Kim, K. et al. (2020). Task-aware variational adversarial active learning. arXiv preprint arXiv: 2002.04709 .Kuo, W. et al. (2018). Cost-sensitive active learning for intracranial hemorrhage detection. In

MedicalImage Computing and Computer Assisted Interventions , pages 715–723.Lee, S.-K. and Kim, J.-H. (2019). Bald-vae: Generative active learning based on the uncertainties ofboth labeled and unlabeled data. In

International Conference on Robot Intelligence Technology andApplications , pages 6–11.Lewis, D. D. (1995). A sequential algorithm for training text classiﬁers: Corrigendum and additionaldata.

ACM SIGIR Conference on Research and Development in Information Retrieval , pages 13–19.Lewis, D. D. and Catlett, J. (1994). Heterogeneous uncertainty sampling for supervised learning. In

International Conference on Machine Learning , pages 148–156.Li, X. and Guo, Y. (2013). Adaptive active learning for image classiﬁcation. In

IEEE/CVF Conferenceon Computer Vision and Pattern Recognition , pages 859–866.Li, X. and Guo, Y. (2014). Multi-level adaptive active learning for scene classiﬁcation. In

EuropeanConference on Computer Vision , pages 234–249. in, L. et al. (2018). Active self-paced learning for cost-eﬀective and progressive face identiﬁcation. IEEETransactions on Pattern Analysis and Machine Intelligence , (1), 7–19.Liu, C. et al. (2018a). Deep learning based supervised semantic segmentation of electron cryo-subtomograms. In International Conference on Image Processing , pages 1578–1582.Liu, C. et al. (2018b). Multi-task learning for macromolecule classiﬁcation, segmentation and coarsestructural recovery in cryo-tomography. In

British Machine Vision Conference , page 271.Liu, S. et al. (2019). Semi-supervised macromolecule structural classiﬁcation in cellular electron cryo-tomograms using 3d autoencoding classiﬁer. In

British Machine Vision Conference , page 30.Long, P. et al. (2016). Simulating cryo electron tomograms of crowded cell cytoplasm for assessment ofautomated particle picking.

BMC Bioinformatics , (1), 405.Luo, W. et al. (2013). Latent structured active learning. In Advances in Neural Information ProcessingSystems , pages 728–736.McCallum, A. and Nigam, K. (1998). Employing EM and pool-based active learning for text classiﬁcation.In

International Conference on Machine Learning , pages 350–358.Nguyen, H. T. and Smeulders, A. W. M. (2004). Active learning using pre-clustering. In

InternationalConference on Machine Learning .Noble, A. J. et al. (2017). Routine single particle cryoem sample and grid characterization by tomography. bioRxiv .Roth, D. and Small, K. (2006). Margin-based active learning for structured output spaces. In

EuropeanConference on Machine Learning , pages 413–424.Roy, N. and McCallum, A. (2001). Toward optimal active learning through sampling estimation of errorreduction. In

International Conference on Machine Learning , pages 441–448.Sener, O. and Savarese, S. (2018). Active learning for convolutional neural networks: A core-set approach.In

International Conference on Learning Representations .Settles, B. and Craven, M. (2008). An analysis of active learning strategies for sequence labeling tasks. In

Conference on Empirical Methods in Natural Language Processing , pages 1070–1079.Settles, B. et al. (2007). Multiple-instance active learning. In

Advances in Neural Information ProcessingSystems , pages 1289–1296.Seung, H. S. et al. (1992). Query by committee. In

Annual Conference on Learning Theory , pages287–294.Shui, C. et al. (2020). Deep active learning: Uniﬁed and principled method for query and training. In

International Conference on Artiﬁcial Intelligence and Statistics , pages 1308–1318.Sinha, S. et al. (2019). Variational adversarial active learning. In

International Conference on ComputerVision , pages 5971–5980.Smailagic, A. et al. (2018). Medal: Accurate and robust deep active learning for medical image analysis.In

International Conference on Machine Learning and Applications , pages 481–488.Tang, Y. and Huang, S. (2019). Self-paced active learning: Query the right thing at the right time. In

AAAI Conference on Artiﬁcial Intelligence , pages 5117–5124.Tong, S. and Koller, D. (2001). Support vector machine active learning with applications to textclassiﬁcation.

Journal of Machine Learning Research , , 45–66.Tran, T. et al. (2019). Bayesian generative active deep learning. In International Conference on MachineLearning , pages 6295–6304.Vijayanarasimhan, S. and Grauman, K. (2011). Large-scale live active learning: Training object detectorswith crawled data and crowds. In

IEEE/CVF Conference on Computer Vision and Pattern Recognition ,pages 1449–1456. ang, H. et al. (2018). Uncertainty sampling for action recognition via maximizing expected averageprecision. In International Joint Conferences on Artiﬁcial Intelligence , pages 964–970.Wang, K. et al. (2017). Cost-eﬀective active learning for deep image classiﬁcation.

IEEE Transactions onCircuits and Systems for Video Technology , (12), 2591–2600.Wang, Z. and Ye, J. (2013). Querying discriminative and representative samples for batch mode activelearning. In ACM SIGKDD International Conference on Knowledge discovery and data mining , pages158–166.Woo, H. and Park, C. H. (2012). An eﬃcient active learning method based on random sampling andbackward deletion. In

International Conference on Intelligence Science and Big Data Engineering ,pages 683–691.Wriggers, W. et al. (1999). Situs: A package for docking crystal structures into low-resolution maps fromelectron microscopy.

Journal of Structural Biology , (2), 185 – 195.Xu, M. et al. (2017). Deep learning-based subdivision approach for large scale macromolecules structurerecovery from electron cryo tomograms. Bioinformatics , (14), i13–i22.Xu, Z. et al. (2010). Fast active exploration for link-based preference learning using gaussian processes.In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery inDatabases , pages 499–514.Yang, L. et al. (2017). Suggestive annotation: A deep active learning framework for biomedical imagesegmentation. In

Medical Image Computing and Computer Assisted Interventions , pages 399–407.Yang, Y. and Loog, M. (2018). A benchmark and comparison of active learning for logistic regression.

Pattern Recognition , , 401–415.Yang, Y. et al. (2015). Multi-class active learning by uncertainty sampling with diversity maximization. International Journal of Computer Vision , (2), 113–127.Yin, C. et al. (2017). Deep similarity-based batch mode active learning with exploration-exploitation. In International Conference on Data Mining , pages 575–584.Yu, L. et al. (2020). Few shot domain adaptation for in situ macromolecule structural classiﬁcation incryo-electron tomograms. arXiv preprint arXiv: 2007.15422 .Zhou, Z. et al. (2017). Fine-tuning convolutional neural networks for biomedical image analysis: Activelyand incrementally. In

IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages4761–4772.Zhu, J. and Bento, J. (2017). Generative adversarial active learning. arXiv preprint arXiv: 1702.07956 ..