Learning active learning at the crossroads? evaluation and discussion
LLearning active learning at the crossroads?evaluation and discussion
L. Desreumaux , V. Lemaire SAP Labs, Paris, France Orange Labs, Lannion, France
Abstract.
Active learning aims to reduce annotation cost by predict-ing which samples are useful for a human expert to label. Although thisfield is quite old, several important challenges to using active learningin real-world settings still remain unsolved. In particular, most selec-tion strategies are hand-designed, and it has become clear that there isno best active learning strategy that consistently outperforms all oth-ers in all applications. This has motivated research into meta-learningalgorithms for “learning how to actively learn”. In this paper, we com-pare this kind of approach with the association of a Random Forest withthe margin sampling strategy, reported in recent comparative studies asa very competitive heuristic. To this end, we present the results of abenchmark performed on 20 datasets that compares a strategy learnedusing a recent meta-learning algorithm with margin sampling. We alsopresent some lessons learned and open future perspectives.
Modern supervised learning methods are known to require large amounts oftraining examples to reach their full potential. Since these examples are mainlyobtained through human experts who manually label samples, the labelling pro-cess may have a high cost. Active learning (AL) is a field that includes all theselection strategies that allow to iteratively build the training set of a model ininteraction with a human expert, also called oracle. The aim is to select the mostinformative examples to minimize the labelling cost.In this article, we consider the selective sampling framework, in which thestrategies manipulate a set of examples D = (cid:32)L ∪ U of constant size, where(cid:32)L = { ( x i , y i ) } li =1 is the set of labelled examples and U = { x i } ni = l +1 is the set ofunlabelled examples. In this framework, active learning is an iterative processthat continues until a labelling budget is exhausted or a pre-defined perfor-mance threshold is reached. Each iteration begins with the selection of the mostinformative example x (cid:63) ∈ U . This selection is generally based on informationcollected during previous iterations (predictions of a classifier, density measures,etc.). The example x (cid:63) is then submitted to the oracle that returns the corre-sponding class y (cid:63) , and the pair ( x (cid:63) , y (cid:63) ) is added to (cid:32)L. The new learning set is In this article, we limit ourselves to binary classification problems. a r X i v : . [ c s . L G ] D ec L. Desreumaux and V. Lemaire then used to improve the model and the new predictions are used in the nextiteration.The utility measures defined by the active learning strategies in the litera-ture [36] differ in their positioning according to a dilemma between the exploita-tion of the current classifier and the exploration of the training data. Selectingan unlabelled example in an unknown region of the observation space R d helpsto explore the data, so as to limit the risk of learning a hypothesis too specificto the current set (cid:32)L. Conversely, selecting an example in a sampled region of R d locally refines the predictive model. The active learning field comes from a parallel between active educational meth-ods and machine learning theory. The learner is from now a statistical modeland not a student. The interactions between the student and the teacher corre-spond to the interactions between the model and the oracle. The examples aresituations used by the model to generate knowledge on the problem.The first AL algorithms were designed with the objective of transposing these“educational” methods to the machine learning domain. The easiest way wasto keep the usual supervised learning methods and to add “strategies” relyingon various heuristics to guide the selection of the most informative examples.From the first initiative and up to now, a lot of strategies motivated by humanintuitions have been suggested in the literature. The purpose of this paper is notto give an overview of the existing strategies but the reader may find in [36,1]of lot of them.A careful reading of the experimental results published in the literature showsthat there is no best AL strategy that consistently outperforms all others in allapplications, and some strategies cater to specific classifiers or to specific appli-cations. Based on this observation, several comprehensive benchmarks carriedout on numerous datasets have highlighted the strategies which, on average, arethe most suitable for several classification models [28,41,29]. They are given inTable 1. For example, the most appropriate strategy for logistic regression andrandom forest is an uncertainty-based sampling strategy, named margin sam-pling, which consists in selecting at each iteration the instance for which thedifference between the probabilities of the two most likely classes is the small-est [34]. To produce this table, we purposefully omitted studies that have arestricted scope, such as focusing on too few datasets [4], specific tasks [37], aninsufficient number of strategies [35,31], or variants of a single strategy [21]. The reader interested in the measures used to quantify the degree of uncertainty inthe context of active learning may find in [25,18] an interesting view which advocatesa distinction between two different types of uncertainty, referred to as epistemic andaleatoric.earning active learning at the crossroads? evaluation and discussion 3
Strategy RF SVM GNB C4.5 LR VFDT Margin a [29]Entropy b [41]QBD c [28] [28]Density d [29,28] [28]OER e [29] [29] [29] Table 1.
Best model/strategy associations highlighted in the literature as a guide touse the appropriate strategy versus the classifier. Strategies: (a) Margin sampling, (b)Entropy sampling, (c) Query by Disagreement, (d) Density sampling, (e) OptimisticError Reduction. Classifiers: (1) Random Forest, (2) Support Vector Machine, (3) 5-Nearest Neighbors, (4) Gaussian Naive Bayes, (5) C4.5 Decision Tree, (6) LogisticRegression, (7) Very Fast Decision Tree.
While the traditional AL strategies can achieve remarkable performance, it isoften challenging to predict in advance which strategy is the most suitable in aparticular situation. In recent years, meta-learning algorithms have been gainingin popularity [23]. Some of them have been proposed to tackle the problem oflearning AL strategies instead of relying on manually designed strategies.Motivated by the success of methods that combine predictors, the first ALalgorithms within this paradigm were designed to combine traditional AL strate-gies with bandit algorithms [3,12,17,8,10,26]. These algorithms learn how to se-lect the best AL criterion for any given dataset and adapt it over time as thelearner improves. However, all the learning must be achieved within a few exam-ples to be helpful, and these algorithms suffer from a cold start issue. Moreover,these approaches are restricted to combining existing AL heuristic strategies.Within the meta-learning framework, some other algorithms have been devel-oped to learn from scratch an AL strategy on multiple source datasets and trans-fer it to new target datasets [19,20,27]. Most of them are based on modern rein-forcement learning methods. The key challenge consists in learning an AL strat-egy that is general enough to automatically control the exploitation/explorationtrade-off when used on new unlabelled datasets, which is not possible when usingheuristic strategies.
From the state of the art, it appears that meta-learned AL strategies can outper-form the most widely used traditional AL strategies, like uncertainty sampling.However, most of the papers that introduce new meta-learning algorithms donot include comprehensive benchmarks that could ascertain the transferabilityof the learned strategies and demonstrate that these strategies can safely be usedin real-world settings.
L. Desreumaux and V. Lemaire
The objective of this article is thus to compare two possible options in therealization of an AL solution that could be used in an industrial context: using atraditional heuristic-based strategy (see Section 1.1) that, on average, is the bestone for a given model and could be used as a strong baseline easy to implementand not so easy to beat, or using a more sophisticated strategy learned in adata-driven fashion that comes from the very recent literature on meta-learning(see Section 1.2).To this end, we present the results of a benchmark performed on 20 datasetsthat compares a strategy learned using the meta-learning algorithm proposedin [20] with margin sampling [34], the models used being in both cases logistic re-gression and random forest. We evaluated the work of [20] since the authors claimto be able to learn a “general-purpose” AL strategy that can generalise acrossdiverse problems and outperform the best heuristic and bandit approaches.The rest of the paper is organized as follows. In Section 2, we explain allthe aspects of the Learning Active Learning (LAL) method proposed in [20],namely the Deep Q-Learning algorithm and the modeling of active learning as aMarkov decision process (MDP). In Section 3, we present the protocol used to doextensive comparative experiments on public datasets from various applicationareas. In Section 4, we give the results of our experimental study and makesome observations. Finally, we present some lessons learned and we open futureperspectives in Section 5.
A Markov decision process is a formalism for modeling the interaction betweenan agent and its environment. This formalism uses the concepts of state , whichdescribes the situation in which the environment finds itself, action , which de-scribes the decision made by the agent, and reward , received by the agent whenit performs an action. The procedure followed by the agent to select the actionto be performed at time t is the policy . Given a policy π , the state-action table is the function Q π ( s, a ) which gives the expectation of the weighted sum of therewards received from the state s if the agent first executes the action a andthen follows the policy π .Q-Learning is a reinforcement learning algorithm that estimates the optimalstate-action table Q (cid:63) = max π Q π from interactions between the agent and theenvironment. The state-action table Q is updated at any time from the currentstate s , the action a = π ( s ) where π is the policy derived from Q , the rewardreceived r and the next state of the environment s (cid:48) : Q t +1 ( s, a ) = (1 − α t ( s, a )) Q t ( s, a ) + α t ( s, a ) (cid:18) r + γ max a (cid:48) ∈A Q t ( s (cid:48) , a (cid:48) ) (cid:19) , (1)where γ ∈ [0 ,
1[ is the weighting factor of the rewards and the α t ( s, a ) ∈ ]0 ,
1[ arethe learning steps that determine the weight of the new experience in relation earning active learning at the crossroads? evaluation and discussion 5 to the knowledge acquired at previous steps. Assuming that all the state-actionpairs are visited an infinite number of times and under some conditions on thelearning steps, the resulting sequence of state-action tables converges to Q (cid:63) [40].The goal of a reinforcement learning agent is to maximize the rewards re-ceived over the long term. To do this, in addition to actions that seem to lead tohigh rewards (exploitation), the agent must select potentially suboptimal actionsthat allow him to acquire new knowledge about the environment (exploration).For Q-Learning, the (cid:15) -greedy method is the most commonly used to manage thisdilemma. It consists in randomly exploring with a probability of (cid:15) and actingaccording to a greedy strategy that chooses the best action with a probabilityof (1 − (cid:15) ). It is also possible to decrease the probability (cid:15) at each transition tomodel the fact that exploration becomes less and less useful with time. In the Q-Learning algorithm, if the state-action table is implemented as a two-input table, then it is impossible to deal with high-dimensional problems. It isnecessary to use a parametric model that will be noted as Q ( s, a ; θ ). If it is adeep neural network, it is called Deep Q-Learning.The training of a neural network requires the prior definition of an errorcriterion to quantify the loss between the value returned by the network and theactual value. In the context of Q-Learning, the latter value does not exist: onecan only use the reward obtained after the completion of an action to calculatea new value, and then estimate the error achieved by calculating the differencebetween the old value and the new one. A possible cost function would thus bethe following: L ( s, a, r, s (cid:48) , θ ) = (cid:18) r + γ max a (cid:48) ∈A Q ( s (cid:48) , a (cid:48) ; θ ) − Q ( s, a ; θ ) (cid:19) . (2)However, this poses an obvious problem: updating the parameters leads to up-dating the target. In practice, this means that the training procedure does notconverge.In 2013, a successful implementation of Deep Q-Learning introducing severalnew features was published [24]. The first novelty is the introduction of a targetnetwork, which is a copy of the first network that is regularly updated. This hasthe effect of stabilizing learning. The cost function becomes: L ( s, a, r, s (cid:48) , θ , θ − ) = (cid:18) r + γ max a (cid:48) ∈A Q ( s (cid:48) , a (cid:48) ; θ − ) − Q ( s, a ; θ ) (cid:19) , (3)where θ − is the vector of the target network parameters. The second nov-elty is experience replay. It consists in saving each experience of the agent( s i , a i , r i , s i +1 ) in a memory of size m and using random samples drawn fromit to update the parameters by stochastic gradient descent. This random drawallows to not necessarily select consecutive, potentially correlated experiences. L. Desreumaux and V. Lemaire
Many improvements to Deep Q-Learning have been published since the articlethat introduced it. We present here the improvements that interest us for thestudy of the LAL method.
Double Deep Q-Learning.
A first improvement is the correction of the overesti-mation bias. It has indeed been empirically shown that Deep Q-Learning as pre-sented in Section 2.2 can produce a positive bias that increases the convergencetime and has a significant negative impact on the quality of the asymptoticallyobtained policy. The importance of this bias and its consequences have beenverified in particular in the configurations the least favourable to its emergence, i.e. when the environment and rewards are deterministic. In addition, its valueincreases with the size of the set of actions. To correct this bias, the solutionwhich has been proposed in [15] consists in not using the parameters θ − to bothselect and evaluate an action. The cost function then becomes: L ( s, a, r, s (cid:48) , θ , θ − ) = (cid:18) r + γQ (cid:18) s (cid:48) , arg max a (cid:48) ∈A Q ( s (cid:48) , a (cid:48) ; θ ); θ − (cid:19) − Q ( s, a ; θ ) (cid:19) . (4) Prioritized Experience Replay.
Another improvement is the introduction of thenotion of priority in experience replay. In its initial version, Deep Q-Learningconsiders that all the experiences can identically advance learning. However,reusing some experiences at the expense of others can reduce the learning time.This requires the ability to measure the acceleration potential of learning asso-ciated with an experience. The priority measure proposed in [33] is the absolutevalue of the temporal difference error: δ i = (cid:12)(cid:12)(cid:12)(cid:12) r i + γ max a (cid:48) ∈A Q ( s i +1 , a (cid:48) ; θ − ) − Q ( s i , a i ; θ ) (cid:12)(cid:12)(cid:12)(cid:12) . (5)A maximum priority is assigned to each new experience, so that all the experi-ences are used at least once to update the parameters.However, the experiences that produce a small temporal difference error atfirst use may never be reused. To address this issue, a method was introducedin [33] to manage the trade-off between uniform sampling and sampling focusingon experiences producing a large error. It consists in defining the probability ofselecting an experience i as follows: p i = ρ βi (cid:80) mk =1 ρ βk , with ρ i = δ i + e, (6)where β ∈ R + is a parameter that determines the shape of the distribution and e is a small positive constant that guarantees p i >
0. The case where β = 0 isequivalent to uniform sampling. earning active learning at the crossroads? evaluation and discussion 7 The formulation of active learning as a MDP is quite natural. In each MDP state , the agent performs an action , which is the selection of an instance to belabelled, and the latter receives a reward that depends on the quality of themodel learned with the new instance. The active learning strategy becomes theMDP policy that associates an action with a state.In this framework, the iteration t of the policy learning process from a datasetdivided into a learning set D = (cid:32)L t ∪U t and a test set D (cid:48) consists in the followingsteps:1. A model h ( t ) is learned from (cid:32)L t . Associated with (cid:32)L t and U t , it allows tocharacterize a state s t .2. The agent performs the action a t = π ( s t ) ∈ A t which defines the instance x ( t ) ∈ U t to label.3. The label y ( t ) associated with x ( t ) is retrieved and the training set is updated, i.e. (cid:32)L t +1 = (cid:32)L t ∪ { ( x ( t ) , y ( t ) ) } and U t +1 = U t \ { x ( t ) } .4. The agent receives the reward r t associated with the performance (cid:96) t on thetest set D (cid:48) . This reward is used to update the policy (see Section 2.5).The set of actions A t depends on time because it is not possible to select thesame instance several times. These steps are repeated until a terminal state s T isreached. Here, we consider that we are in a terminal state when all the instanceshave been labelled or when (cid:96) t ≥ q , where q is a performance threshold that hasbeen chosen as 98% of the performance obtained when the model is learned onall the training data.The precise definition of the set of states, the set of actions and the rewardfunction is not evident. To define a state, it has been proposed to use a vectorwhose components are the scores (cid:98) y t ( x ) = P (Y = 0 | x ) associated with theunlabelled instances of a subset V set aside. This is the simplest representationthat can be used to characterize the uncertainty of a classifier on a dataset at agiven time t .The set of actions has been defined at iteration t as the set of vectors a i =[ (cid:98) y t ( x i ) , g ( x i , (cid:32)L t ) , g ( x i , U t )], where x i ∈ U t and : g ( x i , (cid:32)L t ) = 1 | (cid:32)L t | (cid:88) x j ∈ (cid:32)L t dist( x i , x j ) , g ( x i , U t ) = 1 |U t | (cid:88) x j ∈U t dist( x i , x j ) , (7)where dist is the cosine distance. An action is therefore characterized by theuncertainty on the associated instance, as well as by two statistics related to thedensity of the neighbourhood of the instance.The reward function has been chosen constant and negative until arrival in aterminal state ( r t = − Given that active learning is usually applied in cases, this test set assumed to besmall or very small the performance evaluated on this test set could be a possiblybad approximation. This issue and techniques for avoiding it are not examined inthis paper. L. Desreumaux and V. Lemaire
The Deep Q-Learning algorithm with the improvements presented in Section 2.3is used to learn the optimal policy. To be able to process a state space that evolveswith each iteration, the neural network architecture has been modified. The newarchitecture considers actions as inputs to the Q function in the same way asstates. It then returns only one value, while the classical architecture takes onlyone state as input and returns the values associated with all the actions.The learning procedure involves a collection of Z labelled datasets {Z i } ≤ i ≤ Z .It consists in repeating the following steps (see Figure 1):1. A dataset Z ∈ {Z i } is randomly selected and divided into a training set D and a test set D (cid:48) .2. The policy π derived from the Deep Q-Network is used to simulate severalactive learning episodes on Z according to the procedure described in Sec-tion 2.4. Experiences ( s t , a t , r t , s t +1 ) are collected in a finite size memory.3. The Deep Q-Network parameters are updated several times from a mini-batch of experiences extracted from the memory (according to the methoddescribed in Section 2.3).To initialize the Deep Q-Network, some warm start episodes are simulatedusing a random sampling policy, followed by several parameter updates. Oncethe strategy is learned, its deployment is very simple. At each iteration of thesampling process, the classifier is re-trained, then the vector characterizing theprocess state and all the vectors associated with the actions are calculated. Thevector a (cid:63) corresponding to the example to label x (cid:63) is then the one that satisfies a (cid:63) = arg max a ∈A Q ( s , a ; θ ), the parameters θ being set at the end of the policylearning procedure. .. Experience replay memory of size 10 000 .... Initial examples Selection of 32 experiencesusing the probabilities aa+ stochastic gradient descentSimulation of 10 active learning episodes × … Deep Q-Network
Strategy : T = |U | ¸ Ø q¸ Ø q fi ( s ) = arg max a œA Q ( s , a ; ◊ ) ( s i , a i ,r i , s i +1 ,p i ) ( s i +1 , a i +1 ,r i +1 , s i +2 ,p i +1 ) p i s a |V| = 30 ◊ Ω ◊ ≠ . ·Ò ◊ Qa ÿ j =1 L ( s j , a j ,r j , s j +1 , ◊ , ◊ ≠ ) Rb Fig. 1.
Illustration of the different steps involved in an iteration of the policy learningphase using Deep Q-Learning (the arrows give intuitions about main steps and dataflows) .earning active learning at the crossroads? evaluation and discussion 9
In this section, we introduce our protocol of the comparative experimental studywe conducted.
To learn the strategy, we used the same code , the same hyperparameters andthe same datasets as those used in [20]. The complete list of hyperparametersis given in Table 2 with the variable names from the code that represent them.The datasets from which the strategy is learned are given in Table 3.The specification of the neural network architecture is very simple (all thelayers are fully connected): (i) the first layer (linear + sigmoid) receives thevector s ( i.e. |V| = 30 input neurons) and has 10 output neurons; (ii) the secondlayer (linear + sigmoid) concatenates the 10 output neurons of the first layerwith the vector a ( i.e.
13 neurons in total) and has 5 output neurons; (iii) finally,the last layer (linear) has only one output to estimate Q ( s , a ). Hyperparameter Description
N STATE ESTIMATION = 30
Size of V REPLAY BUFFER SIZE = 10000
Experience replay memory size
PRIORITIZED REPLAY EXPONENT = 3
Exponent β involved in Equation (6) BATCH SIZE = 32
Minibatch size for stochastic gradient descent
LEARNING RATE = 0.0001
Learning rate
TARGET COPY FACTOR = 0.01
Value that sets the target network update EPSILON START = 1
Exploration probability at start
EPSILON END = 0.1
Minimum exploration probability
EPSILON STEPS = 1000
Number of updates of (cid:15) during the training
WARM START EPISODES = 100
Number of warm start episodes
NN UPDATES PER WARM START = 100
Number of parameter updates after the warm start
TRAINING ITERATIONS = 1000
Number of training iterations
TRAINING EPISODES PER ITERATION = 10
Number of episodes per training iteration
NN UPDATES PER ITERATION = 60
Number of updates per training iteration In this implementation, the target network parameters θ − are updated each time the param-eters θ are changed as follows: θ − ← (1 − TARGET COPY FACTOR ) · θ − + TARGET COPY FACTOR · θ . Table 2.
Hyperparameters involved in Deep Q-Learning.
Our objective is to compare the performance of a strategy learned using LALwith the performance of a heuristic strategy that, on average, is the best one for https://github.com/ksenia-konyushkova/LAL-RL Dataset |D| |Y| australian 690 2 6 8 55.51 44.49breast-cancer 272 2 0 9 70.22 29.78diabetes 768 2 8 0 65.10 34.90german 1000 2 7 13 70.00 30.00heart 293 2 13 0 63.82 36.18ionosphere 350 2 33 0 64.29 35.71mushroom 8124 2 0 21 51.80 48.20wdbc 569 2 30 0 62.74 37.26
Table 3.
Datasets used to learn the new strategy. Columns: number of examples, num-ber of classes, numbers of numerical and categorical variables, proportions of examplesin the majority and minority classes. a given model. Several benchmarks conducted on numerous datasets have high-lighted the fact that margin sampling is the best heuristic strategy for logisticregression (LR) and random forest (RF) [41,29].Margin sampling consists in choosing the instance for which the difference (ormargin) between the probabilities of the two most likely classes is the smallest: x (cid:63) = arg min x ∈U P ( y | x ) − P ( y | x ) , (8)where y and y are respectively the first and second most probable classes for x . The main advantage of this strategy is that it is easy to implement: at eachiteration, a single training of the model and |U| predictions are sufficient toselect an example to label. A major disadvantage, however, is its total lack ofexploration, as it only exploits locally the hypothesis learned by the model.We chose to evaluate the Margin/LR association because it is with logisticregression that the hyperparameters of Table 2 were optimized in [20]. In addi-tion, in order to determine whether it is necessary to modify them when anothermodel is used, we also evaluated the Margin/RF association. This last associa-tion is particularly interesting because it is the best association highlighted in arecent and large benchmark carried out on 73 datasets, including 5 classificationmodels and 8 active learning strategies [29]. In addition, we evaluated randomsampling (Rnd) for both models. The datasets were selected so as to have a high diversity according to the fol-lowing criteria: (i) number of examples; (ii) number of numerical variables; (iii)number of categorical variables; (iv) class imbalance.We have also taken care to exclude datasets that are too small and notrepresentative of those used in an industrial context. The 20 selected datasetsare described in Table 4. They all come from the UCI database [11], apartfrom the dataset “orange-fraud” which is dataset on fraud detection. Four ofthe datasets have been used in a challenge on active learning that took place earning active learning at the crossroads? evaluation and discussion 11 in 2010 [14] and the dataset “nomao” comes from another challenge on activelearning [6].
Dataset |D| |Y| adult 48790 2 6 8 76.06 23.94banana 5292 2 2 0 55.16 44.84bank-marketing-full 45211 2 7 9 88.30 11.70climate-simulation-craches 540 2 20 0 91.48 8.52eeg-eye-state 14980 2 14 0 55.12 44.88hiva 40764 2 1617 0 96.50 3.50ibn-sina 13951 2 92 0 76.18 23.82magic 18905 2 10 0 65.23 34.77musk 6581 2 166 1 84.55 15.45nomao 32062 2 89 29 69.40 30.60orange-fraud 1680 2 16 0 63.75 36.25ozone-onehr 2528 2 72 0 97.11 2.89qsar-biodegradation 1052 2 41 0 66.35 33.65seismic-bumps 2578 2 14 4 93.41 6.59skin-segmentation 51444 2 3 0 71.51 28.49statlog-german-credit 1000 2 7 13 70.00 30.00thoracic-surgery 470 2 3 13 85.11 14.89thyroid-hypothyroid 3086 2 7 18 95.43 4.57wilt 4819 2 5 0 94.67 5.33zebra 61488 2 154 0 95.42 4.58
Table 4.
Datasets used for the evaluation of the strategy learned by LAL. Columns:number of examples, number of classes, numbers of numerical and categorical variables,proportions of examples in the majority and minority classes.
In our evaluation protocol, the active sampling process begins with the randomselection of one instance in each class and ends when 250 instances are labelled.This value ensures that our results are comparable to other studies in the liter-ature. For performance comparison, we used the area under the learning curve(ALC) based on the classification accuracy. We do not claim that the ALC isa “perfect metric” but it is the defacto standard evaluation criterion in activelearning, and it has been chosen as part of a challenge [14].Our evaluation was carried out by cross-validation with 5 partitions, in whichclass imbalance within the complete dataset was preserved. For each partition,the sampling process was repeated 5 times with different initializations to get amean and a variance on the result. However, we have made sure that the initial There is literature on more expressive summary statistics of the active-learning curve[39,30]. This could be a limitation of this current article, other metrics could be testedin future versions of experiments.2 L. Desreumaux and V. Lemaire instances are identical for all the strategy/model associations on each partitionso as to not introduce bias into the results. In addition, for Rnd, the randomsequence of numbers was identical for all the models.
The results of our experimental study are given in Table 5. The mean ALC ob-tained for each dataset/classifier/strategy association are reported (the optimalscore is 100). The left part of the table gives the results for logistic regressionand the right part gives the results for random forest. The penultimate line cor-responds to the averages calculated on all the datasets and the last line givesthe number of times the strategy has won, tied or lost. The non-significant dif-ferences were established on the basis of a paired t -test at 99% significance level(where H0: same mean between populations and where the mean is the estimateout of 5 repetitions x cross-validation with 5 partitions of each method). Dataset Rnd/LR Margin/LR LAL/LR Rnd/RF Margin/RF LAL/RF maj adult 77.93 78.91 78.97 80.17
Table 5.
Results of the experimental study.
Several observations can be made. First of all, it should be noted that thechoice of model is decisive: the results of random forest are all better than thoseof logistic regression. The random forest model learns indeed very well from fewdata, as highlighted in [32]. We can notice that even with random sampling, RF isalmost always better than LR, regardless of the strategy used. In addition, usingmargin sampling with this model allows a significant performance improvement. earning active learning at the crossroads? evaluation and discussion 13
This model is very competitive in itself because by its nature, it includes termsof exploration and exploitation (see Section 5 Conclusion about this point).In addition, the results of the learned strategy clearly show that a goodactive learning strategy has been learned, since it performs better than randomsampling over a large number of datasets. However, the learned strategy is nobetter than margin sampling. These results are nevertheless very interestingsince only 8 datasets were used in the learning procedure.Finally, the results show a well-known fact about active learning: on veryunbalanced datasets, it is difficult to achieve a better performance than randomsampling, as shown in the last column of Table 5 in which the results obtainedby always predicting the majority class are given. The “cold start” problem thatoccurs in active learning, i.e. the inability of making reliable predictions in earlyiterations (when training data is not sufficient), is indeed further aggravatedwhen a dataset has highly imbalanced classes, since the selected samples arelikely to belong to the majority class [38]. However, if the imbalance is known, itmay be interesting to associate strategies with a model or criterion appropriateto this case, as illustrated in [13].To investigate the “learning speed”, we show results for different sizes of (cid:32)Lin Table 6. They lead to similar conclusions and our results for | (cid:32)L | = 32 confirmthe results of [32]. The reader may find all our experimental results on Github . | (cid:32)L | = 32 | (cid:32)L | = 64 | (cid:32)L | = 128 | (cid:32)L | = 250 Dataset Rnd Margin LAL Rnd Margin LAL Rnd Margin LAL Rnd Margin LAL adult 77.95 77.88 78.16 79.72 80.51 81.05 81.13 82.79 82.48 82.12 83.55 83.40banana 71.13 65.48 65.16 77.93 71.42 70.96 83.64 75.58 75.70 86.55 79.71 81.35bank... 88.05 87.90 88.10 88.29 88.38 88.54 88.43 88.82 88.90 88.75 89.21 89.35climate... 91.26 91.26 91.18 91.40 91.29 91.40 91.26 91.33 91.33 91.44 91.22 91.29eeg... 58.28 58.94 57.34 62.07 63.17 60.79 66.77 69.38 65.35 72.55 75.08 72.46hiva 96.36 96.52 96.49 96.36 96.55 96.54 96.46 96.57 96.56 96.49 96.65 96.65ibn-sina 86.88 91.17 89.78 90.48 93.99 92.96 92.73 94.76 94.25 93.86 95.85 95.48magic 71.99 75.63 72.95 76.85 80.20 77.26 80.15 82.71 82.01 82.42 84.53 84.43musk 85.29 89.50 90.09 87.43 94.44 94.18 90.58 98.78 97.63 93.64 99.98 99.31nomao 85.92 89.35 89.37 88.92 92.46 92.09 90.85 93.69 93.33 92.36 94.52 94.37orange... 88.06 90.36 90.09 89.16 90.98 90.67 90.08 91.72 91.33 90.41 91.85 91.74ozone... 96.36 96.97 97.01 96.74 97.04 97.10 96.93 97.08 97.11 97.02 97.03 97.05qsar... 75.75 78.08 76.61 79.75 82.09 81.42 81.94 84.65 84.88 84.03 86.12 86.08seismic... 92.39 93.21 93.19 92.42 93.28 93.19 92.52 93.26 93.20 93.14 93.08 93.28skin... 86.42 89.19 89.46 90.80 96.19 96.06 93.70 98.86 98.65 95.85 99.56 99.49statlog... 70.36 70.70 69.70 70.94 72.47 71.75 72.40 73.46 74.10 74.29 75.22 75.06thoracic... 83.14 84.42 84.12 83.31 85.02 84.76 83.70 84.89 84.68 84.21 84.51 84.68thyroid... 97.26 98.71 98.43 97.86 99.15 98.71 98.08 99.10 98.89 98.26 98.84 98.98wilt 94.60 96.23 95.98 95.01 97.47 96.90 95.30 98.21 97.64 96.07 98.51 98.37zebra 94.66 95.32 95.28 94.87 95.44 95.31 94.96 95.72 95.46 95.01 96.04 95.33Mean 84.60
Table 6.
Mean test accuracy (%) for different sizes of | (cid:32)L | with the random forestmodel. https://github.com/ldesreumaux/lal_evaluation In this article, we evaluated a method representative of a recent orientation ofactive learning research towards meta-learning methods for “learning how toactively learn”, which is on top of the state of the art [20], versus a traditionalheuristic-based Active Learning (the association of Random Forest and Margin)which is one of the best method reported in recent comparative studies [41,29].The comparison is limited to just one representative of each of the two classes(meta-learning and traditional heuristic-based) but since each is on top of thestate of the art several lessons can be drawn from our study.
Relevance of LAL.
First of all, the experiments carried out confirm the relevanceof the LAL method, since it has enabled us to learn a strategy that achieves theperformance of a very good heuristic, namely margin sampling, but contraryto the results in [20], the strategy is not always better than random sampling.This method still raises many problems, including that of the transferabilityof the learned strategies. An active learning solution that can be used in anindustrial context must perform well on real data of an unknown nature andmust not involve parameters to be adjusted. With regard to the LAL method, afirst major problem is therefore the constitution of a “dataset of datasets” largeand varied enough to learn a strategy that is effective in very different contexts.Moreover, the learning procedure is sensitive to the performance criteria used,which in our view seems to be a problem. Ideally, the strategy learned shouldbe usable on new datasets with arbitrary performance criteria (AUC, F-score,etc.). From our point of view, the work of optimizing the many hyperparametersof the method (see Table 2) can not be carried out by a user with no expertisein deep reinforcement learning.
About the Margin/RF association.
In addition to the evaluation of the LALmethod, we confirmed a result of [29], namely that margin sampling, associatedwith a random forest, is a very competitive strategy. From an industrial pointof view, regarding the computational complexity, the performances obtainedand the absence of “domain knowledge required to be used” the Margin/RFassociation remains a very strong baseline difficult to beat. However, it shares amajor drawback with many active learning strategies, that is its lack of reliability.Indeed, there is no strategy that is better or equivalent to random sampling on all datasets and with all models. The literature on active learning is incompletewith regard to this problem, which is nevertheless a major obstacle to usingactive learning in real-world settings.Another important problem in real-world applications, little studied in theliterature, is the estimation of the generalization error without a test set. Itwould be interesting to check if the Out-Of-Bag samples of the random forests[5] can be used in an active learning context to estimate this error.Concerning the exploitation/exploration dilemma, margin sampling clearlyperforms only exploitation. The good results of the Margin/RF association maysuggest that the RF algorithm intrinsically contains a part of exploration due to earning active learning at the crossroads? evaluation and discussion 15 the bagging paradigm. It could be interesting to add experiments in the futureto test this point.Still with regard to the random forests, an open question is to study if a betterstrategy than margin sampling could be designed. Since the random forests areensemble classifiers, a possible way of research to design this strategy is to checkif they could be used in the credal uncertainty framework [2] which seeks todifferentiate between the reducible and irreducible part of the uncertainty in aprediction.
About error generalization.
In Real world application AL should be used most ofthe time in absence of a test dataset. A open question could be to a use anotherknown result about RF: the possibility to have an estimate of the generalizationerror using the Out-Of-Bag (OOB) samples [16,5]. We did not present experi-ments on this topic in this paper but an idea could be to analyze the convergenceversus the number of labelled examples between the OOB performance and thetest performance to check at which “moment” ( | L | ) one could trust the OOBperformance (OOB performance ≈ test performance). The use of a “randomuniform forest” [9] for which the OOB performance seems to be more reliablecould also be investigated. About the benchmarking methodology.
Recent benchmarks have highlighted theneed for extensive experimentation to compare active learning strategies. Theresearch community might benefit from a “reference” benchmark, as in the fieldof time series classification [7], so that new results can be rigorously comparedto the state of the art on a same and large set of datasets. By this way, one willhave comprehensive benchmarks that could ascertain the transferability of thelearned strategies and demonstrate that these strategies can safely be used inreal-world settings.If this reference benchmark is created, the second step would be to decidehow to compare the AL strategies. This comparison could be made using nota single criterion but a “pool” of criteria. This pool may be chosen to reflectdifferent “aspects” of the results [22].
References
1. Aggarwal, C.C., Kong, X., Gu, Q., Han, J., Yu, P.S.: Active Learning: A Survey. In:Aggarwal, C.C. (ed.) Data Classification: Algorithms and Applications, chap. 22,pp. 571–605. CRC Press (2014)2. Antonucci, A., Corani, G., Bernaschina, S.: Active Learning by the Naive CredalClassifier. In: Proceedings of the Sixth European Workshop on Probabilistic Graph-ical Models (PGM). pp. 3–10 (2012)3. Baram, Y., El-Yaniv, R., Luz, K.: Online Choice of Active Learning Algorithms.Journal of Machine Learning Research , 255–291 (2004) Since when | L | is very low the RF do overtraining thus it’s train performance is nota good indicator for the error generalization6 L. Desreumaux and V. Lemaire4. Beyer, C., Krempl, G., Lemaire, V.: How to Select Information That Matters: AComparative Study on Active Learning Strategies for Classification. In: Proceed-ings of the 15th International Conference on Knowledge Technologies and Data-driven Business. ACM (2015)5. Breiman, L.: Out-of-bag estimation (1996), , last visited 08/03/20206. Candillier, L., Lemaire, V.: Design and analysis of the nomao challenge activelearning in the real-world. In: Proceedings of the ALRA: Active Learning in Real-world Applications, Workshop ECML-PKDD. (2012)7. Chen, Y., Keogh, E., Hu, B., Begum, N., Bagnall, A., Mueen, A., Batista, G.: TheUCR Time Series Classification Archive (2015),
8. Chu, H.M., Lin, H.T.: Can Active Learning Experience Be Transferred? 2016 IEEE16th International Conference on Data Mining pp. 841–846 (2016)9. Ciss, S.: Generalization Error and Out-of-bag Bounds in Random (Uni-form) Forests, working paper or preprint, https://hal.archives-ouvertes.fr/hal-01110524/document , last visited 06/03/202010. Collet, T.: Optimistic Methods in Active Learning for Classification. Ph.D. thesis,Universit´e de Lorraine (2018)11. Dua, D., Graff, C.: UCI Machine Learning Repository (2017), http://archive.ics.uci.edu/ml
12. Ebert, S., Fritz, M., Schiele, B.: Ralf: A reinforced active learning formulationfor object class recognition. In: 2012 IEEE Conference on Computer Vision andPattern Recognition. pp. 3626–3633 (2012)13. Ertekin, S., Huang, J., Bottou, L., Giles, L.: Learning on the Border: Active Learn-ing in Imbalanced Data Classification. In: Conference on Information and Knowl-edge Management. pp. 127–136. CIKM (2007)14. Guyon, I., Cawley, G., Dror, G., Lemaire, V.: Results of the Active Learning Chal-lenge. In: Proceedings of Machine Learning Research. vol. 16, pp. 19–45. PMLR(2011)15. Hasselt, H.v., Guez, A., Silver, D.: Deep Reinforcement Learning with Double Q-Learning. In: AAAI Conference on Artificial Intelligence. pp. 2094–2100 (2016)16. Hastie, T., Tibshirani, R., Friedman, J.: The elements of statistical learning: datamining, inference and prediction. Springer, 2 edn. (2009)17. Hsu, W.N., Lin, H.T.: Active Learning by Learning. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence. pp. 2659–2665. AAAI Press(2015)18. H¨ullermeier, E., Waegeman, W.: Aleatoric and Epistemic Uncertainty in MachineLearning: An Introduction to Concepts and Methods. arXiv:1910.09457 [cs.LG](2019)19. Konyushkova, K., Sznitman, R., Fua, P.: Learning Active Learning from Data. In:Advances in Neural Information Processing Systems 30, pp. 4225–4235 (2017)20. Konyushkova, K., Sznitman, R., Fua, P.: Discovering General-Purpose ActiveLearning Strategies. arXiv:1810.04114 [cs.LG] (2019)21. K¨orner, C., Wrobel, S.: Multi-class Ensemble-Based Active Learning. In: Proceed-ings of the 17th European Conference on Machine Learning. pp. 687–694. Springer-Verlag (2006)22. Kottke, D., Calma, A., Huseljic, D., Krempl, G., Sick, B.: Challenges of Reli-able, Realistic and Comparable Active Learning Evaluation. In: Proceedings ofthe Workshop and Tutorial on Interactive Adaptive Learning. pp. 2–14 (2017)earning active learning at the crossroads? evaluation and discussion 1723. Lemke, C., Budka, M., Gabrys, B.: Metalearning: a survey of trends and technolo-gies. Artificial Intelligence Review , 117–130 (2015)24. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D.,Riedmiller, M.: Playing Atari with Deep Reinforcement Learning. arXiv:1312.5602[cs.LG] (2013)25. Nguyen, V.L., Destercke, S., H¨ullermeier, E.: Epistemic Uncertainty Sampling. In:Discovery Science (2019)26. Pang, K., Dong, M., Wu, Y., Hospedales, T.M.: Dynamic Ensemble Active Learn-ing: A Non-Stationary Bandit with Expert Advice. In: Proceedings of the 24thInternational Conference on Pattern Recognition. pp. 2269–2276 (2018)27. Pang, K., Dong, M., Wu, Y., Hospedales, T.M.: Meta-Learning Transferable Ac-tive Learning Policies by Deep Reinforcement Learning. arXiv:1806.04798 [cs.LG](2018)28. Pereira-Santos, D., de Carvalho, A.C.: Comparison of Active Learning Strategiesand Proposal of a Multiclass Hypothesis Space Search. In: Proceedings of the 9thInternational Conference on Hybrid Artificial Intelligence Systems – Volume 8480.pp. 618–629. Springer-Verlag (2014)29. Pereira-Santos, D., Prudˆencio, R.B.C., de Carvalho, A.C.: Empirical investigationof active learning strategies. Neurocomputing , 15–27 (2019)30. Pupo, O.G.R., Altalhi, A.H., Ventura, S.: Statistical comparisons of active learningstrategies over multiple datasets. Knowl. Based Syst. , 274–288 (2018)31. Ramirez-Loaiza, M.E., Sharma, M., Kumar, G., Bilgic, M.: Active learning: an em-pirical study of common baselines. Data Mining and Knowledge Discovery (2),287–313 (2017)32. Salperwyck, C., Lemaire, V.: Learning with few examples: an empirical study onleading classifiers. In: Proceedings of the 2011 International Joint Conference onNeural Networks. pp. 1010–1019. IEEE (2011)33. Schaul, T., Quan, J., Antonoglou, I., Silver, D.: Prioritized Experience Replay.arXiv:1511.05952 [cs.LG] (2016)34. Scheffer, T., Decomain, C., Wrobel, S.: Active Hidden Markov Models for Informa-tion Extraction. In: Hoffmann, F., Hand, D.J., Adams, N., Fisher, D., Guimaraes,G. (eds.) Advances in Intelligent Data Analysis. pp. 309–318 (2001)35. Schein, A.I., Ungar, L.H.: Active learning for logistic regression: an evaluation.Machine Learning , 235–265 (2007)36. Settles, B.: Active Learning. Morgan & Claypool Publishers (2012)37. Settles, B., Craven, M.: An Analysis of Active Learning Strategies for SequenceLabeling Tasks. In: Proceedings of the Conference on Empirical Methods in NaturalLanguage Processing. pp. 1070–1079. Association for Computational Linguistics(2008)38. Shao, J., Wang, Q., Liu, F.: Learning to Sample: An Active Learning Framework.IEEE International Conference on Data Mining (ICDM) pp. 538–547 (2019)39. Trittenbach, H., Englhardt, A., B¨ohm, K.: An overview and a benchmark of activelearning for one-class classification. CoRR abs/1808.04759 (2018)40. Watkins, C.J.C.H., Dayan, P.: Q-learning. Machine Learning (3), 279–292 (1992)41. Yang, Y., Loog, M.: A benchmark and comparison of active learning for logisticregression. Pattern Recognition83