Bounded Memory Active Learning through Enriched Queries
BBounded Memory Active Learning through Enriched Queries
Max Hopkins ∗ , Daniel Kane † , Shachar Lovett ‡ , Michal Moshkovitz § February 11, 2021
Abstract
The explosive growth of easily-accessible unlabeled data has lead to growing interest in active learning ,a paradigm in which data-hungry learning algorithms adaptively select informative examples in order tolower prohibitively expensive labeling costs. Unfortunately, in standard worst-case models of learning,the active setting often provides no improvement over non-adaptive algorithms. To combat this, a seriesof recent works have considered a model in which the learner may ask enriched queries beyond labels.While such models have seen success in drastically lowering label costs, they tend to come at the expenseof requiring large amounts of memory. In this work, we study what families of classifiers can be learnedin bounded memory . To this end, we introduce a novel streaming-variant of enriched-query active learningalong with a natural combinatorial parameter called lossless sample compression that is sufficient forlearning not only with bounded memory, but in a query-optimal and computationally efficient manneras well. Finally, we give three fundamental examples of classifier families with small, easy to computelossless compression schemes when given access to basic enriched queries: axis-aligned rectangles, decisiontrees, and halfspaces in two dimensions.
Today’s learning landscape is dominated mostly by data-hungry algorithms, each requiring a massive supplyof labeled training samples in order to reach state of the art accuracy. Such algorithms are excellent whenlabeled data is cheap and plentiful, but in many important scenarios acquiring labels requires the use ofhuman experts, making popular supervised methods like deep learning infeasible both in time and cost. Inrecent years, a framework meant to address this issue called active learning has gained traction in boththeory and practice. Active learning posits that not all labeled samples are equal: some may be particularlyinformative, others useless. While a standard supervised (passive) learning algorithm receives a stream orpool of labeled training data, an active learner instead receives unlabeled data along with the ability to queryan expert labeling oracle . By choosing only to query the most informative examples, the hope is that anactive learner can achieve state of the art accuracy using only a small fraction of the labels required bypassive techniques.While active learning saw initial success with simple classifiers such as thresholds in one-dimension, itquickly became clear that inherent structural barriers barred it from improving substantially over the passivecase even for very basic examples such as halfspaces in two-dimensions or axis-aligned rectangles [11]. Anumber of modifications to the model have been proposed to remedy this issue. One such strategy that hasgained increasing traction in the past few years is empowering the learner to ask questions beyond simplelabel queries. One might ask the oracle, for instace, to compare two pieces of data rather than simply labelthem—the idea being that such additional information might break down the structural barriers inherent instandard lower bounds. Indeed, in 2017, Kane, Lovett, Moran, and Zhang (KLMZ) [32] showed not onlyhow this was true for halfspaces, but introduced a combinatorial complexity parameter called inference ∗ Department of Computer Science and Engineering, UCSD, CA 92092. Email: [email protected] . Supported by NSFAward DGE-1650112 † Department of Computer Science and Engineering / Department of Mathematics, UCSD, CA 92092. Email: [email protected] . Supported by NSF CAREER Award ID 1553288 and a Sloan fellowship ‡ Department of Computer Science and Engineering, UCSD, CA 92092. Email: [email protected] . Supported by NSFCAREER award 1350481, CCF award 1614023 and a Sloan fellowship § Qualcomm Institute, UCSD, California, CA 92092. Emaili: [email protected] We note their work requires structural assumptions on the data to work beyond two dimensions. a r X i v : . [ c s . L G ] F e b imension to characterize exactly when a family of classifiers is efficiently actively learnable with respect tosome set of enriched queries.The model proposed by KLMZ [32], however, is not without its downsides. One issue is that the model is pool-based , meaning the algorithm receives a large pool of unlabeled samples ahead of time, and can queryor otherwise access any part of the sample at any time. This type of model can be infeasible in practicedue to its unrealistic memory requirements—the learner is assumed to always have the full training data instorage. In this work, we aim to resolve this issue by studying when a family of classifiers can be efficientlyactive learned in bounded memory , meaning the amount of memory used by the algorithm should remain constant regardless of the desired accuracy. Such algorithms open up potential applications of active learningto scenarios where storage is severely limited, e.g. to smartphones and other mobile devices.To this end, we introduce a new streaming variant of active learning with enriched queries in which thelearner has access to a stream of unlabeled data and chooses one-by-one whether to store or forget points fromthe stream. Instead of having query access to the full training set at any time, our algorithm is then restrictedto querying only points it has stored in memory. Along with this model, we introduce a natural strengtheningof Littlestone and Warmuth’s [38] sample compression , a standard learning technique for Valiant’s ProbablyApproximately Correct (PAC) [56, 57] model, called lossless sample compression, and show that any classwith such a scheme may be learned query-optimally and with constant memory. In doing so, we make thefirst non-trivial advance towards answering an open question posed by KLMZ [32] regarding the existenceof a combinatorial characterization for bounded-memory active learning. Further, we show that losslesscompression schemes imply learnability not only in the standard PAC-model, but also in a much strongersense known as Reliable and Probably Useful Learning [48]. This model, which carries strong connections tothe active learning paradigm [15, 32], demands that the learner be perfect (makes no errors) with the caveatthat it may abstain from classifying a small fraction of examples.Finally, we conclude by showing that a number of classifier families fundamental to machine learningexhibit small lossless compression schemes with respect to natural enriched queries. We focus in particular onthree such classes: axis-aligned rectangles, decision trees, and halfspaces in two dimensions. In each of thesethree cases our lossless compression scheme is efficiently computable, resulting in computationally efficient aswell as query-optimal and bounded memory learners. All three classes provide powerful examples of hownatural enriched queries can turn fundamental learning problems from prohibitively expensive to surprisinglyfeasible.
We start with an informal overview of our contributions: namely the introduction of lossless sample compression,its implications for efficient, bounded memory RPU-learning, and three fundamental examples of classeswhich, while infeasible or even impossible in standard models, have small lossless compression schemes withrespect to natural enriched queries. Before launching into such results, however, we give a brief introductionto enriched queries, followed by some intuition and background on our novel form of compression.Given a set X and a binary classifier h : X → { , } , standard models of learning generally aim toapproximate h solely through the use of labeled samples from X . Since labels often cannot provide enoughinformation to learn efficiently, we allow the learner to ask some specified set of additional questions, denotedby a “query set” Q (see Section 2.3 for formal description). As an example, one well-studied notion of anenriched query is a “comparison” [29, 33, 60, 61, 26, 27, 9]. In such cases, along with asking for labels, thelearner may additionally ask an expert to compare two pieces of data (e.g. asking a doctor “which of thesepatients do you think is sicker?”). Given a sample S ⊆ X , we let Q h ( S ) then denote the responses to allpossible queries in Q on S under hypothesis h . For the basic example of labels and comparisons, this wouldconsist of | S | labels and all (cid:0) | S | (cid:1) pairwise comparisons.With this notion in hand, we can discuss our extension of sample compression to the enriched query regime.Standard sample compression schemes posit the existence of a compression algorithm A and decompressionscheme D such that for any sample S , A ( S ) is small, and D ( A ( S )) outputs a hypothesis that correctly labelsall elements of S . Lossless sample compression strengthens this idea in two ways. First, the hypothesis outputby D must be zero-error (but, like our learners, is allowed to abstain). Second, the output hypothesis mustlabel not just S , but every point whose label can be inferred from Q h ( S ) . The label of x ∈ X is inferred by a subset S ⊆ X if all concepts h ∈ H consistent with queries on S share the same label for efinition 1.1 (Informal Definition 3.1) . Let X be a set and H a family of binary classifiers on X . We say ( X, H ) has a lossless compression scheme (LCS) W of size k with respect to a set of enriched queries Q if forall classifiers h ∈ H and subsets S ⊂ X , there exists a subset W = W ( Q h ( S )) ⊆ S such that | W | ≤ k , andany point in X whose label is inferred by Q h ( S ) is also inferred by queries on Q h ( W ) . Given the existence of a lossless compression scheme for some class ( X, H ) , we prove in Section 3 that (aslight variant of) the following simple algorithm learns ( X, H ) query-optimally, in bounded memory, andwith no errors. Algorithm 1:
Bounded Memory RPU-Learning via Lossless Compression
Result:
Returns a zero-error classifier that labels a − ε fraction of X with probability − δ . Input:
Query set Q , hypothesis class ( X, H ) , sample oracle O X , and lossless compression scheme W . Parameters: • Size of LCS k • Query cap T = O (cid:0) log (cid:0) εδ (cid:1)(cid:1) • Sample cap T = ˜ O (cid:16) k log (1 /δ ) ε (cid:17) Algorithm:
Initialize i = 0 , j = 0 , C = { ∅ } , X = X ; While i ≤ T :1. Sample a subset S i ⊆ X i of size k (via rejection sampling on O X ).(a) For each point drawn from X in this process, increment j .(b) If j reaches T , abort and return labels inferred by queries on C i
2. Make all queries on S i ∪ C i , and compute C i +1 = W ( Q h ( S i ∪ C i ))
3. Remove all points in and queries on ( S i ∪ C i ) \ C i +1 from memory and increment i .4. Set X i ⊆ X to be the set of points uninferred by queries on C i Return labels inferred by queries on C i It is worth noting that in Algorithm 1, the set of remaining uninferred points X i need not be kept inmemory. Membership in X i can be checked in an online fashion in step 1. We prove in Section 3 thatAlgorithm 1 is a query and computationally efficient, bounded memory RPU-learner. Theorem 1.2 (Informal Theorem 3.5) . Algorithm 1 actively RPU-learns ( X, H ) in only q ( ε, δ ) ≤ O k (log(1 /ε )) queries, T ( ε, δ ) ≤ ˜ O k (cid:18) T X,H,W log (1 /δ ) ε (cid:19) time, and M ( X, H ) ≤ O k (1) memory, where we have suppressed dependence on k (the size of the LCS), and T X,H,W is a parameterdependent only on the class and compression scheme W . In all examples we study T X,H,W is small anddependence on k is at worst quadratic. x . See Section 2.4 for details. As mentioned previously, the learner is allowed to say “I don’t know” on a small fraction of the space. See Section 2.1 forformal definition.
3t is further worth noting that O (log(1 /ε )) query complexity is information-theoretically optimal for mostnon-trivial concept classes. As long as the class has poly (1 /ε ) concepts which are Ω( ε ) separated, any activelearner must make Ω(log(1 /ε )) queries to distinguish between them [36].We give three fundamental examples of classes which are either impossible or highly infeasible to RPU-learnwith standard label queries, but have small, efficiently computable lossless compression schemes with respectto natural enriched queries: axis-aligned rectangles, decision trees, and halfspaces in two dimensions. As aresult, these classes are all efficiently RPU-learnable with bounded memory. We briefly introduce each class,our proposed enriched queries, and discuss the implications on their learnability. We start our discussion withthe simplest of the three, axis-aligned rectangles in R d , which correspond to indicator functions for productsof intervals over R : R = [ a , b ] × . . . × [ a d , b d ] where a i ≤ b i . While axis-aligned rectangles are impossible to RPU-learn in a finite number of label queries[34], we will show that the class has a small lossless compression scheme with respect to a natural query wecall the “odd-one-out” oracle O odd . Notice that axis-aligned rectangles essentially define a certain “acceptable”range for every feature—a point is labeled iff it lies in this range for all coordinates. Informally, given apoint x ∈ R d lying outside the rectangle, an “odd-one-out” query simply asks the user “why do you dislike x ?”. Concretely, one might imagine a chef is trying to cook a dish for a particularly picky patron. After eachfailed attempt, the chef may ask the patron what went wrong—perhaps the patron thinks the meat wasovercooked! More formally, the “odd-one-out” query asks for a violated coordinate (i.e. a feature lying outsidethe acceptable range), and whether the coordinate was too large (in our example, overcooked) or too small(undercooked).We prove that axis-aligned rectangles in R d have an efficiently computable lossless compression scheme ofsize O ( d ) , and thus are efficiently RPU-learnable in bounded memory with near-optimal query complexity. Corollary 1.3 ((Informal) Corollary 4.2) . The class of axis-aligned rectangles in R d is RPU-learnable inonly q ( ε, δ ) = O (cid:18) d log (cid:18) εδ (cid:19)(cid:19) queries, O ( d ) memory, and time ˜ O (cid:16) d log (1 /δ ) ε (cid:17) when the learner has access to O odd . While rectangles provide an excellent theoretical example of a class for which basic enriched queries breakstandard barriers in active and bounded memory learning, they are often too simple to be of much practicaluse. To this end, we next consider a broad generalization of the class of axis-aligned rectangles called decisiontrees. A decision tree over R d is a binary tree where each node in the tree corresponds to an inequality: x i ? ≥ b or x i ? ≤ b, measuring the i -th feature (coordinate) of any x ∈ R d . Each leaf in the tree is assigned a label, and thelabel of any x ∈ X is uniquely determined by the leaf resulting from following the decision tree from root toleaf, always taking the path determined by the inequality at each node. Informally, a decision tree may thenbe thought of as a partition of R d into clusters (axis-aligned rectangle) given by the leaves. Our proposedenriched query for decision trees, the “same-leaf” oracle O leaf , builds off this intuition. Given a decision tree T and two points x, x (cid:48) ∈ R d , O leaf ( T, x, x (cid:48) ) determines whether x and x (cid:48) lie in the same leaf of the decisiontree. Thinking of each leaf as a cluster, this query may be seen as a variant of the “same-cluster” queryparadigm studied in many recent works [4, 58, 39, 1, 16, 13, 49]. For our scenario, think of asking a userwhether two movies they like are of the same genre. We prove that the class of decision trees of size s (atmost s leaves) has a small, efficiently computable lossless compression scheme and are therefore efficientlylearnable in bounded memory. Corollary 1.4 (Informal Corollary 4.4) . The class of size s decision trees over R d is RPU-learnable in only q ( ε, δ ) = O (cid:18) ds log (cid:18) εδ (cid:19)(cid:19) queries, O ( ds ) memory, and time ˜ O (cid:16) d s log(1 /δ ) ε (cid:17) when the learner has access to O leaf .
4t is worth noting that while the class of decision trees of arbitrary size is not learnable (even in thePAC-setting [21]), we can bootstrap Theorem 1.2 and Corollary 1.4 to build an algorithm that learns decisiontrees attribute-efficiently . That is to say an algorithm whose expected time, number of queries, and memoryscales with the size of the unknown decision tree.
Corollary 1.5 (Informal Corollary 4.5) . There exists an algorithm for RPU-learning decision trees over R d which in expectation:1. Makes O (cid:0) ds log (cid:0) sεδ (cid:1)(cid:1) queries2. Runs in time poly ( s, d, ε − , log( δ − ))
3. Uses O ( ds ) memory,where s is the size of the underlying decision tree. Despite being vastly more expressive than axis-aligned rectangles, decision trees in R d are still simplisticin the sense that they remain axis-aligned. For our final example, we study a fundamental class without anysuch restriction: (non-homogeneous) halfspaces in two dimensions. Recall that a halfspace in two dimensionsis given by the sign of h = (cid:104) v, ·(cid:105) + b for some v ∈ R and b ∈ R . Following a number of prior works[29, 33, 60, 61, 26, 27, 9], we study the learnability of halfspaces with comparison queries. Informally, giventwo points of the same sign, a comparison query simply asks which is further away from the separatinghyperplane h = 0 . This type of query is natural in scenarios like halfspaces where the class has an underlyingranking. One might ask a doctor, for instance, “which patient is sicker?”. We prove that halfspaces in twodimensions have an efficiently computable lossless compression scheme of size O (1) with respect to comparisonqueries, and thus that they are efficiently RPU-learnable in bounded memory. Corollary 1.6 (Informal Corollary 4.7) . The class of halfspaces over R is actively RPU-learnable in only q ( ε, δ ) ≤ O (cid:18) log (cid:18) εδ (cid:19)(cid:19) queries, O (1) memory, and time O (cid:16) log (1 / ( δε )) ε (cid:17) . It should be noted that KLMZ [32] remark in their work that halfspaces in two dimensions should belearnable in bounded memory, but give no indication of how this might be done. This concludes the informalstatement of our results, and we end the section with a roadmap of the remainder of our work which formalizesand proves the above.In Section 2, we discuss preliminaries, including our learning model and enriched queries, and relatedwork. In Section 3, we formally introduce lossless sample compression and prove it is is a sufficient conditionfor bounded memory active RPU-learnability. In Section 4, we provide three fundamental classifier familieswith efficiently computable lossless compression schemes with respect to basic enriched queries: axis-alignedrectangles, decision trees, and halfspaces in two dimensions. We conclude in Section 5 with a few commentson further directions.
We study a strong model of learning introduced by Rivest and Sloan [48] called
Reliable and Probably UsefulLearning (RPU-Learning). Unlike the more standard PAC setting, RPU-Learning requires that the learnernever makes an error. To compensate for this stringent requirement, the learner may respond “I don’t know”,denoted “ ⊥ ,” on a small fraction of the space. In the standard, passive version of RPU-Learning, the learnerhas access to a sample oracle from an adversarily chosen distribution. The goal is to analyze the number oflabeled samples required from this oracle to learn almost all inputs with high probability.5 efinition 2.1 (RPU-Learning) . A hypothesis class ( X, H ) is RPU-Learnable with sample complexity n ( ε, δ ) if there exists for all ε, δ > a learning algorithm A such that for any choice of distribution D over X and h ∈ H , the learner is:1. Probably useful: Pr S ∼ D n ( ε,δ ) (cid:104) Pr x ∼ D [ A ( S, h ( S ))( x ) = ⊥ ] > ε (cid:105) < δ,
2. Reliable: ∀ S, x s.t. A ( S, h ( S ))( x ) (cid:54) = ⊥ , A ( S, h ( S ))( x ) = h ( x ) , where S, h ( S ) is shorthand for the set of labeled samples ( x, h ( x )) for x ∈ S . In other words, the learner outputs a label with high probability, and never makes a mistake. This modelof learning is substantially stronger than the more standard PAC model, which need only be approximatelycorrect. In fact, it is known that RPU-learning with only labels often has infeasibly large (or even infinite)sample complexity [35, 34, 26]. Recently, KLMZ [32] proved that this barrier can be broken by allowingthe learner to ask enriched questions, and gave an efficient algorithm for RPU-learning halfspaces in twodimensions (later extended in [26] to arbitrary dimensions with suitable distributional assumptions). Whilethe algorithms in these works give a substantial improvement over previous impossibility results, they comewith a practical caveat: reaching high accuracy guarantees requires an infeasible amount of storage. In thiswork we show not only how to build efficient RPU-learning algorithms for a broader range of queries andhypothesis classes, but also show that this strong model remains surprisingly feasible even in scenarios wherememory is severely limited.
Unfortunately, even with the addition of enriched queries, it is not in general possible to RPU (or even PAC)learn in fewer than poly (1 /ε ) labeled samples. In cases where labels are prohibitively expensive (e.g. medicalimagery), this creates a substantial barrier to learning even the simplest classes such as 1D-thresholds. Toside-step this issue, we consider the well-studied model of active learning . Unlike the previously discussed(passive) models, an active learner receives unlabeled samples from the distribution and may choose whetheror not to send each sample to a labeling oracle. The overall complexity of learning, called query complexity , isthen measured not by the total number of samples, but by the number of calls to the labeling oracle requiredto achieve the guarantees laid out in Definition 2.1.One might hope that the additional adaptivity allowed by active learning allows scenarios with highlabeling cost to become feasible, and indeed it does in some basic or restricted scenarios, lowering theoverall complexity from poly (1 /ε ) to poly (log(1 /ε )) (see e.g. Settles’ [50] or Dasgupta’s [12] surveys, orHanneke’s book [23]). Unfortunately, it has long been known that active learning fails to give any substantialimprovement for fundamental classes such as halfspaces, even in the weaker PAC-model [11]. We continuea recent line of work showing this barrier may be broken by the same technique that permits efficientRPU-learning: asking more informative questions. Instead of having access only to a labeling oracle, the learners we discuss in this work will have the abilityto ask a range of natural questions regarding the data. Enriched queries we discuss will be defined on afixed input size we denote by k . A label query, for instance, is defined on a single point and thus has k = 1 .A comparison between two points would have k = 2 . While we will only consider binary labels, we willin general consider enriched queries of an arbitrary arity r denoting the total number of possible answers.Binary queries like labels or comparisons have r = 2 , but some queries we consider like the “odd-one-out”oracle have larger arity. Finally, we will not assume that our queries have unique valid answers. Recallingagain the “odd-one-out” query, an instance outside a rectangle may violate multiple constraints, and thushave multiple valid answers to the (informal) question “what is wrong with (example) x ?”To formalize these notions, we consider each type of query to be an oracle (function) of the form: O : H × X k → P ([ r ]) \ { ∅ } , P ([ r ]) denotes the powerset of [ r ] = { , . . . , r } , and O ( h, T ) ⊆ [ r ] denotes the set of valid responsesto the query represented by O on T under hypothesis h . Since in practice a user is unlikely to list all validresponses in O ( h, T ) , we do not allow the learner direct access to the oracle response. Instead, when thelearner queries T the adversary selects a valid response from O ( h, T ) to send back.In this work, we consider hypothesis classes ( X, H ) endowed with a collection of oracles {O i } (cid:96)i =1 , which wecollectively denote as the query set Q . While each oracle in Q is defined only on a certain fixed sample size,we will often wish to make every possible query associated to some larger sample S ⊂ X . In particular, givena hypothesis h ∈ H , we denote by Q h ( S ) the set of all possible responses to queries on S . We will generallythink of the learner making queries like this in batches on a larger set S , and receiving some q ( S ) ∈ Q h ( S ) from the adversary. Additionally, it will often be useful to consider the restriction of a given q ( S ) to querieson some subset S (cid:48) ⊂ S , which we denote by q ( S ) | S (cid:48) . For simplicity and when clear from context, we willwrite just q ( S (cid:48) ) as shorthand for q ( S ) | S (cid:48) . While there may exist many q ( S (cid:48) ) ∈ Q h ( S (cid:48) ) that are not equalto q ( S ) | S (cid:48) , we will generally be able to assume without loss of generality that our learners do not re-queryanything in S (cid:48) , which ensures the notation is well-defined. Similar to the framework introduced in [32], we will often wish to analyze what information is inferred by acertain query response q ( S ) ∈ Q h ( S ) . Let ( X, H ) be a hypothesis class with associated query set Q = {O i } .Given a sample S ⊂ X and query response q ( S ) ∈ Q h ( S ) , denote by H | q ( S ) the set of hypotheses consistentwith q ( S ) . For any oracle O i and appropriately sized subset T ⊂ X , we say that q ( S ) infers α ∈ O i ( h, T ) if α is a valid query response with respect to every consistent hypothesis: ∀ h (cid:48) ∈ H | q ( S ) , α ∈ O i ( h (cid:48) , T ) . (1)It is worth noting that since O i ( h (cid:48) , T ) may contain multiple valid responses, it is possible that q ( S ) may inferseveral of them. As in the previous section, it will often be useful to consider this process in batches. Inparticular, given a query set Q , q ( S ) ∈ Q h ( S ) , and q ( S (cid:48) ) ∈ Q h ( S (cid:48) ) , we may wish to know when q ( S ) infersthat q ( S (cid:48) ) is a valid response in Q h ( S (cid:48) ) for all h ∈ H | q ( S ) . In such cases, we say q ( S ) infers q ( S (cid:48) ) and write: q ( S ) → q ( S (cid:48) ) . As in previous works [32, 31, 24, 27, 26], we pay particular attention to the case where O i = L is the labelingoracle (notation we will use throughout). We introduce two important concepts for this special case. Givena sample S and query response q ( S ) , it will be useful to analyze the set of points in X whose labels areinferred by q ( S ) . We denote this set by I ( q ( S )) . Similarly, it will be useful to analyze how much of XI ( q ( S )) covers (with respect to the distribution over X ), which we call the coverage of q ( S ) and denote byCov ( q ( S )) . Finally, when dealing with labels we may wish to restrict the scope of our inference for the sake ofcomputational efficiency. In such scenarios, we will define an inference rule R , which for each query response q ( S ) determines some subset S ⊆ I R ( q ( S )) ⊆ I ( q ( S )) . We let Cov R denote the coverage with respect therule R , and T I R ( n ) denote the inference time under rule R —that is the worst-case time across x ∈ X , S ⊂ X ,and q ( S ) ∈ Q h ( S ) to determine whether x ∈ I R ( q ( S )) . Finally, we call the rule R efficiently computable if itis polynomial in n . When R is trivial (i.e. ∀ q ( S ) : I R ( q ( S )) = I ( q ( S )) ), we drop it from all notation. The main focus of our work lies in understanding not only when active learning can achieve exponentialimprovement over passive learning, but when this can be done by a learner with limited memory. Previousworks studying active learning with enriched queries mostly focus on the pool-based model, where the learnerreceives a large batch of unlabeled samples rather than access to a sampling oracle. In this case, the implicitassumption is that the learner may query any subset of samples from the pool, but this requires the learnerto use a large amount of storage.Adapting definitions from the passive learning literature [25, 17, 3], we define a new, more realistic modelfor active learning with enriched queries in which the learner may only store some finite number of points Formally, this is the product space of valid responses to each possible query on S . query and work tapes, andtwo counters, the sample and query counters. The query tape stores points in the instance space X that thelearner has saved and may wish to query. The work tape, on the other hand, stores bits which provide anyextra information about these points needed for computation—typically this entails query responses, but wewill see cases where other types of information are useful as well. The sample and query counters, true toname, track the total number of unlabeled samples drawn and queries made by the algorithm at any givenstep.We note that the complexity of the query tape is measured in the number of points stored there at anygiven time, rather than in the total number of bits required to represent it. This matches early works onbounded-memory learning in the passive regime [25, 17, 3] and is necessary due to the fact that we areinterested mainly in working over infinite instance spaces like the reals (where representing even a single pointmay take infinite bits). It is also worth noting that this avoids the fact that different representations of datamay have different bit complexities—given a certain representation of the data (say with finite bit-complexity),it is easy to convert our memory bounds if desired to a model counting only bits.We now discuss our model in greater depth. Given the query and work tapes, the learner may choose ateach step from the following options:1. Sample a point and add it to the query tape.2. Remove a point from the query tape.3. Query any subset of points on the query tape, writing the results on the work tape.4. Write or remove a bit from the work tape.Further, as in previous work [3], we allow the action taken by the algorithm at any step to depend on thecontents of the query and sample counters. This can be formalized in one of two ways. The first, consideredimplicitly by Ameur et al. [3], is to think of the algorithm as governed by a non-uniform transition functionthat may depend on the entire content of the sample and query counters. We will generally take this viewthroughout the paper since it is simpler, but if one wishes to use a uniform model of computation, anothermethod is to allow the algorithm to run “simple” randomized procedures that only take about loglog spacein the size of the counter. Since in general our algorithms use at most n = poly ( ε − , log(1 /δ )) samples, thelatter view essentially allocates a special block of O (log log(1 /ε ) + log log(1 /δ )) memory to deal with thecounter. We note that because we allow this procedure to be randomized, this version of the model requiresexpected rather than worst-case bounds on query and computational complexity.With this in mind, we say that a hypothesis class is RPU-learnable with bounded memory if there existsan RPU-learner for the class whose query and work tapes never exceed some constant M ( X, H ) lengthindependent of the learning parameters ε and δ . At a finer grain level, we say that such a class is learnablewith memory M ( X, H ) . We emphasize that as in [3], we do not measure the memory usage of the counter.Indeed, it is not hard to see that some sort of counter or memory scaling with ε and δ is necessary to givea stopping condition for the learner. Finally, we note that while previous techniques such as the inferencedimension algorithm of [32] can certainly be modified to fit into the above framework, they do not result inbounded memory learners, requiring storage that scales with ε and δ in both the query and work tapes. Bounded memory learning in the sense we consider was first introduced in an early work by Haussler [25],who showed the existence of passive PAC-learners for restricted classes of decision trees and basic functionson R such as finite unions of intervals with memory independent of the accuracy parameters ε and δ . Floyd[17], and later Ameur, Fischer, Hoffgen, and Meyer auf der Heide [3], extended Haussler’s work to a general In a bit more detail, we can use approximate counting to probabilistically estimate the counter up to a small constant factorwith probability at least − δ in space O (log(log( n )) + log(log(1 /δ ))) [43], which explains the discrepancy in dependence on δ inthese two equations. ε, δ ,learning a finite stream, storing only bits, time-space tradeoffs, etc.) have seen a substantial amount of study[51, 54, 46, 40, 42, 41, 7, 47, 19, 52, 10, 20, 5]. To our knowledge, however, the subject has seen little tono work within the active learning literature, though a few do consider query settings beyond just labelsincluding statistical [54, 20] and equivalence queries [2]. Active learning with enriched queries has become an increasingly popular alternative to the standard model incases like halfspaces where strong lower bounds prevent adaptivity from providing a significant advantage overstandard passive learning. While most prior works in the area consider specific examples of enriched queriessuch as comparisons [29, 33, 60, 61, 26, 27, 9], cluster-queries [4, 59], mistake queries [6], and separationqueries [24], our work is more closely related to the general paradigm for enriched query active learningintroduced by KLMZ [32]. In their work, KLMZ introduce inference dimension , a combinatorial parameterthat exactly characterizes when a concept class is actively learnable in O (log(1 /ε )) rather than O (1 /ε ) queries.Lossless sample compression can be seen as a strengthening of inference dimension (albeit extended to aricher regime of queries than that considered in [32]) which implies both O (log(1 /ε )) query complexity andbounded memory. This partially answers an open question posed by KLMZ, who asked whether any classwith finite inference dimension has a bounded memory learner. In this section, we prove that lossless sample compression is sufficient for query-optimal, bounded memoryRPU-learning. While we introduced the concept of lossless compression in Section 1, we are now in positionto state the full formal definition.
Definition 3.1 (Lossless Compression Schemes) . We say a hypothesis class ( X, H ) has a lossless compressionscheme (LCS) of size k with respect to a query set Q and inference rule R if for all classifiers h ∈ H ,monochromatic subsets S ⊂ X , and any q ( S ) ∈ Q h ( S ) there exists W = W ( q ( S )) ⊆ S of size at most k suchthat: I R ( q ( S )) = I R ( q ( S ) | W ) . Let T C ( n ) denote the worst-case time required to determine such a subset. We call the LCS efficientlycomputable if T C ( n ) it is polynomial in n . If no inference rule R is stated, it is assumed to be the trivial rule. It is worth briefly noting the intuition behind requiring compression only for monochromatic sets. Thegeneral idea is that in RPU-learning it is sufficient to be able to learn X restricted to the set of positiveand negative points. While this strategy fails in the weaker PAC-model (the learner could output the all ’sfunction for positive points for instance), the fact that RPU-learners cannot make mistakes circumvents thisissue.We will begin by proving that lossless compression implies sample-efficient passive RPU-learning, andthen show how this can be leveraged to give query-efficient and bounded-memory active learning. In fact, ifall one is interested in is the former, lossless sample compression is needlessly strong. As a result, we start byanalyzing a strictly weaker variant that lies in between standard and lossless sample compression, and bearsclose similarities to KLMZ’s [32] theory of inference dimension. Definition 3.2 (Perfect Compression Schemes (PCS)) . We say a hypothesis class ( X, H ) with correspondingquery set Q has a perfect compression scheme (PCS) of size k with respect to inference rule R if for allsubsets S ⊂ X , h ∈ H , and any q ( S ) ∈ Q h ( S ) there exists T = T ( Q ( S )) ⊆ S of size at most k such that: S ⊆ I R ( q ( S ) | T ) . A subset S is monochromatic with respect to a classifier h if it consists entirely of one label. inferred by queries on the original sample are also preserved.Rather than directly analyzing the effect of PCS on active learning, we first prove as an intermediate theoremthat such schemes are sufficient for near-optimal passive RPU-learnability. Combining this fact with the basicboosting procedure of [32] gives query-optimal (but not bounded memory) active RPU-learnability. Theargument for the passive case closely follows the seminal work of Floyd and Warmuth [18, Theorem 5] onlearning via sample compression. Theorem 3.3.
Let ( X, H ) have a perfect compression scheme of size k with respect to inference rule R .Then the sample complexity of passively RPU-learning ( X, H ) is at most n ( ε, δ ) ≤ O (cid:18) k log(1 /ε ) + log(1 /δ ) ε (cid:19) . Proof.
Let h ∈ H be an arbitrary classifier. Our goal is to upper bound by δ the probability that for a randomsample S , | S | = m , there exists q ( S ) ∈ Q h ( S ) such that Cov R ( q ( S )) ≤ − ε . Notice that it is equivalent toprove that for some fixed worst-case choice of q ( S ) (minimizing coverage across each sample): Pr S [ Cov R ( q ( S )) ≤ − ε ] ≤ δ. Given this formulation, the argument proceeds along the lines of [18, Theorem 5]. It is enough to upperbound by δ the probability across samples S of size m that there exists T ⊂ S of size at most k such that:1. S ⊆ I R ( q ( S ) | T )
2. Cov R ( q ( S ) | T ) < − ε Since the PCS implies that every set S has a subset T satisfying the first condition, if both hold only withsome small probability then the following basic algorithm A suffices to RPU-learn with sample complexity m : on input x ∈ X, A ( S )( x ) checks whether all h ∈ H consistent with q ( S ) satisfy L ( h, x ) = z for some z ∈ { , } . If this property holds, the algorithm outputs z . Otherwise, the algorithm outputs “ ⊥ .” Sincewith probability at least − δ the coverage of q ( S ) is at least − ε , this algorithm will label at least a − ε fraction of the space while never making a mistake.Proving this statement essentially boils down to a double sampling argument. The idea is to union boundover sets of indices in [ m ] of size up to k , noting that in each case the remaining m − k points can be treatedindependently. In greater detail, for I ⊂ [ m ] , denote by B I the set of samples S which are inferred by thesubsample with indices given by I , that is: B I = { S = { s , . . . , s m } : S ⊆ I R ( q ( S ) | { s i } i ∈ I ) } . On the other hand, let U I denote samples where the coverage of the subsample given by I is worse than − ε : U I = (cid:8) S = { s , . . . , s m } : Cov R (cid:0) q ( S ) | { s i } i ∈ I (cid:1) < − ε (cid:9) . Notice that the intersection of B I and U I is exactly what we are trying to avoid. By a union bound, theprobability that we draw a sample such that there exists a subset T satisfying 1 and 2 is then at most: (cid:88) I ⊂ [ m ] , | I |≤ k Pr S [ S ∈ B I ∩ U I ] . It is left to bound the probability for fixed I ⊂ [ m ] of drawing a sample in B I ∩ U I . Since we are samplingi.i.d, we can think of independently sampling the coordinates in and outside of I . If the samples given by I have coverage at least − ε , we are done. Otherwise, the probability that we draw remaining samples thatare in the coverage is at most (1 − ε ) m −| I | . Thus we get that the event is bounded by: (1 − ε ) m − k k (cid:88) i =1 (cid:18) mi (cid:19) , which is at most δ for m = O (cid:16) k log(1 /ε )+log(1 /δ ) ε (cid:17) Note that this process is class-dependent. log(1 /ε ) factor recently removed by Hanneke [22] in the latter case). Further, since perfect sample compression ispreserved over subsets of the instance space (a PCS for X is also a PCS when restricted to S ⊂ X ), a boundon passive RPU learning immediately implies efficient active learning via a modification to the basic boostingstrategy for finite instance spaces of [32, Theorem 3.2]. Theorem 3.4 (Active Learning) . Let ( X, H ) have a perfect compression scheme of size k with respect toquery set Q and inference rule R . Then ( X, H ) is actively RPU-learnable in only q ( ε, δ ) ≤ O (cid:18) b (6 k ) log (cid:18) εδ (cid:19)(cid:19) queries, and T ( ε, δ ) ≤ O (cid:18)(cid:18) t (6 k ) + kT I R (60 k log(1 / ( εδ ))) log( k/ ( εδ )) ε (cid:19) log (cid:18) εδ (cid:19)(cid:19) time where b ( n ) is the worst-case number of queries needed to infer some valid q ( S ) ∈ Q h ( S ) across all h ∈ H and | S | = n , and t ( n ) is the worst-case time.Proof. For notational convenience, let X (cid:48) be a copy of X we use to track un-inferred points throughout ouralgorithm. Consider the following strategy:1. Sample S of size k from X (cid:48) (note this may require taking many samples from X in later rounds)2. Infer some q ( S ) ∈ Q h ( S ) via queries on S
3. Restrict X (cid:48) to points whose labels are not inferred by q ( S ) , (i.e. remove any x ∈ X (cid:48) s.t. q ( S ) → L ( h, x ) )
4. Repeat O (cid:0) log (cid:0) εδ (cid:1)(cid:1) times, or until n = O (cid:16) k log( k/ ( εδ )) log(1 / ( εδ )) ε (cid:17) total points have been drawn from X .Notice that for either of these stopping conditions, one of two statements must hold:1. The algorithm has performed at least O (cid:0) log (cid:0) εδ (cid:1)(cid:1) rounds.2. The algorithm has drawn O (cid:18) log ( nδ ) ε (cid:19) inferred samples in a row.We argue both of these conditions imply that the coverage of the learner is at least − ε with probabilityat least − δ . For the first condition, notice that Theorem 3.3 implies the coverage of q ( S ) on a randomsample of size k is at least / with probability at least / . Call a round ‘good’ if it has coverage at least / . It is sufficient to prove we have at least log(1 /ε ) good rounds with probability at least − δ . Since eachround can be thought of as an independent process, this follows easily from a Chernoff bound. For the lattercondition, notice that the probability O (cid:18) log ( nδ ) ε (cid:19) inferred samples appear in a row (at some fixed point)when the coverage is less than − ε is at most O ( δ/n ) . Union bounding over samples implies the algorithmhas the desired coverage guarantees.Finally, we compute the query and computational complexity. The former follows immediately fromnoting that we make at most b (6 k ) queries in each round, and run at most O (cid:0) log (cid:0) εδ (cid:1)(cid:1) rounds. The latterbound stems from noting that each round takes at most t (6 k ) time to compute queries, and each sample requires at most T I R ( ck log(1 /εδ )) time for inference for some > c > .A similar result may also be proved by defining a suitable extension of inference dimension to ourgeneralized queries and noting that a perfect compression scheme implies finite inference dimension. It isalso worth noting that in most cases of interest the above algorithm will also be computationally efficient,as the cost of compression, querying, rejection sampling, and inference tend to be fairly small (and arein all examples we consider). Finally, we show how Theorem 3.4 can be modified with the addition of anefficiently computable lossless compression scheme to give query-optimal, computationally efficient, andbounded memory RPU-learning. In reality this is done by rejection sampling in step 1. When a point is drawn, we check whether it is inferred by queries onany previous sample. heorem 3.5 (Bounded Memory Active Learning) . Let ( X, H ) have an LCS of size k with respect toquery set Q and inference rule R , and let B ( n ) denote the maximum number of bits required to express any q ( S ) ∈ Q h ( S ) for S ⊂ X of size n and h ∈ H . Then ( X, H ) is actively RPU-learnable in only: q ( ε, δ ) ≤ O (cid:18) b (6 k ) log (cid:18) εδ (cid:19)(cid:19) queries, M ( X, H ) ≤ O ( B (7 k )) memory, and T ( ε, δ ) ≤ O (cid:18)(cid:18) t (7 k ) + T C (7 k ) + kT I R ( k ) log( k/ ( εδ )) ε (cid:19) log (cid:18) εδ (cid:19)(cid:19) time.Proof. We follow the overall strategy laid out in Theorem 3.4 with two main modifications. First, instead ofstoring subsamples and queries from all previous rounds to use for rejection sampling, each sub-sample willbe merged in between rounds using the guarantees of lossless sample compression. Without this change, thestrategy of Theorem 3.4 would require O (cid:0) B (6 k ) log (cid:0) εδ (cid:1)(cid:1) memory from the buildup across rounds. Second,we will learn the set of positive and negative points separately since the LCS only guarantees compression formonochromatic subsamples.More formally, assume for the moment that we draw samples from the distribution restricted to positive(or negative) labels. Following the strategy of Theorem 3.4, let the sample of size k drawn at step i bedenoted by S i , the points stored in the query-tape at the start of step i by C i − , and their corresponding queryresponse q ( C i − ) . The existence of an LCS of size k implies for any h ∈ H and q ( S i ∪ C i − ) ∈ Q h ( S i ∪ C i − ) ,there exists a subset C i ⊂ S i ∪ C i − of size at most k such that: I R ( q ( C i )) = I R ( q ( S i ∪ C i − )) , where we recall q ( C i ) is shorthand for q ( S i ∪ C i − ) | C i ) . Further, since restrictions have strictly less information(that is for all S, S (cid:48) ⊆ X and q ( S ∪ S (cid:48) ) ∈ Q h ( S ∪ S (cid:48) ) , I R ( q ( S )) ∪ I R ( q ( S (cid:48) )) ⊆ I R ( q ( S ∪ S (cid:48) )) , we may write: I R ( q ( S i )) ∪ I R ( q ( C i − )) ⊆ I R ( q ( S i ∪ C i − )) = I R ( q ( C i )) . By induction on the step i , q ( C i − ) infers the label of every point in I ( q ( S j )) for j < i , and therefore: i (cid:91) j =1 I R ( q ( S j )) ⊆ I R ( q ( C i )) . Since the coverage guarantees of Theorem 3.4 at a given step i rely only on this left-hand union, followingthe same strategy (plus compression) still results in at least − ε coverage with probability at least − δ inonly O (log( εδ )) rounds. Further, since we merge our stored information every round into C i and q ( C i ) , wenever exceed storage of k points and O ( B (7 k )) bits. The analysis of query and computational complexityfollow the same as in Theorem 3.4 with the addition of compression time.Finally, we argue that we may learn the general distribution by separately applying the above to theset of positive and negative samples. Namely, at step i we sample from the remaining un-inferred pointsuntil we receive either k positive or k negative samples, and apply the standard algorithm on whichever isreached first. Assume at step i that the measure of remaining positive points is at least that of the remainingnegative (the opposite case will follow similarly). Then the probability that we draw k positive points before k negative points is at least / . Recall that such a positive sample has coverage at least / over theremaining positive points with probability at least / . Moreover, since the positive samples make up at least / of the space, the coverage of each round over the entire distribution is at least / with probability atleast / (factoring in the probability we draw the majority sign). Achieving the same guarantees as abovethen simply requires running a small constant times as many rounds. Thus there is no asymptotic change toany of the complexity measures and we have the desired result.12e note that in all examples considered in this paper, b ( n ) , T I R ( n ) , and T C ( n ) are at worst quadratic,resulting in computationally efficient algorithms. Finally, it is worth briefly discussing the fact that it ispossible to remove the query counter in the proof of Theorem 3.5 if one is willing to measure expected ratherthan worst-case query-complexity. This mainly involves running the same algorithm using only the samplecutoff and analyzing the probability that the algorithm makes a large number of queries. In this section we cover three examples of fundamental classes with small, efficiently computable losslesscompression schemes: axis-aligned rectangles, decision trees, and halfspaces in two dimensions. In each ofthese cases standard lower bounds show that without additional queries, active learning provides no substantialbenefit over its passive counterpart [11]. Furthermore, these classes cannot be actively RPU-learned at all(each requires an infinite number of queries) [27]. Thus we see that the introduction of natural enrichedqueries brings learning from an infeasible or even impossible state to one that is highly efficient, even on avery low memory device.
Despite being one of the most basic classes in machine learning, axis-aligned rectangles provide an excellentexample of the failure of both standard active and RPU-learning. In this section we show that the introductionof a natural enriched query, the odd-one-out query, completely removes this barrier, providing a query-optimal,computationally efficient RPU-learner with bounded memory. Recall that axis-aligned rectangles over R d arethe indicator functions corresponding to products of intervals over R : R = [ a , b ] × . . . × [ a d , b d ] for a i ≤ b i . Additionally we allow a i to be −∞ , and b i to be ∞ (though in this case the interval shouldbe open). In other words, thinking of each coordinate in R d as a feature, axis-aligned rectangles capturescenarios where each individual feature has an acceptable range, and the overall object is acceptable if andonly if every feature is acceptable. For instance, one might measure a certain dish by flavor profile, includingfeatures such as saltiness, sourness, etc. If the dish is too salty, or too sour, the diner is unlikely to like itindependent of its other features. In this context, the “odd-one-out” query asks the diner, given a negativeexample, to pick a specific feature they did not like, and further to specify whether the feature was eithermissing, or too present. Perhaps the dish was too sour, or needed more umami.More formally, the odd-one-out oracle O odd : H × X → P (([ d ] × { , } ) ∪ {∗} ) on input h = [ a , b ] × . . . × [ a d , b d ] and x ∈ R d outputs {∗} if L ( h, x ) = 1 , and otherwise outputs the set of pairs ( i, such that x i > b i and ( i, such that x i < a i . Proposition 4.1.
The class of axis-aligned rectangles over R d has a lossless compression scheme of size atmost d with respect to O odd .Proof. Recall that lossless sample compression separately examines the set of positively and negatively labeledpoints. We first analyze the positive samples—those which lie inside the axis-aligned rectangle. In this case,notice that we may simply select a subsample T of size at most d which contains points with the maximumand minimal value at every coordinate. Since it is clear that T infers all points inside the rectangle R ( T ) itspans, it is enough to argue that S cannot infer any point outside R ( T ) . This follows from the fact that both R ( T ) and the all ’s function are consistent with queries on S .For the case of a negatively labeled sample, notice that if any two x, x (cid:48) ∈ X have the same “odd-one-out”response ( b, i ) , one must infer the response of the other. Informally, this is simply because the odd-one-outquery measures whether some feature is too large or too small. If two points are too large in some coordinate,say, then the point with the smaller feature infers this response in the other point. More formally, assumewithout loss of generality that x i ≤ x (cid:48) i . Then if the query response is (0 , i ) , q ( S ) | x (cid:48) → q ( S ) | x . Likewise, if theresponse is (1 , i ) , then q ( S ) | x → q ( S ) | x (cid:48) . Thus for each of the d query types we need only a single point torecover all query information for the original sample. Since no information is lost, this compressed set clearlysatisfies the conditions of the desired LCS. 13s an immediate corollary, we get that axis-aligned rectangles are actively RPU-learnable with near-optimal query complexity and bounded memory. Since the compression scheme and inference rule areefficiently computable, the algorithm is additionally computationally efficient. Corollary 4.2.
The class of axis-aligned rectangles in R d is RPU-learnable in only q ( ε, δ ) = O (cid:18) d log (cid:18) εδ (cid:19)(cid:19) queries, O ( d ) memory, and time O (cid:16) d log( d/ ( εδ )) log(1 / ( εδ )) ε (cid:17) when the learner has access to O odd .Proof. The first two statements follow immediately from Theorem 3.5 and noting that b ( n ) ≤ O ( n ) . Theruntime guarantee is slightly more subtle, and follows from noting that compression time T C ( n ) ≤ O ( dn ) ,and that with a suitable representation of the compressed set, inference for a point x ∈ R d only requireschecking the value of each coordinate (in essence, we may act as though T I ( d ) ≤ O ( d ) ).Fixing parameters other than ε , we note that the optimal query complexity for axis-aligned rectangles(and all following examples) is Ω(log(1 /ε )) . This follows from standard arguments [36], and can be seensimply by noting that there exists a distribution with at least Ω(1 /ε ) ε -pairwise-separated concepts—thebound comes from noting that each query only provides O (1) bits of information. While axis-aligned rectangles provide a fundamental example of a classifier for which enriched queries breakthe standard barriers of active RPU-learning and bounded memory, they are too simple a class in practice tomodel many situations of interest. In this section we consider a broad generalization of axis-aligned rectanglesthat is not only studied extensively in the learning literature [14, 37, 44, 30, 8], but also used frequently inpractice [45, 55, 28, 53]: decision trees. In this setting we consider a natural enriched query which roughlyfalls into a paradigm known as same-cluster queries [4], which determine whether a given set of points lie ina “cluster” of some sort. Variants of this query have seen substantial study in the past few years in bothclustering and learning [58, 39, 1, 16, 49] after their recent formal introduction by Ashtiani, Kushagra, andBen-David [4].More formally, recall that a decision tree over R d is a binary tree where each node in the tree correspondsto an inequality: x i ? ≥ b or x i ? ≤ b, measuring the i -th feature (coordinate) of any x ∈ R d . Each leaf in the tree is assigned a label, and the labelof any x ∈ X is uniquely determined by the leaf resulting from following the decision tree from root to leaf,always taking the path determined by the inequality at each node. We measure the size of a decision tree bycounting its leaves. We introduce a strong enriched query for decision trees in the same-cluster paradigmcalled the same-leaf oracle O leaf . In particular, given a decision tree h and two points x, x (cid:48) , the same-leafquery on x and x (cid:48) simply determines whether x and x (cid:48) lie in the same leaf of h . It is worth noting that thesame-leaf oracle can be seen as a strengthening of Dasgupta, Dey, Roberts, and Sabato’s [13] “discriminativefeatures,” a method of dividing a decision tree into clusters, each of which should have some discriminatingfeature. Same-leaf queries take this idea to its logical extreme where each leaf should be thought of as aseparate cluster.We will show that same-leaf queries have a small and efficiently computable lossless compression schemewith respect to an efficient inference rule R rect , and therefore have query-optimal and computationally efficientbounded-memory RPU-learners. The inference rule R rect is a simple restriction where x ∈ I R rect ( q ( S )) ifand only if x lies inside a rectangle spanned by a mono-leaf subset T ⊆ S , that is a subset T for which ∀ x, x (cid:48) ∈ T, O leaf ( h, x, x (cid:48) ) = 1 . In other words, we infer information independently for each leaf. Proposition 4.3.
The class of size s decision trees over R d has an LCS of size at most ds with respect to O leaf and R rect . roof. Notice that any leaf of a decision tree corresponds to an axis-aligned rectangle in R d . Further, for any S ⊂ R d , q ( S ) allows us to partition S into subsamples sharing the same leaf. Fix one such subsample anddenote it by S L . By the same argument as Proposition 4.1, there exists T ⊆ S L of size at most d such that S L lies entirely within the rectangle spanned by T . Since these are exactly the points inferred by S in thatleaf under inference rule R rect , repeating this process over all s leaves gives the desired LCS. Corollary 4.4.
The class of size s decision trees is RPU-learnable in only q ( ε, δ ) = O (cid:18) ds log (cid:18) εδ (cid:19)(cid:19) queries, O ( ds ) memory, and time O (cid:16) d s log( ds/ ( εδ )) log(1 / ( εδ )) ε (cid:17) when the learner has access to O leaf .Proof. The proof follows similarly to Corollary 4.2. It is not hard to see that T C ( n ) is at worst quadratic.Determining queries on a set of size n reduces to grouping points into s buckets (each standing for some leaf),which can be done in b ( n ) ≤ O ( ns ) queries. While T I R rect ( n ) would technically require finding the rectanglesimplied by the given sample and checking x against each one, the former can be thought of pre-processingsince it need only be done once per round. With this process removed, checking x takes only O ( d ) time,which gives the desired result.While it is not possible to learn decision trees of arbitrary size in the standard learning models [21], onemight hope that Corollary 4.4 can be used to build a learner for this class with nice guarantees. Indeed thisis possible if we weaken our memory and computational complexity measures to be expected rather thanworst-case. We will show that in this regime, Corollary 4.4 can be bootstrapped to build an attribute-efficient algorithm for RPU-learning decision trees, that is one in which the expected memory, query, and computationalcomplexity scale with the (unknown) size of the underlying decision tree. Corollary 4.5.
There exists an algorithm for RPU-learning decision trees over R d which in expectation:1. Makes O (cid:0) ds log (cid:0) sεδ (cid:1)(cid:1) queries2. Runs in time poly ( s, d, ε − , log( δ − ))
3. Uses O ( ds ) memory,where s is the size of the underlying decision tree.Proof. For simplicity, we use Corollary 4.4 as a blackbox for an increasing “guess” s (cid:48) on the size of the decisiontree. After each application of Corollary 4.4, we test its coverage empirically. If too many points go un-inferred,we double our guess for s (cid:48) and continue the process. More formally, we start by setting our guess s (cid:48) to . Thelearner then applies Corollary 4.4 with accuracy parameters δ (cid:48) = ( δ/s (cid:48) ) and ε (cid:48) = O ( ε/ ( s (cid:48) log( s (cid:48) /δ ))) . For easeof analysis, we assume that each application of Corollary 4.4 uses some fixed number of samples N s (cid:48) (givenby its corresponding sample complexity). This can be done by arbitrarily inflating the number of samplesdrawn if the query cutoff is reached before the sample cutoff (and likewise for the query counter if the samplecutoff is reached first). Fixing this quantity allows the algorithm to “know” what step it is in by checking thesample and query counters. After each application of Corollary 4.4, we draw M s (cid:48) = O (log( s (cid:48) /δ ) /ε ) points. If a single point in thissample is un-inferred by the output of Corollary 4.4, we double s (cid:48) and continue the process. Otherwise, weoutput the classifier given by Corollary 4.4 in that round. Notice that tracking the existence of an un-inferredpoint in this sample takes only a single bit, and moreover that since the sample size of each round is fixed tobe N s (cid:48) + M s (cid:48) , the algorithm can still use the counters to track its position at any step. Finally, we note thatfor a given s (cid:48) , if the algorithm does not have coverage at least − ε it aborts with probability at most δs (cid:48) byour choice of M s (cid:48) . Since s (cid:48) starts at and doubles each round, a union bound gives that the probability thealgorithm ever aborts with coverage worse than − ε is at most δ as desired. More formally, this just means there is a well-defined transition function for the algorithm which relies on the counters tomove between potential tree sizes.
15t is left to compute the expected memory usage, query complexity, and computational complexity of ouralgorithm. Notice that at any given step with guess s (cid:48) , our algorithm runs in time O (cid:18) d s (cid:48) log( ds (cid:48) / ( εδ )) log( s (cid:48) / ( εδ )) ε (cid:19) , uses O ( ds (cid:48) ) memory, and makes at most O ( ds (cid:48) log( s (cid:48) /εδ )) queries. Further, for every round in which s (cid:48) ≥ s ,the samples drawn are sufficiently large that Corollary 4.4 is guaranteed to succeed with probability atleast − O (1 /s (cid:48) ) by our chosen parameters. Since the coverage test also succeeds with probability at least − O (1 /s (cid:48) ) and each round is independent, for any s (cid:48) ≥ s the failure probability at that step is at most O (1 /s (cid:48) ) (as the algorithm has at this point run at least four rounds with s (cid:48) ≥ s ). Since all our complexitymeasures scale like o ( s (cid:48) ) , the expected contribution to time, query complexity, and memory usage at leasthalf in each step for which s (cid:48) > s . It is not hard to observe that these final steps then add no additionalasymptotic complexity, which gives the desired result. Much of the prior work on active learning with enriched queries centers around the class of halfspaces[32, 61, 27, 26, 9]. While it has long been known the class cannot be efficiently active learned in the standardmodel [11], KLMZ [32] showed that adding a natural enriched query called a comparison resolves this issue.In particular, recall that a 2D-halfspace is given by the sign of the inner product with some normal vector v ∈ S plus some bias b ∈ R . A comparison query on two points x, x (cid:48) ∈ R measures which point is furtherfrom the hyperplane defined by h , that is: (cid:104) x, v (cid:105) ? ≥ (cid:104) x (cid:48) , v (cid:105) . KLMZ proved that halfspaces in two dimensions have finite inference dimension, a combinatorial parameterslightly weaker than perfect sample compression that characterizes query-optimal active learning. In fact, weshow that 2D-halfspaces with comparisons have substantially stronger structure—namely an LCS of size 5.
Proposition 4.6.
The class of halfspaces over R has a size lossless compression scheme with respect tothe comparison oracle.Proof. Given some (unknown) hyperplane h = (cid:104) v, ·(cid:105) + b , monochromatic set S , and labels and comparisons q h ( S ) , we must prove the existence of a subset T of size at most such that I ( q h ( T )) = I ( q h ( S )) . The ideabehind our construction is to consider positive rays based on S and q h ( S ) . In other words, notice that for anypoint s ∈ S and pair ( x , y ) such that h ( x ) ≥ h ( y ) , the ray r = s + t ( x − y ) is increasing with respectto h . Furthermore, any two such rays with the same base point r = s + t ( x − y ) and r = s + t ( x − y ) form a cone C ( r , r ) = (cid:40) y ∈ R : ∃ α , α > s.t. y = s + (cid:88) i =1 α i ( x i − y i ) (cid:41) such that for any y ∈ C ( r , r ) , labels and comparisons on the set { s, x , x , y , y } infer that h ( y ) ≥ : h ( y ) = (cid:42) v, s + (cid:88) i =1 α i ( x i − y i ) (cid:43) + b = ( (cid:104) v, s (cid:105) + b ) + (cid:88) i =1 (cid:104) v, α i ( x i − y i ) (cid:105) = h ( s ) + α ( h ( x ) − h ( y )) + α ( h ( x ) − h ( y )) ≥ . Notice that the union of all such cones is itself a cone where the base point s is some minimal element in S (in the sense that for all s (cid:48) ∈ S, h ( s ) ≤ h ( s (cid:48) ) ), and r and r are rays stemming from s that make the greatestangle. We argue that y ∈ I ( q h ( S )) if and only if y lies inside this cone. Since the cone may be defined by16 a) Original sample (b) Find increasing directions (blue) (c) Find minimal point (red)(d) Draw rays from minimum (e) Select widest cone Figure 1: A pictorial representation of the intuition behind our LCS for halfspaces. Given a (positive)monochromatic sample (diagram (a)), we find a minimal point (diagram (c)) and all directions of increase(diagram (b)). Combining these gives a set of positive rays which form cones (diagram (d)). Our LCS isgiven by the points which contribute to the widest cone, depicted in diagram (e) (here all 4 points).queries on the 5 points making up r and r , this gives the desired compression scheme. See Figure 1 for apictorial description of this process.Denote the cone given by this process by C . We have already proved that y ∈ I ( q h ( S )) if y ∈ C , so it issufficient to show that any y / ∈ C cannot be inferred by queries on S . To see this, we examine the boundinghyperplanes of C : s + t ( x − y ) , and s + t ( x − y ) . Let H and H denote the corresponding halfspaces(defined such that C = H ∩ H ), and define h i = (cid:104) v i , ·(cid:105) + b i such that H i = sign ( h i ) . We first note that thelabels and comparisons on S corresponding to each h i are consistent with those on the true hypothesis h .This is obvious for label queries since S is assumed to be entirely positive, and C contains S by definition. Forcomparisons, assume for the sake of contradiction that there exists a query inconsistent with some h i , that isa pair x, y ∈ S such that h ( x ) ≥ h ( y ) but h i ( x ) < h i ( y ) . If this is the case, notice that the ray extending from y through x is decreasing with respect h i , and therefore must eventually cross it. It follows that replacing ( x i , y i ) in the construction of T with ( x, y ) results in a wider angle, which gives the desired contradiction.Given that h and h are both consistent with queries on S under the true hypothesis, consider a point y lying outside of C . By definition, either sign ( h ( y )) or sign ( h ( y )) is negative. However, notice thatsince h i = (cid:104) v i , ·(cid:105) + b i is consistent with queries on S , it must also be the case that any non-negative shift h i,b (cid:48) = (cid:104) v i , ·(cid:105) + b i + b (cid:48) for b (cid:48) ≥ is consistent as well. Since for sufficiently large b (cid:48) , sign ( h i,b (cid:48) ( y )) is positive,the true label sign ( h ( y )) cannot be inferred as desired. Corollary 4.7.
The class of halfspaces over R is actively RPU-learnable in only q ( ε, δ ) ≤ O (cid:18) log (cid:18) εδ (cid:19)(cid:19) queries, O (1) memory, and time O (cid:16) log (1 / ( δε )) ε (cid:17) .Proof. Inference is done through a linear program as in [32]. All computational parameters are O (1) due tobeing in O (1) dimensions, which gives the desired result.17 Further Directions
We end with a brief discussion of two natural directions suggested by our work.
In this work we prove that lossless sample compression is a sufficient condition for efficient, bounded memoryactive learning in the enriched query regime. KLMZ [32] prove that inference dimension, a strictly weakercombinatorial parameter for enriched queries, is necessary for efficient bounded memory active learning.Closing the gap between these two conditions remains an open problem—it is currently unknown whetherinference dimension is sufficient or lossless sample compression is necessary. Similarly, the relation betweeninference dimension and lossless sample compression themselves remains unknown. Over finite spaces it isclear from arguments of [32] that inference dimension and lossless sample compression are equivalent up to afactor of log( | X | ) . However, since in the finite regime we are likely interested in learning all points in | X | rather than to some accuracy parameter, the relevant memory bound depends crucially on log( | X | ) , makingthe tightness of the above relation an important question as well.On a related note, though halfspaces in dimensions greater than two have infinite inference dimension,KLMZ [32] do show that halfspaces with certain restricted structure (e.g. bounded bit complexity, margin)have finite inference dimension. Whether such classes have lossless compression schemes, or indeed areeven learnable in bounded memory at all remains an interesting open problem and a concrete step towardsunderstanding the above. In this work we study only realizable case learning. It is reasonable to wonder to what extent our results holdin the agnostic model, where the adversary may choose any function (rather than being restricted to onefrom the concept class H ), or various models of noise in which the adversary may corrupt queries. Whileinference-based learning is difficult in such regimes, previous work has seen some success with particularclasses of enriched queries such as comparisons [61, 27]. Proving the existence of bounded memory activelearners even for simple noise and query regimes such as random classification noise with comparisons remainsan important problem for bringing bounded memory active learning closer to practice. References [1] Nir Ailon, Anup Bhattacharya, and Ragesh Jaiswal. Approximate correlation clustering using same-clusterqueries. In
Latin American Symposium on Theoretical Informatics , pages 14–27. Springer, 2018.[2] Foued Ameur. A space-bounded learning algorithm for axis-parallel rectangles. In
European Conferenceon Computational Learning Theory , pages 313–321. Springer, 1995.[3] Foued Ameur, Paul Fischer, Klaus-Uwe Höffgen, and Friedhelm Meyer auf der Heide. Trial and error: Anew approach to space-bounded learning.
Acta Inf. , 33:621–630, 10 1996. doi: 10.1007/s002360050062.[4] Hassan Ashtiani, Shrinu Kushagra, and Shai Ben-David. Clustering with same-cluster queries. arXivpreprint arXiv:1606.02404 , 2016.[5] Sepehr Assadi and Ran Raz. Near-quadratic lower bounds for two-pass graph streaming algorithms. arXiv preprint arXiv:2009.01161 , 2020.[6] Maria Florina Balcan and Steve Hanneke. Robust interactive learning. In
Conference on LearningTheory , pages 20–1, 2012.[7] Paul Beame, Shayan Oveis Gharan, and Xin Yang. Time-space tradeoffs for learning finite functionsfrom random evaluations, with applications to polynomials. In
Conference On Learning Theory , pages843–856, 2018. 188] Guy Blanc, Neha Gupta, Jane Lange, and Li-Yang Tan. Universal guarantees for decision tree inductionvia a higher-order splitting criterion. arXiv preprint arXiv:2010.08633 , 2020.[9] Zhenghang Cui and Issei Sato. Active classification with uncertainty comparison queries. arXiv preprintarXiv:2008.00645 , 2020.[10] Yuval Dagan, Gil Kur, and Ohad Shamir. Space lower bounds for linear prediction in the streamingmodel. arXiv preprint arXiv:1902.03498 , 2019.[11] Sanjoy Dasgupta. Analysis of a greedy active learning strategy. In
Advances in neural informationprocessing systems , pages 337–344, 2005.[12] Sanjoy Dasgupta. Two faces of active learning.
Theoretical computer science , 412(19):1767–1781, 2011.[13] Sanjoy Dasgupta, Akansha Dey, Nicholas Roberts, and Sivan Sabato. Learning from discriminativefeature feedback.
Advances in Neural Information Processing Systems , 31:3955–3963, 2018.[14] Andrzej Ehrenfeucht and David Haussler. Learning decision trees from random examples.
Informationand Computation , 82(3):231–246, 1989.[15] Ran El-Yaniv and Yair Wiener. Active learning via perfect selective classification.
Journal of MachineLearning Research , 13(Feb):255–279, 2012.[16] Donatella Firmani, Sainyam Galhotra, Barna Saha, and Divesh Srivastava. Robust entity resolutionusing a crowdoracle.
IEEE Data Eng. Bull. , 41(2):91–103, 2018.[17] Sally Floyd. Space-bounded learning and the vapnik-chervonenkis dimension. In
Proceedings of thesecond annual workshop on Computational learning theory , pages 349–364, 1989.[18] Sally Floyd and Manfred Warmuth. Sample compression, learnability, and the vapnik-chervonenkisdimension.
Machine learning , 21(3):269–304, 1995.[19] Sumegha Garg, Ran Raz, and Avishay Tal. Extractor-based time-space lower bounds for learning. In
Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing , pages 990–1002,2018.[20] Alon Gonen, Shachar Lovett, and Michal Moshkovitz. Towards a combinatorial characterization ofbounded memory learning. arXiv preprint arXiv:2002.03123 , 2020.[21] Thomas Hancock, Tao Jiang, Ming Li, and John Tromp. Lower bounds on learning decision lists andtrees.
Information and Computation , 126(2):114–122, 1996.[22] Steve Hanneke. The optimal sample complexity of pac learning.
The Journal of Machine LearningResearch , 17(1):1319–1333, 2016.[23] Steve Hanneke et al. Theory of disagreement-based active learning.
Foundations and Trends® inMachine Learning , 7(2-3):131–309, 2014.[24] Sariel Har-Peled, Mitchell Jones, and S. Rahul. Active learning a convex body in low dimensions. In
ICALP , 2020.[25] D. Haussler. Space efficient learning algorithms. Technical Report UCSC-CRL-88-2, University of Calif.Computer Research Laboratory, Santa Cruz, CA, 1988.[26] Max Hopkins, Daniel Kane, and Shachar Lovett. The power of comparisons for actively learning linearclassifiers.
Advances in Neural Information Processing Systems , 33, 2020.[27] Max Hopkins, Daniel Kane, Shachar Lovett, and Gaurav Mahajan. Noise-tolerant, reliable activeclassification with comparison queries. In
Conference on Learning Theory , pages 1957–2006. PMLR,2020. 1928] Badr Hssina, Abdelkarim Merbouha, Hanane Ezzikouri, and Mohammed Erritali. A comparative studyof decision tree id3 and c4. 5.
International Journal of Advanced Computer Science and Applications , 4(2):13–19, 2014.[29] Kevin G Jamieson and Robert Nowak. Active ranking using pairwise comparisons. In
Advances in neuralinformation processing systems , pages 2240–2248, 2011.[30] Adam Tauman Kalai and Shang-Hua Teng. Decision trees are pac-learnable from most productdistributions: a smoothed analysis. arXiv preprint arXiv:0812.0933 , 2008.[31] Daniel Kane, Shachar Lovett, and Shay Moran. Generalized comparison trees for point-location problems.In
International Colloquium on Automata, Languages and Programming , 2018.[32] Daniel M Kane, Shachar Lovett, Shay Moran, and Jiapeng Zhang. Active classification with comparisonqueries. In , pages355–366. IEEE, 2017.[33] Amin Karbasi, Stratis Ioannidis, et al. Comparison-based learning with rank nets. arXiv preprintarXiv:1206.4674 , 2012.[34] J. Kivinen. Reliable and useful learning with uniform probability distributions. In
Proceedings of theFirst International Workshop on Algorithmic Learning Theory , 1990.[35] J. Kivinen. Learning reliably and with one-sided error.
Mathematical Systems Theory , 28(2):141–172,1995.[36] Sanjeev R Kulkarni, Sanjoy K Mitter, and John N Tsitsiklis. Active learning using arbitrary binaryvalued queries.
Machine Learning , 11(1):23–35, 1993.[37] Eyal Kushilevitz and Yishay Mansour. Learning decision trees using the fourier spectrum.
SIAM Journalon Computing , 22(6):1331–1348, 1993.[38] Nick Littlestone and Manfred Warmuth. Relating data compression and learnability. 1986.[39] Arya Mazumdar and Barna Saha. Clustering with noisy queries. In
Advances in Neural InformationProcessing Systems , pages 5788–5799, 2017.[40] Dana Moshkovitz and Michal Moshkovitz. Mixing implies lower bounds for space bounded learning. In
Conference on Learning Theory , pages 1516–1566, 2017.[41] Dana Moshkovitz and Michal Moshkovitz. Entropy samplers and strong generic lower bounds for spacebounded learning. In . SchlossDagstuhl-Leibniz-Zentrum fuer Informatik, 2018.[42] Michal Moshkovitz and Naftali Tishby. A general memory-bounded learning algorithm. arXiv preprintarXiv:1712.03524 , 2017.[43] Jelani Nelson and Huacheng Yu. Optimal bounds for approximate counting. arXiv preprintarXiv:2010.02116 , 2020.[44] Ryan O’Donnell and Rocco A Servedio. Learning monotone decision trees in polynomial time.
SIAMJournal on Computing , 37(3):827–844, 2007.[45] J. Ross Quinlan. Induction of decision trees.
Machine learning , 1(1):81–106, 1986.[46] Ran Raz. A time-space lower bound for a large class of learning problems. In , pages 732–742. IEEE, 2017.[47] Ran Raz. Fast learning requires good memory: A time-space lower bound for parity learning.
Journal ofthe ACM (JACM) , 66(1):1–18, 2018. 2048] Ronald L Rivest and Robert H Sloan. Learning complicated concepts reliably and usefully. In
AAAI ,pages 635–640, 1988.[49] Barna Saha and Sanjay Subramanian. Correlation clustering with same-cluster queries bounded byoptimal cost. arXiv preprint arXiv:1908.04976 , 2019.[50] Burr Settles. Active learning literature survey. Technical report, University of Wisconsin-MadisonDepartment of Computer Sciences, 2009.[51] Ohad Shamir. Fundamental limits of online and distributed algorithms for statistical learning andestimation.
Advances in Neural Information Processing Systems , 27:163–171, 2014.[52] Vatsal Sharan, Aaron Sidford, and Gregory Valiant. Memory-sample tradeoffs for linear regression withsmall error. In
Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing ,pages 890–901, 2019.[53] Sonia Singh and Priyanka Gupta. Comparative study id3, cart and c4. 5 decision tree algorithm: asurvey.
International Journal of Advanced Information Science and Technology (IJAIST) , 27(27):97–103,2014.[54] Jacob Steinhardt, Gregory Valiant, and Stefan Wager. Memory, communication, and statistical queries.In
Conference on Learning Theory , pages 1490–1516, 2016.[55] M Umanol, Hirotaka Okamoto, Itsuo Hatono, Hiroyuki Tamura, Fumio Kawachi, Sukehisa Umedzu, andJunichi Kinoshita. Fuzzy decision trees by fuzzy id3 algorithm and its application to diagnosis systems.In
Proceedings of 1994 IEEE 3rd International Fuzzy Systems Conference , pages 2113–2118. IEEE, 1994.[56] Leslie G Valiant. A theory of the learnable. In
Proceedings of the sixteenth annual ACM symposium onTheory of computing , pages 436–445. ACM, 1984.[57] Vladimir Vapnik and Alexey Chervonenkis. Theory of pattern recognition, 1974.[58] Vasilis Verroios, Hector Garcia-Molina, and Yannis Papakonstantinou. Waldo: An adaptive humaninterface for crowd entity resolution. In
Proceedings of the 2017 ACM International Conference onManagement of Data , pages 1133–1148, 2017.[59] Sharad Vikram and Sanjoy Dasgupta. Interactive bayesian hierarchical clustering. In
InternationalConference on Machine Learning , pages 2081–2090, 2016.[60] Fabian L Wauthier, Nebojsa Jojic, and Michael I Jordan. Active spectral clustering via iterativeuncertainty reduction. In
Proceedings of the 18th ACM SIGKDD international conference on Knowledgediscovery and data mining , pages 1339–1347, 2012.[61] Yichong Xu, Hongyang Zhang, Kyle Miller, Aarti Singh, and Artur Dubrawski. Noise-tolerant interactivelearning using pairwise comparisons. In