[PDF] Active Learning on Trees and Graphs

Abstract

We investigate the problem of active learning on a given tree whose nodes are assigned binary labels in an adversarial way. Inspired by recent results by Guillory and Bilmes, we characterize (up to constant factors) the optimal placement of queries so to minimize the mistakes made on the non-queried nodes. Our query selection algorithm is extremely efficient, and the optimal number of mistakes on the non-queried nodes is achieved by a simple and efficient mincut classifier. Through a simple modification of the query selection algorithm we also show optimality (up to constant factors) with respect to the trade-off between number of queries and number of mistakes on non-queried nodes. By using spanning trees, our algorithms can be efficiently applied to general graphs, although the problem of finding optimal and efficient active learning algorithms for general graphs remains open. Towards this end, we provide a lower bound on the number of mistakes made on arbitrary graphs by any active learning algorithm using a number of queries which is up to a constant fraction of the graph size.

Full PDF

aa r X i v : . [ c s . L G ] J a n Active Learning on Trees and Graphs

Nicol `o Cesa-BianchiDipartimento di Informatica, Universit`a degli Studi di Milano, Italy [email protected]

Claudio GentileDiSTA, Universit`a dell’Insubria, Italy [email protected]

Fabio VitaleDipartimento di Informatica, Universit`a degli Studi di Milano, Italy [email protected]

Giovanni ZappellaDipartimento di Matematica, Universit`a degli Studi di Milano, Italy [email protected]

August 21, 2018

Abstract

We investigate the problem of active learning on a given tree whose nodes are assignedbinary labels in an adversarial way. Inspired by recent results by Guillory and Bilmes, wecharacterize (up to constant factors) the optimal placement of queries so to minimize the mis-takes made on the non-queried nodes. Our query selection algorithm is extremely efﬁcient, andthe optimal number of mistakes on the non-queried nodes is achieved by a simple and efﬁcientmincut classiﬁer. Through a simple modiﬁcation of the query selection algorithm we also showoptimality (up to constant factors) with respect to the trade-off between number of queries andnumber of mistakes on non-queried nodes. By using spanning trees, our algorithms can be ef-ﬁciently applied to general graphs, although the problem of ﬁnding optimal and efﬁcient activelearning algorithms for general graphs remains open. Towards this end, we provide a lowerbound on the number of mistakes made on arbitrary graphs by any active learning algorithmusing a number of queries which is up to a constant fraction of the graph size.

The abundance of networked data in various application domains (web, social networks, bioin-formatics, etc.) motivates the development of scalable and accurate graph-based prediction algo-rithms. An important topic in this area is the graph binary classiﬁcation problem: Given a graph1ith unknown binary labels on its nodes, the learner receives the labels on a subset of the nodes(the training set) and must predict the labels on the remaining vertices. This is typically done byrelying on some notion of label regularity depending on the graph topology, such as that nearbynodes are likely to be labeled similarly. Standard approaches to this problem predict with the as-signment of labels minimizing the induced cutsize (e.g., [4, 5]), or by binarizing the assignmentthat minimizes certain real-valued extensions of the cutsize function (e.g., [14, 2, 3] and referencestherein).In the active learning version of this problem the learner is allowed to choose the subset oftraining nodes. Similarly to standard feature-based learning, one expects active methods to providea signiﬁcant boost of predictive ability compared to a noninformed (e.g., random) draw of thetraining set. The following simple example provides some intuition of why this could happenwhen the labels are chosen by an adversary, which is the setting considered in this paper. Considera “binary star system” of two star-shaped graphs whose centers are connected by a bridge, whereone star is a constant fraction bigger than the other. The adversary draws two random binary labelsand assigns the ﬁrst label to all nodes of the ﬁrst star graph, and the second label to all nodes ofthe second star graph. Assume that the training set size is two. If we choose the centers of the twostars and predict with a mincut strategy, we are guaranteed to make zero mistakes on all unseenvertices. On the other hand, if we query two nodes at random, then with constant probability bothof them will belong to the bigger star, and all the unseen labels of the smaller star will be mistaken.This simple example shows that the gap between the performance of passive and active learningon graphs can be made arbitrarily big.In general, one would like to devise a strategy for placing a certain budget of queries on thevertices of a given graph. This should be done so as to minimize the number of mistakes made onthe non-queried nodes by some reasonable classiﬁer like mincut. This question has been investi-gated from a theoretical viewpoint by Guillory and Bilmes [6], and by Afshani et al. [1]. Our workis related to an elegant result from [6] which bounds the number of mistakes made by the mincutclassiﬁer on the worst-case assignment of labels in terms of Φ / Ψ( L ) . Here Φ is the cutsize inducedby the unknown labeling, and Ψ( L ) is a function of the query (or training) set L , which depends onthe structural properties of the (unlabeled) graph. For instance, in the above example of the binarysystem, the value of Ψ( L ) when the query set L includes just the two centers is . This impliesthat for the binary system graph, Guillory and Bilmes’ bound on the mincut strategy is Φ mistakesin the worst case (note that in the above example Φ ≤ ). Since Ψ( L ) can be efﬁciently computedon any given graph and query set L , the learner’s task might be reduced to ﬁnding a query set L that maximizes Ψ( L ) given a certain query budget (size of L ). Unfortunately, no feasible generalalgorithm for solving this maximization problem is known, and so one must resort to heuristicmethods —see [6].In this work we investigate the active learning problem on graphs in the important special caseof trees. We exhibit a simple iterative algorithm which, combined with a mincut classiﬁer, isoptimal (up to constant factors) on any given labeled tree. This holds even if the algorithm is notgiven information on the actual cutsize Φ . Our method is extremely efﬁcient, requiring O ( n ln Q ) time for placing Q queries in an n -node tree, and space linear in n . As a byproduct of our analysis,we show that Ψ can be efﬁciently maximized over trees to within constant factors. Hence the A mincut strategy considers all labelings consistent with the labels observed so far, and chooses among them onethat minimizes the resulting cutsize over the whole graph. min L Φ / Ψ( L ) can be achieved efﬁciently.Another interesting question is what kind of trade-off between queries and mistakes can beachieved if the learner is not constrained by a given query budget. We show that a simple modiﬁ-cation of our selection algorithm is able to trade-off queries and mistakes in an optimal way up toconstant factors.Finally, we prove a general lower bound for predicting the labels of any given graph (notnecessarily a tree) when the query set is up to a constant fraction of the number of vertices. Ourlower bound establishes that the number of mistakes must then be at least a constant fraction of thecutsize weighted by the effective resistances. This lower bound apparently yields a contradiction tothe results of Afshani et al. [1], who constructs the query set adaptively. This apparent contradictionis also obtained via a simple counterexample that we detail in Section 5. A labeled tree ( T, y ) is a tree T = ( V, E ) whose nodes V = { , . . . , n } are assigned binary labels y = ( y , . . . , y n ) ∈ {− , +1 } n . We measure the label regularity of ( T, y ) by the cutsize Φ T ( y ) induced by y on T , i.e., Φ T ( y ) = (cid:12)(cid:12) { ( i, j ) ∈ E : y i = y j } (cid:12)(cid:12) . We consider the following activelearning protocol: given a tree T with unknown labeling y , the learner obtains all labels in a queryset L ⊆ V , and is then required to predict the labels of the remaining nodes V \ L . Active learningalgorithms work in two-phases: a selection phase, where a query set of given size is constructed,and a prediction phase, where the algorithm receives the labels of the query set and predicts thelabels of the remaining nodes. Note that the only labels ever observed by the algorithm are thosein the query set. In particular, no labels are revealed during the prediction phase.We measure the ability of the algorithm by the number of prediction mistakes made on V \ L ,where it is reasonable to expect this number to depend on both the uknown cutsize Φ T ( y ) and thenumber | L | of requested labels. A slightly different prediction measure is considered in Section 4.3.Given a tree T and a query set L ⊆ V , a node i ∈ V \ L is a fork node generated by L if andonly if there exist three distinct nodes i , i , i ∈ L that are connected to i through edge disjointpaths. Let FORK ( L ) be the set of all fork nodes generated by L . Then L + is the query set obtainedby adding to L all the generated fork nodes, i.e., L + , L ∪ FORK ( L ) . We say that L ⊆ V is iff L + ≡ L . Note that L + is 0-forked. That is, FORK ( L + ) ≡ ∅ for all L ⊆ V .Given a node subset S ⊆ V , we use T \ S to denote the forest obtained by removing fromthe tree T all nodes in S and all edges incident to them. Moreover, given a second tree T ′ , wedenote by T \ T ′ the forest T \ V ′ , where V ′ is the set of nodes of T ′ . Given a query set L ⊆ V , a hinge-tree is any connected component of T \ L + . We call connection node of a hinge-tree a nodeof L adjacent to any node of the hinge tree. We distinguish between 1-hinge and 2-hinge trees. A has one connection node only, whereas a has two (note that a hinge treecannot have more than two connection nodes because L + is zero-forked, see Figure 1). We now describe the two phases of our active learning algorithm. For the sake of exposition,we call

SEL the selection phase and

PRED the prediction phase.

SEL returns a 0-forked query set3 -Hinge-tree1-Hinge-treeForks2-Hinge-tree

Figure 1: A tree T = ( V, E ) whose nodes are shaded (the query set L ) or white (the set V \ L ).The shaded nodes are also the connection nodes of the depicted hinge trees (not all hinge trees arecontoured). The fork nodes generated by L are denoted by double circles. The thick black edgesconnect the nodes in L . L + SEL ⊆ V of desired size. PRED takes in input the query set L + SEL and the set of labels y i for all i ∈ L + SEL . Then

PRED returns a prediction for the labels of all remaining nodes V \ L + SEL .In order to see the way

SEL operates, we formally introduce the function Ψ ∗ . This is thereciprocal of the Ψ function introduced in [6] and mentioned in Section 1. Deﬁnition 1.

Given a tree T = ( V, E ) and a set of nodes L ⊆ V , Ψ ∗ ( L ) , max ∅6≡ V ′ ⊆ V \ L | V ′ | (cid:12)(cid:12) { ( i, j ) ∈ E : i ∈ V ′ , j ∈ V \ V ′ } (cid:12)(cid:12) . In words, Ψ ∗ ( L ) measures the largest set of nodes not in L that share the least number of edgeswith nodes in L . From the adversary’s viewpoint, Ψ ∗ ( L ) can be described as the largest return inmistakes per unit of cutsize invested. We now move on to the description of the algorithms SEL and

PRED .The selection algoritm

SEL greedily computes a query set that minimizes Ψ ∗ to within constantfactors. To this end, SEL exploits Lemma 9 (a) (see Section 4.2) stating that, for any ﬁxed queryset L , the subset V ′ ⊆ V maximizing | V ′ | (cid:12)(cid:12) { ( i,j ) ∈ E : i ∈ V ′ ,j ∈ V \ V ′ } (cid:12)(cid:12) is always included in a connectedcomponent of T \ L . Thus SEL places its queries in order to end up with a query set L + SEL such thatthe largest component of T \ L + SEL is as small as possible.

SEL operates as follows. Let L t ⊆ L be the set including the ﬁrst t nodes chosen by SEL , T t max be the largest connected component of T \ L t − , and σ ( T ′ , i ) be the size (number of nodes) of thelargest component of the forest T ′ \{ i } , where T ′ is any tree. At each step t = 1 , , . . . , SEL simplypicks the node i t ∈ T t max that minimizes σ ( T t max , i ) over i and sets L t = L t − ∪ { i t } . During thisiterative construction, SEL also maintains a set containing all fork nodes generated in each step byadding nodes i t to the sets L t − . After the desired number of queries is reached (also counting thequeries that would be caused by the stored fork nodes),

SEL has terminated the construction of the In Section 6 we will see that during each step L t − → L t at most a single new fork node may be generated. L SEL . The ﬁnal query set L + SEL , obtained by adding all stored fork nodes to L SEL , is thenreturned.The

Prediction Algorithm

PRED receives in input the labeled nodes of the 0-forked query set L + SEL and computes a mincut assignment. Since each component of T \ L + SEL is either a 1-hinge-treeor a 2-hinge-tree,

PRED is simple to describe and is also very efﬁcient. The algorithm predicts allthe nodes of hinge-tree T using the same label b y T . This label is chosen according to the followingtwo cases:1. If T is a 1-hinge-tree, then b y T is set to the label of its unique connection node;2. If T is a 2-hinge-tree and the labels of its two connection nodes are equal, then b y T is set tothe label of its connection nodes, otherwise b y T is set as the label of the closer connectionnode (ties are broken arbitrarily).In Section 6 we show that SEL requires overall O ( | V | log Q ) time and O ( | V | ) memory space forselecting Q query nodes. Also, we will see that the total running time taken by PRED for predictingall nodes in V \ L is linear in | V | . For a given tree T , we denote by m A ( L, y ) the number of prediction mistakes that algorithm A makes on the labeled tree ( T, y ) when given the query set L . Introduce the function m A ( L, K ) = max y : Φ T ( y ) ≤ K m A ( L, y ) denoting the number of prediction mistakes made by A with query set L on all labeled treeswith cutsize bounded by K . We will also ﬁnd it useful to deal with the “lower bound” func-tion LB ( L, K ) . This is the maximum expected number of mistakes that any prediction algorithm A can be forced to make on the labeled tree ( T, y ) when the query set is L and the cutsize is notlarger than K .We show that the number of mistakes made by PRED on any labeled tree when using the queryset L + SEL satisﬁes m PRED ( L + SEL , K ) ≤ LB ( L, K ) for all query sets L ⊆ V of size up to | L + SEL | . Though neither SEL nor

PRED do know the actualcutsize of the labeled tree ( T, y ) , the combined use of these procedures is competitive against anyalgorithm that knows the cutsize budget K beforehand.While this result implies the optimality (up to constant factors) of our algorithm, it does notrelate the mistake bound to the cutsize, which is a clearly interpretable measure of the label regu-larity. In order to address this issue, we show that our algorithm also satisﬁes the bound m PRED ( L + SEL , y ) ≤ ∗ ( L ) Φ T ( y ) for all query sets L ⊆ V of size up to | L + SEL | . The proof of these results needs a number ofpreliminary lemmas. 5 emma 1. For any tree T = ( V, E ) it holds that min v ∈ V σ ( T, v ) ≤ | V | .Proof. Let i ∈ argmin v ∈ V σ ( T, v ) . For the sake of contradiction, assume there exists a component T i = ( V i , E i ) of T \ { i } such that | V i | > | V | / . Let s be the sum of the sizes all other components.Since | V i | + s = | V | − , we know that s ≤ | V | / − . Now let j be the node adjacent to i which belongs to V i and T j = ( V j , E j ) be the largest component of T \ { j } . There are onlytwo cases to consider: either V j ⊂ V i or V j ∩ V i ≡ ∅ . In the ﬁrst case, | V j | < | V i | . In thesecond case, V j ⊆ { i } ∪ (cid:0) T \ V i (cid:1) , which implies | V j | ≤ s ≤ | V | / < | V i | . In both cases, i argmin v ∈ V σ ( T, v ) , which provides the desired contradiction. Lemma 2.

For all subsets L ⊂ V of the nodes of a tree T = ( V, E ) we have (cid:12)(cid:12) L + (cid:12)(cid:12) ≤ | L | .Proof. Pick an arbitrary node of T and perform a depth-ﬁrst visit of all nodes in T . This visitinduces an ordering T , T , . . . of the connected components in T \ L based on the order of thenodes visited ﬁrst in each component. Now let T ′ , T ′ , . . . be such that each T ′ i is a component of T i extended to include all nodes of L adjacent to nodes in T i . Then the ordering implies that, for i ≥ , T ′ i shares exactly one node (which must be a leaf) with all previously visited trees. Since inany tree the number of nodes of degree larger than two must be strictly smaller than the number ofleaves, we have | FORK ( T ′ i ) | < | Λ i | where, with slight abuse of notation, we denote by FORK ( T ′ i ) the set of all fork nodes in subtree T ′ i . Also, we let Λ i be the set of leaves of T ′ i . This implies that,for i = 1 , , . . . , each fork node in FORK ( T ′ i ) can be injectively associated with one of the | Λ i | − leaves of T ′ i that are not shared with any of the previously visited trees. Since | FORK ( L ) | is equalto the sum of | FORK ( T i ) | over all indices i , this implies that | FORK ( L ) | ≤ | L | . Lemma 3.

Let L t − ⊆ L SEL be the set of the ﬁrst t − nodes chosen by SEL . Given any tree T = ( V, E ) , the largest subtree of T \ L t − contains no more than t | V | nodes.Proof. Recall that i s denotes the s -th node selected by SEL during the incremental constructionof the query set L SEL , and that T s max is the largest component of T \ L s − . The ﬁrst t steps of therecursive splitting procedure performed by SEL can be associated with a splitting tree T ′ deﬁnedin the following way. The internal nodes of T ′ are T s max , for s ≥ . The children of T s max arethe connected components of T s max \ { i s } , i.e., the subtrees of T s max created by the selection of i s .Hence, each leaf of T ′ is bijectively associated with a tree in T \ L t .Let T ′ nol be the tree obtained from T ′ by deleting all leaves. Each node of T ′ nol is one of the t subtrees split by SEL during the construction of L t . As T t max is split by i t , it is a leaf in T ′ nol . Wenow add a second child to each internal node s of T ′ nol having a single child. This second child of s is obtained by merging all the subtrees belonging to leaves of T ′ that are also children of s . Let T ′′ be the resulting tree.We now compare the cardinality of T t max to that of the subtrees associated with the leaves of T ′′ . Let Λ be the set of all leaves of T ′′ and Λ add = T ′′ \ T ′ nol ⊂ Λ be the set of all leaves added to T ′ nol to obtain T ′′ . First of all, note that | T t max | is not larger than the number of nodes in any leaf of T ′ nol . This is because the selection rule of SEL ensures that T t max cannot be larger than any subtreeassociated with a leaf in T ′ nol , since it contains no node selected before time t . In what follows, wewrite | s | to denote the size of the forest or subtree associated with a node s of T ′′ . We now provethe following claim: Claim.

For all ℓ ∈ Λ , | T t max | ≤ | ℓ | , and for all ℓ ∈ Λ add , | T t max | − ≤ | ℓ | .6roof of Claim. The ﬁrst part just follows from the observation that any ℓ ∈ Λ was split by SEL before time t . In order to prove the second part, pick a leaf ℓ ∈ Λ add . Let ℓ ′ be its unique siblingin T ′′ and let p be the parent of ℓ and ℓ ′ , also in T ′′ . Lemma 1 applied to the subtree p implies | ℓ ′ | ≤ | p | . Moreover, since | ℓ | + | ℓ ′ | = | p | − , we obtain | ℓ | + 1 ≥ | p | ≥ | ℓ ′ | ≥ | T t max | , the lastinequality using the ﬁrst part of the claim. This implies | T t max | − ≤ | ℓ | , and the claim is proven.Let now N (Λ) be the number of nodes in subtrees and forests associated with the leaves of T ′′ .With each internal node of T ′′ we can associate a node of L SEL which does not belong to anyleaf in Λ . Moreover, the number | T ′′ \ Λ | of internal nodes in T ′′ is bigger than the number | Λ add | of internal nodes of T ′ nol to which a child has been added. Since these subtrees and forestsare all distinct, we obtain N (Λ) + | T ′′ \ Λ | < N (Λ) + | Λ add | ≤ | V | . Hence, using the aboveclaim we can write N (Λ) ≥ (cid:0) | Λ | − | Λ add | (cid:1) | T t max | + | Λ add | (cid:0) | T t max | − (cid:1) , which implies | T t max | ≤ (cid:0) N (Λ) + | Λ add | (cid:1) / | Λ | ≤ | V | / | Λ | . Since each internal node of T ′′ has at least two children, we havethat | Λ | ≥ | T ′′ | / ≥ | T ′ nol | / t/ . Hence, we can conclude that | T t max | ≤ | V | /t . We now state and prove a lower bound on the number of mistakes that any prediction algorithm(even knowing the cutsize budget K ) makes on any given tree, when the query set L is 0-forked.The bound depends on the following quantity: Given a tree T ( V, E ) , a node subset L ⊆ V and aninteger K , the component function Υ( L, K ) is the sum of the sizes of the K largest componentsof T \ L , or | V \ L | if T \ L has less than K components. Theorem 4.

For all trees T = ( V, E ) , for all 0-forked subsets L + ⊆ V , and for all cutsize budgets K = 0 , , . . . , | V | − , we have that LB ( L + , K ) ≥ Υ( L + , K ) .Proof. We describe an adversarial strategy causing any algorithm to make at least Υ( L + , K ) / mistakes even when the cutsize budget K is known beforehand. Since L + is 0-forked, each com-ponent of T \ L + is a hinge-tree. Let F max be the set of the K largest hinge-trees of T \ L + , and E ( T ) be the set of all edges in E incident to at least one node of a hinge-tree T . The adversarycreates at most one φ -edge in each edge set E ( T ) for all 1-hinge-trees T ∈ F max , exactly one φ -edge in each edge set E ( T ) for all 2-hinge-trees T ∈ F max , and no φ -edges in the edge set E ( T ) of any remaining hinge-tree T 6∈ F max . This is done as follows. By performing a depth-ﬁrstvisit of T , the adversary can always assign disagreeing labels to the two connection nodes of each2-hinge-tree in F max , and agreeing labels to the two connection nodes of each 2-hinge-tree not in F max . Then, for each hinge-tree T ∈ F max , the adversary assigns a unique random label to allnodes of T , forcing |T | / mistakes in expectation. The labels of the remaining hinge-trees not in F max are chosen in agreement with their connection nodes. Remark 1.

Note that Theorem 4 holds for all query sets, not only those that are 0-forked, sinceany adversarial strategy for a query set L + can force at least the same mistakes on the subset L ⊆ L + . Note also that it is not difﬁcult to modify the adversarial strategy described in the proofof Theorem 4 in order to deal with algorithms that are allowed to adaptively choose the querynodes in L depending on the labels of the previously selected nodes. The adversary simply assignsthe same label to each node in the query set and then forces, with the same method described in A φ -edge ( i, j ) is one where y i = y j . he proof, Υ (cid:0) L + , K (cid:1) mistakes in expectation on the K largest hinge-trees. Thus there are at mosttwo φ -edges in each edge set E ( T ) for all hinge-trees T , yielding at most K φ -edges in total. Theresulting (slightly weaker) bound is LB ( L + , K ) ≥ Υ (cid:0) L + , K (cid:1) . Theorem 7 and Corollary 8 canalso be easily rewritten in order to extend the results in this direction. We now bound the total number of mistakes that

PRED makes on any labeled tree when the queriesare decided by

SEL . We use Lemma 1 and 2, together with the two lemmas below, to prove that m PRED ( L + SEL , K ) ≤ LB ( L, K ) for all cutsize budgets K and for all node subset L ⊆ V such that | L | ≤ | L + SEL | . Lemma 5.

For all labeled trees ( T, y ) and for all 0-forked query sets L + ⊆ V , the number ofmistakes made by PRED satisﬁes m PRED ( L + , y ) ≤ Υ (cid:0) L + , Φ T ( y ) (cid:1) .Proof. As in the proof of Theorem 4, we ﬁrst observe that each component of T \ L + is a hinge-tree.Let E ( T ) be the set of all edges in E incident to nodes of a hinge-tree T , and F φ be the set of hinge-trees such that, for all T ∈ F φ , at least one edge of E ( T ) is a φ -edge. Since E ( T ) ∩ E ( T ′ ) ≡ ∅ for all T , T ′ ∈ T \ L + , we have that | F φ | ≤ Φ T ( y ) . Moreover, since for any T 6∈ F φ there are no φ -edges in E ( T ) , the nodes of T must be labeled as its connections nodes. This, together with theprediction rule of PRED , implies that

PRED makes no mistakes over any of the hinge-trees

T 6∈ F φ .Hence, the number of mistakes made by PRED is bounded by the sum of the sizes of all hinge-trees

T ∈ F φ , which (by deﬁnition of Υ ) is bounded by Υ (cid:0) L + , Φ T ( y ) (cid:1) .The next lemma, whose proof is a bit involved, provides the relevant properties of the compo-nent function Υ( · , · ) . Figure 3 helps visualizing the main ingredients of the proof. Lemma 6.

Given a tree T = ( V, E ) , for all node subsets L ⊆ V such that | L | ≤ | L SEL | and forall integers k , we have: (a) Υ( L SEL , k ) ≤ L, k ) ; (b) Υ( L SEL , ≤ Υ( L, .Proof. We prove part (a) by constructing, via

SEL , three bijective mappings µ , µ , µ : P SEL →P L , where P SEL is a suitable partition of T \ L SEL , P L is a subset of V such that any S ∈ P L isall contained in a single connected component of T \ L , and the union of the domains of the threemappings covers the whole set T \ L SEL . The mappings µ , µ and µ are shown to satisfy, for allforests F ∈ P SEL , | F | ≤ | µ ( F ) | , | F | ≤ | µ ( F ) | , | F | ≤ | µ ( F ) | . Since each S ∈ P L is all contained in a connected component of T \ L , this we will enable us toconclude that, for each tree T ′ ∈ T \ L , the forest of all trees T \ L SEL mapped (via any of thesemappings) to any node subset of T ′ has at most ﬁve times the number of nodes of T ′ . This wouldprove the statement in (a).The construction of these mappings requires some auxiliary deﬁnitions. We call ζ -componenteach connected component of T \ L SEL containing at least one node of L . Let i t be the t -th node In this proof, | µ ( A ) | denotes the number of nodes in the set (of nodes) µ ( A ) . Also, with a slight abuse of notation,for all forests F ∈ P SEL , we denote by | F | the sum of the number of nodes in all trees of F . Finally, whenever F ∈ P SEL contains a single tree, we refer to F as it were a tree, rather than a (singleton) forest containing only one tree. SEL during the incremental construction of the query set L SEL . We distinguish betweenfour kinds of nodes chosen by

SEL —see Figure 3 for an example.Node i t is:1. A collision node if it belongs to L SEL ∩ L ;2. a [0; 0] -node if, at time t , the tree T t max does not contain any node of L ;3. a [0; ≥ -node if, at time t , the tree T t max contains k ≥ nodes j , . . . , j k ∈ L all belongingto the same connected component of T t max \ { i t } ;4. a [ ≥ ≥ -node if i t L and, at time t , the tree T t max contains k ≥ nodes j , . . . , j k ∈ L ,which do not belong to the same connected component of T t max \ { i t } .We now turn to building the three mappings. µ simply maps each tree T ′ ∈ T \ L SEL that is not a ζ -component to the node set of T ′ itself.This immediately implies | F | ≤ | µ ( F ) | for all forests F (which are actually single trees) in thedomain of µ . Mappings µ and µ deal with the ζ -components of T \ L SEL . Let Z be the set ofall such ζ -components, and denote by V , V , and V the set of all [0; 0] -nodes, [0; ≥ -nodes,and [ ≥ ≥ -nodes, respectively. Observe that | V | < | L | . Combined with the assumption | L SEL | ≥ | L | , this implies that | V | + | V | plus the total number of collision nodes must belarger than | L | ; as a consequence, | V | + | V | > | Z | . Each node i t ∈ V chosen by SEL splitsthe tree T t max into one component T i t containing at least one node of L and one or more componentsall contained in a single tree T ′ i t of T \ L . Now mapping µ can be constructed incrementally inthe following way. For each [0; ≥ -node selected by SEL at time t , µ sequentially maps any ζ -component generated to the set of nodes in T t max \ T i t , the latter being just a subset of a componentof T \ L . A future time step t ′ > t might feature the selection of a new [0; ≥ -node within T i t , butmapping µ would cover a different subset of such component of T \ L . Now, applying Lemma 1to tree T t max , we can see that | T t max \ T i t | ≥ | T t max | / . Since the selection rule of SEL guaranteesthat the number of nodes in T t max is larger than the number of nodes of any ζ -component, we have | F | ≤ | µ ( F ) | , for any ζ -component F considered in the construction of µ .Mapping µ maps all the remaining ζ -components that are not mapped through µ . Let ∼ be anequivalence relation over V deﬁned as follows: i ∼ j iff i is connected to j by a path containingonly [0; 0] -nodes and nodes in V \ ( L SEL ∪ L ) . Let i t , i t , . . . , i t k be the sequence of nodes of anygiven equivalence class [ C ] ∼ , sorted according to SEL ’s chronological selection. Lemma 3 appliedto tree T t max shows that | T t k max | ≤ | T t max | /k . Moreover, the selection rule of SEL guarantees thatthe number of nodes of T t k max cannot be smaller than the number of nodes of any ζ -component.Hence, for each equivalence class [ C ] ∼ containing k nodes of type [0; 0] , we map through µ a set F ζ of k arbitrarily chosen ζ -components to T t max . Since the size of each ζ -component is ≤ | T t k max | ,we can write | F ζ | ≤ k | T t k max | ≤ | T t max | , which implies | F ζ | ≤ | µ ( F ζ ) | for all F ζ in the domainof µ . Finally, observe that the number of ζ -components that are not mapped through µ cannot belarger than | V | , thus the union of mappings µ and µ do actually map all ζ -components. This,in turn, implies that the union of the domains of the three mappings covers the whole set T \ L SEL ,thereby concluding the proof of part (a).The proof of (b) is built on the deﬁnition of collision nodes, [0; 0] -nodes, [0; ≥ -nodes and [ ≥ ≥ -nodes given in part (a). Let L t ⊆ L SEL be the set of the ﬁrst t nodes chosed by SEL .9ere, we make a further distinction within the collision and [0; ≥ -nodes. We say that during theselection of node i t ∈ V , the nodes in L ∩ T t max are captured by i t . This notion of capture extendsto collision nodes by saying that a collision node i t ∈ L ∩ L SEL just captures itself . We say that i t isan initial [0; ≥ -node (resp., initial collision node) if i t is a [0; ≥ -node (resp., collision node)such that the whole set of nodes in L captured by i t contains no nodes captured so far. See Figure3 for reference. The simple observation leading to the proof of part (b) is the following. If i t is a [0; 0] -node, then T t max cannot be larger than the component of T \ L that contains T t max , which inturn cannot be larger than Υ( L, . This would already imply Υ( L t − , ≤ Υ( L, . Let now i t be an initial [0; ≥ -node and T i t be the unique component of T t max \ { i t } containing one or morenodes of L . Applying Lemma 1 to tree T t max we can see that | T i t | cannot be larger than | T t max \ T i t | ,which in turn cannot be larger than Υ( L, . If at time t ′ > t the procedure SEL selects i t ′ ∈ T i t then | T t ′ max | ≤ | T i t | ≤ Υ( L, . Hence, the maximum integer q such that Υ( L q , > Υ( L, is boundedby the number of [ ≥ ≥ -nodes plus the number of initial [0; ≥ -nodes plus the number ofinitial collision nodes. We now bound this sum as follows. The number of [ ≥ ≥ -nodes isclearly bounded by | L | − . Also, any initial [0; ≥ -node or initial collision node selected by SEL captures at least a new node in L , thereby implying that the total number of initial [0; ≥ -nodeor initial collision node must be ≤ | L | . After q = 2 | L | − rounds, we are sure that the size of thelargest tree of T q max is not larger than the size of the largest component of T \ L , i.e., Υ( L, .We now put the above lemmas together to prove our main result concerning the number ofmistakes made by PRED on the query set chosen by

SEL . Theorem 7.

For all trees T and all cutsize budgets K , the number of mistakes made by PRED onthe query set L + SEL satisﬁes m PRED ( L + SEL , K ) ≤ min L ⊆ V : | L |≤ | L + SEL | LB (cid:0) L, K (cid:1) . Proof.

Pick any L ⊆ V such that | L | ≤ | L + SEL | . Then m PRED ( L + SEL , K ) (Lem. 5) ≤ Υ( L + SEL , K ) (A) ≤ Υ( L SEL , K ) (Lem. 6 (a)) ≤ L + , K ) (Thm. 4) ≤ LB ( L + , K ) (B) ≤ LB ( L, K ) . Inequality (A) holds because L SEL ⊆ L + SEL , and thus T \ L + SEL has connected components of smallersize than L SEL . In order to apply Lemma 6 (a), we need the condition | L + | ≤ | L SEL | . This con-dition is seen to hold after combining Lemma 2 with our assumptions: | L + | ≤ | L | ≤ | L + SEL | ≤ | L SEL | . Finally, inequality (B) holds because any adversarial strategy using query set L can alsobe used with the larger query set L + ⊇ L .Note also that Theorem 4 and Lemma 5 imply the following statement about the optimality of PRED over 0-forked query sets.

Corollary 8.

For all trees T , for all cutsize budgets K , and for all 0-forked query sets L + ⊆ V ,the number of mistakes made by PRED satisﬁes m PRED ( L + , K ) ≤ LB (cid:0) L + , K (cid:1) . In the rest of this section we derive a more intepretable bound on m PRED ( L + , y ) based on thefunction Ψ ∗ introduced in [6]. To this end, we prove that L SEL minimizes Ψ ∗ up to constant factors,and thus is an optimal query set according to the analysis of [6].10or any subset V ′ ⊆ V , let Γ( V ′ , V \ V ′ ) be the number of edges between nodes of V ′ andnodes of V \ V ′ . Using this notation, we can write Ψ ∗ ( L ) = max ∅6≡ V ′ ⊆ V \ L | V ′ | Γ( V ′ , V \ V ′ ) . Lemma 9.

For any tree T = ( V, E ) and any L ⊆ V the following holds.(a) A maximizer of | V ′ | Γ( V ′ ,V \ V ′ ) exists which is included in the node set of a single component of T \ L ;(b) Ψ ∗ ( L ) ≤ Υ( L, .Proof. Let V ′ max be any maximizer of | V ′ | Γ( V ′ ,V \ V ′ ) . For the sake of contradiction, assume that thenodes of V ′ max belong to k ≥ components T , T , . . . , T k ∈ T \ L . Let V ′ i ⊂ V ′ max be thesubset of nodes included in the node set of T i , for i = 1 , . . . , k . Then | V ′ | = P i ≤ k | V ′ i | and Γ( V ′ , V \ V ′ ) = P i ≤ k Γ( V ′ i , V \ V ′ i ) . Now let i ∗ = argmax i ≤ k | V ′ i | / Γ( V ′ i , V \ V ′ i ) . Since (cid:0)P i a i (cid:1)(cid:14)(cid:0)P i b i (cid:1) ≤ max i a i /b i for all a i , b i ≥ , we immediately obtain Ψ( V ′ i ∗ ) ≥ Ψ( V ′ max ) ,contradicting our assumption. This proves (a). Part (b) is an immediate consequence of (a). Lemma 10.

For any tree T = ( V, E ) and any -forked subset L + ⊆ V we have Υ( L + , ≤ ∗ ( L + ) .Proof. Let T max be the largest component of T \ L + and V max be its node set. Since L + is a0-forked query set, T max must be either a 1-hinge-tree or a 2-hinge-tree. Since the only edgesthat connect a hinge-tree to external nodes are the edges leading to connection nodes, we ﬁnd that Γ( V max , V \ V max ) ≤ . We can now write Ψ ∗ ( L + ) = max ∅6≡ V ′ ⊆ V \ L + | V ′ | Γ( V ′ , V \ V ′ ) ≥ | V max | Γ( V max , V \ V max ) ≥ | V max | L + , thereby concluding the proof. Lemma 11.

For any tree T = ( V, E ) and any subset L ⊆ V we have Ψ ∗ ( L + ) ≤ Ψ ∗ ( L ) .Proof. Let V ′ max be any set maximizing Ψ ∗ ( L + ) . Since V ′ max ∈ V \ L + , V ′ max cannot contain anynode of L ⊆ L + . Hence Ψ ∗ ( L ) = max ∅6≡ V ′ ⊆ V \ L | V ′ | Γ( V ′ , V \ V ′ ) ≥ | V ′ max | Γ( V ′ max , V \ V ′ max ) = Ψ ∗ ( L + ) which concludes the proof.We now put together the previous lemmas to show that the query set L SEL minimizes Ψ ∗ up toconstant factors. Theorem 12.

For any tree T = ( V, E ) we have Ψ ∗ ( L SEL ) ≤ min L ⊆ V : | L |≤ | L SEL | ∗ ( L ) . roof. Let L be a query set such that | L | ≤ | L SEL | / . Then we have the following chain ofinequalities: Ψ ∗ ( L SEL ) (Lemma 9 (b)) ≤ Υ( L SEL , (Lemma 6 (b)) ≤ Υ( L + , (Lemma 10) ≤ ∗ ( L + ) (Lemma 11) ≤ ∗ ( L ) . In order to apply Lemma 6 (b), we need the condition | L + | ≤ | L SEL | . This condition holdsbecause, by Lemma 2, | L + | ≤ | L | ≤ | L SEL | .Finally, as promised, the following corollary contains an interpretable mistake bound for PRED run with a query set returned by

SEL . Corollary 13.

For any labeled tree ( T, y ) , the number of mistakes made by PRED when run withquery set L + SEL satisﬁes m PRED ( L + SEL , y ) ≤ L ⊆ V : | L |≤ | L + SEL | Ψ ∗ ( L ) Φ T ( y ) . Proof.

Observe that

PRED assigns labels to nodes in V \ L + SEL so as to minimize the resulting cutsizegiven the labels in the query set L + SEL . We can then invoke [6, Lemma 1], which bounds the numberof mistakes made by the mincut strategy in terms of the functions Ψ ∗ and the cutsize. This yields m PRED ( L + SEL , y ) [6, Lemma 1] ≤ ∗ ( L + SEL ) Φ T ( y ) (A) ≤ ∗ ( L SEL ) Φ T ( y ) (Theorem 12) ≤ ∗ ( L ) Φ T ( y ) . Inequality (A) holds because L SEL ⊆ L + SEL , and thus T \ L + SEL has connected components of smallersize than L SEL . In order to apply Theorem 12, we need the conditon | L | ≤ | L SEL | , which followsfrom a simple combination of Lemma 2 and our assumptions: | L | ≤ | L + SEL | ≤ | L SEL | . Remark 2.

A mincut algorithm exists which efﬁciently predicts even when the query set L is not0-forked (thereby gaining a factor of 2 in the cardinality of the competing query sets L – seeTheorem 7 and Corollary 13). This algorithm is a ”batch” variant of the TreeOpt algorithmanalyzed in [7]. The algorithm can be implemented in such a way that the total time for predicting | V | − | L | labels is O ( | V | ) . A key aspect to the query selection task is deciding when to stop asking queries. Since the morequeries are asked the less mistakes are made afterwards, a reasonable way to deal with this trade-off is to minimize the number of queries issued during the selection phase plus the number ofmistakes made during the prediction phase. For a given pair A = h S, P i of prediction and selectionalgorithms, we denote by [ q + m ] A the sum of queries made by S and prediction mistakes made by P . Similarly to m A introduced in Section 4, [ q + m ] A has to scale with the cutsize Φ T ( y ) of thelabeled tree ( T, y ) under consideration.As a simple example of computing [ q + m ] A , consider a line graph T = ( V, E ) . Since eachquery set on T is 0-forked, Theorem 4 and Corollary 8 ensure that an optimal strategy for selectingthe queries in T is choosing a sequence of nodes such that the distance between any pair of neighbor12odes in L is equal. The total number of mistakes that can be forced on V \ L is, up to a constantfactor, (cid:0) | V | / | L | (cid:1) Φ T ( y ) . Hence, the optimal value of [ q + m ] A is about | L | + | V || L | Φ T ( y ) . (1)Minimizing the above expression over | L | clearly requires knowledge of Φ T ( y ) , which is typicallyunavailable. In this section we investigate a method for choosing the number of queries when thelabeling is known to be sufﬁciently regular, that is when a bound K is known on the cutsize Φ T ( y ) induced by the adversarial labeling. We now show that when a bound K on the cutsize is known, a simple modiﬁcation of SEL (wecall it

SEL ⋆ ) exists which optimizes the [ q + m ] A criterion. This means that the combination of SEL ⋆ and PRED can trade-off optimally (up to constant factors) queries against mistakes.Given a selection algorithm S and a prediction algorithm P , deﬁne [ q + m ] h S,P i by [ q + m ] h S,P i = min Q ≥ (cid:0) Q + m P ( L S ( Q ) , K ) (cid:1) where L S ( Q ) is the query set output by S given query budget Q , and m P ( L S ( Q ) , K ) is the maxi-mum number of mistakes made by P with query set L S ( Q ) on any labeling y with Φ T ( y ) ≤ K —see deﬁnition in Section 4. Deﬁne also [ q + m ] OPT = inf

S,P [ q + m ] h S,P i , where OPT = h S ∗ , P ∗ i is an optimal pair of selection and prediction algorithms. If SEL knows the size of the queryset L ∗ selected by S ∗ , so that SEL can choose a query budget Q = 8 | L ∗ | , then a direct applica-tion of Theorem 7 guarantees that | L + SEL | + m PRED ( L + SEL , K ) ≤

10 [ q + m ] OPT . We now show that

SEL ⋆ , the announced modiﬁcation of SEL , can efﬁciently search for a query set size Q such that Q + m PRED ( L + SEL ( Q ) , K ) = O (cid:0) [ q + m ] OPT (cid:1) when only K , rather than | L ∗ | , is known. In fact, The-orem 4 and Corollary 8 ensure that m PRED ( L + SEL , K ) = Θ (cid:0) Υ( L + SEL , K ) (cid:1) . When K is given as sideinformation, SEL ⋆ can operate as follows. For each t ≤ | V | , the algorithm builds the query set L + t and computes Υ( L + t , K ) . Then it ﬁnds the smallest value t ∗ minimizing t + Υ( L + t , K ) over all t ≤ | V | , and selects L SEL ⋆ ≡ L t ∗ . We stress that the above is only possible because the algorithmcan estimate within constant factors its own future mistake bound (Theorem 4 and Corollary 8),and because the combination of SEL and

PRED is competitive against all query sets whose size is aconstant fraction of | L + SEL | —see Theorem 7. Putting together, we have shown the following result. Theorem 14.

For all trees ( T, y ) , for all cutsize budgets K , and for all labelings y such that Φ T ( y ) ≤ K , the combination of SEL ⋆ and PRED achieves | L SEL ⋆ | + m PRED ( L + SEL ⋆ , K ) = O (cid:0) [ q + m ] OPT (cid:1) when K is given to SEL ⋆ as input. Just to give a few simple examples of how

SEL ⋆ works, consider a star graph. It is not difﬁcultto see that in this case t ∗ = 1 independent of K , i.e., SEL ⋆ always selects the center of the star,which is intuitively the optimal choice. If T is the “binary system” mentioned in the introduction,then t ∗ = 2 and SEL ⋆ always selects the centers of the two stars, again independent of K . At the In [1] a labeling y of a graph G is said to be α -balanced if, after the elimination of all φ -edges, each connectedcomponent of G is not smaller than α | V | for some known constant α ∈ (0 , . In the case of labeled trees, the α -balancing condition is stronger than our regularity assumption. This is because any α -balanced labeling y implies Φ T ( y ) ≤ /α − . In fact, getting back to the line graph example, we immediately see that, if y is α -balanced, thenthe optimal number of queries | L | is order of p | V | (1 /α − , which is also inf A [ q + m ] A . T is a line graph, then SEL ⋆ picks the query nodes in such a way that the distancebetween two consecutive nodes of L in T is (up to a constant factor) equal to p | V | /K . Hence | L | = Θ( p | V | K ) , which is the minimum of (1) over | L | when Φ T ( y ) ≤ K . In this section we provide a general lower bound for prediction on arbitrary labeled graphs ( G, y ) .We then contrast this lower bound to some results contained in Afshani et al. [1].Let Φ RG ( y ) be the sum of the effective resistances (see, e.g., [12]) on the φ -edges of G = ( V, E ) .The theorem below shows that any prediction algorithm using any query set L such that | L | ≤ | V | makes at least order of Φ RG ( y ) mistakes. This lower bound holds even if the algorithm is allowedto use a randomized adaptive strategy for choosing the query set L , that is, a randomized strategywhere the next node of the query set is chosen after receiving the labels of all previously chosennodes. Theorem 15.

Given a labeled graph ( G, y ) , for all K ≤ | V | / , there exists a randomized la-beling strategy such that for all prediction algorithms A choosing a query set of size | L | ≤ | V | via a possibly randomized adaptive strategy, the expected number of mistakes made by A on theremaining nodes V \ L is at least K/ , while Φ RG ( y ) ≤ K . The above lower bound (whose proof is omitted) appears to contradict an argument by Afshaniet al. [1, Section 5]. This argument establishes that for any ε > there exists a randomizedalgorithm using at most K ln(3 /ε )+ K ln( | V | /K )+ O ( K ) queries on any given graph G = ( V, E ) with cutsize K , and making at most ε | V | mistakes on the remaining vertices. This contradictionis easily obtained through the following simple counterexample: assume G is a line graph whereall node labels are +1 but for K = o (cid:0) | V | / ln | V | (cid:1) randomly chosen nodes, which are also givenrandom labels. For all ε = o (cid:0) K | V | (cid:1) , the above argument implies that order of K ln | V | = o ( | V | ) queries are sufﬁcient to make at most ε | V | = o ( K ) mistakes on the remaining nodes, among which Ω( K ) have random labels —which is clearly impossible. In this section we describe an efﬁcient implementation of

SEL and

PRED . We will show that thetotal time needed for selecting Q queries is O ( | V | log Q ) , the total time for predicting | V | − Q nodes is O ( | V | ) , and that the overall memory space is again O ( | V | ) .In order to locate the largest subtree of T \ L t − , the algorithm maintains a priority deque [11] D containing at most Q items. This data-structure enables to ﬁnd and eliminate the item with thesmallest (resp., largest) key in time O (1) (resp., time O (log Q )) . In addition, the insertion of a newelement takes time O (log Q ) .Each item in D has two records: a reference to a node in T and the priority key associatedwith that node. Just before the selection of the t -th query node i t , the Q references point to nodes If t = 1 the priority deque D is empty. Q largest subtrees in T \ L t − , while the corresponding keys are the sizes of suchsubtrees. Hence at time t the item top of D having the largest key points to a node in T t max .First, during an initialization step, SEL creates, for each edge ( i, j ) ∈ E , a directed edge [ i, j ] from i to j and the twin directed edge [ j, i ] from j to i . During the construction of L SEL thealgorithm also stores and maintains the current size σ ( D ) of D , i.e., the total number of itemscontained in D . We ﬁrst describe the way SEL ﬁnds node i t in T t max . Then we will see how SEL can efﬁciently augment the query set L SEL to obtain L + SEL .Starting from the node r of T t max referred to by D , SEL performs a depth-ﬁrst visit of T t max ,followed by the elimination of the item with the largest key in D . For the sake of simplicity,consider T t max as rooted at node r . Given any edge ( i, j ) , we let T i and T j be the two subtreesobtained from T t max after removing edge ( i, j ) , where T i contains node i , and T j contains node j .During each backtracking step of the depth-ﬁrst visit from a node i to a node j , SEL stores thenumber of nodes | T i | contained in T i . This number gets associated with [ j, i ] . Observe that thistask can be accomplished very efﬁciently, since | T i | is equal to plus the number of nodes of theunion of T c ( i ) over all children c ( i ) of i . These numbers can be recursively calculated by summingthe size values that SEL associates with all direct edges [ i, c ( i )] in the previous backtracking steps.Just after storing the value | T i | , the algorithm also stores | T j | = | T t max | − | T i | and associates thisvalue with the twin directed edge [ i, j ] . The size of T t max is then stored in D as the key record ofthe pointer to node r .It is now important to observe that the quantity σ ( T t max , i ) used by SEL (see Section 3) issimply the largest key associated with the directed edges [ i, j ] over all j such that ( i, j ) is anedge of T t max . Hence, a new depth-ﬁrst visit is enough to ﬁnd in time O ( | T t max | ) the t -th node i t = arg min i ∈ T t max σ ( T tmax , i ) selected by SEL . Let N ( i t ) be the set of all nodes adjacent to node i t in T t max . For all nodes i ′ ∈ N ( i t ) , SEL compares | T i ′ | to the smallest key bottom stored in D . Wehave three cases:1. If | T i ′ | ≤ bottom and σ ( D ) ≥ Q − t then the algorithm does nothing, since T i ′ (or subtreesthereof) will never be largest in the subsequent steps of the construction of L SEL , i.e., therewill not exist any node i t ′ with t ′ > t such that i t ′ ∈ T i ′ .2. If | T i ′ | ≤ bottom and σ ( D ) < Q − t , or if | T i ′ | > bottom and σ ( D ) < Q then SEL insertsa pointer to i ′ together with the associated key | T i ′ | . Note that, since D is not full (i.e., σ ( D ) < Q ) , the algorithm need not eliminate any item in D .3. If | T i ′ | > bottom and σ ( D ) = Q then SEL eliminates from D the item having the smallestkey, and inserts a pointer to i ′ , together with the associated key | T i ′ | .Finally, SEL eliminates node i t and all edges (both undirected and directed) incident to it. Note thatthis elimination implies that we can easily perform a depth-ﬁrst visit within T s max for each s ≤ Q ,since T s max is always completely disconnected from the rest of the tree T .In order to turn L SEL into L + SEL , the algorithm proceeds incrementally, using a technique bor-rowed from [7]. Just after the selection of the ﬁrst node i , a depth-ﬁrst visit starting from i isperformed. During each backtracking step of this visit, the algorithm associates with each edge ( i, j ) , the closer node to i between the two nodes i and j . In other words, SEL assigns a direc-tion to each undirected edge ( i, j ) so as to be able to efﬁciently ﬁnd the path connecting each In the initial step t = 1 (i.e., when T t max ≡ T ) node r can be chosen arbitrarily . i to i . When the t -th node i t is selected, SEL follows these edge directions from i t towards i . Let us denote by π ( i, j ) the path connecting node i to node j . During the traversal of π ( i , i t ) , the algorithm assigns a special mark to each visited node, until the algorithm reaches theﬁrst node j ∈ π ( i , i t ) which has already been marked. Let η ( i, L ) be the maximum number ofedge disjoint paths connecting i to nodes in the query set L . Observe that all nodes i for which η ( i, L t ) > η ( i, L t − ) must necessarily belong to π ( i t , j ) . We have η ( i t , L t ) = 1 , and η ( i, L t ) = 2 ,for all internal nodes i in the path π ( i t , j ) . Hence, j is the unique node that we may need to addas a new fork node (if j FORK ( L t − ) ). In fact, j is the unique node such that the number ofedge-disjoint paths connecting it to query nodes may increase, and be actually larger than .Therefore if j ∈ L + t − we need not add any fork node during the incremental construction of L + SEL . On the other hand, if j L + t − then η ( i, L t − ) = 2 , which implies η ( i, L t ) = 3 . This is thecase when SEL views j as new fork node to be added to the query set L SEL under consideration.In order to bound the total time required by

SEL for selecting Q nodes, we rely on Lemma 3,showing that | T t max | ≤ | V | /t . The two depth-ﬁrst visits performed for each node i t take O ( | T t max | ) steps. Hence the overall running time spent on the depth-ﬁrst visits is O ( P t ≤ Q | V | /t ) = O ( | V | log Q ) .The total time spent for incrementally ﬁnding the fork nodes of L SEL is linear in the number of nodesmarked by the algorithm, which is equal to | V | . Finally, handling the priority deque D takes | V | times the worst-case time for eliminating an item with the smallest (or largest) key or adding a newitem. This is again O ( | V | log Q ) .We now turn to the implementation of the prediction phase. PRED operates in two phases. Inthe ﬁrst phase, the algorithm performs a depth-ﬁrst visit of each hinge-tree T , starting from eachconnection node (thereby visiting the nodes of all 1-hinge-tree once, and the nodes of all 2-hinge-tree twice). During these visits, we add to the nodes a tag containing (i) the label of node i T fromwhich the depth-ﬁrst visit started, and (ii) the distance between i T and the currently visited node.In the second phase, we perform a second depth-ﬁrst visit, this time on the whole tree T . Duringthis visit, we predict each node i ∈ V \ L with the label coupled with smaller distance stored in thetags of i . The total time of these visits is linear in | V | since each node of T gets visited at most times. The results proven in this paper characterize, up to constant factors, the optimal algorithms foradversarial active learning on trees in two main settings. In the ﬁrst setting the goal is to minimizethe number of mistakes on the non-queried vertices under a certain query budget. In the secondsetting the goal is to minimize the sum of queries and mistakes under no restriction on the numberof queries.An important open question is the extension of our results to the general case of active learningon graphs. While a direct characterization of optimality on general graphs is likely to require newanalytical tools, an alternative line of attack is reducing the graph learning problem to the treelearning problem via the use of spanning trees. Certain types of spanning trees, such as randomspanning trees, are known to summarize well the graph structure relevant to passive learning —see,e.g., [7, 8, 13]. In the case of active learning, however, we want good query sets on the graph to If i belongs to a 1-hinge-tree, we simply predict y i with the unique label stored in the tag. m cliques connected through bridges, so that eachclique is connected to, say, k other cliques. The breadth-ﬁrst spanning tree of this graph is a set ofconnected stars. This tree clearly reveals a query set (the star centers) which is good for regularlabelings (cfr., the binary system example of Section 1). On the other hand, for certain choicesof m and k a random spanning tree has a good probability of hiding the clustered nature of theoriginal graph, thus leading to the selection of bad query sets.In order to gain intuition about this phenomenon, we are currently running experiments onvarious real-world graphs using different types of spanning trees, where we measure the number ofmistakes made by our algorithm (for various choices of the budget size) against common baselines.We also believe that an extension to general graphs of our algorithm does actually exist. How-ever, the complexity of the methods employed in [6] suggests that techniques based on minimizing Ψ ∗ on general graphs are computationally very expensive.Finally, it would be interesting to combine active learning techniques on the nodes of a graphwith those for predicting links (e.g., [9, 10]). Acknowledgments.

This work was supported in part by Google Inc. through a Google ResearchAward and by the PASCAL2 Network of Excellence under EC grant no. 216886. This publicationonly reﬂects the authors’ views.

References [1] P. Afshani, E. Chiniforooshan, R. Dorrigiv, A. Farzan, M. Mirzazadeh, N. Simjour, H.Zarrabi-Zadeh. On the complexity of ﬁnding an unknown cut via vertex queries.

COCOON2007 , pages 459–469.[2] Belkin, M., Matveeva, I., and Niyogi, P. Regularization and semi-supervised learning onlarge graphs.

COLT 2004 , pages 624–638.[3] Bengio, Y., Delalleau, O., and Le Roux, N. Label propagation and quadratic criterion. In

Semi-Supervised Learning , pages 193–216. MIT Press, 2006.[4] Blum, A. and Chawla, S. Learning from labeled and unlabeled data using graph mincuts.

ICML 2001 , pages 19–26.[5] A. Blum, J. Lafferty, R. Reddy, and M.R. Rwebangira. Semi-supervised learning using ran-domized mincuts.

ICML 2004 .[6] A. Guillory and J. Bilmes. Label Selection on Graphs.

NIPS 2009 .[7] N. Cesa-Bianchi, C. Gentile, F. Vitale. Fast and optimal prediction of a labeled tree.

COLT2009 .[8] N. Cesa-Bianchi, C. Gentile, F. Vitale, and G. Zappella. Random spanning trees and the pre-diction of weighted graphs. In

Proceedings of the 27th International Conference on MachineLearning . Omnipress, 2010. 179] N. Cesa-Bianchi, C. Gentile, F. Vitale, G. Zappella. A linear time active learning algorithmfor link classiﬁcation. Proc. of the 26th conference on Neural Information processing Systems(NIPS 2012).[10] N. Cesa-Bianchi, C. Gentile, F. Vitale, G. Zappella. A Correlation Clustering Approach toLink Classiﬁcation in Signed Networks. Proc. of the 25th Conference on Learning Theory(COLT 2012).[11] J. Katajainen, F. Vitale. Navigation piles with applications to sorting, priority queues, andpriority deques.

Nordic Journal of Computing , 10(3):238–262, 2003.[12] R. Lyons and Y. Peres.

Probability on Trees and Networks.

Manuscript, 2009.[13] F. Vitale, N. Cesa-Bianchi, C. Gentile, and G. Zappella. See the tree through the lines: theShazoo algorithm. In

Proc. of the 25th Annual Conference on Neural Information ProcessingSystems , pages 1584-1592. Curran Associates, 2012.[14] Zhu, X., Ghahramani, Z., and Lafferty, J. Semi-supervised learning using Gaussian ﬁelds andharmonic functions. In

ICML Workshop on the Continuum from Labeled to Unlabeled Datain Machine Learning and Data Mining , 2003.18

234 5 T T T T Figure 2: The

SEL algorithm at work. The upper pane shows the initial tree T = T (in thebox tagged with “1”), and the subsequent subtrees T , T , T , and T . The left pane alsoshows the nodes selected by SEL in chronological order. The four lower panes show the connectedcomponents of T \ L t resulting from this selection. Observe that at the end of round 3, SEL detectsthe generation of fork node ′ . This node gets stored, and is added to L SEL at the end of the selectionprocess. 19

3' 1 21 3 24 53' T i T T T i T T T Time t=2 Time t=3Time t=4 Time t=5,6

234 5 in. 0;>1coll.in. coll.0;0>1;>1 0;>1 Figure 3: The upper pane illustrates the different kinds of nodes chosen by

SEL . Numbers in thesquare tags indicate the ﬁrst six subtrees T t max , and their associated nodes i t , selected by SEL . Node i is a [ ≥ ≥ -node, i is an initial [0; ≥ -node, i is a (noninitial) [0; ≥ -node, i is an initialcollision node, i is a (noninitial) collision node, and i is a [0; 0] -node. As in Figure 2, we denoteby ′ the fork node generated by the inclusion of i into L SEL . Note that node i may be chosenarbitrarily among the four nodes in T \ i . The two black nodes are the set of nodes we arecompeting against, i.e., the nodes in the query set L . Forest T \ L is made up of one large subtreeand two small subtrees. In the lower panes we illustrate some steps of the proof of Lemma 6,with reference to the upper pane. Time t = 2 : Trees T and T i are shown. As explained in theproof, | T i | ≤ | T \ T i | . The circled black node is captured by i . The nodes of tree T \ T i are shaded, and can be used for mapping any ζ -component through µ . Time t = 3 : Trees T and T i are shown. Again, one can easily verify that | T i | ≤ | T \ T i | . As before, the nodesof T \ T i are shaded, and can be used for mapping any ζ -component via µ . The reader cansee that, according to the injectivity of µ , these grey nodes are well separated from the ones in T \ T i . Time t = 4 : T and the initial collision node i are depicted. The latter is enclosedin a circled black node since it captures itself. Time t = 5 , : We depicted trees T and T ,together with nodes i and i . Node i is a collision node, which is not initial since it was alreadycaptured by the [0; ≥ -node i . Node i is a [0; 0] node, so that the whole tree T is completelyincluded in a component (the largest, in this case) of T \ L . Tree T can be used for mapping via µ any ζ -component. The resulting forest T \ L includes several single-node trees and one two-node tree. If i is the last node selected by L SEL , then each component of T \ L can be exploitedby mapping µ , since in this speciﬁc case none of these components contains nodes of L , i.e., thereare no ζζ