From Weakly Supervised Learning to Biquality Learning: an Introduction
Pierre Nodet, Vincent Lemaire, Alexis Bondu, Antoine Cornuéjols, Adam Ouorou
FFrom Weakly Supervised Learning to BiqualityLearning, a brief introduction
Pierre Nodet
Orange LabsChˆatillon, France
Vincent Lemaire
Orange LabsLannion, France
Alexis Bondu
Orange LabsChˆatillon, France
Antoine Cornu´ejols
AgroParisTechParis, France
Adam Ouorou
Orange LabsChˆatillon, France
Abstract —The field of Weakly Supervised Learning (WSL)has recently seen a surge of popularity, with numerous papersaddressing different types of “supervision deficiencies”. In WSLuse cases, a variety of situations exists where the collected“information” is imperfect. The paradigm of WSL attempts tolist and cover these problems with associated solutions. In thispaper, we review the research progress on WSL with the aimto make it as a brief introduction to this field. We present thethree axis of WSL cube and an overview of most of all theelements of their facets. We propose three measurable quantitiesthat acts as coordinates in the previously defined cube namely:Quality, Adaptability and Quantity of information. Thus wesuggest that Biquality Learning framework can be defined asa plan of the WSL cube and propose to re-discover previouslyunrelated patches in WSL literature as a unified BiqualityLearning literature.
Index Terms —weakly, supervised, classification, prediction,noisy labels, trusted and untrusted data, ...
I. I
NTRODUCTION
In the field of machine learning, the task of classificationcan be performed by different approaches depending on thelevel of supervision of training data. As shown in Figure 1,unsupervised, weakly supervised and supervised approachesform a continuum of possible situations, starting from theabsence of ground truth and ending with complete and perfectground truth. For the most part, the accuracy of the modelslearned increases as the level of supervision of data increases.Additionally, the level of supervision of a dataset can beincreased in return for a labelling cost. In [1], the authorsindicate that an interesting goal could be to obtain a highaccuracy while spending a low labeling, cost.
Fig. 1. Classification of Classification from [1] In Weakly Supervised Learning (WSL) use cases (e.g. frauddetection) a variety of situations exist where the collected ground truth is imperfect. In this context, the collected labelsmay suffer from bad quality, non-adaptability (defined in Sec-tion IV) or even insufficient quantity. For instance, automaticlabeling system could be used without any real guarantee thatthe data is complete, exhaustive and trustworthy. Alternatively,manual labelling is also problematic in practice as obtaininglabels from an expert is costly and the availability of expertsis often limited. Consequently, there are many real-life situa-tions where imperfect ground truth must be used because ofsome practical considerations such as cost optimization, expertavailability, difficulty to certainly choose each label.This general problem of supervision deficiency has attracteda recent focus in the literature. The paradigm of
WeaklySupervised Learning attempts to list and cover these problemswith associated solutions. The work of Zhou in [2] is a firstsuccessful effort to synthesise this domain. In this paper, theobjective is threefold: (i) to suggest another view of WSL, (ii)to propose a larger and updated taxonomy compared to [2],and then (iii) to highlight a new emergent view of a part ofthe WSL, namely the biquality learning.The rest of this paper is organized as follows. In Sec-tion II, we present the three axis of the Weakly supervisedLearning cube and an overview of most of all the elementsof their facets. Section III gives additional elements whichhave to be taken in consideration at the crossroad of thesethree axes or when dealing with Weakly learning problems.Section IV suggests 3 key concepts which help at summarizingWSL: Quantity, Quality and Adaptability. In Section V, these3 concepts are used to raise links between some learningframeworks jointly used in WSL as in Biquality Learning.Section VI then give existing works examples of BiqualityLearning. Finally the Section VII concludes this paper.II. T
HE DIFFERENT WAYS OF LOOKING AT WEAKLYSUPERVISION
The taxonomy proposed in this paper is organised in theform of a “cube” and is presented in Figure 2. This sectionprogressively presents the differences between weakly super-vised approaches by going through the axes of this cube.First of all, a distinction must be made between strong and weak supervision. On the one hand, strong supervisioncorrespond to the regular case in machine learning where thetraining examples are expected to be exhaustively labelled withtrue labels, i.e. without any kind of corruption or deficiency. a r X i v : . [ c s . L G ] D ec ig. 2. Taxonomy: an attempt - The big picture On the other hand, weak supervision means that the availableground truth is imperfect, or even corrupted. The WSL fieldaims to address a variety of supervision deficiencies whichcan be categorized in a “cube” along the following three axesas illustrated on Figure 2: inaccurate labels (Axis 1), inexactlabels (Axis 2), incomplete labels (Axis 3).These three axes are detailed in the rest of this section andconstitute the proposed taxonomy. The reader may note thatthe boundaries between these axes are not hard: i.e a part couldbe moved from an axis to another or belong to two axes, thisis a suggestion. A. Axis 1 : Inaccurate Supervision - True Labels vs. InaccurateLabels
Lack of confidence in data sources is a frequent issue whenit comes to real-life use cases. The values used as learningtargets, also called labels or classes, can be incorrect due tomany factors.In practice, a variety of situations can lead to inaccuratelabels: (i) a label can be assigned to a “bag of examples ” suchas a bunch of keys. In this case, at least one of the examples in the keychain actually belongs to the class indicated bythe label. Multi-instance learning [3]–[6] is an appropriatetechnique to deal with this type of of learning task. (ii) a labelmay not be “guaranteed” and may be noisy. In theory, thelearning set should be labeled in a way that is unbiased withrespect to the concept to be learned. However, the data usedin real-world applications provide an imperfect ground truththat does not match the concept to be learned. As defined in[7], noise is “anything that obscures the relationship betweenthe features of an instance and its class”. According to thisdefinition, every error or imprecision in a attribute or label isconsidered as noise, including human deficiency. Noise is nota trivial issue because the origin is never clearly obvious. Inpractical cases, this leads to troubles into evaluating existenceand strength level of noise into a dataset. Frenay et al. in [8]provide a good overview of noise sources, impact of labelingnoise, types of noise and dependency to noise. Below is anon-exhaustive list of common ways to learn a model in thepresence of labeling noise : Note: the number of articles published on this topic has exploded in recentyears. in case of mariginal noise level, a standard learningalgorithm that is natively robust to label noise, is used[9]–[12]; • use a loss function which solves theoretically (or em-pirically) the problem in case of (i) noise completely atrandom [13]; or (ii) class dependent noise [14], [15].In most cases, this type of approach is known in theliterature as “Robust Learning to Label noise (RLL)”; • model noise to assess the quality of each label (requiresassumptions on noise) [16]; • enforce consistency between the model’s predictions andthe proxy labels [17]; • clean the training set by filtering noisy examples [18]–[22]; • trust a subset of data provided by the user, in order tolearn a model at once on trusted examples (without labelnoise) and untrusted ones [14], [23], [24].Another kind of ”noise” appears when each training exam-ple is not equipped with a single true label but with a set ofcandidate labels that contains the true label. To deal with thiskind of training examples, Partial Label Learning (PLL) hasbeen proposed [25] (also called ambiguously labeled learning).It has attracted attention as for example in the algorithms IPAL[26], PL-KNN [25], CLPL [27] and PL-SVM [28] or whensuggesting semi-supervised partial label learning as in [29].This setting is motivated, for example, by common scenario inmany image and video collections, where only partial accessto labels is available. The goal is to learn a classifier thatcan disambiguate the partially-labeled training instances, andgeneralize to unseen data [30]. B. Axis 2 : Inexact Supervision - Labels at the Right Proxy vs.not at the Right Proxy
The second axis describes inexact labeling which is orthog-onal to the first type of supervision deficiency - i.e. inexactlabeling and noisy labeling may coexist. Here, the labels areprovided not at the right proxy, which corresponds to one (orpossibly a mixture) of the following situations: • Proxy domain : the target domain differs between thetraining set and the test set. For instance, it could belearning to discriminate “panthers” from other savannaanimals based one “cats” and “dogs” labels. Two casescan be distinguished: (i) training labels are available inanother target domain than test labels (ii) or traininglabels are available in a sub-domain that belongs to theoriginal target domain. Domain transfer [31] or domainadaptation [32] are clearly suitable techniques to addressthese learning tasks. • Proxy labels : some unlabeled examples are automaticallylabeled, either by a rule-based system or by a learnedmodel, in order to increase the size of the training set.This kind of labels are called proxy labels and can beconsidered as coming from a proxy concept close tothe one to be learned. Only the true labels stand for defined in Section IV-B. the ground truth. The way proxy labels are used variesdepending on their origin. In the case where proxy labelsare provided by the classifier itself without any additionalsupervision, the self-training (ST) [33], the co-training(CT) and their variants attempt to improve the learnedmodel by including proxy-labels in the training set asregular labels. Other approaches exploits the confidencelevel of the classifier to produce soft-proxy-labels, andthen exploit it as weighted training examples [34]. Inthe case where proxy labels are generated by a rule-based system, the quality of labels depends on the expertsknowledge which is manually inputted into the rules.Ultimately, a classifier learned from such labels can beconsidered as a means of smoothing the set of rules,allowing the end-user to score any new example. Somerecent automatic labeling system offer an intermediateway that mixes rule-based systems and machine learningapproaches (MIX) [35], [36]. • Proxy individuals : the statistical individuals are notequally defined between the training set and the testset. For instance, it could be learning to classify imagesbased one labels that only describe a parts of the images.Multi-instance learning (MIL) is an other example whichconsists in learning from labeled groups of individuals.In the literature, many algorithms have been adapted towork within this paradigm [3]–[6]. C. Axis 3 : Incomplete Supervision - Few labels vs. Numerous
The third axis describes incomplete supervision which con-sists of processing a partially labeled training set. In thissituation, labeled and unlabeled examples coexist within thetraining set, and it is assumed that there are not enough labeledexamples to train a performing classifier. The objective is touse the entire training set, including the unlabeled examples,to achieve better classification performance than learning aclassifier only from labeled examples.In the literature, many techniques exist capable of process-ing partially labeled training data, i.e. active learning (AL),semi-supervised learning (SSL), positive unlabeled learning(PUL) or self-training (ST) and Co-Training (CT). At thebottom of the Figure 2, we suggest to sort these methodsaccording to the quantity of labeled examples they require.All these approaches are detailed below.
1) Active Learning (AL) [37]:
Modern supervised learningapproaches are known to require large amounts of trainingexamples in order to achieve their best performance. Theseexamples are mainly obtained by human experts who labelthem manually, making the labelling process costly in practice.Active learning (AL) is a field that includes all the selectionstrategies that allow one to iteratively build the training setof a model in interaction with a human expert (also calledoracle). The aim is to select the most informative examples tominimize the labelling cost.Active learning is an iterative process that continues untila labelling budget is exhausted or a predefined performancethreshold is reached. Each iteration begins with the selectionf the most informative example. This selection is generallybased on information collected during previous iterations (pre-dictions of a classifier, density measures, etc.). The selectedexample is then submitted to the oracle which returns theassociated class, and the example is added to the training set( L ). The new learning set is then used to improve the modeland the new predictions are used to perform the next iteration.In conventional heuristic the utility measures used by activelearning strategies [37] differ in their positioning with respectto the trade off between exploiting the current classifier andexploring training data. Selecting an unlabelled example in anunknown region of the observation space helps to explore thedata, so as to limit the risk of learning a hypothesis that is toospecific to the set L of currently labeled examples. Conversely,selecting an example in an already sampled region allows tolocally refine the predictive model. We do not intend to providean exhaustive overview of existing AL strategies and refer to[37], [38] for a detailed overview, [39]–[41] for some recentbenchmark and a new way to treat uncertainty in [42]Another meta active learning paradigm exists, which com-bines conventional strategies using bandit algorithms [43]–[48]. These meta-learning algorithms intend to select onlinethe best AL strategy according to the observed improvementsof the classifier. These algorithms are capable of adapting theirchoice over time as the classifier improves. However, learningmust be done using few examples to be useful and these kindof algorithms suffer from the cold start problem. In additionthese approaches are limited to combine existing AL heuristicstrategies.Other meta-active-learning algorithms have been developedto learn an AL strategy starting from scratch, using multiplesource of datasets. These algorithms are used by transferringthe learned AL strategy to new target datasets [49]–[51]. Mostof them are based on modern reinforcement learning methods.The major challenge is to learn an AL strategy general enoughto automatically control the exploitation/exploration trade-offwhen used on new unlabeled datasets (which is impossibleusing heuristic strategies). A recent evaluation of learningactive learning can be found in [52].
2) Semi Supervised Learning (SSL):
Early work on semi-supervised learning dates back to the 2000s, an overview ofthese pioneering papers can be found in [53]–[57]. In theliterature, the SSL approaches can be categorized into twogroups: • Algorithms that use unlabeled examples unchanged. Inthis case, the unlabeled examples are treated as un-supervised information added to the labeled examples.Four main categories exist: generative methods, graph-based methods, low-density separation methods, anddisagreement-based methods [2]. • Semi-supervised learning algorithms that produce proxylabels on unlabeled examples, which are used as targetsin addition to the labeled examples. These proxy labelsare produced by the model itself or by its variantswithoutany additional supervision. They are not strictly groundtruth, but may nevertheless be useful for learning. At the end, these inaccurate labels (see Section II-A) canbe considered as noisy. The rest of this section dealswith particular cases of SSL and presents the PostiveUnlabeled Learning , the Self Training and the Co-training approaches.
3) Postive Unlabeled Learning (PUL):
Learning from Pos-itive and Unlabeled examples (PUL) is a special case ofbinary classification and SSL [58]. In this particular setting,the unlabeled examples may contain both positive and negativeexamples with hidden labels. These approaches differ froma one-class classification [59] since they explicitly use theunlabeled examples in the learning process. In the literature,the PUL approches can be divided into three groups: i) thetwo-step techniques, (ii) the biased learning and (iii) the classprior incorporation techniques.The two-step techniques [60] consists in: (1) identifyingreliable negative examples and optionally generating addi-tional positive examples [61]; (2) using supervised or semi-supervised learning approaches which process the positivelylabeled examples, the reliable negatives examples, and theremaining unlabeled examples; (3) (when applicable) select-ing the best classifier generated in Step 2. Biased learningapproaches consider PU data as fully labeled examples withnoisy negative labels. At last, class prior incorporation ap-proaches modify standard learning algorithms by applying themathematics from the SCAR setup (see Section III-B).
4) Self Training (ST):
Self-training has not a clear defi-nition is the literature, it can be viewed as a “single-viewweakly supervised algorithm”. First a classifier is trainedfrom the available labeled examples and then this classifieris used to make predictions and build new proxy labels. Onlythose examples where confidence in proxy labelling exceedsa certain threshold are added to the training set. Then, theclassifier is retrained from the training set enriched with proxy-labels. This process is repeated in an iterative way [33].
5) Co-Training (CT) [62]–[65]:
Starting from a set of par-tially labeled examples, co-learning algorithms aim to increasethe amount of labeled examples by generating proxy-labels.Co-training algorithms work by training several classifiersfrom the initial labeled examples. Then, these classifier areused to make predictions and generate proxy-labels on theunlabeled examples. The most confident predictions on theseproxy-labels are then added to the set of labeled data for thenext iteration.One important aspect of co-training is the relationshipbetween the views (the sets of explicative variables) used inlearning the different models. The original co-training algo-rithm [62] states that the independence of the views is requiredto properly perform automatic labeling. More recent works[66]–[68] show that this assumption can be relaxed. Anotherrequirement is to obtain at the iteration step a “reasonable”classifier in terms of performances , this explains why weplace co-training on the left of AL and SSL in Figure IV-A.In [69], a study is given on the optimal selection of the co-training parameters.o-training can also be considered as a member of ”multi-view training” family in which some other members belong to,such as: Democratic Co-learning [70], Tri-training [71], Asym-metric tri-training [72], Multi-task tri-training [73], which arenot described here.III. O
THER KEY ELEMENTS - B
EYOND THE AXIS
A. Learning at the crossroad of the three axis
The use of a cube to describe the literature on WeaklySupervised Learning allows us not only to use the axes, butalso the volume of the cube to position existing approaches.It is now easy to position more subtly the approaches thatare related to several axes at once. For example, Partial LabelLearning may be related to two supervision deficiencies: i)inexact supervision, because multiple labels are provided foreach training example; ii) inaccurate supervision, because onlyone of the labels provided is correct. Positioning the PLL onthe plane defined by these two axes seems more relevant.Also, this representation allows to highlight some interestingintersections, between two axes or between an axis and aplane. One of these points of interest is the origin of thethree axes, which corresponds to the case where supervisionis absolutely inaccurate, imprecise and incomplete, whichultimately amounts to unsupervised learning. Similarly, thepoint at the opposite end of the cube corresponds to perfectlyprecise, accurate and complete supervision, which equates tosupervised learning.Finally this representation could provide insights on thereasons of using proven techniques from a particular subfieldof Weakly Supervised Learning can be efficient in anotherone. For instance, DivideMix [74] chooses to reuse the effi-cient MixUp [75] approach from Semi-Supervised Learningto tackle the problem of Learning with Label Noise. Thisapproach uses Data Augmentation [76] and Model Agreement[77] to estimate labels probabilities and then discard or keepprovided labels.This section is not exhaustive, interested readers will beable to position the approaches of the literature in the cubethemselves.
B. Deficiency Model
The deficiency model describes the nature of the supervisiondeficiency. It is usually described as a probability measurecalled ρ : ( x, y ) (cid:55)→ ρ ( x, y ) , indicating if an example is corruptor not. ρ can depends on the value the explicative explanatoryvariables x ∈ X , the label value y ∈ Y or both ( x, y ) .The different types of supervision deficiency described in thissection are the following: (i) Completely At Random (CAR),(ii) At Random (AR) and (iii) Not At Random (NAR).If the probability of being corrupted is the same for alltraining examples, ρ : ( x, y ) (cid:55)→ ρ c , ρ c ∈ [0 , , then the su-pervision deficiency model is Completely At Random (CAR).This implies that the cause of the supervision deficiency datais unrelated to the data. If the probability of being corruptedis the same within classes, ρ : ( x, y ) (cid:55)→ ρ y , ∀ y ∈ Y , ρ y ∈ [0 , , then the supervision deficiency model is At Random (AR). If neither CAR nor AR holds, then we speakof Not at Random (NAR) model. Here the probability of beingcorrupted is dependent on both the samples and the labelvalue, ρ : ( x, y ) (cid:55)→ ρ ( x, y ) . These three deficiency models canbe ranked in a descendent manner, having the NAR modelbeing the most complex as it depends on both the instanceand label value, which requires a function to model, to CARmodel where only one constant is enough to describe it. Thesemodels may help practitioner to find links between supervisiondeficiencies. For example PUL is SSL with only one classlabeled, which means that the missingness of the label is linkedto the label value, so PUL is an extreme case of SSL AR with ρ = 1 − e and ρ = e (where e is called the propensity score).AL is another form of SSL where examples are labeledthanks to a strategy, previously labeled instances and the or-dered iterative process leading to non-iid labeled data. As suchAL is part of the SSL NAR family. We want to reiterate thedeficiency model can be applied to any supervision deficiency,even if it has been mostly featured in RLL and in SSL. C. Transductive learning vs. Inductive Learning
As we consider WSL framework, one may be tempted to usethe test set to guide the choice of the model. But in this case weneed to carefully decide if in the future the need of a model topredict on another test (deployment) dataset is required or not:two point of view could be considered transductive learningvs. inductive learning, that is why now we add a note on them.Training a machine could take many forms as supervisedlearning, unsupervised learning, active learning, online learn-ing, etc. The number of members is the family is large andnew members appear regularly as for example “federativelearning”. However one may establish a separation betweentwo constant classes based on the way the user would like touse the “learning machine” at the deployment stage. The userdoes not want necessarily a predictive model for subsequentuse on new data. Because, for example, it has the completenessof the data for the problem to be treated. It is therefore neces-sary to distinguish between inductive learning and transductivelearning.On one side the goal of inductive learning is, essentially,to learn a function (a model) which will be later used onnew data to predict classes (classification) or numerical values(regression). The predictions may be seen as “buy-products”of the model. Induction is reasoning from observed trainingcases to general rules, which are then applied to the testcases. On the other side the goal of transductive learning thegoal is not to obtain a function or a model but only to dopredictions on a given test database, and on only on this testof instances. Transduction was introduced by Vladimir Vapnikin the 1990s, motivated by the intuition that transduction ispreferable to induction since, induction requires solving amore general problem (inferring a function) before solvinga more specific problem (computing outputs for new cases).However the distinction between inductive and transductivelearning could be a hazy border for example in case ofsemi-supervised learning. Knowing this, the view of Zhou in2] about “pure semi supervised learning” and transductivelearning is interesting. The distinction about Transductivelearning vs. Inductive Learning concerned most of the learningform included on Figure 2.IV. T HE COMMON CONCEPTS OF
WSLUntil now we see that many forms of learning and weaknessare intertwined. A way to resume their aspect was givenon Figure 2. From this point of view one may identified 3common concepts that are described now.
A. Quantity | L | Insufficient quantity of labels or training examples occurswhen many training examples are available but only a smallportion is labeled, e.g. due to the cost of labelling. Forinstance, this occurs in the field of cyber security wherehuman forensics is needed to tag attacks. Usually, this issue isaddressed by few shot learning (FSL), active learning (AL)[37] semi-supervised learning (SSL) [55] , Self Training,or Co-Training or active learning (AL) which have beendescribed briefly above in this paper. Another way to see the”quantity” could be the ratio between the number of exampleslabeled and unlabeled ( p ). B. Quality q In this case, all training examples are labeled but the labelsmay be corrupted. This usually happens when outsourcinglabeling to crowd labeling. The Robust Learning to LabelNoise (RLL) approaches tackle this problem [78], with threetypes of label noise identified: i) the completely at random noise corresponds to a uniform probability of label change ;ii) the class dependent label noise when the probability of labelchange depends upon each class, with uniform label changeswithin each class ; iii) the instance dependent label noise iswhen the probability of label change varies over the inputspace of the classifier. This last type of label noise is the mostdifficult to deal with, and typically requires making sometimesstrong assumptions on the data.
C. Adaptability a This is the case for instance, in Multi Instance Learning(MIL) [3]–[6], in which there is one label for each bag oftraining examples, and each example has an uncertain label.Some scenarios in Transfer Learning (TL) [79] imply thatonly the labels in the source domain are provided whilethe target domain labels are not. Often, these non-adaptedlabels are associated with the existence of slightly differentlearning tasks (e.g. more precise and numerous classes aredividing the original categories) . Alternatively, non-adaptedlabels may characterize a differing statistical individual [80] (e.g. a subpart of an image instead of the entire image) .V. F
ROM
WSL TO B IQUALITY LEARNING ( WHEN a = 1 )All the types of supervision deficiencies presented aboveare addressed separately in the literature, leading to highlyspecialized approaches. In practice, it is very difficult toidentify the type(s) of deficiencies with which a real dataset is associated. For this reason, it would be very useful to suggestanother point of view as a tentative of an unified frameworkfor (a part of the) Weakly Supervised Learning, in order todesign generic approaches capable of dealing not a single typeof supervision deficiency. This is the purpose of this sectionmainly given for cases where data are adapted to the task tolearn ( a = 1 ).Learning using biquality data has recently been put forwardin [14], [81], [82] and consists in learning a classifier fromtwo distinct training sets, one trusted and the other not. Theinitial motivation was to unify semi-supervised and robustlearning through a combination of the two. We show in thispaper that this scenario is not limited to this unification andthat it can cover a larger range of supervision deficiencies asdemonstrated with the algorithms we suggest and their results.The trusted dataset D T consists of pairs of labeled examples( x i , y i ) where all labels y i ∈ Y are supposed to be cor-rect according to the true underlying conditional distribution P T ( Y | X ) . In the untrusted dataset D U , examples x i maybe associated with incorrect labels. We note P U ( Y | X ) thecorresponding conditional distribution.At this stage, no assumption is made about the nature of thesupervision deficiencies which could be of any type includinglabel noise, missing labels, concept drift, non-adapted labels ...and more generally a mixture of these supervision deficiencies.The difficulty of a learning task performed on biqualitydata can be characterised by two quantities. First, the ratioof trusted data over the whole data set, denoted by p : p = | D T || D T | + | D U | (1) Second, a measure of the quality, denoted by q , whichevaluates the usefulness of the untrusted data D U to learnthe trusted concept For example in [82] q is defined using aratio of Kullback-Leibler divergence between P T ( Y | X ) and P U ( Y | X ) . p q Supervised b b
RLL b ALSSL
New range of problems b Unsupervised
Fig. 3. The different learning tasks covered by the biquality setting, repre-sented on a 2D representation.
The biquality setting covers a wide range of learning tasksby varying the quantities q and p , as represented in Figure 3. • When ( p = 1 OR q = 1 ) all examples can be p = 1 = ⇒ D U = ∅ = ⇒ q = 1 rusted. This setting corresponds to a standard supervisedlearning (SL) task. • When ( p = 0 AND q = 0 ), there is no trusted examplesand the untrusted labels are not informative. We areleft with only the inputs { x i } ≤ i ≤ m as in unsupervisedlearning (UL) . • On the vertical axis defined by q = 0 , except for thetwo points ( p, q ) = (0 , and ( p, q ) = (1 , , theuntrusted labels are not informative, and trusted examplesare available. The learning task becomes semi-supervisedlearning (SSL) with the untrusted examples as unlabeledand the trusted as labeled. • An upward move on this vertical axis, from a point ( p, q ) = ( (cid:15), characterized by a low proportion oflabeled examples p = (cid:15) , to a point ( p (cid:48) , , with p (cid:48) > p ,corresponds to Active Learning , if an oracle can becalled on unlabeled examples. The same upward movecan also be realized in
Self-training and
Co-training ,where unlabeled training examples are labeled using thepredictions of the current classifier(s). • On the horizontal axis defined by p = 0 , except for thepoints ( p, q ) = (0 , and ( p, q ) = (0 , , only untrustedexamples are provided, which corresponds to the range oflearning tasks typically addressed by Robust Learningto Label noise (RLL) approaches.Only the edges of Figure 3 have been envisioned in previousworks – i.e. the points mentioned above – and a whole newrange of problems corresponding to the entire plan of the figureremains to be explored. Biquality learning may also be usedto tackle particular tasks belonging to WSL, for instance: • Positive Unlabeled Learning (PUL) [58] where the trustedexamples are only positive and untrusted examples thosefrom the unknown class. • Self Training and Co-training [62]–[64] could be adressedat the end of their self labeled process: the initial trainingset is the trusted dataset, all examples labeled after (dur-ing the self labeling process) are the untrusted examples. • Concept drift [83]: when a concept drift occurs, all theexamples used before a drift detection may be consideredas the untrusted examples, while the examples availableafter it are viewed as the trusted ones, assuming a perfectlabeling process. • Self Supervised Learning system as Snorkel [35]: thesmall initial training set is the trusted dataset, all exam-ples automatically labeled using the labeling functionscorrespond to the untrusted examples.As can be seen from the above list, the Biquality frameworkis quite general and its investigation seems a promising avenueto unify different aspects of the Weakly Supervised Learning.VI. B
IQUALITY L EARNING - E
XISTING W ORKS
In the previous section we have been describing how WealkySupervised Learning subfields fitted in the Biquality Learningsetup. Here we would be reviewing three of these subfields andhighlight prexisting Biquality Learning algorithms that either have been made for a different purpose but still could be usedfor WSL, or have been design directly for this setup.
A. Transfer Learning
Transfer Learning focuses on storing, knowledge gainedwhile solving one problem and applying it to a differentbut related problem. Two datasets are at disposal, a sourcedataset D S and target dataset D T that are related to a sourcedomain D S ( X S , P ( X S )) and a target domain D T ( X T , P ( X T )) to solve the target task T T ( Y T , P ( Y T | X T )) with the help ofthe source task T S ( Y S , P ( Y S | X S )) . We can draw a parallelbetween Biquality Learning notations and Transfer Learningnotations mostly by substituting (source, S ) by (untrusted, U )and (target, T ) by (trusted, T ).A lot of different setups can derive from the general TransferLearning setup as Domain Adaptation, Transductive TransferLearning, Covariate Shift, ... Inductive Transfer Learning isthe setup closest to Biquality Learning, indeed most of the keyassumptions are the same : X T = X U , Y T = Y U , P ( X T ) = P ( X U ) , P ( Y T | X T ) (cid:54) = P ( Y U | X U ) .For example, TrAdaBoost [84] is an extension of boostingto Inductive Transfer Learning. TrAdaBoost learns on bothtrusted and untrusted data every iterations. It behaves exactlyas AdaBoost [85], [86] on trusted data : mispredicted trustedsamples get more attention, but opposite on untrusted data :mispredicted untrusted samples are ditched out.Multi Task Learning [87] is another Inductive TransferLearning approach that improves generalization by learningboth tasks in parallel while using a shared representation; whatis learned for the untrusted task can help the trusted task. Thisloss L MTL is usually defined by a convex combination of thetrusted loss L T and untrusted loss L U of the model f (with ≤ λ ≤ ): L MTL ( f ( X ) , Y ) = (1 − λ ) L U ( f ( X ) , Y ) + λL T ( f ( X ) , Y ) (2)In Inductive Transfer Learning as in Transfer Learning ingeneral, we assume that the source task (i.e. untrusted task) isrelevant for the target task (i.e. trusted task). Nonetheless inthe Biquality Data setup, we can have the untrusted task thatbring no information to the trusted task, even bring adversarialinformation. Thus using Inductive Transfer Learning algorithmdirectly on Biquality Data setup can lead to bad predictiveperformances.For example, with Multi Task Learning, the global loss termwould be heavily perturbed as the untrusted loss could neverbe optimized. For TrAdaBoost, the first model learned on bothtrusted and untrusted samples would not be able to learn theclass boundaries correctly, and the weight updating schemeswould not be efficient. B. RLL and Transition Matrix
A family of Biquality Learning algorithm has been pio-neered by Patrini with [88] from the Robust Learning to LabelNoise literature. These algorithms try to estimate the per classrobabilities of label flip into another class (of the K classes)which defines the Transition Matrix T . ∀ ( i, j ) ∈ K , T ( i,j ) = P ( Y U = j | Y T = i ) (3)Patrini proposed in [88] to used the Transition Matrix T to adapt any supervised loss functions L to learning withlabel noise. The two corrections proposed are : (i) the forwardloss correction: L → ( f ( X ) , Y ) = L ( T (cid:62) · f ( X ) , Y ) and (ii) thebackward loss correction: L ← ( f ( X ) , Y ) = T − · L ( f ( X ) , Y ) .When no trusted samples are available as in [88], Patriniproposed to use anchor points in order to estimate T . Ananchor point from the i -th class is the point with the highestprobability to be from the i -th class from a given dataset. ∀ i ∈ K, A i = argmax x P ( Y = i | X = x ) (4)Thanks to this definition Patrini propose an estimator of theTransition Matrix : ˆ T ( i,j ) = P ( Y = j | X = A i ) (5)Finally the procedure to learn a model f that minimizes L on untrusted data with Patrini’s approach is in two steps. Firstlearn f model on untrusted data with a loss L . Estimate theTransition Matrix thanks to ˆ T with Equation 5. Then learn amodel f with either L → or L ← .This algorithm, designed for Robust Learning to Labelnoise can easily be adapted to Biquality Learning. Hendrycksproposed one adaptation in [14] with some changes to Patrini’sapproach. As trusted data are available, there is no more theneed to use anchor points to represent our trusted concept.So another estimator for the Transition Matrix is proposedby learning a model f U on untrusted data, and makingprobabilistic predictions with f U on the trusted dataset D T and comparing it to the trusted labels y T : ˆ T ( i, ∗ ) = (cid:88) x i ∈ D iT f U ( x i ) (cid:80) z i ∈ D iT || f U ( z i ) || (6)where D iT = {∀ ( x, y ) ∈ D T | y = i } . Then for the final step,Hendrycks proposed to learn f with the corrected forward loss L → on the untrusted data, and the uncorrected loss L on thetrusted data. Thus GLC is an example of a Biquality Learningalgorithm that has been demonstrated to be quite efficient onAt Random supervision deficiencies. C. Covariate Shift
Covariate Shift literature has also inspired people to adaptthese algorithms to Biquality Learning. The algorithm with themost influence in this regard is called Importance Reweighting[89], which aims was to give high weights to source samplesthat were similar to the target samples, and low weightswhen they were not similar. This objective fits well withNot At Random (or sample dependent) corruptions as thecorrection made to the untrusted dataset is per sample withthis algorithm family. Multiple approaches has been inspiredby this literature.The key idea of this algorithm family is to define a lossfunction ˜ L such that learning a model f on D U that minimizes ˜ L is equivalent to using the original loss function L on D T in the risk estimate. The following equations show how ˜ L appears from the risk estimate R : R ( X,Y ) ∼ T,L ( f ) = E ( X,Y ) ∼ T [ L ( f ( X ) , Y )]= E ( X,Y ) ∼ U [ P T ( X, Y ) P U ( X, Y ) L ( f ( X ) , Y )]= E ( X,Y ) ∼ U [ βL ( f ( X ) , Y )]= R ( X,Y ) ∼ U, ˜ L ( f ) (7)However this newly defined loss function ˜ L can be hard toestimate and thus approaches have been proposed to furthersimplify the weight estimation.For example, Importance Reweighting for Biquality Learn-ing (IRBL) [24] uses the biquality hypothesis that the distri-bution P ( X ) is the same in the trusted and untrusted datasets.By using Bayes Formula we have a new expression for β : β IRBL = P T ( X, Y ) P U ( X, Y ) = P T ( Y | X ) P ( X ) P U ( Y | X ) P ( X ) = P T ( Y | X ) P U ( Y | X ) (8)First, the vector of ratios between P T ( Y | X ) and P U ( Y | X ) is estimated by the term f T ( x i ) (cid:11) f U ( x i ) , using the models f T and f U learn on D T and D U . For each untrusted example,the weight ˆ β IRBL is the y i -th element of this vector; while ˆ β IRBL is fixed to 1 for the trusted examples. Then, the finalclassifier is learned from D T ∪ D U by minimizing ˜ L .Another algorithm has been proposed in [90] named Dy-namic Importance Reweighting (DIW) by writing Equation 8in a more traditional way with Bayes Formula. β DIW = P T ( X, Y ) P U ( X, Y ) = P T ( X | Y ) P T ( Y ) P U ( X | Y ) P U ( Y ) (9)To estimate β DIW , the trick is to select both sub-samples of D T and D U with samples of the same classes and then use anDensity Ratio Estimator [91] such as Kernel Mean Matching(KMM) [92], [93]. Then a final classifier is learned on D U byminimizing ˜ L . One particular issue of this algorithm is thatKMM is learned by optimizing a quadratic program, K -timesper batch, that leads to high algorithm complexity especiallyin the case of massive multiclass classification.IRBL and DIW are two new Biquality Learning algorithmsthat work on NAR cases.VII. C ONCLUDING REMARKS