On the Power of Deep but Naive Partial Label Learning
OON THE POWER OF DEEP BUT NAIVE PARTIAL LABEL LEARNING
Junghoon Seo † Joon Suk Huh † GIST & SI Analytics, South Korea UW–Madison, USA [email protected] [email protected]
ABSTRACT
Partial label learning (PLL) is a class of weakly supervisedlearning where each training instance consists of a data anda set of candidate labels containing a unique ground truth la-bel. To tackle this problem, a majority of current state-of-the-art methods employs either label disambiguation or av-eraging strategies. So far, PLL methods without such tech-niques have been considered impractical. In this paper, wechallenge this view by revealing the hidden power of the old-est and naivest PLL method when it is instantiated with deepneural networks. Specifically, we show that, with deep neu-ral networks, the naive model can achieve competitive perfor-mances against the other state-of-the-art methods, suggestingit as a strong baseline for PLL. We also address the questionof how and why such a naive model works well with deepneural networks. Our empirical results indicate that deep neu-ral networks trained on partially labeled examples generalizevery well even in the over-parametrized regime and withoutlabel disambiguations or regularizations. We point out thatexisting learning theories on PLL are vacuous in the over-parametrized regime. Hence they cannot explain why thedeep naive method works. We propose an alternative theoryon how deep learning generalize in PLL problems.
Index Terms — classification, partial label learning,weakly supervised learning, deep neural network, empiri-cal risk minimization
1. INTRODUCTION
State-of-the-art performance of the standard classificationtask is one of the fastest-growing in the field of machinelearning. In the standard classification setting, a learner re-quires an unambiguously labeled dataset. However, it isoften hard or even not possible to obtain completely labeleddatasets in the real world. Many pieces of research formu-lated problem settings under which classifiers are trainablewith incompletely labeled datasets. These settings are oftendenoted as weakly supervised . Learning from similar vs. dis-similar pairs [1],
Learning from positive vs. unlabeled data [2, 3],
Multiple instance learning [4, 5] are some examples ofweakly supervised learning. † Both authors contributed equally to this work.
In this paper, we focus on
Partial label learning [6] (PLL),which is one of the most classic examples of weakly super-vised learning. In the PLL problem, classifiers are trainedwith a set of candidate labels, among which only one labelis the ground truth. Web mining [7], ecoinformatic [8], andautomatic image annotation [9] are notable examples of real-world instantizations of the PLL problem.The majority of state-of-the-art parametric methods forPLL involves two types of parameters. One is associatedwith the label confidence, and the other is the model parame-ters. These methods iteratively and alternatively update thesetwo types of parameters. This type of methods is denoted as identification-based . On the other hand, average-based meth-ods [10, 11] treat all the candidate labels equally, assumingthey contribute equally to the trained classifier. Average-based methods do not require any label disambiguation pro-cesses so they are much simpler than identification-basedmethods. However, numerous works [6, 12, 13, 14, 15]pointed out that the label disambiguation processes are es-sential to achieve high-performance in PLL problems, hence,attempts to build a high-performance PLL model through theaverage-based scheme have been avoided.Contrary to this common belief, we show that one ofnaivest and oldest average-based methods can train accu-rate classifiers in real PLL problems. Specifically, our maincontributions are two-fold:1. We generalize the classic naive model of [6] to themodern deep learning setting. Specifically, we presenta naive surrogate loss for deep PLL. We test our deepnaive model’s performance and show that it outper-forms the existing state-of-the-art methods despite itssimplicity.2. We empirically analyze the unreasonable effectivenessof the naive loss with deep neural networks. Our exper-iments shows closing generalization gaps in the over-parametrized regime where bounds from existing learn-ing theories are vacuous. We propose an alternative ex-planation of the working of deep PLL based on obser-vations of Valle-Perez et al. [16]. a r X i v : . [ c s . L G ] O c t . DEEP NAIVE MODEL FOR PLL2.1. Problem Formulation We denote x ∈ X as a data and y ∈ Y = { , . . . , K } asa label, and a set S ∈ S = 2 Y \ ∅ such that y ∈ S as apartial label. A partial label data distribution is defined by ajoint data-label distribution p ( x, y ) and a partial label gener-ating process p ( S | x, y ) where p ( S | x, y ) = 0 if y / ∈ S . Alearner’s task is to output a model θ with small Err ( θ ) = E ( x,y ) ∼ p ( x,y ) I ( h θ ( x ) (cid:54) = y ) given with a finite number of par-tially labeled samples { ( x i , S i ) } ni =1 , where each ( x i , S i ) isindependently sampled from p ( x, S ) . The work of Jin and Gharhramani [6], which is the first pio-neering work on PLL, proposed a simple baseline method forPLL denoted as the ‘Naive model’. It is defined as follows: ˆ θ = arg max θ ∈ Θ n (cid:88) i =1 | S i | (cid:88) y ∈ S i log p ( y | x i ; θ ) . (1)We denote the naive loss as the negative of the objective inthe above. In [6], the authors proposed the disambiguationstrategy as a better alternative to the naive model. Moreover,many works on PLL [12, 13, 14, 15] considered this naivemodel to be low-performing and it is still commonly believedthat label disambiguation processes are crucial in achievinghigh-performance.In this work, we propose the following differentiable lossto instantiate the naive loss with deep neural networks: ˆ l n ( θ ) = − n n (cid:88) i =1 log ( (cid:104) s θ,i s i (cid:105) ) , (2) s θ,i = SOFTMAX ( f θ ( x i )) , (3)where f θ ( x i ) ∈ R K is the output of the neural network. Thesoftmax layer is used to make the outputs of the neural net-work lie in the probability simplex. One can see that the aboveloss is almost identical to the naive loss in (1) up to constantfactors, hence we denote (2) as the deep naive loss while amodel trained from it is denoted as a deep naive model .The above loss can be identified as a surrogate of the par-tial label risk defined as follows: R p ( θ ) = E ( x,S ) ∼ p ( x,S ) I ( h θ ( x ) / ∈ S ) , (4)where I ( · ) is the indicator function. We denote ˆ R p,n ( θ ) as anempirical estimator of R p ( θ ) over n samples. When h θ ( x ) =arg max i f θ,i ( x ) , one can easily see that the deep naive loss(2) is a surrogate of the partial-label risk (4). In this sub-section, we review two existing learning theoriesand their implications which may explain the effectiveness ofdeep naive models.
Under a mild assumption on data distributions, Liu and Diet-terich [17] proved that minimizing an empirical partial labelrisk gives a correct classifier.Formally, they proved a finite sample complexity boundfor the empirical partial risk minimizer (EPRM): ˆ θ n = arg min θ ∈ Θ ˆ R p,n ( θ ) , (5)under a mild distributional assumption called small ambiguitydegree condition. The ambiguity degree [11] quantifies thehardness of a PLL problem and defined as γ = sup ( x,y ) ∈X ×Y , ¯ y ∈Y : p ( x,y ) > , ¯ y (cid:54) = y Pr S ∼ p ( S | x,y ) [¯ y ∈ S ] . (6)When γ is less than 1, we say the small ambiguity degreecondition is satisfied. Intuitively, it measures how a specificnon-ground-truth label co-occurs with a specific ground-truthlabel. When such distractor labels co-occurs with a ground-truth label in every instance, it is impossible to disambiguatethe label hence PLL is not EPRM learnable. With the mild as-sumption that γ < , Liu and Ditterich showed the followingsample complexity bound for PLL, Theorem 1. (PLL Sample complexity bound [17]). Supposethe ambiguity degree of a PLL problem is small, ≤ γ < .Let η = log γ and d H be the Natarajan dimension of thehypothesis space H . Define n ( H , (cid:15), δ ) =4 η(cid:15) (cid:18) d H (cid:18) log 4 d H + 2 log K + log 1 η(cid:15) (cid:19) + log 1 δ + 1 (cid:19) , then when n > n ( H , (cid:15), δ ) , Err (ˆ θ n ) < (cid:15) with probability atleast − δ .We denote this result as Empirical Partial Risk Minimiza-tion (EPRM) learnability . A very recent work by Feng et al. [18] proposed new PLL riskestimators by viewing the partial label generation process asa multiple complementary label generation process [19, 20].One of the proposed estimators is called classifier-consistent (CC) risk R cc ( θ ) . For any multi-class loss function L : R K ×Y → R + , R cc ( θ ) it is defined as follows: R cc ( θ ) = E ( x,S ) ∼ p ( x,S ) (cid:2) L (cid:0) Q (cid:62) p ( y | x ; θ ) , s (cid:1)(cid:3) , (7) ethod Lost MSRCv2 Soccer Player Yahoo! News Refernce Presented atDNPL 81.1 ± ± ± ± ± • ± • ± • ± • [11] JMLR 2011CORD 80.6 ± ± • ± • ± • [13] AAAI 2017ECOC 70.3 ± • ± • ± • ± • [21] TKDE 2017GM-PLL 73.7 ± • ± ± • ± • [22] TKDE 2019IPAL 67.8 ± • ± ± • ± • [12] AAAI 2015PL-BLC 80.6 ± ± ± • ± • [15] AAAI 2020PL-LE 62.9 ± • ± • ± • ± • [23] AAAI 2019PLKNN 43.2 ± • ± • ± • ± • [10] IDA 2006PRODEN 81.6 ± ± • ± • ± • [24] ICML 2020SDIM 80.1 ± ± ± ± • [14] IJCAI 2019SURE 78.0 ± • ± • ± • ± • [25] AAAI 2019 Table 1 . Benchmark results (mean accuracy ± std) on the real-world datasets. • / ◦ indicates whether our method (DFPL) isbetter/worse than the comparing methods with respect to unpaired Welch t -test at significance level.where Q ∈ R K × K is a label transition matrix in the contextof multiple complementary label learning, s is a uniformlyrandomly chosen label from S . ˆ R cc,n ( θ ) is denoted as empir-ical risk of Eq. 7.Feng et al. ’s main contribution is to prove an estimationerror bound for the CC risk (7). Let ˆ θ n = arg min θ ∈ Θ ˆ R cc,n ( θ ) and θ (cid:63) = arg min θ ∈ Θ R cc ( θ ) denote the empirical and thetrue minimizer, respectively. Additionally, H y refers themodel hypothesis space for label y . Then, the estimationerror bound for the CC risk is given as Theorem 2. (Estimation error bound for the CC risk [18]).Assume the loss function L (cid:0) Q (cid:62) p ( y | x ; θ ) , s (cid:1) is ρ -Lipschitzwith respect to the first augment in the 2-norm and upper-bounded by M . Then, for any δ > , with probability at least − δ , R cc (ˆ θ n ) − R cc ( θ (cid:63) ) ≤ ρ (cid:80) ky =1 R n ( H y ) + 2 M (cid:113) log δ n , where R n ( H y ) refers the expected Rademacher complexityof the hypothesis space for the label y , H y , with sample size n . If the uniform label transition probability is assumed i.e., Q ij = δ ij I ( j ∈ S j ) / (cid:0) K − − (cid:1) , Eq. 7 becomes equiva-lent to our deep naive loss (Eq. 2) up to some constant fac-tors. Hence, Theorem 1 and 2 give generalization bounds onthe partial risk and the CC risk (same as Eq. 2) respectively. Since the work of [26], the mystery of deep learning’s gener-alization ability has been widely investigated in the standardsupervised learning setting. While it is still not fully under-stood why over-parametrized deep neural networks generalizewell, several studies are suggesting that deep learning models are inherently biased toward simple functions [16, 27]. Espe-cially, Valle-Perez et al. [16] empirically observed that SGDoutputs are biased toward neural networks with smaller com-plexity. They observed the following universal scaling behav-ior in the output distribution p ( θ ) of SGD: p ( θ ) (cid:46) e − aC ( θ )+ b , (8)where C ( θ ) is a computable proxy of (uncomputable) Kol-mogorov complexity and a , b are θ -independent constants.One example of complexity measure C ( θ ) is Lempel-Zivcomplexity [16] which is roughly the length of compressed θ with ZIP compressor.In the deep naive PLL, the model parameter is a mini-mizer of the empirical partial label risk ˆ R p,n ( θ ) (Eq. 4). Theminima of ˆ R p,n ( θ ) is wide because there are many model pa-rameters perfectly fit to given partially labeled examples. Thesupport of SGD’s output distribution will lie in this wide min-ima. According to Eq. 8, this distribution is heavily biasedtoward parameters with small complexities. One crucial ob-servation is that models fitting inconsistent labels will gener-ally have large complexities since they have to memorize eachexample. According to Eq. 8, such models are exponentiallyunlikely to be outputted by SGD. Hence the most likely out-put of the deep naive PLL method is a classifier with smallerror. As a result, the implications of both Theorem 1 and2 appear to be empirically correct in spite of their vacuity ofmodel complexity.
3. EXPERIMENTS
In this section we give the readers two points. First, deepneural network classifiers trained with the deep naive loss canachieve competitive performance in real-world PLL bench-mark datasets. Second, the generalization gaps of trainedclassifiers effectively decrease with respect to the increasingtraining set size. ig. 1 . Generalization gaps with respect to training set size for (a) Yahoo! dataset and (b) Soccer dataset are shown. Error barsrepresent STDs over 10 repeated experiments.
We use four real-world datasets including
Lost [28],
MSRCv2 [8],
Soccer Player [9], and
Yahoo! News [29]. All real-world datasets can be found in this website . We denotethe suggested method as Deep Naive Partial label Learning(DNPL). We compare DNPL with eleven baseline methods.There are eight parametric methods: CLPL, CORD [13],ECOC, PL-BLC [15], PL-LE [23], PRODEN, SDIM [14],SURE, and three non-parametric methods: GM-PLL [22],IPAL, PLKNN. Note that both CORD and PL-BLC are deeplearning-based PLL methods which includes label identifica-tion or mean-teaching techniques. We employ a neural network of the following architecture: d in − − − d out , where numbers represent dimensionsof layers and d in ( d out ) is input (output) dimension. The neu-ral network have the same size as that of PL-BLC. Batchnormalization [30] is applied after each layer followed byELU activation layer [31]. Yogi optimizer [32] is used withfixed learning rate − and default momentum parameters (0 . , . . Table 1 reports means and standard deviations of observedaccuracies. Accuracies of the naive model are measured over5 repeated 10-fold cross-validation and accuracies of othersare measured over 10-fold cross-validation.The benchmark results indicate that DNPL achieves state-of-the-art performances over all four datasets. Especially,DNPL outperforms PL-BLC which uses a neural network http://palm.seu.edu.cn/zhangml/ of the same size as ours on those datasets. Unlike PL-BLCor CORD, DNPL does not need computationally expensiveprocesses like label identification and mean-teaching. Thismeans that by simply borrowing our surrogate loss to thedeep learning classifier, we can build a sufficiently competi-tive PLL model.Observing that for Soccer Player and
Yahoo! News datasets, DNPL outperforms almost all of the comparingmethods. Regarding the large-scale and high-dimensionalnature of
Soccer Player and
Yahoo! News datasets comparingto other datasets, this observation suggests that DNPL has itsadvantage on large-scale, high-dimensional datasets.
In this section, we empirically show that conventional learn-ing theories (Theorem 1, 2) cannot explain the learning be-haviors of DNPL. Figure 1 shows how the gap | Err (ˆ θ n ) − ˆ R p,n (ˆ θ n ) | and the CC risk R cc (ˆ θ n ) decreases as dataset size n increases. We observe that gap closing behaviors despitethe neural networks are over-parametrized, i.e., ∼ >> the training set size ∼ .
4. CONCLUSIONS
This work showed that a simple naive loss is applicable intraining high-performance deep classifiers with partially la-beled examples. Moreover, this method does not require anylabel disambiguation or explicit regularization. Our observa-tions indicate that the deep naive method’s unreasonable ef-fectiveness cannot be explained by existing learning theories.These raise interesting questions deserving further studies: 1)To what extent does the label disambiguation help learningwith partial labels? 2) How deep learning generalizes in par-tial label learning? We have always observed that with our over-parameterized neural net-work zero risk can be achieved for R cc ( θ (cid:63) ) . Therefore, we omit this term. . REFERENCES [1] Yen-Chang Hsu, Zhaoyang Lv, Joel Schlosser, PhillipOdom, and Zsolt Kira, “Multi-class classification with-out multi-class labels,” in ICLR , 2019.[2] Ryuichi Kiryo, Gang Niu, Marthinus C Du Plessis, andMasashi Sugiyama, “Positive-unlabeled learning withnon-negative risk estimator,” in
NIPS , 2017.[3] Hirotaka Kaji, Hayato Yamaguchi, and MasashiSugiyama, “Multi task learning with positive and unla-beled data and its application to mental state prediction,”in
ICASSP , 2018.[4] Oded Maron and Tom´as Lozano-P´erez, “A frameworkfor multiple-instance learning,” in
NIPS , 1998.[5] Yun Wang, Juncheng Li, and Florian Metze, “A com-parison of five multiple instance learning pooling func-tions for sound event detection with weak labeling,” in
ICASSP , 2019.[6] Rong Jin and Zoubin Ghahramani, “Learning with mul-tiple labels,” in
NIPS , 2003.[7] Jie Luo and Francesco Orabona, “Learning from candi-date labeling sets,” in
NIPS , 2010.[8] Liping Liu and Thomas G Dietterich, “A conditionalmultinomial mixture model for superset label learning,”in
NIPS , 2012.[9] Zinan Zeng, Shijie Xiao, Kui Jia, Tsung-Han Chan,Shenghua Gao, Dong Xu, and Yi Ma, “Learning by as-sociating ambiguously labeled images,” in
CVPR , 2013.[10] Eyke H¨ullermeier and J¨urgen Beringer, “Learning fromambiguously labeled examples,”
Intell Data Anal , 2006.[11] Timothee Cour, Ben Sapp, and Ben Taskar, “Learningfrom partial labels,”
JMLR , 2011.[12] Min-Ling Zhang and Fei Yu, “Solving the partial la-bel learning problem: An instance-based approach,” in
AAAI , 2015.[13] Cai-Zhi Tang and Min-Ling Zhang, “Confidence-rateddiscriminative partial label learning,” in
AAAI , 2017.[14] Lei Feng and Bo An, “Partial label learning by semanticdifference maximization,” in
IJCAI , 2019.[15] Yan Yan and Yuhong Guo, “Partial label learning withbatch label correction,” in
AAAI , 2020.[16] Guillermo Valle-Perez, Chico Q Camargo, and Ard ALouis, “Deep learning generalizes because theparameter-function map is biased towards simple func-tions,” in
ICLR , 2018. [17] Liping Liu and Thomas Dietterich, “Learnability of thesuperset label learning problem,” in
ICML , 2014.[18] Lei Feng, Jiaqi Lv, Bo Han, Miao Xu, Gang Niu, XinGeng, Bo An, and Masashi Sugiyama, “Provably con-sistent partial-label learning,” in
NeurIPS , 2020.[19] Lei Feng and Bo An, “Learning from multiple comple-mentary labels,” in
ICML , 2020.[20] Yuzhou Cao and Yitian Xu, “Multi-complementary andunlabeled learning for arbitrary losses and models,” in
ICML , 2020.[21] Min-Ling Zhang, Fei Yu, and Cai-Zhi Tang,“Disambiguation-free partial label learning,”
IEEETrans Knowl Data Eng , 2017.[22] Gengyu Lyu, Songhe Feng, Tao Wang, Congyan Lang,and Yidong Li, “Gm-pll: Graph matching based partiallabel learning,”
IEEE Trans Knowl Data Eng , 2019.[23] Ning Xu, Jiaqi Lv, and Xin Geng, “Partial label learningvia label enhancement,” in
AAAI , 2019.[24] Jiaqi Lv, Miao Xu, Lei Feng, Gang Niu, Xin Geng, andMasashi Sugiyama, “Progressive identification of truelabels for partial-label learning,” in
ICML , 2020.[25] Lei Feng and Bo An, “Partial label learning with self-guided retraining,” in
AAAI , 2019.[26] Chiyuan Zhang, Samy Bengio, Moritz Hardt, BenjaminRecht, and Oriol Vinyals, “Understanding deep learningrequires rethinking generalization,” in
ICLR , 2017.[27] Giacomo De Palma, Bobak Kiani, and Seth Lloyd,“Random deep neural networks are biased towards sim-ple functions,” in
NeurIPS , 2019.[28] Gabriel Panis, Andreas Lanitis, Nicholas Tsapatsoulis,and Timothy F Cootes, “Overview of research on facialageing using the fg-net ageing database,”
IET Biomet-rics , 2016.[29] Matthieu Guillaumin, Jakob Verbeek, and CordeliaSchmid, “Multiple instance metric learning from au-tomatically labeled bags of faces,” in
ECCV , 2010.[30] Sergey Ioffe and Christian Szegedy, “Batch normaliza-tion: Accelerating deep network training by reducinginternal covariate shift,” in
ICML , 2015.[31] Djork-Arn´e Clevert, Thomas Unterthiner, and SeppHochreiter, “Fast and accurate deep network learningby exponential linear units (elus),” in
ICLR , 2016.[32] Manzil Zaheer, Sashank Reddi, Devendra Sachan,Satyen Kale, and Sanjiv Kumar, “Adaptive methods fornonconvex optimization,” in