[PDF] A Survey on Multi-output Learning

Abstract

Multi-output learning aims to simultaneously predict multiple outputs given an input. It is an important learning problem due to the pressing need for sophisticated decision making in real-world applications. Inspired by big data, the 4Vs characteristics of multi-output imposes a set of challenges to multi-output learning, in terms of the volume, velocity, variety and veracity of the outputs. Increasing number of works in the literature have been devoted to the study of multi-output learning and the development of novel approaches for addressing the challenges encountered. However, it lacks a comprehensive overview on different types of challenges of multi-output learning brought by the characteristics of the multiple outputs and the techniques proposed to overcome the challenges. This paper thus attempts to fill in this gap to provide a comprehensive review on this area. We first introduce different stages of the life cycle of the output labels. Then we present the paradigm on multi-output learning, including its myriads of output structures, definitions of its different sub-problems, model evaluation metrics and popular data repositories used in the study. Subsequently, we review a number of state-of-the-art multi-output learning methods, which are categorized based on the challenges.

Full PDF

11 A Survey on Multi-output Learning

Donna Xu, Yaxin Shi, Ivor W. Tsang, Yew-Soon Ong, Chen Gong, and Xiaobo Shen

Abstract —The aim of multi-output learning is to simultane-ously predict multiple outputs given an input. It is an importantlearning problem for decision-making, since making decisionsin the real world often involves multiple complex factors andcriteria. In recent times, an increasing number of researchstudies have focused on ways to predict multiple outputs at once.Such efforts have transpired in different forms according to theparticular multi-output learning problem under study. Classiccases of multi-output learning include multi-label learning, multi-dimensional learning, multi-target regression and others. Fromour survey of the topic, we were struck by a lack in studiesthat generalize the different forms of multi-output learninginto a common framework. This paper ﬁlls that gap with acomprehensive review and analysis of the multi-output learningparadigm. In particular, we characterize the 4 Vs of multi-outputlearning, i.e., volume, velocity, variety, and veracity, and the waysin which the 4 Vs both beneﬁt and bring challenges to multi-output learning by taking inspiration from big data. We analyzethe life cycle of output labeling, present the main mathematicaldeﬁnitions of multi-output learning, and examine the ﬁeld’s keychallenges and corresponding solutions as found in the literature.Several model evaluation metrics and popular data repositoriesare also discussed. Last but not least, we highlight some emergingchallenges with multi-output learning from the perspective of the4 Vs as potential research directions worthy of further studies.

Index Terms —multi-output learning, structured output predic-tion, output label representation, crowdsourcing, label distribu-tion, extreme classiﬁcation.

I. I

NTRODUCTION T RADITIONAL supervised learning is one of themost well established and adopted machine learningparadigms. It offers fast and accurate predictions for today’sreal-world smart systems and applications. The goal of tra-ditional supervised learning is to learn a function that mapseach of the given inputs to a corresponding known output.For prediction tasks, the output comes in the form of a singlelabel. For regression tasks, it is a single value. Traditionalsupervised learning has been shown to be good at solving

D. Xu, Y. Shi and I. W. Tsang are with the Centre for Artiﬁcial Intelligence,FEIT, University of Technology Sydney, Ultimo, NSW 2007, Australia (email:[email protected], [email protected], [email protected]).Y.-S. Ong is with the Data Science & Artiﬁcial Intelligence ResearchCentre, SCSE, Nanyang Technological University, Singapore 639798 (email:[email protected]).C. Gong and X. Shen are with the School of Computer Science andEngineering, Nanjing University of Science and Technology, Nanjing 210094,China (email: [email protected], [email protected]).This research is supported by ARC grant LP150100671 and DP180100106,CSC (No: 201706330075), NRFS under its AI Singapore Programme (AISG-RP-2018-004), NSF of China (No: 61602246, 61973162), NSF of JiangsuProvince (No: BK20171430), the FRF for the Central Universities (No:30918011319), the Summit of the Six Top Talents Program (No: DZXX-027), the Young Elite Scientists Sponsorship Program by Jiangsu Province, theYoung Elite Scientists Sponsorship Program by CAST (No: 2018QNRC001),the NNSFC under Grant No. 61906091, the NSF of Jiangsu Province, China(Youth Fund Project) under Grant No. BK20190440, and the FRF for theCentral Universities under Grant No. 30919011229. these simple single-output problems classical examples beingbinary classiﬁcation, such as ﬁltering spam in an email system,or a regression problem where the daily energy consumptionof a machine needs to be predicted based on temperature, windspeed, humidity levels, etc.However, the traditional supervised learning paradigm is notcoping well with the increasing needs of today’s complexdecision making. As a result, there is a pressing need fornew machine learning paradigms. Here, multi-output learninghas emerged as a solution. The aim is to simultaneouslypredict multiple outputs given a single input, which means it ispossible to solve far more complex decision-making problems.Compared to traditional single-output learning, multi-outputlearning is multi-variate nature, and the outputs may havecomplex interactions that can only be handled by structuredinference. Additionally, the potentially diverse data types ofthe outputs has led to various categories of machine learningproblems and corresponding subﬁelds of study. For exam-ple, binary output values relate to multi-label classiﬁcationproblems [1], [2]; nominal output values relate to multi-dimensional classiﬁcation problems [3]; ordinal output valuesare studied in label ranking problems [4]; and real-valuedoutputs are considered in multi-target regression problems [5].Together, all these problems constitute the multi-outputparadigm, and the body of literature surrounding this ﬁeldhas grown rapidly. Several works have been presented thatprovide a comprehensive review of the emerging challengesand learning algorithms in each subﬁeld. For instance, Zhangand Zhou [1] studied the emerging area of multi-label learning;Borchani et al. [5] summarized the increasing problems inmulti-target regression; and [4] Vembu and Gartner presenteda review on multi-label ranking. However, little attention hasbeen paid to the global picture of multi-output learning andthe importance of the output labels (Section. II). In addition,although the problems in each subﬁeld seem distinctive dueto the differences in their output structures (Section. III-A),they do share common traits (Section. III-B) and encountercommon challenges brought by the characteristics of theoutput labels. In this paper, we attempt to provide such a view.

A. The 4 Vs Challenges of Multiple Outputs

The popular 4 Vs, i.e., volume , velocity , variety and veracity ,have been well established as the main characteristics of bigdata. When scholars discuss the 4 Vs in multi-output learningscenarios, they are usually referring to input data; however,the 4 Vs can also be used to describe output labels. Moreover,these 4 Vs bring with them a set of challenges to multi-outputlearning processes, explained as follows.1) Volume refers to explosive growth in output labels, whichposes many challenges to multi-output learning. First, a r X i v : . [ c s . L G ] O c t output label spaces can grow extremely large, whichcauses scalability issues. Second, the burden for labelannotators is signiﬁcantly increased and still there areoften insufﬁcient annotations in a dataset to adequatelytrain a model. In turn, this may lead to unseen outputsduring testing. Third, volume may pose label imbalanceissues, especially if not all the generated labels in adataset have sufﬁcient data instances (inputs).2) Velocity refers to how rapidly output labels are acquired,which includes the phenomenon of concept drift [6].Velocity can present challenges due to changes in outputdistributions, where the target outputs vary over time inunforeseen ways.3) Variety refers to the heterogeneous nature of outputlabels. Output labels are gathered from multiple sourcesand are of various data formats with different structures.In particular, output labels with complex structures cancreate multiple challenges in multi-output learning, suchas ﬁnding an appropriate method of modeling output de-pendencies, or how to design a multi-variate loss function,or how to design efﬁcient algorithms.4) Veracity refers to differences in the quality of the outputlabels. Issues such as noise, missing values, abnormali-ties, or incomplete data are all characteristics of veracity. B. Purpose and Organization of This Survey

The goal of this paper is to provide a comprehensiveoverview of the multi-output learning paradigm using the 4Vs as a frame for the current and future challenges facing thisﬁeld of study. Multi-output learning has attracted signiﬁcant at-tention from many machine learning disciplines, such as part-of-speech sequence tagging, language translation and naturallanguage processing, motion tracking and optical characterrecognition in computer vision, document categorization andranking in information retrieval, and so on. We expect this sur-vey to deliver a complete picture of multi-output learning anda summation of the different problems being tackled acrossmultiple communities. Ultimately, we hope to promote furtherdevelopment in multi-output learning, and inspire researchersto pursue worthy and needed future research directions.The remainder of this survey is structured as follows.Section II illustrates the life cycle of output labels to helpunderstand the challenges presented by the 4 Vs. SectionIII provides an overview of the myriad output structuresalong with deﬁnitions for the common subproblems addressedin multi-output learning. This section also includes somebrief details on the common metrics and publicly-availabledata used when evaluating models. Section IV presents thechallenges in multi-output learning presented by the 4 Vs andtheir corresponding representative works. Section V concludesthe survey. II. L

IFE C YCLE OF O UTPUT L ABELS

Output labels play an important role in multi-output learningtasks in that how well a model performs a task relies heavily onthe quality of those labels. Fig. 1 depicts the three stages of alabel’s life cycle: annotation, representation, and evaluation. A

Fig. 1. The life cycle of the output label. brief overview of each stage follows along with the underlyingissues that could potentially harm the effectiveness of multi-output learning systems.

A. How is Data Labeled

Label annotation requires a human to semantically annotatea piece of data and is a crucial step for training multi-outputlearning models. Data can be used directly with its basicannotations or, once labeled; they can be aggregated into setsfor further analysis. Depending on the application and thetask, label annotations come in various types. For example,the images for an image classiﬁcation task should be labeledwith tags or keywords, whereas a segmentation task wouldrequire each object in the images to be localized with a mask.A captioning task would require the images to be labeled withsome textual descriptions, and so on.Typically, creating large annotated datasets from scratch istime-consuming and labor-intensive no matter the annotationrequirement. There are multiple ways to acquire labeled data.Social media provides a platform for researchers to search forlabeled datasets - for example, Facebook and Flickr, whichallow users to post pictures and comments with tags. Open-source collections, such as WordNet and Wikipedia, can alsobe useful sources of labeled datasets.Beyond directly obtaining labeled datasets, crowdsourcingplatforms like Amazon Mechanical Turk help researchers so-licit labels for unlabeled datasets by recruiting online workers.The annotation type depends on the modeling task and, dueto the efﬁciency of crowdsourcing, this method has quicklybecome a popular way of obtaining labeled datasets. ImageNet[7] is a popular dataset that was labeled through a crowd-sourcing platform. Its database of images is organized into aWordNet hierarchy, and it has been used to help researcherssolve problems in a range of areas.There are also many annotation tools that have been de-veloped to annotate different types of data. LabelMe [8], aweb-based tool, provides users with a convenient way to labelevery object in an image and also correct labels annotatedby other users. BRAT [9] is also web-based but is speciﬁcallydesigned for natural language processing tasks, such as named-entity recognition and POS-tagging (part-of-speech tagging).

TURKSENT [10] is an annotation tool to support sentimentanalysis in social media posts.

B. Forms of Label Representations

There are many different types of label annotations fordifferent tasks, such as tags, captions, masks, etc., and eachtype of annotation may have several representations, whichare frequently represented as vectors. For example, the mostcommon is the binary vector, whose size equals the vocabularysize of the tags. Annotated samples, e.g., samples with tags, areassigned with a value of 1 and the rest are given a 0. However,binary vectors are not optimal for more complex multi-outputtasks because these representations do not preserve all usefulinformation. Details like the semantics or the inherent structureare lost. To tackle this issue, alternative representation methodshave been developed. For instance, real-valued vectors of tags[11] indicate the strength and degree of the annotated tagsusing real values. Binary vectors of the associations betweena tag’s attributes have been used to convey the characteristicsof tags. Hierarchical label embedding vectors [12] capture thestructure information in tags. Semantic word vectors, such asWord2Vec [13], can be used to represent the semantics and/orcontext of tags and text descriptions. What is key in real-worldmulti-output applications is to select the label representationthat is most appropriate for the given task.

C. Label Evaluation and Challenges

Label evaluation is an essential step in guaranteeing thequality of labels and label representations. Thus, label eval-uation plays a key role in the performance of multi-outputtasks. Different models can be used to evaluate label quality:which to choose depends on the task. Generally, labels can beevaluated in three different respects: 1) whether the annotationis of good quality (Step A). 2) whether the chosen labelrepresentation represents the labels well (Step B). 3) whetherthe provided label set adequately covers the dataset (LabelSet). After evaluation, a human expert is generally required toexplore any underlying issues and provide feedback to improvedifferent aspects of the labels if needed.

1) Issues of Label Annotation:

The aforementioned annota-tion methods, e.g., crowdsourcing, annotation tools, and socialmedia, help researchers collect annotated data efﬁciently. But,without experts, these annotations methods are highly likely toresult in the so-called noisy label problem, which includes bothmissing annotations and incorrect annotations. There are vari-ous reasons for noisy labels for example, using crowdsourcedworkers that lack the required domain knowledge, social mediausers that include irrelevant tags with their image or post, orambiguous text in a caption.

2) Issues of Label Representation:

Output labels can alsohave internal structures and, often, this structure informationis critical to the performance of the multi-output learningtask at hand. Tag-based information retrieval [14] and imagecaptioning [15] are two examples where structure is crucial.However, incorporating this information into a representationas a labels is a non-trivial undertaking, as the data areusually many and domain knowledge is required to deﬁne their structure. In addition, the output label space might containambiguity. For example, a bag-of-words (BOW) is traditionallyused as a representation of a label space in natural languageprocessing tasks, but BOW contains word sense ambiguity, astwo different words may have the same meaning and one wordmight refer to multiple meanings.

3) Issues of the Label Set:

Constructing a label set for dataannotation requires a human expert with domain knowledge.Plus, it is common that the provided label set does not containsufﬁcient labels for the data perhaps due to fast data growth orthe low occurrence of some labels. Therefore, there are likelyto be unseen labels in the test data, which leads to open-set[16], zero-shot [17] or concept drift [18] problems.III. M

ULTI - OUTPUT L EARNING

In contrast to traditional single-output learning, multi-outputlearning can concurrently predict multiple outputs. The outputscan be of various types and structures, and the problems thatcan be solved are diverse. A summary of the subﬁelds that usemulti-output learning along with their corresponding outputtypes, structures, and applications is presented in Table I.We being the section with an introduction to some ofthe output structures in multi-output learning problems. Thedifferent problem deﬁnitions common to various subﬁelds areprovided next, along with the different constraints placed onthe output space. We also discuss some special cases of theseproblems and give a brief overview of some of the evaluationmetrics that are speciﬁc to multi-output learning. The sectionconcludes with some insights into the evolution of outputdimensions through an analysis of several commonly useddatasets.

A. Myriads of Output Structures

The increasing demand of sophisticated decision-makingtasks has led to new creations of outputs, some of whichhave complex structures. With social media, social networks,and various online services becoming ubiquitous, a widerange of output labels can be stored and then collected byresearchers. Output labels can be anything; they could be text,images, audio, or video, or a combination as multimedia. Forexample, given a long document as input, the output mightbe a summary of the input in text format. Given some textfragments, the output might be an image with its contentsdescribed by the input text. Similarly, audio, such as musicand videos, can be generated given different types of inputs. Inaddition to the different output types, there are also a numberof different possible output structures. Here we present severaltypical output structures given an image as an input using theexample in Fig. 2 as a way to illustrate just how many outputstructures might be possible across all the different input types.

1) Independent Vector:

An independent vector is a vectorwith separate dimensions (features), where each dimensionrepresents a particular label that does not necessarily dependon other labels. Binary vectors can be used to represent a givenpiece of data as tags, attributes, BOW, bag-of-visual-words,hash codes, etc. Real-valued vectors provide the weighteddimensions, where the real value represents the strength of

TABLE IA

SUMMARY OF SUBFIELDS OF MULTI - OUTPUT LEARNING AND THEIR CORRESPONDING OUTPUT STRUCTURES , APPLICATIONS AND DISCIPLINES . Subﬁeld Output Structure Application Discipline

Multi-label Learning IndependentBinary Vector Document Categorization [19] Natural Language ProcessingSemantic Scene Classiﬁcation [20] Computer VisionAutomatic Video Annotation [21] Computer VisionMulti-target Regression IndependentReal-valuedVector River Quality Prediction [22] EcologyNatural Gas Demand Forecasting [23] Energy MeteorologyDrug Efﬁcacy Prediction [24] MedicineLabel Distribution Learning Distribution Head Pose Estimation [25] Computer VisionFacial Age Estimation [26] Computer VisionText Mining [27] Data MiningLabel Ranking Ranking Text Categorization Ranking [28] Information RetrievalQuestion Answering [29] Information RetrievalVisual Object Recognition [30] Computer VisionSequence Alignment Learning Sequence Protein Function Prediction [31] BioinformaticsLanguage Translation [32] Natural Language ProcessingNamed Entity Recognition [33] Natural Language ProcessingNetwork Analysis Graph Scene Graph [34] Computer VisionTree Natural Language Parsing [35] Natural Language ProcessingLink Link Prediction [36] Data MiningData Generation Image Super-resolution Image Reconstruction [37] Computer VisionText Language Generation Natural Language ProcessingAudio Music Generation [38] Signal ProcessingSemantic Retrieval IndependentReal-valuedVector Content-based Image Retrieval [39] Computer VisionMicroblog Retrieval [40] Data MiningNews Retrieval [41] Data MiningTime-series Prediction Time Series DNA Microarray Data Analysis [42] BioinformaticsEnergy Consumption Forecasting [43] Energy MeteorologyVideo Surveillance [44] Computer Vision

Fig. 2. An illustration of the myriads of output structures given an inputimage from a social network. the input data against the corresponding label. Applicationsinclude annotation or classiﬁcation of text, images, or videowith binary vectors [19]–[21], and demand or energy predic-tion with real-valued vectors [23]. An independent vector canbe used to represent the tags of an image, as shown in Fig. 2(1), where all the tags “people”, “dinner”, “table” and “wine”have equal weight..

2) Distribution:

Unlike independent vectors, distributionsprovide information about the probability that a particulardimension will be associated with a particular data sample.In Fig. 2 (2), the tag with the largest weight is “people” andis the main content of the image, while “dinner” and “table”have similar distributions. Applications for distribution outputs include head pose estimation [25], facial age estimation [26]and text mining [27].

3) Ranking:

Outputs might also be in the form of a ranking,which shows the tags ordered from the most to least impor-tant. The results from a distribution learning model can beconverted into a ranking, but a ranking model is not restrictedto only distribution learning models. Text categorization [28],question answering [29] and visual object recognition [30] areapplications where rankings are often used.

4) Text:

Text can be in the form of keywords, sentences,paragraphs, or even documents. Fig. 2 (4) illustrates an ex-ample of text output as a caption of the image “Peopleare having dinner”. Other applications for text outputs aredocument summarization [45] and paragraph generation [46].

5) Sequence:

Sequence outputs refer to a series of elementsselected from a label set. Each element is predicted dependingon the input as well as the predicted output(s) from thepreceding element. An output sequence often corresponds toan input sequence. For example, in speech recognition, weexpect the output to be a sequence of text that corresponds toa given audio signal of speech [47]. In language translation, weexpect the output to be a sentence transformed into the targetlanguage [32]. In the example shown in Fig. 2 (5), the input isan image caption, i.e., text, and the outputs are part-of-speech(POS) tags for each word in the sequence.

6) Tree:

Tree outputs are essentially outputs in the formof a hierarchy. The outputs, usually labels, have an internalstructure where each output has a label that belongs to, or isconnected to, its ancestors in the tree. For example, in syntactic parsing [35], as shown in Fig. 2 (6), each of the outputs for aninput sentence is a POS tag and the entire output is a parsingtree. “people” is labeled as a noun N, but it is also a nounphrase NP as per the tree.

7) Image:

Images are a special form of output that consistof multiple pixel values, where each pixel is predicted depend-ing on the input and the pixels around it. Fig. 2 (7) showssuper-resolution construction [37] as one popular applicationwhere images are common outputs. Super-resolution construc-tion means constructing a high-resolution image from a low-resolution image. Other image output applications include text-to-image synthesis [48], which generates images from naturallanguage descriptions, and face generation [49].

8) Bounding Box:

Bounding boxes as outputs are oftenused to ﬁnd the exact locations of an object or objectsappearing in an image. This is a common task in objectrecognition and object detection [30]. In Fig. 2 (8), each ofthe faces is localized by a bounding box so that each personcan be identiﬁed.

9) Link:

Links as outputs usually represent the associationbetween two nodes in a network [36]. Fig. 2 (9) illustrates atask to predict whether two currently unlinked users will befriends in the future given a partitioned social network wherethe edges represent friendships between users.

10) Graph:

Graphs are commonly used to model relation-ships between. They consist of a set of nodes and edges, wherea node represents an object and an edge indicates a relationshipbetween two objects. Scene graphs [50], for example, are oftenoutput as a way to describe the content of an image [34].Fig. 2 (10) shows that, given an input image, the output is agraph deﬁnition where the nodes are the objects appearing inthe image, i.e., “people”, “dinner”, “table”, and “wine”, andthe edges are the relationships between these objects. Scenegraphs are very useful as representations for tasks like imagegeneration [51] and visual question answering [52].

11) Other Outputs:

Beyond these few types, there are stillmany other types of output structures. For example, contourand polygon outputs are similar to bounding boxes and can beused as labels for object localization. In information retrieval,the output(s) could be of the list type, say, of data objectsthat are similar to the given query. In image segmentation, theoutputs are usually segmentation masks of different objects.In signal processing, outputs might be audio of speech ormusic. In addition, some real-world applications may requiremore sophisticated output structures relating to multiple tasks.For example, one may require that objects be recognizedand localized at the same time, such as in co-saliency, i.e.,discovering the common saliency of multiple images [53],simultaneously segmenting similar objects given multiple im-ages in co-segmentation [54], or detecting and identifyingobjects in multiple images in object co-detection [55].

B. Problem Deﬁnition of Multi-output Learning

Multi-output learning maps each input (instance) to multipleoutputs. Assume X = R d is a d -dimensional input space, and Y = R m is an m -dimensional output label space. The aim ofmulti-output learning is to learn a function f : X → Y from the training set D = { ( x i , y i ) | ≤ i ≤ n } . For each trainingexample ( x i , y i ) , x i ∈ X is a d -dimensional feature vector,and y i ∈ Y is the corresponding output associated with x i .The general deﬁnition of multi-output learning is given as:Finding a function F : X × Y → R based on the trainingsample of input-output pairs, where F ( x , y ) is a compatibilityfunction that evaluates how compatible the input x and the out-put y are. Then, given an unseen instance x at the test state, theoutput is predicted to be the one with the largest compatibilityscore, namely f ( x ) = (cid:101) y = arg max y ∈Y F ( x , y ) [56].This deﬁnition provides a general framework for multi-output learning problems. Although different multi-outputlearning subﬁelds vary in their output structures, they can bedeﬁned within this framework given certain constraints on theoutput label space Y .We selected several popular subﬁelds and present the con-straints of their output space in the following sections. Notethat multi-output learning is not restricted to these particularscenarios; they are just examples for illustration.

1) Multi-label Learning:

The task of multi-label learning isto learn a function f ( · ) that predicts the proper label sets forunseen instances [1]. In this task, each instance is associatedwith a set of class labels/tags and is represented by a sparsebinary label vector. A value of +1 denotes the instance islabeled and means unlabeled. Thus, y i ∈ Y = {− , +1 } m .Given an unseen instance x ∈ X , the learned multi-labelclassiﬁcation function f ( · ) outputs f ( x ) ∈ Y , where the labelsin the output vector with a value of +1 are used as thepredicted labels for x .

2) Multi-target Regression:

The aim of multi-target regres-sion is to simultaneously predict multiple real-valued outputvariables for one instance [5], [57]. Here, multiple labels areassociated with each instance, represented by a real-valuedvector, where the values represent how strongly the instancecorresponds to a label. Therefore, we have the constraint of y i ∈ Y = R m . Given an unseen instance x ∈ X , thelearned multi-target regression function f ( · ) predicts a real-valued vector f ( x ) ∈ Y as the output.

3) Label Distribution Learning:

Label distribution learningdetermines the relative importance of each label in the multi-label learning problem [58]. This is as opposed to multi-labellearning, which simply learns to predict a set of labels. But, asillustrated in Fig.2, the idea of label distribution learning is topredict multiple labels with a degree value that represents howwell each label describes the instance. Therefore, the sum ofthe degree values for each instance is 1. Thus, the output spacefor label distribution learning satisﬁes y i = ( y i , y i , ..., y mi ) ∈Y = R m with the constraints y ji ∈ [0 , , ≤ j ≤ m and (cid:80) mj =1 y ji = 1 .

4) Label Ranking:

The goal of label ranking is to mapinstances to a total order over a ﬁnite set of predeﬁnedlabels [4]. In label ranking, each instance is associated withthe rankings of multiple labels. Therefore, the outputs of theproblem are the total order of all the labels for each instance.Let L = { λ , λ , ..., λ m } denotes the predeﬁned label set. Aranking can be represented as a permutation π on { , , ..., m } ,such that π ( j ) = π ( λ j ) is the position of the label λ j inthe ranking. Therefore, given an unseen instance x ∈ X , the learned label ranking function f ( · ) predicts a permutation f ( x ) = ( y π (1) i , y π (2) i , ..., y π ( m ) i ) ∈ Y as the output.

5) Sequence Alignment Learning:

Sequence alignmentlearning aims to identify the regions of relationships betweentwo or more sequences. The outputs in this task are a sequenceof multiple labels for the input instance. The output vectorhas the constraint y i ∈ Y = { , , ..., c } m , where c denotesthe total number of labels. In sequence alignment learning, m may vary depending on the input. Given an unseen instance x ∈ X , the learned sequence alignment function f ( · ) outputs f ( x ) ∈ Y , where all of the predicted labels in the outputvector form the predicted sequence for x .

6) Network Analysis:

Network analysis explores the rela-tionships and interactions between objects and entities in anetwork structure, and link prediction is a common task withinthis subﬁeld. Let G = ( V, E ) denotes the graph of a network. V is the set of nodes, which represent objects, and E is theset of edges, which represent the relationships between objects.Given a snapshot of a network, the goal of link prediction isto infer whether a connection exists between two nodes. Theoutput vector y i ∈ Y = {− , +1 } m is a binary vector whosevalue represents whether there will be an edge e = ( u, v ) between any pair of nodes u, v ∈ V and e / ∈ E . m is thenumber of node pairs that does not appear in the current graph G and each dimension in y i represents a pair of nodes thatare not currently connected.

7) Data Generation:

Data generation is a subﬁeld of multi-output learning that aims to create and then output structureddata of a certain distribution. Deep generative models areusually used to generate the data, which may be in the formof text, images, or audio. The multiple output labels in theproblem become the different words in the vocabulary, thepixel values, the audio tones, etc. Take image generation asan example. The output vector has the constraint y i ∈ Y = { , , ..., } m w × m h × , where m w and m h are the widthand height of the image. Given an unseen instance x ∈ X ,which is usually a random noise or an embedding vector withsome constraints, the learned GAN-based network f ( · ) outputs f ( x ) ∈ Y , where all of the predicted pixel values in the outputvector form the generated image for x .

8) Semantic Retrieval:

Semantic retrieval means ﬁnds themeanings within some given information. Here, we considersemantic retrieval in a setting where each input instance hassemantic labels that can be used to help retrieval [59]. Thus,each instance representation comprises semantic labels as theoutput y i ∈ Y = R m . Given an unseen instance x ∈ X asthe query, the learned retrieval function f ( · ) predicts a real-valued vector f ( x ) ∈ Y as the intermediate output result. Theintermediate output vector can then be used to retrieve a listof similar data instances from the database by using a properdistance-based retrieval method.

9) Time-series Prediction:

The goal in time-series predic-tion is to predict the future values in a series based on previousobservations [60]. The inputs are a series of data vectors fora period of time, and the output is a data vector for a futuretimestamp. Let t denotes the time index. The output vectorat time t is represented as y ti ∈ Y = R m . Therefore, theoutputs within a period of time from t = 0 to t = T are y i = ( y i , ..., y ti , ... y Ti ) . Given previously observed values,the learned time-series function outputs predicted consecutivefuture values. C. Special Cases of Multi-output Learning1) Multi-class Classiﬁcation:

Multi-class classiﬁcation canbe categorized as a traditional single-output learning paradigmif the output class is represented as either an integer encodingor a one-hot vector.

2) Fine-grained Classiﬁcation:

Fine-grained classiﬁcationis a challenging multi-classiﬁcation task where the categoriesmay only have subtle visual differences [61]. Although theoutput of ﬁne-grained classiﬁcation shares the same vectorrepresentation as multi-class classiﬁcation, the vectors havedifferent internal structures. Also, in its label hierarchy, labelswith the same parents tend to be more closely related thanlabels with different parents.

3) Multi-task Learning:

The aim of multi-task learning(MTL) is the subﬁeld that aims to improve generalizationperformance by learning multiple related tasks simultane-ously [62], [63]. Each task in the problem outputs one singlelabel or value. This can be thought of as part of the multi-output learning paradigm in that learning multiple tasks issimilar to learning multiple outputs. MTL leverages the relat-edness between tasks to improve the performance of learningmodels. One major difference between multi-task learning andmulti-output learning is that, in multi-task learning, differenttasks might be trained on different training sets or features,while, in multi-output learning, the output variables usuallyshare the same training data or features.

D. Model Evaluation Metrics

In this section, we presents the conventional evaluationmetrics used to assess the multi-output learning models with atest dataset. Let T = { ( x i , y i ) | ≤ i ≤ N } be the test dataset, f ( · ) be the multi-output learning model, and ˆ y i = f ( x i ) bethe predicted output of f ( · ) for the testing example x i . Inaddition, let Y i and ˆ Y i denote the set of labels correspondingto y i and ˆ y i , respectively. I is an indicator function, where I ( g ) = 1 if g is true, and otherwise.

1) Classiﬁcation-based Metrics:

Classiﬁcation-based met-rics evaluate the performance of multi-output learning withrespect to classiﬁcation problems, such as multi-label classiﬁ-cation, semantic retrieval, image annotation, label ranking, etc.The outputs are usually in discrete values. The conventionalclassiﬁcation metrics fall into three groups: example-based , label-based and ranking-based .(a) Example-based Metrics: Example-based metrics [64]evaluate the performance of multi-output learning models withrespect to each data instance. Performance is ﬁrst evaluatedon each test instance separately, and then the mean of all theindividual results is used to reﬂect the overall performanceof the model. The evaluation for multi-output classiﬁcationtasks works under the same mechanism as binary classiﬁcation(single output) tasks, the classic metrics for binary classiﬁ-cation can be extended to evaluate multi-output classiﬁcation models [1]. The commonly used metrics are exact match ratio,accuracy, precision, recall and F score. Hamming loss is an example-based metric speciﬁcally de-signed for multi-output classiﬁcation tasks. It computes theaverage difference between the predicted and actual output,considering both prediction and omission errors, i.e., whenthe prediction is incorrect or a label is not predicted at all.The Hamming loss averaged overall data instances is deﬁnedas:

HammingLoss = 1 N N (cid:88) i =1 m | Y i ∆ ˆ Y i | where m is the number of labels and ∆ represents the sym-metric difference between two sets. The lower the hammingloss, the better the performance of the model is.(b) Label-based Metrics: Label-based metrics evaluate per-formance with respect to each output label. These metricsaggregate the contributions of all the labels to arrive at anaveraged evaluation of the model. There are two techniquesfor obtaining label-based metrics: macro- and micro-averaging.Macro-based approaches compute the metrics for each la-bel independently and then average over all the labels withequal weights. By contrast, micro-based approaches give equalweight to every data sample. Let

T P l , F P l , T N l and F N l denote the number of true positives, true negatives, falsepositives, and false negatives, for each label, respectively. Let B be a binary evaluation metric (accuracy, precision, recallor F score) for a particular label. The macro and microapproaches are therefore deﬁned as - macro-averageing: B macro = 1 m m (cid:88) l =1 B ( T P l , F P l , T N l , F N l ) , micro-averaging: B micro = B ( 1 m m (cid:88) l =1 T P l , m m (cid:88) l =1 F P l , m m (cid:88) l =1 T N l , m m (cid:88) l =1 F N l ) . (c) Ranking-based Metrics: Ranking-based metrics evaluatethe performance in terms of the ordering of the output labels.

One-error is the number of times the top-ranked label is notin the true label set. This approach only considers the mostconﬁdent predicted label of the model. An averaged one-errorover all data instances is computed as:

One-error = 1 N N (cid:88) i =1 I (arg min λ ∈L π i ( λ ) / ∈ Y i ) where I is an indicator function, L denotes the label set, and π i ( λ ) is the predicted rank of label λ for the test instance x i . The smaller the one-error, the better the performance. Ranking loss indicates the average proportion of incorrectlyordered label pairs.

RankingLoss = 1 N N (cid:88) i =1 | Y i || Y i | | E | , whereE = ( λ a , λ b ) : π i ( λ a ) > π i ( λ b ) , ( λ a , λ b ) ∈ Y i × Y i where Y i = L \ Y i . The smaller the ranking loss, the betterthe performance of the model. Average Precision (AP) is the proportion of the labelsranked above a particular label in the true label set as anaverage over all the true labels. The larger the value, thebetter the performance of the model is. The averaged APover all test data instances is deﬁned as follows: AP = 1 N N (cid:88) i =1 | Y i | (cid:88) λ ∈ Y i { λ (cid:48) ∈ Y i | π i ( λ (cid:48) ) ≤ π i ( λ ) } π i ( λ ) Discussion:

The metrics listed above are those commonlyused with classiﬁcation-based multi-output learning problems.But the choice of metrics varies according to the differentconsiderations of each application. Take image annotation forexample. If the aim of the task is to annotate each imagecorrectly, example-based metrics are optimal for evaluatingperformance. However, if the objective is keyword-based im-age retrieval, the macro-averaging metric is preferable [64].Further, some metrics are more suited to special cases ofmulti-output learning problems. For instance, for imbalancedlearning tasks, geometric mean [65] for some classiﬁcation-based metrics, e.g. the errors, accuracy, F1-scores and etc.,are more convincing to be used for evaluation. The minimumsensitivity [66] can help determine the classes that hinder theperformance in the imbalanced setting. We do not discuss thesemetrics in detail as they are not the focus here.

2) Regression-based Metrics:

Unsurprisingly, regression-based metrics evaluate multi-output learning performance withregression problems, e.g., object localization or image gener-ation. The outputs are usually real values.

Mean absolute error (MAE) is a classic single-output re-gression metric that computes the absolute difference be-tween the predicted and the actual outputs. It can be extendedto evaluate multi-output regression models by simply aver-aging the metric over all the outputs.

M AE = 1 m N N (cid:88) i =1 | y i − ˆ y i | Mean squared error (MSE) is a regression metric that com-putes the average squared difference between the predictedand the actual outputs. Like MAE, it can also be extended tothe multi-output setting. However, MSE is more sensitive tothe outliers, as it will contribute much higher errors comparedto MAE.

M SE = 1 m N N (cid:88) i =1 ( y i − ˆ y i ) Average correlation coefﬁcient (ACC) measures the degreeof association between the actual and the predicted outputs.

ACC = 1 m m (cid:88) l =1 (cid:80) Ni =1 ( y li − ¯ y l )(ˆ y li − ¯ˆ y l ) (cid:113)(cid:80) Ni =1 ( y li − ¯ y l ) (cid:80) Ni =1 (ˆ y li − ¯ˆ y l ) where y mi and ˆ y mi are the actual and predicted m output of x i ,respectively, and ¯ y l and ¯ˆ y l are the vectors of the averages ofthe actual and predicted outputs for a label l over all samples. Intersection over union threshold (IoU) is a speciﬁcally-designed metric for assessing object localization or segmen-tation. It is computed as:

IoU = Area of OverlapArea of Unionwhere area of overlap is the area of intersection between thepredicted and the actual bounding boxes/segmentation masks.Similarly, area of union is the union area between the actualand predicted boxes/masks.

3) New Metrics:

Data generation is an emerging subﬁeldof multi-out learning that uses generative models to outputstructured data with certain distributions. Based on the partic-ulars of the task at hand, a model’s performance is usuallyevaluated in two respects: 1). whether the generated dataactually follows the desired real data distribution; and 2).the quality of the generated samples. Metrics like averagelog-likelihood [67], coverage metric [68], maximum meandiscrepancy (MMD) [69], geometry score [70], are frequentlyused to assess the veracity of the distribution. Metrics thatquantify the quality of the generated data remain challenging.The commonly used are inception scores (IS) [71], modescore (MS) [72], Frchet inception distance (FID) [73] andkernel inception distance (KID) [74]. Precision, recall and F1score are also employed in GANs to quantify the degree ofoverﬁtting in the model [75].

E. Multi-output Learning Datasets

Most of the datasets used to experiment with multi-outputlearning problems have either been constructed or becomepopular because they reﬂect, and therefore test, a challengethat needs to be overcome. We have presented these datasetsaccording to the challenges reﬂected in the 4 Vs. Table IIlists the datasets, including their multi-output characteristics,the challenge can be tested, the application domain, plus thedataset name, source, and descriptive statistics.The large-scale datasets, i.e. , the datasets that can be usedto test volume, are extremely large. The enormity of their cor-responding statistics illustrate the pressing need to overcomethe challenges caused by this particular V among the 4.Many studies that have focused on change in output distri-bution, e.g., concept drift/velocity, rely on synthetic streamingdata or static databases in their experiments. We have alsoincluded some of the more popular real-world and/or dynamicdatabases that are used to experiment with these tasks. Asshown in the table, the datasets come from various applicationdomains, demonstrating the importance of this challenge.The datasets designed to test complex multi-output learningproblems contain a mix of different output structures. Forexample, the image datasets listed in the table includes bothlabels and bounding boxes for the objects. These datasets canbe used to test the variety of data.Lastly, we come to veracity. Many efforts to detail withnoisy labels evaluate their methods by beginning with a cleandataset to which artiﬁcial noise is then added. This helpsresearchers control and test different levels of noise. We http://manikvarma.org/downloads/XC/XMLRepository.html have also listed several popular real-world datasets with someunknown level of errors in annotation.IV. T HE C HALLENGES OF M ULTI - OUTPUT L EARNING AND R EPRESENTATIVE W ORKS

The pressing need for the complex prediction output andthe explosive growth of output labels pose several challengesto multi-output learning and have exposed the inadequacies ofmany learning models that exist to date. In this section, wediscuss each of these challenges and review several representa-tive works on how they cope with these emerging phenomena.Further, given the success of artiﬁcial neural networks (ANNs),we also present several state-of-the-art examples of multi-output learning using an ANN for each challenge.

A. Volume - Extreme Output Dimensions

Large-scale datasets are ubiquitous in real-world applica-tions. A dataset is deﬁned to be large-scale if it meets oneof three criteria: it has a large number of data instances, theinput feature space has high dimensionality, or the outputspace has high dimensionality. Many studies have sought tosolve the scalability issues caused by a large number of datainstances, e.g., the instance selection method in [212], or withhigh-dimensional feature spaces, such as the feature selectionmethod in [213]. However, the issues associated with highoutput dimensions have received much less attention.Consider, for example, that if the label for each dimension ofan m -dimensional output vectors can be selected from a labelset with c different labels, then the number of output outcomesis c m . Hence, these ultra-high-output dimensions/labels resultin an extremely large output space and, in turn, high com-putation costs. Therefore, it is crucial to design multi-outputlearning models that can handle the immense and ongoinggrowth in outputs.An analysis of the current state-of-the-art research on ultra-high-output dimensions revealed some interesting insights. Ouranalysis was based on the datasets used in studies of multipledisciplines, such as machine learning, computer vision, naturallanguage processing, information retrieval, and data mining.We speciﬁcally focused on articles in three top journals andthree top international conferences: IEEE Transactions onPattern Analysis and Machine Intelligence (TPAMI), IEEETransactions on Neural Networks and Learning Systems(TNNLS), the Journal of Machine Learning Research (JMLR),the International Conference on Machine Learning (ICML),the Conference on Neural Information Processing Systems(NIPS), and the Conference on Knowledge Discovery and DataMining (KDD). Fig. 3 and Fig. 4 summarize our review. Fromthese two ﬁgures, it is evident that the output dimensionalityof the under-studied algorithms has continued to increase overtime. In addition, the latest papers to address this issue inall selected titles are now dealing with more than a millionoutput dimensions and, in some cases, are approaching billionsof outputs. Moreover, the statistics for the conferences withshorter time-lags to publication demonstrate just how rapidlyoutput dimensionality is increasing. From this analysis, we TABLE IIC

HARACTERISTICS OF THE DATASETS OF MULTI - OUTPUT LEARNING TASKS . Multi-outputCharacteristic Challenge Application Domain Dataset Name Statistics SourceVolume

Extreme OutputDimension Output Dimension

Review Text AmazonCat-13K 13,330 [76]Review Text AmazonCat-14K 14,588 [77], [78]Text Wiki10-31 30,938 [79], [80]Social Bookmarking Delicious-200K 205,443 [79], [81]Text WikiLSHTC-325K 325,056 [82], [83]Text Wikipedia-500K 501,070 WikipediaProduct Network Amazon-670K 670,091 [76], [79]Text Ads-1M 1,082,898 [82]Product Network Amazon-3M 2,812,281 [77], [78]Extreme ClassImbalance

Largest ClassImbalance Ratio

Scene Image WIDER-Attribute 1:28 [84]Face Image Celeb Faces Attributes 1:43 [85]Clothing Image DeepFashion 1:733 [86]Clothing Image X-Domain 1:4,162 [87]Unseen Outputs

Seen / Unseen Labels

Image Attribute Pascal abd Yahoo 20 / 12 [88]Animal Image Animal with Attributes 40 / 10 [88]Scene Image HSUN 80 / 27 [89]Music MagTag5K 107 / 29 [90]Bird Image Caltech-UCSD Birds 200 150 / 50 [91]Scene Image SUN Attributes 645 /72 [20]Health MIMIC II 3,228 / 355 [92]Health MIMIC III 4,403 / 178 [93]

Velocity

Change of OutputDistribution

Time Periods

Text Reuters 365 days [94]Route ECML/PKDD 15:Taxi Trajectory Prediction 365 days [95]Route epﬂ/mobility 30 days [96]Electricity Portuguese Electricity Consumption 365 days [97]Trafﬁc Video MIT Trafﬁc Data Set 90 minutes [44]Surveillance Video VIRAT Video 8.5 hours [98]

Variety

Complex Structures

Output Structures

Image LabelMe Label, Bounding Box [8]Image ImageNet Label, Bounding Box [7]Image PASCAL VOC Label, Bounding Box [99]Image CIFAR100 Hierarchical Label [100]Lexical Database WordNet Hierarchy [101]Wikipedia Network Wikipedia Graph, Link [102]Blog Network BlogCatalog Graph, Link [103]Author Collaboration Network arXiv-AstroPh Link [104]Author Collaboration Network arXiv-GrQc Link [104]Text CoNLL-2000 Shared Task Text Chunks [105]Text Wall Street Journal (WSJ) corpus POS Tags, Parsing Tree -European Languages Europarl corpus Sequence [32]

Veracity

Noisy Output Labels

Noisy Labeled Samples

Dog Image AMT 7,354 [106]Food Image Food101N 310K [107]Clothing Image Clothing1M 1M [108]Web Image WebVision 2.4M [109]Image and Video YFCC100M 100M [110] conclude that the explosion in output dimensionality is drivingmany developments in multi-output learning algorithms.The studies we reviewed tend to fall into two categories:qualitative and quantitative approaches. The qualitative ap-proaches generally involve generative models, while the quan-titative models generally involve discriminative models. Themain difference between the two models is that generativemodels focus on learning the joint probability P ( x, y ) of theinputs x and the label y , while the discriminative models focuson the posterior P ( y | x ) . Note that in a generative model, P ( x, y ) can be used to generate some data x , where, in thiscase, x is the generated output in this particular case.

1) Qualitative Approaches/Generative Models:

The aim ofimage synthesis [48], [214] is to synthesize new images fromtextual image descriptions of the image. Some pioneeringresearchers have synthesized images using a GAN with theimage distribution as multiple outputs [215]. But, in real life,GANs can only generate low-resolution images. However,since the ﬁrst attempts at this foray, there has been progressin scaling up GANs to generate high-resolution images withsensible outputs. For example, Reed et al. [48] proposed aGAN architecture that generates visually plausible 64 x 64pixel images given text descriptions. In a follow-up study,they presented GAWWN [214], which scales the synthesizedimage up to 128 x 128 resolution by leveraging additional Fig. 3. Output dimension trends from papers published in the journals TPAMI, TNNLS, and JMLR since 2013 [111]–[165].Fig. 4. Output dimension trends from papers published in the conferences ICML, NIPS, and KDD since 2013 [79], [82], [166]–[211]. annotations. Subsequently, StackGAN [216] was proposed,which is capable of generating photo-realistic images at a256 x 256 resolution from text descriptions. HDGAN [217] isthe current state-of-the-art in image synthesis. It models high-resolution images in an end-to-end fashion at 512 x 512 pixels.Inevitably, the future will see further increases in resolution.MaskGAN [218] use GAN to generate text (i.e., mean-ingful word sequences). The label set size accords with thevocabulary size. The output dimension is the length of theword sequence that is generated, which, technically, can beunlimited. However, MaskGAN only handles sentence-leveltext generation. Document-level and book-level text genera-tions are still challenging.

2) Quantitative Approaches/Discriminative Models:

Likeinstance and feature selection methods, which reduce thenumber of input instances and, in turn, reduce input dimen-sionality, it is natural to design models that similarly reduceoutput dimensionality. Embedding methods can be used tocompress a space by projecting the original space onto a lower-dimensional space, with the expected information preserved,such as label correlations and neighborhood structure. Popular methods, such as random projections or canonical correlationanalysis projections [219]–[222], can be adopted to reducethe dimensions of the output label space. As a result, thesemodeling tasks can be performed on a compressed outputlabel space and then the predicted compressed label can beprojected back onto the original high-dimensional label space.Recently, several embedding methods have been proposed forextreme output dimensions. Mineiro and Karampatziakis [223]proposed a novel randomized embedding for extremely largeoutput spaces. AnnexML [169] is another novel embeddingmethod for graphs that captures graph structures in the embed-ding space. The embeddings are constructed from the k-nearestneighbors of the label vectors, and the predictions are madeefﬁciently through an approximate nearest neighbor searchmethod. Two popular ANN methods for handling extremeoutput dimensions are fastText learn tree [224] and XML-CNN [225]. FastText learn tree [224] jointly learns the datarepresentation and the tree structure, and the learned treestructure is then used for efﬁcient hierarchical prediction.XML-CNN is a CNN-based model that incorporates a dynamicmax pooling scheme to capture ﬁne-grained features from regions of the input document. A hidden bottleneck layer isused to reduce the model size. B. Variety - Complex Structures

With the increasing abundance of labels, there is a pressingneed to understand their inherent structures. Complex outputstructures can lead to multiple challenges in multi-outputlearning. For instance, it is common for strong correlationsand complex dependencies to exist between labels. Therefore,appropriately modeling output dependencies in the label repre-sentation is critical but non-trivial in multi-output learning. Inaddition, designing a multi-variate loss function and proposingan efﬁcient algorithm to alleviate the high complexity causedby complex structures is also challenging.

1) Appropriate Modeling of Output Dependencies:

Thesimplest method of multi-output learning is to decompose thelearning problem into m independent single-output problemswith each corresponding to a single value in the output space.A representative approach is binary relevance (BR) [226],which independently learns binary classiﬁers for all the labelsin the output space. Given an unseen instance x , BR predictsthe output labels by predicting each of the binary classiﬁersand then aggregating the predicted labels. However, suchindependent models do not consider the dependencies betweenoutputs. A set of predicted output labels might be assigned tothe testing instance even though these labels never co-occurin the training set. Hence, it is crucial to model the outputdependencies appropriately to obtain better performance formulti-output tasks.Many classic learning methods have been proposed to modelmultiple outputs with interdependencies. These include labelpowersets (LPs) [227], classiﬁer chains (CC) [228], [229],structured SVMs (SSVM) [230], conditional random ﬁelds(CRF) [231] and etc. LPs model the output dependencies bytreating each different combination of labels in the outputspace as a single label, which transforms the problem intoone of learning multiple single-label classiﬁers. The numberof single-label classiﬁers to be trained is the number of labelcombinations, which grows exponentially with the number oflabels. Therefore, LP has the drawback of high computationcost when training with a large number of output labels.Random k-labelsets [232], an ensemble of LP classiﬁers, isa variant of LP that alleviates the computational complexityproblem by training each LP classiﬁer on a different randomsubset of labels.CC improves BR by taking the output correlations intoaccount. It links all the binary classiﬁers from BR into achain via a modiﬁed feature space. Given the j th label, theinstance x i is augmented with the 1st, 2nd, ... ( j − th label,i.e., ( x i , l , l , ..., l j − ) , as the input, to train the j th classiﬁer.Given an unseen instance, CC predicts the output using the 1stclassiﬁer, and then augments the instance with the predictionfrom the 1st classiﬁer as the input to the 2nd classiﬁer forpredicting the next output. CC processes values in this wayfrom the 1st classiﬁer to the last and so preserves the outputcorrelations. However, a different order of chains leads to dif-ferent results. ECC [228], an ensemble of CC, was proposed to solve this problem. It trains the classiﬁers over a set of randomordering chains and averages the results. Probabilistic classiﬁerchains (PCCs) [233] provide a probabilistic interpretation ofCC by estimating the joint distribution of the output labelsto capture the output correlations. CCMC [114] is a classiﬁerchain model that considers the order of label difﬁculties toreduce the degradation in performance caused by ambiguouslabels. It is an easy-to-hard learning paradigm that identiﬁeseasy and hard labels and uses the predictions for easy labelsto help solve the harder labels.SSVM leverages the idea of large margins to deal withmultiple interdependent outputs. The compatibility function isdeﬁned as F ( x , y ) = w T Φ( x , y ) , where w is the weightvector and Φ :

X × Y → R q is the joint feature map overinput and output pairs. The SSVM aims to ﬁnd the classiﬁer h w ( x ) = arg max y ∈Y (cid:104) w , φ ( x , y ) (cid:105) with the following objective min w ∈ R q , { ξ i ≥ } ni =1 λ (cid:107) w (cid:107) + Cn n (cid:88) i =1 max y ∈Y { ∆( y i , y ) + w T Φ( x i , y ) } − w T Φ( x i , y i ) (cid:124) (cid:123)(cid:122) (cid:125) structured hinge loss Constraining the structured hinge loss with ∆( y i , y ) + w T Φ( x i , y ) − w T Φ( x i , y i ) ≤ ξ i , for all y ∈ Y , the objectivecan be reformulated as min w ∈ R q , { ξ i ≥ } ni =1 λ (cid:107) w (cid:107) + Cn n (cid:88) i =1 ξ i s.t. w T Φ( x i , y i ) − w T Φ( x i , y ) ≥ ∆( y i , y ) − ξ i , ∀ y ∈ Y \ y i , ∀ i. (1)where ∆ : Y × Y → R is a loss function, C is a positiveconstant that controls the trade-off between the training errorminimization and the margin maximization [56], n is thenumber of training samples and ξ i is the slack variable. Inpractice, SSVM is solved with the cutting-plane algorithm[234].Apart from the classic models that learn the correlationsbetween output, some of the state-of-the-art multi-outputlearning models are based on ANNs. For example, modelsbased on convolutional neural networks typically focus onhierarchical multi-labels [235] or rankings [236]. Recurrentneural network (RNNs) models generally focus on sequence-to-sequence learning [237] and time-series prediction [238].Generative deep neural networks are used to generate outputdata, such as images, text, and audio [215].

2) Multivariate Loss Functions:

Various loss functionswere deﬁned to compute the difference between thegroundtruth and the predicted output. Different loss functionspresents different errors given the same dataset, and theygreatly affect the performance of the model. is a standard loss function that is commonly used inclassiﬁcation [239]: L / ( y , y (cid:48) ) = I ( y (cid:54) = y (cid:48) ) (2)where I is the indicator function. In general, 0/1 loss refersto the number of misclassiﬁed training examples. However, it is very restrictive and does not consider label dependency.Therefore, it is not suitable for large numbers of outputs orfor outputs with complex structures. In addition, it is non-convex and non-differentiable, so it is difﬁcult to minimize theloss using standard convex optimization methods. In practice,one typically uses a surrogate loss, which is a convex upperbound of the task loss. However, a surrogate loss in multi-output learning usually loses the consistency when generaliz-ing single-output methods to deal with multiple outputs [240].Several works on subﬁelds of multi-output learning study theconsistency of different surrogate functions and show that theyare consistent under some sufﬁcient conditions [241], [242].Yet this is still a challenging aspect of multi-output learning.More exploration on the theoretical consistency of differentproblems is required.Below, we describe four popular surrogate losses: hinge loss,negative log loss, perceptron loss, and softmaxmargin loss. Hinge loss is one of the most widely used surrogate losses andis usually used in structured SVMs [243]. It pushes the scoreof the correct outputs to be greater than that of the prediction: L Hinge ( x , y , w ) = max y (cid:48) ∈Y [∆( y , y (cid:48) ) + w T Φ( x , y (cid:48) )] − w T Φ( x , y ) (3)The margin, ∆( y , y (cid:48) ) , has different deﬁnitions based on theoutput structures and task. For example, for sequence learningor outputs with equal weights, ∆( y , y (cid:48) ) can be simply deﬁnedas the Hamming loss (cid:80) mj =1 I ( y ( j ) (cid:54) = y (cid:48) ( j ) ) . For taxonomicclassiﬁcation with the hierarchical output structure, ∆( y , y (cid:48) ) can be deﬁned as the tree distance between y and y (cid:48) [19]. Forranking, ∆( y , y (cid:48) ) can be deﬁned as the mean average precisionof a ranking y (cid:48) compared to the optimal y [244]. In syntacticparsing, ∆( y , y (cid:48) ) is deﬁned as the number of labeled spanswhere y and y (cid:48) do not agree [35]. Non-decomposable losses,such as the F measure, average precision (AP), or intersectionover union (IOU), can also be deﬁned as a margin. Negative log loss is commonly used in CRFs [231]. Note thatminimizing negative log loss is the same as maximizing thelog probability of the data. L NegativeLog ( x , y , w ) = log (cid:88) y (cid:48) ∈Y exp[ w T Φ( x , y (cid:48) )] − w T Φ( x , y ) (4) Perceptron loss is usually adopted in structured perceptrontasks [245] and is the same as hinge loss without the margin. L P erceptron ( x , y , w ) = max y (cid:48) ∈Y [ w T Φ( x , y (cid:48) ) − w T Φ( x , y )] (5) Softmax-margin loss is one of the most popular loss func-tions in multi-output learning models such as SSVMs [246]and CRFs [247]. L SoftmaxMargin ( x , y , w ) = log (cid:88) y (cid:48) ∈Y exp[∆( y , y (cid:48) )+ w T Φ( x , y (cid:48) )] − w T Φ( x , y ) (6) Squared loss is a popular and convenient loss function thatquadratically penalizes the difference between the ground truth and the prediction. It is commonly used in traditional single-output learning and can be easily extended to multi-outputlearning by summing the squared differences over all theoutputs: L Squared ( y , y (cid:48) ) = ( y − y (cid:48) ) (7)In multi-output learning, it is usually used with continuousvalued outputs or continuous intermediate results before con-verting them into discrete valued outputs. It is also commonlyused in neural networks and boosting.

3) Efﬁcient Algorithms:

Complex output structures signiﬁ-cantly increase the burden on algorithms to formulate a model.Large-scale outputs, complex output dependencies, and/orcomplex loss functions can all be problematic. Therefore,several algorithms have been proposed speciﬁcally to tacklethese challenges efﬁciently. Many leverage classic machinelearning models so as to speed up the algorithms and alleviatethe burden of complexity. The four most widely used classicmodels are based on k nearest neighbor ( k NN), decision trees, k -means, and hashing.1) k NN-based methods are simple yet powerful machinelearning models. Predictions are made based on theclosest k instances to the test instance vector in termsof Euclidean distance. LMMO- k NN [248] is an SSVM-based model involving an exponential number of con-straints w.r.t. the number of labels. This model imposes k NN constraints instantiated by the label vectors fromneighboring examples to signiﬁcantly reduce the trainingtime and make rapid predictions.2) Decision tree based methods [249], [250] learn a treefrom the training data with a hierarchical output labelspace. They recursively partition the nodes until each leafcontains a small number of labels. Each novel data pointis passed down the tree until it reaches a leaf. This methodusually achieves a logarithmic time prediction.3) k -means based methods such as SLEEC [79] cluster thetraining data using k -means clustering. SLEEC learns aseparate embedding per cluster and performs classiﬁca-tion for a novel instance within its cluster alone. Thissigniﬁcantly reduces the prediction time.4) Hashing methods, such as co-hashing [251], [252] andDBPC [253], reduce the prediction time by using hashingon the input or the intermediate embedding space. Co-hashing learns an embedding space to preserve semanticsimilarity structures between inputs and outputs. Compactbinary representations are then generated for the learnedembeddings for prediction efﬁciency. DBPC jointly learnsa deep latent Hamming space and binary prototypeswhile capturing the latent nonlinear structures of thedata with an ANN. The learned Hamming space andbinary prototypes signiﬁcantly decrease the predictioncomplexity and reduce memory/storage costs. C. Volume - Extreme Class Imbalances

Real-world multi-output applications rarely provide datawith an equal number of training instances for all la-bels/classes. Too many instances in one class over another mean the data is imbalanced, and this is common in manyapplications. Therefore, traditional models learned from suchdata tend to favor majority classes more. For example, inface generation, a trained model tends to generate the facesof famous people because there are so many more imagesof celebrities than other people. Though class imbalanceproblems have been studied extensively in the context ofbinary classiﬁcation, this issue still remains a challenge inmulti-output learning, especially with extreme imbalances.Many studies on multi-output learning either create a bal-anced dataset or ignore the problems introduced by imbalanceddata. A natural way to balance class distributions is to resam-ple the dataset. There are two main resampling techniques:undersampling and oversampling [254]. Undersampling meth-ods down-size the majority classes. The NearMiss family ofmethods [255] are representative works of this category. Theoversampling methods, such as SMOTE and its variants [256],adopt oversampling technique on minority classes to handlethe imbalanced class learning problem. However, all theseresampling methods are mainly designed for single outputlearning problems. There are other techniques to handle classimbalance in multi-output learning tasks with ANN.For example, Dong et al. [257] combined incremental rec-tiﬁcation of mini-batches with a deep neural network. Then ahard sample mining strategy minimizes the dominant effect ofthe majority classes by discovering the boundaries of sparsely-sampled minority classes. Both of the methods in [258] and[259] leveraged adversarial training to mitigate imbalance byusing a re-weighting technique so that majority classes tendto have a similar impact as minority classes. D. Volume - Unseen Outputs

Traditional multi-output learning assumes that the outputset in testing is the same as the one in training, i.e., theoutput labels of a testing instance have already appearedduring training. However, this may not be true in real-worldapplications. For example, a new emerging living species cannot be detected using a learned classiﬁer based on existingliving animals. Similarly, it is infeasible to recognize theactions or events in a real-time video if no such actions orevents with the same labels appeared in the training videoset. Nor could a coarse animal classiﬁer provide details ofthe species of a detected animal, such as whether a dog is alabrador or a shepherd.Depending on the complexity of the learning task, labelannotation is usually very costly. In addition, the enormousgrowth in the number labels not only leads to high-dimensionaloutput space as a result of computation inefﬁciency, but alsomakes supervised learning tasks challenging due to unseenoutput labels during testing.

1) Zero-shot Multi-label Classiﬁcation:

Multi-label clas-siﬁcation is a typical multi-output learning problem. Multi-label classiﬁcation problems can have various inputs, such astext, images, and video, depending on the application. Theoutput for each input instance is usually a binary label vector,indicating what labels are associated with the input. Multi-label classiﬁcation problems learn a mapping from the input to the output. However, as the label space increases, it is commonto ﬁnd unseen output labels during testing, where no suchlabels have appeared in the training set. To study such cases,the zero-shot multi-class classiﬁcation problem was ﬁrst pro-posed in [17], [260] and most leverage the predeﬁned semanticinformation, such as attributes [11], word representations [13]and etc. This technique was then extended to zero-shot multi-label classiﬁcation to assign multiple unseen labels to aninstance. Similarly, zero-shot multi-label learning leveragesthe knowledge of the seen and unseen labels and models therelationships between the input features, label representations,and labels. For example, Gaure et al. [261] leverage the co-occurrence statistics of seen and unseen labels and modelthe label matrix and co-occurrence matrix jointly using agenerative model. Rios and Kavuluru [262] and Lee et al.[263] incorporate knowledge graphs of the label relationshipswith neural networks.

2) Zero-shot Action Localization:

Similar to zero-shot clas-siﬁcation problems, localizing human actions in videos withoutany training video examples is a challenging task. Inspiredby zero-shot image classiﬁcation, many studies into zero-shot action classiﬁcation predict unseen actions from disjuncttraining actions based on the prior knowledge of action-to-attribute mappings [264]–[266]. Such mappings are usuallypredeﬁned and the seen and unseen actions are linked througha description of the attributes. Thus, they can be used to gener-alize undeﬁned actions but are unable to localize actions. Morerecently, some works are proposed to overcome the issue. Jain et al. [267] proposes Objects2action without using any videodata or action annotations. It leverages vast object annotations,images and text descriptions that can be obtained from open-source collections such as WordNet and ImageNet. Mettes andSnoek [268] have subsequently enhanced Objects2action byconsidering the relationships between actors and objects.

3) Open-set Recognition:

Traditional multi-output learningproblems, including zero-shot multi-output learning, operateunder a closed-set assumption, i.e., where all the testingclasses are known at the time of training time either throughthe training samples or because they are predeﬁned in asemantic label space. However, Scheirer et al. [16] proposeda concept called open-set recognition to describe a scenariowhere unknown classes appear in testing. Open-set recognitionpresents 1-vs-set machine to classify the known classes aswell as deal with the unknown classes. In later studies [269],[270], they extended this idea into to multi-class settings byformulating a compact abating probability model. Bendaleand Boult [271] adapted ANNs for open-set recognition byproposing a new model layer that estimates the probability ofan input being an unknown class.Fig. 5 illustrates the relationships between different levels ofunseen outputs in multi-output learning. Open-set recognitionis the most generalized problem of all. Few-shot and zero-shot learning have studied with different multi-output learningproblems, such as multi-label learning and event localization.However, open-set recognition has only been studied in con-junctions with multi-class classiﬁcation. Other problems in thecontext of multi-output learning are still unexplored. Fig. 5. Relationship among different levels of unseen outputs. All of theselearning problems belong to multi-output learning.

E. Veracity - Noisy Output Labels

Almost all methods of label annotation lead to some amountof noise for various reasons. Associations may be weak, thetext may be ambiguous, crowdsourced workers may not bedomain experts so labels may be incorrect [272]. Therefore,it is usually necessary to handle noisy outputs like missing,corrupt, incorrect, and/or partial labels, in real-world tasks.

1) Missing Labels:

Often human annotators annotate animage or document with prominent labels but miss some ofthe less emphasized labels. Additionally, all the objects in animage may not be localized because there are, say, too manyobjects or the objects are too small. Social media, such asInstagram, allow users to tag uploaded images. But the tagscould relate to anything: the type of event, the person’s mood,the weather. Plus, no user is likely to tag every object or everyaspect of an image. Directly using such labeled datasets intraditional multi-output learning models can not guarantee theperformance of the given tasks. Therefore, handling missinglabels is necessary in real-world applications.In early studies, missing labels were handled by treatingthem as negative labels [273]–[275]. Then modeling tasksare performed based on a fully-labeled dataset. However,this approach can introduce undesirable bias into the learn-ing problem. Therefore, a more widely-used method now ismissing value imputation through matrix completion [186],[192], [276]. Most of these approaches are based on a low-rankassumption and, more recently, on label correlations, whichimproves learning performance [277], [278].

2) Incorrect Labels:

Many labels in high-dimensional out-put space are non-informative or simply wrong [279]. Thisis especially common with annotations from crowdsourcingplatforms that hire non-expert workers. Labeled datasets fromsocial media networks are also often less than useful. A basicapproach for handling incorrect labels is to simply removethose samples [280], [281]. That said, it is frequently difﬁcultto detect which samples have been mislabeled. Therefore,designing multi-output learning algorithms that learn fromnoisy datasets is of great practical importance.Existing multi-output learning methods handling noisy la-bels generally fall into two groups. The ﬁrst group is basedon building robust loss functions [282]–[284], which modifythe labels in the loss function to alleviate the effect of noise.The second group models latent labels and learns the transition from the latent to the noisy labels [285]–[287].

Partial Labels

A special case of incorrect labels is partiallabels [288]–[290], where each training instance is associatedwith a set of candidate labels but only one of them is correct.This is a common problem in real-world applications. Forexample, a photograph might contain many faces with captionslisting who is in the photo but the names are not matched tothe face. Many methods for learning partial labels have beendeveloped to recover the ground-truth labels from a candidateset [291], [292]. However, most are based on the assumptionof exactly one ground truth for each instance, which may notalways hold true by different label annotation methods. Withthe use of multiple workers on the crowdsourcing platform toannotate a dataset, the ﬁnal annotations are usually gatheredfrom the union set of the annotations of all the workers, whereeach instance might associate with both multiple relevant andirrelevant labels. Hence, Xie and Huang [293] developed anew learning framework, partial multi-label learning (PML),that relaxes this assumption by leveraging the data structureinformation to optimize the conﬁdence weighted rank loss.Fig. 6 summarizes all the scenarios with noisy output labels,including multi-label learning, missing labels, incorrect labels,partial label learning, and partial multi-label learning.

F. Velocity - Changes in Output Distribution

Many real-world applications must deal with data streams,where data arrives continuously and possibly endlessly. Inthese cases, the output distributions can change over time orconcept drift can occur. Streaming data is common in surveil-lance [98], driver route prediction [95], demand forecasting[97], and many other applications. Take visual tracking [294]in surveillance video as an example, where the video streamis potentially endless. Data streams come in high velocity asthe video keeps generating consecutive frames. The goal isto detect, identify, and locate events or objects in the video.Therefore, the learning model must adapt to possible conceptdrift while working with limited memory.Existing multi-output learning methods model changes inoutput distribution by updating the learning system each timedata streams arrive. The update method might be ensemble-based [295]–[299] or ANN-based methods [294], [300]. Otherstrategies to handle concept drift include: the assumption of afading effect on past data [298]; maintaining a change detectoron predictive performance measurements and recalibratingmodels accordingly [297], [301]; and using stochastic gradientdescent to update the network and accommodate new datastreams with an ANN [294]. Notably, the k neareast neighbor( k NN) is one of the most classic frameworks in handlingmulti-output problems, but it cannot be successfully adaptedto deal with the challenge of change of output distributiondue to the inefﬁciency issue. Many online hashing and onlinequantization based methods [302], [303] are proposed toimprove the efﬁciency of k NN while accommodating thechanging output distribution.

G. Other Challenges

Any two of the aforementioned challenges can be combinedto form a more complex challenge. For example, noisy labels Fig. 6. Range of noisy labels in multi-label classiﬁcation. Training may be: multi-label (sample associates with multiple labels), missing-label (sample hasincomplete label assignment), incorrect-label (sample has at least one incorrect labels and possible incomplete label assignment), partial-label (each samplehas multiple labels, only one of which is correct), partial multi-label (each sample has multiple labels, at least one of which is correct). A line connecting alabel with the sample represents that the sample associates with the label. The label in red color represents the correct label to the sample. The label in graybox represents an incorrect label to the sample. and unseen outputs can be combined to form an open-setnoisy label problem [304]. In addition, the combination ofnoisy labels and extreme output dimensions are also worthyof study and further exploration [206]. Changes in outputdistribution together with noisy labels result in online time-series prediction problems with missing values [305], whilechanges in distributions combined with dynamic label sets(unseen outputs) lead to open-world recognition problemswith incremental labels [306]. Changing output distributionwith extreme class imbalances create the common problemof streaming data with concept drift and class imbalancesat the same time [18], [307]. Moreover, the combination ofcomplex output structures with changing output distribution isalso frequent in real-world applications [308].

H. Open Challenges1) Output Label Interpretation:

There are different ways torepresent output labels and each expresses label informationfrom a speciﬁc perspective. Taking label tags as an outputfor example, binary attributed output embeddings representwhat attributes the input relates to. Hierarchical label outputembedding conveys the hierarchical structure of the inputs.Semantic word output embeddings reﬂect the semantic rela-tionships between the outputs. As one can see, each exhibitsa certain level of human interpretability. Hence, an emergingapproach to label embedding is to incorporate different labelinformation from multiple perspectives and rich contexts toenhance interpretability [309]. This is a challenging under-taking because it is quite difﬁcult to appropriately model theinterdependencies between outputs in a way that humans caneasily interpret and understand. For example, an image ofa centaur is expected to be described with semantic labelslike horse and person. Moreover, the image is expected to bedescribed with attributes like head, arm, tail, etc. As such,appropriately modeling the relationships between input andoutputs with rich interpretations of the labels is an openchallenge that should be explored in future studies.

2) Output Heterogeneity:

As the demand for sophisticateddecision making increases, so does demand for outputs withmore complex structures. Returning to the example of surveil-lance, people re-identiﬁcation in traditional approaches usuallyconsists of two steps: people detection, then re-identifyingthat person if they are input. These steps are essentially twoseparate tasks that need to be learned together if performanceis to be enhanced. Several researchers have recently attemptedthis demanding challenge, i.e., building a model that can simultaneously learn multiple tasks with different outputs.Mousavian et al. [310] undertook joint people detection intandem with re-identiﬁcation, while Van Ranst et al. [311]tackled image segmentation with depth estimation. However,more exploration and investigation to overcome this challengeis needed. As an example, one worthy undertaking wouldbe to answer the question: Can we simultaneously learn therepresentation of a new user in a social network as well astheir potential links to existing users?V. C

ONCLUSION

Multi-output learning has attracted signiﬁcant attention overthe last decade. This paper provides a comprehensive reviewof the study of multi-output learning using the 4 Vs as a frame.We explore the characteristics of the multi-output learningparadigm beginning with the life cycle of the output labels. Weemphasize the issues associated with each step of the learningprocess. In addition, we provide an overview of the types ofoutputs, the structures, selected problem deﬁnitions, commonmodel evaluation metrics, and the popular data repositoriesused in experiments, with representative works referencedthroughout. The paper concludes with a discussion on thechallenges caused by 4 Vs and some future research directionsthat are worthy of further study.R

EFERENCES[1] M. Zhang and Z. Zhou, “A review on multi-label learning algorithms,”

TKDE , vol. 26, no. 8, pp. 1819–1837, 2014.[2] C. Gong, D. Tao, J. Yang, and W. Liu, “Teaching-to-learn and learning-to-teach for multi-label propagation,” in

AAAI , 2016, pp. 1610–1616.[3] C. Bielza, G. Li, and P. Larra˜naga, “Multi-dimensional classiﬁcationwith bayesian networks,”

Int. J. Approx. Reasoning , vol. 52, no. 6, pp.705–727, 2011.[4] S. Vembu and T. G¨artner, “Label ranking algorithms: A survey,”in

Preference Learning. , 2010, pp. 45–64. [Online]. Available:https://doi.org/10.1007/978-3-642-14125-6 3[5] H. Borchani, G. Varando, C. Bielza, and P. Larra˜naga, “A survey onmulti-output regression,”

Wiley Interdisciplinary Reviews: Data Miningand Knowledge Discovery , vol. 5, no. 5, pp. 216–233, 2015.[6] G. Widmer and M. Kubat, “Learning in the presence of concept driftand hidden contexts,”

Machine learning , vol. 23, no. 1, pp. 69–101,1996.[7] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, andL. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”

IJCV , vol. 115, no. 3, pp. 211–252, 2015.[8] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman,“Labelme: A database and web-based tool for image annotation,”

IJCV ,vol. 77, no. 1-3, pp. 157–173, 2008.[9] P. Stenetorp, S. Pyysalo, G. Topic, T. Ohta, S. Ananiadou, and J. Tsujii,“brat: a web-based tool for nlp-assisted text annotation,” in

EACL ,2012. [10] G. Eryigit, F. S. C¸ etin, M. Yanik, T. Temel, and I. C¸ ic¸ekli, “TURK-SENT: A sentiment annotation tool for social media,” in LAW@ACL ,2013, pp. 131–134.[11] C. H. Lampert, H. Nickisch, and S. Harmeling, “Learning to detectunseen object classes by between-class attribute transfer,” in

CVPR ,2009, pp. 951–958.[12] M. Rohrbach, M. Stark, and B. Schiele, “Evaluating knowledge transferand zero-shot learning in a large-scale setting,” in

CVPR , 2011, pp.1641–1648.[13] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,“Distributed representations of words and phrases and their compo-sitionality,” in

NIPS , 2013, pp. 3111–3119.[14] S. C. Deerwester, S. T. Dumais, G. W. Furnas, R. A. Harshman, T. K.Landauer, K. E. Lochbaum, and L. A. Streeter, “Computer informationretrieval using latent semantic structure,” 1989.[15] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov,R. Zemel, and Y. Bengio, “Show, attend and tell: Neural imagecaption generation with visual attention,” in

International conferenceon machine learning , 2015, pp. 2048–2057.[16] W. J. Scheirer, A. de Rezende Rocha, A. Sapkota, and T. E. Boult,“Toward open set recognition,”

TPAMI , vol. 35, no. 7, pp. 1757–1772,2013.[17] M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell, “Zero-shot learning with semantic output codes,” in

NIPS , 2009, pp. 1410–1418.[18] T. R. Hoens, R. Polikar, and N. V. Chawla, “Learning from streamingdata with concept drift and imbalance: an overview,”

Progress in AI ,vol. 1, no. 1, pp. 89–101, 2012.[19] L. Cai and T. Hofmann, “Hierarchical document categorization withsupport vector machines,” in

CIKM , 2004, pp. 78–87.[20] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “SUNdatabase: Large-scale scene recognition from abbey to zoo,” in

CVPR ,2010, pp. 3485–3492.[21] G. Qi, X. Hua, Y. Rui, J. Tang, T. Mei, and H. Zhang, “Correlativemulti-label video annotation,” in

ACM Multimedia , 2007, pp. 17–26.[22] S. Dzeroski, D. Demsar, and J. Grbovic, “Predicting chemical parame-ters of river water quality from bioindicator data,”

Appl. Intell. , vol. 13,no. 1, pp. 7–17, 2000.[23] H. Aras and N. Aras, “Forecasting residential natural gas demand,”

Energy Sources , vol. 26, no. 5, pp. 463–472, 2004.[24] H. Li, W. Zhang, Y. Chen, Y. Guo, G.-Z. Li, and X. Zhu, “A novelmulti-target regression framework for time-series prediction of drugefﬁcacy,”

Scientiﬁc reports , vol. 7, p. 40652, 2017.[25] X. Geng and Y. Xia, “Head pose estimation based on multivariate labeldistribution,” in

CVPR , 2014, pp. 1837–1842.[26] X. Geng, K. Smith-Miles, and Z. Zhou, “Facial age estimation bylearning from label distributions,” in

AAAI , 2010.[27] D. Zhou, X. Zhang, Y. Zhou, Q. Zhao, and X. Geng, “Emotiondistribution learning from texts,” in

EMNLP , 2016, pp. 638–647.[28] K. Crammer and Y. Singer, “A family of additive online algorithms forcategory ranking,”

JMLR , vol. 3, pp. 1025–1058, 2003.[29] J. Ko, E. Nyberg, and L. Si, “A probabilistic graphical model for jointanswer ranking in question answering,” in

SIGIR , 2007, pp. 343–350.[30] S. S. Bucak, P. K. Mallapragada, R. Jin, and A. K. Jain, “Efﬁcientmulti-label ranking for multi-class learning: Application to objectrecognition,” in

ICCV , 2009, pp. 2098–2105.[31] Y. Liu, E. P. Xing, and J. G. Carbonell, “Predicting protein folds withstructural repeats using a chain graph model,” in

ICML , 2005, pp. 513–520.[32] P. Koehn, “Europarl: A parallel corpus for statistical machine transla-tion,” in

MT summit , vol. 5, 2005, pp. 79–86.[33] K. Shaalan, “A survey of arabic named entity recognition and classiﬁ-cation,”

Computational Linguistics , vol. 40, no. 2, pp. 469–510, 2014.[34] A. Newell and J. Deng, “Pixels to graphs by associative embedding,”in

NIPS , 2017, pp. 2168–2177.[35] B. Taskar, D. Klein, M. Collins, D. Koller, and C. Manning, “Max-margin parsing,” in

EMNLP , 2004.[36] D. Liben-Nowell and J. M. Kleinberg, “The link-prediction problemfor social networks,”

JASIST , vol. 58, no. 7, pp. 1019–1031, 2007.[37] S. C. Park, M. K. Park, and M. G. Kang, “Super-resolution image re-construction: a technical overview,”

IEEE signal processing magazine ,vol. 20, no. 3, pp. 21–36, 2003.[38] L. Yang, S. Chou, and Y. Yang, “Midinet: A convolutional generativeadversarial network for symbolic-domain music generation,” in

ISMIR ,2017, pp. 324–331. [39] A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. C. Jain,“Content-based image retrieval at the end of the early years,”

TPAMI ,vol. 22, no. 12, pp. 1349–1380, 2000.[40] C. H. Lau, Y. Li, and D. Tjondronegoro, “Microblog retrieval usingtopical features and query expansion,” in

TREC , 2011.[41] N. Maria and M. J. Silva, “Theme-based retrieval of web news,” in

SIGIR , 2000, pp. 354–356.[42] M. K. Choong, M. Charbit, and H. Yan, “Autoregressive-model-basedmissing value estimation for DNA microarray time series data,”

IEEETransactions on Information Technology in Biomedicine , vol. 13, no. 1,pp. 131–137, 2009.[43] A. Azadeh, S. Ghaderi, and S. Sohrabkhani, “Annual electricity con-sumption forecasting by neural network in high energy consumingindustrial sectors,”

Energy Conversion and management , vol. 49, no. 8,pp. 2272–2278, 2008.[44] X. Wang, X. Ma, and W. E. L. Grimson, “Unsupervised activity per-ception in crowded and complicated scenes using hierarchical bayesianmodels,”

TPAMI , vol. 31, no. 3, pp. 539–555, 2009.[45] D. Shen, J. Sun, H. Li, Q. Yang, and Z. Chen, “Document summariza-tion using conditional random ﬁelds,” in

IJCAI , 2007, pp. 2862–2867.[46] X. Liang, Z. Hu, H. Zhang, C. Gan, and E. P. Xing, “Recurrent topic-transition GAN for visual paragraph generation,” in

ICCV , 2017, pp.3382–3391.[47] A. Graves, A. Mohamed, and G. E. Hinton, “Speech recognition withdeep recurrent neural networks,” in

ICASSP , 2013, pp. 6645–6649.[48] S. E. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee,“Generative adversarial text to image synthesis,” in

ICML , 2016, pp.1060–1069.[49] J. Gauthier, “Conditional generative adversarial nets for convolutionalface generation,”

Class Project for Stanford CS231N: ConvolutionalNeural Networks for Visual Recognition , vol. 2014, no. 5, p. 2, 2014.[50] J. Johnson, R. Krishna, M. Stark, L. Li, D. A. Shamma, M. S.Bernstein, and F. Li, “Image retrieval using scene graphs,” in

CVPR ,2015, pp. 3668–3678.[51] J. Johnson, A. Gupta, and L. Fei-Fei, “Image generation from scenegraphs,” in

CVPR , 2018, pp. 1219–1228.[52] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen,Y. Kalantidis, L. Li, D. A. Shamma, M. S. Bernstein, and L. Fei-Fei,“Visual genome: Connecting language and vision using crowdsourceddense image annotations,”

IJCV , vol. 123, no. 1, pp. 32–73, 2017.[53] D. Zhang, H. Fu, J. Han, and F. Wu, “A review of co-saliency detectiontechnique: Fundamentals, applications, and challenges,”

CoRR , vol.abs/1604.07090, 2016.[54] A. Joulin, F. R. Bach, and J. Ponce, “Discriminative clustering forimage co-segmentation,” in

CVPR , 2010, pp. 1943–1950.[55] S. Y. Bao, Y. Xiang, and S. Savarese, “Object co-detection,” in

ECCV .Springer, 2012, pp. 86–101.[56] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, “Largemargin methods for structured and interdependent output variables,”

Journal of machine learning research , vol. 6, no. Sep, pp. 1453–1484,2005.[57] H. Liu, J. Cai, and Y. Ong, “Remarks on multi-output gaussian processregression,”

Knowledge-Based Systems , vol. 144, pp. 102–121, 2018.[58] X. Geng, “Label distribution learning,”

IEEE Transactions on Knowl-edge and Data Engineering , vol. 28, no. 7, pp. 1734–1748, 2016.[59] G. Carneiro, A. B. Chan, P. J. Moreno, and N. Vasconcelos, “Super-vised learning of semantic classes for image annotation and retrieval,”

TPAMI , vol. 29, no. 3, pp. 394–410, 2007.[60] A. S. Weigend,

Time series prediction: forecasting the future andunderstanding the past . Routledge, 2018.[61] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele, “Evaluationof output embeddings for ﬁne-grained image classiﬁcation,” in

Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2015, pp. 2927–2936.[62] R. Caruana, “Multitask learning,”

Machine learning , vol. 28, no. 1, pp.41–75, 1997.[63] S. Thrun and J. OSullivan, “Clustering learning tasks and the selectivecross-task transfer of knowledge,” in

Learning to learn . Springer,1998, pp. 235–257.[64] Q. Mao, I. W.-H. Tsang, and S. Gao, “Objective-guided image anno-tation,”

IEEE Transactions on Image Processing , vol. 22, no. 4, pp.1585–1597, 2012.[65] M. A. Tahir, J. Kittler, and F. Yan, “Inverse random under samplingfor class imbalance problem and its application to multi-label classiﬁ-cation,”

Pattern Recognition , vol. 45, no. 10, pp. 3738–3750, 2012. [66] R. Alejo, V. Garc´ıa, and J. H. Pacheco-S´anchez, “An efﬁcient over-sampling approach based on mean square error back-propagation fordealing with the multi-class imbalance problem,” Neural ProcessingLetters , vol. 42, no. 3, pp. 603–617, 2015. [Online]. Available:https://doi.org/10.1007/s11063-014-9376-3[67] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in

Advances in neural information processing systems , 2014, pp. 2672–2680.[68] I. O. Tolstikhin, S. Gelly, O. Bousquet, C.-J. Simon-Gabriel, andB. Scholkopf, “Adagan: Boosting generative models,” in

Advances inNeural Information Processing Systems , 2017, pp. 5424–5433.[69] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch¨olkopf, and A. Smola,“A kernel two-sample test,”

Journal of Machine Learning Research ,vol. 13, no. Mar, pp. 723–773, 2012.[70] V. Khrulkov and I. Oseledets, “Geometry score: A method for compar-ing generative adversarial networks,” arXiv preprint arXiv:1802.02664 ,2018.[71] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, andX. Chen, “Improved techniques for training gans,” in

Advances inneural information processing systems , 2016, pp. 2234–2242.[72] T. Che, Y. Li, A. P. Jacob, Y. Bengio, and W. Li, “Mode regularizedgenerative adversarial networks,” arXiv preprint arXiv:1612.02136 ,2016.[73] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter,“Gans trained by a two time-scale update rule converge to a local nashequilibrium,” in

Advances in Neural Information Processing Systems ,2017, pp. 6626–6637.[74] M. Bi´nkowski, D. J. Sutherland, M. Arbel, and A. Gretton, “Demysti-fying mmd gans,” arXiv preprint arXiv:1801.01401 , 2018.[75] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet, “Aregans created equal? A large-scale study,” in

NeualPS , 2018, pp. 698–707.[76] J. J. McAuley and J. Leskovec, “Hidden factors and hidden topics:understanding rating dimensions with review text,” in

Seventh ACMConference on Recommender Systems , 2013, pp. 165–172.[77] J. J. McAuley, C. Targett, Q. Shi, and A. van den Hengel, “Image-based recommendations on styles and substitutes,” in

SIGIR , 2015, pp.43–52.[78] J. J. McAuley, R. Pandey, and J. Leskovec, “Inferring networks ofsubstitutable and complementary products,” in

KDD , 2015, pp. 785–794.[79] K. Bhatia, H. Jain, P. Kar, M. Varma, and P. Jain, “Sparse localembeddings for extreme multi-label classiﬁcation,” in

NIPS , 2015, pp.730–738.[80] A. Zubiaga, “Enhancing navigation on wikipedia with social tags,”

CoRR , vol. abs/1202.5469, 2012.[81] R. Wetzker, C. Zimmermann, and C. Bauckhage, “Analyzing socialbookmarking systems: A del.icio.us cookbook,” in

Proceedings of theECAI 2008 Mining Social Data Workshop , 2008, pp. 26–30.[82] Y. Prabhu and M. Varma, “Fastxml: a fast, accurate and stable tree-classiﬁer for extreme multi-label learning,” in

KDD , 2014, pp. 263–272.[83] I. Partalas, A. Kosmopoulos, N. Baskiotis, T. Arti`eres, G. Paliouras,´E. Gaussier, I. Androutsopoulos, M. Amini, and P. Gallinari,“LSHTC: A benchmark for large-scale text classiﬁcation,”

CoRR , vol.abs/1503.08581, 2015.[84] Y. Li, C. Huang, C. C. Loy, and X. Tang, “Human attribute recognitionby deep hierarchical contexts,” in

ECCV , 2016, pp. 684–700.[85] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributesin the wild,” in

ICCV , 2015, pp. 3730–3738.[86] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, “Deepfashion: Poweringrobust clothes recognition and retrieval with rich annotations,” in

CVPR , 2016, pp. 1096–1104.[87] Q. Chen, J. Huang, R. S. Feris, L. M. Brown, J. Dong, and S. Yan,“Deep domain adaptation for describing people based on ﬁne-grainedclothing attributes,” in

CVPR , 2015, pp. 5315–5324.[88] A. Farhadi, I. Endres, D. Hoiem, and D. A. Forsyth, “Describing objectsby their attributes,” in

CVPR , 2009, pp. 1778–1785.[89] M. J. Choi, J. J. Lim, A. Torralba, and A. S. Willsky, “Exploitinghierarchical context on a large database of object categories,” in

CVPR ,2010, pp. 129–136.[90] G. Marques, M. A. Domingues, T. Langlois, and F. Gouyon, “Threecurrent issues in music autotagging,” in

International Society for MusicInformation Retrieval Conference , 2011, pp. 795–800.[91] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “TheCaltech-UCSD Birds-200-2011 Dataset,” Tech. Rep., 2011. [92] V. Jouhet, G. Defossez, A. Burgun, P. Le Beux, P. Levillain, P. Ingrand,V. Claveau et al. , “Automated classiﬁcation of free-text pathologyreports for registration of incident cases of cancer,”

Methods of in-formation in medicine , vol. 51, no. 3, p. 242, 2012.[93] A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng,M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark,“Mimic-iii, a freely accessible critical care database,”

Scientiﬁc data ,vol. 3, p. 160035, 2016.[94] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li, “RCV1: A new benchmarkcollection for text categorization research,”

JMLR , vol. 5, pp. 361–397,2004.[95] “Kaggle data set ecml/pkdd 15: Taxi trajectory prediction (1).”[96] M. Piorkowski, N. Saraﬁjanovic-Djukic, and M. Grossglauser,“Crawdad data set epﬂ/mobility (v. 2009-02-24),” Feb. 2009. [Online].Available: Downloaded from http://crawdad.org/epﬂ/mobility/[97] A. Trindade, “Uci maching learning repository -electricityloaddiagrams20112014 data set,” 2016. [Online]. Available:http://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014[98] S. Oh, A. Hoogs, A. G. A. Perera, N. P. Cuntoor, C. Chen, J. T. Lee,S. Mukherjee, J. K. Aggarwal, H. Lee, L. S. Davis, E. Swears, X. Wang,Q. Ji, K. K. Reddy, M. Shah, C. Vondrick, H. Pirsiavash, D. Ramanan,J. Yuen, A. Torralba, B. Song, A. Fong, A. K. Roy-Chowdhury, andM. Desai, “A large-scale benchmark dataset for event recognition insurveillance video,” in

CVPR , 2011, pp. 3153–3160.[99] M. Everingham, L. J. V. Gool, C. K. I. Williams, J. M. Winn, andA. Zisserman, “The pascal visual object classes (VOC) challenge,”

IJCV , vol. 88, no. 2, pp. 303–338, 2010.[100] A. Krizhevsky and G. Hinton, “Learning multiple layers of featuresfrom tiny images,” Citeseer, Tech. Rep., 2009.[101] G. A. Miller, “Wordnet: A lexical database for english,”

Communica-tions of the ACM

Proceedings of the 2nd workshop on Learninglanguage in logic and the 4th conference on Computational naturallanguage learning , 2000, pp. 127–132.[106] D. Zhou, J. C. Platt, S. Basu, and Y. Mao, “Learning from the wisdomof crowds by minimax entropy,” in

Conference on Neural InformationProcessing Systems , 2012, pp. 2204–2212.[107] K. Lee, X. He, L. Zhang, and L. Yang, “Cleannet: Transfer learningfor scalable image classiﬁer training with label noise,” in

CVPR , 2018,pp. 5447–5456.[108] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang, “Learning frommassive noisy labeled data for image classiﬁcation,” in

CVPR , 2015,pp. 2691–2699.[109] W. Li, L. Wang, W. Li, E. Agustsson, and L. V. Gool, “Webvisiondatabase: Visual learning and understanding from web data,”

CoRR ,vol. abs/1708.02862, 2017.[110] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland,D. Borth, and L. Li, “The new data and new challenges in multimediaresearch,”

CoRR , vol. abs/1503.01817, 2015.[111] C. K¨ummerle and J. Sigl, “Harmonic mean iteratively reweighted leastsquares for low-rank matrix recovery,”

JMLR , vol. 19, 2018.[112] W. Liu and I. W. Tsang, “Making decision trees feasible in ultrahighfeature and label dimensions,”

JMLR , vol. 18, pp. 81:1–81:36, 2017.[113] C. Dupuy and F. Bach, “Online but accurate inference for latent variablemodels with local gibbs sampling,”

JMLR , vol. 18, pp. 126:1–126:45,2017.[114] W. Liu, I. W. Tsang, and K. M¨uller, “An easy-to-hard learning paradigmfor multiple classes and multiple labels,”

JMLR , vol. 18, pp. 94:1–94:38, 2017.[115] C. Brouard, M. Szafranski, and F. d’Alch´e-Buc, “Input output kernelregression: Supervised and semi-supervised structured output predic-tion with operator-valued kernels,”

JMLR , vol. 17, pp. 176:1–176:48,2016.[116] H. Shin, L. Lu, L. Kim, A. Seff, J. Yao, and R. M. Summers,“Interleaved text/image deep mining on a large-scale radiology databasefor automated image interpretation,”

JMLR , vol. 17, pp. 107:1–107:31,2016.[117] R. Babbar, I. Partalas, ´E. Gaussier, M. Amini, and C. Amblard,“Learning taxonomy adaptation in large-scale classiﬁcation,”

JMLR ,vol. 17, pp. 98:1–98:37, 2016. [118] X. Li, T. Zhao, X. Yuan, and H. Liu, “The ﬂare package for highdimensional linear regression and precision matrix estimation in R,” JMLR , vol. 16, pp. 553–557, 2015.[119] F. Han, H. Lu, and H. Liu, “A direct estimation of high dimensionalstationary vector autoregressions,”

JMLR , vol. 16, pp. 3115–3150,2015.[120] J. R. Doppa, A. Fern, and P. Tadepalli, “Structured prediction via outputspace search,”

JMLR , vol. 15, no. 1, pp. 1317–1350, 2014.[121] D. Colombo and M. H. Maathuis, “Order-independent constraint-basedcausal structure learning,”

JMLR , vol. 15, no. 1, pp. 3741–3782, 2014.[122] C. Gentile and F. Orabona, “On multilabel classiﬁcation and rankingwith bandit feedback,”

JMLR , vol. 15, no. 1, pp. 2451–2487, 2014.[123] P. Gong, J. Ye, and C. Zhang, “Multi-stage multi-task feature learning,”

JMLR , vol. 14, no. 1, pp. 2979–3010, 2013.[124] A. Talwalkar, S. Kumar, M. Mohri, and H. A. Rowley, “Large-scaleSVD and manifold learning,”

JMLR , vol. 14, no. 1, pp. 3129–3152,2013.[125] K. Fu, J. Li, J. Jin, and C. Zhang, “Image-text surgery: Efﬁcient conceptlearning in image captioning by generating pseudopairs,”

TNNLS ,vol. 29, no. 12, pp. 5910–5921, 2018.[126] E. Protas, J. D. Bratti, J. F. O. Gaya, P. Drews, and S. S. C. Botelho,“Visualization methods for image transformation convolutional neuralnetworks,”

TNNLS , 2018.[127] H. Zhang, S. Wang, X. Xu, T. W. S. Chow, and Q. M. J. Wu,“Tree2vector: Learning a vectorial representation for tree-structureddata,”

TNNLS , vol. 29, no. 11, pp. 5304–5318, 2018.[128] Z. Lin, G. Ding, J. Han, and L. Shao, “End-to-end feature-awarelabel space encoding for multilabel classiﬁcation with many classes,”

TNNLS , vol. 29, no. 6, pp. 2472–2487, 2018.[129] S. Fang, J. Li, Y. Tian, T. Huang, and X. Chen, “Learning discriminativesubspaces on random contrasts for image saliency analysis,”

TNNLS ,vol. 28, no. 5, pp. 1095–1108, 2017.[130] K. Zhang, D. Tao, X. Gao, X. Li, and J. Li, “Coarse-to-ﬁne learning forsingle-image super-resolution,”

TNNLS , vol. 28, no. 5, pp. 1109–1122,2017.[131] M. Kim, “Mixtures of conditional random ﬁelds for improved struc-tured output prediction,”

TNNLS , vol. 28, no. 5, pp. 1233–1240, 2017.[132] Y. Cheung, M. Li, Q. Peng, and C. L. P. Chen, “A cooperative learning-based clustering approach to lip segmentation without knowing seg-ment number,”

TNNLS , vol. 28, no. 1, pp. 80–93, 2017.[133] L. Wang, L. Liu, and L. Zhou, “A graph-embedding approach tohierarchical visual word mergence,”

TNNLS , vol. 28, no. 2, pp. 308–320, 2017.[134] Z. Li, Z. Lai, Y. Xu, J. Yang, and D. Zhang, “A locality-constrained andlabel embedding dictionary learning algorithm for image classiﬁcation,”

TNNLS , vol. 28, no. 2, pp. 278–293, 2017.[135] X. Luo, M. Zhou, S. Li, Z. You, Y. Xia, and Q. Zhu, “A nonnegativelatent factor model for large-scale sparse matrices in recommendersystems via alternating direction method,”

TNNLS , vol. 27, no. 3, pp.579–592, 2016.[136] C. Deng, J. Xu, K. Zhang, D. Tao, X. Gao, and X. Li, “Similarityconstraints-based structured output regression machine: An approachto image super-resolution,”

TNNLS , vol. 27, no. 12, pp. 2472–2485,2016.[137] A. Alush and J. Goldberger, “Hierarchical image segmentation usingcorrelation clustering,”

TNNLS , vol. 27, no. 6, pp. 1358–1367, 2016.[138] D. Tao, J. Cheng, M. Song, and X. Lin, “Manifold ranking-based matrixfactorization for saliency detection,”

TNNLS , vol. 27, no. 6, pp. 1122–1134, 2016.[139] F. Cao, M. Cai, Y. Tan, and J. Zhao, “Image super-resolution viaadaptive p (0 < p <

1) regularization and sparse representation,”

TNNLS ,vol. 27, no. 7, pp. 1550–1561, 2016.[140] Q. Zhu, L. Shao, X. Li, and L. Wang, “Targeting accurate objectextraction from an image: A comprehensive study of natural imagematting,”

TNNLS , vol. 26, no. 2, pp. 185–207, 2015.[141] Y. Chen, Y. Ma, D. H. Kim, and S. Park, “Region-based objectrecognition by color segmentation using a simpliﬁed PCNN,”

TNNLS ,vol. 26, no. 8, pp. 1682–1697, 2015.[142] M. Li, W. Bi, J. T. Kwok, and B. Lu, “Large-scale nystr¨om kernelmatrix approximation using randomized SVD,”

TNNLS , vol. 26, no. 1,pp. 152–164, 2015.[143] J. Yu, X. Gao, D. Tao, X. Li, and K. Zhang, “A uniﬁed learningframework for single image super-resolution,”

TNNLS , vol. 25, no. 4,pp. 780–792, 2014.[144] A. Bauer, N. G¨ornitz, F. Biegler, K. M¨uller, and M. Kloft, “Efﬁcientalgorithms for exact inference in sequence labeling svms,”

TNNLS ,vol. 25, no. 5, pp. 870–881, 2014. [145] K. Tang, R. Liu, Z. Su, and J. Zhang, “Structure-constrained low-rankrepresentation,”

TNNLS , vol. 25, no. 12, pp. 2167–2179, 2014.[146] Y. Deng, Q. Dai, R. Liu, Z. Zhang, and S. Hu, “Low-rank structurelearning via nonconvex heuristic recovery,”

TNNLS , vol. 24, no. 3, pp.383–396, 2013.[147] H. Zhang, Q. M. J. Wu, and T. M. Nguyen, “Incorporating meantemplate into ﬁnite mixture model for image segmentation,”

TNNLS ,vol. 24, no. 2, pp. 328–335, 2013.[148] Y. Pang, Z. Ji, P. Jing, and X. Li, “Ranking graph embedding forlearning to rerank,”

TNNLS , vol. 24, no. 8, pp. 1292–1303, 2013.[149] Y. Luo, D. Tao, C. Xu, C. Xu, H. Liu, and Y. Wen, “Multiview vector-valued manifold regularization for multilabel image classiﬁcation,”

TNNLS , vol. 24, no. 5, pp. 709–722, 2013.[150] B. Zhang, D. Xiong, and J. Su, “Neural machine translation with deepattention,”

TPAMI , 2018.[151] S. Jeong, J. Lee, B. Kim, Y. Kim, and J. Noh, “Object segmentationensuring consistency across multi-viewpoint images,”

TPAMI , vol. 40,no. 10, pp. 2455–2468, 2018.[152] C. Raposo, M. Antunes, and J. P. Barreto, “Piecewise-planar stere-oscan: Sequential structure and motion using plane primitives,”

TPAMI ,vol. 40, no. 8, pp. 1918–1931, 2018.[153] M. Cordts, T. Rehfeld, M. Enzweiler, U. Franke, and S. Roth, “Tree-structured models for efﬁcient multi-cue scene labeling,”

TPAMI ,vol. 39, no. 7, pp. 1444–1454, 2017.[154] Y. Xu, E. Carlinet, T. G´eraud, and L. Najman, “Hierarchical segmen-tation using tree-based shape spaces,”

TPAMI , vol. 39, no. 3, pp. 457–469, 2017.[155] K. Fu, J. Jin, R. Cui, F. Sha, and C. Zhang, “Aligning where to see andwhat to tell: Image captioning with region-based attention and scene-speciﬁc contexts,”

TPAMI , vol. 39, no. 12, pp. 2321–2334, 2017.[156] M. A. Hasnat, O. Alata, and A. Tr´emeau, “Joint color-spatial-directional clustering and region merging (JCSD-RM) for unsupervisedRGB-D image segmentation,”

TPAMI , vol. 38, no. 11, pp. 2255–2268,2016.[157] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-basedconvolutional networks for accurate object detection and segmentation,”

TPAMI , vol. 38, no. 1, pp. 142–158, 2016.[158] Z. Qin and C. R. Shelton, “Social grouping for multi-target trackingand head pose estimation in video,”

TPAMI , vol. 38, no. 10, pp. 2082–2095, 2016.[159] Y. Kwon, K. I. Kim, J. Tompkin, J. H. Kim, and C. Theobalt, “Efﬁcientlearning of image super-resolution and compression artifact removalwith semi-local gaussian processes,”

TPAMI , vol. 37, no. 9, pp. 1792–1805, 2015.[160] A. Djelouah, J. Franco, E. Boyer, F. L. Clerc, and P. P´erez, “Sparsemulti-view consistency for object segmentation,”

TPAMI , vol. 37, no. 9,pp. 1890–1903, 2015.[161] S. Wang, Y. Wei, K. Long, X. Zeng, and M. Zheng, “Image super-resolution via self-similarity learning and conformal sparse represen-tation,”

TPAMI .[162] H. B. Shitrit, J. Berclaz, F. Fleuret, and P. Fua, “Multi-commoditynetwork ﬂow for tracking multiple people,”

TPAMI , vol. 36, no. 8, pp.1614–1627, 2014.[163] N. Zhou and J. Fan, “Jointly learning visually correlated dictionariesfor large-scale visual recognition applications,”

TPAMI , vol. 36, no. 4,pp. 715–730, 2014.[164] P. Perakis, G. Passalis, T. Theoharis, and I. A. Kakadiaris, “3d faciallandmark detection under large yaw and expression variations,”

TPAMI ,vol. 35, no. 7, pp. 1552–1564, 2013.[165] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin, “Iterative quantiza-tion: A procrustean approach to learning binary codes for large-scaleimage retrieval,”

TPAMI , vol. 35, no. 12, pp. 2916–2929, 2013.[166] K. G. Dizaji, X. Wang, and H. Huang, “Semi-supervised generativeadversarial network for gene expression inference,” in

KDD , 2018, pp.1435–1444.[167] M. Lee, B. Gao, and R. Zhang, “Rare query expansion throughgenerative adversarial networks in search advertising,” in

KDD , 2018,pp. 500–508.[168] I. E. Yen, X. Huang, W. Dai, P. Ravikumar, I. S. Dhillon, and E. P.Xing, “Ppdsparse: A parallel primal-dual sparse method for extremeclassiﬁcation,” in

KDD , 2017, pp. 545–553.[169] Y. Tagami, “Annexml: Approximate nearest neighbor search for ex-treme multi-label classiﬁcation,” in

KDD , 2017, pp. 455–464.[170] H. Jain, Y. Prabhu, and M. Varma, “Extreme multi-label loss functionsfor recommendation, tagging, ranking & other missing label applica-tions,” in

KDD , 2016, pp. 935–944. [171] C. Xu, D. Tao, and C. Xu, “Robust extreme multi-label learning,” in KDD , 2016, pp. 1275–1284.[172] C. Kuo, X. Wang, P. B. Walker, O. T. Carmichael, J. Ye, and I. David-son, “Uniﬁed and contrasting cuts in multiple graphs: Application tomedical imaging segmentation,” in

KDD , 2015, pp. 617–626.[173] C. Papagiannopoulou, G. Tsoumakas, and I. Tsamardinos, “Discov-ering and exploiting deterministic label relationships in multi-labellearning,” in

KDD , 2015, pp. 915–924.[174] B. Wu, E. Zhong, B. Tan, A. Horner, and Q. Yang, “Crowdsourcedtime-sync video tagging using temporal and personalized topic model-ing,” in

KDD , 2014, pp. 721–730.[175] S. Zhai, T. Xia, and S. Wang, “A multi-class boosting method withdirect optimization,” in

KDD , 2014, pp. 273–282.[176] X. Kong, B. Cao, and P. S. Yu, “Multi-label classiﬁcation by mininglabel and instance correlations from heterogeneous information net-works,” in

KDD , 2013, pp. 614–622.[177] X. Wang and G. Sukthankar, “Multi-label relational neighbor classiﬁ-cation using social context features,” in

KDD , 2013, pp. 464–472.[178] S. Hong, X. Yan, T. S. Huang, and H. Lee, “Learning hierarchicalsemantic image manipulation through structured representations,” in

NIPS , 2018, pp. 2713–2723.[179] M. Wydmuch, K. Jasinska, M. Kuznetsov, R. Busa-Fekete, andK. Dembczynski, “A no-regret generalization of hierarchical softmaxto extreme multi-label classiﬁcation,” in

NIPS , 2018, pp. 6358–6368.[180] B. Pan, Y. Yang, H. Li, Z. Zhao, Y. Zhuang, D. Cai, and X. He,“Macnet: Transferring knowledge from machine comprehension tosequence-to-sequence models,” in

NIPS , 2018, pp. 6095–6105.[181] E. Racah, C. Beckham, T. Maharaj, S. E. Kahou, Prabhat, and C. Pal,“Extremeweather: A large-scale climate dataset for semi-superviseddetection, localization, and understanding of extreme weather events,”in

NIPS , 2017, pp. 3405–3416.[182] Y. Hu, J. Huang, and A. G. Schwing, “Maskrnn: Instance level videoobject segmentation,” in

NIPS , 2017, pp. 324–333.[183] B. Joshi, M. Amini, I. Partalas, F. Iutzeler, and Y. Maximov, “Aggres-sive sampling for multi-class to binary reduction with applications totext classiﬁcation,” in

NIPS , 2017, pp. 4162–4171.[184] J. Nam, E. Loza Menc´ıa, H. J. Kim, and J. F¨urnkranz, “Maximizingsubset accuracy with recurrent neural networks in multi-label classiﬁ-cation,” in

NIPS , 2017, pp. 5419–5429.[185] N. Rosenfeld and A. Globerson, “Optimal tagging with markov chainoptimization,” in

NIPS , 2016, pp. 1307–1315.[186] H. Yu, N. Rao, and I. S. Dhillon, “Temporal regularized matrixfactorization for high-dimensional time series prediction,” in

NIPS ,2016, pp. 847–855.[187] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled samplingfor sequence prediction with recurrent neural networks,” in

NIPS , 2015,pp. 1171–1179.[188] P. Rai, C. Hu, R. Henao, and L. Carin, “Large-scale bayesian multi-label learning via topic-based label embeddings,” in

NIPS , 2015, pp.3222–3230.[189] A. Wu, M. Park, O. Koyejo, and J. W. Pillow, “Sparse bayesianstructure learning with dependent relevance determination priors,” in

NIPS , 2014, pp. 1628–1636.[190] V. Nguyen, J. L. Boyd-Graber, P. Resnik, and J. Chang, “Learning aconcept hierarchy from multi-labeled documents,” in

NIPS , 2014, pp.3671–3679.[191] J. Hoffman, S. Guadarrama, E. Tzeng, R. Hu, J. Donahue, R. B.Girshick, T. Darrell, and K. Saenko, “LSDA: large scale detectionthrough adaptation,” in

NIPS , 2014, pp. 3536–3544.[192] M. Xu, R. Jin, and Z. Zhou, “Speedup matrix completion with sideinformation: Application to multi-label learning,” in

NIPS , 2013, pp.2301–2309.[193] M. Ciss´e, N. Usunier, T. Arti`eres, and P. Gallinari, “Robust bloomﬁlters for large multilabel classiﬁcation tasks,” in

NIPS , 2013, pp.1851–1859.[194] W. Siblini, F. Meyer, and P. Kuntz, “Craftml, an efﬁcient clustering-based random forest for extreme multi-label learning,” in

ICML , 2018,pp. 4671–4680.[195] I. E. Yen, S. Kale, F. X. Yu, D. N. Holtmann-Rice, S. Kumar, andP. Ravikumar, “Loss decomposition for fast learning in large outputspaces,” in

ICML , 2018, pp. 5626–5635.[196] J. Wehrmann, R. Cerri, and R. C. Barros, “Hierarchical multi-labelclassiﬁcation networks,” in

ICML , 2018, pp. 5225–5234.[197] S. Si, H. Zhang, S. S. Keerthi, D. Mahajan, I. S. Dhillon, and C. Hsieh,“Gradient boosted decision trees for high dimensional sparse output,”in

ICML , 2017, pp. 3182–3190. [198] V. Jain, N. Modhe, and P. Rai, “Scalable generative models for multi-label learning with missing labels,” in

ICML , 2017, pp. 1636–1644.[199] T. Zhang and Z. Zhou, “Multi-class optimal margin distribution ma-chine,” in

ICML , 2017, pp. 4063–4071.[200] C. Li, B. Wang, V. Pavlu, and J. A. Aslam, “Conditional bernoullimixtures for multi-label classiﬁcation,” in

ICML , 2016, pp. 2482–2491.[201] I. E. Yen, X. Huang, P. Ravikumar, K. Zhong, and I. S. Dhillon, “Pd-sparse : A primal and dual sparse approach to extreme multiclass andmultilabel classiﬁcation,” in

ICML , 2016, pp. 3069–3077.[202] M. Ciss´e, M. Al-Shedivat, and S. Bengio, “ADIOS: architectures deepin output space,” in

ICML , 2016, pp. 2770–2779.[203] D. Park, J. Neeman, J. Zhang, S. Sanghavi, and I. S. Dhillon, “Pref-erence completion: Large-scale collaborative ranking from pairwisecomparisons,” in

ICML , 2015, pp. 1907–1916.[204] D. Hern´andez-Lobato, J. M. Hern´andez-Lobato, and Z. Ghahramani,“A probabilistic model for dirty multi-task feature selection,” in

ICML ,2015, pp. 1073–1082.[205] Z. Huang, R. Wang, S. Shan, X. Li, and X. Chen, “Log-euclidean met-ric learning on symmetric positive deﬁnite manifold with applicationto image set classiﬁcation,” in

ICML , 2015, pp. 720–729.[206] H. Yu, P. Jain, P. Kar, and I. S. Dhillon, “Large-scale multi-labellearning with missing labels,” in

ICML , 2014, pp. 593–601.[207] Z. Lin, G. Ding, M. Hu, and J. Wang, “Multi-label classiﬁcation viafeature-aware implicit label space encoding,” in

ICML , 2014, pp. 325–333.[208] Y. Li and R. S. Zemel, “High order regularization for semi-supervisedlearning of structured output problems,” in

ICML , 2014, pp. 1368–1376.[209] W. Bi and J. T. Kwok, “Efﬁcient multi-label classiﬁcation with manylabels,” in

ICML , 2013, pp. 405–413.[210] R. Takhanov and V. Kolmogorov, “Inference algorithms for pattern-based crfs on sequence data,” in

ICML , 2013, pp. 145–153.[211] M. Xiao and Y. Guo, “Domain adaptation for sequence labeling taskswith a probabilistic language adaptation model,” in

ICML , 2013, pp.293–301.[212] H. Brighton and C. Mellish, “Advances in instance selection forinstance-based learning algorithms,”

Data mining and knowledge dis-covery , vol. 6, no. 2, pp. 153–172, 2002.[213] Y. Zhai, Y. Ong, and I. W. Tsang, “The emerging ?big dimensionality?”

IEEE Computational Intelligence Magazine , vol. 9, no. 3, pp. 14–26,2014.[214] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee,“Learning what and where to draw,” in

NIPS , 2016, pp. 217–225.[215] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. C. Courville, and Y. Bengio, “Generative adversarial nets,”in

NIPS , pp. 2672–2680.[216] H. Zhang, T. Xu, and H. Li, “Stackgan: Text to photo-realistic imagesynthesis with stacked generative adversarial networks,” in

ICCV , 2017,pp. 5908–5916.[217] Z. Zhang, Y. Xie, and L. Yang, “Photographic text-to-image syn-thesis with a hierarchically-nested adversarial network,”

CoRR , vol.abs/1802.09178, 2018.[218] W. Fedus, I. J. Goodfellow, and A. M. Dai, “Maskgan: Better textgeneration via ﬁlling in the ,”

CoRR , vol. abs/1801.07736, 2018.[219] Y. Chen and H. Lin, “Feature-aware label space dimension reductionfor multi-label classiﬁcation,” in

NIPS , 2012, pp. 1538–1546.[220] D. J. Hsu, S. Kakade, J. Langford, and T. Zhang, “Multi-label predic-tion via compressed sensing,” in

NIPS , 2009, pp. 772–780.[221] F. Tai and H. Lin, “Multilabel classiﬁcation with principal label spacetransformation,”

Neural Computation , vol. 24, no. 9, pp. 2508–2542,2012.[222] A. Kapoor, R. Viswanathan, and P. Jain, “Multilabel classiﬁcation usingbayesian compressed sensing,” in

NIPS , 2012, pp. 2654–2662.[223] P. Mineiro and N. Karampatziakis, “Fast label embeddings via random-ized linear algebra,” in

Joint European conference on machine learningand knowledge discovery in databases , 2015, pp. 37–51.[224] Y. Jernite, A. Choromanska, and D. Sontag, “Simultaneous learningof trees and representations for extreme classiﬁcation and densityestimation,” in

ICML , 2017, pp. 1665–1674.[225] J. Liu, W. Chang, Y. Wu, and Y. Yang, “Deep learning for extrememulti-label text classiﬁcation,” in

SIGIR , 2017, pp. 115–124.[226] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown, “Learning multi-labelscene classiﬁcation,”

Pattern Recognition , vol. 37, no. 9, pp. 1757–1771, 2004.[227] O. Maimon and L. Rokach, Eds.,

Data Mining and Knowledge Dis-covery Handbook, 2nd ed . Springer, 2010. [228] J. Read, B. Pfahringer, G. Holmes, and E. Frank, “Classiﬁer chains formulti-label classiﬁcation,” in ECML PKDD , 2009, pp. 254–269.[229] J. Read, B. Pfahringer, G. Holmes, and E. Frank, “Classiﬁer chainsfor multi-label classiﬁcation,”

Machine Learning , vol. 85, no. 3, pp.333–359, 2011.[230] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, “Largemargin methods for structured and interdependent output variables,”

JMLR , vol. 6, pp. 1453–1484, 2005.[231] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional randomﬁelds: Probabilistic models for segmenting and labeling sequence data,”in

ICML , 2001, pp. 282–289.[232] G. Tsoumakas and I. P. Vlahavas, “Random k -labelsets: An ensemblemethod for multilabel classiﬁcation,” in ECML , 2007, pp. 406–417.[233] K. Dembczynski, W. Cheng, and E. H¨ullermeier, “Bayes optimalmultilabel classiﬁcation via probabilistic classiﬁer chains,” in

ICML ,2010, pp. 279–286.[234] T. Joachims, T. Finley, and C. J. Yu, “Cutting-plane training ofstructural svms,”

Machine Learning , vol. 77, no. 1, pp. 27–59, 2009.[235] S. Baker and A. Korhonen, “Initializing neural networks for hierarchi-cal multi-label text classiﬁcation,” in

BioNLP , 2017, pp. 307–315.[236] S. Chen, C. Zhang, M. Dong, J. Le, and M. Rao, “Using ranking-cnnfor age estimation,” in

CVPR , 2017, pp. 742–751.[237] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learningwith neural networks,” in

Advances in Neural Information ProcessingSystems 27: Annual Conference on Neural Information ProcessingSystems 2014, December 8-13 2014, Montreal, Quebec, Canada , 2014,pp. 3104–3112.[238] C. Smith and Y. Jin, “Evolutionary multi-objective generation ofrecurrent neural network ensembles for time series prediction,”

Neuro-computing , vol. 143, pp. 302–311, 2014.[239] F. J. Och, “Minimum error rate training in statistical machine transla-tion,” in

Association for Computational Linguistics , 2003, pp. 160–167.[240] W. Gao and Z.-H. Zhou, “On the consistency of multi-label learning,”in

Proceedings of the 24th annual conference on learning theory , 2011,pp. 341–358.[241] A. Tewari and P. L. Bartlett, “On the consistency of multiclassclassiﬁcation methods,”

JMLR , vol. 8, pp. 1007–1025, 2007.[242] D. A. McAllester and J. Keshet, “Generalization bounds and consis-tency for latent structural probit and ramp loss,” in

NIPS , 2011, pp.2205–2212.[243] B. Taskar, C. Guestrin, and D. Koller, “Max-margin markov networks,”in

NIPS , 2003, pp. 25–32.[244] Y. Yue, T. Finley, F. Radlinski, and T. Joachims, “A support vectormethod for optimizing average precision,” in

SIGIR , 2007, pp. 271–278.[245] M. Collins, “Discriminative training methods for hidden markov mod-els: Theory and experiments with perceptron algorithms,” in

EmpiricalMethods in Natural Language Processing , 2002.[246] D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran, G. Saon,and K. Visweswariah, “Boosted MMI for model and feature-spacediscriminative training,” in

ICASSP , 2008, pp. 4057–4060.[247] K. Gimpel and N. A. Smith, “Softmax-margin crfs: Training log-linearmodels with cost functions,” in

HLT-NAACL , 2010, pp. 733–736.[248] W. Liu, D. Xu, I. Tsang, and W. Zhang, “Metric learning for multi-output tasks,”

TPAMI , 2018, doi:10.1109/TPAMI.2018.2794976.[249] J. Deng, S. Satheesh, A. C. Berg, and F. Li, “Fast and balanced:Efﬁcient label tree learning for large scale object recognition,” in

NIPS ,2011, pp. 567–575.[250] T. Gao and D. Koller, “Discriminative learning of relaxed hierarchy forlarge-scale visual recognition,” in

ICCV , 2011, pp. 2072–2079.[251] X. Shen, W. Liu, I. W. Tsang, Q. Sun, and Y. Ong, “Compact multi-label learning,” in

AAAI , 2018, pp. 4066–4073.[252] X. Shen, W. Liu, I. W. Tsang, Q. Sun, and Y. Ong, “Multilabelprediction via cross-view search,”

TNNLS , vol. 29, no. 9, pp. 4324–4338, 2018.[253] X. Shen, W. Liu, Y. Luo, Y. Ong, and I. W. Tsang, “Deep discreteprototype multilabel learning,” in

IJCAI , 2018, pp. 2675–2681.[254] A. More, “Survey of resampling techniques for improving classiﬁcationperformance in unbalanced datasets,” arXiv preprint arXiv:1608.06048 ,2016.[255] I. Mani and I. Zhang, “knn approach to unbalanced data distributions:a case study involving information extraction,” in

Proceedings ofworkshop on learning from imbalanced datasets , vol. 126, 2003.[256] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer,“SMOTE: synthetic minority over-sampling technique,”

JAIR , vol. 16,pp. 321–357, 2002. [257] Q. Dong, S. Gong, and X. Zhu, “Imbalanced deep learning by minorityclass incremental rectiﬁcation,”

CoRR , vol. abs/1804.10851, 2018.[258] M. Rezaei, H. Yang, and C. Meinel, “Multi-task generative adver-sarial network for handling imbalanced clinical data,”

CoRR , vol.abs/1811.10419, 2018.[259] E. Montahaei, M. Ghorbani, M. S. Baghshah, and H. R. Ra-biee, “Adversarial classiﬁer for imbalanced problems,”

CoRR , vol.abs/1811.08812, 2018.[260] B. Romera-Paredes and P. H. S. Torr, “An embarrassingly simpleapproach to zero-shot learning,” in

ICML , 2015, pp. 2152–2161.[261] A. Gaure, A. Gupta, V. K. Verma, and P. Rai, “A probabilisticframework for zero-shot multi-label learning,” in

The Conference onUncertainty in Artiﬁcial Intelligence (UAI) , vol. 1, 2017, p. 3.[262] A. Rios and R. Kavuluru, “Few-shot and zero-shot multi-label learningfor structured label spaces,” in

Conference on Empirical Methods inNatural Language Processing , 2018, pp. 3132–3142.[263] C. Lee, W. Fang, C. Yeh, and Y. F. Wang, “Multi-label zero-shotlearning with structured knowledge graphs,” in

CVPR , 2018, pp. 1576–1585.[264] C. Gan, T. Yang, and B. Gong, “Learning attributes equals multi-sourcedomain generalization,” in

CVPR , 2016, pp. 87–97.[265] J. Liu, B. Kuipers, and S. Savarese, “Recognizing human actions byattributes,” in

CVPR , 2011, pp. 3337–3344.[266] Z. Zhang, C. Wang, B. Xiao, W. Zhou, and S. Liu, “Robust relativeattributes for human action recognition,”

Pattern Analysis and Appli-cations , vol. 18, no. 1, pp. 157–171, 2015.[267] M. Jain, J. C. van Gemert, T. Mensink, and C. G. M. Snoek,“Objects2action: Classifying and localizing actions without any videoexample,” in

ICCV , 2015, pp. 4588–4596.[268] P. Mettes and C. G. M. Snoek, “Spatial-aware object embeddings forzero-shot localization and classiﬁcation of actions,” in

ICCV , 2017, pp.4453–4462.[269] W. J. Scheirer, L. P. Jain, and T. E. Boult, “Probability models for openset recognition,”

TPAMI , vol. 36, no. 11, pp. 2317–2324, 2014.[270] L. P. Jain, W. J. Scheirer, and T. E. Boult, “Multi-class open setrecognition using probability of inclusion,” in

ECCV , 2014, pp. 393–409.[271] A. Bendale and T. E. Boult, “Towards open set deep networks,” in

CVPR , 2016, pp. 1563–1572.[272] Y. Li, J. Yang, Y. Song, L. Cao, J. Luo, and L. Li, “Learning fromnoisy labels with distillation,” in

ICCV , 2017, pp. 1928–1936.[273] G. Chen, Y. Song, F. Wang, and C. Zhang, “Semi-supervised multi-label learning by solving a sylvester equation,” in

ICDM , 2008, pp.410–419.[274] Y. Sun, Y. Zhang, and Z. Zhou, “Multi-label learning with weak label,”in

AAAI , 2010.[275] S. S. Bucak, R. Jin, and A. K. Jain, “Multi-label learning withincomplete class assignments,” in

CVPR , 2011, pp. 2801–2808.[276] R. S. Cabral, F. D. la Torre, J. P. Costeira, and A. Bernardino, “Matrixcompletion for multi-label image classiﬁcation,” in

NIPS , 2011, pp.190–198.[277] W. Bi and J. T. Kwok, “Multilabel classiﬁcation with label correlationsand missing labels,” in

AAAI , 2014, pp. 1680–1686.[278] H. Yang, J. T. Zhou, and J. Cai, “Improving multi-label learning withmissing labels by structured semantic correlations,” in

ECCV , 2016,pp. 835–851.[279] C. Gong, H. Zhang, J. Yang, and D. Tao, “Learning with inadequateand incorrect supervision,” in

ICDM , 2017, pp. 889–894.[280] R. Barandela and E. Gasca, “Decontamination of training samples forsupervised pattern recognition methods,” in

Joint IAPR InternationalWorkshops on SPR and SSPR , 2000, pp. 621–630.[281] C. E. Brodley and M. A. Friedl, “Identifying mislabeled training data,”

JAIR , vol. 11, pp. 131–167, 1999.[282] A. Joulin, L. van der Maaten, A. Jabri, and N. Vasilache, “Learningvisual features from large weakly supervised data,” in

ECCV , 2016,pp. 67–84.[283] S. E. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, andA. Rabinovich, “Training deep neural networks on noisy labels withbootstrapping,”

CoRR , vol. abs/1412.6596, 2014.[284] A. Veit, N. Alldrin, G. Chechik, I. Krasin, A. Gupta, and S. J. Belongie,“Learning from noisy large-scale datasets with minimal supervision,”in

CVPR , 2017, pp. 6575–6583.[285] V. Mnih and G. E. Hinton, “Learning to label aerial images from noisydata,” in

ICML , 2012.[286] I. Jindal, M. S. Nokleby, and X. Chen, “Learning deep networks fromnoisy labels with dropout regularization,” in

ICDM , 2016, pp. 967–972. [287] J. Yao, J. Wang, I. W. Tsang, Y. Zhang, J. Sun, C. Zhang, and R. Zhang,“Deep learning from noisy image labels with quality embedding,” CoRR , vol. abs/1711.00583, 2017.[288] Y. Mao, G. Cheung, C. Lin, and Y. Ji, “Joint learning of similarity graphand image classiﬁer from partial labels,” in

Asia-Paciﬁc Signal andInformation Processing Association Annual Summit and Conference ,2016, pp. 1–4.[289] J. Chai, I. W. Tsang, and W.-J. Chen, “Large margin partial labelmachine,”

TNNLS , p. to appear.[290] C. Gong, T. Liu, Y. Tang, J. Yang, J. Yang, and D. Tao, “A regu-larization approach for instance-based superset label learning,”

TCYB ,vol. 48, no. 3, pp. 967–978, 2018.[291] F. Yu and M. Zhang, “Maximum margin partial label learning,”

Machine Learning , vol. 106, no. 4, pp. 573–593, 2017.[292] M. Zhang, F. Yu, and C. Tang, “Disambiguation-free partial labellearning,”

TKDE , vol. 29, no. 10, pp. 2155–2167, 2017.[293] M. Xie and S. Huang, “Partial multi-label learning,” in

AAAI , 2018,pp. 4302–4309.[294] H. Nam and B. Han, “Learning multi-domain convolutional neuralnetworks for visual tracking,” in

CVPR , 2016, pp. 4293–4302.[295] S. Avidan, “Ensemble tracking,”

TPAMI , vol. 29, no. 2, 2007.[296] W. Qu, Y. Zhang, J. Zhu, and Q. Qiu, “Mining multi-label concept-drifting data streams using dynamic classiﬁer ensemble,” in

ACML ,2009, pp. 308–321.[297] A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R. Gavald`a, “Newensemble methods for evolving data streams,” in

KDD , 2009, pp. 139–148.[298] X. Kong and P. S. Yu, “An ensemble-based approach to fast classi-ﬁcation of multi-label data streams,” in

International Conference onCollaborative Computing: Networking, Applications and Worksharing ,2011, pp. 95–104.[299] A. B¨uy¨ukc¸akir, H. R. Bonab, and F. Can, “A novel online stackedensemble for multi-label stream classiﬁcation,” in

CIKM , 2018, pp.1063–1072.[300] A. Milan, S. H. Rezatoﬁghi, A. R. Dick, I. D. Reid, and K. Schindler,“Online multi-target tracking using recurrent neural networks,” in

AAAI , 2017, pp. 4225–4232.[301] J. Read, A. Bifet, G. Holmes, and B. Pfahringer, “Scalable and efﬁcientmulti-label classiﬁcation for evolving data streams,”

Machine Learning ,vol. 88, no. 1-2, pp. 243–272, 2012.[302] L. Huang, Q. Yang, and W. Zheng, “Online hashing,”

TNNLS , vol. 29,no. 6, pp. 2309–2322, 2018.[303] D. Xu, I. W. Tsang, and Y. Zhang, “Online product quantization,”

TKDE , vol. 30, no. 11, pp. 2185 – 2198, 2018.[304] Y. Wang, W. Liu, X. Ma, J. Bailey, H. Zha, L. Song, and S. Xia,“Iterative learning with open-set noisy labels,” in

Computer Vision andPattern Recognition , 2018, pp. 8688–8696.[305] O. Anava, E. Hazan, and A. Zeevi, “Online time series prediction withmissing data,” in

ICML , 2015, pp. 2191–2199.[306] A. Bendale and T. E. Boult, “Towards open world recognition,” in

CVPR , 2015, pp. 1893–1902.[307] E. S. Xiouﬁs, M. Spiliopoulou, G. Tsoumakas, and I. P. Vlahavas,“Dealing with concept drift and class imbalance in multi-label streamclassiﬁcation,” in

IJCAI , 2011, pp. 1583–1588.[308] Z. Ren, M. Peetz, S. Liang, W. van Dolen, and M. de Rijke, “Hierar-chical multi-label classiﬁcation of social text streams,” in

SIGIR , 2014,pp. 213–222.[309] Y. Shi, D. Xu, Y. Pan, I. W. Tsang, and S. Pan, “Label embedding withpartial heterogeneous contexts,” in

AAAI , 2019, pp. 4926–4933.[310] A. Mousavian, H. Pirsiavash, and J. Kosecka, “Joint semantic segmen-tation and depth estimation with deep convolutional networks,” in ,2016, pp. 611–619.[311] W. Van Ranst, F. De Smedt, J. Berte, T. Goedem´e, and T.-Z. Robo-Vision, “Fast simultaneous people detection and re-identiﬁcation in asingle shot network,” in

AVSS , 2018.

Donna Xu received the BCST (Honours) in com-puter science from the University of Sydney in2014, and the PhD degree from the Centre forArtiﬁcial Intelligence, FEIT, University of Technol-ogy Sydney, NSW, Australia. Her research interestsinclude multiclass classiﬁcation, online hashing andinformation retrieval.

Yaxin Shi received her M.E.degree in computerscience from Ocean University of China in 2017.She is currently pursuing a Ph.D.degree under thesupervision of Prof. Ivor W. Tsang at the Centrefor Artiﬁcial Intelligence, University of TechnologySydney, Australia. Her research interests includeincluding multi-view learning, structure learning anddeep generative networks.

Ivor W. Tsang received his PhD degree from theHong Kong University of Science and Technologyin 2007. He is a professor with the University ofTechnology Sydney. He is also the research directorof the UTS Priority Research Centre for ArtiﬁcialIntelligence. He was conferred the 2008 Natural Sci-ence Award (Class II), Australian Research CouncilFuture Fellowship in 2013, IEEE Transactions onNeural Networks Outstanding 2004 Paper Award in2007, the 2014 IEEE Transactions on MultimediaPrize Paper Award, and the Best Student PaperAward at CVPR 2010. He serves as AE for IEEE Transactions on EmergingTopics in Computational Intelligence, IEEE Transactions on Big Data andNeurocomputing. He also serves as Area Chair/SPC for NeurIPS, AAAI andIJCAI.

Yew Soon Ong received the PhD degree for his workon Artiﬁcial Intelligence in complex design from theUniversity of Southampton, UK in 2003. He is aPresident Chair Professor of Computer Science atthe Nanyang Technological University (NTU), andholds the position of Chief Artiﬁcial IntelligenceScientist at the Agency for Science, Technologyand Research in Singapore. At NTU, he serves asDirector of the Singtel-NTU Cognitive & ArtiﬁcialIntelligence Joint Lab. His research interest lies inartiﬁcial and computational intelligence. He is thefounding Editor-in-Chief of the IEEE Transactions on Emerging Topics inComputational Intelligence and AE of IEEE Transactions on Neural Networks& Learning Systems, the IEEE Transactions on Cybernetics, and others.

Chen Gong (M’16) received his dual doctoral de-gree from Shanghai Jiao Tong University (SJTU)and University of Technology Sydney (UTS) in2016 and 2017, respectively. Currently, he is afull professor in the School of Computer Scienceand Engineering, Nanjing University of Science andTechnology. His research interests mainly includemachine learning, data mining, and learning-basedvision problems. He has published more than 60technical papers at prominent journals and confer-ences such as IEEE T-PAMI, IEEE T-NNLS, IEEET-IP, IEEE T-CYB, IEEE T-CSVT, IEEE T-MM, IEEE T-ITS, ACM T-IST,NeurIPS, CVPR, AAAI, IJCAI, ICDM, etc. He also serves as the reviewer formore than 20 international journals such as AIJ, IEEE T-PAMI, IEEE T-NNLS,IEEE T-IP, and also the SPC/PC member of several top-tier conferences suchas ICML, NeurIPS, AAAI, IJCAI, ICDM, AISTATS, etc. He received theExcellent Doctorial Dissertation awarded by Shanghai Jiao Tong University(SJTU) and Chinese Association for Artiﬁcial Intelligence (CAAI). He wasalso enrolled by the Young Elite Scientists Sponsorship Program of JiangsuProvince and China Association for Science and Technology.