Multi-Target Regression via Input Space Expansion: Treating Targets as Inputs
Eleftherios Spyromitros-Xioufis, Grigorios Tsoumakas, William Groves, Ioannis Vlahavas
MMachine Learning manuscript No. (will be inserted by the editor)
Multi-Target Regression via Input Space Expansion:Treating Targets as Inputs
Eleftherios Spyromitros-Xioufis · GrigoriosTsoumakas · William Groves · Ioannis Vlahavas
Accepted:15 January 2016
Abstract
In many practical applications of supervised learning the task involves the predic-tion of multiple target variables from a common set of input variables. When the predictiontargets are binary the task is called multi-label classification, while when the targets are con- tinuous the task is called multi-target regression. In both tasks, target variables often exhibitstatistical dependencies and exploiting them in order to improve predictive accuracy is a corechallenge. A family of multi-label classification methods address this challenge by buildinga separate model for each target on an expanded input space where other targets are treatedas additional input variables. Despite the success of these methods in the multi-label classi-fication domain, their applicability and effectiveness in multi-target regression has not beenstudied until now. In this paper, we introduce two new methods for multi-target regression,called
Stacked Single-Target and
Ensemble of Regressor Chains , by adapting two popu-lar multi-label classification methods of this family. Furthermore, we highlight an inherentproblem of these methods - a discrepancy of the values of the additional input variables be-tween training and prediction - and develop extensions that use out-of-sample estimates ofthe target variables during training in order to tackle this problem. The results of an extensiveexperimental evaluation carried out on a large and diverse collection of datasets show that,when the discrepancy is appropriately mitigated, the proposed methods attain consistent im-provements over the independent regressions baseline. Moreover, two versions of Ensembleof Regression Chains perform significantly better than four state-of-the-art methods includ-ing regularization-based multi-task learning methods and a multi-objective random forestapproach.
Keywords
Multi-target Regression · Multi-label Classification · Stacking · Chaining
The final publication is available at Springer via http://dx.doi.org/10.1007/s10994-016-5546-z.
E. Spyromitros-Xioufis · G. Tsoumakas · I. VlahavasDepartment of Informatics, Aristotle University of Thessaloniki, GreeceE-mail: [email protected], [email protected], [email protected]. GrovesDepartment of Computer Science and Engineering, University of Minnesota, USAE-mail: [email protected] a r X i v : . [ c s . L G ] J a n Eleftherios Spyromitros-Xioufis et al.
Multi-target regression (MTR), also known as multivariate or multi-output regression, refersto the task of predicting multiple continuous variables using a common set of input variables.Such problems arise in various fields including ecological modeling (Kocev et al 2009; Dze-roski et al 2000) (e.g. predicting the abundance of plant species using water quality measure-ments), economics (Ghosn and Bengio 1996) (e.g. predicting stock prices from econometricvariables) and energy (e.g. predicting energy production in solar/wind farms using historicalmeasurements and weather forecast information). Given the importance and diversity of itsapplications, it is not surprising that research on this topic has started as early as 40 yearsago in Statistics (Izenman 1975).Recently, a closely related task called multi-label classification (MLC) (Tsoumakas et al2010; Zhang and Zhou 2014) has received increased attention by Machine Learning re-searchers. Similarly to MTR, MLC deals with the prediction of multiple variables using acommon set of input variables. However, prediction targets in MLC are binary. In fact, thetwo tasks can be thought of as instances of the more general learning task of multi-targetprediction where targets can be continuous, binary, ordinal, categorical or even of mixedtype. The baseline approach of learning a separate model for each target applies to bothMTR and MLC. Moreover, they share the same core challenge of exploiting dependenciesbetween targets (in addition to dependencies between targets and inputs) in order to improve prediction accuracy, as acknowledged by researchers working in both tasks (e.g. Izenman2008; Dembczynski et al 2012). Despite their commonalities, MTR and MLC have typicallybeen treated in isolation and only few works (Blockeel et al 1998; Weston et al 2002; Tehet al 2005; Balasubramanian and Lebanon 2012) have given a general formulation of theirkey ideas, recognizing the dual applicability of their approaches.Motivated by the tight connection between the two tasks, this paper looks at a fam-ily of MLC methods that, despite being almost directly applicable to MTR problems, havenot been applied so far in this domain. In particular, we consider methods that decomposethe MLC task into a series of binary classification tasks, one for each label. This category,includes the typical one-versus-all or
Binary Relevance approach that assumes label inde-pendence but also approaches that model label dependencies by building models that treatother labels as additional input variables (meta-inputs). In this work we adapt two popularmethods of this kind (Godbole and Sarawagi 2004; Read et al 2011) for MTR, contributingtwo new MTR methods:
Stacked Single-Target (SST) and
Ensemble of Regressor Chains (ERC). Both methods have been very successful in the MLC domain and provided inspi-ration for many subsequent works (Cheng and H¨ullermeier 2009; Dembczynski et al 2010;Kumar et al 2012; Read et al 2014).Although the adaptation is trivial (as it basically consists of employing a regressioninstead of a binary classification algorithm to solve each single-target prediction task), itwidens the applicability of existing approaches and increases our understanding of chal-lenges shared by both learning tasks, such as the modeling of target dependencies. This kindof abstraction of key ideas from solutions tailored to related problems can sometimes of-fer additional advantages, such as improving the modularity and conceptual simplicity oflearning techniques and avoiding reinvention of the same solutions .In addition to evaluating the direct adaptations of the corresponding MLC methods inthe MTR domain, we also take a careful look at the treatment of targets as additional inputvariables and spot a shortcoming that was overlooked in the original MLC formulations of See NIPS’11 workshop on relations among machine learning problems at http://rml.anu.edu.au/ ulti-Target Regression via Input Space Expansion: Treating Targets as Inputs 3 both methods. Specifically, we notice that in both methods the values of the meta-inputs aregenerated differently between training and prediction, causing a discrepancy that is shownto drastically downgrade their performance. To tackle this problem, we develop extendedversions of the two methods that manage to decrease the discrepancy by using out-of-sampleestimates of the targets during training. These estimates are obtained via an internal cross-validation methodology.The performance of the proposed methods is comprehensively analyzed based on a largeexperimental study that includes 18 diverse real-world datasets, 14 of which are firstly usedin this paper and are made publicly available for future benchmarks. The experimental re-sults reveal that, affected by the discrepancy problem, the direct adaptations of the corre-sponding MLC methods fail to obtain better accuracy than the baseline approach that per-forms independent regressions. On the other hand, the extended versions obtain consistentimprovements against the baseline, confirming the effectiveness of the proposed solution.Furthermore, extended versions of ERC obtain significantly better accuracy than state-of-the-art methods, including a method based on ensembles of multi-objective decision trees(Kocev et al 2007) and a recent regularization-based multi-task learning method (Jalali et al2010, 2013). Moreover, it is shown that, compared to the rest of the methods, the extendedversions of ERC are associated with the smallest risk of decreasing the accuracy of thebaseline, an appealing property.The rest of the paper is organized as follows: Section 2 presents the SST and ERC meth- ods and describes the discrepancy problem and the proposed solution. Section 3 discussesrelated work from the MTR field, including well-known statistical procedures and multi-task learning methods, and points out differences with previous work on the discrepancyproblem. The details of the experimental setup (method configuration, evaluation methodol-ogy, datasets) are given in Section 4 and Section 5 presents and discusses the experimentalresults. Finally, Section 6 offers our conclusion and outlines future work directions.
We first formally describe the MTR task and provide the notation that will be used subse-quently for the description of the methods. Let X and Y be two random vectors where X consists of d input variables X , .., X d and Y consists of m target variables Y , .., Y m . We as-sume that samples of the form ( x , y ) are generated i.i.d. by some source according to a jointprobability distribution P ( X , Y ) on X × Y where X = R d and Y = R m are the domainsof X and Y and are often referred to as the input and the output space. In a sample ( x , y ) , x = [ x , .., x d ] is the input vector and y = [ y , .., y m ] is the output vector which are realizationsof X and Y respectively. Given a set D = { ( x , y ) , .., ( x n , y n ) } of n training examples, thegoal in MTR is to learn a model h : X → Y that given an input vector x , is able to predictan output vector ˆy = h ( x ) that best approximates the true output vector y .In the baseline Single-Target (ST) method, a multi-target model h is comprised of m single-target models h j : X → R where each model h j is trained on a transformed train-ing set D j = { ( x , y j ) , .., ( x n , y nj ) } to predict the value of a single target variable Y j . Thisway, target variables are modeled independently and no attempt is made to exploit potentialdependencies between them. Despite the simplicity of the ST approach, several empiricalstudies (e.g. Luaces et al 2012) have shown that Binary Relevance, its MLC counterpart, of-ten obtains comparable performance with more sophisticated MLC methods that model la- X = R d is used only for the sake of brevity. The domain of the input variables can also be discrete. Eleftherios Spyromitros-Xioufis et al. bel dependencies, especially in cases where the underlying single-target prediction model iswell fitted to the data (Dembczynski et al 2012; Read and Hollm´en 2014; Read and Hollm´en2015). A theoretical explanation of these results was offered by Dembczynski et al (2012)who showed that modeling the marginal conditional distributions P ( Y i | x ) of the labels (asdone by Binary Relevance) can be sufficient for getting good results in multi-label losseswhose risk minimizers can be expressed in terms of marginal distributions (e.g. Hammingloss).2.1 Stacked Single-TargetStacked Single-Target (SST) is inspired from the Stacked Binary Relevance method (God-bole and Sarawagi 2004) where the idea of stacked generalization (Wolpert 1992) was ap-plied in a MLC context. The training of SST consists of two stages. In the first stage, m inde-pendent single-target models h j : X → R are learned as in ST. However, instead of directlyusing these models for prediction, SST involves an additional training stage where a secondset of m meta models h (cid:48) j : X × R m → R are learned, one for each target Y j . Each meta model h (cid:48) j is learned on a transformed training set D (cid:48) j = { ( x (cid:48) , y j ) , . . . , ( x (cid:48) n , y nj ) } , where the originalinput vectors of the training examples ( x i ) have been augmented by estimates of the values oftheir target variables ( ˆ y i , . . . , ˆ y im ) to form expanded input vectors x (cid:48) i = [ x i , ˆ y i , . . . , ˆ y im ] . These estimates are obtained by applying the first stage models to the examples of the training set.To obtain predictions for an unknown instance x q , the first stage models are first appliedand an output vector ˆ y q = [ h ( x q ) , .., h m ( x q )] is obtained. Then, the second stage modelsare applied on transformed input vectors x (cid:48) q = [ x q , ˆ y q ] to produce the final output vector˜ y q = [ h (cid:48) ( x (cid:48) q ) , . . . , h (cid:48) m ( x (cid:48) qm )] . The training and prediction procedures of SST are graphicallyillustrated in Figure 1.2.2 Ensemble of Regressor ChainsRegressor Chains (RC) is derived from Classifier Chains (Read et al 2011), a recently pro-posed MLC method based on the idea of chaining binary models. The training of RC consistsof selecting a random chain (permutation) of the set of target variables and then building aseparate regression model for each target. Assuming that the chain C = { Y , Y , .., Y m } ( C represents an ordered set) is selected, the first model concerns the prediction of Y , has theform h : X → R and is the same as the model built by the ST method for this target. Thedifference in RC is that subsequent models h j , j > D (cid:48) j = { ( x (cid:48) j , y j ) , .., ( x (cid:48) nj , y nj ) } , where the original input vectors of the training examples havebeen augmented by the actual values of all previous targets of the chain to form expandedinput vectors x (cid:48) ij = [ x i , .., x id , y i , .., y ij − ] . Thus, the models built for targets Y j have the form h j : X × R j − → R .Given such a chain of models, the output vector ˆ y q of an unknown instance x q is ob-tained by sequentially applying the models h j , thus ˆ y q = [ h ( x q ) , h ( x (cid:48) q ) , .., h m ( x (cid:48) qm )] where x (cid:48) qj = [ x q , .., x qd , ˆ y q , .., ˆ y qj − ] . Note that since the true values y q , .., y qj − of the target variablesare not available at prediction time, the method relies on estimates of these values obtainedby applying the models h , .., h j − . The training and prediction procedures of RC are graph-ically illustrated in Figure 2.One notable property of RC is that it is sensitive in the selected chain ordering. To allevi-ate this issue, Read et al (2011) proposed an ensemble scheme called Ensemble of Classifier ulti-Target Regression via Input Space Expansion: Treating Targets as Inputs 5 𝑋 𝐷 ℎ ℎ 𝑋 𝑌 𝑌 𝑋 𝐷 𝑌 … 𝑌 𝑚 𝑌 ℎ 𝑋 𝐷 𝑚 ℎ 𝑚 ℎ 𝑚 𝑋 𝑌 𝑚 𝑌 𝑚 𝑋 𝐷 𝑚′ 𝑌 … 𝑌 𝑚 𝑌 𝑚 ℎ 𝑚′ …… …… …… st training stage meta-variable generation nd training stage Training
Prediction st stage predictions 𝑥 𝑞 ℎ ො𝑦 𝑥 𝑞 ℎ 𝑦 ො𝑦 … ො𝑦 𝑚𝑞 final predictions …… …… 𝑥 𝑞 ℎ 𝑚 ො𝑦 𝑚𝑞 𝑥 𝑞 ℎ 𝑚 ′ 𝑦 𝑚𝑞 ො𝑦 … ො𝑦 𝑚𝑞 Fig. 1: Graphical illustration of SST’s training and prediction procedures.
𝑋 𝐷 ℎ 𝑌 …… Prediction 𝑥 𝑞 ℎ Training 𝑋 𝐷′ ℎ 𝑌 𝑌 𝑋 𝐷′ 𝑚 ℎ 𝑚 𝑌 𝑌 … 𝑌 𝑚−1 𝑌 𝑚 𝑥 𝑞 ℎ ො𝑦 …… 𝑥 𝑞 ℎ 𝑚 ො𝑦 …ො𝑦 ො𝑦 𝑚−1𝑞 ො𝑦 ො𝑦 ො𝑦 𝑚𝑞 Fig. 2: Graphical illustration of RC’s training and prediction procedures.Chains where a set of k Classifier Chains models with different random chains are built onbootstrap samples of the training set and the final predictions come from majority voting.This scheme has been shown to consistently improve the accuracy of a single ClassifierChain in the classification domain. We apply the same idea on RC and compute the finalpredictions by taking the mean of the k estimates for each target. The resulting method iscalled Ensemble of Regressor Chains (ERC).
Eleftherios Spyromitros-Xioufis et al. P ( Y ) (cid:54) = ∏ mi = P ( Y i ) ; and- conditional, where P ( Y | x ) (cid:54) = ∏ mi = P ( Y i | x ) ,and show that modeling them is important for improving generalization performance. Ac-cording to this analysis, stacking is interpreted as a method that models unconditional labeldependence and is more suitable for minimizing label-wise decomposable multi-label lossfunctions , while chaining is interpreted as a method that models conditional dependenceand is more suitable for minimizing multi-label loss functions that cannot be decomposedlabel-wise.Another interesting interpretation is offered by Read and Hollm´en (2015) who showthat Binary Relevance can (under certain conditions) achieve optimal performance in anydataset, and that improvements over the independent approach are often the result of usingan inadequate base learner. Under this view, stacking and chaining can be considered as ‘deep’ independent learners who owe their improved performance over Binary Relevance(when the same base learner is used) to the use of labels as nodes in the inner layers of adeep neural network. These nodes represent readily available (in the training phase), high-level transformations of the original inputs. This interpretation of stacking and chainingapplies directly to the MTR versions of these methods that we present here.From a bias-variance perspective, we observe that by introducing additional features tosingle-target models, SST and ERC have the effect of decreasing their bias at the expense ofan increased variance. This suggests that whenever the increase in variance is outweighedby the decrease in bias, one should expect gains in generalization performance over ST. Thisalso hints that both methods will probably benefit from being combined with a base regres-sor that includes a variance reduction mechanism like bagged (Breiman 1996) regressiontrees . As shown in (Munson and Caruana 2009), bagged trees not only ignore irrelevantfeatures but can also exploit features that contain useful but noisy information. Both prop-erties are very important in the context of SST and ERC because some of the extra featuresthat they introduce might be irrelevant (e.g. whenever two target variables are statisticallyindependent) and/or noisy (as discussed in the following subsection).2.4 Generation of Meta-inputsBoth SST and ERC are based on the same core idea of treating other prediction targets asadditional input variables. These meta-inputs differ from ordinary inputs in the sense thatwhile their actual values are available at training time, they are missing during prediction.Thus, during prediction both methods have to rely on estimates of these values which come Note, however, that this analysis concerns a version of stacking that does not include the original inputvariables in the input space of the second stage models. This is in contrast with traditional deep learning where high-level feature representations are typicallylearned from the data in an unsupervised way. An explicit feature selection could alternatively be applied as a means of variance reduction.ulti-Target Regression via Input Space Expansion: Treating Targets as Inputs 7 either from ST (in the case of SST) or from RC (in the case of ERC) models built on thetraining set. An important question that is answered differently by each method is the fol-lowing:
What type of values should be used at training time for the meta-inputs?
SST usesestimates of the variables obtained by applying the first stage models on the training exam-ples, while ERC uses their actual values. We observe that in both cases a core assumption ofsupervised learning is violated: that the training and testing data should be identically andindependently distributed. In the SST case, the in-sample estimates that are used to formthe training examples of the second stage models will typically be more accurate than theout-of-sample estimates used at prediction time. The situation is even more problematic inthe case of ERC since the actual target values are used during training. In both cases, someof the input variables that are used by the underlying regression algorithm during modelinduction, become noisy (or noisier in the case of SST) at prediction time and, as a result,the induced model might wrongly estimate (overestimate) their usefulness.To mitigate this problem, we propose the use of out-of-sample estimates of the targetsduring training in order to increase the compatibility between the training values of thetarget variables and the values used during prediction. One way to obtain such estimates isto use a subset of the training set for building the first stage ST models (in the case of SST)or the RC models (in the case of ERC) and apply them to the held-out part. However, thisapproach would lead to reduced second stage training sets for SST as only the examples ofthe held-out set would be available for training the second stage models. The same holds for ERC where the chained RC models would be trained on training sets of decreasing size.The solution that we propose to this problem is the use of an internal f -fold cross-validationapproach that allows obtaining out-of-sample estimates of the target variables for all thetraining examples. Compared to the actual target values or the in-sample estimates of thetargets, the cross-validation estimates are expected to better resemble the values that areused during prediction. As a result, we expect that the contribution of the meta-inputs to theprediction of each target will be better estimated by the underlying regression algorithm.The training procedures of the extended SST (denoted as SST cv ) and RC (denoted asRC cv ) methods are outlined in Algorithms 1 and 3. ERC cv consists of simply repeatingthe RC cv procedure k times with different random chains. The corresponding predictionprocedures are presented in Algorithms 2 and 4. Note that the prediction procedures of theoriginal and the extended versions of each method coincide. In Section 5 we compare theperformance of the extended versions of SST and ERC with the performance of the directlyadapted variants, henceforth denoted as SST train and ERC true . To better study the effects ofthe discrepancy problem, the comparison also includes SST using the actual target values(SST true ) and ERC using in-sample estimates of the target variables (ERC train ).2.5 DiscussionBesides the type of values that each method uses for the meta-inputs at training time, SSTand ERC have additional conceptual differences. A notable one is that the model built foreach target Y j by SST, uses all other targets as inputs while in RC each model involves onlytargets that precede Y j in a random chain. As a result, the model built for Y j by RC, cannotbenefit from statistical relationships with targets that appear later than Y j in the chain. Thispotential disadvantage of RC is partially overcome by ERC since each target is includedin multiple random chains and, therefore, the probability that other targets will precedeit is increased. At a first glance, SST seems to represent a more straightforward way ofincluding all the available information about other targets. However, we should take into Eleftherios Spyromitros-Xioufis et al.
Algorithm 1:
SST cv training Input : Training set D , number of internal cross-validation folds f Output : 1st & 2nd stage models h j & h (cid:48) j , j = .. m // Build 1st stage models for j = to m do D j = { ( x , y j ) ,.., ( x n , y nj ) } // transform D to D j h j : D j → R // build model for Y j using D j split D j randomly into f disjoint parts D ij , i = .. f for i = to f do h ij : D j \ D ij → R // build model for Y j using D j \ D ij // Generate 2nd stage training sets for j = to m do D (cid:48) j ← /0 for i = to f do D (cid:48) ij ← /0 foreach x k ∈ D ij do ˆ y k = [ h i ( x k ) ,.., h im ( x q )] x (cid:48) k = [ x k , ˆ y k ] // concatenate x k and ˆ y k D (cid:48) ij = D (cid:48) ij ∪ ( x (cid:48) k , y kj ) D (cid:48) j = D (cid:48) j ∪ D (cid:48) ij // Build 2nd stage models for j = to m do h (cid:48) j : D (cid:48) j → R Algorithm 2:
SST prediction
Input : Unknown instance x q , 1st & 2nd stage models h j & h (cid:48) j , j = .. m Output : Output vector ˜ y q ˆ y q = ˜ y q = // Apply the 1st stage models for j = to m do ˆ y qj = h j ( x q ) x (cid:48) q = [ x q , ˆ y q ] // concatenate x q and ˆ y q // Apply the 2nd stage models for j = to m do ˜ y qj = h (cid:48) j ( x (cid:48) q ) account that, since both methods rely on estimates of the meta-inputs at prediction time (asdiscussed in previous subsection), the more the meta-inputs that are included in the inputspace, the higher the amount of error accumulation that is risked at prediction time. Fromthis perspective, ERC seems to adopt a more cautious approach than SST. On the otherhand, the estimates of the meta-inputs that are used by the second stage models in SSTcome from independent models, while the estimates of the meta-inputs used by each modelin RC (and ERC) come from models that include information about other targets and thusinvolve a higher risk of becoming noisy. Overall, there seems to be a trade-off betweenusing the additional information available in the targets and the noise that this informationcomes with. Which of the two methods (and which variant) achieves a better balance in thistrade-off is revealed by the experimental analysis in Section 5. ulti-Target Regression via Input Space Expansion: Treating Targets as Inputs 9 Algorithm 3: RC cv training Input : Training set D , number of internal cross-validation folds f Output : Chained models h j , j = .. m // Generate D (cid:48) D (cid:48) = { ( x , y ) ,.., ( x n , y n ) } // transform D to D (cid:48) for j = to m do h j : D (cid:48) j → R // build model for Y j using D (cid:48) j if j < m then // Generate D (cid:48) j + D (cid:48) j + ← /0 split D (cid:48) j randomly into f disjoint parts D (cid:48) ij , i = .. f for i = to f do h ij : D (cid:48) j \ D (cid:48) ij → R // build model for Y j using D (cid:48) j \ D (cid:48) ij foreach x (cid:48) kj ∈ D (cid:48) ij do x (cid:48) ij + = x (cid:48) ij ˆ y kj = h ij ( x (cid:48) kj ) x (cid:48) kj + = [ x (cid:48) kj , ˆ y kj ] // append x (cid:48) kj with ˆ y kj D (cid:48) j + = D (cid:48) j + ∪ ( x (cid:48) kj + , y kj + ) Algorithm 4: RC prediction Input : Unknown instance x q , chain models h j , j = .. m Output : Output vector ˆ y q ˆ y q = x (cid:48) q = x q for j = .. m do ˆ y qj = h j ( x (cid:48) qj ) if j < m then x (cid:48) qj + = [ x (cid:48) qj , ˆ y qj ] // append x (cid:48) qj with ˆ y qj O ( g tr ( n , d )) and test complexity O ( g te ( n , d )) for a dataset with n examples and d in-put variables. The training and test complexities of the ST method are O ( m · g tr ( n , d )) and O ( m · g te ( n , d )) respectively, as it involves training and querying m independent single-targetmodels.With respect to SST, the method builds 2 · m models at training time, all of which arequeried at prediction time. In all variants of the method, half of the models are built on theoriginal input space and half of the models are built on an input space augmented by m meta-inputs. Thus, in the case of SST true , where the meta-inputs are readily available, the trainingand test complexities are O ( m · ( g tr ( n , d )+ g tr ( n , d + m ))) and O ( m · ( g te ( n , d )+ g te ( n , d + m ))) respectively. Given that in most cases (see Table 3) the number of targets is much smallerthan the number of inputs, i.e. m (cid:28) d , the effective training and test complexities of SST true become O ( m · g tr ( n , d )) and O ( m · g te ( n , d )) respectively, thus same with ST’s complexities.SST train and SST cv have the same test complexity with SST true but a larger training com- Table 1:
Training and test complexities of the proposed methods with single- and multi-core implementa-tions. n , d and m denote the numbers of data points, inputs, and targets respectively. k denotes the number ofchains in ERC and f the number of internal cross-validation folds in the cv variants of SST and ERC.Method Training complexity Test complexitysingle-core multi-core single-core multi-core SS T true O ( m · g tr ( n , d )) O ( g tr ( n , d )) O ( m · g te ( n , d )) O ( g te ( n , d )) train O ( m · g tr ( n , d )) O ( g tr ( n , d )) O ( m · g te ( n , d )) O ( g te ( n , d )) cv O ( f · m · g tr ( n , d )) O ( g tr ( n , d )) O ( m · g te ( n , d )) O ( g te ( n , d )) E RC true O ( k · m · g tr ( n , d )) O ( g tr ( n , d )) O ( k · m · g te ( n , d )) O ( m · g te ( n , d )) train O ( k · m · g tr ( n , d )) O ( m · g tr ( n , d )) O ( k · m · g te ( n , d )) O ( m · g te ( n , d )) cv O ( k · f · m · g tr ( n , d )) O ( m · g tr ( n , d )) O ( k · m · g te ( n , d )) O ( m · g te ( n , d )) plexity because of the process of generating estimates for the meta-inputs. In the SST train case, the training complexity is O ( m · g tr ( n , d )+ m · g te ( n , d )) because the m first-stage modelsare applied to obtain estimates for all the training examples. For most regression algorithms(e.g. regression trees), the computational cost of making predictions for n instances is muchsmaller than the cost of training on n examples. For instance, the training complexity of atypical binary regression tree learner is O ( n · d ) (Su and Zhang 2006) while the test com-plexity is O ( n · log d ) . Thus, practically, the training complexity of SST train is similar to that of SST true . When it comes to SST cv , in addition to the m first-stage models, f additionalmodels are built on f − f · n examples each. Therefore, the training complexity of SST cv is O ( m · g tr ( n , d )+ m · f · g tr ( f − f · n , d )+ m · g te ( n , d )) ≈ O ( f · m · g tr ( n , d )+ m · g te ( n , d )) . Given that g te ( n , d ) (cid:28) g tr ( n , d ) , we conclude that the training complexity of SST cv is roughly f timesST’s training complexity. Also, note that SST train and SST cv can be parellelized stage-wiseboth at training and at prediction time, i.e. all single-target models within the same level canbe trained and queried independently, while SST true is fully parallelizable at training time(all single-target models can be trained independently) and stage-wise parallelizable at testtime.In ERC, each RC model consists of a chain of m models built on input spaces aug-mented by { , , . . . , m − } meta-inputs, thus m − meta-inputs on average. In the caseof ERC true and for an ensemble size of k RC models, the training and test complexitiesare O ( k · m · g tr ( n , d +( m − )) and O ( k · m · g te ( n , d +( m − )) respectively. Given, as before, that m (cid:28) d , the training complexity of ERC true becomes O ( k · m · g tr ( n , d )) and its test complexitybecomes O ( k · m · g te ( n , d )) , thus k times ST’s complexity in both cases. Following a similarreasoning as we did above for SST, we can show that the training complexity of ERC train issimilar to that of ERC true and that the training complexity of ERC cv is O ( k · f · m · g tr ( n , d )) ,i.e. k · f times ST’s training complexity. Obviously, the test complexities of both ERC train and ERC cv are the same as ERC true ’s test complexity. With respect to parellelization, weobserve that each member of an ERC train or ERC cv ensemble can be trained independently,while ERC true is fully parallelizable at training time, i.e. all k · m single-target models can betrained independently. For all ERC variants, test time parallelization is also possible sinceeach ensemble member can be queried independently.Table 1 summarizes the training and test complexities of each method assuming a single-core implementation as well as the minimum possible complexity when a multi-core im-plementation is used. Note that, as shown in the table, SST cv and ERC cv have the samemulti-core complexity with SST train and ERC train respectively because their internal cross-validation procedure can also be parallelized. ulti-Target Regression via Input Space Expansion: Treating Targets as Inputs 11 ˜y = Bˆy , where ˆy are estimates obtained by applying ordinaryleast squares regression on the target variables and B is a matrix that modifies these esti-mates in order to obtain a more accurate prediction ˜y , under the assumption that the targetsare correlated. In all methods, B can be expressed as B = ˆT − D ˆT , where ˆT is the matrix ofsample canonical co-ordinates and D is a diagonal “shrinking” matrix that is obtained dif-ferently in each method. SST is highly similar to these methods but allows a more generalformulation of the MTR problem. Firstly, SST does not impose any restriction to the familyof models that generate the uncorrected (first stage) estimates in contrast to these approachesthat use estimates obtained from least squares regression. Secondly, the correction of the es- timates applied by SST comes from a learning procedure that jointly considers target andinput variables rather than target variables alone.As shown by Breiman and Friedman (1997), the above methods can be described by analternative but equivalent scheme. According to this, y is first transformed to the canonicalco-ordinate system y (cid:48) = ˆTy , then separate least squares regression is performed on each y (cid:48) to obtain ˆy (cid:48) , these estimates are scaled by D to obtain ˜y (cid:48) = Dˆy (cid:48) and finally transformed backto the original output space ˜y = ˆT − ˜y (cid:48) . As discussed by Dembczynski et al (2012), from thisperspective, these methods fall under a more general scheme where the output space is firsttransformed, single-target regressors are then trained on the transformed output space and aninverse transformation is performed (possibly along with shrinkage/regularization) to obtainpredictions for the original targets. Due to its generality, this scheme has been adopted by anumber of recent methods in both MLC (Hsu et al 2009; Zhang and Schneider 2011, 2012;Tai and Lin 2012) and MTR (Balasubramanian and Lebanon 2012; Tsoumakas et al 2014).A large number of MTR methods are derived from the predictive clustering tree (PCT)framework (Blockeel et al 1998). The main difference between the PCT algorithm and astandard decision tree is that the variance and the prototype functions are treated as parame-ters that can be instantiated to fit the given learning task. Such an instantiation for MTR tasksare the multi-objective decision trees (MODTs) where the variance function is computed asthe sum of the variances of the targets, and the prototype function is the vector mean ofthe target vectors of the training examples falling in each leaf (Blockeel et al 1998, 1999).Bagging and random forest ensembles of MODTs were developed by Kocev et al (2007)and were found significantly more accurate than MODTs and equally good or better thanensembles of single-objective decision trees for both regression and classification tasks. Inparticular, multi-objective random forests yielded better performance than multi-objectivebagging.Methods that deal with the prediction of multiple target variables can be found in theliterature of the related learning task of multi-task learning. According to Caruana (1997),multi-task learning is a form of inductive transfer (Pratt 1992) where the aim is to improvegeneralization accuracy on a set of related tasks by using a shared representation that exploits commonalities between them. This definition implies that a multi-task method should beable to deal with problems where different prediction tasks do not necessarily share thesame set of training examples or descriptive features and, moreover, each task can have adifferent data type. Thus, multi-task learning is actually a generalization of MTR.Artificial neural networks (ANNs) are very well suited for multi-task problems becausethey can be naturally extended to support multiple outputs and offer flexibility in defininghow inputs are shared between tasks. Thus, it is not surprising that most of the earliest multi-task methods were based on ANNs. Caruana (1994), for example, proposed a method wherebackpropagation is used to train single ANN with multiple outputs (connected to the samehidden layers), and showed that it has better generalization performance compared to mul-tiple single-task ANNs. A different architecture was used by (Baxter 1995) where only thefirst hidden layers are shared and subsequent layers are specific to each task. The questionof how much sharing is better when multi-task ANNs are applied for stock return predictionwas explored by (Ghosn and Bengio 1996) who concluded that a partial sharing of net-work parameters is preferable compared to full or no sharing. More recently, Collobert andWeston (2008) applied a deep multi-task neural network architecture for natural languageprocessing.A large number of multi-task learning methods stem from a regularization perspective .Regularization-based multi-task methods minimize a penalized empirical loss of the formmin W L ( W ) + Ω ( W ) , where W is a parameter matrix that has to be estimated, L ( W ) is anempirical loss calculated on the training data and Ω ( W ) is a regularization term that takes a different form in each method depending on the underlying task relatedness assumption.Most methods assume that all tasks are related to each other (Evgeniou and Pontil 2004;Ando and Zhang 2005; Argyriou et al 2006, 2008; Chen et al 2009, 2010a; Obozinski et al2010), while there are methods assuming that tasks are organized in structures such as clus-ters (Jacob et al 2008; Zhou et al 2011a), trees (Kim and Xing 2010) and graphs (Chen et al2010b). A well-studied category of methods, which are particularly useful when dealingwith high-dimensional input spaces, assume that models for different tasks share a commonlow-rank subspace and impose a trace-norm constraint on the parameter matrix (Argyriouet al 2006, 2008; Ji and Ye 2009). A similar category of methods constraint all models toshare a common set of features (thus performing a joint feature selection), typically by ap-plying L / L q -norm ( q >
1) regularization (Obozinski et al 2010). An approach that relaxesthe above restrictive constraint allowing models to leverage different extents of feature shar-ing is proposed in (Jalali et al 2010, 2013).Finally, we would like to mention that a number of MTR methods are based on theGaussian Processes framework (e.g., Bonilla et al 2007; ´Alvarez and Lawrence 2011). Thesemethods capture correlations between tasks by appropriate choices of covariance functions.A nice review of such methods as well as their relations to regularization-based multi-taskapproaches can be found in (Alvarez et al 2011).3.2 Discrepancy in Meta-inputsIn the MLC domain, Senge et al (2013a) studied how the discrepancy issue affects the per-formance of Classifier Chains and showed that longer chains (i.e. multi-label problems withmore labels to be predicted) lead to a higher performance deterioration. In an extensionof that work Senge et al (2013b), a “rectified” version of Classifier Chains (called Nested A nice categorization of regularization-based multi-task methods can be found in (Zhou et al 2012).ulti-Target Regression via Input Space Expansion: Treating Targets as Inputs 13
Stacking) was presented that uses in-sample estimates of the label variables for training as inStacked Binary Relevance. It was shown that this method performs better than the originalClassifier Chains, especially when the label dependencies are strong. Following the oppositedirection, Monta˜n´es et al (2011) proposed AID, a method similar to Stacked Binary Rele-vance, and found that using the actual label values instead of (in-sample) estimates, leadsto better results for most multi-label evaluation measures in both AID and Stacked BinaryRelevance.Our work is the first to study this issue in the MTR domain . The issue is studied jointlyfor SST and ERC, thus allowing general conclusions to be drawn for this type of methods.Furthermore, Monta˜n´es et al (2011); Senge et al (2013b) compared only the use of actualtarget values with the use of in-sample estimates while our comparison includes the useof out-of-sample estimates obtained by a cross-validation procedure. Finally, Senge et al(2013b) evaluate the use of estimates in Classifier Chains whereas we focus on the ensembleversion of the corresponding MTR method (ERC) that is expected to offer more resilienceto error propagation, as discussed in Section 2.5. This section describes our experimental setup. We first present the participating methodsand their parameters and provide details about their implementation in order to facilitate reproducibility of the experiments. Next, we describe the evaluation measure and explainthe process that was followed for the statistical comparison of the methods. Finally, wepresent the datasets that we used and their main statistics.4.1 Methods, Parameters and ImplementationThe experimental evaluation includes all variants of the proposed SST and ERC methods,the ST baseline and the following state-of-the-art multiple prediction methods: a) multi-objective random forest (MORF) (Kocev et al 2007), b) trace norm regularization for multi-task learning (TNR) (Argyriou et al 2008), c) the D
IRTY approach for multi-task learning(Jalali et al 2010, 2013) and d) a very recent multi-target method based on random linearcombinations of the output space (RLC) (Tsoumakas et al 2014). For easy reference, Table 2lists all methods included in the evaluation along with their abbreviations and citations whereappropriate.The proposed methods as well as ST and RLC transform the mutli-target regression taskinto a series of single-target regression tasks which can be dealt with using any standardregression algorithm. For most of the experiments, we use bagged regression trees as thebase regressor. This choice was motivated in Subsection 2.3 and is further discussed inSubsection 5.1 where we present results using a variety of well-known linear and non-linearregression algorithms. The ensemble size of all ERC variants is set to k =
10 RC models,each one trained using a different random chain. In datasets with less than 10 distinct chains,we create exactly as many RC models as the number of distinct chains. Furthermore, sincethe base regressor involves bootstrap sampling, we do not perform sampling in ERC, i.e.each RC model is trained using all training examples. In SST, we exclude the target beingpredicted by each second stage model from the input space of that model as we found that Actually, an early version of this work Spyromitros-Xioufis et al (2012) is the first to consider the dis-crepancy problem in the context of input space expansion methods.4 Eleftherios Spyromitros-Xioufis et al.
Table 2: Methods used in experiments with abbreviations and citations.
Abbr. Method CitationST Single TargetSST true
Stacked ST, true values This paperSST train
Stacked ST, in-sample estimates This paperSST cv Stacked ST, cv estimates This paperERC true
Ensemble of Regressor Chains, true values This paperERC train
Ensemble of Regressor Chains, in-sample estimates This paperERC cv Ensemble of Regressor Chains, cv estimates This paperMORF Multi-Objective Random Forest Kocev et al (2007)TNR Trace Norm Regularization multi-task learning Argyriou et al (2008)D
IRTY
A Dirty model for multi-task learning Jalali et al (2010, 2013)RLC Random Linear target Combinations Tsoumakas et al (2014) this choice improves slightly the performance of all variants of this method. f =
10 internalcross-validation folds are used in both SST cv and ERC cv .Concerning the parameter settings of the competitive methods, in MORF we use anensemble size of 100 trees and the values suggested by Kocev et al (2007) for the rest of itsparameters. In RLC, we generate r =
100 new target variables by combining k = [ , ] interval). As shown in (Tsoumakas et al 2014), these values lead to near optimal results. In TNR, we minimize the squaredloss function using the accelerated gradient method for trace norm minimization (Ji andYe 2009). The regularization parameter is tuned by selecting among the values { r : r ∈{− , ..., }} with internal 5-fold cross-validation. Before applying TNR, we apply z-scorenormalization and add a bias column as suggested in (Zhou et al 2011b). Finally, D IRTY is setup as suggested in (Jalali et al 2013): Input variables are scaled to the [ − , ] rangeby dividing them with their maximum values. The regularization parameters λ b and λ s aretuned via internal 5-fold cross-validation (as in TNR). As suggested in (Jalali et al 2013),we set λ b = c (cid:113) m log dn , where c ∈ { r : r ∈ {− , ..., }} is a constant. Each distinct value of λ b is paired with five values of λ s = λ b + m − i , i ∈ { , , , , } , thus respecting the λ s λ b ∈ [ m , ] relationship dictated by the optimality conditions. In total, 25 different combinations of λ b and λ s are evaluated.All the proposed methods and the evaluation framework were implemented in Java andintegrated in Mulan (Tsoumakas et al 2011) by expanding its functionality to multi-targetregression. The implementation of all single-target regression algorithms that were usedto instantiate problem transformation methods are taken from Weka . With respect to thecompeting methods, RLC was already integrated in Mulan while for the purposes of thisstudy we also integrated MORF (via a wrapper of the implementation offered in CLUS )as well as TNR and D IRTY (via wrappers of the implementations offered in MALSAR (Zhouet al 2011b)). Thus, all methods were evaluated under a common framework. In support ofopen science, we created a github project that contains all our implementations, includingcode that facilitates easy replication of our experimental results. http://mulan.sourceforge.net http://dtai.cs.kuleuven.be/clus/ https://github.com/lefman/mulan-extended ulti-Target Regression via Input Space Expansion: Treating Targets as Inputs 15 h that has beeninduced from a train set D train is estimated based on a test set D test according to the followingequation: RRMSE ( h , D test ) = (cid:115) ∑ ( x , y ) ∈ D test ( ˆ y j − y j ) ∑ ( x , y ) ∈ D test ( ¯ Y j − y j ) (1)where ¯ Y j is the mean value of target variable Y j over D train and ˆ y j is the estimation of theMTR model h for Y j . More intuitively, RRMSE for a target is equal to the Root MeanSquared Error (RMSE) for that target divided by the RMSE of predicting the average valueof that target in the training set. RRMSE is estimated using k -fold cross-validation on alldatasets, i.e. one RRMSE measurement is obtained on each fold and the final RRMSE iscalculated as the average of those measurements. We use k =
10 on all datasets, exceptthose with more than 9000 examples where for computational reasons we use either k = k = .To test the statistical significance of the observed differences between the methods, wefollow the methodology suggested by Demsar (2006). To compare multiple methods on multiple datasets we use the Friedman test, the non-parametric alternative of the repeated-measures ANOVA. The Friedman test operates on the average ranks of the methods andchecks the validity of the hypothesis (null-hypothesis) that all methods are equivalent. Here,we use an improved (less conservative) version of the test that uses the F f instead of the χ F statistic (Iman and Davenport 1980). When the null-hypothesis of the Friedman test isrejected ( p < . The reliability of the estimates obtained using k = k = Table 3: Name, source, number of examples, number of input variables ( d ) and numberof target variables ( m ) of the datasets used in the evaluation. The datasets marked with anasterisk are first used for MTR benchmarking in this paper to the best of our knowledge. Dataset Source Examples d m edm Karalic and Bratko (1997) 154 16 2sf1 Lichman (2013) 323 10 3sf2 Lichman (2013) 1066 10 3jura Goovaerts (1997) 359 15 3wq Dzeroski et al (2000) 1060 16 14*enb Tsanas and Xifara (2012) 768 8 2*slump Yeh (2007) 103 7 3*andro Hatzikos et al (2008) 49 30 6*osales Kaggle (2012) 639 413 12*scpf Kaggle (2013) 1137 23 3*atp1d This paper 337 411 6*atp7d This paper 296 411 6*oes97 This paper 334 263 16*oes10 This paper 403 298 16*rf1 This paper 9125 64 8*rf2 This paper 9125 576 8*scm1d This paper 9803 280 16*scm20d This paper 8966 61 16 et al (2012)) where statistical tests are conducted using both aRRMSE ( per dataset analysis )but also considering RRMSE per target as an independent performance measurement ( pertarget analysis ).4.3 DatasetsDespite the numerous interesting applications of MTR, there are only few publicly availabledatasets of this kind - perhaps because most applications are industrial - and most exper-imental evaluations of MTR methods are based on a limited amount of datasets. For thisstudy, much effort was made for the composition of a large and diverse collection of bench-mark MTR datasets. In addition to 5 datasets that have been used in previous studies andare publicly available (edm, sf1, sf2, jura, wq), we also used 5 publicly available datasets(enb, slump, andro, osales, scpf) that have not been used for MTR benchmarking in the past.We also collected raw MTR data from a variety of interesting application domains and com-posed 8 new benchmark datasets (atp1d, atp7d, oes97, oes10, rf1, rf2, scm1d, scm20d). Intotal we collected 18 datasets and make them publicly available for future studies . To thebest of our knowledge, this is the largest collection of benchmark MTR datasets to date.Table 3 reports the name (1st column), source (2nd column), number of examples (3rdcolumn), number of input variables (4th column) and number of target variables (5th col-umn) of each dataset. Detailed descriptions of all datasets are provided in Appendix A. http://mulan.sourceforge.net/datasets-mtr.html ulti-Target Regression via Input Space Expansion: Treating Targets as Inputs 17 In this Section we present an extensive experimental analysis of the performance of the pro-posed methods. Subsection 5.1 is devoted to an exploration of the performance of ST usingvarious well-known regression algorithms. The purpose of this investigation is to help usselect an algorithm that works well on the studied datasets and use it as base regressor inall problem transformation methods (ST, SST, ERC and RLC) in subsequent experiments.At the same time, a challenging baseline performance level will be set for all multi-targetmethods. In Subsection 5.2 we evaluate SST train and ERC true , the direct adaptations of thecorresponding MLC methods, in order to see whether these variants obtain a competitiveperformance compared to ST and state-of-the-art multi-target methods. Next, in Subsec-tion 5.3 all three meta-input generation variants ( true , train , cv ) of SST and ERC are eval-uated and compared to ST, shedding light into the impact of the discrepancy problem oneach method. After the best performing variants of each method have been identified, Sub-section 5.4 compares them with the state-of-the-art. The running times of all methods arereported and compared in Subsection 5.5, and finally, this section ends with a discussion ofthe main outcomes of the experimental results (Subsection 5.6).5.1 Base Regressor Exploration In this subsection we explore the performance of ST on the studied domains using a varietyof regression algorithms. The goal of this exploration is to help us identify a regression al-gorithm that performs well across many domains, thus setting a challenging baseline perfor-mance level for the multi-target methods that we study next. The algorithm that will emergeas the best performer will be used to instantiate all problem transformation methods (ST,SST, ERC and RLC) in the rest of the experiments, facilitating a fair comparison betweenthese methods.We selected five well-known linear and non-linear regression algorithms to couple STwith, in particular we use: ridge regression (Hoerl and Kennard 1970) (
RIDGE ), regres-sion tree Breiman et al (1984) (
TREE ), L2-regularized support vector regression regression(Drucker et al 1996) (
SVR ), bagged (Breiman 1996) regression trees (
BAG ) and stochasticgradient boosting (Friedman 2002) (
SGB ). In
RIDGE and
SVR , the regularization parameterwas tuned (separately for each target) by applying internal 5-fold cross-validation and choos-ing the value that leads to the lowest root mean squared error among { r : r ∈ {− , ..., }} .In BAG we combine the predictions of 100
TREE s while in
SGB we boost trees with fourterminal nodes using a small shrinkage rate (0 .
1) and a large number of iterations (100), assuggested by Friedman et al (2001).The detailed results obtained by each instantiation on each dataset and target are givenin Appendix B.1. We observe that no algorithm is better in all domains (as dictated by the nofree lunch theorems for supervised learning (Wolpert 1996, 2002)). However, ST-
BAG standsout obtaining the lowest aRRMSE in nine datasets. ST-
SGB follows with five wins while ST-
RIDGE and ST-
SVR each obtain the lowest error in two datasets. Figure 3 shows the averageranks of the different instantiations along with the results of the Friedman and the Nemenyitests for the analysis per dataset (left) and per target (right). In both analyses, the lowestaverage rank is obtained by ST-
BAG , followed by ST-
SGB and ST-
RIDGE . In the per datasetanalysis, the Nemenyi test finds that ST-
BAG is significantly better than ST-
TREE and ST-
SVR while in the per target analysis, ST-
BAG is found significantly better than all the other
Friedman p=0.00017737Nemenyi p=0.05CD = 1.445 4 3 2 1
ST-bag
ST-sgb
ST-ridge
ST-svr
ST-tree (a) Per dataset analysis.
Friedman p=9.8187e-43Nemenyi p=0.05CD = 0.515 4 3 2 1
ST-bag
ST-sgb
ST-ridge
ST-svr
ST-tree (b) Per target analysis.
Fig. 3: Comparison of different ST instantiations using the Nemenyi test. Groups of methodsthat are not significantly different (at p = .
05) are connected.instantiations. Therefore, we use
BAG as the base regressor for all problem transformationmethods in the rest of the experiments.5.2 Evaluation of Direct Adaptations
In this subsection we focus on SST train and ERC true , the versions of SST and ERC that usethe same type of values for the meta-inputs as their MLC counterparts, and compare theirperformance to that of ST, MORF, RLC, TNR and D
IRTY to see where these methods standwith respect to the state-of-the-art.Figure 4 shows the average ranks of the methods along with the results of the Fried-man and the Nemenyi tests when the analysis is performed per dataset (left) and per target(right) . Several interesting remarks can be made based on these results. First, we see that both SST train and ERC true are competitive with state-of-the-art methods . SST train obtainsthe lowest average rank in both the per dataset and the per target analysis. In the per datasetanalysis, it is found significantly better than TNR and D IRTY and similar with MORF andRLC and in the per target analysis it is found better than TNR, D
IRTY and MORF and sim-ilar with RLC. ERC true performs worse than SST train but is still ranked above TNR, D
IRTY and MORF in the per dataset analysis and above TNR and D
IRTY in the per target analysis.In the per dataset analysis, ERC true is found significantly better than TNR and D
IRTY andsimilar with MORF and RLC, while in the per target analysis it is found better than TNRand D
IRTY , similar with MORF and only RLC outperforms it significantly.Interestingly, however, we see that according to both the per dataset and the per targetanalysis, SST train and ERC true are not significantly better than ST. This is an indication thatthe use of targets as meta-input as implemented by these variants of SST and ERC doesnot bring significant improvements. Actually, as can be seen from the detailed results, bothSST train and ERC true perform worse than ST in several cases. This issue is studied in moredetail in the following subsection.Perhaps even more interestingly, none of the state-of-the-art multi-target methods par-ticipating in this comparison manages to significantly improve the performance of ST . Infact, ST is ranked second after SST train in the per dataset analysis and third after SST train and RLC in the per target analysis, and is found significantly better than TNR and D
IRTY in The detailed results per dataset and target can be found in Appendix B.2ulti-Target Regression via Input Space Expansion: Treating Targets as Inputs 19
Friedman p=5.6842e-12Nemenyi p=0.05CD = 2.127 6 5 4 3 2 1
SST-train ST RLC
ERC-true
MORF
Dirty
TNR (a) Per dataset analysis.
Friedman p=5.3155e-121Nemenyi p=0.05CD = 0.757 6 5 4 3 2 1
SST-train
RLC ST MORF
ERC-true
Dirty
TNR (b) Per target analysis.
Fig. 4: Comparison of direct adaptations (with
BAG as base regressor) using the Nemenyitest. Groups of methods that are not significantly different (at p = .
05) are connected.
Friedman p=4.5508e-08Nemenyi p=0.05CD = 2.127 6 5 4 3 2 1
MORF
RLC
ERC-true ST SST-train
Dirty
TNR (a) Per dataset analysis.
Friedman p=1.7672e-112Nemenyi p=0.05CD = 0.757 6 5 4 3 2 1
MORF
ERC-true
RLC ST SST-train
Dirty
TNR (b) Per target analysis.
Fig. 5: Comparison of direct adaptations (with
RIDGE as base regressor) using the Nemenyitest. Groups of methods that are not significantly different (at p = .
05) are connected.both types of analyses. This exceptionally good performance of ST might seem a bit surpris-ing given the results of previous studies (e.g. Kocev et al 2007; Tsoumakas et al 2014) butis in accordance with empirical and theoretical results for Binary Relevance (as discussed inSection 2) and is attributed to the use of a very strong base regressor.To validate this, we instantiate all problem transformation methods with
RIDGE , a baseregressor that was found to perform worse than
BAG in Subsection 5.1, and repeat the com-parison. As shown in Figure 5, the situation is quite different compared to when
BAG wasused as base regressor. We observe that ST is now ranked below MORF, RLC and ERC true in both the per dataset and the per target analysis and is found significantly worse thanMORF according to the per target analysis. Clearly, as the strength of the base regressorincreases, i.e. when the information provided by the features is well exploited, improvingthe performance of ST becomes more difficult . However, it is this challenging setting whereperformance improvements matter the most and it is thus interesting to see whether the pro-posed extensions of SST and ERC manage to obtain more consistent improvements over ST(compared to SST train and ERC true ) under this setting.
Friedman p=0.00017837Nemenyi p=0.05CD = 0.783 2 1
SST-cv
SST-train
SST-true (a) SST, per dataset analysis.
Friedman p=4.9255e-18Nemenyi p=0.05CD = 0.283 2 1
SST-cv
SST-train
SST-true (b) SST, per target analysis.
Fig. 6: Comparison of SST variants using the Nemenyi test. Groups of methods that are notsignificantly different (at p = .
05) are connected.
Friedman p=0.0036795Nemenyi p=0.05CD = 0.783 2 1
ERC-train
ERC-cv
ERC-true (a) ERC, per dataset analysis.
Friedman p=1.3287e-19Nemenyi p=0.05CD = 0.283 2 1
ERC-cv
ERC-train
ERC-true (b) ERC, per target analysis.
Fig. 7: Comparison of ERC variants using the Nemenyi test. Groups of methods that are notsignificantly different (at p = .
05) are connected.5.3 Evaluation of Meta-input Generation VariantsIn this subsection we evaluate the performance of SST and ERC when different types ofvalues are used for the meta-inputs at training time. In particular, each method is evalu-ated using the actual target values ( true variants), in-sample estimates ( train variants) andout-of-sample estimates ( cv variants) generated using the proposed internal cross-validationstrategy. We want to see whether the cv variants (that according to the discussion of Sub-section 2.4 are expected to be less affected by the discrepancy problem) can indeed performbetter than the train and true variants and whether they manage to obtain more consistentimprovements over ST. We also want to see how the SST variants compare to the ERCvariants.Figures 6 and 7 show the average ranks and the results of the Friedman and Nemenyitests for the three variants of SST and ERC, respectively, according to the per dataset (left)and the per target (right) analysis. First, we see that in both SST and ERC and in both typesof analyses, the variants that use the actual values of the targets ( true ) obtain the worstaverage ranks and are found significantly worse than both variants that use estimates ( train and cv ). Since the variants of each method differ only with respect to the type of values thatthey use for the meta-inputs, it is clear that the discrepancy problem has a significant impacton the performance of both SST and ERC and that the use of estimates can ameliorate this problem .With respect to the kind of estimates that should be used (in-sample or out-of-sample)the situation is slightly different for each method. In the case of SST, the cv variant obtainsthe best average rank in both the per dataset and the per target analysis and its difference withthe train variant is found significant in the per target analysis. In the case of ERC, while the cv variant is ranked higher than the train variant in the per target analysis, the train variant ulti-Target Regression via Input Space Expansion: Treating Targets as Inputs 21 Friedman p=8.5958e-09Nemenyi p=0.05CD = 2.127 6 5 4 3 2 1
ERC-train
ERC-cv
SST-cv
SST-train ST ERC-true
SST-true (a) Per dataset analysis.
Friedman p=1.1513e-59Nemenyi p=0.05CD = 0.757 6 5 4 3 2 1
ERC-cv
ERC-train
SST-cv
SST-train ST ERC-true
SST-true (b) Per target analysis.
Fig. 8: Comparison of SST true / train / cv , ERC true / train / cv and ST using the Nemenyi test.Groups of methods that are not significantly different (at p = .
05) are connected.is ranked slightly higher in the per dataset analysis and in both cases the differences arenot found significant. This suggests that using out-of-sample estimates is important for SSTwhile ERC seems to be less affected by the discrepancy problem and, as a result, the use ofin-sample estimates can be considered as a viable alternative .A question that has not been answered yet, is how the new variants of SST and ERC compare to ST and to each other. Figure 8 shows the results of the Friedman and the Ne-menyi tests when all variants of SST and ERC are compared together with ST. We see thatin both the per dataset (left) and the per target (right) analysis, the four variants that useestimates for the meta-inputs obtain lower average ranks than ST while the true variantsobtain worse average ranks. The differences with ST are not found significant accordingto the per dataset analysis but according to the per target analysis ERC train and ERC cv arefound significantly better. Comparing the SST variants with the ERC variants, we see thateach ERC variant is always ranked above the corresponding SST variant. This suggests that ERC’s strategy for leveraging information from target variables is beneficial . Moreover, wesee that that ERC train and ERC cv are found significantly better than the rest of the methods according to the per target analysis. So far, our analysis has focused on the average performance of the proposed methods (asquantified by their average ranks over datasets and targets) and we found that ERC train andERC cv outperform the independent regressions baseline significantly. However, it is alsoimportant to see the consistency of these improvements across different datasets and targets.In particular, we would like to study the degree of cautiousness that each method exhibits,i.e. how frequently and to what extent are the predictions produced by each method lessaccurate than the predictions of ST.To facilitate a comparison of the methods in this regard, the following measures aredefined: R d ( M ) = aRRMSE ( ST ) aRRMSE ( M ) , R t ( M ) = RRMSE ( ST ) RRMSE ( M ) . For each method M and dataset d , R d quantifies the amount of improvement or degradationinduced by M compared to ST in terms of aRRMSE . Similarly, for each method M andtarget t , R t quantifies the amount of improvement or degradation compared to ST in terms of RRMSE . Values of R d ( M ) and R t ( M ) < R d ( ST ) = R t ( ST ) = R d over the 18 datasets included in the experimental study, i.e. each box plot summarizesthe distribution of 18 values, while the lower part displays box plots of the values of R t over143 targets, i.e. each box plot summarizes the distribution of 143 values.We see that that in both the per dataset and the per target analysis, the true variantsare the ones exhibiting the more dispersed distributions with several cases of significantdegradation of ST’s performance. The train and cv variants are clearly more cautious withmuch fewer cases of degradation and even fewer cases of significant degradation. Lookingat the distributions of R t , we could say that the cv variants appear a bit more cautious thanthe train variants especially in the case of SST . We also see that the ERC variants are alwaysmore cautious than the corresponding SST variants. Clearly, ERC train and ERC cv are the twomost cautious methods since they obtain very similar or better performance than ST on alldatasets and on about 75% of the targets. Even on targets where the two methods obtain alower performance than ST, the reduction is less than about 5%. This characteristic alongwith the fact that they obtain the largest average improvements over ST, make ERC train andERC cv highly appealing.5.4 Comparison with the State-of-the-art In this section we compare the three best performing variants of the proposed methods, i.e.ERC cv , ERC train and SST cv , with MORF, RLC, TNR and D IRTY to see how they compareto the state-of-the-art. Figure 10 shows the results of the Friedman and Nemenyi tests for theanalysis per dataset (left) and per target (right). The per dataset analysis shows that all ourmethods perform significantly better than TNR and D
IRTY while ERC cv and ERC train alsoperform significantly better than MORF. Moreover, all our methods obtain a lower averagerank than RLC but according to this analysis the differences are not significant. Accordingto the per target analysis, all our methods are found significantly better than TNR, D IRTY and MORF, and additionally, ERC cv and ERC train are found significantly better than RLC.In Figure 11 we compare the performance of the methods from a cautiousness perspective,as we did in Subsection 5.3. TNR, D IRTY and MORF are far less cautious than SST cv ,ERC train and ERC cv with many instances of extreme degradation of ST’s performance. RLCis more cautious but not as much as SST cv , ERC train and ERC cv , especially according to theper target analysis.5.5 Running TimesIn this subsection we compare the running times of the studied methods. Experiments wererun on a 64-bit CentOS Linux machine with 80 Intel Xeon E7-4860 processors running at2.27 GHz and 1 TB of main memory. The detailed results per method and dataset are shownin Table 4. For ST, RLC, SST and ERC we report times with BAG as base regressor. Thenumber shown in parenthesis next to the name of each dataset corresponds to the maximumnumber of processor threads that were available during the experiment. ST, SST, ERC andRLC made use of multiple threads through Weka’s multi-threaded implementation of Bag-ging. Thus, running times are directly comparable for these methods. Multi-threading wasalso partly used in TNR for the computation of the gradients.
DIRTY and MORF, on theother hand, always used a single processor thread. ulti-Target Regression via Input Space Expansion: Treating Targets as Inputs 23 R d SST true
SST train
SST cv ERC true
ERC train
ERC cv (a) Distribution of R d values for each method over 18 datasets. R t SST true
SST train
SST cv ERC true
ERC train
ERC cv (b) Distribution of R t values for each method over 143 targets. Fig. 9: Cautiousness analysis of SST and ERC variants. On each box, the whiskers extendto the most extreme data points still within 1.5 IQR of the lower quartile and outliers areplotted individually.
Friedman p=5.1605e-16Nemenyi p=0.05CD = 2.127 6 5 4 3 2 1
ERC-cv
ERC-train
SST-cv
RLC
MORF
Dirty
TNR (a) Per dataset analysis.
Friedman p=5.3898e-134Nemenyi p=0.05CD = 0.757 6 5 4 3 2 1
ERC-cv
ERC-train
SST-cv
RLC
MORF
Dirty
TNR (b) Per target analysis.
Fig. 10: Comparison of the best SST and ERC variants with the state-of-the-art using theNemenyi test. Groups of methods that are not significantly different (at p = .
05) are con-nected.Looking at the aggregated running times, we see that MORF is by far the most efficientmethod, followed by ST, SST true and SST train which have similar running times. On theother hand,
DIRTY is the least efficient method, followed by ERC cv . The running times of the rest of the methods lie in between. With respect to the SST and ERC variants, we seethat their running times agree with the complexity analysis of Subsection 2.6. The totalrunning time of SST true is roughly twice the total running time of ST and similar to the totalrunning time of SST train . SST cv is the least efficient among SST variants with a total runningtime that is about 5 times larger than that of SST true and SST train . With respect to the ERCvariants, we see that ERC true and ERC train have similar total running times (which are alsoroughly similar to the total running time of SST cv ) while ERC cv is about 7.5 times slower.Overall, we see that the improvements achieved by ERC cv and ERC train over ST comewith an increased computational cost. However, this cost is manageable especially in thecase of ERC train . Furthermore, when better efficiency is needed, besides the use of paral-lelization one might consider reducing the ensemble size ( k ) or using a smaller number offolds ( f ) when applying internal cross-validation (in ERC cv ).5.6 DiscussionSeveral interesting conclusions can be drawn from our experimental results. The experi-ments of Subsections 5.2 and 5.3 showed that while the directly adapted versions of SSTand ERC have comparable or better performance than state-of-the-art methods, a carefulhandling of the discrepancy problem is crucial for obtaining consistent improvements overthe independent regressions baseline and the state-of-the-art. In particular, as the experi-ments of Subsection 5.3 revealed, the use of estimates for the meta-inputs during trainingshould clearly be preferred over using the actual target values. With regard to using in-sample versus out-of-sample estimates, the results indicate that while out-of-sample esti-mates are preferable in SST, ERC performs almost equally well using either type of esti-mates for the meta-inputs. As discussed in Subsection 2.5, ERC’s models are built on inputspaces which are expanded with fewer meta-inputs compared to SST’s models and, as aresult, a smaller amount of error accumulation is risked at prediction time. ulti-Target Regression via Input Space Expansion: Treating Targets as Inputs 25 R d MORF TNR Dirty RLC SST cv ERC train
ERC cv (a) Distribution of R d values for each method over 18 datasets. R t MORF TNR Dirty RLC SST cv ERC train
ERC cv (b) Distribution of R t values for each method over 143 targets. Fig. 11: Comparing the cautiousness of the best SST and ERC variants to that of state-of-the-art methods. On each box, the whiskers extend to the most extreme data points stillwithin 1.5 IQR of the lower quartile and outliers are plotted individually.
Table 4: Running times (in seconds) using
BAG as base regressor. The number in parenthesisnext to the name of each dataset corresponds to the maximum number of processor threadsthat could be utilized during the experiment.
Dataset S T SS T t r u e SS T t r a i n SS T cv E RC t r u e E RC t r a i n E RC cv M O R F R L C T N R D I R T Y edm (2) 4.3 3.3 3.2 14.5 Another interesting conclusion is that when a strong base regressor is employed, thetask of improving the performance of ST becomes very difficult. As a result, multi-targetmethods which are considered state-of-the-art fail to improve ST’s performance and areeven performing significantly worse. This was particularly the case for the two multi-taskmethods, TNR and D
IRTY , which were consistently found to be the worst performers. Oneexplanation for their bad performance is the fact that both methods are based on a linearformulation of the problem that, as revealed by the base regressor exploration experiments,is not the most suitable hypothesis representation for the studied datasets (
RIDGE and
SVR performed worse than
SGB and
BAG that are based on a non-linear hypothesis representa-tion). Moreover, multi-task methods are expected to work better than single-task methods incases where there is a lack of training data for some of the tasks (Alvarez et al 2011). Thisis not the case for most of the datasets that we used in this study as well as many recentmulti-target prediction problems. In fact, the two datasets where TNR and D
IRTY performbetter than ST (sf1 and slump) are among those with the fewest training examples.With respect to MORF, although it was found significantly more competitive than TNRand D
IRTY , it also performed worse than ST on average. Nevertheless, we should point outthat MORF achieved the best accuracy on three datasets (edm, wq, andro) and is the mostcomputationally efficient of the compared methods. Similarly to TNR and D
IRTY , MORFhas the disadvantage of having a fixed hypothesis representation (trees), as opposed to theproposed methods that have the ability of adapting better to a specific domain by beinginstantiated with a more suitable base regressor. This advantage of the proposed methods isshared with RLC which, however, was not found as accurate.Overall, our experimental results demonstrate that of the methods proposed in this paper,ERC train and ERC cv and, to a lesser extent, SST train and SST cv provide increased accuracyover doing a separate regression per target. In addition, ERC train and ERC cv are significantlymore accurate than TNR, D IRTY , MORF and RLC (in the per target analysis). If caution ulti-Target Regression via Input Space Expansion: Treating Targets as Inputs 27 is a further concern, then again ERC train and ERC cv compare favorably to the rest of themethods. With respect to the true variants of SST and ERC, we should stress out that despitehaving a worse average performance, they are worthy of being considered by a practitioneras they obtain the highest performance in datasets (e.g., sf1 and scfp) where the discrepancyproblem is not predominant. Motivated by the similarity between the tasks of multi-label classification and multi-targetregression, this paper introduced SST and ERC, two new multi-target regression techniquesderived through a simple adaptation of two well-known multi-label classification methods.Both methods are based on the idea of treating other prediction targets as additional inputvariables, and represent a conceptually simple way of exploiting target dependencies in orderto improve prediction accuracy.A comprehensive experimental analysis that includes a multitude of real-world datasetsand four existing state-of-the-art methods, reveals that, despite being competitive with thestate-of-the-art, the directly adapted versions of SST and ERC do not manage to obtainsignificant improvements or even degrade the performance of the independent regressionsbaseline. This degradation is attributed to an underestimation (in the original formulations of the methods) of the impact of the discrepancy of the values used for the additional inputvariables between training and prediction. Confirming our hypothesis, extended versions ofthe methods that attempt to mitigate the discrepancy using out-of-sample estimates of thetargets during training, manage to obtain consistent and significant improvements over thebaseline approach and are found significantly better than four state-of-the-art methods. Thefact that these impressive results were obtained by applying relatively simple adaptations ofexisting multi-label classification methods, highlights the importance of exploiting relation-ships between similar machine learning tasks.Concluding, let us point to some directions for future work. Although a mitigation ofthe discrepancy problem leads to significant performance improvements, a different amountof mitigation is ideal for each target. As a result, the use of in-sample estimates (or even theactual target values) gives better results for some targets. Thus, a promising direction forfuture work would be a deeper theoretical analysis of the different variants and the identi-fication of problem characteristics that favor the use of one variant over the other. Finally,we should point out that SST and ERC can be viewed as strategies for leveraging variablesthat are available in the training phase but not in the prediction phase. This type of scenariois very common, for instance, in time series prediction. We believe that adapting SST andERC for this type of problems is another valuable opportunity for future work.
References
Aho T, Zenko B, Dzeroski S, Elomaa T (2012) Multi-target regression with rule ensembles. Journal of Ma-chine Learning Research 13:2367–2407´Alvarez MA, Lawrence ND (2011) Computationally efficient convolved multiple output gaussian processes.Journal of Machine Learning Research 12:1459–1500Alvarez MA, Rosasco L, Lawrence ND (2011) Kernels for vector-valued functions: A review. arXiv preprintarXiv:11066251Ando RK, Zhang T (2005) A framework for learning predictive structures from multiple tasks and unlabeleddata. Journal of Machine Learning Research 6:1817–18538 Eleftherios Spyromitros-Xioufis et al.Argyriou A, Evgeniou T, Pontil M (2006) Multi-task feature learning. In: Advances in Neural InformationProcessing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Pro-cessing Systems, Vancouver, British Columbia, Canada, December 4-7, 2006, pp 41–48Argyriou A, Evgeniou T, Pontil M (2008) Convex multi-task feature learning. Machine Learning 73(3):243–272Balasubramanian K, Lebanon G (2012) The landmark selection method for multiple output prediction. In:Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scot-land, UK, June 26 - July 1, 2012Baxter J (1995) Learning internal representations. In: Proceedings of the Eigth Annual Conference on Com-putational Learning Theory, COLT 1995, Santa Cruz, California, USA, July 5-8, 1995, pp 311–320Blockeel H, Raedt LD, Ramon J (1998) Top-down induction of clustering trees. In: Proceedings of the Fif-teenth International Conference on Machine Learning (ICML 1998), Madison, Wisconsin, USA, July24-27, 1998, pp 55–63Blockeel H, Dzeroski S, Grbovic J (1999) Simultaneous prediction of mulriple chemical parameters of riverwater quality with TILDE. In: Principles of Data Mining and Knowledge Discovery, Third EuropeanConference, PKDD ’99, Prague, Czech Republic, September 15-18, 1999, Proceedings, pp 32–40Bonilla EV, Chai KMA, Williams CKI (2007) Multi-task gaussian process prediction. In: Advances in NeuralInformation Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural In-formation Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007, pp 153–160Breiman L (1996) Bagging predictors. Machine Learning 24(2):123–140Breiman L, Friedman JH (1997) Predicting multivariate responses in multiple linear regression. Journal ofthe Royal Statistical Society: Series B (Statistical Methodology) 59(1):3–54Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and Regression Trees. WadsworthCaruana R (1994) Learning many related tasks at the same time with backpropagation. In: Advances in NeuralInformation Processing Systems 7, [NIPS Conference, Denver, Colorado, USA, 1994], pp 657–664Caruana R (1997) Multitask learning. Machine learning 28(1):41–75Chen J, Tang L, Liu J, Ye J (2009) A convex formulation for learning shared structures from multiple tasks. In:Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal,Quebec, Canada, June 14-18, 2009, pp 137–144Chen J, Liu J, Ye J (2010a) Learning incoherent sparse and low-rank patterns from multiple tasks. In: Proceed-ings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,Washington, DC, USA, July 25-28, 2010, pp 1179–1188Chen X, Kim S, Lin Q, Carbonell JG, Xing EP (2010b) Graph-structured multi-task regression and an efficientoptimization method for general fused lasso. arXiv preprint arXiv:10053579Cheng W, H¨ullermeier E (2009) Combining instance-based learning and logistic regression for multilabelclassification. Machine Learning 76(2-3):211–225Collobert R, Weston J (2008) A unified architecture for natural language processing: deep neural networkswith multitask learning. In: Machine Learning, Proceedings of the Twenty-Fifth International Confer-ence (ICML 2008), Helsinki, Finland, June 5-9, 2008, pp 160–167Dembczynski K, Cheng W, H¨ullermeier E (2010) Bayes optimal multilabel classification via probabilisticclassifier chains. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10),June 21-24, 2010, Haifa, Israel, pp 279–286Dembczynski K, Waegeman W, Cheng W, H¨ullermeier E (2012) On label dependence and loss minimizationin multi-label classification. Machine Learning 88(1-2):5–45Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. Journal of Machine LearningResearch 7:1–30Drucker H, Burges CJC, Kaufman L, Smola AJ, Vapnik V (1996) Support vector regression machines. In:Advances in Neural Information Processing Systems 9, NIPS, Denver, CO, USA, December 2-5, 1996,pp 155–161Dzeroski S, Demsar D, Grbovic J (2000) Predicting chemical parameters of river water quality from bioindi-cator data. Appl Intell 13(1):7–17Evgeniou T, Pontil M (2004) Regularized multi–task learning. In: Proceedings of the Tenth ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining, Seattle, Washington, USA, August22-25, 2004, pp 109–117Friedman J, Hastie T, Tibshirani R (2001) The elements of statistical learning. Springer series in statisticsSpringer, BerlinFriedman JH (2002) Stochastic gradient boosting. Computational Statistics & Data Analysis 38(4):367–378Ghosn J, Bengio Y (1996) Multi-task learning for stock selection. In: Advances in Neural Information Pro-cessing Systems 9, NIPS, Denver, CO, USA, December 2-5, 1996, pp 946–952ulti-Target Regression via Input Space Expansion: Treating Targets as Inputs 29Godbole S, Sarawagi S (2004) Discriminative methods for multi-labeled classification. In: Advances inKnowledge Discovery and Data Mining, 8th Pacific-Asia Conference, PAKDD 2004, Sydney, Australia,May 26-28, 2004, Proceedings, pp 22–30Goovaerts P (1997) Geostatistics for natural resources evaluation. Oxford university pressGroves W, Gini ML (2011) Improving prediction in TAC SCM by integrating multivariate and temporalaspects via PLS regression. In: Agent-Mediated Electronic Commerce. Designing Trading Strategiesand Mechanisms for Electronic Markets - AMEC 2011, Taipei, Taiwan, May 2, 2011, and TADA 2011,Barcelona, Spain, July 17, 2011, Revised Selected Papers, pp 28–43Groves W, Gini ML (2015) On optimizing airline ticket purchase timing. ACM TIST 7(1):3Hatzikos EV, Tsoumakas G, Tzanis G, Bassiliades N, Vlahavas IP (2008) An empirical study on sea waterquality prediction. Knowl-Based Syst 21(6):471–478Hoerl AE, Kennard RW (1970) Ridge regression: Biased estimation for nonorthogonal problems. Techno-metrics 12(1):55–67Hsu D, Kakade S, Langford J, Zhang T (2009) Multi-label prediction via compressed sensing. In: Advancesin Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Process-ing Systems 2009. Proceedings of a meeting held 7-10 December 2009, Vancouver, British Columbia,Canada., pp 772–780Iman RL, Davenport JM (1980) Approximations of the critical region of the friedman statistic. Communica-tions in Statistics-Theory and Methods 9(6):571–595Izenman AJ (1975) Reduced-rank regression for the multivariate linear model. Journal of Multivariate Anal-ysis 5(2):248 – 264Izenman AJ (2008) Modern Multivariate Statistical Techniques : Regression, Classification, and ManifoldLearning. Springer New YorkJacob L, Bach FR, Vert J (2008) Clustered multi-task learning: A convex formulation. In: Advances in NeuralInformation Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on NeuralInformation Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, pp 745–752Jalali A, Ravikumar PD, Sanghavi S, Ruan C (2010) A dirty model for multi-task learning. In: Advancesin Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Process-ing Systems 2010. Proceedings of a meeting held 6-9 December 2010, Vancouver, British Columbia,Canada., pp 964–972Jalali A, Ravikumar PD, Sanghavi S (2013) A dirty model for multiple sparse regression. IEEE Transactionson Information Theory 59(12):7947–7968Ji S, Ye J (2009) An accelerated gradient method for trace norm minimization. In: Proceedings of the 26thAnnual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June14-18, 2009, pp 457–464Kaggle (2012) Kaggle competition: Online product sales. URL
Kaggle (2013) Kaggle competition: See click predict fix. URL
Karalic A, Bratko I (1997) First order regression. Machine Learning 26(2-3):147–176Kim S, Xing EP (2010) Tree-guided group lasso for multi-task regression with structured sparsity. In: Pro-ceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010,Haifa, Israel, pp 543–550Kocev D, Vens C, Struyf J, Dzeroski S (2007) Ensembles of multi-objective decision trees. In: MachineLearning: ECML 2007, 18th European Conference on Machine Learning, Warsaw, Poland, September17-21, 2007, Proceedings, pp 624–631Kocev D, Dˇzeroski S, White MD, Newell GR, Griffioen P (2009) Using single-and multi-target regres-sion trees and ensembles to model a compound index of vegetation condition. Ecological Modelling220(8):1159–1168Kumar A, Vembu S, Menon AK, Elkan C (2012) Learning and inference in probabilistic classifier chains withbeam search. In: Machine Learning and Knowledge Discovery in Databases - European Conference,ECML PKDD 2012, Bristol, UK, September 24-28, 2012. Proceedings, Part I, pp 665–680Lichman M (2013) UCI machine learning repository. URL http://archive.ics.uci.edu/ml
Luaces O, D´ıez J, Barranquero J, del Coz JJ, Bahamonde A (2012) Binary relevance efficacy for multilabelclassification. Progress in AI 1(4):303–313Monta˜n´es E, Quevedo JR, del Coz JJ (2011) Aggregating independent and dependent models to learn multi-label classifiers. In: Machine Learning and Knowledge Discovery in Databases - European Conference,ECML PKDD 2011, Athens, Greece, September 5-9, 2011, Proceedings, Part II, pp 484–5000 Eleftherios Spyromitros-Xioufis et al.Munson MA, Caruana R (2009) On feature selection, bias-variance, and bagging. In: Machine Learningand Knowledge Discovery in Databases, European Conference, ECML PKDD 2009, Bled, Slovenia,September 7-11, 2009, Proceedings, Part II, pp 144–159Obozinski G, Taskar B, Jordan MI (2010) Joint covariate selection and joint subspace selection for multipleclassification problems. Statistics and Computing 20(2):231–252Pardoe D, Stone P (2008) The 2007 TAC SCM prediction challenge. In: AAAI 2008 Workshop on TradingAgent Design and AnalysisPratt LY (1992) Discriminability-based transfer between neural networks. In: Advances in Neural InformationProcessing Systems 5, [NIPS Conference, Denver, Colorado, USA, November 30 - December 3, 1992],pp 204–211Read J, Hollm´en J (2014) A deep interpretation of classifier chains. In: Advances in Intelligent Data AnalysisXIII - 13th International Symposium, IDA 2014, Leuven, Belgium, October 30 - November 1, 2014.Proceedings, pp 251–262Read J, Hollm´en J (2015) Multi-label classification using labels as hidden nodes. arXiv preprintarXiv:150309022Read J, Pfahringer B, Holmes G, Frank E (2011) Classifier chains for multi-label classification. MachineLearning 85(3):333–359Read J, Martino L, Luengo D (2014) Efficient monte carlo methods for multi-dimensional learning withclassifier chains. Pattern Recognition 47(3):1535–1546Senge R, del Coz JJ, H¨ullermeier E (2013a) On the problem of error propagation in classifier chains formulti-label classification. In: Proc. of the 36th Annual Conference of the German Classification SocietySenge R, del Coz JJ, H¨ullermeier E (2013b) Rectifying classifier chains for multi-label classification. In: LWA2013. Lernen, Wissen & Adaptivit¨at, Workshop Proceedings Bamberg, 7.-9. October 2013, pp 151–158Spyromitros-Xioufis E, Tsoumakas G, Groves W, Vlahavas I (2012) Multi-Label Classification Methods forMulti-Target Regression. ArXiv e-prints
Su J, Zhang H (2006) A fast decision tree learning algorithm. In: Proceedings, The Twenty-First NationalConference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelli-gence Conference, July 16-20, 2006, Boston, Massachusetts, USA, pp 500–505Tai F, Lin H (2012) Multilabel classification with principal label space transformation. Neural Computation24(9):2508–2542Teh YW, Seeger M, Jordan MI (2005) Semiparametric latent factor models. In: Proceedings of the TenthInternational Workshop on Artificial Intelligence and Statistics, AISTATS 2005, Bridgetown, Barbados,January 6-8, 2005Tsanas A, Xifara A (2012) Accurate quantitative estimation of energy performance of residential buildingsusing statistical machine learning tools. Energy and Buildings 49:560–567Tsoumakas G, Katakis I, Vlahavas IP (2010) Mining multi-label data. In: Data Mining and Knowledge Dis-covery Handbook, 2nd ed., Springer, pp 667–685Tsoumakas G, Spyromitros-Xioufis E, Vilcek J, Vlahavas I (2011) Mulan: a java library for multi-label learn-ing. Journal of Machine Learning Research 12:2411–2414Tsoumakas G, Xioufis ES, Vrekou A, Vlahavas IP (2014) Multi-target regression via random linear targetcombinations. In: Machine Learning and Knowledge Discovery in Databases - European Conference,ECML PKDD 2014, Nancy, France, September 15-19, 2014. Proceedings, Part III, pp 225–240Van Der Merwe A, Zidek J (1980) Multivariate regression analysis and canonical variates. Canadian Journalof Statistics 8(1):27–39Weston J, Chapelle O, Elisseeff A, Sch¨olkopf B, Vapnik V (2002) Kernel dependency estimation. In: Ad-vances in Neural Information Processing Systems 15 [Neural Information Processing Systems, NIPS2002, December 9-14, 2002, Vancouver, British Columbia, Canada], pp 873–880Wold H (1985) Partial least squares. Encyclopedia of statistical sciencesWolpert DH (1992) Stacked generalization. Neural Networks 5(2):241–259Wolpert DH (1996) The lack of a priori distinctions between learning algorithms. Neural computation8(7):1341–1390Wolpert DH (2002) The supervised learning no-free-lunch theorems. In: Soft Computing and Industry,Springer, pp 25–42Yeh IC (2007) Modeling slump flow of concrete using second-order regressions and artificial neural networks.Cement and Concrete Composites 29(6):474 – 480Zhang M, Zhou Z (2014) A review on multi-label learning algorithms. IEEE Trans Knowl Data Eng26(8):1819–1837Zhang Y, Schneider JG (2011) Multi-label output codes using canonical correlation analysis. In: Proceedingsof the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, FortLauderdale, USA, April 11-13, 2011, pp 873–882ulti-Target Regression via Input Space Expansion: Treating Targets as Inputs 31Zhang Y, Schneider JG (2012) Maximum margin output coding. In: Proceedings of the 29th InternationalConference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012Zhou J, Chen J, Ye J (2011a) Clustered multi-task learning via alternating structure optimization. In: Ad-vances in Neural Information Processing Systems 24: 25th Annual Conference on Neural InformationProcessing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain., pp702–710Zhou J, Chen J, Ye J (2011b) Malsar: Multi-task learning via structural regularization. Arizona State Univer-sityZhou J, Chen J, Ye J (2012) Multi-task learning: Theory, algorithms, and applications. URL
AppendixA Datasets
A.1 Existing Datasets
EDM
The Electrical Discharge Machining dataset (Karalic and Bratko 1997) represents a two-target regres-sion problem. The task is to shorten the machining time by reproducing the behaviour of a human operator thatcontrols the values of two variables. Each of the target variables takes 3 distinct numeric values ( {− , , } )and there are 16 continuous input variables. SF The Solar Flare dataset (Lichman 2013) has 3 target variables that correspond to the number of times 3types of solar flare (common, moderate, severe) are observed within 24 hours. There are two versions of thisdataset. SF1 contains data from year 1969 and SF2 from year 1978.
JURA
The Jura (Goovaerts 1997) dataset consists of measurements of concentrations of seven heavy met-als (cadmium, cobalt, chromium, copper, nickel, lead, and zinc), recorded at 359 locations in the topsoil ofa region of the Swiss Jura. The type of land use (Forest, Pasture, Meadow, Tillage) and rock type (Argo-vian, Kimmeridgian, Sequanian, Portlandian, Quaternary) were also recorded for each location. In a typicalscenario (Goovaerts 1997; ´Alvarez and Lawrence 2011), we are interested in the prediction of the concen-tration of metals that are more expensive to measure (primary variables) using measurements of metals thatare cheaper to sample (secondary variables). In this study, cadmium, copper and lead are treated as targetvariables while the remaining metals along with land use type, rock type and the coordinates of each locationare used as predictive features. WQ The Water Quality dataset (Dzeroski et al 2000) has 14 target attributes that refer to the relative rep-resentation of plant and animal species in Slovenian rivers and 16 input attributes that refer to physical andchemical water quality parameters.
A.2 New Datasets
ENB
The Energy Building dataset (Tsanas and Xifara 2012) concerns the prediction of the heating load andcooling load requirements of buildings (i.e. energy efficiency) as a function of eight building parameters suchas glazing area, roof area, and overall height, amongst others.
SLUMP
The Concrete Slump dataset (Yeh 2007) concerns the prediction of three properties of concrete(slump, flow and compressive strength) as a function of the content of seven concrete ingredients: cement, flyash, blast furnace slag, water, superplasticizer, coarse aggregate, and fine aggregate.
ANDRO
The Andromeda dataset (Hatzikos et al 2008) concerns the prediction of future values for six waterquality variables (temperature, pH, conductivity, salinity, oxygen, turbidity) in Thermaikos Gulf of Thessa-loniki, Greece. Measurements of the target variables are taken from under-water sensors with a samplinginterval of 9 seconds and then averaged to get a single measurement for each variable over each day. Thespecific dataset that we use here corresponds to using a window of 5 days (i.e. features attributes correspondto the values of the six water quality variables up to 5 days in the past) and a lead of 5 days (i.e. we predictthe values of each variable 6 days ahead).
OSALES
This is a pre-processed version of the dataset used in Kaggle’s “Online Product Sales” competition(Kaggle 2012) that concerns the prediction of the online sales of consumer products. Each row in the datasetcorresponds to a different product that is described by various product features as well as features of an2 Eleftherios Spyromitros-Xioufis et al.advertising campaign. There are 12 target variables corresponding to the monthly sales for the first 12 monthsafter the product launches. For the purposes of this study we removed examples with missing values in anytarget variable (112 out of 751) and attributes with one distinct value (145 out of 558).
SCPF
This is a pre-processed version of the dataset used in Kaggle’s “See Click Predict Fix” competition(Kaggle 2013). It concerns the prediction of three target variables that represent the number of views, clicksand comments that a specific 311 issue will receive. The issues have been collected from 4 cities (Oakland,Richmond, New Haven, Chicago) in the US and span a period of 12 months (01/2012 - 12/2012). The versionof the dataset that we use here is a random 1% sample of the data. In terms of features we use the number ofdays that an issues stayed online, the source from where the issue was created (e.g. android, iphone, remoteapi, etc.), the type of the issue (e.g. graffiti, pothole, trash, etc.), the geographical co-ordinates of the issue, thecity it was published from and the distance from the city center. All multi-valued nominal variables were firsttransformed to binary and then rare binary variables (being true for less than 1% of the cases) were removed.
OES
The Occupational Employment Survey datasets were obtained from years 1997 (OES97) and 2010(OES10) of the annual Occupational Employment Survey compiled by the US Bureau of Labor Statistics.Each row provides the estimated number of full-time equivalent employees across many employment typesfor a specific metropolitan area. There are 334 and 403 cities in the 1997 and 2010 datasets, respectively. Theinput variables in these datasets are a randomly sequenced subset of employment types (e.g. doctor, dentist,car repair technician, etc.) observed in at least 50% of the cities (some categories had no values for particularcities). The targets for both years are randomly selected from the entire set of categories above the 50%threshold. Missing values in both the input and the target variables were replaced by sample means for theseresults. To our knowledge, this is the first use of the OES dataset for benchmarking of multi-target predictionalgorithms.
ATP
The Airline Ticket Price dataset concerns the prediction of airline ticket prices. The rows are a sequenceof time-ordered observations over several days. Each sample in this dataset represents a set of observationsfrom a specific observation date and departure date pair. The input variables for each sample are values thatmay be useful for prediction of the airline ticket prices for a specific departure date. The target variables inthese datasets are the next day (ATP1D) price or minimum price observed over the next 7 days (ATP7D) for6 target flight preferences: 1) any airline with any number of stops, 2) any airline non-stop only, 3) DeltaAirlines, 4) Continental Airlines, 5) Airtrain Airlines, and 6) United Airlines. The input variables includethe following types: the number of days between the observation date and the departure date (1 feature), theboolean variables for day-of-the-week of the observation date (7 features), the complete enumeration of thefollowing 4 values: 1) the minimum price, mean price, and number of quotes from 2) all airlines and fromeach airline quoting more than 50% of the observation days 3) for non-stop, one-stop, and two-stop flights,4) for the current day, previous day, and two days previous. The result is a feature set of 411 variables. Forspecific details on how these datasets are constructed please consult Groves and Gini (2015). The natureof these datasets is heterogeneous with a mixture of several types of variables including boolean variables,prices, and counts. RF The river flow datasets concern the prediction of river network flows for 48 hours in the future at specificlocations. The dataset contains data from hourly flow observations for 8 sites in the Mississippi River networkin the United States and were obtained from the US National Weather Service. Each row includes the mostrecent observation for each of the 8 sites as well as time-lagged observations from 6, 12, 18, 24, 36, 48 and60 hours in the past. In RF1, each site contributes 8 attribute variables to facilitate prediction. There are atotal of 64 variables plus 8 target variables.The RF2 dataset extends the RF1 data by adding precipitationforecast information for each of the 8 sites (expected rainfall reported as discrete values: 0.0, 0.01, 0.25, 1.0inches). For each observation and gauge site, the precipitation forecast for 6 hour windows up to 48 hoursin the future is added (6, 12, 18, 24, 30, 36, 42, and 48 hours). The two datasets both contain over 1 yearof hourly observations ( > SCM
The Supply Chain Management datasets are derived from the Trading Agent Competition in SupplyChain Management (TAC SCM) tournament from 2010. The precise methods for data preprocessing andnormalization are described in detail by Groves and Gini (2011). Some benchmark values for predictionaccuracy in this domain are available from the TAC SCM Prediction Challenge (Pardoe and Stone 2008), thesedatasets correspond only to the “Product Future” prediction type. Each row corresponds to an observationday in the tournament (there are 220 days in each game and 18 tournament games in a tournament). Theinput variables in this domain are observed prices for a specific tournament day. In addition, 4 time-delayedobservations are included for each observed product and component (1,2,4 and 8 days delayed) to facilitatesome anticipation of trends going forward. The datasets contain 16 regression targets, each target correspondsto the next day mean price (SCM1D) or mean price for 20-days in the future (SCM20D) for each productulti-Target Regression via Input Space Expansion: Treating Targets as Inputs 33in the simulation. Days with no target values are excluded from the datasets (i.e. days with labels that arebeyond the end of the game are excluded).
B Detailed Experimental Results
B.1 Base Regressor Exploration ResultsTable 5:
Detailed results for ST using
RIDGE , TREE , SVR , BAG and
SGB as base regressors. For each dataset,we first report the average
RRMSE over all targets ( aRRMSE ), and then the
RRMSE per target. In each row,the lowest error is typeset in bold.Dataset ST-
RIDGE
ST-
TREE
ST-
SVR
ST-
BAG
ST-
SGB
Targetedm 0.871 0.915 0.860 0.742 dflow 0.977 1.092 0.961 0.815 dgap 0.764 0.737 0.760 0.669 sf1 1.130 1.127 y1 0.293 0.062 0.587 slump 0.679 0.827 0.686 0.688 slump 0.889 0.974 0.895 0.795 flow 0.774 0.857 0.755 0.742 compressi. Continued on next page
Continued from previous page m12 0.919 0.981 0.932 allminp0 auaminpa 0.349 0.229 0.392 allminpa 0.730 0.950 0.764 0.641 allminp0 acominpa 0.437 0.355 0.509 Continued on next page ulti-Target Regression via Input Space Expansion: Treating Targets as Inputs 35Table 5 –
Continued from previous page
Continued on next page
Continued from previous page mtlp15a 0.633 0.613 0.727
B.2 Multi-target Regression ResultsTable 6:
Detailed results for all methods using
BAG as base regressor in ST, SST, ERC and RLC (detailed re-sults with additional base regressors can be found at http://users.auth.gr/espyromi/mtr/results.zip ). For each dataset, we first report the average
RRMSE over all targets ( aRRMSE ), and then the
RRMSE per target. In each row, the lowest error is typeset in bold.Dataset S T SS T t r u e SS T t r a i n SS T cv E RC t r u e E RC t r a i n E RC cv M O R F R L C T N R D i r t y Targetedm 0.742 0.747 0.743 0.740 0.743 0.742 0.741 m-class 1.096 m-class 1.075
Continued on next page ulti-Target Regression via Input Space Expansion: Treating Targets as Inputs 37Table 6 –
Continued from previous page
Continued on next page
Continued from previous page
Continued on next page ulti-Target Regression via Input Space Expansion: Treating Targets as Inputs 39Table 6 –
Continued from previous page mtlp13a 0.481 0.515 0.453 0.431 0.516 0.4260.386