[PDF] Multi-Objective Learning to Predict Pareto Fronts Using Hypervolume Maximization

Abstract

Real-world problems are often multi-objective with decision-makers unable to specify a priori which trade-off between the conflicting objectives is preferable. Intuitively, building machine learning solutions in such cases would entail providing multiple predictions that span and uniformly cover the Pareto front of all optimal trade-off solutions. We propose a novel learning approach to estimate the Pareto front by maximizing the dominated hypervolume (HV) of the average loss vectors corresponding to a set of learners, leveraging established multi-objective optimization methods. In our approach, the set of learners are trained multi-objectively with a dynamic loss function, wherein each learner's losses are weighted by their HV maximizing gradients. Consequently, the learners get trained according to different trade-offs on the Pareto front, which otherwise is not guaranteed for fixed linear scalarizations or when optimizing for specific trade-offs per learner without knowing the shape of the Pareto front. Experiments on three different multi-objective tasks show that the outputs of the set of learners are indeed well-spread on the Pareto front. Further, the outputs corresponding to validation samples are also found to closely follow the trade-offs that were learned from training samples for our set of benchmark problems.

Full PDF

MMulti-Objective Learning to Predict ParetoFronts Using Hypervolume Maximization

Timo M. Deist ∗† , Monika Grewal ∗ , Frank J.W.M. Dankers ,Tanja Alderliesten , and Peter A.N. Bosman Centrum Wiskunde & Informatica, Life Sciences and HealthResearch Group, Amsterdam, The Netherlands Leiden University Medical Center, Department of RadiationOncology, Leiden, The Netherlands Delft University of Technology, Faculty of Electrical Engineering,Mathematics and Computer Science, Delft, The NetherlandsFebruary 10, 2021

Abstract

Real-world problems are often multi-objective with decision-makersunable to specify a priori which trade-oﬀ between the conﬂicting objec-tives is preferable. Intuitively, building machine learning solutions in suchcases would entail providing multiple predictions that span and uniformlycover the Pareto front of all optimal trade-oﬀ solutions. We propose anovel learning approach to estimate the Pareto front by maximizing thedominated hypervolume (HV) of the average loss vectors correspondingto a set of learners, leveraging established multi-objective optimizationmethods. In our approach, the set of learners are trained multi-objectivelywith a dynamic loss function, wherein each learner’s losses are weightedby their HV maximizing gradients. Consequently, the learners get trainedaccording to diﬀerent trade-oﬀs on the Pareto front, which otherwise is notguaranteed for ﬁxed linear scalarizations or when optimizing for speciﬁctrade-oﬀs per learner without knowing the shape of the Pareto front. Ex-periments on three diﬀerent multi-objective tasks show that the outputsof the set of learners are indeed well-spread on the Pareto front. Further,the outputs corresponding to validation samples are also found to closelyfollow the trade-oﬀs that were learned from training samples for our setof benchmark problems. ∗ authors contributed equally † corresponding author: [email protected] a r X i v : . [ c s . L G ] F e b Introduction

Machine learning, i.e., minimizing a loss function on a set of training samples toallow for data-driven inference on unseen samples, has become a crucial part ofour day-to-day lives. Similar to traditional optimization-driven decision-makingscenarios, the predictions from machine learning models are also often requiredto meet multiple conﬂicting objectives. The most straightforward approach totackling multi-objective (MO) decision making problems is to formulate the MOproblem as a single-objective problem by deﬁning a trade-oﬀ between diﬀerentobjectives. However, if the marginal beneﬁt of one objective over the otheror, alternatively, a preferred trade-oﬀ is unknown a priori (which is a commonsituation in real-world practice), this is not possible.In MO optimization literature, a posteriori MO decision-making processesare supported by computing ﬁnite-sized approximations of the Pareto frontof solutions, i.e., the set of all Pareto optimal solutions . Consecutively, thedecision-maker chooses their preferred solution from the approximation set ofsolutions. A popular metric to compare approximation sets of solutions duringMO optimization is the hypervolume (HV) which, loosely speaking, measuresthe size of the objective space that is dominated by a given set of solutions.Theoretically, if the HV is maximal for a set of solutions, these solutions are onthe Pareto front [9]. Additionally, sets of solutions with maximal HV are alsospread across the front. Directly maximizing the HV has been a popular strat-egy for MO optimization, but the use of HV maximization for training machinelearning models is still in its nascent stage.In this paper, we show that training a set of machine learning models (learn-ers) to predict an approximation of the Pareto front during inference is possibleby maximizing the HV of their objective losses during training. Moreover, weshow that when using the gradient-based HV-maximization strategy by [40], thisresults in a set of learners being trained to minimize a dynamically weightedcombination of multiple loss functions, wherein the weights of multiple losses arecalculated in each learning iteration based on HV-maximizing gradients. Ourmain contributions are as follows: • an HV-maximizing training strategy to generate ﬁxed size Pareto frontapproximations from a set of learners; • a gradient-based realization of the proposed training strategy that canbe directly used for training deep neural networks in a multi-objectivefashion; • experiments using neural networks on three applications (multi-objectiveregression, multi-observer medical image segmentation, neural multi-styletransfer). A Pareto optimal solution is never dominated by any other solution, meaning that nosolution exists that is at least as good in all objectives and strictly better in at least oneobjective. Related work

Multi-task learning (MTL) [31], i.e., training a single network to perform wellon multiple tasks, is related to MO learning in the sense that both approacheshave multiple objectives that may conﬂict. MTL has been studied extensively,e.g., [28, 32, 44, 29, 23, 17].[25] have described gradient-based HV maximization for single networks andformulated a dynamic loss function, [2] applied this concept for training ingenerative adversarial networks. Our approach uses HV maximization for a set of learners which is necessary to cover the entire Pareto front. In our approach,each learner’s dynamic loss takes into account the other learners’ position inloss space (i.e., the space spanned by the co-domains of all loss functions in theMO learning formulation).MO neural network training to predict Pareto fronts has been describedearlier. [20, 19, 22] describe approaches with dynamic loss functions to trainmultiple networks with Pareto optimal performance following diﬀerent trade-oﬀs on the Pareto front. There, however, these trade-oﬀs are required to beknown in advance whereas our proposed approach does not require knowingthe set of trade-oﬀs beforehand. Recent works by [26, 18] propose to train a“hypernetwork” to generate network weights based on a user-speciﬁed trade-oﬀ.The speciﬁc trade-oﬀs would still need to be known before such a hypernetworkcould replicate a Pareto set of networks that can be produced by our HV-basedtraining. Their approach could, however, relatively quickly approximate thisPareto set by iteratively sampling networks, computing their HVs, and adjustingtrade-oﬀs until a comparable HV is achieved. [21] describe how the Pareto setcan be discovered starting from single Pareto optimal networks. Their approachcould be applied after attaining a diverse Pareto set based on our proposedapproach.A Bayesian optimization approach to HV maximization by iteratively op-timizing random scalarizations is described by [12] (and [6] independently asmentioned by the authors). The key diﬀerence between their approach and oursis that we directly maximize HV by using HV gradients for a ﬁxed number ofnetworks. Another MO Bayesian optimization approach is given by [35] usingthe Pareto frontier entropy metric to control optimization.Other works determine sets of neural network parameters to estimate thePareto front of error and sparsity [13] and accuracy and energy consumption [14].[24] train network layers with multiple regularizing losses using the AlternatingDirection Methods of Multipliers (ADMM). MO optimization to ﬁnd Paretofronts of model hyperparameters applying the HV has been studied by [16, 36, 3].HV maximization is also applied in reinforcement learning [37, 43].

The traditional learning setup is to ﬁnd a learner parameterized by a vec-tor θ such that the loss L ( θ, s k ) is minimal for a given set of samples S =3 L L ↓L ( θ ,s k ) L ( θ ,s k ) L ( θ ,s k ) D r ( L ( θ ,s k )) D r ( L ( θ ,s k )) D r ( L ( θ ,s k )) r (a) Dominated subspaces ← L L ↓ ∂ HV( L (Θ ,s k )) ∂L ( θ ,s k ) ∂ H V ( L ( Θ , s k )) ∂ L ( θ , s k ) ∂ HV( L (Θ ,s k )) ∂ L ( θ ,s k ) ∂ HV( L (Θ ,s k )) ∂ L ( θ ,s k ) ∂ HV( L (Θ ,s k )) ∂ L ( θ ,s k ) r (b) HV gradients D r ( L (Θ ,s k )) D r ( L (Θ ,s k )) ← L L ↓ r (c) Domination-rankedfronts Figure 1: (a)

Three Pareto optimal loss vectors L ( θ i , s ) on the Pareto front(green) with dominated subspaces D r ( L ( θ i , s k )) with respect to reference point r . The union of dominated subspaces is the dominated hypervolume (HV). (b) Gray markings illustrate the computation of the HV gradients ∂ HV( L (Θ , s)) ∂ L ( θ i ,s ) (gray arrows) in the three non-dominated solutions. (c) The same ﬁve solutionsgrouped into two domination-ranked fronts Θ and Θ with corresponding HV(equal to their dominated subspaces D r ( L ( θ i , s k ))) and HV gradients. { s , . . . , s k , . . . , s | S | } . In an MO learning setting, this can be formulated asminimizing a vector of n losses L ( θ, s k ) = [ L ( θ, s k ) , . . . , L n ( θ, s k )]. To learnmultiple sets of parameters with loss vectors on the Pareto front, we replace θ by a set of parameters Θ = { θ , . . . , θ p } , where each parameter vector θ i repre-sents a learner. The corresponding set of loss vectors is {L ( θ , s k ) , . . . , L ( θ p , s k ) } and is represented by a stacked loss vector L (Θ , s k ) = [ L ( θ , s k ) , . . . , L ( θ p , s k )].Our goal is to learn a set of p learners such that for sample s k , the correspondingloss vectors in L (Θ , s k ) lie on and span the Pareto front of loss functions, i.e.,each learner’s loss vector is Pareto optimal and lies in a distinct subsection ofthe Pareto front. The HV of a loss vector L ( θ i , s k ) for a sample s k is the volume of the sub-space D r ( L ( θ i , s k )) in loss space dominated by L ( θ i , s k ). This is illustrated inFigure 1a. To keep this volume ﬁnite, the HV is computed with respect to a ref-erence point r which bounds the space to the region of interest . Subsequently,the HV of multiple loss vectors L (Θ , s k ) is the HV of the union of dominatedsubspaces D r ( L ( θ i , s k )) , ∀ i ∈ { , , ..., p } .Maximizing the HV is a popular approach to approximate Pareto frontsin MO optimization literature because the HV encodes solution quality anddiversity of the set of solutions while simultaneously being Pareto compliant. The reference point is generally set to large coordinates in loss space to ensure that it isalways dominated by all loss vectors. loss vectors that form a set with maximal HV are both Pareto optimal [9]as well as diversiﬁed. Therefore, we maximize the mean HV over the set of p loss vectors with a goal to ﬁnd Pareto optimal and diversiﬁed solutions for eachsample s k . The MO learning problem to maximize the mean HV over all | S | samples is as follows: maximize 1 | S | | S | (cid:88) k =1 HV ( L (Θ , s k )) (1)Concordantly, the update direction of gradient ascent for parameter vector θ i of learner i is: ∂ | S | (cid:80) | S | k =1 HV( L (Θ , s k )) ∂θ i (2)By exploiting the chain rule decomposition of HV gradients as described in [8],the update direction in Equation (2) for parameter vector θ i of learner i can bewritten as follows:1 | S | | S | (cid:88) k =1 ∂ HV ( L (Θ , s k )) ∂ L ( θ i , s k ) · ∂ L ( θ i , s k ) ∂θ i ∀ i ∈ { , . . . , p } (3)The dot product of ∂ HV( L (Θ ,s k )) ∂ L ( θ i ,s k ) (the HV gradients with respect to loss vector L ( θ i , s k )) in loss space, and ∂ L ( θ i ,s k ) ∂θ i (the matrix of loss vector gradients in thelearner i ’s parameters θ i ) in parameter space, can be decomposed to1 | S | | S | (cid:88) k =1 n (cid:88) j =1 ∂ HV ( L (Θ , s k )) ∂L j ( θ i , s k ) ∂L j ( θ i , s k ) ∂θ i ∀ i ∈ { , . . . , p } (4)where ∂ HV( L (Θ ,s k )) ∂L j ( θ i ,s k ) is the scalar HV gradient in the single loss function L j ( θ i , s k ),and ∂L j ( θ i ,s k ) ∂θ i are the gradients used in gradient descent for single-objectivetraining of learner i for loss L j ( θ i , s k ). Based on Equation (4), one can observethat mean HV maximization of loss vectors from a set of p learners for | S | samples can be achieved by weighting their gradient descent directions for lossfunctions L j ( θ i , s k ) with their corresponding HV gradients ∂ HV( L (Θ ,s k )) ∂L j ( θ i ,s k ) for all i , j . In other terms, the MO learning of a set of p learners can be achieved byminimizing the following dynamic loss function for each learner i :1 | S | | S | (cid:88) k =1 n (cid:88) j =1 ∂ HV ( L (Θ , s k )) ∂L j ( θ i , s k ) L j ( θ i , s k ) ∀ i ∈ { , . . . , p } (5)The computation of the HV gradients ∂ HV( L (Θ ,s k )) ∂L j ( θ i ,s k ) is illustrated in Figure 1b.It is equal to the marginal decrease in the subspace dominated only by L ( θ i , s k )when increasing L j ( θ i , s k ). Minimizing (instead of maximizing) the dynamic loss function maximizes the HV becausethe reference point r is in the positive quadrant (“to the right and above 0”). | S | expensive HV gradient com-putations. To reduce the number of HV gradient computations from | S | to 1,we simplify the dynamic loss function to: n (cid:88) j =1 ∂ HV (cid:16) L (Θ , S ) (cid:17) ∂L j ( θ i , S ) L j ( θ i , S ) ∀ i ∈ { , . . . , p } (6)where L (Θ , S ) = (cid:104) L ( θ , S ) , . . . , L ( θ p , S ) (cid:105) , L ( θ i , S ) = (cid:104) L ( θ i , S ) , . . . , L n ( θ i , S ) (cid:105) ,and L j ( θ i , S ) = | S | (cid:80) | S | k =1 L j ( θ i , s k ). Note that the interpretation of Equa-tion (5) is that the HV of all learners’ loss vectors for one sample, when averagedover all samples, is maximal. Speciﬁcally, Equation (5) is agnostic to a singlelearner’s behavior and considers the output of the set of learners as a whole.One learner is not necessarily trained exclusively for a speciﬁc loss trade-oﬀ, but,across diﬀerent samples s k , one learner could generate outputs correspondingto diﬀerent trade-oﬀs.The interpretation of Equation (6), however, is that the HV of the set ofaverage loss vectors (average loss over all samples for each learner) is maximal.Consequently, each learner θ i is trained for a diﬀerent loss trade-oﬀ. Whilecomputationally more eﬃcient, it deviates from the direct representation as theloss is based on the average front as obtained by averaging the loss for all samplesfor each learner separately. This simpliﬁcation might not yield good estimatesof concave Pareto fronts: if single learners are able to produce predictions atopposing extremes of Pareto fronts for diﬀerent samples, the HV of all learners’averaged losses will be higher than the average HV over Pareto front estimatesfor individual concave fronts. Equation (6) might learn sets of predictions atthe extremes of concave Pareto fronts and, therefore, the original Equation (5)could be preferred in settings with concave fronts. A relevant caveat of gradient-based HV maximization is that HV gradients ∂ HV ( L (Θ ,S ) ) ∂L j ( θ i ,S ) in strongly dominated solutions, i.e., solutions in the interior of thedominated HV, are zero [8] because no movement direction will aﬀect the HV(Figure 1b). Further, gradients in weakly dominated solutions are undeﬁned[8]. As a consequence, HV gradients cannot be used for optimizing (weakly orstrongly) dominated solutions. To resolve this issue, we follow [40]’s approachto gradient-based HV optimization. Other strategies to handle dominated solu-tions exist [41, 5], but [40] was selected as it only requires HV computation andnon-dominated sorting and a comparison had shown that it performs similar toa competing approach [5]. The approach by [40] avoids the problem of domi-nated solutions by sorting all loss vectors into separate fronts Θ l of mutuallynon-dominated loss vectors and optimizing each front separately (Figure 1c). l is the domination rank and q ( i ) is the mapping of learner i to domination rank l . By maximizing the HV of each front, trailing fronts with domination rank6 lgorithm 1 Training learners Θ for Pareto front estimation by HV maximiza-tion of domination-ranked frontsInitialize p learners Θ = { θ , . . . , θ p } for each batch ˜ S dofor each learner θ i do Compute average loss vector L ( θ i , ˜ S ) end for Stack average loss vectors L ( θ i , ˜ S ) into L (Θ , ˜ S )Sort L (Θ , ˜ S ) into multiple fronts L (Θ l , ˜ S ) by domination ranking for each front l do Compute loss weights ∂ HV (cid:16) L (Θ q ( i ) , ˜ S ) (cid:17) ∂L j ( θ i , ˜ S ) ∀ i, j using algorithm by [8] end forfor each learner θ i do Backpropagate on joint loss from Equation (7) end for

Update Θ by stepping into gradient direction end for > and a single front ismaximized by determining optimal locations for each loss vector on the Paretofront.Furthermore, we normalize the HV gradients ∂ HV ( L (Θ q ( i ) ,S ) ) ∂ L ( θ i ,S ) as in [5] suchthat their length in loss space is 1. The dynamic loss function including domination-ranking of fronts by [40] and HV gradient normalization is: n (cid:88) j =1 w i ∂ HV (cid:16) L (Θ q ( i ) , S ) (cid:17) ∂L j ( θ i , S ) L j ( θ i , S ) ∀ i ∈ { , . . . , p } (7)where w i = (cid:13)(cid:13)(cid:13)(cid:13) ∂ HV ( L (Θ q ( i ) ,S ) ) ∂ L ( θ i ,S ) (cid:13)(cid:13)(cid:13)(cid:13) . We implemented the HV maximization of losses from multiple learners, as de-ﬁned in Equation (7), in Python. We use [10]’s HV computation reimplementedby Simon Wessing, available from [39]. The HV gradients ∂ HV ( L (Θ q ( i ) ,S ) ) ∂L j ( θ i ,S ) arecomputed following the algorithm by [8]. Learners with identical losses are as-signed the same HV gradients. For non-dominated learners with one or moreidentical losses (which can occur in training with three or more losses), theleft- and right-sided limits of the HV function derivatives are not the same [8]and they are set to zero. Non-dominated sorting is implemented based on [4].Source code is added to the supplementary material. We experimentally tested7 a) (b) (c) Figure 2: Multi-objective regression on two and three losses. (a) HV for sets ofnetworks and losses over training iterations. (b) Network outputs for X ∈ [0 , π ].(c) Generated Pareto front estimates for selection of samples in loss space.our approach for two and three objectives, but the algorithms for HV and HVgradient computations also extend to more objectives. The published time complexities of diﬀerent steps in calculating HV maximizinggradients for n losses and p solutions are as follows: O ( np ) for non-dominatedsorting [4], O ( p ( n − log p )) for HV computation of p non-dominated solutionsif n > O ( p ) for HV calculation for n = 2 after sorting in one loss [10], O ( p log( p )) for calculating HV gradients ∂ HV ( L (Θ q ( i ) ,S ) ) ∂L j ( θ i ,S ) for two and three losses,and O ( p ) for HV gradient calculation of four losses [8]. Note that the latter twocomplexities assume specialized non-dominated sorting and HV computationsubroutines that we did not implement. Overall, for moderate p values and n < =4, this means only little additional computational load compared to computingloss gradients for neural network training, which gives an HV maximization-based approach an edge over other competitive approaches in this direction. In the following sections, we describe experiments using three diﬀerent MOproblems: a simple MO regression example, multi-observer medical image seg-8entation, and a neural style transfer optimization problem. The learners forthe regression and segmentation problem were parameterized by neural networkweights using the Pytorch [27] framework. In neural style transfer, the pixels ofa target image are optimized.

To illustrate our proposed approach for two and three losses, we begin with anartiﬁcial MO learning example. Consider three conﬂicting objectives: given asample x k from input variable X ∈ [0 , π ], predict the corresponding output z k that matches y ( j ) k from target variable Y j , where X and Y j are related asfollows: Y = cos( X ) , Y = sin( X ) , Y = sin( X + π )The corresponding loss functions are L j = MSE j = | S | (cid:80) | S | k =1 ( y ( j ) k − z k ) . Wegenerated 200 samples of input and target variables for training and validationeach. Validation samples were equally spaced in [0 , π ]. For both the twoand three objective cases, we trained ﬁve neural networks for 20000 iterationseach with two fully connected linear layers of 100 neurons followed by ReLUnonlinearities. The reference point was set to (20 , ,

20) for suﬃcient distanceto all networks in loss space.Figure 2a shows the HV and losses over training iterations for the sets ofnetworks. The HV stabilizes visibly and each network picks a loss trade-oﬀ. Fig-ure 2b shows predictions for validation samples evenly sampled from [0 , π ]. Thepredictions from the ﬁve neural networks constitute the Pareto front approxi-mations for each sampled x k , and correspond to precise estimates for cos( X ),sin( X ), sin( X + π ) (in the case of three losses), and trade-oﬀs between the targetfunctions. Figure 2c shows these Pareto front estimates in loss space (only aselection of outputs is shown to simplify visualization). It becomes clear fromFigures 2b & 2c that each x k has a diﬀerently sized Pareto front which thenetworks are able to estimate. The Pareto fronts for samples corresponding to x = π (and x = π ) reduce to a single point in the case of two losses becausecos( X ) and sin( X ) are equal. Pareto fronts for the three losses shown in thisexample never reduce to a single point because the three target functions nevercoincide for any x . Multi-observer medical image segmentation pertains to learning automatic seg-mentation based on delineations provided by multiple expert observers, whichmay be conﬂicting due to inter-observer variability [38, 42]. We applied ourMO learning approach to the multi-observer medical image segmentation sce-nario mentioned in [7]. The dataset [34] contains Magnetic Resonance Imaging(MRI) scans of prostate regions of 32 patients. The original single observerdelineations are systematically perturbed to simulate diﬀerent styles of delin-eation. We generate a bi-observer learning scenario from this dataset (Figure9igure 3: Multi-observer medical image segmentation. (a)

The delineationsfrom Observer 2 consistently have an under-segmented prostate region as com-pared to Observer 1 by 10 pixels. (b)

Predictions from two out of ﬁve neuralnetworks follow one delineation style each, the rest of the predictions partiallymatch both of the delineation styles. (c)

Average Pareto front approximationson the training and validation data from 50 Monte-Carlo cross validation runs.3a), where the two observer delineations disagree in the extent of the prostateregion. We trained ﬁve neural networks for 10000 iterations to minimize softDice losses with the delineations provided by the two observers. The famousUNet [30] architecture was used for the neural networks.The predictions from the ﬁve neural networks trained by our HV maximiza-tion approach on a representative validation sample (Figure 3b) visibly followdiﬀerent trade-oﬀs of agreement between two delineation styles. The averagePareto front approximations (represented by mean soft Dice loss) for the valida-tion data from 50 Monte-Carlo cross-validation runs with 80:20 split are shownin Figure 3c. The results show that the proposed approach trains the neuralnetworks according to ﬁxed trade-oﬀs distributed uniformly across the trade-oﬀfront. Further, the diversiﬁcation of the trade-oﬀs is maintained on unseen dataalso as indicated by the cross-validation performance of each neural network.

We further apply our approach to the problem of style transfer, i.e., the transferof the artistic style of an image onto a target image while preserving its semanticcontent. Users likely cannot provide their preferred trade-oﬀ between style andcontent without seeing the resulting images. Providing an estimated Paretofront is thus a useful tool in aiding decision-making.We selected the problem deﬁnition by [11], where pixels of an image are10 a) Pareto front approximation. (b) HV and losses.(c) Loss space.

Figure 4: Neural style transfer of three styles. (a) Pareto front approximationof generated images. The T-shape approximately reﬂects style loss ordering:images in the top left have lowest style loss with Cole’s

View Across Frenchman’sBay , top right corresponds to Picasso’s

Fanny Tellier , bottom corresponds toHokusai’s

Kajikazawa in Kai Province . (b) HV of the set of images and stylelosses per image for each optimization iteration. (c) Pareto front approximationin loss space.optimized to minimize a weighted combination of content loss (semantic sim-ilarity with the target image) and style loss (artistic similarity with the styleimage). The content loss and the style loss are computed from features of apretrained VGG network [33]. We reused and adjusted Pytorch’s neural styletransfer implementation [15] of this two objective problem to a three objectiveoptimization problem with three distinct style losses (and use the content imageonly to initialize the optimized images). In the presented example, the pixels ofsix target images are optimized using the proposed HV-maximization approach.We maximize the images’ HV of style losses so that they approach the Paretofront, resulting in images with diverse trade-oﬀs over the three style losses.To tune hyperparameters, a grid search is performed for the learning rate andparameters of the Adam optimizer. Tuning is performed on three training imagesets, each containing three style images and one content image. The images aremostly collected from WikiArt [1], and are in the public domain or available11nder fair use. The reference point is set to (10,10,10) based on preliminaryexperiments.Figure 4 shows the Pareto front approximation with six images after HVoptimization. The example was selected for its aesthetic appeal. Three solutionsare close to the distinct artistic styles, and the others are mixes of diﬀerentstyles with trade-oﬀs between the style losses. Viewing the images in loss space(Figure 4c) demonstrates that the images are diverse and clearly dispersed fromeach other.

We adapted the gradient-based hypervolume maximization approach from multi-objective optimization for the goal of learning trade-oﬀs in the presence of multi-ple losses. We experimented with our approach for two multi-objective learningcases with neural networks and one neural multi-style transfer optimization case.The main added value of our proposed approach is the capability to auto-matically and in a single run conﬁgure a set of learners so that they jointlypredict a trade-oﬀ curve for each sample, without prior need of user-speciﬁedpreference vectors. In this way, the proposed approach is truly the machinelearning version of a posteriori decision making in presence of multiple objec-tives. Furthermore, we demonstrated through experiments on diﬀerent multi-objective problems that our HV maximization approach indeed ﬁnds well-spreadsolutions on the Pareto front.In our current implementation, a separate learner is trained correspondingto each trade-oﬀ. This increases computational load linearly if more options onthe Pareto front are desired. We chose for this setup for the sake of simplicityin experimentation and demonstrating a proof-of-concept with clarity. We usedneural networks in our experiments. It is expected that the HV maximizationformulation would work similarly if the parameters of some of the neural networklayers are shared, which would decrease computational load.Open questions are whether the simpliﬁcation described in Equation (6) hassigniﬁcant limitations, e.g., in concave fronts, compared to the original problemformulation in Equation (5) and whether the optimal learners for Equation (6)always attain the maximal mean HV over the training set in Equation (1).Although it seems intuitive that each learner is ﬁxed to a single trade-oﬀ, itmight fail when the distribution of trade-oﬀs needs to be diﬀerent for eachsample. The approach needs to be tested on a variety of other problems to gainfurther insights in this direction. Lastly, though the experiments in this paperfocus on training neural networks multi-objectively, we believe that the scopeof HV maximization to learn predictions on the Pareto front extends to a widerange of machine learning methods. 12

Acknowledgements

We would like to thank dr. Marco Virgolin from Chalmers University of Tech-nology for his valuable contributions and discussions on concept and code. Theresearch is part of the research programme, Open Technology Programme withproject number 15586, which is ﬁnanced by the Dutch Research Council (NWO),Elekta, and Xomnia. Further, the work is co-funded by the public-private part-nership allowance for top consortia for knowledge and innovation (TKIs) fromthe Ministry of Economic Aﬀairs.

References arXiv preprintarXiv:1901.08680 , 2019.[3] Brendan Avent, Javier Gonzalez, Tom Diethe, Andrei Paleyes, and BorjaBalle. Automatic discovery of privacy–utility Pareto fronts.

Proceedings onPrivacy Enhancing Technologies , 2020(4):5–23, 2020.[4] Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyarivan.A fast and elitist multiobjective genetic algorithm: NSGA-II.

IEEE Trans-actions on Evolutionary Computation , 6(2):182–197, 2002.[5] Timo M Deist, Stefanus C Maree, Tanja Alderliesten, and Peter A. N.Bosman. Multi-objective optimization by uncrowded hypervolume gradi-ent ascent. In

International Conference on Parallel Problem Solving fromNature , pages 186–200. Springer, 2020.[6] Jingda Deng and Qingfu Zhang. Approximating hypervolume and hyper-volume contributions using polar coordinate.

IEEE Transactions on Evo-lutionary Computation , 23(5):913–918, 2019.[7] Arkadiy Dushatskiy, Adri¨enne M. Mendrik, Peter A. N. Bosman, andTanja Alderliesten. Observer variation-aware medical image segmentationby combining deep learning and surrogate-assisted genetic algorithms. InIvana Iˇsgum and Bennett A. Landman, editors,

Medical Imaging 2020: Im-age Processing , volume 11313, pages 296 – 306. International Society forOptics and Photonics, SPIE, 2020.[8] Michael Emmerich and Andr´e Deutz. Time complexity and zeros of thehypervolume indicator gradient ﬁeld. In

EVOLVE-A Bridge between Prob-ability, Set Oriented Numerics, and Evolutionary Computation III , pages169–193. Springer, 2014. 139] Mark Fleischer. The measure of Pareto optima applications to multi-objective metaheuristics. In

International Conference on EvolutionaryMulti-Criterion Optimization , pages 519–533. Springer, 2003.[10] Carlos M Fonseca, Lu´ıs Paquete, and Manuel L´opez-Ib´anez. An improveddimension-sweep algorithm for the hypervolume indicator. In , pages 1157–1163.IEEE, 2006.[11] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image styletransfer using convolutional neural networks. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , pages 2414–2423,2016.[12] Daniel Golovin et al. Random hypervolume scalarizations for provablemulti-objective black box optimization. arXiv preprint arXiv:2006.04655 ,2020.[13] Maoguo Gong, Jia Liu, Hao Li, Qing Cai, and Linzhi Su. A multiobjectivesparse feature learning model for deep neural networks.

IEEE Transactionson Neural Networks and Learning Systems , 26(12):3263–3277, 2015.[14] Md Shahriar Iqbal, Jianhai Su, Lars Kotthoﬀ, and Pooyan Jamshidi. Flex-ibo: Cost-aware multi-objective optimization of deep neural networks. arXiv preprint arXiv:2001.06588 , 2020.[15] Alexis Jacq. Neural style transfer using Pytorch.https://pytorch.org/tutorials/advanced/neural style tutorial.html, 2017.[16] Patrick Koch, Tobias Wagner, Michael TM Emmerich, Thomas B¨ack, andWolfgang Konen. Eﬃcient multi-criteria optimization on noisy machinelearning problems.

Applied Soft Computing , 29:357–370, 2015.[17] Sungjae Lee and Youngdoo Son. Multitask learning with single gradientstep update for task balancing. arXiv preprint arXiv:2005.09910 , 2020.[18] Xi Lin, Zhiyuan Yang, Qingfu Zhang, and Sam Kwong. Controllable Paretomulti-task learning. arXiv preprint arXiv:2010.06313 , 2020.[19] Xi Lin, Hui-Ling Zhen, Zhenhua Li, Qing-Fu Zhang, and Sam Kwong.Pareto multi-task learning. In

Advances in Neural Information ProcessingSystems , pages 12060–12070, 2019.[20] Xiao Lin, Hongjie Chen, Changhua Pei, Fei Sun, Xuanji Xiao, Hanxiao Sun,Yongfeng Zhang, Wenwu Ou, and Peng Jiang. A Pareto-eﬃcient algorithmfor multiple objective optimization in e-commerce recommendation. In

Proceedings of the 13th ACM Conference on Recommender Systems , pages20–28, 2019. 1421] Pingchuan Ma, Tao Du, and Wojciech Matusik. Eﬃcient continuous Paretoexploration in multi-task learning. In

International Conference on MachineLearning , pages 6522–6531. PMLR, 2020.[22] Debabrata Mahapatra and Vaibhav Rajan. Multi-task learning with userpreferences: Gradient descent with controlled ascent in Pareto optimiza-tion. In

International Conference on Machine Learning , pages 6597–6607.PMLR, 2020.[23] Yuren Mao, Shuang Yun, Weiwei Liu, and Bo Du. Tchebycheﬀ procedurefor multi-task text classiﬁcation. In

Proceedings of the 58th Annual Meetingof the Association for Computational Linguistics , pages 4217–4226, 2020.[24] Shaobo Min, Hantao Yao, Hongtao Xie, Zheng-Jun Zha, and YongdongZhang. Multi-objective matrix normalization for ﬁne-grained visual recog-nition.

IEEE Transactions on Image Processing , 29:4996–5009, 2020.[25] Conrado S Miranda and Fernando J Von Zuben. Single-solution hyper-volume maximization and its use for improving generalization of neuralnetworks. arXiv preprint arXiv:1602.01164 , 2016.[26] Aviv Navon, Aviv Shamsian, Gal Chechik, and Ethan Fetaya. Learning thePareto front with hypernetworks. arXiv preprint arXiv:2010.04104 , 2020.[27] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, EdwardYang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, andAdam Lerer. Automatic diﬀerentiation in PyTorch. In

Advances in NeuralInformation Processing Systems-W , 2017.[28] Fabrice Poirion, Quentin Mercier, and Jean-Antoine D´esid´eri. Descent al-gorithm for nonsmooth stochastic multiobjective optimization.

Computa-tional Optimization and Applications , 68(2):317–331, 2017.[29] Salvatore D Riccio, Deyan Dyankov, Giorgio Jansen, Giuseppe Di Fatta,and Giuseppe Nicosia. Pareto multi-task deep learning. In

InternationalConference on Artiﬁcial Neural Networks , pages 132–141. Springer, 2020.[30] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolu-tional networks for biomedical image segmentation. In

International Con-ference on Medical Image Computing and Computer-Assisted Intervention ,pages 234–241. Springer, 2015.[31] Sebastian Ruder. An overview of multi-task learning in deep neural net-works. arXiv preprint arXiv:1706.05098 , 2017.[32] Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objectiveoptimization.

Advances in Neural Information Processing Systems , 31:527–538, 2018. 1533] Karen Simonyan and Andrew Zisserman. Very deep convolutional networksfor large-scale image recognition. In

International Conference on LearningRepresentations , 2015.[34] Amber L Simpson, Michela Antonelli, Spyridon Bakas, Michel Bilello, Key-van Farahani, Bram Van Ginneken, Annette Kopp-Schneider, Bennett ALandman, Geert Litjens, Bjoern Menze, et al. A large annotated medicalimage dataset for the development and evaluation of segmentation algo-rithms. arXiv preprint arXiv:1902.09063 , 2019.[35] Shinya Suzuki, Shion Takeno, Tomoyuki Tamura, Kazuki Shitara, andMasayuki Karasuyama. Multi-objective Bayesian optimization usingPareto-frontier entropy. In

International Conference on Machine Learn-ing , pages 9279–9288. PMLR, 2020.[36] Sara Tari, Holger Hoos, Julie Jacques, Marie-El´eonore Kessaci, and LaetitiaJourdan. Automatic conﬁguration of a multi-objective local search forimbalanced classiﬁcation. In

International Conference on Parallel ProblemSolving from Nature , pages 65–77. Springer, 2020.[37] Kristof Van Moﬀaert and Ann Now´e. Multi-objective reinforcement learn-ing using sets of Pareto dominating policies.

The Journal of Machine Learn-ing Research , 15(1):3483–3512, 2014.[38] Geert M Villeirs, Koen Van Vaerenbergh, Luc Vakaet, Samuel Bral, FilipClaus, Wilfried J De Neve, Koenraad L Verstraete, and Gert O De Meerleer.Interobserver delineation variation using ct versus combined ct+ mri inintensity–modulated radiotherapy for prostate cancer.

Strahlentherapie undOnkologie , 181(7):424–430, 2005.[39] Hao Wang, Andr´e Deutz, Thomas B¨ack, and Michael Emmerich. Coderepository: Hypervolume indicator gradient ascent multi-objective opti-mization. https://github.com/wangronin/HIGA-MO.[40] Hao Wang, Andr´e Deutz, Thomas B¨ack, and Michael Emmerich. Hyper-volume indicator gradient ascent multi-objective optimization. In

Inter-national Conference on Evolutionary Multi-Criterion Optimization , pages654–669. Springer, 2017.[41] Hao Wang, Yiyi Ren, Andr´e Deutz, and Michael Emmerich. On steeringdominated points in hypervolume indicator gradient ascent for bi-objectiveoptimization. In

NEO 2015 , pages 175–203. Springer, 2017.[42] EA White, KK Brock, DA Jaﬀray, and CN Catton. Inter-observer variabil-ity of prostate delineation on cone beam computerised tomography images.

Clinical oncology , 21(1):32–38, 2009.[43] Jie Xu, Yunsheng Tian, Pingchuan Ma, Daniela Rus, Shinjiro Sueda, andWojciech Matusik. Prediction-guided multi-objective reinforcement learn-ing for continuous robot control. In

International Conference on MachineLearning , pages 10607–10616. PMLR, 2020.1644] Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Haus-man, and Chelsea Finn. Gradient surgery for multi-task learning. arXivpreprint arXiv:2001.06782arXivpreprint arXiv:2001.06782