[PDF] Finding teams that balance expert load and task coverage

Abstract

The rise of online labor markets (e.g., Freelancer, Guru and Upwork) has ignited a lot of research on team formation, where experts acquiring different skills form teams to complete tasks. The core idea in this line of work has been the strict requirement that the team of experts assigned to complete a given task should contain a superset of the skills required by the task. However, in many applications the required skills are often a wishlist of the entity that posts the task and not all of the skills are absolutely necessary. Thus, in our setting we relax the complete coverage requirement and we allow for tasks to be partially covered by the formed teams, assuming that the quality of task completion is proportional to the fraction of covered skills per task. At the same time, we assume that when multiple tasks need to be performed, the less the load of an expert the better the performance. We combine these two high-level objectives into one and define the BalancedTA problem. We also consider a generalization of this problem where each task consists of required and optional skills. In this setting, our objective is the same under the constraint that all required skills should be covered. From the technical point of view, we show that the BalancedTA problem (and its variant) is NP-hard and design efficient heuristics for solving it in practice. Using real datasets from three online market places, Freelancer, Guru and Upwork we demonstrate the efficiency of our methods and the practical utility of our framework.

Full PDF

FFinding teams that balance expert load and task coverage ∗Soﬁa Maria Nikolakaki Mingxiang Cai Evimaria Terzi

Abstract

The rise of online labor markets (e.g., Freelancer, Guru and Upwork) has ignited a lot of researchon team formation, where experts acquiring diﬀerent skills form teams to complete tasks. The coreidea in this line of work has been the strict requirement that the team of experts assigned to completea given task should contain a superset of the skills required by the task. However, in many applicationsthe required skills are often a wishlist of the entity that posts the task and not all of the skills areabsolutely necessary. Thus, in our setting we relax the complete coverage requirement and we allowfor tasks to be partially covered by the formed teams, assuming that the quality of task completion isproportional to the fraction of covered skills per task. At the same time, we assume that when multipletasks need to be performed, the less the load of an expert the better the performance. We combinethese two high-level objectives into one and deﬁne the

BalancedTA problem. We also consider ageneralization of this problem where each task consists of required and optional skills. In this setting,our objective is the same under the constraint that all required skills should be covered. From thetechnical point of view, we show that the

BalancedTA problem (and its variant) is NP-hard anddesign eﬃcient heuristics for solving it in practice. Using real datasets from three online market places,Freelancer, Guru and Upwork we demonstrate the eﬃciency of our methods and the practical utilityof our framework.

Creating eﬀective teams by combining experts with diverse skills allows organizations to successfullycomplete tasks from diﬀerent domains, while also helping individual experts position themselves in highlycompetitive job markets. The rise of online labor markets, like Freelancer, Guru and Upwork has moti-vated a signiﬁcant amount of research on the problem of team formation [2, 4, 8, 9, 11, 16, 18]. Althoughmany variants of this problem have been considered, they all rely on the same model of coverage of tasksby experts. According to this model, both tasks and experts are characterized by a set of skills. Eachtask consists of a set of skills required for the task completion and each expert acquires a set of skills. Ateam consisting of one or more experts completes a task if the union of the skills of the team’s expertscovers all the required skills of the task. In all existing work complete coverage of tasks is required by theformed teams.However, there are cases where tasks are described generically and require teams to have strong back-ground in many areas, whereas clearly not all these areas can (or need to) be simultaneously covered.Oftentimes, the list of required skills is more like the wishful thinking of the entity that posts the taskand not all of the skills need to be covered for its completion. For example, think of a post that de-scribes potential cluster hires for an academic institution. Moreover, when looking at job posts in onlinelabor markets, oftentimes the skills required by the posted tasks are repetitive. An example of a jobpost in guru.com is: basic, oracle, html, java, javascript, mysql, css, sql, http, ajax, mvc, architecture,jquery, software, software development, web development, developer, web developer . Another examplefrom freelancer.com is:

Advertising, Facebook Marketing, Internet Marketing, Marketing, Social Net-working . In these examples the task descriptions could be repetitive; someone who is good at Marketing ∗ Department of Computer Science, Boston University. {smnikol,marcocai,evimaria}@bu.edu a r X i v : . [ c s . A I] N ov s also good at Facebook and/or Internet marketing. Therefore, in such cases not all skills of a task needto be covered.Motivated by these settings, we relax the requirement of completely covering a task and we allow for partial task coverage by the formed teams. However, we assume that the quality of task completion isproportional to the fraction of covered skills.Another important factor for the quality of task completion is the load of the experts. For example,an expert assigned to several tasks may be too busy to devote a lot of eﬀort in each one of them andconsequently underperform [1]. Thus, our second objective is to avoid loading the experts with manytasks.We combine these two objectives as follows: given a set of tasks, form one team per task such that thetasks are partially covered, while at the same time, the maximum number of teams an expert participatesin is kept as small as possible. Thus, our goal is to both fulﬁll the tasks as much as possible and notoverload the experts involved with the formed teams.Formally, our goal is the following: given a set of k tasks J and a set of experts P form k teams Q = { Q , . . . , Q k } , one for each task such that λL ( Q ) + C ( Q , J ) is minimized. In this equation, L ( Q )denotes the maximum number of teams an expert participates in and C ( Q , J ) denotes the sum of thefraction of uncovered skills per task. Both these components are quantities that we aim to minimize,while λ is a trade-oﬀ parameter that controls the importance of each objective.We call the above problem the BalancedTA problem. Note that the problem deﬁnition is such thatcombines the two, seemingly unrelated objectives, into a single objective, and asks from the algorithm toﬁnd the right balance between the two, according to the trade-oﬀ parameter λ , and without placing ahard constraint on one of the objectives.For the BalancedTA problem we show that there are values of λ such that the BalancedTA problem is NP-hard and we design algorithms for solving it eﬃciently in practice. The eﬀectiveness andthe eﬃciency of our algorithms is shown in our experimental evaluation with datasets from three majoronline labor markets.We also note that in many applications, there may be tasks that have both absolutely required and optional skills. For instance, a task might require workers with expertise in

Facebook Marketing andAdvertising and optionally broader knowledge in the areas of

Internet Marketing, Marketing and SocialNetworking . Thus, we also consider a generalization of the

BalancedTA , where there is a hard constrainton the coverage of the required skills for every task. Not only do we show that this variant of the problemis also NP-hard, but also that the same algorithms we designed for

BalancedTA can be used for thisversion of the problem as well, with minimal changes.Our contributions are summarized as follows: • We deﬁne the

BalancedTA problem, which tries to ﬁnd teams for a set of tasks such that boththe coverage of task requirements (in terms of skills) is maximized and the load of every individualworker (in terms of the number of teams she participates in) is minimized. We combine these tworequirements in a single objective and use a trade-oﬀ parameter λ to control the importance of each. • We study the computational complexity of the

BalancedTA problem and show that while there arecases for which it is trivial, there are also cases for which it is NP-hard. • We design a set of heuristics for solving

BalancedTA in practice. • We study a variant of the

BalancedTA problem where the set of skills required for each task issplit into required and optional. We show that this version is NP-hard and also demonstrate that ouralgorithms for

BalancedTA can be applied to it with few modiﬁcations. • Finally, using three datasets from real online labor markets, we test the practical utility and theeﬃciency of our methods.

Roadmap:

Section 2 reviews the related work. We present our problem in Section 3 and our algorithmsin Section 4. In Section 5 we evaluate the performance of our methodology using real datasets. We2onclude the paper in Section 6.

Recent studies raise the importance of team formation in diﬀerent settings [13, 21]. To the best of ourknowledge, we are the ﬁrst to introduce the

BalancedTA problem, where we simultaneously optimizefor the coverage of the task requirements and the load of the experts. However, our work is related toexisting work on team formation as described below.

Team formation in network of experts:

Lappas et al. [11] were the ﬁrst to introduce the notionof team formation in the setting of a social network. Given a network of experts with skills, their goalis to ﬁnd a team that collectively covers all the requirements of a single task, while establishing smallcommunication cost (in terms of the network) between the team members. A series of subsequent worksextended this work towards diﬀerent directions [2, 4, 10, 8, 9, 16, 12, 15, 14, 18, 22] All the aforementionedworks share two common assumptions: ( i ) the experts are organized in a network that quantiﬁes howwell they can work together and ( ii ) all the required skills of the tasks need to be covered by the formedteams. Our model does not assume the existence of a network among the experts and the tasks neednot be fully covered. Therefore, the computational problem that we are solving is diﬀerent from the onesabove. Team formation with load balancing:

Anagnostopoulos et al. [1], were the ﬁrst to consider minimizingthe load of experts in the online setting where a stream of tasks arrives and experts form teams in order tocover all the required skills for each task. The oﬄine version of this problem resembles the load-balancingrequirement of our problem. However, our work allows partial task coverage, while Anagnostopoulos etal. [1] form teams that entirely cover the requirements of a task. Moreover, our framework also providesthe ﬂexibility of deﬁning a desirable trade-oﬀ between the two costs and creates eﬀective teams based onthe importance of each.

Multiple tasks coverage:

A key characteristic of our work is that we consider the oﬄine setting wherethere are multiple tasks, known a-priori, and there is a team formed for each one of them. The oﬄineversions of Anagnostopoulos et al. [1, 2], the work of Golshan et al. [6], as well as the recent work ofBarnabó et al. [3] consider multiple tasks and multiple teams. However, contrary to our setting, all theabove works require that all skills required by the sequence of tasks are completely covered.

Team formation with partial coverage:

Probably, the closest to our work is the work by Dornand Dustdar [5], which introduces a multi-objective team composition problem with two objectives: skillcoverage and communication cost. Their goal is to identify the best balance between the two costs. For thispurpose, they use a set of heuristics that self-adjust a trade-oﬀ parameter to decide team conﬁgurations.In our setting, we do not consider the communication cost, but the workload of the experts. Moreover,our algorithms focus on allocating experts to teams, based on a user-deﬁned trade-oﬀ between load andcoverage. Dorn and Dustar focus on ﬁnding a “best" trade-oﬀ between connectivity and coverage, wherethe notion of “best" is deﬁned in a rather adhoc manner. Finally, although they touch upon the issue ofpartial coverage, they focus on data extraction rather than algorithm design.

This section provides the notation used throughout the paper and presents the formal deﬁnition and thecomplexity analysis of the problem that we study.Throughout the discussion, we consider a set of m skills S , a set of n experts P = { P i ; i = 1 , . . . , n } and a set of k tasks J = { J j ; j = 1 , . . . , k } . In this setting, every expert and every task is a subsetof the skills, i.e., P i ⊆ S and J j ⊆ S , respectively. To complete a task we need to assign a team ofexperts to it. We let Q j ⊆ P denote the team assigned to the j th task. For k tasks, we form k teams3 = { Q , . . . , Q k } . We call Q the team assignment for tasks J . For each team Q j we compute its skillproﬁle Cov ( Q j ) representing the union of the skills of its members. That is, Cov ( Q j ) = ∪ i ∈ Q j P i . Load cost ( L ) : An important quantity is the load of a person, which is the number of tasks a personis assigned to. That is, for person P we have that the load of P is L ( P, Q ) = |{ j : P ∈ Q j }| . We areinterested in the maximum load among all experts , i.e., L ( Q ) = max P ∈P L ( P, Q ) Incompleteness cost ( C ) : Given a task J and a team Q ⊆ P assigned to it we deﬁne the incompletenesscost of Q with respect to J to be the fraction of the required skills, which are not covered by the team’sskill proﬁle. That is, F ( Q, J ) = | J \ Cov ( Q ) || J | Intuitively, our goal is to minimize the incompleteness cost since we want the assigned team to cover asmany of the skills required by the corresponding task as possible. Thus, we deﬁne the total incompletenesscost of a team assignment Q to be: C ( Q , J ) = X j ∈J F ( Q j , J j ) Team-assignment cost ( B ) : Given a trade-oﬀ parameter λ , the cost of a team assignment Q for a setof tasks J is denoted by B ( Q , J , λ ) and it is a linear combination of the maximum workload and theincompleteness cost as deﬁned above. That is, B ( Q , J , λ ) = λL ( Q ) + C ( Q , J )The trade-oﬀ parameter λ provides an easy way to control the relative importance of each objective,where λ = 0 ignores the workload, and conversely λ > k , where k is the number of tasks, ignores theincompleteness cost. In our experiments, we considered diﬀerent values of λ ∈ R + and we discuss ourﬁndings. BalancedTA problem

We can now deﬁne the main problem addressed in this paper:

Problem 1 ( BalancedTA ) . Given a set of k tasks J = { J , . . . , J k } , a set of n experts P = { P , . . . , P n } and a real non-negative value λ ∈ R + , ﬁnd a team assignment Q = { Q , . . . , Q k } consisting of k teams,such that team Q i is associated with task J i and B ( Q , J , λ ) is minimized. The solution of the case where λ = 0 is trivial: when λ = 0 then the optimal solution assigns allworkers to all tasks, leaving this way the minimum number of non-covered skills.On the other hand, when λ > k , where k is the number of tasks, we prove that only the workloadmatters and thus the trivial solution of not assigning experts to tasks is the best strategy. This issummarized in the following lemma. Lemma 3.1.

For a set of k tasks and λ > k , the optimal solution of the BalancedTA problem is theone that leaves all tasks completely uncovered; i.e., Q = ∅ .Proof. Assume for the sake of contradiction an optimal solution Q ∗ = ∅ with corresponding workload L ( Q ∗ ) ≥

1. By the deﬁnition of incompleteness we know that 0 ≤ C ( Q ∗ ) ≤ k and therefore B ( Q ∗ , J , λ ) >k , for λ > k . However, we see that there exists a solution Q with corresponding workload L ( Q ) = 0whose team-assignment cost is exactly equal to k , i.e., B ( Q , J , λ ) = k . By the deﬁnition of load thissolution can only be Q = ∅ which contradicts the initial assumption.4herefore, we consider problem instances where λ takes values in the range (0 , k ], where k is thenumber of tasks. For a subset of these values of λ we can prove that the BalancedTA problem isNP-hard. More speciﬁcally, we have the following complexity result.

Theorem 3.2.

For a set of k tasks the BalancedTA problem is NP-hard for < λ < kN , with N beingthe cardinality of the largest task.Proof. For the rest of the proof, we will refer to the

BalancedTA problem for the case 0 < λ < kN .We reduce an instance of the NP-hard balanced task covering problem [1] to the BalancedTA problem. A reduction from balanced task covering to BalancedTA exists, if and only if a solutioninstance of

BalancedTA for 0 < λ < kN is also a solution to balanced task covering .An instance of balanced task covering consists of a pool of experts and tasks P , J , respectively,and asks for a set of teams Z , one team for each task such that the maximum workload of a worker isminimized and all tasks in J are completely covered. We transform an instance of balanced task cov-ering to an instance of BalancedTA , by setting P and J to be the experts and the tasks, respectively,of the BalancedTA problem. We now claim that for 0 < λ < kN , Q is a solution to the BalancedTA problem if and only if it is also a solution to the balanced task covering problem.To see this consider the following: If Q is the solution to the balanced task covering problem withload L ( Q ), then Q is also a solution for BalancedTA with load L ( Q ) and incompleteness C ( Q ) = 0;this is because Q covers all skill requirements in the balanced task covering problem.Conversely, let Q be a solution of BalancedTA for 0 < λ < kN . We will show that Q is also asolution for the balanced task covering problem by claiming that for any 0 < λ < kN the solution of BalancedTA always yields C ( Q ) = 0. In order to ensure that C ( Q ) = 0 (all task skills are covered), anypossible team assignment Q should lead to λL ( Q ) < C ( Q ). Intuitively, this means that adding moreworkload to the experts is always preferred, than leaving any of the task requirement skills unsatisﬁed.This is always true if the cost of the largest possible workload, which is assigning one or more expertsto all tasks ( L max = k ), multiplied by our trade-oﬀ parameter λ , is less than the smallest possibleincompleteness cost, which is leaving one skill of the task with the largest cardinality uncovered. This istrue for λ < kN . BalancedTA problem

A natural variant of

BalancedTA is one where some skills of a task are required while others are not.In this variant, each task J i has a set of required skills J ri and a set of optional skills J oi , such that J i = J ri ∪ J oi and J ri ∩ J oi = ∅ . The required skills have to be covered while the optional skills behave asbefore. This problem variant is formally deﬁned as follows: Problem 2 ( R-BalancedTA ) . Given a set of k tasks J = { J , . . . , J k } , with J r = { J r , . . . , J rk } and J o = { J o , . . . , J ok } , a set of n experts P = { P , . . . , P n } and a real non-negative value λ ∈ R + , ﬁnd k teams Q = { Q , . . . , Q k } , such that B ( Q , J o , λ ) is minimized and C ( Q , J r ) = 0 . From the complexity viewpoint we have the following result.

Theorem 3.3.

The

R-BalancedTA problem is NP-hard.

This is because this problem is NP-hard even for the case where all skills of all tasks are required bya reduction from the balanced task covering problem [1].

In this section, we describe the algorithms we designed for solving the

BalancedTA problem.5 lgorithm 1

The

ExpertGreedy algorithm.

Input:

Tasks J = { J , . . . , J k } , Experts P = { P , . . . , P n } , λ Output:

Teams Q = { Q , . . . , Q k } score ← ∞ for ‘ ∈ { , . . . , |J |} do J ← J , Q ← ∅ for P i ∈ P do L i ← TopTasks ( P i , J , ‘ ) Q ← UpdateTeams ( P i , L i , Q ) J ← UpdateTasks ( P i , L i , J ) end for if score > B ( Q , J , λ ) then score ← B ( Q , J , λ ) Q ← Q end if end for return Q , score The

ExpertGreedy algorithm:

ExpertGreedy ﬁnds ‘ solutions (team assignments), each of which witha diﬀerent maximum workload ‘ = 1 , . . . , k = |J | and at the end reports the solution with the best score.In order to do so, for each ‘ it ﬁnds for each expert P i the ‘ tasks with the least uncovered skills whenassigning P i to these tasks, and it assigns P i to teams that correspond to those tasks. The algorithmreports the solution of the ‘ value that resulted in the smallest B ( Q , J , λ ) team-assignment cost.The pseudocode of ExpertGreedy is shown in Algorithm 1. We draw attention to line 5 of this pseu-docode. Routine

TopTasks retrieves the indexes of the ‘ tasks with the smallest fraction of uncoveredskills when expert P i is assigned to them. To ﬁnd these tasks we use a binary min-heap to preservethe incompleteness cost of all tasks in sorted order. Furthermore, lines 6 and 7 perform update opera-tions. In particular, routine UpdateTeams (line 6) assigns expert P i to the selected ‘ tasks, while routine UpdateTasks (line 7) removes skills from the selected tasks that are covered by expert P i chosen at anygiven round i .A natural property of ExpertGreedy is that it essentially assigns the same amount of workload toevery expert. Note, that when deciding which teams to select for an expert, for a speciﬁc ‘ , the algorithmdoes not take into account the ﬁrst part of the objective function, i.e. λL ( Q ), since it is equal to λ‘ forall experts.The runtime complexity of ExpertGreedy is O( k n log k + k nm ). For each maximum load and foreach expert, the algorithm sorts the tasks in ascending order based on the number of skills not covered. The

TaskGreedy algorithm:

This algorithm also ﬁnds ‘ solutions (team assignments), each with adiﬀerent maximum workload ‘ = 1 , . . . , k = |J | and then selects the solution with the smallest cost.However, it diﬀers from the previous algorithm: while ExpertGreedy greedily assigns tasks to experts,

TaskGreedy ﬁnds a set of “good” candidate experts for a speciﬁc task. In particular, for each ‘ , thealgorithm computes for each task J j the cost of the objective value when expert P i is assigned to team Q j , for all i = 1 , . . . , n . The algorithm keeps these costs in a binary min-heap data structure, for runningtime eﬃciency. After computing the costs of all experts, it removes the root of the heap and assigns thecorresponding expert to team Q j , only if her skillset overlaps with the uncovered skills of task J j . If theexpert is assigned to the team, then all covered skills of J j are removed. This process continues until,either all skills of J j are covered, or the remaining skills do not overlap with any of the unassigned experts.After creating Q j , the algorithm checks if there are any experts whose loads are equal to ‘ , and removesthose experts from the pool. At the end of each loop of ‘ , there is a team associated with every task, and6 lgorithm 2 The

TaskGreedy algorithm.

Input:

Tasks J = { J , . . . , J k } , Experts P = { P , . . . , P n } , λ Output:

Teams Q = { Q , . . . , Q k } score ← ∞ for ‘ ∈ { , . . . , |J |} do P ← P , Q ← ∅ for J j ∈ J do L i ← TopExperts ( J j , P , ‘ ) Q ← UpdateTeams ( J j , L i , Q ) P ← UpdateExperts ( P , Q ) end for if score > B ( Q , J , λ ) then score ← B ( Q , J , λ ) Q ← Q end if end for return Q , score the cost B ( Q , J , λ ) is computed. The algorithm reports the solution with the lowest team-assignmentcost.The pseudocode of TaskGreedy is presented in Algorithm 2. Routine

TopExperts (line 5) computesand returns the indexes of those experts whose skillsets cover the requirements of the given task and thathave the smallest objective value. Routines

UpdateTeams (line 6) and

UpdateExperts (line 7) performupdate operations, i.e., assign the selected experts to the team of the current task, and remove from thepool of experts those with load cost equal to ‘ , respectively.In contrast to ExpertGreedy , TaskGreedy does not assign the same amount of workload to everyexpert. In fact, some experts might not be assigned to any team at all; this is the case when there areother experts whose skillsets overlap more with the tasks.Another diﬀerence between

ExpertGreedy and

TaskGreedy is their running time. In particular, therunning time of

TaskGreedy is O( k n + k nm + k n log n + k m ). For each ‘ value and for each task,the algorithm sorts the experts in ascending order, based on the cost obtained after considering each ofthem separately, and then traverses them in the same order to allocate a team, based on the experts’overlap with the task. We improve the running time, by observing that the objective value computationwhen considering an expert for a speciﬁc task, does not require ﬁnding the total incompleteness cost ofall tasks, but only how much the speciﬁc task is covered by the expert since the incompleteness cost inthe other tasks remains constant for all experts that are being evaluated for that task. Then, the runtimecomplexity becomes O( k n + k nm + k n log n + k m ). Finally, keeping a variable that stores the overallmaximum load during an ‘ loop decreases the runtime complexity to O( k nm + k n log n + k m ). The

BestLoad algorithm:

The

BestLoad algorithm is a natural extension of the

Load algorithm pro-posed by Anagnostopoulos et al. [1] for the oﬄine setting of the balanced task covering problem.Recall that in that problem the goal is given a set of tasks, ﬁnd an assignment of teams to tasks so as tominimize the maximum load of the workers subject to the constraint that all skills of all tasks are covered.The

Load algorithm has two steps. The ﬁrst step solves optimally the linear programming relaxationof the ILP formulation of the above problem (see Theorem 2 [1]). This creates a fractional solution ˆ X .The second step of Load performs R rounds, with R = (ln Tδ ), where T = max { mk, n } , where m is thenumber of skills, k is the number of tasks and n is the number of experts. The algorithm assigns anexpert P j to the task J i with probability ˆ X ji , independently of other rounds and of other assignmentswithin the same round. If expert P j was assigned to task J i in at least one round, the algorithm adds the7xpert to the team Q i . The authors show that R rounds are required to achieve complete coverage of theskills acquired by the tasks.The BestLoad algorithm we propose has the same ﬁrst step as

Load . To take into account the weighingtrade-oﬀ parameter λ our BestLoad modiﬁes the second step. In particular, notice that as the number ofrounds increases, more experts are assigned to tasks, i.e., the load increases and the coverage decreases.Therefore, for larger values of λ (load becomes more important than coverage) running fewer roundsleads to a better solution. Conversely, for smaller values of λ (coverage becomes more important thanload) the algorithm needs to run a number of rounds closer to R . Based on this observation, BestLoad accommodates for the diﬀerent values of λ by creating R solutions; one after each assignment round.Then, given a speciﬁc value of λ , it returns the solution that has the corresponding smallest cost.The runtime complexity of the ﬁrst step of BestLoad depends on the method used to solve the LPrelaxation of the algorithm. State-of-the-art LP solvers require running time polynomial in the numberof constraints of the problem [7, 19]. For the coverage of all skills, O( nkm ) constraints are required. Thesecond step of the algorithm requires O(

T nk ) time.

Observation 1.

The performance of

BestLoad is at least as good as the performance of

Load for the

BalancedTA problem.

Clearly, Observation 1 holds since one of the solutions considered by

BestLoad is the one returned by

Load . Improving the running time of

ExpertGreedy and

TaskGreedy : For any value of the trade-oﬀ pa-rameter λ , continually adding more workload to the experts will increase the value of B ( Q , J , λ ) in twocases: (i) when all task requirements have been covered, and (ii) when the beneﬁt from decreasing theincompleteness cost is signiﬁcantly smaller than the cost of increasing the maximum load. In these twocases, we expect the ﬁrst part of the objective function to grow, while the second part remains approx-imately constant. This observation allows us to improve the runtime complexity of ExpertGreedy and

TaskGreedy by setting a maximum possible value for ‘ , namely ‘ max , with ‘ max < |J | . The appropriateselection of ‘ max , is a trade-oﬀ between the running time, and the quality of the results. Solving the

R-BalancedTA problem:

Here, we present how we can extend the above algorithms tosolve the

R-BalancedTA problem. This extension is based on a pre-processing stage that accounts forthe required skills that need to be covered.Solving the

R-BalancedTA problem is essentially the same as adding a preprocessing step to thealgorithms discussed in the previous paragraph. This preprocessing step makes sure that all required skillsfrom all tasks are covered, with a relatively small maximum load among the experts. More speciﬁcally,in this step, we deploy the

Load algorithm proposed in [1] with inputs the set of experts P and the setof tasks J r . Then, we remove from each task those skills that are covered by the corresponding teammembers – in this way we remove all required skills and some of the optional ones that are now covered.On this new input, we run the algorithms we designed to solve BalancedTA .The running time of the preprocessing step is dominated by the method used to solve the linearprogramming relaxation of the algorithm

Load . As described above the state-of-the-art LP solvers requiretime polynomial in the number of constraints [7, 19].

This section explores the practicality of our algorithms using data from three major online labor mar-kets. Speciﬁcally, ( i ) we evaluate and compare the performances of our three methods, ExpertGreedy , TaskGreedy and

BestLoad , to multiple baselines for the

BalancedTA and

R-BalancedTA problems,( ii ) we showcase the impact of the trade-oﬀ parameter λ on the load and incompleteness cost of thesolution, ( iii ) we provide a running time analysis of our algorithms.8 ataset Freelancer Guru Upwork

Table 1: A summary of the dataset statistics.For all our experiments we use a single process implementation of our algorithms on a 64-bit MacBookPro with an Intel Core i7 CPU at 2.6GHz and 16 GB RAM. We use the Gurobi optimizer [17] for linearprogramming. We make the code, the datasets and the chosen parameters available online . We use data from the online labor marketplaces, freelancer.com , guru.com , and upwork.com . We referto these datasets as Freelancer , Guru , and

Upwork , respectively.Table 1 exhibits statistics on the diﬀerent sizes and skill properties of these datasets.In all datasets, skills acquired by experts that are never required by any task have been removed,since these are never used. Note that,

Freelancer (1212 experts, 993 tasks) and

Guru (6120 experts, 3195tasks) have more experts available than posted tasks, while the reverse is true for

Upwork (1500 experts,3000 tasks). An interesting observation is that the ratio of expert skills to task skills is diﬀerent in eachof the three datasets.

Task skills:

The

Freelancer and

Guru datasets include a random sample from a large pool of real tasksposted by users in these marketplaces. The

Upwork dataset is a synthetic dataset obtained through adata-generation procedure similar to that used in the past [2]; a small number of experts (10%) is removedfrom the pool of experts in the dataset, and then subsets of their skills are repeatedly sampled to createtasks, by interpreting the union of their skills as task requirements.

Expert skills:

All expert datasets used in this work are acquired from anonymized proﬁles of membersregistered in the three marketplaces. A proﬁle itself includes a self-deﬁned set of skills.

We compare the performance of our algorithms to the following baselines:

SetCover : A simple variation of the well-known greedy algorithm for

SetCover [20]; for each task, thealgorithm iteratively assigns to each team the expert whose skills overlap the most with the uncoveredskills of the task and then removes these skills from the task. The algorithm stops, either when all skillshave been covered, or when none of the experts overlap with the remaining uncovered skills. The runningtime of this algorithm is O( knm ). BestCostGreedy : This is a variant of

SetCover that takes into account the workload. The diﬀerenceis that instead of selecting the expert overlapping the most with the task,

BestCostGreedy assigns theexpert that improves the objective function the most. The algorithm stops when the cost cannot befurther decreased. The running time of this algorithm is O( knm ). PairGreedy : PairGreedy is another intuitive greedy algorithm, which ﬁnds in each iteration the (task,expert) pair that improves the objective the most, and assigns the expert to the corresponding team. Thedrawback of this baseline is its runtime complexity, which is O (cid:0) k n ( n + m ) (cid:1) , thus prohibiting us fromevaluating it on real datasets. As such, we do not report its performance. Nevertheless, even when testedon smaller datasets, its performance is always outperformed by the proposed algorithms. .3 Performance evaluation for BalancedTA

This section demonstrates the performance of the proposed algorithms compared to the baselines, for the

BalancedTA problem.In these experiments, we vary the trade-oﬀ parameter λ to take values in { , , . . . , } . We selectthis speciﬁc range of λ values because it makes the impact of the trade-oﬀ parameter clear application-wise. However, we also show the performance of our algorithms for values of λ for which we showed that BalancedTA is NP-hard. Furthermore, we set the parameter ‘ max (the maximum number of ‘ iterations)to 80 for all experiments, because we saw that in real applications it generally leads to reasonable solutionsand runtime performances.We present the results for BalancedTA for all three datasets in Figure 1. The y -axis representsthe team-assignment cost ( B ) of each algorithm, and the x -axis corresponds to the value of the trade-oﬀparameter λ . Smaller values of cost correspond to a better solution.We observe that the performances of our algorithms and the baselines follow a similar trend, which isconsistent among the diﬀerent datasets. Furthermore, the baseline algorithms are clearly outperformed byour proposed approaches, with SetCover performing the worst – the only exception is for λ = 0, i.e., thecase that completely ignores the load of the experts. This is because SetCover always returns a team thatcovers all the task requirements and ignores the load of the experts.

BestCostGreedy is also outperformedby our proposed algorithms. Note that for λ = 0, BestCostGreedy is able to ﬁnd solutions where norequirement is left uncovered. However, as λ increases, the algorithm continues covering all of the taskrequirements without decreasing the workload, which leads to the linear increase of the total cost. Theonly exception is for the dataset Upwork , Figure 1(c), where for λ = 4 the algorithm begins compromisingincompleteness cost for less workload, but the total cost remains signiﬁcantly larger compared to theother algorithms. The performance of BestCostGreedy is followed by

BestLoad . One observation is that

BestLoad performed signiﬁcantly better than the original algorithm

Load . We do not present these resultsbecause as explained in Observation 1

BestLoad always performs at least as good as

Load . Recall that

Load ﬁnds a single solution that optimizes the workload, while it covers all the task requirements andits ﬁnal solution is completely independent of the trade-oﬀ parameter λ . Therefore, since the load of thesolution is constant and the coverage is 0 the team-assignment cost increases linearly with the coeﬃcient λ . Now, we illustrate the performance of our algorithms for values of λ for which we showed that the BalancedTA problem is NP-hard. The corresponding performances for the three datasets can be seen inFigure 1 as subplots. As expected, the closer λ is to 0 the closer the algorithmic performances are, but as λ increases the diﬀerence in the performance of our proposed algorithms and the baselines also increases.Overall, we observe that our algorithms, namely ExpertGreedy , TaskGreedy and

BestLoad , outperformthe baseline algorithms and we discuss their individual trade-oﬀ and eﬃciency diﬀerences below.

R-BalancedTA

We perform another set of experiments to demonstrate the performances of the algorithms for the

R-BalancedTA problem. In these experiments, we vary the fraction of required skills in the tasks as follows;with probability p s we independently deﬁne each skill of every task to be a required skill otherwise it isconsidered optional. In Figure 2, we study how the algorithms and baselines perform for a range of p s values and a ﬁxed λ = 4. We see that the observations for the algorithmic comparisons are similar tothe ones made in the previous experiment for the BalancedTA problem. Note that for p s =0 no skill isrequired and therefore the algorithms perform exactly as in the BalancedTA problem, while for p s =1all skills are required and the performance of all algorithms is the same and equal to the result of thepre-processing stage. 10 B TaskGreedyExpertGreedyBestLoad BestCostGreedySetCover (a) B TaskGreedyExpertGreedyBestLoad BestCostGreedySetCover (b) B TaskGreedyExpertGreedyBestLoad BestCostGreedySetCover (c)

Figure 1: Team-assignment cost ( B ) of algorithms and baselines for values of λ = { , , . . . , } and ‘ max = 80. The subplots correspond to values of λ for which BalancedTA is NP-hard. Columnscorrespond to diﬀerent datasets: (a)

Freelancer ; (b)

Guru ; (c)

Upwork . s B TaskGreedyExpertGreedyBestLoad BestCostGreedySetCover (a) s B TaskGreedyExpertGreedyBestLoad BestCostGreedySetCover (b) s B TaskGreedyExpertGreedyBestLoad BestCostGreedySetCover (c)

Figure 2: Team-assignment cost ( B ) of algorithms and baselines for values of p s = { , . , . , . , } , λ = 4 and ‘ max = 80. Columns correspond to diﬀerent datasets: (a) Freelancer ; (b)

Guru ; (c)

Upwork λ We study the behavior of our proposed algorithms for diﬀerent load and incompleteness cost trade-oﬀ values. We begin by setting λ = 0, i.e., we ignore the workload and ensure complete coverage(incompleteness cost is 0), and increase λ to observe how the trade-oﬀ between load and incompletenesscost changes. The results are shown in Figure 3. The y -axis shows the load cost, and the x -axis theincompleteness cost for the speciﬁc load.As expected, for λ close to 0, our algorithms yield solutions with low incompleteness cost and highworkload, while increasing λ changes this balance accordingly. Note that BestLoad lacks trade-oﬀ ca-pabilities, compared to

ExpertGreedy and

TaskGreedy . This is because the ﬁrst step of the algorithm,which creates the optimal fractional solution for the balanced task covering problem [1], is obliviousto the parameter λ . Thus, even though the second step of the algorithm weighs the trade-oﬀ parameter λ the trade-oﬀ capabilities are restricted by the assignment probabilities created in the ﬁrst step. There-fore, what we see in Figure 3 is that for our datasets and the examined range of λ the values of load andincompleteness achieved by the algorithm are the same except from the solution for λ = 0. A qualityof ExpertGreedy and

TaskGreedy is that improving the cost in one of the two components is achievedby paying a moderate price for the other component. For instance, assume a customer using guru.com λ = 2 to createteams that would satisfy both, the customer and the experts, as for a reasonable maximum load ∼ ExpertGreedy and

TaskGreedy induce very small incompleteness cost ∼

5. Now, if another customerprefers hiring few people at the cost of incompleteness, we can set λ = 4 to achieve load ∼

15 for anincompleteness cost of ∼

30, thus weighing diﬀerently the two components, yet always reasonably. L λ=0 λ=10 TaskGreedyExpertGreedyBestLoad (a) L λ=0 λ=10 TaskGreedyExpertGreedyBestLoad (b) L λ=0 λ=10 TaskGreedyExpertGreedyBestLoad (c)

Figure 3: Trade-oﬀ between load cost ( L ) and incompleteness cost ( C ) for ‘ max = 80. The labels next tothe ﬁrst and last data points correspond to λ = 0 and λ = 10, respectively. The in-between points shownin the curve correspond to λ = { , , , } . Columns correspond to diﬀerent datasets: (a) Freelancer ; (b)

Guru ; (c)

Upwork

Note, that the baseline algorithms are omitted from this plot. This is because

BestCostGreedy alwaysmaintains the incompleteness cost very low, which requires workload that is much larger than the onesinduced by the proposed algorithms. On the other hand,

SetCover lacks trade-oﬀ capabilities, since itssolution is always independent of λ with 0 incompleteness cost and constant load.Figure 3 allows us to further investigate the properties of our algorithms for BalancedTA . A ﬁrstobservation is that all algorithms demonstrate a smooth transition on the load and incompleteness costas the trade-oﬀ parameter changes. Note in Figures 3(a) and 3(b) that assigning a maximum workloadof 80 is enough to achieve complete task coverage. In fact, for the same datasets, even if the loaddecreases to ∼

20, the incompleteness cost increases by little. However, this is not the case for Figure3(c) (

Upwork ), where it is clear that the load should be more than 80 ( ‘ max ) to reach complete coverage(0 incompleteness cost). This occurs because, in the speciﬁc dataset, both the experts and the averagenumber of skills acquired by them are signiﬁcantly less than the tasks and the skills required by the tasks,respectively, which requires creating large teams and utilizing the same experts many times to achievefull coverage. Even the baseline Load that guarantees complete coverage with minimum workload cost,needs minimum load 548 for the speciﬁc dataset to accomplish full coverage.To showcase the diﬀerences of the algorithms as depicted in this experiment, we compare

TaskGreedy with

ExpertGreedy and

BestLoad , for the

Upwork dataset (Figure 3(c)). Recall that the

TaskGreedy algorithm assigns experts to tasks, based on how suitable they are for the task individually, and not withina team. Therefore, for datasets such as

Upwork , where there are fewer experts and expert skills comparedto tasks and task requirements, the algorithm becomes less eﬀective as it cannot evaluate a newly-addedperson is the best option for the whole team. However, in a dataset such as

Guru (Figure 3(b)) wherethe experts acquire on average more and a larger variety of skills than the tasks require, we observe that

TaskGreedy performs slightly better, or the same compared to the other two algorithms. This is becausethe skill “surplus”, leaves to the algorithm room for seemingly wrong local choices, as it will be able tocompensate for them by using the skills of some other of the remaining experts.12igure 4: Average running time (sec) of algorithms and baselines over 5 runs, in logarithmic scale, for λ = 4, ‘ max = 80. The three bar charts correspond to the datasets: Freelancer , Guru , Upwork . Finally, we investigate the running time eﬃciency of our algorithms. Figure 4 shows the average runningtimes for all algorithms and datasets when setting the parameter λ = 4. The running time complexitiesof the algorithms are independent of λ so its selection does not aﬀect the running time results. The timesare averaged over 5 runs for the BalancedTA problem – the results for

R-BalancedTA are similarand omitted.We use the baselines

SetCover and

BestCostGreedy as indicators of how well our algorithms performin terms of running time because they have the best runtime complexity. Even though their asymptoticcomplexity is the same,

BestCostGreedy is slower than

SetCover . This is because the two algorithmshave diﬀerent stopping criteria, the former depending on the improvement of the team-assignment cost,and the latter on the coverage of the skills. Note that simply comparing the asymptotic running timesof the diﬀerent algorithms (see Section 4) is not suﬃcient. In fact, there are multiple factors we need toconsider, such as constants, dominating factors that depend on the properties of the datasets, eﬃcientimplementations, etc.In Figure 4 datasets

Freelancer and

Guru show that

ExpertGreedy is much faster than

TaskGreedy and

BestLoad for datasets k < n (the y -axis is in logarithmic scale). Yet, for the dataset Upwork ,where n < k , we see that even though

ExpertGreedy remains the fastest algorithm, the running time of

TaskGreedy is also very close. One possible explanation is that having fewer experts than tasks, with fewerskills on average allows

TaskGreedy to ﬁnd teams faster in this dataset; yet

TaskGreedy is consistentlyslower than

ExpertGreedy for all datasets. Thus, we can conclude that overall

ExpertGreedy is the mosteﬃcient of our algorithms.

In this paper, we introduced

BalancedTA , a team-formation problem where given a collection of tasksand a pool of experts, the goal is to form teams such that each team is associated with a task and itcovers it as well as possible, while at the same time, the maximum workload of the chosen experts isalso minimized. We also considered a variant of this problem where each task has some set of requiredskills that are required to be covered by the formed teams. To the best of our knowledge, we are theﬁrst to combine the coverage of tasks and the workload of experts into a single objective. We showedthat our problems are NP-hard and designed eﬃcient heuristics for solving them. Our experiments withthree real-world datasets from online labor markets demonstrate the eﬃciency and the eﬃcacy of ouralgorithms, and their superiority compared to other heuristics.13 eferences [1] A. Anagnostopoulos, L. Becchetti, C. Castillo, A. Gionis, and S. Leonardi. Power in unity: formingteams in large-scale community systems. In

CIKM , 2010.[2] A. Anagnostopoulos, L. Becchetti, C. Castillo, A. Gionis, and S. Leonardi. Online team formationin social networks. In

WWW , 2012.[3] G. Barnabò, A. Fazzone, S. Leonardi, and C. Schwiegelshohn. Algorithms for fair team formation inonline labour marketplaces. In

WWW , 2019.[4] A. Bhowmik, V. Borkar, D. Garg, and M. Pallan. Submodularity in team formation problem. In

SDM , 2014.[5] C. Dorn and S. Dustdar. Composing near-optimal expert teams: a trade-oﬀ between skills andconnectivity. In

CoopIS , 2010.[6] B. Golshan, T. Lappas, and E. Terzi. Proﬁt-maximizing cluster hires. In

SIGKDD , 2014.[7] J. Gondzio and T. Terlaky.

A computational view of interior-point methods for linear programming .Citeseer, 1994.[8] M. Kargar and A. An. Discovering top-k teams of experts with/without a leader in social networks.In

CIKM , 2011.[9] M. Kargar, A. An, and M. Zihayat. Eﬃcient bi-objective team formation in social networks. In

ECML PKDD , 2012.[10] M. Kargar, M. Zihayat, and A. An. Finding aﬀordable and collaborative teams from a network ofexperts. In

SDM , 2013.[11] T. Lappas, K. Liu, and E. Terzi. Finding a team of experts in social networks. In

KDD , 2009.[12] C.-T. Li, M.-K. Shan, and S.-D. Lin. On team formation with expertise query in collaborative socialnetworks.

KAIS , 2015.[13] L. Li and H. Tong. Network science of teams: Characterization, prediction, and optimization. In

WSDM , 2018.[14] L. Li, H. Tong, N. Cao, K. Ehrlich, Y.-R. Lin, and N. Bucher. Enhancing team composition inprofessional networks: Problem deﬁnitions and fast solutions.

TKDE , 2017.[15] L. Li, H. Tong, N. Cao, K. Ehrlich, Y.-R. Lin, and N. Buchler. Replacing the irreplaceable: Fastalgorithms for team member recommendation. In

WWW , 2015.[16] A. Majumder, S. Datta, and K. Naidu. Capacitated team formation problem on social networks. In

KDD , 2012.[17] G. Optimization. Inc.,?gurobi optimizer reference manual,? 2015, 2014.[18] S. S. Rangapuram, T. Bühler, and M. Hein. Towards realistic team formation in social networksbased on densest subgraphs. In

WWW , 2013.[19] D. A. Spielman and S.-H. Teng. Smoothed analysis of algorithms: Why the simplex algorithm usuallytakes polynomial time.

Journal of the ACM (JACM) , 51(3):385–463, 2004.[20] V. V. Vazirani.

Approximation algorithms . Springer Science & Business Media, 2013.1421] X. Wang, Z. Zhao, and W. Ng. A comparative study of team formation in social networks. In

DASFAA , 2015.[22] X. Yin, C. Qu, Q. Wang, F. Wu, B. Liu, F. Chen, X. Chen, and D. Fang. Social connection awareteam formation for participatory tasks.