Finding teams that balance expert load and task coverage
FFinding teams that balance expert load and task coverage ∗Sofia Maria Nikolakaki Mingxiang Cai Evimaria Terzi
Abstract
The rise of online labor markets (e.g., Freelancer, Guru and Upwork) has ignited a lot of researchon team formation, where experts acquiring different skills form teams to complete tasks. The coreidea in this line of work has been the strict requirement that the team of experts assigned to completea given task should contain a superset of the skills required by the task. However, in many applicationsthe required skills are often a wishlist of the entity that posts the task and not all of the skills areabsolutely necessary. Thus, in our setting we relax the complete coverage requirement and we allowfor tasks to be partially covered by the formed teams, assuming that the quality of task completion isproportional to the fraction of covered skills per task. At the same time, we assume that when multipletasks need to be performed, the less the load of an expert the better the performance. We combinethese two high-level objectives into one and define the
BalancedTA problem. We also consider ageneralization of this problem where each task consists of required and optional skills. In this setting,our objective is the same under the constraint that all required skills should be covered. From thetechnical point of view, we show that the
BalancedTA problem (and its variant) is NP-hard anddesign efficient heuristics for solving it in practice. Using real datasets from three online market places,Freelancer, Guru and Upwork we demonstrate the efficiency of our methods and the practical utilityof our framework.
Creating effective teams by combining experts with diverse skills allows organizations to successfullycomplete tasks from different domains, while also helping individual experts position themselves in highlycompetitive job markets. The rise of online labor markets, like Freelancer, Guru and Upwork has moti-vated a significant amount of research on the problem of team formation [2, 4, 8, 9, 11, 16, 18]. Althoughmany variants of this problem have been considered, they all rely on the same model of coverage of tasksby experts. According to this model, both tasks and experts are characterized by a set of skills. Eachtask consists of a set of skills required for the task completion and each expert acquires a set of skills. Ateam consisting of one or more experts completes a task if the union of the skills of the team’s expertscovers all the required skills of the task. In all existing work complete coverage of tasks is required by theformed teams.However, there are cases where tasks are described generically and require teams to have strong back-ground in many areas, whereas clearly not all these areas can (or need to) be simultaneously covered.Oftentimes, the list of required skills is more like the wishful thinking of the entity that posts the taskand not all of the skills need to be covered for its completion. For example, think of a post that de-scribes potential cluster hires for an academic institution. Moreover, when looking at job posts in onlinelabor markets, oftentimes the skills required by the posted tasks are repetitive. An example of a jobpost in guru.com is: basic, oracle, html, java, javascript, mysql, css, sql, http, ajax, mvc, architecture,jquery, software, software development, web development, developer, web developer . Another examplefrom freelancer.com is:
Advertising, Facebook Marketing, Internet Marketing, Marketing, Social Net-working . In these examples the task descriptions could be repetitive; someone who is good at Marketing ∗ Department of Computer Science, Boston University. {smnikol,marcocai,evimaria}@bu.edu a r X i v : . [ c s . A I] N ov s also good at Facebook and/or Internet marketing. Therefore, in such cases not all skills of a task needto be covered.Motivated by these settings, we relax the requirement of completely covering a task and we allow for partial task coverage by the formed teams. However, we assume that the quality of task completion isproportional to the fraction of covered skills.Another important factor for the quality of task completion is the load of the experts. For example,an expert assigned to several tasks may be too busy to devote a lot of effort in each one of them andconsequently underperform [1]. Thus, our second objective is to avoid loading the experts with manytasks.We combine these two objectives as follows: given a set of tasks, form one team per task such that thetasks are partially covered, while at the same time, the maximum number of teams an expert participatesin is kept as small as possible. Thus, our goal is to both fulfill the tasks as much as possible and notoverload the experts involved with the formed teams.Formally, our goal is the following: given a set of k tasks J and a set of experts P form k teams Q = { Q , . . . , Q k } , one for each task such that λL ( Q ) + C ( Q , J ) is minimized. In this equation, L ( Q )denotes the maximum number of teams an expert participates in and C ( Q , J ) denotes the sum of thefraction of uncovered skills per task. Both these components are quantities that we aim to minimize,while λ is a trade-off parameter that controls the importance of each objective.We call the above problem the BalancedTA problem. Note that the problem definition is such thatcombines the two, seemingly unrelated objectives, into a single objective, and asks from the algorithm tofind the right balance between the two, according to the trade-off parameter λ , and without placing ahard constraint on one of the objectives.For the BalancedTA problem we show that there are values of λ such that the BalancedTA problem is NP-hard and we design algorithms for solving it efficiently in practice. The effectiveness andthe efficiency of our algorithms is shown in our experimental evaluation with datasets from three majoronline labor markets.We also note that in many applications, there may be tasks that have both absolutely required and optional skills. For instance, a task might require workers with expertise in
Facebook Marketing andAdvertising and optionally broader knowledge in the areas of
Internet Marketing, Marketing and SocialNetworking . Thus, we also consider a generalization of the
BalancedTA , where there is a hard constrainton the coverage of the required skills for every task. Not only do we show that this variant of the problemis also NP-hard, but also that the same algorithms we designed for
BalancedTA can be used for thisversion of the problem as well, with minimal changes.Our contributions are summarized as follows: • We define the
BalancedTA problem, which tries to find teams for a set of tasks such that boththe coverage of task requirements (in terms of skills) is maximized and the load of every individualworker (in terms of the number of teams she participates in) is minimized. We combine these tworequirements in a single objective and use a trade-off parameter λ to control the importance of each. • We study the computational complexity of the
BalancedTA problem and show that while there arecases for which it is trivial, there are also cases for which it is NP-hard. • We design a set of heuristics for solving
BalancedTA in practice. • We study a variant of the
BalancedTA problem where the set of skills required for each task issplit into required and optional. We show that this version is NP-hard and also demonstrate that ouralgorithms for
BalancedTA can be applied to it with few modifications. • Finally, using three datasets from real online labor markets, we test the practical utility and theefficiency of our methods.
Roadmap:
Section 2 reviews the related work. We present our problem in Section 3 and our algorithmsin Section 4. In Section 5 we evaluate the performance of our methodology using real datasets. We2onclude the paper in Section 6.
Recent studies raise the importance of team formation in different settings [13, 21]. To the best of ourknowledge, we are the first to introduce the
BalancedTA problem, where we simultaneously optimizefor the coverage of the task requirements and the load of the experts. However, our work is related toexisting work on team formation as described below.
Team formation in network of experts:
Lappas et al. [11] were the first to introduce the notionof team formation in the setting of a social network. Given a network of experts with skills, their goalis to find a team that collectively covers all the requirements of a single task, while establishing smallcommunication cost (in terms of the network) between the team members. A series of subsequent worksextended this work towards different directions [2, 4, 10, 8, 9, 16, 12, 15, 14, 18, 22] All the aforementionedworks share two common assumptions: ( i ) the experts are organized in a network that quantifies howwell they can work together and ( ii ) all the required skills of the tasks need to be covered by the formedteams. Our model does not assume the existence of a network among the experts and the tasks neednot be fully covered. Therefore, the computational problem that we are solving is different from the onesabove. Team formation with load balancing:
Anagnostopoulos et al. [1], were the first to consider minimizingthe load of experts in the online setting where a stream of tasks arrives and experts form teams in order tocover all the required skills for each task. The offline version of this problem resembles the load-balancingrequirement of our problem. However, our work allows partial task coverage, while Anagnostopoulos etal. [1] form teams that entirely cover the requirements of a task. Moreover, our framework also providesthe flexibility of defining a desirable trade-off between the two costs and creates effective teams based onthe importance of each.
Multiple tasks coverage:
A key characteristic of our work is that we consider the offline setting wherethere are multiple tasks, known a-priori, and there is a team formed for each one of them. The offlineversions of Anagnostopoulos et al. [1, 2], the work of Golshan et al. [6], as well as the recent work ofBarnabó et al. [3] consider multiple tasks and multiple teams. However, contrary to our setting, all theabove works require that all skills required by the sequence of tasks are completely covered.
Team formation with partial coverage:
Probably, the closest to our work is the work by Dornand Dustdar [5], which introduces a multi-objective team composition problem with two objectives: skillcoverage and communication cost. Their goal is to identify the best balance between the two costs. For thispurpose, they use a set of heuristics that self-adjust a trade-off parameter to decide team configurations.In our setting, we do not consider the communication cost, but the workload of the experts. Moreover,our algorithms focus on allocating experts to teams, based on a user-defined trade-off between load andcoverage. Dorn and Dustar focus on finding a “best" trade-off between connectivity and coverage, wherethe notion of “best" is defined in a rather adhoc manner. Finally, although they touch upon the issue ofpartial coverage, they focus on data extraction rather than algorithm design.
This section provides the notation used throughout the paper and presents the formal definition and thecomplexity analysis of the problem that we study.Throughout the discussion, we consider a set of m skills S , a set of n experts P = { P i ; i = 1 , . . . , n } and a set of k tasks J = { J j ; j = 1 , . . . , k } . In this setting, every expert and every task is a subsetof the skills, i.e., P i ⊆ S and J j ⊆ S , respectively. To complete a task we need to assign a team ofexperts to it. We let Q j ⊆ P denote the team assigned to the j th task. For k tasks, we form k teams3 = { Q , . . . , Q k } . We call Q the team assignment for tasks J . For each team Q j we compute its skillprofile Cov ( Q j ) representing the union of the skills of its members. That is, Cov ( Q j ) = ∪ i ∈ Q j P i . Load cost ( L ) : An important quantity is the load of a person, which is the number of tasks a personis assigned to. That is, for person P we have that the load of P is L ( P, Q ) = |{ j : P ∈ Q j }| . We areinterested in the maximum load among all experts , i.e., L ( Q ) = max P ∈P L ( P, Q ) Incompleteness cost ( C ) : Given a task J and a team Q ⊆ P assigned to it we define the incompletenesscost of Q with respect to J to be the fraction of the required skills, which are not covered by the team’sskill profile. That is, F ( Q, J ) = | J \ Cov ( Q ) || J | Intuitively, our goal is to minimize the incompleteness cost since we want the assigned team to cover asmany of the skills required by the corresponding task as possible. Thus, we define the total incompletenesscost of a team assignment Q to be: C ( Q , J ) = X j ∈J F ( Q j , J j ) Team-assignment cost ( B ) : Given a trade-off parameter λ , the cost of a team assignment Q for a setof tasks J is denoted by B ( Q , J , λ ) and it is a linear combination of the maximum workload and theincompleteness cost as defined above. That is, B ( Q , J , λ ) = λL ( Q ) + C ( Q , J )The trade-off parameter λ provides an easy way to control the relative importance of each objective,where λ = 0 ignores the workload, and conversely λ > k , where k is the number of tasks, ignores theincompleteness cost. In our experiments, we considered different values of λ ∈ R + and we discuss ourfindings. BalancedTA problem
We can now define the main problem addressed in this paper:
Problem 1 ( BalancedTA ) . Given a set of k tasks J = { J , . . . , J k } , a set of n experts P = { P , . . . , P n } and a real non-negative value λ ∈ R + , find a team assignment Q = { Q , . . . , Q k } consisting of k teams,such that team Q i is associated with task J i and B ( Q , J , λ ) is minimized. The solution of the case where λ = 0 is trivial: when λ = 0 then the optimal solution assigns allworkers to all tasks, leaving this way the minimum number of non-covered skills.On the other hand, when λ > k , where k is the number of tasks, we prove that only the workloadmatters and thus the trivial solution of not assigning experts to tasks is the best strategy. This issummarized in the following lemma. Lemma 3.1.
For a set of k tasks and λ > k , the optimal solution of the BalancedTA problem is theone that leaves all tasks completely uncovered; i.e., Q = ∅ .Proof. Assume for the sake of contradiction an optimal solution Q ∗ = ∅ with corresponding workload L ( Q ∗ ) ≥
1. By the definition of incompleteness we know that 0 ≤ C ( Q ∗ ) ≤ k and therefore B ( Q ∗ , J , λ ) >k , for λ > k . However, we see that there exists a solution Q with corresponding workload L ( Q ) = 0whose team-assignment cost is exactly equal to k , i.e., B ( Q , J , λ ) = k . By the definition of load thissolution can only be Q = ∅ which contradicts the initial assumption.4herefore, we consider problem instances where λ takes values in the range (0 , k ], where k is thenumber of tasks. For a subset of these values of λ we can prove that the BalancedTA problem isNP-hard. More specifically, we have the following complexity result.
Theorem 3.2.
For a set of k tasks the BalancedTA problem is NP-hard for < λ < kN , with N beingthe cardinality of the largest task.Proof. For the rest of the proof, we will refer to the
BalancedTA problem for the case 0 < λ < kN .We reduce an instance of the NP-hard balanced task covering problem [1] to the BalancedTA problem. A reduction from balanced task covering to BalancedTA exists, if and only if a solutioninstance of
BalancedTA for 0 < λ < kN is also a solution to balanced task covering .An instance of balanced task covering consists of a pool of experts and tasks P , J , respectively,and asks for a set of teams Z , one team for each task such that the maximum workload of a worker isminimized and all tasks in J are completely covered. We transform an instance of balanced task cov-ering to an instance of BalancedTA , by setting P and J to be the experts and the tasks, respectively,of the BalancedTA problem. We now claim that for 0 < λ < kN , Q is a solution to the BalancedTA problem if and only if it is also a solution to the balanced task covering problem.To see this consider the following: If Q is the solution to the balanced task covering problem withload L ( Q ), then Q is also a solution for BalancedTA with load L ( Q ) and incompleteness C ( Q ) = 0;this is because Q covers all skill requirements in the balanced task covering problem.Conversely, let Q be a solution of BalancedTA for 0 < λ < kN . We will show that Q is also asolution for the balanced task covering problem by claiming that for any 0 < λ < kN the solution of BalancedTA always yields C ( Q ) = 0. In order to ensure that C ( Q ) = 0 (all task skills are covered), anypossible team assignment Q should lead to λL ( Q ) < C ( Q ). Intuitively, this means that adding moreworkload to the experts is always preferred, than leaving any of the task requirement skills unsatisfied.This is always true if the cost of the largest possible workload, which is assigning one or more expertsto all tasks ( L max = k ), multiplied by our trade-off parameter λ , is less than the smallest possibleincompleteness cost, which is leaving one skill of the task with the largest cardinality uncovered. This istrue for λ < kN . BalancedTA problem
A natural variant of
BalancedTA is one where some skills of a task are required while others are not.In this variant, each task J i has a set of required skills J ri and a set of optional skills J oi , such that J i = J ri ∪ J oi and J ri ∩ J oi = ∅ . The required skills have to be covered while the optional skills behave asbefore. This problem variant is formally defined as follows: Problem 2 ( R-BalancedTA ) . Given a set of k tasks J = { J , . . . , J k } , with J r = { J r , . . . , J rk } and J o = { J o , . . . , J ok } , a set of n experts P = { P , . . . , P n } and a real non-negative value λ ∈ R + , find k teams Q = { Q , . . . , Q k } , such that B ( Q , J o , λ ) is minimized and C ( Q , J r ) = 0 . From the complexity viewpoint we have the following result.
Theorem 3.3.
The
R-BalancedTA problem is NP-hard.
This is because this problem is NP-hard even for the case where all skills of all tasks are required bya reduction from the balanced task covering problem [1].
In this section, we describe the algorithms we designed for solving the
BalancedTA problem.5 lgorithm 1
The
ExpertGreedy algorithm.
Input:
Tasks J = { J , . . . , J k } , Experts P = { P , . . . , P n } , λ Output:
Teams Q = { Q , . . . , Q k } score ← ∞ for ‘ ∈ { , . . . , |J |} do J ← J , Q ← ∅ for P i ∈ P do L i ← TopTasks ( P i , J , ‘ ) Q ← UpdateTeams ( P i , L i , Q ) J ← UpdateTasks ( P i , L i , J ) end for if score > B ( Q , J , λ ) then score ← B ( Q , J , λ ) Q ← Q end if end for return Q , score The
ExpertGreedy algorithm:
ExpertGreedy finds ‘ solutions (team assignments), each of which witha different maximum workload ‘ = 1 , . . . , k = |J | and at the end reports the solution with the best score.In order to do so, for each ‘ it finds for each expert P i the ‘ tasks with the least uncovered skills whenassigning P i to these tasks, and it assigns P i to teams that correspond to those tasks. The algorithmreports the solution of the ‘ value that resulted in the smallest B ( Q , J , λ ) team-assignment cost.The pseudocode of ExpertGreedy is shown in Algorithm 1. We draw attention to line 5 of this pseu-docode. Routine
TopTasks retrieves the indexes of the ‘ tasks with the smallest fraction of uncoveredskills when expert P i is assigned to them. To find these tasks we use a binary min-heap to preservethe incompleteness cost of all tasks in sorted order. Furthermore, lines 6 and 7 perform update opera-tions. In particular, routine UpdateTeams (line 6) assigns expert P i to the selected ‘ tasks, while routine UpdateTasks (line 7) removes skills from the selected tasks that are covered by expert P i chosen at anygiven round i .A natural property of ExpertGreedy is that it essentially assigns the same amount of workload toevery expert. Note, that when deciding which teams to select for an expert, for a specific ‘ , the algorithmdoes not take into account the first part of the objective function, i.e. λL ( Q ), since it is equal to λ‘ forall experts.The runtime complexity of ExpertGreedy is O( k n log k + k nm ). For each maximum load and foreach expert, the algorithm sorts the tasks in ascending order based on the number of skills not covered. The
TaskGreedy algorithm:
This algorithm also finds ‘ solutions (team assignments), each with adifferent maximum workload ‘ = 1 , . . . , k = |J | and then selects the solution with the smallest cost.However, it differs from the previous algorithm: while ExpertGreedy greedily assigns tasks to experts,
TaskGreedy finds a set of “good” candidate experts for a specific task. In particular, for each ‘ , thealgorithm computes for each task J j the cost of the objective value when expert P i is assigned to team Q j , for all i = 1 , . . . , n . The algorithm keeps these costs in a binary min-heap data structure, for runningtime efficiency. After computing the costs of all experts, it removes the root of the heap and assigns thecorresponding expert to team Q j , only if her skillset overlaps with the uncovered skills of task J j . If theexpert is assigned to the team, then all covered skills of J j are removed. This process continues until,either all skills of J j are covered, or the remaining skills do not overlap with any of the unassigned experts.After creating Q j , the algorithm checks if there are any experts whose loads are equal to ‘ , and removesthose experts from the pool. At the end of each loop of ‘ , there is a team associated with every task, and6 lgorithm 2 The
TaskGreedy algorithm.
Input:
Tasks J = { J , . . . , J k } , Experts P = { P , . . . , P n } , λ Output:
Teams Q = { Q , . . . , Q k } score ← ∞ for ‘ ∈ { , . . . , |J |} do P ← P , Q ← ∅ for J j ∈ J do L i ← TopExperts ( J j , P , ‘ ) Q ← UpdateTeams ( J j , L i , Q ) P ← UpdateExperts ( P , Q ) end for if score > B ( Q , J , λ ) then score ← B ( Q , J , λ ) Q ← Q end if end for return Q , score the cost B ( Q , J , λ ) is computed. The algorithm reports the solution with the lowest team-assignmentcost.The pseudocode of TaskGreedy is presented in Algorithm 2. Routine
TopExperts (line 5) computesand returns the indexes of those experts whose skillsets cover the requirements of the given task and thathave the smallest objective value. Routines
UpdateTeams (line 6) and
UpdateExperts (line 7) performupdate operations, i.e., assign the selected experts to the team of the current task, and remove from thepool of experts those with load cost equal to ‘ , respectively.In contrast to ExpertGreedy , TaskGreedy does not assign the same amount of workload to everyexpert. In fact, some experts might not be assigned to any team at all; this is the case when there areother experts whose skillsets overlap more with the tasks.Another difference between
ExpertGreedy and
TaskGreedy is their running time. In particular, therunning time of
TaskGreedy is O( k n + k nm + k n log n + k m ). For each ‘ value and for each task,the algorithm sorts the experts in ascending order, based on the cost obtained after considering each ofthem separately, and then traverses them in the same order to allocate a team, based on the experts’overlap with the task. We improve the running time, by observing that the objective value computationwhen considering an expert for a specific task, does not require finding the total incompleteness cost ofall tasks, but only how much the specific task is covered by the expert since the incompleteness cost inthe other tasks remains constant for all experts that are being evaluated for that task. Then, the runtimecomplexity becomes O( k n + k nm + k n log n + k m ). Finally, keeping a variable that stores the overallmaximum load during an ‘ loop decreases the runtime complexity to O( k nm + k n log n + k m ). The
BestLoad algorithm:
The
BestLoad algorithm is a natural extension of the
Load algorithm pro-posed by Anagnostopoulos et al. [1] for the offline setting of the balanced task covering problem.Recall that in that problem the goal is given a set of tasks, find an assignment of teams to tasks so as tominimize the maximum load of the workers subject to the constraint that all skills of all tasks are covered.The
Load algorithm has two steps. The first step solves optimally the linear programming relaxationof the ILP formulation of the above problem (see Theorem 2 [1]). This creates a fractional solution ˆ X .The second step of Load performs R rounds, with R = (ln Tδ ), where T = max { mk, n } , where m is thenumber of skills, k is the number of tasks and n is the number of experts. The algorithm assigns anexpert P j to the task J i with probability ˆ X ji , independently of other rounds and of other assignmentswithin the same round. If expert P j was assigned to task J i in at least one round, the algorithm adds the7xpert to the team Q i . The authors show that R rounds are required to achieve complete coverage of theskills acquired by the tasks.The BestLoad algorithm we propose has the same first step as
Load . To take into account the weighingtrade-off parameter λ our BestLoad modifies the second step. In particular, notice that as the number ofrounds increases, more experts are assigned to tasks, i.e., the load increases and the coverage decreases.Therefore, for larger values of λ (load becomes more important than coverage) running fewer roundsleads to a better solution. Conversely, for smaller values of λ (coverage becomes more important thanload) the algorithm needs to run a number of rounds closer to R . Based on this observation, BestLoad accommodates for the different values of λ by creating R solutions; one after each assignment round.Then, given a specific value of λ , it returns the solution that has the corresponding smallest cost.The runtime complexity of the first step of BestLoad depends on the method used to solve the LPrelaxation of the algorithm. State-of-the-art LP solvers require running time polynomial in the numberof constraints of the problem [7, 19]. For the coverage of all skills, O( nkm ) constraints are required. Thesecond step of the algorithm requires O(
T nk ) time.
Observation 1.
The performance of
BestLoad is at least as good as the performance of
Load for the
BalancedTA problem.
Clearly, Observation 1 holds since one of the solutions considered by
BestLoad is the one returned by
Load . Improving the running time of
ExpertGreedy and
TaskGreedy : For any value of the trade-off pa-rameter λ , continually adding more workload to the experts will increase the value of B ( Q , J , λ ) in twocases: (i) when all task requirements have been covered, and (ii) when the benefit from decreasing theincompleteness cost is significantly smaller than the cost of increasing the maximum load. In these twocases, we expect the first part of the objective function to grow, while the second part remains approx-imately constant. This observation allows us to improve the runtime complexity of ExpertGreedy and
TaskGreedy by setting a maximum possible value for ‘ , namely ‘ max , with ‘ max < |J | . The appropriateselection of ‘ max , is a trade-off between the running time, and the quality of the results. Solving the
R-BalancedTA problem:
Here, we present how we can extend the above algorithms tosolve the
R-BalancedTA problem. This extension is based on a pre-processing stage that accounts forthe required skills that need to be covered.Solving the
R-BalancedTA problem is essentially the same as adding a preprocessing step to thealgorithms discussed in the previous paragraph. This preprocessing step makes sure that all required skillsfrom all tasks are covered, with a relatively small maximum load among the experts. More specifically,in this step, we deploy the
Load algorithm proposed in [1] with inputs the set of experts P and the setof tasks J r . Then, we remove from each task those skills that are covered by the corresponding teammembers – in this way we remove all required skills and some of the optional ones that are now covered.On this new input, we run the algorithms we designed to solve BalancedTA .The running time of the preprocessing step is dominated by the method used to solve the linearprogramming relaxation of the algorithm
Load . As described above the state-of-the-art LP solvers requiretime polynomial in the number of constraints [7, 19].
This section explores the practicality of our algorithms using data from three major online labor mar-kets. Specifically, ( i ) we evaluate and compare the performances of our three methods, ExpertGreedy , TaskGreedy and
BestLoad , to multiple baselines for the
BalancedTA and
R-BalancedTA problems,( ii ) we showcase the impact of the trade-off parameter λ on the load and incompleteness cost of thesolution, ( iii ) we provide a running time analysis of our algorithms.8 ataset Freelancer Guru Upwork
Table 1: A summary of the dataset statistics.For all our experiments we use a single process implementation of our algorithms on a 64-bit MacBookPro with an Intel Core i7 CPU at 2.6GHz and 16 GB RAM. We use the Gurobi optimizer [17] for linearprogramming. We make the code, the datasets and the chosen parameters available online . We use data from the online labor marketplaces, freelancer.com , guru.com , and upwork.com . We referto these datasets as Freelancer , Guru , and
Upwork , respectively.Table 1 exhibits statistics on the different sizes and skill properties of these datasets.In all datasets, skills acquired by experts that are never required by any task have been removed,since these are never used. Note that,
Freelancer (1212 experts, 993 tasks) and
Guru (6120 experts, 3195tasks) have more experts available than posted tasks, while the reverse is true for
Upwork (1500 experts,3000 tasks). An interesting observation is that the ratio of expert skills to task skills is different in eachof the three datasets.
Task skills:
The
Freelancer and
Guru datasets include a random sample from a large pool of real tasksposted by users in these marketplaces. The
Upwork dataset is a synthetic dataset obtained through adata-generation procedure similar to that used in the past [2]; a small number of experts (10%) is removedfrom the pool of experts in the dataset, and then subsets of their skills are repeatedly sampled to createtasks, by interpreting the union of their skills as task requirements.
Expert skills:
All expert datasets used in this work are acquired from anonymized profiles of membersregistered in the three marketplaces. A profile itself includes a self-defined set of skills.
We compare the performance of our algorithms to the following baselines:
SetCover : A simple variation of the well-known greedy algorithm for
SetCover [20]; for each task, thealgorithm iteratively assigns to each team the expert whose skills overlap the most with the uncoveredskills of the task and then removes these skills from the task. The algorithm stops, either when all skillshave been covered, or when none of the experts overlap with the remaining uncovered skills. The runningtime of this algorithm is O( knm ). BestCostGreedy : This is a variant of
SetCover that takes into account the workload. The differenceis that instead of selecting the expert overlapping the most with the task,
BestCostGreedy assigns theexpert that improves the objective function the most. The algorithm stops when the cost cannot befurther decreased. The running time of this algorithm is O( knm ). PairGreedy : PairGreedy is another intuitive greedy algorithm, which finds in each iteration the (task,expert) pair that improves the objective the most, and assigns the expert to the corresponding team. Thedrawback of this baseline is its runtime complexity, which is O (cid:0) k n ( n + m ) (cid:1) , thus prohibiting us fromevaluating it on real datasets. As such, we do not report its performance. Nevertheless, even when testedon smaller datasets, its performance is always outperformed by the proposed algorithms. .3 Performance evaluation for BalancedTA
This section demonstrates the performance of the proposed algorithms compared to the baselines, for the
BalancedTA problem.In these experiments, we vary the trade-off parameter λ to take values in { , , . . . , } . We selectthis specific range of λ values because it makes the impact of the trade-off parameter clear application-wise. However, we also show the performance of our algorithms for values of λ for which we showed that BalancedTA is NP-hard. Furthermore, we set the parameter ‘ max (the maximum number of ‘ iterations)to 80 for all experiments, because we saw that in real applications it generally leads to reasonable solutionsand runtime performances.We present the results for BalancedTA for all three datasets in Figure 1. The y -axis representsthe team-assignment cost ( B ) of each algorithm, and the x -axis corresponds to the value of the trade-offparameter λ . Smaller values of cost correspond to a better solution.We observe that the performances of our algorithms and the baselines follow a similar trend, which isconsistent among the different datasets. Furthermore, the baseline algorithms are clearly outperformed byour proposed approaches, with SetCover performing the worst – the only exception is for λ = 0, i.e., thecase that completely ignores the load of the experts. This is because SetCover always returns a team thatcovers all the task requirements and ignores the load of the experts.
BestCostGreedy is also outperformedby our proposed algorithms. Note that for λ = 0, BestCostGreedy is able to find solutions where norequirement is left uncovered. However, as λ increases, the algorithm continues covering all of the taskrequirements without decreasing the workload, which leads to the linear increase of the total cost. Theonly exception is for the dataset Upwork , Figure 1(c), where for λ = 4 the algorithm begins compromisingincompleteness cost for less workload, but the total cost remains significantly larger compared to theother algorithms. The performance of BestCostGreedy is followed by
BestLoad . One observation is that
BestLoad performed significantly better than the original algorithm
Load . We do not present these resultsbecause as explained in Observation 1
BestLoad always performs at least as good as
Load . Recall that
Load finds a single solution that optimizes the workload, while it covers all the task requirements andits final solution is completely independent of the trade-off parameter λ . Therefore, since the load of thesolution is constant and the coverage is 0 the team-assignment cost increases linearly with the coefficient λ . Now, we illustrate the performance of our algorithms for values of λ for which we showed that the BalancedTA problem is NP-hard. The corresponding performances for the three datasets can be seen inFigure 1 as subplots. As expected, the closer λ is to 0 the closer the algorithmic performances are, but as λ increases the difference in the performance of our proposed algorithms and the baselines also increases.Overall, we observe that our algorithms, namely ExpertGreedy , TaskGreedy and
BestLoad , outperformthe baseline algorithms and we discuss their individual trade-off and efficiency differences below.
R-BalancedTA
We perform another set of experiments to demonstrate the performances of the algorithms for the
R-BalancedTA problem. In these experiments, we vary the fraction of required skills in the tasks as follows;with probability p s we independently define each skill of every task to be a required skill otherwise it isconsidered optional. In Figure 2, we study how the algorithms and baselines perform for a range of p s values and a fixed λ = 4. We see that the observations for the algorithmic comparisons are similar tothe ones made in the previous experiment for the BalancedTA problem. Note that for p s =0 no skill isrequired and therefore the algorithms perform exactly as in the BalancedTA problem, while for p s =1all skills are required and the performance of all algorithms is the same and equal to the result of thepre-processing stage. 10 B TaskGreedyExpertGreedyBestLoad BestCostGreedySetCover (a) B TaskGreedyExpertGreedyBestLoad BestCostGreedySetCover (b) B TaskGreedyExpertGreedyBestLoad BestCostGreedySetCover (c)
Figure 1: Team-assignment cost ( B ) of algorithms and baselines for values of λ = { , , . . . , } and ‘ max = 80. The subplots correspond to values of λ for which BalancedTA is NP-hard. Columnscorrespond to different datasets: (a)
Freelancer ; (b)
Guru ; (c)
Upwork . s B TaskGreedyExpertGreedyBestLoad BestCostGreedySetCover (a) s B TaskGreedyExpertGreedyBestLoad BestCostGreedySetCover (b) s B TaskGreedyExpertGreedyBestLoad BestCostGreedySetCover (c)
Figure 2: Team-assignment cost ( B ) of algorithms and baselines for values of p s = { , . , . , . , } , λ = 4 and ‘ max = 80. Columns correspond to different datasets: (a) Freelancer ; (b)
Guru ; (c)
Upwork λ We study the behavior of our proposed algorithms for different load and incompleteness cost trade-off values. We begin by setting λ = 0, i.e., we ignore the workload and ensure complete coverage(incompleteness cost is 0), and increase λ to observe how the trade-off between load and incompletenesscost changes. The results are shown in Figure 3. The y -axis shows the load cost, and the x -axis theincompleteness cost for the specific load.As expected, for λ close to 0, our algorithms yield solutions with low incompleteness cost and highworkload, while increasing λ changes this balance accordingly. Note that BestLoad lacks trade-off ca-pabilities, compared to
ExpertGreedy and
TaskGreedy . This is because the first step of the algorithm,which creates the optimal fractional solution for the balanced task covering problem [1], is obliviousto the parameter λ . Thus, even though the second step of the algorithm weighs the trade-off parameter λ the trade-off capabilities are restricted by the assignment probabilities created in the first step. There-fore, what we see in Figure 3 is that for our datasets and the examined range of λ the values of load andincompleteness achieved by the algorithm are the same except from the solution for λ = 0. A qualityof ExpertGreedy and
TaskGreedy is that improving the cost in one of the two components is achievedby paying a moderate price for the other component. For instance, assume a customer using guru.com λ = 2 to createteams that would satisfy both, the customer and the experts, as for a reasonable maximum load ∼ ExpertGreedy and
TaskGreedy induce very small incompleteness cost ∼
5. Now, if another customerprefers hiring few people at the cost of incompleteness, we can set λ = 4 to achieve load ∼
15 for anincompleteness cost of ∼
30, thus weighing differently the two components, yet always reasonably. L λ=0 λ=10 TaskGreedyExpertGreedyBestLoad (a) L λ=0 λ=10 TaskGreedyExpertGreedyBestLoad (b) L λ=0 λ=10 TaskGreedyExpertGreedyBestLoad (c)
Figure 3: Trade-off between load cost ( L ) and incompleteness cost ( C ) for ‘ max = 80. The labels next tothe first and last data points correspond to λ = 0 and λ = 10, respectively. The in-between points shownin the curve correspond to λ = { , , , } . Columns correspond to different datasets: (a) Freelancer ; (b)
Guru ; (c)
Upwork
Note, that the baseline algorithms are omitted from this plot. This is because
BestCostGreedy alwaysmaintains the incompleteness cost very low, which requires workload that is much larger than the onesinduced by the proposed algorithms. On the other hand,
SetCover lacks trade-off capabilities, since itssolution is always independent of λ with 0 incompleteness cost and constant load.Figure 3 allows us to further investigate the properties of our algorithms for BalancedTA . A firstobservation is that all algorithms demonstrate a smooth transition on the load and incompleteness costas the trade-off parameter changes. Note in Figures 3(a) and 3(b) that assigning a maximum workloadof 80 is enough to achieve complete task coverage. In fact, for the same datasets, even if the loaddecreases to ∼
20, the incompleteness cost increases by little. However, this is not the case for Figure3(c) (
Upwork ), where it is clear that the load should be more than 80 ( ‘ max ) to reach complete coverage(0 incompleteness cost). This occurs because, in the specific dataset, both the experts and the averagenumber of skills acquired by them are significantly less than the tasks and the skills required by the tasks,respectively, which requires creating large teams and utilizing the same experts many times to achievefull coverage. Even the baseline Load that guarantees complete coverage with minimum workload cost,needs minimum load 548 for the specific dataset to accomplish full coverage.To showcase the differences of the algorithms as depicted in this experiment, we compare
TaskGreedy with
ExpertGreedy and
BestLoad , for the
Upwork dataset (Figure 3(c)). Recall that the
TaskGreedy algorithm assigns experts to tasks, based on how suitable they are for the task individually, and not withina team. Therefore, for datasets such as
Upwork , where there are fewer experts and expert skills comparedto tasks and task requirements, the algorithm becomes less effective as it cannot evaluate a newly-addedperson is the best option for the whole team. However, in a dataset such as
Guru (Figure 3(b)) wherethe experts acquire on average more and a larger variety of skills than the tasks require, we observe that
TaskGreedy performs slightly better, or the same compared to the other two algorithms. This is becausethe skill “surplus”, leaves to the algorithm room for seemingly wrong local choices, as it will be able tocompensate for them by using the skills of some other of the remaining experts.12igure 4: Average running time (sec) of algorithms and baselines over 5 runs, in logarithmic scale, for λ = 4, ‘ max = 80. The three bar charts correspond to the datasets: Freelancer , Guru , Upwork . Finally, we investigate the running time efficiency of our algorithms. Figure 4 shows the average runningtimes for all algorithms and datasets when setting the parameter λ = 4. The running time complexitiesof the algorithms are independent of λ so its selection does not affect the running time results. The timesare averaged over 5 runs for the BalancedTA problem – the results for
R-BalancedTA are similarand omitted.We use the baselines
SetCover and
BestCostGreedy as indicators of how well our algorithms performin terms of running time because they have the best runtime complexity. Even though their asymptoticcomplexity is the same,
BestCostGreedy is slower than
SetCover . This is because the two algorithmshave different stopping criteria, the former depending on the improvement of the team-assignment cost,and the latter on the coverage of the skills. Note that simply comparing the asymptotic running timesof the different algorithms (see Section 4) is not sufficient. In fact, there are multiple factors we need toconsider, such as constants, dominating factors that depend on the properties of the datasets, efficientimplementations, etc.In Figure 4 datasets
Freelancer and
Guru show that
ExpertGreedy is much faster than
TaskGreedy and
BestLoad for datasets k < n (the y -axis is in logarithmic scale). Yet, for the dataset Upwork ,where n < k , we see that even though
ExpertGreedy remains the fastest algorithm, the running time of
TaskGreedy is also very close. One possible explanation is that having fewer experts than tasks, with fewerskills on average allows
TaskGreedy to find teams faster in this dataset; yet
TaskGreedy is consistentlyslower than
ExpertGreedy for all datasets. Thus, we can conclude that overall
ExpertGreedy is the mostefficient of our algorithms.
In this paper, we introduced
BalancedTA , a team-formation problem where given a collection of tasksand a pool of experts, the goal is to form teams such that each team is associated with a task and itcovers it as well as possible, while at the same time, the maximum workload of the chosen experts isalso minimized. We also considered a variant of this problem where each task has some set of requiredskills that are required to be covered by the formed teams. To the best of our knowledge, we are thefirst to combine the coverage of tasks and the workload of experts into a single objective. We showedthat our problems are NP-hard and designed efficient heuristics for solving them. Our experiments withthree real-world datasets from online labor markets demonstrate the efficiency and the efficacy of ouralgorithms, and their superiority compared to other heuristics.13 eferences [1] A. Anagnostopoulos, L. Becchetti, C. Castillo, A. Gionis, and S. Leonardi. Power in unity: formingteams in large-scale community systems. In
CIKM , 2010.[2] A. Anagnostopoulos, L. Becchetti, C. Castillo, A. Gionis, and S. Leonardi. Online team formationin social networks. In
WWW , 2012.[3] G. Barnabò, A. Fazzone, S. Leonardi, and C. Schwiegelshohn. Algorithms for fair team formation inonline labour marketplaces. In
WWW , 2019.[4] A. Bhowmik, V. Borkar, D. Garg, and M. Pallan. Submodularity in team formation problem. In
SDM , 2014.[5] C. Dorn and S. Dustdar. Composing near-optimal expert teams: a trade-off between skills andconnectivity. In
CoopIS , 2010.[6] B. Golshan, T. Lappas, and E. Terzi. Profit-maximizing cluster hires. In
SIGKDD , 2014.[7] J. Gondzio and T. Terlaky.
A computational view of interior-point methods for linear programming .Citeseer, 1994.[8] M. Kargar and A. An. Discovering top-k teams of experts with/without a leader in social networks.In
CIKM , 2011.[9] M. Kargar, A. An, and M. Zihayat. Efficient bi-objective team formation in social networks. In
ECML PKDD , 2012.[10] M. Kargar, M. Zihayat, and A. An. Finding affordable and collaborative teams from a network ofexperts. In
SDM , 2013.[11] T. Lappas, K. Liu, and E. Terzi. Finding a team of experts in social networks. In
KDD , 2009.[12] C.-T. Li, M.-K. Shan, and S.-D. Lin. On team formation with expertise query in collaborative socialnetworks.
KAIS , 2015.[13] L. Li and H. Tong. Network science of teams: Characterization, prediction, and optimization. In
WSDM , 2018.[14] L. Li, H. Tong, N. Cao, K. Ehrlich, Y.-R. Lin, and N. Bucher. Enhancing team composition inprofessional networks: Problem definitions and fast solutions.
TKDE , 2017.[15] L. Li, H. Tong, N. Cao, K. Ehrlich, Y.-R. Lin, and N. Buchler. Replacing the irreplaceable: Fastalgorithms for team member recommendation. In
WWW , 2015.[16] A. Majumder, S. Datta, and K. Naidu. Capacitated team formation problem on social networks. In
KDD , 2012.[17] G. Optimization. Inc.,?gurobi optimizer reference manual,? 2015, 2014.[18] S. S. Rangapuram, T. Bühler, and M. Hein. Towards realistic team formation in social networksbased on densest subgraphs. In
WWW , 2013.[19] D. A. Spielman and S.-H. Teng. Smoothed analysis of algorithms: Why the simplex algorithm usuallytakes polynomial time.
Journal of the ACM (JACM) , 51(3):385–463, 2004.[20] V. V. Vazirani.
Approximation algorithms . Springer Science & Business Media, 2013.1421] X. Wang, Z. Zhao, and W. Ng. A comparative study of team formation in social networks. In
DASFAA , 2015.[22] X. Yin, C. Qu, Q. Wang, F. Wu, B. Liu, F. Chen, X. Chen, and D. Fang. Social connection awareteam formation for participatory tasks.