[PDF] Learning to Generate Fair Clusters from Demonstrations

Abstract

Fair clustering is the process of grouping similar entities together, while satisfying a mathematically well-defined fairness metric as a constraint. Due to the practical challenges in precise model specification, the prescribed fairness constraints are often incomplete and act as proxies to the intended fairness requirement, leading to biased outcomes when the system is deployed. We examine how to identify the intended fairness constraint for a problem based on limited demonstrations from an expert. Each demonstration is a clustering over a subset of the data. We present an algorithm to identify the fairness metric from demonstrations and generate clusters using existing off-the-shelf clustering techniques, and analyze its theoretical properties. To extend our approach to novel fairness metrics for which clustering algorithms do not currently exist, we present a greedy method for clustering. Additionally, we investigate how to generate interpretable solutions using our approach. Empirical evaluation on three real-world datasets demonstrates the effectiveness of our approach in quickly identifying the underlying fairness and interpretability constraints, which are then used to generate fair and interpretable clusters.

Full PDF

LLearning to Generate Fair Clusters from Demonstrations

Sainyam Galhotra, Sandhya Saisubramanian and Shlomo Zilberstein

University of Massachusetts Amherst

Abstract

Fair clustering is the process of grouping similar entities to-gether, while satisfying a mathematically well-deﬁned fair-ness metric as a constraint. Due to the practical challengesin precise model speciﬁcation, the prescribed fairness con-straints are often incomplete and act as proxies to the intendedfairness requirement, leading to biased outcomes when thesystem is deployed. We examine how to identify the intendedfairness constraint for a problem based on limited demonstra-tions from an expert. Each demonstration is a clustering overa subset of the data. We present an algorithm to identify thefairness metric from demonstrations and generate clusters us-ing existing off-the-shelf clustering techniques, and analyzeits theoretical properties. To extend our approach to novelfairness metrics for which clustering algorithms do not cur-rently exist, we present a greedy method for clustering. Ad-ditionally, we investigate how to generate interpretable so-lutions using our approach. Empirical evaluation on threereal-world datasets demonstrates the effectiveness of our ap-proach in quickly identifying the underlying fairness and in-terpretability constraints, which are then used to generate fairand interpretable clusters.

Introduction

Graph clustering is increasingly used for decision mak-ing in high-impact applications such as infrastructure de-velopment (Hospers, Desrochers, and Sautet 2009), healthcare (Haraty, Dimishkieh, and Masud 2015), and criminaljustice (Aljrees et al. 2016). These domains involve highlyconsequential decisions and it is important to ensure that thegenerated solutions are unbiased. Fair clustering is the pro-cess by which similar nodes are grouped together, while sat-isfying a given fairness constraint (Chierichetti et al. 2017).Prior works on fair clustering focus on designing efﬁcientalgorithms to satisfy a given fairness metric (Anderson et al.2020; Ahmadian et al. 2019; Chierichetti et al. 2017; Gal-hotra, Brun, and Meliou 2017; Kleindessner, Awasthi, andMorgenstern 2019). These approaches assume that the spec-iﬁed fairness metric is complete and accurate. With the in-creased growth in the number of ways to deﬁne and measurefairness, a key challenge for system designers is to accu-rately specify the fairness metric for a problem.Due to the practical challenges in the precise speciﬁcationof fairness metrics and the complexity of machine learningmodels, the system’s objective function and constraints areoften tweaked during the training phase until it produces thedesired behavior on a small subset of the data. As a result,the system may be deployed with an incompletely speciﬁed Figure 1: An illustration of incomplete speciﬁcation of fair-ness metric leading to biased output—unequal distributionof green and blue nodes in each cluster—when deployed.fairness metric that acts as a proxy to the intended metric.Clustering with incompletely speciﬁed fairness metrics maylead to undesirable consequences when deployed. It is chal-lenging to identify the proxies during system design due tothe nuances in the fairness deﬁnitions and unstated assump-tions. Two similar fairness metrics that produce similar solu-tions during the design and initial testing may generate dif-ferent solutions that are unfair in different ways to differentgroups, when deployed.For example in Figure 1, the designer inadvertently spec-iﬁes an incomplete fairness metric and assumes the systemwill behave as intended when deployed. This unintentionalincomplete speciﬁcation is not discovered during the initialtesting since the generated results align with that of the in-tended metric on the training data, such as sample data fromCalifornia. Consequently, the system may generate biasedsolution when deployed in Texas, due to demographic shift.Thus, design decisions that seem innocuous during initialtesting may have harmful impacts when the system is widelydeployed. While the difﬁculty in selecting a fairness metricfor a given problem is acknowledged (Knight 2019), thereexists no principled approach to address this meta-problem.

How to correctly identify the fairness metric that the de-signer intends to optimize for a problem?

We present an approach that generates fair clusters bylearning to identify the intended fairness metric using lim-ited demonstrations from an oracle. It is assumed that thereexists a true clustering with the intended fairness metrics,which are initially unknown. Each demonstration is a sam-ple from the true clusters, providing information about a a r X i v : . [ s t a t . M L ] F e b ubset of the nodes in the dataset. Given a ﬁnite numberof expert demonstrations, our solution approach ﬁrst clus-ters the demonstrations to infer the likelihood of each can-didate constraint and then generates clusters using the mostlikely constraint. By maintaining a distribution over the can-didate metrics and updating it based on the demonstrations,the intended clusters can be recovered since demonstrationsare i.i.d. The nodes in a demonstration are selected by theexpert, abstracted as an oracle. This is in contrast to query-ing an oracle where the algorithm selects the nodes to queryand the oracle responds if they belong to the same clusteror not. When the oracle is a human, demonstrations are eas-ier to collect rather than querying for pairs of nodes, whichrequires constant oversight.While inferring the intended fairness metric is critical tominimize the undesirable behavior of the system, the abil-ity of an end user to evaluate a deployed system for fairnessand identify when to trust the system hinges on the inter-pretability of the results. Though clustering results are ex-pected to be inherently interpretable, no clear patterns maybe easy to identify when clustering with a large number offeatures (Saisubramanian, Galhotra, and Zilberstein 2020).While the existing literature has studied fair clustering andinterpretable clustering independently (Chierichetti et al.2017; Saisubramanian, Galhotra, and Zilberstein 2020), tothe best of our knowledge, there exists no approach to gener-ate clusters that are both fair and interpretable. We show thatour solution approach can generate fair and interpretable clusters by inferring both fairness and interpretability con-straints, based on limited demonstrations.Our primary contributions are as follows: (1) formaliz-ing the problem of learning to generate fair clusters fromdemonstrations; (2) presenting two algorithms to identify thefairness constraints for clustering, generate fair clusters, andanalyzing their theoretical guarantees; and (3) empiricallydemonstrating the effectiveness of our approach in identify-ing the clustering constraints on three data sets, and usingour approach to generate fair and interpretable clusters. Background and Related Work

K-center Clustering

It is one of the most widely studiedobjectives in the literature (Vazirani 2013). Let H = G ( V, d ) be a graph with V = { v , v , . . . , v n } denoting a set of n nodes, along with a pairwise distance metric d : V × V → R . The nodes are described by features values, F . Given agraph instance H and an integer k , the goal is to identify k nodes as cluster centers (say S , | S | = k ) and assign eachnode to the cluster center such that the maximum distanceof any node from its cluster center is minimized. The outputis a set of clusters C = { C , C , . . . , C k } . The clusteringassignment function is deﬁned by γ : V → [ k ] and the nodesassigned to a cluster C i are { v ∈ V | γ ( v ) = i } . The objectivevalue is calculated as: o kC ( H, C ) = max v ∈ V min s ∈ S d ( u, s ) . A simple greedy algorithm provides a 2-approximation forthe k-center problem and it is NP-hard to ﬁnd a better ap-proximation factor (Vazirani 2013).

Fairness in Machine Learning

The existing literatureon fairness in machine learning can be broadly catego-rized into two lines of work: deﬁning notions of fairnessand designing fair algorithms. Various notions of fairnesshave been studied by researchers in different ﬁelds such asAI, Economics, Law, Philosophy, and Public Policy (Bera,Chakrabarty, and Negahbani 2019; Brams and Taylor 1996;Binns 2018; Chierichetti et al. 2017; Galhotra, Brun, andMeliou 2017; Mehrabi et al. 2019; Thomson 1983; Vermaand Rubin 2018). The two commonly studied fairness crite-ria are as follows.•

Group fairness ensures the outcome distribution is thesame for all groups of interest (Chierichetti et al. 2017;Galhotra, Brun, and Meliou 2017; Verma and Rubin2018). This is measured using metrics such as dis-parate impact (Feldman et al. 2015) and statistical par-ity (Kamishima et al. 2012; Verma and Rubin 2018)including conditional statistical parity, predictive parity,false positive error rates, and false negative error rates.•

Individual fairness ensures that any two individuals withthe same attributes are not discriminated (Anderson et al.2020; Dwork et al. 2012; Ilvento 2019). This is often mea-sured using metrics such as causal discrimination (Dworket al. 2012; Verma and Rubin 2018).Given a mathematically well-deﬁned fairness criteria, a fairalgorithm produces outputs that are aligned with the givenfairness deﬁnition. Examples include fair clustering (Ander-son et al. 2020; Ahmadian et al. 2019; Chierichetti et al.2017; Kleindessner, Awasthi, and Morgenstern 2019), fairranking (Celis, Straszak, and Vishnoi 2018), and fair vot-ing (Celis, Huang, and Vishnoi 2018). Although these workshave laid vital ground work to assure fairness in some set-tings, much of the efforts in designing fair algorithms havefocused on the algorithm’s performance—efﬁciency, scala-bility, and providing theoretical guarantees. There is very lit-tle effort, if any, at the meta-level : designing algorithms thatcan identify a suitable fairness metric for a clustering prob-lem, given a set of candidate metrics. There has been recentfocus on learning a metric (Ilvento 2019) or a representationthat ensures fairness with respect to classiﬁcation tasks (Hil-gard et al. 2019; Gillen et al. 2018). It is not straightforwardto extend these fair classiﬁcation techniques to fair cluster-ing because they have different objectives. This is furthercomplicated by the lack of ground truth and NP-hardness ofclustering. Therefore, it is critical to develop techniques toinfer metrics for fair clustering.

Fair Clustering

Fair clustering approaches generate clus-ters that maximize the clustering objective value, while satis-fying the given fairness requirement (Anderson et al. 2020;Ahmadian et al. 2019; Bera, Chakrabarty, and Negahbani2019; Chierichetti et al. 2017; Kleindessner, Awasthi, andMorgenstern 2019). The commonly considered fairness met-rics in clustering are group fairness (Chierichetti et al. 2017),individual fairness (Ilvento 2019; Mahabadi and Vakilian2020), and distributional fairness (Anderson et al. 2020).These approaches require exact speciﬁcation of fairnessmetrics a priori. They generate fair clusters either by modify-ng the input graph or use the fairness metrics as constraintsand solve it as a linear optimization.

Interpretable Clustering

Interpretable clustering is theprocess of generating clusters such that it is easy to iden-tify patterns in the data for the end user. A recent approachto generate interpretable clusters maximizes the homogene-ity of the nodes in each cluster, with respect to predeﬁnedfeatures of interest to the user (Saisubramanian, Galhotra,and Zilberstein 2020). The problem is solved as a multi-objective clustering problem where both interpretability andthe k-center objective value are optimized. While both fair-ness and interpretability are typically investigated indepen-dently, the ability to evaluate the system for fairness viola-tions often relies on its interpretability.

Clustering with an Oracle

Prior works that use addi-tional knowledge from an oracle for clustering typically in-volve queries of the form ‘do nodes u and v belong to thesame cluster?’ (Ashtiani, Kushagra, and Ben-David 2016;Mazumdar and Saha 2017a,b; Galhotra et al. 2018; Firmani,Saha, and Srivastava 2016; Vesdapunt, Bellare, and Dalvi2014). Our approach is different from the oracle-based clus-tering in the following manner. First, our approach involvesthe oracle selecting the nodes and determining what infor-mation is revealed. Second, the oracle provides informationpotentially about a subset of nodes, instead of pairwise rela-tionships. Learning from Demonstration

Learning from demonstra-tion is a type of apprenticeship learning, where the learnerlearns by observing an expert (typically a human) perform-ing the task (Abbeel and Ng 2004). The learner tries tomimic the expert’s behavior by observing the demonstra-tions and generalizing it to unseen situations. Learning fromdemonstration is a popular approach used to teach robotsto complete a task (Abbeel and Ng 2004) or avoid the neg-ative side effects of their actions (Saisubramanian, Kamar,and Zilberstein 2020).

Likelihood Estimation

Maximum likelihood estimation(MLE) is a statistical method to estimate the parameters of aprobability distribution by maximizing the likelihood func-tion, such that the observed data are most probable underthe assumed model (White 1982). Intuitively, it is a searchin the parameter space to identify a set of parameters, for themodel, that best ﬁt the observed data. The maximum likeli-hood estimate is the point in the parameter space that maxi-mizes the likelihood function.

Problem Formulation

Problem Statement:

Let G = (cid:104) V, d (cid:105) be the input graphwith vertices V and distance metric d and let o denote theclustering objective. Given a ﬁnite set of candidate fairnessmetrics, denoted by Ω , and a ﬁnite set of clustering demon-strations, denoted by Λ , the goal is to identify a fairness met-ric ω F ∈ Ω required to be satisﬁed by the clusters whenoptimizing objective o .We present learning to cluster from demonstrations (LCD), an approach to infer ω F using Λ . LCD is intro- duced and discussed in the context of fair clustering but itis a generic approach that can be used to infer any clusteringconstraint. LCD can also handle the case of clustering withmultiple fairness metrics by simply considering Ω to be thepower set over possible candidate metrics. Clustering demonstrations:

LCD relies on the availabil-ity of clustering demonstrations by an expert. It is relativelyeasier to gather demonstrations from a human expert thanquerying for pairs of nodes, which requires constant over-sight or availability to answer the queries.

Deﬁnition 1. A clustering demonstration λ provides theinter-cluster and intra-cluster links for a subset of nodesfrom the dataset S ⊆ V, | S | ≥ , by grouping them ac-cording to the underlying objective function and constraints, λ = { C , . . . , C t } with each C i denoting a cluster and t ≤ k . To generate a demonstration, the oracle selects a subset ofnodes and then clusters it, in accordance with the true clus-ters. The following assumption ensures that demonstrationsare i.i.d and the expert is not acting as an adversary.

Assumption 1.

The nodes in each demonstration are ran-domly selected and clustered according to the ground-truthfairness constraints.

Therefore, a demonstration λ is a sample of the underly-ing clustering, revealing the relationship between a subsetof the nodes. However the relationship between the nodes insuccessive demonstrations is unknown , when the nodes aredistinct in each demonstration. We illustrate this with an ex-ample. Consider seven nodes { u , . . . , u } whose true butinitially unknown clustering is C ∗ = { u , u , u } , C ∗ = { u , u } , and C ∗ = { u , u } . Let λ = { ( u , u ) , ( u ) } and λ = { ( u ) , ( u ) , ( u ) } denote two successive demon-strations. Demonstration λ shows that u , u are in thesame cluster and u is in a separate cluster. Demonstration λ shows that u , u and u are in different clusters. At theend of λ and λ , it is not clear if u , u and u belong tothe same cluster or not. Deﬁnition 2.

Globally informative demonstration providesthe true cluster afﬁliation of a subset of nodes, S ⊆ V , and isdenoted by λ g = {(cid:104) u , γ ( u ) (cid:105) , . . . , (cid:104) u s , γ ( u s ) (cid:105)} , ∀ u i ∈ S with γ ( u ) indicating the cluster afﬁliation of node u . Globally informative demonstration provides informationabout the true cluster afﬁliation (cluster ID) of the nodes,which is used to retrieve the inter-cluster and intra-clusterlinks between the nodes and form clusters { C , . . . , C t } with t ≤ k . The information provided by a single globallyinformative demonstration is the same as a regular cluster-ing demonstration. However, globally informative demon-strations facilitate cross-referencing the cluster afﬁliationsacross demonstrations, overcoming the drawback of gen-eral clustering demonstration. Consider the example withglobal demonstrations λ = {(cid:104) u , (cid:105) , (cid:104) u , (cid:105) , (cid:104) u , (cid:105)} and λ = {(cid:104) u , (cid:105) , (cid:104) u , (cid:105) , (cid:104) u , (cid:105)} . Then we know that C ∗ = { u , u , u } . This subtle but important distinction acceler-ates the identiﬁcation of fairness constraints. ymbol Formula Parameter Reference ω GF Ratio of each feature value ∈ [ α, β ] α, β (Bera, Chakrabarty, and Negahbani 2019; Chierichetti et al. 2017) ω EQ Relative distribution of a speciﬁc feature value β (Ding 2020; Galhotra, Saisubramanian, and Zilberstein 2019) ω IC Homogeneity of clusters β (Saisubramanian, Galhotra, and Zilberstein 2020) Table 1: Candidate fairness and interpretable constraints ( Ω ). Fairness and Interpretability Constraints

In the rest of the paper, we focus on inferring the followingconstraints, with constraint thresholds deﬁned below.

Disparate impact or group fairness ( ω GF ). This com-monly studied fairness metric requires the fraction of nodesbelonging to all groups, characterized by the sensitive fea-ture, to have a fair representation in each cluster. Supposethe sensitive feature takes two values—Red or Blue, witheach node assigned one of the two colors. This constraint re-quires the fraction of red and blue nodes in a cluster to bewithin [ α, β ] where α, β ∈ [0 , are called constraint thresh-olds (Bera, Chakrabarty, and Negahbani 2019; Chierichettiet al. 2017). Equal representation ( ω EQ ). This fairness constraint en-forces equal distribution of nodes with a speciﬁc featurevalue, across clusters. An example is requiring all clustersto have equal number of nodes with the feature value ‘Red’.This clustering constraint has been particularly useful inteam formation settings, where the resources are ﬁxed andcertain colored nodes need to be distributed equally amongteams (clusters). More formally, let α i denote the number ofnodes with feature value α in cluster C i . Constraint ω EQ re-quires α i = α j . Restricting all nodes of feature value α to bedistributed equally may be very strict for some applications.A generalization of this constraint requires the distributionratio to be greater than a pre-deﬁned threshold β , α i α j > β ,for every pair of clusters (Ding 2020; Galhotra, Saisubrama-nian, and Zilberstein 2019). This ratio captures the relativedistribution of α -valued nodes across the clusters. Interpretability ( ω IC ). This constraint considers aspeciﬁc feature of interest (say ‘Color’) and requires thatall clusters are homogenized according to the consideredfeature. The homogeneity of a cluster with respect to afeature f is characterized by the fraction of nodes of acluster that have same feature value for the input feature.For example, consider a cluster with blue nodes, rednodes and green colored node. Then the homogeneity ofthe cluster with respect to the feature ‘color’ and featurevalue ‘blue’ is . . Generating interpretable clusters re-quires satisfying a homogeneity threshold β — each clusteris required to have at least β fraction of nodes with re-spect to f (Saisubramanian, Galhotra, and Zilberstein 2020).These constraints, described by a feature f and a thresh-old β , are summarized in Table 1. Given the set of candi-date constraints Ω and demonstrations Λ , LCD aims to iden-tify the constraint, along with its feature and correspondingthreshold, that has the maximum likelihood. Solution Approach

We begin by describing a naive approach to infer the con-straint thresholds and discuss its limitations. We then pro-pose an algorithm that infers the constraint threshold andgenerates clusters using existing clustering algorithms. Toextend our approach to handle fairness metrics that are notcurrently supported by the existing algorithms, we present agreedy clustering approach.

Naive algorithm

A naive approach to infer the clustering constraint from agiven set of demonstrations Λ is to exhaustively generateall possible clusterings for each type of constraint, its cor-responding feature, and threshold. Among these clusterings,the most likely set of clusters correspond to the one havingmaximum conformance with the demonstrations Λ . This ap-proach is highly effective in identifying the desired set ofclusters but does not scale, given that the fairness constraintthreshold can take inﬁnite values. For example, the disparateimpact constraint ω GF take two parameters α, β as input,which can take any value in the range [0 , . To efﬁcientlyinfer the constraint, we build on the following observations.• k -center clustering (and centroid-based clustering in gen-eral) aims to minimize the maximum distance of any nodefrom the cluster center. Therefore, it is very unlikely thata particular node is assigned to the farthest center.• Our problem can be modeled as a likelihood estimationproblem, where the most likely constraint is expected tocorrespond to the ground truth constraint.Given a cluster C , we can estimate the most likely thresh-old of C with respect to a constraint, by following the pro-cedure discussed in the previous section. For example, if acluster has red nodes and blue nodes, we can infer thatthe fraction of nodes of each color is at least min(3 / , / .Using this constraint threshold estimation, a simple ap-proach is to estimate the likelihood of different clusteringconstraints by considering each demonstration as an inde-pendent set of clusters and calculate threshold with respectto each constraint over these clusters. A major drawback ofthis approach is that a single clustering demonstration gen-erally does not contain representation from all k clusters andfeature values for the considered feature. This may misleadthe likelihood estimation when a demonstration consideredin isolation. Example 1.

Consider an optimal clustering for ω GF , de-noted by C = { r , r , b , b } and C = { r , b } , where r , r , r are the red nodes and b , b , b are blue col-ored nodes. Suppose one of the demonstration is λ = { ( r , r ) , ( r ) } . Based on this demonstration, the inferred igure 2: Overview of solution approach. Algorithm 1

Maximum Likelihood Constraint

Input:

Demos Λ , Nodes V , Features of interest F Output:

Clusters C for v ∈ Λ do C ← C ∪ { v } C ←

ConstructClusters (Λ) while |C| > k do C ←

MergeClosest ( C ) T ( ω, f ) ← , ∀ ω ∈ Ω , f ∈ F for ω ∈ Ω , f ∈ F do T ( ω, f ) ← CalculateThreshold ( C, ω, f ) for ( ω, f ) ∈ T do C ω,f ← C LUSTER ( ω, f, V ) L ω,f ← Likelihood ( C ω,f , Λ) ( ω, f ) ← arg max( L ω,f ) constraint is ω IC with β = 1 , which incorrectly indicatesthat all the nodes in a cluster have the same color. Proposed Algorithm

We present Algorithm 1 that clusters the given demonstra-tions and processes these clusters to infer the most likelyconstraint and its parameter values (feature and threshold).Figure 2 presents the high level architecture of our proposedtechnique. Given a collection of demonstrations generatedby an expert, our algorithm greedily merges them to gener-ate k clusters. These clusters are then used to calculate thelikelihood of each fairness constraint and infers the cluster-ing with maximum likelihood.Algorithm 1 proceeds in two phases. In the ﬁrst phase (Lines 1-5), the algorithm forms k clusters of the demonstra-tions Λ . This phase initializes a clustering C over the set ofnodes in demonstrations Λ ( ConstructClusters (Λ) )which correspond to the different clusters identiﬁed by theexpert. Note that the set C may contain more than k clusters.In that case, we greedily merge the closest pair of clustersuntil k clusters have been identiﬁed. The distance betweenany pair of clusters C i , C j ∈ C is measured as the maximumdistance between any pair of nodes in C i and C j : d ( C , C ) = max u ∈ C ,v ∈ C d ( u, v ) . In the second phase (Lines 6-12), the identiﬁed clusters C are processed to calculate the most likely threshold withrespect to each feature and constraint (denoted by T ). Theidentiﬁed threshold is used to generate a set of k clusters onthe original dataset V for each (cid:104) constraint, feature (cid:105) pair. Atthe end of this step, there are | F | × | Ω | clusterings, with oneof them corresponding to the intended set of clusters.To identify the set of clusters with maximum likelihood( L ), we calculate the accuracy of each clustering with re- Algorithm 2

Greedy Algorithm for Novel Metrics

Input:

Demos Λ , Nodes V , Features of interest F Output:

Clusters C for v ∈ V do C ← C ∪ { v } C ←

ConstructClusters (Λ) while |C| > k do C ←

MergeClosest ( C ) T ← Calculate constraint threshold of each constraint for ( ω, f ) ∈ L do L ( ω, f ) ← Calculate likelihood of each constraint Perform greedy adjustment to satisfy each constraint L ω,f ← Likelihood ( C ω,f , Λ) Return the clustering corresponding arg max( L ω,f ) spect to the input demonstrations and return the set of clus-ters that have the highest accuracy. The accuracy of a setof clusters C is calculated by labeling each pair of nodes asintra-cluster or inter-cluster, and then measuring the fractionof pairs that have same labels according to C and Λ . The ac-curacy estimate of the clusters C captures the likelihood of aparticular constraint. Complexity.

The ﬁrst phase of Algorithm 1 is initializedwith O ( | Λ | ) demonstrations and iteratively reduced to k clusters. In each iteration, it calculates the distance betweenpairs of clusters, resulting in O ( | Λ | ) run time. The secondphase considers all combinations of constraint and features,thereby performing clustering | F | × | Ω | times where F de-notes the set of features for each node. Therefore, the runtime complexity of Algorithm 1 to calculate clusters overthe demonstrations is O (log n ) and it takes O ( n | F || Ω | ) toconstruct clusters and calculate likelihood.Algorithm 1 identiﬁes the optimal set of clusters and themaximum likelihood constraints for a given set of demon-strations, assuming that a clustering technique exists for aninput constraint. We now present a greedy algorithm thatdoes not rely on the clustering technique and greedily gen-erates the set of clusters with maximum likelihood. Greedy Algorithm for Novel Metrics

To handle the fairness objectives for which fair clusteringalgorithms do not currently exist, we present a greedy algo-rithm that generates k clusters without assuming any knowl-edge about the clustering algorithm for the input constraints.Our approach is outlined in Algorithm 2. Given a collec-tion of demonstrations Λ and vertices V , the algorithm pro-ceeds in two phases. The ﬁrst phase of Algorithm 2 (Lines1-5) is similar to that of Algorithm 1, where all nodes are ini-tialized as singleton clusters and all nodes that are groupedtogether in Λ are merged. The closest pair of clusters in C aresequentially merged until k clusters have been identiﬁed. Let C denote the ﬁnal set of k clusters.The second phase (Lines 6-12) begins with estimating theconstraint threshold ( T ), as in Algorithm 1. The estimatedthreshold is used to greedily post-process the clusters ac-cording to each constraint. This greedy processing transfershe nodes from one cluster to another, following the con-straint requirements and is similar to local search techniquesthat move nodes between clusters to satisfy a constraint. Atthe end of this phase, there are | F | × | Ω | different sets ofclusters, with each optimizing a different fairness constraint.The clustering that has the highest likelihood with the inputdemonstrations is returned as the ﬁnal set of clusters. Thelikelihood is estimated in terms of the accuracy of pairwiseintra-cluster and inter-cluster labels. Theoretical Analysis

In this section, we analyze the effectiveness of Algorithm 1to identify the constraints even when the oracle presents

Θ(log n ) demonstrations, where n = | V | . We ﬁrst showthat the estimated constraint is accurate with a high prob-ability under the assumption that the oracle chooses nodesuniformly at random. We then extend the analysis to settingswhere the presented demonstrations are biased towards spe-ciﬁc clusters. This analysis assumes that each demonstration λ ∈ Λ has constant size .Let ˜ V denote the set of nodes that have been clusteredin atleast one of the demonstrations. Lemma 3 shows thatthe sample ˜ V contains Θ(log n ) from a cluster C ∗ whenever | C ∗ | ≥ nk . Lemma 3.

Consider a random sample ˜ V ⊆ V such that | ˜ V | ≥ k log n and each node in ˜ V is chosen uniformly atrandom, then | ˜ V ∩ C ∗ | ≥ n , ∀ C ∗ ≥ nk .Proof. Let X v be a binary indicator variable such that X v =1 if v ∈ ˜ V and otherwise. Since, each record v is chosenuniformly at random, P r [ v is chosen ] = | ˜ V | n . Therefore, E (cid:104) | ˜ V ∩ C ∗ | (cid:105) ≥ | ˜ V | n | C ∗ | = 10 log n. Using Chernoff bound, | ˜ V ∩ C ∗ | ≥ n with a probabilityof − n .Consider a set of ground truth clusters, C ∗ = { C ∗ , . . . , C ∗ k } , such that ∀| C ∗ i | satisfy one of the clusteringconstraint ω ∈ Ω . This means that ∃ i such that | C ∗ i | ≥ nk .For the next part of the proof, we will consider this C ∗ i toanalyze the quality of estimated constraint threshold. Lemma 4.

Suppose the optimal cluster C ∗ i satisﬁes theconstraint, ω GF with parameters [ α, β ] and | ˜ V ∩ C ∗ i | =Θ(log n ) , then the estimated threshold on processing | ˜ V ∩ C ∗ i | is [ α (1 − (cid:15) ) , β (1 + (cid:15) )] with a high probability.Proof. Suppose the optimal fairness constraint ω GF consid-ers a feature f with parameters [ α, β ] . Let A = { a , . . . , a t } denote the domain of values for the feature f . Accordingto the fairness constraint, the subset of C ∗ i that has featurevalue a j , ∀ j is within a fraction [ α, β ] . Suppose the fractionof nodes with feature value a i be α i .We claim that the fraction of nodes with feature α i in thesample ˜ V ∩ C ∗ is within [ α i (1 − (cid:15) ) , α i (1 + (cid:15) )] with a high Our proofs extend to the setting where demonstration size is

Ω(1) too. probability, where (cid:15) is a small constant. Let X v denote abinary random variable such that X v is one if v is presentin the sample ˜ V and otherwise. The expected number ofnodes that have feature α i and belong to the set ˜ V ∩ C ∗ i is α i | C ∗ i || ˜ V | n = Θ(log n ) . Following the proof of Lemma 3and using Chernoff bound, we get that the number of nodeswith value a j is within a factor of [(1 − (cid:15)/ , (1 + (cid:15)/ ofthe expected value with a high probability. Additionally, theexpected number of nodes that belong to the sample | C ∗ i ∩ ˜ V | = | C ∗ i || ˜ V | n and the number of nodes is within a factor of [(1 − (cid:15)/ , (1 + (cid:15)/ with a high probability.Therefore, the ratio of node that have feature value a i andbelong to the sample ˜ V ∩ C ∗ i is always within a factor of (cid:104) − (cid:15)/ (cid:15)/ , (cid:15)/ − (cid:15)/ (cid:105) ∼ [1 − (cid:15), (cid:15) ] for small values of (cid:15) . Takinga union bound over all feature values, we guarantee that theestimated parameter is within a factor of [1 − (cid:15), (cid:15) ] witha high probability. Lemma 5.

Suppose the optimal cluster C ∗ i satisﬁes the con-straint, ω IC with parameter β (some constant) and | ˜ V ∩ C ∗ i | = Θ(log n ) , then the estimated threshold on process-ing | ˜ V ∩ C ∗ i | is [ β (1 − (cid:15) ) , β (1 + (cid:15) )] with a high probability.Proof. Suppose the optimal cluster C ∗ i satisﬁes ω IC withparameter β with respect to a feature value α . Therefore, β fraction of the nodes in C ∗ i have the feature value α . Toanalyze the fraction of nodes of feature value α , we deﬁnebinary random variable X v for each v such that X v = 1 if v ∈ ˜ V and otherwise. The expected number of nodes withfeature value α in the sample ˜ V ∩ C ∗ i is β | C ∗ i || ˜ V | n . Followingthe analysis of Lemma 4, we get that the fraction of nodesof color α is within a factor of [ β (1 − (cid:15) ) , β (1 + (cid:15) )] with aprobability of − n . Lemma 6.

Suppose the optimal cluster C ∗ i satisﬁes the con-straint, ω EQ with parameter β and | ˜ V ∩ C ∗ i | = Θ(log n ) ,,then the estimated threshold on processing | ˜ V ∩ C ∗ i | is [ β (1 − (cid:15) ) , β (1 + (cid:15) )] with a high probability.Proof. This analysis is similar to that of Lemma 5.Lemmas 4, 5 and 6 show that the estimated parameterfrom a cluster C ∗ i with respect to the considered fairnessconstraints is within a factor of [(1 − (cid:15) ) , (1 + (cid:15) )] of the trueconstraint threshold with a high probability. Using these re-sults, we prove the following theorem. Theorem 7.

Given a collection of nodes V and randomlychosen globally informative demonstrations Λ = Θ(log n ) such that each demonstration reveals the true cluster afﬁlia-tion of a constant number of records, then the optimal clus-ter constraint is identiﬁed within a multiplicative factor of [(1 − (cid:15) ) , (1 + (cid:15) )] with a high probability.Proof. Let Λ denote a collection of globally informativedemonstrations such that | Λ | = Θ(log n ) and let ˜ V = ∪ λ g ∈ Λ λ g . Using Lemma 3, we know that ˜ V ∩ C ∗ i = Θ(log n ) or all C ∗ i containing Θ( n ) nodes and therefore, using Lem-mas 4, 5 and 6 we are guaranteed to estimate the cor-rect threshold for the cluster C ∗ i . Hence, Algorithm 1 cor-rectly estimates the constraint with maximum likelihoodwith Θ(log n ) globally informative demonstrations. Remark 8.

In this section we do not optimize for the con-stants in Θ notation because Algorithm 1 empirically con-verges in less than n demonstrations. We extend the proof of Theorem 7 to the setting where thedemonstrations are not globally informative but the groundtruth clusters satisfy an interesting property, similar to the γ -margin property studied in prior work (Ashtiani, Kusha-gra, and Ben-David 2016). We ﬁrst deﬁne the margin prop-erty. Let ˜ V denote a subset of nodes and C ∗ denote the setof clusters corresponding optimal constraint. The set ˜ V isconsidered to satisfy margin property if d ( u, x ) > d ( u, v ) where u, v ∈ C ∗ i ∩ ˜ V and x ∈ ˜ V \ C ∗ i . Theorem 9.

Given a collection of nodes V and ran-domly chosen demonstrations Λ = Θ(log n ) such that eachdemonstration reveals the clustering over a subset of nodes,then the optimal cluster constraint is identiﬁed within a mul-tiplicative factor of [(1 − (cid:15) ) , (1 + (cid:15) )] with a high probabilityif the sampled nodes ∪ λ ∈ Λ λ satisfy the margin property.Proof. Let Λ denote a collection of demonstrations such that | Λ | = Θ(log n ) and let ˜ V = ∪ λ ∈ Λ λ . Using Lemma 3, weknow that ˜ V ∩ C ∗ i = Θ(log n ) for all C ∗ i containing Θ( n ) nodes. This guarantees that we have Θ(log n ) nodes sam-pled from C ∗ i but we may not have merged all these nodes toform a single cluster. In order to show that the nodes presentin merged cluster (after Line 5 of Algorithm 1) belong tothe same cluster, we use the margin property. The marginproperty assumes that all nodes that belong to same clus-ter are closer to each other as compared to nodes of othercluster. Therefore, MergeClosest always merges a pairof clusters that belong to same optimal cluster C ∗ i , therebyguaranteeing its correctness. Since C ∗ i has been constructedcorrectly, the proof is same as Theorem 7. Theorem 10.

Given a collection of nodes V and ran-domly chosen demonstrations Λ = Θ(log n ) such that eachdemonstration reveals the clustering over a subset of nodes,then Algorithm 2 recovers ground truth clusters with a highprobability if the nodes V satisfy the margin property.Proof. This analysis is similar to that of Theorem 9.

Discussion.

The analysis of Theorem 9 assumed mar-gin property over the sampled nodes. In most real worlddatasets, clusters are often well separated, thereby automati-cally implying the margin property. Additionally, even if themargin property does not hold on overall clusters, expert canchoose samples for the demonstration such that the samplesof different clusters are present sufﬁciently away. The proofof Theorem 9 can be extended to settings where a constantfraction of sampled nodes do not obey the margin property.Another important assumption that is crucial in the anal-ysis presented above is the randomness of sampled nodes. Theorem 7 and 9 assume that every node is chosen uni-formly at random. Note that these assumptions can be re-laxed and our proofs extend to settings when the samplesare biased towards a speciﬁc cluster . For example, the num-ber of samples a speciﬁc cluster (say C ∗ i ) is much higherthan Θ(log n ) but the samples from other clusters are muchfewer. In this case, Algorithm 1 will correctly estimate thethreshold from C ∗ i with fewer demonstrations but it may re-quire more number of demonstrations to achieve accurateestimate from other clusters. Experiment Setup

In this section, we evaluate the effectiveness of

LCD on threereal world datasets. We show that our techniques efﬁcientlycalculate the true likelihood of each constraint and the gener-ated set of clusters are closer to the desired output, comparedto other baselines.

Datasets

We evaluate our approach on three datasets,which are borrowed from the prior work that experimentwith the metrics of interest.•

Bank dataset (Bera, Chakrabarty, and Negahbani 2019)containing 4521 data nodes corresponding to phone callsfrom a marketing campaign by a Portuguese banking in-stitution. The marital status of the records is considered asthe sensitive feature for ω GF constraint, with parameters [0 . , . .• Adult dataset (Saisubramanian, Galhotra, and Zilberstein2020) containing records with the income infor-mation of individuals along with their demographic at-tributes. ‘Age’, ‘occupation’, and ‘income’ features areconsidered as the features of interest. Fairness constraint ω EQ is optimized with respect to ‘occupation’ and ω IC with respect to ‘age’ and ’income’.• Crime dataset (Saisubramanian, Galhotra, and Zilber-stein 2020) contains crime information about different communities in the United States, where ‘numberof crimes per 100K population’ is used for ω IC fairnessconstraint.The features in these datasets are considered to calculate dis-tance between every pair of nodes. Euclidean distance is cal-culated between numerical attributes and Jaccard distancebetween the categorical attributes. Please refer to (Bera,Chakrabarty, and Negahbani 2019; Saisubramanian, Galho-tra, and Zilberstein 2020) for more details. Baselines

We compare the results of our techniques withthe following baselines:• B1 calculates the likelihood by considering each demon-stration as a separate set of clusters;• B2 merges the different clusters in the demonstration toidentify k clusters and infers the likelihood over the iden-tiﬁed clusters; and• B3 performs a grid search over all possible fairness con-straints and identiﬁes the clustering that conforms withthe generated demonstrations. G T A l g1 B B B C on s t r a i n t t h r e s ho l d α β (a) Bank , ω GF G T A l g1 B B B C on s t r a i n t t h r e s ho l d Threshold (b)

Adult , ω EQ G T A l g1 B B B C on s t r a i n t t h r e s ho l d Threshold (c)

Crime , ω IC G T A l g1 B B B C on s t r a i n t t h r e s ho l d Threshold (d)

Adult , ω IC Figure 3: Comparison of estimated constraints for different datasets.Algorithm 1 is referred as

Alg1 and Algorithm 2 is labeled

Alg2 in all the plots in this section. Unconstrained k-centerclustering technique is labeled as kC . Setup

We use open source implementations of ω GF and ω IC , and contacted the authors of (Galhotra, Saisubrama-nian, and Zilberstein 2019) for ω EQ . Their code base wereused to generate ground truth clusters for an input constraintrequirement. All algorithms were implemented in Pythonand tested on an Intel Core i5 computer with 16GB of RAM.Our experiments compare the identiﬁed fairness param-eter by our algorithm and each baseline. To compare thequality of identiﬁed clusters, we compute the F-score of theidentiﬁed intra-cluster pairs of nodes. F-score denotes theharmonic mean of the precision and recall, where precisionrefers to the fraction of correctly identiﬁed intra-cluster pairsand recall refers to the fraction of intra-cluster pairs that areidentiﬁed by our algorithm. In all experiments, we report re-sults with k = 5 . We execute the code of constraint cluster-ing techniques with speciﬁed parameters to generate groundtruth clustering. Each demonstration is generated by sam-pling a subset of ﬁve nodes randomly from these clusters.Unless otherwise speciﬁed, we consider n demonstra-tions as input and these demonstrations do not reveal the truecluster afﬁliation of the considered nodes. In case there aremultiple constraints that generate the same set of demon-strations, the algorithm output is considered correct if it cor-rectly identiﬁes any one of those constraints . Results and Discussion

Effectiveness of Algorithm 1

The effectiveness of Algo-rithm 1 is measured based on the constraint threshold andthe quality of the generated clusters. Figure 3 comparesthe estimated threshold of the most-likely constraint, calcu-lated by Algorithm 1 with the ground truth and other base-lines. Across all datasets, Algorithm 1 estimates the optimalthreshold for every considered constraint, matching the per-formance of ground truth. This validates the effectiveness ofAlgorithm 1 to correctly estimate the most likely constraintand its corresponding threshold.Among the baselines, B3 achieves a similar performance.This is an expected behavior since B3 performs naive grid Among the considered constraints, this situation does not arisewhenever | Λ | > G T A l g1 B B B k C F - sc o r e (a) Adult ω IC , β = 1 G T A l g1 B B B k C F - sc o r e (b) Adult ω IC , β = 0 . Figure 4: F-score comparison for different datasets.search to explore all threshold values. Although it is effec-tive in inferring the threshold, this technique is orders ofmagnitude inefﬁcient due to the exhaustive enumeration ofclusters using the different sets of constraints, features andtheir respective thresholds. It is therefore practically infeasi-ble to implement this for problems with large input graphsand large Ω .The other baselines B1 and B2 consistently show poorperformance. Baseline B1 does not identify any fairnessconstraint in settings where the demonstrations obey ω GF and ω EQ (Figure 3(a) and 3(b) respectively). However, itidentiﬁes the optimal clustering constraint only in case of ω IC . Given that each demonstration has fewer than nodes,the information available in a single demonstration is notsufﬁcient for B1 to infer the true fairness constraint. On theother hand, B2 overcomes the limitations of B1 by merg-ing the demonstrations randomly in order to capture con-straint information over all demonstrations collectively. Thisapproach has better performance than B1 but does not iden-tify the true clustering constraint in majority of the cases. Itdoes not identify the fairness constraint ω EQ (Figure 3(b))and the identiﬁed constraint threshold in all other cases aresub-optimal.Figure 4 compares the quality of the returned clusters,by comparing the F-score of the clustering output of eachtechnique with the ground truth clusters. In this experiment,Algorithm 1 and B3 achieve optimal performance as theyidentify the true ground truth clusters across all parametersettings. All other baselines did not identify the clusters cor-rectly and achieved low F-score. Particularly, in case of ω EQ and ω GF , the baselines B1 and B2 did not identify the opti-mal constraint threshold and generated biased clusters.Table 2 compares the running time of Alg1 and otherataset

Alg1 B1 B2 B3Bank

Adult

Crime F - sc o r e (a) Bank , ω GF F - sc o r e (b) Adult , ω IC Figure 5: Effect of

Alg2 performance.baselines for different datasets and clustering constraints.Among all datasets,

Alg1 is orders of magnitude faster than B3 . In the worst case, Alg1 generates O ( | Ω | × | F | ) sets ofclusters whereas B3 generates clusters exhaustively for ev-ery value of constraint threshold. The running time of Alg1 is comparable with B1 and B2 . Effectiveness of Algorithm 2

Alg2 identiﬁes k clusterssuch that the returned output obeys the fairness constraintreﬂected from the demonstrations Λ . Figure 5 plots the F-score of Alg2 for two data sets and the results are com-pared with that of

Alg1 . This allows us to compare the per-formance of our greedy

Alg2 with that of an existing efﬁ-cient solver. In Figure 5(a), we employed the approach usedin Bera, Chakrabarty, and Negahbani (2019) to generate theground truth clusters according to ω GF and tested the effec-tiveness of Alg2 to recover ground truth clusters for varyingnumber of demonstrations. Similarly in Figure 5(b), groundtruth is generated using ω IC .When the number of demonstrations is less than , theF-score of the generated clusters is . for both domains.As we increase the number of demonstrations, we observethat the performance of Alg2 improves and is closer to thatof

Alg1 . Alg2 achieves more than . F-score in less than demonstrations. The continuous improvement in accu-racy demonstrates the effectiveness of Alg2 in recoveringclusters without relying on a clustering algorithm.

Effect of Demonstrations

We now investigate the effectof number of demonstrations on the performance of ourtechniques in identifying the optimal constraint threshold.We varied the number of demonstrations in multiples of log n : . n, log n, n . Figure 6 compares the con-straint threshold and the F-score of the identiﬁed clustersusing Alg1 , with varying number of demonstrations onthe

Bank and

Adult dataset. In case of ω GF , the groundtruth constraint requires equal representation of the differentgroups in each cluster. Algorithm 1 correctly identiﬁes thefairness constraint and achieves perfect F-score even when Λ β F-score (a)

Bank , ω GF β F-score (b)

Adult , ω IC Figure 6: Effect of

Alg1 performance. β F-score (a)

Bank , ω GF β F-score (b)

Adult , ω IC Figure 7: Effect of sampling bias on

Alg1 performance.contains as low as four demonstrations. Increasing the num-ber of demonstrations does not improve its performance asthe constraint likelihood has already converged. In ω IC , theground truth clusters are generated according to threshold β = 0 . . When the number of input demonstrations | Λ | is low ( | Λ | = 4 ), the estimated interpretability constraintthreshold is inaccurate and the constraint estimation im-proves as the number of demonstrations are increased. Algo-rithm 1 is able to achieve an F-score more than . with justten demonstrations and the quality of ﬁnal clusters improvesmonotonically with increasing demonstrations. It convergesto the accurate constraint threshold whenever | Λ | ≥ andtherefore achieves perfect F-score.In Figure 3, the input demonstrations do not reveal thetrue cluster afﬁliation of any of the nodes. We ran an addi-tional experiment with the globally informative demonstra-tions (Deﬁnition 2), which reveals the ground truth clusterafﬁliation of each node in the demonstration. With this addi-tional information, we observe that Algorithm 1 convergesto the optimal constraint threshold in less than ten demon-strations. This experiment validates that Algorithm 1 is ableto leverage the extra information provided by globally infor-mative demonstration to converge faster.Next, we evaluate the effect of number of demonstrationson the performance of Algorithm 2. Figure 5 shows that aswe increase the number of demonstrations, Alg2 matchesthe F-score of

Alg1 . Ablation Study

To test the effectiveness of our constraintestimation techniques, we varied the size of demonstrationfrom to for the different constraints. As expected, thenumber of required demonstrations reduces linearly with in-rease in demo size . Therefore, an increase in demonstra-tion size helps Alg1 converge faster.We tested the robustness of our constraint threshold esti-mation techniques by generating demonstrations accordingto a biased distribution. In the ﬁrst experiment (Figure 7), weemployed a biased sampling procedure, where each demon-stration is biased in favor of some speciﬁc clusters but allnodes within those chosen clusters are equally likely to bechosen for the demonstration. Speciﬁcally, we follow a twostep procedure where we ﬁrst sample the cluster C i withprobability p i and the nodes from the sampled cluster arechosen randomly. This introduction of bias did not affectthe quality of our techniques and Alg1 was able to re-cover ground truth clusters in

Θ(log n ) demonstrations. Thesecond experiment considered a biased sampling procedurewhere the expert samples fewer nodes from the marginalizedgroups. For example, a node having ‘red’ color is chosenwith probability n but a blue colored node is chosen withprobability n . In such setting, the returned demonstrationsare biased against the marginalized groups and the inferredclustering threshold is not accurate. We observe that this biastranslates into the constraint threshold estimation procedureof Alg1 . This experiment justiﬁes the requirement of an un-biased expert annotator that chooses nodes randomly, with-out considering their sensitive attributes.To further study the effect of k , we vary the number ofclusters as k = { , , , , } for adult dataset andcalculated the number of demonstrations required to iden-tify the true clustering constraint. For all values of k , Alg1 identiﬁed the optimal set of clusters in less ( n )demonstrations and the number of required demonstrationsincreases sub-linearly with k . For example, it required demonstrations for k = 5 and demonstrations wereenough for k = 50 . This increase in number of demon-strations is justiﬁed because Alg1 tries to merge presenteddemonstrations into k clusters. If the number of clusters inpresented demonstrations is smaller than k , then it mightend up partitioning some clusters which may introduce somenoise in the likelihood estimation procedure. However, whenthe input demonstrations are globally informative, the num-ber of required demonstrations do not increase with k andtherefore do not include the plots. Alg1 converges to theoptimal clustering constraint as soon as there are

Θ(log n ) nodes from any of the clusters. Fair and Interpretable Clusters

To further evaluate theeffectiveness of generating fair and interpretable clusters,we ran interpretable clustering algorithm (Saisubramanian,Galhotra, and Zilberstein 2020) with β = 1 for Adult dataset. The generated clusters were then post-processed tosatisfy ω EQ . Since none of the current clustering algorithmsoptimize for fairness and interpretability, we implementeda greedy technique to process the output of interpretableclusters and satisfy fairness constraint. We considered thisoutput as the ground truth to generate globally informa- We do not consider smaller demonstrations because clusteringfewer than nodes do not reveal information about the underlyingclusters. A l g2 k C I C E Q F - sc o r e Figure 8: F-score of fair and interpretable clusters generatedby different techniques.tive demonstration Λ and ran Alg2 to calculate the set ofclusters with maximum likelihood.

Alg2 achieved F-scoreof more than . (Figure 8) with less than demonstra-tions, each with nodes. Any baseline that optimizes ω IC or ω EQ alone achieve sub-optimal performance. This exper-iment demonstrated the ability of Alg2 to generate clus-ters even when the constraint optimization algorithm is notknown. Additionally,

Alg2 requires the expert to label lessthan dataset to generate fair and interpretable clusters.

Summary and Future Work

With the availability of many nuanced fairness deﬁnitions, itis non-trivial to specify a fairness metric that captures whatwe intend. As a result, systems may be deployed with anincomplete speciﬁcation of the fairness metric, which leadsto biased outcomes. We formalize the problem of inferringthe complete speciﬁcation of the fairness metric that the de-signer intends to optimize for a given problem. We presentan algorithm to generate fair clusters by inferring the fair-ness constraint using expert demonstration and analyze itstheoretical guarantees. We also present a greedy approach togenerate fair clusters for objectives which are not currentlysupported by the existing suite of fair clustering algorithms.To the best of our knowledge, our algorithm is the ﬁrst tocombine graph clustering and learning from demonstrations,particularly to improve fairness. We empirically demonstratethe effectiveness of our approach in inferring fairness and in-terpretability metrics, and then generate clusters that are fairand interpretable. Although we discuss the framework in thecontext of fair clustering, our proposed framework can beused to infer any clustering constraints, as shown in the ex-periments.In the future, we plan to conduct a human subjects studyto evaluate our approach and design robust algorithms toinfer the intended metrics in the presence of noise. Devel-oping robust techniques to handle bias in demonstrations isanother interesting question for future work. Extending ouralgorithm to handle other fairness metrics and interpretabil-ity metrics will broaden the scope of problems that can behandled by our approach. eferences

Abbeel, P.; and Ng, A. Y. 2004. Apprenticeship learning viainverse reinforcement learning. In

Proceedings of the 21stInternational Conference on Machine Learning .Ahmadian, S.; Epasto, A.; Kumar, R.; and Mahdian, M.2019. Clustering without over-representation. In

Proceed-ings of the 25th ACM SIGKDD International Conference onKnowledge Discovery & Data Mining .Aljrees, T.; Shi, D.; Windridge, D.; and Wong, W. 2016.Criminal Pattern Identiﬁcation Based on Modiﬁed K-meansClustering. In

IEEE International Conference on MachineLearning and Cybernetics , volume 2, 799–806.Anderson, N.; Bera, S. K.; Das, S.; and Liu, Y. 2020. Dis-tributional Individual Fairness in Clustering. arXiv preprintarXiv:2006.12589 .Ashtiani, H.; Kushagra, S.; and Ben-David, S. 2016. Clus-tering with same-cluster queries. In

Advances in Neural In-formation Processing Systems , 3216–3224.Bera, S. K.; Chakrabarty, D.; and Negahbani, M. 2019. Fairalgorithms for clustering.

CoRR arXiv:1901.02393 .Binns, R. 2018. Fairness in machine learning: Lessons frompolitical philosophy. In

Conference on Fairness, Account-ability and Transparency .Brams, S. J.; and Taylor, A. D. 1996.

Fair Division: Fromcake-cutting to dispute resolution . Cambridge UniversityPress.Celis, L. E.; Huang, L.; and Vishnoi, N. K. 2018. Multi-winner voting with fairness constraints. In

Proceedings ofthe 27th International Joint Conference on Artiﬁcial Intelli-gence .Celis, L. E.; Straszak, D.; and Vishnoi, N. K. 2018. Rankingwith Fairness Constraints. In

Proceedings of the 45th In-ternational Colloquium on Automata, Languages, and Pro-gramming .Chierichetti, F.; Kumar, R.; Lattanzi, S.; and Vassilvitskii, S.2017. Fair clustering through fairlets. In

Advances in NeuralInformation Processing Systems .Ding, H. 2020. Faster balanced clusterings in high dimen-sion.

Theoretical Computer Science .Dwork, C.; Hardt, M.; Pitassi, T.; Reingold, O.; and Zemel,R. 2012. Fairness through awareness. In

Proceedings ofthe 3rd Innovations in Theoretical Computer Science Con-ference , 214–226.Feldman, M.; Friedler, S. A.; Moeller, J.; Scheidegger, C.;and Venkatasubramanian, S. 2015. Certifying and remov-ing disparate impact. In

Proceedings of the 21th ACMSIGKDD International Conference on Knowledge Discov-ery and Data Mining .Firmani, D.; Saha, B.; and Srivastava, D. 2016. Online en-tity resolution using an oracle.

Proceedings of the VLDBEndowment

Proceedings of the 11th Joint Meeting on Foundations of Software Engi-neering .Galhotra, S.; Firmani, D.; Saha, B.; and Srivastava, D. 2018.Robust entity resolution using random graphs. In

Proceed-ings of the International Conference on Management ofData , 3–18.Galhotra, S.; Saisubramanian, S.; and Zilberstein, S.2019. Lexicographically Ordered Multi-Objective Cluster-ing.

CoRR arXiv preprint:1903.00750 .Gillen, S.; Jung, C.; Kearns, M.; and Roth, A. 2018. Onlinelearning with an unknown fairness metric. In

Advances inneural information processing systems , 2600–2609.Haraty, R. A.; Dimishkieh, M.; and Masud, M. 2015. An En-hanced K-means Clustering Algorithm for Pattern Discov-ery in Healthcare Data.

International Journal of DistributedSensor Networks .Hilgard, S.; Rosenfeld, N.; Banaji, M. R.; Cao, J.; andParkes, D. C. 2019. Learning representations by humans,for humans. arXiv preprint arXiv:1905.12686 .Hospers, G.-J.; Desrochers, P.; and Sautet, F. 2009. The NextSilicon Valley? On the Relationship Between GeographicalClustering and Public Policy.

International Entrepreneur-ship and Management Journal arXiv preprint arXiv:1906.00250 .Kamishima, T.; Akaho, S.; Asoh, H.; and Sakuma, J. 2012.Fairness-aware classiﬁer with prejudice remover regularizer.In

Joint European Conference on Machine Learning andKnowledge Discovery in Databases .Kleindessner, M.; Awasthi, P.; and Morgenstern, J. 2019.Fair k-center clustering for data summarization.

Interna-tional Conference of Machine Learning

Proceedings of the 37th InternationalConference on Machine Learning .Mazumdar, A.; and Saha, B. 2017a. Clustering with noisyqueries. In

Advances in Neural Information Processing Sys-tems , 5788–5799.Mazumdar, A.; and Saha, B. 2017b. Query complexity ofclustering with side information. In

Advances in Neural In-formation Processing Systems , 4682–4693.Mehrabi, N.; Morstatter, F.; Saxena, N.; Lerman, K.; andGalstyan, A. 2019. A survey on bias and fairness in machinelearning. arXiv preprint arXiv:1908.09635 .Saisubramanian, S.; Galhotra, S.; and Zilberstein, S. 2020.Balancing the Tradeoff Between Clustering Value and Inter-pretability. In

Proceedings of the AAAI/ACM Conference onAI, Ethics, and Society , 351–357.aisubramanian, S.; Kamar, E.; and Zilberstein, S. 2020.A Multi-Objective Approach to Mitigate Negative Side Ef-fects. In

Proceedings of the 29th International Joint Confer-ence on Artiﬁcial Intelligence .Thomson, W. 1983. Problems of fair division and the egal-itarian solution.

Journal of Economic Theory

Approximation Algorithms . SpringerScience & Business Media.Verma, S.; and Rubin, J. 2018. Fairness deﬁnitions ex-plained. In

IEEE/ACM International Workshop on SoftwareFairness .Vesdapunt, N.; Bellare, K.; and Dalvi, N. 2014. Crowd-sourcing algorithms for entity resolution.

Proceedings ofthe VLDB Endowment